There are few, if any, App Dev folks out there who enjoy
working with a visual designer, such as SSIS.
While the designer is actually outputting code in the form of XML, this
code is generally not very navigable and, certainly, difficult to modify. This is a major difference from ASP.NET, in
which, if you wanted, you could perform most of the work in the code page.
However, SSIS was designed with a different purpose in mind. It was not created with the intention of
having a WYSWYG interface for designing UIs, but, rather, to provide an easy
way for SQL developers to implement repetitive ETL tasks. The arguments against SSIS (vs. rolling your
own code or coding with a data integration library) run along the lines of:
·
I can’t see the code / I don’t know what it’s
doing
·
I can’t do diffs in source control against a
visual design
·
I can do the same thing in <Language of
Choice>
·
I can’t see the code
·
There are no “performance improvements” in using
SSIS
·
Still can’t see the code
The predominant argument, really,
is that a user who is working with the tool for the first time, cannot see the
inner workings of what SSIS is doing. If
you are used to writing every line of code for an application, this is probably
a scary prospect, as it will feel like a loss of control. Though, I do not really view it much
different as having a framework that was designed for ETL. Whether I call an execute SQL task in SSIS or
use the System.Data.SqlClient namespace, what difference does it make? Both frameworks obscure the mundane tasks of
getting from here to there.
What SSIS is good at are these repetitive
ETL tasks. To get a file with FTP and
insert the contents of the file into a table in a database -- it’s a few clicks
and some configurations to set up the connection and tasks (even to make them
configurable per environment). Why would
I write that code line by line? It’s
been figured out and, now, we can simply use a tool that has been designed for
the purpose, over and over again. And if
you are doing ETL, you will have all sorts of tasks like this that you will do over and over again.
If a developer does not trust the
tool but are forced to use it, they will roll their own ETL code inside. For example, they will use a .NET script for
every source in a data flow (possibly not even accessing a shared
connection). Is that simpler than
plopping a stored proc into an OLEDB Source?
No…no, not really at all.
In the spirit of working around
the tool, within the tool, and rolling your own code, I have seen all sorts of
painful efforts: script sources, script
destinations, activeX scripts…. I am not
saying scripts should not be used – there is plenty opportunity to use them
appropriately. Though, if you see a
package that is littered with scripts, SSIS is probably not being leveraged
with all it has in the box.
In terms of performance, SSIS
probably does do much compared to .NET or SQL code. SSIS is using the same framework anyway. A bulk insert task is yet a BULK INSERT in T-SQL. What SSIS offers out of the box is a whole
lot that is important for moving tons of data.
First, SSIS uses memory buffers to chunk out data in a data flow. This means that it will work with say, 10k
rows in a buffer, versus loading the entire dataset at once. This is tremendously important when we are
dealing with large volumes. By not doing
this and committing rows in portions, the log file will grow quickly and this
can even put tremendous pressure on TempDB.
I have seen clients use procs to move millions of rows between temp
tables for their nightly ETL, which led to a TempDB that was several magnitudes
larger than the destination database.
Second, SSIS employs a pipeline
concept allowing a developer to redirect flows of data. One could use conditional splits to channel
data in different directions, multicast it to direct the same set of data in
multiple directions or union data from various external sources. This is a handy bit that is very clear to
understand in a data flow. Compare a one
page data flow to a 200+ line proc that is pouring data back and forth between
temp tables or code that is dumping tables between arrays. Pipelines combined with memory buffers are a
very efficient way of moving data around in these more complex scenarios.
If a developer can break away
from their comfort zone of using the code that they know, and use SSIS more out
of the box, the experience will be much better and the code will be better
performing. Or, better said, if your
package looks like a schematic for a motherboard, delete it and start over.
So what’s wrong with SSIS?
Doing diffs in source control stinks.
The code files are difficult to
compare and they contain changes to it that are relative to the machine that
they were built on. That means that
there are a lot of false positives when doing a diff. If the visual designer were comparable along
with the code and the machine specific stuff were excluded, we might be better
able to compare diffs. I’m thinking of
the new tool in Office 2013 called spreadsheet compare. This allows a user to compare to sheets both
visually and with a list of changes. Here
is a post about spreadsheet compare:
http://www.online-tech-tips.com/ms-office-tips/compare-two-excel-files-using-spreadsheet-compare/
Objects can obscure other objects.
There are challenges in visual
designers that you just do not see in text code. One thing that has come to haunt me from time
to time, is that objects can overlay each other in the designer. I can have an Execute SQL Task sit directly
on top of another task. Or I can have an
oledb destination sit on top of another oledb destination. One large sequence container can obscure a
lot of objects. And sometimes an object
can be seen through a sequence container, but not really be in it (I have no
idea how to reproduce that one). You
would never overlay text with other text in your code, so why should an object
be allowed to overlay another object?
While SSIS has these frustrating
weak points, it is important to consider the right tool for the job and to account
for who will maintain the code. I would
not use SSIS for the sake of SSIS. If it’s
not an ETL job, maybe SSIS is not the right choice. If the task is to lots of data, than SSIS
might be a good choice, provided the creator and those who will have to
maintain the code over time know how or can learn to use SSIS appropriately.
Labels: ETL, SSIS