ETL Development for App Devs

There are few, if any, App Dev folks out there who enjoy working with a visual designer, such as SSIS.  While the designer is actually outputting code in the form of XML, this code is generally not very navigable and, certainly, difficult to modify.  This is a major difference from ASP.NET, in which, if you wanted, you could perform most of the work in the code page. 

However, SSIS was designed with a different purpose in mind.  It was not created with the intention of having a WYSWYG interface for designing UIs, but, rather, to provide an easy way for SQL developers to implement repetitive ETL tasks.  The arguments against SSIS (vs. rolling your own code or coding with a data integration library) run along the lines of:
·        I can’t see the code / I don’t know what it’s doing
·        I can’t do diffs in source control against a visual design
·        I can do the same thing in <Language of Choice>
·        I can’t see the code
·        There are no “performance improvements” in using SSIS
·        Still can’t see the code

The predominant argument, really, is that a user who is working with the tool for the first time, cannot see the inner workings of what SSIS is doing.  If you are used to writing every line of code for an application, this is probably a scary prospect, as it will feel like a loss of control.  Though, I do not really view it much different as having a framework that was designed for ETL.  Whether I call an execute SQL task in SSIS or use the System.Data.SqlClient namespace, what difference does it make?  Both frameworks obscure the mundane tasks of getting from here to there.

What SSIS is good at are these repetitive ETL tasks.  To get a file with FTP and insert the contents of the file into a table in a database -- it’s a few clicks and some configurations to set up the connection and tasks (even to make them configurable per environment).  Why would I write that code line by line?  It’s been figured out and, now, we can simply use a tool that has been designed for the purpose, over and over again.  And if you are doing ETL, you will have all sorts of tasks like this that you will do over and over again.

If a developer does not trust the tool but are forced to use it, they will roll their own ETL code inside.  For example, they will use a .NET script for every source in a data flow (possibly not even accessing a shared connection).  Is that simpler than plopping a stored proc into an OLEDB Source?  No…no, not really at all.

In the spirit of working around the tool, within the tool, and rolling your own code, I have seen all sorts of painful efforts:  script sources, script destinations, activeX scripts….  I am not saying scripts should not be used – there is plenty opportunity to use them appropriately.  Though, if you see a package that is littered with scripts, SSIS is probably not being leveraged with all it has in the box.

In terms of performance, SSIS probably does do much compared to .NET or SQL code.  SSIS is using the same framework anyway.  A bulk insert task is yet a BULK INSERT in T-SQL.  What SSIS offers out of the box is a whole lot that is important for moving tons of data.  First, SSIS uses memory buffers to chunk out data in a data flow.  This means that it will work with say, 10k rows in a buffer, versus loading the entire dataset at once.  This is tremendously important when we are dealing with large volumes.  By not doing this and committing rows in portions, the log file will grow quickly and this can even put tremendous pressure on TempDB.  I have seen clients use procs to move millions of rows between temp tables for their nightly ETL, which led to a TempDB that was several magnitudes larger than the destination database. 

Second, SSIS employs a pipeline concept allowing a developer to redirect flows of data.  One could use conditional splits to channel data in different directions, multicast it to direct the same set of data in multiple directions or union data from various external sources.  This is a handy bit that is very clear to understand in a data flow.  Compare a one page data flow to a 200+ line proc that is pouring data back and forth between temp tables or code that is dumping tables between arrays.  Pipelines combined with memory buffers are a very efficient way of moving data around in these more complex scenarios.

If a developer can break away from their comfort zone of using the code that they know, and use SSIS more out of the box, the experience will be much better and the code will be better performing.  Or, better said, if your package looks like a schematic for a motherboard, delete it and start over.

So what’s wrong with SSIS?

Doing diffs in source control stinks
The code files are difficult to compare and they contain changes to it that are relative to the machine that they were built on.  That means that there are a lot of false positives when doing a diff.  If the visual designer were comparable along with the code and the machine specific stuff were excluded, we might be better able to compare diffs.  I’m thinking of the new tool in Office 2013 called spreadsheet compare.  This allows a user to compare to sheets both visually and with a list of changes.  Here is a post about spreadsheet compare: http://www.online-tech-tips.com/ms-office-tips/compare-two-excel-files-using-spreadsheet-compare/

Objects can obscure other objects. 
There are challenges in visual designers that you just do not see in text code.  One thing that has come to haunt me from time to time, is that objects can overlay each other in the designer.  I can have an Execute SQL Task sit directly on top of another task.  Or I can have an oledb destination sit on top of another oledb destination.  One large sequence container can obscure a lot of objects.  And sometimes an object can be seen through a sequence container, but not really be in it (I have no idea how to reproduce that one).  You would never overlay text with other text in your code, so why should an object be allowed to overlay another object? 

While SSIS has these frustrating weak points, it is important to consider the right tool for the job and to account for who will maintain the code.  I would not use SSIS for the sake of SSIS.  If it’s not an ETL job, maybe SSIS is not the right choice.  If the task is to lots of data, than SSIS might be a good choice, provided the creator and those who will have to maintain the code over time know how or can learn to use SSIS appropriately.    

Labels: ,