SSIS Design Considerations

Written by Bob Blackburn

Most major programming languages or development tools can do everything you need to do. However, each has things it does very well and some things it is very poorly. This article will look at some middle ground and help you design your packages for better performance with some simple design changes.

We will look at three common tasks and determine the better option which does not cost anymore development time. The three tasks are data conversion, replacing null foreign keys and increasing I/O throughput.

Data Conversion

If you are used to programming in SQL, you don’t have to worry about a lot of common data type conversions. I.e. non-Unicode to Unicode or among most numeric data types. The database engine performs them for you. However, SSIS enforces data type matching. So, your first inclination may be to use the Data Conversion transformation. Why else is it there? It is a synchronous streaming transformation. It has to be fast. Let’s compare this against a CAST in the SELECT statement.

Coding for Select VS Data Conversion

These are similar levels of effort to add the conversion logic. Now, we will process 5.3 million rows of data. Both used SSIS Source and Destinations.

Results:

Cast: 27.18 seconds
Data Conversion: 43.97 seconds
Cost: 62% longer

Casting during the SELECT is much faster. Of course there will be times when you need to use a data conversion. But, if you have the option, opt for the CAST.

Processing null Foreign Keys

This is very common in Data Warehouse processing. You want to set the Foreign Key to an inferred member (default record) if no record exists. This way you avoid null keys and having to use Left Joins. SSIS has an error output on the Lookup transformation. If you only suspect a very small percentage (2% in this example) of missing Foreign Keys, sending them to an Error path seems logical. You will avoid having to do null update logic on 98% of your output.

Our test:

Inline path to handle unmatched vs Error path
Default missing key: ISNULL(USStateID) ? 0 : USStateID

Inline

Error Path

Results

Inline: 01:09.516
Error Path: 01:19.953
Cost: 15%
Can be higher if larger buffer is created.

The reason it is higher is because SSIS must create a new buffer for the Error Path. This overhead is more than processing inline. It is also a few more steps to program. It is better to set the Lookup Match Error to Ignore and handle the test for all records inline.

Balanced Data Distributor

This add-on transformation is provided by Microsoft. It can be found here:

SQL 2008: http://www.microsoft.com/en-us/download/details.aspx?id=4123
SQL 2012: http://www.microsoft.com/en-us/download/details.aspx?id=30147

This transform takes a single input and distributes the incoming rows to one or more outputs uniformly via multithreading.

The transformations that follow the output must be duplicated for each thread. So, develop your logic then add the BDD.

Here is an example with four threads:

Processing

Core i7 laptop
5.3 million rows

Results:

Single Thread: 2:45
Double Threaded: 1:27
4 Threads: 0:46
7 Threads: 0:25
8 Threads: 0:30

As you can see, once you exceed the number of processors, performance will start to degrade. But, when I/O is your major bottleneck, this will significantly reduce your load times.

Summary

By simple design choices you can implement immediately, you can increase performance. There is much more to SSIS performance; but, getting the little things right puts you at a good starting point for advanced performance tuning.