Written by Bob Blackburn
Most major programming languages or development tools can do everything you need to do. However, each has things it does very well and some things it is very poorly. This article will look at some middle ground and help you design your packages for better performance with some simple design changes.
We will look at three common tasks and determine the better option which does not cost anymore development time. The three tasks are data conversion, replacing null foreign keys and increasing I/O throughput.
If you are used to programming in SQL, you don’t have to worry about a lot of common data type conversions. I.e. non-Unicode to Unicode or among most numeric data types. The database engine performs them for you. However, SSIS enforces data type matching. So, your first inclination may be to use the Data Conversion transformation. Why else is it there? It is a synchronous streaming transformation. It has to be fast. Let’s compare this against a CAST in the SELECT statement.
Coding for Select VS Data Conversion
These are similar levels of effort to add the conversion logic. Now, we will process 5.3 million rows of data. Both used SSIS Source and Destinations.
- Cast: 27.18 seconds
- Data Conversion: 43.97 seconds
- Cost: 62% longer
Casting during the SELECT is much faster. Of course there will be times when you need to use a data conversion. But, if you have the option, opt for the CAST.
Processing null Foreign Keys
This is very common in Data Warehouse processing. You want to set the Foreign Key to an inferred member (default record) if no record exists. This way you avoid null keys and having to use Left Joins. SSIS has an error output on the Lookup transformation. If you only suspect a very small percentage (2% in this example) of missing Foreign Keys, sending them to an Error path seems logical. You will avoid having to do null update logic on 98% of your output.
- Inline path to handle unmatched vs Error path
- Default missing key: ISNULL(USStateID) ? 0 : USStateID
The reason it is higher is because SSIS must create a new buffer for the Error Path. This overhead is more than processing inline. It is also a few more steps to program. It is better to set the Lookup Match Error to Ignore and handle the test for all records inline.
- Inline: 01:09.516
- Error Path: 01:19.953
- Cost: 15%
- Can be higher if larger buffer is created.
Balanced Data Distributor
This add-on transformation is provided by Microsoft. It can be found here:
SQL 2008: http://www.microsoft.com/en-us/download/details.aspx?id=4123
SQL 2012: http://www.microsoft.com/en-us/download/details.aspx?id=30147
This transform takes a single input and distributes the incoming rows to one or more outputs uniformly via multithreading.
The transformations that follow the output must be duplicated for each thread. So, develop your logic then add the BDD.
Here is an example with four threads:
- Core i7 laptop
- 5.3 million rows
- Single Thread: 2:45
- Double Threaded: 1:27
- 4 Threads: 0:46
- 7 Threads: 0:25
- 8 Threads: 0:30
As you can see, once you exceed the number of processors, performance will start to degrade. But, when I/O is your major bottleneck, this will significantly reduce your load times.
By simple design choices you can implement immediately, you can increase performance. There is much more to SSIS performance; but, getting the little things right puts you at a good starting point for advanced performance tuning.