How to Optimize Bulk Insert Operations in SSIS

2015-01-08

I see this question a lot on the forums:  “how do I optimize SSIS for bulk insert operations?”  It might be asked in some roundabout ways:
·         Why does it take so long to load this file?
·         How can I load data faster to Azure?
·         How do I insert millions of rows efficiently?
·         What flags should I set on the bulk insert task?

This post is going to cover the basics of what every ETL developer should know about optimizing inserts.  Let’s get a couple of things out of the way first, though.

When should you use the Bulk Insert Task in SSIS?

That’s easy, never.  You read that right, never.  My colleague, Craig, advised me not speak in absolutes, but this is a pretty safe one.  Here are the reasons why:

What should the dataflow look like before the insert component?


Getting Started on the Dataflow Task

The dataflow configuration is important, but it’s not everything.  Actually, much of the slowness during a bulk insert is caused by what’s going on in the database.  Database settings, indexes, constraints and partitions must all be considered.  We are going to set up this task to get minimal logging which can tremendously boost performance.  I’ll performance along the way using a 300 MB raw file (3.8 million rows, 500 bytes per row) to insert into various versions of a table and various settings of the database and SSIS.  Let’s start with the database:

The first two of these steps are absolutely necessary for minimal logging.  

Heaps

Heaps are a great way for getting data inserted quickly and in parallel.  However, there are a few considerations if this going to happen with minimal logging. 

This is how it performed with different settings on my local machine:

Only the last two tests had minimal logging, with huge gains over the others.  The last used a parallel loading approach taking advantage of the BU lock type for heaps.  This is how that worked.  Suppose we have a column, "id", that increments with each row in the source file.  We can use a conditional split to separate these into 3 paths, i.e. “id % 3 == 0” as the first case expression, “id % 3 == 1” as the second and so on.  All three destinations have TABLOCK on and will get an even distribution of rows:



Clustered Indexes


Clustered Indexes are a different animal and often come with additional issues – Foreign Keys and more indexes.  We can still get minimal logging, but this time it matters if there is already data in the table (unlike a heap).  Parallel loading is not possible directly into the table, but we can use partitioning to our advantage and load the data into staging tables in parallel.  Then we can switch the data back together.  It doesn't apply in all that many scenarios, so I am not going to test that one out.  But here are the basics considerations on the rest of things:

The Order hint is not exposed in the editor of the OLEDB destination, but you can type it in, in the properties window here:


SSIS has two parameters related to batch size:  Batch Size and Maximum Insert Commit Size.  Ignore the first.  In SSIS, if the max commit size is greater than the buffer size, each buffer is committed.  If you set this to 0, the whole result is committed in a single batch.  Note that this could bloat the log so use this setting with caution.  TF 610 can commit with smaller batches because it does not care if rows exist.  This is how you should set it to get a single batch:



Here are the performance results:


The take-aways:


Reference
The Data Loading Performance Guide
http://technet.microsoft.com/en-us/library/dd425070%28v=sql.100%29.aspx


Mark Wojciechowicz

Labels: