How to Optimize Bulk Insert Operations in SSIS

2015-01-08

I see this question a lot on the forums: “how do I optimize SSIS for bulk insert operations?” It might be asked in some roundabout ways:

·         Why does it take so long to load this file?
·         How can I load data faster to Azure?
·         How do I insert millions of rows efficiently?
·         What flags should I set on the bulk insert task?

This post is going to cover the basics of what every ETL developer should know about optimizing inserts. Let’s get a couple of things out of the way first, though.

When should you use the Bulk Insert Task in SSIS?

That’s easy, never. You read that right, never. My colleague, Craig, advised me not speak in absolutes, but this is a pretty safe one. Here are the reasons why:

Data flows are faster, as proven here.
You cannot exclude columns without a format file.
If you use a format file, that’s one more artifact to deploy, which increases complexity and risk

Creating format files is really kooky and you need to google the syntax every time you do it

You cannot add columns (unless you use a format file and default constraint – not very useful)
You cannot transform columns in flight
DATE data types are extremely fussy
The target table includes the database name (the way the SSIS Component works), which cripples portability to different databases (unless you override this with an expression)
They never go right the first time – seriously, I have never gotten them right the first go

What should the dataflow look like before the insert component?

It should have a source and very little else. In fact, it’s a really good idea to stage your data before the insert.
Do all the heavy cleansing and data manipulation ahead of time. In this way you will not tie up a connection with your destination, while you are busy manipulating the data in the pipeline.
Leverage the database for sort operations
Eliminate blocking and partially blocking components

Getting Started on the Dataflow Task

The dataflow configuration is important, but it’s not everything. Actually, much of the slowness during a bulk insert is caused by what’s going on in the database. Database settings, indexes, constraints and partitions must all be considered. We are going to set up this task to get minimal logging which can tremendously boost performance. I’ll performance along the way using a 300 MB raw file (3.8 million rows, 500 bytes per row) to insert into various versions of a table and various settings of the database and SSIS. Let’s start with the database:

Make sure the database is in Simple or Bulk Logged recovery model
Make sure that backups and other maintenance tasks are not occurring during your ETL run
Turn on Trace Flag 610 – this is done at the server level as a start up parameter (applies only to clustered indexes)

The first two of these steps are absolutely necessary for minimal logging.

Heaps

Heaps are a great way for getting data inserted quickly and in parallel. However, there are a few considerations if this going to happen with minimal logging.

There must be no indexes on the table at all. Simply disabling them will not do the trick.
You must use the TABLOCK hint

You can still parallel load a heap with the TABLOCK hint on. It will take a special lock (BU) which is compatible with other BU locks.

This is how it performed with different settings on my local machine:

Only the last two tests had minimal logging, with huge gains over the others. The last used a parallel loading approach taking advantage of the BU lock type for heaps. This is how that worked. Suppose we have a column, "id", that increments with each row in the source file. We can use a conditional split to separate these into 3 paths, i.e. “id % 3 == 0” as the first case expression, “id % 3 == 1” as the second and so on. All three destinations have TABLOCK on and will get an even distribution of rows:

Clustered Indexes

Clustered Indexes are a different animal and often come with additional issues – Foreign Keys and more indexes. We can still get minimal logging, but this time it matters if there is already data in the table (unlike a heap). Parallel loading is not possible directly into the table, but we can use partitioning to our advantage and load the data into staging tables in parallel. Then we can switch the data back together. It doesn't apply in all that many scenarios, so I am not going to test that one out. But here are the basics considerations on the rest of things:

Drop indexes (when these are recreated they will be minimally logged – a much faster operation all told, especially staging / partition switching scenario)
Uncheck the Check Constraints flag in the OLEDB destination (note that you will need to check constraints after the table is loaded or they will not be trusted)

Check constraints flag checks the Foreign keys which can cause multiple sort operations as the keys get validated before insert. I explained that here.

If you do not have TF 610 turned on:

Use the TABLOCK hint
Use the ORDER hint (this takes the column names of the clustered index). Note that the incoming data must be in the order of the clustered index
Make sure there are no rows in the table. Even a single row in the table will cause full logging
Insert all the rows in a single batch. If you do not, only the first batch will be minimally logged. More on how to do that in SSIS below.

If you do have TF 610 turned on, then the hints do not matter, but the data should still be in the order of the index. I have ignored this advice in testing, because I find that there are still performance gains from using the hints. The table can contain data and inserts do not need to be in a single batch. Consequently, turning this flag on can usually boost performance without code changes. It should be tested, of course. J
Compression causes extra overhead in order to compress the data. If the table does not truly need it, do not use compression as a reflex response in order to conserve disk. In a staging table, it is not worth the downstream benefit that you would gain on the reading side.

The Order hint is not exposed in the editor of the OLEDB destination, but you can type it in, in the properties window here:

SSIS has two parameters related to batch size: Batch Size and Maximum Insert Commit Size. Ignore the first. In SSIS, if the max commit size is greater than the buffer size, each buffer is committed. If you set this to 0, the whole result is committed in a single batch. Note that this could bloat the log so use this setting with caution. TF 610 can commit with smaller batches because it does not care if rows exist. This is how you should set it to get a single batch:

Here are the performance results:

The take-aways:

Minimal logging can give a huge performance boost for bulk insert operations
Use TF 610 if you are inserting into a CI and there is data in the table. That test didn't have the best performance, but it was still better than many of the rows with no data.
Use partition switching if each load can be for a single partition and you cannot use TF 610
Check constraints and recreate indexes after the load
Commit in a single batch where plausible

Reference

The Data Loading Performance Guide

http://technet.microsoft.com/en-us/library/dd425070%28v=sql.100%29.aspx

Mark Wojciechowicz

Labels: SSIS