Anexinet BI Blog: Bulk Insert vs. Flat File Source

I have often relied on Bulk Insert tasks to load data from flat files. They are super-fast and, with a format file, they are pretty flexible in terms of what the source data looks like and the form of the destination table. However, I thought I would do some benchmarking between a Bulk Insert and a flat file source.

I set up a source file with 10 million rows, with 3 columns – bigint, date (stored as a string in yyyymmdd) and string data. The bulk insert package is a bulk insert task with default settings, except that I turned the tablock hint on, which will allow minimal logging on sql server (before that, the task ran in 2.5 minutes).

The flat file package is just a flat file source and oledb destination (I know, shame on me for using a deprecated feature, but ODBC is just so bloated). Note that in an oledb destination, the tablock hint is on by default. Also, to deal with the date column, which is really in a string format, I turned fast parse on in the advanced options.

Each time the data is loaded into an empty heap:

CREATE TABLE [dbo].[bulkInsertCompare](

[trandate] [date] NULL,

[someData] [varchar](50) NULL

Each package was executed from SSISDB. Initially, the performance was very close, both in the range of 30s. I then did some further tuning. First, I set the fastparse option on for the integer column which won us about 3s. Next, I looked at batch size for both components. After fiddling with a variety of settings, I settled on 100k rows per batch for the bulk insert task. The flat file source seemed to just get worse or had similar results with various batch sizes, so I left it blank (a single batch).

Then I considered the dataflow buffer settings. In our simple scenario, we have a single execution path – one flat file source to an OLEDB destination. SSIS will allocate physical buffers during execution. These buffers are blocks of memory with which to store the data as it moves through the pipeline. At most, there can be 5 physical buffers on an execution path (this is an internal setting). This means, that if downstream components are slower than the upstream components and 5 physical buffers have already been allocated, SSIS will block more rows from being added to the execution path, slowing everything down.

A good practice is to make the buffers larger, rather than smaller to cope with this buffer limitation. By default, the DefaultBufferMaxRows is set to 10k rows and DefaultBufferSize is set to 10M (the max value is 100M. Note that the setting is in bytes – why anyone needs byte precision, I have no idea.). I tried a number of settings, which had better or worse performance than the original load. What is important to consider, is that the package will eventually be deployed to a production server with far greater resources than your local machine, so performance testing is only directional on your machine. Performance impacts must be tested on a server which will have much more memory and CPU available.

In the 4^th run, I settled on setting the max rows to 1,000,000. These rows are very small so memory had little impact. The results are pretty astounding – it almost cut the time in half! The below results are in seconds.

¹ Turned on fast parse for the integer column

²Bulk Insert batch size was set to 100k. Flat File was configured with DefaultBufferMaxRows = 1,000,000 and DefaultBufferSize was set to 80 M

	T1	T2	T3¹	T4²
Bulk Insert	33.016	39.016	39.015	27
Flat file	30.781	32.141	27.531	18.656