Bulk Insert vs. Flat File Source

2014-03-06
Mark Wojciechowicz

I have often relied on Bulk Insert tasks to load data from flat files.  They are super-fast and, with a format file, they are pretty flexible in terms of what the source data looks like and the form of the destination table.  However, I thought I would do some benchmarking between a Bulk Insert and a flat file source. 

I set up a source file with 10 million rows, with 3 columns – bigint, date (stored as a string in yyyymmdd) and string data.  The bulk insert package is a bulk insert task with default settings, except that I turned the tablock hint on, which will allow minimal logging on sql server (before that, the task ran in 2.5 minutes).



The flat file package is just a flat file source and oledb destination (I know, shame on me for using a deprecated feature, but ODBC is just so bloated).  Note that in an oledb destination, the tablock hint is on by default.  Also, to deal with the date column, which is really in a string format, I turned fast parse on in the advanced options.



Each time the data is loaded into an empty heap:

CREATE TABLE [dbo].[bulkInsertCompare](
                [id] [bigint] NULL,
                [trandate] [date] NULL,
                [someData] [varchar](50) NULL
)

Each package was executed from SSISDB.  Initially, the performance was very close, both in the range of 30s.  I then did some further tuning.  First, I set the fastparse option on for the integer column which won us about 3s.  Next, I looked at batch size for both components.  After fiddling with a variety of settings, I settled on 100k rows per batch for the bulk insert task.  The flat file source seemed to just get worse or had similar results with various batch sizes, so I left it blank (a single batch).

Then I considered the dataflow buffer settings.  In our simple scenario, we have a single execution path – one flat file source to an OLEDB destination.  SSIS will allocate physical buffers during execution.  These buffers are blocks of memory with which to store the data as it moves through the pipeline.  At most, there can be 5 physical buffers on an execution path (this is an internal setting).  This means, that if downstream components are slower than the upstream components and 5 physical buffers have already been allocated, SSIS will block more rows from being added to the execution path, slowing everything down. 

A good practice is to make the buffers larger, rather than smaller to cope with this buffer limitation.  By default, the DefaultBufferMaxRows is set to 10k rows and DefaultBufferSize is set to 10M (the max value is 100M.  Note that the setting is in bytes – why anyone needs byte precision, I have no idea.).  I tried a number of settings, which had better or worse performance than the original load.  What is important to consider, is that the package will eventually be deployed to a production server with far greater resources than your local machine, so performance testing is only directional on your machine.  Performance impacts must be tested on a server which will have much more memory and CPU available. 

In the 4th run, I settled on setting the max rows to 1,000,000.  These rows are very small so memory had little impact.  The results are pretty astounding – it almost cut the time in half!  The below results are in seconds.


T1
T2
T31
T42
Bulk Insert
33.016
39.016
39.015
27
Flat file
30.781
32.141
27.531
18.656

1 Turned on fast parse for the integer column

2 Bulk Insert batch size was set to 100k.  Flat File was configured with DefaultBufferMaxRows = 1,000,000 and DefaultBufferSize was set to 80 M

Labels: ,