2014-03-06
Mark Wojciechowicz
I have often relied on Bulk Insert tasks to load data from
flat files. They are super-fast and,
with a format file, they are pretty flexible in terms of what the source data
looks like and the form of the destination table. However, I thought I would do some
benchmarking between a Bulk Insert and a flat file source.
I set up a source file with 10 million rows, with 3 columns –
bigint, date (stored as a string in yyyymmdd) and string data. The bulk insert package is a bulk insert task
with default settings, except that I turned the tablock hint on, which will allow
minimal logging on sql server (before that, the task ran in 2.5 minutes).
The flat file package is just a flat file source and oledb
destination (I know, shame on me for using a deprecated feature, but ODBC is
just so bloated). Note that in an oledb
destination, the tablock hint is on by default.
Also, to deal with the date column, which is really in a string format,
I turned fast parse on in the advanced options.
Each time the data is loaded into an empty heap:
CREATE TABLE [dbo].[bulkInsertCompare](
[id]
[bigint] NULL,
[trandate]
[date] NULL,
[someData]
[varchar](50) NULL
)
Each package was executed from SSISDB. Initially, the performance was very close,
both in the range of 30s. I then did
some further tuning. First, I set the
fastparse option on for the integer column which won us about 3s. Next, I looked at batch size for both
components. After fiddling with a
variety of settings, I settled on 100k rows per batch for the bulk insert
task. The flat file source seemed to
just get worse or had similar results with various batch sizes, so I left it
blank (a single batch).
Then I considered the dataflow buffer settings. In our simple scenario, we have a single
execution path – one flat file source to an OLEDB destination. SSIS will allocate physical buffers during
execution. These buffers are blocks of
memory with which to store the data as it moves through the pipeline. At most, there can be 5 physical buffers on
an execution path (this is an internal setting). This means, that if downstream components are
slower than the upstream components and 5 physical buffers have already been
allocated, SSIS will block more rows from being added to the execution path,
slowing everything down.
A good practice is to make the buffers larger, rather than
smaller to cope with this buffer limitation. By default, the DefaultBufferMaxRows is set to
10k rows and DefaultBufferSize is set to 10M (the max value is 100M. Note that the setting is in bytes – why anyone
needs byte precision, I have no idea.).
I tried a number of settings, which had better or worse performance than
the original load. What is important to
consider, is that the package will eventually be deployed to a production
server with far greater resources than your local machine, so performance
testing is only directional on your machine.
Performance impacts must be tested on a server which will have much more
memory and CPU available.
In the 4th run, I settled on setting the max rows
to 1,000,000. These rows are very small
so memory had little impact. The results
are pretty astounding – it almost cut the time in half! The below results are in seconds.
|
T1
|
T2
|
T31
|
T42
|
Bulk Insert
|
33.016
|
39.016
|
39.015
|
27
|
Flat file
|
30.781
|
32.141
|
27.531
|
18.656
|
1 Turned on fast parse for the integer column
2 Bulk Insert batch size was set to 100k. Flat File was configured with
DefaultBufferMaxRows = 1,000,000 and DefaultBufferSize was set to 80 M
Labels: Flatfile, SSIS