Have your SSIS Fast
Load (Bulk Load) and Row by Row Error Messages too
SSIS OLEDB
Destination
In the SSIS OLEDB Destination component there are several
Data Access options to select from. For
this post we will only be discussing “Table or view” and “Table or view – fast load”.
When you choose the
fast load option, you are also given a few other options to set. Rows per batch and Maximum insert commit size
have always confused me because they sound similar, so I dug around a bit for
the definitions.
Since fast load uses a BULK INSERT, Rows Per Batch and
Maximum Insert Commit Size are actually arguments for the BULK INSERT
operation. Maximum Insert Commit Size
actually equates to the BATCHSIZE argument and Rows Per Batch equals…
ROWS_PER_BATCH. See this post on MSDN
for a bit more.
Knowing that, we can define the OLEDB Destination fast load
options as such from BOL:
Rows Per Batch (ROWS_PER_BATCH)
“Indicates
the approximate number of rows of data in the data file.
By
default, all the data in the data file is sent to the server as a single
transaction, and the number of rows in the batch is unknown to the query
optimizer. If you specify ROWS_PER_BATCH (with a value > 0) the server uses
this value to optimize the bulk-import operation. The value specified for
ROWS_PER_BATCH should approximately the same as the actual number of rows.”
Maximum Insert Commit Size (BATCHSIZE)
“Specifies
the number of rows in a batch. Each batch is copied to the server as one
transaction. If this fails, SQL Server commits or rolls back the transaction
for every batch. By default, all data in the specified data file is one batch”
So by changing the Maximum Insert Commit Size, we can change
the number of rows that will be committed at one time. SSIS will continue to BULK INSERT batches of
this size until all of the rows in the pipeline have been inserted (or failed). And, if I understand the BOL entry correctly,
Rows Per Batch is actually just an optimizer hint to get maximum performance
out of the BULK INSERT operation. That
is a topic for another discussion!
Why would you ever
NOT use fast load?
With the fast load option, if ANY row in a transaction fails
(constraints, PK violations etc…) that entire transaction will fail to
commit. This means that all of the rows
that were part of that BULK INSERT operation – which the number of rows is set
by the Maximum Insert Commit Size – will fail and/or be passed to the OLEDB Destination
Error Row path.
Unfortunately, this means that if you leave the default
value for Maximum Insert Commit Size (2147483647) and try to insert 1 million
records, and 1 record fails a constraint, all 1 million records would be sent
to the error output. And even worse, you
can’t get an Error Description from the fast load/BULK INSERT operation. So you would get 1 million records into your
error pipeline, not know what the error was, and not know which row caused the
error.
Getting the benefits
of Fast Load, but still capture error rows and descriptions
What you can do to get the benefits of a fast load, but
still get the detailed error information of the non-fast load row by row insert
is chain 2 OLEDB Destinations together and change the Maximum Insert Commit
Size.
In the screen shot above, we are using the first OLEDB
Destination with the fast load option into our FactClaim fact table and we have
the Maximum Insert Commit Size set to 10,000 rows. If 1 row in this batch fails, those 10,000
rows will go to the row by row insert – but if none of those 10,000 rows have
errors, they will be committed using BULK INSERT via the fast load. So if we were trying to load 1 million
records, we would of course end up with 100 BULK INSERT transactions (assuming
no errors), but depending on your required design this may be more acceptable
than having the entire 1 million records fail or being forced to use non-fast
load.
A side benefit of changing the Maximum Insert Commit Size is
that it can help to manage tempdb and log space because smaller batches are committed
at one time, and thus the whole 1 million rows would not need to be rolled back
if the last row fails.
Getting the exact
error row and error description
Once you send the smaller set of rows from the failed batch
to the second OLEDB Destination with non-fast load, SSIS will try to insert those
rows one at a time. In this way, only
the exact rows that fail will now be sent to this second OLEDB Destinations
error output. Thus only those rows that
actually fail the insert are not inserted.
We can then use a script task to add an ErrorDescription
column, and capture the actual error description with code similar to the
below.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
Row.ErrorDescription = ComponentMetaData.GetErrorDescription(Row.OLEDBDestination1ErrorCode);
}
Finally we can get a row count of the actual errors, and
store those error rows, with error description, to a flat file or a table.
By using this method, we can get most of the benefit of
using the fast load option (fast loading data with 0 errors in one BULK INSERT
would likely be faster), but still be able to capture the exact error row and
description.
Labels: SSIS