The first time you had
to load a Slowly Changing Dimension table with SSIS you probably thought no
problem. There is a transformation for that. Let us take a look at the pros and
cons and see how it compares to using the hash function.
use an SCD Transformation?
system with no timestamp
Data Capture not available
creates necessary code.
has to be done again if wizard is rerun.
is on Slow.
with Hash Function
Setup has a few more steps then the wizard
table with 64,000 records of name and address data.
new records (0.05%)
1 SCD. Change data, no history.
table with new source records
table to be updated
Execute the wizard to
map input and output tables.
Execution time: Almost 23 minutes.
with Hash Setup
Hashcode nvarchar(255) to target table
hashcode with hash of data fields
an empty update table in Staging DB that is a copy of the target table.
hashbytes('SHA1', FirstName + LastName + Address1 + StateCode + city
+ cast(zipcode as
nvarchar(10)) + comment)
Target table already
contains Hash Code for each row:
When selecting from the
source table, calculate the Hash Code.
,cast(hashbytes('SHA1', FirstName +
LastName + Address1 + StateCode
+ city + cast(zipcode as
nvarchar(10)) + comment) as nvarchar(255)) as hashcodealpha
order by custID
Control flow for hash
Data Flow for Hash SCD:
In the Lookup
transformation, get ID and Hashcode from production table. If error (no ID in
production table) perform and insert.
In the Condition Split
transformation, if the hashcodes are different write to update table. If they
are the same ignore record.
Execute the update
script for all changed records:
SET FirstName = u.FirstName
,LastName = u.LastName
,Address1 = u.Address1
,StateCode = u.StateCode
,City = u.City
,ZipCode = u.ZipCode
,Comment = u.Comment
,hashcode = hashbytes('SHA1', u.FirstName
+ u.LastName + u.Address1 +
u.StateCode + u.city +
cast(u.zipcode as nvarchar(10)) + u.comment)
from t_customer t
join u_customer u
on t.custid = u.custid
Hash SCD Results:
Execution time: just over 3 seconds
As you can see, a little
extra setup with an update table and writing a hash function can increase your
throughput almost 400 times. On a small
table the update went from 23 minutes down to 3 seconds. The hash function also
reduces the number of fields you have to compare to one. If you have a very
wide dimension table, the proportional savings will be even greater.
Give it a try. It will
provide immediate benefits.