Anexinet BI Blog: Faster Slowly Changing Dimension with Hash Function

The first time you had to load a Slowly Changing Dimension table with SSIS you probably thought no problem. There is a transformation for that. Let us take a look at the pros and cons and see how it compares to using the hash function.

Why use an SCD Transformation?

• Legacy system with no timestamp

• Change Data Capture not available

SCD Transformation

• Wizard creates necessary code.

• Customization has to be done again if wizard is rerun.

• Emphasis is on Slow.

SCD with Hash Function

• Initial Setup has a few more steps then the wizard

• Source table with 64,000 records of name and address data.

• 1344 updates (2%)

• 32 new records (0.05%)

• Type 1 SCD. Change data, no history.

• Staging table with new source records

• Target table to be updated

SCD Transformation test

Execute the wizard to map input and output tables.

Execution time: Almost 23 minutes.

SCD with Hash Setup

• Add Hashcode nvarchar(255) to target table

• Update hashcode with hash of data fields

• Create an empty update table in Staging DB that is a copy of the target table.

hashcode = hashbytes('SHA1', FirstName + LastName + Address1 + StateCode + city

+ cast(zipcode as nvarchar(10)) + comment)

Target table already contains Hash Code for each row:

When selecting from the source table, calculate the Hash Code.

,cast(hashbytes('SHA1', FirstName + LastName + Address1 + StateCode

+ city + cast(zipcode as nvarchar(10)) + comment) as nvarchar(255)) as hashcodealpha

Control flow for hash SCD:

Data Flow for Hash SCD:

In the Lookup transformation, get ID and Hashcode from production table. If error (no ID in production table) perform and insert.

In the Condition Split transformation, if the hashcodes are different write to update table. If they are the same ignore record.

Execute the update script for all changed records:

SET FirstName = u.FirstName

,LastName = u.LastName

,Address1 = u.Address1

,StateCode = u.StateCode

,ZipCode = u.ZipCode

,Comment = u.Comment

,hashcode = hashbytes('SHA1', u.FirstName + u.LastName + u.Address1 + u.StateCode + u.city + cast(u.zipcode as nvarchar(10)) + u.comment)

on t.custid = u.custid

Execution time: just over 3 seconds

As you can see, a little extra setup with an update table and writing a hash function can increase your throughput almost 400 times. On a small table the update went from 23 minutes down to 3 seconds. The hash function also reduces the number of fields you have to compare to one. If you have a very wide dimension table, the proportional savings will be even greater.

Faster Slowly Changing Dimension with Hash Function - Bob Blackburn