Dynamically Generate Unknown/-1 Dimension Records for Your Data Warehouse

In a Dimensional Model Data Warehouse, ideally, no Fact record will have a NULL value for any of its Dimension Keys.  Unfortunately source transactional systems are not always kind enough to follow this rule.  For example you might have a FactOrderReturns table that is populated with 1 record for each order that customers return to your store.  To analyze what causes customers to return items, you also have a DimReturnReason dimension.  On your retail website, you might have a drop down list for the user to select a return reason, but it may not be mandatory. Instead of having the ReturnReasonKey in FactOrderReturns be NULL, you probably want to populate ReturnReasonKey in your DW with a key for an "Unknown" Dimension member - usually a Key of -1.

In order to facilitate this, you will need to insert an Unknown record into all of your Dimension tables.  Since I find myself writing SQL statements to insert one of these records into every dimension on every project, I used some dynamic SQL, XML, and system tables to dynamically generate insert scripts for every dimension in my DW.  This way, every time the DW is deployed, even if new dimensions have been added, every Dimension will have an Unknown record by default.

Here is the full script if you want to jump right to it, but below I will walk through each step.


IF OBJECT_ID(N'tempdb..#InsertValues') IS NOT NULL
BEGIN
    DROP TABLE #InsertValues
END

  
-- Build table with all columns and values to insert based on datatypes
SELECT
s.name +'.'+ t.name AS TableName
, c.name AS ColumnName
,CASE
            WHEN pks.column_id IS NOT NULL THEN 'PK'
            ELSE 'Not PK'
END AS IsPK
,t1.name AS DataType
,c.*
,CASE
            WHEN c.name = 'IsCurrentRow' THEN '1'  -- Assumes IsCurrentRow is the name of the flag to identify the current record for a Business Key
            WHEN t1.name LIKE '%date%' THEN '''1/1/1900'''
            WHEN t1.name LIKE '%int%' THEN '-1'
            WHEN t1.name = 'bit' THEN '0'
            WHEN t1.name LIKE '%char%' THEN CONCAT('''',(LEFT('Unknown',c.max_length/2)),'''')  --Return max # of characters for length surrounded by quotes. max_length/2 because max_length is in bytes, not characters
            ELSE '0'
END AS Value
INTO #InsertValues
FROM sys.tables t
INNER JOIN sys.all_columns c ON c.object_id = t.object_id
LEFT JOIN (SELECT i.object_id AS TableObjectId,ic.column_id
                                     FROM sys.indexes i  --object id is table
                                    INNER JOIN sys.index_columns ic ON ic.object_id = i.object_id
                                                AND ic.index_id = i.index_id
                                    WHERE i.is_primary_key = 1 ) pks ON pks.TableObjectId = t.object_id
                                                AND pks.column_id = c.column_id
INNER JOIN sys.schemas s ON s.schema_id = t.schema_id
INNER JOIN sys.types t1 ON t1.system_type_id = c.system_type_id
            AND t1.user_type_id = c.user_type_id
WHERE s.name = 'Dim'
AND t.name <> 'DimDate' --Exclude DimDate since it has a different "Unknown" record



--Get list of tables
DECLARE @Tables VARCHAR(max);

SELECT @Tables = stuff((SELECT DISTINCT ',[' + iv.TableName + ']'
                               FROM   #InsertValues iv
                               FOR xml path('')), 1, 1, '');


DECLARE @query NVARCHAR(max);
-- Build INSERT INTO statements with all columns and values.  Place within IF EXISTS to only insert if -1 record does not already exist
SET @query = 'SELECT ''IF NOT EXISTS (SELECT 1 FROM '' + data.TableName + '' WHERE '' +  iv.ColumnName + '' = -1) BEGIN SET IDENTITY_INSERT '' + data.TableName + '' ON;'' + ''INSERT INTO '' + data.TableName + '' (''+ColumnNames+'')'' + ''VALUES(''+ [Values] +'')'' + '' SET IDENTITY_INSERT '' + data.TableName + '' OFF; PRINT ''''Inserted -1 record for '' + data.TableName + '' '''' END ELSE PRINT ''''-1 record for '' + data.TableName + '' already exists. '''' ''
FROM   (SELECT TableName,
               stuff((SELECT '','' + b.ColumnName + ''''
                      FROM   #InsertValues b
                      WHERE  a.TableName = b.TableName
                      FOR xml path('''')), 1, 1, '''') [ColumnNames]
                                                            ,stuff((SELECT '','' + b.Value + ''''
                      FROM   #InsertValues b
                      WHERE  a.TableName = b.TableName
                      FOR xml path('''')), 1, 1, '''') [Values]

        FROM   #InsertValues a
        GROUP  BY TableName) AS data
                        INNER JOIN #InsertValues iv ON iv.TableName = data.TableName AND iv.IsPk =''PK'' '
                        ;

--Run SQL
EXEC(@query);


So if you run the above script, and have all of your dimensions in a schema called Dim, you should get a result set that is a series of INSERT statements, one for each dimension, inserting a -1 Unknown Key with default values in every column based on it's data type.  Lets step through it and see how it works.

The first bit of code simply checks if the temp table #InsertValues already exists, and if so, drops it.  The next SELECT statement builds a table (and inserts into #InsertValues) that contains 1 record for every column in every table in the "Dim" schema; except for DimDate which was intentionally excluded because it has a custom Unknown record.  The SELECT statement is returning all of the columns from sys.all_columns, but below are the only columns of interest.


So here, we are returning the TableName, ColumnName, if the column is a Primary Key or not, the data type, the max_length of the column, and the default value we want to insert into that column.  

We determine the "Value" based on the DataType and max_length columns via the CASE statement.

Max_length is divided by 2 because it contains the length in bytes, not characters.  This is necessary because in the case of the nvarchar data type, each character takes up 2 bytes, but in the LEFT function, we want the number of characters the field can contain so it can trim the string "Unknown" to fit if necessary.  Also, remember that the results of this will be used in dynamic SQL – thus the multiple single quotes.

The FROM and JOINs are from various system tables.  The sub SELECT for the LEFT JOIN is used to find the Primary Key of each table - which we later assume we want to set to -1.


Next, we use FOR XML PATH to generate a single column list of the tables.

You can read about FOR XML PATH here - https://msdn.microsoft.com/en-us/library/ms189885.aspx - but basically it turns a rowset into nodes within an XML document with each column as an element.  By specifying PATH(‘’) we are making the outer most node “blank”, and by concatenating ‘,[‘ and ‘]’ we produce a single column comma separated list.  Lastly the STUFF function is used to remove the first comma, leaving us a clean comma delimited list in one column.  *This use of FOR XML can be very handy for lots of other things!



Finally we use dynamic SQL and another FOR XML PATH to build 1 row for each table in the Dim schema that contains an INSERT statement with all of the previously generated columns and values.
When the @query SQL is executed, we get the results below –
Then you can just copy/paste the whole result set in SSMS and execute all the statements to insert all of your Unknown/-1 records.  

Hopefully this script will save you some time when building a DW!

Labels: , ,