In a Dimensional Model Data Warehouse, ideally, no Fact
record will have a NULL value for any of its Dimension Keys.
Unfortunately source transactional systems are not always kind enough to follow
this rule. For example you might have a FactOrderReturns table that is
populated with 1 record for each order that customers return to your
store. To analyze what causes customers to return items, you also have a
DimReturnReason dimension. On your retail website, you might have a drop
down list for the user to select a return reason, but it may not be mandatory.
Instead of having the ReturnReasonKey in FactOrderReturns be NULL, you probably
want to populate ReturnReasonKey in your DW with a key for an
"Unknown" Dimension member - usually a Key of -1.
In order to facilitate this, you will need to insert an Unknown record into all
of your Dimension tables. Since I find myself writing SQL statements to
insert one of these records into every dimension on every project, I used some
dynamic SQL, XML, and system tables to dynamically generate insert scripts for
every dimension in my DW. This way, every time the DW is deployed, even
if new dimensions have been added, every Dimension will have an Unknown record
by default.
Here is the full script if you want to jump right to it, but below I will walk
through each step.
IF OBJECT_ID(N'tempdb..#InsertValues') IS NOT NULL
BEGIN
DROP TABLE #InsertValues
END
-- Build table with all columns and values to insert based on
datatypes
SELECT
s.name
+'.'+ t.name AS
TableName
,
c.name AS ColumnName
,CASE
WHEN
pks.column_id IS NOT NULL THEN 'PK'
ELSE
'Not PK'
END AS
IsPK
,t1.name
AS DataType
,c.*
,CASE
WHEN
c.name = 'IsCurrentRow' THEN
'1' -- Assumes IsCurrentRow is the name of the flag to identify
the current record for a Business Key
WHEN
t1.name LIKE '%date%' THEN
'''1/1/1900'''
WHEN
t1.name LIKE '%int%' THEN
'-1'
WHEN
t1.name = 'bit' THEN
'0'
WHEN
t1.name LIKE '%char%' THEN
CONCAT('''',(LEFT('Unknown',c.max_length/2)),'''') --Return max # of
characters for length surrounded by quotes. max_length/2 because max_length is
in bytes, not characters
ELSE
'0'
END AS
Value
INTO #InsertValues
FROM sys.tables t
INNER
JOIN sys.all_columns
c ON c.object_id
= t.object_id
LEFT
JOIN (SELECT i.object_id AS
TableObjectId,ic.column_id
FROM sys.indexes i --object id is
table
INNER JOIN sys.index_columns
ic ON ic.object_id
= i.object_id
AND
ic.index_id = i.index_id
WHERE i.is_primary_key = 1 ) pks ON pks.TableObjectId = t.object_id
AND
pks.column_id = c.column_id
INNER
JOIN sys.schemas
s ON s.schema_id
= t.schema_id
INNER
JOIN sys.types
t1 ON t1.system_type_id = c.system_type_id
AND t1.user_type_id = c.user_type_id
WHERE s.name = 'Dim'
AND
t.name <> 'DimDate' --Exclude DimDate since it has a different
"Unknown" record
--Get list of tables
DECLARE @Tables VARCHAR(max);
SELECT @Tables = stuff((SELECT DISTINCT ',[' + iv.TableName + ']'
FROM
#InsertValues iv
FOR xml path('')), 1, 1, '');
DECLARE @query NVARCHAR(max);
-- Build INSERT INTO statements with all columns and values. Place within IF EXISTS to only insert if -1
record does not already exist
SET @query = 'SELECT
''IF NOT EXISTS (SELECT 1 FROM '' + data.TableName + '' WHERE '' + iv.ColumnName + '' = -1) BEGIN SET
IDENTITY_INSERT '' + data.TableName + '' ON;'' + ''INSERT INTO '' +
data.TableName + '' (''+ColumnNames+'')'' + ''VALUES(''+ [Values] +'')'' + ''
SET IDENTITY_INSERT '' + data.TableName + '' OFF; PRINT ''''Inserted -1 record
for '' + data.TableName + '' '''' END ELSE PRINT ''''-1 record for '' +
data.TableName + '' already exists. '''' ''
FROM (SELECT TableName,
stuff((SELECT
'','' + b.ColumnName + ''''
FROM #InsertValues b
WHERE a.TableName = b.TableName
FOR xml
path('''')), 1, 1, '''') [ColumnNames]
,stuff((SELECT
'','' + b.Value + ''''
FROM
#InsertValues b
WHERE a.TableName = b.TableName
FOR xml
path('''')), 1, 1, '''') [Values]
FROM #InsertValues a
GROUP BY TableName) AS data
INNER
JOIN #InsertValues iv ON iv.TableName = data.TableName AND iv.IsPk =''PK'' '
;
--Run SQL
EXEC(@query);
So if you run the above script, and
have all of your dimensions in a schema called Dim, you should get a result set
that is a series of INSERT statements, one for each dimension, inserting a -1
Unknown Key with default values in every column based on it's data type.
Lets step through it and see how it works.
The first bit of code simply checks
if the temp table #InsertValues already exists, and if so, drops it. The
next SELECT statement builds a table (and inserts into #InsertValues) that
contains 1 record for every column in every table in the "Dim" schema;
except for DimDate which was intentionally excluded because it has a custom
Unknown record. The SELECT statement is returning all of the columns from
sys.all_columns, but below are the only columns of interest.
So here, we are returning the
TableName, ColumnName, if the column is a Primary Key or not, the data type,
the max_length of the column, and the default value we want to insert into that
column.
We determine the "Value"
based on the DataType and max_length columns via the CASE statement.
Max_length is divided by 2 because it contains the length in bytes, not
characters. This is necessary because in the case of the nvarchar data
type, each character takes up 2 bytes, but in the LEFT function, we want the
number of characters the field can contain so it can trim the string
"Unknown" to fit if necessary.
Also, remember that the results of this will be used in dynamic SQL –
thus the multiple single quotes.
The FROM and JOINs are from various
system tables. The sub SELECT for the LEFT JOIN is used to find the
Primary Key of each table - which we later assume we want to set to -1.
Next, we use FOR XML PATH to generate a single column list of the tables.
You can read about FOR XML PATH here
- https://msdn.microsoft.com/en-us/library/ms189885.aspx
- but basically it turns a rowset into nodes within an XML document with each
column as an element. By specifying
PATH(‘’) we are making the outer most node “blank”, and by concatenating ‘,[‘
and ‘]’ we produce a single column comma separated list. Lastly the STUFF function is used to remove
the first comma, leaving us a clean comma delimited list in one column. *This use of FOR XML can be very handy for lots of other things!
Finally we
use dynamic SQL and another FOR XML PATH to build 1 row for each table in the
Dim schema that contains an INSERT statement with all of the previously
generated columns and values.
When the
@query SQL is executed, we get the results below –
Then you can
just copy/paste the whole result set in SSMS and execute all the statements to
insert all of your Unknown/-1 records.
Hopefully
this script will save you some time when building a DW!
Labels: Data Warehouse, Dimensions, T-SQL