Random, Repeatable, Sampling of Dimension and Fact Records with SQL Server


Sometimes while you are first building and populating a data warehouse, or after it is populated with millions of records, you may want to get a better sampling of records to review instead of a simple TOP 1000.  You may want to do this for some true statistical analysis or just to spot check some records.  



There are a few different ways to get a random (or random-ish) sampling of rows with various pros and cons for ease of use and performance.  In this post I will discuss some of the methods I have come across and why they work (or don’t work) the way they do.  These techniques should all work in SQL Server 2005 and up.
 
TABLESAMPLE

The TABLESAMPLE clause can be added to almost any query after the FROM clause in order to return a “sample” number or percentage of rows.  The syntax is:

SELECT FirstName, LastName
FROM Person.Person
TABLESAMPLE (10 PERCENT) ;
This is a quick and dirty way to get a sampling of rows, but you’ll note it’s a “sampling” and is not truly random at the row level.  The reason for this distinction is that the TABLESAMPLE clause is actually randomly selecting pages and then returning all the rows from that page.  This means that you will end up with clustered rows depending on how the rows are dispersed throughout pages.  For the same reason, this method will usually return different numbers of records each time it is run. 

The benefits of using this command are that it can be added on to just about any select statement (it can’t be used with derived tables, table valued functions, tables from linked-servers, rowset functions, or OPENXML) and will add relatively minimum overhead.
 
NEWID()

If you want to generate a truly random sample of data on a row level basis, your best bet is to use the NEWID() function to generate a random GUID.  You can use the random GUID in an ORDER BY clause with TOP to get a random selection of the desired number of rows.  For example:

SELECT TOP 1000 *
FROM MyTable
ORDER BY NEWID()
A disadvantage to this method is that SQL must generate a GUID at runtime for every row returned.   Interestingly, in SQL Server you are not able to use the RAND() function for this purpose because RAND() will generate the same “random” value for every row returned in the batch – whereas NEWID() will generate a unique value for each row.  

What if you want the results to be repeatable?

You may find yourself in a situation where you want to make sure that your sampling is repeatable if you re-run the query, or if you pass the query off to someone else.  In order to maintain repeatability, you can try one of the following techniques.
 
TABLESAMPLE Repeatable Option

In order to have the result set repeatable, you can add the REPEATABLE option to the TABLESAMPLE clause along with an arbitrary seed value.  The syntax is:
SELECT FirstName, LastName
FROM Person.Person
TABLESAMPLE (10 PERCENT)
   REPEATABLE (205) ;

In this example 205 is the seed value.  Each time 205 is used, assuming there have been no changes to the table, the same pages, and thus rows, will be returned.

Planning Ahead - Creating a RandomNumber Column at Record Insertion

Another option that I’ve recently come across is to set up your Fact and Dimension tables to have a designated RandomNumber column that is populated with each inserted row.  There will of course be a very slight cost to generating a random number on every insert, but it will likely be negligible.

Assuming that your ETL process will be inserting more than 1 record at a time, you again will want to avoid the RAND() function because it will insert the same “random” number for each record in one INSERT statement.   Instead, we can again use the NEWID() function.  But for the purpose of a RandomNumber column, we don’t want a GUID, we want a numeric value – preferably a decimal value between 0 and 1 so we can use it to select a percentage of rows.  To do this, you can use something like:

CAST(CHECKSUM(NEWID()) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

This expression is a bit tricky.  Give the BOL link below credit for it, but here is an explanation of what it’s actually doing:


You can use the above expression in your INSERT statement, or you can even use it as a column default! Using this technique you could put a RandomNumber column in every Data Warehouse table and default it to a truly random value between 0 and 1.

Now that you have your RandomNumber column populated for your table, you can write a simple query like this:

SELECT *
FROM MyTable
WHERE RandomNumber <= 0.1

And you will return a repeatable, truly random, sampling of approximately 10% of the rows.  If you wanted to return approximately 20% of the table, you would use WHERE RandomNumber < 0.2, and so forth.  I say “approximately”  because the number of records in the result will depend on the randomly generated values, but as the full table size increases, a sample size retrieved in this way will continue to approach the percentage used in the WHERE clause.

One last thing to note – the above query would only return a repeatable result set until more rows are added to a table – then the result set may increase if the new row’s RandomNumber also meet the criteria.  However, to insure a query always returns repeatable results, you could further refine the query like so:

SELECT *
FROM MyTable
WHERE RandomNumber <=0.1
AND  MyIdentityColumn <= [CurrentMaxIdValue] 

Where [CurrentMaxIdValue] is a static value going forward that will ensure that after new rows are added, the sampling query will still always return the same exact result set.

For further reading about the TABLESAMPLE clause please see - http://technet.microsoft.com/en-us/library/ms189108%28v=SQL.105%29.aspx

Labels: , , ,