Sometimes while you are first
building and populating a data warehouse, or after it is populated with
millions of records, you may want to get a better sampling of records to review
instead of a simple TOP 1000. You may
want to do this for some true statistical analysis or just to spot check some
records.
There are a few different ways to
get a random (or random-ish) sampling of rows with various pros and cons for
ease of use and performance. In this
post I will discuss some of the methods I have come across and why they work
(or don’t work) the way they do. These techniques should all work in SQL Server 2005 and up.
TABLESAMPLE
The
TABLESAMPLE clause can be added to almost any query after the FROM clause in order
to return a “sample” number or percentage of rows. The syntax is:
SELECT
FirstName, LastName
FROM
Person.Person
TABLESAMPLE
(10 PERCENT) ;
This is a quick and dirty way to
get a sampling of rows, but you’ll note it’s a “sampling” and is not truly
random at the row level. The reason for
this distinction is that the TABLESAMPLE clause is actually randomly selecting pages and then returning all the rows
from that page. This means that you will
end up with clustered rows depending on how the rows are dispersed throughout
pages. For the same reason, this method
will usually return different numbers of records each time it is run.
The benefits of using this command are
that it can be added on to just about any select statement (it can’t be used
with derived tables, table valued functions, tables from linked-servers, rowset
functions, or OPENXML) and will add relatively minimum overhead.
NEWID()
If you
want to generate a truly random sample of data on a row level basis, your best
bet is to use the NEWID() function to generate a random GUID. You can use the random GUID in an ORDER BY
clause with TOP to get a random selection of the desired number of rows. For example:
SELECT TOP 1000 *
FROM MyTable
ORDER BY NEWID()
A disadvantage
to this method is that SQL must generate a GUID at runtime for every row
returned. Interestingly, in SQL Server you
are not able to use the RAND() function for this purpose because RAND() will
generate the same “random” value for every row returned in the batch – whereas
NEWID() will generate a unique value for each row.
What if you want the results to be repeatable?
You may find yourself in a
situation where you want to make sure that your sampling is repeatable if you
re-run the query, or if you pass the query off to someone else. In order to maintain repeatability, you can
try one of the following techniques.
TABLESAMPLE
Repeatable Option
In order to have the result set
repeatable, you can add the REPEATABLE option to the TABLESAMPLE clause along
with an arbitrary seed value. The syntax
is:
SELECT FirstName, LastName
FROM Person.Person
TABLESAMPLE (10 PERCENT)
REPEATABLE (205) ;
In this example 205 is the seed value. Each time 205 is used, assuming there have
been no changes to the table, the same pages, and thus rows, will be returned.
Planning Ahead -
Creating a RandomNumber Column at Record Insertion
Another option that I’ve
recently come across is to set up your Fact and Dimension tables to have a designated
RandomNumber column that is populated with each inserted row. There will of course be a very slight cost to
generating a random number on every insert, but it will likely be negligible.
Assuming that your ETL process will be inserting more than 1
record at a time, you again will want to avoid the RAND() function because it
will insert the same “random” number for each record in one INSERT statement. Instead, we can again use the NEWID()
function. But for the purpose of a
RandomNumber column, we don’t want a GUID, we want a numeric value – preferably
a decimal value between 0 and 1 so we can use it to select a percentage of
rows. To do this, you can use something
like:
CAST(CHECKSUM(NEWID()) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
This expression is a bit
tricky. Give the BOL link below credit
for it, but here is an explanation of what it’s actually doing:
- Use CHECKSUM to generate an integer value from
the NEWID() GUID.
- Use a bitwise AND (&) operator to compare
the GUID checksum to the hexadecimal value 0x7fffffff (which, I found out, is
the max signed value for the Int datatype).
This ensures that the numerator will always be a positive integer <=
the maximum signed Int value (2,147,483,647).
- Divide by the max signed Int value (2,147,483,647
again) to produce a decimal value < 1.
You can use the above
expression in your INSERT statement, or you can even use it as a column
default! Using this technique you could put a RandomNumber column in every Data
Warehouse table and default it to a truly random value between 0 and 1.
Now that you have your
RandomNumber column populated for your table, you can write a simple query like
this:
SELECT *
FROM MyTable
WHERE RandomNumber <=
0.1
And you will return a
repeatable, truly random, sampling of approximately 10% of the rows. If you wanted to return approximately 20% of
the table, you would use WHERE RandomNumber < 0.2, and so forth. I say “approximately” because the number of records in the result
will depend on the randomly generated values, but as the full table size
increases, a sample size retrieved in this way will continue to approach the
percentage used in the WHERE clause.
One last thing to note –
the above query would only return a repeatable result set until more rows are
added to a table – then the result set may increase if the new row’s
RandomNumber also meet the criteria.
However, to insure a query always returns repeatable results, you could
further refine the query like so:
SELECT *
FROM MyTable
WHERE RandomNumber
<=0.1
AND MyIdentityColumn <=
[CurrentMaxIdValue]
Where [CurrentMaxIdValue] is a
static value going forward that will ensure that after new rows are added, the
sampling query will still always return the same exact result set.
Labels: Data Warehouse, Random Numbers, SSMS, TSQL