2014-03-04
When I go into a client, I often see two approaches to error
handling which are opposite extremes. The client will usually have no exception handling
at all or wrap up every process in exception handling which carefully ignores
errors, logs them away for safe keeping and quietly continues. The latter is surprisingly worse than the
former.
On the one hand, the ETL is at the whim of whatever is
thrown its way – connection issues, bad data or other anomalies. On the other, no one ever knows that
exceptions are happening and the users of the system go on their merry way, not
realizing that their data warehouse is dry rotting like a curtain, until one
day, they give a tug to let in the sun and it falls to pieces. How easily trust is lost, when there are
apparent gaps in the data from lost rows.
Let me give a specific example, I was working on a package
which used a script component to parse a flat file which caught all exceptions
and sent them to an error output. The
error output carried a vague hint of what the error was and what column was
affected, but it did not have trace information about the source file, the
affected row or what the value was. I
should also note here that a script component should never be used to replace
the equivalent functionality in an out-of-the-box component (i.e. flat file
source). This is an important aspect of
exception handling – keep it simple, but not too simple.
So why was this catch-all methodology being employed? I found that the source files would sometimes
contain invalid values from user input. In
fact, there were just three fields that would sometimes go wrong, and only two
cases in which the data should have been rejected. Rejected and reported back to business, mind
you.
Exception handling should appear in exactly two places:
·
When an
error is expected (making a web request call, for example) and there is some
meaningful behavior to be done in the case of a failure (such as retrying after
some delay)
·
On a
system boundary, in which case you need to make a decision about how you are
going to expose the error to the outside world.
If we apply this same methodology to my example problem, we
would want exception handling for our three bad data cases and that’s it. Since this file is produced from another
database, as long as we match the source data types, there will be no need to
protect the system boundaries. In other
cases, we might find that the source system is poorly designed and uses a 255
length string for every column. That
would be a good case to enable exception handling for conversion errors for
what the values actually should be.
So what do we do with the exceptions anyway?
In his book,
Data
Quality the Accuracy Dimension, Jack Olsen offers four possible
approaches for dealing with data quality problems, but I think they apply very
well for any exception handling in an ETL:
- Ignore
the error – log it and report it, but keep the original value in the case
of a data quality problem
- Try to
correct it – i.e. put in retries for a problematic connection, like every
FTP connection – they always seem to be spotty
- Reject
the row – I have seen some approaches to quarantining records so they could
be massaged by business and reloaded, but I think it is far better to correct
problems in the source system. That way,
you will never see it again should you need to reload the data.
- Stop the
ETL – sometimes all you can do is rely on manual intervention to fix a
problem and that is when this approach makes sense.
Combining these two approaches, we
can create ETL’s that are very effective at handling exceptions and letting us
know about it when it happens.
Labels: Exception Handling, SSIS