2015-01-05
Have you ever tried the ProperCase() function in SSIS expressions? Yeah, no, it doesn’t actually exist, but one
can dream. In fact, if they ever do add
such a function, I hope that they can improve upon a common shortcoming in the
.NET framework.
First, let’s state what the problem is:
Proper Names ought to be in Proper Case for improved readability. UPPERCASE IS DIFFICULT TO READ, ESPECIALLY
WHEN THERE ARE LONG STRINGS OF TEXT.
THERE ARE NO PLACES FOR THE EYES TO PAUSE AS EACH LETTER HAS THE SAME
EMPHASIS AND HEIGHT. lowercase is
informal, fine for an email address, but not so good for proper names. Any of these items imported through the ETL,
might soon wind up on a report – and that is not the time for fixing them. As we process data in the ETL, we should
think about conforming it to one look and feel to improve the experience in
downstream applications.
To demonstrate, the problems of handling case, I will use the following
dataset:
ALL UPPER CASE
all lower case
the acronym ABC Co.
Maybe, we are getting an excel file from Betty in accounting, but she’s
had the capslock stuck on for about the past 20 years. Or maybe we are sourcing from a database that
supports a web application. However,
that clever programmer, who designed it, also believes in conforming data, so
they wrapped all the inputs in UPPER().
Or maybe we just have a free form input field that is getting data in
every manner of case from all sorts of users.
Whatever the issue, we can try to solve this problem with .NET in a
script component in our SSIS package, but there are some limitations.
In vb, we can use the StrConv function:
Row.MyCol =
StrConv(Row.MyCol, VbStrConv.ProperCase)
This will proper case everything, but it will not handle acronyms -
they get proper cased too. So, "the
acronym ABC Co." becomes "The Acronym Abc Co." It looks a little weird. On the other hand, C# does not have StrConv,
but we can use TextInfo in the System.Globablization class:
CultureInfo ci = Thread.CurrentThread.CurrentCulture;
TextInfo text = ci.TextInfo;
Row.MyCol =
text.ToTitleCase(Row.MyCol);
It's a few more lines of code, but it will leave everything that was in
Uppercase, alone. So, "the acronym
ABC Co." becomes "The Acronym ABC Co." This is great for acronyms, but not so good
if everything is already in
uppercase. In fact, it will look like it
is not working at all. "ALL UPPER
CASE" will be "ALL UPPER CASE."
So, last, we will come up with a tweak that will take advantage of this
odd behavior, by adding a list of exceptions for the acronyms we want to leave
be. We will use regex to search and
replace those strings that we want to remain uppercase. This makes sense in an ETL in which we are
trying to solve a domain specific data quality problem and not all possible
case issues everywhere. It means that we
will need just a short list of exceptions.
This is the full code of what we put inside the script component
transformation:
using System.Globalization;
using System.Threading;
using System.Text.RegularExpressions;
public override void
Input0_ProcessInputRow(Input0Buffer Row)
{
Row.MyCol = ProperCase(Row.MyCol, @"ABC|LBL");
// the second parameter will take any
valid regex
}
public static string
ProperCase(string convertMe, string
acronyms)
{
CultureInfo ci = Thread.CurrentThread.CurrentCulture;
TextInfo text = ci.TextInfo;
convertMe = convertMe.ToLower(); //c# does not propercase anything that is in all upper case
Regex regex = new Regex(acronyms,
RegexOptions.IgnoreCase); //add pipe delimited values for each
expression that should not be left upper case
convertMe = regex.Replace(convertMe, m => m.Value.ToUpper()); //upper case those expressions
convertMe =
text.ToTitleCase(convertMe);
return convertMe;
}
The function ProperCase(), takes the string that we are converting and
a pipe delimited list of exceptions as input.
It converts the target string to lower case and then upper cases any
exceptions that it finds. Finally, it
converts the working string to proper case using the TextInfo class.
Below are all the versions of input for comparison:
Mark Wojciechowicz
Labels: Data Quality, SSIS