Customizing ProperCase in SSIS

2015-01-05
Have you ever tried the ProperCase() function in SSIS expressions?  Yeah, no, it doesn’t actually exist, but one can dream.  In fact, if they ever do add such a function, I hope that they can improve upon a common shortcoming in the .NET framework. 
First, let’s state what the problem is:  Proper Names ought to be in Proper Case for improved readability.  UPPERCASE IS DIFFICULT TO READ, ESPECIALLY WHEN THERE ARE LONG STRINGS OF TEXT.  THERE ARE NO PLACES FOR THE EYES TO PAUSE AS EACH LETTER HAS THE SAME EMPHASIS AND HEIGHT.  lowercase is informal, fine for an email address, but not so good for proper names.  Any of these items imported through the ETL, might soon wind up on a report – and that is not the time for fixing them.  As we process data in the ETL, we should think about conforming it to one look and feel to improve the experience in downstream applications.
To demonstrate, the problems of handling case, I will use the following dataset:
ALL UPPER CASE
all lower case
the acronym ABC Co.
Maybe, we are getting an excel file from Betty in accounting, but she’s had the capslock stuck on for about the past 20 years.  Or maybe we are sourcing from a database that supports a web application.  However, that clever programmer, who designed it, also believes in conforming data, so they wrapped all the inputs in UPPER().  Or maybe we just have a free form input field that is getting data in every manner of case from all sorts of users.  Whatever the issue, we can try to solve this problem with .NET in a script component in our SSIS package, but there are some limitations. 
In vb, we can use the StrConv function:
Row.MyCol = StrConv(Row.MyCol, VbStrConv.ProperCase)
This will proper case everything, but it will not handle acronyms - they get proper cased too.  So, "the acronym ABC Co." becomes "The Acronym Abc Co."  It looks a little weird.  On the other hand, C# does not have StrConv, but we can use TextInfo in the System.Globablization class:
CultureInfo ci = Thread.CurrentThread.CurrentCulture;
TextInfo text = ci.TextInfo;
Row.MyCol = text.ToTitleCase(Row.MyCol);
It's a few more lines of code, but it will leave everything that was in Uppercase, alone.  So, "the acronym ABC Co." becomes "The Acronym ABC Co."  This is great for acronyms, but not so good if everything is already in uppercase.  In fact, it will look like it is not working at all.  "ALL UPPER CASE" will be "ALL UPPER CASE." 
So, last, we will come up with a tweak that will take advantage of this odd behavior, by adding a list of exceptions for the acronyms we want to leave be.  We will use regex to search and replace those strings that we want to remain uppercase.  This makes sense in an ETL in which we are trying to solve a domain specific data quality problem and not all possible case issues everywhere.  It means that we will need just a short list of exceptions.
This is the full code of what we put inside the script component transformation:
using System.Globalization;
using System.Threading;
using System.Text.RegularExpressions;

public override void Input0_ProcessInputRow(Input0Buffer Row)
{

    Row.MyCol = ProperCase(Row.MyCol, @"ABC|LBL");
    // the second parameter will take any valid regex
}

public static string ProperCase(string convertMe, string acronyms)
{
    CultureInfo ci = Thread.CurrentThread.CurrentCulture;
    TextInfo text = ci.TextInfo;
    convertMe = convertMe.ToLower(); //c# does not propercase anything that is in all upper case

    Regex regex = new Regex(acronyms, RegexOptions.IgnoreCase); //add pipe delimited values for each expression that should not be left upper case
    convertMe = regex.Replace(convertMe, m => m.Value.ToUpper()); //upper case those expressions
    convertMe = text.ToTitleCase(convertMe);

    return convertMe;
}
The function ProperCase(), takes the string that we are converting and a pipe delimited list of exceptions as input.  It converts the target string to lower case and then upper cases any exceptions that it finds.  Finally, it converts the working string to proper case using the TextInfo class. 

Below are all the versions of input for comparison:
Mark Wojciechowicz

Labels: ,