Ajilius currently supports three types of data quality screens on data being loaded to the warehouse:

  • Data Type
  • Range/s
  • Regex

We’ve previously posted about type and range validation, but we recently had an enquiry about the use of Regular Expressions (regex) for data validation. Let’s build an example based on Postal Code validation.

The Person.Address table in AdventureWorks2014 contains a Postal Code column. Addresses are international, with many different formats of Postal Code. For the purposes of this demonstration, we are going to validate the column against Canadian Postal Codes. We’ll use a regular expression taken from the Regular Expressions Cookbook, 2nd. Edition

    ^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$

We’ll start from the point where we have imported the Person.Address metadata into our repository:

regex1

Click Change in the right-side panel, then modify the postal_code column. Scroll down to the Data Quality section, and make the following changes:

regex2

 

Save your changes, go back to the Load List, and select the Scripts option for the load_person_address table.

Notice the new section of script which has been generated. This is the validator that will be applied to this column at load time.

    rs = new AjiliusResultSet (rs)
    rs.setValidator('postal_code','text','regex','^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$','1')

Now run the script, and watch the results at the bottom left of the screen:

regex3

As you can see, only 1090 rows passed the validation. And when we view the contents of the table, we see that these rows do match the Canadian Postal Code format defined in our regular expression:

regex4

A word of caution, regex validation is slower than type and range checking. Without regex validation, that same table loaded in 0.749 seconds, and the difference was due entirely to the regex algorithm. If you have a choice, use range checking instead.

Ajilius. Better data screens.

 

Leave a Reply