Data Quality with Regular Expressions

Ajilius currently supports three types of data quality screens on data being loaded to the warehouse:

  • Data Type
  • Range/s
  • Regex

We’ve previously posted about type and range validation, but we recently had an enquiry about the use of Regular Expressions (regex) for data validation. Let’s build an example based on Postal Code validation.

The Person.Address table in AdventureWorks2014 contains a Postal Code column. Addresses are international, with many different formats of Postal Code. For the purposes of this demonstration, we are going to validate the column against Canadian Postal Codes. We’ll use a regular expression taken from the Regular Expressions Cookbook, 2nd. Edition

    ^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$

We’ll start from the point where we have imported the Person.Address metadata into our repository:

regex1

Click Change in the right-side panel, then modify the postal_code column. Scroll down to the Data Quality section, and make the following changes:

regex2

 

Save your changes, go back to the Load List, and select the Scripts option for the load_person_address table.

Notice the new section of script which has been generated. This is the validator that will be applied to this column at load time.

    rs = new AjiliusResultSet (rs)
    rs.setValidator('postal_code','text','regex','^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$','1')

Now run the script, and watch the results at the bottom left of the screen:

regex3

As you can see, only 1090 rows passed the validation. And when we view the contents of the table, we see that these rows do match the Canadian Postal Code format defined in our regular expression:

regex4

A word of caution, regex validation is slower than type and range checking. Without regex validation, that same table loaded in 0.749 seconds, and the difference was due entirely to the regex algorithm. If you have a choice, use range checking instead.

Ajilius. Better data screens.

 

Query Based Load Dependencies

The new Ajilius scheduler uses metadata to figure out the dependencies and run sequence of your ELT jobs.

These dependencies can currently break through custom code.

Here’s an example, contrived from a customer query earlier today:

 

cblDependencies

 

In this query there is a dependency between the tables load.load_from_table and load.load_earlier_table. Unless load.load_earlier_table has been populated before processing load.load_from_table, then duplicate rows may be selected.

In the long term we will use a SQL parser to extract these dependencies from custom queries, but we’re still working on the evaluation of parsers.

Meanwhile, these dependencies can be resolved through multiple steps in the scheduler. Here is an example of scheduler parameters that would ensure the correct sequence of loads in a full data warehouse refresh:

-w <dwname> -b reset
-w <dwname> -l load_earlier_table
-w <dwname> -f all

This could also be written as:

-w <dwname> -b reset -l load_earlier_table -f all

These commands will ensure that load_earlier_table is the first table processed by the scheduler, and it will be available when needed by load_from_table.

Remember, any time you’re not sure how to do something using Ajilius, you are most welcome to contact us and discuss.

PostgreSQL message encoding

We ran into an interesting error this week, where a job was ending with an encoding error on the completion message.

Our first thought was that the database encoding did not match the value in the metadata, but no, both were set to UTF-8.

This is an Italian customer, and after much head-scratching, we discovered that the configuration value for lc_messages was set to ‘Italian_Italy.1252’. Ajilius expects messages returned by the server to be in the same encoding as the data warehouse.

In this case, the DBMS had been installed with settings from the Italian locale on the server, but the data warehouse database had been created with UTF-8 encoding.

We will now recommend that servers are initialised with initdb using the preferred encoding.

In this case, the customer changed the value of lc_messages in postgresql.conf, and the problem was resolved.

http://www.postgresql.org/docs/9.4/static/locale.html

http://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-LC-MESSAGES

Delimited file formats

Delimited files, often referred to as “CSV” or text files, represent tables of data in a simple text format.

Rows are separated by newline characters, or by newline and carriage-return pairs.

Columns within rows are separated by delimiters. The delimiters supported by Ajilius are comma, tab, semi-colon and vertical bar (‘|’). All rows must contain all columns, although empty columns may be represented as consecutive delimiters.

Text fields containing delimiters must be enquoted using single (‘) or double (“) quote characters.

Where a text field contains a character used as a delimiter, that character must be preceded by an escape character, typically backslash (\). If the text field contains the escape character, it must be preceded by another escape character (ie, ‘\\’).

Delimited files must have column names contained in the first row of the file.

The file may be encoded in any format supported by Ajilius (link).

Detection of encoding, delimiters, quotation and escape characters is automatically performed on file access. You don’t need to specify anything other than the file name.

Ajilius. Flexible delimited file formats.

New feature: load streams

Before today, loads to an Ajilius data warehouse were single streamed. That is, one process extracted and loaded one table. You could load hundreds of thousands of rows per second, you could load different tables in parallel, but if you had one very large table there was nothing you could do to make it faster.

We’ve just added a feature called “Load Streams”, that enables you to parallel load large tables. This is specifically designed for MPP target databases, but may also speed up SMP operations.

The Load Table screen now has an option to select a number of streams that will be loaded in parallel. This is a number between 1, and a maximum set for the target platform. On MPP platforms, this number is the number of nodes, while SMP platforms typically benefit from a number between 4 and 8.

On extract, data will be split into the number of streams you defined. Depending on the platform, the extract may be split in a round-robin fashion or by the column/s you have defined as a distribution key on selected MPP platforms.

Because there is overhead in splitting the data, and in handling concurrent writes in the DBMS, the improvement is not linear. For SMP platforms, large tables on SSD storage can be orders of magnitude faster than single threaded loads, while MPP platforms can approach the fraction represented by the number of streams.

Ajilius. We work harder to make your loads faster.

Analytics tools: Too clever?

I’ve recently been trialling a mix of analytics tools, including PowerBI, Qlik, Microstrategy, Tableau and others. One “feature” that has been annoying me is their attempt to automatically form relationships between tables, usually by matching column names. Bad idea.

I’ve had to spend far too long correcting models that have automatically matched the Product, Customer, Location, dimensions, simply because they had ID as the name of their surrogate key.

When I have a table named Customer, a suitable name for a surrogate key is ID. Not CustomerID. It is the Identity of the Customer, not the CustomerIdentity of the Customer.

When I reference that table from another, I might use role names such as PurchasingCustomerID or ReferringCustomerID. Role, table, column. Only in the most trivial implementations might I refer to it as simply CustomerID.

Further, there is no requirement that columns be uniquely named in a schema, nor that they follow any specific pattern. Especially when dealing with data models from legacy business systems, the practice of automatic matching is counter-productive.

Automatic matching is particularly wrong in dimensional modelling, where the use of surrogate keys is standard practice. One of the properties of a surrogate key is that it is a meaningless identifier. From a business perspective, my view of the CustomerID might be “BIGSTORE001”, and the surrogate value “3498” has no value at all.

I suggest that if you can’t determine a relationship from the existence of a foreign key, then you should probably stop trying. Build a flexible relationship editor and skip the automatic matching, as I’m going to have to spend more time finding and fixing tool errors than building it properly in the first place.

It causes me to wonder how many complex analytical models have errors introduced by this practice?

Automatic relationship management may be a feature which demonstrates well, based on carefully selected sample data, but which fails in the real world. I’d like to see it disappear.

New feature: Load Mask

There are times when you need to load multiple instances of data. A common example is found in retail, where you might receive one end-of-day file of sales transactions from each store that need to be loaded into one warehouse table.

We recently faced a more interesting case, where a SAAS company hosting identical application databases for many customers wanted to aggregate the same table from all of those databases.

We’ve now implemented a feature in Ajilius that makes iteration super-easy.

A combination of Database and Table Masks enable you to set wildcards over which a table load will iterate at run time. You do your metadata design using one instance of the load table as the source, then simply define the mask patterns to be used at run time.

Multiple Excel files? No problem. Multiple text files? Easy. Multiple tables? Simple. Same table from many databases? No sweat.

Ajilius. Helpful data warehouse automation.

Handling SCD0 and SCD6

Most ETL and data warehouse automation products define a slowly changing dimension at the table. DIM_PRODUCT, for example, may be defined as a type-2 slowly changing dimension, with changes to the PRODUCT_NAME and PRODUCT_CATEGORY triggering new dimension rows.

When we were designing Ajilius, we realised that this traditional approach is very limiting, particularly when handling dimensions of type 0, 4 and 6. To refresh, a type-0 dimension has values that may never change, with the common example of an original-value column, such as PRODUCT_ORIGINAL_PRICE. A type-6 dimension combines elements of type-2 and type-3, in that it may have some columns that have previous values recorded, and some columns which trigger new dimension rows.

The “may have some columns” expression in the last sentence was our “Ah-Ha!” moment. Slowly changing dimensions should actually be recorded at the dimension attribute level, rather than the table level.

Ajilius enables you to set a change-type value for each non-key column in the dimension. By default we set it to SCD1, but you can change it to any other value through the dimension editor.

  • SCD0 (value never changes)
  • SCD1 (value changes in place without history being recorded)
  • SCD2 (value creates a new dimension row when it changes)
  • SCD3 (value has current- and previous-version recorded in the same dimension row)
  • SCD4 (value has historic versions recorded in a history outrigger)
  • SCD6 (a combination of 0 + 1 + 2 + 3 attributes in the same row)

To the best of our knowledge, Ajilius is the only data warehouse automation product that correctly supports the generation of DDL and DML to create and process all of these types of slowly changing dimension.

Virtual loads and dimensions

Some data warehouse automation products do a good job on green-fields projects, but make it very difficult to integrate into an existing data warehouse architecture. Some don’t do it at all, while others require that you reverse engineer and rebuild the ETL / ELT processing before you can integrate new tables. We’re different.

Ajilius makes it easy to integrate with an existing data warehouse through Virtual Tables.

The purpose of a Virtual Table is to provide a mechanism to create Ajilius metadata over an existing table, without having to re-load or re-process its data. There are two types of Virtual Tables used in Ajilius – Virtual Loads and Virtual Dimensions.

Virtual Loads exist to define data that has already been loaded into the warehouse by other processes. This scenario is used where the table is to be processed in conjunction with data completely controlled by Ajilius, and may also be used in drip-feeding scenarios.

Virtual Dimensions enable you to integrate complete dimensions that have been created and maintained by external processes. If your existing warehouse has already built the perfect SCD2 dimension, there is no need to re-design and re-write that table just to fit it into our metadata structures.

Once defined, Virtual Tables work exactly the same as any other Ajilius load or dimension table.

Virtual Tables are a powerful concept when you are adding the power of Ajilius to an existing data warehouse environment.