Data Profiling Patterns

Patterns in data are a great way to summarise the type and content of a column.

Our data profiling feature creates patterns in the following manner:

  • Values are converted to a string representation
  • Alphabetic characters are converted to an ‘x’.
  • Characters in the range 0-9 are converted to a ‘9’.
  • All other characters are left as-is.

Here are some examples of pattern conversions:

  • 2010-09-23 to ‘9999-99-99’
  • 09/07/2016 to ’99/99/9999′
  • Ajilius to ‘xxxxxxx’
  • (03) 5432 9876 to ‘(99) 9999 9999’

We find that the frequency of pattern occurrence is more valuable than the frequency of value occurrence. For example, it is more useful to know that a column contains only values with a pattern ‘9999-99-99’ than to see a long list of dates, each occurring only a handful of times.

Ajilius. Better profiling for better data warehouses.

 

Announcement: Ajilius 2.3

I’m proud to announce that, today, we achieved our goals for Ajilius 2.3.

The hero feature of this release is data profiling. We now profile data sources faster than Trifacta, with more valuable information than Pentaho, and with more variety than SSIS. We give you the real data you need to make quality decisions about the content of your data sources.

To deliver this feature, we first added persistent metadata caching to Ajilius, as discussed in this earlier post.

Now, we’ve completed the feature by implementing the profiling and presentation features.

You profile a source table from the Extract Source Tables screen. The following screen shot shows that we are about to profile the Chinook Customer table.

profile1

We see any previous profile that is cached for this table, and we can refresh the profile at any time by pressing the Profile button.

profile2

We profile any number of rows, at a rate of around 4million rows per minute.

Every column is profiled in every row.

For columns of less than 64 characters in length, we profile up to 1,000,000 discrete values per column.

For columns of up to 256 characters in length, we profile up to 1,000,000 discrete patterns per column.

For columns of up to 4,000 characters in length, we profile the minimum and maximum values in a column.

Not only do we profile values and patterns, we examine your data for characters that might cause problems in your data warehouse. Null values? Got it. Control characters? Check. Extended ASCII characters? That too. Unicode characters? Again, check.

This is real, valuable profiling data for data warehouse professionals. And it is now included in your Ajilius licence.

So, once again, Ajilius provides real value through the addition of the features you need.

Ajilius. The real innovators in data warehouse automation.

 

Enhancement: Reset Incremental Loads

We like to say that Ajilius is driven by our users, and here is another example.

Release 2.2.16, due Sunday, will include a feature to reset incremental loads, allowing a load to be rerun even after it succeeds. One driver for this feature is detection of a source system error that does not affect the running of a load, but does affect the data content of the load.

The “ajilius” schema of your data warehouse contains a table named “ajilius_control”. This table holds one row for each column for which incremental values are tracked. We have modified this table to include a “previous” and “latest” value.

On successfully processing a load, we update the “previous” value with the contents of the “latest” value (that was used to process the load), and over-write the “latest” value with the maximum column value from the load.

Now, you can select a menu option Schedule | Reset Controls:

reset1

On selecting this option, you will be shown a screen listing all tables for which incremental controls have been defined:

reset2

When you press the Reset button, the control values of any checked table will be reset to their previous state. This enables the incremental load to be repeated.

Ajilius. Tipping the hat to Steve 😉

 

 

Data Quality Reject Limits

Ajilius 2.2.16, due for release later this week, will include a feature to enable a reject threshold to be set for data quality screens.

As shown in the following screen, you can select a limit beyond which a job will be cancelled if it is exceeded by the number of rejects:

reject_limit1

If the limit is exceeded, an error message and exception will be generated. This is an example of how that appears during interactive testing of load scripts:

reject_limit2

When rejects occur that are less than the reject limit, the job will succeed, but a warning message will be placed in the processing log. Here is an example from a test batch:

reject_limit3

Ajilius. Fine tuning data quality.

Data Quality with Regular Expressions

Ajilius currently supports three types of data quality screens on data being loaded to the warehouse:

  • Data Type
  • Range/s
  • Regex

We’ve previously posted about type and range validation, but we recently had an enquiry about the use of Regular Expressions (regex) for data validation. Let’s build an example based on Postal Code validation.

The Person.Address table in AdventureWorks2014 contains a Postal Code column. Addresses are international, with many different formats of Postal Code. For the purposes of this demonstration, we are going to validate the column against Canadian Postal Codes. We’ll use a regular expression taken from the Regular Expressions Cookbook, 2nd. Edition

    ^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$

We’ll start from the point where we have imported the Person.Address metadata into our repository:

regex1

Click Change in the right-side panel, then modify the postal_code column. Scroll down to the Data Quality section, and make the following changes:

regex2

 

Save your changes, go back to the Load List, and select the Scripts option for the load_person_address table.

Notice the new section of script which has been generated. This is the validator that will be applied to this column at load time.

    rs = new AjiliusResultSet (rs)
    rs.setValidator('postal_code','text','regex','^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$','1')

Now run the script, and watch the results at the bottom left of the screen:

regex3

As you can see, only 1090 rows passed the validation. And when we view the contents of the table, we see that these rows do match the Canadian Postal Code format defined in our regular expression:

regex4

A word of caution, regex validation is slower than type and range checking. Without regex validation, that same table loaded in 0.749 seconds, and the difference was due entirely to the regex algorithm. If you have a choice, use range checking instead.

Ajilius. Better data screens.

 

Enhancement: Cached Metadata

Ajilius enables browsing of source system metadata, to identify and load data into the warehouse. Here is a typical display:

metadata01

A click on a table, on the left side, shows the columns for that table on the right side.

Until now, each update of this screen represented a round trip to the database. That was fine for a data source located close by the Ajilius server, but users in hybrid environments reported slow screen updates when pulling cloud metadata to on-premise Ajilius. This was particularly apparent with large Oracle systems, which add thousands of system table entries to a metadata set.

We’ve now added a feature to cache metadata within Ajilius. This means the delay of updating metadata happens only once, and subsequent interactions are as fast as an on-premise solution.

The context menu for a data source now contains an option to Refresh Metadata:

metadata02

When you select this option, you will be prompted with a screen warning that a metadata update might be slow. In this case, “slow” means a few seconds.

metadata03

On running this refresh, an internal metadata cache is added to your data warehouse metadata repository, and subsequent calls to browse metadata or load metadata into the warehouse will be drawn from this cache.

You may refresh the cache at any time.

An added bonus is that once your data source metadata is cached, you no longer need to be connected to the source whilst working with source metadata. That’s a great feature for companies with high risk data, and for people who like to take their work home with them 🙂

Ajilius. More flexible metadata.

 

Multi-generational upgrades

Lately we’ve found that our pace of delivery has outstripped the ability of some users to keep up with upgrades.

We have been expecting users to apply each upgrade as it is issued. In practice, that hasn’t always been possible. Emails might have been missed, user priorities might have been elsewhere, and it has sometimes led to a situation where support was needed to work through the issues of upgrading multiple generations at a time.

We also had a problem this week with a customer who restored a metadata repository from a backup that was several months old. This meant that their metadata was out of step with their current application and repositories.

We’re fixing that problem this week. Release 2.2.14 will bring a new method of versioning upgrades. The version of the repository will be recorded, and the upgrade patches will include multiple generations of upgrade history.

On applying a patch to Ajilius, any repository that is not at the current level will be upgraded, even if every repository is at a different release.

Only one patch will be required to bring your metadata up to the latest version, even if it is several generations old. Further, any metadata that has been restored from a backup will also be brought up to date, simply by rerunning the upgrade process.

Ajilius. Flexible Upgrades.

Enhancement: MongoDB Data Source

Release 2.2.12 includes support for MongoDB as a data source.

MongoDB is a popular NoSQL DBMS, and Ajilius is the first data warehouse automation platform to support MongoDB as a first-class citizen.

Want proof? Here’s a shot of Ajilius displaying data from the MongoDB hosted TCPH database:

mongodb2

 

More? Here’s Ajilius showing data from the Restaurants sample database:

mongodb3

Ajilius makes it easy to work with MongoDB, as simple as working with any other data source we support.

Ajilius. MongoDB to your data warehouse.

Enhancement: Log to Data Warehouse

This week’s Release 2.2.12 includes a feature to write ELT log entries to the data warehouse.

Previous releases logged ELT performance to a log file. This file could be named through the Warehouse Edit screen, and defaulted to ajilius.log.

Now, in addition to the log file, a table AJILIUS_LOG is created in the AJILIUS schema of the data warehouse. On job completion, as long as a data warehouse connection is available, the results of the job will be inserted to this table.

The logged columns are:

  • log_stamp / Timestamp
  • log_script / Script name
  • log_elapsed / Elapsed time in seconds
  • log_status / Job status, 0=success
  • log_files / Number of files processed during job
  • log_inserts / Number of rows inserted during job
  • log_updates / Number of rows updated during job
  • log_deletes / Number of rows deleted during job
  • log_message / Descriptive error message if log_status<>0

Ajilius. Keeping track of performance.