Data Quality Reject Limits

Ajilius 2.2.16, due for release later this week, will include a feature to enable a reject threshold to be set for data quality screens.

As shown in the following screen, you can select a limit beyond which a job will be cancelled if it is exceeded by the number of rejects:

reject_limit1

If the limit is exceeded, an error message and exception will be generated. This is an example of how that appears during interactive testing of load scripts:

reject_limit2

When rejects occur that are less than the reject limit, the job will succeed, but a warning message will be placed in the processing log. Here is an example from a test batch:

reject_limit3

Ajilius. Fine tuning data quality.

Data Quality with Regular Expressions

Ajilius currently supports three types of data quality screens on data being loaded to the warehouse:

  • Data Type
  • Range/s
  • Regex

We’ve previously posted about type and range validation, but we recently had an enquiry about the use of Regular Expressions (regex) for data validation. Let’s build an example based on Postal Code validation.

The Person.Address table in AdventureWorks2014 contains a Postal Code column. Addresses are international, with many different formats of Postal Code. For the purposes of this demonstration, we are going to validate the column against Canadian Postal Codes. We’ll use a regular expression taken from the Regular Expressions Cookbook, 2nd. Edition

    ^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$

We’ll start from the point where we have imported the Person.Address metadata into our repository:

regex1

Click Change in the right-side panel, then modify the postal_code column. Scroll down to the Data Quality section, and make the following changes:

regex2

 

Save your changes, go back to the Load List, and select the Scripts option for the load_person_address table.

Notice the new section of script which has been generated. This is the validator that will be applied to this column at load time.

    rs = new AjiliusResultSet (rs)
    rs.setValidator('postal_code','text','regex','^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$','1')

Now run the script, and watch the results at the bottom left of the screen:

regex3

As you can see, only 1090 rows passed the validation. And when we view the contents of the table, we see that these rows do match the Canadian Postal Code format defined in our regular expression:

regex4

A word of caution, regex validation is slower than type and range checking. Without regex validation, that same table loaded in 0.749 seconds, and the difference was due entirely to the regex algorithm. If you have a choice, use range checking instead.

Ajilius. Better data screens.

 

Enhancement: Cached Metadata

Ajilius enables browsing of source system metadata, to identify and load data into the warehouse. Here is a typical display:

metadata01

A click on a table, on the left side, shows the columns for that table on the right side.

Until now, each update of this screen represented a round trip to the database. That was fine for a data source located close by the Ajilius server, but users in hybrid environments reported slow screen updates when pulling cloud metadata to on-premise Ajilius. This was particularly apparent with large Oracle systems, which add thousands of system table entries to a metadata set.

We’ve now added a feature to cache metadata within Ajilius. This means the delay of updating metadata happens only once, and subsequent interactions are as fast as an on-premise solution.

The context menu for a data source now contains an option to Refresh Metadata:

metadata02

When you select this option, you will be prompted with a screen warning that a metadata update might be slow. In this case, “slow” means a few seconds.

metadata03

On running this refresh, an internal metadata cache is added to your data warehouse metadata repository, and subsequent calls to browse metadata or load metadata into the warehouse will be drawn from this cache.

You may refresh the cache at any time.

An added bonus is that once your data source metadata is cached, you no longer need to be connected to the source whilst working with source metadata. That’s a great feature for companies with high risk data, and for people who like to take their work home with them 🙂

Ajilius. More flexible metadata.

 

Multi-generational upgrades

Lately we’ve found that our pace of delivery has outstripped the ability of some users to keep up with upgrades.

We have been expecting users to apply each upgrade as it is issued. In practice, that hasn’t always been possible. Emails might have been missed, user priorities might have been elsewhere, and it has sometimes led to a situation where support was needed to work through the issues of upgrading multiple generations at a time.

We also had a problem this week with a customer who restored a metadata repository from a backup that was several months old. This meant that their metadata was out of step with their current application and repositories.

We’re fixing that problem this week. Release 2.2.14 will bring a new method of versioning upgrades. The version of the repository will be recorded, and the upgrade patches will include multiple generations of upgrade history.

On applying a patch to Ajilius, any repository that is not at the current level will be upgraded, even if every repository is at a different release.

Only one patch will be required to bring your metadata up to the latest version, even if it is several generations old. Further, any metadata that has been restored from a backup will also be brought up to date, simply by rerunning the upgrade process.

Ajilius. Flexible Upgrades.

Enhancement: MongoDB Data Source

Release 2.2.12 includes support for MongoDB as a data source.

MongoDB is a popular NoSQL DBMS, and Ajilius is the first data warehouse automation platform to support MongoDB as a first-class citizen.

Want proof? Here’s a shot of Ajilius displaying data from the MongoDB hosted TCPH database:

mongodb2

 

More? Here’s Ajilius showing data from the Restaurants sample database:

mongodb3

Ajilius makes it easy to work with MongoDB, as simple as working with any other data source we support.

Ajilius. MongoDB to your data warehouse.

Enhancement: Log to Data Warehouse

This week’s Release 2.2.12 includes a feature to write ELT log entries to the data warehouse.

Previous releases logged ELT performance to a log file. This file could be named through the Warehouse Edit screen, and defaulted to ajilius.log.

Now, in addition to the log file, a table AJILIUS_LOG is created in the AJILIUS schema of the data warehouse. On job completion, as long as a data warehouse connection is available, the results of the job will be inserted to this table.

The logged columns are:

  • log_stamp / Timestamp
  • log_script / Script name
  • log_elapsed / Elapsed time in seconds
  • log_status / Job status, 0=success
  • log_files / Number of files processed during job
  • log_inserts / Number of rows inserted during job
  • log_updates / Number of rows updated during job
  • log_deletes / Number of rows deleted during job
  • log_message / Descriptive error message if log_status<>0

Ajilius. Keeping track of performance.

 

Enhancement: User / Tech Documentation

This week’s download for Release 2.2.9 brings separation of User and Technical documentation.

Previously, Sources, Tables and Columns had one documentation panel. For tables and columns, that documentation was carried forward to subsequent tables in the ELT process. While editable, it gave rise to misleading documentation if a technical note from a previous table was overlooked.

We’ve now provided separate tabs for User and Tech(nical) documentation.

UserTechDoc

User documentation is carried from table-table, and column-column. Technical documentation is specific to the source, table or column for which it is defined.

Also, we’ve provided a feature to tune the size of the documentation panel. User Preferences now contains a field for Documentation Lines. Modify this value to increase or decrease the number of lines displayed by the documentation panel, tuning the fit for screen size and zoom level.

DocLines

Keep the requests coming, your ideas make Ajilius better for every user.

Ajilius. Listening to our users.

Enhancement: Load Metadata Columns

Release 2.2.8, out today, includes the ability to add metadata columns to load tables.

The columns which may be added are:

  • DateTime
  • GUID
  • Hash
  • Server
  • Database
  • Table

The availability of these columns is dependent on the capabilities of the source DBMS. Where a given column is not supported by the data source, an empty string will be supplied in its place.

Metadata columns may be added from the Column List screen for a given table. When adding a column, all you need to do is select the appropriate Column Role. Column names, data types, etc. will be added automatically.

Ajilius. Enriching source data.

Query Based Load Dependencies

The new Ajilius scheduler uses metadata to figure out the dependencies and run sequence of your ELT jobs.

These dependencies can currently break through custom code.

Here’s an example, contrived from a customer query earlier today:

 

cblDependencies

 

In this query there is a dependency between the tables load.load_from_table and load.load_earlier_table. Unless load.load_earlier_table has been populated before processing load.load_from_table, then duplicate rows may be selected.

In the long term we will use a SQL parser to extract these dependencies from custom queries, but we’re still working on the evaluation of parsers.

Meanwhile, these dependencies can be resolved through multiple steps in the scheduler. Here is an example of scheduler parameters that would ensure the correct sequence of loads in a full data warehouse refresh:

-w <dwname> -b reset
-w <dwname> -l load_earlier_table
-w <dwname> -f all

This could also be written as:

-w <dwname> -b reset -l load_earlier_table -f all

These commands will ensure that load_earlier_table is the first table processed by the scheduler, and it will be available when needed by load_from_table.

Remember, any time you’re not sure how to do something using Ajilius, you are most welcome to contact us and discuss.