Version 2.1c – Snowflake TIME Data Type

On Wednesday night we got notification of the latest changes in Snowflake Elastic Data Warehouse. Of particular interest was the implementation of a TIME data type. Here we are on Friday night, and we’re releasing support for this latest feature.

We were one of the customers to request this data type early in the Snowflake development process, so we’re delighted to see it implemented.

Having a TIME data type makes it easy to implement dimensions that incorporate concepts like business/after hours, and morning/lunch/afternoon periods. Want to know the blocks of time in which most of your transactions occur across the week? You need a time dimension.

Without a TIME data type, Ajilius represents time values using a character string in the format HH:MM:SS. This works for building the star schema, but it doesn’t support analytic queries involving time arithmetic. Now we can store and calculate time values in Snowflake data warehouses.

The installer package for V2.1c also updates the Snowflake connector to the latest version.

Ajilius. Keeping up with Snowflake.

 

Version 2.1 – Data Quality Automation

The hero feature of Ajilius 2.1 is Data Quality Automation.

This is yet another unique feature brought to data warehouse automation by Ajilius.

In the 2.1 release, Ajilius adds three types of data quality screens to the extract process:

  • Data type validation, where values are tested for conformance to the column data type.
  • Range validation, where values are tested for set and range boundaries.
  • Regex validation, where values are tested against regex regular expressions.

In Version 2.3 (due September 2016) we will be adding Lookup validation to data quality rules, to check the existence of values in data warehouse tables.

Rows breaking validation are logged to an error file, along with the reason/s for row rejection.

A new return code from the extract job signals that validation errors have occurred, enabling the scheduler to choose to continue or suspend the batch pending user remediation of the errors.

And once again, we’re adding this as a standard feature of the Ajilius platform. If you’re licensed for Ajilius, upgrade to the latest version and you can immediately identify and screen data quality problems before they hit your data warehouse.

Ajilius. Committed to innovation in data warehouse automation.

Version 2.1 – Pivotal Greenplum

greenplumAjilius is pleased to announce full support for Pivotal Greenplum in Version 2.1, available now.

Greenplum is an open source MPP data warehouse, available on-premise and cloud. Based on an earlier version of PostgreSQL, Greenplum will shortly be upgraded to the latest PostgreSQL code base for even faster loads and transformations.

The advantage of Greenplum is that, being open source, it gives anyone the opportunity to use a real MPP data warehouse platform. All you need is hardware, or cloud, with realisable savings of hundreds of thousands of dollars over commercial offerings.

You now have a great, free, scalability path when your workload or data grows to exceed PostgreSQL capabilities. With the unique “3 click migration” feature of Ajilius, you can move your entire data warehouse at any time with just a few clicks of the mouse.

Ajilius. Keeping the value in data warehouse automation.

Surrogate issues with Azure SQL DW

2017-02-14: Ajilius has a new CTAS engine in Release 2.4.0 that fully supports optimised surrogate keys across both PDW and Azure SQL Data Warehouse. We’d still like to see an IDENTITY column, or equivalent, on these platforms, but we’re processing hundreds of millions of rows using our current techniques and we’re satisfied with our solution.

Surrogate keys are fundamental to the success of a dimensional data warehouse. These are the keys that uniquely identify a dimension row. They are typically integer values, because they compress and compare at high performance.

We’ve been using window functions to handle surrogate key generation in Azure SQL Data Warehouse. This was the recommended approach on PDW, then APS, and has now been well documented in a recent paper from the SQL Server Customer Advisory Team.

On reading this paper, I was a little concerned to read the following comment:

NOTE: In SQL DW or (APS), the row_number function generally invokes broadcast data movement operations for dimension tables. This data movement cost is very high in SQL DW. For smaller increment of data, assigning surrogate key this way may work fine but for historical and large data loads this process may take a very long time. In some cases, it may not work due to tempdb size limitations. Our advice is to run the row_number function in smaller chunks of data.

I wasn’t so worried about performance issue in daily DW processing, but the tempdb issue had not occurred to me before. Is it serious? Maybe, maybe not. But having been identified as an issue, we need to do something about it.

We’re working with another vendor at the moment – not named due to NDA constraints – where we also face a restriction that the working set for window functions needs to fit in one node. That, too, is a potential problem when loading large and frequently changing dimensions.

In other words, the commonly recommended approach for surrogate key generation on at least two DW platforms introduces potential problems in larger data sets. Which are exactly the type of data sets with which we are working. It is time to look at alternative approaches.

We don’t face this problem on Redshift or Snowflake, because they both support automatically generated identifiers. Redshift uses syntax like ‘integer identity(0,1) primary key’, while Snowflake uses ‘integer autoincrement’. The two platforms we’re adding in the immediate up-coming releases of Ajilius also support this feature.

If Microsoft did what customers have been asking since the first releases of PDW, they’d give us identity or sequence columns in Azure SQL Data Warehouse. But since that isn’t happening right now, we’re looking at two options to replace the window method of creating surrogate keys. The first option is to create row IDs early in the extract pipeline, the second option is to use hash values, at extract or at the point of surrogate requirement.

Row IDs are attractive in a serial pipeline, but have some limitations when we want to run multiple extracts or streams in parallel as we face issues of overlapping IDs in a merged data set. The benefit of deriving surrogate keys from row IDs is that we would still have the benefits of an integer value.

Hash values are attractive because they can be highly parallel. Their weaknesses are their size and poor comparison performance, but also the risk of hash collision which could create the same surrogate value for different business keys.

We’re just wrapping up the testing for V2.1, resolving this question will be high on the priority list for our next release. Let us know your preferences and suggestions.