Surrogate issues with Azure SQL DW

2017-02-14: Ajilius has a new CTAS engine in Release 2.4.0 that fully supports optimised surrogate keys across both PDW and Azure SQL Data Warehouse. We’d still like to see an IDENTITY column, or equivalent, on these platforms, but we’re processing hundreds of millions of rows using our current techniques and we’re satisfied with our solution.

Surrogate keys are fundamental to the success of a dimensional data warehouse. These are the keys that uniquely identify a dimension row. They are typically integer values, because they compress and compare at high performance.

We’ve been using window functions to handle surrogate key generation in Azure SQL Data Warehouse. This was the recommended approach on PDW, then APS, and has now been well documented in a recent paper from the SQL Server Customer Advisory Team.

On reading this paper, I was a little concerned to read the following comment:

NOTE: In SQL DW or (APS), the row_number function generally invokes broadcast data movement operations for dimension tables. This data movement cost is very high in SQL DW. For smaller increment of data, assigning surrogate key this way may work fine but for historical and large data loads this process may take a very long time. In some cases, it may not work due to tempdb size limitations. Our advice is to run the row_number function in smaller chunks of data.

I wasn’t so worried about performance issue in daily DW processing, but the tempdb issue had not occurred to me before. Is it serious? Maybe, maybe not. But having been identified as an issue, we need to do something about it.

We’re working with another vendor at the moment – not named due to NDA constraints – where we also face a restriction that the working set for window functions needs to fit in one node. That, too, is a potential problem when loading large and frequently changing dimensions.

In other words, the commonly recommended approach for surrogate key generation on at least two DW platforms introduces potential problems in larger data sets. Which are exactly the type of data sets with which we are working. It is time to look at alternative approaches.

We don’t face this problem on Redshift or Snowflake, because they both support automatically generated identifiers. Redshift uses syntax like ‘integer identity(0,1) primary key’, while Snowflake uses ‘integer autoincrement’. The two platforms we’re adding in the immediate up-coming releases of Ajilius also support this feature.

If Microsoft did what customers have been asking since the first releases of PDW, they’d give us identity or sequence columns in Azure SQL Data Warehouse. But since that isn’t happening right now, we’re looking at two options to replace the window method of creating surrogate keys. The first option is to create row IDs early in the extract pipeline, the second option is to use hash values, at extract or at the point of surrogate requirement.

Row IDs are attractive in a serial pipeline, but have some limitations when we want to run multiple extracts or streams in parallel as we face issues of overlapping IDs in a merged data set. The benefit of deriving surrogate keys from row IDs is that we would still have the benefits of an integer value.

Hash values are attractive because they can be highly parallel. Their weaknesses are their size and poor comparison performance, but also the risk of hash collision which could create the same surrogate value for different business keys.

We’re just wrapping up the testing for V2.1, resolving this question will be high on the priority list for our next release. Let us know your preferences and suggestions.