Azure SQL or Azure SQL Data Warehouse?

Customers committed to the Microsoft stack often ask which product they should target for their cloud data warehouse – Azure SQL or Azure SQL Data Warehouse.

Today I heard a Senior Program Manager from the ASDW team give the following answer:

  • If your data is measured in gigabytes, or your fact tables have less than 100 million rows, we’re probably not the right solution

I also read an interesting article the other day, that suggested 70% of analytics data sets – the data normally found in a data warehouse – are less than 1Tb. Yes, the article is a little old, but that 70% figure was at the 500Gb mark, which meant it could double and still hold true today.

In other words, the typical SME-scale data warehouse is probably better off on Azure SQL. If you don’t have tens of millions of customers or products, and you don’t transact hundreds of millions of line items per year, you should probably start your data warehouse on Azure SQL.

Ajilius supports both Azure SQL and Azure SQL Data Warehouse, with three-click migration between them. That means you’re not locked in by your choice of database, and you can easily change your mind if your data grows beyond initial expectations.

You can’t go wrong building your Microsoft data warehouse with Ajilius.


New Script Form

The Ajilius script forms used to look like this:

Now, in Ajilius 3.5.0, we’ve moved to a tabbed presentation of scripts:

There are three main benefits from this new interface:

  • Scripts are more readable
  • You can choose between a CREATE and ALTER script for a table
  • We’ve got room to show¬†all the scripts for a table, including some in up-coming releases

Please note, if the ALTER tab is empty, it means that the table metadata matches its deployment, and no alteration is required.

Ajilius. Simpler scripting.

Snowflake-Snowflake Data Loads

When we built the Snowflake adapter for Ajilius, around two years ago, we saw Snowflake as a Data Warehouse target.

We’re now seeing many Snowflake customers using its capabilities as a Data Lake as well as a Data Warehouse. Customers are landing their data in one Snowflake database, then migrating and structuring it for analysis and reporting.

We have responded to this shift by implementing cross-database queries for data ingestion by Ajilius.

Previously, customers could extract data from one Snowflake database, then reload it to another. If your Ajilius instance was running on-premise, this carried a significant overhead as the data was extracted, transferred to your Ajilius instance, transferred back to the Snowflake host, and reloaded.

Now, a new load type of “Cross-Database” can be selected for a table. When the table load job is executed, we create a three-part name query instead of an extract/load query.

Here’s how it works.

Create a data source in the normal way. Let’s use the SNOWFLAKE_SAMPLE_DATA that comes with a new account:

After refreshing metadata, we can see the tables and columns from the demonstration database. Now we can load the CALL_CENTER table:

Set the characteristics of the table. The Load Method will default to Bulk Import, which uses the Snowflake PUT/COPY method. Change this to Cross-Database Query:

Now, when you generate scripts for this table, you’ll see that a cross database query is used instead of a PUT/COPY. Note that some columns have been dropped in order to show the full query.

Of course, incremental loads are also supported for cross-database loads.

Ajilius. Faster loads for Snowflake-Snowflake.


Documentation belongs …


I recently had to review some old ETL code where the developer was a firm believer in documentation. Unfortunately, his belief was that documentation belonged in Confluence.

Not that I’ve got anything against Confluence, it is a good product, but I got very tired and very grumpy about having to look up page after wiki page to figure out what was in his twisted little brain when the code was written.

Ajilius saves developers from having to search in unreasonable places for documentation. Not only do we capture both user and technical documentation for all artifacts, but we put technical document right were it belongs … IN THE CODE.

Here’s a screen shot of the technical documentation panel for a table:

… and here’s how it surfaces in code, neatly translated from HTML to text:

Ajilius. Better data warehouses … and better documentation.

Ajilius 3: SQL + Scripts

In Ajilius 1, we used Python as a scripting language. Ajilius 2 switched to Java and Groovy. We generated scripts that mixed the scripting language and SQL to emulate stored procedures.

We got pushback on this approach from a number of prospects, probably because we didn’t always do a good job of explaining the relationship between the scripting language and the SQL. Prospects saw the Groovy syntax, for example, and thought that they would have to learn a new language to build their data warehouse.

Looking back, we can see why.

Here is an Ajilius 2 script:

Here is the Ajilius 3 script for the same table:

As you can see, the SQL script in our new version is immediately familiar to a data warehouse developer.

This isn’t to say that we’re abandoning scripting, because Ajilius 3 introduces this feature as SQL + Scripts.

You’ll still have the option to export and package the SQL for execution as Groovy scripts, Stored Procedures (where supported), PowerShell scripts and more. The difference to earlier versions is that this is an option, not the standard.

An added bonus from this approach is that scripts run through the Ajilius scheduler no longer have to go through the overhead of instantiating the scripting engine and loading/compiling the script before execution. As a result, you get much faster execution of your transformation processes.

Ajilius. Listening to customer feedback to make us better.


Enhancement: Surrogate Key Names

Ajilius V1 used a fixed naming convention for surrogate keys. The dimension surrogate was named “id”, and the fact foreign key was named “<dimension>_id”.

While this might have been correct from a data modelling perspective, it didn’t work well with modern BI tools which expect the surrogate and foreign keys to have the same name.

In V2 we modified the convention to use the same name, which we constructed as “<dimension>_key” in each case. This worked well with BI tools, but Kimball modellers were more used to seeing column names like “store_key” instead of “dim_store_key”.

Now we’re releasing an enhancement that enables you to set the surrogate key name when you create a dimension, and to modify it like any other column name.

Update to version 2.2.7, and you’ll be able to use this feature.

Ajilius. Responding to user feedback.


Enhancement: Batch Processor

When we designed Ajilius, we expected that customers would have an existing batch scheduling solution in place. Scripts generated by Ajilius would be deployed to production servers, and scheduled for execution by the operations team.

It hasn’t worked that way in practice. To our surprise, most customers to date have had no existing scheduler in place. Our most common enhancement request has been to provide this functionality.

Well, we’ve done it. Ajilius now incorporates a batch scheduler for dependency-based execution of data warehouse scripts.

Ajilius now incorporates a command line processor that supports the following arguments.

-w / –warehouse
Generate scripts from the warehouse_name metadata

-b / –batch
Execute batch-level operations

-s / –source
Load all tables from named source/s

-l / –load
Load all tables, or named tables

-t / –transform
Transform all tables, or named tables

-d / –dimension
Process all dimensions, or named dimensions

-f / –fact
Process all facts, or named facts

The -t, -d and -f options process all dependencies for the selected tables. For example, the arguments:

-w chinook -b reset -f all

will conduct a full end-to-end load of the “chinook” warehouse.

You might choose to only load two specific fact tables, in which case the arguments might look like:

-w chinook -b reset -f fact_sale fact_budget

Multiple commands can be included in the one command line, or jobs may be separately scheduled. Steps that have been completed will be retained between jobs, until they are reset by the “-b reset” option. For example:

-w chinook -b reset
-w chinook -l all
-w chinook -t all
-w chinook -d all
-w chinook -f all

If executed as a series of jobs, this batch would reset the internal dependency log, load all load tables, transform all stage tables, process all dimensions, then process all facts.

It might, alternatively, been written as:

-w chinook -b reset -l all -t all -d all -f all

But given that all dependencies are processed for each table, it could be simplified to:

-w chinook -b reset -f all

The batch scheduler will be in Release 2.2.6, available from Monday, July 25.

Ajilius. Now with enhanced scheduling.


Integrated Upgrades

One of our principles in making enhancements to Ajilius is drawn from a quote attributed to Albert Einstein:

Everything should be made as simple as possible, but no simpler.

Our new Java version made the installation process much simpler, and the patch process as well. Now, the full release of 2.2 simplifies the process of upgrades to our metadata databases.

When we need to update the metadata databases, we’ll send you a file to be placed in the Ajilius program folder. Simply restart Ajilius, the update will be applied, and your application will continue as normal.

Ajilius. As simple as possible.


Unlike some competitor products, Ajilius does not “phone home”. We do not automatically download any files, and send no information back from your installation. Your metadata and credentials are secure within your site at all times.

Ajilius & Java

We’ve been a little quiet, lately.

The reason is that, for the last few months, we’ve had a team of people working across three countries to complete something exciting, a new version of Ajilius that shifts us from a Python platform to Java. For the last month in particular, we’ve been on an incredible sprint to deliver the Java version by the end of May.

This wasn’t an easy decision to make, as Python was critical to our first year of operation. We couldn’t have built Ajilius as quickly as we did on another product, and we grew to like Python as a development language.

Sadly, though, we found two problems with Python – distribution and database drivers.

Distribution management for Python is handled by a utility called PIP. Our installation scripts used PIP to acquire (and update) our library dependencies. Unfortunately, PIP struggled behind firewalls, and a number of corporate and government customers were unwilling to make the changes necessary for it to work.

That meant we had to bundle dependencies ourselves, which became a nightmare of dependencies-on-dependencies, and a maintenance and distribution headache for our team. If you’ve heard of DLL-hell in the Windows world, Python’s site-packages hell is at least three layers closer to Lucifer himself.

The second problem was database drivers. While Snowflake released a great quality Python driver for their DBMS, most other databases are supported by the Python community, which means that features can lag well behind current vendor releases. A great example is SQL Server, where Python’s interfaces are completely variable in their support. One doesn’t support Unicode. One doesn’t support Integrated Authentication. None support AlwaysOn intents. None support the Bulk Copy API. It wasn’t just a SQL Server problem, we found that as we supported more and more databases, the issues of finding current drivers became a support nightmare.

So both of these problems meant we needed an alternative. After much searching and testing, we settled on the combination of Java and JDBC, and after several man years of effort we finally have a Java version ready for beta testing.

Java solves the packaging and distribution problem because we can distribute Ajilius with the library JAR files already set up. Once you have installed Java, our installation process is as simple as unzipping a folder into a known location. Maintaining current versions of library dependencies is as simple as copying a file. Further, we no longer have to go through a platform-dependent compilation and packaging step as we did with our Python version, the code and packaging is identical across our supported platforms of Windows, Linux and OSX.

Java and JDBC have also enabled us to solve our database driver problem. Every mainstream DBMS has a vendor-supported JDBC driver available, we no longer have to test multiple community drivers then make compromises on the availability of features vs. the driver quality. Even better, we now have drivers available for popular application platforms and web services including CRM, Marketing, Accounting, Collaboration, Social Networking and more. We’ll be talking a lot more about this in coming posts.

So what is the benefit for you, the customer and partner?

Firstly, you get a much faster installation experience, that is more familiar to your operations team. Install Java, copy the Ajilius folder, and you’re done. It can even be fully automated for internal software packaging.

Secondly, updates are now much easier to deploy. You’ll get more features, faster.

Finally, the incredible breadth of JDBC drivers means that we’ll have connectors for more data sources than you can possibly imagine, both cloud and on-premise.

Our demonstration servers have now been switched to Java, and we’ll be ready for on-premise evaluations by the end of the week. Migration packages for existing customers will follow by the end of the month.

Now, we move on with the development of the huge list of features that will be coming to Ajilius in the next twelve months. Hang on, it is going to be one heck of a ride!


Surrogate issues with Azure SQL DW

2017-02-14: Ajilius has a new CTAS engine in Release 2.4.0 that fully supports optimised surrogate keys across both PDW and Azure SQL Data Warehouse. We’d still like to see an IDENTITY column, or equivalent, on these platforms, but we’re processing hundreds of millions of rows using our current techniques and we’re satisfied with our solution.

Surrogate keys are fundamental to the success of a dimensional data warehouse. These are the keys that uniquely identify a dimension row. They are typically integer values, because they compress and compare at high performance.

We’ve been using window functions to handle surrogate key generation in Azure SQL Data Warehouse. This was the recommended approach on PDW, then APS, and has now been well documented in a recent paper from the SQL Server Customer Advisory Team.

On reading this paper, I was a little concerned to read the following comment:

NOTE: In SQL DW or (APS), the row_number function generally invokes broadcast data movement operations for dimension tables. This data movement cost is very high in SQL DW. For smaller increment of data, assigning surrogate key this way may work fine but for historical and large data loads this process may take a very long time. In some cases, it may not work due to tempdb size limitations. Our advice is to run the row_number function in smaller chunks of data.

I wasn’t so worried about performance issue in daily DW processing, but the tempdb issue had not occurred to me before. Is it serious? Maybe, maybe not. But having been identified as an issue, we need to do something about it.

We’re working with another vendor at the moment – not named due to NDA constraints – where we also face a restriction that the working set for window functions needs to fit in one node. That, too, is a potential problem when loading large and frequently changing dimensions.

In other words, the commonly recommended approach for surrogate key generation on at least two DW platforms introduces potential problems in larger data sets. Which are exactly the type of data sets with which we are working. It is time to look at alternative approaches.

We don’t face this problem on Redshift or Snowflake, because they both support automatically generated identifiers. Redshift uses syntax like ‘integer identity(0,1) primary key’, while Snowflake uses ‘integer autoincrement’. The two platforms we’re adding in the immediate up-coming releases of Ajilius also support this feature.

If Microsoft did what customers have been asking since the first releases of PDW, they’d give us identity or sequence columns in Azure SQL Data Warehouse. But since that isn’t happening right now, we’re looking at two options to replace the window method of creating surrogate keys. The first option is to create row IDs early in the extract pipeline, the second option is to use hash values, at extract or at the point of surrogate requirement.

Row IDs are attractive in a serial pipeline, but have some limitations when we want to run multiple extracts or streams in parallel as we face issues of overlapping IDs in a merged data set. The benefit of deriving surrogate keys from row IDs is that we would still have the benefits of an integer value.

Hash values are attractive because they can be highly parallel. Their weaknesses are their size and poor comparison performance, but also the risk of hash collision which could create the same surrogate value for different business keys.

We’re just wrapping up the testing for V2.1, resolving this question will be high on the priority list for our next release. Let us know your preferences and suggestions.