Integrated Upgrades

One of our principles in making enhancements to Ajilius is drawn from a quote attributed to Albert Einstein:

Everything should be made as simple as possible, but no simpler.

Our new Java version made the installation process much simpler, and the patch process as well. Now, the full release of 2.2 simplifies the process of upgrades to our metadata databases.

When we need to update the metadata databases, we’ll send you a file to be placed in the Ajilius program folder. Simply restart Ajilius, the update will be applied, and your application will continue as normal.

Ajilius. As simple as possible.

Note:

Unlike some competitor products, Ajilius does not “phone home”. We do not automatically download any files, and send no information back from your installation. Your metadata and credentials are secure within your site at all times.

Release 2.2.b4

All users should have received an email regarding the availability of 2.2.b4.

This release aligns the dimension surrogate key default value across platforms.

In earlier versions, some databases implemented the default row as ID=0, some as ID=1. The migration process took care of differences, but it was still confusing for some users.

Now, all default rows carry ID=1.

We’ve also rolled in a couple of patches around SQL Server data sources. In particular, you no longer need to specify an extract folder if you’re using the direct load (default) option.

We expect that this will be the second last beta, and we’ll move to a full release next week.

Let us know if you didn’t receive the upgrade email, and we’ll respond with the download location.

Ajilius & Java

We’ve been a little quiet, lately.

The reason is that, for the last few months, we’ve had a team of people working across three countries to complete something exciting, a new version of Ajilius that shifts us from a Python platform to Java. For the last month in particular, we’ve been on an incredible sprint to deliver the Java version by the end of May.

This wasn’t an easy decision to make, as Python was critical to our first year of operation. We couldn’t have built Ajilius as quickly as we did on another product, and we grew to like Python as a development language.

Sadly, though, we found two problems with Python – distribution and database drivers.

Distribution management for Python is handled by a utility called PIP. Our installation scripts used PIP to acquire (and update) our library dependencies. Unfortunately, PIP struggled behind firewalls, and a number of corporate and government customers were unwilling to make the changes necessary for it to work.

That meant we had to bundle dependencies ourselves, which became a nightmare of dependencies-on-dependencies, and a maintenance and distribution headache for our team. If you’ve heard of DLL-hell in the Windows world, Python’s site-packages hell is at least three layers closer to Lucifer himself.

The second problem was database drivers. While Snowflake released a great quality Python driver for their DBMS, most other databases are supported by the Python community, which means that features can lag well behind current vendor releases. A great example is SQL Server, where Python’s interfaces are completely variable in their support. One doesn’t support Unicode. One doesn’t support Integrated Authentication. None support AlwaysOn intents. None support the Bulk Copy API. It wasn’t just a SQL Server problem, we found that as we supported more and more databases, the issues of finding current drivers became a support nightmare.

So both of these problems meant we needed an alternative. After much searching and testing, we settled on the combination of Java and JDBC, and after several man years of effort we finally have a Java version ready for beta testing.

Java solves the packaging and distribution problem because we can distribute Ajilius with the library JAR files already set up. Once you have installed Java, our installation process is as simple as unzipping a folder into a known location. Maintaining current versions of library dependencies is as simple as copying a file. Further, we no longer have to go through a platform-dependent compilation and packaging step as we did with our Python version, the code and packaging is identical across our supported platforms of Windows, Linux and OSX.

Java and JDBC have also enabled us to solve our database driver problem. Every mainstream DBMS has a vendor-supported JDBC driver available, we no longer have to test multiple community drivers then make compromises on the availability of features vs. the driver quality. Even better, we now have drivers available for popular application platforms and web services including CRM, Marketing, Accounting, Collaboration, Social Networking and more. We’ll be talking a lot more about this in coming posts.

So what is the benefit for you, the customer and partner?

Firstly, you get a much faster installation experience, that is more familiar to your operations team. Install Java, copy the Ajilius folder, and you’re done. It can even be fully automated for internal software packaging.

Secondly, updates are now much easier to deploy. You’ll get more features, faster.

Finally, the incredible breadth of JDBC drivers means that we’ll have connectors for more data sources than you can possibly imagine, both cloud and on-premise.

Our demonstration servers have now been switched to Java, and we’ll be ready for on-premise evaluations by the end of the week. Migration packages for existing customers will follow by the end of the month.

Now, we move on with the development of the huge list of features that will be coming to Ajilius in the next twelve months. Hang on, it is going to be one heck of a ride!

 

Running Totals

A customer asked if it was possible to compute a table of cumulative transaction values by day. There were business reasons why this could not be left to a BI tool, so we worked out the following method.

  1. Create a persistent stage table T1 to hold the values to be accumulated. Completely anonymised, these values were TX_DATE, DIM1, DIM2, QUANTITY.
  2. From that persistent table, create a GROUP BY stage table T2 with the columns DIM1, DIM2 and SUM(QUANTITY). Discard the TX_DATE column.
  3. Add a derived column, a date named AT_DATE, with the transform CURRENT DATE.

The last step – a derived column – was the key to the solution.

We made one change to validation rules to enable this to work, and we are now able to meet the customer’s requirement.

Version 2.1c – Snowflake TIME Data Type

On Wednesday night we got notification of the latest changes in Snowflake Elastic Data Warehouse. Of particular interest was the implementation of a TIME data type. Here we are on Friday night, and we’re releasing support for this latest feature.

We were one of the customers to request this data type early in the Snowflake development process, so we’re delighted to see it implemented.

Having a TIME data type makes it easy to implement dimensions that incorporate concepts like business/after hours, and morning/lunch/afternoon periods. Want to know the blocks of time in which most of your transactions occur across the week? You need a time dimension.

Without a TIME data type, Ajilius represents time values using a character string in the format HH:MM:SS. This works for building the star schema, but it doesn’t support analytic queries involving time arithmetic. Now we can store and calculate time values in Snowflake data warehouses.

The installer package for V2.1c also updates the Snowflake connector to the latest version.

Ajilius. Keeping up with Snowflake.

 

Version 2.1 – Data Quality Automation

The hero feature of Ajilius 2.1 is Data Quality Automation.

This is yet another unique feature brought to data warehouse automation by Ajilius.

In the 2.1 release, Ajilius adds three types of data quality screens to the extract process:

  • Data type validation, where values are tested for conformance to the column data type.
  • Range validation, where values are tested for set and range boundaries.
  • Regex validation, where values are tested against regex regular expressions.

In Version 2.3 (due September 2016) we will be adding Lookup validation to data quality rules, to check the existence of values in data warehouse tables.

Rows breaking validation are logged to an error file, along with the reason/s for row rejection.

A new return code from the extract job signals that validation errors have occurred, enabling the scheduler to choose to continue or suspend the batch pending user remediation of the errors.

And once again, we’re adding this as a standard feature of the Ajilius platform. If you’re licensed for Ajilius, upgrade to the latest version and you can immediately identify and screen data quality problems before they hit your data warehouse.

Ajilius. Committed to innovation in data warehouse automation.

Version 2.1 – Pivotal Greenplum

greenplumAjilius is pleased to announce full support for Pivotal Greenplum in Version 2.1, available now.

Greenplum is an open source MPP data warehouse, available on-premise and cloud. Based on an earlier version of PostgreSQL, Greenplum will shortly be upgraded to the latest PostgreSQL code base for even faster loads and transformations.

The advantage of Greenplum is that, being open source, it gives anyone the opportunity to use a real MPP data warehouse platform. All you need is hardware, or cloud, with realisable savings of hundreds of thousands of dollars over commercial offerings.

You now have a great, free, scalability path when your workload or data grows to exceed PostgreSQL capabilities. With the unique “3 click migration” feature of Ajilius, you can move your entire data warehouse at any time with just a few clicks of the mouse.

Ajilius. Keeping the value in data warehouse automation.

Surrogate issues with Azure SQL DW

2017-02-14: Ajilius has a new CTAS engine in Release 2.4.0 that fully supports optimised surrogate keys across both PDW and Azure SQL Data Warehouse. We’d still like to see an IDENTITY column, or equivalent, on these platforms, but we’re processing hundreds of millions of rows using our current techniques and we’re satisfied with our solution.

Surrogate keys are fundamental to the success of a dimensional data warehouse. These are the keys that uniquely identify a dimension row. They are typically integer values, because they compress and compare at high performance.

We’ve been using window functions to handle surrogate key generation in Azure SQL Data Warehouse. This was the recommended approach on PDW, then APS, and has now been well documented in a recent paper from the SQL Server Customer Advisory Team.

On reading this paper, I was a little concerned to read the following comment:

NOTE: In SQL DW or (APS), the row_number function generally invokes broadcast data movement operations for dimension tables. This data movement cost is very high in SQL DW. For smaller increment of data, assigning surrogate key this way may work fine but for historical and large data loads this process may take a very long time. In some cases, it may not work due to tempdb size limitations. Our advice is to run the row_number function in smaller chunks of data.

I wasn’t so worried about performance issue in daily DW processing, but the tempdb issue had not occurred to me before. Is it serious? Maybe, maybe not. But having been identified as an issue, we need to do something about it.

We’re working with another vendor at the moment – not named due to NDA constraints – where we also face a restriction that the working set for window functions needs to fit in one node. That, too, is a potential problem when loading large and frequently changing dimensions.

In other words, the commonly recommended approach for surrogate key generation on at least two DW platforms introduces potential problems in larger data sets. Which are exactly the type of data sets with which we are working. It is time to look at alternative approaches.

We don’t face this problem on Redshift or Snowflake, because they both support automatically generated identifiers. Redshift uses syntax like ‘integer identity(0,1) primary key’, while Snowflake uses ‘integer autoincrement’. The two platforms we’re adding in the immediate up-coming releases of Ajilius also support this feature.

If Microsoft did what customers have been asking since the first releases of PDW, they’d give us identity or sequence columns in Azure SQL Data Warehouse. But since that isn’t happening right now, we’re looking at two options to replace the window method of creating surrogate keys. The first option is to create row IDs early in the extract pipeline, the second option is to use hash values, at extract or at the point of surrogate requirement.

Row IDs are attractive in a serial pipeline, but have some limitations when we want to run multiple extracts or streams in parallel as we face issues of overlapping IDs in a merged data set. The benefit of deriving surrogate keys from row IDs is that we would still have the benefits of an integer value.

Hash values are attractive because they can be highly parallel. Their weaknesses are their size and poor comparison performance, but also the risk of hash collision which could create the same surrogate value for different business keys.

We’re just wrapping up the testing for V2.1, resolving this question will be high on the priority list for our next release. Let us know your preferences and suggestions.

Data quality performance

inspectionWe’re introducing data quality screens in V2.1, to be released at the pgDayAsia conference in March. In this release, data quality screens implement data type, range and expression testing on selected tables and columns.

The last week has been spent running performance tests, to identify the overhead added by these screens.

On average, we currently process 500,000 values per second.

For example, if your daily fact extract has 10 million rows, and each row comprises 20 columns, you have a total of 200 million values to screen. At 500,000 values per second, a full data quality screen of every column would add around six minutes to your batch.

To put that number in scale, 10 million is roughly the number of bets placed at the TAB on Melbourne Cup day. Or around the number of sales transactions done by a major department store chain in one week. Or the total number of motor vehicles sold in one year, world-wide, by Toyota Motor Company. In other words, 10 million rows is a LOT of data, and we’re going to completely validate it with full data screening, guaranteeing that no bad data is loaded from every row and column in your extract, in less than 10 minutes.

Wow!

And even though I wrote “Wow!” about that validation, we are working to make it even faster by release day.

Ajilius. Putting the quality in data warehouse automation.

Ajilius loves Snowflake

Snowflake HeartAlong with Amazon Redshift and Azure SQL Data Warehouse, Ajilius does cloud data warehouse automation for Snowflake Elastic Data Warehouse.

We don’t just support Snowflake, in a short space of time it has become a favourite cloud data warehouse platform. We had a great time working with the Snowflake team during the development of their Python adapter.

Here are some of the things that we love about Snowflake.

Price

Our Snowflake development and demonstration platform cost around AUD300 per month, on a lousy exchange rate. We pay a monthly storage cost, then pay for just the processing we need, when we need it.

For the smallest development machines Redshift may be slightly cheaper, but once you scale to multi-terabyte production workloads the advantage shifts to Snowflake unless you are prepared to commit to three-year reserved instances.

Given the rate of change in the cloud data warehouse market, we believe that long-term commitments are not in the interests of most customers, and Snowflake has a price/performance advantage.

Microsoft Azure SQL Data Warehouse is still in preview, and we can’t comment on comparative pricing at this time.

Scale

Scaling our Snowflake platform takes just seconds. In comparison, we’ve seen cluster resizing on Amazon take many minutes, and we’ve seen it take longer in customer sites.

Snowflake instances can scale from 1 to 128 8-core nodes. That is a huge amount of compute power, making Snowflake suitable for workloads of any size. At the lower end, we see Snowflake as an ideal platform for mid-market customers as its entry point and pricing model is so flexible.

We do a lot of Ajilius work without incurring any processing costs. This is because DDL operations are performed on the database, not the warehouse (see Features), and we don’t need to start a warehouse until we start actually loading, selecting or modifying data. The majority of our development and test work is done on a single node, with occasional scaling for performance tests.

Support

I’ve never had better support from a data warehouse company, especially when we were not known to the vendor, and not spending huge amounts of money. From sales, to pre-sales, to support, and even right into engineering, we’ve had amazing engagement from every level of the company.

Snowflake people respond to emails, pick up the phone, and respond to support requests with speed. We’ve never waited more than a couple of hours for a response to an issue, and that response has always been highly relevant, never of the type “have you unplugged and plugged in the keyboard” variety.

The Snowflake team is knowledgeable, enthusiastic, and committed to success.

Features

One intriguing feature of Snowflake is its avoidance of distribution keys, partitions, etc., in the database. This avoids one of the big design challenges present in both Redshift and Azure, where the wrong distribution method can really damage your performance. One day I’ll have a beer with Snowflake’s designers and figure out how this works, but for now, all I know is that it works well.

Better described as “quirky” is Snowflake’s terminology of database and warehouse. A “database” is a collection of schemas and data. A “warehouse” is the compute configuration that works on databases, to me they’d have been better off using a name like “server”. A powerful feature is the ability for multiple “warehouses” to act on a “database”, with different configuration settings. For example, an ETL warehouse might use very high scale to compress the ELT time, while a Browse warehouse might run for a long time at low scale for refreshing data used by BI tools like Tableau and Qlik Sense.

Another feature we love is the Snowflake administrative console, where we can not only administer databases, warehouses and users; but also review performance history and execute ad-hoc queries. The user interface for the console is a work of art, it is the first cloud-based data warehouse where I’ve not felt the need to find another administration tool.

What’s Missing?

Not much.

All the basic data types are there, all the basic SQL statements are there, you get JDBC, ODBC and Python interfaces, and the documentation is a work of art. There could be a few more examples in documentation for some of the more obscure features of the product, but it is being updated on a frequent basis.

Regarding data types, I’ve always been puzzled why data warehouse vendors avoid geospatial data. After all, map-based data is a major feature of the current generation of visualisation tools, but it is lacking from most cloud data warehouse platforms. I’d like to think Snowflake will get around to this feature soon.

If I was being picky, I’d also call out the absence of a TIME data type. We work around it by the use of date/time functions to extract time portions of timestamp fields into text fields, but a native TIME type would be helpful.

The only real pain we experience is that Snowflake is currently restricted to Amazon US data centers. That has no impact on warehouse performance, but our connection times and data transfer rates are a little slower than I’d like. We can co-locate Ajilius instances in their AWS data centre for fast Snowflake connections, but if your data is on-premise in Australia, you’re going to incur a penalty if you’re moving terabytes into Snowflake. I’m assured that data centers in other parts of the world are on their way.

Recommendation

Try it. You’ll like it.

Ajilius makes it easy to build Snowflake data warehouses from your on-premise and cloud data. Let us know if you’d like a deeper discussion of Snowflake and Ajilius.