Analytics tools: Too clever?

I’ve recently been trialling a mix of analytics tools, including PowerBI, Qlik, Microstrategy, Tableau and others. One “feature” that has been annoying me is their attempt to automatically form relationships between tables, usually by matching column names. Bad idea.

I’ve had to spend far too long correcting models that have automatically matched the Product, Customer, Location, dimensions, simply because they had ID as the name of their surrogate key.

When I have a table named Customer, a suitable name for a surrogate key is ID. Not CustomerID. It is the Identity of the Customer, not the CustomerIdentity of the Customer.

When I reference that table from another, I might use role names such as PurchasingCustomerID or ReferringCustomerID. Role, table, column. Only in the most trivial implementations might I refer to it as simply CustomerID.

Further, there is no requirement that columns be uniquely named in a schema, nor that they follow any specific pattern. Especially when dealing with data models from legacy business systems, the practice of automatic matching is counter-productive.

Automatic matching is particularly wrong in dimensional modelling, where the use of surrogate keys is standard practice. One of the properties of a surrogate key is that it is a meaningless identifier. From a business perspective, my view of the CustomerID might be “BIGSTORE001”, and the surrogate value “3498” has no value at all.

I suggest that if you can’t determine a relationship from the existence of a foreign key, then you should probably stop trying. Build a flexible relationship editor and skip the automatic matching, as I’m going to have to spend more time finding and fixing tool errors than building it properly in the first place.

It causes me to wonder how many complex analytical models have errors introduced by this practice?

Automatic relationship management may be a feature which demonstrates well, based on carefully selected sample data, but which fails in the real world. I’d like to see it disappear.

New competitor: Optimal BI

It is great to see new entrants, bringing new approaches, to the data warehouse automation market. The days when one or two players had the market to themselves are drawing to a close …

This time it is Optimal BI, from New Zealand, with a product named Optimal Data Engine (ODE). It is a data vault product, evolving from consulting assignments, but not much more detail available at this time.

Follow their blog for new developments. http://optimalbi.com/blog/2015/07/03/ode-the-start-of-a-journey/

The old dinosaurs better get started on evolution!

New feature: Load Mask

There are times when you need to load multiple instances of data. A common example is found in retail, where you might receive one end-of-day file of sales transactions from each store that need to be loaded into one warehouse table.

We recently faced a more interesting case, where a SAAS company hosting identical application databases for many customers wanted to aggregate the same table from all of those databases.

We’ve now implemented a feature in Ajilius that makes iteration super-easy.

A combination of Database and Table Masks enable you to set wildcards over which a table load will iterate at run time. You do your metadata design using one instance of the load table as the source, then simply define the mask patterns to be used at run time.

Multiple Excel files? No problem. Multiple text files? Easy. Multiple tables? Simple. Same table from many databases? No sweat.

Ajilius. Helpful data warehouse automation.

Suspending Hadoop DW

We’re temporarily suspending work on Hadoop as a target platform for dimensional data warehouses.

Six to twelve months ago the future of the platform looked bright, with SQL-on-Hadoop vendors bringing out new versions at a rapid pace.

Lately, that pace has slowed to a crawl. We still don’t have wide-spread implementation of an UPDATE statement, and that makes it difficult to process slowly changing dimensions, and accumulating snapshot fact tables.

We’ve been working around this lack by reprocessing the data outside Hadoop. This meant reading and rewriting entire tables, and as the size of our test warehouses grew, it became clear that this was not a better solution than using an RDBMS.

When more complete SQL-on-Hadoop implementations become available we will revisit this decision. Until then, Hadoop will continue to be a supported data source for Ajilius.

Price changes

Working with our first customers and partners has exposed some issues with our initial pricing strategy.

  1. Member Edition (our free version) customers required more support hours than our paid version.
  2. The difference between Member and Subscriber edition was ambiguous when considering licensed, supported versions of Open Source databases.
  3. Subscriber Edition had insufficient margin to make it attractive to resellers.
  4. Both Subscriber and Sponsor Editions were perceived as “too cheap” by their target customers.

Accordingly, we’ve revised the pricing for Ajilius to take these issues into account.

From May 1, we will remove the Member Edition from our price list. Existing customers will continue their right to support, and access to all new versions, free of payment.

A new Evaluation Edition will be introduced. This will be a full-featured version of Ajilius, but with a 30-day time limit.

The annual licence for Subscriber Edition will be increased to USD 5,000pa, and Sponsor Edition will be increased to USD 50,000pa.

Remember, Ajilius is site licensed. That means you have the right to use Ajilius on any number of servers, by any number of developers, creating any number of data warehouses, on any number of data warehouse platforms.

Even with our price increase, you’ll still save tens of thousands of dollars over competing platforms, and still get to 100% ROI in a matter of days.

Ajilius. Committed to business value.

Handling SCD0 and SCD6

Most ETL and data warehouse automation products define a slowly changing dimension at the table. DIM_PRODUCT, for example, may be defined as a type-2 slowly changing dimension, with changes to the PRODUCT_NAME and PRODUCT_CATEGORY triggering new dimension rows.

When we were designing Ajilius, we realised that this traditional approach is very limiting, particularly when handling dimensions of type 0, 4 and 6. To refresh, a type-0 dimension has values that may never change, with the common example of an original-value column, such as PRODUCT_ORIGINAL_PRICE. A type-6 dimension combines elements of type-2 and type-3, in that it may have some columns that have previous values recorded, and some columns which trigger new dimension rows.

The “may have some columns” expression in the last sentence was our “Ah-Ha!” moment. Slowly changing dimensions should actually be recorded at the dimension attribute level, rather than the table level.

Ajilius enables you to set a change-type value for each non-key column in the dimension. By default we set it to SCD1, but you can change it to any other value through the dimension editor.

  • SCD0 (value never changes)
  • SCD1 (value changes in place without history being recorded)
  • SCD2 (value creates a new dimension row when it changes)
  • SCD3 (value has current- and previous-version recorded in the same dimension row)
  • SCD4 (value has historic versions recorded in a history outrigger)
  • SCD6 (a combination of 0 + 1 + 2 + 3 attributes in the same row)

To the best of our knowledge, Ajilius is the only data warehouse automation product that correctly supports the generation of DDL and DML to create and process all of these types of slowly changing dimension.

Measuring load speeds

We recently had a situation where our load speeds were reported as being much slower than a competitor. This surprised me, because I knew that our loader could saturate the network from the source server, and I wondered how our competitor could be faster.

Luckily, the evaluator liked Ajilius, and did a little digging on our behalf. It turned out that the culprit was not our performance, but the competitor’s measurement technique.

When we load data into the warehouse from a source database, there are basically four steps that we need to perform:

  • Query
  • Extract
  • Load
  • Commit

The Query step is where we execute a query on the remote data source, such as “select c1,c2,c3 from t1 where c1 > 9467843”. The extract step is where we transfer the results of that query to the loader. The Load step moves those rows into the warehouse. Finally, the Commit step commits the load transaction/s. Depending on the source and warehouse, Ajilius may overlap one or more of those steps.

When we measure load performance we put a timer call before the Query, and again after the Commit. The elapsed time is the total time taken to extract and load the required data from the source system to the warehouse. This represents real-world performance, the type you need to measure if batch windows are important to you.

Our competitor had a different view of the world. Their measurement of performance was to take the time immediately before the Load step, and immediately after it. They claimed that this was a measurement of “load” performance. I guess they’re technically correct, but knowing just that one part of the job doesn’t help you to assess performance against real-world requirements.

When the customer repeated the tests, this time measuring the elapsed time for the whole job, the results were virtually neck and neck. I’m not surprised because, as I said earlier, I knew we were capable of saturating the relatively slow network in the customer’s development lab.

Ajilius performance tests? Always welcome.

WhereScape faking Data Vault?

Update (2016-07-09): It was true then, but it isn’t true now. WhereScape is now working with Dan Linstedt, the creator of the Data Vault methodology, to deliver automated Data Vault 2.0 solutions.

https://www.wherescape.com/blog/blog-posts/2016/july/wherescape-partners-with-data-vault-inventor-dan-linstedt/

 

I’m a little confused by recent claims about Data Vault and WhereScape Red.

Every WhereScape demonstration I have seen has been Load, Stage, Dimension, Fact, Cube. The tutorial is Load, Stage, Dimension, Fact, Cube. The training is Load, Stage, Dimension, Fact, Cube. The UI boils down to Load, Stage, Dimension, Fact, Cube.

Does that look like Data Vault to you?

Nowhere do I see Hub, Link and Satellite. Just Load, Stage, Dimension, Fact, Cube.

If it looks like Dimensional, presents like Dimensional, trains like Dimensional, and is documented like Dimensional, then there is a pretty good chance it IS Dimensional, and any claim to the contrary is just faking it.

If you want Data Vault, you probably need to be looking at BIReady or Quipu.

We’re proud of the fact that Ajilius is a data warehouse automation company based on Dimensional Modelling. We believe in doing one thing well, not trying to be anything to everyone. If you want a Dimensional data warehouse, buy the product that is firmly committed to this technique.

Ajilius. Keeping it real in data warehouse automation.

DWA vs ETL

A common question is the difference (or similarity) between Data Warehouse Automation and traditional ETL tools.

I like to use an example from my iPad – the difference between the apps Mortgage Calc andPages.

Pages is a spreadsheet. You can edit rows and columns of data, and create formulae using that data.

Mortgage Calc is an app that calculates mortgage payments.

Now, I could write a mortgage calculator in Pages. I could possibly make it look like the Mortgage Calc app rather than a spreadsheet. But which calculations do I use? Which tax rules apply? Are there stamp duties payable? In other words, I have to do a lot of research, a lot of programming, and a lot of testing to make sure I’ve got the basics right. And I’ve also got to maintain that spreadsheet as the rules change.

With Mortgage Calc, I’ve paid a few dollars for an application that has saved me many hours of research and development, and which I’m trusting to give me accurate calculations. In this case, Mortgage Calc is better than Pages, because it does one job, and does it well.

That is the difference between DWA and ETL. An ETL tool is a general purpose programming environment for moving and transforming data between systems. It provides components, in one form or another, which you put together to accomplish one or more tasks.

DWA, on the other hand, is built to do just one task, which is building the code associated with a data warehouse. Ajilius builds dimensional data warehouses. We build transactional, periodic snapshot and accumulating snapshot fact tables; Type 0, 1, 2 and 3 slowly changing dimensions; and move data from multiple data sources into a consolidated presentation layer.

You could do all of that with an ETL tool, but it would be like writing a complex mortgage calculator in a spreadsheet – time consuming, not well understood, and prone to error.

Ajilius generates fast, error free code, that can be easily migrated between data warehouse platforms, at the press of a button.

That’s the advantage of tools like Ajilius. We deliver business value, faster.

Our join editor sucks

I’ve spent today reviewing and discussing our alpha-test feedback.

The best feedback related to the browser UI. The worst feedback to the join editor.

The join editor is used to define the joins between staging tables. We need to know how the business keys from the table being joined relate to the data already in the table.

The way the current version works is that you choose the ‘Join Another Table’ option from the Stage Table menu, then select the table you want to join, finally choosing the join type (ie, inner, left outer) and the join columns. You’d repeat this for as many tables as you wanted to join.

Users gave solid feedback that this did not represent their use cases. It forces the user to make early decisions about the sequence in which tables will be added, doesn’t handle deletion of tables from the join well, and fragments the user’s mental model of the join structure.

We worked through a few alternatives this afternoon, decided on an approach, and now Minh (our Vietnam developer) is due to have it completed by the end of the week. We’ll go back to the alpha users for a re-check on this feature, but then everything should be clear for the beta.