New feature: load streams

Before today, loads to an Ajilius data warehouse were single streamed. That is, one process extracted and loaded one table. You could load hundreds of thousands of rows per second, you could load different tables in parallel, but if you had one very large table there was nothing you could do to make it faster.

We’ve just added a feature called “Load Streams”, that enables you to parallel load large tables. This is specifically designed for MPP target databases, but may also speed up SMP operations.

The Load Table screen now has an option to select a number of streams that will be loaded in parallel. This is a number between 1, and a maximum set for the target platform. On MPP platforms, this number is the number of nodes, while SMP platforms typically benefit from a number between 4 and 8.

On extract, data will be split into the number of streams you defined. Depending on the platform, the extract may be split in a round-robin fashion or by the column/s you have defined as a distribution key on selected MPP platforms.

Because there is overhead in splitting the data, and in handling concurrent writes in the DBMS, the improvement is not linear. For SMP platforms, large tables on SSD storage can be orders of magnitude faster than single threaded loads, while MPP platforms can approach the fraction represented by the number of streams.

Ajilius. We work harder to make your loads faster.

Analytics tools: Too clever?

I’ve recently been trialling a mix of analytics tools, including PowerBI, Qlik, Microstrategy, Tableau and others. One “feature” that has been annoying me is their attempt to automatically form relationships between tables, usually by matching column names. Bad idea.

I’ve had to spend far too long correcting models that have automatically matched the Product, Customer, Location, dimensions, simply because they had ID as the name of their surrogate key.

When I have a table named Customer, a suitable name for a surrogate key is ID. Not CustomerID. It is the Identity of the Customer, not the CustomerIdentity of the Customer.

When I reference that table from another, I might use role names such as PurchasingCustomerID or ReferringCustomerID. Role, table, column. Only in the most trivial implementations might I refer to it as simply CustomerID.

Further, there is no requirement that columns be uniquely named in a schema, nor that they follow any specific pattern. Especially when dealing with data models from legacy business systems, the practice of automatic matching is counter-productive.

Automatic matching is particularly wrong in dimensional modelling, where the use of surrogate keys is standard practice. One of the properties of a surrogate key is that it is a meaningless identifier. From a business perspective, my view of the CustomerID might be “BIGSTORE001”, and the surrogate value “3498” has no value at all.

I suggest that if you can’t determine a relationship from the existence of a foreign key, then you should probably stop trying. Build a flexible relationship editor and skip the automatic matching, as I’m going to have to spend more time finding and fixing tool errors than building it properly in the first place.

It causes me to wonder how many complex analytical models have errors introduced by this practice?

Automatic relationship management may be a feature which demonstrates well, based on carefully selected sample data, but which fails in the real world. I’d like to see it disappear.

New competitor: Optimal BI

It is great to see new entrants, bringing new approaches, to the data warehouse automation market. The days when one or two players had the market to themselves are drawing to a close …

This time it is Optimal BI, from New Zealand, with a product named Optimal Data Engine (ODE). It is a data vault product, evolving from consulting assignments, but not much more detail available at this time.

Follow their blog for new developments. http://optimalbi.com/blog/2015/07/03/ode-the-start-of-a-journey/

The old dinosaurs better get started on evolution!