Measuring load speeds

We recently had a situation where our load speeds were reported as being much slower than a competitor. This surprised me, because I knew that our loader could saturate the network from the source server, and I wondered how our competitor could be faster.

Luckily, the evaluator liked Ajilius, and did a little digging on our behalf. It turned out that the culprit was not our performance, but the competitor’s measurement technique.

When we load data into the warehouse from a source database, there are basically four steps that we need to perform:

  • Query
  • Extract
  • Load
  • Commit

The Query step is where we execute a query on the remote data source, such as “select c1,c2,c3 from t1 where c1 > 9467843”. The extract step is where we transfer the results of that query to the loader. The Load step moves those rows into the warehouse. Finally, the Commit step commits the load transaction/s. Depending on the source and warehouse, Ajilius may overlap one or more of those steps.

When we measure load performance we put a timer call before the Query, and again after the Commit. The elapsed time is the total time taken to extract and load the required data from the source system to the warehouse. This represents real-world performance, the type you need to measure if batch windows are important to you.

Our competitor had a different view of the world. Their measurement of performance was to take the time immediately before the Load step, and immediately after it. They claimed that this was a measurement of “load” performance. I guess they’re technically correct, but knowing just that one part of the job doesn’t help you to assess performance against real-world requirements.

When the customer repeated the tests, this time measuring the elapsed time for the whole job, the results were virtually neck and neck. I’m not surprised because, as I said earlier, I knew we were capable of saturating the relatively slow network in the customer’s development lab.

Ajilius performance tests? Always welcome.

WhereScape faking Data Vault?

Update (2016-07-09): It was true then, but it isn’t true now. WhereScape is now working with Dan Linstedt, the creator of the Data Vault methodology, to deliver automated Data Vault 2.0 solutions.

https://www.wherescape.com/blog/blog-posts/2016/july/wherescape-partners-with-data-vault-inventor-dan-linstedt/

 

I’m a little confused by recent claims about Data Vault and WhereScape Red.

Every WhereScape demonstration I have seen has been Load, Stage, Dimension, Fact, Cube. The tutorial is Load, Stage, Dimension, Fact, Cube. The training is Load, Stage, Dimension, Fact, Cube. The UI boils down to Load, Stage, Dimension, Fact, Cube.

Does that look like Data Vault to you?

Nowhere do I see Hub, Link and Satellite. Just Load, Stage, Dimension, Fact, Cube.

If it looks like Dimensional, presents like Dimensional, trains like Dimensional, and is documented like Dimensional, then there is a pretty good chance it IS Dimensional, and any claim to the contrary is just faking it.

If you want Data Vault, you probably need to be looking at BIReady or Quipu.

We’re proud of the fact that Ajilius is a data warehouse automation company based on Dimensional Modelling. We believe in doing one thing well, not trying to be anything to everyone. If you want a Dimensional data warehouse, buy the product that is firmly committed to this technique.

Ajilius. Keeping it real in data warehouse automation.

DWA vs ETL

A common question is the difference (or similarity) between Data Warehouse Automation and traditional ETL tools.

I like to use an example from my iPad – the difference between the apps Mortgage Calc andPages.

Pages is a spreadsheet. You can edit rows and columns of data, and create formulae using that data.

Mortgage Calc is an app that calculates mortgage payments.

Now, I could write a mortgage calculator in Pages. I could possibly make it look like the Mortgage Calc app rather than a spreadsheet. But which calculations do I use? Which tax rules apply? Are there stamp duties payable? In other words, I have to do a lot of research, a lot of programming, and a lot of testing to make sure I’ve got the basics right. And I’ve also got to maintain that spreadsheet as the rules change.

With Mortgage Calc, I’ve paid a few dollars for an application that has saved me many hours of research and development, and which I’m trusting to give me accurate calculations. In this case, Mortgage Calc is better than Pages, because it does one job, and does it well.

That is the difference between DWA and ETL. An ETL tool is a general purpose programming environment for moving and transforming data between systems. It provides components, in one form or another, which you put together to accomplish one or more tasks.

DWA, on the other hand, is built to do just one task, which is building the code associated with a data warehouse. Ajilius builds dimensional data warehouses. We build transactional, periodic snapshot and accumulating snapshot fact tables; Type 0, 1, 2 and 3 slowly changing dimensions; and move data from multiple data sources into a consolidated presentation layer.

You could do all of that with an ETL tool, but it would be like writing a complex mortgage calculator in a spreadsheet – time consuming, not well understood, and prone to error.

Ajilius generates fast, error free code, that can be easily migrated between data warehouse platforms, at the press of a button.

That’s the advantage of tools like Ajilius. We deliver business value, faster.

Our join editor sucks

I’ve spent today reviewing and discussing our alpha-test feedback.

The best feedback related to the browser UI. The worst feedback to the join editor.

The join editor is used to define the joins between staging tables. We need to know how the business keys from the table being joined relate to the data already in the table.

The way the current version works is that you choose the ‘Join Another Table’ option from the Stage Table menu, then select the table you want to join, finally choosing the join type (ie, inner, left outer) and the join columns. You’d repeat this for as many tables as you wanted to join.

Users gave solid feedback that this did not represent their use cases. It forces the user to make early decisions about the sequence in which tables will be added, doesn’t handle deletion of tables from the join well, and fragments the user’s mental model of the join structure.

We worked through a few alternatives this afternoon, decided on an approach, and now Minh (our Vietnam developer) is due to have it completed by the end of the week. We’ll go back to the alpha users for a re-check on this feature, but then everything should be clear for the beta.

Be a DW hero with PostgreSQL

PostgreSQL and Ajilius can make you a DW Hero.

Lots of organisations don’t know about PostgreSQL, or are afraid that it might not perform. Here are eight simple steps to prove the value of PostgreSQL for data warehousing.

(1) Develop on SQL Server

Your IT management has probably asked you to deliver your data warehouse on SQL Server. There is nothing technically wrong with that decision, but it is going to cost a lot of money if you take it into production. Never mind, let’s humour them, go ahead and implement your development environment. We emphasise development, because using MSDN or SQL Server Developer Edition licences will have a negligible cost.

(2) Develop using Ajilius

Design and build your data warehouse using Ajilius. Use the power of data warehouse automation to generate a fully scripted, high performance data warehouse. Get all your user and technical documentation, and start testing user queries and reports. You’ll save hundreds of hours of development time using Ajilius, and eliminate the risks of bad ETL.

(3) Set up a PostgreSQL server

Download PostgreSQL and set up a separate development server, running on your choice of operating system. Create an empty database, and record the server name, database name, user-id and password.

(4) Clone your Metadata Repository

Go to the Warehouse List screen, and select the Clone Warehouse option. An independent copy of the metadata repository will be created under your chosen name.

(5) Change your Warehouse target

Go to your Warehouse List screen, and select the Change Warehouse option. Now select PostgreSQL as your target type, and enter the server details that you recorded in step (3).

(6) Generate Scripts

Go to your Warehouse List screen, and select the Generate Scripts option. Check each of the Create, Update, Schedule and Migrate script options, and enter the directory where scripts are to be written.

(7) Deploy

You now have a 100% compatible version of your original data warehouse, running on PostgreSQL. Because you used Ajilius to build your data warehouse, you are guaranteed portability across data warehouse platforms. All your extracts and loads have been fully replicated, your data has been migrated, and you can repeat your test queries and reports.

(8) Profit!

Here is where you become a DW Hero.

Demonstrate to your project stakeholders their data warehouse running on PostgreSQL. Discuss the cost savings that will come from running on this platform. Show that you’ve already done the migration, in just minutes of work and a few hours of processing time. Imagine proving that you can save hundreds of thousands of dollars in production licensing, for no cost to your organisation.

That’s HEROIC!

We’re the only data warehouse automation company that fully supports PostgreSQL data warehouses as a first-class citizen, and we’ve done so from the very first lines of code we wrote.

Read more about Ajilius, and the power of the PostgreSQL Data Warehouse.

Virtual loads and dimensions

Some data warehouse automation products do a good job on green-fields projects, but make it very difficult to integrate into an existing data warehouse architecture. Some don’t do it at all, while others require that you reverse engineer and rebuild the ETL / ELT processing before you can integrate new tables. We’re different.

Ajilius makes it easy to integrate with an existing data warehouse through Virtual Tables.

The purpose of a Virtual Table is to provide a mechanism to create Ajilius metadata over an existing table, without having to re-load or re-process its data. There are two types of Virtual Tables used in Ajilius – Virtual Loads and Virtual Dimensions.

Virtual Loads exist to define data that has already been loaded into the warehouse by other processes. This scenario is used where the table is to be processed in conjunction with data completely controlled by Ajilius, and may also be used in drip-feeding scenarios.

Virtual Dimensions enable you to integrate complete dimensions that have been created and maintained by external processes. If your existing warehouse has already built the perfect SCD2 dimension, there is no need to re-design and re-write that table just to fit it into our metadata structures.

Once defined, Virtual Tables work exactly the same as any other Ajilius load or dimension table.

Virtual Tables are a powerful concept when you are adding the power of Ajilius to an existing data warehouse environment.

PostgreSQL data sources

PostgreSQL can be a data source for any data warehouse built using Ajilius.

If you are using PostgreSQL as your DBMS for ERP (ie, xTuple or Odoo), HR (Advantec, perhaps) or CRM (such as OroCRM, Bitnami or Tryton), then Ajilius can easily bring data from these systems into your data warehouse.

We offer two sourcing mechanisms for PostgreSQL data, optimised depending on the target platform.

Low latency applications, such as near real-time retail transaction warehouses, are best supported by direct connections from the target database to the source. We use Foreign Data Wrappers from a PostgreSQL warehouse, or the PGNP OLEDB connection for Linked Servers from Microsoft SQL Server.

Where latency is not an issue, it can sometimes be faster to use bulk copy processes to extract and load via disk files. In this case we use the COPY bulk export utility.

Extracts from PostgreSQL are also required for cloud-hosted warehouses, such as Redshift, drawing data from on-premise systems. Ajilius is perfectly able to handle this requirement due to its scripted nature.

Building data warehouses from PostgreSQL data sources just got a lot easier.

PostgreSQL and Ajilius

PostgreSQL is a great option for dimensional data warehouses. Using Ajilius to build and deploy your PostgreSQL data warehouse helps you to get the most out of this exciting platform.

The biggest advantage of PostgreSQL is that it is free. At a time when competing databases cost tens of thousands of dollars per core, and typically require 8 or more cores in production, PostgreSQL could cut more than $100,000 from your software budget.

Over the next week, we will publish a series of posts that describe how you can use Ajilius and PostgreSQL to maximum effect, wrapping up with some advice that could make you a DW Hero in your organisation.

Stay tuned!

Column compromises

Sometimes you need to trade off one feature against another. A good example is that of column names, where we trade off size against portability.

Column names, more generally described as identifiers, have their permitted length and format vary by DBMS. Lengths range from Oracle’s 30 characters, to 63 characters in PostgreSQL, and 128 characters in SQL Server, Impala and Vertica.

An early design decision for Ajilius was that Oracle was not an important target platform for us. In fact, across the data warehouse automation market, we are not aware of any vendor that prioritises Oracle, nor has a significant volume of customers on that platform.

PostgreSQL, however, is a very important platform for Ajilius, and the 63 character limit in column names became our maximum. Should a source system column name exceed 63 characters, it will be truncated to this limit.

When truncating column names, or when merging tables during the staging process, it is possible that duplicate column names may exist in the source metadata. For example, if you were building a combined table from ‘order’ and ‘order_line’ tables, it is probably that an attribute named ‘order_id’ would exist in both of them. Our strategy for dealing with duplicate names is to add a unique suffix of the format ‘_nn’, where ‘nn’ is an incrementing number, to make the column name unique. We would probably then edit the metadata for the table to delete the column flagged as a duplicate.

When considering identifier names, we also had to take into account the issue of reserved words. When we import metadata from a data source it may have column names that are reserved words in the target DBMS. There are two strategies for dealing with reserved words in Ajilius: prefixing and delimiting.

Ajilius maintains a combined dictionary of reserved words for all target databases. When importing metadata from a source, we prefix the string ‘reserved_’ to any metadata column name which matches a reserved word. This makes it easy to spot conflicting column names, and choose better alternatives when editing the metadata. There are over 640 reserved words across our target databases. The longest reserved word is ‘current_transform_group_for_type’, which is 32 characters, so prefixed names still fit comfortably within our limit.

Most data warehouse platforms support delimited identifiers. These typically use quotes, or character pairs like [], to mark the start and end of the identifier name. While this method works, we’re not great fans of its cluttering of the DDL and DML, nor forcing database users to write queries that include delimiters. That said, we do offer a user option to use delimited identifiers in the warehouse.

PostgreSQL had more influence on identifier names through its quirk of converting non-delimited identifiers to lower case. If you keep the default option to use non-delimited column names, we will force all identifiers to lower case. Source column names will have embedded spaces (and special characters) converted to an underscore. Camel case names such as ‘OrderShipDate’ will be automatically translated to ‘order_ship_date’.

By understanding the constraints of our target data warehouse platforms, we are able to keep our promise of data warehouse portability.

External metadata

The term ‘metadata’ describes the information that is used by Ajilius to create a data warehouse and its ETL / ELT code. Metadata is the descriptive information about the tables, columns, data sources and other features of our warehouse. When Ajilius creates a table, or a warehouse, or a batch update procedure, it reads the metadata and generates code suitable for the target platform.

Ajilius stores its metadata separate from the data warehouse. Because Ajilius supports multiple data warehouses, each warehouse has its own repository. We store that metadata in a database for each warehouse. A common repository stores information that Ajilius uses for all warehouses.

The main benefit of external metadata is that it makes portability simple. The data warehouse market is increasingly fluid, and the ability to take advantage of new platform offerings could be a big advantage. Hadoop, for example, is offering increasing capability at very low prices, and the ability to quickly switch between an RDBMS and Hadoop using Ajilius could save you hundreds of thousands of dollars.

If you’re interested in migration as a feature, make sure that you ask your DW Automation provider to demonstrate this capability against an existing warehouse. No-one comes close to our “three-click migration”.

Another benefit is that we don’t need to migrate metadata into production. The concept of separation between development and production isn’t seen as important by some of our competitors, but we’ve seen scenarios where production data warehouses are hosted on managed platforms to which the DW developers do not have deployment rights. Ajilius makes it easy to generate scripts which can be tested and scheduled by a production operations team, independent of the development platform.

Being a metadata-driven product isn’t enough in today’s data warehouse market. Unless that metadata helps you to migrate across platforms and integrate with your operations framework, you’ve bought just another ETL tool, not a modern data warehouse automation platform.