New feature: load streams

Before today, loads to an Ajilius data warehouse were single streamed. That is, one process extracted and loaded one table. You could load hundreds of thousands of rows per second, you could load different tables in parallel, but if you had one very large table there was nothing you could do to make it faster.

We’ve just added a feature called “Load Streams”, that enables you to parallel load large tables. This is specifically designed for MPP target databases, but may also speed up SMP operations.

The Load Table screen now has an option to select a number of streams that will be loaded in parallel. This is a number between 1, and a maximum set for the target platform. On MPP platforms, this number is the number of nodes, while SMP platforms typically benefit from a number between 4 and 8.

On extract, data will be split into the number of streams you defined. Depending on the platform, the extract may be split in a round-robin fashion or by the column/s you have defined as a distribution key on selected MPP platforms.

Because there is overhead in splitting the data, and in handling concurrent writes in the DBMS, the improvement is not linear. For SMP platforms, large tables on SSD storage can be orders of magnitude faster than single threaded loads, while MPP platforms can approach the fraction represented by the number of streams.

Ajilius. We work harder to make your loads faster.

New feature: Load Mask

There are times when you need to load multiple instances of data. A common example is found in retail, where you might receive one end-of-day file of sales transactions from each store that need to be loaded into one warehouse table.

We recently faced a more interesting case, where a SAAS company hosting identical application databases for many customers wanted to aggregate the same table from all of those databases.

We’ve now implemented a feature in Ajilius that makes iteration super-easy.

A combination of Database and Table Masks enable you to set wildcards over which a table load will iterate at run time. You do your metadata design using one instance of the load table as the source, then simply define the mask patterns to be used at run time.

Multiple Excel files? No problem. Multiple text files? Easy. Multiple tables? Simple. Same table from many databases? No sweat.

Ajilius. Helpful data warehouse automation.

Suspending Hadoop DW

We’re temporarily suspending work on Hadoop as a target platform for dimensional data warehouses.

Six to twelve months ago the future of the platform looked bright, with SQL-on-Hadoop vendors bringing out new versions at a rapid pace.

Lately, that pace has slowed to a crawl. We still don’t have wide-spread implementation of an UPDATE statement, and that makes it difficult to process slowly changing dimensions, and accumulating snapshot fact tables.

We’ve been working around this lack by reprocessing the data outside Hadoop. This meant reading and rewriting entire tables, and as the size of our test warehouses grew, it became clear that this was not a better solution than using an RDBMS.

When more complete SQL-on-Hadoop implementations become available we will revisit this decision. Until then, Hadoop will continue to be a supported data source for Ajilius.

Measuring load speeds

We recently had a situation where our load speeds were reported as being much slower than a competitor. This surprised me, because I knew that our loader could saturate the network from the source server, and I wondered how our competitor could be faster.

Luckily, the evaluator liked Ajilius, and did a little digging on our behalf. It turned out that the culprit was not our performance, but the competitor’s measurement technique.

When we load data into the warehouse from a source database, there are basically four steps that we need to perform:

  • Query
  • Extract
  • Load
  • Commit

The Query step is where we execute a query on the remote data source, such as “select c1,c2,c3 from t1 where c1 > 9467843”. The extract step is where we transfer the results of that query to the loader. The Load step moves those rows into the warehouse. Finally, the Commit step commits the load transaction/s. Depending on the source and warehouse, Ajilius may overlap one or more of those steps.

When we measure load performance we put a timer call before the Query, and again after the Commit. The elapsed time is the total time taken to extract and load the required data from the source system to the warehouse. This represents real-world performance, the type you need to measure if batch windows are important to you.

Our competitor had a different view of the world. Their measurement of performance was to take the time immediately before the Load step, and immediately after it. They claimed that this was a measurement of “load” performance. I guess they’re technically correct, but knowing just that one part of the job doesn’t help you to assess performance against real-world requirements.

When the customer repeated the tests, this time measuring the elapsed time for the whole job, the results were virtually neck and neck. I’m not surprised because, as I said earlier, I knew we were capable of saturating the relatively slow network in the customer’s development lab.

Ajilius performance tests? Always welcome.

Virtual loads and dimensions

Some data warehouse automation products do a good job on green-fields projects, but make it very difficult to integrate into an existing data warehouse architecture. Some don’t do it at all, while others require that you reverse engineer and rebuild the ETL / ELT processing before you can integrate new tables. We’re different.

Ajilius makes it easy to integrate with an existing data warehouse through Virtual Tables.

The purpose of a Virtual Table is to provide a mechanism to create Ajilius metadata over an existing table, without having to re-load or re-process its data. There are two types of Virtual Tables used in Ajilius – Virtual Loads and Virtual Dimensions.

Virtual Loads exist to define data that has already been loaded into the warehouse by other processes. This scenario is used where the table is to be processed in conjunction with data completely controlled by Ajilius, and may also be used in drip-feeding scenarios.

Virtual Dimensions enable you to integrate complete dimensions that have been created and maintained by external processes. If your existing warehouse has already built the perfect SCD2 dimension, there is no need to re-design and re-write that table just to fit it into our metadata structures.

Once defined, Virtual Tables work exactly the same as any other Ajilius load or dimension table.

Virtual Tables are a powerful concept when you are adding the power of Ajilius to an existing data warehouse environment.

Publishing our roadmap

Some companies treat their product development roadmap like it was a state secret. Others don’t even have a roadmap, sometimes not even a vision beyond their first release.

We’re different. We believe you should be fully aware of our production direction and development objectives, so that you can make informed decisions about whether Ajilius is going to satisfy your future needs.

There are two things to keep in mind when reading our roadmap:

  • The roadmap represents our best thinking at a point in time. It is subject to change. The further out it gets in the release cycle, the more likely it is to change.
  • The roadmap is not a guarantee of delivery. If you are interested in a specific feature, please keep in touch with us to make sure it is on track.

We’ll update the roadmap from time to time, but we’ll keep it published on our Bookshelf page.

What is ‘modern’ data warehouse automation?

Our tag-line for Ajilius is “Modern Data Warehouse Automation.

Is it (Modern) Data Warehouse Automation, or (Modern Data Warehouse) Automation?

The answer is “Both”.

Ajilius is a modern software product. You can use Ajilius on desktops, tablets and phones. It runs on Linux, Macintosh and Windows. Deploy Ajilius on-premise or in the cloud. Multiple languages are supported for international customers.

Ajilius fully supports the Modern Data Warehouse. We can build data warehouses on relational databases, cloud platforms, and Hadoop. Ajilius can source from databases, files, web services and streams.

Ajilius delivers the fully development cycle for a data warehouse. Loads, staging tables, screens, dimensions, facts, cubes, schedules and documentation. We’re fully metadata-driven, using specialised templates to generate all scripts and artefacts. With just a few clicks you can migrate from development to production, or from one DW platform to another, with all code fully scripted for automated deployment in segregated data centers.

You’ll also be delighted by our modern approach to licensing. No per-user, per warehouse shennanigans. Just one, simple rule: if you pay for your target database, you pay for Ajilius. And the amount you pay is just a fraction of the maintenance of competing products. Of course, we offer full training and support to get you up and running.

We aim to be #1 in data warehouse automation. Don’t settle for less.