Delimited file formats

Delimited files, often referred to as “CSV” or text files, represent tables of data in a simple text format.

Rows are separated by newline characters, or by newline and carriage-return pairs.

Columns within rows are separated by delimiters. The delimiters supported by Ajilius are comma, tab, semi-colon and vertical bar (‘|’). All rows must contain all columns, although empty columns may be represented as consecutive delimiters.

Text fields containing delimiters must be enquoted using single (‘) or double (“) quote characters.

Where a text field contains a character used as a delimiter, that character must be preceded by an escape character, typically backslash (\). If the text field contains the escape character, it must be preceded by another escape character (ie, ‘\\’).

Delimited files must have column names contained in the first row of the file.

The file may be encoded in any format supported by Ajilius (link).

Detection of encoding, delimiters, quotation and escape characters is automatically performed on file access. You don’t need to specify anything other than the file name.

Ajilius. Flexible delimited file formats.

Handling SCD0 and SCD6

Most ETL and data warehouse automation products define a slowly changing dimension at the table. DIM_PRODUCT, for example, may be defined as a type-2 slowly changing dimension, with changes to the PRODUCT_NAME and PRODUCT_CATEGORY triggering new dimension rows.

When we were designing Ajilius, we realised that this traditional approach is very limiting, particularly when handling dimensions of type 0, 4 and 6. To refresh, a type-0 dimension has values that may never change, with the common example of an original-value column, such as PRODUCT_ORIGINAL_PRICE. A type-6 dimension combines elements of type-2 and type-3, in that it may have some columns that have previous values recorded, and some columns which trigger new dimension rows.

The “may have some columns” expression in the last sentence was our “Ah-Ha!” moment. Slowly changing dimensions should actually be recorded at the dimension attribute level, rather than the table level.

Ajilius enables you to set a change-type value for each non-key column in the dimension. By default we set it to SCD1, but you can change it to any other value through the dimension editor.

  • SCD0 (value never changes)
  • SCD1 (value changes in place without history being recorded)
  • SCD2 (value creates a new dimension row when it changes)
  • SCD3 (value has current- and previous-version recorded in the same dimension row)
  • SCD4 (value has historic versions recorded in a history outrigger)
  • SCD6 (a combination of 0 + 1 + 2 + 3 attributes in the same row)

To the best of our knowledge, Ajilius is the only data warehouse automation product that correctly supports the generation of DDL and DML to create and process all of these types of slowly changing dimension.

Our join editor sucks

I’ve spent today reviewing and discussing our alpha-test feedback.

The best feedback related to the browser UI. The worst feedback to the join editor.

The join editor is used to define the joins between staging tables. We need to know how the business keys from the table being joined relate to the data already in the table.

The way the current version works is that you choose the ‘Join Another Table’ option from the Stage Table menu, then select the table you want to join, finally choosing the join type (ie, inner, left outer) and the join columns. You’d repeat this for as many tables as you wanted to join.

Users gave solid feedback that this did not represent their use cases. It forces the user to make early decisions about the sequence in which tables will be added, doesn’t handle deletion of tables from the join well, and fragments the user’s mental model of the join structure.

We worked through a few alternatives this afternoon, decided on an approach, and now Minh (our Vietnam developer) is due to have it completed by the end of the week. We’ll go back to the alpha users for a re-check on this feature, but then everything should be clear for the beta.

Virtual loads and dimensions

Some data warehouse automation products do a good job on green-fields projects, but make it very difficult to integrate into an existing data warehouse architecture. Some don’t do it at all, while others require that you reverse engineer and rebuild the ETL / ELT processing before you can integrate new tables. We’re different.

Ajilius makes it easy to integrate with an existing data warehouse through Virtual Tables.

The purpose of a Virtual Table is to provide a mechanism to create Ajilius metadata over an existing table, without having to re-load or re-process its data. There are two types of Virtual Tables used in Ajilius – Virtual Loads and Virtual Dimensions.

Virtual Loads exist to define data that has already been loaded into the warehouse by other processes. This scenario is used where the table is to be processed in conjunction with data completely controlled by Ajilius, and may also be used in drip-feeding scenarios.

Virtual Dimensions enable you to integrate complete dimensions that have been created and maintained by external processes. If your existing warehouse has already built the perfect SCD2 dimension, there is no need to re-design and re-write that table just to fit it into our metadata structures.

Once defined, Virtual Tables work exactly the same as any other Ajilius load or dimension table.

Virtual Tables are a powerful concept when you are adding the power of Ajilius to an existing data warehouse environment.

Column compromises

Sometimes you need to trade off one feature against another. A good example is that of column names, where we trade off size against portability.

Column names, more generally described as identifiers, have their permitted length and format vary by DBMS. Lengths range from Oracle’s 30 characters, to 63 characters in PostgreSQL, and 128 characters in SQL Server, Impala and Vertica.

An early design decision for Ajilius was that Oracle was not an important target platform for us. In fact, across the data warehouse automation market, we are not aware of any vendor that prioritises Oracle, nor has a significant volume of customers on that platform.

PostgreSQL, however, is a very important platform for Ajilius, and the 63 character limit in column names became our maximum. Should a source system column name exceed 63 characters, it will be truncated to this limit.

When truncating column names, or when merging tables during the staging process, it is possible that duplicate column names may exist in the source metadata. For example, if you were building a combined table from ‘order’ and ‘order_line’ tables, it is probably that an attribute named ‘order_id’ would exist in both of them. Our strategy for dealing with duplicate names is to add a unique suffix of the format ‘_nn’, where ‘nn’ is an incrementing number, to make the column name unique. We would probably then edit the metadata for the table to delete the column flagged as a duplicate.

When considering identifier names, we also had to take into account the issue of reserved words. When we import metadata from a data source it may have column names that are reserved words in the target DBMS. There are two strategies for dealing with reserved words in Ajilius: prefixing and delimiting.

Ajilius maintains a combined dictionary of reserved words for all target databases. When importing metadata from a source, we prefix the string ‘reserved_’ to any metadata column name which matches a reserved word. This makes it easy to spot conflicting column names, and choose better alternatives when editing the metadata. There are over 640 reserved words across our target databases. The longest reserved word is ‘current_transform_group_for_type’, which is 32 characters, so prefixed names still fit comfortably within our limit.

Most data warehouse platforms support delimited identifiers. These typically use quotes, or character pairs like [], to mark the start and end of the identifier name. While this method works, we’re not great fans of its cluttering of the DDL and DML, nor forcing database users to write queries that include delimiters. That said, we do offer a user option to use delimited identifiers in the warehouse.

PostgreSQL had more influence on identifier names through its quirk of converting non-delimited identifiers to lower case. If you keep the default option to use non-delimited column names, we will force all identifiers to lower case. Source column names will have embedded spaces (and special characters) converted to an underscore. Camel case names such as ‘OrderShipDate’ will be automatically translated to ‘order_ship_date’.

By understanding the constraints of our target data warehouse platforms, we are able to keep our promise of data warehouse portability.

External metadata

The term ‘metadata’ describes the information that is used by Ajilius to create a data warehouse and its ETL / ELT code. Metadata is the descriptive information about the tables, columns, data sources and other features of our warehouse. When Ajilius creates a table, or a warehouse, or a batch update procedure, it reads the metadata and generates code suitable for the target platform.

Ajilius stores its metadata separate from the data warehouse. Because Ajilius supports multiple data warehouses, each warehouse has its own repository. We store that metadata in a database for each warehouse. A common repository stores information that Ajilius uses for all warehouses.

The main benefit of external metadata is that it makes portability simple. The data warehouse market is increasingly fluid, and the ability to take advantage of new platform offerings could be a big advantage. Hadoop, for example, is offering increasing capability at very low prices, and the ability to quickly switch between an RDBMS and Hadoop using Ajilius could save you hundreds of thousands of dollars.

If you’re interested in migration as a feature, make sure that you ask your DW Automation provider to demonstrate this capability against an existing warehouse. No-one comes close to our “three-click migration”.

Another benefit is that we don’t need to migrate metadata into production. The concept of separation between development and production isn’t seen as important by some of our competitors, but we’ve seen scenarios where production data warehouses are hosted on managed platforms to which the DW developers do not have deployment rights. Ajilius makes it easy to generate scripts which can be tested and scheduled by a production operations team, independent of the development platform.

Being a metadata-driven product isn’t enough in today’s data warehouse market. Unless that metadata helps you to migrate across platforms and integrate with your operations framework, you’ve bought just another ETL tool, not a modern data warehouse automation platform.

Publishing our roadmap

Some companies treat their product development roadmap like it was a state secret. Others don’t even have a roadmap, sometimes not even a vision beyond their first release.

We’re different. We believe you should be fully aware of our production direction and development objectives, so that you can make informed decisions about whether Ajilius is going to satisfy your future needs.

There are two things to keep in mind when reading our roadmap:

  • The roadmap represents our best thinking at a point in time. It is subject to change. The further out it gets in the release cycle, the more likely it is to change.
  • The roadmap is not a guarantee of delivery. If you are interested in a specific feature, please keep in touch with us to make sure it is on track.

We’ll update the roadmap from time to time, but we’ll keep it published on our Bookshelf page.

Portable, scalable technology

The technology stack that we’ve chosen for Ajilius had to meet the following requirements:

  1. Same code base on Windows, Linux and OSX
  2. Cloud and on-premise deployment
  3. Easy to deploy and upgrade
  4. Unicode for international languages
  5. Wide range of supporting libraries
  6. Modern, browser-based user interfaces
  7. Agile development

C++ Windows Forms applications were definitely off the table.

Our first decision was a programming language. Based on what we knew, and the advice of people we trusted, it came down to a choice between C# and Python. We prefer the C# language to Python. Mono, and Microsoft’s opening of .NET, would have let C# meet most of our criteria, except for #3. We still don’t know how heavy the load of upgrades and patches to Linux and OS X platforms will be with .Net, but if it is like Windows we didn’t want to find out.

Python absolutely won on criteria like same code base (#1), cloud and on-premise (#2), easy to deploy and upgrade (#3), and wide range of supporting libraries (#5). Python was our choice, and we’re using Python 3.4, the current version.

The next decision was the user interface approach. We looked at a wide range of UI approaches, such as Dojo, jQuery and others, but they all suffered from errors during testing or poor performance at different times. Our decision was to avoid a Javascript front-end, and generate HTML. We chose to base our user interface on the Twitter Bootstrap model, as it gave us good looking, responsive user interfaces that were based on standard HTML5.

The final decision was the DBMS to use for our metadata repository. Keeping deployment and upgrading (#3) in mind, we didn’t want a heavy-weight DBMS to be installed and configured. Our concurrent user numbers are low, our transaction rates are low, and we made the decision to hold our metadata in SQLite3.

Since making these decisions we’ve encountered two common objections – Python is slow, and SQLite3 can’t support multiple users. Both statements are wrong.

If we were inverting very large matrices, or time-stretching minutes of audio in memory using FFTs, Python might not be our first choice of language. But in Ajilius, we do very little CPU processing, we’re mostly pushing data to and from the network and the database. We’re IO bound, not CPU bound. Python is more than adequate for this job.

We agree that SQLite3 is not designed for high volume concurrency, but it is definitely capable of supporting concurrent users with just one constraint – only one user may write to the database at a time. That doesn’t mean only one user is connected, just that the transactions of other users will wait until the writer has finished, and potentially time out in long wait scenarios. We’re keeping transactions short, we use optimistic locking strategies, and we just don’t have long waits. In our experience, typical implementations of data warehouse automation software have 2-10 users. We’ve tested Ajilius with up to 20 simulated users of the one repository, and it has worked just fine, and we’ve got higher volume tests scheduled for later in the development cycle.

So there it is. A modern software stack that lets us deliver data warehouse automation any where, with minimal dependencies and operational overheads for our customers.

The benefits of browser

Someone I used to respect once told me that you couldn’t build a data warehouse automation tool in a web browser. That person was living in a 20 year old bubble.

My epiphany on the capabilities of browser-based software came around 10 years ago, when I first saw an early version of Microsoft’s Outlook Web Access. It was a browser-based application that looked and worked almost exactly the same as Outlook. And now, look at the email, contact and calendar capabilities of Office365 – truly amazing user interfaces, delivered to any mainstream web browser.

Some people think that browsers can’t handle interaction with complex graphics. Have you seen Canva lately? Or Gliffy? Or the online versions of PowerPoint and Keynote? They’re doing graphics online that are as good (or better) as I was doing with desktop-based Visio, SmartDraw and Publisher just a few years ago.

Browsers have more benefits for a product like Ajilius:

Any device
Today’s web browsers run on desktops, notebooks, tablets and phones. On Windows, Linux and OS X, Android and iOS. The browser has become a universal user interface for all types of software. Responsive user interface design means that Ajilius adapts its layout to the constraints of the device, making it usable from large desktop display to pocket sized phone.

No client installs
Packaging, installing and maintaining desktop software is a major cost for most IT departments. Browser-based software, especially when designed to work with all major browsers, eliminates that cost and effort. Assuming the presence of a modern browser, deploying Ajilius in a corporate environment requires nothing installed or changed on end user devices.

Modern user interfaces
Using a browser makes it easy to support modern developments in user interface technology. We’re fully touch-driven, for example, which means you can use Ajilius just as effectively on an iPad as you can on a Windows desktop. We’re accessible, meaning that people with low vision or motor skill issues can still use our software. And we’re multi-lingual, because the browser lets us quickly deal with the user interface issues that come from supporting the world’s written languages.

The days when software was automatically desktop are over. The desktop as a device may not be dying, but software designed for the desktop is last century’s product.

Ajilius, being browser based, is a modern data warehouse automation product.

Why star schema?

There are currently three main schools of thought around building a data warehouse:

  • 3NF
  • Data Vault
  • Star Schema

We’ll leave it to Wikipedia to describe the main principles of each method.

We designed Ajilius to deliver star schema data warehouses for one reason: deliver value faster.

Over the years there have been many studies of the rates of failure in data warehouse projects. Gartner once reported that more than 50% of DW projects fail. Cutter put the figure at 41%. Bill Inmon wrote that it might go as high as 70-80%. Conning, studying the US Insurance industry, speculated that the rate of failure might be as high as 90%.

This terrible rate of failure is due to just one issue: the business owners of the data warehouse did not get timely value. Projects took years before delivering data. New data took so long to be loaded that the need for it had passed. The effort to acquire new data, and restructure existing data, were was too long and too expensive.

So, which data warehouse method best addresses this problem?

3NF might deliver quickly, but it has a track record of building ‘enterprise’ data warehouses. When we see the word ‘enterprise’ on a software project, we know three things – 1) it is expensive, 2) it will take a long time, and 3) it will probably never achieve its goals. The problem is largely one of scope and architecture. In attempting to build the one, true, repository of everything, the project quickly gets bogged down in analysis and design. It is not uncommon to see 3NF projects staffed by teams of architects and data modellers for months and years, often behind closed doors, with the only users of the warehouse being its overworked ETL coders.

We’ve never worked on a Data Vault project, so what we’ll give are our impressions from reading about it online. Firstly, there’s too much of a smell of snake oil around it. Any time we see experienced DW practitioners being told “Of course you don’t understand, you haven’t done the training”, it looks more like Scientology than Data Warehousing. That prejudices us against it. Secondly, we understand that end users are not supposed to directly access the Data Vault warehouse, but use a separate 3NF or star schema layer instead. We’d like to study this one day, because the Data Vault data structures don’t look like they make it easy to actually build 3NF or star schemas on the front end, we suspect some very complex joins and filters are required in the transformation.

Star schemas have, to our minds, three main advantages over 3NF and Data Vault warehouses:

  1. There is a clearly documented and widely available body of knowledge around designing and building star schema databases. There are books (see our Bookshelf for some favourites), tutorials and courses available from a wide range of sources; and skilled practitioners for most data warehouse platforms.
  2. Star schema data structures are a great fit for modern analytics and visualisation products. That means the design, construction and use of the data warehouse, by both developer and user, fit within the same conceptual framework. Also, star schemas have proven methods for dealing with issues like grain, history and aggregation; and for exposing the solutions in an end-user friendly manner
  3. Star schemas are a great fit for agile data warehouse methodologies. We see this as the key to overcoming the terrible failure rates described above, as agile projects engage the end-users in the process, and deliver usable iterations faster. With methodologies like BEAM, supported by tools like Ajilius, projects can been delivering usable data to end-users in days, not the months and years of other approaches.

We chose to base Ajilius on star schemas because we want to deliver business value in the fastest way possible. We want end users to be analysing and visualising our data in hours and days, not months and years.

Delivering value, faster, is the key to data warehouse success.