Dependency diagrams

With Version 1.4 due by the end of December, we’ve just made our final implementation decision regarding dependency diagrams.

Over the past month, the team has produced literally hundreds of diagrams, testing fourteen different approaches to diagram generation.

The short list of implementation technologies came down to three Javascript libraries – Mermaid,VisJS and Cytoscape.

Mermaid’s Dagre layout algorithm produced the best technical appearance in diagrams, but its layout engine experienced severe difficulties when diagrams became more complex. Development is lagging, and is dependent on an unsupported core library.

VisJS was good, but it suffered from hierarchy layout problems, as well as labelling conflicts in diagrams. I think VisJS is more generally powerful, but fell back in our use case.

The best compromise solution was Cytoscape. It is in active (and enthusiastic) development, has a growing selection of layout algorithms, does the best job of laying out our diagrams, and supports features we need such as custom colours and tooltips.

Here’s a sample of what you can expect with the coming release of V1.4.

DependencyDiagram

Is your metadata secure?

Are your databases at risk from your data warehouse automation product?

We’ve seen DWA where user credentials for production systems are held in clear text!

If you’re using a data warehouse automation product other than Ajilius, use a query tool to open your metadata database. For example, if your product uses SQL Server, use Management Studio to open the metadata. Now look at the metadata tables which store data warehouse and data source credentials. Can you read them? If you can, so can anyone who reads that database.

All Ajilius user passwords – that is, the passwords that you use to access Ajilius – are one-way hashed. That means you can’t read the password in our database, and we can’t reverse or recover your password, even if we have your database. Our hashing is based on the SHA256 algorithm for world-class data protection.

All credentials, to your data warehouses and your data sources, are fully encrypted. The user-id and password for each connection are encrypted using the AES256 algorithm, an incredibly strong algorithm used by many government security organisations.

Database query tools with authorised access to the metadata database only see a long string of characters like “4e3d3cc78416…” and not your real credentials.

Even if someone gets a physical copy of your metadata, and browses the disk file, they can’t read your database credentials.

Ajilius. We won’t risk your database security.

Delimited file formats

Delimited files, often referred to as “CSV” or text files, represent tables of data in a simple text format.

Rows are separated by newline characters, or by newline and carriage-return pairs.

Columns within rows are separated by delimiters. The delimiters supported by Ajilius are comma, tab, semi-colon and vertical bar (‘|’). All rows must contain all columns, although empty columns may be represented as consecutive delimiters.

Text fields containing delimiters must be enquoted using single (‘) or double (“) quote characters.

Where a text field contains a character used as a delimiter, that character must be preceded by an escape character, typically backslash (\). If the text field contains the escape character, it must be preceded by another escape character (ie, ‘\\’).

Delimited files must have column names contained in the first row of the file.

The file may be encoded in any format supported by Ajilius (link).

Detection of encoding, delimiters, quotation and escape characters is automatically performed on file access. You don’t need to specify anything other than the file name.

Ajilius. Flexible delimited file formats.

Handling SCD0 and SCD6

Most ETL and data warehouse automation products define a slowly changing dimension at the table. DIM_PRODUCT, for example, may be defined as a type-2 slowly changing dimension, with changes to the PRODUCT_NAME and PRODUCT_CATEGORY triggering new dimension rows.

When we were designing Ajilius, we realised that this traditional approach is very limiting, particularly when handling dimensions of type 0, 4 and 6. To refresh, a type-0 dimension has values that may never change, with the common example of an original-value column, such as PRODUCT_ORIGINAL_PRICE. A type-6 dimension combines elements of type-2 and type-3, in that it may have some columns that have previous values recorded, and some columns which trigger new dimension rows.

The “may have some columns” expression in the last sentence was our “Ah-Ha!” moment. Slowly changing dimensions should actually be recorded at the dimension attribute level, rather than the table level.

Ajilius enables you to set a change-type value for each non-key column in the dimension. By default we set it to SCD1, but you can change it to any other value through the dimension editor.

  • SCD0 (value never changes)
  • SCD1 (value changes in place without history being recorded)
  • SCD2 (value creates a new dimension row when it changes)
  • SCD3 (value has current- and previous-version recorded in the same dimension row)
  • SCD4 (value has historic versions recorded in a history outrigger)
  • SCD6 (a combination of 0 + 1 + 2 + 3 attributes in the same row)

To the best of our knowledge, Ajilius is the only data warehouse automation product that correctly supports the generation of DDL and DML to create and process all of these types of slowly changing dimension.

Our join editor sucks

I’ve spent today reviewing and discussing our alpha-test feedback.

The best feedback related to the browser UI. The worst feedback to the join editor.

The join editor is used to define the joins between staging tables. We need to know how the business keys from the table being joined relate to the data already in the table.

The way the current version works is that you choose the ‘Join Another Table’ option from the Stage Table menu, then select the table you want to join, finally choosing the join type (ie, inner, left outer) and the join columns. You’d repeat this for as many tables as you wanted to join.

Users gave solid feedback that this did not represent their use cases. It forces the user to make early decisions about the sequence in which tables will be added, doesn’t handle deletion of tables from the join well, and fragments the user’s mental model of the join structure.

We worked through a few alternatives this afternoon, decided on an approach, and now Minh (our Vietnam developer) is due to have it completed by the end of the week. We’ll go back to the alpha users for a re-check on this feature, but then everything should be clear for the beta.

Virtual loads and dimensions

Some data warehouse automation products do a good job on green-fields projects, but make it very difficult to integrate into an existing data warehouse architecture. Some don’t do it at all, while others require that you reverse engineer and rebuild the ETL / ELT processing before you can integrate new tables. We’re different.

Ajilius makes it easy to integrate with an existing data warehouse through Virtual Tables.

The purpose of a Virtual Table is to provide a mechanism to create Ajilius metadata over an existing table, without having to re-load or re-process its data. There are two types of Virtual Tables used in Ajilius – Virtual Loads and Virtual Dimensions.

Virtual Loads exist to define data that has already been loaded into the warehouse by other processes. This scenario is used where the table is to be processed in conjunction with data completely controlled by Ajilius, and may also be used in drip-feeding scenarios.

Virtual Dimensions enable you to integrate complete dimensions that have been created and maintained by external processes. If your existing warehouse has already built the perfect SCD2 dimension, there is no need to re-design and re-write that table just to fit it into our metadata structures.

Once defined, Virtual Tables work exactly the same as any other Ajilius load or dimension table.

Virtual Tables are a powerful concept when you are adding the power of Ajilius to an existing data warehouse environment.

Column compromises

Sometimes you need to trade off one feature against another. A good example is that of column names, where we trade off size against portability.

Column names, more generally described as identifiers, have their permitted length and format vary by DBMS. Lengths range from Oracle’s 30 characters, to 63 characters in PostgreSQL, and 128 characters in SQL Server, Impala and Vertica.

An early design decision for Ajilius was that Oracle was not an important target platform for us. In fact, across the data warehouse automation market, we are not aware of any vendor that prioritises Oracle, nor has a significant volume of customers on that platform.

PostgreSQL, however, is a very important platform for Ajilius, and the 63 character limit in column names became our maximum. Should a source system column name exceed 63 characters, it will be truncated to this limit.

When truncating column names, or when merging tables during the staging process, it is possible that duplicate column names may exist in the source metadata. For example, if you were building a combined table from ‘order’ and ‘order_line’ tables, it is probably that an attribute named ‘order_id’ would exist in both of them. Our strategy for dealing with duplicate names is to add a unique suffix of the format ‘_nn’, where ‘nn’ is an incrementing number, to make the column name unique. We would probably then edit the metadata for the table to delete the column flagged as a duplicate.

When considering identifier names, we also had to take into account the issue of reserved words. When we import metadata from a data source it may have column names that are reserved words in the target DBMS. There are two strategies for dealing with reserved words in Ajilius: prefixing and delimiting.

Ajilius maintains a combined dictionary of reserved words for all target databases. When importing metadata from a source, we prefix the string ‘reserved_’ to any metadata column name which matches a reserved word. This makes it easy to spot conflicting column names, and choose better alternatives when editing the metadata. There are over 640 reserved words across our target databases. The longest reserved word is ‘current_transform_group_for_type’, which is 32 characters, so prefixed names still fit comfortably within our limit.

Most data warehouse platforms support delimited identifiers. These typically use quotes, or character pairs like [], to mark the start and end of the identifier name. While this method works, we’re not great fans of its cluttering of the DDL and DML, nor forcing database users to write queries that include delimiters. That said, we do offer a user option to use delimited identifiers in the warehouse.

PostgreSQL had more influence on identifier names through its quirk of converting non-delimited identifiers to lower case. If you keep the default option to use non-delimited column names, we will force all identifiers to lower case. Source column names will have embedded spaces (and special characters) converted to an underscore. Camel case names such as ‘OrderShipDate’ will be automatically translated to ‘order_ship_date’.

By understanding the constraints of our target data warehouse platforms, we are able to keep our promise of data warehouse portability.

External metadata

The term ‘metadata’ describes the information that is used by Ajilius to create a data warehouse and its ETL / ELT code. Metadata is the descriptive information about the tables, columns, data sources and other features of our warehouse. When Ajilius creates a table, or a warehouse, or a batch update procedure, it reads the metadata and generates code suitable for the target platform.

Ajilius stores its metadata separate from the data warehouse. Because Ajilius supports multiple data warehouses, each warehouse has its own repository. We store that metadata in a database for each warehouse. A common repository stores information that Ajilius uses for all warehouses.

The main benefit of external metadata is that it makes portability simple. The data warehouse market is increasingly fluid, and the ability to take advantage of new platform offerings could be a big advantage. Hadoop, for example, is offering increasing capability at very low prices, and the ability to quickly switch between an RDBMS and Hadoop using Ajilius could save you hundreds of thousands of dollars.

If you’re interested in migration as a feature, make sure that you ask your DW Automation provider to demonstrate this capability against an existing warehouse. No-one comes close to our “three-click migration”.

Another benefit is that we don’t need to migrate metadata into production. The concept of separation between development and production isn’t seen as important by some of our competitors, but we’ve seen scenarios where production data warehouses are hosted on managed platforms to which the DW developers do not have deployment rights. Ajilius makes it easy to generate scripts which can be tested and scheduled by a production operations team, independent of the development platform.

Being a metadata-driven product isn’t enough in today’s data warehouse market. Unless that metadata helps you to migrate across platforms and integrate with your operations framework, you’ve bought just another ETL tool, not a modern data warehouse automation platform.

Publishing our roadmap

Some companies treat their product development roadmap like it was a state secret. Others don’t even have a roadmap, sometimes not even a vision beyond their first release.

We’re different. We believe you should be fully aware of our production direction and development objectives, so that you can make informed decisions about whether Ajilius is going to satisfy your future needs.

There are two things to keep in mind when reading our roadmap:

  • The roadmap represents our best thinking at a point in time. It is subject to change. The further out it gets in the release cycle, the more likely it is to change.
  • The roadmap is not a guarantee of delivery. If you are interested in a specific feature, please keep in touch with us to make sure it is on track.

We’ll update the roadmap from time to time, but we’ll keep it published on our Bookshelf page.

Portable, scalable technology

The technology stack that we’ve chosen for Ajilius had to meet the following requirements:

  1. Same code base on Windows, Linux and OSX
  2. Cloud and on-premise deployment
  3. Easy to deploy and upgrade
  4. Unicode for international languages
  5. Wide range of supporting libraries
  6. Modern, browser-based user interfaces
  7. Agile development

C++ Windows Forms applications were definitely off the table.

Our first decision was a programming language. Based on what we knew, and the advice of people we trusted, it came down to a choice between C# and Python. We prefer the C# language to Python. Mono, and Microsoft’s opening of .NET, would have let C# meet most of our criteria, except for #3. We still don’t know how heavy the load of upgrades and patches to Linux and OS X platforms will be with .Net, but if it is like Windows we didn’t want to find out.

Python absolutely won on criteria like same code base (#1), cloud and on-premise (#2), easy to deploy and upgrade (#3), and wide range of supporting libraries (#5). Python was our choice, and we’re using Python 3.4, the current version.

The next decision was the user interface approach. We looked at a wide range of UI approaches, such as Dojo, jQuery and others, but they all suffered from errors during testing or poor performance at different times. Our decision was to avoid a Javascript front-end, and generate HTML. We chose to base our user interface on the Twitter Bootstrap model, as it gave us good looking, responsive user interfaces that were based on standard HTML5.

The final decision was the DBMS to use for our metadata repository. Keeping deployment and upgrading (#3) in mind, we didn’t want a heavy-weight DBMS to be installed and configured. Our concurrent user numbers are low, our transaction rates are low, and we made the decision to hold our metadata in SQLite3.

Since making these decisions we’ve encountered two common objections – Python is slow, and SQLite3 can’t support multiple users. Both statements are wrong.

If we were inverting very large matrices, or time-stretching minutes of audio in memory using FFTs, Python might not be our first choice of language. But in Ajilius, we do very little CPU processing, we’re mostly pushing data to and from the network and the database. We’re IO bound, not CPU bound. Python is more than adequate for this job.

We agree that SQLite3 is not designed for high volume concurrency, but it is definitely capable of supporting concurrent users with just one constraint – only one user may write to the database at a time. That doesn’t mean only one user is connected, just that the transactions of other users will wait until the writer has finished, and potentially time out in long wait scenarios. We’re keeping transactions short, we use optimistic locking strategies, and we just don’t have long waits. In our experience, typical implementations of data warehouse automation software have 2-10 users. We’ve tested Ajilius with up to 20 simulated users of the one repository, and it has worked just fine, and we’ve got higher volume tests scheduled for later in the development cycle.

So there it is. A modern software stack that lets us deliver data warehouse automation any where, with minimal dependencies and operational overheads for our customers.