Column compromises

Sometimes you need to trade off one feature against another. A good example is that of column names, where we trade off size against portability.

Column names, more generally described as identifiers, have their permitted length and format vary by DBMS. Lengths range from Oracle’s 30 characters, to 63 characters in PostgreSQL, and 128 characters in SQL Server, Impala and Vertica.

An early design decision for Ajilius was that Oracle was not an important target platform for us. In fact, across the data warehouse automation market, we are not aware of any vendor that prioritises Oracle, nor has a significant volume of customers on that platform.

PostgreSQL, however, is a very important platform for Ajilius, and the 63 character limit in column names became our maximum. Should a source system column name exceed 63 characters, it will be truncated to this limit.

When truncating column names, or when merging tables during the staging process, it is possible that duplicate column names may exist in the source metadata. For example, if you were building a combined table from ‘order’ and ‘order_line’ tables, it is probably that an attribute named ‘order_id’ would exist in both of them. Our strategy for dealing with duplicate names is to add a unique suffix of the format ‘_nn’, where ‘nn’ is an incrementing number, to make the column name unique. We would probably then edit the metadata for the table to delete the column flagged as a duplicate.

When considering identifier names, we also had to take into account the issue of reserved words. When we import metadata from a data source it may have column names that are reserved words in the target DBMS. There are two strategies for dealing with reserved words in Ajilius: prefixing and delimiting.

Ajilius maintains a combined dictionary of reserved words for all target databases. When importing metadata from a source, we prefix the string ‘reserved_’ to any metadata column name which matches a reserved word. This makes it easy to spot conflicting column names, and choose better alternatives when editing the metadata. There are over 640 reserved words across our target databases. The longest reserved word is ‘current_transform_group_for_type’, which is 32 characters, so prefixed names still fit comfortably within our limit.

Most data warehouse platforms support delimited identifiers. These typically use quotes, or character pairs like [], to mark the start and end of the identifier name. While this method works, we’re not great fans of its cluttering of the DDL and DML, nor forcing database users to write queries that include delimiters. That said, we do offer a user option to use delimited identifiers in the warehouse.

PostgreSQL had more influence on identifier names through its quirk of converting non-delimited identifiers to lower case. If you keep the default option to use non-delimited column names, we will force all identifiers to lower case. Source column names will have embedded spaces (and special characters) converted to an underscore. Camel case names such as ‘OrderShipDate’ will be automatically translated to ‘order_ship_date’.

By understanding the constraints of our target data warehouse platforms, we are able to keep our promise of data warehouse portability.