I’ve recently been trialling a mix of analytics tools, including PowerBI, Qlik, Microstrategy, Tableau and others. One “feature” that has been annoying me is their attempt to automatically form relationships between tables, usually by matching column names. Bad idea.
I’ve had to spend far too long correcting models that have automatically matched the Product, Customer, Location, dimensions, simply because they had ID as the name of their surrogate key.
When I have a table named Customer, a suitable name for a surrogate key is ID. Not CustomerID. It is the Identity of the Customer, not the CustomerIdentity of the Customer.
When I reference that table from another, I might use role names such as PurchasingCustomerID or ReferringCustomerID. Role, table, column. Only in the most trivial implementations might I refer to it as simply CustomerID.
Further, there is no requirement that columns be uniquely named in a schema, nor that they follow any specific pattern. Especially when dealing with data models from legacy business systems, the practice of automatic matching is counter-productive.
Automatic matching is particularly wrong in dimensional modelling, where the use of surrogate keys is standard practice. One of the properties of a surrogate key is that it is a meaningless identifier. From a business perspective, my view of the CustomerID might be “BIGSTORE001”, and the surrogate value “3498” has no value at all.
I suggest that if you can’t determine a relationship from the existence of a foreign key, then you should probably stop trying. Build a flexible relationship editor and skip the automatic matching, as I’m going to have to spend more time finding and fixing tool errors than building it properly in the first place.
It causes me to wonder how many complex analytical models have errors introduced by this practice?
Automatic relationship management may be a feature which demonstrates well, based on carefully selected sample data, but which fails in the real world. I’d like to see it disappear.