We’re introducing data quality screens in V2.1, to be released at the pgDayAsia conference in March. In this release, data quality screens implement data type, range and expression testing on selected tables and columns.
The last week has been spent running performance tests, to identify the overhead added by these screens.
On average, we currently process 500,000 values per second.
For example, if your daily fact extract has 10 million rows, and each row comprises 20 columns, you have a total of 200 million values to screen. At 500,000 values per second, a full data quality screen of every column would add around six minutes to your batch.
To put that number in scale, 10 million is roughly the number of bets placed at the TAB on Melbourne Cup day. Or around the number of sales transactions done by a major department store chain in one week. Or the total number of motor vehicles sold in one year, world-wide, by Toyota Motor Company. In other words, 10 million rows is a LOT of data, and we’re going to completely validate it with full data screening, guaranteeing that no bad data is loaded from every row and column in your extract, in less than 10 minutes.
And even though I wrote “Wow!” about that validation, we are working to make it even faster by release day.
Ajilius. Putting the quality in data warehouse automation.