Take the SAS language challenge

Can your data warehouse automation solution correctly handle international character sets?

I was reading an interesting blog post from SAS at the weekend. It shows how they visualise data containing character sets from many languages around the world.

http://blogs.sas.com/content/sastraining/2017/05/12/map-of-idioms-from-around-the-world/

The blog discuss loading and visualising data that looks like this:

Given that the data is contained in a spreadsheet, I thought it would be fun to see how that data was handled by Ajilius.

Create a data source that points to the spreadsheet:

Ingest the metadata from the spreadsheet:

Run the load script, and here is the result:

There it is. Fast, full-fidelity representation of the SAS data. If you’re building a data warehouse in Korea, or Armenia, or Israel, or any other country with a non-ASCII language, then Ajilius is your best solution for data warehouse automation.

Ajilius. Data warehouse automation for the global community.

 

Terminal Emulators

When your software lives in the cloud, you get used to the world of terminal emulators. We run over 50 cloud servers on 5 different cloud platforms, and the right terminal emulator is a real boost to productivity.

When I was using OSX as my development platform, I used iterm2 as my terminal. Far better than the standard Mac terminal program, it offers features like paned windows, search, autocomplete, hotkeys, and more.

On switching back to Windows I was stumped. Putty was the most often recommended terminal, but it just felt ancient after iterm2, and I started looking for alternatives. After much searching, I found the wonder that is ZOC Terminal.

Tabbed windows! 3270! 5250! Windows and OSX! This product isn’t just a Putty replacement, but Attachmate as well.

The file transfer feature means that I never have to scratch my head trying to remember SCP syntax. There’s a full scripting language built-in, that we’re using to automate some of our automated deployment processes. Think of something you want to do in a terminal session, and it is probably built into the extensive feature list.

ZOC Terminal is a paid, commercial product. Its productivity and usability gains are worth every cent.

Strongly recommended, if you’re deploying Ajilius on a cloud / Linux environment.

Display Settings

Something often overlooked by new users is the ability to adjust the size of tables and editors to match their screen size and browser zoom level.

Not everyone likes small, squinty fonts … especially those of us on the wrong side of 50 … and Ajilius makes it easy to adjust.

Click on your user name (top right of navigation bar) and select the User Preferences option:

Then, use the Table Rows and Documentation Lines options to adjust the display. Table Rows adjusts the number of rows displayed in tables, and Documentation Lines adjusts the number of lines displayed in the User and Tech Notes fields.

Ajilius. Readable data warehouse automation.

Custom Driver Settings

This week we’re adding custom driver settings to Ajilius.

While our driver settings are chosen for optimal performance, there have been times when customers have asked for the ability to tailor connections to specific requirements.

We’re now supporting that request through two settings:

  • Connection Parameters. These are extensions to the connection string used by JDBC drivers. Specify connection string parameters in a single line, including delimiters after every parameter.
  • Driver Properties. These are JDBC properties which modify the behaviour of drivers through an API call instead of the connection string. Specify properties as a combination of property-value pairs, with one property per row.

 

The above example shows a modification to the connection string for the Salesforce driver, which adjusts the Salesforce API to version 36, and reduces the timeout setting to 30 seconds.

Please use custom driver settings with extreme care, as invalid settings could cause your ELT jobs to fail. We recommend you check with us before making changes regarding the specific format requirements of each driver.

Ajilius. Customisable automation.

 

SharePoint Data Warehouse

Reference data – the common lookup tables that drive many data warehouses – often has no clear home in an organisation. It usually gets dumped in spreadsheets, XML files or text files, and ends up in a mess. Recently we’ve found that SharePoint Online makes a great repository for DW reference data. It can be managed and maintained in a clean, multi-user environment, then neatly integrated into the data warehouse.

This was supposed to be a post about the relative merits of SharePoint and Google Sheets for the maintenance of reference data. Unfortunately, Google’s approach to browser-based OAUTH authentication doesn’t play well on servers, so we’re temporarily shelving Google Sheets as an Ajilius data source until we figure out a workable solution.

On with SharePoint!

Adding a SharePoint Online data source to Ajilius is a breeze. Simply enter your site, user name and password:

sharepoint01

Refresh your SharePoint metadata, and you’ll see the libraries, lists and other data sources that can be loaded into Ajilius. Here we’ve created a simple list in which we can maintain details of public holidays.

sharepoint02

You’ll notice that there are many metadata columns added by SharePoint. In fact, we’re only interested in three columns, and we’ll clean that up once we’ve imported the metadata.

Import your metadata just like any other Ajilius data source, by selecting the Load Metadata option from the context menu for the SharePoint.PublicHolidays table. Here is the screen to complete the metadata:

sharepoint03

When added, you’ll see a screen like this, showing the columns for which metadata has been imported:

sharepoint04

Let’s clean up those extra columns we don’t need. Click the Change link at the top of the right-hand panel, and you’ll be taken to the column list.

sharepoint05

We’re going to use the Delete link, highlighted in the previous picture, to quickly delete a large number of columns from the table. When you click it, you will see a simplified form of the column mapper. You can choose to delete a column by changing the drop-down to Delete, either by selection, or most browsers support simply tabbing into the drop-down and pressing the ‘D’ key to set the value.

sharepoint06

When you’ve deleted the columns you don’t need, you should be left with a Date, Store and Title column. Clean up their metadata with some better descriptions, set the Date and Store as business keys to the table, and it should look like this:

sharepoint07

Next, use the Scripts option from the load_public_holiday context menu, and Create and Load your new table. Here is how your screen should look after loading the data.

sharepoint08

And when we view the data, we can see that it has been successfully loaded from our source in SharePoint Online.

sharepoint09

SharePoint Online makes a great multi-user tool for managing the small reference data sets that often get overlooked in the data warehouse governance process. Ajilius now includes a SharePoint connector as a standard feature, at no additional cost.

Ajilius. Now with SharePoint.

 

 

 

Persistent Staging Tutorial

It almost qualifies as a Frequently Asked Question in demonstrations:

How do you handle persistent staging?

Persistent Staging is typically used to keep data from source systems in its original state.

One common reason for doing this is to have a full set of source system data available to support future changes that you may wish to make to star schemas. Having the data already in the data warehouse makes it simple to recreate or modify the star schema at will.

There are two types of persistent staging supported by Ajilius:

  • Latest
  • Historic

Read the tutorial to discover exactly how they work.

Persistent Staging Tutorial

Data Quality with Regular Expressions

Ajilius currently supports three types of data quality screens on data being loaded to the warehouse:

  • Data Type
  • Range/s
  • Regex

We’ve previously posted about type and range validation, but we recently had an enquiry about the use of Regular Expressions (regex) for data validation. Let’s build an example based on Postal Code validation.

The Person.Address table in AdventureWorks2014 contains a Postal Code column. Addresses are international, with many different formats of Postal Code. For the purposes of this demonstration, we are going to validate the column against Canadian Postal Codes. We’ll use a regular expression taken from the Regular Expressions Cookbook, 2nd. Edition

    ^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$

We’ll start from the point where we have imported the Person.Address metadata into our repository:

regex1

Click Change in the right-side panel, then modify the postal_code column. Scroll down to the Data Quality section, and make the following changes:

regex2

 

Save your changes, go back to the Load List, and select the Scripts option for the load_person_address table.

Notice the new section of script which has been generated. This is the validator that will be applied to this column at load time.

    rs = new AjiliusResultSet (rs)
    rs.setValidator('postal_code','text','regex','^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$','1')

Now run the script, and watch the results at the bottom left of the screen:

regex3

As you can see, only 1090 rows passed the validation. And when we view the contents of the table, we see that these rows do match the Canadian Postal Code format defined in our regular expression:

regex4

A word of caution, regex validation is slower than type and range checking. Without regex validation, that same table loaded in 0.749 seconds, and the difference was due entirely to the regex algorithm. If you have a choice, use range checking instead.

Ajilius. Better data screens.

 

Query Based Load Dependencies

The new Ajilius scheduler uses metadata to figure out the dependencies and run sequence of your ELT jobs.

These dependencies can currently break through custom code.

Here’s an example, contrived from a customer query earlier today:

 

cblDependencies

 

In this query there is a dependency between the tables load.load_from_table and load.load_earlier_table. Unless load.load_earlier_table has been populated before processing load.load_from_table, then duplicate rows may be selected.

In the long term we will use a SQL parser to extract these dependencies from custom queries, but we’re still working on the evaluation of parsers.

Meanwhile, these dependencies can be resolved through multiple steps in the scheduler. Here is an example of scheduler parameters that would ensure the correct sequence of loads in a full data warehouse refresh:

-w <dwname> -b reset
-w <dwname> -l load_earlier_table
-w <dwname> -f all

This could also be written as:

-w <dwname> -b reset -l load_earlier_table -f all

These commands will ensure that load_earlier_table is the first table processed by the scheduler, and it will be available when needed by load_from_table.

Remember, any time you’re not sure how to do something using Ajilius, you are most welcome to contact us and discuss.

PostgreSQL message encoding

We ran into an interesting error this week, where a job was ending with an encoding error on the completion message.

Our first thought was that the database encoding did not match the value in the metadata, but no, both were set to UTF-8.

This is an Italian customer, and after much head-scratching, we discovered that the configuration value for lc_messages was set to ‘Italian_Italy.1252’. Ajilius expects messages returned by the server to be in the same encoding as the data warehouse.

In this case, the DBMS had been installed with settings from the Italian locale on the server, but the data warehouse database had been created with UTF-8 encoding.

We will now recommend that servers are initialised with initdb using the preferred encoding.

In this case, the customer changed the value of lc_messages in postgresql.conf, and the problem was resolved.

http://www.postgresql.org/docs/9.4/static/locale.html

http://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-LC-MESSAGES

Delimited file formats

Delimited files, often referred to as “CSV” or text files, represent tables of data in a simple text format.

Rows are separated by newline characters, or by newline and carriage-return pairs.

Columns within rows are separated by delimiters. The delimiters supported by Ajilius are comma, tab, semi-colon and vertical bar (‘|’). All rows must contain all columns, although empty columns may be represented as consecutive delimiters.

Text fields containing delimiters must be enquoted using single (‘) or double (“) quote characters.

Where a text field contains a character used as a delimiter, that character must be preceded by an escape character, typically backslash (\). If the text field contains the escape character, it must be preceded by another escape character (ie, ‘\\’).

Delimited files must have column names contained in the first row of the file.

The file may be encoded in any format supported by Ajilius (link).

Detection of encoding, delimiters, quotation and escape characters is automatically performed on file access. You don’t need to specify anything other than the file name.

Ajilius. Flexible delimited file formats.