Working with multiple columns

How to use DTL to combine several columns from your original data into one category in InfoSum's Global Schema.

In the previous tutorial, we saw how to rewrite the value of a column using DTL's string functions. Now, we'll take it a step further: we'll use DTL to tweak the structure of your dataset, merging several columns into one.

The scenario

Let's suppose that your dataset contains each person's date of birth, with separate columns for the day, month and year. For example:

Name Day Month Year
Jane Smith 15 Jan 1973
John Doe 28 May 1986

 

To keep this tutorial simple, we'll assume that your data is "clean". So:

  • the year is always a 4-digit number
  • the month is always a three-letter abbreviation: Jan, Feb and so on
  • the day is always appropriate for the month (no 31 February or similar).

Now let's suppose that your collaborator's dataset contains their customers' ages, to the nearest year. So, to match up against their data, you need to work out each person's age from their date of birth.

What's more, you need to convert your original three columns - one each for the day, month and year - into just one column, containing each person's age. 

Introducing years_since

Fortunately, DTL offers a function to help with this task. It's called years_since, and as the name suggests, it calculates the number of complete years since some date in the past.

Here's an example of how you use it:

$age = years_since ("15-Jan-1973", "%d-%b-%Y")

You can see that years_since takes two parameters.

  • The first is the date you want to work with (in our case, the date of birth), represented as a single string.
  • The second is a format string, which explains how the date is written. We won't go into that here, because it's not the focus of this tutorial; you can read more about it in the DTL functions specification. For now, you just need to know that the format string "%d-%b-%Y" corresponds to the style we've shown, 15-Jan-1973.

So, the good news is that years_since is perfect for converting the date of birth you have into the age you need. But there's a problem: years_since expects the date to be written as a single string, and at the moment you have it split into three separate columns.

Let's see how to fix that.

Step 1: Assign all the columns to a category

Before you can begin writing any DTL, we need to assign the three date-of-birth columns to a single category. We've already talked about this process in the Bunker and the Script Editor - but this is a complex example, so let's walk through it step by step.

The secret is to assign all three of your original columns (Day, Month and Year) to a single category, Age. You do that like this:

  • Click the settings button next to one of the original columns (Day, say).
  • Hold down the SHIFT key.
  • Click the settings button next to the remaining two columns (Month and Year).
  • Only then, select Assign category from the menu and assign them to the category called Age.

Once you've done this, you'll see a lot of red warnings saying Age category expects 1 or 2 column(s) but got 3 instead. That's fine - it's exactly the problem we're about to fix using DTL.

Step 2: Calculate the age

As we saw above, to pass the date into years_since, we need to write it in a particular format: day-month-year, with the components separated by hyphens.

If you've worked through the previous tutorial, this should be simplicity itself. We just need to use the `backtick`syntax to get the values of each column, and the + operator to join strings together. Like this:

$date_string = `Day` + "-" + `Month` + `Year`

Now that we have the date in the format we need, we can use the years_since function to calculate the age, just as we described above. Let's stash the result in a variable called $age:

$age = years_since ($date_string, "%d-%b-%Y")

Step 3: Store the result in a new column

We've now got the value we want in a variable - but DTL variables disappear as soon as the script finishes. To "save" your result, you have to store it in a column. We'll make a new column, called Age, for that purpose.

Instead of creating a new column, you could re-use an existing one by overwriting its value. But creating a new column makes it clearer what's going on - particularly when you look at the results of your transformation in your Bunker's web-based UI.

The only way to make a new column in DTL is to copy an existing one. And, as always when we're changing the underlying data, we'll use a transformation to do the job.

The transformation we need is called copy. As you'll see from the DTL transformations specification, it takes two parameters, both strings: the name of an exiting column, and the name of a new column to create.

So here's how we copy the Day column into a new column called Age. (We chose Day more or less at random; we're going to overwrite the new column's contents, so we could have copied any of the three existing columns.)

copy ("Day", "Age")

With that done, we just need to use the set transformation to fill in the value we previously calculated. If you need a reminder of how the set transformation works, glance back at the previous tutorial.

set ("Age", $age)

Step 4: Remove the original columns

Although we now have the age stored in a column, we're not quite finished yet. If you've read the three rules to remember article, you'll know that our transformation has to end with the correct number of columns - otherwise our Bunker won't know which column contains the data we want to use.

In this case, we need to end up with just the one column, Age. So we have to delete the three original columns - Day, Month and Year - which, now we've calculated Age, we don't need any more.

We can use the drop transformation to do this. It's very simple: it takes just one parameter, which is the name of the column we want to delete.

Here's how we drop the three original columns Day, Month and Year:

drop ("Day")
drop ("Month")
drop ("Year")

The full script

Pulling it all together, here's the complete script.

// Build a string representing the date of birth
$date_string = `Day` + "-" + `Month` + `Year`

// Calculate the age from the date of birth
$age = years_since ($date_string, "%d-%b-%Y")

// Make a new column to contain the result
copy ("Day", "Age")

// Store the result in the new column
set ("Age", $age)

// Delete the original columns
drop ("Day")
drop ("Month")
drop ("Year")

Try it out!

You can run the script yourself using our tutorial data set, which you can download as a CSV file (coming soon). Here's how to try it out:

  • Just as you did in the previous tutorial, create a project and upload the tutorial data to a Bunker, accepting the recommended categories.
  • Refer back to Step 1 above, and assign the three columns Day, Month and Year to the single category Age.
  • Every row will be showing red warnings, because the data you've selected doesn't yet meet the requirements for the Age category. That's what our script will fix.
  • Using the drop-down on any of the original three columns, select Transformations. Switch to the Editor tab and paste the sample code in, then press Save.
  • See how there's a new Age column added by your transformation, and that the three original columns have been removed - they'll simply say Dropped.
  • See how the red warnings have disappeared, and that each person's age is correctly recorded. If you want to to check, you can hover your mouse over the original columns (where it says Dropped) to see the pre-transformation values.

What's next?

We've seen now how a DTL script can combine several columns into one. In the third and final tutorial, we'll do the opposite - using a regular expression to split a single original column into several parts. Read tutorial three now.