Applying regular expressions using match_regex

How to use DTL's most powerful feature, analysing and rewriting data using regular expressions.

In this last tutorial we'll look at match_regex, the most powerful transformation you'll find in DTL. With match_regex, you can leverage the full flexibility of regular expressions ("regexes" for short), using them to match and extract parts of a data field.

In a nutshell, the match_regex transformation:

  • looks at a column containing string data
  • uses a regex to match one or more parts of that column's value
  • creates new columns to hold those matched values.

You can then refer to the new columns later in your script - for example to perform if tests on the extracted values, or combine them in a different way to make a new key.

The scenario

To show you how match_regex works, let's imagine that you want to rewrite email addresses in a particular way. Your database contains email addresses like john@example.com and you need to rewrite them to be like example.com:john.

Original Rewritten
john@example.com example.com:john
jane.doe@gmail.com gmail.com:jane.doe
nobody@nowhere.org nowhere.org:nobody

 

Why would you ever want to do that? Well, perhaps you're matching against another company's database, which has a particularly unusual key. We admit, though, this is a fairly contrived use case - real-world examples tend to be a bit too detailed to make for a good tutorial.

We'll work through this tutorial in four steps:

  1. write a regex to break up the email address
  2. use match_regex to apply the regex
  3. stitch the components of the address back together in the new format
  4. clean up after ourselves (and we'll explain why we have to do that).

Step 1: Writing the regex

Email addresses make for a nice simple example, because they have a very predictable format.

  • There's always exactly one @ sign.
  • Before the @ sign comes what's technically called the local part.
  • After the @ sign comes the domain.

So it's simple to write a regex which captures the local part and the domain (and throws away the @ sign itself). Here it is:

(.*)@(.*)

There isn't room in this short tutorial to explain the detail of regexes. If the regex above doesn't immediately make sense to you, take a look at an online resource like this one.

Step 2: Using match_regex

As we said above, the match_regex transformation:

  • looks at a column containing string data
  • uses a regex to match one or more parts of that column's value
  • creates new columns to hold those matched values.

Here's the syntax:

match_regex ("original-column-name",
["new-column-name-1", "new-column-name-2",...],
"regex")

Here's how we use it in our example:

match_regex ("Email Address",
["Local Part", "Domain"],
"(.*)@(.*)")

We're taking the column called Email Address, and using it to extract two new columns called Local Part and Domain. The final parameter is the regex which we built in Step 1.

Technically speaking, that second parameter is an array of strings, where each string is the name of a new column. The number of strings in the array must equal the number of things captured by the regex - you'll get an error if it doesn't. If your regex is capturing more than you want it to, non-capturing groups might help.

As always, remember that when you're writing the name of a column, it goes inside regular "double quote marks". That goes for the name of the original column "Email Address", and the names of the new columns "Local Part" and "Domain" too.

Step 3: Stitching it back together

Now that we've split up the email address into new columns, it's simple to put them back together again in a different way. Remember, we're imagining that we want to rewrite the address to the format domain:local-part - so for example john@example.com becomes example.com:john.

Let's store the result back in the Email Address column. That just needs a set transformation, like we've used several times before, together with the + operator for string concatenation.

set ("Email Address", `Domain` + ":" + `Local Part`)

This time we're writing `Domain` and `Local Part` inside backticks, because it's the value of those columns we want to combine.

Step 4: Cleaning up after ourselves

We're nearly done, but there's one last detail to consider.

Remember how, in addition to the original column Email Address, we created new columns called Local Part and Domain? We need to delete those now, otherwise our Bunker won't know which of the three columns contains the result.

We had a similar situation back in the previous tutorial, and we saw how to use the drop transformation to delete a column. This time round, since we've stored the result back into the original Email Address column, we need to drop the two columns we created with match_regex. Like this:

drop ("Local Part")
drop ("Domain")

A few categorisers do expect to receive more than one column. For example, the Date of Birth categoriser expects three columns, for the year, month and day. See three rules to remember for more information.

The full script

Putting it all together, here's the complete script.

// Extract the local-part and domain, store them in new columns
match_regex ("Email Address",
["Local Part", "Domain"],
"(.*)@(.*)")

// Re-combine the components, store back in the email-address column
set ("Email Address", `Domain` + ":" + `Local Part`)

// Clean up temporary columns
drop ("Local Part")
drop ("Domain")

Try it out!

You can try out the script yourself using the Email Address column of the tutorial data set. If you've worked through the previous two tutorials, you'll know how to do this now, so we'll keep it brief:

  • Import the data set in the usual way.
  • Using the drop-down above the Email Address column, select Assign Category and assign the email address to Custom Key 1.
  • Now, using the drop-down again, select Transformations. Switch to the Editor tab and paste the sample code in.
  • Press Save and you'll see the rewritten results in the Email Address column.