Data processing concepts

Before you upload data to a Bunker, it's a good idea to know how it will be processed (and your options for configuring that processing).

Understanding these concepts will help you get the maximum value from your data. 

Your Bunker will analyse your uploaded data, and can often generate most of the necessary configuration automatically. So, while it's a good idea to understand what it's doing, you probably won't need to work step-by-step through everything we discuss below.

Normalising data

To generate aggregated statistics, InfoSum's software matches your data to your partner's, identifying which rows in each dataset refer to the same people. But almost certainly, you and your partner use different schemas - so the datasets you have won't match up directly.

To solve this problem, you need to normalise your data. When you normalise your data, you map it onto InfoSum's standard schema, developed by our data scientists to reflect common practices in the industry.

You and your partner both need to normalise your own datasets, working entirely within your separate Bunkers. You will never need to access each other's data.

Categories

The first step in normalising your data is to apply categories. When you categorise your data, you tell your bunker what each column in your dataset means.

As a rough approximation, you can think of each category as a column in InfoSum's standard schema. But don't take that too literally - as we will see, categories have many capabilities that ordinary database columns don't.

InfoSum's standard schema has a pre-defined set of categories, reflecting information commonly found in our customers' datasets. Here are just a few examples of pre-defined categories:

  • Name
  • Age
  • Market Value Of Property
  • Date Of Vehicle Insurance Renewal

You can also use "custom" categories for data which doesn't fit in any of the pre-defined categories.

Several columns in your original data can map onto a single category. For example, you might have individual columns for Street Address, City and Postal Code (or zip code). All of these together would map onto a single category, Address. Your Bunker takes care of retaining and interpreting the address's internal structure.

Properties

For a few categories, you can configure properties to help your bunker understand your original schema.

For example, we've already said that you might have several columns which together represent an Address. In most countries, addresses include a postal code (or zip code), which unambiguously identifies the street.

So the Address category comes with a Postal Code property, which tells your bunker which of the original columns contains the code. When you map a group of columns to the Address category, you also configure that property.

Transformations

Sometimes, your Bunker may need more help mapping your original columns onto your chosen category. For example, perhaps you have a "customer number" column, and the customer number contains the customer's date of birth. You can map that onto the Date Of Birth category - but your Bunker won't know how to extract the date from the customer number.

Transformations are the solution. For each category, you can configure a series of changes - such as "remove the first letter" or "change this word into that one" - which your Bunker will apply to your original data.

You can set up simple transformations using an online wizard. For more complicated tasks, you can write transformation scripts using InfoSum's specialist programming language, the Data Transformation Language (DTL).

Representations

As each piece of data is mapped into a category, it's automatically converted into one or more representations. Every category has at least one representation, and many have more than one.

At its simplest, a representation is a standardised way of formatting values. For example, you and your partner might use different date formats in your original data, but the standardised representation of a date will be the same for both of you.

But some representations do more than that - reflecting interpretation or analysis of an item of data. For example, the category called Job Title has representations called Fine and Granular. In this way, the complex and multi-faceted information contained in a job description is broken down into specific details, which can then be accessed in queries and reports.

Personally Identifiable Information (PII)

Personally Identifiable Information (PII) is data which can identify an individual person. Regulators expect you to treat PII with particular care. Within your Bunker, PII is special for two reasons.

Firstly, it's PII which lets your bunker and your partner's bunker match up rows which represent the same individual. As we'll see below, the Bunkers use advanced techniques to compare the data without actually revealing the PII to each other.

Secondly, your Bunker ensures that PII can't be revealed in reports or queries. This is an important way to stay within the law; because your bunker won't reveal PII, you can be confident that you are only sharing anonymous, aggregated information.

When you categorise your data, you're also helping your Bunker understand what is and isn't PII. For example, data categorised as a Name is obviously PII, whereas data categorised as an Occupation isn't.

Some categories use representations to present PII and non-PII versions of the data. For example, a full address is clearly PII, but the town or city a person lives in isn't PII. So the Address category has a Post Town representation, which hides most of the address and reveals only the name of the town or city. Even though your Bunker won't allow access to the full address, you can access the Post Town representation.

Keys

Once you've configured your categorisations and transformations, your Bunker generates keys for each row. Keys are anonymised version of the row's PII, used solely to match up records in your and your partner's Bunkers.

Two Bunkers with the same PII will generate the same key, so the two Bunkers can use their keys to identify records which relate to the same person. But it isn't possible to use a key to recover the original information. So, even in the unlikely event that a key was compromised, this wouldn't reveal the PII.