How to normalize data 2.0
Once you’ve created your recordset, it is time to ready your data for publishing. We call this process data normalization. During data normalization, your original columns can be modified and the imported data is mapped onto our Global Schema which ensures that all parties using the InfoSum platform have data formatted to the same schema/format.
Table of contents
What happens during normalization?
Normalizing an Insights or Activation Bunker
1. Select the recordset you want to normalize
2. Configure your normalization settings
a. Map your data to the Global Schema and assign keys and data types
b. (optional) Modify your columns, and map addresses (US only)
c. Remove columns you do not need to normalize
d. Save your Normalization config for future use
What happens during normalization?
Normalization plays an important role in ensuring the security of your data. During normalization, direct identifiers (such as emails) are irreversibly converted to pseudonymized, salted keys. Our normalization process begins by lowercasing and removing any leading or trailing spaces before converting raw PII to sha256 before it is further encrypted and salted. By the end of the normalization process, there is no translatable identifier information stored within an InfoSum Bunker.
For details of any formatting required for your raw data, see data formatting for normalization.
We recommend bunkering raw format identifier data where at all possible to avoid any discrepancies in standardization across partners.
Normalizing an Insights or Activation Bunker
You will use the same UI to normalize both types of Bunkers. One normalization task can be published in both an Insights and an Activation Bunker.
For activation use cases you will likely need to have the same identifiers in both an Insight and Activation Bunker. At the prepare stage, you can select different columns to publish to each Bunker (e.g. the Activation Bunker most likely doesn’t need attribute information)
If you want to normalize data for an Activation Bunker, you will need to select at least one key column that can be exported by toggling the ‘export col’ column during normalization (second to last). An export column is retained in its original form during normalization to allow the export of the results of an activation query. For example, the export column might contain a customer number or an email address that will be used during activation.
If you don’t select any export columns, the normalized file can only be published to an Insight Bunker.
Steps to normalize data
Video tutorial
1. Select the recordset you want to normalize
To begin the normalization process, go to your Cloud Vault and select the recordset you want to normalize.
On the right hand side of the screen, a details panel will appear. Click the Normalise button.
2. Configure your normalization settings
The platform will now ask you if you’d like to reuse an existing configuration or if you’d like to create a new one.
If you’re normalizing an updated version of a file you’ve previously imported, then using an existing configuration can be a really fast way to get your data normalized. Simply select from the dropdown the name of the saved configuration and give the output a name that is relevant to this normalization task.
If the data that you're bringing is very similar to an existing config, you can click on Modify Configuration and edit an existing configuration to match your new data.
This article explains how to edit an existing configuration and save it as a new config.
If you’re bringing in new data or are unsure, click the Create new Config button and then Continue to column selection.
There are three steps to creating a brand new normalization configuration: mapping to global schema and identifying keys, making any data modifications (optional), and saving your normalization configuration.
Normalizing using JSON |
Instead of going through the drop down UI, you can edit the JSON file for that normalization - but we only recommend this option for advanced users. Please reach out to your InfoSum representative for more information. If our support team is helping you with a complex normalization config they will likely give you a JSON file you can paste on the editor. |
1. Map your data to the Global Schema and assign keys and data types
For more information on the normalization process, the global schema and how to best format your data please read our data formatting for normalization support article.
Using the Global schema
On the mappings screen, the platform will automatically assign columns to the Global Schema where it recognizes a column name and mark it as a key. For any missed or incorrect mappings, you can click the gray/ blue pencil icon and correct the assignment. For some columns, there may be a requirement for an additional mapping (eg, postcode in the UK).
You can find a list of the Global Schema keys here
Assigning PII as a Key
A key is an irreversibly encrypted identifier that is used for matching or activation. To set a column as either a key or a category, select the toggle as appropriate under the heading Key.
Please note that all columns marked as keys will be automatically standardized with two modifications: trim whitespaces and lowercase.
Using Custom Keys |
When normalizing your data, the names of the keys used for joining to collaboration partner's datasets need to be consistent across the different datasets. The purpose of the Global Schema is to assist in the standardization of names to make joining simpler. However, if you're uploading a custom key, you will need to ensure the names are exactly the same across all datasets. |
Note for Activation Bunkers |
If you’d like to use this dataset for an Activation Bunker you’ll need to set the output columns using the toggle under the heading Export columns. Remember a key is used for joining but a column needs to be set as an export column if it needs to be exported from the platform. |
Assign/confirm data types
Confirm that your data type is registered correctly. If you have mapped keys to the global schema, they will be automatically categorized as the right data type.
-
String: A combination of letters (with optional numbers or symbols) that will be treated as a word. Most of your attributes will be strings unless they are related to time, currency or other numbers.
- Multi-value string: when a column contains multiple independent string data points on each cell (e.g. two emails in one cell). This will be automatically identified by the platform if the cell contains multiple values.
- Integer: a real number without decimal points.
- Float (decimal): A real number that has a decimal point
If you are bringing date/time as a format, please use the modifications functionality to parse date/time (see next section)
2. (optional) Modify your columns and use Address mapper (US only)
You also have the option to apply some basic column modifications to ensure that the data you publish to a Bunker is in the most useful format for your intended use case. For example, you might create a new multi-value column from the input of multiple columns or if you’re bringing date/time format, you can use this function to parse it.
Please note that all columns marked as keys will be automatically standardized with two modifications: trim whitespaces and lowercase.
This article lists the modification options and how to apply them.
To use the US address mapper, please follow the instructions on this article.
3. Remove columns you do not need to normalize
During normalization you might have created new columns and no longer need your original columns. Please remove any columns that aren’t needed by clicking on the bin icon at the end of each line:
You can recover any columns you’ve deleted by clicking on the black + button at the top and selecting ‘re-add input columns’
4. Save your Normalization config for future use
The final screen will see the platform ask you to name the configuration you’ve just created (which will allow you to skip these three steps next time and automate your data onboarding) and declare an output name which can be viewed on the next screen.
If any of your columns has date/time but doesn’t include timezone information, you will need to confirm the timezone for your data at this stage.
If any of your columns are mapped to a phone number global schema key you will need to select the country code at this stage. We can only support one country code override for the whole dataset.
Once you click Create new config and Normalize, the platform will take you to the Tasks screen where you can view the progress of the normalization process.
Next step
Once you've normalized your data in the platform, the next step is to Prepare and Publish your Data