Sigma Normalization
Once you’ve created your recordset, it is time to ready your data for publishing. We call this process data normalization. During data normalization, your original imported data is mapped onto our Global Schema. This addresses the obvious problem that two separate imported Bunkers are likely to use different schemas. There are also a range of UI-based tools available to tidy up any messy data.
This process plays an important role in ensuring the security of your data. During normalization, direct identifiers are irreversibly converted to pseudonymized, salted keys. For details of how imported data is mapped to the Global Schema on InfoSum Platform, including any formatting required to your raw data, see normalization rules.
You will need to perform a series of steps to normalize your data. You can complete these tasks using your Bunker's web-based UI:
- assign columns to keys and categories,
- assign keys and categories to the Global Schema
- define the data type
When normalizing your data, the names of the keys used for joining to collaboration partner's datasets need to be consistent across the different datasets. The purpose of the Global Schema is to assist in the standardization of names to make joining simpler. However, if you're uploading a custom key, you will need to ensure the names are exactly the same across all datasets.
If you are importing data to be published to an activation dataset, you will also need to select at least one output column to make it available in identity queries. See assign output data for the task steps.
Note, you can supply either raw or SHA256 hashed data. If raw data is supplied, it will be hashed (using SHA256) with a salt added during normalization. If SHA256 hashed data is supplied, a salt will be added during normalization. These two approaches together mean you can collaborate with other Bunkers that may have used an alternative setup. However, it should be noted that to increase match rates, it is advised to supply raw identifiers (as SHA256 hashing is case sensitive and InfoSum’s normalizer standardizes the case).
To begin the normalization process, go to your Cloud Vault and select the recordset you want to normalize. On the right hand side of the screen, a details panel will appear. Click the Normalise button.
The platform will now ask you if you’d like to reuse an existing configuration or if you’d like to create a new one. If you’re normalizing an updated version of a file you’ve previously imported, then using an existing configuration can be a really fast way to get your data normalized. Simply select from the dropdown the name of the saved configuration and give the output a name that is relevant to this normalization task.
However, if you’re bringing in new data or are unsure, click the Create new Config button and then Continue to column selection.
There are three steps to creating a brand new normalization configuration. On the first page, you’re asked which columns you want to normalize. Note, that you can’t add columns later on but you can select which columns you wish to publish when publishing your data so it is always better to select more at this stage if you’re unsure. If you’re 100% sure which columns you need then select only these (as having more columns will mean a longer normalization task time). The platform will automatically select all columns from the recordset for you but you can deselect the Use all columns for selection toggle and manually select the columns you wish. The columns you’ve selected will be displayed in the right hand box. Once you’re happy with the selection, click Continue to mapping.
On the mappings screen, the platform will automatically assign columns to the Global Schema where it recognises a column name. For any missed or incorrect mappings, you can click the gray/ blue pencil icon and correct the assignment. For some columns, there may be a requirement for an additional mapping (eg, postcode in the UK). To set a column as either a key or a category, select the toggle as appropriate under the heading Key. If you’d like to use this dataset for an Activation Bunker you’ll need to set the output columns using the toggle under the heading Output columns. Remember a key is used for joining but a column needs to be set as an output column if it needs to be exported from the platform. Finally, set the data type of the column as either string or integer.
NB, if you set any column as an output column, the platform will not allow you to publish categories to an Insight Bunker. Please only select output columns when publishing to an Activation Bunker.
The final screen will see the platform ask you to name the configuration you’ve just created (which will allow you to skip these three steps next time) and declare an output name which can be viewed on the next screen.
Once you click Create new config and Normalize, the platform will take you to the Tasks screen where you can view the progress of the normalization process.
Once you've normalized your data in the platform, the next step is to Prepare and Publish your Data