Normalizing a dataset
After you import a dataset to InfoSum Platform, each dataset must be normalized (and then published) before it can be referenced in queries.
We call this process data normalization. During data normalization, your original imported data is mapped onto our Global Schema. This addresses the obvious problem that two separate imported datasets are likely to use different schemas. There are also a range of UI-based tools available to tidy up any messy data.
This process plays an important role in ensuring the security of your data. During normalization, direct identifiers are irreversibly converted to anonymized keys. So, even if the Bunker which holds your dataset were somehow compromised, this would not reveal the identity of any individuals. For details of how imported data is mapped to the Global Schema on InfoSum Platform, including any formatting required to your raw data, see normalization rules.
You will need to perform a series of steps to normalize your data. You can complete these tasks using your Bunker's web-based UI:
- assign columns to categories,
- set up category mappings,
- use the transformation tools,
- test with a dry run,
- normalize and publish
If you are importing an activation dataset, you will also need to select an output column to make it available in identity queries. See assign output data for the task steps.
For advanced transformations, you can use InfoSum's custom scripting language, the Data Transformation Language (DTL). The language offers a powerful tool to apply complex logic to your transformations. Using DTL, you can apply a range of transformations to your imported dataset, clean up messy data and convert formats to match those defined in the Global Schema.