Data normalisation overview
Normalisation addresses the obvious problem that two separate imported datasets are likely to use different schemas. By normalising both datasets to the common Global Schema, they can be used together in statistical analysis or other queries.
Additionally, normalisation plays an important role in ensuring the security of your data. As part of normalisation, direct identifiers (which can be used to identify specific people) are irreversibly converted to anonymised keys. The original data is then permanently deleted. So, even if the Bunker which holds your dataset were somehow compromised, this would not reveal the identity of any individuals.
You will need to perform a series of steps to prepare your data for normalisation. You can complete these tasks using your Bunker's web-based UI:
- assign categories to your original data columns,
- test with a dry run to highlight any mapping problems,
- configure mappings and transformations to tidy up any messy data,
- publish the dataset to make it available to reference in queries
If you are importing an identity dataset, you will also need to select and publish an output column to make it available in identity queries.
For more advanced transformations, you can use InfoSum's custom scripting language, the Data Transformation Language (DTL). The language offers a powerful tool to apply complex logic to your transformations. Using DTL, you can apply a range of transformations to your imported dataset, clean up messy data and convert formats to match those defined in the Global Schema.