How to normalise a dataset
This quick start covers the normalisation process, which is the next step after importing a dataset.
This process ensures your dataset is correctly mapped to our Global Schema, so it can later be compared to other datasets. You can explain the meaning of any columns which weren't processed by the import wizard, and use mapping and transformation tools to clean up any inconsistencies.
Depending on the cleanliness of the original dataset, some or all of the columns may have been assigned. Any unassigned columns can be found by scrolling right.
To categorise a column, click on the Settings button next to the column name, then Assign Category. From here, you can search for a category or scroll through the list, then click Assign.
If there isn't a relevant category available, you can create a custom category (string or integer) or custom key (if it contains unique identifiers).
A custom category can later be used for aggregated analysis, whereas a custom key can be used to match identities across datasets (with matching keys).
Check category mappings
When a column is assigned to a category, each data point is converted. If one or more point cannot be converted, a warning will appear and will then be excluded unless it's resolved.
Depending on the complexity of the change required, you can use mapping and transformation tools to revolve warnings.
Mapping tools are a suitable solution when a column contains one of a number of specific values. For example, mapping
Female ensures these values are converted and included in queries.
Transformation tools can configure a series of changes to the column, such as "remove the first letter" or "change this word into that one".
For more complicated tasks, transformation scripts can be written using InfoSum's Data Transformation Language (DTL).
Test with a dry run, then publish dataset
Lastly, a dry run is recommended to test against a sample of your dataset. This will help you spot and resolve any further problems with the data and improve the quality of your normalised data.
Once this test runs smoothly, select Normalise and head to the Bunker dashboard to publish the dataset. It will then be available to connect to other datasets and reference in queries.