Normalising your data

Depending on the cleanliness of the imported data, some or all of the columns may have been assigned to a Global Schema category.

InfoSum’s Global Schema defines a standard set of categories and keys, which can be used to compare datasets from diverse original sources. This addresses the obvious problem that two separate datasets are likely to use different schemas.

Using the Normalise tab in the bunker, you can explain the meaning of any columns that weren’t processed during the import by assigning columns to categories. You can then use category mapping and transformation tools to clean up any inconsistencies.

Any unassigned columns can be found by scrolling right. To assign a column, click on the Settings button next to the column name, then Assign Category. From here, you can select the columns to assign to a category. Click on the NEXT button to search for a relevant category or scroll through the list, then click on the SAVE button. If there isn’t a relevant category available, you can create a custom category.

Once all the columns are mapped to a category, error warnings will appear if a data point is not as expected. For example, if the income data points were to contain a pound sign, a red flag would appear. You can then use the transformation tools to change each data point, such as “remove the £” or “change this word into that one”.

For more information, see the section on the process of normalising data.

At any time you can use a dry run to test how successfully your data has been mapped to the Global Schema against a representative sample. It may help you spot and resolve any further problems with the data and should help you improve the quality. Once you’re happy with the results, select Normalise and head to the Bunker dashboard.

Next steps

You now need to publish your dataset to make it available to reference in queries.