Preparing & publishing a Dataset
Once you’ve normalized your data in the platform, it is time to prepare it for publishing a Dataset. The purpose of the prepare step is to create all of the indexes that enables the InfoSum platform to run fast queries. This is a critical step in ensuring that InfoSum never moves any personally identifiable data during collaborations.
You will have to:
- Create or select a Dataset to publish to
- Select the normalized recordset you wish to prepare
- Choose which ID, Attributes, and export columns to publish
- Set the rounding and redaction thresholds
- Publish your Dataset
Create or select a Dataset to publish to
Navigate to the Dataset page inside Data management. Select or create a Dataset to publish to.
There are three statuses for Datasets on this screen: “Ready”, “Prepared”, “Published”.
Be aware that if you are preparing and publishing to a Dataset with a status of “Prepared” or “Published”, and you are not doing an incremental update, the data in your Dataset will overwrite the published Dataset.
How to create a new Dataset
If you have no Datasets available or don’t want to overwrite data that is already in a Dataset, you can click “New Dataset” to create a new Dataset.
Please follow the instructions on this page to create a new Dataset.
Select the normalized recordset you wish to prepare
Once you've selected your Dataset, click 'prepare' on the Dataset details panel. Please ensure you are in the correct Cloud Vault where you have normalized your data.
Overwrite or incremental update?
When you select the recordset you will see two options at the bottom of the page:
- Click “Prepare” to overwrite the data in your Dataset with your new recordset
-
Click “Incremental prepare” to append your prepared recordset to the existing data in your Dataset. To use this option you’ll need:
- A date column in your recordset that can be used as a creation timestamp to set a retention window. If there are no date columns this option will not be clickable
- The new recordset must have all keys and attributes already present in the data in Dataset (can also contain additional ones)
Select the keys, categories, and output columns to publish
At this stage, you will be asked which columns you wish to be included in the final published version of your Dataset. There will be three tabs available to you at this step: “Keys” and “Attributes” and “Output columns”. Select which of them you wish to prepare and publish now, noting the fill rates of each (which shows a percentage of how many rows contains a record for that column)
- Rows shows the number of rows in the Dataset that contain this key or attribute
- Values shows the number of total values across all the rows. This number might be higher than the number of rows if there are multi-value columns
- The fill rate shows the percentage of how many rows contains a value for that column. If the fill rate is zero, the column will be highlighted in red
There are some instances when even with the same number of rows and values, your fill rate might be under 100%. For example, if you had multi-value keys or attributes, you could have:
- 6k total rows in your Dataset
- 4k rows that contain a certain attribute
- 8k values (it's a multi-value attribute that contains two values per row)
- Your fill rate will be 66% (rows with at least one value/total rows)
Incremental updates
If you are preparing for an incremental publish, you will be asked to provide a datetime column to use as the starting point of your retention period and to specify said period in days.
The platform will use this column to expire records that are outside your retention period every time that you publish new data.
For this reason we strongly recommend that you set up an automation after publishing to ensure that your records are expired in a regular cadence.
Set the rounding and redaction thresholds
Finally, confirm that you’re happy with the rounding and redaction thresholds for this Dataset. If you’d like to change them, click edit and make the changes before clicking “Run Prepare” to continue the process.
- Rounding defines the number that every result will be rounded down to so, if the threshold is set to 100, a result of 2,563,975 rows would be reported as 2,563,900.
- Redaction defines the minimum size of a group so, if the threshold is set to 100, then a result of 87 rows wouldn't be reported on.
The Dataset will now be prepared for publishing
Now the indexes will be created and the Bunker will be prepared. Information is displayed in the details panel that gives more details about the prepared Bunker ready for publishing.
Publish the Dataset
Once the prepare stage has completed, the button text in the green info panel will change to “Publish”. Click this button to publish the Dataset, ready for collaboration. Once published, the label will go green and this indicates the publish has been successful.
| Important note |
| A Dataset will stay in a prepared state for only 36 hours before it is terminated. Ensure that you publish the Dataset before the 36 hours expires. |