How to automate data imports
Automation enables you to schedule the regular update of a published dataset. This feature removes the need to manually import, normalise and publish a new version of the same dataset.
You can use automation to update a dataset using a connector, such as to MySQL, S3 or Google Cloud Storage. You cannot automate the import of a file from your computer, such as a CSV file.
If you do not have automation enabled and would like to use it, please get in touch with your InfoSum representative.
Automation is configured and controlled from inside the Bunker. If you have previously imported a dataset using a connector, simply access that Bunker, locate the Automation toggle in the top right hand corner, select Edit and the dialog shown below will appear.
Here, you can define how regularly you would like the connector to re-import, normalise and publish the dataset. You can either use the UI or, if you require more customisation, use the advanced settings to specify a 'cron' schedule format.
Click Save, then use the Automation toggle to enable automation. You can then hover over this toggle to see when the next run is due and check details of the schedule.
Enabling automation will lock down the ability to manually import, normalise and publish a dataset. This means that you will not be able to import a dataset from another source, or change the normalisation configuration.
You can disable automation at any time by using the same toggle you used to switch it on.
Candidate, published and background datasets
When a dataset has been automatically imported, it will be shown in the Dashboard as shown below and in the Publish tab. You can see a range of information, such as the size of the dataset and its current state.
The state of this dataset is Published, which means that it is ready to be queried.
There are three types of dataset state, which are used in every Bunker but are particularly relevant when using automation:
- A candidate dataset is a dataset which is either in the process of import and normalisation, or has been imported but hasn't been published.
- A published dataset is the dataset which will be analysed when referenced in all subsequent queries.
- A background dataset is a dataset which was previously published and was replaced by a newer dataset, but may still be queried if a query was initiated before the newer dataset was published.
If you switch to the Publish tab, you will see the state of each dataset version, alongside logs for all previously published datasets.
If you want to, you can click on any of these rows to bring up a detailed view on the publish data and the Global Schema version, alongside the categories and key fill rates