Automating periodic imports
Automation enables you to schedule the regular update of a published dataset. This feature removes the need to manually import, normalize and publish a new version of the same dataset.
You can use automation to update a dataset using a connector, such as to Amazon S3, Google Cloud Storage, MySQL or SFTP. You cannot automate the import of a file from your computer, such as a CSV file.
If you do not have automation enabled and would like to use it, please get in touch with your InfoSum representative.
All automated imports use the same details as your original manual import, that is:
- the same credentials, path or bucket name and filename from SFTP or cloud storage.
- the current configuration from the Bunker.
Automation import times are specified in UTC time, but are reported in local time on the Platform.
Enabling automation
Automation is configured and controlled from inside a Bunker. If you have previously imported a dataset using a connector, simply access that Bunker, locate the Automation toggle in the top right hand corner, select Edit and the dialog shown below will appear.
Here, you can define how regularly you would like the connector to re-import, normalize and publish the dataset. You can either use the UI or, if you require more customization, use the advanced settings to specify a 'cron' schedule format.
Click Save, then use the Automation toggle to enable automation. You can then hover over this toggle to see when the next run is due and check details of the schedule.
Enabling automation will lock down the ability to manually import, normalize and publish a dataset. This means that you will not be able to import a dataset from another source, or change the normalization configuration.
You can disable automation at any time by using the same toggle you used to switch it on.
Draft, published and background datasets
When a dataset has been automatically imported, it will be shown in the Dashboard and in the Publish tab. You can see a range of information, such as the size of the dataset and its current state.
The state of this dataset is Published, which means that it is ready to be queried.
There are three types of dataset state, which are used in every Bunker but are particularly relevant when using automation:
- A draft dataset is a dataset which is either in the process of import and normalization, or has been imported but hasn't been published.
- A published dataset is the dataset which will be analyzed when referenced in all subsequent queries.
- A background dataset is a dataset which was previously published and was replaced by a newer dataset, but may still be queried if a query was initiated before the newer dataset was published.
If you switch to the Publish tab, you will see the state of each dataset version, alongside logs for all previously published datasets.
If you want to, you can click on any of these rows to bring up a detailed view on the publish data and the Global Schema version, alongside the categories and key fill rates.