Using incremental imports
Important note |
This functionality isn't available as standard to all users. Please contact your InfoSum representative to learn more about how to gain access. |
Incremental imports allow you to import updates to an existing dataset. This means that you do not need to re-import the whole file when you want to update your dataset.
To enable incremental imports, you will need to select this option when creating a dataset. You cannot change the import type once the Bunker is created. Incremental imports enable you to:
- Add new records.
- Delete existing records.
- Modify existing records.
- Expire existing records (i.e. do not remove the record until the next import).
For the steps to do this, see:
You can import incremental imports using any data connector (e.g. S3, local CSV file, etc).
Incremental imports can be added manually or using automation.
Although you can automate incremental imports from any data connector except local CSV upload, the SFTP data connector has an additional option to select how multiple files are handled during automated or manual import. For more details, see data connector for SFTP: incremental imports option.
Creating an incremental import dataset
The steps are the same as for creating a dataset until the Import type options appear.
Select the type of import you want:
- A standard import, which replaces all data in the Bunker.
- An incremental import, which updates data in the Bunker.
Note: Import type can only be set when you create the Bunker.
Under Import Type, select Incremental and click Next.
The remaining steps are the same as for creating a dataset.
Importing an incremental dataset
To import incremental data, first you will need to import the full dataset and then import the incremental (delta) updates.
Importing a full dataset
The steps are the same as for Importing a dataset until after you select Accept Preview Configuration. With Incremental imports, you are taken to the screen below.
You have two options:
- Full refresh - this is the Bunker’s standard import type. Selecting this option replaces the existing dataset with a new one.
- Delta update - this incrementally updates the existing dataset.
As you are importing for the first time, select Full refresh to import your complete dataset. Click Continue on the warning message that appears:
You will need to assign a configuration to the dataset. A configuration is a stored process for adapting a dataset into the Global Schema.
As this is your first time creating an incremental import, select Create a new Config to go to the incremental import wizard. Here you will see an additional Primary Key and TTL tab for incremental imports, which you will need to complete in addition to any categories and Platform keys in the Categories and Custom categories tabs.
The Primary Key and TTL tab tells the Bunker how to update records and when to expire them.
- Primary Key whose only purpose is to allow incremental imports to match published datasets with updates. Unlike a normal Platform key, the Primary Key is not used to match datasets on the Platform. The Primary Key can be assigned to any column and does not have to be a key or a category, but it must be unique. The primary key can be stored in the Bunker and does not need to be published. If you re-import a record with the same Primary Key, it replaces the existing record with the new record.
- TTL (MS) Time To Live (TTL) specifies how long in milliseconds (MS) a record stays in the InfoSum system. TTL is not a standard key in the Global Schema and the column containing the TTL can be given any name (e.g. "TTL" in the example screenshot). When creating your TTL column, the data type must be set to integer. Any record that exceeds its TTL is not removed from the dataset until the next publish. This means that:
- If you do not re-publish a dataset, short TTL records will remain forever.
- If you re-publish a dataset, some records might be removed.
For example, if a record’s TTL is one day and your next import is in one week, the record will not be removed until your next import is published, in one week’s time. Some examples of TTL values you can set are:
- 0 (milliseconds) - to delete an existing record.
- 1 (milliseconds) - to expire a record immediately after it is published. Note: the expired record will not be removed from the dataset until after the next publish.
- 2147483647 (milliseconds) - to update or add a new record with a TTL of 24.855 days.
- 9223372036854775807 (milliseconds) - the maximum TTL value supported by the Platform (approx. 300 million years, literally).
Click on Accept Wizard Settings. You will now use your Bunker's web interface to normalize your data before publishing it. The screenshot below shows how the primary key selected earlier in the Primary Key and TTL tab is not assigned as a Platform key.
Please see the data normalization section for guides on how to cleanse, transform and standardize your dataset. Once this stage is complete, you will need to publish your dataset to make it available for queries.
Once published, you can select Incremental stats in the Publish tab to see details for the dataset.
This shows the number of imported records that were added, updated, deleted and expired, and the total number of rows.
Importing a delta update
When importing a delta update, the steps are the same as for importing a dataset until you reach the Incremental Upload screen, where you need to select the Delta update option.
You are taken to the Normalize screen where you can use your Bunker's web interface to normalize your data before publishing it.
Please note:
- Delta updates automatically use the same normalization config settings as the full refresh.
- To delete a record, use the same primary key and TTL of 0. Any record with a TTL of 0 will be immediately removed from the index as part of the normalize process.
- Any record with a TTL of 1 will be expired one millisecond after publication. These records are removed only after the next publish.
Please see the data normalization section for guides on how to cleanse, transform and standardize your dataset. Once this stage is complete, you will need to publish your dataset to make it available for queries.
Once published, you can select Incremental stats in the Publish tab to see the changes to the full dataset.
This shows the number of records that were added, updated, deleted and expired by the incremental import.
Data connector for SFTP: incremental imports option
This section describes describes the “Automation: Incremental imports” option only. You can ignore this option if you are not uploading incremental imports. Click here for the steps to use the SFTP data connector.
The form now contains an additional option:
- Automation: Incremental imports - Select how multiple files are handled for incremental imports. See the next section for more details.
You can select whether to ingest all files or only new files added since the last successful incremental import.
The option to ingest new files only (since the last import) is designed for periodic automatic imports, but can also be used for manual imports. This option is especially useful for automatic imports as it allows you to ingest only files added since the last successful incremental import. This means you can continually add files to your SFTP server and only those files added after the last import will be automatically ingested into the Bunker.
To avoid an automated import processing files that are in the process of being transferred onto the SFTP server at the time of the import, the SFTP connector will only process files that are accompanied by a file of the same name plus a done suffix. For example, if a file named toProcess.csv is added to the SFTP server, it will only be processed if the file toProcess.csv.done also exists, indicating that it is ready to be processed.