Expanded support for very large published datasets
Important note |
This functionality isn't available as standard to all users. Please contact your InfoSum representative to learn more about how to gain access. |
InfoSum Platform now supports a larger file ingestion capability of up to 2,000 columns per published dataset.
This is an increase from 250 columns to almost 2000 columns. You will need to specify that your data set has up to 2000 columns when provisioning a Bunker. The following applies to very large datasets:
- The maximum number of rows is 350m
- The maximum number of columns is 1999
- Large datasets cannot be automated
- Large datasets cannot be used for incremental updates
Creating a very large dataset
The steps are the same as for creating a dataset except you will need to select the New Dataset Size option shown below.
Under Import data’s maximum number of columns, select Up to 2000.
Click Next. The remaining steps are the same as for creating a dataset. Next, you can import your large dataset into InfoSum Platform.
Importing a large dataset
The steps are the same as for Importing a dataset. Due to the increased size of the dataset, this process can take longer than for a standard dataset, for example:
- Importing into a Bunker takes approximately 13 hours for a 250 million rows dataset and 1999 columns.
- Normalization for 250m rows/1999 columns (7 keys, remaining columns binary) can take up to 3 days. If your dataset contains addresses mapped to the Global schema normalizations can take an additional day to process.
Once an import, normalization or dry run has started, it will continue to run if you exit the Bunker. When re-entering the Bunker, progress will automatically be displayed for an import or dry run. For a normalization, you will need to select the Publish tab to see the progress.
Important: To ensure that data is imported correctly, always wait for the moving bar at the top of the screen to disappear. This can take up to 5 minutes for large data sets. For example, clicking the Normalize button when the bar still appears can cause normalization to fail.
Once the import to a Bunker is complete, you can launch the Category Wizard to bulk assign columns to categories. Activation Bunkers have an additional option to bulk assign output columns. See assign output data for more information.
Normalizing and publishing very large datasets
Normalizing a dataset is done using List View instead of Spreadsheet.
You can test your dataset with a dry run. Next, you can normalize and publish your dataset to make it available for queries.