Importing data using Amazon S3
This training guide explains step-by-step how you can directly import a dataset from your Amazon S3 bucket.
Workflow
Here is the workflow for uploading your dataset to InfoSum:
- Creating a dataset
- Accessing a Bunker
- Importing your dataset
- Previewing your data
- Configuring your data
- Normalizing your data
- Publishing your dataset
Some important concepts
Before you begin uploading your data, there are some concepts you’ll need to be familiar with:
- Dataset - A dataset is a single database table on a customer’s InfoSum site which contains a group of records and which have keys and categories. The combination of keys and categories is a dataset.
- Key - Keys are Personally Identifiable Information (PII) such as mobile phone number, email or IP address which will be used for matching with other datasets.
- Category - Categories are specific types of information contained in a dataset such as age, gender, lapsed customer or existing customer.
Creating a dataset
Before starting, please make sure you are logged into your InfoSum Platform account.
Under Data, click the New Dataset button on the Datasets tab.
You will then be taken to the form below, where you can create a standard or large dataset depending on your file size. Standard datasets can contain up to 45 million rows and large datasets can contain 50 million or more rows.
Select the type of dataset you want and click Next, you will be taken to the form below.
Choose the cloud instance location and provider you would like for your Bunker and click Next.
Note: You will skip the above form if you do not have the option to select a location and supplier for your Bunker.
You will be asked to select the type of dataset you want to create.
Select the type of dataset you want to create and click Next.
Note: You will skip the above form if activation settings are not enabled for your account.
You have two options:
- Insight datasets are used to see the match between datasets and create insights from the intersection between datasets. Insight data is always anonymised and numbers are always aggregated. Individual records are never shown, it is always an aggregated number.
- Activation datasets are used to push your audience to an activation partner once you find your target audience.
You will then be taken to the form below.
Complete the fields on this form as follows:
- Add a Private ID - this is a name you give your dataset that only you can see.
- Add a Public Name - this is the name the other party will see when you grant them permissions to your dataset. In the above example, the other party will only see the public name “Demo Company”, but you will see the private ID “ExistingCustomers”. This allows you to give your dataset a meaningful name that will not be shared with another party.
- Add a Public Description - this optionally allows you to add a description for the dataset.
- Select a Project (if required) - this is a user-scoped label associated with a number of datasets.
- Select a Team - this is a collection of users within a company that can access the dataset.
- The Dataset Expiry date will expire in 2-3 days. If you want to keep your dataset for longer, tick Do not expire, otherwise your dataset will expire and you won't be able to access it.
Click Create and you will be taken back to the Datasets tab on the InfoSum Platform. This is your home for viewing, editing and deleting your datasets.
You can see it is your dataset by hovering your mouse over the blue-filled icon . If the dataset belongs to another party, this icon will appear as a blue outline icon
.
When you create a dataset, the Platform automatically creates a corresponding Bunker. A Bunker is a private cloud instance that only the dataset owner can access.
Use InfoSum Platform to run queries and to activate, but use InfoSum Bunker to upload your file, normalize and publish. Only you, and no-one else, can access your Bunker.
Accessing a Bunker
Switch to the Datasets tab in the Platform, if you're not there already.
Find the row representing the dataset you've just created (refer to the dataset’s public name if you're not sure).
Click the Access button and you will be redirected from the Platform to the Bunker. Every Bunker has:
- a User Interface
- a unique domain name
- a unique IP address
Once you are in the Bunker dashboard, you can grab the IP address and give this to your IT team if you need to whitelist the IP address.
Importing your dataset
Select Import from the left-hand menu and you will be taken to a selection of import connectors. For example, you can import your data from your Google Cloud bucket, your S3 bucket, your SFTP file or directly from your database servers. The following steps show you how to import a CSV file from your S3 bucket.
Click the Connect button for S3 and enter your credentials as shown below.
The above form contains the following fields:
- Access Key ID: Customer needs to authenticate using their AWS credentials
- Access Secret Key: Customer needs to authenticate using their AWS credentials
- Bucket Name: Specify the bucket name (no leading s3 identifier, e.g. "bucket-name" not "s3://bucket-name")
- Prefix: Optionally add extra path in this box [SubFolder/NextFolder/]]
- GPG Encryption - GPG Public key to encrypt the file
You can import a normal file or an encrypted file.
If you are importing an encrypted file, click on the GPG key field and use the GPG public key provided here to encrypt your file. Every Bunker has a unique public and private key. The encrypted file will be decrypted using the Bunker’s private key when you upload the file.
When you have completed all required fields, click Connect. You will be taken to a page where you can download files from your S3 bucket.
Note: If you are experiencing slower than expected import/export speeds and you're using a VPN or firewall that can block data upload or download, please refer to whitelisting IP addresses.
The above form shows all the files available within the selected S3 bucket and contains the following fields:
- Key: File name within the S3 bucket
- Field Delimiter: Delimiter used to separate values in the file(s)
- This file is gpg encrypted: Enable this option if you are uploading an encrypted file. When you click Download, the Bunker will decrypt the file using the Bunker private key.
Copy a file name into the Key field and click Download. Note: you can only import a single file.
Previewing your data
Next, click Connect. A subset of your data appears as a preview.
You can upload either:
- a single value column (that is, one email per row for a single user), or
- a multi-value column (that is, multiple emails or hobbies per row for a single user). You could, for example, include “sports”, “travel” and “reading” as a single user’s hobbies.
Enabling multi-value columns
In preview settings, the Platform will show the delimiter used in the multi-value column. If it’s not the correct delimiter, select the correct delimiter from the dropdown list.
For each multi-value column, click on the toggle next to the column header and enable the multi-value columns option.
Configuring your data
When you're happy with the preview, click Accept Preview Config. Your file will now be imported into InfoSum Platform.
Next, you will need to assign a config to the dataset. A config is a stored process for adapting your dataset into our Global Schema. Every customer needs to map their source file to the Global Schema. Some of the benefits of mapping to the Global Schema are:
- Each source dataset has its own language. For example, “mobile phone number” may be called “Contact” in one dataset and “Daytime mobile number” in another. Both datasets have the same column but defined in a different way so we need to standardize everything.
- Some datasets use different formats. For example, “gender” in one dataset may be classified 1,2,3 and as a,b,c in another. The Global Schema lets everyone map “gender” values so they are presented in a standard way, that is, either “Male”, “Female” or “Other”. The benefit of this is you don't need to transform your data in your source file, you can simply assign your values to Global Schema values in InfoSum Bunker.
- For some data, such as age or salary range, the Global schema will create representations of the data so you don’t need to create 10 different types of age columns in your source file. All you need is one “Age” column and the Global Schema will create the representations. This means that user can filter audience or see insights by all those representations
Creating a config tells InfoSum Platform to remember how your dataset was manipulated and make it a repeatable process. This enables you to quickly publish another dataset without needing to manually set up the same configuration.
You can save a config so that next time you can load an existing config. Configs sit on the user level so any time you create a Bunker, you can use your saved config. For example, if I select the “Training config” you can see that I already have 9 matched categories.
Select Create a new config to go to the Import Wizard.
The Import WIzard has automatically picked some of my source columns and assigned them to the Global Schema because they have the same name as in the Global Schema. Any columns in your dataset not picked here will be in custom categories.
A custom category is a category which you have defined yourself (as opposed to one defined in InfoSum's Global Schema).
There are two reasons why a category will not be picked for the Global Schema:
- the category is not in the Global Schema, or
- the column name is not the same as in the Global Schema (for example, “device_id” and “mobile_phone_number” and are in the Global Schema as “Mobile Advertising ID (e.g. AAID, IDFA)” and “Mobile Phone Number” respectively.)
If your column is not in the Global Schema, you can add it as a custom category by selecting it here. Since I know that “device_id” and “mobile_phone_number” are categories in the Global Schema, I won’t select them as custom categories. The others columns are all custom columns so I have selected some of them here.
If you are unsure which categories to select, you can do this in the next step.
Normalizing your data
Click Accept Wizard Settings to go to the Normalize page.
After assigning columns to categories, you may need to map the values in the original Gender column (such as 0 and 1) to the Male and Female values in the Global Schema. You can see that my Gender category is in 0s and 1s so the Global Schema cannot recognise it.
Mapping a category
Instead of transforming your data before you upload it, you can map your data to the Global Schema values so that “1” becomes “Female” and “0” becomes “Male”.
Note: Mapping only supports Global Schema categories. Custom Categories require transformation either by using the transformation tools or by performing the transformation prior to upload.
To set up mappings for a category, click the settings icon above your original data column, and select Mappings from the drop-down menu.
If you don't see the Mappings option, this means that the category you've selected doesn't support mappings. Mappings are only suitable for categories such as Gender that contain values selected from a predefined list of options.
The Mappings dialog appears where you can map “1” to “Female” and click Map.
Next, map “0” to “Male”.
Click Map to see your mappings:
Next, select the Extended Gender tab and repeat the previous steps to map the "0" and "1" gender values.
Click Save to see the values mapped correctly for the Gender category.
Viewing Representations
If you scroll to the age source column, you will see that the Global Schema has automatically created six representations from the age source data. Creating representations from a single age column is another benefit of the Global Schema.
Assigning categories
After mapping to either our Global Schema or a custom category, any columns that appear in blue are not assigned to the Global Schema.
To assign a column to the Global Schema, click on the settings button next to the column name.
Select Assign category.
Click Next and select a category from the drop-down list.
This assigns your “mobile_phone_number” column to the Global Schema Mobile Phone Number category.
Next, select the local phone region in category properties, GB in this example.
Click Save to assign the category to the Global Schema.
If there isn't a relevant Global Schema category available, you can create a custom category. This lets you use categories beyond what is included in the Global Schema. For example, if you have an internal ID or flag that you want to use.
To create a custom category, open up the Assign Category dialog as before and click NEXT.
Then select the Custom Category option and an additional settings area will appear. You will now need to give the custom category a name and specify the type of data. Two custom categories in different datasets can only be matched if they have the same name, so this stage may require some coordination with other users.
If the column used for the custom category is an identifier, such as a Customer ID, you will need to select 'is key' for it to be used later on to match keys across datasets.
Viewing keys
When you have finished assigning categories and cleaning up your data, you can select Keys under Normalize to see the keys that can be used for matching with other datasets.
Click Normalize to run the normalization process.
What happens to your data during normalization?
We have a standard normalization process. For each Global Schema key there is a set of validations, which we check and then we hash the key. The same normalization checks, validations and hashing are applied to all datasets.
Saving your config
Once you’ve imported and normalized the dataset, you can save the config and reuse it within a Bunker.
To save your config, click on the Configs tab, and click Save. A dialog box appears where you can review and name the configuration.
Click Save to save the config. When you’re next importing a dataset to this Bunker, you will be able to retrieve this saved config to speed up the time before it's published.
Publishing your dataset
Once your dataset has normalized you can publish it. Any draft datasets are deleted after eight days.
Click Publish and your dataset is available on the Platform, where it is ready for sending permissions or running queries.
Click on your dataset to see the stats of the dataset. Click on the Key tab to see the fill rates for each key. For example, 90% distinct records for Mobile Phone Number, which means that 90% of records have a unique Mobile Phone Number and 10% have no records.
Click on the Category Stats tab to see the fill rate for your dataset’s categories.
You have now successfully published your dataset.