Importing data using Amazon S3 Cross-Account
Important note |
This functionality isn't available as standard to all users. Please contact your InfoSum representative to learn more about how to gain access. |
This training guide explains step-by-step how you can directly import a dataset from your Amazon S3 cross-account.
Workflow
Here is the workflow for uploading your dataset to InfoSum:
- Creating a dataset
- Accessing a Bunker
- Configuring your S3 cross-account
- Importing your dataset
- Previewing your data
- Configuring your data
- Normalizing your data
- Publishing your dataset
Some important concepts
Before you begin uploading your data, there are some concepts you’ll need to be familiar with:
- Dataset - A dataset is a single database table on a customer’s InfoSum site which contains a group of records and which have keys and categories. The combination of keys and categories is a dataset.
- Key - Keys are Personally Identifiable Information (PII) such as mobile phone number, email or IP address which will be used for matching with other datasets.
- Category - Categories are specific types of information contained in a dataset such as age, gender, lapsed customer or existing customer.
Creating a dataset
Before starting, please make sure you are logged into your InfoSum Platform account.
Under Data, click the New Dataset button on the Datasets tab.
You will then be taken to the form below, where you can create a standard or large dataset depending on your file size. Standard datasets can contain up to 45 million rows and large datasets can contain 50 million or more rows.
Select the type of dataset you want and click Next, you will be taken to the form below.
Choose the cloud instance location and provider you would like for your Bunker and click Next.
Note: You will skip the above form if you do not have the option to select a location and supplier for your Bunker.
You will be asked to select the type of dataset you want to create.
Select the type of dataset you want to create and click Next.
Note: You will skip the above form if activation settings are not enabled for your account.
You have two options:
- Insight datasets are used to see the match between datasets and create insights from the intersection between datasets. Insight data is always anonymised and numbers are always aggregated. Individual records are never shown, it is always an aggregated number.
- Activation datasets are used to push your audience to an activation partner once you find your target audience.
You will then be taken to the form below.
Complete the fields on this form as follows:
- Add a Private ID - this is a name you give your dataset that only you can see.
- Add a Public Name - this is the name the other party will see when you grant them permissions to your dataset. In the above example, the other party will only see the public name “Demo Company”, but you will see the private ID “ExistingCustomers”. This allows you to give your dataset a meaningful name that will not be shared with another party.
- Add a Public Description - this optionally allows you to add a description for the dataset.
- Select a Project (if required) - this is a user-scoped label associated with a number of datasets.
- Select a Team - this is a collection of users within a company that can access the dataset.
- The Dataset Expiry date will expire in 2-3 days. If you want to keep your dataset for longer, tick Do not expire, otherwise your dataset will expire and you won't be able to access it.
Click Create and you will be taken back to the Datasets tab on the InfoSum Platform. This is your home for viewing, editing and deleting your datasets.
You can see it is your dataset by hovering your mouse over the blue-filled icon . If the dataset belongs to another party, this icon will appear as a blue outline icon
.
When you create a dataset, the Platform automatically creates a corresponding Bunker. A Bunker is a private cloud instance that only the dataset owner can access.
Use InfoSum Platform to run queries and to activate, but use InfoSum Bunker to upload your file, normalize and publish. Only you, and no-one else, can access your Bunker.
Accessing a Bunker
Switch to the Datasets tab in the Platform, if you're not there already.
Find the row representing the dataset you've just created (refer to the dataset’s public name if you're not sure).
Click the Access button and you will be redirected from the Platform to the Bunker. Every Bunker has:
- a User Interface
- a unique domain name
- a unique IP address
Once you are in the Bunker dashboard, you can grab the IP address and give this to your IT team if you need to whitelist the IP address.
Configuring your S3 cross-account
Before you can export data to the InfoSum Platform, you will need to configure your Amazon S3 cross-account.
First, create an Amazon S3 bucket. For the steps to do this, see Creating a bucket - Amazon Simple Storage Service. An S3 bucket must contain at least one compatible file for import to the InfoSum Platform to work.
Once you have created an Amazon S3 bucket, you will need to do the following to import into InfoSum Platform:
- Create a policy
- Associate the policy with a role, a user and a bucket
- Configure AWS Identity and Access Management (IAM) for the InfoSum S3 cross-account data connector
The sections below describe the steps to do this.
Creating a policy
The AWS policy, when attached to a Bunker, defines the Bunker’s permissions. You will need to manually create the policy to import/export data into InfoSum as there is no policy to do this within AWS. To create an AWS policy for S3:
Go to the Identity and Access Management (IAM) Dashboard by searching for IAM in AWS.
Select Policies from the Access Management section.
In the Policies window, click the Create Policy button.
Choose the S3 service (this is the service the policy relates to).
Select the permissions required for InfoSum Platform to access files in the S3 bucket.
The table below shows the minimum permissions from the List, Read and Write access levels you need to select to allow InfoSum Platform to complete the S3 import or export:
Access Level |
Permission |
List |
ListAllMyBuckets ListBucket |
Read |
GetObject |
Write |
DeleteObject PutObject |
Click the Next:Tags button to add extra information to the policy. You do not need to add anything here.
Click the Next:Permissions button to go to the S3 account - Permissions tab.
Give the policy for the S3 account a name (for example, S3-Cross-Account). You have now created the policy permissions.
To review the new policy, go to the IAM Dashboard, select Policies from the Access Management section and click on the policy name. Here you can check that the policy has the correct Right, Read and List permissions.
Next, you will need to create a role, which you will associate with the policy.
Associating the policy with a role, a user and a bucket
Note: If you are pushing data to an S3 bucket, you will need to create a separate role in AWS for the push connector.
Go to IAM Dashboard and select Roles from the Access Management section.
In the Roles window, click the Create Role button.
Select the S3 policy from the list, as shown below.
Scroll down and select S3 as your use case (Do not select S3 Batch Operations):
Click the Next:Permissions button and select the S3 permissions policy you created earlier, i.e. “S3 Cross Account” in this example.
Click the Next: Tags button to add extra information to the policy. You do not need to add anything here.
Click the Next: Review button and give the role a name and optionally add a description of the role.
Check that the policy and trusted entities are correct and then click the Create Role button.
The role now appears in the list of roles, which shows the role's trusted entity as the AWS Service: S3. Click on the role and you can see it is attached to the S3-Cross-Account policy and S3 buckets.
Configuring AWS IAM for the InfoSum S3 cross-account data connector
You will need to configure AWS IAM to obtain the correct field values to use when importing/exporting S3 cross-account files into or from InfoSum Platform.
User ARN
The User ARN requested by the InfoSum S3 cross-account data connector is not the User ARN shown in a user’s AWS account. The correct ARN to use is the Role ARN shown in IAM > Roles > Summary in AWS.
Session Name
You will need to add the S3 session name in AWS. To do this:
Go to the IAM Dashboard and select Roles from the Access Management section.
Select the new role and then select the Trust relationships tab.
If you are importing data to the Platform, click Edit Trust Policy and replace the text with the InfoSum trust relationship policy shown below.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:sts::134928160093:assumed-role/InfoSumImportConnector/ChangeThis" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "d68b1db43ed52e7e0297dbaca2bad20f40d4faaf92b5349891f08fddde530e23" } } } ] }
Replace the ChangeThis section of the above trust policy with a session name, which can be anything, e.g. InfoSumDemo. This is the session name that you will need to add to the Session Name field in the InfoSum S3 cross-account data connector.
If you are pushing data to an S3 bucket, you will also need to change the connector name in the Principal field from InfoSumImportConnector to InfoSumPushConnector, as shown below.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:sts::134928160093:assumed-role/InfoSumPushConnector/ChangeThis" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "d68b1db43ed52e7e0297dbaca2bad20f40d4faaf92b5349891f08fddde530e23" } } } ] }
External ID
You will need to copy and paste the external ID from the InfoSum S3 cross-account data connector to the Trust Relationships tab section as they will be different.
To do this, replace the text in the AWS Trust Policy text with the external ID from the InfoSum S3 cross-account data connector.
Click the Update Trust Policy button. The trust policy displays the session name and the external ID to use in the InfoSum S3 Cross-Account import/export.
This trusted entity will be allowed to access the Amazon S3 bucket providing the external ID within the InfoSum Platform is the same as in AWS.
Amazon S3 bucket
To find the S3 bucket name to use in the InfoSum S3 cross-account data connector:
In AWS, search for S3 services.
Click on S3 in services to display a list of S3 buckets.
Click on the S3 bucket to display details for the bucket.
Prefix
If your S3 bucket contains folders, you will need to specify the folder(s) to use in the Prefix field of the InfoSum S3 cross-account data connector.
The next section describes how you can use the session name, external ID, bucket name and prefix provided in AWS to complete the InfoSum S3 cross-connect data connector fields.
Importing your dataset
Select Import from the left-hand menu and you will be taken to a selection of import connectors. For example, you can import your data from your Google Cloud bucket, your S3 bucket, your SFTP file or directly from your database servers. The following steps show you how to import a CSV or GZIP file from your S3 Cross-Account bucket.
Click the Connect button for the S3 Cross-Account and enter your credentials as shown below.
The above form contains the following fields:
- Principal: Customer needs to authorize the ARN in this field to assume the User Role that is supplied in step 3 below
- External ID: Infosum generated ID per user email domain. Customers should use this external ID, along with the Principal for an extra validation when allowing the assumption of their user role
- User ARN: Customer needs to create a User Role that has a Permissions Policy that allows reading from their S3 bucket. Note: this is not the User ARN shown in the user’s AWS account. The correct ARN to use is the Role ARN shown in IAM > Roles > Summary in AWS
- Session Name: User defined session
- Bucket Name: Specify the bucket name (no leading s3 identifier, e.g. "bucket-name" not "s3://bucket-name")
- Prefix: Optionally add extra path in this box [SubFolder/NextFolder/]
- GPG Encryption - GPG Public to encrypt the file
You can import a normal file or an encrypted file.
If you are importing an encrypted file, click on the GPG key field and use the GPG public key provided here to encrypt your file. Every Bunker has a unique public and private key. The encrypted file will be decrypted using the Bunker’s private key when you upload the file.
When you have completed all required fields, click Connect. You will be taken to a page where you can download files from your S3 Cross-Account.
The above form shows all the files available within the selected S3 bucket and contains the following fields:
- Key: File name(s) within the S3 bucket separated by the key delimiter specified in 2 below
- Key Delimiter: Delimiter used to separate files names in 1 above
- File is GZIP compressed: Select this field if you are importing compressed GZIP (.gz) files. When you click Download, the Bunker will decompress the files. You can ignore this field if you are not importing compressed GZIP files
- Field Delimiter: Delimiter used to separate values in the file(s)
- If you are uploading an encrypted file, enable the "This file is gpg encrypted" option. When you click Download, the Bunker will decrypt the file using the Bunker private key.
Working with multiple files
- You can specify any number of files.
- All files must have the same structure.
- Clicking on a filename overwrites it to the Object field. For this reason, we recommend listing multiple files in a text editor and cutting and pasting them to the Object field.
Copy the file names into the Key field and click Download.
Previewing your data
Next, click Connect. A subset of your data appears as a preview.
You can upload either:
- a single value column (that is, one email per row for a single user), or
- a multi-value column (that is, multiple emails or hobbies per row for a single user). You could, for example, include “sports”, “travel” and “reading” as a single user’s hobbies.
Enabling multi-value columns
In preview settings, the Platform will show the delimiter used in the multi-value column. If it’s not the correct delimiter, select the correct delimiter from the dropdown list.
For each multi-value column, click on the toggle next to the column header and enable the multi-value columns option.
Configuring your data
When you're happy with the preview, click Accept Preview Config. Your file will now be imported into InfoSum Platform.
Next, you will need to assign a config to the dataset. A config is a stored process for adapting your dataset into our Global Schema. Every customer needs to map their source file to the Global Schema. Some of the benefits of mapping to the Global Schema are:
- Each source dataset has its own language. For example, “mobile phone number” may be called “Contact” in one dataset and “Daytime mobile number” in another. Both datasets have the same column but defined in a different way so we need to standardize everything.
- Some datasets use different formats. For example, “gender” in one dataset may be classified 1,2,3 and as a,b,c in another. The Global Schema lets everyone map “gender” values so they are presented in a standard way, that is, either “Male”, “Female” or “Other”. The benefit of this is you don't need to transform your data in your source file, you can simply assign your values to Global Schema values in InfoSum Bunker.
- For some data, such as age or salary range, the Global schema will create representations of the data so you don’t need to create 10 different types of age columns in your source file. All you need is one “Age” column and the Global Schema will create the representations. This means that user can filter audience or see insights by all those representations
Creating a config tells InfoSum Platform to remember how your dataset was manipulated and make it a repeatable process. This enables you to quickly publish another dataset without needing to manually set up the same configuration.
You can save a config so that next time you can load an existing config. Configs sit on the user level so any time you create a Bunker, you can use your saved config. For example, if I select the “Training config” you can see that I already have 9 matched categories.
Select Create a new config to go to the Import Wizard.
The Import WIzard has automatically picked some of my source columns and assigned them to the Global Schema because they have the same name as in the Global Schema. Any columns in your dataset not picked here will be in custom categories.
A custom category is a category which you have defined yourself (as opposed to one defined in InfoSum's Global Schema).
There are two reasons why a category will not be picked for the Global Schema:
- the category is not in the Global Schema, or
- the column name is not the same as in the Global Schema (for example, “device_id” and “mobile_phone_number” and are in the Global Schema as “Mobile Advertising ID (e.g. AAID, IDFA)” and “Mobile Phone Number” respectively.)
If your column is not in the Global Schema, you can add it as a custom category by selecting it here. Since I know that “device_id” and “mobile_phone_number” are categories in the Global Schema, I won’t select them as custom categories. The others columns are all custom columns so I have selected some of them here.
If you are unsure which categories to select, you can do this in the next step.
Normalizing your data
Click Accept Wizard Settings to go to the Normalize page.
After assigning columns to categories, you may need to map the values in the original Gender column (such as 0 and 1) to the Male and Female values in the Global Schema. You can see that my Gender category is in 0s and 1s so the Global Schema cannot recognise it.
Mapping a category
Instead of transforming your data before you upload it, you can map your data to the Global Schema values so that “1” becomes “Female” and “0” becomes “Male”.
Note: Mapping only supports Global Schema categories. Custom Categories require transformation either by using the transformation tools or by performing the transformation prior to upload.
To set up mappings for a category, click the settings icon above your original data column, and select Mappings from the drop-down menu.
If you don't see the Mappings option, this means that the category you've selected doesn't support mappings. Mappings are only suitable for categories such as Gender that contain values selected from a predefined list of options.
The Mappings dialog appears where you can map “1” to “Female” and click Map.
Next, map “0” to “Male”.
Click Map to see your mappings:
Next, select the Extended Gender tab and repeat the previous steps to map the "0" and "1" gender values.
Click Save to see the values mapped correctly for the Gender category.
Viewing Representations
If you scroll to the age source column, you will see that the Global Schema has automatically created six representations from the age source data. Creating representations from a single age column is another benefit of the Global Schema.
Assigning categories
After mapping to either our Global Schema or a custom category, any columns that appear in blue are not assigned to the Global Schema.
To assign a column to the Global Schema, click on the settings button next to the column name.
Select Assign category.
Click Next and select a category from the drop-down list.
This assigns your “mobile_phone_number” column to the Global Schema Mobile Phone Number category.
Next, select the local phone region in category properties, GB in this example.
Click Save to assign the category to the Global Schema.
If there isn't a relevant Global Schema category available, you can create a custom category. This lets you use categories beyond what is included in the Global Schema. For example, if you have an internal ID or flag that you want to use.
To create a custom category, open up the Assign Category dialog as before and click NEXT.
Then select the Custom Category option and an additional settings area will appear. You will now need to give the custom category a name and specify the type of data. Two custom categories in different datasets can only be matched if they have the same name, so this stage may require some coordination with other users.
If the column used for the custom category is an identifier, such as a Customer ID, you will need to select 'is key' for it to be used later on to match keys across datasets.
Viewing keys
When you have finished assigning categories and cleaning up your data, you can select Keys under Normalize to see the keys that can be used for matching with other datasets.
Click Normalize to run the normalization process.
What happens to your data during normalization?
We have a standard normalization process. For each Global Schema key there is a set of validations, which we check and then we hash the key. The same normalization checks, validations and hashing are applied to all datasets.
Saving your config
Once you’ve imported and normalized the dataset, you can save the config and reuse it within a Bunker.
To save your config, click on the Configs tab, and click Save. A dialog box appears where you can review and name the configuration.
Click Save to save the config. When you’re next importing a dataset to this Bunker, you will be able to retrieve this saved config to speed up the time before it's published.
Publishing your dataset
Once your dataset has normalized you can publish it. Any draft datasets are deleted after eight days.
Click Publish and your dataset is available on the Platform, where it is ready for sending permissions or running queries.
Click on your dataset to see the stats of the dataset. Click on the Key tab to see the fill rates for each key. For example, 90% distinct records for Mobile Phone Number, which means that 90% of records have a unique Mobile Phone Number and 10% have no records.
Click on the Category Stats tab to see the fill rate for your dataset’s categories.
You have now successfully published your dataset.