Importing data using SFTP
This training guide explains step-by-step how you can directly import a dataset from your SFTP server.
Here is the workflow for uploading your dataset to InfoSum:
- Creating a dataset
- Accessing a Bunker
- Hosting an SFTP server in AWS (AWS Transfer Family)
- Importing your dataset
- Previewing your data
- Configuring your data
- Normalizing your data
- Publishing your dataset
Some important concepts
Before you begin uploading your data, there are some concepts you’ll need to be familiar with:
- Dataset - A dataset is a single database table on a customer’s InfoSum site which contains a group of records and which have keys and categories. The combination of keys and categories is a dataset.
- Key - Keys are Personally Identifiable Information (PII) such as mobile phone number, email or IP address which will be used for matching with other datasets.
- Category - Categories are specific types of information contained in a dataset such as age, gender, lapsed customer or existing customer.
Creating a dataset
To create a dataset, click the New Dataset button on the Datasets tab.
You will then be taken to the form below, where you can create a standard or large dataset depending on your file size. Standard datasets can contain up to 45 million rows and large datasets can contain 50 million or more rows.
Select the type of dataset you want and click Next, you will be taken to the form below.
Choose the cloud instance location and provider you would like for your Bunker and click Next.
Note: You will skip the above form if you do not have the option to select a location and supplier for your Bunker.
You will be asked to select the type of dataset you want to create.
Select the type of dataset you want to create and click Next.
Note: You will skip the above form if activation settings are not enabled for your account.
You have two options:
- Insight datasets are used to see the match between datasets and create insights from the intersection between datasets. Insight data is always anonymised and numbers are always aggregated. Individual records are never shown, it is always an aggregated number.
- Activation datasets are used to push your audience to an activation partner once you find your target audience.
You will then be taken to the form below.
Complete the fields on this form as follows:
- Add a Private ID - this is a name you give your dataset that only you can see.
- Add a Public Name - this is the name the other party will see when you grant them permissions to your dataset. In the above example, the other party will only see the public name “Demo Company”, but you will see the private ID “ExistingCustomers”. This allows you to give your dataset a meaningful name that will not be shared with another party.
- Add a Public Description - this optionally allows you to add a description for the dataset.
- Select a Project (if required) - this is a user-scoped label associated with a number of datasets.
- Select a Team - this is a collection of users within a company that can access the dataset.
- The Dataset Expiry date will expire in 2-3 days. If you want to keep your dataset for longer, tick Do not expire, otherwise your dataset will expire and you won't be able to access it.
Click Create and you will be taken back to the Datasets tab on the InfoSum Platform. This is your home for viewing, editing and deleting your datasets.
You can see it is your dataset by hovering your mouse over the blue-filled icon . If the dataset belongs to another party, this icon will appear as a blue outline icon .
When you create a dataset, the Platform automatically creates a corresponding Bunker. A Bunker is a private cloud instance that only the dataset owner can access.
Use InfoSum Platform to run queries and to activate, but use InfoSum Bunker to upload your file, normalize and publish. Only you, and no-one else, can access your Bunker.
Accessing a Bunker
Switch to the Datasets tab in the Platform, if you're not there already.
Find the row representing the dataset you've just created (refer to the dataset’s public name if you're not sure).
Click the Access button and you will be redirected from the Platform to the Bunker. Every Bunker has:
- a User Interface
- a unique domain name
- a unique IP address
Once you are in the Bunker dashboard, you can grab the IP address and give this to your IT team if you need to whitelist the IP address.
Hosting an SFTP server in AWS (AWS Transfer Family)
You can host an SFTP server in AWS to import or export data on the InfoSum Platform using the SFTP data connector.
First, create an SFTP-enabled server in AWS and add a user to the server. For the steps to do this, see:
Note: For detailed steps of how to create an SFTP server within AWS Transfer Family, please refer to the AWS Tutorial: Getting started with AWS Transfer Family.
Once you have created the SFTP server in AWS, you will need to authenticate the server and assign it the correct access for import to InfoSum Platform to work. For the steps to do this, see:
- Creating a SSH key to authenticate with the AWS Amazon Transfer Family SFTP server
- Adjusting AWS IAM Policy to allow access
Creating an SFTP server in AWS
Sign in to the AWS Transfer Family console.
Click Create Server on the main page of the AWS console. You are taken to the Choose Protocols page.
Ensure the SFTP option is selected and Click Next. You are taken to the Choose an identity provider page.
Ensure that the Service managed option is selected and click Next. You are taken to the Choose an endpoint page.
Ensure the Publicly accessible option is selected and click Next. You are taken to the Choose a domain page.
Ensure that the Amazon S3 option is selected and click Next. You are taken to the Configure additional details page.
In the CloudWatch logging section, ensure the Create a new role option is selected.
Leave the Security policy and all other fields as is and scroll to the bottom of the page.
Click Next. You are taken to the Review and create page.
Click Create server. Your newly created SFTP server appears in the list of servers on the main Servers page.
Next, you will need to add a user to the server (see the next section).
Adding a user to the SFTP server in AWS
On the main Servers page, double-click the Server ID for your newly-created SFTP server. You are taken to the set-up page for the selected server.
Click Add user. You are taken to the Add user page.
Complete these fields as follows:
- Username - Type a name for the new user.
- Role - Select AWSTransferLoggingAccess from the drop-down list
Leave all other fields as is and scroll to the bottom of the page.
Click Add. You are taken to the main server page which now shows the newly-added user. The next section includes the steps to find details of the SSH public key for a user.
Creating an SSH key to authenticate with the AWS Amazon Transfer Family SFTP server
On your local machine (and machines where you will be accessing the SFTP server from), at the command prompt type in:
ssh-keygen -P "" -f <filename>
This generates two keys:
- a private key, and
- a public key
The private key will have the name you specified using the -f flag. The public key will have the same name suffixed with .pub.
To share your details with the transfer server, copy the public key details found in the previous step using the less command:
Alternatively, open the public key file in a text editor of your choice. Copy the contents and paste this into the SSH Public Keys field of the SFTP user that you created in AWS Transfer Family.
Adjusting AWS IAM Policy to allow access
You may encounter an issue whereby you have created the SFTP server, but cannot perform any actions on it (such as running the list directory contents command). This is usually because, by default, the SFTP server has no access to the attached S3 bucket. We recommend that you create a specific IAM policy for AWS Transfer Family that allows access to both S3 and Transfer Family commands, alongside a policy that allows full control of both the attached S3 bucket and SFTP server.
Importing your dataset
Select Import from the left-hand menu and you will be taken to a selection of import connectors. For example, you can import your data from your Google Cloud bucket, your S3 bucket, your SFTP file or directly from your database servers. The following steps show you how to import a CSV file from an SFTP server.
Click Connect and a form will appear as shown below to enter your credentials.
The above form contains four different things:
- Connection - Fields: Host, Port, Path
- Authentication - Fields: Username, Password or Private Key Pem
- Host Verification - Fields: Authorized Keys or Known Hosts or Public Key Pem
- GPG Encryption - GPG Public to encrypt the file
In Bunker UI, you may need to enter three different keys for different purposes. Here are the keys.
Private Key Pem:
You can ignore this field if you are establishing a connection using a password.
This is a user authorization key (User's SSH private key) replacing the password and will be in the form of a public/private key pair.
If you are establishing a connection using SSH key/password-less, you will need to add the public ssh key into the authorized keys file on your server and put the private SSH key in the Private Key Pem field in the UI.
Host Verification Key:
You will need to enter a host public key in one of the below format (You need to enter only one):
- Host Public Keys (OpenSSH authorized_keys format) - "Authorized Keys" in Bunker UI
- Host Public Keys (OpenSSH known_hosts format) - "Known Hosts" in Bunker UI
- Host Public Keys (PEM Format) - "Public Key PEM" in Bunker UI. Currently, we only support PKIX format for public keys. The PEM block with "PUBLIC KEY" will go to this field.
Please note this key is NOT the same as the public part of the user SSH key, this is a public key associated with your server, not with your user.
You can find this key in one of two ways.
1) Your IT team can look it up on the server (probably in the /etc/ssh directory) and there will be a number of files e.g.
The contents of one of these files can just be put straight into the "Authorized Keys" field on the Bunker UI. An example format for the ecdsa file:
2) Alternatively, you can SSH into your server then generate a key pair using the following command "ssh-keygen -F <hostname>" then look up in your local known_hosts file for the public key for that host.
You can ignore this field if you are not uploading an encrypted file.
Your Bunker will generate a public/private key pair. You can use the GPG public key provided to you in the UI for encrypting your file.
When you have completed all the required fields, click Connect to take you to the Download page, where you can select the files to import.
Click on a filename in the table to copy it to the File name box and select Download.
If you are uploading an encrypted file, enable the "This file is gpg encrypted" option. When you click Download, the Bunker will decrypt the file using the Bunker private key.
Previewing your data
Next, click Connect. A subset of your data appears as a preview.
You can upload either:
- a single value column (that is, one email per row for a single user), or
- a multi-value column (that is, multiple emails or hobbies per row for a single user). You could, for example, include “sports”, “travel” and “reading” as a single user’s hobbies.
Enabling multi-value columns
In preview settings, the Platform will show the delimiter used in the multi-value column. If it’s not the correct delimiter, select the correct delimiter from the dropdown list.
For each multi-value column, click on the toggle next to the column header and enable the multi-value columns option.
Configuring your data
When you're happy with the preview, click Accept Preview Config. Your file will now be imported into InfoSum Platform.
Next, you will need to assign a config to the dataset. A config is a stored process for adapting your dataset into our Global Schema. Every customer needs to map their source file to the Global Schema. Some of the benefits of mapping to the Global Schema are:
- Each source dataset has its own language. For example, “mobile phone number” may be called “Contact” in one dataset and “Daytime mobile number” in another. Both datasets have the same column but defined in a different way so we need to standardize everything.
- Some datasets use different formats. For example, “gender” in one dataset may be classified 1,2,3 and as a,b,c in another. The Global Schema lets everyone map “gender” values so they are presented in a standard way, that is, either “Male”, “Female” or “Other”. The benefit of this is you don't need to transform your data in your source file, you can simply assign your values to Global Schema values in InfoSum Bunker.
- For some data, such as age or salary range, the Global schema will create representations of the data so you don’t need to create 10 different types of age columns in your source file. All you need is one “Age” column and the Global Schema will create the representations. This means that user can filter audience or see insights by all those representations
Creating a config tells InfoSum Platform to remember how your dataset was manipulated and make it a repeatable process. This enables you to quickly publish another dataset without needing to manually set up the same configuration.
You can save a config so that next time you can load an existing config. Configs sit on the user level so any time you create a Bunker, you can use your saved config. For example, if I select the “Training config” you can see that I already have 9 matched categories.
Select Create a new config to go to the Import Wizard.
The Import WIzard has automatically picked some of my source columns and assigned them to the Global Schema because they have the same name as in the Global Schema. Any columns in your dataset not picked here will be in custom categories.
A custom category is a category which you have defined yourself (as opposed to one defined in InfoSum's Global Schema).
There are two reasons why a category will not be picked for the Global Schema:
- the category is not in the Global Schema, or
- the column name is not the same as in the Global Schema (for example, “device_id” and “mobile_phone_number” and are in the Global Schema as “Mobile Advertising ID (e.g. AAID, IDFA)” and “Mobile Phone Number” respectively.)
If your column is not in the Global Schema, you can add it as a custom category by selecting it here. Since I know that “device_id” and “mobile_phone_number” are categories in the Global Schema, I won’t select them as custom categories. The others columns are all custom columns so I have selected some of them here.
If you are unsure which categories to select, you can do this in the next step.
Normalizing your data
Click Accept Wizard Settings to go to the Normalize page.
After assigning columns to categories, you may need to map the values in the original Gender column (such as 0 and 1) to the Male and Female values in the Global Schema. You can see that my Gender category is in 0s and 1s so the Global Schema cannot recognise it.
Mapping a category
Instead of transforming your data before you upload it, you can map your data to the Global Schema values so that “1” becomes “Female” and “0” becomes “Male”.
Note: Mapping only supports Global Schema categories. Custom Categories require transformation either by using the transformation tools or by performing the transformation prior to upload.
To set up mappings for a category, click the settings icon above your original data column, and select Mappings from the drop-down menu.
If you don't see the Mappings option, this means that the category you've selected doesn't support mappings. Mappings are only suitable for categories such as Gender that contain values selected from a predefined list of options.
The Mappings dialog appears where you can map “1” to “Female” and click Map.
Next, map “0” to “Male”.
Click Map to see your mappings:
Next, select the Extended Gender tab and repeat the previous steps to map the "0" and "1" gender values.
Click Save to see the values mapped correctly for the Gender category.
If you scroll to the age source column, you will see that the Global Schema has automatically created six representations from the age source data. Creating representations from a single age column is another benefit of the Global Schema.
After mapping to either our Global Schema or a custom category, any columns that appear in blue are not assigned to the Global Schema.
To assign a column to the Global Schema, click on the settings button next to the column name.
Select Assign category.
Click Next and select a category from the drop-down list.
This assigns your “mobile_phone_number” column to the Global Schema Mobile Phone Number category.
Next, select the local phone region in category properties, GB in this example.
Click Save to assign the category to the Global Schema.
If there isn't a relevant Global Schema category available, you can create a custom category. This lets you use categories beyond what is included in the Global Schema. For example, if you have an internal ID or flag that you want to use.
To create a custom category, open up the Assign Category dialog as before and click NEXT.
Then select the Custom Category option and an additional settings area will appear. You will now need to give the custom category a name and specify the type of data. Two custom categories in different datasets can only be matched if they have the same name, so this stage may require some coordination with other users.
If the column used for the custom category is an identifier, such as a Customer ID, you will need to select 'is key' for it to be used later on to match keys across datasets.
When you have finished assigning categories and cleaning up your data, you can select Keys under Normalize to see the keys that can be used for matching with other datasets.
Click Normalize to run the normalization process.
What happens to your data during normalization?
We have a standard normalization process. For each Global Schema key there is a set of validations, which we check and then we hash the key. The same normalization checks, validations and hashing are applied to all datasets.
Saving your config
Once you’ve imported and normalized the dataset, you can save the config and reuse it within a Bunker.
To save your config, click on the Configs tab, and click Save. A dialog box appears where you can review and name the configuration.
Click Save to save the config. When you’re next importing a dataset to this Bunker, you will be able to retrieve this saved config to speed up the time before it's published.
Publishing your dataset
Once your dataset has normalized you can publish it. Any draft datasets are deleted after eight days.
Click Publish and your dataset is available on the Platform, where it is ready for sending permissions or running queries.
Click on your dataset to see the stats of the dataset. Click on the Key tab to see the fill rates for each key. For example, 90% distinct records for Mobile Phone Number, which means that 90% of records have a unique Mobile Phone Number and 10% have no records.
Click on the Category Stats tab to see the fill rate for your dataset’s categories.
You have now successfully published your dataset.