Importing data using Google Cloud Platform
This training guide explains step-by-step how you can directly import a dataset from your Google Cloud Platform account.
Workflow
Here is the workflow for uploading your dataset to InfoSum:
- Creating a dataset
- Accessing a Bunker
- Configuring Google Cloud Platform
- Importing your dataset
- Previewing your data
- Configuring your data
- Normalizing your data
- Publishing your dataset
Some important concepts
Before you begin uploading your data, there are some concepts you’ll need to be familiar with:
- Dataset - A dataset is a single database table on a customer’s InfoSum site which contains a group of records and which have keys and categories. The combination of keys and categories is a dataset.
- Key - Keys are Personally Identifiable Information (PII) such as mobile phone number, email or IP address which will be used for matching with other datasets.
- Category - Categories are specific types of information contained in a dataset such as age, gender, lapsed customer or existing customer.
Creating a dataset
Before starting, please make sure you are logged into your InfoSum Platform account.
Under Data, click the New Dataset button on the Datasets tab.
You will then be taken to the form below, where you can create a standard or large dataset depending on your file size. Standard datasets can contain up to 45 million rows and large datasets can contain 50 million or more rows.
Select the type of dataset you want and click Next, you will be taken to the form below.
Choose the cloud instance location and provider you would like for your Bunker and click Next.
Note: You will skip the above form if you do not have the option to select a location and supplier for your Bunker.
You will be asked to select the type of dataset you want to create.
Select the type of dataset you want to create and click Next.
Note: You will skip the above form if activation settings are not enabled for your account.
You have two options:
- Insight datasets are used to see the match between datasets and create insights from the intersection between datasets. Insight data is always anonymised and numbers are always aggregated. Individual records are never shown, it is always an aggregated number.
- Activation datasets are used to push your audience to an activation partner once you find your target audience.
You will then be taken to the form below.
Complete the fields on this form as follows:
- Add a Private ID - this is a name you give your dataset that only you can see.
- Add a Public Name - this is the name the other party will see when you grant them permissions to your dataset. In the above example, the other party will only see the public name “Demo Company”, but you will see the private ID “ExistingCustomers”. This allows you to give your dataset a meaningful name that will not be shared with another party.
- Add a Public Description - this optionally allows you to add a description for the dataset.
- Select a Project (if required) - this is a user-scoped label associated with a number of datasets.
- Select a Team - this is a collection of users within a company that can access the dataset.
- The Dataset Expiry date will expire in 2-3 days. If you want to keep your dataset for longer, tick Do not expire, otherwise your dataset will expire and you won't be able to access it.
Click Create and you will be taken back to the Datasets tab on the InfoSum Platform. This is your home for viewing, editing and deleting your datasets.
You can see it is your dataset by hovering your mouse over the blue-filled icon . If the dataset belongs to another party, this icon will appear as a blue outline icon
.
When you create a dataset, the Platform automatically creates a corresponding Bunker. A Bunker is a private cloud instance that only the dataset owner can access.
Use InfoSum Platform to run queries and to activate, but use InfoSum Bunker to upload your file, normalize and publish. Only you, and no-one else, can access your Bunker.
Accessing a Bunker
Switch to the Datasets tab in the Platform, if you're not there already.
Find the row representing the dataset you've just created (refer to the dataset’s public name if you're not sure).
Click the Access button and you will be redirected from the Platform to the Bunker. Every Bunker has:
- a User Interface
- a unique domain name
- a unique IP address
Once you are in the Bunker dashboard, you can grab the IP address and give this to your IT team if you need to whitelist the IP address.
Configuring Google Cloud Platform
Before you can import data into InfoSum Platform, you will need to configure your Google Cloud Platform account.
First, you will need a Google Cloud Platform account. To set up an account, or access an existing account, go to the URL: console.cloud.google.com
Once you have created your Google Cloud Platform account, you will need to do the following to import into InfoSum Platform:
- Create a project
- Create a bucket for your project
- Create a service account for your project
- Associate a key with a service account
The sections below describe the steps to do this.
Creating a project
Go to the console.cloud.google.com URL for your account.
Select Browser and click the New Project button in the Menu bar. Alternately, select a project and click the New Project button in the dialog box that opens.
The new project window opens, where you can change the new project name and browse and select the location of an associated Google organization.
Note: personal accounts cannot be associated with a Google organization.
Click the Create button to create the project.
Next, you will need to create a bucket for your project.
Creating a bucket for your project
Select Browser from the Cloud Storage menu for your newly-created project.
Select Create Bucket.
Enter a name for the bucket from the window that opens.
Click Continue to select the location to store your data depending on where you are.
Select the location type and location, for example, Multi-region and EU.
Note: You can see a breakdown of monthly cost estimates in the right-hand pane of the window.
Click Continue and select the default storage for your data. This can be left as Standard.
Click Continue and choose how to control access to objects, which can be left as is.
Click Continue to open advanced settings, which can be left as is.
Click Create and the bucket and its selected settings appear in the Browser list.
Click on the bucket to open the details window, where you can drag and drop files into the bucket.
Next, you will need to create a service account. This account is used to create the JSON file for the Google Cloud Platform import connector file drop that you will see on the InfoSum Platform, here:
Creating a service account for your project
Select Service Accounts from the IAM & Admin menu in the Google Cloud Platform.
Select Create Service Account.
This opens the window shown below. Give the service account a name - this name is used to auto-complete the randomly generated Service account... field underneath.
Click Create and Continue to create the account and grant service account access to the project.
Select the Storage Object Viewer role or above to be able to use this account. Storage Object Viewer is the lowest level role that can use this service account.
Click Continue to grant specific users access to this account. There is no need to add any users here because when you create the account, the JSON file is downloaded to your computer and can be given to any users that need it.
Click Done and the account appears in the list of Service Accounts.
Next, you will need to associate a key with the account as it has no keys.
Associating a key with a service account
Select Manage Keys from the Actions menu for the newly-created service account.
Select Create new key from the Add Key menu to create a new key for this account.
The Create private key dialog box opens.
Select JSON as the key and click Create.
Note: InfoSum Platform only accepts JSON keys.
The following message appears after the JSON file is downloaded to your computer:
Click Close. If you go to the list of service accounts, the newly-created service account now shows the key associated with this account.
An example JSON key is shown below, which contains the project ID, private ID, full private ID, client email, client ID, and authorization/token URLs.
You can complete the InfoSum Google Cloud Storage data connector fields using the downloaded JSON key file and bucket name provided in Google Cloud Platform, as described in the next section.
Importing your dataset
Select Import from the left-hand menu and you will be taken to a selection of import connectors. For example, you can import your data from your Google Cloud bucket, your S3 bucket, your SFTP file or directly from your database servers. The following steps show you how to import a GCS dataset.
Click the Connect button for the Google Cloud Storage connector. You can complete the Google Cloud Storage data connector fields using the downloaded JSON key file and bucket name provided in Google Cloud Platform.
Import the service account credential (JSON) file.
In the Bunker UI, you may need to enter the GPG key, which you can find here.
GPG Key:
You can ignore this field if you are not uploading an encrypted file.
Your Bunker will generate a public/private key pair. You can use the GPG public key provided to you in the UI for encrypting your file.
Click on Connect and you will be taken to the Connect stage.
When you specify the bucket. you will be taken to a list of files in the selected Google bucket, which you can select for download.
In the Object field, specify the file name(s) within the bucket, separating each file name with a comma.
Working with multiple files
-
- You can specify any number of files. There is no limit to the number of files you can download.
- Filenames must be separated with a comma.
- All files must have the same structure.
- Clicking on a filename overwrites it to the Object field. For this reason, we recommend listing multiple files in a text editor and cutting and pasting them to the Object field.
If you are uploading an encrypted file, enable the This file is gpg encrypted option. When you click Download, the Bunker will decrypt the file using the Bunker private key.
Next, click Download.
In the Field Delimiter field, select the delimiter used to separate values in the file(s), then Connect.
Previewing your data
Next, click Connect. A subset of your data appears as a preview.
You can upload either:
- a single value column (that is, one email per row for a single user), or
- a multi-value column (that is, multiple emails or hobbies per row for a single user). You could, for example, include “sports”, “travel” and “reading” as a single user’s hobbies.
Enabling multi-value columns
In preview settings, the Platform will show the delimiter used in the multi-value column. If it’s not the correct delimiter, select the correct delimiter from the dropdown list.
For each multi-value column, click on the toggle next to the column header and enable the multi-value columns option.
Configuring your data
When you're happy with the preview, click Accept Preview Config. Your file will now be imported into InfoSum Platform.
Next, you will need to assign a config to the dataset. A config is a stored process for adapting your dataset into our Global Schema. Every customer needs to map their source file to the Global Schema. Some of the benefits of mapping to the Global Schema are:
- Each source dataset has its own language. For example, “mobile phone number” may be called “Contact” in one dataset and “Daytime mobile number” in another. Both datasets have the same column but defined in a different way so we need to standardize everything.
- Some datasets use different formats. For example, “gender” in one dataset may be classified 1,2,3 and as a,b,c in another. The Global Schema lets everyone map “gender” values so they are presented in a standard way, that is, either “Male”, “Female” or “Other”. The benefit of this is you don't need to transform your data in your source file, you can simply assign your values to Global Schema values in InfoSum Bunker.
- For some data, such as age or salary range, the Global schema will create representations of the data so you don’t need to create 10 different types of age columns in your source file. All you need is one “Age” column and the Global Schema will create the representations. This means that user can filter audience or see insights by all those representations
Creating a config tells InfoSum Platform to remember how your dataset was manipulated and make it a repeatable process. This enables you to quickly publish another dataset without needing to manually set up the same configuration.
You can save a config so that next time you can load an existing config. Configs sit on the user level so any time you create a Bunker, you can use your saved config. For example, if I select the “Training config” you can see that I already have 9 matched categories.
Select Create a new config to go to the Import Wizard.
The Import WIzard has automatically picked some of my source columns and assigned them to the Global Schema because they have the same name as in the Global Schema. Any columns in your dataset not picked here will be in custom categories.
A custom category is a category which you have defined yourself (as opposed to one defined in InfoSum's Global Schema).
There are two reasons why a category will not be picked for the Global Schema:
- the category is not in the Global Schema, or
- the column name is not the same as in the Global Schema (for example, “device_id” and “mobile_phone_number” and are in the Global Schema as “Mobile Advertising ID (e.g. AAID, IDFA)” and “Mobile Phone Number” respectively.)
If your column is not in the Global Schema, you can add it as a custom category by selecting it here. Since I know that “device_id” and “mobile_phone_number” are categories in the Global Schema, I won’t select them as custom categories. The others columns are all custom columns so I have selected some of them here.
If you are unsure which categories to select, you can do this in the next step.
Normalizing your data
Click Accept Wizard Settings to go to the Normalize page.
After assigning columns to categories, you may need to map the values in the original Gender column (such as 0 and 1) to the Male and Female values in the Global Schema. You can see that my Gender category is in 0s and 1s so the Global Schema cannot recognise it.
Mapping a category
Instead of transforming your data before you upload it, you can map your data to the Global Schema values so that “1” becomes “Female” and “0” becomes “Male”.
Note: Mapping only supports Global Schema categories. Custom Categories require transformation either by using the transformation tools or by performing the transformation prior to upload.
To set up mappings for a category, click the settings icon above your original data column, and select Mappings from the drop-down menu.
If you don't see the Mappings option, this means that the category you've selected doesn't support mappings. Mappings are only suitable for categories such as Gender that contain values selected from a predefined list of options.
The Mappings dialog appears where you can map “1” to “Female” and click Map.
Next, map “0” to “Male”.
Click Map to see your mappings:
Next, select the Extended Gender tab and repeat the previous steps to map the "0" and "1" gender values.
Click Save to see the values mapped correctly for the Gender category.
Viewing Representations
If you scroll to the age source column, you will see that the Global Schema has automatically created six representations from the age source data. Creating representations from a single age column is another benefit of the Global Schema.
Assigning categories
After mapping to either our Global Schema or a custom category, any columns that appear in blue are not assigned to the Global Schema.
To assign a column to the Global Schema, click on the settings button next to the column name.
Select Assign category.
Click Next and select a category from the drop-down list.
This assigns your “mobile_phone_number” column to the Global Schema Mobile Phone Number category.
Next, select the local phone region in category properties, GB in this example.
Click Save to assign the category to the Global Schema.
If there isn't a relevant Global Schema category available, you can create a custom category. This lets you use categories beyond what is included in the Global Schema. For example, if you have an internal ID or flag that you want to use.
To create a custom category, open up the Assign Category dialog as before and click NEXT.
Then select the Custom Category option and an additional settings area will appear. You will now need to give the custom category a name and specify the type of data. Two custom categories in different datasets can only be matched if they have the same name, so this stage may require some coordination with other users.
If the column used for the custom category is an identifier, such as a Customer ID, you will need to select 'is key' for it to be used later on to match keys across datasets.
Viewing keys
When you have finished assigning categories and cleaning up your data, you can select Keys under Normalize to see the keys that can be used for matching with other datasets.
Click Normalize to run the normalization process.
What happens to your data during normalization?
We have a standard normalization process. For each Global Schema key there is a set of validations, which we check and then we hash the key. The same normalization checks, validations and hashing are applied to all datasets.
Saving your config
Once you’ve imported and normalized the dataset, you can save the config and reuse it within a Bunker.
To save your config, click on the Configs tab, and click Save. A dialog box appears where you can review and name the configuration.
Click Save to save the config. When you’re next importing a dataset to this Bunker, you will be able to retrieve this saved config to speed up the time before it's published.
Publishing your dataset
Once your dataset has normalized you can publish it. Any draft datasets are deleted after eight days.
Click Publish and your dataset is available on the Platform, where it is ready for sending permissions or running queries.
Click on your dataset to see the stats of the dataset. Click on the Key tab to see the fill rates for each key. For example, 90% distinct records for Mobile Phone Number, which means that 90% of records have a unique Mobile Phone Number and 10% have no records.
Click on the Category Stats tab to see the fill rate for your dataset’s categories.
You have now successfully published your dataset.