Data formatting for normalization

When you import a data file to InfoSum Platform, you will be provided with a range of tools to normalize the data and map it into our Global Schema. Here are some basic principles to help you prepare your data before import:

Our platform works best with one row per customer/customer-level data
Include human-readable column names and data to allow for easy partner analysis
Using multi-value attributes is more column-volume efficient and creates more interesting insights
Please import raw data where possible, as inconsistencies with hashing done off-platform can prevent or reduce matches
For data providers/publishers: We recommend having an off-the-shelf data schema that you can make available to everyone to productize with ease. Focus on making the data usable and driving your customers to value quickly. Custom projects that require very specific slices of data or more granularity can be generated on the side for customers who want more.
Please ensure your data stays within the character limits of our platform
Please ensure your data has a date-time column if you are planning to update your Dataset incrementally (as this is needed for data expiry)

How normalization works

Normalization plays an important role in ensuring the security of your data. During normalization, direct identifiers (such as emails) are irreversibly converted to pseudonymized, salted keys. Our normalization process begins by lowercasing and removing any leading or trailing spaces before converting raw PII to sha256 before it is further encrypted and salted.

We recommend bunkering raw format identifier data where at all possible to avoid any discrepancies in standardization across partners.

Using the Global Schema

Keys can be mapped to InfoSum's Global Schema, which defines a standard set of identifiers that can be used to compare Datasets from diverse original sources.

Multi-language support

Column headers are used to map columns to the Global Schema, and they will be recognized in English only. If your column names are in another language you can still map them manually to the Global Schema or add them as custom keys (if you're uploading a custom key, you will need to ensure the names are exactly the same across all partner Datasets)

See this section for a list of keys included in our Global Schema. Any keys that are not included can be added to your Dataset as a custom key.

Should I import hashed data?

We recommend importing raw data into your Cloud Vault. All columns marked as Key will be double-hashed and salted during normalization, making them irreversibly encrypted. Data can be imported using GPG encryption for added security.

If you are using the Global Schema:

You must always import raw identifiers for MAID, UDPRN, IP Address, home & mobile phone numbers, zip9, zip5
The rest of the identifiers can be imported hashed as long as all parties use the same rules and hashing methodology

If you are creating custom keys:

You can still import the data in raw format and we recommend this to avoid any normalization errors
If you need to import the data pre-hashed, you must set the same column name during normalization and ensure that all organizations collaborating use the same rules and hashing methodology. Case and salt inconsistencies will prevent matches.

Pre-hashing data

The onboarding steps need to be completed in this order: 1) Normalization > 2) Hashing. Normalization and hashing can happen directly on the platform if you are onboarding raw data. Please note that not all Global Schema keys can be imported hashed as mentioned above.

All parties involved in the collaboration must agree on whether to import raw or hashed matching keys and what hashing is done (except for emails: as long as the instructions below are followed by the party uploading HEMs, they can be matched with other parties that import raw emails). If some parties only have hashed data and others only raw data, the party with raw data must hash their matching keys before publishing it to their Datasets or matching will be impossible. This can be done:

Before import: Data can be onboarded pre-hashed to SHA256 standard, but it needs to be normalized first (all lowercase and any spaces removed) to ensure that after hashing the values still match.
During the normalization process: raw keys can have an additional forced hash applied in the platform

Salting data

We do not recommend uploading salted data as this is not necessary due to the decentralized nature of our platform.

If salt is required, the onboarding steps need to be completed in this order: 1) Salt > 2) Normalize > 3) Hash.

Data can be onboarded with a salt that all parties agree on to ensure that encrypted values match.
If you’d like to onboard data that is already salted and hashed, please ensure that you have completed the normalization process before hashing (all lower case and any spaces removed)
Please consult with your collaboration partners before using any salt. Not many companies in our ecosystem salt their data and it would need to be the exact same salt.

Structuring identifiers

The list below outlines the identifiers commonly used in the Platform. During the normalization process, these identifiers are converted into keys, which are used to match rows in a query. Keys can be both deterministically (e.g. Email) and probabilistically (e.g. Full name and phone number) matched during a query.

All keys can be multi-value keys (multiple values in one cell).

Email address

What email address format does InfoSum require?

How does InfoSum validate emails?

How does InfoSum normalize emails?

Can be provided in either SHA256 hexadecimal format or raw data.

Must be in a single column.

If you use SHA256 format, ensure all email addresses must be in lowercase with leading/trailing white spaces removed before you convert to SHA256 format.

If email is in raw format: Matches RFC 2822.
Case insensitive.

If email is in SHA256 format: Checks hash has correct length

If plain / raw emails are provided (not in SHA256 format), the email address will be converted to lowercase characters (will remove leading/trailing white spaces) before hashing.

Phone Number

Always provided in raw format. We will validate the phone number is valid and it’s in E.164 format.

Both mobile and home phone numbers can be imported in separate columns.

After mapping to the Global Schema you must specify the phone number region. You will need to select the two-letter country code for the region your phone numbers are for from a dropdown. We currently only support one country code per Dataset when mapping to the Global Schema.

If you are including phone numbers from different countries, please contact your InfoSum representative.

If you wish to onboard multiple phone numbers per user, please make sure to create one multi-value column. During the recordset creation process you can identify your multi-value delimiter so the platform can recognize them.

Alternatively you can create a multi-value column during normalization by using the ‘create multi-value’ modification step.

NOTE: When activating phone numbers to Meta you need a second column that includes the country so please ensure your Dataset has that information if you plan to use the Meta destination

Name

Names are not included as a standard key in the Global Schema, they can be imported as custom keys. Collaborators must agree on formatting and column name.

Postal address

Below are some general guidelines for addresses, if you are in the US please refer to the US address mapper page. We’re currently working to bring UK address mapper to the normalizer.

If using the address mapper please only include UK or US addresses in one Dataset.

For other addresses, a range of address columns can be imported. Each datapoint should also be split into individual columns. I.e. Street in one column, Town in another Column, etc

You can then create custom keys by concatenating the different columns during normalization.

Postcode/Zipcode

Always provided in raw format

For US address please provide zip5 or Zip9 (If using the US address mapper - these two keys will be created automatically during the normalization process.

To use this as a key for matching please ensure that you and your partner are providing the same information in the same format

Mobile ID

Both Android's Advertising ID (AAID) and Apple's Advertising Identifier (IDFA) can be used.

IP Address

There is no format enforcement but most clients provide this in IPv4 or iPv6

E.g.

iPV4: 116.61.80.61, 30.161.132.202, 137.143.254.196, 62.158.243.253

iPv6: 2001:0000:130F:0000:0000:09C0:876A:130B.

To use this as a key for matching please ensure that you and your partner are providing the same information in the same format

Social media handles

Identifiers from most social media platforms can be used.

E.g. Social Media: Twitter Handle

Social Media: Facebook ID

To use this as a key for matching please ensure that you and your partner are providing the same information in the same format.

Note: Twitter/X only accepts handles without leading ‘@’ so we recommend uploading these already without the initial symbol

Structuring attributes

An attribute is non-unique information in a Dataset (such as age or marital status) that can be compared and analyzed.

A Column or several columns can be combined in the Dataset during load to make an attribute. Attributes are values retained raw (encrypted) in the Dataset that are used to filter results and collect into anonymized statistics.

Age/DOB

To add age or date of birth, there are two ways that we recommend formatting your data:

Leave just the just the year (age or birth year) as one column and at the normalizer stage set the data as integer and use the modification setting ‘Interger Buckets’ to create useful representations of the age of your customers.
e.g.

1994-1999

25-30
Upload a user’s date of birth as a date, and map it to a Date Type in the modification settings during normalization to get more flexible filtering options.

Cost/Price/Revenue (currency)

Currency symbols are not supported as part of the column name nor as part of the data values. Please ensure that you remove currency symbols. Due to our focus on privacy, you cannot perform calculations that you would usually perform on currency figures

We recommend:

During the data normalization process, use the integer bucket modification to create useful representations of your data that can be surfaced for analysis. Think about what’s most interesting for you to do with that data, how your customers purchase the product, the individual product you’re adding data about, and the type of analysis that you want to do on the platform
Full numbers and bucket representations are also great for greater than/lower than filtering

If your analysis is going to be heavily into cost efficiencies please speak to our team to get a more tailored recommendation

Ratios or percentages E.g. share of wallet

Percentage symbols (%) are not supported as part of the column name nor as part of the data values (the platform requires integers or floating points without symbols in the cell value).

During data normalization, you will need to map the numbers to the integer or floating point category.

We recommend keeping percentages as integers as this will allow you to use the integer bucket modification to create useful representations of your data that can be surfaced for analysis and greater than/lower than filtering.

Dates and timestamps

You can import dates and timestamps into your Dataset. To be able to use them as date formats for insights and measurement, you will need to modify your column to mark it as date-time during the normalization process.

We support all standard time and date formats. You can also include the time zone in the data itself or add it during normalization.

Interests & behaviours

These might include attributes such as Characteristics, Content consumption, Contextual, Hobbies etc

Using multi-value attributes is more column volume efficient and creates more interesting insights, by grouping similar interests or interest fields into one column. E.g. instead of one column per type of sport, create a ‘favorite sports’ column where you can list the sport or sports.

You can join individual single-value columns into a multi-value column using the modifications feature during normalization.

It can be challenging to provide relevant insights without going to the highest level of granularity which makes analysis more cumbersome. We suggest you include just two levels of granularity, for example:

Favourite sports > type of sport (multi-value) > sports club
TV channel > Type of program > program title
Music > Music genre > Artist

You may also wish to upload your content organized following a common mendia language such as IAB categories, if you’re using these already as it can reduce the data preparation needed on your side. Following these categories (names and format) can create more transferable insights for the brands (e.g. they can activate the insight programmatically or as part of a media plan)

Protecting PII during normalization

Never publish PII as attribute data

This can happen when the Platform treats key information as category information. When importing a data file, some keys are automatically mapped to the Global Schema. These automatically mapped keys are never classified as attributes. However, when creating a custom category, the user needs to explicitly tell the system that the column data is a key or it will be classified as an attribute. Failure to do this can result in personal data being published to a Dataset.

You can do this at the normalization screen by turning on the ‘key’ toggle (second column from the left)

Never publish PII as meta-data

This can happen when the Platform unintentionally treats personal data as header data. It can result in personal data being published to a Dataset or saved to a configuration file.

You can ensure that personal data is not classified as meta-data when creating the recordset by turning off the toggle ‘ Files have column headers’ to tell the platform that your file does not have column headers.

However, we recommend that, where possible, your files have human-readable headers.

List of Global Schema Keys

This list contains the standard keys that are defined in the Global Schema. Any keys that are not included can be added to your Dataset as a custom key.

MAID (AAID, IDFA)
Email
UDPRN (UK only)
Cookie ID
IPv4/IPv6
Mobile Phone Number
Home Phone Number
Zip9 (US only)
Zip5 (US only)
Address (US only)
LastName:Address (US only)
LastName:Zip5 (US only)
LastName:Zip9 (US only)
FullName:Address (US only)
FirstName:Address (US only)
Epsilon CORE ID
Experian LUID
Experian PID
NetID
Tapad ID
Transunion ID

Hi, How can we help?