Data formatting for normalization 2.0
Table of Contents
Ratios or percentages E.g. share of wallet
Protecting PII during normalization
Data preparation overview
When you import a dataset to InfoSum Platform, you will be provided with a range of tools to normalize the data and map it into our Global Schema. Here are some basic principles to help you prepare your data before import:
- Our platform works best with one row per customer/customer-level data
- Include human-readable column names and data to allow for easy partner analysis
- Using multi-value attributes is more column-volume efficient and creates more interesting insights
- Please import raw data where possible, as inconsistencies with hashing done off-platform can prevent or reduce matches
- For data providers/publishers: We recommend having an off-the-shelf data schema that you can make available to everyone to productize with ease. Focus on making the data usable and driving your customers to value quickly. Custom projects that require very specific slices of data or more granularity can be generated on the side for customers who want more.
- Please ensure your data stays within the character limits of our platform
How normalization works
Normalization plays an important role in ensuring the security of your data. During normalization, direct identifiers (such as emails) are irreversibly converted to pseudonymized, salted keys. Our normalization process begins by lowercasing and removing any leading or trailing spaces before converting raw PII to sha256 before it is further encrypted and salted. By the end of the normalization process, there is no translatable identifier information stored within an Insight Bunker.
We recommend bunkering raw format identifier data where at all possible to avoid any discrepancies in standardization across partners.
Using the Global Schema
Keys can be mapped to InfoSum's Global Schema, which defines a standard set of identifiers that can be used to compare datasets from diverse original sources.
Multi-language support |
Column headers are used to map columns to the Global Schema, and they will be recognized in English only. If your column names are in another language you can still map them manually to the Global Schema or add them as custom keys (if you're uploading a custom key, you will need to ensure the names are exactly the same across all partner datasets) |
See this section for a list of keys included in our Global Schema. Any keys that are not included can be added to your dataset as a custom key.
Should I import hashed data?
We recommend importing raw data into your Cloud Vault. All columns marked as Key will be double-hashed and salted during normalization, making them irreversibly encrypted. Data can be imported using GPG encryption for added security.
If you are using the Global Schema:
- You must always import raw identifiers for MAID, UDPRN, IP Address, home & mobile phone numbers, zip9, zip5
- The rest of the identifiers can be imported hashed as long as all parties use the same rules and hashing methodology
If you are creating custom keys:
- You can still import the data in raw format and we recommend this to avoid any normalization errors
- If you need to import the data pre-hashed, you must set the same column name during normalization and ensure that all organizations collaborating use the same rules and hashing methodology. Case and salt inconsistencies will prevent matches.
Pre-hashing data
The onboarding steps need to be completed in this order: 1) Normalization > 2) Hashing. Normalization and hashing can happen directly on the platform if you are onboarding raw data. Please note that not all Global Schema keys can be imported hashed as mentioned above.
All parties involved in the collaboration must agree on whether to import raw or hashed matching keys and what hashing is done. If some parties only have hashed data and others only raw data, the party with raw data must hash their matching keys before publishing it to their Bunkers or matching will be impossible. This can be done:
- Before import: Data can be onboarded pre-hashed to SHA256 standard, but it needs to be normalized first (all lowercase and any spaces removed) to ensure that after hashing the values still match.
- During the normalization process: raw keys can have an additional forced hash applied in the platform
Salting data
We do not recommend uploading salted data as this is not necessary due to the decentralized nature of our platform.
If salt is required, the onboarding steps need to be completed in this order: 1) Salt > 2) Normalize > 3) Hash.
- Data can be onboarded with a salt that all parties agree on to ensure that encrypted values match.
- If you’d like to onboard data that is already salted and hashed, please ensure that you have completed the normalization process before hashing (all lower case and any spaces removed)
- Please consult with your collaboration partners before using any salt. Not many companies in our ecosystem salt their datasets and it would need to be the exact same salt.
Structuring identifiers
The list below outlines the identifiers commonly used in the Platform. During the normalization process, these identifiers are converted into keys, which are used to match rows in a query. Keys can be both deterministically (e.g. Email) and probabilistically (e.g. Full name and phone number) matched during a query.
All keys can be multi-value keys (multiple values in one cell).
Email address
What email address format does InfoSum require? |
How does InfoSum validate emails? |
How does InfoSum normalize emails? |
Can be provided in either SHA256 hexadecimal format or raw data. Must be in a single column. If you use SHA256 format, ensure all email addresses must be in lowercase with leading/trailing white spaces removed before you convert to SHA256 format. |
If email is in raw format: Matches RFC 2822. If email is in SHA256 format: Checks hash has correct length |
If plain / raw emails are provided (not in SHA256 format), the email address will be converted to lowercase characters (will remove leading/trailing white spaces) before hashing. |
Phone Number
Always provided in raw format. We will validate the phone number is valid and it’s in E.164 format.
Both mobile and home phone numbers can be imported in separate columns.
After mapping to the Global Schema you must specify the phone number region. You will need to select the two-letter country code for the region your phone numbers are for from a dropdown. We currently only support one country code per dataset when mapping to the Global Schema.
If you are including phone numbers from different countries, please contact your InfoSum representative.
If you wish to onboard multiple phone numbers per user, please make sure to create one multi-value column. During the recordset creation process you can identify your multi-value delimiter so the platform can recognize them.
Alternatively you can create a multi-value column during normalization by using the ‘create multi-value’ modification step.
NOTE: When activating phone numbers to Meta you need a second column that includes the country so please ensure your dataset has that information if you plan to use the Meta destination |
Name
Names are not included as a standard key in the Global Schema, they can be imported as custom keys. Collaborators must agree on formatting and column name.
Postal address
Below are some general guidelines for addresses, if you are in the US please refer to the US address mapper page. We’re currently working to bring UK address mapper to the normalizer.
If using the address mapper please only include UK or US addresses in one dataset.
For other addresses, a range of address columns can be imported. Each datapoint should also be split into individual columns. I.e. Street in one column, Town in another Column, etc
You can then create custom keys by concatenating the different columns during normalization.
Postcode/Zipcode
Always provided in raw format
For US address please provide zip5 or Zip9 (If using the US address mapper - these two keys will be created automatically during the normalization process.
To use this as a key for matching please ensure that you and your partner are providing the same information in the same format
Mobile ID
Both Android's Advertising ID (AAID) and Apple's Advertising Identifier (IDFA) can be used.
IP Address
There is no format enforcement but most clients provide this in IPv4 or iPv6
E.g.
iPV4: 116.61.80.61, 30.161.132.202, 137.143.254.196, 62.158.243.253
iPv6: 2001:0000:130F:0000:0000:09C0:876A:130B.
To use this as a key for matching please ensure that you and your partner are providing the same information in the same format
Social media handles
Identifiers from most social media platforms can be used.
E.g. Social Media: Twitter Handle
Social Media: Facebook ID
To use this as a key for matching please ensure that you and your partner are providing the same information in the same format.
Note: Twitter/X only accepts handles without leading ‘@’ so we recommend uploading these already without the initial symbol
Structuring attributes
An attribute is non-unique information in a dataset (such as age or marital status) that can be compared and analyzed.
A Column or several columns can be combined in the Bunker during load to make an attribute. Attributes are values retained raw in the dataset that are used to filter results and collect into anonymized statistics.
Age/DOB
To add age or date of birth, we recommend adding just the year (age or birth year) as one column and at the normalizer stage set the data as integer and use the modification setting ‘Interger Buckets’ to create useful representations of the age of your customers.
1994-1999
25-30
Cost/Price/Revenue (currency)
Currency symbols are not supported as part of the column name nor as part of the data values. Please ensure that you remove currency symbols. Due to our focus on privacy, you cannot perform calculations that you would usually perform on currency figures
We recommend:
- During the data normalization process, use the integer bucket modification to create useful representations of your data that can be surfaced for analysis. Think about what’s most interesting for you to do with that data, how your customers purchase the product, the individual product you’re adding data about, and the type of analysis that you want to do on the platform
- Full numbers and bucket representations are also great for greater than/lower than filtering
If your analysis is going to be heavily into cost efficiencies please speak to our team to get a more tailored recommendation
Ratios or percentages E.g. share of wallet
Percentage symbols (%) are not supported as part of the column name nor as part of the data values (the platform requires integers or floating points without symbols in the cell value).
During data normalization, you will need to map the numbers to the integer or floating point category.
We recommend keeping percentages as integers as this will allow you to use the integer bucket modification to create useful representations of your data that can be surfaced for analysis and greater than/lower than filtering.
Dates and timestamps
You can import dates and timestamps into your Bunker. To be able to use them as date formats for insights and measurement, you will need to modify your column to mark it as date-time during the normalization process.
We support all standard time and date formats. You can also include the time zone in the data itself or add it during normalization.
Please note that the segment builder and IQL tool don’t support filtering on dates at this stage. This is something in the roadmap, so please reach out to your InfoSum representative if you’re interested in this functionality when it becomes available.
Interests & behaviours
These might include attributes such as Characteristics, Content consumption, Contextual, Hobbies etc
Using multi-value attributes is more column volume efficient and creates more interesting insights, by grouping similar interests or interest fields into one column. E.g. instead of one column per type of sport, create a ‘favorite sports’ column where you can list the sport or sports.
You can join individual single-value columns into a multi-value column using the modifications feature during normalization.
It can be challenging to provide relevant insights without going to the highest level of granularity which makes analysis more cumbersome. We suggest you include just two levels of granularity, for example:
- Favourite sports > type of sport (multi-value) > sports club
- TV channel > Type of program > program title
- Music > Music genre > Artist
You may also wish to upload your content organized following a common mendia language such as IAB categories, if you’re using these already as it can reduce the data preparation needed on your side. Following these categories (names and format) can create more transferable insights for the brands (e.g. they can activate the insight programmatically or as part of a media plan)
Protecting PII during normalization
Never publish PII as attribute data
This can happen when the Platform treats key information as category information. When importing a data file, some keys are automatically mapped to the Global Schema. These automatically mapped keys are never classified as attributes. However, when creating a custom category, the user needs to explicitly tell the system that the column data is a key or it will be classified as an attribute. Failure to do this can result in personal data being published to a dataset.
You can do this at the normalization screen by turning on the ‘key’ toggle (second column from the left)
Never publish PII as meta-data
This can happen when the Platform unintentionally treats personal data as header data. It can result in personal data being published to a dataset or saved to a configuration file.
You can ensure that personal data is not classified as meta-data when creating the recordset by turning off the toggle ‘ Files have column headers’ to tell the platform that your file does not have column headers.
However, we recommend that, where possible, your files have human-readable headers.
List of Global Schema Keys
This list contains the standard keys that are defined in the Global Schema. Any keys that are not included can be added to your dataset as a custom key.
- Mobile Advertising ID (e.g. AAID, IDFA)
- UDPRN
- Cookie ID
- IPv4IPv6
- Mobile Phone Number
- Home Phone Number
- Zip9
- Zip5
- Epsilon CORE ID
- Experian LUID
- Experian PID
- NetID
- Tapad ID
- Transunion ID