Data formatting for normalization
When you import a dataset to InfoSum Platform, you will be provided with a range of tools to normalize the data and map it into our Global Schema. Here are some basic principles to help you prepare your data before import:
- Our platform works best with one row per customer/customer-level data
- Include human-readable column names and data to allow for easy partner analysis
- Please import raw data where possible, as inconsistencies with hashing done off-platform can prevent or reduce matches
- Using multi-value attributes is more column-volume efficient and creates more interesting insights
- For data providers/publishers: We recommend having an off-the-shelf data schema that you can make available to everyone to productize with ease. Focus on making the data usable and driving your customers to value quickly. Custom projects that require very specific slices of data or more granularity can be generated on the side for customers who want more.
Table of contents
Normalizing identifiers into keys
Ratios or percentages E.g. share of wallet
Protecting PII during normalization
How normalization works
Our Bunker normalization technology has been purposely designed for users to feel comfortable with bunkering raw identifiers, including emails. Our normalization process begins by lowercasing and removing any leading or trailing spaces before converting raw PII to sha256 before it is further encrypted and salted. By the end of the normalization process, there is no translatable identifier information stored within an InfoSum Bunker.
We recommend bunkering raw format identifier data where at all possible to avoid any discrepancies in standardization across partners.
Using the Global Schema
Additionally, keys can also be mapped to InfoSum's Global Schema, which defines a standard set of keys, which can be used to compare datasets from diverse original sources.
Multi-language support |
Column headers are used to map columns to the Global Schema, and they will be recognized in English only. If your column names are in another language you can still map them manually to the Global Schema or add them as custom keys (if you're uploading a custom key, you will need to ensure the names are exactly the same across all partner datasets) |
See this section for a list of keys included in our Global Schema. Any keys that are not included can be added to your dataset as a custom key.
Should I import hashed data?
We recommend importing raw data into your Cloud Vault. Data can be encrypted with GPG encryption for added security.
When using pre-hashed data ensure all organizations collaborating use the same rules. Case and salt inconsistencies will prevent matches.
The below rules should be followed if hashing happens before data import.
Pre-hashing data
The onboarding steps need to be completed in this order: 1) Normalization > 2) Hashing.
Normalization and hashing can happen in the platform directly if you are onboarding raw data.
Data can be onboarded pre-hashed to SHA256 standard, but it needs to be normalized first (all lower case and any spaces removed) to ensure that after hashing the values still match.
Salting data
We do not recommend uploading salted data as this is not necessary due to the decentralized nature of our platform.
If salt is required, the onboarding steps need to be completed in this order: 1) Salt > 2) Normalize > 3) Hash.
- Data can be onboarded with a salt that all parties agree on to ensure that encrypted values match.
- If you’d like to onboard data that is already salted and hashed, please ensure that you have completed the normalization process before hashing (all lower case and any spaces removed)
- Please consult with your collaboration partners before using any salt. Not many companies in our ecosystem salt their datasets and it would need to be the exact same salt.
Normalizing identifiers into keys
The list below outlines the identifiers commonly used in the Platform. During the normalization process, these identifiers are converted into keys, which are used to match rows in a query. Keys can be both deterministically (e.g. Email) and probabilistically (e.g. Full name and DOB) matched during a query.
A note on multi-value columns |
If using the Global Schema - you can only import multi-value keys for the following identifiers:
|
Email address
What email address format does InfoSum require? |
How does InfoSum validate emails? |
How does InfoSum normalize emails? |
Can be provided in either SHA256 hexadecimal format or raw data. Must be in a single column. If you use SHA256 format, ensure all email addresses must be in lowercase with leading/trailing white spaces removed before you convert to SHA256 format. |
If email is in raw format: Matches RFC 2822. If email is in SHA256 format: Checks hash has correct length |
If plain / raw emails are provided (not in SHA256 format), the email address will be converted to lowercase characters (will remove leading/trailing white spaces) before hashing. |
Phone Number
Always provided in raw format. Valid phone number in valid E.164 format
Both mobile and home phone numbers can be imported in separate columns.
NOTE: When activating phone numbers to Meta you need a second column that includes the country so please ensure your dataset has that information if you plan to use the Meta destination |
Name
Always provided in raw format. Requires first and second name.
Upload either a single column as input or two columns.
Note: We do not support middle names or suffixes. However, these could be imported as custom categories if you want to use them outside of our Global Schema
You can map the name columns in your spreadsheet to the ”Forename” and “Surname” properties of the “Name” category in the Global Schema, as shown:
The “Name” category cannot be used as a key in its own right and must be used in combination with a second category to form a key. When used alongside DOB the platform will automatically create the following compound keys (bar email)
Postal address
Below are some general guidelines for addresses, if you are in the UK or US please refer to the UK address mapper or US address mapper pages.
A range of address columns can be imported. Each datapoint should also be split into individual columns. I.e. Street in one column, Town in another Column, etc
Postcode/Zipcode
Always provided in raw format
For US address please provide zip5 or Zip9 (If using the US address mapper - these two keys will be created automatically during the normalization process.
To use this as a key for matching please ensure that you and your partner are providing the same information in the same format
Mobile ID
Both Android's Advertising ID (AAID) and Apple's Advertising Identifier (IDFA) can be used.
IP Address
There is no format enforcement but most clients provide this in IPv4 or iPv6
E.g.
iPV4: 116.61.80.61, 30.161.132.202, 137.143.254.196, 62.158.243.253
iPv6: 2001:0000:130F:0000:0000:09C0:876A:130B.
To use this as a key for matching please ensure that you and your partner are providing the same information in the same format
Social media handles
Identifiers from most social media platforms can be used.
E.g. Social Media: Twitter Handle
Social Media: Facebook ID
To use this as a key for matching please ensure that you and your partner are providing the same information in the same format.
Note: Twitter/X only accepts handles without leading ‘@’ so we recommend uploading these already without the initial symbol
Structuring attributes
An attribute is non-unique information in a dataset (such as age or marital status) that can be compared and analyzed.
A Column or several columns can be combined in the Bunker during load to make an attribute. Attributes are values retained raw in the dataset that are used to filter results and collect into anonymized statistics.
DOB (Date of Birth)
Mapping DOB to the global schema lets you leverage one of the most useful existing groupings such as 5 or 10 year bins
Always provided in raw format.
Must be in three columns, each with a separate input value for "yyyy", "mm", and "dd".
You can map the DOB columns in your spreadsheet to the "Day", “Month” and ”Year" properties in the “Date of Birth” category in the Global Schema using the steps below.
Create three DOB columns in your spreadsheet, each with a separate input value for "dd", "mm", and "yyyy. Import your data and create a recordset. Then at the normalization stage, map each column to the Global Schema category:
Cost/Price/Revenue (currency)
The platform does not recognize currency symbols or currency as a data format Please ensure that you remove currency symbols.. Due to our focus on privacy, you cannot perform calculations that you would usually perform on currency figures
We recommend:
- Use range buckets as well, but strategize and think about what is most interesting for you to do with that data, how your customers purchase the product, the individual product you’re adding data about, and the type of analysis that you want to do on the platform
- Full numbers or ranges are also great for greater than/lower than filtering
If your analysis is going to be heavily into cost efficiencies please speak to our team to get a more tailored recommendation
Ratios or percentages E.g. share of wallet
The platform does not recognize % signs or percentages as a data format.
We recommend:
- Include ranges (e.g. 0-5%) - this pre-bucketing helps with generating usable insights
- Full numbers or ranges are also great for greater than/lower than filtering
Dates and timestamps
The platform does not recognise dates as a format. To prioritise privacy, we don't allow the use of dates in the usual manner, as they could assist in re-identifying individuals.
- Option 1 (string): You could import as a string when it’s a static value (‘show me all the users where date and time is X X X X’)
- Option 2 (integer): If they are variable and you care about the integer range - think about whether you care about the time, would just date be sufficient (measurement might need both but when did someone purchase a ticket? Date is probably enough)
- Option 3 (integer): Need an integer because you cannot only use OR conditions and want a range - distill down to dates but still need integer range
We recommend retaining only the essential information. For example, using just the month or the combination of month and year may be sufficient to generate actionable insights.
Interests & behaviours
These might include attributes such as Characteristics, Content consumption, Contextual, Hobbies etc
Using multi-value attributes is more column volume efficient and creates more interesting insights, by grouping similar interests or interest fields into one column. E.g. instead of one column per type of sport, create a ‘favorite sports’ column where you can list the sport or sports.
It can be challenging to provide relevant insights without going to the highest level of granularity which makes analysis more cumbersome. We suggest you include just two levels of granularity, for example:
- Favourite sports > type of sport (multi-value) > sports club
- TV channel > Type of program > program title
- Music > Music genre > Artist
You may also wish to upload your content organized following a common mendia language such as IAB categories, if you’re using these already as it can reduce the data preparation needed on your side. Following these categories (names and format) can create more transferable insights for the brands (e.g. they can activate the insight programmatically or as part of a media plan)
Protecting PII during normalization
Never publish PII as attribute data
This can happen when the Platform treats key information as category information. When importing a data file, some keys are automatically mapped to the Global Schema. These automatically mapped keys are never classified as attributes. However, when creating a custom category, the user needs to explicitly tell the system that the column data is a key or it will be classified as an attribute. Failure to do this can result in personal data being published to a dataset.
You can do this at the normalization screen by turning on the ‘key’ toggle (second column from the left)
Never publish PII as meta-data
This can happen when the Platform unintentionally treats personal data as header data. It can result in personal data being published to a dataset or saved to a configuration file.
You can ensure that personal data is not classified as meta-data when creating the recordset by turning off the toggle ‘ Files have column headers’ to tell the platform that your file does not have column headers.
However, we recommend that, where possible, your files have human-readable headers.
List of Global Schema Keys
This list contains (in alphabetical order) the standard keys that are defined in the Global Schema. Any keys that are not included can be added to your dataset as a custom key.
- Address
- Adform ID
- Adobe ID
- Amobee ID
- Beeswax ID
- BritePool ID
- Cookie ID
- CriteoID
- Epsilon CORE ID
- Experian LUID
- Experian PID
- Fabrick ID by Neustar
- Google ID
- ID+ by Zeotap
- ID5
- Index ID
- IP Address (supporting IPv4 or IPv6 format)
- Kinesso ID
- LiveIntent nonID
- MediaMath ID
- Merkury ID by Merkle
- Mobile Advertising ID (AAID & IDFA)
- Name (Forename & Surname)
- NetID
- NextRoll ID
- OpenX ID
- Panorama ID by Lotame
- Parrable ID
- Permutive ID
- Phone number
- Pubmatic ID
- Quantcast ID
- RampID by LiveRamp
- Shared ID
- Social Media (there are more than 10 keys, including Twitter Handle & Facebook ID)
- Tapad ID
- The Trade Desk ID
- Throtle ID
- Transunion ID
- Unified ID 2.0
- Xandr ID