Sum and Avg IQL function
You can use IQL to calculate the Sum and Avg across the intersections of different datasets. This is particularly useful for insights or measurement.
This article explains how IQL calculates Sum and Average. There is also a Sum and Avg Query library available with a section of metrics that can be calculated, depending on the data that is available for querying.
When running Sum and Avg queries you will be limited by a privacy query budget, meaning there will be a limit to the number of queries you can run. This is to protect the privacy of individuals contained in the datasets.
Functionality in Beta |
Please note that this functionality is in Beta and not available to all users. Please reach out to your InfoSum representative if you wish to use it. Additionally, there are some functionality restrictions:
|
Table of Contents
Preserving Privacy with Sum and Average
Understanding the Confidence Score
We limit queries with a Privacy Budget
The SUM Function
Description
SUM returns three metrics: the SUM, USER AVERAGE, and Distinct rows. It also displays the remaining Privacy Budget for each dataset and a Confidence score to show the impact of differential privacy due to data outliers.
SUM represents the total of all populated values in the provided attribute. It does not de-duplicate based on keys - one individual in the data can contribute several times to the SUM. The SUM is calculated only over valid rows (i.e. not empty keys).
USER AVERAGE represents the deduplicated user average which is calculated as the SUM divided by the DISTINCT ROWS. The user average does not take into account the number of times a user contributed to the SUM, and so it represents the average across users and not across datapoints.
DISTINCT ROWS represents the de-duplicated count on a key (the same as running a Count query)
Syntax
SELECT SUM(Datasetname.Column)
FROM Dataset1 INTERSECT Dataset2
WHERE Dataset1.category = 'Value'
Please note all WHERE clauses are optional.
Please note you don't need to run an intersection to calculate a SUM.
To see examples of how to use these queries to run useful measurement and insights metrics, please read the Sum and Avg query library.
Preserving Privacy with Sum and Average
All IQL functions in our platform have differential privacy applied to them. This is how we calculate Sum and Avg to ensure that privacy isn’t compromised.
We remove outliers
Outliers are values or individuals that are significantly different to the rest of the data. These outliers are easy to identify using very few Sum queries, so need to be removed & clamped when querying.
We use interquartile range methodology (IQR) to identify these outliers. This method uses quartiles, which divide your data into four equal parts. The IQR is the difference between the first quartile (Q1) and the third quartile (Q3). Outliers are defined as values that fall below Q1 - 1.5IQR or above Q3 + 1.5IQR.
Where possible where a value exceeds the upper or lower bounds of the quartile, we replace the outlier value with the max or minimum bounds. This approach is taken rather than excluding the data because it maintains a bounded sensitivity while retaining as much information as possible from the data.
In some cases, where a user contributes to the sum at a frequency that would make it an outlier using the same approach as above, we exclude the user to keep the data usable and representative of the average across unique users.
We add noise
As with any other aggregated results served by the platform, we add +/- 2% noise to the final calculations. The amount will depend on the size of the data and the distribution of the values/outliers.
Understanding the Confidence Score
A confidence score is also displayed alongside the results to give you an indication of how the data quality (presence of outliers and, therefore, the need to add noise) affects the calculation. It will be displayed as a % and color-coded:
-
High confidence
- 100-95%, shown in green
- Clean data with low numbers of outliners, the calculation is within standard platform noise.
-
Medium confidence
- 94-75%, shown in amber
- Multiple outliers have been identified and either clamped or removed. The result has been altered more than IQL query standard
-
Low confidence
- <75%, shown in red
- The data presented a large number of outliers that put users at risk and the calculation has been amended significantly
We limit queries with a Privacy Budget
A privacy budget is a restriction on the number of SUM queries that a company can run on a dataset to ensure that subsequent analysis doesn’t allow for the reidentification of individuals. It limits the ability to run too many SUM queries and it does not apply to any other IQL functions.
The budget is set at a company to dataset relationship for datasets you have permission to use. You do not consume budget when running queries on your own datasets.
The current SUM Privacy Budget has been set at 10 queries a week and resets weekly. This is based on statistical standards, and it’s been set as a default for this BETA release as it provides a good balance between privacy and usability. If you are interested in the maths behind the Privacy Budget please reach out to your InfoSum representative.
You will be able to see the Budget you have left before running a SUM query. A modal will display how many SUM queries there are left in the budget for each of the datasets involved in the query.