When you run a query using InfoSum Platform, the results returned contain an element of approximation.
In typical use cases, results will be accurate to plus or minus 1% - a factor which is unlikely to affect any decision-making based on the results. In some unusual circumstances however, the approximation may be more significant. InfoSum Platform offers quality metrics to help you gauge and minimise this effect.
Approximation occurs for three distinct reasons.
- Some approximation is inherent in the algorithms used by InfoSum Platform. By allowing this small measure of approximation, InfoSum Platform can respond to queries far more quickly, and with far lower use of expensive computational resources.
- Some approximation arises because the original data was itself approximate. For example, this can happen if a dataset indicates age ranges, rather than precise ages or dates of birth.
- Some approximation is deliberately introduced by privacy controls. This is an important safety measure, which prevents accidental or malicious re-identification of specific individuals by comparing the results of similar queries.
The specific datasets you use in your query affect the degree of approximation. If quality metrics reveal that the approximation is unacceptable, you may be able to reduce it by selecting different datasets.
Quality metrics are all expressed as a number between 0 and 1, with 1 representing a better result (that is, a lower level of approximation).
This section describes the three separate sources of approximation in detail, along with the quality metrics you can use to monitor them.
Approximation inherent in algorithms
When a dataset is normalised on InfoSum Platform, direct identifiers (which identify specific individuals) are irreversibly anonymised.
As well as guarding individuals' personal data, this process significantly reduces the space occupied by this data, and prepares it for efficient querying. In turn, this helps InfoSum Platform respond quickly to queries and reduces costs by limiting the use of resources.
As an inherent property of this algorithm, it is possible for two individuals to be anonymised to the same value. (This effect is similar to a hash collision in a traditional approach.) Those individuals' records will then be incorrectly linked, resulting in a degree of approximation in the ultimate query results.
InfoSum Platform dynamically tunes the algorithm to keep this approximation to an acceptable level. This small degree of approximation is a necessary trade-off for the benefits of reduced resource usage and faster query times.
Because this tuning delivers a fixed level of approximation, it has a proportionately greater impact if the number of individuals involved in a query (the audience size) is small. A query which reports on a million individuals will show negligible approximation, while a query which reports on just a hundred individuals will show significant approximation - even if the hundred individuals are selected from datasets containing millions of rows.
The quality metric Model Accuracy helps you gauge the impact of this effect on a specific query. Like all quality metrics, it is a value between 0 and 1. A value above 0.95 suggests there is limited impact, while a value below 0.90 indicates that you may need to take this effect into account.
Approximation arising from mismatched bins
Some approximation may also result from the format of the original data, and how well it matches the query you submit.
It's easiest to understand this by way of an example. Suppose you run a insight query which reports on age distribution, grouping ages into 10-year bins (0-10 years old, 10-20 years old, 20-30 years old and so on).
If the dataset you query contains exact ages, or dates of birth, then each individual can be definitively assigned to one of these bins. There is no approximation from this process.
But suppose the original data itself grouped ages into bins - this time with different boundaries (say 0-18, 18-30, 31-45 and so on). In this case, if the original data said that a particular person is aged 18-30, there is no way to tell for sure whether they belong in the 10-20 or 20-30 query bin.
InfoSum Platform resolves this problem by choosing the most closely-matching bin (20-30 in this example). However, this clearly introduces an element of approximation. If quality metrics indicate that this is happening, you may be able to improve your results by using a different dataset containing more specific data.
The quality metric Representation Score helps you gauge the impact of this effect on a specific query. Like all quality metrics, it is a value between 0 and 1. A value above 0.95 suggests there is limited impact, while a value below 0.90 indicates that you may need to take this effect into account.
Approximation arising from privacy controls
Privacy controls are described in detail here. If you haven't already, it's a good idea to read that document to help you understand the following. In a nutshell, privacy controls add a deliberate degree of error to statistical results. This small error margin averts some scenarios where individual people could be identified by a series of carefully-crafted queries.
This form of approximation is an intentional and valuable product feature. However, you still need to consider its impact on query results. There are two specific scenarios you can monitor using quality metrics.
Firstly, if more than one dataset is involved in a query, the overall thresholds applied are the highest of those configured. So, if your query incorporates a single dataset with a significantly higher threshold, then that one dataset will increase the level of approximation. You may be able to improve your results by avoiding this dataset or substituting another dataset with a lower configured threshold.
The quality metrics Rounding Similarity and Redaction Similarity measure this effect for the redaction and rounding thresholds, respectively. Like all quality metrics, these are values between 0 and 1. A value of 1 indicates that the thresholds for each dataset are exactly the same. If the value is less than 0.80, you may want to consider using alternative datasets.
Secondly, because privacy control thresholds are set as absolute numbers, they will have a disproportionate effect if any given statistic is small. As an extreme example, if the rounding threshold is 1,000 but a particular statistic would be 500, then that statistic will not appear in results at all.
The quality metric Result Precision (across all bins) measures this effect across all statistics generated by your query, and consolidates the result into a single number between 0 and 1. A value above 0.95 suggests there is limited impact, while a value below 0.90 indicates that you may need to take this effect into account.
The quality metric Result Precision (across non-zero bins) is an alternative measure, which ignores statistics which equal zero when producing the consolidated result.