Privacy controls

Privacy controls are safeguards built into InfoSum Platform, designed to prevent accidental or deliberate identification of individuals through aggregated queries.

In a nutshell, privacy controls do two things:

  • prevent queries reporting on very small numbers of individuals (because those individuals might then be identified)
  • and add a small, intentional error margin to all results - which is insignificant when aggregated statistics are used for their intended purpose, but prevents manipulation of the input data to reveal information about individuals.

These controls are an important precaution, which guard privacy and assist in ensuring regulatory compliance. For your protection, InfoSum Platform applies privacy controls automatically.

You may notice privacy controls at work if you experiment with your own datasets. For example, a query might say that you have no customers in Scotland, even though you know there are a small number of people in your database who live there. This page explains why this happens, and why it isn't an error.

Why privacy controls are needed

Suppose a particular individual, John Smith, lives in London. And suppose you are collaborating on a data project with someone who secretly wants to know whether John Smith is on your customer list.

Without privacy controls, they might persuade you to take part in a project and upload your customer database into your Bunker. In their Bunker, they could upload a database of their own, listing thousands of people who live in Glasgow and just one person who lives in London - John Smith.

Then, they could ask for statistics on where the individuals who appear in both your databases live. If this report revealed that there was an individual who lived in London, then - because John Smith is the only London resident in their database - they would know that John Smith is in your database, too.

Privacy controls prevent this leak of information. In this case, the result for London - like any other very small result - is rounded down to zero. Your collaborator would gain no information about John Smith.

How privacy controls work

InfoSum Platform applies three privacy controls: noise, rounding, and redaction. Although the controls are applied in that order, it's easiest to understand them by considering them the other way round.

Redaction

Redaction controls the minimum size of any group of individuals which may be reported on as part of a query. Redaction is applied to each bin created during a query. A bin is a grouping used for aggregated reports; for example, if your query produces statistics based on people's ages, then the bins might cover age ranges 0-17, 18-30, and so on.

Each dataset by default has a minimum bin redaction threshold of 100 individuals, which can be increased by the dataset's owner. If a bin contains fewer individuals than that threshold, then the result is "redacted" and reported as zero.

It is important to understand that redaction is applied separately for each dataset. For example, suppose a query asks for the number of individuals grouped by age, totaled across two datasets A and B. Suppose dataset A contains 900 individuals aged 0-17, and its redaction threshold is 1000. Those 900 individuals will be redacted and will not be included in the query result, irrespective of the contents of dataset B.

Additionally, when a query involves more than one dataset, the whole query is redacted if it would report on a total number of individuals less than the highest of the redaction thresholds for any of the datasets.

For example, suppose datasets A and B both contain millions of records, but there are only 500 individuals who appear in both datasets. Suppose also that one of the datasets has a redaction threshold of 1000. A user submits a query requesting aggregated demographic information for users who appear in both A and B. This query will be redacted completely - because it would report on just 500 individuals in total, which is below the redaction threshold of one of the datasets.

Rounding

Rounding controls the precision of the values returned by a query. Because the values returned are always "round numbers", an attacker cannot gain information by looking at precise counts.

Similarly to the bin redaction threshold, each dataset has a rounding threshold of 100 individuals, which can be increased by the dataset's owner. Results are rounded down to be a multiple of this threshold. When a query involves more than one dataset, results are rounded down to be a multiple of the highest of the rounding thresholds.

For example, suppose dataset A has a rounding threshold of 50 and dataset B has a rounding threshold of 20. The highest of these is 50, so the statistics returned will always be a multiple of 50.

If, say, there are 1,070 people in the query result who live in London, this will be rounded down to 1,050. If there are 49 people who live in Scotland, this will be rounded down to 0.

Noise

Noise defends against a category of attacks built around changing the input data incrementally and observing the change in the results. Without noise, a malicious user could gradually add rows to a dataset until they see an aggregated value move over the rounding threshold - which would reveal the original value before rounding.

To apply noise, the InfoSum Platform modifies every value by a small but unpredictable amount. Repeating exactly the same input will result in the same noise (and therefore the same result). But any change to the input will result in a different amount of noise, and will therefore vary the result in an unpredictable way.

Although noise is a form of deliberate inaccuracy, it is too small to be significant when aggregated statistics are used in the intended way. It will only impact cases where, intentionally or accidentally, a query might reveal information about identifiable individuals.

If you want to, you can disable the injection of noise by using 100% accuracy queries.