Navigating Privacy in the Age of Big Compute

30 May 2024

Look, I understand that compliance is NOT security.

But privacy interacts with security in a really unique way: data that includes personally identifiable information requires the highest standard of security. The meaning of privacy has forever changed, as big compute means re-identification from combinations of fully “anonymized” datasets can be used to identify individuals, easily.

We need to change our mindset if we are going to preserve privacy online.

Compute, specifically big compute - unlocks patterns in high dimensional data using sparse informational vectors to become dense in personally identifiable patterns. The ability to quantitatively measure how many individuals, or groups of similar characteristics, is quantitatively measured by Unicity.

Unicity is often used in the English language as embodied kindess and openness.

Unicity in mathematics is defined as stating the uniqueness of a mathematical object, which usually means that there is only one object fulfilling given properties, or that all objects of a given class are equivalent.

Unicity Distance in cryptography is not the focus of today, but it may help to elucidate the idea: it tells us how much ciphertext is required so that the encryption key can be uniquely recovered, assuming that the attacker knows the encryption algorithm and has access to both the ciphertext and some statistics about the plaintext. Basically, it lets you calculate how big the haystack needs to be to find a needle, before you go digging.

This idea of measuring unicity in large data sets was first made famous by a study that found over 90% of people could be uniquely re-identified in a Netflix Prize data set as they, “demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber's record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.”

Robust De-anonymization of Large Sparse Datasets

In 2021, I was reminded again that “The risk of re-identification remains high even in country-scale location datasets.” This came from my own institution, National Institutes of Health.

I had been doing signal processing studies on the human brain, seeing if we could change brain networks without conscious awareness. Spoilers: you totally can. That data may seem like it could be pretty sensitive, highly identifiable individual data - but there are data sets much more dangerous than that. Like your known Neflix usage.

Medical research funded by the US Government requires those data sets to be openly available to the public when privacy can be reasonably preserved, but when you calculate for the risk of re-identification not just of an individual within the data set, but by combination to any easily available ones in the nearby geographical location, things get tricky.

It’s worth reading the whole summary:

“Although anonymous data are not considered personal data, recent research has shown how individuals can often be re-identified. Scholars have argued that previous findings apply only to small-scale datasets and that privacy is preserved in large-scale datasets. Using 3 months of location data, we (1) show the risk of re-identification to decrease slowly with dataset size, (2) approximate this decrease with a simple model taking into account three population-wide marginal distributions, and (3) prove that unicity is convex and obtain a linear lower bound. Our estimates show that 93% of people would be uniquely identified in a dataset of 60M people using four points of auxiliary information, with a lower bound at 22%. This lower bound increases to 87% when five points are available. Taken together, our results show how the privacy of individuals is very unlikely to be preserved even in country-scale location datasets.”

This is the gold that hackers usually mine for in healthcare, finance, and government records. They need four golden auxiliary data points, and they can find the individual.

This isn't finding a needle in a haystack.

It’s finding a specific needle in a stack of needles.

All I need is three months of location data about that needle, and bingo, I got it.

Unicity in data sets is a massive blindspot for most organizations.

It should be a major compliance issue, but it’s a blindspot there too.

It’s a major security risk, until we learn to observe it.

I just took the IAPP AI Governance Training. It’s the new standard for understanding global regulation around privacy concerns for Artificial Intelligence just established in April 2024. I’ve got a technical background, I wanted to use that training to get inside of the minds of all the lawyers, regulators and the compliance officers that I often interact with. I’m super pleased with how it sums up the current regulatory landscape, and I like that the certification requires updating your training on the subject every year: in this regulatory landscape, things move fast.

I’d like to focus for a moment on what I wish AI Governance Professionals Understood.

I wish we had covered the technical advancements in Privacy Enhancing Technologies that you would need to consider if you have a data set that is at high risk of unicity. I wish we had covered any known, quantitative measurements to reduce the risk of unicity in small or large data sets. I wish we had covered unicity, period.

I wish we had covered how the use of Privacy Enhancing Technologies (PETs) is unique: all the way down to the primitives of the Linux Kernel, that technology has been specifically designed with privacy protection in mind. PETs can mitigate both compliance and security risks for high risk data sets, all at once.

Security risks are often reviewed in the form of threat modeling. It’s the speculative calculation of the multiplication of three factors: the type of threat (inside actor, supply chain vulnerability), the magnitude of impact (to stakeholders, to end users, to business reputation) and the likelihood.

RISK = THREAT x IMPACT x LIKELIHOOD.

Let’s focus on likelihood: I tend to calculate that as the known/perceived asset value, and even put a proposed price tag on intellectual property like algorithms. This is important. You should evaluate your algorithmic IP like it is your product, because particularly in AI, it absolutely is your product.

This also focuses your attention clearly in your threat model. If your business is specifically creating intellectual property around generative algorithms, traditional methods of security won’t work.

Let me explain why:

We are really good at encrypting data now.

It is, unfortunately, literally impossible to compute encrypted data.

If your business relies on compute (and it probably does if you have read this far), then you are responsible for making decisions about the privacy motivated security threats, to your surface area. Privacy is the one part of technology where compliance may actually be wholly aligned with security.

Back to that pesky encrypted data: there’s a few good reasons why it might be encrypted. My favorite real use case for the PET Confidential Computing is in the fight against global human trafficking.

There have always been good people in the world, fighting for the rights and freedoms of the victims of this globally distributed problem. Traditionally, OSINT techniques would be used to identify the locations of databases with information, often a corpus of photographic or videographic information, that legally, you were NOT allowed to store and hold as evidence, because the goal is to limit any ability for those records to ever have a new distribution vector.

This created a problem, as predators could easily move information around online, centralizing and decentralizing their architecture as needed. Those fighting the problem did not have the same flexibility.

Reasonable regulation, unfortunate secondary effects.

Now, Confidential Computing gives us a fair fight in the Hope for Justice Private Data Exchange: a demonstration of how to centralize those extremely high risk records into a Trusted Execution Environment, protecting the data in use by performing computation in a hardware-based, attested Trusted Execution Environment: where this data will only ever be observed by algorithms, not human eyes.

And it gets better. Because we are so good at encryption, this could now become part of a large, federated data ecosystem. Organizations around the world are able to pull their records together and use the magic of just four golden auxiliary measures to get potentially individually identifiable information about not just the individuals, but the locations and potentially patterns of movement. A fair fight, where privacy is preserved by an isolated execution environment: only algorithmic eyes will ever see those images again.

Unicity is not some great evil.

Unicity is a tool, a really good tool. Unicity replaces your blindspot with a calculation. Take a look at your own organization’s first attempts at AI Conformity Assessment: risk management, data governance, and cybersecurity practices. Think beyond the current regulation and to the total risk that your system may actually represent to end users, and start threat modeling for a data dense world. Let’s get this right.

I learned so much in the days we spent days covering every framework in AI regulation. Based on the Framework of Regulation provided in the AIGP training, here is my current recommendation for how to handle this in any medium to large sized organisation.

Prioritising Current Frameworks for AI Governance

An Enriched AI Governance Framework

Comprehensive Risk Management (NIST AI RMF)

Structured Risk Management Process:
- Identify Risks: Conduct thorough risk assessments to identify potential AI-related risks.
- Assess Risks: Evaluate the severity and likelihood of identified risks.
- Manage Risks: Implement strategies to mitigate identified risks.
- Monitor and Update: Continuously monitor AI systems for new risks and update risk management strategies accordingly.

Ethical AI Development (OECD AI Principles)

Ethical Considerations:
- Human-Centric Design: Ensure AI systems prioritize human input and address human needs and experiences.
- Transparency and Explainability: Provide clear and understandable information about how AI systems make decisions.
- Accountability: Establish clear accountability for the actions and outcomes of AI systems.

Data Protection and Privacy:
- GDPR Compliance: Implement measures to protect personal data, including data minimization and anonymization.
- EU AI Act: Classify AI systems by risk and ensure compliance with specific requirements for high-risk AI systems.
- Data Impact Assessments: Conduct Data Protection Impact Assessments (DPIAs) and AI conformity assessments to evaluate privacy risks.

Technical Considerations

Privacy-Enhancing Technologies (PETs):
- Differential Privacy: Implement differential privacy to ensure data privacy while analyzing group patterns.
- Federated Learning: Use federated learning to train AI models on decentralized data without sharing individual data points.
- Homomorphic Encryption: Employ homomorphic encryption to perform computations on encrypted data.
Unicity and Re-identification Risks:
- Measure Unicity: Quantitatively measure the risk of re-identification in datasets to ensure privacy.
- Monitor and Reduce Unicity: Continuously monitor the unicity of datasets and implement strategies to reduce it.

Try to Measure Impact Over Time on your Implementation

Establish a Central Governance Body: Create a dedicated team responsible for AI governance, ensuring compliance with GDPR, EU AI Act, NIST AI RMF, and OECD AI Principles.
Develop Integrated Policies and Procedures: Create policies that integrate the principles of all four regulatory frameworks, focusing on data protection, risk management, transparency, and accountability.
Leverage Technology for Compliance: Use advanced technologies, such as privacy-enhancing technologies (PETs) and AI monitoring tools, to support compliance and risk management efforts.
Stay updated on regulatory changes and advancements in AI governance, ensuring the governance framework evolves with new developments. Keep a regulatory horizon line, but start thinking of this problem differently while you still can. Consider all of the ways that we can actually do responsible compute.

If we want to identify individuals, let’s make those surface areas secure.

If we don’t want to identify individuals, implement a way to monitor the ongoing risk of re-identification in your system’s outputs.

Lower levels of unicity in public and breached datasets would be great for all of us. It’s a data hygiene practice your team can do, that can give a quantitative measure of the risk of convergent data usage by a privacy motivated adversarial. We absolutely can, and must, raise the bar on protecting personal data from re-identification. We can only start doing that if we measure it in our own data. If you are serious about privacy enhancing technologies and the changing tides of regulation in compute, send me an interesting question about it. If your systems necessarily engage with high risk data in training, you might also care about Unlearning in AI, or Security Threats to High Impact LLMs.