By the end of today’s lecture, you will learn:
Today, we will begin Unit 4 of the course on secure data analysis. This unit continues our investigation into data confidentiality (and we will sometimes also discuss data integrity).
We can sub-divide data confidentiality into three different scenarios.
Protecting data at rest, such as encrypting data on your local own hard drive. This involves encryption from Alice to future-Alice, in order to stop someone who steals her laptop or phone from reading her data.
Protecting data in transit, for instance by using end-to-end encrypted messengers like Signal, WhatsApp, or iMessage. This involves encryption from Alice to Bob, in order to stop anyone else on the Internet from reading their messages. (But note that Alice and Bob implicitly trust each other with the message contents.)
Both of these two scenarios can be achieved using authenticated encryption, the tool we designed in Unit 3.
Protecting data in use, so that Alice can outsource her computation to Bob while still not allowing Bob himself to read the data!
This may sound counterintuitive to you, or perhaps even impossible. After all, if you want someone to conduct data analysis for you, don’t you have to give them your data? That’s how modern cloud computing works.
Our goal is to spend the next two weeks exploring this problem. My goal is to convince you that it is possible to conduct data science without data sharing.
To solve this problem, we will need a new data confidentiality tool besides encryption. Encryption has an all-or-nothing property: anyone who knows the symmetric key (e.g., Alice or Bob) can read everything, whereas anyone without the key (e.g., Mallory) learns nothing.
Instead, we want a new system where the people you’re working with and the people you don’t fully trust are the same people. In other words, what happens when Bob is also Mallory?
“Cryptography is how people get things done when they need one another, don’t fully trust one another, and have adversaries actively trying to screw things up.”
– Ben Adida
As our starting point into this journey, we will look at the one-time pad in a new way.
Recall that a one-time pad uses the xor
function \(\oplus\) to protect a secret message \(P\) by performing:
\[ P = K \oplus C.\]
If the attacker sees the ciphertext \(C\) but not the key \(K\), then the cipher is perfectly secret.
The adversary has learned nothing about the secret message, in a probabilistic sense.
Concretely, an adversary who only sees \(C\) but not \(K\) cannot distinguish between any two possible values \(P_0\) and \(P_1\) of the secret message.
\[ \Pr[\text{secret is } P_0\text{, given ciphertext } C] = \Pr[\text{secret is } P_1\text{, given ciphertext } C].\]
Here’s an interesting thought experiment: what if we switch the roles of \(K\) and \(C\)? That is, what if:
With a one-time pad, this scenario should never happen. But it’s illustrative to think about it anyway.
Amazingly, this is also perfectly secret!
This amazing insight leads to the idea of cryptographic secret sharing!
Secret sharing enables a group of parties to jointly hold a secret, but in such a way that no individual person or small coalition in the group knows the secret.
Each person holds only a single share of the secret – where the word “share” here is used in the sense of a stock share, not in the sense of sharing data.
Some people prefer the term secret splitting instead, as it is a more accurate description of what’s happening.
Example. Think of a lockbox with multiple keys. Inside the box is a secret message that nobody knows.
Informally, a \((t, n)\)-secret sharing scheme splits the secret \(s\) into \(n\) shares, such that any \(t − 1\) of the shares reveal no information about \(s\), while any \(t\) shares allow complete reconstruction of the secret \(s\).
A formal definition follows.
Definition. Starting from a secret \(s\), a cryptographic secret sharing scheme is defined by two algorithms:
split
\((s,t,n)\) is a randomized algorithm that splits a secret into a vector of \(n\) shares \([s_1, s_2, \ldots, s_n]\), such that any subset of shares of size \(t\) can reconstruct the original secret.
reconstruct
\((s_{i_1}, \ldots, s_{i_t})\) is a deterministic algorithm that takes \(t\) distinct secret shares (i.e., a sub-vector of \(\vec{s}\)) and calculates the original secret \(s\).
The scheme must satisfy two security properties.
Correctness states that if you split
any secret \(s\), then reconstruct
always recovers the original secret \(s\).
Perfect privacy is defined similarly to the perfect secrecy of a one-time pad: for any two possible secrets \(s\) and \(s'\), and for any vector of \(t-1\) or fewer secret shares \(\vec{x} = \{x_{i_1}, x_{i_2}, \ldots, x_{i_t}\}\), it is the case that:
\[ \Pr[\text{all }x_i = s_i \mid \vec{s} = \text{split}(s, t, n)] = \Pr[\text{all }x_i = s'_i \mid \vec{s} = \text{split}(s', t, n)].\]
In words: perfect privacy says that the shares held by a coalition of \(t-1\) parties have an equal probability of being shares of \(s\) or shares of \(s'\).
Today, we will contruct two (related) secret sharing algorithms.
Scenario. Suppose that Susie has a secret \(s\) that she wants to split between two of her friends Alice and Bob. However, Susie does not trust either Alice or Bob enough to hand over her secret.
Instead, she wants each of her friends to hold onto one share of the secret on their local hard drives. This way:
Here is a protocol that allows Susie to accomplish this goal. Suppose that her secret \(s\) is \(\ell\) bits in length. Susie can split it into two shares as follows.
xor
operation).To reconstruct the secret later, calculate \(s \gets s_1 \oplus s_2\).
I claim that this is a secret sharing scheme among \(n = 2\) parties that requires a threshold of \(t = 2\) of them to reconstruct.
Correctness should be easy to verify. Both the split and reconstruct algorithms rely on the fact that \(s = s_1 \oplus s_2\).
Perfect privacy follows a similar argument as a one-time pad.
There is a subtle but essential probability argument that is necessary to make this argument rigorous.
\(s_1\) has the uniform distribution over all length-\(\ell\) bitstrings, by construction. So it looks like a OTP key.
\(s_2\) on its own also has the uniform distribution over all length-\(\ell\) bitstrings! This is because a uniformly-random bitstring xor
ed with a constant bitstring has a uniform distribution.
For example, suppose that the secret \(s = \texttt{111..1}\). Then \(s_2 = \texttt{111..1} \oplus s_1\) is the result of taking a uniformly-random bitstring and flipping all of its bits. The result also has the uniform distribution.
We can also construct a secret sharing scheme using addition, rather than the xor
operation. Here’s an alternative approach for Susie to split her secret:
To reconstruct the secret later, calculate \(s \gets s_1 + s_2 \bmod{2^\ell}\).
Correctness and privacy hold for essentially the same reasons as before. (Do you see why?)
Also, both the boolean and arithmetic secret sharing approaches immediately generalize to the \(n\) party setting. Susie can split her secret \(s\) among \(n\) parties by sampling shares such that:
\[ s = s_1 + s_2 + \cdots + s_n \bmod{2^\ell}.\]
Subsequently, all \(n\) parties must honestly participate in order to reconstruct the original secret \(s\).
But hang on a second. I started today by discussing three types of data protection: data at rest, data in transit, and data in use.
So far, we have just discussed secret sharing as a new way for Susie to protect data at rest. Granted, the data is stored at rest with Alice and Bob rather than on Susie’s own laptop… but still the data is being protected at rest.
What I would ideally like to do is to perform data analysis on the data while it is inside the box.
Let me introduce this new way to conduct data science by way of a story – a true story.
In 2013, then-Boston Mayor Tom Menino commissioned a working group to write a report about how to “make Boston the premier city for working women.”
The Mayor’s office wrote a report, and they established a new group called the Boston Women’s Workforce Council that brought together leading women in the state from government, industry, and academia to find and implement ways to close the gender and racial wage gap.
The Council established a Compact – or pledge – that has been signed by more than 250 companies throughout the greater Boston area. The companies promise to do three things; I want to draw your attention to the part about contributing data toward measuring the city-wide gender and racial wage gap.
|
|
---|
[Image source: Boston Women’s Workforce Council]
The text of the pledge reveals the tension between data confidentiality and data analysis.
Contributing their employee data anonymously. Individual companies don’t want to publish their salary data.
Measurement to create a snapshot of the gender/racial wage gap in Greater Boston. Everyone is interested in learning about the overall state of the gender and racial wage gap, as part of their efforts to close the gap and move toward a world of 100% equal pay for equal work.
Question. How can the companies resolve this tension? How can they perform data analysis while preserving data confidentiality?
The BWWC initially tried to find a trusted external party to hold onto the data and compute the city-wide wage gap. However, by 2014 it was clear that this wasn’t going to work, for two reasons.
But this story isn’t over. The Boston Women’s Workforce Council has run the data analysis to calculate the city-wide gender and racial wage gap. In fact they’ve done it 5 times now: in 2016, 2017, 2019, 2021, and 2023.
|
|
|
|
---|
So how did they do it? The answer lies in a new cryptographic primitive that we will study for the next week: secure multi-party computation.
Several of us at Boston University have worked with the Boston Women’s Workforce Council and the city to design, build, and deploy a web-based survey form. This webpage allows companies to contribute their data toward the wage gap calculation without anyone learning their salary data – not even us!
Here’s what our website looks like. The intention is for someone in (say) the HR department of each participating company to input the number and total annual compensation of all employees in each group of gender, race/ethnicity, and job category.
The form is designed to follow the format of a U.S. Equal Employment Opportunity Commission form that all large employers must fill out every year. So it is a format that our target audience is already familiar with.
So far, this looks like an ordinary website. The only special part about it is what happens after you click on the “submit” button.
With a normal website, clicking “submit” would cause the data to be sent over to the server. It would likely be encrypted in transit, but it would be viewable by the people running the server (in this case, us at BU).
But this website is different. When you click on the “submit” button, the data itself doesn’t go anywhere!
Instead, we use cryptographic secret sharing to split the data between BU and the Boston Women’s Workforce Council.
The objective is to learn a single spreadsheet containing the sum across all spreadsheets provided by the hundreds of companies. That is: we want to know the city-wide total for each cell in the spreadsheet, without learning any information about individual employees or employers.
Let’s see how this works using a smaller-scale version of the wage gap project. For now, let’s suppose that:
Let’s call the employers’ individual answers \(a\), \(b\), and \(c\), respectively. Our objective is for:
\(a\) |
\(b\) |
\(c\) |
|
||
---|---|---|---|---|---|
|
|||||
|
Technically, everyone is interested in the average, not the sum. But since the number of companies is known, it’s straightforward to convert from one to the other: just divide by 3.
First, each company visits the website and splits their secret data between BU and the Council. Remember that these shares (e.g., \(a_1\) and \(a_2\)) have a uniform distribution subject to the fact that they sum to the real secret (e.g., \(a = a_1 + a_2 \bmod{2^\ell}\)). So, neither BU or the Council on their own have learned anything so far because they each hold a random number.
\(a\) |
\(b\) |
\(c\) |
|
||
---|---|---|---|---|---|
|
\(a_1\) | \(b_1\) | \(c_1\) | ||
|
\(a_2\) | \(b_2\) | \(c_2\) |
Our overall goal is for the City of Boston to be able to compute \(a + b + c\), which is the sum of all six secret shares.
Here comes the key insight: because addition is commutative, it doesn’t matter whether we
And the latter option preserves privacy!
Second, BU and the Council add up the three secret shares that they possess, and send these results over to the City.
Each of these numbers individually is uniformly random, and it doesn’t reveal anything about \(a\), \(b\), or \(c\) individually, so it preserves the companies’ privacy.
\(a\) |
\(b\) |
\(c\) |
|
||
---|---|---|---|---|---|
|
\(a_1\) | \(b_1\) | \(c_1\) | \(a_1 + b_1 + c_1\) | |
|
\(a_2\) | \(b_2\) | \(c_2\) | \(a_2 + b_2 + c_2\) |
Finally, the City of Boston can add the numbers on the right to reconstruct the sum \(a + b + c\), whcih is the desired city-wide aggregate statistic.
This is what our web-based software does behind the scenes. It just does it for each of the 100+ participating companies, and for each of the 600+ cells in the spreadsheet.
\(a\) |
\(b\) |
\(c\) |
\(a + b + c\) |
||
---|---|---|---|---|---|
|
\(a_1\) | \(b_1\) | \(c_1\) | \(a_1 + b_1 + c_1\) | |
|
\(a_2\) | \(b_2\) | \(c_2\) | \(a_2 + b_2 + c_2\) |
Great! We now know how to perform data analysis over data that we cannot see … well, as long as the only data analysis we want to perform is addition.
The good news is that you can perform cryptographically secure multi-party computation of any function!
Concretely, consider the following setting:
There are \(m\) people who possess sensitive data \(x_1, x_2, \ldots, x_m\).
They want to outsource this data to a collection of computing parties \(P_1, P_2, \ldots, P_n\). (The computing parties could be the same people as the data holders, or they could be different companies like cloud providers.)
Everyone agrees upon a data analysis function \(f\) to perform, and everyone assumes that fewer than \(t\) of the \(n\) parties are adversarial.
Secure multi-party computation (often abbreviated as MPC) allows the parties to:
Now that we understand how MPC works informally, this is usually where I would provide a scientific definition. I’m not going to do that this time though, because the formal definition of MPC is somewhat complicated. You’ll read about it in this week’s required reading.
Instead, I reproduce below some text from a recent bill in the U.S. Congress that provides a legal definition of MPC. (Emphasis added by me.)
The term “secure multi-party computation” means a computerized system that enables different participating entities in possession of private sets of data to link and aggregate their data sets for the exclusive purpose of performing a finite number of pre-approved computations without transferring or otherwise revealing any private data to each other or anyone else. – 2022 U.S. Senate bill S.3952
We can consider MPC against two kinds of adversaries (the same ones we considered for encryption):
The protocol I showed you today only protects against a passive eavesdropper Eve.
If, for example, we at BU tampered with the data, then we could cause the output of the calculation to be incorrect. That is: we can break data integrity. Even so, there’s no way for us to break data confidentiality.
\(a\) |
\(b\) |
\(c\) |
\(a + b + c \color{red}{+ 10}\) |
||
---|---|---|---|---|---|
|
\(a_1\) | \(b_1\) | \(c_1\) | \(a_1 + b_1 + c_1 \color{red}{+ 10}\) | |
|
\(a_2\) | \(b_2\) | \(c_2\) | \(a_2 + b_2 + c_2\) |
We will look next time at how to protect integrity against an active adversary Mallory.
DS 653 — Lecture 19