Lecture 19: Secure Multi-Party Computation

Announcements

Learning outcomes

By the end of today’s lecture, you will learn:

  • The distinction between protecting data at rest, in transit, and in use.
  • The concept of secure multi-party computation (or MPC for short).
  • How to construct MPC from cryptographic secret sharing.

19.1 Protecting data in use

Today, we will begin Unit 4 of the course on secure data analysis. This unit continues our investigation into data confidentiality (and we will sometimes also discuss data integrity).

We can sub-divide data confidentiality into three different scenarios.

  • Protecting data at rest, such as encrypting data on your local own hard drive. This involves encryption from Alice to future-Alice, in order to stop someone who steals her laptop or phone from reading her data.

  • Protecting data in transit, for instance by using end-to-end encrypted messengers like Signal, WhatsApp, or iMessage. This involves encryption from Alice to Bob, in order to stop anyone else on the Internet from reading their messages. (But note that Alice and Bob implicitly trust each other with the message contents.)

    Both of these two scenarios can be achieved using authenticated encryption, the tool we designed in Unit 3.

  • Protecting data in use, so that Alice can outsource her computation to Bob while still not allowing Bob himself to read the data!

This may sound counterintuitive to you, or perhaps even impossible. After all, if you want someone to conduct data analysis for you, don’t you have to give them your data? That’s how modern cloud computing works.

Our goal is to spend the next two weeks exploring this problem. My goal is to convince you that it is possible to conduct data science without data sharing.

Looking ahead

To solve this problem, we will need a new data confidentiality tool besides encryption. Encryption has an all-or-nothing property: anyone who knows the symmetric key (e.g., Alice or Bob) can read everything, whereas anyone without the key (e.g., Mallory) learns nothing.

Instead, we want a new system where the people you’re working with and the people you don’t fully trust are the same people. In other words, what happens when Bob is also Mallory?

“Cryptography is how people get things done when they need one another, don’t fully trust one another, and have adversaries actively trying to screw things up.”

– Ben Adida

19.2 Cryptographic secret sharing

As our starting point into this journey, we will look at the one-time pad in a new way.

Recall that a one-time pad uses the xor function \(\oplus\) to protect a secret message \(P\) by performing:

\[ P = K \oplus C.\]

If the attacker sees the ciphertext \(C\) but not the key \(K\), then the cipher is perfectly secret.

  • The adversary has learned nothing about the secret message, in a probabilistic sense.

  • Concretely, an adversary who only sees \(C\) but not \(K\) cannot distinguish between any two possible values \(P_0\) and \(P_1\) of the secret message.

    \[ \Pr[\text{secret is } P_0\text{, given ciphertext } C] = \Pr[\text{secret is } P_1\text{, given ciphertext } C].\]

Here’s an interesting thought experiment: what if we switch the roles of \(K\) and \(C\)? That is, what if:

  • Mallory sees the key \(K\) but
  • Mallory does not see the ciphertext \(C\)?

With a one-time pad, this scenario should never happen. But it’s illustrative to think about it anyway.

Amazingly, this is also perfectly secret!

  • In the equation \(P = K \oplus C\), we typically think of \(K\) as a “mask” that hides \(C\).
  • But it is equally valid to think about \(C\) as a mask that hides \(K\).

Secret sharing, informally

This amazing insight leads to the idea of cryptographic secret sharing!

  • Secret sharing enables a group of parties to jointly hold a secret, but in such a way that no individual person or small coalition in the group knows the secret.

  • Each person holds only a single share of the secret – where the word “share” here is used in the sense of a stock share, not in the sense of sharing data.

  • Some people prefer the term secret splitting instead, as it is a more accurate description of what’s happening.

Example. Think of a lockbox with multiple keys. Inside the box is a secret message that nobody knows.

  • If enough keys were inserted in the right places, the box would open and the secret would be revealed.
  • Otherwise, the safe remains closed and nobody has access to it.

Defining secret sharing

Informally, a \((t, n)\)-secret sharing scheme splits the secret \(s\) into \(n\) shares, such that any \(t − 1\) of the shares reveal no information about \(s\), while any \(t\) shares allow complete reconstruction of the secret \(s\).

A formal definition follows.

Definition. Starting from a secret \(s\), a cryptographic secret sharing scheme is defined by two algorithms:

  1. split\((s,t,n)\) is a randomized algorithm that splits a secret into a vector of \(n\) shares \([s_1, s_2, \ldots, s_n]\), such that any subset of shares of size \(t\) can reconstruct the original secret.

  2. reconstruct\((s_{i_1}, \ldots, s_{i_t})\) is a deterministic algorithm that takes \(t\) distinct secret shares (i.e., a sub-vector of \(\vec{s}\)) and calculates the original secret \(s\).

The scheme must satisfy two security properties.

  • Correctness states that if you split any secret \(s\), then reconstruct always recovers the original secret \(s\).

  • Perfect privacy is defined similarly to the perfect secrecy of a one-time pad: for any two possible secrets \(s\) and \(s'\), and for any vector of \(t-1\) or fewer secret shares \(\vec{x} = \{x_{i_1}, x_{i_2}, \ldots, x_{i_t}\}\), it is the case that:

    \[ \Pr[\text{all }x_i = s_i \mid \vec{s} = \text{split}(s, t, n)] = \Pr[\text{all }x_i = s'_i \mid \vec{s} = \text{split}(s', t, n)].\]

    In words: perfect privacy says that the shares held by a coalition of \(t-1\) parties have an equal probability of being shares of \(s\) or shares of \(s'\).

Boolean secret sharing

Today, we will contruct two (related) secret sharing algorithms.

Scenario. Suppose that Susie has a secret \(s\) that she wants to split between two of her friends Alice and Bob. However, Susie does not trust either Alice or Bob enough to hand over her secret.

Instead, she wants each of her friends to hold onto one share of the secret on their local hard drives. This way:

  • If Susie loses her secret in the future, she can recover it by talking to both of them.
  • If either one of Alice or Bob are actually evil (or have their hard drive corrupted), the secret is still protected.

Here is a protocol that allows Susie to accomplish this goal. Suppose that her secret \(s\) is \(\ell\) bits in length. Susie can split it into two shares as follows.

  1. Choose the share \(s_1\) uniformly at random among all bitstrings of length \(\ell\).
  2. Calculate \(s_2 \gets s \oplus s_1\) (where \(\oplus\) denotes the boolean-xor operation).
  3. Send \(s_1\) to Alice, and send \(s_2\) to Bob.

To reconstruct the secret later, calculate \(s \gets s_1 \oplus s_2\).

I claim that this is a secret sharing scheme among \(n = 2\) parties that requires a threshold of \(t = 2\) of them to reconstruct.

  • Correctness should be easy to verify. Both the split and reconstruct algorithms rely on the fact that \(s = s_1 \oplus s_2\).

  • Perfect privacy follows a similar argument as a one-time pad.

    • From Alice’s perspective: she knows \(s_1\) but not \(s_2\) or \(s\). To her, the scheme looks just like a one-time pad where \(s_1\) is akin to the ciphertext, and the unknown \(s_2\) acts as the key that perfectly hides the message \(s\).
    • From Bob’s perspective: the situation is the same! He knows \(s_2\), but to him this looks just like a OTP ciphertext, and the unknown \(s_1\) acts as the key that perfectly hides the message.

There is a subtle but essential probability argument that is necessary to make this argument rigorous.

  • \(s_1\) has the uniform distribution over all length-\(\ell\) bitstrings, by construction. So it looks like a OTP key.

  • \(s_2\) on its own also has the uniform distribution over all length-\(\ell\) bitstrings! This is because a uniformly-random bitstring xored with a constant bitstring has a uniform distribution.

    For example, suppose that the secret \(s = \texttt{111..1}\). Then \(s_2 = \texttt{111..1} \oplus s_1\) is the result of taking a uniformly-random bitstring and flipping all of its bits. The result also has the uniform distribution.

Arithmetic secret sharing

We can also construct a secret sharing scheme using addition, rather than the xor operation. Here’s an alternative approach for Susie to split her secret:

  1. Choose the share \(s_1\) uniformly at random among all numbers that are \(\ell\) bits long (i.e., the numbers 0 to \(2^\ell-1\)).
  2. Calculate \(s_2 \gets s - s_1 \bmod{2^\ell}\).
  3. Send \(s_1\) to Alice, and send \(s_2\) to Bob.

To reconstruct the secret later, calculate \(s \gets s_1 + s_2 \bmod{2^\ell}\).

Correctness and privacy hold for essentially the same reasons as before. (Do you see why?)

Also, both the boolean and arithmetic secret sharing approaches immediately generalize to the \(n\) party setting. Susie can split her secret \(s\) among \(n\) parties by sampling shares such that:

\[ s = s_1 + s_2 + \cdots + s_n \bmod{2^\ell}.\]

Subsequently, all \(n\) parties must honestly participate in order to reconstruct the original secret \(s\).

Secret sharing \(\to\) secure computing

But hang on a second. I started today by discussing three types of data protection: data at rest, data in transit, and data in use.

So far, we have just discussed secret sharing as a new way for Susie to protect data at rest. Granted, the data is stored at rest with Alice and Bob rather than on Susie’s own laptop… but still the data is being protected at rest.

What I would ideally like to do is to perform data analysis on the data while it is inside the box.

Image source: NASA

19.3 Measuring the wage gap in Boston

Let me introduce this new way to conduct data science by way of a story – a true story.

In 2013, then-Boston Mayor Tom Menino commissioned a working group to write a report about how to “make Boston the premier city for working women.”

The Mayor’s office wrote a report, and they established a new group called the Boston Women’s Workforce Council that brought together leading women in the state from government, industry, and academia to find and implement ways to close the gender and racial wage gap.

The Council established a Compact – or pledge – that has been signed by more than 250 companies throughout the greater Boston area. The companies promise to do three things; I want to draw your attention to the part about contributing data toward measuring the city-wide gender and racial wage gap.

[Image source: Boston Women’s Workforce Council]

The text of the pledge reveals the tension between data confidentiality and data analysis.

  • Contributing their employee data anonymously. Individual companies don’t want to publish their salary data.

  • Measurement to create a snapshot of the gender/racial wage gap in Greater Boston. Everyone is interested in learning about the overall state of the gender and racial wage gap, as part of their efforts to close the gap and move toward a world of 100% equal pay for equal work.

Question. How can the companies resolve this tension? How can they perform data analysis while preserving data confidentiality?

The BWWC initially tried to find a trusted external party to hold onto the data and compute the city-wide wage gap. However, by 2014 it was clear that this wasn’t going to work, for two reasons.

  1. None of the companies wanted to hand over their data to another company.
  2. Nobody wanted to be the group that collected salary data from across the city. The liability risk was too great: if the data were ever breached, they might be sued by hundreds of companies and thousands of city employees, all at the same time.

But this story isn’t over. The Boston Women’s Workforce Council has run the data analysis to calculate the city-wide gender and racial wage gap. In fact they’ve done it 5 times now: in 2016, 2017, 2019, 2021, and 2023.

So how did they do it? The answer lies in a new cryptographic primitive that we will study for the next week: secure multi-party computation.

19.4 Secure addition

Several of us at Boston University have worked with the Boston Women’s Workforce Council and the city to design, build, and deploy a web-based survey form. This webpage allows companies to contribute their data toward the wage gap calculation without anyone learning their salary data – not even us!

Here’s what our website looks like. The intention is for someone in (say) the HR department of each participating company to input the number and total annual compensation of all employees in each group of gender, race/ethnicity, and job category.

The form is designed to follow the format of a U.S. Equal Employment Opportunity Commission form that all large employers must fill out every year. So it is a format that our target audience is already familiar with.

100talent.org website in 2017

So far, this looks like an ordinary website. The only special part about it is what happens after you click on the “submit” button.

With a normal website, clicking “submit” would cause the data to be sent over to the server. It would likely be encrypted in transit, but it would be viewable by the people running the server (in this case, us at BU).

But this website is different. When you click on the “submit” button, the data itself doesn’t go anywhere!

Instead, we use cryptographic secret sharing to split the data between BU and the Boston Women’s Workforce Council.

The objective is to learn a single spreadsheet containing the sum across all spreadsheets provided by the hundreds of companies. That is: we want to know the city-wide total for each cell in the spreadsheet, without learning any information about individual employees or employers.

Let’s see how this works using a smaller-scale version of the wage gap project. For now, let’s suppose that:

  • There are only 3 companies that participate in the data analysis.
  • We only want to calculate one cell within the spreadsheet. The first one happens to ask for the number of female Latina executives at the company.

Setup

Let’s call the employers’ individual answers \(a\), \(b\), and \(c\), respectively. Our objective is for:

  • The city of Boston to learn only the sum of these numbers \(a + b + c\).
  • Everyone else (including us at BU) to learn nothing.
\(a\)
\(b\)
\(c\)

Technically, everyone is interested in the average, not the sum. But since the number of companies is known, it’s straightforward to convert from one to the other: just divide by 3.

Contributing data

First, each company visits the website and splits their secret data between BU and the Council. Remember that these shares (e.g., \(a_1\) and \(a_2\)) have a uniform distribution subject to the fact that they sum to the real secret (e.g., \(a = a_1 + a_2 \bmod{2^\ell}\)). So, neither BU or the Council on their own have learned anything so far because they each hold a random number.

\(a\)
\(b\)
\(c\)
\(a_1\) \(b_1\) \(c_1\)
\(a_2\) \(b_2\) \(c_2\)

Our overall goal is for the City of Boston to be able to compute \(a + b + c\), which is the sum of all six secret shares.

Here comes the key insight: because addition is commutative, it doesn’t matter whether we

  • first add the numbers vertically and then horizontally, or
  • first add the numbers horizontally and then vertically.

And the latter option preserves privacy!

Aggregating data

Second, BU and the Council add up the three secret shares that they possess, and send these results over to the City.

Each of these numbers individually is uniformly random, and it doesn’t reveal anything about \(a\), \(b\), or \(c\) individually, so it preserves the companies’ privacy.

\(a\)
\(b\)
\(c\)
\(a_1\) \(b_1\) \(c_1\) \(a_1 + b_1 + c_1\)
\(a_2\) \(b_2\) \(c_2\) \(a_2 + b_2 + c_2\)

Finally, the City of Boston can add the numbers on the right to reconstruct the sum \(a + b + c\), whcih is the desired city-wide aggregate statistic.

This is what our web-based software does behind the scenes. It just does it for each of the 100+ participating companies, and for each of the 600+ cells in the spreadsheet.

\(a\)
\(b\)
\(c\)
\(a + b + c\)
\(a_1\) \(b_1\) \(c_1\) \(a_1 + b_1 + c_1\)
\(a_2\) \(b_2\) \(c_2\) \(a_2 + b_2 + c_2\)

19.5 Defining secure multi-party computation

Great! We now know how to perform data analysis over data that we cannot see … well, as long as the only data analysis we want to perform is addition.

The good news is that you can perform cryptographically secure multi-party computation of any function!

Concretely, consider the following setting:

  • There are \(m\) people who possess sensitive data \(x_1, x_2, \ldots, x_m\).

  • They want to outsource this data to a collection of computing parties \(P_1, P_2, \ldots, P_n\). (The computing parties could be the same people as the data holders, or they could be different companies like cloud providers.)

  • Everyone agrees upon a data analysis function \(f\) to perform, and everyone assumes that fewer than \(t\) of the \(n\) parties are adversarial.

Secure multi-party computation (often abbreviated as MPC) allows the parties to:

  • Work together to calculate \(y = f(x_1, \ldots, x_m)\) in such a way that
  • No coalition of up to \(t-1\) computing parties can learn anything about the sensitive input data – or at least, nothing beyond what can be inferred from the output \(y\), in a Bayesian sense.

Now that we understand how MPC works informally, this is usually where I would provide a scientific definition. I’m not going to do that this time though, because the formal definition of MPC is somewhat complicated. You’ll read about it in this week’s required reading.

Instead, I reproduce below some text from a recent bill in the U.S. Congress that provides a legal definition of MPC. (Emphasis added by me.)

The term “secure multi-party computation” means a computerized system that enables different participating entities in possession of private sets of data to link and aggregate their data sets for the exclusive purpose of performing a finite number of pre-approved computations without transferring or otherwise revealing any private data to each other or anyone else. – 2022 U.S. Senate bill S.3952

We can consider MPC against two kinds of adversaries (the same ones we considered for encryption):

  • The passive eavesdropper Eve. Remember that Eve promises to follow the rules of a cryptographic protocol, but she tries to learn sensitive data along the way.
  • The actively-malicious adversary Mallory. She doesn’t follow any rules, and will do whatever she can to learn or tamper with the data.

The protocol I showed you today only protects against a passive eavesdropper Eve.

If, for example, we at BU tampered with the data, then we could cause the output of the calculation to be incorrect. That is: we can break data integrity. Even so, there’s no way for us to break data confidentiality.

\(a\)
\(b\)
\(c\)
\(a + b + c \color{red}{+ 10}\)
\(a_1\) \(b_1\) \(c_1\) \(a_1 + b_1 + c_1 \color{red}{+ 10}\)
\(a_2\) \(b_2\) \(c_2\) \(a_2 + b_2 + c_2\)

We will look next time at how to protect integrity against an active adversary Mallory.