By the end of today’s lecture, you will learn:
In the first half of this course, we designed blockchains and broadcast protocols that provide strong integrity and availability. They are decentralized protocols that allow for outsourcing or federating data among a large number \(n\) of parties (e.g., servers on the Internet).
Fundamentally, blockchains and broadcasts achieve decentralization based on the principle of replication. The idea is that if you want information to live for a long time (either on the Internet or between a set of Byzantine generals), then you should store it in many computers or many people’s minds.
Replication is a powerful technique. But on its own, it seems incompatible with confidentiality. After all, if you give your data to \(n\) servers, how can you protect your data aginst them?
In this unit of the course, we began to study the concept of secure multi-party computation, or MPC.
MPC is a powerful tool that allows you to distribute your data among \(n\) parties, without giving your data to any single one of them! If a threshold number \(t\) parties meet then they can jointly reconstruct the data, but any coalition of smaller size cannot read your data.
MPC is based on the idea of randomness, rather than replication. The idea is to give each party a share of your secret: a random number, but chosen so that all of the shares jointly can be used to reconstruct your data.
We have seen two examples of secure multi-party computation so far.
In Lecture 4, we saw how a group of \(n\) people can produce a threshold Schnorr signature indicating that \(t\) of the \(n\) people approve the message.
In the last lecture, we showed that it is possible for \(n\) parties to collectively perform secure addition.
\(a\) |
\(b\) |
\(c\) |
\(a + b + c\) |
||
---|---|---|---|---|---|
|
\(a_1\) | \(b_1\) | \(c_1\) | \(a_1 + b_1 + c_1\) | |
|
\(a_2\) | \(b_2\) | \(c_2\) | \(a_2 + b_2 + c_2\) |
Today, we will extend MPC in three ways.
We will securely compute any mathematical function \(f\) over secret shares of the data, and then only reconstruct the final result while keeping confidential all inputs and intermediate variables.
We will add integrity: that is, we will provide security against an actively malicious adversary Mallory, rather than a passive Eve.
We will add availability: that is, we will make sure that all computing parties have shares of the data, and they know that everyone else has shares too.
Secure multi-party computation protocols have three steps:
Steps 1 and 3 come directly from the split
and reconstruct
operations of the secret sharing construction.
The main challenge is in step 2. So far, we only know how to add secrets by adding their corresponding secret shares.
The ideas that we have already shown for secure addition easily extend to cover all linear functions!
For instance, suppose that rather than calculating \(a + b + c\), we actually wanted to compute \(2a + 3b - c + 10\).
Since we are using additive secret sharing, and scalar multiplication also commutes over addition, each party can just compute its share of the result.
\(a\) |
\(b\) |
\(c\) |
\(2a + 3b - c + 10\) |
||
---|---|---|---|---|---|
|
\(a_1\) | \(b_1\) | \(c_1\) | \(2a_1 + 3b_1 - c_1 + 10\) | |
|
\(a_2\) | \(b_2\) | \(c_2\) | \(2a_2 + 3b_2 - c_2 + 0\) |
Next, let’s try to multiply two secrets. Suppose that two data holders input \(a\) and \(b\), respectively. Can we somehow calculate secret shares of the product \(c = a \cdot b\)?
\(a\) |
\(b\) |
\(c = a \cdot b\) |
||
---|---|---|---|---|
|
\(a_1\) | \(b_1\) | \(\color{red}{??}\) | |
|
\(a_2\) | \(b_2\) | \(\color{red}{??}\) |
This is not as easy to compute. This time we cannot rely solely on commutativity of addition; we need to consider the distributive law as well.
Recall the relationships between the secret variables and their additive shares (all equations will be mod \(2^\ell\), but from now on I’m going to omit writing the modulus for brevity): \[\begin{align*} a &= a_1 + a_2 \\ b &= b_1 + b_2. \end{align*}\]
Hence, we want to compute: \[\begin{align*} c &= a \cdot b \\ &= (a_1 + a_2) (b_1 + b_2) \\ &= a_1 b_1 + a_1 b_2 + a_2 b_1 + a_2 b_2. \end{align*}\]
BU knows how to compute the term \(a_1 b_1\) on its own, and the Council knows how to compute \(a_2 b_2\) on its own. But it’s unclear how to compute the cross-terms \(a_1 b_2\) and \(a_2 b_1\) – that is, it’s unclear how to calculate these terms without the computing parties sending their shares to each other, but that would break data confidentiality.
I’ll show you today one way to resolve this problem. It relies on two useful insights.
Insight #1: let’s add a third computing server.
\(a\) |
\(b\) |
\(c = a \cdot b\) |
||
---|---|---|---|---|
|
\(a_1\) | \(b_1\) | \(\color{red}{??}\) | |
|
\(a_2\) | \(b_2\) | \(\color{red}{??}\) | |
|
\(a_3\) | \(b_3\) | \(\color{red}{??}\) |
At first glance, this might look to make the calculation even harder to perform. Now the distributive law says that we must calculate:
\[\begin{align*} c &= (a_1 + a_2 + a_3) (b_1 + b_2 + b_3) \\ &= a_1 b_1 + a_1 b_2 + a_1 b_3 \\ &\quad + a_2 b_1 + a_2 b_2 + a_2 b_3 \\ &\quad + a_3 b_1 + a_3 b_2 + a_3 b_3. \\ \end{align*}\]
Insight #2: Let’s give each computing party two of the three shares. This is called replicated secret sharing.
\(a\) |
\(b\) |
\(c = a \cdot b\) |
||
---|---|---|---|---|
|
\(a_1, a_2\) | \(b_1, b_2\) | \(c_1 = a_1b_1 + a_1b_2 + a_2b_1\) | |
|
\(a_2, a_3\) | \(b_2, b_3\) | \(c_2 = a_2b_2 + a_2b_3 + a_3b_2\) | |
|
\(a_3, a_1\) | \(b_3, b_1\) | \(c_3 = a_3b_3 + a_3b_1 + a_1b_3\) |
The punchline here is that: now there exists a computing server that can calculate each of the 9 terms in this product! So the three computing servers can calculate \(a \cdot b\) by working together in such a way that nobody has learned either \(a\) or \(b\).
\[\begin{align*} c &= (a_1 + a_2 + a_3) (b_1 + b_2 + b_3) \\ &= \color{red}{a_1 b_1} + \color{red}{a_1 b_2} + \color{gray}{a_1 b_3} \\ &\quad + \; \color{red}{a_2 b_1} + \color{blue}{a_2 b_2} + \color{blue}{a_2 b_3} \\ &\quad + \; \color{gray}{a_3 b_1} + \color{blue}{a_3 b_2} + \color{gray}{a_3 b_3}. \\ \end{align*}\]
There’s just one downside here: the computing parties have two shares of \(a\) and \(b\), but only one share of \(c\). So we cannot continue and multiply \(c\) times some other secret.
This is easy to fix with some communication: each computing party gives its share to one other party.
Now our invariant is maintained: each computing party starts with 2 out of 3 shares of each input, and completes multiplication with 2 out of 3 shares of each output.
To summarize, we have constructed a secure multiplication scheme with:
So far, we have seen one way to perform secure multi-party computation of \(+\) and \(\times\).
Importantly: \(+\) and \(\times\) are a Turing-complete set of gates! That is, we can write any data analytic as a circuit with \(+\) and \(\times\) gates. (This may not be the fastest way to compute \(f\), but it will always work in principle.)
For instance, given the circuit below along with secret shares of \(s\), \(t\), and \(x\) it is possible for the 3 computing parties to work together to calculate:
and then reconstruct
only the output \(y\).
Hence, the protocol we have constructed today can do a secure computation of any function, as long as:
It turns out that you can also perform MPC in the dishonest majority setting, and against an actively-malicious adversary Mallory.
Recall from last time that our current protocol is only secure against a passive eavesdropper Eve. That’s because a malicious computing party Mallory can tamper with the result by adding some number to her own share.
\(a\) |
\(b\) |
\(c = a \cdot b \color{red}{+ 10}\) |
||
---|---|---|---|---|
|
\(a_1, a_2\) | \(b_1, b_2\) | \(c_1 = a_1b_1 + a_1b_2 + a_2b_1 \color{red}{+ 10}\) | |
|
\(a_2, a_3\) | \(b_2, b_3\) | \(c_2 = a_2b_2 + a_2b_3 + a_3b_2\) | |
|
\(a_3, a_1\) | \(b_3, b_1\) | \(c_3 = a_3b_3 + a_3b_1 + a_1b_3\) |
Here is one way to stop Mallory: we can add a fourth computing party for redundancy.
\(a\) |
\(b\) |
\(c = a \cdot b\) |
||
---|---|---|---|---|
|
\(a_1, a_2, a_3\) | \(b_1, b_2, b_3\) | \(c_1\) | |
|
\(a_2, a_3, a_4\) | \(b_2, b_3, b_4\) | \(c_2\) | |
|
\(a_3, a_4, a_1\) | \(b_3, b_4, b_1\) | \(c_3\) | |
|
\(a_4, a_1, a_2\) | \(b_4, b_1, b_2\) | \(c_4\) |
The math for multiplication has gotten a bit more complicated, so I won’t write out explicitly how the shares of \(c\) are calculated.
\[\begin{align*} c &= (a_1 + a_2 + a_3 + a_4) (b_1 + b_2 + b_3 + b_4) \\ &= a_1 b_1 + a_1 b_2 + a_1 b_3 + a_1 b_4 \\ &\quad + \; a_2 b_1 + a_2 b_2 + a_2 b_3 + a_2 b_4 \\ &\quad + \; a_3 b_1 + a_3 b_2 + a_3 b_3 + a_3 b_4 \\ &\quad + \; a_4 b_1 + a_4 b_2 + a_4 b_3 + a_4 b_4 \end{align*}\]
Fortunately, the punchline is simple:
Let’s take a step back and admire what we have done so far.
For the rest of today’s lecture, I want to discuss a different – and more foundational – way to build MPC that is secure against Mallory.
Recall how additive secret sharing works. Say you have secret data \(x\) and convert it into a collection of \(n\) secret shares:
\[ x = x_1 + x_2 + \cdots + x_n \bmod{p}.\]
For now, suppose you don’t even want to compute over this data; you just want to outsource the data to several cloud providers and reconstruct it at some later date.
So, you can disperse the message among the cloud providers by giving share \(x_i\) to provider \(i\).
Then, many months or years later, you can retrieve the secret data by fetching all shares and adding them together.
This provides excellent confidentiality: the cloud providers can perform their job of holding onto your data for a long time, but no coalition of \(n-1\) or fewer of them can ever learn the data.
However, if one party provides an incorrect share \(x_i^* = x_i + \Delta\), then your entire secret is going to be incorrect by \(\Delta\):
\[x_1 + x_2 + \cdots + (x_i + \Delta) + \cdots + x_n = x + \Delta.\]
This is bad news for you: it means your data is corrupted, and even worse there’s no way for you to pinpoint which cloud provider is to blame.
Depending on how the economics of this hypothetical “private multi-cloud outsourcing” service works, it could also be bad news for the honest cloud providers: maybe you refuse to pay them as a result of the data corruption, even though they properly did their job and held onto your data for a long time.
Basically what we want is something like filecoin.io, which is a “decentralized storage network” based on “an open-source cloud storage marketplace, protocol, and incentive layer.” (We’re not going to explain the entirety of how filecoin works in today’s lecture, but we will describe some of the main ideas.)
In other words, this problem is very similar to the Byzantine generals question. You are a leader and you’re providing information to many other parties.
The main differences here are that:
But the other trust questions from the Byzantine generals problem remain. You don’t necessarily trust the other parties entirely, and they don’t entirely trust you either.
Question. How can we provide each party with a verifiability guarantee? That is:
One way to solve this problem is to use Merkle trees. Recall that Merkle trees allow for uploading many files to the cloud, so that the cloud can prove that each file is part of the overall collection.
Concretely, here is a protocol that you can follow to disperse the shares and later retrieve the message.
In dispersal:
In retrieval:
Question. There is actually a bug in this protocol. Can you find it?
One problem is that the leader might use different Merkle tree roots \(h\) with each cloud provider. So even if the provider is honest, if the leader is malicious then they could send one cloud provider the wrong root \(h^*\) and later accuse the provider of cheating during the retrieval phase.
We can fix this problem using techniques from earlier parts of this course! Here are two recourses:
So here is one way to fix the protocol.
In dispersal:
Retrieval can work the same as before. And in fact, the leader doesn’t even need to store the Merkle root \(h\) because it can just request it from all providers later: if there is an honest majority or a PKI, then the leader can distinguish the correct \(h\) from an incorrect one sent by a faulty cloud provider.
So far, this technique works to disperse secret data and then retrieve it later.
Let’s bring back the idea of computing over secrets.
Suppose the leader dispersed two secrets \(x\) and \(y\) using this protocol… or maybe two different leaders dispersed two different secrets, such as their company’s payroll data.
Now, we only want to retrieve the overall sum \(x + y\) of the two secrets, rather than retrieving \(x\) and \(y\) individually.
And as we saw earlier today, if we can take the sum of two secrets, then it’s pretty easy to extend the idea in order to perform any linear algebra operation over several secrets.
Can we do this? Well… not quite.
So let’s replace the Merkle tree with something else.
Using the discrete log problem, we can create a new way to commit and prove properties of the secrets that is amenable to computation.
A Merkle tree is a special case of a kind of cryptographic primitive that we discussed at the beginning of the semester: a commitment scheme. Recall that commitments are a digital analog to putting a written message inside of an envelope and placing it on the table.
commit
s to a message, then everyone knows that the message is inside of the envelope, butopen
ed. (This is related to, but not the same thing as, encryption of a message.)Commitments have two security guarantees.
open
step than the message \(m\) that it originally used during the commit
routine.Mallory | Bob | |
---|---|---|
|
\(\quad \overset{c}{\longrightarrow} \quad\) |
|
Alice | Mallory | |
---|---|---|
|
\(\overset{c}{\longrightarrow}\) |
|
Let’s construct a commitment scheme. As is the norm in this course by now, we will construct commitments based on the hardness of some math problem, so that they are easy to compute but hard to invert.
Specifically, we will use the hardness of the discrete log problem. Remember that this problem states that for a uniformly random choice of \(x\):
\[ (g, x) \mapsto h = g^x \text{ is easy to compute, but } (g, h) \mapsto x = \log_g(h) \text{ is hard to compute}\]
Question. How can we construct a commitment scheme from discrete logs?
For today, we will only consider a simplified version of the problem where message \(m\) to be committed is itself chosen uniformly at random.
Here is a commitment scheme designed by Paul Feldman in 1987. As a starting point, assume that all parties agree upon some base element \(g\) (e.g., it is specified in some open standard).
commit
(m): create the commitment \(c = g^m\).open
(c, m): reveal \(m\), so that the receiver can calculate \(g^m\) itself and check that it is equal to \(c\).This commitment scheme achieves:
We can use commitments to resolve the conundrum from before! As a reminder: we wanted to find a way to disperse several secrets, say \(x\) and \(y\), and then provide the option for the cloud providers to retrieve only the sum \(x + y\).
Specifically, we will use Feldman commitments to construct a new verifiable secret sharing scheme. The key insights here are that:
Technically, this isn’t linearity in the sense of a linear algebra course like DS 121, because we need to multiply the commitments in order to add the underlying messages.
So let’s give this property a fancy new name: additive homomorphism. We say that Feldman and Pedersen commitments are additively homomorphic, because given several commitments it is possible (somehow) to compute the commitment to the sum of the underlying messages.
Here is a new verifiable secret sharing scheme that we can use to disperse a secret over \(n\) cloud providers. As a starting point, we assume that there are \(n\) public constants \(g_1, g_2, \ldots, g_n\) that everybody knows before the protocol starts; importantly, they have nothing to do with our secrets.
In dispersal:
Retrieving \(x\) works similarly as before:
Now let’s do secure addition of secrets. Suppose that two secrets \(x\) and \(y\) have been dispersed to the same set of cloud providers (even by different leaders!), with corresponding commitment vectors \(\vec{c}_x = [g_i^{x_i} : i \in \{1, \ldots, n\}]\) and \(\vec{c}_y = [g_i^{y_i}: i \in \{1, \ldots, n\}]\).
The cloud providers can form a verifiable secret sharing of their sum \(z = x + y\) by:
This works because the new commitment corresponding to \(z_i\) equals \(g_i^{x_i} \times g_i^{y_i} = g_i^{x_i + y_i}\), which is a valid Feldman commitment that will properly open and verify during the retrieval stage.
Today, we have built protocols based on verifiable secret sharing that allow a collection of parties to perform data science with confidentiality, integrity, and availability.
In the remainder of today’s lecture, let me share a few assorted thoughts that (a) convey some extensions of this basic construction and (b) show how powerful the idea of verifiable secret sharing can be.
One small downside of our construction is that we have to disperse a vector of commitments \(\vec{c}\).
By comparison, the Merkle tree looks nicer: the root \(h\) is independent of \(n\), and each individual proof \(\pi_i\) has size \(\log(n)\).
The good news is that it is possible to combine all of the commitments into a single element that has constant size, independent of both the message length and the number of providers!
The idea is simple and again is based on the hardness of discrete logarithms.
Rather than computing \(\vec{c} = [g_1^{x_1}, g_2^{x_2}, \ldots, g_n^{x_n}]\), we will instead compute a single commitment:
\[\vec{c} = g_1^{x_1} \times g_2^{x_2} \times \cdots \times g_n^{x_n}.\]
This is called a vector commitment. The math of how to open such a commitment is not too difficult but beyond the scope of this course; it uses something called “bilinear maps.”
Relatedly: it is also possible to build a polynomial commitment, which means that you commit to a polynomial with something like \(c = g^{p(x)}\), and then can open the polynomial at a single point of your choice. This is useful when combined with the next comment…
We have shown today how to compose:
Notice that these constructions have two different thresholds at play here, of how many cloud providers need to collude in order to break different guarantees:
Without a PKI, we know that \(f < n/3\). But the privacy threshold \(t\) can actually be larger than this.
That is, we can protect the confidentiality of the data even in circumstances where the attacker controls enough parties that agreement breaks and we can no longer guarantee correct retrieval of the data.
This is a subtle point, and the papers on verifiable secret sharing missed it for decades!
A dual-threshold verifiable secret sharing scheme supports \(t > f + 1\). The highest privacy threshold possible is \(t = n\). We constructed such a scheme today: the privacy threshold was \(t = n\) but the agreement threshold might be lower if we don’t have a PKI.
In today’s lecture, I’ve discussed protocols that have clearly-defined rounds – that is, I’ve assumed that the network is synchronous. It is important that the cloud providers receive their shares after a short delay, so that they can complain instantly and publicly if they haven’t received shares that are consistent with the commitment.
One could ask the same question in the asynchronous setting. That is, we could ask for asynchronous verifiable secret sharing. Constructing this primitive is more subtle than what we did today, for several reasons:
To my knowledge, the most efficient asynchronous verifiable secret sharing scheme is one that former BU PhD student Nicolas Alhaddad, former BU undergrad Ziling Yang, and I built last year. It’s also dual-threshold. If you’re interested, here is the paper containing all of the details: https://eprint.iacr.org/2024/326.pdf.
DS 653 — Lecture 20