By the end of today’s lecture, you will learn:
A hash function \(H: \{0, 1\}^* \to \{0, 1\}^n\) is an efficiently-computable function that receives an input bitstring of arbitrary length, and provides an output bitstring that contains a fixed number of bits \(n\).
It satisfies the following security guarantees (and more):
One wayness, aka preimage resistance: if we sample \(y \gets \{0, 1\}^n\) uniformly at random from the set of all bitstrings of length \(n\), then it is practically infeasible for Mallory to find any preimage \(x\) such that \(H(x) = y\).
Collision resistance: it is practically infeasible for Mallory to find two different inputs \(x_1\) and \(x_2\) such that \(H(x) = H(x')\).
Due to the birthday problem, there is an upper bound on how strong a hash function can be. For a hash function \(H: \{0, 1\}^* \to \{0, 1\}^n\) whose output is \(n\) bits long, the best we can hope for is that:
One application of hash functions is to improve integrity of file outsourcing. Suppose Alice wants to store a large, important file \(f\) on a cloud service like Dropbox or Google Drive. Then she can:
Due to collision resistance, even if the malicious adversary Mallory is running the cloud provider, she cannot find a new file \(f'\) that is different from Alice’s original file \(f\) and yet still hashes to the same string \(h\).
What if Alice wants to upload many files \(f_1, f_2, \ldots, f_k\) to the cloud?
Fortunately, there is another approach that still allows Alice to verify one file at a time. Before uploading the files, Alice must construct a Merkle tree as follows.
(This data structure is named after cryptographer Ralph Merkle.)
The nice part about Merkle trees is that:
To construct a proof that (for example) a claimed file \(f_3^*\) is in the set, the cloud provider Mallory sends the following \(\log(k)\) hash digests to Alice.
I put an asterisk on the received file \(f_3^*\) and the received hash digests \(h_4^*, h_{1,2}^*, h_{5,8}^*\) because Alice does not yet know that the cloud provider Mallory is telling the truth. That is, these might potentially be different than the variables that Alice used when she initially constructed the Merkle tree.
To test whether the received file \(f_3^*\) is correct, Alice can “fill in the blanks” of the path of the Merkle tree from the leaf up to the root.
\[\begin{align} h_{3}^* &= H(f_3^*) \\ h_{3,4}^* &= H(h_3^*,h_4^*) \\ h_{1,4}^* &= H(h_{1,2}^*,h_{3, 4}^*) \\ h_{1,8}^* &= H(h_{1,4}^*,h_{5,8}^*) \\ \end{align}\]
Remember that the real file \(f_3\) contributed toward \(\log(n)\) hash digests. For instance, if we “trace” the path of the file \(f_3\), here are all of the variables that it influenced.
\[\begin{align*} h_{3} &= H(f_3) \\ h_{3,4} &= H(h_3,h_4) \\ h_{1,4} &= H(h_{1,2},h_{3, 4}) \\ h_{1,8} &= H(h_{1,4},h_{5,8}) \\ \end{align*}\]
Because Alice has stored \(h_{1,8}\), she can test whether \(h_{1,8}^* = h_{1,8}\). If this equality holds:
From the final equation: it must be true that \(h_{1,4}^* = h_{1,4}\) and \(h_{5,8}^* = h_{5,8}\). Otherwise, \(H(h_{1,4},h_{5,8}) = H(h_{1,4}^*,h_{5,8}^*)\) so Mallory has broken collision resistance.
From the second-to-last equation: it must also be true that \(h_{1,2}^* = h_{1,2}\) and \(h_{3,4}^* = h_{3,4}\). Otherwise, \(H(h_{1,2},h_{3,4}) = H(h_{1,2}^*,h_{3,4}^*)\) so Mallory has broken collision resistance.
…and so on, until eventually the first equation tells us that \(f_3 = f_3^*\), as desired!
In this way: hash functions provide a succinct yet tamper-evident way to represent large volumes of data.
Let’s use hash functions to design another cryptographic primitive: a message commitment scheme.
I’ll motivate this new cryptographic primitive with an example from the paper Predicting the winner of the 2008 US Presidential Elections using a Sony PlayStation 3.
Let’s rewind time and imagine that the date is currently November 30, 2007. At that time, here were some of the leading candidates for the 2008 U.S. presidential election.
John Edwards | Al Gore | John McCain | Barack Obama | Mitt Romney |
---|---|---|---|---|
[Images from Wikipedia]
Here’s the scenario:
There is a clear problem here: if Alice won’t reveal the image \(x\) until after November 2008, then how does Bob know that she knew the answer beforehand?
Question. Can we design a way to make this bet that:
In the physical world, here is one way to solve this question:
In the digital world, here is a potential way to solve this problem using cryptographic hash functions.
Intuitively, this proposed construction looks appealing because:
Unfortunately, the above argument is not correct.
Question. Why not?
Instead, a message commitment scheme is the appropriate cryptographic tool to solve this problem.
Commitments are related to, but not the same thing as, encryption of a message. We will discuss encryption in a future lecture.
Commitments are a digital analog to putting a written message inside of an envelope and placing it on the table. Everyone knows that the message is inside of the envelope, but nobody can read the message until the envelope is opened.
Definition. A message commitment scheme contains two algorithms.
Commit
(m; r): given a message \(m\) and some randomness \(r\), construct a commitment \(c\).Open
(c, m, r): given the commitment \(c\) and the inputs used to create it, this algorithm should convince the receiver that \(m\) is the correct message contained inside of the commitment.These algorithms satisfy two security properties: binding and hiding.
Let’s precisely specify two guarantees provided by message commitment schemes — and also, by envelopes. To do so, we introduce our malicious adversary Mallory in place of either Alice or Bob.
Commit
step, it is infeasible for a malicious committer to be able to Open
the commitment to two different messages \(m_1\) and \(m_2\) and have the receiver be convinced by both of them.Mallory | Bob | |
---|---|---|
|
\(\overset{c}{\longrightarrow}\) |
|
Alice | Mallory | |
---|---|---|
|
\(\overset{c}{\longrightarrow}\) |
|
Hash functions provide the binding property thanks to collision resistance. But on their own, they do not provide hiding.
Fortunately, there is a simple way to use cryptographic hash functions in order to build a commitment scheme that is both hiding and binding.
Construction of message commitment from a cryptographic hash function \(H\).
Commit
(m; r): produce the commitment \(c = H(m, r)\). In words, hash the message along with a random string that has nothing to do with the message.Open
(c, m, r): reveal everything, and let the receiver check for themselves whether \(c \overset{?}{=} H(m, r)\).The hiding property stems from Bob’s unawareness of the randomness \(r\). Since he does not know \(r\), he cannot do a brute-force attack of all images to see which one hashes to the value \(c\).
Later in the semester, we will see other constructions and applications of commitment schemes.
Let’s define one more cryptographic primitive today: digital signatures. They are the digital version of signing a check. Within the CIA triad, digital signatures provide integrity.
A major concern in the subject [of cryptography] is that of authentication. How do you know some data is correct (data authentication), and how do you know an entity is who they claim to be (entity authentication).
– Nigel Smart, Cryptography Made Simple
Recall that a digital signature must satisfy two requirements:
To understand the unforgeability guarantee better, it helps to introduce our adversary Mallory.
Alice | Bob | |
---|---|---|
|
\(\longrightarrow\) |
|
Alice | Mallory | Bob | ||
---|---|---|---|---|
|
\(\longrightarrow\) |
|
\(\longrightarrow\) |
|
We want to ensure that any attempt by Mallory to tamper with the messages from Alice to Bob will be detected:
No matter how many messages Alice sends. Even if she sends 10, 100, or 1000 messages \(M_1, M_2, \ldots, M_{1000}\) (or more!), Mallory cannot forge any other message that Alice hasn’t sent.
No matter which messages Alice sends. That is, for any choice of the 1000 messages \(M_1, M_2, \ldots, M_{1000}\) that Alice sent, Mallory cannot create a signature for message 1001.
Here is our formal definition.
Definition. A digital signature contains three algorithms:
KeyGen
: takes no input, produces a secret key \(sk\) and a public key \(pk\)Sign
\((sk, M)\): produces a signature \(\sigma\)Verify
\((pk, M, \sigma)\): returns true or falseAll three algorithms should be efficiently computable. Moreover, the digital signature must satisfy two requirements: correctness and unforgeability.
Correctness means that for any secret/public keypair produced by KeyGen, it is the case that any signature made via the Sign
algorithm is subsequently accepted by Verify
.
We formalize the unforgeability guarantee by considering a hypothetical game involving
This game proceeds in two phases.
Sign
\((sk, M)\). Mallory can repeat this step as often as she wants.Mallory wins the game if:
Verify
\((pk, M^*) =\) true.If it is practically infeasible for Mallory to win this game, then we say that the digital signature scheme is “existentially unforgeable under a chosen message attack,” often abbreviated EU-CMA.
In words, this means that no matter how many signed messages that Mallory has seen Alice make in the past, and no matter what the message contents were, it is practically impossible for Mallory to forge a new message that Alice never signed herself.
It is helpful to have a definition in hand before we try to construct a digital signature, so that we have a way to test whether a candidate construction is good.
Let’s try to construct a digital signature. It turns out that this is challenging to do. Remember that we need the following properties to hold:
So: we need to find a math problem that is difficult to solve but easy to verify.
Unfortunately, we do not know for certain whether any such math problems exist! If you can find one, the Clay Math Institute will give you a prize of $1 million. (This is called the \(\textit{P}\) vs. \(\textit{NP}\) question.)
So we do the next best thing: build signature schemes from math problems that we believe to have this property of “difficult to solve but easy to check.” One such problem is factoring.
In the late 1970s, Rivest, Shamir, and Adleman discovered how to use factoring as the basis of a digital signature scheme. It is named RSA in their honor. It works using modular (or clock) arithmetic.
Here is a slightly oversimplified (and incorrect) version of their digital signature algorithm.
KeyGen
has no inputs, and does the following:
The remaining two operations use modular arithmetic, or clock arithmetic.
As a reminder: when given two integers \(A\) and \(B\), the notation \(A \bmod{B}\) means to calculate the remainder that occurs when you divide \(A\) by \(B\).
Here’s our first attempt at defining the rest of the RSA algorithm.
Sign
\((sk, M) = M^{d} \bmod{N}\)Verify
\((pk, M, \sigma)\) returns true if and only if \(\sigma^{e} = M \bmod{N}\)Why does this (almost) work? It turns out that:
Math details
There is a theorem from number theory called Euler’s theorem, which states that for any integer \(m\): \[m^{(p-1)(q-1)} \bmod{N} = 1.\] As a result, the exponents \(d\) and \(e\) will “cancel out” as long as the product \(de\) is 1 more than some multiple of \((p-1)(q-1)\), or in other words, if there is some integer \(k\) such that \(de = k(p-1)(q-1) + 1\). This is because \[ \begin{align*} m^{de} \bmod{N} &= m^{k(p-1)(q-1) + 1} \bmod{N} \\ &= \left(m^{(p-1)(q-1)} \right)^k \cdot m^1 \bmod{N} \\ &= 1^k \cdot m \bmod{N} \\ &= m \bmod{N}. \end{align*}\] Hence, the special exponents \(d\) and \(e\) must be connected by the equation \(de \equiv 1 \bmod{(p-1)(q-1)}\). Since Alice knows \(p\) and \(q\), for any public exponent \(e\) it is possible for her to calculate the secret exponent \(d\) using modular arithmetic.
There’s a bug in the current construction though. Suppose Alice gives Mallory the signatures of two messages \(M_1\) and \(M_2\):
\[ \sigma_1 = M_1^{d} \bmod{N} \qquad \sigma_2 = M_2^{d} \bmod{N}\]
From this, Mallory could produce the signature corresponding to the message \(M^* = M_1 \cdot M_2 \bmod{N}\) because:
\[ \begin{align*} \operatorname{Sign}(sk, M^*) &= (M_1 \cdot M_2)^{d} \bmod{N} \\ & = M_1^{d} \cdot M_2^{d} \bmod{N} \\ & = \sigma_1 \cdot \sigma_2 \bmod{N}. \end{align*}\]
Importantly: the last equation is something Mallory can calculate even without knowing \(sk\).
This is our first example of a cryptanalysis attack! Even though a forgery would take a lot of time if you tried to factor \(N\), we found an alternative approach that sidestepped the hard math problem and that succeeds much faster.
To fix this problem, the real RSA signature scheme also uses a cryptographic hash function \(H\).
KeyGen
generates two primes \(p\), \(q\) and calculate \(N\). Then find two exponents \(e\) and \(d\) that “cancel out.” The secret key is \(sk = (p, q, d)\) and the public key is \(pk = (N, e)\).Sign
\((sk, M) = H(M)^{d} \bmod{N}\)Verify
\((pk, M, \sigma)\) returns true
if and only if \(\sigma^{e} = H(M) \bmod{N}\)In this new construction, the Sign
algorithm starts by applying a hash function \(H\) to the message \(M\). The hash function breaks the algebraic structure that caused the problem above.
In fact, the hash function provides two benefits:
The overall new construction does satisfy existential unforgeability under a chosen message attack, as far as we know. (Explaining why this is the case is out of scope for this course; if you’re interested to see the math, it is covered in CS 538.)
The hash function allows us to sign a message \(M\) of arbitrary length, since \(H\) maps any input bitstring onto a fixed-length number that we can use as the base for the subsequent exponentiation.
The RSA digital signature scheme (and a related technique to build encryption) are sometimes used in practice today.
For example, every time you visit the website bu.edu
, your computer uses the SHA-256 and RSA algorithms that we have discussed in this course. (You can see an image like this one by clicking on the lock icon 🔒 in your web browser.)
However, RSA is less commonly used today. The reason for this is performance, not security.
RSA keys and signatures are small, but not that small. As shown in the image above, the RSA keys and signatures used by bu.edu
are 2048 bits (or 256 bytes) long. This is because factoring is a hard problem, but it’s not a really hard problem, so we have to pick very large numbers for it to be infeasible by modern computers.
As a result: RSA is fast, but not that fast. Signing and verification take around 1 millisecond on a modern laptop or phone, give or take.
These costs may not sound big, and indeed a single signature is small and quick to compute. But if you run a web server and calculate RSA for every client that visits your website, then the costs add up quickly.
Instead, the most common standard is called the digital signature algorithm, or DSA.
We will study the digital signature algorithm next time.
DS 653 — Lecture 3