import math
range(344,367)) / (366**23) math.prod(
0.49367698818054007
By the end of today’s lecture, you will learn:
“Cryptography is how people get things done when they need one another, don’t fully trust one another, and have adversaries actively trying to screw things up.”
– Ben Adida
Cryptography uses math, computer science, and data science in order to design systems that help people to get things done safely in the digital world. In every topic we explore in this course, we will provide (some or all of) the following three objectives.
The first part of this course is about secure ledgers, or currencies without centralization. Our starting point into cryptocurrencies is to build a digital version of a banking system, where people can transfer money to each other electronically over the Internet without the need for a single trusted banking authority. We consider the following problem:
In the physical world, Alice could write a check to Bob. Last time, we started to discuss a cryptographic primitive that provides a digital equivalent to the act of “signing a check.” This primitive is (appropriately enough) called a digital signature.
Today, we will study a special function that is needed to build a digital signature and other crypto tools. This special function is called a cryptographic hash function, and we will have many uses for it throughout the semester.
To motivate today’s discussion, let’s consider the following problem:
Question. Can we design the cloud storage service in a way that provides an integrity guarantee to Alice? Specifically, she wants to be convinced that the cloud service provider cannot cheat later when revealing \(x\).
What would be great is if there was a function \(H\) that we could apply to the file \(x\) that:
Producing a small output of size \(n\) is easy to do. We could:
However, all of these options have an obvious drawback: if Alice only saves the function output \(H(x)\), then the cloud provider can cheat later. Specifically, it can reveal a different file with the same \(n\)-bit prefix, or sum, or image after the linear transformation.
Depending on the file type and format, it might be the case that we get lucky. For instance, perhaps \(x\) is the only file in the world that maps to the fingerprint \(H(x)\).
But in cryptography, we are not in the business of relying on luck. Instead, let’s make our own luck.
Recall that our goal is to “compress” a large image \(x\) into a small (say, constant-sized) output \(y = H(x)\) while making sure that Alice doesn’t change the image afterward.
Because \(H\) has infinitely many inputs and only finitely many outputs, it must be the case that there are collisions – that is, two inputs that map to the same output string.
But: we are going to insist that collisions are computationally infeasible to find. This guarantee should hold even if we give the code of \(H\) to our adversary. This property is called collision resistance.
This guarantee should hold even if we give the code of \(H\) to our adversary.
Here is our adversary in this course. Her name is Mallory, because she is malicious and doesn’t follow the rules of any crypto protocol.
Mallory is intended to be a stand-in for any kind of real-world adversary who you might be concerned about: your internet service provider, cloud provider, government, or anyone else who might want to ruin your day.
Mallory has a substantial (though not infinite) amount of computing power at her disposal.
She is also very smart. Specifically, we must assume that she is a better cryptanalyst than we are. This is to account for Schneier’s law.
“Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can’t break.”
– Bruce Schneier
Even though Mallory is mathematically savvy and computationally powerful, we insist that she must still not be able to “break” our function \(H\).
Often called a cryptographer’s Swiss Army knife, a hash function can underlie many different cryptographic schemes: aside from producing a document’s digest to be digitally signed—one of the most common applications.
Let us give example applications: code-signing systems, computer forensics, [and protecting] passwords.
– Aumasson, Meier, Phan, and Henzen, The Hash Function BLAKE.
At a high level, a cryptographic hash function is a mathematical function, which we typically denote by \(H\), that has two seemingly-contradictory requirements.
You can think of a hash function as a gigantic function that translates between two languages:
Moreover, the outputs have a fixed, known length, no matter how short or long the input messages are. As a toy example: the follow truth table was produced by mapping the top 3000 English words to random 4-letter strings using random.org.
Input | Output |
---|---|
abandon | nieh |
ability | btol |
able | ykwf |
about | fkay |
\(\vdots\) | \(\vdots\) |
zipper | jgxm |
zone | pnjo |
zoo | ansm |
While this toy example used English-language words as the possible inputs, in reality we want a hash function to accept any data type. To be as generic as possible, we will consider the inputs and outputs to be bitstrings
. Then, we can apply \(H\) to Alice’s file, in any data format.
Using the language of bitstrings, we can now define a hash \(H\) as we stated previously: it’s a single mathematical function that can accept inputs of any size and that always returns an output of a fixed length.
Definition. A hash function \(H: \{0, 1\}^* \to \{0, 1\}^n\) is an efficiently-computable function that
Note that there is nothing hidden or secret here: the code of \(H\) is public knowledge and anyone can compute it.
The output \(H(x)\) is often called the hash digest corresponding to the input \(x\).
Next, we need to specify precisely the problem that Mallory cannot break. In fact we will define multiple such problems. As we go down the list:
Definition. A cryptographic hash function \(H\) satisfies the following properties.
One-wayness, aka preimage resistance: if we sample \(y \gets \{0, 1\}^n\) uniformly at random from the set of all bitstrings of length \(n\), then it is practically infeasible for Mallory to find any preimage \(x\) such that \(H(x) = y\).
(Note: remember that there are infinitely many such preimages, yet Mallory cannot find even a single one!)
Collision resistance: it is practically infeasible for Mallory to find two different inputs \(x_1\) and \(x_2\) such that \(H(x) = H(x')\).
(Note: if a function is collision resistant then it must be one-way, but the converse is not necessarily true. Do you see why that is?)
If an adversary [Mallory] has not explicitly queried the oracle on some point \(x\), then the value of \(H(x)\) is completely random… at least as far as [Mallory] is concerned.
– Jon Katz and Yehuda Lindell, Introduction to Modern Cryptography
Note that there is a limit to how infeasible it is to find a collision. After all, we already said that collisions must exist. And the output space is finite, so if we have enough time and computing power, it must be possible to find a collision.
Exactly how difficult can we hope that the problem might be? That is, even with the best possible hash function, what is a blunt and brute force attack that will always succeed?
Let’s step away from hash functions for a moment, and look at a seemingly-unrelated question. Let \(K\) be the number of people in this classroom.
Question. What is the probability that two people in the room have the same birthday?
(For the purpose of this question, suppose that everyone’s birthday is sampled uniformly at random from the set of 366 possible choices. This is not a valid assumption for many reasons, e.g., leap years, but let’s go with it for now.)
It might intuitively seem like the answer should be \(\frac{K}{366}\).
But that is incorrect! It would be the answer to a different problem:
What is the probability that at least one of \(K\) students in this classroom has the same birthday as Prof. Varia does?
The distinction between these two questions (and their corresponding answers) is called the birthday paradox.
The answer to the original question depends on the number of pairs of people in the classroom.
So the answer to the question is closer to \({K \choose 2} / 366 \approx K^2 / (2 \cdot 366)\) than it is to \(K / 366\).
But \({K \choose 2} / 366\) is not exactly right either. The reason is the principle of inclusion-exclusion: we would be overcounting in scenarios where many people have the same birthday.
Instead, it is easier to think about the converse problem. Suppose \(K = 23\), and let’s imagine that the \(K\) people walk into the classroom, one at a time.
What is the probability that all of them have different birthdays?
As a result, the overall probability that everybody’s birthday is distinct is small:
\[ 1 \times \frac{365}{366} \times \frac{364}{366} \times \cdots \times \frac{344}{366}\]
We can calculate this expression using Python.
import math
range(344,367)) / (366**23) math.prod(
0.49367698818054007
As a result, if there are 23 people in the classroom, then it is more likely than not that two of us have the same birthday.
Here’s a graph that shows the probability of two people having the same birthday in a room of \(K\) people.
Generalizing from the specific case of birthdays (where there are \(K = 366\) options): when drawing with replacement from a set of size \(K\), then:
\[ E[\text{\# of attempts until the first collision}] \approx 1.25 \sqrt{K}.\]
This approximation works well even for fairly small choices of \(K\), such as \(K = 366\). And it works really well for larger \(K\).
import math
1.25 * math.sqrt(366)
23.91390808713624
Moreover, the distribution is tightly correlated around this expected value. There is less than a 15% chance that:
Returning back to hash functions: the key insight from the birthday problem is that:
The best-possible security that we can hope for a hash function \(H: \{0, 1\}^* \to \{0, 1\}^n\) to achieve is related to the length of the output space \(n\).
Example. Suppose that we build a hash function whose output length is \(n = 3\) bits. Let’s try to break one-wayness and collision resistance using a simple guess-and-check approach.
To break one-wayness for (say) \(y = 000\), suppose we just select inputs \(x\) at random and compute the hash function until we find an input where \(H(x) = y\).
Now let’s try to break collision resistance. Once again, suppose that we aimlessly select inputs \(x\) at random and build a table of all \((x, H(x))\) pairs until we find two inputs with the same digest.
So in order to have a secure hash function, it is necessary (though not sufficient) to choose a large output length \(K\). A common choice in cryptography is to choose \(n = 256\) bits, so that \(K = 2^{256}\).
Extrapolating from the analysis above, our brute-force attack succeeds at breaking:
These are both big numbers. And they’re very different scales of big-ness. Remember that \(2^{256}\) is not two times as big as \(2^{128}\). It is instead \(2^{128}\) times as big!
In fact, both of them are so big as to be practically impossible for anyone to execute, with the computing power available today or in the near future.
To give a sense of how big the number \(2^{128}\) is:
As a useful heuristic: there are about \(2^{25}\) seconds in one year. So this calculation would take \(2^{58 - 25} = 2^{23} \approx 8\) million years
(Admittedly, this calculation discounts the effect of Moore’s law. But hopefully it gives you a sense of the vast scale of computing power involved here.)
The number \(2^{256}\) is much, much larger than this. As we discussed last time: even running a for
loop over \(2^{256}\) possible preimages would require (1) more time than the expected remaining length of the universe and (2) approximately as much energy as the sun.
So far, we have talked about the best-possible security that a hash function can hope to achieve. But is this hope actually achievable?
The good news: the worldwide crypto community believes the answer is yes!
There is some bad news, however.
So instead: we evaluate hash functions empirically, based on how well they hold up against the worldwide cryptanalysis community.
Informally, any possible instantiation of low-level crypto building blocks like hash functions must satisfy three properties:
Simple: It must be implementable in a small number of lines of code, and run quickly on modern computers (e.g., millions, billions, trillions, … of times per second).
Makes no sense: It survives the gauntlet of cryptanalysis attempts by hundreds of cryptanalysts over many years.
Simple to see why it makes no sense: There is some mathematical reason to believe that nobody will come up with a clever cryptanalysis algorithm tomorrow, in order to adhere to Schneier’s law.
In particular, we do not rely on security by obscurity – that is, simply making the code of a hash function so complicated that nobody can understand it.
Instead, we want the hash function to be hard to break even by people who understand exactly what it is doing.
In more detail, all cryptographic primitives must satisfy Kerckhoffs’s principle.
[Source: Auguste Kerckhoffs, La Cryptographie Militaire, 1883 (more details available on Wikipedia)]
The U.S. National Institute of Standards and Technology (NIST) develops standards for all cryptographic building blocks: hash functions, digital signatures, encryption schemes, and more.
To date, NIST has developed three standards for hash functions (and also subsequently revoked one of them). These algorithms are called the Secure Hash Algorithms, or SHA for short. There have been three iterations of hash functions: SHA-1, SHA-2, and SHA-3.
SHA-1 was initially developed in 1995 by the U.S. National Security Agency. It had an output length of \(n = 160\) bits, so by the birthday bound at best we can hope that it takes \(2^{80}\) time to find a collision.
But through a combination of incredibly clever math to design a more clever collision-finding algorithm and a lot of computing power to execute it, researchers have broken SHA-1 and you should never use it!
Example. Here are two PDF files (source: shattered.io).
good.pdf | bad.pdf |
---|---|
It turns out that they have the same hash digest using the SHA-1 hash function.
from hashlib import sha1
# read the contents of two files: good.pdf (a check mark) and bad.pdf (an X)
with open('images/good.pdf', 'rb') as file:
= file.read()
good
with open('images/bad.pdf', 'rb') as file:
= file.read()
bad
# verify that the strings are different
assert(good != bad)
# outputs are 40 hex characters, or 160 bits in length
= sha1(good).hexdigest()
hashGood = sha1(bad).hexdigest()
hashBad
# observe that the hash function outputs are the same
print(hashGood)
print(hashBad)
assert(hashGood == hashBad)
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a
So which hash functions should you use?
SHA-2 was initially developed in 2001, also by the U.S. National Security Agency. It is the most popular hash function algorithm today. It offers a choice of four possible output lengths: 224, 256, 384, or 512 bits.
SHA-3 was standardized in 2015 after a multi-year open competition. Its goal was to design a hash function in a fundamentally different way as SHA-2, in order to have a “failsafe” plan in case someone finds a collision in SHA-2.
To the best of our collective knowledge as a crypto community, for both of these functions are secure.
In particular, the best known algorithms to break collision resistance require about \(2^{n/2}\) work to break, when the hash function is set to have an output length of \(n\) bits. In other words: no known attack on collision resistance for SHA-2 or SHA-3 works much better than a brute-force attack via the birthday bound.
from hashlib import sha256
# input is a bytestring (notice the 'b' before the string) of arbitrary length
print(sha256(b'Hello world!').hexdigest())
# output is 64 hex characters, or 256 bits in length
c0535e4be2b79ffd93291305436bf889314e4a3faec05ecffcbb7df31ad9e51a
We will see many uses of hash functions throughout this course. For example, next time we will see how hash functions can be used within the construction of digital signatures.
For today, let’s go back to the digital fingerprinting application from the beginning of the lecture. Recall that Alice wants to store a large file \(f\) on a cloud service like Dropbox or Google Drive. The file is important to Alice, so when she retrieves it later, she wants to be able to check that the file hasn’t been altered or corrupted on the cloud.
How can Alice do this? Here’s one common approach:
This approach is both simple and secure.