Lecture 2: Cryptographic hash functions

Announcements

Reading: David Wong’s Hash functions and security
Survey: complete on Gradescope by Friday at 10pm
Homework 1: will be released this week and due next Friday 1/31 at 10pm
Office hours: no office hours today, will start OH tomorrow at 1:15-2:25pm in the CDS 13th floor conference room (1324)

Learning outcomes

By the end of today’s lecture, you will learn:

The security guarantees of hash functions: one-wayness, collision resistance, and acting as a random oracle.
How the one-wayness and collision resistance of a hash function depends on its output length, and why cryptographers are satisfied with output lengths like $n = 256$ bits.
How hash functions can be used to authenticate (sets of) files.
The goals and security guarantees of message commitments.

Review from last lecture

“Cryptography is how people get things done when they need one another, don’t fully trust one another, and have adversaries actively trying to screw things up.”

– Ben Adida

Cryptography uses math, computer science, and data science in order to design systems that help people to get things done safely in the digital world. In every topic we explore in this course, we will provide (some or all of) the following three objectives.

Confidentiality, meaning to keep data or people’s identities hidden from others.
Integrity, which means preventing raw data or the data science results from being altered, or at least allowing for detection of tampering.
Availability, or ensuring that data science remains accessible and censorship-resistant.

The first part of this course is about secure ledgers, or currencies without centralization. Our starting point into cryptocurrencies is to build a digital version of a banking system, where people can transfer money to each other electronically over the Internet without the need for a single trusted banking authority. We consider the following problem:

There are two people, who (following the typical convention in crypto) we will call Alice and Bob.
Alice would like to transfer $1 to Bob.

In the physical world, Alice could write a check to Bob. Last time, we started to discuss a cryptographic primitive that provides a digital equivalent to the act of “signing a check.” This primitive is (appropriately enough) called a digital signature.

Today, we will study a special function that is needed to build a digital signature and other crypto tools. This special function is called a cryptographic hash function, and we will have many uses for it throughout the semester.

2.1 Digital fingerprinting

To motivate today’s discussion, let’s consider the following problem:

Alice wants to store a large file $x$ on a cloud service like Dropbox or Google Drive.
The file is very important to Alice. So when she retrieves it later, she wants to be able to verify that the file hasn’t been altered or corrupted on the cloud.

Question. Can we design the cloud storage service in a way that provides an integrity guarantee to Alice? Specifically, she wants to be convinced that the cloud service provider cannot cheat later when revealing $x$.

What would be great is if there was a function $H$ that we could apply to the file $x$ that:

Can act as a “fingerprint,” in the sense that it uniquely specifies the file.
Produces a small output, so that it’s easy for Alice to save $H(x)$ on her computer, even if the entire file $x$ is too large to save.

Producing a small output of size $n$ is easy to do. We could:

Truncate the file and keep just the first $n$ bits,
Add together all length-$n$ chunks of the file,
Apply a linear transformation to the file
…and much more.

However, all of these options have an obvious drawback: if Alice only saves the function output $H(x)$, then the cloud provider can cheat later. Specifically, it can reveal a different file with the same $n$-bit prefix, or sum, or image after the linear transformation.

Depending on the file type and format, it might be the case that we get lucky. For instance, perhaps $x$ is the only file in the world that maps to the fingerprint $H(x)$.

But in cryptography, we are not in the business of relying on luck. Instead, let’s make our own luck.

Collision resistance

Recall that our goal is to “compress” a large image $x$ into a small (say, constant-sized) output $y = H(x)$ while making sure that Alice doesn’t change the image afterward.

Because $H$ has infinitely many inputs and only finitely many outputs, it must be the case that there are collisions – that is, two inputs that map to the same output string.

But: we are going to insist that collisions are computationally infeasible to find. This guarantee should hold even if we give the code of $H$ to our adversary. This property is called collision resistance.

This guarantee should hold even if we give the code of $H$ to our adversary.

Meet Mallory

Here is our adversary in this course. Her name is Mallory, because she is malicious and doesn’t follow the rules of any crypto protocol.

Mallory is intended to be a stand-in for any kind of real-world adversary who you might be concerned about: your internet service provider, cloud provider, government, or anyone else who might want to ruin your day.

Mallory has a substantial (though not infinite) amount of computing power at her disposal.

She is also very smart. Specifically, we must assume that she is a better cryptanalyst than we are. This is to account for Schneier’s law.

“Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can’t break.”

– Bruce Schneier

Even though Mallory is mathematically savvy and computationally powerful, we insist that she must still not be able to “break” our function $H$.

2.2 Cryptographic hash functions

Often called a cryptographer’s Swiss Army knife, a hash function can underlie many different cryptographic schemes: aside from producing a document’s digest to be digitally signed—one of the most common applications.

Let us give example applications: code-signing systems, computer forensics, [and protecting] passwords.

– Aumasson, Meier, Phan, and Henzen, The Hash Function BLAKE.

At a high level, a cryptographic hash function is a mathematical function, which we typically denote by $H$, that has two seemingly-contradictory requirements.

It should be public and deterministic, so everyone can compute it quickly, and yet
It should be unpredictable and random-looking, so that nobody can understand how its outputs are connected to its inputs.

You can think of a hash function as a gigantic function that translates between two languages:

The inputs that have some mathematical or English-language structure that makes sense to us.
The outputs are a new kind of “foreign language” that is incomprehensible to all of us on Earth.

Moreover, the outputs have a fixed, known length, no matter how short or long the input messages are. As a toy example: the follow truth table was produced by mapping the top 3000 English words to random 4-letter strings using random.org.

Input	Output
abandon	`nieh`
ability	`btol`
able	`ykwf`
about	`fkay`
$\vdots$	$\vdots$
zipper	`jgxm`
zone	`pnjo`
zoo	`ansm`

While this toy example used English-language words as the possible inputs, in reality we want a hash function to accept any data type. To be as generic as possible, we will consider the inputs and outputs to be bitstrings. Then, we can apply $H$ to Alice’s file, in any data format.

Defining hash functions

Using the language of bitstrings, we can now define a hash $H$ as we stated previously: it’s a single mathematical function that can accept inputs of any size and that always returns an output of a fixed length.

Definition. A hash function $H: \{0, 1\}^* \to \{0, 1\}^n$ is an efficiently-computable function that

Receives an input bitstring of arbitrary length, and
Provides an output bitstring that contains a fixed number of bits $n$.

Note that there is nothing hidden or secret here: the code of $H$ is public knowledge and anyone can compute it.

The output $H(x)$ is often called the hash digest corresponding to the input $x$.

Next, we need to specify precisely the problem that Mallory cannot break. In fact we will define multiple such problems. As we go down the list:

The problems become easier and easier to break, so$\ldots$
If Mallory still cannot break them, then the hash function $H$ must be really strong.

Definition. A cryptographic hash function $H$ satisfies the following properties.

One-wayness, aka preimage resistance: if we sample $y \gets \{0, 1\}^n$ uniformly at random from the set of all bitstrings of length $n$, then it is practically infeasible for Mallory to find any preimage $x$ such that $H(x) = y$.

(Note: remember that there are infinitely many such preimages, yet Mallory cannot find even a single one!)

Collision resistance: it is practically infeasible for Mallory to find two different inputs $x_1$ and $x_2$ such that $H(x) = H(x')$.

(Note: if a function is collision resistant then it must be one-way, but the converse is not necessarily true. Do you see why that is?)

Puzzle friendliness: (this property is complicated to describe; we will come back to it later)

Random oracle: $H$ acts like a randomly-generated truth table would.

If an adversary [Mallory] has not explicitly queried the oracle on some point $x$, then the value of $H(x)$ is completely random… at least as far as [Mallory] is concerned.

– Jon Katz and Yehuda Lindell, Introduction to Modern Cryptography

Note that there is a limit to how infeasible it is to find a collision. After all, we already said that collisions must exist. And the output space is finite, so if we have enough time and computing power, it must be possible to find a collision.

Exactly how difficult can we hope that the problem might be? That is, even with the best possible hash function, what is a blunt and brute force attack that will always succeed?

2.3 The birthday problem

Let’s step away from hash functions for a moment, and look at a seemingly-unrelated question. Let $K$ be the number of people in this classroom.

Question. What is the probability that two people in the room have the same birthday?

(For the purpose of this question, suppose that everyone’s birthday is sampled uniformly at random from the set of 366 possible choices. This is not a valid assumption for many reasons, e.g., leap years, but let’s go with it for now.)

It might intuitively seem like the answer should be $\frac{K}{366}$.

But that is incorrect! It would be the answer to a different problem:

What is the probability that at least one of $K$ students in this classroom has the same birthday as Prof. Varia does?

The distinction between these two questions (and their corresponding answers) is called the birthday paradox.

The answer to the original question depends on the number of pairs of people in the classroom.

If birthdays are sampled uniformly at random, each pair has a $\frac{1}{366}$ chance of having the same birthday.
There are $K \choose 2$ pairs of people in the room.

So the answer to the question is closer to ${K \choose 2} / 366 \approx K^2 / (2 \cdot 366)$ than it is to $K / 366$.

But ${K \choose 2} / 366$ is not exactly right either. The reason is the principle of inclusion-exclusion: we would be overcounting in scenarios where many people have the same birthday.

Instead, it is easier to think about the converse problem. Suppose $K = 23$, and let’s imagine that the $K$ people walk into the classroom, one at a time.

What is the probability that all of them have different birthdays?

The first person certainly has a different birthday than everyone else in the room (since there is nobody else in the room yet).
The second person has probability 365/366 of having a different birthday than the first person.
Conditioned on the first two people having distinct birthdays, the third person has probability 364/366 of having a different birthday than both of them.
…and so on, until the 23th person to enter the room has a 344/366 probability of having a birthday that is different than the other 22 people already in the room.

As a result, the overall probability that everybody’s birthday is distinct is small:

\[ 1 \times \frac{365}{366} \times \frac{364}{366} \times \cdots \times \frac{344}{366}\]

We can calculate this expression using Python.

import math
math.prod(range(344,367)) / (366**23)

0.49367698818054007

As a result, if there are 23 people in the classroom, then it is more likely than not that two of us have the same birthday.

Here’s a graph that shows the probability of two people having the same birthday in a room of $K$ people.

Generalizing from the specific case of birthdays (where there are $K = 366$ options): when drawing with replacement from a set of size $K$, then:

\[ E[\text{\# of attempts until the first collision}] \approx 1.25 \sqrt{K}.\]

This approximation works well even for fairly small choices of $K$, such as $K = 366$. And it works really well for larger $K$.

import math
1.25 * math.sqrt(366)

23.91390808713624

Moreover, the distribution is tightly correlated around this expected value. There is less than a 15% chance that:

The first collision has already occurred when drawing $\frac{1}{2} \cdot 1.25 \sqrt{K}$ samples.
The first collision has not yet occurred when drawing $2 \cdot 1.25 \sqrt{K}$ samples.

Comparing preimage and collision resistance

Returning back to hash functions: the key insight from the birthday problem is that:

The best-possible security that we can hope for a hash function $H: \{0, 1\}^* \to \{0, 1\}^n$ to achieve is related to the length of the output space $n$.

Example. Suppose that we build a hash function whose output length is $n = 3$ bits. Let’s try to break one-wayness and collision resistance using a simple guess-and-check approach.

To break one-wayness for (say) $y = 000$, suppose we just select inputs $x$ at random and compute the hash function until we find an input where $H(x) = y$.

Each sample of $x$ will satisfy the equation with probability 1/8.
So in expectation, we will find an input $x$ such that $H(x) = y$ after 8 attempts.
There is a chance that we get unlucky and it could take much longer. Probabilistically: the number of attempts until the first success follows the distribution of a geometric random variable.

Now let’s try to break collision resistance. Once again, suppose that we aimlessly select inputs $x$ at random and build a table of all $(x, H(x))$ pairs until we find two inputs with the same digest.

In the worst case, we can select 9 inputs, and by the pigeonhole principle two of them must map to the same output.
In expectation, the birthday bound says that the expected number of trials is around $1.25 \cdot \sqrt{2^n} \approx 4$.

Choosing larger $n$

So in order to have a secure hash function, it is necessary (though not sufficient) to choose a large output length $K$. A common choice in cryptography is to choose $n = 256$ bits, so that $K = 2^{256}$.

Extrapolating from the analysis above, our brute-force attack succeeds at breaking:

One-wayness with good probability by an attacker who tries around $2^{256}$ inputs.
Collision resistance with good probability by an attacker who tries around $\sqrt{2^{256}} = 2^{128}$ inputs.

These are both big numbers. And they’re very different scales of big-ness. Remember that $2^{256}$ is not two times as big as $2^{128}$. It is instead $2^{128}$ times as big!

In fact, both of them are so big as to be practically impossible for anyone to execute, with the computing power available today or in the near future.

To give a sense of how big the number $2^{128}$ is:

The cryptocurrency Bitcoin gives people money to run a hash function called SHA-256 over and over and over again.
The world has collectively devoted enough computing power to this task to run more than 1 billion trillion SHA-256 hashes per second (source: blockchain.info). So that’s about $10^{21} \approx 2^{70}$ computations per second.
To compute the hash function $2^{128}$ times, it would take the entire world (with our current computing power) about $2^{128} / 2^{70} = 2^{58}$ seconds.

As a useful heuristic: there are about $2^{25}$ seconds in one year. So this calculation would take $2^{58 - 25} = 2^{23} \approx 8$ million years

(Admittedly, this calculation discounts the effect of Moore’s law. But hopefully it gives you a sense of the vast scale of computing power involved here.)

The number $2^{256}$ is much, much larger than this. As we discussed last time: even running a for loop over $2^{256}$ possible preimages would require (1) more time than the expected remaining length of the universe and (2) approximately as much energy as the sun.

2.4 Hash function constructions

So far, we have talked about the best-possible security that a hash function can hope to achieve. But is this hope actually achievable?

The good news: the worldwide crypto community believes the answer is yes!

There is some bad news, however.

I cannot mathematically and analytically prove to you that any cryptographic hash function actually meets the definition of one-wayness or collision resistance.
That would require fundamental breakthroughs in computer science like proving $\textit{P} \neq \textit{NP}$.

So instead: we evaluate hash functions empirically, based on how well they hold up against the worldwide cryptanalysis community.

Design goals

Informally, any possible instantiation of low-level crypto building blocks like hash functions must satisfy three properties:

Simple: It must be implementable in a small number of lines of code, and run quickly on modern computers (e.g., millions, billions, trillions, … of times per second).
Makes no sense: It survives the gauntlet of cryptanalysis attempts by hundreds of cryptanalysts over many years.
Simple to see why it makes no sense: There is some mathematical reason to believe that nobody will come up with a clever cryptanalysis algorithm tomorrow, in order to adhere to Schneier’s law.

In particular, we do not rely on security by obscurity – that is, simply making the code of a hash function so complicated that nobody can understand it.

Instead, we want the hash function to be hard to break even by people who understand exactly what it is doing.

In more detail, all cryptographic primitives must satisfy Kerckhoffs’s principle.

The system must be practically, if not mathematically, indecipherable;
It should not require secrecy, and it should not be a problem if it falls into enemy hands;
It must be possible to communicate and remember the key without using written notes, and correspondents must be able to change or modify it at will;
It must be applicable to telegraph communications;
It must be portable, and should not require several persons to handle or operate;
Lastly, given the circumstances in which it is to be used, the system must be easy to use and should not be stressful to use or require its users to know and comply with a long list of rules.

[Source: Auguste Kerckhoffs, La Cryptographie Militaire, 1883 (more details available on Wikipedia)]

Hash function standards

The U.S. National Institute of Standards and Technology (NIST) develops standards for all cryptographic building blocks: hash functions, digital signatures, encryption schemes, and more.

Officially, these standards are only required for use within the U.S. federal government.
Unofficially, their standards are often used in commercial applications around the world.

To date, NIST has developed three standards for hash functions (and also subsequently revoked one of them). These algorithms are called the Secure Hash Algorithms, or SHA for short. There have been three iterations of hash functions: SHA-1, SHA-2, and SHA-3.

SHA-1

SHA-1 was initially developed in 1995 by the U.S. National Security Agency. It had an output length of $n = 160$ bits, so by the birthday bound at best we can hope that it takes $2^{80}$ time to find a collision.

But through a combination of incredibly clever math to design a more clever collision-finding algorithm and a lot of computing power to execute it, researchers have broken SHA-1 and you should never use it!

Example. Here are two PDF files (source: shattered.io).

good.pdf	bad.pdf

It turns out that they have the same hash digest using the SHA-1 hash function.

from hashlib import sha1

# read the contents of two files: good.pdf (a check mark) and bad.pdf (an X)
with open('images/good.pdf', 'rb') as file:
    good = file.read()

with open('images/bad.pdf', 'rb') as file:
    bad = file.read()

# verify that the strings are different
assert(good != bad)

# outputs are 40 hex characters, or 160 bits in length
hashGood = sha1(good).hexdigest()
hashBad  = sha1(bad).hexdigest()

# observe that the hash function outputs are the same
print(hashGood)
print(hashBad)
assert(hashGood == hashBad)

d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a

SHA-2 and SHA-3

So which hash functions should you use?

SHA-2 was initially developed in 2001, also by the U.S. National Security Agency. It is the most popular hash function algorithm today. It offers a choice of four possible output lengths: 224, 256, 384, or 512 bits.
SHA-3 was standardized in 2015 after a multi-year open competition. Its goal was to design a hash function in a fundamentally different way as SHA-2, in order to have a “failsafe” plan in case someone finds a collision in SHA-2.

To the best of our collective knowledge as a crypto community, for both of these functions are secure.

In particular, the best known algorithms to break collision resistance require about $2^{n/2}$ work to break, when the hash function is set to have an output length of $n$ bits. In other words: no known attack on collision resistance for SHA-2 or SHA-3 works much better than a brute-force attack via the birthday bound.

from hashlib import sha256

# input is a bytestring (notice the 'b' before the string) of arbitrary length
print(sha256(b'Hello world!').hexdigest())
# output is 64 hex characters, or 256 bits in length

c0535e4be2b79ffd93291305436bf889314e4a3faec05ecffcbb7df31ad9e51a

2.5 Back to digital fingerprinting

We will see many uses of hash functions throughout this course. For example, next time we will see how hash functions can be used within the construction of digital signatures.

For today, let’s go back to the digital fingerprinting application from the beginning of the lecture. Recall that Alice wants to store a large file $f$ on a cloud service like Dropbox or Google Drive. The file is important to Alice, so when she retrieves it later, she wants to be able to check that the file hasn’t been altered or corrupted on the cloud.

How can Alice do this? Here’s one common approach:

Compute the hash of the file $h = \texttt{SHA256}(f)$ before you upload the file $f$ to the cloud. Store $h$ on your own computer. (Remember that $h$ is small, only 32 bytes in size.)
When you later download a file $f'$, check that $\texttt{SHA256}(f') \overset{?}{=} h$ in order to verify that $f'$ is indeed the same as the file $f$ that you previously uploaded.

This approach is both simple and secure.

Modern hash functions are very quick to compute, even on large files that are gigabytes in size.
Due to collision resistance, even if Mallory is running the cloud provider, she cannot find a new file $f'$ that is different from Alice’s original file $f$ and yet still hashes to the same string $h$.