Lecture 14: Data Encryption

Announcements

Test 2 is this Thursday during lecture
This week’s reading: Boneh-Shoup, pages 4-18
Challenge problems 3-4 have been released, and are due on Sunday 3/30

Learning outcomes

By the end of today’s lecture, you will learn:

The power of encryption to protect the confidentiality of data at rest and in transit.
The security objectives of symmetric key encryption.
How to construct a one-time pad, and why it is rarely used in practice today.

Review

Cryptology comes from two Greek word parts:

kryptos, meaning secret or hidden
logy, meaning the study of

So the goal of cryptology is to study the art of keeping secrets.

More generally, cryptographic systems provide some or all of the objectives in the CIA triad:

Confidentiality, meaning to keep data or people’s identities hidden from others.
Integrity, which means preventing raw data or the data science results from being altered, or at least allowing for detection of tampering.
Availability, or ensuring that data science remains accessible and censorship-resistant.

So far, we have completed two units of this course:

Crypto foundations and their use in the design of currencies without centralization.
Algorithmic foundations of agreement without trust.

Collectively, we have designed cryptographic tools that protect data integrity and availability.

14.1 Encryption

Starting today, we will study encryption and its role in protecting data confidentiality.

“Cryptography is about communication in the presence of an adversary.”

– Prof. Ron Rivest

Remember that a digital signature does not provide data confidentiality. Instead, its goal is unforgeability.

In this example, the 64-byte signature is added to the message, but the original message contents remain in the clear.

# using libsodium
from nacl.signing import SigningKey, VerifyKey

# Bob's key pair
secret_key = SigningKey.generate()
public_key = secret_key.verify_key

# Bob's message and its signature
message = b"Hello, Alice!"
signed_message = secret_key.sign(message)
assert public_key.verify(signed_message), "invalid signature"

print("Length of original message:", len(message), "bytes")
print("Length of signed message:  ", len(signed_message), "bytes\n")
print("Signed message:", signed_message[:5] + b"..." + signed_message[-15:])

Length of original message: 13 bytes
Length of signed message:   77 bytes

Signed message: b'\x8ai4\xf8\xec...,\x07Hello, Alice!'

Digital signatures allow:

Bob to “lock” his message using his secret key.
Anyone to “unlock” the message using the corresponding public key (to verify whether the message was previously locked by Bob).

Informal description of encryption and digital signatures (source: idea-instructions.com)

But what if we turn this picture around? What if:

Anyone can lock a message using the public key
Only Bob can unlock the message with his secret key

as shown in the left half of the figure?

The result is a public key encryption scheme! This new primitive does protect data confidentiality.

from nacl.public import PrivateKey, PublicKey, SealedBox

# Bob's key pair
secret_key = PrivateKey.generate()
public_key = secret_key.public_key
sealed_box = SealedBox(public_key)

# Alice's message and its encryption
message = b"Hello, Bob!"
encrypted_message = sealed_box.encrypt(message)

print(encrypted_message)

b"\xb0X\xd8\xe8\xa3\x01[od\xf5\xb5\x90\xec\xcfX!3D\x1c\x8b+\x02$\x97)'\t\xa3\x8aa\xa70p\x06\xbf\xa8\xc7\xcd\\\x93\xb9\x05\x8f9\xbc\xcd\x83#6O\x05\xf1n\x1c\x959'5\xa1"

Also as we will see, encryption schemes are randomized: if you encrypt the same message twice, you will get different results. Hence, encryption requires good sources of randomness.

encrypted_message = sealed_box.encrypt(message)
print(encrypted_message)

b'\xd2=\xb3\x03C\xa7\xf9\xfa9\xce\xa08\xfa\xa1\x16\xf2\xdaPQ\xbd\x18\xf0\xe1Ib(\xd3#\x91)\x8e\x06\xb7L\xee\x04\x1e\xb6\xcfZE\xd1.2N\x99p\x18\xb0\x164C\xa9\x1f\x03\xda9V~'

The goal of Part 3 of this course is to study encryption and its role in:

Protecting data at rest on your laptop or phone
Protecting data in transit over the Internet or messaging app

The power of encryption

Historically, encryption schemes have been designed and used by governments. It played a significant role in both of the world wars in the 20th century.

World War I: Zimmerman telegram (source: BBC)

World War II: Enigma machine (source: Wikipedia)

Today, encryption allows for confidential data exchange over the Internet. Alice can send a chat message, email, file, or database across the internet so that only Bob can read it.

Alice		Mallory		Bob
	\(\longrightarrow\)		\(\longrightarrow\)

Remember that “Mallory” is just a stand-in for whoever Alice and Bob might consider to be their adversary.

Question. In the real world, who might be trying to snoop in on their conversation?

Since the Internet is just “the world passing notes in a classroom,” this might be a more realistic picture of the Internet.

Even though the entities in the middle of the picture have significantly more computing power, encryption allows Alice and Bob to communicate securely because they have one thing that the others do not: a cryptographic key.

As a result, encryption is a powerful tool with immense social consequences.

“Cryptography rearranges power: it configures who can do what, from what.”

– Prof. Phillip Rogaway (UC Davis)

In particular, having the secret key gives someone the power to decrypt and read a message.

Encryption on the Internet is often performed in one of two ways.

Client-server encryption	End-to-end encryption
Alice encrypts her message with the tech company’s key (and hence, they can read it). Separately, the tech company encrypts a message to Bob.	Alice encrypts her message directly to Bob. While the message passes through the server, the tech company cannot read it.

14.2 Symmetric key cryptography

It is possible to construct public key encryption based on hard math problems like factoring and discrete logarithms.

But we will not study public key encryption in this course. In fact, I’m going to make a bold claim: even though it exists, in nearly all data science applications you should not use public key encryption!

I say this for two reasons:

Public key encryption allows anyone on the Internet to send you messages. In many applications though, you know exactly who is going to send you a message, so we can build cryptosystems that leverage this fact.
Losing control of the secret key for an encryption scheme has immediate, irrecoverable consequences.

Security over time

Let’s think about what happens in practice if Alice wants to connect to a website like kaggle.com.

Alice		kaggle.com
	\(\quad \Longleftrightarrow \quad\)

Question. What happens if the website loses control of its secret key?

With a digital signature scheme, Mallory can forge messages as the website in the future. In response, the website can revoke its key – that is, tell the world not to trust it anymore.
With a public key encryption scheme, Mallory can read messages sent to the website in the past. There is no way to recover from this data breach.

One key, rather than two

Fortunately, we do not need something as sophisticated as public key encryption, which allows anyone to send a message to the recipient.

In this case:

Alice and the website know that they are talking to each other!
When they first establish a connection, they can use Diffie-Hellman key exchange to agree on a secret that both of them know.

So we can use a simpler primitive: a symmetric key cryptosystem, which has only one key that can turn the lock to the left and to the right.

A message authentication code allows two parties to send messages to each other with integrity. It is the symmetric-key version of a signature scheme, where each party can both sign and verify.
A symmetric key encryption allows two parties to send messages to each other with confidentiality. Each party can both encrypt and decrypt.

Looking ahead: these primitives are nearly always used together. So for convenience, we will construct a combined primitive called authenticated encryption that performs both of these tasks at once.

“Confidentiality xor authenticity is not possible. If you don’t have both, often you don’t have either.”

– Prof. Matthew Green (Johns Hopkins)

For today though, let’s focus on encryption by itself.

Goals and non-goals

Let’s discuss the objectives of symmetric key encryption.

Symmetric key encryption must:

Hide the contents of Alice and Bob’s messages
Even if the adversary Mallory has full control over the network and can add, drop, alter, or re-order packets
Optionally: even if Mallory takes over Alice or Bob’s computer for some time, we might still want to provide confidentiality at other times

However, encryption on its own is not designed to hide “metadata.”

Encryption doesn’t provide anonymity: Mallory knows that Alice and Bob are communicating
Encryption doesn’t hide message length: Mallory sees how much data is flowing across the wire

Companies that use encryption sometimes forget about this fact!

Example 1. Here is a real interaction between Sen. Ron Wyden and Match, the parent company of Tinder.

Letter from Sen. Ron Wyden (link to source)

Response from Match Group, Inc. (link to source)

Example 2. A company called Voatz tried to build an internet-based voting system for political elections… which is actually a terrible idea for many reasons that we will discuss toward the end of the semester. But they failed at the very first step:

“The length of the encrypted packet clearly leaks which candidate was selected.”

– Michael A. Specter, James Koppel, and Daniel Weitzner (link to source)

14.3 Defining symmetric encryption

Like signatures, an encryption scheme has three methods: KeyGen, Enc, and Dec.

KeyGen: Alice and Bob jointly sample a symmetric key \(k\) from a distribution \(K\) (say, a uniformly random bytestring of fixed length)
Enc\((k, P; r) \to C\): Alice transforms plaintext message \(P\) into a ciphertext \(C\), with one-time randomness \(R\)
Dec\((k, C) \to P\): Bob transforms the ciphertext back into the original plaintext message \(P\)

Encryption satisfies two properties.

Correctness: Dec undoes the action of Enc. That is, for any key \(k\), message \(P\), and randomness \(R\) it holds that \[\text{Dec}(k, \text{Enc}(k, P; R)) = P.\]
Secrecy (informally): If an attacker Mallory sees the ciphertext \(C\) but not the symmetric key \(k\), then she has no chance of learning any part of the plaintext message \(P\).

Question. How can we make the security requirement precise? In particular:

What does it mean to learn “part” of the message \(P\)?
How can we take advantage of the randomness \(R\) when defining and constructing an encryption scheme?

Perfect secrecy

Today, we will study one way to formalize the secrecy guarantee.

Definition. An encryption scheme satisfies perfect secrecy if for any two messages \(P_0\) and \(P_1\), and for any ciphertext \(C\),

\[ \Pr[\text{Enc}(k, P_0) = C] = \Pr[\text{Enc}(k, P_1) = C],\]

where this probability is taken over the random choice of the key \(k \gets K\).

Here is another way to think about perfect secrecy.

If KeyGen samples keys uniformly at random, then perfect secrecy says that for any two messages \(P_0\) and \(P_1\), there are the same number of keys that will map \(P_0\) to \(C\) and that will map \(P_1\) to \(C\).
As a consequence, no amount of computing power is ever enough to recover the real message! Even if Mallory had the energy of the sun, it would not help.
Even a brute force attack will fail! Or more precisely, it willl find the correct message but also all possible incorrect messages, and will have no way to detect the difference.

14.4 One-time pads

Let’s build an encryption scheme that satisfies perfect secrecy.

Recall that Mallory can add, drop, alter, or re-order Alice’s messages. This is admittedly a powerful adversary to thwart.
So for today, we will consider a simpler adversary that only acts as a passive eavesdropper but does not change the contents of Alice’s messages. We call this adversary Eve.

Question. How can Alice encode messages so Mallory cannot read them?

Option 1: substitute each character (source: Wikipedia)

Option 2: substitute each word or phrase

Caesar cipher

Let’s try to build a cipher that substitutes one character of the message at a time.

Our first attempt will be an idea that has been used for thousands of years; in particular, Julius Caesar used it in his private correspondences.
It will not be a secure encryption scheme, but it will provide insights on how to build one.

The idea is to choose a key between 0 and 25, and rotate all characters by this amount. This figure shows the cipher with symmetric key \(k = 3\).

one \(\mapsto\) lkb
two \(\mapsto\) qtl

Question. Put yourself in the role of Eve. Can you break this encoding?

If Eve observes the ciphertext \(C\) for a known plaintext message \(P\), then she learns \(k\)
Even if Eve knows a single character of \(P\) and \(C\), she can encipher a new message: three \(\mapsto\) qeobb

The takeaway message is: reuse of the key leads to a “frequency attack.”

Cryptogram puzzle (image source: cryptograms.org)

Fixing the cipher

Let’s improve the Caesar cipher in two ways:

To improve security, give each character its own key
To make the cipher computer-friendly, let’s think about bits instead of characters.

The xor logical operation measures whether 2 inputs are identical. Given two bits \(a\) and \(b\), we can calculate their logical xor \(a \oplus b\) with this truth table.

\(\bigoplus\)	0	1
0	0	1
1	1	0

The one-time pad cipher has a simple construction: the ciphertext is the XOR of the key and the message.

OTP “masks” private message by applying Caesar cipher independently to each bit
Since XOR is a lossless function, this encryption scheme is invertible – in fact, XOR is its own inverse.

Here is an example:

      message  0110 0110 1001
  XOR key      0011 1100 1110
      -----------------------
      cipher   0101 1010 0111

Theorem. The one-time pad satisfies perfect secrecy!

There is a proof of this theorem in Section 2.1 of Boneh-Shoup, but I encourage you to try to prove this claim on your own before looking there. Remember that perfect secrecy says that for any two plaintext messages \(P_0\) and \(P_1\) and any ciphertext \(C\),

\[ \Pr_{k \gets K}[\text{Enc}(k, P_0) = C] = \Pr_{k \gets K}[\text{Enc}(k, P_1) = C].\]

Downsides of one-time pad

Even though it satisfies perfect secrecy, the one-time pad is rarely used on its own in practice today for a few reasons:

The symmetric key \(k\) must be as long as the plaintext message \(P\).
Reusing the same key for two or more messages breaks perfect security, and brings back the cryptograms issue.
The cipher is malleable! While OTP is safe against a passive eavesdropper Eve, an active attacker Mallory can manipulate the ciphertext in transit and cause the same change in the decrypted plaintext.

      message  0110 0110 1001               cipher   0101 1010 1000
  XOR key      0011 1100 1110           XOR key      0011 1100 1110
      -----------------------               -----------------------
      cipher   0101 1010 0111               message  0110 0110 0110

Next week, we will design pseudorandom functions and use them to build ciphers that:

Can encrypt many long messages from a single, short key
Combine encryption and message authentication into a single primitive, so as to protect our data even against an active attacker Mallory.