Crypto for Data Science

Syllabus Overview

Instructors

Meetings

  • Lecture: Tuesday and Thursday, 11:00am-12:15pm in CAS 213
  • Discussion labs: Monday, 9:05-9:55am and 10:10-11:00am in IEC B10
  • Office hours are posted on the course website. Currently:
    • Thurs 3:30-5pm for Mayank
    • Wed 3:30-5pm + Fri 11:30am-1pm for Tejovan

Websites

Test dates (mark your calendars now!)

  • Tuesday, February 18 in lab (this day follows a Monday schedule)
  • Thursday, March 20 in lecture
  • Monday, April 7 in lab
  • Tuesday, April 29 in lecture

Final exam: Yes, at the date and time specified in the BU Final Exam Schedule. Tentatively, our class is slated to have a final exam on Tuesday, May 6 at 12-2pm.


Course description

CDS DS 653 investigates techniques for performing trustworthy data analyses without a trusted party, and for conducting data science without data sharing.

  • The first half of the course investigates cryptocurrencies, the blockchain technology underpinning them, and the incentives for each participant.
  • The second half of the course focuses on privacy and anonymity using advanced tools from cryptography.

For undergraduate DS majors: this course satisfies the DS methodology elective in the “scalable & trustworthy DS” category, but it does not satisfy any Hub units.

Learning objectives

There are five parts to this course, each with their own learning objectives.

  1. Secure ledgers (currencies without centralization): using hash functions and digital signatures to authenticate people and data over the Internet, and exploring the mechanics of cryptocurrencies, blockchains, and mining
  2. Secure consensus (agreement without trust): reaching agreement over the Internet using distributed algorithms
  3. Secure transport (messaging without meeting): seeing how encryption protects data on the internet, and how secure messaging applications provide end-to-end encryption and pre- and post-compromise security.
  4. Secure analysis (data science without data sharing): using crypto to work with data you cannot see, or to prove compliance with a stated rule, policy, or regulation
  5. Crypto and society: discussing (in)appropriate uses of the tech we learn in this course

The full course schedule is available on the Piazza Course Schedule page. The schedule contains the required textbook reading, links to the lecture notes and videos, and assignments due each week. The schedule is subject to small changes, so please review the schedule weekly.

Prerequisites

There are a few prereqs for this course.

  • Mathematical maturity with linear algebra and probability concepts at the level of DS 121 and 122, or equivalent
  • Familiarity with algorithms at the level of DS 320, CS 330, or equivalent

Note that no prior background in cryptography is necessary.

If you do not meet the prerequisites of this course, please talk with me after today’s lecture.


Course grading

This course uses a non-traditional grading system. The upshot for you is:

  • There are limited opportunities for partial credit in the grading rubrics, but…
  • There are opportunities to learn from mistakes, retry work, and demonstrate growth.
  • Read the syllabus to understand the policies and procedures used in this course!

A typical week of this course has a lab on Monday, lectures on Tuesday and Thursday, and homework due on Friday evening at 10pm. Your course grade is based on satisfactory completion of the following:

  • Homework: 10 reading- and programming-based assignments that are auto-graded. You can submit the homework to Gradescope as many times as you want before the due date of Friday at 10pm (or by the late deadline of Sunday 10pm, if you have remaining late hours).

  • Worksheets: 10 in-class assignments that you complete during Monday discussion labs for participation credit. If you miss a lab, you can turn in the worksheet later in the week, but then it is graded for correctness.

  • Tests: 4 tests in this course, each with 9 questions. Each question is graded on a scale of: Satisfactory, Revisable, Try Again. You can retry questions after the test for partial credit. (Collectively the homework, worksheets, and tests form your Base Grade in this course.)

  • Challenge problems: 6 open-ended problems that you must complete by yourself. These are optional, and can raise your course grade. You can submit each question twice: once on the original due date, and then a second time to address any remaining issues.

  • Final exam containing 17 questions. It can raise or lower your final course grade by one step (e.g., from a B+ to an A- or B).

Please read the syllabus carefully. You are responsible for adhering to course policies at all times, especially the academic code of conduct, plagiarism, AI, and collaboration policies. Also, your final grade is not the average of the grades on each assessment; the actual procedure is described in the syllabus.

The syllabus can be downloaded from a link at the top of the Piazza course information page. The summary provided in these lecture notes should not be considered as a substitute for the syllabus itself.

After you read the syllabus, you must complete the course survey by Friday, January 24.

Lecture 1: Course Overview

“Cryptography is how people get things done when they need one another, don’t fully trust one another, and have adversaries actively trying to screw things up.”

– Ben Adida

1.1 Introduction to crypto(graphy)

This is a course about crypto… which is admittedly a term with many different meanings to different people. It’s also an abbreviation with (at least) three different endings.

  • Cryptocurrencies
  • Cryptography
  • Cryptology

We will get to the topic of cryptocurrencies in a few minutes. But first, here is a question to start us off in this course: what is cryptology?

Cryptology comes from two Greek word parts:

  • kryptos, meaning secret or hidden
  • logy, meaning the study of

So the goal of cryptology is to study the art of keeping secrets.

In its modern form, crypto uses hard math problems as a way to encode data in order to restrict who can use it, how they can use it, and sometimes even when and where they can access it.

Cryptology can be divided into two sub-fields:

  • Cryptography: the art of making codes.
  • Cryptanalysis: the art of breaking codes.

Both cryptography and cryptanalysis are challenging, multi-disciplinary tasks. They also reinforce each other, due to Schneier’s law.

“Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can’t break.”

– Bruce Schneier

In this course we will almost exclusively focus on the defensive, cryptographic side. Even though I won’t say too much about cryptanalysis in this course, it is important to know that the cryptographic primitives that I will show you in this course have been vetted through decades of cryptanalysts by thousands of people, collectively spanning many centuries or millennia of people-years. No single one of us can compete with that.

“Don’t roll your own crypto.”

– (Every cryptographer)

Crypto is a scientific field at the intersection of many disciplines. To accomplish their goals, cryptographers use:

  • Algorithms techniques in their protocol designs.
  • Engineering techniques to produce vetted, secure software implementations of their protocols.
  • Complexity theory to demonstrate security reductions from new protocols to more well-studied ones.
  • Mathematics for cryptanalysis.

This class will primarily make use of the first two skills: algorithms and software engineering.

While we won’t delve into the specific type of math used in cryptanalysis, nevertheless we will often use concepts from algebra and probability.


Crypto and data science

This is also a course about how crypto can be used within data science.

It’s likely that you have already used crypto in all of your data science projects, perhaps without explicitly thinking about it.

Lock icon on kaggle.com

In this image, the lock icon in the web browser means that my computer has a protected connection to the kaggle.com servers.

At a high level, this means that nobody else on the internet can

  • see which dataset I am downloading from kaggle.com; in other words, the data remains confidential, and
  • tamper with the dataset in transit; in other words, data integrity is maintained.

(Note though that the kaggle.com servers can see what you are doing on their website. We will see later in the semester how even this can potentially be avoided.)

Evolution of the Internet

Using crypto is an essential part of providing security over the Internet, which has always been designed as an open network. There is no physical wire that directly connects your computer with the kaggle.com server. Instead, the data flows through several other computers along the way.

Here is a picture of all of the computer servers connected to the Internet in 1968.

The Internet in 1968 (source: pwnallthethings on Twitter)

Of course, the Internet is much larger now.

Growth of the Internet over time (source: Wikipedia)

But the design of the Internet has fundamentally remained the same.

“The Internet is just the world passing notes in a classroom.”

– Jon Stewart

Getting the crypto right is essential here.

  • If the system is vulnerable to decoding or tampering by unauthorized parties, then it can lead to universal, covert breaches of security. Moreover, everyone will have a false sense of security.

  • If the system is too onerous or slow to use, then people will ignore it and move to other, less secure options.


Goals of cryptography

We will see throughout this course that crypto can be used in many ways besides network transmissions on the Internet.

Still though, this example highlights a general theme of all of our applications:

Crypto is a social science that masquerades as a mathematical science.

Our goal is to use the tools from math, computer science, and data science in order to design systems that help people.

At a high level, cryptographic systems attempt to provide (some or all of) the following three objectives.

  • Confidentiality, meaning to keep data or people’s identities hidden from others.
  • Integrity, which means preventing raw data or the data science results from being altered, or at least allowing for detection of tampering.
  • Availability, or ensuring that data science remains accessible and censorship-resistant.

These three security objectives are sometimes referred to as the “CIA triad.”

Admittedly, these three objectives are both underspecified (lacking mathematical rigor) and overloaded (the terms have many different sub-meanings).

There are many different flavors of confidentiality, integrity, and availability that cryptography can provide… in fact, far more than we have time to cover in this course.

Types of security guarantees (source: Handbook of Applied Cryptography, page 4)

(Note: BU courses CS 357 and CS 538 also cover different aspects of cryptography and its application in modern digital systems. This course has a partial overlap with them, but we will cover many topics that they don’t and vice-versa. You are welcome to register for those courses as well, but they are not required knowledge.)


Crypto and policy

Because crypto is a scientific field that has social impacts, it is probably inevitable that crypto has clashed with law and policy over the past five decades.

  • On the one hand, crypto offers ways to upend existing norms, laws, and rules… in both good and bad ways.

  • On the other hand, the use of crypto is influenced and regulated by the law and politicians.

“Cryptography rearranges power: it configures who can do what, from what. This makes cryptography an inherently political tool, and it confers on the field an intrinsically moral dimension.”

– Phillip Rogaway

We will see aspects of the two-way connection between crypto and policy throughout the semester, and we will spend the last few weeks focusing exclusively on it.


1.2 Introduction to crypto(currencies)

[Note: This section of the lecture is based on Prof. Abishek Jain’s slides for JHU CS 601.641.]

Pausing the cryptography discussion for a moment, let’s provide an overview of cryptocurrencies like Bitcoin and Ethereum. We will spend the next few weeks delving into their design.

From the start, it is worthwhile to state what this course is NOT about.

This is not a finance course on cryptocurrencies. You should not expect to be taught how to invest in cryptocurrencies or how to become a billionaire overnight (or ever, for that matter).

In fact, the cryptocurrency market is still in its infancy and is incredibly volatile. Investing in cryptocurrencies is a good way to lose money. So this course doesn’t offer any investment advice, except in this sentence: I advise you not to invest in cryptocurrencies.

Still though, cryptocurrencies are a worthy topic of study in order to understand their capabilities, potential, and limitations.

In the first unit of this course, we will explore the following topics about cryptocurrencies.

  • Understanding the mechanics of blockchains
  • Understanding why current implementations work
  • Understanding the necessary cryptographic background
  • Exploring applications of blockchains to cryptocurrencies and beyond
  • Understanding limitations of current blockchains

Let’s start at the top of the list. What is a blockchain? It is:

  • A distributed ledger or database.
  • Used for building decentralized cryptocurrencies such as Bitcoin.
  • Capable of improving many other applications such as distributed Domain Name system (DNS), Public-Key Infrastructure (PKI), stock trade databases, etc.
  • The basis of lots of exciting academic research and industry startups.

Transferring digital money

Blockchains are a data structure to keep track of transactions. Here is the motivating scenario for this unit:

  • There are two people, who (following the typical convention in crypto) we will call Alice and Bob.
  • Alice would like to transfer $1 to Bob.

Let’s think about

  • The functionality of how this could work, i.e., how it can work if everyone is honest.
  • The security concerns, i.e., ways this could go wrong if someone is malicious.

Let’s start by exploring how money transfers work in the world of physical banking. Suppose Alice writes a check to Bob. What happens?

Alice writes a physical check to Bob

If everyone is honest, the money is transferred because a bank performs the bookkeeping task of recording every account and its balance.

19th century ledger (source: Wikipedia)

What could go wrong?

Now let’s think about what happens if each party is malicious.

We will do this often throughout the course. Remember that crypto is about getting things done in the presence of an adversary. So, we will develop the skill of `thinking like an adversary,’ in order to design secure cryptosystems.

For this single transaction, Alice doesn’t really care whether Bob is malicious or not. That’s because she isn’t expecting to receive anything of value.

In the other direction: even if Bob trusts the bank (for now), he may be suspicious that Alice is trying to deceive him. He needs to be able to verify the following three properties.

  1. Alice properly signed the check: To accomplish this goal, Bob can verify the signature himself, say by comparing the check against Alice’s identification card.

  2. Alice possesses $1 in her bank account: Bob can ask the bank to verify that Alice’s account balance is sufficient for the check to clear. Note that Bob doesn’t even need to learn Alice’s balance, just whether it is greater than $1 or not.

  3. Alice does not double spend the money by writing a check to someone else at the same time: This is the toughest property to achieve, because neither Bob nor the bank know yet whether Alice has overdrafted her account. They will only discover this later; if so, Alice and Bob can use the bank and the legal system to settle their dispute.

By contrast, the digital world is diverse and international. There is no single entity that everyone in the world trusts, nor is there a single legal system to identify and prosecute thieves and money launderers.

So, if we want a digital system of money, then we need a different approach altogether.

It will take us a few lectures to build up the tools for a digital system of transferring money. First, we need to understand two essential crypto tools: digital signatures and hash functions.


1.3 Introduction to Digital Signatures

Today, we only focus on property #1: building a digital version of signing a check.

  • The digital equivalent to “signing a check” is called, appropriately enough, a digital signature scheme.
  • The most common standard is called the digital signature algorithm, or DSA.

A digital signature allows Alice, when given any message, to append a sequence of bits called a signature that satisfies two properties (which we will state informally for now).

  1. Correctness: Everyone in the world can verify whether Alice created the signature.
  2. Unforgeability: Only Alice can produce the signature, and it is bound to the message \(M\).

Also, there are (at least) two properties that digital signatures do not achieve on their own.

  • Receiver authentication: Because a signature can be verified by anyone, on its own it will not tell Bob that he was the intended target of the message. If Alice wants to convey that Bob is the intended target, she should do so in the message contents \(M\).

  • Thwarting replay attacks: If Bob has a digital signature, he can show it to others (e.g., a bank) as many times as he wants. To solve this issue, we will have to be clever in thinking about the format of the message \(M\) – for instance, perhaps it should contain a unique serial number, so the bank can detect if the same check is being cashed twice.


Cryptographic keys

To satisfy the unforgeability goal, Alice must have something that the rest of the world does not. We call this special something a “secret key.”

Informally, a digital signature allows Alice to “lock” her message using her secret key.

To satisfy the correctness goal, there is a corresponding public key that anyone can use to verify whether a message was previously “locked” by Alice.

Informal description of encryption and digital signatures (source: idea-instructions.com)

In more detail, a secret key is a long, random sequence of bytes. One common choice is to pick a key that is \(32\) bytes (or \(32 \times 8 = 256\) bits) in length.

For example, here is a secret key (also known as a private key):

# first install PyCryptoDome with, e.g., "pip install pycryptodome"
from Crypto.PublicKey import ECC
secretKey = ECC.generate(curve='P-256')

print(secretKey.export_key(format='PEM'))
-----BEGIN PRIVATE KEY-----
MIGHAgEAMBMGByqGSM49AgEGCCqGSM49AwEHBG0wawIBAQQgOoFxs2aWP2uQXOrT
pnRDYLyhSRjlB1ptYlkXJUuRCBahRANCAAQIlzVRFeiRx8qRiKxjBzOBuWAbqeQQ
Vh5vkA7vbSrmimx3kfT0Epf70DNUZIV1AwjTbZe+1NMqbXDqPA3EUzEw
-----END PRIVATE KEY-----

This code example is just for illustrative purposes. Do not export a secret key that you are using for any legitimate purpose! It must remain secret, after all. And definitely do not use this specific key that I just posted on the Internet.

As the name suggests, only Alice knows the secret key; she never shares it with anyone. (Hence my warning above.) Additionally, the key is long enough that it is incredibly unlikely that anyone else in the world will be able to guess it by chance.

As a result, knowing the key identifies Alice, in the sense that if you meet a stranger on the Internet who can has the ability to sign a new message that can be verified with Alice’s public key, then this Internet stranger must be Alice! (…Or someone who has compromised Alice’s computer and stolen her keys.)

To give you a sense of the scale here, let’s imagine that I tried to forge Alice’s key in a brute-force manner, simply by guessing every single 32-byte long string and trying to use it to “lock” the message just as Alice did.

Concretely, suppose I create a for loop that iterates from:

from binascii import hexlify
print(hexlify(32 * b'\x00')) # 32 copies of the byte value 0 (printed in hex)
b'0000000000000000000000000000000000000000000000000000000000000000'

to this:

print(hexlify(32 * b'\xff')) # 32 copies of the byte value 255 (printed in hex)
b'ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff'

The set of possible secret keys is finite – concretely, there are \(2^{256}\) possible options. So, it is theoretically possible to enumerate over all of them.

But it isn’t practically feasible to do so. - Running this for loop on my laptop would require more time than the expected remaining length of the universe. - And simply flipping all of these bits back and forth in RAM would consume approximately the energy of the sun.

Granted, surely there might be more effective ways for an attacker to try to break our digital signature scheme (i.e., to forge a signature) that are smarter than this guess-and-check method. But our hope is that any method of forging signatures is infeasible in practice.

We will discuss next week how to formalize this guarantee, and then discuss a few ways of achieving it.


1.4 Representations of data

For the rest of today’s lecture, let’s explore how computers — and specifically, Python programs — represent data. Throughout this course, we will typically represent data as abstractly as possible: as a sequence of bits of a particular length.

  • A bit is a single 0 or 1 value. That is, a bit is an element of the set \(\{0, 1\}\).
  • A byte is a sequence of eight bits. In other words, a byte is an element of the set \(\{0, 1\}^8\). (Note that there is a one-to-one mapping between bytes and integers in the range 0-255.)
# Table of all 256 values of a byte, in decimal and binary formats

print("byte  bits")
for x in range(9): print(format(x, '#4d'), "", format(x, '#010b'))
print(" ...")
for x in range(248,256): print(format(x, '#4d'), "", format(x, '#010b'))
byte  bits
   0  0b00000000
   1  0b00000001
   2  0b00000010
   3  0b00000011
   4  0b00000100
   5  0b00000101
   6  0b00000110
   7  0b00000111
   8  0b00001000
 ...
 248  0b11111000
 249  0b11111001
 250  0b11111010
 251  0b11111011
 252  0b11111100
 253  0b11111101
 254  0b11111110
 255  0b11111111
  • A bitstring is an arbitrary-length sequence of bits. The set of all bitstrings of length \(n\) is denoted as \(\{0, 1\}^n\). The set of bitstrings of any possible length is denoted as \(\{0, 1\}^*\).

    (We will typically, though not always, consider bitstrings that are a multiple of 8 bits; i.e., that can be decomposed into bytes with no “leftover” bits.)

Within Python, you can represent a value of type byte using a string with the letter b in front of it. For example, here is a byte that holds the ASCII value of the character A.

s = b'A'
print(s)
type(s)
b'A'
bytes

Some (but not all!) bytes correspond to printable characters in the American character encoding standard, or ASCII.

print("byte  ASCII character")
for x in range(4): print(format(x, '#4d'), chr(x))
print(" ...")
for x in range(65,70): print(format(x, '#4d'), chr(x))
print(" ...")
for x in range(251,256): print(format(x, '#4d'), chr(x))
byte  ASCII character
   0 
   1 
   2 
   3 
 ...
  65 A
  66 B
  67 C
  68 D
  69 E
 ...
 251 û
 252 ü
 253 ý
 254 þ
 255 ÿ

Representations of numbers

Python natively represents numbers in the following formats:

Data type Base # bits encoded in each symbol
bin 2 1
int 10 about 3.32
hex 16 4

The hexadecimal, or base 16, representation of a number uses the symbols 0-9 and a-f to denote the values 0 through 15.

By convention, Python places a 0x prefix in front of a number to remind you that it is written in hex format.

print("int   hex")
for x in range(16):
    print(format(x, '#3d'), " ", format(x, '#04x'))
print("...")
for x in range(240,256):
    print(format(x, '#3d'), " ", format(x, '#04x'))
int   hex
  0   0x00
  1   0x01
  2   0x02
  3   0x03
  4   0x04
  5   0x05
  6   0x06
  7   0x07
  8   0x08
  9   0x09
 10   0x0a
 11   0x0b
 12   0x0c
 13   0x0d
 14   0x0e
 15   0x0f
...
240   0xf0
241   0xf1
242   0xf2
243   0xf3
244   0xf4
245   0xf5
246   0xf6
247   0xf7
248   0xf8
249   0xf9
250   0xfa
251   0xfb
252   0xfc
253   0xfd
254   0xfe
255   0xff

Converting between bytestrings and numbers

The Python libraries binascii and PyCrypto provide commands to convert between raw bytes and numbers in bin, int, or hex formats. ( You can install PyCrypto on your computer using the command “pip install pycryptodome.”)

from Crypto.Util.number import bytes_to_long, long_to_bytes # requires PyCrypto
from binascii import hexlify, unhexlify

t = bytes_to_long(s)
print(t)
print(type(t))
65
<class 'int'>
u = hexlify(s)
print(u)
print(type(u))
b'41'
<class 'bytes'>
v = bin(t)
print(v)
print(type(v))
0b1000001
<class 'str'>

Note that Python uses the prefix 0b to denote a number in binary format.

The base64 library and base58 library offer additional options.

Base # bits encoded in each symbol
32 5
58 about 5.86
64 6

(There are other valid bases in which to represent numbers, although these are all the representations we will see in this course)