What brings you to Androma?

This course unifies three fundamental problems in information science through Shannon's elegant theoretical framework: how to encode data efficiently, how to protect it against noise, and how to keep it secret. Coding theory and cryptography might seem like disparate topics, but they share a common mathematical foundation rooted in entropy and information. The first half of the course establishes this foundation, beginning with the quantitative measure of information itself—entropy—and showing how it governs the limits of lossless compression. From there, we move to the practical challenge of transmitting information over noisy channels, where Shannon's theorems reveal the fundamental trade-off between data rate and error correction, and introduce the notion of channel capacity as the ultimate speed limit for reliable communication. The second half translates these principles into concrete constructions and new domains. Chapters 5 and 6 move from Shannon's existence proofs to explicit algebraic codes—linear codes, Hamming codes, cyclic codes, and BCH codes—that achieve near-optimal performance and admit efficient decoding. These chapters show how abstract algebra provides the machinery to build codes that actually work in practice. The final chapters pivot toward cryptography, where the framework shifts subtly: instead of protecting against random noise, we protect against a sophisticated adversary. Classical and modern cryptography are presented as natural extensions of information theory, beginning with the information-theoretic security of the one-time pad and progressing to stream ciphers, public-key systems, and protocols where computational hardness, rather than information-theoretic impossibility, provides the security guarantee. Throughout, the course emphasises the deep connections between these three threads. Entropy measures redundancy in source coding and randomness in cryptographic keys. Channel coding and cryptographic protocols both concern reliable communication in adversarial or noisy settings. The algebraic structures — linear codes, finite fields, number theory — appear repeatedly, each time in service of a different goal: correcting errors, achieving computational security, or enabling efficient protocol design. # 1. Noiseless Coding and Entropy This chapter opens the course by asking a deceptively simple question: given a source producing symbols with known probabilities, what is the most efficient way to encode those symbols as strings over a coding alphabet? Efficiency means minimising the expected length of encoded messages. The answer, due to Shannon, turns out to depend on a single number — the *entropy* of the source — which simultaneously serves as a measure of the source's randomness and as a fundamental lower bound on how concisely any decoder can operate. We develop the theory from first principles: prefix-free codes and Kraft's inequality, Shannon's noiseless coding theorem, Huffman's algorithm for achieving optimality, and the extension to joint and conditional entropy. ## Codes and Decipherability Suppose you receive the binary string $0001$ and you know it was produced by a code with $c(1) = 0$, $c(2) = 1$, $c(3) = 00$, $c(4) = 01$. How do you decode it? Is it $c(3)c(2)$ (since $00 \cdot 1 = 0001$), or $c(1)c(1)c(4)$ (since $0 \cdot 0 \cdot 01 = 0001$), or something else? There is no way to tell — the encoding is ambiguous. Any useful code must guarantee that every encoded string has a *unique* decoding. The question of when this is possible, and at what cost, is what this section makes precise. The basic setup is two finite alphabets: a *source alphabet* $A$ and a *coding alphabet* $B$. We write $A^* = \bigcup_{n \geq 0} A^n$ for the set of all finite strings over $A$, including the empty string. The *concatenation* of strings $x = x_1 \cdots x_r$ and $y = y_1 \cdots y_s$ is $xy = x_1 \cdots x_r y_1 \cdots y_s$. [definition: Code] Let $A$ and $B$ be finite alphabets. A **code** is a function $c : A \to B^*$. The strings $c(a)$ for $a \in A$ are called **codewords**. A code with $|A| = m$ and $|B| = a$ is called an **$a$-ary code of size $m$**. [/definition] Any code $c : A \to B^*$ extends naturally to a map $c^* : A^* \to B^*$ by concatenation: $c^*(x_1 \cdots x_n) = c(x_1) \cdots c(x_n)$. [example: Greek Fire Code] Take $A = \{\alpha, \beta, \ldots, \omega\}$ (the Greek alphabet, 24 letters) and $B = \{1, 2, 3, 4, 5\}$. The code maps $\alpha \mapsto 11$, $\beta \mapsto 12$, and so on up to $\omega \mapsto 54$. A codeword $xy$ is transmitted by holding up $x$ torches and then $y$ torches nearby. This is a 5-ary code of size 24. [/example] The Greek fire code is a *block code*: every codeword has the same length, so the decoder simply reads off fixed-width chunks. A different — and in many ways more natural — way to build an unambiguous code is to use variable-length codewords separated by a dedicated marker, as the next example shows. [example: Dictionary Code] Take $A$ = the words of a dictionary and $B = \{A, B, \ldots, Z, \omega\}$ where $\omega$ is a special space character. The code $c$ sends each word to its spelling followed by a space. A message $x_1 \cdots x_n \in A^*$ is then encoded as $c(x_1) \cdots c(x_n) \in B^*$. [/example] For a code to be useful in practice, it must be possible to uniquely recover the original message from its encoding. [definition: Decipherable Code] A code $c : A \to B^*$ is **decipherable** if the induced map $c^* : A^* \to B^*$ is injective — that is, each string in $B^*$ corresponds to at most one message in $A^*$. [/definition] Injectivity of $c$ itself is necessary but not sufficient for decipherability. The following example illustrates the gap. [example: Injective but Not Decipherable] Let $A = \{1, 2, 3, 4\}$, $B = \{0, 1\}$, and define $c(1) = 0$, $c(2) = 1$, $c(3) = 00$, $c(4) = 01$. Then $c$ is injective. However, $c^*(114) = 0001 = c^*(312) = c^*(144)$, so the string $0001$ cannot be uniquely decoded. The code is not decipherable. [/example] Three natural classes of codes are always decipherable (given that $c$ is injective): - A **block code** assigns every symbol in $A$ a codeword of the same fixed length. The Greek fire code is an example. - A **comma code** reserves one letter of $B$ to mark the end of each codeword, as the dictionary code does with the space character. [definition: Prefix-Free Code] A code $c : A \to B^*$ is **prefix-free** if no codeword is a prefix of any other: whenever $c(a) = c(b) \cdot z$ for some $z \in B^*$, it follows that $z$ is empty (equivalently, $a = b$). Block codes and comma codes are both special cases of prefix-free codes. [/definition] Prefix-free codes are sometimes called **instantaneous** because a received string can be decoded symbol-by-symbol as it arrives, without waiting for future characters to determine where one codeword ends and the next begins. The moment you see a codeword completed, you can output the corresponding symbol — no lookahead required. This on-the-fly decodability is the reason prefix-free codes are the natural object to study. ## Prefix-Free Codes and Kraft's Inequality The fundamental question about prefix-free codes is: for which sequences of word lengths $(l_1, \ldots, l_m)$ does a prefix-free code exist? Not every length sequence is achievable. For instance, with a binary coding alphabet ($a = 2$) and four codewords of lengths $(1, 1, 2)$, any attempt to build a prefix-free code fails: assigning the two length-1 codewords exhausts all of $\{0, 1\}$, so there is nothing available for a length-2 codeword that avoids being an extension of an existing codeword. This failure has a clean arithmetic explanation: $2^{-1} + 2^{-1} + 2^{-2} = 1.25 > 1$. The precise characterisation is Kraft's inequality. [definition: Kraft's Inequality] Let $|A| = m$, $|B| = a$. A code $c : A \to B^*$ with word lengths $l_1, \ldots, l_m$ satisfies **Kraft's inequality** if \begin{align*} \sum_{i=1}^{m} a^{-l_i} \leq 1. \end{align*} [/definition] [quotetheorem:1626] [citeproof:1626] [remark: The Construction is Explicit] The inductive proof is constructive: list all strings of $B^*$ in lexicographic order by length, and greedily pick the next available string of the required length as the next codeword, skipping any that have an already-assigned codeword as a prefix. [/remark] It is worth pausing to appreciate what the hypothesis of prefix-freeness buys. The theorem is an *exact* characterisation: every length sequence satisfying Kraft's inequality is realised by some prefix-free code, and every prefix-free code's length sequence satisfies Kraft's inequality. In particular, Kraft's inequality being *tight* (sum equal to 1) corresponds to a code that perfectly tiles the infinite binary tree — no length capacity is wasted. Relaxing prefix-freeness to mere injectivity does not help: an injective code need not satisfy Kraft's inequality (it could assign two length-1 codewords to a binary alphabet, which blocks all extensions), but injective codes cannot encode arbitrary sequences unambiguously. This motivates asking whether decipherability — a weaker condition than prefix-freeness — also forces Kraft's inequality. Kraft's inequality is also necessary for *all* decipherable codes, not just prefix-free ones. This is McMillan's theorem. [quotetheorem:1627] [citeproof:1627] The key hypothesis in McMillan's theorem is decipherability, not prefix-freeness. Why is this the right condition? A code might be prefix-free but have poor expected length; a code might be neither prefix-free nor injective but accidentally satisfy Kraft's inequality for incidental reasons. Decipherability is exactly the operational condition we care about: the ability to uniquely decode any encoded message. The theorem says that *any* operationally useful code — no matter how cleverly designed — must satisfy the same arithmetic constraint that prefix-free codes satisfy. The $R$-th power trick is what drives the proof: it amplifies the Kraft sum to a regime where the counting argument (at most $a^l$ strings of length $l$) becomes decisive. Without decipherability, $b_l$ could exceed $a^l$ — the count of sequences of $R$ codewords summing to length $l$ could outstrip the available strings — and the argument collapses. Combining these two theorems yields a clean corollary that justifies restricting attention to prefix-free codes from now on. [quotetheorem:1628] [citeproof:1628] The upshot is fundamental: when designing optimal codes, we lose nothing by insisting on the prefix-free property. ## Entropy and Shannon's Noiseless Coding Theorem We now attach probabilities to the source. Suppose $X$ is a random variable taking values $\mu_1, \ldots, \mu_m$ (the elements of $A$) with probabilities $p_1, \ldots, p_m$. The goal is to find a prefix-free code $c : A \to B^*$ minimising the expected codeword length $\mathbb{E}[S] = \sum_{i=1}^{m} p_i l_i$. Before defining entropy formally, here is the intuition. Entropy should measure the "randomness" of $X$, or equivalently, how much information we gain on average when we observe $X$. If $X$ is concentrated on a single value, there is no surprise — entropy should be zero. If $X$ is uniform over many values, each outcome is genuinely surprising — entropy should be large. [definition: Entropy] Let $X$ be a random variable taking values $x_1, \ldots, x_n$ with probabilities $p_1, \ldots, p_n$. The **entropy** of $X$ is \begin{align*} H(X) = -\sum_{i=1}^{n} p_i \log_2 p_i, \end{align*} measured in **bits**. By convention, $0 \log_2 0 = 0$ (consistent with the limit $p \log p \to 0$ as $p \to 0^+$). [/definition] Before working out entropy for specific distributions, it is worth developing some intuition: entropy is not the average surprise but a particular weighted average of log-surprises, and the weights are the probabilities themselves. The biased-coin calculation below makes this concrete and ties the abstract definition to the expected number of fair flips needed to simulate the source. [example: Biased Distribution] Suppose $(p_1, p_2, p_3, p_4) = (1/2, 1/4, 1/8, 1/8)$. Think of this as a biased coin game: flip once; if heads ($p = 1/2$) output $x_1$ and stop; otherwise flip again; and so on. Then \begin{align*} H(X) = \frac{1}{2} \cdot 1 + \frac{1}{4} \cdot 2 + \frac{1}{8} \cdot 3 + \frac{1}{8} \cdot 3 = \frac{7}{4} \text{ bits}. \end{align*} The expected number of fair coin flips to simulate one draw of $X$ is exactly $7/4$. This distribution is less random than the uniform one over four symbols (which has entropy $2$ bits), and the lower entropy reflects that: $x_1$ arriving half the time creates a predictable pattern that compresses well. [/example] The biased-coin case above generalises to a function of a single parameter — the probability of heads — that appears so often it deserves its own name and graph. Tracing the entropy as this parameter varies from $0$ to $1$ reveals the shape that underlies most of the qualitative intuition about entropy. [example: Binary Entropy Function] For a biased coin with $\mathbb{P}(H) = p$ and $\mathbb{P}(T) = 1-p$, the entropy is the **binary entropy function** \begin{align*} h(p) = -p \log_2 p - (1-p) \log_2(1-p). \end{align*} At the extremes $h(0) = h(1) = 0$: a coin that always lands heads (or always tails) is perfectly predictable — observing it gives you no information. This is not surprising; it would be alarming if a deterministic coin had positive entropy. At $p = 1/2$, $h(1/2) = 1$ bit: one fair flip carries exactly one bit of information. Computing $h'(p) = \log_2 \frac{1-p}{p}$, the function is strictly increasing on $(0,1/2)$ and strictly decreasing on $(1/2,1)$, making $p = 1/2$ the unique maximum. The function is also strictly concave on $(0,1)$, which is a special case of the concavity of entropy as a function of the distribution — a fact that will reappear when we prove that conditioning reduces entropy. [/example] The key inequality relating entropy to other distributions is Gibbs' inequality, sometimes called the log-sum inequality. [quotetheorem:1629] [citeproof:1629] An immediate consequence is that entropy is maximised by the uniform distribution. [quotetheorem:1630] [citeproof:1630] The proof is a one-line application of Gibbs, but the statement deserves a moment of commentary: it tells us exactly which distribution is "most random" and by how much entropy exceeds this maximum for any other distribution. [remark: Why Uniform Maximises Entropy] This theorem has a clean intuitive reading: entropy measures how uncertain we are about the outcome of $X$, and we are most uncertain when all outcomes are equally likely. A distribution concentrated on a few values gives away partial information ("it's probably $x_1$"), which reduces entropy. The uniform distribution withholds all such structure — every outcome is as likely as every other — and so it carries the maximum possible uncertainty, $\log_2 n$ bits. Equivalently, the uniform distribution is the *hardest* source to compress: there is no pattern to exploit. [/remark] We are now ready for the central theorem of noiseless coding. For a source $X$ with values in an alphabet $A$ of size $m$ and a coding alphabet $B$ of size $a$, call a code **optimal** if it achieves the minimum expected word length $\mathbb{E}[S] = \sum_{i=1}^{m} p_i l_i$ over all decipherable codes. [quotetheorem:1631] [citeproof:1631] Shannon's noiseless coding theorem is, when you first encounter it, genuinely surprising. The entropy $H(X)$ was defined as an abstract measure of randomness. There is no obvious reason why the minimum achievable expected codeword length — a purely operational quantity, the best any algorithm can do — should be controlled by this particular formula. Yet Shannon showed they are the same thing, up to a 1-bit additive term that cannot be removed in general. This is not a computational statement about specific algorithms: it bounds every decipherable code simultaneously. It is equally important to note what the theorem does *not* say. It does not construct an optimal code — the ceiling construction achieves the upper bound but typically leaves a gap up to 1 bit. It does not give a finite-length rate statement: the guarantee is per-symbol for a source emitting single draws; for blocks of $n$ symbols the gap shrinks to $1/n$, but that is a separate argument. And it does not tell you what happens with a noisy channel — that requires the more sophisticated Shannon channel coding theorem, which is the subject of later chapters. [remark: When Equality Holds] The lower bound is achieved ($\mathbb{E}[S] = H(X)/\log a$) if and only if $p_i = a^{-l_i}$ for each $i$, i.e. each probability is a negative integer power of $a$. Such "dyadic" distributions are rare in practice, but the gap is at most 1 regardless. [/remark] ### Shannon–Fano Coding The upper bound in Shannon's theorem is constructive: given $p_1, \ldots, p_m$, set $l_i = \lceil -\log_a p_i \rceil$ and build a prefix-free code by greedily assigning codewords in order of increasing length (Kraft's inequality guarantees this is always possible). [example: Shannon-Fano Encoding] Take $a = 2$ and $(p_1, \ldots, p_5) = (0.4, 0.2, 0.2, 0.1, 0.1)$. The ceiling word lengths and an assignment are: | $i$ | $p_i$ | $l_i = \lceil -\log_2 p_i \rceil$ | codeword | |---|---|---|---| | 1 | 0.4 | 2 | 00 | | 2 | 0.2 | 3 | 010 | | 3 | 0.2 | 3 | 011 | | 4 | 0.1 | 4 | 1000 | | 5 | 0.1 | 4 | 1001 | The expected length is $\mathbb{E}[S] = 2(0.4) + 3(0.2) + 3(0.2) + 4(0.1) + 4(0.1) = 2.8$ bits. The entropy in base 2 is $H(X) \approx 2.12$ bits, consistent with $H(X) \leq \mathbb{E}[S] < H(X) + 1$. [/example] ## Huffman Coding Shannon's theorem guarantees that optimal codes exist, but it does not tell us how to build them. The Shannon–Fano construction from the previous section is a natural attempt: assign codeword length $\lceil -\log_a p_i \rceil$ to symbol $i$ and build a prefix-free code. For the distribution $(0.4, 0.2, 0.2, 0.1, 0.1)$, this gives expected length $2.8$ bits. But consider the same distribution through a different lens: the symbol with probability $0.4$ receives a length-2 codeword, even though it appears in $40\%$ of messages. You might suspect it deserves a length-1 codeword — and you would be right. Huffman coding does exactly this and achieves expected length $2.2$ bits, beating Shannon–Fano by $0.6$ bits. This is not a numerical quirk: Shannon–Fano is provably suboptimal whenever the probabilities are not negative integer powers of $a$. The mistake Shannon–Fano makes is rounding each length independently; Huffman's insight is that lengths must be chosen *jointly* to satisfy Kraft's inequality as a system. Huffman gave a recursive algorithm that always produces an optimal prefix-free code. The idea is a greedy merging strategy. At each step, merge the two least probable symbols into a single "meta-symbol," build a code for the reduced alphabet, then split the meta-symbol back by appending 0 and 1 to its codeword. ### The Huffman Algorithm Fix $|B| = 2$ (the binary case). Given source probabilities $p_1 \geq p_2 \geq \cdots \geq p_m$: - **Base case** ($m = 2$): assign codewords $0$ and $1$. - **Inductive step** ($m > 2$): form a new source with symbols $\mu_1, \ldots, \mu_{m-2}, \nu$ and probabilities $p_1, \ldots, p_{m-2}, p_{m-1} + p_m$. Build a Huffman code $c_{m-1}$ for this reduced source. Then set $c_m(\mu_i) = c_{m-1}(\mu_i)$ for $i \leq m-2$, and $c_m(\mu_{m-1}) = c_{m-1}(\nu) \cdot 0$, $c_m(\mu_m) = c_{m-1}(\nu) \cdot 1$. The resulting code is always prefix-free. [example: Huffman Coding] Revisiting the source with $(p_1, \ldots, p_5) = (0.4, 0.2, 0.2, 0.1, 0.1)$. We merge the two smallest probabilities at each stage: **Step 1:** Merge $p_4 = 0.1$ and $p_5 = 0.1$ to get probability $0.2$. Sorted: $\{0.4, 0.2, 0.2, 0.2\}$. **Step 2:** Merge two of the $0.2$'s to get $0.4$. Sorted: $\{0.4, 0.4, 0.2\}$. **Step 3:** Merge $0.4$ and $0.2$ to get $0.6$. Sorted: $\{0.6, 0.4\}$. **Step 4:** Merge to get $1.0$; assign codewords $0$ and $1$. Reading back: the code for symbol 1 (probability 0.4) is $1$, giving expected length $\mathbb{E}[S] = 0.4 \cdot 1 + 0.2 \cdot 2 + 0.2 \cdot 3 + 0.1 \cdot 4 + 0.1 \cdot 4 = 2.2$ bits. This is shorter than the Shannon–Fano code's $2.8$ bits. [/example] [quotetheorem:1632] [citeproof:1632] The proof above relied on a structural fact about optimal codes that we now state and prove properly. The fact that any optimal code can be put into a canonical form compatible with the Huffman structure is not obvious: different optimal codes may assign codewords in different ways, and we need to know that at least one of them has the paired-sibling structure the induction requires. [quotetheorem:1633] [citeproof:1633] With the structure theorem in hand, a natural follow-up question is whether Huffman's algorithm produces *the* optimal code or merely *an* optimal code. The two possibilities differ because different tie-breaking choices during the merging steps can yield genuinely different trees. [remark: Non-Uniqueness of Huffman Codes] Huffman codes are not unique: different tie-breaking choices (when merging symbols of equal probability) yield different codes. However, all Huffman codes for a given source have the same expected word length. Moreover, not every optimal code is a Huffman code — but every optimal code has the same word length distribution as some Huffman code. [/remark] Why is merging the *two least probable* symbols the right move? Intuitively, the two rarest symbols will receive the longest codewords in any optimal code (by the Structure theorem: more probable symbols get shorter codewords). Since they will be siblings at the bottom of the code tree anyway, we may as well decide their shared parent first and recurse on the reduced problem. Any other merging strategy would incorrectly assign a long codeword to a more probable symbol, increasing expected length. It is also worth asking what Huffman coding does *not* give. First, for non-binary alphabets ($|B| > 2$), the algorithm generalises but requires a careful initialisation to ensure the final step merges exactly $a$ symbols — sometimes dummy symbols of probability 0 must be added. Second, in the worst case the gap between $\mathbb{E}[S]$ and $H(X)/\log_2 a$ can approach 1 bit per symbol; for sources with many symbols of similar probability, this 1-bit overhead matters in practice. Third, Huffman codes encode each source symbol independently: they cannot exploit statistical dependencies between consecutive symbols. *Arithmetic coding* overcomes this limitation by encoding entire messages as single intervals, approaching entropy arbitrarily closely even without dyadic probabilities — but that is a topic for a more advanced treatment. ## Joint and Conditional Entropy The entropy $H(X)$ captures the uncertainty in a single random variable, but it can mislead when two variables are observed together. To see why, consider the pair $(X, Y)$ distributed uniformly over $\{(0,0), (1,1)\}$. Each marginal is a fair coin flip, so $H(X) = H(Y) = 1$ bit. Naively adding these gives $2$ bits as the cost of encoding the pair — but the pair $(X,Y)$ takes only two values, so $H(X,Y) = 1$ bit. The single-variable entropies overestimate by exactly $1$ bit, because knowing $X$ tells you $Y$ for free: the two variables are perfectly correlated. The lesson is that $H(X) + H(Y)$ is the correct encoding cost only when $X$ and $Y$ are independent. In general, joint entropy accounts for dependencies between variables and can be strictly less. [definition: Joint Entropy] The **joint entropy** of random variables $X$ and $Y$ is \begin{align*} H(X, Y) = -\sum_{x \in A} \sum_{y \in B} \mathbb{P}(X = x, Y = y) \log_2 \mathbb{P}(X = x, Y = y). \end{align*} This extends to any finite collection of random variables in the natural way. [/definition] The definition treats $(X,Y)$ as a single random variable taking values in $A \times B$, and applies the entropy formula to its joint distribution. The question is how $H(X,Y)$ relates to $H(X)$ and $H(Y)$ separately. The answer is given by subadditivity. [quotetheorem:1634] [citeproof:1634] The proof is short, but the inequality deserves unpacking because the *gap* between the two sides is itself a quantity of central importance in information theory. It measures precisely how much $X$ and $Y$ "share," and it has its own name. [remark: Interpretation of Subadditivity] The inequality $H(X,Y) \leq H(X) + H(Y)$ says that two correlated random variables together carry less uncertainty than they would if they were independent. The "missing" entropy $H(X) + H(Y) - H(X, Y) \geq 0$ is called the **mutual information** $I(X;Y)$, which measures how much knowing one variable reduces uncertainty about the other. In the extreme example above — $(X,Y)$ uniform on $\{(0,0),(1,1)\}$ — the mutual information is $1$ bit, the full entropy of either marginal. Mutual information will reappear as a central quantity when we study channel capacity. [/remark] To go further, we need to formalise what it means for uncertainty to decrease when we observe one of the variables. This is the role of conditional entropy. [definition: Conditional Entropy] The **conditional entropy** of $Y$ given $X$ is \begin{align*} H(Y \mid X) = -\sum_{x \in A} \sum_{y \in B} \mathbb{P}(X = x, Y = y) \log_2 \mathbb{P}(Y = y \mid X = x). \end{align*} Equivalently, $H(Y \mid X) = \sum_{x} \mathbb{P}(X = x) \cdot H(Y \mid X = x)$, the expectation over $x$ of the entropy of the conditional distribution of $Y$. [/definition] The second form of the definition makes the operational meaning transparent: $H(Y \mid X)$ is the average remaining uncertainty about $Y$ after $X$ has been observed. For each realisation $X = x$, we compute the entropy of the conditional distribution $\mathbb{P}(Y = \cdot \mid X = x)$, then average over $x$. If $X$ and $Y$ are independent, observing $X$ gives no information about $Y$, and $H(Y \mid X) = H(Y)$. If $Y$ is a deterministic function of $X$, then $H(Y \mid X) = 0$: once $X$ is known, $Y$ is determined. The chain rule ties these pieces together. For any jointly distributed $X$ and $Y$, we can decompose the joint uncertainty into two sequential steps: first incur the cost $H(X)$ of learning $X$, then incur the residual cost $H(Y \mid X)$ of learning $Y$ given $X$. [quotetheorem:1635] [citeproof:1635] The chain rule is a decomposition, not an inequality. It says that the joint uncertainty is accounted for exactly by the cost of observing $X$ first, then the residual cost of observing $Y$. Crucially, the decomposition is asymmetric: $H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)$, so both orderings give the same total, but the intermediate quantities differ unless $X$ and $Y$ are independent. Subadditivity — $H(X,Y) \leq H(X) + H(Y)$ — is an immediate consequence: by the chain rule, $H(X,Y) = H(X) + H(Y \mid X)$, and since conditioning reduces entropy ($H(Y \mid X) \leq H(Y)$, because averaging over observed information cannot increase uncertainty), the inequality follows. These concepts — joint entropy, conditional entropy, the chain rule, and mutual information — are not isolated tools; they form the vocabulary for describing what a noisy channel can and cannot transmit. In the next chapter, we ask: when a source sends a symbol across a channel that corrupts the signal with some probability, how much information gets through? The answer is the channel capacity, defined as the maximum mutual information $I(X;Y)$ over all input distributions $\mathbb{P}(X = x)$. Shannon's channel coding theorem then shows that reliable communication is possible if and only if the transmission rate falls below this capacity. All the machinery built in this chapter will be indispensable there. # 2. Error-Correcting Codes: Hamming Codes and Bounds Chapter 1 dealt with noiseless coding — the problem of compressing messages over a perfect channel. Real channels, however, are imperfect: bits flip, signals degrade, and a transmitted word may arrive corrupted. This chapter addresses the complementary problem of *error-correction*: how to encode messages so that even after some bits are flipped in transit, the original message can be recovered. We introduce the binary symmetric channel, define Hamming distance, and study Hamming's original 1950 construction. We then derive two fundamental bounds — the Hamming (sphere-packing) bound and the Gilbert–Varshamov (GV) bound — which together constrain how efficient an error-correcting code can be, and close with simple constructions that produce new codes from old. ## Noisy Channels and Decoding Rules When a message is transmitted across a physical channel, individual bits may be flipped with some probability. The simplest model is the **binary symmetric channel** (BSC): a channel over the binary alphabet $\{0,1\}$ in which each transmitted bit is independently flipped with probability $p \in [0,1]$, regardless of its value. To protect against such errors, we encode each message as a *codeword* — a binary string of length $n$ — before transmission, and use only a small subset $C \subset \{0,1\}^n$ of all possible strings as valid codewords. [definition: Binary Code] A **binary $[n,m]$-code** is a subset $C \subset \{0,1\}^n$ with $|C| = m$. The integer $n$ is the **length** of the code and the elements of $C$ are called **codewords**. Using the code, we transmit one of $m$ messages by making $n$ uses of the channel. Since $1 \le m \le 2^n$, the **information rate** satisfies \begin{align*} 0 \le \frac{1}{n} \log m \le 1. \end{align*} [/definition] The central geometric quantity for binary codes is Hamming distance, which measures how many bit positions two strings differ in. [definition: Hamming Distance] For $x, y \in \{0,1\}^n$, the **Hamming distance** is \begin{align*} d(x, y) = |\{i : 1 \le i \le n,\, x_i \ne y_i\}|. \end{align*} [/definition] [quotetheorem:1636] This is immediate from the definitions; the triangle inequality holds because errors on disjoint positions combine additively. The metric structure is not merely a formality: it allows us to import geometric intuition — in particular, the notion of spheres centred at codewords — and the language of sphere-packing, which will be the key to the Hamming bound. Decoding regions will later be interpreted as Voronoi cells in this metric. Once a word $x \in \{0,1\}^n$ is received, we must decide which codeword was sent. Three natural decoding rules present themselves: [definition: Decoding Rules] Let $C \subset \{0,1\}^n$ be a code and let $x \in \{0,1\}^n$ be a received word. - The **ideal observer rule** decodes $x$ as the codeword $c \in C$ that maximises $\mathbb{P}(c \text{ sent} \mid x \text{ received})$. - The **maximum likelihood rule** decodes $x$ as the codeword $c \in C$ that maximises $\mathbb{P}(x \text{ received} \mid c \text{ sent})$. - The **minimum distance rule** decodes $x$ as the codeword $c \in C$ minimising $d(x, c)$. [/definition] The three rules are not always equivalent, but they agree in the most natural regime: [quotetheorem:1637] [citeproof:1637] The proof illustrates a recurring theme: conditions on $p$ that make the channel useful translate directly into conditions on the decoding rule. The regime $p < 1/2$ is precisely the regime in which the channel is non-trivially informative, and it is this regime we always work in. [remark: Degenerate Cases] If $p = 1/2$ then every bit is equally likely to be flipped or not, and the channel carries no information — the code is called **useless**. If $p = 0$ the channel is perfect and the code is called **lossless**. In all remaining cases ($0 < p < 1/2$) minimum distance decoding is optimal under equal priors. [/remark] Minimum distance decoding is the standard choice throughout the course, as it is natural and independent of the prior. [example: Ideal Observer vs Minimum Distance] Suppose codewords $000$ and $111$ are used with probabilities $\alpha = 9/10$ and $\beta = 1/10$ respectively, transmitted over a BSC with $p = 1/4$. Suppose $110$ is received. Then $d(110, 000) = 2$ and $d(110, 111) = 1$, so minimum distance decoding yields $111$. Computing directly: \begin{align*} \mathbb{P}(000 \text{ sent} \mid 110) &= \frac{\alpha p^2(1-p)}{\alpha p^2(1-p) + (1-\alpha)p(1-p)^2} = \frac{3}{4}, \end{align*} so the ideal observer decodes as $000$. The two rules disagree because the prior is far from uniform. [/example] ## Error Detection and Correction Suppose you receive a word $r$ that is not in the code $C$. You know an error happened — but is the received word uniquely closest to one codeword? And if not, what can you do? The answers depend on a single parameter: the minimum distance of the code. [definition: Error Detection and Correction] A code $C$ is: - **$d$-error detecting** if changing at most $d$ digits of any codeword never produces another codeword; equivalently, every pair of distinct codewords has Hamming distance greater than $d$. - **$e$-error correcting** if, whenever a received word $x \in \{0,1\}^n$ differs from some codeword in at most $e$ places, that codeword is uniquely determined — so minimum distance decoding recovers it correctly. [/definition] To make the connection between these capabilities and the code's structure precise, we need to measure how far apart codewords are — the minimum distance crystallises this into a single number. [definition: Minimum Distance] The **minimum distance** of a code $C$ is \begin{align*} d(C) = \min_{\substack{c_1, c_2 \in C \\ c_1 \ne c_2}} d(c_1, c_2). \end{align*} A code of length $n$, size $m$, and minimum distance $d$ is called an **$[n, m, d]$-code**. [/definition] The following theorem makes the trade-off precise: larger minimum distance buys more error-correcting power, but at the cost of fewer codewords. [quotetheorem:1638] [citeproof:1638] The theorem makes precise the trade-off at the heart of coding theory: increasing $d$ improves error-correcting capability but forces $m$ to decrease (fewer codewords fit with large mutual separation). ## Hamming's Original Code Before developing the general theory, we examine the three basic code families that anchor it. [example: Fundamental Code Families] (a) The **repetition code** of length $n$ has exactly two codewords: $00\ldots0$ and $11\ldots1$. It is an $[n, 2, n]$-code, so it is $(n-1)$-error detecting and $\lfloor \frac{n-1}{2} \rfloor$-error correcting. The information rate is only $\frac{1}{n}$, which tends to zero — a high price for strong error correction. (b) The **simple parity check code** is $C = \{(x_1,\ldots,x_n) \in \{0,1\}^n : \sum_{i=1}^n x_i = 0\}$, where arithmetic is in $\mathbb{F}_2$. This is an $[n, 2^{n-1}, 2]$-code with information rate $(n-1)/n \to 1$. It detects any single error but cannot correct errors. The minimum distance is $2$ because flipping any one bit of a codeword changes the parity and produces a non-codeword, but two simultaneous flips restore the parity. (c) **Hamming's original code** (1950) is a $1$-error correcting binary $[7, 16, 3]$-code, constructed as follows. [/example] Hamming's construction is one of the great early examples in coding theory: it achieves a near-optimal balance between redundancy and error correction. [definition: Hamming's [7,16,3]-Code] Working over $\mathbb{F}_2$, define $C \subset \mathbb{F}_2^7$ by the three linear constraints \begin{align*} c_1 + c_3 + c_5 + c_7 &= 0, \\ c_2 + c_3 + c_6 + c_7 &= 0, \\ c_4 + c_5 + c_6 + c_7 &= 0. \end{align*} The bits $c_3, c_5, c_6, c_7$ are free information bits, while $c_1, c_2, c_4$ are **check digits** determined by the constraints. Thus $|C| = 2^4 = 16$. [/definition] The code is decoded via the **syndrome**. Given a received word $x \in \mathbb{F}_2^7$, compute \begin{align*} z_1 &= x_1 + x_3 + x_5 + x_7, \\ z_2 &= x_2 + x_3 + x_6 + x_7, \\ z_4 &= x_4 + x_5 + x_6 + x_7. \end{align*} If $x \in C$, then $(z_1, z_2, z_4) = (0,0,0)$. If exactly one bit was flipped — say $x = c + e_i$ where $e_i$ is the $i$th standard basis vector — then $(z_1, z_2, z_4) = (z_1(e_i), z_2(e_i), z_4(e_i))$, and the integer $z_1 + 2z_2 + 4z_4$ (computed over $\mathbb{Z}$, not $\mathbb{F}_2$) equals precisely $i$. One checks this for each $1 \le i \le 7$: the three check equations are designed so that the binary representations of $1$ through $7$ each appear as the syndrome of $e_i$. This allows us to correct the error by flipping position $i$. [example: Hamming Code Parameters] The two codewords $0000000$ and $1110000$ lie in $C$ and have Hamming distance $3$, so $d(C) \le 3$. Since the code is $1$-error correcting, Lemma [Minimum Distance Determines Error Capability] gives $d(C) \ge 3$. Hence $d(C) = 3$, confirming this is a $[7, 16, 3]$-code. The same theorem tells us the code is $2$-error detecting. [/example] ## The Hamming Bound and Perfect Codes To analyse how good a code can be, we count using a geometric object: the Hamming ball. [definition: Hamming Ball] For $x \in \mathbb{F}_2^n$ and $r \ge 0$, the **closed Hamming ball** of radius $r$ centred at $x$ is \begin{align*} B(x, r) = \{y \in \mathbb{F}_2^n : d(x, y) \le r\}. \end{align*} Its size depends only on $n$ and $r$: \begin{align*} V(n, r) := |B(x, r)| = \sum_{i=0}^{r} \binom{n}{i}. \end{align*} [/definition] If a code is $e$-error correcting, then no two codewords can have their radius-$e$ balls overlap — otherwise a received word in the intersection could plausibly have come from either codeword, and decoding would be ambiguous. To see why this constraint bites, consider trying to pack $M = 30$ codewords in $\{0,1\}^7$ with minimum distance $d = 3$ (so each Hamming ball has radius $e = 1$). Each ball contains $1 + 7 = 8$ points, so 30 disjoint balls would cover $30 \times 8 = 240$ points — already exceeding $2^7 = 128$. The space simply does not have room, forcing $M \le 128/8 = 16$. This sphere-packing observation gives a sharp upper bound. [quotetheorem:1639] [citeproof:1639] A code meeting this bound exactly is called perfect: every word in $\mathbb{F}_2^n$ lies within distance $e$ of exactly one codeword, so the balls tile the space. [definition: Perfect Code] An $e$-error correcting code $C$ of length $n$ is **perfect** if \begin{align*} |C| = \frac{2^n}{V(n, e)}, \end{align*} equivalently, if for every $x \in \mathbb{F}_2^n$ there exists a unique $c \in C$ with $d(x, c) \le e$. For a perfect code, any pattern of $e+1$ errors will necessarily be decoded incorrectly. [/definition] Perfect codes are remarkably constrained objects. The condition $|C| = 2^n / V(n,e)$ forces the balls to tile $\mathbb{F}_2^n$ without gaps or overlaps — a strong combinatorial requirement that most parameter pairs $(n,e)$ cannot satisfy. [example: Perfect Codes] (a) Hamming's $[7, 16, 3]$-code is $1$-error correcting and $\frac{2^7}{V(7,1)} = \frac{128}{1+7} = 16 = |C|$, so it is perfect. (b) The binary repetition code of odd length $n$ is $\lfloor \frac{n-1}{2} \rfloor$-error correcting and perfect. [/example] These two families are essentially the only 'obvious' perfect codes. As the following remark explains, perfect codes are the exception rather than the rule. [remark: Perfection is Rare] If $2^n / V(n, e)$ is not an integer, no perfect $e$-error correcting code of length $n$ can exist. The converse is false: the condition $2^n / V(n, e) \in \mathbb{Z}$ is necessary but not sufficient. For instance, $n = 90$, $e = 2$ satisfies the integrality condition but no perfect code of those parameters exists. [/remark] ## The Gilbert–Varshamov Bound and the Function $A(n,d)$ Having bounded codes from above via sphere-packing, we want a lower bound guaranteeing that good codes exist. We first define the central combinatorial quantity of the subject. [definition: $A(n,d)$] Let $A(n, d) = \max\{m : \exists\, [n, m, d]\text{-code}\}$ be the maximum number of codewords in a binary code of length $n$ and minimum distance at least $d$. [/definition] The values $A(n, d)$ are generally unknown, but several cases follow immediately from our examples: - $A(n, 1) = 2^n$ (the full code $C = \mathbb{F}_2^n$), - $A(n, n) = 2$ (repetition code), - $A(n, 2) = 2^{n-1}$ (simple parity check code). The following monotonicity lemma is useful: [quotetheorem:1640] [citeproof:1640] The inequality $A(n, d+1) \le A(n, d)$ reflects a natural trade-off: demanding a larger minimum distance — at least $d+1$ rather than $d$ — can only reduce the number of codewords that fit. Defining $A(n,d)$ as the maximum over codes with minimum distance *at least* $d$ (rather than *exactly* $d$) makes this monotonicity clean and gives us a well-defined non-increasing function of $d$. As a corollary, $A(n, d) = \max\{m : \exists\, [n, m, d']\text{-code for some } d' \ge d\}$: we may always assume the minimum distance is exactly $d$. The main theorem of this section sandwiches $A(n, d)$ between two explicit bounds. [quotetheorem:1641] [citeproof:1641] The GV bound guarantees that good codes exist; the Hamming bound limits how good they can be. Together they sandwich $A(n,d)$ and leave a gap whose size is still not fully understood. [example: Bounds in Practice] For $n = 10$, $d = 3$: $V(10, 1) = 11$ and $V(10, 2) = 1 + 10 + 45 = 56$. The theorem gives \begin{align*} \frac{1024}{56} \approx 18.3 \le A(10, 3) \le \frac{1024}{11} \approx 93. \end{align*} So $19 \le A(10, 3) \le 93$. Refining this took decades: the exact value $A(10, 3) = 72$ was only established in 1999. [/example] ## Asymptotics and the Binary Entropy Function For fixed error rate $\delta \in (0, 1/2)$, we study how the maximum information rate $\frac{1}{n} \log A(n, \lfloor n\delta \rfloor)$ behaves as $n \to \infty$. The answer is governed by the binary entropy function. [definition: Binary Entropy Function] For $\delta \in (0, 1)$, the **binary entropy function** is \begin{align*} H(\delta) = -\delta \log \delta - (1 - \delta) \log(1 - \delta), \end{align*} where logarithms are in base $2$. [/definition] The function $H(\delta)$ is concave, symmetric about $1/2$, with $H(0) = H(1) = 0$ and $H(1/2) = 1$. It measures the entropy of a Bernoulli($\delta$) random variable. [quotetheorem:1642] [citeproof:1642] The matching Hamming asymptotic upper bound states that $\frac{1}{n} \log A(n, \lfloor n\delta \rfloor) \le 1 - H(\delta/2)$ for all $\delta \in (0, 1/2)$; this follows from $V(n, \lfloor n\delta/2 \rfloor) \approx 2^{nH(\delta/2)}$ by the same entropy argument. The condition $\delta < 1/2$ is essential: for $\delta \ge 1/2$ a BSC with error rate $\delta$ is useless (or you can simply complement all bits), so the problem becomes degenerate. The GV and Hamming asymptotic bounds do not in general coincide — the gap between $1 - H(\delta)$ (GV lower) and $1 - H(\delta/2)$ (Hamming upper) remains open for most $\delta$, and closing it requires Shannon's capacity theorem, which we turn to in the next chapter. [explanation: Interpreting the Bounds] The GV theorem guarantees that, for large $n$, there exist codes of rate approaching $1 - H(\delta)$ that correct a fraction $\delta$ of errors. The Hamming upper bound $1 - H(\delta/2)$ is strictly larger for $\delta > 0$, so the two bounds do not meet — the region between them is the **Hamming–GV gap**, and closing it asymptotically is one of the central open problems in coding theory. The rate $1 - H(\delta)$ is the combinatorial analogue of Shannon's channel capacity; in the next chapter we prove the capacity theorem for the binary symmetric channel and see why Shannon's argument gives a matching achievability result. [/explanation] ## Constructing New Codes from Old Given an $[n, m, d]$-code $C$, three simple constructions produce new codes. These are building blocks for larger and more efficient families. ### Parity Check Extension The **parity check extension** $C^+$ appends one extra bit to each codeword, chosen so that the total number of ones is even: \begin{align*} C^+ = \left\{ \left(c_1, \ldots, c_n, \sum_{i=1}^n c_i\right) : (c_1, \ldots, c_n) \in C \right\}. \end{align*} This is an $[n+1, m, d']$-code. If $d$ is odd, the extra bit increases the minimum distance: $d' = d + 1$. If $d$ is even, $d' = d$. In both cases, $d \le d' \le d + 1$. ### Punctured Code **Puncturing** deletes the $i$th coordinate from every codeword: \begin{align*} C^- = \{(c_1, \ldots, c_{i-1}, c_{i+1}, \ldots, c_n) : (c_1, \ldots, c_n) \in C\}. \end{align*} If $d \ge 2$, this is an $[n-1, m, d']$-code with $d - 1 \le d' \le d$. The minimum distance may decrease by $1$ because two codewords that agreed in position $i$ and differed in $d$ others now differ in $d$ others after puncturing (distance preserved), but two codewords that differed in position $i$ and $d - 1$ others now differ in only $d - 1$ others (distance decreases by 1). The size $m$ is preserved because puncturing is not injective only when all codewords agree in position $i$, but in that case one coordinate was redundant. ### Shortened Code **Shortening** at position $i$ with value $\alpha \in \mathbb{F}_2$ retains only the codewords with $c_i = \alpha$ and then deletes position $i$: \begin{align*} C' = \{(c_1, \ldots, c_{i-1}, c_{i+1}, \ldots, c_n) : (c_1, \ldots, c_{i-1}, \alpha, c_{i+1}, \ldots, c_n) \in C\}. \end{align*} This is an $[n-1, m', d']$-code with $d' \ge d$ (removing a coordinate can only decrease distances, but we have restricted to codewords agreeing in position $i$, so the minimum distance among the retained codewords is at least $d$). For a suitable choice of $\alpha$, the retained set has size $m' \ge m/2$. [remark: Uses of These Constructions] These three operations are not merely theoretical exercises. Parity check extension converts a code of odd minimum distance to one of even minimum distance; puncturing and shortening reduce parameters in controlled ways. Combined with algebraic constructions (linear codes, cyclic codes), they generate the rich families studied later in the course. [/remark] # 3. Shannon's First Coding Theorem and the Kelly Criterion The Noiseless Coding Theorem from the previous chapter tells us the optimal expected codeword length for encoding a single random variable. But real communication involves long streams of data, not isolated symbols. This chapter asks a sharper question: how efficiently can we encode a source that produces symbols one after another, and what is the fundamental limit on this compression rate? The answer, Shannon's First Coding Theorem, identifies entropy as the information rate of a memoryless source. We then turn to a striking application outside communication: the Kelly criterion, which shows how entropy governs the optimal growth rate of a gambler's bankroll. ## Sources and Reliable Encoding Suppose a source $X_1, X_2, \ldots$ produces symbols from a finite alphabet $A$, one per unit time. We want to compress this stream: store or transmit it using as few bits per symbol as possible. But what does efficient compression of an infinite stream even mean? We cannot insist on perfectly recovering every possible sequence — some sequences are so rare that we can simply declare failure on them without paying a cost. This leads to the right notion: encoding at a rate $r$ is reliable if the probability of failure vanishes as the block length grows. [definition: Source and Bernoulli Source] A **source** is a sequence of random variables $X_1, X_2, \ldots$ taking values in a finite alphabet $A$. A source is **Bernoulli** (or memoryless) if $X_1, X_2, \ldots$ are independent and identically distributed; we write $(X; X_n)$ to denote this. [/definition] With this terminology, reliable encoding at rate $r$ is the precise answer to our compression question. When we encode long sequences from a source, we do not need to encode every possible string in $A^n$. If some strings are overwhelmingly more likely than others, we can afford to encode only the probable ones and declare failure (with vanishing probability) on the rest. [definition: Reliable Encoding at Rate r] A source $X_1, X_2, \ldots$ is **reliably encodable at rate $r$** if there exists a sequence of subsets $(A_n)_{n \geq 1}$ with $A_n \subseteq A^n$ such that: 1. $\displaystyle\lim_{n \to \infty} \frac{\log |A_n|}{n} = r$; 2. $\displaystyle\lim_{n \to \infty} \mathbb{P}((X_1, \ldots, X_n) \in A_n) = 1$. The **information rate** $H$ of the source is the infimum of all reliable encoding rates. [/definition] Condition (1) says the set $A_n$ has roughly $2^{nr}$ elements: we need about $nr$ bits to index elements of $A_n$. Condition (2) says the source lands in $A_n$ with high probability. So the information rate is the minimum bit rate per symbol needed for asymptotically reliable compression. [remark: Immediate Bounds] The information rate satisfies $0 \leq H \leq \log |A|$. The upper bound is achieved when all $|A|^n$ strings are equally likely (we cannot compress at all); the lower bound is achieved when the source is deterministic (it always outputs the same symbol). [/remark] ## The Asymptotic Equipartition Property The key insight behind Shannon's theorem is that for large $n$, the probability of a source sequence concentrates on a much smaller "typical" set. To make this precise we need a probability-theoretic tool. **Setup for Bernoulli sources.** Given a source $X_1, X_2, \ldots$ with values in $A$, write $X^{(n)} = (X_1, \ldots, X_n)$. Its probability mass function is $p_{X^{(n)}} : A^n \to [0,1]$. Composing, we get a random variable $p(X^{(n)}) = p_{X^{(n)}} \circ X^{(n)} : \Omega \to [0,1]$, the probability of the observed sequence. For a Bernoulli source, the iid structure gives $p(X_1, \ldots, X_n) = p(X_1) \cdots p(X_n)$, so \begin{align*} -\frac{1}{n} \log p(X_1, \ldots, X_n) = -\frac{1}{n} \sum_{i=1}^{n} \log p(X_i). \end{align*} The Weak Law of Large Numbers (from IA Probability) tells us that if $X_1, X_2, \ldots$ are iid real-valued random variables with finite expectation $\mathbb{E}[X]$, then \begin{align*} \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{\mathbb{P}} \mathbb{E}[X]. \end{align*} Applying this to the iid variables $-\log p(X_i)$: \begin{align*} -\frac{1}{n} \log p(X_1, \ldots, X_n) \xrightarrow{\mathbb{P}} \mathbb{E}[-\log p(X_1)] = H(X_1). \end{align*} This convergence in probability is the Asymptotic Equipartition Property for Bernoulli sources. [definition: Asymptotic Equipartition Property] A source $X_1, X_2, \ldots$ satisfies the **Asymptotic Equipartition Property (AEP)** with constant $H \geq 0$ if \begin{align*} -\frac{1}{n} \log p(X_1, X_2, \ldots, X_n) \xrightarrow{\mathbb{P}} H \quad \text{as } n \to \infty. \end{align*} [/definition] [example: Biased Coin] Toss a biased coin with $\mathbb{P}(H) = p$ independently. After $N$ tosses, a typical sequence has approximately $pN$ heads and $(1-p)N$ tails. The probability of any specific such sequence is \begin{align*} p^{pN}(1-p)^{(1-p)N} = 2^{N(p \log p + (1-p) \log(1-p))} = 2^{-NH(X)}, \end{align*} where $H(X) = -p\log p - (1-p)\log(1-p)$ is the entropy of a single toss. The AEP says that atypical sequences have vanishing probability: with high probability, the observed sequence has probability close to $2^{-NH(X)}$. [/example] The AEP is equivalent to a structural property of the sample space that is more useful for coding arguments. [quotetheorem:1643] The proof that these conditions are equivalent is straightforward from the definition of convergence in probability and is non-examinable. The key point is that $T_n$ is simultaneously large in probability and small in cardinality: condition (2) forces \begin{align*} 1 \geq \mathbb{P}(T_n) \geq |T_n| \cdot 2^{-n(H+\varepsilon)}, \end{align*} so $|T_n| \leq 2^{n(H+\varepsilon)}$. The typical set contains at most exponentially $2^{nH}$ sequences, far fewer than all $|A|^n$ sequences, yet it captures almost all the probability. ## Shannon's First Coding Theorem The AEP tells us that for large $n$, almost all the probability sits on a typical set of size roughly $2^{nH}$. This raises the central question of the chapter directly: given a source with AEP constant $H$, can we compress it to rate $H$ — that is, can we reliably encode it using only $nH$ bits per block? And if so, is $H$ the best we can do, or could some clever scheme compress below $H$? The answer is a clean dichotomy: rate $H + \varepsilon$ is always achievable (just encode the typical set), but rate $H - \varepsilon$ always fails. The threshold is sharp. [quotetheorem:1644] [citeproof:1644] The theorem is elegant: it says the AEP not only guarantees compression to rate $H + \varepsilon$ (by encoding only typical sequences), but also that no scheme can do better than rate $H$. The entropy is both achievable and optimal. [explanation: What the Theorem Requires and What It Does Not Say] The AEP is not decoration — it is load-bearing. The proof of the lower bound (no reliable encoding below rate $H$) uses both the upper and lower bounds on $p(x_1,\ldots,x_n)$ for typical sequences. For a source that does not satisfy the AEP, there may be no typical set in the required sense: probability could remain spread across an exponentially large set of sequences without concentrating on any tractable family. In that case, the theorem's lower bound breaks down and the information rate may be hard to characterise. It is also important to note what the theorem does **not** say. It gives no information about finite block lengths: the rates $H \pm \varepsilon$ are asymptotic, and for any fixed $n$ the optimal encoding at length $n$ may differ substantially from $H$. The theorem says nothing about redundancy — the precise gap between the optimal $n$-block code and the limiting rate $H$. Characterising that gap is the subject of second-order coding theory. Finally, the theorem concerns a noiseless channel: we assumed all encoded bits are received without error. The analogue for noisy channels is Shannon's Second (Channel Coding) Theorem, which we will reach later in the course. [/explanation] ## The Information Rate of Bernoulli Sources Shannon's First Coding Theorem is conditional on a hypothesis: the source must satisfy the AEP. But which sources actually do? The theorem is only useful if the AEP holds for the sources we care about in practice. This is the gap this section closes. The simplest and most important class — Bernoulli (memoryless) sources — do satisfy the AEP, and the proof is essentially just the Weak Law of Large Numbers. [quotetheorem:1645] [citeproof:1645] [explanation: Why iid Is Essential, and What Happens Beyond] The proof uses independence in a critical way: the factorisation $p(X_1,\ldots,X_n) = \prod_{i=1}^n p(X_i)$ allows us to write $-\frac{1}{n}\log p(X^{(n)})$ as an average of iid terms, to which the WLLN applies directly. For a **Markov source** — where $X_{n+1}$ depends on $X_n$ but not on earlier symbols — the iid structure is absent, and the factorisation fails. Nevertheless, the AEP still holds for Markov chains under mild conditions (irreducibility and aperiodicity), with the entropy rate of the chain playing the role of $H$. More generally, the AEP holds for any **stationary ergodic** source, with $H$ being the entropy rate $\lim_{n\to\infty} \frac{1}{n}H(X_1,\ldots,X_n)$. This is the **Shannon–McMillan–Breiman theorem**, a significant result that we do not prove here. The iid case is the entry point, but the conclusion is far more general. [/explanation] [quotetheorem:1646] [citeproof:1646] This is the deepest result so far. It says that entropy measures not just the average uncertainty of a single symbol, but also the minimum number of bits per symbol needed to store arbitrarily long outputs of a memoryless source. A coin with $\mathbb{P}(H) = 3/4$ generates sequences at information rate $H(3/4, 1/4) \approx 0.811$ bits per symbol — any compression below this rate must fail with non-negligible probability. ### Reduction to Block Sources The route to the information rate upper bound passes through two intermediate results that are instructive in their own right. They show how block encoding improves efficiency, and illustrate that the Noiseless Coding Theorem from Chapter 2 does almost all the work. [quotetheorem:1647] [citeproof:1647] Combining with the Noiseless Coding Theorem (which gives $\mathbb{E}[l_1] < H(X_1) + 1$ for an optimal code), the information rate is less than $H(X_1) + 1$. **Block encoding.** We can do better by encoding in blocks. Group the source as \begin{align*} \underbrace{X_1, \ldots, X_N}_{Y_1},\ \underbrace{X_{N+1}, \ldots, X_{2N}}_{Y_2},\ \ldots \end{align*} so $Y_1, Y_2, \ldots$ are iid with values in $A^N$. Since each $Y_j$ is a block of $N$ independent symbols, $H(Y_1) = NH(X_1)$. Applying the Noiseless Coding Theorem to the $Y_j$ source gives information rate less than $H(Y_1) + 1 = NH(X_1) + 1$, measured in bits per block. Per symbol this is $H(X_1) + 1/N$. Since $N$ is arbitrary, the information rate is at most $H(X_1)$. Together with the AEP lower bound, this gives $H = H(X_1)$. ## The Kelly Criterion The connection between entropy and long-run growth rates reappears in a striking context: repeated gambling. The Kelly criterion asks: what fraction of your bankroll should you bet at each step to maximise long-run growth? **Setup.** A coin is tossed $n$ times independently, with $\mathbb{P}(H) = p \in (0,1)$. Before each toss, given current bankroll $X_k$, a gambler bets fraction $w$ of their bankroll on heads, where $0 \leq w < 1$. If heads: the bet pays at rate $u > 0$, so the bankroll becomes $X_k(wu + (1-w))$. If tails: the bet is lost, so the bankroll becomes $X_k(1-w)$. Setting $X_0 = 1$: \begin{align*} X_{n+1} = \begin{cases} X_n(wu + (1-w)) & \text{if the $(n+1)$th toss is heads,} \\ X_n(1-w) & \text{if the $(n+1)$th toss is tails.} \end{cases} \end{align*} Let $Y_{n+1} = X_{n+1}/X_n$ be the bankroll multiplier at step $n+1$. Then $Y_1, Y_2, \ldots$ are iid, and \begin{align*} \log X_n = \sum_{i=1}^{n} \log Y_i. \end{align*} The quantity $\frac{1}{n}\log X_n$ is the average log-growth rate per toss. [quotetheorem:1648] [citeproof:1648] Two aspects of this bound deserve comment. First, the finite variance assumption $\sigma^2 < \infty$ is needed for Chebyshev to apply: it ensures that fluctuations around the mean are controlled by the second moment. In the gambling setup, $\log Y_1$ takes only two values ($\log(1 + (u-1)w)$ and $\log(1-w)$), so finite variance is automatic — but in more general settings it would need to be verified. Second, the theorem gives convergence **in probability**: for any fixed $\delta > 0$, $\mathbb{P}(|\frac{1}{n}\log X_n - \mu| \geq \delta) \to 0$. This is weaker than **almost sure convergence** ($\frac{1}{n}\log X_n \to \mu$ with probability 1), which also holds here by the Strong Law of Large Numbers. The distinction matters: almost sure convergence says the log-growth rate stabilises along every trajectory (except a null set), whereas in-probability convergence only says the probability of deviation vanishes. The practical implication is the same — maximise $\mu$ — but the almost-sure statement is the stronger foundation. This lemma shows that $\frac{1}{n}\log X_n$ concentrates around $\mu = \mathbb{E}[\log Y_1]$. To maximise the long-run growth rate, we should choose $w$ to maximise $\mu$. [quotetheorem:1649] [explanation: Structural Remarks on the Kelly Bet] Three structural points clarify the scope of this theorem. **Why $u > 0$ is essential.** The theorem requires the bet pays a positive amount on a win. If the payout ratio were $u \leq 0$, the bet would have non-positive expected return in every scenario — there would be no reason to bet at all, and the theorem's formula for $w^*$ would be meaningless or negative. **The optimal fraction lies in $[0,1)$.** When $up > 1$, we have $w^* = \frac{up-1}{u-1}$. Since $p < 1$, we get $up - 1 < u - 1$, so $w^* < 1$. And $w^* > 0$ follows from $up > 1$. So the Kelly bet always stays strictly below the full bankroll: you never bet everything. **Growth-optimal, not expected-utility-optimal.** The Kelly criterion maximises the long-run growth rate $\mathbb{E}[\log Y]$, which is the same as maximising $\mathbb{E}[\log X_n]/n$ as $n \to \infty$. This is the right objective for a gambler who wants to maximise wealth over a long horizon. However, a risk-averse gambler with a concave utility function $u(x)$ other than $\log x$ will in general prefer a smaller bet than $w^*$: Kelly overbets relative to the risk-averse optimum. The theorem says nothing about expected-utility maximisation for alternative utility functions. [/explanation] The proof is a standard calculus optimisation that the course does not cover in lectures (see Körner's book for details); a sketch is included for completeness. [citeproof:1649] [remark: Entropy in the Optimal Growth Rate] Write $q = 1 - p$. At the optimal $w^*$ when $up > 1$: \begin{align*} \mathbb{E}[\log Y] = p\log p + q\log q + \log u - q\log(u-1). \end{align*} The term $-(p\log p + q\log q) = H(p, q)$ is exactly the entropy of the coin. The optimal growth rate depends on entropy: a more uncertain coin (higher $H$) contributes more to growth at the optimal bet. [/remark] **Kelly's criterion** is the principle that a gambler making a long sequence of bets should maximise $\mathbb{E}[\log Y_1]$ at each step. The concentration lemma justifies this: since $\frac{1}{n}\log X_n \xrightarrow{\mathbb{P}} \mathbb{E}[\log Y_1]$, maximising the expectation of the log-multiplier is the same as maximising the long-run growth rate of the bankroll. [explanation: Over- and Under-Betting] Kelly's criterion identifies a precise optimal fraction $w^*$. Betting less than $w^*$ still yields growth (when $up > 1$), but at a slower exponential rate. More strikingly, betting more than a certain critical proportion causes the bankroll to tend to zero: the law of large numbers works against the over-bettor, because the log-multiplier $\mathbb{E}[\log Y_1]$ becomes negative beyond the critical level, and $X_n = e^{n \cdot \frac{1}{n}\log X_n} \to 0$. To make this concrete: take $p = 0.6$, $u = 2$ (a favourable bet: 60% win probability, winnings double the stake). Then $w^* = \frac{2 \cdot 0.6 - 1}{2 - 1} = 0.2$. The optimal log-growth rate is $\mathbb{E}[\log Y_1] \approx 0.6\log(1.2) + 0.4\log(0.8) \approx 0.020$ bits per toss. Now suppose the gambler bets $w = 0.5$ instead, reasoning that the bet is favourable so more is better. The log-growth rate becomes $0.6\log(1.5) + 0.4\log(0.5) \approx 0.6(0.585) + 0.4(-1) = 0.351 - 0.4 = -0.049$ bits per toss — negative. Despite winning 60% of tosses and getting double the stake on each win, the bankroll tends to zero exponentially fast. This quantifies the asymmetry: over-betting by a factor of 2.5 turns a positive-expectation bet into a bankroll-destroying strategy. This asymmetry — moderate overbetting is bad, aggressive overbetting is catastrophic — is the practical content of the Kelly criterion. The connection to entropy is more than aesthetic: the AEP and the WLLN are the same tool applied in two different domains, and the "typical set" of a communication source corresponds directly to the "typical trajectory" of a gambler's bankroll. [/explanation] # 4. Channel Capacity and Shannon's Second Coding Theorem The central question of information theory is not whether errors can be corrected, but how much information can be pushed through a noisy channel while keeping the error probability as small as we please. Earlier chapters established that good error-correcting codes exist and quantified their distance properties. This chapter identifies the fundamental limit — the *capacity* of a channel — and proves Shannon's Second Coding Theorem: reliable transmission is possible at every rate below capacity, and impossible at every rate above it. The key conceptual tool is *mutual information*, a measure of how much the channel's output reveals about its input. ## Reliable Transmission and Channel Capacity How efficiently can we use a noisy channel? To make this precise, we need to distinguish two things: the *rate* at which information is sent, and how reliably it arrives. Recall from the previous chapter that a **code** $C$ of length $n$ is a function $C : \mathcal{M} \to A^n$, where $\mathcal{M}$ is the message set and $A$ is the input alphabet, together with a decoding rule $B^n \to \mathcal{M}$. The size of $C$ is $m = |\mathcal{M}|$, and the information rate is \begin{align*} \rho(C) = \frac{1}{n} \log_2 m. \end{align*} The worst-case error probability is $\hat{e}(C) = \max_{x \in \mathcal{M}} \mathbb{P}(\text{error} \mid x \text{ sent})$. [definition: Reliable Transmission Rate] A channel can **transmit reliably at rate $R$** if there exists a sequence of codes $(C_n)_{n=1}^{\infty}$, with $C_n$ a code of length $n$, such that \begin{align*} \lim_{n \to \infty} \rho(C_n) = R \quad \text{and} \quad \lim_{n \to \infty} \hat{e}(C_n) = 0. \end{align*} The **operational capacity** of the channel is the supremum of all rates at which reliable transmission is possible. [/definition] The operational capacity captures what the channel can actually do in practice. But computing it directly — searching over all possible code sequences — is intractable. The breakthrough of information theory is that it equals a clean, computable quantity defined via entropy. Before reaching that, it is instructive to check that the capacity of a binary symmetric channel (BSC) is positive. This is not immediately obvious: with error probability $p > 0$, every use of the channel introduces noise, so one might fear that reliable transmission requires sending so redundantly that the rate collapses to zero. [quotetheorem:1650] The hypothesis $p < 1/4$ is not the sharpest possible — we will later show the capacity is $1 - H(p) > 0$ for all $p < 1/2$ — but the proof here is elementary, using only the Gilbert–Shannon–Varshamov (GSV) bound. [citeproof:1650] The law-of-large-numbers step deserves to be isolated as a standalone result, because it recurs in exactly this one-sided form — bounding the probability that the error count exceeds $n(p+\varepsilon)$ for a fixed $\varepsilon > 0$ — throughout the rest of the chapter. Both Fano's inequality and the converse proof rely on errors concentrating below a threshold, not merely on the symmetric statement that errors concentrate near their mean. [quotetheorem:1651] [citeproof:1651] It is worth noting why the one-sided form — bounding the upper tail $\mathbb{P}(\text{errors} \geq n(p+\varepsilon))$ rather than both tails — is the version we actually need. In the converse proof, Fano's inequality produces an upper bound on $H(X \mid Y)$ that involves the error rate $p$; as $p$ grows large, this bound becomes vacuous. What matters is that $p$ cannot be small while $R > C$, and the argument works by showing the one-sided concentration forces $p$ to stay bounded away from zero. The symmetric WLLN would not give a cleaner conclusion here. ## Conditional Entropy and Mutual Information To measure what a channel actually communicates, we need to compare the uncertainty about the input before and after observing the output. This requires *conditional entropy*. Suppose the channel has input alphabet $A$ and output alphabet $B$, with random variables $X$ and $Y$ taking values in $A$ and $B$ respectively. [definition: Conditional Entropy] The **conditional entropy of $X$ given $Y$** is \begin{align*} H(X \mid Y = y) &= -\sum_{x \in A} \mathbb{P}(X = x \mid Y = y) \log \mathbb{P}(X = x \mid Y = y), \\ H(X \mid Y) &= \sum_{y \in B} \mathbb{P}(Y = y)\, H(X \mid Y = y). \end{align*} [/definition] The quantity $H(X \mid Y)$ measures the average residual uncertainty about $X$ once $Y$ is known. It satisfies $H(X \mid Y) \geq 0$, with equality precisely when $X$ is a deterministic function of $Y$. The following chain rule for entropy is fundamental — it expresses joint entropy in terms of individual and conditional entropies. [quotetheorem:1635] [citeproof:1635] The chain rule is the workhorse identity of this chapter. It converts statements about joint entropy into statements involving conditional entropy, and it is precisely the tool that will let us establish the symmetry $I(X;Y) = I(Y;X)$, derive Fano's inequality, and carry out the converse proof of Shannon's Second Coding Theorem. [example: Dice and Parity] Roll a fair die. Let $X$ be the outcome and $Y = 0$ if $X$ is even, $Y = 1$ if $X$ is odd. Then: - $H(X, Y) = H(X) = \log_2 6$, since $Y$ is determined by $X$. - $H(Y) = \log_2 2 = 1$. - $H(X \mid Y) = H(X, Y) - H(Y) = \log_2 6 - 1 = \log_2 3$. This makes sense: knowing the parity halves the possibilities, leaving three equally likely outcomes. - $H(Y \mid X) = H(X,Y) - H(X) = 0$, confirming that $Y$ is fully determined by $X$. [/example] The chain rule connects joint and conditional entropies through a clean additive decomposition; the next result sharpens this into an inequality by comparing $H(X \mid Y)$ with the unconditional $H(X)$. The key observation is that averaging over $Y$ can only reduce uncertainty about $X$ — it can never introduce new uncertainty that was not already present. An immediate corollary is that conditioning never increases entropy. [quotetheorem:1652] [citeproof:1652] This inequality has the intuitive meaning: observing $Y$ can only reduce (or maintain) our uncertainty about $X$. For channel coding, it says that the output can only tell us things about the input, never increase what we didn't know. We will also need a chain-rule inequality for joint conditioning, which bounds $H(X \mid Y)$ by conditioning on an additional variable. [quotetheorem:1653] [citeproof:1653] Equality holds in the inequality $H(X \mid Y) \leq H(X \mid Y, Z) + H(Z)$ if and only if $Z$ is independent of $Y$ given $X$ — that is, the triple $(X, Y, Z)$ satisfies $Z \perp Y \mid X$. In practice, the version with $H(Z)$ rather than $H(Z \mid Y)$ is the form we actually need for Fano's inequality, because $Z$ will be the error indicator whose marginal entropy $H(Z) = H(p)$ has a clean closed form, whereas conditioning $Z$ on $Y$ would complicate the bound without tightening it. ### Fano's Inequality The Joint Conditioning Inequality leads to one of the most important results in information theory: Fano's inequality, which quantifies how much the error probability constrains conditional entropy. It is the key tool for proving that reliable transmission above capacity is impossible. [quotetheorem:1654] The hypothesis that $X$ and $Y$ share the same alphabet is natural in the coding context: $X$ is the transmitted codeword (or message) and $Y$ is the decoded output. Then $p$ is the error probability, $H(p)$ measures the information needed to learn *whether* an error occurred, and $p \log_2(m-1)$ bounds the information needed to identify *which* error occurred, given that one did. The theorem does not say that $H(X \mid Y)$ is small when $p$ is small — it gives an upper bound. The conclusion is that if $p \to 0$, then $H(X \mid Y) \to 0$ as well, since $H(p) \to 0$ and $p \log_2(m-1) \to 0$. [citeproof:1654] The bound is most useful in the regime where $p$ is small. To see it in action, consider the simplest possible alphabet. [example: Fano Bound for Binary Alphabet] For a BSC with error probability $p$ and alphabet $A = \{0,1\}$, we have $m = 2$ and Fano's inequality gives $H(X \mid Y) \leq H(p) + p \cdot 0 = H(p)$. This has a direct interpretation: knowing the output $Y$, the only residual uncertainty about $X$ is whether an error occurred, which contributes at most $H(p)$ bits. [/example] ## Mutual Information and Information Capacity Having measured residual uncertainty via conditional entropy, the *reduction* in uncertainty due to observing the channel output is the natural measure of how much information the channel transmits. [definition: Mutual Information] For random variables $X$ and $Y$, the **mutual information** is \begin{align*} I(X; Y) := H(X) - H(X \mid Y). \end{align*} [/definition] By the conditioning inequality, $I(X; Y) \geq 0$, with equality if and only if $X$ and $Y$ are independent. The symmetry $I(X; Y) = I(Y; X)$ follows from the chain rule: both equal $H(X) + H(Y) - H(X,Y)$. For a discrete memoryless channel (DMC) with input alphabet $A$ and output alphabet $B$, we model the input as a random variable $X$ with some distribution $(p_1, \ldots, p_m)$ on $A$, and write $Y$ for the resulting output. The mutual information $I(X; Y)$ depends on the chosen input distribution as well as the channel matrix. [definition: Information Capacity] The **information capacity** of a DMC is \begin{align*} C = \max_X I(X; Y), \end{align*} where the maximum is taken over all probability distributions on the input alphabet $A$. [/definition] Three points about this definition deserve emphasis. First, the maximum is indeed attained: the map $(p_1, \ldots, p_m) \mapsto I(X; Y)$ is continuous on the compact simplex $\{(p_1, \ldots, p_m) \in [0,1]^m : p_1 + \cdots + p_m = 1\}$. Second, the information capacity depends only on the channel matrix — not on any particular input distribution. Third, computing $C$ is in principle a constrained optimisation problem; for symmetric channels like the BSC and BEC, the symmetry forces the optimal distribution to be uniform, making the computation tractable. ### Capacity of the Binary Symmetric Channel [example: BSC Capacity] For a BSC with error probability $p$, let $X$ satisfy $\mathbb{P}(X=0) = \alpha$ and $\mathbb{P}(X=1) = 1-\alpha$. The output satisfies \begin{align*} \mathbb{P}(Y=0) = \alpha(1-p) + (1-\alpha)p, \quad \mathbb{P}(Y=1) = (1-\alpha)(1-p) + \alpha p. \end{align*} Using the representation $I(X;Y) = H(Y) - H(Y \mid X)$: given $X$, the output is $\text{Bernoulli}(p)$ regardless of the input value, so $H(Y \mid X) = H(p)$ for any $\alpha$. Thus \begin{align*} C = \max_\alpha I(X;Y) = \max_\alpha H(Y) - H(p) = 1 - H(p), \end{align*} where $H(Y)$ achieves its maximum value of 1 bit when $\alpha = 1/2$, making $Y$ uniform on $\{0,1\}$. Hence the BSC capacity is $C = 1 - H(p) = 1 + p\log_2 p + (1-p)\log_2(1-p)$. [/example] The BSC capacity $1 - H(p)$ is the benchmark against which we measure achievability in the rest of the chapter. Before moving to the BEC, it is worth pausing to note that the computation above used the representation $I(X;Y) = H(Y) - H(Y \mid X)$ — a deliberate choice that made the conditional term constant across all input distributions. The following remark records why the two representations of $I(X;Y)$ are equivalent and when each is more convenient. [remark: Choice of I(X;Y) Representation] The formula $I(X;Y) = H(Y) - H(Y \mid X) = H(X) - H(X \mid Y)$ gives two equivalent representations. For the BSC, $H(Y \mid X) = H(p)$ is easier to compute; for the BEC, $H(X \mid Y)$ turns out to be cleaner. The choice is a matter of which conditional entropy is simpler. [/remark] ### Capacity of the Binary Erasure Channel For the BEC, the $H(X \mid Y)$ representation of mutual information turns out to be the more natural choice. The reason is structural: the BEC either delivers a bit perfectly or erases it entirely, so the residual uncertainty $H(X \mid Y)$ collapses to zero on non-erased outputs and reduces to $H(\alpha)$ on erased ones, giving an immediate factoring over the erasure probability. The calculation below makes this transparent. [example: BEC Capacity] For a binary erasure channel (BEC) with erasure probability $p$, let $\mathbb{P}(X=0) = \alpha$. The output takes values in $\{0, *, 1\}$ with $\mathbb{P}(Y=0) = \alpha(1-p)$, $\mathbb{P}(Y=*) = p$, $\mathbb{P}(Y=1) = (1-\alpha)(1-p)$. Computing $H(X \mid Y)$ directly: $H(X \mid Y=0) = H(X \mid Y=1) = 0$ (no uncertainty when received), and $H(X \mid Y=*) = H(\alpha)$ (erasure gives no information). So \begin{align*} H(X \mid Y) = p \cdot H(\alpha). \end{align*} Therefore \begin{align*} C = \max_\alpha I(X;Y) = \max_\alpha (H(\alpha) - pH(\alpha)) = (1-p) \max_\alpha H(\alpha) = 1-p, \end{align*} attained at $\alpha = 1/2$. The BEC capacity $1-p$ has a transparent interpretation: a fraction $p$ of transmitted bits are erased and carry no information; the remaining fraction $1-p$ are received perfectly. [/example] ## Shannon's Second Coding Theorem Why should a correlation statistic — the maximum of $I(X;Y)$ over all input distributions, computed purely from the channel's transition probabilities — equal an operational limit computed by optimising over all possible code sequences of all possible lengths? The two quantities are defined in completely different terms: one is a single number derived from the channel matrix via entropy, the other is the supremum of rates achievable by sequences of encoding and decoding maps with vanishing error. There is no a priori reason they should agree, and the content of Shannon's Second Coding Theorem is precisely that they do. [quotetheorem:1655] The proof proceeds in two parts: one direction (operational $\leq$ informational) holds for all DMCs; the other (operational $\geq$ informational) is proved here for the BSC. We treat each direction separately. ### Capacity of the nth Extension A key stepping stone is computing the information capacity of the channel used $n$ times in succession. A single DMC has capacity $C$; its $n$th extension uses the channel $n$ times independently, with input alphabet $A^n$ and output alphabet $B^n$, governed by \begin{align*} \mathbb{P}(y_1,\ldots,y_n \mid x_1,\ldots,x_n) = \prod_{i=1}^n \mathbb{P}(y_i \mid x_i). \end{align*} [quotetheorem:1656] [citeproof:1656] The upper bound $nC$ is tight, meaning $n$ uses of the channel deliver exactly $nC$ bits of capacity — capacity scales linearly with the number of uses. ### Operational Capacity is at Most the Information Capacity [quotetheorem:1657] This is the *converse* of Shannon's theorem. It says no clever coding scheme can beat the information capacity. The proof is an elegant application of Fano's inequality. [citeproof:1657] The algebraic core of the argument is worth summarising in plain terms before interpreting the result: Fano's inequality converts the hypothesis $e(C_n) \to 0$ into an upper bound on $H(X \mid Y)$, the capacity theorem converts $I(X;Y)$ into a lower bound via $H(X)$, and combining these forces a contradiction the moment $R$ exceeds $C$. The next explanation unpacks what the converse does and does not guarantee. [explanation: What the Converse Does and Does Not Say] The converse establishes a hard ceiling: no sequence of codes can sustain rate $R > C$ with vanishing error. But it says nothing about what happens exactly at $R = C$. The theorem guarantees reliable transmission only *strictly* below capacity; at the boundary, the error probability need not vanish. This is a general feature of capacity results in information theory — the boundary is typically not achievable. The converse also does not say that any particular code fails — it says that *no* code sequence can work. There might be codes of length $n$ with rate slightly above $C$ and small (but bounded-away-from-zero) error. [/explanation] ### Achievability for the BSC The harder direction — showing that every rate $R < C$ is achievable — requires constructing a code sequence with $e(C_n) \to 0$. The proof uses the *method of random coding*: rather than explicitly constructing a good code, we show that a *randomly chosen* code works on average, and therefore at least one good code must exist. [quotetheorem:1658] [citeproof:1658] The probabilistic argument is striking: we never construct a specific good code. The existence proof works by showing that the *average* code over a random ensemble has vanishing error, which forces the existence of at least one code with that property. Shannon's theorem guarantees that good codes exist but offers no guidance on how to find them. ### Lifting Average Error to Worst-Case Error The achievability result above controls only the *average* error rate $e(C_n)$. Reliable transmission requires the *worst-case* error $\hat{e}(C_n) \to 0$. The following argument closes this gap. [quotetheorem:1659] [citeproof:1659] The half-deletion trick is elegant: by removing the worst half of codewords, we convert an average-case guarantee into a worst-case one at the cost of halving the code size. This costs only $1/n$ bits in the rate, which is asymptotically negligible. ## Summary and Significance Combining the converse (operational $\leq C$) with achievability (operational $\geq C$), we obtain: [remark: BSC Capacity] A BSC with error probability $p \in [0, 1/2)$ has operational capacity $1 - H(p)$. For any rate $R < 1 - H(p)$, reliable transmission is possible; for any $R > 1 - H(p)$, it is not. [/remark] Shannon's Second Coding Theorem has a profound and perhaps counterintuitive character. The information capacity $C = \max_X I(X;Y)$ is determined entirely by the channel's transition probabilities — it is an intrinsic property of the channel, independent of block length or any coding strategy. Shannon showed that this static quantity exactly characterises the dynamic phenomenon of long-block reliable transmission. [remark: Existence vs Construction] Shannon's theorem is an existence result. The random coding proof shows that codes achieving capacity exist but gives no procedure for constructing them explicitly. The search for capacity-achieving codes that are also computationally efficient (both for encoding and decoding) is a central problem in modern coding theory. Turbo codes and low-density parity-check (LDPC) codes, developed in the 1990s, are among the most successful practical approaches, approaching capacity for specific channels. [/remark] # 5. Linear Codes The earlier chapters of this course built codes from combinatorial and probabilistic principles — Hamming's original construction, the Singleton bound, perfect codes — but the methods were often ad hoc. This chapter imposes algebraic structure. By requiring a code to be a vector subspace of $\mathbb{F}_2^n$, we unlock two things simultaneously: efficient encoding via matrix multiplication, and efficient decoding via the parity-check matrix and syndrome lookup. We then study the dual code, establish the fundamental dimension formula, and meet the Reed-Muller family, which provides an infinite recursive construction achieving a wide range of parameters. ## Linear Codes and Why Structure Helps How do you decode a received word efficiently when your code has $2^k$ codewords? Brute-force comparison against all codewords takes time exponential in $k$. The answer — which motivates this entire chapter — is to restrict attention to codes that are closed under addition. When the code is a vector subspace, decoding becomes linear algebra, and the work shrinks dramatically. [definition: Linear Code] A code $C \subseteq \mathbb{F}_2^n$ is **linear** if: - $0 \in C$; - whenever $x, y \in C$, we have $x + y \in C$. Equivalently, $C$ is an $\mathbb{F}_2$-vector subspace of $\mathbb{F}_2^n$. The **rank** of $C$ is its dimension as an $\mathbb{F}_2$-vector space. A linear code of length $n$ and rank $k$ is called an $(n, k)$-code. If the minimum distance is $d$, it is an $(n, k, d)$-code. [/definition] Since $C$ has dimension $k$, it has exactly $2^k$ codewords, so an $(n, k)$-code is an $[n, 2^k]$-code in the earlier notation. The information rate is $k/n$: the fraction of transmitted bits that carry genuine information. The weight of a vector measures how many positions are nonzero, and for linear codes this is directly tied to minimum distance. The key insight — which fails completely for nonlinear codes — is that the difference of two codewords is again a codeword, so finding the minimum distance reduces to finding the minimum weight nonzero codeword. [definition: Weight] The **weight** of $x \in \mathbb{F}_2^n$ is $w(x) = d(x, 0)$, the Hamming distance from $x$ to the zero word. [/definition] [quotetheorem:1660] The reason is that for any two distinct codewords $x, y \in C$, the difference $x - y = x + y$ (since we are in characteristic 2) is itself a codeword, and $d(x, y) = w(x + y)$. So the minimum of $d(x, y)$ over distinct pairs equals the minimum of $w(z)$ over nonzero $z \in C$. This is a distinctly linear-code fact: without the closure property, differences of codewords need not be codewords, and the correspondence breaks down entirely. ## Parity-Check Codes and the Inner Product To build linear codes systematically, we use the standard inner product on $\mathbb{F}_2^n$. [definition: Inner Product on $\mathbb{F}_2^n$] For $x, y \in \mathbb{F}_2^n$, define \begin{align*} x \cdot y = \sum_{i=1}^n x_i y_i \in \mathbb{F}_2. \end{align*} This operation is symmetric and bilinear. It is not positive definite: $x \cdot x = w(x) \bmod 2$, so a nonzero vector can be self-orthogonal. [/definition] This inner product is the engine of parity checking. A 'parity condition' imposed on a word $x$ is simply the requirement that $p \cdot x = 0$ for some fixed check vector $p \in \mathbb{F}_2^n$: the positions where $p$ is 1 must together have an even sum. Imposing several such conditions simultaneously defines a parity-check code. Given a set $P \subseteq \mathbb{F}_2^n$ of check vectors, the corresponding parity-check code is all words that pass every check in $P$. [definition: Parity-Check Code] Let $P \subseteq \mathbb{F}_2^n$. The **parity-check code** defined by $P$ is \begin{align*} C = \{x \in \mathbb{F}_2^n : p \cdot x = 0 \text{ for all } p \in P\}. \end{align*} [/definition] This construction immediately produces examples. Taking $P = \{(1, 1, \ldots, 1)\}$ gives the simple parity-check code: all words with an even number of 1s. Taking $P$ to be the three rows \begin{align*} (1,0,1,0,1,0,1), \quad (0,1,1,0,0,1,1), \quad (0,0,0,1,1,1,1) \end{align*} gives Hamming's original $[7, 16, 3]$-code. Any parity-check code is automatically linear, since if $p \cdot x = 0$ and $p \cdot y = 0$, then $p \cdot (x + y) = 0$ by bilinearity. ## Dual Codes and the Dimension Formula Given a linear code $C$, how do we test membership efficiently? Checking whether $x \in C$ by comparing against all $2^k$ codewords is out of the question for large $k$. What we want is a minimal system of linear equations — the fewest parity checks that together characterise exactly the codewords of $C$. The natural object encoding this system is the dual code. [definition: Dual Code] Let $C \subseteq \mathbb{F}_2^n$ be a linear code. The **dual code** is \begin{align*} C^\perp = \{x \in \mathbb{F}_2^n : x \cdot y = 0 \text{ for all } y \in C\}. \end{align*} [/definition] Since $C^\perp$ is the parity-check code defined by the set $C$, it is automatically linear. A notable subtlety: $C \cap C^\perp$ is not necessarily $\{0\}$. In ordinary Euclidean geometry, a subspace and its orthogonal complement intersect only at the origin. Over $\mathbb{F}_2$, a nonzero vector can be orthogonal to itself (precisely when its weight is even), so a codeword of $C$ may also belong to $C^\perp$. [quotetheorem:1690] [citeproof:1690] The theorem says nothing about whether $C$ and $C^\perp$ are disjoint — only that their dimensions partition $n$. A code can well satisfy $C \subseteq C^\perp$ (a self-orthogonal code) or even $C = C^\perp$ (a self-dual code). The bound just requires $k \leq n - k$, i.e., $k \leq n/2$, for self-orthogonality to be possible. [quotetheorem:1691] [citeproof:1691] This corollary is genuinely useful: the parity-check codes and the linear codes are the same family, just described in two complementary ways. The linearity of $C$ is essential here. The statement $(C^\perp)^\perp = C$ is meaningless for nonlinear codes: if $C$ is not a subspace, then $C^\perp$ as defined (vectors orthogonal to every element of $C$) is still a subspace, but there is no reason for $(C^\perp)^\perp$ to recover $C$ rather than the subspace spanned by $C$. In a finite-dimensional real inner product space, the analogous identity $V = (V^\perp)^\perp$ holds automatically because the inner product is non-degenerate and the ambient space is well-behaved. Over $\mathbb{F}_2$, the geometry is more delicate — self-orthogonal vectors exist — yet the identity still holds, precisely because the dimension formula gives exact control over sizes. Practically, the corollary means we can freely pass between the generator-matrix description $G$ and the parity-check-matrix description $H$ of the same code: each is the generator matrix of the dual of the other. ## Generator Matrices and Standard Form How do we encode messages concretely using a linear code? Given a code $C$ of rank $k$, any basis gives a $k \times n$ matrix $G$ whose rows span $C$, and encoding becomes a matrix multiplication. But there is an obvious question: different bases give different matrices, and it is not clear which to prefer. The answer is a 'standard form' in which the first $k$ columns form the identity — this makes encoding systematic (message bits appear verbatim) and connects $G$ directly to the parity-check matrix $H$. [definition: Generator Matrix and Parity-Check Matrix] Let $C$ be an $(n, k)$-code. (i) A **generator matrix** $G$ for $C$ is a $k \times n$ matrix whose rows form a basis of $C$. (ii) A **parity-check matrix** $H$ for $C$ is a generator matrix for $C^\perp$. It is an $(n-k) \times n$ matrix. The codewords of $C$ are exactly the row vectors $\{yG : y \in \mathbb{F}_2^k\}$, and equivalently $C = \{x \in \mathbb{F}_2^n : Hx = 0\}$ (where $x$ is a column vector). [/definition] The codewords can thus be read off from $G$ (as linear combinations of rows) or characterised by $H$ (as solutions to a system of linear equations). Both descriptions are useful and complementary. Encoding using $G$ is constructive — given a message $y \in \mathbb{F}_2^k$, the codeword is $yG$. Error detection using $H$ is a test — compute $Hx$ and check whether it is zero. We will see that $H$ is the key tool for both minimum distance analysis and syndrome decoding. [definition: Equivalent Codes] Codes $C_1, C_2 \subseteq \mathbb{F}_2^n$ are **equivalent** if there exists a permutation $\sigma \in S_n$ such that applying $\sigma$ to each codeword of $C_1$ (reordering coordinates) gives $C_2$. [/definition] Equivalence preserves length, dimension, and minimum distance, so all combinatorial parameters are invariant. The advantage of working up to equivalence is that we can always bring the generator matrix into a convenient normal form. [quotetheorem:1661] [citeproof:1661] When $G = (I_k \mid B)$, encoding a message $y \in \mathbb{F}_2^k$ (as a row vector) produces $yG = (y \mid yB)$: the first $k$ bits reproduce the message, and the last $n - k$ bits are check digits computed by $yB$. This is systematic encoding. Two remarks on the role of equivalence here. First, the standard form is only achievable up to code equivalence: bringing $G$ to $(I_k \mid B)$ may require permuting columns, which changes the code. Since equivalence preserves all combinatorial parameters (length, rank, minimum distance), working up to equivalence is without loss of generality for any question about those parameters. Second, if $G$ is not in standard form — that is, the pivot columns are not in positions $1, \ldots, k$ — there is no simple closed-form expression for $H$ in terms of the entries of $G$. The formula $H = (B^\top \mid I_{n-k})$ depends on the specific structure of standard form. This is one reason we work with standard form: it makes the $G \leftrightarrow H$ relationship explicit. The relationship between $G$ and $H$ in standard form is explicit. [quotetheorem:1662] [citeproof:1662] The pattern $H = (B^\top \mid I_{n-k})$ has a natural interpretation: the entry $B^\top_{ij}$ records whether the $j$-th parity-check bit depends on the $i$-th message bit. The identity block $I_{n-k}$ on the right simply says that each parity-check bit appears in exactly one check equation — the one that defines it. This modularity is the hallmark of systematic codes. Note also what fails when $G$ is not in standard form: without the identity block $I_k$ sitting in a known position, there is no analogous closed form for $H$. Standard form is the key that makes the $G \leftrightarrow H$ correspondence algebraically transparent. ## Minimum Distance via the Parity-Check Matrix For a code of rank $k = 100$, the $2^{100}$ codewords cannot be listed, let alone compared pairwise. Computing the minimum distance by brute force is completely infeasible. Can we instead read $d$ directly off the parity-check matrix $H$, without enumerating codewords? The following theorem says yes: the minimum distance is encoded in the linear-independence structure of the columns of $H$. [quotetheorem:1663] The proof interprets $Hx = 0$ (the condition for $x$ to be a codeword): $x$ is a codeword with $w(x) = d$ precisely when there are $d$ columns of $H$ that sum to zero. So the minimum weight nonzero codeword corresponds to the smallest linearly dependent set of columns of $H$. The theorem does not assert that $H$ can be read off from $d$ alone — many different matrices can give the same $d$ — but it shows that improving minimum distance is equivalent to making columns of $H$ more independent. This criterion also clarifies the computational savings we identified earlier. Finding the minimum distance by listing all $|C| = 2^k$ codewords and computing $w(x)$ for each is $O(2^k)$ — the Minimum Distance Equals Minimum Weight theorem already reduces this from $O(|C|^2)$ pairwise comparisons to $O(|C|)$ weight computations. But for $k = 100$, even $O(2^k)$ is infeasible. The parity-check-matrix criterion replaces this with a question about column independence in an $(n-k) \times n$ matrix, which is a finite combinatorial problem independent of $2^k$. For nonlinear codes, neither reduction applies: the differences of codewords need not be codewords, so there is no analogue of 'minimum weight equals minimum distance', and no parity-check matrix to examine. ## Syndrome Decoding The parity-check matrix makes decoding algorithmic. Suppose we receive a word $x \in \mathbb{F}_2^n$ and want to correct errors. [definition: Syndrome] Let $C$ be an $(n, k)$-code with parity-check matrix $H$. The **syndrome** of $x \in \mathbb{F}_2^n$ is $Hx \in \mathbb{F}_2^{n-k}$. [/definition] The key observation is that if a codeword $c$ is transmitted and an error pattern $z$ is added, so that $x = c + z$ is received, then \begin{align*} Hx = H(c + z) = Hc + Hz = 0 + Hz = Hz. \end{align*} The syndrome depends only on the error pattern, not on which codeword was sent. This is the algebraic miracle that makes syndrome decoding work. For a code that is $e$-error correcting, the decoding procedure runs as follows. Before any transmission begins, precompute a lookup table of all syndromes $Hz$ for all error patterns $z$ with $w(z) \leq e$. When a word $x$ is received, compute $Hx$ and look it up in the table. If $Hx = Hz$ for some stored pattern $z$, then $H(x - z) = 0$, so $x - z$ is a codeword; decode $x$ as $c = x - z$. The efficiency gain is substantial: instead of comparing $x$ against all $2^k$ codewords, we compute one matrix–vector product and perform a lookup. The lookup table has at most $\sum_{i=0}^e \binom{n}{i}$ entries. For small $e$ and moderate $n$, this is manageable even when $2^k$ is astronomical. ## Hamming Codes Revisited Which parity-check matrices give 1-error-correcting perfect codes? The column-independence criterion tells us: we need any two columns to be linearly independent (ensuring $d \geq 3$), and the total number of columns must be as large as possible (for perfectness). Over $\mathbb{F}_2$, the answer is purely algebraic: take every nonzero vector as a column. [definition: Hamming Code] For $d \geq 1$, let $n = 2^d - 1$. Let $H$ be the $d \times n$ matrix whose columns are all nonzero elements of $\mathbb{F}_2^d$ (in any order). The **Hamming $(n, n-d)$-code** is the linear code with parity-check matrix $H$. [/definition] The original Hamming code is the case $d = 3$, giving a $[7, 4]$-code. The columns of $H$ are all 7 nonzero binary vectors of length 3. [quotetheorem:1664] [citeproof:1664] The theorem says $d(C) = 3$ — 1-error correcting — but says nothing about error detection. A $(n, n-d)$-Hamming code can actually detect 2 errors (since $d = 3 > 2$), but can only correct 1. The theorem also does not claim that Hamming codes are the unique perfect 1-error-correcting binary codes, though this does turn out to be true for the given parameters (up to equivalence). Syndrome decoding for Hamming codes is particularly elegant. The syndrome $Hx$ is a binary vector of length $d$, which we can read as a binary integer from 1 to $2^d - 1 = n$. If $x = c + e_i$ (a single error in position $i$), then $Hx = He_i$, which is the $i$-th column of $H$. If we arrange the columns of $H$ so that the $i$-th column encodes the binary representation of $i$, then the syndrome directly tells us which position is in error. This is the original insight of Hamming's 1950 paper. ## Reed-Muller Codes Hamming codes have minimum distance exactly 3, regardless of the parameter $d$. This is a fundamental limitation of the family: we cannot produce a 2-error-correcting code by scaling or modifying the Hamming construction. If we want a family where both $d$ and $n$ can be tuned — trading rate for error-correction — we need a genuinely different construction. Reed-Muller codes are the resolution: a two-parameter family where the minimum distance is $2^{d-r}$, giving us any power of 2 as a target distance. The Hamming codes are one member of a much richer family; the Reed-Muller codes provide a two-parameter collection, with a beautiful recursive structure built on set theory. The construction begins by identifying $\mathbb{F}_2^n$ with characteristic functions on a set. Take $X = \mathbb{F}_2^d$, so $|X| = n = 2^d$. There is a natural bijection between subsets of $X$ and vectors in $\mathbb{F}_2^n$: a subset $A \subseteq X$ corresponds to its indicator function $\mathbf{1}_A \in \mathbb{F}_2^n$ (the vector whose $i$-th entry is 1 iff $p_i \in A$). Under this bijection, symmetric difference of sets corresponds to vector addition in $\mathbb{F}_2^n$, and intersection corresponds to the coordinate-wise product $x \wedge y = (x_1 y_1, \ldots, x_n y_n)$. We now define specific vectors: let $v_0 = (1, 1, \ldots, 1)$ be the all-ones vector (corresponding to $X$ itself), and for $1 \leq i \leq d$, let $v_i = \mathbf{1}_{H_i}$ where $H_i = \{p \in X : p_i = 0\}$ is the hyperplane in $\mathbb{F}_2^d$ where the $i$-th coordinate is zero. [definition: Reed-Muller Code] Let $0 \leq r \leq d$. The **Reed-Muller code** $\mathrm{RM}(d, r)$ of order $r$ and length $2^d$ is the linear code spanned by $v_0$ and all wedge products $v_{i_1} \wedge \cdots \wedge v_{i_s}$ for $s \leq r$ and $1 \leq i_1 < i_2 < \cdots < i_s \leq d$. By convention, the empty wedge product is $v_0$. [/definition] [example: Reed-Muller Codes for $d = 3$] Take $X = \mathbb{F}_2^3 = \{p_1, \ldots, p_8\}$ listed in binary order $(000, 001, 010, 011, 100, 101, 110, 111)$. The generating vectors are: | Vector | $000$ | $001$ | $010$ | $011$ | $100$ | $101$ | $110$ | $111$ | |---|---|---|---|---|---|---|---|---| | $v_0$ | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | | $v_1$ | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | | $v_2$ | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | | $v_3$ | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | | $v_1 \wedge v_2$ | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | | $v_1 \wedge v_3$ | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | | $v_2 \wedge v_3$ | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | | $v_1 \wedge v_2 \wedge v_3$ | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | The special cases: - $\mathrm{RM}(3, 0)$: span of $v_0$ — a repetition code of length 8; - $\mathrm{RM}(3, 1)$: span of $v_0, v_1, v_2, v_3$ — equivalent to a parity-check extension of Hamming's $(7, 4)$-code; - $\mathrm{RM}(3, 2)$: an $(8, 7)$-code, the simple parity-check code of length 8; - $\mathrm{RM}(3, 3) = \mathbb{F}_2^8$: the full space. [/example] The rank of $\mathrm{RM}(d, r)$ is computed by finding a basis formed from the wedge products. [quotetheorem:1665] [citeproof:1665] Part (i) is the surprising core: the wedge products of the $v_i$ form a basis for the entire ambient space. The $\binom{d}{s}$ products of exactly $s$ vectors from $\{v_1, \ldots, v_d\}$ behave like the degree-$s$ monomials in a polynomial ring, and their linear independence mirrors the independence of distinct monomials. Part (ii) does not say that other vectors span $\mathrm{RM}(d, r)$ more efficiently — the formula $\sum_{s=0}^r \binom{d}{s}$ is tight. ## The Bar Product Construction The minimum distance of Reed-Muller codes is computed not directly but via a powerful recursive decomposition. [definition: Bar Product] Let $C_1, C_2 \subseteq \mathbb{F}_2^n$ be linear codes with $C_2 \subseteq C_1$. The **bar product** is \begin{align*} C_1 \mid C_2 = \{(x \mid x + y) : x \in C_1,\, y \in C_2\}, \end{align*} a linear code of length $2n$. [/definition] The bar product concatenates a vector with a modified copy of itself. The containment condition $C_2 \subseteq C_1$ is needed to ensure that the set is a well-defined vector subspace (without it, the set may not be closed under addition). The rank and minimum distance of the bar product are determined precisely by the two input codes. [quotetheorem:1666] [citeproof:1666] The distance formula $\min\{2\, d(C_1), d(C_2)\}$ has a geometric interpretation: words of the bar product whose second half differs from the first (i.e., $y \neq 0$) achieve distance at least $d(C_2)$; words with matching halves (i.e., $y = 0$) are double-length repetitions and achieve distance at least $2\, d(C_1)$. The smaller of these two lower bounds determines the minimum distance. ## Recursive Structure and Minimum Distance of Reed-Muller Codes The bar product is not just an abstract construction — it is exactly how Reed-Muller codes are assembled from smaller ones. [quotetheorem:1667] [citeproof:1667] The theorem gives a complete picture of the Reed-Muller family. At one extreme, $r = 0$ gives a repetition code (minimum distance $2^d$, rank 1); at the other, $r = d$ gives the full space (minimum distance 1, rank $2^d$). For intermediate $r$, the rank is $\sum_{s=0}^r \binom{d}{s}$ and the minimum distance is $2^{d-r}$. As $r$ increases, rank grows and distance falls — the familiar trade-off. The recursive structure (i) is also the key to efficient encoding and decoding of Reed-Muller codes. Encoding a message for $\mathrm{RM}(d, r)$ splits the message into a part for $\mathrm{RM}(d-1, r)$ and a part for $\mathrm{RM}(d-1, r-1)$, encodes each recursively, and applies the bar product. Decoding runs a similar recursion, using the structure of each half to recover errors at each level. The result is a family of codes that can be decoded in $O(n \log n)$ time — a remarkable achievement given that their minimum distances grow as powers of 2. [remark: Reed-Muller Codes and Error-Correcting Capability] Since $d(\mathrm{RM}(d, r)) = 2^{d-r}$, the code is $\lfloor (2^{d-r} - 1)/2 \rfloor$-error correcting. For $r = 1$ (first-order Reed-Muller codes), this is $\lfloor (2^{d-1} - 1)/2 \rfloor$, which grows with $d$. First-order Reed-Muller codes were used in the Mariner 9 spacecraft (1971–1972) for image transmission from Mars, where robust error correction over a noisy deep-space channel was critical. The code $\mathrm{RM}(5, 1)$ has length 32, rank 6, and minimum distance 16 — capable of correcting up to 7 errors per block. [/remark] # 6. Cyclic and BCH Codes The codes we have studied so far — Hamming codes, Reed-Muller codes — were constructed by specifying a parity-check matrix and checking that it had the right combinatorial properties. This chapter introduces a more algebraic approach: we build codes directly as ideals in a polynomial ring, and the algebraic structure does the work of guaranteeing distance bounds. The reward is the BCH family of codes, which can correct any prescribed number of errors and admit a concrete decoding algorithm. The chapter develops the theory in three stages: the structure of cyclic codes via generator polynomials, the BCH bound relating algebraic roots to minimum distance, and the decoding algorithm that recovers the error pattern from syndrome information. ## Cyclic Codes and the Polynomial Ring Why should we care whether a code is closed under cyclic shifts? At first glance, the cyclic condition $(a_0, a_1, \ldots, a_{n-1}) \in C \implies (a_{n-1}, a_0, \ldots, a_{n-2}) \in C$ looks like a mere symmetry property — aesthetically pleasing but with no apparent structural payoff. The payoff is that cyclic codes are precisely the ideals in a certain quotient ring, and the theory of ideals in polynomial rings over fields gives us complete control over their structure. Every cyclic code of length $n$ turns out to be generated by a single polynomial dividing $X^n - 1$, and this single polynomial encodes the dimension, the parity-check matrix, and (for BCH codes) the minimum distance. We work with binary codes, so our alphabet is $\mathbb{F}_2$. Recall that for any field $F$ the polynomial ring $F[X]$ has a division algorithm: given $f, g \in F[X]$ with $g \neq 0$, there exist unique $q, r \in F[X]$ with $f = qg + r$ and $\deg r < \deg g$. This is the key fact that makes ideals in $F[X]$ so tractable. The bridge between words and polynomials is the identification \begin{align*} \pi: \mathbb{F}_2[X]/(X^n - 1) &\longrightarrow \mathbb{F}_2^n, \\ a_0 + a_1 X + \cdots + a_{n-1}X^{n-1} + (X^n - 1) &\longmapsto (a_0, a_1, \ldots, a_{n-1}). \end{align*} Multiplication by $X$ in the quotient ring corresponds to a cyclic shift of the coefficient vector: if $g(X) = a_0 + a_1 X + \cdots + a_{n-1}X^{n-1}$, then $Xg(X) \equiv a_{n-1} + a_0 X + \cdots + a_{n-2}X^{n-1} \pmod{X^n - 1}$. This observation drives the whole theory. [definition: Cyclic Code] A linear code $C \subseteq \mathbb{F}_2^n$ is **cyclic** if \begin{align*} (a_0, a_1, \ldots, a_{n-1}) \in C \implies (a_{n-1}, a_0, a_1, \ldots, a_{n-2}) \in C. \end{align*} [/definition] Under the identification $\pi$, a subset $C \subseteq \mathbb{F}_2^n$ is a cyclic code if and only if $\pi^{-1}(C)$ is an ideal in $\mathbb{F}_2[X]/(X^n-1)$. [quotetheorem:1668] [citeproof:1668] From now on we freely identify $C$ with $\pi^{-1}(C)$ and call both "$C$". The ideal characterisation is not a renaming; it is a structural statement with consequences. Because $\mathbb{F}_2[X]$ is a Euclidean domain, every ideal is principal, and this property descends to the quotient $\mathbb{F}_2[X]/(X^n - 1)$ whenever $X^n - 1$ factors without repeated irreducible components. When $n$ is odd this holds, so each of the $2^k$-element cyclic codes collapses to a single generating polynomial — a drastic simplification. For even $n$, however, $X^n - 1$ has repeated factors (at $n = 2$ we get $(X-1)^2$), the quotient ring is no longer a direct product of fields, and the classification of cyclic codes is more delicate. This is the first reason to restrict to odd $n$ throughout the chapter. ## Generator Polynomials The key theorem says that every cyclic code of length $n$ is controlled by a single polynomial — its generator polynomial — and this polynomial must divide $X^n - 1$. [quotetheorem:1669] [citeproof:1669] The theorem does not say every polynomial dividing $X^n - 1$ gives a genuinely useful code. Different divisors give codes with different dimensions and distances, and choosing $g(X)$ wisely is the design problem. Importantly, the theorem also fails for even $n$ in a subtle way: if $n$ is even then $X^n - 1$ factors with repeated roots in $\mathbb{F}_2[X]$ (for instance $X^2 - 1 = (X-1)^2$), and the ideal structure of the quotient ring becomes more complicated. ### Dimension and the Generator Matrix Given the generator polynomial $g(X) = a_0 + a_1 X + \cdots + a_k X^k$ of degree $k$, the codewords are $f(X)g(X)$ as $f$ ranges over polynomials of degree at most $n - k - 1$. This gives an explicit basis. [quotetheorem:1670] [citeproof:1670] The degree bound $\deg f \leq n-k-1$ is sharp: any $f$ of higher degree reduces mod $X^n-1$ to a polynomial of degree at most $n-1$ that is still a multiple of $g$, so the basis is not enlarged. The theorem immediately fixes the dimension at $n - \deg g$, which reconciles the ideal picture with the usual $\dim C = k$ convention for an $(n, k)$-code. It also provides a concrete encoding: a message polynomial $m(X)$ of degree at most $n-k-1$ maps to the codeword $m(X) g(X)$, and the basis of shifts shows exactly how the Toeplitz-like structure of the generator matrix arises. Writing out the basis vectors as rows gives the generator matrix $G$, a $(n-k) \times n$ matrix in which each row is a cyclic shift of the previous one: \begin{align*} G = \begin{pmatrix} a_0 & a_1 & \cdots & a_k & 0 & \cdots & 0 \\ 0 & a_0 & \cdots & a_{k-1} & a_k & \cdots & 0 \\ \vdots & & \ddots & & & \ddots & \vdots \\ 0 & 0 & \cdots & a_0 & a_1 & \cdots & a_k \end{pmatrix}. \end{align*} This Toeplitz-like structure is a direct consequence of the cyclic property, and it means that encoding a message $(m_0, \ldots, m_{n-k-1})$ amounts to computing the polynomial $\sum m_i X^i$ times $g(X)$. ### The Parity Check Polynomial Since $g(X) \mid X^n - 1$, we can write $g(X) h(X) = X^n - 1$ for some polynomial $h(X) = b_0 + b_1 X + \cdots + b_{n-k} X^{n-k}$ of degree $n - k$. The polynomial $h(X)$ is called the **parity check polynomial** of $C$. [quotetheorem:1671] [citeproof:1671] Notice that the roles of $g$ and $h$ are symmetric up to reversing coefficients: the dual code $C^\perp$ corresponds to the reversed polynomial $\check{h}(X) = X^{n-k} h(1/X)$. This reciprocity reflects the duality between encoding and syndrome checking. [example: The Hamming (7,4)-Code as a Cyclic Code] Over $\mathbb{F}_2$, the factorisation $X^7 - 1 = (X+1)(X^3 + X + 1)(X^3 + X^2 + 1)$ holds. Taking $g(X) = X^3 + X + 1$ gives a cyclic code of length $7$ with $\dim C = 7 - 3 = 4$. The parity check polynomial is $h(X) = (X+1)(X^3 + X^2 + 1) = X^4 + X^2 + X + 1$. The parity check matrix built from $h$ is \begin{align*} H = \begin{pmatrix} 1 & 0 & 1 & 1 & 1 & 0 & 0 \\ 0 & 1 & 0 & 1 & 1 & 1 & 0 \\ 0 & 0 & 1 & 0 & 1 & 1 & 1 \end{pmatrix}, \end{align*} whose columns are exactly the nonzero elements of $\mathbb{F}_2^3$. This is the Hamming $(7,4)$-code — so cyclic codes and Hamming codes are not separate families; one is a special case of the other. [/example] The example reveals something important: the Hamming code, defined earlier through a linear-algebraic condition on columns, turns out to have a polynomial generator with clean divisibility properties. The cyclic perspective unifies what might otherwise seem like an ad hoc construction. ### Counting Cyclic Codes of Odd Length When $n$ is odd, $\gcd(n, 2) = 1$, so $X^n - 1$ has no repeated factors over $\mathbb{F}_2$. [quotetheorem:1672] The proof invokes two classical facts about polynomials over fields of positive characteristic: $X^n - 1$ is separable when $\gcd(n, \operatorname{char} F) = 1$, and every polynomial over a field factors uniquely into irreducibles. Given the distinct irreducible factorisation, subsets of $\{f_1, \ldots, f_t\}$ correspond bijectively to divisors of $X^n - 1$, hence to cyclic codes. Each divisor of $X^n - 1$ is a product of some subset of the irreducible factors, and each such divisor gives a distinct cyclic code. The extreme cases are $g(X) = 1$ (the whole space $\mathbb{F}_2^n$) and $g(X) = X^n - 1$ (the zero code $\{0\}$). The interesting codes lie between these extremes, and the challenge is to choose $g$ so that the resulting code has large minimum distance. This is exactly the problem the BCH construction addresses. ## BCH Codes and the Design Distance The BCH construction answers a question that generator polynomials alone cannot: given that we want a code that corrects $t$ errors, how do we choose $g(X)$? The insight is to specify $g(X)$ by prescribing which elements of an extension field it must vanish on, rather than writing down its coefficients directly. ### Fields and Roots of Unity To make sense of "roots of $g(X)$ in an extension field," we need a few facts about finite fields. These results are proved in the Part II Galois Theory course; we state them here as the algebraic foundation for BCH codes. If $p$ is prime and $f(X) \in \mathbb{F}_p[X]$ is irreducible of degree $d$, then $\mathbb{F}_p[X]/(f)$ is a field of order $p^d$. More generally, for every prime power $q = p^\alpha$ there exists a field $\mathbb{F}_q$ of order $q$, unique up to isomorphism. The multiplicative group $\mathbb{F}_q^\times = \mathbb{F}_q \setminus \{0\}$ is cyclic; a generator of this group is called a **primitive element** of $\mathbb{F}_q$. Now fix an odd integer $n$ and choose $r \geq 1$ such that $2^r \equiv 1 \pmod{n}$ (this exists because $\gcd(n, 2) = 1$). Set $K = \mathbb{F}_{2^r}$. The $n$-th roots of unity in $K$ form the subgroup $\mu_n(K) = \{x \in K : x^n = 1\} \leq K^\times$. Since $K^\times$ is cyclic of order $2^r - 1$ and $n \mid (2^r - 1)$, the group $\mu_n(K)$ is cyclic of order $n$. A generator of $\mu_n(K)$ is called a **primitive $n$-th root of unity** and is denoted $\alpha$. ### Defining Sets and the Generator Polynomial [definition: Cyclic Code with Defining Set] Let $A \subseteq \mu_n(K)$ be a subset of $n$-th roots of unity. The **cyclic code with defining set $A$** is \begin{align*} C = \{ f(X) \pmod{X^n - 1} : f \in \mathbb{F}_2[X],\; f(a) = 0 \text{ for all } a \in A \}. \end{align*} The generator polynomial $g(X)$ of $C$ is the monic polynomial of least degree over $\mathbb{F}_2$ that vanishes on all $a \in A$, equivalently the least common multiple of the minimal polynomials of the elements of $A$ over $\mathbb{F}_2$. [/definition] The phrase "minimal polynomial over $\mathbb{F}_2$" is important: the elements of $A$ live in the extension field $K$, not in $\mathbb{F}_2$, so we need to pull back to $\mathbb{F}_2[X]$ via minimal polynomials. Different elements of $A$ may share a minimal polynomial (if they are conjugate over $\mathbb{F}_2$), which is why we take the lcm rather than the product. [definition: BCH Code] Let $\alpha \in \mu_n(K)$ be a primitive $n$-th root of unity and let $\delta \geq 2$ be an integer. The **BCH code** (after Bose, Ray-Chaudhuri, and Hocquenghem) **with design distance $\delta$** is the cyclic code with defining set $\{\alpha, \alpha^2, \ldots, \alpha^{\delta-1}\}$. [/definition] The name "design distance" anticipates the main theorem: $\delta$ is not quite the minimum distance, but it is a guaranteed lower bound. The actual minimum distance can be larger. ## The BCH Bound The central theorem of this chapter guarantees that specifying $\delta - 1$ consecutive powers of $\alpha$ as roots forces the minimum distance to be at least $\delta$. The proof uses a classical determinant identity. [quotetheorem:1673] The Vandermonde identity is established by an induction on $m$: factor out $\prod_i (x_i - x_1)$ by row-reducing the first column, then apply the inductive hypothesis to the resulting $(m-1) \times (m-1)$ Vandermonde matrix. Since all $x_i$ are distinct, every factor $(x_i - x_j)$ is nonzero, so the product is nonzero. This means any square Vandermonde matrix with distinct nodes has nonzero determinant — equivalently, any set of rows from a Vandermonde matrix are linearly independent. [quotetheorem:1674] [citeproof:1674] Two points deserve emphasis. First, the matrix $M$ used in the proof has entries in the extension field $K$, not in $\mathbb{F}_2$; it is not itself a parity check matrix for $C$ (whose entries must be in $\mathbb{F}_2$). The proof works because column independence over $K$ implies column independence over $\mathbb{F}_2$, but $M$ is an analytical tool, not a structural one. Second, the BCH bound says $d(C) \geq \delta$, not $d(C) = \delta$. The actual minimum distance can exceed the design distance — the BCH construction is conservative, and stronger distance lower bounds sometimes hold. [example: BCH Bound for the Hamming (7,4)-Code] Take $n = 7$, $K = \mathbb{F}_{2^3}$, and let $\beta$ be a root of $X^3 + X + 1 \in \mathbb{F}_2[X]$. Since $\beta^3 = \beta + 1$, one computes $\beta^6 = \beta^2 + 1$, and checking directly gives $(\beta^2)^3 + (\beta^2) + 1 = \beta^6 + \beta^2 + 1 = (\beta^2 + 1) + \beta^2 + 1 = 0$. So $\beta^2$ is also a root of $X^3 + X + 1$, and the BCH code with defining set $\{\beta, \beta^2\}$ has generator polynomial $g(X) = X^3 + X + 1$. This is the Hamming $(7,4)$-code. The BCH theorem guarantees $d \geq 3$, and since a Hamming code has minimum distance exactly $3$, the bound is tight in this case. [/example] The example also illustrates why the $\delta - 1$ consecutive powers $\alpha, \alpha^2, \ldots$ in the defining set matter: it is exactly this consecutiveness that provides the Vandermonde structure the proof exploits. A BCH code whose defining set lacks consecutive powers would not be covered by the bound. With the bound in hand and a code whose distance we can therefore control by design, the remaining question is operational: given a received word, can we actually compute the error? The next section develops the decoding algorithm. ## Decoding BCH Codes The BCH bound tells us that a code with design distance $\delta$ can correct $t = \lfloor(\delta-1)/2\rfloor$ errors. Now we need a practical procedure to do so. The decoding algorithm for BCH codes is one of the most satisfying results in the theory: the algebra that defines the code also provides the machinery to decode it. ### Setup and Error Locator Polynomial We work with a BCH code $C$ of length $n$ having defining set $\{\alpha, \alpha^2, \ldots, \alpha^{\delta-1}\}$, where $\alpha$ is a primitive $n$-th root of unity in $K = \mathbb{F}_{2^r}$. We transmit a codeword $c(X) \in C$ and receive $r(X) = c(X) + e(X)$, where $e(X)$ is the error polynomial with at most $t$ nonzero coefficients. Write $E = \{i : e_i = 1, 0 \leq i \leq n-1\}$ for the **error location set**. Since $c(\alpha^j) = 0$ for $1 \leq j \leq \delta - 1$, we have $r(\alpha^j) = e(\alpha^j)$ for those indices. The values $\{r(\alpha^j) : 1 \leq j \leq 2t\}$ are called the **syndromes** and can be computed directly from the received word. [definition: Error Locator Polynomial] The **error locator polynomial** is \begin{align*} \sigma(X) = \prod_{i \in E} (1 - \alpha^i X) \in K[X]. \end{align*} [/definition] The roots of $\sigma(X)$ are $\alpha^{-i}$ for $i \in E$, so knowing $\sigma(X)$ amounts to knowing the error positions. The central theorem characterises $\sigma(X)$ purely in terms of the received word. [quotetheorem:1675] [citeproof:1675] The theorem is the theoretical heart of BCH decoding. It reduces the problem — recover the error pattern from the received word — to finding the polynomial of least degree satisfying a linear congruence. This is a computable problem: we are looking for the minimal-degree $\sigma$ such that $\sigma(X) \sum_{j=1}^{2t} S_j X^j \equiv \omega(X) \pmod{X^{2t+1}}$ with $\deg \omega < \deg \sigma$, and minimal-degree solutions of such congruences can be found by the extended Euclidean algorithm or the Berlekamp–Massey procedure. What remains is to translate the abstract characterisation into an explicit step-by-step algorithm that a computer can run. ### The Decoding Algorithm Given the theorem, the decoding procedure is as follows. Receive $r(X)$. 1. **Compute syndromes.** Evaluate $S_j = r(\alpha^j)$ for $1 \leq j \leq 2t$ by substituting elements of $K$ into $r(X)$. 2. **Set up linear equations.** Write $\sigma(X) = 1 + \sigma_1 X + \cdots + \sigma_t X^t$ with unknown coefficients $\sigma_1, \ldots, \sigma_t \in K$. The congruence condition $\sigma(X) \sum_{j=1}^{2t} S_j X^j \equiv \omega(X) \pmod{X^{2t+1}}$, for degree $i$ terms with $t+1 \leq i \leq 2t$, yields the system \begin{align*} \sum_{j=0}^{t} \sigma_j S_{i-j} = 0, \quad t+1 \leq i \leq 2t, \end{align*} where $\sigma_0 = 1$. These are $t$ linear equations in $\sigma_1, \ldots, \sigma_t$ over $K$. 3. **Solve over $K$.** Find all solutions and retain those of minimum degree (i.e., with the fewest nonzero coefficients among $\sigma_1, \ldots, \sigma_t$). 4. **Find error positions.** Compute $E = \{0 \leq i \leq n-1 : \sigma(\alpha^{-i}) = 0\}$ and check that $|E| = \deg \sigma$. If this fails, more than $t$ errors occurred and the algorithm signals a decoding failure. 5. **Correct the errors.** Set $e(X) = \sum_{i \in E} X^i$ and decode as $c(X) = r(X) + e(X)$. Step 4 is sometimes called the **Chien search**: it tests all $n$ possible error positions by evaluating $\sigma$ at each $\alpha^{-i}$, which is efficient because multiplying by $\alpha^{-1}$ is cheap in $K$. ### Structure of the Linear System The system of equations in step 2 is worth examining more carefully. Writing it in matrix form: \begin{align*} \begin{pmatrix} S_t & S_{t-1} & \cdots & S_1 \\ S_{t+1} & S_t & \cdots & S_2 \\ \vdots & & \ddots & \vdots \\ S_{2t-1} & S_{2t-2} & \cdots & S_t \end{pmatrix} \begin{pmatrix} \sigma_1 \\ \sigma_2 \\ \vdots \\ \sigma_t \end{pmatrix} = \begin{pmatrix} S_{t+1} \\ S_{t+2} \\ \vdots \\ S_{2t} \end{pmatrix}, \end{align*} this is the **Newton identity** system, and the coefficient matrix is a **Hankel matrix** (constant on anti-diagonals). When the number of errors $|E|$ is strictly less than $t$, the system is underdetermined, but the uniqueness theorem guarantees we find the right $\sigma$ by taking the solution of minimum degree. In practice, the Berlekamp-Massey algorithm solves this system efficiently by finding the shortest linear feedback shift register that generates the syndrome sequence. [example: BCH Decoding with One Error] Take the Hamming $(7,4)$-code realised as a BCH code with $n = 7$, design distance $\delta = 3$ (so $t = 1$). Work over $K = \mathbb{F}_8 = \mathbb{F}_2[\beta]$ with $\beta^3 + \beta + 1 = 0$, and set $\alpha = \beta$, so $\alpha$ is a primitive $7$th root of unity. Suppose we receive $r(X) = X^5$, corresponding to a single bit set in position $5$. The syndrome is $S_1 = r(\alpha) = \alpha^5 = \beta^5$. For a single error at position $i$, the error locator is $\sigma(X) = 1 + X_1 X$ where $X_1 = \alpha^i$. The Newton identity at degree $1$ gives $\sigma_1 + S_1 = 0$, so over $\mathbb{F}_2$ we have $\sigma_1 = S_1 = \beta^5$ (since $-1 = 1$). The root of $\sigma(X) = 1 + \beta^5 X$ is $X = \beta^{-5} = \beta^{7-5} = \beta^2$. Thus $\alpha^{-i} = \beta^2$, giving $\alpha^i = \beta^{-2} = \beta^{5}$, and therefore $i = 5$. The correction step gives $c(X) = r(X) + X^5 = 0$, the all-zero codeword. [/example] The example illustrates how the decoding algorithm succeeds even when the syndrome computation and field arithmetic require working in $K = \mathbb{F}_8$ rather than $\mathbb{F}_2$. All intermediate calculations happen in the extension field; only the final codeword lives in $\mathbb{F}_2^n$. ### Reed-Solomon Codes and Practical Applications When the code length equals $n = 2^r - 1$ (the full order of $K^\times$) and $\alpha$ is a primitive element of $K$ (not just an $n$-th root of unity), the BCH code becomes a **Reed-Solomon code**. In this case the BCH bound is tight: the minimum distance equals the design distance $\delta$. Reed-Solomon codes are maximum distance separable and achieve the Singleton bound $d \leq n - k + 1$ with equality. The practical utility of BCH codes extends far beyond theoretical examples. CD players use BCH codes (specifically a concatenated scheme involving Reed-Solomon codes) to correct the burst errors introduced by scratches and manufacturing defects. The error-correcting capability is what allows audio to play without distortion even when the disc is imperfect. This application, mentioned in the source lectures, illustrates that the algebraic machinery developed in this chapter — polynomial rings, extension fields, Vandermonde determinants — underlies technology that operates billions of times per day. The generator polynomial formalism also extends naturally to allow **burst error correction**, where errors are not independent but clustered in contiguous positions. Cyclic codes are particularly well-suited to burst correction because a cyclic shift of a burst error pattern is still a burst error pattern, and the algebra handles both simultaneously. # 7. Classical Cryptography, One-Time Pad, and Stream Ciphers The central question of cryptography is not how to send a message, but how to send a message that an eavesdropper cannot read. Unlike coding theory, where the adversary is noise, here the adversary is intelligent: she can observe every ciphertext, knows the encryption scheme, and may even be able to choose which plaintexts she encrypts. This chapter develops the mathematical foundations for reasoning about what secrecy means, when it is achievable, and at what cost. We begin with classical examples, pass through Shannon's information-theoretic framework, and arrive at the one-time pad and its practical approximation via linear feedback shift registers. ## What Can an Enemy Learn from a Ciphertext? Suppose Alice wants to send a secret message to Bob in the presence of Eve, who intercepts every transmission. The naive hope is that Eve simply cannot read the ciphertext. But "cannot read" is not a precise mathematical statement. To make progress, we need a formal model. [definition: Cryptosystem] A **cryptosystem** consists of finite sets $\mathcal{M}$ (message space), $\mathcal{K}$ (key space), $\mathcal{C}$ (ciphertext space), and functions \begin{align*} e : \mathcal{M} \times \mathcal{K} \to \mathcal{C}, \qquad d : \mathcal{C} \times \mathcal{K} \to \mathcal{M} \end{align*} satisfying $d(e(m, k), k) = m$ for all $m \in \mathcal{M}$ and $k \in \mathcal{K}$. [/definition] The condition $d(e(m,k),k) = m$ is the basic correctness requirement: the intended receiver, who knows $k$, can always recover the message. Eve knows the functions $e$ and $d$ (Kerckhoffs's principle: the security of the system should rest on the key, not on the secrecy of the algorithm), but does not know which key was used. Three classical examples illustrate the range of key spaces: [example: Classical Substitution Ciphers] Take $\mathcal{M} = \mathcal{C} = \{A, B, \ldots, Z\}^*$, the set of all finite strings over the 26-letter alphabet $\Sigma$. **(i) Simple substitution.** $\mathcal{K}$ is the set of all permutations of $\Sigma$ (there are $26!$ of them). A key $\sigma \in \mathcal{K}$ replaces each letter $x$ by $\sigma(x)$, and decryption applies $\sigma^{-1}$. **(ii) Vigenère cipher.** Identify $\Sigma$ with $\mathbb{Z}/26\mathbb{Z}$. Fix $d \in \mathbb{N}$; the key space is $\mathcal{K} = \Sigma^d$. A key $(k_0, \ldots, k_{d-1})$ is written out repeatedly below the message, and letters are added modulo 26. For instance, with key LEMON and plaintext ATTACKATDAWN: \begin{align*} \text{plaintext:} & \quad \text{A T T A C K A T D A W N} \\ \text{key:} & \quad \text{L E M O N L E M O N L E} \\ \text{ciphertext:} & \quad \text{L X F O P V E F R N H R.} \end{align*} **(iii) Caesar cipher.** The special case $d = 1$ of the Vigenère cipher: a single letter shift applied uniformly to every position in the message. The simple substitution and Vigenère ciphers both fail at the first level of attack for messages in natural language: the statistical structure of English (letter frequencies, common digraphs) survives encryption and can be exploited to recover the key without knowing it directly. [/example] ## Three Levels of Attack The strength of a cryptosystem cannot be assessed in isolation; it depends on what resources the attacker has available. The standard hierarchy has three levels of increasing power. **Level 1 (Ciphertext-only).** Eve observes some ciphertext and must recover the plaintext or the key. This is the weakest attack, but the most common: it covers passive eavesdropping. **Level 2 (Known-plaintext).** Eve possesses a substantial collection of plaintext-ciphertext pairs $(m, e(m, k))$ and uses them to determine $k$, after which she can decrypt all future messages. This models situations where Eve can observe pairs of messages and their encrypted versions — for example, if certain standard phrases are known to appear at the start of every message. **Level 3 (Chosen-plaintext).** Eve can choose any plaintext $m$ and obtain $e(m, k)$. This is the strongest attack. In modern applications, security against Level 3 is the standard requirement, because an adversary who interacts with the encryption oracle over time may learn to distinguish ciphertexts. Simple substitution fails at Level 1 for long English texts because the frequency distribution of ciphertext letters mirrors the known frequency distribution of English. The Vigenère cipher with period $d$ fails at Level 2 with $O(d)$ known plaintext-ciphertext pairs, since knowledge of $d$ positions reveals the key directly. ## Perfect Secrecy and Its Cost The information-theoretic approach asks: can we design a system where the ciphertext reveals nothing about the plaintext, regardless of how much Eve knows? Shannon's formalization makes this precise. Model the key $K$ and message $M$ as independent random variables taking values in $\mathcal{K}$ and $\mathcal{M}$ respectively, with some distributions. The ciphertext $C := e(M, K)$ is then also a random variable. [definition: Perfect Secrecy] A cryptosystem $(\mathcal{M}, \mathcal{K}, \mathcal{C})$ has **perfect secrecy** if $M \perp K$ (the message and key are independent) and $H(M \mid C) = H(M)$, i.e. the message $M$ and ciphertext $C$ are independent random variables. [/definition] Perfect secrecy means that observing the ciphertext gives Eve no information about the message whatsoever: her posterior distribution over messages is identical to her prior. The hypothesis that $M$ and $K$ are independent is essential — it captures the assumption that the key is chosen without reference to the message, which is standard in practice. [quotetheorem:1676] [citeproof:1676] This constraint has a fundamental implication: a one-time pad key must be at least as long as the message. This observation motivates the search for efficient measures of how far a system falls short of perfect secrecy. ## Message and Key Equivocation Perfect secrecy is all-or-nothing: either observing the ciphertext tells Eve nothing, or the system does not achieve perfect secrecy. But real cryptosystems often fall somewhere in between — they leak some information, but not all of it. How do we quantify partial leakage? How much does a ciphertext actually reveal about the message? That is precisely what equivocation measures: the residual uncertainty about $M$ that remains after $C$ is observed. [definition: Equivocation] Given a cryptosystem with random variables $M$, $K$, $C$: 1. The **message equivocation** is $H(M \mid C)$, the conditional entropy of the message given the ciphertext. 2. The **key equivocation** is $H(K \mid C)$, the conditional entropy of the key given the ciphertext. [/definition] High message equivocation means Eve still faces substantial uncertainty about the message after seeing the ciphertext. A system with perfect secrecy has message equivocation equal to $H(M)$ — the maximum possible. A system with message equivocation zero has been completely broken. The following lemma establishes a fundamental ordering between the two equivocations. [quotetheorem:1677] [citeproof:1677] The content of this result is that key equivocation is always at least as large as message equivocation. To achieve high message secrecy, the key must remain uncertain to Eve — she cannot break the message without first breaking the key. Three further observations sharpen the picture. First, the independence hypothesis $M \perp K$ is essential: without it, the right-hand side $H(K \mid C)$ can be strictly smaller than the true equivocation $H(M \mid C)$, so the bound would fail. Second, the inequality $H(M \mid C) \leq H(K \mid C)$ is often strict; when it is, the gap equals exactly the mutual information $I(M; C) - I(K; C)$ contributing to the two sides, and in particular when $H(K \mid C) = H(M \mid C) + \Delta$ for $\Delta > 0$, the key carries more residual uncertainty than the message. Third, the inequality motivates the concept of **unicity distance**: as more ciphertext is observed, both equivocations decrease; the unicity distance is the value of $n$ at which $H(K \mid C^{(n)})$ first approaches zero, meaning the key is essentially determined and all secrecy is lost. ## Unicity Distance: How Much Ciphertext Does It Take? In practice, Eve intercepts multiple messages all encrypted with the same key. How many ciphertexts are needed before the key is essentially determined? [definition: Unicity Distance] Let $M^{(n)} = (M_1, \ldots, M_n)$ be a sequence of $n$ independent messages encrypted with the same key $K$ to produce ciphertexts $C^{(n)} = (C_1, \ldots, C_n)$. The **unicity distance** is the least $n$ such that $H(K \mid C^{(n)}) = 0$, i.e. the smallest number of encrypted messages required to uniquely determine the key. [/definition] To compute the unicity distance explicitly, we use three simplifying assumptions: - All keys are equally likely: $H(K) = \log |\mathcal{K}|$. - The message source has a well-defined per-symbol entropy: $H(M^{(n)}) \approx nH$ for large $n$, where $H$ is the per-symbol entropy. - All ciphertext sequences are equally likely: $H(C^{(n)}) = n \log |\mathcal{A}|$, where $\mathcal{A}$ is the common alphabet. Under these assumptions, since $K$ and $M^{(n)}$ are independent: \begin{align*} H(K \mid C^{(n)}) &= H(K, C^{(n)}) - H(C^{(n)}) = H(K, M^{(n)}) - H(C^{(n)}) \\ &= H(K) + H(M^{(n)}) - H(C^{(n)}) \\ &\approx \log |\mathcal{K}| + nH - n \log |\mathcal{A}|. \end{align*} Setting this to zero, the unicity distance is: \begin{align*} U = \frac{\log |\mathcal{K}|}{\log |\mathcal{A}| - H}. \end{align*} Writing $R = 1 - H/\log |\mathcal{A}|$ for the **redundancy** of the message source, this becomes $U = \log |\mathcal{K}| / (R \log |\mathcal{A}|)$. [example: Interpretating the Unicity Distance] The unicity distance increases when the key space $|\mathcal{K}|$ is large (many possible keys) or the redundancy $R$ is small (the source uses nearly all of its capacity, as in compressed data). English text has redundancy roughly $R \approx 0.75$ (its entropy per letter is about $1.5$ bits while the maximum is $\log_2 26 \approx 4.7$ bits). A Caesar cipher has $|\mathcal{K}| = 26$, giving $U \approx 1/0.75 \approx 1.3$: even a single ciphertext message almost certainly determines the key. A simple substitution cipher has $|\mathcal{K}| = 26!$, giving $U \approx \log_2(26!)/(0.75 \times \log_2 26) \approx 88.38/3.525 \approx 25.1$: about 25 letters of ciphertext suffice to determine the key. In both cases, the redundancy of English is the enemy of secrecy. A boundary case: if $R = 0$ (the source has maximum entropy, e.g. uniformly random messages), then $U = \infty$, and the key can never be determined from ciphertext alone. This is consistent with perfect secrecy being achievable only when the message source is completely unpredictable. [/example] ## Feedback Shift Registers as Key-Stream Generators The one-time pad is information-theoretically optimal, but it requires a key as long as the message. Sharing such a key is impractical: distributing a gigabyte of random key material every time you want to send a gigabyte of data defeats the purpose of secure communication. Can we instead generate a long pseudo-random key stream deterministically from a short shared seed — something that looks random to an eavesdropper but is reproducible by Alice and Bob who know the seed? Feedback shift registers are the minimal answer: a compact register of $d$ bits, iterated by a feedback rule, whose output stream has period up to $2^d$. [definition: Feedback Shift Register] A **feedback shift register (FSR)** of length $d$ is a map $f : \mathbb{F}_2^d \to \mathbb{F}_2^d$ given by \begin{align*} f(x_0, x_1, \ldots, x_{d-1}) = (x_1, x_2, \ldots, x_{d-1}, C(x_0, x_1, \ldots, x_{d-1})) \end{align*} where $C : \mathbb{F}_2^d \to \mathbb{F}_2$ is the **feedback function**. Given an initial fill $(y_0, y_1, \ldots, y_{d-1}) \in \mathbb{F}_2^d$, the **stream** associated to this fill is the infinite sequence $(y_0, y_1, y_2, \ldots)$ with \begin{align*} y_n = C(y_{n-d}, y_{n-d+1}, \ldots, y_{n-1}) \quad \text{for all } n \geq d. \end{align*} [/definition] The map $f$ shifts the register contents left by one position and feeds the new value in on the right. The state space is $\mathbb{F}_2^d$, which has $2^d$ elements. Since the state space is finite, any stream produced by an FSR must eventually repeat — but how quickly? [quotetheorem:1678] [citeproof:1678] The maximum possible period of an FSR stream is $2^d$ — achieved when the state sequence itself is periodic before repeating the initial state. For practical stream cipher design, this maximum is desirable: a long period before repetition means the key stream is harder to predict. Two refinements are worth noting. First, the finiteness of $\mathbb{F}_2^d$ is essential to the argument: for an FSR over an infinite state space, the pigeonhole argument fails and the stream need not be eventually periodic. Second, 'eventually periodic' is not the same as 'purely periodic': the stream is purely periodic (i.e., periodic from index $0$) only when the initial state itself lies on a cycle of the state transition map. For example, certain nonlinear FSRs have a pre-period — a sequence of states leading into a cycle — so the stream differs in its transient phase from the periodic tail. Third, the bound $2^d$ on the period is not sharp for arbitrary FSRs; a nonlinear feedback function may achieve a strictly smaller maximum period. The LFSR specialisation tightens this: because the all-zero state maps to itself under any linear recurrence, it cannot lie on a non-trivial cycle, so the maximum period for a LFSR is $2^d - 1$ (over the remaining states), and a primitive auxiliary polynomial achieves this maximum. ## Linear Feedback Shift Registers We want to analyse the period of an FSR — specifically, to identify which feedback functions produce streams of maximum length. For a general nonlinear feedback function, the state transition graph is a combinatorial mess: cycles of different lengths can coexist, and determining the period requires tracing every orbit. Linearity tames this completely. When the feedback function is linear, the state space $\mathbb{F}_2^d$ becomes a vector space, and the state transitions are governed by a linear recurrence. This brings finite-field algebra to bear: the period of the stream is controlled by the arithmetic of the auxiliary polynomial, and the theory of primitive polynomials over $\mathbb{F}_2$ gives a clean characterisation of maximum-period LFSRs. [definition: Linear Feedback Shift Register] A **linear feedback shift register (LFSR)** of length $d$ is an FSR with linear feedback function: \begin{align*} C(x_0, x_1, \ldots, x_{d-1}) = \sum_{i=0}^{d-1} a_i x_i, \quad a_i \in \mathbb{F}_2 \end{align*} (usually with $a_0 = 1$ to ensure the register can run backwards). The stream produced by a LFSR satisfies the linear recurrence \begin{align*} y_n = \sum_{i=0}^{d-1} a_i y_{n-d+i} \quad \text{for all } n \geq d, \end{align*} over $\mathbb{F}_2$. The **auxiliary polynomial** is $P(X) = X^d + a_{d-1}X^{d-1} + \cdots + a_1 X + a_0 \in \mathbb{F}_2[X]$, and the **feedback polynomial** is $\check{P}(X) = a_0 X^d + a_1 X^{d-1} + \cdots + a_{d-1}X + 1 = X^d P(X^{-1})$. [/definition] The recurrence $y_n = \sum_{i=0}^{d-1} a_i y_{n-d+i}$ is the characteristic relation for $P$: the stream $(y_n)$ satisfies the recurrence precisely when $P$ is its auxiliary polynomial. The generating function of the stream encodes this algebraically. [definition: Generating Function] A sequence $(y_0, y_1, \ldots)$ of elements of $\mathbb{F}_2$ has **generating function** $G(X) = \sum_{j=0}^{\infty} y_j X^j \in \mathbb{F}_2[[X]]$. [/definition] [quotetheorem:1679] [citeproof:1679] The connection to rational functions over $\mathbb{F}_2[[X]]$ is more than an algebraic curiosity: recovering the LFSR from its stream is equivalent to writing the generating function as a ratio of two polynomials, which is the same structural problem as decoding BCH codes. This parallel motivated the Berlekamp-Massey method. Three consequences of the rational form deserve emphasis. First, the degree bound $\deg A < d$ is necessary: if we allowed $\deg A \geq d$, we could absorb any polynomial multiple of $\check{P}$ into $A$, giving spurious freedom in the numerator — equivalently, the initial state of the register would not be uniquely prescribed by the recurrence. Requiring $\deg A < d$ pins down the numerator to exactly the $d$ initial conditions $y_0, \ldots, y_{d-1}$. Second, when $P$ is the true minimal polynomial of the sequence (the polynomial of smallest degree annihilating the recurrence), the correspondence $G(X) \leftrightarrow A(X)/\check{P}(X)$ is bijective: each initial fill gives a distinct $A$. Third, if $P$ is merely an annihilating polynomial (one that kills the recurrence, but not necessarily the minimal one), then $A$ is not uniquely determined — there may be several numerators giving the same stream. This ambiguity motivates the Berlekamp-Massey algorithm, whose goal is to find the minimal annihilating polynomial and thereby the shortest LFSR generating the observed stream. ### Solutions of the LFSR Recurrence Over $\mathbb{C}$, the general solution to a linear recurrence with characteristic polynomial $P$ is a linear combination of terms $\alpha^n$, $n\alpha^n$, $\ldots$, $n^{t-1}\alpha^n$ for each root $\alpha$ of $P$ of multiplicity $t$. Over $\mathbb{F}_2$, the situation requires two modifications: one must work in a splitting field $K$ for $P(X) \in \mathbb{F}_2[X]$, and the terms $n^i \alpha^n$ must be replaced by $\binom{n}{i}\alpha^n$ (using binomial coefficients modulo 2), because in characteristic 2, squaring collapses: $n^2 = n$ in $\mathbb{F}_2$ but $\binom{n}{i}$ is the correct substitute for "polynomial in $n$" over fields of characteristic 2. ### Combining LFSR Streams It is natural to ask whether combining two LFSR streams produces a more complex stream, or one that is still LFSR-generated. The answer determines the security of stream ciphers built from multiple LFSRs. [quotetheorem:1680] [citeproof:1680] The theorem has a sobering implication for cryptographic design. Adding two LFSR streams of lengths $M$ and $N$ gives a stream that is itself LFSR-generated, of length at most $M + N$. This offers no security advantage over a single LFSR of length $M + N$ — the adversary can treat the combined output as a single LFSR stream and apply the Berlekamp-Massey method directly. Multiplying two streams produces a stream of length at most $MN$, which looks more promising on the surface. But $x_n y_n = 0$ whenever either factor is zero, and since bits of an LFSR stream are each $0$ or $1$ with roughly equal probability, $x_n y_n = 0$ approximately $75\%$ of the time — the output is heavily biased and statistically far from random. The conclusion is that purely linear constructions, even when combined, do not escape the reach of linear algebra. Truly secure stream ciphers require nonlinear components, but general nonlinear FSRs are hard to analyse and may be vulnerable to attacks their designers did not anticipate. ## The Berlekamp-Massey Method The Berlekamp-Massey method answers the following concrete question: given the output stream $(x_n)_{n \geq 1}$ of an unknown LFSR, find the length $d$ and coefficients $a_0, \ldots, a_{d-1}$ such that \begin{align*} x_n + \sum_{i=1}^d a_{d-i} x_{n-i} = 0 \quad \text{for all } n \geq d. \end{align*} This is a system of linear equations over $\mathbb{F}_2$. The key insight is that $d$ is unknown, so we must determine it simultaneously with the coefficients. The method proceeds by looking successively at the Hankel matrices \begin{align*} A_r = \begin{pmatrix} x_r & x_{r-1} & \cdots & x_0 \\ x_{r+1} & x_r & \cdots & x_1 \\ \vdots & & \ddots & \vdots \\ x_{2r} & x_{2r-1} & \cdots & x_r \end{pmatrix} \in \mathbb{F}_2^{(r+1)\times(r+1)}, \end{align*} starting from $r = 0$. The algorithm at each step tests whether $d = r$: if $\det A_r \neq 0$ over $\mathbb{F}_2$, then $d \neq r$ (the system has no solution for this value of $d$), and we increment $r$. If $\det A_r = 0$, we attempt to solve the system under the hypothesis $d = r$ and verify the candidate coefficients against further terms of the stream. If the candidate fails, we know $d > r$ and increment. The method terminates when the candidate passes verification. [example: Consequence for LFSR-Based Stream Ciphers] A stream cipher that uses an LFSR of length $d$ to generate its key stream fails at Level 2 (known-plaintext attack). If Eve obtains $2d$ bits of plaintext and the corresponding ciphertext bits, she can XOR them to recover $2d$ bits of the key stream, then apply Berlekamp-Massey to determine $d$ and the coefficients completely. With the LFSR fully recovered, she can generate the entire future key stream and decrypt all subsequent messages. This vulnerability applies to any LFSR-based cipher, regardless of how long the LFSR is. Length alone does not provide security against a known-plaintext attacker. [/example] ## The One-Time Pad With the LFSR background in place, we can now understand both the ideal solution and its practical limitations. The setting for stream ciphers over $\mathbb{F}_2$: a plaintext stream $p_0, p_1, p_2, \ldots$ and a key stream $k_0, k_1, k_2, \ldots$ are added bitwise to produce the ciphertext stream $z_0, z_1, z_2, \ldots$ where $z_n = p_n + k_n$ for all $n \geq 0$. Decryption is identical: $p_n = z_n + k_n$. [definition: One-Time Pad] A **one-time pad** is the stream cipher in which the key stream $(k_n)_{n \geq 0}$ consists of independent and identically distributed uniform random variables over $\mathbb{F}_2$: each $k_i$ equals $0$ or $1$ with probability $1/2$, independently. The key stream must be used only once and then discarded. [/definition] The one-time pad has perfect secrecy. Given any ciphertext bit $z_n = p_n + k_n$, the key bit $k_n$ is uniform and independent of $p_n$, so $z_n$ is uniform over $\{0, 1\}$ regardless of the value of $p_n$. More formally: for any plaintext sequence $m$ and ciphertext sequence $c$ of the same length, $\mathbb{P}(C = c \mid M = m) = 2^{-n} = \mathbb{P}(C = c)$, so $M$ and $C$ are independent. The one-time pad is the unique solution to the perfect secrecy problem in the following sense: any cryptosystem with perfect secrecy must have $|\mathcal{K}| \geq |\mathcal{M}|$, and the one-time pad achieves equality (for fixed-length messages) while being operationally simple. Its cost is the key stream itself. ### The Key Distribution Problem The one-time pad has two practical obstacles that no engineering solution can bypass: First, generating a truly random key stream is difficult. Pseudo-random generators are deterministic and hence not truly random; physical random number generators are slow and expensive. A key stream that is not truly random weakens the secrecy guarantee. Second, and more fundamentally, the key stream must be shared with the receiver in advance. This requires a secure channel. But if Alice and Bob already have a secure channel, why not use it to send the message directly? The one-time pad relocates the problem of secure communication to the problem of secure key distribution — it does not solve it. ## Stream Ciphers via LFSR Key Streams An LFSR of length $d$ can produce a pseudo-random key stream of period up to $2^d - 1$ from a seed of just $d$ bits. The natural question is: can we XOR this stream with a message to build a practical cipher that approximates the one-time pad? The construction is straightforward — Alice and Bob agree on an LFSR and an initial fill, then encrypt by XOR. But something goes wrong as soon as the attacker gets lucky enough to observe both plaintext and ciphertext bits: XORing them reveals the key stream directly, and the Berlekamp-Massey algorithm can then reconstruct the entire LFSR from just $2d$ known bits. The security of the scheme collapses not because the key stream is short, but because it is linear. [definition: Stream Cipher] A **stream cipher** is a cryptosystem in which a short shared key (the initial fill of an FSR) is expanded into a long key stream by the FSR, which is then used to encrypt via bitwise XOR. [/definition] Stream ciphers have several practical advantages over block ciphers for certain applications. They are fast, requiring only a single XOR per bit; they are error-tolerant, since a single bit error in the ciphertext corrupts only one bit of the plaintext rather than an entire block; and they can encrypt and decrypt "on the fly," processing each bit as it arrives rather than waiting for a complete block. The maximum period of an FSR of length $d$ is $2^d$, achieved when the state sequence cycles through all $2^d$ possible states. For a LFSR of length $d$, the maximum achievable period is $2^d - 1$ (the all-zero state maps to itself under a linear recurrence, so it is excluded from any non-trivial cycle). A LFSR achieving this maximum is called a **maximal-length LFSR** or **m-sequence generator**; its auxiliary polynomial must be a primitive polynomial over $\mathbb{F}_2$. [remark: LFSR Stream Ciphers and Level 2 Security] As noted earlier, LFSR-based stream ciphers fail at Level 2 due to Berlekamp-Massey: with $2d$ known plaintext bits, Eve recovers the LFSR completely. The fundamental reason is that the LFSR recurrence is linear, and linear structure is precisely what Berlekamp-Massey exploits. Secure stream ciphers must introduce nonlinearity, either through nonlinear feedback functions, nonlinear output functions applied to LFSR states, or irregular clocking schemes. [/remark] ## Symmetric vs Asymmetric Cryptosystems Stream ciphers are examples of **symmetric cryptosystems**: encryption and decryption use the same key (or one is easily derived from the other). Both Alice and Bob must know the same secret, which must be shared before communication begins. This shared-secret requirement is the central limitation of symmetric cryptography: before Alice and Bob can communicate privately, they need a secure channel to exchange the key. For large networks with many participants, managing shared secrets becomes combinatorially expensive: $n$ users require up to $\binom{n}{2}$ distinct keys if every pair needs to communicate privately. The resolution — **asymmetric** or **public-key cryptography** — is developed in subsequent chapters. The insight is that certain mathematical operations are easy to perform but hard to reverse (trapdoor functions), allowing a public key to be shared openly while a private key remains secret. The one-time pad and stream ciphers, despite their limitations, establish the information-theoretic standard against which all practical systems are measured. # 8. Public-Key Cryptography Every symmetric cipher studied so far shares a fundamental weakness: Alice and Bob must somehow agree on a secret key before they can communicate privately. Over an insecure channel this seems paradoxical — how can they agree on a secret without already having a shared secret? Public-key (asymmetric) cryptography resolves this paradox by exploiting mathematical problems that are easy to pose but believed to be hard to solve. This chapter develops the theory of asymmetric ciphers, grounds it in the number-theoretic results that underpin their security, and analyses the two main examples: the Rabin and RSA cryptosystems. The chapter connects to the broader theme of computational hardness that has run implicitly through the course: security is no longer unconditional (as with the one-time pad) but rests on complexity-theoretic assumptions. ## Why Symmetric Key Exchange Is a Bottleneck Imagine two parties who have never met and wish to begin encrypted communication. Any symmetric cipher requires them to first agree on a key; but transmitting that key over the open channel defeats the purpose of encryption. For $n$ parties who all need to communicate pairwise, the key management problem requires $\binom{n}{2}$ separate secure key exchanges — an operational nightmare at scale. The insight behind asymmetric cryptography is that secrecy and the ability to encrypt need not be the same capability. If Alice can publish a *public key* that anyone can use to encrypt messages to her, while keeping a *private key* that only she can use to decrypt, the key-exchange problem disappears entirely. Bob encrypts using Alice's public key; Alice decrypts using her private key. Knowing the encryption algorithm, the decryption algorithm, and the public key should still leave an eavesdropper unable to read the message or recover the private key. This is called **Level 3 secrecy** in the course's classification. The mathematical mechanism that makes this possible is a **trapdoor one-way function**: a function that is easy to compute in one direction but hard to invert without extra information (the trapdoor). The private key is precisely that trapdoor. ## Computational Hardness and Polynomial Time To make the notion of "hard" precise, the course adopts the standard complexity-theoretic benchmark. [definition: Polynomial-Time Algorithm] An algorithm runs in **polynomial time** if the number of elementary operations it performs on an input is at most $c \cdot (\text{input size})^d$ for some constants $c, d > 0$. The input size is measured in bits: an integer $N$ has input size $\lfloor \log_2 N \rfloor + 1$. [/definition] The definition matters because the same integer $N$ that looks "small" has an exponentially large value relative to the number of bits needed to write it down. An algorithm running in time $O(\sqrt{N})$ is exponential in the input size $B = \log_2 N$, since $\sqrt{N} = 2^{B/2}$. Several fundamental operations are known to run in polynomial time, and these form the building blocks of cryptographic constructions: - Integer arithmetic ($+$, $-$, $\times$, division with remainder); - Computing $\gcd(a,b)$ via the Euclidean algorithm; - Modular exponentiation: computing $x^\alpha \pmod{N}$ by repeated squaring in $O(\log \alpha)$ multiplications; - Primality testing (AKS algorithm, Agrawal–Kayal–Saxena 2002). In contrast, two problems central to public-key cryptography are not known to admit polynomial-time algorithms, and the best known algorithms run in **sub-exponential but super-polynomial time**: **Factoring.** Given a product $N = pq$ of two large primes, find $p$ and $q$. Trial division up to $\sqrt{N}$ runs in $O(\sqrt{N}) = O(2^{B/2})$. The best known general method, the **number field sieve**, runs in time \begin{align*} O\!\left(\exp\!\left(c(\log N)^{1/3}(\log \log N)^{2/3}\right)\right), \end{align*} where $c$ is an explicit constant. This is dramatically better than trial division but still grows faster than any polynomial in $B = \log_2 N$. **Discrete logarithm.** Let $p$ be a large prime and $g$ a primitive root modulo $p$, meaning $g$ generates the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^\times$. Given $x \in (\mathbb{Z}/p\mathbb{Z})^\times$, find the exponent $a$ such that $x \equiv g^a \pmod{p}$. Shanks' **baby-step giant-step** algorithm solves this in $O(\sqrt{p}\,\log p)$ — again exponential in the bit-length of $p$. Both problems are believed to be hard (no polynomial-time algorithm is known), but neither has been *proved* hard. The security of every cryptosystem in this chapter rests on this unproven assumption. ## Modular Arithmetic Background The cryptosystems below rest on two classical results from number theory. Recall that Euler's totient function is $\phi(n) = |(\mathbb{Z}/n\mathbb{Z})^\times|$, the number of integers in $\{1, \ldots, n\}$ coprime to $n$. [quotetheorem:1681] The Euler–Fermat theorem is the engine behind RSA correctness: if $de \equiv 1 \pmod{\phi(N)}$ then $m^{de} = m^{k\phi(N)+1} \equiv m \pmod{N}$, so encryption followed by decryption recovers the plaintext. The theorem requires $\gcd(a, N) = 1$; without it the statement fails — for instance, $2^{\phi(6)} = 2^2 = 4 \not\equiv 1 \pmod 6$. When $N = pq$ and $m$ is a random message, the probability that $p \mid m$ or $q \mid m$ is negligibly small for large primes. A subtler point is that $\phi(N)$ is not always the smallest universal exponent: the Carmichael function $\lambda(N)$ can be strictly smaller (for $N = 15$, $\lambda(15) = 4$ while $\phi(15) = 8$), which is why some RSA variants use $\lambda(N)$ in place of $\phi(N)$. A concrete consequence of Fermat's little theorem allows efficient square root extraction modulo primes of a special form. [quotetheorem:1682] [citeproof:1682] The hypothesis $p \equiv 3 \pmod{4}$ is not merely a convenience. For primes $p \equiv 1 \pmod{4}$, the exponent $k$ is not an integer in the formula $d^k$, so the argument fails and square root extraction requires a different algorithm (Tonelli–Shanks). The theorem also says nothing about which of the two square roots $\pm d^k$ it produces; both are valid, and knowing which one corresponds to the original plaintext requires additional structure in the message format. ## The Rabin Cryptosystem Given the background — fast modular exponentiation, hard factoring, root-extraction aided by the $p \equiv 3 \pmod 4$ formula — what is the simplest trapdoor function we can build? Squaring modulo $N = pq$ is an obvious candidate: the forward operation is a single multiplication, while inverting it (extracting a square root modulo a composite) is provably as hard as factoring. The Rabin cryptosystem makes this concrete. **Setup.** Choose two large distinct primes $p, q \equiv 3 \pmod{4}$. Keep $p, q$ secret. The public key is $N = pq$; the private key is $(p, q)$. **Encryption.** The message space and ciphertext space are both $(\mathbb{Z}/N\mathbb{Z})^\times$ (with the convention that $m > \sqrt{N}$ for disambiguation). Encrypt $m$ as \begin{align*} c \equiv m^2 \pmod{N}. \end{align*} **Decryption.** Given $c$ and the private key $(p,q)$, the receiver applies the square root theorem to find $x_1$ with $x_1^2 \equiv c \pmod{p}$ and $x_2$ with $x_2^2 \equiv c \pmod{q}$ (both $p$ and $q$ satisfy $p, q \equiv 3 \pmod 4$). The Chinese Remainder Theorem then produces $x$ satisfying \begin{align*} x \equiv x_1 \pmod{p}, \qquad x \equiv x_2 \pmod{q}, \end{align*} which gives a solution to $x^2 \equiv c \pmod{N}$. Running the Euclidean algorithm on $p$ and $q$ gives $r, s$ with $rp + sq = 1$, and the explicit formula is $x = (sq)x_1 + (rp)x_2 \pmod{N}$. There is, however, a structural obstacle: square roots modulo $N$ are not unique. [quotetheorem:1683] [citeproof:1683] The hypothesis that $p$ and $q$ are *distinct* odd primes is essential to the four-root count. If $p = q$ the modulus $N = p^2$ is a prime power, and the group $\left(\mathbb{Z}/p^2\mathbb{Z}\right)^\times$ has a different structure: a quadratic residue modulo $p^2$ with $\gcd(d, p) = 1$ has at most two square roots (not four), because the CRT decomposition into independent factors is unavailable. Similarly, if $p = 2$ the group $(\mathbb{Z}/2\mathbb{Z})^\times$ is trivial and the square-root count changes; the argument for part (i) uses odd order of $p$ in an essential way. The four-root structure is precisely the hook for the Rabin chosen-ciphertext attack. Suppose an attacker has access to a decryption oracle. They choose $r \in (\mathbb{Z}/N\mathbb{Z})^\times$ at random, compute $c = r^2 \bmod N$, and submit $c$ to the oracle. The oracle returns one of the four square roots of $c$; call it $r'$. Since the four roots split into two pairs $\{\pm r \bmod N\}$ and $\{\pm s \bmod N\}$ where $s \not\equiv \pm r \pmod{N}$, the probability is $1/2$ that $r' \not\equiv \pm r \pmod{N}$. When this occurs, $N \mid (r' - r)(r' + r)$ but $N \nmid (r' \pm r)$, so $\gcd(r' - r, N)$ yields a non-trivial factor of $N$. This attack IS the reduction from the security proof: the reduction constructs an oracle from an assumed decryption algorithm and then runs exactly this argument. The existence of four square roots means that decryption produces four candidates: only one is the intended plaintext. In practice, Rabin's scheme is implemented with **message redundancy** (e.g.\ the last 64 bits of $m$ repeat a known pattern), so the receiver can identify the correct plaintext among the four candidates. This ambiguity is not a flaw in the mathematical design; it is a known cost that the protocol accounts for. The decisive property of Rabin's cryptosystem is that breaking it is provably as hard as factoring. [quotetheorem:1684] [citeproof:1684] This is a remarkable guarantee. Rabin is the only major cryptosystem whose security is provably equivalent to a standard hard problem. The proof gives a precise reduction: any efficient decryption algorithm immediately yields an efficient factoring algorithm, and vice versa. Two caveats. The reduction requires a *perfect* decryption oracle — partial or noisy oracles are not covered by the argument. More consequentially, the reduction itself reveals the CCA weakness: an attacker with oracle access can submit a random $r^2 \bmod N$, receive one of the four roots (which is not always $\pm r$), and compute $\gcd(r - r', N)$ to extract $p$ or $q$. So Rabin is provably secure under passive (ciphertext-only) attack but provably insecure under chosen-ciphertext attack — a sharp security divide that RSA handles more gracefully in practice through padding schemes. ## Factoring via a Known Multiple of $\phi(N)$ Can we always extract the factorisation from $\phi(N)$? The answer for $\phi(N)$ itself is yes — computing $(p-1)(q-1) = N + 1 - (p+q)$ gives $p + q$ and then the quadratic $X^2 - (p+q) X + N$ has roots $p$ and $q$. But an attacker rarely learns $\phi(N)$ exactly; more realistically they might learn a *multiple* of it, e.g. the value $ed - 1$ from a recovered private key. Does knowing only a multiple suffice? Perhaps surprisingly, yes — and this is the content of the following theorem, which underlies both the Rabin and RSA security analyses. Suppose $m = 2^a b$ with $b$ odd and $a \geq 1$ is a known multiple of $\phi(N)$. Define the set \begin{align*} X = \left\{ x \in (\mathbb{Z}/N\mathbb{Z})^\times : \mathrm{ord}_p(x^b) \neq \mathrm{ord}_q(x^b) \right\}, \end{align*} where $\mathrm{ord}_p(y)$ denotes the multiplicative order of $y$ in $(\mathbb{Z}/p\mathbb{Z})^\times$. [quotetheorem:1685] [citeproof:1685] Part (ii) says that a uniformly random element of $(\mathbb{Z}/N\mathbb{Z})^\times$ lands in $X$ with probability at least $1/2$. Combined with part (i), this means that given any multiple of $\phi(N)$, a randomised algorithm finds a non-trivial factor of $N$ in expected two trials. The theorem does not say that *every* element leads to a factor — elements outside $X$ yield trivial gcds — and this is why the probabilistic sampling is essential. The bound $|X| \geq \phi(N)/2$ is load-bearing: it ensures that random sampling hits $X$ with probability at least $1/2$ on each trial, which is what makes the expected number of trials bounded (exactly $2$). If the bound were, say, $|X| \geq 1$, random sampling could in principle require many attempts. The algorithm is a **Las Vegas algorithm**: it always produces a correct answer when it halts, but the number of steps is random — here with geometric distribution of parameter $\geq 1/2$, giving an expected run time of two iterations of the $\gcd$ computation. The forward connection to RSA is direct. If an attacker manages to recover the private key $d$ from the public key $(N, e)$, then $de - 1$ is a known multiple of $\phi(N)$. This theorem then provides a randomised algorithm that factors $N$ in expected constant many modular exponentiations. In other words: recovering the private key *reduces to* factoring, because any key-recovery algorithm immediately yields a factoring algorithm via this theorem. The theorem is the engine that drives the proof of RSA Key Recovery Reduces to Factoring. ## The RSA Cryptosystem Rabin provides provable security but at the cost of an ambiguous decryption: a ciphertext has four square roots, and the legitimate receiver must disambiguate using message redundancy. Can we redesign the scheme so that decryption returns a unique plaintext, even at the price of weaker security guarantees? The answer is yes: use an exponent $e$ that is a bijection on $(\mathbb{Z}/N\mathbb{Z})^\times$, rather than the squaring map. This is the RSA cryptosystem (Rivest–Shamir–Adleman, 1977; independently Clifford Cocks at GCHQ, 1973), by far the most widely deployed public-key cryptosystem. **Key generation.** Choose large distinct primes $p$ and $q$ at random and keep them secret. Set $N = pq$. Choose $e$ with $\gcd(e, \phi(N)) = 1$; run the extended Euclidean algorithm to find $d$ satisfying $de \equiv 1 \pmod{\phi(N)}$. [definition: RSA Public and Private Keys] The **public key** is the pair $(N, e)$, where $e$ is the encrypting exponent. The **private key** is the pair $(N, d)$, where $d$ is the decrypting exponent. Both are ordered pairs, not arithmetic expressions. [/definition] **Encryption.** Given plaintext $m \in (\mathbb{Z}/N\mathbb{Z})^\times$, compute \begin{align*} c \equiv m^e \pmod{N}. \end{align*} **Decryption.** Given ciphertext $c$, compute \begin{align*} x \equiv c^d \pmod{N}. \end{align*} Correctness follows from Euler–Fermat: $c^d \equiv m^{ed} \equiv m^{k\phi(N)+1} \equiv m \pmod{N}$, since $de = k\phi(N) + 1$ for some integer $k$. The same Euler–Fermat application requires $\gcd(m, N) = 1$, which fails only when $p \mid m$ or $q \mid m$; for random messages modulo $N$, this probability is $O(1/p + 1/q)$, negligible in practice. [example: Boundary Case — Choosing $e$] The requirement $\gcd(e, \phi(N)) = 1$ is essential. If $\gcd(e, \phi(N)) = g > 1$, then no decrypting exponent $d$ exists, because $de \equiv 1 \pmod{\phi(N)}$ has no solution when $g \nmid 1$. In practice, $e = 65537 = 2^{16} + 1$ is a standard choice: it is prime (so $\gcd(e, \phi(N)) = 1$ as long as $\phi(N)$ is not a multiple of $e$, which happens with negligible probability for random large $p, q$), and its binary representation has only two 1-bits, making modular exponentiation by $e$ very fast. [/example] The security of RSA rests on the following reduction, which mirrors the Rabin theorem. [quotetheorem:1686] [citeproof:1686] [remark: What RSA Security Does Not Guarantee] The theorem shows that recovering the *private key* is as hard as factoring. It does not show that *decrypting individual messages* is as hard as factoring — it is unknown whether an algorithm that decrypts ciphertexts (without recovering $d$) would necessarily yield a factoring algorithm. This is an important open question in cryptography; the analogous statement for Rabin *is* known to hold, which is one reason Rabin has a stronger provable security guarantee. [/remark] [example: Toy RSA Cycle] Take $p = 11$, $q = 23$, so $N = pq = 253$ and $\phi(N) = (p-1)(q-1) = 10 \cdot 22 = 220$. Choose encrypting exponent $e = 3$; since $\gcd(3, 220) = 1$, a decrypting exponent exists. To find $d$, solve $3d \equiv 1 \pmod{220}$: the extended Euclidean algorithm gives $d = 147$, since $3 \times 147 = 441 = 2 \times 220 + 1$. Encrypt message $m = 7$: \begin{align*} c \equiv 7^3 \equiv 343 \equiv 343 - 253 \equiv 90 \pmod{253}. \end{align*} Decrypt ciphertext $c = 90$: \begin{align*} x \equiv 90^{147} \pmod{253}. \end{align*} By Euler–Fermat, $90^{220} \equiv 1 \pmod{253}$ (since $\gcd(90, 253) = 1$). Write $147 = 0 \cdot 220 + 147$, so the computation reduces to $90^{147} \bmod 253$, which evaluates to $7$ — recovering the original plaintext. The round trip $m \mapsto m^e \mapsto m^{ed} = m^{441} = m^{2 \cdot 220 + 1} \equiv m^1 = m$ works because $ed = 1 + 2\phi(N)$, making Euler–Fermat exact. [/example] RSA also has an important practical limitation: it is substantially slower than symmetric ciphers like AES or even the one-time pad, because modular exponentiation of large numbers is computationally expensive compared to XOR or table lookups. In practice, RSA is used to encrypt a short symmetric key, which is then used for bulk encryption — a hybrid approach. ## Shamir's Three-Pass Protocol and Diffie–Hellman Before analysing RSA, the course develops an illuminating example of key exchange using modular exponentiation, which leads naturally to the Diffie–Hellman protocol. [example: Shamir's Three-Pass Protocol] Let the alphabet be $\mathbb{Z}_p$ for a large prime $p$. Alice and Bob each choose private exponents: Alice chooses $a \in \mathbb{Z}_{p-1}^\times$ and finds $a'$ with $aa' \equiv 1 \pmod{p-1}$; Bob chooses $b$ and finds $b'$ similarly. To send message $m \in \mathbb{Z}_p$, the parties exchange three messages over a public channel: \begin{align*} m \xrightarrow{A} m^a \pmod{p} \xrightarrow{B} m^{ab} \pmod{p} \xrightarrow{A} m^{aba'} \pmod{p} \xrightarrow{B} m^{aba'b'} \pmod{p}. \end{align*} By Fermat's little theorem, $aba'b' \equiv bb' \equiv 1 \pmod{p-1}$, so the final value is $m^1 \equiv m \pmod{p}$. No shared key is ever established — each party "locks" and "unlocks" the message independently, like two padlocks on a box. The critical observation is that this protocol does *not* require Alice and Bob to communicate a key. However, it requires three rounds of communication and rests on the hardness of the discrete logarithm problem: an eavesdropper sees $g^a$ and $g^{ab}$ but not $a$ or $b$ individually. [/example] The three-pass protocol is conceptually elegant but operationally cumbersome. The Diffie–Hellman protocol achieves the same goal — establishing a shared secret over a public channel — in two messages. [definition: Diffie–Hellman Key Exchange] Fix a large prime $p$ and a primitive root $g$ modulo $p$ (both public). Alice and Bob each independently choose private integers $\alpha$ and $\beta$ respectively. They exchange: \begin{align*} A \to B:& \quad g^\alpha \pmod{p}, \\ B \to A:& \quad g^\beta \pmod{p}. \end{align*} Both parties compute the shared secret $k = g^{\alpha\beta} \pmod{p}$: Alice computes $(g^\beta)^\alpha$ and Bob computes $(g^\alpha)^\beta$. An eavesdropper $E$ sees $g, p, g^\alpha, g^\beta$ but must find $g^{\alpha\beta}$ — the **Diffie–Hellman problem**. [/definition] The Diffie–Hellman problem is conjectured to be as hard as the discrete logarithm problem: if one could solve discrete logarithms, one could compute $\alpha$ from $g^\alpha$ and then compute $g^{\alpha\beta}$. Whether the converse holds — whether breaking Diffie–Hellman implies solving discrete logarithms — remains an open problem. [example: The Gap Between Problems] Consider a group where the Diffie–Hellman problem has a dedicated structural attack that does not require computing discrete logarithms. In such a setting, key exchange over Diffie–Hellman would be broken while discrete logarithm extraction remains hard. No such group is known for the multiplicative groups $(\mathbb{Z}/p\mathbb{Z})^\times$ used in practice, but this shows that "as hard as discrete log" and "equivalent to discrete log" are genuinely different claims. The course's statement that breaking D–H is "as difficult as the discrete logarithm problem" is a conjecture, not a theorem. [/example] ## Putting the Systems Together: Comparing Rabin and RSA Both Rabin and RSA encrypt via modular exponentiation (with exponent $2$ and $e$ respectively), use a composite modulus $N = pq$ as the public key, and rely on the difficulty of factoring for security. But they differ sharply in what has been *proved*. For Rabin, the equivalence between decryption and factoring is a theorem. For RSA, only the equivalence between *key recovery* and factoring is proved; whether decrypting a single RSA ciphertext is as hard as factoring remains open. In this sense Rabin has a stronger provable security guarantee, while RSA enjoys wider deployment due to its deterministic decryption (no four-way ambiguity) and its greater flexibility in the choice of public exponent. A common thread through both systems is the role of the Chinese Remainder Theorem: decryption always proceeds by solving the problem modulo $p$ and modulo $q$ separately, then combining via CRT. The factorisation of $N$ is what enables this "divide and conquer" — without knowing $p$ and $q$, the CRT decomposition is unavailable, and the cryptosystem is secure. # 9. Cryptographic Protocols The earlier chapters of this course developed the machinery for transmitting information reliably and for encrypting it so that adversaries cannot read it. But secure communication between two parties raises subtler questions that encryption alone does not resolve: how can several people share a secret so that no single person holds too much power? How can Alice prove to Bob that a message came from her, and not from an impersonator? And how can Alice commit to a choice — a bit, a vote, a prediction — in a way that she cannot later revoke or deny, yet Bob cannot peek at early? This chapter addresses these three questions through the theory of cryptographic protocols. ## Secret Sharing and the Threshold Problem Imagine a Faculty whose leader knows a secret integer $S$ needed to access a secure bunker. The leader may be absent or incapacitated, so the Faculty wants any $k$ of its $n$ members to be able to reconstruct $S$ together, but no group of $k-1$ members should be able to do so even by pooling everything they know. A naive solution — giving each member a piece of $S$ — fails immediately: any subset who together hold enough pieces reconstructs $S$, but a single member holding one piece already learns something about it. The challenge is to share $S$ so that fewer than $k$ members learn absolutely nothing about its value. Shamir's scheme solves this problem using polynomial interpolation over a finite field, achieving a much stronger guarantee than naive splitting: any $k-1$ shareholders have zero information about $S$ in the information-theoretic sense. ### Shamir's Secret-Sharing Scheme We want to split a secret integer $S$ among $n$ players so that any $k$ of them can reconstruct $S$ together, but any group of fewer than $k$ players learns absolutely nothing about its value — not even probabilistic information. What mathematical object achieves exactly this threshold behaviour? A polynomial of degree $k-1$ over a finite field is the key: it is completely determined by any $k$ of its values, yet any $k-1$ values leave one full degree of freedom. If we hide $S$ as the constant term $f(0)$ of a random degree-$(k-1)$ polynomial $f$, then $k$ evaluation points pin down $f(0)$ exactly, while $k-1$ points are consistent with every possible value of $f(0)$ — any secret is equally plausible. Fix a prime $p$ larger than both $n$ and $S$, and work in the finite field $\mathbb{F}_p = \mathbb{Z}/p\mathbb{Z}$. [definition: Shamir Secret-Sharing Scheme] Let $p$ be a prime with $p > n$ and $0 \leq S < p$. The **$(k, n)$-Shamir secret-sharing scheme** is defined as follows. **Share distribution.** The dealer picks $k-1$ coefficients $a_1, \ldots, a_{k-1}$ uniformly at random from $\mathbb{F}_p$ and forms the polynomial \begin{align*} f(x) &= S + a_1 x + a_2 x^2 + \cdots + a_{k-1} x^{k-1} \in \mathbb{F}_p[x]. \end{align*} The **share** of participant $i$ (for $1 \leq i \leq n$) is the pair $(i,\, f(i) \bmod p)$. **Reconstruction.** Given the shares $(i_1, f(i_1)), \ldots, (i_k, f(i_k))$ of any $k$ distinct participants, the secret is recovered as \begin{align*} S &= f(0) = \sum_{j=1}^{k} f(i_j) \prod_{\substack{l=1 \\ l \neq j}}^{k} \frac{-i_l}{i_j - i_l} \pmod{p}, \end{align*} the Lagrange interpolation formula evaluated at $x = 0$. [/definition] The reconstruction formula warrants a moment's attention. A polynomial of degree at most $k-1$ over a field is completely determined by any $k$ of its values, since specifying $k$ points $(x_j, y_j)$ with distinct $x_j$ yields a system of $k$ equations in $k$ unknowns that has a unique solution — this is the content of Lagrange interpolation. Because $f$ has degree $k-1$ and $f(0) = S$, any $k$ shares suffice to recover $S$ exactly. All arithmetic is performed in $\mathbb{F}_p$, so division in the Lagrange formula means multiplication by the modular inverse, which exists because $p$ is prime and $i_j - i_l \not\equiv 0 \pmod{p}$ for distinct participants $i_j \neq i_l$ (using $p > n$). [quotetheorem:1687] [citeproof:1687] Several remarks sharpen the scope of this theorem. First, the hypothesis $p > n$ is essential: without it, two participants $i, j \leq n$ could satisfy $i \equiv j \pmod{p}$, colliding at the same evaluation point. The scheme breaks entirely — reconstruction fails and security is void. Second, the theorem assumes **honest-but-curious** collusion: all $k-1$ adversarial shareholders submit their genuine shares during reconstruction and merely pool information. A **malicious** shareholder who submits a false share during reconstruction can cause the reconstructed value to differ from $S$ without detection. This vulnerability motivates the theory of *verifiable secret sharing*, in which each share is accompanied by a zero-knowledge proof of correctness that can be checked before reconstruction. Third, the theorem guarantees information-theoretic security for the secret $S$ given the shares — it does not protect against an adversary who can influence the *distribution* from which $S$ is drawn, nor does it address what happens once $S$ is reconstructed and used. [example: Boundary Case — Exactly $k$ Shareholders] Consider the $(2, 3)$ scheme with $p = 7$ and secret $S = 3$. The dealer picks $a_1$ at random, say $a_1 = 5$, giving $f(x) = 3 + 5x \pmod{7}$. The shares are $(1, 1), (2, 6), (3, 4)$. Any two shareholders can reconstruct: using shares $(1,1)$ and $(3,4)$, Lagrange interpolation gives \begin{align*} f(0) &= 1 \cdot \frac{0-3}{1-3} + 4 \cdot \frac{0-1}{3-1} \equiv 1 \cdot \frac{-3}{-2} + 4 \cdot \frac{-1}{2} \pmod{7}. \end{align*} Working in $\mathbb{F}_7$: $(-2)^{-1} \equiv 3$ and $2^{-1} \equiv 4$, so $f(0) \equiv 1 \cdot (-3)(3) + 4 \cdot (-1)(4) \equiv -9 - 16 \equiv -25 \equiv 3 \pmod{7}$, which recovers $S = 3$. A single shareholder holding $(1, 1)$ alone cannot determine $S$: for any claimed secret $S' \in \mathbb{F}_7$, there is a unique line through $(0, S')$ and $(1, 1)$, so all seven values of $S'$ remain equally consistent. [/example] ## Signatures and Authentication Encryption solves the problem of secrecy, but it does not resolve a different class of threats. Suppose Alice sends Bob a message $m$. Bob can verify that no third party read the message if it was encrypted, but how can he be sure the message actually came from Alice and not from an impersonator Eve? And how can he later prove to a judge that Alice sent the message, when Alice might deny it? These are the goals of **authenticity** and **non-repudiation** respectively. A first attempt using RSA already shows both the power and the pitfalls of the idea. Alice holds private key $(N, d)$ and public key $(N, e)$. She can "sign" a message $m$ by computing $m^d \pmod{N}$: anyone with the public key can verify this is from Alice by checking that $(m^d)^e \equiv m \pmod{N}$, and no one without knowledge of $d$ can produce this value. But this naive approach is immediately broken. [example: Homomorphism Attack on Naive RSA Signing] The bank sends encrypted pairs $(M_1, M_2)$ where $M_1$ is the client's name and $M_2$ is the transfer amount, encoding them as $(M_1^e \bmod N,\, M_2^e \bmod N)$. An attacker who observes the ciphertext $(Z_1, Z_2)$ for a £100 transfer can send the modified message $(Z_1,\, Z_2^3)$ to the bank. Because RSA is multiplicatively homomorphic — $(Z_2^3)^d = (M_2^e)^{3d} = M_2^3 \pmod{N}$ — the bank decrypts to find the transfer amount is $M_2^3$, i.e. £$100^3 = 10^6$. The attacker has become a millionaire without breaking RSA. A separate attack is simple copying: replay the valid signed message $(Z_1, Z_2)$ repeatedly. This is defeated by including a timestamp in the signed content. [/example] [example: Existential Forgery of Naive RSA Signatures] An adversary who knows Alice's public key $(N, e)$ can produce valid-looking signed messages without knowing $d$. Choose any $s \in \{1, \ldots, N-1\}$ at random, then compute $m = s^e \bmod N$. The pair $(m, s)$ satisfies the verification equation $s^e \equiv m \pmod{N}$, so it looks like Alice signed $m$. The adversary has no control over what $m$ is — it is a random element of $\mathbb{Z}/N\mathbb{Z}$ — but if any such random-looking $m$ is interpretable as a valid message, the scheme is completely broken. [/example] Both attacks share a common structural cause: the adversary exploits the raw algebraic properties of RSA exponentiation — its multiplicativity in the homomorphism attack, and the absence of any constraint on message format in the existential forgery. The remedy must break both exploits simultaneously, and there is a single intervention that achieves this: rather than signing the message $m$ directly, Alice signs its hash $h(m)$, where $h : M \to \{1, \ldots, N-1\}$ is a publicly known collision-resistant hash function. A **collision** is a pair $x_1 \neq x_2$ with $h(x_1) = h(x_2)$; the requirement is that finding collisions is computationally infeasible. Signing the hash breaks the multiplicative structure exploited by the homomorphism attack (since $h(M_2^3) \neq h(M_2)^3$ in general), and makes existential forgery meaningless (the adversary can still choose an $s$ and compute $m = s^e$, but finding a message with $h(m') = m$ for some meaningful $m'$ requires inverting the hash, which is assumed hard). The general framework captures this precisely. [definition: Digital Signature Scheme] A **digital signature scheme** consists of: - a message space $M$, a signature space $\Sigma$, a private key space $K_{\mathrm{priv}}$, and a public key space $K_{\mathrm{pub}}$; - a **signing function** $s : M \times K_{\mathrm{priv}} \to \Sigma$ that takes a message and a private key to produce a signature; - a **verification function** $v : M \times \Sigma \times K_{\mathrm{pub}} \to \{\mathrm{accept}, \mathrm{reject}\}$ that uses the corresponding public key to check whether a putative signature is valid. A message $m$ is **signed** as the pair $(m, s(m, k))$ where $k \in K_{\mathrm{priv}}$ is A's private key. B verifies the signature by computing $v(m, s(m,k), k_{\mathrm{pub}})$ using A's published public key $k_{\mathrm{pub}} \in K_{\mathrm{pub}}$. [/definition] The signed message is not the message of the sender but the signature of the content: the goal is to certify what was sent, not merely who the sender is. The examples above reveal a fundamental limitation: naive RSA signing is not just weak — it is actively dangerous, enabling an attacker to manufacture valid-looking signed messages or to scale transaction amounts without breaking any hard mathematical problem. Both vulnerabilities stem from the raw algebraic structure of exponentiation modulo $N$. Hashing before signing destroys this structure, but it raises a new question: can we do something more radical? Instead of patching RSA encryption to perform authentication, can we repurpose the RSA key pair so that the roles of the private and public key are simply reversed — the private key signs and the public key verifies, rather than the public key encrypting and the private key decrypting? The answer is yes, and this reversal is conceptually the entirety of RSA signatures. The same mathematical operation $m \mapsto m^d \bmod N$ that decrypts a ciphertext, when applied to a message (or its hash) with the *private* key, produces a value that anyone holding the public key can verify was produced by the key holder — because only someone knowing $d$ could have computed a value whose $e$-th power recovers $m$. ### RSA Signatures With private key $(N, d)$ and public key $(N, e)$, Alice signs $m$ as $(m,\, m^d \bmod N)$. Bob verifies by checking $s^e \equiv m \pmod{N}$. In practice, Alice signs $h(m)$ rather than $m$ itself, producing $(m,\, h(m)^d \bmod N)$, and Bob verifies $s^e \equiv h(m) \pmod{N}$. The homomorphism attack and existential forgery are both mitigated by using a suitable hash. ### The ElGamal Signature Scheme A fundamentally different approach bases security on the difficulty of the **discrete logarithm problem**: given $g, y, p$ with $y \equiv g^u \pmod{p}$, find $u$. Unlike RSA, ElGamal uses a fresh random value $k$ for each signature, introducing probabilistic signing. [definition: ElGamal Signature Scheme] Alice chooses a large prime $p$, a primitive root $g$ modulo $p$, and a secret $u$ with $1 < u < p$. She publishes the **public key** $(p, g, y)$ where $y \equiv g^u \pmod{p}$; her **private key** is $u$. Let $h : M \to \{1, \ldots, p-1\}$ be a hash function. **Signing.** To sign a message $m$, Alice chooses a random $k$ with $1 < k < p-2$ and $\gcd(k, p-1) = 1$. She computes: \begin{align*} r &\equiv g^k \pmod{p}, \qquad 1 \leq r \leq p-1, \\ s &\equiv k^{-1}(h(m) - ur) \pmod{p-1}, \qquad 1 \leq s \leq p-2. \end{align*} The **signature** on $m$ is the pair $(r, s)$. **Verification.** Bob accepts the signature $(r, s)$ on message $m$ if and only if \begin{align*} g^{h(m)} &\equiv y^r r^s \pmod{p}. \end{align*} [/definition] The verification equation holds for legitimately signed messages because: \begin{align*} y^r r^s &\equiv g^{ur} \cdot g^{ks} \equiv g^{ur + ks} \pmod{p}. \end{align*} Since $s$ was chosen so that $h(m) \equiv ur + ks \pmod{p-1}$ and $g^{p-1} \equiv 1 \pmod{p}$ by Fermat's little theorem, we get $g^{ur+ks} \equiv g^{h(m)} \pmod{p}$. The requirement $\gcd(k, p-1) = 1$ ensures that the equation $h(m) \equiv ur + ks \pmod{p-1}$ can be solved for $s$, as the following lemma guarantees. [quotetheorem:1688] [citeproof:1688] This theorem has a useful sharpness to notice before applying it. The divisibility condition $\gcd(a, m) \mid b$ is not merely sufficient but genuinely necessary: for example, $2x \equiv 1 \pmod{4}$ has no solution, since $\gcd(2, 4) = 2$ does not divide $1$. When solutions do exist, they form an arithmetic progression modulo $m$ — there are exactly $\gcd(a, m)$ of them, spaced $m / \gcd(a, m)$ apart. In the ElGamal context, we need to solve $h(m) \equiv ur + ks \pmod{p-1}$ for $s$, which is a linear congruence in $s$ with coefficient $k$. For this to have a unique solution — so that the signing step is well-defined — we need $\gcd(k, p-1) = 1$, i.e. $k$ must be invertible modulo $p-1$. This is why $k$ is chosen uniformly from $(\mathbb{Z}/(p-1))^\times$, the group of units modulo $p-1$. Since $\gcd(k, p-1) = 1$, taking $a = k$ and $m = p-1$ gives $\gcd(a,m) = 1$, so for any right-hand side there is a unique solution for $s$ modulo $p-1$. ### Reusing the Nonce $k$ is Fatal The most critical implementation requirement is that Alice must choose a fresh, independent $k$ for every signature. Reusing $k$ exposes her private key entirely. [example: Private Key Recovery from Repeated Nonce] Suppose Alice signs two distinct messages $m_1, m_2$ using the same $k$, producing signatures $(r, s_1)$ and $(r, s_2)$ (note $r = g^k \bmod p$ is identical since $k$ is the same). The signing equations give: \begin{align*} h(m_1) &\equiv ur + ks_1 \pmod{p-1}, \\ h(m_2) &\equiv ur + ks_2 \pmod{p-1}. \end{align*} Subtracting: $h(m_1) - h(m_2) \equiv k(s_1 - s_2) \pmod{p-1}$. Set $d = \gcd(s_1 - s_2, p-1)$. By the lemma, this congruence in $k$ has $d$ solutions modulo $p-1$; Alice's $k$ is one of them, and the correct one is identified by the condition $r \equiv g^k \pmod{p}$. Once $k$ is known, the equation $h(m_1) \equiv ur + ks_1 \pmod{p-1}$ gives $ur \equiv h(m_1) - ks_1 \pmod{p-1}$, which has $\gcd(r, p-1)$ solutions for $u$; again the correct one is identified by $y \equiv g^u \pmod{p}$. The attacker recovers Alice's private key without solving any discrete logarithm. [/example] This attack is not a theoretical curiosity. The Sony PlayStation 3 was broken in 2010 precisely because its implementation of a signature scheme used a fixed rather than random nonce, allowing the console's private signing key to be extracted from two signed messages. ### The Digital Signature Algorithm ElGamal signatures are inconveniently large: each signature is a pair $(r, s)$ where both components have the same bit-length as $p$, typically 1024 to 2048 bits. Transmitting and storing such signatures is expensive, and for many applications — smart cards, embedded systems, digital certificates — this overhead is prohibitive. Can we shrink signatures without weakening discrete-logarithm security? The observation is that the security of ElGamal rests on the hardness of discrete log in $\mathbb{Z}_p^\times$, but the *signature values themselves* do not need to live in this full group. If $p-1$ has a large prime factor $q$ of only 160 bits, then the subgroup of $\mathbb{Z}_p^\times$ generated by an element $g$ of order $q$ has the same discrete-log hardness (by restriction) but is much smaller. By performing the final reduction modulo $q$ rather than modulo $p-1$, signatures shrink to pairs of 160-bit numbers — a factor of six or more compression for a 1024-bit $p$ — with no loss of security. The Digital Signature Algorithm (DSA) implements exactly this idea. It is a variant of ElGamal developed by the NSA, working modulo a smaller subgroup to achieve shorter signatures. [definition: Digital Signature Algorithm (DSA)] **Setup.** The public parameters $(p, q, g)$ are constructed as follows: - $p$ is a prime of exactly $N$ bits ($512 \leq N \leq 1024$, $N$ a multiple of 64); - $q$ is a 160-bit prime dividing $p - 1$; - $g \equiv h^{(p-1)/q} \pmod{p}$ where $h$ is a primitive root modulo $p$, so $g$ has order $q$ in $\mathbb{Z}_p^\times$. Alice chooses a **private key** $x$ with $1 < x < q$ and publishes **public key** $y \equiv g^x \pmod{p}$. **Signing.** To sign message $m$ (with $0 \leq m < q$), Alice chooses random $k$ with $1 < k < q$ and computes: \begin{align*} s_1 &\equiv (g^k \bmod p) \pmod{q}, \\ s_2 &\equiv k^{-1}(m + xs_1) \pmod{q}. \end{align*} The signature is the pair $(s_1, s_2)$. **Verification.** Bob computes $w \equiv s_2^{-1} \pmod{q}$, then: \begin{align*} u_1 &\equiv mw \pmod{q}, \quad u_2 \equiv s_1 w \pmod{q}, \\ v &\equiv (g^{u_1} y^{u_2} \bmod p) \pmod{q}. \end{align*} Bob accepts the signature if and only if $v = s_1$. [/definition] The double reduction modulo $q$ — first reducing $g^k \bmod p$ modulo $q$ to get $s_1$ — is the key structural difference from ElGamal. Since $q$ has only 160 bits, signatures are much shorter than raw ElGamal signatures. [quotetheorem:1689] [citeproof:1689] The theorem states only that honest signers pass verification. It does not say that forging a signature is hard — that hardness rests on the difficulty of computing discrete logarithms in $\mathbb{Z}_p^\times$, which is assumed but not proved here. Several structural features of DSA deserve explicit attention. First, the hypothesis $q \mid p-1$ is essential: it is what allows $g = h^{(p-1)/q} \bmod p$ to have order exactly $q$ in $\mathbb{Z}_p^\times$. Without this divisibility, the exponent $(p-1)/q$ is not an integer, and the construction of a subgroup of order $q$ collapses; the reduction modulo $q$ at the end of the signing step would then lose information in an uncontrolled way rather than performing a well-defined projection onto a subgroup. Second, $q$ must be large enough to resist square-root attacks on the discrete logarithm problem in the subgroup — specifically, baby-step giant-step and Pollard's rho run in $O(\sqrt{q})$ time, so $q$ must have at least 160 bits (and in modern standards, 256 bits) to provide adequate security. The whole point of DSA is to achieve short signatures, but choosing $q$ too small defeats this: the shorter signatures come at the cost of the very security they were designed to preserve. Third, as with ElGamal, **freshness of $k$ is essential**: if the same nonce $k$ is used to sign two different messages $m_1$ and $m_2$, the resulting signatures $(s_1, s_2)$ and $(s_1', s_2')$ share $s_1 = s_1'$ (since $r = g^k \bmod p \bmod q$ is identical), and the same subtraction argument recovers $k$ from $s_2 - s_2'$, then $x$ from the signing equation. This is not a theoretical curiosity: the Sony PlayStation 3 was broken in 2010 precisely because its DSA implementation used a fixed rather than random nonce, allowing the console's private signing key to be extracted from two publicly observed signed messages. ## Commitment Schemes and Bit Commitment Signatures address the question of who sent a message. A commitment scheme addresses a different problem: how can Alice lock in a choice that she cannot later change, while keeping it hidden from Bob until she chooses to reveal it? The motivating scenarios make the dual requirement precise. In a coin toss between parties in separate rooms, Alice must commit to heads or tails before seeing what Bob called, but Bob must not know her call until after he has declared his. For selling racing tips, the tipster must prove the tip was fixed before the race, but not reveal it until after payment. For online polling, each vote must be locked in before the tallying process begins, yet remain hidden from other voters until the poll closes. [definition: Bit Commitment Scheme] A **bit commitment scheme** is a protocol between parties A and B with two phases: - **Commit phase:** A computes and sends to B a **commitment** $c$ to a bit $m \in \mathbb{F}_2$. After this phase, $c$ is fixed. - **Reveal phase:** A sends B additional information that allows B to determine $m$ from $c$. The scheme must satisfy: - **Hiding:** B cannot determine $m$ from $c$ alone (before the reveal phase); - **Binding:** A cannot, after the commit phase, produce a reveal that convinces B that her original choice was $m' \neq m$. [/definition] The two properties are in tension: a perfectly hiding scheme (where $c$ conveys no information at all about $m$) and a perfectly binding scheme (where $c$ uniquely determines $m$) cannot coexist in a setting where A has unbounded computational power, because an unbounded A could always search over all possible $m$ and find which one is consistent with $c$. In practice one or both properties hold computationally. ### Commitment via Public-Key Cryptography [example: Public-Key Commitment] Alice's encryption and decryption functions $e_A, d_A$ form a public-key cryptosystem, with $e_A$ published and $d_A$ secret. - **Commit:** Alice computes $c = e_A(m)$ and sends it to Bob. Since $e_A$ is a secure cipher, Bob cannot decrypt $c$ to find $m$. - **Reveal:** Alice sends Bob her private key $d_A$. Bob computes $d_A(c) = d_A(e_A(m)) = m$ to recover her choice. He also verifies that $d_A$ and $e_A$ are genuine inverses, confirming Alice has not sent a false private key in an attempt to rewrite history. The hiding property follows from the security of the cryptosystem. The binding property holds because $e_A(m)$ is sent before $d_A$ is revealed: at commit time, the ciphertext $c$ already determines $m$ once the key is known, and Alice cannot later change $c$. However, if the cryptosystem is not perfectly secure, a computationally unbounded Bob could break the commitment's hiding. [/example] ### Commitment via Coding Theory The second scheme is more elegant: it uses the physical properties of a noisy channel to enforce binding without any computational assumptions, at the cost of a probabilistic reveal. [example: Coding-Theoretic Commitment] Alice and Bob have two channels: a **noisy channel** (a binary symmetric channel $\mathrm{BSC}(p)$ with $0 < p < 1/2$, corrupting each bit independently with probability $p$ and controlled by neither party) and a **clear channel** (noiseless). - **Setup:** Bob publishes a binary linear code $C \subseteq \mathbb{F}_2^N$ with minimum distance $d$. Alice publishes a random non-trivial linear map $\theta : C \to \mathbb{F}_2$. - **Commit:** To commit to a bit $m \in \mathbb{F}_2$, Alice chooses a uniformly random codeword $c \in C$ satisfying $\theta(c) = m$ and sends $c$ over the **noisy channel**. Bob receives $r = c + e$ where $e \in \mathbb{F}_2^N$ encodes the channel errors. - **Why Bob cannot determine $m$:** The expected Hamming distance $d(r, c) = d(e, 0)$ is $Np$. Parameters are chosen so $Np \gg d$, meaning $r$ is far from the nearest codeword boundary: Bob cannot decode $r$ to recover $c$, so he cannot compute $\theta(c) = m$. - **Reveal:** Alice sends $c$ over the **clear channel**. Bob checks that $d(c, r) \approx Np$ (within expected fluctuations). If so, he accepts $\theta(c)$ as Alice's genuine commitment. - **Why Alice cannot cheat:** After sending $c$ through the noisy channel, Alice knows $c$ but not the exact error pattern $e$ — the channel is memoryless and independent of her actions. She knows only that $r$ is within $\approx Np$ of $c$ in Hamming distance. If she tries to reveal a different codeword $c' \neq c$ through the clear channel, she must ensure $d(c', r) \approx Np$. Since $d(c', c) \geq d$ (the minimum distance of $C$), the triangle inequality gives $d(c', r) \geq d(c, c') - d(c, r) \geq d - Np$. For $d \gg Np$, this is much larger than $Np$, so Bob will detect the discrepancy and reject. Parameters $N$ and $d$ are chosen so that the probability of the legitimate error count $d(c, r)$ deviating too far from $Np$ is negligible. If this rare event does occur, Alice and Bob simply repeat the protocol. [/example] The coding-theoretic scheme achieves something the public-key scheme cannot: its binding property holds unconditionally, not merely computationally. Alice cannot cheat even with unlimited computing power, because her inability to predict $e$ is physical rather than computational. The hiding property, on the other hand, is information-theoretic only in the sense that $r$ is far from $c$ in Hamming distance; it relies on the parameters being chosen so that the code's structure prevents decoding. The interplay between the minimum distance $d$, the noise level $Np$, and the block length $N$ is the quantitative heart of the scheme's security analysis.

Created by admin on 4/24/2026 | Last updated on 4/24/2026

What brings you to Androma?

Start with a route through the knowledge graph.

Cambridge II Coding and Cryptography

Sign in to Androma

Check your inbox

One last step

Cambridge II Coding and Cryptography

Prerequisites

Rate this page