[proofplan]
We prove by induction on the alphabet size $m$ that the Huffman code $c_m$ for a source $X_m$ with probabilities $p_1 \geq \cdots \geq p_m$ has expected length no larger than that of any other prefix-free code. The base case $m = 2$ is immediate. For the inductive step, the two lowest-probability symbols $\mu_{m-1}, \mu_m$ are merged into a single symbol $\nu$ of weight $p_{m-1} + p_m$ to form the reduced source $X_{m-1}$; the Huffman code $c_m$ is obtained from $c_{m-1}$ by appending $0$ and $1$ to the codeword of $\nu$, which gives the key identity $\mathbb{E}[S_m] = \mathbb{E}[S_{m-1}] + p_{m-1} + p_m$. Given any competing optimal prefix-free code $c'_m$, the [Structure of Optimal Codes](/theorems/1633) theorem lets us arrange (via relabelling) that its two longest codewords are siblings; collapsing them produces a prefix-free code for the reduced source with the same length identity. The inductive hypothesis applied to the reduced source closes the argument. [McMillan's inequality](/theorems/???) ensures that restricting to prefix-free codes is no loss of generality, since any decipherable code can be matched by a prefix-free code with the same word lengths.
[/proofplan]
[step:Reduce to prefix-free codes via McMillan's inequality]
Let $X_m$ be a source with values $\mu_1, \ldots, \mu_m$ and probabilities $p_1 \geq p_2 \geq \cdots \geq p_m > 0$. We may assume WLOG that all $p_i > 0$; symbols of zero probability contribute nothing to expected length and are discarded.
Let $c$ be any decipherable code for $X_m$ with word lengths $l_1, \ldots, l_m$. By [McMillan's inequality](/theorems/???), $\sum_{i=1}^m a^{-l_i} \leq 1$, so by [Kraft's inequality](/theorems/???) there exists a prefix-free code $\tilde c$ with the same word lengths $l_1, \ldots, l_m$ and therefore the same expected length $\mathbb{E}[\tilde S] = \mathbb{E}[S]$. Hence the minimum expected length over decipherable codes equals the minimum over prefix-free codes, and it suffices to prove that a Huffman code is optimal among prefix-free codes.
For notational simplicity we take $a = 2$ (binary Huffman); the general case is identical with $0, 1$ replaced by any pair of distinct letters in $B$.
[/step]
[step:Set up induction on the alphabet size and dispose of the base case]
We prove by induction on $m \geq 2$ the statement:
$P(m)$: *For every source $X_m$ on $m$ symbols, the Huffman code $c_m$ for $X_m$ is optimal among prefix-free binary codes, i.e., $\mathbb{E}[S_m] \leq \mathbb{E}[S'_m]$ for every prefix-free binary code $c'_m$ for $X_m$.*
**Base case $m = 2$.** A prefix-free binary code for a two-symbol source must assign a codeword of length $\geq 1$ to each symbol; the only such codes (up to relabelling of $0 \leftrightarrow 1$) have codewords of length exactly $1$, namely $\{0, 1\}$. So every prefix-free binary code on two symbols has expected length $p_1 \cdot 1 + p_2 \cdot 1 = 1$, and the Huffman code (which produces exactly this assignment) is optimal.
[/step]
[step:Define the reduced source $X_{m-1}$ and the inductive Huffman code $c_m$]
**Inductive step.** Assume $P(m-1)$ holds. Let $X_m$ have probabilities $p_1 \geq \cdots \geq p_m$. Define the reduced source $X_{m-1}$ on the symbols $\mu_1, \ldots, \mu_{m-2}, \nu$ with probability distribution
\begin{align*}
\mathbb{P}(X_{m-1} = \mu_i) &= p_i \quad (1 \leq i \leq m-2), \\
\mathbb{P}(X_{m-1} = \nu) &= p_{m-1} + p_m.
\end{align*}
These sum to $1$ because $\sum_{i=1}^m p_i = 1$, so $X_{m-1}$ is a valid source on $m - 1$ symbols.
Let $c_{m-1}$ be a Huffman code for $X_{m-1}$, and let $y \in \{0, 1\}^*$ denote the codeword $c_{m-1}(\nu)$. The Huffman construction then defines the code $c_m$ for $X_m$ by
\begin{align*}
c_m(\mu_i) &= c_{m-1}(\mu_i) \quad (1 \leq i \leq m-2), \\
c_m(\mu_{m-1}) &= y \cdot 0, \\
c_m(\mu_m) &= y \cdot 1.
\end{align*}
One checks that $c_m$ is prefix-free: the only new codewords are $y \cdot 0$ and $y \cdot 1$; neither is a prefix of the other, and neither is a prefix of any $c_{m-1}(\mu_i) = c_m(\mu_i)$ because $c_{m-1}$ is prefix-free and $y = c_{m-1}(\nu)$ is therefore not a prefix of any $c_{m-1}(\mu_i)$ for $i \leq m - 2$ — so its extensions $y \cdot 0, y \cdot 1$ are not prefixes either.
[/step]
[step:Establish the length identity $\mathbb{E}[S_m] = \mathbb{E}[S_{m-1}] + p_{m-1} + p_m$]
Let $\ell_i = |c_{m-1}(\mu_i)|$ for $i \leq m - 2$ and $\ell_\nu = |y|$. Then the word lengths of $c_m$ are
\begin{align*}
|c_m(\mu_i)| &= \ell_i \quad (i \leq m-2), \\
|c_m(\mu_{m-1})| &= |c_m(\mu_m)| = \ell_\nu + 1.
\end{align*}
Therefore
\begin{align*}
\mathbb{E}[S_m]
&= \sum_{i=1}^{m-2} p_i \ell_i + p_{m-1}(\ell_\nu + 1) + p_m(\ell_\nu + 1) \\
&= \sum_{i=1}^{m-2} p_i \ell_i + (p_{m-1} + p_m) \ell_\nu + p_{m-1} + p_m \\
&= \mathbb{E}[S_{m-1}] + p_{m-1} + p_m,
\end{align*}
where the last equality uses the definition of $X_{m-1}$: its expected length under $c_{m-1}$ is
\begin{align*}
\mathbb{E}[S_{m-1}] = \sum_{i=1}^{m-2} p_i \ell_i + (p_{m-1} + p_m) \ell_\nu.
\end{align*}
[/step]
[step:Reshape a competing optimal code so its two longest codewords are siblings]
Let $c'_m$ be any optimal prefix-free binary code for $X_m$, with word lengths $l'_1, \ldots, l'_m$ and expected length $\mathbb{E}[S'_m]$. We apply [Structure of Optimal Codes](/theorems/1633), which asserts that (i) the codeword lengths of $c'_m$ can be ordered so that $l'_i \leq l'_j$ whenever $p_i > p_j$, and (ii) among all codewords of $c'_m$ of maximal length, two of them differ only in their final digit.
We verify the hypotheses of that theorem: $c'_m$ is prefix-free by assumption, and $c'_m$ is optimal by assumption. The theorem applies.
By part (i), WLOG (after relabelling symbols of equal probability, which does not change the distribution) the two codewords of maximal length are those of the two least-probable symbols $\mu_{m-1}$ and $\mu_m$. By part (ii), WLOG (after swapping codewords within the pair of siblings at maximal depth, which does not change the multiset of word lengths and hence preserves optimality and the expected length) these two codewords are $c'_m(\mu_{m-1}) = y' \cdot 0$ and $c'_m(\mu_m) = y' \cdot 1$ for some common prefix $y' \in \{0, 1\}^*$.
[guided]
We need to massage $c'_m$ into a form compatible with the inductive structure — namely, its two longest codewords should sit at the bottom of the coding tree as siblings (sharing a common parent), and they should correspond to the two least-probable symbols. Neither reshaping changes the expected length, so they are WLOG; both rely on [Structure of Optimal Codes](/theorems/1633).
*First reshape: longest codewords correspond to least-probable symbols.* Part (i) of Structure of Optimal Codes says: if $p_i > p_j$ then $l'_i \leq l'_j$. Contrapositively, the symbols with the largest word lengths have the smallest probabilities. Among the symbols achieving the maximum length, we may freely permute their codewords — this is a bijection on codewords that preserves prefix-freeness (the multiset of codewords is unchanged) and does not change the expected length (a symbol of probability $p$ still receives a codeword of that maximum length). So we may assume the two codewords of maximum length belong to $\mu_{m-1}$ and $\mu_m$.
*Second reshape: two longest codewords are siblings.* Part (ii) of Structure of Optimal Codes says: among the codewords of $c'_m$ of maximum length, two differ only in their last digit — i.e., are of the form $w \cdot 0$ and $w \cdot 1$ for some $w$. We swap the codewords assigned to $\mu_{m-1}, \mu_m$ with the two that form this sibling pair (this is again a permutation within the maximum-length class, so preserves prefix-freeness and expected length). After this swap, $c'_m(\mu_{m-1}) = y' \cdot 0$ and $c'_m(\mu_m) = y' \cdot 1$ for some prefix $y'$.
The reshaping is lossless in expected length, so proving $\mathbb{E}[S_m] \leq \mathbb{E}[S'_m]$ for the reshaped $c'_m$ suffices for the original.
[/guided]
[/step]
[step:Collapse the sibling pair to obtain a prefix-free code $c'_{m-1}$ for the reduced source]
Define a code $c'_{m-1}$ for $X_{m-1}$ by
\begin{align*}
c'_{m-1}(\mu_i) &= c'_m(\mu_i) \quad (1 \leq i \leq m-2), \\
c'_{m-1}(\nu) &= y'.
\end{align*}
We check $c'_{m-1}$ is prefix-free. The codewords $c'_{m-1}(\mu_i) = c'_m(\mu_i)$ for $i \leq m - 2$ are pairwise prefix-incomparable because $c'_m$ is prefix-free. It remains to check that $y'$ is prefix-incomparable with each $c'_m(\mu_i)$ ($i \leq m - 2$):
- If $y'$ were a prefix of $c'_m(\mu_i)$, then $y' \cdot 0 = c'_m(\mu_{m-1})$ would differ from $c'_m(\mu_i)$ only in its first extension beyond $y'$; in particular the initial $|y'|$ characters of $c'_m(\mu_i)$ equal $y'$. Since $c'_m$ is prefix-free and $c'_m(\mu_{m-1}) = y' \cdot 0$ has length $|y'| + 1$, we would need $c'_m(\mu_i)$ not to be $y' \cdot 0$ itself, i.e., $c'_m(\mu_i)$ has length $\geq |y'| + 1$ and begins with $y' \cdot b$ for some $b \in \{0, 1\}$. If $b = 0$, then $y' \cdot 0 = c'_m(\mu_{m-1})$ is a prefix of $c'_m(\mu_i)$, violating prefix-freeness of $c'_m$. If $b = 1$, then $y' \cdot 1 = c'_m(\mu_m)$ is a prefix of $c'_m(\mu_i)$, again violating prefix-freeness. So $y'$ is not a prefix of $c'_m(\mu_i)$.
- Conversely, if $c'_m(\mu_i)$ were a prefix of $y'$, then $c'_m(\mu_i)$ would be a prefix of $y' \cdot 0 = c'_m(\mu_{m-1})$, violating prefix-freeness of $c'_m$.
So $y'$ is prefix-incomparable with every $c'_m(\mu_i)$, and $c'_{m-1}$ is prefix-free.
[/step]
[step:Apply the length identity to $c'_m$ and finish by the inductive hypothesis]
Exactly as in the earlier length-identity step, the word lengths of $c'_{m-1}$ relate to those of $c'_m$ by
\begin{align*}
|c'_{m-1}(\mu_i)| &= |c'_m(\mu_i)| \quad (i \leq m-2), \\
|c'_{m-1}(\nu)| &= |y'| = |c'_m(\mu_{m-1})| - 1 = |c'_m(\mu_m)| - 1,
\end{align*}
so
\begin{align*}
\mathbb{E}[S'_m] = \mathbb{E}[S'_{m-1}] + p_{m-1} + p_m,
\end{align*}
where $\mathbb{E}[S'_{m-1}]$ is the expected length of $c'_{m-1}$ under the source $X_{m-1}$.
By the inductive hypothesis $P(m-1)$, the Huffman code $c_{m-1}$ is optimal for $X_{m-1}$, so
\begin{align*}
\mathbb{E}[S_{m-1}] \leq \mathbb{E}[S'_{m-1}].
\end{align*}
Adding $p_{m-1} + p_m$ to both sides and using the two length identities:
\begin{align*}
\mathbb{E}[S_m] = \mathbb{E}[S_{m-1}] + p_{m-1} + p_m \leq \mathbb{E}[S'_{m-1}] + p_{m-1} + p_m = \mathbb{E}[S'_m].
\end{align*}
This establishes $P(m)$: the Huffman code $c_m$ has expected length no larger than any other optimal prefix-free code (and hence, by the reduction in the first step, no larger than any decipherable code). The induction is complete.
[/step]