[proofplan]
We bootstrap off the average-error version (the [Achievability Theorem for the BSC](/theorems/1658)). Pick a rate $R'$ strictly between $R$ and $1 - H(p)$, produce a sequence of codes $(C'_n)$ of size $\lfloor 2^{nR'} \rfloor$ with average error $e(C'_n) \to 0$, then expurgate: delete the worst half of each $C'_n$, keeping the half with the lowest conditional error probabilities. A Markov-style argument shows the surviving codewords each have error at most $2 e(C'_n)$, so the maximum error goes to zero. The expurgated code has size $\lceil \lfloor 2^{nR'} \rfloor / 2 \rceil$, which is eventually at least $\lfloor 2^{nR} \rfloor$ because $R' > R$; we trim down to exactly that size by discarding extra codewords, which only lowers the maximum error.
[/proofplan]
[step:Choose an intermediate rate $R'$ with $R < R' < 1 - H(p)$]
By the hypothesis $R < 1 - H(p)$, the open interval $(R, 1 - H(p))$ is non-empty; pick any
\begin{align*}
R' \in (R, 1 - H(p)).
\end{align*}
By the [Achievability Theorem for the BSC](/theorems/1658) applied at rate $R'$, there exists a sequence $(C'_n)_{n \ge 1}$ of codes with $C'_n \subseteq \{0,1\}^n$, $|C'_n| = m'_n := \lfloor 2^{nR'} \rfloor$, and average error probability $e(C'_n) \to 0$.
[guided]
Why introduce $R'$ rather than working directly at rate $R$? The expurgation halves the code size, which costs one bit of rate. Starting the construction at rate $R$ would leave us with a code of rate $R - 1/n$, eventually below $R$. We need a cushion: a rate $R'$ strictly bigger than $R$. The constraint $R' < 1 - H(p)$ is needed so that the average-error theorem applies. Since $R < 1 - H(p)$ strictly, there is room in between; any $R'$ in the open interval $(R, 1 - H(p))$ will do.
[/guided]
[/step]
[step:Order each $C'_n$ by conditional error probability and delete the worst half]
Fix $n$ and abbreviate $C' = C'_n$ with $m' = m'_n$. For each codeword $c \in C'$, let
\begin{align*}
e(c; C') := \mathbb{P}(\text{decoder output} \ne c \mid c \text{ sent using } C'),
\end{align*}
so that $e(C') = \frac{1}{m'} \sum_{c \in C'} e(c; C')$. Order the codewords as $C' = \{c_{(1)}, c_{(2)}, \ldots, c_{(m')}\}$ with
\begin{align*}
e(c_{(1)}; C') \le e(c_{(2)}; C') \le \cdots \le e(c_{(m')}; C').
\end{align*}
Let $k := \lceil m'/2 \rceil$ and define the expurgated code
\begin{align*}
C'' := \{c_{(1)}, c_{(2)}, \ldots, c_{(k)}\}.
\end{align*}
Then $|C''| = k = \lceil m'/2 \rceil$.
[/step]
[step:Bound the worst-case error of $C''$ by twice the average error of $C'$]
We show
\begin{align*}
\hat{e}(C''; \text{decoder of } C') \le 2\, e(C').
\end{align*}
Suppose for contradiction that $e(c_{(j)}; C') > 2 e(C')$ for some $j \le k = \lceil m'/2 \rceil$. By the ordering,
\begin{align*}
e(c_{(i)}; C') \ge e(c_{(j)}; C') > 2 e(C') \qquad \text{for all } i \ge j.
\end{align*}
The number of such indices $i$ is $m' - j + 1 \ge m' - k + 1 \ge m'/2$. Hence
\begin{align*}
e(C') = \frac{1}{m'} \sum_{i=1}^{m'} e(c_{(i)}; C') \ge \frac{1}{m'} \sum_{i=j}^{m'} e(c_{(i)}; C') > \frac{1}{m'} \cdot \frac{m'}{2} \cdot 2 e(C') = e(C'),
\end{align*}
a contradiction. So every codeword of $C''$ has conditional error at most $2 e(C')$.
Crucially, the decoder used for $C''$ is still the decoder originally specified for $C'$ (minimum-distance on $C'$), restricted to the event that a codeword of $C''$ was sent. This decoder may output an element of $C' \setminus C''$ — if so, it counts as a decoding error for $C''$. Thus the bound $e(c; C') \le 2 e(C')$ directly controls the error probability on $C''$, giving $\hat{e}(C'') \le 2 e(C')$.
Since $e(C'_n) \to 0$ by Step 1, we have $\hat{e}(C''_n) \le 2 e(C'_n) \to 0$.
[guided]
The argument is a Markov-type tail bound turned deterministic by the ordering. We want: at least half of the codewords have error $\le 2 e(C')$. Suppose instead that strictly more than half had error $> 2 e(C')$. Then the average over all codewords — which is a mean of $m'$ numbers — would be strictly greater than $(m'/2) \cdot 2 e(C') / m' = e(C')$, contradicting that the average is exactly $e(C')$. Equivalently, if we sort the errors in ascending order, the median-or-below indices $i \le \lceil m'/2 \rceil$ are forced to have error $\le 2 e(C')$, so the first $k = \lceil m'/2 \rceil$ of them do.
Why is the decoder question slightly delicate? When we delete codewords, the decoder is not automatically redefined. The natural choice is to keep using the decoder inherited from $C'$, which maps $\{0,1\}^n \to C' \cup \{\bot\}$. Restricted to the sender of a kept codeword $c \in C''$, the event "decoder outputs $c$" is exactly the event "no error under the original decoder". So the conditional error probability $e(c; C')$ is unchanged when we restrict attention to $C''$-senders. A decoding into some element of $C' \setminus C''$ is still an error for the $C''$-code. We do not need to replace the decoder with a minimum-distance decoder over $C''$: any such replacement can only **decrease** the error, so the bound $\hat{e}(C'') \le 2 e(C')$ still holds.
[/guided]
[/step]
[step:Verify that $C''$ has size at least $\lfloor 2^{nR} \rfloor$ for large $n$]
We have $|C''_n| = \lceil m'_n / 2 \rceil = \lceil \lfloor 2^{nR'} \rfloor / 2 \rceil$. For $n$ large enough,
\begin{align*}
\left\lceil \frac{\lfloor 2^{nR'} \rfloor}{2} \right\rceil \ge \frac{2^{nR'} - 1}{2} \ge 2^{nR' - 1} - 1.
\end{align*}
We need this to be at least $\lfloor 2^{nR} \rfloor$. Since $\lfloor 2^{nR} \rfloor \le 2^{nR}$, it suffices to show $2^{nR' - 1} - 1 \ge 2^{nR}$ for all large $n$, or equivalently
\begin{align*}
2^{nR'-1} \ge 2^{nR} + 1.
\end{align*}
Because $R' > R$, we have $2^{n(R' - R) - 1} \to \infty$ as $n \to \infty$. Concretely, pick $N$ large enough that $2^{n(R' - R) - 1} \ge 2$ for all $n \ge N$; then $2^{nR'-1} = 2^{nR} \cdot 2^{n(R' - R) - 1} \ge 2 \cdot 2^{nR} \ge 2^{nR} + 1$. So for all $n \ge N$, $|C''_n| \ge \lfloor 2^{nR} \rfloor$.
[/step]
[step:Pass to a subcode of exact size $\lfloor 2^{nR} \rfloor$]
For $n \ge N$, let $C_n$ be an arbitrary subset of $C''_n$ of size exactly $\lfloor 2^{nR} \rfloor$; this is possible by Step 4. For $n < N$, set $C_n$ to be any $\lfloor 2^{nR} \rfloor$-element subset of $\{0,1\}^n$ (finitely many such $n$ do not affect the limit).
We verify that $\hat{e}(C_n) \to 0$. For $n \ge N$, $C_n \subseteq C''_n$ and the decoder on $C_n$ is minimum-distance on $C_n$. For any $c \in C_n$,
\begin{align*}
\mathbb{P}(\text{decoder on } C_n \text{ outputs } \ne c \mid c \text{ sent}) \le \mathbb{P}(\text{decoder on } C' \text{ outputs } \ne c \mid c \text{ sent}) \le \hat{e}(C''_n),
\end{align*}
because shrinking the codebook from $C'$ to $C_n$ removes candidate decoder outputs, so the event "nearest codeword (in $C_n$) equals $c$" is a superset of "nearest codeword (in $C'$) equals $c$". Hence $\hat{e}(C_n) \le \hat{e}(C''_n) \le 2 e(C'_n) \to 0$.
[guided]
Removing codewords from a code can only help the worst-case error under minimum-distance decoding. The minimum-distance decoder on $C_n$ outputs the nearest element of $C_n$; if $c$ was the nearest in $C'$ among all the competitors, it is still the nearest in $C_n \subset C'$. So every "correct decoding in $C'$" remains a correct decoding in $C_n$. Erroneous decodings may become more common (because new elements of $C_n$ may tie or beat the true codeword, but now nothing new is added), or rather, they may shift — but on the event where $C'$-decoding succeeded, $C_n$-decoding also succeeds. Hence the worst-case error probability of $C_n$ is at most that of $C''_n$, and passes to zero.
(A clean way to see this: the event "minimum-distance decoder on $C_n$ outputs $c$" contains the event "minimum-distance decoder on $C'$ outputs $c$" when $c \in C_n \subseteq C'$ and we have a unique-minimum decoder, because fewer competitors means fewer ways for another codeword to be closer.)
[/guided]
[/step]
[step:Combine the estimates to conclude]
By Steps 4 and 5, $(C_n)_{n \ge 1}$ is a sequence of codes with $C_n \subseteq \{0,1\}^n$, $|C_n| = \lfloor 2^{nR} \rfloor$, and
\begin{align*}
\hat{e}(C_n) \le 2\, e(C'_n) \to 0 \qquad \text{as } n \to \infty.
\end{align*}
This is the required sequence with maximum error probability tending to zero, completing the proof.
[/step]