[proofplan]
We use the probabilistic method: draw a code $C_n$ uniformly at random from all size-$m$ subsets of $\{0,1\}^n$ and show that the expected average error probability tends to zero, so some deterministic code achieves at most the expectation. The decoder is minimum-distance (equivalently, nearest-neighbour in the Hamming metric). We fix a small $\varepsilon > 0$ with $p + \varepsilon < 1/2$ and $R < 1 - H(p+\varepsilon)$ — possible by continuity of $H$ — and decode by checking whether a Hamming ball of radius $r := \lfloor n(p+\varepsilon) \rfloor$ around the received word $Y$ contains exactly one codeword. Errors come from two sources: the channel producing more than $r$ flips (controlled by a law-of-large-numbers estimate), or a random codeword $c_j$ with $j \ne i$ falling into the ball (controlled by the union bound combined with the entropy estimate $V(n,r) \le 2^{nH(p+\varepsilon)}$). Both terms vanish as $n \to \infty$ exactly because $R < 1 - H(p+\varepsilon)$.
[/proofplan]
[step:Choose $\varepsilon > 0$ so that $p + \varepsilon < 1/2$ and $R < 1 - H(p+\varepsilon)$]
The binary entropy function
\begin{align*}
H: [0,1] &\to [0,1] \\
q &\mapsto -q\log_2 q - (1-q)\log_2(1-q),
\end{align*}
with the conventions $0 \log_2 0 = 0$, is continuous on $[0,1]$ and strictly increasing on $[0, 1/2]$. By hypothesis $p < 1/2$ and $R < 1 - H(p)$, so the map $q \mapsto 1 - H(q) - R$ is continuous and strictly positive at $q = p$. Hence there exists $\varepsilon > 0$ with
\begin{align*}
p + \varepsilon < 1/2 \qquad \text{and} \qquad R < 1 - H(p+\varepsilon).
\end{align*}
Fix such an $\varepsilon$ for the rest of the proof and set $r_n := \lfloor n(p + \varepsilon) \rfloor$ and $m_n := \lfloor 2^{nR} \rfloor$. When $n$ is understood we write simply $r$ and $m$.
[guided]
Why a "cushion" $\varepsilon$ rather than working directly with $p$? The decoding ball needs radius large enough that the channel's actual error count — a $\operatorname{Bin}(n,p)$ random variable — lies inside it with high probability; radius exactly $np$ would put the fluctuations right on the edge. Taking radius $n(p+\varepsilon)$ leaves room for fluctuations of order $\sqrt{n}$ and makes the tail estimate quantitative. On the other hand we cannot enlarge the ball arbitrarily, because the false-codeword term (Step 4) grows like $V(n,r)/2^n \approx 2^{-n(1 - H(r/n))}$, which still needs to be smaller than $2^{-nR}$ to beat the union bound over $m-1$ competitors. The constraint $R < 1 - H(p+\varepsilon)$ is exactly the statement that this remains true after the cushion is added. Continuity of $H$ at $p$ guarantees that a sufficiently small $\varepsilon$ works.
[/guided]
[/step]
[step:Define the random code and the minimum-distance decoder]
Let $\mathcal{C}_{n,m}$ denote the set of all $m$-element subsets of $\{0,1\}^n$ with $|\mathcal{C}_{n,m}| = \binom{2^n}{m}$. Let $C = \{c_1, \ldots, c_m\}$ be drawn uniformly from $\mathcal{C}_{n,m}$, where the enumeration $c_1, \ldots, c_m$ is a uniformly random ordering. Let $I$ be an index drawn uniformly from $\{1, \ldots, m\}$ independently of $C$, let $c_I$ be sent over the BSC, and let $Y \in \{0,1\}^n$ be the received word. Thus, conditionally on $(C, I)$, the channel output satisfies
\begin{align*}
Y = c_I \oplus Z, \qquad Z \sim \operatorname{Bin}(n, p) \text{ coordinatewise},
\end{align*}
where $\oplus$ is componentwise addition modulo $2$ and $Z$ is independent of $(C, I)$.
The minimum-distance decoder is the map
\begin{align*}
D : \{0,1\}^n \times \mathcal{C}_{n,m} &\to \{0,1\}^n \cup \{\bot\} \\
(y, C) &\mapsto \begin{cases} c & \text{if there is a unique } c \in C \text{ with } d_H(y, c) \text{ minimal,} \\ \bot & \text{otherwise,} \end{cases}
\end{align*}
where $d_H$ is the Hamming distance. We declare a decoding error whenever $D(Y, C) \ne c_I$.
The average probability of error for a fixed code $C$ is
\begin{align*}
e(C) := \frac{1}{m}\sum_{i=1}^m \mathbb{P}(D(Y, C) \ne c_i \mid c_i \text{ sent}) = \mathbb{P}(D(Y, C) \ne c_I \mid C),
\end{align*}
and we want to show $\mathbb{E}[e(C)] = \mathbb{P}(D(Y, C) \ne c_I) \to 0$ as $n \to \infty$. Once this is established, a deterministic code $C_n^*$ with $e(C_n^*) \le \mathbb{E}[e(C)]$ exists by the probabilistic method — if every realisation of $C$ had $e(C) > \mathbb{E}[e(C)]$, the average could not equal $\mathbb{E}[e(C)]$.
[/step]
[step:Bound the error probability using the ball-decoding surrogate]
Consider the event $\mathcal{S}_i := \{B(Y, r) \cap C = \{c_i\}\}$, where $B(y, r) := \{x \in \{0,1\}^n : d_H(x,y) \le r\}$ is the Hamming ball of radius $r$. On $\mathcal{S}_I$ the received word $Y$ has exactly one codeword within distance $r$; any other codeword lies at distance $> r$, so minimum-distance decoding returns $c_I$. Hence
\begin{align*}
\mathbb{P}(D(Y,C) \ne c_I) \le \mathbb{P}(\mathcal{S}_I^c) = \mathbb{P}\big(c_I \notin B(Y,r) \ \text{or}\ B(Y,r) \cap C \supsetneq \{c_I\}\big).
\end{align*}
By the union bound,
\begin{align*}
\mathbb{P}(D(Y,C) \ne c_I) \le \underbrace{\mathbb{P}(c_I \notin B(Y, r))}_{\text{Term I}} + \underbrace{\mathbb{P}(B(Y,r) \cap C \supsetneq \{c_I\})}_{\text{Term II}}.
\end{align*}
[guided]
Why dominate the actual decoding error by this ball-based event? Minimum-distance decoding outputs the unique nearest codeword. If $B(Y,r)$ contains exactly $c_I$ and no other codeword, then $c_I$ is within distance $r$ of $Y$ while every other codeword is at distance strictly greater than $r$, so $c_I$ is unambiguously the nearest, and decoding succeeds. The converse is not true — decoding may succeed even if $B(Y,r)$ misses $c_I$ or contains extra codewords, provided $c_I$ still happens to be nearest. That asymmetry is fine: we are looking for a sufficient condition for correct decoding. The complementary event $\mathcal{S}_I^c$ is the union of "channel made more than $r$ errors" (the ball misses $c_I$) and "some confusion codeword is also in the ball". The union bound splits these.
[/guided]
[/step]
[step:Bound Term I via the weak law of large numbers for the channel noise]
The event $\{c_I \notin B(Y, r)\}$ means $d_H(c_I, Y) > r$, i.e.\ the number of bit flips produced by the channel exceeds $r = \lfloor n(p+\varepsilon) \rfloor$. Let $N_n := d_H(c_I, Y) = \sum_{j=1}^n Z_j$ where $Z_j$ are i.i.d.\ $\operatorname{Bernoulli}(p)$ independent of $(C, I)$. Then $\mathbb{E}[N_n/n] = p$ and by the weak law of large numbers,
\begin{align*}
\mathbb{P}\big(c_I \notin B(Y,r)\big) = \mathbb{P}\big(N_n > n(p+\varepsilon) - 1\big) \le \mathbb{P}\big(N_n/n \ge p + \varepsilon/2\big) \to 0
\end{align*}
as $n \to \infty$, where the intermediate inequality uses $n(p+\varepsilon) - 1 \ge n(p + \varepsilon/2)$ for $n \ge 2/\varepsilon$.
[guided]
The flip count $N_n$ is the sum of $n$ i.i.d.\ Bernoulli$(p)$ random variables, so the weak law of large numbers tells us $N_n / n \to p$ in probability. Concretely, for any fixed $\delta > 0$, $\mathbb{P}(|N_n/n - p| > \delta) \to 0$. We want to control $\mathbb{P}(N_n > n(p+\varepsilon) - 1)$. For all $n \ge 2/\varepsilon$, we have $n(p + \varepsilon) - 1 \ge n(p + \varepsilon/2)$, so this probability is at most $\mathbb{P}(N_n/n \ge p + \varepsilon/2)$, which tends to $0$ by the law of large numbers with $\delta = \varepsilon/2$. The independence of $Z$ from $(C,I)$ ensures we can compute this probability without conditioning on the code realisation.
[/guided]
[/step]
[step:Bound Term II using exchangeability and the Hamming ball volume]
Let $V(n,r) := \#B(0,r) = \sum_{k=0}^r \binom{n}{k}$ be the volume of a Hamming ball of radius $r$ in $\{0,1\}^n$; by translation invariance of $d_H$, every Hamming ball of radius $r$ in $\{0,1\}^n$ has this cardinality.
Decompose
\begin{align*}
\mathbb{P}(B(Y,r) \cap C \supsetneq \{c_I\}) &= \mathbb{P}\!\left(\bigcup_{j \ne I} \{c_j \in B(Y,r)\}\right) \le \sum_{j=1}^{m} \sum_{i \ne j} \frac{1}{m}\, \mathbb{P}(c_j \in B(Y, r) \mid I = i) \\
&\le (m-1) \cdot \max_{i \ne j} \mathbb{P}(c_j \in B(Y,r) \mid I = i),
\end{align*}
using the law of total probability over $I$ and the union bound over $j \ne i$.
We claim that for any $i \ne j$,
\begin{align*}
\mathbb{P}(c_j \in B(Y, r) \mid I = i) = \frac{V(n,r)}{2^n}.
\end{align*}
To see this, condition on $(c_i, Y) = (a, y)$; then $c_j$ is distributed uniformly over $\{0,1\}^n \setminus \{c_1, \ldots, c_{j-1}, c_{j+1}, \ldots\}$ — but crucially, among the ordered distinct sequences drawn, swapping $c_j$ with any fixed element of $\{0,1\}^n \setminus \{a\}$ preserves the uniform distribution on $m$-element subsets. More concretely, since $(c_1, \ldots, c_m)$ is a uniform random ordering of a uniform $m$-subset, the marginal of $c_j$ given $(c_i, I = i, Y) = (a, i, y)$ is uniform on $\{0,1\}^n \setminus \{a\}$. Thus
\begin{align*}
\mathbb{P}(c_j \in B(y, r) \mid c_i = a, I = i, Y = y) = \frac{\#(B(y,r) \setminus \{a\})}{2^n - 1} \le \frac{V(n,r)}{2^n - 1} \le \frac{V(n,r)}{2^n - V(n,r)}.
\end{align*}
For all sufficiently large $n$ the ball volume satisfies $V(n,r) \le 2^n/2$ (see the entropy bound below), so $V(n,r)/(2^n - V(n,r)) \le 2 V(n,r)/2^n$. Incorporating the factor of $2$ into constants we do not track,
\begin{align*}
\mathbb{P}(c_j \in B(Y,r) \mid I = i) \le \frac{2\, V(n,r)}{2^n}.
\end{align*}
Combining with the union bound,
\begin{align*}
\text{Term II} \le (m-1) \cdot \frac{2\, V(n,r)}{2^n} \le \frac{2 \cdot 2^{nR} \cdot V(n,r)}{2^n}.
\end{align*}
We now invoke the entropy bound for binomial sums: for $0 \le r \le n/2$,
\begin{align*}
V(n, r) = \sum_{k=0}^{r} \binom{n}{k} \le 2^{n H(r/n)}.
\end{align*}
Since $r/n \le p + \varepsilon < 1/2$ and $H$ is increasing on $[0, 1/2]$,
\begin{align*}
V(n,r) \le 2^{n H(p + \varepsilon)}.
\end{align*}
Therefore
\begin{align*}
\text{Term II} \le \frac{2 \cdot 2^{nR} \cdot 2^{nH(p+\varepsilon)}}{2^n} = 2 \cdot 2^{n\bigl(R - (1 - H(p+\varepsilon))\bigr)}.
\end{align*}
By Step 1, the exponent $R - (1 - H(p+\varepsilon))$ is strictly negative, so $\text{Term II} \to 0$ as $n \to \infty$.
[guided]
Two ideas drive this step. First, by symmetry of the random-code construction, for $i \ne j$ the marginal distribution of $c_j$ given $(c_i, I = i, Y)$ is uniform on $\{0,1\}^n \setminus \{c_i\}$. Concretely: the joint distribution of $(c_1, \ldots, c_m)$ is invariant under permutations of coordinates $\ne i$, and given $c_i$ the remaining $m-1$ codewords form a uniform random injection of $\{1, \ldots, m\} \setminus \{i\}$ into $\{0,1\}^n \setminus \{c_i\}$. Hence each $c_j$ is individually uniform on $\{0,1\}^n \setminus \{c_i\}$ — the reasoning is identical to drawing cards one at a time from a shuffled deck: the second card is uniform on the $51$ non-first cards. Second, we need to estimate the probability that a uniform element of $\{0,1\}^n \setminus \{c_i\}$ lies in $B(Y,r)$; this probability is $\#(B(Y,r) \setminus \{c_i\})/(2^n - 1) \le V(n,r)/(2^n-1)$, and we absorb the $-1$ into a constant factor $\le 2$.
Why does the entropy bound $V(n,r) \le 2^{nH(r/n)}$ hold? It comes from noting $\binom{n}{k} \le 2^{nH(k/n)}$ for any $k$, which follows from the inequality
\begin{align*}
\binom{n}{k} p^k (1-p)^{n-k} \le \max_{x \in [0,1]} x^k (1-x)^{n-k} \cdot \text{sum constraints},
\end{align*}
or more directly from Stirling. Summing over $k \le r \le n/2$ and using that $H$ is maximised at the endpoint yields the stated bound. The rate condition $R < 1 - H(p+\varepsilon)$ is **exactly** what we need to make the resulting exponent $R + H(p+\varepsilon) - 1$ negative, so Term II decays exponentially.
[/guided]
[/step]
[step:Conclude and extract a deterministic code]
Combining Steps 3, 4, and 5,
\begin{align*}
\mathbb{P}(D(Y, C) \ne c_I) \le \text{Term I} + \text{Term II} \to 0 \qquad \text{as } n \to \infty.
\end{align*}
For each $n$, since $\mathbb{P}(D(Y,C) \ne c_I) = \mathbb{E}[e(C)]$, there exists at least one realisation $C_n \in \mathcal{C}_{n,m}$ with $e(C_n) \le \mathbb{E}[e(C)]$. Choose such a $C_n$. Then $|C_n| = m = \lfloor 2^{nR} \rfloor$ and $e(C_n) \to 0$. This is the required sequence of codes, completing the proof.
[/step]