[proofplan]
Both equivalences reduce to monotonicity statements on simple scalar functions. For (a), Bayes' theorem rewrites the posterior probability $\mathbb{P}(c \text{ sent} \mid x \text{ received})$ as a product of the prior $\mathbb{P}(c \text{ sent})$, the likelihood $\mathbb{P}(x \text{ received} \mid c \text{ sent})$, and a factor independent of $c$; when the prior is uniform the factor multiplying the likelihood is the same for every codeword, so the maximisers coincide. For (b), on a [binary symmetric channel](/page/Binary%20Symmetric%20Channel) the likelihood of receiving $x$ given $c$ was sent depends on $(x, c)$ only through the Hamming distance $r = d(x, c)$, and factors as $(1-p)^n (p/(1-p))^r$. Because $p < 1/2$ forces $p/(1-p) < 1$, the likelihood is strictly decreasing in $r$, so maximising likelihood is the same as minimising Hamming distance.
[/proofplan]
[step:Set up the three decoding rules as maximisation problems]
Let $n$ denote the blocklength and $C \subseteq \mathbb{F}_2^n$ the code. Denote the three decoding rules as functions
\begin{align*}
D_{\text{io}}, \; D_{\text{ml}}, \; D_{\text{md}} : \mathbb{F}_2^n &\to C
\end{align*}
defined by
\begin{align*}
D_{\text{io}}(x) &\in \arg\max_{c \in C} \mathbb{P}(c \text{ sent} \mid x \text{ received}), \\
D_{\text{ml}}(x) &\in \arg\max_{c \in C} \mathbb{P}(x \text{ received} \mid c \text{ sent}), \\
D_{\text{md}}(x) &\in \arg\min_{c \in C} d(x, c),
\end{align*}
where ties are broken by a fixed rule (the claim is that the argmax/argmin sets themselves coincide). To prove equivalence of two rules it suffices to show that the two objective functions attain their extrema on the same subset of $C$ for every received word $x \in \mathbb{F}_2^n$.
[/step]
[step:Apply Bayes' theorem to the posterior under a uniform prior]
Fix a received word $x \in \mathbb{F}_2^n$ with $\mathbb{P}(x \text{ received}) > 0$ (if the marginal is zero, $x$ cannot occur and the rule is vacuous). By [Bayes' theorem](/theorems/???),
\begin{align*}
\mathbb{P}(c \text{ sent} \mid x \text{ received}) = \frac{\mathbb{P}(c \text{ sent}) \cdot \mathbb{P}(x \text{ received} \mid c \text{ sent})}{\mathbb{P}(x \text{ received})}
\end{align*}
for every $c \in C$. Under the hypothesis of (a), the prior is uniform: $\mathbb{P}(c \text{ sent}) = 1/|C|$ for every $c \in C$. Substituting,
\begin{align*}
\mathbb{P}(c \text{ sent} \mid x \text{ received}) = \frac{1}{|C| \cdot \mathbb{P}(x \text{ received})} \cdot \mathbb{P}(x \text{ received} \mid c \text{ sent}).
\end{align*}
The prefactor $\frac{1}{|C| \cdot \mathbb{P}(x \text{ received})}$ depends on $x$ but not on $c$, and is strictly positive. Hence the map $c \mapsto \mathbb{P}(c \text{ sent} \mid x \text{ received})$ is a strictly positive constant multiple of $c \mapsto \mathbb{P}(x \text{ received} \mid c \text{ sent})$, so the two functions on $C$ have identical argmax sets:
\begin{align*}
\arg\max_{c \in C} \mathbb{P}(c \text{ sent} \mid x \text{ received}) = \arg\max_{c \in C} \mathbb{P}(x \text{ received} \mid c \text{ sent}).
\end{align*}
This establishes $D_{\text{io}}(x) = D_{\text{ml}}(x)$ for every $x$, proving (a).
[guided]
We want to show that maximising the posterior over the codebook is the same as maximising the likelihood. The natural tool is Bayes' theorem, which relates posterior and likelihood via the prior and the marginal:
\begin{align*}
\mathbb{P}(c \text{ sent} \mid x \text{ received}) = \frac{\mathbb{P}(c \text{ sent}) \cdot \mathbb{P}(x \text{ received} \mid c \text{ sent})}{\mathbb{P}(x \text{ received})}.
\end{align*}
Why might maximising these two functions give different answers in general? The prior $\mathbb{P}(c \text{ sent})$ depends on $c$, so the posterior weights each codeword differently from the likelihood. A codeword with small likelihood but very large prior could dominate.
But our hypothesis is precisely that the prior is uniform: $\mathbb{P}(c \text{ sent}) = 1/|C|$ for every $c \in C$. So the prior contributes the same factor to every posterior, and Bayes gives
\begin{align*}
\mathbb{P}(c \text{ sent} \mid x \text{ received}) = \underbrace{\frac{1}{|C| \cdot \mathbb{P}(x \text{ received})}}_{\text{independent of } c} \cdot \mathbb{P}(x \text{ received} \mid c \text{ sent}).
\end{align*}
The marginal $\mathbb{P}(x \text{ received})$ is non-zero (otherwise $x$ could never be received and we would not be decoding it) and also independent of $c$, so the prefactor is a strictly positive constant as $c$ varies over $C$. Multiplying a function by a strictly positive constant does not change its argmax:
\begin{align*}
\arg\max_{c \in C} \mathbb{P}(c \text{ sent} \mid x \text{ received}) = \arg\max_{c \in C} \mathbb{P}(x \text{ received} \mid c \text{ sent}).
\end{align*}
Consequently the ideal observer and maximum likelihood rules, which pick codewords from these two argmax sets respectively, produce the same decoded codewords. This is exactly the statement of (a).
[/guided]
[/step]
[step:Compute the BSC likelihood as a monotone function of Hamming distance]
Let $x, c \in \mathbb{F}_2^n$ and write $r = d(x, c) \in \{0, 1, \dots, n\}$, i.e., $r$ coordinates of $x$ differ from $c$. On the [binary symmetric channel](/page/Binary%20Symmetric%20Channel) with error probability $p \in [0, 1]$, each coordinate is flipped independently with probability $p$ and kept with probability $1-p$. Hence the likelihood of receiving $x$ given that $c$ was sent is
\begin{align*}
\mathbb{P}(x \text{ received} \mid c \text{ sent}) = p^r (1-p)^{n-r}.
\end{align*}
Assuming $0 < p < 1$ (the degenerate cases $p \in \{0, 1\}$ being noise-free or deterministic inversion, for which the result is immediate), we factor
\begin{align*}
p^r (1-p)^{n-r} = (1-p)^n \left( \frac{p}{1-p} \right)^r.
\end{align*}
[/step]
[step:Use $p < 1/2$ to convert the likelihood into a decreasing function of distance]
The hypothesis $p < 1/2$ is equivalent to $p < 1 - p$, hence $p/(1-p) < 1$. For a base $\rho \in (0, 1)$, the map
\begin{align*}
\varphi : \{0, 1, \dots, n\} &\to (0, \infty) \\
r &\mapsto \rho^r
\end{align*}
is strictly decreasing in $r$. With $\rho = p/(1-p)$, the factor $(1-p)^n > 0$ is independent of $(x, c)$, so
\begin{align*}
\mathbb{P}(x \text{ received} \mid c \text{ sent}) = (1-p)^n \cdot \varphi(d(x, c))
\end{align*}
is a strictly positive constant times a strictly decreasing function of $d(x, c)$. Therefore for any $c_1, c_2 \in C$,
\begin{align*}
\mathbb{P}(x \text{ received} \mid c_1 \text{ sent}) > \mathbb{P}(x \text{ received} \mid c_2 \text{ sent}) \iff d(x, c_1) < d(x, c_2).
\end{align*}
Maximising the left-hand expression over $c \in C$ is therefore equivalent to minimising $d(x, c)$ over $c \in C$. Equivalently, the argmax of the likelihood over $C$ equals the argmin of Hamming distance over $C$:
\begin{align*}
\arg\max_{c \in C} \mathbb{P}(x \text{ received} \mid c \text{ sent}) = \arg\min_{c \in C} d(x, c).
\end{align*}
This establishes $D_{\text{ml}}(x) = D_{\text{md}}(x)$ for every $x$, proving (b).
[guided]
We want to show that, under the condition $p < 1/2$, maximising the likelihood over the code is equivalent to minimising Hamming distance. The key observation is that the BSC likelihood depends on the pair $(x, c)$ only through the single scalar $r = d(x, c)$. Why? Because the channel is memoryless and symmetric: each coordinate independently experiences a flip with probability $p$ and is kept with probability $1-p$. The probability of the observed pattern of $r$ flips and $n - r$ non-flips is therefore $p^r (1-p)^{n-r}$.
To see the monotonicity cleanly, factor out the constant:
\begin{align*}
\mathbb{P}(x \text{ received} \mid c \text{ sent}) = p^r (1-p)^{n-r} = (1-p)^n \left( \frac{p}{1-p} \right)^r.
\end{align*}
Now the likelihood is a strictly positive constant $(1-p)^n$ times $(p/(1-p))^r$. Whether $(p/(1-p))^r$ increases or decreases with $r$ depends on whether the base $p/(1-p)$ is greater or less than $1$. The hypothesis $p < 1/2$ is exactly the hypothesis $p < 1-p$, which is exactly $p/(1-p) < 1$. So the base lies strictly between $0$ and $1$, and $r \mapsto (p/(1-p))^r$ is a strictly decreasing function.
What would have gone wrong without $p < 1/2$? If $p > 1/2$, then $p/(1-p) > 1$ and the likelihood would be strictly *increasing* in Hamming distance — maximum likelihood would pick the codeword *farthest* from $x$, the opposite of minimum distance decoding. (This is the intuitive statement that a noisier-than-fair coin inverts the inference.) If $p = 1/2$, the likelihood is constant in $r$ and the BSC conveys no information.
Given $p \in (0, 1/2)$, the equivalence
\begin{align*}
\mathbb{P}(x \text{ received} \mid c_1 \text{ sent}) > \mathbb{P}(x \text{ received} \mid c_2 \text{ sent}) \iff d(x, c_1) < d(x, c_2)
\end{align*}
holds because multiplying a strictly decreasing function by the positive constant $(1-p)^n$ preserves strict monotonicity. Taking this pointwise inequality over all pairs $c_1, c_2 \in C$ shows that the argmax of the likelihood coincides with the argmin of Hamming distance:
\begin{align*}
\arg\max_{c \in C} \mathbb{P}(x \text{ received} \mid c \text{ sent}) = \arg\min_{c \in C} d(x, c).
\end{align*}
This is the claim (b).
[/guided]
[/step]
[step:Combine the two reductions to conclude]
Parts (a) and (b) have each been established. Combining: if all codewords are equally likely to be sent *and* $p < 1/2$, then the ideal observer rule, the maximum likelihood rule, and the minimum distance rule all agree on every received word. This completes the proof.
[/step]