[proofplan]
We express the misclassification risk $R(h) = \mathbb{P}(Y \neq h(X))$ as an iterated expectation using the tower property, then decompose the inner conditional expectation into contributions from each label. Because $h(X)$ is determined by $X$, minimising the outer expectation reduces to a pointwise minimisation of the conditional misclassification probability at each $x$. Comparing the two cases $h(x) = 1$ and $h(x) = -1$ shows that the optimal decision boundary occurs at $\eta(x) = 1/2$.
[/proofplan]
[step:Express the risk as an iterated expectation using the tower property]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be the underlying probability space and let $(X, Y)$ be a random pair with $X$ taking values in $\mathcal{X}$ and $Y \in \{-1, 1\}$. Define the regression function $\eta : \mathcal{X} \to [0,1]$ by $\eta(x) := \mathbb{P}(Y = 1 \mid X = x)$. For any measurable classifier $h : \mathcal{X} \to \{-1, 1\}$, the misclassification risk is
\begin{align*}
R(h) = \mathbb{E}[\mathbb{1}_{\{Y \neq h(X)\}}].
\end{align*}
By the tower property of conditional expectation — conditioning on the $\sigma$-algebra $\sigma(X)$ generated by $X$ — we obtain
\begin{align*}
R(h) = \mathbb{E}\bigl[\mathbb{E}[\mathbb{1}_{\{Y \neq h(X)\}} \mid X]\bigr].
\end{align*}
[guided]
The risk $R(h) = \mathbb{P}(Y \neq h(X))$ is the expectation of the indicator $\mathbb{1}_{\{Y \neq h(X)\}}$. The tower property of conditional expectation states that for any integrable random variable $Z$ and any sub-$\sigma$-algebra $\mathcal{G} \subset \mathcal{F}$,
\begin{align*}
\mathbb{E}[Z] = \mathbb{E}[\mathbb{E}[Z \mid \mathcal{G}]].
\end{align*}
We apply this with $Z = \mathbb{1}_{\{Y \neq h(X)\}}$ and $\mathcal{G} = \sigma(X)$, the $\sigma$-algebra generated by $X$. Since $\mathbb{1}_{\{Y \neq h(X)\}} \in \{0, 1\}$, it is bounded and hence integrable, so the tower property applies. This gives
\begin{align*}
R(h) = \mathbb{E}\bigl[\mathbb{E}[\mathbb{1}_{\{Y \neq h(X)\}} \mid X]\bigr].
\end{align*}
Why condition on $X$? Because $h(X)$ is a function of $X$ alone, conditioning on $X$ makes $h(X)$ a known constant inside the inner expectation. The only remaining randomness comes from $Y$, which is governed by the conditional distribution $\mathbb{P}(Y = \cdot \mid X)$. This separation is what allows us to optimise $h$ pointwise.
[/guided]
[/step]
[step:Decompose the inner conditional expectation into contributions from each label]
Conditioning on $X = x$ makes $h(X) = h(x)$ a constant. The event $\{Y \neq h(X)\}$ then depends only on $Y$, which takes values in $\{-1, 1\}$. We decompose by cases:
\begin{align*}
\mathbb{E}[\mathbb{1}_{\{Y \neq h(X)\}} \mid X = x] &= \mathbb{P}(Y \neq h(x) \mid X = x) \\
&= \mathbb{1}_{\{h(x) = -1\}} \cdot \mathbb{P}(Y = 1 \mid X = x) + \mathbb{1}_{\{h(x) = 1\}} \cdot \mathbb{P}(Y = -1 \mid X = x) \\
&= \mathbb{1}_{\{h(x) = -1\}} \cdot \eta(x) + \mathbb{1}_{\{h(x) = 1\}} \cdot (1 - \eta(x)).
\end{align*}
The second equality holds because $h(x) \in \{-1, 1\}$, so exactly one of the indicators $\mathbb{1}_{\{h(x) = -1\}}$ and $\mathbb{1}_{\{h(x) = 1\}}$ equals $1$, and the misclassification event $\{Y \neq h(x)\}$ selects the opposite class from the one predicted. The third equality uses $\mathbb{P}(Y = -1 \mid X = x) = 1 - \eta(x)$.
[guided]
Since $h(x)$ is either $-1$ or $1$, exactly one of the two indicators is active at each $x$. If $h(x) = -1$, then $Y \neq h(x)$ iff $Y = 1$, which has conditional probability $\eta(x)$. If $h(x) = 1$, then $Y \neq h(x)$ iff $Y = -1$, which has conditional probability $1 - \eta(x)$. This gives the decomposition
\begin{align*}
\mathbb{P}(Y \neq h(x) \mid X = x) = \mathbb{1}_{\{h(x) = -1\}} \cdot \eta(x) + \mathbb{1}_{\{h(x) = 1\}} \cdot (1 - \eta(x)).
\end{align*}
The key observation is that this quantity depends on $x$ only through $\eta(x)$ and the choice of $h(x)$. Since the outer expectation $\mathbb{E}[\cdot]$ integrates this over the marginal distribution of $X$, minimising $R(h)$ reduces to choosing $h(x)$ to minimise this expression at each $x$ separately. This pointwise reduction is possible precisely because $h(X)$ is $\sigma(X)$-measurable, so the choice at one value of $X$ does not affect the conditional expectation at another.
[/guided]
[/step]
[step:Minimise pointwise by comparing the two label choices at each $x$]
At each $x \in \mathcal{X}$, the conditional misclassification probability takes one of two values:
\begin{align*}
\mathbb{P}(Y \neq h(x) \mid X = x) = \begin{cases} \eta(x) & \text{if } h(x) = -1, \\ 1 - \eta(x) & \text{if } h(x) = 1. \end{cases}
\end{align*}
To minimise $R(h) = \mathbb{E}[\mathbb{P}(Y \neq h(X) \mid X)]$, it suffices to minimise the integrand pointwise (since the integrand is non-negative and the expectation is monotone in the integrand). At each $x$:
- If $\eta(x) > 1/2$, then $1 - \eta(x) < \eta(x)$, so setting $h(x) = 1$ yields the smaller conditional error $1 - \eta(x)$.
- If $\eta(x) < 1/2$, then $\eta(x) < 1 - \eta(x)$, so setting $h(x) = -1$ yields the smaller conditional error $\eta(x)$.
- If $\eta(x) = 1/2$, then both choices give conditional error $1/2$, so either label is optimal.
Therefore the Bayes classifier
\begin{align*}
h_0(x) = \begin{cases} 1 & \text{if } \eta(x) > 1/2, \\ -1 & \text{otherwise} \end{cases}
\end{align*}
minimises $\mathbb{P}(Y \neq h(X) \mid X = x)$ at every $x \in \mathcal{X}$. Pointwise minimisation of a non-negative integrand implies minimisation of its expectation, so $h_0$ minimises $R(h)$ over all measurable classifiers $h : \mathcal{X} \to \{-1, 1\}$. The tie-breaking rule at $\eta(x) = 1/2$ is arbitrary because both labels achieve the same conditional misclassification probability $1/2$.
[guided]
We are choosing $h(x) \in \{-1, 1\}$ to minimise a quantity that depends only on $\eta(x)$ and $h(x)$. This is a discrete optimisation over two values, so we simply compare:
- Choosing $h(x) = 1$ incurs conditional error $1 - \eta(x)$.
- Choosing $h(x) = -1$ incurs conditional error $\eta(x)$.
Setting $h(x) = 1$ is better when $1 - \eta(x) < \eta(x)$, i.e., when $\eta(x) > 1/2$. Setting $h(x) = -1$ is better when $\eta(x) < 1/2$. When $\eta(x) = 1/2$, both choices give the same error of $1/2$.
Why does pointwise minimisation imply global minimisation of $R(h)$? Because $R(h) = \mathbb{E}[\mathbb{P}(Y \neq h(X) \mid X)]$ is the expectation (with respect to the marginal distribution of $X$) of a non-negative function of $X$. If $g_1(x) \leq g_2(x)$ for all $x$, then $\mathbb{E}[g_1(X)] \leq \mathbb{E}[g_2(X)]$. Since the Bayes classifier $h_0$ achieves the smallest possible value of $\mathbb{P}(Y \neq h(X) \mid X = x)$ at every $x$, no other classifier can have smaller risk.
The conclusion is that
\begin{align*}
h_0(x) = \begin{cases} 1 & \text{if } \eta(x) > 1/2, \\ -1 & \text{otherwise} \end{cases}
\end{align*}
minimises $R(h)$ over all measurable classifiers, and the choice at the boundary $\eta(x) = 1/2$ is immaterial.
[/guided]
[/step]