[proofplan]
We first separate the prediction error into the test noise $\varepsilon_0$ and the estimation error $f_0(X_0)-\hat f_S(X_0)$. The mixed term vanishes because the training sample is independent of the test observation and the test noise has conditional mean zero given $X_0$. We then condition on $X_0$ and decompose the remaining squared estimation error into its conditional mean square and conditional variance. Taking expectations gives the irreducible noise, squared bias, and variance terms.
[/proofplan]
custom_env
admin
[step:Split the prediction error into noise and estimation terms]Define the real-valued random variable
\begin{align*}
Z:=\hat f_S(X_0).
\end{align*}
By the model relation $Y_0=f_0(X_0)+\varepsilon_0$, we have
\begin{align*}
Y_0-Z=\varepsilon_0+\bigl(f_0(X_0)-Z\bigr).
\end{align*}
Since all terms are square-integrible, expanding the square and taking expectations gives
\begin{align*}
\mathbb E[(Y_0-Z)^2]
&=
\mathbb E[\varepsilon_0^2]
+2\,\mathbb E\left[\varepsilon_0\bigl(f_0(X_0)-Z\bigr)\right]
+\mathbb E\left[\bigl(f_0(X_0)-Z\bigr)^2\right].
\end{align*}[/step]
custom_env
admin
[guided]Set
\begin{align*}
Z:=\hat f_S(X_0).
\end{align*}
This notation isolates the random prediction at the test point. The randomness in $Z$ comes from both the random test feature $X_0$ and the independent training sample $S$.
Using the identity $Y_0=f_0(X_0)+\varepsilon_0$, the prediction error is
\begin{align*}
Y_0-Z=\varepsilon_0+\bigl(f_0(X_0)-Z\bigr).
\end{align*}
The first term is the new test noise, while the second term is the error made in estimating the regression function value $f_0(X_0)$. Because $Y_0$, $f_0(X_0)$, and $Z$ are square-integrable, every term in the square expansion is integrable. Therefore
\begin{align*}
\mathbb E[(Y_0-Z)^2]
&=
\mathbb E[\varepsilon_0^2]
+2\,\mathbb E\left[\varepsilon_0\bigl(f_0(X_0)-Z\bigr)\right]
+\mathbb E\left[\bigl(f_0(X_0)-Z\bigr)^2\right].
\end{align*}
The rest of the proof identifies these three terms.[/guided]
custom_env
admin
[step:Show that the mixed noise and estimation term vanishes]Since $S$ is independent of $(X_0,Y_0)$ and $\varepsilon_0=Y_0-f_0(X_0)$ is measurable with respect to $\sigma(X_0,Y_0)$, independence gives
\begin{align*}
\mathbb E[\varepsilon_0\mid X_0,S]
=
\mathbb E[\varepsilon_0\mid X_0]
=
0
\end{align*}
almost surely. The random variable $f_0(X_0)-Z$ is measurable with respect to $\sigma(X_0,S)$, so the defining property of conditional expectation gives
\begin{align*}
\mathbb E\left[\varepsilon_0\bigl(f_0(X_0)-Z\bigr)\right]
&=
\mathbb E\left[
\bigl(f_0(X_0)-Z\bigr)\mathbb E[\varepsilon_0\mid X_0,S]
\right] \\
&=0.
\end{align*}[/step]
custom_env
admin
[guided]We need to prove that the cross term contributes nothing. The factor $f_0(X_0)-Z$ is determined once $X_0$ and the training sample $S$ are known, so it is measurable with respect to $\sigma(X_0,S)$.
The noise $\varepsilon_0=Y_0-f_0(X_0)$ is measurable with respect to $\sigma(X_0,Y_0)$. Since $S$ is independent of $(X_0,Y_0)$, conditioning additionally on $S$ does not change the conditional mean of $\varepsilon_0$ once $X_0$ is known. Hence
\begin{align*}
\mathbb E[\varepsilon_0\mid X_0,S]
=
\mathbb E[\varepsilon_0\mid X_0]
=
0
\end{align*}
almost surely. Now use the defining property of conditional expectation with the $\sigma(X_0,S)$-measurable multiplier $f_0(X_0)-Z$:
\begin{align*}
\mathbb E\left[\varepsilon_0\bigl(f_0(X_0)-Z\bigr)\right]
&=
\mathbb E\left[
\bigl(f_0(X_0)-Z\bigr)\mathbb E[\varepsilon_0\mid X_0,S]
\right] \\
&=0.
\end{align*}
Thus the test noise is orthogonal, in expectation, to the estimation error.[/guided]
custom_env
admin
[step:Identify the irreducible noise term]
Because $\mathbb E[\varepsilon_0\mid X_0]=0$, the conditional variance assumption gives
\begin{align*}
\mathbb E[\varepsilon_0^2\mid X_0]
=
\operatorname{Var}(\varepsilon_0\mid X_0)
=
\sigma^2
\end{align*}
almost surely. Taking expectations yields
\begin{align*}
\mathbb E[\varepsilon_0^2]=\sigma^2.
\end{align*}
[/step]
custom_env
admin
[step:Decompose the conditional estimation error into bias and variance]Define the conditional mean prediction
\begin{align*}
m:\Omega&\to\mathbb R,\\
\omega&\mapsto \mathbb E[Z\mid X_0](\omega).
\end{align*}
Then $m$ is $\sigma(X_0)$-measurable. Write
\begin{align*}
f_0(X_0)-Z
=
\bigl(f_0(X_0)-m\bigr)+\bigl(m-Z\bigr).
\end{align*}
Conditioning on $X_0$ and expanding the square gives
\begin{align*}
\mathbb E\left[\bigl(f_0(X_0)-Z\bigr)^2\mid X_0\right]
&=
\bigl(f_0(X_0)-m\bigr)^2
+2\bigl(f_0(X_0)-m\bigr)\mathbb E[m-Z\mid X_0] \\
&\quad
+\mathbb E\left[(m-Z)^2\mid X_0\right].
\end{align*}
Since $m=\mathbb E[Z\mid X_0]$, we have
\begin{align*}
\mathbb E[m-Z\mid X_0]=m-\mathbb E[Z\mid X_0]=0.
\end{align*}
Also, by the definition of conditional variance,
\begin{align*}
\mathbb E\left[(m-Z)^2\mid X_0\right]
=
\operatorname{Var}(Z\mid X_0).
\end{align*}
Therefore
\begin{align*}
\mathbb E\left[\bigl(f_0(X_0)-Z\bigr)^2\mid X_0\right]
=
\bigl(f_0(X_0)-\mathbb E[Z\mid X_0]\bigr)^2
+
\operatorname{Var}(Z\mid X_0).
\end{align*}[/step]
custom_env
admin
[guided]The remaining term is the squared error of the fitted rule around the true regression function value. We split it into a systematic part and a centered random part.
Define
\begin{align*}
m:\Omega&\to\mathbb R,\\
\omega&\mapsto \mathbb E[Z\mid X_0](\omega).
\end{align*}
The random variable $m$ is the average prediction at the observed test feature $X_0$, where the averaging is over the training randomness conditional on $X_0$. It is $\sigma(X_0)$-measurable by definition of conditional expectation.
Now decompose
\begin{align*}
f_0(X_0)-Z
=
\bigl(f_0(X_0)-m\bigr)+\bigl(m-Z\bigr).
\end{align*}
The first term is the conditional bias at $X_0$. The second term is centered conditional on $X_0$, because
\begin{align*}
\mathbb E[m-Z\mid X_0]
=
m-\mathbb E[Z\mid X_0]
=
0.
\end{align*}
Expanding the conditional square gives
\begin{align*}
\mathbb E\left[\bigl(f_0(X_0)-Z\bigr)^2\mid X_0\right]
&=
\bigl(f_0(X_0)-m\bigr)^2
+2\bigl(f_0(X_0)-m\bigr)\mathbb E[m-Z\mid X_0] \\
&\quad
+\mathbb E\left[(m-Z)^2\mid X_0\right].
\end{align*}
The middle term is zero by the centering just proved. The final term is exactly the conditional variance of $Z$ given $X_0$, since $m=\mathbb E[Z\mid X_0]$:
\begin{align*}
\mathbb E\left[(m-Z)^2\mid X_0\right]
=
\operatorname{Var}(Z\mid X_0).
\end{align*}
Thus
\begin{align*}
\mathbb E\left[\bigl(f_0(X_0)-Z\bigr)^2\mid X_0\right]
=
\bigl(f_0(X_0)-\mathbb E[Z\mid X_0]\bigr)^2
+
\operatorname{Var}(Z\mid X_0).
\end{align*}[/guided]
custom_env
admin
[step:Take expectations to obtain the decomposition]
Taking expectations in the conditional identity from the previous step gives
\begin{align*}
\mathbb E\left[\bigl(f_0(X_0)-Z\bigr)^2\right]
=
\mathbb E\left[\bigl(f_0(X_0)-\mathbb E[Z\mid X_0]\bigr)^2\right]
+
\mathbb E\left[\operatorname{Var}(Z\mid X_0)\right].
\end{align*}
Combining this identity with the expansion of $\mathbb E[(Y_0-Z)^2]$, the vanishing mixed term, and $\mathbb E[\varepsilon_0^2]=\sigma^2$, we obtain
\begin{align*}
\mathbb E\left[(Y_0-\hat f_S(X_0))^2\right]
&=
\sigma^2
+\mathbb E\left[\left(f_0(X_0)-\mathbb E[\hat f_S(X_0)\mid X_0]\right)^2\right] \\
&\quad
+\mathbb E\left[\operatorname{Var}(\hat f_S(X_0)\mid X_0)\right].
\end{align*}
This is the claimed bias–variance decomposition.
[/step]