[proofplan]
We decompose the squared prediction error into an irreducible conditional-noise part and an approximation part involving only the conditional mean. The key point is that the residual $Y-m$ is orthogonal in $L^2$ to every $\sigma(X)$-measurable square-integrable random variable, hence in particular to every affine function of $X$ and to $m$ itself. This makes minimizing $\mathbb E[(Y-Z)^2]$ over affine predictors exactly the same problem as projecting $m$ onto $\operatorname{span}\{1,X\}$. The final equivalence follows from the elementary fact that a point is its own projection onto a subspace exactly when it already belongs to that subspace.
[/proofplan]
custom_env
admin
[step:Define the affine subspace and the conditional residual]
Let
\begin{align*}
\mathcal A := \operatorname{span}\{1,X\}
= \{a+bX : a,b \in \mathbb R\} \subset L^2(\Omega,\mathcal F,\mathbb P).
\end{align*}
Since $X \in L^2(\Omega,\mathcal F,\mathbb P)$ and $1 \in L^2(\Omega,\mathcal F,\mathbb P)$, every $Z \in \mathcal A$ belongs to $L^2(\Omega,\mathcal F,\mathbb P)$. Also $m \in L^2(\Omega,\mathcal F,\mathbb P)$, because conditional expectation is an $L^2$ contraction:
\begin{align*}
\mathbb E[m^2] \leq \mathbb E[Y^2] < \infty.
\end{align*}
Define the residual random variable
\begin{align*}
r:\Omega &\to \mathbb R \\
\omega &\mapsto Y(\omega)-m(\omega).
\end{align*}
Then $r \in L^2(\Omega,\mathcal F,\mathbb P)$.
[/step]
custom_env
admin
[step:Show the conditional residual is orthogonal to every square-integrable function of $X$]Let $W \in L^2(\Omega,\mathcal F,\mathbb P)$ be $\sigma(X)$-measurable. We prove
\begin{align*}
\mathbb E[rW]=0.
\end{align*}
For each $n \in \mathbb N$, define the bounded $\sigma(X)$-measurable truncation
\begin{align*}
W_n:\Omega &\to \mathbb R \\
\omega &\mapsto \max\{-n,\min\{W(\omega),n\}\}.
\end{align*}
By the defining property of conditional expectation, since $W_n$ is bounded and $\sigma(X)$-measurable,
\begin{align*}
\mathbb E[YW_n] = \mathbb E[mW_n],
\end{align*}
and hence
\begin{align*}
\mathbb E[rW_n]=0.
\end{align*}
Moreover $W_n \to W$ pointwise and $|W_n-W| \leq 2|W|$. Since $r,W \in L^2(\Omega,\mathcal F,\mathbb P)$, the [Cauchy-Schwarz inequality](/theorems/432) gives $rW \in L^1(\Omega,\mathcal F,\mathbb P)$, and
\begin{align*}
|r(W_n-W)| \leq 2|r||W| \in L^1(\Omega,\mathcal F,\mathbb P).
\end{align*}
The [dominated convergence theorem](/theorems/4) gives
\begin{align*}
\lim_{n\to\infty}\mathbb E[rW_n] = \mathbb E[rW].
\end{align*}
Therefore $\mathbb E[rW]=0$.[/step]
custom_env
admin
[guided]The conditional expectation identity is initially guaranteed against bounded $\sigma(X)$-measurable test random variables. Since we need to test against square-integrable random variables such as $m-Z$, we first reduce to the bounded case by truncation.
Let $W \in L^2(\Omega,\mathcal F,\mathbb P)$ be $\sigma(X)$-measurable. For each $n \in \mathbb N$, define
\begin{align*}
W_n:\Omega &\to \mathbb R \\
\omega &\mapsto \max\{-n,\min\{W(\omega),n\}\}.
\end{align*}
Each $W_n$ is bounded and $\sigma(X)$-measurable because it is obtained from $W$ by composing with the continuous truncation map $t \mapsto \max\{-n,\min\{t,n\}\}$. By the defining property of $m=\mathbb E[Y\mid \sigma(X)]$,
\begin{align*}
\mathbb E[YW_n]=\mathbb E[mW_n].
\end{align*}
Subtracting the right-hand side from the left-hand side gives
\begin{align*}
\mathbb E[(Y-m)W_n]=\mathbb E[rW_n]=0.
\end{align*}
It remains to pass from $W_n$ to $W$. We have $W_n \to W$ pointwise and $|W_n-W|\leq 2|W|$. Since $r,W\in L^2(\Omega,\mathcal F,\mathbb P)$, the Cauchy-Schwarz inequality gives
\begin{align*}
\mathbb E[|rW|] \leq \mathbb E[r^2]^{1/2}\mathbb E[W^2]^{1/2}<\infty.
\end{align*}
Thus $2|r||W|$ is an integrable dominating random variable, and
\begin{align*}
|r(W_n-W)|\leq 2|r||W|.
\end{align*}
By dominated convergence,
\begin{align*}
\mathbb E[rW]=\lim_{n\to\infty}\mathbb E[rW_n]=0.
\end{align*}
This proves that the conditional residual $r=Y-m$ is orthogonal in $L^2$ to every square-integrable $\sigma(X)$-measurable random variable.[/guided]
custom_env
admin
[step:Decompose the risk for every affine predictor]Fix $Z \in \mathcal A$. Since $Z$ is a real linear combination of $1$ and $X$, it is $\sigma(X)$-measurable. Also $m$ is $\sigma(X)$-measurable by definition of conditional expectation, so $m-Z$ is $\sigma(X)$-measurable and belongs to $L^2(\Omega,\mathcal F,\mathbb P)$. Applying the orthogonality just proved with $W=m-Z$ gives
\begin{align*}
\mathbb E[r(m-Z)] = 0.
\end{align*}
Using $Y-Z=r+(m-Z)$, expanding the square, and using this orthogonality,
\begin{align*}
\mathbb E[(Y-Z)^2]
&= \mathbb E[(r+(m-Z))^2] \\
&= \mathbb E[r^2] + 2\mathbb E[r(m-Z)] + \mathbb E[(m-Z)^2] \\
&= \mathbb E[r^2] + \mathbb E[(m-Z)^2].
\end{align*}
The first term $\mathbb E[r^2]$ is independent of $Z$.[/step]
custom_env
admin
[guided]We now compare the least-squares risk for predicting $Y$ with the least-squares risk for approximating $m$. Fix an arbitrary affine predictor $Z \in \mathcal A$. By definition of $\mathcal A$, there exist $a,b\in\mathbb R$ such that $Z=a+bX$. Therefore $Z$ is $\sigma(X)$-measurable. The conditional mean $m$ is also $\sigma(X)$-measurable, so the difference $m-Z$ is $\sigma(X)$-measurable. Since $m,Z\in L^2(\Omega,\mathcal F,\mathbb P)$, also $m-Z\in L^2(\Omega,\mathcal F,\mathbb P)$.
The previous step applies with $W=m-Z$, giving
\begin{align*}
\mathbb E[r(m-Z)] = 0.
\end{align*}
Now use the identity
\begin{align*}
Y-Z = (Y-m)+(m-Z)=r+(m-Z).
\end{align*}
Expanding the square in $L^2$ gives
\begin{align*}
\mathbb E[(Y-Z)^2]
&= \mathbb E[(r+(m-Z))^2] \\
&= \mathbb E[r^2] + 2\mathbb E[r(m-Z)] + \mathbb E[(m-Z)^2] \\
&= \mathbb E[r^2] + \mathbb E[(m-Z)^2].
\end{align*}
The term $\mathbb E[r^2]$ does not depend on the affine predictor $Z$. Thus the only part of the risk that can be changed by choosing $Z$ is the squared distance from $Z$ to the conditional mean $m$.[/guided]
custom_env
admin
[step:Identify affine regression with projection of the conditional mean]
Let $\widehat Z \in \mathcal A$. From the risk decomposition, for every $Z \in \mathcal A$,
\begin{align*}
\mathbb E[(Y-\widehat Z)^2] \leq \mathbb E[(Y-Z)^2]
\end{align*}
holds if and only if
\begin{align*}
\mathbb E[(m-\widehat Z)^2] \leq \mathbb E[(m-Z)^2].
\end{align*}
Therefore the minimizers of $Z \mapsto \mathbb E[(Y-Z)^2]$ over $\mathcal A$ are exactly the minimizers of $Z \mapsto \mathbb E[(m-Z)^2]$ over $\mathcal A$. This is precisely the assertion that the population affine regression predictor is the $L^2$-projection of $m$ onto $\mathcal A$.
[/step]
custom_env
admin
[step:Characterize when the affine predictor equals the conditional mean]
If $m \in \mathcal A$, then choosing $Z=m$ gives
\begin{align*}
\inf_{Z\in\mathcal A}\mathbb E[(m-Z)^2]=0,
\end{align*}
so every $L^2$-projection of $m$ onto $\mathcal A$ equals $m$ almost surely.
Conversely, if an affine regression predictor $\widehat Z$ equals $m$ almost surely, then $\widehat Z\in\mathcal A$ by definition of affine regression, and hence $m\in\mathcal A$ as an element of $L^2(\Omega,\mathcal F,\mathbb P)$. Thus the affine regression predictor agrees with the conditional mean almost surely if and only if $m\in\operatorname{span}\{1,X\}$.
[/step]