[proofplan]
We view prediction by $g(X)$ as approximation of $Y$ by a square-integrable $\sigma(X)$-measurable random variable. The conditional expectation $Z=\mathbb E[Y\mid\sigma(X)]$ has the defining orthogonality property against bounded $\sigma(X)$-measurable test variables, and we extend this orthogonality to every square-integrable competitor by truncation. Expanding the square then gives a Pythagorean identity, from which the minimization and the minimum risk follow.
[/proofplan]
[step:Place the conditional expectation and each competitor in the same $L^2$ space]
Let $\mathcal H:=\sigma(X)$ denote the sub-$\sigma$-algebra of $\mathcal F$ generated by the random vector $X$. Since $Y \in L^2(\Omega,\mathcal F,\mathbb P)$ and $\mathbb P(\Omega)=1$, we also have $Y \in L^1(\Omega,\mathcal F,\mathbb P)$. Let
\begin{align*}
Z:\Omega &\to \mathbb R
\end{align*}
be a version of $\mathbb E[Y\mid \mathcal H]$.
Because conditional expectation is $L^1$-contractive in the following square-integrable case, or equivalently by [Jensen's inequality](/theorems/9) for conditional expectation applied to the convex function $t\mapsto t^2$, we have
\begin{align*}
Z^2 \leq \mathbb E[Y^2\mid \mathcal H] \quad \mathbb P\text{-a.s.}
\end{align*}
Taking expectations gives
\begin{align*}
\mathbb E[Z^2] \leq \mathbb E[Y^2] < \infty.
\end{align*}
Thus $Z \in L^2(\Omega,\mathcal H,\mathbb P)$.
For any $g \in \mathcal G_X$, define the competitor
\begin{align*}
H_g:\Omega &\to \mathbb R \\
\omega &\mapsto g(X(\omega)).
\end{align*}
By the definition of $\mathcal G_X$, $H_g \in L^2(\Omega,\mathcal F,\mathbb P)$. Since $g$ is $\mathcal B(\mathbb R^p)$-measurable and $X$ generates $\mathcal H$, the composition $H_g=g\circ X$ is $\mathcal H$-measurable. Hence $H_g \in L^2(\Omega,\mathcal H,\mathbb P)$.
[guided]
Let $\mathcal H:=\sigma(X)$ be the information contained in the random vector $X$. A predictor of the form $g(X)$ only uses this information, so the natural comparison class is the space of square-integrable $\mathcal H$-measurable random variables.
First define
\begin{align*}
Z:\Omega &\to \mathbb R
\end{align*}
to be a version of $\mathbb E[Y\mid\mathcal H]$. Since $Y\in L^2(\Omega,\mathcal F,\mathbb P)$ and $\mathbb P$ is a probability measure, $Y$ is also integrable, so this conditional expectation is defined. We also need $Z$ itself to be square-integrable, because the risk contains $(Y-Z)^2$. [Jensen's inequality](/theorems/1977) for conditional expectation, applied to the convex function $t\mapsto t^2$, gives
\begin{align*}
Z^2 \leq \mathbb E[Y^2\mid\mathcal H] \quad \mathbb P\text{-a.s.}
\end{align*}
Taking expectations on both sides yields
\begin{align*}
\mathbb E[Z^2] \leq \mathbb E[\mathbb E[Y^2\mid\mathcal H]] = \mathbb E[Y^2] < \infty.
\end{align*}
Thus $Z\in L^2(\Omega,\mathcal H,\mathbb P)$.
Now fix $g\in\mathcal G_X$ and define
\begin{align*}
H_g:\Omega &\to \mathbb R \\
\omega &\mapsto g(X(\omega)).
\end{align*}
The definition of $\mathcal G_X$ gives $H_g\in L^2(\Omega,\mathcal F,\mathbb P)$. Moreover, because $g$ is Borel measurable and $X$ is the map generating $\mathcal H=\sigma(X)$, the composition $g\circ X$ is $\mathcal H$-measurable. Therefore $H_g$ lies in the same [Hilbert space](/page/Hilbert%20Space) $L^2(\Omega,\mathcal H,\mathbb P)$ as $Z$.
[/guided]
[/step]
[step:Extend the conditional expectation orthogonality to square-integrable predictors]
We claim that for every $V\in L^2(\Omega,\mathcal H,\mathbb P)$,
\begin{align*}
\mathbb E[(Y-Z)V]=0.
\end{align*}
For each $n\in\mathbb N$, define the truncated random variable
\begin{align*}
V_n:\Omega &\to \mathbb R \\
\omega &\mapsto \max\{-n,\min\{V(\omega),n\}\}.
\end{align*}
Each $V_n$ is bounded and $\mathcal H$-measurable. By the defining property of $Z=\mathbb E[Y\mid\mathcal H]$,
\begin{align*}
\mathbb E[YV_n]=\mathbb E[ZV_n],
\end{align*}
so
\begin{align*}
\mathbb E[(Y-Z)V_n]=0.
\end{align*}
Also $V_n\to V$ in $L^2(\Omega,\mathcal H,\mathbb P)$, since $|V_n-V|^2\leq |V|^2$ and $V_n\to V$ pointwise. Because $Y-Z\in L^2(\Omega,\mathcal F,\mathbb P)$, the [Cauchy-Schwarz inequality](/theorems/432) gives
\begin{align*}
|\mathbb E[(Y-Z)(V-V_n)]|
\leq
\mathbb E[(Y-Z)^2]^{1/2}\mathbb E[(V-V_n)^2]^{1/2}
\to 0.
\end{align*}
Therefore
\begin{align*}
\mathbb E[(Y-Z)V]
=
\lim_{n\to\infty}\mathbb E[(Y-Z)V_n]
=
0.
\end{align*}
[guided]
The defining property of conditional expectation says that $Y$ and $Z$ have the same integrals against bounded $\mathcal H$-measurable test variables. Our competitors, however, need only be square-integrable, not bounded. We therefore approximate an arbitrary square-integrable $\mathcal H$-measurable random variable by bounded truncations.
Let $V\in L^2(\Omega,\mathcal H,\mathbb P)$. For each $n\in\mathbb N$, define
\begin{align*}
V_n:\Omega &\to \mathbb R \\
\omega &\mapsto \max\{-n,\min\{V(\omega),n\}\}.
\end{align*}
Then $V_n$ is $\mathcal H$-measurable because it is obtained from $V$ by composing with the continuous truncation map $t\mapsto \max\{-n,\min\{t,n\}\}$. It is also bounded by $n$.
Since $Z=\mathbb E[Y\mid\mathcal H]$, the defining identity for conditional expectation gives
\begin{align*}
\mathbb E[YV_n]=\mathbb E[ZV_n].
\end{align*}
Both products are integrable because $V_n$ is bounded and $Y,Z\in L^1(\Omega,\mathcal F,\mathbb P)$. Hence
\begin{align*}
\mathbb E[(Y-Z)V_n]=0.
\end{align*}
It remains to pass from $V_n$ to $V$. We have $V_n\to V$ pointwise and
\begin{align*}
|V_n-V|^2 \leq |V|^2.
\end{align*}
Since $V\in L^2$, dominated convergence gives $V_n\to V$ in $L^2(\Omega,\mathcal H,\mathbb P)$. Also $Y-Z\in L^2$ because both $Y$ and $Z$ are in $L^2$. Applying the Cauchy-Schwarz inequality to the product $(Y-Z)(V-V_n)$ gives
\begin{align*}
|\mathbb E[(Y-Z)(V-V_n)]|
\leq
\mathbb E[(Y-Z)^2]^{1/2}\mathbb E[(V-V_n)^2]^{1/2}
\to 0.
\end{align*}
Therefore the bounded-test-variable orthogonality extends to $V$:
\begin{align*}
\mathbb E[(Y-Z)V]
=
\lim_{n\to\infty}\mathbb E[(Y-Z)V_n]
=
0.
\end{align*}
[/guided]
[/step]
[step:Expand the squared risk and cancel the cross term]
Fix $g\in\mathcal G_X$ and let $H_g=g\circ X$. Define
\begin{align*}
V_g:\Omega &\to \mathbb R \\
\omega &\mapsto Z(\omega)-H_g(\omega).
\end{align*}
Since $Z,H_g\in L^2(\Omega,\mathcal H,\mathbb P)$, we have $V_g\in L^2(\Omega,\mathcal H,\mathbb P)$. The orthogonality from the previous step gives
\begin{align*}
\mathbb E[(Y-Z)V_g]=0.
\end{align*}
Using the identity $Y-H_g=(Y-Z)+V_g$, we expand:
\begin{align*}
\mathbb E[(Y-H_g)^2]
&=
\mathbb E[((Y-Z)+V_g)^2] \\
&=
\mathbb E[(Y-Z)^2]+2\mathbb E[(Y-Z)V_g]+\mathbb E[V_g^2] \\
&=
\mathbb E[(Y-Z)^2]+\mathbb E[(Z-H_g)^2].
\end{align*}
Since $\mathbb E[(Z-H_g)^2]\geq 0$, it follows that
\begin{align*}
\mathbb E[(Y-Z)^2]\leq \mathbb E[(Y-g(X))^2].
\end{align*}
[guided]
Fix a predictor $g\in\mathcal G_X$ and write its realized prediction as $H_g=g\circ X$. We compare $H_g$ to the conditional mean $Z$ by defining
\begin{align*}
V_g:\Omega &\to \mathbb R \\
\omega &\mapsto Z(\omega)-H_g(\omega).
\end{align*}
Both $Z$ and $H_g$ are square-integrable and $\mathcal H$-measurable, so $V_g\in L^2(\Omega,\mathcal H,\mathbb P)$. The orthogonality proved in the previous step applies to this particular choice of $V_g$, giving
\begin{align*}
\mathbb E[(Y-Z)V_g]=0.
\end{align*}
Now decompose the prediction error:
\begin{align*}
Y-H_g = (Y-Z)+(Z-H_g)=(Y-Z)+V_g.
\end{align*}
Squaring and taking expectations gives
\begin{align*}
\mathbb E[(Y-H_g)^2]
&=
\mathbb E[((Y-Z)+V_g)^2] \\
&=
\mathbb E[(Y-Z)^2]+2\mathbb E[(Y-Z)V_g]+\mathbb E[V_g^2].
\end{align*}
The middle term vanishes by orthogonality, so
\begin{align*}
\mathbb E[(Y-H_g)^2]
=
\mathbb E[(Y-Z)^2]+\mathbb E[(Z-H_g)^2].
\end{align*}
This is the Pythagorean identity for the projection of $Y$ onto the closed subspace of $L^2$ consisting of $\mathcal H$-measurable random variables. Since the last term is non-negative,
\begin{align*}
\mathbb E[(Y-Z)^2]\leq \mathbb E[(Y-g(X))^2].
\end{align*}
[/guided]
[/step]
[step:Identify the minimizers and the minimum value]
Let $m\in\mathcal G_X$ satisfy $m(X)=Z$ $\mathbb P$-a.s. Applying the identity from the previous step with $g=m$ gives
\begin{align*}
\mathbb E[(Y-m(X))^2]
=
\mathbb E[(Y-Z)^2].
\end{align*}
For every $g\in\mathcal G_X$, the same identity gives
\begin{align*}
\mathbb E[(Y-g(X))^2]
=
\mathbb E[(Y-Z)^2]+\mathbb E[(Z-g(X))^2]
\geq
\mathbb E[(Y-Z)^2].
\end{align*}
Therefore every such $m$ minimizes the squared prediction risk, and the minimum value is
\begin{align*}
\inf_{g\in\mathcal G_X}\mathbb E[(Y-g(X))^2]
=
\mathbb E[(Y-Z)^2]
=
\mathbb E[(Y-\mathbb E[Y\mid X])^2].
\end{align*}
This proves the theorem.
[/step]