[proofplan]
The theorem is equivalent to the statement that $\mathbb{E}[X \mid \mathcal{G}]$ is the orthogonal projection of $X$ onto the closed linear subspace $L^2(\Omega, \mathcal{G}, \mathbb{P})$ of the real Hilbert space $L^2(\Omega, \mathcal{F}, \mathbb{P})$, equipped with the inner product $(U, V)_{L^2} = \mathbb{E}[UV]$. The proof has four steps. First, the conditional Jensen inequality establishes that $\mathbb{E}[X \mid \mathcal{G}]$ itself belongs to $L^2(\Omega, \mathcal{G}, \mathbb{P})$. Next, the "taking out what is known" and tower properties of conditional expectation yield the key orthogonality: the residual $X - \mathbb{E}[X \mid \mathcal{G}]$ satisfies $\mathbb{E}[(X - \mathbb{E}[X \mid \mathcal{G}])\,W] = 0$ for every $W \in L^2(\Omega, \mathcal{G}, \mathbb{P})$. A Pythagorean expansion of $\mathbb{E}[(X - Z)^2]$ then produces a non-negative correction term $\mathbb{E}[(\mathbb{E}[X \mid \mathcal{G}] - Z)^2]$, giving the inequality; equality holds precisely when this correction term vanishes, which forces $Z = \mathbb{E}[X \mid \mathcal{G}]$ $\mathbb{P}$-almost surely.
[/proofplan]
[step:Verify that $\mathbb{E}[X \mid \mathcal{G}]$ belongs to $L^2(\Omega, \mathcal{G}, \mathbb{P})$ via the conditional Jensen inequality]
Since $(\Omega, \mathcal{F}, \mathbb{P})$ is a probability space and $X \in L^2(\Omega, \mathcal{F}, \mathbb{P})$, the [Cauchy-Schwarz Inequality](/theorems/432) gives $\mathbb{E}[|X|] = \mathbb{E}[|X| \cdot 1] \leq \|X\|_{L^2(\Omega,\mathcal{F},\mathbb{P})}\|1\|_{L^2(\Omega,\mathcal{F},\mathbb{P})} = \|X\|_{L^2(\Omega,\mathcal{F},\mathbb{P})} < \infty$, so $X \in L^1(\Omega, \mathcal{F}, \mathbb{P})$. The function $\varphi: \mathbb{R} \to \mathbb{R}$, $t \mapsto t^2$, is convex, and $\mathbb{E}[\varphi(X)] = \mathbb{E}[X^2] < \infty$ since $X \in L^2$. Applying the [Conditional Jensen Inequality](/theorems/1149) to $\varphi$ and the sub-$\sigma$-algebra $\mathcal{G}$:
\begin{align*}
\bigl(\mathbb{E}[X \mid \mathcal{G}]\bigr)^2 \leq \mathbb{E}[X^2 \mid \mathcal{G}] \quad \mathbb{P}\text{-a.s.}
\end{align*}
Taking expectations of both sides and applying the [Tower Property of Conditional Expectation](/theorems/1150):
\begin{align*}
\mathbb{E}\!\left[\bigl(\mathbb{E}[X \mid \mathcal{G}]\bigr)^2\right] \leq \mathbb{E}\!\left[\mathbb{E}[X^2 \mid \mathcal{G}]\right] = \mathbb{E}[X^2] < \infty.
\end{align*}
Since $\mathbb{E}[X \mid \mathcal{G}]$ is $\mathcal{G}$-measurable by definition and has finite second moment, $\mathbb{E}[X \mid \mathcal{G}] \in L^2(\Omega, \mathcal{G}, \mathbb{P})$.
[/step]
[step:Show that the residual $X - \mathbb{E}[X \mid \mathcal{G}]$ is orthogonal to every element of $L^2(\Omega, \mathcal{G}, \mathbb{P})$]
Let $W \in L^2(\Omega, \mathcal{G}, \mathbb{P})$. By the [Cauchy-Schwarz Inequality](/theorems/432):
\begin{align*}
\mathbb{E}[|XW|] \leq \|X\|_{L^2(\Omega,\mathcal{F},\mathbb{P})}\,\|W\|_{L^2(\Omega,\mathcal{G},\mathbb{P})} < \infty,
\end{align*}
so $XW \in L^1(\Omega, \mathcal{F}, \mathbb{P})$. Since $W$ is $\mathcal{G}$-measurable and $XW \in L^1$, the "taking out what is known" property of conditional expectation (cf. [Basic Properties of Conditional Expectation](/theorems/1148)) gives:
\begin{align*}
\mathbb{E}[XW \mid \mathcal{G}] = W\,\mathbb{E}[X \mid \mathcal{G}] \quad \mathbb{P}\text{-a.s.}
\end{align*}
Taking expectations and applying the [Tower Property of Conditional Expectation](/theorems/1150) to the left-hand side:
\begin{align*}
\mathbb{E}[XW] = \mathbb{E}\!\left[\mathbb{E}[XW \mid \mathcal{G}]\right] = \mathbb{E}\!\left[W\,\mathbb{E}[X \mid \mathcal{G}]\right].
\end{align*}
By linearity of expectation:
\begin{align*}
\mathbb{E}\!\left[\bigl(X - \mathbb{E}[X \mid \mathcal{G}]\bigr)W\right] = \mathbb{E}[XW] - \mathbb{E}\!\left[\mathbb{E}[X \mid \mathcal{G}]\cdot W\right] = \mathbb{E}[XW] - \mathbb{E}[XW] = 0.
\end{align*}
[guided]
We want to show the residual $X - \mathbb{E}[X \mid \mathcal{G}]$ is uncorrelated with every $\mathcal{G}$-measurable square-integrable random variable — that is, $\mathbb{E}[(X - \mathbb{E}[X \mid \mathcal{G}])\,W] = 0$ for all $W \in L^2(\Omega, \mathcal{G}, \mathbb{P})$. In Hilbert space language this says the error is orthogonal to the entire subspace $L^2(\Omega, \mathcal{G}, \mathbb{P})$, which is the defining property of an orthogonal projection. Intuitively, $\mathbb{E}[X \mid \mathcal{G}]$ has already "extracted" everything in $X$ that is visible through the $\sigma$-algebra $\mathcal{G}$; the leftover residual cannot be detected by any $\mathcal{G}$-measurable probe $W$.
Let $W \in L^2(\Omega, \mathcal{G}, \mathbb{P})$.
**Step 1: Verify that $XW$ and $\mathbb{E}[X \mid \mathcal{G}] \cdot W$ are integrable.** This is needed before we can manipulate expectations. By the [Cauchy-Schwarz Inequality](/theorems/432) applied to $(\Omega, \mathcal{F}, \mathbb{P})$:
\begin{align*}
\mathbb{E}[|XW|] \leq \|X\|_{L^2(\Omega,\mathcal{F},\mathbb{P})}\,\|W\|_{L^2(\Omega,\mathcal{G},\mathbb{P})} < \infty,
\end{align*}
so $XW \in L^1(\Omega, \mathcal{F}, \mathbb{P})$. Since $\mathbb{E}[X \mid \mathcal{G}] \in L^2(\Omega, \mathcal{G}, \mathbb{P})$ (established in the previous step), the same Cauchy-Schwarz bound gives $\mathbb{E}[X \mid \mathcal{G}] \cdot W \in L^1(\Omega, \mathcal{G}, \mathbb{P})$.
**Step 2: Reduce to the defining identity of conditional expectation.** By linearity:
\begin{align*}
\mathbb{E}\!\left[\bigl(X - \mathbb{E}[X \mid \mathcal{G}]\bigr)W\right] = \mathbb{E}[XW] - \mathbb{E}\!\left[\mathbb{E}[X \mid \mathcal{G}]\cdot W\right].
\end{align*}
It suffices to show $\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \cdot W] = \mathbb{E}[XW]$.
**Step 3: Apply "taking out what is known" and the tower property.** We invoke two properties from [Basic Properties of Conditional Expectation](/theorems/1148) and the [Tower Property of Conditional Expectation](/theorems/1150).
- **Taking out what is known**: since $W$ is $\mathcal{G}$-measurable and $XW \in L^1(\Omega, \mathcal{F}, \mathbb{P})$, we have
\begin{align*}
\mathbb{E}[XW \mid \mathcal{G}] = W\,\mathbb{E}[X \mid \mathcal{G}] \quad \mathbb{P}\text{-a.s.}
\end{align*}
The reason this holds is that from the perspective of $\mathcal{G}$, the value of $W$ is already determined, so it factors out of the conditional expectation of $X$.
- **Tower property**: for any $Y \in L^1(\Omega, \mathcal{F}, \mathbb{P})$, $\mathbb{E}[\mathbb{E}[Y \mid \mathcal{G}]] = \mathbb{E}[Y]$.
Chaining these:
\begin{align*}
\mathbb{E}\!\left[\mathbb{E}[X \mid \mathcal{G}]\cdot W\right] = \mathbb{E}\!\left[\mathbb{E}[XW \mid \mathcal{G}]\right] = \mathbb{E}[XW].
\end{align*}
**Conclusion.** Substituting back:
\begin{align*}
\mathbb{E}\!\left[\bigl(X - \mathbb{E}[X \mid \mathcal{G}]\bigr)W\right] = \mathbb{E}[XW] - \mathbb{E}[XW] = 0.
\end{align*}
This orthogonality is the key structural fact. It says that conditional expectation is not merely a good predictor — it is the unique best predictor in the sense that the prediction error is completely invisible to any $\mathcal{G}$-measurable function. The Pythagorean expansion in the next step converts this orthogonality into the minimization inequality.
[/guided]
[/step]
[step:Decompose $X - Z$ into the residual and correction, then apply the Pythagorean identity to obtain the inequality]
Define the residual and correction:
\begin{align*}
\varepsilon &:= X - \mathbb{E}[X \mid \mathcal{G}] \in L^2(\Omega, \mathcal{F}, \mathbb{P}), \\
\delta &:= \mathbb{E}[X \mid \mathcal{G}] - Z \in L^2(\Omega, \mathcal{G}, \mathbb{P}).
\end{align*}
By construction $X - Z = \varepsilon + \delta$. Since $\varepsilon, \delta \in L^2(\Omega, \mathcal{F}, \mathbb{P})$, the [Cauchy-Schwarz Inequality](/theorems/432) gives $\mathbb{E}[|\varepsilon\delta|] \leq \|\varepsilon\|_{L^2}\|\delta\|_{L^2} < \infty$, so $\varepsilon\delta \in L^1(\Omega, \mathcal{F}, \mathbb{P})$ and the bilinear expansion is valid:
\begin{align*}
\mathbb{E}[(X - Z)^2] = \mathbb{E}[(\varepsilon + \delta)^2] = \mathbb{E}[\varepsilon^2] + 2\,\mathbb{E}[\varepsilon\delta] + \mathbb{E}[\delta^2].
\end{align*}
Since $\delta \in L^2(\Omega, \mathcal{G}, \mathbb{P})$, the preceding step applies with $W := \delta$ to give $\mathbb{E}[\varepsilon\delta] = 0$. Since $\mathbb{E}[\delta^2] \geq 0$:
\begin{align*}
\mathbb{E}[(X - Z)^2] = \mathbb{E}[\varepsilon^2] + \mathbb{E}[\delta^2] \geq \mathbb{E}[\varepsilon^2] = \mathbb{E}\!\left[(X - \mathbb{E}[X \mid \mathcal{G}])^2\right].
\end{align*}
This is the claimed inequality.
[guided]
The strategy is to write $X - Z$ as a sum of two parts: the irreducible error $\varepsilon = X - \mathbb{E}[X \mid \mathcal{G}]$ (the error of the optimal predictor, which no $\mathcal{G}$-measurable estimate can improve) and the correction $\delta = \mathbb{E}[X \mid \mathcal{G}] - Z$ (the gap between the optimal predictor and the chosen estimator $Z$). The orthogonality established in the preceding step means these two parts are perpendicular in $L^2$, so the total squared error splits as the sum of the two squared errors — the Pythagorean theorem in the Hilbert space $L^2(\Omega, \mathcal{F}, \mathbb{P})$.
**Setting up the decomposition.** Define:
\begin{align*}
\varepsilon &:= X - \mathbb{E}[X \mid \mathcal{G}], \\
\delta &:= \mathbb{E}[X \mid \mathcal{G}] - Z.
\end{align*}
We check membership in the relevant $L^2$ spaces. Since $X \in L^2(\Omega, \mathcal{F}, \mathbb{P})$ and $\mathbb{E}[X \mid \mathcal{G}] \in L^2(\Omega, \mathcal{G}, \mathbb{P}) \subseteq L^2(\Omega, \mathcal{F}, \mathbb{P})$ (from the previous step), $\varepsilon$ belongs to $L^2(\Omega, \mathcal{F}, \mathbb{P})$. Since $\mathbb{E}[X \mid \mathcal{G}], Z \in L^2(\Omega, \mathcal{G}, \mathbb{P})$ and that space is closed under subtraction, $\delta \in L^2(\Omega, \mathcal{G}, \mathbb{P})$. In particular, $\delta$ is $\mathcal{G}$-measurable. By construction, $\varepsilon + \delta = X - Z$.
**Validity of the bilinear expansion.** To expand $\mathbb{E}[(\varepsilon + \delta)^2]$ we need the cross term $\mathbb{E}[\varepsilon\delta]$ to be finite. By the [Cauchy-Schwarz Inequality](/theorems/432):
\begin{align*}
\mathbb{E}[|\varepsilon\delta|] \leq \|\varepsilon\|_{L^2(\Omega,\mathcal{F},\mathbb{P})}\,\|\delta\|_{L^2(\Omega,\mathcal{G},\mathbb{P})} < \infty,
\end{align*}
so $\varepsilon\delta \in L^1(\Omega, \mathcal{F}, \mathbb{P})$ and the expansion is justified:
\begin{align*}
\mathbb{E}[(X - Z)^2] = \mathbb{E}[(\varepsilon + \delta)^2] = \mathbb{E}[\varepsilon^2] + 2\,\mathbb{E}[\varepsilon\delta] + \mathbb{E}[\delta^2].
\end{align*}
**Killing the cross term.** The cross term vanishes by orthogonality: since $\delta \in L^2(\Omega, \mathcal{G}, \mathbb{P})$, we apply the result of the preceding step with $W := \delta$ to obtain $\mathbb{E}[\varepsilon\delta] = \mathbb{E}[(X - \mathbb{E}[X \mid \mathcal{G}])\,\delta] = 0$. In Hilbert space terms, $\varepsilon$ and $\delta$ are orthogonal elements of $L^2(\Omega, \mathcal{F}, \mathbb{P})$, so the Pythagorean identity holds: $\|\varepsilon + \delta\|_{L^2}^2 = \|\varepsilon\|_{L^2}^2 + \|\delta\|_{L^2}^2$.
**Concluding the inequality.** Since $\mathbb{E}[\delta^2] \geq 0$:
\begin{align*}
\mathbb{E}[(X - Z)^2] = \mathbb{E}[\varepsilon^2] + \mathbb{E}[\delta^2] \geq \mathbb{E}[\varepsilon^2] = \mathbb{E}\!\left[(X - \mathbb{E}[X \mid \mathcal{G}])^2\right].
\end{align*}
The mean-square error of any $\mathcal{G}$-measurable estimator $Z$ is at least the mean-square error of the conditional expectation. The extra cost of using $Z$ rather than $\mathbb{E}[X \mid \mathcal{G}]$ is precisely $\mathbb{E}[\delta^2] = \mathbb{E}[(\mathbb{E}[X \mid \mathcal{G}] - Z)^2]$, the squared $L^2$-distance between $Z$ and the optimal predictor.
[/guided]
[/step]
[step:Conclude that equality holds if and only if $Z = \mathbb{E}[X \mid \mathcal{G}]$ $\mathbb{P}$-almost surely]
The preceding step established the identity
\begin{align*}
\mathbb{E}[(X - Z)^2] = \mathbb{E}\!\left[(X - \mathbb{E}[X \mid \mathcal{G}])^2\right] + \mathbb{E}\!\left[(\mathbb{E}[X \mid \mathcal{G}] - Z)^2\right].
\end{align*}
Equality $\mathbb{E}[(X - Z)^2] = \mathbb{E}[(X - \mathbb{E}[X \mid \mathcal{G}])^2]$ holds if and only if $\mathbb{E}[(\mathbb{E}[X \mid \mathcal{G}] - Z)^2] = 0$. Since $(\mathbb{E}[X \mid \mathcal{G}] - Z)^2 \geq 0$ $\mathbb{P}$-a.s., its expectation is zero if and only if $(\mathbb{E}[X \mid \mathcal{G}] - Z)^2 = 0$ $\mathbb{P}$-a.s., which is equivalent to $Z = \mathbb{E}[X \mid \mathcal{G}]$ $\mathbb{P}$-almost surely. This establishes both the inequality and the characterization of equality, completing the proof.
[/step]