[guided]The strategy is to write $X - Z$ as a sum of two parts: the irreducible error $\varepsilon = X - \mathbb{E}[X \mid \mathcal{G}]$ (the error of the optimal predictor, which no $\mathcal{G}$-measurable estimate can improve) and the correction $\delta = \mathbb{E}[X \mid \mathcal{G}] - Z$ (the gap between the optimal predictor and the chosen estimator $Z$). The orthogonality established in the preceding step means these two parts are perpendicular in $L^2$, so the total squared error splits as the sum of the two squared errors — the Pythagorean theorem in the Hilbert space $L^2(\Omega, \mathcal{F}, \mathbb{P})$.
**Setting up the decomposition.** Define:
\begin{align*}
\varepsilon &:= X - \mathbb{E}[X \mid \mathcal{G}], \\
\delta &:= \mathbb{E}[X \mid \mathcal{G}] - Z.
\end{align*}
We check membership in the relevant $L^2$ spaces. Since $X \in L^2(\Omega, \mathcal{F}, \mathbb{P})$ and $\mathbb{E}[X \mid \mathcal{G}] \in L^2(\Omega, \mathcal{G}, \mathbb{P}) \subseteq L^2(\Omega, \mathcal{F}, \mathbb{P})$ (from the previous step), $\varepsilon$ belongs to $L^2(\Omega, \mathcal{F}, \mathbb{P})$. Since $\mathbb{E}[X \mid \mathcal{G}], Z \in L^2(\Omega, \mathcal{G}, \mathbb{P})$ and that space is closed under subtraction, $\delta \in L^2(\Omega, \mathcal{G}, \mathbb{P})$. In particular, $\delta$ is $\mathcal{G}$-measurable. By construction, $\varepsilon + \delta = X - Z$.
**Validity of the bilinear expansion.** To expand $\mathbb{E}[(\varepsilon + \delta)^2]$ we need the cross term $\mathbb{E}[\varepsilon\delta]$ to be finite. By the [Cauchy-Schwarz Inequality](/theorems/432):
\begin{align*}
\mathbb{E}[|\varepsilon\delta|] \leq \|\varepsilon\|_{L^2(\Omega,\mathcal{F},\mathbb{P})}\,\|\delta\|_{L^2(\Omega,\mathcal{G},\mathbb{P})} < \infty,
\end{align*}
so $\varepsilon\delta \in L^1(\Omega, \mathcal{F}, \mathbb{P})$ and the expansion is justified:
\begin{align*}
\mathbb{E}[(X - Z)^2] = \mathbb{E}[(\varepsilon + \delta)^2] = \mathbb{E}[\varepsilon^2] + 2\,\mathbb{E}[\varepsilon\delta] + \mathbb{E}[\delta^2].
\end{align*}
**Killing the cross term.** The cross term vanishes by orthogonality: since $\delta \in L^2(\Omega, \mathcal{G}, \mathbb{P})$, we apply the result of the preceding step with $W := \delta$ to obtain $\mathbb{E}[\varepsilon\delta] = \mathbb{E}[(X - \mathbb{E}[X \mid \mathcal{G}])\,\delta] = 0$. In Hilbert space terms, $\varepsilon$ and $\delta$ are orthogonal elements of $L^2(\Omega, \mathcal{F}, \mathbb{P})$, so the Pythagorean identity holds: $\|\varepsilon + \delta\|_{L^2}^2 = \|\varepsilon\|_{L^2}^2 + \|\delta\|_{L^2}^2$.
**Concluding the inequality.** Since $\mathbb{E}[\delta^2] \geq 0$:
\begin{align*}
\mathbb{E}[(X - Z)^2] = \mathbb{E}[\varepsilon^2] + \mathbb{E}[\delta^2] \geq \mathbb{E}[\varepsilon^2] = \mathbb{E}\!\left[(X - \mathbb{E}[X \mid \mathcal{G}])^2\right].
\end{align*}
The mean-square error of any $\mathcal{G}$-measurable estimator $Z$ is at least the mean-square error of the conditional expectation. The extra cost of using $Z$ rather than $\mathbb{E}[X \mid \mathcal{G}]$ is precisely $\mathbb{E}[\delta^2] = \mathbb{E}[(\mathbb{E}[X \mid \mathcal{G}] - Z)^2]$, the squared $L^2$-distance between $Z$ and the optimal predictor.[/guided]