Conditional Expectation as the $L^2$ Risk Minimizer

Conditional Expectation as the $L^2$ Risk Minimizer (Theorem # 4424)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We view prediction by $g(X)$ as approximation of $Y$ by a square-integrable $\sigma(X)$-measurable random variable. The conditional expectation $Z=\mathbb E[Y\mid\sigma(X)]$ has the defining orthogonality property against bounded $\sigma(X)$-measurable test variables, and we extend this orthogonality to every square-integrable competitor by truncation. Expanding the square then gives a Pythagorean identity, from which the minimization and the minimum risk follow. [/proofplan] [step:Place the conditional expectation and each competitor in the same $L^2$ space] Let $\mathcal H:=\sigma(X)$ denote the sub-$\sigma$-algebra of $\mathcal F$ generated by the random vector $X$. Since $Y \in L^2(\Omega,\mathcal F,\mathbb P)$ and $\mathbb P(\Omega)=1$, we also have $Y \in L^1(\Omega,\mathcal F,\mathbb P)$. Let \begin{align*} Z:\Omega &\to \mathbb R \end{align*} be a version of $\mathbb E[Y\mid \mathcal H]$. Because conditional expectation is $L^1$-contractive in the following square-integrable case, or equivalently by [Jensen's inequality](/theorems/9) for conditional expectation applied to the convex function $t\mapsto t^2$, we have \begin{align*} Z^2 \leq \mathbb E[Y^2\mid \mathcal H] \quad \mathbb P\text{-a.s.} \end{align*} Taking expectations gives \begin{align*} \mathbb E[Z^2] \leq \mathbb E[Y^2] < \infty. \end{align*} Thus $Z \in L^2(\Omega,\mathcal H,\mathbb P)$. For any $g \in \mathcal G_X$, define the competitor \begin{align*} H_g:\Omega &\to \mathbb R \\ \omega &\mapsto g(X(\omega)). \end{align*} By the definition of $\mathcal G_X$, $H_g \in L^2(\Omega,\mathcal F,\mathbb P)$. Since $g$ is $\mathcal B(\mathbb R^p)$-measurable and $X$ generates $\mathcal H$, the composition $H_g=g\circ X$ is $\mathcal H$-measurable. Hence $H_g \in L^2(\Omega,\mathcal H,\mathbb P)$. [guided] Let $\mathcal H:=\sigma(X)$ be the information contained in the random vector $X$. A predictor of the form $g(X)$ only uses this information, so the natural comparison class is the space of square-integrable $\mathcal H$-measurable random variables. First define \begin{align*} Z:\Omega &\to \mathbb R \end{align*} to be a version of $\mathbb E[Y\mid\mathcal H]$. Since $Y\in L^2(\Omega,\mathcal F,\mathbb P)$ and $\mathbb P$ is a probability measure, $Y$ is also integrable, so this conditional expectation is defined. We also need $Z$ itself to be square-integrable, because the risk contains $(Y-Z)^2$. [Jensen's inequality](/theorems/1977) for conditional expectation, applied to the convex function $t\mapsto t^2$, gives \begin{align*} Z^2 \leq \mathbb E[Y^2\mid\mathcal H] \quad \mathbb P\text{-a.s.} \end{align*} Taking expectations on both sides yields \begin{align*} \mathbb E[Z^2] \leq \mathbb E[\mathbb E[Y^2\mid\mathcal H]] = \mathbb E[Y^2] < \infty. \end{align*} Thus $Z\in L^2(\Omega,\mathcal H,\mathbb P)$. Now fix $g\in\mathcal G_X$ and define \begin{align*} H_g:\Omega &\to \mathbb R \\ \omega &\mapsto g(X(\omega)). \end{align*} The definition of $\mathcal G_X$ gives $H_g\in L^2(\Omega,\mathcal F,\mathbb P)$. Moreover, because $g$ is Borel measurable and $X$ is the map generating $\mathcal H=\sigma(X)$, the composition $g\circ X$ is $\mathcal H$-measurable. Therefore $H_g$ lies in the same [Hilbert space](/page/Hilbert%20Space) $L^2(\Omega,\mathcal H,\mathbb P)$ as $Z$. [/guided] [/step] [step:Extend the conditional expectation orthogonality to square-integrable predictors] We claim that for every $V\in L^2(\Omega,\mathcal H,\mathbb P)$, \begin{align*} \mathbb E[(Y-Z)V]=0. \end{align*} For each $n\in\mathbb N$, define the truncated random variable \begin{align*} V_n:\Omega &\to \mathbb R \\ \omega &\mapsto \max\{-n,\min\{V(\omega),n\}\}. \end{align*} Each $V_n$ is bounded and $\mathcal H$-measurable. By the defining property of $Z=\mathbb E[Y\mid\mathcal H]$, \begin{align*} \mathbb E[YV_n]=\mathbb E[ZV_n], \end{align*} so \begin{align*} \mathbb E[(Y-Z)V_n]=0. \end{align*} Also $V_n\to V$ in $L^2(\Omega,\mathcal H,\mathbb P)$, since $|V_n-V|^2\leq |V|^2$ and $V_n\to V$ pointwise. Because $Y-Z\in L^2(\Omega,\mathcal F,\mathbb P)$, the [Cauchy-Schwarz inequality](/theorems/432) gives \begin{align*} |\mathbb E[(Y-Z)(V-V_n)]| \leq \mathbb E[(Y-Z)^2]^{1/2}\mathbb E[(V-V_n)^2]^{1/2} \to 0. \end{align*} Therefore \begin{align*} \mathbb E[(Y-Z)V] = \lim_{n\to\infty}\mathbb E[(Y-Z)V_n] = 0. \end{align*} [guided] The defining property of conditional expectation says that $Y$ and $Z$ have the same integrals against bounded $\mathcal H$-measurable test variables. Our competitors, however, need only be square-integrable, not bounded. We therefore approximate an arbitrary square-integrable $\mathcal H$-measurable random variable by bounded truncations. Let $V\in L^2(\Omega,\mathcal H,\mathbb P)$. For each $n\in\mathbb N$, define \begin{align*} V_n:\Omega &\to \mathbb R \\ \omega &\mapsto \max\{-n,\min\{V(\omega),n\}\}. \end{align*} Then $V_n$ is $\mathcal H$-measurable because it is obtained from $V$ by composing with the continuous truncation map $t\mapsto \max\{-n,\min\{t,n\}\}$. It is also bounded by $n$. Since $Z=\mathbb E[Y\mid\mathcal H]$, the defining identity for conditional expectation gives \begin{align*} \mathbb E[YV_n]=\mathbb E[ZV_n]. \end{align*} Both products are integrable because $V_n$ is bounded and $Y,Z\in L^1(\Omega,\mathcal F,\mathbb P)$. Hence \begin{align*} \mathbb E[(Y-Z)V_n]=0. \end{align*} It remains to pass from $V_n$ to $V$. We have $V_n\to V$ pointwise and \begin{align*} |V_n-V|^2 \leq |V|^2. \end{align*} Since $V\in L^2$, dominated convergence gives $V_n\to V$ in $L^2(\Omega,\mathcal H,\mathbb P)$. Also $Y-Z\in L^2$ because both $Y$ and $Z$ are in $L^2$. Applying the Cauchy-Schwarz inequality to the product $(Y-Z)(V-V_n)$ gives \begin{align*} |\mathbb E[(Y-Z)(V-V_n)]| \leq \mathbb E[(Y-Z)^2]^{1/2}\mathbb E[(V-V_n)^2]^{1/2} \to 0. \end{align*} Therefore the bounded-test-variable orthogonality extends to $V$: \begin{align*} \mathbb E[(Y-Z)V] = \lim_{n\to\infty}\mathbb E[(Y-Z)V_n] = 0. \end{align*} [/guided] [/step] [step:Expand the squared risk and cancel the cross term] Fix $g\in\mathcal G_X$ and let $H_g=g\circ X$. Define \begin{align*} V_g:\Omega &\to \mathbb R \\ \omega &\mapsto Z(\omega)-H_g(\omega). \end{align*} Since $Z,H_g\in L^2(\Omega,\mathcal H,\mathbb P)$, we have $V_g\in L^2(\Omega,\mathcal H,\mathbb P)$. The orthogonality from the previous step gives \begin{align*} \mathbb E[(Y-Z)V_g]=0. \end{align*} Using the identity $Y-H_g=(Y-Z)+V_g$, we expand: \begin{align*} \mathbb E[(Y-H_g)^2] &= \mathbb E[((Y-Z)+V_g)^2] \\ &= \mathbb E[(Y-Z)^2]+2\mathbb E[(Y-Z)V_g]+\mathbb E[V_g^2] \\ &= \mathbb E[(Y-Z)^2]+\mathbb E[(Z-H_g)^2]. \end{align*} Since $\mathbb E[(Z-H_g)^2]\geq 0$, it follows that \begin{align*} \mathbb E[(Y-Z)^2]\leq \mathbb E[(Y-g(X))^2]. \end{align*} [guided] Fix a predictor $g\in\mathcal G_X$ and write its realized prediction as $H_g=g\circ X$. We compare $H_g$ to the conditional mean $Z$ by defining \begin{align*} V_g:\Omega &\to \mathbb R \\ \omega &\mapsto Z(\omega)-H_g(\omega). \end{align*} Both $Z$ and $H_g$ are square-integrable and $\mathcal H$-measurable, so $V_g\in L^2(\Omega,\mathcal H,\mathbb P)$. The orthogonality proved in the previous step applies to this particular choice of $V_g$, giving \begin{align*} \mathbb E[(Y-Z)V_g]=0. \end{align*} Now decompose the prediction error: \begin{align*} Y-H_g = (Y-Z)+(Z-H_g)=(Y-Z)+V_g. \end{align*} Squaring and taking expectations gives \begin{align*} \mathbb E[(Y-H_g)^2] &= \mathbb E[((Y-Z)+V_g)^2] \\ &= \mathbb E[(Y-Z)^2]+2\mathbb E[(Y-Z)V_g]+\mathbb E[V_g^2]. \end{align*} The middle term vanishes by orthogonality, so \begin{align*} \mathbb E[(Y-H_g)^2] = \mathbb E[(Y-Z)^2]+\mathbb E[(Z-H_g)^2]. \end{align*} This is the Pythagorean identity for the projection of $Y$ onto the closed subspace of $L^2$ consisting of $\mathcal H$-measurable random variables. Since the last term is non-negative, \begin{align*} \mathbb E[(Y-Z)^2]\leq \mathbb E[(Y-g(X))^2]. \end{align*} [/guided] [/step] [step:Identify the minimizers and the minimum value] Let $m\in\mathcal G_X$ satisfy $m(X)=Z$ $\mathbb P$-a.s. Applying the identity from the previous step with $g=m$ gives \begin{align*} \mathbb E[(Y-m(X))^2] = \mathbb E[(Y-Z)^2]. \end{align*} For every $g\in\mathcal G_X$, the same identity gives \begin{align*} \mathbb E[(Y-g(X))^2] = \mathbb E[(Y-Z)^2]+\mathbb E[(Z-g(X))^2] \geq \mathbb E[(Y-Z)^2]. \end{align*} Therefore every such $m$ minimizes the squared prediction risk, and the minimum value is \begin{align*} \inf_{g\in\mathcal G_X}\mathbb E[(Y-g(X))^2] = \mathbb E[(Y-Z)^2] = \mathbb E[(Y-\mathbb E[Y\mid X])^2]. \end{align*} This proves the theorem. [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Cauchy-Schwarz Inequality

Definitions & Concepts

Explore Further

Expectation Definition Orthogonality Definition Cauchy-Schwarz Inequality Theorem #432 Variance Inflation Factor Formula Probability & Statistics Unbiasedness of the Ordinary Least Squares Estimator Under Exogeneity Probability & Statistics Portmanteau Theorem Weak Convergence MGF Determines the Distribution Probability Theory Discrete Factorisation Criterion Probability Theory Gambler's Ruin Recurrence Probability Theory Structure of the Zero Set of Brownian Motion Brownian Motion Existence of Nonmeasurable Subsets of the Real Line Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.