Bias-Variance Decomposition for Squared Loss — Statement & Proof

Bias-Variance Decomposition for Squared Loss (Theorem # 4425)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] Both identities come from the same [orthogonal decomposition](/theorems/436) in $L^2$. For the constant predictor, decompose $Y-a$ into the centered fluctuation $Y-\mu$ and the deterministic bias $\mu-a$, then expand the square and use $\mathbb E[Y-\mu]=0$. For the conditional identity, decompose $Y-g(X)$ into the residual $Y-\mathbb E[Y\mid X]$ and the $\sigma(X)$-measurable bias $\mathbb E[Y\mid X]-g(X)$; the mixed term vanishes by the defining orthogonality of conditional expectation. [/proofplan] [step:Expand the constant predictor around the mean] Define the centered random variable $Z:\Omega\to\mathbb R$ by $Z:=Y-\mu$, and define the constant $c:=\mu-a\in\mathbb R$. Since $Y\in L^2(\Omega,\mathcal F,\mathbb P)$ and constants are square-integrable on a probability space, $Z\in L^2(\Omega,\mathcal F,\mathbb P)$. We have \begin{align*} Y-a = (Y-\mu)+(\mu-a)=Z+c. \end{align*} Expanding the square and using linearity of expectation gives \begin{align*} \mathbb E[(Y-a)^2] &=\mathbb E[(Z+c)^2] \\ &=\mathbb E[Z^2]+2c\,\mathbb E[Z]+c^2. \end{align*} By the definition of $\mu$, \begin{align*} \mathbb E[Z]=\mathbb E[Y-\mu]=\mathbb E[Y]-\mu=0. \end{align*} Also, \begin{align*} \mathbb E[Z^2]=\mathbb E[(Y-\mu)^2]=\operatorname{Var}(Y). \end{align*} Substituting these two identities into the expansion yields \begin{align*} \mathbb E[(Y-a)^2]=\operatorname{Var}(Y)+(\mu-a)^2. \end{align*} [/step] [step:Decompose the conditional squared loss into residual and bias terms] Let $\mathcal G:=\sigma(X)$ be the $\sigma$-algebra generated by $X$. Define \begin{align*} M:\Omega&\to\mathbb R \\ \omega&\mapsto \mathbb E[Y\mid \mathcal G](\omega), \end{align*} choosing a fixed version of the conditional expectation. Since $Y\in L^2(\Omega,\mathcal F,\mathbb P)$, [Jensen's inequality](/theorems/9) for conditional expectation gives $M\in L^2(\Omega,\mathcal G,\mathbb P)$. Define the residual random variable $R:\Omega\to\mathbb R$ and the bias random variable $B:\Omega\to\mathbb R$ by \begin{align*} R:=Y-M, \qquad B:=M-g(X). \end{align*} Then $R,B\in L^2(\Omega,\mathcal F,\mathbb P)$, and \begin{align*} Y-g(X)=R+B. \end{align*} Therefore \begin{align*} \mathbb E[(Y-g(X))^2] = \mathbb E[R^2]+2\mathbb E[RB]+\mathbb E[B^2]. \end{align*} [/step] [step:Show the conditional mixed term vanishes] The random variable $B=M-g(X)$ is $\mathcal G$-measurable and belongs to $L^2(\Omega,\mathcal G,\mathbb P)$. Since $R=Y-\mathbb E[Y\mid\mathcal G]$, the defining orthogonality property of conditional expectation gives \begin{align*} \mathbb E[R H]=0 \end{align*} for every bounded $\mathcal G$-measurable random variable $H:\Omega\to\mathbb R$. Applying this first to the truncations \begin{align*} B_n:=\max\{-n,\min\{B,n\}\}, \qquad n\in\mathbb N, \end{align*} gives $\mathbb E[R B_n]=0$ for every $n\in\mathbb N$. Since $B_n\to B$ pointwise and $|B_n|\le |B|$, the [Cauchy-Schwarz inequality](/theorems/432) gives \begin{align*} \mathbb E[|R(B_n-B)|] \le \mathbb E[R^2]^{1/2}\mathbb E[(B_n-B)^2]^{1/2}. \end{align*} The [dominated convergence theorem](/theorems/4) applied to $(B_n-B)^2\le 4B^2$ gives $\mathbb E[(B_n-B)^2]\to 0$. Hence $\mathbb E[RB]=0$. [/step] [step:Identify the residual term with expected conditional variance] By the definition of conditional variance relative to $\mathcal G=\sigma(X)$, \begin{align*} \operatorname{Var}(Y\mid X) = \mathbb E[(Y-M)^2\mid \mathcal G] = \mathbb E[R^2\mid \mathcal G]. \end{align*} Taking expectations and using the [tower property of conditional expectation](/theorems/1150) gives \begin{align*} \mathbb E[\operatorname{Var}(Y\mid X)] = \mathbb E[\mathbb E[R^2\mid \mathcal G]] = \mathbb E[R^2]. \end{align*} Also, since $M=\mathbb E[Y\mid X]$, \begin{align*} \mathbb E[B^2] = \mathbb E[(M-g(X))^2] = \mathbb E[(\mathbb E[Y\mid X]-g(X))^2]. \end{align*} Substituting $\mathbb E[RB]=0$ and the two identifications above into the expansion gives \begin{align*} \mathbb E[(Y-g(X))^2] = \mathbb E[\operatorname{Var}(Y\mid X)] + \mathbb E[(\mathbb E[Y\mid X]-g(X))^2]. \end{align*} This is the desired conditional [bias-variance decomposition](/theorems/1424). [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Orthogonality Definition Expectation Definition Factorisation Criterion for Independence Probability Theory MGF Determines the Distribution Probability Theory Unbiasedness of Ordinary Least Squares Under Strict Exogeneity Probability & Statistics Unbiasedness of the Holdout Risk Estimator Probability & Statistics Moments of Branching Processes Probability Theory Population Reliability Diagram Identity Probability & Statistics Bias–Variance Decomposition for Expected Prediction Error Probability & Statistics Tail Integral Formula Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.

Bias-Variance Decomposition for Squared Loss (Theorem # 4425)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Bias-Variance Decomposition for Squared Loss (Theorem # 4425)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further