Bias-Variance Decomposition — Statement & Proof

Bias-Variance Decomposition (Theorem # 1424)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The proof is a direct algebraic expansion of the squared error. We add and subtract the mean $\mathbb{E}_\theta[\hat\theta]$ inside the square, expand the resulting binomial, and identify the three terms. The cross term vanishes because the centred estimator $\hat\theta - \mathbb{E}_\theta[\hat\theta]$ has mean zero, and the factor it is paired with, $\mathbb{E}_\theta[\hat\theta] - \theta$, is a deterministic scalar that can be pulled out of the expectation. The two surviving terms are exactly the variance of $\hat\theta$ and the square of its bias. [/proofplan] [step:Decompose the error into a stochastic and a deterministic piece] Let $\hat\theta: \mathcal{X}^n \to \Theta$ be a measurable estimator based on a sample $X_1, \ldots, X_n$ from a model parameterised by $\theta \in \Theta \subseteq \mathbb{R}$, and assume $\mathbb{E}_\theta[\hat\theta^2] < \infty$ so that all relevant moments exist. Define \begin{align*} m(\theta) &:= \mathbb{E}_\theta[\hat\theta], \\ Z(\theta) &:= \hat\theta - m(\theta), \\ b(\theta) &:= m(\theta) - \theta. \end{align*} Then $Z(\theta)$ is a random variable with $\mathbb{E}_\theta[Z(\theta)] = 0$ by construction, and $b(\theta)$ is the deterministic bias of $\hat\theta$ at $\theta$. The estimation error decomposes as \begin{align*} \hat\theta - \theta = Z(\theta) + b(\theta). \end{align*} [guided] We begin with the mean squared error \begin{align*} \operatorname{MSE}_\theta(\hat\theta) = \mathbb{E}_\theta[(\hat\theta - \theta)^2]. \end{align*} The expression $(\hat\theta - \theta)^2$ mixes two kinds of deviation: the random fluctuation of $\hat\theta$ around its own mean, and the systematic offset between that mean and the true parameter. We want to separate them. The standard device is to insert the mean of $\hat\theta$: \begin{align*} \hat\theta - \theta = \underbrace{\bigl(\hat\theta - \mathbb{E}_\theta[\hat\theta]\bigr)}_{\text{random, mean zero}} + \underbrace{\bigl(\mathbb{E}_\theta[\hat\theta] - \theta\bigr)}_{\text{deterministic bias}}. \end{align*} To keep notation clean, name the pieces. Let $\hat\theta: \mathcal{X}^n \to \Theta$ be the estimator and assume $\mathbb{E}_\theta[\hat\theta^2] < \infty$ so that the variance, the mean, and the MSE are all finite — this is the integrability hypothesis that makes every expectation below well-defined. Define \begin{align*} m(\theta) &:= \mathbb{E}_\theta[\hat\theta], \\ Z(\theta) &:= \hat\theta - m(\theta), \\ b(\theta) &:= m(\theta) - \theta. \end{align*} Here $m(\theta)$ is the expected value of $\hat\theta$ when the data are generated by $\mathbb{P}_\theta$, $Z(\theta)$ is the centred version of $\hat\theta$, and $b(\theta)$ is the bias. By linearity of expectation, $\mathbb{E}_\theta[Z(\theta)] = \mathbb{E}_\theta[\hat\theta] - m(\theta) = 0$. And since $m(\theta)$ and $\theta$ are both non-random (under the fixed true parameter $\theta$), $b(\theta)$ is deterministic. We have \begin{align*} \hat\theta - \theta = \bigl(\hat\theta - m(\theta)\bigr) + \bigl(m(\theta) - \theta\bigr) = Z(\theta) + b(\theta). \end{align*} [/guided] [/step] [step:Expand the square and take expectations term by term] Squaring the decomposition from the previous step and taking the expectation under $\mathbb{P}_\theta$: \begin{align*} \mathbb{E}_\theta\bigl[(\hat\theta - \theta)^2\bigr] &= \mathbb{E}_\theta\bigl[(Z(\theta) + b(\theta))^2\bigr] \\ &= \mathbb{E}_\theta\bigl[Z(\theta)^2\bigr] + 2\,b(\theta)\,\mathbb{E}_\theta\bigl[Z(\theta)\bigr] + b(\theta)^2. \end{align*} The second equality uses linearity of expectation and the fact that $b(\theta)$ is a constant (with respect to the randomness under $\mathbb{P}_\theta$), so it factors out of each expectation. [guided] We now compute $\mathbb{E}_\theta[(\hat\theta - \theta)^2]$ using the decomposition $\hat\theta - \theta = Z(\theta) + b(\theta)$ from the previous step. Squaring the binomial, \begin{align*} (\hat\theta - \theta)^2 = Z(\theta)^2 + 2\,Z(\theta)\,b(\theta) + b(\theta)^2. \end{align*} Next apply $\mathbb{E}_\theta[\cdot]$. Linearity of expectation lets us move it past the sum, and the key observation is that $b(\theta)$ is a deterministic scalar: it does not depend on the data, so it behaves like a constant inside the expectation. Therefore $\mathbb{E}_\theta[Z(\theta)\,b(\theta)] = b(\theta)\,\mathbb{E}_\theta[Z(\theta)]$, and $\mathbb{E}_\theta[b(\theta)^2] = b(\theta)^2$. Collecting, \begin{align*} \mathbb{E}_\theta\bigl[(\hat\theta - \theta)^2\bigr] = \mathbb{E}_\theta[Z(\theta)^2] + 2\,b(\theta)\,\mathbb{E}_\theta[Z(\theta)] + b(\theta)^2. \end{align*} This has three terms; we examine each next. [/guided] [/step] [step:Identify the cross term as zero] By the construction of $Z(\theta) = \hat\theta - m(\theta)$ in the first step, $\mathbb{E}_\theta[Z(\theta)] = 0$. Therefore the middle term vanishes: \begin{align*} 2\,b(\theta)\,\mathbb{E}_\theta[Z(\theta)] = 0. \end{align*} [guided] The cross term is $2\,b(\theta)\,\mathbb{E}_\theta[Z(\theta)]$. Recall that $Z(\theta) = \hat\theta - m(\theta)$ and $m(\theta) = \mathbb{E}_\theta[\hat\theta]$, so \begin{align*} \mathbb{E}_\theta[Z(\theta)] = \mathbb{E}_\theta[\hat\theta] - m(\theta) = m(\theta) - m(\theta) = 0. \end{align*} This is the entire reason for centring: subtracting the mean makes the first moment vanish, and the cross term therefore drops out. Without this cancellation, MSE would not split cleanly into variance and squared bias — it would also carry a covariance-like interaction term. Here the cross term is exactly zero. [/guided] [/step] [step:Identify the remaining terms as variance and squared bias] The first surviving term is \begin{align*} \mathbb{E}_\theta\bigl[Z(\theta)^2\bigr] = \mathbb{E}_\theta\bigl[(\hat\theta - \mathbb{E}_\theta[\hat\theta])^2\bigr] = \operatorname{Var}_\theta(\hat\theta), \end{align*} by the definition of variance. The second surviving term is \begin{align*} b(\theta)^2 = \bigl(\mathbb{E}_\theta[\hat\theta] - \theta\bigr)^2 = \operatorname{bias}_\theta(\hat\theta)^2, \end{align*} by the definition of bias. Substituting into the expansion from the second step and using the vanishing of the cross term established in the third step, \begin{align*} \mathbb{E}_\theta\bigl[(\hat\theta - \theta)^2\bigr] = \operatorname{Var}_\theta(\hat\theta) + \operatorname{bias}_\theta(\hat\theta)^2. \end{align*} This is the bias-variance decomposition, and completes the proof. [/step]

Explore Further

Consistency of the MLE Statistics MLE Admissibility and Minimaxity in the Gaussian Model Statistics Total Sum of Squares Decomposition Statistics t-Distribution of Normalised Coefficient Estimate Statistics Scalar Asymptotic Normality Statistics Chi-Squared Distribution of RSS Statistics Bootstrap Consistency Statistics Bayes Risk Bounds Maximal Risk Statistics

What brings you to Androma?

Start with a route through the knowledge graph.

Bias-Variance Decomposition (Theorem # 1424)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Bias-Variance Decomposition (Theorem # 1424)

Discussion

Proof

Explore Further