Unbiasedness of the Ordinary Least Squares Error Variance Estimator

Unbiasedness of the Ordinary Least Squares Error Variance Estimator (Theorem # 4437)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We express the ordinary least squares residual vector as the projection of the error vector onto the orthogonal complement of the column space of $X$. The residual sum of squares then becomes a quadratic form $\varepsilon^\top M\varepsilon$, where $M$ is the residual-maker matrix. Expanding this quadratic form entry by entry under conditional expectation gives $\mathbb{E}[\varepsilon^\top M\varepsilon \mid X]=\sigma^2\operatorname{tr}(M)$. Finally, the trace of $M$ is $n-p$, so dividing by $n-p$ gives conditional unbiasedness. [/proofplan] [step:Define the projection and residual-maker matrices] Since $\operatorname{rank}(X)=p$ almost surely, the matrix $X^\top X \in \mathbb{R}^{p\times p}$ is invertible almost surely. Define the hat matrix $H \in \mathbb{R}^{n\times n}$ and residual-maker matrix $M \in \mathbb{R}^{n\times n}$ by \begin{align*} H = X(X^\top X)^{-1}X^\top, \qquad M = I_n - H. \end{align*} The matrix $H$ is symmetric because \begin{align*} H^\top = \left(X(X^\top X)^{-1}X^\top\right)^\top = X\left((X^\top X)^{-1}\right)^\top X^\top = X(X^\top X)^{-1}X^\top = H, \end{align*} where $X^\top X$ is symmetric and hence $(X^\top X)^{-1}$ is symmetric. Also, \begin{align*} H^2 &= X(X^\top X)^{-1}X^\top X(X^\top X)^{-1}X^\top \\ &= X(X^\top X)^{-1}(X^\top X)(X^\top X)^{-1}X^\top \\ &= X(X^\top X)^{-1}X^\top = H. \end{align*} Thus $H$ is an idempotent symmetric projection matrix, and $M=I_n-H$ is also symmetric and idempotent. [guided] Because $X$ has full column rank $p$, the Gram matrix $X^\top X$ is invertible: if $v \in \mathbb{R}^p$ satisfies $(X^\top X)v=0$, then \begin{align*} 0 = v^\top X^\top Xv = |Xv|^2, \end{align*} so $Xv=0$, and full column rank gives $v=0$. Therefore $X^\top X$ is nonsingular. Define \begin{align*} H = X(X^\top X)^{-1}X^\top, \qquad M = I_n - H. \end{align*} The matrix $H$ is the [orthogonal projection](/theorems/437) onto the column space of $X$, and $M$ is the projection onto its orthogonal complement. We verify the algebraic facts needed later. First, \begin{align*} H^\top = \left(X(X^\top X)^{-1}X^\top\right)^\top = X\left((X^\top X)^{-1}\right)^\top X^\top = X(X^\top X)^{-1}X^\top = H, \end{align*} because $X^\top X$ and its inverse are symmetric. Second, \begin{align*} H^2 &= X(X^\top X)^{-1}X^\top X(X^\top X)^{-1}X^\top \\ &= X(X^\top X)^{-1}(X^\top X)(X^\top X)^{-1}X^\top \\ &= X(X^\top X)^{-1}X^\top = H. \end{align*} So $H$ is symmetric and idempotent. Since $M=I_n-H$, we also have $M^\top=M$ and \begin{align*} M^2 = (I_n-H)^2 = I_n-2H+H^2 = I_n-H = M. \end{align*} These identities are what allow the residual norm to be written as a single quadratic form. [/guided] [/step] [step:Rewrite the residual sum of squares as a quadratic form in the error] The fitted value is \begin{align*} X\hat{\beta} = X(X^\top X)^{-1}X^\top y = Hy. \end{align*} Therefore the residual vector $r \in \mathbb{R}^n$ defined by $r=y-X\hat{\beta}$ satisfies \begin{align*} r = y-Hy = My. \end{align*} Since \begin{align*} HX = X(X^\top X)^{-1}X^\top X = X, \end{align*} we have $MX=0$. Using $y=X\beta+\varepsilon$, this gives \begin{align*} r = My = M(X\beta+\varepsilon) = MX\beta+M\varepsilon = M\varepsilon. \end{align*} Because $M$ is symmetric and idempotent, \begin{align*} \operatorname{RSS} = |r|^2 = (M\varepsilon)^\top(M\varepsilon) = \varepsilon^\top M^\top M\varepsilon = \varepsilon^\top M\varepsilon. \end{align*} [guided] We now connect least squares residuals to the error vector. The ordinary least squares fitted value is \begin{align*} X\hat{\beta} = X(X^\top X)^{-1}X^\top y = Hy. \end{align*} Thus the residual vector $r \in \mathbb{R}^n$ is \begin{align*} r = y-X\hat{\beta} = y-Hy = My. \end{align*} The reason this is useful is that $M$ kills every vector in the column space of $X$. Indeed, \begin{align*} HX = X(X^\top X)^{-1}X^\top X = X, \end{align*} and hence \begin{align*} MX = (I_n-H)X = X-HX = 0. \end{align*} Since $y=X\beta+\varepsilon$, applying $M$ gives \begin{align*} r = My = M(X\beta+\varepsilon) = MX\beta+M\varepsilon = M\varepsilon. \end{align*} So the residuals depend only on the component of the error vector orthogonal to the column space of $X$. Finally, by definition, \begin{align*} \operatorname{RSS} = |r|^2. \end{align*} Using $r=M\varepsilon$, symmetry of $M$, and idempotence of $M$, we obtain \begin{align*} \operatorname{RSS} = (M\varepsilon)^\top(M\varepsilon) = \varepsilon^\top M^\top M\varepsilon = \varepsilon^\top M^2\varepsilon = \varepsilon^\top M\varepsilon. \end{align*} This is the form to which the conditional variance assumption applies. [/guided] [/step] [step:Compute the conditional expectation of the quadratic form] Write $M=(M_{ij})_{1\leq i,j\leq n}$ and $\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)^\top$. Since $M$ is a measurable function of $X$, its entries are fixed under conditioning on $X$. Expanding the quadratic form, \begin{align*} \mathbb{E}[\varepsilon^\top M\varepsilon \mid X] &= \mathbb{E}\left[\sum_{i=1}^n\sum_{j=1}^n M_{ij}\varepsilon_i\varepsilon_j \,\middle|\, X\right] \\ &= \sum_{i=1}^n\sum_{j=1}^n M_{ij}\mathbb{E}[\varepsilon_i\varepsilon_j \mid X]. \end{align*} The conditional mean assumption gives $\mathbb{E}[\varepsilon_i\mid X]=0$ for each $i$, and the conditional variance assumption gives \begin{align*} \mathbb{E}[\varepsilon_i\varepsilon_j \mid X] = \operatorname{Cov}(\varepsilon_i,\varepsilon_j\mid X) = \sigma^2\mathbb{1}_{\{i=j\}}. \end{align*} Therefore \begin{align*} \mathbb{E}[\varepsilon^\top M\varepsilon \mid X] &= \sum_{i=1}^n\sum_{j=1}^n M_{ij}\sigma^2\mathbb{1}_{\{i=j\}} \\ &= \sigma^2\sum_{i=1}^n M_{ii} = \sigma^2\operatorname{tr}(M). \end{align*} [guided] The conditional expectation is computed entry by entry. Write \begin{align*} M=(M_{ij})_{1\leq i,j\leq n}, \qquad \varepsilon=(\varepsilon_1,\dots,\varepsilon_n)^\top. \end{align*} Since $M=I_n-X(X^\top X)^{-1}X^\top$ is determined by $X$, each entry $M_{ij}$ is measurable with respect to the information contained in $X$. Therefore, inside a conditional expectation given $X$, the entries of $M$ behave like known coefficients. Expanding the quadratic form gives \begin{align*} \varepsilon^\top M\varepsilon = \sum_{i=1}^n\sum_{j=1}^n M_{ij}\varepsilon_i\varepsilon_j. \end{align*} Taking conditional expectation and pulling out the $X$-measurable coefficients, \begin{align*} \mathbb{E}[\varepsilon^\top M\varepsilon \mid X] &= \mathbb{E}\left[\sum_{i=1}^n\sum_{j=1}^n M_{ij}\varepsilon_i\varepsilon_j \,\middle|\, X\right] \\ &= \sum_{i=1}^n\sum_{j=1}^n M_{ij}\mathbb{E}[\varepsilon_i\varepsilon_j \mid X]. \end{align*} Now use the assumptions on the conditional mean and conditional variance. Because $\mathbb{E}[\varepsilon\mid X]=0$, each component satisfies $\mathbb{E}[\varepsilon_i\mid X]=0$. Hence \begin{align*} \mathbb{E}[\varepsilon_i\varepsilon_j\mid X] = \operatorname{Cov}(\varepsilon_i,\varepsilon_j\mid X). \end{align*} The assumption $\operatorname{Var}(\varepsilon\mid X)=\sigma^2 I_n$ says exactly that this conditional covariance is $\sigma^2$ when $i=j$ and $0$ when $i\neq j$, so \begin{align*} \mathbb{E}[\varepsilon_i\varepsilon_j \mid X] = \sigma^2\mathbb{1}_{\{i=j\}}. \end{align*} Substituting this into the expanded quadratic form, \begin{align*} \mathbb{E}[\varepsilon^\top M\varepsilon \mid X] &= \sum_{i=1}^n\sum_{j=1}^n M_{ij}\sigma^2\mathbb{1}_{\{i=j\}} \\ &= \sigma^2\sum_{i=1}^n M_{ii} = \sigma^2\operatorname{tr}(M). \end{align*} Thus the conditional expected residual sum of squares is controlled by the trace of the residual-maker matrix. [/guided] [/step] [step:Evaluate the trace of the residual-maker matrix] Using the cyclic invariance of trace for compatible finite-dimensional matrices, \begin{align*} \operatorname{tr}(H) &= \operatorname{tr}\left(X(X^\top X)^{-1}X^\top\right) \\ &= \operatorname{tr}\left((X^\top X)^{-1}X^\top X\right) \\ &= \operatorname{tr}(I_p) = p. \end{align*} Therefore \begin{align*} \operatorname{tr}(M) = \operatorname{tr}(I_n-H) = \operatorname{tr}(I_n)-\operatorname{tr}(H) = n-p. \end{align*} Combining this with the previous step, \begin{align*} \mathbb{E}[\operatorname{RSS}\mid X] = \mathbb{E}[\varepsilon^\top M\varepsilon\mid X] = \sigma^2(n-p). \end{align*} Since $p<n$, division by $n-p$ is valid, and hence \begin{align*} \mathbb{E}[\hat{\sigma}^2\mid X] = \mathbb{E}\left[\frac{\operatorname{RSS}}{n-p}\,\middle|\,X\right] = \frac{1}{n-p}\mathbb{E}[\operatorname{RSS}\mid X] = \sigma^2. \end{align*} This proves that $\hat{\sigma}^2$ is conditionally unbiased for $\sigma^2$. [/step]

Prerequisites (0/1 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Expectation

Explore Further

Expectation Definition Factorisation Criterion for Independence Probability Theory Independence Is Preserved Under Complements Probability Theory Properties of the Discrete Conditional Expectation Conditional Expectation Extreme Information Levels Probability & Statistics Bonferroni Inequalities Probability Theory Cauchy-Schwarz Inequality for Covariance Probability Theory Properties of Variance Probability Theory Skorokhod Embedding Theorem Brownian Motion Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.