Gauss–Markov — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We prove that the ordinary least squares estimator $\hat{\beta} = (X^\top X)^{-1} X^\top Y$ has the smallest covariance matrix (in the positive semi-definite Löwner order) among all linear unbiased estimators of $\beta$. The strategy is a covariance decomposition. An arbitrary linear estimator $\tilde{\beta} = CY$ is decomposed as OLS plus a correction, $C = (X^\top X)^{-1}X^\top + D$, and unbiasedness is translated into the orthogonality condition $DX = \mathbf{0}$. The cross-terms in the covariance $\sigma^2 CC^\top$ then vanish, leaving $\operatorname{Cov}(\tilde{\beta}) = \operatorname{Cov}(\hat{\beta}) + \sigma^2 DD^\top$. Since $DD^\top$ is positive semi-definite, OLS is the unique minimiser up to $D = \mathbf{0}$. [/proofplan] [step:Set up the model and the candidate estimator] We work in the classical linear model: \begin{align*} Y: \Omega \to \mathbb{R}^n, \qquad Y = X\beta + \varepsilon, \end{align*} where $X \in \mathbb{R}^{n \times p}$ is a deterministic design matrix of full column rank $p \leq n$ (so $X^\top X \in \mathbb{R}^{p \times p}$ is invertible), $\beta \in \mathbb{R}^p$ is the unknown parameter, and $\varepsilon: \Omega \to \mathbb{R}^n$ is the error vector satisfying the Gauss–Markov conditions \begin{align*} \mathbb{E}[\varepsilon] = \mathbf{0}, \qquad \operatorname{Cov}(\varepsilon) = \sigma^2 I_n, \end{align*} with $\sigma^2 > 0$. No normality is assumed. The **ordinary least squares** estimator is $\hat{\beta} = (X^\top X)^{-1} X^\top Y$. We consider an arbitrary competing **linear** estimator \begin{align*} \tilde{\beta}: \Omega &\to \mathbb{R}^p \\ \omega &\mapsto CY(\omega), \end{align*} where $C \in \mathbb{R}^{p \times n}$ is a deterministic matrix. Our goal is to show $\operatorname{Cov}(\tilde{\beta}) \succeq \operatorname{Cov}(\hat{\beta})$ in the Löwner order (i.e. the difference is positive semi-definite) whenever $\tilde{\beta}$ is unbiased, with equality only if $\tilde{\beta} = \hat{\beta}$. [/step] [step:Translate unbiasedness of $\tilde{\beta}$ into the matrix identity $CX = I_p$] By linearity of expectation, \begin{align*} \mathbb{E}[\tilde{\beta}] = \mathbb{E}[CY] = C\,\mathbb{E}[Y] = C X\beta. \end{align*} Unbiasedness of $\tilde{\beta}$ means $\mathbb{E}[\tilde{\beta}] = \beta$ for **every** $\beta \in \mathbb{R}^p$. That is, \begin{align*} CX\beta = \beta \qquad \text{for all } \beta \in \mathbb{R}^p, \end{align*} which is equivalent to the matrix identity \begin{align*} CX = I_p. \end{align*} [guided] The condition "$\tilde{\beta}$ is unbiased for $\beta$" means $\mathbb{E}_{\beta}[\tilde{\beta}] = \beta$ for all $\beta$ — the estimator's expected value is the true parameter, whatever the true parameter happens to be. Computing the expectation: \begin{align*} \mathbb{E}[\tilde{\beta}] = \mathbb{E}[CY] = \mathbb{E}[C(X\beta + \varepsilon)] = CX\beta + C\,\mathbb{E}[\varepsilon] = CX\beta, \end{align*} using $\mathbb{E}[\varepsilon] = \mathbf{0}$ (a Gauss–Markov condition). So the unbiasedness requirement is \begin{align*} CX\beta = \beta \qquad \text{for all } \beta \in \mathbb{R}^p. \end{align*} This is a linear identity in $\beta$. Two linear maps $\mathbb{R}^p \to \mathbb{R}^p$ agree on all vectors iff they are equal as matrices, so the condition is $CX = I_p$. **Geometric interpretation.** Think of $C: \mathbb{R}^n \to \mathbb{R}^p$ as a linear map and $X: \mathbb{R}^p \to \mathbb{R}^n$ as its "right inverse candidate." Unbiasedness says $C$ is a left inverse of $X$: $C \circ X = \operatorname{id}_{\mathbb{R}^p}$. Equivalently, $C$ restricted to $\operatorname{Range}(X) \subseteq \mathbb{R}^n$ inverts $X$ (the action of $C$ on the complement of $\operatorname{Range}(X)$ is not constrained by this condition — which is where the flexibility to vary $C$ lives, and where OLS attains its optimality). [/guided] [/step] [step:Decompose $C$ as OLS plus a correction $D$ and derive $DX = \mathbf{0}$] Define the correction matrix \begin{align*} D := C - (X^\top X)^{-1} X^\top \in \mathbb{R}^{p \times n}, \end{align*} so that $C = (X^\top X)^{-1} X^\top + D$. Multiplying the unbiasedness identity $CX = I_p$ on the right-hand side using this decomposition, \begin{align*} I_p = CX = \bigl[(X^\top X)^{-1}X^\top + D\bigr]X = (X^\top X)^{-1}(X^\top X) + DX = I_p + DX, \end{align*} which forces \begin{align*} DX = \mathbf{0}_{p \times p}. \end{align*} [/step] [step:Expand the covariance of $\tilde{\beta}$ and show the cross-terms vanish] By the [Orthogonal Transformations Preserve Multivariate Normality](/theorems/1434) covariance rule (the covariance-transformation part does not require normality, only the Gauss–Markov second-moment conditions), \begin{align*} \operatorname{Cov}(\tilde{\beta}) = \operatorname{Cov}(CY) = C\,\operatorname{Cov}(Y)\,C^\top = C(\sigma^2 I_n)C^\top = \sigma^2 C C^\top, \end{align*} using $\operatorname{Cov}(Y) = \operatorname{Cov}(\varepsilon) = \sigma^2 I_n$ (the Gauss–Markov second condition and the fact that adding a constant vector $X\beta$ does not change covariance). Substituting $C = (X^\top X)^{-1} X^\top + D$ and expanding, \begin{align*} CC^\top &= \bigl[(X^\top X)^{-1} X^\top + D\bigr]\bigl[(X^\top X)^{-1} X^\top + D\bigr]^\top \\ &= (X^\top X)^{-1} X^\top \bigl[(X^\top X)^{-1}X^\top\bigr]^\top + (X^\top X)^{-1} X^\top D^\top + D \bigl[(X^\top X)^{-1}X^\top\bigr]^\top + DD^\top. \end{align*} We evaluate each term: **Term 1.** Using $[(X^\top X)^{-1}]^\top = (X^\top X)^{-1}$ (since $X^\top X$ is symmetric), \begin{align*} (X^\top X)^{-1} X^\top [(X^\top X)^{-1}X^\top]^\top = (X^\top X)^{-1} X^\top X (X^\top X)^{-1} = (X^\top X)^{-1}. \end{align*} **Term 2.** Using $DX = \mathbf{0}$, taking the transpose gives $X^\top D^\top = \mathbf{0}$, so \begin{align*} (X^\top X)^{-1} X^\top D^\top = (X^\top X)^{-1}\cdot \mathbf{0} = \mathbf{0}. \end{align*} **Term 3.** Again using $DX = \mathbf{0}$, \begin{align*} D\bigl[(X^\top X)^{-1} X^\top\bigr]^\top = D X (X^\top X)^{-1} = \mathbf{0}\cdot (X^\top X)^{-1} = \mathbf{0}. \end{align*} **Term 4.** Remains as $DD^\top$. Combining, \begin{align*} \operatorname{Cov}(\tilde{\beta}) = \sigma^2 C C^\top = \sigma^2\bigl[(X^\top X)^{-1} + DD^\top\bigr]. \end{align*} [guided] The decomposition $C = (X^\top X)^{-1}X^\top + D$ is not chosen arbitrarily: it splits $C$ into "the OLS part" and "everything else." The point is that the orthogonality condition $DX = \mathbf{0}$ — which is exactly the unbiasedness constraint, once we have committed to this decomposition — causes the cross-terms to drop. Let us see this cross-term vanishing from a more structural angle. Expanding $CC^\top$ bilinearly, \begin{align*} CC^\top = \underbrace{(X^\top X)^{-1}X^\top \cdot [(X^\top X)^{-1}X^\top]^\top}_{\text{OLS}\times\text{OLS}} + \underbrace{(X^\top X)^{-1}X^\top \cdot D^\top}_{\text{cross 1}} + \underbrace{D\cdot [(X^\top X)^{-1}X^\top]^\top}_{\text{cross 2}} + \underbrace{D D^\top}_{\text{correction}}. \end{align*} **OLS × OLS term.** Simplifies using $(A^\top X A)^{-1} = \ldots$ algebra: \begin{align*} (X^\top X)^{-1}X^\top \cdot X (X^\top X)^{-1} = (X^\top X)^{-1} (X^\top X) (X^\top X)^{-1} = (X^\top X)^{-1}. \end{align*} This is already the covariance of $\hat{\beta}$ divided by $\sigma^2$: indeed $\operatorname{Cov}(\hat{\beta}) = \sigma^2(X^\top X)^{-1}$. **Cross term 1.** We need $(X^\top X)^{-1}X^\top D^\top = (X^\top X)^{-1}(DX)^\top = (X^\top X)^{-1}\mathbf{0}^\top = \mathbf{0}$. The bridge is $X^\top D^\top = (DX)^\top$, and $DX = \mathbf{0}$ by unbiasedness. **Cross term 2.** Similarly $D\cdot X(X^\top X)^{-1} = (DX)(X^\top X)^{-1} = \mathbf{0}$. Both cross terms vanish because $DX = \mathbf{0}$ — and $DX = \mathbf{0}$ is precisely the constraint imposed by unbiasedness (Step 3). This is the heart of the proof: **unbiasedness forces the correction $D$ to be orthogonal to the column space of $X$, which is exactly the space where $\hat{\beta}$ "lives," and orthogonality eliminates cross-covariance.** **Correction term.** $DD^\top$ is left as-is. It is a $p \times p$ symmetric matrix, and, crucially, it is positive semi-definite (see next step). Putting it all together, \begin{align*} \operatorname{Cov}(\tilde{\beta}) = \sigma^2(X^\top X)^{-1} + \sigma^2 DD^\top = \operatorname{Cov}(\hat{\beta}) + \sigma^2 DD^\top. \end{align*} Any competing unbiased linear estimator has a covariance that is "OLS covariance plus a positive semi-definite correction." [/guided] [/step] [step:Identify $\sigma^2(X^\top X)^{-1}$ as $\operatorname{Cov}(\hat{\beta})$ and establish the Löwner inequality] Specialising the covariance formula above to $D = \mathbf{0}$ (i.e. $C = (X^\top X)^{-1}X^\top$, so $\tilde{\beta} = \hat{\beta}$) gives \begin{align*} \operatorname{Cov}(\hat{\beta}) = \sigma^2 (X^\top X)^{-1}. \end{align*} Therefore for any unbiased linear estimator $\tilde{\beta}$, \begin{align*} \operatorname{Cov}(\tilde{\beta}) - \operatorname{Cov}(\hat{\beta}) = \sigma^2 DD^\top. \end{align*} The matrix $DD^\top \in \mathbb{R}^{p \times p}$ is positive semi-definite: for any $v \in \mathbb{R}^p$, \begin{align*} v^\top DD^\top v = (D^\top v)^\top (D^\top v) = \|D^\top v\|^2 \geq 0. \end{align*} Multiplying by $\sigma^2 > 0$ preserves positive semi-definiteness, so $\sigma^2 DD^\top \succeq 0$. This gives the Löwner inequality \begin{align*} \operatorname{Cov}(\tilde{\beta}) \succeq \operatorname{Cov}(\hat{\beta}). \end{align*} [/step] [step:Establish the equality case and conclude uniqueness] Equality $\operatorname{Cov}(\tilde{\beta}) = \operatorname{Cov}(\hat{\beta})$ in the Löwner order holds iff $\sigma^2 DD^\top = \mathbf{0}$, iff $DD^\top = \mathbf{0}$. By the identity $v^\top DD^\top v = \|D^\top v\|^2$ from the previous step, $DD^\top = \mathbf{0}$ iff $D^\top v = \mathbf{0}$ for all $v \in \mathbb{R}^p$, iff $D^\top = \mathbf{0}$, iff $D = \mathbf{0}$. Thus equality holds precisely when $C = (X^\top X)^{-1} X^\top$, i.e. $\tilde{\beta} = \hat{\beta}$ almost surely. Combining the previous two steps: for every linear unbiased estimator $\tilde{\beta}$ of $\beta$, \begin{align*} \operatorname{Cov}(\tilde{\beta}) \succeq \operatorname{Cov}(\hat{\beta}) = \sigma^2(X^\top X)^{-1}, \end{align*} with equality iff $\tilde{\beta} = \hat{\beta}$. This establishes $\hat{\beta}$ as the **unique** best linear unbiased estimator (BLUE) of $\beta$, completing the proof. [guided] We have shown $\operatorname{Cov}(\tilde{\beta}) - \operatorname{Cov}(\hat{\beta}) = \sigma^2 DD^\top \succeq 0$. To complete the uniqueness claim we analyse when equality holds. **Step A.** $\sigma^2 DD^\top = \mathbf{0}$ iff $DD^\top = \mathbf{0}$, since $\sigma^2 > 0$. **Step B.** $DD^\top = \mathbf{0}$ iff $D = \mathbf{0}$. The forward direction is the interesting one: if $DD^\top = \mathbf{0}$, then $v^\top DD^\top v = 0$ for all $v$, but $v^\top DD^\top v = \|D^\top v\|^2$, so $D^\top v = \mathbf{0}$ for every $v \in \mathbb{R}^p$. This means $D^\top$ is the zero linear map, hence $D = \mathbf{0}$. The converse is trivial. **Step C.** $D = \mathbf{0}$ iff $C = (X^\top X)^{-1}X^\top$, by definition of $D$. And this is iff $\tilde{\beta} = CY = (X^\top X)^{-1}X^\topY = \hat{\beta}$. So the "best" in "best linear unbiased estimator" is sharp: among all linear estimators that are unbiased for every $\beta$, the OLS estimator $\hat{\beta}$ strictly dominates (in the Löwner order) any competitor that is not itself OLS. **What this means in practice.** For any fixed linear functional $c^\top \beta$ of the parameter (with $c \in \mathbb{R}^p$), the variance of the plug-in estimator $c^\top \tilde{\beta}$ is \begin{align*} \operatorname{Var}(c^\top \tilde{\beta}) = c^\top \operatorname{Cov}(\tilde{\beta})c \geq c^\top \operatorname{Cov}(\hat{\beta})c = \operatorname{Var}(c^\top \hat{\beta}). \end{align*} So OLS is optimal not just for estimating $\beta$ itself but for any linear contrast of the regression coefficients — predictions, differences of coefficients, etc. — simultaneously, uniformly across all such contrasts. This uniform optimality is the practical content of Gauss–Markov. **What is NOT assumed.** Normality of $\varepsilon$ is not used anywhere in this proof. The only requirements on $\varepsilon$ are the first two moments: $\mathbb{E}[\varepsilon] = \mathbf{0}$ and $\operatorname{Cov}(\varepsilon) = \sigma^2 I_n$. This is a considerable strengthening over, e.g., maximum likelihood optimality under normality. [/guided] [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Gauss–Markov (Theorem # 1437)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Gauss–Markov (Theorem # 1437)

Discussion

Proof

Explore Further