[proofplan]
We prove that the ordinary least squares estimator $\hat{\beta} = (X^\top X)^{-1} X^\top Y$ has the smallest covariance matrix (in the positive semi-definite Löwner order) among all linear unbiased estimators of $\beta$. The strategy is a covariance decomposition. An arbitrary linear estimator $\tilde{\beta} = CY$ is decomposed as OLS plus a correction, $C = (X^\top X)^{-1}X^\top + D$, and unbiasedness is translated into the orthogonality condition $DX = \mathbf{0}$. The cross-terms in the covariance $\sigma^2 CC^\top$ then vanish, leaving $\operatorname{Cov}(\tilde{\beta}) = \operatorname{Cov}(\hat{\beta}) + \sigma^2 DD^\top$. Since $DD^\top$ is positive semi-definite, OLS is the unique minimiser up to $D = \mathbf{0}$.
[/proofplan]
[step:Set up the model and the candidate estimator]
We work in the classical linear model:
\begin{align*}
Y: \Omega \to \mathbb{R}^n, \qquad Y = X\beta + \varepsilon,
\end{align*}
where $X \in \mathbb{R}^{n \times p}$ is a deterministic design matrix of full column rank $p \leq n$ (so $X^\top X \in \mathbb{R}^{p \times p}$ is invertible), $\beta \in \mathbb{R}^p$ is the unknown parameter, and $\varepsilon: \Omega \to \mathbb{R}^n$ is the error vector satisfying the Gauss–Markov conditions
\begin{align*}
\mathbb{E}[\varepsilon] = \mathbf{0}, \qquad \operatorname{Cov}(\varepsilon) = \sigma^2 I_n,
\end{align*}
with $\sigma^2 > 0$. No normality is assumed.
The **ordinary least squares** estimator is $\hat{\beta} = (X^\top X)^{-1} X^\top Y$. We consider an arbitrary competing **linear** estimator
\begin{align*}
\tilde{\beta}: \Omega &\to \mathbb{R}^p \\
\omega &\mapsto CY(\omega),
\end{align*}
where $C \in \mathbb{R}^{p \times n}$ is a deterministic matrix. Our goal is to show $\operatorname{Cov}(\tilde{\beta}) \succeq \operatorname{Cov}(\hat{\beta})$ in the Löwner order (i.e. the difference is positive semi-definite) whenever $\tilde{\beta}$ is unbiased, with equality only if $\tilde{\beta} = \hat{\beta}$.
[/step]
[step:Translate unbiasedness of $\tilde{\beta}$ into the matrix identity $CX = I_p$]
By linearity of expectation,
\begin{align*}
\mathbb{E}[\tilde{\beta}] = \mathbb{E}[CY] = C\,\mathbb{E}[Y] = C X\beta.
\end{align*}
Unbiasedness of $\tilde{\beta}$ means $\mathbb{E}[\tilde{\beta}] = \beta$ for **every** $\beta \in \mathbb{R}^p$. That is,
\begin{align*}
CX\beta = \beta \qquad \text{for all } \beta \in \mathbb{R}^p,
\end{align*}
which is equivalent to the matrix identity
\begin{align*}
CX = I_p.
\end{align*}
[guided]
The condition "$\tilde{\beta}$ is unbiased for $\beta$" means $\mathbb{E}_{\beta}[\tilde{\beta}] = \beta$ for all $\beta$ — the estimator's expected value is the true parameter, whatever the true parameter happens to be.
Computing the expectation:
\begin{align*}
\mathbb{E}[\tilde{\beta}] = \mathbb{E}[CY] = \mathbb{E}[C(X\beta + \varepsilon)] = CX\beta + C\,\mathbb{E}[\varepsilon] = CX\beta,
\end{align*}
using $\mathbb{E}[\varepsilon] = \mathbf{0}$ (a Gauss–Markov condition). So the unbiasedness requirement is
\begin{align*}
CX\beta = \beta \qquad \text{for all } \beta \in \mathbb{R}^p.
\end{align*}
This is a linear identity in $\beta$. Two linear maps $\mathbb{R}^p \to \mathbb{R}^p$ agree on all vectors iff they are equal as matrices, so the condition is $CX = I_p$.
**Geometric interpretation.** Think of $C: \mathbb{R}^n \to \mathbb{R}^p$ as a linear map and $X: \mathbb{R}^p \to \mathbb{R}^n$ as its "right inverse candidate." Unbiasedness says $C$ is a left inverse of $X$: $C \circ X = \operatorname{id}_{\mathbb{R}^p}$. Equivalently, $C$ restricted to $\operatorname{Range}(X) \subseteq \mathbb{R}^n$ inverts $X$ (the action of $C$ on the complement of $\operatorname{Range}(X)$ is not constrained by this condition — which is where the flexibility to vary $C$ lives, and where OLS attains its optimality).
[/guided]
[/step]
[step:Decompose $C$ as OLS plus a correction $D$ and derive $DX = \mathbf{0}$]
Define the correction matrix
\begin{align*}
D := C - (X^\top X)^{-1} X^\top \in \mathbb{R}^{p \times n},
\end{align*}
so that $C = (X^\top X)^{-1} X^\top + D$.
Multiplying the unbiasedness identity $CX = I_p$ on the right-hand side using this decomposition,
\begin{align*}
I_p = CX = \bigl[(X^\top X)^{-1}X^\top + D\bigr]X = (X^\top X)^{-1}(X^\top X) + DX = I_p + DX,
\end{align*}
which forces
\begin{align*}
DX = \mathbf{0}_{p \times p}.
\end{align*}
[/step]
[step:Expand the covariance of $\tilde{\beta}$ and show the cross-terms vanish]
By the [Orthogonal Transformations Preserve Multivariate Normality](/theorems/1434) covariance rule (the covariance-transformation part does not require normality, only the Gauss–Markov second-moment conditions),
\begin{align*}
\operatorname{Cov}(\tilde{\beta}) = \operatorname{Cov}(CY) = C\,\operatorname{Cov}(Y)\,C^\top = C(\sigma^2 I_n)C^\top = \sigma^2 C C^\top,
\end{align*}
using $\operatorname{Cov}(Y) = \operatorname{Cov}(\varepsilon) = \sigma^2 I_n$ (the Gauss–Markov second condition and the fact that adding a constant vector $X\beta$ does not change covariance).
Substituting $C = (X^\top X)^{-1} X^\top + D$ and expanding,
\begin{align*}
CC^\top &= \bigl[(X^\top X)^{-1} X^\top + D\bigr]\bigl[(X^\top X)^{-1} X^\top + D\bigr]^\top \\
&= (X^\top X)^{-1} X^\top \bigl[(X^\top X)^{-1}X^\top\bigr]^\top + (X^\top X)^{-1} X^\top D^\top + D \bigl[(X^\top X)^{-1}X^\top\bigr]^\top + DD^\top.
\end{align*}
We evaluate each term:
**Term 1.** Using $[(X^\top X)^{-1}]^\top = (X^\top X)^{-1}$ (since $X^\top X$ is symmetric),
\begin{align*}
(X^\top X)^{-1} X^\top [(X^\top X)^{-1}X^\top]^\top = (X^\top X)^{-1} X^\top X (X^\top X)^{-1} = (X^\top X)^{-1}.
\end{align*}
**Term 2.** Using $DX = \mathbf{0}$, taking the transpose gives $X^\top D^\top = \mathbf{0}$, so
\begin{align*}
(X^\top X)^{-1} X^\top D^\top = (X^\top X)^{-1}\cdot \mathbf{0} = \mathbf{0}.
\end{align*}
**Term 3.** Again using $DX = \mathbf{0}$,
\begin{align*}
D\bigl[(X^\top X)^{-1} X^\top\bigr]^\top = D X (X^\top X)^{-1} = \mathbf{0}\cdot (X^\top X)^{-1} = \mathbf{0}.
\end{align*}
**Term 4.** Remains as $DD^\top$.
Combining,
\begin{align*}
\operatorname{Cov}(\tilde{\beta}) = \sigma^2 C C^\top = \sigma^2\bigl[(X^\top X)^{-1} + DD^\top\bigr].
\end{align*}
[guided]
The decomposition $C = (X^\top X)^{-1}X^\top + D$ is not chosen arbitrarily: it splits $C$ into "the OLS part" and "everything else." The point is that the orthogonality condition $DX = \mathbf{0}$ — which is exactly the unbiasedness constraint, once we have committed to this decomposition — causes the cross-terms to drop.
Let us see this cross-term vanishing from a more structural angle. Expanding $CC^\top$ bilinearly,
\begin{align*}
CC^\top = \underbrace{(X^\top X)^{-1}X^\top \cdot [(X^\top X)^{-1}X^\top]^\top}_{\text{OLS}\times\text{OLS}} + \underbrace{(X^\top X)^{-1}X^\top \cdot D^\top}_{\text{cross 1}} + \underbrace{D\cdot [(X^\top X)^{-1}X^\top]^\top}_{\text{cross 2}} + \underbrace{D D^\top}_{\text{correction}}.
\end{align*}
**OLS × OLS term.** Simplifies using $(A^\top X A)^{-1} = \ldots$ algebra:
\begin{align*}
(X^\top X)^{-1}X^\top \cdot X (X^\top X)^{-1} = (X^\top X)^{-1} (X^\top X) (X^\top X)^{-1} = (X^\top X)^{-1}.
\end{align*}
This is already the covariance of $\hat{\beta}$ divided by $\sigma^2$: indeed $\operatorname{Cov}(\hat{\beta}) = \sigma^2(X^\top X)^{-1}$.
**Cross term 1.** We need $(X^\top X)^{-1}X^\top D^\top = (X^\top X)^{-1}(DX)^\top = (X^\top X)^{-1}\mathbf{0}^\top = \mathbf{0}$. The bridge is $X^\top D^\top = (DX)^\top$, and $DX = \mathbf{0}$ by unbiasedness.
**Cross term 2.** Similarly $D\cdot X(X^\top X)^{-1} = (DX)(X^\top X)^{-1} = \mathbf{0}$.
Both cross terms vanish because $DX = \mathbf{0}$ — and $DX = \mathbf{0}$ is precisely the constraint imposed by unbiasedness (Step 3). This is the heart of the proof: **unbiasedness forces the correction $D$ to be orthogonal to the column space of $X$, which is exactly the space where $\hat{\beta}$ "lives," and orthogonality eliminates cross-covariance.**
**Correction term.** $DD^\top$ is left as-is. It is a $p \times p$ symmetric matrix, and, crucially, it is positive semi-definite (see next step).
Putting it all together,
\begin{align*}
\operatorname{Cov}(\tilde{\beta}) = \sigma^2(X^\top X)^{-1} + \sigma^2 DD^\top = \operatorname{Cov}(\hat{\beta}) + \sigma^2 DD^\top.
\end{align*}
Any competing unbiased linear estimator has a covariance that is "OLS covariance plus a positive semi-definite correction."
[/guided]
[/step]
[step:Identify $\sigma^2(X^\top X)^{-1}$ as $\operatorname{Cov}(\hat{\beta})$ and establish the Löwner inequality]
Specialising the covariance formula above to $D = \mathbf{0}$ (i.e. $C = (X^\top X)^{-1}X^\top$, so $\tilde{\beta} = \hat{\beta}$) gives
\begin{align*}
\operatorname{Cov}(\hat{\beta}) = \sigma^2 (X^\top X)^{-1}.
\end{align*}
Therefore for any unbiased linear estimator $\tilde{\beta}$,
\begin{align*}
\operatorname{Cov}(\tilde{\beta}) - \operatorname{Cov}(\hat{\beta}) = \sigma^2 DD^\top.
\end{align*}
The matrix $DD^\top \in \mathbb{R}^{p \times p}$ is positive semi-definite: for any $v \in \mathbb{R}^p$,
\begin{align*}
v^\top DD^\top v = (D^\top v)^\top (D^\top v) = \|D^\top v\|^2 \geq 0.
\end{align*}
Multiplying by $\sigma^2 > 0$ preserves positive semi-definiteness, so $\sigma^2 DD^\top \succeq 0$. This gives the Löwner inequality
\begin{align*}
\operatorname{Cov}(\tilde{\beta}) \succeq \operatorname{Cov}(\hat{\beta}).
\end{align*}
[/step]
[step:Establish the equality case and conclude uniqueness]
Equality $\operatorname{Cov}(\tilde{\beta}) = \operatorname{Cov}(\hat{\beta})$ in the Löwner order holds iff $\sigma^2 DD^\top = \mathbf{0}$, iff $DD^\top = \mathbf{0}$. By the identity $v^\top DD^\top v = \|D^\top v\|^2$ from the previous step, $DD^\top = \mathbf{0}$ iff $D^\top v = \mathbf{0}$ for all $v \in \mathbb{R}^p$, iff $D^\top = \mathbf{0}$, iff $D = \mathbf{0}$.
Thus equality holds precisely when $C = (X^\top X)^{-1} X^\top$, i.e. $\tilde{\beta} = \hat{\beta}$ almost surely.
Combining the previous two steps: for every linear unbiased estimator $\tilde{\beta}$ of $\beta$,
\begin{align*}
\operatorname{Cov}(\tilde{\beta}) \succeq \operatorname{Cov}(\hat{\beta}) = \sigma^2(X^\top X)^{-1},
\end{align*}
with equality iff $\tilde{\beta} = \hat{\beta}$. This establishes $\hat{\beta}$ as the **unique** best linear unbiased estimator (BLUE) of $\beta$, completing the proof.
[guided]
We have shown $\operatorname{Cov}(\tilde{\beta}) - \operatorname{Cov}(\hat{\beta}) = \sigma^2 DD^\top \succeq 0$. To complete the uniqueness claim we analyse when equality holds.
**Step A.** $\sigma^2 DD^\top = \mathbf{0}$ iff $DD^\top = \mathbf{0}$, since $\sigma^2 > 0$.
**Step B.** $DD^\top = \mathbf{0}$ iff $D = \mathbf{0}$. The forward direction is the interesting one: if $DD^\top = \mathbf{0}$, then $v^\top DD^\top v = 0$ for all $v$, but $v^\top DD^\top v = \|D^\top v\|^2$, so $D^\top v = \mathbf{0}$ for every $v \in \mathbb{R}^p$. This means $D^\top$ is the zero linear map, hence $D = \mathbf{0}$. The converse is trivial.
**Step C.** $D = \mathbf{0}$ iff $C = (X^\top X)^{-1}X^\top$, by definition of $D$. And this is iff $\tilde{\beta} = CY = (X^\top X)^{-1}X^\topY = \hat{\beta}$.
So the "best" in "best linear unbiased estimator" is sharp: among all linear estimators that are unbiased for every $\beta$, the OLS estimator $\hat{\beta}$ strictly dominates (in the Löwner order) any competitor that is not itself OLS.
**What this means in practice.** For any fixed linear functional $c^\top \beta$ of the parameter (with $c \in \mathbb{R}^p$), the variance of the plug-in estimator $c^\top \tilde{\beta}$ is
\begin{align*}
\operatorname{Var}(c^\top \tilde{\beta}) = c^\top \operatorname{Cov}(\tilde{\beta})c \geq c^\top \operatorname{Cov}(\hat{\beta})c = \operatorname{Var}(c^\top \hat{\beta}).
\end{align*}
So OLS is optimal not just for estimating $\beta$ itself but for any linear contrast of the regression coefficients — predictions, differences of coefficients, etc. — simultaneously, uniformly across all such contrasts. This uniform optimality is the practical content of Gauss–Markov.
**What is NOT assumed.** Normality of $\varepsilon$ is not used anywhere in this proof. The only requirements on $\varepsilon$ are the first two moments: $\mathbb{E}[\varepsilon] = \mathbf{0}$ and $\operatorname{Cov}(\varepsilon) = \sigma^2 I_n$. This is a considerable strengthening over, e.g., maximum likelihood optimality under normality.
[/guided]
[/step]