[proofplan]
The proof rewrites the ordinary least squares estimation error as a fixed linear transformation of the noise, conditional on the design matrix $X$. Since $X$ has full column rank, $X^\top X$ is invertible and the matrix $A=(X^\top X)^{-1}X^\top$ is determined by $X$. We then compute the conditional covariance of $A\varepsilon$ componentwise and substitute the hypothesis $\operatorname{Var}(\varepsilon\mid X)=\sigma^2 I_n$.
[/proofplan]
custom_env
admin
[step:Express the estimation error as a conditional linear transformation of the noise]Let $(\Omega,\mathcal F,\mathbb P)$ denote the underlying probability space, and let $\sigma(X)\subset\mathcal F$ denote the sigma-algebra generated by the random matrix $X:\Omega\to\mathbb R^{n\times p}$. Define the full-measure event
\begin{align*}
E:=\{\omega\in\Omega:\operatorname{rank}X(\omega)=p\}.
\end{align*}
For every $\omega\in E$, the matrix $X(\omega)^\top X(\omega)$ is invertible. Define the random matrix
\begin{align*}
A:\Omega &\to \mathbb R^{p\times n}, \\
\omega &\mapsto
\begin{cases}
(X(\omega)^\top X(\omega))^{-1}X(\omega)^\top, & \omega\in E, \\
0, & \omega\notin E.
\end{cases}
\end{align*}
The event $E$ is $\sigma(X)$-measurable, and inversion is a Borel map on the [open set](/page/Open%20Set) of invertible $p\times p$ matrices, so $A$ is $\sigma(X)$-measurable. On $E$, using $y=X\beta+\varepsilon$, we compute
\begin{align*}
\hat{\beta}
&= (X^\top X)^{-1}X^\top (X\beta+\varepsilon) \\
&= (X^\top X)^{-1}X^\top X\beta + (X^\top X)^{-1}X^\top \varepsilon \\
&= \beta + A\varepsilon.
\end{align*}
Therefore
\begin{align*}
\hat{\beta}-\beta = A\varepsilon
\end{align*}
almost surely.[/step]
custom_env
admin
[guided]The first point is that the ordinary least squares matrix is well-defined on a full-measure event. Let $(\Omega,\mathcal F,\mathbb P)$ be the underlying probability space, and let $\sigma(X)\subset\mathcal F$ denote the sigma-algebra generated by the random matrix $X:\Omega\to\mathbb R^{n\times p}$. Define
\begin{align*}
E:=\{\omega\in\Omega:\operatorname{rank}X(\omega)=p\}.
\end{align*}
The hypothesis $\operatorname{rank}X=p$ almost surely says $\mathbb P(E)=1$. For every $\omega\in E$, the columns of $X(\omega)$ are linearly independent, so $X(\omega)^\top X(\omega)$ is a positive definite $p\times p$ matrix and hence invertible.
Define
\begin{align*}
A:\Omega &\to \mathbb R^{p\times n}, \\
\omega &\mapsto
\begin{cases}
(X(\omega)^\top X(\omega))^{-1}X(\omega)^\top, & \omega\in E, \\
0, & \omega\notin E.
\end{cases}
\end{align*}
This definition assigns an arbitrary value on the null set $\Omega\setminus E$, which does not affect any almost-sure identity. The event $E$ is determined by $X$, and the matrix operations used in the formula are Borel on the set where the inverse exists, so $A$ is $\sigma(X)$-measurable. On $E$, substituting the model equation $y=X\beta+\varepsilon$ into the definition of $\hat{\beta}$ gives
\begin{align*}
\hat{\beta}
&= (X^\top X)^{-1}X^\top y \\
&= (X^\top X)^{-1}X^\top (X\beta+\varepsilon) \\
&= (X^\top X)^{-1}X^\top X\beta + (X^\top X)^{-1}X^\top \varepsilon \\
&= \beta + A\varepsilon.
\end{align*}
Thus the estimation error is exactly
\begin{align*}
\hat{\beta}-\beta = A\varepsilon
\end{align*}
on the event $E$, and hence almost surely. This identity is the algebraic core of the proof: conditional on $X$, the randomness in $\hat{\beta}$ comes only from $\varepsilon$, transformed by the $\sigma(X)$-measurable matrix $A$.[/guided]
custom_env
admin
[step:Compute the conditional covariance of the transformed noise]Let
\begin{align*}
m:\Omega &\to \mathbb R^n, &
m &= \mathbb E[\varepsilon\mid X].
\end{align*}
Since $A$ is $\sigma(X)$-measurable and finite almost surely, and since $\varepsilon$ has finite second conditional moments given $X$ by the hypothesis that $\operatorname{Var}(\varepsilon\mid X)$ exists, the product $A\varepsilon$ has finite conditional second moments given $X$. Therefore
\begin{align*}
\mathbb E[\hat{\beta}\mid X]
&= \mathbb E[\beta+A\varepsilon\mid X] \\
&= \beta + A\,\mathbb E[\varepsilon\mid X] \\
&= \beta + A m.
\end{align*}
Hence
\begin{align*}
\hat{\beta}-\mathbb E[\hat{\beta}\mid X]
&= A(\varepsilon-m).
\end{align*}
Using the definition of conditional variance for vector-valued random variables,
\begin{align*}
\operatorname{Var}(\hat{\beta}\mid X)
&= \mathbb E\left[
(\hat{\beta}-\mathbb E[\hat{\beta}\mid X])
(\hat{\beta}-\mathbb E[\hat{\beta}\mid X])^\top
\mid X
\right] \\
&= \mathbb E\left[
A(\varepsilon-m)(\varepsilon-m)^\top A^\top
\mid X
\right].
\end{align*}
Because $A$ and $A^\top$ are $\sigma(X)$-measurable, they factor out of the conditional expectation:
\begin{align*}
\operatorname{Var}(\hat{\beta}\mid X)
&= A\,\mathbb E\left[
(\varepsilon-m)(\varepsilon-m)^\top
\mid X
\right]A^\top \\
&= A\,\operatorname{Var}(\varepsilon\mid X)\,A^\top.
\end{align*}[/step]
custom_env
admin
[guided]We now compute the conditional covariance, taking care to center both random vectors correctly. Define the conditional mean of the noise by
\begin{align*}
m:\Omega &\to \mathbb R^n, &
m &= \mathbb E[\varepsilon\mid X].
\end{align*}
The matrix $A$ is $\sigma(X)$-measurable and finite almost surely, so once we condition on $X$, it behaves as a known matrix. Because $\operatorname{Var}(\varepsilon\mid X)$ exists, the noise vector $\varepsilon$ has finite second conditional moments; multiplying by the finite matrix $A$ preserves finite conditional second moments for $A\varepsilon$. Therefore conditional linearity gives
\begin{align*}
\mathbb E[\hat{\beta}\mid X]
&= \mathbb E[\beta+A\varepsilon\mid X] \\
&= \beta + A\,\mathbb E[\varepsilon\mid X] \\
&= \beta + A m.
\end{align*}
Subtracting this conditional mean from $\hat{\beta}=\beta+A\varepsilon$ gives
\begin{align*}
\hat{\beta}-\mathbb E[\hat{\beta}\mid X]
&= \beta+A\varepsilon-(\beta+Am) \\
&= A(\varepsilon-m).
\end{align*}
By definition, the conditional variance matrix of the $\mathbb R^p$-valued random vector $\hat{\beta}$ is
\begin{align*}
\operatorname{Var}(\hat{\beta}\mid X)
&= \mathbb E\left[
(\hat{\beta}-\mathbb E[\hat{\beta}\mid X])
(\hat{\beta}-\mathbb E[\hat{\beta}\mid X])^\top
\mid X
\right].
\end{align*}
Substituting the centered identity just obtained,
\begin{align*}
\operatorname{Var}(\hat{\beta}\mid X)
&= \mathbb E\left[
A(\varepsilon-m)(\varepsilon-m)^\top A^\top
\mid X
\right].
\end{align*}
Since $A$ and $A^\top$ are functions of $X$, they are $\sigma(X)$-measurable. The conditional expectation therefore factors them outside the conditional expectation, giving
\begin{align*}
\operatorname{Var}(\hat{\beta}\mid X)
&= A\,\mathbb E\left[
(\varepsilon-m)(\varepsilon-m)^\top
\mid X
\right]A^\top \\
&= A\,\operatorname{Var}(\varepsilon\mid X)\,A^\top.
\end{align*}
This is the conditional covariance transformation rule, derived directly from the definition.[/guided]
custom_env
admin
[step:Substitute the conditional homoskedasticity hypothesis and simplify the matrix product]
By hypothesis,
\begin{align*}
\operatorname{Var}(\varepsilon\mid X)=\sigma^2 I_n
\end{align*}
almost surely. Therefore
\begin{align*}
\operatorname{Var}(\hat{\beta}\mid X)
&= A(\sigma^2 I_n)A^\top \\
&= \sigma^2 AA^\top.
\end{align*}
Using $A=(X^\top X)^{-1}X^\top$ and the symmetry of $X^\top X$, we have
\begin{align*}
A^\top
&= X(X^\top X)^{-1}.
\end{align*}
Thus
\begin{align*}
AA^\top
&= (X^\top X)^{-1}X^\top X(X^\top X)^{-1} \\
&= (X^\top X)^{-1}.
\end{align*}
Combining these identities yields
\begin{align*}
\operatorname{Var}(\hat{\beta}\mid X)
= \sigma^2 (X^\top X)^{-1}
\end{align*}
almost surely, which is the desired formula.
[/step]