[proofplan]
We express the ordinary least squares residual vector as the projection of the error vector onto the orthogonal complement of the column space of $X$. The residual sum of squares then becomes a quadratic form $\varepsilon^\top M\varepsilon$, where $M$ is the residual-maker matrix. Expanding this quadratic form entry by entry under conditional expectation gives $\mathbb{E}[\varepsilon^\top M\varepsilon \mid X]=\sigma^2\operatorname{tr}(M)$. Finally, the trace of $M$ is $n-p$, so dividing by $n-p$ gives conditional unbiasedness.
[/proofplan]
[step:Define the projection and residual-maker matrices]
Since $\operatorname{rank}(X)=p$ almost surely, the matrix $X^\top X \in \mathbb{R}^{p\times p}$ is invertible almost surely. Define the hat matrix $H \in \mathbb{R}^{n\times n}$ and residual-maker matrix $M \in \mathbb{R}^{n\times n}$ by
\begin{align*}
H = X(X^\top X)^{-1}X^\top,
\qquad
M = I_n - H.
\end{align*}
The matrix $H$ is symmetric because
\begin{align*}
H^\top
=
\left(X(X^\top X)^{-1}X^\top\right)^\top
=
X\left((X^\top X)^{-1}\right)^\top X^\top
=
X(X^\top X)^{-1}X^\top
=
H,
\end{align*}
where $X^\top X$ is symmetric and hence $(X^\top X)^{-1}$ is symmetric. Also,
\begin{align*}
H^2
&=
X(X^\top X)^{-1}X^\top X(X^\top X)^{-1}X^\top \\
&=
X(X^\top X)^{-1}(X^\top X)(X^\top X)^{-1}X^\top \\
&=
X(X^\top X)^{-1}X^\top
=
H.
\end{align*}
Thus $H$ is an idempotent symmetric projection matrix, and $M=I_n-H$ is also symmetric and idempotent.
[guided]
Because $X$ has full column rank $p$, the Gram matrix $X^\top X$ is invertible: if $v \in \mathbb{R}^p$ satisfies $(X^\top X)v=0$, then
\begin{align*}
0
=
v^\top X^\top Xv
=
|Xv|^2,
\end{align*}
so $Xv=0$, and full column rank gives $v=0$. Therefore $X^\top X$ is nonsingular.
Define
\begin{align*}
H = X(X^\top X)^{-1}X^\top,
\qquad
M = I_n - H.
\end{align*}
The matrix $H$ is the [orthogonal projection](/theorems/437) onto the column space of $X$, and $M$ is the projection onto its orthogonal complement. We verify the algebraic facts needed later. First,
\begin{align*}
H^\top
=
\left(X(X^\top X)^{-1}X^\top\right)^\top
=
X\left((X^\top X)^{-1}\right)^\top X^\top
=
X(X^\top X)^{-1}X^\top
=
H,
\end{align*}
because $X^\top X$ and its inverse are symmetric. Second,
\begin{align*}
H^2
&=
X(X^\top X)^{-1}X^\top X(X^\top X)^{-1}X^\top \\
&=
X(X^\top X)^{-1}(X^\top X)(X^\top X)^{-1}X^\top \\
&=
X(X^\top X)^{-1}X^\top
=
H.
\end{align*}
So $H$ is symmetric and idempotent. Since $M=I_n-H$, we also have $M^\top=M$ and
\begin{align*}
M^2
=
(I_n-H)^2
=
I_n-2H+H^2
=
I_n-H
=
M.
\end{align*}
These identities are what allow the residual norm to be written as a single quadratic form.
[/guided]
[/step]
[step:Rewrite the residual sum of squares as a quadratic form in the error]
The fitted value is
\begin{align*}
X\hat{\beta}
=
X(X^\top X)^{-1}X^\top y
=
Hy.
\end{align*}
Therefore the residual vector $r \in \mathbb{R}^n$ defined by $r=y-X\hat{\beta}$ satisfies
\begin{align*}
r
=
y-Hy
=
My.
\end{align*}
Since
\begin{align*}
HX
=
X(X^\top X)^{-1}X^\top X
=
X,
\end{align*}
we have $MX=0$. Using $y=X\beta+\varepsilon$, this gives
\begin{align*}
r
=
My
=
M(X\beta+\varepsilon)
=
MX\beta+M\varepsilon
=
M\varepsilon.
\end{align*}
Because $M$ is symmetric and idempotent,
\begin{align*}
\operatorname{RSS}
=
|r|^2
=
(M\varepsilon)^\top(M\varepsilon)
=
\varepsilon^\top M^\top M\varepsilon
=
\varepsilon^\top M\varepsilon.
\end{align*}
[guided]
We now connect least squares residuals to the error vector. The ordinary least squares fitted value is
\begin{align*}
X\hat{\beta}
=
X(X^\top X)^{-1}X^\top y
=
Hy.
\end{align*}
Thus the residual vector $r \in \mathbb{R}^n$ is
\begin{align*}
r
=
y-X\hat{\beta}
=
y-Hy
=
My.
\end{align*}
The reason this is useful is that $M$ kills every vector in the column space of $X$. Indeed,
\begin{align*}
HX
=
X(X^\top X)^{-1}X^\top X
=
X,
\end{align*}
and hence
\begin{align*}
MX
=
(I_n-H)X
=
X-HX
=
0.
\end{align*}
Since $y=X\beta+\varepsilon$, applying $M$ gives
\begin{align*}
r
=
My
=
M(X\beta+\varepsilon)
=
MX\beta+M\varepsilon
=
M\varepsilon.
\end{align*}
So the residuals depend only on the component of the error vector orthogonal to the column space of $X$.
Finally, by definition,
\begin{align*}
\operatorname{RSS}
=
|r|^2.
\end{align*}
Using $r=M\varepsilon$, symmetry of $M$, and idempotence of $M$, we obtain
\begin{align*}
\operatorname{RSS}
=
(M\varepsilon)^\top(M\varepsilon)
=
\varepsilon^\top M^\top M\varepsilon
=
\varepsilon^\top M^2\varepsilon
=
\varepsilon^\top M\varepsilon.
\end{align*}
This is the form to which the conditional variance assumption applies.
[/guided]
[/step]
[step:Compute the conditional expectation of the quadratic form]
Write $M=(M_{ij})_{1\leq i,j\leq n}$ and $\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)^\top$. Since $M$ is a measurable function of $X$, its entries are fixed under conditioning on $X$. Expanding the quadratic form,
\begin{align*}
\mathbb{E}[\varepsilon^\top M\varepsilon \mid X]
&=
\mathbb{E}\left[\sum_{i=1}^n\sum_{j=1}^n M_{ij}\varepsilon_i\varepsilon_j \,\middle|\, X\right] \\
&=
\sum_{i=1}^n\sum_{j=1}^n M_{ij}\mathbb{E}[\varepsilon_i\varepsilon_j \mid X].
\end{align*}
The conditional mean assumption gives $\mathbb{E}[\varepsilon_i\mid X]=0$ for each $i$, and the conditional variance assumption gives
\begin{align*}
\mathbb{E}[\varepsilon_i\varepsilon_j \mid X]
=
\operatorname{Cov}(\varepsilon_i,\varepsilon_j\mid X)
=
\sigma^2\mathbb{1}_{\{i=j\}}.
\end{align*}
Therefore
\begin{align*}
\mathbb{E}[\varepsilon^\top M\varepsilon \mid X]
&=
\sum_{i=1}^n\sum_{j=1}^n M_{ij}\sigma^2\mathbb{1}_{\{i=j\}} \\
&=
\sigma^2\sum_{i=1}^n M_{ii}
=
\sigma^2\operatorname{tr}(M).
\end{align*}
[guided]
The conditional expectation is computed entry by entry. Write
\begin{align*}
M=(M_{ij})_{1\leq i,j\leq n},
\qquad
\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)^\top.
\end{align*}
Since $M=I_n-X(X^\top X)^{-1}X^\top$ is determined by $X$, each entry $M_{ij}$ is measurable with respect to the information contained in $X$. Therefore, inside a conditional expectation given $X$, the entries of $M$ behave like known coefficients.
Expanding the quadratic form gives
\begin{align*}
\varepsilon^\top M\varepsilon
=
\sum_{i=1}^n\sum_{j=1}^n M_{ij}\varepsilon_i\varepsilon_j.
\end{align*}
Taking conditional expectation and pulling out the $X$-measurable coefficients,
\begin{align*}
\mathbb{E}[\varepsilon^\top M\varepsilon \mid X]
&=
\mathbb{E}\left[\sum_{i=1}^n\sum_{j=1}^n M_{ij}\varepsilon_i\varepsilon_j \,\middle|\, X\right] \\
&=
\sum_{i=1}^n\sum_{j=1}^n M_{ij}\mathbb{E}[\varepsilon_i\varepsilon_j \mid X].
\end{align*}
Now use the assumptions on the conditional mean and conditional variance. Because $\mathbb{E}[\varepsilon\mid X]=0$, each component satisfies $\mathbb{E}[\varepsilon_i\mid X]=0$. Hence
\begin{align*}
\mathbb{E}[\varepsilon_i\varepsilon_j\mid X]
=
\operatorname{Cov}(\varepsilon_i,\varepsilon_j\mid X).
\end{align*}
The assumption $\operatorname{Var}(\varepsilon\mid X)=\sigma^2 I_n$ says exactly that this conditional covariance is $\sigma^2$ when $i=j$ and $0$ when $i\neq j$, so
\begin{align*}
\mathbb{E}[\varepsilon_i\varepsilon_j \mid X]
=
\sigma^2\mathbb{1}_{\{i=j\}}.
\end{align*}
Substituting this into the expanded quadratic form,
\begin{align*}
\mathbb{E}[\varepsilon^\top M\varepsilon \mid X]
&=
\sum_{i=1}^n\sum_{j=1}^n M_{ij}\sigma^2\mathbb{1}_{\{i=j\}} \\
&=
\sigma^2\sum_{i=1}^n M_{ii}
=
\sigma^2\operatorname{tr}(M).
\end{align*}
Thus the conditional expected residual sum of squares is controlled by the trace of the residual-maker matrix.
[/guided]
[/step]
[step:Evaluate the trace of the residual-maker matrix]
Using the cyclic invariance of trace for compatible finite-dimensional matrices,
\begin{align*}
\operatorname{tr}(H)
&=
\operatorname{tr}\left(X(X^\top X)^{-1}X^\top\right) \\
&=
\operatorname{tr}\left((X^\top X)^{-1}X^\top X\right) \\
&=
\operatorname{tr}(I_p)
=
p.
\end{align*}
Therefore
\begin{align*}
\operatorname{tr}(M)
=
\operatorname{tr}(I_n-H)
=
\operatorname{tr}(I_n)-\operatorname{tr}(H)
=
n-p.
\end{align*}
Combining this with the previous step,
\begin{align*}
\mathbb{E}[\operatorname{RSS}\mid X]
=
\mathbb{E}[\varepsilon^\top M\varepsilon\mid X]
=
\sigma^2(n-p).
\end{align*}
Since $p<n$, division by $n-p$ is valid, and hence
\begin{align*}
\mathbb{E}[\hat{\sigma}^2\mid X]
=
\mathbb{E}\left[\frac{\operatorname{RSS}}{n-p}\,\middle|\,X\right]
=
\frac{1}{n-p}\mathbb{E}[\operatorname{RSS}\mid X]
=
\sigma^2.
\end{align*}
This proves that $\hat{\sigma}^2$ is conditionally unbiased for $\sigma^2$.
[/step]