[proofplan]
The residual vector is obtained by applying the projection $I_n-H$ to the noise vector $\varepsilon$, because $I_n-H$ annihilates the column space of $X$. We show that $I_n-H$ is the [orthogonal projection](/theorems/437) onto an $(n-p)$-dimensional subspace, choose an orthonormal basis adapted to that projection, and express the squared residual norm in those coordinates. Orthogonal invariance of the centered Gaussian distribution then turns the normalized residual sum of squares into a sum of squares of $n-p$ independent standard normal random variables. The expectation identity follows from the mean of the chi-squared distribution and the definition of $\hat{\sigma}^2$.
[/proofplan]
[step:Show that the hat matrix is the orthogonal projection onto the column space of $X$]
Since $\operatorname{rank}(X)=p$, the matrix $X^\top X \in \mathbb{R}^{p \times p}$ is invertible. Indeed, if $v \in \mathbb{R}^p$ satisfies $X^\top Xv=0$, then
\begin{align*}
0 = v^\top X^\top Xv = |Xv|^2,
\end{align*}
so $Xv=0$, and the injectivity of $X:\mathbb{R}^p \to \mathbb{R}^n$ gives $v=0$.
The matrix $H=X(X^\top X)^{-1}X^\top$ is symmetric because $(X^\top X)^{-1}$ is symmetric, and it is idempotent since
\begin{align*}
H^2
&= X(X^\top X)^{-1}X^\top X(X^\top X)^{-1}X^\top \\
&= X(X^\top X)^{-1}(X^\top X)(X^\top X)^{-1}X^\top \\
&= X(X^\top X)^{-1}X^\top \\
&= H.
\end{align*}
For every $a \in \mathbb{R}^p$,
\begin{align*}
H(Xa)=X(X^\top X)^{-1}X^\top Xa = Xa,
\end{align*}
so $H$ acts as the identity on $\operatorname{Range}(X)$. Also, every $Hv$ belongs to $\operatorname{Range}(X)$, hence $\operatorname{Range}(H)\subseteq \operatorname{Range}(X)$. Therefore $\operatorname{Range}(H)=\operatorname{Range}(X)$.
Since $H$ is symmetric and idempotent, it is the orthogonal projection onto $\operatorname{Range}(X)$. Consequently,
\begin{align*}
P := I_n-H
\end{align*}
is the orthogonal projection onto $\operatorname{Range}(X)^\perp$.
[guided]
The first point is that $H$ is well-defined. Since $X$ has rank $p$, the [linear map](/page/Linear%20Map) $X:\mathbb{R}^p \to \mathbb{R}^n$ is injective. To prove that $X^\top X$ is invertible, take $v \in \mathbb{R}^p$ with $X^\top Xv=0$. Multiplying on the left by $v^\top$ gives
\begin{align*}
0 = v^\top X^\top Xv = |Xv|^2.
\end{align*}
Thus $Xv=0$, and injectivity gives $v=0$. Hence the kernel of $X^\top X$ is $\{0\}$, so $X^\top X$ is invertible.
Now verify that $H$ is a projection. The matrix $X^\top X$ is symmetric, so its inverse is symmetric. Therefore
\begin{align*}
H^\top
= \bigl(X(X^\top X)^{-1}X^\top\bigr)^\top
= X(X^\top X)^{-1}X^\top
= H.
\end{align*}
For idempotence, multiply directly:
\begin{align*}
H^2
&= X(X^\top X)^{-1}X^\top X(X^\top X)^{-1}X^\top \\
&= X(X^\top X)^{-1}(X^\top X)(X^\top X)^{-1}X^\top \\
&= X(X^\top X)^{-1}X^\top \\
&= H.
\end{align*}
It remains to identify the subspace onto which $H$ projects. If $a \in \mathbb{R}^p$, then
\begin{align*}
H(Xa)=X(X^\top X)^{-1}X^\top Xa=Xa,
\end{align*}
so every vector in the column space $\operatorname{Range}(X)$ is fixed by $H$. Conversely, $Hv$ is visibly a linear combination of the columns of $X$ for every $v \in \mathbb{R}^n$, so $\operatorname{Range}(H)\subseteq \operatorname{Range}(X)$. Hence $\operatorname{Range}(H)=\operatorname{Range}(X)$.
A symmetric idempotent matrix is the orthogonal projection onto its range: idempotence makes it a projection, and symmetry makes its kernel orthogonal to its range. Thus $H$ projects orthogonally onto $\operatorname{Range}(X)$, and
\begin{align*}
P:=I_n-H
\end{align*}
projects orthogonally onto $\operatorname{Range}(X)^\perp$.
[/guided]
[/step]
[step:Express the residual vector as the projection of the noise]
Using the definition of $\hat{\beta}$,
\begin{align*}
X\hat{\beta}
= X(X^\top X)^{-1}X^\top Y
= HY.
\end{align*}
Thus
\begin{align*}
e = Y-X\hat{\beta} = (I_n-H)Y = PY.
\end{align*}
Since $Y=X\beta+\varepsilon$ and $PX=(I_n-H)X=0$, we obtain
\begin{align*}
e = P(X\beta+\varepsilon)=PX\beta+P\varepsilon=P\varepsilon.
\end{align*}
Therefore
\begin{align*}
\operatorname{RSS}=|e|^2=|P\varepsilon|^2.
\end{align*}
[/step]
[step:Diagonalize the residual projection in an orthonormal basis]
Because $\operatorname{Range}(X)$ has dimension $p$, its orthogonal complement $\operatorname{Range}(X)^\perp$ has dimension $n-p$. Choose an orthonormal basis
\begin{align*}
q_1,\dots,q_{n-p}
\end{align*}
of $\operatorname{Range}(X)^\perp$ and an orthonormal basis
\begin{align*}
q_{n-p+1},\dots,q_n
\end{align*}
of $\operatorname{Range}(X)$. Let $Q \in \mathbb{R}^{n \times n}$ be the orthogonal matrix whose columns are $q_1,\dots,q_n$.
Since $P$ is the orthogonal projection onto $\operatorname{Range}(X)^\perp$, we have
\begin{align*}
Pq_i =
\begin{cases}
q_i, & 1 \le i \le n-p,\\
0, & n-p+1 \le i \le n.
\end{cases}
\end{align*}
Equivalently, with $0_{(n-p)\times p}$, $0_{p\times(n-p)}$, and $0_{p\times p}$ denoting zero matrices of the indicated sizes,
\begin{align*}
Q^\top P Q
=
\begin{pmatrix}
I_{n-p} & 0_{(n-p)\times p} \\
0_{p\times(n-p)} & 0_{p\times p}
\end{pmatrix}.
\end{align*}
Since $Q$ is orthogonal, Euclidean norm is preserved by $Q^\top$, and therefore
\begin{align*}
|P\varepsilon|^2
&= |Q^\top P\varepsilon|^2 \\
&= |Q^\top PQQ^\top \varepsilon|^2.
\end{align*}
Define the random vector
\begin{align*}
Z: (\Omega,\mathcal{F}) &\to (\mathbb{R}^n,\mathcal{B}(\mathbb{R}^n)) \\
\omega &\mapsto \frac{1}{\sigma}Q^\top \varepsilon(\omega).
\end{align*}
Then
\begin{align*}
\frac{|P\varepsilon|^2}{\sigma^2}
=
\left|
\begin{pmatrix}
I_{n-p} & 0 \\
0 & 0_p
\end{pmatrix}
Z
\right|^2
=
\sum_{i=1}^{n-p} Z_i^2,
\end{align*}
where $Z_i:\Omega \to \mathbb{R}$ denotes the $i$-th coordinate random variable of $Z$.
[guided]
The purpose of this step is to choose coordinates in which the projection $P=I_n-H$ has the simplest possible form. Since $\operatorname{Range}(X)$ has dimension $p$, the dimension formula for orthogonal complements in $\mathbb{R}^n$ gives
\begin{align*}
\dim \operatorname{Range}(X)^\perp = n-p.
\end{align*}
Choose an orthonormal basis $q_1,\dots,q_{n-p}$ for $\operatorname{Range}(X)^\perp$ and then an orthonormal basis $q_{n-p+1},\dots,q_n$ for $\operatorname{Range}(X)$. Together these form an orthonormal basis of $\mathbb{R}^n$. Let $Q \in \mathbb{R}^{n \times n}$ be the matrix with these vectors as columns; then $Q^\top Q=I_n$.
Because $P$ is the orthogonal projection onto $\operatorname{Range}(X)^\perp$, it fixes the first $n-p$ basis vectors and kills the last $p$ basis vectors:
\begin{align*}
Pq_i =
\begin{cases}
q_i, & 1 \le i \le n-p,\\
0, & n-p+1 \le i \le n.
\end{cases}
\end{align*}
Writing this action in the basis determined by $Q$ gives the following block matrix, where $0_{(n-p)\times p}$, $0_{p\times(n-p)}$, and $0_{p\times p}$ denote zero matrices of the indicated sizes:
\begin{align*}
Q^\top P Q
=
\begin{pmatrix}
I_{n-p} & 0_{(n-p)\times p} \\
0_{p\times(n-p)} & 0_{p\times p}
\end{pmatrix}.
\end{align*}
Now introduce normalized coordinates for the noise. Define
\begin{align*}
Z: (\Omega,\mathcal{F}) &\to (\mathbb{R}^n,\mathcal{B}(\mathbb{R}^n)) \\
\omega &\mapsto \frac{1}{\sigma}Q^\top \varepsilon(\omega).
\end{align*}
Since $Q$ is orthogonal, applying $Q^\top$ does not change Euclidean length. Thus
\begin{align*}
\frac{|P\varepsilon|^2}{\sigma^2}
&= \left|\frac{1}{\sigma}Q^\top P\varepsilon\right|^2 \\
&= \left|Q^\top P Q \frac{1}{\sigma}Q^\top\varepsilon\right|^2 \\
&=
\left|
\begin{pmatrix}
I_{n-p} & 0 \\
0 & 0_p
\end{pmatrix}
Z
\right|^2 \\
&=
\sum_{i=1}^{n-p} Z_i^2.
\end{align*}
This computation isolates exactly the $n-p$ noise coordinates that survive after projection onto the residual subspace.
[/guided]
[/step]
[step:Identify the surviving coordinates as independent standard normal variables]
Since $\varepsilon \sim \mathcal{N}(0,\sigma^2 I_n)$ and $Q$ is orthogonal, the affine transformation rule for Gaussian random vectors gives
\begin{align*}
Z=\frac{1}{\sigma}Q^\top\varepsilon \sim \mathcal{N}(0,I_n),
\end{align*}
because
\begin{align*}
\mathbb{E}[Z]=\frac{1}{\sigma}Q^\top 0=0
\end{align*}
and
\begin{align*}
\operatorname{Cov}(Z)
=
\frac{1}{\sigma^2}Q^\top(\sigma^2 I_n)Q
=
Q^\top Q
=
I_n.
\end{align*}
For a Gaussian random vector with covariance matrix $I_n$, the coordinate random variables $Z_1,\dots,Z_n$ are independent and each has distribution $\mathcal{N}(0,1)$. Hence
\begin{align*}
\frac{\operatorname{RSS}}{\sigma^2}
=
\frac{|P\varepsilon|^2}{\sigma^2}
=
\sum_{i=1}^{n-p} Z_i^2
\sim \chi^2_{n-p}.
\end{align*}
[guided]
We now use the defining stability property of Gaussian random vectors under linear transformations. The random vector $\varepsilon$ has distribution $\mathcal{N}(0,\sigma^2 I_n)$, and $Z$ is obtained from $\varepsilon$ by the linear map $\sigma^{-1}Q^\top:\mathbb{R}^n \to \mathbb{R}^n$. Therefore $Z$ is Gaussian. Its mean is
\begin{align*}
\mathbb{E}[Z]=\frac{1}{\sigma}Q^\top \mathbb{E}[\varepsilon]=\frac{1}{\sigma}Q^\top 0=0.
\end{align*}
Its covariance matrix is
\begin{align*}
\operatorname{Cov}(Z)
&=
\mathbb{E}\bigl[ZZ^\top\bigr] \\
&=
\frac{1}{\sigma^2}Q^\top \mathbb{E}\bigl[\varepsilon\varepsilon^\top\bigr]Q \\
&=
\frac{1}{\sigma^2}Q^\top(\sigma^2 I_n)Q \\
&=
Q^\top Q \\
&=
I_n.
\end{align*}
Thus
\begin{align*}
Z \sim \mathcal{N}(0,I_n).
\end{align*}
For a centered Gaussian random vector, zero covariance between two coordinates implies independence of those coordinates. Since the covariance matrix is $I_n$, the coordinates $Z_1,\dots,Z_n$ are independent, and each coordinate has variance $1$ and mean $0$. Hence
\begin{align*}
Z_i \sim \mathcal{N}(0,1)
\end{align*}
for every $1 \le i \le n$.
The definition of a chi-squared random variable with $n-p$ degrees of freedom is the sum of squares of $n-p$ independent standard normal random variables. From the previous coordinate computation,
\begin{align*}
\frac{\operatorname{RSS}}{\sigma^2}
=
\sum_{i=1}^{n-p}Z_i^2.
\end{align*}
Therefore
\begin{align*}
\frac{\operatorname{RSS}}{\sigma^2}\sim \chi^2_{n-p}.
\end{align*}
[/guided]
[/step]
[step:Compute the expectation of the variance estimator]
Let $W:\Omega \to \mathbb{R}$ denote the random variable
\begin{align*}
W := \frac{\operatorname{RSS}}{\sigma^2}.
\end{align*}
The previous step gives $W\sim \chi^2_{n-p}$, so
\begin{align*}
\mathbb{E}[W]=n-p.
\end{align*}
Using $\operatorname{RSS}=\sigma^2 W$ and $\hat{\sigma}^2=\operatorname{RSS}/(n-p)$, we get
\begin{align*}
\mathbb{E}[\hat{\sigma}^2]
&=
\mathbb{E}\left[\frac{\operatorname{RSS}}{n-p}\right] \\
&=
\frac{\sigma^2}{n-p}\mathbb{E}[W] \\
&=
\frac{\sigma^2}{n-p}(n-p) \\
&=
\sigma^2.
\end{align*}
This proves both asserted conclusions.
[/step]