[proofplan]
We stack $\hat\beta$ and the residual vector $R = (I_n - P)Y$ into a single vector $V = DY$. Since $Y$ is multivariate normal and $D$ is deterministic, $V$ is also multivariate normal. We then compute the off-diagonal block of $\operatorname{Cov}(V)$ and show it vanishes: this reduces to the identity $X^\top(I_n - P) = 0$, which says residuals are orthogonal to the column space of $X$. For jointly normal vectors, zero cross-covariance implies independence, so $\hat\beta$ is independent of $R$, hence of any function of $R$ — in particular of $\mathrm{RSS} = R^\top R$.
[/proofplan]
[step:Stack $\hat\beta$ and the residual vector into a single linear function of $Y$]
Let $P = X(X^\top X)^{-1} X^\top$ be the hat matrix, $C := (X^\top X)^{-1} X^\top \in \mathbb{R}^{p \times n}$ so that $\hat\beta = CY$, and let $R = (I_n - P)Y$ be the residual vector. Define the stacked vector and transformation matrix as maps
\begin{align*}
D &: \mathbb{R}^n \to \mathbb{R}^{p+n}, & V &: \Omega \to \mathbb{R}^{p+n} \\
y &\mapsto \begin{pmatrix} C \\ I_n - P \end{pmatrix} y, & \omega &\mapsto DY(\omega) = \begin{pmatrix} \hat\beta(\omega) \\ R(\omega) \end{pmatrix}.
\end{align*}
Here $D \in \mathbb{R}^{(p+n) \times n}$ is a deterministic matrix and $V = DY$ is a random vector on the underlying probability space.
[guided]
Our goal is to prove independence of $\hat\beta$ (a $p$-vector) and $\mathrm{RSS}$ (a scalar). Direct independence proofs rarely succeed in the normal linear model — moment generating functions and density factorisations become unwieldy. The standard technique is to exploit the single most powerful feature of multivariate normals: **for jointly normal vectors, independence is equivalent to zero covariance**. The proof will therefore run in three moves:
1. Package $\hat\beta$ and $R$ together as one linear function of $Y$ (this step).
2. Use multivariate normality of $Y$ to conclude the package is jointly normal (next step).
3. Compute the cross-covariance block and show it vanishes (final step).
Why prove independence of $\hat\beta$ and $R$ rather than of $\hat\beta$ and $\mathrm{RSS}$ directly? Because $\mathrm{RSS} = R^\top R$ is a (measurable) function of $R$, and independence is preserved under such measurable functions: if $U$ is independent of $W$, then $U$ is independent of $f(W)$ for any measurable $f$. So it is enough — and much more natural — to prove the vector-vector independence $\hat\beta \perp\!\!\!\perp R$, then invoke this elementary property at the end.
Packaging into a single vector $V = DY$ is just notation: we line up the linear maps that extract $\hat\beta$ and $R$ from $Y$ into one tall matrix $D$. Reading off the two blocks of $D$:
- top block $C = (X^\top X)^{-1} X^\top$, representing $\hat\beta = CY$ (the least squares formula);
- bottom block $I_n - P$, representing $R = (I_n - P)Y$ (the projection onto the residual subspace).
Stacking produces a $(p+n) \times n$ matrix whose action on $Y$ returns both quantities simultaneously.
[/guided]
[/step]
[step:Conclude that $V$ is multivariate normal]
Under the normal linear model, $Y \sim N_n(X\beta, \sigma^2 I_n)$. Affine transformations of multivariate normals are multivariate normal: for any deterministic matrix $B \in \mathbb{R}^{m \times n}$ and vector $b \in \mathbb{R}^m$, the [Orthogonal Transformations Preserve Multivariate Normality](/theorems/1434) theorem (applied with the more general matrix $B$, not just an orthogonal $B$) gives
\begin{align*}
BY + b &\sim N_m\!\big(B(X\beta) + b,\; B(\sigma^2 I_n) B^\top\big) = N_m\!\big(BX\beta + b,\; \sigma^2 B B^\top\big).
\end{align*}
Applying this with $B = D$ and $b = \mathbf{0}$, the stacked vector $V = DY$ is multivariate normal:
\begin{align*}
V &\sim N_{p+n}\!\big(D X\beta,\; \sigma^2 D D^\top\big).
\end{align*}
[/step]
[step:Compute the cross-covariance block and show it vanishes]
Write the covariance of $V$ in $p + n$ block form:
\begin{align*}
\operatorname{Cov}(V) &= \sigma^2 D D^\top = \sigma^2 \begin{pmatrix} C \\ I_n - P \end{pmatrix} \begin{pmatrix} C^\top & (I_n - P)^\top \end{pmatrix} = \sigma^2 \begin{pmatrix} C C^\top & C (I_n - P)^\top \\ (I_n - P) C^\top & (I_n - P)(I_n - P)^\top \end{pmatrix}.
\end{align*}
The off-diagonal block of interest is $\sigma^2 C (I_n - P)^\top$. Substituting $C = (X^\top X)^{-1} X^\top$ and using symmetry of $I_n - P$ (established in the proof of the [Chi-Squared Distribution of RSS](/theorems/1443)):
\begin{align*}
C (I_n - P)^\top &= (X^\top X)^{-1} X^\top (I_n - P) = (X^\top X)^{-1} \big(X^\top - X^\top P\big).
\end{align*}
Now
\begin{align*}
X^\top P &= X^\top X (X^\top X)^{-1} X^\top = X^\top,
\end{align*}
so $X^\top (I_n - P) = X^\top - X^\top = \mathbf{0} \in \mathbb{R}^{p \times n}$. Therefore
\begin{align*}
C (I_n - P)^\top &= (X^\top X)^{-1} \cdot \mathbf{0} = \mathbf{0}_{p \times n},
\end{align*}
so the full off-diagonal block vanishes:
\begin{align*}
\operatorname{Cov}(\hat\beta, R) &= \sigma^2 C (I_n - P)^\top = \mathbf{0}_{p \times n}.
\end{align*}
[guided]
We computed the cross-covariance block $\operatorname{Cov}(\hat\beta, R) = \sigma^2 C (I_n - P)^\top$ and need to show it is the zero matrix. Substituting the formula $C = (X^\top X)^{-1} X^\top$ and using that $I_n - P$ is symmetric:
\begin{align*}
C (I_n - P)^\top = (X^\top X)^{-1} X^\top (I_n - P).
\end{align*}
Everything now comes down to the identity $X^\top (I_n - P) = \mathbf{0}$. This has both an algebraic and a geometric reading.
*Algebraic.* We multiply out:
\begin{align*}
X^\top (I_n - P) = X^\top - X^\top P = X^\top - X^\top \cdot X(X^\top X)^{-1} X^\top = X^\top - (X^\top X)(X^\top X)^{-1} X^\top = X^\top - I_p X^\top = \mathbf{0}.
\end{align*}
The collapse happens because $(X^\top X)(X^\top X)^{-1} = I_p$ — the very identity that makes the pseudoinverse work.
*Geometric.* The matrix $P$ is the orthogonal projection onto $\operatorname{Range}(X) \subset \mathbb{R}^n$, so $I_n - P$ is the orthogonal projection onto $\operatorname{Range}(X)^\perp$. The columns of $X$ span $\operatorname{Range}(X)$, so projecting them via $(I_n - P)$ gives zero — but this is exactly what the identity $(I_n - P) X = 0$ says (already noted in the proof of [Chi-Squared Distribution of RSS](/theorems/1443)). Transposing:
\begin{align*}
X^\top (I_n - P)^\top = 0 \quad\Longleftrightarrow\quad X^\top (I_n - P) = 0 \quad \text{(since } I_n - P \text{ symmetric)}.
\end{align*}
This is the statistician's version of the **normal equations**: residuals are orthogonal to the column space of $X$, equivalently, to every predictor.
Substituting back, the left-hand factor $(X^\top X)^{-1}$ is bounded and acts on the zero matrix, so
\begin{align*}
C (I_n - P)^\top = (X^\top X)^{-1} \cdot 0 = 0_{p \times n},
\end{align*}
and the cross-covariance block vanishes.
[/guided]
[/step]
[step:Conclude independence of $\hat\beta$ and $\mathrm{RSS}$]
Since $V = (\hat\beta^\top, R^\top)^\top$ is multivariate normal (Step 2) and its cross-covariance block is zero (Step 3), the two sub-vectors are independent: for jointly normal vectors, zero cross-covariance is equivalent to independence. Therefore
\begin{align*}
\hat\beta &\perp\!\!\!\perp R.
\end{align*}
Independence is preserved under measurable functions of either side: for any Borel-measurable $f: \mathbb{R}^n \to \mathbb{R}$, $\hat\beta$ is independent of $f(R)$. Taking $f(r) := r^\top r$, we obtain $\mathrm{RSS} = R^\top R$ and hence
\begin{align*}
\hat\beta &\perp\!\!\!\perp \mathrm{RSS}.
\end{align*}
Finally, since $\hat\sigma^2 = \mathrm{RSS}/n$ (MLE) or $\tilde\sigma^2 = \mathrm{RSS}/(n-p)$ (unbiased) is a deterministic function of $\mathrm{RSS}$, $\hat\beta$ is independent of $\hat\sigma^2$ as well. This completes the proof.
[guided]
We assemble the three pieces:
*Joint normality.* Step 2 established $V = (\hat\beta^\top, R^\top)^\top \sim N_{p+n}(D X\beta, \sigma^2 D D^\top)$.
*Zero cross-covariance.* Step 3 showed the $p \times n$ off-diagonal block of $\operatorname{Cov}(V)$ is zero.
*Implication.* For a jointly normal vector $V = (U^\top, W^\top)^\top$, the blocks $U$ and $W$ are independent if and only if $\operatorname{Cov}(U, W) = 0$. The "if" direction is the non-trivial one: zero covariance generally does not imply independence, but the joint normality collapses the exception — the joint density factorises as a product of normal densities precisely because the covariance matrix is block-diagonal. We conclude $\hat\beta \perp\!\!\!\perp R$.
*Function of an independent variable stays independent.* The last piece is the elementary fact that if $U \perp\!\!\!\perp W$ and $f$ is a Borel-measurable function on the codomain of $W$, then $U \perp\!\!\!\perp f(W)$. This follows from the definition of independence via $\sigma$-algebras: $\sigma(f(W)) \subseteq \sigma(W)$, and independence between $\sigma$-algebras passes to sub-$\sigma$-algebras. Applying this with $f(r) = r^\top r$:
\begin{align*}
\mathrm{RSS} = R^\top R = f(R) \quad\Longrightarrow\quad \hat\beta \perp\!\!\!\perp \mathrm{RSS}.
\end{align*}
The estimators $\hat\sigma^2 = \mathrm{RSS}/n$ and $\tilde\sigma^2 = \mathrm{RSS}/(n - p)$ are deterministic functions of $\mathrm{RSS}$, so $\hat\beta \perp\!\!\!\perp \hat\sigma^2$ and $\hat\beta \perp\!\!\!\perp \tilde\sigma^2$ follow immediately. This is the independence property that powers the derivation of the $t$-distribution for normalised coefficients.
[/guided]
[/step]