[proofplan]
We rewrite least squares using the [orthogonal projection](/theorems/437) $P$ onto $\operatorname{col}(X)$. An orthogonal change of coordinates in $\mathbb{R}^n$ splits the noise matrix into its fitted component and its residual component. Because the rows of the noise matrix are independent Gaussian vectors with common covariance $\Sigma$, orthogonal transformations preserve the joint Gaussian law and make the two coordinate blocks independent. The residual sum of squares is then the cross-product of exactly $n-q$ independent $\mathcal{N}_p(0,\Sigma)$ row vectors, which is the defining construction of the Wishart law.
[/proofplan]
[step:Express the fitted and residual matrices through the projection onto $\operatorname{col}(X)$]
Let $I_n \in \mathbb{R}^{n \times n}$ denote the identity matrix. Define the matrix
\begin{align*}
P := X(X^\top X)^{-1}X^\top \in \mathbb{R}^{n \times n}.
\end{align*}
Since $X$ has rank $q$, the matrix $X^\top X$ is invertible, and $P$ is the orthogonal projection of $\mathbb{R}^n$ onto $\operatorname{col}(X)$. The least-squares fitted matrix is
\begin{align*}
X\hat{B}
&= X(X^\top X)^{-1}X^\top Y \\
&= P Y.
\end{align*}
Therefore the residual matrix is
\begin{align*}
R := Y-X\hat{B} = (I_n-P)Y.
\end{align*}
Using $Y=XB+E$ and $PX= X$, we get
\begin{align*}
R
&= (I_n-P)(XB+E) \\
&= (I_n-P)E.
\end{align*}
Thus
\begin{align*}
S = R^\top R = E^\top(I_n-P)E.
\end{align*}
[guided]
The least-squares estimator is designed so that the fitted values are the orthogonal projection of $Y$ onto the column space of $X$. We make this projection explicit. Let $I_n \in \mathbb{R}^{n \times n}$ denote the identity matrix, and define
\begin{align*}
P := X(X^\top X)^{-1}X^\top \in \mathbb{R}^{n \times n}.
\end{align*}
The hypothesis $\operatorname{rank}(X)=q$ implies that the Gram matrix $X^\top X$ is invertible. Hence $P$ is well-defined, symmetric, idempotent, and has range $\operatorname{col}(X)$; it is the orthogonal projection onto $\operatorname{col}(X)$.
Now compute the fitted matrix:
\begin{align*}
X\hat{B}
&= X(X^\top X)^{-1}X^\top Y \\
&= P Y.
\end{align*}
Therefore the residual matrix $R \in \mathbb{R}^{n \times p}$ is
\begin{align*}
R := Y-X\hat{B} = (I_n-P)Y.
\end{align*}
Substitute the model equation $Y=XB+E$:
\begin{align*}
R
&= (I_n-P)(XB+E) \\
&= (I_n-P)XB + (I_n-P)E.
\end{align*}
Since every column of $XB$ lies in $\operatorname{col}(X)$, the projection $P$ fixes $XB$, so $(I_n-P)XB=0$. Hence
\begin{align*}
R = (I_n-P)E.
\end{align*}
Consequently the residual sum of squares matrix is
\begin{align*}
S = R^\top R = E^\top(I_n-P)^\top(I_n-P)E.
\end{align*}
Because $I_n-P$ is symmetric and idempotent, this simplifies to
\begin{align*}
S = E^\top(I_n-P)E.
\end{align*}
[/guided]
[/step]
[step:Choose orthogonal coordinates adapted to the fitted and residual subspaces]
Choose an orthogonal matrix $Q \in \mathbb{R}^{n \times n}$ whose first $q$ rows form an orthonormal basis of $\operatorname{col}(X)$ and whose last $n-q$ rows form an orthonormal basis of $\operatorname{col}(X)^\perp$. Write
\begin{align*}
Q =
\begin{pmatrix}
Q_1 \\
Q_2
\end{pmatrix},
\end{align*}
where $Q_1 \in \mathbb{R}^{q \times n}$ and $Q_2 \in \mathbb{R}^{(n-q) \times n}$. Then
\begin{align*}
P = Q_1^\top Q_1,
\qquad
I_n-P = Q_2^\top Q_2.
\end{align*}
Therefore
\begin{align*}
S = E^\top Q_2^\top Q_2 E = (Q_2E)^\top(Q_2E).
\end{align*}
Also, since $P=Q_1^\top Q_1$, the fitted matrix $PY$ is determined by $Q_1Y$, while the residual sum of squares $S$ is determined by $Q_2Y=Q_2E$.
[guided]
The projection $P$ separates the data into two orthogonal directions: the fitted directions inside $\operatorname{col}(X)$ and the residual directions inside $\operatorname{col}(X)^\perp$. To make that separation algebraic, choose an orthogonal matrix $Q \in \mathbb{R}^{n \times n}$ adapted to this direct sum decomposition. Write
\begin{align*}
Q =
\begin{pmatrix}
Q_1 \\
Q_2
\end{pmatrix},
\end{align*}
where $Q_1 \in \mathbb{R}^{q \times n}$ has rows forming an orthonormal basis of $\operatorname{col}(X)$, and $Q_2 \in \mathbb{R}^{(n-q) \times n}$ has rows forming an orthonormal basis of $\operatorname{col}(X)^\perp$.
Because $Q_1$ contains an orthonormal basis of $\operatorname{col}(X)$, the matrix $Q_1^\top Q_1$ is the orthogonal projection onto $\operatorname{col}(X)$. Since $P$ is also the orthogonal projection onto $\operatorname{col}(X)$, we have
\begin{align*}
P = Q_1^\top Q_1.
\end{align*}
Likewise, the orthogonal projection onto $\operatorname{col}(X)^\perp$ is
\begin{align*}
I_n-P = Q_2^\top Q_2.
\end{align*}
Substituting this identity into the expression for $S$ gives
\begin{align*}
S
&= E^\top(I_n-P)E \\
&= E^\top Q_2^\top Q_2 E \\
&= (Q_2E)^\top(Q_2E).
\end{align*}
This is the central reduction: the residual sum of squares is the cross-product of the bottom block of the orthogonally transformed noise matrix.
The same decomposition also identifies the source of independence. The fitted matrix $PY$ is determined by $Q_1Y$, because $P=Q_1^\top Q_1$. The residual matrix and hence $S$ are determined by $Q_2Y$. Since $Q_2X=0$, we have
\begin{align*}
Q_2Y = Q_2(XB+E)=Q_2E.
\end{align*}
Thus the fitted and residual objects depend on disjoint orthogonal blocks of the transformed noise.
[/guided]
[/step]
[step:Show that the orthogonally transformed noise has independent Gaussian rows]
Define the transformed noise matrix
\begin{align*}
Z := Q E \in \mathbb{R}^{n \times p}.
\end{align*}
For indices $r,s \in \{1,\dots,n\}$, define the Kronecker delta $\delta_{rs} \in \{0,1\}$ by $\delta_{rs}=1$ if $r=s$ and $\delta_{rs}=0$ if $r \ne s$. For each row index $r \in \{1,\dots,n\}$, let $Z_r \in \mathbb{R}^{1 \times p}$ denote the $r$th row of $Z$. For any column vectors $a_1,\dots,a_n \in \mathbb{R}^p$, the linear combination $\sum_{r=1}^n Z_r a_r$ is a real-valued Gaussian random variable because it is a linear combination of the jointly Gaussian entries of $E$. Thus the rows of $Z$ are jointly Gaussian.
For $r,s \in \{1,\dots,n\}$, compute the covariance between the column vectors $Z_r^\top$ and $Z_s^\top$:
\begin{align*}
\operatorname{Cov}(Z_r^\top,Z_s^\top)
&= \sum_{i=1}^n \sum_{j=1}^n Q_{ri}Q_{sj}\operatorname{Cov}(E_i^\top,E_j^\top) \\
&= \sum_{i=1}^n Q_{ri}Q_{si}\Sigma \\
&= (QQ^\top)_{rs}\Sigma \\
&= \delta_{rs}\Sigma.
\end{align*}
By the [Gaussian independence criterion for jointly Gaussian random vectors](/page/Multivariate%20Normal%20Distribution), zero cross-covariance between distinct Gaussian blocks implies independence. Hence $Z_1,\dots,Z_n$ are independent row vectors, and $Z_r^\top \sim \mathcal{N}_p(0,\Sigma)$ for each $r$.
[guided]
We now verify the probabilistic effect of multiplying the noise matrix by $Q$. Define
\begin{align*}
Z := Q E \in \mathbb{R}^{n \times p}.
\end{align*}
For indices $r,s \in \{1,\dots,n\}$, define the Kronecker delta $\delta_{rs} \in \{0,1\}$ by $\delta_{rs}=1$ if $r=s$ and $\delta_{rs}=0$ if $r \ne s$. For each $r \in \{1,\dots,n\}$, let $Z_r \in \mathbb{R}^{1 \times p}$ be the $r$th row of $Z$. Since $Z=QE$, the row $Z_r$ is
\begin{align*}
Z_r = \sum_{i=1}^n Q_{ri}E_i.
\end{align*}
Each $E_i^\top$ is Gaussian, and the rows $E_1,\dots,E_n$ are independent. Therefore every finite linear combination of entries of $Z$ is a finite linear combination of jointly Gaussian random variables, hence is Gaussian. Thus the rows $Z_1,\dots,Z_n$ are jointly Gaussian.
We use the [Gaussian independence criterion for jointly Gaussian random vectors](/page/Multivariate%20Normal%20Distribution): for jointly Gaussian random vectors, independence is equivalent to zero cross-covariance between the corresponding blocks. The hypothesis needed for this criterion is joint Gaussianity, which was verified above for the rows $Z_1,\dots,Z_n$. We compute the cross-covariance explicitly. For $r,s \in \{1,\dots,n\}$,
\begin{align*}
\operatorname{Cov}(Z_r^\top,Z_s^\top)
&= \operatorname{Cov}\left(\sum_{i=1}^n Q_{ri}E_i^\top,\sum_{j=1}^n Q_{sj}E_j^\top\right) \\
&= \sum_{i=1}^n \sum_{j=1}^n Q_{ri}Q_{sj}\operatorname{Cov}(E_i^\top,E_j^\top).
\end{align*}
The original rows are independent and each has covariance $\Sigma$, so
\begin{align*}
\operatorname{Cov}(E_i^\top,E_j^\top)
=
\begin{cases}
\Sigma, & i=j, \\
0, & i \ne j.
\end{cases}
\end{align*}
Therefore
\begin{align*}
\operatorname{Cov}(Z_r^\top,Z_s^\top)
&= \sum_{i=1}^n Q_{ri}Q_{si}\Sigma \\
&= (QQ^\top)_{rs}\Sigma.
\end{align*}
Since $Q$ is orthogonal, $QQ^\top=I_n$, so
\begin{align*}
\operatorname{Cov}(Z_r^\top,Z_s^\top)=\delta_{rs}\Sigma.
\end{align*}
Thus different rows of $Z$ have zero cross-covariance, and because they are jointly Gaussian, they are independent. For $r=s$, the same formula gives covariance $\Sigma$, and the mean is zero because each original row $E_i$ has mean zero. Hence
\begin{align*}
Z_r^\top \sim \mathcal{N}_p(0,\Sigma)
\end{align*}
for every $r \in \{1,\dots,n\}$.
[/guided]
[/step]
[step:Identify the residual cross-product as a Wishart random matrix]
Write
\begin{align*}
Z =
\begin{pmatrix}
Z_{\mathrm{fit}} \\
Z_{\mathrm{res}}
\end{pmatrix},
\end{align*}
where $Z_{\mathrm{fit}} := Q_1E \in \mathbb{R}^{q \times p}$ and $Z_{\mathrm{res}} := Q_2E \in \mathbb{R}^{(n-q) \times p}$. By the previous step, the rows of $Z_{\mathrm{res}}$ are independent and each row has transpose distributed as $\mathcal{N}_p(0,\Sigma)$. Since
\begin{align*}
S = (Q_2E)^\top(Q_2E) = Z_{\mathrm{res}}^\top Z_{\mathrm{res}},
\end{align*}
the defining construction of the Wishart distribution gives
\begin{align*}
S \sim W_p(\Sigma,n-q).
\end{align*}
[guided]
We now match the residual cross-product with the definition of the Wishart distribution. Write
\begin{align*}
Z =
\begin{pmatrix}
Z_{\mathrm{fit}} \\
Z_{\mathrm{res}}
\end{pmatrix},
\end{align*}
where
\begin{align*}
Z_{\mathrm{fit}} := Q_1E \in \mathbb{R}^{q \times p},
\qquad
Z_{\mathrm{res}} := Q_2E \in \mathbb{R}^{(n-q) \times p}.
\end{align*}
The previous step proved that the rows of $Z=QE$ are independent and that each row transpose has distribution $\mathcal{N}_p(0,\Sigma)$. Therefore the $(n-q)$ rows of $Z_{\mathrm{res}}$ are independent, and each row transpose has distribution $\mathcal{N}_p(0,\Sigma)$.
The residual sum of squares was reduced earlier to the cross-product of the residual block:
\begin{align*}
S = (Q_2E)^\top(Q_2E) = Z_{\mathrm{res}}^\top Z_{\mathrm{res}}.
\end{align*}
By the defining construction of the Wishart distribution, the cross-product of $n-q$ independent $\mathcal{N}_p(0,\Sigma)$ row vectors has distribution $W_p(\Sigma,n-q)$. Hence
\begin{align*}
S \sim W_p(\Sigma,n-q).
\end{align*}
[/guided]
[/step]
[step:Separate the estimator from the residual sum of squares]
The estimator $\hat{B}$ is determined by $PY$, because
\begin{align*}
\hat{B}=(X^\top X)^{-1}X^\top(PY),
\end{align*}
as $X^\top P=X^\top$. The fitted component $PY$ is determined by $Q_1Y$, while $S$ is determined by $Q_2Y=Q_2E$. Now
\begin{align*}
Q_1Y = Q_1XB + Q_1E,
\qquad
Q_2Y = Q_2E.
\end{align*}
The random matrices $Q_1E$ and $Q_2E$ consist of disjoint blocks of the independent Gaussian rows of $Z=QE$, so they are independent. Adding the deterministic matrix $Q_1XB$ to $Q_1E$ preserves independence from $Q_2E$. Therefore $Q_1Y$ and $Q_2Y$ are independent, and hence $\hat{B}$ and $S$ are independent.
[guided]
It remains to prove independence. The estimator $\hat{B}$ depends only on the fitted part of $Y$. Indeed, since $P$ is symmetric and $PX=X$, we have $X^\top P=X^\top$, and therefore
\begin{align*}
(X^\top X)^{-1}X^\top(PY)
&= (X^\top X)^{-1}X^\top Y \\
&= \hat{B}.
\end{align*}
Thus $\hat{B}$ is determined by $PY$. Because $P=Q_1^\top Q_1$, the matrix $PY$ is determined by $Q_1Y$.
The residual sum of squares is determined by the residual block:
\begin{align*}
S = (Q_2Y)^\top(Q_2Y),
\end{align*}
since $Q_2Y=Q_2E$. Now decompose the transformed data:
\begin{align*}
Q_1Y = Q_1XB + Q_1E,
\qquad
Q_2Y = Q_2XB + Q_2E.
\end{align*}
Because the rows of $Q_2$ span $\operatorname{col}(X)^\perp$, we have $Q_2X=0$, so
\begin{align*}
Q_2Y = Q_2E.
\end{align*}
The previous step showed that the rows of $Z=QE$ are independent Gaussian row vectors. Hence the top block $Q_1E$ and the bottom block $Q_2E$ are independent random matrices. Adding the deterministic matrix $Q_1XB$ to $Q_1E$ does not change independence from $Q_2E$. Therefore $Q_1Y$ and $Q_2Y$ are independent.
Finally, $\hat{B}$ is a measurable function of $Q_1Y$, and $S$ is a measurable function of $Q_2Y$. [Measurable functions](/page/Measurable%20Functions) of independent random objects are independent. Hence $\hat{B}$ and $S$ are independent. This completes the proof.
[/guided]
[/step]