[proofplan]
We first derive the constrained least-squares estimator under the restriction $R\beta=r$ and show that the increase in residual sum of squares is exactly the Wald quadratic form. Then we compute the distribution of $R\hat{\beta}-r$ under $H_0$ and standardise it to obtain a $\chi^2_q$ random variable. Finally, we show that this numerator is independent of the residual variance estimator, whose scaled form has the $\chi^2_{n-p}$ distribution, and the quotient is therefore $F_{q,n-p}$.
[/proofplan]
[step:Express the constrained residual sum of squares as a quadratic form in $R\hat{\beta}-r$]
Let $A \in \mathbb{R}^{p \times p}$ denote the inverse Gram matrix
\begin{align*}
A = (X^\top X)^{-1},
\end{align*}
and let $V \in \mathbb{R}^{q \times q}$ denote the restriction covariance matrix without the factor $\sigma^2$:
\begin{align*}
V = RAR^\top.
\end{align*}
Since $X$ has full column rank, $X^\top X$ is positive definite, so $A$ is positive definite. Since $R$ has full row rank, for every nonzero $u \in \mathbb{R}^q$,
\begin{align*}
u^\top V u
=
(R^\top u)^\top A(R^\top u)
>
0,
\end{align*}
because $R^\top u \neq 0$. Hence $V$ is positive definite and invertible.
Define the unrestricted residual sum of squares $S_U: \Omega \to [0,\infty)$ by
\begin{align*}
S_U = |Y-X\hat{\beta}|^2,
\end{align*}
and define the restricted residual sum of squares $S_R: \Omega \to [0,\infty)$ by
\begin{align*}
S_R = \min\{|Y-Xb|^2 : b \in \mathbb{R}^p,\ Rb=r\}.
\end{align*}
The affine constraint set is nonempty: the matrix $R \in \mathbb{R}^{q \times p}$ has full row rank, so the [linear map](/page/Linear%20Map) $b \mapsto Rb$ from $\mathbb{R}^p$ to $\mathbb{R}^q$ has range dimension $q$ and is surjective; hence there exists $b_0 \in \mathbb{R}^p$ with $Rb_0=r$.
For each $b \in \mathbb{R}^p$, expanding around $\hat{\beta}$ gives
\begin{align*}
|Y-Xb|^2
&=
|Y-X\hat{\beta}+X(\hat{\beta}-b)|^2 \\
&=
|Y-X\hat{\beta}|^2
+2(Y-X\hat{\beta})^\top X(\hat{\beta}-b)
+(\hat{\beta}-b)^\top X^\top X(\hat{\beta}-b).
\end{align*}
The normal equations $X^\top(Y-X\hat{\beta})=0$ annihilate the cross term, so
\begin{align*}
|Y-Xb|^2
=
S_U+(b-\hat{\beta})^\top X^\top X(b-\hat{\beta}).
\end{align*}
We minimize this expression subject to $Rb=r$. The Lagrange multiplier equations for the map
\begin{align*}
L: \mathbb{R}^p \times \mathbb{R}^q &\to \mathbb{R} \\
(b,\lambda) &\mapsto (b-\hat{\beta})^\top X^\top X(b-\hat{\beta})+2\lambda^\top(Rb-r)
\end{align*}
are
\begin{align*}
X^\top X(b-\hat{\beta})+R^\top\lambda &= 0, \\
Rb-r &= 0.
\end{align*}
The first equation gives $b=\hat{\beta}-AR^\top\lambda$. Substituting this into the constraint gives
\begin{align*}
R\hat{\beta}-RAR^\top\lambda-r=0,
\end{align*}
hence
\begin{align*}
\lambda = V^{-1}(R\hat{\beta}-r).
\end{align*}
Therefore the restricted least-squares estimator $\tilde{\beta}: \Omega \to \mathbb{R}^p$ is
\begin{align*}
\tilde{\beta}
=
\hat{\beta}-AR^\top V^{-1}(R\hat{\beta}-r).
\end{align*}
Using the quadratic expansion at $b=\tilde{\beta}$,
\begin{align*}
S_R-S_U
&=
(\tilde{\beta}-\hat{\beta})^\top X^\top X(\tilde{\beta}-\hat{\beta}) \\
&=
(R\hat{\beta}-r)^\top V^{-1}RA X^\top X AR^\top V^{-1}(R\hat{\beta}-r) \\
&=
(R\hat{\beta}-r)^\top V^{-1}RAR^\top V^{-1}(R\hat{\beta}-r) \\
&=
(R\hat{\beta}-r)^\top V^{-1}(R\hat{\beta}-r).
\end{align*}
Since $V=R(X^\top X)^{-1}R^\top$, the partial $F$-statistic
\begin{align*}
F=\frac{(S_R-S_U)/q}{S_U/(n-p)}
\end{align*}
is exactly
\begin{align*}
F
=
\frac{1}{q}
\frac{(R\hat{\beta}-r)^\top\left(R(X^\top X)^{-1}R^\top\right)^{-1}(R\hat{\beta}-r)}
{\hat{\sigma}^2}.
\end{align*}
[guided]
The goal of this step is to connect the geometric residual-sum-of-squares definition of the partial $F$-test with the matrix expression in the theorem.
Define
\begin{align*}
A=(X^\top X)^{-1},
\qquad
V=RAR^\top.
\end{align*}
Because $X$ has full column rank, $X^\top X$ is positive definite and therefore invertible. Thus $A$ is positive definite. Because $R$ has full row rank, $R^\top u=0$ implies $u=0$ for $u \in \mathbb{R}^q$. Hence, for $u \neq 0$,
\begin{align*}
u^\top V u
=
(R^\top u)^\top A(R^\top u)
>
0.
\end{align*}
So $V$ is positive definite and invertible. This verifies that the inverse appearing in the Wald statistic is well-defined.
Now define
\begin{align*}
S_U=|Y-X\hat{\beta}|^2,
\qquad
S_R=\min\{|Y-Xb|^2:b\in\mathbb{R}^p,\ Rb=r\}.
\end{align*}
The set over which $S_R$ is minimized is not empty. Indeed, since $R \in \mathbb{R}^{q \times p}$ has full row rank, the linear map $b \mapsto Rb$ from $\mathbb{R}^p$ to $\mathbb{R}^q$ has range dimension $q$, so it is surjective. Because $r \in \mathbb{R}^q$, there exists $b_0 \in \mathbb{R}^p$ satisfying $Rb_0=r$.
The basic least-squares identity is obtained by expanding around the unrestricted minimizer $\hat{\beta}$. For $b \in \mathbb{R}^p$,
\begin{align*}
|Y-Xb|^2
&=
|Y-X\hat{\beta}+X(\hat{\beta}-b)|^2 \\
&=
|Y-X\hat{\beta}|^2
+2(Y-X\hat{\beta})^\top X(\hat{\beta}-b)
+(\hat{\beta}-b)^\top X^\top X(\hat{\beta}-b).
\end{align*}
The middle term vanishes because the normal equations for ordinary least squares are
\begin{align*}
X^\top(Y-X\hat{\beta})=0.
\end{align*}
Therefore
\begin{align*}
|Y-Xb|^2
=
S_U+(b-\hat{\beta})^\top X^\top X(b-\hat{\beta}).
\end{align*}
Thus the restricted problem is the problem of minimizing the positive definite quadratic form
\begin{align*}
(b-\hat{\beta})^\top X^\top X(b-\hat{\beta})
\end{align*}
over the affine set $\{b\in\mathbb{R}^p:Rb=r\}$. Introduce a Lagrange multiplier $\lambda \in \mathbb{R}^q$ and define
\begin{align*}
L: \mathbb{R}^p \times \mathbb{R}^q &\to \mathbb{R} \\
(b,\lambda) &\mapsto (b-\hat{\beta})^\top X^\top X(b-\hat{\beta})+2\lambda^\top(Rb-r).
\end{align*}
The stationarity equations are
\begin{align*}
X^\top X(b-\hat{\beta})+R^\top\lambda &=0,\\
Rb-r&=0.
\end{align*}
The first equation gives
\begin{align*}
b=\hat{\beta}-AR^\top\lambda.
\end{align*}
Substitution into $Rb=r$ gives
\begin{align*}
R\hat{\beta}-RAR^\top\lambda-r=0.
\end{align*}
Since $V=RAR^\top$ is invertible,
\begin{align*}
\lambda=V^{-1}(R\hat{\beta}-r).
\end{align*}
Therefore the restricted estimator is
\begin{align*}
\tilde{\beta}
=
\hat{\beta}-AR^\top V^{-1}(R\hat{\beta}-r).
\end{align*}
Now compute the increase in residual sum of squares:
\begin{align*}
S_R-S_U
&=
(\tilde{\beta}-\hat{\beta})^\top X^\top X(\tilde{\beta}-\hat{\beta}) \\
&=
(R\hat{\beta}-r)^\top V^{-1}RA X^\top X AR^\top V^{-1}(R\hat{\beta}-r).
\end{align*}
Since $A=(X^\top X)^{-1}$, we have $AX^\top XA=A$, so
\begin{align*}
S_R-S_U
&=
(R\hat{\beta}-r)^\top V^{-1}RAR^\top V^{-1}(R\hat{\beta}-r) \\
&=
(R\hat{\beta}-r)^\top V^{-1}V V^{-1}(R\hat{\beta}-r) \\
&=
(R\hat{\beta}-r)^\top V^{-1}(R\hat{\beta}-r).
\end{align*}
Since $\hat{\sigma}^2=S_U/(n-p)$, the residual-sum-of-squares form
\begin{align*}
F=\frac{(S_R-S_U)/q}{S_U/(n-p)}
\end{align*}
becomes exactly
\begin{align*}
F
=
\frac{1}{q}
\frac{(R\hat{\beta}-r)^\top\left(R(X^\top X)^{-1}R^\top\right)^{-1}(R\hat{\beta}-r)}
{\hat{\sigma}^2}.
\end{align*}
[/guided]
[/step]
[step:Standardise the restriction estimator to obtain a $\chi^2_q$ numerator]
Since
\begin{align*}
\hat{\beta}
=
(X^\top X)^{-1}X^\top Y
=
\beta+(X^\top X)^{-1}X^\top\varepsilon,
\end{align*}
the random vector $\hat{\beta}: \Omega \to \mathbb{R}^p$ is Gaussian with mean $\beta$ and covariance matrix $\sigma^2 A$ by the standard affine-transformation property of Gaussian random vectors. Hence
\begin{align*}
R\hat{\beta}-r
\end{align*}
is Gaussian with mean $R\beta-r$ and covariance matrix $\sigma^2 V$. Under $H_0$, this mean is zero.
Let $V^{-1/2} \in \mathbb{R}^{q \times q}$ denote the positive definite inverse square root of $V$, and define
\begin{align*}
Z: \Omega \to \mathbb{R}^q,
\qquad
Z=\sigma^{-1}V^{-1/2}(R\hat{\beta}-r).
\end{align*}
Under $H_0$, $Z \sim \mathcal{N}(0,I_q)$. Define the numerator statistic $Q: \Omega \to \mathbb{R}$ by
\begin{align*}
Q
=
\frac{(R\hat{\beta}-r)^\top V^{-1}(R\hat{\beta}-r)}{\sigma^2}.
\end{align*}
Then
\begin{align*}
Q
=
Z^\top Z
=
\sum_{j=1}^q Z_j^2,
\end{align*}
so $Q$ has the $\chi^2_q$ distribution by the definition of the chi-squared distribution as a sum of squares of independent standard normal random variables.
[guided]
We now identify the numerator distribution. The ordinary least-squares estimator is an affine function of the Gaussian error vector:
\begin{align*}
\hat{\beta}
=
(X^\top X)^{-1}X^\top Y
=
(X^\top X)^{-1}X^\top(X\beta+\varepsilon)
=
\beta+(X^\top X)^{-1}X^\top\varepsilon.
\end{align*}
By the standard affine-transformation property of Gaussian random vectors, an affine image of a Gaussian vector is Gaussian. Applying that fact to the affine map $e \mapsto \beta+(X^\top X)^{-1}X^\top e$ from $\mathbb{R}^n$ to $\mathbb{R}^p$ shows that $\hat{\beta}$ is Gaussian. Its mean is
\begin{align*}
\mathbb{E}[\hat{\beta}]
=
\beta+(X^\top X)^{-1}X^\top\mathbb{E}[\varepsilon]
=
\beta,
\end{align*}
and its covariance matrix is
\begin{align*}
\operatorname{Cov}(\hat{\beta})
&=
(X^\top X)^{-1}X^\top(\sigma^2 I_n)X(X^\top X)^{-1} \\
&=
\sigma^2(X^\top X)^{-1}.
\end{align*}
Thus $R\hat{\beta}-r$ is Gaussian with mean $R\beta-r$ and covariance
\begin{align*}
\sigma^2R(X^\top X)^{-1}R^\top
=
\sigma^2V.
\end{align*}
Under the null hypothesis $H_0:R\beta=r$, the mean is zero.
Because $V$ is positive definite, it has a positive definite inverse square root $V^{-1/2}$. Define the standardised random vector
\begin{align*}
Z: \Omega \to \mathbb{R}^q,
\qquad
Z=\sigma^{-1}V^{-1/2}(R\hat{\beta}-r).
\end{align*}
Under $H_0$, its mean is zero and its covariance matrix is
\begin{align*}
\operatorname{Cov}(Z)
&=
\sigma^{-2}V^{-1/2}(\sigma^2V)V^{-1/2} \\
&=
I_q.
\end{align*}
Hence $Z\sim\mathcal{N}(0,I_q)$. Define the numerator statistic $Q: \Omega \to \mathbb{R}$ by
\begin{align*}
Q
=
\frac{(R\hat{\beta}-r)^\top V^{-1}(R\hat{\beta}-r)}{\sigma^2}.
\end{align*}
Then the quadratic form in the numerator satisfies
\begin{align*}
Q
=
Z^\top Z
=
\sum_{j=1}^q Z_j^2.
\end{align*}
Since the coordinates $Z_1,\dots,Z_q$ are independent standard normal random variables, this sum has the $\chi^2_q$ distribution by the definition of the chi-squared distribution.
[/guided]
[/step]
[step:Show that the residual variance gives an independent $\chi^2_{n-p}$ denominator]
Define the projection matrix $P_X \in \mathbb{R}^{n \times n}$ onto the column space of $X$ by
\begin{align*}
P_X=X(X^\top X)^{-1}X^\top,
\end{align*}
and define the residual projection matrix $M_X \in \mathbb{R}^{n \times n}$ by
\begin{align*}
M_X=I_n-P_X.
\end{align*}
Then
\begin{align*}
Y-X\hat{\beta}=M_XY=M_X\varepsilon,
\end{align*}
because $M_XX\beta=0$. The matrix $M_X$ is symmetric and idempotent, and its rank is $n-p$.
By the spectral theorem for real symmetric matrices, applied to the symmetric idempotent matrix $M_X$, there is an orthonormal basis of eigenvectors and every eigenvalue is either $0$ or $1$ because $\lambda^2=\lambda$ for each eigenvalue $\lambda$.
Choose an orthogonal matrix $U \in \mathbb{R}^{n \times n}$ such that
\begin{align*}
UM_XU^\top
=
\begin{pmatrix}
I_{n-p} & 0 \\
0 & 0_p
\end{pmatrix}.
\end{align*}
Define $\xi: \Omega \to \mathbb{R}^n$ by
\begin{align*}
\xi=\sigma^{-1}U\varepsilon.
\end{align*}
Since $U$ is orthogonal and $\varepsilon\sim\mathcal{N}(0,\sigma^2I_n)$, we have $\xi\sim\mathcal{N}(0,I_n)$. Therefore
\begin{align*}
\frac{(n-p)\hat{\sigma}^2}{\sigma^2}
=
\frac{|M_X\varepsilon|^2}{\sigma^2}
=
\xi^\top
\begin{pmatrix}
I_{n-p} & 0 \\
0 & 0_p
\end{pmatrix}
\xi
=
\sum_{j=1}^{n-p}\xi_j^2
\end{align*}
has the $\chi^2_{n-p}$ distribution.
Moreover, $\hat{\beta}-\beta=A X^\top\varepsilon$ depends only on $P_X\varepsilon$, while $Y-X\hat{\beta}=M_X\varepsilon$ depends only on $M_X\varepsilon$. Since $P_XM_X=0$ and $\varepsilon$ is Gaussian with covariance $\sigma^2I_n$, the Gaussian random vectors $P_X\varepsilon$ and $M_X\varepsilon$ have zero cross-covariance. By the standard independence criterion for jointly Gaussian random vectors, zero cross-covariance implies independence. Thus the numerator variable $Q$ is independent of $\hat{\sigma}^2$.
[guided]
The residual variance estimator is built from the part of $Y$ orthogonal to the column space of $X$. Define
\begin{align*}
P_X=X(X^\top X)^{-1}X^\top,
\qquad
M_X=I_n-P_X.
\end{align*}
The matrix $P_X$ is the [orthogonal projection](/theorems/437) onto $\operatorname{Range}(X)$, and $M_X$ is the orthogonal projection onto its orthogonal complement. Since
\begin{align*}
X\hat{\beta}=P_XY,
\end{align*}
we have
\begin{align*}
Y-X\hat{\beta}
=
(I_n-P_X)Y
=
M_XY.
\end{align*}
Using $Y=X\beta+\varepsilon$ and $M_XX=0$,
\begin{align*}
Y-X\hat{\beta}
=
M_X(X\beta+\varepsilon)
=
M_X\varepsilon.
\end{align*}
The matrix $M_X$ is symmetric and idempotent. Its rank is $n-p$ because $P_X$ has rank $p$ and projects onto the $p$-dimensional column space of $X$. By the spectral theorem for real symmetric matrices, the symmetric matrix $M_X$ is orthogonally diagonalizable. Since $M_X^2=M_X$, every eigenvalue $\lambda$ of $M_X$ satisfies $\lambda^2=\lambda$, hence $\lambda \in \{0,1\}$. The number of eigenvalues equal to $1$ is the rank of $M_X$, namely $n-p$, so there exists an orthogonal matrix $U \in \mathbb{R}^{n\times n}$ such that
\begin{align*}
UM_XU^\top
=
\begin{pmatrix}
I_{n-p} & 0 \\
0 & 0_p
\end{pmatrix}.
\end{align*}
Define
\begin{align*}
\xi: \Omega \to \mathbb{R}^n,
\qquad
\xi=\sigma^{-1}U\varepsilon.
\end{align*}
Because $U$ is orthogonal and $\varepsilon\sim\mathcal{N}(0,\sigma^2I_n)$, the vector $\xi$ has distribution $\mathcal{N}(0,I_n)$. Hence its coordinates are independent standard normal random variables. Therefore
\begin{align*}
\frac{(n-p)\hat{\sigma}^2}{\sigma^2}
&=
\frac{|Y-X\hat{\beta}|^2}{\sigma^2} \\
&=
\frac{|M_X\varepsilon|^2}{\sigma^2} \\
&=
\xi^\top
\begin{pmatrix}
I_{n-p} & 0 \\
0 & 0_p
\end{pmatrix}
\xi \\
&=
\sum_{j=1}^{n-p}\xi_j^2.
\end{align*}
This has the $\chi^2_{n-p}$ distribution.
It remains to check independence from the numerator. The numerator depends on $R\hat{\beta}-r$, and
\begin{align*}
\hat{\beta}-\beta=A X^\top\varepsilon.
\end{align*}
Thus it depends only on the projection of $\varepsilon$ onto $\operatorname{Range}(X)$, namely $P_X\varepsilon$. The residual vector is $M_X\varepsilon$, which lies in the orthogonal complement of $\operatorname{Range}(X)$. Their cross-covariance is
\begin{align*}
\operatorname{Cov}(P_X\varepsilon,M_X\varepsilon)
=
P_X(\sigma^2I_n)M_X
=
\sigma^2P_XM_X
=
0.
\end{align*}
Since $(P_X\varepsilon,M_X\varepsilon)$ is jointly Gaussian, the standard jointly Gaussian independence criterion applies: zero cross-covariance implies independence. Therefore the quadratic-form numerator is independent of $\hat{\sigma}^2$.
[/guided]
[/step]
[step:Form the quotient and identify the $F_{q,n-p}$ distribution]
Under $H_0$, define the random variables $Q: \Omega \to \mathbb{R}$ and $W: \Omega \to \mathbb{R}$ by
\begin{align*}
Q=\frac{(R\hat{\beta}-r)^\top V^{-1}(R\hat{\beta}-r)}{\sigma^2},
\qquad
W=\frac{(n-p)\hat{\sigma}^2}{\sigma^2}.
\end{align*}
The previous steps show that $Q\sim\chi^2_q$, $W\sim\chi^2_{n-p}$, and $Q$ and $W$ are independent. Therefore
\begin{align*}
\frac{Q/q}{W/(n-p)}
\sim
F_{q,n-p}.
\end{align*}
Finally,
\begin{align*}
\frac{Q/q}{W/(n-p)}
&=
\frac{
\frac{1}{q}(R\hat{\beta}-r)^\top V^{-1}(R\hat{\beta}-r)/\sigma^2
}{
\hat{\sigma}^2/\sigma^2
} \\
&=
\frac{1}{q}
\frac{(R\hat{\beta}-r)^\top V^{-1}(R\hat{\beta}-r)}
{\hat{\sigma}^2}.
\end{align*}
Substituting $V=R(X^\top X)^{-1}R^\top$ gives the displayed Wald statistic, so the statistic has the $F_{q,n-p}$ distribution under $H_0$.
[/step]