[proofplan]
We decompose the prediction error into the new error and the fitted-coefficient error. The fitted-coefficient error is a linear combination of the training errors and therefore is normal with covariance $h_0\Sigma$, independent of the new error. The residual covariance estimator is independent of the fitted part and has a Wishart distribution, so the studentised prediction error has the Hotelling $T^2$ distribution with scale factor $1+h_0$. Translating that distributional identity into $F$ quantiles gives the stated prediction region.
[/proofplan]
[step:Decompose the prediction error into independent normal components]
Let $E \in \mathbb{R}^{n \times p}$ denote the error matrix whose $i$-th row is $\varepsilon_i^\top$. Then
\begin{align*}
Y = XB + E.
\end{align*}
Since $X$ has full column rank,
\begin{align*}
\hat B - B
&= (X^\top X)^{-1}X^\top E.
\end{align*}
Therefore
\begin{align*}
Y_0-\hat\mu_0
&= B^\top x_0+\varepsilon_0-\hat B^\top x_0 \\
&= \varepsilon_0-(\hat B-B)^\top x_0 \\
&= \varepsilon_0-E^\top X(X^\top X)^{-1}x_0.
\end{align*}
Define $a \in \mathbb{R}^n$ by
\begin{align*}
a := X(X^\top X)^{-1}x_0.
\end{align*}
Then
\begin{align*}
E^\top X(X^\top X)^{-1}x_0 = E^\top a = \sum_{i=1}^n a_i\varepsilon_i.
\end{align*}
Because $\varepsilon_1,\dots,\varepsilon_n$ are independent $\mathcal{N}_p(0,\Sigma)$ random vectors, the random vector $E^\top a$ is normal with mean $0$ and covariance
\begin{align*}
\operatorname{Cov}(E^\top a)
&= \sum_{i=1}^n a_i^2\Sigma
= (a^\top a)\Sigma.
\end{align*}
Moreover,
\begin{align*}
a^\top a
&= x_0^\top (X^\top X)^{-1}X^\top X(X^\top X)^{-1}x_0 \\
&= x_0^\top (X^\top X)^{-1}x_0
= h_0.
\end{align*}
Thus $E^\top a \sim \mathcal{N}_p(0,h_0\Sigma)$. Since $\varepsilon_0$ is independent of the training errors, $\varepsilon_0$ and $E^\top a$ are independent. Hence
\begin{align*}
Y_0-\hat\mu_0 \sim \mathcal{N}_p(0,(1+h_0)\Sigma).
\end{align*}
[guided]
The point of this step is to identify the exact covariance of the prediction error. The fitted mean $\hat\mu_0$ is random because $\hat B$ is estimated from the training sample, so the prediction error has two sources: the new error $\varepsilon_0$ and the estimation error in $\hat B$.
Let $E \in \mathbb{R}^{n \times p}$ be the matrix with $i$-th row $\varepsilon_i^\top$. Since $Y=XB+E$, the least-squares estimator satisfies
\begin{align*}
\hat B-B
&= (X^\top X)^{-1}X^\top Y-B \\
&= (X^\top X)^{-1}X^\top(XB+E)-B \\
&= B+(X^\top X)^{-1}X^\top E-B \\
&= (X^\top X)^{-1}X^\top E.
\end{align*}
Therefore
\begin{align*}
Y_0-\hat\mu_0
&= B^\top x_0+\varepsilon_0-\hat B^\top x_0 \\
&= \varepsilon_0-(\hat B-B)^\top x_0 \\
&= \varepsilon_0-E^\top X(X^\top X)^{-1}x_0.
\end{align*}
Define the coefficient vector $a \in \mathbb{R}^n$ by
\begin{align*}
a := X(X^\top X)^{-1}x_0.
\end{align*}
Then the estimation part is
\begin{align*}
E^\top a = \sum_{i=1}^n a_i\varepsilon_i.
\end{align*}
This is a linear combination of independent multivariate normal random vectors, hence it is multivariate normal. Its mean is $0$, and its covariance is computed using independence:
\begin{align*}
\operatorname{Cov}(E^\top a)
&= \operatorname{Cov}\left(\sum_{i=1}^n a_i\varepsilon_i\right) \\
&= \sum_{i=1}^n a_i^2\operatorname{Cov}(\varepsilon_i) \\
&= \sum_{i=1}^n a_i^2\Sigma \\
&= (a^\top a)\Sigma.
\end{align*}
Now compute the scalar $a^\top a$:
\begin{align*}
a^\top a
&= x_0^\top (X^\top X)^{-1}X^\top X(X^\top X)^{-1}x_0 \\
&= x_0^\top (X^\top X)^{-1}x_0 \\
&= h_0.
\end{align*}
Thus $E^\top a \sim \mathcal{N}_p(0,h_0\Sigma)$. The new error $\varepsilon_0$ is independent of all training errors, so it is independent of $E^\top a$. Adding independent normal vectors adds their covariance matrices, and the minus sign does not change the covariance. Hence
\begin{align*}
Y_0-\hat\mu_0 \sim \mathcal{N}_p(0,\Sigma+h_0\Sigma)
= \mathcal{N}_p(0,(1+h_0)\Sigma).
\end{align*}
[/guided]
[/step]
[step:Identify the independent Wishart residual covariance]
Define the projection matrix $H \in \mathbb{R}^{n \times n}$ and residual projection matrix $M \in \mathbb{R}^{n \times n}$ by
\begin{align*}
H := X(X^\top X)^{-1}X^\top,\qquad M := I_n-H.
\end{align*}
Then
\begin{align*}
Y-X\hat B = MY = M(XB+E)=ME,
\end{align*}
because $MX=0$. Hence
\begin{align*}
\nu S = E^\top M E.
\end{align*}
The matrix $M$ is symmetric idempotent with rank $\nu=n-q$. Choose a matrix $C \in \mathbb{R}^{\nu \times n}$ whose rows form an orthonormal basis of $\operatorname{Range}(M)$, so that
\begin{align*}
CC^\top=I_\nu,\qquad C^\top C=M.
\end{align*}
Then
\begin{align*}
\nu S = E^\top C^\top C E = (CE)^\top(CE).
\end{align*}
The rows of $CE$ are independent $\mathcal{N}_p(0,\Sigma)$ random vectors, so
\begin{align*}
\nu S \sim \operatorname{Wishart}_p(\Sigma,\nu).
\end{align*}
The vector $E^\top a$ depends only on the projection of $E$ onto $\operatorname{Range}(H)$, while $ME$ depends only on the projection of $E$ onto $\operatorname{Range}(M)$. Since $H$ and $M$ are orthogonal projections with $HM=0$, the corresponding Gaussian projections are independent. Therefore $Y_0-\hat\mu_0$ is independent of $S$.
[guided]
The residual covariance estimator must be independent of the prediction error after the covariance structure has been separated. This is the normal-theory fact that makes the exact $F$ calibration possible.
Define
\begin{align*}
H := X(X^\top X)^{-1}X^\top,\qquad M := I_n-H.
\end{align*}
The matrix $H$ is the [orthogonal projection](/theorems/437) onto the column space of $X$, and $M$ is the orthogonal projection onto its orthogonal complement. Since $HX=X$, we have $MX=0$. Therefore
\begin{align*}
Y-X\hat B
&= Y-X(X^\top X)^{-1}X^\top Y \\
&= (I_n-H)Y \\
&= MY \\
&= M(XB+E) \\
&= ME.
\end{align*}
Thus
\begin{align*}
\nu S = (Y-X\hat B)^\top(Y-X\hat B)=E^\top M E.
\end{align*}
Because $X$ has rank $q$, the projection $H$ has rank $q$, and hence $M=I_n-H$ has rank $\nu=n-q$. Choose $C \in \mathbb{R}^{\nu \times n}$ whose rows form an orthonormal basis for $\operatorname{Range}(M)$. Then
\begin{align*}
CC^\top=I_\nu,\qquad C^\top C=M.
\end{align*}
Consequently,
\begin{align*}
\nu S=E^\top M E=E^\top C^\top C E=(CE)^\top(CE).
\end{align*}
Each row of $CE$ is a fixed linear combination of the independent Gaussian rows of $E$. The orthonormality condition $CC^\top=I_\nu$ gives covariance $\Sigma$ for each row and zero cross-covariance between distinct rows. Since the joint distribution is Gaussian, zero cross-covariance gives independence. Hence the rows of $CE$ are independent $\mathcal{N}_p(0,\Sigma)$ random vectors, and so
\begin{align*}
\nu S \sim \operatorname{Wishart}_p(\Sigma,\nu).
\end{align*}
It remains to justify independence from the prediction error. The training part of the prediction error is $E^\top a$, where
\begin{align*}
a=X(X^\top X)^{-1}x_0.
\end{align*}
This vector $a$ lies in $\operatorname{Range}(X)=\operatorname{Range}(H)$. Thus $E^\top a$ depends only on the $H$-projection of $E$. The residual matrix $ME$ depends only on the $M$-projection of $E$. Since $H$ and $M$ are orthogonal projections and $HM=0$, these two Gaussian projections are independent. The new error $\varepsilon_0$ is also independent of the training errors. Therefore $Y_0-\hat\mu_0$ is independent of $S$.
[/guided]
[/step]
[step:Studentise the prediction error and obtain the Hotelling statistic]
Define
\begin{align*}
Z := \frac{1}{\sqrt{1+h_0}}(Y_0-\hat\mu_0).
\end{align*}
From the preceding steps,
\begin{align*}
Z \sim \mathcal{N}_p(0,\Sigma),\qquad \nu S \sim \operatorname{Wishart}_p(\Sigma,\nu),
\end{align*}
and $Z$ is independent of $S$. By the defining Hotelling $T^2$ distributional identity for an independent normal vector and Wishart covariance estimator,
\begin{align*}
\frac{\nu-p+1}{p\nu}Z^\top S^{-1}Z \sim F_{p,\nu-p+1}.
\end{align*}
Since
\begin{align*}
Z^\top S^{-1}Z
=
\frac{1}{1+h_0}(Y_0-\hat\mu_0)^\top S^{-1}(Y_0-\hat\mu_0),
\end{align*}
we obtain
\begin{align*}
\frac{\nu-p+1}{p\nu(1+h_0)}
(Y_0-\hat\mu_0)^\top S^{-1}(Y_0-\hat\mu_0)
\sim F_{p,\nu-p+1}.
\end{align*}
[/step]
[step:Choose the quantile to get exact coverage]
Let $f_{p,\nu-p+1;1-\alpha}$ be the $(1-\alpha)$-quantile of $F_{p,\nu-p+1}$. From the distributional identity just proved,
\begin{align*}
&\mathbb{P}\left(
\frac{\nu-p+1}{p\nu(1+h_0)}
(Y_0-\hat\mu_0)^\top S^{-1}(Y_0-\hat\mu_0)
\le f_{p,\nu-p+1;1-\alpha}
\right) \\
&\qquad = 1-\alpha.
\end{align*}
Multiplying both sides of the inequality by the positive scalar
\begin{align*}
(1+h_0)\frac{p\nu}{\nu-p+1}
\end{align*}
gives
\begin{align*}
\mathbb{P}\left(
(Y_0-\hat\mu_0)^\top S^{-1}(Y_0-\hat\mu_0)
\le
(1+h_0)\frac{p\nu}{\nu-p+1}f_{p,\nu-p+1;1-\alpha}
\right)
=1-\alpha.
\end{align*}
This is exactly the event $Y_0 \in \mathcal{P}_{1-\alpha}$, so
\begin{align*}
\mathbb{P}(Y_0 \in \mathcal{P}_{1-\alpha})=1-\alpha.
\end{align*}
[/step]