[proofplan]
We identify the distribution of the scalar estimator $x_0^\top\hat{\beta}$ and the residual variance estimator under the normal linear model. The scalar estimator is normal with mean $x_0^\top\beta$ and variance $\sigma^2 x_0^\top(X^\top X)^{-1}x_0$, while the scaled residual variance has a chi-squared distribution with $n-p$ degrees of freedom and is independent of $x_0^\top\hat{\beta}$. Dividing the centered scalar estimator by its estimated standard error therefore produces a Student $t$ pivot. The confidence interval follows by rewriting the central probability statement for this pivot.
[/proofplan]
[step:Compute the normal distribution of the scalar least squares estimator]
Since $X$ has rank $p$, the matrix $X^\top X \in \mathbb{R}^{p \times p}$ is invertible. Define the matrix
\begin{align*}
A := (X^\top X)^{-1}X^\top \in \mathbb{R}^{p \times n}.
\end{align*}
Then $\hat{\beta}=AY$, and using $Y=X\beta+\varepsilon$ gives
\begin{align*}
\hat{\beta}
=
A X\beta + A\varepsilon
=
\beta + A\varepsilon,
\end{align*}
because $AX=(X^\top X)^{-1}X^\top X=I_p$.
Let
\begin{align*}
Z:\Omega &\to \mathbb{R} \\
\omega &\mapsto x_0^\top\hat{\beta}(\omega)
\end{align*}
be the scalar least squares estimator of $x_0^\top\beta$. Since $Z=x_0^\top\beta+x_0^\top A\varepsilon$ is an affine transformation of the Gaussian vector $\varepsilon$, it is Gaussian. Its mean is
\begin{align*}
\mathbb{E}[Z]
=
x_0^\top\beta+x_0^\top A\mathbb{E}[\varepsilon]
=
x_0^\top\beta,
\end{align*}
and its variance is
\begin{align*}
\operatorname{Var}(Z)
&=
x_0^\top A(\sigma^2 I_n)A^\top x_0 \\
&=
\sigma^2 x_0^\top (X^\top X)^{-1}X^\top X(X^\top X)^{-1}x_0 \\
&=
\sigma^2 x_0^\top (X^\top X)^{-1}x_0.
\end{align*}
Because $X^\top X$ is positive definite and $x_0\neq 0$, the number
\begin{align*}
s_0 := \sqrt{x_0^\top(X^\top X)^{-1}x_0}
\end{align*}
is strictly positive. Hence
\begin{align*}
\frac{Z-x_0^\top\beta}{\sigma s_0}
\sim
\mathcal{N}(0,1).
\end{align*}
[guided]
The first task is to reduce the vector estimator $\hat{\beta}$ to the one-dimensional estimator that appears in the interval. Because $X$ has full column rank, $X^\top X$ is invertible, so the ordinary least squares estimator is well-defined. Define
\begin{align*}
A := (X^\top X)^{-1}X^\top \in \mathbb{R}^{p \times n}.
\end{align*}
Then $\hat{\beta}=AY$. Substituting the model equation $Y=X\beta+\varepsilon$ gives
\begin{align*}
\hat{\beta}
=
A(X\beta+\varepsilon)
=
AX\beta+A\varepsilon
=
\beta+A\varepsilon,
\end{align*}
because $AX=(X^\top X)^{-1}X^\top X=I_p$.
Now define the scalar random variable
\begin{align*}
Z:\Omega &\to \mathbb{R} \\
\omega &\mapsto x_0^\top\hat{\beta}(\omega).
\end{align*}
This is the estimator of the scalar mean response $x_0^\top\beta$. Since
\begin{align*}
Z
=
x_0^\top\beta+x_0^\top A\varepsilon,
\end{align*}
and $\varepsilon$ is Gaussian, $Z$ is Gaussian because affine transformations of Gaussian random vectors are Gaussian.
We compute its parameters directly. Since $\mathbb{E}[\varepsilon]=0$,
\begin{align*}
\mathbb{E}[Z]
=
x_0^\top\beta+x_0^\top A\mathbb{E}[\varepsilon]
=
x_0^\top\beta.
\end{align*}
For the variance, use $\operatorname{Cov}(\varepsilon)=\sigma^2 I_n$:
\begin{align*}
\operatorname{Var}(Z)
&=
x_0^\top A(\sigma^2 I_n)A^\top x_0 \\
&=
\sigma^2 x_0^\top (X^\top X)^{-1}X^\top X(X^\top X)^{-1}x_0 \\
&=
\sigma^2 x_0^\top (X^\top X)^{-1}x_0.
\end{align*}
The assumption $x_0\neq 0$ matters here: since $X^\top X$ is positive definite, its inverse is positive definite, so
\begin{align*}
s_0 := \sqrt{x_0^\top(X^\top X)^{-1}x_0}
\end{align*}
is strictly positive. Therefore the standardized statistic satisfies
\begin{align*}
\frac{Z-x_0^\top\beta}{\sigma s_0}
\sim
\mathcal{N}(0,1).
\end{align*}
[/guided]
[/step]
[step:Use the residual variance distribution and independence]
Let
\begin{align*}
P := X(X^\top X)^{-1}X^\top \in \mathbb{R}^{n \times n}
\end{align*}
be the [orthogonal projection](/theorems/437) matrix onto the column space of $X$. Then
\begin{align*}
Y-X\hat{\beta}
=
(I_n-P)Y
=
(I_n-P)\varepsilon,
\end{align*}
because $(I_n-P)X\beta=0$. Under the normal linear model,
\begin{align*}
\frac{|Y-X\hat{\beta}|^2}{\sigma^2}
=
\frac{\varepsilon^\top(I_n-P)\varepsilon}{\sigma^2}
\sim
\chi^2_{n-p}.
\end{align*}
Equivalently,
\begin{align*}
V
:=
\frac{(n-p)\hat{\sigma}^2}{\sigma^2}
=
\frac{|Y-X\hat{\beta}|^2}{\sigma^2}
\sim
\chi^2_{n-p}.
\end{align*}
Moreover, $P\varepsilon$ and $(I_n-P)\varepsilon$ are independent Gaussian random vectors because their covariance is
\begin{align*}
\operatorname{Cov}(P\varepsilon,(I_n-P)\varepsilon)
=
\sigma^2P(I_n-P)
=
0.
\end{align*}
Since $\hat{\beta}$ is a function of $P\varepsilon$ and $\hat{\sigma}^2$ is a function of $(I_n-P)\varepsilon$, the random variables $Z=x_0^\top\hat{\beta}$ and $V$ are independent.
[guided]
The residual variance estimator depends only on the component of the noise orthogonal to the column space of $X$. Define the matrix
\begin{align*}
P := X(X^\top X)^{-1}X^\top \in \mathbb{R}^{n \times n}.
\end{align*}
This matrix is the orthogonal projection onto $\operatorname{Range}(X)$: it is symmetric, idempotent, and satisfies $PX=X$. Hence
\begin{align*}
Y-X\hat{\beta}
=
Y-X(X^\top X)^{-1}X^\top Y
=
(I_n-P)Y.
\end{align*}
Substituting $Y=X\beta+\varepsilon$ gives
\begin{align*}
Y-X\hat{\beta}
=
(I_n-P)X\beta+(I_n-P)\varepsilon
=
(I_n-P)\varepsilon,
\end{align*}
because $(I_n-P)X=0$.
The matrix $I_n-P$ is also symmetric and idempotent, so it is the orthogonal projection onto the orthogonal complement of $\operatorname{Range}(X)$. Since $\operatorname{rank}(X)=p$, the projection $I_n-P$ has rank $n-p$. Therefore the quadratic form in the Gaussian vector $\varepsilon/\sigma \sim \mathcal{N}(0,I_n)$ satisfies
\begin{align*}
\frac{|Y-X\hat{\beta}|^2}{\sigma^2}
=
\frac{\varepsilon^\top(I_n-P)\varepsilon}{\sigma^2}
\sim
\chi^2_{n-p}.
\end{align*}
By the definition of $\hat{\sigma}^2$, this is the same as
\begin{align*}
V
:=
\frac{(n-p)\hat{\sigma}^2}{\sigma^2}
=
\frac{|Y-X\hat{\beta}|^2}{\sigma^2}
\sim
\chi^2_{n-p}.
\end{align*}
We also need independence, because the Student $t$ statistic is formed by dividing a standard normal variable by the square root of an independent chi-squared variable divided by its degrees of freedom. The estimator $\hat{\beta}$ depends on the projected noise $P\varepsilon$, while the residuals depend on the orthogonal projected noise $(I_n-P)\varepsilon$. Their covariance is
\begin{align*}
\operatorname{Cov}(P\varepsilon,(I_n-P)\varepsilon)
=
P(\sigma^2 I_n)(I_n-P)
=
\sigma^2P(I_n-P)
=
0.
\end{align*}
Because these two random vectors are jointly Gaussian, zero covariance implies independence. Thus $\hat{\beta}$, and hence $Z=x_0^\top\hat{\beta}$, is independent of $\hat{\sigma}^2$ and of $V$.
[/guided]
[/step]
[step:Form the Student $t$ pivot]
Define the random variable
\begin{align*}
T:\Omega &\to \mathbb{R} \\
\omega &\mapsto
\frac{x_0^\top\hat{\beta}(\omega)-x_0^\top\beta}
{\hat{\sigma}(\omega)\sqrt{x_0^\top(X^\top X)^{-1}x_0}}.
\end{align*}
Using $s_0=\sqrt{x_0^\top(X^\top X)^{-1}x_0}$, write
\begin{align*}
T
=
\frac{(Z-x_0^\top\beta)/(\sigma s_0)}
{\sqrt{V/(n-p)}}.
\end{align*}
The numerator is $\mathcal{N}(0,1)$, the denominator is the square root of an independent $\chi^2_{n-p}$ random variable divided by $n-p$, and therefore
\begin{align*}
T \sim t_{n-p}.
\end{align*}
[guided]
We now assemble the two distributional facts into a pivot, meaning a statistic whose distribution does not depend on the unknown parameters $\beta$ and $\sigma$. Define
\begin{align*}
T:\Omega &\to \mathbb{R} \\
\omega &\mapsto
\frac{x_0^\top\hat{\beta}(\omega)-x_0^\top\beta}
{\hat{\sigma}(\omega)\sqrt{x_0^\top(X^\top X)^{-1}x_0}}.
\end{align*}
Let
\begin{align*}
s_0 := \sqrt{x_0^\top(X^\top X)^{-1}x_0}.
\end{align*}
From the first step,
\begin{align*}
\frac{Z-x_0^\top\beta}{\sigma s_0}
\sim
\mathcal{N}(0,1).
\end{align*}
From the second step,
\begin{align*}
V := \frac{(n-p)\hat{\sigma}^2}{\sigma^2}
\sim
\chi^2_{n-p},
\end{align*}
and $V$ is independent of $Z$. Therefore
\begin{align*}
T
=
\frac{(Z-x_0^\top\beta)/(\sigma s_0)}
{\sqrt{V/(n-p)}}.
\end{align*}
This is exactly the defining construction of a Student $t$ random variable with $n-p$ degrees of freedom: a standard normal random variable divided by the square root of an independent chi-squared random variable divided by its degrees of freedom. Hence
\begin{align*}
T \sim t_{n-p}.
\end{align*}
[/guided]
[/step]
[step:Rearrange the central probability statement to obtain the confidence interval]
By the definition of the quantile $t_{n-p,1-\alpha/2}$ and the symmetry of the Student $t_{n-p}$ distribution,
\begin{align*}
\mathbb{P}\left(
-t_{n-p,1-\alpha/2}
\leq
T
\leq
t_{n-p,1-\alpha/2}
\right)
=
1-\alpha.
\end{align*}
Substituting the definition of $T$ gives
\begin{align*}
\mathbb{P}\left(
-t_{n-p,1-\alpha/2}
\leq
\frac{x_0^\top\hat{\beta}-x_0^\top\beta}
{\hat{\sigma}s_0}
\leq
t_{n-p,1-\alpha/2}
\right)
=
1-\alpha.
\end{align*}
Since $\hat{\sigma}s_0\geq 0$ and $\hat{\sigma}>0$ with probability one under the full-rank normal linear model with $n>p$, this event is equivalent to
\begin{align*}
x_0^\top\hat{\beta}
-
t_{n-p,1-\alpha/2}\hat{\sigma}s_0
\leq
x_0^\top\beta
\leq
x_0^\top\hat{\beta}
+
t_{n-p,1-\alpha/2}\hat{\sigma}s_0.
\end{align*}
Replacing $s_0$ by its definition gives exactly
\begin{align*}
x_0^\top\hat{\beta}
-
t_{n-p,1-\alpha/2}\,\hat{\sigma}\sqrt{x_0^\top(X^\top X)^{-1}x_0}
\leq
x_0^\top\beta
\leq
x_0^\top\hat{\beta}
+
t_{n-p,1-\alpha/2}\,\hat{\sigma}\sqrt{x_0^\top(X^\top X)^{-1}x_0}.
\end{align*}
Therefore the displayed random interval has coverage probability $1-\alpha$ for $x_0^\top\beta$, which proves the theorem.
[/step]