[proofplan]
We split the prediction error into two independent normal pieces: the estimation error $x^{*\top}(\hat\beta - \beta)$, a function of the training data $Y$, and the future noise $\varepsilon^*$, independent of $Y$ by hypothesis. Computing the mean and variance of each piece and summing gives $\hat Y^* - Y^* \sim N(0, \sigma^2(\tau^2 + 1))$. For the $t$-distribution claim we standardise by $\sigma$, replace $\sigma$ by its unbiased estimate $\tilde\sigma$, and use that $\hat\beta$, $\mathrm{RSS}$, and $\varepsilon^*$ are pairwise independent to build an independent $\chi^2_{n-p}$ denominator — the same construction as the $t$-distribution proof for normalised coefficients.
[/proofplan]
[step:Decompose the prediction error into an estimation error and an independent future noise]
The predicted and true responses at the new covariate vector $x^* \in \mathbb{R}^p$ are
\begin{align*}
\hat Y^* = x^{*\top} \hat\beta, \qquad Y^* = x^{*\top} \beta + \varepsilon^*,
\end{align*}
with $\varepsilon^* \sim N(0, \sigma^2)$ independent of the training data $Y$. Subtracting,
\begin{align*}
\hat Y^* - Y^* &= x^{*\top} \hat\beta - x^{*\top}\beta - \varepsilon^* = x^{*\top}(\hat\beta - \beta) - \varepsilon^*.
\end{align*}
Define the two components as the maps
\begin{align*}
U &: \Omega \to \mathbb{R}, & \omega &\mapsto x^{*\top}(\hat\beta(\omega) - \beta), \\
V &: \Omega \to \mathbb{R}, & \omega &\mapsto \varepsilon^*(\omega),
\end{align*}
so that $\hat Y^* - Y^* = U - V$.
*Independence of $U$ and $V$.* The estimator $\hat\beta$ is a measurable function of $Y$ (explicitly $\hat\beta = (X^\top X)^{-1} X^\top Y$), so $U = x^{*\top}(\hat\beta - \beta)$ is a measurable function of $Y$. By hypothesis $\varepsilon^*$ is independent of $Y$, and independence is preserved under measurable transformations, so $U \perp\!\!\!\perp V$.
[/step]
[step:Compute the mean and variance of the estimation error $U$]
By [Distributional Properties of the Normal Linear Model](/theorems/1445) part (1), $\hat\beta \sim N_p(\beta, \sigma^2 (X^\top X)^{-1})$, so $\hat\beta - \beta \sim N_p(\mathbf{0}, \sigma^2(X^\top X)^{-1})$. The quantity $U = x^{*\top}(\hat\beta - \beta)$ is a fixed linear combination of a multivariate normal, hence normal by the [Orthogonal Transformations Preserve Multivariate Normality](/theorems/1434) theorem applied to the $1 \times p$ matrix $x^{*\top}$:
\begin{align*}
U &\sim N\!\big(x^{*\top} \cdot \mathbf{0},\; x^{*\top}\, \sigma^2(X^\top X)^{-1}\, x^*\big) = N\!\big(0,\; \sigma^2 \tau^2\big),
\end{align*}
where $\tau^2 := x^{*\top}(X^\top X)^{-1} x^*$ is the leverage of $x^*$.
[guided]
We want the mean and variance of $U = x^{*\top}(\hat\beta - \beta)$. Because $\hat\beta$ is multivariate normal, any linear combination of its entries is univariate normal; we only need to read off its two parameters.
*Mean.* By linearity of expectation,
\begin{align*}
\mathbb{E}[U] = x^{*\top} \mathbb{E}[\hat\beta - \beta] = x^{*\top}(\beta - \beta) = 0.
\end{align*}
*Variance.* For a linear combination $a^\top Z$ of a random vector $Z$ with covariance $\Sigma$, $\operatorname{Var}(a^\top Z) = a^\top \Sigma a$. Here $a = x^*$ and $\Sigma = \sigma^2 (X^\top X)^{-1}$, so
\begin{align*}
\operatorname{Var}(U) = x^{*\top} \sigma^2(X^\top X)^{-1} x^* = \sigma^2 \cdot x^{*\top}(X^\top X)^{-1} x^* = \sigma^2 \tau^2,
\end{align*}
where $\tau^2$ is the hat matrix leverage at the new point — it measures how "far" $x^*$ is from the centre of mass of the training design in the geometry defined by $X^\top X$, and it is always non-negative because $(X^\top X)^{-1}$ is positive definite (established in the proof of [t-Distribution of Normalised Coefficient Estimate](/theorems/1446)). Therefore $U \sim N(0, \sigma^2 \tau^2)$.
*Normality.* Formally, $U$ is normal because a linear combination of a multivariate normal is normal by the [Orthogonal Transformations Preserve Multivariate Normality](/theorems/1434) theorem (applied to the $1 \times p$ matrix $x^{*\top}$, not restricted to orthogonal matrices — the theorem covers general matrices). Combining the three pieces:
\begin{align*}
U \sim N(0, \sigma^2 \tau^2).
\end{align*}
[/guided]
[/step]
[step:Compute the distribution of $\hat Y^* - Y^* = U - V$]
We have $U \sim N(0, \sigma^2 \tau^2)$ (Step 2), $V = \varepsilon^* \sim N(0, \sigma^2)$ (hypothesis), and $U \perp\!\!\!\perp V$ (Step 1). The difference of two independent normals is normal, with mean equal to the difference of means and variance equal to the **sum** of variances:
\begin{align*}
\hat Y^* - Y^* = U - V &\sim N\!\big(0 - 0,\; \sigma^2 \tau^2 + \sigma^2\big) = N\!\big(0,\; \sigma^2(\tau^2 + 1)\big).
\end{align*}
This proves the first claim of the theorem.
[guided]
Three facts combine to give the distribution of $U - V$.
1. *Linearity for means.* $\mathbb{E}[U - V] = \mathbb{E}[U] - \mathbb{E}[V] = 0 - 0 = 0$.
2. *Variance of a sum (or difference) of independent variables.* For independent $U$ and $V$, $\operatorname{Var}(U - V) = \operatorname{Var}(U) + \operatorname{Var}(-V) = \operatorname{Var}(U) + \operatorname{Var}(V) = \sigma^2 \tau^2 + \sigma^2 = \sigma^2(\tau^2 + 1)$. The negative sign on $V$ does not flip the variance — variances add under independence regardless of sign.
3. *Independent normals sum to a normal.* This is the stability of the normal family under independent sums: if $U \sim N(\mu_1, \sigma_1^2)$ and $V \sim N(\mu_2, \sigma_2^2)$ are independent, then $U - V \sim N(\mu_1 - \mu_2, \sigma_1^2 + \sigma_2^2)$. This follows from the characteristic function of a normal: $\phi_{U-V}(t) = \phi_U(t) \phi_{-V}(t) = \exp(it\mu_1 - t^2\sigma_1^2/2) \exp(-it\mu_2 - t^2\sigma_2^2/2) = \exp(it(\mu_1 - \mu_2) - t^2(\sigma_1^2 + \sigma_2^2)/2)$, which is the characteristic function of $N(\mu_1 - \mu_2, \sigma_1^2 + \sigma_2^2)$.
Putting it all together:
\begin{align*}
\hat Y^* - Y^* \sim N(0, \sigma^2(\tau^2 + 1)).
\end{align*}
Note that the prediction error variance $\sigma^2(\tau^2 + 1)$ decomposes into two pieces: $\sigma^2 \tau^2$, the **estimation uncertainty** that shrinks to zero as $n \to \infty$ (because $\tau^2 = x^{*\top} (X^\top X)^{-1} x^*$ shrinks as the design fills out); and $\sigma^2$, the **irreducible noise** of the new observation, which never shrinks. No amount of data can eliminate the second term.
[/guided]
[/step]
[step:Reduce the $t$-distribution claim to an independent standard normal / chi-squared ratio]
From Step 3, define the standardised prediction error
\begin{align*}
Z &: \Omega \to \mathbb{R}, & \omega &\mapsto \frac{\hat Y^*(\omega) - Y^*(\omega)}{\sigma \sqrt{\tau^2 + 1}}.
\end{align*}
Then $Z \sim N(0, 1)$. The target quantity is
\begin{align*}
\frac{\hat Y^* - Y^*}{\tilde\sigma \sqrt{\tau^2 + 1}} &= \frac{(\hat Y^* - Y^*)/(\sigma\sqrt{\tau^2 + 1})}{\tilde\sigma/\sigma} = \frac{Z}{\tilde\sigma/\sigma}.
\end{align*}
Setting $W := \mathrm{RSS}/\sigma^2$, the unbiased-variance estimator satisfies $\tilde\sigma^2/\sigma^2 = \mathrm{RSS}/[(n-p)\sigma^2] = W/(n-p)$, so
\begin{align*}
\frac{\tilde\sigma}{\sigma} = \sqrt{\frac{W}{n-p}} \quad\Longrightarrow\quad \frac{\hat Y^* - Y^*}{\tilde\sigma \sqrt{\tau^2 + 1}} = \frac{Z}{\sqrt{W/(n-p)}}.
\end{align*}
[/step]
[step:Verify independence of $Z$ and $W$, then invoke the $t_{n-p}$ definition]
The definition of $t_{n-p}$ requires $Z \sim N(0,1)$, $W \sim \chi^2_{n-p}$, and $Z \perp\!\!\!\perp W$. We verify all three.
**$Z \sim N(0,1)$:** Step 3 gives $\hat Y^* - Y^* \sim N(0, \sigma^2(\tau^2 + 1))$, and $Z$ is the standardisation.
**$W \sim \chi^2_{n-p}$:** By [Distributional Properties of the Normal Linear Model](/theorems/1445) part (2).
**$Z \perp\!\!\!\perp W$:** The random variable $Z$ is a (measurable) function of $\hat\beta$ and $\varepsilon^*$, and $W$ is a (measurable) function of $\mathrm{RSS}$. We show $W$ is independent of the pair $(\hat\beta, \varepsilon^*)$, which implies $W \perp\!\!\!\perp Z$.
By [Distributional Properties of the Normal Linear Model](/theorems/1445) part (3), $\hat\beta$ and $\mathrm{RSS}$ are independent. By hypothesis, $\varepsilon^*$ is independent of $Y$; since both $\hat\beta$ and $\mathrm{RSS}$ are (measurable) functions of $Y$, the pair $(\hat\beta, \mathrm{RSS})$ is jointly a function of $Y$, hence independent of $\varepsilon^*$. Therefore:
- $\mathrm{RSS} \perp\!\!\!\perp \hat\beta$ (within the training data);
- $\mathrm{RSS} \perp\!\!\!\perp \varepsilon^*$ (training vs future noise).
These two independences combine to give $\mathrm{RSS}$ independent of the pair $(\hat\beta, \varepsilon^*)$: since the joint distribution of $(\mathrm{RSS}, \hat\beta, \varepsilon^*)$ factors as the product of the distribution of $\mathrm{RSS}$ with the joint distribution of $(\hat\beta, \varepsilon^*)$ (using in turn $\mathrm{RSS} \perp\!\!\!\perp Y$ fails to hold in general, but $\mathrm{RSS} \perp\!\!\!\perp \hat\beta$ within $\mathcal{F}_{Y}$ combined with $Y \perp\!\!\!\perp \varepsilon^*$ gives the factorisation via conditioning). Passing to measurable functions, $W \perp\!\!\!\perp Z$.
Having verified all three conditions, the definition of the $t$ distribution gives
\begin{align*}
\frac{\hat Y^* - Y^*}{\tilde\sigma \sqrt{\tau^2 + 1}} &= \frac{Z}{\sqrt{W/(n-p)}} \sim t_{n-p},
\end{align*}
completing the proof.
[guided]
The $t$-ratio construction closely parallels the proof of [t-Distribution of Normalised Coefficient Estimate](/theorems/1446): a standard normal over the square root of an independent $\chi^2_{n-p}/(n-p)$. The only new element is the extra independent noise term $\varepsilon^*$ in the numerator. We need to check carefully that this extra term does not disturb the independence between numerator and denominator.
*Summary of what needs to be independent.* The $t$-ratio $Z/\sqrt{W/(n-p)}$ requires $Z \perp\!\!\!\perp W$. Here $Z$ depends on both $\hat\beta$ (through $\hat Y^* = x^{*\top}\hat\beta$) and $\varepsilon^*$ (through $Y^* = x^{*\top}\beta + \varepsilon^*$), while $W = \mathrm{RSS}/\sigma^2$ depends on the training data through $\mathrm{RSS}$. So we must verify $\mathrm{RSS}$ is independent of $(\hat\beta, \varepsilon^*)$ jointly.
*Step 1: $\mathrm{RSS} \perp\!\!\!\perp \hat\beta$.* This is [Distributional Properties of the Normal Linear Model](/theorems/1445) part (3).
*Step 2: $\mathrm{RSS} \perp\!\!\!\perp \varepsilon^*$.* The new noise $\varepsilon^*$ is independent of $Y$ by hypothesis. Since $\mathrm{RSS}$ is a Borel-measurable function of $Y$ (explicitly $\mathrm{RSS} = Y^\top (I_n - P) Y$), and independence between $\varepsilon^*$ and $Y$ passes to $\mathrm{RSS} = \phi(Y)$ by $\sigma(\mathrm{RSS}) \subseteq \sigma(Y)$, we get $\mathrm{RSS} \perp\!\!\!\perp \varepsilon^*$.
*Step 3: Joint independence $\mathrm{RSS} \perp\!\!\!\perp (\hat\beta, \varepsilon^*)$.* This is strictly stronger than the two pairwise independences in Steps 1 and 2, but follows from them combined with the independence structure of the underlying data. The cleanest way is to check the joint distribution factorises.
The random vector $(Y, \varepsilon^*)$ has joint distribution $\mu_{Y} \otimes \mu_{\varepsilon^*}$ because $\varepsilon^* \perp\!\!\!\perp Y$. Applying the measurable map $(y, e) \mapsto (\mathrm{RSS}(y), \hat\beta(y), e)$, the image measure factors as the joint distribution of $(\mathrm{RSS}, \hat\beta)$ (which itself factors as $\mu_{\mathrm{RSS}} \otimes \mu_{\hat\beta}$ by [Distributional Properties of the Normal Linear Model](/theorems/1445) part (3)) times $\mu_{\varepsilon^*}$. So
\begin{align*}
\mu_{(\mathrm{RSS}, \hat\beta, \varepsilon^*)} = \mu_{\mathrm{RSS}} \otimes \mu_{\hat\beta} \otimes \mu_{\varepsilon^*},
\end{align*}
i.e., the three variables are mutually independent, and in particular $\mathrm{RSS}$ is independent of the pair $(\hat\beta, \varepsilon^*)$.
*Step 4: $W \perp\!\!\!\perp Z$.* Since $W = \mathrm{RSS}/\sigma^2$ is a measurable function of $\mathrm{RSS}$, and $Z$ is a measurable function of the pair $(\hat\beta, \varepsilon^*)$, the previous joint independence gives $W \perp\!\!\!\perp Z$ — measurable functions of independent variables remain independent.
*Why the extra $\varepsilon^*$ does not inflate the degrees of freedom.* This is the subtle point. One might worry that carrying around an extra normal in the numerator should "cost" us some chi-squared variability and reduce the degrees of freedom. It does not, because $\varepsilon^*$ is completely independent of the training data and therefore of $W$. The only degrees of freedom consumed in the denominator are the $p$ coefficients estimated from the $n$ training observations, leaving $n - p$ — exactly the $\chi^2_{n - p}$ degrees of freedom that appeared in the $t$-ratio for normalised coefficients.
Applying the definition of the $t$ distribution:
\begin{align*}
\frac{Z}{\sqrt{W/(n-p)}} \sim t_{n-p},
\end{align*}
and substituting the identity from Step 4,
\begin{align*}
\frac{\hat Y^* - Y^*}{\tilde\sigma \sqrt{\tau^2 + 1}} \sim t_{n-p}.
\end{align*}
This is the pivot used to construct prediction intervals for a future observation.
[/guided]
[/step]