Prediction Error Distribution — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We split the prediction error into two independent normal pieces: the estimation error $x^{*\top}(\hat\beta - \beta)$, a function of the training data $Y$, and the future noise $\varepsilon^*$, independent of $Y$ by hypothesis. Computing the mean and variance of each piece and summing gives $\hat Y^* - Y^* \sim N(0, \sigma^2(\tau^2 + 1))$. For the $t$-distribution claim we standardise by $\sigma$, replace $\sigma$ by its unbiased estimate $\tilde\sigma$, and use that $\hat\beta$, $\mathrm{RSS}$, and $\varepsilon^*$ are pairwise independent to build an independent $\chi^2_{n-p}$ denominator — the same construction as the $t$-distribution proof for normalised coefficients. [/proofplan] [step:Decompose the prediction error into an estimation error and an independent future noise] The predicted and true responses at the new covariate vector $x^* \in \mathbb{R}^p$ are \begin{align*} \hat Y^* = x^{*\top} \hat\beta, \qquad Y^* = x^{*\top} \beta + \varepsilon^*, \end{align*} with $\varepsilon^* \sim N(0, \sigma^2)$ independent of the training data $Y$. Subtracting, \begin{align*} \hat Y^* - Y^* &= x^{*\top} \hat\beta - x^{*\top}\beta - \varepsilon^* = x^{*\top}(\hat\beta - \beta) - \varepsilon^*. \end{align*} Define the two components as the maps \begin{align*} U &: \Omega \to \mathbb{R}, & \omega &\mapsto x^{*\top}(\hat\beta(\omega) - \beta), \\ V &: \Omega \to \mathbb{R}, & \omega &\mapsto \varepsilon^*(\omega), \end{align*} so that $\hat Y^* - Y^* = U - V$. *Independence of $U$ and $V$.* The estimator $\hat\beta$ is a measurable function of $Y$ (explicitly $\hat\beta = (X^\top X)^{-1} X^\top Y$), so $U = x^{*\top}(\hat\beta - \beta)$ is a measurable function of $Y$. By hypothesis $\varepsilon^*$ is independent of $Y$, and independence is preserved under measurable transformations, so $U \perp\!\!\!\perp V$. [/step] [step:Compute the mean and variance of the estimation error $U$] By [Distributional Properties of the Normal Linear Model](/theorems/1445) part (1), $\hat\beta \sim N_p(\beta, \sigma^2 (X^\top X)^{-1})$, so $\hat\beta - \beta \sim N_p(\mathbf{0}, \sigma^2(X^\top X)^{-1})$. The quantity $U = x^{*\top}(\hat\beta - \beta)$ is a fixed linear combination of a multivariate normal, hence normal by the [Orthogonal Transformations Preserve Multivariate Normality](/theorems/1434) theorem applied to the $1 \times p$ matrix $x^{*\top}$: \begin{align*} U &\sim N\!\big(x^{*\top} \cdot \mathbf{0},\; x^{*\top}\, \sigma^2(X^\top X)^{-1}\, x^*\big) = N\!\big(0,\; \sigma^2 \tau^2\big), \end{align*} where $\tau^2 := x^{*\top}(X^\top X)^{-1} x^*$ is the leverage of $x^*$. [guided] We want the mean and variance of $U = x^{*\top}(\hat\beta - \beta)$. Because $\hat\beta$ is multivariate normal, any linear combination of its entries is univariate normal; we only need to read off its two parameters. *Mean.* By linearity of expectation, \begin{align*} \mathbb{E}[U] = x^{*\top} \mathbb{E}[\hat\beta - \beta] = x^{*\top}(\beta - \beta) = 0. \end{align*} *Variance.* For a linear combination $a^\top Z$ of a random vector $Z$ with covariance $\Sigma$, $\operatorname{Var}(a^\top Z) = a^\top \Sigma a$. Here $a = x^*$ and $\Sigma = \sigma^2 (X^\top X)^{-1}$, so \begin{align*} \operatorname{Var}(U) = x^{*\top} \sigma^2(X^\top X)^{-1} x^* = \sigma^2 \cdot x^{*\top}(X^\top X)^{-1} x^* = \sigma^2 \tau^2, \end{align*} where $\tau^2$ is the hat matrix leverage at the new point — it measures how "far" $x^*$ is from the centre of mass of the training design in the geometry defined by $X^\top X$, and it is always non-negative because $(X^\top X)^{-1}$ is positive definite (established in the proof of [t-Distribution of Normalised Coefficient Estimate](/theorems/1446)). Therefore $U \sim N(0, \sigma^2 \tau^2)$. *Normality.* Formally, $U$ is normal because a linear combination of a multivariate normal is normal by the [Orthogonal Transformations Preserve Multivariate Normality](/theorems/1434) theorem (applied to the $1 \times p$ matrix $x^{*\top}$, not restricted to orthogonal matrices — the theorem covers general matrices). Combining the three pieces: \begin{align*} U \sim N(0, \sigma^2 \tau^2). \end{align*} [/guided] [/step] [step:Compute the distribution of $\hat Y^* - Y^* = U - V$] We have $U \sim N(0, \sigma^2 \tau^2)$ (Step 2), $V = \varepsilon^* \sim N(0, \sigma^2)$ (hypothesis), and $U \perp\!\!\!\perp V$ (Step 1). The difference of two independent normals is normal, with mean equal to the difference of means and variance equal to the **sum** of variances: \begin{align*} \hat Y^* - Y^* = U - V &\sim N\!\big(0 - 0,\; \sigma^2 \tau^2 + \sigma^2\big) = N\!\big(0,\; \sigma^2(\tau^2 + 1)\big). \end{align*} This proves the first claim of the theorem. [guided] Three facts combine to give the distribution of $U - V$. 1. *Linearity for means.* $\mathbb{E}[U - V] = \mathbb{E}[U] - \mathbb{E}[V] = 0 - 0 = 0$. 2. *Variance of a sum (or difference) of independent variables.* For independent $U$ and $V$, $\operatorname{Var}(U - V) = \operatorname{Var}(U) + \operatorname{Var}(-V) = \operatorname{Var}(U) + \operatorname{Var}(V) = \sigma^2 \tau^2 + \sigma^2 = \sigma^2(\tau^2 + 1)$. The negative sign on $V$ does not flip the variance — variances add under independence regardless of sign. 3. *Independent normals sum to a normal.* This is the stability of the normal family under independent sums: if $U \sim N(\mu_1, \sigma_1^2)$ and $V \sim N(\mu_2, \sigma_2^2)$ are independent, then $U - V \sim N(\mu_1 - \mu_2, \sigma_1^2 + \sigma_2^2)$. This follows from the characteristic function of a normal: $\phi_{U-V}(t) = \phi_U(t) \phi_{-V}(t) = \exp(it\mu_1 - t^2\sigma_1^2/2) \exp(-it\mu_2 - t^2\sigma_2^2/2) = \exp(it(\mu_1 - \mu_2) - t^2(\sigma_1^2 + \sigma_2^2)/2)$, which is the characteristic function of $N(\mu_1 - \mu_2, \sigma_1^2 + \sigma_2^2)$. Putting it all together: \begin{align*} \hat Y^* - Y^* \sim N(0, \sigma^2(\tau^2 + 1)). \end{align*} Note that the prediction error variance $\sigma^2(\tau^2 + 1)$ decomposes into two pieces: $\sigma^2 \tau^2$, the **estimation uncertainty** that shrinks to zero as $n \to \infty$ (because $\tau^2 = x^{*\top} (X^\top X)^{-1} x^*$ shrinks as the design fills out); and $\sigma^2$, the **irreducible noise** of the new observation, which never shrinks. No amount of data can eliminate the second term. [/guided] [/step] [step:Reduce the $t$-distribution claim to an independent standard normal / chi-squared ratio] From Step 3, define the standardised prediction error \begin{align*} Z &: \Omega \to \mathbb{R}, & \omega &\mapsto \frac{\hat Y^*(\omega) - Y^*(\omega)}{\sigma \sqrt{\tau^2 + 1}}. \end{align*} Then $Z \sim N(0, 1)$. The target quantity is \begin{align*} \frac{\hat Y^* - Y^*}{\tilde\sigma \sqrt{\tau^2 + 1}} &= \frac{(\hat Y^* - Y^*)/(\sigma\sqrt{\tau^2 + 1})}{\tilde\sigma/\sigma} = \frac{Z}{\tilde\sigma/\sigma}. \end{align*} Setting $W := \mathrm{RSS}/\sigma^2$, the unbiased-variance estimator satisfies $\tilde\sigma^2/\sigma^2 = \mathrm{RSS}/[(n-p)\sigma^2] = W/(n-p)$, so \begin{align*} \frac{\tilde\sigma}{\sigma} = \sqrt{\frac{W}{n-p}} \quad\Longrightarrow\quad \frac{\hat Y^* - Y^*}{\tilde\sigma \sqrt{\tau^2 + 1}} = \frac{Z}{\sqrt{W/(n-p)}}. \end{align*} [/step] [step:Verify independence of $Z$ and $W$, then invoke the $t_{n-p}$ definition] The definition of $t_{n-p}$ requires $Z \sim N(0,1)$, $W \sim \chi^2_{n-p}$, and $Z \perp\!\!\!\perp W$. We verify all three. **$Z \sim N(0,1)$:** Step 3 gives $\hat Y^* - Y^* \sim N(0, \sigma^2(\tau^2 + 1))$, and $Z$ is the standardisation. **$W \sim \chi^2_{n-p}$:** By [Distributional Properties of the Normal Linear Model](/theorems/1445) part (2). **$Z \perp\!\!\!\perp W$:** The random variable $Z$ is a (measurable) function of $\hat\beta$ and $\varepsilon^*$, and $W$ is a (measurable) function of $\mathrm{RSS}$. We show $W$ is independent of the pair $(\hat\beta, \varepsilon^*)$, which implies $W \perp\!\!\!\perp Z$. By [Distributional Properties of the Normal Linear Model](/theorems/1445) part (3), $\hat\beta$ and $\mathrm{RSS}$ are independent. By hypothesis, $\varepsilon^*$ is independent of $Y$; since both $\hat\beta$ and $\mathrm{RSS}$ are (measurable) functions of $Y$, the pair $(\hat\beta, \mathrm{RSS})$ is jointly a function of $Y$, hence independent of $\varepsilon^*$. Therefore: - $\mathrm{RSS} \perp\!\!\!\perp \hat\beta$ (within the training data); - $\mathrm{RSS} \perp\!\!\!\perp \varepsilon^*$ (training vs future noise). These two independences combine to give $\mathrm{RSS}$ independent of the pair $(\hat\beta, \varepsilon^*)$: since the joint distribution of $(\mathrm{RSS}, \hat\beta, \varepsilon^*)$ factors as the product of the distribution of $\mathrm{RSS}$ with the joint distribution of $(\hat\beta, \varepsilon^*)$ (using in turn $\mathrm{RSS} \perp\!\!\!\perp Y$ fails to hold in general, but $\mathrm{RSS} \perp\!\!\!\perp \hat\beta$ within $\mathcal{F}_{Y}$ combined with $Y \perp\!\!\!\perp \varepsilon^*$ gives the factorisation via conditioning). Passing to measurable functions, $W \perp\!\!\!\perp Z$. Having verified all three conditions, the definition of the $t$ distribution gives \begin{align*} \frac{\hat Y^* - Y^*}{\tilde\sigma \sqrt{\tau^2 + 1}} &= \frac{Z}{\sqrt{W/(n-p)}} \sim t_{n-p}, \end{align*} completing the proof. [guided] The $t$-ratio construction closely parallels the proof of [t-Distribution of Normalised Coefficient Estimate](/theorems/1446): a standard normal over the square root of an independent $\chi^2_{n-p}/(n-p)$. The only new element is the extra independent noise term $\varepsilon^*$ in the numerator. We need to check carefully that this extra term does not disturb the independence between numerator and denominator. *Summary of what needs to be independent.* The $t$-ratio $Z/\sqrt{W/(n-p)}$ requires $Z \perp\!\!\!\perp W$. Here $Z$ depends on both $\hat\beta$ (through $\hat Y^* = x^{*\top}\hat\beta$) and $\varepsilon^*$ (through $Y^* = x^{*\top}\beta + \varepsilon^*$), while $W = \mathrm{RSS}/\sigma^2$ depends on the training data through $\mathrm{RSS}$. So we must verify $\mathrm{RSS}$ is independent of $(\hat\beta, \varepsilon^*)$ jointly. *Step 1: $\mathrm{RSS} \perp\!\!\!\perp \hat\beta$.* This is [Distributional Properties of the Normal Linear Model](/theorems/1445) part (3). *Step 2: $\mathrm{RSS} \perp\!\!\!\perp \varepsilon^*$.* The new noise $\varepsilon^*$ is independent of $Y$ by hypothesis. Since $\mathrm{RSS}$ is a Borel-measurable function of $Y$ (explicitly $\mathrm{RSS} = Y^\top (I_n - P) Y$), and independence between $\varepsilon^*$ and $Y$ passes to $\mathrm{RSS} = \phi(Y)$ by $\sigma(\mathrm{RSS}) \subseteq \sigma(Y)$, we get $\mathrm{RSS} \perp\!\!\!\perp \varepsilon^*$. *Step 3: Joint independence $\mathrm{RSS} \perp\!\!\!\perp (\hat\beta, \varepsilon^*)$.* This is strictly stronger than the two pairwise independences in Steps 1 and 2, but follows from them combined with the independence structure of the underlying data. The cleanest way is to check the joint distribution factorises. The random vector $(Y, \varepsilon^*)$ has joint distribution $\mu_{Y} \otimes \mu_{\varepsilon^*}$ because $\varepsilon^* \perp\!\!\!\perp Y$. Applying the measurable map $(y, e) \mapsto (\mathrm{RSS}(y), \hat\beta(y), e)$, the image measure factors as the joint distribution of $(\mathrm{RSS}, \hat\beta)$ (which itself factors as $\mu_{\mathrm{RSS}} \otimes \mu_{\hat\beta}$ by [Distributional Properties of the Normal Linear Model](/theorems/1445) part (3)) times $\mu_{\varepsilon^*}$. So \begin{align*} \mu_{(\mathrm{RSS}, \hat\beta, \varepsilon^*)} = \mu_{\mathrm{RSS}} \otimes \mu_{\hat\beta} \otimes \mu_{\varepsilon^*}, \end{align*} i.e., the three variables are mutually independent, and in particular $\mathrm{RSS}$ is independent of the pair $(\hat\beta, \varepsilon^*)$. *Step 4: $W \perp\!\!\!\perp Z$.* Since $W = \mathrm{RSS}/\sigma^2$ is a measurable function of $\mathrm{RSS}$, and $Z$ is a measurable function of the pair $(\hat\beta, \varepsilon^*)$, the previous joint independence gives $W \perp\!\!\!\perp Z$ — measurable functions of independent variables remain independent. *Why the extra $\varepsilon^*$ does not inflate the degrees of freedom.* This is the subtle point. One might worry that carrying around an extra normal in the numerator should "cost" us some chi-squared variability and reduce the degrees of freedom. It does not, because $\varepsilon^*$ is completely independent of the training data and therefore of $W$. The only degrees of freedom consumed in the denominator are the $p$ coefficients estimated from the $n$ training observations, leaving $n - p$ — exactly the $\chi^2_{n - p}$ degrees of freedom that appeared in the $t$-ratio for normalised coefficients. Applying the definition of the $t$ distribution: \begin{align*} \frac{Z}{\sqrt{W/(n-p)}} \sim t_{n-p}, \end{align*} and substituting the identity from Step 4, \begin{align*} \frac{\hat Y^* - Y^*}{\tilde\sigma \sqrt{\tau^2 + 1}} \sim t_{n-p}. \end{align*} This is the pivot used to construct prediction intervals for a future observation. [/guided] [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Prediction Error Distribution (Theorem # 1447)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Prediction Error Distribution (Theorem # 1447)

Discussion

Proof

Explore Further