Exact Normal-Theory Prediction Region for a New Multivariate Linear-Model Observation

Exact Normal-Theory Prediction Region for a New Multivariate Linear-Model Observation (Theorem # 4068)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We decompose the prediction error into the new error and the fitted-coefficient error. The fitted-coefficient error is a linear combination of the training errors and therefore is normal with covariance $h_0\Sigma$, independent of the new error. The residual covariance estimator is independent of the fitted part and has a Wishart distribution, so the studentised prediction error has the Hotelling $T^2$ distribution with scale factor $1+h_0$. Translating that distributional identity into $F$ quantiles gives the stated prediction region. [/proofplan] [step:Decompose the prediction error into independent normal components] Let $E \in \mathbb{R}^{n \times p}$ denote the error matrix whose $i$-th row is $\varepsilon_i^\top$. Then \begin{align*} Y = XB + E. \end{align*} Since $X$ has full column rank, \begin{align*} \hat B - B &= (X^\top X)^{-1}X^\top E. \end{align*} Therefore \begin{align*} Y_0-\hat\mu_0 &= B^\top x_0+\varepsilon_0-\hat B^\top x_0 \\ &= \varepsilon_0-(\hat B-B)^\top x_0 \\ &= \varepsilon_0-E^\top X(X^\top X)^{-1}x_0. \end{align*} Define $a \in \mathbb{R}^n$ by \begin{align*} a := X(X^\top X)^{-1}x_0. \end{align*} Then \begin{align*} E^\top X(X^\top X)^{-1}x_0 = E^\top a = \sum_{i=1}^n a_i\varepsilon_i. \end{align*} Because $\varepsilon_1,\dots,\varepsilon_n$ are independent $\mathcal{N}_p(0,\Sigma)$ random vectors, the random vector $E^\top a$ is normal with mean $0$ and covariance \begin{align*} \operatorname{Cov}(E^\top a) &= \sum_{i=1}^n a_i^2\Sigma = (a^\top a)\Sigma. \end{align*} Moreover, \begin{align*} a^\top a &= x_0^\top (X^\top X)^{-1}X^\top X(X^\top X)^{-1}x_0 \\ &= x_0^\top (X^\top X)^{-1}x_0 = h_0. \end{align*} Thus $E^\top a \sim \mathcal{N}_p(0,h_0\Sigma)$. Since $\varepsilon_0$ is independent of the training errors, $\varepsilon_0$ and $E^\top a$ are independent. Hence \begin{align*} Y_0-\hat\mu_0 \sim \mathcal{N}_p(0,(1+h_0)\Sigma). \end{align*} [guided] The point of this step is to identify the exact covariance of the prediction error. The fitted mean $\hat\mu_0$ is random because $\hat B$ is estimated from the training sample, so the prediction error has two sources: the new error $\varepsilon_0$ and the estimation error in $\hat B$. Let $E \in \mathbb{R}^{n \times p}$ be the matrix with $i$-th row $\varepsilon_i^\top$. Since $Y=XB+E$, the least-squares estimator satisfies \begin{align*} \hat B-B &= (X^\top X)^{-1}X^\top Y-B \\ &= (X^\top X)^{-1}X^\top(XB+E)-B \\ &= B+(X^\top X)^{-1}X^\top E-B \\ &= (X^\top X)^{-1}X^\top E. \end{align*} Therefore \begin{align*} Y_0-\hat\mu_0 &= B^\top x_0+\varepsilon_0-\hat B^\top x_0 \\ &= \varepsilon_0-(\hat B-B)^\top x_0 \\ &= \varepsilon_0-E^\top X(X^\top X)^{-1}x_0. \end{align*} Define the coefficient vector $a \in \mathbb{R}^n$ by \begin{align*} a := X(X^\top X)^{-1}x_0. \end{align*} Then the estimation part is \begin{align*} E^\top a = \sum_{i=1}^n a_i\varepsilon_i. \end{align*} This is a linear combination of independent multivariate normal random vectors, hence it is multivariate normal. Its mean is $0$, and its covariance is computed using independence: \begin{align*} \operatorname{Cov}(E^\top a) &= \operatorname{Cov}\left(\sum_{i=1}^n a_i\varepsilon_i\right) \\ &= \sum_{i=1}^n a_i^2\operatorname{Cov}(\varepsilon_i) \\ &= \sum_{i=1}^n a_i^2\Sigma \\ &= (a^\top a)\Sigma. \end{align*} Now compute the scalar $a^\top a$: \begin{align*} a^\top a &= x_0^\top (X^\top X)^{-1}X^\top X(X^\top X)^{-1}x_0 \\ &= x_0^\top (X^\top X)^{-1}x_0 \\ &= h_0. \end{align*} Thus $E^\top a \sim \mathcal{N}_p(0,h_0\Sigma)$. The new error $\varepsilon_0$ is independent of all training errors, so it is independent of $E^\top a$. Adding independent normal vectors adds their covariance matrices, and the minus sign does not change the covariance. Hence \begin{align*} Y_0-\hat\mu_0 \sim \mathcal{N}_p(0,\Sigma+h_0\Sigma) = \mathcal{N}_p(0,(1+h_0)\Sigma). \end{align*} [/guided] [/step] [step:Identify the independent Wishart residual covariance] Define the projection matrix $H \in \mathbb{R}^{n \times n}$ and residual projection matrix $M \in \mathbb{R}^{n \times n}$ by \begin{align*} H := X(X^\top X)^{-1}X^\top,\qquad M := I_n-H. \end{align*} Then \begin{align*} Y-X\hat B = MY = M(XB+E)=ME, \end{align*} because $MX=0$. Hence \begin{align*} \nu S = E^\top M E. \end{align*} The matrix $M$ is symmetric idempotent with rank $\nu=n-q$. Choose a matrix $C \in \mathbb{R}^{\nu \times n}$ whose rows form an orthonormal basis of $\operatorname{Range}(M)$, so that \begin{align*} CC^\top=I_\nu,\qquad C^\top C=M. \end{align*} Then \begin{align*} \nu S = E^\top C^\top C E = (CE)^\top(CE). \end{align*} The rows of $CE$ are independent $\mathcal{N}_p(0,\Sigma)$ random vectors, so \begin{align*} \nu S \sim \operatorname{Wishart}_p(\Sigma,\nu). \end{align*} The vector $E^\top a$ depends only on the projection of $E$ onto $\operatorname{Range}(H)$, while $ME$ depends only on the projection of $E$ onto $\operatorname{Range}(M)$. Since $H$ and $M$ are orthogonal projections with $HM=0$, the corresponding Gaussian projections are independent. Therefore $Y_0-\hat\mu_0$ is independent of $S$. [guided] The residual covariance estimator must be independent of the prediction error after the covariance structure has been separated. This is the normal-theory fact that makes the exact $F$ calibration possible. Define \begin{align*} H := X(X^\top X)^{-1}X^\top,\qquad M := I_n-H. \end{align*} The matrix $H$ is the [orthogonal projection](/theorems/437) onto the column space of $X$, and $M$ is the orthogonal projection onto its orthogonal complement. Since $HX=X$, we have $MX=0$. Therefore \begin{align*} Y-X\hat B &= Y-X(X^\top X)^{-1}X^\top Y \\ &= (I_n-H)Y \\ &= MY \\ &= M(XB+E) \\ &= ME. \end{align*} Thus \begin{align*} \nu S = (Y-X\hat B)^\top(Y-X\hat B)=E^\top M E. \end{align*} Because $X$ has rank $q$, the projection $H$ has rank $q$, and hence $M=I_n-H$ has rank $\nu=n-q$. Choose $C \in \mathbb{R}^{\nu \times n}$ whose rows form an orthonormal basis for $\operatorname{Range}(M)$. Then \begin{align*} CC^\top=I_\nu,\qquad C^\top C=M. \end{align*} Consequently, \begin{align*} \nu S=E^\top M E=E^\top C^\top C E=(CE)^\top(CE). \end{align*} Each row of $CE$ is a fixed linear combination of the independent Gaussian rows of $E$. The orthonormality condition $CC^\top=I_\nu$ gives covariance $\Sigma$ for each row and zero cross-covariance between distinct rows. Since the joint distribution is Gaussian, zero cross-covariance gives independence. Hence the rows of $CE$ are independent $\mathcal{N}_p(0,\Sigma)$ random vectors, and so \begin{align*} \nu S \sim \operatorname{Wishart}_p(\Sigma,\nu). \end{align*} It remains to justify independence from the prediction error. The training part of the prediction error is $E^\top a$, where \begin{align*} a=X(X^\top X)^{-1}x_0. \end{align*} This vector $a$ lies in $\operatorname{Range}(X)=\operatorname{Range}(H)$. Thus $E^\top a$ depends only on the $H$-projection of $E$. The residual matrix $ME$ depends only on the $M$-projection of $E$. Since $H$ and $M$ are orthogonal projections and $HM=0$, these two Gaussian projections are independent. The new error $\varepsilon_0$ is also independent of the training errors. Therefore $Y_0-\hat\mu_0$ is independent of $S$. [/guided] [/step] [step:Studentise the prediction error and obtain the Hotelling statistic] Define \begin{align*} Z := \frac{1}{\sqrt{1+h_0}}(Y_0-\hat\mu_0). \end{align*} From the preceding steps, \begin{align*} Z \sim \mathcal{N}_p(0,\Sigma),\qquad \nu S \sim \operatorname{Wishart}_p(\Sigma,\nu), \end{align*} and $Z$ is independent of $S$. By the defining Hotelling $T^2$ distributional identity for an independent normal vector and Wishart covariance estimator, \begin{align*} \frac{\nu-p+1}{p\nu}Z^\top S^{-1}Z \sim F_{p,\nu-p+1}. \end{align*} Since \begin{align*} Z^\top S^{-1}Z = \frac{1}{1+h_0}(Y_0-\hat\mu_0)^\top S^{-1}(Y_0-\hat\mu_0), \end{align*} we obtain \begin{align*} \frac{\nu-p+1}{p\nu(1+h_0)} (Y_0-\hat\mu_0)^\top S^{-1}(Y_0-\hat\mu_0) \sim F_{p,\nu-p+1}. \end{align*} [/step] [step:Choose the quantile to get exact coverage] Let $f_{p,\nu-p+1;1-\alpha}$ be the $(1-\alpha)$-quantile of $F_{p,\nu-p+1}$. From the distributional identity just proved, \begin{align*} &\mathbb{P}\left( \frac{\nu-p+1}{p\nu(1+h_0)} (Y_0-\hat\mu_0)^\top S^{-1}(Y_0-\hat\mu_0) \le f_{p,\nu-p+1;1-\alpha} \right) \\ &\qquad = 1-\alpha. \end{align*} Multiplying both sides of the inequality by the positive scalar \begin{align*} (1+h_0)\frac{p\nu}{\nu-p+1} \end{align*} gives \begin{align*} \mathbb{P}\left( (Y_0-\hat\mu_0)^\top S^{-1}(Y_0-\hat\mu_0) \le (1+h_0)\frac{p\nu}{\nu-p+1}f_{p,\nu-p+1;1-\alpha} \right) =1-\alpha. \end{align*} This is exactly the event $Y_0 \in \mathcal{P}_{1-\alpha}$, so \begin{align*} \mathbb{P}(Y_0 \in \mathcal{P}_{1-\alpha})=1-\alpha. \end{align*} [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Orthogonal Projection

Definitions & Concepts

Distribution

Explore Further

Distribution Definition Orthogonal Projection Theorem #437 Bonferroni Simultaneous Confidence Intervals for Linear Contrasts of a Multivariate Normal Mean probability Covariance Matrix Is Positive Semidefinite probability Wiener–Khinchin Theorem probability Scheffe Simultaneous Confidence Intervals for Linear Combinations of a Multivariate Mean probability Positive Definiteness of the Autocovariance Function probability Rayleigh–Ritz Variance Maximisation Characterisation probability Monotonicity and Triangle Inequality for Expectation probability Root-Pattern Sensitivity of Classical One-Way MANOVA Criteria probability

What brings you to Androma?

Start with a route through the knowledge graph.