Confidence Interval for the Mean Response in the Normal Linear Model

Confidence Interval for the Mean Response in the Normal Linear Model (Theorem # 4444)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We identify the distribution of the scalar estimator $x_0^\top\hat{\beta}$ and the residual variance estimator under the normal linear model. The scalar estimator is normal with mean $x_0^\top\beta$ and variance $\sigma^2 x_0^\top(X^\top X)^{-1}x_0$, while the scaled residual variance has a chi-squared distribution with $n-p$ degrees of freedom and is independent of $x_0^\top\hat{\beta}$. Dividing the centered scalar estimator by its estimated standard error therefore produces a Student $t$ pivot. The confidence interval follows by rewriting the central probability statement for this pivot. [/proofplan] [step:Compute the normal distribution of the scalar least squares estimator] Since $X$ has rank $p$, the matrix $X^\top X \in \mathbb{R}^{p \times p}$ is invertible. Define the matrix \begin{align*} A := (X^\top X)^{-1}X^\top \in \mathbb{R}^{p \times n}. \end{align*} Then $\hat{\beta}=AY$, and using $Y=X\beta+\varepsilon$ gives \begin{align*} \hat{\beta} = A X\beta + A\varepsilon = \beta + A\varepsilon, \end{align*} because $AX=(X^\top X)^{-1}X^\top X=I_p$. Let \begin{align*} Z:\Omega &\to \mathbb{R} \\ \omega &\mapsto x_0^\top\hat{\beta}(\omega) \end{align*} be the scalar least squares estimator of $x_0^\top\beta$. Since $Z=x_0^\top\beta+x_0^\top A\varepsilon$ is an affine transformation of the Gaussian vector $\varepsilon$, it is Gaussian. Its mean is \begin{align*} \mathbb{E}[Z] = x_0^\top\beta+x_0^\top A\mathbb{E}[\varepsilon] = x_0^\top\beta, \end{align*} and its variance is \begin{align*} \operatorname{Var}(Z) &= x_0^\top A(\sigma^2 I_n)A^\top x_0 \\ &= \sigma^2 x_0^\top (X^\top X)^{-1}X^\top X(X^\top X)^{-1}x_0 \\ &= \sigma^2 x_0^\top (X^\top X)^{-1}x_0. \end{align*} Because $X^\top X$ is positive definite and $x_0\neq 0$, the number \begin{align*} s_0 := \sqrt{x_0^\top(X^\top X)^{-1}x_0} \end{align*} is strictly positive. Hence \begin{align*} \frac{Z-x_0^\top\beta}{\sigma s_0} \sim \mathcal{N}(0,1). \end{align*} [guided] The first task is to reduce the vector estimator $\hat{\beta}$ to the one-dimensional estimator that appears in the interval. Because $X$ has full column rank, $X^\top X$ is invertible, so the ordinary least squares estimator is well-defined. Define \begin{align*} A := (X^\top X)^{-1}X^\top \in \mathbb{R}^{p \times n}. \end{align*} Then $\hat{\beta}=AY$. Substituting the model equation $Y=X\beta+\varepsilon$ gives \begin{align*} \hat{\beta} = A(X\beta+\varepsilon) = AX\beta+A\varepsilon = \beta+A\varepsilon, \end{align*} because $AX=(X^\top X)^{-1}X^\top X=I_p$. Now define the scalar random variable \begin{align*} Z:\Omega &\to \mathbb{R} \\ \omega &\mapsto x_0^\top\hat{\beta}(\omega). \end{align*} This is the estimator of the scalar mean response $x_0^\top\beta$. Since \begin{align*} Z = x_0^\top\beta+x_0^\top A\varepsilon, \end{align*} and $\varepsilon$ is Gaussian, $Z$ is Gaussian because affine transformations of Gaussian random vectors are Gaussian. We compute its parameters directly. Since $\mathbb{E}[\varepsilon]=0$, \begin{align*} \mathbb{E}[Z] = x_0^\top\beta+x_0^\top A\mathbb{E}[\varepsilon] = x_0^\top\beta. \end{align*} For the variance, use $\operatorname{Cov}(\varepsilon)=\sigma^2 I_n$: \begin{align*} \operatorname{Var}(Z) &= x_0^\top A(\sigma^2 I_n)A^\top x_0 \\ &= \sigma^2 x_0^\top (X^\top X)^{-1}X^\top X(X^\top X)^{-1}x_0 \\ &= \sigma^2 x_0^\top (X^\top X)^{-1}x_0. \end{align*} The assumption $x_0\neq 0$ matters here: since $X^\top X$ is positive definite, its inverse is positive definite, so \begin{align*} s_0 := \sqrt{x_0^\top(X^\top X)^{-1}x_0} \end{align*} is strictly positive. Therefore the standardized statistic satisfies \begin{align*} \frac{Z-x_0^\top\beta}{\sigma s_0} \sim \mathcal{N}(0,1). \end{align*} [/guided] [/step] [step:Use the residual variance distribution and independence] Let \begin{align*} P := X(X^\top X)^{-1}X^\top \in \mathbb{R}^{n \times n} \end{align*} be the [orthogonal projection](/theorems/437) matrix onto the column space of $X$. Then \begin{align*} Y-X\hat{\beta} = (I_n-P)Y = (I_n-P)\varepsilon, \end{align*} because $(I_n-P)X\beta=0$. Under the normal linear model, \begin{align*} \frac{|Y-X\hat{\beta}|^2}{\sigma^2} = \frac{\varepsilon^\top(I_n-P)\varepsilon}{\sigma^2} \sim \chi^2_{n-p}. \end{align*} Equivalently, \begin{align*} V := \frac{(n-p)\hat{\sigma}^2}{\sigma^2} = \frac{|Y-X\hat{\beta}|^2}{\sigma^2} \sim \chi^2_{n-p}. \end{align*} Moreover, $P\varepsilon$ and $(I_n-P)\varepsilon$ are independent Gaussian random vectors because their covariance is \begin{align*} \operatorname{Cov}(P\varepsilon,(I_n-P)\varepsilon) = \sigma^2P(I_n-P) = 0. \end{align*} Since $\hat{\beta}$ is a function of $P\varepsilon$ and $\hat{\sigma}^2$ is a function of $(I_n-P)\varepsilon$, the random variables $Z=x_0^\top\hat{\beta}$ and $V$ are independent. [guided] The residual variance estimator depends only on the component of the noise orthogonal to the column space of $X$. Define the matrix \begin{align*} P := X(X^\top X)^{-1}X^\top \in \mathbb{R}^{n \times n}. \end{align*} This matrix is the orthogonal projection onto $\operatorname{Range}(X)$: it is symmetric, idempotent, and satisfies $PX=X$. Hence \begin{align*} Y-X\hat{\beta} = Y-X(X^\top X)^{-1}X^\top Y = (I_n-P)Y. \end{align*} Substituting $Y=X\beta+\varepsilon$ gives \begin{align*} Y-X\hat{\beta} = (I_n-P)X\beta+(I_n-P)\varepsilon = (I_n-P)\varepsilon, \end{align*} because $(I_n-P)X=0$. The matrix $I_n-P$ is also symmetric and idempotent, so it is the orthogonal projection onto the orthogonal complement of $\operatorname{Range}(X)$. Since $\operatorname{rank}(X)=p$, the projection $I_n-P$ has rank $n-p$. Therefore the quadratic form in the Gaussian vector $\varepsilon/\sigma \sim \mathcal{N}(0,I_n)$ satisfies \begin{align*} \frac{|Y-X\hat{\beta}|^2}{\sigma^2} = \frac{\varepsilon^\top(I_n-P)\varepsilon}{\sigma^2} \sim \chi^2_{n-p}. \end{align*} By the definition of $\hat{\sigma}^2$, this is the same as \begin{align*} V := \frac{(n-p)\hat{\sigma}^2}{\sigma^2} = \frac{|Y-X\hat{\beta}|^2}{\sigma^2} \sim \chi^2_{n-p}. \end{align*} We also need independence, because the Student $t$ statistic is formed by dividing a standard normal variable by the square root of an independent chi-squared variable divided by its degrees of freedom. The estimator $\hat{\beta}$ depends on the projected noise $P\varepsilon$, while the residuals depend on the orthogonal projected noise $(I_n-P)\varepsilon$. Their covariance is \begin{align*} \operatorname{Cov}(P\varepsilon,(I_n-P)\varepsilon) = P(\sigma^2 I_n)(I_n-P) = \sigma^2P(I_n-P) = 0. \end{align*} Because these two random vectors are jointly Gaussian, zero covariance implies independence. Thus $\hat{\beta}$, and hence $Z=x_0^\top\hat{\beta}$, is independent of $\hat{\sigma}^2$ and of $V$. [/guided] [/step] [step:Form the Student $t$ pivot] Define the random variable \begin{align*} T:\Omega &\to \mathbb{R} \\ \omega &\mapsto \frac{x_0^\top\hat{\beta}(\omega)-x_0^\top\beta} {\hat{\sigma}(\omega)\sqrt{x_0^\top(X^\top X)^{-1}x_0}}. \end{align*} Using $s_0=\sqrt{x_0^\top(X^\top X)^{-1}x_0}$, write \begin{align*} T = \frac{(Z-x_0^\top\beta)/(\sigma s_0)} {\sqrt{V/(n-p)}}. \end{align*} The numerator is $\mathcal{N}(0,1)$, the denominator is the square root of an independent $\chi^2_{n-p}$ random variable divided by $n-p$, and therefore \begin{align*} T \sim t_{n-p}. \end{align*} [guided] We now assemble the two distributional facts into a pivot, meaning a statistic whose distribution does not depend on the unknown parameters $\beta$ and $\sigma$. Define \begin{align*} T:\Omega &\to \mathbb{R} \\ \omega &\mapsto \frac{x_0^\top\hat{\beta}(\omega)-x_0^\top\beta} {\hat{\sigma}(\omega)\sqrt{x_0^\top(X^\top X)^{-1}x_0}}. \end{align*} Let \begin{align*} s_0 := \sqrt{x_0^\top(X^\top X)^{-1}x_0}. \end{align*} From the first step, \begin{align*} \frac{Z-x_0^\top\beta}{\sigma s_0} \sim \mathcal{N}(0,1). \end{align*} From the second step, \begin{align*} V := \frac{(n-p)\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}, \end{align*} and $V$ is independent of $Z$. Therefore \begin{align*} T = \frac{(Z-x_0^\top\beta)/(\sigma s_0)} {\sqrt{V/(n-p)}}. \end{align*} This is exactly the defining construction of a Student $t$ random variable with $n-p$ degrees of freedom: a standard normal random variable divided by the square root of an independent chi-squared random variable divided by its degrees of freedom. Hence \begin{align*} T \sim t_{n-p}. \end{align*} [/guided] [/step] [step:Rearrange the central probability statement to obtain the confidence interval] By the definition of the quantile $t_{n-p,1-\alpha/2}$ and the symmetry of the Student $t_{n-p}$ distribution, \begin{align*} \mathbb{P}\left( -t_{n-p,1-\alpha/2} \leq T \leq t_{n-p,1-\alpha/2} \right) = 1-\alpha. \end{align*} Substituting the definition of $T$ gives \begin{align*} \mathbb{P}\left( -t_{n-p,1-\alpha/2} \leq \frac{x_0^\top\hat{\beta}-x_0^\top\beta} {\hat{\sigma}s_0} \leq t_{n-p,1-\alpha/2} \right) = 1-\alpha. \end{align*} Since $\hat{\sigma}s_0\geq 0$ and $\hat{\sigma}>0$ with probability one under the full-rank normal linear model with $n>p$, this event is equivalent to \begin{align*} x_0^\top\hat{\beta} - t_{n-p,1-\alpha/2}\hat{\sigma}s_0 \leq x_0^\top\beta \leq x_0^\top\hat{\beta} + t_{n-p,1-\alpha/2}\hat{\sigma}s_0. \end{align*} Replacing $s_0$ by its definition gives exactly \begin{align*} x_0^\top\hat{\beta} - t_{n-p,1-\alpha/2}\,\hat{\sigma}\sqrt{x_0^\top(X^\top X)^{-1}x_0} \leq x_0^\top\beta \leq x_0^\top\hat{\beta} + t_{n-p,1-\alpha/2}\,\hat{\sigma}\sqrt{x_0^\top(X^\top X)^{-1}x_0}. \end{align*} Therefore the displayed random interval has coverage probability $1-\alpha$ for $x_0^\top\beta$, which proves the theorem. [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Orthogonal Projection

Definitions & Concepts

Distribution

Explore Further

Distribution Definition Orthogonal Projection Theorem #437 Continuity of Probability Probability Theory Mean Response and Prediction Intervals in the Gaussian Linear Model Probability & Statistics Conditioning and Independence Conditional Expectation Conditional Expectation on a Finite Partition Probability Theory Hilbert Projection Theorem for Closed Linear Subspaces of $L^2$ Probability & Statistics Backwards Martingale Convergence Theorem Martingale Theory Independence of Fitted Values and Residuals in the Normal Linear Model Probability & Statistics Sum Law by Convolution Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.