Sampling Distribution of the Least Squares Coefficient Matrix in the Multivariate Linear Model

Sampling Distribution of the Least Squares Coefficient Matrix in the Multivariate Linear Model (Theorem # 4063)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The least squares estimator is first written as the true coefficient matrix plus a fixed linear transformation of the error matrix. Since the error matrix has independent Gaussian rows with common covariance $\Sigma$, applying the column-stacking [vectorization](/page/Vectorization) map to the transformed error matrix gives a deterministic linear image of a [multivariate normal distribution](/page/Multivariate%20Normal%20Distribution). Computing the vectorized covariance with Kronecker product identities identifies the [matrix normal distribution](/page/Matrix%20Normal%20Distribution) row covariance as $(X^\top X)^{-1}$ and the column covariance as $\Sigma$, and the scalar contrast formula follows from the same covariance computation. [/proofplan] [step:Express the estimation error as a fixed linear transformation of the noise] Let $n,p,q \in \mathbb{N}$, $X \in \mathbb{R}^{n \times q}$, $B \in \mathbb{R}^{q \times p}$, $Y \in \mathbb{R}^{n \times p}$, and $E \in \mathbb{R}^{n \times p}$ be the matrices from the statement, so that $Y=XB+E$ and $E \sim MN_{n \times p}(0,I_n,\Sigma)$. Because $X$ has full column rank, the matrix $X^\top X \in \mathbb{R}^{q \times q}$ is invertible. Define \begin{align*} A := (X^\top X)^{-1}X^\top \in \mathbb{R}^{q \times n}. \end{align*} Using the model identity $Y = XB + E$, we compute \begin{align*} \hat B &= (X^\top X)^{-1}X^\top Y \\ &= (X^\top X)^{-1}X^\top(XB + E) \\ &= (X^\top X)^{-1}X^\top X B + (X^\top X)^{-1}X^\top E \\ &= B + AE. \end{align*} Therefore \begin{align*} \hat B - B = AE. \end{align*} [/step] [step:Show that every scalar contrast is Gaussian] Let $a \in \mathbb{R}^q$ and $c \in \mathbb{R}^p$ be fixed. Define the deterministic vector \begin{align*} u := A^\top a \in \mathbb{R}^n. \end{align*} Write the rows of $E$ as random vectors \begin{align*} E_i := (E_{i1},\dots,E_{ip})^\top \in \mathbb{R}^p, \qquad 1 \leq i \leq n. \end{align*} By the hypothesis $E \sim MN_{n \times p}(0,I_n,\Sigma)$, the random vectors $E_1,\dots,E_n$ are independent and each satisfies $E_i \sim N_p(0,\Sigma)$. The scalar contrast of the estimation error is \begin{align*} a^\top(\hat B - B)c &= a^\top A E c \\ &= u^\top E c \\ &= \sum_{i=1}^n u_i\, E_i^\top c. \end{align*} For each $i$, the scalar random variable $E_i^\top c$ is Gaussian with mean $0$ and variance $c^\top \Sigma c$. Since the rows $E_i$ are independent, the random variables $E_i^\top c$ are independent. Hence $a^\top(\hat B-B)c$ is a finite linear combination of independent Gaussian random variables, and is therefore Gaussian. [guided] The goal is to prove a matrix normal statement. A convenient way to do this is to test all scalar bilinear contrasts $a^\top(\hat B-B)c$, because these contrasts determine the Gaussian law and its row and column covariance structure. Fix $a \in \mathbb{R}^q$ and $c \in \mathbb{R}^p$. We already proved that $\hat B-B = AE$, where \begin{align*} A := (X^\top X)^{-1}X^\top \in \mathbb{R}^{q \times n}. \end{align*} Define \begin{align*} u := A^\top a \in \mathbb{R}^n. \end{align*} If $E_i := (E_{i1},\dots,E_{ip})^\top \in \mathbb{R}^p$ denotes the $i$-th row of $E$ written as a column vector, then the matrix product $u^\top E c$ expands row by row as \begin{align*} a^\top(\hat B - B)c &= a^\top A E c \\ &= u^\top E c \\ &= \sum_{i=1}^n u_i\, E_i^\top c. \end{align*} The hypothesis $E \sim MN_{n \times p}(0,I_n,\Sigma)$ means exactly that $E_1,\dots,E_n$ are independent and that each row has distribution $N_p(0,\Sigma)$. For a fixed vector $c \in \mathbb{R}^p$, each scalar projection $E_i^\top c$ is a one-dimensional Gaussian random variable with mean $0$ and variance $c^\top \Sigma c$. Independence of the row vectors implies independence of these scalar projections. Therefore the displayed sum is a finite deterministic linear combination of independent Gaussian random variables, so it is Gaussian. [/guided] [/step] [step:Compute the variance of the scalar contrast] Since each $E_i^\top c$ has mean $0$, the scalar contrast has mean \begin{align*} \mathbb{E}\left[a^\top(\hat B-B)c\right] = \sum_{i=1}^n u_i\, \mathbb{E}[E_i^\top c] = 0. \end{align*} Using independence of the rows of $E$, we obtain \begin{align*} \operatorname{Var}\left(a^\top(\hat B-B)c\right) &= \operatorname{Var}\left(\sum_{i=1}^n u_i\, E_i^\top c\right) \\ &= \sum_{i=1}^n u_i^2\, \operatorname{Var}(E_i^\top c) \\ &= \sum_{i=1}^n u_i^2\, c^\top \Sigma c \\ &= (u^\top u)\, c^\top \Sigma c. \end{align*} Now \begin{align*} u^\top u &= a^\top A A^\top a \\ &= a^\top (X^\top X)^{-1}X^\top X (X^\top X)^{-1}a \\ &= a^\top (X^\top X)^{-1}a. \end{align*} Thus \begin{align*} a^\top(\hat B-B)c \sim N\left(0,\, a^\top(X^\top X)^{-1}a\, c^\top\Sigma c\right). \end{align*} [guided] We now compute the parameters of the Gaussian scalar contrast. From the previous step, \begin{align*} a^\top(\hat B-B)c = \sum_{i=1}^n u_i\,E_i^\top c, \end{align*} where $u=A^\top a \in \mathbb{R}^n$ is deterministic and the random variables $E_i^\top c$ are independent Gaussian random variables with mean $0$ and variance $c^\top\Sigma c$. Linearity of expectation gives \begin{align*} \mathbb{E}\left[a^\top(\hat B-B)c\right] = \sum_{i=1}^n u_i\,\mathbb{E}[E_i^\top c] = 0. \end{align*} For the variance, independence removes the mixed covariance terms, so \begin{align*} \operatorname{Var}\left(a^\top(\hat B-B)c\right) &= \operatorname{Var}\left(\sum_{i=1}^n u_i\,E_i^\top c\right) \\ &= \sum_{i=1}^n u_i^2\,\operatorname{Var}(E_i^\top c) \\ &= \sum_{i=1}^n u_i^2\,c^\top\Sigma c \\ &= (u^\top u)c^\top\Sigma c. \end{align*} It remains to rewrite the deterministic factor $u^\top u$ in terms of $X$. Since $u=A^\top a$ and $A=(X^\top X)^{-1}X^\top$, direct multiplication gives \begin{align*} u^\top u &= a^\top A A^\top a \\ &= a^\top (X^\top X)^{-1}X^\top X (X^\top X)^{-1}a \\ &= a^\top (X^\top X)^{-1}a. \end{align*} Therefore the scalar contrast has distribution \begin{align*} a^\top(\hat B-B)c \sim N\left(0,\,a^\top(X^\top X)^{-1}a\,c^\top\Sigma c\right). \end{align*} [/guided] [/step] [step:Identify the matrix normal distribution of the estimator by vectorizing the error] Let $\operatorname{vec}: \mathbb{R}^{q \times p} \to \mathbb{R}^{qp}$ denote the column-stacking [vectorization](/page/Vectorization) [linear map](/page/Linear%20Map). Let $I_p \in \mathbb{R}^{p \times p}$ denote the identity matrix. By the definition of the [matrix normal distribution](/page/Matrix%20Normal%20Distribution) applied to $E \sim MN_{n \times p}(0,I_n,\Sigma)$, the vectorized error satisfies \begin{align*} \operatorname{vec}(E) \sim N_{np}\left(0,\Sigma \otimes I_n\right), \end{align*} where the Kronecker product is taken with the same column-stacking convention for $\operatorname{vec}$. The vectorization identity for left multiplication gives \begin{align*} \operatorname{vec}(\hat B-B) &= \operatorname{vec}(AE) \\ &= (I_p \otimes A)\operatorname{vec}(E). \end{align*} By the deterministic linear image property of the [multivariate normal distribution](/page/Multivariate%20Normal%20Distribution), since $I_p \otimes A$ is deterministic and linear, $\operatorname{vec}(\hat B-B)$ is multivariate Gaussian with mean \begin{align*} (I_p \otimes A)0 = 0 \end{align*} and covariance \begin{align*} (I_p \otimes A)(\Sigma \otimes I_n)(I_p \otimes A)^\top &= (I_p \Sigma I_p^\top) \otimes (A I_n A^\top) \\ &= \Sigma \otimes A A^\top \\ &= \Sigma \otimes (X^\top X)^{-1}. \end{align*} The first equality uses the mixed-product identity for the [Kronecker product](/page/Kronecker%20Product). The last equality uses the computation \begin{align*} A A^\top &= (X^\top X)^{-1}X^\top X (X^\top X)^{-1} \\ &= (X^\top X)^{-1}. \end{align*} Therefore, by the definition of the matrix normal distribution, \begin{align*} \hat B-B \sim MN_{q \times p}\left(0,(X^\top X)^{-1},\Sigma\right). \end{align*} Adding the deterministic matrix $B \in \mathbb{R}^{q \times p}$ shifts the mean and leaves the covariance structure unchanged, so \begin{align*} \hat B \sim MN_{q \times p}\left(B,(X^\top X)^{-1},\Sigma\right). \end{align*} The scalar contrast formula proved in the preceding step is the corresponding one-dimensional marginal statement. [guided] We must prove the full matrix normal distribution, not only the distribution of each rank-one scalar contrast. The standard way to do this is to vectorize the random matrix and prove that the resulting vector is multivariate Gaussian with the covariance required by the matrix normal definition. Let $\operatorname{vec}: \mathbb{R}^{q \times p} \to \mathbb{R}^{qp}$ be the column-stacking [vectorization](/page/Vectorization) linear map, and let $I_p \in \mathbb{R}^{p \times p}$ be the identity matrix. The hypothesis $E \sim MN_{n \times p}(0,I_n,\Sigma)$ means, by the definition of the [matrix normal distribution](/page/Matrix%20Normal%20Distribution) under this column-stacking convention, that \begin{align*} \operatorname{vec}(E) \sim N_{np}\left(0,\Sigma \otimes I_n\right). \end{align*} The estimator error is $\hat B-B=AE$, where $A=(X^\top X)^{-1}X^\top \in \mathbb{R}^{q \times n}$. Vectorizing this identity and using the vectorization rule for left multiplication gives \begin{align*} \operatorname{vec}(\hat B-B) &= \operatorname{vec}(AE) \\ &= (I_p \otimes A)\operatorname{vec}(E). \end{align*} This is the crucial strengthening of the scalar-contrast argument: it proves joint Gaussianity because $\operatorname{vec}(\hat B-B)$ is a deterministic linear image of the Gaussian vector $\operatorname{vec}(E)$. Now compute the mean and covariance. The mean is \begin{align*} \mathbb{E}[\operatorname{vec}(\hat B-B)] &= (I_p \otimes A)\mathbb{E}[\operatorname{vec}(E)] \\ &= (I_p \otimes A)0 \\ &= 0. \end{align*} For the covariance, use the covariance transformation formula for a deterministic linear map of a [multivariate normal distribution](/page/Multivariate%20Normal%20Distribution) and the mixed-product identity for [Kronecker products](/page/Kronecker%20Product): \begin{align*} \operatorname{Cov}(\operatorname{vec}(\hat B-B)) &= (I_p \otimes A)(\Sigma \otimes I_n)(I_p \otimes A)^\top \\ &= (I_p \otimes A)(\Sigma \otimes I_n)(I_p^\top \otimes A^\top) \\ &= (I_p\Sigma I_p^\top) \otimes (A I_n A^\top) \\ &= \Sigma \otimes A A^\top. \end{align*} It remains to identify $A A^\top$. Since $A=(X^\top X)^{-1}X^\top$ and $X^\top X$ is invertible, direct multiplication gives \begin{align*} A A^\top &= (X^\top X)^{-1}X^\top X (X^\top X)^{-1} \\ &= (X^\top X)^{-1}. \end{align*} Thus \begin{align*} \operatorname{vec}(\hat B-B) \sim N_{qp}\left(0,\Sigma \otimes (X^\top X)^{-1}\right). \end{align*} By the definition of $MN_{q \times p}(M,U,\Sigma)$, this is exactly \begin{align*} \hat B-B \sim MN_{q \times p}\left(0,(X^\top X)^{-1},\Sigma\right). \end{align*} Finally, adding the deterministic matrix $B \in \mathbb{R}^{q \times p}$ shifts the mean from $0$ to $B$ without changing covariance, so \begin{align*} \hat B \sim MN_{q \times p}\left(B,(X^\top X)^{-1},\Sigma\right). \end{align*} The scalar contrast formula has already been computed from this same covariance structure. [/guided] [/step]

Prerequisites (0/4 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

test

Definitions & Concepts

Explore Further

Distribution Definition Expectation Definition Linear Map Definition test Theorem #89 Nelson's Strict Stationarity Theorem for the GARCH(1,1) Process probability Yule-Walker Equations for a Causal Autoregressive Process probability Moment Generating Function of a Multivariate Normal Distribution probability One-Sample Gaussian Mean Test with Known Covariance probability Scheffe Simultaneous Confidence Intervals for Linear Combinations of a Multivariate Mean probability Tail Dependence Coefficients of the Bivariate Student $t$ Copula probability Ljung-Box Portmanteau Test probability Beveridge–Nelson Decomposition probability

What brings you to Androma?

Start with a route through the knowledge graph.