Wishart Distribution of the Residual Sum of Squares in the Multivariate Linear Model

Wishart Distribution of the Residual Sum of Squares in the Multivariate Linear Model (Theorem # 4064)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We rewrite least squares using the [orthogonal projection](/theorems/437) $P$ onto $\operatorname{col}(X)$. An orthogonal change of coordinates in $\mathbb{R}^n$ splits the noise matrix into its fitted component and its residual component. Because the rows of the noise matrix are independent Gaussian vectors with common covariance $\Sigma$, orthogonal transformations preserve the joint Gaussian law and make the two coordinate blocks independent. The residual sum of squares is then the cross-product of exactly $n-q$ independent $\mathcal{N}_p(0,\Sigma)$ row vectors, which is the defining construction of the Wishart law. [/proofplan] [step:Express the fitted and residual matrices through the projection onto $\operatorname{col}(X)$] Let $I_n \in \mathbb{R}^{n \times n}$ denote the identity matrix. Define the matrix \begin{align*} P := X(X^\top X)^{-1}X^\top \in \mathbb{R}^{n \times n}. \end{align*} Since $X$ has rank $q$, the matrix $X^\top X$ is invertible, and $P$ is the orthogonal projection of $\mathbb{R}^n$ onto $\operatorname{col}(X)$. The least-squares fitted matrix is \begin{align*} X\hat{B} &= X(X^\top X)^{-1}X^\top Y \\ &= P Y. \end{align*} Therefore the residual matrix is \begin{align*} R := Y-X\hat{B} = (I_n-P)Y. \end{align*} Using $Y=XB+E$ and $PX= X$, we get \begin{align*} R &= (I_n-P)(XB+E) \\ &= (I_n-P)E. \end{align*} Thus \begin{align*} S = R^\top R = E^\top(I_n-P)E. \end{align*} [guided] The least-squares estimator is designed so that the fitted values are the orthogonal projection of $Y$ onto the column space of $X$. We make this projection explicit. Let $I_n \in \mathbb{R}^{n \times n}$ denote the identity matrix, and define \begin{align*} P := X(X^\top X)^{-1}X^\top \in \mathbb{R}^{n \times n}. \end{align*} The hypothesis $\operatorname{rank}(X)=q$ implies that the Gram matrix $X^\top X$ is invertible. Hence $P$ is well-defined, symmetric, idempotent, and has range $\operatorname{col}(X)$; it is the orthogonal projection onto $\operatorname{col}(X)$. Now compute the fitted matrix: \begin{align*} X\hat{B} &= X(X^\top X)^{-1}X^\top Y \\ &= P Y. \end{align*} Therefore the residual matrix $R \in \mathbb{R}^{n \times p}$ is \begin{align*} R := Y-X\hat{B} = (I_n-P)Y. \end{align*} Substitute the model equation $Y=XB+E$: \begin{align*} R &= (I_n-P)(XB+E) \\ &= (I_n-P)XB + (I_n-P)E. \end{align*} Since every column of $XB$ lies in $\operatorname{col}(X)$, the projection $P$ fixes $XB$, so $(I_n-P)XB=0$. Hence \begin{align*} R = (I_n-P)E. \end{align*} Consequently the residual sum of squares matrix is \begin{align*} S = R^\top R = E^\top(I_n-P)^\top(I_n-P)E. \end{align*} Because $I_n-P$ is symmetric and idempotent, this simplifies to \begin{align*} S = E^\top(I_n-P)E. \end{align*} [/guided] [/step] [step:Choose orthogonal coordinates adapted to the fitted and residual subspaces] Choose an orthogonal matrix $Q \in \mathbb{R}^{n \times n}$ whose first $q$ rows form an orthonormal basis of $\operatorname{col}(X)$ and whose last $n-q$ rows form an orthonormal basis of $\operatorname{col}(X)^\perp$. Write \begin{align*} Q = \begin{pmatrix} Q_1 \\ Q_2 \end{pmatrix}, \end{align*} where $Q_1 \in \mathbb{R}^{q \times n}$ and $Q_2 \in \mathbb{R}^{(n-q) \times n}$. Then \begin{align*} P = Q_1^\top Q_1, \qquad I_n-P = Q_2^\top Q_2. \end{align*} Therefore \begin{align*} S = E^\top Q_2^\top Q_2 E = (Q_2E)^\top(Q_2E). \end{align*} Also, since $P=Q_1^\top Q_1$, the fitted matrix $PY$ is determined by $Q_1Y$, while the residual sum of squares $S$ is determined by $Q_2Y=Q_2E$. [guided] The projection $P$ separates the data into two orthogonal directions: the fitted directions inside $\operatorname{col}(X)$ and the residual directions inside $\operatorname{col}(X)^\perp$. To make that separation algebraic, choose an orthogonal matrix $Q \in \mathbb{R}^{n \times n}$ adapted to this direct sum decomposition. Write \begin{align*} Q = \begin{pmatrix} Q_1 \\ Q_2 \end{pmatrix}, \end{align*} where $Q_1 \in \mathbb{R}^{q \times n}$ has rows forming an orthonormal basis of $\operatorname{col}(X)$, and $Q_2 \in \mathbb{R}^{(n-q) \times n}$ has rows forming an orthonormal basis of $\operatorname{col}(X)^\perp$. Because $Q_1$ contains an orthonormal basis of $\operatorname{col}(X)$, the matrix $Q_1^\top Q_1$ is the orthogonal projection onto $\operatorname{col}(X)$. Since $P$ is also the orthogonal projection onto $\operatorname{col}(X)$, we have \begin{align*} P = Q_1^\top Q_1. \end{align*} Likewise, the orthogonal projection onto $\operatorname{col}(X)^\perp$ is \begin{align*} I_n-P = Q_2^\top Q_2. \end{align*} Substituting this identity into the expression for $S$ gives \begin{align*} S &= E^\top(I_n-P)E \\ &= E^\top Q_2^\top Q_2 E \\ &= (Q_2E)^\top(Q_2E). \end{align*} This is the central reduction: the residual sum of squares is the cross-product of the bottom block of the orthogonally transformed noise matrix. The same decomposition also identifies the source of independence. The fitted matrix $PY$ is determined by $Q_1Y$, because $P=Q_1^\top Q_1$. The residual matrix and hence $S$ are determined by $Q_2Y$. Since $Q_2X=0$, we have \begin{align*} Q_2Y = Q_2(XB+E)=Q_2E. \end{align*} Thus the fitted and residual objects depend on disjoint orthogonal blocks of the transformed noise. [/guided] [/step] [step:Show that the orthogonally transformed noise has independent Gaussian rows] Define the transformed noise matrix \begin{align*} Z := Q E \in \mathbb{R}^{n \times p}. \end{align*} For indices $r,s \in \{1,\dots,n\}$, define the Kronecker delta $\delta_{rs} \in \{0,1\}$ by $\delta_{rs}=1$ if $r=s$ and $\delta_{rs}=0$ if $r \ne s$. For each row index $r \in \{1,\dots,n\}$, let $Z_r \in \mathbb{R}^{1 \times p}$ denote the $r$th row of $Z$. For any column vectors $a_1,\dots,a_n \in \mathbb{R}^p$, the linear combination $\sum_{r=1}^n Z_r a_r$ is a real-valued Gaussian random variable because it is a linear combination of the jointly Gaussian entries of $E$. Thus the rows of $Z$ are jointly Gaussian. For $r,s \in \{1,\dots,n\}$, compute the covariance between the column vectors $Z_r^\top$ and $Z_s^\top$: \begin{align*} \operatorname{Cov}(Z_r^\top,Z_s^\top) &= \sum_{i=1}^n \sum_{j=1}^n Q_{ri}Q_{sj}\operatorname{Cov}(E_i^\top,E_j^\top) \\ &= \sum_{i=1}^n Q_{ri}Q_{si}\Sigma \\ &= (QQ^\top)_{rs}\Sigma \\ &= \delta_{rs}\Sigma. \end{align*} By the [Gaussian independence criterion for jointly Gaussian random vectors](/page/Multivariate%20Normal%20Distribution), zero cross-covariance between distinct Gaussian blocks implies independence. Hence $Z_1,\dots,Z_n$ are independent row vectors, and $Z_r^\top \sim \mathcal{N}_p(0,\Sigma)$ for each $r$. [guided] We now verify the probabilistic effect of multiplying the noise matrix by $Q$. Define \begin{align*} Z := Q E \in \mathbb{R}^{n \times p}. \end{align*} For indices $r,s \in \{1,\dots,n\}$, define the Kronecker delta $\delta_{rs} \in \{0,1\}$ by $\delta_{rs}=1$ if $r=s$ and $\delta_{rs}=0$ if $r \ne s$. For each $r \in \{1,\dots,n\}$, let $Z_r \in \mathbb{R}^{1 \times p}$ be the $r$th row of $Z$. Since $Z=QE$, the row $Z_r$ is \begin{align*} Z_r = \sum_{i=1}^n Q_{ri}E_i. \end{align*} Each $E_i^\top$ is Gaussian, and the rows $E_1,\dots,E_n$ are independent. Therefore every finite linear combination of entries of $Z$ is a finite linear combination of jointly Gaussian random variables, hence is Gaussian. Thus the rows $Z_1,\dots,Z_n$ are jointly Gaussian. We use the [Gaussian independence criterion for jointly Gaussian random vectors](/page/Multivariate%20Normal%20Distribution): for jointly Gaussian random vectors, independence is equivalent to zero cross-covariance between the corresponding blocks. The hypothesis needed for this criterion is joint Gaussianity, which was verified above for the rows $Z_1,\dots,Z_n$. We compute the cross-covariance explicitly. For $r,s \in \{1,\dots,n\}$, \begin{align*} \operatorname{Cov}(Z_r^\top,Z_s^\top) &= \operatorname{Cov}\left(\sum_{i=1}^n Q_{ri}E_i^\top,\sum_{j=1}^n Q_{sj}E_j^\top\right) \\ &= \sum_{i=1}^n \sum_{j=1}^n Q_{ri}Q_{sj}\operatorname{Cov}(E_i^\top,E_j^\top). \end{align*} The original rows are independent and each has covariance $\Sigma$, so \begin{align*} \operatorname{Cov}(E_i^\top,E_j^\top) = \begin{cases} \Sigma, & i=j, \\ 0, & i \ne j. \end{cases} \end{align*} Therefore \begin{align*} \operatorname{Cov}(Z_r^\top,Z_s^\top) &= \sum_{i=1}^n Q_{ri}Q_{si}\Sigma \\ &= (QQ^\top)_{rs}\Sigma. \end{align*} Since $Q$ is orthogonal, $QQ^\top=I_n$, so \begin{align*} \operatorname{Cov}(Z_r^\top,Z_s^\top)=\delta_{rs}\Sigma. \end{align*} Thus different rows of $Z$ have zero cross-covariance, and because they are jointly Gaussian, they are independent. For $r=s$, the same formula gives covariance $\Sigma$, and the mean is zero because each original row $E_i$ has mean zero. Hence \begin{align*} Z_r^\top \sim \mathcal{N}_p(0,\Sigma) \end{align*} for every $r \in \{1,\dots,n\}$. [/guided] [/step] [step:Identify the residual cross-product as a Wishart random matrix] Write \begin{align*} Z = \begin{pmatrix} Z_{\mathrm{fit}} \\ Z_{\mathrm{res}} \end{pmatrix}, \end{align*} where $Z_{\mathrm{fit}} := Q_1E \in \mathbb{R}^{q \times p}$ and $Z_{\mathrm{res}} := Q_2E \in \mathbb{R}^{(n-q) \times p}$. By the previous step, the rows of $Z_{\mathrm{res}}$ are independent and each row has transpose distributed as $\mathcal{N}_p(0,\Sigma)$. Since \begin{align*} S = (Q_2E)^\top(Q_2E) = Z_{\mathrm{res}}^\top Z_{\mathrm{res}}, \end{align*} the defining construction of the Wishart distribution gives \begin{align*} S \sim W_p(\Sigma,n-q). \end{align*} [guided] We now match the residual cross-product with the definition of the Wishart distribution. Write \begin{align*} Z = \begin{pmatrix} Z_{\mathrm{fit}} \\ Z_{\mathrm{res}} \end{pmatrix}, \end{align*} where \begin{align*} Z_{\mathrm{fit}} := Q_1E \in \mathbb{R}^{q \times p}, \qquad Z_{\mathrm{res}} := Q_2E \in \mathbb{R}^{(n-q) \times p}. \end{align*} The previous step proved that the rows of $Z=QE$ are independent and that each row transpose has distribution $\mathcal{N}_p(0,\Sigma)$. Therefore the $(n-q)$ rows of $Z_{\mathrm{res}}$ are independent, and each row transpose has distribution $\mathcal{N}_p(0,\Sigma)$. The residual sum of squares was reduced earlier to the cross-product of the residual block: \begin{align*} S = (Q_2E)^\top(Q_2E) = Z_{\mathrm{res}}^\top Z_{\mathrm{res}}. \end{align*} By the defining construction of the Wishart distribution, the cross-product of $n-q$ independent $\mathcal{N}_p(0,\Sigma)$ row vectors has distribution $W_p(\Sigma,n-q)$. Hence \begin{align*} S \sim W_p(\Sigma,n-q). \end{align*} [/guided] [/step] [step:Separate the estimator from the residual sum of squares] The estimator $\hat{B}$ is determined by $PY$, because \begin{align*} \hat{B}=(X^\top X)^{-1}X^\top(PY), \end{align*} as $X^\top P=X^\top$. The fitted component $PY$ is determined by $Q_1Y$, while $S$ is determined by $Q_2Y=Q_2E$. Now \begin{align*} Q_1Y = Q_1XB + Q_1E, \qquad Q_2Y = Q_2E. \end{align*} The random matrices $Q_1E$ and $Q_2E$ consist of disjoint blocks of the independent Gaussian rows of $Z=QE$, so they are independent. Adding the deterministic matrix $Q_1XB$ to $Q_1E$ preserves independence from $Q_2E$. Therefore $Q_1Y$ and $Q_2Y$ are independent, and hence $\hat{B}$ and $S$ are independent. [guided] It remains to prove independence. The estimator $\hat{B}$ depends only on the fitted part of $Y$. Indeed, since $P$ is symmetric and $PX=X$, we have $X^\top P=X^\top$, and therefore \begin{align*} (X^\top X)^{-1}X^\top(PY) &= (X^\top X)^{-1}X^\top Y \\ &= \hat{B}. \end{align*} Thus $\hat{B}$ is determined by $PY$. Because $P=Q_1^\top Q_1$, the matrix $PY$ is determined by $Q_1Y$. The residual sum of squares is determined by the residual block: \begin{align*} S = (Q_2Y)^\top(Q_2Y), \end{align*} since $Q_2Y=Q_2E$. Now decompose the transformed data: \begin{align*} Q_1Y = Q_1XB + Q_1E, \qquad Q_2Y = Q_2XB + Q_2E. \end{align*} Because the rows of $Q_2$ span $\operatorname{col}(X)^\perp$, we have $Q_2X=0$, so \begin{align*} Q_2Y = Q_2E. \end{align*} The previous step showed that the rows of $Z=QE$ are independent Gaussian row vectors. Hence the top block $Q_1E$ and the bottom block $Q_2E$ are independent random matrices. Adding the deterministic matrix $Q_1XB$ to $Q_1E$ does not change independence from $Q_2E$. Therefore $Q_1Y$ and $Q_2Y$ are independent. Finally, $\hat{B}$ is a measurable function of $Q_1Y$, and $S$ is a measurable function of $Q_2Y$. [Measurable functions](/page/Measurable%20Functions) of independent random objects are independent. Hence $\hat{B}$ and $S$ are independent. This completes the proof. [/guided] [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Orthogonal Projection

Definitions & Concepts

Distribution

Explore Further

Distribution Definition Orthogonal Projection Theorem #437 Conditional Variance Forecast Formula for a Stationary GARCH(1,1) Process probability Independence of the Sample Mean and Sample Covariance for Multivariate Normal Samples probability Herglotz Representation Theorem probability Affine Stability of the Multivariate Normal Distribution probability Dickey-Fuller Unit Root Limiting Distribution probability Linear Pairwise Decision Boundaries for Sample LDA probability One-Sample Gaussian Mean Test with Known Covariance probability Orthogonality of Innovations probability

What brings you to Androma?

Start with a route through the knowledge graph.