Best Linear Predictor and Residual Covariance Formula

Best Linear Predictor and Residual Covariance Formula (Theorem # 4002)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We first separate the intercept from the linear coefficient by centering $X$ and $Y$. For a fixed matrix $B$, the optimal intercept is forced by the mean-zero condition on the residual. After centering, the problem becomes a strictly convex quadratic minimization over matrices, whose normal equation is $B\Sigma_{YY}=\Sigma_{XY}$. Solving this equation gives the best linear coefficient, and substituting it into the covariance of the centered residual gives the stated Schur complement formula. [/proofplan] [step:Center the variables and optimize the intercept for a fixed linear coefficient] Define the centered random vectors \begin{align*} \widetilde X: \Omega &\to \mathbb R^p, & \omega &\mapsto X(\omega)-\mu_X, \\ \widetilde Y: \Omega &\to \mathbb R^q, & \omega &\mapsto Y(\omega)-\mu_Y. \end{align*} Then $\mathbb E[\widetilde X]=0$ and $\mathbb E[\widetilde Y]=0$. Fix $B \in \mathbb R^{p \times q}$. For $a \in \mathbb R^p$, define \begin{align*} c := a-\mu_X+B\mu_Y \in \mathbb R^p. \end{align*} Then \begin{align*} X-a-BY &= \widetilde X - B\widetilde Y - c. \end{align*} Let \begin{align*} Z_B: \Omega &\to \mathbb R^p \\ \omega &\mapsto \widetilde X(\omega)-B\widetilde Y(\omega). \end{align*} Since $\mathbb E[Z_B]=0$, expanding the Euclidean square gives \begin{align*} \mathbb E[|X-a-BY|^2] &= \mathbb E[|Z_B-c|^2] \\ &= \mathbb E[|Z_B|^2]-2c^\top \mathbb E[Z_B]+|c|^2 \\ &= \mathbb E[|Z_B|^2]+|c|^2. \end{align*} Thus, for this fixed $B$, the unique minimizing intercept is obtained by setting $c=0$, namely \begin{align*} a = \mu_X-B\mu_Y. \end{align*} Therefore it remains to minimize \begin{align*} \Phi: \mathbb R^{p \times q} &\to \mathbb R \\ B &\mapsto \mathbb E[|\widetilde X-B\widetilde Y|^2]. \end{align*} [guided] The intercept exists precisely to handle the means. To make that precise, we introduce the centered variables \begin{align*} \widetilde X: \Omega &\to \mathbb R^p, & \omega &\mapsto X(\omega)-\mu_X, \\ \widetilde Y: \Omega &\to \mathbb R^q, & \omega &\mapsto Y(\omega)-\mu_Y. \end{align*} Both have expectation zero. Now fix a matrix $B \in \mathbb R^{p \times q}$. The affine predictor $a+BY$ can be rewritten around the means: \begin{align*} X-a-BY &= X-a-B(\widetilde Y+\mu_Y) \\ &= (X-\mu_X)-B\widetilde Y-(a-\mu_X+B\mu_Y). \end{align*} Define \begin{align*} c := a-\mu_X+B\mu_Y \in \mathbb R^p \end{align*} and \begin{align*} Z_B: \Omega &\to \mathbb R^p \\ \omega &\mapsto \widetilde X(\omega)-B\widetilde Y(\omega). \end{align*} Then $X-a-BY=Z_B-c$. Since $\mathbb E[\widetilde X]=0$ and $\mathbb E[\widetilde Y]=0$, we have $\mathbb E[Z_B]=0$. Therefore \begin{align*} \mathbb E[|X-a-BY|^2] &= \mathbb E[|Z_B-c|^2] \\ &= \mathbb E[|Z_B|^2]-2c^\top \mathbb E[Z_B]+|c|^2 \\ &= \mathbb E[|Z_B|^2]+|c|^2. \end{align*} The only part depending on the intercept is $|c|^2$, which is minimized uniquely when $c=0$. Hence the optimal intercept for this fixed $B$ is \begin{align*} a=\mu_X-B\mu_Y. \end{align*} So the affine prediction problem reduces to the centered linear prediction problem of minimizing \begin{align*} \Phi: \mathbb R^{p \times q} &\to \mathbb R \\ B &\mapsto \mathbb E[|\widetilde X-B\widetilde Y|^2]. \end{align*} [/guided] [/step] [step:Expand the centered mean squared error as a quadratic function of $B$] Using $|v|^2=\operatorname{tr}(vv^\top)$ for $v \in \mathbb R^p$ and linearity of expectation, \begin{align*} \Phi(B) &= \mathbb E[\operatorname{tr}((\widetilde X-B\widetilde Y)(\widetilde X-B\widetilde Y)^\top)] \\ &= \operatorname{tr}(\Sigma_{XX}) -2\operatorname{tr}(B\Sigma_{YX}) +\operatorname{tr}(B\Sigma_{YY}B^\top). \end{align*} Indeed, \begin{align*} \mathbb E[\widetilde X\widetilde X^\top]&=\Sigma_{XX},& \mathbb E[\widetilde X\widetilde Y^\top]&=\Sigma_{XY},& \mathbb E[\widetilde Y\widetilde X^\top]&=\Sigma_{YX},& \mathbb E[\widetilde Y\widetilde Y^\top]&=\Sigma_{YY}. \end{align*} [/step] [step:Solve the normal equation for the unique minimizing matrix] Let $B \in \mathbb R^{p \times q}$ and let $H \in \mathbb R^{p \times q}$ be an arbitrary perturbation. From the [quadratic formula](/theorems/1301) for $\Phi$, \begin{align*} \Phi(B+H)-\Phi(B) &= -2\operatorname{tr}(H\Sigma_{YX}) +\operatorname{tr}(B\Sigma_{YY}H^\top) +\operatorname{tr}(H\Sigma_{YY}B^\top) +\operatorname{tr}(H\Sigma_{YY}H^\top). \end{align*} Since $\Sigma_{YY}$ is symmetric, cyclicity of trace gives \begin{align*} \operatorname{tr}(B\Sigma_{YY}H^\top) +\operatorname{tr}(H\Sigma_{YY}B^\top) =2\operatorname{tr}(B\Sigma_{YY}H^\top), \end{align*} and since $\operatorname{tr}(H\Sigma_{YX})=\operatorname{tr}(\Sigma_{XY}H^\top)$, we obtain \begin{align*} \Phi(B+H)-\Phi(B) = 2\operatorname{tr}((B\Sigma_{YY}-\Sigma_{XY})H^\top) +\operatorname{tr}(H\Sigma_{YY}H^\top). \end{align*} At a minimizer, the first-order term must vanish for every $H$, hence \begin{align*} B\Sigma_{YY}=\Sigma_{XY}. \end{align*} Because $\Sigma_{YY}$ is positive definite, it is invertible, so the unique solution is \begin{align*} B_*=\Sigma_{XY}\Sigma_{YY}^{-1}. \end{align*} Conversely, if $B_*$ is defined this way, then $B_*\Sigma_{YY}=\Sigma_{XY}$, and therefore for every $B \in \mathbb R^{p \times q}$, with $H:=B-B_*$, \begin{align*} \Phi(B)-\Phi(B_*) &=\operatorname{tr}(H\Sigma_{YY}H^\top). \end{align*} Writing $h_i \in \mathbb R^q$ for the $i$-th row of $H$, this trace is \begin{align*} \operatorname{tr}(H\Sigma_{YY}H^\top) = \sum_{i=1}^p h_i^\top \Sigma_{YY}h_i. \end{align*} Positive definiteness of $\Sigma_{YY}$ implies each summand is nonnegative and that the sum is zero only when every $h_i=0$. Thus $B_*$ is the unique minimizer of $\Phi$. [guided] We now minimize the centered objective over matrices. Let $B \in \mathbb R^{p \times q}$ be a candidate coefficient and let $H \in \mathbb R^{p \times q}$ be an arbitrary perturbation. The purpose of $H$ is to test whether changing $B$ in any matrix direction can lower the objective. Using the quadratic expansion from the previous step, \begin{align*} \Phi(B+H)-\Phi(B) &= -2\operatorname{tr}(H\Sigma_{YX}) +\operatorname{tr}(B\Sigma_{YY}H^\top) +\operatorname{tr}(H\Sigma_{YY}B^\top) +\operatorname{tr}(H\Sigma_{YY}H^\top). \end{align*} Because $\Sigma_{YY}$ is a covariance matrix, it is symmetric. Therefore trace cyclicity gives \begin{align*} \operatorname{tr}(H\Sigma_{YY}B^\top) = \operatorname{tr}(B\Sigma_{YY}H^\top). \end{align*} Also, since $\Sigma_{YX}=\Sigma_{XY}^\top$, \begin{align*} \operatorname{tr}(H\Sigma_{YX}) = \operatorname{tr}(\Sigma_{XY}H^\top). \end{align*} Substituting these two trace identities yields \begin{align*} \Phi(B+H)-\Phi(B) = 2\operatorname{tr}((B\Sigma_{YY}-\Sigma_{XY})H^\top) +\operatorname{tr}(H\Sigma_{YY}H^\top). \end{align*} At a minimizer, the coefficient of the first-order perturbation must vanish in every direction $H$. Equivalently, \begin{align*} \operatorname{tr}((B\Sigma_{YY}-\Sigma_{XY})H^\top)=0 \end{align*} for every $H \in \mathbb R^{p \times q}$. Taking $H=B\Sigma_{YY}-\Sigma_{XY}$ shows that this is possible only if \begin{align*} B\Sigma_{YY}-\Sigma_{XY}=0. \end{align*} Thus the normal equation is \begin{align*} B\Sigma_{YY}=\Sigma_{XY}. \end{align*} The hypothesis that $\Sigma_{YY}$ is positive definite implies that $\Sigma_{YY}$ is invertible, so the normal equation has the unique solution \begin{align*} B_*=\Sigma_{XY}\Sigma_{YY}^{-1}. \end{align*} It remains to check that this critical point is genuinely the global minimizer. Let $B \in \mathbb R^{p \times q}$ be arbitrary and define \begin{align*} H:=B-B_* \in \mathbb R^{p \times q}. \end{align*} Since $B_*\Sigma_{YY}=\Sigma_{XY}$, the first-order term disappears, and we get \begin{align*} \Phi(B)-\Phi(B_*) &=\operatorname{tr}(H\Sigma_{YY}H^\top). \end{align*} If $h_i \in \mathbb R^q$ denotes the $i$-th row of $H$, then \begin{align*} \operatorname{tr}(H\Sigma_{YY}H^\top) = \sum_{i=1}^p h_i^\top \Sigma_{YY}h_i. \end{align*} Positive definiteness of $\Sigma_{YY}$ gives $h_i^\top\Sigma_{YY}h_i \ge 0$ for each $i$, with equality only when $h_i=0$. Hence $\Phi(B)\ge \Phi(B_*)$, and equality holds only when every row $h_i$ is zero, namely only when $B=B_*$. This proves that $B_*$ is the unique minimizing matrix. [/guided] [/step] [step:Recover the optimal intercept] From the intercept optimization with $B=B_*$, the unique optimal intercept is \begin{align*} a_*=\mu_X-B_*\mu_Y. \end{align*} Thus the unique best affine predictor of $X$ from $Y$ is \begin{align*} \omega \mapsto a_*+B_*Y(\omega). \end{align*} [/step] [step:Compute the residual covariance] Define the optimal centered residual \begin{align*} \widetilde R_*: \Omega &\to \mathbb R^p \\ \omega &\mapsto \widetilde X(\omega)-B_*\widetilde Y(\omega). \end{align*} Since $a_*=\mu_X-B_*\mu_Y$, we have $R_*=\widetilde R_*$. Also, \begin{align*} \mathbb E[R_*] &= \mathbb E[\widetilde X]-B_*\mathbb E[\widetilde Y] =0. \end{align*} Therefore \begin{align*} \operatorname{Cov}(R_*) &= \mathbb E[R_*R_*^\top] \\ &= \mathbb E[(\widetilde X-B_*\widetilde Y)(\widetilde X-B_*\widetilde Y)^\top] \\ &= \Sigma_{XX}-B_*\Sigma_{YX}-\Sigma_{XY}B_*^\top+B_*\Sigma_{YY}B_*^\top. \end{align*} Using $B_*\Sigma_{YY}=\Sigma_{XY}$ and $B_*^\top=\Sigma_{YY}^{-1}\Sigma_{YX}$, we compute \begin{align*} B_*\Sigma_{YX} &=\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}, \\ \Sigma_{XY}B_*^\top &=\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}, \\ B_*\Sigma_{YY}B_*^\top &=\Sigma_{XY}B_*^\top =\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}. \end{align*} Substituting these three identities gives \begin{align*} \operatorname{Cov}(R_*) &= \Sigma_{XX} -\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX} -\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX} +\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX} \\ &= \Sigma_{XX}-\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}. \end{align*} This is exactly \begin{align*} \operatorname{Cov}(X)-\operatorname{Cov}(X,Y)\operatorname{Cov}(Y)^{-1}\operatorname{Cov}(Y,X), \end{align*} which completes the proof. [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

test

Definitions & Concepts

Expectation

Explore Further

Expectation Definition test Theorem #89 Sampling Alias Formula probability Wishart Principal Block Marginal Theorem probability Consistency of Sample Principal Component Analysis probability One-Sample Gaussian Mean Test with Known Covariance probability Law of the Unconscious Statistician probability Orthogonality of Population Canonical Variates probability Conditional Expectation as the $L^2$ Projection probability Monotonicity and Triangle Inequality for Expectation probability

What brings you to Androma?

Start with a route through the knowledge graph.