[proofplan]
We first separate the intercept from the linear coefficient by centering $X$ and $Y$. For a fixed matrix $B$, the optimal intercept is forced by the mean-zero condition on the residual. After centering, the problem becomes a strictly convex quadratic minimization over matrices, whose normal equation is $B\Sigma_{YY}=\Sigma_{XY}$. Solving this equation gives the best linear coefficient, and substituting it into the covariance of the centered residual gives the stated Schur complement formula.
[/proofplan]
[step:Center the variables and optimize the intercept for a fixed linear coefficient]
Define the centered random vectors
\begin{align*}
\widetilde X: \Omega &\to \mathbb R^p, &
\omega &\mapsto X(\omega)-\mu_X, \\
\widetilde Y: \Omega &\to \mathbb R^q, &
\omega &\mapsto Y(\omega)-\mu_Y.
\end{align*}
Then $\mathbb E[\widetilde X]=0$ and $\mathbb E[\widetilde Y]=0$.
Fix $B \in \mathbb R^{p \times q}$. For $a \in \mathbb R^p$, define
\begin{align*}
c := a-\mu_X+B\mu_Y \in \mathbb R^p.
\end{align*}
Then
\begin{align*}
X-a-BY
&= \widetilde X - B\widetilde Y - c.
\end{align*}
Let
\begin{align*}
Z_B: \Omega &\to \mathbb R^p \\
\omega &\mapsto \widetilde X(\omega)-B\widetilde Y(\omega).
\end{align*}
Since $\mathbb E[Z_B]=0$, expanding the Euclidean square gives
\begin{align*}
\mathbb E[|X-a-BY|^2]
&= \mathbb E[|Z_B-c|^2] \\
&= \mathbb E[|Z_B|^2]-2c^\top \mathbb E[Z_B]+|c|^2 \\
&= \mathbb E[|Z_B|^2]+|c|^2.
\end{align*}
Thus, for this fixed $B$, the unique minimizing intercept is obtained by setting $c=0$, namely
\begin{align*}
a = \mu_X-B\mu_Y.
\end{align*}
Therefore it remains to minimize
\begin{align*}
\Phi: \mathbb R^{p \times q} &\to \mathbb R \\
B &\mapsto \mathbb E[|\widetilde X-B\widetilde Y|^2].
\end{align*}
[guided]
The intercept exists precisely to handle the means. To make that precise, we introduce the centered variables
\begin{align*}
\widetilde X: \Omega &\to \mathbb R^p, &
\omega &\mapsto X(\omega)-\mu_X, \\
\widetilde Y: \Omega &\to \mathbb R^q, &
\omega &\mapsto Y(\omega)-\mu_Y.
\end{align*}
Both have expectation zero. Now fix a matrix $B \in \mathbb R^{p \times q}$. The affine predictor $a+BY$ can be rewritten around the means:
\begin{align*}
X-a-BY
&= X-a-B(\widetilde Y+\mu_Y) \\
&= (X-\mu_X)-B\widetilde Y-(a-\mu_X+B\mu_Y).
\end{align*}
Define
\begin{align*}
c := a-\mu_X+B\mu_Y \in \mathbb R^p
\end{align*}
and
\begin{align*}
Z_B: \Omega &\to \mathbb R^p \\
\omega &\mapsto \widetilde X(\omega)-B\widetilde Y(\omega).
\end{align*}
Then $X-a-BY=Z_B-c$. Since $\mathbb E[\widetilde X]=0$ and $\mathbb E[\widetilde Y]=0$, we have $\mathbb E[Z_B]=0$. Therefore
\begin{align*}
\mathbb E[|X-a-BY|^2]
&= \mathbb E[|Z_B-c|^2] \\
&= \mathbb E[|Z_B|^2]-2c^\top \mathbb E[Z_B]+|c|^2 \\
&= \mathbb E[|Z_B|^2]+|c|^2.
\end{align*}
The only part depending on the intercept is $|c|^2$, which is minimized uniquely when $c=0$. Hence the optimal intercept for this fixed $B$ is
\begin{align*}
a=\mu_X-B\mu_Y.
\end{align*}
So the affine prediction problem reduces to the centered linear prediction problem of minimizing
\begin{align*}
\Phi: \mathbb R^{p \times q} &\to \mathbb R \\
B &\mapsto \mathbb E[|\widetilde X-B\widetilde Y|^2].
\end{align*}
[/guided]
[/step]
[step:Expand the centered mean squared error as a quadratic function of $B$]
Using $|v|^2=\operatorname{tr}(vv^\top)$ for $v \in \mathbb R^p$ and linearity of expectation,
\begin{align*}
\Phi(B)
&= \mathbb E[\operatorname{tr}((\widetilde X-B\widetilde Y)(\widetilde X-B\widetilde Y)^\top)] \\
&= \operatorname{tr}(\Sigma_{XX})
-2\operatorname{tr}(B\Sigma_{YX})
+\operatorname{tr}(B\Sigma_{YY}B^\top).
\end{align*}
Indeed,
\begin{align*}
\mathbb E[\widetilde X\widetilde X^\top]&=\Sigma_{XX},&
\mathbb E[\widetilde X\widetilde Y^\top]&=\Sigma_{XY},&
\mathbb E[\widetilde Y\widetilde X^\top]&=\Sigma_{YX},&
\mathbb E[\widetilde Y\widetilde Y^\top]&=\Sigma_{YY}.
\end{align*}
[/step]
[step:Solve the normal equation for the unique minimizing matrix]
Let $B \in \mathbb R^{p \times q}$ and let $H \in \mathbb R^{p \times q}$ be an arbitrary perturbation. From the [quadratic formula](/theorems/1301) for $\Phi$,
\begin{align*}
\Phi(B+H)-\Phi(B)
&= -2\operatorname{tr}(H\Sigma_{YX})
+\operatorname{tr}(B\Sigma_{YY}H^\top)
+\operatorname{tr}(H\Sigma_{YY}B^\top)
+\operatorname{tr}(H\Sigma_{YY}H^\top).
\end{align*}
Since $\Sigma_{YY}$ is symmetric, cyclicity of trace gives
\begin{align*}
\operatorname{tr}(B\Sigma_{YY}H^\top)
+\operatorname{tr}(H\Sigma_{YY}B^\top)
=2\operatorname{tr}(B\Sigma_{YY}H^\top),
\end{align*}
and since $\operatorname{tr}(H\Sigma_{YX})=\operatorname{tr}(\Sigma_{XY}H^\top)$, we obtain
\begin{align*}
\Phi(B+H)-\Phi(B)
=
2\operatorname{tr}((B\Sigma_{YY}-\Sigma_{XY})H^\top)
+\operatorname{tr}(H\Sigma_{YY}H^\top).
\end{align*}
At a minimizer, the first-order term must vanish for every $H$, hence
\begin{align*}
B\Sigma_{YY}=\Sigma_{XY}.
\end{align*}
Because $\Sigma_{YY}$ is positive definite, it is invertible, so the unique solution is
\begin{align*}
B_*=\Sigma_{XY}\Sigma_{YY}^{-1}.
\end{align*}
Conversely, if $B_*$ is defined this way, then $B_*\Sigma_{YY}=\Sigma_{XY}$, and therefore for every $B \in \mathbb R^{p \times q}$, with $H:=B-B_*$,
\begin{align*}
\Phi(B)-\Phi(B_*)
&=\operatorname{tr}(H\Sigma_{YY}H^\top).
\end{align*}
Writing $h_i \in \mathbb R^q$ for the $i$-th row of $H$, this trace is
\begin{align*}
\operatorname{tr}(H\Sigma_{YY}H^\top)
=
\sum_{i=1}^p h_i^\top \Sigma_{YY}h_i.
\end{align*}
Positive definiteness of $\Sigma_{YY}$ implies each summand is nonnegative and that the sum is zero only when every $h_i=0$. Thus $B_*$ is the unique minimizer of $\Phi$.
[guided]
We now minimize the centered objective over matrices. Let $B \in \mathbb R^{p \times q}$ be a candidate coefficient and let $H \in \mathbb R^{p \times q}$ be an arbitrary perturbation. The purpose of $H$ is to test whether changing $B$ in any matrix direction can lower the objective.
Using the quadratic expansion from the previous step,
\begin{align*}
\Phi(B+H)-\Phi(B)
&= -2\operatorname{tr}(H\Sigma_{YX})
+\operatorname{tr}(B\Sigma_{YY}H^\top)
+\operatorname{tr}(H\Sigma_{YY}B^\top)
+\operatorname{tr}(H\Sigma_{YY}H^\top).
\end{align*}
Because $\Sigma_{YY}$ is a covariance matrix, it is symmetric. Therefore trace cyclicity gives
\begin{align*}
\operatorname{tr}(H\Sigma_{YY}B^\top)
=
\operatorname{tr}(B\Sigma_{YY}H^\top).
\end{align*}
Also, since $\Sigma_{YX}=\Sigma_{XY}^\top$,
\begin{align*}
\operatorname{tr}(H\Sigma_{YX})
=
\operatorname{tr}(\Sigma_{XY}H^\top).
\end{align*}
Substituting these two trace identities yields
\begin{align*}
\Phi(B+H)-\Phi(B)
=
2\operatorname{tr}((B\Sigma_{YY}-\Sigma_{XY})H^\top)
+\operatorname{tr}(H\Sigma_{YY}H^\top).
\end{align*}
At a minimizer, the coefficient of the first-order perturbation must vanish in every direction $H$. Equivalently,
\begin{align*}
\operatorname{tr}((B\Sigma_{YY}-\Sigma_{XY})H^\top)=0
\end{align*}
for every $H \in \mathbb R^{p \times q}$. Taking $H=B\Sigma_{YY}-\Sigma_{XY}$ shows that this is possible only if
\begin{align*}
B\Sigma_{YY}-\Sigma_{XY}=0.
\end{align*}
Thus the normal equation is
\begin{align*}
B\Sigma_{YY}=\Sigma_{XY}.
\end{align*}
The hypothesis that $\Sigma_{YY}$ is positive definite implies that $\Sigma_{YY}$ is invertible, so the normal equation has the unique solution
\begin{align*}
B_*=\Sigma_{XY}\Sigma_{YY}^{-1}.
\end{align*}
It remains to check that this critical point is genuinely the global minimizer. Let $B \in \mathbb R^{p \times q}$ be arbitrary and define
\begin{align*}
H:=B-B_* \in \mathbb R^{p \times q}.
\end{align*}
Since $B_*\Sigma_{YY}=\Sigma_{XY}$, the first-order term disappears, and we get
\begin{align*}
\Phi(B)-\Phi(B_*)
&=\operatorname{tr}(H\Sigma_{YY}H^\top).
\end{align*}
If $h_i \in \mathbb R^q$ denotes the $i$-th row of $H$, then
\begin{align*}
\operatorname{tr}(H\Sigma_{YY}H^\top)
=
\sum_{i=1}^p h_i^\top \Sigma_{YY}h_i.
\end{align*}
Positive definiteness of $\Sigma_{YY}$ gives $h_i^\top\Sigma_{YY}h_i \ge 0$ for each $i$, with equality only when $h_i=0$. Hence $\Phi(B)\ge \Phi(B_*)$, and equality holds only when every row $h_i$ is zero, namely only when $B=B_*$. This proves that $B_*$ is the unique minimizing matrix.
[/guided]
[/step]
[step:Recover the optimal intercept]
From the intercept optimization with $B=B_*$, the unique optimal intercept is
\begin{align*}
a_*=\mu_X-B_*\mu_Y.
\end{align*}
Thus the unique best affine predictor of $X$ from $Y$ is
\begin{align*}
\omega \mapsto a_*+B_*Y(\omega).
\end{align*}
[/step]
[step:Compute the residual covariance]
Define the optimal centered residual
\begin{align*}
\widetilde R_*: \Omega &\to \mathbb R^p \\
\omega &\mapsto \widetilde X(\omega)-B_*\widetilde Y(\omega).
\end{align*}
Since $a_*=\mu_X-B_*\mu_Y$, we have $R_*=\widetilde R_*$. Also,
\begin{align*}
\mathbb E[R_*]
&= \mathbb E[\widetilde X]-B_*\mathbb E[\widetilde Y]
=0.
\end{align*}
Therefore
\begin{align*}
\operatorname{Cov}(R_*)
&= \mathbb E[R_*R_*^\top] \\
&= \mathbb E[(\widetilde X-B_*\widetilde Y)(\widetilde X-B_*\widetilde Y)^\top] \\
&= \Sigma_{XX}-B_*\Sigma_{YX}-\Sigma_{XY}B_*^\top+B_*\Sigma_{YY}B_*^\top.
\end{align*}
Using $B_*\Sigma_{YY}=\Sigma_{XY}$ and $B_*^\top=\Sigma_{YY}^{-1}\Sigma_{YX}$, we compute
\begin{align*}
B_*\Sigma_{YX}
&=\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}, \\
\Sigma_{XY}B_*^\top
&=\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}, \\
B_*\Sigma_{YY}B_*^\top
&=\Sigma_{XY}B_*^\top
=\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}.
\end{align*}
Substituting these three identities gives
\begin{align*}
\operatorname{Cov}(R_*)
&= \Sigma_{XX}
-\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}
-\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}
+\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX} \\
&= \Sigma_{XX}-\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}.
\end{align*}
This is exactly
\begin{align*}
\operatorname{Cov}(X)-\operatorname{Cov}(X,Y)\operatorname{Cov}(Y)^{-1}\operatorname{Cov}(Y,X),
\end{align*}
which completes the proof.
[/step]