[proofplan]
We regard the seemingly unrelated regression system as one stacked linear model with nonspherical covariance matrix $\Omega$. Whitening by a symmetric inverse square root of $\Omega$ reduces the model to an ordinary homoscedastic linear model, and a direct Gauss-Markov projection argument proves that GLS has minimal covariance among all linear unbiased estimators. Since equation-by-equation OLS is one admissible linear unbiased estimator, GLS has covariance no larger than OLS. Finally, an explicit two-equation system with correlated errors and different regressor spaces shows that the improvement can be strict.
[/proofplan]
[step:Whiten the stacked regression system]
Since $\Sigma$ is symmetric positive definite and $I_n$ is symmetric positive definite, the Kronecker product
\begin{align*}
\Omega := \Sigma \otimes I_n
\end{align*}
is symmetric positive definite. Hence there is a symmetric positive definite matrix $\Omega^{-1/2} \in \mathbb{R}^{mn \times mn}$ such that
\begin{align*}
\Omega^{-1/2}\Omega\Omega^{-1/2}=I_{mn}.
\end{align*}
Define the transformed response, design matrix, and error vector by
\begin{align*}
\tilde{y} &:= \Omega^{-1/2}y \in \mathbb{R}^{mn},\\
\tilde{X} &:= \Omega^{-1/2}X \in \mathbb{R}^{mn \times k},\\
\tilde{\varepsilon} &:= \Omega^{-1/2}\varepsilon \in \mathbb{R}^{mn}.
\end{align*}
Then
\begin{align*}
\tilde{y}=\tilde{X}\beta+\tilde{\varepsilon},
\end{align*}
and
\begin{align*}
\mathbb{E}[\tilde{\varepsilon}]
&=\Omega^{-1/2}\mathbb{E}[\varepsilon]
=0,\\
\operatorname{Cov}(\tilde{\varepsilon})
&=\Omega^{-1/2}\operatorname{Cov}(\varepsilon)\Omega^{-1/2}
=\Omega^{-1/2}\Omega\Omega^{-1/2}
=I_{mn}.
\end{align*}
Because $X$ has full column rank and $\Omega^{-1/2}$ is invertible, $\tilde{X}$ has full column rank. Therefore $\tilde{X}^\top\tilde{X}$ is invertible.
[/step]
[step:Identify GLS as ordinary least squares after whitening]
The ordinary least squares estimator in the transformed model is
\begin{align*}
\hat{\beta}_{\mathrm{OLS},\sim}
:=
(\tilde{X}^\top\tilde{X})^{-1}\tilde{X}^\top\tilde{y}.
\end{align*}
Using the definitions of $\tilde{X}$ and $\tilde{y}$, we compute
\begin{align*}
\tilde{X}^\top\tilde{X}
&=
X^\top\Omega^{-1/2}\Omega^{-1/2}X
=
X^\top\Omega^{-1}X,\\
\tilde{X}^\top\tilde{y}
&=
X^\top\Omega^{-1/2}\Omega^{-1/2}y
=
X^\top\Omega^{-1}y.
\end{align*}
Hence
\begin{align*}
\hat{\beta}_{\mathrm{OLS},\sim}
=
(X^\top\Omega^{-1}X)^{-1}X^\top\Omega^{-1}y
=
\hat{\beta}_{\mathrm{GLS}}.
\end{align*}
[/step]
[step:Prove the covariance minimality of GLS]
Let $A \in \mathbb{R}^{k \times mn}$ satisfy $AX=I_k$, so that $Ay$ is a linear unbiased estimator of $\beta$. Define
\begin{align*}
B := A\Omega^{1/2} \in \mathbb{R}^{k \times mn}.
\end{align*}
Since $\tilde{X}=\Omega^{-1/2}X$, the condition $AX=I_k$ is equivalent to
\begin{align*}
B\tilde{X}=A\Omega^{1/2}\Omega^{-1/2}X=AX=I_k.
\end{align*}
Let
\begin{align*}
B_0 := (\tilde{X}^\top\tilde{X})^{-1}\tilde{X}^\top \in \mathbb{R}^{k \times mn}
\end{align*}
be the ordinary least squares matrix in the whitened model. Then $B_0\tilde{X}=I_k$. Define
\begin{align*}
C := B-B_0 \in \mathbb{R}^{k \times mn}.
\end{align*}
Then
\begin{align*}
C\tilde{X}=B\tilde{X}-B_0\tilde{X}=I_k-I_k=0.
\end{align*}
The covariance of $Ay$ is
\begin{align*}
\operatorname{Cov}(Ay)
&=
A\Omega A^\top\\
&=
A\Omega^{1/2}\Omega^{1/2}A^\top\\
&=
BB^\top.
\end{align*}
The covariance of $\hat{\beta}_{\mathrm{GLS}}=B_0\tilde{y}$ is
\begin{align*}
\operatorname{Cov}(\hat{\beta}_{\mathrm{GLS}})
&=
B_0B_0^\top\\
&=
(\tilde{X}^\top\tilde{X})^{-1}\tilde{X}^\top\tilde{X}(\tilde{X}^\top\tilde{X})^{-1}\\
&=
(\tilde{X}^\top\tilde{X})^{-1}.
\end{align*}
Since $C\tilde{X}=0$, we also have
\begin{align*}
CB_0^\top
&=
C\tilde{X}(\tilde{X}^\top\tilde{X})^{-1}
=0.
\end{align*}
Therefore
\begin{align*}
\operatorname{Cov}(Ay)-\operatorname{Cov}(\hat{\beta}_{\mathrm{GLS}})
&=
BB^\top-B_0B_0^\top\\
&=
(B_0+C)(B_0+C)^\top-B_0B_0^\top\\
&=
B_0C^\top+CB_0^\top+CC^\top\\
&=
CC^\top.
\end{align*}
For every $v \in \mathbb{R}^k$,
\begin{align*}
v^\top CC^\top v = |C^\top v|^2 \geq 0,
\end{align*}
so $CC^\top$ is positive semidefinite. Thus $\hat{\beta}_{\mathrm{GLS}}$ is best linear unbiased among all linear unbiased estimators using the full covariance matrix $\Omega$.
[guided]
The key point is that whitening has reduced the problem to the ordinary homoscedastic Gauss-Markov geometry. We nevertheless write the argument directly, because it also gives the exact covariance difference.
Let $A \in \mathbb{R}^{k \times mn}$ be any matrix satisfying $AX=I_k$. This condition is precisely the linear unbiasedness condition: since $\mathbb{E}[y]=X\beta$, we have
\begin{align*}
\mathbb{E}[Ay]=AX\beta=\beta
\end{align*}
for every $\beta \in \mathbb{R}^k$.
To compare $Ay$ with the whitened least squares estimator, define
\begin{align*}
B := A\Omega^{1/2} \in \mathbb{R}^{k \times mn}.
\end{align*}
Because $\tilde{X}=\Omega^{-1/2}X$, the unbiasedness condition becomes
\begin{align*}
B\tilde{X}=A\Omega^{1/2}\Omega^{-1/2}X=AX=I_k.
\end{align*}
Thus $B$ is a linear unbiased estimator matrix for the transformed homoscedastic model.
Now define the least squares matrix in the transformed model by
\begin{align*}
B_0 := (\tilde{X}^\top\tilde{X})^{-1}\tilde{X}^\top \in \mathbb{R}^{k \times mn}.
\end{align*}
Since $\tilde{X}$ has full column rank, $\tilde{X}^\top\tilde{X}$ is invertible, and
\begin{align*}
B_0\tilde{X}
=
(\tilde{X}^\top\tilde{X})^{-1}\tilde{X}^\top\tilde{X}
=
I_k.
\end{align*}
Set
\begin{align*}
C := B-B_0.
\end{align*}
Then $C$ measures the deviation of the arbitrary unbiased estimator from least squares. Since both $B$ and $B_0$ are unbiased in the transformed model,
\begin{align*}
C\tilde{X}=B\tilde{X}-B_0\tilde{X}=I_k-I_k=0.
\end{align*}
This orthogonality relation is the whole Gauss-Markov argument.
We compute the covariance of the arbitrary estimator:
\begin{align*}
\operatorname{Cov}(Ay)
&=
A\operatorname{Cov}(y)A^\top\\
&=
A\Omega A^\top\\
&=
A\Omega^{1/2}\Omega^{1/2}A^\top\\
&=
BB^\top.
\end{align*}
The GLS estimator is $B_0\tilde{y}$, so because $\operatorname{Cov}(\tilde{y})=\operatorname{Cov}(\tilde{\varepsilon})=I_{mn}$, its covariance is
\begin{align*}
\operatorname{Cov}(\hat{\beta}_{\mathrm{GLS}})
&=
B_0 I_{mn} B_0^\top\\
&=
B_0B_0^\top\\
&=
(\tilde{X}^\top\tilde{X})^{-1}\tilde{X}^\top\tilde{X}(\tilde{X}^\top\tilde{X})^{-1}\\
&=
(\tilde{X}^\top\tilde{X})^{-1}.
\end{align*}
The cross terms vanish because
\begin{align*}
CB_0^\top
=
C\tilde{X}(\tilde{X}^\top\tilde{X})^{-1}
=
0.
\end{align*}
Taking transposes also gives $B_0C^\top=0$. Therefore
\begin{align*}
\operatorname{Cov}(Ay)-\operatorname{Cov}(\hat{\beta}_{\mathrm{GLS}})
&=
BB^\top-B_0B_0^\top\\
&=
(B_0+C)(B_0+C)^\top-B_0B_0^\top\\
&=
B_0C^\top+CB_0^\top+CC^\top\\
&=
CC^\top.
\end{align*}
Finally, $CC^\top$ is positive semidefinite because for every vector $v \in \mathbb{R}^k$,
\begin{align*}
v^\top CC^\top v=|C^\top v|^2 \geq 0.
\end{align*}
Thus no linear unbiased estimator can have smaller covariance than GLS in the positive semidefinite order.
[/guided]
[/step]
[step:Compare GLS with equation-by-equation ordinary least squares]
The equation-by-equation ordinary least squares estimator is
\begin{align*}
\hat{\beta}_{\mathrm{OLS}}
=
(X^\top X)^{-1}X^\top y.
\end{align*}
Because $X$ has full column rank,
\begin{align*}
(X^\top X)^{-1}X^\top X=I_k.
\end{align*}
Thus $\hat{\beta}_{\mathrm{OLS}}$ is linear unbiased. Applying the covariance minimality result with
\begin{align*}
A := (X^\top X)^{-1}X^\top
\end{align*}
gives
\begin{align*}
\operatorname{Cov}(\hat{\beta}_{\mathrm{OLS}})
-
\operatorname{Cov}(\hat{\beta}_{\mathrm{GLS}})
\end{align*}
positive semidefinite.
[/step]
[step:Exhibit strict gain when errors are correlated and regressor spaces differ]
It remains to justify the strict-efficiency assertion. It is enough to give one valid seemingly unrelated regression system with correlated equation errors and different regressor spaces for which GLS and equation-by-equation OLS have different covariance matrices.
Take $m=2$, $n=2$, $k_1=1$, and $k_2=1$. Define
\begin{align*}
X_1
&:=
\begin{pmatrix}
1\\
1
\end{pmatrix}
\in \mathbb{R}^{2 \times 1},\\
X_2
&:=
\begin{pmatrix}
1\\
0
\end{pmatrix}
\in \mathbb{R}^{2 \times 1}.
\end{align*}
Both matrices have full column rank, and their column spaces differ because $\operatorname{Range}(X_1)=\operatorname{span}\{(1,1)^\top\}$ while $\operatorname{Range}(X_2)=\operatorname{span}\{(1,0)^\top\}$. Let $\rho \in (-1,1)\setminus\{0\}$ and define
\begin{align*}
\Sigma
:=
\begin{pmatrix}
1 & \rho\\
\rho & 1
\end{pmatrix}.
\end{align*}
The eigenvalues of $\Sigma$ are $1+\rho$ and $1-\rho$, both positive because $|\rho|<1$. Hence $\Sigma$ is symmetric positive definite, and its nonzero off-diagonal entries mean that the equation errors are correlated.
For this system, define
\begin{align*}
X:=\operatorname{diag}(X_1,X_2) \in \mathbb{R}^{4 \times 2},
\qquad
\Omega:=\Sigma \otimes I_2 \in \mathbb{R}^{4 \times 4}.
\end{align*}
A direct block computation gives
\begin{align*}
X^\top X
&=
\begin{pmatrix}
2 & 0\\
0 & 1
\end{pmatrix},
\end{align*}
so the covariance of equation-by-equation OLS is
\begin{align*}
\operatorname{Cov}(\hat{\beta}_{\mathrm{OLS}})
&=
(X^\top X)^{-1}X^\top\Omega X(X^\top X)^{-1}.
\end{align*}
The first coefficient belongs to the first equation. Its OLS variance is the $(1,1)$ entry of this matrix. Since the first equation error variance is $1$, this entry is
\begin{align*}
\operatorname{Var}((\hat{\beta}_{\mathrm{OLS}})_1)
=
(X_1^\top X_1)^{-1}
=
\frac{1}{2}.
\end{align*}
For GLS, using
\begin{align*}
\Sigma^{-1}
=
\frac{1}{1-\rho^2}
\begin{pmatrix}
1 & -\rho\\
-\rho & 1
\end{pmatrix},
\end{align*}
we obtain
\begin{align*}
X^\top\Omega^{-1}X
=
\frac{1}{1-\rho^2}
\begin{pmatrix}
X_1^\top X_1 & -\rho X_1^\top X_2\\
-\rho X_2^\top X_1 & X_2^\top X_2
\end{pmatrix}
=
\frac{1}{1-\rho^2}
\begin{pmatrix}
2 & -\rho\\
-\rho & 1
\end{pmatrix}.
\end{align*}
Therefore
\begin{align*}
\operatorname{Cov}(\hat{\beta}_{\mathrm{GLS}})
=
(X^\top\Omega^{-1}X)^{-1}.
\end{align*}
Inverting the displayed $2 \times 2$ matrix gives
\begin{align*}
\operatorname{Cov}(\hat{\beta}_{\mathrm{GLS}})
=
\frac{1-\rho^2}{2-\rho^2}
\begin{pmatrix}
1 & \rho\\
\rho & 2
\end{pmatrix}.
\end{align*}
Thus the GLS variance of the first coefficient is
\begin{align*}
\operatorname{Var}((\hat{\beta}_{\mathrm{GLS}})_1)
=
\frac{1-\rho^2}{2-\rho^2}.
\end{align*}
Since $\rho \neq 0$ and $|\rho|<1$,
\begin{align*}
\frac{1-\rho^2}{2-\rho^2}
<
\frac{1}{2},
\end{align*}
because
\begin{align*}
2(1-\rho^2)<2-\rho^2
\iff
-\rho^2<0.
\end{align*}
Thus the GLS variance of the first coefficient is strictly smaller than the corresponding equation-by-equation OLS variance. Consequently,
\begin{align*}
\operatorname{Cov}(\hat{\beta}_{\mathrm{OLS}})
-
\operatorname{Cov}(\hat{\beta}_{\mathrm{GLS}})
\end{align*}
is not the zero matrix. Since the previous step already proved it is positive semidefinite, the efficiency gain is strict for this correlated-error system with different regressor spaces.
[guided]
The general covariance comparison already proves that GLS is never worse than equation-by-equation OLS. To show that the comparison can be strict, we build a two-equation, two-observation example where the two equation design spaces are different and the cross-equation error covariance is nonzero.
Let
\begin{align*}
X_1
&:=
\begin{pmatrix}
1\\
1
\end{pmatrix}
\in \mathbb{R}^{2 \times 1},
&
X_2
&:=
\begin{pmatrix}
1\\
0
\end{pmatrix}
\in \mathbb{R}^{2 \times 1}.
\end{align*}
The first equation has the constant regressor, while the second equation uses only the first observation direction. The column spaces differ because
\begin{align*}
\operatorname{Range}(X_1)=\operatorname{span}\{(1,1)^\top\},
\qquad
\operatorname{Range}(X_2)=\operatorname{span}\{(1,0)^\top\}.
\end{align*}
Both $X_1$ and $X_2$ have full column rank because neither matrix is the zero matrix.
Choose a correlation parameter $\rho \in (-1,1)\setminus\{0\}$ and put
\begin{align*}
\Sigma
:=
\begin{pmatrix}
1 & \rho\\
\rho & 1
\end{pmatrix}.
\end{align*}
The eigenvalues of $\Sigma$ are $1+\rho$ and $1-\rho$, both positive because $|\rho|<1$. Thus $\Sigma$ is symmetric positive definite. Its off-diagonal entries are nonzero because $\rho \neq 0$, so the equation errors are correlated.
Define the stacked design matrix and full covariance matrix by
\begin{align*}
X:=\operatorname{diag}(X_1,X_2) \in \mathbb{R}^{4 \times 2},
\qquad
\Omega:=\Sigma \otimes I_2 \in \mathbb{R}^{4 \times 4}.
\end{align*}
For equation-by-equation OLS, the covariance formula is
\begin{align*}
\operatorname{Cov}(\hat{\beta}_{\mathrm{OLS}})
&=
(X^\top X)^{-1}X^\top\Omega X(X^\top X)^{-1}.
\end{align*}
Since the first coefficient belongs to the first equation, its OLS variance is the $(1,1)$ entry of this matrix. The first equation error variance is $1$, and
\begin{align*}
X_1^\top X_1=2.
\end{align*}
Therefore
\begin{align*}
\operatorname{Var}((\hat{\beta}_{\mathrm{OLS}})_1)
=
(X_1^\top X_1)^{-1}
=
\frac{1}{2}.
\end{align*}
This calculation avoids component notation for the error vector and uses only the covariance matrix already defined.
Now compute the GLS variance. Since
\begin{align*}
\Sigma^{-1}
=
\frac{1}{1-\rho^2}
\begin{pmatrix}
1 & -\rho\\
-\rho & 1
\end{pmatrix},
\end{align*}
we have
\begin{align*}
\Omega^{-1}=\Sigma^{-1}\otimes I_2.
\end{align*}
Using the block form of $X=\operatorname{diag}(X_1,X_2)$, the information matrix is
\begin{align*}
X^\top\Omega^{-1}X
=
\frac{1}{1-\rho^2}
\begin{pmatrix}
X_1^\top X_1 & -\rho X_1^\top X_2\\
-\rho X_2^\top X_1 & X_2^\top X_2
\end{pmatrix}.
\end{align*}
The required products are
\begin{align*}
X_1^\top X_1 &= 2,\\
X_1^\top X_2 &= 1,\\
X_2^\top X_1 &= 1,\\
X_2^\top X_2 &= 1.
\end{align*}
Therefore
\begin{align*}
X^\top\Omega^{-1}X
=
\frac{1}{1-\rho^2}
\begin{pmatrix}
2 & -\rho\\
-\rho & 1
\end{pmatrix}.
\end{align*}
Since the GLS covariance matrix is the inverse of this information matrix,
\begin{align*}
\operatorname{Cov}(\hat{\beta}_{\mathrm{GLS}})
=
(X^\top\Omega^{-1}X)^{-1}.
\end{align*}
The inverse of the displayed $2 \times 2$ matrix is computed using the determinant $2-\rho^2$:
\begin{align*}
\operatorname{Cov}(\hat{\beta}_{\mathrm{GLS}})
=
\frac{1-\rho^2}{2-\rho^2}
\begin{pmatrix}
1 & \rho\\
\rho & 2
\end{pmatrix}.
\end{align*}
Thus the first diagonal entry is
\begin{align*}
\operatorname{Var}((\hat{\beta}_{\mathrm{GLS}})_1)
=
\frac{1-\rho^2}{2-\rho^2}.
\end{align*}
This is strictly smaller than the OLS variance:
\begin{align*}
\frac{1-\rho^2}{2-\rho^2}
<
\frac{1}{2}
\end{align*}
because
\begin{align*}
2(1-\rho^2)<2-\rho^2
\iff
-\rho^2<0,
\end{align*}
and $\rho \neq 0$.
Thus, in this concrete SUR system, the off-diagonal covariance supplies useful information precisely because the two equations do not have the same regressor space. GLS uses that information through $\Omega^{-1}$, while equation-by-equation OLS ignores it. The covariance improvement is therefore strict.
[/guided]
[/step]