[proofplan]
The estimator error splits into a deterministic conditional bias term, caused by the projection away from the row span of $X$, and a random conditional noise term, caused by $\varepsilon$. We expand the quadratic risk using this decomposition. The conditional mean-zero assumption eliminates the cross terms, and the conditional covariance assumption evaluates the remaining quadratic form in $\varepsilon$ as a trace.
[/proofplan]
custom_env
admin
[step:Decompose the estimator error into projection bias and noise]Let $\Omega_X := \{\omega : X(\omega)X(\omega)^\top \text{ is invertible}\}$ denote the event on which the Gram matrix is invertible. By hypothesis, $\mathbb{P}(\Omega_X)=1$, and all identities below are asserted on $\Omega_X$ before taking conditional expectations with respect to $X$.
Define the $X$-measurable matrix $A_X \in \mathbb{R}^{d \times n}$ by
\begin{align*}
A_X := X^\top(XX^\top)^{-1}.
\end{align*}
Then $\hat{\beta}=A_XY$. Using $Y=X\beta+\varepsilon$ gives
\begin{align*}
\hat{\beta}=A_X(X\beta+\varepsilon).
\end{align*}
Expanding the matrix product and using the definitions of $A_X$ and $P_X$ gives
\begin{align*}
\hat{\beta}=X^\top(XX^\top)^{-1}X\beta + X^\top(XX^\top)^{-1}\varepsilon.
\end{align*}
Thus
\begin{align*}
\hat{\beta}=P_X\beta + A_X\varepsilon .
\end{align*}
Therefore
\begin{align*}
\hat{\beta}-\beta = -(I_d-P_X)\beta + A_X\varepsilon .
\end{align*}
Define the $X$-measurable vector
\begin{align*}
b_X := -(I_d-P_X)\beta \in \mathbb{R}^d .
\end{align*}
Then
\begin{align*}
\hat{\beta}-\beta = b_X + A_X\varepsilon .
\end{align*}[/step]
custom_env
admin
[guided]The estimator $\hat{\beta}$ is linear in $Y$, so we first rewrite it in terms of the signal part $X\beta$ and the noise part $\varepsilon$. Let $\Omega_X := \{\omega : X(\omega)X(\omega)^\top \text{ is invertible}\}$ denote the event on which the Gram matrix is invertible. By hypothesis, $\mathbb{P}(\Omega_X)=1$, and the following algebra is performed on $\Omega_X$, where $(XX^\top)^{-1}$ is well-defined. Define the $X$-measurable matrix
\begin{align*}
A_X := X^\top(XX^\top)^{-1} \in \mathbb{R}^{d \times n}.
\end{align*}
This notation is useful because $\hat{\beta}=A_XY$. Substituting the model equation $Y=X\beta+\varepsilon$ gives
\begin{align*}
\hat{\beta}=A_X(X\beta+\varepsilon).
\end{align*}
Expanding the two summands and using the definitions of $A_X$ and $P_X$ gives
\begin{align*}
\hat{\beta}=X^\top(XX^\top)^{-1}X\beta + X^\top(XX^\top)^{-1}\varepsilon.
\end{align*}
Therefore
\begin{align*}
\hat{\beta}=P_X\beta + A_X\varepsilon .
\end{align*}
After subtracting $\beta$, the error becomes
\begin{align*}
\hat{\beta}-\beta=P_X\beta-\beta+A_X\varepsilon.
\end{align*}
Since $P_X\beta-\beta=-(I_d-P_X)\beta$, this is
\begin{align*}
\hat{\beta}-\beta=-(I_d-P_X)\beta + A_X\varepsilon .
\end{align*}
The first term is fully determined once $X$ is fixed; it is the component of $\beta$ not recovered from the row span of $X$. Define this conditional bias vector by
\begin{align*}
b_X := -(I_d-P_X)\beta \in \mathbb{R}^d .
\end{align*}
Thus the fundamental decomposition is
\begin{align*}
\hat{\beta}-\beta = b_X + A_X\varepsilon .
\end{align*}[/guided]
custom_env
admin
[step:Expand the quadratic risk and identify the cross terms]By the definition of $R_\beta$ and the decomposition above,
\begin{align*}
R_\beta(\hat{\beta})=(\hat{\beta}-\beta)^\top\Sigma(\hat{\beta}-\beta).
\end{align*}
Substituting $\hat{\beta}-\beta=b_X+A_X\varepsilon$ gives
\begin{align*}
R_\beta(\hat{\beta})=(b_X+A_X\varepsilon)^\top\Sigma(b_X+A_X\varepsilon).
\end{align*}
Expanding the quadratic form gives
\begin{align*}
R_\beta(\hat{\beta})=b_X^\top\Sigma b_X + b_X^\top\Sigma A_X\varepsilon + \varepsilon^\top A_X^\top\Sigma b_X + \varepsilon^\top A_X^\top\Sigma A_X\varepsilon .
\end{align*}
By the added integrability hypothesis in the theorem statement, each of the four scalar random variables in the displayed expansion is integrable. Hence the conditional expectations below are ordinary $L^1$ conditional expectations. The assumption $\mathbb E[\varepsilon\varepsilon^\top\mid X]=\sigma^2I_n$ also implies $\mathbb E[\varepsilon_i^2\mid X]<\infty$ for each coordinate $\varepsilon_i$. Taking [conditional expectation](/page/Conditional%20Expectation) with respect to $X$ and using that $b_X$, $A_X$, and $\Sigma$ are $X$-measurable finite-dimensional coefficients,
\begin{align*}
\mathbb{E}[b_X^\top\Sigma A_X\varepsilon \mid X]=b_X^\top\Sigma A_X \mathbb{E}[\varepsilon \mid X]=0.
\end{align*}
Similarly,
\begin{align*}
\mathbb{E}[\varepsilon^\top A_X^\top\Sigma b_X \mid X]=\mathbb{E}[\varepsilon^\top \mid X]A_X^\top\Sigma b_X=0.
\end{align*}
Hence
\begin{align*}
\mathbb{E}[R_\beta(\hat{\beta}) \mid X]
=
b_X^\top\Sigma b_X
+
\mathbb{E}[\varepsilon^\top A_X^\top\Sigma A_X\varepsilon \mid X].
\end{align*}[/step]
custom_env
admin
[guided]After the estimator-error decomposition, the risk is the quadratic form
\begin{align*}
R_\beta(\hat{\beta})=(\hat{\beta}-\beta)^\top\Sigma(\hat{\beta}-\beta).
\end{align*}
Substituting $\hat{\beta}-\beta=b_X+A_X\varepsilon$ gives
\begin{align*}
R_\beta(\hat{\beta})=(b_X+A_X\varepsilon)^\top\Sigma(b_X+A_X\varepsilon).
\end{align*}
Expanding this finite-dimensional quadratic form gives
\begin{align*}
R_\beta(\hat{\beta})=b_X^\top\Sigma b_X + b_X^\top\Sigma A_X\varepsilon + \varepsilon^\top A_X^\top\Sigma b_X + \varepsilon^\top A_X^\top\Sigma A_X\varepsilon .
\end{align*}
The two middle summands are the cross terms. The added integrability hypothesis in the theorem statement says that each scalar [random variable](/page/Random%20Variable) in this four-term expansion is integrable, so every conditional expectation in this step is an ordinary $L^1$ conditional expectation. Once $X$ is fixed, the matrices $b_X$, $A_X$, and $\Sigma$ are deterministic finite-dimensional coefficients. The hypothesis $\mathbb E[\varepsilon\mid X]=0$ then gives
\begin{align*}
\mathbb{E}[b_X^\top\Sigma A_X\varepsilon \mid X]=b_X^\top\Sigma A_X\mathbb{E}[\varepsilon\mid X]=0
\end{align*}
and
\begin{align*}
\mathbb{E}[\varepsilon^\top A_X^\top\Sigma b_X \mid X]=\mathbb{E}[\varepsilon^\top\mid X]A_X^\top\Sigma b_X=0.
\end{align*}
Therefore
\begin{align*}
\mathbb{E}[R_\beta(\hat{\beta}) \mid X]
=
b_X^\top\Sigma b_X
+
\mathbb{E}[\varepsilon^\top A_X^\top\Sigma A_X\varepsilon \mid X].
\end{align*}[/guided]
custom_env
admin
[step:Evaluate the conditional quadratic form as a trace]Define the $X$-measurable matrix
\begin{align*}
M_X := A_X^\top\Sigma A_X \in \mathbb{R}^{n \times n}.
\end{align*}
Because $M_X$ is a finite-dimensional $X$-measurable matrix and each conditional second moment $\mathbb E[\varepsilon_i^2\mid X]$ is finite, the quadratic form $\varepsilon^\top M_X\varepsilon$ is conditionally integrable by the finite expansion below and the inequality $|\varepsilon_i\varepsilon_j|\leq (\varepsilon_i^2+\varepsilon_j^2)/2$. Writing $\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)^\top$ and $M_X=((M_X)_{ij})_{1 \leq i,j \leq n}$, we have
\begin{align*}
\varepsilon^\top M_X\varepsilon
=
\sum_{i=1}^n\sum_{j=1}^n (M_X)_{ij}\varepsilon_i\varepsilon_j .
\end{align*}
Since $M_X$ is $X$-measurable, conditional linearity gives
\begin{align*}
\mathbb{E}[\varepsilon^\top M_X\varepsilon \mid X]=\sum_{i=1}^n\sum_{j=1}^n (M_X)_{ij}\mathbb{E}[\varepsilon_i\varepsilon_j \mid X].
\end{align*}
Using $\mathbb E[\varepsilon\varepsilon^\top\mid X]=\sigma^2 I_n$, this becomes
\begin{align*}
\mathbb{E}[\varepsilon^\top M_X\varepsilon \mid X]=\sum_{i=1}^n\sum_{j=1}^n (M_X)_{ij}(\sigma^2 I_n)_{ij}.
\end{align*}
Since $(I_n)_{ij}=0$ for $i\neq j$ and $(I_n)_{ii}=1$, the double sum reduces to the diagonal sum
\begin{align*}
\mathbb{E}[\varepsilon^\top M_X\varepsilon \mid X]=\sigma^2\sum_{i=1}^n (M_X)_{ii}.
\end{align*}
By the definition of trace,
\begin{align*}
\mathbb{E}[\varepsilon^\top M_X\varepsilon \mid X]=\sigma^2\operatorname{tr}(M_X).
\end{align*}
This computation uses only finite sums and the conditional covariance matrix; no symmetry assumption on $\Sigma$ is needed.
Now
\begin{align*}
M_X=\left(X^\top(XX^\top)^{-1}\right)^\top \Sigma X^\top(XX^\top)^{-1}.
\end{align*}
Because $XX^\top$ is symmetric, its inverse $(XX^\top)^{-1}$ is symmetric. Hence
\begin{align*}
M_X=(XX^\top)^{-1}X\Sigma X^\top(XX^\top)^{-1}.
\end{align*} Therefore
\begin{align*}
\operatorname{tr}(M_X)=\operatorname{tr}\left((XX^\top)^{-1}X\Sigma X^\top(XX^\top)^{-1}\right).
\end{align*}
Define the finite matrices
\begin{align*}
C_X := (XX^\top)^{-1}X \in \mathbb{R}^{n \times d}
\end{align*}
and
\begin{align*}
D_X := \Sigma X^\top(XX^\top)^{-1} \in \mathbb{R}^{d \times n}.
\end{align*}
For finite matrices with compatible dimensions, the identity $\operatorname{tr}(C_XD_X)=\operatorname{tr}(D_XC_X)$ follows by expanding the diagonal sums:
By the definition of trace,
\begin{align*}
\operatorname{tr}(C_XD_X)=\sum_{i=1}^n\sum_{a=1}^d (C_X)_{ia}(D_X)_{ai}.
\end{align*}
Since both sums are finite, we may interchange their order and commute the scalar factors inside each summand. Thus
\begin{align*}
\operatorname{tr}(C_XD_X)=\sum_{a=1}^d\sum_{i=1}^n (D_X)_{ai}(C_X)_{ia}.
\end{align*}
Applying the definition of trace again gives
\begin{align*}
\operatorname{tr}(C_XD_X)=\operatorname{tr}(D_XC_X).
\end{align*}
Since $C_XD_X=(XX^\top)^{-1}X\Sigma X^\top(XX^\top)^{-1}$ and $D_XC_X=\Sigma X^\top(XX^\top)^{-2}X$, we obtain
\begin{align*}
\operatorname{tr}(M_X)=\operatorname{tr}\left(\Sigma X^\top(XX^\top)^{-2}X\right).
\end{align*}[/step]
custom_env
admin
[guided]Define the $X$-measurable matrix
\begin{align*}
M_X := A_X^\top\Sigma A_X \in \mathbb{R}^{n \times n}.
\end{align*}
We need to evaluate the conditional expectation of $\varepsilon^\top M_X\varepsilon$. Write $\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)^\top$ and $M_X=((M_X)_{ij})_{1 \leq i,j \leq n}$. Since this is a finite-dimensional quadratic form,
\begin{align*}
\varepsilon^\top M_X\varepsilon
=
\sum_{i=1}^n\sum_{j=1}^n (M_X)_{ij}\varepsilon_i\varepsilon_j .
\end{align*}
After conditioning on $X$, the entries $(M_X)_{ij}$ are fixed coefficients, so conditional linearity gives
\begin{align*}
\mathbb{E}[\varepsilon^\top M_X\varepsilon \mid X]=\sum_{i=1}^n\sum_{j=1}^n (M_X)_{ij}\mathbb{E}[\varepsilon_i\varepsilon_j \mid X].
\end{align*}
The covariance assumption says that the conditional second-moment matrix is $\sigma^2I_n$, hence
\begin{align*}
\mathbb{E}[\varepsilon_i\varepsilon_j \mid X]=(\sigma^2I_n)_{ij}.
\end{align*}
Substitution gives
\begin{align*}
\mathbb{E}[\varepsilon^\top M_X\varepsilon \mid X]=\sum_{i=1}^n\sum_{j=1}^n (M_X)_{ij}(\sigma^2I_n)_{ij}.
\end{align*}
Only the diagonal entries survive, because $(I_n)_{ij}=0$ for $i\neq j$ and $(I_n)_{ii}=1$. Therefore
\begin{align*}
\mathbb{E}[\varepsilon^\top M_X\varepsilon \mid X]=\sigma^2\sum_{i=1}^n (M_X)_{ii}=\sigma^2\operatorname{tr}(M_X).
\end{align*}
Finally, using $A_X=X^\top(XX^\top)^{-1}$ and the symmetry of $(XX^\top)^{-1}$,
\begin{align*}
M_X=(XX^\top)^{-1}X\Sigma X^\top(XX^\top)^{-1}.
\end{align*}
Set
\begin{align*}
C_X := (XX^\top)^{-1}X \in \mathbb{R}^{n \times d}
\end{align*}
and
\begin{align*}
D_X := \Sigma X^\top(XX^\top)^{-1} \in \mathbb{R}^{d \times n}.
\end{align*}
Then $C_XD_X=M_X$ and $D_XC_X=\Sigma X^\top(XX^\top)^{-2}X$. Expanding the diagonal sums proves $\operatorname{tr}(C_XD_X)=\operatorname{tr}(D_XC_X)$ for these finite matrices, so
\begin{align*}
\operatorname{tr}(M_X)=\operatorname{tr}\left(\Sigma X^\top(XX^\top)^{-2}X\right).
\end{align*}[/guided]
custom_env
admin
[step:Combine the bias and variance terms]Since $b_X=-(I_d-P_X)\beta$,
\begin{align*}
b_X^\top\Sigma b_X=\bigl(-(I_d-P_X)\beta\bigr)^\top\Sigma\bigl(-(I_d-P_X)\beta\bigr).
\end{align*}
The two minus signs cancel in the scalar quadratic form, so
\begin{align*}
b_X^\top\Sigma b_X=\bigl((I_d-P_X)\beta\bigr)^\top\Sigma\bigl((I_d-P_X)\beta\bigr).
\end{align*}
Combining this identity with the conditional quadratic-form computation gives
\begin{align*}
\mathbb{E}\left[R_\beta(\hat{\beta}) \mid X\right]
=
\bigl((I_d-P_X)\beta\bigr)^\top \Sigma \bigl((I_d-P_X)\beta\bigr)
+
\sigma^2 \operatorname{tr}\left(\Sigma X^\top(XX^\top)^{-2}X\right).
\end{align*}
The added integrability hypothesis ensures that this identity is an equality of ordinary conditional expectations. This is the desired conditional [bias-variance decomposition](/theorems/1845).[/step]
custom_env
admin
[guided]It remains to rewrite the deterministic term using the definition of the conditional bias vector. Since
\begin{align*}
b_X=-(I_d-P_X)\beta,
\end{align*}
we have
\begin{align*}
b_X^\top\Sigma b_X=\bigl(-(I_d-P_X)\beta\bigr)^\top\Sigma\bigl(-(I_d-P_X)\beta\bigr).
\end{align*}
The two minus signs cancel in the scalar quadratic form, giving
\begin{align*}
b_X^\top\Sigma b_X=\bigl((I_d-P_X)\beta\bigr)^\top\Sigma\bigl((I_d-P_X)\beta\bigr).
\end{align*}
The preceding trace computation gives
\begin{align*}
\mathbb{E}[\varepsilon^\top A_X^\top\Sigma A_X\varepsilon\mid X]
=
\sigma^2\operatorname{tr}\left(\Sigma X^\top(XX^\top)^{-2}X\right).
\end{align*}
Substituting these two identities into the conditional risk expansion yields
\begin{align*}
\mathbb{E}\left[R_\beta(\hat{\beta}) \mid X\right]
=
\bigl((I_d-P_X)\beta\bigr)^\top \Sigma \bigl((I_d-P_X)\beta\bigr)
+
\sigma^2 \operatorname{tr}\left(\Sigma X^\top(XX^\top)^{-2}X\right).
\end{align*}
The integrability assumption in the theorem statement ensures that this is an equality of ordinary conditional expectations, not only a formal conditional second-moment calculation for fixed $X$.[/guided]