[proofplan]
We first isolate the random perturbation $E := \widehat{\Sigma}-\Sigma$ and apply the sub-Gaussian sample covariance concentration theorem to bound $\|E\|_{\mathrm{op}}$ with probability at least $1-e^{-t}$. The stated smallness assumption guarantees that this perturbation is at most half of the population eigengap $\Delta_r$, so the empirical top-$r$ spectral cluster remains separated from the rest of the spectrum. Davis--Kahan's projector perturbation estimate then converts the operator-norm covariance error into a Frobenius-norm bound for the difference of the spectral projectors. Finally, we substitute the covariance concentration bound and absorb universal numerical constants into $C_K$.
[/proofplan]
[step:Introduce the covariance perturbation and its high probability operator norm bound]
Define the sample covariance matrix by
\begin{align*}
\widehat{\Sigma}:=\frac{1}{n}\sum_{i=1}^{n}X_iX_i^\top \in \mathbb{R}^{d \times d}.
\end{align*}
Define the random symmetric perturbation matrix by
\begin{align*}
E := \widehat{\Sigma}-\Sigma \in \mathbb{R}^{d \times d}.
\end{align*}
The effective rank $r_{\mathrm{eff}}(\Sigma)$ is the quantity defined in the theorem statement, namely $r_{\mathrm{eff}}(\Sigma)=\operatorname{tr}(\Sigma)/\|\Sigma\|_{\mathrm{op}}$ when $\Sigma \neq 0$, and $r_{\mathrm{eff}}(0)=0$. By the covariance-adapted sub-Gaussian sample covariance concentration theorem, there exists a constant $A_K>0$, depending only on $K$, such that for every $t \ge 1$, with probability at least $1-e^{-t}$,
\begin{align*}
\|E\|_{\mathrm{op}}
\le
A_K\|\Sigma\|_{\mathrm{op}}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right).
\end{align*}
The hypotheses of this concentration theorem are satisfied because the theorem statement assumes that $X_1,\dots,X_n$ are independent identically distributed mean-zero random vectors in $\mathbb{R}^d$, assumes the covariance-adapted $K$-sub-Gaussian condition
\begin{align*}
\|\langle X_1,u\rangle\|_{\psi_2}
\le K\bigl(\mathbb{E}\langle X_1,u\rangle^2\bigr)^{1/2}
\end{align*}
for every vector $u \in \mathbb{R}^d$, defines $\Sigma=\mathbb{E}[X_1X_1^\top]$ as their covariance matrix, and defines $\widehat{\Sigma}=n^{-1}\sum_{i=1}^n X_iX_i^\top$ as the associated sample covariance matrix.
[guided]
We want to compare the spectral projectors of $\widehat{\Sigma}$ and $\Sigma$, so we first name the empirical covariance matrix and then name the additive perturbation between the two covariance matrices. Define
\begin{align*}
\widehat{\Sigma}:=\frac{1}{n}\sum_{i=1}^{n}X_iX_i^\top \in \mathbb{R}^{d \times d}.
\end{align*}
Define
\begin{align*}
E := \widehat{\Sigma}-\Sigma \in \mathbb{R}^{d \times d}.
\end{align*}
Both $\widehat{\Sigma}$ and $\Sigma$ are symmetric positive semidefinite matrices, hence $E$ is symmetric. The effective rank used below is the quantity from the theorem statement:
\begin{align*}
r_{\mathrm{eff}}(\Sigma)=\frac{\operatorname{tr}(\Sigma)}{\|\Sigma\|_{\mathrm{op}}}
\end{align*}
when $\Sigma \neq 0$, with $r_{\mathrm{eff}}(0)=0$.
The input estimate needed for the perturbation argument is an operator-norm bound for $E$. We apply the covariance-adapted sub-Gaussian sample covariance concentration theorem. Its hypotheses require independent identically distributed mean-zero random vectors, the covariance-adapted sub-Gaussian moment comparison, their covariance matrix, and the corresponding empirical covariance matrix. These are exactly the objects in the theorem statement: the vectors are $X_1,\dots,X_n$, and the covariance-adapted $K$-sub-Gaussian hypothesis in the statement gives, for every vector $u \in \mathbb{R}^d$,
\begin{align*}
\|\langle X_1,u\rangle\|_{\psi_2}
\le K\bigl(\mathbb{E}\langle X_1,u\rangle^2\bigr)^{1/2}.
\end{align*}
The covariance matrix is
\begin{align*}
\Sigma=\mathbb{E}[X_1X_1^\top],
\end{align*}
and the empirical covariance matrix is
\begin{align*}
\widehat{\Sigma}=\frac{1}{n}\sum_{i=1}^{n}X_iX_i^\top.
\end{align*}
Therefore there exists a constant $A_K>0$, depending only on $K$, such that for every $t \ge 1$, with probability at least $1-e^{-t}$,
\begin{align*}
\|E\|_{\mathrm{op}}
\le
A_K\|\Sigma\|_{\mathrm{op}}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right).
\end{align*}
This is the only probabilistic input in the proof. Everything after this point is deterministic matrix perturbation theory on the event where this bound holds.
[/guided]
[/step]
[step:Use the smallness condition to preserve the top spectral cluster]
Let $\lambda_1(\Sigma) \ge \cdots \ge \lambda_d(\Sigma)$ denote the eigenvalues of $\Sigma$ listed in non-increasing order with multiplicity, and let $\lambda_1(\widehat\Sigma) \ge \cdots \ge \lambda_d(\widehat\Sigma)$ denote the eigenvalues of $\widehat\Sigma$ listed in non-increasing order with multiplicity. The population eigengap in the theorem statement is
\begin{align*}
\Delta_r := \lambda_r(\Sigma)-\lambda_{r+1}(\Sigma)>0.
\end{align*}
Let $\Omega_t$ denote the event on which the preceding covariance concentration bound holds. Choose the final constant $C_K$ so that $C_K \ge 2A_K$. On $\Omega_t$, the covariance concentration bound gives
\begin{align*}
\|E\|_{\mathrm{op}}
\le
A_K\|\Sigma\|_{\mathrm{op}}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right).
\end{align*}
Since $A_K \le C_K/2$, the assumed smallness condition implies
\begin{align*}
\|E\|_{\mathrm{op}}
\le
\frac{C_K}{2}\|\Sigma\|_{\mathrm{op}}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right)
\le \frac{\Delta_r}{4}.
\end{align*}
Thus the perturbation size is at most one quarter of the population separation between $\lambda_r(\Sigma)$ and $\lambda_{r+1}(\Sigma)$. By the Weyl eigenvalue perturbation inequality applied to the real symmetric pair $(\Sigma,\widehat\Sigma)$, the empirical gap satisfies
\begin{align*}
\lambda_r(\widehat\Sigma)-\lambda_{r+1}(\widehat\Sigma)
\ge
\lambda_r(\Sigma)-\lambda_{r+1}(\Sigma)-2\|E\|_{\mathrm{op}}
\ge \frac{\Delta_r}{2}.
\end{align*}
In particular, the empirical top-$r$ spectral cluster is separated from the rest of the spectrum, so the top-$r$ spectral projector $\widehat P_r$ of $\widehat{\Sigma}=\Sigma+E$ is unambiguously the perturbation of the isolated top-$r$ spectral projector $P_r$ of $\Sigma$.
[/step]
[step:Apply Davis Kahan to convert covariance error into projector error]
On the event $\Omega_t$, the matrices $\Sigma$ and $\widehat{\Sigma}=\Sigma+E$ are real symmetric, and the projector $P_r$ corresponds to the isolated population spectral cluster $\{\lambda_1(\Sigma),\dots,\lambda_r(\Sigma)\}$ separated from $\{\lambda_{r+1}(\Sigma),\dots,\lambda_d(\Sigma)\}$ by the gap $\Delta_r=\lambda_r(\Sigma)-\lambda_{r+1}(\Sigma)$. The preceding step gives $\|E\|_{\mathrm{op}}\le \Delta_r/4$, so the perturbation is strictly smaller than the separating population gap. Hence the Davis--Kahan sin theta theorem applies to the real symmetric pair $(\Sigma,\widehat{\Sigma})$, with unperturbed cluster $\{\lambda_1(\Sigma),\dots,\lambda_r(\Sigma)\}$, perturbed top-$r$ cluster $\{\lambda_1(\widehat\Sigma),\dots,\lambda_r(\widehat\Sigma)\}$, and separating denominator $\Delta_r$. We use the Frobenius projector form of Davis--Kahan: if $A$ and $A+H$ are real symmetric matrices, $P$ and $\widehat P$ are the spectral projectors onto two corresponding $r$-dimensional spectral clusters, and the unperturbed cluster is separated from the complementary spectrum by $\delta>0$ with $\|H\|_{\mathrm{op}}<\delta$, then there is a numerical constant $B>0$ such that
\begin{align*}
\|\widehat P-P\|_F
\le
\frac{B\sqrt r}{\delta}\|H\|_{\mathrm{op}}.
\end{align*}
Here the factor $\sqrt r$ is the conversion from an operator-norm perturbation estimate for the sine-angle operator to the Frobenius norm on an operator of rank at most $r$. Applying this statement with $A=\Sigma$, $H=E$, $P=P_r$, $\widehat P=\widehat P_r$, and $\delta=\Delta_r$ gives
\begin{align*}
\|\widehat P_r-P_r\|_F
\le
\frac{B\sqrt r}{\Delta_r}\|E\|_{\mathrm{op}}.
\end{align*}
Substituting the operator-norm bound for $E$ on $\Omega_t$ gives
\begin{align*}
\|\widehat P_r-P_r\|_F
&\le
\frac{B\sqrt r}{\Delta_r}
A_K\|\Sigma\|_{\mathrm{op}}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right).
\end{align*}
[guided]
Now we use deterministic perturbation theory. The reason projectors are the right object is that an eigenspace has no preferred [orthonormal basis](/page/Orthonormal%20Basis): individual eigenvectors may change sign or rotate inside a repeated eigenspace, but the orthogonal projector onto the eigenspace is uniquely defined.
We apply the Davis--Kahan sin theta theorem. The theorem requires two real symmetric matrices, an isolated spectral cluster for the unperturbed matrix, and a perturbation small compared with the separating spectral gap. Here the unperturbed matrix is $\Sigma$, the perturbed matrix is
\begin{align*}
\widehat{\Sigma}=\Sigma+E,
\end{align*}
and both matrices are real symmetric. The relevant isolated cluster for $\Sigma$ is the set of its top $r$ eigenvalues, separated from the remaining eigenvalues by
\begin{align*}
\Delta_r=\lambda_r(\Sigma)-\lambda_{r+1}(\Sigma)>0.
\end{align*}
We now reproduce the deterministic separation argument inside the guided proof. On $\Omega_t$, the concentration bound gives
\begin{align*}
\|E\|_{\mathrm{op}}
\le
A_K\|\Sigma\|_{\mathrm{op}}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right).
\end{align*}
The final constant is chosen with $C_K \ge 2A_K$, so $A_K \le C_K/2$. Combining this with $A_K \le C_K/2$ gives
\begin{align*}
\|E\|_{\mathrm{op}}
\le
\frac{C_K}{2}\|\Sigma\|_{\mathrm{op}}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right).
\end{align*}
The smallness assumption in the theorem statement gives
\begin{align*}
C_K\|\Sigma\|_{\mathrm{op}}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right)
\le \frac{\Delta_r}{2}.
\end{align*}
Dividing this last inequality by $2$ gives
\begin{align*}
\|E\|_{\mathrm{op}}\le \frac{\Delta_r}{4}.
\end{align*}
This is the quantitative reason the empirical spectral subspace cannot mix with the rest of the spectrum.
We next apply Weyl's eigenvalue perturbation inequality to the real symmetric matrices $\Sigma$ and $\widehat\Sigma=\Sigma+E$. For each index $j \in \{1,\dots,d\}$, Weyl's inequality gives
\begin{align*}
|\lambda_j(\widehat\Sigma)-\lambda_j(\Sigma)|\le \|E\|_{\mathrm{op}}.
\end{align*}
Using this first with $j=r$ and then with $j=r+1$, we obtain
\begin{align*}
\lambda_r(\widehat\Sigma)-\lambda_{r+1}(\widehat\Sigma)
\ge
\bigl(\lambda_r(\Sigma)-\|E\|_{\mathrm{op}}\bigr)
-
\bigl(\lambda_{r+1}(\Sigma)+\|E\|_{\mathrm{op}}\bigr).
\end{align*}
The right-hand side equals $\Delta_r-2\|E\|_{\mathrm{op}}$, and the bound $\|E\|_{\mathrm{op}}\le \Delta_r/4$ gives
\begin{align*}
\lambda_r(\widehat\Sigma)-\lambda_{r+1}(\widehat\Sigma)
\ge \frac{\Delta_r}{2}.
\end{align*}
Thus the empirical top-$r$ cluster remains separated from the remaining empirical eigenvalues, and the empirical projector $\widehat P_r$ is the spectral projector corresponding to that separated cluster.
Davis--Kahan therefore gives a numerical constant $B>0$ such that
\begin{align*}
\|\widehat P_r-P_r\|_F
\le
\frac{B\sqrt r}{\Delta_r}\|E\|_{\mathrm{op}}.
\end{align*}
The factor $\sqrt r$ appears because the estimate is stated in Frobenius norm for rank-$r$ spectral projectors, while the perturbation is controlled in operator norm. Substituting the covariance concentration estimate for $\|E\|_{\mathrm{op}}$ on $\Omega_t$ gives
\begin{align*}
\|\widehat P_r-P_r\|_F
&\le
\frac{B\sqrt r}{\Delta_r}
A_K\|\Sigma\|_{\mathrm{op}}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right).
\end{align*}
This is already the desired inequality, up to replacing the product of constants $BA_K$ by a single constant depending only on $K$.
[/guided]
[/step]
[step:Absorb constants and conclude the high probability bound]
We now specify the constant used in the theorem statement. Since the theorem is an existence statement, the assertion is made from the start with one final enlarged constant $C_K>0$, depending only on $K$, chosen so that
\begin{align*}
C_K \ge 2A_K
\end{align*}
and
\begin{align*}
C_K \ge BA_K.
\end{align*}
Then on $\Omega_t$,
\begin{align*}
\|\widehat P_r-P_r\|_F
\le
\frac{C_K\sqrt r\,\|\Sigma\|_{\mathrm{op}}}{\Delta_r}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right).
\end{align*}
Since $\mathbb{P}(\Omega_t)\ge 1-e^{-t}$ by the covariance concentration theorem, the displayed bound holds with probability at least $1-e^{-t}$. This proves the claimed principal subspace error estimate.
[guided]
The previous step gave the desired estimate except that the multiplicative constant was written as $BA_K$, while the theorem statement uses a single constant $C_K$ depending only on the sub-Gaussian parameter $K$. Because the theorem asserts existence of such a constant, we take the theorem statement to use one final enlarged value from the start. Let $C_K>0$ be any constant, depending only on $K$, such that
\begin{align*}
C_K \ge 2A_K
\end{align*}
and
\begin{align*}
C_K \ge BA_K.
\end{align*}
The [first inequality](/theorems/2897) is the one used in the smallness argument to force $\|E\|_{\mathrm{op}}\le \Delta_r/4$. The [second inequality](/theorems/2136) is the one used to absorb the product of the Davis--Kahan numerical constant and the covariance concentration constant.
On the event $\Omega_t$, the Davis--Kahan step and the covariance concentration estimate give
\begin{align*}
\|\widehat P_r-P_r\|_F
&\le
\frac{BA_K\sqrt r\,\|\Sigma\|_{\mathrm{op}}}{\Delta_r}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right).
\end{align*}
Since $C_K \ge BA_K$, this implies
\begin{align*}
\|\widehat P_r-P_r\|_F
\le
\frac{C_K\sqrt r\,\|\Sigma\|_{\mathrm{op}}}{\Delta_r}
\left(
\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}
+
\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}
\right).
\end{align*}
Finally, the event $\Omega_t$ was defined to be the event on which the covariance concentration theorem holds, and that theorem gives
\begin{align*}
\mathbb{P}(\Omega_t)\ge 1-e^{-t}.
\end{align*}
Therefore the displayed principal subspace error bound holds with probability at least $1-e^{-t}$. This is exactly the asserted conclusion.
[/guided]
[/step]