Principal Subspace Error Bound for Sub-Gaussian Sample Covariance Matrices

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We first isolate the random perturbation $E := \widehat{\Sigma}-\Sigma$ and apply the sub-Gaussian sample covariance concentration theorem to bound $\|E\|_{\mathrm{op}}$ with probability at least $1-e^{-t}$. The stated smallness assumption guarantees that this perturbation is at most half of the population eigengap $\Delta_r$, so the empirical top-$r$ spectral cluster remains separated from the rest of the spectrum. Davis--Kahan's projector perturbation estimate then converts the operator-norm covariance error into a Frobenius-norm bound for the difference of the spectral projectors. Finally, we substitute the covariance concentration bound and absorb universal numerical constants into $C_K$. [/proofplan] [step:Introduce the covariance perturbation and its high probability operator norm bound] Define the sample covariance matrix by \begin{align*} \widehat{\Sigma}:=\frac{1}{n}\sum_{i=1}^{n}X_iX_i^\top \in \mathbb{R}^{d \times d}. \end{align*} Define the random symmetric perturbation matrix by \begin{align*} E := \widehat{\Sigma}-\Sigma \in \mathbb{R}^{d \times d}. \end{align*} The effective rank $r_{\mathrm{eff}}(\Sigma)$ is the quantity defined in the theorem statement, namely $r_{\mathrm{eff}}(\Sigma)=\operatorname{tr}(\Sigma)/\|\Sigma\|_{\mathrm{op}}$ when $\Sigma \neq 0$, and $r_{\mathrm{eff}}(0)=0$. By the covariance-adapted sub-Gaussian sample covariance concentration theorem, there exists a constant $A_K>0$, depending only on $K$, such that for every $t \ge 1$, with probability at least $1-e^{-t}$, \begin{align*} \|E\|_{\mathrm{op}} \le A_K\|\Sigma\|_{\mathrm{op}} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right). \end{align*} The hypotheses of this concentration theorem are satisfied because the theorem statement assumes that $X_1,\dots,X_n$ are independent identically distributed mean-zero random vectors in $\mathbb{R}^d$, assumes the covariance-adapted $K$-sub-Gaussian condition \begin{align*} \|\langle X_1,u\rangle\|_{\psi_2} \le K\bigl(\mathbb{E}\langle X_1,u\rangle^2\bigr)^{1/2} \end{align*} for every vector $u \in \mathbb{R}^d$, defines $\Sigma=\mathbb{E}[X_1X_1^\top]$ as their covariance matrix, and defines $\widehat{\Sigma}=n^{-1}\sum_{i=1}^n X_iX_i^\top$ as the associated sample covariance matrix. [guided] We want to compare the spectral projectors of $\widehat{\Sigma}$ and $\Sigma$, so we first name the empirical covariance matrix and then name the additive perturbation between the two covariance matrices. Define \begin{align*} \widehat{\Sigma}:=\frac{1}{n}\sum_{i=1}^{n}X_iX_i^\top \in \mathbb{R}^{d \times d}. \end{align*} Define \begin{align*} E := \widehat{\Sigma}-\Sigma \in \mathbb{R}^{d \times d}. \end{align*} Both $\widehat{\Sigma}$ and $\Sigma$ are symmetric positive semidefinite matrices, hence $E$ is symmetric. The effective rank used below is the quantity from the theorem statement: \begin{align*} r_{\mathrm{eff}}(\Sigma)=\frac{\operatorname{tr}(\Sigma)}{\|\Sigma\|_{\mathrm{op}}} \end{align*} when $\Sigma \neq 0$, with $r_{\mathrm{eff}}(0)=0$. The input estimate needed for the perturbation argument is an operator-norm bound for $E$. We apply the covariance-adapted sub-Gaussian sample covariance concentration theorem. Its hypotheses require independent identically distributed mean-zero random vectors, the covariance-adapted sub-Gaussian moment comparison, their covariance matrix, and the corresponding empirical covariance matrix. These are exactly the objects in the theorem statement: the vectors are $X_1,\dots,X_n$, and the covariance-adapted $K$-sub-Gaussian hypothesis in the statement gives, for every vector $u \in \mathbb{R}^d$, \begin{align*} \|\langle X_1,u\rangle\|_{\psi_2} \le K\bigl(\mathbb{E}\langle X_1,u\rangle^2\bigr)^{1/2}. \end{align*} The covariance matrix is \begin{align*} \Sigma=\mathbb{E}[X_1X_1^\top], \end{align*} and the empirical covariance matrix is \begin{align*} \widehat{\Sigma}=\frac{1}{n}\sum_{i=1}^{n}X_iX_i^\top. \end{align*} Therefore there exists a constant $A_K>0$, depending only on $K$, such that for every $t \ge 1$, with probability at least $1-e^{-t}$, \begin{align*} \|E\|_{\mathrm{op}} \le A_K\|\Sigma\|_{\mathrm{op}} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right). \end{align*} This is the only probabilistic input in the proof. Everything after this point is deterministic matrix perturbation theory on the event where this bound holds. [/guided] [/step] [step:Use the smallness condition to preserve the top spectral cluster] Let $\lambda_1(\Sigma) \ge \cdots \ge \lambda_d(\Sigma)$ denote the eigenvalues of $\Sigma$ listed in non-increasing order with multiplicity, and let $\lambda_1(\widehat\Sigma) \ge \cdots \ge \lambda_d(\widehat\Sigma)$ denote the eigenvalues of $\widehat\Sigma$ listed in non-increasing order with multiplicity. The population eigengap in the theorem statement is \begin{align*} \Delta_r := \lambda_r(\Sigma)-\lambda_{r+1}(\Sigma)>0. \end{align*} Let $\Omega_t$ denote the event on which the preceding covariance concentration bound holds. Choose the final constant $C_K$ so that $C_K \ge 2A_K$. On $\Omega_t$, the covariance concentration bound gives \begin{align*} \|E\|_{\mathrm{op}} \le A_K\|\Sigma\|_{\mathrm{op}} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right). \end{align*} Since $A_K \le C_K/2$, the assumed smallness condition implies \begin{align*} \|E\|_{\mathrm{op}} \le \frac{C_K}{2}\|\Sigma\|_{\mathrm{op}} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right) \le \frac{\Delta_r}{4}. \end{align*} Thus the perturbation size is at most one quarter of the population separation between $\lambda_r(\Sigma)$ and $\lambda_{r+1}(\Sigma)$. By the Weyl eigenvalue perturbation inequality applied to the real symmetric pair $(\Sigma,\widehat\Sigma)$, the empirical gap satisfies \begin{align*} \lambda_r(\widehat\Sigma)-\lambda_{r+1}(\widehat\Sigma) \ge \lambda_r(\Sigma)-\lambda_{r+1}(\Sigma)-2\|E\|_{\mathrm{op}} \ge \frac{\Delta_r}{2}. \end{align*} In particular, the empirical top-$r$ spectral cluster is separated from the rest of the spectrum, so the top-$r$ spectral projector $\widehat P_r$ of $\widehat{\Sigma}=\Sigma+E$ is unambiguously the perturbation of the isolated top-$r$ spectral projector $P_r$ of $\Sigma$. [/step] [step:Apply Davis Kahan to convert covariance error into projector error] On the event $\Omega_t$, the matrices $\Sigma$ and $\widehat{\Sigma}=\Sigma+E$ are real symmetric, and the projector $P_r$ corresponds to the isolated population spectral cluster $\{\lambda_1(\Sigma),\dots,\lambda_r(\Sigma)\}$ separated from $\{\lambda_{r+1}(\Sigma),\dots,\lambda_d(\Sigma)\}$ by the gap $\Delta_r=\lambda_r(\Sigma)-\lambda_{r+1}(\Sigma)$. The preceding step gives $\|E\|_{\mathrm{op}}\le \Delta_r/4$, so the perturbation is strictly smaller than the separating population gap. Hence the Davis--Kahan sin theta theorem applies to the real symmetric pair $(\Sigma,\widehat{\Sigma})$, with unperturbed cluster $\{\lambda_1(\Sigma),\dots,\lambda_r(\Sigma)\}$, perturbed top-$r$ cluster $\{\lambda_1(\widehat\Sigma),\dots,\lambda_r(\widehat\Sigma)\}$, and separating denominator $\Delta_r$. We use the Frobenius projector form of Davis--Kahan: if $A$ and $A+H$ are real symmetric matrices, $P$ and $\widehat P$ are the spectral projectors onto two corresponding $r$-dimensional spectral clusters, and the unperturbed cluster is separated from the complementary spectrum by $\delta>0$ with $\|H\|_{\mathrm{op}}<\delta$, then there is a numerical constant $B>0$ such that \begin{align*} \|\widehat P-P\|_F \le \frac{B\sqrt r}{\delta}\|H\|_{\mathrm{op}}. \end{align*} Here the factor $\sqrt r$ is the conversion from an operator-norm perturbation estimate for the sine-angle operator to the Frobenius norm on an operator of rank at most $r$. Applying this statement with $A=\Sigma$, $H=E$, $P=P_r$, $\widehat P=\widehat P_r$, and $\delta=\Delta_r$ gives \begin{align*} \|\widehat P_r-P_r\|_F \le \frac{B\sqrt r}{\Delta_r}\|E\|_{\mathrm{op}}. \end{align*} Substituting the operator-norm bound for $E$ on $\Omega_t$ gives \begin{align*} \|\widehat P_r-P_r\|_F &\le \frac{B\sqrt r}{\Delta_r} A_K\|\Sigma\|_{\mathrm{op}} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right). \end{align*} [guided] Now we use deterministic perturbation theory. The reason projectors are the right object is that an eigenspace has no preferred [orthonormal basis](/page/Orthonormal%20Basis): individual eigenvectors may change sign or rotate inside a repeated eigenspace, but the orthogonal projector onto the eigenspace is uniquely defined. We apply the Davis--Kahan sin theta theorem. The theorem requires two real symmetric matrices, an isolated spectral cluster for the unperturbed matrix, and a perturbation small compared with the separating spectral gap. Here the unperturbed matrix is $\Sigma$, the perturbed matrix is \begin{align*} \widehat{\Sigma}=\Sigma+E, \end{align*} and both matrices are real symmetric. The relevant isolated cluster for $\Sigma$ is the set of its top $r$ eigenvalues, separated from the remaining eigenvalues by \begin{align*} \Delta_r=\lambda_r(\Sigma)-\lambda_{r+1}(\Sigma)>0. \end{align*} We now reproduce the deterministic separation argument inside the guided proof. On $\Omega_t$, the concentration bound gives \begin{align*} \|E\|_{\mathrm{op}} \le A_K\|\Sigma\|_{\mathrm{op}} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right). \end{align*} The final constant is chosen with $C_K \ge 2A_K$, so $A_K \le C_K/2$. Combining this with $A_K \le C_K/2$ gives \begin{align*} \|E\|_{\mathrm{op}} \le \frac{C_K}{2}\|\Sigma\|_{\mathrm{op}} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right). \end{align*} The smallness assumption in the theorem statement gives \begin{align*} C_K\|\Sigma\|_{\mathrm{op}} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right) \le \frac{\Delta_r}{2}. \end{align*} Dividing this last inequality by $2$ gives \begin{align*} \|E\|_{\mathrm{op}}\le \frac{\Delta_r}{4}. \end{align*} This is the quantitative reason the empirical spectral subspace cannot mix with the rest of the spectrum. We next apply Weyl's eigenvalue perturbation inequality to the real symmetric matrices $\Sigma$ and $\widehat\Sigma=\Sigma+E$. For each index $j \in \{1,\dots,d\}$, Weyl's inequality gives \begin{align*} |\lambda_j(\widehat\Sigma)-\lambda_j(\Sigma)|\le \|E\|_{\mathrm{op}}. \end{align*} Using this first with $j=r$ and then with $j=r+1$, we obtain \begin{align*} \lambda_r(\widehat\Sigma)-\lambda_{r+1}(\widehat\Sigma) \ge \bigl(\lambda_r(\Sigma)-\|E\|_{\mathrm{op}}\bigr) - \bigl(\lambda_{r+1}(\Sigma)+\|E\|_{\mathrm{op}}\bigr). \end{align*} The right-hand side equals $\Delta_r-2\|E\|_{\mathrm{op}}$, and the bound $\|E\|_{\mathrm{op}}\le \Delta_r/4$ gives \begin{align*} \lambda_r(\widehat\Sigma)-\lambda_{r+1}(\widehat\Sigma) \ge \frac{\Delta_r}{2}. \end{align*} Thus the empirical top-$r$ cluster remains separated from the remaining empirical eigenvalues, and the empirical projector $\widehat P_r$ is the spectral projector corresponding to that separated cluster. Davis--Kahan therefore gives a numerical constant $B>0$ such that \begin{align*} \|\widehat P_r-P_r\|_F \le \frac{B\sqrt r}{\Delta_r}\|E\|_{\mathrm{op}}. \end{align*} The factor $\sqrt r$ appears because the estimate is stated in Frobenius norm for rank-$r$ spectral projectors, while the perturbation is controlled in operator norm. Substituting the covariance concentration estimate for $\|E\|_{\mathrm{op}}$ on $\Omega_t$ gives \begin{align*} \|\widehat P_r-P_r\|_F &\le \frac{B\sqrt r}{\Delta_r} A_K\|\Sigma\|_{\mathrm{op}} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right). \end{align*} This is already the desired inequality, up to replacing the product of constants $BA_K$ by a single constant depending only on $K$. [/guided] [/step] [step:Absorb constants and conclude the high probability bound] We now specify the constant used in the theorem statement. Since the theorem is an existence statement, the assertion is made from the start with one final enlarged constant $C_K>0$, depending only on $K$, chosen so that \begin{align*} C_K \ge 2A_K \end{align*} and \begin{align*} C_K \ge BA_K. \end{align*} Then on $\Omega_t$, \begin{align*} \|\widehat P_r-P_r\|_F \le \frac{C_K\sqrt r\,\|\Sigma\|_{\mathrm{op}}}{\Delta_r} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right). \end{align*} Since $\mathbb{P}(\Omega_t)\ge 1-e^{-t}$ by the covariance concentration theorem, the displayed bound holds with probability at least $1-e^{-t}$. This proves the claimed principal subspace error estimate. [guided] The previous step gave the desired estimate except that the multiplicative constant was written as $BA_K$, while the theorem statement uses a single constant $C_K$ depending only on the sub-Gaussian parameter $K$. Because the theorem asserts existence of such a constant, we take the theorem statement to use one final enlarged value from the start. Let $C_K>0$ be any constant, depending only on $K$, such that \begin{align*} C_K \ge 2A_K \end{align*} and \begin{align*} C_K \ge BA_K. \end{align*} The [first inequality](/theorems/2897) is the one used in the smallness argument to force $\|E\|_{\mathrm{op}}\le \Delta_r/4$. The [second inequality](/theorems/2136) is the one used to absorb the product of the Davis--Kahan numerical constant and the covariance concentration constant. On the event $\Omega_t$, the Davis--Kahan step and the covariance concentration estimate give \begin{align*} \|\widehat P_r-P_r\|_F &\le \frac{BA_K\sqrt r\,\|\Sigma\|_{\mathrm{op}}}{\Delta_r} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right). \end{align*} Since $C_K \ge BA_K$, this implies \begin{align*} \|\widehat P_r-P_r\|_F \le \frac{C_K\sqrt r\,\|\Sigma\|_{\mathrm{op}}}{\Delta_r} \left( \sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}} + \frac{r_{\mathrm{eff}}(\Sigma)+t}{n} \right). \end{align*} Finally, the event $\Omega_t$ was defined to be the event on which the covariance concentration theorem holds, and that theorem gives \begin{align*} \mathbb{P}(\Omega_t)\ge 1-e^{-t}. \end{align*} Therefore the displayed principal subspace error bound holds with probability at least $1-e^{-t}$. This is exactly the asserted conclusion. [/guided] [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.