Minimax Rates for Bounded Gaussian Covariance Estimation

Minimax Rates for Bounded Gaussian Covariance Estimation (Theorem # 5898)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We prove the benchmark by matching explicit upper and lower bounds. The upper bounds use the empirical covariance estimator together with the zero estimator: the empirical covariance gives the parametric rates, while the zero estimator gives the bounded-risk truncation imposed by $0 \preceq \Sigma \preceq I_d$. The lower bounds are obtained from finite covariance subfamilies inside $\Theta_d$: a spectrally controlled Wigner-sign family gives the Frobenius rate, a spherical rank-one family gives the operator rate when $n \ge d$, and the same spherical packing with a fixed spike size gives the constant operator lower bound when $n < d$. [/proofplan] [step:Compute the Frobenius risk of the empirical covariance] Let $\hat\Sigma_n: (\mathbb{R}^d)^n \to \mathbb{R}^{d \times d}$ be the empirical covariance estimator defined by \begin{align*} \hat\Sigma_n(x_1,\dots,x_n) &= \frac{1}{n}\sum_{i=1}^n x_i x_i^\top . \end{align*} For $X_i \sim \mathcal N(0,\Sigma)$, the Gaussian fourth-moment identity gives The summands $X_iX_i^\top-\Sigma$ are independent, mean-zero random matrices, so cross terms vanish when expanding the squared Frobenius norm. Thus the Gaussian fourth-moment identity gives \begin{align*} \mathbb E_\Sigma[\|\hat\Sigma_n - \Sigma\|_F^2] = \frac{1}{n}\left((\operatorname{tr}\Sigma)^2 + \operatorname{tr}(\Sigma^2)\right). \end{align*} Since $0 \preceq \Sigma \preceq I_d$, we have $\operatorname{tr}\Sigma \le d$ and $\operatorname{tr}(\Sigma^2) \le d$, hence \begin{align*} \sup_{\Sigma \in \Theta_d}\mathbb E_\Sigma[\|\hat\Sigma_n - \Sigma\|_F^2] \le \frac{d(d+1)}{n}. \end{align*} The zero estimator $\hat\Sigma_0: (\mathbb{R}^d)^n \to \mathbb{R}^{d \times d}$, defined by $\hat\Sigma_0(x_1,\dots,x_n)=0$, satisfies \begin{align*} \sup_{\Sigma \in \Theta_d}\mathbb E_\Sigma[\|\hat\Sigma_0 - \Sigma\|_F^2] = \sup_{\Sigma \in \Theta_d}\operatorname{tr}(\Sigma^2) \le d. \end{align*} Taking the better of these two estimators gives \begin{align*} \inf_{\hat\Sigma}\sup_{\Sigma\in\Theta_d}\mathbb E_\Sigma[\|\hat\Sigma-\Sigma\|_F^2] \le \min\left\{d,\frac{d(d+1)}{n}\right\}. \end{align*} [/step] [step:Lower bound the Frobenius risk by a spectrally bounded sign packing] We use the following finite-packing fact. There are universal constants $a,b,c>0$ and, for each $d\ge2$, a set $\mathcal A_d$ of symmetric $d\times d$ matrices with zero diagonal and entries in $\{-1,1\}$ off the diagonal such that $|\mathcal A_d|\ge \exp(a d^2)$, $\|A\|_{\mathrm{op}}\le b\sqrt d$ for every $A\in\mathcal A_d$, and \begin{align*} \|A-B\|_F^2 \ge c d^2 \end{align*} for distinct $A,B\in\mathcal A_d$. To obtain it, apply the [Hamming and Gilbert-Varshamov Bounds](/theorems/5738) to the $d(d-1)/2$ off-diagonal sign coordinates, giving an exponentially large Hamming-separated family. A random symmetric sign matrix has $\|A\|_{\mathrm{op}}\le b\sqrt d$ with probability at least $3/4$ for a universal $b>0$, while the Gilbert-Varshamov family has separation at least a fixed positive fraction of the coordinates; averaging over random sign translations and then discarding matrices outside the operator-norm event leaves a subfamily of cardinality at least $\exp(ad^2)$ after reducing $a>0$, with the same Frobenius separation up to reducing $c>0$. Let $C_1>0$ be the universal constant such that the Gaussian covariance KL estimate below is bounded by $C_1 n\|\Sigma_A-\Sigma_B\|_F^2$ whenever all eigenvalues of the covariance matrices lie in $[1/4,3/4]$. Since $\|A-B\|_F^2\le 4d^2$ for matrices in $\mathcal A_d$, define $C_2=4C_1$ and $C_3=C_2$. Choose a constant $\gamma>0$ small enough that $C_3\gamma^2\le a/8$ and $\gamma\le 1/(4b)$. Set \begin{align*} \delta &= \gamma\min\left\{\frac{1}{\sqrt d},\frac{1}{\sqrt n}\right\}. \end{align*} For each $A\in\mathcal A_d$, define the covariance matrix \begin{align*} \Sigma_A &= \frac{1}{2}I_d + \delta A. \end{align*} Since $\|\delta A\|_{\mathrm{op}}\le 1/4$, each eigenvalue of $\Sigma_A$ lies in $[1/4,3/4]$, so $\Sigma_A\in\Theta_d$. For each $A\in\mathcal A_d$, let $P_A$ denote the probability law $\mathcal N(0,\Sigma_A)$ on $\mathbb R^d$. For distinct $A,B\in\mathcal A_d$, \begin{align*} \|\Sigma_A-\Sigma_B\|_F^2 = \delta^2\|A-B\|_F^2 \ge c\delta^2 d^2. \end{align*} The Kullback-Leibler divergence between $n$ samples from $\mathcal N(0,\Sigma_A)$ and $\mathcal N(0,\Sigma_B)$ is \begin{align*} D_{\mathrm{KL}}(P_A^{\otimes n}\|P_B^{\otimes n}) = \frac{n}{2}\left(\operatorname{tr}(\Sigma_B^{-1}\Sigma_A-I_d)-\log\det(\Sigma_B^{-1}\Sigma_A)\right). \end{align*} Because the eigenvalues of $\Sigma_A$ and $\Sigma_B$ lie in $[1/4,3/4]$, [Taylor's theorem](/theorems/827) for $t-1-\log t$ on $[1/3,3]$ gives \begin{align*} D_{\mathrm{KL}}(P_A^{\otimes n}\|P_B^{\otimes n}) \le C_1 n\|\Sigma_A-\Sigma_B\|_F^2 \le C_2 n\delta^2 d^2 \le C_3\gamma^2 d^2 \le \frac{a}{8}d^2. \end{align*} The [Fano Inequality](/theorems/1654) testing argument applied to the uniform prior on $\mathcal A_d$ yields \begin{align*} \inf_{\hat\Sigma}\sup_{\Sigma\in\Theta_d}\mathbb E_\Sigma[\|\hat\Sigma-\Sigma\|_F^2] \ge C_4\delta^2 d^2 \ge C_5\min\left\{d,\frac{d^2}{n}\right\}. \end{align*} Since $d(d+1)$ and $d^2$ are comparable for $d\ge2$, this is the desired Frobenius lower bound. [guided] The lower bound must use nonzero covariance matrices inside $\Theta_d$, because the zero covariance itself is easy to estimate. The role of the sign packing is to create many covariance matrices that are separated in Frobenius norm but remain uniformly bounded in operator norm. More precisely, the packing supplies universal constants $a,b,c>0$ and a set $\mathcal A_d$ of symmetric $d\times d$ matrices with zero diagonal and off-diagonal entries in $\{-1,1\}$ such that $|\mathcal A_d|\ge \exp(ad^2)$, $\|A\|_{\mathrm{op}}\le b\sqrt d$ for every $A\in\mathcal A_d$, and $\|A-B\|_F^2\ge cd^2$ whenever $A\ne B$. This is obtained from the [Hamming and Gilbert-Varshamov Bounds](/theorems/5738) on the off-diagonal sign coordinates, together with the random sign-translation pruning argument and the standard operator-norm bound for random symmetric sign matrices. The spectral pruning condition $\|A\|_{\mathrm{op}}\le b\sqrt d$ is what permits perturbations of size comparable to $d^{-1/2}$ while keeping $\frac12 I_d+\delta A$ positive semidefinite and bounded above by $I_d$. Let $C_1>0$ be the universal constant in the KL estimate below, define $C_2=4C_1$, and set $C_3=C_2$. Choose $\gamma>0$ small enough that $C_3\gamma^2\le a/8$ and $\gamma\le 1/(4b)$. Define \begin{align*} \delta &= \gamma\min\left\{\frac{1}{\sqrt d},\frac{1}{\sqrt n}\right\}, \end{align*} and, for $A\in\mathcal A_d$, define \begin{align*} \Sigma_A &= \frac{1}{2}I_d+\delta A. \end{align*} Since $A$ is symmetric, its eigenvalues are real, and the operator norm bound gives $\|\delta A\|_{\mathrm{op}}\le 1/4$. Therefore every eigenvalue of $\Sigma_A$ belongs to $[1/4,3/4]$, proving $0\preceq \Sigma_A\preceq I_d$, hence $\Sigma_A\in\Theta_d$. For each $A\in\mathcal A_d$, let $P_A$ denote the probability law $\mathcal N(0,\Sigma_A)$ on $\mathbb R^d$. The Frobenius separation is inherited directly from the packing: \begin{align*} \|\Sigma_A-\Sigma_B\|_F^2 =\delta^2\|A-B\|_F^2 \ge c\delta^2 d^2. \end{align*} The information distance is small because Gaussian covariance models are locally quadratic in the covariance matrix. For one observation, the covariance Kullback-Leibler formula is \begin{align*} D_{\mathrm{KL}}(\mathcal N(0,\Sigma_A)\|\mathcal N(0,\Sigma_B)) =\frac12\left(\operatorname{tr}(\Sigma_B^{-1}\Sigma_A-I_d)-\log\det(\Sigma_B^{-1}\Sigma_A)\right), \end{align*} and independence multiplies this quantity by $n$. Since all eigenvalues stay in a fixed compact subset of $(0,\infty)$, Taylor's theorem bounds the scalar expression $t-1-\log t$ by a universal multiple of $(t-1)^2$. Hence \begin{align*} D_{\mathrm{KL}}(P_A^{\otimes n}\|P_B^{\otimes n}) \le C_1n\|\Sigma_A-\Sigma_B\|_F^2 \le C_2n\delta^2d^2. \end{align*} The definition of $\delta$ gives $n\delta^2\le \gamma^2$, so the KL divergence is at most $C_3\gamma^2d^2\le ad^2/8$, while the logarithm of the packing size is at least $ad^2$. The [Fano Inequality](/theorems/1654) testing argument therefore forces a constant probability of confusing two separated covariance matrices. Multiplying that testing error by the squared separation gives \begin{align*} \inf_{\hat\Sigma}\sup_{\Sigma\in\Theta_d}\mathbb E_\Sigma[\|\hat\Sigma-\Sigma\|_F^2] \ge C_4\delta^2d^2 \ge C_5\min\left\{d,\frac{d^2}{n}\right\}. \end{align*} Because $d(d+1)\asymp d^2$ for $d\ge2$, this is the asserted Frobenius minimax lower rate. [/guided] [/step] [step:Bound the operator risk from above] Let $\hat\Sigma_n$ be the empirical covariance estimator from the first step. For universal constants $C_6,C_7>0$, the standard Gaussian sample covariance concentration inequality gives \begin{align*} \mathbb E_\Sigma[\|\hat\Sigma_n-\Sigma\|_{\mathrm{op}}^2] \le C_6\|\Sigma\|_{\mathrm{op}}^2\left(\sqrt{\frac{r(\Sigma)}{n}}+\frac{r(\Sigma)}{n}\right)^2, \end{align*} where the effective rank is defined by $r(\Sigma)=\operatorname{tr}(\Sigma)/\|\Sigma\|_{\mathrm{op}}$ when $\Sigma\ne0$, and $r(0)=0$. Since $0\preceq\Sigma\preceq I_d$, we have $\|\Sigma\|_{\mathrm{op}}\le1$ and $r(\Sigma)\le d$, so \begin{align*} \sup_{\Sigma\in\Theta_d}\mathbb E_\Sigma[\|\hat\Sigma_n-\Sigma\|_{\mathrm{op}}^2] \le C_6\left(\sqrt{\frac dn}+\frac dn\right)^2. \end{align*} The zero estimator satisfies \begin{align*} \sup_{\Sigma\in\Theta_d}\mathbb E_\Sigma[\|\hat\Sigma_0-\Sigma\|_{\mathrm{op}}^2] \le 1. \end{align*} Taking the better of the two estimators yields \begin{align*} \inf_{\hat\Sigma}\sup_{\Sigma\in\Theta_d}\mathbb E_\Sigma[\|\hat\Sigma-\Sigma\|_{\mathrm{op}}^2] \le C_7\left(\left(\sqrt{\frac dn}+\frac dn\right)^2\wedge1\right). \end{align*} [/step] [step:Lower bound the operator risk when $n\ge d$] Let $\mathbb S^{d-1}=\{v\in\mathbb R^d: |v|=1\}$. Choose a universal angular separation parameter $\rho\in(0,1)$ and a maximal subset $V\subset\mathbb S^{d-1}$ such that $|v-w|\ge\rho$ for distinct $v,w\in V$. The standard spherical packing bound gives $|V|\ge\exp(c_1d)$ after decreasing the universal constant $c_1>0$. Since $|v-w|^2=2-2v\cdot w\ge\rho^2$, and replacing $w$ by $-w$ if necessary gives the same projector, the packing may be chosen so that $|v\cdot w|\le 1-c_2^2$ for a universal $c_2>0$. Therefore \begin{align*} \|vv^\top-ww^\top\|_{\mathrm{op}} = \sqrt{1-(v\cdot w)^2}\ge c_2 \end{align*} for distinct $v,w\in V$. Set $\varepsilon=c_3\sqrt{d/n}$ with $c_3>0$ sufficiently small, and, for $v\in V$, define \begin{align*} \Sigma_v &= \frac{1}{2}I_d + \frac{\varepsilon}{2}vv^\top. \end{align*} Then $\Sigma_v\in\Theta_d$. For each $v\in V$, let $P_v$ denote the probability law $\mathcal N(0,\Sigma_v)$ on $\mathbb R^d$. The operator separation is \begin{align*} \|\Sigma_v-\Sigma_w\|_{\mathrm{op}}^2 =\frac{\varepsilon^2}{4}\|vv^\top-ww^\top\|_{\mathrm{op}}^2 \ge c_4\frac dn. \end{align*} The same Gaussian covariance KL formula and Taylor bound used above give \begin{align*} D_{\mathrm{KL}}(P_v^{\otimes n}\|P_w^{\otimes n}) \le C_8 n\|\Sigma_v-\Sigma_w\|_F^2 \le C_9n\varepsilon^2 \le c_1d/8. \end{align*} The [Fano Inequality](/theorems/1654) testing argument therefore implies \begin{align*} \inf_{\hat\Sigma}\sup_{\Sigma\in\Theta_d}\mathbb E_\Sigma[\|\hat\Sigma-\Sigma\|_{\mathrm{op}}^2] \ge C_{10}\frac dn. \end{align*} When $n\ge d$, this is comparable to $\left(\sqrt{d/n}+d/n\right)^2\wedge1$. [guided] The operator-norm lower bound uses rank-one spikes because the operator norm is sensitive to a single difficult direction. We first choose a spherical packing $V\subset\mathbb S^{d-1}$ with $|V|\ge\exp(c_1d)$ and $\|vv^\top-ww^\top\|_{\mathrm{op}}\ge c_2$ for distinct $v,w\in V$. The projector identity \begin{align*} \|vv^\top-ww^\top\|_{\mathrm{op}} = \sqrt{1-(v\cdot w)^2} \end{align*} shows exactly why angular separation on the sphere becomes operator-norm separation between rank-one projectors. Set $\varepsilon=c_3\sqrt{d/n}$ with $c_3>0$ sufficiently small, and define \begin{align*} \Sigma_v &= \frac{1}{2}I_d + \frac{\varepsilon}{2}vv^\top. \end{align*} Because $0\preceq vv^\top\preceq I_d$ and $n\ge d$ gives $\varepsilon\le c_3$, choosing $c_3\le1$ gives $0\preceq\Sigma_v\preceq I_d$, so $\Sigma_v\in\Theta_d$. The separation is \begin{align*} \|\Sigma_v-\Sigma_w\|_{\mathrm{op}}^2 =\frac{\varepsilon^2}{4}\|vv^\top-ww^\top\|_{\mathrm{op}}^2 \ge c_4\frac dn. \end{align*} For the information bound, the Gaussian covariance KL formula and the same Taylor estimate used in the Frobenius step give \begin{align*} D_{\mathrm{KL}}(P_v^{\otimes n}\|P_w^{\otimes n}) \le C_8 n\|\Sigma_v-\Sigma_w\|_F^2. \end{align*} Since $vv^\top-ww^\top$ has Frobenius norm bounded by a universal constant, this is at most $C_9n\varepsilon^2=C_9c_3^2d$. Choosing $c_3>0$ small enough makes this at most $c_1d/8$, while $\log|V|\ge c_1d$. Thus [Fano Inequality](/theorems/1654) gives a constant testing error, and multiplying by the squared operator separation yields the lower bound $C_{10}d/n$. For $n\ge d$, the quantity $d/n$ is comparable to $\left(\sqrt{d/n}+d/n\right)^2\wedge1$. [/guided] [/step] [step:Lower bound the operator risk when $n<d$] Use the same spherical packing construction as in the preceding step, but choose a fixed spike size $\varepsilon_0\in(0,1/4]$ small enough that $C_9\varepsilon_0^2\le c_1/8$, where $C_9$ is the universal constant in the KL bound below. For each $v\in V$, define $\Sigma_v=I_d/2+\varepsilon_0 vv^\top/2$, and let $P_v$ denote the probability law $\mathcal N(0,\Sigma_v)$ on $\mathbb R^d$. Since the eigenvalues of $\Sigma_v$ lie in $[1/2,1/2+\varepsilon_0/2]$, each $\Sigma_v$ belongs to $\Theta_d$. For distinct $v,w\in V$, the operator separation satisfies \begin{align*} \|\Sigma_v-\Sigma_w\|_{\mathrm{op}}^2 =\frac{\varepsilon_0^2}{4}\|vv^\top-ww^\top\|_{\mathrm{op}}^2 \ge \frac{\varepsilon_0^2c_2^2}{4}. \end{align*} The same Gaussian covariance KL formula and Taylor bound give \begin{align*} D_{\mathrm{KL}}(P_v^{\otimes n}\|P_w^{\otimes n}) \le C_9n\varepsilon_0^2 < C_9d\varepsilon_0^2 \le \frac{c_1}{8}d. \end{align*} Because $\log |V|\ge c_1d$, the [Fano Inequality](/theorems/1654) testing argument yields \begin{align*} \inf_{\hat\Sigma}\sup_{\Sigma\in\Theta_d}\mathbb E_\Sigma[\|\hat\Sigma-\Sigma\|_{\mathrm{op}}^2] \ge c_{13}, \end{align*} where $c_{13}>0$ is universal. Since $n<d$ implies $\left(\sqrt{d/n}+d/n\right)^2\wedge1=1$, this gives the required lower bound in the high-dimensional regime. [guided] In the regime $n<d$, the target rate is constant because the expression $\left(\sqrt{d/n}+d/n\right)^2\wedge1$ equals $1$. We therefore keep the same spherical packing $V$ but do not shrink the spike with $n$: choose $\varepsilon_0\in(0,1/4]$ small enough that $C_9\varepsilon_0^2\le c_1/8$, and define \begin{align*} \Sigma_v &= \frac{1}{2}I_d + \frac{\varepsilon_0}{2}vv^\top. \end{align*} The eigenvalues of $\Sigma_v$ lie in $[1/2,1/2+\varepsilon_0/2]$, hence $\Sigma_v\in\Theta_d$. For distinct $v,w\in V$, the projector separation gives \begin{align*} \|\Sigma_v-\Sigma_w\|_{\mathrm{op}}^2 =\frac{\varepsilon_0^2}{4}\|vv^\top-ww^\top\|_{\mathrm{op}}^2 \ge \frac{\varepsilon_0^2c_2^2}{4}. \end{align*} The KL divergence is bounded by \begin{align*} D_{\mathrm{KL}}(P_v^{\otimes n}\|P_w^{\otimes n}) \le C_9n\varepsilon_0^2 < C_9d\varepsilon_0^2 \le \frac{c_1}{8}d. \end{align*} Thus [Fano Inequality](/theorems/1654) again gives a constant probability of error over a packing whose logarithmic size is at least $c_1d$. Multiplying by the constant squared separation yields a universal lower bound $c_{13}>0$. [/guided] [/step] [step:Combine the four estimates] The Frobenius upper and lower bounds give universal constants $0<c<C<\infty$ such that \begin{align*} c\min\left\{d,\frac{d(d+1)}{n}\right\} \le \inf_{\hat\Sigma}\sup_{\Sigma\in\Theta_d}\mathbb E_\Sigma[\|\hat\Sigma-\Sigma\|_F^2] \le C\min\left\{d,\frac{d(d+1)}{n}\right\}. \end{align*} The operator upper bound and the rank-one packing lower bounds for $n\ge d$ and $n<d$ similarly give \begin{align*} c\left(\left(\sqrt{\frac dn}+\frac dn\right)^2\wedge1\right) \le \inf_{\hat\Sigma}\sup_{\Sigma\in\Theta_d}\mathbb E_\Sigma[\|\hat\Sigma-\Sigma\|_{\mathrm{op}}^2] \le C\left(\left(\sqrt{\frac dn}+\frac dn\right)^2\wedge1\right). \end{align*} This proves both asserted benchmark rates over $\Theta_d$. [/step]

Prerequisites (0/8 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Explore Further

Event Definition Matrix Definition Set Definition Taylor's Theorem With Lagrange Remainder Theorem #188 Taylor's Theorem With Integral Remainder Theorem #189 Taylor's Theorem With Cauchy Remainder Theorem #199 Taylor's Theorem Theorem #827 Taylor's Theorem for Holomorphic Functions Theorem #348 Minimax Prediction Rate over Sparse Linear Models Probability & Statistics Oracle Property for the SCAD Penalized Least-Squares Estimator Probability & Statistics Necessary Signal Strength for Exact Support Recovery in Sparse Gaussian Linear Regression Probability & Statistics Law of Total Probability Probability Theory Stable and Robust Recovery under the Restricted Isometry Property Probability & Statistics Frobenius Risk of the Gaussian Sample Covariance Matrix Probability & Statistics MGF of a Sum Probability Theory Chebyshev's Inequality Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.