Frobenius Minimax Lower Bound for Bounded-Spectrum Gaussian Covariance Estimation

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We prove the lower bound by embedding a large finite packing inside $\mathcal{C}_p(m,M)$ and applying Fano's method. The packing consists of covariance matrices whose off-diagonal signs vary in many independent directions, while the perturbation amplitude is chosen so that all spectra remain in $[m,M]$. Frobenius separation accumulates over order $p^2$ entries, whereas the Gaussian Kullback-Leibler divergence grows like $n$ times the squared perturbation size. Choosing the perturbation size as $(n^{-1} \wedge p^{-1})^{1/2}$ gives the scale $p^2/n \wedge p$. [/proofplan] [step:Construct a large bounded-norm packing of sign matrices] For $p \geq 2$, let $\mathcal{S}_p$ denote the set of symmetric matrices $A \in \mathbb{R}^{p \times p}$ such that $A_{ii}=0$ for every $i \in \{1,\dots,p\}$ and $A_{ij} \in \{-1,1\}$ for every $1 \leq i < j \leq p$. There exist constants $\gamma>0$, $\beta>0$, and $K>0$, independent of $p$, and matrices \begin{align*} A_1,\dots,A_N \in \mathcal{S}_p \end{align*} such that \begin{align*} N \geq \exp(\gamma p^2). \end{align*} For every $j \in \{1,\dots,N\}$, \begin{align*} \|A_j\|_{\mathrm{op}} \leq K\sqrt{p}. \end{align*} For every distinct pair $j,k \in \{1,\dots,N\}$, \begin{align*} \|A_j-A_k\|_F^2 \geq \beta p^2. \end{align*} Indeed, identify each matrix in $\mathcal{S}_p$ with its upper-triangular sign vector in $\{-1,1\}^d$, where $d=p(p-1)/2$. Let $\rho$ denote the Hamming metric on $\{-1,1\}^d$. By the standard symmetric Rademacher matrix operator-norm estimate, obtained for example from the noncommutative Khintchine moment bound and Markov's inequality, there is an absolute constant $K>0$ such that, for a uniformly sampled $A\in\mathcal{S}_p$, \begin{align*} \mathbb{P}\left(\|A\|_{\mathrm{op}} \leq K\sqrt{p}\right) \geq \frac{1}{2}. \end{align*} Define the good set \begin{align*} G_p:=\left\{A\in\mathcal{S}_p:\|A\|_{\mathrm{op}}\leq K\sqrt p\right\}. \end{align*} The preceding probability bound gives $|G_p|\geq 2^{d-1}$. We now run the [Gilbert-Varshamov packing argument](/page/Gilbert-Varshamov%20Bound) inside $G_p$, not in the full cube. Let $B_d(v,r)$ denote the Hamming ball in $\{-1,1\}^d$ centred at $v$ with radius $r$, and set $r_d:=\lfloor d/8\rfloor$. The greedy algorithm chooses a point of $G_p$, removes from $G_p$ all points in its Hamming ball of radius $r_d$, and repeats until no point remains. Since each removed set is contained in a Hamming ball of the full cube, \begin{align*} |B_d(v,r_d)| \leq \sum_{q=0}^{\lfloor d/8\rfloor}\binom{d}{q} \leq \exp\left(H\left(\frac18\right)d\right), \end{align*} where $H(t):=-t\log t-(1-t)\log(1-t)$ is the binary entropy function. Hence the number $N$ of selected matrices satisfies \begin{align*} N \geq \frac{|G_p|}{\exp(H(1/8)d)} \geq \exp\left(\left(\log 2-H\left(\frac18\right)\right)d-\log 2\right). \end{align*} Since $\log 2-H(1/8)>0$, after decreasing an absolute constant $\gamma>0$ and treating the finitely many small dimensions by direct selection and another decrease of $\gamma$, this gives $N\geq\exp(\gamma p^2)$ for all $p\geq2$. The greedy construction also gives $\rho(A_j,A_k)>d/8$ for distinct selected matrices. If two upper-triangular sign vectors differ in one coordinate, then the associated symmetric matrices differ by $2$ in both entries $(i,j)$ and $(j,i)$, contributing $2^2+2^2=8$ to the squared Frobenius norm. Therefore \begin{align*} \|A_j-A_k\|_F^2 \geq 8\cdot \frac{d}{8} = \frac{p(p-1)}{2}. \end{align*} For $p\geq2$, this is at least $p^2/4$ after reducing the constant, so one may take $\beta=1/4$. Because every selected matrix lies in $G_p$, it also satisfies $\|A_j\|_{\mathrm{op}}\leq K\sqrt p$. This proves the asserted packing. [guided] The goal of this step is to create many directions in which covariance matrices can differ. If we used only diagonal perturbations, we would get only $p$ independent coordinates. To obtain the Frobenius scale $p^2/n$, we need order $p^2$ independent signs, so we vary the off-diagonal entries. Let $\mathcal{S}_p$ be the set of symmetric zero-diagonal sign matrices: \begin{align*} \mathcal{S}_p = \{A \in \mathbb{R}^{p \times p} : A=A^\top,\ A_{ii}=0\text{ for every }i\in\{1,\dots,p\},\ A_{ij}\in\{-1,1\}\text{ for every }1\leq i<j\leq p\}. \end{align*} Each matrix is determined by the $d=p(p-1)/2$ signs above the diagonal, so $\mathcal{S}_p$ is the same combinatorial object as the discrete cube $\{-1,1\}^d$. Let $\rho$ be the Hamming metric on this cube. We need two properties at once: large Frobenius separation and bounded operator norm. The delicate point is that we cannot first take an arbitrary Gilbert-Varshamov packing and then intersect it with the bounded-operator-norm event, because the chosen packing might avoid that event. Instead, we first identify many good matrices and then run the greedy packing argument inside that good set. A standard symmetric Rademacher matrix operator-norm estimate, obtainable from the noncommutative Khintchine moment inequality followed by Markov's inequality, gives an absolute constant $K>0$ such that a uniformly random $A\in\mathcal{S}_p$ satisfies \begin{align*} \mathbb{P}\left(\|A\|_{\mathrm{op}} \leq K\sqrt p\right) \geq \frac12. \end{align*} Define \begin{align*} G_p:=\left\{A\in\mathcal{S}_p:\|A\|_{\mathrm{op}}\leq K\sqrt p\right\}. \end{align*} Since the uniform measure on $\mathcal{S}_p$ assigns mass $2^{-d}$ to each sign vector, the probability estimate implies $|G_p|\geq 2^{d-1}$. Now apply the [Gilbert-Varshamov packing argument](/page/Gilbert-Varshamov%20Bound) to $G_p$. Let $B_d(v,r)$ be the Hamming ball in $\{-1,1\}^d$ with centre $v$ and radius $r$, and set $r_d:=\lfloor d/8\rfloor$. The greedy algorithm selects one point of $G_p$, deletes all remaining good points within Hamming distance $r_d$, and repeats. Every deleted cluster is contained in a full-cube Hamming ball, whose size is bounded by the binary entropy estimate \begin{align*} |B_d(v,r_d)| \leq \sum_{q=0}^{\lfloor d/8\rfloor}\binom{d}{q} \leq \exp\left(H\left(\frac18\right)d\right), \end{align*} where $H(t):=-t\log t-(1-t)\log(1-t)$ is the binary entropy function. Therefore the number $N$ of selected matrices is at least \begin{align*} N \geq \frac{|G_p|}{\exp(H(1/8)d)} \geq \exp\left(\left(\log 2-H\left(\frac18\right)\right)d-\log 2\right). \end{align*} Because $\log 2-H(1/8)>0$, decreasing an absolute constant $\gamma>0$ and absorbing finitely many small values of $p$ by direct selection gives $N\geq\exp(\gamma p^2)$ for every $p\geq2$. The selected matrices have pairwise Hamming distance greater than $d/8$. If two sign vectors differ in one upper-triangular coordinate, then the associated symmetric matrices differ in both entries $(i,j)$ and $(j,i)$ by $2$, so the squared Frobenius contribution is $2^2+2^2=8$. Hence, for distinct selected matrices, \begin{align*} \|A_j-A_k\|_F^2 \geq 8\cdot\frac{d}{8} = \frac{p(p-1)}{2}. \end{align*} For $p\geq2$, this is bounded below by a fixed multiple of $p^2$; after decreasing the absolute constant, we write $\|A_j-A_k\|_F^2\geq\beta p^2$. Since the greedy construction selected only points of $G_p$, every selected matrix also satisfies $\|A_j\|_{\mathrm{op}}\leq K\sqrt p$. [/guided] [/step] [step:Embed the packing inside the bounded-spectrum covariance class] Set \begin{align*} \lambda := \frac{m+M}{2}, \qquad r := \frac{M-m}{2}. \end{align*} Choose a constant $a=a(m,M)>0$ satisfying all three constraints \begin{align*} aK \leq \frac{r}{2}, \qquad \frac{aK}{\lambda}\leq \frac12, \qquad \frac{a^2}{2\lambda^2}\leq \frac{\gamma}{8}. \end{align*} The first constraint keeps the perturbations inside the spectrum interval, the second justifies the scalar logarithm bound in the Kullback-Leibler computation, and the third gives the Fano divergence condition. For $n,p \in \mathbb{N}$ with $p \geq 2$, define \begin{align*} \varepsilon := a\left(\frac{1}{n}\wedge \frac{1}{p}\right)^{1/2}. \end{align*} For each $j \in \{1,\dots,N\}$, define the covariance matrix \begin{align*} \Sigma_j := \lambda I_p + \varepsilon A_j. \end{align*} Since $\|A_j\|_{\mathrm{op}}\leq K\sqrt p$ and $\varepsilon \leq a p^{-1/2}$, \begin{align*} \|\varepsilon A_j\|_{\mathrm{op}} \leq aK \leq \frac{r}{2}. \end{align*} Hence every eigenvalue of $\Sigma_j$ lies in \begin{align*} \left[\lambda-\frac{r}{2},\lambda+\frac{r}{2}\right] \subset [m,M]. \end{align*} Therefore $\Sigma_j \in \mathcal{C}_p(m,M)$ for every $j$. [/step] [step:Compute Frobenius separation at the target scale] For $j\neq k$, the construction gives \begin{align*} \|\Sigma_j-\Sigma_k\|_F^2 = \varepsilon^2 \|A_j-A_k\|_F^2 \geq \beta \varepsilon^2 p^2. \end{align*} By the definition of $\varepsilon$, \begin{align*} \varepsilon^2p^2 = a^2p^2 \left( \frac{1}{n}\wedge \frac{1}{p} \right) = a^2 \left( \frac{p^2}{n}\wedge p \right). \end{align*} Thus the finite family $\{\Sigma_1,\dots,\Sigma_N\}$ is separated in squared Frobenius norm by at least \begin{align*} s^2 := \beta a^2 \left( \frac{p^2}{n}\wedge p \right). \end{align*} [/step] [step:Bound the Gaussian Kullback-Leibler divergences] Let $P_j$ denote the joint law of $(X_1,\dots,X_n)$ when each $X_i$ has distribution $\mathcal{N}_p(0,\Sigma_j)$. Let $P_0$ denote the joint law when each $X_i$ has distribution $\mathcal{N}_p(0,\lambda I_p)$. By the [Gaussian Kullback-Leibler divergence formula](/page/Gaussian%20Kullback-Leibler%20Divergence), for one observation the Kullback-Leibler divergence from $\mathcal{N}_p(0,\Sigma_j)$ to $\mathcal{N}_p(0,\lambda I_p)$ is \begin{align*} \operatorname{KL} \left( \mathcal{N}_p(0,\Sigma_j), \mathcal{N}_p(0,\lambda I_p) \right) = \frac{1}{2} \left[ \operatorname{tr}\left(\lambda^{-1}\Sigma_j\right) -p -\log\det\left(\lambda^{-1}\Sigma_j\right) \right]. \end{align*} Let $\mu_{j,1},\dots,\mu_{j,p}$ be the eigenvalues of $A_j$. Since $A_j$ has zero diagonal, $\operatorname{tr}(A_j)=0$, and therefore \begin{align*} \sum_{\ell=1}^p \mu_{j,\ell}=0. \end{align*} The eigenvalues of $\lambda^{-1}\Sigma_j$ are \begin{align*} 1+\frac{\varepsilon}{\lambda}\mu_{j,\ell}, \qquad \ell=1,\dots,p. \end{align*} Because $\varepsilon\leq ap^{-1/2}$, $|\mu_{j,\ell}|\leq\|A_j\|_{\mathrm{op}}\leq K\sqrt p$, and $aK/\lambda\leq1/2$, we have \begin{align*} \left| \frac{\varepsilon}{\lambda}\mu_{j,\ell} \right| \leq \frac{aK}{\lambda} \leq \frac{1}{2} \end{align*} for every $j$ and $\ell$. Using the scalar inequality \begin{align*} x-\log(1+x) \leq x^2 \quad \text{for } |x|\leq \frac12, \end{align*} we obtain The preceding eigenvalue representation gives \begin{align*} \operatorname{KL}(P_j,P_0) = \frac{n}{2} \sum_{\ell=1}^p \left[ \frac{\varepsilon}{\lambda}\mu_{j,\ell} - \log\left(1+\frac{\varepsilon}{\lambda}\mu_{j,\ell}\right) \right]. \end{align*} Applying the scalar inequality term by term yields \begin{align*} \operatorname{KL}(P_j,P_0) \leq \frac{n\varepsilon^2}{2\lambda^2} \sum_{\ell=1}^p \mu_{j,\ell}^2. \end{align*} Since $A_j$ is real symmetric, the sum of the squared eigenvalues equals the squared Frobenius norm, so \begin{align*} \operatorname{KL}(P_j,P_0) \leq \frac{n\varepsilon^2}{2\lambda^2} \|A_j\|_F^2. \end{align*} Since $A_j$ has $p(p-1)$ nonzero entries, each equal to $\pm 1$, \begin{align*} \|A_j\|_F^2=p(p-1)\leq p^2. \end{align*} Also $n\varepsilon^2 \leq a^2$. Hence \begin{align*} \operatorname{KL}(P_j,P_0) \leq \frac{a^2}{2\lambda^2}p^2. \end{align*} The collected constraint $a^2/(2\lambda^2)\leq\gamma/8$ gives \begin{align*} \max_{1\leq j\leq N}\operatorname{KL}(P_j,P_0) \leq \frac{\gamma}{8}p^2 \leq \frac{1}{8}\log N. \end{align*} [/step] [step:Apply Fano's testing lower bound] We use the following finite testing consequence of [Fano's inequality](/page/Fano%27s%20Inequality): if probability measures $P_1,\dots,P_N$ satisfy \begin{align*} \max_{1\leq j\leq N}\operatorname{KL}(P_j,P_0) \leq \frac{1}{8}\log N \end{align*} and parameters $\theta_1,\dots,\theta_N$ are pairwise separated by $\|\theta_j-\theta_k\|^2\geq s^2$, then every measurable estimator $\widehat{\theta}:\mathcal{Y}\to\Theta$, where $(\mathcal{Y},\mathcal{A})$ is the observation space and $\Theta$ is the normed parameter space containing $\theta_1,\dots,\theta_N$, satisfies \begin{align*} \sup_{1\leq j\leq N}\mathbb{E}_{P_j}\left[\|\widehat{\theta}-\theta_j\|^2\right]\geq\frac{s^2}{16}. \end{align*} Applying this statement with $\theta_j=\Sigma_j$, $\Theta=\mathbb{R}^{p\times p}$ equipped with $\|\cdot\|_F$, and the separation $s^2$ computed above gives Since $\{\Sigma_1,\dots,\Sigma_N\}\subset \mathcal{C}_p(m,M)$, \begin{align*} \inf_{\widetilde{\Sigma}} \sup_{\Sigma\in\mathcal{C}_p(m,M)} \mathbb{E}_{\Sigma} \left[ \|\widetilde{\Sigma}-\Sigma\|_F^2 \right] \geq \inf_{\widetilde{\Sigma}} \sup_{1\leq j\leq N} \mathbb{E}_{\Sigma_j} \left[ \|\widetilde{\Sigma}-\Sigma_j\|_F^2 \right]. \end{align*} The Fano lower bound and the value of $s^2$ give \begin{align*} \inf_{\widetilde{\Sigma}} \sup_{\Sigma\in\mathcal{C}_p(m,M)} \mathbb{E}_{\Sigma} \left[ \|\widetilde{\Sigma}-\Sigma\|_F^2 \right] \geq \frac{\beta a^2}{16} \left( \frac{p^2}{n}\wedge p \right). \end{align*} [guided] The finite family $\{\Sigma_1,\dots,\Sigma_N\}$ is a subset of the full parameter class $\mathcal{C}_p(m,M)$. Therefore a lower bound over this finite family is automatically a lower bound over the whole class. Fano's method converts a multi-way testing problem into an estimation lower bound. The separation condition says that if an estimator is close to the true covariance in Frobenius norm, then it must identify which packing element generated the data. The Kullback-Leibler condition says that the distributions are too statistically close, on average, for reliable identification. Here the parameter points are $\theta_j=\Sigma_j$ for $j\in\{1,\dots,N\}$, the parameter space is $\Theta=\mathbb{R}^{p\times p}$, and the loss norm is the Frobenius norm. From the previous step, \begin{align*} \|\Sigma_j-\Sigma_k\|_F^2 \geq s^2 = \beta a^2 \left( \frac{p^2}{n}\wedge p \right) \end{align*} for all $j\neq k$. The Kullback-Leibler computation also gave \begin{align*} \max_{1\leq j\leq N}\operatorname{KL}(P_j,P_0) \leq \frac18 \log N. \end{align*} Thus [Fano's inequality](/theorems/1654) applies and yields \begin{align*} \inf_{\widetilde{\Sigma}} \sup_{1\leq j\leq N} \mathbb{E}_{\Sigma_j} \left[ \|\widetilde{\Sigma}-\Sigma_j\|_F^2 \right] \geq \frac{s^2}{16}. \end{align*} Substituting the value of $s^2$ gives \begin{align*} \inf_{\widetilde{\Sigma}} \sup_{1\leq j\leq N} \mathbb{E}_{\Sigma_j} \left[ \|\widetilde{\Sigma}-\Sigma_j\|_F^2 \right] \geq \frac{\beta a^2}{16} \left( \frac{p^2}{n}\wedge p \right). \end{align*} Since every $\Sigma_j$ belongs to $\mathcal{C}_p(m,M)$, the same lower bound holds for the supremum over $\mathcal{C}_p(m,M)$. [/guided] [/step] [step:Handle the one-dimensional case and collect constants] For $p=1$, the class is the interval of variances $[m,M]$. Choose two variances \begin{align*} \sigma_0^2 := \frac{m+M}{2}, \end{align*} and \begin{align*} \sigma_1^2 := \sigma_0^2 + b\left(n^{-1/2}\wedge 1\right), \end{align*} where $b=b(m,M)>0$ is small enough that $\sigma_1^2\in[m,M]$. Let $\widetilde{\sigma}^2:\mathbb{R}^n\to\mathbb{R}$ denote an arbitrary measurable estimator of the scalar variance from the one-dimensional sample. The same two-point [Le Cam testing argument](/page/Le%20Cam%27s%20Method), together with the [Gaussian Kullback-Leibler divergence formula](/page/Gaussian%20Kullback-Leibler%20Divergence), gives \begin{align*} \inf_{\widetilde{\sigma}^2} \sup_{\sigma^2\in[m,M]} \mathbb{E}_{\sigma^2} \left[ |\widetilde{\sigma}^2-\sigma^2|^2 \right] \geq c_1(m,M)\left(\frac{1}{n}\wedge 1\right). \end{align*} This is exactly the asserted bound when $p=1$. Combining the $p\geq 2$ bound with the $p=1$ bound and setting \begin{align*} c(m,M) := \min\left\{ \frac{\beta a^2}{16}, c_1(m,M) \right\} \end{align*} proves that, for every $n,p\in\mathbb{N}$, \begin{align*} \inf_{\widetilde{\Sigma}} \sup_{\Sigma\in\mathcal{C}_p(m,M)} \mathbb{E}_{\Sigma} \left[ \|\widetilde{\Sigma}-\Sigma\|_F^2 \right] \geq c(m,M) \left( \frac{p^2}{n}\wedge p \right). \end{align*} This is the desired Frobenius minimax lower bound over the bounded-spectrum Gaussian covariance class. [/step]

Prerequisites (0/6 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.