Minimax Lower Bound for Bounded-Spectrum Gaussian Covariance Estimation

Minimax Lower Bound for Bounded-Spectrum Gaussian Covariance Estimation (Theorem # 5939)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We embed a finite testing problem into the covariance class by perturbing a scalar covariance matrix in many rank-one directions. A spherical packing gives exponentially many unit vectors whose rank-one projectors are separated in operator norm. The Kullback-Leibler divergence between the corresponding Gaussian product measures is quadratic in the perturbation size and independent of the packing cardinality except through $p$. Choosing the perturbation size of order $\sqrt{p/n}\wedge 1$, [Fano's inequality](/theorems/1654) forces a nontrivial testing error, and the [testing-to-estimation reduction](/theorems/5895) converts that error into an operator-norm risk lower bound. [/proofplan] [step:Build many separated rank-one perturbations inside the spectrum class] Set \begin{align*} a:=\frac{m+M}{2}, \qquad r:=\frac{M-m}{4}. \end{align*} For each unit vector $u\in\mathbb R^p$, define the rank-one [orthogonal projection](/theorems/437) $P_u:\mathbb R^p\to\mathbb R^p$ by \begin{align*} P_u(x):=(x\cdot u)u \quad \text{for every } x\in\mathbb R^p. \end{align*} Equivalently, in matrix notation, $P_u=uu^\top$. We use the following elementary spherical packing fact: there exist unit vectors $u_1,\dots,u_N\in\mathbb R^p$ such that \begin{align*} N\ge \exp(\beta p) \end{align*} for a universal constant $\beta>0$, and \begin{align*} |u_i\cdot u_j|\le \frac12 \end{align*} for all $i\ne j$. For such $i\ne j$, the operator norm of the difference of the corresponding rank-one projections is \begin{align*} \|P_{u_i}-P_{u_j}\|_{\mathrm{op}} = \sqrt{1-|u_i\cdot u_j|^2} \ge \frac{\sqrt 3}{2}. \end{align*} Let $\lambda\in(0,r]$ be chosen later. For each $j\in\{1,\dots,N\}$, define \begin{align*} \Sigma_j:=aI_p+\lambda P_{u_j}. \end{align*} The eigenvalues of $\Sigma_j$ are $a+\lambda$ in the direction $\operatorname{span}\{u_j\}$ and $a$ on its orthogonal complement. Since $\lambda\le r$, we have \begin{align*} m < a-r \le a \le a+\lambda \le a+r < M. \end{align*} Thus $\Sigma_j\in\mathcal C_p(m,M)$ for every $j$. Moreover, for $i\ne j$, \begin{align*} \|\Sigma_i-\Sigma_j\|_{\mathrm{op}} = \lambda\|P_{u_i}-P_{u_j}\|_{\mathrm{op}} \ge \frac{\sqrt 3}{2}\lambda. \end{align*} [guided] The point of using rank-one perturbations is that they create many covariances while keeping the information distance small. We start from the scalar matrix $aI_p$, which lies in the middle of the spectral interval $[m,M]$, and add a small positive perturbation in one direction. Define \begin{align*} a:=\frac{m+M}{2}, \qquad r:=\frac{M-m}{4}. \end{align*} The number $r$ is a spectral safety margin: if $0<\lambda\le r$, then $a+\lambda$ is still strictly below $M$, while $a$ is strictly above $m$. For a unit vector $u\in\mathbb R^p$, define the map $P_u:\mathbb R^p\to\mathbb R^p$ by \begin{align*} P_u(x):=(x\cdot u)u \quad \text{for every } x\in\mathbb R^p. \end{align*} This is the orthogonal projection onto the line $\operatorname{span}\{u\}$. Its matrix is $uu^\top$, and its only eigenvalues are $1$ on $\operatorname{span}\{u\}$ and $0$ on $\operatorname{span}\{u\}^{\perp}$. Choose unit vectors $u_1,\dots,u_N\in\mathbb R^p$ with \begin{align*} N\ge \exp(\beta p), \qquad |u_i\cdot u_j|\le \frac12 \quad\text{for }i\ne j, \end{align*} where $\beta>0$ is a universal constant. This is the standard volumetric packing construction on the Euclidean unit sphere. For each $j$, define \begin{align*} \Sigma_j:=aI_p+\lambda P_{u_j}. \end{align*} Since $P_{u_j}$ has eigenvalue $1$ in the direction $u_j$ and eigenvalue $0$ on $u_j^\perp$, the covariance matrix $\Sigma_j$ has eigenvalue $a+\lambda$ in the direction $u_j$ and eigenvalue $a$ on $u_j^\perp$. Therefore, if $\lambda\le r$, \begin{align*} m < a-r \le a \le a+\lambda \le a+r < M. \end{align*} So every $\Sigma_j$ belongs to $\mathcal C_p(m,M)$. Finally, the separation of the directions gives separation of the covariance matrices. For rank-one orthogonal projections, \begin{align*} \|P_{u_i}-P_{u_j}\|_{\mathrm{op}} = \sqrt{1-|u_i\cdot u_j|^2}. \end{align*} Since $|u_i\cdot u_j|\le 1/2$, this gives \begin{align*} \|\Sigma_i-\Sigma_j\|_{\mathrm{op}} = \lambda\|P_{u_i}-P_{u_j}\|_{\mathrm{op}} \ge \frac{\sqrt 3}{2}\lambda. \end{align*} Thus the parameter set contains exponentially many covariances separated by order $\lambda$ in operator norm. [/guided] [/step] [step:Bound the Gaussian product divergences] For $j\in\{1,\dots,N\}$, let $\mathbb P_j$ denote the joint law of $(X_1,\dots,X_n)$ when $X_1,\dots,X_n$ are independent with common distribution $\mathcal N(0,\Sigma_j)$. For two positive definite matrices $\Sigma,\Gamma\in\mathbb R^{p\times p}$, the Kullback-Leibler divergence between the centred Gaussian product laws is \begin{align*} D_{\mathrm{KL}}\left(\mathcal N(0,\Sigma)^{\otimes n}\,\middle\|\,\mathcal N(0,\Gamma)^{\otimes n}\right) = \frac n2 \left( \operatorname{tr}(\Gamma^{-1}\Sigma-I_p) - \log\det(\Gamma^{-1}\Sigma) \right). \end{align*} For matrices whose eigenvalues lie in $[m,M]$, the scalar inequality \begin{align*} t-1-\log t\le L_{m,M}(t-1)^2 \end{align*} holds for every $t\in[m/M,M/m]$, where \begin{align*} L_{m,M}:=\sup_{t\in[m/M,M/m]} \frac{t-1-\log t}{(t-1)^2} \end{align*} with the value at $t=1$ interpreted as $1/2$. Applying this inequality to the eigenvalues of $\Gamma^{-1/2}\Sigma\Gamma^{-1/2}$ gives \begin{align*} D_{\mathrm{KL}}\left(\mathcal N(0,\Sigma)^{\otimes n}\,\middle\|\,\mathcal N(0,\Gamma)^{\otimes n}\right) \le \frac{nL_{m,M}}{2m^2}\|\Sigma-\Gamma\|_F^2. \end{align*} For $\Sigma_i-\Sigma_j=\lambda(P_{u_i}-P_{u_j})$, we have \begin{align*} \|P_{u_i}-P_{u_j}\|_F^2 = 2-2(u_i\cdot u_j)^2 \le 2. \end{align*} Hence, for every $i,j$, \begin{align*} D_{\mathrm{KL}}(\mathbb P_i\|\mathbb P_j) \le \frac{nL_{m,M}}{m^2}\lambda^2. \end{align*} [/step] [step:Choose the perturbation size so Fano applies] Let \begin{align*} \kappa := \min\left\{ r,\, \frac{m}{4}\sqrt{\frac{\beta}{L_{m,M}}} \right\} \end{align*} and set \begin{align*} \lambda:=\kappa\left(\sqrt{\frac{p}{n}}\wedge 1\right). \end{align*} Then $\lambda\le r$, so the covariance matrices constructed above remain in $\mathcal C_p(m,M)$. Since $\lambda^2\le \kappa^2(p/n)$, the divergence bound gives \begin{align*} D_{\mathrm{KL}}(\mathbb P_i\|\mathbb P_j) \le \frac{L_{m,M}}{m^2}n\lambda^2 \le \frac{L_{m,M}\kappa^2}{m^2}p \le \frac{\beta}{16}p. \end{align*} Because $N\ge \exp(\beta p)$, we have $\log N\ge \beta p$, and therefore \begin{align*} \max_{i,j}D_{\mathrm{KL}}(\mathbb P_i\|\mathbb P_j) \le \frac{1}{16}\log N. \end{align*} By Fano's inequality (citing a result not yet in the wiki: Fano's inequality), every measurable testing rule \begin{align*} \widehat J:(\mathbb R^p)^n\to\{1,\dots,N\} \end{align*} satisfies \begin{align*} \sup_{j\in\{1,\dots,N\}} \mathbb P_j(\widehat J\ne j) \ge \alpha \end{align*} for a universal constant $\alpha>0$. [/step] [step:Convert testing error into estimation risk] Let \begin{align*} \widetilde\Sigma:(\mathbb R^p)^n\to\mathbb R^{p\times p} \end{align*} be any measurable estimator. From it, define the nearest-neighbour testing rule $\widehat J:(\mathbb R^p)^n\to\{1,\dots,N\}$ by \begin{align*} \widehat J(x):=\min\operatorname*{argmin}_{1\le k\le N} \|\widetilde\Sigma(x)-\Sigma_k\|_{\mathrm{op}} \quad \text{for every } x\in(\mathbb R^p)^n, \end{align*} where the minimum is used only to break ties. If the true index is $j$ and $\widehat J(x)\ne j$, then by the definition of nearest neighbour and the triangle inequality, \begin{align*} \|\Sigma_j-\Sigma_{\widehat J(x)}\|_{\mathrm{op}} \le \|\Sigma_j-\widetilde\Sigma(x)\|_{\mathrm{op}} + \|\widetilde\Sigma(x)-\Sigma_{\widehat J(x)}\|_{\mathrm{op}} \le 2\|\widetilde\Sigma(x)-\Sigma_j\|_{\mathrm{op}}. \end{align*} Since distinct parameter points are separated by at least $(\sqrt 3/2)\lambda$, the event $\{\widehat J\ne j\}$ implies \begin{align*} \|\widetilde\Sigma-\Sigma_j\|_{\mathrm{op}} \ge \frac{\sqrt 3}{4}\lambda. \end{align*} Therefore, \begin{align*} \mathbb E_j\left[\|\widetilde\Sigma-\Sigma_j\|_{\mathrm{op}}\right] \ge \frac{\sqrt 3}{4}\lambda\,\mathbb P_j(\widehat J\ne j). \end{align*} Taking the supremum over $j$ and using the Fano lower bound, \begin{align*} \sup_{1\le j\le N} \mathbb E_j\left[\|\widetilde\Sigma-\Sigma_j\|_{\mathrm{op}}\right] \ge \frac{\sqrt 3}{4}\alpha\lambda. \end{align*} Since $\{\Sigma_1,\dots,\Sigma_N\}\subset\mathcal C_p(m,M)$, it follows that \begin{align*} \sup_{\Sigma\in\mathcal C_p(m,M)} \mathbb E_\Sigma\left[\|\widetilde\Sigma-\Sigma\|_{\mathrm{op}}\right] \ge \frac{\sqrt 3}{4}\alpha\kappa \left(\sqrt{\frac{p}{n}}\wedge 1\right). \end{align*} Finally, because the estimator $\widetilde\Sigma$ was arbitrary, taking the infimum over all measurable estimators gives the desired bound with \begin{align*} c(m,M):=\frac{\sqrt 3}{4}\alpha\kappa>0. \end{align*} [/step]

Prerequisites (0/7 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Explore Further

Distribution Definition Event Definition Matrix Definition Set Definition Orthogonal Projection Theorem #437 Fano's Inequality Theorem #1654 Triangle Inequality For Inner Product Spaces Theorem #433 Inverse Transform Sampling Probability Theory Coordinate Characterisation of Product Measurability Probability & Statistics Bayes' Formula Probability Theory Linearity and Positivity Probability & Statistics Bias–Variance Decomposition for Prediction Error Probability & Statistics Conditional Expectation as the $L^2$ Risk Minimizer Probability & Statistics Uniqueness of the PGF Probability Theory Identifiability Under Full Column Rank Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.