Sparse Gaussian Mean Minimax Lower Bound — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We construct a finite packing of $k$-sparse vectors whose nonzero entries all have the same amplitude $a$. The Gilbert-Varshamov argument gives exponentially many supports separated in Hamming distance, hence the corresponding mean vectors are separated in Euclidean distance. Choosing $a^2$ proportional to $\sigma^2 \log(d/k)$ keeps the average Kullback-Leibler divergence small compared with the logarithm of the packing size, so Fano's testing reduction forces squared-error risk of order $a^2 k$. The radius assumption places the whole packing inside $\Theta_k(R)$, and the unrestricted lower bound follows by monotonicity of the parameter class. [/proofplan] [step:Construct a separated family of $k$-sparse binary vectors] Let \begin{align*} \mathcal{S}_{d,k} := \{v \in \{0,1\}^d : |\operatorname{supp}(v)| = k\} \end{align*} be the set of binary vectors with exactly $k$ ones. Let $d_H: \{0,1\}^d \times \{0,1\}^d \to \{0,1,\dots,d\}$ denote Hamming distance: \begin{align*} d_H(u,v) := |\{j \in \{1,\dots,d\} : u_j \neq v_j\}|. \end{align*} [claim:Gilbert-Varshamov packing for sparse supports] There is a universal constant $c_{\mathrm{GV}} > 0$ and a subset $\mathcal{V} \subset \mathcal{S}_{d,k}$ such that \begin{align*} d_H(u,v) \geq \frac{k}{2} \end{align*} for all distinct $u,v \in \mathcal{V}$, and \begin{align*} \log |\mathcal{V}| \geq c_{\mathrm{GV}} k \log\left(\frac{d}{k}\right). \end{align*} [/claim] [proof] This is the constant-weight form of the [Gilbert-Varshamov bound](/page/Gilbert-Varshamov%20Bound), in its endpoint-uniform formulation: there is a numerical constant $c_{\mathrm{GV}}>0$ such that for every pair of integers $d,k$ with $1\leq k\leq d/2$, the constant-weight binary space $\mathcal{S}_{d,k}$ contains a subset $\mathcal{V}$ with minimum Hamming distance at least $k/2$ and logarithmic cardinality at least $c_{\mathrm{GV}}k\log(d/k)$. The hypotheses of this formulation are satisfied because the theorem statement now assumes $d,k\in\mathbb N$ and $1\leq k\leq d/2$. Applying the bound gives a subset $\mathcal{V}\subset\mathcal{S}_{d,k}$ such that distinct $u,v\in\mathcal{V}$ satisfy \begin{align*} d_H(u,v)\geq \frac{k}{2}, \end{align*} and \begin{align*} \log |\mathcal{V}| \geq c_{\mathrm{GV}}k\log\left(\frac{d}{k}\right). \end{align*} The constant $c_{\mathrm{GV}}$ is universal in the cited endpoint-uniform statement, so no additional endpoint loss is hidden when $d/k=2$. This proves the claim. [/proof] [guided] The goal of this step is to produce many possible supports, but with a fixed amount of separation between any two supports. We work inside \begin{align*} \mathcal{S}_{d,k} := \{v \in \{0,1\}^d : |\operatorname{supp}(v)| = k\}, \end{align*} where each vector records a support of size $k$. For $u,v\in\{0,1\}^d$, the Hamming distance is defined by \begin{align*} d_H(u,v) := |\{j\in\{1,\dots,d\}:u_j\neq v_j\}|. \end{align*} We use the constant-weight form of the [Gilbert-Varshamov bound](/page/Gilbert-Varshamov%20Bound), specifically the endpoint-uniform version for integers $d,k$ with $1\leq k\leq d/2$. This result applies to binary words of length $d$ with exactly $k$ ones and gives a large subcollection whose pairwise Hamming distances are bounded below. Its hypotheses are met here because the theorem statement assumes $d,k\in\mathbb N$ and $1\leq k\leq d/2$, so $\mathcal{S}_{d,k}$ is a nonempty constant-weight binary space and the requested distance threshold $\lceil k/2\rceil$ is within the admissible constant-weight range. Applying this endpoint-uniform bound with minimum Hamming distance $\lceil k/2\rceil$ gives a subset $\mathcal{V}\subset\mathcal{S}_{d,k}$ and a universal constant $c_{\mathrm{GV}}>0$ such that distinct $u,v\in\mathcal{V}$ satisfy \begin{align*} d_H(u,v)\geq \frac{k}{2}, \end{align*} and the number of selected supports obeys \begin{align*} \log|\mathcal{V}| \geq c_{\mathrm{GV}}k\log\left(\frac{d}{k}\right). \end{align*} The important point is that the constant in this cited constant-weight packing theorem is uniform for every ratio $d/k\geq2$. Thus no separate absorption of endpoint constants is needed when $d/k=2$. This proves the packing claim in the uniform form required for the later Fano argument. [/guided] [/step] [step:Convert the support packing into a Euclidean packing inside $\Theta_k(R)$] Let $c_1 > 0$ be a universal constant to be chosen below, and define the amplitude $a > 0$ by \begin{align*} a^2 := c_1 \sigma^2 \log\left(\frac{d}{k}\right). \end{align*} For each $v \in \mathcal{V}$, define the mean vector $\theta_v \in \mathbb{R}^d$ by \begin{align*} (\theta_v)_j := a v_j, \qquad j \in \{1,\dots,d\}. \end{align*} Then $\theta_v$ is $k$-sparse and \begin{align*} |\theta_v|^2 = a^2 |v|^2 = a^2 k. \end{align*} Choose $C_0 \geq c_1$. The hypothesis on $R$ gives \begin{align*} |\theta_v|^2 = c_1 \sigma^2 k \log\left(\frac{d}{k}\right) \leq R^2, \end{align*} so $\theta_v \in \Theta_k(R)$ for every $v \in \mathcal{V}$. If $u,v \in \mathcal{V}$ are distinct, then \begin{align*} |\theta_u - \theta_v|^2 = a^2 d_H(u,v) \geq \frac{a^2 k}{2}. \end{align*} Thus the finite set \begin{align*} \Theta_{\mathcal{V}} := \{\theta_v : v \in \mathcal{V}\} \end{align*} is a Euclidean packing in $\Theta_k(R)$ with squared separation at least $a^2 k/2$. [guided] We now turn the separated supports into separated Gaussian means. Let $c_1>0$ be a universal constant, to be fixed when we compare Kullback-Leibler divergence with entropy, and define the amplitude $a>0$ by \begin{align*} a^2 := c_1\sigma^2\log\left(\frac{d}{k}\right). \end{align*} For each $v \in \mathcal{V}$, define the map from supports to mean vectors by declaring $\theta_v \in \mathbb{R}^d$ coordinatewise: \begin{align*} (\theta_v)_j := a v_j, \qquad j \in \{1,\dots,d\}. \end{align*} Since $v$ has exactly $k$ nonzero coordinates and each nonzero coordinate equals $1$, the vector $\theta_v$ has exactly $k$ nonzero coordinates and each nonzero coordinate equals $a$. Therefore $\theta_v$ is $k$-sparse and \begin{align*} |\theta_v|^2 = \sum_{j=1}^d a^2 v_j^2 = a^2 k. \end{align*} The radius condition is used precisely here. If we choose the universal constant in the theorem so that $C_0 \geq c_1$, then the hypothesis \begin{align*} R^2\geq C_0\sigma^2 k\log\left(\frac{d}{k}\right) \end{align*} implies \begin{align*} |\theta_v|^2 = c_1\sigma^2 k\log\left(\frac{d}{k}\right) \leq R^2. \end{align*} Thus every $\theta_v$ belongs to the restricted sparse class $\Theta_k(R)$. It remains to check separation in Euclidean distance. If $u,v \in \mathcal{V}$ are distinct, then the coordinates where $u$ and $v$ differ contribute exactly $a^2$ each to the squared Euclidean distance, while the other coordinates contribute $0$. Hence \begin{align*} |\theta_u-\theta_v|^2 = \sum_{j=1}^d a^2(u_j-v_j)^2 = a^2d_H(u,v) \geq \frac{a^2k}{2}. \end{align*} Therefore \begin{align*} \Theta_{\mathcal{V}} := \{\theta_v : v \in \mathcal{V}\} \end{align*} is a finite Euclidean packing contained in $\Theta_k(R)$ with squared separation at least $a^2k/2$. [/guided] [/step] [step:Bound the Gaussian Kullback-Leibler divergences from the null model] Let $P_\theta$ denote the distribution $\mathcal{N}(\theta,\sigma^2 I_d)$ on $(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$, and let $P_0 := P_{\theta_0}$ with $\theta_0 = 0 \in \mathbb{R}^d$. For each $v \in \mathcal{V}$, the [Kullback-Leibler divergence formula for Gaussian measures](/page/Kullback-Leibler%20Divergence) with common covariance $\sigma^2 I_d$ gives \begin{align*} D(P_{\theta_v}\|P_0) = \frac{|\theta_v|^2}{2\sigma^2} = \frac{a^2 k}{2\sigma^2} = \frac{c_1}{2} k \log\left(\frac{d}{k}\right). \end{align*} Choose \begin{align*} c_1 := \frac{c_{\mathrm{GV}}}{16}. \end{align*} Since $\log |\mathcal{V}| \geq c_{\mathrm{GV}} k \log(d/k)$, we have \begin{align*} \frac{1}{|\mathcal{V}|}\sum_{v \in \mathcal{V}} D(P_{\theta_v}\|P_0) \leq \frac{1}{32}\log |\mathcal{V}|. \end{align*} [guided] We need the Fano argument to see many hypotheses but only a controlled amount of information distinguishing them from a reference distribution. Let $P_\theta$ denote the probability distribution $\mathcal{N}(\theta,\sigma^2I_d)$ on the measurable space $(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$, and let $P_0 := P_{\theta_0}$ for the zero vector $\theta_0=0\in\mathbb{R}^d$. By the [Kullback-Leibler divergence formula for Gaussian measures](/page/Kullback-Leibler%20Divergence), Gaussian measures with the same covariance matrix $\sigma^2I_d$ satisfy \begin{align*} D(P_\theta\|P_{\theta'}) = \frac{|\theta-\theta'|^2}{2\sigma^2}. \end{align*} The common covariance hypothesis is satisfied because every distribution in the finite family has covariance $\sigma^2I_d$. Applying this formula with $\theta=\theta_v$ and $\theta'=0$ gives, for every $v\in\mathcal{V}$, \begin{align*} D(P_{\theta_v}\|P_0) = \frac{|\theta_v|^2}{2\sigma^2} = \frac{a^2k}{2\sigma^2} = \frac{c_1}{2}k\log\left(\frac{d}{k}\right). \end{align*} This value does not depend on $v$, so averaging over $\mathcal{V}$ leaves the same bound. Now choose \begin{align*} c_1 := \frac{c_{\mathrm{GV}}}{16}. \end{align*} Since the support-packing step proved \begin{align*} \log|\mathcal{V}| \geq c_{\mathrm{GV}}k\log\left(\frac{d}{k}\right), \end{align*} we obtain \begin{align*} \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\theta_v}\|P_0) = \frac{c_{\mathrm{GV}}}{32}k\log\left(\frac{d}{k}\right) \leq \frac{1}{32}\log|\mathcal{V}|. \end{align*} This is exactly the small-average-divergence condition needed for Fano's method. [/guided] [/step] [step:Apply Fano's testing reduction to force squared error] We use the finite-packing form of [Fano's inequality](/page/Fano%27s%20Inequality) with a reference distribution. This reference-measure version follows from the usual mutual-information form because, for a uniform index $M_0$ on $M$ and observation $X\sim P_{\vartheta_{M_0}}$, the mutual information satisfies \begin{align*} I(M_0;X) \leq \frac{1}{|M|}\sum_{m\in M}D(P_{\vartheta_m}\|Q) \end{align*} for every probability measure $Q$ dominating the family, by the convexity identity comparing the mixture law with $Q$. Therefore, if $\{\vartheta_m:m\in M\}$ is a finite subset of a parameter space, if \begin{align*} |\vartheta_m-\vartheta_{m'}|^2\geq 4s^2 \end{align*} for all distinct $m,m'\in M$, and if \begin{align*} \frac{1}{|M|}\sum_{m\in M}D(P_{\vartheta_m}\|Q) \leq \alpha\log|M| \end{align*} for some probability measure $Q$ and some $\alpha\in(0,1/8]$, then every measurable estimator $\hat{\theta}:\mathbb{R}^d\to\mathbb{R}^d$ satisfies \begin{align*} \sup_{m\in M}\mathbb{E}_{\vartheta_m}\left[|\hat{\theta}-\vartheta_m|^2\right] \geq \frac{s^2}{2}. \end{align*} Indeed, compose $\hat{\theta}$ with a nearest-neighbour decoder into the packing. The separation condition implies that a decoding error can occur only if $|\hat{\theta}-\vartheta_m|\geq s$, and [Fano's inequality](/theorems/1654) lower-bounds the maximal decoding error probability under the displayed average divergence hypothesis. For the packing $\Theta_{\mathcal{V}}$, the squared separation is at least $a^2k/2$. Hence we may take \begin{align*} s^2 := \frac{a^2 k}{8}. \end{align*} The divergence estimate from the previous step gives the Fano hypothesis with reference measure $Q=P_0$ and $\alpha = 1/32$. Therefore The finite packing is contained in $\Theta_k(R)$, so restricting the supremum to $\Theta_{\mathcal{V}}$ gives \begin{align*} \inf_{\hat{\theta}} \sup_{\theta \in \Theta_k(R)} \mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right] \geq \inf_{\hat{\theta}} \sup_{v \in \mathcal{V}} \mathbb{E}_{\theta_v}\left[|\hat{\theta}-\theta_v|^2\right]. \end{align*} Fano's testing reduction gives \begin{align*} \inf_{\hat{\theta}} \sup_{v \in \mathcal{V}} \mathbb{E}_{\theta_v}\left[|\hat{\theta}-\theta_v|^2\right] \geq \frac{s^2}{2}. \end{align*} By the definition $s^2=a^2k/8$, \begin{align*} \frac{s^2}{2} = \frac{a^2k}{16} = \frac{c_1}{16}\sigma^2 k\log\left(\frac{d}{k}\right). \end{align*} Thus the desired lower bound holds with \begin{align*} c := \frac{c_1}{16} = \frac{c_{\mathrm{GV}}}{256}. \end{align*} [guided] We apply the finite-packing form of [Fano's inequality](/page/Fano%27s%20Inequality) to convert the packing into a minimax lower bound. The form needed here permits an external reference probability measure $Q$: if a finite family $\{\vartheta_m:m\in M\}$ has pairwise squared Euclidean separation at least $4s^2$ and satisfies \begin{align*} \frac{1}{|M|}\sum_{m\in M}D(P_{\vartheta_m}\|Q) \leq \alpha\log|M| \end{align*} for some $\alpha\in(0,1/8]$, then every measurable estimator $\hat{\theta}:\mathbb{R}^d\to\mathbb{R}^d$ has squared-error risk at least $s^2/2$ on one member of the family. Why is this the same Fano mechanism? If $M_0$ is uniform on $M$ and $X\sim P_{\vartheta_{M_0}}$, the usual mutual-information version of Fano applies to the induced testing problem. The information term obeys \begin{align*} I(M_0;X) \leq \frac{1}{|M|}\sum_{m\in M}D(P_{\vartheta_m}\|Q), \end{align*} so the displayed average divergence is sufficient. The reduction from estimation to testing is by nearest-neighbour decoding: if $|\hat{\theta}-\vartheta_m|<s$, then the separation condition makes $\vartheta_m$ the unique nearest packing point, so estimation error below $s$ prevents testing error; Fano's inequality then supplies the testing-error lower bound. Let $M:=\mathcal{V}$ and let $\vartheta_v:=\theta_v$ for $v\in\mathcal{V}$. The Euclidean packing step proved \begin{align*} |\theta_u-\theta_v|^2\geq \frac{a^2k}{2} \end{align*} for distinct $u,v\in\mathcal{V}$. Therefore we set \begin{align*} s^2 := \frac{a^2k}{8}, \end{align*} so that $4s^2=a^2k/2$, exactly matching the required separation hypothesis. The KL step proved \begin{align*} \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\theta_v}\|P_0) \leq \frac{1}{32}\log|\mathcal{V}|, \end{align*} so the divergence hypothesis holds with reference measure $Q=P_0$ and $\alpha=1/32\in(0,1/8]$. Hence Fano's method gives \begin{align*} \inf_{\hat{\theta}}\sup_{v\in\mathcal{V}}\mathbb{E}_{\theta_v}\left[|\hat{\theta}-\theta_v|^2\right] \geq \frac{s^2}{2}. \end{align*} Because $\Theta_{\mathcal{V}}\subset\Theta_k(R)$, taking the supremum over all of $\Theta_k(R)$ can only increase the risk: \begin{align*} \inf_{\hat{\theta}}\sup_{\theta\in\Theta_k(R)}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right] \geq \inf_{\hat{\theta}}\sup_{v\in\mathcal{V}}\mathbb{E}_{\theta_v}\left[|\hat{\theta}-\theta_v|^2\right]. \end{align*} Combining these inequalities and substituting $s^2=a^2k/8$ and $a^2=c_1\sigma^2\log(d/k)$ yields \begin{align*} \inf_{\hat{\theta}}\sup_{\theta\in\Theta_k(R)}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right] \geq \frac{c_1}{16}\sigma^2k\log\left(\frac{d}{k}\right). \end{align*} Thus the restricted minimax lower bound holds with \begin{align*} c := \frac{c_1}{16}=\frac{c_{\mathrm{GV}}}{256}. \end{align*} [/guided] [/step] [step:Pass from the bounded sparse class to the unrestricted sparse class] Since $\Theta_k(R) \subset \Theta_k$, for every measurable estimator $\hat{\theta}: \mathbb{R}^d \to \mathbb{R}^d$, \begin{align*} \sup_{\theta \in \Theta_k} \mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right] \geq \sup_{\theta \in \Theta_k(R)} \mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]. \end{align*} Taking the infimum over all measurable estimators on both sides gives \begin{align*} \inf_{\hat{\theta}} \sup_{\theta \in \Theta_k} \mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right] \geq \inf_{\hat{\theta}} \sup_{\theta \in \Theta_k(R)} \mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right] \geq c\sigma^2 k \log\left(\frac{d}{k}\right). \end{align*} This proves both assertions. [guided] The final step uses monotonicity of the supremum over parameter classes. The restricted class is contained in the unrestricted class: \begin{align*} \Theta_k(R) \subset \Theta_k. \end{align*} Therefore, for every measurable estimator $\hat{\theta}:\mathbb{R}^d\to\mathbb{R}^d$, the supremum of the risk over the larger class is at least the supremum over the smaller class: \begin{align*} \sup_{\theta\in\Theta_k}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right] \geq \sup_{\theta\in\Theta_k(R)}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]. \end{align*} Now take the infimum over all measurable estimators on both sides. Since the inequality holds estimator by estimator, it remains true after taking infima: \begin{align*} \inf_{\hat{\theta}}\sup_{\theta\in\Theta_k}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right] \geq \inf_{\hat{\theta}}\sup_{\theta\in\Theta_k(R)}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]. \end{align*} The restricted lower bound proved in the preceding step gives \begin{align*} \inf_{\hat{\theta}}\sup_{\theta\in\Theta_k(R)}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right] \geq c\sigma^2k\log\left(\frac{d}{k}\right). \end{align*} Combining the last two displayed inequalities proves the same lower bound over $\Theta_k$, and therefore proves both assertions of the theorem. [/guided] [/step]

Prerequisites (0/4 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Fano's Inequality

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.

Sparse Gaussian Mean Minimax Lower Bound (Theorem # 5903)

Discussion

Proof

Prerequisites (0/4 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Sparse Gaussian Mean Minimax Lower Bound (Theorem # 5903)

Discussion

Proof

Prerequisites (0/4 completed)

Prerequisites Graph

Explore Further