[proofplan]
We build a finite packing inside $\Theta_s$ by taking scaled $s$-sparse binary vectors. The Gilbert-Varshamov packing gives exponentially many such vectors with pairwise Hamming distance of order $s$, hence squared Euclidean separation of order $a^2s$ after scaling by an amplitude $a>0$. We choose $a^2$ proportional to $\sigma^2\log(d/s)$ so that the Gaussian Kullback-Leibler divergences from the zero-mean model remain a small fraction of the logarithm of the packing cardinality. [Fano's inequality](/theorems/1654) then forces a constant probability of misidentifying the packing point, and the [testing-to-estimation reduction](/theorems/5895) converts that testing error into squared-error risk.
[/proofplan]
custom_env
admin
[step:Construct a large sparse Hamming packing]
Let $\mathcal S_s \subset \{0,1\}^d$ denote the set of binary vectors with exactly $s$ nonzero coordinates:
\begin{align*}
\mathcal S_s := \left\{v \in \{0,1\}^d : \sum_{j=1}^d v_j = s\right\}.
\end{align*}
For $u,v \in \{0,1\}^d$, define the Hamming distance
\begin{align*}
d_H(u,v) := |\{j \in \{1,\dots,d\} : u_j \ne v_j\}|.
\end{align*}
By the Gilbert-Varshamov sparse packing bound (citing a result not yet in the wiki: Sparse Gilbert-Varshamov Bound), there are universal constants $c_{\mathrm{GV}}>0$ and a subset $\mathcal V \subset \mathcal S_s$ such that
\begin{align*}
|\mathcal V| \ge \exp\!\left(c_{\mathrm{GV}}s\log(d/s)\right)
\end{align*}
and, for all distinct $u,v \in \mathcal V$,
\begin{align*}
d_H(u,v) \ge \frac{s}{2}.
\end{align*}
Let $M := |\mathcal V|$.
[/step]
custom_env
admin
[step:Scale the packing into the parameter space and compute its separation]
Let $a>0$ be an amplitude to be chosen later. For each $v \in \mathcal V$, define the parameter vector
\begin{align*}
\theta_v \in \mathbb R^d,
\qquad
(\theta_v)_j := a v_j
\quad \text{for } j \in \{1,\dots,d\}.
\end{align*}
Since every $v \in \mathcal V$ has exactly $s$ nonzero coordinates, each $\theta_v$ belongs to $\Theta_s$.
For distinct $u,v \in \mathcal V$, the squared Euclidean distance between $\theta_u$ and $\theta_v$ is
\begin{align*}
|\theta_u-\theta_v|^2 = \sum_{j=1}^d a^2(u_j-v_j)^2.
\end{align*}
Because $u_j,v_j \in \{0,1\}$ for each $j \in \{1,\dots,d\}$, the factor $(u_j-v_j)^2$ equals $1$ exactly on the coordinates where $u_j \ne v_j$ and equals $0$ otherwise. Hence
\begin{align*}
|\theta_u-\theta_v|^2 = a^2 d_H(u,v).
\end{align*}
Using the Hamming separation of the packing gives
\begin{align*}
|\theta_u-\theta_v|^2 \ge \frac{a^2s}{2}.
\end{align*}
Thus the finite family $\{\theta_v : v \in \mathcal V\}$ is separated in squared Euclidean distance by at least $a^2s/2$.
[/step]
custom_env
admin
[step:Choose the amplitude so the Gaussian information is small]Let $P_0 := \mathcal N(0,\sigma^2 I_d)$ and, for $v \in \mathcal V$, let $P_v := P_{\theta_v}=\mathcal N(\theta_v,\sigma^2 I_d)$. The Kullback-Leibler divergence from $P_v$ to $P_0$ is
\begin{align*}
D_{\mathrm{KL}}(P_v\,\|\,P_0)
=
\frac{|\theta_v|^2}{2\sigma^2}.
\end{align*}
Indeed, the two Gaussian measures have the same covariance matrix $\sigma^2 I_d$, and their Lebesgue densities differ only by the mean shift. Since $|\theta_v|^2=a^2s$, this gives
\begin{align*}
D_{\mathrm{KL}}(P_v\,\|\,P_0)
=
\frac{a^2s}{2\sigma^2}.
\end{align*}
Choose
\begin{align*}
a^2 := \frac{c_{\mathrm{GV}}}{16}\,\sigma^2\log(d/s).
\end{align*}
Then, for every $v \in \mathcal V$,
\begin{align*}
D_{\mathrm{KL}}(P_v\,\|\,P_0)
=
\frac{c_{\mathrm{GV}}}{32}s\log(d/s)
\le
\frac{1}{32}\log M,
\end{align*}
because $\log M \ge c_{\mathrm{GV}}s\log(d/s)$.[/step]
custom_env
admin
[guided]The amplitude $a$ controls two competing quantities. Larger $a$ makes the parameters easier to separate in squared loss, because the separation is proportional to $a^2s$. But larger $a$ also makes the distributions easier to distinguish statistically, because the Kullback-Leibler divergence from the zero-mean Gaussian grows like $a^2s/\sigma^2$.
For each $v \in \mathcal V$, define $P_v := \mathcal N(\theta_v,\sigma^2 I_d)$ and $P_0 := \mathcal N(0,\sigma^2 I_d)$. Since both Gaussian measures have covariance matrix $\sigma^2 I_d$, the Gaussian Kullback-Leibler formula gives
\begin{align*}
D_{\mathrm{KL}}(P_v\,\|\,P_0)
=
\frac{|\theta_v|^2}{2\sigma^2}.
\end{align*}
Here $|\theta_v|^2=a^2s$, because $\theta_v$ has exactly $s$ nonzero coordinates and each nonzero coordinate equals $a$. Hence
\begin{align*}
D_{\mathrm{KL}}(P_v\,\|\,P_0)
=
\frac{a^2s}{2\sigma^2}.
\end{align*}
We now choose $a$ so that this information is only a small fraction of the logarithm of the packing size. Since the Gilbert-Varshamov packing satisfies
\begin{align*}
\log M \ge c_{\mathrm{GV}}s\log(d/s),
\end{align*}
set
\begin{align*}
a^2 := \frac{c_{\mathrm{GV}}}{16}\,\sigma^2\log(d/s).
\end{align*}
Substitution gives
\begin{align*}
D_{\mathrm{KL}}(P_v\,\|\,P_0)
=
\frac{c_{\mathrm{GV}}}{32}s\log(d/s)
\le
\frac{1}{32}\log M.
\end{align*}
This is the exact information condition needed for Fano's inequality: the distributions indexed by the packing are far apart as parameters, but still too close statistically to identify reliably.[/guided]
custom_env
admin
[step:Convert estimation risk into testing error on the packing]
Let $\hat{\theta}: \mathbb R^d \to \mathbb R^d$ be an arbitrary measurable estimator. Define the induced decoder
\begin{align*}
\hat v: \mathbb R^d \to \mathcal V
\end{align*}
by choosing, for each $x \in \mathbb R^d$, an element $\hat v(x) \in \mathcal V$ satisfying
\begin{align*}
|\hat{\theta}(x)-\theta_{\hat v(x)}|
=
\min_{w \in \mathcal V} |\hat{\theta}(x)-\theta_w|.
\end{align*}
Ties are broken by a fixed ordering of the finite set $\mathcal V$.
Let
\begin{align*}
\rho^2 := \min_{\substack{u,v \in \mathcal V\\u\ne v}}|\theta_u-\theta_v|^2.
\end{align*}
From the separation estimate, $\rho^2 \ge a^2s/2$. If $X \sim P_v$ and $\hat v(X)\ne v$, then the nearest-packing-point rule implies
\begin{align*}
|\hat{\theta}(X)-\theta_v|
\ge
\frac{\rho}{2}.
\end{align*}
Therefore
\begin{align*}
\mathbb E_v[|\hat{\theta}(X)-\theta_v|^2]
\ge
\frac{\rho^2}{4}\,\mathbb P_v(\hat v(X)\ne v)
\ge
\frac{a^2s}{8}\,\mathbb P_v(\hat v(X)\ne v).
\end{align*}
Taking the supremum over $v \in \mathcal V$ and then the infimum over estimators gives
\begin{align*}
\mathfrak M(\Theta_s,|\cdot|^2)
\ge
\frac{a^2s}{8}
\inf_{\tilde v}
\sup_{v \in \mathcal V}
\mathbb P_v(\tilde v(X)\ne v),
\end{align*}
where the infimum is over all measurable decoders $\tilde v:\mathbb R^d \to \mathcal V$.
[/step]
custom_env
admin
[step:Apply Fano's inequality to the sparse Gaussian testing problem]
By Fano's inequality in the form with a common reference measure (citing a result not yet in the wiki: Fano Inequality with Average Kullback-Leibler Radius), if
\begin{align*}
\frac{1}{M}\sum_{v \in \mathcal V}D_{\mathrm{KL}}(P_v\,\|\,P_0)
\le
\frac{1}{32}\log M,
\end{align*}
then there is a universal constant $c_F>0$ such that
\begin{align*}
\inf_{\tilde v}
\sup_{v \in \mathcal V}
\mathbb P_v(\tilde v(X)\ne v)
\ge
c_F.
\end{align*}
The information condition holds by the preceding step, because every term in the average is at most $\frac{1}{32}\log M$. Hence
\begin{align*}
\mathfrak M(\Theta_s,|\cdot|^2)
\ge
\frac{c_F a^2s}{8}.
\end{align*}
Substituting the chosen value of $a^2$ yields
\begin{align*}
\mathfrak M(\Theta_s,|\cdot|^2)
\ge
\frac{c_F c_{\mathrm{GV}}}{128}\,\sigma^2 s\log(d/s).
\end{align*}
Thus the desired lower bound holds with the universal constant
\begin{align*}
c := \frac{c_F c_{\mathrm{GV}}}{128}.
\end{align*}
[/step]