[proofplan]
We construct a finite packing of $k$-sparse vectors whose nonzero entries all have the same amplitude $a$. The Gilbert-Varshamov argument gives exponentially many supports separated in Hamming distance, hence the corresponding mean vectors are separated in Euclidean distance. Choosing $a^2$ proportional to $\sigma^2 \log(d/k)$ keeps the average Kullback-Leibler divergence small compared with the logarithm of the packing size, so Fano's testing reduction forces squared-error risk of order $a^2 k$. The radius assumption places the whole packing inside $\Theta_k(R)$, and the unrestricted lower bound follows by monotonicity of the parameter class.
[/proofplan]
[step:Construct a separated family of $k$-sparse binary vectors]
Let
\begin{align*}
\mathcal{S}_{d,k} := \{v \in \{0,1\}^d : |\operatorname{supp}(v)| = k\}
\end{align*}
be the set of binary vectors with exactly $k$ ones. Let $d_H: \{0,1\}^d \times \{0,1\}^d \to \{0,1,\dots,d\}$ denote Hamming distance:
\begin{align*}
d_H(u,v) := |\{j \in \{1,\dots,d\} : u_j \neq v_j\}|.
\end{align*}
[claim:Gilbert-Varshamov packing for sparse supports]
There is a universal constant $c_{\mathrm{GV}} > 0$ and a subset $\mathcal{V} \subset \mathcal{S}_{d,k}$ such that
\begin{align*}
d_H(u,v) \geq \frac{k}{2}
\end{align*}
for all distinct $u,v \in \mathcal{V}$, and
\begin{align*}
\log |\mathcal{V}| \geq c_{\mathrm{GV}} k \log\left(\frac{d}{k}\right).
\end{align*}
[/claim]
[proof]
This is the constant-weight form of the [Gilbert-Varshamov bound](/page/Gilbert-Varshamov%20Bound), in its endpoint-uniform formulation: there is a numerical constant $c_{\mathrm{GV}}>0$ such that for every pair of integers $d,k$ with $1\leq k\leq d/2$, the constant-weight binary space $\mathcal{S}_{d,k}$ contains a subset $\mathcal{V}$ with minimum Hamming distance at least $k/2$ and logarithmic cardinality at least $c_{\mathrm{GV}}k\log(d/k)$. The hypotheses of this formulation are satisfied because the theorem statement now assumes $d,k\in\mathbb N$ and $1\leq k\leq d/2$. Applying the bound gives a subset $\mathcal{V}\subset\mathcal{S}_{d,k}$ such that distinct $u,v\in\mathcal{V}$ satisfy
\begin{align*}
d_H(u,v)\geq \frac{k}{2},
\end{align*}
and
\begin{align*}
\log |\mathcal{V}|
\geq
c_{\mathrm{GV}}k\log\left(\frac{d}{k}\right).
\end{align*}
The constant $c_{\mathrm{GV}}$ is universal in the cited endpoint-uniform statement, so no additional endpoint loss is hidden when $d/k=2$. This proves the claim.
[/proof]
[guided]
The goal of this step is to produce many possible supports, but with a fixed amount of separation between any two supports. We work inside
\begin{align*}
\mathcal{S}_{d,k} := \{v \in \{0,1\}^d : |\operatorname{supp}(v)| = k\},
\end{align*}
where each vector records a support of size $k$. For $u,v\in\{0,1\}^d$, the Hamming distance is defined by
\begin{align*}
d_H(u,v) := |\{j\in\{1,\dots,d\}:u_j\neq v_j\}|.
\end{align*}
We use the constant-weight form of the [Gilbert-Varshamov bound](/page/Gilbert-Varshamov%20Bound), specifically the endpoint-uniform version for integers $d,k$ with $1\leq k\leq d/2$. This result applies to binary words of length $d$ with exactly $k$ ones and gives a large subcollection whose pairwise Hamming distances are bounded below. Its hypotheses are met here because the theorem statement assumes $d,k\in\mathbb N$ and $1\leq k\leq d/2$, so $\mathcal{S}_{d,k}$ is a nonempty constant-weight binary space and the requested distance threshold $\lceil k/2\rceil$ is within the admissible constant-weight range.
Applying this endpoint-uniform bound with minimum Hamming distance $\lceil k/2\rceil$ gives a subset $\mathcal{V}\subset\mathcal{S}_{d,k}$ and a universal constant $c_{\mathrm{GV}}>0$ such that distinct $u,v\in\mathcal{V}$ satisfy
\begin{align*}
d_H(u,v)\geq \frac{k}{2},
\end{align*}
and the number of selected supports obeys
\begin{align*}
\log|\mathcal{V}|
\geq
c_{\mathrm{GV}}k\log\left(\frac{d}{k}\right).
\end{align*}
The important point is that the constant in this cited constant-weight packing theorem is uniform for every ratio $d/k\geq2$. Thus no separate absorption of endpoint constants is needed when $d/k=2$. This proves the packing claim in the uniform form required for the later Fano argument.
[/guided]
[/step]
[step:Convert the support packing into a Euclidean packing inside $\Theta_k(R)$]
Let $c_1 > 0$ be a universal constant to be chosen below, and define the amplitude $a > 0$ by
\begin{align*}
a^2 := c_1 \sigma^2 \log\left(\frac{d}{k}\right).
\end{align*}
For each $v \in \mathcal{V}$, define the mean vector $\theta_v \in \mathbb{R}^d$ by
\begin{align*}
(\theta_v)_j := a v_j,
\qquad j \in \{1,\dots,d\}.
\end{align*}
Then $\theta_v$ is $k$-sparse and
\begin{align*}
|\theta_v|^2 = a^2 |v|^2 = a^2 k.
\end{align*}
Choose $C_0 \geq c_1$. The hypothesis on $R$ gives
\begin{align*}
|\theta_v|^2 = c_1 \sigma^2 k \log\left(\frac{d}{k}\right)
\leq R^2,
\end{align*}
so $\theta_v \in \Theta_k(R)$ for every $v \in \mathcal{V}$.
If $u,v \in \mathcal{V}$ are distinct, then
\begin{align*}
|\theta_u - \theta_v|^2
=
a^2 d_H(u,v)
\geq
\frac{a^2 k}{2}.
\end{align*}
Thus the finite set
\begin{align*}
\Theta_{\mathcal{V}} := \{\theta_v : v \in \mathcal{V}\}
\end{align*}
is a Euclidean packing in $\Theta_k(R)$ with squared separation at least $a^2 k/2$.
[guided]
We now turn the separated supports into separated Gaussian means. Let $c_1>0$ be a universal constant, to be fixed when we compare Kullback-Leibler divergence with entropy, and define the amplitude $a>0$ by
\begin{align*}
a^2 := c_1\sigma^2\log\left(\frac{d}{k}\right).
\end{align*}
For each $v \in \mathcal{V}$, define the map from supports to mean vectors by declaring $\theta_v \in \mathbb{R}^d$ coordinatewise:
\begin{align*}
(\theta_v)_j := a v_j,
\qquad j \in \{1,\dots,d\}.
\end{align*}
Since $v$ has exactly $k$ nonzero coordinates and each nonzero coordinate equals $1$, the vector $\theta_v$ has exactly $k$ nonzero coordinates and each nonzero coordinate equals $a$. Therefore $\theta_v$ is $k$-sparse and
\begin{align*}
|\theta_v|^2
=
\sum_{j=1}^d a^2 v_j^2
=
a^2 k.
\end{align*}
The radius condition is used precisely here. If we choose the universal constant in the theorem so that $C_0 \geq c_1$, then the hypothesis
\begin{align*}
R^2\geq C_0\sigma^2 k\log\left(\frac{d}{k}\right)
\end{align*}
implies
\begin{align*}
|\theta_v|^2
=
c_1\sigma^2 k\log\left(\frac{d}{k}\right)
\leq
R^2.
\end{align*}
Thus every $\theta_v$ belongs to the restricted sparse class $\Theta_k(R)$.
It remains to check separation in Euclidean distance. If $u,v \in \mathcal{V}$ are distinct, then the coordinates where $u$ and $v$ differ contribute exactly $a^2$ each to the squared Euclidean distance, while the other coordinates contribute $0$. Hence
\begin{align*}
|\theta_u-\theta_v|^2
=
\sum_{j=1}^d a^2(u_j-v_j)^2
=
a^2d_H(u,v)
\geq
\frac{a^2k}{2}.
\end{align*}
Therefore
\begin{align*}
\Theta_{\mathcal{V}} := \{\theta_v : v \in \mathcal{V}\}
\end{align*}
is a finite Euclidean packing contained in $\Theta_k(R)$ with squared separation at least $a^2k/2$.
[/guided]
[/step]
[step:Bound the Gaussian Kullback-Leibler divergences from the null model]
Let $P_\theta$ denote the distribution $\mathcal{N}(\theta,\sigma^2 I_d)$ on $(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$, and let $P_0 := P_{\theta_0}$ with $\theta_0 = 0 \in \mathbb{R}^d$. For each $v \in \mathcal{V}$, the [Kullback-Leibler divergence formula for Gaussian measures](/page/Kullback-Leibler%20Divergence) with common covariance $\sigma^2 I_d$ gives
\begin{align*}
D(P_{\theta_v}\|P_0)
=
\frac{|\theta_v|^2}{2\sigma^2}
=
\frac{a^2 k}{2\sigma^2}
=
\frac{c_1}{2} k \log\left(\frac{d}{k}\right).
\end{align*}
Choose
\begin{align*}
c_1 := \frac{c_{\mathrm{GV}}}{16}.
\end{align*}
Since $\log |\mathcal{V}| \geq c_{\mathrm{GV}} k \log(d/k)$, we have
\begin{align*}
\frac{1}{|\mathcal{V}|}\sum_{v \in \mathcal{V}} D(P_{\theta_v}\|P_0)
\leq
\frac{1}{32}\log |\mathcal{V}|.
\end{align*}
[guided]
We need the Fano argument to see many hypotheses but only a controlled amount of information distinguishing them from a reference distribution. Let $P_\theta$ denote the probability distribution $\mathcal{N}(\theta,\sigma^2I_d)$ on the measurable space $(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$, and let $P_0 := P_{\theta_0}$ for the zero vector $\theta_0=0\in\mathbb{R}^d$.
By the [Kullback-Leibler divergence formula for Gaussian measures](/page/Kullback-Leibler%20Divergence), Gaussian measures with the same covariance matrix $\sigma^2I_d$ satisfy
\begin{align*}
D(P_\theta\|P_{\theta'})
=
\frac{|\theta-\theta'|^2}{2\sigma^2}.
\end{align*}
The common covariance hypothesis is satisfied because every distribution in the finite family has covariance $\sigma^2I_d$. Applying this formula with $\theta=\theta_v$ and $\theta'=0$ gives, for every $v\in\mathcal{V}$,
\begin{align*}
D(P_{\theta_v}\|P_0)
=
\frac{|\theta_v|^2}{2\sigma^2}
=
\frac{a^2k}{2\sigma^2}
=
\frac{c_1}{2}k\log\left(\frac{d}{k}\right).
\end{align*}
This value does not depend on $v$, so averaging over $\mathcal{V}$ leaves the same bound.
Now choose
\begin{align*}
c_1 := \frac{c_{\mathrm{GV}}}{16}.
\end{align*}
Since the support-packing step proved
\begin{align*}
\log|\mathcal{V}| \geq c_{\mathrm{GV}}k\log\left(\frac{d}{k}\right),
\end{align*}
we obtain
\begin{align*}
\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\theta_v}\|P_0)
=
\frac{c_{\mathrm{GV}}}{32}k\log\left(\frac{d}{k}\right)
\leq
\frac{1}{32}\log|\mathcal{V}|.
\end{align*}
This is exactly the small-average-divergence condition needed for Fano's method.
[/guided]
[/step]
[step:Apply Fano's testing reduction to force squared error]
We use the finite-packing form of [Fano's inequality](/page/Fano%27s%20Inequality) with a reference distribution. This reference-measure version follows from the usual mutual-information form because, for a uniform index $M_0$ on $M$ and observation $X\sim P_{\vartheta_{M_0}}$, the mutual information satisfies
\begin{align*}
I(M_0;X)
\leq
\frac{1}{|M|}\sum_{m\in M}D(P_{\vartheta_m}\|Q)
\end{align*}
for every probability measure $Q$ dominating the family, by the convexity identity comparing the mixture law with $Q$. Therefore, if $\{\vartheta_m:m\in M\}$ is a finite subset of a parameter space, if
\begin{align*}
|\vartheta_m-\vartheta_{m'}|^2\geq 4s^2
\end{align*}
for all distinct $m,m'\in M$, and if
\begin{align*}
\frac{1}{|M|}\sum_{m\in M}D(P_{\vartheta_m}\|Q)
\leq
\alpha\log|M|
\end{align*}
for some probability measure $Q$ and some $\alpha\in(0,1/8]$, then every measurable estimator $\hat{\theta}:\mathbb{R}^d\to\mathbb{R}^d$ satisfies
\begin{align*}
\sup_{m\in M}\mathbb{E}_{\vartheta_m}\left[|\hat{\theta}-\vartheta_m|^2\right]
\geq
\frac{s^2}{2}.
\end{align*}
Indeed, compose $\hat{\theta}$ with a nearest-neighbour decoder into the packing. The separation condition implies that a decoding error can occur only if $|\hat{\theta}-\vartheta_m|\geq s$, and [Fano's inequality](/theorems/1654) lower-bounds the maximal decoding error probability under the displayed average divergence hypothesis.
For the packing $\Theta_{\mathcal{V}}$, the squared separation is at least $a^2k/2$. Hence we may take
\begin{align*}
s^2 := \frac{a^2 k}{8}.
\end{align*}
The divergence estimate from the previous step gives the Fano hypothesis with reference measure $Q=P_0$ and $\alpha = 1/32$. Therefore
The finite packing is contained in $\Theta_k(R)$, so restricting the supremum to $\Theta_{\mathcal{V}}$ gives
\begin{align*}
\inf_{\hat{\theta}}
\sup_{\theta \in \Theta_k(R)}
\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]
\geq
\inf_{\hat{\theta}}
\sup_{v \in \mathcal{V}}
\mathbb{E}_{\theta_v}\left[|\hat{\theta}-\theta_v|^2\right].
\end{align*}
Fano's testing reduction gives
\begin{align*}
\inf_{\hat{\theta}}
\sup_{v \in \mathcal{V}}
\mathbb{E}_{\theta_v}\left[|\hat{\theta}-\theta_v|^2\right]
\geq
\frac{s^2}{2}.
\end{align*}
By the definition $s^2=a^2k/8$,
\begin{align*}
\frac{s^2}{2}
=
\frac{a^2k}{16}
=
\frac{c_1}{16}\sigma^2 k\log\left(\frac{d}{k}\right).
\end{align*}
Thus the desired lower bound holds with
\begin{align*}
c := \frac{c_1}{16} = \frac{c_{\mathrm{GV}}}{256}.
\end{align*}
[guided]
We apply the finite-packing form of [Fano's inequality](/page/Fano%27s%20Inequality) to convert the packing into a minimax lower bound. The form needed here permits an external reference probability measure $Q$: if a finite family $\{\vartheta_m:m\in M\}$ has pairwise squared Euclidean separation at least $4s^2$ and satisfies
\begin{align*}
\frac{1}{|M|}\sum_{m\in M}D(P_{\vartheta_m}\|Q)
\leq
\alpha\log|M|
\end{align*}
for some $\alpha\in(0,1/8]$, then every measurable estimator $\hat{\theta}:\mathbb{R}^d\to\mathbb{R}^d$ has squared-error risk at least $s^2/2$ on one member of the family. Why is this the same Fano mechanism? If $M_0$ is uniform on $M$ and $X\sim P_{\vartheta_{M_0}}$, the usual mutual-information version of Fano applies to the induced testing problem. The information term obeys
\begin{align*}
I(M_0;X)
\leq
\frac{1}{|M|}\sum_{m\in M}D(P_{\vartheta_m}\|Q),
\end{align*}
so the displayed average divergence is sufficient. The reduction from estimation to testing is by nearest-neighbour decoding: if $|\hat{\theta}-\vartheta_m|<s$, then the separation condition makes $\vartheta_m$ the unique nearest packing point, so estimation error below $s$ prevents testing error; Fano's inequality then supplies the testing-error lower bound.
Let $M:=\mathcal{V}$ and let $\vartheta_v:=\theta_v$ for $v\in\mathcal{V}$. The Euclidean packing step proved
\begin{align*}
|\theta_u-\theta_v|^2\geq \frac{a^2k}{2}
\end{align*}
for distinct $u,v\in\mathcal{V}$. Therefore we set
\begin{align*}
s^2 := \frac{a^2k}{8},
\end{align*}
so that $4s^2=a^2k/2$, exactly matching the required separation hypothesis. The KL step proved
\begin{align*}
\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\theta_v}\|P_0)
\leq
\frac{1}{32}\log|\mathcal{V}|,
\end{align*}
so the divergence hypothesis holds with reference measure $Q=P_0$ and $\alpha=1/32\in(0,1/8]$.
Hence Fano's method gives
\begin{align*}
\inf_{\hat{\theta}}\sup_{v\in\mathcal{V}}\mathbb{E}_{\theta_v}\left[|\hat{\theta}-\theta_v|^2\right]
\geq
\frac{s^2}{2}.
\end{align*}
Because $\Theta_{\mathcal{V}}\subset\Theta_k(R)$, taking the supremum over all of $\Theta_k(R)$ can only increase the risk:
\begin{align*}
\inf_{\hat{\theta}}\sup_{\theta\in\Theta_k(R)}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]
\geq
\inf_{\hat{\theta}}\sup_{v\in\mathcal{V}}\mathbb{E}_{\theta_v}\left[|\hat{\theta}-\theta_v|^2\right].
\end{align*}
Combining these inequalities and substituting $s^2=a^2k/8$ and $a^2=c_1\sigma^2\log(d/k)$ yields
\begin{align*}
\inf_{\hat{\theta}}\sup_{\theta\in\Theta_k(R)}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]
\geq
\frac{c_1}{16}\sigma^2k\log\left(\frac{d}{k}\right).
\end{align*}
Thus the restricted minimax lower bound holds with
\begin{align*}
c := \frac{c_1}{16}=\frac{c_{\mathrm{GV}}}{256}.
\end{align*}
[/guided]
[/step]
[step:Pass from the bounded sparse class to the unrestricted sparse class]
Since $\Theta_k(R) \subset \Theta_k$, for every measurable estimator $\hat{\theta}: \mathbb{R}^d \to \mathbb{R}^d$,
\begin{align*}
\sup_{\theta \in \Theta_k}
\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]
\geq
\sup_{\theta \in \Theta_k(R)}
\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right].
\end{align*}
Taking the infimum over all measurable estimators on both sides gives
\begin{align*}
\inf_{\hat{\theta}}
\sup_{\theta \in \Theta_k}
\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]
\geq
\inf_{\hat{\theta}}
\sup_{\theta \in \Theta_k(R)}
\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]
\geq
c\sigma^2 k \log\left(\frac{d}{k}\right).
\end{align*}
This proves both assertions.
[guided]
The final step uses monotonicity of the supremum over parameter classes. The restricted class is contained in the unrestricted class:
\begin{align*}
\Theta_k(R) \subset \Theta_k.
\end{align*}
Therefore, for every measurable estimator $\hat{\theta}:\mathbb{R}^d\to\mathbb{R}^d$, the supremum of the risk over the larger class is at least the supremum over the smaller class:
\begin{align*}
\sup_{\theta\in\Theta_k}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]
\geq
\sup_{\theta\in\Theta_k(R)}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right].
\end{align*}
Now take the infimum over all measurable estimators on both sides. Since the inequality holds estimator by estimator, it remains true after taking infima:
\begin{align*}
\inf_{\hat{\theta}}\sup_{\theta\in\Theta_k}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]
\geq
\inf_{\hat{\theta}}\sup_{\theta\in\Theta_k(R)}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right].
\end{align*}
The restricted lower bound proved in the preceding step gives
\begin{align*}
\inf_{\hat{\theta}}\sup_{\theta\in\Theta_k(R)}\mathbb{E}_{\theta}\left[|\hat{\theta}-\theta|^2\right]
\geq
c\sigma^2k\log\left(\frac{d}{k}\right).
\end{align*}
Combining the last two displayed inequalities proves the same lower bound over $\Theta_k$, and therefore proves both assertions of the theorem.
[/guided]
[/step]