[guided]The statistical lower bound now comes from converting estimation into a finite testing problem. Suppose an estimator $\hat{\beta}$ had prediction error much smaller than the squared separation $\delta^2$ on every packed point. Then we could identify which packed point generated the data by choosing the nearest packed parameter to $\hat{\beta}$ in prediction norm. This would contradict Fano's inequality.
Let $V$ be uniformly distributed on $\mathcal{V}$, and given $V=v$, let the data have law $P_{\beta_v}$. Fano's inequality in testing form with a fixed reference law states that, for any estimator $\widehat{V}$ of $V$,
\begin{align*}
\mathbb{P}(\widehat{V}\neq V)
\geq
1-\frac{\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0)+\log 2}{\log|\mathcal{V}|}.
\end{align*}
Its hypotheses are satisfied here: $\mathcal{V}$ is finite, $V$ has the uniform prior, and the divergences to $P_0$ are finite by the Gaussian KL computation. The previous information calculation gives
\begin{align*}
\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0)
\leq
\frac{1}{16}\log|\mathcal{V}|.
\end{align*}
When $\log|\mathcal{V}|\geq 4\log 2$, Fano gives
\begin{align*}
\mathbb{P}(\widehat{V}\neq V)
\geq
1-\frac{1}{16}-\frac{1}{4}
=
\frac{11}{16}
>
\frac{1}{2}.
\end{align*}
It remains to handle the case $\log|\mathcal{V}|<4\log 2$ without referring back to the exact proof. The entropy lower bound implies
\begin{align*}
k\log(d/k)<\frac{4\log 2}{c_{\mathrm{VG}}}.
\end{align*}
Because $k\leq d/2$, there exist two disjoint $k$-element subsets $A,B\subset\{1,\dots,d\}$. Define two vectors $\beta_A,\beta_B\in\mathbb{R}^d$ by the coordinate rules $(\beta_A)_j=a\mathbb{1}_A(j)$ and $(\beta_B)_j=a\mathbb{1}_B(j)$ for every $j\in\{1,\dots,d\}$. Each vector is $k$-sparse, and the same radius calculation as above gives $\beta_A,\beta_B\in\mathcal{B}_k(R)$. Since $A$ and $B$ are disjoint, $\beta_A-\beta_B$ has $2k$ nonzero coordinates, each of absolute value $a$, so
\begin{align*}
|\beta_A-\beta_B|^2=2a^2k.
\end{align*}
Applying the lower sparse eigenvalue assumption to the $2k$-sparse vector $\beta_A-\beta_B$ gives
\begin{align*}
|\Sigma^{1/2}(\beta_A-\beta_B)|^2
\geq
2\kappa_-a^2k
=:
\Delta^2.
\end{align*}
The same Gaussian KL computation applied to the pair $\beta_A,\beta_B$ gives
\begin{align*}
D(P_{\beta_A}\|P_{\beta_B})
=
\frac{n}{2\sigma^2}(\beta_A-\beta_B)^\top\Sigma(\beta_A-\beta_B).
\end{align*}
The vector $\beta_A-\beta_B$ is $2k$-sparse, so the upper sparse eigenvalue assumption yields
\begin{align*}
D(P_{\beta_A}\|P_{\beta_B})
\leq
\frac{n\kappa_+}{2\sigma^2}|\beta_A-\beta_B|^2
=
\gamma k\log(d/k)
<
\frac{4\gamma\log 2}{c_{\mathrm{VG}}}.
\end{align*}
If we choose $\gamma\leq c_{\mathrm{VG}}/(16\log 2)$, then $D(P_{\beta_A}\|P_{\beta_B})\leq 1/4$. Pinsker's inequality gives
\begin{align*}
\|P_{\beta_A}-P_{\beta_B}\|_{\mathrm{TV}}\leq \sqrt{\frac{1}{2}D(P_{\beta_A}\|P_{\beta_B})}\leq \sqrt{\frac{1}{8}}.
\end{align*}
Le Cam's two-point testing bound therefore gives, for every test $\widehat{T}\in\{A,B\}$,
\begin{align*}
\max\left\{P_{\beta_A}(\widehat{T}=B),P_{\beta_B}(\widehat{T}=A)\right\}
\geq
\frac{1-\|P_{\beta_A}-P_{\beta_B}\|_{\mathrm{TV}}}{2}
\geq
\frac{1-\sqrt{1/8}}{2}
>
\frac{1}{4}.
\end{align*}
Given any regression estimator $\hat{\beta}$, decode to whichever of $\beta_A$ and $\beta_B$ is nearer in the norm $b\mapsto |\Sigma^{1/2}b|$, breaking ties deterministically. If the squared prediction loss is less than $\Delta^2/4$ at the true point, the triangle inequality forces this two-point decoder to be correct. Hence a decoding error implies squared loss at least $\Delta^2/4$, and so
\begin{align*}
\sup_{\beta\in\mathcal{B}_k(R)}\mathbb{E}_{\beta}\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right]
\geq
\frac{\Delta^2}{4}\cdot\frac{1}{4}
=
\frac{\Delta^2}{16}
=
\frac{\gamma}{8}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}.
\end{align*}
Thus the bounded-entropy case is already covered after decreasing the final universal constant to $\gamma/16$. In the nontrivial packing case used below, every decoder satisfies
\begin{align*}
\mathbb{P}(\widehat{V}\neq V)\geq \frac{1}{2}.
\end{align*}
We now construct a decoder from an arbitrary regression estimator. Let $\hat{\beta}: (\mathbb{R}^d\times\mathbb{R})^n\to\mathbb{R}^d$ be measurable. Define $\widehat{V}_{\hat{\beta}}:(\mathbb{R}^d\times\mathbb{R})^n \to \mathcal{V}$ as follows. For a sample $z=((X_1,Y_1),\dots,(X_n,Y_n))\in(\mathbb{R}^d\times\mathbb{R})^n$, set
\begin{align*}
\widehat{V}_{\hat{\beta}}(z)
:=
\operatorname*{argmin}_{v\in\mathcal{V}} |\Sigma^{1/2}(\hat{\beta}(z)-\beta_v)|,
\end{align*}
breaking ties by a fixed ordering of $\mathcal{V}$. This map is measurable because $\mathcal{V}$ is finite and each displayed distance is a measurable function of the data.
If the true index is $V=v$ and the estimator satisfies
\begin{align*}
|\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2<\frac{\delta^2}{4},
\end{align*}
then $\hat{\beta}$ is within distance $\delta/2$ of the true packed point. Since every other packed point is at least distance $\delta$ from $\beta_v$, the triangle inequality gives, for every $w\neq v$,
\begin{align*}
|\Sigma^{1/2}(\hat{\beta}-\beta_w)| \geq |\Sigma^{1/2}(\beta_v-\beta_w)| - |\Sigma^{1/2}(\hat{\beta}-\beta_v)| > \delta-\frac{\delta}{2} = \frac{\delta}{2}.
\end{align*}
Thus the nearest-neighbour decoder must return $v$. Equivalently,
\begin{align*}
\{\widehat{V}_{\hat{\beta}}\neq V\}
\subset
\left\{|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\geq \frac{\delta^2}{4}\right\}.
\end{align*}
Using this implication and averaging over the uniform prior on $\mathcal{V}$,
\begin{align*}
\sup_{\beta\in\mathcal{B}_k(R)} \mathbb{E}_\beta\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}} \mathbb{E}_{\beta_v}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2\right].
\end{align*}
Averaging over the uniform prior identifies the right-hand side as
\begin{align*}
\mathbb{E}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\right].
\end{align*}
Using the elementary bound $Z\geq t\mathbb{1}_{\{Z\geq t\}}$ for the nonnegative [random variable](/page/Random%20Variable) $Z:=|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2$ and the threshold $t:=\delta^2/4$, we get
\begin{align*}
\mathbb{E}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\right] \geq \frac{\delta^2}{4} \mathbb{P}\left(|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\geq \frac{\delta^2}{4}\right).
\end{align*}
The event inclusion gives
\begin{align*}
\frac{\delta^2}{4} \mathbb{P}\left(|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\geq \frac{\delta^2}{4}\right) \geq \frac{\delta^2}{4}\mathbb{P}(\widehat{V}_{\hat{\beta}}\neq V) \geq \frac{\delta^2}{8}.
\end{align*}
Finally, from the previous step,
\begin{align*}
\delta^2
=
\frac{\gamma}{2}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}.
\end{align*}
Therefore
\begin{align*}
\sup_{\beta\in\mathcal{B}_k(R)}
\mathbb{E}_\beta\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right]
\geq
\frac{\gamma}{16}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}.
\end{align*}
Since $\hat{\beta}$ was arbitrary, taking the infimum over all measurable estimators proves the restricted minimax lower bound.[/guided]