[proofplan]
The lower bound is a metric testing argument. We construct many $k$-sparse vectors whose supports are well separated, scale their common amplitude so that the induced Gaussian measures have small Kullback-Leibler divergence, and then apply an elementary Fano testing bound. The lower restricted eigenvalue turns coefficient separation into prediction separation. For the upper bound we use the constrained least-squares estimator over $\Theta_k$ and control the largest Gaussian projection over all coordinate subspaces of dimension at most $2k$.
[/proofplan]
[step:Fix the restricted eigenvalue normalization used throughout]
For $1\le m\le d$, the restricted lower prediction eigenvalue of $X$ is
\begin{align*}
\kappa_-(m;X):=\inf\left\{\frac{|Xu|^2}{n|u|^2}:u\in\mathbb R^d\setminus\{0\},\ |\operatorname{supp}u|\le m\right\}.
\end{align*}
The restricted upper prediction eigenvalue of $X$ is
\begin{align*}
\kappa_+(m;X):=\sup\left\{\frac{|Xu|^2}{n|u|^2}:u\in\mathbb R^d\setminus\{0\},\ |\operatorname{supp}u|\le m\right\}.
\end{align*}
Thus every $2k$-sparse vector $u\in\mathbb R^d$ satisfies
\begin{align*}
a_0|u|^2\le \frac{1}{n}|Xu|^2\le A_0|u|^2.
\end{align*}
[/step]
[step:Build a sparse Hamming packing with logarithmic cardinality]
We first record the combinatorial packing used in the lower bound. There exist a universal constant $c_1>0$ and vectors $z_1,\dots,z_M \in \{0,1\}^d$ such that $|\operatorname{supp} z_j| = k$ for every $j$, such that
\begin{align*}
|z_j-z_\ell|^2 \ge \frac{k}{2}
\quad\text{whenever }j\ne \ell,
\end{align*}
and such that
\begin{align*}
\log M \ge c_1 k\log(d/k).
\end{align*}
Let
\begin{align*}
\mathcal{H}:=\{z\in\{0,1\}^d: |\operatorname{supp}z|=k\}
\end{align*}
denote the family of constant-weight support indicators. We use the following [constant-weight Gilbert-Varshamov packing estimate](/page/Gilbert-Varshamov%20Bound): there is a universal constant $c_1>0$ such that, whenever $1\le k\le d/2$, the family $\mathcal{H}$ contains a subset $\mathcal{P}$ satisfying $|z-w|^2\ge k/2$ for all distinct $z,w\in\mathcal{P}$ and
\begin{align*}
|\mathcal{P}|\ge \exp\{c_1 k\log(d/k)\}.
\end{align*}
The hypotheses of this estimate are exactly $1\le k\le d/2$ and constant weight $k$, both of which hold here. Enumerating $\mathcal{P}$ as $z_1,\dots,z_M$ gives the required vectors.
[guided]
The goal is to produce many possible sparse coefficient vectors, all with the same sparsity $k$, so that no two are too close. We work first with support indicators rather than amplitudes. Define
\begin{align*}
\mathcal{H}:=\{z\in\{0,1\}^d: |\operatorname{supp}z|=k\}.
\end{align*}
Thus each $z\in\mathcal{H}$ records a $k$-element subset of $\{1,\dots,d\}$.
For vectors in $\{0,1\}^d$, the Hamming distance is exactly $|z-w|^2$, since each differing coordinate contributes $1$ to the squared Euclidean distance. The [constant-weight Gilbert-Varshamov packing estimate](/page/Gilbert-Varshamov%20Bound) says that, under the hypotheses $1\le k\le d/2$ and constant weight $k$, there is a universal constant $c_1>0$ and a subset $\mathcal{P}\subset\mathcal{H}$ with pairwise Hamming distance at least $k/2$ such that
\begin{align*}
|\mathcal{P}|\ge\exp\{c_1 k\log(d/k)\}.
\end{align*}
These hypotheses hold because the theorem assumes $1\le k\le d/2$ and every element of $\mathcal{H}$ has exactly $k$ nonzero coordinates. Enumerating $\mathcal{P}$ as $z_1,\dots,z_M$ gives the required packing.
[/guided]
[/step]
[step:Scale the packing so prediction distances are large and divergences are small]
For each $\beta\in\Theta_k$, let $P_\beta$ denote the law on $\mathbb R^n$ of the observation vector $Y=X\beta+\varepsilon$, where $\varepsilon\sim\mathcal N(0,\sigma^2I_n)$. Define
\begin{align*}
r^2 := \frac{c_2\sigma^2\log M}{A_0nk},
\end{align*}
where $c_2>0$ is a universal constant to be chosen below, and define $\beta_j:=rz_j\in\Theta_k$ for $j=1,\dots,M$.
For $j\ne \ell$, the vector $\beta_j-\beta_\ell$ is $2k$-sparse. By the lower restricted eigenvalue bound,
\begin{align*}
\frac{1}{n}|X(\beta_j-\beta_\ell)|^2
\ge
a_0|\beta_j-\beta_\ell|^2
=
a_0r^2|z_j-z_\ell|^2
\ge
\frac{a_0r^2k}{2}.
\end{align*}
Set
\begin{align*}
s^2:=\frac{a_0r^2k}{2}.
\end{align*}
Then all distinct elements of the packing are separated by prediction distance at least $s$ in the loss metric.
For each $j$, the Kullback-Leibler divergence between $P_{\beta_j}$ and $P_0$ is
\begin{align*}
D_{\mathrm{KL}}(P_{\beta_j}\|P_0)
=
\frac{|X\beta_j|^2}{2\sigma^2}
\le
\frac{nA_0|\beta_j|^2}{2\sigma^2}
=
\frac{nA_0r^2k}{2\sigma^2}
=
\frac{c_2}{2}\log M.
\end{align*}
Choosing $c_2\le 1/8$ gives
\begin{align*}
\frac{1}{M}\sum_{j=1}^M D_{\mathrm{KL}}(P_{\beta_j}\|P_0)
\le
\frac{1}{16}\log M.
\end{align*}
[guided]
For each $\beta\in\Theta_k$, $P_\beta$ denotes the law on $\mathbb R^n$ of $Y=X\beta+\varepsilon$ with $\varepsilon\sim\mathcal N(0,\sigma^2I_n)$. We choose the amplitude
\begin{align*}
r^2 := \frac{c_2\sigma^2\log M}{A_0nk},
\end{align*}
where $c_2>0$ will be fixed small, and set $\beta_j:=rz_j$ for $j=1,\dots,M$. Since each $z_j$ has exactly $k$ nonzero entries, each $\beta_j$ belongs to $\Theta_k$.
If $j\ne \ell$, then $\beta_j-\beta_\ell$ has support contained in $\operatorname{supp}z_j\cup\operatorname{supp}z_\ell$, so it is $2k$-sparse. The definition of $\kappa_-(2k;X)$ and the hypothesis $\kappa_-(2k;X)\ge a_0$ give
\begin{align*}
\frac{1}{n}|X(\beta_j-\beta_\ell)|^2
\ge
a_0|\beta_j-\beta_\ell|^2
=
a_0r^2|z_j-z_\ell|^2
\ge
\frac{a_0r^2k}{2}.
\end{align*}
Thus, with
\begin{align*}
s^2:=\frac{a_0r^2k}{2},
\end{align*}
the packing is separated by at least $s$ in the prediction loss metric.
The same amplitude must keep the statistical models close enough for [Fano's inequality](/theorems/1654). For centered Gaussian measures with common covariance $\sigma^2I_n$ and means $X\beta_j$ and $0$, the Kullback-Leibler divergence is
\begin{align*}
D_{\mathrm{KL}}(P_{\beta_j}\|P_0)
=
\frac{|X\beta_j|^2}{2\sigma^2}.
\end{align*}
Since $\beta_j$ is $k$-sparse, it is also $2k$-sparse, so the definition of $\kappa_+(2k;X)$ and the hypothesis $\kappa_+(2k;X)\le A_0$ imply
\begin{align*}
D_{\mathrm{KL}}(P_{\beta_j}\|P_0)
\le
\frac{nA_0|\beta_j|^2}{2\sigma^2}
=
\frac{nA_0r^2k}{2\sigma^2}
=
\frac{c_2}{2}\log M.
\end{align*}
Choosing $c_2\le 1/8$ makes the average divergence satisfy
\begin{align*}
\frac{1}{M}\sum_{j=1}^M D_{\mathrm{KL}}(P_{\beta_j}\|P_0)
\le
\frac{1}{16}\log M.
\end{align*}
This is the exact small-divergence condition needed for the Fano step.
[/guided]
[/step]
[step:Convert Fano testing error into minimax prediction risk]
We use the following elementary form of [Fano's inequality](/page/Fano%27s%20Inequality): if $P_1,\dots,P_M$ are probability measures and $P_0$ is another probability measure satisfying
\begin{align*}
\frac{1}{M}\sum_{j=1}^M D_{\mathrm{KL}}(P_j\|P_0)\le \alpha\log M
\quad\text{with }0<\alpha<1/8,
\end{align*}
then every measurable test $\psi:\mathbb{R}^n\to\{1,\dots,M\}$ satisfies
\begin{align*}
\sup_{1\le j\le M}P_j(\psi\ne j)\ge c_3,
\end{align*}
where $c_3>0$ is a universal constant. This follows from the standard entropy proof of Fano's inequality applied to the uniform prior on $\{1,\dots,M\}$.
Let $\hat\beta:\mathbb{R}^n\to\mathbb{R}^d$ be any estimator. Define a measurable nearest-packing selector $\psi_{\hat\beta}:\mathbb{R}^n\to\{1,\dots,M\}$ by choosing an index minimizing
\begin{align*}
j\mapsto |X(\hat\beta(Y)-\beta_j)|.
\end{align*}
If $\psi_{\hat\beta}(Y)\ne j$, then the triangle inequality and the packing separation imply
\begin{align*}
|X(\hat\beta(Y)-\beta_j)|
\ge
\frac{\sqrt{n}s}{2}.
\end{align*}
Therefore, under $P_{\beta_j}$,
\begin{align*}
\mathbb E_{\beta_j}\left[\frac{1}{n}|X(\hat\beta-\beta_j)|^2\right]
\ge
\frac{s^2}{4}P_{\beta_j}(\psi_{\hat\beta}\ne j).
\end{align*}
Taking the supremum over $j$ and then the infimum over $\hat\beta$ gives
\begin{align*}
\inf_{\hat\beta}\sup_{\beta\in\Theta_k}
\mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta-\beta)|^2\right]
\ge
\frac{c_3s^2}{4}.
\end{align*}
By the definition of $s^2$,
\begin{align*}
\frac{c_3s^2}{4}=\frac{c_3a_0r^2k}{8}.
\end{align*}
By the definition of $r^2$,
\begin{align*}
\frac{c_3a_0r^2k}{8}=\frac{c_3a_0c_2}{8A_0}\sigma^2\frac{\log M}{n}.
\end{align*}
Since $\log M\ge c_1k\log(d/k)$, the lower bound follows with
\begin{align*}
c:=\frac{c_1c_2c_3a_0}{8A_0}.
\end{align*}
[guided]
We now convert separation of the finite packing into a lower bound for every estimator. The [Fano inequality](/page/Fano%27s%20Inequality) form used here requires probability measures $P_1,\dots,P_M$ and a reference measure $P_0$ satisfying
\begin{align*}
\frac{1}{M}\sum_{j=1}^M D_{\mathrm{KL}}(P_j\|P_0)\le \alpha\log M
\quad\text{with }0<\alpha<1/8.
\end{align*}
The previous step verified this hypothesis for $P_j=P_{\beta_j}$ with $\alpha=1/16$. Therefore every measurable test $\psi:\mathbb R^n\to\{1,\dots,M\}$ satisfies
\begin{align*}
\sup_{1\le j\le M}P_{\beta_j}(\psi\ne j)\ge c_3,
\end{align*}
where $c_3>0$ is universal.
Let $\hat\beta:\mathbb R^n\to\mathbb R^d$ be any estimator. From it we build a test by nearest-neighbour decoding in prediction norm: define $\psi_{\hat\beta}:\mathbb R^n\to\{1,\dots,M\}$ to be an index minimizing
\begin{align*}
j\mapsto |X(\hat\beta(Y)-\beta_j)|,
\end{align*}
with a fixed tie-breaking rule. If the true index is $j$ but $\psi_{\hat\beta}(Y)=\ell\ne j$, then the nearest-neighbour property gives
\begin{align*}
|X(\hat\beta(Y)-\beta_\ell)|\le |X(\hat\beta(Y)-\beta_j)|.
\end{align*}
Using the triangle inequality and the packing separation,
\begin{align*}
\sqrt{n}s
\le
|X(\beta_j-\beta_\ell)|
\le
|X(\hat\beta(Y)-\beta_j)|+|X(\hat\beta(Y)-\beta_\ell)|
\le
2|X(\hat\beta(Y)-\beta_j)|.
\end{align*}
Hence on the event $\{\psi_{\hat\beta}(Y)\ne j\}$,
\begin{align*}
\frac{1}{n}|X(\hat\beta(Y)-\beta_j)|^2\ge \frac{s^2}{4}.
\end{align*}
Taking expectation under $P_{\beta_j}$ gives
\begin{align*}
\mathbb E_{\beta_j}\left[\frac{1}{n}|X(\hat\beta-\beta_j)|^2\right]
\ge
\frac{s^2}{4}P_{\beta_j}(\psi_{\hat\beta}\ne j).
\end{align*}
Now take the supremum over $j$ and use the Fano lower bound on testing error:
\begin{align*}
\sup_{1\le j\le M}\mathbb E_{\beta_j}\left[\frac{1}{n}|X(\hat\beta-\beta_j)|^2\right]
\ge
\frac{c_3s^2}{4}.
\end{align*}
Since $\{\beta_1,\dots,\beta_M\}\subset\Theta_k$, this lower bounds the supremum over all $\beta\in\Theta_k$. Taking the infimum over estimators and substituting
\begin{align*}
s^2=\frac{a_0r^2k}{2},
\qquad
r^2=\frac{c_2\sigma^2\log M}{A_0nk},
\end{align*}
yields
\begin{align*}
\inf_{\hat\beta}\sup_{\beta\in\Theta_k}
\mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta-\beta)|^2\right]
\ge
\frac{c_3a_0c_2}{8A_0}\sigma^2\frac{\log M}{n}.
\end{align*}
Finally $\log M\ge c_1k\log(d/k)$, so the lower bound holds with
\begin{align*}
c:=\frac{c_1c_2c_3a_0}{8A_0}.
\end{align*}
[/guided]
[/step]
[step:Control constrained least squares by Gaussian projections]
For each support set $S\subset\{1,\dots,d\}$ with $|S|\le k$, let $X_S$ denote the submatrix of $X$ formed by the columns indexed by $S$, and let $X_S^+$ denote its Moore-Penrose pseudoinverse. Define $b_S:\mathbb{R}^n\to\Theta_k$ by placing $X_S^+Y$ on the coordinates in $S$ and placing $0$ on $S^c$. Among the finitely many supports $S$ with $|S|\le k$, choose one minimizing $|Y-Xb_S(Y)|^2$, with lexicographic tie-breaking on the ordered support. This gives a measurable estimator $\hat\beta_{\mathrm{cls}}:\mathbb{R}^n\to\Theta_k$, since each $b_S$ is linear and the finite lexicographic comparison is made by measurable inequalities. For $\beta\in\Theta_k$, let $\varepsilon:=Y-X\beta$. Since $\beta$ is feasible,
\begin{align*}
|Y-X\hat\beta_{\mathrm{cls}}|^2\le |Y-X\beta|^2.
\end{align*}
With $\Delta:=\hat\beta_{\mathrm{cls}}-\beta$, this becomes
\begin{align*}
|\varepsilon-X\Delta|^2\le |\varepsilon|^2,
\end{align*}
and hence
\begin{align*}
|X\Delta|^2\le 2\varepsilon\cdot X\Delta.
\end{align*}
The vector $\Delta$ is $2k$-sparse. For each set $S\subset\{1,\dots,d\}$ with $|S|\le 2k$, let $V_S:=\operatorname{Range}(X_S)\subset\mathbb{R}^n$, where $X_S$ is the submatrix of $X$ formed by the columns indexed by $S$, and let $\Pi_S:\mathbb{R}^n\to V_S$ be the Euclidean [orthogonal projection](/theorems/437). If $S=\operatorname{supp}\Delta$, then $X\Delta\in V_S$, so
\begin{align*}
|X\Delta|^2
\le
2|\Pi_S\varepsilon|\,|X\Delta|
\end{align*}
by the [Cauchy-Schwarz inequality](/theorems/432) in $\mathbb{R}^n$. Therefore, defining the maximum over all subsets of size at most $2k$ in prose to avoid a row break in the subscript,
\begin{align*}
|X\Delta|^2\le4\max_{S\subset\{1,\dots,d\}:\ |S|\le 2k}|\Pi_S\varepsilon|^2.
\end{align*}
[guided]
The constrained least-squares estimator is the natural upper-bound estimator because it searches only over the true sparsity class. Define
\begin{align*}
\hat\beta_{\mathrm{cls}}:\mathbb{R}^n\to\Theta_k
\end{align*}
to be any minimizer of $b\mapsto |Y-Xb|^2$ over $b\in\Theta_k$. Since the true $\beta$ belongs to $\Theta_k$, it is one of the candidates in this minimization. Hence
\begin{align*}
|Y-X\hat\beta_{\mathrm{cls}}|^2\le |Y-X\beta|^2.
\end{align*}
Write the noise vector as $\varepsilon:=Y-X\beta$ and define the estimation error
\begin{align*}
\Delta:=\hat\beta_{\mathrm{cls}}-\beta.
\end{align*}
Then $Y-X\hat\beta_{\mathrm{cls}}=\varepsilon-X\Delta$, so the previous inequality becomes
\begin{align*}
|\varepsilon-X\Delta|^2\le |\varepsilon|^2.
\end{align*}
Expanding the square in the Euclidean [inner product](/page/Inner%20Product) on $\mathbb{R}^n$ gives
\begin{align*}
|\varepsilon|^2-2\varepsilon\cdot X\Delta+|X\Delta|^2\le |\varepsilon|^2,
\end{align*}
and cancelling $|\varepsilon|^2$ yields
\begin{align*}
|X\Delta|^2\le 2\varepsilon\cdot X\Delta.
\end{align*}
Now $\hat\beta_{\mathrm{cls}}$ and $\beta$ are both $k$-sparse, so $\Delta$ is $2k$-sparse. For each set $S\subset\{1,\dots,d\}$ with $|S|\le 2k$, define $V_S:=\operatorname{Range}(X_S)\subset\mathbb{R}^n$, where $X_S$ is the submatrix of $X$ with columns indexed by $S$, and let $\Pi_S:\mathbb{R}^n\to V_S$ denote the Euclidean orthogonal projection. If $S=\operatorname{supp}\Delta$, then $X\Delta\in V_S$. Therefore
\begin{align*}
\varepsilon\cdot X\Delta=(\Pi_S\varepsilon)\cdot X\Delta.
\end{align*}
Applying the Cauchy-Schwarz inequality in $\mathbb{R}^n$ gives
\begin{align*}
|X\Delta|^2
\le
2|\Pi_S\varepsilon|\,|X\Delta|.
\end{align*}
If $|X\Delta|=0$, the desired bound is immediate. Otherwise divide by $|X\Delta|$ and square:
\begin{align*}
|X\Delta|^2\le 4|\Pi_S\varepsilon|^2.
\end{align*}
Since $S$ is one of the subsets of $\{1,\dots,d\}$ with cardinality at most $2k$, enlarging from this single support to the maximum over all such supports gives
\begin{align*}
|X\Delta|^2\le4\max_{T\subset\{1,\dots,d\}:\ |T|\le 2k}|\Pi_T\varepsilon|^2.
\end{align*}
This is the key deterministic inequality: the prediction error is controlled by the largest Gaussian projection onto a coordinate-generated subspace of dimension at most $2k$.
[/guided]
[/step]
[step:Bound the maximum projected noise energy]
For a fixed set $S\subset\{1,\dots,d\}$ with $|S|\le2k$, the subspace $V_S$ has dimension at most $2k$, and
\begin{align*}
|\Pi_S\varepsilon|^2/\sigma^2
\end{align*}
is stochastically dominated by a chi-square [random variable](/page/Random%20Variable) with $2k$ degrees of freedom. This follows because $\Pi_S\varepsilon$ is a centered Gaussian vector in the subspace $V_S$ with covariance operator bounded by $\sigma^2\Pi_S$. The number of sets $S$ with $|S|\le2k$ is at most $N$, where
\begin{align*}
N:=\sum_{m=0}^{2k}\binom{d}{m}.
\end{align*}
Using $2k\le d$ and the elementary binomial bound $\binom{d}{m}\le (ed/m)^m$ for $1\le m\le 2k$, there is a universal constant $c_4>0$ such that
\begin{align*}
N\le \exp\{c_4k\log(ed/k)\}.
\end{align*}
We now derive the needed maximal bound from a chi-square tail estimate. For a fixed $S$, the [Laurent-Massart chi-square tail bound](/page/Laurent-Massart%20Chi-Square%20Tail%20Bound) gives, for every $u>0$,
\begin{align*}
\mathbb P_\beta\left(|\Pi_S\varepsilon|^2>\sigma^2(2k+2\sqrt{2ku}+2u)\right)\le e^{-u}.
\end{align*}
Taking a union bound over the $N$ supports gives, for every $u>0$,
\begin{align*}
\mathbb P_\beta\left(\max_{S\subset\{1,\dots,d\}:\ |S|\le2k}|\Pi_S\varepsilon|^2>\sigma^2(2k+2\sqrt{2k(u+\log N)}+2(u+\log N))\right)\le e^{-u}.
\end{align*}
Using $2\sqrt{ab}\le a+b$ with $a=2k$ and $b=u+\log N$, there is a universal constant $c_6>0$ such that, for every $u>0$,
\begin{align*}
\mathbb P_\beta\left(\max_{S\subset\{1,\dots,d\}:\ |S|\le2k}|\Pi_S\varepsilon|^2>c_6\sigma^2(k+\log N+u)\right)\le e^{-u}.
\end{align*}
For the nonnegative random variable
\begin{align*}
Z:=\max_{S\subset\{1,\dots,d\}:\ |S|\le2k}|\Pi_S\varepsilon|^2,
\end{align*}
set $t_0:=c_6\sigma^2(k+\log N)$. The tail identity for nonnegative random variables gives
\begin{align*}
\mathbb E_\beta[Z]
\le
t_0+\int_{t_0}^\infty \mathbb P_\beta(Z>t)\,d\mathcal{L}^1(t),
\end{align*}
where $\mathcal{L}^1$ denotes one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure). With the change of variables $t=t_0+c_6\sigma^2u$, this becomes
\begin{align*}
\mathbb E_\beta[Z]
\le
c_6\sigma^2(k+\log N)+c_6\sigma^2\int_0^\infty e^{-u}\,d\mathcal{L}^1(u).
\end{align*}
Since $\int_0^\infty e^{-u}\,d\mathcal{L}^1(u)=1$ and $\log N\le c_4k\log(ed/k)$, after enlarging constants there is a universal constant $c_5>0$ such that
\begin{align*}
\mathbb E_\beta\left[\max_{S\subset\{1,\dots,d\}:\ |S|\le2k}|\Pi_S\varepsilon|^2\right]
\le
c_5\sigma^2 k\log(ed/k).
\end{align*}
Combining this with the deterministic inequality from the previous step yields
\begin{align*}
\mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta_{\mathrm{cls}}-\beta)|^2\right]\le4c_5\sigma^2\frac{k\log(ed/k)}{n}.
\end{align*}
Taking the supremum over $\beta\in\Theta_k$ and then using that the minimax risk is at most the risk of $\hat\beta_{\mathrm{cls}}$ gives the upper bound with $C:=4c_5$.
[guided]
We must bound the random variable
\begin{align*}
Z:=\max_{S\subset\{1,\dots,d\}:\ |S|\le2k}|\Pi_S\varepsilon|^2,
\end{align*}
where $\varepsilon\sim\mathcal N(0,\sigma^2I_n)$ and $\Pi_S$ is the Euclidean orthogonal projection onto $V_S=\operatorname{Range}(X_S)$. For a fixed $S$, the subspace $V_S$ has dimension at most $|S|\le2k$. Therefore $|\Pi_S\varepsilon|^2/\sigma^2$ is stochastically dominated by a chi-square random variable with $2k$ degrees of freedom.
The number of possible supports is
\begin{align*}
N:=\sum_{m=0}^{2k}\binom{d}{m}.
\end{align*}
Since $2k\le d$, the binomial estimate $\binom{d}{m}\le (ed/m)^m$ for $1\le m\le2k$ implies, after absorbing the finite sum into a universal constant, that
\begin{align*}
N\le \exp\{c_4k\log(ed/k)\}
\end{align*}
for a universal constant $c_4>0$.
For a fixed $S$, the [Laurent-Massart chi-square tail bound](/page/Laurent-Massart%20Chi-Square%20Tail%20Bound) gives, for every $v>0$,
\begin{align*}
\mathbb P_\beta\left(|\Pi_S\varepsilon|^2>\sigma^2(2k+2\sqrt{2kv}+2v)\right)\le e^{-v}.
\end{align*}
Apply this with $v=u+\log N$ and take a union bound over the $N$ supports. For every $u>0$,
\begin{align*}
\mathbb P_\beta\left(Z>\sigma^2(2k+2\sqrt{2k(u+\log N)}+2(u+\log N))\right)\le e^{-u}.
\end{align*}
Using $2\sqrt{ab}\le a+b$ with $a=2k$ and $b=u+\log N$, there is a universal constant $c_6>0$ such that
\begin{align*}
\mathbb P_\beta\left(Z>c_6\sigma^2(k+\log N+u)\right)\le e^{-u}.
\end{align*}
Set $t_0:=c_6\sigma^2(k+\log N)$. The tail identity for nonnegative random variables gives
\begin{align*}
\mathbb E_\beta[Z]
\le
t_0+\int_{t_0}^\infty \mathbb P_\beta(Z>t)\,d\mathcal L^1(t).
\end{align*}
With $t=t_0+c_6\sigma^2u$, the measure transforms as $d\mathcal L^1(t)=c_6\sigma^2\,d\mathcal L^1(u)$ and the domain $[t_0,\infty)$ becomes $[0,\infty)$. Hence
\begin{align*}
\mathbb E_\beta[Z]
\le
c_6\sigma^2(k+\log N)+c_6\sigma^2\int_0^\infty e^{-u}\,d\mathcal L^1(u)
=
c_6\sigma^2(k+\log N+1).
\end{align*}
Since $\log N\le c_4k\log(ed/k)$ and $k\ge1$, enlarging constants gives
\begin{align*}
\mathbb E_\beta[Z]
\le
c_5\sigma^2k\log(ed/k)
\end{align*}
for a universal constant $c_5>0$. Combining this with the deterministic inequality
\begin{align*}
|X(\hat\beta_{\mathrm{cls}}-\beta)|^2\le4Z
\end{align*}
gives
\begin{align*}
\mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta_{\mathrm{cls}}-\beta)|^2\right]
\le
4c_5\sigma^2\frac{k\log(ed/k)}{n}.
\end{align*}
Taking the supremum over $\beta\in\Theta_k$ and comparing the minimax risk to this particular estimator proves the upper bound with $C:=4c_5$.
[/guided]
[/step]
[step:Combine the two bounds]
The preceding lower-bound argument gives
\begin{align*}
\inf_{\hat\beta}\sup_{\beta\in\Theta_k}
\mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta-\beta)|^2\right]
\ge
c\sigma^2\frac{k\log(d/k)}{n},
\end{align*}
with $c>0$ depending only on $a_0$ and $A_0$. The constrained least-squares argument gives
\begin{align*}
\inf_{\hat\beta}\sup_{\beta\in\Theta_k}
\mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta-\beta)|^2\right]
\le
C\sigma^2\frac{k\log(ed/k)}{n},
\end{align*}
with a universal constant $C>0$, hence in particular with $C$ depending only on $a_0$ and $A_0$. This proves the claimed minimax sparse prediction rate.
[guided]
The lower-bound steps proved that every estimator has prediction risk at least
\begin{align*}
c\sigma^2\frac{k\log(d/k)}{n},
\end{align*}
where
\begin{align*}
c=\frac{c_1c_2c_3a_0}{8A_0}>0.
\end{align*}
Thus $c$ depends only on $a_0$ and $A_0$, because $c_1,c_2,c_3$ are universal constants. The upper-bound steps constructed the constrained least-squares estimator $\hat\beta_{\mathrm{cls}}$ and proved that, for every $\beta\in\Theta_k$,
\begin{align*}
\mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta_{\mathrm{cls}}-\beta)|^2\right]
\le
C\sigma^2\frac{k\log(ed/k)}{n}
\end{align*}
with $C:=4c_5$, a universal constant. Since the minimax risk is the infimum over all estimators, it is no larger than the risk of this particular estimator. Combining these two inequalities gives exactly
\begin{align*}
c\sigma^2\frac{k\log(d/k)}{n}
\le
\inf_{\hat\beta}\sup_{\beta\in\Theta_k}\mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta-\beta)|^2\right]
\le
C\sigma^2\frac{k\log(ed/k)}{n}.
\end{align*}
This is the claimed minimax sparse prediction rate.
[/guided]
[/step]