Minimax Prediction Rate over Sparse Linear Models

Minimax Prediction Rate over Sparse Linear Models (Theorem # 5912)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The lower bound is a metric testing argument. We construct many $k$-sparse vectors whose supports are well separated, scale their common amplitude so that the induced Gaussian measures have small Kullback-Leibler divergence, and then apply an elementary Fano testing bound. The lower restricted eigenvalue turns coefficient separation into prediction separation. For the upper bound we use the constrained least-squares estimator over $\Theta_k$ and control the largest Gaussian projection over all coordinate subspaces of dimension at most $2k$. [/proofplan] [step:Fix the restricted eigenvalue normalization used throughout] For $1\le m\le d$, the restricted lower prediction eigenvalue of $X$ is \begin{align*} \kappa_-(m;X):=\inf\left\{\frac{|Xu|^2}{n|u|^2}:u\in\mathbb R^d\setminus\{0\},\ |\operatorname{supp}u|\le m\right\}. \end{align*} The restricted upper prediction eigenvalue of $X$ is \begin{align*} \kappa_+(m;X):=\sup\left\{\frac{|Xu|^2}{n|u|^2}:u\in\mathbb R^d\setminus\{0\},\ |\operatorname{supp}u|\le m\right\}. \end{align*} Thus every $2k$-sparse vector $u\in\mathbb R^d$ satisfies \begin{align*} a_0|u|^2\le \frac{1}{n}|Xu|^2\le A_0|u|^2. \end{align*} [/step] [step:Build a sparse Hamming packing with logarithmic cardinality] We first record the combinatorial packing used in the lower bound. There exist a universal constant $c_1>0$ and vectors $z_1,\dots,z_M \in \{0,1\}^d$ such that $|\operatorname{supp} z_j| = k$ for every $j$, such that \begin{align*} |z_j-z_\ell|^2 \ge \frac{k}{2} \quad\text{whenever }j\ne \ell, \end{align*} and such that \begin{align*} \log M \ge c_1 k\log(d/k). \end{align*} Let \begin{align*} \mathcal{H}:=\{z\in\{0,1\}^d: |\operatorname{supp}z|=k\} \end{align*} denote the family of constant-weight support indicators. We use the following [constant-weight Gilbert-Varshamov packing estimate](/page/Gilbert-Varshamov%20Bound): there is a universal constant $c_1>0$ such that, whenever $1\le k\le d/2$, the family $\mathcal{H}$ contains a subset $\mathcal{P}$ satisfying $|z-w|^2\ge k/2$ for all distinct $z,w\in\mathcal{P}$ and \begin{align*} |\mathcal{P}|\ge \exp\{c_1 k\log(d/k)\}. \end{align*} The hypotheses of this estimate are exactly $1\le k\le d/2$ and constant weight $k$, both of which hold here. Enumerating $\mathcal{P}$ as $z_1,\dots,z_M$ gives the required vectors. [guided] The goal is to produce many possible sparse coefficient vectors, all with the same sparsity $k$, so that no two are too close. We work first with support indicators rather than amplitudes. Define \begin{align*} \mathcal{H}:=\{z\in\{0,1\}^d: |\operatorname{supp}z|=k\}. \end{align*} Thus each $z\in\mathcal{H}$ records a $k$-element subset of $\{1,\dots,d\}$. For vectors in $\{0,1\}^d$, the Hamming distance is exactly $|z-w|^2$, since each differing coordinate contributes $1$ to the squared Euclidean distance. The [constant-weight Gilbert-Varshamov packing estimate](/page/Gilbert-Varshamov%20Bound) says that, under the hypotheses $1\le k\le d/2$ and constant weight $k$, there is a universal constant $c_1>0$ and a subset $\mathcal{P}\subset\mathcal{H}$ with pairwise Hamming distance at least $k/2$ such that \begin{align*} |\mathcal{P}|\ge\exp\{c_1 k\log(d/k)\}. \end{align*} These hypotheses hold because the theorem assumes $1\le k\le d/2$ and every element of $\mathcal{H}$ has exactly $k$ nonzero coordinates. Enumerating $\mathcal{P}$ as $z_1,\dots,z_M$ gives the required packing. [/guided] [/step] [step:Scale the packing so prediction distances are large and divergences are small] For each $\beta\in\Theta_k$, let $P_\beta$ denote the law on $\mathbb R^n$ of the observation vector $Y=X\beta+\varepsilon$, where $\varepsilon\sim\mathcal N(0,\sigma^2I_n)$. Define \begin{align*} r^2 := \frac{c_2\sigma^2\log M}{A_0nk}, \end{align*} where $c_2>0$ is a universal constant to be chosen below, and define $\beta_j:=rz_j\in\Theta_k$ for $j=1,\dots,M$. For $j\ne \ell$, the vector $\beta_j-\beta_\ell$ is $2k$-sparse. By the lower restricted eigenvalue bound, \begin{align*} \frac{1}{n}|X(\beta_j-\beta_\ell)|^2 \ge a_0|\beta_j-\beta_\ell|^2 = a_0r^2|z_j-z_\ell|^2 \ge \frac{a_0r^2k}{2}. \end{align*} Set \begin{align*} s^2:=\frac{a_0r^2k}{2}. \end{align*} Then all distinct elements of the packing are separated by prediction distance at least $s$ in the loss metric. For each $j$, the Kullback-Leibler divergence between $P_{\beta_j}$ and $P_0$ is \begin{align*} D_{\mathrm{KL}}(P_{\beta_j}\|P_0) = \frac{|X\beta_j|^2}{2\sigma^2} \le \frac{nA_0|\beta_j|^2}{2\sigma^2} = \frac{nA_0r^2k}{2\sigma^2} = \frac{c_2}{2}\log M. \end{align*} Choosing $c_2\le 1/8$ gives \begin{align*} \frac{1}{M}\sum_{j=1}^M D_{\mathrm{KL}}(P_{\beta_j}\|P_0) \le \frac{1}{16}\log M. \end{align*} [guided] For each $\beta\in\Theta_k$, $P_\beta$ denotes the law on $\mathbb R^n$ of $Y=X\beta+\varepsilon$ with $\varepsilon\sim\mathcal N(0,\sigma^2I_n)$. We choose the amplitude \begin{align*} r^2 := \frac{c_2\sigma^2\log M}{A_0nk}, \end{align*} where $c_2>0$ will be fixed small, and set $\beta_j:=rz_j$ for $j=1,\dots,M$. Since each $z_j$ has exactly $k$ nonzero entries, each $\beta_j$ belongs to $\Theta_k$. If $j\ne \ell$, then $\beta_j-\beta_\ell$ has support contained in $\operatorname{supp}z_j\cup\operatorname{supp}z_\ell$, so it is $2k$-sparse. The definition of $\kappa_-(2k;X)$ and the hypothesis $\kappa_-(2k;X)\ge a_0$ give \begin{align*} \frac{1}{n}|X(\beta_j-\beta_\ell)|^2 \ge a_0|\beta_j-\beta_\ell|^2 = a_0r^2|z_j-z_\ell|^2 \ge \frac{a_0r^2k}{2}. \end{align*} Thus, with \begin{align*} s^2:=\frac{a_0r^2k}{2}, \end{align*} the packing is separated by at least $s$ in the prediction loss metric. The same amplitude must keep the statistical models close enough for [Fano's inequality](/theorems/1654). For centered Gaussian measures with common covariance $\sigma^2I_n$ and means $X\beta_j$ and $0$, the Kullback-Leibler divergence is \begin{align*} D_{\mathrm{KL}}(P_{\beta_j}\|P_0) = \frac{|X\beta_j|^2}{2\sigma^2}. \end{align*} Since $\beta_j$ is $k$-sparse, it is also $2k$-sparse, so the definition of $\kappa_+(2k;X)$ and the hypothesis $\kappa_+(2k;X)\le A_0$ imply \begin{align*} D_{\mathrm{KL}}(P_{\beta_j}\|P_0) \le \frac{nA_0|\beta_j|^2}{2\sigma^2} = \frac{nA_0r^2k}{2\sigma^2} = \frac{c_2}{2}\log M. \end{align*} Choosing $c_2\le 1/8$ makes the average divergence satisfy \begin{align*} \frac{1}{M}\sum_{j=1}^M D_{\mathrm{KL}}(P_{\beta_j}\|P_0) \le \frac{1}{16}\log M. \end{align*} This is the exact small-divergence condition needed for the Fano step. [/guided] [/step] [step:Convert Fano testing error into minimax prediction risk] We use the following elementary form of [Fano's inequality](/page/Fano%27s%20Inequality): if $P_1,\dots,P_M$ are probability measures and $P_0$ is another probability measure satisfying \begin{align*} \frac{1}{M}\sum_{j=1}^M D_{\mathrm{KL}}(P_j\|P_0)\le \alpha\log M \quad\text{with }0<\alpha<1/8, \end{align*} then every measurable test $\psi:\mathbb{R}^n\to\{1,\dots,M\}$ satisfies \begin{align*} \sup_{1\le j\le M}P_j(\psi\ne j)\ge c_3, \end{align*} where $c_3>0$ is a universal constant. This follows from the standard entropy proof of Fano's inequality applied to the uniform prior on $\{1,\dots,M\}$. Let $\hat\beta:\mathbb{R}^n\to\mathbb{R}^d$ be any estimator. Define a measurable nearest-packing selector $\psi_{\hat\beta}:\mathbb{R}^n\to\{1,\dots,M\}$ by choosing an index minimizing \begin{align*} j\mapsto |X(\hat\beta(Y)-\beta_j)|. \end{align*} If $\psi_{\hat\beta}(Y)\ne j$, then the triangle inequality and the packing separation imply \begin{align*} |X(\hat\beta(Y)-\beta_j)| \ge \frac{\sqrt{n}s}{2}. \end{align*} Therefore, under $P_{\beta_j}$, \begin{align*} \mathbb E_{\beta_j}\left[\frac{1}{n}|X(\hat\beta-\beta_j)|^2\right] \ge \frac{s^2}{4}P_{\beta_j}(\psi_{\hat\beta}\ne j). \end{align*} Taking the supremum over $j$ and then the infimum over $\hat\beta$ gives \begin{align*} \inf_{\hat\beta}\sup_{\beta\in\Theta_k} \mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta-\beta)|^2\right] \ge \frac{c_3s^2}{4}. \end{align*} By the definition of $s^2$, \begin{align*} \frac{c_3s^2}{4}=\frac{c_3a_0r^2k}{8}. \end{align*} By the definition of $r^2$, \begin{align*} \frac{c_3a_0r^2k}{8}=\frac{c_3a_0c_2}{8A_0}\sigma^2\frac{\log M}{n}. \end{align*} Since $\log M\ge c_1k\log(d/k)$, the lower bound follows with \begin{align*} c:=\frac{c_1c_2c_3a_0}{8A_0}. \end{align*} [guided] We now convert separation of the finite packing into a lower bound for every estimator. The [Fano inequality](/page/Fano%27s%20Inequality) form used here requires probability measures $P_1,\dots,P_M$ and a reference measure $P_0$ satisfying \begin{align*} \frac{1}{M}\sum_{j=1}^M D_{\mathrm{KL}}(P_j\|P_0)\le \alpha\log M \quad\text{with }0<\alpha<1/8. \end{align*} The previous step verified this hypothesis for $P_j=P_{\beta_j}$ with $\alpha=1/16$. Therefore every measurable test $\psi:\mathbb R^n\to\{1,\dots,M\}$ satisfies \begin{align*} \sup_{1\le j\le M}P_{\beta_j}(\psi\ne j)\ge c_3, \end{align*} where $c_3>0$ is universal. Let $\hat\beta:\mathbb R^n\to\mathbb R^d$ be any estimator. From it we build a test by nearest-neighbour decoding in prediction norm: define $\psi_{\hat\beta}:\mathbb R^n\to\{1,\dots,M\}$ to be an index minimizing \begin{align*} j\mapsto |X(\hat\beta(Y)-\beta_j)|, \end{align*} with a fixed tie-breaking rule. If the true index is $j$ but $\psi_{\hat\beta}(Y)=\ell\ne j$, then the nearest-neighbour property gives \begin{align*} |X(\hat\beta(Y)-\beta_\ell)|\le |X(\hat\beta(Y)-\beta_j)|. \end{align*} Using the triangle inequality and the packing separation, \begin{align*} \sqrt{n}s \le |X(\beta_j-\beta_\ell)| \le |X(\hat\beta(Y)-\beta_j)|+|X(\hat\beta(Y)-\beta_\ell)| \le 2|X(\hat\beta(Y)-\beta_j)|. \end{align*} Hence on the event $\{\psi_{\hat\beta}(Y)\ne j\}$, \begin{align*} \frac{1}{n}|X(\hat\beta(Y)-\beta_j)|^2\ge \frac{s^2}{4}. \end{align*} Taking expectation under $P_{\beta_j}$ gives \begin{align*} \mathbb E_{\beta_j}\left[\frac{1}{n}|X(\hat\beta-\beta_j)|^2\right] \ge \frac{s^2}{4}P_{\beta_j}(\psi_{\hat\beta}\ne j). \end{align*} Now take the supremum over $j$ and use the Fano lower bound on testing error: \begin{align*} \sup_{1\le j\le M}\mathbb E_{\beta_j}\left[\frac{1}{n}|X(\hat\beta-\beta_j)|^2\right] \ge \frac{c_3s^2}{4}. \end{align*} Since $\{\beta_1,\dots,\beta_M\}\subset\Theta_k$, this lower bounds the supremum over all $\beta\in\Theta_k$. Taking the infimum over estimators and substituting \begin{align*} s^2=\frac{a_0r^2k}{2}, \qquad r^2=\frac{c_2\sigma^2\log M}{A_0nk}, \end{align*} yields \begin{align*} \inf_{\hat\beta}\sup_{\beta\in\Theta_k} \mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta-\beta)|^2\right] \ge \frac{c_3a_0c_2}{8A_0}\sigma^2\frac{\log M}{n}. \end{align*} Finally $\log M\ge c_1k\log(d/k)$, so the lower bound holds with \begin{align*} c:=\frac{c_1c_2c_3a_0}{8A_0}. \end{align*} [/guided] [/step] [step:Control constrained least squares by Gaussian projections] For each support set $S\subset\{1,\dots,d\}$ with $|S|\le k$, let $X_S$ denote the submatrix of $X$ formed by the columns indexed by $S$, and let $X_S^+$ denote its Moore-Penrose pseudoinverse. Define $b_S:\mathbb{R}^n\to\Theta_k$ by placing $X_S^+Y$ on the coordinates in $S$ and placing $0$ on $S^c$. Among the finitely many supports $S$ with $|S|\le k$, choose one minimizing $|Y-Xb_S(Y)|^2$, with lexicographic tie-breaking on the ordered support. This gives a measurable estimator $\hat\beta_{\mathrm{cls}}:\mathbb{R}^n\to\Theta_k$, since each $b_S$ is linear and the finite lexicographic comparison is made by measurable inequalities. For $\beta\in\Theta_k$, let $\varepsilon:=Y-X\beta$. Since $\beta$ is feasible, \begin{align*} |Y-X\hat\beta_{\mathrm{cls}}|^2\le |Y-X\beta|^2. \end{align*} With $\Delta:=\hat\beta_{\mathrm{cls}}-\beta$, this becomes \begin{align*} |\varepsilon-X\Delta|^2\le |\varepsilon|^2, \end{align*} and hence \begin{align*} |X\Delta|^2\le 2\varepsilon\cdot X\Delta. \end{align*} The vector $\Delta$ is $2k$-sparse. For each set $S\subset\{1,\dots,d\}$ with $|S|\le 2k$, let $V_S:=\operatorname{Range}(X_S)\subset\mathbb{R}^n$, where $X_S$ is the submatrix of $X$ formed by the columns indexed by $S$, and let $\Pi_S:\mathbb{R}^n\to V_S$ be the Euclidean [orthogonal projection](/theorems/437). If $S=\operatorname{supp}\Delta$, then $X\Delta\in V_S$, so \begin{align*} |X\Delta|^2 \le 2|\Pi_S\varepsilon|\,|X\Delta| \end{align*} by the [Cauchy-Schwarz inequality](/theorems/432) in $\mathbb{R}^n$. Therefore, defining the maximum over all subsets of size at most $2k$ in prose to avoid a row break in the subscript, \begin{align*} |X\Delta|^2\le4\max_{S\subset\{1,\dots,d\}:\ |S|\le 2k}|\Pi_S\varepsilon|^2. \end{align*} [guided] The constrained least-squares estimator is the natural upper-bound estimator because it searches only over the true sparsity class. Define \begin{align*} \hat\beta_{\mathrm{cls}}:\mathbb{R}^n\to\Theta_k \end{align*} to be any minimizer of $b\mapsto |Y-Xb|^2$ over $b\in\Theta_k$. Since the true $\beta$ belongs to $\Theta_k$, it is one of the candidates in this minimization. Hence \begin{align*} |Y-X\hat\beta_{\mathrm{cls}}|^2\le |Y-X\beta|^2. \end{align*} Write the noise vector as $\varepsilon:=Y-X\beta$ and define the estimation error \begin{align*} \Delta:=\hat\beta_{\mathrm{cls}}-\beta. \end{align*} Then $Y-X\hat\beta_{\mathrm{cls}}=\varepsilon-X\Delta$, so the previous inequality becomes \begin{align*} |\varepsilon-X\Delta|^2\le |\varepsilon|^2. \end{align*} Expanding the square in the Euclidean [inner product](/page/Inner%20Product) on $\mathbb{R}^n$ gives \begin{align*} |\varepsilon|^2-2\varepsilon\cdot X\Delta+|X\Delta|^2\le |\varepsilon|^2, \end{align*} and cancelling $|\varepsilon|^2$ yields \begin{align*} |X\Delta|^2\le 2\varepsilon\cdot X\Delta. \end{align*} Now $\hat\beta_{\mathrm{cls}}$ and $\beta$ are both $k$-sparse, so $\Delta$ is $2k$-sparse. For each set $S\subset\{1,\dots,d\}$ with $|S|\le 2k$, define $V_S:=\operatorname{Range}(X_S)\subset\mathbb{R}^n$, where $X_S$ is the submatrix of $X$ with columns indexed by $S$, and let $\Pi_S:\mathbb{R}^n\to V_S$ denote the Euclidean orthogonal projection. If $S=\operatorname{supp}\Delta$, then $X\Delta\in V_S$. Therefore \begin{align*} \varepsilon\cdot X\Delta=(\Pi_S\varepsilon)\cdot X\Delta. \end{align*} Applying the Cauchy-Schwarz inequality in $\mathbb{R}^n$ gives \begin{align*} |X\Delta|^2 \le 2|\Pi_S\varepsilon|\,|X\Delta|. \end{align*} If $|X\Delta|=0$, the desired bound is immediate. Otherwise divide by $|X\Delta|$ and square: \begin{align*} |X\Delta|^2\le 4|\Pi_S\varepsilon|^2. \end{align*} Since $S$ is one of the subsets of $\{1,\dots,d\}$ with cardinality at most $2k$, enlarging from this single support to the maximum over all such supports gives \begin{align*} |X\Delta|^2\le4\max_{T\subset\{1,\dots,d\}:\ |T|\le 2k}|\Pi_T\varepsilon|^2. \end{align*} This is the key deterministic inequality: the prediction error is controlled by the largest Gaussian projection onto a coordinate-generated subspace of dimension at most $2k$. [/guided] [/step] [step:Bound the maximum projected noise energy] For a fixed set $S\subset\{1,\dots,d\}$ with $|S|\le2k$, the subspace $V_S$ has dimension at most $2k$, and \begin{align*} |\Pi_S\varepsilon|^2/\sigma^2 \end{align*} is stochastically dominated by a chi-square [random variable](/page/Random%20Variable) with $2k$ degrees of freedom. This follows because $\Pi_S\varepsilon$ is a centered Gaussian vector in the subspace $V_S$ with covariance operator bounded by $\sigma^2\Pi_S$. The number of sets $S$ with $|S|\le2k$ is at most $N$, where \begin{align*} N:=\sum_{m=0}^{2k}\binom{d}{m}. \end{align*} Using $2k\le d$ and the elementary binomial bound $\binom{d}{m}\le (ed/m)^m$ for $1\le m\le 2k$, there is a universal constant $c_4>0$ such that \begin{align*} N\le \exp\{c_4k\log(ed/k)\}. \end{align*} We now derive the needed maximal bound from a chi-square tail estimate. For a fixed $S$, the [Laurent-Massart chi-square tail bound](/page/Laurent-Massart%20Chi-Square%20Tail%20Bound) gives, for every $u>0$, \begin{align*} \mathbb P_\beta\left(|\Pi_S\varepsilon|^2>\sigma^2(2k+2\sqrt{2ku}+2u)\right)\le e^{-u}. \end{align*} Taking a union bound over the $N$ supports gives, for every $u>0$, \begin{align*} \mathbb P_\beta\left(\max_{S\subset\{1,\dots,d\}:\ |S|\le2k}|\Pi_S\varepsilon|^2>\sigma^2(2k+2\sqrt{2k(u+\log N)}+2(u+\log N))\right)\le e^{-u}. \end{align*} Using $2\sqrt{ab}\le a+b$ with $a=2k$ and $b=u+\log N$, there is a universal constant $c_6>0$ such that, for every $u>0$, \begin{align*} \mathbb P_\beta\left(\max_{S\subset\{1,\dots,d\}:\ |S|\le2k}|\Pi_S\varepsilon|^2>c_6\sigma^2(k+\log N+u)\right)\le e^{-u}. \end{align*} For the nonnegative random variable \begin{align*} Z:=\max_{S\subset\{1,\dots,d\}:\ |S|\le2k}|\Pi_S\varepsilon|^2, \end{align*} set $t_0:=c_6\sigma^2(k+\log N)$. The tail identity for nonnegative random variables gives \begin{align*} \mathbb E_\beta[Z] \le t_0+\int_{t_0}^\infty \mathbb P_\beta(Z>t)\,d\mathcal{L}^1(t), \end{align*} where $\mathcal{L}^1$ denotes one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure). With the change of variables $t=t_0+c_6\sigma^2u$, this becomes \begin{align*} \mathbb E_\beta[Z] \le c_6\sigma^2(k+\log N)+c_6\sigma^2\int_0^\infty e^{-u}\,d\mathcal{L}^1(u). \end{align*} Since $\int_0^\infty e^{-u}\,d\mathcal{L}^1(u)=1$ and $\log N\le c_4k\log(ed/k)$, after enlarging constants there is a universal constant $c_5>0$ such that \begin{align*} \mathbb E_\beta\left[\max_{S\subset\{1,\dots,d\}:\ |S|\le2k}|\Pi_S\varepsilon|^2\right] \le c_5\sigma^2 k\log(ed/k). \end{align*} Combining this with the deterministic inequality from the previous step yields \begin{align*} \mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta_{\mathrm{cls}}-\beta)|^2\right]\le4c_5\sigma^2\frac{k\log(ed/k)}{n}. \end{align*} Taking the supremum over $\beta\in\Theta_k$ and then using that the minimax risk is at most the risk of $\hat\beta_{\mathrm{cls}}$ gives the upper bound with $C:=4c_5$. [guided] We must bound the random variable \begin{align*} Z:=\max_{S\subset\{1,\dots,d\}:\ |S|\le2k}|\Pi_S\varepsilon|^2, \end{align*} where $\varepsilon\sim\mathcal N(0,\sigma^2I_n)$ and $\Pi_S$ is the Euclidean orthogonal projection onto $V_S=\operatorname{Range}(X_S)$. For a fixed $S$, the subspace $V_S$ has dimension at most $|S|\le2k$. Therefore $|\Pi_S\varepsilon|^2/\sigma^2$ is stochastically dominated by a chi-square random variable with $2k$ degrees of freedom. The number of possible supports is \begin{align*} N:=\sum_{m=0}^{2k}\binom{d}{m}. \end{align*} Since $2k\le d$, the binomial estimate $\binom{d}{m}\le (ed/m)^m$ for $1\le m\le2k$ implies, after absorbing the finite sum into a universal constant, that \begin{align*} N\le \exp\{c_4k\log(ed/k)\} \end{align*} for a universal constant $c_4>0$. For a fixed $S$, the [Laurent-Massart chi-square tail bound](/page/Laurent-Massart%20Chi-Square%20Tail%20Bound) gives, for every $v>0$, \begin{align*} \mathbb P_\beta\left(|\Pi_S\varepsilon|^2>\sigma^2(2k+2\sqrt{2kv}+2v)\right)\le e^{-v}. \end{align*} Apply this with $v=u+\log N$ and take a union bound over the $N$ supports. For every $u>0$, \begin{align*} \mathbb P_\beta\left(Z>\sigma^2(2k+2\sqrt{2k(u+\log N)}+2(u+\log N))\right)\le e^{-u}. \end{align*} Using $2\sqrt{ab}\le a+b$ with $a=2k$ and $b=u+\log N$, there is a universal constant $c_6>0$ such that \begin{align*} \mathbb P_\beta\left(Z>c_6\sigma^2(k+\log N+u)\right)\le e^{-u}. \end{align*} Set $t_0:=c_6\sigma^2(k+\log N)$. The tail identity for nonnegative random variables gives \begin{align*} \mathbb E_\beta[Z] \le t_0+\int_{t_0}^\infty \mathbb P_\beta(Z>t)\,d\mathcal L^1(t). \end{align*} With $t=t_0+c_6\sigma^2u$, the measure transforms as $d\mathcal L^1(t)=c_6\sigma^2\,d\mathcal L^1(u)$ and the domain $[t_0,\infty)$ becomes $[0,\infty)$. Hence \begin{align*} \mathbb E_\beta[Z] \le c_6\sigma^2(k+\log N)+c_6\sigma^2\int_0^\infty e^{-u}\,d\mathcal L^1(u) = c_6\sigma^2(k+\log N+1). \end{align*} Since $\log N\le c_4k\log(ed/k)$ and $k\ge1$, enlarging constants gives \begin{align*} \mathbb E_\beta[Z] \le c_5\sigma^2k\log(ed/k) \end{align*} for a universal constant $c_5>0$. Combining this with the deterministic inequality \begin{align*} |X(\hat\beta_{\mathrm{cls}}-\beta)|^2\le4Z \end{align*} gives \begin{align*} \mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta_{\mathrm{cls}}-\beta)|^2\right] \le 4c_5\sigma^2\frac{k\log(ed/k)}{n}. \end{align*} Taking the supremum over $\beta\in\Theta_k$ and comparing the minimax risk to this particular estimator proves the upper bound with $C:=4c_5$. [/guided] [/step] [step:Combine the two bounds] The preceding lower-bound argument gives \begin{align*} \inf_{\hat\beta}\sup_{\beta\in\Theta_k} \mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta-\beta)|^2\right] \ge c\sigma^2\frac{k\log(d/k)}{n}, \end{align*} with $c>0$ depending only on $a_0$ and $A_0$. The constrained least-squares argument gives \begin{align*} \inf_{\hat\beta}\sup_{\beta\in\Theta_k} \mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta-\beta)|^2\right] \le C\sigma^2\frac{k\log(ed/k)}{n}, \end{align*} with a universal constant $C>0$, hence in particular with $C$ depending only on $a_0$ and $A_0$. This proves the claimed minimax sparse prediction rate. [guided] The lower-bound steps proved that every estimator has prediction risk at least \begin{align*} c\sigma^2\frac{k\log(d/k)}{n}, \end{align*} where \begin{align*} c=\frac{c_1c_2c_3a_0}{8A_0}>0. \end{align*} Thus $c$ depends only on $a_0$ and $A_0$, because $c_1,c_2,c_3$ are universal constants. The upper-bound steps constructed the constrained least-squares estimator $\hat\beta_{\mathrm{cls}}$ and proved that, for every $\beta\in\Theta_k$, \begin{align*} \mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta_{\mathrm{cls}}-\beta)|^2\right] \le C\sigma^2\frac{k\log(ed/k)}{n} \end{align*} with $C:=4c_5$, a universal constant. Since the minimax risk is the infimum over all estimators, it is no larger than the risk of this particular estimator. Combining these two inequalities gives exactly \begin{align*} c\sigma^2\frac{k\log(d/k)}{n} \le \inf_{\hat\beta}\sup_{\beta\in\Theta_k}\mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta-\beta)|^2\right] \le C\sigma^2\frac{k\log(ed/k)}{n}. \end{align*} This is the claimed minimax sparse prediction rate. [/guided] [/step]

Prerequisites (0/11 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Explore Further

Expectation Definition Random Variable Definition Event Definition Matrix Definition Set Definition test Theorem #89 Cauchy-Schwarz Inequality Theorem #432 Orthogonal Projection Theorem #437 Cauchy-Schwarz Inequality for Covariance Theorem #5018 Fano's Inequality Theorem #1654 Triangle Inequality For Inner Product Spaces Theorem #433 Chebyshev's Inequality Probability Theory Minimax Risk for the Unrestricted Gaussian Location Model Probability & Statistics Countable Additivity of Conditional Probability Probability Theory Assouad's Lemma for Coordinatewise Support Recovery Probability & Statistics Singular Value Decomposition Formula for Ridge Regression Probability & Statistics Discrete Factorisation Criterion Probability Theory Unbiasedness of Ordinary Least Squares Under Strict Exogeneity Probability & Statistics Limiting Events Are Measurable Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.