Minimax Euclidean Risk for Sparse Linear Regression with Gaussian Design

Minimax Euclidean Risk for Sparse Linear Regression with Gaussian Design (Theorem # 5913)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We work on the high-probability event on which the Gaussian design has restricted sparse eigenvalues of order $n$ up to sparsity $2k$. On that event, the empirical prediction norm $v \mapsto n^{-1/2}|Xv|$ and the Euclidean norm $v \mapsto |v|$ are equivalent for every $2k$-sparse vector. The lower bound follows by applying a sparse Fano testing argument on a $k$-sparse packing and using the upper sparse eigenvalue control to keep Kullback-Leibler divergences below a fixed fraction of the packing entropy. The upper bound uses the constrained least-squares estimator over the $k$-sparse parameter set, applies the prediction oracle bound at scale $\sigma^2 k\log(ed/k)$ for the unaveraged prediction loss $|X(\hat\beta-\beta)|^2$, and converts prediction error into Euclidean error using the lower sparse eigenvalue control. [/proofplan] [step:Condition on the restricted eigenvalue event for the Gaussian design] For an integer $s \in \{1,\dots,d\}$, define the lower restricted eigenvalue of $X$ at sparsity $s$ by \begin{align*} \kappa_-(s;X) &:= \inf\left\{\frac{|Xv|^2}{|v|^2}: v \in \mathbb R^d,\ 0<|\{j:v_j\neq 0\}|\leq s\right\}. \end{align*} Define the upper restricted eigenvalue of $X$ at sparsity $s$ by \begin{align*} \kappa_+(s;X) &:= \sup\left\{\frac{|Xv|^2}{|v|^2}: v \in \mathbb R^d,\ 0<|\{j:v_j\neq 0\}|\leq s\right\}. \end{align*} Because the theorem statement takes $X$ to have independent $\mathcal N(0,1)$ entries, the [Gaussian restricted eigenvalue bound for sparse vectors](/page/Gaussian%20Restricted%20Eigenvalue%20Bound) applies in its unnormalized form. The distributional hypothesis is verified directly: the entries of $X$ are independent centered Gaussian random variables with variance $1$, and $X$ is independent of the noise vector $\varepsilon$. The bound gives universal constants $a,b,c \in (0,\infty)$ and $C_0 \in (0,\infty)$ such that, whenever \begin{align*} n \geq C_0 k \log\left(\frac{ed}{k}\right), \end{align*} the event \begin{align*} \mathcal E_X := \{an \leq \kappa_-(2k;X) \text{ and } \kappa_+(2k;X) \leq bn\} \end{align*} satisfies \begin{align*} \mathbb P(\mathcal E_X) \geq 1 - 2\exp\left(-c k \log\left(\frac{ed}{k}\right)\right). \end{align*} For the rest of the proof, fix a deterministic matrix $X \in \mathcal E_X$. Define the $k$-sparse parameter set \begin{align*} \Theta_k := \{\theta\in\mathbb R^d:|\{j:\theta_j\neq0\}|\leq k\}. \end{align*} If $\beta,\gamma \in \Theta_k$, then $\beta-\gamma$ is $2k$-sparse. Hence the definition of $\mathcal E_X$ gives \begin{align*} an|\beta-\gamma|^2 \leq |X(\beta-\gamma)|^2 \leq bn|\beta-\gamma|^2. \end{align*} [/step] [step:Prove the Euclidean minimax lower bound by sparse testing] Let $\mathcal A$ denote the class of measurable estimators $\hat\beta:\mathbb R^n\times\mathbb R^{n\times d}\to\mathbb R^d$. Define \begin{align*} R_X := \inf_{\hat\beta\in\mathcal A}\sup_{\beta\in\Theta_k} \mathbb E_\beta\left[|\hat\beta(Y,X)-\beta|^2\mid X\right]. \end{align*} Use the following standard [Varshamov-Gilbert packing](/page/Varshamov-Gilbert%20Bound) and reference-measure form of [Fano inequality](/page/Fano%27s%20Inequality) for sparse Gaussian regression: if a finite parameter set has pairwise squared Euclidean separation at least $r^2$ and every corresponding law has Kullback-Leibler divergence at most $(1/16)\log M$ from a common reference law, where $M$ is the cardinality of the finite set, then the minimax squared Euclidean risk over that finite set is at least a universal constant times $r^2$. There are universal constants $c_2,c_3,c_4 \in (0,\infty)$ such that, for each $1\leq k\leq d/2$, there is a set $\mathcal U\subset\{0,1\}^d$ with $|u|_0=k$ for every $u\in\mathcal U$, \begin{align*} \log |\mathcal U| \geq c_2 k\log\left(\frac{ed}{k}\right), \end{align*} and $|u-v|^2\geq c_3 k$ whenever $u,v\in\mathcal U$ and $u\neq v$. Define \begin{align*} \delta^2 := \alpha\sigma^2\frac{\log(ed/k)}{n}, \qquad \mathcal V := \{\delta u : u\in\mathcal U\}\subset\Theta_k, \end{align*} where $\alpha>0$ is chosen so that $\alpha b/2\leq c_2/16$. For each $\theta=\delta u\in\mathcal V$, \begin{align*} |\theta|^2=k\delta^2, \end{align*} so the upper restricted eigenvalue bound gives \begin{align*} D_{\mathrm{KL}}\!\left(\mathcal N(X\theta,\sigma^2 I_n)\,\middle\|\,\mathcal N(0,\sigma^2 I_n)\right) = \frac{|X\theta|^2}{2\sigma^2} \leq \frac{bnk\delta^2}{2\sigma^2} = \frac{\alpha b}{2}k\log\left(\frac{ed}{k}\right) \leq \frac{1}{16}\log|\mathcal V|. \end{align*} Moreover distinct $\theta,\eta\in\mathcal V$ satisfy \begin{align*} |\theta-\eta|^2 = \delta^2|u-v|^2 \geq c_3 k\delta^2. \end{align*} [Fano's inequality](/theorems/1654) therefore gives \begin{align*} R_X \geq c_4 c_3 k\delta^2 = c_4c_3\alpha\sigma^2\frac{k\log(ed/k)}{n}. \end{align*} Setting $c_1:=c_4c_3\alpha$ proves the lower bound. [guided] This guided block rewrites the lower-bound step; the upper-bound estimator argument is proved in the following steps. We now give the finite testing construction explicitly. The goal is to find many $k$-sparse vectors that are well separated in Euclidean distance, while the corresponding Gaussian laws remain close enough in Kullback-Leibler divergence for Fano's inequality to apply. By the [Varshamov-Gilbert packing](/page/Varshamov-Gilbert%20Bound), there are universal constants $c_2,c_3\in(0,\infty)$ and a set $\mathcal U\subset\{0,1\}^d$ such that each $u\in\mathcal U$ has exactly $k$ nonzero coordinates, \begin{align*} \log |\mathcal U| \geq c_2 k\log\left(\frac{ed}{k}\right), \end{align*} and, for $u\neq v$ in $\mathcal U$, \begin{align*} |u-v|^2\geq c_3k. \end{align*} Choose a scale $\delta>0$ by \begin{align*} \delta^2 := \alpha\sigma^2\frac{\log(ed/k)}{n}, \end{align*} where $\alpha>0$ will be fixed below, and define the finite parameter set \begin{align*} \mathcal V := \{\delta u : u\in\mathcal U\}\subset\Theta_k. \end{align*} This declaration ensures that every element of $\mathcal V$ is $k$-sparse, hence admissible. It also gives a precise separation estimate: if $\theta=\delta u$ and $\eta=\delta v$ are distinct elements of $\mathcal V$, then \begin{align*} |\theta-\eta|^2 = \delta^2|u-v|^2 \geq c_3k\delta^2. \end{align*} Next we control the information distance. Under parameter $\theta\in\mathcal V$, the conditional law of $Y$ given $X$ is the Gaussian measure $\mathcal N(X\theta,\sigma^2I_n)$. The Kullback-Leibler divergence from the law at $0$ is \begin{align*} D_{\mathrm{KL}}\!\left(\mathcal N(X\theta,\sigma^2 I_n)\,\middle\|\,\mathcal N(0,\sigma^2 I_n)\right) = \frac{|X\theta|^2}{2\sigma^2}. \end{align*} Because $\theta$ is $k$-sparse and $X\in\mathcal E_X$, the upper restricted eigenvalue bound gives \begin{align*} \frac{|X\theta|^2}{2\sigma^2} \leq \frac{bn|\theta|^2}{2\sigma^2}. \end{align*} Since $\theta=\delta u$ and $|u|^2=k$, this becomes \begin{align*} \frac{bn|\theta|^2}{2\sigma^2} = \frac{bnk\delta^2}{2\sigma^2} = \frac{\alpha b}{2}k\log\left(\frac{ed}{k}\right). \end{align*} Choose $\alpha$ so that $\alpha b/2\leq c_2/16$. Then \begin{align*} D_{\mathrm{KL}}\!\left(\mathcal N(X\theta,\sigma^2 I_n)\,\middle\|\,\mathcal N(0,\sigma^2 I_n)\right) \leq \frac{1}{16}\log|\mathcal V|. \end{align*} The reference-measure form of Fano's inequality applies because all conditional Gaussian laws indexed by $\mathcal V$ have Kullback-Leibler divergence at most $(1/16)\log |\mathcal V|$ from the common reference law $\mathcal N(0,\sigma^2I_n)$. It gives a universal constant $c_4>0$ such that every estimator has worst-case squared Euclidean risk at least $c_4$ times the squared separation scale. Hence \begin{align*} R_X \geq c_4 c_3 k\delta^2 = c_4c_3\alpha\sigma^2\frac{k\log(ed/k)}{n}. \end{align*} With $c_1:=c_4c_3\alpha$, this is the desired lower bound. [/guided] [/step] [step:Choose the constrained least-squares estimator for the upper bound] Define the constrained least-squares estimator as a measurable map \begin{align*} \hat\beta_{\mathrm{LS}}: \mathbb R^n\times\mathbb R^{n\times d} &\to \Theta_k. \end{align*} For each $(y,X)\in\mathbb R^n\times\mathbb R^{n\times d}$, choose \begin{align*} \hat\beta_{\mathrm{LS}}(y,X) \in \operatorname*{argmin}_{\theta\in\Theta_k}|y-X\theta|^2, \end{align*} with an arbitrary measurable minimizer selected if there is more than one minimizer. We justify that such a measurable choice exists on the fixed event $\mathcal E_X$. For each support set $S\subset\{1,\dots,d\}$ with $|S|\leq k$, the restriction of the objective to the coordinate subspace \begin{align*} L_S:=\{\theta\in\mathbb R^d:\theta_j=0\text{ for every }j\notin S\} \end{align*} is a continuous quadratic function of the coordinates in $S$. Since $X\in\mathcal E_X$ and $|S|\leq k\leq2k$, the lower restricted eigenvalue bound gives coercivity on $L_S$; hence the minimum on $L_S$ is attained. There are finitely many such support sets, so taking the support with the smallest attained residual gives a global minimizer over $\Theta_k$. The finite-support construction and a fixed deterministic tie-breaking rule give a Borel measurable minimizer as a function of $(y,X)$. Since $\hat\beta_{\mathrm{LS}}(Y,X)\in\Theta_k$ and $\beta\in\Theta_k$, the difference \begin{align*} \hat\Delta := \hat\beta_{\mathrm{LS}}(Y,X)-\beta \end{align*} is $2k$-sparse. On $\mathcal E_X$, the lower restricted eigenvalue bound gives \begin{align*} |\hat\Delta|^2 \leq \frac{1}{an}|X\hat\Delta|^2. \end{align*} Taking [conditional expectation](/page/Conditional%20Expectation) under parameter $\beta$ gives \begin{align*} \mathbb E_\beta\left[|\hat\beta_{\mathrm{LS}}(Y,X)-\beta|^2\mid X\right] \leq \frac{1}{an} \mathbb E_\beta\left[|X(\hat\beta_{\mathrm{LS}}(Y,X)-\beta)|^2\mid X\right]. \end{align*} [/step] [step:Apply the sparse prediction oracle bound and convert to Euclidean risk] The [sparse prediction oracle bound for constrained least squares](/page/Sparse%20Prediction%20Oracle%20Bound) applies to the conditional Gaussian model $Y=X\beta+\varepsilon$ with fixed deterministic design $X$ and noise $\varepsilon\sim\mathcal N(0,\sigma^2I_n)$. Its hypotheses match the present setup: after conditioning on the deterministic matrix $X$, the noise vector is centered Gaussian with covariance matrix $\sigma^2I_n$; the estimator is the least-squares minimizer constrained to the $k$-sparse set $\Theta_k$; the true parameter $\beta$ belongs to the same set $\Theta_k$; and the cited fixed-design result is uniform over deterministic matrices $X$. It gives a universal constant $c_5>0$ such that, for every $\beta\in\Theta_k$, \begin{align*} \mathbb E_\beta\left[|X(\hat\beta_{\mathrm{LS}}(Y,X)-\beta)|^2\mid X\right] \leq c_5\sigma^2 k\log\left(\frac{ed}{k}\right). \end{align*} Combining this estimate with the lower restricted eigenvalue conversion from the previous step yields \begin{align*} \mathbb E_\beta\left[|\hat\beta_{\mathrm{LS}}(Y,X)-\beta|^2\mid X\right] \leq \frac{c_5}{a}\sigma^2\frac{k\log(ed/k)}{n}. \end{align*} Taking the supremum over $\beta\in\Theta_k$ and then using that the minimax risk is no larger than the risk of the particular estimator $\hat\beta_{\mathrm{LS}}$, we obtain \begin{align*} R_X \leq \frac{c_5}{a}\sigma^2\frac{k\log(ed/k)}{n}. \end{align*} [/step] [step:Combine the two bounds on the high-probability event] For every fixed $X\in\mathcal E_X$, the lower and upper bounds proved above give \begin{align*} c_1\sigma^2\frac{k\log(ed/k)}{n} \leq \inf_{\hat\beta\in\mathcal A}\sup_{\beta\in\Theta_k} \mathbb E_\beta\left[|\hat\beta(Y,X)-\beta|^2\mid X\right] \leq \frac{c_5}{a}\sigma^2\frac{k\log(ed/k)}{n}. \end{align*} Set \begin{align*} A_0 := c_1, \qquad A_1 := \frac{c_5}{a}, \qquad c_0 := c. \end{align*} Choose the universal sample-size constant $C$ in the theorem statement so that $C\geq C_0$, where $C_0$ is the constant from the restricted-eigenvalue event. Then the hypothesis $n\geq Ck\log(ed/k)$ implies $n\geq C_0k\log(ed/k)$, so $\mathcal E_X$ has the probability bound proved in the first step. Since $\mathbb P(\mathcal E_X)\geq 1-2\exp(-c_0 k\log(ed/k))$, the asserted high-probability minimax Euclidean rate follows. [/step]

Prerequisites (0/6 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Fano's Inequality

Definitions & Concepts

Explore Further

What brings you to Androma?

Start with a route through the knowledge graph.