Necessary Signal Strength for Exact Support Recovery in Sparse Gaussian Linear Regression

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We argue by contradiction through a many-way testing reduction. On a restricted-eigenvalue event, fix a common baseline support of size $k_\ell-1$ and vary one additional active coordinate among at least order $d_\ell$ possible locations. Exact support recovery on the whole sparse class would identify this varying coordinate with vanishing error. A self-contained Fano bound shows this is impossible unless the average Kullback-Leibler divergence between the induced Gaussian laws is at least a constant multiple of the logarithm of the number of alternatives; the restricted-eigenvalue upper bound controls that divergence by $n_\ell a_\ell^2/\sigma^2$. No lower restricted-eigenvalue estimate, sample-size estimate, or distributional property of the Gaussian design is used after conditioning on $X\in E_\ell$; those hypotheses serve only to place the deterministic obstruction inside the surrounding high-probability random-design setting. [/proofplan] [step:Build a many-alternative subfamily inside the sparse parameter class] Fix $\ell$ and write $n:=n_\ell$, $d:=d_\ell$, $k:=k_\ell$, $a:=a_\ell$, and $X:=X_\ell$. Let $E:=E_\ell$. Work on an arbitrary realization of $X$ in $E$. For $\gamma\in\mathbb R^d$, define its support by \begin{align*} S(\gamma):=\{i\in\{1,\dots,d\}:\gamma_i\ne 0\}, \end{align*} and define the sparse signal class by \begin{align*} \Theta_k(a):=\{\gamma\in\mathbb R^d: |S(\gamma)|=k \text{ and } |\gamma_i|\ge a \text{ for every } i\in S(\gamma)\}. \end{align*} If $k=1$, set $B:=\varnothing$. If $k\ge 2$, set \begin{align*} B:=\{1,\dots,k-1\}. \end{align*} Define the candidate coordinate set \begin{align*} J:=\{k,k+1,\dots,d\}. \end{align*} Then \begin{align*} M:=|J|=d-k+1\ge d/2, \end{align*} because $k\le d/2$. For each $j\in J$, define the parameter vector $\beta_j\in\mathbb R^d$ as follows: $(\beta_j)_i=a$ when $i\in B\cup\{j\}$, and $(\beta_j)_i=0$ when $i\notin B\cup\{j\}$. Then $S(\beta_j)=B\cup\{j\}$ and $|S(\beta_j)|=k$, so $\beta_j\in\Theta_k(a)$ for every $j\in J$. Let $P_j^X$ denote the conditional law of $y$ given $X$ under parameter $\beta_j$. Thus \begin{align*} P_j^X=\mathcal N(X\beta_j,\sigma^2 I_n) \end{align*} as a probability measure on $\mathbb R^n$ equipped with its Borel $\sigma$-algebra. [guided] For any vector $\gamma\in\mathbb R^d$, its support is \begin{align*} S(\gamma):=\{i\in\{1,\dots,d\}:\gamma_i\ne 0\}. \end{align*} The sparse signal class in this fixed dimension is \begin{align*} \Theta_k(a):=\{\gamma\in\mathbb R^d: |S(\gamma)|=k \text{ and } |\gamma_i|\ge a \text{ for every } i\in S(\gamma)\}. \end{align*} Thus $\Theta_k(a)$ contains only vectors with exactly $k$ active coordinates and all nonzero entries at least $a$ in absolute value. Therefore we cannot use the simpler alternatives $0$ and $a e_j$ when $k>1$, because $0\notin\Theta_k(a)$ and $a e_j$ has support size $1$ rather than $k$. The repair is to keep $k-1$ active coordinates fixed and make only the last active coordinate vary. If $k=1$, no baseline coordinates are needed, so $B=\varnothing$. If $k\ge 2$, we take \begin{align*} B=\{1,\dots,k-1\}. \end{align*} The remaining candidate coordinates are \begin{align*} J=\{k,k+1,\dots,d\}. \end{align*} Since $k\le d/2$, the number of candidates satisfies \begin{align*} M=|J|=d-k+1\ge d/2. \end{align*} For each $j\in J$, define $\beta_j:\{1,\dots,d\}\to\mathbb R$ componentwise by setting $(\beta_j)_i=a$ when $i\in B\cup\{j\}$ and $(\beta_j)_i=0$ when $i\notin B\cup\{j\}$. Equivalently, $\beta_j$ has coefficient $a$ on every baseline coordinate in $B$ and coefficient $a$ at the one varying coordinate $j$. Hence \begin{align*} S(\beta_j)=B\cup\{j\},\qquad |S(\beta_j)|=k,\qquad \min_{i\in S(\beta_j)}|(\beta_j)_i|=a. \end{align*} Thus every $\beta_j$ lies in $\Theta_k(a)$. For this fixed design matrix $X$, the observation model under $\beta_j$ is Gaussian: \begin{align*} y=X\beta_j+\varepsilon,\qquad \varepsilon\sim\mathcal N(0,\sigma^2 I_n). \end{align*} We denote this conditional law by \begin{align*} P_j^X=\mathcal N(X\beta_j,\sigma^2 I_n). \end{align*} The testing problem is now: from one draw $y\sim P_j^X$, identify which $j\in J$ generated the observation. [/guided] [/step] [step:Convert support recovery into identification of the varying coordinate] Let \begin{align*} \hat S: \mathbb R^n\times\mathbb R^{n\times d} &\to 2^{\{1,\dots,d\}} \end{align*} be any measurable estimator of the support. For the fixed design $X$, define the induced coordinate estimator \begin{align*} \hat j_X: \mathbb R^n &\to J\cup\{\ast\} \end{align*} as follows. If $\hat S(y,X)\setminus B$ is a singleton contained in $J$, then $\hat j_X(y)$ is the unique element of $\hat S(y,X)\setminus B$. In all other cases, set $\hat j_X(y):=\ast$. Here $\ast$ is an auxiliary failure symbol not belonging to $J$. If $\hat S(y,X)=S(\beta_j)=B\cup\{j\}$, then $\hat j_X(y)=j$. Therefore, for every $j\in J$, \begin{align*} P_j^X(\hat j_X(y)\ne j) \le P_j^X(\hat S(y,X)\ne S(\beta_j)). \end{align*} Consequently, with \begin{align*} R_X:=\sup_{\beta\in\Theta_k(a)} \mathbb P_\beta(\hat S(y,X)\ne S(\beta)\mid X), \end{align*} we have \begin{align*} \max_{j\in J}P_j^X(\hat j_X(y)\ne j)\le R_X. \end{align*} [/step] [step:Apply a finite Fano bound to the induced testing problem] [claim:Fano lower bound for finitely many Gaussian experiments] Let $J$ be a finite set with $M:=|J|\ge 2$. For each $j\in J$, let $P_j$ be a probability measure on a common measurable space $(\mathcal Y,\mathcal A)$. Let the measurable estimator $\tilde j$ be the map \begin{align*} \tilde j: \mathcal Y &\to J\cup\{\ast\}. \end{align*} Fix $j_0\in J$. Define $D_{\mathrm{KL}}(P\,\|\,Q)$ to be the Kullback-Leibler divergence of a probability measure $P$ from a probability measure $Q$, with value $+\infty$ if $P$ is not absolutely continuous with respect to $Q$. Then \begin{align*} \max_{j\in J}P_j(\tilde j\ne j) \ge 1-\frac{\frac{1}{M}\sum_{j\in J}D_{\mathrm{KL}}(P_j\,\|\,P_{j_0})+\log 2}{\log M}. \end{align*} [/claim] [proof] Let $V$ be a $J$-valued [random variable](/page/Random%20Variable) with the uniform law on $J$, and conditional on the event $\{V=j\}$ let $Y$ be a $\mathcal Y$-valued random variable with law $P_j$. Let $Q$ be the marginal law of $Y$, so \begin{align*} Q=\frac{1}{M}\sum_{m\in J}P_m. \end{align*} Define the mutual information $I(V;Y)$ by \begin{align*} I(V;Y) := \frac{1}{M}\sum_{j\in J}D_{\mathrm{KL}}(P_j\,\|\,Q). \end{align*} If at least one term $D_{\mathrm{KL}}(P_j\,\|\,P_{j_0})$ is infinite, the desired bound follows after substituting $+\infty$ on the right-hand side. Hence assume all these divergences are finite. Then $P_j\ll P_{j_0}$ for every $j\in J$, and therefore $Q\ll P_{j_0}$. Let $p_j:=dP_j/dP_{j_0}$ and $q:=dQ/dP_{j_0}$. By the definition of $Q$, \begin{align*} q=\frac{1}{M}\sum_{m\in J}p_m \end{align*} $P_{j_0}$-almost everywhere. Expanding the average divergence relative to $P_{j_0}$ gives \begin{align*} \frac{1}{M}\sum_{j\in J}D_{\mathrm{KL}}(P_j\,\|\,P_{j_0}) &= \frac{1}{M}\sum_{j\in J}\int_{\mathcal Y} p_j\log p_j\,dP_{j_0}. \end{align*} Since $p_j=q\,(dP_j/dQ)$ $P_{j_0}$-almost everywhere on the set where $q>0$, and $p_j=0$ where $q=0$, we obtain First, using $p_j=q\,(dP_j/dQ)$ $P_{j_0}$-almost everywhere on $\{q>0\}$ and $p_j=0$ on $\{q=0\}$ gives \begin{align*} \frac{1}{M}\sum_{j\in J}\int_{\mathcal Y} p_j\log p_j\,dP_{j_0} = \frac{1}{M}\sum_{j\in J}\int_{\mathcal Y} p_j\log\left(\frac{dP_j}{dQ}\right)\,dP_{j_0} + \frac{1}{M}\sum_{j\in J}\int_{\mathcal Y}p_j\log q\,dP_{j_0}. \end{align*} Since $q=M^{-1}\sum_{j\in J}p_j$, this becomes \begin{align*} \frac{1}{M}\sum_{j\in J}\int_{\mathcal Y} p_j\log p_j\,dP_{j_0} = \frac{1}{M}\sum_{j\in J}D_{\mathrm{KL}}(P_j\,\|\,Q) + \int_{\mathcal Y}q\log q\,dP_{j_0}. \end{align*} By the definitions of $I(V;Y)$ and $D_{\mathrm{KL}}(Q\,\|\,P_{j_0})$, we conclude \begin{align*} \frac{1}{M}\sum_{j\in J}\int_{\mathcal Y} p_j\log p_j\,dP_{j_0} = I(V;Y)+D_{\mathrm{KL}}(Q\,\|\,P_{j_0}). \end{align*} Because $D_{\mathrm{KL}}(Q\,\|\,P_{j_0})\ge 0$, this identity yields the required information bound \begin{align*} I(V;Y)\le \frac{1}{M}\sum_{j\in J}D_{\mathrm{KL}}(P_j\,\|\,P_{j_0}). \end{align*} Let $\hat V:=\tilde j(Y)$. Define the error probability $p_e:=\mathbb P(\hat V\ne V)$. Since $\ast\notin J$, every outcome with $\tilde j(Y)=\ast$ is counted as an error, so $\hat V$ is a finite-valued decision rule whose error event is precisely $\{\hat V\ne V\}$. Let $H(V)$ denote the Shannon entropy of $V$, and let $H(V\mid \hat V)$ denote the conditional Shannon entropy of $V$ given $\hat V$. We prove the finite-alphabet entropy estimate needed here. Let $E_0$ be the Bernoulli random variable defined by $E_0=1$ on $\{\hat V\ne V\}$ and $E_0=0$ on $\{\hat V=V\}$. Conditional on $\hat V$ and $E_0=0$, the value of $V$ is determined by $V=\hat V$. Conditional on $\hat V$ and $E_0=1$, the value of $V$ belongs to at most $M$ possibilities. Hence Since adding $E_0$ cannot decrease the information available to describe the pair, we have \begin{align*} H(V\mid \hat V) \le H(E_0,V\mid \hat V). \end{align*} The finite-alphabet [chain rule for entropy](/theorems/1635) gives \begin{align*} H(E_0,V\mid \hat V) = H(E_0\mid \hat V)+H(V\mid \hat V,E_0). \end{align*} The Bernoulli entropy is bounded by $\log 2$, and the conditional entropy of $V$ is $0$ on $\{E_0=0\}$ and at most $\log M$ on $\{E_0=1\}$. Therefore \begin{align*} H(V\mid \hat V) \le \log 2+p_e\log M. \end{align*} Moreover, $H(V)=\log M$. Because $\hat V$ is a measurable function of $Y$, conditioning on the full observation $Y$ leaves no more uncertainty than conditioning on $\hat V$ alone, so \begin{align*} H(V\mid \hat V)\ge H(V\mid Y)=H(V)-I(V;Y)=\log M-I(V;Y). \end{align*} Combining the preceding upper and lower bounds on $H(V\mid \hat V)$ gives \begin{align*} p_e \ge 1-\frac{I(V;Y)+\log 2}{\log M}. \end{align*} Finally, \begin{align*} p_e = \frac{1}{M}\sum_{j\in J}P_j(\tilde j\ne j) \le \max_{j\in J}P_j(\tilde j\ne j). \end{align*} Combining this with the information bound proves the claim. [/proof] Choose a fixed index $j_0\in J$. Applying the claim with $P_j=P_j^X$ and $\tilde j=\hat j_X$ gives \begin{align*} R_X \ge 1-\frac{\frac{1}{M}\sum_{j\in J}D_{\mathrm{KL}}(P_j^X\,\|\,P_{j_0}^X)+\log 2}{\log M}. \end{align*} [/step] [step:Bound the Gaussian divergences by the restricted eigenvalue upper bound] Let $\kappa_+>0$ denote the fixed upper restricted-eigenvalue constant on the event $E$, meaning that every $2k$-sparse vector $v\in\mathbb R^d$ satisfies \begin{align*} \frac{1}{n}\|Xv\|_2^2\le \kappa_+\|v\|_2^2. \end{align*} For $j\in J$, both $P_j^X$ and $P_{j_0}^X$ are Gaussian measures on $\mathbb R^n$ with common covariance matrix $\sigma^2 I_n$. We derive the Kullback-Leibler formula directly. If $\mu_j:=X\beta_j$ and $\mu_{j_0}:=X\beta_{j_0}$ denote the two mean vectors in $\mathbb R^n$, then the densities of $P_j^X$ and $P_{j_0}^X$ with respect to $n$-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) $\mathcal L^n$ satisfy \begin{align*} \log\frac{dP_j^X}{dP_{j_0}^X}(y) = -\frac{1}{2\sigma^2}|y-\mu_j|^2+\frac{1}{2\sigma^2}|y-\mu_{j_0}|^2. \end{align*} Taking expectation with respect to $P_j^X$ and using $\mathbb E_j[y-\mu_j]=0$ and $\mathbb E_j[y-\mu_{j_0}]=\mu_j-\mu_{j_0}$ in the cross term gives \begin{align*} D_{\mathrm{KL}}(P_j^X\,\|\,P_{j_0}^X) = \frac{1}{2\sigma^2}|\mu_j-\mu_{j_0}|^2 = \frac{1}{2\sigma^2}\|X(\beta_j-\beta_{j_0})\|_2^2. \end{align*} The vector $\beta_j-\beta_{j_0}$ has support contained in $\{j,j_0\}$, so it is $2$-sparse and hence $2k$-sparse because $k\ge 1$. Since $X\in E$, the upper restricted-eigenvalue bound gives \begin{align*} \frac{1}{n}\|X(\beta_j-\beta_{j_0})\|_2^2 \le \kappa_+\|\beta_j-\beta_{j_0}\|_2^2. \end{align*} If $j=j_0$, the right-hand side is $0$. If $j\ne j_0$, then $\beta_j-\beta_{j_0}$ has entries $a$ at $j$ and $-a$ at $j_0$, so \begin{align*} \|\beta_j-\beta_{j_0}\|_2^2=2a^2. \end{align*} Thus, for every $j\in J$, \begin{align*} D_{\mathrm{KL}}(P_j^X\,\|\,P_{j_0}^X) \le \frac{\kappa_+ n a^2}{\sigma^2}. \end{align*} Substituting this bound into the Fano inequality yields \begin{align*} R_X \ge 1-\frac{\kappa_+ n a^2/\sigma^2+\log 2}{\log M}. \end{align*} [guided] The only quantitative input from the design matrix is the upper restricted-eigenvalue bound. The constant used for that bound is $\kappa_+>0$, defined by the condition that every $2k$-sparse vector $v\in\mathbb R^d$ satisfies \begin{align*} \frac{1}{n}\|Xv\|_2^2\le \kappa_+\|v\|_2^2. \end{align*} For two different alternatives $\beta_j$ and $\beta_{j_0}$, the Gaussian laws have the same covariance $\sigma^2 I_n$ and different means $X\beta_j$ and $X\beta_{j_0}$. Let \begin{align*} \mu_j:=X\beta_j,\qquad \mu_{j_0}:=X\beta_{j_0} \end{align*} denote these two mean vectors in $\mathbb R^n$. We compute the divergence from the Gaussian densities with respect to $\mathcal L^n$. For $y\in\mathbb R^n$, \begin{align*} \log\frac{dP_j^X}{dP_{j_0}^X}(y) = -\frac{1}{2\sigma^2}|y-\mu_j|^2+\frac{1}{2\sigma^2}|y-\mu_{j_0}|^2. \end{align*} Under $P_j^X$, write $y=\mu_j+\varepsilon$ with $\varepsilon\sim\mathcal N(0,\sigma^2 I_n)$. Then $y-\mu_{j_0}=\varepsilon+(\mu_j-\mu_{j_0})$, and the expectation of the cross term is zero because $\mathbb E_j[\varepsilon]=0$. Hence \begin{align*} D_{\mathrm{KL}}(P_j^X\,\|\,P_{j_0}^X) = \frac{1}{2\sigma^2}|\mu_j-\mu_{j_0}|^2 = \frac{1}{2\sigma^2}\|X(\beta_j-\beta_{j_0})\|_2^2. \end{align*} Now we check that the restricted-eigenvalue hypothesis applies to $\beta_j-\beta_{j_0}$. The baseline coordinates in $B$ cancel, because every $\beta_j$ has the same value $a$ on $B$. Therefore $\beta_j-\beta_{j_0}$ is supported only on the two varying coordinates $j$ and $j_0$. Since $k\ge 1$, every $2$-sparse vector is $2k$-sparse. Hence, on $E$, \begin{align*} \frac{1}{n}\|X(\beta_j-\beta_{j_0})\|_2^2 \le \kappa_+\|\beta_j-\beta_{j_0}\|_2^2. \end{align*} If $j\ne j_0$, then the difference vector has one coordinate equal to $a$ and one coordinate equal to $-a$, so \begin{align*} \|\beta_j-\beta_{j_0}\|_2^2=a^2+a^2=2a^2. \end{align*} If $j=j_0$, the difference is zero, and the same upper bound remains valid. Therefore \begin{align*} D_{\mathrm{KL}}(P_j^X\,\|\,P_{j_0}^X) \le \frac{1}{2\sigma^2}\,n\kappa_+(2a^2) = \frac{\kappa_+ n a^2}{\sigma^2}. \end{align*} [Fano's inequality](/theorems/1654) then forces \begin{align*} R_X \ge 1-\frac{\kappa_+ n a^2/\sigma^2+\log 2}{\log M}. \end{align*} This inequality expresses the core obstruction: if $n a^2/\sigma^2$ is much smaller than $\log M$, then the observations do not carry enough information to identify the active coordinate among $M$ possibilities. [/guided] [/step] [step:Use uniform vanishing recovery error on the restricted-eigenvalue event to force the signal lower bound] For each realization $X\in E_\ell$, define \begin{align*} R_\ell(X):=\sup_{\beta\in\Theta_{k_\ell}(a_\ell)}\mathbb P_\beta\!\left(\hat S_\ell(y,X)\ne S(\beta)\mid X_\ell=X\right). \end{align*} By the uniform conditional recovery hypothesis in the theorem statement, where "on $E_\ell$" is interpreted as uniform over every design realization belonging to $E_\ell$, \begin{align*} \sup_{X\in E_\ell}R_\ell(X)\to 0. \end{align*} The assumption $\mathbb P(E_\ell)\to 1$ ensures that this uniform conditional statement is imposed on a typical set of Gaussian designs; the deterministic lower bound below only uses that the chosen realization $X$ belongs to $E_\ell$. The preceding steps show that, for every $X\in E_\ell$ and all sufficiently large $\ell$ with $M_\ell:=d_\ell-k_\ell+1\ge 2$, \begin{align*} R_\ell(X)\ge 1-\frac{\kappa_+ n_\ell a_\ell^2/\sigma^2+\log 2}{\log M_\ell}. \end{align*} Since $M_\ell\ge d_\ell/2$ and $d_\ell\to\infty$, \begin{align*} \log M_\ell\ge \frac{1}{2}\log d_\ell \end{align*} for all sufficiently large $\ell$. The [uniform convergence](/page/Uniform%20Convergence) gives $\sup_{X\in E_\ell}R_\ell(X)\le 1/2$ for all sufficiently large $\ell$, and hence, for each $X\in E_\ell$, \begin{align*} \frac{\kappa_+ n_\ell a_\ell^2}{\sigma^2}+\log 2\ge \frac{1}{2}\log M_\ell. \end{align*} For all sufficiently large $\ell$, $\log M_\ell\ge 4\log 2$, so \begin{align*} \frac{\kappa_+ n_\ell a_\ell^2}{\sigma^2}\ge \frac{1}{4}\log M_\ell\ge \frac{1}{8}\log d_\ell. \end{align*} Thus \begin{align*} a_\ell\ge \frac{1}{\sqrt{8\kappa_+}}\,\sigma\sqrt{\frac{\log d_\ell}{n_\ell}} \end{align*} for all sufficiently large $\ell$. This is the claimed necessary signal-strength lower bound, with a constant depending only on the restricted-eigenvalue upper bound. [guided] The previous step gave a deterministic lower bound for the worst-case conditional support recovery error on every design realization $X\in E_\ell$: \begin{align*} R_\ell(X)\ge 1-\frac{\kappa_+ n_\ell a_\ell^2/\sigma^2+\log 2}{\log M_\ell}, \end{align*} where \begin{align*} M_\ell:=d_\ell-k_\ell+1. \end{align*} The recovery hypothesis says that this same quantity must vanish uniformly over $X\in E_\ell$: \begin{align*} \sup_{X\in E_\ell}R_\ell(X)\to 0. \end{align*} Thus, for all sufficiently large $\ell$, every $X\in E_\ell$ satisfies $R_\ell(X)\le 1/2$. Combining this upper bound with the deterministic Fano lower bound gives \begin{align*} \frac{\kappa_+ n_\ell a_\ell^2}{\sigma^2}+\log 2\ge \frac{1}{2}\log M_\ell. \end{align*} Now use the size of the testing family. Since $k_\ell\le d_\ell/2$, we have \begin{align*} M_\ell=d_\ell-k_\ell+1\ge d_\ell/2. \end{align*} Because $d_\ell\to\infty$, for all sufficiently large $\ell$, \begin{align*} \log M_\ell\ge \frac{1}{2}\log d_\ell \end{align*} and also \begin{align*} \log M_\ell\ge 4\log 2. \end{align*} The latter inequality lets us absorb the additive $\log 2$ term: \begin{align*} \frac{\kappa_+ n_\ell a_\ell^2}{\sigma^2} \ge \frac{1}{2}\log M_\ell-\log 2 \ge \frac{1}{4}\log M_\ell \ge \frac{1}{8}\log d_\ell. \end{align*} Solving this inequality for $a_\ell$ yields \begin{align*} a_\ell\ge \frac{1}{\sqrt{8\kappa_+}}\,\sigma\sqrt{\frac{\log d_\ell}{n_\ell}} \end{align*} for all sufficiently large $\ell$. This proves the necessary signal-strength lower bound, and the displayed constant depends only on the restricted-eigenvalue upper bound $\kappa_+$. [/guided] [/step]

Prerequisites (0/7 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.