Coordinatewise Asymptotic Normality of the Debiased Lasso

Coordinatewise Asymptotic Normality of the Debiased Lasso (Theorem # 5584)

Theorem

Edit Issues Pull Requests Attributions Admin

For each $n$, let $Y_n=X_n\beta_n^*+\varepsilon_n$ with fixed design $X_n\in\mathbb R^{n\times p_n}$, columns satisfying $\|X_{n,k}\|_2^2/n=1$, noise $\varepsilon_n\sim\mathcal N(0,\sigma^2 I_n)$, and support size $s_n=|\operatorname{supp}(\beta_n^*)|$. Let $\hat\beta_n\in\mathbb{R}^{p_n}$ be any solution of the Lasso problem \begin{align*} \hat\beta_n\in\operatorname*{argmin}_{\beta\in\mathbb{R}^{p_n}}\left\{\frac{1}{2n}|Y_n-X_n\beta|_2^2+\lambda_n\|\beta\|_1\right\}, \end{align*} with $\lambda_n\asymp\sigma\sqrt{\log p_n/n}$, and let $\hat\Theta_n$ be built from $X_n$ by nodewise Lasso with $\lambda_{n,k}\asymp\sqrt{\log p_n/n}$. Define the debiased Lasso estimator $\hat b_n\in\mathbb{R}^{p_n}$ by \begin{align*} \hat b_n:=\hat\beta_n+\hat\Theta_n\frac{X_n^\top(Y_n-X_n\hat\beta_n)}{n}. \end{align*} Assume there are constants $\kappa>0$, $0<c<C<\infty$, and $A<\infty$ independent of $n$ such that the following deterministic design and nodewise conditions hold for all sufficiently large $n$: 1. the empirical Gram matrix $\hat\Sigma_n=X_n^\top X_n/n$ satisfies the restricted eigenvalue lower bound \begin{align*} \frac{|X_n\delta|_2}{\sqrt n}\ge \kappa |\delta_S|_2 \end{align*} for every set $S\subset\{1,\dots,p_n\}$ with $|S|\le s_n$ and every $\delta\in\mathbb R^{p_n}$ satisfying $\|\delta_{S^c}\|_1\le 3\|\delta_S\|_1$; 2. for the fixed coordinate $j$, the nodewise residual variance satisfies \begin{align*} c\le \hat\tau_{n,j}^2\le C \end{align*} and the nodewise approximate inverse row satisfies \begin{align*} \|e_j^\top(I_{p_n}-\hat\Theta_n\hat\Sigma_n)\|_\infty \le A\sqrt{\frac{\log p_n}{n}}; \end{align*} 3. the variance factor is nondegenerate: \begin{align*} c\le \hat\theta_{n,j}^\top\hat\Sigma_n\hat\theta_{n,j}\le C. \end{align*} Assume also that the Lasso oracle rate holds with probability tending to $1$: \begin{align*} \|\hat\beta_n-\beta_n^*\|_1\le A s_n\sqrt{\frac{\log p_n}{n}}. \end{align*} If \begin{align*} \frac{s_n\log p_n}{\sqrt n}\to 0, \end{align*} then for each fixed coordinate $j$, \begin{align*} \frac{\hat b_{n,j}-\beta_{n,j}^*}{\sigma(\hat\theta_{n,j}^\top\hat\Sigma_n\hat\theta_{n,j}/n)^{1/2}} \xrightarrow{d} \mathcal N(0,1). \end{align*} If $\hat\sigma/\sigma\xrightarrow{\mathbb P}1$, the same conclusion holds with $\hat\sigma$ replacing $\sigma$.

Discussion

Proof

[proofplan] We decompose the debiased Lasso coordinate into a leading Gaussian score term plus a deterministic approximation-error remainder. The leading term is exactly standard normal after normalisation because $X_n$ and $\hat\Theta_n$ are deterministic functions of the fixed design. The [nodewise approximate inverse bound](/theorems/5581) and the assumed Lasso $\ell^1$ oracle rate show that the coordinate remainder is $o_{\mathbb{P}}(n^{-1/2})$ after the stated sparsity scaling. The restricted eigenvalue, column-normalisation, tuning, nodewise-construction, and nodewise-variance hypotheses are not used again directly; in this theorem they serve as deterministic sufficient context for the separately assumed oracle and approximate-inverse bounds. The known-variance result follows by Slutsky's theorem, and the feasible result follows by applying the same principle to the multiplicative factor $\sigma/\hat\sigma$. [/proofplan] [step:Expand the debiased estimator into a Gaussian score and a coordinate remainder] For each sufficiently large $n$, define the debiased Lasso estimator $\hat b_n\in\mathbb{R}^{p_n}$ by \begin{align*} \hat b_n:=\hat\beta_n+\hat\Theta_n\frac{X_n^\top(Y_n-X_n\hat\beta_n)}{n}. \end{align*} Define the score vector \begin{align*} W_n := \hat\Theta_n \frac{X_n^\top \varepsilon_n}{n} \in \mathbb{R}^{p_n} \end{align*} and the remainder vector \begin{align*} R_n := (I_{p_n}-\hat\Theta_n\hat\Sigma_n)(\hat\beta_n-\beta_n^*) \in \mathbb{R}^{p_n}. \end{align*} Using $Y_n-X_n\hat\beta_n=X_n(\beta_n^*-\hat\beta_n)+\varepsilon_n$, we compute first \begin{align*} \hat b_n-\beta_n^* = \hat\beta_n-\beta_n^* + \hat\Theta_n \frac{X_n^\top X_n(\beta_n^*-\hat\beta_n)}{n} + \hat\Theta_n \frac{X_n^\top\varepsilon_n}{n}. \end{align*} Since $\hat\Sigma_n=X_n^\top X_n/n$, this becomes \begin{align*} \hat b_n-\beta_n^* = \hat\beta_n-\beta_n^* - \hat\Theta_n\hat\Sigma_n(\hat\beta_n-\beta_n^*) + \hat\Theta_n \frac{X_n^\top\varepsilon_n}{n}. \end{align*} By the definitions of $R_n$ and $W_n$, we obtain \begin{align*} \hat b_n-\beta_n^* = R_n + W_n. \end{align*} Taking the $j$th coordinate gives \begin{align*} \hat b_{n,j}-\beta_{n,j}^* = \hat\theta_{n,j}^\top\frac{X_n^\top\varepsilon_n}{n} + e_j^\top(I_{p_n}-\hat\Theta_n\hat\Sigma_n)(\hat\beta_n-\beta_n^*). \end{align*} [/step] [step:Identify the exact normal law of the leading score term] Define the coordinate variance factor \begin{align*} v_{n,j} := \hat\theta_{n,j}^\top\hat\Sigma_n\hat\theta_{n,j}. \end{align*} By hypothesis, $c \le v_{n,j} \le C$, so $v_{n,j} > 0$. Define the scalar [random variable](/page/Random%20Variable) \begin{align*} G_{n,j} := \hat\theta_{n,j}^\top\frac{X_n^\top\varepsilon_n}{n}. \end{align*} Since $X_n$ and $\hat\theta_{n,j}$ are deterministic and $\varepsilon_n \sim \mathcal{N}(0,\sigma^2 I_n)$, the scalar $G_{n,j}$ is Gaussian with mean $0$. Its variance is computed by the covariance formula for a deterministic linear functional of a Gaussian vector: \begin{align*} \operatorname{Var}(G_{n,j}) = \operatorname{Var}\left(\frac{1}{n}\hat\theta_{n,j}^\top X_n^\top\varepsilon_n\right). \end{align*} Using $\operatorname{Var}(\varepsilon_n)=\sigma^2 I_n$, we get \begin{align*} \operatorname{Var}(G_{n,j}) = \frac{1}{n^2}\hat\theta_{n,j}^\top X_n^\top \operatorname{Var}(\varepsilon_n) X_n\hat\theta_{n,j}. \end{align*} Therefore \begin{align*} \operatorname{Var}(G_{n,j}) = \frac{\sigma^2}{n^2}\hat\theta_{n,j}^\top X_n^\top X_n\hat\theta_{n,j}. \end{align*} Since $\hat\Sigma_n=X_n^\top X_n/n$, this equals \begin{align*} \operatorname{Var}(G_{n,j}) = \frac{\sigma^2}{n}\hat\theta_{n,j}^\top\hat\Sigma_n\hat\theta_{n,j}. \end{align*} By the definition of $v_{n,j}$, we conclude \begin{align*} \operatorname{Var}(G_{n,j}) = \frac{\sigma^2 v_{n,j}}{n}. \end{align*} Therefore \begin{align*} Z_{n,j} := \frac{G_{n,j}}{\sigma(v_{n,j}/n)^{1/2}} \sim \mathcal{N}(0,1) \end{align*} for every sufficiently large $n$. [guided] The purpose of this step is to isolate the term whose distribution we can compute exactly. Define \begin{align*} v_{n,j} := \hat\theta_{n,j}^\top\hat\Sigma_n\hat\theta_{n,j}. \end{align*} The theorem assumes $c \le v_{n,j} \le C$, so the normalising denominator \begin{align*} \sigma(v_{n,j}/n)^{1/2} \end{align*} is positive and finite. Now define the leading coordinate score \begin{align*} G_{n,j} := \hat\theta_{n,j}^\top\frac{X_n^\top\varepsilon_n}{n}. \end{align*} This is a linear functional of the Gaussian vector $\varepsilon_n$. Because $X_n$ is fixed and $\hat\theta_{n,j}$ is constructed only from $X_n$, the vector $X_n\hat\theta_{n,j}/n \in \mathbb{R}^n$ is deterministic. Hence $G_{n,j}$ is a centered Gaussian scalar. Its variance is computed directly from $\operatorname{Var}(\varepsilon_n)=\sigma^2 I_n$. First, \begin{align*} \operatorname{Var}(G_{n,j}) = \operatorname{Var}\left(\frac{1}{n}\hat\theta_{n,j}^\top X_n^\top\varepsilon_n\right). \end{align*} The covariance formula for a deterministic linear functional gives \begin{align*} \operatorname{Var}(G_{n,j}) = \frac{1}{n^2}\hat\theta_{n,j}^\top X_n^\top \operatorname{Var}(\varepsilon_n) X_n\hat\theta_{n,j}. \end{align*} Substituting $\operatorname{Var}(\varepsilon_n)=\sigma^2 I_n$, we obtain \begin{align*} \operatorname{Var}(G_{n,j}) = \frac{\sigma^2}{n^2}\hat\theta_{n,j}^\top X_n^\top X_n\hat\theta_{n,j}. \end{align*} Because $\hat\Sigma_n=X_n^\top X_n/n$, this is \begin{align*} \operatorname{Var}(G_{n,j}) = \frac{\sigma^2}{n}\hat\theta_{n,j}^\top\hat\Sigma_n\hat\theta_{n,j}. \end{align*} Finally, the definition $v_{n,j}=\hat\theta_{n,j}^\top\hat\Sigma_n\hat\theta_{n,j}$ gives \begin{align*} \operatorname{Var}(G_{n,j}) = \frac{\sigma^2 v_{n,j}}{n}. \end{align*} Thus the denominator in the theorem is exactly the standard deviation of $G_{n,j}$. After division by that standard deviation, we obtain \begin{align*} Z_{n,j} := \frac{G_{n,j}}{\sigma(v_{n,j}/n)^{1/2}} \sim \mathcal{N}(0,1). \end{align*} This exact normality is the reason the proof reduces to showing that the debiasing remainder is negligible. [/guided] [/step] [step:Bound the coordinate remainder at the standard-error scale] Define the high-probability event \begin{align*} E_n := \left\{ \|\hat\beta_n-\beta_n^*\|_1 \le A s_n\sqrt{\frac{\log p_n}{n}} \right\}. \end{align*} By assumption, $\mathbb{P}(E_n) \to 1$. On $E_n$, the coordinate remainder satisfies the identity \begin{align*} |e_j^\top R_n| = \left|e_j^\top(I_{p_n}-\hat\Theta_n\hat\Sigma_n)(\hat\beta_n-\beta_n^*)\right|. \end{align*} By the [duality inequality between $\ell^\infty$ and $\ell^1$ norms](/page/Holder%20Inequality), \begin{align*} |e_j^\top R_n| \le \|e_j^\top(I_{p_n}-\hat\Theta_n\hat\Sigma_n)\|_\infty \|\hat\beta_n-\beta_n^*\|_1. \end{align*} Using the nodewise approximate inverse bound and the defining inequality for $E_n$, we get \begin{align*} |e_j^\top R_n| \le A\sqrt{\frac{\log p_n}{n}} \cdot A s_n\sqrt{\frac{\log p_n}{n}}. \end{align*} Hence \begin{align*} |e_j^\top R_n| \le A^2\frac{s_n\log p_n}{n}. \end{align*} Since $v_{n,j}\ge c$, we have on $E_n$ \begin{align*} \left|\frac{e_j^\top R_n}{\sigma(v_{n,j}/n)^{1/2}}\right| \le \frac{A^2 s_n\log p_n/n}{\sigma(c/n)^{1/2}}. \end{align*} Equivalently, \begin{align*} \left|\frac{e_j^\top R_n}{\sigma(v_{n,j}/n)^{1/2}}\right| \le \frac{A^2}{\sigma c^{1/2}} \frac{s_n\log p_n}{\sqrt n}. \end{align*} The right-hand side tends to $0$ by hypothesis. To pass from this eventwise bound to convergence in probability, fix $\varepsilon>0$. For all sufficiently large $n$, the deterministic right-hand side is at most $\varepsilon$, and therefore \begin{align*} \mathbb{P}\left( \left|\frac{e_j^\top R_n}{\sigma(v_{n,j}/n)^{1/2}}\right|>\varepsilon \right) \le \mathbb{P}(E_n^c). \end{align*} Since $\mathbb{P}(E_n^c)=1-\mathbb{P}(E_n)\to 0$, it follows that \begin{align*} \frac{e_j^\top R_n}{\sigma(v_{n,j}/n)^{1/2}} \xrightarrow{\mathbb{P}} 0. \end{align*} [/step] [step:Combine exact normality and negligible remainder] From the expansion in the first step, \begin{align*} \frac{\hat b_{n,j}-\beta_{n,j}^*}{\sigma(v_{n,j}/n)^{1/2}} = Z_{n,j} + \frac{e_j^\top R_n}{\sigma(v_{n,j}/n)^{1/2}}. \end{align*} The normalized score sequence satisfies $Z_{n,j}\sim\mathcal{N}(0,1)$ for every sufficiently large $n$; hence $(Z_{n,j})$ converges in distribution to $\mathcal{N}(0,1)$. The normalized remainder satisfies \begin{align*} \frac{e_j^\top R_n}{\sigma(v_{n,j}/n)^{1/2}} \xrightarrow{\mathbb{P}} 0 \end{align*} by the previous step. Thus the two hypotheses of [Slutsky's Theorem](/page/Slutsky%27s%20Theorem) are met: one summand converges in distribution and the other converges in probability to the constant $0$. Therefore \begin{align*} \frac{\hat b_{n,j}-\beta_{n,j}^*} {\sigma(v_{n,j}/n)^{1/2}} \xrightarrow{d} \mathcal{N}(0,1). \end{align*} Substituting $v_{n,j}=\hat\theta_{n,j}^\top\hat\Sigma_n\hat\theta_{n,j}$ gives the stated known-variance conclusion. [/step] [step:Replace the noise level by a consistent estimator] Assume now that $\hat\sigma/\sigma \xrightarrow{\mathbb{P}} 1$. Let $h:(0,\infty)\to(0,\infty)$ be the continuous map $h(t)=1/t$. Since $\sigma>0$ and the limit is the positive constant $1$, the [Continuous Mapping Theorem](/page/Continuous%20Mapping%20Theorem) applied to $\hat\sigma/\sigma$ gives \begin{align*} \frac{\sigma}{\hat\sigma}=h\left(\frac{\hat\sigma}{\sigma}\right) \xrightarrow{\mathbb{P}} h(1)=1. \end{align*} Using the known-variance statistic from the previous step, write \begin{align*} \frac{\hat b_{n,j}-\beta_{n,j}^*}{\hat\sigma(v_{n,j}/n)^{1/2}} = \left(\frac{\hat b_{n,j}-\beta_{n,j}^*}{\sigma(v_{n,j}/n)^{1/2}}\right)\left(\frac{\sigma}{\hat\sigma}\right). \end{align*} The first factor converges in distribution to $\mathcal{N}(0,1)$ by the known-variance result, and the second factor converges in probability to the constant $1$. Therefore [Slutsky's Theorem](/page/Slutsky%27s%20Theorem), now applied to the product of these two factors, yields \begin{align*} \frac{\hat b_{n,j}-\beta_{n,j}^*} {\hat\sigma(v_{n,j}/n)^{1/2}} \xrightarrow{d} \mathcal{N}(0,1). \end{align*} Again substituting $v_{n,j}=\hat\theta_{n,j}^\top\hat\Sigma_n\hat\theta_{n,j}$ gives the feasible conclusion. This completes the proof. [/step]

Prerequisites (0/5 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Explore Further

Variance Definition Distribution Definition Event Definition Covariance Formula for the Gaussian Factor Model Theorem #4036 Nodewise Approximate Inverse Bound Theorem #5581 Gaussian Score Bound for Fixed-Design Linear Regression Probability & Statistics Adaptive Lasso Oracle Property Probability & Statistics Almost Sure Martingale Convergence Theorem Martingale Theory Addition Formula for Two Events Probability Theory Conditional Expectations are Uniformly Integrable Martingale Theory Stability Selection False Discovery Bound Probability & Statistics Basic Properties of the Ordinary Least Squares Hat Matrix Probability & Statistics Optional Stopping for UI Martingales Martingale Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.