Adaptive Lasso Oracle Property (Theorem # 5571)
Theorem
Consider a fixed-dimensional triangular linear model $Y_n=X_n\beta^*+\varepsilon_n$, where $p$ is fixed, $X_n\in\mathbb{R}^{n\times p}$ is deterministic, $\varepsilon_{n,i}$ are i.i.d. with mean $0$ and variance $\sigma^2$, and
\begin{align*}
\frac{1}{n}X_n^\top X_n \to C
\end{align*}
for a positive definite matrix $C$. Writing $x_{ni}\in\mathbb{R}^p$ for the $i$-th row of $X_n$, assume also the leverage condition
\begin{align*}
\max_{1\le i\le n}\frac{|x_{ni}|^2}{n}\to 0.
\end{align*}
Let $S = \{j : \beta^*_j \ne 0\}$. Suppose the pilot estimator satisfies $\sqrt{n}(\tilde{\beta}_n-\beta^*)=O_{\mathbb P}(1)$, assume $\mathbb P(\min_{1\le j\le p}|\tilde{\beta}_{n,j}|>0)\to 1$, the adaptive weights are $w_{n,j}=|\tilde{\beta}_{n,j}|^{-\gamma}$ with $\gamma>0$ on this event, and the tuning parameter satisfies
\begin{align*}
\lambda_n\sqrt{n} \to 0,
\end{align*}
and
\begin{align*}
\lambda_n n^{(1+\gamma)/2} \to \infty.
\end{align*}
Define the adaptive Lasso estimator $\hat{\beta}^{\mathrm{AL}}_n$ to be the unique minimizer, when unique, of
\begin{align*}
Q_n(\beta)
:=
\frac{1}{2n}|Y_n-X_n\beta|^2
+
\lambda_n\sum_{j=1}^p w_{n,j}|\beta_j|,
\qquad
\beta\in\mathbb R^p.
\end{align*}
Then the adaptive Lasso estimator $\hat{\beta}^{\mathrm{AL}}_n$ satisfies
\begin{align*}
\mathbb P(\{j:\hat{\beta}^{\mathrm{AL}}_{n,j}\ne 0\}=S) \to 1,
\end{align*}
and
\begin{align*}
\sqrt{n}(\hat{\beta}^{\mathrm{AL}}_{n,S}-\beta^*_S) \xrightarrow{d}
\mathcal N(0,\sigma^2 C_{SS}^{-1}).
\end{align*}
Probability & Statistics
Discussion
This theorem states an oracle property for the adaptive Lasso under suitable weighting and regularity assumptions. It captures simultaneous variable selection consistency and asymptotic behavior matching the oracle active model.
Proof
[proofplan]
The proof constructs the adaptive Lasso solution by first solving the penalized least-squares problem restricted to the true active coordinates $S$ and setting the inactive coordinates equal to zero. On the active coordinates, the adaptive penalty is negligible at the $n^{-1/2}$ scale because the pilot weights converge to finite nonzero limits and $\lambda_n\sqrt{n}\to 0$, so the restricted estimator has the same first-order expansion as ordinary least squares on $S$. On the inactive coordinates, root-$n$ consistency of the pilot makes the adaptive weights grow like $n^{\gamma/2}$, and the condition $\lambda_n n^{(1+\gamma)/2}\to\infty$ makes the KKT penalty threshold dominate the random score. The KKT conditions then identify the restricted estimator as the full adaptive Lasso estimator with probability tending to one, giving both exact support recovery and the oracle normal limit.
[/proofplan]
[step:Fix the zero-dimensional convention when there are no active coordinates]
Throughout the proof, if $S=\varnothing$, we use the standard zero-dimensional convention: $\mathbb{R}^{|S|}=\{0\}$, $C_{SS}$ is the unique $0\times 0$ positive definite matrix, $C_{SS}^{-1}$ is its inverse, and $\mathcal{N}(0,\sigma^2 C_{SS}^{-1})$ is the probability law on the zero-dimensional [vector space](/page/Vector%20Space). Under this convention, all assertions restricted to the active coordinate space are vacuous identities, and the asymptotic normality assertion on that space is automatic once the adaptive Lasso estimator is identified with the all-zero candidate. The only substantive work in the case $S=\varnothing$ is therefore the inactive-coordinate KKT verification.
[/step]
[step:Record the score expansion and the adaptive weight behaviour]
Define the empirical Gram matrix
\begin{align*}
C_n := \frac{1}{n}X_n^\top X_n \in \mathbb{R}^{p \times p}
\end{align*}
and the normalized score vector
\begin{align*}
Z_n := \frac{1}{\sqrt{n}}X_n^\top \varepsilon_n \in \mathbb{R}^p.
\end{align*}
By the deterministic-design leverage hypothesis,
\begin{align*}
\max_{1\le i\le n}\frac{|x_{ni}|^2}{n}\to 0,
\end{align*}
where $x_{ni}\in\mathbb{R}^p$ denotes the $i$-th row of $X_n$. To verify the [multivariate Lindeberg central limit theorem for triangular arrays](/page/Multivariate%20Lindeberg%20Central%20Limit%20Theorem) through the [Cramer-Wold device](/page/Cramer-Wold%20Device), fix $a\in\mathbb{R}^p$ and define scalar summands
\begin{align*}
U_{n,i}:=\frac{a^\top x_{ni}}{\sqrt{n}}\varepsilon_i\in\mathbb{R},
\qquad 1\le i\le n.
\end{align*}
Their variances satisfy
\begin{align*}
\sum_{i=1}^n \operatorname{Var}(U_{n,i})
=
\sigma^2 a^\top C_n a
\to
\sigma^2 a^\top C a.
\end{align*}
Let $h_n:=\max_{1\le i\le n}|a^\top x_{ni}|/\sqrt{n}$. Then $h_n\le |a|\max_{1\le i\le n}|x_{ni}|/\sqrt{n}\to 0$. For every $\eta>0$,
The bound
\begin{align*}
\sum_{i=1}^n \mathbb{E}\left[U_{n,i}^2\mathbb{1}_{\{|U_{n,i}|>\eta\}}\right]
\le
\frac{1}{n}\sum_{i=1}^n (a^\top x_{ni})^2
\mathbb{E}\left[\varepsilon_i^2\mathbb{1}_{\{h_n|\varepsilon_i|>\eta\}}\right]
\end{align*}
holds by the definition of $h_n$. Since the errors are i.i.d., the right-hand side equals
\begin{align*}
a^\top C_n a\,\mathbb{E}\left[\varepsilon_1^2\mathbb{1}_{\{|\varepsilon_1|>\eta/h_n\}}\right]
\to 0,
\end{align*}
where the last convergence uses $\mathbb{E}[\varepsilon_1^2]<\infty$ and $\eta/h_n\to\infty$. The scalar Lindeberg condition holds for every $a$, so the [Cramer-Wold device](/page/Cramer-Wold%20Device) gives
\begin{align*}
Z_n \xrightarrow{d} Z,
\qquad
Z \sim \mathcal{N}(0,\sigma^2 C).
\end{align*}
For each $j \in S$, since $\beta^*_j \ne 0$ and $\tilde{\beta}_{n,j}\xrightarrow{\mathbb{P}}\beta^*_j$, the [continuous mapping theorem](/theorems/1847) gives
\begin{align*}
w_{n,j}=|\tilde{\beta}_{n,j}|^{-\gamma}
\xrightarrow{\mathbb{P}}
|\beta^*_j|^{-\gamma}.
\end{align*}
If $S^c\ne\varnothing$, define the weight-definition event
\begin{align*}
A_n:=\left\{\min_{1\le j\le p}|\tilde{\beta}_{n,j}|>0\right\}.
\end{align*}
By hypothesis, $\mathbb{P}(A_n)\to1$. Off $A_n$, assign the weights arbitrary positive values; this makes $w_{n,j}$ a well-defined positive [random variable](/page/Random%20Variable) for every $j$ and does not affect any probability limit because $\mathbb{P}(A_n^c)\to0$. For each $j \in S^c$, since $\beta^*_j=0$ and $\sqrt{n}\tilde{\beta}_{n,j}=O_{\mathbb{P}}(1)$, on $A_n$ we have
\begin{align*}
n^{-\gamma/2}w_{n,j}
=
|\sqrt{n}\tilde{\beta}_{n,j}|^{-\gamma}.
\end{align*}
This is large enough in the following sense: for every $M>0$,
\begin{align*}
\mathbb{P}\left(A_n\cap\left\{\min_{j\in S^c}w_{n,j}\ge n^{\gamma/2}M^{-\gamma}\right\}\right)
\ge
\mathbb{P}\left(\max_{j\in S^c}|\sqrt{n}\tilde{\beta}_{n,j}|\le M\right)-\mathbb{P}(A_n^c).
\end{align*}
Since the maximum over the fixed nonempty set $S^c$ is $O_{\mathbb{P}}(1)$, it follows that
\begin{align*}
\sqrt{n}\lambda_n \min_{j\in S^c} w_{n,j}
\xrightarrow{\mathbb{P}}
\infty.
\end{align*}
If $S^c=\varnothing$, there are no inactive weights and all later inactive-coordinate KKT inequalities are vacuous.
[guided]
The proof needs two different effects from the adaptive weights. On active coordinates, the pilot estimator converges to a nonzero number, so the weight stays bounded. On inactive coordinates, the pilot estimator is only of order $n^{-1/2}$, so its reciprocal power is of order $n^{\gamma/2}$.
First define the two quantities that appear in every first-order condition. The empirical Gram matrix is
\begin{align*}
C_n := \frac{1}{n}X_n^\top X_n \in \mathbb{R}^{p \times p}.
\end{align*}
The normalized score vector is
\begin{align*}
Z_n := \frac{1}{\sqrt{n}}X_n^\top \varepsilon_n \in \mathbb{R}^p.
\end{align*}
The vector $Z_n$ is the normalized least-squares score at the true parameter. The deterministic design assumptions imply that its covariance converges:
\begin{align*}
\operatorname{Cov}(Z_n)
=
\frac{1}{n}X_n^\top \operatorname{Cov}(\varepsilon_n)X_n
=
\sigma^2 C_n
\to
\sigma^2 C.
\end{align*}
We now verify the Lindeberg condition using the leverage assumption in the theorem statement. Let $x_{ni}\in\mathbb{R}^p$ denote the $i$-th row of $X_n$. To use the [Cramer-Wold device](/page/Cramer-Wold%20Device), fix $a\in\mathbb{R}^p$ and define scalar triangular-array summands
\begin{align*}
U_{n,i}:=\frac{a^\top x_{ni}}{\sqrt{n}}\varepsilon_i\in\mathbb{R},
\qquad 1\le i\le n.
\end{align*}
Their variance sum is
\begin{align*}
\sum_{i=1}^n \operatorname{Var}(U_{n,i})
=
\sigma^2 a^\top C_n a
\to
\sigma^2 a^\top C a.
\end{align*}
Define $h_n:=\max_{1\le i\le n}|a^\top x_{ni}|/\sqrt{n}$. The leverage assumption gives
\begin{align*}
h_n\le |a|\max_{1\le i\le n}\frac{|x_{ni}|}{\sqrt{n}}\to 0.
\end{align*}
For every $\eta>0$, the definition of $h_n$ gives
\begin{align*}
\sum_{i=1}^n \mathbb{E}\left[U_{n,i}^2\mathbb{1}_{\{|U_{n,i}|>\eta\}}\right]
\le
\frac{1}{n}\sum_{i=1}^n (a^\top x_{ni})^2
\mathbb{E}\left[\varepsilon_i^2\mathbb{1}_{\{h_n|\varepsilon_i|>\eta\}}\right].
\end{align*}
Since the errors are identically distributed, this equals
\begin{align*}
a^\top C_n a\,\mathbb{E}\left[\varepsilon_1^2\mathbb{1}_{\{|\varepsilon_1|>\eta/h_n\}}\right]
\to 0,
\end{align*}
because $\mathbb{E}[\varepsilon_1^2]<\infty$ and $\eta/h_n\to\infty$. Thus every one-dimensional projection satisfies the scalar Lindeberg condition, and the [multivariate Lindeberg central limit theorem for triangular arrays](/page/Multivariate%20Lindeberg%20Central%20Limit%20Theorem), through the [Cramer-Wold device](/page/Cramer-Wold%20Device), gives
\begin{align*}
Z_n \xrightarrow{d} Z,
\qquad
Z \sim \mathcal{N}(0,\sigma^2 C).
\end{align*}
Now consider the active coordinates. If $j\in S$, then $\beta^*_j\ne 0$. Since $\sqrt{n}(\tilde{\beta}_n-\beta^*)=O_{\mathbb{P}}(1)$, in particular $\tilde{\beta}_{n,j}\xrightarrow{\mathbb{P}}\beta^*_j$. The map $t\mapsto |t|^{-\gamma}$ is continuous at every nonzero $t$, so
\begin{align*}
w_{n,j}=|\tilde{\beta}_{n,j}|^{-\gamma}
\xrightarrow{\mathbb{P}}
|\beta^*_j|^{-\gamma}.
\end{align*}
Thus active weights remain bounded in probability.
For inactive coordinates the conclusion is different. If $j\in S^c$, then $\beta^*_j=0$, and root-$n$ consistency gives
\begin{align*}
\sqrt{n}\tilde{\beta}_{n,j}=O_{\mathbb{P}}(1).
\end{align*}
Let
\begin{align*}
A_n:=\left\{\min_{1\le j\le p}|\tilde{\beta}_{n,j}|>0\right\}
\end{align*}
be the event on which the formula $w_{n,j}=|\tilde{\beta}_{n,j}|^{-\gamma}$ defines all adaptive weights. By hypothesis, $\mathbb{P}(A_n)\to1$. Off $A_n$, assign the weights arbitrary positive values. This extension is only a definitional device: every later comparison that uses the reciprocal-power formula is intersected with $A_n$, and the exceptional event has probability tending to zero. On $A_n$,
\begin{align*}
w_{n,j}
=
|\tilde{\beta}_{n,j}|^{-\gamma}
=
n^{\gamma/2}|\sqrt{n}\tilde{\beta}_{n,j}|^{-\gamma}.
\end{align*}
Because $p$ is fixed, the random variable $\max_{j\in S^c}|\sqrt{n}\tilde{\beta}_{n,j}|$ is also $O_{\mathbb{P}}(1)$. Therefore, for every fixed $M>0$,
\begin{align*}
\mathbb{P}\left(A_n\cap\left\{\min_{j\in S^c}w_{n,j}\ge n^{\gamma/2}M^{-\gamma}\right\}\right)
\ge
\mathbb{P}\left(\max_{j\in S^c}|\sqrt{n}\tilde{\beta}_{n,j}|\le M\right)-\mathbb{P}(A_n^c).
\end{align*}
Choosing $M$ large makes the right-hand probability close to $1$, while the deterministic factor satisfies
\begin{align*}
\sqrt{n}\lambda_n n^{\gamma/2}M^{-\gamma}
=
M^{-\gamma}\lambda_n n^{(1+\gamma)/2}
\to
\infty.
\end{align*}
Hence
\begin{align*}
\sqrt{n}\lambda_n \min_{j\in S^c}w_{n,j}
\xrightarrow{\mathbb{P}}
\infty.
\end{align*}
This is the mechanism behind variable selection: the inactive penalty threshold is much larger than the random score, while the active penalty is asymptotically negligible.
[/guided]
[/step]
[step:Solve the penalized problem on the true active coordinates]
Define the restricted adaptive Lasso estimator
\begin{align*}
\bar{\beta}_{n,S}
\in
\operatorname*{arg\,min}_{b\in \mathbb{R}^{|S|}}
\left\{
\frac{1}{2n}|Y_n-X_{n,S}b|^2
+
\lambda_n\sum_{j\in S}w_{n,j}|b_j|
\right\},
\qquad
\bar{\beta}_{n,S^c}:=0.
\end{align*}
These two coordinate definitions define the full vector $\bar{\beta}_n\in\mathbb{R}^p$.
Here $X_{n,S}\in\mathbb{R}^{n\times |S|}$ is the submatrix of $X_n$ with columns indexed by $S$. Since $C$ is positive definite, $C_{SS}$ is positive definite, and $C_{n,SS}\to C_{SS}$ implies that $C_{n,SS}$ is positive definite for all sufficiently large $n$. Hence, for all sufficiently large $n$, the restricted objective is strictly convex and the displayed argmin consists of a unique vector, which we denote by $\bar{\beta}_{n,S}$.
Let $\tau_{n,S}\in[-1,1]^{|S|}$ be a subgradient vector for the active penalty at $\bar{\beta}_{n,S}$, so $\tau_{n,j}=\operatorname{sgn}(\bar{\beta}_{n,j})$ if $\bar{\beta}_{n,j}\ne0$ and $\tau_{n,j}\in[-1,1]$ if $\bar{\beta}_{n,j}=0$. The subgradient optimality condition for the restricted convex problem is
\begin{align*}
0
=
-\frac{1}{n}X_{n,S}^\top(Y_n-X_{n,S}\bar{\beta}_{n,S})
+
\lambda_n W_{n,S}\tau_{n,S},
\end{align*}
where $W_{n,S}\in\mathbb{R}^{|S|\times |S|}$ is the diagonal matrix with entries $w_{n,j}$ for $j\in S$. Using $Y_n=X_{n,S}\beta^*_S+\varepsilon_n$, this becomes
\begin{align*}
C_{n,SS}(\bar{\beta}_{n,S}-\beta^*_S)
=
\frac{1}{n}X_{n,S}^\top\varepsilon_n
-
\lambda_n W_{n,S}\tau_{n,S}.
\end{align*}
Multiplying by $\sqrt{n}$ gives
\begin{align*}
\sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S)
=
C_{n,SS}^{-1}Z_{n,S}
-
\sqrt{n}\lambda_n C_{n,SS}^{-1}W_{n,S}\tau_{n,S}.
\end{align*}
Since $C_{n,SS}^{-1}\to C_{SS}^{-1}$, $W_{n,S}=O_{\mathbb{P}}(1)$, $|\tau_{n,S}|\le \sqrt{|S|}$, and $\sqrt{n}\lambda_n\to 0$,
\begin{align*}
\sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S)
=
C_{SS}^{-1}Z_{n,S}+o_{\mathbb{P}}(1).
\end{align*}
Therefore
\begin{align*}
\sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S)
\xrightarrow{d}
\mathcal{N}(0,\sigma^2 C_{SS}^{-1}).
\end{align*}
In particular, $\bar{\beta}_{n,S}\xrightarrow{\mathbb{P}}\beta^*_S$, so
\begin{align*}
\mathbb{P}(\bar{\beta}_{n,j}\ne 0\text{ for every }j\in S)\to 1.
\end{align*}
[guided]
The restricted problem is solved only on the coordinates where the true coefficient is nonzero. Define
\begin{align*}
\bar{\beta}_{n,S}
\in
\operatorname*{arg\,min}_{b\in \mathbb{R}^{|S|}}
\left\{
\frac{1}{2n}|Y_n-X_{n,S}b|^2
+
\lambda_n\sum_{j\in S}w_{n,j}|b_j|
\right\},
\qquad
\bar{\beta}_{n,S^c}:=0.
\end{align*}
The matrix $X_{n,S}\in\mathbb{R}^{n\times |S|}$ is obtained from $X_n$ by keeping only the columns indexed by $S$. Since $C$ is positive definite, the principal submatrix $C_{SS}$ is positive definite. Because $C_{n,SS}\to C_{SS}$, the matrices $C_{n,SS}$ are positive definite for all sufficiently large $n$, so the restricted objective is strictly convex for those $n$. Therefore the displayed argmin consists of a unique vector for all sufficiently large $n$, and this unique vector is denoted by $\bar{\beta}_{n,S}$.
The important point is to avoid assuming in advance that the restricted minimizer has no zero active coordinates. The penalty is nondifferentiable at zero, so we use subgradients. Let $\tau_{n,S}\in[-1,1]^{|S|}$ be a subgradient vector for the active absolute-value penalty at $\bar{\beta}_{n,S}$: for each $j\in S$, $\tau_{n,j}=\operatorname{sgn}(\bar{\beta}_{n,j})$ if $\bar{\beta}_{n,j}\ne0$, while $\tau_{n,j}\in[-1,1]$ if $\bar{\beta}_{n,j}=0$. The subgradient optimality condition for the restricted convex problem is
\begin{align*}
0
=
-\frac{1}{n}X_{n,S}^\top(Y_n-X_{n,S}\bar{\beta}_{n,S})
+
\lambda_n W_{n,S}\tau_{n,S},
\end{align*}
where $W_{n,S}\in\mathbb{R}^{|S|\times |S|}$ is diagonal with diagonal entries $w_{n,j}$ for $j\in S$.
Substitute the model identity $Y_n=X_{n,S}\beta^*_S+\varepsilon_n$. The optimality condition becomes
\begin{align*}
0
=
-\frac{1}{n}X_{n,S}^\top\{X_{n,S}\beta^*_S+\varepsilon_n-X_{n,S}\bar{\beta}_{n,S}\}
+
\lambda_n W_{n,S}\tau_{n,S}.
\end{align*}
Expanding the braces and collecting the terms containing $\bar{\beta}_{n,S}-\beta^*_S$ gives
\begin{align*}
0
=
C_{n,SS}(\bar{\beta}_{n,S}-\beta^*_S)
-
\frac{1}{n}X_{n,S}^\top\varepsilon_n
+
\lambda_n W_{n,S}\tau_{n,S}.
\end{align*}
Thus
\begin{align*}
C_{n,SS}(\bar{\beta}_{n,S}-\beta^*_S)
=
\frac{1}{n}X_{n,S}^\top\varepsilon_n
-
\lambda_n W_{n,S}\tau_{n,S}.
\end{align*}
Multiplying by $\sqrt{n}$ and using $Z_{n,S}=X_{n,S}^\top\varepsilon_n/\sqrt{n}$ gives
\begin{align*}
\sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S)
=
C_{n,SS}^{-1}Z_{n,S}
-
\sqrt{n}\lambda_n C_{n,SS}^{-1}W_{n,S}\tau_{n,S}.
\end{align*}
The first term is the ordinary least-squares fluctuation on the true active variables. The second term is the adaptive penalty bias. On active coordinates, $W_{n,S}=O_{\mathbb{P}}(1)$ by the weight convergence from the first step, and $|\tau_{n,S}|\le\sqrt{|S|}$ because each component lies in $[-1,1]$. Since $\sqrt{n}\lambda_n\to0$ and $C_{n,SS}^{-1}\to C_{SS}^{-1}$, the penalty bias is $o_{\mathbb{P}}(1)$. Therefore
\begin{align*}
\sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S)
=
C_{SS}^{-1}Z_{n,S}+o_{\mathbb{P}}(1).
\end{align*}
Since $Z_{n,S}\xrightarrow{d}\mathcal{N}(0,\sigma^2 C_{SS})$, [Slutsky's theorem](/page/Slutsky%27s%20Theorem) yields
\begin{align*}
\sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S)
\xrightarrow{d}
\mathcal{N}(0,\sigma^2 C_{SS}^{-1}).
\end{align*}
This gives $\bar{\beta}_{n,S}\xrightarrow{\mathbb{P}}\beta^*_S$. Because every coordinate of $\beta^*_S$ is nonzero, convergence in probability implies
\begin{align*}
\mathbb{P}(\bar{\beta}_{n,j}\ne 0\text{ for every }j\in S)\to 1.
\end{align*}
[/guided]
[/step]
[step:Show the inactive KKT inequalities hold with probability tending to one]
If $S^c=\varnothing$, there are no inactive KKT inequalities to verify. Suppose $S^c\ne\varnothing$. For $j\in\{1,\dots,p\}$, let $X_{n,j}\in\mathbb R^n$ denote the $j$th column of $X_n$. For $j\in S^c$, define the inactive score at the restricted estimator by
\begin{align*}
R_{n,j}
:=
\frac{1}{n}X_{n,j}^\top(Y_n-X_{n,S}\bar{\beta}_{n,S})
\in \mathbb{R}.
\end{align*}
Using $Y_n=X_{n,S}\beta^*_S+\varepsilon_n$,
\begin{align*}
R_{n,j}
=
\frac{1}{n}X_{n,j}^\top\varepsilon_n
-
\frac{1}{n}X_{n,j}^\top X_{n,S}(\bar{\beta}_{n,S}-\beta^*_S).
\end{align*}
The first term is $O_{\mathbb{P}}(n^{-1/2})$ because it is a coordinate of $Z_n/\sqrt{n}$ and $Z_n=O_{\mathbb{P}}(1)$. The second term is $O_{\mathbb{P}}(n^{-1/2})$ because $X_{n,j}^\top X_{n,S}/n$ converges and $\bar{\beta}_{n,S}-\beta^*_S=O_{\mathbb{P}}(n^{-1/2})$. Since $S^c$ is fixed and nonempty,
\begin{align*}
\max_{j\in S^c}|R_{n,j}|=O_{\mathbb{P}}(n^{-1/2}).
\end{align*}
From the inactive weight lower bound established above,
\begin{align*}
\sqrt{n}\lambda_n\min_{j\in S^c}w_{n,j}\xrightarrow{\mathbb{P}}\infty.
\end{align*}
Combining these two estimates gives
\begin{align*}
\mathbb{P}\left(\max_{j\in S^c}|R_{n,j}|\le \lambda_n\min_{j\in S^c}w_{n,j}\right)\to 1.
\end{align*}
Equivalently, with probability tending to one,
\begin{align*}
|R_{n,j}|\le \lambda_n w_{n,j}
\qquad
\text{for every }j\in S^c.
\end{align*}
[guided]
There are two cases. If $S^c=\varnothing$, then every coordinate is active and there are no inactive KKT inequalities to check. Now suppose $S^c\ne\varnothing$. For each $j\in\{1,\dots,p\}$, let $X_{n,j}\in\mathbb R^n$ denote the $j$th column of $X_n$. For each inactive coordinate $j\in S^c$, define
\begin{align*}
R_{n,j}
:=
\frac{1}{n}X_{n,j}^\top(Y_n-X_{n,S}\bar{\beta}_{n,S})
\in \mathbb{R}.
\end{align*}
This is exactly the least-squares score in coordinate $j$ evaluated at the candidate vector that equals $\bar{\beta}_{n,S}$ on $S$ and zero on $S^c$.
Substitute $Y_n=X_{n,S}\beta^*_S+\varepsilon_n$ to separate noise from estimation error:
\begin{align*}
R_{n,j}
=
\frac{1}{n}X_{n,j}^\top\varepsilon_n
-
\frac{1}{n}X_{n,j}^\top X_{n,S}(\bar{\beta}_{n,S}-\beta^*_S).
\end{align*}
The first term is $O_{\mathbb{P}}(n^{-1/2})$ because
\begin{align*}
\frac{1}{n}X_{n,j}^\top\varepsilon_n
=
\frac{1}{\sqrt{n}}(Z_n)_j,
\end{align*}
and $Z_n=O_{\mathbb{P}}(1)$ follows from its convergence in distribution. The second term is also $O_{\mathbb{P}}(n^{-1/2})$: the deterministic row vector $X_{n,j}^\top X_{n,S}/n$ converges as a subblock of $C_n$, while the restricted estimator expansion gives $\bar{\beta}_{n,S}-\beta^*_S=O_{\mathbb{P}}(n^{-1/2})$. Since the inactive set is fixed and nonempty,
\begin{align*}
\max_{j\in S^c}|R_{n,j}|=O_{\mathbb{P}}(n^{-1/2}).
\end{align*}
The inactive adaptive weights were shown earlier to satisfy
\begin{align*}
\sqrt{n}\lambda_n\min_{j\in S^c}w_{n,j}\xrightarrow{\mathbb{P}}\infty.
\end{align*}
This means the penalty threshold $\lambda_n w_{n,j}$ is larger than the random inactive score by a diverging factor at the $n^{-1/2}$ scale. Combining the score bound and the weight lower bound gives
\begin{align*}
\mathbb{P}\left(\max_{j\in S^c}|R_{n,j}|\le \lambda_n\min_{j\in S^c}w_{n,j}\right)\to 1.
\end{align*}
Therefore, with probability tending to one,
\begin{align*}
|R_{n,j}|\le \lambda_n w_{n,j}
\qquad
\text{for every }j\in S^c.
\end{align*}
These are exactly the KKT inequalities required for all inactive coefficients to remain equal to zero.
[/guided]
[/step]
[step:Use the KKT conditions to identify the full adaptive Lasso solution]
Define the full adaptive Lasso criterion $Q_n:\mathbb{R}^p\to\mathbb{R}$ by
\begin{align*}
Q_n(\beta)
:=
\frac{1}{2n}|Y_n-X_n\beta|^2
+
\lambda_n\sum_{j=1}^p w_{n,j}|\beta_j|,
\qquad
\beta\in\mathbb{R}^p.
\end{align*}
This criterion is convex. Let $\lambda_{\min}(C)>0$ denote the smallest eigenvalue of $C$. Since $C_n\to C$, the smallest eigenvalue of $C_n$ is at least $\lambda_{\min}(C)/2$ for all sufficiently large $n$, so the quadratic part has positive definite Hessian $C_n$ for all such $n$. Hence $Q_n$ is strictly convex for all sufficiently large $n$. For those $n$, $Q_n$ is also coercive because
\begin{align*}
Q_n(\beta)
\ge
\frac{\lambda_{\min}(C)}{4}|\beta|^2
-
\frac{1}{n}|X_n^\top Y_n|\,|\beta|
+
\frac{1}{2n}|Y_n|^2,
\end{align*}
and the positive quadratic term dominates the linear term as $|\beta|\to\infty$. Therefore $Q_n$ has a global minimizer, and strict convexity makes that minimizer unique. The subgradient of $|t|$ is $\operatorname{sgn}(t)$ when $t\ne0$ and the interval $[-1,1]$ when $t=0$. For a convex function on $\mathbb{R}^p$, the condition $0\in\partial Q_n(\beta)$ is necessary and sufficient for global optimality. Hence the [Karush-Kuhn-Tucker conditions](/page/Karush-Kuhn-Tucker%20Conditions) for $\beta\in\mathbb{R}^p$ to minimize $Q_n$ are
\begin{align*}
\frac{1}{n}X_{n,j}^\top(Y_n-X_n\beta)
=
\lambda_n w_{n,j}\operatorname{sgn}(\beta_j)
\quad
\text{if }\beta_j\ne 0,
\end{align*}
and
\begin{align*}
\left|\frac{1}{n}X_{n,j}^\top(Y_n-X_n\beta)\right|
\le
\lambda_n w_{n,j}
\quad
\text{if }\beta_j=0.
\end{align*}
The restricted estimator $\bar{\beta}_n$ satisfies the equality conditions on $S$ with probability tending to one, because $\mathbb{P}(\bar{\beta}_{n,j}\ne0\text{ for every }j\in S)\to1$ and the restricted subgradient equations then reduce to the displayed sign equations. If $S^c\ne\varnothing$, the previous step gives the inequality conditions on $S^c$ with probability tending to one; if $S^c=\varnothing$, those inequalities are vacuous. Hence
\begin{align*}
\mathbb{P}(\hat{\beta}^{\mathrm{AL}}_n=\bar{\beta}_n)\to 1.
\end{align*}
Since $\bar{\beta}_{n,S}\xrightarrow{\mathbb{P}}\beta^*_S$ and every coordinate of $\beta^*_S$ is nonzero,
\begin{align*}
\mathbb{P}(\bar{\beta}_{n,j}\ne 0\text{ for every }j\in S)\to 1.
\end{align*}
Together with $\bar{\beta}_{n,S^c}=0$, this yields
\begin{align*}
\mathbb{P}\left(\{j:\hat{\beta}^{\mathrm{AL}}_{n,j}\ne 0\}=S\right)\to 1.
\end{align*}
Finally, because $\mathbb{P}(\hat{\beta}^{\mathrm{AL}}_n=\bar{\beta}_n)\to 1$, the difference
\begin{align*}
\sqrt{n}(\hat{\beta}^{\mathrm{AL}}_{n,S}-\beta^*_S)
-
\sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S)
\end{align*}
is equal to zero with probability tending to one and hence converges to $0$ in probability. Since the restricted estimator satisfies
\begin{align*}
\sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S)
\xrightarrow{d}
\mathcal{N}(0,\sigma^2 C_{SS}^{-1}),
\end{align*}
[Slutsky's theorem](/page/Slutsky%27s%20Theorem) gives
\begin{align*}
\sqrt{n}(\hat{\beta}^{\mathrm{AL}}_{n,S}-\beta^*_S)
\xrightarrow{d}
\mathcal{N}(0,\sigma^2 C_{SS}^{-1}).
\end{align*}
This proves both the selection consistency and the oracle asymptotic normality.
[guided]
We now check that the candidate vector $\bar{\beta}_n$ is actually the full adaptive Lasso minimizer. Define the full adaptive Lasso criterion $Q_n:\mathbb{R}^p\to\mathbb{R}$ by
\begin{align*}
Q_n(\beta)
:=
\frac{1}{2n}|Y_n-X_n\beta|^2
+
\lambda_n\sum_{j=1}^p w_{n,j}|\beta_j|,
\qquad
\beta\in\mathbb{R}^p.
\end{align*}
Let $\hat{\beta}^{\mathrm{AL}}_n\in\mathbb{R}^p$ denote the full adaptive Lasso minimizer of $Q_n$ when this minimizer is unique. It is convex because it is the sum of a convex quadratic function and weighted absolute-value penalties. Let $\lambda_{\min}(C)>0$ denote the smallest eigenvalue of $C$. Since $C_n=X_n^\top X_n/n\to C$, eigenvalue convergence for symmetric matrices gives $\lambda_{\min}(C_n)\to\lambda_{\min}(C)$. Therefore $\lambda_{\min}(C_n)\ge \lambda_{\min}(C)/2$ for all sufficiently large $n$, so the quadratic part is strictly convex for those $n$. We also need existence, not just uniqueness. For sufficiently large $n$,
\begin{align*}
Q_n(\beta)
\ge
\frac{\lambda_{\min}(C)}{4}|\beta|^2
-
\frac{1}{n}|X_n^\top Y_n|\,|\beta|
+
\frac{1}{2n}|Y_n|^2.
\end{align*}
The right-hand side tends to $\infty$ as $|\beta|\to\infty$, so $Q_n$ is coercive. Since $Q_n$ is continuous on $\mathbb{R}^p$, coercivity gives a global minimizer, and strict convexity makes that minimizer unique.
The subgradient of the absolute-value function is $\operatorname{sgn}(t)$ when $t\ne0$ and the interval $[-1,1]$ when $t=0$. Since $Q_n$ is convex on $\mathbb{R}^p$, the subgradient condition $0\in\partial Q_n(\beta)$ is necessary and sufficient for $\beta$ to be a global minimizer. Therefore the [Karush-Kuhn-Tucker conditions](/page/Karush-Kuhn-Tucker%20Conditions) for a vector $\beta\in\mathbb{R}^p$ to minimize $Q_n$ are the following coordinatewise conditions:
\begin{align*}
\frac{1}{n}X_{n,j}^\top(Y_n-X_n\beta)
=
\lambda_n w_{n,j}\operatorname{sgn}(\beta_j)
\quad
\text{if }\beta_j\ne 0,
\end{align*}
and
\begin{align*}
\left|\frac{1}{n}X_{n,j}^\top(Y_n-X_n\beta)\right|
\le
\lambda_n w_{n,j}
\quad
\text{if }\beta_j=0.
\end{align*}
The restricted estimator satisfies the equality conditions on $S$ with probability tending to one: the previous active-coordinate analysis gives $\mathbb{P}(\bar{\beta}_{n,j}\ne0\text{ for every }j\in S)\to1$, and on that event the restricted subgradient equations are exactly the displayed sign equations. On $S^c$, either $S^c=\varnothing$ and there is nothing to verify, or $S^c\ne\varnothing$ and the inactive score inequalities from the previous step give
\begin{align*}
\left|\frac{1}{n}X_{n,j}^\top(Y_n-X_n\bar{\beta}_n)\right|
\le
\lambda_n w_{n,j}
\qquad
\text{for every }j\in S^c
\end{align*}
with probability tending to one. Thus $\bar{\beta}_n$ satisfies the full KKT system with probability tending to one. By uniqueness of the minimizer for large $n$,
\begin{align*}
\mathbb{P}(\hat{\beta}^{\mathrm{AL}}_n=\bar{\beta}_n)\to 1.
\end{align*}
Now support recovery follows from the construction. The vector $\bar{\beta}_n$ is zero on $S^c$ by definition, and $\bar{\beta}_{n,S}\xrightarrow{\mathbb{P}}\beta^*_S$ with every coordinate of $\beta^*_S$ nonzero. Hence
\begin{align*}
\mathbb{P}\left(\{j:\hat{\beta}^{\mathrm{AL}}_{n,j}\ne 0\}=S\right)\to 1.
\end{align*}
Finally, replacing $\bar{\beta}_n$ by $\hat{\beta}^{\mathrm{AL}}_n$ changes the active-coordinate fluctuation only on an event whose probability tends to zero. Equivalently,
\begin{align*}
\sqrt{n}(\hat{\beta}^{\mathrm{AL}}_{n,S}-\beta^*_S)
-
\sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S)
\xrightarrow{\mathbb{P}}
0.
\end{align*}
Since
\begin{align*}
\sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S)
\xrightarrow{d}
\mathcal{N}(0,\sigma^2 C_{SS}^{-1}),
\end{align*}
[Slutsky's theorem](/page/Slutsky%27s%20Theorem) gives
\begin{align*}
\sqrt{n}(\hat{\beta}^{\mathrm{AL}}_{n,S}-\beta^*_S)
\xrightarrow{d}
\mathcal{N}(0,\sigma^2 C_{SS}^{-1}).
\end{align*}
This proves both exact variable selection and the oracle asymptotic normality.
[/guided]
[/step]
Prerequisites (0/4 completed)
Prerequisites Graph
Interactive dependency map showing how this theorem builds on foundational concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Definitions & Concepts
Explore Further
Distribution
Definition
Variance
Definition
Event
Definition
Random Variable
Definition
Interval Probabilities from the Distribution Function
Probability Theory
Minimality of Generated Sigma-Algebras
Probability & Statistics
Existence of Nonmeasurable Subsets of the Real Line
Probability & Statistics
Stirling's Formula
Probability Theory
Optional Stopping Theorem
Martingale Theory
Doob's Upcrossing Inequality
Martingale Theory
Independence and Conditional Probability
Probability Theory
Properties of Conditional Expectation
Probability Theory
Probability & Statistics
Area