Androma — The Home of Mathematics on the Internet

custom_env admin

[step:Record the score expansion and the adaptive weight behaviour]Define the empirical Gram matrix \begin{align*} C_n := \frac{1}{n}X_n^\top X_n \in \mathbb{R}^{p \times p} \end{align*} and the normalized score vector \begin{align*} Z_n := \frac{1}{\sqrt{n}}X_n^\top \varepsilon_n \in \mathbb{R}^p. \end{align*} By the deterministic-design leverage hypothesis, \begin{align*} \max_{1\le i\le n}\frac{|x_{ni}|^2}{n}\to 0, \end{align*} where $x_{ni}\in\mathbb{R}^p$ denotes the $i$-th row of $X_n$. To verify the [multivariate Lindeberg central limit theorem for triangular arrays](/page/Multivariate%20Lindeberg%20Central%20Limit%20Theorem) through the [Cramer-Wold device](/page/Cramer-Wold%20Device), fix $a\in\mathbb{R}^p$ and define scalar summands \begin{align*} U_{n,i}:=\frac{a^\top x_{ni}}{\sqrt{n}}\varepsilon_i\in\mathbb{R}, \qquad 1\le i\le n. \end{align*} Their variances satisfy \begin{align*} \sum_{i=1}^n \operatorname{Var}(U_{n,i}) = \sigma^2 a^\top C_n a \to \sigma^2 a^\top C a. \end{align*} Let $h_n:=\max_{1\le i\le n}|a^\top x_{ni}|/\sqrt{n}$. Then $h_n\le |a|\max_{1\le i\le n}|x_{ni}|/\sqrt{n}\to 0$. For every $\eta>0$, The bound \begin{align*} \sum_{i=1}^n \mathbb{E}\left[U_{n,i}^2\mathbb{1}_{\{|U_{n,i}|>\eta\}}\right] \le \frac{1}{n}\sum_{i=1}^n (a^\top x_{ni})^2 \mathbb{E}\left[\varepsilon_i^2\mathbb{1}_{\{h_n|\varepsilon_i|>\eta\}}\right] \end{align*} holds by the definition of $h_n$. Since the errors are i.i.d., the right-hand side equals \begin{align*} a^\top C_n a\,\mathbb{E}\left[\varepsilon_1^2\mathbb{1}_{\{|\varepsilon_1|>\eta/h_n\}}\right] \to 0, \end{align*} where the last convergence uses $\mathbb{E}[\varepsilon_1^2]<\infty$ and $\eta/h_n\to\infty$. The scalar Lindeberg condition holds for every $a$, so the [Cramer-Wold device](/page/Cramer-Wold%20Device) gives \begin{align*} Z_n \xrightarrow{d} Z, \qquad Z \sim \mathcal{N}(0,\sigma^2 C). \end{align*} For each $j \in S$, since $\beta^*_j \ne 0$ and $\tilde{\beta}_{n,j}\xrightarrow{\mathbb{P}}\beta^*_j$, the [continuous mapping theorem](/theorems/1847) gives \begin{align*} w_{n,j}=|\tilde{\beta}_{n,j}|^{-\gamma} \xrightarrow{\mathbb{P}} |\beta^*_j|^{-\gamma}. \end{align*} If $S^c\ne\varnothing$, define the weight-definition event \begin{align*} A_n:=\left\{\min_{1\le j\le p}|\tilde{\beta}_{n,j}|>0\right\}. \end{align*} By hypothesis, $\mathbb{P}(A_n)\to1$. Off $A_n$, assign the weights arbitrary positive values; this makes $w_{n,j}$ a well-defined positive [random variable](/page/Random%20Variable) for every $j$ and does not affect any probability limit because $\mathbb{P}(A_n^c)\to0$. For each $j \in S^c$, since $\beta^*_j=0$ and $\sqrt{n}\tilde{\beta}_{n,j}=O_{\mathbb{P}}(1)$, on $A_n$ we have \begin{align*} n^{-\gamma/2}w_{n,j} = |\sqrt{n}\tilde{\beta}_{n,j}|^{-\gamma}. \end{align*} This is large enough in the following sense: for every $M>0$, \begin{align*} \mathbb{P}\left(A_n\cap\left\{\min_{j\in S^c}w_{n,j}\ge n^{\gamma/2}M^{-\gamma}\right\}\right) \ge \mathbb{P}\left(\max_{j\in S^c}|\sqrt{n}\tilde{\beta}_{n,j}|\le M\right)-\mathbb{P}(A_n^c). \end{align*} Since the maximum over the fixed nonempty set $S^c$ is $O_{\mathbb{P}}(1)$, it follows that \begin{align*} \sqrt{n}\lambda_n \min_{j\in S^c} w_{n,j} \xrightarrow{\mathbb{P}} \infty. \end{align*} If $S^c=\varnothing$, there are no inactive weights and all later inactive-coordinate KKT inequalities are vacuous.[/step]

custom_env admin

[guided]The proof needs two different effects from the adaptive weights. On active coordinates, the pilot estimator converges to a nonzero number, so the weight stays bounded. On inactive coordinates, the pilot estimator is only of order $n^{-1/2}$, so its reciprocal power is of order $n^{\gamma/2}$. First define the two quantities that appear in every first-order condition. The empirical Gram matrix is \begin{align*} C_n := \frac{1}{n}X_n^\top X_n \in \mathbb{R}^{p \times p}. \end{align*} The normalized score vector is \begin{align*} Z_n := \frac{1}{\sqrt{n}}X_n^\top \varepsilon_n \in \mathbb{R}^p. \end{align*} The vector $Z_n$ is the normalized least-squares score at the true parameter. The deterministic design assumptions imply that its covariance converges: \begin{align*} \operatorname{Cov}(Z_n) = \frac{1}{n}X_n^\top \operatorname{Cov}(\varepsilon_n)X_n = \sigma^2 C_n \to \sigma^2 C. \end{align*} We now verify the Lindeberg condition using the leverage assumption in the theorem statement. Let $x_{ni}\in\mathbb{R}^p$ denote the $i$-th row of $X_n$. To use the [Cramer-Wold device](/page/Cramer-Wold%20Device), fix $a\in\mathbb{R}^p$ and define scalar triangular-array summands \begin{align*} U_{n,i}:=\frac{a^\top x_{ni}}{\sqrt{n}}\varepsilon_i\in\mathbb{R}, \qquad 1\le i\le n. \end{align*} Their variance sum is \begin{align*} \sum_{i=1}^n \operatorname{Var}(U_{n,i}) = \sigma^2 a^\top C_n a \to \sigma^2 a^\top C a. \end{align*} Define $h_n:=\max_{1\le i\le n}|a^\top x_{ni}|/\sqrt{n}$. The leverage assumption gives \begin{align*} h_n\le |a|\max_{1\le i\le n}\frac{|x_{ni}|}{\sqrt{n}}\to 0. \end{align*} For every $\eta>0$, the definition of $h_n$ gives \begin{align*} \sum_{i=1}^n \mathbb{E}\left[U_{n,i}^2\mathbb{1}_{\{|U_{n,i}|>\eta\}}\right] \le \frac{1}{n}\sum_{i=1}^n (a^\top x_{ni})^2 \mathbb{E}\left[\varepsilon_i^2\mathbb{1}_{\{h_n|\varepsilon_i|>\eta\}}\right]. \end{align*} Since the errors are identically distributed, this equals \begin{align*} a^\top C_n a\,\mathbb{E}\left[\varepsilon_1^2\mathbb{1}_{\{|\varepsilon_1|>\eta/h_n\}}\right] \to 0, \end{align*} because $\mathbb{E}[\varepsilon_1^2]<\infty$ and $\eta/h_n\to\infty$. Thus every one-dimensional projection satisfies the scalar Lindeberg condition, and the [multivariate Lindeberg central limit theorem for triangular arrays](/page/Multivariate%20Lindeberg%20Central%20Limit%20Theorem), through the [Cramer-Wold device](/page/Cramer-Wold%20Device), gives \begin{align*} Z_n \xrightarrow{d} Z, \qquad Z \sim \mathcal{N}(0,\sigma^2 C). \end{align*} Now consider the active coordinates. If $j\in S$, then $\beta^*_j\ne 0$. Since $\sqrt{n}(\tilde{\beta}_n-\beta^*)=O_{\mathbb{P}}(1)$, in particular $\tilde{\beta}_{n,j}\xrightarrow{\mathbb{P}}\beta^*_j$. The map $t\mapsto |t|^{-\gamma}$ is continuous at every nonzero $t$, so \begin{align*} w_{n,j}=|\tilde{\beta}_{n,j}|^{-\gamma} \xrightarrow{\mathbb{P}} |\beta^*_j|^{-\gamma}. \end{align*} Thus active weights remain bounded in probability. For inactive coordinates the conclusion is different. If $j\in S^c$, then $\beta^*_j=0$, and root-$n$ consistency gives \begin{align*} \sqrt{n}\tilde{\beta}_{n,j}=O_{\mathbb{P}}(1). \end{align*} Let \begin{align*} A_n:=\left\{\min_{1\le j\le p}|\tilde{\beta}_{n,j}|>0\right\} \end{align*} be the event on which the formula $w_{n,j}=|\tilde{\beta}_{n,j}|^{-\gamma}$ defines all adaptive weights. By hypothesis, $\mathbb{P}(A_n)\to1$. Off $A_n$, assign the weights arbitrary positive values. This extension is only a definitional device: every later comparison that uses the reciprocal-power formula is intersected with $A_n$, and the exceptional event has probability tending to zero. On $A_n$, \begin{align*} w_{n,j} = |\tilde{\beta}_{n,j}|^{-\gamma} = n^{\gamma/2}|\sqrt{n}\tilde{\beta}_{n,j}|^{-\gamma}. \end{align*} Because $p$ is fixed, the random variable $\max_{j\in S^c}|\sqrt{n}\tilde{\beta}_{n,j}|$ is also $O_{\mathbb{P}}(1)$. Therefore, for every fixed $M>0$, \begin{align*} \mathbb{P}\left(A_n\cap\left\{\min_{j\in S^c}w_{n,j}\ge n^{\gamma/2}M^{-\gamma}\right\}\right) \ge \mathbb{P}\left(\max_{j\in S^c}|\sqrt{n}\tilde{\beta}_{n,j}|\le M\right)-\mathbb{P}(A_n^c). \end{align*} Choosing $M$ large makes the right-hand probability close to $1$, while the deterministic factor satisfies \begin{align*} \sqrt{n}\lambda_n n^{\gamma/2}M^{-\gamma} = M^{-\gamma}\lambda_n n^{(1+\gamma)/2} \to \infty. \end{align*} Hence \begin{align*} \sqrt{n}\lambda_n \min_{j\in S^c}w_{n,j} \xrightarrow{\mathbb{P}} \infty. \end{align*} This is the mechanism behind variable selection: the inactive penalty threshold is much larger than the random score, while the active penalty is asymptotically negligible.[/guided]

custom_env admin

[step:Solve the penalized problem on the true active coordinates]Define the restricted adaptive Lasso estimator \begin{align*} \bar{\beta}_{n,S} \in \operatorname*{arg\,min}_{b\in \mathbb{R}^{|S|}} \left\{ \frac{1}{2n}|Y_n-X_{n,S}b|^2 + \lambda_n\sum_{j\in S}w_{n,j}|b_j| \right\}, \qquad \bar{\beta}_{n,S^c}:=0. \end{align*} These two coordinate definitions define the full vector $\bar{\beta}_n\in\mathbb{R}^p$. Here $X_{n,S}\in\mathbb{R}^{n\times |S|}$ is the submatrix of $X_n$ with columns indexed by $S$. Since $C$ is positive definite, $C_{SS}$ is positive definite, and $C_{n,SS}\to C_{SS}$ implies that $C_{n,SS}$ is positive definite for all sufficiently large $n$. Hence, for all sufficiently large $n$, the restricted objective is strictly convex and the displayed argmin consists of a unique vector, which we denote by $\bar{\beta}_{n,S}$. Let $\tau_{n,S}\in[-1,1]^{|S|}$ be a subgradient vector for the active penalty at $\bar{\beta}_{n,S}$, so $\tau_{n,j}=\operatorname{sgn}(\bar{\beta}_{n,j})$ if $\bar{\beta}_{n,j}\ne0$ and $\tau_{n,j}\in[-1,1]$ if $\bar{\beta}_{n,j}=0$. The subgradient optimality condition for the restricted convex problem is \begin{align*} 0 = -\frac{1}{n}X_{n,S}^\top(Y_n-X_{n,S}\bar{\beta}_{n,S}) + \lambda_n W_{n,S}\tau_{n,S}, \end{align*} where $W_{n,S}\in\mathbb{R}^{|S|\times |S|}$ is the diagonal matrix with entries $w_{n,j}$ for $j\in S$. Using $Y_n=X_{n,S}\beta^*_S+\varepsilon_n$, this becomes \begin{align*} C_{n,SS}(\bar{\beta}_{n,S}-\beta^*_S) = \frac{1}{n}X_{n,S}^\top\varepsilon_n - \lambda_n W_{n,S}\tau_{n,S}. \end{align*} Multiplying by $\sqrt{n}$ gives \begin{align*} \sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S) = C_{n,SS}^{-1}Z_{n,S} - \sqrt{n}\lambda_n C_{n,SS}^{-1}W_{n,S}\tau_{n,S}. \end{align*} Since $C_{n,SS}^{-1}\to C_{SS}^{-1}$, $W_{n,S}=O_{\mathbb{P}}(1)$, $|\tau_{n,S}|\le \sqrt{|S|}$, and $\sqrt{n}\lambda_n\to 0$, \begin{align*} \sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S) = C_{SS}^{-1}Z_{n,S}+o_{\mathbb{P}}(1). \end{align*} Therefore \begin{align*} \sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S) \xrightarrow{d} \mathcal{N}(0,\sigma^2 C_{SS}^{-1}). \end{align*} In particular, $\bar{\beta}_{n,S}\xrightarrow{\mathbb{P}}\beta^*_S$, so \begin{align*} \mathbb{P}(\bar{\beta}_{n,j}\ne 0\text{ for every }j\in S)\to 1. \end{align*}[/step]

custom_env admin

[guided]The restricted problem is solved only on the coordinates where the true coefficient is nonzero. Define \begin{align*} \bar{\beta}_{n,S} \in \operatorname*{arg\,min}_{b\in \mathbb{R}^{|S|}} \left\{ \frac{1}{2n}|Y_n-X_{n,S}b|^2 + \lambda_n\sum_{j\in S}w_{n,j}|b_j| \right\}, \qquad \bar{\beta}_{n,S^c}:=0. \end{align*} The matrix $X_{n,S}\in\mathbb{R}^{n\times |S|}$ is obtained from $X_n$ by keeping only the columns indexed by $S$. Since $C$ is positive definite, the principal submatrix $C_{SS}$ is positive definite. Because $C_{n,SS}\to C_{SS}$, the matrices $C_{n,SS}$ are positive definite for all sufficiently large $n$, so the restricted objective is strictly convex for those $n$. Therefore the displayed argmin consists of a unique vector for all sufficiently large $n$, and this unique vector is denoted by $\bar{\beta}_{n,S}$. The important point is to avoid assuming in advance that the restricted minimizer has no zero active coordinates. The penalty is nondifferentiable at zero, so we use subgradients. Let $\tau_{n,S}\in[-1,1]^{|S|}$ be a subgradient vector for the active absolute-value penalty at $\bar{\beta}_{n,S}$: for each $j\in S$, $\tau_{n,j}=\operatorname{sgn}(\bar{\beta}_{n,j})$ if $\bar{\beta}_{n,j}\ne0$, while $\tau_{n,j}\in[-1,1]$ if $\bar{\beta}_{n,j}=0$. The subgradient optimality condition for the restricted convex problem is \begin{align*} 0 = -\frac{1}{n}X_{n,S}^\top(Y_n-X_{n,S}\bar{\beta}_{n,S}) + \lambda_n W_{n,S}\tau_{n,S}, \end{align*} where $W_{n,S}\in\mathbb{R}^{|S|\times |S|}$ is diagonal with diagonal entries $w_{n,j}$ for $j\in S$. Substitute the model identity $Y_n=X_{n,S}\beta^*_S+\varepsilon_n$. The optimality condition becomes \begin{align*} 0 = -\frac{1}{n}X_{n,S}^\top\{X_{n,S}\beta^*_S+\varepsilon_n-X_{n,S}\bar{\beta}_{n,S}\} + \lambda_n W_{n,S}\tau_{n,S}. \end{align*} Expanding the braces and collecting the terms containing $\bar{\beta}_{n,S}-\beta^*_S$ gives \begin{align*} 0 = C_{n,SS}(\bar{\beta}_{n,S}-\beta^*_S) - \frac{1}{n}X_{n,S}^\top\varepsilon_n + \lambda_n W_{n,S}\tau_{n,S}. \end{align*} Thus \begin{align*} C_{n,SS}(\bar{\beta}_{n,S}-\beta^*_S) = \frac{1}{n}X_{n,S}^\top\varepsilon_n - \lambda_n W_{n,S}\tau_{n,S}. \end{align*} Multiplying by $\sqrt{n}$ and using $Z_{n,S}=X_{n,S}^\top\varepsilon_n/\sqrt{n}$ gives \begin{align*} \sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S) = C_{n,SS}^{-1}Z_{n,S} - \sqrt{n}\lambda_n C_{n,SS}^{-1}W_{n,S}\tau_{n,S}. \end{align*} The first term is the ordinary least-squares fluctuation on the true active variables. The second term is the adaptive penalty bias. On active coordinates, $W_{n,S}=O_{\mathbb{P}}(1)$ by the weight convergence from the first step, and $|\tau_{n,S}|\le\sqrt{|S|}$ because each component lies in $[-1,1]$. Since $\sqrt{n}\lambda_n\to0$ and $C_{n,SS}^{-1}\to C_{SS}^{-1}$, the penalty bias is $o_{\mathbb{P}}(1)$. Therefore \begin{align*} \sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S) = C_{SS}^{-1}Z_{n,S}+o_{\mathbb{P}}(1). \end{align*} Since $Z_{n,S}\xrightarrow{d}\mathcal{N}(0,\sigma^2 C_{SS})$, [Slutsky's theorem](/page/Slutsky%27s%20Theorem) yields \begin{align*} \sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S) \xrightarrow{d} \mathcal{N}(0,\sigma^2 C_{SS}^{-1}). \end{align*} This gives $\bar{\beta}_{n,S}\xrightarrow{\mathbb{P}}\beta^*_S$. Because every coordinate of $\beta^*_S$ is nonzero, convergence in probability implies \begin{align*} \mathbb{P}(\bar{\beta}_{n,j}\ne 0\text{ for every }j\in S)\to 1. \end{align*}[/guided]

custom_env admin

[guided]There are two cases. If $S^c=\varnothing$, then every coordinate is active and there are no inactive KKT inequalities to check. Now suppose $S^c\ne\varnothing$. For each $j\in\{1,\dots,p\}$, let $X_{n,j}\in\mathbb R^n$ denote the $j$th column of $X_n$. For each inactive coordinate $j\in S^c$, define \begin{align*} R_{n,j} := \frac{1}{n}X_{n,j}^\top(Y_n-X_{n,S}\bar{\beta}_{n,S}) \in \mathbb{R}. \end{align*} This is exactly the least-squares score in coordinate $j$ evaluated at the candidate vector that equals $\bar{\beta}_{n,S}$ on $S$ and zero on $S^c$. Substitute $Y_n=X_{n,S}\beta^*_S+\varepsilon_n$ to separate noise from estimation error: \begin{align*} R_{n,j} = \frac{1}{n}X_{n,j}^\top\varepsilon_n - \frac{1}{n}X_{n,j}^\top X_{n,S}(\bar{\beta}_{n,S}-\beta^*_S). \end{align*} The first term is $O_{\mathbb{P}}(n^{-1/2})$ because \begin{align*} \frac{1}{n}X_{n,j}^\top\varepsilon_n = \frac{1}{\sqrt{n}}(Z_n)_j, \end{align*} and $Z_n=O_{\mathbb{P}}(1)$ follows from its convergence in distribution. The second term is also $O_{\mathbb{P}}(n^{-1/2})$: the deterministic row vector $X_{n,j}^\top X_{n,S}/n$ converges as a subblock of $C_n$, while the restricted estimator expansion gives $\bar{\beta}_{n,S}-\beta^*_S=O_{\mathbb{P}}(n^{-1/2})$. Since the inactive set is fixed and nonempty, \begin{align*} \max_{j\in S^c}|R_{n,j}|=O_{\mathbb{P}}(n^{-1/2}). \end{align*} The inactive adaptive weights were shown earlier to satisfy \begin{align*} \sqrt{n}\lambda_n\min_{j\in S^c}w_{n,j}\xrightarrow{\mathbb{P}}\infty. \end{align*} This means the penalty threshold $\lambda_n w_{n,j}$ is larger than the random inactive score by a diverging factor at the $n^{-1/2}$ scale. Combining the score bound and the weight lower bound gives \begin{align*} \mathbb{P}\left(\max_{j\in S^c}|R_{n,j}|\le \lambda_n\min_{j\in S^c}w_{n,j}\right)\to 1. \end{align*} Therefore, with probability tending to one, \begin{align*} |R_{n,j}|\le \lambda_n w_{n,j} \qquad \text{for every }j\in S^c. \end{align*} These are exactly the KKT inequalities required for all inactive coefficients to remain equal to zero.[/guided]

custom_env admin

[step:Use the KKT conditions to identify the full adaptive Lasso solution]Define the full adaptive Lasso criterion $Q_n:\mathbb{R}^p\to\mathbb{R}$ by \begin{align*} Q_n(\beta) := \frac{1}{2n}|Y_n-X_n\beta|^2 + \lambda_n\sum_{j=1}^p w_{n,j}|\beta_j|, \qquad \beta\in\mathbb{R}^p. \end{align*} This criterion is convex. Let $\lambda_{\min}(C)>0$ denote the smallest eigenvalue of $C$. Since $C_n\to C$, the smallest eigenvalue of $C_n$ is at least $\lambda_{\min}(C)/2$ for all sufficiently large $n$, so the quadratic part has positive definite Hessian $C_n$ for all such $n$. Hence $Q_n$ is strictly convex for all sufficiently large $n$. For those $n$, $Q_n$ is also coercive because \begin{align*} Q_n(\beta) \ge \frac{\lambda_{\min}(C)}{4}|\beta|^2 - \frac{1}{n}|X_n^\top Y_n|\,|\beta| + \frac{1}{2n}|Y_n|^2, \end{align*} and the positive quadratic term dominates the linear term as $|\beta|\to\infty$. Therefore $Q_n$ has a global minimizer, and strict convexity makes that minimizer unique. The subgradient of $|t|$ is $\operatorname{sgn}(t)$ when $t\ne0$ and the interval $[-1,1]$ when $t=0$. For a convex function on $\mathbb{R}^p$, the condition $0\in\partial Q_n(\beta)$ is necessary and sufficient for global optimality. Hence the [Karush-Kuhn-Tucker conditions](/page/Karush-Kuhn-Tucker%20Conditions) for $\beta\in\mathbb{R}^p$ to minimize $Q_n$ are \begin{align*} \frac{1}{n}X_{n,j}^\top(Y_n-X_n\beta) = \lambda_n w_{n,j}\operatorname{sgn}(\beta_j) \quad \text{if }\beta_j\ne 0, \end{align*} and \begin{align*} \left|\frac{1}{n}X_{n,j}^\top(Y_n-X_n\beta)\right| \le \lambda_n w_{n,j} \quad \text{if }\beta_j=0. \end{align*} The restricted estimator $\bar{\beta}_n$ satisfies the equality conditions on $S$ with probability tending to one, because $\mathbb{P}(\bar{\beta}_{n,j}\ne0\text{ for every }j\in S)\to1$ and the restricted subgradient equations then reduce to the displayed sign equations. If $S^c\ne\varnothing$, the previous step gives the inequality conditions on $S^c$ with probability tending to one; if $S^c=\varnothing$, those inequalities are vacuous. Hence \begin{align*} \mathbb{P}(\hat{\beta}^{\mathrm{AL}}_n=\bar{\beta}_n)\to 1. \end{align*} Since $\bar{\beta}_{n,S}\xrightarrow{\mathbb{P}}\beta^*_S$ and every coordinate of $\beta^*_S$ is nonzero, \begin{align*} \mathbb{P}(\bar{\beta}_{n,j}\ne 0\text{ for every }j\in S)\to 1. \end{align*} Together with $\bar{\beta}_{n,S^c}=0$, this yields \begin{align*} \mathbb{P}\left(\{j:\hat{\beta}^{\mathrm{AL}}_{n,j}\ne 0\}=S\right)\to 1. \end{align*} Finally, because $\mathbb{P}(\hat{\beta}^{\mathrm{AL}}_n=\bar{\beta}_n)\to 1$, the difference \begin{align*} \sqrt{n}(\hat{\beta}^{\mathrm{AL}}_{n,S}-\beta^*_S) - \sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S) \end{align*} is equal to zero with probability tending to one and hence converges to $0$ in probability. Since the restricted estimator satisfies \begin{align*} \sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S) \xrightarrow{d} \mathcal{N}(0,\sigma^2 C_{SS}^{-1}), \end{align*} [Slutsky's theorem](/page/Slutsky%27s%20Theorem) gives \begin{align*} \sqrt{n}(\hat{\beta}^{\mathrm{AL}}_{n,S}-\beta^*_S) \xrightarrow{d} \mathcal{N}(0,\sigma^2 C_{SS}^{-1}). \end{align*} This proves both the selection consistency and the oracle asymptotic normality.[/step]

custom_env admin

[guided]We now check that the candidate vector $\bar{\beta}_n$ is actually the full adaptive Lasso minimizer. Define the full adaptive Lasso criterion $Q_n:\mathbb{R}^p\to\mathbb{R}$ by \begin{align*} Q_n(\beta) := \frac{1}{2n}|Y_n-X_n\beta|^2 + \lambda_n\sum_{j=1}^p w_{n,j}|\beta_j|, \qquad \beta\in\mathbb{R}^p. \end{align*} Let $\hat{\beta}^{\mathrm{AL}}_n\in\mathbb{R}^p$ denote the full adaptive Lasso minimizer of $Q_n$ when this minimizer is unique. It is convex because it is the sum of a convex quadratic function and weighted absolute-value penalties. Let $\lambda_{\min}(C)>0$ denote the smallest eigenvalue of $C$. Since $C_n=X_n^\top X_n/n\to C$, eigenvalue convergence for symmetric matrices gives $\lambda_{\min}(C_n)\to\lambda_{\min}(C)$. Therefore $\lambda_{\min}(C_n)\ge \lambda_{\min}(C)/2$ for all sufficiently large $n$, so the quadratic part is strictly convex for those $n$. We also need existence, not just uniqueness. For sufficiently large $n$, \begin{align*} Q_n(\beta) \ge \frac{\lambda_{\min}(C)}{4}|\beta|^2 - \frac{1}{n}|X_n^\top Y_n|\,|\beta| + \frac{1}{2n}|Y_n|^2. \end{align*} The right-hand side tends to $\infty$ as $|\beta|\to\infty$, so $Q_n$ is coercive. Since $Q_n$ is continuous on $\mathbb{R}^p$, coercivity gives a global minimizer, and strict convexity makes that minimizer unique. The subgradient of the absolute-value function is $\operatorname{sgn}(t)$ when $t\ne0$ and the interval $[-1,1]$ when $t=0$. Since $Q_n$ is convex on $\mathbb{R}^p$, the subgradient condition $0\in\partial Q_n(\beta)$ is necessary and sufficient for $\beta$ to be a global minimizer. Therefore the [Karush-Kuhn-Tucker conditions](/page/Karush-Kuhn-Tucker%20Conditions) for a vector $\beta\in\mathbb{R}^p$ to minimize $Q_n$ are the following coordinatewise conditions: \begin{align*} \frac{1}{n}X_{n,j}^\top(Y_n-X_n\beta) = \lambda_n w_{n,j}\operatorname{sgn}(\beta_j) \quad \text{if }\beta_j\ne 0, \end{align*} and \begin{align*} \left|\frac{1}{n}X_{n,j}^\top(Y_n-X_n\beta)\right| \le \lambda_n w_{n,j} \quad \text{if }\beta_j=0. \end{align*} The restricted estimator satisfies the equality conditions on $S$ with probability tending to one: the previous active-coordinate analysis gives $\mathbb{P}(\bar{\beta}_{n,j}\ne0\text{ for every }j\in S)\to1$, and on that event the restricted subgradient equations are exactly the displayed sign equations. On $S^c$, either $S^c=\varnothing$ and there is nothing to verify, or $S^c\ne\varnothing$ and the inactive score inequalities from the previous step give \begin{align*} \left|\frac{1}{n}X_{n,j}^\top(Y_n-X_n\bar{\beta}_n)\right| \le \lambda_n w_{n,j} \qquad \text{for every }j\in S^c \end{align*} with probability tending to one. Thus $\bar{\beta}_n$ satisfies the full KKT system with probability tending to one. By uniqueness of the minimizer for large $n$, \begin{align*} \mathbb{P}(\hat{\beta}^{\mathrm{AL}}_n=\bar{\beta}_n)\to 1. \end{align*} Now support recovery follows from the construction. The vector $\bar{\beta}_n$ is zero on $S^c$ by definition, and $\bar{\beta}_{n,S}\xrightarrow{\mathbb{P}}\beta^*_S$ with every coordinate of $\beta^*_S$ nonzero. Hence \begin{align*} \mathbb{P}\left(\{j:\hat{\beta}^{\mathrm{AL}}_{n,j}\ne 0\}=S\right)\to 1. \end{align*} Finally, replacing $\bar{\beta}_n$ by $\hat{\beta}^{\mathrm{AL}}_n$ changes the active-coordinate fluctuation only on an event whose probability tends to zero. Equivalently, \begin{align*} \sqrt{n}(\hat{\beta}^{\mathrm{AL}}_{n,S}-\beta^*_S) - \sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S) \xrightarrow{\mathbb{P}} 0. \end{align*} Since \begin{align*} \sqrt{n}(\bar{\beta}_{n,S}-\beta^*_S) \xrightarrow{d} \mathcal{N}(0,\sigma^2 C_{SS}^{-1}), \end{align*} [Slutsky's theorem](/page/Slutsky%27s%20Theorem) gives \begin{align*} \sqrt{n}(\hat{\beta}^{\mathrm{AL}}_{n,S}-\beta^*_S) \xrightarrow{d} \mathcal{N}(0,\sigma^2 C_{SS}^{-1}). \end{align*} This proves both exact variable selection and the oracle asymptotic normality.[/guided]

custom_env admin

What brings you to Androma?

Start with a route through the knowledge graph.

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Sign in to Androma

Check your inbox

One last step

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Raw Attribution Data