Androma — The Home of Mathematics on the Internet

custom_env admin

[guided]The oracle estimator is obtained by fitting least squares on the true support $S$ and setting all inactive coordinates to zero. The first thing to check is therefore that the least-squares residual is orthogonal to the active columns. We define \begin{align*} r^{\mathrm{or}}:=Y-X\hat\beta^{\mathrm{or}}. \end{align*} Using $Y=X_S\beta^*_S+\varepsilon$ and $\hat\beta^{\mathrm{or}}_{S^c}=0$, we get \begin{align*} r^{\mathrm{or}} = \varepsilon-X_S(\hat\beta^{\mathrm{or}}_S-\beta^*_S). \end{align*} The oracle definition gives \begin{align*} \hat\beta^{\mathrm{or}}_S-\beta^*_S = \Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon, \end{align*} so \begin{align*} r^{\mathrm{or}} = \varepsilon-X_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon. \end{align*} Now multiply by $X_S^\top/n$. Since $\Sigma_S=X_S^\top X_S/n$, \begin{align*} \frac{1}{n}X_S^\top r^{\mathrm{or}}=\frac{1}{n}X_S^\top\varepsilon-\frac{1}{n}X_S^\top X_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon. \end{align*} Since $X_S^\top X_S/n=\Sigma_S$, this becomes \begin{align*} \frac{1}{n}X_S^\top r^{\mathrm{or}}=\frac{1}{n}X_S^\top\varepsilon-\Sigma_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon=0. \end{align*} This proves stationarity along active coordinates. Because the preceding step placed the oracle active coordinates themselves in the flat SCAD region, the penalty derivative at those coordinates is also zero. For inactive coordinates, the issue is different: $\hat\beta^{\mathrm{or}}_j=0$, so stationarity means the least-squares score must lie inside the subdifferential interval $[-\lambda,\lambda]$ of $p_{\lambda,a}(|\cdot|)$ at zero. Define \begin{align*} g_{S^c}:=\frac{1}{n}X_{S^c}^\top r^{\mathrm{or}}. \end{align*} Substituting the residual expression gives \begin{align*} g_{S^c} = \frac{1}{n}X_{S^c}^\top\varepsilon - \frac{1}{n}X_{S^c}^\top X_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon. \end{align*} The first term is controlled directly by the score bound. The second term is controlled by multiplying the inactive-active correlation matrix against the active score vector. Using the maximum row-sum matrix norm, \begin{align*} \|g_{S^c}\|_\infty \le \left\|\frac{1}{n}X_{S^c}^\top\varepsilon\right\|_\infty + \left\|\frac{1}{n}X_{S^c}^\top X_S\Sigma_S^{-1}\right\|_\infty \left\|\frac{1}{n}X_S^\top\varepsilon\right\|_\infty. \end{align*} The score bound controls both active and inactive subvectors of $X^\top\varepsilon/n$, and the inactive-active correlation bound controls the matrix factor, so \begin{align*} \|g_{S^c}\|_\infty \le \frac{\eta\lambda}{4} + \frac{1-\eta}{2}\cdot\frac{\eta\lambda}{4} \le \lambda. \end{align*} Hence every inactive coordinate satisfies \begin{align*} \left|\frac{1}{n}X_j^\top r^{\mathrm{or}}\right|\le \lambda. \end{align*} This is exactly the condition that the linear term in an inactive perturbation can be dominated by the SCAD penalty near zero.[/guided]

custom_env admin

[step:Use sparse curvature to dominate SCAD concavity]Let $h\in\mathbb R^p$ satisfy \begin{align*} |\operatorname{supp}(\hat\beta^{\mathrm{or}}+h)\cup S|\le Ms, \qquad |h|\le \rho, \end{align*} where \begin{align*} 0<\rho\le \rho_0. \end{align*} Set \begin{align*} T:=\operatorname{supp}(\hat\beta^{\mathrm{or}}+h)\cup S. \end{align*} Then $\operatorname{supp}(h)\subseteq T$ and $|T|\le Ms$, so the sparse curvature bound gives \begin{align*} \frac{1}{n}|Xh|^2\ge \kappa |h|^2. \end{align*} We use two one-dimensional SCAD lower bounds. First, for every $t\ge 0$, the derivative satisfies \begin{align*} p'_{\lambda,a}(t)\ge \lambda-\frac{t}{a-1}. \end{align*} Indeed, on $0\le t\le\lambda$ this says $\lambda\ge\lambda-t/(a-1)$. On $\lambda<t\le a\lambda$, we compute \begin{align*} \frac{a\lambda-t}{a-1}-\left(\lambda-\frac{t}{a-1}\right)=\frac{\lambda}{a-1}>0, \end{align*} so the inequality holds on the middle branch. On $t>a\lambda$, it says $0\ge\lambda-t/(a-1)$, which follows from $t>a\lambda$ and $a>2$. Hence, for every $j\in S^c$, since $\hat\beta^{\mathrm{or}}_j=0$ and $p_{\lambda,a}(0)=0$, integration over $[0,|h_j|]$ with respect to one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) gives \begin{align*} p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j+h_j|)-p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j|)=p_{\lambda,a}(|h_j|)\ge \lambda |h_j|-\frac{h_j^2}{2(a-1)}. \end{align*} Second, define the even SCAD penalty map $q_{\lambda,a}:\mathbb R\to\mathbb R$ by \begin{align*} q_{\lambda,a}(u):=p_{\lambda,a}(|u|). \end{align*} Define also $F_{\lambda,a}:\mathbb R\to\mathbb R$ by \begin{align*} F_{\lambda,a}(u):=q_{\lambda,a}(u)+\frac{u^2}{2(a-1)}. \end{align*} We verify convexity of $F_{\lambda,a}$ directly from its one-sided derivatives. We use the elementary one-dimensional fact that a continuous piecewise $C^1$ function is convex if its one-sided derivative is nondecreasing across the open smooth pieces and the left derivative at each breakpoint is at most the right derivative. On the intervals $(-\infty,-a\lambda)$, $(-a\lambda,-\lambda)$, $(-\lambda,0)$, $(0,\lambda)$, $(\lambda,a\lambda)$, and $(a\lambda,\infty)$, the derivative of $q_{\lambda,a}$ is respectively \begin{align*} 0,\quad -\frac{a\lambda+u}{a-1},\quad -\lambda,\quad \lambda,\quad \frac{a\lambda-u}{a-1},\quad 0. \end{align*} Therefore the derivative of $F_{\lambda,a}$ is respectively \begin{align*} \frac{u}{a-1},\quad -\frac{a\lambda}{a-1},\quad -\lambda+\frac{u}{a-1},\quad \lambda+\frac{u}{a-1},\quad \frac{a\lambda}{a-1},\quad \frac{u}{a-1}. \end{align*} Each displayed formula is nondecreasing on its interval. At the breakpoints $-a\lambda$, $-\lambda$, $0$, $\lambda$, and $a\lambda$, the left one-sided derivative is at most the right one-sided derivative; at $0$ the jump is from $-\lambda$ to $\lambda$. Hence $F'_{\lambda,a}$ is nondecreasing in the one-sided sense on $\mathbb R$, which proves that $F_{\lambda,a}$ is convex. For $j\in S$, the bound $|\hat\beta^{\mathrm{or}}_j|\ge a\lambda$ places $\hat\beta^{\mathrm{or}}_j$ in the flat SCAD region, so $q'_{\lambda,a}(\hat\beta^{\mathrm{or}}_j)=0$ and therefore \begin{align*} F'_{\lambda,a}(\hat\beta^{\mathrm{or}}_j)=\frac{\hat\beta^{\mathrm{or}}_j}{a-1}. \end{align*} Convexity of $F_{\lambda,a}$ at the point $\hat\beta^{\mathrm{or}}_j$ gives \begin{align*} F_{\lambda,a}(\hat\beta^{\mathrm{or}}_j+h_j)-F_{\lambda,a}(\hat\beta^{\mathrm{or}}_j) \ge \frac{\hat\beta^{\mathrm{or}}_j}{a-1}h_j. \end{align*} Expanding the definition of $F_{\lambda,a}$ and cancelling the linear quadratic term yields \begin{align*} p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j+h_j|)-p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j|)\ge -\frac{h_j^2}{2(a-1)}. \end{align*} Using $Y-X(\hat\beta^{\mathrm{or}}+h)=r^{\mathrm{or}}-Xh$, the objective difference is \begin{align*} Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})=\frac{1}{2n}|Xh|^2-\frac{1}{n}(r^{\mathrm{or}})^\top Xh+\sum_{j=1}^p\left[p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j+h_j|)-p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j|)\right]. \end{align*} Define $g_j:=X_j^\top r^{\mathrm{or}}/n$ for each $j\in\{1,\dots,p\}$. The active linear term vanishes because $X_S^\top r^{\mathrm{or}}/n=0$, while the inactive linear term is bounded below by $-\sum_{j\in S^c}|g_j||h_j|$. Combining the active and inactive penalty lower bounds gives \begin{align*} Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})\ge\frac{1}{2n}|Xh|^2-\sum_{j\in S^c}|g_j||h_j|+\sum_{j\in S^c}\lambda |h_j|-\frac{1}{2(a-1)}\sum_{j=1}^p h_j^2. \end{align*} Since $|g_j|\le\lambda$ for every $j\in S^c$, the inactive linear loss is dominated by the inactive SCAD linear gain, and hence \begin{align*} Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})\ge\frac{1}{2n}|Xh|^2-\frac{1}{2(a-1)}|h|^2. \end{align*} The sparse curvature bound then gives \begin{align*} Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})\ge\frac{1}{2}\left(\kappa-\frac{1}{a-1}\right)|h|^2\ge 0. \end{align*} The last inequality uses $\kappa>1/(a-1)$. Therefore $Q_n(\hat\beta^{\mathrm{or}}+h)\ge Q_n(\hat\beta^{\mathrm{or}})$ for every $h$ such that $\hat\beta^{\mathrm{or}}+h\in\mathcal N_S(\rho)$.[/step]

custom_env admin

[guided]The point of this step is to compare the positive quadratic curvature of the least-squares loss with the possible negative curvature of the SCAD penalty. Let $h\in\mathbb R^p$ satisfy \begin{align*} |\operatorname{supp}(\hat\beta^{\mathrm{or}}+h)\cup S|\le Ms, \qquad |h|\le \rho. \end{align*} Define \begin{align*} T:=\operatorname{supp}(\hat\beta^{\mathrm{or}}+h)\cup S. \end{align*} Then $\operatorname{supp}(h)\subseteq T$ and $|T|\le Ms$, so the sparse curvature hypothesis applies to $h$ and gives \begin{align*} \frac{1}{n}|Xh|^2\ge \kappa |h|^2. \end{align*} For inactive coordinates $j\in S^c$, we have $\hat\beta^{\mathrm{or}}_j=0$. Let $\mathcal L^1$ denote one-dimensional Lebesgue measure on $\mathbb R$. The SCAD derivative lower bound $p'_{\lambda,a}(t)\ge \lambda-t/(a-1)$ for $t\ge0$ gives, after integrating over $[0,|h_j|]$ with respect to $\mathcal L^1$, \begin{align*} p_{\lambda,a}(|h_j|)-p_{\lambda,a}(0) \ge \lambda |h_j|-\frac{h_j^2}{2(a-1)}. \end{align*} For active coordinates, define $q_{\lambda,a}:\mathbb R\to\mathbb R$ by $q_{\lambda,a}(u):=p_{\lambda,a}(|u|)$ and define $F_{\lambda,a}:\mathbb R\to\mathbb R$ by \begin{align*} F_{\lambda,a}(u):=q_{\lambda,a}(u)+\frac{u^2}{2(a-1)}. \end{align*} We now justify the convexity assertion rather than treating it as a black box. The elementary criterion we use is this: a continuous piecewise $C^1$ function on $\mathbb R$ is convex if its one-sided derivative is nondecreasing on each smooth interval and the left derivative at every breakpoint is at most the right derivative. On the intervals $(-\infty,-a\lambda)$, $(-a\lambda,-\lambda)$, $(-\lambda,0)$, $(0,\lambda)$, $(\lambda,a\lambda)$, and $(a\lambda,\infty)$, the derivative of $q_{\lambda,a}$ is respectively \begin{align*} 0,\quad -\frac{a\lambda+u}{a-1},\quad -\lambda,\quad \lambda,\quad \frac{a\lambda-u}{a-1},\quad 0. \end{align*} Adding the derivative of $u^2/(2(a-1))$ gives the corresponding derivative values of $F_{\lambda,a}$: \begin{align*} \frac{u}{a-1},\quad -\frac{a\lambda}{a-1},\quad -\lambda+\frac{u}{a-1},\quad \lambda+\frac{u}{a-1},\quad \frac{a\lambda}{a-1},\quad \frac{u}{a-1}. \end{align*} Each expression is nondecreasing on its own interval. Checking the junctions $-a\lambda$, $-\lambda$, $0$, $\lambda$, and $a\lambda$, the left derivative never exceeds the right derivative; the only jump at $0$ goes upward from $-\lambda$ to $\lambda$. Thus $F'_{\lambda,a}$ is nondecreasing in the one-sided derivative sense, which is the elementary one-dimensional convexity criterion used here. Hence $F_{\lambda,a}$ is convex. Since $|\hat\beta^{\mathrm{or}}_j|\ge a\lambda$ for $j\in S$, the SCAD part is flat at $\hat\beta^{\mathrm{or}}_j$, hence $q'_{\lambda,a}(\hat\beta^{\mathrm{or}}_j)=0$ and \begin{align*} F'_{\lambda,a}(\hat\beta^{\mathrm{or}}_j)=\frac{\hat\beta^{\mathrm{or}}_j}{a-1}. \end{align*} Convexity gives \begin{align*} F_{\lambda,a}(\hat\beta^{\mathrm{or}}_j+h_j)-F_{\lambda,a}(\hat\beta^{\mathrm{or}}_j) \ge \frac{\hat\beta^{\mathrm{or}}_j}{a-1}h_j. \end{align*} After expanding $F_{\lambda,a}$, this is exactly \begin{align*} p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j+h_j|)-p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j|) \ge -\frac{h_j^2}{2(a-1)}. \end{align*} Now expand the objective. Since $Y-X(\hat\beta^{\mathrm{or}}+h)=r^{\mathrm{or}}-Xh$, \begin{align*} Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})=\frac{1}{2n}|Xh|^2-\frac{1}{n}(r^{\mathrm{or}})^\top Xh+\sum_{j=1}^p\left[p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j+h_j|)-p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j|)\right]. \end{align*} Define $g_j:=X_j^\top r^{\mathrm{or}}/n$ for $j\in\{1,\dots,p\}$. The active normal equations give $g_j=0$ for $j\in S$, while the previous stationarity step gives $|g_j|\le\lambda$ for $j\in S^c$. Combining the penalty bounds with the objective expansion gives \begin{align*} Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}}) \ge \frac{1}{2n}|Xh|^2-\sum_{j\in S^c}|g_j||h_j|+\sum_{j\in S^c}\lambda |h_j|-\frac{1}{2(a-1)}|h|^2. \end{align*} Because $|g_j|\le\lambda$ on $S^c$, the inactive linear loss is dominated by the inactive SCAD linear gain. Hence \begin{align*} Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}}) \ge \frac{1}{2n}|Xh|^2-\frac{1}{2(a-1)}|h|^2. \end{align*} Finally the sparse curvature bound gives \begin{align*} Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}}) \ge \frac{1}{2}\left(\kappa-\frac{1}{a-1}\right)|h|^2 \ge 0, \end{align*} because $\kappa>1/(a-1)$. Thus every admissible sparse perturbation has nonnegative objective increase.[/guided]

custom_env admin

What brings you to Androma?

Start with a route through the knowledge graph.

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Sign in to Androma

Check your inbox

One last step

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Raw Attribution Data