Oracle Property for the SCAD Penalized Least-Squares Estimator (Theorem # 5580)
Theorem
Consider the fixed-design linear model $Y=X\beta^*+\varepsilon$ with support $S$ of size $s$, and let $\Sigma_S=X_S^\top X_S/n$. Consider the SCAD criterion with parameters $\lambda>0$ and $a>2$. Assume the columns of $X$ are normalised by $|X_j|^2/n=1$ and that
\begin{align*}
\lambda_{\min}(\Sigma_S)\ge \kappa_S>0.
\end{align*}
For a vector, let $\|\cdot\|_\infty$ denote the maximum-entry norm, and for a matrix $A=(A_{ij})$, let $\|A\|_\infty:=\max_i\sum_j |A_{ij}|$ denote the maximum row-sum operator norm. Fix constants $\eta\in(0,1)$ and $M\ge 2$. Suppose that, on an event $\mathcal E_n$ with $\mathbb P(\mathcal E_n)\ge 1-\delta_n$, the score bound
\begin{align*}
\left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty\le \frac{\eta\lambda}{4},
\end{align*}
the oracle estimation bound
\begin{align*}
\left\|\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon\right\|_\infty\le r_n,
\end{align*}
the inactive-active correlation bound
\begin{align*}
\left\|\frac{1}{n}X_{S^c}^\top X_S\Sigma_S^{-1}\right\|_\infty\le \frac{1-\eta}{2},
\end{align*}
and the sparse curvature bound
\begin{align*}
\frac{1}{n}|X\delta|^2\ge \kappa |\delta|^2
\end{align*}
hold for every $\delta\in\mathbb R^p$ supported on a set of size at most $M s$, where $\kappa>1/(a-1)$. If the beta-min condition satisfies
\begin{align*}
\min_{j\in S}|\beta_j^*|\ge a\lambda+r_n,
\end{align*}
then there exists $\rho>0$ such that, on $\mathcal E_n$, the oracle estimator $\hat\beta^{\mathrm{or}}$ is a local minimiser of the SCAD criterion relative to the sparse neighbourhood
\begin{align*}
\mathcal N_S(\rho)
=\{\beta\in\mathbb R^p:|\operatorname{supp}(\beta)\cup S|\le Ms
\text{ and }|\beta-\hat\beta^{\mathrm{or}}|\le \rho\}.
\end{align*}
Consequently the existence of such a sparse local minimiser equal to $\hat\beta^{\mathrm{or}}$ holds with probability at least $1-\delta_n$.
Probability & Statistics
Discussion
This theorem states conditions under which the oracle estimator is a sparse local minimizer of the SCAD criterion. It explains how beta-min, score control, and sparse curvature yield the oracle property for nonconvex regularization.
Proof
[proofplan]
We work on the event $\mathcal E_n$. First we use the oracle normal equations to show that every active oracle coordinate lies in the flat region of the SCAD penalty, so the penalty contributes no active-coordinate first-order term near the oracle estimator. Next we compute the oracle residual and use the score and inactive-active correlation bounds to verify the zero-coordinate stationarity inequality on $S^c$. Finally we compare the SCAD objective at $\hat\beta^{\mathrm{or}}+h$ and at $\hat\beta^{\mathrm{or}}$ for every sufficiently small sparse perturbation $h$, using the sparse curvature lower bound to dominate the maximum concavity of SCAD.
[/proofplan]
[step:Define the SCAD objective and the oracle estimator]
Fix an outcome in $\mathcal E_n$. We first justify that $\Sigma_S\in\mathbb R^{s\times s}$ is invertible from the sparse curvature hypothesis. If $v\in\mathbb R^s$ is nonzero and $\delta\in\mathbb R^p$ is the vector with $\delta_S=v$ and $\delta_{S^c}=0$, then $\delta$ is supported on $S$, hence on a set of size at most $Ms$. Therefore
\begin{align*}
v^\top\Sigma_S v
=\frac{1}{n}|X_Sv|^2
=\frac{1}{n}|X\delta|^2
\ge \kappa |\delta|^2
=\kappa |v|^2
>0.
\end{align*}
Thus $\Sigma_S$ is positive definite and hence invertible.
Define the SCAD penalty $p_{\lambda,a}:[0,\infty)\to[0,\infty)$ by $p_{\lambda,a}(0)=0$ and by the one-sided derivative formula
\begin{align*}
p'_{\lambda,a}(t)=\lambda \quad\text{for }0<t\le \lambda.
\end{align*}
For $\lambda<t\le a\lambda$, define
\begin{align*}
p'_{\lambda,a}(t)=\frac{a\lambda-t}{a-1}.
\end{align*}
For $t>a\lambda$, define
\begin{align*}
p'_{\lambda,a}(t)=0.
\end{align*}
Define the SCAD objective $Q_n:\mathbb R^p\to\mathbb R$ by
\begin{align*}
Q_n(\beta):=\frac{1}{2n}|Y-X\beta|^2+\sum_{j=1}^p p_{\lambda,a}(|\beta_j|).
\end{align*}
Define the oracle estimator $\hat\beta^{\mathrm{or}}\in\mathbb R^p$ by setting its active coordinates to
\begin{align*}
\hat\beta^{\mathrm{or}}_S
:=\beta^*_S+\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon.
\end{align*}
Set its inactive coordinates to
\begin{align*}
\hat\beta^{\mathrm{or}}_{S^c}:=0.
\end{align*}
[/step]
[step:Place the oracle active coordinates in the flat SCAD region]
Define the active oracle error vector $e_S\in\mathbb R^s$ by
\begin{align*}
e_S:=\hat\beta^{\mathrm{or}}_S-\beta^*_S
=
\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon.
\end{align*}
By the oracle estimation bound,
\begin{align*}
\|e_S\|_\infty\le r_n.
\end{align*}
For every $j\in S$, the beta-min condition gives
\begin{align*}
|\hat\beta^{\mathrm{or}}_j|
\ge |\beta_j^*|-|\hat\beta^{\mathrm{or}}_j-\beta_j^*|
\ge a\lambda+r_n-r_n
=
a\lambda.
\end{align*}
Thus every active oracle coordinate lies at or beyond the SCAD flat threshold. At the endpoint $t=a\lambda$, the middle branch gives $p'_{\lambda,a}(a\lambda)=(a\lambda-a\lambda)/(a-1)=0$, while for $t>a\lambda$ the final branch gives $p'_{\lambda,a}(t)=0$. Hence $p'_{\lambda,a}(t)=0$ for every $t\ge a\lambda$.
Choose
\begin{align*}
\rho_0:=\frac{\lambda}{2}.
\end{align*}
Then $\rho_0>0$. The argument below will not require the active-coordinate penalty to remain constant throughout the whole neighbourhood; it only uses that each active oracle coordinate itself lies in the flat SCAD region and then controls any possible movement into the concave region by the global one-dimensional SCAD curvature bound.
[/step]
[step:Verify active normal equations and inactive stationarity]
Define the residual vector $r^{\mathrm{or}}\in\mathbb R^n$ by
\begin{align*}
r^{\mathrm{or}}:=Y-X\hat\beta^{\mathrm{or}}.
\end{align*}
Since $Y=X_S\beta^*_S+\varepsilon$ and $\hat\beta^{\mathrm{or}}_{S^c}=0$,
\begin{align*}
r^{\mathrm{or}}
=
\varepsilon-X_S(\hat\beta^{\mathrm{or}}_S-\beta^*_S)
=
\varepsilon-X_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon.
\end{align*}
The active normal equations follow from the definition of $\Sigma_S$. First,
\begin{align*}
\frac{1}{n}X_S^\top r^{\mathrm{or}}
=
\frac{1}{n}X_S^\top\varepsilon
-
\frac{1}{n}X_S^\top X_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon.
\end{align*}
Since $X_S^\top X_S/n=\Sigma_S$, this becomes
\begin{align*}
\frac{1}{n}X_S^\top r^{\mathrm{or}}
=
\frac{1}{n}X_S^\top\varepsilon
-
\Sigma_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon
=
0.
\end{align*}
For inactive coordinates, define $g_{S^c}\in\mathbb R^{p-s}$ by
\begin{align*}
g_{S^c}:=\frac{1}{n}X_{S^c}^\top r^{\mathrm{or}}.
\end{align*}
Using the residual identity,
\begin{align*}
g_{S^c}
=
\frac{1}{n}X_{S^c}^\top\varepsilon
-
\frac{1}{n}X_{S^c}^\top X_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon.
\end{align*}
Taking the maximum norm and using the score bound and the inactive-active correlation bound gives
\begin{align*}
\|g_{S^c}\|_\infty
\le
\left\|\frac{1}{n}X_{S^c}^\top\varepsilon\right\|_\infty
+
\left\|\frac{1}{n}X_{S^c}^\top X_S\Sigma_S^{-1}\right\|_\infty
\left\|\frac{1}{n}X_S^\top\varepsilon\right\|_\infty.
\end{align*}
Therefore
\begin{align*}
\|g_{S^c}\|_\infty
\le
\frac{\eta\lambda}{4}
+
\frac{1-\eta}{2}\cdot\frac{\eta\lambda}{4}
\le \lambda.
\end{align*}
Thus
\begin{align*}
\left|\frac{1}{n}X_j^\top r^{\mathrm{or}}\right|\le \lambda
\end{align*}
for every $j\in S^c$, which is the SCAD zero-coordinate stationarity inequality.
[guided]
The oracle estimator is obtained by fitting least squares on the true support $S$ and setting all inactive coordinates to zero. The first thing to check is therefore that the least-squares residual is orthogonal to the active columns. We define
\begin{align*}
r^{\mathrm{or}}:=Y-X\hat\beta^{\mathrm{or}}.
\end{align*}
Using $Y=X_S\beta^*_S+\varepsilon$ and $\hat\beta^{\mathrm{or}}_{S^c}=0$, we get
\begin{align*}
r^{\mathrm{or}}
=
\varepsilon-X_S(\hat\beta^{\mathrm{or}}_S-\beta^*_S).
\end{align*}
The oracle definition gives
\begin{align*}
\hat\beta^{\mathrm{or}}_S-\beta^*_S
=
\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon,
\end{align*}
so
\begin{align*}
r^{\mathrm{or}}
=
\varepsilon-X_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon.
\end{align*}
Now multiply by $X_S^\top/n$. Since $\Sigma_S=X_S^\top X_S/n$,
\begin{align*}
\frac{1}{n}X_S^\top r^{\mathrm{or}}=\frac{1}{n}X_S^\top\varepsilon-\frac{1}{n}X_S^\top X_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon.
\end{align*}
Since $X_S^\top X_S/n=\Sigma_S$, this becomes
\begin{align*}
\frac{1}{n}X_S^\top r^{\mathrm{or}}=\frac{1}{n}X_S^\top\varepsilon-\Sigma_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon=0.
\end{align*}
This proves stationarity along active coordinates. Because the preceding step placed the oracle active coordinates themselves in the flat SCAD region, the penalty derivative at those coordinates is also zero.
For inactive coordinates, the issue is different: $\hat\beta^{\mathrm{or}}_j=0$, so stationarity means the least-squares score must lie inside the subdifferential interval $[-\lambda,\lambda]$ of $p_{\lambda,a}(|\cdot|)$ at zero. Define
\begin{align*}
g_{S^c}:=\frac{1}{n}X_{S^c}^\top r^{\mathrm{or}}.
\end{align*}
Substituting the residual expression gives
\begin{align*}
g_{S^c}
=
\frac{1}{n}X_{S^c}^\top\varepsilon
-
\frac{1}{n}X_{S^c}^\top X_S\Sigma_S^{-1}\frac{1}{n}X_S^\top\varepsilon.
\end{align*}
The first term is controlled directly by the score bound. The second term is controlled by multiplying the inactive-active correlation matrix against the active score vector. Using the maximum row-sum matrix norm,
\begin{align*}
\|g_{S^c}\|_\infty
\le
\left\|\frac{1}{n}X_{S^c}^\top\varepsilon\right\|_\infty
+
\left\|\frac{1}{n}X_{S^c}^\top X_S\Sigma_S^{-1}\right\|_\infty
\left\|\frac{1}{n}X_S^\top\varepsilon\right\|_\infty.
\end{align*}
The score bound controls both active and inactive subvectors of $X^\top\varepsilon/n$, and the inactive-active correlation bound controls the matrix factor, so
\begin{align*}
\|g_{S^c}\|_\infty
\le
\frac{\eta\lambda}{4}
+
\frac{1-\eta}{2}\cdot\frac{\eta\lambda}{4}
\le \lambda.
\end{align*}
Hence every inactive coordinate satisfies
\begin{align*}
\left|\frac{1}{n}X_j^\top r^{\mathrm{or}}\right|\le \lambda.
\end{align*}
This is exactly the condition that the linear term in an inactive perturbation can be dominated by the SCAD penalty near zero.
[/guided]
[/step]
[step:Use sparse curvature to dominate SCAD concavity]
Let $h\in\mathbb R^p$ satisfy
\begin{align*}
|\operatorname{supp}(\hat\beta^{\mathrm{or}}+h)\cup S|\le Ms,
\qquad
|h|\le \rho,
\end{align*}
where
\begin{align*}
0<\rho\le \rho_0.
\end{align*}
Set
\begin{align*}
T:=\operatorname{supp}(\hat\beta^{\mathrm{or}}+h)\cup S.
\end{align*}
Then $\operatorname{supp}(h)\subseteq T$ and $|T|\le Ms$, so the sparse curvature bound gives
\begin{align*}
\frac{1}{n}|Xh|^2\ge \kappa |h|^2.
\end{align*}
We use two one-dimensional SCAD lower bounds. First, for every $t\ge 0$, the derivative satisfies
\begin{align*}
p'_{\lambda,a}(t)\ge \lambda-\frac{t}{a-1}.
\end{align*}
Indeed, on $0\le t\le\lambda$ this says $\lambda\ge\lambda-t/(a-1)$. On $\lambda<t\le a\lambda$, we compute
\begin{align*}
\frac{a\lambda-t}{a-1}-\left(\lambda-\frac{t}{a-1}\right)=\frac{\lambda}{a-1}>0,
\end{align*}
so the inequality holds on the middle branch. On $t>a\lambda$, it says $0\ge\lambda-t/(a-1)$, which follows from $t>a\lambda$ and $a>2$. Hence, for every $j\in S^c$, since $\hat\beta^{\mathrm{or}}_j=0$ and $p_{\lambda,a}(0)=0$, integration over $[0,|h_j|]$ with respect to one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) gives
\begin{align*}
p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j+h_j|)-p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j|)=p_{\lambda,a}(|h_j|)\ge \lambda |h_j|-\frac{h_j^2}{2(a-1)}.
\end{align*}
Second, define the even SCAD penalty map $q_{\lambda,a}:\mathbb R\to\mathbb R$ by
\begin{align*}
q_{\lambda,a}(u):=p_{\lambda,a}(|u|).
\end{align*}
Define also $F_{\lambda,a}:\mathbb R\to\mathbb R$ by
\begin{align*}
F_{\lambda,a}(u):=q_{\lambda,a}(u)+\frac{u^2}{2(a-1)}.
\end{align*}
We verify convexity of $F_{\lambda,a}$ directly from its one-sided derivatives. We use the elementary one-dimensional fact that a continuous piecewise $C^1$ function is convex if its one-sided derivative is nondecreasing across the open smooth pieces and the left derivative at each breakpoint is at most the right derivative. On the intervals $(-\infty,-a\lambda)$, $(-a\lambda,-\lambda)$, $(-\lambda,0)$, $(0,\lambda)$, $(\lambda,a\lambda)$, and $(a\lambda,\infty)$, the derivative of $q_{\lambda,a}$ is respectively
\begin{align*}
0,\quad -\frac{a\lambda+u}{a-1},\quad -\lambda,\quad \lambda,\quad \frac{a\lambda-u}{a-1},\quad 0.
\end{align*}
Therefore the derivative of $F_{\lambda,a}$ is respectively
\begin{align*}
\frac{u}{a-1},\quad -\frac{a\lambda}{a-1},\quad -\lambda+\frac{u}{a-1},\quad \lambda+\frac{u}{a-1},\quad \frac{a\lambda}{a-1},\quad \frac{u}{a-1}.
\end{align*}
Each displayed formula is nondecreasing on its interval. At the breakpoints $-a\lambda$, $-\lambda$, $0$, $\lambda$, and $a\lambda$, the left one-sided derivative is at most the right one-sided derivative; at $0$ the jump is from $-\lambda$ to $\lambda$. Hence $F'_{\lambda,a}$ is nondecreasing in the one-sided sense on $\mathbb R$, which proves that $F_{\lambda,a}$ is convex. For $j\in S$, the bound $|\hat\beta^{\mathrm{or}}_j|\ge a\lambda$ places $\hat\beta^{\mathrm{or}}_j$ in the flat SCAD region, so $q'_{\lambda,a}(\hat\beta^{\mathrm{or}}_j)=0$ and therefore
\begin{align*}
F'_{\lambda,a}(\hat\beta^{\mathrm{or}}_j)=\frac{\hat\beta^{\mathrm{or}}_j}{a-1}.
\end{align*}
Convexity of $F_{\lambda,a}$ at the point $\hat\beta^{\mathrm{or}}_j$ gives
\begin{align*}
F_{\lambda,a}(\hat\beta^{\mathrm{or}}_j+h_j)-F_{\lambda,a}(\hat\beta^{\mathrm{or}}_j)
\ge
\frac{\hat\beta^{\mathrm{or}}_j}{a-1}h_j.
\end{align*}
Expanding the definition of $F_{\lambda,a}$ and cancelling the linear quadratic term yields
\begin{align*}
p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j+h_j|)-p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j|)\ge -\frac{h_j^2}{2(a-1)}.
\end{align*}
Using $Y-X(\hat\beta^{\mathrm{or}}+h)=r^{\mathrm{or}}-Xh$, the objective difference is
\begin{align*}
Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})=\frac{1}{2n}|Xh|^2-\frac{1}{n}(r^{\mathrm{or}})^\top Xh+\sum_{j=1}^p\left[p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j+h_j|)-p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j|)\right].
\end{align*}
Define $g_j:=X_j^\top r^{\mathrm{or}}/n$ for each $j\in\{1,\dots,p\}$. The active linear term vanishes because $X_S^\top r^{\mathrm{or}}/n=0$, while the inactive linear term is bounded below by $-\sum_{j\in S^c}|g_j||h_j|$. Combining the active and inactive penalty lower bounds gives
\begin{align*}
Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})\ge\frac{1}{2n}|Xh|^2-\sum_{j\in S^c}|g_j||h_j|+\sum_{j\in S^c}\lambda |h_j|-\frac{1}{2(a-1)}\sum_{j=1}^p h_j^2.
\end{align*}
Since $|g_j|\le\lambda$ for every $j\in S^c$, the inactive linear loss is dominated by the inactive SCAD linear gain, and hence
\begin{align*}
Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})\ge\frac{1}{2n}|Xh|^2-\frac{1}{2(a-1)}|h|^2.
\end{align*}
The sparse curvature bound then gives
\begin{align*}
Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})\ge\frac{1}{2}\left(\kappa-\frac{1}{a-1}\right)|h|^2\ge 0.
\end{align*}
The last inequality uses $\kappa>1/(a-1)$. Therefore $Q_n(\hat\beta^{\mathrm{or}}+h)\ge Q_n(\hat\beta^{\mathrm{or}})$ for every $h$ such that $\hat\beta^{\mathrm{or}}+h\in\mathcal N_S(\rho)$.
[guided]
The point of this step is to compare the positive quadratic curvature of the least-squares loss with the possible negative curvature of the SCAD penalty. Let $h\in\mathbb R^p$ satisfy
\begin{align*}
|\operatorname{supp}(\hat\beta^{\mathrm{or}}+h)\cup S|\le Ms,
\qquad
|h|\le \rho.
\end{align*}
Define
\begin{align*}
T:=\operatorname{supp}(\hat\beta^{\mathrm{or}}+h)\cup S.
\end{align*}
Then $\operatorname{supp}(h)\subseteq T$ and $|T|\le Ms$, so the sparse curvature hypothesis applies to $h$ and gives
\begin{align*}
\frac{1}{n}|Xh|^2\ge \kappa |h|^2.
\end{align*}
For inactive coordinates $j\in S^c$, we have $\hat\beta^{\mathrm{or}}_j=0$. Let $\mathcal L^1$ denote one-dimensional Lebesgue measure on $\mathbb R$. The SCAD derivative lower bound $p'_{\lambda,a}(t)\ge \lambda-t/(a-1)$ for $t\ge0$ gives, after integrating over $[0,|h_j|]$ with respect to $\mathcal L^1$,
\begin{align*}
p_{\lambda,a}(|h_j|)-p_{\lambda,a}(0)
\ge
\lambda |h_j|-\frac{h_j^2}{2(a-1)}.
\end{align*}
For active coordinates, define $q_{\lambda,a}:\mathbb R\to\mathbb R$ by $q_{\lambda,a}(u):=p_{\lambda,a}(|u|)$ and define $F_{\lambda,a}:\mathbb R\to\mathbb R$ by
\begin{align*}
F_{\lambda,a}(u):=q_{\lambda,a}(u)+\frac{u^2}{2(a-1)}.
\end{align*}
We now justify the convexity assertion rather than treating it as a black box. The elementary criterion we use is this: a continuous piecewise $C^1$ function on $\mathbb R$ is convex if its one-sided derivative is nondecreasing on each smooth interval and the left derivative at every breakpoint is at most the right derivative. On the intervals $(-\infty,-a\lambda)$, $(-a\lambda,-\lambda)$, $(-\lambda,0)$, $(0,\lambda)$, $(\lambda,a\lambda)$, and $(a\lambda,\infty)$, the derivative of $q_{\lambda,a}$ is respectively
\begin{align*}
0,\quad -\frac{a\lambda+u}{a-1},\quad -\lambda,\quad \lambda,\quad \frac{a\lambda-u}{a-1},\quad 0.
\end{align*}
Adding the derivative of $u^2/(2(a-1))$ gives the corresponding derivative values of $F_{\lambda,a}$:
\begin{align*}
\frac{u}{a-1},\quad -\frac{a\lambda}{a-1},\quad -\lambda+\frac{u}{a-1},\quad \lambda+\frac{u}{a-1},\quad \frac{a\lambda}{a-1},\quad \frac{u}{a-1}.
\end{align*}
Each expression is nondecreasing on its own interval. Checking the junctions $-a\lambda$, $-\lambda$, $0$, $\lambda$, and $a\lambda$, the left derivative never exceeds the right derivative; the only jump at $0$ goes upward from $-\lambda$ to $\lambda$. Thus $F'_{\lambda,a}$ is nondecreasing in the one-sided derivative sense, which is the elementary one-dimensional convexity criterion used here. Hence $F_{\lambda,a}$ is convex. Since $|\hat\beta^{\mathrm{or}}_j|\ge a\lambda$ for $j\in S$, the SCAD part is flat at $\hat\beta^{\mathrm{or}}_j$, hence $q'_{\lambda,a}(\hat\beta^{\mathrm{or}}_j)=0$ and
\begin{align*}
F'_{\lambda,a}(\hat\beta^{\mathrm{or}}_j)=\frac{\hat\beta^{\mathrm{or}}_j}{a-1}.
\end{align*}
Convexity gives
\begin{align*}
F_{\lambda,a}(\hat\beta^{\mathrm{or}}_j+h_j)-F_{\lambda,a}(\hat\beta^{\mathrm{or}}_j)
\ge
\frac{\hat\beta^{\mathrm{or}}_j}{a-1}h_j.
\end{align*}
After expanding $F_{\lambda,a}$, this is exactly
\begin{align*}
p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j+h_j|)-p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j|)
\ge
-\frac{h_j^2}{2(a-1)}.
\end{align*}
Now expand the objective. Since $Y-X(\hat\beta^{\mathrm{or}}+h)=r^{\mathrm{or}}-Xh$,
\begin{align*}
Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})=\frac{1}{2n}|Xh|^2-\frac{1}{n}(r^{\mathrm{or}})^\top Xh+\sum_{j=1}^p\left[p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j+h_j|)-p_{\lambda,a}(|\hat\beta^{\mathrm{or}}_j|)\right].
\end{align*}
Define $g_j:=X_j^\top r^{\mathrm{or}}/n$ for $j\in\{1,\dots,p\}$. The active normal equations give $g_j=0$ for $j\in S$, while the previous stationarity step gives $|g_j|\le\lambda$ for $j\in S^c$. Combining the penalty bounds with the objective expansion gives
\begin{align*}
Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})
\ge
\frac{1}{2n}|Xh|^2-\sum_{j\in S^c}|g_j||h_j|+\sum_{j\in S^c}\lambda |h_j|-\frac{1}{2(a-1)}|h|^2.
\end{align*}
Because $|g_j|\le\lambda$ on $S^c$, the inactive linear loss is dominated by the inactive SCAD linear gain. Hence
\begin{align*}
Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})
\ge
\frac{1}{2n}|Xh|^2-\frac{1}{2(a-1)}|h|^2.
\end{align*}
Finally the sparse curvature bound gives
\begin{align*}
Q_n(\hat\beta^{\mathrm{or}}+h)-Q_n(\hat\beta^{\mathrm{or}})
\ge
\frac{1}{2}\left(\kappa-\frac{1}{a-1}\right)|h|^2
\ge 0,
\end{align*}
because $\kappa>1/(a-1)$. Thus every admissible sparse perturbation has nonnegative objective increase.
[/guided]
[/step]
[step:Conclude sparse local minimality and the probability statement]
Choose any
\begin{align*}
\rho\in(0,\rho_0].
\end{align*}
The preceding step proves that, on $\mathcal E_n$,
\begin{align*}
Q_n(\beta)\ge Q_n(\hat\beta^{\mathrm{or}})
\end{align*}
for every $\beta\in\mathcal N_S(\rho)$. Hence $\hat\beta^{\mathrm{or}}$ is a local minimizer of the SCAD criterion relative to $\mathcal N_S(\rho)$.
Since this conclusion holds for every outcome in $\mathcal E_n$ and $\mathbb P(\mathcal E_n)\ge 1-\delta_n$, the event that there exists a sparse local minimizer of $Q_n$ equal to $\hat\beta^{\mathrm{or}}$ has probability at least $1-\delta_n$. This completes the proof.
[/step]
Prerequisites (0/2 completed)
Prerequisites Graph
Interactive dependency map showing how this theorem builds on foundational concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Definitions & Concepts
Explore Further
Event
Definition
Lebesgue Measure
Definition
Coordinate Characterisation of Product Measurability
Probability & Statistics
Orthogonality of Ordinary Least Squares Residuals
Probability & Statistics
Unbiasedness of the Holdout Risk Estimator
Probability & Statistics
Structure of the Zero Set of Brownian Motion
Brownian Motion
Karush-Kuhn-Tucker Conditions for the Group Lasso
Probability & Statistics
Uniform Integrability and $L^1$ Convergence
Martingale Theory
Tail Sum Formula for Expectation
Probability Theory
Complement Rule
Probability Theory
Probability & Statistics
Area