Local Restricted Strong Convexity for Folded-Concave Penalized Least Squares

Local Restricted Strong Convexity for Folded-Concave Penalized Least Squares (Theorem # 5579)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The proof separates the curvature of the objective into the squared-loss contribution and the penalty contribution. On an $m$-sparse coordinate subspace, the [sparse curvature](/page/Sparse%20Curvature) hypothesis gives a lower quadratic curvature bound for the least-squares loss. The coordinatewise [maximum concavity](/page/Maximum%20Concavity) hypothesis says that adding $\frac{\zeta}{2}s^2$ makes the even scalar penalty $s \mapsto p_\lambda(|s|)$ convexified on $\mathbb{R}$, so the penalty can destroy at most $\zeta$ units of curvature per coordinate. Combining these two estimates gives the restricted chord curvature inequality with parameter $\kappa-\zeta$; when $\kappa>\zeta$, this parameter is positive and gives genuine [strong convexity](/page/Strong%20Convexity) along every line segment whose active coordinates have size at most $m$. [/proofplan] [step:Fix the sparse coordinate subspace containing the line segment] The theorem statement declares $n,p,m \in \mathbb{N}$, the design matrix $X \in \mathbb{R}^{n \times p}$, the response vector $Y \in \mathbb{R}^n$, the squared loss $\ell: \mathbb{R}^p \to \mathbb{R}$, and the objective $Q: \mathbb{R}^p \to \mathbb{R}$. Let $\beta,\beta' \in \mathbb{R}^p$ satisfy \begin{align*} |\operatorname{supp}(\beta) \cup \operatorname{supp}(\beta')| \leq m. \end{align*} Define the active index set \begin{align*} S := \operatorname{supp}(\beta) \cup \operatorname{supp}(\beta') \subseteq \{1,\dots,p\}. \end{align*} For every $t \in [0,1]$, define the line segment point \begin{align*} \beta_t := (1-t)\beta + t\beta' \in \mathbb{R}^p. \end{align*} Since both $\beta$ and $\beta'$ vanish outside $S$, the vector $\beta_t$ also vanishes outside $S$. Hence \begin{align*} \operatorname{supp}(\beta_t) \subseteq S \end{align*} for every $t \in [0,1]$. Define the displacement vector $\delta \in \mathbb{R}^p$ by \begin{align*} \delta := \beta' - \beta. \end{align*} Then $\operatorname{supp}(\delta) \subseteq S$, so $|\operatorname{supp}(\delta)| \leq m$. [/step] [step:Use sparse curvature to bound the squared loss from below along the chord] For $t \in [0,1]$, expand the squared loss at $\beta_t=\beta+t\delta$: \begin{align*} \ell(\beta_t) = \frac{1}{2n}|Y-X\beta-tX\delta|^2. \end{align*} Expanding the Euclidean square gives \begin{align*} \ell(\beta_t) = \frac{1}{2n}|Y-X\beta|^2 - \frac{t}{n}(Y-X\beta)\cdot X\delta + \frac{t^2}{2n}|X\delta|^2. \end{align*} Similarly, first write \begin{align*} \ell(\beta') = \frac{1}{2n}|Y-X\beta-X\delta|^2. \end{align*} Expanding the Euclidean square gives \begin{align*} \ell(\beta') = \frac{1}{2n}|Y-X\beta|^2 - \frac{1}{n}(Y-X\beta)\cdot X\delta + \frac{1}{2n}|X\delta|^2. \end{align*} Taking the convex combination $(1-t)\ell(\beta)+t\ell(\beta')$ and subtracting $\ell(\beta_t)$ gives \begin{align*} (1-t)\ell(\beta)+t\ell(\beta')-\ell(\beta_t) = \frac{t(1-t)}{2n}|X\delta|^2. \end{align*} Because $|\operatorname{supp}(\delta)| \leq m$, the sparse curvature hypothesis applies to $\delta$ and yields \begin{align*} \frac{1}{n}|X\delta|^2 \geq \kappa|\delta|^2. \end{align*} Therefore \begin{align*} \ell(\beta_t) \leq (1-t)\ell(\beta)+t\ell(\beta')-\frac{\kappa}{2}t(1-t)|\delta|^2. \end{align*} [guided] The squared loss is a quadratic function of $\beta$, so along any line it has an exact one-dimensional quadratic expansion. We use this exact identity because the strong convexity inequality is also a statement about the gap between the value at a convex combination and the convex combination of endpoint values. Define \begin{align*} S := \operatorname{supp}(\beta)\cup\operatorname{supp}(\beta') \subseteq \{1,\dots,p\}. \end{align*} The theorem hypotheses give $|S|\leq m$. Define the displacement vector $\delta \in \mathbb{R}^p$ by \begin{align*} \delta := \beta' - \beta. \end{align*} Since $\beta$ and $\beta'$ are both supported in $S$, their difference $\delta$ is supported in $S$ as well. Hence $|\operatorname{supp}(\delta)| \leq m$, which is precisely the sparsity condition needed to apply the sparse curvature hypothesis. For $t \in [0,1]$, the point on the line segment from $\beta$ to $\beta'$ is the vector $\beta_t \in \mathbb{R}^p$ defined by \begin{align*} \beta_t = \beta + t\delta. \end{align*} Expanding the Euclidean square in $\mathbb{R}^n$ gives \begin{align*} \ell(\beta_t) = \frac{1}{2n}|Y-X(\beta+t\delta)|^2. \end{align*} Since $X(\beta+t\delta)=X\beta+tX\delta$, this is \begin{align*} \ell(\beta_t) = \frac{1}{2n}|Y-X\beta-tX\delta|^2. \end{align*} Expanding the Euclidean square in $\mathbb{R}^n$ gives \begin{align*} \ell(\beta_t) = \frac{1}{2n}|Y-X\beta|^2 - \frac{t}{n}(Y-X\beta)\cdot X\delta + \frac{t^2}{2n}|X\delta|^2. \end{align*} At the endpoint $t=1$, the same expansion first gives \begin{align*} \ell(\beta') = \frac{1}{2n}|Y-X\beta-X\delta|^2. \end{align*} Expanding the Euclidean square gives \begin{align*} \ell(\beta') = \frac{1}{2n}|Y-X\beta|^2 - \frac{1}{n}(Y-X\beta)\cdot X\delta + \frac{1}{2n}|X\delta|^2. \end{align*} Now compare $\ell(\beta_t)$ with the chord value $(1-t)\ell(\beta)+t\ell(\beta')$. The constant terms and the linear terms in $t$ cancel, leaving exactly \begin{align*} (1-t)\ell(\beta)+t\ell(\beta')-\ell(\beta_t) = \frac{t(1-t)}{2n}|X\delta|^2. \end{align*} This identity says that the curvature of the squared loss along this line is measured by $\frac{1}{n}|X\delta|^2$. Since $\delta$ is supported on at most $m$ coordinates, the sparse curvature assumption gives \begin{align*} \frac{1}{n}|X\delta|^2 \geq \kappa|\delta|^2. \end{align*} Substituting this lower bound into the exact chord identity yields \begin{align*} \ell(\beta_t) \leq (1-t)\ell(\beta)+t\ell(\beta')-\frac{\kappa}{2}t(1-t)|\delta|^2. \end{align*} Thus the loss contributes at least $\kappa$ units of strong convexity along the sparse line segment. [/guided] [/step] [step:Use maximum concavity to bound the penalty loss of curvature] The theorem statement declares the folded-concave penalty as a map $p_\lambda: [0,\infty) \to \mathbb{R}$ and assumes coordinatewise maximum concavity at most $\zeta$. Thus the scalar convexification $g: \mathbb{R} \to \mathbb{R}$ defined by \begin{align*} g(s) = p_\lambda(|s|)+\frac{\zeta}{2}s^2 \end{align*} is convex on $\mathbb{R}$. For each coordinate $j \in \{1,\dots,p\}$, convexity of $g$ applied to the two [real numbers](/page/Real%20Numbers) $\beta_j$ and $\beta'_j$ gives \begin{align*} g((1-t)\beta_j+t\beta'_j) \leq (1-t)g(\beta_j)+tg(\beta'_j). \end{align*} Expanding the definition of $g$ and using the scalar identity \begin{align*} (1-t)\beta_j^2+t(\beta'_j)^2-((1-t)\beta_j+t\beta'_j)^2 = t(1-t)(\beta'_j-\beta_j)^2, \end{align*} we obtain \begin{align*} p_\lambda(|(\beta_t)_j|) \leq (1-t)p_\lambda(|\beta_j|) +t p_\lambda(|\beta'_j|) +\frac{\zeta}{2}t(1-t)(\beta'_j-\beta_j)^2. \end{align*} Summing this inequality over $j=1,\dots,p$ gives \begin{align*} \sum_{j=1}^p p_\lambda(|(\beta_t)_j|) \leq (1-t)\sum_{j=1}^p p_\lambda(|\beta_j|)+t\sum_{j=1}^p p_\lambda(|\beta'_j|)+\frac{\zeta}{2}t(1-t)|\delta|^2. \end{align*} [guided] The penalty is separable across coordinates, so it is enough to understand what happens to one coordinate. The theorem statement declares the penalty as a map $p_\lambda: [0,\infty) \to \mathbb{R}$ and assumes coordinatewise maximum concavity at most $\zeta$. This means that the scalar convexification $g: \mathbb{R} \to \mathbb{R}$ defined by \begin{align*} g(s) = p_\lambda(|s|)+\frac{\zeta}{2}s^2 \end{align*} is convex on $\mathbb{R}$. This coordinatewise formulation is the hypothesis needed here: the coordinates $\beta_j$ and $\beta'_j$ may have either sign, so convexity only of the radial map on $[0,\infty)$ would not by itself justify applying convexity between the real numbers $\beta_j$ and $\beta'_j$. Therefore, for every $j \in \{1,\dots,p\}$ and every $t \in [0,1]$, convexity of $g$ applied to the two real numbers $\beta_j$ and $\beta'_j$ gives \begin{align*} g((1-t)\beta_j+t\beta'_j) \leq (1-t)g(\beta_j)+tg(\beta'_j). \end{align*} Substituting the definition of $g$ into this inequality gives \begin{align*} p_\lambda(|(1-t)\beta_j+t\beta'_j|)+\frac{\zeta}{2}((1-t)\beta_j+t\beta'_j)^2 \leq (1-t)p_\lambda(|\beta_j|)+t p_\lambda(|\beta'_j|)+\frac{\zeta}{2}\bigl((1-t)\beta_j^2+t(\beta'_j)^2\bigr). \end{align*} Moving the quadratic term on the left to the right and using \begin{align*} (1-t)\beta_j^2+t(\beta'_j)^2-((1-t)\beta_j+t\beta'_j)^2 = t(1-t)(\beta'_j-\beta_j)^2 \end{align*} yields \begin{align*} p_\lambda(|(\beta_t)_j|) \leq (1-t)p_\lambda(|\beta_j|)+t p_\lambda(|\beta'_j|)+\frac{\zeta}{2}t(1-t)(\beta'_j-\beta_j)^2. \end{align*} Finally, summing over all coordinates gives \begin{align*} \sum_{j=1}^p p_\lambda(|(\beta_t)_j|) \leq (1-t)\sum_{j=1}^p p_\lambda(|\beta_j|)+t\sum_{j=1}^p p_\lambda(|\beta'_j|)+\frac{\zeta}{2}t(1-t)|\delta|^2. \end{align*} This is the precise sense in which the penalty can remove at most $\zeta$ units of quadratic curvature along the chord. [/guided] [/step] [step:Combine the loss and penalty estimates to obtain the restricted chord curvature inequality] By definition of $Q$, we have \begin{align*} Q(\beta_t) = \ell(\beta_t)+\sum_{j=1}^p p_\lambda(|(\beta_t)_j|). \end{align*} Adding the two estimates gives \begin{align*} Q(\beta_t) \leq (1-t)\ell(\beta)+t\ell(\beta')-\frac{\kappa}{2}t(1-t)|\delta|^2 +(1-t)\sum_{j=1}^p p_\lambda(|\beta_j|)+t\sum_{j=1}^p p_\lambda(|\beta'_j|)+\frac{\zeta}{2}t(1-t)|\delta|^2. \end{align*} Collecting the endpoint terms and the quadratic terms gives \begin{align*} Q(\beta_t) \leq (1-t)Q(\beta)+tQ(\beta')-\frac{\kappa-\zeta}{2}t(1-t)|\delta|^2. \end{align*} Since $\delta=\beta'-\beta$, this is \begin{align*} Q((1-t)\beta+t\beta') \leq (1-t)Q(\beta)+tQ(\beta')-\frac{\kappa-\zeta}{2}t(1-t)|\beta'-\beta|^2. \end{align*} This is the restricted chord curvature inequality with parameter $\kappa-\zeta$ along the segment from $\beta$ to $\beta'$. The same argument also applies to any two vectors $u,v\in\mathbb{R}^p$ with $\operatorname{supp}(u)\cup\operatorname{supp}(v)\subseteq S$, because then $\operatorname{supp}(v-u)\subseteq S$ and $|S|\leq m$. Hence the restriction of $Q$ to the coordinate subspace supported on $S$ has strong convexity modulus at least $\kappa-\zeta$. If $\kappa>\zeta$, then $\kappa-\zeta>0$, so the displayed inequality is the strong convexity inequality on each sparse line segment with positive modulus at least $\kappa-\zeta$. Hence $Q$ is strongly convex along every line segment joining sparse vectors whose union of supports has cardinality at most $m$. This proves the claimed local restricted strong convexity. [guided] The objective is the sum of the squared loss and the coordinatewise penalty. By definition, \begin{align*} Q(\beta_t) = \ell(\beta_t)+\sum_{j=1}^p p_\lambda(|(\beta_t)_j|). \end{align*} The loss estimate proved above is \begin{align*} \ell(\beta_t) \leq (1-t)\ell(\beta)+t\ell(\beta')-\frac{\kappa}{2}t(1-t)|\delta|^2. \end{align*} The penalty estimate proved above is \begin{align*} \sum_{j=1}^p p_\lambda(|(\beta_t)_j|) \leq (1-t)\sum_{j=1}^p p_\lambda(|\beta_j|)+t\sum_{j=1}^p p_\lambda(|\beta'_j|)+\frac{\zeta}{2}t(1-t)|\delta|^2. \end{align*} Adding these two inequalities is valid because both have the same left-side point $\beta_t$ and the same chord parameter $t$. The endpoint loss terms and endpoint penalty terms combine back into $Q(\beta)$ and $Q(\beta')$, while the quadratic coefficients combine by \begin{align*} -\frac{\kappa}{2}+\frac{\zeta}{2}=-\frac{\kappa-\zeta}{2}. \end{align*} Therefore \begin{align*} Q(\beta_t) \leq (1-t)Q(\beta)+tQ(\beta')-\frac{\kappa-\zeta}{2}t(1-t)|\delta|^2. \end{align*} Since $\beta_t=(1-t)\beta+t\beta'$ and $\delta=\beta'-\beta$, this becomes \begin{align*} Q((1-t)\beta+t\beta') \leq (1-t)Q(\beta)+tQ(\beta')-\frac{\kappa-\zeta}{2}t(1-t)|\beta'-\beta|^2. \end{align*} This proves the restricted chord curvature inequality with parameter $\kappa-\zeta$ for the chosen pair. The same proof applies to arbitrary $u,v\in\mathbb{R}^p$ supported in the coordinate subspace indexed by $S$, because $\operatorname{supp}(v-u)\subseteq S$ and $|S|\leq m$ are the only support facts used in the loss estimate. Therefore the entire restriction of $Q$ to that coordinate subspace has strong convexity modulus at least $\kappa-\zeta$. To call this genuine strong convexity, the modulus must be positive. The additional hypothesis $\kappa>\zeta$ gives $\kappa-\zeta>0$, and then the displayed inequality is exactly the strong convexity inequality along every sparse line segment under consideration. [/guided] [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Real Numbers Definition Separable Definition Bonferroni Inequalities Probability Theory Chi-Squared Distribution of the Residual Sum of Squares Probability & Statistics Linearity and Positivity Probability & Statistics Unbiasedness of $K$-Fold Cross-Validation Risk for Fold-Trained Predictors Probability & Statistics Conditional Variance Formula for the Ordinary Least Squares Estimator Probability & Statistics Existence and Uniqueness of Conditional Expectation Conditional Expectation Lasso Basic Inequality Probability & Statistics Splitting Property of the Poisson Distribution Poisson Processes Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.