Karush-Kuhn-Tucker Conditions for the Group Lasso

Karush-Kuhn-Tucker Conditions for the Group Lasso (Theorem # 5572)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We prove the result by converting minimization of the convex group Lasso objective into non-negativity of every one-sided directional derivative. The squared loss contributes a linear directional derivative involving the groupwise residual correlations, while each group norm contributes either an ordinary derivative in the active case or the Euclidean norm of the perturbation in the inactive case. Testing directions supported on one group gives the active equations and inactive inequalities, and the same formulas imply the directional derivative is non-negative in every direction, giving the converse. [/proofplan] [step:Define the objective and compute its directional derivative] Let $F:\mathbb{R}^p\to\mathbb{R}$ be the group Lasso objective defined by \begin{align*} F(\beta):=\frac{1}{2n}|Y-X\beta|^2+\lambda\sum_{G\in\mathcal{G}}a_G|\beta_G|. \end{align*} Fix $\hat{\beta}\in\mathbb{R}^p$, and define the residual vector \begin{align*} \hat{r}:=Y-X\hat{\beta}\in\mathbb{R}^n. \end{align*} For each $G\in\mathcal{G}$, define the residual correlation vector \begin{align*} c_G:=\frac{1}{n}X_G^\top\hat{r}\in\mathbb{R}^{|G|}. \end{align*} Let $h\in\mathbb{R}^p$ be a direction, and write $h_G\in\mathbb{R}^{|G|}$ for its group subvector. The one-sided directional derivative of the squared loss at $\hat{\beta}$ in the direction $h$ is The identity $Y-X(\hat{\beta}+th)=\hat{r}-tXh$ gives \begin{align*} \lim_{t\downarrow 0}\frac{\frac{1}{2n}|Y-X(\hat{\beta}+th)|^2-\frac{1}{2n}|Y-X\hat{\beta}|^2}{t}=\lim_{t\downarrow 0}\frac{\frac{1}{2n}|\hat{r}-tXh|^2-\frac{1}{2n}|\hat{r}|^2}{t}. \end{align*} Expanding the Euclidean square and taking the limit yields \begin{align*} \lim_{t\downarrow 0}\frac{\frac{1}{2n}|\hat{r}-tXh|^2-\frac{1}{2n}|\hat{r}|^2}{t}=-\frac{1}{n}(X^\top\hat{r})\cdot h. \end{align*} Using the group decomposition of $h$ and the definition of $c_G$, this equals \begin{align*} -\sum_{G\in\mathcal{G}}c_G\cdot h_G. \end{align*} For each group $G\in\mathcal{G}$, let $\gamma_G:\mathbb{R}^{|G|}\to\mathbb{R}$ be the map defined by \begin{align*} \gamma_G(u):=|u|. \end{align*} Its one-sided directional derivative at $\hat{\beta}_G$ in the direction $h_G$ has the following two forms. If $\hat{\beta}_G\neq 0$, then \begin{align*} \gamma_G'(\hat{\beta}_G;h_G)=\frac{\hat{\beta}_G}{|\hat{\beta}_G|}\cdot h_G. \end{align*} If $\hat{\beta}_G=0$, then \begin{align*} \gamma_G'(\hat{\beta}_G;h_G)=|h_G|. \end{align*} Therefore \begin{align*} F'(\hat{\beta};h)=\sum_{G\in\mathcal{G}}\left(-c_G\cdot h_G+\lambda a_G\,\gamma_G'(\hat{\beta}_G;h_G)\right). \end{align*} [guided] The point of introducing $c_G$ is to isolate the part of the loss gradient that belongs to group $G$. We define \begin{align*} \hat{r}:=Y-X\hat{\beta}\in\mathbb{R}^n, \qquad c_G:=\frac{1}{n}X_G^\top\hat{r}\in\mathbb{R}^{|G|}. \end{align*} Now perturb $\hat{\beta}$ in an arbitrary direction $h\in\mathbb{R}^p$. The residual changes from $\hat{r}$ to $\hat{r}-tXh$, so expanding the square gives \begin{align*} \frac{1}{2n}|\hat{r}-tXh|^2 = \frac{1}{2n}|\hat{r}|^2-\frac{t}{n}(X^\top\hat{r})\cdot h+\frac{t^2}{2n}|Xh|^2. \end{align*} Dividing by $t>0$ and letting $t\downarrow 0$ yields \begin{align*} \lim_{t\downarrow 0}\frac{\frac{1}{2n}|Y-X(\hat{\beta}+th)|^2-\frac{1}{2n}|Y-X\hat{\beta}|^2}{t} = -\frac{1}{n}(X^\top\hat{r})\cdot h = -\sum_{G\in\mathcal{G}}c_G\cdot h_G. \end{align*} For the penalty, we examine the Euclidean norm on each group separately. Define $\gamma_G:\mathbb{R}^{|G|}\to\mathbb{R}$ by \begin{align*} \gamma_G(u):=|u|. \end{align*} If $\hat{\beta}_G\neq 0$, the Euclidean norm is differentiable at $\hat{\beta}_G$, and its derivative in the direction $h_G$ is \begin{align*} \gamma_G'(\hat{\beta}_G;h_G)=\frac{\hat{\beta}_G}{|\hat{\beta}_G|}\cdot h_G. \end{align*} If $\hat{\beta}_G=0$, then directly from the definition of the one-sided directional derivative, \begin{align*} \gamma_G'(0;h_G) = \lim_{t\downarrow 0}\frac{|th_G|-0}{t} = |h_G|. \end{align*} Adding the loss and penalty contributions gives \begin{align*} F'(\hat{\beta};h)=\sum_{G\in\mathcal{G}}\left(-c_G\cdot h_G+\lambda a_G\,\gamma_G'(\hat{\beta}_G;h_G)\right). \end{align*} This formula is the whole mechanism behind the KKT conditions: active groups produce linear equalities, while inactive groups produce Euclidean-ball inequalities. [/guided] [/step] [step:Reduce minimization to non-negative directional derivatives] The objective $F$ is convex: the squared loss is convex as the composition of a convex Euclidean square with an affine map, each map $\beta\mapsto |\beta_G|$ is convex by the triangle inequality in $\mathbb{R}^{|G|}$, and the coefficients satisfy $\lambda a_G\geq 0$. We use the following elementary convexity criterion: a convex function $F:\mathbb{R}^p\to\mathbb{R}$ is minimized at $\hat{\beta}$ if and only if $F'(\hat{\beta};h)\geq 0$ for every $h\in\mathbb{R}^p$. Indeed, if $\hat{\beta}$ minimizes $F$, then for every $h\in\mathbb{R}^p$ and every $t>0$, \begin{align*} \frac{F(\hat{\beta}+th)-F(\hat{\beta})}{t}\geq 0, \end{align*} so $F'(\hat{\beta};h)\geq 0$. Conversely, suppose $F'(\hat{\beta};h)\geq 0$ for every $h\in\mathbb{R}^p$. For any $\beta\in\mathbb{R}^p$, set $h:=\beta-\hat{\beta}$. Convexity of the map $\varphi:[0,1]\to\mathbb{R}$ defined by \begin{align*} \varphi(t):=F(\hat{\beta}+th) \end{align*} implies that the difference quotient \begin{align*} t\mapsto \frac{\varphi(t)-\varphi(0)}{t} \end{align*} is nondecreasing on $(0,1]$. Hence \begin{align*} F(\beta)-F(\hat{\beta}) = \varphi(1)-\varphi(0) \geq \lim_{t\downarrow 0}\frac{\varphi(t)-\varphi(0)}{t} = F'(\hat{\beta};h) \geq 0. \end{align*} Thus $\hat{\beta}$ minimizes $F$. [/step] [step:Derive the active group equation] Assume $\hat{\beta}$ minimizes $F$, so $F'(\hat{\beta};h)\geq 0$ for every $h\in\mathbb{R}^p$. Fix a group $G\in\mathcal{G}$ such that $\hat{\beta}_G\neq 0$, and let $v\in\mathbb{R}^{|G|}$. Define $h\in\mathbb{R}^p$ by \begin{align*} h_G:=v, \qquad h_H:=0 \quad \text{for every } H\in\mathcal{G}\setminus\{G\}. \end{align*} Using the directional derivative formula gives \begin{align*} 0 \leq F'(\hat{\beta};h) = \left(-c_G+\lambda a_G\frac{\hat{\beta}_G}{|\hat{\beta}_G|}\right)\cdot v. \end{align*} Applying the same inequality to the direction with group component $-v$ gives \begin{align*} 0 \leq -\left(-c_G+\lambda a_G\frac{\hat{\beta}_G}{|\hat{\beta}_G|}\right)\cdot v. \end{align*} Therefore \begin{align*} \left(-c_G+\lambda a_G\frac{\hat{\beta}_G}{|\hat{\beta}_G|}\right)\cdot v=0 \end{align*} for every $v\in\mathbb{R}^{|G|}$. Taking $v$ equal to the vector inside the parentheses gives \begin{align*} -c_G+\lambda a_G\frac{\hat{\beta}_G}{|\hat{\beta}_G|}=0. \end{align*} Thus \begin{align*} \frac{1}{n}X_G^\top(Y-X\hat{\beta}) = \lambda a_G\frac{\hat{\beta}_G}{|\hat{\beta}_G|}. \end{align*} [/step] [step:Derive the inactive group inequality] Assume again that $\hat{\beta}$ minimizes $F$. Fix a group $G\in\mathcal{G}$ such that $\hat{\beta}_G=0$, and let $v\in\mathbb{R}^{|G|}$. Define $h\in\mathbb{R}^p$ by \begin{align*} h_G:=v, \qquad h_H:=0 \quad \text{for every } H\in\mathcal{G}\setminus\{G\}. \end{align*} The directional derivative formula gives \begin{align*} 0 \leq F'(\hat{\beta};h) = -c_G\cdot v+\lambda a_G |v|. \end{align*} If $c_G=0$, then $|c_G|\leq \lambda a_G$. If $c_G\neq 0$, choose \begin{align*} v:=c_G\in\mathbb{R}^{|G|}. \end{align*} Then \begin{align*} 0 \leq -|c_G|^2+\lambda a_G |c_G|, \end{align*} and division by $|c_G|>0$ yields \begin{align*} |c_G|\leq \lambda a_G. \end{align*} Substituting the definition of $c_G$ gives \begin{align*} \left|\frac{1}{n}X_G^\top(Y-X\hat{\beta})\right| \leq \lambda a_G. \end{align*} [/step] [step:Use the groupwise conditions to prove optimality] Conversely, assume that $\hat{\beta}\in\mathbb{R}^p$ satisfies the displayed active equations and inactive inequalities for every $G\in\mathcal{G}$. Let $h\in\mathbb{R}^p$ be arbitrary. For an active group $G$ with $\hat{\beta}_G\neq 0$, the assumed equation gives \begin{align*} -c_G\cdot h_G+\lambda a_G\frac{\hat{\beta}_G}{|\hat{\beta}_G|}\cdot h_G=0. \end{align*} For an inactive group $G$ with $\hat{\beta}_G=0$, the assumed inequality and the [Cauchy-Schwarz inequality](/theorems/???) in $\mathbb{R}^{|G|}$ give \begin{align*} -c_G\cdot h_G+\lambda a_G|h_G| \geq -|c_G|\,|h_G|+\lambda a_G|h_G| = (\lambda a_G-|c_G|)|h_G| \geq 0. \end{align*} Summing over all groups and using the directional derivative formula gives \begin{align*} F'(\hat{\beta};h)\geq 0. \end{align*} Since $h\in\mathbb{R}^p$ was arbitrary, the convexity criterion implies that $\hat{\beta}$ minimizes $F$. Replacing $\hat{\beta}$ by $\hat{\beta}^{\mathrm{GL}}$ gives exactly the stated characterization of group Lasso estimators. [/step]

Prerequisites (0/1 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Triangle Inequality For Inner Product Spaces

Explore Further

Triangle Inequality For Inner Product Spaces Theorem #433 Strong Markov Property of Brownian Motion Brownian Motion Conditional Expectation as the $L^2$ Risk Minimizer Probability & Statistics Existence and Uniqueness of Conditional Expectation Conditional Expectation Joint Distribution of Brownian Motion and its Maximum Brownian Motion Almost Sure Martingale Convergence Theorem Martingale Theory Coordinate Characterisation of Product Measurability Probability & Statistics Cook's Distance Formula for Internally Studentized Residuals Probability & Statistics Thresholded Lasso Support and Sign Recovery Under Supremum Norm Control Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.