Elastic Net Grouping Effect for Two Predictors — Statement & Proof

Elastic Net Grouping Effect for Two Predictors (Theorem # 5569)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We use the first-order optimality equations for the two active coefficients. Since $\hat{\beta}_1$ and $\hat{\beta}_2$ are nonzero and have the same sign, the two $\ell^1$ subgradient contributions are equal and cancel after subtraction. The standardized Gram matrix converts the fitted-value term into $(1-\rho)(\hat{\beta}_1-\hat{\beta}_2)$, while the ridge penalty contributes $\lambda_2(\hat{\beta}_1-\hat{\beta}_2)$. Solving the resulting scalar identity gives the stated estimate. [/proofplan] [step:Write the active first-order equations and cancel the common sign term] Let $Y \in \mathbb{R}^n$ denote the response vector. Let $X: \mathbb{R}^2 \to \mathbb{R}^n$ denote the design [linear map](/page/Linear%20Map) whose columns are $X_1,X_2 \in \mathbb{R}^n$, so that \begin{align*} X\beta = X_1\beta_1 + X_2\beta_2 \end{align*} for every $\beta=(\beta_1,\beta_2) \in \mathbb{R}^2$. Let $\lambda_1 \ge 0$ denote the $\ell^1$ tuning parameter in the elastic net objective, so that $\hat{\beta}$ minimizes the objective function $Q: \mathbb{R}^2 \to \mathbb{R}$ defined by \begin{align*} Q(\beta) := \frac{1}{2n}|Y-X\beta|^2 + \lambda_1\bigl(|\beta_1|+|\beta_2|\bigr) + \frac{\lambda_2}{2}|\beta|^2. \end{align*} Define the residual vector $r \in \mathbb{R}^n$ by $r := Y - X\hat{\beta}$. Since $p=2$, this gives \begin{align*} r = Y - X_1\hat{\beta}_1 - X_2\hat{\beta}_2. \end{align*} Since $\hat{\beta}_1\hat{\beta}_2>0$, both active coefficients are nonzero and have the same sign. Define $s \in \{-1,1\}$ by \begin{align*} s := \operatorname{sgn}(\hat{\beta}_1) = \operatorname{sgn}(\hat{\beta}_2). \end{align*} The function $Q$ is convex because it is the sum of a convex quadratic loss, the convex $\ell^1$ penalty, and the convex ridge penalty. For an active coordinate $j \in \{1,2\}$, the map $t \mapsto Q(\hat{\beta}+t e_j)$ is differentiable at $t=0$, where $e_j \in \mathbb{R}^2$ is the $j$th standard basis vector, because $\hat{\beta}_j \ne 0$ makes $|\beta_j|$ differentiable at $\hat{\beta}_j$. Since $\hat{\beta}$ minimizes $Q$, the necessary one-dimensional first-order condition gives \begin{align*} -\frac{1}{n}X_j^\top r + \lambda_1 s + \lambda_2 \hat{\beta}_j = 0. \end{align*} Thus \begin{align*} \frac{1}{n}X_j^\top r = \lambda_1 s + \lambda_2 \hat{\beta}_j. \end{align*} Subtracting the equation for $j=2$ from the equation for $j=1$ cancels the common $\lambda_1s$ term and yields \begin{align*} \frac{1}{n}(X_1-X_2)^\top r = \lambda_2(\hat{\beta}_1-\hat{\beta}_2). \end{align*} [guided] The only non-smooth part of the elastic net objective is the $\ell^1$ term. We first name all the objects used in the first-order equations. Let $Y \in \mathbb{R}^n$ be the response vector. Let $X: \mathbb{R}^2 \to \mathbb{R}^n$ be the design linear map with columns $X_1,X_2 \in \mathbb{R}^n$, so $X\beta=X_1\beta_1+X_2\beta_2$ for $\beta=(\beta_1,\beta_2) \in \mathbb{R}^2$. Let $\lambda_1 \ge 0$ be the $\ell^1$ tuning parameter in the elastic net objective \begin{align*} Q(\beta) := \frac{1}{2n}|Y-X\beta|^2 + \lambda_1\bigl(|\beta_1|+|\beta_2|\bigr) + \frac{\lambda_2}{2}|\beta|^2, \end{align*} which is minimized by $\hat{\beta}$. The hypothesis $\hat{\beta}_1\hat{\beta}_2>0$ is exactly what makes the $\ell^1$ term harmless here: both coefficients are active and have the same sign. Define $r := Y - X\hat{\beta} \in \mathbb{R}^n$. Since $p=2$, this gives \begin{align*} r = Y - X_1\hat{\beta}_1 - X_2\hat{\beta}_2. \end{align*} Define the common sign $s \in \{-1,1\}$ by \begin{align*} s := \operatorname{sgn}(\hat{\beta}_1) = \operatorname{sgn}(\hat{\beta}_2). \end{align*} The objective $Q$ is convex: the squared-error term is convex, the $\ell^1$ term is convex, and the ridge term is convex. Because $\hat{\beta}_j \ne 0$, the derivative of $|\beta_j|$ at $\hat{\beta}_j$ is $\operatorname{sgn}(\hat{\beta}_j)=s$. Thus the one-variable function $t \mapsto Q(\hat{\beta}+t e_j)$ is differentiable at $t=0$, where $e_j \in \mathbb{R}^2$ is the $j$th standard basis vector. Since $\hat{\beta}$ minimizes $Q$, this differentiable one-variable function has a minimum at $t=0$, so its derivative at $0$ is zero. Therefore, for each active coordinate $j \in \{1,2\}$, the first-order optimality equation is \begin{align*} -\frac{1}{n}X_j^\top r + \lambda_1 s + \lambda_2 \hat{\beta}_j = 0. \end{align*} Equivalently, \begin{align*} \frac{1}{n}X_j^\top r = \lambda_1 s + \lambda_2 \hat{\beta}_j. \end{align*} Now subtract the equation with $j=2$ from the equation with $j=1$. The $\lambda_1s$ terms are identical, so they cancel: \begin{align*} \frac{1}{n}(X_1-X_2)^\top r = \lambda_2(\hat{\beta}_1-\hat{\beta}_2). \end{align*} This cancellation is the grouping mechanism: if two active coefficients have the same sign, the lasso part does not separate them; only the ridge term and the correlation structure remain. [/guided] [/step] [step:Rewrite the residual equation using the two-column Gram matrix] Using $r = Y - X_1\hat{\beta}_1 - X_2\hat{\beta}_2$, we expand the scalar product as \begin{align*} \frac{1}{n}(X_1-X_2)^\top r = \frac{1}{n}(X_1-X_2)^\top Y - \frac{1}{n}(X_1-X_2)^\top X_1\,\hat{\beta}_1 - \frac{1}{n}(X_1-X_2)^\top X_2\,\hat{\beta}_2. \end{align*} The standardization assumptions give the first Gram difference as \begin{align*} \frac{1}{n}(X_1-X_2)^\top X_1 = \frac{|X_1|^2}{n} - \frac{X_2^\top X_1}{n} = 1-\rho. \end{align*} They give the second Gram difference as \begin{align*} \frac{1}{n}(X_1-X_2)^\top X_2 = \frac{X_1^\top X_2}{n} - \frac{|X_2|^2}{n} = \rho-1. \end{align*} Hence \begin{align*} \frac{1}{n}(X_1-X_2)^\top r = \frac{1}{n}(X_1-X_2)^\top Y - (1-\rho)(\hat{\beta}_1-\hat{\beta}_2). \end{align*} Combining this identity with the previous step gives \begin{align*} \frac{1}{n}(X_1-X_2)^\top Y = (1-\rho+\lambda_2)(\hat{\beta}_1-\hat{\beta}_2). \end{align*} [guided] The first-order equation from the previous step contains the residual $r$, but the desired estimate is stated in terms of $Y$ and the difference $\hat{\beta}_1-\hat{\beta}_2$. We therefore substitute the residual identity \begin{align*} r = Y - X_1\hat{\beta}_1 - X_2\hat{\beta}_2 \end{align*} into the scalar product with $X_1-X_2$. Distributing the Euclidean [inner product](/page/Inner%20Product) gives \begin{align*} \frac{1}{n}(X_1-X_2)^\top r = \frac{1}{n}(X_1-X_2)^\top Y - \frac{1}{n}(X_1-X_2)^\top X_1\,\hat{\beta}_1 - \frac{1}{n}(X_1-X_2)^\top X_2\,\hat{\beta}_2. \end{align*} Now the standardization hypotheses are used. Since $|X_1|^2/n=1$ and $X_1^\top X_2/n=\rho$, symmetry of the Euclidean inner product gives \begin{align*} \frac{1}{n}(X_1-X_2)^\top X_1 = \frac{|X_1|^2}{n} - \frac{X_2^\top X_1}{n} = 1-\rho. \end{align*} Similarly, since $|X_2|^2/n=1$ and $X_1^\top X_2/n=\rho$, \begin{align*} \frac{1}{n}(X_1-X_2)^\top X_2 = \frac{X_1^\top X_2}{n} - \frac{|X_2|^2}{n} = \rho-1 = -(1-\rho). \end{align*} Substituting these two Gram differences into the expansion gives \begin{align*} \frac{1}{n}(X_1-X_2)^\top r = \frac{1}{n}(X_1-X_2)^\top Y - (1-\rho)\hat{\beta}_1 + (1-\rho)\hat{\beta}_2. \end{align*} Factoring the last two terms produces \begin{align*} \frac{1}{n}(X_1-X_2)^\top r = \frac{1}{n}(X_1-X_2)^\top Y - (1-\rho)(\hat{\beta}_1-\hat{\beta}_2). \end{align*} The previous step proved \begin{align*} \frac{1}{n}(X_1-X_2)^\top r = \lambda_2(\hat{\beta}_1-\hat{\beta}_2). \end{align*} Equating the two expressions for the same scalar and moving the Gram term to the right-hand side gives \begin{align*} \frac{1}{n}(X_1-X_2)^\top Y = (1-\rho+\lambda_2)(\hat{\beta}_1-\hat{\beta}_2). \end{align*} This identity is the exact algebraic form of the grouping effect: the coefficient difference is multiplied by the ridge strength plus the decorrelation factor $1-\rho$. [/guided] [/step] [step:Take absolute values and use positivity of the denominator] By the [Cauchy-Schwarz inequality](/page/Cauchy-Schwarz%20Inequality) applied in the Euclidean inner product on $\mathbb{R}^n$ to $X_1/\sqrt{n}$ and $X_2/\sqrt{n}$, \begin{align*} |\rho| = \left|\frac{X_1^\top X_2}{n}\right| \le \left(\frac{|X_1|^2}{n}\right)^{1/2}\left(\frac{|X_2|^2}{n}\right)^{1/2} = 1. \end{align*} Therefore $1-\rho+\lambda_2 \ge \lambda_2 >0$. Taking absolute values in \begin{align*} (1-\rho+\lambda_2)(\hat{\beta}_1-\hat{\beta}_2) = \frac{1}{n}(X_1-X_2)^\top Y \end{align*} and dividing by the positive scalar $1-\rho+\lambda_2$ gives \begin{align*} |\hat{\beta}_1-\hat{\beta}_2| = \frac{|(X_1-X_2)^\top Y|}{n(1-\rho+\lambda_2)} \le \frac{|(X_1-X_2)^\top Y|}{n(1-\rho+\lambda_2)}. \end{align*} This is the desired grouping estimate. [guided] It remains to justify that the scalar multiplying $\hat{\beta}_1-\hat{\beta}_2$ is positive, because only then may we divide without changing the estimate. We first bound $\rho$. Applying the [Cauchy-Schwarz inequality](/page/Cauchy-Schwarz%20Inequality) in the Euclidean inner product on $\mathbb{R}^n$ to the vectors $X_1/\sqrt{n}$ and $X_2/\sqrt{n}$ gives \begin{align*} |\rho| = \left|\frac{X_1^\top X_2}{n}\right| \le \left(\frac{|X_1|^2}{n}\right)^{1/2}\left(\frac{|X_2|^2}{n}\right)^{1/2}. \end{align*} The standardization assumptions say $|X_1|^2/n=1$ and $|X_2|^2/n=1$, so the right-hand side equals $1$. Hence $|\rho|\le 1$, and in particular $\rho\le 1$. Since $\lambda_2>0$, we have \begin{align*} 1-\rho+\lambda_2 \ge \lambda_2 > 0. \end{align*} Now return to the scalar identity proved in the previous step: \begin{align*} (1-\rho+\lambda_2)(\hat{\beta}_1-\hat{\beta}_2) = \frac{1}{n}(X_1-X_2)^\top Y. \end{align*} Taking absolute values and using positivity of $1-\rho+\lambda_2$ gives \begin{align*} (1-\rho+\lambda_2)|\hat{\beta}_1-\hat{\beta}_2| = \frac{|(X_1-X_2)^\top Y|}{n}. \end{align*} Dividing by $1-\rho+\lambda_2$ yields \begin{align*} |\hat{\beta}_1-\hat{\beta}_2| = \frac{|(X_1-X_2)^\top Y|}{n(1-\rho+\lambda_2)} \le \frac{|(X_1-X_2)^\top Y|}{n(1-\rho+\lambda_2)}. \end{align*} This is exactly the claimed grouping estimate. [/guided] [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Inner Product Definition Linear Map Definition Countable Additivity of Conditional Probability Probability Theory Singular Value Decomposition Formula for Ridge Regression Probability & Statistics Elementary Closure Properties Probability & Statistics Ordinary Least Squares Projection Theorem Probability & Statistics Existence of Densities Probability Theory Conditioning and Independence Conditional Expectation Equivalence of Penalized and Constrained Lasso Formulations Probability & Statistics Conditional Expectation as the $L^2$ Risk Minimizer Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.

Elastic Net Grouping Effect for Two Predictors (Theorem # 5569)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Elastic Net Grouping Effect for Two Predictors (Theorem # 5569)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further