[proofplan]
We use the first-order optimality equations for the two active coefficients. Since $\hat{\beta}_1$ and $\hat{\beta}_2$ are nonzero and have the same sign, the two $\ell^1$ subgradient contributions are equal and cancel after subtraction. The standardized Gram matrix converts the fitted-value term into $(1-\rho)(\hat{\beta}_1-\hat{\beta}_2)$, while the ridge penalty contributes $\lambda_2(\hat{\beta}_1-\hat{\beta}_2)$. Solving the resulting scalar identity gives the stated estimate.
[/proofplan]
[step:Write the active first-order equations and cancel the common sign term]
Let $Y \in \mathbb{R}^n$ denote the response vector. Let $X: \mathbb{R}^2 \to \mathbb{R}^n$ denote the design [linear map](/page/Linear%20Map) whose columns are $X_1,X_2 \in \mathbb{R}^n$, so that
\begin{align*}
X\beta = X_1\beta_1 + X_2\beta_2
\end{align*}
for every $\beta=(\beta_1,\beta_2) \in \mathbb{R}^2$. Let $\lambda_1 \ge 0$ denote the $\ell^1$ tuning parameter in the elastic net objective, so that $\hat{\beta}$ minimizes the objective function $Q: \mathbb{R}^2 \to \mathbb{R}$ defined by
\begin{align*}
Q(\beta) := \frac{1}{2n}|Y-X\beta|^2 + \lambda_1\bigl(|\beta_1|+|\beta_2|\bigr) + \frac{\lambda_2}{2}|\beta|^2.
\end{align*}
Define the residual vector $r \in \mathbb{R}^n$ by $r := Y - X\hat{\beta}$. Since $p=2$, this gives
\begin{align*}
r = Y - X_1\hat{\beta}_1 - X_2\hat{\beta}_2.
\end{align*}
Since $\hat{\beta}_1\hat{\beta}_2>0$, both active coefficients are nonzero and have the same sign. Define $s \in \{-1,1\}$ by
\begin{align*}
s := \operatorname{sgn}(\hat{\beta}_1) = \operatorname{sgn}(\hat{\beta}_2).
\end{align*}
The function $Q$ is convex because it is the sum of a convex quadratic loss, the convex $\ell^1$ penalty, and the convex ridge penalty. For an active coordinate $j \in \{1,2\}$, the map $t \mapsto Q(\hat{\beta}+t e_j)$ is differentiable at $t=0$, where $e_j \in \mathbb{R}^2$ is the $j$th standard basis vector, because $\hat{\beta}_j \ne 0$ makes $|\beta_j|$ differentiable at $\hat{\beta}_j$. Since $\hat{\beta}$ minimizes $Q$, the necessary one-dimensional first-order condition gives
\begin{align*}
-\frac{1}{n}X_j^\top r + \lambda_1 s + \lambda_2 \hat{\beta}_j = 0.
\end{align*}
Thus
\begin{align*}
\frac{1}{n}X_j^\top r = \lambda_1 s + \lambda_2 \hat{\beta}_j.
\end{align*}
Subtracting the equation for $j=2$ from the equation for $j=1$ cancels the common $\lambda_1s$ term and yields
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top r
= \lambda_2(\hat{\beta}_1-\hat{\beta}_2).
\end{align*}
[guided]
The only non-smooth part of the elastic net objective is the $\ell^1$ term. We first name all the objects used in the first-order equations. Let $Y \in \mathbb{R}^n$ be the response vector. Let $X: \mathbb{R}^2 \to \mathbb{R}^n$ be the design linear map with columns $X_1,X_2 \in \mathbb{R}^n$, so $X\beta=X_1\beta_1+X_2\beta_2$ for $\beta=(\beta_1,\beta_2) \in \mathbb{R}^2$. Let $\lambda_1 \ge 0$ be the $\ell^1$ tuning parameter in the elastic net objective
\begin{align*}
Q(\beta) := \frac{1}{2n}|Y-X\beta|^2 + \lambda_1\bigl(|\beta_1|+|\beta_2|\bigr) + \frac{\lambda_2}{2}|\beta|^2,
\end{align*}
which is minimized by $\hat{\beta}$. The hypothesis $\hat{\beta}_1\hat{\beta}_2>0$ is exactly what makes the $\ell^1$ term harmless here: both coefficients are active and have the same sign. Define $r := Y - X\hat{\beta} \in \mathbb{R}^n$. Since $p=2$, this gives
\begin{align*}
r = Y - X_1\hat{\beta}_1 - X_2\hat{\beta}_2.
\end{align*}
Define the common sign $s \in \{-1,1\}$ by
\begin{align*}
s := \operatorname{sgn}(\hat{\beta}_1) = \operatorname{sgn}(\hat{\beta}_2).
\end{align*}
The objective $Q$ is convex: the squared-error term is convex, the $\ell^1$ term is convex, and the ridge term is convex. Because $\hat{\beta}_j \ne 0$, the derivative of $|\beta_j|$ at $\hat{\beta}_j$ is $\operatorname{sgn}(\hat{\beta}_j)=s$. Thus the one-variable function $t \mapsto Q(\hat{\beta}+t e_j)$ is differentiable at $t=0$, where $e_j \in \mathbb{R}^2$ is the $j$th standard basis vector. Since $\hat{\beta}$ minimizes $Q$, this differentiable one-variable function has a minimum at $t=0$, so its derivative at $0$ is zero. Therefore, for each active coordinate $j \in \{1,2\}$, the first-order optimality equation is
\begin{align*}
-\frac{1}{n}X_j^\top r + \lambda_1 s + \lambda_2 \hat{\beta}_j = 0.
\end{align*}
Equivalently,
\begin{align*}
\frac{1}{n}X_j^\top r = \lambda_1 s + \lambda_2 \hat{\beta}_j.
\end{align*}
Now subtract the equation with $j=2$ from the equation with $j=1$. The $\lambda_1s$ terms are identical, so they cancel:
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top r
= \lambda_2(\hat{\beta}_1-\hat{\beta}_2).
\end{align*}
This cancellation is the grouping mechanism: if two active coefficients have the same sign, the lasso part does not separate them; only the ridge term and the correlation structure remain.
[/guided]
[/step]
[step:Rewrite the residual equation using the two-column Gram matrix]
Using $r = Y - X_1\hat{\beta}_1 - X_2\hat{\beta}_2$, we expand the scalar product as
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top r = \frac{1}{n}(X_1-X_2)^\top Y - \frac{1}{n}(X_1-X_2)^\top X_1\,\hat{\beta}_1 - \frac{1}{n}(X_1-X_2)^\top X_2\,\hat{\beta}_2.
\end{align*}
The standardization assumptions give the first Gram difference as
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top X_1 = \frac{|X_1|^2}{n} - \frac{X_2^\top X_1}{n} = 1-\rho.
\end{align*}
They give the second Gram difference as
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top X_2 = \frac{X_1^\top X_2}{n} - \frac{|X_2|^2}{n} = \rho-1.
\end{align*}
Hence
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top r = \frac{1}{n}(X_1-X_2)^\top Y - (1-\rho)(\hat{\beta}_1-\hat{\beta}_2).
\end{align*}
Combining this identity with the previous step gives
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top Y = (1-\rho+\lambda_2)(\hat{\beta}_1-\hat{\beta}_2).
\end{align*}
[guided]
The first-order equation from the previous step contains the residual $r$, but the desired estimate is stated in terms of $Y$ and the difference $\hat{\beta}_1-\hat{\beta}_2$. We therefore substitute the residual identity
\begin{align*}
r = Y - X_1\hat{\beta}_1 - X_2\hat{\beta}_2
\end{align*}
into the scalar product with $X_1-X_2$. Distributing the Euclidean [inner product](/page/Inner%20Product) gives
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top r = \frac{1}{n}(X_1-X_2)^\top Y - \frac{1}{n}(X_1-X_2)^\top X_1\,\hat{\beta}_1 - \frac{1}{n}(X_1-X_2)^\top X_2\,\hat{\beta}_2.
\end{align*}
Now the standardization hypotheses are used. Since $|X_1|^2/n=1$ and $X_1^\top X_2/n=\rho$, symmetry of the Euclidean inner product gives
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top X_1 = \frac{|X_1|^2}{n} - \frac{X_2^\top X_1}{n} = 1-\rho.
\end{align*}
Similarly, since $|X_2|^2/n=1$ and $X_1^\top X_2/n=\rho$,
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top X_2 = \frac{X_1^\top X_2}{n} - \frac{|X_2|^2}{n} = \rho-1 = -(1-\rho).
\end{align*}
Substituting these two Gram differences into the expansion gives
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top r = \frac{1}{n}(X_1-X_2)^\top Y - (1-\rho)\hat{\beta}_1 + (1-\rho)\hat{\beta}_2.
\end{align*}
Factoring the last two terms produces
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top r = \frac{1}{n}(X_1-X_2)^\top Y - (1-\rho)(\hat{\beta}_1-\hat{\beta}_2).
\end{align*}
The previous step proved
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top r = \lambda_2(\hat{\beta}_1-\hat{\beta}_2).
\end{align*}
Equating the two expressions for the same scalar and moving the Gram term to the right-hand side gives
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top Y = (1-\rho+\lambda_2)(\hat{\beta}_1-\hat{\beta}_2).
\end{align*}
This identity is the exact algebraic form of the grouping effect: the coefficient difference is multiplied by the ridge strength plus the decorrelation factor $1-\rho$.
[/guided]
[/step]
[step:Take absolute values and use positivity of the denominator]
By the [Cauchy-Schwarz inequality](/page/Cauchy-Schwarz%20Inequality) applied in the Euclidean inner product on $\mathbb{R}^n$ to $X_1/\sqrt{n}$ and $X_2/\sqrt{n}$,
\begin{align*}
|\rho| = \left|\frac{X_1^\top X_2}{n}\right| \le \left(\frac{|X_1|^2}{n}\right)^{1/2}\left(\frac{|X_2|^2}{n}\right)^{1/2} = 1.
\end{align*}
Therefore $1-\rho+\lambda_2 \ge \lambda_2 >0$. Taking absolute values in
\begin{align*}
(1-\rho+\lambda_2)(\hat{\beta}_1-\hat{\beta}_2) = \frac{1}{n}(X_1-X_2)^\top Y
\end{align*}
and dividing by the positive scalar $1-\rho+\lambda_2$ gives
\begin{align*}
|\hat{\beta}_1-\hat{\beta}_2| = \frac{|(X_1-X_2)^\top Y|}{n(1-\rho+\lambda_2)} \le \frac{|(X_1-X_2)^\top Y|}{n(1-\rho+\lambda_2)}.
\end{align*}
This is the desired grouping estimate.
[guided]
It remains to justify that the scalar multiplying $\hat{\beta}_1-\hat{\beta}_2$ is positive, because only then may we divide without changing the estimate. We first bound $\rho$. Applying the [Cauchy-Schwarz inequality](/page/Cauchy-Schwarz%20Inequality) in the Euclidean inner product on $\mathbb{R}^n$ to the vectors $X_1/\sqrt{n}$ and $X_2/\sqrt{n}$ gives
\begin{align*}
|\rho| = \left|\frac{X_1^\top X_2}{n}\right| \le \left(\frac{|X_1|^2}{n}\right)^{1/2}\left(\frac{|X_2|^2}{n}\right)^{1/2}.
\end{align*}
The standardization assumptions say $|X_1|^2/n=1$ and $|X_2|^2/n=1$, so the right-hand side equals $1$. Hence $|\rho|\le 1$, and in particular $\rho\le 1$. Since $\lambda_2>0$, we have
\begin{align*}
1-\rho+\lambda_2 \ge \lambda_2 > 0.
\end{align*}
Now return to the scalar identity proved in the previous step:
\begin{align*}
(1-\rho+\lambda_2)(\hat{\beta}_1-\hat{\beta}_2) = \frac{1}{n}(X_1-X_2)^\top Y.
\end{align*}
Taking absolute values and using positivity of $1-\rho+\lambda_2$ gives
\begin{align*}
(1-\rho+\lambda_2)|\hat{\beta}_1-\hat{\beta}_2| = \frac{|(X_1-X_2)^\top Y|}{n}.
\end{align*}
Dividing by $1-\rho+\lambda_2$ yields
\begin{align*}
|\hat{\beta}_1-\hat{\beta}_2| = \frac{|(X_1-X_2)^\top Y|}{n(1-\rho+\lambda_2)} \le \frac{|(X_1-X_2)^\top Y|}{n(1-\rho+\lambda_2)}.
\end{align*}
This is exactly the claimed grouping estimate.
[/guided]
[/step]