Slow-Rate Lasso Prediction Inequality — Statement & Proof

Slow-Rate Lasso Prediction Inequality (Theorem # 5562)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We compare the Lasso objective at the minimizer $\hat{\beta}$ with its value at the true coefficient vector $\beta^*$. After substituting $Y=X\beta^*+\varepsilon$, the comparison gives a deterministic basic inequality involving the prediction error, the score vector $X^\top\varepsilon/n$, and the difference of $\ell^1$ penalties. On the score event, the stochastic term is controlled by the $\ell^\infty$-$\ell^1$ duality estimate, and the remaining penalty terms are rearranged using the triangle inequality. [/proofplan] [step:Use optimality of $\hat{\beta}$ to derive the basic inequality] Define the estimation error vector $\delta \in \mathbb{R}^p$ by \begin{align*} \delta := \hat{\beta}-\beta^* . \end{align*} Since $\hat{\beta}$ minimizes the Lasso objective over $\mathbb{R}^p$, evaluating the objective at the comparison point $\beta^*$ gives \begin{align*} \frac{1}{2n}|Y-X\hat{\beta}|^2+\lambda\|\hat{\beta}\|_1 \leq \frac{1}{2n}|Y-X\beta^*|^2+\lambda\|\beta^*\|_1 . \end{align*} Using $Y=X\beta^*+\varepsilon$ and $\delta=\hat{\beta}-\beta^*$, we have \begin{align*} Y-X\hat{\beta} = \varepsilon-X\delta, \qquad Y-X\beta^* = \varepsilon . \end{align*} Substituting these identities gives \begin{align*} \frac{1}{2n}|\varepsilon-X\delta|^2+\lambda\|\hat{\beta}\|_1 \leq \frac{1}{2n}|\varepsilon|^2+\lambda\|\beta^*\|_1 . \end{align*} Expanding the Euclidean square, \begin{align*} |\varepsilon-X\delta|^2 = |\varepsilon|^2 -2\varepsilon^\top X\delta + |X\delta|^2 . \end{align*} After cancelling $|\varepsilon|^2/(2n)$ from both sides, we obtain \begin{align*} \frac{1}{2n}|X\delta|^2 \leq \frac{1}{n}\varepsilon^\top X\delta + \lambda\|\beta^*\|_1 - \lambda\|\hat{\beta}\|_1 . \end{align*} [guided] The point of starting from optimality is that it gives an inequality without differentiability assumptions or uniqueness of the minimizer. Since $\hat{\beta}$ is a minimizer of \begin{align*} \beta \mapsto \frac{1}{2n}|Y-X\beta|^2+\lambda\|\beta\|_1 , \end{align*} its objective value is no larger than the value at any competitor. We choose the specific competitor $\beta^*$, because the model equation gives an exact simplification there: \begin{align*} \frac{1}{2n}|Y-X\hat{\beta}|^2+\lambda\|\hat{\beta}\|_1 \leq \frac{1}{2n}|Y-X\beta^*|^2+\lambda\|\beta^*\|_1 . \end{align*} Define $\delta := \hat{\beta}-\beta^*$. The model equation $Y=X\beta^*+\varepsilon$ gives \begin{align*} Y-X\hat{\beta} = X\beta^*+\varepsilon-X\hat{\beta} = \varepsilon-X(\hat{\beta}-\beta^*) = \varepsilon-X\delta, \end{align*} and also \begin{align*} Y-X\beta^*=\varepsilon . \end{align*} Therefore \begin{align*} \frac{1}{2n}|\varepsilon-X\delta|^2+\lambda\|\hat{\beta}\|_1 \leq \frac{1}{2n}|\varepsilon|^2+\lambda\|\beta^*\|_1 . \end{align*} Now expand the squared Euclidean norm. For vectors $a,b \in \mathbb{R}^n$, \begin{align*} |a-b|^2=|a|^2-2a^\top b+|b|^2 . \end{align*} With $a=\varepsilon$ and $b=X\delta$, this gives \begin{align*} |\varepsilon-X\delta|^2 = |\varepsilon|^2-2\varepsilon^\top X\delta+|X\delta|^2 . \end{align*} Substituting this expansion and cancelling the common term $|\varepsilon|^2/(2n)$ yields \begin{align*} \frac{1}{2n}|X\delta|^2 \leq \frac{1}{n}\varepsilon^\top X\delta + \lambda\|\beta^*\|_1 - \lambda\|\hat{\beta}\|_1 . \end{align*} This is the basic inequality: the left-hand side is the prediction error, while the right-hand side contains one stochastic score term and one deterministic penalty difference. [/guided] [/step] [step:Control the score term on $\mathcal{E}_{\lambda}$] Define the score vector $s \in \mathbb{R}^p$ by \begin{align*} s := \frac{1}{n}X^\top\varepsilon . \end{align*} Then \begin{align*} \frac{1}{n}\varepsilon^\top X\delta = s^\top\delta . \end{align*} Using the finite-dimensional [$\ell^\infty$-$\ell^1$ duality estimate](/page/Dual%20Norm), \begin{align*} s^\top\delta \leq |s^\top\delta| \leq \sum_{j=1}^p |s_j|\,|\delta_j| \leq \|s\|_\infty \|\delta\|_1 . \end{align*} On $\mathcal{E}_{\lambda}$, $\|s\|_\infty \leq \lambda/2$, so \begin{align*} \frac{1}{n}\varepsilon^\top X\delta \leq \frac{\lambda}{2}\|\delta\|_1 . \end{align*} Substituting this into the basic inequality gives \begin{align*} \frac{1}{2n}|X\delta|^2 \leq \frac{\lambda}{2}\|\delta\|_1 + \lambda\|\beta^*\|_1 - \lambda\|\hat{\beta}\|_1 . \end{align*} [guided] We isolate the stochastic term by naming the score vector. Define $s \in \mathbb{R}^p$ by \begin{align*} s := \frac{1}{n}X^\top\varepsilon . \end{align*} Then associativity of matrix multiplication and the definition of $s$ give \begin{align*} \frac{1}{n}\varepsilon^\top X\delta = \left(\frac{1}{n}X^\top\varepsilon\right)^\top\delta = s^\top\delta . \end{align*} We now apply the finite-dimensional [$\ell^\infty$-$\ell^1$ duality estimate](/page/Dual%20Norm) to the vectors $s,\delta \in \mathbb{R}^p$. This estimate is the coordinate inequality obtained by bounding each coefficient of $s$ by its maximum absolute value: \begin{align*} s^\top\delta \leq |s^\top\delta| \leq \sum_{j=1}^p |s_j|\,|\delta_j| \leq \|s\|_\infty \sum_{j=1}^p |\delta_j| = \|s\|_\infty \|\delta\|_1 . \end{align*} The score event $\mathcal{E}_{\lambda}$ is precisely the event on which $\|s\|_\infty \leq \lambda/2$. Therefore, on $\mathcal{E}_{\lambda}$, \begin{align*} \frac{1}{n}\varepsilon^\top X\delta \leq \frac{\lambda}{2}\|\delta\|_1 . \end{align*} Substituting this bound into the basic inequality from the previous step gives \begin{align*} \frac{1}{2n}|X\delta|^2 \leq \frac{\lambda}{2}\|\delta\|_1 + \lambda\|\beta^*\|_1 - \lambda\|\hat{\beta}\|_1 . \end{align*} This is the point where randomness leaves the argument: after restricting to $\mathcal{E}_{\lambda}$, the remaining estimate is deterministic. [/guided] [/step] [step:Absorb the penalty terms using the triangle inequality] The [triangle inequality](/page/Triangle%20Inequality) in $\mathbb{R}^p$ gives \begin{align*} \|\delta\|_1 = \|\hat{\beta}-\beta^*\|_1 \leq \|\hat{\beta}\|_1+\|\beta^*\|_1 . \end{align*} Therefore substituting the triangle inequality estimate into the previous bound gives \begin{align*} \frac{1}{2n}|X\delta|^2 \leq \frac{\lambda}{2}\left(\|\hat{\beta}\|_1+\|\beta^*\|_1\right) + \lambda\|\beta^*\|_1 - \lambda\|\hat{\beta}\|_1 . \end{align*} Collecting the coefficients of $\|\hat{\beta}\|_1$ and $\|\beta^*\|_1$, this becomes \begin{align*} \frac{1}{2n}|X\delta|^2 \leq \frac{3\lambda}{2}\|\beta^*\|_1 - \frac{\lambda}{2}\|\hat{\beta}\|_1 . \end{align*} Since $\lambda>0$ and $\|\hat{\beta}\|_1 \geq 0$, the last term is non-positive, hence \begin{align*} \frac{1}{2n}|X\delta|^2 \leq \frac{3\lambda}{2}\|\beta^*\|_1 . \end{align*} Multiplying by $2$ yields the sharper bound \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq 3\lambda\|\beta^*\|_1 . \end{align*} Since $3\lambda\|\beta^*\|_1 \leq 4\lambda\|\beta^*\|_1$, the stated inequality follows: \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq 4\lambda\|\beta^*\|_1 . \end{align*} [guided] We now remove the unknown error norm $\|\delta\|_1$ from the bound by comparing it to the two coefficient vectors already present. The [triangle inequality](/page/Triangle%20Inequality) in $\mathbb{R}^p$ gives \begin{align*} \|\delta\|_1 = \|\hat{\beta}-\beta^*\|_1 \leq \|\hat{\beta}\|_1+\|\beta^*\|_1 . \end{align*} Substituting this into the estimate obtained after controlling the score term yields \begin{align*} \frac{1}{2n}|X\delta|^2 \leq \frac{\lambda}{2}\left(\|\hat{\beta}\|_1+\|\beta^*\|_1\right) + \lambda\|\beta^*\|_1 - \lambda\|\hat{\beta}\|_1 . \end{align*} Now collect the coefficients of the two penalty norms. The coefficient of $\|\hat{\beta}\|_1$ is $\lambda/2-\lambda=-\lambda/2$, and the coefficient of $\|\beta^*\|_1$ is $\lambda/2+\lambda=3\lambda/2$. Hence \begin{align*} \frac{1}{2n}|X\delta|^2 \leq \frac{3\lambda}{2}\|\beta^*\|_1 - \frac{\lambda}{2}\|\hat{\beta}\|_1 . \end{align*} Because $\lambda>0$ and $\|\hat{\beta}\|_1 \geq 0$, the term $-(\lambda/2)\|\hat{\beta}\|_1$ is non-positive. Dropping a non-positive term from the right-hand side preserves the upper bound: \begin{align*} \frac{1}{2n}|X\delta|^2 \leq \frac{3\lambda}{2}\|\beta^*\|_1 . \end{align*} Multiplying both sides by $2$ gives \begin{align*} \frac{1}{n}|X\delta|^2 \leq 3\lambda\|\beta^*\|_1 . \end{align*} Finally, since $\delta=\hat{\beta}-\beta^*$ and $3\lambda\|\beta^*\|_1 \leq 4\lambda\|\beta^*\|_1$, we obtain the stated slow-rate prediction inequality: \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq 4\lambda\|\beta^*\|_1 . \end{align*} [/guided] [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Triangle Inequality For Inner Product Spaces

Definitions & Concepts

Event

Explore Further

Event Definition Triangle Inequality For Inner Product Spaces Theorem #433 Radon-Nikodym Theorem (Probabilistic) Martingale Theory Cauchy-Schwarz Inequality for Covariance Probability Theory Affine Regression as Projection of the Conditional Mean Probability & Statistics Continuity of Probability Probability Theory MGF of a Sum Probability Theory GLM Score Equations Probability & Statistics Weak Stirling Probability Theory Discrete Convolution Formula Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.

Slow-Rate Lasso Prediction Inequality (Theorem # 5562)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Slow-Rate Lasso Prediction Inequality (Theorem # 5562)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further