[proofplan]
The general inequality is exactly the optimality inequality for the minimizer $\hat{\beta}$ of the Lasso objective. To obtain the specialized basic inequality, we compare $\hat{\beta}$ with the true coefficient vector $\beta^*$ and substitute the model identity $Y = X\beta^* + \varepsilon$. Expanding the Euclidean square cancels the common noise term $|\varepsilon|^2$ and leaves the prediction error bounded by the empirical noise [inner product](/page/Inner%20Product) plus the penalty difference.
[/proofplan]
[step:Use the minimizing property of $\hat{\beta}$ to compare against an arbitrary vector]
Define the Lasso objective to be the function $Q_\lambda: \mathbb{R}^p \to \mathbb{R}$ given, for each $b \in \mathbb{R}^p$, by
\begin{align*}
Q_\lambda(b) := \frac{1}{2n}|Y-Xb|^2 + \lambda\|b\|_1.
\end{align*}
Since $\hat{\beta}$ is a minimizer of $Q_\lambda$ over $\mathbb{R}^p$, for every $\beta \in \mathbb{R}^p$ one has
\begin{align*}
Q_\lambda(\hat{\beta}) \le Q_\lambda(\beta).
\end{align*}
Substituting the definition of $Q_\lambda$ gives
\begin{align*}
\frac{1}{2n}|Y-X\hat{\beta}|^2 + \lambda\|\hat{\beta}\|_1
\le
\frac{1}{2n}|Y-X\beta|^2 + \lambda\|\beta\|_1.
\end{align*}
[guided]
The only input needed for the general inequality is the meaning of the phrase "$\hat{\beta}$ is a Lasso solution." We make this precise by introducing the objective function $Q_\lambda: \mathbb{R}^p \to \mathbb{R}$ defined, for each $b \in \mathbb{R}^p$, by
\begin{align*}
Q_\lambda(b) := \frac{1}{2n}|Y-Xb|^2 + \lambda\|b\|_1.
\end{align*}
The hypothesis says exactly that $\hat{\beta}$ minimizes this function over all candidate coefficient vectors $b \in \mathbb{R}^p$. Therefore, if $\beta \in \mathbb{R}^p$ is any comparator, the value of the objective at $\hat{\beta}$ cannot exceed the value of the objective at $\beta$:
\begin{align*}
Q_\lambda(\hat{\beta}) \le Q_\lambda(\beta).
\end{align*}
Expanding both sides using the definition of $Q_\lambda$ gives
\begin{align*}
\frac{1}{2n}|Y-X\hat{\beta}|^2 + \lambda\|\hat{\beta}\|_1
\le
\frac{1}{2n}|Y-X\beta|^2 + \lambda\|\beta\|_1.
\end{align*}
This proves the comparator form of the basic inequality.
[/guided]
[/step]
[step:Specialize the comparator inequality to $\beta^*$]
Assume now that $Y = X\beta^*+\varepsilon$ for some $\beta^* \in \mathbb{R}^p$ and $\varepsilon \in \mathbb{R}^n$. Applying the comparator inequality with $\beta = \beta^*$ gives
\begin{align*}
\frac{1}{2n}|Y-X\hat{\beta}|^2 + \lambda\|\hat{\beta}\|_1
\le
\frac{1}{2n}|Y-X\beta^*|^2 + \lambda\|\beta^*\|_1.
\end{align*}
Using $Y-X\beta^*=\varepsilon$ and
\begin{align*}
Y-X\hat{\beta}
= X\beta^*+\varepsilon-X\hat{\beta}
= \varepsilon-X(\hat{\beta}-\beta^*),
\end{align*}
we obtain
\begin{align*}
\frac{1}{2n}|\varepsilon-X(\hat{\beta}-\beta^*)|^2 + \lambda\|\hat{\beta}\|_1
\le
\frac{1}{2n}|\varepsilon|^2 + \lambda\|\beta^*\|_1.
\end{align*}
[guided]
We now use the comparator inequality with the particular comparator equal to the true coefficient vector. Since the theorem assumes $\beta^* \in \mathbb{R}^p$, it is an admissible choice of comparator, so the already proved optimality inequality gives
\begin{align*}
\frac{1}{2n}|Y-X\hat{\beta}|^2 + \lambda\|\hat{\beta}\|_1
\le
\frac{1}{2n}|Y-X\beta^*|^2 + \lambda\|\beta^*\|_1.
\end{align*}
The model identity $Y=X\beta^*+\varepsilon$ converts both residuals into expressions involving the noise vector and the prediction error. First,
\begin{align*}
Y-X\beta^* = \varepsilon.
\end{align*}
Second,
\begin{align*}
Y-X\hat{\beta}
= X\beta^*+\varepsilon-X\hat{\beta}
= \varepsilon-X(\hat{\beta}-\beta^*).
\end{align*}
Substituting these two identities into the comparator inequality yields
\begin{align*}
\frac{1}{2n}|\varepsilon-X(\hat{\beta}-\beta^*)|^2 + \lambda\|\hat{\beta}\|_1
\le
\frac{1}{2n}|\varepsilon|^2 + \lambda\|\beta^*\|_1.
\end{align*}
This is the point where the statistical model assumption is used: it replaces the abstract residual comparison by an inequality involving the estimation error $\hat{\beta}-\beta^*$.
[/guided]
[/step]
[step:Expand the square and isolate the prediction error]
Let
\begin{align*}
\Delta := \hat{\beta}-\beta^* \in \mathbb{R}^p.
\end{align*}
Using the Euclidean identity $|a-b|^2 = |a|^2 - 2a^\top b + |b|^2$ with $a=\varepsilon \in \mathbb{R}^n$ and $b=X\Delta \in \mathbb{R}^n$, the previous inequality becomes
\begin{align*}
\frac{1}{2n}\left(|\varepsilon|^2 - 2\varepsilon^\top X\Delta + |X\Delta|^2\right)
+ \lambda\|\hat{\beta}\|_1
\le
\frac{1}{2n}|\varepsilon|^2 + \lambda\|\beta^*\|_1.
\end{align*}
Subtracting $\frac{1}{2n}|\varepsilon|^2$ from both sides gives
\begin{align*}
\frac{1}{2n}|X\Delta|^2 - \frac{1}{n}\varepsilon^\top X\Delta
+ \lambda\|\hat{\beta}\|_1
\le
\lambda\|\beta^*\|_1.
\end{align*}
Moving the empirical noise term and the penalty term to the right-hand side yields
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\le
\frac{1}{n}\varepsilon^\top X\Delta + \lambda(\|\beta^*\|_1-\|\hat{\beta}\|_1).
\end{align*}
Finally substituting back $\Delta=\hat{\beta}-\beta^*$ gives
\begin{align*}
\frac{1}{2n}|X(\hat{\beta}-\beta^*)|^2
\le
\frac{1}{n}\varepsilon^\top X(\hat{\beta}-\beta^*) + \lambda(\|\beta^*\|_1-\|\hat{\beta}\|_1).
\end{align*}
This is the claimed specialized Lasso basic inequality.
[guided]
Define the estimation error vector $\Delta \in \mathbb{R}^p$ by
\begin{align*}
\Delta := \hat{\beta}-\beta^*.
\end{align*}
The specialized inequality from the previous step is
\begin{align*}
\frac{1}{2n}|\varepsilon-X\Delta|^2 + \lambda\|\hat{\beta}\|_1
\le
\frac{1}{2n}|\varepsilon|^2 + \lambda\|\beta^*\|_1.
\end{align*}
We expand the square using the Euclidean identity $|a-b|^2=|a|^2-2a^\top b+|b|^2$, applied with $a=\varepsilon \in \mathbb{R}^n$ and $b=X\Delta \in \mathbb{R}^n$. This gives
\begin{align*}
\frac{1}{2n}\left(|\varepsilon|^2 - 2\varepsilon^\top X\Delta + |X\Delta|^2\right)
+ \lambda\|\hat{\beta}\|_1
\le
\frac{1}{2n}|\varepsilon|^2 + \lambda\|\beta^*\|_1.
\end{align*}
The same term $\frac{1}{2n}|\varepsilon|^2$ appears on both sides, so subtracting it from both sides gives
\begin{align*}
\frac{1}{2n}|X\Delta|^2 - \frac{1}{n}\varepsilon^\top X\Delta
+ \lambda\|\hat{\beta}\|_1
\le
\lambda\|\beta^*\|_1.
\end{align*}
Now move the empirical noise term and the penalty term to the right-hand side:
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\le
\frac{1}{n}\varepsilon^\top X\Delta + \lambda(\|\beta^*\|_1-\|\hat{\beta}\|_1).
\end{align*}
Finally substitute the definition $\Delta=\hat{\beta}-\beta^*$ back into the display. We obtain
\begin{align*}
\frac{1}{2n}|X(\hat{\beta}-\beta^*)|^2
\le
\frac{1}{n}\varepsilon^\top X(\hat{\beta}-\beta^*) + \lambda(\|\beta^*\|_1-\|\hat{\beta}\|_1),
\end{align*}
which is the specialized Lasso basic inequality.
[/guided]
[/step]