[proofplan]
The proof uses only the defining optimality of the Lasso estimator. We compare the Lasso objective at the minimizer $\hat{\beta}$ with its value at the feasible point $\beta^*$, substitute the model identity $Y = X\beta^* + \varepsilon$, and expand the resulting squared Euclidean norm. After canceling the common noise term $|\varepsilon|^2$, the remaining terms rearrange to the desired deterministic inequality.
[/proofplan]
[step:Compare the Lasso objective at $\hat{\beta}$ and $\beta^*$]
Define the Lasso objective $Q: \mathbb{R}^p \to \mathbb{R}$ by
\begin{align*}
Q(b) := \frac{1}{2n}|Y-Xb|^2 + \lambda |b|_1.
\end{align*}
Since $\hat{\beta}$ minimizes $Q$ over $\mathbb{R}^p$ and $\beta^* \in \mathbb{R}^p$ is an admissible comparison point, we have
\begin{align*}
\frac{1}{2n}|Y-X\hat{\beta}|^2 + \lambda |\hat{\beta}|_1 \leq \frac{1}{2n}|Y-X\beta^*|^2 + \lambda |\beta^*|_1.
\end{align*}
[/step]
[step:Substitute the linear model and expand the squared norm]
Define the prediction-error vector $r \in \mathbb{R}^n$ by
\begin{align*}
r := X(\hat{\beta}-\beta^*).
\end{align*}
Using $Y = X\beta^* + \varepsilon$, we first compute
\begin{align*}
Y-X\hat{\beta}=X\beta^*+\varepsilon-X\hat{\beta}.
\end{align*}
Since $X\beta^*-X\hat{\beta}=-X(\hat{\beta}-\beta^*)$, this gives
\begin{align*}
Y-X\hat{\beta}=\varepsilon-X(\hat{\beta}-\beta^*).
\end{align*}
By the definition of $r$, we therefore have
\begin{align*}
Y-X\hat{\beta}=\varepsilon-r.
\end{align*}
At the comparison point $\beta^*$, the residual is
\begin{align*}
Y-X\beta^*=\varepsilon.
\end{align*}
Therefore the optimality inequality becomes
\begin{align*}
\frac{1}{2n}|\varepsilon-r|^2+\lambda|\hat{\beta}|_1 \leq \frac{1}{2n}|\varepsilon|^2+\lambda|\beta^*|_1.
\end{align*}
Expanding the Euclidean square first gives
\begin{align*}
|\varepsilon-r|^2=(\varepsilon-r)^\top(\varepsilon-r).
\end{align*}
Multiplying out the Euclidean [inner product](/page/Inner%20Product) yields
\begin{align*}
(\varepsilon-r)^\top(\varepsilon-r)=|\varepsilon|^2-2\varepsilon^\top r+|r|^2.
\end{align*}
Thus
\begin{align*}
|\varepsilon-r|^2=|\varepsilon|^2-2\varepsilon^\top r+|r|^2.
\end{align*}
[guided]
The comparison inequality contains the residual $Y-X\hat{\beta}$, but the theorem is stated in terms of the prediction error $X(\hat{\beta}-\beta^*)$. We therefore name that vector. Define $r \in \mathbb{R}^n$ by
\begin{align*}
r := X(\hat{\beta}-\beta^*).
\end{align*}
The model identity $Y=X\beta^*+\varepsilon$ converts the Lasso residual into noise minus prediction error. First,
\begin{align*}
Y-X\hat{\beta}=X\beta^*+\varepsilon-X\hat{\beta}.
\end{align*}
Since $X\beta^*-X\hat{\beta}=-X(\hat{\beta}-\beta^*)$, this becomes
\begin{align*}
Y-X\hat{\beta}=\varepsilon-X(\hat{\beta}-\beta^*).
\end{align*}
By the definition of $r$, this is
\begin{align*}
Y-X\hat{\beta}=\varepsilon-r.
\end{align*}
At the comparison point $\beta^*$, the residual is just the noise:
\begin{align*}
Y-X\beta^*=\varepsilon.
\end{align*}
Substituting these two identities into the objective comparison gives
\begin{align*}
\frac{1}{2n}|\varepsilon-r|^2+\lambda|\hat{\beta}|_1 \leq \frac{1}{2n}|\varepsilon|^2+\lambda|\beta^*|_1.
\end{align*}
Now expand the square using the Euclidean inner product on $\mathbb{R}^n$:
\begin{align*}
|\varepsilon-r|^2=(\varepsilon-r)^\top(\varepsilon-r).
\end{align*}
The inner product expansion gives
\begin{align*}
(\varepsilon-r)^\top(\varepsilon-r)=\varepsilon^\top\varepsilon-2\varepsilon^\top r+r^\top r.
\end{align*}
Using $\varepsilon^\top\varepsilon=|\varepsilon|^2$ and $r^\top r=|r|^2$, we get
\begin{align*}
|\varepsilon-r|^2=|\varepsilon|^2-2\varepsilon^\top r+|r|^2.
\end{align*}
This expansion is the only algebraic point in the proof: it separates the residual norm into the pure noise term, the stochastic inner-product term, and the deterministic prediction-error term.
[/guided]
[/step]
[step:Cancel the common noise term and rearrange]
Substituting the expansion into the preceding inequality yields
\begin{align*}
\frac{1}{2n}\bigl(|\varepsilon|^2-2\varepsilon^\top r+|r|^2\bigr)+\lambda|\hat{\beta}|_1 \leq \frac{1}{2n}|\varepsilon|^2+\lambda|\beta^*|_1.
\end{align*}
Subtracting $\frac{1}{2n}|\varepsilon|^2$ from both sides gives
\begin{align*}
-\frac{1}{n}\varepsilon^\top r+\frac{1}{2n}|r|^2+\lambda|\hat{\beta}|_1 \leq \lambda|\beta^*|_1.
\end{align*}
Moving the inner-product term and the penalty term to the right-hand side gives
\begin{align*}
\frac{1}{2n}|r|^2 \leq \frac{1}{n}\varepsilon^\top r+\lambda\bigl(|\beta^*|_1-|\hat{\beta}|_1\bigr).
\end{align*}
Finally substituting $r=X(\hat{\beta}-\beta^*)$ gives
\begin{align*}
\frac{1}{2n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}\varepsilon^\top X(\hat{\beta}-\beta^*)+\lambda\bigl(|\beta^*|_1-|\hat{\beta}|_1\bigr),
\end{align*}
which is the desired inequality.
[/step]