[proofplan]
We bound $f(\beta_s) - f(\hat{\beta})$ at each iterate using the subgradient inequality, then apply a polarisation identity to express the resulting inner product as a telescoping difference of squared distances plus a step-size correction. The non-expansiveness of the projection ([Projection Theorem](/theorems/1985), part (ii)) ensures the projection step does not increase the distance to $\hat{\beta}$. Summing the per-step bounds over $s = 1, \ldots, k$ produces a telescoping sum, which we bound using the diameter constraint $\|\beta\|_2 \leq R$. Finally, Jensen's inequality applied to the convex function $f$ converts the bound on the average $\frac{1}{k}\sum_{s=1}^k f(\beta_s)$ into a bound on $f(\bar{\beta})$, and the choice $\eta = 2R/(L\sqrt{k})$ balances the two terms.
[/proofplan]
[step:Bound $f(\beta_s) - f(\hat{\beta})$ via the subgradient inequality]
At each iterate $\beta_s \in C$, select a subgradient $g_s \in \partial f(\beta_s)$. The [Non-Emptiness of the Subdifferential](/theorems/1987) guarantees that $\partial f(\beta_s) \neq \varnothing$ since $f$ is convex. By the definition of the subdifferential, for any $y \in \mathbb{R}^p$:
\begin{align*}
f(y) \geq f(\beta_s) + g_s^\top(y - \beta_s).
\end{align*}
Taking $y = \hat{\beta}$ and rearranging:
\begin{align*}
f(\beta_s) - f(\hat{\beta}) \leq g_s^\top(\beta_s - \hat{\beta}).
\end{align*}
[/step]
[step:Rewrite the inner product $g_s^\top(\beta_s - \hat{\beta})$ as a telescoping expression using the polarisation identity]
The projected gradient descent update is $z_{s+1} = \beta_s - \eta g_s$ followed by $\beta_{s+1} = \pi_C(z_{s+1})$. From the first relation, $g_s = \frac{1}{\eta}(\beta_s - z_{s+1})$, so
\begin{align*}
g_s^\top(\beta_s - \hat{\beta}) = \frac{1}{\eta}(\beta_s - z_{s+1})^\top(\beta_s - \hat{\beta}).
\end{align*}
Apply the polarisation identity $2a^\top b = \|a\|_2^2 + \|b\|_2^2 - \|a - b\|_2^2$ with $a = \beta_s - z_{s+1}$ and $b = \beta_s - \hat{\beta}$. Computing $a - b = \hat{\beta} - z_{s+1}$:
\begin{align*}
2(\beta_s - z_{s+1})^\top(\beta_s - \hat{\beta}) = \|\beta_s - z_{s+1}\|_2^2 + \|\beta_s - \hat{\beta}\|_2^2 - \|z_{s+1} - \hat{\beta}\|_2^2.
\end{align*}
Dividing by $2\eta$:
\begin{align*}
g_s^\top(\beta_s - \hat{\beta}) = \frac{1}{2\eta}\bigl(\|\beta_s - z_{s+1}\|_2^2 + \|\beta_s - \hat{\beta}\|_2^2 - \|z_{s+1} - \hat{\beta}\|_2^2\bigr).
\end{align*}
Since $z_{s+1} = \beta_s - \eta g_s$, the first term satisfies $\|\beta_s - z_{s+1}\|_2^2 = \eta^2 \|g_s\|_2^2$. Substituting:
\begin{align*}
f(\beta_s) - f(\hat{\beta}) \leq \frac{\eta}{2}\|g_s\|_2^2 + \frac{1}{2\eta}\bigl(\|\beta_s - \hat{\beta}\|_2^2 - \|z_{s+1} - \hat{\beta}\|_2^2\bigr).
\end{align*}
[guided]
Why use the polarisation identity? The subgradient inequality gives us $g_s^\top(\beta_s - \hat{\beta})$ -- an inner product between the gradient direction and the error. This is not directly summable in a useful way. The polarisation identity rewrites it as a difference of squared distances $\|\beta_s - \hat{\beta}\|_2^2 - \|z_{s+1} - \hat{\beta}\|_2^2$ plus the squared step $\|\beta_s - z_{s+1}\|_2^2$. The difference of squared distances will telescope when summed over $s$, while the squared step is controlled by the gradient bound.
This technique -- sometimes called the "three-point identity" in optimisation -- is the standard approach for analysing first-order methods on convex problems. It converts a linear inner-product bound into a quadratic telescoping bound.
[/guided]
[/step]
[step:Apply non-expansiveness of the projection to tighten the bound]
Since $\beta_{s+1} = \pi_C(z_{s+1})$ and $\hat{\beta} \in C$ (so $\hat{\beta} = \pi_C(\hat{\beta})$), the non-expansiveness property of the projection ([Projection Theorem](/theorems/1985), part (ii)) gives:
\begin{align*}
\|\beta_{s+1} - \hat{\beta}\|_2 = \|\pi_C(z_{s+1}) - \pi_C(\hat{\beta})\|_2 \leq \|z_{s+1} - \hat{\beta}\|_2.
\end{align*}
Squaring: $\|\beta_{s+1} - \hat{\beta}\|_2^2 \leq \|z_{s+1} - \hat{\beta}\|_2^2$, so $-\|z_{s+1} - \hat{\beta}\|_2^2 \leq -\|\beta_{s+1} - \hat{\beta}\|_2^2$. Substituting into the per-step bound:
\begin{align*}
f(\beta_s) - f(\hat{\beta}) \leq \frac{\eta}{2}\|g_s\|_2^2 + \frac{1}{2\eta}\bigl(\|\beta_s - \hat{\beta}\|_2^2 - \|\beta_{s+1} - \hat{\beta}\|_2^2\bigr).
\end{align*}
[guided]
This is the step where the projection earns its keep. The pre-projection point $z_{s+1}$ may lie outside $C$, so $\|z_{s+1} - \hat{\beta}\|_2$ could be large. Projecting back onto $C$ can only bring us closer to $\hat{\beta}$ (since $\hat{\beta} \in C$), so $\|\beta_{s+1} - \hat{\beta}\|_2 \leq \|z_{s+1} - \hat{\beta}\|_2$. This replacement is what makes the sum telescope in $\|\beta_s - \hat{\beta}\|_2^2$ rather than in $\|z_s - \hat{\beta}\|_2^2$, which would not telescope because the update $z_{s+1} = \beta_s - \eta g_s$ involves $\beta_s$, not $z_s$.
[/guided]
[/step]
[step:Sum over $s = 1, \ldots, k$ and telescope]
Summing the per-step bound over $s = 1, \ldots, k$:
\begin{align*}
\sum_{s=1}^k \bigl[f(\beta_s) - f(\hat{\beta})\bigr] \leq \frac{\eta}{2}\sum_{s=1}^k \|g_s\|_2^2 + \frac{1}{2\eta}\sum_{s=1}^k \bigl(\|\beta_s - \hat{\beta}\|_2^2 - \|\beta_{s+1} - \hat{\beta}\|_2^2\bigr).
\end{align*}
The second sum telescopes:
\begin{align*}
\sum_{s=1}^k \bigl(\|\beta_s - \hat{\beta}\|_2^2 - \|\beta_{s+1} - \hat{\beta}\|_2^2\bigr) = \|\beta_1 - \hat{\beta}\|_2^2 - \|\beta_{k+1} - \hat{\beta}\|_2^2 \leq \|\beta_1 - \hat{\beta}\|_2^2.
\end{align*}
Since $\beta_1, \hat{\beta} \in C$ and $\sup_{\beta \in C}\|\beta\|_2 \leq R$, the triangle inequality gives $\|\beta_1 - \hat{\beta}\|_2 \leq \|\beta_1\|_2 + \|\hat{\beta}\|_2 \leq 2R$, so $\|\beta_1 - \hat{\beta}\|_2^2 \leq 4R^2$. By the gradient bound, $\|g_s\|_2 \leq L$ for each $s$. Dividing by $k$:
\begin{align*}
\frac{1}{k}\sum_{s=1}^k \bigl[f(\beta_s) - f(\hat{\beta})\bigr] \leq \frac{\eta L^2}{2} + \frac{2R^2}{\eta k}.
\end{align*}
[/step]
[step:Apply Jensen's inequality and optimise over $\eta$ to obtain the final bound]
The averaged iterate is $\bar{\beta} := \frac{1}{k}\sum_{s=1}^k \beta_s$. Since each $\beta_s \in C$ and $C$ is convex, the convex combination $\bar{\beta} \in C$. Since $f$ is convex, [Jensen's Inequality](/theorems/1977) applied to the convex function $f$ and the uniform distribution over $\{\beta_1, \ldots, \beta_k\}$ gives:
\begin{align*}
f(\bar{\beta}) = f\!\left(\frac{1}{k}\sum_{s=1}^k \beta_s\right) \leq \frac{1}{k}\sum_{s=1}^k f(\beta_s).
\end{align*}
Therefore
\begin{align*}
f(\bar{\beta}) - f(\hat{\beta}) \leq \frac{1}{k}\sum_{s=1}^k f(\beta_s) - f(\hat{\beta}) = \frac{1}{k}\sum_{s=1}^k \bigl[f(\beta_s) - f(\hat{\beta})\bigr] \leq \frac{\eta L^2}{2} + \frac{2R^2}{\eta k}.
\end{align*}
To minimise the right-hand side over $\eta > 0$, take the derivative with respect to $\eta$ and set it to zero:
\begin{align*}
\frac{L^2}{2} - \frac{2R^2}{\eta^2 k} = 0 \quad \implies \quad \eta^2 = \frac{4R^2}{L^2 k} \quad \implies \quad \eta = \frac{2R}{L\sqrt{k}}.
\end{align*}
Substituting $\eta = 2R/(L\sqrt{k})$ into the bound:
\begin{align*}
f(\bar{\beta}) - f(\hat{\beta}) \leq \frac{L^2}{2} \cdot \frac{2R}{L\sqrt{k}} + \frac{2R^2}{\frac{2R}{L\sqrt{k}} \cdot k} = \frac{LR}{\sqrt{k}} + \frac{2R^2 L\sqrt{k}}{2Rk} = \frac{LR}{\sqrt{k}} + \frac{LR}{\sqrt{k}} = \frac{2LR}{\sqrt{k}}.
\end{align*}
This is the stated convergence rate.
[guided]
The two terms in the bound $\frac{\eta L^2}{2} + \frac{2R^2}{\eta k}$ capture a fundamental trade-off in gradient methods:
- The term $\frac{\eta L^2}{2}$ is the **per-step noise** from taking a finite-length step. Larger step sizes $\eta$ mean each individual step overshoots more, accumulating a larger error from the squared gradient norm.
- The term $\frac{2R^2}{\eta k}$ is the **initial distance penalty** amortised over $k$ steps. Smaller step sizes $\eta$ mean slower progress toward the optimum, so it takes more steps to traverse the distance from $\beta_1$ to $\hat{\beta}$.
The optimal step size $\eta = 2R/(L\sqrt{k})$ balances these two effects, yielding $O(1/\sqrt{k})$ convergence. This rate is tight for first-order methods on Lipschitz convex functions: no algorithm using only subgradient evaluations can achieve a better rate without additional assumptions (such as strong convexity or smoothness).
Note that the step size depends on the total number of iterations $k$, which must be chosen in advance. This is characteristic of the "constant step size" variant; diminishing step-size schedules $\eta_s \to 0$ with $\sum \eta_s = \infty$ achieve the same asymptotic rate without foreknowledge of $k$, but the analysis is slightly different.
[/guided]
[/step]