Convergence of Projected Gradient Descent — Statement & Proof

Convergence of Projected Gradient Descent (Theorem # 1990)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We bound $f(\beta_s) - f(\hat{\beta})$ at each iterate using the subgradient inequality, then apply a polarisation identity to express the resulting inner product as a telescoping difference of squared distances plus a step-size correction. The non-expansiveness of the projection ([Projection Theorem](/theorems/1985), part (ii)) ensures the projection step does not increase the distance to $\hat{\beta}$. Summing the per-step bounds over $s = 1, \ldots, k$ produces a telescoping sum, which we bound using the diameter constraint $\|\beta\|_2 \leq R$. Finally, Jensen's inequality applied to the convex function $f$ converts the bound on the average $\frac{1}{k}\sum_{s=1}^k f(\beta_s)$ into a bound on $f(\bar{\beta})$, and the choice $\eta = 2R/(L\sqrt{k})$ balances the two terms. [/proofplan] [step:Bound $f(\beta_s) - f(\hat{\beta})$ via the subgradient inequality] At each iterate $\beta_s \in C$, select a subgradient $g_s \in \partial f(\beta_s)$. The [Non-Emptiness of the Subdifferential](/theorems/1987) guarantees that $\partial f(\beta_s) \neq \varnothing$ since $f$ is convex. By the definition of the subdifferential, for any $y \in \mathbb{R}^p$: \begin{align*} f(y) \geq f(\beta_s) + g_s^\top(y - \beta_s). \end{align*} Taking $y = \hat{\beta}$ and rearranging: \begin{align*} f(\beta_s) - f(\hat{\beta}) \leq g_s^\top(\beta_s - \hat{\beta}). \end{align*} [/step] [step:Rewrite the inner product $g_s^\top(\beta_s - \hat{\beta})$ as a telescoping expression using the polarisation identity] The projected gradient descent update is $z_{s+1} = \beta_s - \eta g_s$ followed by $\beta_{s+1} = \pi_C(z_{s+1})$. From the first relation, $g_s = \frac{1}{\eta}(\beta_s - z_{s+1})$, so \begin{align*} g_s^\top(\beta_s - \hat{\beta}) = \frac{1}{\eta}(\beta_s - z_{s+1})^\top(\beta_s - \hat{\beta}). \end{align*} Apply the polarisation identity $2a^\top b = \|a\|_2^2 + \|b\|_2^2 - \|a - b\|_2^2$ with $a = \beta_s - z_{s+1}$ and $b = \beta_s - \hat{\beta}$. Computing $a - b = \hat{\beta} - z_{s+1}$: \begin{align*} 2(\beta_s - z_{s+1})^\top(\beta_s - \hat{\beta}) = \|\beta_s - z_{s+1}\|_2^2 + \|\beta_s - \hat{\beta}\|_2^2 - \|z_{s+1} - \hat{\beta}\|_2^2. \end{align*} Dividing by $2\eta$: \begin{align*} g_s^\top(\beta_s - \hat{\beta}) = \frac{1}{2\eta}\bigl(\|\beta_s - z_{s+1}\|_2^2 + \|\beta_s - \hat{\beta}\|_2^2 - \|z_{s+1} - \hat{\beta}\|_2^2\bigr). \end{align*} Since $z_{s+1} = \beta_s - \eta g_s$, the first term satisfies $\|\beta_s - z_{s+1}\|_2^2 = \eta^2 \|g_s\|_2^2$. Substituting: \begin{align*} f(\beta_s) - f(\hat{\beta}) \leq \frac{\eta}{2}\|g_s\|_2^2 + \frac{1}{2\eta}\bigl(\|\beta_s - \hat{\beta}\|_2^2 - \|z_{s+1} - \hat{\beta}\|_2^2\bigr). \end{align*} [guided] Why use the polarisation identity? The subgradient inequality gives us $g_s^\top(\beta_s - \hat{\beta})$ -- an inner product between the gradient direction and the error. This is not directly summable in a useful way. The polarisation identity rewrites it as a difference of squared distances $\|\beta_s - \hat{\beta}\|_2^2 - \|z_{s+1} - \hat{\beta}\|_2^2$ plus the squared step $\|\beta_s - z_{s+1}\|_2^2$. The difference of squared distances will telescope when summed over $s$, while the squared step is controlled by the gradient bound. This technique -- sometimes called the "three-point identity" in optimisation -- is the standard approach for analysing first-order methods on convex problems. It converts a linear inner-product bound into a quadratic telescoping bound. [/guided] [/step] [step:Apply non-expansiveness of the projection to tighten the bound] Since $\beta_{s+1} = \pi_C(z_{s+1})$ and $\hat{\beta} \in C$ (so $\hat{\beta} = \pi_C(\hat{\beta})$), the non-expansiveness property of the projection ([Projection Theorem](/theorems/1985), part (ii)) gives: \begin{align*} \|\beta_{s+1} - \hat{\beta}\|_2 = \|\pi_C(z_{s+1}) - \pi_C(\hat{\beta})\|_2 \leq \|z_{s+1} - \hat{\beta}\|_2. \end{align*} Squaring: $\|\beta_{s+1} - \hat{\beta}\|_2^2 \leq \|z_{s+1} - \hat{\beta}\|_2^2$, so $-\|z_{s+1} - \hat{\beta}\|_2^2 \leq -\|\beta_{s+1} - \hat{\beta}\|_2^2$. Substituting into the per-step bound: \begin{align*} f(\beta_s) - f(\hat{\beta}) \leq \frac{\eta}{2}\|g_s\|_2^2 + \frac{1}{2\eta}\bigl(\|\beta_s - \hat{\beta}\|_2^2 - \|\beta_{s+1} - \hat{\beta}\|_2^2\bigr). \end{align*} [guided] This is the step where the projection earns its keep. The pre-projection point $z_{s+1}$ may lie outside $C$, so $\|z_{s+1} - \hat{\beta}\|_2$ could be large. Projecting back onto $C$ can only bring us closer to $\hat{\beta}$ (since $\hat{\beta} \in C$), so $\|\beta_{s+1} - \hat{\beta}\|_2 \leq \|z_{s+1} - \hat{\beta}\|_2$. This replacement is what makes the sum telescope in $\|\beta_s - \hat{\beta}\|_2^2$ rather than in $\|z_s - \hat{\beta}\|_2^2$, which would not telescope because the update $z_{s+1} = \beta_s - \eta g_s$ involves $\beta_s$, not $z_s$. [/guided] [/step] [step:Sum over $s = 1, \ldots, k$ and telescope] Summing the per-step bound over $s = 1, \ldots, k$: \begin{align*} \sum_{s=1}^k \bigl[f(\beta_s) - f(\hat{\beta})\bigr] \leq \frac{\eta}{2}\sum_{s=1}^k \|g_s\|_2^2 + \frac{1}{2\eta}\sum_{s=1}^k \bigl(\|\beta_s - \hat{\beta}\|_2^2 - \|\beta_{s+1} - \hat{\beta}\|_2^2\bigr). \end{align*} The second sum telescopes: \begin{align*} \sum_{s=1}^k \bigl(\|\beta_s - \hat{\beta}\|_2^2 - \|\beta_{s+1} - \hat{\beta}\|_2^2\bigr) = \|\beta_1 - \hat{\beta}\|_2^2 - \|\beta_{k+1} - \hat{\beta}\|_2^2 \leq \|\beta_1 - \hat{\beta}\|_2^2. \end{align*} Since $\beta_1, \hat{\beta} \in C$ and $\sup_{\beta \in C}\|\beta\|_2 \leq R$, the triangle inequality gives $\|\beta_1 - \hat{\beta}\|_2 \leq \|\beta_1\|_2 + \|\hat{\beta}\|_2 \leq 2R$, so $\|\beta_1 - \hat{\beta}\|_2^2 \leq 4R^2$. By the gradient bound, $\|g_s\|_2 \leq L$ for each $s$. Dividing by $k$: \begin{align*} \frac{1}{k}\sum_{s=1}^k \bigl[f(\beta_s) - f(\hat{\beta})\bigr] \leq \frac{\eta L^2}{2} + \frac{2R^2}{\eta k}. \end{align*} [/step] [step:Apply Jensen's inequality and optimise over $\eta$ to obtain the final bound] The averaged iterate is $\bar{\beta} := \frac{1}{k}\sum_{s=1}^k \beta_s$. Since each $\beta_s \in C$ and $C$ is convex, the convex combination $\bar{\beta} \in C$. Since $f$ is convex, [Jensen's Inequality](/theorems/1977) applied to the convex function $f$ and the uniform distribution over $\{\beta_1, \ldots, \beta_k\}$ gives: \begin{align*} f(\bar{\beta}) = f\!\left(\frac{1}{k}\sum_{s=1}^k \beta_s\right) \leq \frac{1}{k}\sum_{s=1}^k f(\beta_s). \end{align*} Therefore \begin{align*} f(\bar{\beta}) - f(\hat{\beta}) \leq \frac{1}{k}\sum_{s=1}^k f(\beta_s) - f(\hat{\beta}) = \frac{1}{k}\sum_{s=1}^k \bigl[f(\beta_s) - f(\hat{\beta})\bigr] \leq \frac{\eta L^2}{2} + \frac{2R^2}{\eta k}. \end{align*} To minimise the right-hand side over $\eta > 0$, take the derivative with respect to $\eta$ and set it to zero: \begin{align*} \frac{L^2}{2} - \frac{2R^2}{\eta^2 k} = 0 \quad \implies \quad \eta^2 = \frac{4R^2}{L^2 k} \quad \implies \quad \eta = \frac{2R}{L\sqrt{k}}. \end{align*} Substituting $\eta = 2R/(L\sqrt{k})$ into the bound: \begin{align*} f(\bar{\beta}) - f(\hat{\beta}) \leq \frac{L^2}{2} \cdot \frac{2R}{L\sqrt{k}} + \frac{2R^2}{\frac{2R}{L\sqrt{k}} \cdot k} = \frac{LR}{\sqrt{k}} + \frac{2R^2 L\sqrt{k}}{2Rk} = \frac{LR}{\sqrt{k}} + \frac{LR}{\sqrt{k}} = \frac{2LR}{\sqrt{k}}. \end{align*} This is the stated convergence rate. [guided] The two terms in the bound $\frac{\eta L^2}{2} + \frac{2R^2}{\eta k}$ capture a fundamental trade-off in gradient methods: - The term $\frac{\eta L^2}{2}$ is the **per-step noise** from taking a finite-length step. Larger step sizes $\eta$ mean each individual step overshoots more, accumulating a larger error from the squared gradient norm. - The term $\frac{2R^2}{\eta k}$ is the **initial distance penalty** amortised over $k$ steps. Smaller step sizes $\eta$ mean slower progress toward the optimum, so it takes more steps to traverse the distance from $\beta_1$ to $\hat{\beta}$. The optimal step size $\eta = 2R/(L\sqrt{k})$ balances these two effects, yielding $O(1/\sqrt{k})$ convergence. This rate is tight for first-order methods on Lipschitz convex functions: no algorithm using only subgradient evaluations can achieve a better rate without additional assumptions (such as strong convexity or smoothness). Note that the step size depends on the total number of iterations $k$, which must be chosen in advance. This is characteristic of the "constant step size" variant; diminishing step-size schedules $\eta_s \to 0$ with $\sum \eta_s = \infty$ achieve the same asymptotic rate without foreknowledge of $k$, but the analysis is slightly different. [/guided] [/step]

Explore Further

Symmetrization Bound Machine Learning Universal Approximation Machine Learning SGD Convergence Machine Learning Local Minima Are Global for Convex Functions Machine Learning Rademacher Bound via VC Dimension Machine Learning Rademacher Complexity of ℓ₂-Constrained Class Machine Learning Regression Function Minimises Squared-Error Risk Machine Learning Zhang–Bartlett Machine Learning

What brings you to Androma?

Start with a route through the knowledge graph.

Convergence of Projected Gradient Descent (Theorem # 1990)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Convergence of Projected Gradient Descent (Theorem # 1990)

Discussion

Proof

Explore Further