SGD Convergence — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The proof parallels the deterministic [Convergence of Projected Gradient Descent](/theorems/1990) analysis, replacing the true subgradient with a stochastic estimate and taking conditional expectations at each step. The stochastic subgradient $\tilde{g}_s$ satisfies $\mathbb{E}[\tilde{g}_s \mid \beta_s] \in \partial f(\beta_s)$, which provides the subgradient inequality in expectation. The polarisation identity, non-expansiveness of the projection, and telescoping sum proceed exactly as in the deterministic case. The stochastic gradient's second moment bound $\mathbb{E}[\|\tilde{g}_s\|_2^2 \mid \beta_s] \leq L^2$ replaces the deterministic bound $\|g_s\|_2 \leq L$, and taking full expectations yields the same $O(1/\sqrt{k})$ rate. [/proofplan] [step:Establish the subgradient inequality in conditional expectation] At step $s$, the SGD algorithm draws an independent sample $U_s$, selects $\tilde{g}_s \in \partial \tilde{f}(\beta_s; U_s)$, and forms $z_{s+1} = \beta_s - \eta \tilde{g}_s$ followed by $\beta_{s+1} = \pi_C(z_{s+1})$. Define $g_s := \mathbb{E}[\tilde{g}_s \mid \beta_s]$. Since $\tilde{f}(\cdot; u)$ is convex for each $u$ and $f(\beta) = \mathbb{E}\tilde{f}(\beta; U)$, a standard interchange of expectation and subdifferential gives $g_s \in \partial f(\beta_s)$. (The interchange is valid because $\tilde{f}(\cdot; U)$ is convex and finite-valued on $\mathbb{R}^p$, which ensures the subdifferential of the expectation contains the expectation of the subdifferentials.) By the definition of the subdifferential: \begin{align*} f(\beta_s) - f(\hat{\beta}) \leq g_s^\top(\beta_s - \hat{\beta}) = \mathbb{E}[\tilde{g}_s^\top(\beta_s - \hat{\beta}) \mid \beta_s], \end{align*} where the equality uses the fact that $\beta_s$ is measurable with respect to its own conditioning $\sigma$-algebra, so it can be taken out of the conditional expectation. [guided] The passage from deterministic to stochastic gradient descent hinges on the observation that the stochastic subgradient $\tilde{g}_s$ is an unbiased estimator of a true subgradient $g_s \in \partial f(\beta_s)$. Why does this suffice? The subgradient inequality $f(\beta_s) - f(\hat{\beta}) \leq g_s^\top(\beta_s - \hat{\beta})$ is a deterministic statement about the convex function $f$ and its subdifferential. The stochasticity enters only through the identity $g_s = \mathbb{E}[\tilde{g}_s \mid \beta_s]$, which allows us to replace $g_s^\top(\beta_s - \hat{\beta})$ by $\mathbb{E}[\tilde{g}_s^\top(\beta_s - \hat{\beta}) \mid \beta_s]$. Crucially, $\beta_s$ depends on the past samples $U_1, \ldots, U_{s-1}$ but is independent of $U_s$. This independence is what makes $\mathbb{E}[\tilde{g}_s \mid \beta_s]$ a well-defined conditional expectation over the fresh randomness $U_s$, with $\beta_s$ acting as a frozen parameter. [/guided] [/step] [step:Apply the polarisation identity and non-expansiveness to obtain a per-step bound] Following the same algebraic steps as in the [Convergence of Projected Gradient Descent](/theorems/1990) proof, use the update $z_{s+1} = \beta_s - \eta \tilde{g}_s$ to write $\tilde{g}_s = \frac{1}{\eta}(\beta_s - z_{s+1})$. The polarisation identity $2a^\top b = \|a\|_2^2 + \|b\|_2^2 - \|a - b\|_2^2$ with $a = \beta_s - z_{s+1}$ and $b = \beta_s - \hat{\beta}$ gives: \begin{align*} \tilde{g}_s^\top(\beta_s - \hat{\beta}) = \frac{1}{2\eta}\bigl(\|\beta_s - z_{s+1}\|_2^2 + \|\beta_s - \hat{\beta}\|_2^2 - \|z_{s+1} - \hat{\beta}\|_2^2\bigr). \end{align*} Since $\|\beta_s - z_{s+1}\|_2^2 = \eta^2\|\tilde{g}_s\|_2^2$ and the non-expansiveness of the projection ([Projection Theorem](/theorems/1985), part (ii)) gives $\|\beta_{s+1} - \hat{\beta}\|_2^2 = \|\pi_C(z_{s+1}) - \pi_C(\hat{\beta})\|_2^2 \leq \|z_{s+1} - \hat{\beta}\|_2^2$ (using $\hat{\beta} \in C$): \begin{align*} \tilde{g}_s^\top(\beta_s - \hat{\beta}) \leq \frac{\eta}{2}\|\tilde{g}_s\|_2^2 + \frac{1}{2\eta}\bigl(\|\beta_s - \hat{\beta}\|_2^2 - \|\beta_{s+1} - \hat{\beta}\|_2^2\bigr). \end{align*} [/step] [step:Take conditional and full expectations] Take the conditional expectation $\mathbb{E}[\cdot \mid \beta_s]$ on both sides of the per-step bound. From the first step, $f(\beta_s) - f(\hat{\beta}) \leq \mathbb{E}[\tilde{g}_s^\top(\beta_s - \hat{\beta}) \mid \beta_s]$, so: \begin{align*} f(\beta_s) - f(\hat{\beta}) \leq \frac{\eta}{2}\mathbb{E}[\|\tilde{g}_s\|_2^2 \mid \beta_s] + \frac{1}{2\eta}\bigl(\|\beta_s - \hat{\beta}\|_2^2 - \mathbb{E}[\|\beta_{s+1} - \hat{\beta}\|_2^2 \mid \beta_s]\bigr). \end{align*} Now take the full expectation $\mathbb{E}[\cdot]$: \begin{align*} \mathbb{E}[f(\beta_s)] - f(\hat{\beta}) \leq \frac{\eta}{2}\mathbb{E}[\|\tilde{g}_s\|_2^2] + \frac{1}{2\eta}\bigl(\mathbb{E}[\|\beta_s - \hat{\beta}\|_2^2] - \mathbb{E}[\|\beta_{s+1} - \hat{\beta}\|_2^2]\bigr), \end{align*} where the tower property $\mathbb{E}[\mathbb{E}[\cdot \mid \beta_s]] = \mathbb{E}[\cdot]$ has been applied. [guided] The conditional expectation step is the heart of the stochastic analysis. Conditioning on $\beta_s$ freezes the current iterate, so the randomness comes only from $U_s$ (through $\tilde{g}_s$). Two quantities require conditioning: 1. The stochastic gradient norm: $\mathbb{E}[\|\tilde{g}_s\|_2^2 \mid \beta_s]$ is the conditional second moment, which is bounded by $L^2$ by hypothesis. 2. The next-iterate distance: $\mathbb{E}[\|\beta_{s+1} - \hat{\beta}\|_2^2 \mid \beta_s]$ appears because $\beta_{s+1}$ depends on the random draw $U_s$. Taking full expectations via the tower property converts everything to unconditional expectations, producing the same telescoping structure as in the deterministic case. The full-expectation step uses the tower property: $\mathbb{E}[\|\beta_s - \hat{\beta}\|_2^2 - \mathbb{E}[\|\beta_{s+1} - \hat{\beta}\|_2^2 \mid \beta_s]] = \mathbb{E}[\|\beta_s - \hat{\beta}\|_2^2] - \mathbb{E}[\|\beta_{s+1} - \hat{\beta}\|_2^2]$, preserving the telescoping form. [/guided] [/step] [step:Sum over $s = 1, \ldots, k$, telescope, and apply Jensen's inequality] Sum over $s = 1, \ldots, k$: \begin{align*} \sum_{s=1}^k \bigl[\mathbb{E}[f(\beta_s)] - f(\hat{\beta})\bigr] \leq \frac{\eta}{2}\sum_{s=1}^k \mathbb{E}[\|\tilde{g}_s\|_2^2] + \frac{1}{2\eta}\bigl(\mathbb{E}[\|\beta_1 - \hat{\beta}\|_2^2] - \mathbb{E}[\|\beta_{k+1} - \hat{\beta}\|_2^2]\bigr). \end{align*} By the second moment bound, $\mathbb{E}[\|\tilde{g}_s\|_2^2] = \mathbb{E}[\mathbb{E}[\|\tilde{g}_s\|_2^2 \mid \beta_s]] \leq L^2$. The telescoping sum satisfies $\mathbb{E}[\|\beta_1 - \hat{\beta}\|_2^2] - \mathbb{E}[\|\beta_{k+1} - \hat{\beta}\|_2^2] \leq \mathbb{E}[\|\beta_1 - \hat{\beta}\|_2^2]$. Since $\beta_1 \in C$ (a deterministic initialisation or a point in $C$) and $\hat{\beta} \in C$ with $\sup_{\beta \in C}\|\beta\|_2 \leq R$, the triangle inequality gives $\|\beta_1 - \hat{\beta}\|_2 \leq 2R$, so $\mathbb{E}[\|\beta_1 - \hat{\beta}\|_2^2] \leq 4R^2$. Dividing by $k$: \begin{align*} \frac{1}{k}\sum_{s=1}^k \mathbb{E}[f(\beta_s)] - f(\hat{\beta}) \leq \frac{\eta L^2}{2} + \frac{2R^2}{\eta k}. \end{align*} The averaged iterate is $\bar{\beta} = \frac{1}{k}\sum_{s=1}^k \beta_s$. Since $f$ is convex, [Jensen's Inequality](/theorems/1977) gives $f(\bar{\beta}) \leq \frac{1}{k}\sum_{s=1}^k f(\beta_s)$. Taking expectations: \begin{align*} \mathbb{E}[f(\bar{\beta})] - f(\hat{\beta}) \leq \frac{1}{k}\sum_{s=1}^k \mathbb{E}[f(\beta_s)] - f(\hat{\beta}) \leq \frac{\eta L^2}{2} + \frac{2R^2}{\eta k}. \end{align*} Substituting the optimal step size $\eta = 2R/(L\sqrt{k})$ (which balances $\frac{\eta L^2}{2} = \frac{LR}{\sqrt{k}}$ and $\frac{2R^2}{\eta k} = \frac{LR}{\sqrt{k}}$): \begin{align*} \mathbb{E}[f(\bar{\beta})] - f(\hat{\beta}) \leq \frac{LR}{\sqrt{k}} + \frac{LR}{\sqrt{k}} = \frac{2LR}{\sqrt{k}}. \end{align*} [guided] The structure of this proof mirrors the deterministic projected gradient descent analysis almost exactly. The only differences are: 1. **Stochastic gradients**: The deterministic subgradient bound $\|g_s\|_2 \leq L$ is replaced by the second moment bound $\mathbb{E}[\|\tilde{g}_s\|_2^2 \mid \beta_s] \leq L^2$. Note that the second moment bound is slightly stronger than just assuming $\|\tilde{g}_s\|_2 \leq L$ almost surely -- it also covers the case where $\tilde{g}_s$ has large norm with small probability, as long as the expected squared norm is bounded. 2. **Expectations throughout**: Every inequality in the deterministic proof becomes an inequality in expectation. The telescoping structure survives because the tower property of conditional expectation preserves additive structure. 3. **Same rate**: The final bound $\mathbb{E}[f(\bar{\beta})] - f(\hat{\beta}) \leq 2LR/\sqrt{k}$ is identical to the deterministic rate. Stochastic gradient descent achieves the same asymptotic convergence as batch gradient descent, despite using only a single sample per step. The cost is that the bound holds in expectation rather than deterministically, and the constant may be larger if $\mathbb{E}[\|\tilde{g}_s\|_2^2]$ is much larger than $\|g_s\|_2^2$ (due to variance in the stochastic gradient). [/guided] [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

SGD Convergence (Theorem # 1992)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

SGD Convergence (Theorem # 1992)

Discussion

Proof

Explore Further