Contraction Lemma — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We show that the Lipschitz map $\phi$ applied to Rademacher sums can be "contracted" coordinate by coordinate. The argument proceeds by induction on the coordinate index: at each step $i$, we condition on all Rademacher variables except $\varepsilon_i$, exploit the symmetry $\varepsilon_i = \pm 1$ to decouple the supremum into two independent optimisations, apply the Lipschitz bound to replace $\phi(y_i h(x_i))$ by $L \cdot h(x_i)$, and recombine. After processing all $n$ coordinates, every $\phi$ term has been replaced by a scaled linear term, yielding $\hat{R}(\mathcal{F}(z_{1:n})) \leq L\,\hat{R}(\mathcal{H}(x_{1:n}))$. [/proofplan] [step:Establish the one-coordinate replacement claim] [claim:One-Coordinate Contraction] For any index $i \in \{1, \ldots, n\}$ and any function $A : \mathcal{H} \times \{-1,1\}^{n-1} \to \mathbb{R}$ (where $\varepsilon_{-i}$ denotes the Rademacher vector with the $i$-th coordinate removed), \begin{align*} \mathbb{E}\!\left[\sup_{h \in \mathcal{H}}\!\left\{\frac{1}{n}\varepsilon_i \phi(y_i h(x_i)) + A(h, \varepsilon_{-i})\right\}\right] \leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_i h(x_i) + A(h, \varepsilon_{-i})\right\}\right]. \end{align*} [/claim] [proof] Condition on $\varepsilon_{-i}$. Since $\varepsilon_i$ takes values $+1$ and $-1$ each with probability $1/2$, the conditional expectation of the left-hand side equals \begin{align*} \frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{\frac{1}{n}\phi(y_i h(x_i)) + A(h, \varepsilon_{-i})\right\} + \frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{-\frac{1}{n}\phi(y_i h(x_i)) + A(h, \varepsilon_{-i})\right\}. \end{align*} Since the two suprema are taken independently, we may bound this by optimising over pairs $(h, g) \in \mathcal{H} \times \mathcal{H}$: \begin{align*} &\leq \frac{1}{2}\sup_{h, g \in \mathcal{H}}\!\left\{\frac{1}{n}\bigl[\phi(y_i h(x_i)) - \phi(y_i g(x_i))\bigr] + A(h, \varepsilon_{-i}) + A(g, \varepsilon_{-i})\right\}. \end{align*} By the Lipschitz condition on $\phi$ with constant $L$ (which applies because $|y_i h(x_i)| \leq r$ and $|y_i g(x_i)| \leq r$ since $|y_i| = 1$ and $|h(x_i)| \leq r$): \begin{align*} \phi(y_i h(x_i)) - \phi(y_i g(x_i)) \leq |\phi(y_i h(x_i)) - \phi(y_i g(x_i))| \leq L|y_i h(x_i) - y_i g(x_i)| = L|h(x_i) - g(x_i)|, \end{align*} where the final equality uses $|y_i| = 1$. By symmetry of the supremum over $(h, g)$ -- swapping the roles of $h$ and $g$ replaces $|h(x_i) - g(x_i)|$ by $h(x_i) - g(x_i)$ without loss -- we obtain \begin{align*} \frac{1}{2}\sup_{h, g \in \mathcal{H}}\!\left\{\frac{L}{n}\bigl[h(x_i) - g(x_i)\bigr] + A(h, \varepsilon_{-i}) + A(g, \varepsilon_{-i})\right\}. \end{align*} Separating the supremum over the pair into two independent suprema (each over a single element of $\mathcal{H}$): \begin{align*} = \frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}h(x_i) + A(h, \varepsilon_{-i})\right\} + \frac{1}{2}\sup_{g \in \mathcal{H}}\!\left\{-\frac{L}{n}g(x_i) + A(g, \varepsilon_{-i})\right\}. \end{align*} This is the conditional expectation over $\varepsilon_i = \pm 1$ of $\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_i h(x_i) + A(h, \varepsilon_{-i})\right\}$. Taking full expectations over $\varepsilon_{-i}$ proves the claim. [/proof] [guided] The strategy is a decoupling argument: condition on everything except $\varepsilon_i$, split the resulting coin flip into its two outcomes, and exploit the Lipschitz bound on the difference that appears. Condition on $\varepsilon_{-i}$ (the Rademacher vector with the $i$-th coordinate removed). Since $\varepsilon_i$ is an independent Rademacher variable taking values $+1$ and $-1$ each with probability $1/2$, the conditional expectation of the left-hand side becomes \begin{align*} \frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{\frac{1}{n}\phi(y_i h(x_i)) + A(h, \varepsilon_{-i})\right\} + \frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{-\frac{1}{n}\phi(y_i h(x_i)) + A(h, \varepsilon_{-i})\right\}. \end{align*} Why do the two suprema decouple? Because they are taken independently -- the $\varepsilon_i = +1$ case optimises over its own $h$, and the $\varepsilon_i = -1$ case optimises over a potentially different $h$. We exploit this by introducing two independent optimisation variables $(h, g) \in \mathcal{H} \times \mathcal{H}$: \begin{align*} &\leq \frac{1}{2}\sup_{h, g \in \mathcal{H}}\!\left\{\frac{1}{n}\bigl[\phi(y_i h(x_i)) - \phi(y_i g(x_i))\bigr] + A(h, \varepsilon_{-i}) + A(g, \varepsilon_{-i})\right\}. \end{align*} Now we apply the Lipschitz condition. We must verify that the arguments of $\phi$ lie in $[-r, r]$: since $|y_i| = 1$ and $|h(x_i)| \leq r$ by the definition $r = \sup_{x, h} |h(x)|$, we have $|y_i h(x_i)| = |h(x_i)| \leq r$, and similarly $|y_i g(x_i)| \leq r$. The Lipschitz bound gives \begin{align*} \phi(y_i h(x_i)) - \phi(y_i g(x_i)) \leq |\phi(y_i h(x_i)) - \phi(y_i g(x_i))| \leq L|y_i h(x_i) - y_i g(x_i)| = L|h(x_i) - g(x_i)|, \end{align*} where the final equality uses $|y_i| = 1$. How do we remove the absolute value? The key observation is a symmetry argument: the supremum over $(h, g) \in \mathcal{H} \times \mathcal{H}$ is free to choose whichever pair it likes, so it can always arrange $h(x_i) \geq g(x_i)$, making $|h(x_i) - g(x_i)| = h(x_i) - g(x_i)$. Therefore \begin{align*} \frac{1}{2}\sup_{h, g \in \mathcal{H}}\!\left\{\frac{L}{n}\bigl[h(x_i) - g(x_i)\bigr] + A(h, \varepsilon_{-i}) + A(g, \varepsilon_{-i})\right\}. \end{align*} Finally, we re-separate the joint supremum into two independent suprema. The expression $\frac{L}{n}h(x_i) + A(h, \varepsilon_{-i})$ depends only on $h$, and $-\frac{L}{n}g(x_i) + A(g, \varepsilon_{-i})$ depends only on $g$, so \begin{align*} = \frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}h(x_i) + A(h, \varepsilon_{-i})\right\} + \frac{1}{2}\sup_{g \in \mathcal{H}}\!\left\{-\frac{L}{n}g(x_i) + A(g, \varepsilon_{-i})\right\}. \end{align*} Reading the right-hand side as the conditional expectation over a Rademacher $\varepsilon_i = \pm 1$: the $\varepsilon_i = +1$ term gives $\sup_h\!\left\{\frac{L}{n}h(x_i) + A\right\}$ and the $\varepsilon_i = -1$ term gives $\sup_g\!\left\{-\frac{L}{n}g(x_i) + A\right\}$, which is exactly the conditional expectation of $\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_i h(x_i) + A(h, \varepsilon_{-i})\right\}$. Taking the full expectation over $\varepsilon_{-i}$ proves the claim. [/guided] [/step] [step:Iterate the one-coordinate claim over all $n$ coordinates] The empirical Rademacher complexity of the composed class is \begin{align*} \hat{R}(\mathcal{F}(z_{1:n})) = \mathbb{E}\!\left[\sup_{h \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n \varepsilon_i \phi(y_i h(x_i))\right]. \end{align*} Apply the one-coordinate contraction claim with $i = 1$ and \begin{align*} A(h, \varepsilon_{-1}) = \frac{1}{n}\sum_{j=2}^n \varepsilon_j \phi(y_j h(x_j)). \end{align*} The claim gives \begin{align*} \hat{R}(\mathcal{F}(z_{1:n})) \leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_1 h(x_1) + \frac{1}{n}\sum_{j=2}^n \varepsilon_j \phi(y_j h(x_j))\right\}\right]. \end{align*} Now apply the claim with $i = 2$ and the updated remainder $A(h, \varepsilon_{-2}) = \frac{L}{n}\varepsilon_1 h(x_1) + \frac{1}{n}\sum_{j=3}^n \varepsilon_j \phi(y_j h(x_j))$. This replaces $\varepsilon_2 \phi(y_2 h(x_2))$ by $L\varepsilon_2 h(x_2)$. Continuing inductively for $i = 3, 4, \ldots, n$, after all $n$ applications every $\phi(y_i h(x_i))$ term has been replaced by $L \cdot h(x_i)$: \begin{align*} \hat{R}(\mathcal{F}(z_{1:n})) \leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}} \frac{L}{n}\sum_{i=1}^n \varepsilon_i h(x_i)\right] = L\,\hat{R}(\mathcal{H}(x_{1:n})). \end{align*} [guided] We perform the replacement one coordinate at a time, beginning with $i = 1$. The empirical Rademacher complexity of the composed class is \begin{align*} \hat{R}(\mathcal{F}(z_{1:n})) = \mathbb{E}\!\left[\sup_{h \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n \varepsilon_i \phi(y_i h(x_i))\right]. \end{align*} To apply the one-coordinate contraction claim at $i = 1$, we must identify the remainder function $A$. Set \begin{align*} A(h, \varepsilon_{-1}) = \frac{1}{n}\sum_{j=2}^n \varepsilon_j \phi(y_j h(x_j)). \end{align*} This is a valid choice because $A$ depends on $h$ and on $\varepsilon_{-1} = (\varepsilon_2, \ldots, \varepsilon_n)$ but not on $\varepsilon_1$, which is exactly what the claim requires. The claim gives \begin{align*} \hat{R}(\mathcal{F}(z_{1:n})) \leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_1 h(x_1) + \frac{1}{n}\sum_{j=2}^n \varepsilon_j \phi(y_j h(x_j))\right\}\right]. \end{align*} Now apply the claim at $i = 2$. The updated remainder is \begin{align*} A(h, \varepsilon_{-2}) = \frac{L}{n}\varepsilon_1 h(x_1) + \frac{1}{n}\sum_{j=3}^n \varepsilon_j \phi(y_j h(x_j)), \end{align*} which depends on $h$ and on $(\varepsilon_1, \varepsilon_3, \ldots, \varepsilon_n)$ but not on $\varepsilon_2$. Why is this still a valid remainder function? Because the claim applies to an arbitrary function $A : \mathcal{H} \times \{-1,1\}^{n-1} \to \mathbb{R}$ -- it imposes no structural requirement on $A$ beyond measurability. So the fact that some coordinates already carry $L \cdot h$ terms rather than $\phi$ terms is irrelevant. The claim replaces $\varepsilon_2 \phi(y_2 h(x_2))$ by $L\varepsilon_2 h(x_2)$: \begin{align*} \leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_1 h(x_1) + \frac{L}{n}\varepsilon_2 h(x_2) + \frac{1}{n}\sum_{j=3}^n \varepsilon_j \phi(y_j h(x_j))\right\}\right]. \end{align*} Continuing inductively for $i = 3, 4, \ldots, n$, at each step the remainder $A$ contains the already-replaced coordinates $\frac{L}{n}\sum_{j<i}\varepsilon_j h(x_j)$ and the not-yet-replaced coordinates $\frac{1}{n}\sum_{j>i}\varepsilon_j \phi(y_j h(x_j))$. The claim is applied with this $A$, replacing $\varepsilon_i \phi(y_i h(x_i))$ by $L\varepsilon_i h(x_i)$. After all $n$ applications: \begin{align*} \hat{R}(\mathcal{F}(z_{1:n})) \leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}} \frac{L}{n}\sum_{i=1}^n \varepsilon_i h(x_i)\right] = L\,\hat{R}(\mathcal{H}(x_{1:n})). \end{align*} The final equality is the definition of the empirical Rademacher complexity of $\mathcal{H}$ evaluated at $x_{1:n}$. [/guided] [/step] [step:Deduce the population-level bound $R_n(\mathcal{F}) \leq L\, R_n(\mathcal{H})$] The population Rademacher complexity is $R_n(\mathcal{F}) = \mathbb{E}_{Z_{1:n}}[\hat{R}(\mathcal{F}(Z_{1:n}))]$, where the outer expectation is over the i.i.d. data $Z_1, \ldots, Z_n$. Applying the empirical bound pointwise for every realisation of the data: \begin{align*} R_n(\mathcal{F}) = \mathbb{E}_{Z_{1:n}}\bigl[\hat{R}(\mathcal{F}(Z_{1:n}))\bigr] \leq \mathbb{E}_{Z_{1:n}}\bigl[L\,\hat{R}(\mathcal{H}(X_{1:n}))\bigr] = L\,\mathbb{E}_{Z_{1:n}}\bigl[\hat{R}(\mathcal{H}(X_{1:n}))\bigr] = L\, R_n(\mathcal{H}), \end{align*} where the linearity of expectation pulls $L$ outside and we use the fact that the population Rademacher complexity of $\mathcal{H}$ is $R_n(\mathcal{H}) = \mathbb{E}_{X_{1:n}}[\hat{R}(\mathcal{H}(X_{1:n}))]$. This completes the proof. [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Contraction Lemma (Theorem # 1981)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Contraction Lemma (Theorem # 1981)

Discussion

Proof

Explore Further