Rademacher Contraction Inequality — Statement & Proof

Rademacher Contraction Inequality (Theorem # 9825)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We first prove the inequality for a finite subclass by identifying each function with its vector of values at the fixed sample points. The finite proof is a coordinate-by-coordinate contraction argument: replacing one coordinate $t_i$ by $\phi_i(t_i)$ cannot increase the Rademacher average because the two-point oscillation created by flipping the sign $\varepsilon_i$ is controlled by the Lipschitz bound. Iterating this replacement over all coordinates gives the finite-class inequality with the sharp constant $L$. The general statement follows by applying the finite result to finite subclasses and passing to the relevant supremum convention. [/proofplan] [step:Reduce the finite-subclass statement to a finite vector comparison] Let $\mathcal G\subset\mathcal F$ be a nonempty finite subclass. Define the finite nonempty set \begin{align*} T_{\mathcal G}:=\{(f(x_1),\dots,f(x_n))\in\mathbb R^n:f\in\mathcal G\}. \end{align*} Define the map \begin{align*} \Phi: T_{\mathcal G} \to \mathbb R^n \end{align*} by \begin{align*} \Phi(t) = (\phi_1(t_1),\dots,\phi_n(t_n)) \end{align*} for each $t=(t_1,\dots,t_n)\in T_{\mathcal G}$. Then \begin{align*} \sup_{f\in\mathcal G}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i)) = \sup_{t\in T_{\mathcal G}}\sum_{i=1}^{n}\varepsilon_i\phi_i(t_i), \end{align*} and \begin{align*} \sup_{f\in\mathcal G}\sum_{i=1}^{n}\varepsilon_i f(x_i) = \sup_{t\in T_{\mathcal G}}\sum_{i=1}^{n}\varepsilon_i t_i. \end{align*} It is therefore enough, for finite subclasses, to prove that every nonempty finite set $T\subset\mathbb R^n$ satisfies \begin{align*} \mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i\phi_i(t_i)\right] \le L\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i t_i\right]. \end{align*} [guided] The fixed points $x_1,\dots,x_n$ mean that every function $f:S\to\mathbb R$ contributes only the finite vector of sample values \begin{align*} (f(x_1),\dots,f(x_n))\in\mathbb R^n. \end{align*} For a nonempty finite subclass $\mathcal G\subset\mathcal F$, we name the corresponding finite nonempty vector set \begin{align*} T_{\mathcal G}:=\{(f(x_1),\dots,f(x_n))\in\mathbb R^n:f\in\mathcal G\}. \end{align*} This is the right reduction because the Rademacher sums only see the values of $f$ at the sample points. If $t=(t_1,\dots,t_n)\in T_{\mathcal G}$ comes from $f\in\mathcal G$, then $t_i=f(x_i)$ for each $i$. Thus the transformed Rademacher sum becomes \begin{align*} \sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i)) = \sum_{i=1}^{n}\varepsilon_i\phi_i(t_i). \end{align*} Taking the supremum over $f\in\mathcal G$ is the same as taking the supremum over all sample-value vectors $t\in T_{\mathcal G}$, so \begin{align*} \sup_{f\in\mathcal G}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i)) = \sup_{t\in T_{\mathcal G}}\sum_{i=1}^{n}\varepsilon_i\phi_i(t_i). \end{align*} The same identification without the functions $\phi_i$ gives \begin{align*} \sup_{f\in\mathcal G}\sum_{i=1}^{n}\varepsilon_i f(x_i) = \sup_{t\in T_{\mathcal G}}\sum_{i=1}^{n}\varepsilon_i t_i. \end{align*} So the finite-subclass theorem is exactly a finite-dimensional comparison for a finite set $T\subset\mathbb R^n$. [/guided] [/step] [step:Prove the one-coordinate contraction estimate] Let $T\subset\mathbb R^n$ be finite and nonempty. Fix an index $k\in\{1,\dots,n\}$. Let $a:T\to\mathbb R$ be an arbitrary function, and let $\phi:\mathbb R\to\mathbb R$ be $L$-Lipschitz. We claim that \begin{align*} \frac{1}{2}\sup_{t\in T}(a(t)+\phi(t_k)) + \frac{1}{2}\sup_{t\in T}(a(t)-\phi(t_k)) \le \frac{1}{2}\sup_{t\in T}(a(t)+L t_k) + \frac{1}{2}\sup_{t\in T}(a(t)-L t_k). \end{align*} Because $T$ is finite, choose $s,u\in T$ such that \begin{align*} \sup_{t\in T}(a(t)+\phi(t_k))=a(s)+\phi(s_k) \end{align*} and \begin{align*} \sup_{t\in T}(a(t)-\phi(t_k))=a(u)-\phi(u_k). \end{align*} Then the left-hand side is \begin{align*} \frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\}. \end{align*} Since $\phi$ is $L$-Lipschitz, \begin{align*} \phi(s_k)-\phi(u_k)\le L|s_k-u_k|. \end{align*} If $s_k\ge u_k$, then \begin{align*} \frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\} \le \frac{1}{2}\{a(s)+L s_k+a(u)-L u_k\}, \end{align*} which is bounded by the right-hand side. If $s_k<u_k$, then \begin{align*} \frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\} \le \frac{1}{2}\{a(s)-L s_k+a(u)+L u_k\}, \end{align*} which is again bounded by the right-hand side. This proves the one-coordinate estimate. [guided] The purpose of this step is to isolate a single sign flip. We freeze every part of the Rademacher sum except the $k$-th coordinate and compare the average over the two values of the sign $\varepsilon_k$. The arbitrary function $a:T\to\mathbb R$ represents all frozen terms, while $t_k$ is the one coordinate still being contracted. Because $T$ is finite and nonempty, both suprema are attained. Choose $s,u\in T$ such that \begin{align*} \sup_{t\in T}(a(t)+\phi(t_k))=a(s)+\phi(s_k) \end{align*} and \begin{align*} \sup_{t\in T}(a(t)-\phi(t_k))=a(u)-\phi(u_k). \end{align*} Therefore the left-hand side of the desired one-coordinate inequality is \begin{align*} \frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\}. \end{align*} The Lipschitz hypothesis gives \begin{align*} \phi(s_k)-\phi(u_k)\le |\phi(s_k)-\phi(u_k)|\le L|s_k-u_k|. \end{align*} Now the sign of $s_k-u_k$ tells us which of the two linear perturbations should receive $s$ and which should receive $u$. If $s_k\ge u_k$, then $|s_k-u_k|=s_k-u_k$, so \begin{align*} \frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\} \le \frac{1}{2}\{a(s)+L s_k+a(u)-L u_k\}. \end{align*} Since $s,u\in T$, the last expression is bounded by \begin{align*} \frac{1}{2}\sup_{t\in T}(a(t)+L t_k) + \frac{1}{2}\sup_{t\in T}(a(t)-L t_k). \end{align*} If $s_k<u_k$, then $|s_k-u_k|=u_k-s_k$, and the same computation with the two signs interchanged gives \begin{align*} \frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\} \le \frac{1}{2}\{a(s)-L s_k+a(u)+L u_k\}, \end{align*} which is again bounded by \begin{align*} \frac{1}{2}\sup_{t\in T}(a(t)+L t_k) + \frac{1}{2}\sup_{t\in T}(a(t)-L t_k). \end{align*} Thus replacing the nonlinear coordinate $\phi(t_k)$ by the linear coordinate $L t_k$ cannot decrease the two-sign average. [/guided] [/step] [step:Replace the transformed coordinates one at a time] For each $m\in\{0,1,\dots,n\}$, define the map \begin{align*} \Psi_m: T \to \mathbb R^n \end{align*} by \begin{align*} \Psi_m(t) = (L t_1,\dots,L t_m,\phi_{m+1}(t_{m+1}),\dots,\phi_n(t_n)) \end{align*} for each $t=(t_1,\dots,t_n)\in T$. Thus $\Psi_0(t)=(\phi_1(t_1),\dots,\phi_n(t_n))$ and $\Psi_n(t)=(L t_1,\dots,L t_n)$. Fix $m\in\{0,\dots,n-1\}$. Condition on all Rademacher variables except $\varepsilon_{m+1}$. For a fixed choice of those other signs, define $a:T\to\mathbb R$ by \begin{align*} a(t):=\sum_{i=1}^{m}\varepsilon_i L t_i+\sum_{i=m+2}^{n}\varepsilon_i\phi_i(t_i). \end{align*} Applying the one-coordinate estimate with $k=m+1$ and $\phi=\phi_{m+1}$ gives \begin{align*} \mathbb E_{\varepsilon_{m+1}}\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_m(t))_i\right] \le \mathbb E_{\varepsilon_{m+1}}\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_{m+1}(t))_i\right]. \end{align*} Taking expectation over the remaining signs yields \begin{align*} \mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_m(t))_i\right] \le \mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_{m+1}(t))_i\right]. \end{align*} Iterating this inequality from $m=0$ to $m=n-1$ gives \begin{align*} \mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i\phi_i(t_i)\right] \le \mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i L t_i\right]. \end{align*} Since $L\ge0$, the last expression equals \begin{align*} L\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i t_i\right]. \end{align*} [guided] For each $m\in\{0,1,\dots,n\}$, define the map \begin{align*} \Psi_m:T\to\mathbb R^n,\qquad t\mapsto (L t_1,\dots,L t_m,\phi_{m+1}(t_{m+1}),\dots,\phi_n(t_n)) \end{align*} for each $t=(t_1,\dots,t_n)\in T$. The map $\Psi_m$ records the state after the first $m$ coordinates have been replaced by their linear contracted versions. Thus $\Psi_0(t)=(\phi_1(t_1),\dots,\phi_n(t_n))$, while $\Psi_n(t)=(L t_1,\dots,L t_n)$. Fix $m\in\{0,\dots,n-1\}$. We compare the averages at stages $m$ and $m+1$ by conditioning on every Rademacher sign except $\varepsilon_{m+1}$. For a fixed choice of the remaining signs, define the function $a:T\to\mathbb R$ by \begin{align*} a(t):=\sum_{i=1}^{m}\varepsilon_i L t_i+\sum_{i=m+2}^{n}\varepsilon_i\phi_i(t_i). \end{align*} This function contains exactly the frozen part of the Rademacher sum. The only variable part left is the two-sign average in the coordinate $m+1$. Since $\phi_{m+1}$ is $L$-Lipschitz by the theorem hypothesis, applying the one-coordinate estimate with $k=m+1$ and $\phi=\phi_{m+1}$ gives \begin{align*} \mathbb E_{\varepsilon_{m+1}}\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_m(t))_i\right] \le \mathbb E_{\varepsilon_{m+1}}\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_{m+1}(t))_i\right]. \end{align*} Now take expectation over the other signs. Since $T$ is finite, all displayed suprema are maxima over finitely many real-valued random variables, hence measurable and integrable on the finite Rademacher [probability space](/page/Probability%20Space). Therefore conditioning and iterated expectation are legitimate, and we obtain \begin{align*} \mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_m(t))_i\right] \le \mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_{m+1}(t))_i\right]. \end{align*} Iterating this inequality for $m=0,1,\dots,n-1$ replaces the transformed coordinates one at a time and gives \begin{align*} \mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i\phi_i(t_i)\right] \le \mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i L t_i\right]. \end{align*} Because $L\ge0$, multiplication by $L$ commutes with the supremum over $T$, so the right-hand side is \begin{align*} L\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i t_i\right]. \end{align*} [/guided] [/step] [step:Pass from finite subclasses to the stated class] For every nonempty finite subclass $\mathcal G\subset\mathcal F$, the preceding finite-class argument gives \begin{align*} \mathbb E_\varepsilon\left[\sup_{f\in\mathcal G}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i))\right] \le L\mathbb E_\varepsilon\left[\sup_{f\in\mathcal G}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i f(x_i)\right]. \end{align*} The factor $1/n$ is harmless because it is positive and common to both sides. Under the nonempty finite-subclass approximation convention, take the supremum over nonempty finite subclasses $\mathcal G\subset\mathcal F$ on both sides. This gives \begin{align*} \sup_{\varnothing\ne\mathcal G\subset\mathcal F\,\mathrm{finite}}\mathbb E_\varepsilon\left[\sup_{f\in\mathcal G}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i))\right] \le L\sup_{\varnothing\ne\mathcal G\subset\mathcal F\,\mathrm{finite}}\mathbb E_\varepsilon\left[\sup_{f\in\mathcal G}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i f(x_i)\right]. \end{align*} If, in addition, there is an increasing sequence of nonempty finite subclasses $\mathcal G_m\subset\mathcal F$ for which the two displayed full suprema are measurable, integrable, and equal pointwise to the increasing limits of the corresponding suprema over $\mathcal G_m$, define \begin{align*} Y_m:=\sup_{f\in\mathcal G_m}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i)) \end{align*} and \begin{align*} Z_m:=\sup_{f\in\mathcal G_m}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i f(x_i). \end{align*} The sequences $(Y_m)$ and $(Z_m)$ are increasing pointwise, and $Y_1$ and $Z_1$ are integrable because $\mathcal G_1$ is finite and the Rademacher probability space has finitely many sign outcomes. Hence the nonnegative sequences $(Y_m-Y_1)$ and $(Z_m-Z_1)$ satisfy the hypotheses of the [monotone convergence theorem](/theorems/509). Applying the monotone convergence theorem to these shifted sequences, and then adding back the finite constants $\mathbb E_\varepsilon[Y_1]$ and $\mathbb E_\varepsilon[Z_1]$, gives convergence of the expectations to the expectations of the stated full suprema, and therefore \begin{align*} \mathbb E_\varepsilon\left[\sup_{f\in\mathcal F}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i))\right] \le L\mathbb E_\varepsilon\left[\sup_{f\in\mathcal F}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i f(x_i)\right]. \end{align*} This is precisely the alternative formulation stated in the theorem. This completes the proof. [/step]

Prerequisites (0/4 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Supremum

Explore Further

Supremum Definition Monotone Convergence Theorem Theorem #509 Monotone Convergence Theorem For Sequences Theorem #626 Monotone Convergence Theorem for Sequences Theorem #743 Marchenko-Pastur Density from the Stieltjes Transform Equation Probability & Statistics Moments from the MGF Probability Theory Wald Form of the Linear Restriction F-Test Probability & Statistics Asymptotic Equicontinuity Criterion for Empirical Processes Probability & Statistics Hoeffding Decomposition for U-Statistics Probability & Statistics AMISE-Optimal Bandwidth for Kernel Density Estimation Probability & Statistics Rosenbaum-Rubin Propensity Score Theorem Probability & Statistics $L^2$ Integrability Implies $L^1$ Integrability Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.

Rademacher Contraction Inequality (Theorem # 9825)

Discussion

Proof

Prerequisites (0/4 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Rademacher Contraction Inequality (Theorem # 9825)

Discussion

Proof

Prerequisites (0/4 completed)

Prerequisites Graph

Explore Further