[proofplan]
We first prove the inequality for a finite subclass by identifying each function with its vector of values at the fixed sample points. The finite proof is a coordinate-by-coordinate contraction argument: replacing one coordinate $t_i$ by $\phi_i(t_i)$ cannot increase the Rademacher average because the two-point oscillation created by flipping the sign $\varepsilon_i$ is controlled by the Lipschitz bound. Iterating this replacement over all coordinates gives the finite-class inequality with the sharp constant $L$. The general statement follows by applying the finite result to finite subclasses and passing to the relevant supremum convention.
[/proofplan]
[step:Reduce the finite-subclass statement to a finite vector comparison]
Let $\mathcal G\subset\mathcal F$ be a nonempty finite subclass. Define the finite nonempty set
\begin{align*}
T_{\mathcal G}:=\{(f(x_1),\dots,f(x_n))\in\mathbb R^n:f\in\mathcal G\}.
\end{align*}
Define the map
\begin{align*}
\Phi: T_{\mathcal G} \to \mathbb R^n
\end{align*}
by
\begin{align*}
\Phi(t) = (\phi_1(t_1),\dots,\phi_n(t_n))
\end{align*}
for each $t=(t_1,\dots,t_n)\in T_{\mathcal G}$.
Then
\begin{align*}
\sup_{f\in\mathcal G}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i))
=
\sup_{t\in T_{\mathcal G}}\sum_{i=1}^{n}\varepsilon_i\phi_i(t_i),
\end{align*}
and
\begin{align*}
\sup_{f\in\mathcal G}\sum_{i=1}^{n}\varepsilon_i f(x_i)
=
\sup_{t\in T_{\mathcal G}}\sum_{i=1}^{n}\varepsilon_i t_i.
\end{align*}
It is therefore enough, for finite subclasses, to prove that every nonempty finite set $T\subset\mathbb R^n$ satisfies
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i\phi_i(t_i)\right]
\le
L\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i t_i\right].
\end{align*}
[guided]
The fixed points $x_1,\dots,x_n$ mean that every function $f:S\to\mathbb R$ contributes only the finite vector of sample values
\begin{align*}
(f(x_1),\dots,f(x_n))\in\mathbb R^n.
\end{align*}
For a nonempty finite subclass $\mathcal G\subset\mathcal F$, we name the corresponding finite nonempty vector set
\begin{align*}
T_{\mathcal G}:=\{(f(x_1),\dots,f(x_n))\in\mathbb R^n:f\in\mathcal G\}.
\end{align*}
This is the right reduction because the Rademacher sums only see the values of $f$ at the sample points. If $t=(t_1,\dots,t_n)\in T_{\mathcal G}$ comes from $f\in\mathcal G$, then $t_i=f(x_i)$ for each $i$. Thus the transformed Rademacher sum becomes
\begin{align*}
\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i))
=
\sum_{i=1}^{n}\varepsilon_i\phi_i(t_i).
\end{align*}
Taking the supremum over $f\in\mathcal G$ is the same as taking the supremum over all sample-value vectors $t\in T_{\mathcal G}$, so
\begin{align*}
\sup_{f\in\mathcal G}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i))
=
\sup_{t\in T_{\mathcal G}}\sum_{i=1}^{n}\varepsilon_i\phi_i(t_i).
\end{align*}
The same identification without the functions $\phi_i$ gives
\begin{align*}
\sup_{f\in\mathcal G}\sum_{i=1}^{n}\varepsilon_i f(x_i)
=
\sup_{t\in T_{\mathcal G}}\sum_{i=1}^{n}\varepsilon_i t_i.
\end{align*}
So the finite-subclass theorem is exactly a finite-dimensional comparison for a finite set $T\subset\mathbb R^n$.
[/guided]
[/step]
[step:Prove the one-coordinate contraction estimate]
Let $T\subset\mathbb R^n$ be finite and nonempty. Fix an index $k\in\{1,\dots,n\}$. Let $a:T\to\mathbb R$ be an arbitrary function, and let $\phi:\mathbb R\to\mathbb R$ be $L$-Lipschitz. We claim that
\begin{align*}
\frac{1}{2}\sup_{t\in T}(a(t)+\phi(t_k))
+
\frac{1}{2}\sup_{t\in T}(a(t)-\phi(t_k))
\le
\frac{1}{2}\sup_{t\in T}(a(t)+L t_k)
+
\frac{1}{2}\sup_{t\in T}(a(t)-L t_k).
\end{align*}
Because $T$ is finite, choose $s,u\in T$ such that
\begin{align*}
\sup_{t\in T}(a(t)+\phi(t_k))=a(s)+\phi(s_k)
\end{align*}
and
\begin{align*}
\sup_{t\in T}(a(t)-\phi(t_k))=a(u)-\phi(u_k).
\end{align*}
Then the left-hand side is
\begin{align*}
\frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\}.
\end{align*}
Since $\phi$ is $L$-Lipschitz,
\begin{align*}
\phi(s_k)-\phi(u_k)\le L|s_k-u_k|.
\end{align*}
If $s_k\ge u_k$, then
\begin{align*}
\frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\}
\le
\frac{1}{2}\{a(s)+L s_k+a(u)-L u_k\},
\end{align*}
which is bounded by the right-hand side. If $s_k<u_k$, then
\begin{align*}
\frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\}
\le
\frac{1}{2}\{a(s)-L s_k+a(u)+L u_k\},
\end{align*}
which is again bounded by the right-hand side. This proves the one-coordinate estimate.
[guided]
The purpose of this step is to isolate a single sign flip. We freeze every part of the Rademacher sum except the $k$-th coordinate and compare the average over the two values of the sign $\varepsilon_k$. The arbitrary function $a:T\to\mathbb R$ represents all frozen terms, while $t_k$ is the one coordinate still being contracted.
Because $T$ is finite and nonempty, both suprema are attained. Choose $s,u\in T$ such that
\begin{align*}
\sup_{t\in T}(a(t)+\phi(t_k))=a(s)+\phi(s_k)
\end{align*}
and
\begin{align*}
\sup_{t\in T}(a(t)-\phi(t_k))=a(u)-\phi(u_k).
\end{align*}
Therefore the left-hand side of the desired one-coordinate inequality is
\begin{align*}
\frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\}.
\end{align*}
The Lipschitz hypothesis gives
\begin{align*}
\phi(s_k)-\phi(u_k)\le |\phi(s_k)-\phi(u_k)|\le L|s_k-u_k|.
\end{align*}
Now the sign of $s_k-u_k$ tells us which of the two linear perturbations should receive $s$ and which should receive $u$. If $s_k\ge u_k$, then $|s_k-u_k|=s_k-u_k$, so
\begin{align*}
\frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\}
\le
\frac{1}{2}\{a(s)+L s_k+a(u)-L u_k\}.
\end{align*}
Since $s,u\in T$, the last expression is bounded by
\begin{align*}
\frac{1}{2}\sup_{t\in T}(a(t)+L t_k)
+
\frac{1}{2}\sup_{t\in T}(a(t)-L t_k).
\end{align*}
If $s_k<u_k$, then $|s_k-u_k|=u_k-s_k$, and the same computation with the two signs interchanged gives
\begin{align*}
\frac{1}{2}\{a(s)+a(u)+\phi(s_k)-\phi(u_k)\}
\le
\frac{1}{2}\{a(s)-L s_k+a(u)+L u_k\},
\end{align*}
which is again bounded by
\begin{align*}
\frac{1}{2}\sup_{t\in T}(a(t)+L t_k)
+
\frac{1}{2}\sup_{t\in T}(a(t)-L t_k).
\end{align*}
Thus replacing the nonlinear coordinate $\phi(t_k)$ by the linear coordinate $L t_k$ cannot decrease the two-sign average.
[/guided]
[/step]
[step:Replace the transformed coordinates one at a time]
For each $m\in\{0,1,\dots,n\}$, define the map
\begin{align*}
\Psi_m: T \to \mathbb R^n
\end{align*}
by
\begin{align*}
\Psi_m(t) = (L t_1,\dots,L t_m,\phi_{m+1}(t_{m+1}),\dots,\phi_n(t_n))
\end{align*}
for each $t=(t_1,\dots,t_n)\in T$.
Thus $\Psi_0(t)=(\phi_1(t_1),\dots,\phi_n(t_n))$ and $\Psi_n(t)=(L t_1,\dots,L t_n)$.
Fix $m\in\{0,\dots,n-1\}$. Condition on all Rademacher variables except $\varepsilon_{m+1}$. For a fixed choice of those other signs, define $a:T\to\mathbb R$ by
\begin{align*}
a(t):=\sum_{i=1}^{m}\varepsilon_i L t_i+\sum_{i=m+2}^{n}\varepsilon_i\phi_i(t_i).
\end{align*}
Applying the one-coordinate estimate with $k=m+1$ and $\phi=\phi_{m+1}$ gives
\begin{align*}
\mathbb E_{\varepsilon_{m+1}}\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_m(t))_i\right]
\le
\mathbb E_{\varepsilon_{m+1}}\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_{m+1}(t))_i\right].
\end{align*}
Taking expectation over the remaining signs yields
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_m(t))_i\right]
\le
\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_{m+1}(t))_i\right].
\end{align*}
Iterating this inequality from $m=0$ to $m=n-1$ gives
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i\phi_i(t_i)\right]
\le
\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i L t_i\right].
\end{align*}
Since $L\ge0$, the last expression equals
\begin{align*}
L\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i t_i\right].
\end{align*}
[guided]
For each $m\in\{0,1,\dots,n\}$, define the map
\begin{align*}
\Psi_m:T\to\mathbb R^n,\qquad t\mapsto (L t_1,\dots,L t_m,\phi_{m+1}(t_{m+1}),\dots,\phi_n(t_n))
\end{align*}
for each $t=(t_1,\dots,t_n)\in T$. The map $\Psi_m$ records the state after the first $m$ coordinates have been replaced by their linear contracted versions. Thus $\Psi_0(t)=(\phi_1(t_1),\dots,\phi_n(t_n))$, while $\Psi_n(t)=(L t_1,\dots,L t_n)$.
Fix $m\in\{0,\dots,n-1\}$. We compare the averages at stages $m$ and $m+1$ by conditioning on every Rademacher sign except $\varepsilon_{m+1}$. For a fixed choice of the remaining signs, define the function $a:T\to\mathbb R$ by
\begin{align*}
a(t):=\sum_{i=1}^{m}\varepsilon_i L t_i+\sum_{i=m+2}^{n}\varepsilon_i\phi_i(t_i).
\end{align*}
This function contains exactly the frozen part of the Rademacher sum. The only variable part left is the two-sign average in the coordinate $m+1$. Since $\phi_{m+1}$ is $L$-Lipschitz by the theorem hypothesis, applying the one-coordinate estimate with $k=m+1$ and $\phi=\phi_{m+1}$ gives
\begin{align*}
\mathbb E_{\varepsilon_{m+1}}\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_m(t))_i\right]
\le
\mathbb E_{\varepsilon_{m+1}}\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_{m+1}(t))_i\right].
\end{align*}
Now take expectation over the other signs. Since $T$ is finite, all displayed suprema are maxima over finitely many real-valued random variables, hence measurable and integrable on the finite Rademacher [probability space](/page/Probability%20Space). Therefore conditioning and iterated expectation are legitimate, and we obtain
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_m(t))_i\right]
\le
\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i(\Psi_{m+1}(t))_i\right].
\end{align*}
Iterating this inequality for $m=0,1,\dots,n-1$ replaces the transformed coordinates one at a time and gives
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i\phi_i(t_i)\right]
\le
\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i L t_i\right].
\end{align*}
Because $L\ge0$, multiplication by $L$ commutes with the supremum over $T$, so the right-hand side is
\begin{align*}
L\mathbb E_\varepsilon\left[\sup_{t\in T}\sum_{i=1}^{n}\varepsilon_i t_i\right].
\end{align*}
[/guided]
[/step]
[step:Pass from finite subclasses to the stated class]
For every nonempty finite subclass $\mathcal G\subset\mathcal F$, the preceding finite-class argument gives
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{f\in\mathcal G}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i))\right]
\le
L\mathbb E_\varepsilon\left[\sup_{f\in\mathcal G}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i f(x_i)\right].
\end{align*}
The factor $1/n$ is harmless because it is positive and common to both sides.
Under the nonempty finite-subclass approximation convention, take the supremum over nonempty finite subclasses $\mathcal G\subset\mathcal F$ on both sides. This gives
\begin{align*}
\sup_{\varnothing\ne\mathcal G\subset\mathcal F\,\mathrm{finite}}\mathbb E_\varepsilon\left[\sup_{f\in\mathcal G}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i))\right]
\le
L\sup_{\varnothing\ne\mathcal G\subset\mathcal F\,\mathrm{finite}}\mathbb E_\varepsilon\left[\sup_{f\in\mathcal G}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i f(x_i)\right].
\end{align*}
If, in addition, there is an increasing sequence of nonempty finite subclasses $\mathcal G_m\subset\mathcal F$ for which the two displayed full suprema are measurable, integrable, and equal pointwise to the increasing limits of the corresponding suprema over $\mathcal G_m$, define
\begin{align*}
Y_m:=\sup_{f\in\mathcal G_m}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i))
\end{align*}
and
\begin{align*}
Z_m:=\sup_{f\in\mathcal G_m}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i f(x_i).
\end{align*}
The sequences $(Y_m)$ and $(Z_m)$ are increasing pointwise, and $Y_1$ and $Z_1$ are integrable because $\mathcal G_1$ is finite and the Rademacher probability space has finitely many sign outcomes. Hence the nonnegative sequences $(Y_m-Y_1)$ and $(Z_m-Z_1)$ satisfy the hypotheses of the [monotone convergence theorem](/theorems/509). Applying the monotone convergence theorem to these shifted sequences, and then adding back the finite constants $\mathbb E_\varepsilon[Y_1]$ and $\mathbb E_\varepsilon[Z_1]$, gives convergence of the expectations to the expectations of the stated full suprema, and therefore
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{f\in\mathcal F}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\phi_i(f(x_i))\right]
\le
L\mathbb E_\varepsilon\left[\sup_{f\in\mathcal F}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i f(x_i)\right].
\end{align*}
This is precisely the alternative formulation stated in the theorem. This completes the proof.
[/step]