[proofplan]
We show that the Lipschitz map $\phi$ applied to Rademacher sums can be "contracted" coordinate by coordinate. The argument proceeds by induction on the coordinate index: at each step $i$, we condition on all Rademacher variables except $\varepsilon_i$, exploit the symmetry $\varepsilon_i = \pm 1$ to decouple the supremum into two independent optimisations, apply the Lipschitz bound to replace $\phi(y_i h(x_i))$ by $L \cdot h(x_i)$, and recombine. After processing all $n$ coordinates, every $\phi$ term has been replaced by a scaled linear term, yielding $\hat{R}(\mathcal{F}(z_{1:n})) \leq L\,\hat{R}(\mathcal{H}(x_{1:n}))$.
[/proofplan]
[step:Establish the one-coordinate replacement claim]
[claim:One-Coordinate Contraction]
For any index $i \in \{1, \ldots, n\}$ and any function $A : \mathcal{H} \times \{-1,1\}^{n-1} \to \mathbb{R}$ (where $\varepsilon_{-i}$ denotes the Rademacher vector with the $i$-th coordinate removed),
\begin{align*}
\mathbb{E}\!\left[\sup_{h \in \mathcal{H}}\!\left\{\frac{1}{n}\varepsilon_i \phi(y_i h(x_i)) + A(h, \varepsilon_{-i})\right\}\right] \leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_i h(x_i) + A(h, \varepsilon_{-i})\right\}\right].
\end{align*}
[/claim]
[proof]
Condition on $\varepsilon_{-i}$. Since $\varepsilon_i$ takes values $+1$ and $-1$ each with probability $1/2$, the conditional expectation of the left-hand side equals
\begin{align*}
\frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{\frac{1}{n}\phi(y_i h(x_i)) + A(h, \varepsilon_{-i})\right\} + \frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{-\frac{1}{n}\phi(y_i h(x_i)) + A(h, \varepsilon_{-i})\right\}.
\end{align*}
Since the two suprema are taken independently, we may bound this by optimising over pairs $(h, g) \in \mathcal{H} \times \mathcal{H}$:
\begin{align*}
&\leq \frac{1}{2}\sup_{h, g \in \mathcal{H}}\!\left\{\frac{1}{n}\bigl[\phi(y_i h(x_i)) - \phi(y_i g(x_i))\bigr] + A(h, \varepsilon_{-i}) + A(g, \varepsilon_{-i})\right\}.
\end{align*}
By the Lipschitz condition on $\phi$ with constant $L$ (which applies because $|y_i h(x_i)| \leq r$ and $|y_i g(x_i)| \leq r$ since $|y_i| = 1$ and $|h(x_i)| \leq r$):
\begin{align*}
\phi(y_i h(x_i)) - \phi(y_i g(x_i)) \leq |\phi(y_i h(x_i)) - \phi(y_i g(x_i))| \leq L|y_i h(x_i) - y_i g(x_i)| = L|h(x_i) - g(x_i)|,
\end{align*}
where the final equality uses $|y_i| = 1$. By symmetry of the supremum over $(h, g)$ -- swapping the roles of $h$ and $g$ replaces $|h(x_i) - g(x_i)|$ by $h(x_i) - g(x_i)$ without loss -- we obtain
\begin{align*}
\frac{1}{2}\sup_{h, g \in \mathcal{H}}\!\left\{\frac{L}{n}\bigl[h(x_i) - g(x_i)\bigr] + A(h, \varepsilon_{-i}) + A(g, \varepsilon_{-i})\right\}.
\end{align*}
Separating the supremum over the pair into two independent suprema (each over a single element of $\mathcal{H}$):
\begin{align*}
= \frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}h(x_i) + A(h, \varepsilon_{-i})\right\} + \frac{1}{2}\sup_{g \in \mathcal{H}}\!\left\{-\frac{L}{n}g(x_i) + A(g, \varepsilon_{-i})\right\}.
\end{align*}
This is the conditional expectation over $\varepsilon_i = \pm 1$ of $\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_i h(x_i) + A(h, \varepsilon_{-i})\right\}$. Taking full expectations over $\varepsilon_{-i}$ proves the claim.
[/proof]
[guided]
The strategy is a decoupling argument: condition on everything except $\varepsilon_i$, split the resulting coin flip into its two outcomes, and exploit the Lipschitz bound on the difference that appears.
Condition on $\varepsilon_{-i}$ (the Rademacher vector with the $i$-th coordinate removed). Since $\varepsilon_i$ is an independent Rademacher variable taking values $+1$ and $-1$ each with probability $1/2$, the conditional expectation of the left-hand side becomes
\begin{align*}
\frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{\frac{1}{n}\phi(y_i h(x_i)) + A(h, \varepsilon_{-i})\right\} + \frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{-\frac{1}{n}\phi(y_i h(x_i)) + A(h, \varepsilon_{-i})\right\}.
\end{align*}
Why do the two suprema decouple? Because they are taken independently -- the $\varepsilon_i = +1$ case optimises over its own $h$, and the $\varepsilon_i = -1$ case optimises over a potentially different $h$. We exploit this by introducing two independent optimisation variables $(h, g) \in \mathcal{H} \times \mathcal{H}$:
\begin{align*}
&\leq \frac{1}{2}\sup_{h, g \in \mathcal{H}}\!\left\{\frac{1}{n}\bigl[\phi(y_i h(x_i)) - \phi(y_i g(x_i))\bigr] + A(h, \varepsilon_{-i}) + A(g, \varepsilon_{-i})\right\}.
\end{align*}
Now we apply the Lipschitz condition. We must verify that the arguments of $\phi$ lie in $[-r, r]$: since $|y_i| = 1$ and $|h(x_i)| \leq r$ by the definition $r = \sup_{x, h} |h(x)|$, we have $|y_i h(x_i)| = |h(x_i)| \leq r$, and similarly $|y_i g(x_i)| \leq r$. The Lipschitz bound gives
\begin{align*}
\phi(y_i h(x_i)) - \phi(y_i g(x_i)) \leq |\phi(y_i h(x_i)) - \phi(y_i g(x_i))| \leq L|y_i h(x_i) - y_i g(x_i)| = L|h(x_i) - g(x_i)|,
\end{align*}
where the final equality uses $|y_i| = 1$. How do we remove the absolute value? The key observation is a symmetry argument: the supremum over $(h, g) \in \mathcal{H} \times \mathcal{H}$ is free to choose whichever pair it likes, so it can always arrange $h(x_i) \geq g(x_i)$, making $|h(x_i) - g(x_i)| = h(x_i) - g(x_i)$. Therefore
\begin{align*}
\frac{1}{2}\sup_{h, g \in \mathcal{H}}\!\left\{\frac{L}{n}\bigl[h(x_i) - g(x_i)\bigr] + A(h, \varepsilon_{-i}) + A(g, \varepsilon_{-i})\right\}.
\end{align*}
Finally, we re-separate the joint supremum into two independent suprema. The expression $\frac{L}{n}h(x_i) + A(h, \varepsilon_{-i})$ depends only on $h$, and $-\frac{L}{n}g(x_i) + A(g, \varepsilon_{-i})$ depends only on $g$, so
\begin{align*}
= \frac{1}{2}\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}h(x_i) + A(h, \varepsilon_{-i})\right\} + \frac{1}{2}\sup_{g \in \mathcal{H}}\!\left\{-\frac{L}{n}g(x_i) + A(g, \varepsilon_{-i})\right\}.
\end{align*}
Reading the right-hand side as the conditional expectation over a Rademacher $\varepsilon_i = \pm 1$: the $\varepsilon_i = +1$ term gives $\sup_h\!\left\{\frac{L}{n}h(x_i) + A\right\}$ and the $\varepsilon_i = -1$ term gives $\sup_g\!\left\{-\frac{L}{n}g(x_i) + A\right\}$, which is exactly the conditional expectation of $\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_i h(x_i) + A(h, \varepsilon_{-i})\right\}$. Taking the full expectation over $\varepsilon_{-i}$ proves the claim.
[/guided]
[/step]
[step:Iterate the one-coordinate claim over all $n$ coordinates]
The empirical Rademacher complexity of the composed class is
\begin{align*}
\hat{R}(\mathcal{F}(z_{1:n})) = \mathbb{E}\!\left[\sup_{h \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n \varepsilon_i \phi(y_i h(x_i))\right].
\end{align*}
Apply the one-coordinate contraction claim with $i = 1$ and
\begin{align*}
A(h, \varepsilon_{-1}) = \frac{1}{n}\sum_{j=2}^n \varepsilon_j \phi(y_j h(x_j)).
\end{align*}
The claim gives
\begin{align*}
\hat{R}(\mathcal{F}(z_{1:n})) \leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_1 h(x_1) + \frac{1}{n}\sum_{j=2}^n \varepsilon_j \phi(y_j h(x_j))\right\}\right].
\end{align*}
Now apply the claim with $i = 2$ and the updated remainder $A(h, \varepsilon_{-2}) = \frac{L}{n}\varepsilon_1 h(x_1) + \frac{1}{n}\sum_{j=3}^n \varepsilon_j \phi(y_j h(x_j))$. This replaces $\varepsilon_2 \phi(y_2 h(x_2))$ by $L\varepsilon_2 h(x_2)$. Continuing inductively for $i = 3, 4, \ldots, n$, after all $n$ applications every $\phi(y_i h(x_i))$ term has been replaced by $L \cdot h(x_i)$:
\begin{align*}
\hat{R}(\mathcal{F}(z_{1:n})) \leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}} \frac{L}{n}\sum_{i=1}^n \varepsilon_i h(x_i)\right] = L\,\hat{R}(\mathcal{H}(x_{1:n})).
\end{align*}
[guided]
We perform the replacement one coordinate at a time, beginning with $i = 1$. The empirical Rademacher complexity of the composed class is
\begin{align*}
\hat{R}(\mathcal{F}(z_{1:n})) = \mathbb{E}\!\left[\sup_{h \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n \varepsilon_i \phi(y_i h(x_i))\right].
\end{align*}
To apply the one-coordinate contraction claim at $i = 1$, we must identify the remainder function $A$. Set
\begin{align*}
A(h, \varepsilon_{-1}) = \frac{1}{n}\sum_{j=2}^n \varepsilon_j \phi(y_j h(x_j)).
\end{align*}
This is a valid choice because $A$ depends on $h$ and on $\varepsilon_{-1} = (\varepsilon_2, \ldots, \varepsilon_n)$ but not on $\varepsilon_1$, which is exactly what the claim requires. The claim gives
\begin{align*}
\hat{R}(\mathcal{F}(z_{1:n})) \leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_1 h(x_1) + \frac{1}{n}\sum_{j=2}^n \varepsilon_j \phi(y_j h(x_j))\right\}\right].
\end{align*}
Now apply the claim at $i = 2$. The updated remainder is
\begin{align*}
A(h, \varepsilon_{-2}) = \frac{L}{n}\varepsilon_1 h(x_1) + \frac{1}{n}\sum_{j=3}^n \varepsilon_j \phi(y_j h(x_j)),
\end{align*}
which depends on $h$ and on $(\varepsilon_1, \varepsilon_3, \ldots, \varepsilon_n)$ but not on $\varepsilon_2$. Why is this still a valid remainder function? Because the claim applies to an arbitrary function $A : \mathcal{H} \times \{-1,1\}^{n-1} \to \mathbb{R}$ -- it imposes no structural requirement on $A$ beyond measurability. So the fact that some coordinates already carry $L \cdot h$ terms rather than $\phi$ terms is irrelevant. The claim replaces $\varepsilon_2 \phi(y_2 h(x_2))$ by $L\varepsilon_2 h(x_2)$:
\begin{align*}
\leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}}\!\left\{\frac{L}{n}\varepsilon_1 h(x_1) + \frac{L}{n}\varepsilon_2 h(x_2) + \frac{1}{n}\sum_{j=3}^n \varepsilon_j \phi(y_j h(x_j))\right\}\right].
\end{align*}
Continuing inductively for $i = 3, 4, \ldots, n$, at each step the remainder $A$ contains the already-replaced coordinates $\frac{L}{n}\sum_{j<i}\varepsilon_j h(x_j)$ and the not-yet-replaced coordinates $\frac{1}{n}\sum_{j>i}\varepsilon_j \phi(y_j h(x_j))$. The claim is applied with this $A$, replacing $\varepsilon_i \phi(y_i h(x_i))$ by $L\varepsilon_i h(x_i)$. After all $n$ applications:
\begin{align*}
\hat{R}(\mathcal{F}(z_{1:n})) \leq \mathbb{E}\!\left[\sup_{h \in \mathcal{H}} \frac{L}{n}\sum_{i=1}^n \varepsilon_i h(x_i)\right] = L\,\hat{R}(\mathcal{H}(x_{1:n})).
\end{align*}
The final equality is the definition of the empirical Rademacher complexity of $\mathcal{H}$ evaluated at $x_{1:n}$.
[/guided]
[/step]
[step:Deduce the population-level bound $R_n(\mathcal{F}) \leq L\, R_n(\mathcal{H})$]
The population Rademacher complexity is $R_n(\mathcal{F}) = \mathbb{E}_{Z_{1:n}}[\hat{R}(\mathcal{F}(Z_{1:n}))]$, where the outer expectation is over the i.i.d. data $Z_1, \ldots, Z_n$. Applying the empirical bound pointwise for every realisation of the data:
\begin{align*}
R_n(\mathcal{F}) = \mathbb{E}_{Z_{1:n}}\bigl[\hat{R}(\mathcal{F}(Z_{1:n}))\bigr] \leq \mathbb{E}_{Z_{1:n}}\bigl[L\,\hat{R}(\mathcal{H}(X_{1:n}))\bigr] = L\,\mathbb{E}_{Z_{1:n}}\bigl[\hat{R}(\mathcal{H}(X_{1:n}))\bigr] = L\, R_n(\mathcal{H}),
\end{align*}
where the linearity of expectation pulls $L$ outside and we use the fact that the population Rademacher complexity of $\mathcal{H}$ is $R_n(\mathcal{H}) = \mathbb{E}_{X_{1:n}}[\hat{R}(\mathcal{H}(X_{1:n}))]$. This completes the proof.
[/step]