[proofplan]
We prove the statement by deriving the usual [sub-Gaussian tail bound](/theorems/1953) for each centered empirical average and then applying the union bound over the finite class $\mathcal F$. The sample-size hypothesis is exactly the algebraic condition that makes the union-bound error at most $\beta$. Taking complements then gives the desired high-probability uniform estimate.
[/proofplan]
[step:Define the centered empirical average for each function]
Let $(\Omega,\mathcal A,\mathbb P)$ be the probability space on which the random variables $Z_1,\dots,Z_n$ are defined. For each $f\in\mathcal F$, define the [random variable](/page/Random%20Variable) $A_f:\Omega\to\mathbb R$ by
\begin{align*}
A_f(\omega)
=
\frac{1}{n}\sum_{i=1}^n \left(f(Z_i(\omega))-\mathbb E[f(Z_i)]\right).
\end{align*}
Each $A_f$ is measurable as a finite linear combination of measurable real-valued random variables. Since $\mathcal F$ is finite, the map $\omega\mapsto \sup_{f\in\mathcal F}|A_f(\omega)|$ is measurable as the pointwise maximum of finitely many measurable real-valued functions. Therefore the event in the theorem is
\begin{align*}
\left\{\omega\in\Omega:\sup_{f\in\mathcal F}|A_f(\omega)|\le \varepsilon\right\}.
\end{align*}
Its complement is
\begin{align*}
\left\{\omega\in\Omega:\sup_{f\in\mathcal F}|A_f(\omega)|> \varepsilon\right\}
=
\bigcup_{f\in\mathcal F}\{\omega\in\Omega:|A_f(\omega)|>\varepsilon\}.
\end{align*}
[/step]
[step:Derive the two-sided sub-Gaussian tail bound for one empirical average]
Fix $f\in\mathcal F$. The hypotheses of the theorem give that the centered real-valued random variables $f(Z_i)-\mathbb E[f(Z_i)]$, for $1\le i\le n$, are independent and satisfy, for every parameter $\theta\in\mathbb R$,
\begin{align*}
\mathbb E\left[\exp\left(\theta\left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right]
\le
\exp\left(\frac{\sigma^2\theta^2}{2}\right).
\end{align*}
For $\lambda>0$, independence gives
\begin{align*}
\mathbb E\left[\exp\left(\lambda n A_f\right)\right]=\mathbb E\left[\exp\left(\lambda\sum_{i=1}^n \left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right].
\end{align*}
Since the centered random variables $f(Z_i)-\mathbb E[f(Z_i)]$ are independent and each satisfies the sub-Gaussian moment-generating-function bound with variance proxy $\sigma^2$, this equals the product of the one-variable exponential moments and is bounded by
\begin{align*}
\prod_{i=1}^n \mathbb E\left[\exp\left(\lambda\left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right]\le \exp\left(\frac{n\sigma^2\lambda^2}{2}\right).
\end{align*}
By [Markov's Inequality](/theorems/1126) applied to the nonnegative random variable $\exp(\lambda nA_f)$,
\begin{align*}
\mathbb P(A_f>\varepsilon)=\mathbb P\left(\exp(\lambda nA_f)>\exp(\lambda n\varepsilon)\right).
\end{align*}
[Markov's Inequality](/theorems/1126) and the preceding exponential-moment estimate give
\begin{align*}
\mathbb P(A_f>\varepsilon)\le \exp(-\lambda n\varepsilon)\mathbb E\left[\exp(\lambda nA_f)\right]\le \exp\left(-\lambda n\varepsilon+\frac{n\sigma^2\lambda^2}{2}\right).
\end{align*}
Choosing $\lambda=\varepsilon/\sigma^2$ yields
\begin{align*}
\mathbb P(A_f>\varepsilon)
\le
\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right).
\end{align*}
Applying the same argument to $-A_f$ is valid because the uniform sub-Gaussian hypothesis gives the same moment-generating-function bound for every real parameter. Equivalently, for $\lambda>0$ we apply the bound with parameter $-\lambda$ to each centered variable $f(Z_i)-\mathbb E[f(Z_i)]$. Hence
\begin{align*}
\mathbb P(A_f<-\varepsilon)
\le
\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right).
\end{align*}
Therefore
\begin{align*}
\mathbb P(|A_f|>\varepsilon)
\le
2\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right).
\end{align*}
[guided]
Fix one function $f\in\mathcal F$. The goal is first to control the deviation of a single centered empirical average; only after that will we take the union over all functions. Define the random variable $A_f:\Omega\to\mathbb R$ by
\begin{align*}
A_f(\omega)
=
\frac{1}{n}\sum_{i=1}^n \left(f(Z_i(\omega))-\mathbb E[f(Z_i)]\right).
\end{align*}
Equivalently,
\begin{align*}
nA_f
=
\sum_{i=1}^n \left(f(Z_i)-\mathbb E[f(Z_i)]\right).
\end{align*}
The hypotheses of the theorem give that the centered variables $f(Z_i)-\mathbb E[f(Z_i)]$ are independent and that, for every $\theta\in\mathbb R$ and every $1\le i\le n$,
\begin{align*}
\mathbb E\left[\exp\left(\theta\left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right]
\le
\exp\left(\frac{\sigma^2\theta^2}{2}\right).
\end{align*}
For $\lambda>0$, the exponential moment of this sum factors because the centered random variables are independent. First,
\begin{align*}
\mathbb E\left[\exp\left(\lambda n A_f\right)\right]=\mathbb E\left[\exp\left(\lambda\sum_{i=1}^n \left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right].
\end{align*}
Independence turns the exponential moment of the sum into the product of the exponential moments:
\begin{align*}
\mathbb E\left[\exp\left(\lambda n A_f\right)\right]=\prod_{i=1}^n \mathbb E\left[\exp\left(\lambda\left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right].
\end{align*}
The sub-Gaussian hypothesis says that each factor is bounded by $\exp(\sigma^2\lambda^2/2)$. Multiplying the $n$ identical upper bounds gives
\begin{align*}
\mathbb E\left[\exp\left(\lambda n A_f\right)\right]\le\exp\left(\frac{n\sigma^2\lambda^2}{2}\right).
\end{align*}
Now apply [Markov's Inequality](/theorems/1126) to the nonnegative random variable $\exp(\lambda nA_f)$. Since $\lambda>0$, the event $A_f>\varepsilon$ is the same as the event $\exp(\lambda nA_f)>\exp(\lambda n\varepsilon)$. Hence
\begin{align*}
\mathbb P(A_f>\varepsilon)=\mathbb P\left(\exp(\lambda nA_f)>\exp(\lambda n\varepsilon)\right).
\end{align*}
[Markov's Inequality](/theorems/1126) and the exponential-moment estimate then give
\begin{align*}
\mathbb P(A_f>\varepsilon)\le\exp(-\lambda n\varepsilon)\mathbb E\left[\exp(\lambda nA_f)\right]\le\exp\left(-\lambda n\varepsilon+\frac{n\sigma^2\lambda^2}{2}\right).
\end{align*}
The exponent is minimized by choosing
\begin{align*}
\lambda=\frac{\varepsilon}{\sigma^2},
\end{align*}
which is positive because $\varepsilon>0$ and $\sigma>0$. Substituting this value gives
\begin{align*}
\mathbb P(A_f>\varepsilon)
\le
\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right).
\end{align*}
The same calculation applies to $-A_f$, but it is worth spelling out why the hypothesis permits this. The uniform sub-Gaussian moment-generating-function bound is assumed for every real parameter. Thus, when estimating $\mathbb E[\exp(\lambda(-nA_f))]$ with $\lambda>0$, we apply the original bound to each centered variable $f(Z_i)-\mathbb E[f(Z_i)]$ with parameter $-\lambda$. This gives the same estimate for the negative average, and therefore
\begin{align*}
\mathbb P(A_f<-\varepsilon)
\le
\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right).
\end{align*}
Taking the union of the upper-tail and lower-tail events gives the two-sided bound
\begin{align*}
\mathbb P(|A_f|>\varepsilon)
\le
2\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right).
\end{align*}
[/guided]
[/step]
[step:Apply the union bound over the finite class]
Using the complement representation from the first step,
\begin{align*}
\mathbb P\left(\sup_{f\in\mathcal F}|A_f|>\varepsilon\right)=\mathbb P\left(\bigcup_{f\in\mathcal F}\{|A_f|>\varepsilon\}\right).
\end{align*}
The [Finite-Class Union Bound](/theorems/1108) over the finite class $\mathcal F$ and the two-sided bound from the previous step give
\begin{align*}
\mathbb P\left(\sup_{f\in\mathcal F}|A_f|>\varepsilon\right)\le\sum_{f\in\mathcal F}\mathbb P(|A_f|>\varepsilon)\le 2|\mathcal F|\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right).
\end{align*}
[/step]
[step:Use the sample-size hypothesis to make the failure probability at most $\beta$]
Since $\mathcal F$ is nonempty, $|\mathcal F|\ge 1$, so the quantity
\begin{align*}
\frac{2|\mathcal F|}{\beta}
\end{align*}
is positive and the following division by $|\mathcal F|$ is legitimate. The assumed lower bound on $n$ implies
\begin{align*}
\frac{n\varepsilon^2}{2\sigma^2}
\ge
\log\frac{2|\mathcal F|}{\beta}.
\end{align*}
Exponentiating the negative of both sides gives
\begin{align*}
\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right)
\le
\frac{\beta}{2|\mathcal F|}.
\end{align*}
Substituting this into the union-bound estimate yields
\begin{align*}
\mathbb P\left(\sup_{f\in\mathcal F}|A_f|>\varepsilon\right)
\le
2|\mathcal F|\cdot \frac{\beta}{2|\mathcal F|}
=
\beta.
\end{align*}
Therefore, since the events $\{\sup_{f\in\mathcal F}|A_f|\le\varepsilon\}$ and $\{\sup_{f\in\mathcal F}|A_f|>\varepsilon\}$ are complements,
\begin{align*}
\mathbb P\left(\sup_{f\in\mathcal F}|A_f|\le\varepsilon\right)=1-\mathbb P\left(\sup_{f\in\mathcal F}|A_f|>\varepsilon\right)\ge 1-\beta.
\end{align*}
Replacing $A_f$ by its definition gives
\begin{align*}
\mathbb P\left(\sup_{f\in\mathcal F}\left|\frac{1}{n}\sum_{i=1}^n f(Z_i)-\frac{1}{n}\sum_{i=1}^n\mathbb E[f(Z_i)]\right|\le \varepsilon\right)\ge 1-\beta,
\end{align*}
which is the desired conclusion.
[/step]