[proofplan]
The proof first reduces the empirical deviation to a two-sample deviation by introducing an independent ghost sample. Conditional on the pooled $2n$ sample points, the class $\mathcal C$ induces at most $\Pi_{\mathcal C}(2n)$ distinct binary labelings. For each fixed labeling, a random permutation argument converts the two-sample difference into a Rademacher average, and [Hoeffding's inequality](/theorems/1962) controls its tail. A union bound gives the [first inequality](/theorems/2897), and the Sauer-Shelah lemma gives the polynomial growth bound when $V(\mathcal C)<\infty$.
[/proofplan]
[step:Dispose of the empty class and the small deviation range]
If $\mathcal C=\varnothing$, then the supremum over $\mathcal C$ is interpreted as $0$, while $\Pi_{\mathcal C}(2n)=0$, so both displayed bounds are immediate. Hence assume $\mathcal C\ne\varnothing$. Then $\Pi_{\mathcal C}(2n)\ge 1$.
If
\begin{align*}
8\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{32}\right)\ge 1,
\end{align*}
then the first inequality follows from the elementary bound that probabilities are at most $1$. We may therefore assume throughout the proof of the first inequality that
\begin{align*}
8\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{32}\right)<1.
\end{align*}
In particular,
\begin{align*}
nt^2>32\log 8>2.
\end{align*}
[/step]
[step:Introduce a ghost empirical measure and symmetrise the deviation]
Let $Y_1,\dots,Y_n:(\Omega',\mathcal A',\mathbb P')\to(S,\mathcal S)$ be independent identically distributed random variables with common distribution $P$, independent of $X_1,\dots,X_n$. Define the ghost empirical set function
\begin{align*}
P_n':\mathcal C\to[0,1]
\end{align*}
by
\begin{align*}
P_n'(C):=\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i).
\end{align*}
Since $\mathcal C$ is countable, the random variables
\begin{align*}
\sup_{C\in\mathcal C}|P_n(C)-P(C)|
\end{align*}
and
\begin{align*}
\sup_{C\in\mathcal C}|P_n(C)-P_n'(C)|
\end{align*}
are measurable as suprema of countably many measurable real-valued random variables.
For each fixed $C\in\mathcal C$, the random variables $\mathbb{1}_C(Y_1),\dots,\mathbb{1}_C(Y_n)$ are independent Bernoulli random variables with mean $P(C)$. Hence
\begin{align*}
\operatorname{Var}(P_n'(C))=\frac{P(C)(1-P(C))}{n}\le \frac{1}{4n}.
\end{align*}
By [Chebyshev's inequality](/theorems/1126), for every $C\in\mathcal C$,
\begin{align*}
\mathbb P'\left(|P_n'(C)-P(C)|>\frac{t}{2}\right)
\le \frac{1}{nt^2}.
\end{align*}
Since $nt^2>2$, this implies
\begin{align*}
\mathbb P'\left(|P_n'(C)-P(C)|\le \frac{t}{2}\right)\ge \frac{1}{2}.
\end{align*}
Let
\begin{align*}
A:=\left\{\sup_{C\in\mathcal C}|P_n(C)-P(C)|>t\right\}.
\end{align*}
Choose an enumeration $\mathcal C=\{D_1,D_2,\dots\}$, allowing repetitions if $\mathcal C$ is finite. For each $\omega\in A$, let $k(\omega)$ be the least index $k\in\mathbb N$ such that
\begin{align*}
|P_n(D_k)(\omega)-P(D_k)|>t,
\end{align*}
and define $C_\omega:=D_{k(\omega)}$. This measurable first-choice rule is available because $\mathcal C$ is countable and the supremum exceeds $t$. On the event
\begin{align*}
\left\{|P_n'(C_\omega)-P(C_\omega)|\le \frac{t}{2}\right\},
\end{align*}
the [reverse triangle inequality](/theorems/2300) gives
\begin{align*}
|P_n(C_\omega)-P_n'(C_\omega)|
\ge |P_n(C_\omega)-P(C_\omega)|-|P_n'(C_\omega)-P(C_\omega)|>\frac{t}{2}.
\end{align*}
Therefore
\begin{align*}
\mathbb P'\left(\sup_{C\in\mathcal C}|P_n(C)-P_n'(C)|>\frac{t}{2}\right)\ge \frac{1}{2}
\end{align*}
for every fixed original sample in $A$. Integrating this conditional lower bound over the original sample gives
\begin{align*}
\mathbb P(A)
\le 2(\mathbb P\otimes\mathbb P')\left(\sup_{C\in\mathcal C}|P_n(C)-P_n'(C)|>\frac{t}{2}\right).
\end{align*}
[guided]
The purpose of the ghost sample is to replace the unknown expectation $P(C)$ by an empirical object with the same mean. We introduce independent random variables $Y_1,\dots,Y_n$ with the same law $P$ as the $X_i$, and define the ghost empirical set function $P_n':\mathcal C\to[0,1]$ by
\begin{align*}
P_n'(C):=\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i).
\end{align*}
The countability of $\mathcal C$ is used here in a concrete way: the suprema over $C\in\mathcal C$ are countable suprema of measurable random variables, hence measurable.
For a fixed set $C\in\mathcal C$, the variables $\mathbb{1}_C(Y_i)$ are independent Bernoulli random variables with expectation $P(C)$. Thus
\begin{align*}
\operatorname{Var}(P_n'(C))=\operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^{n}\mathbb 1_C(Y_i)\right)
=\frac{1}{n^2}\sum_{i=1}^{n}\operatorname{Var}(\mathbb 1_C(Y_i)).
\end{align*}
Since $\operatorname{Var}(\mathbb 1_C(Y_i))=P(C)(1-P(C))\le 1/4$, we obtain
\begin{align*}
\operatorname{Var}(P_n'(C))\le \frac{1}{4n}.
\end{align*}
Applying Chebyshev's inequality to the centered real-valued [random variable](/page/Random%20Variable) $P_n'(C)-P(C)$ gives
\begin{align*}
\mathbb P'\left(|P_n'(C)-P(C)|>\frac{t}{2}\right)
\le \frac{\operatorname{Var}(P_n'(C))}{t^2/4}
\le \frac{1}{nt^2}.
\end{align*}
Because the small-deviation case has already been removed, we have $nt^2>2$, and hence
\begin{align*}
\mathbb P'\left(|P_n'(C)-P(C)|\le \frac{t}{2}\right)\ge \frac{1}{2}.
\end{align*}
Now suppose the original empirical measure is far from $P$ for some set $C$. More precisely, if
\begin{align*}
\sup_{C\in\mathcal C}|P_n(C)-P(C)|>t,
\end{align*}
then countability lets us choose a set $C_\omega\in\mathcal C$ depending on the realised original sample such that
\begin{align*}
|P_n(C_\omega)-P(C_\omega)|>t.
\end{align*}
Whenever the ghost empirical measure is within $t/2$ of $P(C_\omega)$, the original and ghost empirical measures must be separated by more than $t/2$:
\begin{align*}
|P_n(C_\omega)-P_n'(C_\omega)|
\ge |P_n(C_\omega)-P(C_\omega)|-|P_n'(C_\omega)-P(C_\omega)|>\frac{t}{2}.
\end{align*}
Thus, conditional on every original sample for which the empirical deviation exceeds $t$, the ghost sample produces a two-sample deviation exceeding $t/2$ with probability at least $1/2$. Integrating this conditional statement over the original sample yields
\begin{align*}
\mathbb P\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|>t\right)
\le 2(\mathbb P\otimes\mathbb P')\left(\sup_{C\in\mathcal C}|P_n(C)-P_n'(C)|>\frac{t}{2}\right).
\end{align*}
[/guided]
[/step]
[step:Condition on the pooled sample and count the induced labelings]
For fixed points $z_1,\dots,z_{2n}\in S$, define the trace set
\begin{align*}
\mathcal T(z_1,\dots,z_{2n}):=\{(\mathbb{1}_C(z_1),\dots,\mathbb{1}_C(z_{2n})):C\in\mathcal C\}\subset\{0,1\}^{2n}.
\end{align*}
By the definition of the growth function,
\begin{align*}
|\mathcal T(z_1,\dots,z_{2n})|\le \Pi_{\mathcal C}(2n).
\end{align*}
Let $Z_1,\dots,Z_{2n}$ denote the pooled sample defined by
\begin{align*}
Z_i:=X_i\quad\text{for }1\le i\le n,
\end{align*}
and
\begin{align*}
Z_{n+i}:=Y_i\quad\text{for }1\le i\le n.
\end{align*}
Conditional on $Z_1,\dots,Z_{2n}$, the class $\mathcal C$ induces at most $\Pi_{\mathcal C}(2n)$ distinct vectors
\begin{align*}
(\mathbb{1}_C(Z_1),\dots,\mathbb{1}_C(Z_{2n})).
\end{align*}
[/step]
[step:Introduce independent swaps and bound each fixed labeling by Hoeffding]
Let $(\Omega_\varepsilon,\mathcal A_\varepsilon,\mathbb P_\varepsilon)$ be a [probability space](/page/Probability%20Space) carrying independent Rademacher random variables $\varepsilon_1,\dots,\varepsilon_n:\Omega_\varepsilon\to\{-1,1\}$, independent of the original and ghost samples. Here $\mathbb P_\varepsilon$ denotes probability with respect to the Rademacher variables alone. For a fixed trace vector $a=(a_1,\dots,a_{2n})\in\{0,1\}^{2n}$, define the real-valued Rademacher sum $R_a:\Omega_\varepsilon\to\mathbb R$ by
\begin{align*}
R_a:=\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i(a_i-a_{n+i}).
\end{align*}
Since each coefficient $a_i-a_{n+i}$ lies in $\{-1,0,1\}$, the summands $\varepsilon_i(a_i-a_{n+i})$ are independent, mean-zero, and take values in $[-1,1]$. Hoeffding's inequality for bounded independent sums gives
\begin{align*}
\mathbb P_\varepsilon\left(|R_a|>\frac{t}{2}\right)
\le 2\exp\left(-\frac{nt^2}{8}\right).
\end{align*}
For $\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$ define swapped samples $X_i^\varepsilon,Y_i^\varepsilon:(\Omega\times\Omega'\times\Omega_\varepsilon,\mathcal A\otimes\mathcal A'\otimes\mathcal A_\varepsilon)\to(S,\mathcal S)$ by setting $X_i^\varepsilon=X_i$ and $Y_i^\varepsilon=Y_i$ on $\{\varepsilon_i=1\}$, and $X_i^\varepsilon=Y_i$ and $Y_i^\varepsilon=X_i$ on $\{\varepsilon_i=-1\}$. Because $(X_i,Y_i)$ has product law $P\otimes P$ and is exchangeable under the coordinate swap, and because the pairs are independent, the process
\begin{align*}
\left(\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i)\right)_{C\in\mathcal C}
\end{align*}
has the same law under $\mathbb P\otimes\mathbb P'$ as
\begin{align*}
\left(\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i^\varepsilon)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i^\varepsilon)\right)_{C\in\mathcal C}
\end{align*}
has under $\mathbb P\otimes\mathbb P'\otimes\mathbb P_\varepsilon$.
Conditional on the realised ordered pooled sample $(Z_1,\dots,Z_{2n})$, the swapped difference for a set $C\in\mathcal C$ is
\begin{align*}
\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i(\mathbb{1}_C(Z_i)-\mathbb{1}_C(Z_{n+i})).
\end{align*}
The possible coefficient vectors are exactly the trace vectors in $\mathcal T(Z_1,\dots,Z_{2n})$, whose cardinality is at most $\Pi_{\mathcal C}(2n)$. Applying the preceding Hoeffding bound to each trace vector and taking a union bound over this finite trace set gives, for every realised pooled sample,
\begin{align*}
\mathbb P_\varepsilon\left(\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i^\varepsilon)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i^\varepsilon)\right|>\frac{t}{2}\right)
\le 2\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{8}\right).
\end{align*}
Integrating over the pooled sample and using the equality in law under independent swaps yields
\begin{align*}
(\mathbb P\otimes\mathbb P')\left(\sup_{C\in\mathcal C}|P_n(C)-P_n'(C)|>\frac{t}{2}\right)
\le 2\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{8}\right).
\end{align*}
[guided]
The delicate point is that the ordered pooled sample itself is fixed after conditioning, so the randomness must come from an additional independent swap mechanism. We place independent Rademacher variables $\varepsilon_1,\dots,\varepsilon_n$ on a probability space $(\Omega_\varepsilon,\mathcal A_\varepsilon,\mathbb P_\varepsilon)$ and let $\mathbb P_\varepsilon$ mean probability over these signs only. For a trace vector $a=(a_1,\dots,a_{2n})\in\{0,1\}^{2n}$, define the real-valued map $R_a:\Omega_\varepsilon\to\mathbb R$ by
\begin{align*}
R_a:=\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i(a_i-a_{n+i}).
\end{align*}
The variables $\varepsilon_i(a_i-a_{n+i})$ are independent because the signs are independent, have mean zero because each $\varepsilon_i$ has mean zero, and lie in $[-1,1]$ because $a_i-a_{n+i}\in\{-1,0,1\}$. Hoeffding's inequality for bounded independent sums therefore gives
\begin{align*}
\mathbb P_\varepsilon\left(|R_a|>\frac{t}{2}\right)
\le 2\exp\left(-\frac{nt^2}{8}\right).
\end{align*}
We now connect this artificial sign process to the original two-sample process. For each $i$, define $X_i^\varepsilon$ and $Y_i^\varepsilon$ by leaving $(X_i,Y_i)$ unchanged when $\varepsilon_i=1$ and swapping the two coordinates when $\varepsilon_i=-1$. Since $X_i$ and $Y_i$ are independent with the same distribution $P$, the pair $(X_i,Y_i)$ has law $P\otimes P$, which is invariant under the coordinate swap. The pairs are independent across $i$, so the whole swapped array has the same law as the original array. Hence the two processes indexed by $\mathcal C$,
\begin{align*}
\left(\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i)\right)_{C\in\mathcal C}
\end{align*}
and
\begin{align*}
\left(\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i^\varepsilon)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i^\varepsilon)\right)_{C\in\mathcal C}
\end{align*}
have the same distribution.
After conditioning on $Z_1,\dots,Z_{2n}$, no randomness remains in the labels $\mathbb{1}_C(Z_j)$; the only randomness is in the signs. For a fixed $C$, the swapped difference becomes
\begin{align*}
\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i(\mathbb{1}_C(Z_i)-\mathbb{1}_C(Z_{n+i})).
\end{align*}
As $C$ varies, the coefficient vectors are drawn from the finite trace set $\mathcal T(Z_1,\dots,Z_{2n})$, which has cardinality at most $\Pi_{\mathcal C}(2n)$. We apply the fixed-vector Hoeffding bound to every vector in this trace set and take a union bound. This gives, conditionally on every realised pooled sample,
\begin{align*}
\mathbb P_\varepsilon\left(\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i^\varepsilon)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i^\varepsilon)\right|>\frac{t}{2}\right)
\le 2\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{8}\right).
\end{align*}
Finally, integrating this conditional inequality and using the equality in law between the original and swapped two-sample processes yields
\begin{align*}
(\mathbb P\otimes\mathbb P')\left(\sup_{C\in\mathcal C}|P_n(C)-P_n'(C)|>\frac{t}{2}\right)
\le 2\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{8}\right).
\end{align*}
[/guided]
[/step]
[step:Combine symmetrisation with the finite-labeling bound]
Using the symmetrisation estimate from the ghost-sample step and the conditional finite-labeling estimate, we obtain
\begin{align*}
\mathbb P\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|>t\right)
\le 4\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{8}\right).
\end{align*}
Since $nt^2\ge 0$, we have
\begin{align*}
4\exp\left(-\frac{nt^2}{8}\right)
=8\exp\left(-\frac{nt^2}{32}\right)\cdot \frac{1}{2}\exp\left(-\frac{3nt^2}{32}\right)
\le 8\exp\left(-\frac{nt^2}{32}\right).
\end{align*}
Thus the desired bound follows:
\begin{align*}
\mathbb P\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|>t\right)
\le 8\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{32}\right).
\end{align*}
Together with the small-deviation case handled at the beginning, this proves the first assertion for every $t>0$.
[/step]
[step:Apply Sauer-Shelah to obtain the VC-dimension consequence]
Assume $V(\mathcal C)=v<\infty$ with $v\ge 1$, and let $n\ge v$. Fix arbitrary points $x_1,\dots,x_{2n}\in S$ and set $E:=\{x_1,\dots,x_{2n}\}$. Repeated points can only reduce $|E|$, so $|E|\le 2n$. Applying the [Sauer-Shelah Lemma][citetheorem:1969] to the finite set system induced by $\mathcal C$ on $E$ gives
\begin{align*}
\left|\{C\cap E:C\in\mathcal C\}\right|\le \sum_{k=0}^{v}\binom{|E|}{k}\le \sum_{k=0}^{v}\binom{2n}{k}.
\end{align*}
Taking the supremum over all $x_1,\dots,x_{2n}\in S$ gives
\begin{align*}
\Pi_{\mathcal C}(2n)\le \sum_{k=0}^{v}\binom{2n}{k}.
\end{align*}
Since $v\le n\le 2n$, the standard binomial estimate in the Sauer-Shelah bound gives
\begin{align*}
\sum_{k=0}^{v}\binom{2n}{k}\le \left(\frac{2en}{v}\right)^v.
\end{align*}
Substituting this estimate into the first inequality yields
\begin{align*}
\mathbb P\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|>t\right)
\le 8\left(\frac{2en}{v}\right)^v\exp\left(-\frac{nt^2}{32}\right).
\end{align*}
This is the claimed finite-VC-dimension form of the inequality.
[/step]