[proofplan]
We prove the estimate by bounding packing numbers for the pseudometric $d_Q$. A maximal separated family of sets is first interpreted as a separated family of indicator functions, so its pairwise $L^2(Q)$ distances are lower bounds on the $Q$-measures of symmetric differences. The Pollard discretisation lemma converts this measure separation into a bound by finite traces, and the Sauer-Shelah lemma bounds the number of possible traces of a VC class on a finite sample. Finally we pass from packing to covering and absorb harmless numerical factors into one universal constant.
[/proofplan]
[step:Reduce covering to packing in the $L^2(Q)$ pseudometric]
Define the pseudometric
\begin{align*}
d_Q:\mathcal C\times\mathcal C\to[0,\infty),\qquad d_Q(C,D):=Q(C\triangle D)^{1/2}.
\end{align*}
This is exactly the $L^2(Q)$ distance between the indicator functions $\mathbb 1_C$ and $\mathbb 1_D$, because
\begin{align*}
\|\mathbb 1_C-\mathbb 1_D\|_{L^2(Q)}^2
=\int_S |\mathbb 1_C(x)-\mathbb 1_D(x)|^2\,dQ(x)
=Q(C\triangle D).
\end{align*}
For $\delta>0$, let $M(\delta,\mathcal C,d_Q)$ denote the supremum of the cardinalities of finite $\delta$-separated subsets of $\mathcal C$ with respect to $d_Q$. If every finite $\varepsilon$-separated subset of $\mathcal C$ has cardinality at most $B$, then a maximality argument gives
\begin{align*}
N(\varepsilon,\mathcal C,L^2(Q))\le B.
\end{align*}
Indeed, if no finite $\varepsilon$-net with at most $B$ points existed, one could inductively choose $B+1$ points with pairwise $d_Q$-distance greater than $\varepsilon$, contradicting the finite packing bound. It is therefore enough to prove the stated upper bound for every finite $\varepsilon$-separated subfamily of $\mathcal C$.
[/step]
[step:Apply Pollard's packing bound to the associated symmetric-difference class]
Let $m\in\mathbb N$, and let $C_1,\dots,C_m\in\mathcal C$ be a finite $\varepsilon$-separated family in $d_Q$. For distinct indices $i,j\in\{1,\dots,m\}$, separation gives
\begin{align*}
Q(C_i\triangle C_j)\ge \varepsilon^2.
\end{align*}
We use Pollard's packing estimate for VC classes: if $\mathcal C$ has VC dimension $v<\infty$, $Q$ is a probability measure on $(S,\mathcal A)$, and $0<\alpha<1$, then every finite family $A_1,\dots,A_r\in\mathcal C$ satisfying $Q(A_a\triangle A_b)>\alpha$ for all $a\ne b$ has cardinality
\begin{align*}
r\le 2(v+1)(4e)^v\alpha^{-v}.
\end{align*}
This is Pollard's probabilistic discretisation argument: one samples finitely many points from $Q$, shows that an $L^1(Q)$-separated family must create many distinct traces with positive probability, and then bounds those traces by the Sauer-Shelah lemma [citetheorem:1969].
Fix $\alpha$ with $0<\alpha<\varepsilon^2$. Because $0<\varepsilon<1$, we have $0<\alpha<1$. Since $Q(C_i\triangle C_j)\ge \varepsilon^2>\alpha$ for all distinct $i,j$, the family $C_1,\dots,C_m$ satisfies the strict separation hypothesis in Pollard's estimate. Hence
\begin{align*}
m\le 2(v+1)(4e)^v\alpha^{-v}.
\end{align*}
Letting $\alpha\uparrow\varepsilon^2$ gives
\begin{align*}
m\le 2(v+1)(4e)^v\varepsilon^{-2v}.
\end{align*}
[guided]
Let $m\in\mathbb N$, and let $C_1,\dots,C_m\in\mathcal C$ be a finite family satisfying $d_Q(C_i,C_j)>\varepsilon$ for all distinct indices $i,j\in\{1,\dots,m\}$. The metric in the theorem is an $L^2(Q)$ metric on indicator functions. Squaring it converts the problem into the symmetric-difference metric
\begin{align*}
\rho_Q:\mathcal C\times\mathcal C\to[0,1],\qquad \rho_Q(C,D):=Q(C\triangle D).
\end{align*}
Indeed, by the definition of $d_Q$,
\begin{align*}
\rho_Q(C_i,C_j)=d_Q(C_i,C_j)^2\ge \varepsilon^2
\end{align*}
for all distinct $i,j$.
The external ingredient is Pollard's packing estimate for VC classes. Its hypotheses are: the class has finite VC dimension, the measure is a probability measure on the underlying measurable space, and the separation parameter $\alpha$ satisfies $0<\alpha<1$. These are available here because $V(\mathcal C)=v<\infty$, $Q$ is a probability measure on $(S,\mathcal A)$, and any choice $0<\alpha<\varepsilon^2$ also satisfies $0<\alpha<1$ since $0<\varepsilon<1$. The conclusion of Pollard's estimate is that every finite family in $\mathcal C$ whose pairwise symmetric differences have $Q$-measure strictly larger than $\alpha$ has cardinality at most
\begin{align*}
2(v+1)(4e)^v\alpha^{-v}.
\end{align*}
The finite-trace mechanism behind the estimate is that a random sample converts measure separation into separation of traces on a finite set, while the Sauer-Shelah lemma [citetheorem:1969] bounds the number of traces of a VC class of dimension $v$.
Now fix $\alpha$ with $0<\alpha<\varepsilon^2$. Since $Q(C_i\triangle C_j)\ge\varepsilon^2>\alpha$ for all distinct $i,j$, the strict separation hypothesis in Pollard's estimate is satisfied. Therefore
\begin{align*}
m\le 2(v+1)(4e)^v\alpha^{-v}.
\end{align*}
This inequality holds for every $0<\alpha<\varepsilon^2$. Passing to the limit as $\alpha\uparrow\varepsilon^2$ gives
\begin{align*}
m\le 2(v+1)(4e)^v\varepsilon^{-2v}.
\end{align*}
[/guided]
[/step]
[step:Absorb numerical factors into the stated universal constant]
From the previous step, every finite $\varepsilon$-separated subfamily of $\mathcal C$ has cardinality at most $2(v+1)(4e)^v\varepsilon^{-2v}$. The finite packing-to-covering reduction from the first step therefore gives
\begin{align*}
N(\varepsilon,\mathcal C,L^2(Q))
\le 2(v+1)(4e)^v\varepsilon^{-2v}.
\end{align*}
Since
\begin{align*}
2(4e)^v\le (4e)^{v+1}
\end{align*}
for every $v\ge 0$, the estimate holds with the universal constant $K=1$:
\begin{align*}
N(\varepsilon,\mathcal C,L^2(Q))
\le (v+1)(4e)^{v+1}\varepsilon^{-2v}.
\end{align*}
Allowing an arbitrary universal constant $K\ge 1$ gives exactly the stated form.
[/step]