Conditional Dudley Entropy Bound for Rademacher Processes (Theorem # 9850)
Theorem
Let $\mathbb N=\{1,2,\dots\}$. Let $n\in\mathbb N$, let $(S,\mathcal S)$ be a measurable space, and let $\mathcal F$ be a nonempty class of [measurable functions](/page/Measurable%20Functions) $f:S\to\mathbb R$. Let $x_1,\dots,x_n\in S$ be a realised sample satisfying
\begin{align*}
\sup_{f\in\mathcal F}\max_{1\le i\le n}|f(x_i)|<\infty.
\end{align*}
Define the empirical pseudometric $d_n:\mathcal F\times\mathcal F\to[0,\infty)$ by
\begin{align*}
d_n(f,g):=\left(\frac{1}{n}\sum_{i=1}^{n}(f(x_i)-g(x_i))^2\right)^{1/2}
\end{align*}
for $f,g\in\mathcal F$, and define
\begin{align*}
\operatorname{diam}(\mathcal F,d_n):=\sup_{f,g\in\mathcal F}d_n(f,g).
\end{align*}
For $\varepsilon>0$, let $N(\mathcal F,d_n,\varepsilon)$ denote the least cardinality of a finite subset $A\subset\mathcal F$ such that for every $f\in\mathcal F$ there exists $a\in A$ with $d_n(f,a)\le\varepsilon$, with the convention that $N(\mathcal F,d_n,\varepsilon)=\infty$ if no finite such set exists. Let $(\Omega_\varepsilon,\mathcal A_\varepsilon,\mathbb P_\varepsilon)$ be a [probability space](/page/Probability%20Space) carrying independent Rademacher random variables $\varepsilon_1,\dots,\varepsilon_n:\Omega_\varepsilon\to\{-1,1\}$, meaning that
\begin{align*}
\mathbb P_\varepsilon(\varepsilon_i=1)=\mathbb P_\varepsilon(\varepsilon_i=-1)=\frac{1}{2}
\end{align*}
for every $1\le i\le n$. For each $f\in\mathcal F$, define the conditional Rademacher process $R_n(f):\Omega_\varepsilon\to\mathbb R$ by
\begin{align*}
R_n(f):=\frac{1}{\sqrt n}\sum_{i=1}^{n}\varepsilon_i f(x_i).
\end{align*}
Assume that $\mathcal F$ is pointwise measurable for this realised sample in the following sense: there is a countable subclass $\mathcal F_0\subset\mathcal F$ such that the [random variable](/page/Random%20Variable) $\sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)|$ is equal to $\sup_{f,g\in\mathcal F_0}|R_n(f)-R_n(g)|$ on $\Omega_\varepsilon$, and for every $\varepsilon>0$ the covering number $N(\mathcal F,d_n,\varepsilon)$ is equal to $N(\mathcal F_0,d_n,\varepsilon)$. Then there exists a universal constant $C>0$ such that
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)|\right]
\le C\int_0^{\operatorname{diam}(\mathcal F,d_n)}\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon).
\end{align*}
The right-hand side is interpreted as $+\infty$ when the entropy integral is infinite.
Knowledge Status
Probability & Statistics
Discussion
A result from [empirical process theory](/page/Empirical%20Process%20Theory) concerning conditional, dudley, entropy, bound, used to organize [uniform convergence](/page/Uniform%20Convergence), [weak convergence](/page/Weak%20Convergence), and complexity arguments in probability.
Proof
[proofplan]
Condition on the realised sample $x_1,\dots,x_n$, so the empirical pseudometric $d_n$ is deterministic and the only remaining randomness is carried by the Rademacher signs. First we prove directly that each increment $R_n(f)-R_n(g)$ is sub-Gaussian with variance proxy $d_n(f,g)^2$. We then use a successive-net chaining argument on the countable separable subclass supplied by the measurability hypothesis, bounding each chaining level by a finite sub-Gaussian maximum estimate. Summing over dyadic scales and comparing the sum with the entropy integral gives the desired conditional Dudley bound.
[/proofplan]
[step:Reduce the problem to the countable separating subclass]
By the pointwise measurability hypothesis, there is a countable subclass $\mathcal F_0\subset\mathcal F$ such that
\begin{align*}
\sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)|
=
\sup_{f,g\in\mathcal F_0}|R_n(f)-R_n(g)|
\end{align*}
as a [random variable](/page/Random%20Variable) on $(\Omega_\varepsilon,\mathcal A_\varepsilon,\mathbb P_\varepsilon)$, and
\begin{align*}
N(\mathcal F,d_n,\varepsilon)=N(\mathcal F_0,d_n,\varepsilon)
\end{align*}
for every $\varepsilon>0$. Therefore it is enough to prove the asserted estimate with $\mathcal F$ replaced by $\mathcal F_0$. Replacing $\mathcal F$ by $\mathcal F_0$, we assume from now on that $\mathcal F$ is countable.
The bounded empirical envelope assumption gives, for all $f,g\in\mathcal F$,
\begin{align*}
d_n(f,g)\le 2\sup_{h\in\mathcal F}\max_{1\le i\le n}|h(x_i)|<\infty.
\end{align*}
Thus
\begin{align*}
D:=\operatorname{diam}(\mathcal F,d_n)
\end{align*}
is a finite number. If $D=0$, then $f(x_i)=g(x_i)$ for every $f,g\in\mathcal F$ and every $1\le i\le n$, so $R_n(f)=R_n(g)$ for all $f,g\in\mathcal F$ and the left-hand side is $0$. Hence the result is immediate in this case. We assume henceforth that $D>0$.
If
\begin{align*}
\int_0^D \sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon)=+\infty,
\end{align*}
the claimed inequality is immediate. We therefore assume that this entropy integral is finite. Since the function
\begin{align*}
\varepsilon\mapsto N(\mathcal F,d_n,\varepsilon)
\end{align*}
is nonincreasing as $\varepsilon$ increases, finiteness of the integral implies that $N(\mathcal F,d_n,\varepsilon)<\infty$ for every $\varepsilon>0$.
[/step]
[step:Prove sub-Gaussian tails for every Rademacher increment]
Fix $f,g\in\mathcal F$. Define the [real numbers](/page/Real%20Numbers) $a_i(f,g)$ for $1\le i\le n$ by
\begin{align*}
a_i(f,g):=\frac{f(x_i)-g(x_i)}{\sqrt n}.
\end{align*}
Then
\begin{align*}
R_n(f)-R_n(g)=\sum_{i=1}^{n}\varepsilon_i a_i(f,g).
\end{align*}
For every $\lambda\in\mathbb R$, independence of $\varepsilon_1,\dots,\varepsilon_n$ and the elementary bound $\cosh u\le e^{u^2/2}$ give
\begin{align*}
\mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right]
=
\prod_{i=1}^{n}\mathbb E_\varepsilon\left[\exp(\lambda\varepsilon_i a_i(f,g))\right].
\end{align*}
For each $i$,
\begin{align*}
\mathbb E_\varepsilon\left[\exp(\lambda\varepsilon_i a_i(f,g))\right]
=
\cosh(\lambda a_i(f,g))
\le
\exp\left(\frac{\lambda^2a_i(f,g)^2}{2}\right).
\end{align*}
Multiplying the inequalities yields
\begin{align*}
\mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right]
\le
\exp\left(\frac{\lambda^2}{2}\sum_{i=1}^{n}a_i(f,g)^2\right).
\end{align*}
By the definition of $d_n$,
\begin{align*}
\sum_{i=1}^{n}a_i(f,g)^2=d_n(f,g)^2.
\end{align*}
Therefore
\begin{align*}
\mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right]
\le
\exp\left(\frac{\lambda^2d_n(f,g)^2}{2}\right).
\end{align*}
[guided]
The purpose of this step is to identify the metric that controls the process. Fix $f,g\in\mathcal F$. The increment of the process is
\begin{align*}
R_n(f)-R_n(g)=\frac{1}{\sqrt n}\sum_{i=1}^{n}\varepsilon_i(f(x_i)-g(x_i)).
\end{align*}
To put this into the standard form of a Rademacher sum, define
\begin{align*}
a_i(f,g):=\frac{f(x_i)-g(x_i)}{\sqrt n}
\end{align*}
for $1\le i\le n$. Then
\begin{align*}
R_n(f)-R_n(g)=\sum_{i=1}^{n}\varepsilon_i a_i(f,g).
\end{align*}
We now compute the exponential moment. Since the random variables $\varepsilon_1,\dots,\varepsilon_n$ are independent under $\mathbb P_\varepsilon$, the expectation of the product factors:
\begin{align*}
\mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right]
=
\prod_{i=1}^{n}\mathbb E_\varepsilon\left[\exp(\lambda\varepsilon_i a_i(f,g))\right].
\end{align*}
For a Rademacher random variable $\varepsilon_i$,
\begin{align*}
\mathbb E_\varepsilon\left[\exp(\lambda\varepsilon_i a_i(f,g))\right]
=
\frac{e^{\lambda a_i(f,g)}+e^{-\lambda a_i(f,g)}}{2}
=
\cosh(\lambda a_i(f,g)).
\end{align*}
The elementary inequality $\cosh u\le e^{u^2/2}$ gives
\begin{align*}
\mathbb E_\varepsilon\left[\exp(\lambda\varepsilon_i a_i(f,g))\right]
\le
\exp\left(\frac{\lambda^2a_i(f,g)^2}{2}\right).
\end{align*}
Multiplying these bounds over $1\le i\le n$ gives
\begin{align*}
\mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right]
\le
\exp\left(\frac{\lambda^2}{2}\sum_{i=1}^{n}a_i(f,g)^2\right).
\end{align*}
The sum in the exponent is exactly the empirical squared distance:
\begin{align*}
\sum_{i=1}^{n}a_i(f,g)^2
=
\frac{1}{n}\sum_{i=1}^{n}(f(x_i)-g(x_i))^2
=
d_n(f,g)^2.
\end{align*}
Thus every increment is sub-Gaussian with scale $d_n(f,g)$:
\begin{align*}
\mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right]
\le
\exp\left(\frac{\lambda^2d_n(f,g)^2}{2}\right).
\end{align*}
This is exactly why the normalization $n^{-1/2}$ in $R_n$ matches the normalization in $d_n$.
[/guided]
[/step]
[step:Bound finite maxima of sub-Gaussian increments]
Let $I$ be a finite nonempty set, and let $(Z_i)_{i\in I}$ be real-valued random variables on $(\Omega_\varepsilon,\mathcal A_\varepsilon,\mathbb P_\varepsilon)$. Assume that there is a number $\sigma\ge 0$ such that, for every $i\in I$ and every $\lambda\in\mathbb R$,
\begin{align*}
\mathbb E_\varepsilon[e^{\lambda Z_i}]
\le
e^{\lambda^2\sigma^2/2}.
\end{align*}
Then
\begin{align*}
\mathbb E_\varepsilon\left[\max_{i\in I}|Z_i|\right]
\le
\sigma\sqrt{2\log(2|I|)}.
\end{align*}
Indeed, if $\sigma=0$, the exponential moment bound forces $Z_i=0$ $\mathbb P_\varepsilon$-a.s. for every $i\in I$, so the estimate is immediate. Assume $\sigma>0$. For any $\lambda>0$, monotonicity of the exponential and the union bound inside the exponential sum give
\begin{align*}
\exp\left(\lambda\max_{i\in I}|Z_i|\right)
\le
\sum_{i\in I}e^{\lambda Z_i}+\sum_{i\in I}e^{-\lambda Z_i}.
\end{align*}
Taking expectations and using the exponential moment assumption with $\lambda$ and $-\lambda$ gives
\begin{align*}
\mathbb E_\varepsilon\left[\exp\left(\lambda\max_{i\in I}|Z_i|\right)\right]
\le
2|I|e^{\lambda^2\sigma^2/2}.
\end{align*}
[Jensen's inequality](/theorems/9) applied to the concave logarithm gives
\begin{align*}
\lambda\mathbb E_\varepsilon\left[\max_{i\in I}|Z_i|\right]
\le
\log(2|I|)+\frac{\lambda^2\sigma^2}{2}.
\end{align*}
Choosing
\begin{align*}
\lambda:=\frac{\sqrt{2\log(2|I|)}}{\sigma}
\end{align*}
yields
\begin{align*}
\mathbb E_\varepsilon\left[\max_{i\in I}|Z_i|\right]
\le
\sigma\sqrt{2\log(2|I|)}.
\end{align*}
[/step]
[step:Construct dyadic nets and telescope every process value]
For each integer $k\ge 0$, define
\begin{align*}
\delta_k:=2^{-k}D.
\end{align*}
Because $N(\mathcal F,d_n,\delta_k)<\infty$, choose a finite $\delta_k$-net $T_k\subset\mathcal F$ such that
\begin{align*}
|T_k|=N(\mathcal F,d_n,\delta_k).
\end{align*}
For each $k\ge 0$, choose a nearest-net map $\pi_k:\mathcal F\to T_k$ satisfying
\begin{align*}
d_n(f,\pi_k(f))\le\delta_k
\end{align*}
for every $f\in\mathcal F$. Since $\delta_0=D$, the set $T_0$ may be chosen to contain a single element; fix this choice and denote its element by $f_*$. Then $\pi_0(f)=f_*$ for every $f\in\mathcal F$.
For every $f\in\mathcal F$ and every integer $m\ge 1$, the telescoping identity gives
\begin{align*}
R_n(\pi_m(f))-R_n(f_*)
=
\sum_{k=1}^{m}\left(R_n(\pi_k(f))-R_n(\pi_{k-1}(f))\right).
\end{align*}
Also,
\begin{align*}
d_n(f,\pi_m(f))\le\delta_m\to 0.
\end{align*}
The increment sub-Gaussian estimate implies convergence of $R_n(\pi_m(f))$ to $R_n(f)$ in $L^1(\mathbb P_\varepsilon)$. Indeed, applying the finite-maximum bound to the singleton family containing $R_n(f)-R_n(\pi_m(f))$ gives
\begin{align*}
\mathbb E_\varepsilon[|R_n(f)-R_n(\pi_m(f))|]
\le
d_n(f,\pi_m(f))\sqrt{2\log 2}
\le
\delta_m\sqrt{2\log 2}.
\end{align*}
Enumerate the countable class as $\mathcal F=\{h_j:j\in J\}$, where $J\subset\mathbb N$ is nonempty, and set $\mathcal F^{(q)}:=\{h_j:j\in J,\ j\le q\}$ for $q\in\mathbb N$. Fix $q$. Since $\mathcal F^{(q)}$ is finite and $R_n(\pi_m(h))-R_n(h)\to0$ in $L^1(\mathbb P_\varepsilon)$ for each $h\in\mathcal F^{(q)}$, the maximum over $\mathcal F^{(q)}$ also converges in $L^1(\mathbb P_\varepsilon)$. Therefore
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{h\in\mathcal F^{(q)}}|R_n(h)-R_n(f_*)|\right]
=
\lim_{m\to\infty}\mathbb E_\varepsilon\left[\sup_{h\in\mathcal F^{(q)}}|R_n(\pi_m(h))-R_n(f_*)|\right].
\end{align*}
Using the finite telescoping identity and the triangle inequality in $\mathbb R$, for every $m\ge1$,
\begin{align*}
\sup_{h\in\mathcal F^{(q)}}|R_n(\pi_m(h))-R_n(f_*)|
\le
\sum_{k=1}^{m}\sup_{h\in\mathcal F}|R_n(\pi_k(h))-R_n(\pi_{k-1}(h))|.
\end{align*}
Taking expectations and then letting $m\to\infty$, the [monotone convergence theorem](/theorems/509) for the increasing partial sums on the right gives
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{h\in\mathcal F^{(q)}}|R_n(h)-R_n(f_*)|\right]
\le
\sum_{k=1}^{\infty}
\mathbb E_\varepsilon\left[
\sup_{h\in\mathcal F}|R_n(\pi_k(h))-R_n(\pi_{k-1}(h))|
\right].
\end{align*}
Finally, the sequence $\sup_{h\in\mathcal F^{(q)}}|R_n(h)-R_n(f_*)|$ increases pointwise to $\sup_{h\in\mathcal F}|R_n(h)-R_n(f_*)|$. Applying the monotone convergence theorem once more yields
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{h\in\mathcal F}|R_n(h)-R_n(f_*)|\right]
\le
\sum_{k=1}^{\infty}
\mathbb E_\varepsilon\left[
\sup_{h\in\mathcal F}|R_n(\pi_k(h))-R_n(\pi_{k-1}(h))|
\right].
\end{align*}
[guided]
The delicate point is that the convergence $R_n(\pi_m(f))\to R_n(f)$ is first obtained for each fixed $f$, while the desired estimate contains a supremum over the whole countable class. We handle this by taking finite suprema first and only then passing to the countable supremum.
For a fixed $f\in\mathcal F$, the net property gives $d_n(f,\pi_m(f))\le\delta_m$ and $\delta_m\to0$. Applying the finite-maximum estimate to the singleton family containing the increment $R_n(f)-R_n(\pi_m(f))$ gives
\begin{align*}
\mathbb E_\varepsilon[|R_n(f)-R_n(\pi_m(f))|]
\le
d_n(f,\pi_m(f))\sqrt{2\log 2}
\le
\delta_m\sqrt{2\log 2}.
\end{align*}
Thus $R_n(\pi_m(f))\to R_n(f)$ in $L^1(\mathbb P_\varepsilon)$.
Now enumerate the countable class as $\mathcal F=\{h_j:j\in J\}$, where $J\subset\mathbb N$ is nonempty, and define $\mathcal F^{(q)}:=\{h_j:j\in J,\ j\le q\}$. For fixed $q$, only finitely many functions are involved. Since each term converges in $L^1$, the finite maximum also converges in $L^1$:
\begin{align*}
\sup_{h\in\mathcal F^{(q)}}|R_n(\pi_m(h))-R_n(f_*)|
\to
\sup_{h\in\mathcal F^{(q)}}|R_n(h)-R_n(f_*)|
\end{align*}
in $L^1(\mathbb P_\varepsilon)$. Hence the expectations converge.
For each $m\ge1$, the telescoping identity gives
\begin{align*}
R_n(\pi_m(h))-R_n(f_*)
=
\sum_{k=1}^{m}\left(R_n(\pi_k(h))-R_n(\pi_{k-1}(h))\right).
\end{align*}
Taking absolute values, then the supremum over $h\in\mathcal F^{(q)}$, and then enlarging the supremum to all of $\mathcal F$, we obtain
\begin{align*}
\sup_{h\in\mathcal F^{(q)}}|R_n(\pi_m(h))-R_n(f_*)|
\le
\sum_{k=1}^{m}\sup_{h\in\mathcal F}|R_n(\pi_k(h))-R_n(\pi_{k-1}(h))|.
\end{align*}
Taking expectations and letting $m\to\infty$, the monotone convergence theorem applies to the increasing partial sums on the right because each summand is nonnegative. This gives the bound for the finite class $\mathcal F^{(q)}$. Finally, as $q\to\infty$, the finite suprema increase pointwise to the countable supremum over $\mathcal F$, so another application of monotone convergence gives the desired expectation-of-supremum inequality.
[/guided]
[/step]
[step:Estimate each chaining level by its covering number]
Fix an integer $k\ge 1$. Define the finite set of admissible pairs
\begin{align*}
I_k:=\{(u,v)\in T_k\times T_{k-1}: \text{there exists } f\in\mathcal F \text{ with } u=\pi_k(f) \text{ and } v=\pi_{k-1}(f)\}.
\end{align*}
Then
\begin{align*}
|I_k|\le |T_k||T_{k-1}|.
\end{align*}
For every $(u,v)\in I_k$, there exists $f\in\mathcal F$ such that $u=\pi_k(f)$ and $v=\pi_{k-1}(f)$. By the triangle inequality for the pseudometric $d_n$,
\begin{align*}
d_n(u,v)\le d_n(u,f)+d_n(f,v).
\end{align*}
Using the defining properties of $\pi_k$ and $\pi_{k-1}$ gives
\begin{align*}
d_n(u,v)\le \delta_k+\delta_{k-1}=3\delta_k.
\end{align*}
The sub-Gaussian increment estimate and the finite-maximum estimate therefore imply
\begin{align*}
\mathbb E_\varepsilon\left[
\sup_{f\in\mathcal F}|R_n(\pi_k(f))-R_n(\pi_{k-1}(f))|
\right]
\le
3\delta_k\sqrt{2\log(2|I_k|)}.
\end{align*}
Since $\delta_{k-1}>\delta_k$, every $\delta_k$-net is also a $\delta_{k-1}$-net after enlarging the radius, and monotonicity of covering numbers gives
\begin{align*}
|T_{k-1}|\le |T_k|.
\end{align*}
Thus
\begin{align*}
\log(2|I_k|)
\le
\log(2|T_k|^2)
\le
2\log(2|T_k|).
\end{align*}
As $|T_k|=N(\mathcal F,d_n,\delta_k)$, we obtain
\begin{align*}
\mathbb E_\varepsilon\left[
\sup_{f\in\mathcal F}|R_n(\pi_k(f))-R_n(\pi_{k-1}(f))|
\right]
\le
6\delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))}.
\end{align*}
[guided]
At level $k$, the chaining approximation replaces each $f$ by two nearby net points, one at scale $\delta_k$ and one at the coarser scale $\delta_{k-1}$. We need to control the random variables
\begin{align*}
R_n(\pi_k(f))-R_n(\pi_{k-1}(f))
\end{align*}
uniformly over $f\in\mathcal F$.
First define the finite set of pairs that can actually occur:
\begin{align*}
I_k:=\{(u,v)\in T_k\times T_{k-1}: \text{there exists } f\in\mathcal F \text{ with } u=\pi_k(f) \text{ and } v=\pi_{k-1}(f)\}.
\end{align*}
This set is finite because $T_k$ and $T_{k-1}$ are finite, and it satisfies
\begin{align*}
|I_k|\le |T_k||T_{k-1}|.
\end{align*}
Now fix a pair $(u,v)\in I_k$. By definition of $I_k$, there exists $f\in\mathcal F$ such that $u=\pi_k(f)$ and $v=\pi_{k-1}(f)$. The triangle inequality for $d_n$ gives
\begin{align*}
d_n(u,v)\le d_n(u,f)+d_n(f,v).
\end{align*}
The map $\pi_k$ was chosen so that $d_n(f,\pi_k(f))\le\delta_k$, and $\pi_{k-1}$ was chosen so that $d_n(f,\pi_{k-1}(f))\le\delta_{k-1}$. Therefore
\begin{align*}
d_n(u,v)\le\delta_k+\delta_{k-1}.
\end{align*}
Because $\delta_{k-1}=2\delta_k$, this becomes
\begin{align*}
d_n(u,v)\le 3\delta_k.
\end{align*}
The previous sub-Gaussian increment estimate applies to every random variable $R_n(u)-R_n(v)$ with $(u,v)\in I_k$, with scale at most $3\delta_k$. Applying the finite-maximum estimate to the finite family indexed by $I_k$ gives
\begin{align*}
\mathbb E_\varepsilon\left[
\sup_{(u,v)\in I_k}|R_n(u)-R_n(v)|
\right]
\le
3\delta_k\sqrt{2\log(2|I_k|)}.
\end{align*}
Since every pair generated by some $f$ belongs to $I_k$, the supremum over $f\in\mathcal F$ is bounded by the supremum over $(u,v)\in I_k$:
\begin{align*}
\mathbb E_\varepsilon\left[
\sup_{f\in\mathcal F}|R_n(\pi_k(f))-R_n(\pi_{k-1}(f))|
\right]
\le
3\delta_k\sqrt{2\log(2|I_k|)}.
\end{align*}
It remains only to express this in terms of covering numbers. Since $|I_k|\le |T_k||T_{k-1}|$ and $|T_{k-1}|\le |T_k|$, we have
\begin{align*}
\log(2|I_k|)
\le
\log(2|T_k|^2).
\end{align*}
For every nonempty finite set $T_k$, the elementary estimate
\begin{align*}
\log(2|T_k|^2)\le 2\log(2|T_k|)
\end{align*}
holds. Since $|T_k|=N(\mathcal F,d_n,\delta_k)$, the level-$k$ estimate is
\begin{align*}
\mathbb E_\varepsilon\left[
\sup_{f\in\mathcal F}|R_n(\pi_k(f))-R_n(\pi_{k-1}(f))|
\right]
\le
6\delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))}.
\end{align*}
This is the chaining mechanism: the random fluctuation at scale $\delta_k$ is controlled by the scale $\delta_k$ times the square root of the logarithm of the number of net points needed at that scale.
[/guided]
[/step]
[step:Compare the dyadic chaining sum with the entropy integral]
Combining the preceding two steps gives
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{f\in\mathcal F}|R_n(f)-R_n(f_*)|\right]
\le
6\sum_{k=1}^{\infty}\delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))}.
\end{align*}
We first record a diameter consequence. If $0<r<D/2$, then $N(\mathcal F,d_n,r)\ge2$. Indeed, if $N(\mathcal F,d_n,r)=1$, then there exists $a\in\mathcal F$ such that $d_n(f,a)\le r$ for every $f\in\mathcal F$, and the triangle inequality for $d_n$ gives $d_n(f,g)\le2r<D$ for every $f,g\in\mathcal F$, contradicting the definition of $D$ as the supremum of all such distances.
For every $k\ge1$ and every $\varepsilon\in(\delta_{k+1},\delta_k)$, we have $\varepsilon<D/2$, so
\begin{align*}
N(\mathcal F,d_n,\varepsilon)\ge2.
\end{align*}
Since covering numbers are nonincreasing as the radius increases and $\varepsilon\le\delta_k$, we also have
\begin{align*}
N(\mathcal F,d_n,\delta_k)\le N(\mathcal F,d_n,\varepsilon).
\end{align*}
Therefore, for $\mathcal L^1$-a.e. $\varepsilon\in(\delta_{k+1},\delta_k)$,
\begin{align*}
\log(2N(\mathcal F,d_n,\delta_k))
\le
\log(2N(\mathcal F,d_n,\varepsilon))
\le
2\log N(\mathcal F,d_n,\varepsilon).
\end{align*}
Using $\delta_k=2(\delta_k-\delta_{k+1})$, we obtain
\begin{align*}
\delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))}
\le
2\sqrt2\int_{\delta_{k+1}}^{\delta_k}\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon).
\end{align*}
Summing over $k\ge1$ and using that the intervals $(\delta_{k+1},\delta_k)$ are disjoint subsets of $(0,D)$ gives
\begin{align*}
\sum_{k=1}^{\infty}\delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))}
\le
2\sqrt2\int_0^D\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon).
\end{align*}
Thus we may take
\begin{align*}
C_1:=2\sqrt2.
\end{align*}
Consequently
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{f\in\mathcal F}|R_n(f)-R_n(f_*)|\right]
\le
6C_1\int_0^D\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon).
\end{align*}
[guided]
The goal is to turn the dyadic sum produced by chaining into the entropy integral. The only subtlety is the extra factor $2$ inside $\log(2N)$.
First observe that small balls cannot cover the class with one centre when the diameter is $D>0$. If $0<r<D/2$ and $N(\mathcal F,d_n,r)=1$, then some $a\in\mathcal F$ satisfies $d_n(f,a)\le r$ for every $f\in\mathcal F$. For any $f,g\in\mathcal F$, the triangle inequality gives
\begin{align*}
d_n(f,g)\le d_n(f,a)+d_n(a,g)\le2r<D.
\end{align*}
This contradicts the definition
\begin{align*}
D=\sup_{f,g\in\mathcal F}d_n(f,g).
\end{align*}
Hence $N(\mathcal F,d_n,r)\ge2$ for every $0<r<D/2$.
Now fix $k\ge1$. On the interval $(\delta_{k+1},\delta_k)$ we have $\varepsilon<D/2$, so $N(\mathcal F,d_n,\varepsilon)\ge2$. Also $\varepsilon\le\delta_k$, and covering numbers decrease when the radius increases. Therefore
\begin{align*}
N(\mathcal F,d_n,\delta_k)\le N(\mathcal F,d_n,\varepsilon).
\end{align*}
Since $N(\mathcal F,d_n,\varepsilon)\ge2$, we have $\log 2\le\log N(\mathcal F,d_n,\varepsilon)$, and hence
\begin{align*}
\log(2N(\mathcal F,d_n,\delta_k))
\le
\log(2N(\mathcal F,d_n,\varepsilon))
=
\log 2+\log N(\mathcal F,d_n,\varepsilon)
\le
2\log N(\mathcal F,d_n,\varepsilon).
\end{align*}
Taking square roots gives
\begin{align*}
\sqrt{\log(2N(\mathcal F,d_n,\delta_k))}
\le
\sqrt2\sqrt{\log N(\mathcal F,d_n,\varepsilon)}.
\end{align*}
Finally $\delta_k=2(\delta_k-\delta_{k+1})$, so integrating the last pointwise bound over $(\delta_{k+1},\delta_k)$ yields
\begin{align*}
\delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))}
\le
2\sqrt2\int_{\delta_{k+1}}^{\delta_k}\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon).
\end{align*}
The dyadic intervals are disjoint and lie in $(0,D)$, so summing over $k\ge1$ gives the comparison with constant $C_1=2\sqrt2$.
[/guided]
[/step]
[step:Pass from one anchored supremum to the pair supremum]
For every $f,g\in\mathcal F$, the triangle inequality in $\mathbb R$ gives
\begin{align*}
|R_n(f)-R_n(g)|
\le
|R_n(f)-R_n(f_*)|+|R_n(g)-R_n(f_*)|.
\end{align*}
Taking the supremum over $f,g\in\mathcal F$ yields
\begin{align*}
\sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)|
\le
2\sup_{h\in\mathcal F}|R_n(h)-R_n(f_*)|.
\end{align*}
Taking $\mathbb P_\varepsilon$-expectations and using the anchored estimate from the preceding step gives
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)|\right]
\le
12C_1\int_0^D\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon).
\end{align*}
With
\begin{align*}
C:=12C_1,
\end{align*}
and with $D=\operatorname{diam}(\mathcal F,d_n)$, this is precisely
\begin{align*}
\mathbb E_\varepsilon\left[\sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)|\right]
\le
C\int_0^{\operatorname{diam}(\mathcal F,d_n)}\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon).
\end{align*}
The constant $C$ is universal because every constant introduced above came only from the Rademacher exponential moment estimate, the finite-maximum estimate, and the dyadic comparison. This proves the theorem.
[/step]
Prerequisites (0/5 completed)
Prerequisites Graph
Interactive dependency map showing how this theorem builds on foundational concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Explore Further
Variance
Definition
Triangle Inequality For Inner Product Spaces
Theorem #433
Monotone Convergence Theorem
Theorem #509
Monotone Convergence Theorem For Sequences
Theorem #626
Monotone Convergence Theorem for Sequences
Theorem #743
Minimality of Generated Sigma-Algebras
Probability & Statistics
Orthogonality of Ordinary Least Squares Residuals
Probability & Statistics
Countable Subadditivity
Probability Theory
Coordinate Characterisation of Product Measurability
Probability & Statistics
Nonidentifiability Under Hidden Confounding
Probability & Statistics
Existence and Autocovariance of Absolutely Summable Linear Processes
Probability & Statistics
Tensorization Theorem for Logarithmic Sobolev Inequalities
Probability & Statistics
Asymptotic Normality of Two-Sample Linear Rank Statistics
Probability & Statistics
Probability & Statistics
Area