Conditional Dudley Entropy Bound for Rademacher Processes

Theorem

Edit Issues Pull Requests Attributions Admin

Let $\mathbb N=\{1,2,\dots\}$. Let $n\in\mathbb N$, let $(S,\mathcal S)$ be a measurable space, and let $\mathcal F$ be a nonempty class of [measurable functions](/page/Measurable%20Functions) $f:S\to\mathbb R$. Let $x_1,\dots,x_n\in S$ be a realised sample satisfying \begin{align*} \sup_{f\in\mathcal F}\max_{1\le i\le n}|f(x_i)|<\infty. \end{align*} Define the empirical pseudometric $d_n:\mathcal F\times\mathcal F\to[0,\infty)$ by \begin{align*} d_n(f,g):=\left(\frac{1}{n}\sum_{i=1}^{n}(f(x_i)-g(x_i))^2\right)^{1/2} \end{align*} for $f,g\in\mathcal F$, and define \begin{align*} \operatorname{diam}(\mathcal F,d_n):=\sup_{f,g\in\mathcal F}d_n(f,g). \end{align*} For $\varepsilon>0$, let $N(\mathcal F,d_n,\varepsilon)$ denote the least cardinality of a finite subset $A\subset\mathcal F$ such that for every $f\in\mathcal F$ there exists $a\in A$ with $d_n(f,a)\le\varepsilon$, with the convention that $N(\mathcal F,d_n,\varepsilon)=\infty$ if no finite such set exists. Let $(\Omega_\varepsilon,\mathcal A_\varepsilon,\mathbb P_\varepsilon)$ be a [probability space](/page/Probability%20Space) carrying independent Rademacher random variables $\varepsilon_1,\dots,\varepsilon_n:\Omega_\varepsilon\to\{-1,1\}$, meaning that \begin{align*} \mathbb P_\varepsilon(\varepsilon_i=1)=\mathbb P_\varepsilon(\varepsilon_i=-1)=\frac{1}{2} \end{align*} for every $1\le i\le n$. For each $f\in\mathcal F$, define the conditional Rademacher process $R_n(f):\Omega_\varepsilon\to\mathbb R$ by \begin{align*} R_n(f):=\frac{1}{\sqrt n}\sum_{i=1}^{n}\varepsilon_i f(x_i). \end{align*} Assume that $\mathcal F$ is pointwise measurable for this realised sample in the following sense: there is a countable subclass $\mathcal F_0\subset\mathcal F$ such that the [random variable](/page/Random%20Variable) $\sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)|$ is equal to $\sup_{f,g\in\mathcal F_0}|R_n(f)-R_n(g)|$ on $\Omega_\varepsilon$, and for every $\varepsilon>0$ the covering number $N(\mathcal F,d_n,\varepsilon)$ is equal to $N(\mathcal F_0,d_n,\varepsilon)$. Then there exists a universal constant $C>0$ such that \begin{align*} \mathbb E_\varepsilon\left[\sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)|\right] \le C\int_0^{\operatorname{diam}(\mathcal F,d_n)}\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon). \end{align*} The right-hand side is interpreted as $+\infty$ when the entropy integral is infinite.

Discussion

Proof

[proofplan] Condition on the realised sample $x_1,\dots,x_n$, so the empirical pseudometric $d_n$ is deterministic and the only remaining randomness is carried by the Rademacher signs. First we prove directly that each increment $R_n(f)-R_n(g)$ is sub-Gaussian with variance proxy $d_n(f,g)^2$. We then use a successive-net chaining argument on the countable separable subclass supplied by the measurability hypothesis, bounding each chaining level by a finite sub-Gaussian maximum estimate. Summing over dyadic scales and comparing the sum with the entropy integral gives the desired conditional Dudley bound. [/proofplan] [step:Reduce the problem to the countable separating subclass] By the pointwise measurability hypothesis, there is a countable subclass $\mathcal F_0\subset\mathcal F$ such that \begin{align*} \sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)| = \sup_{f,g\in\mathcal F_0}|R_n(f)-R_n(g)| \end{align*} as a [random variable](/page/Random%20Variable) on $(\Omega_\varepsilon,\mathcal A_\varepsilon,\mathbb P_\varepsilon)$, and \begin{align*} N(\mathcal F,d_n,\varepsilon)=N(\mathcal F_0,d_n,\varepsilon) \end{align*} for every $\varepsilon>0$. Therefore it is enough to prove the asserted estimate with $\mathcal F$ replaced by $\mathcal F_0$. Replacing $\mathcal F$ by $\mathcal F_0$, we assume from now on that $\mathcal F$ is countable. The bounded empirical envelope assumption gives, for all $f,g\in\mathcal F$, \begin{align*} d_n(f,g)\le 2\sup_{h\in\mathcal F}\max_{1\le i\le n}|h(x_i)|<\infty. \end{align*} Thus \begin{align*} D:=\operatorname{diam}(\mathcal F,d_n) \end{align*} is a finite number. If $D=0$, then $f(x_i)=g(x_i)$ for every $f,g\in\mathcal F$ and every $1\le i\le n$, so $R_n(f)=R_n(g)$ for all $f,g\in\mathcal F$ and the left-hand side is $0$. Hence the result is immediate in this case. We assume henceforth that $D>0$. If \begin{align*} \int_0^D \sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon)=+\infty, \end{align*} the claimed inequality is immediate. We therefore assume that this entropy integral is finite. Since the function \begin{align*} \varepsilon\mapsto N(\mathcal F,d_n,\varepsilon) \end{align*} is nonincreasing as $\varepsilon$ increases, finiteness of the integral implies that $N(\mathcal F,d_n,\varepsilon)<\infty$ for every $\varepsilon>0$. [/step] [step:Prove sub-Gaussian tails for every Rademacher increment] Fix $f,g\in\mathcal F$. Define the [real numbers](/page/Real%20Numbers) $a_i(f,g)$ for $1\le i\le n$ by \begin{align*} a_i(f,g):=\frac{f(x_i)-g(x_i)}{\sqrt n}. \end{align*} Then \begin{align*} R_n(f)-R_n(g)=\sum_{i=1}^{n}\varepsilon_i a_i(f,g). \end{align*} For every $\lambda\in\mathbb R$, independence of $\varepsilon_1,\dots,\varepsilon_n$ and the elementary bound $\cosh u\le e^{u^2/2}$ give \begin{align*} \mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right] = \prod_{i=1}^{n}\mathbb E_\varepsilon\left[\exp(\lambda\varepsilon_i a_i(f,g))\right]. \end{align*} For each $i$, \begin{align*} \mathbb E_\varepsilon\left[\exp(\lambda\varepsilon_i a_i(f,g))\right] = \cosh(\lambda a_i(f,g)) \le \exp\left(\frac{\lambda^2a_i(f,g)^2}{2}\right). \end{align*} Multiplying the inequalities yields \begin{align*} \mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right] \le \exp\left(\frac{\lambda^2}{2}\sum_{i=1}^{n}a_i(f,g)^2\right). \end{align*} By the definition of $d_n$, \begin{align*} \sum_{i=1}^{n}a_i(f,g)^2=d_n(f,g)^2. \end{align*} Therefore \begin{align*} \mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right] \le \exp\left(\frac{\lambda^2d_n(f,g)^2}{2}\right). \end{align*} [guided] The purpose of this step is to identify the metric that controls the process. Fix $f,g\in\mathcal F$. The increment of the process is \begin{align*} R_n(f)-R_n(g)=\frac{1}{\sqrt n}\sum_{i=1}^{n}\varepsilon_i(f(x_i)-g(x_i)). \end{align*} To put this into the standard form of a Rademacher sum, define \begin{align*} a_i(f,g):=\frac{f(x_i)-g(x_i)}{\sqrt n} \end{align*} for $1\le i\le n$. Then \begin{align*} R_n(f)-R_n(g)=\sum_{i=1}^{n}\varepsilon_i a_i(f,g). \end{align*} We now compute the exponential moment. Since the random variables $\varepsilon_1,\dots,\varepsilon_n$ are independent under $\mathbb P_\varepsilon$, the expectation of the product factors: \begin{align*} \mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right] = \prod_{i=1}^{n}\mathbb E_\varepsilon\left[\exp(\lambda\varepsilon_i a_i(f,g))\right]. \end{align*} For a Rademacher random variable $\varepsilon_i$, \begin{align*} \mathbb E_\varepsilon\left[\exp(\lambda\varepsilon_i a_i(f,g))\right] = \frac{e^{\lambda a_i(f,g)}+e^{-\lambda a_i(f,g)}}{2} = \cosh(\lambda a_i(f,g)). \end{align*} The elementary inequality $\cosh u\le e^{u^2/2}$ gives \begin{align*} \mathbb E_\varepsilon\left[\exp(\lambda\varepsilon_i a_i(f,g))\right] \le \exp\left(\frac{\lambda^2a_i(f,g)^2}{2}\right). \end{align*} Multiplying these bounds over $1\le i\le n$ gives \begin{align*} \mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right] \le \exp\left(\frac{\lambda^2}{2}\sum_{i=1}^{n}a_i(f,g)^2\right). \end{align*} The sum in the exponent is exactly the empirical squared distance: \begin{align*} \sum_{i=1}^{n}a_i(f,g)^2 = \frac{1}{n}\sum_{i=1}^{n}(f(x_i)-g(x_i))^2 = d_n(f,g)^2. \end{align*} Thus every increment is sub-Gaussian with scale $d_n(f,g)$: \begin{align*} \mathbb E_\varepsilon\left[\exp\left(\lambda(R_n(f)-R_n(g))\right)\right] \le \exp\left(\frac{\lambda^2d_n(f,g)^2}{2}\right). \end{align*} This is exactly why the normalization $n^{-1/2}$ in $R_n$ matches the normalization in $d_n$. [/guided] [/step] [step:Bound finite maxima of sub-Gaussian increments] Let $I$ be a finite nonempty set, and let $(Z_i)_{i\in I}$ be real-valued random variables on $(\Omega_\varepsilon,\mathcal A_\varepsilon,\mathbb P_\varepsilon)$. Assume that there is a number $\sigma\ge 0$ such that, for every $i\in I$ and every $\lambda\in\mathbb R$, \begin{align*} \mathbb E_\varepsilon[e^{\lambda Z_i}] \le e^{\lambda^2\sigma^2/2}. \end{align*} Then \begin{align*} \mathbb E_\varepsilon\left[\max_{i\in I}|Z_i|\right] \le \sigma\sqrt{2\log(2|I|)}. \end{align*} Indeed, if $\sigma=0$, the exponential moment bound forces $Z_i=0$ $\mathbb P_\varepsilon$-a.s. for every $i\in I$, so the estimate is immediate. Assume $\sigma>0$. For any $\lambda>0$, monotonicity of the exponential and the union bound inside the exponential sum give \begin{align*} \exp\left(\lambda\max_{i\in I}|Z_i|\right) \le \sum_{i\in I}e^{\lambda Z_i}+\sum_{i\in I}e^{-\lambda Z_i}. \end{align*} Taking expectations and using the exponential moment assumption with $\lambda$ and $-\lambda$ gives \begin{align*} \mathbb E_\varepsilon\left[\exp\left(\lambda\max_{i\in I}|Z_i|\right)\right] \le 2|I|e^{\lambda^2\sigma^2/2}. \end{align*} [Jensen's inequality](/theorems/9) applied to the concave logarithm gives \begin{align*} \lambda\mathbb E_\varepsilon\left[\max_{i\in I}|Z_i|\right] \le \log(2|I|)+\frac{\lambda^2\sigma^2}{2}. \end{align*} Choosing \begin{align*} \lambda:=\frac{\sqrt{2\log(2|I|)}}{\sigma} \end{align*} yields \begin{align*} \mathbb E_\varepsilon\left[\max_{i\in I}|Z_i|\right] \le \sigma\sqrt{2\log(2|I|)}. \end{align*} [/step] [step:Construct dyadic nets and telescope every process value] For each integer $k\ge 0$, define \begin{align*} \delta_k:=2^{-k}D. \end{align*} Because $N(\mathcal F,d_n,\delta_k)<\infty$, choose a finite $\delta_k$-net $T_k\subset\mathcal F$ such that \begin{align*} |T_k|=N(\mathcal F,d_n,\delta_k). \end{align*} For each $k\ge 0$, choose a nearest-net map $\pi_k:\mathcal F\to T_k$ satisfying \begin{align*} d_n(f,\pi_k(f))\le\delta_k \end{align*} for every $f\in\mathcal F$. Since $\delta_0=D$, the set $T_0$ may be chosen to contain a single element; fix this choice and denote its element by $f_*$. Then $\pi_0(f)=f_*$ for every $f\in\mathcal F$. For every $f\in\mathcal F$ and every integer $m\ge 1$, the telescoping identity gives \begin{align*} R_n(\pi_m(f))-R_n(f_*) = \sum_{k=1}^{m}\left(R_n(\pi_k(f))-R_n(\pi_{k-1}(f))\right). \end{align*} Also, \begin{align*} d_n(f,\pi_m(f))\le\delta_m\to 0. \end{align*} The increment sub-Gaussian estimate implies convergence of $R_n(\pi_m(f))$ to $R_n(f)$ in $L^1(\mathbb P_\varepsilon)$. Indeed, applying the finite-maximum bound to the singleton family containing $R_n(f)-R_n(\pi_m(f))$ gives \begin{align*} \mathbb E_\varepsilon[|R_n(f)-R_n(\pi_m(f))|] \le d_n(f,\pi_m(f))\sqrt{2\log 2} \le \delta_m\sqrt{2\log 2}. \end{align*} Enumerate the countable class as $\mathcal F=\{h_j:j\in J\}$, where $J\subset\mathbb N$ is nonempty, and set $\mathcal F^{(q)}:=\{h_j:j\in J,\ j\le q\}$ for $q\in\mathbb N$. Fix $q$. Since $\mathcal F^{(q)}$ is finite and $R_n(\pi_m(h))-R_n(h)\to0$ in $L^1(\mathbb P_\varepsilon)$ for each $h\in\mathcal F^{(q)}$, the maximum over $\mathcal F^{(q)}$ also converges in $L^1(\mathbb P_\varepsilon)$. Therefore \begin{align*} \mathbb E_\varepsilon\left[\sup_{h\in\mathcal F^{(q)}}|R_n(h)-R_n(f_*)|\right] = \lim_{m\to\infty}\mathbb E_\varepsilon\left[\sup_{h\in\mathcal F^{(q)}}|R_n(\pi_m(h))-R_n(f_*)|\right]. \end{align*} Using the finite telescoping identity and the triangle inequality in $\mathbb R$, for every $m\ge1$, \begin{align*} \sup_{h\in\mathcal F^{(q)}}|R_n(\pi_m(h))-R_n(f_*)| \le \sum_{k=1}^{m}\sup_{h\in\mathcal F}|R_n(\pi_k(h))-R_n(\pi_{k-1}(h))|. \end{align*} Taking expectations and then letting $m\to\infty$, the [monotone convergence theorem](/theorems/509) for the increasing partial sums on the right gives \begin{align*} \mathbb E_\varepsilon\left[\sup_{h\in\mathcal F^{(q)}}|R_n(h)-R_n(f_*)|\right] \le \sum_{k=1}^{\infty} \mathbb E_\varepsilon\left[ \sup_{h\in\mathcal F}|R_n(\pi_k(h))-R_n(\pi_{k-1}(h))| \right]. \end{align*} Finally, the sequence $\sup_{h\in\mathcal F^{(q)}}|R_n(h)-R_n(f_*)|$ increases pointwise to $\sup_{h\in\mathcal F}|R_n(h)-R_n(f_*)|$. Applying the monotone convergence theorem once more yields \begin{align*} \mathbb E_\varepsilon\left[\sup_{h\in\mathcal F}|R_n(h)-R_n(f_*)|\right] \le \sum_{k=1}^{\infty} \mathbb E_\varepsilon\left[ \sup_{h\in\mathcal F}|R_n(\pi_k(h))-R_n(\pi_{k-1}(h))| \right]. \end{align*} [guided] The delicate point is that the convergence $R_n(\pi_m(f))\to R_n(f)$ is first obtained for each fixed $f$, while the desired estimate contains a supremum over the whole countable class. We handle this by taking finite suprema first and only then passing to the countable supremum. For a fixed $f\in\mathcal F$, the net property gives $d_n(f,\pi_m(f))\le\delta_m$ and $\delta_m\to0$. Applying the finite-maximum estimate to the singleton family containing the increment $R_n(f)-R_n(\pi_m(f))$ gives \begin{align*} \mathbb E_\varepsilon[|R_n(f)-R_n(\pi_m(f))|] \le d_n(f,\pi_m(f))\sqrt{2\log 2} \le \delta_m\sqrt{2\log 2}. \end{align*} Thus $R_n(\pi_m(f))\to R_n(f)$ in $L^1(\mathbb P_\varepsilon)$. Now enumerate the countable class as $\mathcal F=\{h_j:j\in J\}$, where $J\subset\mathbb N$ is nonempty, and define $\mathcal F^{(q)}:=\{h_j:j\in J,\ j\le q\}$. For fixed $q$, only finitely many functions are involved. Since each term converges in $L^1$, the finite maximum also converges in $L^1$: \begin{align*} \sup_{h\in\mathcal F^{(q)}}|R_n(\pi_m(h))-R_n(f_*)| \to \sup_{h\in\mathcal F^{(q)}}|R_n(h)-R_n(f_*)| \end{align*} in $L^1(\mathbb P_\varepsilon)$. Hence the expectations converge. For each $m\ge1$, the telescoping identity gives \begin{align*} R_n(\pi_m(h))-R_n(f_*) = \sum_{k=1}^{m}\left(R_n(\pi_k(h))-R_n(\pi_{k-1}(h))\right). \end{align*} Taking absolute values, then the supremum over $h\in\mathcal F^{(q)}$, and then enlarging the supremum to all of $\mathcal F$, we obtain \begin{align*} \sup_{h\in\mathcal F^{(q)}}|R_n(\pi_m(h))-R_n(f_*)| \le \sum_{k=1}^{m}\sup_{h\in\mathcal F}|R_n(\pi_k(h))-R_n(\pi_{k-1}(h))|. \end{align*} Taking expectations and letting $m\to\infty$, the monotone convergence theorem applies to the increasing partial sums on the right because each summand is nonnegative. This gives the bound for the finite class $\mathcal F^{(q)}$. Finally, as $q\to\infty$, the finite suprema increase pointwise to the countable supremum over $\mathcal F$, so another application of monotone convergence gives the desired expectation-of-supremum inequality. [/guided] [/step] [step:Estimate each chaining level by its covering number] Fix an integer $k\ge 1$. Define the finite set of admissible pairs \begin{align*} I_k:=\{(u,v)\in T_k\times T_{k-1}: \text{there exists } f\in\mathcal F \text{ with } u=\pi_k(f) \text{ and } v=\pi_{k-1}(f)\}. \end{align*} Then \begin{align*} |I_k|\le |T_k||T_{k-1}|. \end{align*} For every $(u,v)\in I_k$, there exists $f\in\mathcal F$ such that $u=\pi_k(f)$ and $v=\pi_{k-1}(f)$. By the triangle inequality for the pseudometric $d_n$, \begin{align*} d_n(u,v)\le d_n(u,f)+d_n(f,v). \end{align*} Using the defining properties of $\pi_k$ and $\pi_{k-1}$ gives \begin{align*} d_n(u,v)\le \delta_k+\delta_{k-1}=3\delta_k. \end{align*} The sub-Gaussian increment estimate and the finite-maximum estimate therefore imply \begin{align*} \mathbb E_\varepsilon\left[ \sup_{f\in\mathcal F}|R_n(\pi_k(f))-R_n(\pi_{k-1}(f))| \right] \le 3\delta_k\sqrt{2\log(2|I_k|)}. \end{align*} Since $\delta_{k-1}>\delta_k$, every $\delta_k$-net is also a $\delta_{k-1}$-net after enlarging the radius, and monotonicity of covering numbers gives \begin{align*} |T_{k-1}|\le |T_k|. \end{align*} Thus \begin{align*} \log(2|I_k|) \le \log(2|T_k|^2) \le 2\log(2|T_k|). \end{align*} As $|T_k|=N(\mathcal F,d_n,\delta_k)$, we obtain \begin{align*} \mathbb E_\varepsilon\left[ \sup_{f\in\mathcal F}|R_n(\pi_k(f))-R_n(\pi_{k-1}(f))| \right] \le 6\delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))}. \end{align*} [guided] At level $k$, the chaining approximation replaces each $f$ by two nearby net points, one at scale $\delta_k$ and one at the coarser scale $\delta_{k-1}$. We need to control the random variables \begin{align*} R_n(\pi_k(f))-R_n(\pi_{k-1}(f)) \end{align*} uniformly over $f\in\mathcal F$. First define the finite set of pairs that can actually occur: \begin{align*} I_k:=\{(u,v)\in T_k\times T_{k-1}: \text{there exists } f\in\mathcal F \text{ with } u=\pi_k(f) \text{ and } v=\pi_{k-1}(f)\}. \end{align*} This set is finite because $T_k$ and $T_{k-1}$ are finite, and it satisfies \begin{align*} |I_k|\le |T_k||T_{k-1}|. \end{align*} Now fix a pair $(u,v)\in I_k$. By definition of $I_k$, there exists $f\in\mathcal F$ such that $u=\pi_k(f)$ and $v=\pi_{k-1}(f)$. The triangle inequality for $d_n$ gives \begin{align*} d_n(u,v)\le d_n(u,f)+d_n(f,v). \end{align*} The map $\pi_k$ was chosen so that $d_n(f,\pi_k(f))\le\delta_k$, and $\pi_{k-1}$ was chosen so that $d_n(f,\pi_{k-1}(f))\le\delta_{k-1}$. Therefore \begin{align*} d_n(u,v)\le\delta_k+\delta_{k-1}. \end{align*} Because $\delta_{k-1}=2\delta_k$, this becomes \begin{align*} d_n(u,v)\le 3\delta_k. \end{align*} The previous sub-Gaussian increment estimate applies to every random variable $R_n(u)-R_n(v)$ with $(u,v)\in I_k$, with scale at most $3\delta_k$. Applying the finite-maximum estimate to the finite family indexed by $I_k$ gives \begin{align*} \mathbb E_\varepsilon\left[ \sup_{(u,v)\in I_k}|R_n(u)-R_n(v)| \right] \le 3\delta_k\sqrt{2\log(2|I_k|)}. \end{align*} Since every pair generated by some $f$ belongs to $I_k$, the supremum over $f\in\mathcal F$ is bounded by the supremum over $(u,v)\in I_k$: \begin{align*} \mathbb E_\varepsilon\left[ \sup_{f\in\mathcal F}|R_n(\pi_k(f))-R_n(\pi_{k-1}(f))| \right] \le 3\delta_k\sqrt{2\log(2|I_k|)}. \end{align*} It remains only to express this in terms of covering numbers. Since $|I_k|\le |T_k||T_{k-1}|$ and $|T_{k-1}|\le |T_k|$, we have \begin{align*} \log(2|I_k|) \le \log(2|T_k|^2). \end{align*} For every nonempty finite set $T_k$, the elementary estimate \begin{align*} \log(2|T_k|^2)\le 2\log(2|T_k|) \end{align*} holds. Since $|T_k|=N(\mathcal F,d_n,\delta_k)$, the level-$k$ estimate is \begin{align*} \mathbb E_\varepsilon\left[ \sup_{f\in\mathcal F}|R_n(\pi_k(f))-R_n(\pi_{k-1}(f))| \right] \le 6\delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))}. \end{align*} This is the chaining mechanism: the random fluctuation at scale $\delta_k$ is controlled by the scale $\delta_k$ times the square root of the logarithm of the number of net points needed at that scale. [/guided] [/step] [step:Compare the dyadic chaining sum with the entropy integral] Combining the preceding two steps gives \begin{align*} \mathbb E_\varepsilon\left[\sup_{f\in\mathcal F}|R_n(f)-R_n(f_*)|\right] \le 6\sum_{k=1}^{\infty}\delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))}. \end{align*} We first record a diameter consequence. If $0<r<D/2$, then $N(\mathcal F,d_n,r)\ge2$. Indeed, if $N(\mathcal F,d_n,r)=1$, then there exists $a\in\mathcal F$ such that $d_n(f,a)\le r$ for every $f\in\mathcal F$, and the triangle inequality for $d_n$ gives $d_n(f,g)\le2r<D$ for every $f,g\in\mathcal F$, contradicting the definition of $D$ as the supremum of all such distances. For every $k\ge1$ and every $\varepsilon\in(\delta_{k+1},\delta_k)$, we have $\varepsilon<D/2$, so \begin{align*} N(\mathcal F,d_n,\varepsilon)\ge2. \end{align*} Since covering numbers are nonincreasing as the radius increases and $\varepsilon\le\delta_k$, we also have \begin{align*} N(\mathcal F,d_n,\delta_k)\le N(\mathcal F,d_n,\varepsilon). \end{align*} Therefore, for $\mathcal L^1$-a.e. $\varepsilon\in(\delta_{k+1},\delta_k)$, \begin{align*} \log(2N(\mathcal F,d_n,\delta_k)) \le \log(2N(\mathcal F,d_n,\varepsilon)) \le 2\log N(\mathcal F,d_n,\varepsilon). \end{align*} Using $\delta_k=2(\delta_k-\delta_{k+1})$, we obtain \begin{align*} \delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))} \le 2\sqrt2\int_{\delta_{k+1}}^{\delta_k}\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon). \end{align*} Summing over $k\ge1$ and using that the intervals $(\delta_{k+1},\delta_k)$ are disjoint subsets of $(0,D)$ gives \begin{align*} \sum_{k=1}^{\infty}\delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))} \le 2\sqrt2\int_0^D\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon). \end{align*} Thus we may take \begin{align*} C_1:=2\sqrt2. \end{align*} Consequently \begin{align*} \mathbb E_\varepsilon\left[\sup_{f\in\mathcal F}|R_n(f)-R_n(f_*)|\right] \le 6C_1\int_0^D\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon). \end{align*} [guided] The goal is to turn the dyadic sum produced by chaining into the entropy integral. The only subtlety is the extra factor $2$ inside $\log(2N)$. First observe that small balls cannot cover the class with one centre when the diameter is $D>0$. If $0<r<D/2$ and $N(\mathcal F,d_n,r)=1$, then some $a\in\mathcal F$ satisfies $d_n(f,a)\le r$ for every $f\in\mathcal F$. For any $f,g\in\mathcal F$, the triangle inequality gives \begin{align*} d_n(f,g)\le d_n(f,a)+d_n(a,g)\le2r<D. \end{align*} This contradicts the definition \begin{align*} D=\sup_{f,g\in\mathcal F}d_n(f,g). \end{align*} Hence $N(\mathcal F,d_n,r)\ge2$ for every $0<r<D/2$. Now fix $k\ge1$. On the interval $(\delta_{k+1},\delta_k)$ we have $\varepsilon<D/2$, so $N(\mathcal F,d_n,\varepsilon)\ge2$. Also $\varepsilon\le\delta_k$, and covering numbers decrease when the radius increases. Therefore \begin{align*} N(\mathcal F,d_n,\delta_k)\le N(\mathcal F,d_n,\varepsilon). \end{align*} Since $N(\mathcal F,d_n,\varepsilon)\ge2$, we have $\log 2\le\log N(\mathcal F,d_n,\varepsilon)$, and hence \begin{align*} \log(2N(\mathcal F,d_n,\delta_k)) \le \log(2N(\mathcal F,d_n,\varepsilon)) = \log 2+\log N(\mathcal F,d_n,\varepsilon) \le 2\log N(\mathcal F,d_n,\varepsilon). \end{align*} Taking square roots gives \begin{align*} \sqrt{\log(2N(\mathcal F,d_n,\delta_k))} \le \sqrt2\sqrt{\log N(\mathcal F,d_n,\varepsilon)}. \end{align*} Finally $\delta_k=2(\delta_k-\delta_{k+1})$, so integrating the last pointwise bound over $(\delta_{k+1},\delta_k)$ yields \begin{align*} \delta_k\sqrt{\log(2N(\mathcal F,d_n,\delta_k))} \le 2\sqrt2\int_{\delta_{k+1}}^{\delta_k}\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon). \end{align*} The dyadic intervals are disjoint and lie in $(0,D)$, so summing over $k\ge1$ gives the comparison with constant $C_1=2\sqrt2$. [/guided] [/step] [step:Pass from one anchored supremum to the pair supremum] For every $f,g\in\mathcal F$, the triangle inequality in $\mathbb R$ gives \begin{align*} |R_n(f)-R_n(g)| \le |R_n(f)-R_n(f_*)|+|R_n(g)-R_n(f_*)|. \end{align*} Taking the supremum over $f,g\in\mathcal F$ yields \begin{align*} \sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)| \le 2\sup_{h\in\mathcal F}|R_n(h)-R_n(f_*)|. \end{align*} Taking $\mathbb P_\varepsilon$-expectations and using the anchored estimate from the preceding step gives \begin{align*} \mathbb E_\varepsilon\left[\sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)|\right] \le 12C_1\int_0^D\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon). \end{align*} With \begin{align*} C:=12C_1, \end{align*} and with $D=\operatorname{diam}(\mathcal F,d_n)$, this is precisely \begin{align*} \mathbb E_\varepsilon\left[\sup_{f,g\in\mathcal F}|R_n(f)-R_n(g)|\right] \le C\int_0^{\operatorname{diam}(\mathcal F,d_n)}\sqrt{\log N(\mathcal F,d_n,\varepsilon)}\,d\mathcal L^1(\varepsilon). \end{align*} The constant $C$ is universal because every constant introduced above came only from the Rademacher exponential moment estimate, the finite-maximum estimate, and the dyadic comparison. This proves the theorem. [/step]

Prerequisites (0/5 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Variance

What brings you to Androma?

Start with a route through the knowledge graph.