Vapnik-Chervonenkis Uniform Law of Large Numbers

Vapnik-Chervonenkis Uniform Law of Large Numbers (Theorem # 9822)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We bound the uniform empirical error in outer expectation and then convert the bound to outer probability. Symmetrisation reduces the empirical error to a Rademacher process indexed by the same class of sets. Conditional on the realised sample, only the finitely many traces of $\mathcal C$ on the observed points matter; Sauer-Shelah bounds their number polynomially in $n$, and Massart's finite-class lemma bounds the conditional Rademacher supremum. The resulting deterministic upper bound tends to zero, and Markov's inequality gives the desired convergence. [/proofplan] [step:Define the outer error and symmetrise it] For $n\in\mathbb N$, define the possibly nonmeasurable extended real map \begin{align*} Z_n:\Omega&\to[0,\infty] \end{align*} \begin{align*} \omega&\mapsto \sup_{C\in\mathcal C}|P_n(C)(\omega)-P(C)|. \end{align*} If $\mathcal C=\varnothing$, then the convention in the theorem statement gives $Z_n=0$ for every $n\in\mathbb N$, so both outer-probability and ordinary-probability conclusions follow immediately. Hence, for the rest of the proof, assume $\mathcal C\neq\varnothing$. Let $\mathbb E^*$ denote outer expectation with respect to $\mathbb P$. Since $\mathbb 1_C:S\to\mathbb R$ is measurable and bounded for every $C\in\mathcal C$, every member of the indicator class \begin{align*} \mathcal F_{\mathcal C}:=\{\mathbb 1_C:S\to\mathbb R:C\in\mathcal C\} \end{align*} is $P$-integrable. Applying the [[Symmetrization Inequality for Empirical Processes](/theorems/9851)][citetheorem:9823] to $\mathcal F_{\mathcal C}$ gives \begin{align*} \mathbb E^*[Z_n]\le 2\,\mathbb E^*\left[\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\mathbb 1_C(X_i)\right|\right], \end{align*} where $\varepsilon_1,\dots,\varepsilon_n$ are independent Rademacher random variables, independent of $X_1,\dots,X_n$. [guided] The object \begin{align*} Z_n(\omega)=\sup_{C\in\mathcal C}|P_n(C)(\omega)-P(C)| \end{align*} may fail to be $\mathcal F$-measurable because $\mathcal C$ need not be countable. This is why we use outer expectation $\mathbb E^*$ and later outer probability $\mathbb P^*$. The symmetrisation theorem applies to the function class \begin{align*} \mathcal F_{\mathcal C}:=\{\mathbb 1_C:S\to\mathbb R:C\in\mathcal C\}. \end{align*} For each $C\in\mathcal C$, the function $\mathbb 1_C$ is measurable because $C\in\mathcal A$, and it is $P$-integrable because \begin{align*} \int_S |\mathbb 1_C(x)|\,dP(x)=P(C)\le 1. \end{align*} Thus the hypotheses of the [Symmetrization Inequality for Empirical Processes][citetheorem:9823] are satisfied for this bounded indicator class, with outer expectation used for possibly nonmeasurable suprema. Therefore \begin{align*} \mathbb E^*[Z_n]\le 2\,\mathbb E^*\left[\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\mathbb 1_C(X_i)\right|\right]. \end{align*} This replaces the unknown centering $P(C)$ by random signs. The advantage is that, once the sample points are fixed, the class $\mathcal C$ contributes only through its finitely many membership patterns on those sample points. [/guided] [/step] [step:Replace the class by its traces on the realised sample] Fix $n\in\mathbb N$ and fix a deterministic vector $x=(x_1,\dots,x_n)\in S^n$. Define the trace vector set \begin{align*} T(x):=\{(\mathbb 1_C(x_1),\dots,\mathbb 1_C(x_n)):C\in\mathcal C\}\subset\{0,1\}^n. \end{align*} Then, conditional on $X_1=x_1,\dots,X_n=x_n$, \begin{align*} \sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\mathbb 1_C(x_i)\right| = \frac{1}{n}\sup_{t\in T(x)}\left|\sum_{i=1}^{n}\varepsilon_i t_i\right|. \end{align*} Let $D(x):=\{x_1,\dots,x_n\}\subset S$ be the set of distinct observed points. The map $C\mapsto C\cap D(x)$ determines the vector $(\mathbb 1_C(x_1),\dots,\mathbb 1_C(x_n))$, so \begin{align*} |T(x)|\le |\{C\cap D(x):C\in\mathcal C\}|\le \Pi_{\mathcal C}(n), \end{align*} where $\Pi_{\mathcal C}(n)$ is the growth function of $\mathcal C$. [/step] [step:Bound the number of traces by Sauer-Shelah] Set $v:=V(\mathcal C)$. By hypothesis, $v<\infty$. The [Sauer-Shelah Lemma][citetheorem:1969] gives, for every $n\in\mathbb N$, \begin{align*} \Pi_{\mathcal C}(n)\le \sum_{j=0}^{v}{n\choose j}. \end{align*} If $v=0$, this gives $\Pi_{\mathcal C}(n)\le 1$ for every $n$. If $v\ge 1$, then \begin{align*} \Pi_{\mathcal C}(n)\le \sum_{j=0}^{v}{n\choose j}\le (n+1)^v \end{align*} for every $n\in\mathbb N$, since each [binomial coefficient](/page/Binomial%20Coefficient) is at most $n^j$ and the sum has $v+1$ terms, which is bounded by $(n+1)^v$ for $v\ge 1$. [/step] [step:Apply Massart's finite-class lemma conditionally] For the fixed sample vector $x\in S^n$, let $\mathbb E_{\varepsilon}$ denote expectation with respect to the product distribution of the independent Rademacher variables $\varepsilon_1,\dots,\varepsilon_n$, with $x$ held fixed. Define \begin{align*} T_{\pm}(x):=T(x)\cup\{-t:t\in T(x)\}\subset\mathbb R^n. \end{align*} Then \begin{align*} \sup_{t\in T(x)}\left|\sum_{i=1}^{n}\varepsilon_i t_i\right| = \sup_{u\in T_{\pm}(x)}\sum_{i=1}^{n}\varepsilon_i u_i. \end{align*} Every $u\in T_{\pm}(x)$ satisfies $|u|\le\sqrt n$, and \begin{align*} |T_{\pm}(x)|\le 2|T(x)|. \end{align*} Applying [[Massart Finite-Class Lemma](/theorems/9826)][citetheorem:9826] conditionally on the sample yields \begin{align*} \mathbb E_{\varepsilon}\left[\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\mathbb 1_C(x_i)\right|\right] \le \sqrt{\frac{2\log(2|T(x)|)}{n}}. \end{align*} Using the trace bound from the preceding steps, we obtain the deterministic estimate \begin{align*} \mathbb E_{\varepsilon}\left[\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\mathbb 1_C(x_i)\right|\right] \le \sqrt{\frac{2\log(2\Pi_{\mathcal C}(n))}{n}}. \end{align*} The right-hand side is independent of the fixed vector $x$. Therefore this pointwise conditional bound controls the outer expectation over the joint law of $(X_1,\dots,X_n,\varepsilon_1,\dots,\varepsilon_n)$: \begin{align*} \mathbb E^*\left[\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\mathbb 1_C(X_i)\right|\right] \le \sqrt{\frac{2\log(2\Pi_{\mathcal C}(n))}{n}}. \end{align*} [guided] After the sample points $x_1,\dots,x_n$ are fixed, the random quantity depends on $C$ only through the vector \begin{align*} (\mathbb 1_C(x_1),\dots,\mathbb 1_C(x_n)). \end{align*} Thus the index set is the finite subset \begin{align*} T(x)=\{(\mathbb 1_C(x_1),\dots,\mathbb 1_C(x_n)):C\in\mathcal C\}\subset\{0,1\}^n. \end{align*} We want to bound a supremum of absolute values. Massart's lemma is stated for a supremum of linear Rademacher sums without absolute values, so we symmetrise the finite set itself by defining \begin{align*} T_{\pm}(x):=T(x)\cup\{-t:t\in T(x)\}. \end{align*} Then \begin{align*} \sup_{t\in T(x)}\left|\sum_{i=1}^{n}\varepsilon_i t_i\right| = \sup_{u\in T_{\pm}(x)}\sum_{i=1}^{n}\varepsilon_i u_i. \end{align*} The hypotheses of [Massart Finite-Class Lemma][citetheorem:9826] are now verified. The set $T_{\pm}(x)$ is finite. For every $u\in T_{\pm}(x)$, each coordinate $u_i$ belongs to $\{-1,0,1\}$, hence \begin{align*} |u|^2=\sum_{i=1}^{n}u_i^2\le n, \end{align*} so $|u|\le\sqrt n$. Massart's lemma gives \begin{align*} \mathbb E_{\varepsilon}\left[\sup_{u\in T_{\pm}(x)}\sum_{i=1}^{n}\varepsilon_i u_i\right] \le \sqrt n\,\sqrt{2\log |T_{\pm}(x)|}. \end{align*} Dividing by $n$ and using $|T_{\pm}(x)|\le 2|T(x)|$ gives \begin{align*} \mathbb E_{\varepsilon}\left[\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\mathbb 1_C(x_i)\right|\right] \le \sqrt{\frac{2\log(2|T(x)|)}{n}}. \end{align*} Finally, repeated sample points do not increase the number of possible membership vectors: the vector is determined by the trace $C\cap\{x_1,\dots,x_n\}$ on the distinct observed points. Therefore $|T(x)|\le\Pi_{\mathcal C}(n)$, and hence \begin{align*} \mathbb E_{\varepsilon}\left[\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\mathbb 1_C(x_i)\right|\right] \le \sqrt{\frac{2\log(2\Pi_{\mathcal C}(n))}{n}}. \end{align*} Because this upper bound is deterministic and holds for every fixed sample vector $x\in S^n$, it also bounds the outer expectation after substituting $x_i=X_i$ and averaging over the sample together with the Rademacher signs: \begin{align*} \mathbb E^*\left[\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\mathbb 1_C(X_i)\right|\right] \le \sqrt{\frac{2\log(2\Pi_{\mathcal C}(n))}{n}}. \end{align*} [/guided] [/step] [step:Show that the outer expectation tends to zero] Combining the symmetrisation estimate with the conditional finite-class bound gives \begin{align*} \mathbb E^*[Z_n]\le 2\sqrt{\frac{2\log(2\Pi_{\mathcal C}(n))}{n}}. \end{align*} If $v=0$, then $\Pi_{\mathcal C}(n)\le 1$, and therefore \begin{align*} \mathbb E^*[Z_n]\le 2\sqrt{\frac{2\log 2}{n}}\to 0. \end{align*} If $v\ge 1$, then $\Pi_{\mathcal C}(n)\le (n+1)^v$, and hence \begin{align*} \mathbb E^*[Z_n]\le 2\sqrt{\frac{2\log(2(n+1)^v)}{n}}. \end{align*} Since \begin{align*} \frac{\log(2(n+1)^v)}{n}=\frac{\log 2+v\log(n+1)}{n}\to 0, \end{align*} we have \begin{align*} \mathbb E^*[Z_n]\to 0. \end{align*} [/step] [step:Convert the expectation bound into outer probability convergence] Fix $\eta>0$. By Markov's inequality for outer expectation applied to the nonnegative map $Z_n$, we have \begin{align*} \mathbb P^*(Z_n>\eta)\le \frac{\mathbb E^*[Z_n]}{\eta}. \end{align*} Since $\mathbb E^*[Z_n]\to 0$, it follows that \begin{align*} \mathbb P^*\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|>\eta\right)\to 0. \end{align*} This proves convergence in outer probability. [/step] [step:Recover ordinary convergence in probability under measurability] Assume now that, for every $n\in\mathbb N$, the map $Z_n:\Omega\to[0,\infty]$ is $\mathcal F$-measurable. Then outer probability agrees with ordinary probability on the event $\{Z_n>\eta\}$ for every $\eta>0$, because this event belongs to $\mathcal F$. Therefore \begin{align*} \mathbb P(Z_n>\eta)=\mathbb P^*(Z_n>\eta)\to 0 \end{align*} for every $\eta>0$. This is exactly \begin{align*} \sup_{C\in\mathcal C}|P_n(C)-P(C)|\xrightarrow{\mathbb P}0. \end{align*} [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Distribution Definition Supremum Definition Davidson-Szarek Gaussian Singular Value Bounds Probability & Statistics Law of Total Expectation Probability Theory Independence from the Germ Field Brownian Motion Logarithmic Sobolev Inequality Implies Poincare Inequality Probability & Statistics Chernoff Bound in Legendre Transform Form Probability & Statistics Asymptotic Normality of Sample Quantiles Probability & Statistics Additive Chernoff-Hoeffding Bound for Bernoulli Sums Probability & Statistics Failure of Positivity Implies Nonidentification of the Stratum-Specific Potential-Outcome Law Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.