Vapnik-Chervonenkis Inequality — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The proof first reduces the empirical deviation to a two-sample deviation by introducing an independent ghost sample. Conditional on the pooled $2n$ sample points, the class $\mathcal C$ induces at most $\Pi_{\mathcal C}(2n)$ distinct binary labelings. For each fixed labeling, a random permutation argument converts the two-sample difference into a Rademacher average, and [Hoeffding's inequality](/theorems/1962) controls its tail. A union bound gives the [first inequality](/theorems/2897), and the Sauer-Shelah lemma gives the polynomial growth bound when $V(\mathcal C)<\infty$. [/proofplan] [step:Dispose of the empty class and the small deviation range] If $\mathcal C=\varnothing$, then the supremum over $\mathcal C$ is interpreted as $0$, while $\Pi_{\mathcal C}(2n)=0$, so both displayed bounds are immediate. Hence assume $\mathcal C\ne\varnothing$. Then $\Pi_{\mathcal C}(2n)\ge 1$. If \begin{align*} 8\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{32}\right)\ge 1, \end{align*} then the first inequality follows from the elementary bound that probabilities are at most $1$. We may therefore assume throughout the proof of the first inequality that \begin{align*} 8\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{32}\right)<1. \end{align*} In particular, \begin{align*} nt^2>32\log 8>2. \end{align*} [/step] [step:Introduce a ghost empirical measure and symmetrise the deviation] Let $Y_1,\dots,Y_n:(\Omega',\mathcal A',\mathbb P')\to(S,\mathcal S)$ be independent identically distributed random variables with common distribution $P$, independent of $X_1,\dots,X_n$. Define the ghost empirical set function \begin{align*} P_n':\mathcal C\to[0,1] \end{align*} by \begin{align*} P_n'(C):=\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i). \end{align*} Since $\mathcal C$ is countable, the random variables \begin{align*} \sup_{C\in\mathcal C}|P_n(C)-P(C)| \end{align*} and \begin{align*} \sup_{C\in\mathcal C}|P_n(C)-P_n'(C)| \end{align*} are measurable as suprema of countably many measurable real-valued random variables. For each fixed $C\in\mathcal C$, the random variables $\mathbb{1}_C(Y_1),\dots,\mathbb{1}_C(Y_n)$ are independent Bernoulli random variables with mean $P(C)$. Hence \begin{align*} \operatorname{Var}(P_n'(C))=\frac{P(C)(1-P(C))}{n}\le \frac{1}{4n}. \end{align*} By [Chebyshev's inequality](/theorems/1126), for every $C\in\mathcal C$, \begin{align*} \mathbb P'\left(|P_n'(C)-P(C)|>\frac{t}{2}\right) \le \frac{1}{nt^2}. \end{align*} Since $nt^2>2$, this implies \begin{align*} \mathbb P'\left(|P_n'(C)-P(C)|\le \frac{t}{2}\right)\ge \frac{1}{2}. \end{align*} Let \begin{align*} A:=\left\{\sup_{C\in\mathcal C}|P_n(C)-P(C)|>t\right\}. \end{align*} Choose an enumeration $\mathcal C=\{D_1,D_2,\dots\}$, allowing repetitions if $\mathcal C$ is finite. For each $\omega\in A$, let $k(\omega)$ be the least index $k\in\mathbb N$ such that \begin{align*} |P_n(D_k)(\omega)-P(D_k)|>t, \end{align*} and define $C_\omega:=D_{k(\omega)}$. This measurable first-choice rule is available because $\mathcal C$ is countable and the supremum exceeds $t$. On the event \begin{align*} \left\{|P_n'(C_\omega)-P(C_\omega)|\le \frac{t}{2}\right\}, \end{align*} the [reverse triangle inequality](/theorems/2300) gives \begin{align*} |P_n(C_\omega)-P_n'(C_\omega)| \ge |P_n(C_\omega)-P(C_\omega)|-|P_n'(C_\omega)-P(C_\omega)|>\frac{t}{2}. \end{align*} Therefore \begin{align*} \mathbb P'\left(\sup_{C\in\mathcal C}|P_n(C)-P_n'(C)|>\frac{t}{2}\right)\ge \frac{1}{2} \end{align*} for every fixed original sample in $A$. Integrating this conditional lower bound over the original sample gives \begin{align*} \mathbb P(A) \le 2(\mathbb P\otimes\mathbb P')\left(\sup_{C\in\mathcal C}|P_n(C)-P_n'(C)|>\frac{t}{2}\right). \end{align*} [guided] The purpose of the ghost sample is to replace the unknown expectation $P(C)$ by an empirical object with the same mean. We introduce independent random variables $Y_1,\dots,Y_n$ with the same law $P$ as the $X_i$, and define the ghost empirical set function $P_n':\mathcal C\to[0,1]$ by \begin{align*} P_n'(C):=\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i). \end{align*} The countability of $\mathcal C$ is used here in a concrete way: the suprema over $C\in\mathcal C$ are countable suprema of measurable random variables, hence measurable. For a fixed set $C\in\mathcal C$, the variables $\mathbb{1}_C(Y_i)$ are independent Bernoulli random variables with expectation $P(C)$. Thus \begin{align*} \operatorname{Var}(P_n'(C))=\operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^{n}\mathbb 1_C(Y_i)\right) =\frac{1}{n^2}\sum_{i=1}^{n}\operatorname{Var}(\mathbb 1_C(Y_i)). \end{align*} Since $\operatorname{Var}(\mathbb 1_C(Y_i))=P(C)(1-P(C))\le 1/4$, we obtain \begin{align*} \operatorname{Var}(P_n'(C))\le \frac{1}{4n}. \end{align*} Applying Chebyshev's inequality to the centered real-valued [random variable](/page/Random%20Variable) $P_n'(C)-P(C)$ gives \begin{align*} \mathbb P'\left(|P_n'(C)-P(C)|>\frac{t}{2}\right) \le \frac{\operatorname{Var}(P_n'(C))}{t^2/4} \le \frac{1}{nt^2}. \end{align*} Because the small-deviation case has already been removed, we have $nt^2>2$, and hence \begin{align*} \mathbb P'\left(|P_n'(C)-P(C)|\le \frac{t}{2}\right)\ge \frac{1}{2}. \end{align*} Now suppose the original empirical measure is far from $P$ for some set $C$. More precisely, if \begin{align*} \sup_{C\in\mathcal C}|P_n(C)-P(C)|>t, \end{align*} then countability lets us choose a set $C_\omega\in\mathcal C$ depending on the realised original sample such that \begin{align*} |P_n(C_\omega)-P(C_\omega)|>t. \end{align*} Whenever the ghost empirical measure is within $t/2$ of $P(C_\omega)$, the original and ghost empirical measures must be separated by more than $t/2$: \begin{align*} |P_n(C_\omega)-P_n'(C_\omega)| \ge |P_n(C_\omega)-P(C_\omega)|-|P_n'(C_\omega)-P(C_\omega)|>\frac{t}{2}. \end{align*} Thus, conditional on every original sample for which the empirical deviation exceeds $t$, the ghost sample produces a two-sample deviation exceeding $t/2$ with probability at least $1/2$. Integrating this conditional statement over the original sample yields \begin{align*} \mathbb P\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|>t\right) \le 2(\mathbb P\otimes\mathbb P')\left(\sup_{C\in\mathcal C}|P_n(C)-P_n'(C)|>\frac{t}{2}\right). \end{align*} [/guided] [/step] [step:Condition on the pooled sample and count the induced labelings] For fixed points $z_1,\dots,z_{2n}\in S$, define the trace set \begin{align*} \mathcal T(z_1,\dots,z_{2n}):=\{(\mathbb{1}_C(z_1),\dots,\mathbb{1}_C(z_{2n})):C\in\mathcal C\}\subset\{0,1\}^{2n}. \end{align*} By the definition of the growth function, \begin{align*} |\mathcal T(z_1,\dots,z_{2n})|\le \Pi_{\mathcal C}(2n). \end{align*} Let $Z_1,\dots,Z_{2n}$ denote the pooled sample defined by \begin{align*} Z_i:=X_i\quad\text{for }1\le i\le n, \end{align*} and \begin{align*} Z_{n+i}:=Y_i\quad\text{for }1\le i\le n. \end{align*} Conditional on $Z_1,\dots,Z_{2n}$, the class $\mathcal C$ induces at most $\Pi_{\mathcal C}(2n)$ distinct vectors \begin{align*} (\mathbb{1}_C(Z_1),\dots,\mathbb{1}_C(Z_{2n})). \end{align*} [/step] [step:Introduce independent swaps and bound each fixed labeling by Hoeffding] Let $(\Omega_\varepsilon,\mathcal A_\varepsilon,\mathbb P_\varepsilon)$ be a [probability space](/page/Probability%20Space) carrying independent Rademacher random variables $\varepsilon_1,\dots,\varepsilon_n:\Omega_\varepsilon\to\{-1,1\}$, independent of the original and ghost samples. Here $\mathbb P_\varepsilon$ denotes probability with respect to the Rademacher variables alone. For a fixed trace vector $a=(a_1,\dots,a_{2n})\in\{0,1\}^{2n}$, define the real-valued Rademacher sum $R_a:\Omega_\varepsilon\to\mathbb R$ by \begin{align*} R_a:=\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i(a_i-a_{n+i}). \end{align*} Since each coefficient $a_i-a_{n+i}$ lies in $\{-1,0,1\}$, the summands $\varepsilon_i(a_i-a_{n+i})$ are independent, mean-zero, and take values in $[-1,1]$. Hoeffding's inequality for bounded independent sums gives \begin{align*} \mathbb P_\varepsilon\left(|R_a|>\frac{t}{2}\right) \le 2\exp\left(-\frac{nt^2}{8}\right). \end{align*} For $\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$ define swapped samples $X_i^\varepsilon,Y_i^\varepsilon:(\Omega\times\Omega'\times\Omega_\varepsilon,\mathcal A\otimes\mathcal A'\otimes\mathcal A_\varepsilon)\to(S,\mathcal S)$ by setting $X_i^\varepsilon=X_i$ and $Y_i^\varepsilon=Y_i$ on $\{\varepsilon_i=1\}$, and $X_i^\varepsilon=Y_i$ and $Y_i^\varepsilon=X_i$ on $\{\varepsilon_i=-1\}$. Because $(X_i,Y_i)$ has product law $P\otimes P$ and is exchangeable under the coordinate swap, and because the pairs are independent, the process \begin{align*} \left(\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i)\right)_{C\in\mathcal C} \end{align*} has the same law under $\mathbb P\otimes\mathbb P'$ as \begin{align*} \left(\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i^\varepsilon)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i^\varepsilon)\right)_{C\in\mathcal C} \end{align*} has under $\mathbb P\otimes\mathbb P'\otimes\mathbb P_\varepsilon$. Conditional on the realised ordered pooled sample $(Z_1,\dots,Z_{2n})$, the swapped difference for a set $C\in\mathcal C$ is \begin{align*} \frac{1}{n}\sum_{i=1}^{n}\varepsilon_i(\mathbb{1}_C(Z_i)-\mathbb{1}_C(Z_{n+i})). \end{align*} The possible coefficient vectors are exactly the trace vectors in $\mathcal T(Z_1,\dots,Z_{2n})$, whose cardinality is at most $\Pi_{\mathcal C}(2n)$. Applying the preceding Hoeffding bound to each trace vector and taking a union bound over this finite trace set gives, for every realised pooled sample, \begin{align*} \mathbb P_\varepsilon\left(\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i^\varepsilon)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i^\varepsilon)\right|>\frac{t}{2}\right) \le 2\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{8}\right). \end{align*} Integrating over the pooled sample and using the equality in law under independent swaps yields \begin{align*} (\mathbb P\otimes\mathbb P')\left(\sup_{C\in\mathcal C}|P_n(C)-P_n'(C)|>\frac{t}{2}\right) \le 2\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{8}\right). \end{align*} [guided] The delicate point is that the ordered pooled sample itself is fixed after conditioning, so the randomness must come from an additional independent swap mechanism. We place independent Rademacher variables $\varepsilon_1,\dots,\varepsilon_n$ on a probability space $(\Omega_\varepsilon,\mathcal A_\varepsilon,\mathbb P_\varepsilon)$ and let $\mathbb P_\varepsilon$ mean probability over these signs only. For a trace vector $a=(a_1,\dots,a_{2n})\in\{0,1\}^{2n}$, define the real-valued map $R_a:\Omega_\varepsilon\to\mathbb R$ by \begin{align*} R_a:=\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i(a_i-a_{n+i}). \end{align*} The variables $\varepsilon_i(a_i-a_{n+i})$ are independent because the signs are independent, have mean zero because each $\varepsilon_i$ has mean zero, and lie in $[-1,1]$ because $a_i-a_{n+i}\in\{-1,0,1\}$. Hoeffding's inequality for bounded independent sums therefore gives \begin{align*} \mathbb P_\varepsilon\left(|R_a|>\frac{t}{2}\right) \le 2\exp\left(-\frac{nt^2}{8}\right). \end{align*} We now connect this artificial sign process to the original two-sample process. For each $i$, define $X_i^\varepsilon$ and $Y_i^\varepsilon$ by leaving $(X_i,Y_i)$ unchanged when $\varepsilon_i=1$ and swapping the two coordinates when $\varepsilon_i=-1$. Since $X_i$ and $Y_i$ are independent with the same distribution $P$, the pair $(X_i,Y_i)$ has law $P\otimes P$, which is invariant under the coordinate swap. The pairs are independent across $i$, so the whole swapped array has the same law as the original array. Hence the two processes indexed by $\mathcal C$, \begin{align*} \left(\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i)\right)_{C\in\mathcal C} \end{align*} and \begin{align*} \left(\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i^\varepsilon)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i^\varepsilon)\right)_{C\in\mathcal C} \end{align*} have the same distribution. After conditioning on $Z_1,\dots,Z_{2n}$, no randomness remains in the labels $\mathbb{1}_C(Z_j)$; the only randomness is in the signs. For a fixed $C$, the swapped difference becomes \begin{align*} \frac{1}{n}\sum_{i=1}^{n}\varepsilon_i(\mathbb{1}_C(Z_i)-\mathbb{1}_C(Z_{n+i})). \end{align*} As $C$ varies, the coefficient vectors are drawn from the finite trace set $\mathcal T(Z_1,\dots,Z_{2n})$, which has cardinality at most $\Pi_{\mathcal C}(2n)$. We apply the fixed-vector Hoeffding bound to every vector in this trace set and take a union bound. This gives, conditionally on every realised pooled sample, \begin{align*} \mathbb P_\varepsilon\left(\sup_{C\in\mathcal C}\left|\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(X_i^\varepsilon)-\frac{1}{n}\sum_{i=1}^{n}\mathbb{1}_C(Y_i^\varepsilon)\right|>\frac{t}{2}\right) \le 2\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{8}\right). \end{align*} Finally, integrating this conditional inequality and using the equality in law between the original and swapped two-sample processes yields \begin{align*} (\mathbb P\otimes\mathbb P')\left(\sup_{C\in\mathcal C}|P_n(C)-P_n'(C)|>\frac{t}{2}\right) \le 2\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{8}\right). \end{align*} [/guided] [/step] [step:Combine symmetrisation with the finite-labeling bound] Using the symmetrisation estimate from the ghost-sample step and the conditional finite-labeling estimate, we obtain \begin{align*} \mathbb P\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|>t\right) \le 4\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{8}\right). \end{align*} Since $nt^2\ge 0$, we have \begin{align*} 4\exp\left(-\frac{nt^2}{8}\right) =8\exp\left(-\frac{nt^2}{32}\right)\cdot \frac{1}{2}\exp\left(-\frac{3nt^2}{32}\right) \le 8\exp\left(-\frac{nt^2}{32}\right). \end{align*} Thus the desired bound follows: \begin{align*} \mathbb P\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|>t\right) \le 8\Pi_{\mathcal C}(2n)\exp\left(-\frac{nt^2}{32}\right). \end{align*} Together with the small-deviation case handled at the beginning, this proves the first assertion for every $t>0$. [/step] [step:Apply Sauer-Shelah to obtain the VC-dimension consequence] Assume $V(\mathcal C)=v<\infty$ with $v\ge 1$, and let $n\ge v$. Fix arbitrary points $x_1,\dots,x_{2n}\in S$ and set $E:=\{x_1,\dots,x_{2n}\}$. Repeated points can only reduce $|E|$, so $|E|\le 2n$. Applying the [Sauer-Shelah Lemma][citetheorem:1969] to the finite set system induced by $\mathcal C$ on $E$ gives \begin{align*} \left|\{C\cap E:C\in\mathcal C\}\right|\le \sum_{k=0}^{v}\binom{|E|}{k}\le \sum_{k=0}^{v}\binom{2n}{k}. \end{align*} Taking the supremum over all $x_1,\dots,x_{2n}\in S$ gives \begin{align*} \Pi_{\mathcal C}(2n)\le \sum_{k=0}^{v}\binom{2n}{k}. \end{align*} Since $v\le n\le 2n$, the standard binomial estimate in the Sauer-Shelah bound gives \begin{align*} \sum_{k=0}^{v}\binom{2n}{k}\le \left(\frac{2en}{v}\right)^v. \end{align*} Substituting this estimate into the first inequality yields \begin{align*} \mathbb P\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|>t\right) \le 8\left(\frac{2en}{v}\right)^v\exp\left(-\frac{nt^2}{32}\right). \end{align*} This is the claimed finite-VC-dimension form of the inequality. [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Chebyshev's Inequality

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.

Vapnik-Chervonenkis Inequality (Theorem # 9831)

Discussion

Proof

Prerequisites (0/3 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Vapnik-Chervonenkis Inequality (Theorem # 9831)

Discussion

Proof

Prerequisites (0/3 completed)

Prerequisites Graph

Explore Further