Vapnik-Chervonenkis Generalization Theorem — Statement & Proof

Vapnik-Chervonenkis Generalization Theorem (Theorem # 9869)

Theorem

Edit Issues Pull Requests Attributions Admin

Let $\mathbb N=\{1,2,\dots\}$. Let $(\mathcal X,\mathcal E)$ be a measurable space, let $\mathcal C\subset\mathcal E$ be a class of measurable subsets of $\mathcal X$ with finite VC dimension $V(\mathcal C)=V\in\mathbb N\cup\{0\}$, let $P$ be a probability measure on $(\mathcal X,\mathcal E)$, and let \begin{align*} X_1,\dots,X_n:(\Omega,\mathcal F,\mathbb P)\to(\mathcal X,\mathcal E) \end{align*} be independent identically distributed random variables with common distribution $P$, where $n\in\mathbb N$. For each $C\in\mathcal C$, define \begin{align*} P(C):=\int_{\mathcal X}\mathbb 1_C(x)\,dP(x) \end{align*} and \begin{align*} P_n(C):=\frac{1}{n}\sum_{i=1}^{n}\mathbb 1_C(X_i). \end{align*} Define $A(V,n)$ by $A(0,n):=1$, and, for $V\ge 1$, \begin{align*} A(V,n):=m\log\left(\frac{en}{m}\right), \end{align*} where $m:=\min\{V,n\}$. Then there exists a universal constant $K>0$ such that, for every $t>0$, \begin{align*} \mathbb P^*\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|\le K\sqrt{\frac{A(V,n)+t}{n}}\right)\ge 1-e^{-t}, \end{align*} where $\mathbb P^*$ denotes the outer probability induced by $\mathbb P$. If the displayed supremum is measurable, then $\mathbb P^*$ may be replaced by ordinary probability $\mathbb P$. Consequently, let $(\mathcal Z,\mathcal A)$ be a measurable space, let $\mathcal H$ be a class of $\mathcal A$-$\mathcal B(\{0,1\})$ measurable classifiers $h:\mathcal Z\to\{0,1\}$, and let \begin{align*} (Z_1,Y_1),\dots,(Z_n,Y_n):(\Omega,\mathcal F,\mathbb P)\to(\mathcal Z\times\{0,1\},\mathcal A\otimes\mathcal P(\{0,1\})) \end{align*} be independent identically distributed random variables with the same distribution as a random pair $(Z,Y)$. For each $h\in\mathcal H$, define the loss set \begin{align*} L_h:=\{(z,y)\in\mathcal Z\times\{0,1\}:h(z)\ne y\}, \end{align*} the risk \begin{align*} R(h):=\mathbb P((Z,Y)\in L_h), \end{align*} and the empirical risk \begin{align*} R_n(h):=\frac{1}{n}\sum_{i=1}^{n}\mathbb 1_{L_h}(Z_i,Y_i). \end{align*} Assume that the loss class \begin{align*} \mathcal L_{\mathcal H}:=\{L_h:h\in\mathcal H\} \end{align*} has finite VC dimension $V<\infty$. Let $\hat h$ be an $\mathcal H$-valued selector, possibly depending on the sample, such that pointwise \begin{align*} R_n(\hat h)\le \inf_{h\in\mathcal H}R_n(h)+\varepsilon \end{align*} for some $\varepsilon\ge 0$. Then, for every $t>0$, \begin{align*} \mathbb P^*\left(R(\hat h)-\inf_{h\in\mathcal H}R(h)\le 2K\sqrt{\frac{A(V,n)+t}{n}}+\varepsilon\right)\ge 1-e^{-t}. \end{align*}

Discussion

Proof

[proofplan] We first prove a high-probability uniform deviation inequality for a VC class of sets. The core inputs are symmetrisation, the finite trace bound supplied by the Sauer-Shelah lemma, and Hoeffding-type concentration for the finitely many traces on the sample. These ingredients give a tail bound of the form $\exp\{A(V,n)-cnu^2\}$, and the constants are absorbed into one universal constant $K$. The classification statement is then a deterministic comparison on the event where all losses have small empirical deviation. [/proofplan] [step:Reduce the VC deviation bound to a standard finite-trace tail estimate] For $u>0$, define the outer event \begin{align*} E_u:=\left\{\sup_{C\in\mathcal C}|P_n(C)-P(C)|>u\right\}. \end{align*} By the [[Vapnik-Chervonenkis Inequality](/theorems/9831)][citetheorem:9831], whose proof combines the [[Symmetrization Inequality for Empirical Processes](/theorems/9851)][citetheorem:9851], the [Sauer-Shelah Lemma][citetheorem:1969], and [[Hoeffding's inequality](/theorems/1962)][citetheorem:9838], there exist universal constants $a_0,a_1>0$ such that, for every $u>0$, \begin{align*} \mathbb P^*(E_u)\le a_0\exp\left(A(V,n)-a_1nu^2\right). \end{align*} [guided] We do not reprove the full VC tail estimate here; instead we invoke the [Vapnik-Chervonenkis Inequality][citetheorem:9831], which packages the symmetrization argument, the Sauer-Shelah trace bound, and Hoeffding's inequality into a single statement. Its hypotheses are satisfied because $\mathcal C \subset \mathcal E$ is a class of measurable sets and $\mathcal C$ has finite VC dimension $V$. The theorem gives universal constants $a_0,a_1>0$ such that, for every $u>0$, \begin{align*} \mathbb P^*(E_u)\le a_0\exp\left(A(V,n)-a_1nu^2\right). \end{align*} The role of the auxiliary ingredients is exactly to justify that the coefficient of $A(V,n)$ is absorbed into the universal constant $a_0$ rather than tracked separately. [/guided] [/step] [step:Choose the universal constant so the tail probability is at most $e^{-t}$] Let $a_0,a_1>0$ be the universal constants from the preceding step. Choose a universal constant $K>0$ satisfying \begin{align*} a_1K^2\ge 2 \end{align*} and \begin{align*} a_1K^2\ge 1+\max\{\log a_0,0\}. \end{align*} For $t>0$, define \begin{align*} u_t:=K\sqrt{\frac{A(V,n)+t}{n}}. \end{align*} Then \begin{align*} A(V,n)-a_1nu_t^2=A(V,n)-a_1K^2(A(V,n)+t). \end{align*} Since the repaired definition of $A(V,n)$ gives $A(V,n)\ge 1$, the choice of $K$ gives \begin{align*} \log a_0+A(V,n)-a_1nu_t^2\le -t. \end{align*} Therefore the VC tail estimate gives \begin{align*} \mathbb P^*\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|>u_t\right)\le e^{-t}. \end{align*} Equivalently, \begin{align*} \mathbb P^*\left(\sup_{C\in\mathcal C}|P_n(C)-P(C)|\le K\sqrt{\frac{A(V,n)+t}{n}}\right)\ge 1-e^{-t}. \end{align*} This proves the set-class assertion. If the supremum is measurable, the same argument is an ordinary probability statement, so $\mathbb P^*$ may be replaced by $\mathbb P$. [/step] [step:Apply the uniform VC bound to the loss class] Consider the measurable space \begin{align*} (\mathcal Z\times\{0,1\},\mathcal A\otimes\mathcal P(\{0,1\})). \end{align*} For each $h\in\mathcal H$, the map \begin{align*} (z,y)\mapsto \mathbb 1_{\{h(z)\ne y\}} \end{align*} is measurable because $h$ is measurable and $\{0,1\}$ has the discrete $\sigma$-algebra. Hence $L_h\in\mathcal A\otimes\mathcal P(\{0,1\})$, and the class \begin{align*} \mathcal L_{\mathcal H}:=\{L_h:h\in\mathcal H\} \end{align*} is a class of measurable sets with VC dimension $V<\infty$. Applying the set-class result to $\mathcal L_{\mathcal H}$ and to the sample \begin{align*} (Z_1,Y_1),\dots,(Z_n,Y_n) \end{align*} gives the outer-probability event \begin{align*} G_t:=\left\{\sup_{h\in\mathcal H}|R_n(h)-R(h)|\le K\sqrt{\frac{A(V,n)+t}{n}}\right\} \end{align*} with \begin{align*} \mathbb P^*(G_t)\ge 1-e^{-t}. \end{align*} Define the deviation radius \begin{align*} \Delta_t:=K\sqrt{\frac{A(V,n)+t}{n}}. \end{align*} On $G_t$, for every $h\in\mathcal H$, \begin{align*} R(h)\le R_n(h)+\Delta_t \end{align*} and \begin{align*} R_n(h)\le R(h)+\Delta_t. \end{align*} [/step] [step:Compare the risk of the approximate empirical minimizer with any fixed comparator] Work on the event $G_t$. Let $h\in\mathcal H$ be arbitrary. By the first deviation inequality applied to $\hat h$, \begin{align*} R(\hat h)\le R_n(\hat h)+\Delta_t. \end{align*} By the assumed approximate empirical minimization property, \begin{align*} R_n(\hat h)\le R_n(h)+\varepsilon. \end{align*} By the second deviation inequality applied to $h$, \begin{align*} R_n(h)\le R(h)+\Delta_t. \end{align*} Combining the three inequalities gives \begin{align*} R(\hat h)\le R(h)+2\Delta_t+\varepsilon. \end{align*} Since $h\in\mathcal H$ was arbitrary, taking the infimum over $h\in\mathcal H$ yields \begin{align*} R(\hat h)-\inf_{h\in\mathcal H}R(h)\le 2\Delta_t+\varepsilon. \end{align*} Substituting the definition of $\Delta_t$ gives \begin{align*} R(\hat h)-\inf_{h\in\mathcal H}R(h)\le 2K\sqrt{\frac{A(V,n)+t}{n}}+\varepsilon. \end{align*} The preceding paragraph proves the pointwise inclusion \begin{align*} G_t\subseteq\left\{R(\hat h)-\inf_{h\in\mathcal H}R(h)\le 2K\sqrt{\frac{A(V,n)+t}{n}}+\varepsilon\right\}. \end{align*} By monotonicity of outer probability, this inclusion and $\mathbb P^*(G_t)\ge 1-e^{-t}$ imply the claimed outer-probability bound. [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Vapnik-Chervonenkis Inequality

Definitions & Concepts

Explore Further

What brings you to Androma?

Start with a route through the knowledge graph.

Vapnik-Chervonenkis Generalization Theorem (Theorem # 9869)

Discussion

Proof

Prerequisites (0/3 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Vapnik-Chervonenkis Generalization Theorem (Theorem # 9869)

Discussion

Proof

Prerequisites (0/3 completed)

Prerequisites Graph

Explore Further