Rademacher Generalization Bound for Empirical Risk Minimization

Theorem

Edit Issues Pull Requests Attributions Admin

Let $\mathbb N=\{1,2,\dots\}$. Let $n\in\mathbb N$, let $(\mathcal Z,\mathcal A)$ be a measurable space, and let $P$ be a probability measure on $(\mathcal Z,\mathcal A)$. Let $Z_1,\dots,Z_n:(\Omega,\mathcal F,\mathbb P)\to(\mathcal Z,\mathcal A)$ be independent identically distributed random variables with common distribution $P$. For every [measurable function](/page/Measurable%20Function) $h:\mathcal Z\to\mathbb R$ with $P|h|<\infty$, define \begin{align*} P h:=\int_{\mathcal Z}h(z)\,dP(z) \end{align*} and \begin{align*} P_n h:=\frac{1}{n}\sum_{i=1}^{n}h(Z_i). \end{align*} Let $\mathcal H$ be a nonempty pointwise measurable class of $\mathcal A$-$\mathcal B(\mathbb R)$ [measurable functions](/page/Measurable%20Functions) $h:\mathcal Z\to[0,1]$, and define \begin{align*} \mathcal H^\pm:=\mathcal H\cup\{-h:h\in\mathcal H\}. \end{align*} For every nonempty class $\mathcal G$ of real-valued functions on $\mathcal Z$ and every tuple $(z_1,\dots,z_n)\in\mathcal Z^n$, define the empirical Rademacher complexity \begin{align*} \mathfrak R_n(\mathcal G;z_1,\dots,z_n):=\mathbb E_\varepsilon\left[\sup_{g\in\mathcal G}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i g(z_i)\right], \end{align*} where $\varepsilon_1,\dots,\varepsilon_n$ are independent Rademacher random variables on an auxiliary [probability space](/page/Probability%20Space) $(\Omega_\varepsilon,\mathcal F_\varepsilon,\mathbb P_\varepsilon)$, and $\mathbb E_\varepsilon$ denotes expectation with respect to $\mathbb P_\varepsilon$. Then, for every $t>0$, with $\mathbb P$-probability at least $1-e^{-t}$, \begin{align*} \sup_{h\in\mathcal H}|(P_n-P)h|\le 2\mathbb E\left[\mathfrak R_n(\mathcal H^\pm;Z_1,\dots,Z_n)\right]+\sqrt{\frac{t}{2n}}. \end{align*} Consequently, let $(\mathcal X,\mathcal E)$ and $(\mathcal Y,\mathcal B)$ be measurable spaces, let $\mathcal Z=\mathcal X\times\mathcal Y$, and let $\mathcal A=\mathcal E\otimes\mathcal B$. Let $\mathcal F_0$ be a nonempty class of measurable predictors $f:(\mathcal X,\mathcal E)\to(\mathcal Y,\mathcal B)$, let $\ell:(\mathcal Y\times\mathcal Y,\mathcal B\otimes\mathcal B)\to([0,1],\mathcal B([0,1]))$ be a measurable loss function, and for every $f\in\mathcal F_0$ define \begin{align*} \ell_f:\mathcal X\times\mathcal Y\to[0,1],\qquad (x,y)\mapsto \ell(f(x),y). \end{align*} Assume that \begin{align*} \mathcal H:=\{\ell_f:f\in\mathcal F_0\} \end{align*} is pointwise measurable. If $\hat f$ is any possibly nonmeasurable $\mathcal F_0$-valued selector satisfying \begin{align*} P_n\ell_{\hat f}=\inf_{f\in\mathcal F_0}P_n\ell_f \end{align*} whenever it is evaluated, then, for every $t>0$, \begin{align*} \mathbb P^*\left(P\ell_{\hat f}-\inf_{f\in\mathcal F_0}P\ell_f\le 2\left(2\mathbb E\left[\mathfrak R_n(\mathcal H^\pm;Z_1,\dots,Z_n)\right]+\sqrt{\frac{t}{2n}}\right)\right)\ge 1-e^{-t}. \end{align*}

Discussion

Proof

[proofplan] We first convert the absolute uniform empirical deviation over $\mathcal H$ into a one-sided supremum over the symmetrized class $\mathcal H^\pm$. The expectation of this supremum is bounded by the empirical Rademacher complexity through the symmetrization inequality. A bounded-differences argument then upgrades the expected bound to a high-probability bound, using the fact that the original functions take values in $[0,1]$. Finally, the empirical-risk-minimization statement follows from the deterministic inequality that excess risk is at most twice the uniform deviation. [/proofplan] [step:Rewrite the absolute deviation as a one-sided supremum over the symmetrized class] Define the measurable map $S:\Omega\to[0,\infty)$ by \begin{align*} S(\omega):=\sup_{h\in\mathcal H}|(P_n-P)h(\omega)|. \end{align*} Since $\mathcal H$ is pointwise measurable, the usual countable separability reduction for pointwise measurable classes makes this supremum measurable. For every $\omega\in\Omega$ and every $h\in\mathcal H$, \begin{align*} |(P_n-P)h(\omega)|=\max\{(P_n-P)h(\omega),(P_n-P)(-h)(\omega)\}. \end{align*} Since $\mathcal H^\pm=\mathcal H\cup\{-h:h\in\mathcal H\}$, it follows that \begin{align*} S=\sup_{g\in\mathcal H^\pm}(P_n-P)g. \end{align*} Every $g\in\mathcal H^\pm$ is measurable and satisfies $|g|\le 1$, so $P|g|<\infty$. [guided] The absolute value is the reason for introducing $\mathcal H^\pm$. For a fixed function $h\in\mathcal H$, the quantity $(P_n-P)h$ can be positive or negative, and taking absolute values is the same as allowing the sign of the function to change. More precisely, for every sample point $\omega\in\Omega$, \begin{align*} |(P_n-P)h(\omega)|=\max\{(P_n-P)h(\omega),-(P_n-P)h(\omega)\}. \end{align*} By linearity of $P_n$ and $P$ on bounded [measurable functions](/page/Measurable%20Functions), \begin{align*} -(P_n-P)h(\omega)=(P_n-P)(-h)(\omega). \end{align*} Thus the largest absolute deviation over $h\in\mathcal H$ is exactly the largest one-sided deviation over the enlarged class containing both $h$ and $-h$: \begin{align*} \sup_{h\in\mathcal H}|(P_n-P)h(\omega)|=\sup_{g\in\mathcal H^\pm}(P_n-P)g(\omega). \end{align*} The measurability of this supremum is the point of the pointwise measurability assumption: it permits replacing the supremum over $\mathcal H$ by a supremum over a countable pointwise-dense subclass, so that the supremum is a measurable [random variable](/page/Random%20Variable). Finally, because each $h$ takes values in $[0,1]$, every $g\in\mathcal H^\pm$ takes values in $[-1,1]$, and hence $P|g|<\infty$. [/guided] [/step] [step:Bound the expected uniform deviation by symmetrization] We apply the [[Symmetrization Inequality for Empirical Processes](/theorems/9851)][citetheorem:9851] to the class $\mathcal H^\pm$. Pointwise measurability passes from $\mathcal H$ to $\mathcal H^\pm$ because the countable pointwise-dense subclass for $\mathcal H$ together with its negatives is countable and pointwise dense in $\mathcal H^\pm$. Each $g\in\mathcal H^\pm$ is measurable and satisfies $|g|\le 1$, so $P|g|<\infty$ and the empirical and population averages are integrable. The same countable reduction makes the suprema in the symmetrization inequality measurable; boundedness by $2$ gives finite expectation. Therefore \begin{align*} \mathbb E[S]\le 2\mathbb E\left[\mathbb E_\varepsilon\left[\sup_{g\in\mathcal H^\pm}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i g(Z_i)\right]\right]. \end{align*} By the definition of $\mathfrak R_n$, \begin{align*} \mathbb E[S]\le 2\mathbb E\left[\mathfrak R_n(\mathcal H^\pm;Z_1,\dots,Z_n)\right]. \end{align*} [guided] We now turn the deterministic rewriting from the previous step into an expectation bound. The theorem we use is the [Symmetrization Inequality for Empirical Processes][citetheorem:9851], applied with the function class $\mathcal H^\pm$ and the sample $Z_1,\dots,Z_n$. We verify its hypotheses. First, $\mathcal H^\pm$ consists of measurable functions because every element is either $h$ or $-h$ for some measurable $h\in\mathcal H$. Second, every $g\in\mathcal H^\pm$ satisfies $|g|\le 1$, and therefore \begin{align*} P|g|=\int_{\mathcal Z}|g(z)|\,dP(z)\le 1<\infty. \end{align*} Third, the relevant suprema are measurable. Indeed, pointwise measurability of $\mathcal H$ gives a countable subclass whose pointwise evaluations determine suprema over $\mathcal H$; adjoining the negatives of this countable subclass gives a countable pointwise-dense subclass of $\mathcal H^\pm$. Thus the supremum over $\mathcal H^\pm$ is the supremum of a countable family of measurable random variables. Since the summands are bounded by $1$ in absolute value, these suprema have finite expectation. The symmetrization inequality therefore gives \begin{align*} \mathbb E[S]\le 2\mathbb E\left[\mathbb E_\varepsilon\left[\sup_{g\in\mathcal H^\pm}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i g(Z_i)\right]\right]. \end{align*} The inner expectation is exactly the empirical Rademacher complexity of $\mathcal H^\pm$ at the realised sample. Hence, by the definition of $\mathfrak R_n$, \begin{align*} \mathbb E[S]\le 2\mathbb E\left[\mathfrak R_n(\mathcal H^\pm;Z_1,\dots,Z_n)\right]. \end{align*} [/guided] [/step] [step:Apply bounded differences to the original bounded class] Define the deterministic function $\Phi:\mathcal Z^n\to[0,\infty)$ by \begin{align*} \Phi(z_1,\dots,z_n):=\sup_{h\in\mathcal H}\left|\frac{1}{n}\sum_{i=1}^{n}h(z_i)-P h\right|. \end{align*} Then $S=\Phi(Z_1,\dots,Z_n)$. Let $(z_1,\dots,z_n)\in\mathcal Z^n$, let $(z_1',\dots,z_n')\in\mathcal Z^n$, and assume that these two tuples differ only in the $j$-th coordinate for some $j\in\{1,\dots,n\}$. For each $h\in\mathcal H$, \begin{align*} \left|\left(\frac{1}{n}\sum_{i=1}^{n}h(z_i)-P h\right)-\left(\frac{1}{n}\sum_{i=1}^{n}h(z_i')-P h\right)\right|=\frac{1}{n}|h(z_j)-h(z_j')|. \end{align*} Since $0\le h\le 1$, the right-hand side is at most $1/n$. Taking suprema and using $|\sup A-\sup B|\le \sup_{a}|a-b|$ for corresponding indexed families gives \begin{align*} |\Phi(z_1,\dots,z_n)-\Phi(z_1',\dots,z_n')|\le \frac{1}{n}. \end{align*} Thus $\Phi$ has bounded differences with constants $1/n,\dots,1/n$. By [McDiarmid's bounded differences inequality](/theorems/6072) applied to the independent variables $Z_1,\dots,Z_n$, \begin{align*} \mathbb P\left(S-\mathbb E[S]\ge u\right)\le \exp(-2nu^2) \end{align*} for every $u>0$. Taking \begin{align*} u:=\sqrt{\frac{t}{2n}} \end{align*} gives \begin{align*} \mathbb P\left(S\le \mathbb E[S]+\sqrt{\frac{t}{2n}}\right)\ge 1-e^{-t}. \end{align*} Using the expectation bound from the previous step, we obtain \begin{align*} \mathbb P\left(S\le 2\mathbb E\left[\mathfrak R_n(\mathcal H^\pm;Z_1,\dots,Z_n)\right]+\sqrt{\frac{t}{2n}}\right)\ge 1-e^{-t}. \end{align*} This is the first asserted inequality. [guided] The concentration step must be done with the original class $\mathcal H$, not with $\mathcal H^\pm$, because the sharp Lipschitz constant uses the range $[0,1]$. Define $\Phi:\mathcal Z^n\to[0,\infty)$ by \begin{align*} \Phi(z_1,\dots,z_n):=\sup_{h\in\mathcal H}\left|\frac{1}{n}\sum_{i=1}^{n}h(z_i)-P h\right|. \end{align*} Then the random variable we want to control is exactly \begin{align*} S=\Phi(Z_1,\dots,Z_n). \end{align*} We verify the bounded-differences hypothesis. Fix two deterministic samples $(z_1,\dots,z_n)$ and $(z_1',\dots,z_n')$ that differ only at coordinate $j$. For a fixed $h\in\mathcal H$, the population term $P h$ is unchanged, and all empirical summands except the $j$-th one cancel. Hence \begin{align*} \left|\left(\frac{1}{n}\sum_{i=1}^{n}h(z_i)-P h\right)-\left(\frac{1}{n}\sum_{i=1}^{n}h(z_i')-P h\right)\right|=\frac{1}{n}|h(z_j)-h(z_j')|. \end{align*} Because every $h\in\mathcal H$ takes values in $[0,1]$, we have $|h(z_j)-h(z_j')|\le 1$, so the displayed difference is at most $1/n$. Passing from a fixed $h$ to the supremum cannot increase the coordinate sensitivity beyond this common bound, and therefore \begin{align*} |\Phi(z_1,\dots,z_n)-\Phi(z_1',\dots,z_n')|\le \frac{1}{n}. \end{align*} We now apply McDiarmid's bounded differences inequality, whose conclusion says that a [measurable function](/page/Measurable%20Function) of independent variables with coordinate sensitivities $c_1,\dots,c_n$ satisfies \begin{align*} \mathbb P\left(\Phi(Z_1,\dots,Z_n)-\mathbb E[\Phi(Z_1,\dots,Z_n)]\ge u\right)\le \exp\left(-\frac{2u^2}{\sum_{i=1}^{n}c_i^2}\right). \end{align*} Here $c_i=1/n$ for every $i$, so \begin{align*} \sum_{i=1}^{n}c_i^2=\sum_{i=1}^{n}\frac{1}{n^2}=\frac{1}{n}. \end{align*} Thus \begin{align*} \mathbb P\left(S-\mathbb E[S]\ge u\right)\le \exp(-2nu^2). \end{align*} Choosing \begin{align*} u=\sqrt{\frac{t}{2n}} \end{align*} makes the right-hand side equal to $e^{-t}$. Combining this concentration estimate with the symmetrization estimate \begin{align*} \mathbb E[S]\le 2\mathbb E\left[\mathfrak R_n(\mathcal H^\pm;Z_1,\dots,Z_n)\right] \end{align*} gives \begin{align*} \mathbb P\left(S\le 2\mathbb E\left[\mathfrak R_n(\mathcal H^\pm;Z_1,\dots,Z_n)\right]+\sqrt{\frac{t}{2n}}\right)\ge 1-e^{-t}. \end{align*} This is exactly the asserted high-probability uniform deviation bound. [/guided] [/step] [step:Convert the uniform deviation bound into the ERM excess-risk bound] For the ERM consequence, set \begin{align*} B_t:=2\mathbb E\left[\mathfrak R_n(\mathcal H^\pm;Z_1,\dots,Z_n)\right]+\sqrt{\frac{t}{2n}}. \end{align*} By the first part, \begin{align*} \mathbb P\left(\sup_{f\in\mathcal F_0}|(P_n-P)\ell_f|\le B_t\right)\ge 1-e^{-t}. \end{align*} Fix a sample point $\omega\in\Omega$ for which \begin{align*} \sup_{f\in\mathcal F_0}|(P_n-P)\ell_f(\omega)|\le B_t \end{align*} and for which the selector $\hat f(\omega)$ is evaluated and satisfies the empirical minimization identity. Since \begin{align*} P\ell_{\hat f(\omega)}\le P_n\ell_{\hat f(\omega)}(\omega)+B_t \end{align*} and \begin{align*} P_n\ell_{\hat f(\omega)}(\omega)=\inf_{f\in\mathcal F_0}P_n\ell_f(\omega), \end{align*} we have \begin{align*} P\ell_{\hat f(\omega)}\le \inf_{f\in\mathcal F_0}P_n\ell_f(\omega)+B_t. \end{align*} For every $f\in\mathcal F_0$, \begin{align*} P_n\ell_f(\omega)\le P\ell_f+B_t. \end{align*} Taking the infimum over $f\in\mathcal F_0$ gives \begin{align*} \inf_{f\in\mathcal F_0}P_n\ell_f(\omega)\le \inf_{f\in\mathcal F_0}P\ell_f+B_t. \end{align*} Therefore \begin{align*} P\ell_{\hat f(\omega)}-\inf_{f\in\mathcal F_0}P\ell_f\le 2B_t. \end{align*} Thus the event \begin{align*} \left\{\sup_{f\in\mathcal F_0}|(P_n-P)\ell_f|\le B_t\right\} \end{align*} is contained in the outer event \begin{align*} \left\{P\ell_{\hat f}-\inf_{f\in\mathcal F_0}P\ell_f\le 2B_t\right\}. \end{align*} Taking outer probability preserves this lower bound, and hence \begin{align*} \mathbb P^*\left(P\ell_{\hat f}-\inf_{f\in\mathcal F_0}P\ell_f\le 2B_t\right)\ge 1-e^{-t}. \end{align*} Substituting the definition of $B_t$ proves the stated ERM bound. [guided] The ERM part is deterministic once the uniform deviation event is known. Define \begin{align*} B_t:=2\mathbb E\left[\mathfrak R_n(\mathcal H^\pm;Z_1,\dots,Z_n)\right]+\sqrt{\frac{t}{2n}}. \end{align*} The first part of the theorem gives \begin{align*} \mathbb P\left(\sup_{f\in\mathcal F_0}|(P_n-P)\ell_f|\le B_t\right)\ge 1-e^{-t}. \end{align*} Fix a sample point $\omega\in\Omega$ on this event and suppose the selector $\hat f(\omega)$ is evaluated and satisfies the empirical minimization identity. The uniform deviation bound applied to $\ell_{\hat f(\omega)}$ gives \begin{align*} P\ell_{\hat f(\omega)}\le P_n\ell_{\hat f(\omega)}(\omega)+B_t. \end{align*} Since $\hat f(\omega)$ is an empirical risk minimizer, \begin{align*} P_n\ell_{\hat f(\omega)}(\omega)=\inf_{f\in\mathcal F_0}P_n\ell_f(\omega). \end{align*} For each fixed $f\in\mathcal F_0$, the same uniform deviation event gives \begin{align*} P_n\ell_f(\omega)\le P\ell_f+B_t. \end{align*} Taking the infimum over $f\in\mathcal F_0$ preserves the inequality, so \begin{align*} \inf_{f\in\mathcal F_0}P_n\ell_f(\omega)\le \inf_{f\in\mathcal F_0}P\ell_f+B_t. \end{align*} Combining the last three displays yields \begin{align*} P\ell_{\hat f(\omega)}-\inf_{f\in\mathcal F_0}P\ell_f\le 2B_t. \end{align*} Thus the high-probability uniform deviation event is contained in the outer event appearing in the theorem statement. Taking outer probability is necessary because $\hat f$ may be nonmeasurable, and it preserves the lower bound from the measurable event. Therefore \begin{align*} \mathbb P^*\left(P\ell_{\hat f}-\inf_{f\in\mathcal F_0}P\ell_f\le 2B_t\right)\ge 1-e^{-t}. \end{align*} Substituting the displayed definition of $B_t$ gives the asserted ERM bound. [/guided] [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Symmetrization Inequality for Empirical Processes

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.