Expected Risk Bound via Rademacher Complexity — Statement & Proof

Expected Risk Bound via Rademacher Complexity (Theorem # 1967)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The excess risk is bounded by the expected supremum of the empirical process $G = \sup_{h \in \mathcal{H}}\{R(h) - \hat{R}(h)\}$. We rewrite this supremum in the form required by the [Symmetrization Bound](/theorems/1965), then invoke symmetrisation to obtain a bound in terms of Rademacher complexity of the negated loss class $-F$. A sign-symmetry argument for Rademacher variables shows $\mathcal{R}_n(-F) = \mathcal{R}_n(F)$, completing the proof. [/proofplan] [step:Bound the expected excess risk by the expected supremum of the empirical process] Let $G := \sup_{h \in \mathcal{H}}\{R(h) - \hat{R}(h)\}$. For the ERM $\hat{h}$, the ERM property $\hat{R}(\hat{h}) \leq \hat{R}(h^*)$ gives \begin{align*} R(\hat{h}) - R(h^*) = \bigl[R(\hat{h}) - \hat{R}(\hat{h})\bigr] + \bigl[\hat{R}(\hat{h}) - \hat{R}(h^*)\bigr] + \bigl[\hat{R}(h^*) - R(h^*)\bigr] \leq \bigl[R(\hat{h}) - \hat{R}(\hat{h})\bigr] + \bigl[\hat{R}(h^*) - R(h^*)\bigr]. \end{align*} Since $R(\hat{h}) - \hat{R}(\hat{h}) \leq G$ and $\hat{R}(h^*) - R(h^*) \leq 0$ in expectation (by unbiasedness of $\hat{R}(h^*)$), taking expectations: \begin{align*} \mathbb{E}[R(\hat{h}) - R(h^*)] \leq \mathbb{E}[R(\hat{h}) - \hat{R}(\hat{h})] + \underbrace{\mathbb{E}[\hat{R}(h^*) - R(h^*)]}_{= 0} \leq \mathbb{E}[G]. \end{align*} The second term vanishes because $\mathbb{E}[\hat{R}(h^*)] = R(h^*)$ by linearity of expectation and the fact that $h^*$ is non-random. [/step] [step:Rewrite $\mathbb{E}[G]$ as the expected supremum of a centred empirical process and apply the Symmetrization Bound] For each $f \in F$, write $f(Z_i) = \ell(h(X_i), Y_i)$ where $Z_i = (X_i, Y_i)$. Then $R(h) = \mathbb{E}[f(Z)]$ and $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n f(Z_i)$, so \begin{align*} G = \sup_{f \in F}\left\{\mathbb{E}[f(Z)] - \frac{1}{n}\sum_{i=1}^n f(Z_i)\right\} = \sup_{f \in F}\frac{1}{n}\sum_{i=1}^n\bigl\{\mathbb{E}[f(Z_i)] - f(Z_i)\bigr\}. \end{align*} Negating inside the supremum, this is equivalently \begin{align*} G = \sup_{g \in -F}\frac{1}{n}\sum_{i=1}^n\bigl\{g(Z_i) - \mathbb{E}[g(Z_i)]\bigr\}, \end{align*} where $-F := \{-f : f \in F\}$ is the negated loss class. This is exactly the form to which the [Symmetrization Bound](/theorems/1965) applies. The hypotheses of the Symmetrization Bound are satisfied: $-F$ is a class of real-valued functions on $\mathcal{Z} = \mathcal{X} \times \{-1, 1\}$, and $Z_1, \ldots, Z_n$ are i.i.d. Applying the bound: \begin{align*} \mathbb{E}[G] \leq 2\mathcal{R}_n(-F). \end{align*} [guided] The sign reversal is the key algebraic step. The empirical process $G$ takes the form $\sup_{f \in F}\{\mathbb{E}f - \hat{\mathbb{E}}_n f\}$, but the Symmetrization Bound controls $\sup_{g \in \mathcal{G}}\{\hat{\mathbb{E}}_n g - \mathbb{E}g\}$ (empirical minus population). Introducing $g = -f$ converts one form into the other: $\mathbb{E}f - \hat{\mathbb{E}}_n f = \hat{\mathbb{E}}_n(-f) - \mathbb{E}(-f) = \hat{\mathbb{E}}_n g - \mathbb{E}g$. The Symmetrization Bound requires i.i.d. data and a class of real-valued functions. Both conditions are met: the data $(X_i, Y_i)$ are i.i.d. by assumption, and each $g \in -F$ is a measurable function from $\mathcal{Z}$ to $\mathbb{R}$. [/guided] [/step] [step:Use Rademacher sign symmetry to replace $\mathcal{R}_n(-F)$ by $\mathcal{R}_n(F)$] By definition, the Rademacher complexity of a function class $\mathcal{G}$ is \begin{align*} \mathcal{R}_n(\mathcal{G}) = \mathbb{E}\!\left[\sup_{g \in \mathcal{G}} \frac{1}{n}\sum_{i=1}^n \varepsilon_i g(Z_i)\right], \end{align*} where $\varepsilon_1, \ldots, \varepsilon_n$ are i.i.d. Rademacher random variables (taking values $\pm 1$ with equal probability), independent of $Z_1, \ldots, Z_n$. For the negated class $-F$: \begin{align*} \mathcal{R}_n(-F) = \mathbb{E}\!\left[\sup_{f \in F} \frac{1}{n}\sum_{i=1}^n \varepsilon_i \bigl(-f(Z_i)\bigr)\right] = \mathbb{E}\!\left[\sup_{f \in F} \frac{1}{n}\sum_{i=1}^n (-\varepsilon_i) f(Z_i)\right]. \end{align*} Since $\varepsilon_i$ are symmetric about zero, the vector $(-\varepsilon_1, \ldots, -\varepsilon_n)$ has the same joint distribution as $(\varepsilon_1, \ldots, \varepsilon_n)$. Therefore \begin{align*} \mathbb{E}\!\left[\sup_{f \in F} \frac{1}{n}\sum_{i=1}^n (-\varepsilon_i) f(Z_i)\right] = \mathbb{E}\!\left[\sup_{f \in F} \frac{1}{n}\sum_{i=1}^n \varepsilon_i f(Z_i)\right] = \mathcal{R}_n(F). \end{align*} Combining with the bound from the previous step: \begin{align*} \mathbb{E}[R(\hat{h})] - R(h^*) \leq \mathbb{E}[G] \leq 2\mathcal{R}_n(-F) = 2\mathcal{R}_n(F), \end{align*} which is the stated inequality. [guided] The sign-symmetry argument is worth spelling out carefully. Each $\varepsilon_i$ takes values $\pm 1$ with equal probability, so $-\varepsilon_i$ has the same distribution as $\varepsilon_i$. Since $\varepsilon_1, \ldots, \varepsilon_n$ are independent, the joint distribution of $(-\varepsilon_1, \ldots, -\varepsilon_n)$ equals the joint distribution of $(\varepsilon_1, \ldots, \varepsilon_n)$. Any measurable function of $(\varepsilon_1, \ldots, \varepsilon_n)$ -- including $\sup_{f \in F}\frac{1}{n}\sum_i \varepsilon_i f(Z_i)$ -- therefore has the same expectation when $\varepsilon_i$ is replaced by $-\varepsilon_i$ throughout. This symmetry is the reason Rademacher complexity is insensitive to negation of the function class: $\mathcal{R}_n(-\mathcal{G}) = \mathcal{R}_n(\mathcal{G})$ for any class $\mathcal{G}$. It is a purely distributional property of Rademacher random variables and does not require any structure on $\mathcal{G}$. [/guided] [/step]

Explore Further

Subdifferential at Points of Differentiability Machine Learning Hoeffding's Inequality Machine Learning Universal Approximation Machine Learning Symmetrization Bound Machine Learning Sub-Gaussian Tail Bound Machine Learning Ensemble Variance Formula Machine Learning Contraction Lemma Machine Learning Convergence of Projected Gradient Descent Machine Learning

What brings you to Androma?

Start with a route through the knowledge graph.

Expected Risk Bound via Rademacher Complexity (Theorem # 1967)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Expected Risk Bound via Rademacher Complexity (Theorem # 1967)

Discussion

Proof

Explore Further