Rademacher Complexity of ℓ₂-Constrained Class

Rademacher Complexity of ℓ₂-Constrained Class (Theorem # 1982)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We evaluate the supremum over the $\ell_2$-ball in closed form using the Cauchy--Schwarz inequality, then apply Jensen's inequality (concavity of $\sqrt{\cdot}$) to pass from $\mathbb{E}\|v\|_2$ to $(\mathbb{E}\|v\|_2^2)^{1/2}$. The cross terms in $\mathbb{E}\|\sum_i \varepsilon_i x_i\|_2^2$ vanish by independence and mean-zero of the Rademacher variables, leaving the sum $\sum_i \|x_i\|_2^2$. The population-level bound follows by taking expectations over the data and using the i.i.d. assumption. [/proofplan] [step:Evaluate the supremum over $\|\beta\|_2 \leq \lambda$ in closed form] The empirical Rademacher complexity is \begin{align*} \hat{R}(\mathcal{H}(x_{1:n})) = \mathbb{E}\!\left[\sup_{\beta:\,\|\beta\|_2 \leq \lambda} \frac{1}{n}\sum_{i=1}^n \varepsilon_i x_i^\top \beta\right]. \end{align*} Since the sum is linear in $\beta$, we may write \begin{align*} \sup_{\beta:\,\|\beta\|_2 \leq \lambda} \frac{1}{n}\sum_{i=1}^n \varepsilon_i x_i^\top \beta = \sup_{\beta:\,\|\beta\|_2 \leq \lambda} \frac{1}{n}\beta^\top\!\left(\sum_{i=1}^n \varepsilon_i x_i\right). \end{align*} Define $v := \sum_{i=1}^n \varepsilon_i x_i \in \mathbb{R}^p$. The Cauchy--Schwarz inequality in $\mathbb{R}^p$ gives $\beta^\top v \leq \|\beta\|_2 \|v\|_2$ for all $\beta, v \in \mathbb{R}^p$, with equality when $\beta$ is a positive scalar multiple of $v$. Since $\|\beta\|_2 \leq \lambda$, the supremum is attained at $\beta^* = \lambda v / \|v\|_2$ (when $v \neq 0$; the case $v = 0$ gives a supremum of $0$), yielding \begin{align*} \sup_{\beta:\,\|\beta\|_2 \leq \lambda} \frac{1}{n}\beta^\top v = \frac{\lambda}{n}\|v\|_2 = \frac{\lambda}{n}\!\left\|\sum_{i=1}^n \varepsilon_i x_i\right\|_2. \end{align*} Therefore \begin{align*} \hat{R}(\mathcal{H}(x_{1:n})) = \frac{\lambda}{n}\,\mathbb{E}\!\left\|\sum_{i=1}^n \varepsilon_i x_i\right\|_2. \end{align*} [guided] Why can we evaluate the supremum exactly? The constraint set $\{\beta : \|\beta\|_2 \leq \lambda\}$ is the closed $\ell_2$-ball of radius $\lambda$, and the objective $\beta \mapsto \beta^\top v$ is a linear functional. The maximum of a linear functional over an $\ell_2$-ball is a classical duality result: $\sup_{\|\beta\|_2 \leq \lambda} \beta^\top v = \lambda\|v\|_2$. This is the content of the Cauchy--Schwarz inequality: $\beta^\top v \leq \|\beta\|_2\|v\|_2 \leq \lambda\|v\|_2$, with equality achieved by choosing $\beta = \lambda v/\|v\|_2$. [/guided] [/step] [step:Apply Jensen's inequality to pass from $\mathbb{E}\|v\|_2$ to $(\mathbb{E}\|v\|_2^2)^{1/2}$] The function $t \mapsto \sqrt{t}$ is concave on $[0, \infty)$. By [Jensen's Inequality](/theorems/1977) applied to the random variable $\|v\|_2^2 = \|\sum_{i=1}^n \varepsilon_i x_i\|_2^2$ (which is non-negative) and the concave function $\sqrt{\cdot}$: \begin{align*} \mathbb{E}\!\left\|\sum_{i=1}^n \varepsilon_i x_i\right\|_2 = \mathbb{E}\!\left[\sqrt{\left\|\sum_{i=1}^n \varepsilon_i x_i\right\|_2^2}\right] \leq \sqrt{\mathbb{E}\!\left\|\sum_{i=1}^n \varepsilon_i x_i\right\|_2^2}. \end{align*} Substituting into the expression from the previous step: \begin{align*} \hat{R}(\mathcal{H}(x_{1:n})) \leq \frac{\lambda}{n}\left(\mathbb{E}\!\left\|\sum_{i=1}^n \varepsilon_i x_i\right\|_2^2\right)^{1/2}. \end{align*} [guided] Jensen's inequality for concave functions reverses the direction: $\mathbb{E}[\varphi(Z)] \leq \varphi(\mathbb{E}[Z])$ when $\varphi$ is concave. Here $\varphi(t) = \sqrt{t}$ is concave (since $\varphi''(t) = -1/(4t^{3/2}) < 0$ for $t > 0$), so $\mathbb{E}[\sqrt{Z}] \leq \sqrt{\mathbb{E}[Z]}$ for any non-negative random variable $Z$. We apply this with $Z = \|\sum_i \varepsilon_i x_i\|_2^2$. [/guided] [/step] [step:Compute $\mathbb{E}\|\sum_{i=1}^n \varepsilon_i x_i\|_2^2$ by expanding and eliminating cross terms] Expand the squared norm: \begin{align*} \mathbb{E}\!\left\|\sum_{i=1}^n \varepsilon_i x_i\right\|_2^2 = \mathbb{E}\!\left[\sum_{i=1}^n \sum_{j=1}^n \varepsilon_i \varepsilon_j\, x_i^\top x_j\right] = \sum_{i=1}^n \sum_{j=1}^n \mathbb{E}[\varepsilon_i \varepsilon_j]\, x_i^\top x_j. \end{align*} Since $\varepsilon_1, \ldots, \varepsilon_n$ are independent Rademacher variables (each taking values $\pm 1$ with equal probability), their moments satisfy: \begin{align*} \mathbb{E}[\varepsilon_i \varepsilon_j] = \begin{cases} \mathbb{E}[\varepsilon_i^2] = 1 & \text{if } i = j, \\ \mathbb{E}[\varepsilon_i]\,\mathbb{E}[\varepsilon_j] = 0 & \text{if } i \neq j, \end{cases} \end{align*} where the $i \neq j$ case uses independence and $\mathbb{E}[\varepsilon_i] = 0$. Therefore all cross terms vanish and \begin{align*} \mathbb{E}\!\left\|\sum_{i=1}^n \varepsilon_i x_i\right\|_2^2 = \sum_{i=1}^n x_i^\top x_i = \sum_{i=1}^n \|x_i\|_2^2. \end{align*} [guided] This computation is the analogue of the variance of a sum of independent mean-zero random vectors: $\operatorname{Var}(\sum_i Z_i) = \sum_i \operatorname{Var}(Z_i)$ when the $Z_i$ are independent with mean zero. Here each $Z_i = \varepsilon_i x_i$ has $\mathbb{E}[Z_i] = x_i \mathbb{E}[\varepsilon_i] = 0$ and $\mathbb{E}[\|Z_i\|_2^2] = \|x_i\|_2^2 \mathbb{E}[\varepsilon_i^2] = \|x_i\|_2^2$. The cross terms $\mathbb{E}[Z_i^\top Z_j] = x_i^\top x_j \mathbb{E}[\varepsilon_i \varepsilon_j] = 0$ for $i \neq j$ by independence and the mean-zero property. [/guided] [/step] [step:Assemble the empirical bound] Substituting the result of the previous step: \begin{align*} \hat{R}(\mathcal{H}(x_{1:n})) \leq \frac{\lambda}{n}\left(\sum_{i=1}^n \|x_i\|_2^2\right)^{1/2}, \end{align*} which is the first stated bound. [/step] [step:Derive the population-level bound by taking expectations over the data] Suppose $X_1, \ldots, X_n$ are i.i.d. copies of a random vector $X \in \mathbb{R}^p$. The population Rademacher complexity is $R_n(\mathcal{H}) = \mathbb{E}_{X_{1:n}}[\hat{R}(\mathcal{H}(X_{1:n}))]$. We return to the exact expression from the first step (before applying Jensen) and take the full expectation over both $\varepsilon$ and $X$: \begin{align*} R_n(\mathcal{H}) = \frac{\lambda}{n}\,\mathbb{E}_{\varepsilon, X}\!\left\|\sum_{i=1}^n \varepsilon_i X_i\right\|_2. \end{align*} Applying Jensen's inequality as before (concavity of $\sqrt{\cdot}$, now over the joint randomness): \begin{align*} R_n(\mathcal{H}) \leq \frac{\lambda}{n}\left(\mathbb{E}_{\varepsilon, X}\!\left\|\sum_{i=1}^n \varepsilon_i X_i\right\|_2^2\right)^{1/2}. \end{align*} By the same cross-term cancellation (the $\varepsilon_i$ are independent of everything and of each other, with $\mathbb{E}[\varepsilon_i] = 0$): \begin{align*} \mathbb{E}_{\varepsilon, X}\!\left\|\sum_{i=1}^n \varepsilon_i X_i\right\|_2^2 = \sum_{i=1}^n \mathbb{E}\!\left[\|X_i\|_2^2\right] = n\,\mathbb{E}\!\left[\|X\|_2^2\right], \end{align*} where the last equality uses the identical distribution of $X_1, \ldots, X_n$. Therefore \begin{align*} R_n(\mathcal{H}) \leq \frac{\lambda}{n}\left(n\,\mathbb{E}\!\left[\|X\|_2^2\right]\right)^{1/2} = \frac{\lambda\,(\mathbb{E}\|X\|_2^2)^{1/2}}{\sqrt{n}}, \end{align*} which is the second stated bound. This completes the proof. [guided] The population-level bound exhibits the standard $1/\sqrt{n}$ rate: the Rademacher complexity of the $\ell_2$-constrained linear class decreases at rate $n^{-1/2}$. The factor $\lambda(\mathbb{E}\|X\|_2^2)^{1/2}$ reflects the interaction between the complexity of the hypothesis class (controlled by the radius $\lambda$ of the $\ell_2$-ball) and the scale of the data (controlled by $\mathbb{E}\|X\|_2^2$). A larger $\ell_2$ constraint or higher-dimensional data increases the Rademacher complexity, while more data points decrease it -- this is the bias-complexity tradeoff quantified in a single formula. [/guided] [/step]

Explore Further

Empirical Rademacher Bound via Counting Machine Learning Subgradient Calculus Machine Learning Projection Theorem Machine Learning Zhang–Bartlett Machine Learning Fundamental Theorem of Statistical Learning Theory Machine Learning Bias-Variance Decomposition for Decision Trees Machine Learning Hoeffding's Lemma Machine Learning Properties of Convex Functions Machine Learning

What brings you to Androma?

Start with a route through the knowledge graph.