Rademacher Complexity of ℓ₁-Constrained Class

Rademacher Complexity of ℓ₁-Constrained Class (Theorem # 1984)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We use the $\ell_1$--$\ell_\infty$ duality (Holder's inequality) to reduce the supremum over the $\ell_1$-ball to the $\ell_\infty$-norm of the Rademacher average vector. We then bound the expected $\ell_\infty$-norm by the [Expected Maximum of Sub-Gaussians](/theorems/1963) bound, noting that for each coordinate $j$ the Rademacher sum $\frac{1}{n}\sum_{i=1}^n \varepsilon_i x_{ij}$ and its negation are both sub-Gaussian, giving $2p$ sub-Gaussian random variables whose maximum controls the $\ell_\infty$-norm. [/proofplan] [step:Apply Holder's inequality to evaluate the supremum over the $\ell_1$-ball] The empirical Rademacher complexity is \begin{align*} \hat{R}(\mathcal{H}(x_{1:n})) = \mathbb{E}\!\left[\sup_{\beta:\,\|\beta\|_1 \leq \lambda} \frac{1}{n}\sum_{i=1}^n \varepsilon_i x_i^\top \beta\right] = \mathbb{E}\!\left[\sup_{\beta:\,\|\beta\|_1 \leq \lambda} \beta^\top\!\left(\frac{1}{n}\sum_{i=1}^n \varepsilon_i x_i\right)\right]. \end{align*} Define the random vector $v := \frac{1}{n}\sum_{i=1}^n \varepsilon_i x_i \in \mathbb{R}^p$. By Holder's inequality with exponents $1$ and $\infty$ (since $\frac{1}{1} + \frac{1}{\infty} = 1$), for any $\beta, v \in \mathbb{R}^p$: \begin{align*} \beta^\top v \leq \|\beta\|_1 \|v\|_\infty. \end{align*} The supremum over $\|\beta\|_1 \leq \lambda$ is attained: let $j^* = \arg\max_j |v_j|$, and set $\beta^* = \lambda\, \operatorname{sgn}(v_{j^*})\, e_{j^*}$ (where $e_{j^*}$ is the $j^*$-th standard basis vector). Then $\|\beta^*\|_1 = \lambda$ and $(\beta^*)^\top v = \lambda |v_{j^*}| = \lambda\|v\|_\infty$. Therefore \begin{align*} \hat{R}(\mathcal{H}(x_{1:n})) = \lambda\,\mathbb{E}\!\left[\|v\|_\infty\right] = \lambda\,\mathbb{E}\!\left[\max_{j=1,\ldots,p}\left|\frac{1}{n}\sum_{i=1}^n \varepsilon_i x_{ij}\right|\right]. \end{align*} [guided] The $\ell_1$--$\ell_\infty$ duality is the analogue of the $\ell_2$ self-duality used in the [Rademacher Complexity of $\ell_2$-Constrained Class](/theorems/1982). For the $\ell_2$-ball, the supremum of a linear functional reduces to the $\ell_2$-norm of the coefficient vector (via Cauchy--Schwarz). For the $\ell_1$-ball, it reduces to the $\ell_\infty$-norm (via Holder). This explains why the $\ell_1$-constrained bound depends on $\max_j$ (coordinate-wise quantities) rather than on $\sum_i \|x_i\|_2^2$ (an aggregate): the $\ell_1$-ball is "spiky", so the worst-case hypothesis concentrates its weight on a single coordinate. [/guided] [/step] [step:Rewrite the expected $\ell_\infty$-norm as the expected maximum of $2p$ sub-Gaussian variables] Write \begin{align*} \max_{j=1,\ldots,p}\left|\frac{1}{n}\sum_{i=1}^n \varepsilon_i x_{ij}\right| = \max_{j=1,\ldots,p}\max\!\left\{\frac{1}{n}\sum_{i=1}^n \varepsilon_i x_{ij},\; -\frac{1}{n}\sum_{i=1}^n \varepsilon_i x_{ij}\right\} = \max_{k=1,\ldots,2p} W_k, \end{align*} where, for $j = 1, \ldots, p$, we define $W_j := \frac{1}{n}\sum_{i=1}^n \varepsilon_i x_{ij}$ and $W_{p+j} := -\frac{1}{n}\sum_{i=1}^n \varepsilon_i x_{ij}$. Each $W_k$ is mean-zero: $\mathbb{E}[W_j] = \frac{1}{n}\sum_{i=1}^n x_{ij}\,\mathbb{E}[\varepsilon_i] = 0$, and similarly $\mathbb{E}[W_{p+j}] = 0$. For the sub-Gaussian parameter of $W_j$: the summands $\varepsilon_i x_{ij}$ are independent (since the $\varepsilon_i$ are independent and the $x_{ij}$ are fixed constants). Each $\varepsilon_i x_{ij}$ takes values in $[-|x_{ij}|, |x_{ij}|]$, so by [Hoeffding's Lemma](/theorems/1956), $\varepsilon_i x_{ij}$ is sub-Gaussian with parameter $(2|x_{ij}|)/2 = |x_{ij}|$. By [Sub-Gaussian Stability Under Linear Combinations](/theorems/1960) applied with weights $\gamma_i = 1/n$ and parameters $\sigma_i = |x_{ij}|$: \begin{align*} W_j \text{ is sub-Gaussian with parameter } \sigma_j := \left(\sum_{i=1}^n \frac{x_{ij}^2}{n^2}\right)^{1/2} = \frac{1}{n}\left(\sum_{i=1}^n x_{ij}^2\right)^{1/2}. \end{align*} Since $W_{p+j} = -W_j$ and negation preserves the sub-Gaussian property with the same parameter, $W_{p+j}$ is also sub-Gaussian with parameter $\sigma_j$. Define \begin{align*} \sigma_{\max} := \max_{j=1,\ldots,p} \sigma_j = \max_{j=1,\ldots,p} \frac{1}{n}\left(\sum_{i=1}^n x_{ij}^2\right)^{1/2}. \end{align*} Since a sub-Gaussian random variable with parameter $\sigma_j \leq \sigma_{\max}$ is also sub-Gaussian with parameter $\sigma_{\max}$, all $2p$ variables $W_1, \ldots, W_{2p}$ are sub-Gaussian with the common parameter $\sigma_{\max}$. [/step] [step:Apply the expected maximum bound and simplify] By the [Expected Maximum of Sub-Gaussians](/theorems/1963) applied to the $2p$ mean-zero sub-Gaussian random variables $W_1, \ldots, W_{2p}$, each with parameter $\sigma_{\max}$ (independence is not required): \begin{align*} \mathbb{E}\!\left[\max_{k=1,\ldots,2p} W_k\right] \leq \sigma_{\max}\sqrt{2\log(2p)}. \end{align*} Substituting into the expression from the first step: \begin{align*} \hat{R}(\mathcal{H}(x_{1:n})) \leq \lambda \cdot \sigma_{\max} \cdot \sqrt{2\log(2p)} = \lambda \cdot \max_{j=1,\ldots,p}\left(\frac{1}{n}\sum_{i=1}^n x_{ij}^2\right)^{1/2} \cdot \frac{\sqrt{2\log(2p)}}{\sqrt{n}} \cdot \sqrt{n}. \end{align*} Expanding $\sigma_{\max}$ explicitly: \begin{align*} \hat{R}(\mathcal{H}(x_{1:n})) \leq \lambda \cdot \frac{1}{n}\left(\max_{j=1,\ldots,p}\sum_{i=1}^n x_{ij}^2\right)^{1/2} \cdot \sqrt{2\log(2p)} = \lambda \cdot \max_{j=1,\ldots,p}\left(\frac{1}{n}\sum_{i=1}^n x_{ij}^2\right)^{1/2} \cdot \frac{\sqrt{2\log(2p)}}{\sqrt{n}}, \end{align*} where in the last step we used $\frac{1}{n}(\sum_i x_{ij}^2)^{1/2} = \frac{1}{\sqrt{n}}(\frac{1}{n}\sum_i x_{ij}^2)^{1/2}$. This is the first stated bound. When $x_{ij} \in [-C, C]$ for all $i, j$, we have $\frac{1}{n}\sum_{i=1}^n x_{ij}^2 \leq \frac{1}{n} \cdot nC^2 = C^2$ for each $j$, so $\max_j (\frac{1}{n}\sum_i x_{ij}^2)^{1/2} \leq C$. Substituting: \begin{align*} \hat{R}(\mathcal{H}(x_{1:n})) \leq \frac{\lambda C\sqrt{2\log(2p)}}{\sqrt{n}}, \end{align*} which is the second stated bound. This completes the proof. [guided] The key difference from the $\ell_2$-constrained bound is the logarithmic dependence on $p$. The $\ell_2$-constrained class has Rademacher complexity $O(\lambda(\mathbb{E}\|X\|_2^2)^{1/2}/\sqrt{n})$, which depends on $p$ only through $\|X\|_2^2$. The $\ell_1$-constrained class pays a $\sqrt{\log p}$ factor (from the maximum of $2p$ sub-Gaussians), but its bound depends on the coordinate-wise second moment $\max_j \frac{1}{n}\sum_i x_{ij}^2$ rather than the full squared norm. When $p$ is large but each coordinate is bounded by $C$, the $\ell_1$ bound scales as $O(\lambda C\sqrt{\log p/n})$, which can be much smaller than the $\ell_2$ bound if $\|X\|_2^2 \approx pC^2$ grows with $p$. This explains the practical effectiveness of $\ell_1$-regularisation (LASSO) in high-dimensional settings: the $\ell_1$ constraint produces a hypothesis class with better statistical properties when $p \gg n$. [/guided] [/step]

Explore Further

VC Dimension Bound for Sign Classifiers Machine Learning Calibration via Differentiability at Zero Machine Learning Symmetrization Bound Machine Learning Sub-Gaussian Stability Under Linear Combinations Machine Learning Ensemble Variance Formula Machine Learning Rademacher Complexity of ℓ₂-Constrained Class Machine Learning Non-Emptiness of the Subdifferential Machine Learning Bayes Classifier Machine Learning

What brings you to Androma?

Start with a route through the knowledge graph.