Sample Size Bound for Uniform Estimation over a Finite Sub-Gaussian Class

Sample Size Bound for Uniform Estimation over a Finite Sub-Gaussian Class (Theorem # 6069)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We prove the statement by deriving the usual [sub-Gaussian tail bound](/theorems/1953) for each centered empirical average and then applying the union bound over the finite class $\mathcal F$. The sample-size hypothesis is exactly the algebraic condition that makes the union-bound error at most $\beta$. Taking complements then gives the desired high-probability uniform estimate. [/proofplan] [step:Define the centered empirical average for each function] Let $(\Omega,\mathcal A,\mathbb P)$ be the probability space on which the random variables $Z_1,\dots,Z_n$ are defined. For each $f\in\mathcal F$, define the [random variable](/page/Random%20Variable) $A_f:\Omega\to\mathbb R$ by \begin{align*} A_f(\omega) = \frac{1}{n}\sum_{i=1}^n \left(f(Z_i(\omega))-\mathbb E[f(Z_i)]\right). \end{align*} Each $A_f$ is measurable as a finite linear combination of measurable real-valued random variables. Since $\mathcal F$ is finite, the map $\omega\mapsto \sup_{f\in\mathcal F}|A_f(\omega)|$ is measurable as the pointwise maximum of finitely many measurable real-valued functions. Therefore the event in the theorem is \begin{align*} \left\{\omega\in\Omega:\sup_{f\in\mathcal F}|A_f(\omega)|\le \varepsilon\right\}. \end{align*} Its complement is \begin{align*} \left\{\omega\in\Omega:\sup_{f\in\mathcal F}|A_f(\omega)|> \varepsilon\right\} = \bigcup_{f\in\mathcal F}\{\omega\in\Omega:|A_f(\omega)|>\varepsilon\}. \end{align*} [/step] [step:Derive the two-sided sub-Gaussian tail bound for one empirical average] Fix $f\in\mathcal F$. The hypotheses of the theorem give that the centered real-valued random variables $f(Z_i)-\mathbb E[f(Z_i)]$, for $1\le i\le n$, are independent and satisfy, for every parameter $\theta\in\mathbb R$, \begin{align*} \mathbb E\left[\exp\left(\theta\left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right] \le \exp\left(\frac{\sigma^2\theta^2}{2}\right). \end{align*} For $\lambda>0$, independence gives \begin{align*} \mathbb E\left[\exp\left(\lambda n A_f\right)\right]=\mathbb E\left[\exp\left(\lambda\sum_{i=1}^n \left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right]. \end{align*} Since the centered random variables $f(Z_i)-\mathbb E[f(Z_i)]$ are independent and each satisfies the sub-Gaussian moment-generating-function bound with variance proxy $\sigma^2$, this equals the product of the one-variable exponential moments and is bounded by \begin{align*} \prod_{i=1}^n \mathbb E\left[\exp\left(\lambda\left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right]\le \exp\left(\frac{n\sigma^2\lambda^2}{2}\right). \end{align*} By [Markov's Inequality](/theorems/1126) applied to the nonnegative random variable $\exp(\lambda nA_f)$, \begin{align*} \mathbb P(A_f>\varepsilon)=\mathbb P\left(\exp(\lambda nA_f)>\exp(\lambda n\varepsilon)\right). \end{align*} [Markov's Inequality](/theorems/1126) and the preceding exponential-moment estimate give \begin{align*} \mathbb P(A_f>\varepsilon)\le \exp(-\lambda n\varepsilon)\mathbb E\left[\exp(\lambda nA_f)\right]\le \exp\left(-\lambda n\varepsilon+\frac{n\sigma^2\lambda^2}{2}\right). \end{align*} Choosing $\lambda=\varepsilon/\sigma^2$ yields \begin{align*} \mathbb P(A_f>\varepsilon) \le \exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right). \end{align*} Applying the same argument to $-A_f$ is valid because the uniform sub-Gaussian hypothesis gives the same moment-generating-function bound for every real parameter. Equivalently, for $\lambda>0$ we apply the bound with parameter $-\lambda$ to each centered variable $f(Z_i)-\mathbb E[f(Z_i)]$. Hence \begin{align*} \mathbb P(A_f<-\varepsilon) \le \exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right). \end{align*} Therefore \begin{align*} \mathbb P(|A_f|>\varepsilon) \le 2\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right). \end{align*} [guided] Fix one function $f\in\mathcal F$. The goal is first to control the deviation of a single centered empirical average; only after that will we take the union over all functions. Define the random variable $A_f:\Omega\to\mathbb R$ by \begin{align*} A_f(\omega) = \frac{1}{n}\sum_{i=1}^n \left(f(Z_i(\omega))-\mathbb E[f(Z_i)]\right). \end{align*} Equivalently, \begin{align*} nA_f = \sum_{i=1}^n \left(f(Z_i)-\mathbb E[f(Z_i)]\right). \end{align*} The hypotheses of the theorem give that the centered variables $f(Z_i)-\mathbb E[f(Z_i)]$ are independent and that, for every $\theta\in\mathbb R$ and every $1\le i\le n$, \begin{align*} \mathbb E\left[\exp\left(\theta\left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right] \le \exp\left(\frac{\sigma^2\theta^2}{2}\right). \end{align*} For $\lambda>0$, the exponential moment of this sum factors because the centered random variables are independent. First, \begin{align*} \mathbb E\left[\exp\left(\lambda n A_f\right)\right]=\mathbb E\left[\exp\left(\lambda\sum_{i=1}^n \left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right]. \end{align*} Independence turns the exponential moment of the sum into the product of the exponential moments: \begin{align*} \mathbb E\left[\exp\left(\lambda n A_f\right)\right]=\prod_{i=1}^n \mathbb E\left[\exp\left(\lambda\left(f(Z_i)-\mathbb E[f(Z_i)]\right)\right)\right]. \end{align*} The sub-Gaussian hypothesis says that each factor is bounded by $\exp(\sigma^2\lambda^2/2)$. Multiplying the $n$ identical upper bounds gives \begin{align*} \mathbb E\left[\exp\left(\lambda n A_f\right)\right]\le\exp\left(\frac{n\sigma^2\lambda^2}{2}\right). \end{align*} Now apply [Markov's Inequality](/theorems/1126) to the nonnegative random variable $\exp(\lambda nA_f)$. Since $\lambda>0$, the event $A_f>\varepsilon$ is the same as the event $\exp(\lambda nA_f)>\exp(\lambda n\varepsilon)$. Hence \begin{align*} \mathbb P(A_f>\varepsilon)=\mathbb P\left(\exp(\lambda nA_f)>\exp(\lambda n\varepsilon)\right). \end{align*} [Markov's Inequality](/theorems/1126) and the exponential-moment estimate then give \begin{align*} \mathbb P(A_f>\varepsilon)\le\exp(-\lambda n\varepsilon)\mathbb E\left[\exp(\lambda nA_f)\right]\le\exp\left(-\lambda n\varepsilon+\frac{n\sigma^2\lambda^2}{2}\right). \end{align*} The exponent is minimized by choosing \begin{align*} \lambda=\frac{\varepsilon}{\sigma^2}, \end{align*} which is positive because $\varepsilon>0$ and $\sigma>0$. Substituting this value gives \begin{align*} \mathbb P(A_f>\varepsilon) \le \exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right). \end{align*} The same calculation applies to $-A_f$, but it is worth spelling out why the hypothesis permits this. The uniform sub-Gaussian moment-generating-function bound is assumed for every real parameter. Thus, when estimating $\mathbb E[\exp(\lambda(-nA_f))]$ with $\lambda>0$, we apply the original bound to each centered variable $f(Z_i)-\mathbb E[f(Z_i)]$ with parameter $-\lambda$. This gives the same estimate for the negative average, and therefore \begin{align*} \mathbb P(A_f<-\varepsilon) \le \exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right). \end{align*} Taking the union of the upper-tail and lower-tail events gives the two-sided bound \begin{align*} \mathbb P(|A_f|>\varepsilon) \le 2\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right). \end{align*} [/guided] [/step] [step:Apply the union bound over the finite class] Using the complement representation from the first step, \begin{align*} \mathbb P\left(\sup_{f\in\mathcal F}|A_f|>\varepsilon\right)=\mathbb P\left(\bigcup_{f\in\mathcal F}\{|A_f|>\varepsilon\}\right). \end{align*} The [Finite-Class Union Bound](/theorems/1108) over the finite class $\mathcal F$ and the two-sided bound from the previous step give \begin{align*} \mathbb P\left(\sup_{f\in\mathcal F}|A_f|>\varepsilon\right)\le\sum_{f\in\mathcal F}\mathbb P(|A_f|>\varepsilon)\le 2|\mathcal F|\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right). \end{align*} [/step] [step:Use the sample-size hypothesis to make the failure probability at most $\beta$] Since $\mathcal F$ is nonempty, $|\mathcal F|\ge 1$, so the quantity \begin{align*} \frac{2|\mathcal F|}{\beta} \end{align*} is positive and the following division by $|\mathcal F|$ is legitimate. The assumed lower bound on $n$ implies \begin{align*} \frac{n\varepsilon^2}{2\sigma^2} \ge \log\frac{2|\mathcal F|}{\beta}. \end{align*} Exponentiating the negative of both sides gives \begin{align*} \exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right) \le \frac{\beta}{2|\mathcal F|}. \end{align*} Substituting this into the union-bound estimate yields \begin{align*} \mathbb P\left(\sup_{f\in\mathcal F}|A_f|>\varepsilon\right) \le 2|\mathcal F|\cdot \frac{\beta}{2|\mathcal F|} = \beta. \end{align*} Therefore, since the events $\{\sup_{f\in\mathcal F}|A_f|\le\varepsilon\}$ and $\{\sup_{f\in\mathcal F}|A_f|>\varepsilon\}$ are complements, \begin{align*} \mathbb P\left(\sup_{f\in\mathcal F}|A_f|\le\varepsilon\right)=1-\mathbb P\left(\sup_{f\in\mathcal F}|A_f|>\varepsilon\right)\ge 1-\beta. \end{align*} Replacing $A_f$ by its definition gives \begin{align*} \mathbb P\left(\sup_{f\in\mathcal F}\left|\frac{1}{n}\sum_{i=1}^n f(Z_i)-\frac{1}{n}\sum_{i=1}^n\mathbb E[f(Z_i)]\right|\le \varepsilon\right)\ge 1-\beta, \end{align*} which is the desired conclusion. [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Random Variable Definition Variance Definition Event Definition Karush-Kuhn-Tucker Conditions for the Group Lasso Probability & Statistics Time Inversion of Brownian Motion Brownian Motion Characterization of Distribution Functions Probability Theory Restricted Isometry Property Implies Injectivity on Sparse Vectors Probability & Statistics Bernoulli Chernoff Bound Probability & Statistics Cone Condition for the Dantzig Selector Probability & Statistics Joint Law Criterion for Independence Probability Theory Unbiasedness of $K$-Fold Cross-Validation Risk for Fold-Trained Predictors Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.