Additive Chernoff-Hoeffding Bound for Bernoulli Sums

Additive Chernoff-Hoeffding Bound for Bernoulli Sums (Theorem # 6049)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We prove the upper-tail estimate by the exponential moment method. First we establish a sharp moment-generating bound for each centered Bernoulli variable, using a one-variable calculus estimate. Independence then factors the exponential moment of $S_n-\mu$, and optimizing over the exponential parameter gives the bound $e^{-2t^2/n}$. Applying the same argument to $\mu-S_n$ gives the lower tail, and the two-sided and sample-proportion estimates follow by a union bound and the substitution $t=n\varepsilon$. [/proofplan] [step:Bound the centered exponential moment of one Bernoulli variable] Fix $i \in \{1,\dots,n\}$. Since $Y_i\sim \operatorname{Ber}(p_i)$, the [random variable](/page/Random%20Variable) $Y_i$ is a Bernoulli random variable with parameter $p_i$. On the probability space $(\Omega,\mathcal F,\mathbb P)$ from the statement, define the centered Bernoulli random variable $X_i:\Omega \to \mathbb R$ by \begin{align*} X_i(\omega)=Y_i(\omega)-p_i. \end{align*} For every $\lambda \in \mathbb R$, we claim that \begin{align*} \mathbb E[e^{\lambda X_i}] \le e^{\lambda^2/8}. \end{align*} If $p_i \in \{0,1\}$, then $Y_i=p_i$ almost surely, so $X_i=0$ almost surely and $\mathbb E[e^{\lambda X_i}]=1\le e^{\lambda^2/8}$. Assume therefore that $0<p_i<1$. Define the function $h_i:\mathbb R \to \mathbb R$ by \begin{align*} h_i(u)=\log(1-p_i+p_i e^u)-p_i u. \end{align*} Since $Y_i$ takes the value $1$ with probability $p_i$ and the value $0$ with probability $1-p_i$, the moment-generating function of $X_i$ is \begin{align*} \mathbb E[e^{\lambda X_i}]=p_i e^{\lambda(1-p_i)}+(1-p_i)e^{-\lambda p_i}. \end{align*} Also, \begin{align*} p_i e^{\lambda(1-p_i)}+(1-p_i)e^{-\lambda p_i}=e^{\log(1-p_i+p_i e^\lambda)-p_i\lambda}=e^{h_i(\lambda)}. \end{align*} We compute \begin{align*} h_i'(u)=\frac{p_i e^u}{1-p_i+p_i e^u}-p_i. \end{align*} Also, \begin{align*} h_i''(u)=\frac{p_i(1-p_i)e^u}{(1-p_i+p_i e^u)^2}. \end{align*} Define $q_i:\mathbb R \to (0,1)$ by \begin{align*} q_i(u)=\frac{p_i e^u}{1-p_i+p_i e^u}. \end{align*} Then \begin{align*} h_i''(u)=q_i(u)(1-q_i(u))\le \frac14, \end{align*} because $q(1-q)\le 1/4$ for every $q\in[0,1]$. Also $h_i(0)=0$ and $h_i'(0)=0$. Let $\mathcal L^1$ denote one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) on $\mathbb R$. Since $h_i$ is twice continuously differentiable on $\mathbb R$, [Taylor's Theorem with Integral Remainder](/theorems/189) gives, for $\lambda\ge 0$, \begin{align*} h_i(\lambda)=\int_0^\lambda (\lambda-u)h_i''(u)\,d\mathcal L^1(u)\le \frac14\int_0^\lambda (\lambda-u)\,d\mathcal L^1(u)=\frac{\lambda^2}{8}. \end{align*} For $\lambda<0$, the same formula is rewritten over the non-oriented interval $[\lambda,0]$ as \begin{align*} h_i(\lambda)=\int_\lambda^0 (u-\lambda)h_i''(u)\,d\mathcal L^1(u)\le \frac14\int_\lambda^0 (u-\lambda)\,d\mathcal L^1(u)=\frac{\lambda^2}{8}. \end{align*} Hence \begin{align*} \mathbb E[e^{\lambda X_i}]=e^{h_i(\lambda)}\le e^{\lambda^2/8}. \end{align*} [guided] The goal of this step is to replace each Bernoulli random variable by a uniform exponential estimate that does not depend on its parameter $p_i$. Work on the probability space $(\Omega,\mathcal F,\mathbb P)$ from the statement. Fix $i \in \{1,\dots,n\}$ and define $X_i:\Omega \to \mathbb R$ by \begin{align*} X_i(\omega)=Y_i(\omega)-p_i. \end{align*} This is the centered version of $Y_i$, since $\mathbb E[X_i]=0$. We prove that for every $\lambda\in\mathbb R$, \begin{align*} \mathbb E[e^{\lambda X_i}]\le e^{\lambda^2/8}. \end{align*} If $p_i=0$ or $p_i=1$, then $Y_i=p_i$ almost surely, so $X_i=0$ almost surely. Therefore $\mathbb E[e^{\lambda X_i}]=1$, and the estimate follows from $1\le e^{\lambda^2/8}$. Now assume $0<p_i<1$. The exponential moment can be computed from the two possible values of $Y_i$: \begin{align*} \mathbb E[e^{\lambda X_i}]=\mathbb E[e^{\lambda(Y_i-p_i)}]. \end{align*} Since $Y_i$ takes only the values $0$ and $1$, \begin{align*} \mathbb E[e^{\lambda(Y_i-p_i)}]=p_i e^{\lambda(1-p_i)}+(1-p_i)e^{-\lambda p_i}. \end{align*} We want to show that this is at most $e^{\lambda^2/8}$. Taking logarithms is useful because products of exponential moments will later become sums. Define the function $h_i:\mathbb R \to \mathbb R$ by \begin{align*} h_i(u)=\log(1-p_i+p_i e^u)-p_i u. \end{align*} Then \begin{align*} \mathbb E[e^{\lambda X_i}] =e^{h_i(\lambda)}. \end{align*} We estimate $h_i$ by bounding its second derivative. Differentiating gives \begin{align*} h_i'(u)=\frac{p_i e^u}{1-p_i+p_i e^u}-p_i. \end{align*} Differentiating once more gives \begin{align*} h_i''(u)=\frac{p_i(1-p_i)e^u}{(1-p_i+p_i e^u)^2}. \end{align*} Define $q_i:\mathbb R \to (0,1)$ by \begin{align*} q_i(u)=\frac{p_i e^u}{1-p_i+p_i e^u}. \end{align*} With this notation, \begin{align*} h_i''(u)=q_i(u)(1-q_i(u)). \end{align*} For every $q\in[0,1]$, \begin{align*} q(1-q)\le \frac14, \end{align*} so $h_i''(u)\le 1/4$ for every $u\in\mathbb R$. Since $h_i(0)=0$ and $h_i'(0)=0$, and since $\mathcal L^1$ denotes one-dimensional Lebesgue measure on $\mathbb R$, the function $h_i$ is twice continuously differentiable on $\mathbb R$, so [Taylor's Theorem](/theorems/827) with Integral Remainder gives, for $\lambda\ge 0$, \begin{align*} h_i(\lambda)=\int_0^\lambda (\lambda-u)h_i''(u)\,d\mathcal L^1(u)\le \frac14\int_0^\lambda (\lambda-u)\,d\mathcal L^1(u)=\frac{\lambda^2}{8}. \end{align*} For $\lambda<0$, we write the same remainder formula over $[\lambda,0]$ as \begin{align*} h_i(\lambda)=\int_\lambda^0 (u-\lambda)h_i''(u)\,d\mathcal L^1(u)\le \frac14\int_\lambda^0 (u-\lambda)\,d\mathcal L^1(u)=\frac{\lambda^2}{8}. \end{align*} Therefore \begin{align*} \mathbb E[e^{\lambda X_i}] =e^{h_i(\lambda)} \le e^{\lambda^2/8}. \end{align*} This is the special Bernoulli form of the sub-Gaussian estimate that drives the Chernoff-Hoeffding bound. [/guided] [/step] [step:Factor the exponential moment of the centered sum using independence] For $\lambda\in\mathbb R$, define the random variable $Z_\lambda:\Omega \to \mathbb R$ on the same probability space by \begin{align*} Z_\lambda(\omega)=e^{\lambda(S_n(\omega)-\mu)}. \end{align*} Since \begin{align*} S_n-\mu=\sum_{i=1}^{n}(Y_i-p_i)=\sum_{i=1}^{n}X_i, \end{align*} we have \begin{align*} Z_\lambda =\prod_{i=1}^{n} e^{\lambda X_i}. \end{align*} The random variables $Y_1,\dots,Y_n$ are [independent](/page/Independence%20%28Probability%20Theory%29), and each $e^{\lambda X_i}$ is a measurable function of $Y_i$, so $e^{\lambda X_1},\dots,e^{\lambda X_n}$ are independent. Hence \begin{align*} \mathbb E[e^{\lambda(S_n-\mu)}]=\prod_{i=1}^{n}\mathbb E[e^{\lambda X_i}]. \end{align*} Using the one-variable estimate for each factor, \begin{align*} \prod_{i=1}^{n}\mathbb E[e^{\lambda X_i}]\le \prod_{i=1}^{n} e^{\lambda^2/8}=e^{n\lambda^2/8}. \end{align*} [/step] [step:Optimize the exponential Markov bound for the upper tail] Fix $t>0$ and $\lambda>0$. This is the exponential form of [Markov's Inequality](/theorems/514), applied to the non-negative random variable $e^{\lambda(S_n-\mu)}$. We spell out the pointwise domination. On the event $\{S_n-\mu\ge t\}$, monotonicity of the exponential function gives \begin{align*} e^{\lambda(S_n-\mu)}\ge e^{\lambda t}. \end{align*} Therefore \begin{align*} \mathbb 1_{\{S_n-\mu\ge t\}} \le e^{-\lambda t}e^{\lambda(S_n-\mu)}. \end{align*} Taking expectations yields \begin{align*} \mathbb P(S_n-\mu\ge t)=\mathbb E[\mathbb 1_{\{S_n-\mu\ge t\}}]. \end{align*} Taking expectations in the preceding pointwise inequality gives \begin{align*} \mathbb E[\mathbb 1_{\{S_n-\mu\ge t\}}]\le e^{-\lambda t}\mathbb E[e^{\lambda(S_n-\mu)}]. \end{align*} Using the exponential-moment bound from the previous step, \begin{align*} e^{-\lambda t}\mathbb E[e^{\lambda(S_n-\mu)}]\le \exp\left(-\lambda t+\frac{n\lambda^2}{8}\right). \end{align*} Choose \begin{align*} \lambda:=\frac{4t}{n}. \end{align*} Since $t>0$ and $n\in\mathbb N$, this choice satisfies $\lambda>0$, and substitution gives \begin{align*} -\lambda t+\frac{n\lambda^2}{8} =-\frac{4t^2}{n}+\frac{n}{8}\frac{16t^2}{n^2} =-\frac{2t^2}{n}. \end{align*} Thus \begin{align*} \mathbb P(S_n-\mu\ge t)\le e^{-2t^2/n}. \end{align*} [/step] [step:Repeat the argument for the lower tail and combine the two estimates] For each $i\in\{1,\dots,n\}$, define $\widetilde X_i:\Omega \to \mathbb R$ by \begin{align*} \widetilde X_i(\omega)=p_i-Y_i(\omega). \end{align*} Since $\widetilde X_i=-X_i$, the exponential moment estimate from the first step gives \begin{align*} \mathbb E[e^{\lambda\widetilde X_i}]=\mathbb E[e^{-\lambda X_i}]. \end{align*} Applying the first-step estimate with parameter $-\lambda$ gives \begin{align*} \mathbb E[e^{-\lambda X_i}]\le e^{\lambda^2/8}. \end{align*} This holds for every $\lambda\in\mathbb R$. Moreover, because the random variables $Y_1,\dots,Y_n$ are [independent](/page/Independence%20%28Probability%20Theory%29) and each $e^{\lambda\widetilde X_i}$ is a measurable function of $Y_i$, the random variables $e^{\lambda\widetilde X_1},\dots,e^{\lambda\widetilde X_n}$ are independent. Therefore the same factorization and exponential Markov argument applies to \begin{align*} \mu-S_n=\sum_{i=1}^{n}\widetilde X_i \end{align*} and gives, for every $t>0$, \begin{align*} \mathbb P(\mu-S_n\ge t)\le e^{-2t^2/n}. \end{align*} Since \begin{align*} \{|S_n-\mu|\ge t\} = \{S_n-\mu\ge t\}\cup\{\mu-S_n\ge t\}, \end{align*} the [finite subadditivity property of probability measures](/theorems/1106) gives \begin{align*} \mathbb P(|S_n-\mu|\ge t)\le \mathbb P(S_n-\mu\ge t)+\mathbb P(\mu-S_n\ge t). \end{align*} Using the upper-tail and lower-tail estimates, \begin{align*} \mathbb P(S_n-\mu\ge t)+\mathbb P(\mu-S_n\ge t)\le e^{-2t^2/n}+e^{-2t^2/n}=2e^{-2t^2/n}. \end{align*} [/step] [step:Substitute $t=n\varepsilon$ to obtain the sample-proportion form] Let $\varepsilon>0$ and define \begin{align*} t:=n\varepsilon. \end{align*} Then $t>0$. The inequality \begin{align*} \left|\frac{S_n}{n}-\frac{\mu}{n}\right|\ge \varepsilon \end{align*} is equivalent to \begin{align*} |S_n-\mu|\ge n\varepsilon, \end{align*} and, by the definition of $t$, this is equivalent to \begin{align*} |S_n-\mu|\ge t. \end{align*} Applying the two-sided estimate with this value of $t$ gives \begin{align*} \mathbb P\left(\left|\frac{S_n}{n}-\frac{\mu}{n}\right|\ge \varepsilon\right)=\mathbb P(|S_n-\mu|\ge n\varepsilon). \end{align*} Applying the two-sided estimate with $t=n\varepsilon$ gives \begin{align*} \mathbb P(|S_n-\mu|\ge n\varepsilon)\le 2e^{-2(n\varepsilon)^2/n}=2e^{-2n\varepsilon^2}. \end{align*} This is the stated equivalent formulation. [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Lebesgue Measure Definition Random Variable Definition Event Definition Le Cam Two-Point Method Probability & Statistics Martingale Difference Bound under the Bounded Differences Condition Probability & Statistics Joint Law Criterion for Independence Probability Theory Jensen's Inequality Probability Theory Second-Moment Criterion for Contiguity Probability & Statistics Tensorization of Hellinger Affinity Probability & Statistics Cramer's Theorem Large Deviations Gambler's Ruin Recurrence Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.