Bernoulli Chernoff Bound — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We prove the two one-sided estimates by exponential tilting. For the upper tail, we use the elementary Markov estimate for the non-negative [random variable](/page/Random%20Variable) $e^{\lambda X}$ with $\lambda > 0$, then optimize the resulting exponent over positive $\lambda$. For the lower tail, the same argument is applied with $\lambda < 0$, where the exponential is decreasing, and the same stationary equation gives the optimizer. Substituting the optimizer into the Legendre-type expression gives exactly the Bernoulli relative entropy $D(a\|p)$. [/proofplan] [step:Compute the exponential moment of the Bernoulli variable] Let $(\Omega,\mathcal F,\mathbb P)$ denote the probability space on which $X$ is defined. For every integrable real-valued random variable $Y: \Omega \to \mathbb R$, let $\mathbb E[Y]$ denote expectation with respect to $\mathbb P$, defined by \begin{align*} \mathbb E[Y] := \int_\Omega Y\,d\mathbb P. \end{align*} For every $\lambda \in \mathbb R$, define the moment generating function $M_X: \mathbb R \to (0,\infty)$ by \begin{align*} M_X(\lambda) = \mathbb E[e^{\lambda X}]. \end{align*} Since $X$ takes the value $1$ with probability $p$ and the value $0$ with probability $1-p$, we have \begin{align*} M_X(\lambda)= e^{\lambda \cdot 0}\mathbb P(X=0) + e^{\lambda \cdot 1}\mathbb P(X=1). \end{align*} Since $\mathbb P(X=0)=1-p$ and $\mathbb P(X=1)=p$, this becomes \begin{align*} M_X(\lambda)=1-p+p e^\lambda. \end{align*} Define the logarithmic moment generating function $\Lambda_X: \mathbb R \to \mathbb R$ by \begin{align*} \Lambda_X(\lambda) = \log M_X(\lambda) = \log(1-p+p e^\lambda). \end{align*} [/step] [step:Bound the upper tail by optimizing over positive exponential tilts] Fix $a \in (p,1)$. Let $\lambda > 0$. Since the map $t \mapsto e^{\lambda t}$ is increasing on $\mathbb R$, the event $\{X \ge a\}$ is contained in the event $\{e^{\lambda X} \ge e^{\lambda a}\}$. Therefore $\mathbb P(X \ge a) \le \mathbb P(e^{\lambda X} \ge e^{\lambda a})$. The elementary Markov estimate proved below gives \begin{align*} \mathbb P(e^{\lambda X} \ge e^{\lambda a}) \le e^{-\lambda a}\mathbb E[e^{\lambda X}]. \end{align*} Using the definition of $\Lambda_X$, we obtain \begin{align*} \mathbb P(X \ge a) \le \exp\{\Lambda_X(\lambda)-\lambda a\}. \end{align*} The [second inequality](/theorems/2136) follows because $e^{\lambda X} \ge 0$ and \begin{align*} e^{\lambda a}\mathbb P(e^{\lambda X} \ge e^{\lambda a}) \le \mathbb E[e^{\lambda X}]. \end{align*} Define $\Phi_a: \mathbb R \to \mathbb R$ by \begin{align*} \Phi_a(\lambda) = \lambda a - \Lambda_X(\lambda). \end{align*} Then \begin{align*} \Phi_a'(\lambda) = a - \frac{p e^\lambda}{1-p+p e^\lambda}. \end{align*} The stationary equation $\Phi_a'(\lambda)=0$ is \begin{align*} a = \frac{p e^\lambda}{1-p+p e^\lambda}, \end{align*} which is equivalent to \begin{align*} e^\lambda = \frac{a(1-p)}{p(1-a)}. \end{align*} Define \begin{align*} \lambda_a = \log\frac{a(1-p)}{p(1-a)}. \end{align*} Because $a>p$, we have $a(1-p)>p(1-a)$, hence $\lambda_a>0$. Moreover, \begin{align*} \Phi_a''(\lambda) = -\frac{p(1-p)e^\lambda}{(1-p+p e^\lambda)^2} < 0, \end{align*} so $\Phi_a$ is strictly concave and the stationary point $\lambda_a$ is the unique maximizer on $\mathbb R$, hence also on $(0,\infty)$. Substituting $\lambda_a$ gives \begin{align*} 1-p+p e^{\lambda_a}= 1-p + p\frac{a(1-p)}{p(1-a)}. \end{align*} Cancelling the factor $p$ in the second term and putting the terms over the common denominator $1-a$ gives \begin{align*} 1-p+p e^{\lambda_a}=\frac{1-p}{1-a}. \end{align*} Therefore \begin{align*} \Phi_a(\lambda_a)= a\log\frac{a(1-p)}{p(1-a)} - \log\frac{1-p}{1-a}. \end{align*} Separating logarithms and collecting the coefficients of $\log\frac{1-p}{1-a}$ gives \begin{align*} \Phi_a(\lambda_a)= a\log\frac{a}{p} + (1-a)\log\frac{1-a}{1-p}. \end{align*} By the definition of $D(a\|p)$, this is \begin{align*} \Phi_a(\lambda_a)=D(a\|p). \end{align*} Using the preceding exponential estimate with $\lambda=\lambda_a$ yields \begin{align*} \mathbb P(X \ge a)\le \exp\{-\Phi_a(\lambda_a)\}. \end{align*} Since $\Phi_a(\lambda_a)=D(a\|p)$, this is \begin{align*} \mathbb P(X \ge a)\le \exp\{-D(a\|p)\}. \end{align*} [guided] We want an upper bound for $\mathbb P(X \ge a)$ that decays exponentially in a rate function. The standard device is to replace the event involving $X$ by an event involving $e^{\lambda X}$, because exponential moments are easy to compute for a Bernoulli random variable. Fix $a \in (p,1)$ and choose $\lambda > 0$. Since $t \mapsto e^{\lambda t}$ is increasing, whenever $X \ge a$ we also have $e^{\lambda X} \ge e^{\lambda a}$. Hence \begin{align*} \{X \ge a\} \subseteq \{e^{\lambda X} \ge e^{\lambda a}\}. \end{align*} The random variable $e^{\lambda X}: \Omega \to (0,\infty)$ is non-negative. Therefore \begin{align*} e^{\lambda a}\mathbb P(e^{\lambda X} \ge e^{\lambda a}) \le \mathbb E[e^{\lambda X}], \end{align*} because the expectation over the event $\{e^{\lambda X} \ge e^{\lambda a}\}$ already contributes at least $e^{\lambda a}$ times the probability of that event. Combining these two observations, first we get \begin{align*} \mathbb P(X \ge a)\le \mathbb P(e^{\lambda X} \ge e^{\lambda a}). \end{align*} Then the elementary Markov estimate gives \begin{align*} \mathbb P(e^{\lambda X} \ge e^{\lambda a})\le e^{-\lambda a}\mathbb E[e^{\lambda X}]. \end{align*} Using $\mathbb E[e^{\lambda X}]=\exp\{\Lambda_X(\lambda)\}$, this yields \begin{align*} \mathbb P(X \ge a)\le \exp\{\Lambda_X(\lambda)-\lambda a\}. \end{align*} Now we optimize the exponent. Define $\Phi_a: \mathbb R \to \mathbb R$ by \begin{align*} \Phi_a(\lambda)=\lambda a-\Lambda_X(\lambda). \end{align*} Since $\Lambda_X(\lambda)=\log(1-p+p e^\lambda)$, differentiation gives \begin{align*} \Phi_a'(\lambda) = a-\frac{p e^\lambda}{1-p+p e^\lambda}. \end{align*} The optimal exponential tilt should make this derivative vanish, so we solve \begin{align*} a=\frac{p e^\lambda}{1-p+p e^\lambda}. \end{align*} Multiplying by $1-p+p e^\lambda$ and collecting the $e^\lambda$ terms gives \begin{align*} a(1-p)+ap e^\lambda = p e^\lambda, \end{align*} hence \begin{align*} a(1-p)=p(1-a)e^\lambda, \end{align*} and therefore \begin{align*} e^\lambda=\frac{a(1-p)}{p(1-a)}. \end{align*} Define \begin{align*} \lambda_a=\log\frac{a(1-p)}{p(1-a)}. \end{align*} Because $a>p$, the inequality $a(1-p)>p(1-a)$ holds, so $\lambda_a>0$. This matters because the upper-tail argument requires the exponential map $t \mapsto e^{\lambda t}$ to be increasing, which is why we restricted to $\lambda>0$. To confirm that this stationary point really is the optimizer, compute the second derivative: \begin{align*} \Phi_a''(\lambda) = -\frac{p(1-p)e^\lambda}{(1-p+p e^\lambda)^2}. \end{align*} Since $p \in (0,1)$ and $e^\lambda>0$, this quantity is strictly negative for every $\lambda \in \mathbb R$. Thus $\Phi_a$ is strictly concave, and its stationary point $\lambda_a$ is the unique maximizer. It remains to compute the value of the maximum. Using the defining formula for $\lambda_a$, \begin{align*} 1-p+p e^{\lambda_a}=1-p+p\frac{a(1-p)}{p(1-a)}. \end{align*} Cancelling $p$ in the second term gives \begin{align*} 1-p+p e^{\lambda_a}=1-p+\frac{a(1-p)}{1-a}. \end{align*} Putting the terms over the common denominator $1-a$ gives \begin{align*} 1-p+p e^{\lambda_a}=\frac{(1-p)(1-a)+a(1-p)}{1-a}. \end{align*} The numerator simplifies to $1-p$, so \begin{align*} 1-p+p e^{\lambda_a}=\frac{1-p}{1-a}. \end{align*} Therefore \begin{align*} \Phi_a(\lambda_a)=a\lambda_a-\log(1-p+p e^{\lambda_a}). \end{align*} Using the formulas for $\lambda_a$ and $1-p+p e^{\lambda_a}$, this becomes \begin{align*} \Phi_a(\lambda_a)=a\log\frac{a(1-p)}{p(1-a)}-\log\frac{1-p}{1-a}. \end{align*} Separating logarithms gives \begin{align*} \Phi_a(\lambda_a)=a\log\frac{a}{p}+a\log\frac{1-p}{1-a}-\log\frac{1-p}{1-a}. \end{align*} Combining the last two terms gives \begin{align*} \Phi_a(\lambda_a)=a\log\frac{a}{p}+(1-a)\log\frac{1-a}{1-p}. \end{align*} By the definition of $D(a\|p)$, \begin{align*} \Phi_a(\lambda_a)=D(a\|p). \end{align*} Taking $\lambda=\lambda_a$ in the exponential estimate gives \begin{align*} \mathbb P(X \ge a)\le \exp\{\Lambda_X(\lambda_a)-\lambda_a a\}. \end{align*} Since $\Phi_a(\lambda)=\lambda a-\Lambda_X(\lambda)$, the exponent is $-\Phi_a(\lambda_a)$, so \begin{align*} \mathbb P(X \ge a)\le \exp\{-\Phi_a(\lambda_a)\}. \end{align*} Using $\Phi_a(\lambda_a)=D(a\|p)$, we conclude \begin{align*} \mathbb P(X \ge a)\le \exp\{-D(a\|p)\}. \end{align*} This proves the upper-tail bound. [/guided] [/step] [step:Bound the lower tail by optimizing over negative exponential tilts] Fix $a \in (0,p)$. Let $\lambda < 0$. Since the map $t \mapsto e^{\lambda t}$ is decreasing on $\mathbb R$, the event $\{X \le a\}$ is contained in the event $\{e^{\lambda X} \ge e^{\lambda a}\}$. As before, \begin{align*} \mathbb P(X \le a)\le \mathbb P(e^{\lambda X} \ge e^{\lambda a}). \end{align*} Since $e^{\lambda X}: \Omega \to (0,\infty)$ is non-negative and $e^{\lambda a}>0$, the elementary Markov estimate gives \begin{align*} e^{\lambda a}\mathbb P(e^{\lambda X} \ge e^{\lambda a}) \le \mathbb E[e^{\lambda X}]. \end{align*} Dividing by $e^{\lambda a}$ gives \begin{align*} \mathbb P(e^{\lambda X} \ge e^{\lambda a}) \le e^{-\lambda a}\mathbb E[e^{\lambda X}]. \end{align*} Using the definition of $\Lambda_X$, we obtain \begin{align*} \mathbb P(X \le a)\le \exp\{\Lambda_X(\lambda)-\lambda a\}. \end{align*} Equivalently, \begin{align*} \mathbb P(X \le a) \le \exp\{-\Phi_a(\lambda)\}, \end{align*} where $\Phi_a(\lambda)=\lambda a-\Lambda_X(\lambda)$. Differentiating the already defined function $\Phi_a: \mathbb R \to \mathbb R$ gives \begin{align*} \Phi_a'(\lambda)=a-\frac{p e^\lambda}{1-p+p e^\lambda}. \end{align*} Solving $\Phi_a'(\lambda)=0$ gives \begin{align*} e^\lambda=\frac{a(1-p)}{p(1-a)}. \end{align*} Define \begin{align*} \lambda_a = \log\frac{a(1-p)}{p(1-a)}. \end{align*} Because $a<p$, we have $a(1-p)<p(1-a)$, hence $\lambda_a<0$. The second derivative formula \begin{align*} \Phi_a''(\lambda) = -\frac{p(1-p)e^\lambda}{(1-p+p e^\lambda)^2} < 0 \end{align*} shows that this stationary point is the unique maximizer of $\Phi_a$, in particular over $(-\infty,0)$. Substituting the definition of $\lambda_a$ into $M_X$ gives \begin{align*} 1-p+p e^{\lambda_a}=1-p+p\frac{a(1-p)}{p(1-a)}. \end{align*} Cancelling $p$ in the second term and putting the terms over the common denominator $1-a$ gives \begin{align*} 1-p+p e^{\lambda_a}=\frac{1-p}{1-a}. \end{align*} Therefore \begin{align*} \Phi_a(\lambda_a)=a\log\frac{a(1-p)}{p(1-a)} - \log\frac{1-p}{1-a}. \end{align*} Separating logarithms and collecting terms gives \begin{align*} \Phi_a(\lambda_a)=a\log\frac{a}{p} + (1-a)\log\frac{1-a}{1-p}. \end{align*} By the definition of $D(a\|p)$, this is \begin{align*} \Phi_a(\lambda_a)=D(a\|p). \end{align*} Using the lower-tail exponential estimate with $\lambda=\lambda_a$ gives \begin{align*} \mathbb P(X \le a) \le \exp\{-D(a\|p)\}. \end{align*} [guided] Fix $a \in (0,p)$. For the lower tail we choose a negative tilt, so let $\lambda<0$. The reason for changing the sign is monotonicity: the map $t\mapsto e^{\lambda t}$ is decreasing on $\mathbb R$, so $X\le a$ implies $e^{\lambda X}\ge e^{\lambda a}$. Hence \begin{align*} \{X\le a\}\subseteq \{e^{\lambda X}\ge e^{\lambda a}\}. \end{align*} The random variable $e^{\lambda X}:\Omega\to(0,\infty)$ is non-negative, and the threshold $e^{\lambda a}$ is positive. Therefore the elementary Markov estimate gives \begin{align*} e^{\lambda a}\mathbb P(e^{\lambda X}\ge e^{\lambda a})\le \mathbb E[e^{\lambda X}]. \end{align*} Dividing by $e^{\lambda a}$ and using the definition of $\Lambda_X$ gives \begin{align*} \mathbb P(X\le a)\le \exp\{\Lambda_X(\lambda)-\lambda a\}. \end{align*} Equivalently, for $\Phi_a(\lambda)=\lambda a-\Lambda_X(\lambda)$, \begin{align*} \mathbb P(X\le a)\le \exp\{-\Phi_a(\lambda)\}. \end{align*} Now optimize over negative $\lambda$. Since $\Lambda_X(\lambda)=\log(1-p+p e^\lambda)$, differentiation gives \begin{align*} \Phi_a'(\lambda)=a-\frac{p e^\lambda}{1-p+p e^\lambda}. \end{align*} The stationary equation $\Phi_a'(\lambda)=0$ is equivalent to \begin{align*} e^\lambda=\frac{a(1-p)}{p(1-a)}. \end{align*} Define \begin{align*} \lambda_a=\log\frac{a(1-p)}{p(1-a)}. \end{align*} Because $a<p$, we have $a(1-p)<p(1-a)$, so $\lambda_a<0$; thus the optimizer lies in the range of tilts allowed for the lower-tail argument. The second derivative is \begin{align*} \Phi_a''(\lambda)=-\frac{p(1-p)e^\lambda}{(1-p+p e^\lambda)^2}<0. \end{align*} Since $p\in(0,1)$ and $e^\lambda>0$, this proves that $\Phi_a$ is strictly concave and that $\lambda_a$ is its unique maximizer, in particular over $(-\infty,0)$. It remains to evaluate the exponent at the maximizing tilt. Substituting $e^{\lambda_a}=\frac{a(1-p)}{p(1-a)}$ gives \begin{align*} 1-p+p e^{\lambda_a}=1-p+p\frac{a(1-p)}{p(1-a)}. \end{align*} After cancelling $p$ in the second term and using the common denominator $1-a$, we obtain \begin{align*} 1-p+p e^{\lambda_a}=\frac{1-p}{1-a}. \end{align*} Therefore \begin{align*} \Phi_a(\lambda_a)=a\log\frac{a(1-p)}{p(1-a)}-\log\frac{1-p}{1-a}. \end{align*} Separating logarithms gives \begin{align*} \Phi_a(\lambda_a)=a\log\frac{a}{p}+(1-a)\log\frac{1-a}{1-p}. \end{align*} By the definition of $D(a\|p)$, \begin{align*} \Phi_a(\lambda_a)=D(a\|p). \end{align*} Taking $\lambda=\lambda_a$ in the lower-tail exponential estimate gives \begin{align*} \mathbb P(X\le a)\le \exp\{-D(a\|p)\}. \end{align*} This proves the lower-tail bound. [/guided] [/step] [step:Conclude both Bernoulli Chernoff estimates] The previous two steps prove that for every $a \in (p,1)$, \begin{align*} \mathbb P(X \ge a) \le \exp\{-D(a\|p)\}, \end{align*} and for every $a \in (0,p)$, \begin{align*} \mathbb P(X \le a) \le \exp\{-D(a\|p)\}. \end{align*} These are exactly the asserted upper-tail and lower-tail Bernoulli Chernoff bounds. [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.

Bernoulli Chernoff Bound (Theorem # 6045)

Discussion

Proof

Prerequisites (0/3 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Bernoulli Chernoff Bound (Theorem # 6045)

Discussion

Proof

Prerequisites (0/3 completed)

Prerequisites Graph

Explore Further