[proofplan]
We prove the two one-sided estimates by exponential tilting. For the upper tail, we use the elementary Markov estimate for the non-negative [random variable](/page/Random%20Variable) $e^{\lambda X}$ with $\lambda > 0$, then optimize the resulting exponent over positive $\lambda$. For the lower tail, the same argument is applied with $\lambda < 0$, where the exponential is decreasing, and the same stationary equation gives the optimizer. Substituting the optimizer into the Legendre-type expression gives exactly the Bernoulli relative entropy $D(a\|p)$.
[/proofplan]
[step:Compute the exponential moment of the Bernoulli variable]
Let $(\Omega,\mathcal F,\mathbb P)$ denote the probability space on which $X$ is defined. For every integrable real-valued random variable $Y: \Omega \to \mathbb R$, let $\mathbb E[Y]$ denote expectation with respect to $\mathbb P$, defined by
\begin{align*}
\mathbb E[Y] := \int_\Omega Y\,d\mathbb P.
\end{align*}
For every $\lambda \in \mathbb R$, define the moment generating function $M_X: \mathbb R \to (0,\infty)$ by
\begin{align*}
M_X(\lambda) = \mathbb E[e^{\lambda X}].
\end{align*}
Since $X$ takes the value $1$ with probability $p$ and the value $0$ with probability $1-p$, we have
\begin{align*}
M_X(\lambda)= e^{\lambda \cdot 0}\mathbb P(X=0) + e^{\lambda \cdot 1}\mathbb P(X=1).
\end{align*}
Since $\mathbb P(X=0)=1-p$ and $\mathbb P(X=1)=p$, this becomes
\begin{align*}
M_X(\lambda)=1-p+p e^\lambda.
\end{align*}
Define the logarithmic moment generating function $\Lambda_X: \mathbb R \to \mathbb R$ by
\begin{align*}
\Lambda_X(\lambda) = \log M_X(\lambda) = \log(1-p+p e^\lambda).
\end{align*}
[/step]
[step:Bound the upper tail by optimizing over positive exponential tilts]
Fix $a \in (p,1)$. Let $\lambda > 0$. Since the map $t \mapsto e^{\lambda t}$ is increasing on $\mathbb R$, the event $\{X \ge a\}$ is contained in the event $\{e^{\lambda X} \ge e^{\lambda a}\}$. Therefore $\mathbb P(X \ge a) \le \mathbb P(e^{\lambda X} \ge e^{\lambda a})$. The elementary Markov estimate proved below gives
\begin{align*}
\mathbb P(e^{\lambda X} \ge e^{\lambda a}) \le e^{-\lambda a}\mathbb E[e^{\lambda X}].
\end{align*}
Using the definition of $\Lambda_X$, we obtain
\begin{align*}
\mathbb P(X \ge a) \le \exp\{\Lambda_X(\lambda)-\lambda a\}.
\end{align*}
The [second inequality](/theorems/2136) follows because $e^{\lambda X} \ge 0$ and
\begin{align*}
e^{\lambda a}\mathbb P(e^{\lambda X} \ge e^{\lambda a})
\le \mathbb E[e^{\lambda X}].
\end{align*}
Define $\Phi_a: \mathbb R \to \mathbb R$ by
\begin{align*}
\Phi_a(\lambda) = \lambda a - \Lambda_X(\lambda).
\end{align*}
Then
\begin{align*}
\Phi_a'(\lambda)
= a - \frac{p e^\lambda}{1-p+p e^\lambda}.
\end{align*}
The stationary equation $\Phi_a'(\lambda)=0$ is
\begin{align*}
a = \frac{p e^\lambda}{1-p+p e^\lambda},
\end{align*}
which is equivalent to
\begin{align*}
e^\lambda = \frac{a(1-p)}{p(1-a)}.
\end{align*}
Define
\begin{align*}
\lambda_a = \log\frac{a(1-p)}{p(1-a)}.
\end{align*}
Because $a>p$, we have $a(1-p)>p(1-a)$, hence $\lambda_a>0$.
Moreover,
\begin{align*}
\Phi_a''(\lambda)
= -\frac{p(1-p)e^\lambda}{(1-p+p e^\lambda)^2} < 0,
\end{align*}
so $\Phi_a$ is strictly concave and the stationary point $\lambda_a$ is the unique maximizer on $\mathbb R$, hence also on $(0,\infty)$.
Substituting $\lambda_a$ gives
\begin{align*}
1-p+p e^{\lambda_a}= 1-p + p\frac{a(1-p)}{p(1-a)}.
\end{align*}
Cancelling the factor $p$ in the second term and putting the terms over the common denominator $1-a$ gives
\begin{align*}
1-p+p e^{\lambda_a}=\frac{1-p}{1-a}.
\end{align*}
Therefore
\begin{align*}
\Phi_a(\lambda_a)= a\log\frac{a(1-p)}{p(1-a)} - \log\frac{1-p}{1-a}.
\end{align*}
Separating logarithms and collecting the coefficients of $\log\frac{1-p}{1-a}$ gives
\begin{align*}
\Phi_a(\lambda_a)= a\log\frac{a}{p} + (1-a)\log\frac{1-a}{1-p}.
\end{align*}
By the definition of $D(a\|p)$, this is
\begin{align*}
\Phi_a(\lambda_a)=D(a\|p).
\end{align*}
Using the preceding exponential estimate with $\lambda=\lambda_a$ yields
\begin{align*}
\mathbb P(X \ge a)\le \exp\{-\Phi_a(\lambda_a)\}.
\end{align*}
Since $\Phi_a(\lambda_a)=D(a\|p)$, this is
\begin{align*}
\mathbb P(X \ge a)\le \exp\{-D(a\|p)\}.
\end{align*}
[guided]
We want an upper bound for $\mathbb P(X \ge a)$ that decays exponentially in a rate function. The standard device is to replace the event involving $X$ by an event involving $e^{\lambda X}$, because exponential moments are easy to compute for a Bernoulli random variable.
Fix $a \in (p,1)$ and choose $\lambda > 0$. Since $t \mapsto e^{\lambda t}$ is increasing, whenever $X \ge a$ we also have $e^{\lambda X} \ge e^{\lambda a}$. Hence
\begin{align*}
\{X \ge a\} \subseteq \{e^{\lambda X} \ge e^{\lambda a}\}.
\end{align*}
The random variable $e^{\lambda X}: \Omega \to (0,\infty)$ is non-negative. Therefore
\begin{align*}
e^{\lambda a}\mathbb P(e^{\lambda X} \ge e^{\lambda a})
\le \mathbb E[e^{\lambda X}],
\end{align*}
because the expectation over the event $\{e^{\lambda X} \ge e^{\lambda a}\}$ already contributes at least $e^{\lambda a}$ times the probability of that event. Combining these two observations, first we get
\begin{align*}
\mathbb P(X \ge a)\le \mathbb P(e^{\lambda X} \ge e^{\lambda a}).
\end{align*}
Then the elementary Markov estimate gives
\begin{align*}
\mathbb P(e^{\lambda X} \ge e^{\lambda a})\le e^{-\lambda a}\mathbb E[e^{\lambda X}].
\end{align*}
Using $\mathbb E[e^{\lambda X}]=\exp\{\Lambda_X(\lambda)\}$, this yields
\begin{align*}
\mathbb P(X \ge a)\le \exp\{\Lambda_X(\lambda)-\lambda a\}.
\end{align*}
Now we optimize the exponent. Define $\Phi_a: \mathbb R \to \mathbb R$ by
\begin{align*}
\Phi_a(\lambda)=\lambda a-\Lambda_X(\lambda).
\end{align*}
Since $\Lambda_X(\lambda)=\log(1-p+p e^\lambda)$, differentiation gives
\begin{align*}
\Phi_a'(\lambda)
= a-\frac{p e^\lambda}{1-p+p e^\lambda}.
\end{align*}
The optimal exponential tilt should make this derivative vanish, so we solve
\begin{align*}
a=\frac{p e^\lambda}{1-p+p e^\lambda}.
\end{align*}
Multiplying by $1-p+p e^\lambda$ and collecting the $e^\lambda$ terms gives
\begin{align*}
a(1-p)+ap e^\lambda = p e^\lambda,
\end{align*}
hence
\begin{align*}
a(1-p)=p(1-a)e^\lambda,
\end{align*}
and therefore
\begin{align*}
e^\lambda=\frac{a(1-p)}{p(1-a)}.
\end{align*}
Define
\begin{align*}
\lambda_a=\log\frac{a(1-p)}{p(1-a)}.
\end{align*}
Because $a>p$, the inequality $a(1-p)>p(1-a)$ holds, so $\lambda_a>0$. This matters because the upper-tail argument requires the exponential map $t \mapsto e^{\lambda t}$ to be increasing, which is why we restricted to $\lambda>0$.
To confirm that this stationary point really is the optimizer, compute the second derivative:
\begin{align*}
\Phi_a''(\lambda)
= -\frac{p(1-p)e^\lambda}{(1-p+p e^\lambda)^2}.
\end{align*}
Since $p \in (0,1)$ and $e^\lambda>0$, this quantity is strictly negative for every $\lambda \in \mathbb R$. Thus $\Phi_a$ is strictly concave, and its stationary point $\lambda_a$ is the unique maximizer.
It remains to compute the value of the maximum. Using the defining formula for $\lambda_a$,
\begin{align*}
1-p+p e^{\lambda_a}=1-p+p\frac{a(1-p)}{p(1-a)}.
\end{align*}
Cancelling $p$ in the second term gives
\begin{align*}
1-p+p e^{\lambda_a}=1-p+\frac{a(1-p)}{1-a}.
\end{align*}
Putting the terms over the common denominator $1-a$ gives
\begin{align*}
1-p+p e^{\lambda_a}=\frac{(1-p)(1-a)+a(1-p)}{1-a}.
\end{align*}
The numerator simplifies to $1-p$, so
\begin{align*}
1-p+p e^{\lambda_a}=\frac{1-p}{1-a}.
\end{align*}
Therefore
\begin{align*}
\Phi_a(\lambda_a)=a\lambda_a-\log(1-p+p e^{\lambda_a}).
\end{align*}
Using the formulas for $\lambda_a$ and $1-p+p e^{\lambda_a}$, this becomes
\begin{align*}
\Phi_a(\lambda_a)=a\log\frac{a(1-p)}{p(1-a)}-\log\frac{1-p}{1-a}.
\end{align*}
Separating logarithms gives
\begin{align*}
\Phi_a(\lambda_a)=a\log\frac{a}{p}+a\log\frac{1-p}{1-a}-\log\frac{1-p}{1-a}.
\end{align*}
Combining the last two terms gives
\begin{align*}
\Phi_a(\lambda_a)=a\log\frac{a}{p}+(1-a)\log\frac{1-a}{1-p}.
\end{align*}
By the definition of $D(a\|p)$,
\begin{align*}
\Phi_a(\lambda_a)=D(a\|p).
\end{align*}
Taking $\lambda=\lambda_a$ in the exponential estimate gives
\begin{align*}
\mathbb P(X \ge a)\le \exp\{\Lambda_X(\lambda_a)-\lambda_a a\}.
\end{align*}
Since $\Phi_a(\lambda)=\lambda a-\Lambda_X(\lambda)$, the exponent is $-\Phi_a(\lambda_a)$, so
\begin{align*}
\mathbb P(X \ge a)\le \exp\{-\Phi_a(\lambda_a)\}.
\end{align*}
Using $\Phi_a(\lambda_a)=D(a\|p)$, we conclude
\begin{align*}
\mathbb P(X \ge a)\le \exp\{-D(a\|p)\}.
\end{align*}
This proves the upper-tail bound.
[/guided]
[/step]
[step:Bound the lower tail by optimizing over negative exponential tilts]
Fix $a \in (0,p)$. Let $\lambda < 0$. Since the map $t \mapsto e^{\lambda t}$ is decreasing on $\mathbb R$, the event $\{X \le a\}$ is contained in the event $\{e^{\lambda X} \ge e^{\lambda a}\}$. As before,
\begin{align*}
\mathbb P(X \le a)\le \mathbb P(e^{\lambda X} \ge e^{\lambda a}).
\end{align*}
Since $e^{\lambda X}: \Omega \to (0,\infty)$ is non-negative and $e^{\lambda a}>0$, the elementary Markov estimate gives
\begin{align*}
e^{\lambda a}\mathbb P(e^{\lambda X} \ge e^{\lambda a}) \le \mathbb E[e^{\lambda X}].
\end{align*}
Dividing by $e^{\lambda a}$ gives
\begin{align*}
\mathbb P(e^{\lambda X} \ge e^{\lambda a}) \le e^{-\lambda a}\mathbb E[e^{\lambda X}].
\end{align*}
Using the definition of $\Lambda_X$, we obtain
\begin{align*}
\mathbb P(X \le a)\le \exp\{\Lambda_X(\lambda)-\lambda a\}.
\end{align*}
Equivalently,
\begin{align*}
\mathbb P(X \le a) \le \exp\{-\Phi_a(\lambda)\},
\end{align*}
where $\Phi_a(\lambda)=\lambda a-\Lambda_X(\lambda)$.
Differentiating the already defined function $\Phi_a: \mathbb R \to \mathbb R$ gives
\begin{align*}
\Phi_a'(\lambda)=a-\frac{p e^\lambda}{1-p+p e^\lambda}.
\end{align*}
Solving $\Phi_a'(\lambda)=0$ gives
\begin{align*}
e^\lambda=\frac{a(1-p)}{p(1-a)}.
\end{align*}
Define
\begin{align*}
\lambda_a = \log\frac{a(1-p)}{p(1-a)}.
\end{align*}
Because $a<p$, we have $a(1-p)<p(1-a)$, hence $\lambda_a<0$. The second derivative formula
\begin{align*}
\Phi_a''(\lambda)
= -\frac{p(1-p)e^\lambda}{(1-p+p e^\lambda)^2} < 0
\end{align*}
shows that this stationary point is the unique maximizer of $\Phi_a$, in particular over $(-\infty,0)$.
Substituting the definition of $\lambda_a$ into $M_X$ gives
\begin{align*}
1-p+p e^{\lambda_a}=1-p+p\frac{a(1-p)}{p(1-a)}.
\end{align*}
Cancelling $p$ in the second term and putting the terms over the common denominator $1-a$ gives
\begin{align*}
1-p+p e^{\lambda_a}=\frac{1-p}{1-a}.
\end{align*}
Therefore
\begin{align*}
\Phi_a(\lambda_a)=a\log\frac{a(1-p)}{p(1-a)} - \log\frac{1-p}{1-a}.
\end{align*}
Separating logarithms and collecting terms gives
\begin{align*}
\Phi_a(\lambda_a)=a\log\frac{a}{p} + (1-a)\log\frac{1-a}{1-p}.
\end{align*}
By the definition of $D(a\|p)$, this is
\begin{align*}
\Phi_a(\lambda_a)=D(a\|p).
\end{align*}
Using the lower-tail exponential estimate with $\lambda=\lambda_a$ gives
\begin{align*}
\mathbb P(X \le a)
\le \exp\{-D(a\|p)\}.
\end{align*}
[guided]
Fix $a \in (0,p)$. For the lower tail we choose a negative tilt, so let $\lambda<0$. The reason for changing the sign is monotonicity: the map $t\mapsto e^{\lambda t}$ is decreasing on $\mathbb R$, so $X\le a$ implies $e^{\lambda X}\ge e^{\lambda a}$. Hence
\begin{align*}
\{X\le a\}\subseteq \{e^{\lambda X}\ge e^{\lambda a}\}.
\end{align*}
The random variable $e^{\lambda X}:\Omega\to(0,\infty)$ is non-negative, and the threshold $e^{\lambda a}$ is positive. Therefore the elementary Markov estimate gives
\begin{align*}
e^{\lambda a}\mathbb P(e^{\lambda X}\ge e^{\lambda a})\le \mathbb E[e^{\lambda X}].
\end{align*}
Dividing by $e^{\lambda a}$ and using the definition of $\Lambda_X$ gives
\begin{align*}
\mathbb P(X\le a)\le \exp\{\Lambda_X(\lambda)-\lambda a\}.
\end{align*}
Equivalently, for $\Phi_a(\lambda)=\lambda a-\Lambda_X(\lambda)$,
\begin{align*}
\mathbb P(X\le a)\le \exp\{-\Phi_a(\lambda)\}.
\end{align*}
Now optimize over negative $\lambda$. Since $\Lambda_X(\lambda)=\log(1-p+p e^\lambda)$, differentiation gives
\begin{align*}
\Phi_a'(\lambda)=a-\frac{p e^\lambda}{1-p+p e^\lambda}.
\end{align*}
The stationary equation $\Phi_a'(\lambda)=0$ is equivalent to
\begin{align*}
e^\lambda=\frac{a(1-p)}{p(1-a)}.
\end{align*}
Define
\begin{align*}
\lambda_a=\log\frac{a(1-p)}{p(1-a)}.
\end{align*}
Because $a<p$, we have $a(1-p)<p(1-a)$, so $\lambda_a<0$; thus the optimizer lies in the range of tilts allowed for the lower-tail argument. The second derivative is
\begin{align*}
\Phi_a''(\lambda)=-\frac{p(1-p)e^\lambda}{(1-p+p e^\lambda)^2}<0.
\end{align*}
Since $p\in(0,1)$ and $e^\lambda>0$, this proves that $\Phi_a$ is strictly concave and that $\lambda_a$ is its unique maximizer, in particular over $(-\infty,0)$.
It remains to evaluate the exponent at the maximizing tilt. Substituting $e^{\lambda_a}=\frac{a(1-p)}{p(1-a)}$ gives
\begin{align*}
1-p+p e^{\lambda_a}=1-p+p\frac{a(1-p)}{p(1-a)}.
\end{align*}
After cancelling $p$ in the second term and using the common denominator $1-a$, we obtain
\begin{align*}
1-p+p e^{\lambda_a}=\frac{1-p}{1-a}.
\end{align*}
Therefore
\begin{align*}
\Phi_a(\lambda_a)=a\log\frac{a(1-p)}{p(1-a)}-\log\frac{1-p}{1-a}.
\end{align*}
Separating logarithms gives
\begin{align*}
\Phi_a(\lambda_a)=a\log\frac{a}{p}+(1-a)\log\frac{1-a}{1-p}.
\end{align*}
By the definition of $D(a\|p)$,
\begin{align*}
\Phi_a(\lambda_a)=D(a\|p).
\end{align*}
Taking $\lambda=\lambda_a$ in the lower-tail exponential estimate gives
\begin{align*}
\mathbb P(X\le a)\le \exp\{-D(a\|p)\}.
\end{align*}
This proves the lower-tail bound.
[/guided]
[/step]
[step:Conclude both Bernoulli Chernoff estimates]
The previous two steps prove that for every $a \in (p,1)$,
\begin{align*}
\mathbb P(X \ge a) \le \exp\{-D(a\|p)\},
\end{align*}
and for every $a \in (0,p)$,
\begin{align*}
\mathbb P(X \le a) \le \exp\{-D(a\|p)\}.
\end{align*}
These are exactly the asserted upper-tail and lower-tail Bernoulli Chernoff bounds.
[/step]