[proofplan]
We prove both tail estimates by applying the exponential moment method to the binomial [random variable](/page/Random%20Variable) $S_n$. First we compute the moment generating function of $S_n$ directly from its binomial probability mass function, using the [Binomial Theorem](/theorems/750). Then we use the exponential form of [Markov's Inequality](/theorems/514) for $e^{\lambda S_n}$, choosing $\lambda>0$ for the upper tail and $\lambda<0$ for the lower tail. The remaining optimization is one-dimensional and gives exactly the Bernoulli relative entropy $D(a \mid p)$.
[/proofplan]
[step:Compute the exponential moment of the binomial random variable]
For each $\lambda \in \mathbb{R}$, define the exponential moment function $M: \mathbb{R} \to (0,\infty)$ by $M(\lambda) = \mathbb{E}[e^{\lambda S_n}]$.
Since $S_n \sim \operatorname{Bin}(n,p)$, for each $k \in \{0,1,\dots,n\}$ the binomial probability mass function gives
\begin{align*}
\mathbb{P}(S_n=k)=\binom{n}{k}p^k(1-p)^{n-k}.
\end{align*}
Therefore, by the definition of expectation for a finite-valued random variable and the [Binomial Theorem](/theorems/750),
\begin{align*}
M(\lambda)=\sum_{k=0}^{n} e^{\lambda k}\mathbb{P}(S_n=k)=\sum_{k=0}^{n} \binom{n}{k}(pe^\lambda)^k(1-p)^{n-k}=(1-p+pe^\lambda)^n.
\end{align*}
Define also the logarithmic Bernoulli moment function $\Lambda: \mathbb{R} \to \mathbb{R}$ by $\Lambda(\lambda)=\log(1-p+pe^\lambda)$.
Then
\begin{align*}
\log M(\lambda)=n\Lambda(\lambda).
\end{align*}
[/step]
[step:Bound the upper tail by an exponential moment]
Assume $a>p$. Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space on which the random variable $S_n: \Omega \to \{0,1,\dots,n\}$ is defined. Let $\lambda \in (0,\infty)$. Define the non-negative random variable $Y_\lambda: \Omega \to (0,\infty)$ by $Y_\lambda(\omega)=e^{\lambda S_n(\omega)}$.
The following estimate is the exponential form of [Markov's Inequality](/theorems/514), with the verification written out. On the event $\{S_n\ge na\}$, since $\lambda>0$, we have $Y_\lambda\ge e^{\lambda na}$. Hence
\begin{align*}
\mathbb{E}[Y_\lambda] \ge \int_{\{S_n\ge na\}} Y_\lambda\,d\mathbb{P}(\omega) \ge e^{\lambda na}\mathbb{P}(S_n\ge na).
\end{align*}
Using the exponential moment computed above,
\begin{align*}
\mathbb{P}(S_n\ge na) \le e^{-\lambda na}\mathbb{E}[e^{\lambda S_n}] = \exp\{n(\Lambda(\lambda)-\lambda a)\}.
\end{align*}
Thus
\begin{align*}
\mathbb{P}(S_n\ge na)
\le \exp\left\{-n\sup_{\lambda>0}(\lambda a-\Lambda(\lambda))\right\}.
\end{align*}
[guided]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space on which the random variable $S_n: \Omega \to \{0,1,\dots,n\}$ is defined. The upper-tail event $\{S_n\ge na\}$ is controlled by exponentiating $S_n$ with a positive parameter; this is the standard exponential form of [Markov's Inequality](/theorems/514). The reason for taking $\lambda>0$ is monotonicity: if $S_n(\omega)\ge na$, then
\begin{align*}
e^{\lambda S_n(\omega)}\ge e^{\lambda na}.
\end{align*}
Define $Y_\lambda: \Omega \to (0,\infty)$ by $Y_\lambda(\omega)=e^{\lambda S_n(\omega)}$.
This is a non-negative random variable. Integrating $Y_\lambda$ over all of $\Omega$ and then restricting the domain of integration to the event $\{S_n\ge na\}$ gives
\begin{align*}
\mathbb{E}[Y_\lambda]=\int_\Omega Y_\lambda\,d\mathbb{P}(\omega)\ge \int_{\{S_n\ge na\}}Y_\lambda\,d\mathbb{P}(\omega).
\end{align*}
On this restricted event, $Y_\lambda\ge e^{\lambda na}$, so
\begin{align*}
\int_{\{S_n\ge na\}}Y_\lambda\,d\mathbb{P}(\omega) \ge \int_{\{S_n\ge na\}}e^{\lambda na}\,d\mathbb{P}(\omega) = e^{\lambda na}\mathbb{P}(S_n\ge na).
\end{align*}
Combining the two inequalities gives
\begin{align*}
\mathbb{P}(S_n\ge na)\le e^{-\lambda na}\mathbb{E}[e^{\lambda S_n}].
\end{align*}
We now compute the exponential moment inside this guided argument so the step is self-contained. Since $S_n \sim \operatorname{Bin}(n,p)$, for each $k \in \{0,1,\dots,n\}$,
\begin{align*}
\mathbb{P}(S_n=k)=\binom{n}{k}p^k(1-p)^{n-k}.
\end{align*}
Using the definition of expectation for a finite-valued random variable and the [Binomial Theorem](/theorems/750),
\begin{align*}
\mathbb{E}[e^{\lambda S_n}]
=\sum_{k=0}^{n}e^{\lambda k}\mathbb{P}(S_n=k)
=\sum_{k=0}^{n}\binom{n}{k}(pe^\lambda)^k(1-p)^{n-k}
=(1-p+pe^\lambda)^n
=\exp\{n\Lambda(\lambda)\}.
\end{align*}
Therefore
\begin{align*}
\mathbb{P}(S_n\ge na) \le \exp\{n(\Lambda(\lambda)-\lambda a)\}.
\end{align*}
Since this estimate is valid for every $\lambda>0$, we may optimize over all such $\lambda$:
\begin{align*}
\mathbb{P}(S_n\ge na)
\le \exp\left\{-n\sup_{\lambda>0}(\lambda a-\Lambda(\lambda))\right\}.
\end{align*}
[/guided]
[/step]
[step:Optimize the exponential rate for the upper tail]
Define $G_a: \mathbb{R} \to \mathbb{R}$ by $G_a(\lambda)=\lambda a-\Lambda(\lambda)$.
The derivative is
\begin{align*}
G_a'(\lambda)
=a-\frac{pe^\lambda}{1-p+pe^\lambda}.
\end{align*}
The unique critical point is the real number $\lambda_a$ defined by
\begin{align*}
\lambda_a=\log\frac{a(1-p)}{p(1-a)}.
\end{align*}
Since $a>p$, we have $\lambda_a>0$. Also,
\begin{align*}
G_a''(\lambda)
=-\frac{pe^\lambda(1-p)}{(1-p+pe^\lambda)^2}<0,
\end{align*}
so $G_a$ is strictly concave and $\lambda_a$ is the global maximizer. Substituting $e^{\lambda_a}=\frac{a(1-p)}{p(1-a)}$ gives
\begin{align*}
1-p+pe^{\lambda_a}=1-p+\frac{a(1-p)}{1-a}=\frac{1-p}{1-a}.
\end{align*}
Therefore
\begin{align*}
G_a(\lambda_a)=a\log\frac{a(1-p)}{p(1-a)}-\log\frac{1-p}{1-a}=a\log\frac{a}{p}+(1-a)\log\frac{1-a}{1-p}=D(a \mid p).
\end{align*}
Hence
\begin{align*}
\mathbb{P}(S_n\ge na)\le \exp\{-nD(a \mid p)\}.
\end{align*}
[/step]
[step:Bound and optimize the lower tail with a negative exponential parameter]
Assume $a<p$. Let $\lambda \in (-\infty,0)$. Since $\lambda<0$, on the event $\{S_n\le na\}$ we have $e^{\lambda S_n}\ge e^{\lambda na}$. Applying the same exponential form of [Markov's Inequality](/theorems/514) to the non-negative random variable $Y_\lambda: \Omega \to (0,\infty)$, defined by $Y_\lambda(\omega)=e^{\lambda S_n(\omega)}$, gives
\begin{align*}
\mathbb{P}(S_n\le na) \le e^{-\lambda na}\mathbb{E}[e^{\lambda S_n}] = \exp\{n(\Lambda(\lambda)-\lambda a)\}.
\end{align*}
Thus
\begin{align*}
\mathbb{P}(S_n\le na)
\le \exp\left\{-n\sup_{\lambda<0}G_a(\lambda)\right\}.
\end{align*}
The same derivative computation gives the unique maximizer
\begin{align*}
\lambda_a=\log\frac{a(1-p)}{p(1-a)}.
\end{align*}
Since $a<p$, we have $\lambda_a<0$, so this maximizer lies in the allowed range $\lambda<0$. The same substitution as above yields
\begin{align*}
G_a(\lambda_a)=D(a \mid p).
\end{align*}
Consequently
\begin{align*}
\mathbb{P}(S_n\le na)\le \exp\{-nD(a \mid p)\}.
\end{align*}
This proves both asserted binomial Chernoff bounds.
[/step]