[proofplan]
We prove the upper-tail estimate by the exponential moment method. First we establish a sharp moment-generating bound for each centered Bernoulli variable, using a one-variable calculus estimate. Independence then factors the exponential moment of $S_n-\mu$, and optimizing over the exponential parameter gives the bound $e^{-2t^2/n}$. Applying the same argument to $\mu-S_n$ gives the lower tail, and the two-sided and sample-proportion estimates follow by a union bound and the substitution $t=n\varepsilon$.
[/proofplan]
[step:Bound the centered exponential moment of one Bernoulli variable]
Fix $i \in \{1,\dots,n\}$. Since $Y_i\sim \operatorname{Ber}(p_i)$, the [random variable](/page/Random%20Variable) $Y_i$ is a Bernoulli random variable with parameter $p_i$. On the probability space $(\Omega,\mathcal F,\mathbb P)$ from the statement, define the centered Bernoulli random variable $X_i:\Omega \to \mathbb R$ by
\begin{align*}
X_i(\omega)=Y_i(\omega)-p_i.
\end{align*}
For every $\lambda \in \mathbb R$, we claim that
\begin{align*}
\mathbb E[e^{\lambda X_i}] \le e^{\lambda^2/8}.
\end{align*}
If $p_i \in \{0,1\}$, then $Y_i=p_i$ almost surely, so $X_i=0$ almost surely and $\mathbb E[e^{\lambda X_i}]=1\le e^{\lambda^2/8}$. Assume therefore that $0<p_i<1$. Define the function $h_i:\mathbb R \to \mathbb R$ by
\begin{align*}
h_i(u)=\log(1-p_i+p_i e^u)-p_i u.
\end{align*}
Since $Y_i$ takes the value $1$ with probability $p_i$ and the value $0$ with probability $1-p_i$, the moment-generating function of $X_i$ is
\begin{align*}
\mathbb E[e^{\lambda X_i}]=p_i e^{\lambda(1-p_i)}+(1-p_i)e^{-\lambda p_i}.
\end{align*}
Also,
\begin{align*}
p_i e^{\lambda(1-p_i)}+(1-p_i)e^{-\lambda p_i}=e^{\log(1-p_i+p_i e^\lambda)-p_i\lambda}=e^{h_i(\lambda)}.
\end{align*}
We compute
\begin{align*}
h_i'(u)=\frac{p_i e^u}{1-p_i+p_i e^u}-p_i.
\end{align*}
Also,
\begin{align*}
h_i''(u)=\frac{p_i(1-p_i)e^u}{(1-p_i+p_i e^u)^2}.
\end{align*}
Define $q_i:\mathbb R \to (0,1)$ by
\begin{align*}
q_i(u)=\frac{p_i e^u}{1-p_i+p_i e^u}.
\end{align*}
Then
\begin{align*}
h_i''(u)=q_i(u)(1-q_i(u))\le \frac14,
\end{align*}
because $q(1-q)\le 1/4$ for every $q\in[0,1]$. Also $h_i(0)=0$ and $h_i'(0)=0$. Let $\mathcal L^1$ denote one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) on $\mathbb R$. Since $h_i$ is twice continuously differentiable on $\mathbb R$, [Taylor's Theorem with Integral Remainder](/theorems/189) gives, for $\lambda\ge 0$,
\begin{align*}
h_i(\lambda)=\int_0^\lambda (\lambda-u)h_i''(u)\,d\mathcal L^1(u)\le \frac14\int_0^\lambda (\lambda-u)\,d\mathcal L^1(u)=\frac{\lambda^2}{8}.
\end{align*}
For $\lambda<0$, the same formula is rewritten over the non-oriented interval $[\lambda,0]$ as
\begin{align*}
h_i(\lambda)=\int_\lambda^0 (u-\lambda)h_i''(u)\,d\mathcal L^1(u)\le \frac14\int_\lambda^0 (u-\lambda)\,d\mathcal L^1(u)=\frac{\lambda^2}{8}.
\end{align*} Hence
\begin{align*}
\mathbb E[e^{\lambda X_i}]=e^{h_i(\lambda)}\le e^{\lambda^2/8}.
\end{align*}
[guided]
The goal of this step is to replace each Bernoulli random variable by a uniform exponential estimate that does not depend on its parameter $p_i$. Work on the probability space $(\Omega,\mathcal F,\mathbb P)$ from the statement. Fix $i \in \{1,\dots,n\}$ and define $X_i:\Omega \to \mathbb R$ by
\begin{align*}
X_i(\omega)=Y_i(\omega)-p_i.
\end{align*}
This is the centered version of $Y_i$, since $\mathbb E[X_i]=0$.
We prove that for every $\lambda\in\mathbb R$,
\begin{align*}
\mathbb E[e^{\lambda X_i}]\le e^{\lambda^2/8}.
\end{align*}
If $p_i=0$ or $p_i=1$, then $Y_i=p_i$ almost surely, so $X_i=0$ almost surely. Therefore $\mathbb E[e^{\lambda X_i}]=1$, and the estimate follows from $1\le e^{\lambda^2/8}$.
Now assume $0<p_i<1$. The exponential moment can be computed from the two possible values of $Y_i$:
\begin{align*}
\mathbb E[e^{\lambda X_i}]=\mathbb E[e^{\lambda(Y_i-p_i)}].
\end{align*}
Since $Y_i$ takes only the values $0$ and $1$,
\begin{align*}
\mathbb E[e^{\lambda(Y_i-p_i)}]=p_i e^{\lambda(1-p_i)}+(1-p_i)e^{-\lambda p_i}.
\end{align*}
We want to show that this is at most $e^{\lambda^2/8}$. Taking logarithms is useful because products of exponential moments will later become sums. Define the function $h_i:\mathbb R \to \mathbb R$ by
\begin{align*}
h_i(u)=\log(1-p_i+p_i e^u)-p_i u.
\end{align*}
Then
\begin{align*}
\mathbb E[e^{\lambda X_i}]
=e^{h_i(\lambda)}.
\end{align*}
We estimate $h_i$ by bounding its second derivative. Differentiating gives
\begin{align*}
h_i'(u)=\frac{p_i e^u}{1-p_i+p_i e^u}-p_i.
\end{align*}
Differentiating once more gives
\begin{align*}
h_i''(u)=\frac{p_i(1-p_i)e^u}{(1-p_i+p_i e^u)^2}.
\end{align*}
Define $q_i:\mathbb R \to (0,1)$ by
\begin{align*}
q_i(u)=\frac{p_i e^u}{1-p_i+p_i e^u}.
\end{align*}
With this notation,
\begin{align*}
h_i''(u)=q_i(u)(1-q_i(u)).
\end{align*}
For every $q\in[0,1]$,
\begin{align*}
q(1-q)\le \frac14,
\end{align*}
so $h_i''(u)\le 1/4$ for every $u\in\mathbb R$. Since $h_i(0)=0$ and $h_i'(0)=0$, and since $\mathcal L^1$ denotes one-dimensional Lebesgue measure on $\mathbb R$, the function $h_i$ is twice continuously differentiable on $\mathbb R$, so [Taylor's Theorem](/theorems/827) with Integral Remainder gives, for $\lambda\ge 0$,
\begin{align*}
h_i(\lambda)=\int_0^\lambda (\lambda-u)h_i''(u)\,d\mathcal L^1(u)\le \frac14\int_0^\lambda (\lambda-u)\,d\mathcal L^1(u)=\frac{\lambda^2}{8}.
\end{align*}
For $\lambda<0$, we write the same remainder formula over $[\lambda,0]$ as
\begin{align*}
h_i(\lambda)=\int_\lambda^0 (u-\lambda)h_i''(u)\,d\mathcal L^1(u)\le \frac14\int_\lambda^0 (u-\lambda)\,d\mathcal L^1(u)=\frac{\lambda^2}{8}.
\end{align*}
Therefore
\begin{align*}
\mathbb E[e^{\lambda X_i}]
=e^{h_i(\lambda)}
\le e^{\lambda^2/8}.
\end{align*}
This is the special Bernoulli form of the sub-Gaussian estimate that drives the Chernoff-Hoeffding bound.
[/guided]
[/step]
[step:Factor the exponential moment of the centered sum using independence]
For $\lambda\in\mathbb R$, define the random variable $Z_\lambda:\Omega \to \mathbb R$ on the same probability space by
\begin{align*}
Z_\lambda(\omega)=e^{\lambda(S_n(\omega)-\mu)}.
\end{align*}
Since
\begin{align*}
S_n-\mu=\sum_{i=1}^{n}(Y_i-p_i)=\sum_{i=1}^{n}X_i,
\end{align*}
we have
\begin{align*}
Z_\lambda
=\prod_{i=1}^{n} e^{\lambda X_i}.
\end{align*}
The random variables $Y_1,\dots,Y_n$ are [independent](/page/Independence%20%28Probability%20Theory%29), and each $e^{\lambda X_i}$ is a measurable function of $Y_i$, so $e^{\lambda X_1},\dots,e^{\lambda X_n}$ are independent. Hence
\begin{align*}
\mathbb E[e^{\lambda(S_n-\mu)}]=\prod_{i=1}^{n}\mathbb E[e^{\lambda X_i}].
\end{align*}
Using the one-variable estimate for each factor,
\begin{align*}
\prod_{i=1}^{n}\mathbb E[e^{\lambda X_i}]\le \prod_{i=1}^{n} e^{\lambda^2/8}=e^{n\lambda^2/8}.
\end{align*}
[/step]
[step:Optimize the exponential Markov bound for the upper tail]
Fix $t>0$ and $\lambda>0$. This is the exponential form of [Markov's Inequality](/theorems/514), applied to the non-negative random variable $e^{\lambda(S_n-\mu)}$. We spell out the pointwise domination. On the event $\{S_n-\mu\ge t\}$, monotonicity of the exponential function gives
\begin{align*}
e^{\lambda(S_n-\mu)}\ge e^{\lambda t}.
\end{align*}
Therefore
\begin{align*}
\mathbb 1_{\{S_n-\mu\ge t\}}
\le e^{-\lambda t}e^{\lambda(S_n-\mu)}.
\end{align*}
Taking expectations yields
\begin{align*}
\mathbb P(S_n-\mu\ge t)=\mathbb E[\mathbb 1_{\{S_n-\mu\ge t\}}].
\end{align*}
Taking expectations in the preceding pointwise inequality gives
\begin{align*}
\mathbb E[\mathbb 1_{\{S_n-\mu\ge t\}}]\le e^{-\lambda t}\mathbb E[e^{\lambda(S_n-\mu)}].
\end{align*}
Using the exponential-moment bound from the previous step,
\begin{align*}
e^{-\lambda t}\mathbb E[e^{\lambda(S_n-\mu)}]\le \exp\left(-\lambda t+\frac{n\lambda^2}{8}\right).
\end{align*}
Choose
\begin{align*}
\lambda:=\frac{4t}{n}.
\end{align*}
Since $t>0$ and $n\in\mathbb N$, this choice satisfies $\lambda>0$, and substitution gives
\begin{align*}
-\lambda t+\frac{n\lambda^2}{8}
=-\frac{4t^2}{n}+\frac{n}{8}\frac{16t^2}{n^2}
=-\frac{2t^2}{n}.
\end{align*}
Thus
\begin{align*}
\mathbb P(S_n-\mu\ge t)\le e^{-2t^2/n}.
\end{align*}
[/step]
[step:Repeat the argument for the lower tail and combine the two estimates]
For each $i\in\{1,\dots,n\}$, define $\widetilde X_i:\Omega \to \mathbb R$ by
\begin{align*}
\widetilde X_i(\omega)=p_i-Y_i(\omega).
\end{align*} Since $\widetilde X_i=-X_i$, the exponential moment estimate from the first step gives
\begin{align*}
\mathbb E[e^{\lambda\widetilde X_i}]=\mathbb E[e^{-\lambda X_i}].
\end{align*}
Applying the first-step estimate with parameter $-\lambda$ gives
\begin{align*}
\mathbb E[e^{-\lambda X_i}]\le e^{\lambda^2/8}.
\end{align*}
This holds for every $\lambda\in\mathbb R$. Moreover, because the random variables $Y_1,\dots,Y_n$ are [independent](/page/Independence%20%28Probability%20Theory%29) and each $e^{\lambda\widetilde X_i}$ is a measurable function of $Y_i$, the random variables $e^{\lambda\widetilde X_1},\dots,e^{\lambda\widetilde X_n}$ are independent. Therefore the same factorization and exponential Markov argument applies to
\begin{align*}
\mu-S_n=\sum_{i=1}^{n}\widetilde X_i
\end{align*}
and gives, for every $t>0$,
\begin{align*}
\mathbb P(\mu-S_n\ge t)\le e^{-2t^2/n}.
\end{align*}
Since
\begin{align*}
\{|S_n-\mu|\ge t\}
=
\{S_n-\mu\ge t\}\cup\{\mu-S_n\ge t\},
\end{align*}
the [finite subadditivity property of probability measures](/theorems/1106) gives
\begin{align*}
\mathbb P(|S_n-\mu|\ge t)\le \mathbb P(S_n-\mu\ge t)+\mathbb P(\mu-S_n\ge t).
\end{align*}
Using the upper-tail and lower-tail estimates,
\begin{align*}
\mathbb P(S_n-\mu\ge t)+\mathbb P(\mu-S_n\ge t)\le e^{-2t^2/n}+e^{-2t^2/n}=2e^{-2t^2/n}.
\end{align*}
[/step]
[step:Substitute $t=n\varepsilon$ to obtain the sample-proportion form]
Let $\varepsilon>0$ and define
\begin{align*}
t:=n\varepsilon.
\end{align*}
Then $t>0$. The inequality
\begin{align*}
\left|\frac{S_n}{n}-\frac{\mu}{n}\right|\ge \varepsilon
\end{align*}
is equivalent to
\begin{align*}
|S_n-\mu|\ge n\varepsilon,
\end{align*}
and, by the definition of $t$, this is equivalent to
\begin{align*}
|S_n-\mu|\ge t.
\end{align*}
Applying the two-sided estimate with this value of $t$ gives
\begin{align*}
\mathbb P\left(\left|\frac{S_n}{n}-\frac{\mu}{n}\right|\ge \varepsilon\right)=\mathbb P(|S_n-\mu|\ge n\varepsilon).
\end{align*}
Applying the two-sided estimate with $t=n\varepsilon$ gives
\begin{align*}
\mathbb P(|S_n-\mu|\ge n\varepsilon)\le 2e^{-2(n\varepsilon)^2/n}=2e^{-2n\varepsilon^2}.
\end{align*}
This is the stated equivalent formulation.
[/step]