[proofplan]
Set $Y_i=f(X_i)$. Then the empirical average is given by
\begin{align*}
P_n f=\frac{1}{n}\sum_{i=1}^{n}Y_i.
\end{align*}
We first prove the bounded centered [moment generating function](/page/Moment%20Generating%20Function) estimate that underlies Bernstein's inequality, using the exponential series and the bound $|Y_i|\le M$. Exponential Markov's inequality and independence then convert this moment [generating function](/page/Generating%20Function) bound into a one-sided tail estimate. Optimizing the exponential parameter gives the stated denominator, and applying the same argument to $-f$ gives the two-sided bound by the union bound.
[/proofplan]
[step:Transfer the hypotheses from $f$ to the sampled variables]
For each $1\le i\le n$, define the real-valued [random variable](/page/Random%20Variable) $Y_i:(\Omega,\mathcal A)\to(\mathbb R,\mathcal B(\mathbb R))$ by
\begin{align*}
Y_i(\omega):=f(X_i(\omega)).
\end{align*}
Write $\mathbb E[\cdot]$ for expectation with respect to $\mathbb P$, and write $\mathcal B(\mathbb R)$ for the Borel $\sigma$-algebra on $\mathbb R$. Since $f$ is $\mathcal S$-measurable and $X_i$ is $\mathcal A/\mathcal S$-measurable, each $Y_i$ is $\mathcal A/\mathcal B(\mathbb R)$-measurable. Since the random variables $X_1,\dots,X_n$ are independent and identically distributed with common distribution $P$, the random variables $Y_1,\dots,Y_n$ are independent and identically distributed with common law $\mathbb P\circ Y_1^{-1}=P\circ f^{-1}$.
The integrability follows from $|f|\le M$ $P$-a.e. Hence, for every $1\le i\le n$,
\begin{align*}
\mathbb E[Y_i]=\int_\Omega f(X_i(\omega))\,d\mathbb P(\omega)=\int_S f(x)\,dP(x)=P f=0.
\end{align*}
Also,
\begin{align*}
|Y_i|\le M
\end{align*}
$\mathbb P$-a.s., and
\begin{align*}
\mathbb E[Y_i^2]=\int_\Omega f(X_i(\omega))^2\,d\mathbb P(\omega)=\int_S f(x)^2\,dP(x)=P f^2\le \sigma^2.
\end{align*}
Finally, by the definition of $P_n f$,
\begin{align*}
P_n f=\frac{1}{n}\sum_{i=1}^{n}Y_i.
\end{align*}
[/step]
[step:Prove the Bernstein moment generating function estimate]
Let $Y:\Omega\to\mathbb R$ be any real-valued random variable satisfying $\mathbb E[Y]=0$, $|Y|\le M$ $\mathbb P$-a.s., and $\mathbb E[Y^2]\le \sigma^2$. We claim that, for every $\lambda\in(0,3/M)$,
\begin{align*}
\mathbb E[\exp(\lambda Y)]\le \exp\left(\frac{\lambda^2\sigma^2}{2(1-\lambda M/3)}\right).
\end{align*}
To prove this, fix $\lambda\in(0,3/M)$. For every real number $u$ with $|u|<3$, the exponential series and the estimate $k!\ge 2\cdot 3^{k-2}$ for all integers $k\ge 2$ give
\begin{align*}
e^u-1-u=\sum_{k=2}^{\infty}\frac{u^k}{k!}\le \sum_{k=2}^{\infty}\frac{|u|^k}{k!}\le \frac{u^2}{2}\sum_{j=0}^{\infty}\left(\frac{|u|}{3}\right)^j=\frac{u^2}{2(1-|u|/3)}.
\end{align*}
Apply this with $u=\lambda Y(\omega)$ on the full-probability event where $|Y(\omega)|\le M$. Since $|\lambda Y(\omega)|\le \lambda M<3$, we obtain
\begin{align*}
e^{\lambda Y}\le 1+\lambda Y+\frac{\lambda^2Y^2}{2(1-\lambda M/3)}
\end{align*}
$\mathbb P$-a.s. Taking expectations with respect to $\mathbb P$ and using $\mathbb E[Y]=0$ and $\mathbb E[Y^2]\le\sigma^2$ gives
\begin{align*}
\mathbb E[e^{\lambda Y}]\le 1+\frac{\lambda^2\sigma^2}{2(1-\lambda M/3)}.
\end{align*}
Using $1+r\le e^r$ for $r\ge 0$, with $r=\lambda^2\sigma^2/(2(1-\lambda M/3))$, proves the claimed estimate.
[guided]
We need a moment generating function estimate that uses exactly the two quantitative hypotheses available: a variance bound and a uniform bound. Let $Y:\Omega\to\mathbb R$ be a real-valued random variable such that $\mathbb E[Y]=0$, $|Y|\le M$ $\mathbb P$-a.s., and $\mathbb E[Y^2]\le\sigma^2$. Fix $\lambda\in(0,3/M)$. The restriction $\lambda M<3$ is chosen so that the exponential series can be dominated by a geometric series.
For a real number $u$ with $|u|<3$, expand the exponential series:
\begin{align*}
e^u-1-u=\sum_{k=2}^{\infty}\frac{u^k}{k!}.
\end{align*}
Taking absolute values term-by-term gives
\begin{align*}
e^u-1-u\le \sum_{k=2}^{\infty}\frac{|u|^k}{k!}.
\end{align*}
For every integer $k\ge 2$, the elementary factorial bound $k!\ge 2\cdot 3^{k-2}$ holds: it is an equality at $k=2$, and multiplying by $k+1\ge 3$ preserves the induction step. Therefore
\begin{align*}
\sum_{k=2}^{\infty}\frac{|u|^k}{k!}\le \sum_{k=2}^{\infty}\frac{|u|^k}{2\cdot 3^{k-2}}=\frac{u^2}{2}\sum_{j=0}^{\infty}\left(\frac{|u|}{3}\right)^j=\frac{u^2}{2(1-|u|/3)}.
\end{align*}
Now substitute $u=\lambda Y(\omega)$. On the event where $|Y(\omega)|\le M$, which has probability one, we have $|\lambda Y(\omega)|\le \lambda M<3$. Hence
\begin{align*}
e^{\lambda Y}\le 1+\lambda Y+\frac{\lambda^2Y^2}{2(1-\lambda M/3)}
\end{align*}
$\mathbb P$-a.s. The denominator was enlarged from $1-|\lambda Y|/3$ to $1-\lambda M/3$ because $|\lambda Y|\le\lambda M$, and this gives a deterministic bound.
Taking expectations with respect to $\mathbb P$ is now legitimate because the right-hand side is integrable: $Y$ is bounded by $M$, and $Y^2$ is integrable. We obtain
\begin{align*}
\mathbb E[e^{\lambda Y}]\le 1+\lambda\mathbb E[Y]+\frac{\lambda^2\mathbb E[Y^2]}{2(1-\lambda M/3)}.
\end{align*}
The centering assumption removes the linear term, and the variance bound controls the quadratic term:
\begin{align*}
\mathbb E[e^{\lambda Y}]\le 1+\frac{\lambda^2\sigma^2}{2(1-\lambda M/3)}.
\end{align*}
Finally, since $1+r\le e^r$ for every $r\ge 0$, with
\begin{align*}
r:=\frac{\lambda^2\sigma^2}{2(1-\lambda M/3)},
\end{align*}
we conclude
\begin{align*}
\mathbb E[\exp(\lambda Y)]\le \exp\left(\frac{\lambda^2\sigma^2}{2(1-\lambda M/3)}\right).
\end{align*}
This is the Bernstein moment generating function estimate.
[/guided]
[/step]
[step:Convert the moment generating function bound into a one-sided tail bound]
Let
\begin{align*}
S_n:=\sum_{i=1}^{n}Y_i.
\end{align*}
Fix $t>0$ and $\lambda\in(0,3/M)$. Since the function $x\mapsto e^{\lambda x}$ is increasing on $\mathbb R$,
\begin{align*}
\mathbb P(P_n f\ge t)=\mathbb P(S_n\ge nt).
\end{align*}
Therefore,
\begin{align*}
\mathbb P(S_n\ge nt)=\mathbb P(e^{\lambda S_n}\ge e^{\lambda nt}).
\end{align*}
For the nonnegative random variable $e^{\lambda S_n}$, Markov's inequality follows from the definition of expectation of a nonnegative random variable applied to the event $\{e^{\lambda S_n}\ge e^{\lambda nt}\}$:
\begin{align*}
\mathbb E[e^{\lambda S_n}]\ge e^{\lambda nt}\mathbb P(e^{\lambda S_n}\ge e^{\lambda nt}).
\end{align*}
Thus
\begin{align*}
\mathbb P(P_n f\ge t)\le e^{-\lambda nt}\mathbb E[e^{\lambda S_n}].
\end{align*}
By independence of $Y_1,\dots,Y_n$ and the moment generating function estimate from the previous step applied to each $Y_i$,
\begin{align*}
\mathbb E[e^{\lambda S_n}]=\prod_{i=1}^{n}\mathbb E[e^{\lambda Y_i}]\le \exp\left(\frac{n\lambda^2\sigma^2}{2(1-\lambda M/3)}\right).
\end{align*}
Consequently, for every $\lambda\in(0,3/M)$,
\begin{align*}
\mathbb P(P_n f\ge t)\le \exp\left(-n\lambda t+\frac{n\lambda^2\sigma^2}{2(1-\lambda M/3)}\right).
\end{align*}
[/step]
[step:Optimize the exponential parameter]
If $\sigma=0$, then $\mathbb E[Y_i^2]=0$ for every $i$, so $Y_i=0$ $\mathbb P$-a.s. for every $i$. Hence $P_n f=0$ $\mathbb P$-a.s., and for every $t>0$,
\begin{align*}
\mathbb P(P_n f\ge t)=0\le \exp\left(-\frac{nt^2}{2\sigma^2+2Mt/3}\right).
\end{align*}
Assume now that $\sigma>0$. Define
\begin{align*}
\lambda_*:=\frac{t}{\sigma^2+Mt/3}.
\end{align*}
Since $\sigma^2>0$ and $t>0$, we have $\lambda_*>0$ and
\begin{align*}
\lambda_*M=\frac{Mt}{\sigma^2+Mt/3}<3.
\end{align*}
Thus $\lambda_*\in(0,3/M)$, and the previous step applies with $\lambda=\lambda_*$. Moreover,
\begin{align*}
1-\frac{\lambda_*M}{3}=\frac{\sigma^2}{\sigma^2+Mt/3}.
\end{align*}
Therefore
\begin{align*}
-n\lambda_*t+\frac{n\lambda_*^2\sigma^2}{2(1-\lambda_*M/3)}=-\frac{nt^2}{\sigma^2+Mt/3}+\frac{nt^2}{2(\sigma^2+Mt/3)}=-\frac{nt^2}{2\sigma^2+2Mt/3}.
\end{align*}
Substitution gives
\begin{align*}
\mathbb P(P_n f\ge t)\le \exp\left(-\frac{nt^2}{2\sigma^2+2Mt/3}\right).
\end{align*}
[/step]
[step:Apply the one-sided bound to $f$ and $-f$]
Define $g:S\to\mathbb R$ by
\begin{align*}
g(x):=-f(x).
\end{align*}
Then $g$ is $\mathcal S$-measurable,
\begin{align*}
P g=-P f=0,
\end{align*}
\begin{align*}
|g(x)|=|f(x)|\le M
\end{align*}
for $P$-a.e. $x\in S$, and
\begin{align*}
P g^2=P f^2\le \sigma^2.
\end{align*}
Applying the one-sided estimate to $g$ gives
\begin{align*}
\mathbb P(P_n g\ge t)\le \exp\left(-\frac{nt^2}{2\sigma^2+2Mt/3}\right).
\end{align*}
Since $P_n g=-P_n f$, this is
\begin{align*}
\mathbb P(P_n f\le -t)\le \exp\left(-\frac{nt^2}{2\sigma^2+2Mt/3}\right).
\end{align*}
The event $\{|P_n f|\ge t\}$ is contained in the union of the two events $\{P_n f\ge t\}$ and $\{P_n f\le -t\}$. By finite subadditivity of the probability measure $\mathbb P$,
\begin{align*}
\mathbb P(|P_n f|\ge t)\le \mathbb P(P_n f\ge t)+\mathbb P(P_n f\le -t)\le 2\exp\left(-\frac{nt^2}{2\sigma^2+2Mt/3}\right).
\end{align*}
This proves both asserted inequalities.
[/step]