[proofplan]
We first prove the one-variable exponential moment estimate that uses only the upper bound $X_i\le b$, the zero mean, and the variance. Independence then multiplies the one-variable estimates into an exponential moment bound for the sum. A direct estimate for the nonnegative [random variable](/page/Random%20Variable) $e^{\lambda S_n}$ converts this moment bound into a tail bound, and the remaining work is the explicit optimization in the exponential parameter. The degenerate case $V=0$ is handled separately because the displayed formula divides by $V$ in the argument of $h$.
[/proofplan]
[step:Reduce the zero variance case to an almost sure identity]
Assume first that $V=0$. For each $1\le i\le n$, set $\sigma_i^2:=\operatorname{Var}(X_i)$. Since each $\sigma_i^2\ge 0$ and $\sum_{i=1}^n\sigma_i^2=0$, we have $\sigma_i^2=0$ for every $i$. Because $\mathbb E[X_i]=0$, this gives
\begin{align*}
\mathbb E[X_i^2]=\operatorname{Var}(X_i)+(\mathbb E[X_i])^2=0.
\end{align*}
Since $X_i^2\ge 0$, it follows that $X_i=0$ almost surely. The finite intersection of the probability-one events $\{X_i=0\}$ has probability one, so
\begin{align*}
\sum_{i=1}^n X_i=0
\end{align*}
almost surely. Hence $\mathbb P(\sum_{i=1}^n X_i\ge t)=0$ for $t>0$ and equals $1$ for $t=0$.
[/step]
[step:Prove the bounded variance exponential moment estimate]
Assume now that $V>0$. For $1\le i\le n$, define $\sigma_i^2:=\operatorname{Var}(X_i)$. We claim that for every $\lambda\ge 0$ and every $1\le i\le n$,
\begin{align*}
\mathbb E[e^{\lambda X_i}]
\le
\exp\left(\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right).
\end{align*}
Fix $\lambda\ge 0$ and define $\mu:=\lambda b$. Let $\mathcal L^1$ denote one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) on $[0,1]$ with its Borel $\sigma$-algebra. Define the auxiliary function $q:\mathbb R\to[0,\infty)$ by setting $q(0):=1/2$ and, for $y\ne0$, setting
\begin{align*}
q(y):=\frac{e^y-1-y}{y^2}.
\end{align*}
The identity
\begin{align*}
e^y-1-y
=
y^2\int_0^1 (1-s)e^{sy}\,d\mathcal L^1(s)
\end{align*}
shows that
\begin{align*}
q(y)=\int_0^1 (1-s)e^{sy}\,d\mathcal L^1(s).
\end{align*}
Therefore $q$ is increasing on $\mathbb R$, because for $y_1\le y_2$ the integrand satisfies $(1-s)e^{sy_1}\le (1-s)e^{sy_2}$ for every $s\in[0,1]$. Since $\mu u\le\mu$, we obtain
\begin{align*}
e^{\mu u}-1-\mu u
\le
u^2(e^\mu-1-\mu).
\end{align*}
Define the real-valued random variable $U_i:\Omega\to\mathbb R$ by
\begin{align*}
U_i(\omega):=\frac{X_i(\omega)}{b}.
\end{align*}
Since $X_i\le b$ almost surely, $U_i\le 1$ almost surely. Applying the preceding scalar inequality with $u=U_i$ gives
\begin{align*}
e^{\lambda X_i}
\le
1+\lambda X_i+\frac{X_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right)
\end{align*}
almost surely. Taking expectations and using $\mathbb E[X_i]=0$ and $\mathbb E[X_i^2]=\sigma_i^2$ gives
\begin{align*}
\mathbb E[e^{\lambda X_i}]
\le
1+\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right).
\end{align*}
Since $1+a\le e^a$ for every $a\ge 0$, and $e^{\lambda b}-1-\lambda b\ge 0$, the claimed estimate follows.
[guided]
The key point is to dominate $e^{\lambda x}$ on the whole half-line $x\le b$ by a quadratic expression whose linear term disappears after expectation and whose quadratic term becomes the variance. Fix $\lambda\ge 0$ and set $\mu:=\lambda b$. We first prove the scalar inequality
\begin{align*}
e^{\mu u}
\le
1+\mu u+u^2(e^\mu-1-\mu)
\end{align*}
for every real number $u\le 1$.
Let $\mathcal L^1$ denote one-dimensional Lebesgue measure on $[0,1]$ with its Borel $\sigma$-algebra. To prove the scalar inequality, define $q:\mathbb R\to[0,\infty)$ by setting $q(0):=1/2$ and, for $y\ne0$, setting
\begin{align*}
q(y):=\frac{e^y-1-y}{y^2}.
\end{align*}
The useful representation of $q$ is obtained from the integral remainder formula for the exponential function:
\begin{align*}
e^y-1-y
=
y^2\int_0^1 (1-s)e^{sy}\,d\mathcal L^1(s).
\end{align*}
Thus
\begin{align*}
q(y)=\int_0^1 (1-s)e^{sy}\,d\mathcal L^1(s).
\end{align*}
If $y_1\le y_2$, then $e^{sy_1}\le e^{sy_2}$ for every $s\in[0,1]$, and hence
\begin{align*}
(1-s)e^{sy_1}\le (1-s)e^{sy_2}.
\end{align*}
Integrating with respect to $\mathcal L^1$ over $[0,1]$ gives $q(y_1)\le q(y_2)$, so $q$ is increasing.
Now take any $u\le 1$. Since $\mu\ge 0$, we have $\mu u\le\mu$. The monotonicity of $q$ gives
\begin{align*}
\frac{e^{\mu u}-1-\mu u}{(\mu u)^2}
\le
\frac{e^\mu-1-\mu}{\mu^2}
\end{align*}
when $\mu u\ne0$ and $\mu\ne0$, with the same conclusion by continuity in the cases where one of these quantities is zero. Multiplying by $\mu^2u^2$ yields
\begin{align*}
e^{\mu u}-1-\mu u
\le
u^2(e^\mu-1-\mu),
\end{align*}
which is the desired scalar inequality.
We apply this scalar inequality to the normalized random variable $U_i:\Omega\to\mathbb R$ defined by
\begin{align*}
U_i(\omega):=\frac{X_i(\omega)}{b}.
\end{align*}
The hypothesis $X_i\le b$ almost surely is exactly what gives $U_i\le1$ almost surely. Therefore
\begin{align*}
e^{\lambda X_i}
\le
1+\lambda X_i+\frac{X_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right)
\end{align*}
almost surely. Taking expectations is legitimate because $X_i^2$ is integrable by the finite variance hypothesis, and $e^{\lambda X_i}\le e^{\lambda b}$ almost surely. Using $\mathbb E[X_i]=0$ and
\begin{align*}
\mathbb E[X_i^2]=\operatorname{Var}(X_i)+(\mathbb E[X_i])^2=\sigma_i^2
\end{align*}
gives
\begin{align*}
\mathbb E[e^{\lambda X_i}]
\le
1+\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right).
\end{align*}
Finally $e^{\lambda b}-1-\lambda b\ge0$, since $e^r\ge1+r$ for $r\ge0$, and $1+a\le e^a$ for $a\ge0$. Hence
\begin{align*}
\mathbb E[e^{\lambda X_i}]
\le
\exp\left(\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right).
\end{align*}
[/guided]
[/step]
[step:Multiply the one-variable estimates for the sum]
Define the partial sum random variable $S_n:\Omega\to\mathbb R$ by
\begin{align*}
S_n(\omega):=\sum_{i=1}^n X_i(\omega).
\end{align*}
For every $\lambda\ge0$, the independence of $X_1,\dots,X_n$ implies the independence of $e^{\lambda X_1},\dots,e^{\lambda X_n}$. Since $X_i\le b$ almost surely for each $i$, each factor satisfies $0\le e^{\lambda X_i}\le e^{\lambda b}$ almost surely and is integrable. Hence
\begin{align*}
\mathbb E[e^{\lambda S_n}]=\mathbb E\left[\prod_{i=1}^n e^{\lambda X_i}\right].
\end{align*}
Independence of the integrable nonnegative random variables $e^{\lambda X_1},\dots,e^{\lambda X_n}$ gives
\begin{align*}
\mathbb E\left[\prod_{i=1}^n e^{\lambda X_i}\right]=\prod_{i=1}^n \mathbb E[e^{\lambda X_i}].
\end{align*}
Applying the one-variable estimate to each factor gives
\begin{align*}
\prod_{i=1}^n \mathbb E[e^{\lambda X_i}]\le\prod_{i=1}^n \exp\left(\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right).
\end{align*}
Using $V=\sum_{i=1}^n\sigma_i^2$, the product is
\begin{align*}
\prod_{i=1}^n \exp\left(\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right)=\exp\left(\frac{V}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right).
\end{align*}
[/step]
[step:Convert the exponential moment bound into a tail bound]
Fix $t\ge0$ and $\lambda\ge0$. Since $e^{\lambda S_n}$ is nonnegative and the event $\{S_n\ge t\}$ implies $e^{\lambda S_n}\ge e^{\lambda t}$, we have
\begin{align*}
e^{\lambda t}\mathbb P(S_n\ge t)
\le
\mathbb E[e^{\lambda S_n}].
\end{align*}
Combining this with the preceding exponential moment estimate gives
\begin{align*}
\mathbb P(S_n\ge t)
\le
\exp\left(-\lambda t+\frac{V}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right)
\end{align*}
for every $\lambda\ge0$.
[/step]
[step:Optimize the exponential parameter]
Define $\Phi:[0,\infty)\to\mathbb R$ by
\begin{align*}
\Phi(\lambda):=-\lambda t+\frac{V}{b^2}\left(e^{\lambda b}-1-\lambda b\right).
\end{align*}
If $t=0$, choosing $\lambda=0$ gives
\begin{align*}
\mathbb P(S_n\ge0)\le 1
=
\exp\left(-\frac{V}{b^2}h(0)\right),
\end{align*}
because $h(0)=0$.
Assume now that $t>0$. Differentiating $\Phi$ gives
\begin{align*}
\Phi'(\lambda)=-t+\frac{V}{b}\left(e^{\lambda b}-1\right).
\end{align*}
Differentiating once more gives
\begin{align*}
\Phi''(\lambda)=V e^{\lambda b}>0.
\end{align*}
Thus $\Phi$ is strictly convex, and its unique minimizer $\lambda_*$ is determined by
\begin{align*}
e^{\lambda_* b}=1+\frac{bt}{V}.
\end{align*}
Hence
\begin{align*}
\lambda_*=\frac{1}{b}\log\left(1+\frac{bt}{V}\right).
\end{align*}
Define $r\in(0,\infty)$ by
\begin{align*}
r:=\frac{bt}{V}.
\end{align*}
Then
\begin{align*}
\frac{t}{b}=\frac{V}{b^2}r
\end{align*}
and
\begin{align*}
\lambda_* = \frac{1}{b}\log(1+r).
\end{align*}
Substituting these identities into $\Phi$ gives
\begin{align*}
\Phi(\lambda_*)=-\frac{V}{b^2}r\log(1+r)+\frac{V}{b^2}\left(r-\log(1+r)\right).
\end{align*}
Factoring $-V/b^2$ yields
\begin{align*}
\Phi(\lambda_*)=-\frac{V}{b^2}\left[(1+r)\log(1+r)-r\right].
\end{align*}
By the definition $h(r)=(1+r)\log(1+r)-r$, this becomes
\begin{align*}
\Phi(\lambda_*)=-\frac{V}{b^2}h\left(\frac{bt}{V}\right).
\end{align*}
Therefore
\begin{align*}
\mathbb P(S_n\ge t)
\le
\exp\left(-\frac{V}{b^2}h\left(\frac{bt}{V}\right)\right).
\end{align*}
Since $S_n=\sum_{i=1}^n X_i$, this is exactly the asserted Bennett bound.
[/step]