[proofplan]
We pass to the product law of the independent coordinates and form the Doob martingale obtained by revealing the coordinates one at a time. The bounded differences assumption implies that the $i$-th martingale increment has conditional range length at most $c_i$, even though the coordinate spaces and laws may be different. A conditional Hoeffding estimate then bounds the exponential moment of each increment, and iterating these bounds gives a sub-Gaussian moment generating function. Markov's inequality and optimization over the exponential parameter give the upper tail, the same argument applied to $-Z$ gives the lower tail, and the two-sided bound follows by the union bound.
[/proofplan]
[step:Move the argument to the product probability space]
Let $(\Omega,\mathcal F,\mathbb P)$ be the original probability space on which the random variables $X_1,\dots,X_n$ are defined.
Let $\mu_i:=\mathbb P\circ X_i^{-1}$ be the law of $X_i$ on $(\mathcal X_i,\mathcal A_i)$, and define the product measurable space
\begin{align*}
(\mathcal X,\mathcal A):=\left(\prod_{i=1}^n \mathcal X_i,\bigotimes_{i=1}^n \mathcal A_i\right).
\end{align*}
Since $X_1,\dots,X_n$ are independent, the law of
\begin{align*}
X:=(X_1,\dots,X_n):(\Omega,\mathcal F)\to(\mathcal X,\mathcal A)
\end{align*}
is the product probability measure
\begin{align*}
\mu:=\bigotimes_{i=1}^n \mu_i.
\end{align*}
Because $f:(\mathcal X,\mathcal A)\to\mathbb R$ is measurable by hypothesis and the law of $X$ is $\mu$, integrability of $Z=f(X)$ gives
\begin{align*}
\int_{\mathcal X}|f(y)|\,d\mu(y)=\mathbb E[|f(X)|]<\infty.
\end{align*}
Thus $f\in L^1(\mathcal X,\mathcal A,\mu)$. It is therefore enough to prove the estimates for the coordinate map
\begin{align*}
Y:\mathcal X\to\mathcal X,\qquad y\mapsto y,
\end{align*}
on the probability space $(\mathcal X,\mathcal A,\mu)$, since $f(Y)$ and $Z$ have the same distribution and the same expectation. For the rest of the proof, write
\begin{align*}
Y=(Y_1,\dots,Y_n),
\end{align*}
where $Y_i:\mathcal X\to\mathcal X_i$ is the $i$-th coordinate projection.
[/step]
[step:Construct the Doob martingale by revealing coordinates]
For each $i\in\{0,\dots,n\}$, define the sub-$\sigma$-algebra
\begin{align*}
\mathcal F_i:=\sigma(Y_1,\dots,Y_i),
\end{align*}
with $\mathcal F_0:=\{\varnothing,\mathcal X\}$. Define the Doob martingale
\begin{align*}
M_i:\mathcal X\to\mathbb R,\qquad M_i:=\mathbb E_\mu[f(Y)\mid \mathcal F_i].
\end{align*}
Then $(M_i)_{i=0}^n$ is an integrable martingale with respect to $(\mathcal F_i)_{i=0}^n$, and
\begin{align*}
M_0=\mathbb E_\mu[f(Y)],\qquad M_n=f(Y).
\end{align*}
Define the martingale increments
\begin{align*}
D_i:\mathcal X\to\mathbb R,\qquad D_i:=M_i-M_{i-1}
\end{align*}
for $i\in\{1,\dots,n\}$. Then
\begin{align*}
f(Y)-\mathbb E_\mu[f(Y)]=\sum_{i=1}^n D_i.
\end{align*}
[/step]
[step:Bound each martingale increment by the corresponding coordinate oscillation]
Fix $i\in\{1,\dots,n\}$. By [Fubini's theorem](/theorems/2961), applied to the integrable function $f\in L^1(\mathcal X,\mathcal A,\mu)$ on the finite product probability space, the following section integrals are finite for $\bigotimes_{j=1}^i\mu_j$-almost every prefix. Choose measurable versions on the remaining null set arbitrarily. Define the measurable map $g_i:\prod_{j=1}^i\mathcal X_j\to\mathbb R$ by
\begin{align*}
g_i(a_1,\dots,a_i)
:=
\int_{\prod_{j=i+1}^n\mathcal X_j}
f(a_1,\dots,a_i,u_{i+1},\dots,u_n)
\,d\left(\bigotimes_{j=i+1}^n\mu_j\right)(u_{i+1},\dots,u_n),
\end{align*}
with the convention that, when $i=n$, the integral over the empty product is the value of $f(a_1,\dots,a_n)$. The standard product-space formula for [conditional expectation](/page/Conditional%20Expectation), justified by [Fubini's theorem](/theorems/2961), gives
\begin{align*}
M_i=g_i(Y_1,\dots,Y_i)
\end{align*}
almost surely. Similarly,
\begin{align*}
M_{i-1}=
\int_{\mathcal X_i} g_i(Y_1,\dots,Y_{i-1},v_i)\,d\mu_i(v_i)
\end{align*}
almost surely.
For $\bigotimes_{j=1}^{i-1}\mu_j$-almost every prefix $(a_1,\dots,a_{i-1})$, Fubini's theorem applied on $\mathcal X_i\times\mathcal X_i\times\prod_{j=i+1}^n\mathcal X_j$ gives, for $\mu_i\otimes\mu_i$-almost every pair $(v_i,w_i)$,
\begin{align*}
|g_i(a_1,\dots,a_{i-1},v_i)-g_i(a_1,\dots,a_{i-1},w_i)|\le c_i.
\end{align*}
This follows by integrating the bounded differences inequality in the future coordinates against $\bigotimes_{j=i+1}^n\mu_j$. Hence, for almost every fixed prefix, the essential range with respect to $\mu_i$ of the function $v_i\mapsto g_i(a_1,\dots,a_{i-1},v_i)$ has length at most $c_i$.
Define $A_i:\mathcal X\to\mathbb R\cup\{-\infty\}$ to be the conditional essential infimum of $M_i$ given $\mathcal F_{i-1}$, and define $B_i:\mathcal X\to\mathbb R\cup\{\infty\}$ to be the conditional essential supremum of $M_i$ given $\mathcal F_{i-1}$. By definition of conditional essential infimum and conditional essential supremum, $A_i$ and $B_i$ are $\mathcal F_{i-1}$-measurable, and
\begin{align*}
A_i\le M_i\le B_i
\end{align*}
almost surely. The preceding fiberwise essential-range estimate gives
\begin{align*}
B_i-A_i\le c_i
\end{align*}
almost surely, because conditional essential bounds over $\mathcal F_{i-1}$ are computed by fixing the revealed prefix outside the Fubini null set and then taking essential bounds over the remaining $i$-th coordinate. Since $M_i$ is finite almost surely and this conditional essential range has finite length, the bounds $A_i$ and $B_i$ may be chosen finite almost surely after changing them on a null set. Since $M_{i-1}=\mathbb E_\mu[M_i\mid\mathcal F_{i-1}]$, the conditional average also satisfies
\begin{align*}
A_i\le M_{i-1}\le B_i
\end{align*}
almost surely. Therefore $D_i=M_i-M_{i-1}$ has conditional mean
\begin{align*}
\mathbb E_\mu[D_i\mid\mathcal F_{i-1}]=0
\end{align*}
and, conditionally on $\mathcal F_{i-1}$, lies in the $\mathcal F_{i-1}$-measurable interval $[A_i-M_{i-1},B_i-M_{i-1}]$, whose length is at most $c_i$.
[guided]
The key point is that the future coordinates are still averaged against their original product law after the first $i$ coordinates have been revealed. This is exactly where independence is used, and no identical-distribution assumption is involved.
Fix $i\in\{1,\dots,n\}$. Once the first $i$ coordinates are prescribed as
\begin{align*}
a=(a_1,\dots,a_i)\in\prod_{j=1}^i\mathcal X_j,
\end{align*}
the only remaining randomness is in the future coordinates. Because $f\in L^1(\mathcal X,\mathcal A,\mu)$ and $\mu$ is a finite product probability measure, [Fubini's theorem](/theorems/2961) implies that these future-coordinate sections are integrable for almost every prefix. We choose measurable finite versions on the exceptional null set and encode the remaining average by the measurable map $g_i:\prod_{j=1}^i\mathcal X_j\to\mathbb R$ defined by
\begin{align*}
g_i(a_1,\dots,a_i)
:=
\int_{\prod_{j=i+1}^n\mathcal X_j}
f(a_1,\dots,a_i,u_{i+1},\dots,u_n)
\,d\left(\bigotimes_{j=i+1}^n\mu_j\right)(u_{i+1},\dots,u_n).
\end{align*}
When $i=n$, this means simply
\begin{align*}
g_n(a_1,\dots,a_n)=f(a_1,\dots,a_n).
\end{align*}
Since the joint law is the product measure $\mu=\bigotimes_{j=1}^n\mu_j$, the product-space formula for [conditional expectation](/page/Conditional%20Expectation), again justified by Fubini's theorem, gives
\begin{align*}
M_i=\mathbb E_\mu[f(Y)\mid\mathcal F_i]
=g_i(Y_1,\dots,Y_i)
\end{align*}
almost surely. Conditioning on one fewer coordinate gives
\begin{align*}
M_{i-1}
=
\int_{\mathcal X_i} g_i(Y_1,\dots,Y_{i-1},v_i)\,d\mu_i(v_i)
\end{align*}
almost surely.
Now fix a prefix $(a_1,\dots,a_{i-1})$ outside the null set supplied by Fubini's theorem. The bounded differences hypothesis says that replacing only the $i$-th coordinate changes $f$ by at most $c_i$, uniformly over all future coordinates on the product space. Thus Fubini's theorem, applied to the non-negative measurable function obtained from the absolute difference of the two future-coordinate sections, gives for $\mu_i\otimes\mu_i$-almost every pair $(v_i,w_i)$ the integrated inequality
\begin{align*}
|g_i(a_1,\dots,a_{i-1},v_i)-g_i(a_1,\dots,a_{i-1},w_i)|\le c_i.
\end{align*}
This almost-everywhere pairwise estimate is exactly the statement needed for essential ranges: as the $i$-th coordinate varies with respect to $\mu_i$, the conditional value $g_i$ has essential range length at most $c_i$.
To turn this fiberwise statement into conditional bounds, define $A_i:\mathcal X\to\mathbb R\cup\{-\infty\}$ to be the conditional essential infimum of $M_i$ given $\mathcal F_{i-1}$, and define $B_i:\mathcal X\to\mathbb R\cup\{\infty\}$ to be the conditional essential supremum of $M_i$ given $\mathcal F_{i-1}$. These are not arbitrary fiberwise choices: by the definition of conditional essential infimum and supremum, $A_i$ and $B_i$ are $\mathcal F_{i-1}$-measurable random variables. They satisfy
\begin{align*}
A_i\le M_i\le B_i
\end{align*}
almost surely.
Why does the length bound survive this conditional formulation? Outside the Fubini null set, conditioning on $\mathcal F_{i-1}$ fixes the prefix $(Y_1,\dots,Y_{i-1})$ and leaves the $i$-th coordinate distributed according to $\mu_i$. The preceding pairwise estimate says that the resulting essential range of $g_i$ over that coordinate has length at most $c_i$. Therefore the conditional essential bounds satisfy
\begin{align*}
B_i-A_i\le c_i
\end{align*}
almost surely. Since $M_i$ is real-valued almost surely and this conditional essential range has finite length, we may use finite versions of $A_i$ and $B_i$ after altering them on a null set. Since
\begin{align*}
M_{i-1}=\mathbb E_\mu[M_i\mid\mathcal F_{i-1}],
\end{align*}
the conditional average of a [random variable](/page/Random%20Variable) lying between $A_i$ and $B_i$ also lies between those same $\mathcal F_{i-1}$-measurable bounds:
\begin{align*}
A_i\le M_{i-1}\le B_i
\end{align*}
almost surely. Consequently the martingale increment
\begin{align*}
D_i=M_i-M_{i-1}
\end{align*}
has conditional mean
\begin{align*}
\mathbb E_\mu[D_i\mid\mathcal F_{i-1}]=0
\end{align*}
and, conditionally on $\mathcal F_{i-1}$, lies in the interval $[A_i-M_{i-1},B_i-M_{i-1}]$, whose length is at most $c_i$. This is the precise form of the bounded increment estimate needed for the exponential argument.
[/guided]
[/step]
[step:Estimate the exponential moment of the martingale sum]
Define the variance proxy
\begin{align*}
V:=\sum_{i=1}^n c_i^2.
\end{align*}
We use the following elementary conditional Hoeffding estimate: if $W$ is an integrable real-valued random variable, $\mathcal G$ is a sub-$\sigma$-algebra, $\mathbb E[W\mid\mathcal G]=0$, and there are $\mathcal G$-measurable real-valued random variables $A$ and $B$ such that $A\le W\le B$ almost surely and $B-A\le c$, then for every $\lambda\in\mathbb R$,
\begin{align*}
\mathbb E[\exp(\lambda W)\mid\mathcal G]\le \exp\left(\frac{\lambda^2c^2}{8}\right).
\end{align*}
Indeed, $\mathbb E[W\mid\mathcal G]=0$ and $A\le W\le B$ imply $A\le 0\le B$ almost surely after taking conditional expectations. On each conditional interval $[A,B]$, convexity of the map $x\mapsto \exp(\lambda x)$ gives the chord bound
\begin{align*}
\exp(\lambda W)
\le
\frac{B-W}{B-A}\exp(\lambda A)+\frac{W-A}{B-A}\exp(\lambda B),
\end{align*}
with the equality case interpreted directly when $A=B$. Taking conditional expectation and using $\mathbb E[W\mid\mathcal G]=0$ gives the scalar expression with interval length $L:=B-A$ and
\begin{align*}
\theta:=\frac{B}{B-A}\in[0,1].
\end{align*}
Since $A=-(1-\theta)L$ and $B=\theta L$, it remains to prove the scalar estimate
\begin{align*}
\theta e^{-\lambda(1-\theta)L}+(1-\theta)e^{\lambda\theta L}
\le \exp\left(\frac{\lambda^2L^2}{8}\right).
\end{align*}
Set $s:=\lambda L$ and define $\varphi:[0,1]\to\mathbb R$ by
\begin{align*}
\varphi(\theta):=\log\left(\theta e^{-(1-\theta)s}+(1-\theta)e^{\theta s}\right).
\end{align*}
The standard one-variable proof of [Hoeffding's lemma](/theorems/1956) differentiates twice and gives
\begin{align*}
\varphi''(\theta)\le s^2
\end{align*}
for $0\le\theta\le1$, while $\varphi(0)=\varphi(1)=0$. Applying [Taylor's theorem](/theorems/827) with remainder to $\varphi$ around its maximizing point on $[0,1]$ yields
\begin{align*}
\varphi(\theta)\le \frac{s^2}{8}.
\end{align*}
Thus
\begin{align*}
\theta e^{-\lambda(1-\theta)L}+(1-\theta)e^{\lambda\theta L}
\le \exp\left(\frac{\lambda^2L^2}{8}\right)
\le \exp\left(\frac{\lambda^2c^2}{8}\right),
\end{align*}
and the conditional Hoeffding estimate follows.
Applying this estimate to $W=D_i$, $\mathcal G=\mathcal F_{i-1}$, and $c=c_i$, we obtain
\begin{align*}
\mathbb E_\mu[\exp(\lambda D_i)\mid\mathcal F_{i-1}]
\le
\exp\left(\frac{\lambda^2c_i^2}{8}\right).
\end{align*}
Iterating conditional expectations from $i=n$ down to $i=1$ gives
\begin{align*}
\mathbb E_\mu\left[\exp\left(\lambda\sum_{i=1}^nD_i\right)\right]
\le
\prod_{i=1}^n \exp\left(\frac{\lambda^2c_i^2}{8}\right)
=
\exp\left(\frac{\lambda^2}{8}\sum_{i=1}^n c_i^2\right).
\end{align*}
Since $\sum_{i=1}^nD_i=f(Y)-\mathbb E_\mu[f(Y)]$, this becomes
\begin{align*}
\mathbb E_\mu\left[\exp\left(\lambda(f(Y)-\mathbb E_\mu[f(Y)])\right)\right]
\le
\exp\left(\frac{\lambda^2V}{8}\right).
\end{align*}
[/step]
[step:Optimize Markov's inequality to obtain the upper tail]
Assume first that $V>0$. For $\lambda>0$, [Markov's inequality](/theorems/514) applied to the non-negative random variable
\begin{align*}
\exp\left(\lambda(f(Y)-\mathbb E_\mu[f(Y)])\right)
\end{align*}
gives
\begin{align*}
\mathbb P_\mu(f(Y)-\mathbb E_\mu[f(Y)]\ge t)
\le
\exp(-\lambda t)
\mathbb E_\mu\left[\exp\left(\lambda(f(Y)-\mathbb E_\mu[f(Y)])\right)\right].
\end{align*}
Using the exponential moment estimate,
\begin{align*}
\mathbb P_\mu(f(Y)-\mathbb E_\mu[f(Y)]\ge t)
\le
\exp\left(-\lambda t+\frac{\lambda^2V}{8}\right).
\end{align*}
The quadratic function
\begin{align*}
q:(0,\infty)\to\mathbb R,\qquad q(\lambda):=-\lambda t+\frac{\lambda^2V}{8}
\end{align*}
is minimized at
\begin{align*}
\lambda=\frac{4t}{V}.
\end{align*}
Substituting this value gives
\begin{align*}
-\lambda t+\frac{\lambda^2V}{8}
=
-\frac{4t^2}{V}+\frac{16t^2}{V^2}\frac{V}{8}
=
-\frac{2t^2}{V}.
\end{align*}
Therefore
\begin{align*}
\mathbb P_\mu(f(Y)-\mathbb E_\mu[f(Y)]\ge t)
\le
\exp\left(-\frac{2t^2}{V}\right).
\end{align*}
Transferring back from the product model to $Z=f(X)$ gives
\begin{align*}
\mathbb P(Z-\mathbb E[Z]\ge t)
\le
\exp\left(-\frac{2t^2}{V}\right).
\end{align*}
[/step]
[step:Apply the same argument to the negative function and combine the tails]
The function
\begin{align*}
-f:\mathcal X_1\times\cdots\times\mathcal X_n\to\mathbb R
\end{align*}
satisfies the same bounded differences condition with the same constants $c_i$, because
\begin{align*}
|(-f)(x)-(-f)(x')|=|f(x)-f(x')|
\end{align*}
whenever $x$ and $x'$ differ only in coordinate $i$. Applying the upper-tail estimate to $-f$ gives
\begin{align*}
\mathbb P_\mu(\mathbb E_\mu[f(Y)]-f(Y)\ge t)
\le
\exp\left(-\frac{2t^2}{V}\right).
\end{align*}
Returning to $Z=f(X)$,
\begin{align*}
\mathbb P(\mathbb E[Z]-Z\ge t)
\le
\exp\left(-\frac{2t^2}{V}\right).
\end{align*}
Finally,
\begin{align*}
\{|Z-\mathbb E[Z]|\ge t\}
=
\{Z-\mathbb E[Z]\ge t\}\cup\{\mathbb E[Z]-Z\ge t\}.
\end{align*}
By the [union bound](/theorems/6078),
\begin{align*}
\mathbb P(|Z-\mathbb E[Z]|\ge t)
\le
2\exp\left(-\frac{2t^2}{V}\right).
\end{align*}
[/step]
[step:Handle the degenerate zero-variance-proxy case]
If $V=0$, then $c_i=0$ for every $i\in\{1,\dots,n\}$. Choose any two points
\begin{align*}
y=(y_1,\dots,y_n),\qquad z=(z_1,\dots,z_n)
\end{align*}
in $\mathcal X_1\times\cdots\times\mathcal X_n$. Define intermediate points $r_k\in\mathcal X_1\times\cdots\times\mathcal X_n$ for $k\in\{0,\dots,n\}$ by
\begin{align*}
r_k:=(z_1,\dots,z_k,y_{k+1},\dots,y_n).
\end{align*}
Then $r_0=y$, $r_n=z$, and $r_k$ and $r_{k-1}$ differ only in coordinate $k$. The bounded differences condition with $c_k=0$ gives
\begin{align*}
f(r_k)=f(r_{k-1})
\end{align*}
for each $k\in\{1,\dots,n\}$. Therefore $f(y)=f(z)$. Since $y$ and $z$ were arbitrary, $f$ is constant on the whole product space. Hence
\begin{align*}
f(Y)=\mathbb E_\mu[f(Y)]
\end{align*}
$\mu$-almost surely, and therefore
\begin{align*}
Z=\mathbb E[Z]
\end{align*}
$\mathbb P$-almost surely. Thus both one-sided tail probabilities are $0$ for every $t>0$, and the two-sided tail probability is also $0$ for every $t>0$. With the convention from the theorem statement that $\exp(-2t^2/V)$ equals $0$ for $t>0$ and equals $1$ for $t=0$ when $V=0$, the displayed estimates hold: for $t>0$ both sides of each one-sided estimate are $0$, while at $t=0$ probabilities are at most $1$ in the one-sided cases and at most $1\le 2$ in the two-sided case. This completes the proof.
[/step]