McDiarmid's Bounded Differences Inequality for Independent Non-Identically Distributed Inputs

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We pass to the product law of the independent coordinates and form the Doob martingale obtained by revealing the coordinates one at a time. The bounded differences assumption implies that the $i$-th martingale increment has conditional range length at most $c_i$, even though the coordinate spaces and laws may be different. A conditional Hoeffding estimate then bounds the exponential moment of each increment, and iterating these bounds gives a sub-Gaussian moment generating function. Markov's inequality and optimization over the exponential parameter give the upper tail, the same argument applied to $-Z$ gives the lower tail, and the two-sided bound follows by the union bound. [/proofplan] [step:Move the argument to the product probability space] Let $(\Omega,\mathcal F,\mathbb P)$ be the original probability space on which the random variables $X_1,\dots,X_n$ are defined. Let $\mu_i:=\mathbb P\circ X_i^{-1}$ be the law of $X_i$ on $(\mathcal X_i,\mathcal A_i)$, and define the product measurable space \begin{align*} (\mathcal X,\mathcal A):=\left(\prod_{i=1}^n \mathcal X_i,\bigotimes_{i=1}^n \mathcal A_i\right). \end{align*} Since $X_1,\dots,X_n$ are independent, the law of \begin{align*} X:=(X_1,\dots,X_n):(\Omega,\mathcal F)\to(\mathcal X,\mathcal A) \end{align*} is the product probability measure \begin{align*} \mu:=\bigotimes_{i=1}^n \mu_i. \end{align*} Because $f:(\mathcal X,\mathcal A)\to\mathbb R$ is measurable by hypothesis and the law of $X$ is $\mu$, integrability of $Z=f(X)$ gives \begin{align*} \int_{\mathcal X}|f(y)|\,d\mu(y)=\mathbb E[|f(X)|]<\infty. \end{align*} Thus $f\in L^1(\mathcal X,\mathcal A,\mu)$. It is therefore enough to prove the estimates for the coordinate map \begin{align*} Y:\mathcal X\to\mathcal X,\qquad y\mapsto y, \end{align*} on the probability space $(\mathcal X,\mathcal A,\mu)$, since $f(Y)$ and $Z$ have the same distribution and the same expectation. For the rest of the proof, write \begin{align*} Y=(Y_1,\dots,Y_n), \end{align*} where $Y_i:\mathcal X\to\mathcal X_i$ is the $i$-th coordinate projection. [/step] [step:Construct the Doob martingale by revealing coordinates] For each $i\in\{0,\dots,n\}$, define the sub-$\sigma$-algebra \begin{align*} \mathcal F_i:=\sigma(Y_1,\dots,Y_i), \end{align*} with $\mathcal F_0:=\{\varnothing,\mathcal X\}$. Define the Doob martingale \begin{align*} M_i:\mathcal X\to\mathbb R,\qquad M_i:=\mathbb E_\mu[f(Y)\mid \mathcal F_i]. \end{align*} Then $(M_i)_{i=0}^n$ is an integrable martingale with respect to $(\mathcal F_i)_{i=0}^n$, and \begin{align*} M_0=\mathbb E_\mu[f(Y)],\qquad M_n=f(Y). \end{align*} Define the martingale increments \begin{align*} D_i:\mathcal X\to\mathbb R,\qquad D_i:=M_i-M_{i-1} \end{align*} for $i\in\{1,\dots,n\}$. Then \begin{align*} f(Y)-\mathbb E_\mu[f(Y)]=\sum_{i=1}^n D_i. \end{align*} [/step] [step:Bound each martingale increment by the corresponding coordinate oscillation] Fix $i\in\{1,\dots,n\}$. By [Fubini's theorem](/theorems/2961), applied to the integrable function $f\in L^1(\mathcal X,\mathcal A,\mu)$ on the finite product probability space, the following section integrals are finite for $\bigotimes_{j=1}^i\mu_j$-almost every prefix. Choose measurable versions on the remaining null set arbitrarily. Define the measurable map $g_i:\prod_{j=1}^i\mathcal X_j\to\mathbb R$ by \begin{align*} g_i(a_1,\dots,a_i) := \int_{\prod_{j=i+1}^n\mathcal X_j} f(a_1,\dots,a_i,u_{i+1},\dots,u_n) \,d\left(\bigotimes_{j=i+1}^n\mu_j\right)(u_{i+1},\dots,u_n), \end{align*} with the convention that, when $i=n$, the integral over the empty product is the value of $f(a_1,\dots,a_n)$. The standard product-space formula for [conditional expectation](/page/Conditional%20Expectation), justified by [Fubini's theorem](/theorems/2961), gives \begin{align*} M_i=g_i(Y_1,\dots,Y_i) \end{align*} almost surely. Similarly, \begin{align*} M_{i-1}= \int_{\mathcal X_i} g_i(Y_1,\dots,Y_{i-1},v_i)\,d\mu_i(v_i) \end{align*} almost surely. For $\bigotimes_{j=1}^{i-1}\mu_j$-almost every prefix $(a_1,\dots,a_{i-1})$, Fubini's theorem applied on $\mathcal X_i\times\mathcal X_i\times\prod_{j=i+1}^n\mathcal X_j$ gives, for $\mu_i\otimes\mu_i$-almost every pair $(v_i,w_i)$, \begin{align*} |g_i(a_1,\dots,a_{i-1},v_i)-g_i(a_1,\dots,a_{i-1},w_i)|\le c_i. \end{align*} This follows by integrating the bounded differences inequality in the future coordinates against $\bigotimes_{j=i+1}^n\mu_j$. Hence, for almost every fixed prefix, the essential range with respect to $\mu_i$ of the function $v_i\mapsto g_i(a_1,\dots,a_{i-1},v_i)$ has length at most $c_i$. Define $A_i:\mathcal X\to\mathbb R\cup\{-\infty\}$ to be the conditional essential infimum of $M_i$ given $\mathcal F_{i-1}$, and define $B_i:\mathcal X\to\mathbb R\cup\{\infty\}$ to be the conditional essential supremum of $M_i$ given $\mathcal F_{i-1}$. By definition of conditional essential infimum and conditional essential supremum, $A_i$ and $B_i$ are $\mathcal F_{i-1}$-measurable, and \begin{align*} A_i\le M_i\le B_i \end{align*} almost surely. The preceding fiberwise essential-range estimate gives \begin{align*} B_i-A_i\le c_i \end{align*} almost surely, because conditional essential bounds over $\mathcal F_{i-1}$ are computed by fixing the revealed prefix outside the Fubini null set and then taking essential bounds over the remaining $i$-th coordinate. Since $M_i$ is finite almost surely and this conditional essential range has finite length, the bounds $A_i$ and $B_i$ may be chosen finite almost surely after changing them on a null set. Since $M_{i-1}=\mathbb E_\mu[M_i\mid\mathcal F_{i-1}]$, the conditional average also satisfies \begin{align*} A_i\le M_{i-1}\le B_i \end{align*} almost surely. Therefore $D_i=M_i-M_{i-1}$ has conditional mean \begin{align*} \mathbb E_\mu[D_i\mid\mathcal F_{i-1}]=0 \end{align*} and, conditionally on $\mathcal F_{i-1}$, lies in the $\mathcal F_{i-1}$-measurable interval $[A_i-M_{i-1},B_i-M_{i-1}]$, whose length is at most $c_i$. [guided] The key point is that the future coordinates are still averaged against their original product law after the first $i$ coordinates have been revealed. This is exactly where independence is used, and no identical-distribution assumption is involved. Fix $i\in\{1,\dots,n\}$. Once the first $i$ coordinates are prescribed as \begin{align*} a=(a_1,\dots,a_i)\in\prod_{j=1}^i\mathcal X_j, \end{align*} the only remaining randomness is in the future coordinates. Because $f\in L^1(\mathcal X,\mathcal A,\mu)$ and $\mu$ is a finite product probability measure, [Fubini's theorem](/theorems/2961) implies that these future-coordinate sections are integrable for almost every prefix. We choose measurable finite versions on the exceptional null set and encode the remaining average by the measurable map $g_i:\prod_{j=1}^i\mathcal X_j\to\mathbb R$ defined by \begin{align*} g_i(a_1,\dots,a_i) := \int_{\prod_{j=i+1}^n\mathcal X_j} f(a_1,\dots,a_i,u_{i+1},\dots,u_n) \,d\left(\bigotimes_{j=i+1}^n\mu_j\right)(u_{i+1},\dots,u_n). \end{align*} When $i=n$, this means simply \begin{align*} g_n(a_1,\dots,a_n)=f(a_1,\dots,a_n). \end{align*} Since the joint law is the product measure $\mu=\bigotimes_{j=1}^n\mu_j$, the product-space formula for [conditional expectation](/page/Conditional%20Expectation), again justified by Fubini's theorem, gives \begin{align*} M_i=\mathbb E_\mu[f(Y)\mid\mathcal F_i] =g_i(Y_1,\dots,Y_i) \end{align*} almost surely. Conditioning on one fewer coordinate gives \begin{align*} M_{i-1} = \int_{\mathcal X_i} g_i(Y_1,\dots,Y_{i-1},v_i)\,d\mu_i(v_i) \end{align*} almost surely. Now fix a prefix $(a_1,\dots,a_{i-1})$ outside the null set supplied by Fubini's theorem. The bounded differences hypothesis says that replacing only the $i$-th coordinate changes $f$ by at most $c_i$, uniformly over all future coordinates on the product space. Thus Fubini's theorem, applied to the non-negative measurable function obtained from the absolute difference of the two future-coordinate sections, gives for $\mu_i\otimes\mu_i$-almost every pair $(v_i,w_i)$ the integrated inequality \begin{align*} |g_i(a_1,\dots,a_{i-1},v_i)-g_i(a_1,\dots,a_{i-1},w_i)|\le c_i. \end{align*} This almost-everywhere pairwise estimate is exactly the statement needed for essential ranges: as the $i$-th coordinate varies with respect to $\mu_i$, the conditional value $g_i$ has essential range length at most $c_i$. To turn this fiberwise statement into conditional bounds, define $A_i:\mathcal X\to\mathbb R\cup\{-\infty\}$ to be the conditional essential infimum of $M_i$ given $\mathcal F_{i-1}$, and define $B_i:\mathcal X\to\mathbb R\cup\{\infty\}$ to be the conditional essential supremum of $M_i$ given $\mathcal F_{i-1}$. These are not arbitrary fiberwise choices: by the definition of conditional essential infimum and supremum, $A_i$ and $B_i$ are $\mathcal F_{i-1}$-measurable random variables. They satisfy \begin{align*} A_i\le M_i\le B_i \end{align*} almost surely. Why does the length bound survive this conditional formulation? Outside the Fubini null set, conditioning on $\mathcal F_{i-1}$ fixes the prefix $(Y_1,\dots,Y_{i-1})$ and leaves the $i$-th coordinate distributed according to $\mu_i$. The preceding pairwise estimate says that the resulting essential range of $g_i$ over that coordinate has length at most $c_i$. Therefore the conditional essential bounds satisfy \begin{align*} B_i-A_i\le c_i \end{align*} almost surely. Since $M_i$ is real-valued almost surely and this conditional essential range has finite length, we may use finite versions of $A_i$ and $B_i$ after altering them on a null set. Since \begin{align*} M_{i-1}=\mathbb E_\mu[M_i\mid\mathcal F_{i-1}], \end{align*} the conditional average of a [random variable](/page/Random%20Variable) lying between $A_i$ and $B_i$ also lies between those same $\mathcal F_{i-1}$-measurable bounds: \begin{align*} A_i\le M_{i-1}\le B_i \end{align*} almost surely. Consequently the martingale increment \begin{align*} D_i=M_i-M_{i-1} \end{align*} has conditional mean \begin{align*} \mathbb E_\mu[D_i\mid\mathcal F_{i-1}]=0 \end{align*} and, conditionally on $\mathcal F_{i-1}$, lies in the interval $[A_i-M_{i-1},B_i-M_{i-1}]$, whose length is at most $c_i$. This is the precise form of the bounded increment estimate needed for the exponential argument. [/guided] [/step] [step:Estimate the exponential moment of the martingale sum] Define the variance proxy \begin{align*} V:=\sum_{i=1}^n c_i^2. \end{align*} We use the following elementary conditional Hoeffding estimate: if $W$ is an integrable real-valued random variable, $\mathcal G$ is a sub-$\sigma$-algebra, $\mathbb E[W\mid\mathcal G]=0$, and there are $\mathcal G$-measurable real-valued random variables $A$ and $B$ such that $A\le W\le B$ almost surely and $B-A\le c$, then for every $\lambda\in\mathbb R$, \begin{align*} \mathbb E[\exp(\lambda W)\mid\mathcal G]\le \exp\left(\frac{\lambda^2c^2}{8}\right). \end{align*} Indeed, $\mathbb E[W\mid\mathcal G]=0$ and $A\le W\le B$ imply $A\le 0\le B$ almost surely after taking conditional expectations. On each conditional interval $[A,B]$, convexity of the map $x\mapsto \exp(\lambda x)$ gives the chord bound \begin{align*} \exp(\lambda W) \le \frac{B-W}{B-A}\exp(\lambda A)+\frac{W-A}{B-A}\exp(\lambda B), \end{align*} with the equality case interpreted directly when $A=B$. Taking conditional expectation and using $\mathbb E[W\mid\mathcal G]=0$ gives the scalar expression with interval length $L:=B-A$ and \begin{align*} \theta:=\frac{B}{B-A}\in[0,1]. \end{align*} Since $A=-(1-\theta)L$ and $B=\theta L$, it remains to prove the scalar estimate \begin{align*} \theta e^{-\lambda(1-\theta)L}+(1-\theta)e^{\lambda\theta L} \le \exp\left(\frac{\lambda^2L^2}{8}\right). \end{align*} Set $s:=\lambda L$ and define $\varphi:[0,1]\to\mathbb R$ by \begin{align*} \varphi(\theta):=\log\left(\theta e^{-(1-\theta)s}+(1-\theta)e^{\theta s}\right). \end{align*} The standard one-variable proof of [Hoeffding's lemma](/theorems/1956) differentiates twice and gives \begin{align*} \varphi''(\theta)\le s^2 \end{align*} for $0\le\theta\le1$, while $\varphi(0)=\varphi(1)=0$. Applying [Taylor's theorem](/theorems/827) with remainder to $\varphi$ around its maximizing point on $[0,1]$ yields \begin{align*} \varphi(\theta)\le \frac{s^2}{8}. \end{align*} Thus \begin{align*} \theta e^{-\lambda(1-\theta)L}+(1-\theta)e^{\lambda\theta L} \le \exp\left(\frac{\lambda^2L^2}{8}\right) \le \exp\left(\frac{\lambda^2c^2}{8}\right), \end{align*} and the conditional Hoeffding estimate follows. Applying this estimate to $W=D_i$, $\mathcal G=\mathcal F_{i-1}$, and $c=c_i$, we obtain \begin{align*} \mathbb E_\mu[\exp(\lambda D_i)\mid\mathcal F_{i-1}] \le \exp\left(\frac{\lambda^2c_i^2}{8}\right). \end{align*} Iterating conditional expectations from $i=n$ down to $i=1$ gives \begin{align*} \mathbb E_\mu\left[\exp\left(\lambda\sum_{i=1}^nD_i\right)\right] \le \prod_{i=1}^n \exp\left(\frac{\lambda^2c_i^2}{8}\right) = \exp\left(\frac{\lambda^2}{8}\sum_{i=1}^n c_i^2\right). \end{align*} Since $\sum_{i=1}^nD_i=f(Y)-\mathbb E_\mu[f(Y)]$, this becomes \begin{align*} \mathbb E_\mu\left[\exp\left(\lambda(f(Y)-\mathbb E_\mu[f(Y)])\right)\right] \le \exp\left(\frac{\lambda^2V}{8}\right). \end{align*} [/step] [step:Optimize Markov's inequality to obtain the upper tail] Assume first that $V>0$. For $\lambda>0$, [Markov's inequality](/theorems/514) applied to the non-negative random variable \begin{align*} \exp\left(\lambda(f(Y)-\mathbb E_\mu[f(Y)])\right) \end{align*} gives \begin{align*} \mathbb P_\mu(f(Y)-\mathbb E_\mu[f(Y)]\ge t) \le \exp(-\lambda t) \mathbb E_\mu\left[\exp\left(\lambda(f(Y)-\mathbb E_\mu[f(Y)])\right)\right]. \end{align*} Using the exponential moment estimate, \begin{align*} \mathbb P_\mu(f(Y)-\mathbb E_\mu[f(Y)]\ge t) \le \exp\left(-\lambda t+\frac{\lambda^2V}{8}\right). \end{align*} The quadratic function \begin{align*} q:(0,\infty)\to\mathbb R,\qquad q(\lambda):=-\lambda t+\frac{\lambda^2V}{8} \end{align*} is minimized at \begin{align*} \lambda=\frac{4t}{V}. \end{align*} Substituting this value gives \begin{align*} -\lambda t+\frac{\lambda^2V}{8} = -\frac{4t^2}{V}+\frac{16t^2}{V^2}\frac{V}{8} = -\frac{2t^2}{V}. \end{align*} Therefore \begin{align*} \mathbb P_\mu(f(Y)-\mathbb E_\mu[f(Y)]\ge t) \le \exp\left(-\frac{2t^2}{V}\right). \end{align*} Transferring back from the product model to $Z=f(X)$ gives \begin{align*} \mathbb P(Z-\mathbb E[Z]\ge t) \le \exp\left(-\frac{2t^2}{V}\right). \end{align*} [/step] [step:Apply the same argument to the negative function and combine the tails] The function \begin{align*} -f:\mathcal X_1\times\cdots\times\mathcal X_n\to\mathbb R \end{align*} satisfies the same bounded differences condition with the same constants $c_i$, because \begin{align*} |(-f)(x)-(-f)(x')|=|f(x)-f(x')| \end{align*} whenever $x$ and $x'$ differ only in coordinate $i$. Applying the upper-tail estimate to $-f$ gives \begin{align*} \mathbb P_\mu(\mathbb E_\mu[f(Y)]-f(Y)\ge t) \le \exp\left(-\frac{2t^2}{V}\right). \end{align*} Returning to $Z=f(X)$, \begin{align*} \mathbb P(\mathbb E[Z]-Z\ge t) \le \exp\left(-\frac{2t^2}{V}\right). \end{align*} Finally, \begin{align*} \{|Z-\mathbb E[Z]|\ge t\} = \{Z-\mathbb E[Z]\ge t\}\cup\{\mathbb E[Z]-Z\ge t\}. \end{align*} By the [union bound](/theorems/6078), \begin{align*} \mathbb P(|Z-\mathbb E[Z]|\ge t) \le 2\exp\left(-\frac{2t^2}{V}\right). \end{align*} [/step] [step:Handle the degenerate zero-variance-proxy case] If $V=0$, then $c_i=0$ for every $i\in\{1,\dots,n\}$. Choose any two points \begin{align*} y=(y_1,\dots,y_n),\qquad z=(z_1,\dots,z_n) \end{align*} in $\mathcal X_1\times\cdots\times\mathcal X_n$. Define intermediate points $r_k\in\mathcal X_1\times\cdots\times\mathcal X_n$ for $k\in\{0,\dots,n\}$ by \begin{align*} r_k:=(z_1,\dots,z_k,y_{k+1},\dots,y_n). \end{align*} Then $r_0=y$, $r_n=z$, and $r_k$ and $r_{k-1}$ differ only in coordinate $k$. The bounded differences condition with $c_k=0$ gives \begin{align*} f(r_k)=f(r_{k-1}) \end{align*} for each $k\in\{1,\dots,n\}$. Therefore $f(y)=f(z)$. Since $y$ and $z$ were arbitrary, $f$ is constant on the whole product space. Hence \begin{align*} f(Y)=\mathbb E_\mu[f(Y)] \end{align*} $\mu$-almost surely, and therefore \begin{align*} Z=\mathbb E[Z] \end{align*} $\mathbb P$-almost surely. Thus both one-sided tail probabilities are $0$ for every $t>0$, and the two-sided tail probability is also $0$ for every $t>0$. With the convention from the theorem statement that $\exp(-2t^2/V)$ equals $0$ for $t>0$ and equals $1$ for $t=0$ when $V=0$, the displayed estimates hold: for $t>0$ both sides of each one-sided estimate are $0$, while at $t=0$ probabilities are at most $1$ in the one-sided cases and at most $1\le 2$ in the two-sided case. This completes the proof. [/step]

Prerequisites (0/7 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Fubini's Theorem

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.