McDiarmid Bounded Differences Inequality — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We reveal the variables one at a time and form the Doob martingale whose $i$-th term is the [conditional expectation](/page/Conditional%20Expectation) of $Z$ after observing the first $i$ variables. The bounded differences hypothesis implies that each successive martingale increment is almost surely confined, conditionally on the previously revealed variables, to an interval of length at most the corresponding constant $c_i$. A conditional Hoeffding-type estimate then gives an exponential bound for each conditional increment, and iteration controls the moment generating function of $Z-\mathbb E[Z]$. Markov's inequality and optimization in the exponential parameter give the upper tail; applying the same argument to $-f$ gives the lower tail, and the two-sided estimate follows from the union bound. [/proofplan] [step:Construct the Doob martingale by revealing the variables one at a time] Let $(\Omega,\mathcal F,\mathbb P)$ be the probability space on which the random variables $X_1,\dots,X_n$ are defined. For $i\in\{0,\dots,n\}$, define the sub-$\sigma$-algebra \begin{align*} \mathcal F_i:=\sigma(X_1,\dots,X_i), \end{align*} with $\mathcal F_0:=\{\varnothing,\Omega\}$ completed if necessary. Define the [random variable](/page/Random%20Variable) $M_i:\Omega\to\mathbb R$ by \begin{align*} M_i(\omega):=\mathbb E[Z\mid \mathcal F_i](\omega). \end{align*} Here $L^1(\Omega,\mathcal F,\mathbb P)$ denotes the space of integrable real-valued random variables on the probability space $(\Omega,\mathcal F,\mathbb P)$, modulo almost sure equality. Since $Z\in L^1(\Omega,\mathcal F,\mathbb P)$, each $M_i$ is integrable and $\mathcal F_i$-measurable. The [tower property of conditional expectation](/theorems/1150) applies because $\mathcal F_{i-1}\subset\mathcal F_i$ and $Z\in L^1(\Omega,\mathcal F,\mathbb P)$, and gives \begin{align*} \mathbb E[M_i\mid \mathcal F_{i-1}] = \mathbb E[\mathbb E[Z\mid \mathcal F_i]\mid \mathcal F_{i-1}] = \mathbb E[Z\mid \mathcal F_{i-1}] = M_{i-1}, \end{align*} so $(M_i)_{i=0}^n$ is a martingale with respect to $(\mathcal F_i)_{i=0}^n$. Also $M_0=\mathbb E[Z]$ and $M_n=Z$ almost surely, because $Z$ is $\mathcal F_n$-measurable. [/step] [step:Bound each martingale increment by the corresponding bounded difference constant] For $i\in\{1,\dots,n\}$, define the martingale difference $D_i:\Omega\to\mathbb R$ by \begin{align*} D_i(\omega):=M_i(\omega)-M_{i-1}(\omega). \end{align*} We claim that $D_i$ is almost surely contained in an interval of length at most $c_i$ conditional on $\mathcal F_{i-1}$. Fix $i$. Let $\mu_j$ denote the law of $X_j$ on $(\mathcal X_j,\mathcal A_j)$. Let $\nu_i$ denote the product probability measure $\mu_{i+1}\otimes\cdots\otimes\mu_n$ on $\mathcal X_{i+1}\times\cdots\times\mathcal X_n$, with the convention that for $i=n$ this is the one-point probability measure on the empty tail. Independence means that the law of $(X_1,\dots,X_n)$ on the product measurable space is $\mu_1\otimes\cdots\otimes\mu_n$. We will use this product-law identity directly to verify conditional expectations, rather than invoking a regular conditional distribution of the unrevealed coordinates. Define a measurable version $G_i:\mathcal X_1\times\cdots\times\mathcal X_i\to\mathbb R$ of the conditional expectation section as follows. Since $Z\in L^1(\Omega,\mathcal F,\mathbb P)$, the Tonelli theorem applied to $|f|$ under the product law is valid because $|f|$ is nonnegative and measurable. Hence for $(\mu_1\otimes\cdots\otimes\mu_i)$-almost every prefix $(x_1,\dots,x_i)$ the integral of $|f(x_1,\dots,x_i,\cdot)|$ with respect to $\nu_i$ is finite. On this full-measure set define \begin{align*} G_i(x_1,\dots,x_i):=\int_{\mathcal X_{i+1}\times\cdots\times\mathcal X_n} f(x_1,\dots,x_i,y_{i+1},\dots,y_n)\, d\nu_i(y_{i+1},\dots,y_n), \end{align*} and define $G_i$ to be $0$ on the exceptional null set. For $i=n$, this formula reads $G_n=f$ outside a null set. The function $G_i$ is $\mathcal A_1\otimes\cdots\otimes\mathcal A_i$-measurable by the measurable-parameter form of Tonelli theorem applied to the positive and negative parts of the integrable function $f$. We verify the conditional-expectation identity by the defining property. Let $H:\Omega\to\mathbb R$ be a bounded $\mathcal F_i$-measurable random variable. By the Doob-Dynkin factorization lemma for $\sigma(X_1,\dots,X_i)$-measurable real random variables, there is a bounded $\mathcal A_1\otimes\cdots\otimes\mathcal A_i$-measurable function $h:\mathcal X_1\times\cdots\times\mathcal X_i\to\mathbb R$ such that \begin{align*} H=h(X_1,\dots,X_i) \end{align*} almost surely. Since $h$ is bounded and $f$ is integrable under the product law, [Fubini's theorem](/theorems/2961) applies to the integrable function \begin{align*} (x_1,\dots,x_n)\mapsto h(x_1,\dots,x_i)f(x_1,\dots,x_n). \end{align*} Using the product law of $(X_1,\dots,X_n)$, we have \begin{align*} \mathbb E[HZ] = \int_{\mathcal X_1\times\cdots\times\mathcal X_n} h(x_1,\dots,x_i)f(x_1,\dots,x_n)\,d(\mu_1\otimes\cdots\otimes\mu_n)(x_1,\dots,x_n). \end{align*} [Fubini's theorem](/theorems/2961) identifies the last integral as \begin{align*} \int_{\mathcal X_1\times\cdots\times\mathcal X_i} h(x_1,\dots,x_i)G_i(x_1,\dots,x_i)\,d(\mu_1\otimes\cdots\otimes\mu_i)(x_1,\dots,x_i). \end{align*} By the product law of $(X_1,\dots,X_i)$, this equals \begin{align*} \mathbb E[H G_i(X_1,\dots,X_i)]. \end{align*} Thus $G_i(X_1,\dots,X_i)$ is integrable, $\mathcal F_i$-measurable, and satisfies the defining identity for $\mathbb E[Z\mid\mathcal F_i]$. Hence \begin{align*} M_i=G_i(X_1,\dots,X_i) \end{align*} almost surely. By [Fubini's theorem](/theorems/2961) applied to the integrable function $|f|$ under the product law, there is a set $B_i\subset\mathcal X_1\times\cdots\times\mathcal X_{i-1}$ with $(\mu_1\otimes\cdots\otimes\mu_{i-1})(B_i)=1$ such that, for every prefix $x=(x_1,\dots,x_{i-1})\in B_i$, the section $u\mapsto G_i(x_1,\dots,x_{i-1},u)$ is finite for $\mu_i$-almost every $u\in\mathcal X_i$. For such a prefix $x$, define the finite-a.e. one-coordinate section $g_{i,x}:\mathcal X_i\to\mathbb R$ by $g_{i,x}(u):=G_i(x_1,\dots,x_{i-1},u)$ on its full-measure domain. If $u,v$ belong to this full-measure domain, then the bounded differences hypothesis gives, for every tail point outside a $\nu_i$-null set depending on $u$ and $v$, \begin{align*} \left| f(x_1,\dots,x_{i-1},u,y_{i+1},\dots,y_n) - f(x_1,\dots,x_{i-1},v,y_{i+1},\dots,y_n) \right| \le c_i. \end{align*} Integrating this pointwise inequality with respect to the probability measure $\nu_i$ gives \begin{align*} |g_{i,x}(u)-g_{i,x}(v)| \le c_i. \end{align*} Thus the essential range of $g_{i,x}(X_i)$ under $\mu_i$ is contained in an interval of length at most $c_i$. Conditionally on $\mathcal F_{i-1}$, $M_i$ has this essential range and \begin{align*} M_{i-1}=\mathbb E[M_i\mid\mathcal F_{i-1}]. \end{align*} Subtracting $M_{i-1}$ translates the interval, so conditionally on $\mathcal F_{i-1}$, $D_i=M_i-M_{i-1}$ has mean $0$ and is supported, in the essential-range sense, in an interval of length at most $c_i$. [guided] The point of the Doob martingale is that $M_i$ is the best prediction of $Z$ after seeing the first $i$ variables. We must show that revealing $X_i$ can change this prediction by at most $c_i$. Fix $i\in\{1,\dots,n\}$. Let $\mu_j$ be the distribution of $X_j$ on $(\mathcal X_j,\mathcal A_j)$ for each $j$, and let $\nu_i$ be the product measure $\mu_{i+1}\otimes\cdots\otimes\mu_n$ on the unrevealed coordinates, with the empty product interpreted as the one-point probability measure when $i=n$. Because $X_1,\dots,X_n$ are independent, the joint law of $(X_1,\dots,X_n)$ is the product measure $\mu_1\otimes\cdots\otimes\mu_n$. This product-law statement is what allows us to compute conditional expectations by testing against bounded functions of the revealed coordinates; no regular [conditional probability](/page/Conditional%20Probability) is needed. Since the theorem assumes $Z\in L^1(\Omega,\mathcal F,\mathbb P)$, the Tonelli theorem applies to $|f|$ under the product law because $|f|$ is nonnegative and measurable. It implies that for $(\mu_1\otimes\cdots\otimes\mu_i)$-almost every prefix $(x_1,\dots,x_i)$, the tail integral of $|f(x_1,\dots,x_i,\cdot)|$ with respect to $\nu_i$ is finite. On this full-measure set define $G_i:\mathcal X_1\times\cdots\times\mathcal X_i\to\mathbb R$ by \begin{align*} G_i(x_1,\dots,x_i):=\int_{\mathcal X_{i+1}\times\cdots\times\mathcal X_n} f(x_1,\dots,x_i,y_{i+1},\dots,y_n)\, d\nu_i(y_{i+1},\dots,y_n), \end{align*} and set $G_i$ equal to $0$ on the exceptional null set. For $i=n$, this is just the statement that $G_n=f$ outside a null set. The function $G_i$ is measurable with respect to $\mathcal A_1\otimes\cdots\otimes\mathcal A_i$ by the measurable-parameter form of Tonelli theorem, applied separately to the positive and negative parts of the integrable function $f$. We now verify that $G_i(X_1,\dots,X_i)$ is the conditional expectation of $Z$ given $\mathcal F_i$. Take any bounded $\mathcal F_i$-measurable random variable $H:\Omega\to\mathbb R$. Since $\mathcal F_i=\sigma(X_1,\dots,X_i)$, the Doob-Dynkin factorization lemma gives a bounded measurable function $h:\mathcal X_1\times\cdots\times\mathcal X_i\to\mathbb R$ such that \begin{align*} H=h(X_1,\dots,X_i) \end{align*} almost surely. Because $h$ is bounded and $f$ is integrable under the product law, [Fubini's theorem](/theorems/2961) applies to the integrable function $(x_1,\dots,x_n)\mapsto h(x_1,\dots,x_i)f(x_1,\dots,x_n)$. Therefore \begin{align*} \mathbb E[HZ] = \int_{\mathcal X_1\times\cdots\times\mathcal X_n} h(x_1,\dots,x_i)f(x_1,\dots,x_n)\,d(\mu_1\otimes\cdots\otimes\mu_n)(x_1,\dots,x_n). \end{align*} Fubini's theorem turns this into \begin{align*} \int_{\mathcal X_1\times\cdots\times\mathcal X_i} h(x_1,\dots,x_i)G_i(x_1,\dots,x_i)\,d(\mu_1\otimes\cdots\otimes\mu_i)(x_1,\dots,x_i). \end{align*} By the product law of $(X_1,\dots,X_i)$, the last display is \begin{align*} \mathbb E[H G_i(X_1,\dots,X_i)]. \end{align*} This is exactly the defining property of $\mathbb E[Z\mid\mathcal F_i]$, and $G_i(X_1,\dots,X_i)$ is $\mathcal F_i$-measurable by construction. Hence \begin{align*} M_i=G_i(X_1,\dots,X_i) \end{align*} almost surely. Now we must be careful about null sets. [Fubini's theorem](/theorems/2961) applied to the integrable function $|f|$ gives a full-measure set $B_i\subset\mathcal X_1\times\cdots\times\mathcal X_{i-1}$ such that, once $x=(x_1,\dots,x_{i-1})\in B_i$ is fixed, the one-coordinate section $u\mapsto G_i(x_1,\dots,x_{i-1},u)$ is finite for $\mu_i$-almost every $u$. Define $g_{i,x}:\mathcal X_i\to\mathbb R$ by $g_{i,x}(u):=G_i(x_1,\dots,x_{i-1},u)$ on this full-measure domain. Compare two values $u,v$ from that full-measure domain. The bounded differences condition says that changing only the $i$-th coordinate changes $f$ by at most $c_i$. Hence for $\nu_i$-almost every tail point, \begin{align*} \left| f(x_1,\dots,x_{i-1},u,y_{i+1},\dots,y_n) - f(x_1,\dots,x_{i-1},v,y_{i+1},\dots,y_n) \right| \le c_i. \end{align*} Since $\nu_i$ is a probability measure, integrating preserves this bound: \begin{align*} |g_{i,x}(u)-g_{i,x}(v)| \le c_i. \end{align*} Thus, conditional on the information $\mathcal F_{i-1}$, the essential range of $M_i$ under the remaining randomness of $X_i$ is contained in an interval whose length is at most $c_i$. Since \begin{align*} M_{i-1}=\mathbb E[M_i\mid\mathcal F_{i-1}], \end{align*} subtracting $M_{i-1}$ only translates that interval. Therefore $D_i=M_i-M_{i-1}$ has conditional mean zero and, conditional on $\mathcal F_{i-1}$, is supported in the essential-range sense in an interval of length at most $c_i$. [/guided] [/step] [step:Control the exponential moment of each martingale increment] We use the following conditional form of [Hoeffding's lemma](/theorems/1956). Let $\mathcal G\subset\mathcal F$ be a sub-$\sigma$-algebra. Let $A:\Omega\to\mathbb R$ and $B:\Omega\to\mathbb R$ be $\mathcal G$-measurable real-valued random variables satisfying $A\le B$ almost surely and $B-A\le c$ almost surely. If $Y:\Omega\to\mathbb R$ is an integrable real-valued random variable with $\mathbb E[Y\mid\mathcal G]=0$ and $A\le Y\le B$ almost surely, then for every $\lambda\in\mathbb R$, \begin{align*} \mathbb E[e^{\lambda Y}\mid\mathcal G]\le \exp\left(\frac{\lambda^2c^2}{8}\right). \end{align*} This follows by applying the scalar argument below on each conditioning fiber, with the $\mathcal G$-measurable endpoints $A$ and $B$. If $B=A$, then the support condition forces $Y=A$ conditionally on $\mathcal G$, and $\mathbb E[Y\mid\mathcal G]=0$ forces $A=0$ conditionally; hence $\mathbb E[e^{\lambda Y}\mid\mathcal G]=1$, which gives the estimate. Assume now that $B>A$. Since $A\le Y\le B$ and $\mathbb E[Y\mid\mathcal G]=0$, taking conditional expectations gives \begin{align*} A\le 0\le B \end{align*} almost surely. The graph of $y\mapsto e^{\lambda y}$ lies below the chord joining $(A,e^{\lambda A})$ and $(B,e^{\lambda B})$ on $[A,B]$. Taking conditional expectations and using $\mathbb E[Y\mid\mathcal G]=0$ gives \begin{align*} \mathbb E[e^{\lambda Y}\mid\mathcal G]\le \frac{-A}{B-A}e^{\lambda B}+\frac{B}{B-A}e^{\lambda A}. \end{align*} Writing $p:=-A/(B-A)$ and $h:=\lambda(B-A)$, the right-hand side equals $(1-p)e^{-ph}+p e^{(1-p)h}$. The remaining scalar estimate is exactly [Hoeffding's lemma](/theorems/1956) applied pointwise to the centered two-point random variable taking values $-ph$ and $(1-p)h$ with probabilities $1-p$ and $p$, respectively; its hypotheses hold because this variable has mean $0$ and range length $|h|$. Therefore \begin{align*} (1-p)e^{-ph}+p e^{(1-p)h}\le e^{h^2/8}. \end{align*} Since $|h|=|\lambda|(B-A)\le |\lambda|c$, the displayed conditional bound follows. In the preceding step, the interval containing the conditional essential range of $D_i$ may depend on the value of the prefix $(X_1,\dots,X_{i-1})$. This causes no change in the estimate: for each prefix in the full-measure set $B_i$, apply the deterministic interval argument above to the conditional law of $D_i$ on that fiber. The bound depends only on the interval length, which is at most $c_i$, and hence the resulting inequality is $\mathcal F_{i-1}$-almost sure. Applying this fiberwise estimate to $Y=D_i$, $\mathcal G=\mathcal F_{i-1}$, and $c=c_i$, we obtain \begin{align*} \mathbb E[e^{\lambda D_i}\mid\mathcal F_{i-1}] \le \exp\left(\frac{\lambda^2c_i^2}{8}\right) \end{align*} for every $\lambda\in\mathbb R$ and every $i\in\{1,\dots,n\}$. [/step] [step:Iterate the conditional exponential bounds] Define \begin{align*} C:=\sum_{i=1}^n c_i^2. \end{align*} Since the martingale telescopes, we have \begin{align*} Z-\mathbb E[Z] = M_n-M_0 = \sum_{i=1}^n D_i \end{align*} almost surely. Repeated conditioning gives the moment generating function estimate. First, \begin{align*} \mathbb E[e^{\lambda(Z-\mathbb E[Z])}]=\mathbb E\left[\exp\left(\lambda\sum_{i=1}^n D_i\right)\right]. \end{align*} Conditioning on $\mathcal F_{n-1}$ and using that $D_1,\dots,D_{n-1}$ are $\mathcal F_{n-1}$-measurable gives \begin{align*} \mathbb E[e^{\lambda(Z-\mathbb E[Z])}]=\mathbb E\left[\exp\left(\lambda\sum_{i=1}^{n-1}D_i\right)\mathbb E[e^{\lambda D_n}\mid\mathcal F_{n-1}]\right]. \end{align*} Using the conditional exponential bound for $D_n$ yields \begin{align*} \mathbb E[e^{\lambda(Z-\mathbb E[Z])}]\le \exp\left(\frac{\lambda^2c_n^2}{8}\right)\mathbb E\left[\exp\left(\lambda\sum_{i=1}^{n-1}D_i\right)\right]. \end{align*} Repeating the same conditioning step for $D_{n-1},\dots,D_1$ gives \begin{align*} \mathbb E[e^{\lambda(Z-\mathbb E[Z])}]\le \exp\left(\frac{\lambda^2}{8}\sum_{i=1}^n c_i^2\right)=\exp\left(\frac{\lambda^2C}{8}\right). \end{align*} [/step] [step:Optimize Markov's inequality to obtain the upper tail] Assume first that $C>0$. If $t=0$, then the desired upper-tail estimate is immediate because $\mathbb P(Z-\mathbb E[Z]\ge 0)\le 1=\exp(0)$. Now assume $t>0$. For $\lambda>0$, [Markov's inequality](/theorems/514) applies to the nonnegative random variable $e^{\lambda(Z-\mathbb E[Z])}$ and gives \begin{align*} \mathbb P(Z-\mathbb E[Z]\ge t)=\mathbb P(e^{\lambda(Z-\mathbb E[Z])}\ge e^{\lambda t}). \end{align*} Markov's inequality and the exponential moment bound give \begin{align*} \mathbb P(Z-\mathbb E[Z]\ge t)\le e^{-\lambda t}\mathbb E[e^{\lambda(Z-\mathbb E[Z])}]. \end{align*} Therefore \begin{align*} \mathbb P(Z-\mathbb E[Z]\ge t)\le \exp\left(-\lambda t+\frac{\lambda^2C}{8}\right). \end{align*} The quadratic function \begin{align*} \lambda\mapsto -\lambda t+\frac{\lambda^2C}{8} \end{align*} is minimized over $\lambda>0$ at \begin{align*} \lambda=\frac{4t}{C}. \end{align*} Substituting this value gives \begin{align*} \mathbb P(Z-\mathbb E[Z]\ge t) \le \exp\left(-\frac{2t^2}{C}\right). \end{align*} [/step] [step:Apply the same argument to $-Z$ and combine the two tails] In this step assume $C>0$; the case $C=0$ is handled at the end of the proof. Define $\tilde f:\mathcal X_1\times\cdots\times\mathcal X_n\to\mathbb R$ by \begin{align*} \tilde f(x_1,\dots,x_n):=-f(x_1,\dots,x_n). \end{align*} If two inputs differ only in coordinate $i$, then \begin{align*} |\tilde f(x_1,\dots,x_n)-\tilde f(y_1,\dots,y_n)| = |f(x_1,\dots,x_n)-f(y_1,\dots,y_n)| \le c_i. \end{align*} Thus $\tilde f$ has the same bounded differences constants, and applying the upper-tail estimate to $-Z=\tilde f(X_1,\dots,X_n)$ gives \begin{align*} \mathbb P(\mathbb E[Z]-Z\ge t) = \mathbb P((-Z)-\mathbb E[-Z]\ge t) \le \exp\left(-\frac{2t^2}{C}\right). \end{align*} Finally, \begin{align*} \{|Z-\mathbb E[Z]|\ge t\} = \{Z-\mathbb E[Z]\ge t\}\cup\{\mathbb E[Z]-Z\ge t\}. \end{align*} Using [subadditivity of probability](/theorems/1106), \begin{align*} \mathbb P(|Z-\mathbb E[Z]|\ge t)\le \mathbb P(Z-\mathbb E[Z]\ge t)+\mathbb P(\mathbb E[Z]-Z\ge t). \end{align*} Using the two one-sided estimates, we obtain \begin{align*} \mathbb P(|Z-\mathbb E[Z]|\ge t)\le 2\exp\left(-\frac{2t^2}{C}\right). \end{align*} If $C=0$, then each $c_i=0$, so the bounded differences condition shows that changing any coordinate does not change $f$. Hence $Z$ is almost surely constant under the product law of $(X_1,\dots,X_n)$, and therefore $Z=\mathbb E[Z]$ almost surely. This gives the degenerate case. [/step]

Prerequisites (0/6 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Fubini's Theorem

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.

McDiarmid Bounded Differences Inequality (Theorem # 6074)

Discussion

Proof

Prerequisites (0/6 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

McDiarmid Bounded Differences Inequality (Theorem # 6074)

Discussion

Proof

Prerequisites (0/6 completed)

Prerequisites Graph

Explore Further