Azuma-Hoeffding Inequality — Statement & Proof

Azuma-Hoeffding Inequality (Theorem # 6071)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] Write the martingale increments as $X_k := M_k - M_{k-1}$. The bounded-difference assumption gives a conditional Hoeffding bound for each increment, and iterating [conditional expectation](/page/Conditional%20Expectation) yields a Gaussian-type bound for the moment generating function of $\sum_{k=1}^{n} X_k$. Exponential Markov's inequality then converts this moment generating function estimate into a one-sided tail bound, and optimizing over the exponential parameter gives the stated exponent. The two-sided estimate follows by applying the same one-sided bound to the martingale $(-M_k)_{k=0}^{n}$ and using the union bound. [/proofplan] [step:Reduce the martingale to bounded mean-zero increments] For each $1 \le k \le n$, define the real-valued [random variable](/page/Random%20Variable) $X_k: \Omega \to \mathbb R$ by \begin{align*} X_k(\omega) = M_k(\omega) - M_{k-1}(\omega). \end{align*} Since $M_k$ and $M_{k-1}$ are integrable and $\mathcal F_k$-measurable, $X_k$ is integrable and $\mathcal F_k$-measurable. Since $(M_k)_{k=0}^{n}$ is a martingale with respect to $(\mathcal F_k)_{k=0}^{n}$, for every $1 \le k \le n$ we first use linearity of conditional expectation and the fact that $M_{k-1}$ is $\mathcal F_{k-1}$-measurable to get \begin{align*} \mathbb E[X_k \mid \mathcal F_{k-1}] = \mathbb E[M_k \mid \mathcal F_{k-1}] - M_{k-1}. \end{align*} The martingale property gives $\mathbb E[M_k \mid \mathcal F_{k-1}] = M_{k-1}$ $\mathbb P$-a.s., hence \begin{align*} \mathbb E[X_k \mid \mathcal F_{k-1}] = 0 \end{align*} $\mathbb P$-a.s. Also, \begin{align*} \sum_{k=1}^{n} X_k = M_n - M_0 \end{align*} by telescoping. The hypothesis gives $|X_k| \le c_k$ $\mathbb P$-a.s. for every $1 \le k \le n$. [/step] [step:Prove the conditional Hoeffding bound for one increment] [claim:Conditional Hoeffding bound] Let $\mathcal G \subset \mathcal F$ be a sub-$\sigma$-algebra, let $Y: \Omega \to \mathbb R$ be an integrable random variable, and let $c \ge 0$ be deterministic. Suppose $|Y| \le c$ $\mathbb P$-a.s. and $\mathbb E[Y \mid \mathcal G] = 0$ $\mathbb P$-a.s. Then, for every $\lambda \in \mathbb R$, \begin{align*} \mathbb E[e^{\lambda Y} \mid \mathcal G] \le \exp\left(\frac{\lambda^2 c^2}{2}\right) \end{align*} $\mathbb P$-a.s. [/claim] [proof] If $c=0$, then $Y=0$ $\mathbb P$-a.s., and therefore \begin{align*} \mathbb E[e^{\lambda Y} \mid \mathcal G] = 1 = \exp(0) \end{align*} $\mathbb P$-a.s. Hence assume $c>0$. For every $y \in [-c,c]$, convexity of the map $r \mapsto e^{\lambda r}$ on $\mathbb R$ gives the chord bound \begin{align*} e^{\lambda y} \le \frac{c-y}{2c}e^{-\lambda c}+\frac{c+y}{2c}e^{\lambda c}. \end{align*} Applying this pointwise with $y=Y$ and taking conditional expectation with respect to $\mathcal G$ gives \begin{align*} \mathbb E[e^{\lambda Y}\mid \mathcal G] \le \mathbb E\left[\frac{c-Y}{2c}e^{-\lambda c}+\frac{c+Y}{2c}e^{\lambda c}\,\middle|\,\mathcal G\right]. \end{align*} By linearity of conditional expectation and the hypothesis $\mathbb E[Y\mid \mathcal G]=0$ $\mathbb P$-a.s., \begin{align*} \mathbb E[e^{\lambda Y}\mid \mathcal G] \le \frac{e^{-\lambda c}+e^{\lambda c}}{2}. \end{align*} Define the function $h: \mathbb R \to \mathbb R$ by \begin{align*} h(u)=\frac{e^u+e^{-u}}{2}. \end{align*} Using the [power series](/page/Power%20Series) expansions of $e^u$ and $e^{-u}$, \begin{align*} h(u)=\sum_{m=0}^{\infty}\frac{u^{2m}}{(2m)!}. \end{align*} Since $(2m)! \ge 2^m m!$ for every integer $m \ge 0$, \begin{align*} h(u) \le \sum_{m=0}^{\infty}\frac{(u^2/2)^m}{m!}=e^{u^2/2}, \end{align*} because $(2m)! \ge 2^m m!$ for every integer $m \ge 0$. Taking $u=\lambda c$ yields \begin{align*} \mathbb E[e^{\lambda Y}\mid \mathcal G] \le \exp\left(\frac{\lambda^2 c^2}{2}\right) \end{align*} $\mathbb P$-a.s. [/proof] [guided] The point of this claim is to replace a bounded mean-zero random variable by the worst possible exponential moment allowed by its range. The proof is conditional, because later we will apply it to $X_k$ after conditioning on the past $\mathcal F_{k-1}$. If $c=0$, the bound $|Y|\le 0$ implies $Y=0$ $\mathbb P$-a.s. Therefore $e^{\lambda Y}=1$ $\mathbb P$-a.s., and \begin{align*} \mathbb E[e^{\lambda Y}\mid \mathcal G] = 1 = \exp(0). \end{align*} So assume $c>0$. For each number $y\in[-c,c]$, write $y$ as a convex combination of the two endpoints $-c$ and $c$: \begin{align*} y = \frac{c-y}{2c}(-c)+\frac{c+y}{2c}c, \end{align*} where both coefficients are non-negative and their sum is $1$. Since the map $r\mapsto e^{\lambda r}$ is convex on $\mathbb R$, its value at this convex combination is at most the corresponding convex combination of its endpoint values: \begin{align*} e^{\lambda y} \le \frac{c-y}{2c}e^{-\lambda c}+\frac{c+y}{2c}e^{\lambda c}. \end{align*} Applying this inequality with $y=Y(\omega)$ gives a pointwise inequality outside a $\mathbb P$-null set, because $|Y|\le c$ $\mathbb P$-a.s. Taking conditional expectation with respect to $\mathcal G$ preserves the inequality: \begin{align*} \mathbb E[e^{\lambda Y}\mid \mathcal G] \le \mathbb E\left[\frac{c-Y}{2c}e^{-\lambda c}+\frac{c+Y}{2c}e^{\lambda c}\,\middle|\,\mathcal G\right]. \end{align*} By linearity of conditional expectation, \begin{align*} \mathbb E\left[\frac{c-Y}{2c}e^{-\lambda c}+\frac{c+Y}{2c}e^{\lambda c}\,\middle|\,\mathcal G\right]=\frac{c-\mathbb E[Y\mid \mathcal G]}{2c}e^{-\lambda c}+\frac{c+\mathbb E[Y\mid \mathcal G]}{2c}e^{\lambda c}. \end{align*} The conditional mean-zero hypothesis is now used exactly once: $\mathbb E[Y\mid\mathcal G]=0$ $\mathbb P$-a.s. Hence \begin{align*} \mathbb E[e^{\lambda Y}\mid \mathcal G] \le \frac{e^{-\lambda c}+e^{\lambda c}}{2}. \end{align*} It remains to bound the hyperbolic cosine term by a Gaussian exponential. Define the function $h: \mathbb R \to \mathbb R$ by \begin{align*} h(u)=\frac{e^u+e^{-u}}{2}. \end{align*} Using the power series for the exponential functions, the odd powers cancel and give \begin{align*} h(u) = \sum_{m=0}^{\infty}\frac{u^{2m}}{(2m)!}. \end{align*} For every integer $m\ge 0$, the factorial inequality $(2m)!\ge 2^m m!$ gives \begin{align*} \frac{u^{2m}}{(2m)!}\le \frac{(u^2/2)^m}{m!}. \end{align*} Summing over $m$ yields \begin{align*} h(u)\le \sum_{m=0}^{\infty}\frac{(u^2/2)^m}{m!}=e^{u^2/2}. \end{align*} With $u=\lambda c$, this proves \begin{align*} \mathbb E[e^{\lambda Y}\mid \mathcal G] \le \exp\left(\frac{\lambda^2 c^2}{2}\right) \end{align*} $\mathbb P$-a.s. [/guided] [/step] [step:Iterate conditional expectations to bound the full moment generating function] Fix $\lambda \in \mathbb R$. For each $0 \le j \le n$, define the partial-sum random variable $S_j: \Omega \to \mathbb R$ by \begin{align*} S_j(\omega)=\sum_{k=1}^{j} X_k(\omega), \end{align*} with the convention $S_0=0$. Since $|X_k|\le c_k$ $\mathbb P$-a.s. and $n<\infty$, each $e^{\lambda S_j}$ is integrable. For $1\le k\le n$, the random variable $S_{k-1}$ is $\mathcal F_{k-1}$-measurable. Applying the conditional Hoeffding bound with $\mathcal G=\mathcal F_{k-1}$, $Y=X_k$, and $c=c_k$ gives \begin{align*} \mathbb E[e^{\lambda X_k}\mid \mathcal F_{k-1}] \le \exp\left(\frac{\lambda^2 c_k^2}{2}\right) \end{align*} $\mathbb P$-a.s. Therefore, using the tower property and then pulling the $\mathcal F_{k-1}$-measurable factor $e^{\lambda S_{k-1}}$ out of the inner conditional expectation, \begin{align*} \mathbb E[e^{\lambda S_k}]=\mathbb E\left[e^{\lambda S_{k-1}}\mathbb E[e^{\lambda X_k}\mid \mathcal F_{k-1}]\right]. \end{align*} The conditional Hoeffding bound gives \begin{align*} \mathbb E[e^{\lambda S_k}]\le \exp\left(\frac{\lambda^2 c_k^2}{2}\right)\mathbb E[e^{\lambda S_{k-1}}]. \end{align*} Iterating this inequality from $k=n$ down to $k=1$ gives \begin{align*} \mathbb E[e^{\lambda(M_n-M_0)}]=\mathbb E[e^{\lambda S_n}]\le \prod_{k=1}^{n}\exp\left(\frac{\lambda^2 c_k^2}{2}\right)=\exp\left(\frac{\lambda^2}{2}\sum_{k=1}^{n}c_k^2\right). \end{align*} [/step] [step:Apply exponential Markov and optimize the parameter] Assume first that $S=\sum_{k=1}^{n}c_k^2>0$. Fix $t\ge 0$ and $\lambda>0$. Define the non-negative random variable $Z_\lambda: \Omega \to [0,\infty)$ by \begin{align*} Z_\lambda(\omega)=\exp(\lambda(M_n(\omega)-M_0(\omega))). \end{align*} On the event $\{M_n-M_0\ge t\}$, one has $Z_\lambda\ge e^{\lambda t}$. Hence, by monotonicity of expectation and the moment generating function estimate, \begin{align*} e^{\lambda t}\mathbb P(M_n-M_0\ge t)\le \mathbb E[Z_\lambda]\le \exp\left(\frac{\lambda^2 S}{2}\right). \end{align*} Thus \begin{align*} \mathbb P(M_n-M_0\ge t) \le \exp\left(-\lambda t+\frac{\lambda^2S}{2}\right). \end{align*} Choose \begin{align*} \lambda := \frac{t}{S}. \end{align*} If $t=0$, this choice gives $\lambda=0$, and the desired inequality is simply $\mathbb P(M_n-M_0\ge 0)\le 1$. If $t>0$, the chosen $\lambda$ is positive and gives \begin{align*} -\lambda t+\frac{\lambda^2S}{2} = -\frac{t^2}{S}+\frac{t^2}{2S} = -\frac{t^2}{2S}. \end{align*} Therefore, for every $t\ge 0$, \begin{align*} \mathbb P(M_n-M_0\ge t) \le \exp\left(-\frac{t^2}{2S}\right). \end{align*} [guided] The moment generating function estimate becomes a tail estimate by exponentiating the event. Fix $t\ge 0$ and choose a parameter $\lambda>0$. Define the random variable $Z_\lambda: \Omega \to [0,\infty)$ by \begin{align*} Z_\lambda(\omega)=\exp(\lambda(M_n(\omega)-M_0(\omega))). \end{align*} This random variable is non-negative. On the event $\{M_n-M_0\ge t\}$, the monotonicity of the exponential function gives \begin{align*} Z_\lambda = \exp(\lambda(M_n-M_0))\ge e^{\lambda t}. \end{align*} Equivalently, \begin{align*} e^{\lambda t}\mathbb 1_{\{M_n-M_0\ge t\}} \le Z_\lambda. \end{align*} Taking expectation and using monotonicity of expectation gives \begin{align*} e^{\lambda t}\mathbb P(M_n-M_0\ge t) \le \mathbb E[Z_\lambda]. \end{align*} The previous step bounds the right-hand side: \begin{align*} \mathbb E[Z_\lambda]=\mathbb E[e^{\lambda(M_n-M_0)}]\le \exp\left(\frac{\lambda^2}{2}\sum_{k=1}^{n}c_k^2\right). \end{align*} Writing \begin{align*} S:=\sum_{k=1}^{n}c_k^2, \end{align*} we obtain \begin{align*} \mathbb P(M_n-M_0\ge t) \le \exp\left(-\lambda t+\frac{\lambda^2S}{2}\right). \end{align*} Now we choose $\lambda$ to make the exponent as small as possible. Define the function $q:(0,\infty)\to\mathbb R$ by \begin{align*} q(\lambda)=-\lambda t+\frac{\lambda^2S}{2}. \end{align*} This is a quadratic polynomial with derivative $q'(\lambda)=-t+\lambda S$. When $S>0$ and $t>0$, the minimum over $\lambda>0$ occurs at \begin{align*} \lambda=\frac{t}{S}. \end{align*} Substituting this value gives \begin{align*} -\lambda t+\frac{\lambda^2S}{2} = -\frac{t^2}{S}+\frac{t^2}{2S} = -\frac{t^2}{2S}. \end{align*} Therefore, for $t>0$, \begin{align*} \mathbb P(M_n-M_0\ge t) \le \exp\left(-\frac{t^2}{2S}\right). \end{align*} For $t=0$, the same displayed inequality reads $\mathbb P(M_n-M_0\ge 0)\le 1$, which is true because probabilities are at most $1$. Thus the one-sided estimate holds for every $t\ge 0$ when $S>0$. [/guided] [/step] [step:Apply the one-sided estimate to the negative martingale for the two-sided bound] Assume $S>0$. Define the process $(N_k)_{k=0}^{n}$ by declaring, for each $0\le k\le n$, the random variable $N_k: \Omega\to\mathbb R$ as \begin{align*} N_k(\omega)=-M_k(\omega). \end{align*} Since $(M_k)_{k=0}^{n}$ is a martingale with respect to $(\mathcal F_k)_{k=0}^{n}$, the process $(N_k)_{k=0}^{n}$ is also a martingale with respect to the same filtration. Moreover, \begin{align*} |N_k-N_{k-1}|=|M_k-M_{k-1}|\le c_k \end{align*} $\mathbb P$-a.s. for every $1\le k\le n$. Applying the one-sided estimate to $(N_k)_{k=0}^{n}$ gives \begin{align*} \mathbb P(M_n-M_0\le -t) = \mathbb P(N_n-N_0\ge t) \le \exp\left(-\frac{t^2}{2S}\right). \end{align*} Since \begin{align*} \{|M_n-M_0|\ge t\} \subset \{M_n-M_0\ge t\}\cup\{M_n-M_0\le -t\}, \end{align*} the union bound gives \begin{align*} \mathbb P(|M_n-M_0|\ge t)\le \mathbb P(M_n-M_0\ge t)+\mathbb P(M_n-M_0\le -t)\le 2\exp\left(-\frac{t^2}{2S}\right). \end{align*} [/step] [step:Handle the degenerate variance proxy] If $S=0$, then $c_k=0$ for every $1\le k\le n$. Hence $|M_k-M_{k-1}|\le 0$ $\mathbb P$-a.s. for every $k$, so $M_k=M_{k-1}$ $\mathbb P$-a.s. for every $k$. Since there are finitely many indices, intersecting the corresponding probability-one events gives \begin{align*} M_n=M_0 \end{align*} $\mathbb P$-a.s. Therefore, for every $t>0$, \begin{align*} \mathbb P(M_n-M_0\ge t)=0 \end{align*} and \begin{align*} \mathbb P(|M_n-M_0|\ge t)=0. \end{align*} Together with the estimates proved for $S>0$, this proves the theorem. [/step]

Prerequisites (0/6 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

What brings you to Androma?

Start with a route through the knowledge graph.

Azuma-Hoeffding Inequality (Theorem # 6071)

Discussion

Proof

Prerequisites (0/6 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Azuma-Hoeffding Inequality (Theorem # 6071)

Discussion

Proof

Prerequisites (0/6 completed)

Prerequisites Graph

Explore Further