Bennett Inequality — Statement & Proof

Bennett Inequality (Theorem # 6050)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We first prove the one-variable exponential moment estimate that uses only the upper bound $X_i\le b$, the zero mean, and the variance. Independence then multiplies the one-variable estimates into an exponential moment bound for the sum. A direct estimate for the nonnegative [random variable](/page/Random%20Variable) $e^{\lambda S_n}$ converts this moment bound into a tail bound, and the remaining work is the explicit optimization in the exponential parameter. The degenerate case $V=0$ is handled separately because the displayed formula divides by $V$ in the argument of $h$. [/proofplan] [step:Reduce the zero variance case to an almost sure identity] Assume first that $V=0$. For each $1\le i\le n$, set $\sigma_i^2:=\operatorname{Var}(X_i)$. Since each $\sigma_i^2\ge 0$ and $\sum_{i=1}^n\sigma_i^2=0$, we have $\sigma_i^2=0$ for every $i$. Because $\mathbb E[X_i]=0$, this gives \begin{align*} \mathbb E[X_i^2]=\operatorname{Var}(X_i)+(\mathbb E[X_i])^2=0. \end{align*} Since $X_i^2\ge 0$, it follows that $X_i=0$ almost surely. The finite intersection of the probability-one events $\{X_i=0\}$ has probability one, so \begin{align*} \sum_{i=1}^n X_i=0 \end{align*} almost surely. Hence $\mathbb P(\sum_{i=1}^n X_i\ge t)=0$ for $t>0$ and equals $1$ for $t=0$. [/step] [step:Prove the bounded variance exponential moment estimate] Assume now that $V>0$. For $1\le i\le n$, define $\sigma_i^2:=\operatorname{Var}(X_i)$. We claim that for every $\lambda\ge 0$ and every $1\le i\le n$, \begin{align*} \mathbb E[e^{\lambda X_i}] \le \exp\left(\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right). \end{align*} Fix $\lambda\ge 0$ and define $\mu:=\lambda b$. Let $\mathcal L^1$ denote one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) on $[0,1]$ with its Borel $\sigma$-algebra. Define the auxiliary function $q:\mathbb R\to[0,\infty)$ by setting $q(0):=1/2$ and, for $y\ne0$, setting \begin{align*} q(y):=\frac{e^y-1-y}{y^2}. \end{align*} The identity \begin{align*} e^y-1-y = y^2\int_0^1 (1-s)e^{sy}\,d\mathcal L^1(s) \end{align*} shows that \begin{align*} q(y)=\int_0^1 (1-s)e^{sy}\,d\mathcal L^1(s). \end{align*} Therefore $q$ is increasing on $\mathbb R$, because for $y_1\le y_2$ the integrand satisfies $(1-s)e^{sy_1}\le (1-s)e^{sy_2}$ for every $s\in[0,1]$. Since $\mu u\le\mu$, we obtain \begin{align*} e^{\mu u}-1-\mu u \le u^2(e^\mu-1-\mu). \end{align*} Define the real-valued random variable $U_i:\Omega\to\mathbb R$ by \begin{align*} U_i(\omega):=\frac{X_i(\omega)}{b}. \end{align*} Since $X_i\le b$ almost surely, $U_i\le 1$ almost surely. Applying the preceding scalar inequality with $u=U_i$ gives \begin{align*} e^{\lambda X_i} \le 1+\lambda X_i+\frac{X_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right) \end{align*} almost surely. Taking expectations and using $\mathbb E[X_i]=0$ and $\mathbb E[X_i^2]=\sigma_i^2$ gives \begin{align*} \mathbb E[e^{\lambda X_i}] \le 1+\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right). \end{align*} Since $1+a\le e^a$ for every $a\ge 0$, and $e^{\lambda b}-1-\lambda b\ge 0$, the claimed estimate follows. [guided] The key point is to dominate $e^{\lambda x}$ on the whole half-line $x\le b$ by a quadratic expression whose linear term disappears after expectation and whose quadratic term becomes the variance. Fix $\lambda\ge 0$ and set $\mu:=\lambda b$. We first prove the scalar inequality \begin{align*} e^{\mu u} \le 1+\mu u+u^2(e^\mu-1-\mu) \end{align*} for every real number $u\le 1$. Let $\mathcal L^1$ denote one-dimensional Lebesgue measure on $[0,1]$ with its Borel $\sigma$-algebra. To prove the scalar inequality, define $q:\mathbb R\to[0,\infty)$ by setting $q(0):=1/2$ and, for $y\ne0$, setting \begin{align*} q(y):=\frac{e^y-1-y}{y^2}. \end{align*} The useful representation of $q$ is obtained from the integral remainder formula for the exponential function: \begin{align*} e^y-1-y = y^2\int_0^1 (1-s)e^{sy}\,d\mathcal L^1(s). \end{align*} Thus \begin{align*} q(y)=\int_0^1 (1-s)e^{sy}\,d\mathcal L^1(s). \end{align*} If $y_1\le y_2$, then $e^{sy_1}\le e^{sy_2}$ for every $s\in[0,1]$, and hence \begin{align*} (1-s)e^{sy_1}\le (1-s)e^{sy_2}. \end{align*} Integrating with respect to $\mathcal L^1$ over $[0,1]$ gives $q(y_1)\le q(y_2)$, so $q$ is increasing. Now take any $u\le 1$. Since $\mu\ge 0$, we have $\mu u\le\mu$. The monotonicity of $q$ gives \begin{align*} \frac{e^{\mu u}-1-\mu u}{(\mu u)^2} \le \frac{e^\mu-1-\mu}{\mu^2} \end{align*} when $\mu u\ne0$ and $\mu\ne0$, with the same conclusion by continuity in the cases where one of these quantities is zero. Multiplying by $\mu^2u^2$ yields \begin{align*} e^{\mu u}-1-\mu u \le u^2(e^\mu-1-\mu), \end{align*} which is the desired scalar inequality. We apply this scalar inequality to the normalized random variable $U_i:\Omega\to\mathbb R$ defined by \begin{align*} U_i(\omega):=\frac{X_i(\omega)}{b}. \end{align*} The hypothesis $X_i\le b$ almost surely is exactly what gives $U_i\le1$ almost surely. Therefore \begin{align*} e^{\lambda X_i} \le 1+\lambda X_i+\frac{X_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right) \end{align*} almost surely. Taking expectations is legitimate because $X_i^2$ is integrable by the finite variance hypothesis, and $e^{\lambda X_i}\le e^{\lambda b}$ almost surely. Using $\mathbb E[X_i]=0$ and \begin{align*} \mathbb E[X_i^2]=\operatorname{Var}(X_i)+(\mathbb E[X_i])^2=\sigma_i^2 \end{align*} gives \begin{align*} \mathbb E[e^{\lambda X_i}] \le 1+\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right). \end{align*} Finally $e^{\lambda b}-1-\lambda b\ge0$, since $e^r\ge1+r$ for $r\ge0$, and $1+a\le e^a$ for $a\ge0$. Hence \begin{align*} \mathbb E[e^{\lambda X_i}] \le \exp\left(\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right). \end{align*} [/guided] [/step] [step:Multiply the one-variable estimates for the sum] Define the partial sum random variable $S_n:\Omega\to\mathbb R$ by \begin{align*} S_n(\omega):=\sum_{i=1}^n X_i(\omega). \end{align*} For every $\lambda\ge0$, the independence of $X_1,\dots,X_n$ implies the independence of $e^{\lambda X_1},\dots,e^{\lambda X_n}$. Since $X_i\le b$ almost surely for each $i$, each factor satisfies $0\le e^{\lambda X_i}\le e^{\lambda b}$ almost surely and is integrable. Hence \begin{align*} \mathbb E[e^{\lambda S_n}]=\mathbb E\left[\prod_{i=1}^n e^{\lambda X_i}\right]. \end{align*} Independence of the integrable nonnegative random variables $e^{\lambda X_1},\dots,e^{\lambda X_n}$ gives \begin{align*} \mathbb E\left[\prod_{i=1}^n e^{\lambda X_i}\right]=\prod_{i=1}^n \mathbb E[e^{\lambda X_i}]. \end{align*} Applying the one-variable estimate to each factor gives \begin{align*} \prod_{i=1}^n \mathbb E[e^{\lambda X_i}]\le\prod_{i=1}^n \exp\left(\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right). \end{align*} Using $V=\sum_{i=1}^n\sigma_i^2$, the product is \begin{align*} \prod_{i=1}^n \exp\left(\frac{\sigma_i^2}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right)=\exp\left(\frac{V}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right). \end{align*} [/step] [step:Convert the exponential moment bound into a tail bound] Fix $t\ge0$ and $\lambda\ge0$. Since $e^{\lambda S_n}$ is nonnegative and the event $\{S_n\ge t\}$ implies $e^{\lambda S_n}\ge e^{\lambda t}$, we have \begin{align*} e^{\lambda t}\mathbb P(S_n\ge t) \le \mathbb E[e^{\lambda S_n}]. \end{align*} Combining this with the preceding exponential moment estimate gives \begin{align*} \mathbb P(S_n\ge t) \le \exp\left(-\lambda t+\frac{V}{b^2}\left(e^{\lambda b}-1-\lambda b\right)\right) \end{align*} for every $\lambda\ge0$. [/step] [step:Optimize the exponential parameter] Define $\Phi:[0,\infty)\to\mathbb R$ by \begin{align*} \Phi(\lambda):=-\lambda t+\frac{V}{b^2}\left(e^{\lambda b}-1-\lambda b\right). \end{align*} If $t=0$, choosing $\lambda=0$ gives \begin{align*} \mathbb P(S_n\ge0)\le 1 = \exp\left(-\frac{V}{b^2}h(0)\right), \end{align*} because $h(0)=0$. Assume now that $t>0$. Differentiating $\Phi$ gives \begin{align*} \Phi'(\lambda)=-t+\frac{V}{b}\left(e^{\lambda b}-1\right). \end{align*} Differentiating once more gives \begin{align*} \Phi''(\lambda)=V e^{\lambda b}>0. \end{align*} Thus $\Phi$ is strictly convex, and its unique minimizer $\lambda_*$ is determined by \begin{align*} e^{\lambda_* b}=1+\frac{bt}{V}. \end{align*} Hence \begin{align*} \lambda_*=\frac{1}{b}\log\left(1+\frac{bt}{V}\right). \end{align*} Define $r\in(0,\infty)$ by \begin{align*} r:=\frac{bt}{V}. \end{align*} Then \begin{align*} \frac{t}{b}=\frac{V}{b^2}r \end{align*} and \begin{align*} \lambda_* = \frac{1}{b}\log(1+r). \end{align*} Substituting these identities into $\Phi$ gives \begin{align*} \Phi(\lambda_*)=-\frac{V}{b^2}r\log(1+r)+\frac{V}{b^2}\left(r-\log(1+r)\right). \end{align*} Factoring $-V/b^2$ yields \begin{align*} \Phi(\lambda_*)=-\frac{V}{b^2}\left[(1+r)\log(1+r)-r\right]. \end{align*} By the definition $h(r)=(1+r)\log(1+r)-r$, this becomes \begin{align*} \Phi(\lambda_*)=-\frac{V}{b^2}h\left(\frac{bt}{V}\right). \end{align*} Therefore \begin{align*} \mathbb P(S_n\ge t) \le \exp\left(-\frac{V}{b^2}h\left(\frac{bt}{V}\right)\right). \end{align*} Since $S_n=\sum_{i=1}^n X_i$, this is exactly the asserted Bennett bound. [/step]

Prerequisites (0/6 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Expectation Definition Continuity Definition Lebesgue Measure Definition Random Variable Definition Variance Definition Event Definition Harmonicity of the Brownian Dirichlet Solution Brownian Motion Stable Embedding Lower Bound for Sparse Vectors Probability & Statistics Canonical Minimax Rates in Gaussian High-Dimensional Models Probability & Statistics Asymptotic Normality of Random Design Ordinary Least Squares Probability & Statistics Conditional Hoeffding Lemma for Bounded Martingale Differences Probability & Statistics Time Inversion of Brownian Motion Brownian Motion Doob's Upcrossing Inequality Martingale Theory Planted-Clique-Based Sparse PCA Lower Bound Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.

Bennett Inequality (Theorem # 6050)

Discussion

Proof

Prerequisites (0/6 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Bennett Inequality (Theorem # 6050)

Discussion

Proof

Prerequisites (0/6 completed)

Prerequisites Graph

Explore Further