This course develops the classical toolkit for proving concentration inequalities, the results that quantify how sharply a [random variable](/page/Random%20Variable) or a random process stays near its typical value. It focuses on the foundational methods used throughout probability and statistics: turning information about moments, variance, or boundedness into tail bounds, and interpreting those bounds as practical statements about rare events. The central goal is to understand not just the inequalities themselves, but the proof ideas behind them and the situations in which each method is most effective.
The chapters build a progression from general principles to increasingly refined and structured bounds. The course begins with moment- and variance-based estimates, then introduces the Chernoff method and moment generating functions as a more systematic way to control tails. From there it derives the classical inequalities for independent bounded sums, including Hoeffding’s inequality, and then sharpens the picture with variance-sensitive results such as Bernstein’s and Bennett’s inequalities. Later chapters develop the language of sub-Gaussian and sub-exponential random variables, apply concentration to empirical means and basic statistical problems, and extend the theory to martingales through Azuma-Hoeffding and to functions of independent variables via McDiarmid’s inequality. The final chapter ties these results together by highlighting the recurring proof patterns that underlie classical concentration theory.
# Introduction
This introductory chapter fixes the language and aims of the course. Concentration inequalities ask for quantitative upper bounds on probabilities of rare deviations, usually at finite sample size and with constants visible. The later chapters develop separate methods, but the same pattern appears throughout: identify structural information about a random variable, translate it into a moment or exponential moment estimate, and optimize the resulting bound.
The course begins in Chapter 1 with inequalities that use only expectation and variance, moves in Chapter 2 to the Chernoff method, then develops Hoeffding, Bernstein, sub-Gaussian, sub-exponential, martingale, and bounded-differences inequalities in later chapters. This chapter is a map of that route. It also records the baseline vocabulary for tails, deviation scales, moment generating functions, and the kinds of results that count as concentration.
## What Concentration Inequalities Measure
A random variable may be complicated, but many probabilistic questions only ask how far it is from a typical value. The first problem is therefore to decide what quantity counts as a deviation and what kind of upper bound is informative.
[definition: Tail Probability]
Let $X: (\Omega, \mathcal F, \mathbb P) \to \mathbb R$ be a real-valued random variable. For $a \in \mathbb R$, the upper tail probability of $X$ at level $a$ is $\mathbb P(X \ge a)$, and the lower tail probability is $\mathbb P(X \le a)$.
[/definition]
Tail probabilities become useful when the level is expressed relative to a reference point such as the mean or median. This motivates packaging the event together with its centre, its direction, and its deviation parameter.
[definition: Deviation Inequality]
Let $X: (\Omega, \mathcal F, \mathbb P) \to \mathbb R$ be a real-valued random variable and let $m \in \mathbb R$ be a reference point. A deviation inequality for $X$ around $m$ is an upper bound on quantities such as $\mathbb P(X - m \ge t)$, $\mathbb P(m - X \ge t)$, or $\mathbb P(|X-m| \ge t)$, valid for a specified range of $t \ge 0$.
[/definition]
A deviation inequality is nonasymptotic when it applies at the sample size under discussion, rather than only after taking a limit. This distinction is central: the course studies estimates that can be used before invoking a law of large numbers or a [central limit theorem](/theorems/521).
[example: Sample Mean Deviation]
Let $X_1,\dots,X_n$ be i.i.d. real-valued random variables with $\mathbb E[X_1]=\mu$, and set
\begin{align*}
\bar X_n=\frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
Since $\mu=(1/n)\sum_{i=1}^n \mu$, the centered sample mean can be written as
\begin{align*}
\bar X_n-\mu=\frac{1}{n}\sum_{i=1}^n X_i-\frac{1}{n}\sum_{i=1}^n \mu=\frac{1}{n}\sum_{i=1}^n (X_i-\mu).
\end{align*}
Thus a concentration result for the sample mean is an upper bound on
\begin{align*}
\mathbb P(|\bar X_n-\mu|\ge t)=\mathbb P\left(\left|\frac{1}{n}\sum_{i=1}^n (X_i-\mu)\right|\ge t\right)
\end{align*}
in terms of $n$, $t$, and assumptions on the common distribution.
For example, suppose in addition that $\operatorname{Var}(X_1)=\sigma^2<\infty$. Scaling variance gives
\begin{align*}
\operatorname{Var}(\bar X_n)=\operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^n X_i\right)=\frac{1}{n^2}\operatorname{Var}\left(\sum_{i=1}^n X_i\right).
\end{align*}
Independence makes the variance of the sum equal to the sum of the variances, so
\begin{align*}
\operatorname{Var}\left(\sum_{i=1}^n X_i\right)=\sum_{i=1}^n \operatorname{Var}(X_i)=\sum_{i=1}^n \sigma^2=n\sigma^2.
\end{align*}
Substituting this into the previous display gives
\begin{align*}
\operatorname{Var}(\bar X_n)=\frac{1}{n^2}\,n\sigma^2=\frac{\sigma^2}{n}.
\end{align*}
Applying *Chebyshev Inequality* to $\bar X_n$ therefore yields, for $t>0$,
\begin{align*}
\mathbb P(|\bar X_n-\mu|\ge t)\le \frac{\operatorname{Var}(\bar X_n)}{t^2}=\frac{\sigma^2}{n t^2}.
\end{align*}
Finite variance alone gives polynomial decay in the deviation parameter $t$, while stronger hypotheses such as boundedness or finite exponential moments lead later to bounds with exponential decay in $n t^2$ for moderate deviations.
[/example]
The example shows why concentration is sharper than convergence in probability. To compare different theorems systematically, we need vocabulary that records whether the decay is polynomial or exponential and what the natural scale is.
[definition: Weak And Exponential Concentration]
For each $n \ge 1$, let $X_n: (\Omega_n, \mathcal F_n, \mathbb P_n) \to \mathbb R$ be a real-valued random variable with reference point $m_n \in \mathbb R$. The sequence has weak concentration at scale $r_n>0$ if there are constants $C, p>0$ such that
\begin{align*}
\mathbb P_n(|X_n-m_n| \ge t r_n) \le C t^{-p}
\end{align*}
for all $t$ in a specified range. It has exponential concentration at scale $r_n$ if there are constants $c, C>0$ such that
\begin{align*}
\mathbb P_n(|X_n-m_n| \ge t r_n) \le C e^{-c t^2}
\end{align*}
for all $t$ in a specified range.
[/definition]
The exact exponent changes from theorem to theorem: Gaussian-type estimates have $e^{-c t^2}$ behaviour, while exponential-type estimates have $e^{-c t}$ behaviour in the far tail. The purpose of the definition is to separate polynomial control from exponential control, not to impose a single universal formula.
## From Structural Information To Tail Bounds
The guiding question for the course is how assumptions on $X$ are converted into bounds for $\mathbb P(X \ge a)$. The classical methods are organized by the kind of information available: moments, variance, bounded range, independence, or control of a moment generating function.
[quotetheorem:514]
[citeproof:514]
The nonnegativity hypothesis is essential to the pointwise comparison: if $Y$ can be negative, its expectation may be small while its upper tail is large. For instance, if $Y$ takes the values $-100$ and $1$ with probabilities $1/2$ each, then $\mathbb E[Y]<0$ although $\mathbb P(Y\ge 1)=1/2$, so the displayed bound would give no valid positive control. Integrability is also part of the statement because the right-hand side must be a finite number that can be compared with a probability. Markov's inequality does not say that tails are small in an absolute sense; it says that a nonnegative random variable cannot frequently exceed a level much larger than its mean. Its real strength in this course comes from choosing $Y$ as a transform of the quantity of interest. Taking $Y=|X-m|^p$ gives moment bounds; taking $Y=e^{\lambda X}$ gives the Chernoff method.
[example: Moment Bound From Markov]
Let $X$ be a real-valued random variable, let $m \in \mathbb R$, and suppose $\mathbb E[|X-m|^p]<\infty$ for some $p\ge 1$. For $t>0$, set
\begin{align*}
Y=|X-m|^p.
\end{align*}
Then $Y\ge 0$ and $\mathbb E[Y]=\mathbb E[|X-m|^p]<\infty$, so *[Markov Inequality](/theorems/514)* applied at the level $t^p>0$ gives
\begin{align*}
\mathbb P(Y\ge t^p)\le \frac{\mathbb E[Y]}{t^p}.
\end{align*}
Substituting $Y=|X-m|^p$ yields
\begin{align*}
\mathbb P(|X-m|^p\ge t^p)\le \frac{\mathbb E[|X-m|^p]}{t^p}.
\end{align*}
Since $t>0$ and the map $u\mapsto u^p$ is increasing on $[0,\infty)$, the events are equal:
\begin{align*}
\{|X-m|\ge t\}=\{|X-m|^p\ge t^p\}.
\end{align*}
Therefore
\begin{align*}
\mathbb P(|X-m|\ge t)\le \frac{\mathbb E[|X-m|^p]}{t^p}, \qquad t>0.
\end{align*}
A finite $p$th centered moment therefore gives polynomial tail control of order $t^{-p}$; exponential decay requires stronger information than this moment assumption alone.
[/example]
Variance is the first structural quantity that captures centering. It measures average squared fluctuation and leads to a universal bound for deviations around the mean.
[quotetheorem:1126]
[citeproof:1126]
The finite second moment hypothesis is not cosmetic: if $\mathbb E[X^2]=\infty$, then $\operatorname{Var}(X)$ is not a finite quantity and Chebyshev's bound has no content. A concrete obstruction is a Pareto-type random variable with $\mathbb P(|X|\ge t)=t^{-\alpha}$ for $t\ge 1$ and $1<\alpha\le 2$; it may have a finite mean, but its variance is infinite. Centering at $\mathbb E[X]$ is also part of what makes the statement useful, because variance measures squared fluctuation around the mean rather than around an arbitrary location. [Chebyshev's inequality](/theorems/1126) cannot detect whether the underlying tail is Gaussian, exponential, or merely polynomial; all such distributions are compressed into the single number $\operatorname{Var}(X)$. It explains the baseline rate for sample means: if the variance is finite and the variables are independent, the variance of the average is of order $1/n$. The next example records the computation that later exponential inequalities will improve.
[example: Chebyshev For Sample Means]
Let $X_1,\dots,X_n$ be independent real-valued random variables with common mean $\mu$ and common variance $\sigma^2<\infty$, and set
\begin{align*}
\bar X_n=\frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
By linearity of expectation,
\begin{align*}
\mathbb E[\bar X_n]=\mathbb E\left[\frac{1}{n}\sum_{i=1}^n X_i\right].
\end{align*}
Pulling out the constant $1/n$ and distributing expectation over the finite sum gives
\begin{align*}
\mathbb E[\bar X_n]=\frac{1}{n}\sum_{i=1}^n \mathbb E[X_i].
\end{align*}
Since each $\mathbb E[X_i]=\mu$,
\begin{align*}
\mathbb E[\bar X_n]=\frac{1}{n}\sum_{i=1}^n \mu=\frac{n\mu}{n}=\mu.
\end{align*}
Now compute the variance of $\bar X_n$. Using $\operatorname{Var}(cY)=c^2\operatorname{Var}(Y)$ with $c=1/n$,
\begin{align*}
\operatorname{Var}(\bar X_n)=\operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^n X_i\right)=\frac{1}{n^2}\operatorname{Var}\left(\sum_{i=1}^n X_i\right).
\end{align*}
Because $X_1,\dots,X_n$ are independent, the variance of their sum is the sum of their variances:
\begin{align*}
\operatorname{Var}\left(\sum_{i=1}^n X_i\right)=\sum_{i=1}^n \operatorname{Var}(X_i).
\end{align*}
Each variance is $\sigma^2$, so
\begin{align*}
\sum_{i=1}^n \operatorname{Var}(X_i)=\sum_{i=1}^n \sigma^2=n\sigma^2.
\end{align*}
Substituting this back gives
\begin{align*}
\operatorname{Var}(\bar X_n)=\frac{1}{n^2}\,n\sigma^2=\frac{\sigma^2}{n}.
\end{align*}
Applying *Chebyshev Inequality* to the random variable $\bar X_n$, whose mean is $\mu$, gives for every $t>0$,
\begin{align*}
\mathbb P(|\bar X_n-\mu|\ge t)\le \frac{\operatorname{Var}(\bar X_n)}{t^2}.
\end{align*}
Using the variance computation,
\begin{align*}
\frac{\operatorname{Var}(\bar X_n)}{t^2}=\frac{\sigma^2/n}{t^2}=\frac{\sigma^2}{n t^2}.
\end{align*}
Therefore
\begin{align*}
\mathbb P(|\bar X_n-\mu| \ge t) \le \frac{\sigma^2}{n t^2}, \qquad t>0.
\end{align*}
Thus finite variance and independence give a nonasymptotic $1/n$ improvement for fixed deviation level $t$, but the dependence on $t$ remains polynomial rather than exponential.
[/example]
## Exponential Moments As The Main Organizing Device
The next question is how to move from polynomial tails to exponential tails. The central device is to apply Markov's inequality not to powers of the deviation, but to exponentials of it.
[definition: Moment Generating Function]
Let $X: (\Omega, \mathcal F, \mathbb P) \to \mathbb R$ be a real-valued random variable, and define
\begin{align*}
D_X=\{\lambda \in \mathbb R : \mathbb E[e^{\lambda X}]<\infty\}.
\end{align*}
The moment generating function of $X$ is the map
\begin{align*}
M_X:D_X \to (0,\infty), \qquad \lambda \mapsto \mathbb E[e^{\lambda X}].
\end{align*}
[/definition]
The moment generating function handles a single random variable, but products appear as soon as independent sums enter the argument. This motivates taking logarithms, because logarithms turn those products into additive quantities and expose convexity.
[definition: Cumulant Generating Function]
Let $X: (\Omega, \mathcal F, \mathbb P) \to \mathbb R$ be a real-valued random variable, and define
\begin{align*}
D_X=\{\lambda \in \mathbb R : \mathbb E[e^{\lambda X}]<\infty\}.
\end{align*}
The cumulant generating function of $X$ is the map
\begin{align*}
\Lambda_X:D_X \to \mathbb R, \qquad \lambda \mapsto \log \mathbb E[e^{\lambda X}].
\end{align*}
[/definition]
The cumulant generating function is useful only after it is connected back to probability. The obstruction is that a tail event such as $\{X\ge a\}$ is an indicator, while $\Lambda_X$ only controls exponential averages. Exponential Markov converts that mismatch into a usable estimate, and the remaining problem is to choose the exponential tilt well.
[quotetheorem:6038]
[citeproof:6038]
The sign restriction $\lambda>0$ is what makes the exponential transform an upper-tail argument: on $\{X\ge a\}$ it gives $e^{\lambda X}\ge e^{\lambda a}$. For a lower-tail bound one instead applies the same idea to $-X$, or equivalently uses negative exponential tilts. The finite exponential moment assumption is essential, since a heavy-tailed random variable with finite variance can still satisfy $\mathbb E[e^{\lambda X}]=\infty$ for every $\lambda>0$, making the displayed estimate unusable. The template is therefore not a complete concentration theorem by itself; without an upper bound on $\Lambda_X(\lambda)$ and a tractable optimization in $\lambda$, it gives only a formal restatement of exponential Markov. The rest of the course supplies such estimates from boundedness, independence, Bernoulli structure, Gaussian structure, Bernstein moment conditions, martingale differences, and Lipschitz-type bounded differences.
[example: Gaussian Chernoff Computation]
Let $Z\sim \mathcal N(0,1)$ and fix $t>0$. For any $\lambda\in\mathbb R$, the standard normal density gives
\begin{align*}
M_Z(\lambda)=\mathbb E[e^{\lambda Z}]=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{\lambda z}e^{-z^2/2}\,dz.
\end{align*}
Combining the exponents,
\begin{align*}
e^{\lambda z}e^{-z^2/2}=\exp\left(-\frac{z^2-2\lambda z}{2}\right).
\end{align*}
Completing the square gives
\begin{align*}
z^2-2\lambda z=(z-\lambda)^2-\lambda^2.
\end{align*}
Therefore
\begin{align*}
M_Z(\lambda)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\exp\left(-\frac{(z-\lambda)^2-\lambda^2}{2}\right)\,dz.
\end{align*}
Pulling out the factor independent of $z$,
\begin{align*}
M_Z(\lambda)=e^{\lambda^2/2}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{-(z-\lambda)^2/2}\,dz.
\end{align*}
The function $z\mapsto (2\pi)^{-1/2}e^{-(z-\lambda)^2/2}$ is the density of $\mathcal N(\lambda,1)$, so its integral over $\mathbb R$ is $1$. Hence
\begin{align*}
M_Z(\lambda)=e^{\lambda^2/2}.
\end{align*}
Applying *[Chernoff Bound](/theorems/6038) Template* to $Z$ gives
\begin{align*}
\mathbb P(Z\ge t)\le \inf_{\lambda>0}\exp\left(\log M_Z(\lambda)-\lambda t\right).
\end{align*}
Since $\log M_Z(\lambda)=\lambda^2/2$,
\begin{align*}
\mathbb P(Z\ge t)\le \inf_{\lambda>0}\exp\left(\frac{\lambda^2}{2}-\lambda t\right).
\end{align*}
For $\lambda>0$, the exponent satisfies
\begin{align*}
\frac{\lambda^2}{2}-\lambda t=\frac{1}{2}(\lambda^2-2\lambda t).
\end{align*}
Completing the square again,
\begin{align*}
\lambda^2-2\lambda t=(\lambda-t)^2-t^2.
\end{align*}
Thus
\begin{align*}
\frac{\lambda^2}{2}-\lambda t=\frac{(\lambda-t)^2}{2}-\frac{t^2}{2}.
\end{align*}
Since $(\lambda-t)^2\ge 0$ and $t>0$ makes $\lambda=t$ admissible in the infimum over $\lambda>0$, the smallest possible exponent is $-t^2/2$. Substituting $\lambda=t$ gives
\begin{align*}
\mathbb P(Z\ge t)\le \exp\left(\frac{t^2}{2}-t^2\right).
\end{align*}
Therefore
\begin{align*}
\mathbb P(Z\ge t)\le e^{-t^2/2}, \qquad t>0.
\end{align*}
The Gaussian upper tail is therefore controlled by an exponential of a negative quadratic, which is the prototype for sub-Gaussian concentration.
[/example]
## The Road Through The Course
The final question for this chapter is how the different tools fit together. Each later lecture block answers the same tail-bound problem under stronger or differently structured hypotheses.
[explanation: Course Architecture]
The first chapter studies bounds from moments and variance. Markov, Chebyshev, Cantelli, and tail-integration identities establish the baseline: weak information gives weak tails, and tail bounds can also be integrated back into moment estimates.
The second chapter introduces the Chernoff method and moment generating functions. It develops the optimization principle, the role of convexity, and the Legendre-transform viewpoint, then applies these ideas to Bernoulli, binomial, Poisson, and Gaussian variables.
Chapters 3 and 4 turn the Chernoff method into the main classical inequalities for independent sums. [Hoeffding's inequality](/theorems/1962) shows what boundedness alone gives, while Bennett and Bernstein explain how variance information sharpens the bounded-summand picture.
Chapters 5 and 6 use sub-Gaussian and sub-exponential random variables as reusable classes. Instead of reproving an mgf calculation for every distribution, we prove closure properties, sum bounds, and norm equivalences that let a concentration argument proceed from a few structural checks.
Chapter 7 translates the preceding tail bounds into statistical guarantees for empirical means, finite classes, and robust estimation. This is where the course shifts from proving inequalities to choosing confidence radii and sample sizes.
Chapters 8 and 9 turn from sums to processes and functions. Martingale inequalities handle sequential dependence through conditional moment estimates, while bounded differences and McDiarmid's inequality show how independent inputs can force concentration for a nonlinear function of many variables.
[/explanation]
This organization reflects a progression in what the random object looks like. Sums of independent variables are the basic case; martingales add time and conditioning; bounded differences treat functions whose individual coordinates have limited influence.
[explanation: How To Read The Blocks]
The formal cards in the course are landmarks, not isolated fragments. Each theorem is meant to answer a local obstruction: Markov handles a nonnegative transform, Chernoff handles an exponential transform, Hoeffding handles bounded independent summands, Bernstein adds variance, and martingale or bounded-differences results handle dependence created by revealing information. The examples are computational because they show the scaling that survives after the theorem is applied, but the main narrative question is always the same: what information is available, and which inequality uses exactly that information without inventing stronger assumptions?
[/explanation]
This viewpoint is deliberately finite-sample. The estimates below are not merely asymptotic slogans; they are tools for deciding what can be concluded at a concrete sample size and confidence level.
[remark: Nonasymptotic Viewpoint]
The course does not replace limit theorems such as the law of large numbers or the [central limit theorem](/theorems/1848). It gives finite-sample estimates that often imply convergence theorems after summing or optimizing the bounds. The advantage is that the same inequality can be used at a specified $n$, with an explicit confidence level and deviation threshold.
[/remark]
A useful way to read the rest of the notes is to track three questions for every theorem. What information about the random variable is assumed? What transform or comparison converts that information into a probability bound? What scale of deviation does the resulting inequality identify?
The opening chapter leaves us with a simple reading guide: identify the assumption, identify the transform or comparison, and identify the deviation scale. Chapter 1 begins by making that guide concrete for the most basic concentration tools, where expectation, variance, and elementary tail manipulations already give nontrivial control of rare events.
# 1. Tail Bounds from Moments and Variance
After the introductory roadmap, this first chapter sets up the basic language of concentration: how likely it is for a random variable to deviate from a typical scale, and what information is needed to control that likelihood. The results here use only positivity, expectation, variance, and elementary integration. They are weaker than the exponential inequalities developed later, but they are the baseline against which sharper methods are measured.
## Tail Probabilities and Deviation Scales
The first question in concentration is not how to compute the whole distribution of a random variable, but how to bound the probability of an unusually large deviation. This shifts attention from densities and distribution functions to tail events such as $\{X \ge t\}$ and $\{|X - m| \ge r\}$, where $m$ is a centre and $r$ is a deviation scale.
[definition: Tail Probability]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space and let $X: \Omega \to \mathbb R$ be a real-valued random variable. For $t \in \mathbb R$, the upper tail probability of $X$ at level $t$ is $\mathbb P(X \ge t)$, and the lower tail probability is $\mathbb P(X \le t)$.
[/definition]
Tail probabilities isolate rare large-value events, but concentration usually asks whether a random variable stays near a reference value. This makes it necessary to measure tails after subtracting a centre, so that the level parameter records distance rather than absolute size. This motivates the centred deviation probability.
[definition: Deviation Probability]
Let $X: \Omega \to \mathbb R$ be a real-valued random variable and let $m \in \mathbb R$. For $r \ge 0$, the two-sided deviation probability of $X$ from $m$ at scale $r$ is
\begin{align*}
\mathbb P(|X - m| \ge r).
\end{align*}
[/definition]
A deviation probability depends on the chosen centre $m$. For integrable random variables the mean $\mathbb E[X]$ is the usual centre, while for distributions with heavy tails a median or another quantile may better represent a typical value. This motivates recording quantiles as possible centres for concentration statements.
[definition: Quantile]
Let $X: \Omega \to \mathbb R$ be a real-valued random variable with distribution function $F_X: \mathbb R \to [0,1]$ defined by $F_X(x) = \mathbb P(X \le x)$. For $p \in (0,1)$, a $p$-quantile of $X$ is a number $q_p \in \mathbb R$ such that
\begin{align*}
\mathbb P(X \le q_p) \ge p, \qquad \mathbb P(X \ge q_p) \ge 1-p.
\end{align*}
[/definition]
Medians are the case $p=1/2$. They will reappear in later chapters because concentration around a median can sometimes be established before concentration around the mean. The next example shows that the numerical centre chosen for a tail bound can change the interpretation of the same random variable.
[example: Bernoulli Median and Mean]
Let $X \sim \operatorname{Ber}(p)$ with $p<1/2$, so $X$ takes the value $0$ with probability $1-p$ and the value $1$ with probability $p$. At $q=0$,
\begin{align*}
\mathbb P(X\le 0)=1-p \ge \frac12.
\end{align*}
Also,
\begin{align*}
\mathbb P(X\ge 0)=1.
\end{align*}
Thus $0$ is a median. If $q<0$, then no value of $X$ is at most $q$, so
\begin{align*}
\mathbb P(X\le q)=0<\frac12.
\end{align*}
If $0<q\le 1$, then the event $\{X\ge q\}$ is exactly $\{X=1\}$, and hence
\begin{align*}
\mathbb P(X\ge q)=p<\frac12.
\end{align*}
If $q>1$, then no value of $X$ is at least $q$, so
\begin{align*}
\mathbb P(X\ge q)=0<\frac12.
\end{align*}
Therefore $0$ is the unique median.
The mean is computed from the two atoms:
\begin{align*}
\mathbb E[X]=0\cdot \mathbb P(X=0)+1\cdot \mathbb P(X=1).
\end{align*}
Substituting the Bernoulli probabilities gives
\begin{align*}
\mathbb E[X]=0\cdot(1-p)+1\cdot p=p.
\end{align*}
For $0<p<1/2$, the mean $p$ is not one of the possible values $0$ or $1$. On the atom $\{X=0\}$, the two centred distances are
\begin{align*}
|X-p|=p, \qquad |X-0|=0.
\end{align*}
On the atom $\{X=1\}$, they are
\begin{align*}
|X-p|=1-p, \qquad |X-0|=1.
\end{align*}
Thus $\{|X-p|\ge r\}$ measures distance from the non-attained centre $p$, while $\{|X-0|\ge r\}$ measures distance from the atom $0$, which has probability $1-p>1/2$. The centre used in a concentration statement is therefore part of the mathematical content of the statement.
[/example]
Once a centre and deviation scale are fixed, the next issue is the rate at which the bound decreases as $r$ grows. This rate separates the weak moment bounds of the present chapter from the exponential concentration inequalities that occupy the rest of the course. This motivates distinguishing polynomial tails from exponential tails at the level of definitions.
[definition: Polynomial and Exponential Tail Bounds]
Let $X: \Omega \to \mathbb R$ be a real-valued random variable and let $m \in \mathbb R$. A polynomial tail bound of order $p>0$ at scale $a>0$ is a bound of the form
\begin{align*}
\mathbb P(|X-m| \ge r) \le C\left(\frac{a}{r}\right)^p
\end{align*}
for all $r>0$, with $C>0$. An exponential tail bound at scale $a>0$ is a bound of the form
\begin{align*}
\mathbb P(|X-m| \ge r) \le C\exp\left(-c\left(\frac{r}{a}\right)^\alpha\right)
\end{align*}
for all $r>0$, with constants $C,c,\alpha>0$.
[/definition]
Polynomial bounds are useful because they require little information. Exponential bounds carry much stronger nonasymptotic information, but they require stronger hypotheses such as boundedness, independence with moment generating function control, or Lipschitz dependence on independent coordinates.
## Markov, Chebyshev, and Cantelli Inequalities
How much can be concluded from a single moment? The answer is Markov's inequality: nonnegativity and the expectation already force the upper tail to be small at high levels.
[quotetheorem:514]
[citeproof:514]
Markov's inequality is the template for many later arguments. Whenever a nonnegative transform of a random variable is available, applying Markov to that transform converts moment information into a tail probability. The nonnegativity hypothesis is necessary for this pointwise comparison: if $X=-1$ almost surely, then for $t=1$ the right-hand side $\mathbb E[X]/t$ is negative while the probability $\mathbb P(X\ge 1)$ is nonnegative, so the stated inequality cannot hold. Markov's inequality also does not identify the typical size of $X$ or provide a lower-tail estimate; it only says that a nonnegative variable with small expectation cannot often exceed a large positive level. The following estimator example records the most direct use of the theorem before we move to centred deviations.
[example: Failure Probability of a Nonnegative Estimator]
Let $Z \ge 0$ be an error estimator with $\mathbb E[Z] \le \varepsilon$, and fix a tolerance $\delta>0$. The failure event is $\{Z \ge \delta\}$. Since $Z$ is nonnegative, *Markov Inequality* applied with threshold $\delta$ gives
\begin{align*}
\mathbb P(Z \ge \delta)
&\le \frac{\mathbb E[Z]}{\delta}.
\end{align*}
Using the assumed mean bound and the positivity of $\delta$,
\begin{align*}
\frac{\mathbb E[Z]}{\delta}
&\le \frac{\varepsilon}{\delta},
\end{align*}
so
\begin{align*}
\mathbb P(Z \ge \delta)
&\le \frac{\varepsilon}{\delta}.
\end{align*}
Thus a bound on the mean error gives a bound on the probability of exceeding any fixed tolerance, with polynomial decay proportional to $1/\delta$.
[/example]
The estimator example involved a nonnegative quantity whose size itself was the error. For a general real-valued random variable, the error is usually the centred quantity $X-\mathbb E[X]$, and the relevant nonnegative transform is its square. This motivates the variance, which is the mean squared centred deviation.
[definition: Variance]
Let $X: \Omega \to \mathbb R$ be a real-valued random variable with $\mathbb E[X^2]<\infty$. The variance of $X$ is
\begin{align*}
\operatorname{Var}(X) = \mathbb E[(X-\mathbb E[X])^2].
\end{align*}
[/definition]
Variance is small when $X$ is usually close to its mean in squared distance. Since the square of a centred variable is nonnegative, Markov's inequality can be applied to it directly. This motivates the basic two-sided variance concentration estimate.
[quotetheorem:1126]
[citeproof:1126]
The centering in Chebyshev's inequality is not cosmetic. Without subtracting the mean, the second moment mixes spread with the location of the random variable: a constant random variable $X=M$ has no fluctuation at all, but $\mathbb E[X^2]=M^2$ can be arbitrarily large. The finite variance hypothesis is also a real restriction; for instance, a Cauchy random variable has no finite second moment, so Chebyshev's argument has no variance quantity to which Markov's inequality can be applied. Even when the hypotheses hold, the conclusion has only polynomial decay of order $r^{-2}$, so it cannot by itself capture the exponential tails that later independence or boundedness assumptions produce. The sample mean example shows why the centred variance form is the useful one for averages.
[example: Variance Bound for Sample Means]
Let $X_1,\dots,X_n$ be independent real-valued random variables with common mean $\mu$ and common variance $\sigma^2<\infty$, and set $\bar X_n=n^{-1}\sum_{i=1}^n X_i$. By linearity of expectation,
\begin{align*}
\mathbb E[\bar X_n]=\mathbb E\left[\frac{1}{n}\sum_{i=1}^n X_i\right].
\end{align*}
Pulling out the constant $1/n$ and distributing expectation over the finite sum gives
\begin{align*}
\mathbb E[\bar X_n]=\frac{1}{n}\sum_{i=1}^n \mathbb E[X_i].
\end{align*}
Since each $\mathbb E[X_i]=\mu$,
\begin{align*}
\mathbb E[\bar X_n]=\frac{1}{n}\sum_{i=1}^n \mu=\frac{n\mu}{n}=\mu.
\end{align*}
For the variance, first use $\operatorname{Var}(aY)=a^2\operatorname{Var}(Y)$ with $a=1/n$:
\begin{align*}
\operatorname{Var}(\bar X_n)=\operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^n X_i\right)=\frac{1}{n^2}\operatorname{Var}\left(\sum_{i=1}^n X_i\right).
\end{align*}
Because the variables are independent, variances add for the finite sum:
\begin{align*}
\operatorname{Var}\left(\sum_{i=1}^n X_i\right)=\sum_{i=1}^n \operatorname{Var}(X_i).
\end{align*}
Substituting $\operatorname{Var}(X_i)=\sigma^2$ for every $i$ gives
\begin{align*}
\operatorname{Var}(\bar X_n)=\frac{1}{n^2}\sum_{i=1}^n \sigma^2=\frac{n\sigma^2}{n^2}=\frac{\sigma^2}{n}.
\end{align*}
Applying *Chebyshev Inequality* to $\bar X_n$, whose mean is $\mu$, gives for every $r>0$,
\begin{align*}
\mathbb P(|\bar X_n-\mu|\ge r)\le \frac{\operatorname{Var}(\bar X_n)}{r^2}.
\end{align*}
Using the variance computation,
\begin{align*}
\mathbb P(|\bar X_n-\mu|\ge r)\le \frac{\sigma^2/n}{r^2}=\frac{\sigma^2}{nr^2}.
\end{align*}
Thus averaging independent variables with common variance reduces the Chebyshev deviation bound by the factor $n$, giving a nonasymptotic law-of-large-numbers estimate.
[/example]
Chebyshev controls both sides of the deviation event at once. If only upward deviations matter, the proof can exploit a shifted square rather than the symmetric square $(X-\mathbb E[X])^2$. This motivates a one-sided Chebyshev inequality with an optimized shift.
[quotetheorem:6039]
[citeproof:6039]
Cantelli's inequality improves Chebyshev only because the event is one-sided and the shifted square is optimized for that direction. The assumptions still require a finite mean and finite variance; a symmetric heavy-tailed variable with $\mathbb P(|X|>t)=t^{-1}$ for $t\ge 1$ has no finite variance, so there is no parameter $\sigma^2$ from which this bound can be formed. The theorem does not give simultaneous control of both tails at the Cantelli rate, and it does not turn variance information into exponential decay. For a random variable with mean $\mu$, apply Cantelli to $X-\mu$. The resulting upper-tail bound is
\begin{align*}
\mathbb P(X-\mu \ge r) \le \frac{\operatorname{Var}(X)}{\operatorname{Var}(X)+r^2}.
\end{align*}
A matching lower-tail version follows by applying the same result to $\mu-X$.
[example: Sharpness for Two-Point Distributions]
Fix $r>0$ and $\sigma>0$. Let $X$ take the values $r$ and $-\sigma^2/r$ with probabilities
\begin{align*}
\mathbb P(X=r)=\frac{\sigma^2}{\sigma^2+r^2}, \qquad \mathbb P\left(X=-\frac{\sigma^2}{r}\right)=\frac{r^2}{\sigma^2+r^2}.
\end{align*}
These are valid probabilities because both numerators are positive and
\begin{align*}
\frac{\sigma^2}{\sigma^2+r^2}+\frac{r^2}{\sigma^2+r^2}=\frac{\sigma^2+r^2}{\sigma^2+r^2}=1.
\end{align*}
The mean is
\begin{align*}
\mathbb E[X]=r\cdot \frac{\sigma^2}{\sigma^2+r^2}+\left(-\frac{\sigma^2}{r}\right)\cdot \frac{r^2}{\sigma^2+r^2}.
\end{align*}
Multiplying the two terms gives
\begin{align*}
\mathbb E[X]=\frac{r\sigma^2}{\sigma^2+r^2}-\frac{\sigma^2 r^2}{r(\sigma^2+r^2)}.
\end{align*}
Since $r>0$, the second fraction is $\frac{r\sigma^2}{\sigma^2+r^2}$, so
\begin{align*}
\mathbb E[X]=\frac{r\sigma^2}{\sigma^2+r^2}-\frac{r\sigma^2}{\sigma^2+r^2}=0.
\end{align*}
Since $\mathbb E[X]=0$, the variance is $\operatorname{Var}(X)=\mathbb E[X^2]$. Therefore
\begin{align*}
\operatorname{Var}(X)=r^2\cdot \frac{\sigma^2}{\sigma^2+r^2}+\left(-\frac{\sigma^2}{r}\right)^2\cdot \frac{r^2}{\sigma^2+r^2}.
\end{align*}
Squaring the second atom gives
\begin{align*}
\operatorname{Var}(X)=\frac{r^2\sigma^2}{\sigma^2+r^2}+\frac{\sigma^4}{r^2}\cdot \frac{r^2}{\sigma^2+r^2}.
\end{align*}
Cancelling $r^2$ in the second term gives
\begin{align*}
\operatorname{Var}(X)=\frac{r^2\sigma^2}{\sigma^2+r^2}+\frac{\sigma^4}{\sigma^2+r^2}.
\end{align*}
Combining the fractions,
\begin{align*}
\operatorname{Var}(X)=\frac{\sigma^2(r^2+\sigma^2)}{\sigma^2+r^2}=\sigma^2.
\end{align*}
The second atom satisfies $-\sigma^2/r<0<r$, so the event $\{X\ge r\}$ is exactly the event $\{X=r\}$. Hence
\begin{align*}
\mathbb P(X\ge r)=\mathbb P(X=r)=\frac{\sigma^2}{\sigma^2+r^2}.
\end{align*}
Thus this two-point law has mean $0$, variance $\sigma^2$, and upper-tail probability equal to the Cantelli bound at the level $r$, so the inequality is attained.
[/example]
This example also explains why variance-only bounds cannot generally have exponential tails. A distribution supported on two points can saturate the one-sided variance bound at a prescribed deviation level.
## Tail Integration and Moment Bounds
The inequalities above pass from moments to tails. The reverse direction is also fundamental: tail probabilities determine moments of nonnegative random variables through integration over level sets.
[quotetheorem:4993]
[citeproof:4993]
The identity turns the expectation of a nonnegative variable into an integral of its tails. Nonnegativity is essential in this form: if $X=-1$ almost surely, then $\mathbb P(X>t)=0$ for every $t\ge0$, while $\mathbb E[X]=-1$. For signed random variables one must instead decompose $X=X^+-X^-$ and apply the identity separately to the positive and negative parts when the expectations are well-defined. The identity also does not guarantee that the expectation is finite; it may assert that both sides are $\infty$. To estimate higher moments, we apply the same idea to the transformed variable $X^p$. This motivates the standard moment formula used throughout concentration theory.
[remark: Higher Moments from Tail Integration]
Applying the preceding [tail integral formula](/theorems/4993) to $X^p$ gives the higher-moment identity
\begin{align*}
\mathbb E[X^p]=p\int_0^\infty t^{p-1}\mathbb P(X>t)\,dt,\qquad p>0.
\end{align*}
This derived formula lets us convert an upper tail estimate into a moment estimate by integration. The nonnegativity of $X$ is used so that $X^p$ is an order-preserving transform with level sets $\{X^p>s\}=\{X>s^{1/p}\}$; for signed variables the corresponding formula is applied to $|X|^p$ instead. The condition $p>0$ is also part of the level-set change of variables, since negative powers would study behaviour near zero rather than upper tails. The formula is an identity in $[0,\infty]$, not a finiteness theorem: if $\mathbb P(X>t)=1/(1+t)$, then the integral for $p=1$ diverges and $\mathbb E[X]=\infty$. The next consequence shows how polynomial tails control lower-order moments.
[/remark]
The obstruction is the endpoint: a tail of order $t^{-q}$ can make the $q$th moment diverge, even though every lower-order tail integral is finite. A usable moment estimate must therefore stop at $p<q$ and keep the dependence on the tail scale explicit.
[quotetheorem:6040]
[citeproof:6040]
The endpoint $p=q$ is not covered because the last integral has logarithmic divergence. This endpoint failure is a recurring warning: tail exponents and moment orders are tightly linked. The uniform tail assumption for all $t>0$ also matters, since the small-$t$ bound by $1$ and the large-$t$ bound by $(A/t)^q$ are both used in the split integral. The conclusion cannot be strengthened to the $q$-th moment in general: a nonnegative random variable with tail $\mathbb P(X>t)=\min\{1,t^{-q}\}$ satisfies the hypothesis with $A=1$, but its $q$-th moment diverges by the tail formula. Applying this principle to Chebyshev's inequality recovers lower moments from variance control.
[example: Moment Bound from a Chebyshev Tail]
Let $X$ be a real-valued random variable with mean $\mu$ and variance $\sigma^2<\infty$, and set $Y=|X-\mu|$. If $\sigma=0$, then
\begin{align*}
\mathbb E[(X-\mu)^2]=0.
\end{align*}
Since $(X-\mu)^2\ge 0$, this forces $(X-\mu)^2=0$ almost surely, so $Y=0$ almost surely and $\mathbb E[Y^p]=0$ for every $p>0$.
Assume now that $\sigma>0$. For every $t>0$, the event $\{Y>t\}$ is contained in $\{Y\ge t\}$, so
\begin{align*}
\mathbb P(Y>t)\le \mathbb P(Y\ge t).
\end{align*}
By *Chebyshev Inequality*,
\begin{align*}
\mathbb P(Y\ge t)\le \frac{\sigma^2}{t^2}.
\end{align*}
Every probability is also at most $1$, hence
\begin{align*}
\mathbb P(Y>t)\le 1.
\end{align*}
Combining these two upper bounds gives
\begin{align*}
\mathbb P(Y>t)\le \min\left\{1,\frac{\sigma^2}{t^2}\right\}.
\end{align*}
Since $\sigma>0$ and $t>0$,
\begin{align*}
\frac{\sigma^2}{t^2}=\left(\frac{\sigma}{t}\right)^2.
\end{align*}
Therefore
\begin{align*}
\mathbb P(Y>t)\le \min\left\{1,\left(\frac{\sigma}{t}\right)^2\right\}.
\end{align*}
Thus $Y$ satisfies the hypothesis of *Moments from Polynomial Tails* with $A=\sigma$ and $q=2$. For every $p\in(0,2)$, that theorem gives
\begin{align*}
\mathbb E[Y^p]\le \sigma^p\left(1+\frac{p}{2-p}\right).
\end{align*}
Substituting $Y=|X-\mu|$ yields
\begin{align*}
\mathbb E[|X-\mu|^p]\le \sigma^p\left(1+\frac{p}{2-p}\right).
\end{align*}
The estimate shows that finite variance controls every centred absolute moment of order strictly below $2$.
[/example]
For integer-valued random variables, the integral identity has a discrete counterpart. This is useful for waiting times and counting variables, where summing tail probabilities is often more natural than integrating them. The identity below is the counting version of the layer-cake representation.
If $X$ is a nonnegative integer-valued random variable, then
\begin{align*}
\mathbb E[X]=\sum_{k=1}^{\infty}\mathbb P(X\ge k),
\end{align*}
with both sides interpreted in $[0,\infty]$.
The integer-valued hypothesis is what turns the area under the tail into a countable sum with unit-width layers. For a non-integer variable such as $X\sim \operatorname{Unif}(0,1)$, the sum $\sum_{k=1}^\infty \mathbb P(X\ge k)$ is $0$, while $\mathbb E[X]=1/2$, so the discrete identity cannot replace the integral identity. The identity also does not by itself prove finiteness; the expectation and the tail sum may both be infinite. The chapter's methods therefore move in both directions: moments imply tails through Markov-type arguments, and tails imply moments through integration or summation. Chapter 2 replaces powers by exponentials through the Chernoff method, and Chapter 5 abstracts the resulting quadratic exponential behaviour into the sub-Gaussian scale.
Chapter 1 shows that Markov- and Chebyshev-type arguments can turn moments into tails, but at the cost of losing the sharper decay visible in many independent-sum problems. Chapter 2 recovers that decay by switching from powers to exponentials, introducing the Chernoff method and moment generating functions as the main engine.
# 2. The Chernoff Method and Moment Generating Functions
The first chapter showed how Markov and Chebyshev convert expectations and variances into tail bounds. Those inequalities are robust, but they usually lose the exponential decay that appears in sums of independent variables. The Chernoff method is the first systematic way to recover that decay: exponentiate the random variable, apply Markov's inequality, and then optimize over the exponential parameter. The prerequisites are the expectation and variance estimates from Chapter 1, independence of random variables, the moment generating function language introduced in Chapter 0, and the basic convexity inequalities used in Hölder's inequality. This chapter develops the method through moment generating functions, convex duality, and the standard Bernoulli, binomial, Poisson, and Gaussian examples.
## Exponential Markov Inequality and Chernoff Optimization
The basic problem is to estimate a probability such as $\mathbb P(X \ge t)$ when polynomial moment information is too weak. Working directly with $\mathbb P(X\ge t)$ gives no algebraic handle on sums: the event $\{X_1+\cdots+X_n\ge t\}$ does not factor under independence. Exponentiation fixes this obstruction because $e^{\lambda(X_1+\cdots+X_n)}$ becomes a product, but it introduces a new danger: the expectation may be infinite for some values of $\lambda$, and the sign of $\lambda$ decides which tail is being controlled. If $e^{\lambda X}$ has finite expectation for some $\lambda > 0$, then the event $X \ge t$ becomes the event $e^{\lambda X} \ge e^{\lambda t}$, so Markov's inequality can be applied on the exponential scale. The remaining choice is the value of $\lambda$, and the whole method is built around optimizing that parameter over the admissible range only.
[definition: Moment Generating Function]
Let $X$ be a real-valued random variable. The moment generating function of $X$ is the function $M_X: D_X \to (0, \infty)$ defined by
\begin{align*}
M_X(\lambda) = \mathbb E[e^{\lambda X}],
\end{align*}
where
\begin{align*}
D_X = \{\lambda \in \mathbb R : \mathbb E[e^{\lambda X}] < \infty\}.
\end{align*}
[/definition]
As in Chapter 0, the moment generating function records which exponential reweightings of $X$ are integrable. The immediate difficulty is that integrability is an average statement, while concentration asks for a probability of a large-deviation event. Monotonicity of the exponential function is the mechanism that turns the event $\{X\ge a\}$ into an inequality involving $e^{\lambda X}$.
[quotetheorem:6041]
[citeproof:6041]
The sign and finiteness hypotheses are part of the content of the theorem. A positive $\lambda$ is used for the upper tail because $x\mapsto e^{\lambda x}$ is increasing; if $\lambda<0$ were inserted into the upper-tail argument, the event inclusion would reverse and the displayed estimate would not follow. Finiteness of $\mathbb E[e^{\lambda X}]$ is equally necessary: for a heavy-tailed $X$ with infinite positive exponential moments, the formal expression gives no finite numerical bound. This theorem gives a valid inequality for each admissible exponential parameter, so the parameter should be chosen rather than fixed in advance. The next theorem formalizes this optimization principle and is the form used throughout the rest of this chapter.
[quotetheorem:6042]
[citeproof:6042]
The optimization step improves a family of valid bounds, but it does not create information outside the available exponential moment domain. If the only admissible positive parameter is very small, or if no positive parameter is admissible, the upper-tail Chernoff bound may be weak or unavailable even when the probability itself is small. The theorem also gives an upper bound only; it does not assert that the exponent is sharp without additional large-deviation input. In sums of independent random variables, independence turns the moment generating function of a sum into a product, so the logarithm becomes additive and the minimization becomes a convex calculus problem.
[example: Coin Flip Deviation Bound]
Let $X_1,\dots,X_n$ be independent with $X_i\sim\operatorname{Ber}(p)$, where $p\in(0,1)$, and let $S_n=\sum_{i=1}^n X_i$. For the nontrivial upper-tail case $a\in(p,1)$, one summand has
\begin{align*}
\mathbb E[e^{\lambda X_i}]=e^{\lambda\cdot 0}(1-p)+e^{\lambda\cdot 1}p=1-p+pe^\lambda.
\end{align*}
Independence gives
\begin{align*}
\mathbb E[e^{\lambda S_n}]=\mathbb E\left[\prod_{i=1}^n e^{\lambda X_i}\right]=\prod_{i=1}^n \mathbb E[e^{\lambda X_i}]=(1-p+pe^\lambda)^n.
\end{align*}
By *[Chernoff Bound Via Moment Generating Functions](/theorems/6042)*, for every $\lambda>0$,
\begin{align*}
\mathbb P(S_n\ge na)\le \exp\{-\lambda na\}(1-p+pe^\lambda)^n.
\end{align*}
Equivalently,
\begin{align*}
\mathbb P(S_n\ge na)\le \inf_{\lambda>0}\exp\left\{n\left[-\lambda a+\log(1-p+pe^\lambda)\right]\right\}.
\end{align*}
It remains to minimize
\begin{align*}
\phi(\lambda)=-\lambda a+\log(1-p+pe^\lambda).
\end{align*}
Differentiating gives
\begin{align*}
\phi'(\lambda)=-a+\frac{pe^\lambda}{1-p+pe^\lambda}.
\end{align*}
The critical point equation $\phi'(\lambda)=0$ is
\begin{align*}
a=\frac{pe^\lambda}{1-p+pe^\lambda}.
\end{align*}
Multiplying by $1-p+pe^\lambda$ gives
\begin{align*}
a(1-p+pe^\lambda)=pe^\lambda.
\end{align*}
Expanding the left side gives
\begin{align*}
a(1-p)+ape^\lambda=pe^\lambda.
\end{align*}
Moving the $ape^\lambda$ term to the right gives
\begin{align*}
a(1-p)=p(1-a)e^\lambda.
\end{align*}
Since $p\in(0,1)$ and $a\in(p,1)$, division by $p(1-a)$ is valid, and
\begin{align*}
e^\lambda=\frac{a(1-p)}{p(1-a)}.
\end{align*}
The ratio is greater than $1$ because
\begin{align*}
a(1-p)>p(1-a)
\end{align*}
is equivalent to $a>p$. Hence
\begin{align*}
\lambda_*=\log\frac{a(1-p)}{p(1-a)}
\end{align*}
is positive. Also,
\begin{align*}
\phi''(\lambda)=\frac{p(1-p)e^\lambda}{(1-p+pe^\lambda)^2}>0,
\end{align*}
so this critical point is the minimizer.
Now substitute $\lambda_*$. First,
\begin{align*}
1-p+pe^{\lambda_*}=1-p+p\cdot\frac{a(1-p)}{p(1-a)}.
\end{align*}
Canceling $p$ in the second term gives
\begin{align*}
1-p+pe^{\lambda_*}=1-p+\frac{a(1-p)}{1-a}.
\end{align*}
Factoring $1-p$ gives
\begin{align*}
1-p+pe^{\lambda_*}=(1-p)\left(1+\frac{a}{1-a}\right).
\end{align*}
Since $1+a/(1-a)=1/(1-a)$, this becomes
\begin{align*}
1-p+pe^{\lambda_*}=\frac{1-p}{1-a}.
\end{align*}
Therefore
\begin{align*}
\phi(\lambda_*)=-a\log\frac{a(1-p)}{p(1-a)}+\log\frac{1-p}{1-a}.
\end{align*}
Using $\log(xy)=\log x+\log y$ on the first logarithm,
\begin{align*}
\phi(\lambda_*)=-a\log\frac{a}{p}-a\log\frac{1-p}{1-a}+\log\frac{1-p}{1-a}.
\end{align*}
Combining the last two terms gives
\begin{align*}
\phi(\lambda_*)=-a\log\frac{a}{p}+(1-a)\log\frac{1-p}{1-a}.
\end{align*}
Since $\log((1-p)/(1-a))=-\log((1-a)/(1-p))$,
\begin{align*}
\phi(\lambda_*)=-\left(a\log\frac{a}{p}+(1-a)\log\frac{1-a}{1-p}\right).
\end{align*}
Thus
\begin{align*}
\mathbb P(S_n\ge na)\le \exp\{-nD(a\|p)\},
\end{align*}
where
\begin{align*}
D(a\|p)=a\log\frac{a}{p}+(1-a)\log\frac{1-a}{1-p}.
\end{align*}
This shows explicitly how independence turns the exponential moment of the count into an $n$th power, and how optimizing the exponential tilt produces the relative entropy cost.
[/example]
The previous example also indicates why it is preferable to work with logarithms. Products from independent variables become sums, and the final exponent records the balance between the imposed deviation and the exponential moment cost.
## Cumulant Generating Functions and Convex Duality
The Chernoff bound is strongest when written in a form that separates probability from optimization. Raw moment generating functions multiply under independence, so minimizing expressions such as $e^{-\lambda t}M_X(\lambda)$ for sums can hide the linear-in-$n$ rate behind products. Taking logarithms exposes the additive quantity that is actually being optimized. The logarithm also prevents a common mistake: optimizing over a value of $\lambda$ where $M_X(\lambda)=\infty$ is not a limiting argument but an invalid parameter choice. Its Legendre transform records the best exponential rate obtainable from the Chernoff method.
[definition: Cumulant Generating Function]
Let $X$ be a real-valued random variable. The cumulant generating function of $X$ is the function $\Lambda_X: D_X \to \mathbb R$ defined by
\begin{align*}
\Lambda_X(\lambda)=\log \mathbb E[e^{\lambda X}],
\end{align*}
where $D_X=\{\lambda\in\mathbb R:M_X(\lambda)<\infty\}$.
[/definition]
This definition packages the moment generating function on the exponential scale. The domain $D_X$ must stay visible in the notation: for distributions with only one-sided exponential moments, the cumulant generating function may exist on a half-line rather than on all of $\mathbb R$. The next theorem is needed because optimization over $\lambda$ depends on convexity; without convexity, the Chernoff parameter search would not have the clean calculus structure used below.
[quotetheorem:6043]
[citeproof:6043]
Convexity is a structural statement, not a differentiability statement. The theorem does not say that $\Lambda_X$ has a derivative everywhere, nor that every supremum used later is attained; boundary optima and infinite rates can occur when the domain has endpoints. This is why the later bounds are stated with infima and suprema rather than always with a named optimizer. Convexity turns the Chernoff optimization into a duality calculation. The next definition names the dual function whose value is the exponential cost assigned to a proposed deviation level.
[example: Exponential Moment Domain Matters]
Let $X\sim\operatorname{Exp}(1)$, so $X$ has density $f_X(x)=e^{-x}$ for $x\ge 0$. For any real $\lambda$,
\begin{align*}
M_X(\lambda)=\mathbb E[e^{\lambda X}]=\int_0^\infty e^{\lambda x}e^{-x}\,dx=\int_0^\infty e^{-(1-\lambda)x}\,dx.
\end{align*}
If $\lambda<1$, then $1-\lambda>0$. For $R>0$,
\begin{align*}
\int_0^R e^{-(1-\lambda)x}\,dx=\left[-\frac{1}{1-\lambda}e^{-(1-\lambda)x}\right]_{x=0}^{x=R}.
\end{align*}
Evaluating the endpoints gives
\begin{align*}
\left[-\frac{1}{1-\lambda}e^{-(1-\lambda)x}\right]_{x=0}^{x=R}=\frac{1-e^{-(1-\lambda)R}}{1-\lambda}.
\end{align*}
Since $1-\lambda>0$, we have $e^{-(1-\lambda)R}\to 0$ as $R\to\infty$, and therefore
\begin{align*}
M_X(\lambda)=\lim_{R\to\infty}\frac{1-e^{-(1-\lambda)R}}{1-\lambda}=\frac{1}{1-\lambda}.
\end{align*}
At the endpoint $\lambda=1$,
\begin{align*}
M_X(1)=\int_0^\infty e^{1\cdot x}e^{-x}\,dx=\int_0^\infty 1\,dx=\infty.
\end{align*}
If $\lambda>1$, then $\lambda-1>0$, and for $R>0$,
\begin{align*}
\int_0^R e^{(\lambda-1)x}\,dx=\frac{e^{(\lambda-1)R}-1}{\lambda-1}.
\end{align*}
Because $e^{(\lambda-1)R}\to\infty$ as $R\to\infty$, this gives
\begin{align*}
M_X(\lambda)=\int_0^\infty e^{(\lambda-1)x}\,dx=\infty.
\end{align*}
Thus the true moment-generating-function domain is
\begin{align*}
D_X=(-\infty,1).
\end{align*}
On this domain,
\begin{align*}
\Lambda_X(\lambda)=\log M_X(\lambda)=\log\frac{1}{1-\lambda}=-\log(1-\lambda).
\end{align*}
Differentiating on $(-\infty,1)$ gives
\begin{align*}
\Lambda_X'(\lambda)=\frac{1}{1-\lambda}.
\end{align*}
Differentiating once more gives
\begin{align*}
\Lambda_X''(\lambda)=\frac{1}{(1-\lambda)^2}>0.
\end{align*}
Hence $\Lambda_X$ is convex on its finite domain $(-\infty,1)$. There is no finite value of $\Lambda_X(\lambda)$ at or beyond $\lambda=1$, because $M_X(\lambda)=\infty$ there; using such a parameter in Chernoff's method would insert an infinite exponential moment into Markov's inequality rather than produce a finite tail estimate.
[/example]
The exponential example is also a warning about endpoints: convexity holds on the interval where the function is finite, while the Chernoff optimization may approach the boundary without being attained there. To turn this endpoint-aware optimization into a reusable object, we need a function that assigns to each proposed value $t$ the best available exponential penalty $\lambda t-\Lambda_X(\lambda)$. This is the role of the Legendre transform. It keeps the admissible domain inside the definition and converts the Chernoff parameter search into a rate function that can be compared across sums, scalings, and different distributions.
[definition: Legendre Transform of a Cumulant Generating Function]
Let $X$ be a real-valued random variable with cumulant generating function $\Lambda_X$. The Legendre transform of $\Lambda_X$ is the function $\Lambda_X^*:\mathbb R\to[0,\infty]$ defined by
\begin{align*}
\Lambda_X^*(t)=\sup_{\lambda\in D_X}\{\lambda t-\Lambda_X(\lambda)\}.
\end{align*}
[/definition]
The supremum in the Legendre transform is useful only when it is taken over admissible parameters. If the full transform $\Lambda_X^*$ is used without checking signs, it can mix upper-tail and lower-tail information and produce a rate for the wrong event. The Chernoff bound over all real parameters can now be expressed as an exponential rate. The next theorem gives the one-sided version of this dual expression, with the sign of $\lambda$ matching the direction of the tail.
[quotetheorem:6044]
[citeproof:6044]
The theorem is a repackaging of the Chernoff infimum, so it inherits all of its restrictions. The supremum is over $\lambda>0$ for upper tails and $\lambda<0$ for lower tails; using $\lambda=0$ would give only the bound $1$, and using the wrong sign would not control the desired event. It also does not guarantee a matching lower bound. The transform $\Lambda_X^*$ is therefore the rate function produced by this upper-bound method, with positive values when the requested value is separated from the mean in a direction where exponential moments exist.
[example: Gaussian Tail from Completing the Square]
Let $Z\sim\mathcal N(0,1)$, whose density is $(2\pi)^{-1/2}e^{-z^2/2}$. For every $\lambda\in\mathbb R$, its moment generating function is
\begin{align*}
M_Z(\lambda)=\mathbb E[e^{\lambda Z}]=\int_{-\infty}^{\infty} e^{\lambda z}\frac{1}{\sqrt{2\pi}}e^{-z^2/2}\,dz.
\end{align*}
Combining the exponent terms gives
\begin{align*}
M_Z(\lambda)=\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}\exp\left(\lambda z-\frac{z^2}{2}\right)\,dz.
\end{align*}
Complete the square in the exponent:
\begin{align*}
\lambda z-\frac{z^2}{2}=-\frac{1}{2}(z^2-2\lambda z).
\end{align*}
Since
\begin{align*}
z^2-2\lambda z=(z-\lambda)^2-\lambda^2,
\end{align*}
we get
\begin{align*}
\lambda z-\frac{z^2}{2}=\frac{\lambda^2}{2}-\frac{(z-\lambda)^2}{2}.
\end{align*}
Substituting this into the integral gives
\begin{align*}
M_Z(\lambda)=\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}\exp\left(\frac{\lambda^2}{2}-\frac{(z-\lambda)^2}{2}\right)\,dz.
\end{align*}
The factor $e^{\lambda^2/2}$ does not depend on $z$, so
\begin{align*}
M_Z(\lambda)=e^{\lambda^2/2}\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-(z-\lambda)^2/2}\,dz.
\end{align*}
With the change of variables $y=z-\lambda$, we have $dy=dz$, and the limits remain $-\infty$ and $\infty$. Hence
\begin{align*}
\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-(z-\lambda)^2/2}\,dz=\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-y^2/2}\,dy.
\end{align*}
The last integral is $1$ because it is the total mass of the standard normal density. Therefore
\begin{align*}
M_Z(\lambda)=e^{\lambda^2/2}.
\end{align*}
Thus
\begin{align*}
\Lambda_Z(\lambda)=\log M_Z(\lambda)=\log(e^{\lambda^2/2})=\frac{\lambda^2}{2}.
\end{align*}
For $t>0$, the upper-tail Chernoff exponent from *Chernoff Bound in Legendre Form* is
\begin{align*}
\sup_{\lambda>0}\left\{\lambda t-\frac{\lambda^2}{2}\right\}.
\end{align*}
For $\lambda>0$,
\begin{align*}
\lambda t-\frac{\lambda^2}{2}=-\frac{1}{2}(\lambda^2-2\lambda t).
\end{align*}
Since
\begin{align*}
\lambda^2-2\lambda t=(\lambda-t)^2-t^2,
\end{align*}
we have
\begin{align*}
\lambda t-\frac{\lambda^2}{2}=\frac{t^2}{2}-\frac{(\lambda-t)^2}{2}.
\end{align*}
Because $(\lambda-t)^2\ge 0$, this expression is at most $t^2/2$. Equality occurs at $\lambda=t$, and this parameter is admissible because $t>0$. Therefore
\begin{align*}
\sup_{\lambda>0}\left\{\lambda t-\frac{\lambda^2}{2}\right\}=\frac{t^2}{2}.
\end{align*}
Applying the upper-tail Legendre-form Chernoff bound gives
\begin{align*}
\mathbb P(Z\ge t)\le \exp\left\{-\frac{t^2}{2}\right\}.
\end{align*}
The Gaussian tail bound comes from the same square-completion calculation twice: first to evaluate the tilted [Gaussian integral](/theorems/1140), and then to optimize the Chernoff exponent in $\lambda$.
[/example]
The Gaussian example is the model case of a quadratic cumulant generating function. Chapter 5 abstracts this into the definition of sub-Gaussian random variables, where $\Lambda_X(\lambda)$ is bounded above by a quadratic even when $X$ is not Gaussian.
## Chernoff Bounds for Classical Distributions
We now compute the Chernoff method in the distributions that appear most often in discrete probability. These computations are reusable templates: Bernoulli and binomial variables give relative entropy exponents, Poisson variables give the same multiplicative shape with a different mean parameter, and Gaussian variables give quadratic tails.
The first computation isolates the single-trial rate function. It is needed because sums of independent Bernoulli variables inherit this exponent additively. The hypotheses exclude the degenerate cases $p=0$ and $p=1$, where the logarithms in $D(a\|p)$ require separate endpoint conventions and the random variable has no genuine fluctuation. The restriction $a>p$ for the upper tail and $a<p$ for the lower tail is also not cosmetic: if the threshold lies on the typical side of the mean, Chernoff optimization over the corresponding sign gives no positive exponential rate.
[quotetheorem:6045]
[citeproof:6045]
For a single Bernoulli variable this bound is mainly a rate computation, not the final concentration statement: since $X$ only takes the values $0$ and $1$, a threshold $a\in(p,1)$ makes $\mathbb P(X\ge a)=p$, so the entropy bound is not meant to be the sharpest possible one-step estimate. Its role is to identify the convex dual exponent that becomes meaningful after summing many independent trials. The endpoint restrictions also avoid undefined logarithms; endpoint versions can be recovered by limits but need separate statements. The next theorem applies the same exponent to the binomial sum, where independence makes the rate scale linearly in $n$.
[example: Bernoulli Tail Direction Cannot Be Ignored]
Let $X\sim\operatorname{Ber}(p)$ with $p\in(0,1)$. For an upper deviation take $a\in(p,1)$. The Bernoulli entropy optimizer is determined by
\begin{align*}
e^\lambda=\frac{a(1-p)}{p(1-a)}.
\end{align*}
All four factors $a$, $1-p$, $p$, and $1-a$ are positive, so the ratio is positive. To decide the sign of $\lambda$, compare the ratio with $1$. Since the denominator $p(1-a)$ is positive,
\begin{align*}
\frac{a(1-p)}{p(1-a)}>1
\end{align*}
is equivalent to
\begin{align*}
a(1-p)>p(1-a).
\end{align*}
Expanding both sides gives
\begin{align*}
a-ap>p-pa.
\end{align*}
Because $ap=pa$, this is equivalent to
\begin{align*}
a>p.
\end{align*}
Thus $a>p$ implies $e^\lambda>1$, and therefore
\begin{align*}
\lambda=\log\frac{a(1-p)}{p(1-a)}>0.
\end{align*}
So the optimizer belongs to the positive-parameter side used for the upper-tail Chernoff bound.
If one instead used the lower-tail Chernoff optimization, the parameter would have to satisfy $\lambda<0$, so it could not include the optimizer above. The event would also be the wrong event: because $0<a<1$ and $X$ only takes the values $0$ and $1$,
\begin{align*}
\{X\le a\}=\{X=0\},
\end{align*}
and hence
\begin{align*}
\mathbb P(X\le a)=\mathbb P(X=0)=1-p.
\end{align*}
This is not the event $\{X\ge a\}$, which is
\begin{align*}
\{X\ge a\}=\{X=1\},
\end{align*}
with probability
\begin{align*}
\mathbb P(X\ge a)=\mathbb P(X=1)=p.
\end{align*}
Conversely, for a lower deviation take $a\in(0,p)$. The same optimizer formula gives a ratio below $1$. Indeed, since $p(1-a)>0$,
\begin{align*}
\frac{a(1-p)}{p(1-a)}<1
\end{align*}
is equivalent to
\begin{align*}
a(1-p)<p(1-a).
\end{align*}
Expanding both sides gives
\begin{align*}
a-ap<p-pa.
\end{align*}
Again using $ap=pa$, this is equivalent to
\begin{align*}
a<p.
\end{align*}
Thus $a<p$ implies $e^\lambda<1$, so
\begin{align*}
\lambda=\log\frac{a(1-p)}{p(1-a)}<0.
\end{align*}
The entropy optimizer now lies on the negative-parameter side used for the lower-tail Chernoff bound. The sign condition is therefore forced by the deviation direction: positive $\lambda$ controls upper tails, negative $\lambda$ controls lower tails, and using the opposite sign changes both the admissible optimizer and the event being bounded.
[/example]
This sign check becomes more important for sums, because an incorrect choice of tail direction can turn a rare event into a typical one and remove the positive exponential rate. For $n$ independent Bernoulli trials, the event $S_n\ge na$ with $a>p$ asks for an empirical success frequency above its mean, while $S_n\le na$ with $a<p$ asks for a frequency below its mean. The single-trial computation identifies the entropy cost of forcing one tilted Bernoulli variable to have mean $a$; independence is what makes the same cost add over $n$ trials. The next theorem records this passage from the one-step rate to the binomial concentration estimate.
[quotetheorem:6046]
[citeproof:6046]
The [binomial theorem](/theorems/750) depends essentially on independence. Having Bernoulli marginals with the same $p$ is not enough: if all trials are perfectly correlated, then $S_n$ is either $0$ or $n$, and the exponential-in-$n$ decay can fail. The threshold condition $a>p$ or $a<p$ again marks the side of the mean being tested; at $a=p$ the entropy rate is zero and no decay statement is obtained from this bound. The entropy form is accurate and symmetric between the two tails, but it is sometimes more convenient to express deviations multiplicatively around the mean $\mu=np$. The next theorem gives that practical form and the common quadratic simplifications used in applications.
[quotetheorem:6047]
[citeproof:6047]
The range restrictions on $\delta$ encode both the tail direction and the support of the binomial variable. For the lower tail, the proof uses the negative parameter $\lambda=\log(1-\delta)$, so the displayed formula is naturally stated for $\delta\in(0,1)$. At the endpoint $\delta=1$, the event is $S_n\le0$ and the limiting right-hand side is $e^{-\mu}$, which is still a valid bound since $\mathbb P(S_n=0)=(1-p)^n\le e^{-\mu}$. For $\delta>1$, the threshold $(1-\delta)\mu$ is negative, so the lower-tail event is empty. The upper-tail multiplicative expression holds for all $\delta>0$, while the quadratic simplification with constant $1/3$ is stated only for $\delta\in(0,1)$; larger deviations require a different elementary comparison if one wants a quadratic-looking bound. These estimates explain why a binomial random variable is sharply concentrated when its mean is large. The exponent is proportional to $\mu$, so relative deviations of fixed size have exponentially small probability in the expected count.
[example: Occupancy Counts]
Throw $m$ balls independently into $N$ boxes, each ball choosing a box uniformly. Fix a box $j$, and define the indicator
\begin{align*}
I_{r,j}=\mathbf 1\{\text{ball }r\text{ lands in box }j\}
\end{align*}
for $1\le r\le m$. Then the occupancy of box $j$ is
\begin{align*}
Y_j=\sum_{r=1}^m I_{r,j}.
\end{align*}
For each $r$, uniform choice among the $N$ boxes gives
\begin{align*}
\mathbb P(I_{r,j}=1)=\frac{1}{N}.
\end{align*}
Also,
\begin{align*}
\mathbb P(I_{r,j}=0)=1-\frac{1}{N}.
\end{align*}
Since the balls are thrown independently, the indicators $I_{1,j},\dots,I_{m,j}$ are independent Bernoulli random variables with parameter $1/N$. Hence
\begin{align*}
Y_j\sim\operatorname{Bin}\left(m,\frac{1}{N}\right).
\end{align*}
Its mean is
\begin{align*}
\mu=m\cdot\frac{1}{N}=\frac{m}{N}.
\end{align*}
For any $\delta>0$, the threshold $(1+\delta)m/N$ is the same as $(1+\delta)\mu$. Applying *Multiplicative Chernoff Bounds for Binomial Variables* to $Y_j$ gives
\begin{align*}
\mathbb P\left(Y_j\ge (1+\delta)\frac{m}{N}\right)=\mathbb P(Y_j\ge (1+\delta)\mu).
\end{align*}
Therefore
\begin{align*}
\mathbb P(Y_j\ge (1+\delta)\mu)\le \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu.
\end{align*}
Substituting $\mu=m/N$ gives the one-box estimate
\begin{align*}
\mathbb P\left(Y_j\ge (1+\delta)\frac{m}{N}\right)\le \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^{m/N}.
\end{align*}
Now set
\begin{align*}
A_j=\left\{Y_j\ge (1+\delta)\frac{m}{N}\right\}.
\end{align*}
The event that at least one box has occupancy at least $(1+\delta)m/N$ is
\begin{align*}
\left\{\max_{1\le j\le N}Y_j\ge (1+\delta)\frac{m}{N}\right\}=\bigcup_{j=1}^N A_j.
\end{align*}
By the *Union Bound*,
\begin{align*}
\mathbb P\left(\bigcup_{j=1}^N A_j\right)\le \sum_{j=1}^N \mathbb P(A_j).
\end{align*}
Each box has the same occupancy distribution because the ball choices are uniform, so $\mathbb P(A_j)=\mathbb P(A_1)$ for every $j$. Thus
\begin{align*}
\sum_{j=1}^N \mathbb P(A_j)=N\mathbb P(A_1).
\end{align*}
Using the one-box Chernoff estimate for $A_1$ gives
\begin{align*}
N\mathbb P(A_1)\le N\left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^{m/N}.
\end{align*}
Combining these inequalities,
\begin{align*}
\mathbb P\left(\max_{1\le j\le N}Y_j\ge (1+\delta)\frac{m}{N}\right)\le N\left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^{m/N}.
\end{align*}
The single-box Chernoff estimate gives an exponential bound in the expected occupancy $m/N$, and the union bound pays a factor of $N$ to control all boxes at once.
[/example]
Occupancy counts also point toward rare-event regimes, where binomial variables are often approximated by Poisson variables. In that regime the binomial formula is no longer the most natural object to optimize, but the same Chernoff calculus should still apply if the Poisson cumulant generating function has the right shape. The question is whether the familiar multiplicative upper and lower tail factors survive after passing to the Poisson model.
[quotetheorem:6048]
[citeproof:6048]
The condition $\mu>0$ excludes the degenerate Poisson law concentrated at $0$, for which the displayed relative deviations have to be interpreted separately. The lower-tail restriction $\delta\in(0,1)$ has the same meaning as in the binomial case: it keeps the threshold nonnegative and makes $\lambda=\log(1-\delta)$ a valid negative parameter. These are one-sided upper bounds, so they do not assert the exact asymptotic probability of the tail event, although the exponent has the correct large-deviation shape. The Poisson proof is the cleanest demonstration of the Chernoff calculus: write down $\Lambda_X$, solve a one-variable minimization, and substitute the optimizer. The next theorem completes the chapter by applying the same method to Gaussian variables, where the cumulant generating function is quadratic.
For a one-dimensional Gaussian, the same Chernoff computation gives the standard quadratic tail estimate. If $X\sim\mathcal N(\mu,\sigma^2)$ with $\sigma>0$, then for every $t>0$,
\begin{align*}
\mathbb P(X-\mu\ge t)\le \exp\left(-\frac{t^2}{2\sigma^2}\right),
\end{align*}
and applying the same bound to $\mu-X$ gives
\begin{align*}
\mathbb P(|X-\mu|\ge t)\le 2\exp\left(-\frac{t^2}{2\sigma^2}\right).
\end{align*}
The assumption $\sigma>0$ rules out the degenerate Gaussian $X=\mu$, where both tails at positive distance have probability $0$ and the formula with $1/\sigma^2$ is not meaningful. The condition $t>0$ is also the deviation regime; at $t=0$ the one-sided upper bound gives $1$, while the true one-sided Gaussian probability is $1/2$. The estimate captures the correct quadratic exponential scale but not the missing prefactor of the normal tail, so it is designed for concentration rather than precise asymptotics. The estimates in this chapter have the same architecture: an exponential moment bound, a convex optimization, and a final rate function. Chapter 3 first turns this architecture into Hoeffding's bounded-sum inequality, and Chapters 5 and 6 later abstract it into sub-Gaussian and sub-exponential classes, where exact moment generating functions are replaced by upper bounds that still drive the Chernoff method.
The Chernoff method gives a reusable template: bound an exponential moment, optimize over the parameter, and read off the tail from the resulting rate function. Chapter 3 applies this template to bounded independent sums, where Hoeffding's inequality shows how far boundedness alone can take us.
# 3. Hoeffding's Inequality for Bounded Independent Sums
This chapter turns the Chernoff method from the previous chapter into a reusable bound for bounded independent sums. The guiding question is: how much exponential concentration can be obtained from boundedness alone, when the summands may have different ranges? The answer is Hoeffding's inequality, which gives Gaussian-type tails with a variance proxy determined by the interval lengths rather than by the actual variance.
The argument has two parts. First, [Hoeffding's lemma](/theorems/1956) controls the moment generating function of a single centered bounded random variable. Second, independence multiplies these bounds and the Chernoff optimization converts the product estimate into a tail probability.
## Centered Bounded Variables and Their Moment Generating Functions
The Chernoff method requires a usable upper bound for $\mathbb E[e^{\lambda X}]$. For a centered random variable with finite variance, a second-order Taylor expansion suggests Gaussian behaviour near $\lambda=0$, but a tail bound needs control for every positive $\lambda$. Boundedness supplies that global control.
[definition: Centered Bounded Random Variable]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. A real-valued random variable $X: \Omega \to \mathbb R$ is centered and bounded in $[a,b]$ if $a \le X \le b$ a.s. and $\mathbb E[X]=0$.
[/definition]
The centering condition removes the linear term in the exponential moment. The interval $[a,b]$ records the strongest information available in this chapter: no distributional shape is assumed beyond support and mean. The next theorem is needed because it converts exactly this limited support information into the global mgf estimate required by the Chernoff method.
[quotetheorem:1956]
[citeproof:1956]
Hoeffding's lemma says that every centered variable supported in an interval of length $b-a$ is sub-Gaussian with variance proxy $(b-a)^2/4$. Centering is essential for this exact form: if $X\equiv 1$, then $\mathbb E[e^{\lambda X}]=e^\lambda$, which cannot be bounded by $e^{\lambda^2\sigma^2/2}$ for all sufficiently small positive $\lambda$ with a finite variance proxy. Boundedness is also essential; a centered random variable with heavy enough tails may fail to have a finite moment generating function for any positive $\lambda$. The lemma does not identify the best possible mgf bound for a particular distribution, since it forgets the exact law of $X$; its role is to provide the uniform single-variable input that can be multiplied across independent summands.
[example: Rademacher Variable]
Let $\varepsilon$ satisfy $\mathbb P(\varepsilon=1)=\mathbb P(\varepsilon=-1)=1/2$. Its mean is
\begin{align*}
\mathbb E[\varepsilon]=1\cdot \frac12+(-1)\cdot \frac12=0.
\end{align*}
Thus $\varepsilon$ is centered, and its only possible values are $-1$ and $1$, so $\varepsilon\in[-1,1]$.
For every $\lambda\in\mathbb R$, its moment generating function is
\begin{align*}
\mathbb E[e^{\lambda\varepsilon}]=e^{\lambda\cdot 1}\mathbb P(\varepsilon=1)+e^{\lambda\cdot(-1)}\mathbb P(\varepsilon=-1).
\end{align*}
Substituting the two probabilities gives
\begin{align*}
\mathbb E[e^{\lambda\varepsilon}]=e^\lambda\cdot\frac12+e^{-\lambda}\cdot\frac12=\frac{e^\lambda+e^{-\lambda}}{2}.
\end{align*}
By the definition of $\cosh$, this is
\begin{align*}
\mathbb E[e^{\lambda\varepsilon}]=\cosh\lambda.
\end{align*}
The interval length is $1-(-1)=2$, so *Hoeffding Lemma* gives
\begin{align*}
\mathbb E[e^{\lambda\varepsilon}]\le \exp\left(\frac{\lambda^2(2)^2}{8}\right)=\exp\left(\frac{\lambda^2}{2}\right).
\end{align*}
Combining this with the exact mgf computation yields
\begin{align*}
\cosh\lambda\le e^{\lambda^2/2}.
\end{align*}
Thus the range-based lemma recovers the standard sub-Gaussian estimate for a symmetric sign variable, with variance proxy $1$.
[/example]
The preceding example is symmetric, but the lemma does not require symmetry. The next example shows how the interval length controls the estimate even when the distribution is skewed.
[example: Centered Bernoulli Variable]
Let $Y\sim\operatorname{Ber}(p)$, where $p\in[0,1]$, and set $X=Y-p$. Since $\mathbb P(Y=1)=p$ and $\mathbb P(Y=0)=1-p$,
\begin{align*}
\mathbb E[Y]=1\cdot p+0\cdot(1-p)=p,
\end{align*}
so
\begin{align*}
\mathbb E[X]=\mathbb E[Y-p]=\mathbb E[Y]-p=p-p=0.
\end{align*}
The two possible values of $X$ are
\begin{align*}
1-p \quad \text{when } Y=1,
\end{align*}
and
\begin{align*}
-p \quad \text{when } Y=0.
\end{align*}
Thus $X\in[-p,1-p]$ a.s., and the interval length is
\begin{align*}
(1-p)-(-p)=1-p+p=1.
\end{align*}
Applying *Hoeffding Lemma* to the centered bounded variable $X=Y-p$ gives, for every $\lambda\in\mathbb R$,
\begin{align*}
\mathbb E[e^{\lambda(Y-p)}]
=
\mathbb E[e^{\lambda X}]
\le
\exp\left(\frac{\lambda^2\cdot 1^2}{8}\right)
=
e^{\lambda^2/8}.
\end{align*}
This estimate depends only on the fact that $Y-p$ lies in an interval of length $1$, so it is uniform over all $p\in[0,1]$; the sharper [Bernoulli Chernoff bound](/theorems/6045) keeps the parameter $p$ in the exponent.
[/example]
The Bernoulli example shows both the utility and the cost of replacing an exact mgf by a range-based bound. To compare this estimate with Gaussian tails and with later concentration results, we need a common vocabulary for variables whose mgfs are bounded by Gaussian mgfs.
[definition: Sub-Gaussian Variance Proxy]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. A real-valued random variable $X: \Omega \to \mathbb R$ has sub-Gaussian variance proxy $\sigma^2$ if $\mathbb E[X]=0$ and, for every $\lambda \in \mathbb R$,
\begin{align*}
\mathbb E[e^{\lambda X}] \le \exp\left(\frac{\lambda^2\sigma^2}{2}\right).
\end{align*}
[/definition]
With this language, Hoeffding's lemma says that a centered variable in $[a,b]$ has variance proxy $(b-a)^2/4$. The constant is expressed this way because a Gaussian $G \sim \mathcal N(0,\sigma^2)$ satisfies $\mathbb E[e^{\lambda G}]=e^{\lambda^2\sigma^2/2}$.
## Independent Bounded Summands with Unequal Ranges
The next problem is to pass from one bounded variable to a sum of many bounded variables. The summands need not have identical distributions and their intervals may have different lengths. This is the setting of empirical averages with non-identical observations, randomized algorithms with step-dependent errors, and many bounded martingale increments before the martingale theory appears.
[quotetheorem:1962]
[citeproof:1962]
The denominator is the sum of squared interval lengths. Thus a summand with range length $2$ contributes four times as much to the concentration scale as a summand with range length $1$, regardless of its variance. Independence is the hypothesis that lets the mgf of the sum factor into a product; without it, taking $X_1=\cdots=X_n$ would make the sum fluctuate like $nX_1$ rather than like an average of $n$ separate errors. The bounded-range assumption is also doing real work: variables with only finite variance are governed at this level by Chebyshev-type polynomial tails, not by a uniform exponential bound. The theorem does not say the range proxy is sharp for every distribution, only that it is valid uniformly; later martingale inequalities keep the same squared-range scale while weakening independence to conditional control.
[example: Empirical Average of Bounded Observations]
Let $X_1,\dots,X_n$ be independent observations with $0\le X_i\le 1$ a.s., and write $\mu_i=\mathbb E[X_i]$. If
\begin{align*}
\bar X_n=\frac{1}{n}\sum_{i=1}^n X_i,
\end{align*}
then the empirical average error satisfies
\begin{align*}
\bar X_n-\frac{1}{n}\sum_{i=1}^n \mu_i=\frac{1}{n}\sum_{i=1}^n X_i-\frac{1}{n}\sum_{i=1}^n \mu_i.
\end{align*}
Factoring out $1/n$ and combining the two finite sums gives
\begin{align*}
\frac{1}{n}\sum_{i=1}^n X_i-\frac{1}{n}\sum_{i=1}^n \mu_i=\frac{1}{n}\left(\sum_{i=1}^n X_i-\sum_{i=1}^n \mu_i\right)=\frac{1}{n}\sum_{i=1}^n (X_i-\mu_i).
\end{align*}
Therefore, for every $\varepsilon>0$,
\begin{align*}
\left\{\bar X_n-\frac{1}{n}\sum_{i=1}^n \mu_i\ge \varepsilon\right\}=\left\{\frac{1}{n}\sum_{i=1}^n (X_i-\mu_i)\ge \varepsilon\right\}.
\end{align*}
Since $n>0$, multiplying both sides of the inequality inside the event by $n$ gives
\begin{align*}
\left\{\frac{1}{n}\sum_{i=1}^n (X_i-\mu_i)\ge \varepsilon\right\}=\left\{\sum_{i=1}^n (X_i-\mu_i)\ge n\varepsilon\right\}.
\end{align*}
For each $i$, the bounds are $a_i=0$ and $b_i=1$, so the denominator in *Hoeffding Inequality* is
\begin{align*}
\sum_{i=1}^n (b_i-a_i)^2=\sum_{i=1}^n (1-0)^2=\sum_{i=1}^n 1=n.
\end{align*}
Applying *Hoeffding Inequality* to the sample average with threshold $\varepsilon$ gives
\begin{align*}
\mathbb P\left(\bar X_n-\frac{1}{n}\sum_{i=1}^n \mu_i\ge \varepsilon\right)
\le \exp\left(-\frac{2n^2\varepsilon^2}{n}\right).
\end{align*}
Expanding the exponent gives
\begin{align*}
\exp\left(-\frac{2n^2\varepsilon^2}{n}\right)=e^{-2n\varepsilon^2}.
\end{align*}
Thus
\begin{align*}
\mathbb P\left(\bar X_n-\frac{1}{n}\sum_{i=1}^n \mu_i\ge \varepsilon\right)\le e^{-2n\varepsilon^2}.
\end{align*}
The same event conversion with absolute values gives
\begin{align*}
\left\{\left|\bar X_n-\frac{1}{n}\sum_{i=1}^n \mu_i\right|\ge \varepsilon\right\}=\left\{\left|\sum_{i=1}^n (X_i-\mu_i)\right|\ge n\varepsilon\right\}.
\end{align*}
Using the two-sided form of *Hoeffding Inequality* with threshold $\varepsilon$ gives
\begin{align*}
\mathbb P\left(\left|\bar X_n-\frac{1}{n}\sum_{i=1}^n \mu_i\right|\ge \varepsilon\right)\le 2\exp\left(-\frac{2n^2\varepsilon^2}{n}\right)=2e^{-2n\varepsilon^2}.
\end{align*}
Thus the empirical average concentrates around the average of the individual means at exponential scale $e^{-2n\varepsilon^2}$, using only independence and the common range length $1$.
[/example]
This example is the basic distribution-free confidence bound for bounded data. It depends only on independence and the interval $[0,1]$, so it applies to Bernoulli observations, clipped real-valued measurements, and bounded losses in learning theory.
[example: Randomized Rounding Error]
Suppose $x_1,\dots,x_n\in[0,1]$ are rounded independently to Bernoulli variables $Y_i$ with $\mathbb P(Y_i=1)=x_i$ and $\mathbb P(Y_i=0)=1-x_i$. For each $i$,
\begin{align*}
\mathbb E[Y_i]
=
1\cdot \mathbb P(Y_i=1)+0\cdot \mathbb P(Y_i=0)
=
1\cdot x_i+0\cdot(1-x_i)
=
x_i.
\end{align*}
Therefore the centered rounding error in the $i$th coordinate is $Y_i-x_i$, and the total rounding error is
\begin{align*}
R_n=\sum_{i=1}^n (Y_i-x_i)=\sum_{i=1}^n (Y_i-\mathbb E[Y_i]).
\end{align*}
The possible values of $Y_i-x_i$ are
\begin{align*}
1-x_i \quad \text{when } Y_i=1,
\end{align*}
and
\begin{align*}
-x_i \quad \text{when } Y_i=0.
\end{align*}
Since $0\le x_i\le 1$, we have $-x_i\le 1-x_i$, so
\begin{align*}
Y_i-x_i\in[-x_i,1-x_i]\quad\text{a.s.}
\end{align*}
The length of this interval is
\begin{align*}
(1-x_i)-(-x_i)=1-x_i+x_i=1.
\end{align*}
Thus the denominator in the two-sided form of *Hoeffding Inequality* is
\begin{align*}
\sum_{i=1}^n (b_i-a_i)^2
=
\sum_{i=1}^n \bigl((1-x_i)-(-x_i)\bigr)^2
=
\sum_{i=1}^n 1^2
=
n.
\end{align*}
Applying the inequality to the independent centered summands $Y_i-\mathbb E[Y_i]$ gives, for every $t>0$,
\begin{align*}
\mathbb P(|R_n|\ge t)
=
\mathbb P\left(\left|\sum_{i=1}^n (Y_i-\mathbb E[Y_i])\right|\ge t\right)
\le
2\exp\left(-\frac{2t^2}{n}\right).
\end{align*}
Equivalently, choosing $t=C\sqrt n$ gives
\begin{align*}
\mathbb P(|R_n|\ge C\sqrt n)
\le
2\exp\left(-\frac{2C^2 n}{n}\right)
=
2e^{-2C^2}.
\end{align*}
Thus independent randomized rounding preserves the total sum up to fluctuations on the $\sqrt n$ scale, with exponentially small probability in the squared constant $C^2$ for exceeding $C\sqrt n$.
[/example]
The rounding example has equal sensitivities: each coordinate changes the rounded total by at most one. Many applications attach different weights to different coordinates, so the effective size of the $i$th fluctuation should be $|c_i|$ rather than $1$. The following weighted form records this unequal-sensitivity version without introducing a new proof idea.
Let $Y_1,\dots,Y_n$ be independent random variables with $0\le Y_i\le 1$ almost surely, and let $c_1,\dots,c_n\in\mathbb R$. Applying Hoeffding's inequality to the bounded variables $c_iY_i$, whose interval lengths are $|c_i|$, gives
\begin{align*}
\mathbb P\left(\left|\sum_{i=1}^n c_i(Y_i-\mathbb E[Y_i])\right|\ge t\right)
\le
2\exp\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right)
\end{align*}
whenever $t>0$ and $\sum_i c_i^2>0$.
Weighted Hoeffding bounds are the first glimpse of bounded differences: changing the $i$th input can alter the sum by at most $|c_i|$. The positivity of $\sum_i c_i^2$ only removes the degenerate case where the weighted sum is identically zero; in that case the displayed event has probability $0$ for every $t>0$. Independence is still essential, because perfectly correlated inputs can make many weighted errors move together and destroy the effective squared-weight scale. The bound does not use the Bernoulli law, only the interval $[0,1]$, so it applies equally to bounded losses or randomized rounding variables; later, McDiarmid's inequality will replace this linear function by a general function of independent inputs with controlled coordinate sensitivities.
## Bernoulli Sums and the Additive Chernoff-Hoeffding Bound
For Bernoulli sums, Chapter 2 gave multiplicative Chernoff bounds that adapt to the mean. Hoeffding's inequality gives a simpler additive form: it controls deviations of the sample proportion by an absolute error. This form is often the right one for confidence intervals and uniform estimation over bounded losses.
[quotetheorem:6049]
[citeproof:6049]
The word "additive" refers to the absolute deviation $t$ or $\varepsilon$, not a deviation measured relative to $\mu$. Independence is again necessary: if $Y_1=\cdots=Y_n$ with common Bernoulli parameter $p$, then $S_n/n$ is just one Bernoulli observation and does not concentrate exponentially with $n$. The Bernoulli boundedness $0\le Y_i\le 1$ fixes the range scale, but the theorem does not exploit the exact value of the mean except through centering. When $\mu$ is small, the multiplicative Chernoff bounds can be sharper; when a fixed accuracy for a proportion is desired, the additive form is usually more transparent and gives the direct confidence-interval scale used next.
[example: Confidence Interval for a Bounded Mean]
Let $X_1,\dots,X_n$ be independent random variables with $0\le X_i\le 1$ a.s. and common mean $m$, and write
\begin{align*}
\bar X_n=\frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
Since each $\mathbb E[X_i]=m$, the average of the means is
\begin{align*}
\frac{1}{n}\sum_{i=1}^n \mathbb E[X_i]
=
\frac{1}{n}\sum_{i=1}^n m
=
\frac{nm}{n}
=
m.
\end{align*}
Thus the two-sided empirical-average form of *Hoeffding Inequality* gives, for every $\varepsilon>0$,
\begin{align*}
\mathbb P(|\bar X_n-m|\ge \varepsilon)
\le
2e^{-2n\varepsilon^2}.
\end{align*}
For a prescribed $\delta\in(0,1)$, choose
\begin{align*}
\varepsilon=\sqrt{\frac{\log(2/\delta)}{2n}}.
\end{align*}
Then
\begin{align*}
2n\varepsilon^2
=
2n\left(\sqrt{\frac{\log(2/\delta)}{2n}}\right)^2
=
2n\cdot \frac{\log(2/\delta)}{2n}
=
\log(2/\delta),
\end{align*}
so the Hoeffding bound becomes
\begin{align*}
\mathbb P(|\bar X_n-m|\ge \varepsilon)
\le
2e^{-\log(2/\delta)}.
\end{align*}
Using $e^{-\log(2/\delta)}=(2/\delta)^{-1}=\delta/2$, we get
\begin{align*}
2e^{-\log(2/\delta)}
=
2\cdot \frac{\delta}{2}
=
\delta.
\end{align*}
Therefore
\begin{align*}
\mathbb P(|\bar X_n-m|\ge \varepsilon)\le \delta.
\end{align*}
The event that the interval contains $m$ is
\begin{align*}
\{m\in[\bar X_n-\varepsilon,\bar X_n+\varepsilon]\}
=
\{\bar X_n-\varepsilon\le m\le \bar X_n+\varepsilon\}
=
\{|\bar X_n-m|\le \varepsilon\}.
\end{align*}
Taking complements gives
\begin{align*}
\mathbb P(m\in[\bar X_n-\varepsilon,\bar X_n+\varepsilon])
=
1-\mathbb P(|\bar X_n-m|>\varepsilon)
\ge
1-\mathbb P(|\bar X_n-m|\ge \varepsilon)
\ge
1-\delta.
\end{align*}
Thus $[\bar X_n-\varepsilon,\bar X_n+\varepsilon]$ is a nonasymptotic confidence interval with coverage at least $1-\delta$, before any correction that clips the endpoints to the parameter range $[0,1]$.
[/example]
This confidence interval has width proportional to $n^{-1/2}$ and logarithmic dependence on $1/\delta$. Its advantage over normal approximations is that it is valid for every sample size under only boundedness and independence.
## Comparing Chebyshev, Chernoff, and Hoeffding Bounds
The final question in this chapter is not how to prove another inequality, but how to choose among the bounds already available. Chebyshev uses variance, Chernoff uses exact or bounded moment generating functions, and Hoeffding uses boundedness. These inputs lead to different strengths and losses.
[explanation: Three Sources of Tail Information]
Chebyshev's inequality turns a variance bound into a polynomial tail:
\begin{align*}
\mathbb P(|S_n-\mathbb E[S_n]|\ge t)\le \frac{\operatorname{Var}(S_n)}{t^2}.
\end{align*}
It is robust and often sharp when only second moments are known, but it cannot produce exponential decay in $t^2$.
The Chernoff method starts from moment generating functions and usually gives the sharpest exponential bound among the three methods when the mgf can be computed or tightly bounded. For Bernoulli and Poisson variables, the optimized exponent reflects the actual mean and the asymmetry between upper and lower tails.
Hoeffding's inequality sits between these approaches. It is a Chernoff argument, but Hoeffding's lemma replaces the exact mgf by a support-based upper bound. The resulting estimate is distribution-free for bounded independent summands, with a Gaussian-type exponent governed by squared ranges.
[/explanation]
The comparison is most concrete for $[0,1]$-valued averages. If the variables are independent with common mean $m$, Chebyshev gives a bound of order
\begin{align*}
\frac{1}{n\varepsilon^2},
\end{align*}
while Hoeffding gives
\begin{align*}
2e^{-2n\varepsilon^2}.
\end{align*}
[example: Chebyshev Versus Hoeffding for a Sample Mean]
Let $X_1,\dots,X_n$ be independent with $0\le X_i\le 1$ a.s. and common mean $m$, and fix $\varepsilon>0$. Since $0\le X_i\le 1$, we have $X_i^2\le X_i$ a.s., so
\begin{align*}
\operatorname{Var}(X_i)=\mathbb E[X_i^2]-(\mathbb E[X_i])^2\le \mathbb E[X_i]-m^2=m-m^2.
\end{align*}
Also,
\begin{align*}
0\le \left(m-\frac12\right)^2=m^2-m+\frac14,
\end{align*}
which is the same as
\begin{align*}
m-m^2\le \frac14.
\end{align*}
Thus $\operatorname{Var}(X_i)\le 1/4$ for every $i$.
Writing
\begin{align*}
\bar X_n=\frac1n\sum_{i=1}^n X_i,
\end{align*}
independence gives
\begin{align*}
\operatorname{Var}\left(\sum_{i=1}^n X_i\right)=\sum_{i=1}^n \operatorname{Var}(X_i).
\end{align*}
Therefore
\begin{align*}
\operatorname{Var}(\bar X_n)=\frac{1}{n^2}\operatorname{Var}\left(\sum_{i=1}^n X_i\right)=\frac{1}{n^2}\sum_{i=1}^n \operatorname{Var}(X_i).
\end{align*}
Using $\operatorname{Var}(X_i)\le 1/4$,
\begin{align*}
\operatorname{Var}(\bar X_n)\le \frac{1}{n^2}\sum_{i=1}^n \frac14=\frac{1}{n^2}\cdot \frac n4=\frac{1}{4n}.
\end{align*}
The mean of the sample average is
\begin{align*}
\mathbb E[\bar X_n]=\mathbb E\left[\frac1n\sum_{i=1}^n X_i\right]=\frac1n\sum_{i=1}^n \mathbb E[X_i]=\frac1n\sum_{i=1}^n m=m.
\end{align*}
Applying *Chebyshev's inequality* to $\bar X_n$ gives
\begin{align*}
\mathbb P(|\bar X_n-m|\ge \varepsilon)\le \frac{\operatorname{Var}(\bar X_n)}{\varepsilon^2}\le \frac{1/(4n)}{\varepsilon^2}=\frac{1}{4n\varepsilon^2}.
\end{align*}
For the Hoeffding comparison, rewrite the same deviation as a centered sum:
\begin{align*}
\bar X_n-m=\frac1n\sum_{i=1}^n X_i-\frac1n\sum_{i=1}^n m=\frac1n\sum_{i=1}^n (X_i-m).
\end{align*}
Since $n>0$,
\begin{align*}
\{|\bar X_n-m|\ge \varepsilon\}=\left\{\left|\sum_{i=1}^n (X_i-m)\right|\ge n\varepsilon\right\}.
\end{align*}
For each $i$, the range is $[0,1]$, so
\begin{align*}
\sum_{i=1}^n (b_i-a_i)^2=\sum_{i=1}^n (1-0)^2=\sum_{i=1}^n 1=n.
\end{align*}
Applying the two-sided form of *Hoeffding Inequality* to the sample average with threshold $\varepsilon$ gives
\begin{align*}
\mathbb P(|\bar X_n-m|\ge \varepsilon)\le 2\exp\left(-\frac{2n^2\varepsilon^2}{n}\right)=2e^{-2n\varepsilon^2}.
\end{align*}
Thus, for fixed $\varepsilon>0$, the Chebyshev estimate decreases like $1/n$, while the Hoeffding estimate decreases exponentially in $n$.
[/example]
Hoeffding is not always the sharpest Chernoff-type estimate. Its strength is that it applies uniformly across all bounded distributions, while exact Chernoff calculations exploit more structure.
[example: Additive and Multiplicative Bernoulli Bounds]
Let $S_n\sim\operatorname{Bin}(n,p)$ and $\mu=np$. Since $S_n$ is a sum of $n$ independent Bernoulli variables with mean $p$, the additive form of *Hoeffding's inequality* gives, for every $t>0$,
\begin{align*}
\mathbb P(S_n-\mu\ge t)\le e^{-2t^2/n}.
\end{align*}
If the deviation is written as $t=c\mu$ with $c>0$, then
\begin{align*}
e^{-2t^2/n}
=
e^{-2(c\mu)^2/n}
=
e^{-2c^2\mu^2/n}.
\end{align*}
Using $\mu=np$, this exponent becomes
\begin{align*}
-\frac{2c^2\mu^2}{n}
=
-\frac{2c^2(np)^2}{n}
=
-2c^2np^2.
\end{align*}
Thus the additive bound gives
\begin{align*}
\mathbb P(S_n-\mu\ge c\mu)\le e^{-2c^2np^2}.
\end{align*}
The multiplicative Chernoff bound from Chapter 2 gives, for the same event,
\begin{align*}
\mathbb P(S_n-\mu\ge c\mu)
=
\mathbb P(S_n\ge (1+c)\mu)
\le
\exp\left(-\mu\bigl((1+c)\log(1+c)-c\bigr)\right).
\end{align*}
Substituting $\mu=np$ gives
\begin{align*}
\exp\left(-\mu\bigl((1+c)\log(1+c)-c\bigr)\right)
=
\exp\left(-np\bigl((1+c)\log(1+c)-c\bigr)\right).
\end{align*}
For fixed $c>0$, the additive exponent is $-2c^2np^2$, while the multiplicative exponent is
\begin{align*}
-np\bigl((1+c)\log(1+c)-c\bigr).
\end{align*}
Hence, when $p$ is small, the multiplicative exponent is linear in $p$ and the additive exponent is quadratic in $p$, so the multiplicative bound can be substantially smaller for deviations comparable to the mean.
For fixed additive accuracy in estimating $p$, write
\begin{align*}
\left|\frac{S_n}{n}-p\right|
=
\left|\frac{S_n}{n}-\frac{\mu}{n}\right|
=
\frac{|S_n-\mu|}{n}.
\end{align*}
Since $n>0$,
\begin{align*}
\left\{\left|\frac{S_n}{n}-p\right|\ge \varepsilon\right\}
=
\{|S_n-\mu|\ge n\varepsilon\}.
\end{align*}
The two-sided additive bound gives
\begin{align*}
\mathbb P\left(\left|\frac{S_n}{n}-p\right|\ge \varepsilon\right)
\le
2e^{-2(n\varepsilon)^2/n}
=
2e^{-2n\varepsilon^2}.
\end{align*}
To make this at most $\delta$, it is enough that
\begin{align*}
2e^{-2n\varepsilon^2}\le \delta.
\end{align*}
Equivalently,
\begin{align*}
e^{-2n\varepsilon^2}\le \frac{\delta}{2},
\end{align*}
so taking logarithms gives
\begin{align*}
-2n\varepsilon^2\le \log(\delta/2).
\end{align*}
Multiplying by $-1$ reverses the inequality:
\begin{align*}
2n\varepsilon^2\ge \log(2/\delta),
\end{align*}
and therefore
\begin{align*}
n\ge \frac{\log(2/\delta)}{2\varepsilon^2}.
\end{align*}
Thus Hoeffding gives a direct distribution-free sample-size rule for additive estimation of $p$, while the multiplicative Chernoff form is sharper when the relevant deviation is measured relative to a small mean.
[/example]
The chapter's main lesson is that boundedness plus independence is enough for exponential concentration of sums. Chapter 4 keeps the same Chernoff architecture but adds variance information to improve the range-only scale. The proof is short because the work is concentrated in Hoeffding's lemma: convexity reduces the single-variable mgf to an extremal two-point comparison, and the Chernoff method then handles sums. This pattern will reappear in later chapters whenever a local mgf estimate is tensorized by independence and optimized into a tail bound.
Hoeffding's inequality captures the effect of bounded summands, but it does not use variance information when that information is available. Chapter 4 sharpens the picture by developing Bernstein and Bennett bounds, which trade the crude range control for variance-sensitive estimates with better tail behavior.
# 4. Variance-Sensitive Bounds: Bernstein and Bennett
Chapters 2 and 3 developed the Chernoff method and Hoeffding-type estimates, where boundedness alone controlled the tail of a sum. This chapter refines that picture by replacing the range of a random variable by its variance wherever the exponential moment calculation permits it. Bennett and [Bernstein inequalities](/theorems/3188) keep the boundedness assumption but insert the total variance as the scale of the moderate-deviation regime. The goal is to understand why many statistical averages behave Gaussian near the mean even when the only global assumption is a bounded jump size, and why the tail becomes merely exponential farther out.
## Variance as the Effective Scale
The problem with a range-only estimate is not that it is false, but that it forgets how often a summand actually moves. If $X_i$ is bounded by $1$ but usually equals $0$, then the interval length is still $1$ while the variance may be much smaller. The Chernoff method can exploit this difference because the logarithm of the moment generating function begins with a quadratic term governed by variance.
[definition: Total Variance Proxy]
Let $X_1,\dots,X_n$ be independent real-valued random variables with $\mathbb E[X_i]=0$ and $\operatorname{Var}(X_i)<\infty$. The total variance proxy is
\begin{align*}
V := \sum_{i=1}^n \operatorname{Var}(X_i).
\end{align*}
[/definition]
The number $V$ is the natural scale for central fluctuations of the sum $S_n=\sum_i X_i$. A range bound, by contrast, sees only the maximum possible jump size and therefore cannot distinguish a dense Bernoulli sum from a sparse one.
[example: Sparse Bernoulli Variance Scale]
Let $Y_1,\dots,Y_n \overset{\text{i.i.d.}}{\sim} \operatorname{Ber}(p)$ and set $X_i=Y_i-p$. Since $Y_i=1$ with probability $p$ and $Y_i=0$ with probability $1-p$, the variable $X_i$ equals $1-p$ with probability $p$ and equals $-p$ with probability $1-p$. Hence $|X_i|$ is either $1-p$ or $p$, so $|X_i|\le 1$.
The centering and variance are computed from these two values:
\begin{align*}
\mathbb E[X_i]=p(1-p)+(1-p)(-p)=p-p^2-p+p^2=0.
\end{align*}
Therefore
\begin{align*}
\operatorname{Var}(X_i)=\mathbb E[X_i^2]-(\mathbb E[X_i])^2=p(1-p)^2+(1-p)p^2.
\end{align*}
Expanding the right-hand side gives
\begin{align*}
p(1-p)^2+(1-p)p^2=p(1-2p+p^2)+p^2-p^3=p-2p^2+p^3+p^2-p^3=p-p^2=p(1-p).
\end{align*}
Thus the total variance proxy is
\begin{align*}
V=\sum_{i=1}^n \operatorname{Var}(X_i)=\sum_{i=1}^n p(1-p)=np(1-p).
\end{align*}
The centred sum is
\begin{align*}
\sum_{i=1}^n X_i=\sum_{i=1}^n (Y_i-p)=\sum_{i=1}^n Y_i-np.
\end{align*}
So deviations of $\sum_i Y_i$ around its mean $np$ are exactly deviations of $\sum_i X_i$ around $0$. A range-only estimate sees only $|X_i|\le 1$ and therefore uses scale $n$, giving a quadratic exponent of order $t^2/n$. A variance-sensitive estimate uses $V=np(1-p)$, and in the moderate-deviation regime its quadratic exponent has scale
\begin{align*}
\frac{t^2}{V}=\frac{t^2}{np(1-p)}.
\end{align*}
When $p\ll 1$, we have $1-p$ close to $1$, so this is of order $t^2/(np)$. Thus the sparse Bernoulli sum initially fluctuates on the variance scale $np$, while a range-only bound misses the fact that most summands are usually zero.
[/example]
This example points to the target form: keep a quadratic exponent when $t$ is small compared with $V$, but prevent the bound from becoming too optimistic when a single bounded summand can cause a large jump. To make that tradeoff precise, we need a Chernoff bound whose exponential-moment estimate contains both the variance and the maximum allowed jump.
## Bennett's Inequality for Bounded Summands
We now ask how much of the exact Bernoulli Chernoff calculation survives for arbitrary independent summands that are bounded above. The answer is Bennett's inequality: among centred variables with a fixed upper bound and fixed variance, the exponential moment is controlled by a two-point extremal calculation.
[definition: Bennett Function]
The Bennett function is the function $h:[0,\infty)\to[0,\infty)$ defined by
\begin{align*}
h(u) := (1+u)\log(1+u)-u.
\end{align*}
[/definition]
The function $h$ is the rate function that appears after optimizing the Chernoff parameter. It behaves quadratically near $0$ and grows like $u\log u$ for large $u$, so it is designed to express the exact compromise between variance scale and bounded-jump scale. The next theorem is the variance-sensitive Chernoff bound that realizes this compromise for independent bounded summands.
[quotetheorem:6050]
[citeproof:6050]
Bennett's theorem is one-sided because the hypothesis is one-sided: $X_i\le b$ controls the upper tail. Applying it to $-X_i$ gives the lower tail when $X_i\ge -b$ almost surely, and a two-sided bound follows when $|X_i|\le b$.
The hypotheses are not cosmetic. Independence is what allows the moment generating functions to multiply; if $X_1=\cdots=X_n=Y$ with $\mathbb E[Y]=0$ and $Y\le b$, then the sum is $nY$, not a sum with variance scale $n\operatorname{Var}(Y)$ in the same Chernoff factorisation sense. Centering is also part of the conclusion: if $X_i\equiv a>0$, then $\sum_i X_i$ has a deterministic positive drift, so an upper-tail estimate around $0$ cannot have the displayed form. The upper bound cannot be replaced by variance alone, since a rare but very large positive value can keep the variance finite while making the upper tail much heavier at that jump scale. Finite variance is needed because $V$ is the quantity appearing in the exponent; Bennett does not claim a purely variance-free bound, nor does it identify the exact tail for every distribution.
[example: Bounded Measurements with Small Variance]
Suppose $X_1,\dots,X_n$ are independent centred measurements with $|X_i|\le 1$ and $\operatorname{Var}(X_i)\le \sigma^2$, where $0<\sigma^2\ll 1$. The total variance proxy satisfies
\begin{align*}
V=\sum_{i=1}^n \operatorname{Var}(X_i)\le \sum_{i=1}^n \sigma^2=n\sigma^2.
\end{align*}
Since $X_i\le 1$ almost surely, *[Bennett Inequality](/theorems/6050)* applies with $b=1$ and gives, for $V>0$,
\begin{align*}
\mathbb P\left(\sum_{i=1}^n X_i\ge t\right)\le \exp\left(-Vh\left(\frac{t}{V}\right)\right).
\end{align*}
To replace $V$ by the upper bound $n\sigma^2$, define $g(v):=v h(t/v)$ for $v>0$. If $u=t/v$, then $h'(u)=\log(1+u)$, so
\begin{align*}
g'(v)=h\left(\frac{t}{v}\right)-\frac{t}{v}h'\left(\frac{t}{v}\right).
\end{align*}
Substituting $u=t/v$ into the right-hand side gives
\begin{align*}
g'(v)=(1+u)\log(1+u)-u-u\log(1+u)=\log(1+u)-u.
\end{align*}
Since $\log(1+u)\le u$ for $u\ge 0$, we have $g'(v)\le 0$, so $g$ is decreasing. Therefore $V\le n\sigma^2$ implies
\begin{align*}
Vh\left(\frac{t}{V}\right)=g(V)\ge g(n\sigma^2)=n\sigma^2 h\left(\frac{t}{n\sigma^2}\right).
\end{align*}
Hence
\begin{align*}
\mathbb P\left(\sum_{i=1}^n X_i\ge t\right)\le \exp\left(-n\sigma^2 h\left(\frac{t}{n\sigma^2}\right)\right).
\end{align*}
Now set $A:=n\sigma^2$ and take $t=c\sqrt A$ with $0<c\le \sqrt A$. Then $u:=t/A=c/\sqrt A$ lies in $[0,1]$. Since
\begin{align*}
h(0)=0,\qquad h'(0)=0,\qquad h''(u)=\frac{1}{1+u},
\end{align*}
Taylor's formula with integral remainder gives
\begin{align*}
h(u)=\int_0^u \frac{u-s}{1+s}\,ds.
\end{align*}
For $0\le s\le u\le 1$, we have $1/2\le 1/(1+s)\le 1$, and therefore
\begin{align*}
\int_0^u \frac{u-s}{2}\,ds\le h(u)\le \int_0^u (u-s)\,ds.
\end{align*}
Evaluating the two elementary integrals gives
\begin{align*}
\frac{u^2}{4}\le h(u)\le \frac{u^2}{2}.
\end{align*}
With $u=c/\sqrt A$, this becomes
\begin{align*}
\frac{c^2}{4}\le A h\left(\frac{c}{\sqrt A}\right)\le \frac{c^2}{2}.
\end{align*}
Thus deviations of size $t=c\sqrt{n\sigma^2}$ have a Bennett exponent bounded above and below by constants depending only on $c$. A range-only estimate such as Hoeffding's inequality uses the scale $n$ instead of $n\sigma^2$, so when $\sigma^2$ is small it misses the smaller variance scale governing the moderate deviations.
[/example]
Bennett is accurate but its rate function is sometimes inconvenient in applications. Bernstein's inequality replaces $h$ by a simpler lower bound that exposes the two regimes directly.
## Bernstein's Quadratic-to-Linear Tail Transition
The next question is how to turn Bennett's rate into a formula that can be read without optimizing a logarithmic expression. The price is a slightly weaker constant, but the reward is a denominator that shows exactly when the bound changes from Gaussian-looking to exponential-looking.
[explanation: Scalar Bernstein as a Bennett Consequence]
The scalar Bernstein estimate used in the examples below is the standard simplification of Bennett's inequality. If $X_1,\dots,X_n$ are independent, centred, and satisfy $|X_i|\le b$ almost surely, define
\begin{align*}
V:=\sum_{i=1}^n\operatorname{Var}(X_i).
\end{align*}
Bennett's inequality applied with the upper bound $X_i\le b$ gives a rate involving $h(bt/V)$. The elementary lower bound
\begin{align*}
h(u)\ge \frac{u^2}{2(1+u/3)},\qquad u\ge0,
\end{align*}
turns that Bennett rate into the readable Bernstein estimate
\begin{align*}
\mathbb P\left(\sum_{i=1}^n X_i\ge t\right)
\le
\exp\left(-\frac{t^2}{2(V+bt/3)}\right).
\end{align*}
[/explanation]
The denominator $V+bt/3$ is the main message. For $t\ll V/b$, the term $V$ dominates and the exponent behaves like $-t^2/(2V)$. For $t\gg V/b$, the term $bt/3$ dominates and the exponent behaves like $-3t/(2b)$.
The same counterexamples that limit Bennett also limit Bernstein. Without independence, correlations can turn many small variables into one repeated variable, so the variance sum no longer captures the exponential moment of the total. Without centering, the bound is placed around the wrong origin. Without a uniform upper bound, finite variance alone permits rare large jumps that violate a Bernstein tail at large $t$; for instance, a centred heavy-tailed variable can have finite variance but polynomial upper tail. Bernstein is therefore a bounded-increment theorem with a variance correction, not a replacement for assumptions such as sub-Gaussianity, martingale difference structure, or exact distributional information.
[remark: Bernstein Transition Scale]
The transition occurs at deviations of order $V/b$. Below this scale the sum behaves like a sub-Gaussian variable with variance proxy $V$. Above this scale the bounded jump size $b$ controls the tail, so the decay becomes exponential in $t/b$.
[/remark]
This transition explains why Bernstein is often the right replacement for Hoeffding in statistical estimates. Hoeffding has a single quadratic regime tied to the range, while Bernstein adapts the quadratic part to the actual variance and keeps a bounded-jump correction for larger deviations.
[example: Comparison with Hoeffding Regimes]
Let $X_i=Y_i-p$, where $Y_i\overset{\text{i.i.d.}}{\sim}\operatorname{Ber}(p)$. Then $X_i=1-p$ with probability $p$ and $X_i=-p$ with probability $1-p$, so
\begin{align*}
-p\le X_i\le 1-p.
\end{align*}
The interval length is
\begin{align*}
(1-p)-(-p)=1.
\end{align*}
By *Hoeffding's inequality*, the range bound therefore gives
\begin{align*}
\mathbb P\left(\sum_{i=1}^n X_i\ge t\right)\le \exp\left(-\frac{2t^2}{\sum_{i=1}^n 1^2}\right)=\exp\left(-\frac{2t^2}{n}\right).
\end{align*}
Thus Hoeffding uses the quadratic scale $n$.
For Bernstein, compute the variance scale from the two values of $X_i$. Since $\mathbb E[X_i]=0$,
\begin{align*}
\operatorname{Var}(X_i)=\mathbb E[X_i^2]=p(1-p)^2+(1-p)p^2.
\end{align*}
Expanding gives
\begin{align*}
p(1-p)^2=p(1-2p+p^2)=p-2p^2+p^3.
\end{align*}
Also
\begin{align*}
(1-p)p^2=p^2-p^3.
\end{align*}
Adding the two terms,
\begin{align*}
p(1-p)^2+(1-p)p^2=(p-2p^2+p^3)+(p^2-p^3)=p-p^2=p(1-p).
\end{align*}
Hence
\begin{align*}
V=\sum_{i=1}^n \operatorname{Var}(X_i)=np(1-p).
\end{align*}
Since $X_i\le 1-p\le 1$, *Bernstein Inequality* applies with $b=1$ and gives
\begin{align*}
\mathbb P\left(\sum_{i=1}^n X_i\ge t\right)\le \exp\left(-\frac{t^2}{2(V+t/3)}\right)=\exp\left(-\frac{t^2}{2(np(1-p)+t/3)}\right).
\end{align*}
Up to universal constants in the exponent, this is the scale
\begin{align*}
-\frac{t^2}{np(1-p)+t}.
\end{align*}
When $p$ is small and $t\ll np$, the term $t$ is negligible compared with $np(1-p)\approx np$, so Bernstein has quadratic exponent of order $-t^2/(np)$ instead of Hoeffding's $-t^2/n$. When $t\gg np$, the denominator is governed by $t$, and the exponent is of order $-t$; in that regime the bounded increment size and the discreteness of the count force a linear tail scale.
[/example]
The same bound will be expressed in Chapter 6 in the language of sub-exponential random variables. This form is less sharp in constants but is useful because it matches the algebra of sums and maxima used later in the course.
## Simplified Bernstein Forms in Applications
In applications, the exact denominator is often less important than a bound with clean parameters and a usable confidence form. We therefore record several equivalent-looking Bernstein consequences, each adapted to a common use case.
[quotetheorem:6051]
[citeproof:6051]
The minimum form is the standard sub-exponential Bernstein tail: Gaussian decay up to the variance scale, exponential decay beyond it. It also makes the dependence on $V$ and $b$ transparent when constants are not the main issue.
This simplification loses the exact Bennett rate and the visible constant $1/3$ in the transition term, so it should be read as a robust order-of-magnitude consequence rather than a sharp inequality. Its hypotheses are inherited from Bernstein, and each one has a concrete role. For independence, take $X_1=\cdots=X_n=Y$, where $Y$ is centred and bounded; then $\sum_i X_i=nY$, so the deviation scale is controlled by a single random draw rather than by a product of $n$ moment generating functions. For centering, take $X_i\equiv a>0$; the event $\{\sum_i X_i\ge na/2\}$ has probability $1$, while a centred Bernstein tail would predict decay. For bounded jumps, take a centred random variable with finite variance and polynomial upper tail; at large $t$, the probability of one large positive jump is too large to satisfy an exponential bound of Bernstein type.
These examples show why the minimum form is best treated as a reusable tail pattern rather than as a theorem about arbitrary finite-variance variables. The next definition abstracts exactly the two-regime conclusion of Bernstein, separating the variance-like parameter from the jump-size parameter so that later arguments can invoke this tail behaviour without repeating the bounded-summand proof.
[definition: Sub-Exponential Bernstein Tail]
A centred real-valued random variable $Z$ has a sub-exponential Bernstein tail with parameters $(\nu^2,\alpha)$ if, for every $t\ge 0$,
\begin{align*}
\mathbb P(Z\ge t)\le \exp\left(-c\min\left\{\frac{t^2}{\nu^2},\frac{t}{\alpha}\right\}\right)
\end{align*}
for a universal constant $c>0$.
[/definition]
For sums of bounded independent variables, Bernstein gives this definition with $Z=\sum_i X_i$, $\nu^2=V$, and $\alpha=b$. The terminology emphasizes that a sub-exponential tail is not merely linear in all ranges; near the origin it retains a variance-sensitive quadratic regime.
[example: Confidence Bound for a Bounded Mean]
Let $X_i:=Z_i-\mu$ and $S_n:=\sum_{i=1}^n X_i$. Then $\mathbb E[X_i]=0$, $|X_i|\le b$, and
\begin{align*}
\operatorname{Var}(X_i)=\operatorname{Var}(Z_i)\le \sigma^2.
\end{align*}
The sum $S_n$ is exactly the centered sample mean multiplied by $n$:
\begin{align*}
S_n=\sum_{i=1}^n (Z_i-\mu)=\sum_{i=1}^n Z_i-n\mu=n(\bar Z-\mu).
\end{align*}
Also,
\begin{align*}
V=\sum_{i=1}^n \operatorname{Var}(X_i)\le \sum_{i=1}^n \sigma^2=n\sigma^2.
\end{align*}
Since $X_i\le b$ almost surely, *Bernstein Inequality* gives, for every $u\ge 0$,
\begin{align*}
\mathbb P(S_n\ge u)\le \exp\left(-\frac{u^2}{2(V+bu/3)}\right).
\end{align*}
Taking $u=ns$ gives
\begin{align*}
\mathbb P(\bar Z-\mu\ge s)=\mathbb P(S_n\ge ns).
\end{align*}
Therefore
\begin{align*}
\mathbb P(\bar Z-\mu\ge s)\le \exp\left(-\frac{n^2s^2}{2(V+bns/3)}\right).
\end{align*}
Because $V+bns/3\le n\sigma^2+bns/3$, we have
\begin{align*}
\frac{n^2s^2}{2(V+bns/3)}\ge \frac{n^2s^2}{2(n\sigma^2+bns/3)}.
\end{align*}
Hence
\begin{align*}
\mathbb P(\bar Z-\mu\ge s)\le \exp\left(-\frac{ns^2}{2(\sigma^2+bs/3)}\right).
\end{align*}
Now fix $\delta\in(0,1)$ and write $L:=\log(1/\delta)$. Since $\delta\in(0,1)$, we have $L>0$. To make the preceding probability at most $\delta$, it is enough to choose $s$ so that
\begin{align*}
\frac{ns^2}{2(\sigma^2+bs/3)}\ge L.
\end{align*}
Multiplying by the positive denominator $2(\sigma^2+bs/3)$, this condition is
\begin{align*}
ns^2\ge 2\sigma^2L+\frac{2}{3}bsL.
\end{align*}
Set
\begin{align*}
s_0:=2\sqrt{\frac{\sigma^2L}{n}}+\frac{2bL}{n}.
\end{align*}
Expanding the square,
\begin{align*}
ns_0^2=4\sigma^2L+8bL\sqrt{\frac{\sigma^2L}{n}}+\frac{4b^2L^2}{n}.
\end{align*}
Substituting the same $s_0$ into the right-hand side of the required inequality gives
\begin{align*}
2\sigma^2L+\frac{2}{3}bs_0L=2\sigma^2L+\frac{4}{3}bL\sqrt{\frac{\sigma^2L}{n}}+\frac{4b^2L^2}{3n}.
\end{align*}
The difference between these two quantities is
\begin{align*}
ns_0^2-\left(2\sigma^2L+\frac{2}{3}bs_0L\right)=2\sigma^2L+\frac{20}{3}bL\sqrt{\frac{\sigma^2L}{n}}+\frac{8b^2L^2}{3n}.
\end{align*}
Each term on the right is nonnegative, so
\begin{align*}
ns_0^2\ge 2\sigma^2L+\frac{2}{3}bs_0L.
\end{align*}
Thus
\begin{align*}
\mathbb P(\bar Z-\mu\ge s_0)\le e^{-L}.
\end{align*}
Since $L=\log(1/\delta)$,
\begin{align*}
e^{-L}=e^{-\log(1/\delta)}=\delta.
\end{align*}
Therefore, with probability at least $1-\delta$,
\begin{align*}
\bar Z-\mu\le 2\sqrt{\frac{\sigma^2\log(1/\delta)}{n}}+\frac{2b\log(1/\delta)}{n}.
\end{align*}
The square-root term is the variance-sensitive moderate-deviation radius, while the linear term is the bounded-jump correction that becomes important at very high confidence.
[/example]
The previous display uses a known variance upper bound. In data analysis the variance is often estimated from the sample, leading to empirical Bernstein bounds. A full empirical Bernstein theorem requires an additional concentration argument for the sample variance, but the heuristic already explains the form of the result.
[remark: Empirical Bernstein Heuristic]
For bounded observations $Z_i\in[0,1]$, the fixed-variance Bernstein confidence radius has the shape
\begin{align*}
\sqrt{\frac{\sigma^2\log(1/\delta)}{n}}+\frac{\log(1/\delta)}{n}.
\end{align*}
Empirical Bernstein inequalities replace $\sigma^2$ by a sample variance, plus correction terms that account for estimating the variance. The advantage is strongest when the data are nearly deterministic, because the leading square-root term shrinks with the observed variance rather than with the worst-case range.
[/remark]
The main lesson of this chapter is that boundedness and variance play different roles. Variance determines the moderate-deviation scale through the quadratic part of the cumulant generating function, while boundedness limits the size of rare large deviations and produces the linear tail. Bennett preserves the sharper logarithmic rate; Bernstein gives the simpler form used again in Chapter 6 for sub-exponential sums and in Chapter 7 for statistical confidence radii.
Chapter 4 shows that variance can improve the exponential rate, especially when the fluctuations are moderate and the cumulant generating function is well behaved. Chapter 5 packages this phenomenon into the language of sub-Gaussian random variables, where Gaussian-like tails become the natural benchmark.
# 5. Sub-Gaussian Random Variables
Sub-Gaussian random variables are the right language for random quantities whose tails decay at least as fast as a Gaussian tail. The chapter uses the probability-space notation, expectation, moment generating functions, Markov's inequality, and elementary tail integration developed in Chapters 0 and 1; it also relies on the Chernoff method from Chapter 2, which turns moment generating function bounds into tail bounds. Here we reverse that perspective and treat the mgf bound, tail decay, moment growth, and Orlicz norm as interchangeable ways of measuring the same phenomenon. Once this equivalence is established, sums and maxima of sub-Gaussian variables become accessible through a small number of reusable norm estimates, connecting back to Hoeffding inequalities from Chapter 3 and preparing the ground for empirical-process bounds and high-dimensional probability estimates later in the course.
## Equivalent Ways to Recognise Sub-Gaussianity
The first question is how to define a random variable whose deviations behave like a Gaussian. A tail estimate is often the most visible form of concentration, but the Chernoff method naturally asks for control of exponential moments, while moment estimates are often what computations provide.
[definition: Sub-Gaussian Moment Generating Function Bound]
Let $X$ be a real-valued random variable with $\mathbb E[X]=0$. We say that $X$ is sub-Gaussian with variance proxy $\sigma^2>0$ if
\begin{align*}
\mathbb E[e^{\lambda X}] \le e^{\sigma^2\lambda^2/2}
\end{align*}
for every $\lambda \in \mathbb R$.
[/definition]
This definition is modelled on the exact moment generating function of $\mathcal N(0,\sigma^2)$. The centering condition matters because a nonzero mean creates a linear term in the log-mgf, and concentration is meant to describe fluctuations around the mean.
[example: Gaussian Moment Generating Function]
Let $X\sim \mathcal N(0,\sigma^2)$, with $\sigma>0$. For every $\lambda\in\mathbb R$, the Gaussian density gives
\begin{align*}
\mathbb E[e^{\lambda X}]=\int_{-\infty}^{\infty} e^{\lambda x}\frac{1}{\sqrt{2\pi}\sigma}e^{-x^2/(2\sigma^2)}\,dx.
\end{align*}
Combining the exponential factors,
\begin{align*}
\mathbb E[e^{\lambda X}]=\frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{\infty}\exp\left(\lambda x-\frac{x^2}{2\sigma^2}\right)\,dx.
\end{align*}
Complete the square in the exponent:
\begin{align*}
\lambda x-\frac{x^2}{2\sigma^2}=-\frac{1}{2\sigma^2}\left(x^2-2\sigma^2\lambda x\right).
\end{align*}
Since
\begin{align*}
x^2-2\sigma^2\lambda x=(x-\sigma^2\lambda)^2-\sigma^4\lambda^2,
\end{align*}
we obtain
\begin{align*}
\lambda x-\frac{x^2}{2\sigma^2}=-\frac{(x-\sigma^2\lambda)^2}{2\sigma^2}+\frac{\sigma^2\lambda^2}{2}.
\end{align*}
Substituting this identity into the integral and factoring out the term independent of $x$ gives
\begin{align*}
\mathbb E[e^{\lambda X}]=e^{\sigma^2\lambda^2/2}\frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{\infty}\exp\left(-\frac{(x-\sigma^2\lambda)^2}{2\sigma^2}\right)\,dx.
\end{align*}
With the change of variables $y=x-\sigma^2\lambda$, we have $dy=dx$ and the limits remain $-\infty$ and $\infty$, so
\begin{align*}
\frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{\infty}\exp\left(-\frac{(x-\sigma^2\lambda)^2}{2\sigma^2}\right)\,dx=\frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{\infty}e^{-y^2/(2\sigma^2)}\,dy.
\end{align*}
The last integral is the total mass of the $\mathcal N(0,\sigma^2)$ density, hence it equals $1$. Therefore
\begin{align*}
\mathbb E[e^{\lambda X}]=e^{\sigma^2\lambda^2/2}, \qquad \lambda\in\mathbb R.
\end{align*}
Since $\mathbb E[X]=0$, this is exactly the sub-Gaussian moment generating function bound with variance proxy $\sigma^2$, with equality for every $\lambda$. This example fixes the normalisation used throughout the chapter.
[/example]
The Gaussian computation supplies the template, but an inequality is useful only if it gives a probability statement at the desired deviation level. The mgf bound is still a transform-space statement, so the remaining task is to convert the quadratic exponent into a bound on $\mathbb P(|X|\ge t)$. Because both tails are involved, the argument must use the estimate for positive and negative exponential parameters.
[quotetheorem:6052]
[citeproof:6052]
The centering hypothesis in the theorem removes deterministic drift: if $X=Y+m$ with $m\neq 0$, then $\mathbb E[e^{\lambda X}]$ contains the factor $e^{\lambda m}$, so the displayed centred Gaussian proxy cannot hold uniformly with the same interpretation around $0$. The two-sided conclusion also uses the assumption for both positive and negative $\lambda$; controlling only $\lambda>0$ would give an upper-tail estimate and say nothing about large negative deviations. The theorem does not assert that the bound is sharp for every distribution satisfying the mgf inequality, only that the Gaussian proxy is sufficient for Chernoff optimisation. The tail estimate is already a concentration inequality, and it also controls all polynomial moments. This matters because later estimates often begin with an $L^p$ computation rather than an exponential moment computation, so we need a way to move from tails to moment scales.
[quotetheorem:6053]
[citeproof:6053]
The exponential tail assumption is essential in this form: a polynomial tail such as $\mathbb P(|X|\ge t)\asymp t^{-4}$ gives finite low moments but cannot satisfy $\|X\|_{L^p}\lesssim \sqrt p$ for all $p$. The constant $K$ records the price of a non-normalised tail bound, while the scale $a$ determines the size of the moments; changing either changes the resulting $L^p$ estimate. The theorem does not recover an mgf bound by itself, since moment control must hold uniformly over all $p$ with the correct growth before it can be converted back to exponential integrability. Moment growth tells us that the $p$th norm grows like the Gaussian benchmark $\sqrt p$, but carrying every $p$ separately is inconvenient. The next definition packages the whole sequence of moment bounds into one Orlicz norm, which becomes the main bookkeeping device for sub-Gaussian estimates.
[definition: Orlicz Psi Two Norm]
Let $L^0(\Omega;\mathbb R)$ denote the space of real-valued random variables modulo almost sure equality. The Orlicz $\psi_2$ functional is the map
\begin{align*}
\|\cdot\|_{\psi_2}:L^0(\Omega;\mathbb R)\longrightarrow [0,\infty]
\end{align*}
defined by
\begin{align*}
\|X\|_{\psi_2} := \inf\left\{s>0 : \mathbb E\left[e^{X^2/s^2}\right] \le 2\right\}.
\end{align*}
[/definition]
The value may be $+\infty$ on all of $L^0(\Omega;\mathbb R)$; it becomes a norm on the subspace of almost-sure equivalence classes with finite $\psi_2$ value. The constant $2$ is conventional; replacing it by any fixed number greater than $1$ changes the quantity only by a universal factor. Since this norm is not tied to centering, the next theorem states the equivalence for the centred fluctuation $X-\mathbb E[X]$ and records that all four ways of measuring sub-Gaussianity give the same scale up to constants.
[quotetheorem:6054]
[citeproof:6054]
The centering in the statement is not cosmetic: if $X=m+Z$ with $Z$ centred Gaussian and $m$ large, then the tails of $X$ around $0$ and the tails of $X-\mathbb E[X]$ live on different scales. The finite-norm condition excludes heavy-tailed variables such as Pareto variables, whose individual low moments may exist but whose exponential square moment is infinite. The theorem does not identify the exact best constant in any one convention; it says that the conventions are interchangeable when universal constants are acceptable. This is the point at which $\|X\|_{\psi_2}$ becomes a genuine concentration scale. In applications we usually estimate the most accessible quantity, then move to tails or mgfs as needed.
[example: Maximum Of Gaussian Coordinates]
Let $G=(G_1,\dots,G_n)$ have independent coordinates $G_i\sim\mathcal N(0,1)$. We first compute the one-coordinate tail estimate. For $t\ge 0$ and $\lambda>0$, Markov's inequality gives
\begin{align*}
\mathbb P(G_i\ge t)=\mathbb P(e^{\lambda G_i}\ge e^{\lambda t}).
\end{align*}
Hence
\begin{align*}
\mathbb P(G_i\ge t)\le e^{-\lambda t}\mathbb E[e^{\lambda G_i}].
\end{align*}
The standard Gaussian moment generating function is $\mathbb E[e^{\lambda G_i}]=e^{\lambda^2/2}$, so
\begin{align*}
\mathbb P(G_i\ge t)\le \exp\left(-\lambda t+\frac{\lambda^2}{2}\right).
\end{align*}
Choosing $\lambda=t$ gives
\begin{align*}
\mathbb P(G_i\ge t)\le e^{-t^2/2}.
\end{align*}
Since $-G_i$ also has the $\mathcal N(0,1)$ distribution, the same argument gives
\begin{align*}
\mathbb P(G_i\le -t)=\mathbb P(-G_i\ge t)\le e^{-t^2/2}.
\end{align*}
Therefore
\begin{align*}
\mathbb P(|G_i|\ge t)=\mathbb P(G_i\ge t)+\mathbb P(G_i\le -t)\le 2e^{-t^2/2}.
\end{align*}
Now apply the union bound to the events $\{|G_i|\ge t\}$. For every $t\ge 0$,
\begin{align*}
\mathbb P\left(\max_{1\le i\le n}|G_i|\ge t\right)=\mathbb P\left(\bigcup_{i=1}^n\{|G_i|\ge t\}\right).
\end{align*}
Thus
\begin{align*}
\mathbb P\left(\max_{1\le i\le n}|G_i|\ge t\right)\le \sum_{i=1}^n\mathbb P(|G_i|\ge t)\le 2n e^{-t^2/2}.
\end{align*}
Fix $u\ge 0$ and set $t=\sqrt{2\log(2n)}+u$. Expanding the square gives
\begin{align*}
t^2=2\log(2n)+2u\sqrt{2\log(2n)}+u^2.
\end{align*}
Substituting this value of $t$ into the union-bound estimate yields
\begin{align*}
2n e^{-t^2/2}=2n\exp\left(-\log(2n)-u\sqrt{2\log(2n)}-\frac{u^2}{2}\right).
\end{align*}
Since $2n e^{-\log(2n)}=1$, this becomes
\begin{align*}
2n e^{-t^2/2}=\exp\left(-u\sqrt{2\log(2n)}-\frac{u^2}{2}\right).
\end{align*}
Because $u\sqrt{2\log(2n)}\ge 0$, we have
\begin{align*}
\exp\left(-u\sqrt{2\log(2n)}-\frac{u^2}{2}\right)\le e^{-u^2/2}.
\end{align*}
Consequently,
\begin{align*}
\mathbb P\left(\max_{1\le i\le n}|G_i|\ge \sqrt{2\log(2n)}+u\right)\le e^{-u^2/2}.
\end{align*}
The maximum is therefore concentrated around a scale of order $\sqrt{\log n}$; the logarithmic factor is the cost of controlling all $n$ Gaussian coordinates at once.
[/example]
## Closure Properties and Standard Sources
The next question is whether the class of sub-Gaussian variables is stable under the operations used in probability: scaling, centering, adding independent variables, and forming bounded functions. Without these closure properties, the equivalence theorem would be a diagnostic tool but not a usable calculus. For example, even if $X$ has Gaussian tails, a proof usually needs the centred variable $X-\mathbb E[X]$ or a rescaled copy $aX$, so the theory must keep track of how the concentration scale changes under those operations.
[quotetheorem:6055]
[citeproof:6055]
The scaling identity depends on using a homogeneous Orlicz gauge; it would fail for a tail statement with constants hidden in an unspecified way. Centering requires integrability, which is supplied by the finite $\psi_2$ norm through the moment characterisation. The result does not say that subtracting an arbitrary random quantity preserves sub-Gaussianity: if $Y$ is heavy-tailed, then $X-Y$ can be heavy-tailed even when $X$ is bounded. Scaling and centering let us normalise individual variables, but concentration for sums requires the class to interact well with independence. The following theorem says that independent centred sub-Gaussian variables add like independent Gaussians: the variance proxies add rather than the standard deviations.
[quotetheorem:6056]
[citeproof:6056]
Independence is the structural hypothesis that makes the variance proxies add: the proof uses factorisation of the mgf, and that step is unavailable for dependent variables. A concrete failure is the perfectly correlated case $X_i=Z$ for all $i$, where $Z$ is centred sub-Gaussian; then $\sum_i a_iX_i=(\sum_i a_i)Z$, so the scale is governed by $|\sum_i a_i|$ and can be as large as the $\ell^1$ size of the coefficient vector rather than its $\ell^2$ size. Centering also matters because deterministic means add linearly and belong outside the fluctuation estimate. The theorem does not cover martingale differences or weak dependence; those require conditional mgf estimates or separate decoupling arguments. It reduces concentration for a sum to a deterministic estimate of the individual sub-Gaussian scales, as illustrated by random signs in a fixed direction.
[example: Random Signs In A Fixed Linear Form]
Let $\varepsilon_1,\dots,\varepsilon_n$ be independent Rademacher random variables, so $\mathbb P(\varepsilon_i=1)=\mathbb P(\varepsilon_i=-1)=1/2$, and let $a=(a_1,\dots,a_n)\in\mathbb R^n$. For
\begin{align*}
S=\sum_{i=1}^n a_i\varepsilon_i,
\end{align*}
we compute the moment generating function. For each $i$ and every $\lambda\in\mathbb R$,
\begin{align*}
\mathbb E[e^{\lambda a_i\varepsilon_i}]=\frac12 e^{\lambda a_i}+\frac12 e^{-\lambda a_i}.
\end{align*}
Thus
\begin{align*}
\mathbb E[e^{\lambda a_i\varepsilon_i}]=\frac{e^{\lambda a_i}+e^{-\lambda a_i}}{2}.
\end{align*}
By the definition of hyperbolic cosine,
\begin{align*}
\mathbb E[e^{\lambda a_i\varepsilon_i}]=\cosh(\lambda a_i).
\end{align*}
For every real $u$, the [power series](/page/Power%20Series) expansions give
\begin{align*}
\cosh u=\sum_{k=0}^{\infty}\frac{u^{2k}}{(2k)!}.
\end{align*}
Since $(2k)!=1\cdot 2\cdot 3\cdots (2k)$ contains the $k$ factors $2,4,\dots,2k$, we have
\begin{align*}
(2k)!\ge 2^k k!.
\end{align*}
Therefore
\begin{align*}
\frac{1}{(2k)!}\le \frac{1}{2^k k!}.
\end{align*}
Substituting this coefficient bound term by term gives
\begin{align*}
\cosh u\le \sum_{k=0}^{\infty}\frac{u^{2k}}{2^k k!}.
\end{align*}
The right-hand side is the exponential series with argument $u^2/2$:
\begin{align*}
\sum_{k=0}^{\infty}\frac{u^{2k}}{2^k k!}=\sum_{k=0}^{\infty}\frac{(u^2/2)^k}{k!}=e^{u^2/2}.
\end{align*}
Applying this with $u=\lambda a_i$ yields
\begin{align*}
\mathbb E[e^{\lambda a_i\varepsilon_i}]\le e^{\lambda^2a_i^2/2}.
\end{align*}
Independence factors the exponential moment of the sum. First,
\begin{align*}
e^{\lambda S}=e^{\lambda\sum_{i=1}^n a_i\varepsilon_i}.
\end{align*}
Using $e^{x_1+\cdots+x_n}=e^{x_1}\cdots e^{x_n}$,
\begin{align*}
e^{\lambda S}=\prod_{i=1}^n e^{\lambda a_i\varepsilon_i}.
\end{align*}
Taking expectations and using independence,
\begin{align*}
\mathbb E[e^{\lambda S}]=\prod_{i=1}^n \mathbb E[e^{\lambda a_i\varepsilon_i}].
\end{align*}
Using the one-coordinate bound,
\begin{align*}
\mathbb E[e^{\lambda S}]\le \prod_{i=1}^n e^{\lambda^2a_i^2/2}.
\end{align*}
Multiplying the exponentials,
\begin{align*}
\prod_{i=1}^n e^{\lambda^2a_i^2/2}=\exp\left(\frac{\lambda^2}{2}\sum_{i=1}^n a_i^2\right).
\end{align*}
Since $|a|^2=\sum_{i=1}^n a_i^2$,
\begin{align*}
\mathbb E[e^{\lambda S}]\le e^{\lambda^2|a|^2/2}.
\end{align*}
Also
\begin{align*}
\mathbb E[\varepsilon_i]=\frac12\cdot 1+\frac12\cdot (-1)=0,
\end{align*}
so by linearity of expectation,
\begin{align*}
\mathbb E[S]=\sum_{i=1}^n a_i\mathbb E[\varepsilon_i]=0.
\end{align*}
Thus $S$ is sub-Gaussian with variance proxy $|a|^2$. If $|a|>0$, the moment-generating-function-to-tail estimate gives, for every $t\ge 0$,
\begin{align*}
\mathbb P(|S|\ge t)\le 2e^{-t^2/(2|a|^2)}.
\end{align*}
If $|a|=0$, then $a_i=0$ for every $i$, so $S=0$ almost surely. Random signs therefore turn the deterministic Euclidean size $|a|$ of the coefficient vector into the concentration scale of the random linear form.
[/example]
Random signs show sub-Gaussianity through an exact mgf calculation. This exact calculation would not survive replacing the signs by arbitrary bounded variables, because the mgf may no longer have a closed form. Bounded random variables require a different argument that uses only the interval containing the variable; the next lemma proves that the range length alone controls the centred mgf.
[quotetheorem:1956]
[citeproof:1956]
Both hypotheses in Hoeffding's lemma are doing work. Boundedness rules out variables with rare large jumps, such as a centred variable that equals $M$ with probability $M^{-2}$ and is adjusted elsewhere to have mean zero; such variables may have finite variance while their mgf grows too quickly for a uniform Gaussian proxy. Centering removes the linear term in the log-mgf, since an uncentred bounded variable has $\mathbb E[e^{\lambda X}]=1+\lambda\mathbb E[X]+O(\lambda^2)$ near $\lambda=0$. The lemma also does not give the sharp variance-sensitive behaviour available from Bernstein-type bounds when the variance is much smaller than $(b-a)^2$. The following example records the form used when a bounded summand appears inside a larger concentration proof.
[example: Bounded Variables As Sub-Gaussian Variables]
Let $Y$ be any random variable with values in $[a,b]$, and set
\begin{align*}
X=Y-\mathbb E[Y].
\end{align*}
Then
\begin{align*}
\mathbb E[X]=\mathbb E[Y-\mathbb E[Y]]
=\mathbb E[Y]-\mathbb E[Y]
=0.
\end{align*}
Since $a\le Y\le b$ almost surely, subtracting $\mathbb E[Y]$ from each part of the inequality gives
\begin{align*}
a-\mathbb E[Y]\le X\le b-\mathbb E[Y]
\end{align*}
almost surely. The length of this interval is
\begin{align*}
(b-\mathbb E[Y])-(a-\mathbb E[Y])
=b-\mathbb E[Y]-a+\mathbb E[Y]
=b-a.
\end{align*}
Applying *Hoeffding Lemma* to the centred variable $X$ therefore gives, for every $\lambda\in\mathbb R$,
\begin{align*}
\mathbb E[e^{\lambda X}]
\le \exp\left(\frac{\lambda^2(b-a)^2}{8}\right).
\end{align*}
Substituting $X=Y-\mathbb E[Y]$ yields
\begin{align*}
\mathbb E[e^{\lambda(Y-\mathbb E[Y])}]
\le \exp\left(\frac{\lambda^2(b-a)^2}{8}\right).
\end{align*}
The sub-Gaussian moment generating function bound with variance proxy $\sigma^2$ has the form
\begin{align*}
\mathbb E[e^{\lambda(Y-\mathbb E[Y])}]
\le \exp\left(\frac{\sigma^2\lambda^2}{2}\right).
\end{align*}
Choosing
\begin{align*}
\sigma^2=\frac{(b-a)^2}{4}
\end{align*}
gives
\begin{align*}
\frac{\sigma^2\lambda^2}{2}
=\frac{1}{2}\cdot \frac{(b-a)^2}{4}\lambda^2
=\frac{\lambda^2(b-a)^2}{8}.
\end{align*}
Thus $Y-\mathbb E[Y]$ is sub-Gaussian with variance proxy $(b-a)^2/4$. In particular, if $Y$ is Bernoulli, then $Y\in[0,1]$, so $Y-\mathbb E[Y]$ is sub-Gaussian with variance proxy $1/4$.
[/example]
## Norm Estimates and Maximal Inequalities
The last question of the chapter is how sub-Gaussian norms behave in estimates where constants and finite families matter. Concentration arguments often need a bound for a maximum, a supremum over a finite class, or a linear combination, so the logarithmic dependence on the number of variables must be tracked. This is also where the chapter begins to connect with later empirical process theory and random matrix theory: finite maxima are the first model for suprema over classes of functions, nets on spheres, and high-dimensional coordinate searches.
[quotetheorem:6057]
[citeproof:6057]
The linear-combination estimate is another place where independence is essential; without mgf factorisation, the coefficient vector need not enter through its Euclidean length. The estimate is also a norm-scale statement rather than an exact variance calculation, so constants are universal rather than distribution-specific. It is the form used in high-dimensional statistics when random signed or sub-Gaussian designs are tested against a fixed vector. A maximum is different because it asks many questions at once, and the cost of answering $N$ questions appears as a logarithmic factor.
[quotetheorem:6058]
[citeproof:6058]
This inequality does not require independence among the variables; the union bound only uses the individual tails. The logarithmic factor is generally unavoidable, as independent standard Gaussian variables satisfy $\mathbb E[\max_{1\le i\le N}|G_i|]\asymp \sqrt{\log N}$. The result is finite-class only: it does not control $\sup_{t\in T}X_t$ over an infinite index set without additional metric entropy, covering, or chaining information. This limitation is the bridge to empirical process theory, concentration of measure on product spaces, and random matrix estimates where one first controls a finite net and then passes to a continuum.
[remark: Constants And Variance Proxies]
Different texts normalise sub-Gaussianity in slightly different ways: some use $\mathbb E[e^{\lambda X}]\le e^{K^2\lambda^2}$, some use tails $e^{-t^2/K^2}$, and some define the theory through $\psi_2$ norms. The equivalence theorem means that these conventions differ only by universal constants, but constants such as $1/2$ and $1/8$ matter when comparing Gaussian and bounded-variable bounds exactly.
[/remark]
The chapter's working rule is therefore simple: use exact mgf constants when they are available, as for Gaussians, Rademacher variables, and Hoeffding's lemma; use $\psi_2$ estimates when the object is assembled through several operations. This balance keeps the proofs short while preserving the correct dependence on variance proxies, coefficient vectors, and the size of finite classes.
Sub-Gaussianity gives a clean and flexible way to state Gaussian-scale concentration, but many important variables decay more slowly than a Gaussian tail. Chapter 6 moves to the sub-exponential scale, where the same Chernoff logic still works after the mgf is controlled only locally and the tails split into quadratic and linear regimes.
# 6. Sub-Exponential Random Variables
Sub-exponential random variables are the right scale for random quantities whose tails decay like $e^{-ct}$ rather than $e^{-ct^2}$. Chapters 2 and 5 used the Chernoff method and sub-Gaussian variables to control sums with Gaussian-type fluctuations. This chapter develops the next layer: variables with exponential tails, products and squares of sub-Gaussian variables, and Bernstein's inequality, which interpolates between Gaussian behaviour for moderate deviations and exponential behaviour for large deviations.
The guiding issue is that not every useful random variable has a finite moment generating function on the whole real line. A square of a Gaussian variable, for instance, has tails of order $e^{-ct}$, so its moment generating function exists only near the origin. Sub-exponential theory records exactly this local exponential integrability and turns it into stable tail bounds for sums.
## Tail, Moment, and Moment Generating Function Tests
How should we recognise exponential tails in a way that is robust under scaling, summation, and comparison? A bare estimate such as $\mathbb P(|X| \ge t) \le C e^{-ct}$ is useful, but the Chernoff method naturally asks for information about $\mathbb E[e^{\lambda X}]$ near $\lambda=0$, while moment estimates ask how fast $\mathbb E[|X|^p]$ grows with $p$. The point of this section is that these viewpoints are interchangeable up to absolute constants.
We begin with the norm that packages exponential integrability into a single number.
[definition: Psi One Norm]
On the [vector space](/page/Vector%20Space) of real-valued random variables modulo a.s. equality, define the functional
\begin{align*}
\|\cdot\|_{\psi_1}: L^0(\Omega;\mathbb R) \to [0,\infty]
\end{align*}
by
\begin{align*}
\|X\|_{\psi_1} := \inf\{K > 0 : \mathbb E[\exp(|X|/K)] \le 2\}.
\end{align*}
A real-valued random variable $X$ is sub-exponential if $\|X\|_{\psi_1} < \infty$.
[/definition]
This definition is two-sided and therefore suited to controlling $|X|$. For concentration of sums, the centered one-sided moment generating function is often more convenient, because cancellation in $\sum_i X_i$ is visible only after centering.
[definition: Centered Sub-Exponential MGF Bound]
Let $X$ be a real-valued random variable with $\mathbb E[X]=0$. We say that $X$ satisfies a centered sub-exponential MGF bound with parameters $(\nu,b)$ if $\nu>0$, $b>0$, and
\begin{align*}
\mathbb E[e^{\lambda X}] \le \exp\left(\frac{\nu^2\lambda^2}{2}\right)
\end{align*}
for all $\lambda \in \mathbb R$ with $|\lambda| < 1/b$.
[/definition]
The parameter $\nu$ governs the small-deviation, variance-like part of the estimate, while $b$ marks the scale beyond which the MGF is no longer controlled quadratically. A concrete reason to keep several tests available is that different examples reveal different information: an exponential variable gives its MGF explicitly, while a maximum or a product is often easier to control first by a tail or moment estimate. Since later arguments may start from a tail estimate, a moment estimate, or an Orlicz norm estimate, we need a theorem that lets us pass between these forms without changing the class of variables.
[quotetheorem:6059]
[citeproof:6059]
The theorem explains why the phrase sub-exponential is not tied to a single formula. The constants matter only up to universal factors, so the result is a classification of tail scale rather than an exact norm identity. The exponential decay hypothesis cannot be weakened to polynomial decay: for example, a Pareto-type variable with $\mathbb P(|X|\ge t)\asymp t^{-q}$ has only finitely many moments and cannot satisfy the moment growth condition for all $p\ge1$. Conversely, finite moments of every order alone are not enough unless their growth is at most linear in $p$; this limitation is what makes the moment condition quantitative rather than merely qualitative. It is legitimate to prove a tail estimate, use it as a moment estimate in a later argument, and then convert back to a moment generating function bound when summing independent variables.
[example: Centered Exponential Variable]
Let $Y\sim \operatorname{Exp}(1)$, so $Y$ has density $e^{-y}\mathbf 1_{\{y\ge0\}}$, and set $X=Y-1$. First,
\begin{align*}
\mathbb E[X]=\int_0^\infty y e^{-y}\,dy-1.
\end{align*}
[Integration by parts](/theorems/210) with $u=y$ and $dv=e^{-y}\,dy$ gives
\begin{align*}
\int_0^\infty y e^{-y}\,dy=\left[-ye^{-y}\right]_0^\infty+\int_0^\infty e^{-y}\,dy=0+1=1.
\end{align*}
Hence $\mathbb E[X]=1-1=0$, so $X$ is centered.
The variable is sub-exponential because $|X|=|Y-1|\le Y+1$. Taking $K=4$,
\begin{align*}
\mathbb E[e^{|X|/4}]\le e^{1/4}\mathbb E[e^{Y/4}].
\end{align*}
Using the density of $Y$,
\begin{align*}
\mathbb E[e^{Y/4}]=\int_0^\infty e^{y/4}e^{-y}\,dy=\int_0^\infty e^{-3y/4}\,dy=\frac43.
\end{align*}
Thus
\begin{align*}
\mathbb E[e^{|X|/4}]\le \frac43 e^{1/4}<2,
\end{align*}
so $\|X\|_{\psi_1}\le4$ by the definition of the $\psi_1$ norm.
For $t\ge0$, the right tail is
\begin{align*}
\mathbb P(X\ge t)=\mathbb P(Y\ge t+1)=\int_{t+1}^\infty e^{-y}\,dy=e^{-(t+1)}.
\end{align*}
Since $Y\ge0$, we have $X\ge -1$, and therefore $\mathbb P(X\le -t)=0$ for every $t>1$.
For $\lambda<1$, the moment generating function is
\begin{align*}
\mathbb E[e^{\lambda X}]=e^{-\lambda}\int_0^\infty e^{\lambda y}e^{-y}\,dy=e^{-\lambda}\int_0^\infty e^{-(1-\lambda)y}\,dy=e^{-\lambda}(1-\lambda)^{-1}.
\end{align*}
Taking logarithms gives
\begin{align*}
\log \mathbb E[e^{\lambda X}]=-\lambda-\log(1-\lambda).
\end{align*}
For $|\lambda|\le1/2$, the logarithmic series gives
\begin{align*}
-\lambda-\log(1-\lambda)=-\lambda+\sum_{k=1}^\infty \frac{\lambda^k}{k}=\sum_{k=2}^\infty \frac{\lambda^k}{k}.
\end{align*}
Therefore
\begin{align*}
-\lambda-\log(1-\lambda)\le \sum_{k=2}^\infty |\lambda|^k=\frac{\lambda^2}{1-|\lambda|}\le 2\lambda^2.
\end{align*}
Exponentiating,
\begin{align*}
\mathbb E[e^{\lambda X}]\le e^{2\lambda^2}
\end{align*}
for $|\lambda|\le1/2$. This explicitly shows the local quadratic MGF behaviour of the centered exponential variable, while the formula $e^{-\lambda}(1-\lambda)^{-1}$ also shows that the MGF is finite only for $\lambda<1$.
[/example]
This example is the model case: the MGF exists near the origin but fails at a finite positive value of $\lambda$. To use the Chernoff method systematically, we need a result saying that the abstract $\psi_1$ definition always supplies a local quadratic MGF estimate after centering.
[quotetheorem:6060]
[citeproof:6060]
This result is the bridge from the norm definition to Chernoff estimates. It says that a centered sub-exponential variable behaves like a sub-Gaussian variable only for sufficiently small values of the transform parameter. The restriction on $\lambda$ is not a technical nuisance: it is exactly what distinguishes sub-exponential variables from genuinely sub-Gaussian ones and is responsible for the linear large-deviation regime that appears later in Bernstein-type inequalities. Centering is also part of the mechanism, since an uncentered variable would carry a linear term in its logarithmic moment generating function and would not have purely quadratic local behavior around the origin. Thus the theorem supplies the local input for Chernoff optimization, while leaving the global tail behavior to be handled by separate estimates.
## Squares and Products of Sub-Gaussian Variables
Where do sub-exponential variables come from in practice? The main source is multiplication. Products have heavier tails than their factors, and the square of a sub-Gaussian variable has the exponential tail scale that appears in chi-square concentration.
We use the sub-Gaussian $\psi_2$ norm from Chapter 5.
[definition: Psi Two Norm]
On the vector space of real-valued random variables modulo a.s. equality, define the functional
\begin{align*}
\|\cdot\|_{\psi_2}: L^0(\Omega;\mathbb R) \to [0,\infty]
\end{align*}
by
\begin{align*}
\|X\|_{\psi_2}:=\inf\{K>0 : \mathbb E[\exp(X^2/K^2)]\le 2\}.
\end{align*}
A real-valued random variable $X$ is sub-Gaussian if $\|X\|_{\psi_2}<\infty$.
[/definition]
The definition says that $X^2$ has exponential integrability at the scale $\|X\|_{\psi_2}^2$. Since many statistics are sums of centered squares, the next lemma records this conversion in the exact form needed for Bernstein's inequality.
[quotetheorem:6061]
[citeproof:6061]
The square lemma handles diagonal quadratic terms such as $X_i^2$. The sub-Gaussian assumption is the natural hypothesis because it is exactly exponential integrability of $X^2$; if $X$ merely has exponential tails, then $X^2$ has tail scale $\mathbb P(X^2\ge t)=\mathbb P(|X|\ge \sqrt t)$, which is typically much heavier than $e^{-ct}$. The conclusion is also limited: $X^2$ is generally not sub-Gaussian, as the Gaussian example below shows by the finite singularity of its MGF. For covariance estimates and quadratic forms, cross terms $X_iY_i$ appear as well, so we also need a product estimate with the same sub-exponential conclusion.
[quotetheorem:6062]
[citeproof:6062]
This product theorem marks a structural difference between the sub-Gaussian and sub-exponential classes. Both factors need sub-Gaussian control in this form: if one factor has only a polynomial tail, then even multiplying by a bounded nonzero variable can leave a product with polynomial rather than exponential tails. The estimate also does not require independence; the proof uses only pointwise inequalities and Cauchy--Schwarz, so dependence between $X$ and $Y$ is not an obstruction. Its limitation is that multiplication loses a tail regime: even when both inputs are sub-Gaussian, the product need not be sub-Gaussian, and the correct general target is $\psi_1$ rather than $\psi_2$. Linear combinations preserve sub-Gaussian behaviour, while quadratic expressions usually land in the sub-exponential class.
[example: Gaussian Square]
Let $Z\sim \mathcal N(0,1)$, with density $(2\pi)^{-1/2}e^{-z^2/2}$. To verify sub-Gaussianity directly, take $K=2$. Then
\begin{align*}
\mathbb E[e^{Z^2/4}]=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{z^2/4}e^{-z^2/2}\,dz.
\end{align*}
Combining the exponents gives $z^2/4-z^2/2=-z^2/4$, so
\begin{align*}
\mathbb E[e^{Z^2/4}]=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{-z^2/4}\,dz.
\end{align*}
Using $\int_{-\infty}^{\infty}e^{-az^2}\,dz=\sqrt{\pi/a}$ for $a>0$, with $a=1/4$, gives
\begin{align*}
\mathbb E[e^{Z^2/4}]=\frac{1}{\sqrt{2\pi}}\cdot 2\sqrt{\pi}=\sqrt2<2.
\end{align*}
Thus $\|Z\|_{\psi_2}\le2$, so $Z$ is sub-Gaussian.
Also $\mathbb E[Z^2]=1$. Indeed, by symmetry,
\begin{align*}
\mathbb E[Z^2]=\frac{2}{\sqrt{2\pi}}\int_0^\infty z^2e^{-z^2/2}\,dz.
\end{align*}
[Integration by parts](/theorems/2098) with $u=z$ and $dv=ze^{-z^2/2}\,dz$ gives $du=dz$ and $v=-e^{-z^2/2}$, hence
\begin{align*}
\int_0^\infty z^2e^{-z^2/2}\,dz=\left[-ze^{-z^2/2}\right]_0^\infty+\int_0^\infty e^{-z^2/2}\,dz.
\end{align*}
The boundary term is $0$, and symmetry of the standard Gaussian integral gives
\begin{align*}
\int_0^\infty e^{-z^2/2}\,dz=\frac12\int_{-\infty}^{\infty}e^{-z^2/2}\,dz=\frac{\sqrt{2\pi}}{2}.
\end{align*}
Therefore
\begin{align*}
\mathbb E[Z^2]=\frac{2}{\sqrt{2\pi}}\cdot\frac{\sqrt{2\pi}}{2}=1.
\end{align*}
So $\mathbb E[Z^2-1]=0$, and by *Sub-Gaussian Square Is Sub-Exponential*, $Z^2-1$ is centered sub-exponential.
For $\lambda<1/2$, compute the moment generating function from the density:
\begin{align*}
\mathbb E[e^{\lambda(Z^2-1)}]=e^{-\lambda}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{\lambda z^2}e^{-z^2/2}\,dz.
\end{align*}
Since $\lambda z^2-z^2/2=-(1/2-\lambda)z^2$,
\begin{align*}
\mathbb E[e^{\lambda(Z^2-1)}]=e^{-\lambda}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{-(1/2-\lambda)z^2}\,dz.
\end{align*}
Applying the Gaussian integral formula with $a=1/2-\lambda>0$ gives
\begin{align*}
\mathbb E[e^{\lambda(Z^2-1)}]=e^{-\lambda}\frac{1}{\sqrt{2\pi}}\sqrt{\frac{\pi}{1/2-\lambda}}.
\end{align*}
Because $1/2-\lambda=(1-2\lambda)/2$, this becomes
\begin{align*}
\mathbb E[e^{\lambda(Z^2-1)}]=e^{-\lambda}(1-2\lambda)^{-1/2}.
\end{align*}
Taking logarithms,
\begin{align*}
\log \mathbb E[e^{\lambda(Z^2-1)}]=-\lambda-\frac12\log(1-2\lambda).
\end{align*}
For $|\lambda|\le1/4$, the logarithmic series $-\log(1-x)=\sum_{k=1}^{\infty}x^k/k$ applies with $x=2\lambda$, so
\begin{align*}
-\lambda-\frac12\log(1-2\lambda)=-\lambda+\frac12\sum_{k=1}^{\infty}\frac{(2\lambda)^k}{k}.
\end{align*}
The $k=1$ term is $\lambda$, so it cancels the initial $-\lambda$:
\begin{align*}
-\lambda-\frac12\log(1-2\lambda)=\sum_{k=2}^{\infty}\frac{2^{k-1}\lambda^k}{k}.
\end{align*}
Taking absolute values term by term gives
\begin{align*}
-\lambda-\frac12\log(1-2\lambda)\le \sum_{k=2}^{\infty}2^{k-1}|\lambda|^k.
\end{align*}
The right-hand side is a geometric series with first term $2|\lambda|^2$ and ratio $2|\lambda|$, hence
\begin{align*}
\sum_{k=2}^{\infty}2^{k-1}|\lambda|^k=\frac{2|\lambda|^2}{1-2|\lambda|}.
\end{align*}
Since $|\lambda|\le1/4$, we have $1-2|\lambda|\ge1/2$, and therefore
\begin{align*}
-\lambda-\frac12\log(1-2\lambda)\le4\lambda^2.
\end{align*}
Exponentiating,
\begin{align*}
\mathbb E[e^{\lambda(Z^2-1)}]\le e^{4\lambda^2}
\end{align*}
for every $|\lambda|\le1/4$. Thus the centered Gaussian square has the local quadratic MGF behaviour required of a centered sub-exponential variable, while the integral formula also shows that the MGF is finite exactly for $\lambda<1/2$.
[/example]
The finite singularity at $\lambda=1/2$ is not a defect of the method. It is the analytic signature of exponential, rather than Gaussian, tails for $Z^2$.
## Bernstein Inequality for Sub-Exponential Sums
What tail bound should replace Hoeffding's inequality when each summand has only a local MGF bound? The Chernoff argument still works, but the optimization over $\lambda$ is restricted to $|\lambda|<1/b$. This restriction produces a two-regime inequality: Gaussian decay for deviations below the natural variance scale and exponential decay beyond it.
[quotetheorem:6063]
[citeproof:6063]
The minimum in Bernstein's inequality is the essential feature. Small deviations are controlled by the aggregate variance proxy $\nu^2$, while large deviations are controlled by the largest individual exponential scale $b$. Independence is used exactly when multiplying the moment generating functions; without it, perfectly correlated summands can behave like $nX$ rather than like a variance-proxy sum. Centering is also necessary for a two-sided concentration statement around zero, since nonzero means create a deterministic drift in $\sum_i X_i$. Finally, the local restriction $|\lambda|<1/b$ is what forces the linear large-deviation regime, so Bernstein's inequality should be read as a two-scale estimate rather than a purely Gaussian bound.
[example: Centered Exponential Sums]
Let $Y_1,\dots,Y_n$ be i.i.d. $\operatorname{Exp}(1)$ random variables and set $X_i=Y_i-1$. For each $i$, the centered exponential computation gives
\begin{align*}
\mathbb E[X_i]=0
\end{align*}
and
\begin{align*}
\mathbb E[e^{\lambda X_i}]\le e^{2\lambda^2}
\end{align*}
whenever $|\lambda|\le 1/2$. Since
\begin{align*}
e^{2\lambda^2}=\exp\left(\frac{2^2\lambda^2}{2}\right),
\end{align*}
each $X_i$ satisfies a centered sub-exponential MGF bound with common parameters $(2,2)$.
Let
\begin{align*}
S_n:=\sum_{i=1}^n X_i=\sum_{i=1}^n(Y_i-1)=\sum_{i=1}^nY_i-n.
\end{align*}
For these parameters, the aggregate variance proxy in *Bernstein Inequality For Sub-Exponential Sums* is
\begin{align*}
\nu_{\mathrm{sum}}^2=\sum_{i=1}^n 2^2=4n,
\end{align*}
and the common exponential scale is
\begin{align*}
b_{\mathrm{sum}}=\max_{1\le i\le n}2=2.
\end{align*}
Applying Bernstein's inequality to $S_n$ gives, for every $t\ge0$,
\begin{align*}
\mathbb P(|S_n|\ge t)\le 2\exp\left(-\frac14\min\left\{\frac{t^2}{4n},\frac{t}{2}\right\}\right).
\end{align*}
Also,
\begin{align*}
\frac1n\sum_{i=1}^nY_i-1=\frac1n\left(\sum_{i=1}^nY_i-n\right)=\frac{S_n}{n}.
\end{align*}
Thus the event
\begin{align*}
\left|\frac1n\sum_{i=1}^nY_i-1\right|\ge u
\end{align*}
is the same as the event $|S_n|\ge nu$. Substituting $t=nu$ into the Bernstein bound gives
\begin{align*}
\mathbb P\left(\left|\frac1n\sum_{i=1}^nY_i-1\right|\ge u\right)\le 2\exp\left(-\frac14\min\left\{\frac{n^2u^2}{4n},\frac{nu}{2}\right\}\right).
\end{align*}
Since
\begin{align*}
\frac{n^2u^2}{4n}=\frac{nu^2}{4},
\end{align*}
this becomes
\begin{align*}
\mathbb P\left(\left|\frac1n\sum_{i=1}^nY_i-1\right|\ge u\right)\le 2\exp\left(-\min\left\{\frac{nu^2}{16},\frac{nu}{8}\right\}\right).
\end{align*}
Because
\begin{align*}
\min\left\{\frac{nu^2}{16},\frac{nu}{8}\right\}\ge \frac{n}{16}\min\{u^2,u\},
\end{align*}
we obtain
\begin{align*}
\mathbb P\left(\left|\frac1n\sum_{i=1}^nY_i-1\right|\ge u\right)\le 2\exp\left(-\frac{n}{16}\min\{u^2,u\}\right)
\end{align*}
for all $u\ge0$. Thus the empirical mean has Gaussian-type decay $e^{-cnu^2}$ for $0\le u\le1$ and exponential-type decay $e^{-cnu}$ for $u\ge1$, with $c=1/16$ in this calculation.
[/example]
The same calculation underlies concentration of quadratic statistics. The only additional input is the square lemma, which converts Gaussian observations into centered sub-exponential summands.
[example: Chi-Square Concentration]
Let $Z_1,\dots,Z_n$ be i.i.d. $\mathcal N(0,1)$ random variables and set
\begin{align*}
X_i:=Z_i^2-1.
\end{align*}
From the Gaussian square computation, $\mathbb E[Z_i^2]=1$, and therefore
\begin{align*}
\mathbb E[X_i]=\mathbb E[Z_i^2-1]=\mathbb E[Z_i^2]-1=1-1=0.
\end{align*}
By *Sub-Gaussian Square Is Sub-Exponential*, each $X_i$ is centered sub-exponential with absolute parameters: there are universal constants $\nu_0>0$ and $b_0>0$ such that
\begin{align*}
\mathbb E[e^{\lambda X_i}]\le \exp\left(\frac{\nu_0^2\lambda^2}{2}\right)
\end{align*}
for all $|\lambda|<1/b_0$.
Applying *Bernstein Inequality For Sub-Exponential Sums* to $X_1,\dots,X_n$, the aggregate variance proxy is
\begin{align*}
\nu_{\mathrm{sum}}^2=\sum_{i=1}^n \nu_0^2=n\nu_0^2,
\end{align*}
and the aggregate exponential scale is
\begin{align*}
b_{\mathrm{sum}}=\max_{1\le i\le n} b_0=b_0.
\end{align*}
Hence, for every $t\ge0$,
\begin{align*}
\mathbb P\left(\left|\sum_{i=1}^n (Z_i^2-1)\right|\ge t\right)\le 2\exp\left(-\frac14\min\left\{\frac{t^2}{n\nu_0^2},\frac{t}{b_0}\right\}\right).
\end{align*}
Define
\begin{align*}
c:=\frac14\min\left\{\frac{1}{\nu_0^2},\frac{1}{b_0}\right\}.
\end{align*}
Since
\begin{align*}
\frac{t^2}{n\nu_0^2}\ge \min\left\{\frac{1}{\nu_0^2},\frac{1}{b_0}\right\}\frac{t^2}{n}
\end{align*}
and
\begin{align*}
\frac{t}{b_0}\ge \min\left\{\frac{1}{\nu_0^2},\frac{1}{b_0}\right\}t,
\end{align*}
we get
\begin{align*}
\frac14\min\left\{\frac{t^2}{n\nu_0^2},\frac{t}{b_0}\right\}\ge c\min\left\{\frac{t^2}{n},t\right\}.
\end{align*}
Therefore
\begin{align*}
\mathbb P\left(\left|\sum_{i=1}^n (Z_i^2-1)\right|\ge t\right)\le 2\exp\left(-c\min\left\{\frac{t^2}{n},t\right\}\right).
\end{align*}
For the empirical second moment,
\begin{align*}
\frac1n\sum_{i=1}^n Z_i^2-1=\frac1n\sum_{i=1}^n (Z_i^2-1).
\end{align*}
Thus the event
\begin{align*}
\left|\frac1n\sum_{i=1}^n Z_i^2-1\right|\ge u
\end{align*}
is exactly the event
\begin{align*}
\left|\sum_{i=1}^n (Z_i^2-1)\right|\ge nu.
\end{align*}
Substituting $t=nu$ into the previous bound gives
\begin{align*}
\mathbb P\left(\left|\frac1n\sum_{i=1}^n Z_i^2-1\right|\ge u\right)\le 2\exp\left(-c\min\left\{\frac{n^2u^2}{n},nu\right\}\right).
\end{align*}
Since
\begin{align*}
\frac{n^2u^2}{n}=nu^2
\end{align*}
and
\begin{align*}
\min\{nu^2,nu\}=n\min\{u^2,u\},
\end{align*}
we conclude that, for every $u\ge0$,
\begin{align*}
\mathbb P\left(\left|\frac1n\sum_{i=1}^n Z_i^2-1\right|\ge u\right)\le 2\exp\left(-c n\min\{u^2,u\}\right).
\end{align*}
This proves the elementary chi-square concentration estimate using only the sub-exponential square principle and Bernstein's inequality, without using the exact chi-square density.
[/example]
This example shows why the sub-exponential framework is useful even in classical Gaussian calculations. It gives the correct two-regime deviation shape while requiring only independence and the square-of-sub-Gaussian principle.
[example: Sample Second Moment Of Gaussian Observations]
Let $W_1,\dots,W_n$ be i.i.d. $\mathcal N(0,\sigma^2)$ random variables with $\sigma>0$, and define
\begin{align*}
Z_i:=\frac{W_i}{\sigma}.
\end{align*}
If $W_i$ has density
\begin{align*}
f_W(w)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{w^2}{2\sigma^2}\right),
\end{align*}
then the change of variables $w=\sigma z$ gives
\begin{align*}
f_Z(z)=\sigma f_W(\sigma z).
\end{align*}
Substituting the formula for $f_W$,
\begin{align*}
f_Z(z)=\sigma\cdot \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{\sigma^2z^2}{2\sigma^2}\right).
\end{align*}
Canceling the factor $\sigma$ and simplifying the exponent gives
\begin{align*}
f_Z(z)=\frac{1}{\sqrt{2\pi}}e^{-z^2/2}.
\end{align*}
Thus $Z_1,\dots,Z_n$ are i.i.d. $\mathcal N(0,1)$.
By the chi-square concentration estimate from the previous example, there is a universal constant $c>0$ such that, for every $u\ge0$,
\begin{align*}
\mathbb P\left(\left|\frac1n\sum_{i=1}^n Z_i^2-1\right|\ge u\right)
\le 2\exp\left(-c n\min\{u^2,u\}\right).
\end{align*}
Since $Z_i^2=W_i^2/\sigma^2$,
\begin{align*}
\frac1n\sum_{i=1}^n Z_i^2-1
=\frac1n\sum_{i=1}^n \frac{W_i^2}{\sigma^2}-1.
\end{align*}
Pulling out the constant $1/\sigma^2$,
\begin{align*}
\frac1n\sum_{i=1}^n \frac{W_i^2}{\sigma^2}-1
=\frac{1}{\sigma^2}\left(\frac1n\sum_{i=1}^n W_i^2\right)-1.
\end{align*}
Writing $1=\sigma^2/\sigma^2$ gives
\begin{align*}
\frac{1}{\sigma^2}\left(\frac1n\sum_{i=1}^n W_i^2\right)-1
=\frac{1}{\sigma^2}\left(\frac1n\sum_{i=1}^n W_i^2-\sigma^2\right).
\end{align*}
Because $\sigma^2>0$, the event
\begin{align*}
\left|\frac1n\sum_{i=1}^n Z_i^2-1\right|\ge u
\end{align*}
is the same as
\begin{align*}
\frac{1}{\sigma^2}\left|\frac1n\sum_{i=1}^n W_i^2-\sigma^2\right|\ge u.
\end{align*}
Multiplying both sides by the positive number $\sigma^2$ gives the equivalent event
\begin{align*}
\left|\frac1n\sum_{i=1}^n W_i^2-\sigma^2\right|\ge \sigma^2 u.
\end{align*}
Substituting this equivalent event into the standard-normal concentration bound gives
\begin{align*}
\mathbb P\left(\left|\frac1n\sum_{i=1}^n W_i^2-\sigma^2\right|\ge \sigma^2 u\right)
\le 2\exp\left(-c n\min\{u^2,u\}\right).
\end{align*}
Thus the empirical second moment estimates $\sigma^2$ with relative error $u$ and failure probability at most $2\exp(-c n\min\{u^2,u\})$; in particular, deviations of order $\sigma^2/\sqrt n$ are the natural small-error scale.
[/example]
Bernstein's inequality is the workhorse estimate that closes the chapter. It extends the Chernoff method from globally sub-Gaussian MGFs to locally quadratic MGFs, and it explains why sums of squared or multiplied sub-Gaussian variables concentrate with a Gaussian core and exponential tails. Chapter 7 turns these tail estimates into confidence radii for empirical means.
Chapter 6 extends the classical methods to a broader class of variables whose tails are exponential rather than Gaussian. Chapter 7 turns these tail bounds into statistical tools, using them to build confidence radii for empirical means and other basic estimators.
# 7. Concentration of Empirical Means and Basic Statistical Applications
This chapter turns the concentration estimates developed earlier into statistical guarantees for empirical averages. Chapters 3, 5, and 6 gave tail bounds for sums under boundedness, sub-Gaussian, and sub-exponential hypotheses; here we rewrite those inequalities as confidence radii and use them to compare many estimates at once. The main questions are practical: how many samples are needed for a desired accuracy, how should the estimator change under heavy tails, and how does the error grow when many means or risks are estimated simultaneously?
## Confidence Intervals from Tail Bounds
The basic statistical problem is to estimate an unknown mean from independent observations. The concentration point of view asks for a radius $r_n(\beta)$ such that the empirical mean differs from the true mean by at most $r_n(\beta)$ with probability at least $1-\beta$. Different assumptions on the observations give different radii, and the course's previous inequalities now become recipes for confidence intervals.
A useful warning is that a confidence statement is only as strong as the event it controls. If $M$ separate means are each estimated with failure probability $\beta$, then the chance that at least one interval fails can be much larger than $\beta$ after the estimates are inspected together. This is why the chapter first develops pointwise intervals and then returns to simultaneous guarantees for finite classes.
[definition: Empirical Mean]
For $n\in\mathbb N$, the empirical mean map is the function
\begin{align*}
A_n: \mathbb R^n \to \mathbb R, \qquad A_n(x_1,\dots,x_n)=\frac{1}{n}\sum_{i=1}^n x_i.
\end{align*}
If $X_1,\dots,X_n$ are real-valued random variables with common expectation $\mu=\mathbb E[X_1]$, their empirical mean is the random variable
\begin{align*}
\bar X_n = A_n(X_1,\dots,X_n)=\frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
[/definition]
The empirical mean is useful because its expectation equals the target mean. To turn this estimator into an interval statement, we need a named quantity that records how far from $\mu$ the estimator may be at a prescribed failure probability.
[definition: Confidence Radius]
Let $\beta \in (0,1)$. A number $r_n(\beta) \ge 0$ is a confidence radius at level $1-\beta$ for estimating $\mu$ by $\bar X_n$ if
\begin{align*}
\mathbb P\left(|\bar X_n - \mu| \le r_n(\beta)\right) \ge 1-\beta.
\end{align*}
[/definition]
A confidence radius is not unique; any larger number also works. We now need an explicit radius under a concrete sampling assumption, and bounded observations are the natural starting point because Hoeffding's inequality converts ranges directly into two-sided estimation error. This motivates the following theorem.
[quotetheorem:6064]
[citeproof:6064]
This theorem explains why bounded observations lead to the familiar $n^{-1/2}$ rate with only logarithmic dependence on the failure probability. The independence hypothesis is doing real work: if all observations were identical copies of a single bounded random variable, averaging would not reduce the error at all. The bounded-range hypothesis fixes the scale without asking for a variance estimate, but it also makes the bound insensitive to favourable cases such as rare Bernoulli events with very small variance. It is often the first nonasymptotic confidence interval used for averages of losses, rewards, or proportions, and it sets the template for replacing the range parameter by sharper tail parameters later.
[example: Estimating A Bernoulli Mean]
Let $X_1,\dots,X_n \overset{\text{i.i.d.}}{\sim} \operatorname{Ber}(p)$. A Bernoulli random variable takes only the values $0$ and $1$, so $0\le X_i\le 1$ a.s. for every $i$. Its expectation is
\begin{align*}
\mathbb E[X_i]=0\cdot \mathbb P(X_i=0)+1\cdot \mathbb P(X_i=1)=0\cdot(1-p)+1\cdot p=p.
\end{align*}
Thus the common mean is $\mu=p$, and the empirical mean is
\begin{align*}
\bar X_n=\frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
Applying *Bounded Mean Confidence Radius* with $a=0$ and $b=1$, the radius is
\begin{align*}
r_n(\beta)=(1-0)\sqrt{\frac{\log(2/\beta)}{2n}}=\sqrt{\frac{1}{2n}\log\frac{2}{\beta}}.
\end{align*}
Substituting $\mu=p$ and this value of $r_n(\beta)$ gives
\begin{align*}
\mathbb P\left(|\bar X_n-p| \le \sqrt{\frac{1}{2n}\log\frac{2}{\beta}}\right) \ge 1-\beta.
\end{align*}
Equivalently, the random interval centered at $\bar X_n$ with radius $\sqrt{\log(2/\beta)/(2n)}$ contains the unknown success probability $p$ with probability at least $1-\beta$. Its width is $2\sqrt{\log(2/\beta)/(2n)}$, which depends on $n$ and $\beta$ but not on $p$; the bound is therefore distribution-free over all Bernoulli laws.
[/example]
The Bernoulli example is bounded, so the range determines the scale. For Gaussian observations the range assumption fails completely, since every interval $[a,b]$ misses some possible observations. Nevertheless Gaussian tails are still quadratic in the deviation, so the boundedness assumption is too strong for the phenomenon we want. The next radius uses the sub-Gaussian moment generating function condition to isolate quadratic tails without requiring bounded support.
[quotetheorem:6065]
[citeproof:6065]
Sub-Gaussian intervals have the same $n^{-1/2}$ scaling as bounded intervals, but the range length is replaced by the sub-Gaussian scale. This is the correct language for Gaussian observations and for bounded observations after centering. The identical variance-proxy assumption keeps the statement readable; with non-identical proxies the same proof gives the average proxy $n^{-2}\sum_i \sigma_i^2$ for the mean. The moment generating function hypothesis cannot be replaced by finite variance, since a finite-variance heavy-tailed law may still have polynomial rather than exponential tail probabilities for the empirical mean. This distinction is what forces the median-of-means construction later in this chapter when exponential confidence is desired under only second-moment assumptions.
[example: Gaussian Mean with Known Variance Proxy]
Let $X_1,\dots,X_n \overset{\text{i.i.d.}}{\sim} \mathcal N(\mu,\sigma^2)$, and write $Z_i=X_i-\mu$. Then $Z_i\sim \mathcal N(0,\sigma^2)$, so its moment generating function is
\begin{align*}
\mathbb E[e^{\lambda Z_i}]
&=\exp\left(\frac{\sigma^2\lambda^2}{2}\right)
\end{align*}
for every $\lambda\in\mathbb R$. Thus each $X_i-\mu$ is sub-Gaussian with variance proxy $\sigma^2$.
By *Sub-Gaussian Confidence Radius*, for every $\beta\in(0,1)$,
\begin{align*}
\mathbb P\left(|\bar X_n-\mu|\le \sigma\sqrt{\frac{2\log(2/\beta)}{n}}\right)\ge 1-\beta.
\end{align*}
Let
\begin{align*}
r=\sigma\sqrt{\frac{2\log(2/\beta)}{n}}.
\end{align*}
The event $|\bar X_n-\mu|\le r$ is equivalent to
\begin{align*}
-r\le \bar X_n-\mu\le r.
\end{align*}
Subtracting $\bar X_n$ throughout gives
\begin{align*}
-r-\bar X_n\le -\mu\le r-\bar X_n,
\end{align*}
and multiplying by $-1$ reverses both inequalities:
\begin{align*}
\bar X_n-r\le \mu\le \bar X_n+r.
\end{align*}
Therefore
\begin{align*}
\mathbb P\left(\mu \in \left[\bar X_n-\sigma\sqrt{\frac{2\log(2/\beta)}{n}},\,\bar X_n+\sigma\sqrt{\frac{2\log(2/\beta)}{n}}\right]\right)\ge 1-\beta.
\end{align*}
This interval is usable because its centre $\bar X_n$ is computed from the data and its radius depends only on the known scale $\sigma$, the sample size $n$, and the chosen failure probability $\beta$.
[/example]
The Gaussian example still has purely quadratic tails. A centred exponential variable gives the next boundary case: its moment generating function exists only on a bounded interval of values of $\lambda$, so Chernoff's method cannot optimize over all $\lambda\in\mathbb R$. Small deviations still behave in a variance-like way, but large deviations are governed by the finite endpoint of the moment generating function. To handle variables with exponential but non-Gaussian tails, we need a confidence radius that records both the small-deviation quadratic regime and the large-deviation linear regime.
[quotetheorem:6066]
[citeproof:6066]
This interval records the operational meaning of the sub-exponential condition. The parameter $\nu$ controls the local, variance-like regime, while $b$ controls how far the moment generating function extends and therefore how expensive very high confidence becomes. The theorem does not claim the empirical mean is always sub-Gaussian; the extra linear term is precisely the price paid for one-sided exponential tails such as those of exponential or geometric-type variables. At ordinary confidence levels the first term may dominate, while extremely small failure probabilities can force the second term to control the radius.
[example: Averaging Exponential Observations]
Let $X_1,\dots,X_n \overset{\text{i.i.d.}}{\sim} \operatorname{Exp}(\lambda)$, with density $\lambda e^{-\lambda x}$ on $x\ge 0$, and set $\mu=1/\lambda$. For $t<\lambda$, the moment generating function is
\begin{align*}
\mathbb E[e^{tX_i}]=\int_0^\infty e^{tx}\lambda e^{-\lambda x}\,dx=\lambda\int_0^\infty e^{-(\lambda-t)x}\,dx=\lambda\cdot\frac{1}{\lambda-t}=\frac{\lambda}{\lambda-t}.
\end{align*}
Therefore
\begin{align*}
\mathbb E[e^{t(X_i-\mu)}]=e^{-t/\lambda}\mathbb E[e^{tX_i}]=e^{-t/\lambda}\frac{\lambda}{\lambda-t}.
\end{align*}
Writing $u=t/\lambda$, this becomes
\begin{align*}
\mathbb E[e^{t(X_i-\mu)}]=\frac{e^{-u}}{1-u}.
\end{align*}
For $|u|<1$, the power series for $-\log(1-u)$ gives
\begin{align*}
\log \mathbb E[e^{t(X_i-\mu)}]=-u-\log(1-u)=-u+\sum_{k=1}^{\infty}\frac{u^k}{k}=\sum_{k=2}^{\infty}\frac{u^k}{k}.
\end{align*}
If $|u|\le 1/2$, then
\begin{align*}
\sum_{k=2}^{\infty}\frac{u^k}{k}\le \sum_{k=2}^{\infty}\frac{|u|^k}{k}\le \frac{1}{2}\sum_{k=2}^{\infty}|u|^k=\frac{|u|^2}{2(1-|u|)}\le |u|^2.
\end{align*}
Thus, whenever $|t|\le \lambda/2$,
\begin{align*}
\mathbb E[e^{t(X_i-\mu)}]\le \exp\left(\frac{t^2}{\lambda^2}\right)=\exp\left(\frac{(2/\lambda^2)t^2}{2}\right).
\end{align*}
So $X_i-\mu$ is sub-exponential with parameters $\nu^2=2/\lambda^2$ and $b=2/\lambda$ under the convention used in the preceding theorem.
Applying *Sub-Exponential Mean Confidence Radius* with these parameters gives, for every $\beta\in(0,1)$,
\begin{align*}
\mathbb P\left(|\bar X_n-\mu|\le \max\left\{\frac{\sqrt{2}}{\lambda}\sqrt{\frac{2\log(2/\beta)}{n}},\frac{4\log(2/\beta)}{\lambda n}\right\}\right)\ge 1-\beta.
\end{align*}
The first term satisfies
\begin{align*}
\frac{\sqrt{2}}{\lambda}\sqrt{\frac{2\log(2/\beta)}{n}}=\frac{2}{\lambda}\sqrt{\frac{\log(2/\beta)}{n}}.
\end{align*}
Hence a valid radius is
\begin{align*}
\max\left\{\frac{2}{\lambda}\sqrt{\frac{\log(2/\beta)}{n}},\frac{4}{\lambda}\frac{\log(2/\beta)}{n}\right\}.
\end{align*}
Since $\max\{a,b\}\le a+b$ for $a,b\ge 0$, this is at most
\begin{align*}
\frac{1}{\lambda}\left(2\sqrt{\frac{\log(2/\beta)}{n}}+4\frac{\log(2/\beta)}{n}\right).
\end{align*}
Thus exponential observations are asymmetric and unbounded, but their empirical mean still has an exponential-probability confidence radius with a variance-like term of order $\lambda^{-1}\sqrt{\log(2/\beta)/n}$ and a large-deviation term of order $\lambda^{-1}\log(2/\beta)/n$.
[/example]
## Median-of-Means Under Finite Variance
What should be done when the moment generating function is unavailable? Chebyshev's inequality alone gives only polynomial tails for the empirical mean, but a blocking construction converts a finite-variance bound into an exponential failure probability. The median-of-means estimator is the basic robust amplifier for this purpose.
[definition: Median-of-Means Estimator]
Let $k,m\in\mathbb N$ and $n=km$. Fix a partition of $\{1,\dots,n\}$ into $k$ disjoint blocks $B_1,\dots,B_k$ of size $m$. The median-of-means map associated with this partition is any function
\begin{align*}
M_{k,m}:\mathbb R^n\to\mathbb R
\end{align*}
that sends $x=(x_1,\dots,x_n)$ to a median of the block averages
\begin{align*}
y_j(x)=\frac{1}{m}\sum_{i\in B_j}x_i, \qquad 1\le j\le k.
\end{align*}
For real-valued observations $X_1,\dots,X_n$, the median-of-means estimator is $\widehat\mu_{\mathrm{MOM}}=M_{k,m}(X_1,\dots,X_n)$.
[/definition]
The construction separates accuracy from confidence: each block mean only needs a constant success probability, and the median succeeds if most blocks are accurate. The next theorem quantifies this amplification using Chebyshev for each block and a Chernoff bound for the number of bad blocks.
[quotetheorem:6067]
[citeproof:6067]
The theorem trades a small constant factor in sample size for robustness to heavy tails. The i.i.d. assumption makes the block means independent and gives them a common variance bound; without independent blocks, the Chernoff amplification step would not apply. The finite-variance hypothesis is also essential for this particular proof, since Chebyshev's inequality is the mechanism that makes each block reliable with constant probability. Unlike the empirical mean under finite variance, the estimator achieves exponential dependence on the confidence parameter without requiring a finite moment generating function, which is why it is a basic tool in robust statistics.
[example: Heavy-Tailed Mean Estimation with Finite Variance]
Suppose $X_1,\dots,X_n$ are i.i.d. with mean $\mu$ and variance at most $\sigma^2$, but the distribution has no finite moment generating function near $0$. Fix $\beta\in(0,1)$, set
\begin{align*}
L=\log\frac{2}{\beta}, \qquad k=\left\lceil 8L\right\rceil, \qquad m=\left\lfloor \frac{n}{k}\right\rfloor,
\end{align*}
and form the median-of-means estimator from $k$ blocks of size $m$, using the first $km$ observations if $km<n$.
By construction,
\begin{align*}
k=\left\lceil 8L\right\rceil \ge 8L,
\end{align*}
so
\begin{align*}
2e^{-k/8}
\le 2e^{-L}
=2e^{-\log(2/\beta)}
=2\cdot \frac{\beta}{2}
=\beta.
\end{align*}
The *[Median-of-Means Deviation Inequality](/theorems/6067)* says that if $m\ge 8\sigma^2/r^2$, then
\begin{align*}
\mathbb P\left(|\widehat\mu_{\mathrm{MOM}}-\mu|>r\right)\le 2e^{-k/8}.
\end{align*}
Choose
\begin{align*}
r=\sqrt{\frac{8\sigma^2}{m}}.
\end{align*}
Then $m=8\sigma^2/r^2$, so
\begin{align*}
\mathbb P\left(|\widehat\mu_{\mathrm{MOM}}-\mu|>\sqrt{\frac{8\sigma^2}{m}}\right)\le \beta.
\end{align*}
If $n\ge 2k$, then $n/k\ge 2$, and for any $x\ge 2$ one has $\lfloor x\rfloor\ge x/2$. Hence
\begin{align*}
m=\left\lfloor \frac{n}{k}\right\rfloor \ge \frac{n}{2k}.
\end{align*}
Since $L\ge \log 2$ and $k=\lceil 8L\rceil\le 8L+1\le 10L$, we get
\begin{align*}
\sqrt{\frac{8\sigma^2}{m}}
\le \sqrt{\frac{8\sigma^2}{n/(2k)}}
= \sqrt{\frac{16k\sigma^2}{n}}
\le \sqrt{\frac{160\sigma^2 L}{n}}
=4\sqrt{10}\,\sigma\sqrt{\frac{\log(2/\beta)}{n}}.
\end{align*}
Therefore, whenever $n\ge 2\lceil 8\log(2/\beta)\rceil$,
\begin{align*}
\mathbb P\left(|\widehat\mu_{\mathrm{MOM}}-\mu|\le 4\sqrt{10}\,\sigma\sqrt{\frac{\log(2/\beta)}{n}}\right)\ge 1-\beta.
\end{align*}
The conclusion has the same $\sigma\sqrt{\log(1/\beta)/n}$ confidence scale as sub-Gaussian mean estimation, but it uses only independence and the finite variance bound.
[/example]
Median-of-means is not designed to improve constants for Gaussian data. Its role is to preserve the concentration rate in settings where a few extreme observations may make the ordinary empirical mean unstable at high confidence.
[remark: Block Number and Confidence]
The number of blocks $k$ controls the failure probability, while the block size $m$ controls the accuracy of each block mean. Taking too many blocks makes each block too noisy; taking too few blocks leaves too little amplification. The usual choice is $k$ proportional to $\log(1/\beta)$, so the construction requires $n$ to be at least a constant multiple of $\log(1/\beta)$.
[/remark]
## Uniform Control Over Finite Classes
Many statistical tasks require estimating more than one mean. If a finite family of empirical quantities is inspected after seeing the data, then a pointwise confidence interval is not enough; the probability of at least one failure can accumulate across the class. For example, if $1000$ independent confidence intervals each fail with probability $0.01$, then the probability that at least one fails is close to one rather than $0.01$. This is the same obstruction behind multiple testing and behind overfitting by selecting the model with the smallest noisy empirical risk. The union bound gives the first uniform concentration principle.
[quotetheorem:1108]
[citeproof:1108/concentration-inequalities-i-classical-methods]
This result is simple but central: the failure budget is divided across the candidates. No independence assumption appears, which is important because empirical risks for different models are usually computed from the same data and are therefore strongly dependent. The price is that the theorem only uses the number of candidates, not their similarity; two nearly identical functions cost the same as two unrelated functions. In concentration language, it replaces $\log(1/\beta)$ by $\log(|\mathcal F|/\beta)$ in many confidence radii, and later refinements replace cardinality by metric entropy or other complexity measures.
[quotetheorem:6068]
[citeproof:6068]
The logarithmic dependence on $|\mathcal F|$ is the key lesson. The theorem requires only pointwise sub-Gaussian concentration and a finite index set; it does not require the random variables for different $f$ to be independent. Its limitation is also visible from the proof: it treats the class as an unstructured list, so it cannot exploit smoothness, low dimension, or correlations between nearby functions. A finite search over many candidates is statistically feasible when the class size is large but not exponentially larger than the sample size at the target accuracy, which is the basic finite-class version of empirical process control in learning theory.
[example: Estimating Many Bounded Means Simultaneously]
For $j=1,\dots,M$, let $X_{1j},\dots,X_{nj}$ be observations in $[0,1]$, independent across $i$ for each fixed $j$, with common mean $\mu_j$. Write
\begin{align*}
\bar X_{nj}=\frac{1}{n}\sum_{i=1}^n X_{ij}.
\end{align*}
For a fixed $j$, the variables satisfy $0\le X_{ij}\le 1$ a.s., so *Bounded Mean Confidence Radius* applied with $a=0$, $b=1$, and failure probability $\beta/M$ gives
\begin{align*}
\mathbb P\left(|\bar X_{nj}-\mu_j| \le (1-0)\sqrt{\frac{\log(2/(\beta/M))}{2n}}\right) \ge 1-\frac{\beta}{M}.
\end{align*}
Since
\begin{align*}
\log\frac{2}{\beta/M}
=\log\frac{2M}{\beta},
\end{align*}
this becomes
\begin{align*}
\mathbb P\left(|\bar X_{nj}-\mu_j| \le \sqrt{\frac{1}{2n}\log\frac{2M}{\beta}}\right) \ge 1-\frac{\beta}{M}.
\end{align*}
Define the good event
\begin{align*}
A_j=\left\{|\bar X_{nj}-\mu_j| \le \sqrt{\frac{1}{2n}\log\frac{2M}{\beta}}\right\}.
\end{align*}
The preceding bound says $\mathbb P(A_j^c)\le \beta/M$ for every $j$. Therefore, by *Finite-Class Union Bound* applied to the finite set $\{1,\dots,M\}$,
\begin{align*}
\mathbb P\left(\bigcap_{j=1}^M A_j\right)\ge 1-\beta.
\end{align*}
On the event $\bigcap_{j=1}^M A_j$, every one of the $M$ inequalities holds, which is exactly
\begin{align*}
\max_{1\le j\le M}|\bar X_{nj}-\mu_j|
\le \sqrt{\frac{1}{2n}\log\frac{2M}{\beta}}.
\end{align*}
Hence
\begin{align*}
\mathbb P\left(\max_{1\le j\le M}|\bar X_{nj}-\mu_j| \le \sqrt{\frac{1}{2n}\log\frac{2M}{\beta}}\right)\ge 1-\beta.
\end{align*}
Thus the individual failure level is divided by $M$, and the simultaneous confidence radius pays the logarithmic price $\log(2M/\beta)$ for controlling all $M$ bounded means at once.
[/example]
A related way to express finite-class control is to bound the expectation of the maximum deviation. This turns tail bounds into maximal inequalities and often feeds into model-selection arguments.
[quotetheorem:1963]
[citeproof:1963]
Applying the same quoted theorem to the $2M$ variables $Y_1,\dots,Y_M,-Y_1,\dots,-Y_M$ also gives $\mathbb E[\max_j |Y_j|]\le \sigma\sqrt{2\log(2M)}$. Thus the finite maximum bound controls the expected amount by which the most favourable member of a finite class can fluctuate upward. The mean-zero assumption centres the comparison around fluctuation rather than bias, and the common sub-Gaussian proxy supplies a uniform scale for all candidates. The theorem does not give a high-probability event by itself; it controls expectation, which is often paired with a separate concentration step when probability bounds are needed. In statistics, this is the concentration cost of selecting the best-looking candidate from finitely many noisy estimates.
[example: Selecting Among Finitely Many Models by Empirical Risk]
Let $\mathcal F$ be a nonempty finite set of prediction rules, and suppose $Z_1,\dots,Z_n$ are independent data points. For each $f\in\mathcal F$, let the loss satisfy $0\le \ell_f(Z_i)\le 1$ a.s., and define
\begin{align*}
R(f)=\mathbb E[\ell_f(Z)].
\end{align*}
\begin{align*}
\widehat R_n(f)=\frac{1}{n}\sum_{i=1}^n \ell_f(Z_i).
\end{align*}
For fixed $f$, the variables $\ell_f(Z_i)$ are bounded in $[0,1]$ and have common expectation $R(f)$, so *Bounded Mean Confidence Radius* with failure probability $\beta/|\mathcal F|$ gives
\begin{align*}
\mathbb P\left(|\widehat R_n(f)-R(f)|\le \sqrt{\frac{1}{2n}\log\frac{2}{\beta/|\mathcal F|}}\right)\ge 1-\frac{\beta}{|\mathcal F|}.
\end{align*}
Since
\begin{align*}
\log\frac{2}{\beta/|\mathcal F|}=\log\frac{2|\mathcal F|}{\beta},
\end{align*}
we have
\begin{align*}
\mathbb P\left(|\widehat R_n(f)-R(f)|\le \sqrt{\frac{1}{2n}\log\frac{2|\mathcal F|}{\beta}}\right)\ge 1-\frac{\beta}{|\mathcal F|}.
\end{align*}
Set
\begin{align*}
r=\sqrt{\frac{1}{2n}\log\frac{2|\mathcal F|}{\beta}}.
\end{align*}
For each $f\in\mathcal F$, let
\begin{align*}
A_f=\{|\widehat R_n(f)-R(f)|\le r\}.
\end{align*}
The preceding bound says $\mathbb P(A_f^c)\le \beta/|\mathcal F|$ for every $f$. By *Finite-Class Union Bound*,
\begin{align*}
\mathbb P\left(\bigcap_{f\in\mathcal F}A_f\right)\ge 1-\beta.
\end{align*}
On this event,
\begin{align*}
\sup_{f\in\mathcal F}|\widehat R_n(f)-R(f)|\le r.
\end{align*}
Now choose an empirical risk minimizer $\widehat f\in\mathcal F$ and a true risk minimizer $f^*\in\mathcal F$, so that
\begin{align*}
\widehat R_n(\widehat f)\le \widehat R_n(f^*).
\end{align*}
\begin{align*}
R(f^*)=\inf_{f\in\mathcal F}R(f).
\end{align*}
On the same uniform event, applying $|\widehat R_n(f)-R(f)|\le r$ first with $f=\widehat f$ gives
\begin{align*}
R(\widehat f)\le \widehat R_n(\widehat f)+r.
\end{align*}
The defining property of $\widehat f$ gives
\begin{align*}
\widehat R_n(\widehat f)+r\le \widehat R_n(f^*)+r.
\end{align*}
Applying the uniform bound again with $f=f^*$ gives
\begin{align*}
\widehat R_n(f^*)+r\le R(f^*)+2r.
\end{align*}
Combining the three displayed inequalities,
\begin{align*}
R(\widehat f)\le R(f^*)+2r.
\end{align*}
Therefore, with probability at least $1-\beta$,
\begin{align*}
R(\widehat f)-\inf_{f\in\mathcal F}R(f)\le 2\sqrt{\frac{1}{2n}\log\frac{2|\mathcal F|}{\beta}}.
\end{align*}
The factor $2$ appears because the comparison passes through two estimation errors: one at the selected model $\widehat f$ and one at the best true model $f^*$.
[/example]
## Statistical Interpretation and Sample Complexity
The same inequalities can be read backwards as sample complexity requirements. Instead of fixing $n$ and asking for a radius, fix an accuracy $\varepsilon>0$ and confidence level $1-\beta$, then solve for the number of samples needed.
[quotetheorem:6069]
[citeproof:6069]
This form makes the statistical message visible. Accuracy costs $\varepsilon^{-2}$ samples, confidence costs $\log(1/\beta)$ samples, and a finite search space costs $\log|\mathcal F|$ samples. The statement is sufficient rather than necessary, since it inherits the constants and possible looseness of the union bound. It also assumes a fixed finite class chosen before observing the data; if the class is adaptively generated from the same sample, the event being union-bounded has changed and additional validation or complexity control is needed.
[remark: What the Chapter Adds]
Chapters 1 through 6 developed concentration inequalities as probability estimates for sums. This chapter converts them into statistical tools: confidence intervals, robust mean estimators, simultaneous estimation guarantees, and finite-model selection bounds. Chapter 9 replaces direct summation by bounded-differences arguments for functions of independent inputs; more advanced material beyond this course replaces the finite union bound by covering or symmetrisation when the class is too large to enumerate.
[/remark]
After developing concentration for sums and averages, the course now asks what survives when randomness arrives sequentially rather than all at once. Chapter 8 answers this by replacing independent summands with martingale differences, showing how [conditional expectation](/page/Conditional%20Expectation) and bounded increments lead to Azuma-Hoeffding concentration.
# 8. Martingales and Azuma-Hoeffding as a Classical Extension
This chapter extends concentration beyond independent sums by letting information arrive over time. It uses conditional expectation, sigma-algebras, independence, Chernoff's method from Chapter 2, and the bounded-summand viewpoint from Chapters 3 and 4. The guiding object is a martingale: a process whose current value is the best conditional prediction of its future value. In Chapters 3 and 7, independent coordinates were controlled directly through sums and finite-class bounds; here the same reveal-one-coordinate philosophy is recast through filtrations and conditional moment-generating functions.
## Filtrations and Martingale Differences
When a random experiment is observed sequentially, the first question is how to record exactly what is known at each time. Without such a record, a quantity may accidentally use future information: after tossing coins $Y_1,\dots,Y_n$, the process $N_k=\mathbb E[Y_n\mid Y_1,\dots,Y_n]$ is already looking at the final coin even when $k<n$. Filtrations rule out this kind of anticipatory definition by specifying the information available at each stage, and martingales describe quantities whose conditional drift vanishes as that information expands.
[definition: Filtration]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. A filtration is a sequence $(\mathcal F_k)_{k=0}^{n}$ of sub-$\sigma$-algebras of $\mathcal F$ such that
\begin{align*}
\mathcal F_0 \subset \mathcal F_1 \subset \cdots \subset \mathcal F_n \subset \mathcal F.
\end{align*}
[/definition]
The interpretation is that $\mathcal F_k$ contains the events whose truth can be decided after observing the first $k$ stages. To speak about a stochastic process that is available as the experiment unfolds, the next requirement is measurability with respect to the current information rather than future information.
[definition: Adapted Process]
Let $(\mathcal F_k)_{k=0}^{n}$ be a filtration. A sequence $(M_k)_{k=0}^{n}$ of real-valued random variables is adapted to $(\mathcal F_k)$ if $M_k$ is $\mathcal F_k$-measurable for every $0 \le k \le n$.
[/definition]
Adaptedness alone only says that the process can be observed in real time. To obtain concentration, we also need a no-drift condition saying that the next increment has conditional mean zero.
[definition: Martingale]
Let $(\mathcal F_k)_{k=0}^{n}$ be a filtration. An adapted sequence $(M_k)_{k=0}^{n}$ of integrable real-valued random variables is a martingale if
\begin{align*}
\mathbb E[M_k \mid \mathcal F_{k-1}] = M_{k-1}
\end{align*}
for every $1 \le k \le n$.
[/definition]
The increment form is the one used in Azuma-Hoeffding. It separates the proof into a conditional estimate for one step and an iteration over time.
[definition: Martingale Difference Sequence]
Let $(\mathcal F_k)_{k=0}^{n}$ be a filtration. A sequence $(X_k)_{k=1}^{n}$ of integrable real-valued random variables is a martingale difference sequence if $X_k$ is $\mathcal F_k$-measurable and
\begin{align*}
\mathbb E[X_k \mid \mathcal F_{k-1}] = 0
\end{align*}
for every $1 \le k \le n$.
[/definition]
If $M_k = M_0 + \sum_{i=1}^{k} X_i$, then $(M_k)$ is a martingale precisely when $(X_k)$ is a martingale difference sequence relative to the same filtration. This is the martingale analogue of decomposing a sum into centered summands.
[example: Revealing Independent Variables]
Let $Y_1,\dots,Y_n$ be independent random variables, set $\mathcal F_k=\sigma(Y_1,\dots,Y_k)$, and write $\mu_k=\mathbb E[g_k(Y_k)]$ and $X_k=g_k(Y_k)-\mu_k$. We show that $(X_k)_{k=1}^{n}$ is a martingale difference sequence with respect to this exposure filtration.
Since $g_k(Y_k)$ is $\sigma(Y_k)$-measurable and $\sigma(Y_k)\subseteq \mathcal F_k$, the random variable $X_k$ is $\mathcal F_k$-measurable. It is integrable because
\begin{align*}
\mathbb E[|X_k|]\le \mathbb E[|g_k(Y_k)|]+|\mu_k|\le 2\mathbb E[|g_k(Y_k)|]<\infty .
\end{align*}
Independence of $Y_1,\dots,Y_n$ implies that $Y_k$ is independent of $\mathcal F_{k-1}=\sigma(Y_1,\dots,Y_{k-1})$, so the integrable random variable $g_k(Y_k)$ is independent of $\mathcal F_{k-1}$. Hence
\begin{align*}
\mathbb E[g_k(Y_k)\mid \mathcal F_{k-1}]=\mathbb E[g_k(Y_k)]=\mu_k .
\end{align*}
Also, because $\mu_k$ is constant,
\begin{align*}
\mathbb E[\mu_k\mid \mathcal F_{k-1}]=\mu_k .
\end{align*}
Therefore, by linearity of conditional expectation,
\begin{align*}
\mathbb E[X_k\mid \mathcal F_{k-1}]=\mathbb E[g_k(Y_k)-\mu_k\mid \mathcal F_{k-1}]=\mu_k-\mu_k=0 .
\end{align*}
Thus the centered independent variables become martingale differences when the variables are revealed one at a time, so the classical centered independent sum is already a martingale sum.
[/example]
For concentration, the relevant boundedness condition is allowed to depend on the past only through deterministic constants. The word predictable refers to quantities known before the next increment is revealed.
[definition: Predictably Bounded Increments]
Let $(X_k)_{k=1}^{n}$ be adapted to $(\mathcal F_k)_{k=0}^{n}$. The sequence has predictably bounded increments with constants $c_1,\dots,c_n \ge 0$ if
\begin{align*}
|X_k| \le c_k \quad \text{a.s.}
\end{align*}
for every $1 \le k \le n$.
[/definition]
More general versions allow $c_k$ to be $\mathcal F_{k-1}$-measurable, but this first course keeps deterministic bounds so that the final inequality matches Hoeffding's form. The next lemma is the conditional replacement for Hoeffding's lemma.
[quotetheorem:6070]
[citeproof:6070]
This lemma says that bounded martingale differences are conditionally sub-Gaussian with variance proxy $c_k^2$. The martingale-difference hypothesis is essential: if $X_k$ has positive conditional drift, for instance $X_k=c_k$ a.s., then $\mathbb E[X_k\mid \mathcal F_{k-1}]\ne 0$ and the centered sub-Gaussian estimate is not the right object. The boundedness hypothesis is also doing real work, since a mean-zero variable with rare very large values can have a finite first moment but an infinite or very large moment-generating function. The lemma does not assert independence from the past; it asserts that after the past is fixed, the next increment has a Hoeffding-type conditional mgf bound, which is exactly the input needed for the iterative proof of Azuma-Hoeffding.
## Azuma-Hoeffding from Conditional Moment Generating Functions
The main question is whether the Chernoff method survives without independence. For martingales the answer is yes, because conditional mgf bounds can be multiplied by repeatedly conditioning on the last revealed sigma-algebra.
[quotetheorem:6071]
[citeproof:6071]
The result is a direct martingale extension of Hoeffding's inequality. Martingality is essential because without the conditional mean-zero property the process can drift upward deterministically, as in $M_k=M_{k-1}+1$, making a sub-Gaussian upper-tail bound around $M_0$ false. The bounded-increment assumption is also essential: a martingale may have rare large jumps, and those jumps can dominate the upper tail even when the conditional mean remains zero. The deterministic constants $c_k$ keep the final variance proxy non-random; if the bounds depend on the past, the same method needs a more refined statement tracking predictable quadratic variation, leading toward Freedman-type martingale inequalities.
[example: Concentration of a Sequential Randomized Algorithm]
Consider a randomized algorithm whose final score after $n$ random choices is an integrable real-valued random variable $Z$. Let $\mathcal F_k$ be the information generated by the first $k$ choices, and define
\begin{align*}
M_k=\mathbb E[Z\mid \mathcal F_k]
\end{align*}
for $0\le k\le n$. Assume $\mathcal F_0$ is trivial and that $Z$ is determined by all $n$ choices. Then
\begin{align*}
M_0=\mathbb E[Z]
\end{align*}
and
\begin{align*}
M_n=\mathbb E[Z\mid \mathcal F_n]=Z .
\end{align*}
Fix $k$ and fix the first $k-1$ random choices. Suppose that, under this fixed past, the possible values of the conditional expected final score after varying the $k$-th random choice all lie in an interval $[L_k,U_k]$ satisfying
\begin{align*}
U_k-L_k\le c_k .
\end{align*}
After the $k$-th choice is revealed, $M_k$ is one of these conditional expected scores, so
\begin{align*}
L_k\le M_k\le U_k .
\end{align*}
Before the $k$-th choice is revealed, the tower property gives
\begin{align*}
M_{k-1}=\mathbb E[Z\mid \mathcal F_{k-1}]
\end{align*}
and
\begin{align*}
\mathbb E[M_k\mid \mathcal F_{k-1}]=\mathbb E[\mathbb E[Z\mid \mathcal F_k]\mid \mathcal F_{k-1}]=\mathbb E[Z\mid \mathcal F_{k-1}]=M_{k-1}.
\end{align*}
Taking conditional expectations in $L_k\le M_k\le U_k$ therefore gives
\begin{align*}
L_k\le M_{k-1}\le U_k .
\end{align*}
Thus both $M_k$ and $M_{k-1}$ lie in the same interval of length at most $c_k$, and hence
\begin{align*}
|M_k-M_{k-1}|\le U_k-L_k\le c_k .
\end{align*}
Applying *[Azuma-Hoeffding Inequality](/theorems/6071)* to the martingale $(M_k)_{k=0}^{n}$ yields, for every $t\ge 0$,
\begin{align*}
\mathbb P(Z-\mathbb E[Z]\ge t)=\mathbb P(M_n-M_0\ge t)\le \exp\left(-\frac{t^2}{2\sum_{k=1}^{n}c_k^2}\right).
\end{align*}
Thus the final score is concentrated around its mean whenever each newly revealed random choice can change the conditional expected final score by only a small amount.
[/example]
This example shows why martingales are suited to adaptive procedures. The algorithm may choose later actions using earlier random outcomes, but the Doob martingale isolates the net effect of each newly revealed choice.
## Doob Martingales for Functions of Independent Variables
The final question is how to turn a general function of independent inputs into a martingale. The construction is canonical: reveal the inputs one at a time and track the conditional expectation of the final quantity.
[definition: Doob Martingale]
Let $Y_i:(\Omega,\mathcal F)\to(E_i,\mathcal E_i)$ be independent random variables for $1\le i\le n$. Let $f:E_1\times\cdots\times E_n\to\mathbb R$ be a measurable function such that $Z=f(Y_1,\dots,Y_n)$ is integrable, and set $\mathcal F_k=\sigma(Y_1,\dots,Y_k)$. The Doob martingale associated with $Z$ and the exposure filtration is
\begin{align*}
M_k = \mathbb E[Z \mid \mathcal F_k], \qquad 0 \le k \le n.
\end{align*}
[/definition]
Here $M_0=\mathbb E[Z]$ when $\mathcal F_0$ contains only null events and their complements, and $M_n=Z$ since $Z$ is measurable with respect to all inputs. The next result verifies that this construction has the martingale property and gives the telescoping decomposition that Azuma-Hoeffding can act on.
[quotetheorem:3539]
[citeproof:3539]
The decomposition is formal; by itself it gives no concentration. Azuma-Hoeffding enters only after separate estimates prove that the increments $|M_k-M_{k-1}|$ are bounded by deterministic constants. Integrability is necessary because conditional expectations are not defined as finite random variables for arbitrary non-integrable $Z$. For a concrete failure, take $n=1$, let $\mathcal F_0$ contain only null events and their complements, let $Y_1$ have a Cauchy distribution, and set $Z=Y_1$. Then $\mathbb E[Z]$ is not finite, so $M_0=\mathbb E[Z]$ and the displayed decomposition are not meaningful finite identities. A nonnegative example gives the same obstruction: if $\mathbb P(Y_1=m)=C/m^2$ for $m\ge 1$ and $Z=Y_1$, then $\mathbb E[Z]=\infty$, so the terminal random variable cannot generate an integrable Doob martingale. Terminal measurability is also essential: if $Z$ depends on randomness not contained in $\mathcal F_n$, then $M_n=\mathbb E[Z\mid\mathcal F_n]$ need not equal $Z$, and the telescoping sum represents the learned conditional expectation rather than the final quantity itself. The statement $M_0=\mathbb E[Z]$ uses an initial $\sigma$-algebra containing only null events and their complements; with nonconstant initial information, the correct starting point is $M_0=\mathbb E[Z\mid\mathcal F_0]$. The bounded differences condition developed fully in Chapter 9 is a common way to obtain exactly the needed increment bounds, and the next theorem gives the martingale proof of that forthcoming concentration result.
[quotetheorem:6072]
[citeproof:6072]
This proof identifies [McDiarmid's bounded differences inequality](/theorems/6072) as an Azuma-Hoeffding inequality for the exposure martingale. Independence is essential for the simple exposure argument: if the unrevealed coordinates change their conditional distribution after observing $Y_k$, then averaging over the future coordinates need not preserve the same oscillation bound. For example, if all coordinates are forced to equal a single hidden bit, changing the first revealed coordinate can determine the whole vector, so the Doob increment can be much larger than the single-coordinate Lipschitz constant computed under independent resampling. The coordinate bounded-difference hypothesis is also necessary for this conclusion; a function such as $f(y_1,\dots,y_n)=n\mathbb{1}_{\{y_1=1\}}$ has an increment of order $n$ when the first coordinate is revealed. The viewpoint remains valuable because whenever a different filtration gives valid martingale increment bounds, Azuma-Hoeffding applies even if the information is not literally a product coordinate. Chapter 9 specializes this martingale viewpoint to deterministic coordinate replacement bounds.
[example: Exposure Martingale for Random Graphs]
Let $G\sim G(m,p)$ on vertex set $\{1,\dots,m\}$, and enumerate the $N=\binom{m}{2}$ edges as $e_1,\dots,e_N$. Write $Y_i$ for the indicator that $e_i$ is present, set $\mathcal F_k=\sigma(Y_1,\dots,Y_k)$, and let
\begin{align*}
Z=\sum_{\{a,b,c\}\subseteq \{1,\dots,m\}} Y_{ab}Y_{ac}Y_{bc}
\end{align*}
be the number of triangles. The associated exposure martingale is
\begin{align*}
M_k=\mathbb E[Z\mid \mathcal F_k], \qquad 0\le k\le N.
\end{align*}
Fix an edge $e=\{a,b\}$. If two full edge configurations differ only in the value of $Y_e$, then every triangle not containing $e$ has the same indicator in both configurations. The triangles containing $e$ are exactly those with vertex set $\{a,b,r\}$, where
\begin{align*}
r\in \{1,\dots,m\}\setminus\{a,b\}.
\end{align*}
There are therefore $m-2$ such triangles. Each corresponding triangle indicator can change by at most $1$, so changing the single edge $e$ changes $Z$ by at most
\begin{align*}
\sum_{r\in \{1,\dots,m\}\setminus\{a,b\}} 1=m-2.
\end{align*}
Now fix the first $k-1$ revealed edge indicators. As the value of the $k$-th edge varies between $0$ and $1$, the conditional expected value of $Z$ changes by at most $m-2$, because the unrevealed edge indicators are still averaged over the same independent product distribution and the pointwise change in $Z$ is bounded by $m-2$. Hence the possible values of $M_k$ under this fixed past lie in an interval $[L_k,U_k]$ satisfying
\begin{align*}
U_k-L_k\le m-2.
\end{align*}
The tower property gives
\begin{align*}
\mathbb E[M_k\mid \mathcal F_{k-1}]=\mathbb E[\mathbb E[Z\mid \mathcal F_k]\mid \mathcal F_{k-1}]=\mathbb E[Z\mid \mathcal F_{k-1}]=M_{k-1}.
\end{align*}
Thus $M_{k-1}$ is a conditional average of values in $[L_k,U_k]$, while $M_k$ itself also lies in $[L_k,U_k]$. Therefore
\begin{align*}
|M_k-M_{k-1}|\le U_k-L_k\le m-2.
\end{align*}
Applying *Azuma-Hoeffding Inequality* with $c_k=m-2$ for every $1\le k\le N$ gives, for every $t\ge 0$,
\begin{align*}
\mathbb P(Z-\mathbb E[Z]\ge t)=\mathbb P(M_N-M_0\ge t).
\end{align*}
The inequality bounds this probability by
\begin{align*}
\exp\left(-\frac{t^2}{2\sum_{k=1}^{N}(m-2)^2}\right).
\end{align*}
Since
\begin{align*}
\sum_{k=1}^{N}(m-2)^2=N(m-2)^2,
\end{align*}
we obtain
\begin{align*}
\mathbb P(Z-\mathbb E[Z]\ge t)\le \exp\left(-\frac{t^2}{2N(m-2)^2}\right).
\end{align*}
This bound is often not optimal for triangle counts, but it demonstrates that the exposure martingale method controls a statistic built from highly dependent triangle indicators.
[/example]
The random graph example also shows a limitation. Azuma-Hoeffding uses worst-case bounded increments, so it may ignore that many exposed edges lie in few present partial configurations.
[remark: Role in the Classical Toolkit]
Martingale concentration is the first point in the course where the random variables being summed need not be independent. The method keeps the Chernoff-mgf architecture but replaces factorization by the tower property. This makes the same inequality useful in randomized algorithms, empirical process arguments, and probabilistic combinatorics, where the object of interest is often revealed through a sequence of partial observations rather than written as an independent sum. In empirical-process arguments, the filtration may reveal sample points one at a time while the statistic is a supremum over a class of functions; in random graph processes, it may reveal edges or vertices while tracking a global graph parameter. The same conditional-expectation viewpoint is also close to optional-stopping intuition: a martingale has no conditional drift, but concentration requires quantitative control of the allowed jumps before stopping or tail estimates can be trusted. Later refinements, such as Freedman's inequality and Bernstein-type martingale bounds, improve Azuma-Hoeffding by incorporating conditional variance rather than only worst-case increment bounds.
[/remark]
Martingale concentration shows that additivity is not essential if one can control the size of each revealed increment. Chapter 9 shifts from processes to functions of many inputs and asks how much concentration remains when the output depends on each coordinate only through bounded sensitivity.
# 9. McDiarmid's Bounded Differences Inequality
This chapter moves from concentration for sums to concentration for functions of many independent inputs. The question is how much randomness remains when no additive structure is visible, but changing one coordinate has a controlled effect on the output. The main tool is the Doob martingale from Chapter 8 obtained by revealing coordinates one at a time, which converts coordinatewise sensitivity into bounded martingale differences and then applies the Azuma-Hoeffding method.
## Functions of Independent Variables and Coordinatewise Sensitivity
Suppose $X_1,\dots,X_n$ are independent random variables and $f$ is a statistic of the whole vector. To prove concentration of $f(X_1,\dots,X_n)$, we need a way to measure how much influence a single input coordinate can have while all other coordinates are held fixed.
[definition: Bounded Differences Condition]
Let $\mathcal X_1,\dots,\mathcal X_n$ be sets, let $f:\mathcal X_1\times\cdots\times\mathcal X_n\to\mathbb R$, and let $c_1,\dots,c_n\ge 0$. The function $f$ satisfies the bounded differences condition with constants $(c_i)_{i=1}^n$ if, for every $i\in\{1,\dots,n\}$ and every pair of points $x,y\in\mathcal X_1\times\cdots\times\mathcal X_n$ satisfying $x_j=y_j$ for all $j\ne i$, one has
\begin{align*}
|f(x)-f(y)|\le c_i.
\end{align*}
[/definition]
The constants $c_i$ are worst-case coordinate sensitivities. They do not depend on the distribution of $X_i$, only on the deterministic behaviour of the statistic under coordinate replacement.
[example: Sample Mean as a Bounded Differences Function]
Let $X_1,\dots,X_n$ take values in $[0,1]$, and define
\begin{align*}
f(x_1,\dots,x_n)=\frac{1}{n}\sum_{k=1}^n x_k.
\end{align*}
Fix a coordinate $i$, and let $x,y\in[0,1]^n$ satisfy $x_j=y_j$ for every $j\ne i$. Then
\begin{align*}
f(x)-f(y)=\frac{1}{n}\sum_{k=1}^n x_k-\frac{1}{n}\sum_{k=1}^n y_k.
\end{align*}
By linearity of finite sums,
\begin{align*}
f(x)-f(y)=\frac{1}{n}\sum_{k=1}^n (x_k-y_k).
\end{align*}
For every $k\ne i$ we have $x_k-y_k=0$, so only the $i$th term remains:
\begin{align*}
f(x)-f(y)=\frac{1}{n}(x_i-y_i).
\end{align*}
Taking absolute values gives
\begin{align*}
|f(x)-f(y)|=\frac{1}{n}|x_i-y_i|.
\end{align*}
Since $x_i,y_i\in[0,1]$, one has $-1\le x_i-y_i\le 1$, and hence $|x_i-y_i|\le 1$. Therefore
\begin{align*}
|f(x)-f(y)|\le \frac{1}{n}.
\end{align*}
Thus the bounded differences constants are $c_i=1/n$ for all $i$. Their squared sensitivity scale is
\begin{align*}
\sum_{i=1}^n c_i^2=\sum_{i=1}^n \frac{1}{n^2}=\frac{n}{n^2}=\frac{1}{n}.
\end{align*}
This is the same $n^{-1/2}$ fluctuation scale that appears for bounded sums.
[/example]
The point of this condition is that it treats the statistic as a black box. In many applications the statistic is a maximum, a graph parameter, or the output of an algorithm, and the useful information is not a formula for $f$ but a certificate saying that each input can only perturb it by a small amount.
[remark: Worst-Case Rather Than Average Influence]
The bounded differences condition is uniform over all possible coordinate replacements. This makes the theorem robust and simple, but it can be conservative when rare configurations have unusually large influence. Later refinements replace worst-case bounds by variance-sensitive or self-bounding quantities.
[/remark]
The next step is to turn this deterministic sensitivity statement into a probabilistic tail bound. To do that, we need an intermediate result showing that coordinate sensitivity controls the jumps of the conditional expectations obtained when coordinates are revealed.
## Proof from Doob Martingales
The problem is that $f(X_1,\dots,X_n)$ need not be a sum of independent random variables. The standard remedy is to replace the statistic by a martingale whose increments represent the new information gained when each coordinate is revealed.
[definition: Doob Martingale of a Statistic]
Let $X_1,\dots,X_n$ be independent random variables on $(\Omega,\mathcal F,\mathbb P)$, let $f$ be an integrable real-valued function of $(X_1,\dots,X_n)$, and set $Z=f(X_1,\dots,X_n)$. The Doob martingale associated with $Z$ and the coordinate filtration is the process $(M_i)_{i=0}^n$ given by
\begin{align*}
M_i=\mathbb E[Z\mid X_1,\dots,X_i], \qquad 0\le i\le n.
\end{align*}
For each $i$, $M_i:\Omega\to\mathbb R$ is a real-valued random variable.
[/definition]
This martingale starts at $M_0=\mathbb E[Z]$ and ends at $M_n=Z$, so a deviation of $Z$ from its mean is the sum of the martingale increments $M_i-M_{i-1}$. The remaining issue is to verify that these increments inherit the deterministic coordinate bounds $c_i$.
[quotetheorem:6073]
[citeproof:6073]
This lemma supplies the exact hypothesis required by Azuma-Hoeffding: the statistic has been decomposed into a martingale with bounded increments. Independence is doing real work here. If $X_1=\cdots=X_n=Y$ for a single Bernoulli random variable $Y$ and $f(x_1,\dots,x_n)=n^{-1}\sum_i x_i$, then each coordinate replacement changes $f$ by at most $1/n$, but $f(X_1,\dots,X_n)=Y$ has order-one fluctuations rather than the order $n^{-1/2}$ fluctuations predicted under independence. The increment bound also does not say that the final statistic is small, or that its variance is close to $\sum_i c_i^2/4$; it only controls how much the conditional prediction can move at each reveal step. Because the constants are worst-case replacement bounds, a statistic with rare high-influence configurations may satisfy McDiarmid with a much larger scale than its typical behaviour deserves. The natural next question is what tail bound results when these increment bounds are substituted into the martingale inequality.
[quotetheorem:6074]
[citeproof:6074]
The theorem is distribution-free except for independence, and that exception is essential rather than cosmetic. The counterexample $X_1=\cdots=X_n=Y$ with $f(x)=n^{-1}\sum_i x_i$ has the same deterministic constants $c_i=1/n$ as an independent sample mean, but no averaging takes place because all coordinates carry the same random bit. Thus bounded differences control sensitivity to formal coordinate replacement, while independence ensures that revealing one coordinate does not also reveal hidden information about the others. The theorem also inherits the limitations of worst-case constants: if a coordinate can change the output by $1$ on a tiny part of the sample space and by $1/n$ elsewhere, McDiarmid must use the value $1$ unless the statistic is modified or a sharper inequality is invoked. In applications the result is best read as a robust first certificate of sub-Gaussian concentration, not as a guarantee that the displayed exponent is optimal.
This raises a useful operational point for applications: the inputs may have different laws and may take values in different measurable spaces, provided the coordinate replacement constants are available. The following formulation states the same conclusion in the notation often used when each coordinate has its own state space and its own influence constant.
[quotetheorem:6075]
[citeproof:6075]
The phrase "variance proxy" records the sub-Gaussian scale: the bound has the form expected for a centred sub-Gaussian random variable with proxy variance $\sum_i c_i^2/4$. It should not be read as an assertion that $\operatorname{Var}(Z)$ equals this quantity, nor does it imply that the coordinates contribute variance independently in an additive decomposition. The hypotheses also cannot be weakened to arbitrary dependence: if every coordinate is a copy of the same random variable, the coordinatewise constants may be small while the output still has the full fluctuation of that single shared source. Measurability and integrability are part of the statement because the theorem is centred at $\mathbb E[Z]$ and uses conditional expectations throughout the martingale proof. With these qualifications in place, the theorem is ready for use as a certificate: identify the independent inputs, compute replacement constants, and then compare the resulting bound with more tailored inequalities when additional structure is present.
[example: Number of Edges in an Erdos-Renyi Graph]
Let $G\sim G(n,p)$, and write $E=\binom{\{1,\dots,n\}}{2}$, so $|E|=\binom{n}{2}$. For each unordered pair $e\in E$, let $X_e\in\{0,1\}$ be the indicator that the edge $e$ is present. The number of edges is
\begin{align*}
Z=\sum_{e\in E}X_e.
\end{align*}
Since the edge indicators are independent Bernoulli random variables with parameter $p$, linearity of expectation gives
\begin{align*}
\mathbb E[Z]=\mathbb E\left[\sum_{e\in E}X_e\right].
\end{align*}
Again by linearity of expectation,
\begin{align*}
\mathbb E\left[\sum_{e\in E}X_e\right]=\sum_{e\in E}\mathbb E[X_e].
\end{align*}
For each $e\in E$, $\mathbb E[X_e]=p$, hence
\begin{align*}
\sum_{e\in E}\mathbb E[X_e]=\sum_{e\in E}p=p|E|=p\binom{n}{2}.
\end{align*}
Thus $\mathbb E[Z]=p\binom{n}{2}$.
To verify bounded differences, fix an edge $e_0\in E$ and take two edge-indicator vectors $x,y\in\{0,1\}^{E}$ such that $x_e=y_e$ for every $e\ne e_0$. Then
\begin{align*}
Z(x)-Z(y)=\sum_{e\in E}x_e-\sum_{e\in E}y_e.
\end{align*}
By linearity of finite sums,
\begin{align*}
\sum_{e\in E}x_e-\sum_{e\in E}y_e=\sum_{e\in E}(x_e-y_e).
\end{align*}
For every $e\ne e_0$, the assumption $x_e=y_e$ gives $x_e-y_e=0$, so
\begin{align*}
\sum_{e\in E}(x_e-y_e)=x_{e_0}-y_{e_0}.
\end{align*}
Therefore
\begin{align*}
Z(x)-Z(y)=x_{e_0}-y_{e_0}.
\end{align*}
Because $x_{e_0},y_{e_0}\in\{0,1\}$, one has $x_{e_0}-y_{e_0}\in\{-1,0,1\}$, and hence
\begin{align*}
|Z(x)-Z(y)|=|x_{e_0}-y_{e_0}|\le 1.
\end{align*}
Thus the bounded differences constants are $c_e=1$ for all $e\in E$. Their squared sensitivity scale is
\begin{align*}
\sum_{e\in E}c_e^2=\sum_{e\in E}1^2=\sum_{e\in E}1=|E|=\binom{n}{2}.
\end{align*}
Applying *[McDiarmid Bounded Differences Inequality](/theorems/6074)* gives, for every $t\ge 0$,
\begin{align*}
\mathbb P(|Z-\mathbb E[Z]|\ge t)\le 2\exp\left(-\frac{2t^2}{\sum_{e\in E}c_e^2}\right).
\end{align*}
Substituting $\sum_{e\in E}c_e^2=\binom{n}{2}$ yields
\begin{align*}
\mathbb P(|Z-\mathbb E[Z]|\ge t)\le 2\exp\left(-\frac{2t^2}{\binom{n}{2}}\right).
\end{align*}
Here $Z$ is also a binomial random variable with parameters $\binom{n}{2}$ and $p$, so Chernoff bounds use more of the additive Bernoulli structure; the bounded differences argument instead recovers Gaussian-scale concentration using only the certificate that changing one edge indicator changes the edge count by at most $1$.
[/example]
The graph example is additive, so it does not yet show the full strength of the theorem. McDiarmid becomes most useful when the statistic depends on the coordinates through an optimization or a data-dependent choice.
## Typical Lipschitz Certificates
To use McDiarmid, most of the work is proving the bounded differences condition. The proof usually takes the form of a replacement argument: freeze all coordinates except one, replace that coordinate by another admissible value, and bound the change in the statistic.
[definition: Hamming Lipschitz Function]
Let $\mathcal X$ be a set and equip $\mathcal X^n$ with the Hamming distance
\begin{align*}
d_H:\mathcal X^n\times\mathcal X^n\to\{0,\dots,n\},\qquad
d_H(x,y)=|\{i\in\{1,\dots,n\}:x_i\ne y_i\}|.
\end{align*}
A function $f:\mathcal X^n\to\mathbb R$ is $L$-Lipschitz with respect to $d_H$ if
\begin{align*}
|f(x)-f(y)|\le Ld_H(x,y)
\end{align*}
for all $x,y\in\mathcal X^n$.
[/definition]
A Hamming Lipschitz condition gives bounded differences with $c_i=L$ for all coordinates. This is the quickest certificate when all coordinates play the same role.
[example: Lipschitz Statistics of Independent Samples]
Let $X_1,\dots,X_n$ be independent observations in a set $\mathcal X$, and set $Z=f(X_1,\dots,X_n)$, where $f:\mathcal X^n\to\mathbb R$ is $L$-Lipschitz with respect to Hamming distance. Fix a coordinate $i$, and let $x,y\in\mathcal X^n$ satisfy $x_j=y_j$ for every $j\ne i$. Then the set of coordinates on which $x$ and $y$ differ is contained in $\{i\}$, so
\begin{align*}
d_H(x,y)=|\{j:x_j\ne y_j\}|\le 1.
\end{align*}
By the Hamming Lipschitz condition,
\begin{align*}
|f(x)-f(y)|\le Ld_H(x,y).
\end{align*}
Combining the two inequalities gives
\begin{align*}
|f(x)-f(y)|\le L.
\end{align*}
Thus $f$ satisfies the bounded differences condition with constants
\begin{align*}
c_i=L,\qquad 1\le i\le n.
\end{align*}
The squared sensitivity scale is therefore
\begin{align*}
\sum_{i=1}^n c_i^2=\sum_{i=1}^n L^2=nL^2.
\end{align*}
Applying *McDiarmid Bounded Differences Inequality* gives, for every $t\ge 0$,
\begin{align*}
\mathbb P(|Z-\mathbb E[Z]|\ge t)\le 2\exp\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right).
\end{align*}
Substituting $\sum_{i=1}^n c_i^2=nL^2$ yields
\begin{align*}
\mathbb P(|Z-\mathbb E[Z]|\ge t)\le 2\exp\left(-\frac{2t^2}{nL^2}\right).
\end{align*}
Thus any statistic for which one observation can change the value by at most $L$ has sub-Gaussian deviations at scale $L\sqrt n$; this includes capped counts, bounded empirical risks, and maximum scores over a fixed finite collection of bounded losses once the same one-sample replacement bound has been verified.
[/example]
When coordinates have unequal effects, it is better to keep individual constants. The sum $\sum_i c_i^2$ is then the effective dimension of the problem, and small-influence coordinates contribute very little to the deviation scale.
[example: Weighted Sample Average]
Let $X_1,\dots,X_n\in[0,1]$ be independent, let $a_1,\dots,a_n\in\mathbb R$ be deterministic weights, and define
\begin{align*}
Z=f(X_1,\dots,X_n),\qquad f(x_1,\dots,x_n)=\sum_{k=1}^n a_kx_k.
\end{align*}
Fix a coordinate $i$, and let $x,y\in[0,1]^n$ satisfy $x_j=y_j$ for every $j\ne i$. Then
\begin{align*}
f(x)-f(y)=\sum_{k=1}^n a_kx_k-\sum_{k=1}^n a_ky_k.
\end{align*}
By linearity of finite sums,
\begin{align*}
f(x)-f(y)=\sum_{k=1}^n a_k(x_k-y_k).
\end{align*}
For every $k\ne i$, the assumption $x_k=y_k$ gives $x_k-y_k=0$, so
\begin{align*}
\sum_{k=1}^n a_k(x_k-y_k)=a_i(x_i-y_i).
\end{align*}
Therefore
\begin{align*}
|f(x)-f(y)|=|a_i(x_i-y_i)|.
\end{align*}
Using $|uv|=|u||v|$,
\begin{align*}
|a_i(x_i-y_i)|=|a_i|\,|x_i-y_i|.
\end{align*}
Since $x_i,y_i\in[0,1]$, one has $-1\le x_i-y_i\le 1$, and hence $|x_i-y_i|\le 1$. Thus
\begin{align*}
|f(x)-f(y)|\le |a_i|.
\end{align*}
So $f$ satisfies the bounded differences condition with constants
\begin{align*}
c_i=|a_i|,\qquad 1\le i\le n.
\end{align*}
Their squared sensitivity scale is
\begin{align*}
\sum_{i=1}^n c_i^2=\sum_{i=1}^n |a_i|^2.
\end{align*}
Since $|a_i|^2=a_i^2$ for real $a_i$,
\begin{align*}
\sum_{i=1}^n c_i^2=\sum_{i=1}^n a_i^2.
\end{align*}
Applying *McDiarmid Bounded Differences Inequality* gives, for every $t\ge 0$,
\begin{align*}
\mathbb P(|Z-\mathbb E[Z]|\ge t)\le 2\exp\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right).
\end{align*}
Substituting $\sum_{i=1}^n c_i^2=\sum_{i=1}^n a_i^2$ yields
\begin{align*}
\mathbb P(|Z-\mathbb E[Z]|\ge t)\le 2\exp\left(-\frac{2t^2}{\sum_{i=1}^n a_i^2}\right).
\end{align*}
For equal weights $a_i=1/n$, the denominator becomes
\begin{align*}
\sum_{i=1}^n a_i^2=\sum_{i=1}^n \frac{1}{n^2}.
\end{align*}
Since the sum has $n$ identical terms,
\begin{align*}
\sum_{i=1}^n \frac{1}{n^2}=\frac{n}{n^2}=\frac{1}{n}.
\end{align*}
Hence in the equal-weight case,
\begin{align*}
\mathbb P(|Z-\mathbb E[Z]|\ge t)\le 2\exp(-2nt^2).
\end{align*}
Thus the equal-weight sample average has deviations on the usual $n^{-1/2}$ scale, while unequal weights concentrate according to the Euclidean weight scale $\left(\sum_{i=1}^n a_i^2\right)^{1/2}$.
[/example]
A common pattern in learning theory is that the statistic is a supremum over hypotheses. The coordinate replacement proof works because replacing one sample affects every empirical average by a bounded amount, and taking a supremum preserves that bound.
[example: Generalization Gap for a Finite Hypothesis Class]
Let $\mathcal H$ be a finite class of hypotheses, let $S=(X_1,\dots,X_n)$ be independent samples, and assume that $\ell(h,X_i)\in[0,1]$ for every $h\in\mathcal H$ and every $i$. For each $h\in\mathcal H$, write
\begin{align*}
\mu_h=\mathbb E[\ell(h,X)].
\end{align*}
Define
\begin{align*}
Z=f(X_1,\dots,X_n),\qquad f(x_1,\dots,x_n)=\sup_{h\in\mathcal H}\left|\mu_h-\frac{1}{n}\sum_{k=1}^n \ell(h,x_k)\right|.
\end{align*}
We verify the bounded differences condition with constants $c_i=1/n$.
Fix $i$, and let $x,y$ be two samples with $x_j=y_j$ for every $j\ne i$. For each $h\in\mathcal H$, set
\begin{align*}
A_h(x)=\mu_h-\frac{1}{n}\sum_{k=1}^n \ell(h,x_k),\qquad A_h(y)=\mu_h-\frac{1}{n}\sum_{k=1}^n \ell(h,y_k).
\end{align*}
Then
\begin{align*}
A_h(x)-A_h(y)=\left(\mu_h-\frac{1}{n}\sum_{k=1}^n \ell(h,x_k)\right)-\left(\mu_h-\frac{1}{n}\sum_{k=1}^n \ell(h,y_k)\right).
\end{align*}
The $\mu_h$ terms cancel, so
\begin{align*}
A_h(x)-A_h(y)=-\frac{1}{n}\sum_{k=1}^n \ell(h,x_k)+\frac{1}{n}\sum_{k=1}^n \ell(h,y_k).
\end{align*}
By linearity of finite sums,
\begin{align*}
A_h(x)-A_h(y)=\frac{1}{n}\sum_{k=1}^n\bigl(\ell(h,y_k)-\ell(h,x_k)\bigr).
\end{align*}
For $k\ne i$, the equality $x_k=y_k$ gives $\ell(h,y_k)-\ell(h,x_k)=0$, hence
\begin{align*}
A_h(x)-A_h(y)=\frac{1}{n}\bigl(\ell(h,y_i)-\ell(h,x_i)\bigr).
\end{align*}
Since both losses lie in $[0,1]$,
\begin{align*}
\left|A_h(x)-A_h(y)\right|=\frac{1}{n}\left|\ell(h,y_i)-\ell(h,x_i)\right|\le \frac{1}{n}.
\end{align*}
The [reverse triangle inequality](/theorems/2300) gives
\begin{align*}
\left||A_h(x)|-|A_h(y)|\right|\le |A_h(x)-A_h(y)|\le \frac{1}{n}.
\end{align*}
Put $U_h=|A_h(x)|$ and $V_h=|A_h(y)|$. Since $\mathcal H$ is finite, the suprema are maxima. Let $h_x$ attain $\sup_{h\in\mathcal H}U_h$. Then
\begin{align*}
\sup_{h\in\mathcal H}U_h-\sup_{h\in\mathcal H}V_h=U_{h_x}-\sup_{h\in\mathcal H}V_h.
\end{align*}
Because $\sup_{h\in\mathcal H}V_h\ge V_{h_x}$,
\begin{align*}
U_{h_x}-\sup_{h\in\mathcal H}V_h\le U_{h_x}-V_{h_x}.
\end{align*}
Also $U_{h_x}-V_{h_x}\le |U_{h_x}-V_{h_x}|$, so
\begin{align*}
\sup_{h\in\mathcal H}U_h-\sup_{h\in\mathcal H}V_h\le |U_{h_x}-V_{h_x}|\le \frac{1}{n}.
\end{align*}
Interchanging the roles of $x$ and $y$ gives
\begin{align*}
\sup_{h\in\mathcal H}V_h-\sup_{h\in\mathcal H}U_h\le \frac{1}{n}.
\end{align*}
Combining the two one-sided bounds,
\begin{align*}
|f(x)-f(y)|=\left|\sup_{h\in\mathcal H}U_h-\sup_{h\in\mathcal H}V_h\right|\le \frac{1}{n}.
\end{align*}
Thus $f$ satisfies the bounded differences condition with $c_i=1/n$ for all $i$, and
\begin{align*}
\sum_{i=1}^n c_i^2=\sum_{i=1}^n \frac{1}{n^2}=\frac{n}{n^2}=\frac{1}{n}.
\end{align*}
Applying *McDiarmid Bounded Differences Inequality* gives, for every $t\ge 0$,
\begin{align*}
\mathbb P(Z-\mathbb E[Z]\ge t)\le \exp\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right).
\end{align*}
Substituting $\sum_{i=1}^n c_i^2=1/n$ yields
\begin{align*}
\mathbb P(Z-\mathbb E[Z]\ge t)\le \exp\left(-\frac{2t^2}{1/n}\right)=\exp(-2nt^2).
\end{align*}
The concentration step controls fluctuations of the random generalization gap around its mean; a full learning-theory bound still needs a separate estimate of $\mathbb E[Z]$, often obtained by symmetrization and a union bound over the finite class $\mathcal H$.
[/example]
The examples have the same structure: the random variables are revealed one at a time, and the statistic is tested against a coordinate replacement. This viewpoint is often called a reveal-one-coordinate proof, because it mirrors the Doob martingale even when the martingale is not written explicitly.
[explanation: Reveal-One-Coordinate Proofs]
A reveal-one-coordinate proof begins by imagining that $X_1,\dots,X_n$ are hidden and are then exposed sequentially. At step $i$, the proof asks how much the conditional prediction of the final statistic can move when $X_i$ is disclosed. If replacing $X_i$ while fixing all other coordinates changes the final value by at most $c_i$, then the $i$th prediction update has absolute value at most $c_i$. McDiarmid's inequality is exactly the concentration consequence of summing these bounded prediction updates.
This perspective is useful because it separates modelling from probability. The modelling work is the coordinate replacement certificate; the probabilistic work is supplied by the martingale concentration theorem. Once the certificate is obtained, the tail estimate is automatic.
[/explanation]
The chapter therefore completes the classical route from sums to general functions of independent variables. Chernoff and Hoeffding inequalities exploit additivity directly; McDiarmid exploits independence through conditional expectation and uses bounded coordinate sensitivity as the replacement for additivity. Chapter 10 closes the course by turning these methods into a proof-selection guide.
The final chapter steps back from individual inequalities and treats them as design patterns. Having seen how Markov, Chebyshev, Chernoff, Hoeffding, Bernstein, martingale, and bounded-differences arguments fit different assumptions, we close by organizing these tools into a practical guide for choosing the right proof strategy.
# 10. Classical Concentration Patterns and Proof Design
This closing chapter is about proof design rather than a new inequality. Chapters 1 through 9 built Markov, Chebyshev, Chernoff, Hoeffding, Bernstein, sub-Gaussian, sub-exponential, martingale, and bounded-differences bounds; here we organise them by the information available in a problem. The aim is to turn a concentration question into a short diagnostic: what is known, what scale is natural, and which reusable proof move exposes the needed structure?
## Choosing an Inequality from the Available Information
A concentration proof usually begins with incomplete information about a random quantity. Sometimes only a moment is known; sometimes the variables are bounded; sometimes a moment generating function is available; sometimes the object is not a sum, but changing one coordinate can only change the output a little. The first task is to match the available data to a theorem whose hypotheses are genuinely present.
[explanation: Information Hierarchy]
The classical inequalities form a hierarchy of assumptions. Markov uses nonnegativity and a first moment. Chebyshev and Cantelli use a variance. Chernoff uses an mgf bound and improves when the mgf has the right curvature. Hoeffding uses independence and bounded summands. Bernstein uses independence, boundedness, and variance. Sub-Gaussian and sub-exponential estimates package mgf information into reusable norms. Azuma-Hoeffding and McDiarmid replace summand structure by martingale increments or coordinate sensitivity.
The practical point is not that stronger assumptions are always better. A Bernstein bound may dominate Hoeffding when the variance is much smaller than the square of the range, while Hoeffding may be cleaner when only ranges are recorded. McDiarmid can control nonlinear functions of independent inputs, but it may lose variance information that a problem-specific martingale or Bernstein argument can retain.
[/explanation]
The hierarchy is easiest to use when it is compressed into a single decision table. The following guide records which earlier theorem is usually appropriate for each kind of input information.
[explanation: Classical Concentration Selection Table]
Let $t > 0$. The following choices are standard when the stated information is available.
| Method | Input information | Typical upper-tail form | Natural scale |
|---|---|---|---|
| Markov | $X \ge 0$, $\mathbb E[X]$ | first-moment tail | $\mathbb E[X]$ |
| Chebyshev | $\mathbb E[X]$, $\operatorname{Var}(X)$ | two-sided variance tail | $\sqrt{\operatorname{Var}(X)}$ |
| Cantelli | $\mathbb E[X]$, $\operatorname{Var}(X)$ | one-sided variance tail | $\sqrt{\operatorname{Var}(X)}$ |
| Chernoff | $\mathbb E[e^{\lambda X}]$ | optimized exponential tilt | Legendre scale |
| Hoeffding | independent bounded summands | bounded-range exponential tail | $\sqrt{\sum_i (b_i-a_i)^2}$ |
| Bernstein | independent summands with bounded centered deviations and variance | variance-sensitive exponential tail | $\sqrt{\sigma^2}$, then $M$ |
| Sub-Gaussian | $\|X\|_{\psi_2}$ or quadratic mgf | quadratic exponential tail | $\|X\|_{\psi_2}$ |
| Sub-exponential | $\|X\|_{\psi_1}$ or Bernstein mgf | quadratic-to-linear exponential tail | $K$ and $K^2$ |
| Azuma-Hoeffding | bounded martingale increments | martingale bounded-increment tail | $\sqrt{\sum_i c_i^2}$ |
| McDiarmid | bounded coordinate differences | bounded-difference tail | $\sqrt{\sum_i c_i^2}$ |
In formula form, the corresponding bounds have the following representative shapes. They are written separately so that each hypothesis and scale can be read without treating the table as a single chain of equalities.
For a nonnegative random variable, Markov gives
\begin{align*}
\mathbb P(X \ge t) \le \frac{\mathbb E[X]}{t}.
\end{align*}
With a finite variance, Chebyshev gives
\begin{align*}
\mathbb P(|X-\mathbb E[X]| \ge t) \le \frac{\operatorname{Var}(X)}{t^2}.
\end{align*}
For a one-sided variance estimate, Cantelli gives
\begin{align*}
\mathbb P(X-\mathbb E[X] \ge t) \le \frac{\operatorname{Var}(X)}{\operatorname{Var}(X)+t^2}.
\end{align*}
When exponential moments are available, Chernoff gives
\begin{align*}
\mathbb P(X \ge t) \le \inf_{\lambda>0} e^{-\lambda t}\mathbb E[e^{\lambda X}].
\end{align*}
For independent bounded summands, Hoeffding gives
\begin{align*}
\mathbb P(S-\mathbb E[S] \ge t) \le \exp\left(-\frac{2t^2}{\sum_i (b_i-a_i)^2}\right).
\end{align*}
For independent summands with bounded centered deviations and variance proxy $\sigma^2$, Bernstein gives
\begin{align*}
\mathbb P(S-\mathbb E[S] \ge t) \le \exp\left(-\frac{t^2}{2(\sigma^2+Mt/3)}\right).
\end{align*}
For a sub-Gaussian random variable, the reusable tail form is
\begin{align*}
\mathbb P(|X| \ge t) \lesssim \exp\left(-\frac{ct^2}{\|X\|_{\psi_2}^2}\right).
\end{align*}
For a sub-exponential random variable with scale $K$, the tail form is
\begin{align*}
\mathbb P(|X| \ge t) \lesssim \exp\left(-c\min\{t^2/K^2,t/K\}\right).
\end{align*}
For a martingale with bounded increments, Azuma-Hoeffding gives
\begin{align*}
\mathbb P(M_n-M_0 \ge t) \le \exp\left(-\frac{t^2}{2\sum_i c_i^2}\right).
\end{align*}
For a function satisfying bounded coordinate differences, McDiarmid gives
\begin{align*}
\mathbb P(f-\mathbb E[f] \ge t) \le \exp\left(-\frac{2t^2}{\sum_i c_i^2}\right).
\end{align*}
Here $S=\sum_i X_i$, each $X_i \in [a_i,b_i]$ in the Hoeffding row, $|X_i-\mathbb E[X_i]| \le M$ and $\sigma^2=\sum_i \operatorname{Var}(X_i)$ in the Bernstein row, and the constants implicit in $\lesssim$ are universal.
[/explanation]
This table should be read from the hypotheses column first. The strongest-looking bound is unusable if the available data do not justify its mgf, boundedness, or sensitivity assumption. For instance, Chebyshev cannot produce exponential tails from variance alone, and Hoeffding cannot be invoked merely because a sum has independent terms; the summands must have deterministic ranges. Conversely, an mgf bound that holds only for small $\lambda$ naturally leads to Bernstein or sub-exponential behaviour rather than a purely Gaussian tail at all scales. The forward design principle is therefore to record the exact information available before starting the proof, because later moves such as truncation, centering, martingale exposure, and union bounds are useful precisely when they create one of the hypotheses in the table.
[example: Sampling Without Replacement]
Let the population values be $a_1,\dots,a_N\in[0,1]$, and let $X_1,\dots,X_n$ be the first $n$ terms of a uniformly random ordering of these values. Although the variables $X_i$ are dependent, we show that sequential exposure gives martingale increments bounded by $1/n$, and therefore the deviation scale of $\bar X_n=n^{-1}\sum_{i=1}^n X_i$ is still $n^{-1/2}$.
Let $\mathcal F_k=\sigma(X_1,\dots,X_k)$, let $S_k=\sum_{i=1}^k X_i$, and let $R_k$ be the sum of the $N-k$ population values not yet revealed after time $k$. Given $\mathcal F_k$, each future draw has conditional mean $R_k/(N-k)$, so the Doob martingale $M_k=\mathbb E[\bar X_n\mid \mathcal F_k]$ satisfies
\begin{align*}
M_k=\frac{1}{n}\mathbb E\left[\sum_{i=1}^n X_i\mid \mathcal F_k\right].
\end{align*}
Splitting the already revealed and unrevealed parts gives
\begin{align*}
M_k=\frac{1}{n}\left(S_k+\mathbb E\left[\sum_{i=k+1}^n X_i\mid \mathcal F_k\right]\right).
\end{align*}
There are $n-k$ future sampled values among the remaining $N-k$ population values, each with conditional mean $R_k/(N-k)$, hence
\begin{align*}
M_k=\frac{1}{n}\left(S_k+(n-k)\frac{R_k}{N-k}\right).
\end{align*}
Thus $M_0=\mathbb E[\bar X_n]$ and $M_n=\bar X_n$.
Now fix $k\in\{1,\dots,n\}$ and condition on $\mathcal F_{k-1}$. Put $m=N-k+1$, $r=n-k$, $R=R_{k-1}$, and $\mu=R/m$. After revealing $X_k$, the remaining sum is $R-X_k$ and the remaining population size is $m-1=N-k$. Therefore
\begin{align*}
M_k-M_{k-1}=\frac{1}{n}\left(X_k+r\frac{R-X_k}{m-1}-(r+1)\frac{R}{m}\right).
\end{align*}
Since $R=m\mu$, this becomes
\begin{align*}
M_k-M_{k-1}=\frac{1}{n}\left(X_k+r\frac{m\mu-X_k}{m-1}-(r+1)\mu\right).
\end{align*}
Collecting the $X_k$ terms and the $\mu$ terms gives
\begin{align*}
M_k-M_{k-1}=\frac{1}{n}\left(X_k\left(1-\frac{r}{m-1}\right)+\mu\left(\frac{rm}{m-1}-r-1\right)\right).
\end{align*}
The coefficient of $X_k$ is $(m-1-r)/(m-1)$, and the coefficient of $\mu$ is $r/(m-1)-1=-(m-1-r)/(m-1)$, so
\begin{align*}
M_k-M_{k-1}=\frac{m-1-r}{n(m-1)}(X_k-\mu).
\end{align*}
Because $m-1-r=(N-k)-(n-k)=N-n$ and $X_k,\mu\in[0,1]$,
\begin{align*}
|M_k-M_{k-1}|=\frac{N-n}{n(N-k)}|X_k-\mu|.
\end{align*}
Since $k\le n$, we have $N-n\le N-k$, and $|X_k-\mu|\le 1$, hence
\begin{align*}
|M_k-M_{k-1}|\le \frac{1}{n}.
\end{align*}
Applying the *Azuma-Hoeffding inequality* to this martingale gives
\begin{align*}
\mathbb P(\bar X_n-\mathbb E[\bar X_n]\ge t)=\mathbb P(M_n-M_0\ge t).
\end{align*}
Using the increment bound $|M_k-M_{k-1}|\le 1/n$,
\begin{align*}
\mathbb P(M_n-M_0\ge t)\le \exp\left(-\frac{t^2}{2\sum_{k=1}^n (1/n)^2}\right).
\end{align*}
Since $\sum_{k=1}^n(1/n)^2=1/n$, this is
\begin{align*}
\mathbb P(\bar X_n-\mathbb E[\bar X_n]\ge t)\le \exp\left(-\frac{nt^2}{2}\right).
\end{align*}
Applying the same argument to $-\bar X_n$ gives the lower-tail estimate. Thus sampling without replacement has the same $n^{-1/2}$ bounded-difference scale as an independent bounded average; the dependence is handled by the exposure martingale rather than by treating the sampled values as independent.
[/example]
The example illustrates a common selection rule: if independence is awkward but the output is stable under each reveal, try a martingale or bounded-differences formulation before forcing the problem into a sum. The important diagnostic is the exposure, not the original algebraic form of the statistic. Once the statistic is rewritten as a martingale terminal value, the proof needs only a deterministic increment bound, and the absence of replacement becomes part of the conditional-expectation calculation rather than an obstacle to the concentration theorem. This is the same pattern that will recur for random graphs, permutations, and algorithms whose outputs are built by revealing randomness step by step.
## Reusable Proof Moves
Once the inequality has been chosen, many proofs still need a preliminary transformation. Centering removes deterministic offsets, truncation separates rare large values from bounded fluctuations, symmetrization-lite replaces an unknown centre by an independent comparison, and union bounds convert one-dimensional tails into simultaneous control. These moves recur because they expose exactly the hypotheses in the selection table.
[definition: Centering Map]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. The centering map is the map from integrable real-valued random variables on $(\Omega,\mathcal F,\mathbb P)$ to integrable real-valued random variables on $(\Omega,\mathcal F,\mathbb P)$ defined by
\begin{align*}
X \longmapsto X - \mathbb E[X].
\end{align*}
[/definition]
Centering is useful because most concentration inequalities measure fluctuations around a deterministic location. For a sum $S=\sum_i X_i$, centering also decomposes the fluctuation into $S-\mathbb E[S]=\sum_i (X_i-\mathbb E[X_i])$, which is the form needed by mgf and variance calculations.
[example: Centering a Bernoulli Average]
Let $X_1,\dots,X_n \overset{\text{i.i.d.}}{\sim} \operatorname{Ber}(p)$ and set
\begin{align*}
\bar X_n=\frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
Since $\mathbb E[X_i]=p$, linearity of expectation gives
\begin{align*}
\mathbb E[\bar X_n]=\frac{1}{n}\sum_{i=1}^n \mathbb E[X_i].
\end{align*}
Substituting $\mathbb E[X_i]=p$ into the sum,
\begin{align*}
\mathbb E[\bar X_n]=\frac{1}{n}\sum_{i=1}^n p=p.
\end{align*}
Thus the centered upper-tail event is
\begin{align*}
\{\bar X_n-p\ge t\}=\left\{\sum_{i=1}^n (X_i-p)\ge nt\right\}.
\end{align*}
Because each $X_i\in[0,1]$, *Hoeffding's inequality* applied to $S=\sum_{i=1}^n X_i$ with ranges $b_i-a_i=1$ gives
\begin{align*}
\mathbb P(\bar X_n-p\ge t)=\mathbb P(S-\mathbb E[S]\ge nt).
\end{align*}
The Hoeffding denominator is
\begin{align*}
\sum_{i=1}^n (b_i-a_i)^2=\sum_{i=1}^n 1^2=n.
\end{align*}
Therefore
\begin{align*}
\mathbb P(\bar X_n-p\ge t)\le \exp\left(-\frac{2(nt)^2}{n}\right).
\end{align*}
Since $2(nt)^2/n=2nt^2$, this becomes
\begin{align*}
\mathbb P(\bar X_n-p\ge t)\le \exp(-2nt^2).
\end{align*}
For the variance-sensitive comparison, put $Y_i=X_i-p$. Then $\mathbb E[Y_i]=0$, and since $X_i-p\in\{-p,1-p\}\subseteq[-1,1]$,
\begin{align*}
|Y_i|\le 1.
\end{align*}
For a Bernoulli variable, $X_i^2=X_i$, so
\begin{align*}
\operatorname{Var}(X_i)=\mathbb E[X_i^2]-(\mathbb E[X_i])^2.
\end{align*}
Substituting $\mathbb E[X_i^2]=\mathbb E[X_i]=p$ gives
\begin{align*}
\operatorname{Var}(X_i)=p-p^2=p(1-p).
\end{align*}
Centering does not change variance, hence
\begin{align*}
\sigma^2=\sum_{i=1}^n \operatorname{Var}(Y_i)=\sum_{i=1}^n p(1-p)=np(1-p).
\end{align*}
Applying *Bernstein's inequality* with $M=1$ and threshold $nt$ gives
\begin{align*}
\mathbb P(\bar X_n-p\ge t)\le \exp\left(-\frac{(nt)^2}{2(np(1-p)+nt/3)}\right).
\end{align*}
Dividing numerator and denominator by $n$ yields the exact variance-sensitive form
\begin{align*}
\mathbb P(\bar X_n-p\ge t)\le \exp\left(-\frac{nt^2}{2p(1-p)+2t/3}\right).
\end{align*}
Since $p(1-p)\le p$, we have
\begin{align*}
2p(1-p)+\frac{2t}{3}\le 2p+\frac{2t}{3}.
\end{align*}
Replacing the denominator by the larger quantity gives the simpler estimate
\begin{align*}
\mathbb P(\bar X_n-p\ge t)\le \exp\left(-\frac{nt^2}{2p+2t/3}\right).
\end{align*}
The Hoeffding estimate uses only the range $[0,1]$, while the Bernstein estimate also uses the variance $p(1-p)$; this is why centering and variance information should both be recorded before choosing the inequality.
[/example]
The Bernoulli example used bounded variables, but many concentration problems start with variables that have rare large values. To use bounded-summand theorems in that situation, we first need a formal way to separate the ordinary part of a variable from its exceptional tail. The next definition names that separation.
[definition: Truncation Map]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and let $L>0$. The truncation map at level $L$ is the map from real-valued random variables on $(\Omega,\mathcal F,\mathbb P)$ to real-valued random variables on $(\Omega,\mathcal F,\mathbb P)$ defined by
\begin{align*}
X \longmapsto X_L := X\mathbb{1}_{\{|X|\le L\}}.
\end{align*}
[/definition]
The discarded part $X-X_L$ is handled by Markov, a moment bound, or an Orlicz tail estimate. The retained part is bounded, so Hoeffding, Bernstein, or an mgf argument becomes available. To make this division usable in proofs, we need a probability estimate that charges any large deviation either to the bounded sum or to the removed tail mass.
[quotetheorem:6076]
[citeproof:6076]
The truncation decomposition moves difficulty into a bounded piece and a tail piece, but it does not say either piece is small by itself. The constant $t/2$ is only a convenient split from the triangle inequality; any fixed split $\alpha t$ and $(1-\alpha)t$ could be used, and the best choice depends on the tail estimate available. The hypothesis $L>0$ is essential because the bounded piece must have a finite deterministic range before Hoeffding or Bernstein can be applied. A concrete limitation is that truncation may introduce bias: $\sum_i (X_i)_L$ is usually centred at $\sum_i \mathbb E[(X_i)_L]$, not at $\sum_i \mathbb E[X_i]$, so a proof must also control the discarded expectation. A different obstacle occurs when the random variable has a median or mean that is hard to calculate, while an independent copy gives a more symmetric comparison. The next result gives a lightweight replacement for the full symmetrization machinery.
[quotetheorem:6077]
[citeproof:6077]
The bound is useful when $X-X'$ has a representation with independent signs or bounded increments. The median hypotheses are not cosmetic. For a concrete failure, let $X\sim\operatorname{Ber}(1/2)$ and take $a=2$, which is not a median. With $t=3/2$, the left-hand side is $\mathbb P(|X-2|\ge 3/2)=1/2$, while $|X-X'|\le 1$ always, so the right-hand side is $0$. Thus the comparison can break down completely when $a$ is not a median. Independence of $X'$ is also essential, since taking $X'=X$ would make $|X-X'|=0$ and could not control any non-degenerate fluctuation. The comparison does not by itself identify the mean, nor does it preserve sharp constants; it is a device for proving concentration around a median before using a separate mean-median comparison. After a one-dimensional tail has been proved, the next proof-design question is often how to make many such estimates hold at the same time.
Union bounds answer that simultaneous-control question with minimal hypotheses. They are crude in dependence structure but very reliable when the number of events is moderate compared with the tail exponent.
[quotetheorem:6078]
[citeproof:6078]
The theorem converts individual concentration into simultaneous concentration, and the cost is the cardinality of the index set. The finiteness of $J$ is the relevant hypothesis here. If $Y_j=j^{-1}Z_j$ with independent standard Gaussian $Z_j$, then $\mathbb P(|Y_j|\ge t)$ is summable in $j$ for every fixed $t>0$, so the same union-bound idea can be applied after replacing the finite sum by a convergent infinite series. Without such summability, pointwise tail bounds do not imply a small tail for the supremum. For a quantitative failure, let $(Y_j)_{j\ge 1}$ be i.i.d. $\operatorname{Ber}(q)$ with $q\in(0,1)$ and take $t=1/2$. Each individual probability is $\mathbb P(|Y_j|\ge 1/2)=q$, but
\begin{align*}
\mathbb P\left(\sup_{j\ge 1}|Y_j|\ge 1/2\right)
=1-\lim_{m\to\infty}(1-q)^m
=1.
\end{align*}
Thus any attempted infinite-index conclusion of the form "small individual tails give a small simultaneous tail" is false without a finite index set or a summable sequence of tail probabilities. Independence is deliberately absent from the finite theorem, so the result applies even to heavily dependent deviations, but this robustness is also its limitation because it cannot exploit negative dependence or geometry. For example, bounding a maximum over $d$ genuinely different coordinates often costs $\log d$ in the tail inversion, while bounding a smooth norm may require a method that sees more than coordinatewise events. Random matrices provide a standard case where entrywise control is useful as a first bound, even though it does not capture every matrix quantity.
[example: Random Matrix Entrywise Control]
Let $A$ be an $m\times n$ random matrix whose entries are centered sub-Gaussian random variables satisfying $\|A_{ij}\|_{\psi_2}\le K$. For each fixed pair $(i,j)$, the sub-Gaussian tail estimate gives a universal constant $c>0$ such that, for every $t>0$,
\begin{align*}
\mathbb P(|A_{ij}|\ge t)\le 2\exp\left(-\frac{ct^2}{K^2}\right).
\end{align*}
The event that at least one entry exceeds $t$ is the union of the entrywise events:
\begin{align*}
\left\{\max_{1\le i\le m,\ 1\le j\le n}|A_{ij}|\ge t\right\}=\bigcup_{i=1}^m\bigcup_{j=1}^n \{|A_{ij}|\ge t\}.
\end{align*}
By the *[Union Bound for Simultaneous Deviations](/theorems/6078)*,
\begin{align*}
\mathbb P\left(\max_{1\le i\le m,\ 1\le j\le n}|A_{ij}|\ge t\right)\le \sum_{i=1}^m\sum_{j=1}^n \mathbb P(|A_{ij}|\ge t).
\end{align*}
Substituting the fixed-entry tail bound into each summand gives
\begin{align*}
\sum_{i=1}^m\sum_{j=1}^n \mathbb P(|A_{ij}|\ge t)\le \sum_{i=1}^m\sum_{j=1}^n 2\exp\left(-\frac{ct^2}{K^2}\right).
\end{align*}
The summand does not depend on $i$ or $j$, and there are $mn$ pairs, so
\begin{align*}
\sum_{i=1}^m\sum_{j=1}^n 2\exp\left(-\frac{ct^2}{K^2}\right)=2mn\exp\left(-\frac{ct^2}{K^2}\right).
\end{align*}
Therefore
\begin{align*}
\mathbb P\left(\max_{1\le i\le m,\ 1\le j\le n}|A_{ij}|\ge t\right)\le 2mn\exp\left(-\frac{ct^2}{K^2}\right).
\end{align*}
To make the high-probability scale explicit, fix $u>0$ and choose
\begin{align*}
t=K\sqrt{\frac{\log(2mn)+u}{c}}.
\end{align*}
Then
\begin{align*}
\frac{ct^2}{K^2}=\log(2mn)+u.
\end{align*}
Thus
\begin{align*}
2mn\exp\left(-\frac{ct^2}{K^2}\right)=2mn\exp(-\log(2mn)-u).
\end{align*}
Since $\exp(-\log(2mn))=(2mn)^{-1}$, this becomes
\begin{align*}
2mn\exp(-\log(2mn)-u)=e^{-u}.
\end{align*}
Hence
\begin{align*}
\mathbb P\left(\max_{1\le i\le m,\ 1\le j\le n}|A_{ij}|\ge K\sqrt{\frac{\log(2mn)+u}{c}}\right)\le e^{-u}.
\end{align*}
This proves entrywise control at scale $K\sqrt{\log(mn)}$, which is appropriate for the maximum entry, but it does not use the geometry needed for spectral norm concentration.
[/example]
The logarithmic factor in the example is the price of simultaneous entrywise control. Whether that price is necessary depends on the target quantity.
## Diagnosing Dimension Losses and Method Limits
Classical concentration methods are powerful, but they can lose factors when the argument ignores geometry, variance structure, or dependence. The final skill of the course is to diagnose whether a bound is weak because the random object is genuinely variable or because the proof used information too crudely. This diagnosis motivates entropy and transport methods in later courses, while keeping the present course within the classical toolkit.
[explanation: Sources of Dimension Loss]
A dimension-dependent loss often enters through a union bound, a worst-case bounded-difference constant, or a variance proxy that treats all coordinates as equally active. Union bounds replace geometry by cardinality. McDiarmid replaces local variance by worst-case coordinate sensitivity. Hoeffding replaces variance by squared ranges. These replacements are sometimes correct for the target, but they can be wasteful for smooth averages, norms, and suprema with metric structure.
A useful diagnostic is to compare the deviation scale predicted by the method with the scale suggested by variance or known limiting behaviour. If the theorem gives $\sqrt{d}$ where the variance scale is constant, the proof has probably discarded structure. If a maximum over $d$ nearly independent coordinates is being bounded, a $\sqrt{\log d}$ factor may be a real feature rather than an artefact.
[/explanation]
The preceding discussion identifies coordinate sensitivity as both useful and potentially lossy. The precise obstruction is that bounded differences remember only the largest possible coordinate change, even if that change occurs on a very rare part of the probability space. To diagnose when McDiarmid's inequality is using too little information, we isolate an example where the worst-case increments stay fixed while the actual variance scale becomes much smaller.
[quotetheorem:6079]
[citeproof:6079]
This is not a defect in the theorem; it is a statement about what the theorem is designed to use. The Bernoulli family in the proof keeps the same worst-case coordinate sensitivity while its variance tends to zero, so the bounded-difference constants alone do not determine the fluctuation scale. The opposite inequality $\operatorname{Var}(f)\le \sum_i c_i^2/4$ is not the point of this warning; it is a form of Efron-Stein control under independence. The loss diagnosed here is that McDiarmid may be much weaker than a variance-sensitive method when the worst-case change occurs on a rare part of the space.
The hypotheses also mark real failure modes. Independence cannot simply be dropped: if $X_1=\cdots=X_n=B$ with $B\sim\operatorname{Ber}(1/2)$ and $f(x_1,\dots,x_n)=n^{-1}\sum_i x_i$, then $f(B,\dots,B)=B$ although the coordinate differences of $f$ are all $1/n$. A McDiarmid-style bound would give an upper tail at level $1/2$ of order $\exp(-n/2)$, while the actual probability is $1/2$. The bounded-differences assumption is equally essential. For $n=1$ and $f(x)=x$ with $X\sim\operatorname{Exp}(1)$, there is no finite deterministic coordinate-difference constant on the state space, and a sub-Gaussian tail of the McDiarmid form would contradict the exact exponential tail $\mathbb P(X-\mathbb E[X]\ge t)=e^{-(t+1)}$ for $t\ge -1$. When conditional variances are much smaller than worst-case increments but bounded differences and independence do hold, a Bernstein-type martingale inequality from a later course is the natural next step.
[example: Isolated Vertices in an Erdos-Renyi Graph]
Let $G\sim G(n,p)$, and write the graph as the family of independent edge indicators $(B_e)_{e\in \binom{[n]}{2}}$, where $B_{\{u,v\}}=1$ means that the edge $\{u,v\}$ is present. For each vertex $v$, define
\begin{align*}
I_v=\mathbb{1}_{\{v\text{ is isolated}\}}.
\end{align*}
Then the number of isolated vertices is
\begin{align*}
Z=\sum_{v=1}^n I_v.
\end{align*}
We expose the graph one edge at a time and check a bounded-differences constant. Fix an edge $e=\{u,v\}$, and compare two edge configurations that differ only in the value of $B_e$. If $w\notin\{u,v\}$, then every edge incident to $w$ has the same value in both configurations, so
\begin{align*}
I_w'=I_w.
\end{align*}
Thus only the indicators $I_u$ and $I_v$ can change. Since each isolation indicator is either $0$ or $1$, we have $|I_u'-I_u|\le 1$ and $|I_v'-I_v|\le 1$. Therefore
\begin{align*}
|Z'-Z|=\left|(I_u'-I_u)+(I_v'-I_v)\right|.
\end{align*}
By the triangle inequality,
\begin{align*}
\left|(I_u'-I_u)+(I_v'-I_v)\right|\le |I_u'-I_u|+|I_v'-I_v|.
\end{align*}
Using the two indicator bounds gives
\begin{align*}
|Z'-Z|\le 2.
\end{align*}
Thus each independent edge coordinate has bounded-difference constant $c_e=2$.
There are $\binom n2=n(n-1)/2$ edge coordinates. Hence
\begin{align*}
\sum_{e\in\binom{[n]}{2}} c_e^2=\sum_{e\in\binom{[n]}{2}}2^2.
\end{align*}
Since the summand is $4$ and there are $\binom n2$ edges,
\begin{align*}
\sum_{e\in\binom{[n]}{2}} c_e^2=4\binom n2.
\end{align*}
Substituting $\binom n2=n(n-1)/2$ gives
\begin{align*}
\sum_{e\in\binom{[n]}{2}} c_e^2=2n(n-1).
\end{align*}
By *McDiarmid's inequality* applied to the independent edge indicators,
\begin{align*}
\mathbb P(Z-\mathbb E[Z]\ge t)\le \exp\left(-\frac{2t^2}{\sum_{e\in\binom{[n]}{2}}c_e^2}\right).
\end{align*}
Using the computed sum of squared constants,
\begin{align*}
\mathbb P(Z-\mathbb E[Z]\ge t)\le \exp\left(-\frac{2t^2}{2n(n-1)}\right).
\end{align*}
Cancelling the factor $2$ in numerator and denominator yields
\begin{align*}
\mathbb P(Z-\mathbb E[Z]\ge t)\le \exp\left(-\frac{t^2}{n(n-1)}\right).
\end{align*}
The same argument applied to $-Z$ gives the corresponding lower-tail estimate.
This proves concentration only at the coarse scale $n$, because the exponent is of order one when $t$ is of order $n$. The method is still useful as a first pass: it uses only independent edge exposures and worst-case changes. Its limitation is equally visible from the calculation, since every edge is charged as though it could change two isolation indicators, even though in sparse regimes many edge changes occur in configurations where neither endpoint is isolated.
[/example]
The final comparison is between entrywise and structural control. A union bound over coordinates may be the correct argument for a coordinate maximum, but it can be inefficient for a norm or spectrum because those objects aggregate directions continuously.
[example: When Entrywise Bounds Lose Geometry]
Let $A$ be an $n\times n$ random matrix with independent centered sub-Gaussian entries of unit scale. We make the entrywise route explicit and then compare its operator-norm consequence with the spectral scale.
Assume the unit-scale [sub-Gaussian tail bound](/theorems/1953): there is a universal constant $c>0$ such that, for every $i,j$ and every $t>0$,
\begin{align*}
\mathbb P(|A_{ij}|\ge t)\le 2\exp(-ct^2).
\end{align*}
The event $\{\max_{1\le i,j\le n}|A_{ij}|\ge t\}$ is the union of the $n^2$ events $\{|A_{ij}|\ge t\}$. By the union bound,
\begin{align*}
\mathbb P\left(\max_{1\le i,j\le n}|A_{ij}|\ge t\right)\le \sum_{i=1}^n\sum_{j=1}^n \mathbb P(|A_{ij}|\ge t).
\end{align*}
Using the fixed-entry tail estimate in each summand gives
\begin{align*}
\sum_{i=1}^n\sum_{j=1}^n \mathbb P(|A_{ij}|\ge t)\le \sum_{i=1}^n\sum_{j=1}^n 2\exp(-ct^2).
\end{align*}
Since the summand is independent of $i$ and $j$, and there are $n^2$ pairs,
\begin{align*}
\sum_{i=1}^n\sum_{j=1}^n 2\exp(-ct^2)=2n^2\exp(-ct^2).
\end{align*}
Therefore
\begin{align*}
\mathbb P\left(\max_{1\le i,j\le n}|A_{ij}|\ge t\right)\le 2n^2\exp(-ct^2).
\end{align*}
Fix $u>0$ and choose
\begin{align*}
t=\sqrt{\frac{2\log n+u+\log 2}{c}}.
\end{align*}
Then
\begin{align*}
ct^2=2\log n+u+\log 2.
\end{align*}
Substituting this value into the entrywise bound gives
\begin{align*}
2n^2\exp(-ct^2)=2n^2\exp(-2\log n-u-\log 2).
\end{align*}
Using $\exp(-2\log n)=n^{-2}$ and $\exp(-\log 2)=2^{-1}$,
\begin{align*}
2n^2\exp(-2\log n-u-\log 2)=2n^2\cdot n^{-2}\cdot e^{-u}\cdot 2^{-1}.
\end{align*}
The numerical factors cancel, so
\begin{align*}
2n^2\cdot n^{-2}\cdot e^{-u}\cdot 2^{-1}=e^{-u}.
\end{align*}
Hence
\begin{align*}
\mathbb P\left(\max_{1\le i,j\le n}|A_{ij}|\ge \sqrt{\frac{2\log n+u+\log 2}{c}}\right)\le e^{-u}.
\end{align*}
Now use only this entrywise information to control the operator norm. For any $x\in\mathbb R^n$ with $\|x\|_2=1$, the $i$th coordinate of $Ax$ is
\begin{align*}
(Ax)_i=\sum_{j=1}^n A_{ij}x_j.
\end{align*}
The triangle inequality gives
\begin{align*}
|(Ax)_i|\le \sum_{j=1}^n |A_{ij}|\,|x_j|.
\end{align*}
Each $|A_{ij}|$ is bounded by the maximum entry, so
\begin{align*}
\sum_{j=1}^n |A_{ij}|\,|x_j|\le \left(\max_{1\le r,s\le n}|A_{rs}|\right)\sum_{j=1}^n |x_j|.
\end{align*}
By Cauchy-Schwarz,
\begin{align*}
\sum_{j=1}^n |x_j|\le \left(\sum_{j=1}^n 1^2\right)^{1/2}\left(\sum_{j=1}^n x_j^2\right)^{1/2}.
\end{align*}
Since $\sum_{j=1}^n 1^2=n$ and $\sum_{j=1}^n x_j^2=\|x\|_2^2=1$,
\begin{align*}
\sum_{j=1}^n |x_j|\le \sqrt n.
\end{align*}
Thus, for every $i$,
\begin{align*}
|(Ax)_i|\le \sqrt n\,\max_{1\le r,s\le n}|A_{rs}|.
\end{align*}
Squaring and summing over $i$ gives
\begin{align*}
\|Ax\|_2^2=\sum_{i=1}^n |(Ax)_i|^2.
\end{align*}
Using the coordinate bound in each term,
\begin{align*}
\sum_{i=1}^n |(Ax)_i|^2\le \sum_{i=1}^n n\left(\max_{1\le r,s\le n}|A_{rs}|\right)^2.
\end{align*}
The summand is constant in $i$, so
\begin{align*}
\sum_{i=1}^n n\left(\max_{1\le r,s\le n}|A_{rs}|\right)^2=n^2\left(\max_{1\le r,s\le n}|A_{rs}|\right)^2.
\end{align*}
Taking square roots gives
\begin{align*}
\|Ax\|_2\le n\max_{1\le r,s\le n}|A_{rs}|.
\end{align*}
Since this holds for every unit vector $x$,
\begin{align*}
\|A\|_{\mathrm{op}}=\sup_{\|x\|_2=1}\|Ax\|_2\le n\max_{1\le r,s\le n}|A_{rs}|.
\end{align*}
Combining this deterministic inequality with the high-probability entrywise bound yields
\begin{align*}
\mathbb P\left(\|A\|_{\mathrm{op}}\ge n\sqrt{\frac{2\log n+u+\log 2}{c}}\right)\le e^{-u}.
\end{align*}
For many standard centered unit-scale sub-Gaussian matrix ensembles, the true spectral norm scale is of order $\sqrt n$. The entrywise argument gives order $n\sqrt{\log n}$ instead, because it first controls the largest coordinate and then treats every row and every direction as if all entries could align with that same worst-case value. The loss comes from ignoring cancellation and the geometry of the unit sphere, which are exactly the features that operator norm concentration must exploit.
[/example]
These limitations set up the next stage of concentration theory beyond this course. Entropy methods refine the mgf method by exploiting variance and self-bounding structure. Transport methods relate concentration to geometric inequalities for measures. Chaining and metric entropy replace a raw union bound by a multiscale count of the relevant index set. The classical methods from Chapters 1 through 9 remain the first tools to try, but a good proof designer also knows when their hypotheses are discarding the structure that controls the random quantity.
## Connections and Further Reading
The measure-theoretic foundations behind expectations, independence, and conditional expectation are developed in [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure). The martingale viewpoint used for Azuma-Hoeffding and bounded differences connects directly to [Martingale](/page/Martingale) and to the more advanced probability background in [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability).
In statistics, the empirical-mean and uniform-estimation chapters lead toward [High-Dimensional Statistics I: Sparsity and Regularisation](/page/High-Dimensional%20Statistics%20I%3A%20Sparsity%20and%20Regularisation). The random-matrix examples at the end are deliberately crude first passes; sharper operator-norm concentration and minimax uses appear in [High-Dimensional Statistics II: Minimax Theory and Random Matrices](/page/High-Dimensional%20Statistics%20II%3A%20Minimax%20Theory%20and%20Random%20Matrices).
The practical takeaway is to choose the inequality from the information actually available. Moment information points to Markov, Chebyshev, and tail integration. Bounded independent summands point to Hoeffding, Bennett, and Bernstein. Conditional bounded increments point to martingale methods. Nonlinear functions of independent variables point to McDiarmid, unless a sharper problem-specific variance calculation is available.
## References
- Boucheron, Lugosi, and Massart, *Concentration Inequalities: A Nonasymptotic Theory of Independence*.
- Vershynin, *High-Dimensional Probability: An Introduction with Applications in Data Science*.
- Wainwright, *High-Dimensional Statistics: A Non-Asymptotic Viewpoint*.
- van der Vaart and Wellner, *[Weak Convergence](/page/Weak%20Convergence) and Empirical Processes*.
Contents
- Introduction
- What Concentration Inequalities Measure
- From Structural Information To Tail Bounds
- Exponential Moments As The Main Organizing Device
- The Road Through The Course
- 1. Tail Bounds from Moments and Variance
- Tail Probabilities and Deviation Scales
- Markov, Chebyshev, and Cantelli Inequalities
- Tail Integration and Moment Bounds
- 2. The Chernoff Method and Moment Generating Functions
- Exponential Markov Inequality and Chernoff Optimization
- Cumulant Generating Functions and Convex Duality
- Chernoff Bounds for Classical Distributions
- 3. Hoeffding's Inequality for Bounded Independent Sums
- Centered Bounded Variables and Their Moment Generating Functions
- Independent Bounded Summands with Unequal Ranges
- Bernoulli Sums and the Additive Chernoff-Hoeffding Bound
- Comparing Chebyshev, Chernoff, and Hoeffding Bounds
- 4. Variance-Sensitive Bounds: Bernstein and Bennett
- Variance as the Effective Scale
- Bennett's Inequality for Bounded Summands
- Bernstein's Quadratic-to-Linear Tail Transition
- Simplified Bernstein Forms in Applications
- 5. Sub-Gaussian Random Variables
- Equivalent Ways to Recognise Sub-Gaussianity
- Closure Properties and Standard Sources
- Norm Estimates and Maximal Inequalities
- 6. Sub-Exponential Random Variables
- Tail, Moment, and Moment Generating Function Tests
- Squares and Products of Sub-Gaussian Variables
- Bernstein Inequality for Sub-Exponential Sums
- 7. Concentration of Empirical Means and Basic Statistical Applications
- Confidence Intervals from Tail Bounds
- Median-of-Means Under Finite Variance
- Uniform Control Over Finite Classes
- Statistical Interpretation and Sample Complexity
- 8. Martingales and Azuma-Hoeffding as a Classical Extension
- Filtrations and Martingale Differences
- Azuma-Hoeffding from Conditional Moment Generating Functions
- Doob Martingales for Functions of Independent Variables
- 9. McDiarmid's Bounded Differences Inequality
- Functions of Independent Variables and Coordinatewise Sensitivity
- Proof from Doob Martingales
- Typical Lipschitz Certificates
- 10. Classical Concentration Patterns and Proof Design
- Choosing an Inequality from the Available Information
- Reusable Proof Moves
- Diagnosing Dimension Losses and Method Limits
- Connections and Further Reading
- References
Concentration Inequalities I: Classical Methods
Content
Problems
History
Created by admin on 6/8/2026 | Last updated on 6/8/2026
Prerequisites (0/7 completed)
Log in to track your prerequisite progress.
Prerequisites Graph
Interactive dependency map showing prerequisite concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Rate this page
★
★
★
★
★
Poor
Excellent