Concentration Inequalities II: Entropy and Transport

Edit 0 Issues 0 Pull Requests Roadmap Admin

Content

Problems

History

Issues Verification Attributions

This course develops a modern concentration-of-measure theory centered on entropy, functional inequalities, and transport. It explains how probabilistic fluctuations can be controlled not only by variance-based methods, but also by more structural tools that reveal why high-dimensional random objects tend to be sharply concentrated. The emphasis is on a unified view: entropy as a measure of disorder and information, and transport as a way to compare probability measures geometrically. The early chapters introduce entropy as a concentration tool and build the Herbst argument, which turns logarithmic Sobolev inequalities into exponential tail bounds. From there, the course moves to discrete entropy and product spaces, showing how tensorization and product structure make concentration mechanisms work in both continuous and combinatorial settings. Gaussian isoperimetry and Talagrand’s convex distance inequality then provide some of the sharpest classical examples, illustrating how geometry and probability interact to produce optimal concentration bounds. The later chapters broaden the framework to transportation-cost inequalities, quadratic transport, and Talagrand’s $T_2$ inequality, connecting concentration to optimal transport and geometric properties of measures. The course closes with case studies and synthesis, tying together entropy, functional inequalities, isoperimetry, and transport into a coherent toolkit. By the end, the chapters build from foundational entropy methods to deep geometric principles that explain concentration across a wide range of probabilistic models. # Introduction This course studies why independent or weakly dependent randomness produces functions that are sharply concentrated around typical values. The first course treated classical moment-generating-function estimates, bounded differences, and direct martingale methods. Here the same phenomena are reorganised through entropy, isoperimetry, and transport: instead of estimating every [random variable](/page/Random%20Variable) separately, we prove structural inequalities for the underlying probability measure and then derive concentration as a consequence. The central theme is that sub-Gaussian concentration is often a shadow of a stronger principle. Tensorization explains why product measures behave well in high dimension; logarithmic Sobolev inequalities turn entropy bounds into moment bounds; Gaussian isoperimetry identifies the exact geometric source of Gaussian tails; transportation-cost inequalities connect concentration to stability of measures under optimal coupling. This introduction fixes the language and the map of the course before the technical development begins. ## From Tail Bounds to Structural Inequalities What kind of statement should replace a direct tail estimate when we want a reusable concentration principle? A tail inequality for one function is useful, but it usually hides the mechanism producing the bound. This course looks for inequalities attached to a probability measure, because such inequalities can be transported through Lipschitz maps, tensorized across independent coordinates, and compared across different spaces. [definition: Concentration Bound] Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space and let $X: \Omega \to \mathbb R$ be a real-valued random variable. A concentration bound for $X$ is an upper bound on probabilities of the form $\mathbb P(X - m \ge t)$ or $\mathbb P(|X - m| \ge t)$, where $m \in \mathbb R$ is a centre and $t \ge 0$. [/definition] The choice of centre matters less than it first appears when the tails are strong enough. We will move between means, medians, and other typical values when the argument gives quantitative control of the shift. [example: Gaussian Lipschitz Concentration] Let $G \sim \mathcal N(0,I_n)$ and let $f:\mathbb R^n\to\mathbb R$ be $1$-Lipschitz for the Euclidean norm. Since \begin{align*} |f(x)|\le |f(0)|+\|x\|_2, \end{align*} and $\mathbb E[\|G\|_2]\le (\mathbb E[\|G\|_2^2])^{1/2}=\sqrt n$, the random variable $f(G)$ is integrable. The model conclusion is \begin{align*} \mathbb P(f(G)-\mathbb E[f(G)]\ge t)\le \exp(-t^2/2),\qquad t\ge 0. \end{align*} The structural routes developed later all prove the centred Laplace estimate \begin{align*} \log \mathbb E\exp\left(\lambda(f(G)-\mathbb E[f(G)])\right)\le \frac{\lambda^2}{2},\qquad \lambda\ge 0. \end{align*} From this estimate, Markov's inequality applied to the nonnegative random variable $\exp(\lambda(f(G)-\mathbb E[f(G)]))$ gives, for every $\lambda>0$, \begin{align*} \mathbb P(f(G)-\mathbb E[f(G)]\ge t)\le e^{-\lambda t}\mathbb E\exp\left(\lambda(f(G)-\mathbb E[f(G)])\right). \end{align*} The Laplace estimate is equivalent to \begin{align*} \mathbb E\exp\left(\lambda(f(G)-\mathbb E[f(G)])\right)\le e^{\lambda^2/2}, \end{align*} so \begin{align*} \mathbb P(f(G)-\mathbb E[f(G)]\ge t)\le \exp\left(\frac{\lambda^2}{2}-\lambda t\right). \end{align*} Choosing $\lambda=t$ when $t>0$ gives \begin{align*} \mathbb P(f(G)-\mathbb E[f(G)]\ge t)\le \exp\left(\frac{t^2}{2}-t^2\right)=\exp(-t^2/2). \end{align*} For $t=0$, the same displayed bound reads $\mathbb P(f(G)-\mathbb E[f(G)]\ge 0)\le 1$, which is true. Thus the Gaussian Lipschitz tail bound is a common consequence of the Laplace estimate; Chapter 2 obtains that estimate through Herbst's argument, Chapter 3 through the Gaussian logarithmic Sobolev inequality, Chapter 5 through Gaussian isoperimetry, and Chapters 7 and 8 through transportation-cost inequalities. [/example] The Gaussian example suggests a shift in perspective. Rather than asking only for the tail of $f(G)$, we ask which functional inequality of the Gaussian measure forces every Lipschitz image to have Gaussian tails. [definition: Sub-Gaussian Concentration Property] Let $(E,d)$ be a [metric space](/page/Metric%20Space) and let $\mu$ be a probability measure on its Borel $\sigma$-algebra. We say that $\mu$ has the sub-Gaussian concentration property with constant $C > 0$ if for every $1$-Lipschitz function $f:E\to \mathbb R$ and every $t \ge 0$, \begin{align*} \mu\{x \in E : f(x) - \int_E f\,d\mu \ge t\} \le \exp\left(-\frac{t^2}{2C}\right). \end{align*} [/definition] This formulation turns concentration into a property of the measure. The rest of the course develops several ways of proving such a property and comparing the constants that arise. ## Entropy as a Measure of Tilting How can a proof control rare events without estimating their probabilities directly? The entropy method replaces a rare-event estimate by a cost of changing measure. If a new measure is obtained by tilting the old one toward an unlikely region, relative entropy measures how expensive that tilt is. [definition: Relative Entropy] Let $(E,\mathcal E)$ be a measurable space, and let $\mathcal P(E,\mathcal E)$ denote the set of probability measures on $(E,\mathcal E)$. The relative entropy functional is the map \begin{align*} H: \mathcal P(E,\mathcal E) \times \mathcal P(E,\mathcal E) \to [0,+\infty] \end{align*} defined by \begin{align*} H(\nu\mid \mu) := \int_E \log\left(\frac{d\nu}{d\mu}\right)\,d\nu \end{align*} when $\nu \ll \mu$, and by $H(\nu\mid \mu):=+\infty$ otherwise. [/definition] Some quoted results use the information-theoretic notation $D(\nu\|\mu)$ for the same quantity. Throughout these notes, \begin{align*} D(\nu\|\mu)=H(\nu\mid\mu) \end{align*} with the same absolute-continuity convention. Relative entropy will be the bookkeeping device behind exponential tilting. It also gives a precise way to compare two expectations under different measures. To use entropy in estimates, we need a dual formula that turns a logarithmic moment-generating function into an optimisation over tilted laws. The [Gibbs variational principle](/theorems/6723) supplies that formula and explains why entropy is the correct penalty term. [quotetheorem:6723] [citeproof:6723] The finiteness assumption \begin{align*} \int_E e^f\,d\mu < \infty \end{align*} is the boundary between a genuine logarithmic moment and an infinite variational problem. For instance, if $\mu$ is the standard exponential law on $[0,\infty)$ and $f(x)=x$, then \begin{align*} \int_{[0,\infty)} e^f\,d\mu=\infty . \end{align*} Truncating $f$ at level $k$ gives finite tilted laws whose variational values grow like $\log k$. Hence the variational supremum diverges together with the logarithmic moment, rather than producing a finite log-[Laplace transform](/page/Laplace%20Transform). The restriction $\nu \ll \mu$ is also essential, since a singular measure may put mass where the reference measure sees nothing, and entropy assigns infinite cost to that change of law. The theorem does not itself give concentration; it only identifies the correct dual form of exponential moments. Chapters 2 and 3 supply upper bounds for entropies of exponential tilts, while Chapters 7 and 8 supply bounds on $H(\nu\mid\mu)$ through transport; this principle then converts those bounds into estimates on Laplace transforms. [example: Entropy Cost of an Exponential Tilt] Let $X\sim\mathcal N(0,1)$ under $\mu$, so $\mu$ has density \begin{align*} \varphi(x)=\frac{1}{\sqrt{2\pi}}e^{-x^2/2} \end{align*} with respect to [Lebesgue measure](/page/Lebesgue%20Measure). For $\lambda\in\mathbb R$, the normalising constant of the exponential tilt by $e^{\lambda X}$ is \begin{align*} Z_\lambda=\int_{\mathbb R} e^{\lambda x}\varphi(x)\,dx. \end{align*} Expanding the density gives \begin{align*} Z_\lambda=\frac{1}{\sqrt{2\pi}}\int_{\mathbb R}\exp\left(\lambda x-\frac{x^2}{2}\right)\,dx. \end{align*} Completing the square, \begin{align*} \lambda x-\frac{x^2}{2}=-\frac{x^2-2\lambda x}{2}=-\frac{(x-\lambda)^2-\lambda^2}{2}=-\frac{(x-\lambda)^2}{2}+\frac{\lambda^2}{2}. \end{align*} Therefore \begin{align*} Z_\lambda=e^{\lambda^2/2}\frac{1}{\sqrt{2\pi}}\int_{\mathbb R}\exp\left(-\frac{(x-\lambda)^2}{2}\right)\,dx. \end{align*} The change of variables $y=x-\lambda$ turns the remaining integral into the total mass of the standard Gaussian density, so \begin{align*} Z_\lambda=e^{\lambda^2/2}. \end{align*} Let $\nu_\lambda$ be the tilted law. Its Lebesgue density is \begin{align*} \frac{d\nu_\lambda}{dx}(x)=\frac{e^{\lambda x}}{Z_\lambda}\varphi(x). \end{align*} Substituting the values of $Z_\lambda$ and $\varphi$ gives \begin{align*} \frac{d\nu_\lambda}{dx}(x)=\frac{1}{\sqrt{2\pi}}\exp\left(\lambda x-\frac{\lambda^2}{2}-\frac{x^2}{2}\right). \end{align*} Since \begin{align*} \lambda x-\frac{\lambda^2}{2}-\frac{x^2}{2}=-\frac{x^2-2\lambda x+\lambda^2}{2}=-\frac{(x-\lambda)^2}{2}, \end{align*} we obtain \begin{align*} \frac{d\nu_\lambda}{dx}(x)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{(x-\lambda)^2}{2}\right). \end{align*} Thus $\nu_\lambda=\mathcal N(\lambda,1)$. The Radon-Nikodym derivative of the tilted law relative to $\mu$ is \begin{align*} \frac{d\nu_\lambda}{d\mu}(x)=\frac{e^{\lambda x}}{Z_\lambda}=e^{\lambda x-\lambda^2/2}. \end{align*} Using the definition of relative entropy, \begin{align*} H(\nu_\lambda\mid\mu)=\int_{\mathbb R}\log\left(\frac{d\nu_\lambda}{d\mu}\right)\,d\nu_\lambda. \end{align*} Substituting the logarithm of the derivative gives \begin{align*} H(\nu_\lambda\mid\mu)=\int_{\mathbb R}\left(\lambda x-\frac{\lambda^2}{2}\right)\,d\nu_\lambda(x). \end{align*} By linearity of the integral, \begin{align*} H(\nu_\lambda\mid\mu)=\lambda\int_{\mathbb R}x\,d\nu_\lambda(x)-\frac{\lambda^2}{2}\int_{\mathbb R}1\,d\nu_\lambda(x). \end{align*} Since $\nu_\lambda=\mathcal N(\lambda,1)$ has mean $\lambda$ and total mass $1$, \begin{align*} H(\nu_\lambda\mid\mu)=\lambda^2-\frac{\lambda^2}{2}=\frac{\lambda^2}{2}. \end{align*} Thus moving the mean of a standard Gaussian from $0$ to $\lambda$ costs exactly $\lambda^2/2$ units of entropy, the same quadratic scale that appears in Gaussian concentration exponents. [/example] The course repeatedly uses this example as a template. Entropy quantifies the price of changing the law; functional inequalities estimate that price using gradients, conditional entropies, or transportation distances. ## Tensorization and High-Dimensional Stability Why do concentration inequalities often improve rather than deteriorate in product spaces? Independence permits global quantities to decompose into coordinatewise pieces. Tensorization is the formal expression of this principle. [definition: Product Probability Space] Let $(E_i,\mathcal E_i,\mu_i)$ be probability spaces for $1 \le i \le n$. The product probability space is \begin{align*} (E,\mathcal E,\mu) = \left(\prod_{i=1}^n E_i,\bigotimes_{i=1}^n \mathcal E_i,\bigotimes_{i=1}^n \mu_i\right). \end{align*} [/definition] Product spaces are where entropy becomes especially powerful, because independence gives a canonical way to expose one coordinate at a time. To make that idea quantitative, we need an entropy functional for nonnegative weights: a function $F$ will later play the role of an unnormalised density for a tilted law, and its entropy measures how far that tilt is from being constant. [definition: Entropy of a Nonnegative Function] Let $(E,\mathcal E,\mu)$ be a probability space. The entropy of a nonnegative function with respect to $\mu$ is the functional \begin{align*} \operatorname{Ent}_\mu:\left\{F:E\to[0,\infty]\ \middle|\ F \text{ is measurable and } \int_E F\,d\mu<\infty\right\}\to[0,\infty] \end{align*} defined by \begin{align*} \operatorname{Ent}_\mu(F):=\int_E F\log F\,d\mu-\left(\int_E F\,d\mu\right)\log\left(\int_E F\,d\mu\right), \end{align*} with the convention $0\log 0=0$ and with value $+\infty$ if the positive part of $\int_E F\log F\,d\mu$ is infinite. [/definition] This functional is the relative entropy of the tilted measure with density $F/\int_E F\,d\mu$, multiplied by the normalising mass $\int_E F\,d\mu$. Thus the preceding definition converts measure-level entropy into an object that can be applied directly to exponential weights such as $F=e^{\lambda f}$. The next theorem is needed because later concentration proofs first control the entropy of such a global weight by splitting it into coordinatewise contributions; without this tensorization step, an $n$-variable entropy estimate would not reduce to the one-coordinate estimates available from logarithmic Sobolev or bounded-difference arguments. [quotetheorem:6725] [citeproof:6725] Independence is the hypothesis that makes the theorem high-dimensional rather than merely notational. If the coordinates are strongly dependent, for instance if $X_1=\cdots=X_n$ almost surely, the conditional reference law of a coordinate is no longer the fixed marginal law $\mu_i$, and the coordinatewise entropy accounting used in the proof breaks down. The nonnegativity hypothesis is also structural, since $\operatorname{Ent}_\mu(F)$ is defined through $F\log F$ and the normalisation by $\mathbb E[F]$; for example $F=1-2\mathbb{1}_{\{X_1=1\}}$ on a Bernoulli space takes negative values, so $F\log F$ is not a real entropy expression. Integrability prevents a different failure mode: if $F(x)=e^{x^2}$ under a standard Gaussian input, then $\mathbb E[F]=\infty$, so neither the global entropy nor the conditional entropies give finite quantities to compare. The theorem also does not give a tail bound by itself: it reduces a global entropy to local terms, but those local terms still need separate estimates such as two-point logarithmic Sobolev inequalities or bounded-difference bounds. This is the reason later product-space arguments split into two stages, first tensorizing and then estimating each coordinate contribution. [example: Bernoulli Product Space] Let $\mu=\operatorname{Ber}(p)^{\otimes n}$ on $\{0,1\}^n$, let $X=(X_1,\dots,X_n)\sim\mu$, and set \begin{align*} F(X)=e^{\lambda f(X)} \end{align*} for $\lambda\ge 0$. Since $F\ge 0$, entropy tensorization gives \begin{align*} \operatorname{Ent}_\mu(e^{\lambda f})\le \sum_{i=1}^n \mathbb E\left[\operatorname{Ent}_{\operatorname{Ber}(p)}\left(e^{\lambda f(X_1,\dots,X_{i-1},\cdot,X_{i+1},\dots,X_n)}\right)\right]. \end{align*} For fixed $x_{-i}=(x_1,\dots,x_{i-1},x_{i+1},\dots,x_n)$, write \begin{align*} a_i(x_{-i})=f(x_1,\dots,x_{i-1},1,x_{i+1},\dots,x_n) \end{align*} and \begin{align*} b_i(x_{-i})=f(x_1,\dots,x_{i-1},0,x_{i+1},\dots,x_n). \end{align*} The $i$th conditional entropy is therefore the two-point expression \begin{align*} p e^{\lambda a_i}\log(e^{\lambda a_i})+(1-p)e^{\lambda b_i}\log(e^{\lambda b_i})-\left(p e^{\lambda a_i}+(1-p)e^{\lambda b_i}\right)\log\left(p e^{\lambda a_i}+(1-p)e^{\lambda b_i}\right). \end{align*} Thus the global entropy of $e^{\lambda f(X)}$ has been reduced to $n$ explicit Bernoulli entropies, one for each coordinate. Assume now that $f$ has bounded coordinate differences: for every $i$ and every $x_{-i}$, \begin{align*} |a_i(x_{-i})-b_i(x_{-i})|\le c_i. \end{align*} The two-point entropy estimate on a Bernoulli coordinate gives \begin{align*} \operatorname{Ent}_{\operatorname{Ber}(p)}\left(e^{\lambda f(x_1,\dots,x_{i-1},\cdot,x_{i+1},\dots,x_n)}\right)\le \frac{\lambda^2c_i^2}{8}\mathbb E_i\left[e^{\lambda f(x_1,\dots,x_{i-1},X_i,x_{i+1},\dots,x_n)}\right], \end{align*} where $\mathbb E_i$ denotes expectation only over the $i$th Bernoulli coordinate. Taking expectation over the remaining coordinates gives \begin{align*} \mathbb E\left[\operatorname{Ent}_{\operatorname{Ber}(p)}\left(e^{\lambda f(X_1,\dots,X_{i-1},\cdot,X_{i+1},\dots,X_n)}\right)\right]\le \frac{\lambda^2c_i^2}{8}\mathbb E[e^{\lambda f(X)}]. \end{align*} Substituting these $n$ bounds into tensorization yields \begin{align*} \operatorname{Ent}_\mu(e^{\lambda f})\le \sum_{i=1}^n \frac{\lambda^2c_i^2}{8}\mathbb E[e^{\lambda f(X)}]. \end{align*} Hence \begin{align*} \operatorname{Ent}_\mu(e^{\lambda f})\le \frac{\lambda^2}{8}\left(\sum_{i=1}^n c_i^2\right)\mathbb E[e^{\lambda f(X)}]. \end{align*} Applying Herbst's argument with $C=\frac{1}{4}\sum_{i=1}^n c_i^2$ gives \begin{align*} \mathbb P\left(f(X)-\mathbb E[f(X)]\ge t\right)\le \exp\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right),\qquad t\ge 0. \end{align*} Independence is what decomposes the entropy into one-coordinate costs, and the bounded-difference assumption controls each cost by the square of the corresponding coordinate sensitivity. [/example] This product-space viewpoint will later be compared with martingale differences. The martingale proof estimates increments of conditional expectations, while the entropy proof estimates the cost of coordinatewise tilting. ## The Three Main Routes to Concentration Which structural inequalities will the course use, and how do they relate to one another? The course follows three routes: entropy and logarithmic Sobolev inequalities, isoperimetry, and optimal transport. They overlap, but each has a distinct language and a distinct class of examples where it is most natural. [definition: Logarithmic Sobolev Strategy] The logarithmic Sobolev strategy is the method of proving concentration by bounding the entropy of $e^f$ or $e^{\lambda f}$ in terms of a gradient quantity associated with $f$. [/definition] After this definition, the main technical device is Herbst's argument. It converts entropy control of exponential tilts into a differential inequality for the log-Laplace transform. This conversion is needed because the final concentration statement is a tail bound, while logarithmic Sobolev inequalities speak about entropy. The next theorem records the analytic bridge between these two languages. [quotetheorem:6727] [citeproof:6727] The exponential-moment hypotheses are not cosmetic: a heavy-tailed random variable may have infinite $\mathbb E[e^{\lambda X}]$ for every $\lambda>0$, in which case no finite sub-Gaussian Laplace bound can hold. The additional condition $\mathbb E[|X|e^{\lambda X}]<\infty$ is what justifies differentiating the log-Laplace transform under the expectation; without it, $\mathbb E[e^{\lambda X}]$ may be finite at a boundary point while $\mathbb E[Xe^{\lambda X}]$ is infinite, so the displayed entropy identity does not define a finite derivative term. For example, a nonnegative random variable with tail density proportional to $x^{-2}e^{-\lambda_0 x}$ for large $x$ has finite $\mathbb E[e^{\lambda_0 X}]$ but infinite $\mathbb E[Xe^{\lambda_0 X}]$. The entropy hypothesis is also one-sided in the parameter range stated, so the theorem proves an upper-tail estimate; a matching lower-tail estimate requires applying the same argument to $-X$ or assuming the entropy bound for both signs. The result does not identify why the entropy bound is true, and it does not by itself prove dimension-free concentration for a measure. Later logarithmic Sobolev and tensorization arguments are precisely the mechanisms used to verify this hypothesis for Lipschitz functions of Gaussian or product inputs. [definition: Transportation-Cost Strategy] The transportation-cost strategy is the method of proving concentration by bounding an optimal-transport distance between probability measures in terms of their relative entropy. [/definition] This strategy changes the object of study from a random variable to a pair of probability measures: the reference law $\mu$ and a tilted law $\nu$ that gives extra weight to the event or region under investigation. To turn such a comparison into a concentration estimate, we need a distance between laws that is visible to Lipschitz test functions. The Wasserstein-one distance is the first transport distance with exactly this feature: it measures the cheapest average amount of movement needed to transform one law into another, and its dual formulation controls differences of expectations. [definition: Wasserstein-One Distance] Let $(E,d)$ be a metric space, fix $x_0\in E$, and let $\mathcal P_1(E)$ denote the set of Borel probability measures $\rho$ on $E$ such that \begin{align*} \int_E d(x,x_0)\,d\rho(x)<\infty. \end{align*} The Wasserstein-one distance is the map \begin{align*} W_1:\mathcal P_1(E)\times \mathcal P_1(E)\to [0,\infty) \end{align*} defined by \begin{align*} W_1(\nu,\mu):=\inf_{\pi}\int_{E\times E} d(x,y)\,d\pi(x,y), \end{align*} where the infimum is over all couplings $\pi$ of $\nu$ and $\mu$. [/definition] The finite-moment condition does not depend on the chosen base point $x_0$, by the triangle inequality. It is needed because the average transport cost may be infinite for arbitrary probability measures on an unbounded metric space. The dual form of this distance says that $W_1$ is exactly the largest possible change in expectation over $1$-Lipschitz test functions, under the usual hypotheses for [Kantorovich duality](/theorems/6799). This is the transport input used in the following example. [example: Transport Interpretation of a Lipschitz Deviation] Let $(E,d)$ be a metric space, let $\mu,\nu\in\mathcal P_1(E)$, and let $f:E\to\mathbb R$ be $1$-Lipschitz and integrable under both measures. By *Kantorovich duality for $W_1$*, \begin{align*} \int_E f\,d\nu-\int_E f\,d\mu\le \sup_{\operatorname{Lip}(g)\le 1}\left\{\int_E g\,d\nu-\int_E g\,d\mu\right\}=W_1(\nu,\mu). \end{align*} If $\mu$ satisfies the transportation-cost inequality \begin{align*} W_1(\rho,\mu)^2\le 2C H(\rho\mid\mu) \end{align*} for every probability measure $\rho\ll\mu$, then applying it to $\rho=\nu$ gives \begin{align*} \int_E f\,d\nu-\int_E f\,d\mu\le W_1(\nu,\mu)\le \sqrt{2C H(\nu\mid\mu)}. \end{align*} This estimate becomes a Laplace bound by applying *Gibbs variational principle* to the centred function $\lambda(f-\int_E f\,d\mu)$, where $\lambda\ge 0$: \begin{align*} \log\int_E \exp\left(\lambda\left(f-\int_E f\,d\mu\right)\right)\,d\mu=\sup_{\rho\ll\mu}\left\{\lambda\left(\int_E f\,d\rho-\int_E f\,d\mu\right)-H(\rho\mid\mu)\right\}. \end{align*} For each admissible $\rho$, the transport estimate gives \begin{align*} \lambda\left(\int_E f\,d\rho-\int_E f\,d\mu\right)-H(\rho\mid\mu)\le \lambda\sqrt{2C H(\rho\mid\mu)}-H(\rho\mid\mu). \end{align*} Writing $h=H(\rho\mid\mu)\ge 0$, the identity \begin{align*} \left(\sqrt h-\lambda\sqrt{\frac C2}\right)^2=h-\lambda\sqrt{2C h}+\frac{C\lambda^2}{2} \end{align*} and nonnegativity of the square imply \begin{align*} \lambda\sqrt{2C h}-h\le \frac{C\lambda^2}{2}. \end{align*} Taking the supremum over $\rho\ll\mu$ therefore yields \begin{align*} \log\int_E \exp\left(\lambda\left(f-\int_E f\,d\mu\right)\right)\,d\mu\le \frac{C\lambda^2}{2}. \end{align*} Markov's inequality applied to the nonnegative random variable $\exp(\lambda(f-\int_E f\,d\mu))$ gives, for $\lambda>0$ and $t\ge 0$, \begin{align*} \mu\left\{f-\int_E f\,d\mu\ge t\right\}\le \exp(-\lambda t)\int_E \exp\left(\lambda\left(f-\int_E f\,d\mu\right)\right)\,d\mu. \end{align*} Substituting the Laplace bound gives \begin{align*} \mu\left\{f-\int_E f\,d\mu\ge t\right\}\le \exp\left(\frac{C\lambda^2}{2}-\lambda t\right). \end{align*} Choosing $\lambda=t/C$ when $t>0$ gives \begin{align*} \mu\left\{f-\int_E f\,d\mu\ge t\right\}\le \exp\left(\frac{C}{2}\frac{t^2}{C^2}-\frac{t^2}{C}\right)=\exp\left(-\frac{t^2}{2C}\right). \end{align*} For $t=0$, the same bound reads $\mu\{f-\int_E f\,d\mu\ge 0\}\le 1$. Thus a quadratic transport-[entropy inequality](/theorems/6729) turns the entropy cost of tilting $\mu$ into the usual sub-Gaussian upper tail for every $1$-Lipschitz observable. [/example] The transport example explains deviations by comparing two probability measures. A different route avoids tilted measures and controls metric neighbourhoods of sets directly. This motivates the isoperimetric strategy, which is needed for Gaussian half-space extremality and for Talagrand's convex distance inequality, where geometry is the primary object. [definition: Isoperimetric Strategy] The isoperimetric strategy is the method of proving concentration by lower bounding the measure of metric enlargements of sets from their original measure. [/definition] These three strategies will be developed separately and then compared. A mature use of concentration inequalities often involves selecting the route whose hypotheses match the problem rather than forcing every problem into a single template. ## Prerequisites and Course Conventions What background will the notes assume, and what notation will be used without redefinition? We assume measure-theoretic probability, [conditional expectation](/page/Conditional%20Expectation), real analysis, basic functional analysis, convexity, and the classical concentration inequalities from the first course. In particular, Chernoff bounds, bounded differences, [Jensen's inequality](/theorems/9), Holder's inequality, and basic properties of product measures are treated as available tools. [remark: Probability Notation] All random variables are defined on a probability space $(\Omega,\mathcal F,\mathbb P)$ unless a different space is named. Expectations are written as $\mathbb E[X]$. Laws of random variables are written as pushforwards, and $X\sim \mu$ means that $X$ has distribution $\mu$. [/remark] The notes distinguish between inequalities for random variables and inequalities for measures. When the measure is the main object, integral notation such as \begin{align*} \int_E f\,d\mu \end{align*} will often replace probabilistic notation. [remark: Metric and Functional Notation] Metric spaces are written as $(E,d)$, and Lipschitz constants are computed with respect to $d$. For functions on Euclidean space, gradients are written as $\nabla f$. Norms on functions are subscripted on first use, for instance $\|f\|_{L^2(\mu)}$ or $\|f\|_\infty$. [/remark] The course uses constants in a way that tracks dimension-free behaviour. If a constant depends on fixed parameters, those parameters will be stated near the inequality. [remark: What Counts as a Proof in These Notes] When a theorem is part of the course's main development, the notes include a proof sketch indicating the strategy and the tools used. Some comparison results from measure theory, [convex geometry](/page/Convex%20Geometry), or optimal transport are used as external inputs when their proofs would lead outside the course. Those results are identified at the point of use rather than treated as gaps in the argument. [/remark] This introduction should be read as a map rather than as a substitute for the later arguments. The next chapter begins the course in earnest by defining entropy for probability measures and nonnegative random variables, proving the Gibbs variational principle, and deriving the entropy inequality that drives the first concentration estimates. # 1. Entropy as a Concentration Tool This chapter introduces entropy as the bookkeeping device behind many modern concentration arguments. In the first concentration course, exponential moments were often estimated directly; here the same estimates are reorganised through relative entropy, variational formulae, and product-space decompositions. The main prerequisites are measure-theoretic probability, Radon--Nikodym derivatives, conditional distributions on standard Borel spaces, [Jensen's inequality](/theorems/1977), and the elementary moment-generating-function method from concentration inequalities. The main question is how information about a change of measure controls expectations, and how independence lets entropy split coordinate by coordinate. The same objects also appear outside concentration theory. Relative entropy is the rate function in Sanov-type large deviations, the loss functional in maximum-entropy statistical inference, and the quantity transported by logarithmic Sobolev and transportation-cost inequalities. The course uses these broader links selectively: large-deviation intuition explains why tilted measures are canonical, while transport and log-Sobolev methods later convert entropy estimates into geometric tail bounds. ## Relative Entropy and the Gibbs Formula How much does a probability measure change when we tilt it toward large values of a function? Relative entropy measures the cost of replacing a reference law by another law, and the Gibbs variational formula says that this cost is exactly dual to the logarithmic moment generating function. We begin with the measure-theoretic version, since later chapters repeatedly pass between probability measures and densities. [definition: Relative Entropy] Let $(E, \mathcal E)$ be a measurable space, and let $\mathcal P(E)$ denote the set of probability measures on $(E,\mathcal E)$. Relative entropy is the functional \begin{align*} H:\mathcal P(E)\times \mathcal P(E)\to [0,\infty]. \end{align*} For $\nu,\mu\in\mathcal P(E)$ with $\nu\ll\mu$, the relative entropy of $\nu$ with respect to $\mu$ is \begin{align*} H(\nu\mid \mu) = \int_E \log\left(\frac{d\nu}{d\mu}\right)\,d\nu. \end{align*} If $\nu\not\ll\mu$, set $H(\nu\mid\mu)=+\infty$. [/definition] The density $d\nu/d\mu$ records how strongly the new measure reweights the old one. Entropy is therefore finite only when the new measure does not assign mass to sets that were impossible under the reference measure. [example: Bernoulli Relative Entropy] Let $\nu=\operatorname{Ber}(p)$ and $\mu=\operatorname{Ber}(q)$ on $\{0,1\}$, where $p,q\in(0,1)$. Since $\mu(\{0\})=1-q>0$ and $\mu(\{1\})=q>0$, we have $\nu\ll\mu$. The Radon--Nikodym derivative is determined pointwise by \begin{align*} \frac{d\nu}{d\mu}(1)=\frac{\nu(\{1\})}{\mu(\{1\})}=\frac{p}{q}. \end{align*} Similarly, \begin{align*} \frac{d\nu}{d\mu}(0)=\frac{\nu(\{0\})}{\mu(\{0\})}=\frac{1-p}{1-q}. \end{align*} Using the definition of relative entropy as an integral with respect to $\nu$, the integral over the two-point space is \begin{align*} H(\nu\mid\mu)=\nu(\{1\})\log\left(\frac{d\nu}{d\mu}(1)\right)+\nu(\{0\})\log\left(\frac{d\nu}{d\mu}(0)\right). \end{align*} Substituting the two masses and the two derivative values gives \begin{align*} H(\nu\mid\mu)=p\log\frac{p}{q}+(1-p)\log\frac{1-p}{1-q}. \end{align*} If the measures are reversed, the same calculation gives \begin{align*} H(\mu\mid\nu)=q\log\frac{q}{p}+(1-q)\log\frac{1-q}{1-p}. \end{align*} The weights and logarithmic ratios have changed, so Bernoulli relative entropy is not symmetric in general. The displayed function of $p$ with reference parameter $q$ is the rate function that later appears for empirical means of independent Bernoulli coordinates. [/example] Relative entropy compares two probability measures, but concentration proofs often start with a nonnegative weight $f$ rather than a named second measure. We therefore need a version of entropy that treats $f$ as an unnormalised density and still records the convex cost of reweighting the reference law. [definition: Entropy of a Nonnegative Random Variable] Let $(E,\mathcal E,\mu)$ be a probability space. Entropy under $\mu$ is the functional \begin{align*} \operatorname{Ent}_\mu:\mathcal D_\mu\to[0,\infty), \end{align*} where $\mathcal D_\mu$ is the set of [measurable functions](/page/Measurable%20Functions) $f:E\to[0,\infty]$ satisfying \begin{align*} \int_E f\,d\mu<\infty, \qquad \int_E f\log f\,d\mu<\infty. \end{align*} For such $f$, the entropy of $f$ under $\mu$ is \begin{align*} \operatorname{Ent}_\mu(f)=\int_E f\log f\,d\mu-\left(\int_E f\,d\mu\right)\log\left(\int_E f\,d\mu\right). \end{align*} [/definition] When \begin{align*} \int_E f\,d\mu=1, \end{align*} the function $f$ is the density of a probability measure $\nu$ with respect to $\mu$, and $\operatorname{Ent}_\mu(f)=H(\nu\mid\mu)$. For general $f$, this is the homogeneous version of the same expression; multiplying $f$ by a constant multiplies the entropy by that constant. The next task is to connect entropy with exponential moments. This connection is what makes entropy useful for concentration: moment generating functions are easy to manipulate under independence, while relative entropy tracks how much a tilted law has changed. The variational formula below gives the exact dictionary between the two quantities. [quotetheorem:6723] [citeproof:6723] The theorem converts moment generating functions into variational problems. The exponential integrability assumption is what makes the tilted law a probability measure and keeps the left-hand side finite; if it fails, the logarithmic moment generating function can be infinite and no normalised exponential tilt exists. Absolute continuity is enforced through the entropy term: a measure that charges a $\mu$-null set pays infinite cost, so it cannot improve the supremum under the stated convention. The formula also does not claim that arbitrary expectations of the form \begin{align*} \int_E g\,d\nu \end{align*} are finite; measures for which the positive part is not integrable are excluded from contributing by the extended-value convention. In concentration arguments we usually use the result in the reverse direction: a rough bound on relative entropy gives a useful bound on an expectation. [example: Failures Without Gibbs Hypotheses] Let $E=\mathbb N$ and let $\mu(\{k\})=c k^{-2}$, where $c>0$ is chosen so that \begin{align*} 1=\sum_{k=1}^{\infty}\mu(\{k\})=c\sum_{k=1}^{\infty}k^{-2}. \end{align*} For $g(k)=3\log k$, we have $e^{g(k)}=e^{3\log k}=k^3$. Therefore \begin{align*} \int_E e^g\,d\mu=\sum_{k=1}^{\infty}e^{g(k)}\mu(\{k\})=\sum_{k=1}^{\infty}k^3 c k^{-2}=c\sum_{k=1}^{\infty}k=\infty. \end{align*} Thus the normalising constant for the exponential tilt is infinite, so no probability law can have density $e^g/\int_E e^g\,d\mu$ with respect to $\mu$. For each fixed $k\in\mathbb N$, take $\nu=\delta_k$. Since $\mu(\{j\})>0$ for every $j\in\mathbb N$, we have $\delta_k\ll\mu$. The Radon--Nikodym derivative equals $1/\mu(\{k\})$ at $k$ and equals $0$ away from $k$, so the entropy integral is evaluated only at the point to which $\delta_k$ assigns mass one: \begin{align*} D(\delta_k\|\mu)=\log\left(\frac{1}{\mu(\{k\})}\right). \end{align*} Using $\mu(\{k\})=c k^{-2}$ gives \begin{align*} D(\delta_k\|\mu)=\log\left(\frac{1}{c k^{-2}}\right)=\log\left(\frac{k^2}{c}\right)=2\log k-\log c. \end{align*} Also, \begin{align*} \int_E g\,d\delta_k=g(k)=3\log k. \end{align*} Hence \begin{align*} \int_E g\,d\delta_k-D(\delta_k\|\mu)=3\log k-\left(2\log k-\log c\right)=\log k+\log c. \end{align*} Since $\log k+\log c\to\infty$ as $k\to\infty$, the variational supremum is $+\infty$, matching the infinite exponential moment. Absolute continuity is also necessary for a finite entropy cost. If $E=\{0,1\}$, $\mu=\delta_0$, $\nu=\delta_1$, and $g(1)=M$, then \begin{align*} \int_E g\,d\nu=g(1)=M. \end{align*} But $\mu(\{1\})=0$ while $\nu(\{1\})=1$, so $\nu\not\ll\mu$. By the definition of relative entropy, \begin{align*} H(\nu\mid\mu)=+\infty. \end{align*} The entropy penalty is what prevents mass from being moved onto impossible points at finite cost. [/example] With the hypotheses in place, exponential tilting has a concrete probabilistic meaning rather than only a variational one. The Gaussian case is the basic model: the tilt shifts the mean, and entropy records the cost of that shift. [example: Gaussian Exponential Tilt] Let $X\sim\mathcal N(0,1)$ under $\mu$, and fix $\lambda\in\mathbb R$. With respect to Lebesgue measure, $\mu$ has density \begin{align*} \varphi_0(x)=\frac{1}{\sqrt{2\pi}}e^{-x^2/2}. \end{align*} The normalising constant of the exponential tilt is \begin{align*} \mathbb E_\mu[e^{\lambda X}]=\int_{\mathbb R}e^{\lambda x}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}\,dx. \end{align*} Combining the exponents gives \begin{align*} \mathbb E_\mu[e^{\lambda X}]=\frac{1}{\sqrt{2\pi}}\int_{\mathbb R}\exp\left(\lambda x-\frac{x^2}{2}\right)\,dx. \end{align*} Completing the square, \begin{align*} \lambda x-\frac{x^2}{2}=-\frac{x^2-2\lambda x}{2}=-\frac{(x-\lambda)^2-\lambda^2}{2}=-\frac{(x-\lambda)^2}{2}+\frac{\lambda^2}{2}. \end{align*} Therefore \begin{align*} \mathbb E_\mu[e^{\lambda X}]=e^{\lambda^2/2}\frac{1}{\sqrt{2\pi}}\int_{\mathbb R}e^{-(x-\lambda)^2/2}\,dx. \end{align*} The translated Gaussian density integrates to $1$, so \begin{align*} \mathbb E_\mu[e^{\lambda X}]=e^{\lambda^2/2}. \end{align*} Hence \begin{align*} \log\mathbb E_\mu[e^{\lambda X}]=\frac{\lambda^2}{2}. \end{align*} Let $\nu_\lambda$ be the tilted law. Its density with respect to Lebesgue measure is \begin{align*} e^{\lambda x-\lambda^2/2}\varphi_0(x). \end{align*} Substituting $\varphi_0$ gives \begin{align*} e^{\lambda x-\lambda^2/2}\varphi_0(x)=\frac{1}{\sqrt{2\pi}}\exp\left(\lambda x-\frac{\lambda^2}{2}-\frac{x^2}{2}\right). \end{align*} The exponent satisfies \begin{align*} \lambda x-\frac{\lambda^2}{2}-\frac{x^2}{2}=-\frac{x^2-2\lambda x+\lambda^2}{2}=-\frac{(x-\lambda)^2}{2}. \end{align*} Thus the tilted density is \begin{align*} \frac{1}{\sqrt{2\pi}}\exp\left(-\frac{(x-\lambda)^2}{2}\right), \end{align*} so $\nu_\lambda=\mathcal N(\lambda,1)$. The density ratio relative to $\mu$ is \begin{align*} \frac{d\nu_\lambda}{d\mu}(x)=e^{\lambda x-\lambda^2/2}. \end{align*} Using the definition of relative entropy, \begin{align*} H(\nu_\lambda\mid\mu)=\int_{\mathbb R}\log\left(e^{\lambda x-\lambda^2/2}\right)\,d\nu_\lambda(x). \end{align*} Since $\log(e^a)=a$, \begin{align*} H(\nu_\lambda\mid\mu)=\int_{\mathbb R}\left(\lambda x-\frac{\lambda^2}{2}\right)\,d\nu_\lambda(x). \end{align*} By linearity of the integral, \begin{align*} H(\nu_\lambda\mid\mu)=\lambda\int_{\mathbb R}x\,d\nu_\lambda(x)-\frac{\lambda^2}{2}\int_{\mathbb R}1\,d\nu_\lambda(x). \end{align*} Because $\nu_\lambda=\mathcal N(\lambda,1)$, its mean is $\lambda$, and because $\nu_\lambda$ is a probability measure, $\int_{\mathbb R}1\,d\nu_\lambda(x)=1$. Hence \begin{align*} H(\nu_\lambda\mid\mu)=\lambda^2-\frac{\lambda^2}{2}=\frac{\lambda^2}{2}. \end{align*} The exponential tilt by $e^{\lambda X}$ shifts the Gaussian mean from $0$ to $\lambda$, keeps the variance equal to $1$, and pays exactly the entropy cost $\lambda^2/2$. [/example] ## Entropy Inequality for Tilted Expectations Suppose a random variable is large on a rare event, and we want to estimate its expectation under a probability measure that has been biased toward that event. The entropy inequality is the basic tool: it separates the price of tilting the measure from the exponential integrability of the variable being tested. The following result is a direct corollary of Gibbs' formula, but its operational form is important enough to name separately. [quotetheorem:6729] [citeproof:6729] The parameter $\alpha$ must be positive because the proof divides by $\alpha$ and uses the moment generating function of $\alpha g$ rather than that of $-\alpha g$. Finiteness of \begin{align*} \int_E e^{\alpha g}\,d\mu \end{align*} is the condition that the reference law has enough upper-tail integrability to control the tilted expectation. If $D(\nu\|\mu)=+\infty$, or if the exponential moment is extremely large, the bound is still formally correct but may carry no usable information. Later, Herbst's argument chooses $\alpha$ through a differential inequality; in this chapter the point is to see how the entropy price and the exponential-moment term balance. [example: Failures Without Entropy Inequality Hypotheses] The sign of $\alpha$ cannot be reversed. Let $E=\mathbb R$, let $\mu=\mathcal N(0,1)$, let $\nu=\mathcal N(m,1)$ with $m>0$, and set $g(x)=x$. The density ratio is \begin{align*} \frac{d\nu}{d\mu}(x)=\frac{(2\pi)^{-1/2}e^{-(x-m)^2/2}}{(2\pi)^{-1/2}e^{-x^2/2}}. \end{align*} Cancelling the common factor and expanding the exponent gives \begin{align*} -\frac{(x-m)^2}{2}+\frac{x^2}{2}=-\frac{x^2-2mx+m^2}{2}+\frac{x^2}{2}=mx-\frac{m^2}{2}. \end{align*} Thus \begin{align*} \frac{d\nu}{d\mu}(x)=\exp\left(mx-\frac{m^2}{2}\right). \end{align*} Using the definition of relative entropy, \begin{align*} D(\nu\|\mu)=\int_{\mathbb R}\left(mx-\frac{m^2}{2}\right)\,d\nu(x). \end{align*} By linearity of the integral and because $\nu=\mathcal N(m,1)$ is a probability measure with mean $m$, \begin{align*} D(\nu\|\mu)=m\int_{\mathbb R}x\,d\nu(x)-\frac{m^2}{2}\int_{\mathbb R}1\,d\nu(x)=m^2-\frac{m^2}{2}=\frac{m^2}{2}. \end{align*} Also, \begin{align*} \int_{\mathbb R}g\,d\nu=\int_{\mathbb R}x\,d\nu(x)=m. \end{align*} For the standard Gaussian moment generating function, \begin{align*} \int_{\mathbb R}e^{\alpha x}\,d\mu(x)=\frac{1}{\sqrt{2\pi}}\int_{\mathbb R}\exp\left(\alpha x-\frac{x^2}{2}\right)\,dx. \end{align*} Completing the square, \begin{align*} \alpha x-\frac{x^2}{2}=-\frac{x^2-2\alpha x}{2}=-\frac{(x-\alpha)^2-\alpha^2}{2}=-\frac{(x-\alpha)^2}{2}+\frac{\alpha^2}{2}. \end{align*} Hence \begin{align*} \int_{\mathbb R}e^{\alpha x}\,d\mu(x)=e^{\alpha^2/2}\frac{1}{\sqrt{2\pi}}\int_{\mathbb R}e^{-(x-\alpha)^2/2}\,dx=e^{\alpha^2/2}. \end{align*} Therefore \begin{align*} \log\int_{\mathbb R}e^{\alpha g}\,d\mu=\frac{\alpha^2}{2}. \end{align*} If one tried to use the entropy inequality with $\alpha<0$, it would assert \begin{align*} m\le \frac{1}{\alpha}\left(\frac{m^2}{2}+\frac{\alpha^2}{2}\right). \end{align*} The numerator $(m^2+\alpha^2)/2$ is positive and $\alpha<0$, so the right-hand side is negative, while the left-hand side is $m>0$. Thus the asserted upper bound fails. The exponential integrability assumption cannot be dropped. Let $E=\mathbb N$, let $\mu(\{k\})=c k^{-2}$, where $c>0$ is chosen so that $c\sum_{k=1}^{\infty}k^{-2}=1$, set $g(k)=2\log k$, and take $\nu=\delta_m$ for a fixed $m\in\mathbb N$. Since $\mu(\{m\})=c m^{-2}>0$, we have $\delta_m\ll\mu$. The Radon--Nikodym derivative equals $1/\mu(\{m\})$ at $m$ and $0$ away from $m$, so the entropy integral is evaluated at the single point $m$: \begin{align*} H(\delta_m\mid\mu)=\log\left(\frac{1}{\mu(\{m\})}\right)=\log\left(\frac{1}{c m^{-2}}\right)=\log\left(\frac{m^2}{c}\right)<\infty. \end{align*} However, \begin{align*} \int_E e^g\,d\mu=\sum_{k=1}^{\infty}e^{2\log k}\,c k^{-2}. \end{align*} Since $e^{2\log k}=k^2$, this becomes \begin{align*} \int_E e^g\,d\mu=\sum_{k=1}^{\infty}k^2 c k^{-2}=c\sum_{k=1}^{\infty}1=\infty. \end{align*} Thus this finite-entropy change of measure has no finite exponential-moment term available in the inequality. The integrability condition under $\nu$ is also necessary for the left-hand side to be a finite quantity. Let $\mu=\nu$ be the probability measure on $\mathbb N$ with $\mu(\{k\})=c k^{-2}$, and set $g(k)=k$. Because the two measures are equal, $d\nu/d\mu=1$ $\mu$-a.e., so \begin{align*} D(\nu\|\mu)=\int_E \log 1\,d\nu=0. \end{align*} Since $g(k)=k\ge0$, we have $g^+=g$, and therefore \begin{align*} \int_E g^+\,d\nu=\sum_{k=1}^{\infty}k\,c k^{-2}=c\sum_{k=1}^{\infty}\frac{1}{k}=\infty. \end{align*} In this case the entropy cost is zero, but the expectation $\int_E g\,d\nu$ is infinite, so entropy alone cannot produce a finite upper bound without the stated positive-part integrability assumption. [/example] The preceding failures explain why the theorem is formulated with an upper-tail exponential moment and a finite positive-part expectation under the tilted law. Under those conditions, conditioning on a rare event becomes a controlled finite-entropy change of measure. [example: Controlling a Rare-Event Expectation] Let $A\in\mathcal E$ with $\mu(A)>0$, and let $\nu=\mu(\cdot\mid A)$. For every $B\in\mathcal E$, \begin{align*} \nu(B)=\frac{\mu(B\cap A)}{\mu(A)}=\int_B \frac{\mathbf 1_A}{\mu(A)}\,d\mu. \end{align*} Hence \begin{align*} \frac{d\nu}{d\mu}=\frac{\mathbf 1_A}{\mu(A)}. \end{align*} Since $\nu(A)=1$, the definition of relative entropy gives \begin{align*} D(\nu\|\mu)=\int_E \log\left(\frac{d\nu}{d\mu}\right)\,d\nu=\int_A \log\left(\frac{1}{\mu(A)}\right)\,d\nu. \end{align*} The integrand is constant on $A$, so \begin{align*} D(\nu\|\mu)=\log\left(\frac{1}{\mu(A)}\right)\nu(A)=\log\left(\frac{1}{\mu(A)}\right). \end{align*} Assume that $g$ is sub-Gaussian under $\mu$, meaning that for every $\alpha>0$, \begin{align*} \log\int_E e^{\alpha g}\,d\mu\le \alpha\mathbb E_\mu[g]+\frac{\sigma^2\alpha^2}{2}. \end{align*} Applying the *Entropy Inequality* to the conditional law $\nu$ gives, for every $\alpha>0$ for which the expectation is finite, \begin{align*} \mathbb E[g\mid A]=\int_E g\,d\nu\le \frac{1}{\alpha}\left(D(\nu\|\mu)+\log\int_E e^{\alpha g}\,d\mu\right). \end{align*} Substituting the entropy of the conditional law gives \begin{align*} \mathbb E[g\mid A]\le \frac{1}{\alpha}\left(\log\frac{1}{\mu(A)}+\log\int_E e^{\alpha g}\,d\mu\right). \end{align*} Using the sub-Gaussian bound then yields \begin{align*} \mathbb E[g\mid A]\le \frac{1}{\alpha}\left(\log\frac{1}{\mu(A)}+\alpha\mathbb E_\mu[g]+\frac{\sigma^2\alpha^2}{2}\right). \end{align*} Dividing each term by $\alpha$ gives \begin{align*} \mathbb E[g\mid A]\le \mathbb E_\mu[g]+\frac{\log(1/\mu(A))}{\alpha}+\frac{\sigma^2\alpha}{2}. \end{align*} Set \begin{align*} L=\log\frac{1}{\mu(A)}. \end{align*} If $L>0$ and $\sigma>0$, choose \begin{align*} \alpha=\frac{\sqrt{2L}}{\sigma}. \end{align*} Then \begin{align*} \frac{L}{\alpha}=L\cdot\frac{\sigma}{\sqrt{2L}}=\frac{\sigma\sqrt L}{\sqrt 2}. \end{align*} Also, \begin{align*} \frac{\sigma^2\alpha}{2}=\frac{\sigma^2}{2}\cdot\frac{\sqrt{2L}}{\sigma}=\frac{\sigma\sqrt{2L}}{2}=\frac{\sigma\sqrt L}{\sqrt 2}. \end{align*} Therefore \begin{align*} \mathbb E[g\mid A]-\mathbb E_\mu[g]\le \frac{\sigma\sqrt L}{\sqrt 2}+\frac{\sigma\sqrt L}{\sqrt 2}=\sigma\sqrt{2L}. \end{align*} Returning to $L=\log(1/\mu(A))$, \begin{align*} \mathbb E[g\mid A]-\mathbb E_\mu[g]\le \sigma\sqrt{2\log\frac{1}{\mu(A)}}. \end{align*} If $L=0$, then $\mu(A)=1$ and the conditional law equals $\mu$, so the excess is $0$. If $\sigma=0$, the bound above with arbitrary $\alpha>0$ becomes $\mathbb E[g\mid A]-\mathbb E_\mu[g]\le L/\alpha$, and letting $\alpha\to\infty$ gives excess at most $0$. Thus conditioning on a rare event costs the square root of its entropy price $\log(1/\mu(A))$, measured in the sub-Gaussian scale $\sigma$. [/example] The same inequality can be written with an arbitrary density. This form is useful when the tilted measure comes from an algorithm, a conditioning argument, or a martingale step rather than from an explicitly named probability measure. [remark: Density Form] If $h\ge0$ and \begin{align*} \int_E h\,d\mu=1, \end{align*} then for every $\alpha>0$, \begin{align*} \int_E gh\,d\mu\le \frac{1}{\alpha}\left(\operatorname{Ent}_\mu(h)+\log\int_E e^{\alpha g}\,d\mu\right). \end{align*} This is the entropy inequality applied to the probability measure $d\nu=h\,d\mu$. [/remark] ## Conditional Entropy and the Chain Rule Concentration on product spaces depends on revealing coordinates one at a time. The relevant question is how relative entropy changes when we separate the first coordinate from the remaining coordinates, or more generally when we condition on part of the information. To state this without choosing coordinates, we need the conditional version of relative entropy. [definition: Conditional Relative Entropy] Let $\nu$ and $\mu$ be probability measures on $(E\times F,\mathcal E\otimes\mathcal F)$. Let $X:E\times F\to E$ and $Y:E\times F\to F$ be the coordinate maps. Suppose regular conditional distributions $\nu_{X\mid Y=y}$ and $\mu_{X\mid Y=y}$ are chosen for $y\in F$, and write $\nu_Y$ for the $Y$-marginal of $\nu$. Conditional relative entropy is the functional \begin{align*} D(\,\cdot\mid \cdot\,):\{(\nu,\mu)\text{ with the stated regular conditional distributions}\}\to[0,\infty]. \end{align*} For such $(\nu,\mu)$, the conditional relative entropy of the first coordinate given the second coordinate is \begin{align*} D_\nu(X\mid Y;\mu)=\int_F D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y})\,d\nu_Y(y). \end{align*} [/definition] This definition measures the average entropy remaining after the value of $Y$ has been revealed. In the product reference case, the reference conditional law often does not depend on $y$, which makes the expression especially concrete. [quotetheorem:6731] [citeproof:6731] The chain rule is the entropy analogue of the tower property for conditional expectation. Regular conditional distributions are needed because the second term compares conditional laws pointwise in the observed value of $Y$; on general measurable spaces such kernels need not be available without additional assumptions. The absolute-continuity alternatives identify the two ways entropy can become infinite: either the marginal of $Y$ already charges a reference-null set, or the marginal is acceptable but some conditional law of $X$ charges a conditional reference-null set. The identity decomposes an already specified change of measure; it does not by itself produce independence or bound any of the conditional terms. Its value for concentration is that entropy can be accumulated over successive coordinate exposures once those conditional terms are estimated. [example: Infinite Terms in the Chain Rule] Let $E=F=\{0,1\}$, with $X$ and $Y$ the two coordinate maps. First take \begin{align*} \mu=\delta_{(0,0)}, \qquad \nu=\delta_{(0,1)}. \end{align*} The $Y$-marginals are found by evaluating the mass of the fibres of $Y$: \begin{align*} \mu_Y(\{0\})=\mu(E\times\{0\})=\mu(\{(0,0),(1,0)\})=1, \end{align*} while \begin{align*} \nu_Y(\{1\})=\nu(E\times\{1\})=\nu(\{(0,1),(1,1)\})=1. \end{align*} Thus $\mu_Y=\delta_0$ and $\nu_Y=\delta_1$. Since \begin{align*} \mu_Y(\{1\})=0 \qquad\text{but}\qquad \nu_Y(\{1\})=1, \end{align*} we have $\nu_Y\not\ll\mu_Y$, so \begin{align*} D(\nu_Y\|\mu_Y)=+\infty. \end{align*} The full measures fail absolute continuity for the same reason: \begin{align*} \mu(\{(0,1)\})=0 \qquad\text{but}\qquad \nu(\{(0,1)\})=1. \end{align*} Therefore $\nu\not\ll\mu$, and the full relative entropy is \begin{align*} D(\nu\|\mu)=+\infty. \end{align*} The conditional term can be infinite even when the marginals match. Let $\mu$ put mass $1/2$ on each of $(0,0)$ and $(0,1)$, and let $\nu$ put mass $1/2$ on each of $(0,0)$ and $(1,1)$. Then \begin{align*} \mu_Y(\{0\})=\mu(\{(0,0),(1,0)\})=\frac12, \qquad \mu_Y(\{1\})=\mu(\{(0,1),(1,1)\})=\frac12, \end{align*} and \begin{align*} \nu_Y(\{0\})=\nu(\{(0,0),(1,0)\})=\frac12, \qquad \nu_Y(\{1\})=\nu(\{(0,1),(1,1)\})=\frac12. \end{align*} Hence $\nu_Y=\mu_Y$, so $d\nu_Y/d\mu_Y=1$ at both points and \begin{align*} D(\nu_Y\|\mu_Y) =\sum_{y\in\{0,1\}}\nu_Y(\{y\})\log\left(\frac{d\nu_Y}{d\mu_Y}(y)\right) =\frac12\log 1+\frac12\log 1 =0. \end{align*} Now condition on the event $Y=1$. Under $\mu$, the only point with $Y=1$ and positive mass is $(0,1)$, so \begin{align*} \mu_{X\mid Y=1}=\delta_0. \end{align*} Under $\nu$, the only point with $Y=1$ and positive mass is $(1,1)$, so \begin{align*} \nu_{X\mid Y=1}=\delta_1. \end{align*} Since \begin{align*} \mu_{X\mid Y=1}(\{1\})=0 \qquad\text{but}\qquad \nu_{X\mid Y=1}(\{1\})=1, \end{align*} we have $\nu_{X\mid Y=1}\not\ll\mu_{X\mid Y=1}$, and therefore \begin{align*} D(\nu_{X\mid Y=1}\|\mu_{X\mid Y=1})=+\infty. \end{align*} Because $\nu_Y(\{1\})=1/2>0$, the averaged conditional entropy contains this infinite contribution: \begin{align*} \int_{\{0,1\}}D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y})\,d\nu_Y(y)=+\infty. \end{align*} The full entropy is also infinite, since \begin{align*} \mu(\{(1,1)\})=0 \qquad\text{but}\qquad \nu(\{(1,1)\})=\frac12, \end{align*} so $\nu\not\ll\mu$. Thus the two infinite alternatives in the chain rule correspond exactly to failure of absolute continuity at the marginal level or at a conditional level of positive $\nu_Y$-mass. [/example] When the absolute-continuity requirements hold at each stage, the chain rule turns independent coordinate changes into an additive computation. The next example records the finite-space calculation that later product-space arguments abstract. [example: Product Bernoulli Entropy Decomposition] Let $\mu=\operatorname{Ber}(q_1)\otimes\cdots\otimes\operatorname{Ber}(q_n)$ and $\nu=\operatorname{Ber}(p_1)\otimes\cdots\otimes\operatorname{Ber}(p_n)$ on $\{0,1\}^n$, with $p_i,q_i\in(0,1)$ for every $i$. Since every point of $\{0,1\}^n$ has positive $\mu$-mass and positive $\nu$-mass, we have $\nu\ll\mu$. Reveal the coordinates in the order $X_1,\dots,X_n$. By repeated application of the *[Chain Rule for Relative Entropy](/theorems/6731)*, the full entropy is the sum of the conditional one-coordinate entropy costs: \begin{align*} D(\nu\|\mu)=\sum_{i=1}^n \int_{\{0,1\}^{i-1}} D\left(\nu_{X_i\mid X_1=x_1,\dots,X_{i-1}=x_{i-1}}\middle\|\mu_{X_i\mid X_1=x_1,\dots,X_{i-1}=x_{i-1}}\right)\,d\nu_{1:i-1}(x_1,\dots,x_{i-1}). \end{align*} Because both measures are products, conditioning on the previously revealed coordinates does not change the $i$th coordinate law: \begin{align*} \nu_{X_i\mid X_1=x_1,\dots,X_{i-1}=x_{i-1}}=\operatorname{Ber}(p_i). \end{align*} Likewise, \begin{align*} \mu_{X_i\mid X_1=x_1,\dots,X_{i-1}=x_{i-1}}=\operatorname{Ber}(q_i). \end{align*} Therefore the integrand is constant in $(x_1,\dots,x_{i-1})$. Since $\nu_{1:i-1}$ is a probability measure, integrating this constant gives \begin{align*} D(\nu\|\mu)=\sum_{i=1}^n D(\operatorname{Ber}(p_i)\|\operatorname{Ber}(q_i)). \end{align*} For a fixed coordinate $i$, the Radon--Nikodym derivative of $\operatorname{Ber}(p_i)$ with respect to $\operatorname{Ber}(q_i)$ at $1$ is \begin{align*} \frac{d\operatorname{Ber}(p_i)}{d\operatorname{Ber}(q_i)}(1)=\frac{p_i}{q_i}. \end{align*} At $0$ it is \begin{align*} \frac{d\operatorname{Ber}(p_i)}{d\operatorname{Ber}(q_i)}(0)=\frac{1-p_i}{1-q_i}. \end{align*} Using the definition of relative entropy on the two-point space, \begin{align*} D(\operatorname{Ber}(p_i)\|\operatorname{Ber}(q_i))=p_i\log\frac{p_i}{q_i}+(1-p_i)\log\frac{1-p_i}{1-q_i}. \end{align*} Substituting this into the chain-rule sum gives \begin{align*} D(\nu\|\mu)=\sum_{i=1}^n\left[p_i\log\frac{p_i}{q_i}+(1-p_i)\log\frac{1-p_i}{1-q_i}\right]. \end{align*} Thus independent coordinate changes pay entropy one coordinate at a time, with no interaction term between distinct coordinates. [/example] ## Tensorization of Entropy The final question of the chapter is how the entropy of a function on a product space can be controlled by entropies in single coordinates. This is the step that turns one-dimensional inequalities into dimension-free concentration statements. Let $\mu=\mu_1\otimes\cdots\otimes\mu_n$ on $E=E_1\times\cdots\times E_n$, and write \begin{align*} E_{-i}=\prod_{j\ne i}E_j, \qquad \mu_{-i}=\bigotimes_{j\ne i}\mu_j. \end{align*} For each $i$, the single-coordinate entropy operator is the functional \begin{align*} \operatorname{Ent}_{\mu_i,i}: \mathcal A_i\to L^1(E_{-i},\mu_{-i};[0,\infty]), \end{align*} where $x_{-i}$ denotes all coordinates of $x=(x_1,\dots,x_n)$ except $x_i$. The class $\mathcal A_i$ consists of measurable functions $f:E\to[0,\infty)$ such that, for $\mu_{-i}$-a.e. $x_{-i}$, the slice $z_i\mapsto f(z_i,x_{-i})$ has finite $\mu_i$-entropy and the resulting slice-entropy function is $\mu_{-i}$-integrable. Set \begin{align*} \operatorname{Ent}_{\mu_i,i}(f)(x_{-i}) =\int_{E_i} f(x)\log f(x)\,d\mu_i(x_i) -\left(\int_{E_i} f(x)\,d\mu_i(x_i)\right) \log\left(\int_{E_i} f(x)\,d\mu_i(x_i)\right), \end{align*} where the other coordinates are held fixed. With this notation, tensorization says that total entropy is no larger than the sum of conditional coordinate entropies. [quotetheorem:6725] [citeproof:6725] Tensorization is the first structural reason concentration can be dimension-free. The product assumption is essential because it makes the reference conditional law in each coordinate independent of the other coordinates; under dependence, conditional reference laws can change with the revealed variables and the same coordinatewise entropy sum need not dominate the full entropy. The inequality also does not assert equality in general, nor does it say that the coordinate terms are small; that requires a separate one-coordinate entropy estimate. If every coordinate satisfies such an estimate, the sum of coordinate contributions often collapses to a gradient, difference, or Lipschitz quantity for the whole function. [example: Dependent Reference Measure Breaks Tensorization] Let $E_1=E_2=\{0,1\}$, and let $\mu$ be the diagonal law \begin{align*} \mu(\{(0,0)\})=\frac12,\qquad \mu(\{(1,1)\})=\frac12. \end{align*} Both one-coordinate marginals are $\operatorname{Ber}(1/2)$, but the coordinates are fully dependent because $\mu(\{(0,0),(1,1)\})=1$. Define \begin{align*} f(0,0)=2,\qquad f(1,1)=0,\qquad f(0,1)=f(1,0)=1. \end{align*} Only the values on the diagonal affect $\operatorname{Ent}_\mu(f)$. Using $0\log 0=0$, \begin{align*} \int f\,d\mu=\frac12 f(0,0)+\frac12 f(1,1)=\frac12\cdot 2+\frac12\cdot 0=1. \end{align*} Also, \begin{align*} \int f\log f\,d\mu=\frac12 f(0,0)\log f(0,0)+\frac12 f(1,1)\log f(1,1)=\frac12\cdot 2\log 2+\frac12\cdot 0=\log 2. \end{align*} Therefore \begin{align*} \operatorname{Ent}_\mu(f)=\int f\log f\,d\mu-\left(\int f\,d\mu\right)\log\left(\int f\,d\mu\right)=\log 2-1\cdot\log 1=\log 2. \end{align*} Now incorrectly apply the product-space coordinate expression using the marginal measure $\operatorname{Ber}(1/2)$ in each coordinate. For the first coordinate, the slice at $x_2=0$ is $(2,1)$, and the slice at $x_2=1$ is $(1,0)$. The second coordinate has the same two slices: at $x_1=0$ it is $(2,1)$, and at $x_1=1$ it is $(1,0)$. For $h=(2,1)$ under $\operatorname{Ber}(1/2)$, \begin{align*} \int h\,d\operatorname{Ber}(1/2)=\frac12\cdot 2+\frac12\cdot 1=\frac32. \end{align*} Since $1\log 1=0$, \begin{align*} \int h\log h\,d\operatorname{Ber}(1/2)=\frac12\cdot 2\log 2+\frac12\cdot 1\log 1=\log 2. \end{align*} Thus \begin{align*} \operatorname{Ent}_{\operatorname{Ber}(1/2)}(2,1)=\log 2-\frac32\log\frac32. \end{align*} For $h=(1,0)$, \begin{align*} \int h\,d\operatorname{Ber}(1/2)=\frac12\cdot 1+\frac12\cdot 0=\frac12. \end{align*} Again using $0\log 0=0$ and $1\log 1=0$, \begin{align*} \int h\log h\,d\operatorname{Ber}(1/2)=\frac12\cdot 1\log 1+\frac12\cdot 0\log 0=0. \end{align*} Hence \begin{align*} \operatorname{Ent}_{\operatorname{Ber}(1/2)}(1,0)=0-\frac12\log\frac12=\frac12\log 2. \end{align*} Averaging the two slices in each coordinate and then summing the two coordinates gives \begin{align*} 2\left(\frac12\operatorname{Ent}_{\operatorname{Ber}(1/2)}(2,1)+\frac12\operatorname{Ent}_{\operatorname{Ber}(1/2)}(1,0)\right)=\operatorname{Ent}_{\operatorname{Ber}(1/2)}(2,1)+\operatorname{Ent}_{\operatorname{Ber}(1/2)}(1,0). \end{align*} Substituting the two computed slice entropies, \begin{align*} \operatorname{Ent}_{\operatorname{Ber}(1/2)}(2,1)+\operatorname{Ent}_{\operatorname{Ber}(1/2)}(1,0)=\log 2-\frac32\log\frac32+\frac12\log 2. \end{align*} To compare with the true entropy, subtract this quantity from $\log 2$: \begin{align*} \log 2-\left(\log 2-\frac32\log\frac32+\frac12\log 2\right)=\frac32\log\frac32-\frac12\log 2. \end{align*} Combining logarithms gives \begin{align*} \frac32\log\frac32-\frac12\log 2=\frac12\left(3\log\frac32-\log 2\right)=\frac12\log\left(\frac{(3/2)^3}{2}\right)=\frac12\log\frac{27}{16}. \end{align*} Since $27/16>1$, this difference is positive. Thus the marginal-coordinate entropy sum is strictly smaller than the true entropy $\operatorname{Ent}_\mu(f)=\log 2$. The failure comes from replacing the true conditional coordinate laws, which are point masses on the diagonal, by the marginal laws $\operatorname{Ber}(1/2)$. [/example] For genuine product measures, the obstruction in the previous example disappears and coordinate entropies add in the expected way. Factorised functions give the cleanest case, because no interaction term remains between the two coordinates. [example: Two-Coordinate Bernoulli Function] Let $\mu=\operatorname{Ber}(1/2)\otimes\operatorname{Ber}(1/2)$ and define $f(x_1,x_2)=e^{\lambda(x_1+x_2)}$ on $\{0,1\}^2$. Write $g(x)=e^{\lambda x}$ for $x\in\{0,1\}$, let $X\sim\operatorname{Ber}(1/2)$, and set \begin{align*} a=\mathbb E[g(X)]=\frac12 g(0)+\frac12 g(1)=\frac{1+e^\lambda}{2}. \end{align*} Since \begin{align*} f(x_1,x_2)=e^{\lambda x_1}e^{\lambda x_2}=g(x_1)g(x_2), \end{align*} if $X_1,X_2$ are independent $\operatorname{Ber}(1/2)$ variables, then independence gives \begin{align*} \mathbb E_\mu[f]=\mathbb E[g(X_1)g(X_2)]=\mathbb E[g(X_1)]\mathbb E[g(X_2)]=a^2. \end{align*} Also, \begin{align*} \log f(X_1,X_2)=\log(g(X_1)g(X_2))=\log g(X_1)+\log g(X_2). \end{align*} Therefore \begin{align*} \mathbb E_\mu[f\log f]=\mathbb E[g(X_1)g(X_2)\log g(X_1)]+\mathbb E[g(X_1)g(X_2)\log g(X_2)]. \end{align*} Using independence in each term, \begin{align*} \mathbb E[g(X_1)g(X_2)\log g(X_1)]=\mathbb E[g(X_1)\log g(X_1)]\mathbb E[g(X_2)]=a\,\mathbb E[g(X)\log g(X)]. \end{align*} Similarly, \begin{align*} \mathbb E[g(X_1)g(X_2)\log g(X_2)]=\mathbb E[g(X_1)]\mathbb E[g(X_2)\log g(X_2)]=a\,\mathbb E[g(X)\log g(X)]. \end{align*} Adding the two terms gives \begin{align*} \mathbb E_\mu[f\log f]=2a\,\mathbb E[g(X)\log g(X)]. \end{align*} Hence, by the definition of entropy of a nonnegative random variable, \begin{align*} \operatorname{Ent}_\mu(f)=2a\,\mathbb E[g(X)\log g(X)]-a^2\log(a^2). \end{align*} Since $\log(a^2)=2\log a$, this becomes \begin{align*} \operatorname{Ent}_\mu(f)=2a\,\mathbb E[g(X)\log g(X)]-2a^2\log a. \end{align*} Factoring out $2a$ gives \begin{align*} \operatorname{Ent}_\mu(f)=2a\left(\mathbb E[g(X)\log g(X)]-a\log a\right). \end{align*} Because $a=\mathbb E[g(X)]$, the expression in parentheses is exactly $\operatorname{Ent}_{\operatorname{Ber}(1/2)}(g)$. Thus \begin{align*} \operatorname{Ent}_\mu(f)=2a\,\operatorname{Ent}_{\operatorname{Ber}(1/2)}(g). \end{align*} Now compute the coordinate entropy terms appearing in tensorization. For fixed $x_2$, the first-coordinate slice is \begin{align*} x_1\mapsto f(x_1,x_2)=e^{\lambda x_2}g(x_1). \end{align*} For a constant $c>0$, the entropy of the scaled function $cg$ under $\operatorname{Ber}(1/2)$ is \begin{align*} \operatorname{Ent}_{\operatorname{Ber}(1/2)}(cg)=\mathbb E[cg(X)\log(cg(X))]-\mathbb E[cg(X)]\log\mathbb E[cg(X)]. \end{align*} Since $\log(cg(X))=\log c+\log g(X)$ and $\mathbb E[cg(X)]=ca$, this equals \begin{align*} c\,\mathbb E[g(X)\log g(X)]+ca\log c-ca\log(ca). \end{align*} Using $\log(ca)=\log c+\log a$, we get \begin{align*} \operatorname{Ent}_{\operatorname{Ber}(1/2)}(cg)=c\,\mathbb E[g(X)\log g(X)]+ca\log c-ca\log c-ca\log a. \end{align*} The two $ca\log c$ terms cancel, so \begin{align*} \operatorname{Ent}_{\operatorname{Ber}(1/2)}(cg)=c\left(\mathbb E[g(X)\log g(X)]-a\log a\right)=c\,\operatorname{Ent}_{\operatorname{Ber}(1/2)}(g). \end{align*} Taking $c=e^{\lambda x_2}$ gives \begin{align*} \operatorname{Ent}_{\mu_1,1}(f)(x_2)=e^{\lambda x_2}\operatorname{Ent}_{\operatorname{Ber}(1/2)}(g). \end{align*} Averaging over the second coordinate, \begin{align*} \int \operatorname{Ent}_{\mu_1,1}(f)(x_2)\,d\mu_2(x_2)=\left(\int e^{\lambda x_2}\,d\operatorname{Ber}(1/2)(x_2)\right)\operatorname{Ent}_{\operatorname{Ber}(1/2)}(g). \end{align*} The integral in parentheses is $a$, hence \begin{align*} \int \operatorname{Ent}_{\mu_1,1}(f)(x_2)\,d\mu_2(x_2)=a\,\operatorname{Ent}_{\operatorname{Ber}(1/2)}(g). \end{align*} By the same calculation with the two coordinates reversed, \begin{align*} \int \operatorname{Ent}_{\mu_2,2}(f)(x_1)\,d\mu_1(x_1)=a\,\operatorname{Ent}_{\operatorname{Ber}(1/2)}(g). \end{align*} Therefore the sum of the two coordinate entropy terms is \begin{align*} a\,\operatorname{Ent}_{\operatorname{Ber}(1/2)}(g)+a\,\operatorname{Ent}_{\operatorname{Ber}(1/2)}(g)=2a\,\operatorname{Ent}_{\operatorname{Ber}(1/2)}(g). \end{align*} Comparing with the earlier computation, \begin{align*} \operatorname{Ent}_\mu(f)=2a\,\operatorname{Ent}_{\operatorname{Ber}(1/2)}(g), \end{align*} so the tensorization inequality is attained as equality for this factorised two-coordinate Bernoulli function. This is the finite-space prototype of the additivity used in the Herbst argument: independent multiplicative factors contribute entropy coordinate by coordinate, with no interaction term. [/example] The chapter has built the entropy toolkit used throughout the course: variational control of exponential moments, entropy bounds for tilted expectations, chain rules under conditioning, and tensorization across independent coordinates. The next chapter applies these tools to the log-Laplace transform and derives concentration from differential inequalities. # 2. The Herbst Argument The previous chapter introduced entropy as a way of measuring the cost of changing measure. Building on Chapter 1's entropy variational formula, entropy inequality, and tensorization principle, this chapter also uses Markov's inequality, elementary convexity of exponential moments, and the logarithmic Sobolev inequality as the main functional input. We now turn that variational viewpoint into tail estimates by applying it to exponential tilts of a random variable. The central mechanism is Herbst's argument: an entropy bound for $e^{\beta F}$ becomes a differential inequality for the log-Laplace transform of $F$, and integrating that inequality yields sub-Gaussian concentration. ## Exponential Tilting and the Log-Laplace Transform How should we study the event that a random variable is unusually large? Instead of conditioning on a rare event directly, we bias the probability measure toward large values by weighting with $e^{\beta F}$. The parameter $\beta$ controls the strength of the bias, and differentiating in $\beta$ records how the tilted averages change. [definition: Log-Laplace Transform] Let $F$ be a real-valued random variable on a probability space $(\Omega, \mathcal F, \mathbb P)$, and define its effective exponential-moment domain by \begin{align*} D_F := \{\beta \in \mathbb R : \mathbb E[e^{\beta F}] < \infty\}. \end{align*} The log-Laplace transform of $F$ is the function $\Lambda_F: D_F \to \mathbb R$ given by \begin{align*} \Lambda_F(\beta) := \log \mathbb E[e^{\beta F}]. \end{align*} [/definition] The log-Laplace transform packages all exponential moment information into a convex function. To connect this analytic object with probability, we need the measure under which differentiation of $\Lambda_F$ becomes ordinary expectation. [definition: Exponential Tilt] Let $F$ be a real-valued random variable on a probability space $(\Omega,\mathcal F,\mathbb P)$, and let $\beta \in \mathbb R$ satisfy $0 < \mathbb E[e^{\beta F}] < \infty$. The exponential tilt of $\mathbb P$ by $F$ at inverse temperature $\beta$ is the probability measure $\mathbb P_\beta$ on $(\Omega,\mathcal F)$ defined by \begin{align*} \frac{d\mathbb P_\beta}{d\mathbb P} = \frac{e^{\beta F}}{\mathbb E[e^{\beta F}]}. \end{align*} [/definition] This tilted measure is the Gibbs measure associated with the observable $F$. Expectations under it are written \begin{align*} \mathbb E_\beta[G] := \frac{\mathbb E[G e^{\beta F}]}{\mathbb E[e^{\beta F}]}, \end{align*} so large values of $F$ receive more weight when $\beta > 0$. The next theorem records the exact calculus rule behind this notation: the slope of the log-Laplace transform is the tilted mean, and its curvature is the tilted variance. [quotetheorem:6736] [citeproof:6736] This theorem gives a differential interpretation of exponential moments, but its hypotheses are doing real work. If exponential moments exist only at a single point, the tilted expectations need not vary differentiably in $\beta$, so the differential method has no stable object to integrate. A concrete failure occurs for a Pareto-type non-negative random variable with $\mathbb P(F>x)=x^{-\alpha}$ for $x\ge 1$: $\mathbb E[e^{\beta F}]=\infty$ for every $\beta>0$, so $\Lambda_F$ has no right-neighbourhood of $0$ on which a positive tilt can be differentiated. At the boundary of an exponential-moment domain, the same problem can appear even when moments exist on one side; differentiating may create the factor $F e^{\beta F}$, whose expectation need not be finite at the endpoint. The open interval assumption provides nearby exponential moments, which dominate the extra factors of $F$ created by differentiation. Convexity is the analytic sign that tilting increases the sensitivity of averages; concentration will come from proving that this sensitivity does not grow too quickly. As in Chapter 1's Gibbs formula, entropy supplies exactly the quantity that compares the tilted expectation with the original moment generating function. [example: Gaussian Linear Functional] Let $X\sim\mathcal N(0,I_n)$ and let $a\in\mathbb R^n$. For $F=a\cdot X$, the standard Gaussian density gives, for every $\beta\in\mathbb R$, \begin{align*} \mathbb E[e^{\beta a\cdot X}] = (2\pi)^{-n/2}\int_{\mathbb R^n}\exp\left(\beta a\cdot x-\frac{|x|^2}{2}\right)\,dx. \end{align*} The square-completion identity is \begin{align*} |x-\beta a|^2=|x|^2-2\beta a\cdot x+\beta^2|a|^2. \end{align*} Rearranging this identity gives \begin{align*} \beta a\cdot x-\frac{|x|^2}{2} =-\frac{|x-\beta a|^2}{2}+\frac{\beta^2|a|^2}{2}. \end{align*} Substituting into the integral, \begin{align*} \mathbb E[e^{\beta a\cdot X}] = \exp\left(\frac{\beta^2|a|^2}{2}\right)(2\pi)^{-n/2}\int_{\mathbb R^n}\exp\left(-\frac{|x-\beta a|^2}{2}\right)\,dx. \end{align*} The translated [Gaussian integral](/theorems/1140) equals $(2\pi)^{n/2}$, so \begin{align*} \mathbb E[e^{\beta F}] = \exp\left(\frac{\beta^2|a|^2}{2}\right). \end{align*} Therefore \begin{align*} \Lambda_F(\beta)=\log\mathbb E[e^{\beta F}]=\frac{\beta^2|a|^2}{2}. \end{align*} The tilted density with respect to Lebesgue measure is \begin{align*} \frac{e^{\beta a\cdot x}}{\mathbb E[e^{\beta a\cdot X}]}(2\pi)^{-n/2}e^{-|x|^2/2} = (2\pi)^{-n/2}\exp\left(\beta a\cdot x-\frac{|x|^2}{2}-\frac{\beta^2|a|^2}{2}\right). \end{align*} Using the same square completion, \begin{align*} \beta a\cdot x-\frac{|x|^2}{2}-\frac{\beta^2|a|^2}{2} = -\frac{|x-\beta a|^2}{2}. \end{align*} Hence the tilted density is \begin{align*} (2\pi)^{-n/2}\exp\left(-\frac{|x-\beta a|^2}{2}\right), \end{align*} so $\mathbb P_\beta$ is the law of $\mathcal N(\beta a,I_n)$. Finally, \begin{align*} \Lambda_F'(\beta)=\beta |a|^2 \end{align*} and \begin{align*} \Lambda_F''(\beta)=|a|^2. \end{align*} Thus a linear functional of a standard Gaussian has exactly quadratic log-Laplace transform, making it the model case for the sub-Gaussian bounds proved later in the chapter. [/example] ## Entropy Bounds as Differential Inequalities The next problem is to convert a functional inequality into an estimate on $\Lambda_F$. The bridge is the entropy of $e^{\beta F}$, because it is exactly the difference between the tilted mean of $\beta F$ and the logarithm of its normalising constant. [definition: Entropy of an Exponential Tilt] Let $F$ be a real-valued random variable. Define \begin{align*} D_F^{\operatorname{Ent}} := \{\beta \in \mathbb R : \mathbb E[e^{\beta F}] < \infty \text{ and } \mathbb E[|F|e^{\beta F}] < \infty\}. \end{align*} The entropy of the exponential tilt is the function $\operatorname{Ent}_F^{\exp}: D_F^{\operatorname{Ent}} \to \mathbb R$ given by \begin{align*} \operatorname{Ent}_F^{\exp}(\beta) := \mathbb E[\beta F e^{\beta F}] - \mathbb E[e^{\beta F}]\log \mathbb E[e^{\beta F}]. \end{align*} We write this value as $\operatorname{Ent}(e^{\beta F})$. [/definition] Dividing this entropy by $\mathbb E[e^{\beta F}]$ reveals a derivative of $\Lambda_F(\beta)/\beta$. This matters because many functional inequalities give upper bounds for entropy, while tail bounds require upper bounds for log-Laplace transforms. The next lemma is the algebraic conversion between those two forms. [quotetheorem:6738] [citeproof:6738] The lemma separates the method into two tasks, and both assumptions are necessary for that separation. Exponential integrability on a neighbourhood of $0$ is what makes $\Lambda_F(\beta)/\beta$ differentiable and gives the limiting value $\mathbb E[F]$ at the origin; without it, the calculation can fail before any tail estimate is attempted. For instance, if $F$ has the Pareto tail $\mathbb P(F>x)=x^{-\alpha}$ for $x\ge 1$, then positive exponential moments are infinite and the entropy expression for $e^{\beta F}$ is not available for any $\beta>0$. The sign of the tilt is also not cosmetic: a bound on $\operatorname{Ent}(e^{\beta F})$ only for $\beta>0$ can at most control the upper tail, while a variable such as $F=-Y$ with $Y$ Pareto has a harmless positive Laplace transform but an uncontrolled lower tail. Thus the lemma cannot manufacture two-sided concentration from one-sided entropy information. It also gives no entropy estimate by itself: it only converts an already proved entropy bound into a Laplace-transform bound. Once the exponential-moment interval and the appropriate signs are in place, the rest is deterministic calculus: first prove an entropy inequality for exponentials of the functions under study, then integrate the differential inequality to obtain a Laplace-transform estimate. [remark: Centering in the Laplace Bound] The conclusion is often written as \begin{align*} \log \mathbb E[e^{\beta(F-\mathbb E[F])}] \le c\beta^2. \end{align*} This is the same statement, since subtracting $\beta\mathbb E[F]$ from $\Lambda_F(\beta)$ is the log-Laplace transform of the centered random variable $F-\mathbb E[F]$. [/remark] ## The Herbst Lemma for Sub-Gaussian Tails A Laplace bound becomes a tail bound through Chernoff's method. The role of the Herbst lemma is to state this conversion in a reusable form, so later chapters can focus on proving entropy inequalities rather than redoing the optimisation each time. [quotetheorem:6740] [citeproof:6740] The theorem explains why a quadratic log-Laplace estimate is called sub-Gaussian: it gives the same tail exponent as a Gaussian random variable with variance proxy $\sigma^2$. The centering by $\mathbb E[F]$ is essential. If $F=a+Z$ with $Z\sim\mathcal N(0,\sigma^2)$ and the Laplace estimate were used without subtracting $\mathbb E[F]=a$, the [Chernoff bound](/theorems/6038) would estimate deviations above $0$ rather than deviations above the natural centre $a$, giving the wrong statement when $|a|$ is large. The one-sided hypothesis only controls the upper tail; without negative values of $\beta$, there is no corresponding estimate for $\mathbb P(F-\mathbb E[F]\le -t)$. A concrete example is $F=-Y$ with $Y$ non-negative and heavy-tailed: $\mathbb E[e^{\beta F}]\le 1$ for every $\beta\ge 0$, so positive Laplace parameters look benign, but the lower tail of $F$ is the heavy upper tail of $Y$. The theorem is also limited to Gaussian-scale concentration: heavier-tailed variables may have finite moments but fail any quadratic Laplace bound near infinity. The next theorem packages the entropy-to-Laplace lemma with a logarithmic Sobolev inequality; Chapter 3 then studies this functional inequality systematically and proves its Gaussian and tensorized forms. [quotetheorem:6742] [citeproof:6742] This result is the main template for the rest of the course: prove a logarithmic Sobolev inequality, insert an exponential [test function](/page/Test%20Function), and read off concentration. The Lipschitz hypothesis is the point at which geometry enters, because it turns the gradient term into the uniform bound $|\nabla F|\le L$; without such control the entropy inequality may still hold but gives a variance proxy depending on $F$ rather than a dimension-free constant. In Gaussian space, $F(x)=|x|^2$ is the basic warning example: it is smooth but not Lipschitz, and its upper tail is chi-squared rather than governed by a dimension-free Gaussian bound with a fixed Lipschitz constant. Smoothness is a technical condition needed to use $e^{\beta F/2}$ as a test function; later approximation arguments remove it in settings where the Dirichlet form is closed. The theorem does not say that every measure has Gaussian concentration, since the logarithmic Sobolev inequality is a strong input and fails for many heavy-tailed measures. For a Pareto law on $[1,\infty)$, even positive exponential moments are infinite, so no logarithmic Sobolev inequality of this form can imply Gaussian concentration for all Lipschitz observables. Gaussian space is the guiding example because its logarithmic Sobolev constant and its linear functions match the model calculation from the beginning of the chapter. [example: Lipschitz Functions on Euclidean Gaussian Space] Let $X\sim\mathcal N(0,I_n)$, and let $F:\mathbb R^n\to\mathbb R$ be $L$-Lipschitz. Since the Gaussian logarithmic Sobolev inequality has constant $C=1$, the *[Sub-Gaussian Tail Bound](/theorems/1953) from Logarithmic Sobolev Inequality* applied with this $C$ gives, for every $t\ge 0$, \begin{align*} \mathbb P(F(X)-\mathbb E[F(X)]\ge t) \le \exp\left(-\frac{t^2}{2\cdot 1\cdot L^2}\right) = \exp\left(-\frac{t^2}{2L^2}\right). \end{align*} For the linear functional $F(x)=a\cdot x$, its Lipschitz constant is $|a|$: by Cauchy-Schwarz, \begin{align*} |F(x)-F(y)| = |a\cdot(x-y)| \le |a|\,|x-y|, \end{align*} and equality in the Lipschitz ratio is attained when $x-y$ is a non-zero multiple of $a$ if $a\ne 0$. Therefore the bound becomes \begin{align*} \mathbb P(a\cdot X\ge t) \le \exp\left(-\frac{t^2}{2|a|^2}\right), \end{align*} because $\mathbb E[a\cdot X]=a\cdot \mathbb E[X]=0$. This has the same Gaussian exponent as the exact one-dimensional law $a\cdot X\sim\mathcal N(0,|a|^2)$. For a nonlinear example, let $A\subset\mathbb R^n$ be nonempty and define $F(x)=\operatorname{dist}(x,A)$. For any $x,y\in\mathbb R^n$ and any $z\in A$, the triangle inequality gives \begin{align*} \operatorname{dist}(x,A) \le |x-z| \le |x-y|+|y-z|. \end{align*} Taking the infimum over $z\in A$ yields \begin{align*} \operatorname{dist}(x,A)\le |x-y|+\operatorname{dist}(y,A). \end{align*} Interchanging $x$ and $y$ gives \begin{align*} \operatorname{dist}(y,A)\le |x-y|+\operatorname{dist}(x,A), \end{align*} so \begin{align*} |\operatorname{dist}(x,A)-\operatorname{dist}(y,A)|\le |x-y|. \end{align*} Thus $F$ is $1$-Lipschitz, and the same theorem gives \begin{align*} \mathbb P(\operatorname{dist}(X,A)-\mathbb E[\operatorname{dist}(X,A)]\ge t) \le \exp\left(-\frac{t^2}{2}\right). \end{align*} The point is that the concentration estimate only uses the Lipschitz constant, so it applies even when the distribution of $\operatorname{dist}(X,A)$ is not explicitly computable. [/example] ## Centering, Median Bounds, and Two-Sided Concentration The Herbst argument naturally centers at the mean because the log-Laplace transform remembers $\Lambda_F'(0)=\mathbb E[F]$. In geometric concentration, however, medians are often more robust and interact better with isoperimetric statements. We therefore need a short comparison between mean and median under sub-Gaussian tails. [definition: Median] Let $F$ be a real-valued random variable on a probability space $(\Omega, \mathcal F, \mathbb P)$. The set of medians of $F$ is \begin{align*} \operatorname{Med}(F) := \left\{m\in\mathbb R : \mathbb P(F\le m)\ge \frac12 \text{ and } \mathbb P(F\ge m)\ge \frac12\right\}\subset\mathbb R. \end{align*} A median of $F$ is an element $m_F\in\operatorname{Med}(F)$. [/definition] Medians need not be unique, but concentration estimates around any chosen median are equivalent up to a change in constants to concentration around the mean. The next lemma gives the quantitative comparison used throughout the course. [quotetheorem:6744] [citeproof:6744] This comparison allows the course to move between analytic estimates, which usually produce mean-centered concentration, and geometric estimates, which often produce median-centered concentration. The two-sided tail assumption is essential here: an upper-tail estimate alone can control medians above the mean but says nothing about medians far below it. A concrete failure is obtained by taking $F=-Y$, where $Y$ is non-negative and has a very large mean compared with its median. Then the upper tail of $F-\mathbb E[F]$ may be small because $F$ is bounded above relative to the heavy positive values of $Y$, while $m_F$ can sit far below $\mathbb E[F]$; the missing lower-tail estimate is exactly what would rule this out. The constant is not designed to be sharp, and for symmetric distributions it may have substantial slack, but it is uniform over all random variables with the stated tail bound. The price is only a shift of order $\sigma$, so the Gaussian scale is unchanged. The following example shows how to use the comparison without knowing the median explicitly. [example: Comparison Between Mean and Median Under Sub-Gaussian Tails] Suppose $F$ satisfies the two-sided tail bound with parameter $\sigma=1$, meaning that for every $t\ge 0$, \begin{align*} \mathbb P(|F-\mathbb E[F]|\ge t) \le 2\exp\left(-\frac{t^2}{2}\right). \end{align*} Applying the *[Mean-Median Comparison Under Sub-Gaussian Tails](/theorems/6744)* with $\sigma=1$ gives, for every median $m_F$, \begin{align*} |\mathbb E[F]-m_F| \le 1\cdot \sqrt{2\log 4} = \sqrt{2\log 4}. \end{align*} The same comparison gives, for every $t\ge 0$, \begin{align*} \mathbb P(|F-m_F|\ge t+\sqrt{2\log 4}) \le 2\exp\left(-\frac{t^2}{2\cdot 1^2}\right) = 2e^{-t^2/2}. \end{align*} For a standard Gaussian $Z$, symmetry gives $\mathbb P(Z\le 0)=\mathbb P(Z\ge 0)=1/2$, so $0$ is a median, and $\mathbb E[Z]=0$. Thus the comparison allows \begin{align*} |\mathbb E[Z]-0|=0\le \sqrt{2\log 4}, \end{align*} while the actual mean-median distance is zero. The estimate is therefore not sharp for this special distribution, but it gives a distribution-free way to replace the unknown median by the mean up to a fixed Gaussian-scale error. [/example] The example also illustrates a recurring theme: the constants obtained by a general concentration principle may not be sharp for a special distribution, but the scale is stable and dimension-free. This is the feature needed when the random variable is a complicated Lipschitz functional rather than a coordinate projection. [remark: What Herbst Does Not Prove] The Herbst argument converts an entropy inequality into concentration, but it does not by itself prove the entropy inequality or identify the best geometric constant. Chapter 3 supplies logarithmic Sobolev inputs, Chapter 4 supplies discrete tensorized entropy inputs, Chapters 5 and 6 supply isoperimetric inputs, and Chapters 7 and 8 supply transport inputs. This separation is useful because the same differential argument can be reused once the correct entropy estimate has been established, whether the input comes from analysis, optimal transport, or geometric measure concentration. [/remark] # 3. Logarithmic Sobolev Inequalities Logarithmic Sobolev inequalities give the functional-inequality form of the entropy method introduced in Chapter 1 and converted into tails by the Herbst argument in Chapter 2. The previous chapter showed that an entropy estimate for exponential tilts yields concentration through the Herbst argument. This chapter identifies a robust source of such entropy estimates: control of entropy by a Dirichlet energy. The main theme is that logarithmic Sobolev inequalities tensorize, are sharp for Gaussian measures, and imply the more familiar Poincare inequality. ## Entropy Controlled by Energy The problem is to find hypotheses on a probability measure that turn the entropy of a function into a local quantity involving derivatives. For concentration, this matters because applying such an estimate to $e^{\theta f}$ converts global fluctuations of $f$ into bounds involving $|\nabla f|$. Let $\nu$ be a probability measure on $\mathbb R^n$ or, more generally, on a smooth space where a gradient is available. The entropy functional is the map \begin{align*} \operatorname{Ent}_\nu : \{g:\mathbb R^n\to[0,\infty] \text{ measurable} : \int g\,d\nu<\infty,\ \int g\log g\,d\nu \text{ is defined}\}\to[0,\infty] \end{align*} given by \begin{align*} \operatorname{Ent}_\nu(g) = \int g \log g\, d\nu - \left(\int g\, d\nu\right)\log \left(\int g\, d\nu\right). \end{align*} Here $0\log 0$ is interpreted as $0$, and the value $+\infty$ is allowed. The logarithmic Sobolev inequality is stated for squares because the energy of $f$ is quadratic and because $f^2$ is nonnegative. [definition: Logarithmic Sobolev Inequality] Let $\nu$ be a probability measure on $\mathbb R^n$. We say that $\nu$ satisfies a logarithmic Sobolev inequality with constant $C>0$, written $\operatorname{LSI}(C)$, if every smooth function $f:\mathbb R^n\to \mathbb R$ satisfies \begin{align*} \operatorname{Ent}_\nu(f^2) \le 2C \int |\nabla f|^2\, d\nu. \end{align*} [/definition] The factor $2$ is a convention chosen so that the standard Gaussian has constant $C=1$. Constants do not affect the qualitative theory, but the normalization is useful when comparing log-Sobolev and Poincare constants. [example: Entropy Detects Rare Spikes] Let $A$ be measurable with $\nu(A)=p\in(0,1)$, and set $g=p^{-1}\mathbf 1_A$. Since $g=p^{-1}$ on $A$ and $g=0$ on $A^c$, \begin{align*} \int g\,d\nu=\int_A p^{-1}\,d\nu+\int_{A^c}0\,d\nu=p^{-1}\nu(A)=p^{-1}p=1. \end{align*} Using the definition of entropy and $\int g\,d\nu=1$, \begin{align*} \operatorname{Ent}_\nu(g)=\int g\log g\,d\nu-\left(\int g\,d\nu\right)\log\left(\int g\,d\nu\right)=\int g\log g\,d\nu-1\cdot\log 1=\int g\log g\,d\nu. \end{align*} The convention $0\log 0=0$ makes the contribution from $A^c$ equal to $0$, while on $A$ we have $g\log g=p^{-1}\log(p^{-1})$. Therefore \begin{align*} \operatorname{Ent}_\nu(g)=\int_A p^{-1}\log(p^{-1})\,d\nu=p^{-1}\log(p^{-1})\nu(A)=p^{-1}\log(p^{-1})p=\log\left(\frac{1}{p}\right). \end{align*} Thus a nonnegative function can have fixed mean $1$ while concentrating all its mass on an event of probability $p$, and its entropy records exactly the logarithmic cost $\log(1/p)$ of that localization. [/example] The preceding example shows that entropy is sensitive to localization, not merely to variance. To use this sensitivity in concentration, the main input is the following entropy estimate for exponential functions. [remark: Exponential Test Functions] If $f:\mathbb R^n\to \mathbb R$ is smooth and $\nu$ satisfies $\operatorname{LSI}(C)$, applying the inequality to $e^{\theta f/2}$ gives \begin{align*} \operatorname{Ent}_\nu(e^{\theta f}) \le \frac{C\theta^2}{2}\int |\nabla f|^2 e^{\theta f}\,d\nu. \end{align*} When $f$ is $L$-Lipschitz, this becomes \begin{align*} \operatorname{Ent}_\nu(e^{\theta f}) \le \frac{C\theta^2 L^2}{2}\mathbb E_\nu[e^{\theta f}], \end{align*} which is exactly the type of input required by Herbst's argument. [/remark] Thus an $\operatorname{LSI}(C)$ measure has sub-Gaussian concentration for smooth Lipschitz functions, once the approximation step from smooth to general Lipschitz functions is justified. The rest of the chapter is about finding measures with this property and understanding how the property behaves under products. ## Gross's Gaussian Inequality Which probability measure should be the model case for logarithmic Sobolev inequalities? Since the Herbst argument converts a log-Sobolev inequality into Gaussian tails, the standard Gaussian measure should satisfy the inequality with the best possible constant. Let $\gamma_n$ denote the standard Gaussian probability measure on $\mathbb R^n$, with density \begin{align*} (2\pi)^{-n/2}e^{-|x|^2/2} \end{align*} with respect to $\mathcal L^n$. [quotetheorem:6746] [citeproof:6746] The hypotheses encode the exact geometry used in the proof: Gaussian [integration by parts](/theorems/210) supplies the generator $L=\Delta-x\cdot\nabla$, and the curvature of the Gaussian potential gives the decay factor $e^{-t}$ in the gradient commutation estimate. A measure with much heavier tails, such as the Cauchy distribution on $\mathbb R$, cannot satisfy this Gaussian log-Sobolev inequality, since applying Herbst would force sub-Gaussian tails for Lipschitz functions. The theorem is therefore not just a concentration statement; it identifies the Gaussian measure as the model case where entropy dissipation, convexity of the potential, and dimension-free constants meet. This is the result that later comparison and perturbation theorems try to preserve under controlled changes of measure. [example: Standard Gaussian Lipschitz Concentration] Let $F:\mathbb R^n\to\mathbb R$ be smooth and $L$-Lipschitz, so $|\nabla F|\le L$ pointwise. For $\theta\ge 0$, apply *Gross Logarithmic Sobolev Inequality* to $f=e^{\theta F/2}$. The chain rule gives \begin{align*} \nabla e^{\theta F/2}=\frac{\theta}{2}e^{\theta F/2}\nabla F. \end{align*} Therefore \begin{align*} \left|\nabla e^{\theta F/2}\right|^2=\frac{\theta^2}{4}e^{\theta F}|\nabla F|^2. \end{align*} Using $|\nabla F|^2\le L^2$, \begin{align*} \left|\nabla e^{\theta F/2}\right|^2\le \frac{\theta^2L^2}{4}e^{\theta F}. \end{align*} Gross's inequality now gives \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta F})\le 2\int \left|\nabla e^{\theta F/2}\right|^2\,d\gamma_n. \end{align*} Substituting the pointwise bound into the right-hand side, \begin{align*} 2\int \left|\nabla e^{\theta F/2}\right|^2\,d\gamma_n\le 2\int \frac{\theta^2L^2}{4}e^{\theta F}\,d\gamma_n. \end{align*} Since $2\cdot \theta^2L^2/4=\theta^2L^2/2$, this is \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta F})\le \frac{\theta^2L^2}{2}\mathbb E_{\gamma_n}[e^{\theta F}]. \end{align*} Set $m=\mathbb E_{\gamma_n}[F]$, $Z(\theta)=\mathbb E_{\gamma_n}[e^{\theta(F-m)}]$, and $\psi(\theta)=\log Z(\theta)$. Multiplying a nonnegative function by a constant $a>0$ multiplies its entropy by $a$, because \begin{align*} \operatorname{Ent}_{\gamma_n}(ag)=a\int g\log g\,d\gamma_n+a\log a\int g\,d\gamma_n-a\int g\,d\gamma_n\log\left(a\int g\,d\gamma_n\right). \end{align*} The two $a\log a\int g\,d\gamma_n$ terms cancel, so $\operatorname{Ent}_{\gamma_n}(ag)=a\operatorname{Ent}_{\gamma_n}(g)$. Applying this with $a=e^{-\theta m}$ gives \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta(F-m)})\le \frac{\theta^2L^2}{2}Z(\theta). \end{align*} By the entropy definition, \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta(F-m)})=\mathbb E_{\gamma_n}[\theta(F-m)e^{\theta(F-m)}]-Z(\theta)\log Z(\theta). \end{align*} Since \begin{align*} Z'(\theta)=\mathbb E_{\gamma_n}[(F-m)e^{\theta(F-m)}], \end{align*} the entropy identity becomes \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta(F-m)})=\theta Z'(\theta)-Z(\theta)\psi(\theta). \end{align*} Using $\psi'(\theta)=Z'(\theta)/Z(\theta)$, \begin{align*} \theta Z'(\theta)-Z(\theta)\psi(\theta)=Z(\theta)\left(\theta\psi'(\theta)-\psi(\theta)\right). \end{align*} After dividing by $Z(\theta)>0$, \begin{align*} \theta\psi'(\theta)-\psi(\theta)\le \frac{\theta^2L^2}{2}. \end{align*} For $\theta>0$, \begin{align*} \frac{d}{d\theta}\left(\frac{\psi(\theta)}{\theta}\right)=\frac{\theta\psi'(\theta)-\psi(\theta)}{\theta^2}. \end{align*} Thus \begin{align*} \frac{d}{d\theta}\left(\frac{\psi(\theta)}{\theta}\right)\le \frac{L^2}{2}. \end{align*} Also $Z(0)=1$ and $Z'(0)=\mathbb E_{\gamma_n}[F-m]=0$, so $\lim_{\theta\downarrow0}\psi(\theta)/\theta=0$. Integrating from $0$ to $\theta$ gives \begin{align*} \frac{\psi(\theta)}{\theta}\le \frac{L^2\theta}{2}. \end{align*} Multiplying by $\theta$, \begin{align*} \psi(\theta)\le \frac{L^2\theta^2}{2}. \end{align*} Equivalently, \begin{align*} \mathbb E_{\gamma_n}[e^{\theta(F-m)}]\le \exp\left(\frac{L^2\theta^2}{2}\right). \end{align*} By Markov's inequality, for every $\theta>0$ and $t\ge0$, \begin{align*} \mathbb P_{\gamma_n}(F-m\ge t)\le e^{-\theta t}\mathbb E_{\gamma_n}[e^{\theta(F-m)}]. \end{align*} Using the log-Laplace bound, \begin{align*} \mathbb P_{\gamma_n}(F-m\ge t)\le \exp\left(-\theta t+\frac{L^2\theta^2}{2}\right). \end{align*} If $L>0$, choose $\theta=t/L^2$. Then \begin{align*} -\theta t+\frac{L^2\theta^2}{2}=-\frac{t^2}{L^2}+\frac{t^2}{2L^2}. \end{align*} Hence \begin{align*} -\theta t+\frac{L^2\theta^2}{2}=-\frac{t^2}{2L^2}. \end{align*} Therefore \begin{align*} \mathbb P_{\gamma_n}(F-\mathbb E_{\gamma_n}[F]\ge t)\le \exp\left(-\frac{t^2}{2L^2}\right),\qquad t\ge0. \end{align*} When $L=0$, the bound $|\nabla F|=0$ forces $F$ to be constant on $\mathbb R^n$, so the same upper-tail statement is trivial. The computation shows explicitly how Gross's local energy estimate becomes a Gaussian bound for the centered log-Laplace transform. [/example] The Gaussian example is also the template for the transport and curvature comparisons in Chapters 8 and 9. The feature to retain here is dimension-free control: the constant in Gross's inequality does not grow with $n$. ## Tensorization Across Independent Coordinates The next problem is whether logarithmic Sobolev inequalities survive passage from one coordinate to many independent coordinates. Concentration on high-dimensional product spaces requires constants that remain stable under products; otherwise the Herbst bound degenerates as dimension grows. For a smooth function $f$ on a product space $E_1\times\cdots\times E_n$, write $\nabla_i f$ for the gradient in the $i$th coordinate, with the other coordinates held fixed. In Euclidean products this gives \begin{align*} |\nabla f|^2=\sum_{i=1}^n |\nabla_i f|^2. \end{align*} [quotetheorem:6748] [citeproof:6748] Tensorization is the structural reason why logarithmic Sobolev inequalities are useful in concentration. The independence hypothesis is essential: the proof decomposes entropy into conditional entropies, and such a decomposition is not available for an arbitrary correlated law. For example, a mixture of two well-separated product Gaussians may have excellent one-coordinate conditional behaviour inside each component but poor global concentration because the function distinguishing the two components has small local gradient away from the transition region and large variance. Thus tensorization covers product measures, not general dependence structures; later results need perturbation, convexity, or transportation arguments to handle non-product laws. The smooth-space assumption is also a technical way of ensuring that the coordinate Dirichlet energies $|\nabla_i f|^2$ are meaningful and compatible with [Fubini's theorem](/theorems/2961). [example: Product Gaussian Measures] Let $\gamma_1$ be the standard Gaussian measure on $\mathbb R$. By *Gross Logarithmic Sobolev Inequality*, $\gamma_1$ satisfies $\operatorname{LSI}(1)$. Applying *Tensorization of Logarithmic Sobolev Inequalities* with $C_1=\cdots=C_n=1$ gives, for every smooth $f:\mathbb R^n\to\mathbb R$, \begin{align*} \operatorname{Ent}_{\gamma_1^{\otimes n}}(f^2)\le 2\sum_{i=1}^n \int |\partial_i f|^2\,d\gamma_1^{\otimes n}. \end{align*} Since integration is linear, \begin{align*} 2\sum_{i=1}^n \int |\partial_i f|^2\,d\gamma_1^{\otimes n}=2\int \sum_{i=1}^n |\partial_i f|^2\,d\gamma_1^{\otimes n}. \end{align*} For the Euclidean product gradient, $|\nabla f|^2=\sum_{i=1}^n|\partial_i f|^2$, so \begin{align*} \operatorname{Ent}_{\gamma_1^{\otimes n}}(f^2)\le 2\int |\nabla f|^2\,d\gamma_1^{\otimes n}. \end{align*} Thus $\gamma_1^{\otimes n}$ satisfies $\operatorname{LSI}(1)$, and because independent standard Gaussian coordinates have joint law $\gamma_n$, this recovers the Gaussian logarithmic Sobolev inequality for $\gamma_n$. Now let $\mu_\sigma$ be the law of $\sigma Y$, where $Y\sim\mathcal N(0,1)$ and $\sigma>0$. For a smooth $u:\mathbb R\to\mathbb R$, define $v(y)=u(\sigma y)$. Since $\mu_\sigma$ is the pushforward of $\gamma_1$ under $y\mapsto \sigma y$, \begin{align*} \operatorname{Ent}_{\gamma_1}(v^2)=\operatorname{Ent}_{\mu_\sigma}(u^2). \end{align*} The chain rule gives \begin{align*} v'(y)=\sigma u'(\sigma y). \end{align*} Therefore \begin{align*} \int |v'(y)|^2\,d\gamma_1(y)=\sigma^2\int |u'(\sigma y)|^2\,d\gamma_1(y). \end{align*} Changing variables through the pushforward relation gives \begin{align*} \int |u'(\sigma y)|^2\,d\gamma_1(y)=\int |u'(x)|^2\,d\mu_\sigma(x). \end{align*} Applying $\operatorname{LSI}(1)$ for $\gamma_1$ to $v$ now yields \begin{align*} \operatorname{Ent}_{\mu_\sigma}(u^2)\le 2\sigma^2\int |u'|^2\,d\mu_\sigma. \end{align*} Hence $\mathcal N(0,\sigma^2)$ satisfies $\operatorname{LSI}(\sigma^2)$. If $X_i\sim\mathcal N(0,\sigma_i^2)$ are independent and $\mu=\bigotimes_{i=1}^n\mathcal N(0,\sigma_i^2)$, tensorization gives \begin{align*} \operatorname{Ent}_{\mu}(f^2)\le 2\sum_{i=1}^n \sigma_i^2\int |\partial_i f|^2\,d\mu. \end{align*} Since $\sigma_i^2\le \max_{1\le j\le n}\sigma_j^2$ for each $i$, \begin{align*} 2\sum_{i=1}^n \sigma_i^2\int |\partial_i f|^2\,d\mu\le 2\left(\max_{1\le j\le n}\sigma_j^2\right)\sum_{i=1}^n\int |\partial_i f|^2\,d\mu. \end{align*} Using again $|\nabla f|^2=\sum_{i=1}^n|\partial_i f|^2$, \begin{align*} \operatorname{Ent}_{\mu}(f^2)\le 2\left(\max_{1\le j\le n}\sigma_j^2\right)\int |\nabla f|^2\,d\mu. \end{align*} Thus the product law satisfies $\operatorname{LSI}(\max_i\sigma_i^2)$: among independent Gaussian coordinates, the largest variance determines the product log-Sobolev constant. [/example] The tensorized estimate also explains why the gradient appears as a sum of coordinate energies. It matches the way entropy decomposes under independence, so the functional inequality mirrors the probabilistic product structure. ## Poincare Inequality as a Consequence A logarithmic Sobolev inequality controls entropy, while the Poincare inequality controls variance. The next question is how these two notions compare, since variance bounds are easier to interpret but often too weak for sharp tail estimates. [definition: Poincare Inequality] Let $\nu$ be a probability measure on $\mathbb R^n$. We say that $\nu$ satisfies a Poincare inequality with constant $C>0$ if every smooth function $f:\mathbb R^n\to\mathbb R$ satisfies \begin{align*} \operatorname{Var}_\nu(f)\le C\int |\nabla f|^2\,d\nu. \end{align*} [/definition] The definition records a variance bound in terms of the same Dirichlet energy that appears in the log-Sobolev inequality. The natural comparison question is whether the stronger entropy estimate automatically supplies this variance estimate, and the answer comes from linearising the log-Sobolev inequality around constants. [quotetheorem:6751] [citeproof:6751] The hypotheses are used through a small perturbation of constants, so only the local second-order content of the logarithmic Sobolev inequality survives in the conclusion. The converse is false: many measures satisfy a Poincare inequality without satisfying a logarithmic Sobolev inequality. A standard example is the exponential law on $\mathbb R_+$, which has a spectral-gap/Poincare inequality but cannot satisfy an LSI with the Euclidean gradient because Herbst's argument would give Gaussian concentration for the identity function, contradicting its exponential tail. Thus the variance inequality is a lower-order shadow of the entropy estimate: it records the quadratic fluctuation scale but loses the information needed to control the whole log-Laplace transform. This distinction explains why the course treats logarithmic Sobolev inequalities as the main engine for Gaussian concentration rather than replacing them by Poincare bounds. [example: Variance Bound Versus Entropy Bound] Let $F:\mathbb R^n\to\mathbb R$ be smooth and $L$-Lipschitz under $\gamma_n$, so $|\nabla F|\le L$ pointwise. Applying *Gross Logarithmic Sobolev Inequality* to $f=e^{\theta F/2}$ gives \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta F})\le 2\int \left|\nabla e^{\theta F/2}\right|^2\,d\gamma_n. \end{align*} By the chain rule, \begin{align*} \nabla e^{\theta F/2}=\frac{\theta}{2}e^{\theta F/2}\nabla F. \end{align*} Taking squared norms gives \begin{align*} \left|\nabla e^{\theta F/2}\right|^2=\frac{\theta^2}{4}e^{\theta F}|\nabla F|^2. \end{align*} Since $|\nabla F|^2\le L^2$, \begin{align*} \left|\nabla e^{\theta F/2}\right|^2\le \frac{\theta^2L^2}{4}e^{\theta F}. \end{align*} Substitution into the log-Sobolev bound yields \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta F})\le 2\int \frac{\theta^2L^2}{4}e^{\theta F}\,d\gamma_n. \end{align*} Because $2\cdot \theta^2L^2/4=\theta^2L^2/2$, \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta F})\le \frac{\theta^2L^2}{2}\mathbb E_{\gamma_n}[e^{\theta F}]. \end{align*} By *Log-Sobolev Implies Poincare*, the same Gaussian log-Sobolev inequality gives \begin{align*} \operatorname{Var}_{\gamma_n}(F)\le \int |\nabla F|^2\,d\gamma_n. \end{align*} Using $|\nabla F|^2\le L^2$ and $\gamma_n(\mathbb R^n)=1$, \begin{align*} \int |\nabla F|^2\,d\gamma_n\le \int L^2\,d\gamma_n=L^2. \end{align*} Thus \begin{align*} \operatorname{Var}_{\gamma_n}(F)\le L^2. \end{align*} For $F(x)=x_1$, we have $\nabla F=(1,0,\dots,0)$, so $L=1$. If $X_1\sim\mathcal N(0,1)$, then $\mathbb E[X_1]=0$ and $\mathbb E[X_1^2]=1$, hence \begin{align*} \operatorname{Var}_{\gamma_n}(x_1)=\mathbb E[X_1^2]-(\mathbb E[X_1])^2=1-0^2=1. \end{align*} So the Poincare estimate is sharp for this coordinate function. The exponential estimate is sharp as well. Completing the square gives \begin{align*} \theta x-\frac{x^2}{2}=-\frac{(x-\theta)^2}{2}+\frac{\theta^2}{2}. \end{align*} Therefore \begin{align*} \mathbb E[e^{\theta X_1}]=\frac{1}{\sqrt{2\pi}}\int_{\mathbb R}e^{\theta x}e^{-x^2/2}\,dx. \end{align*} Using the displayed square completion, \begin{align*} \mathbb E[e^{\theta X_1}]=e^{\theta^2/2}\frac{1}{\sqrt{2\pi}}\int_{\mathbb R}e^{-(x-\theta)^2/2}\,dx. \end{align*} The last integral equals $\sqrt{2\pi}$ by the change of variables $u=x-\theta$, so \begin{align*} \mathbb E[e^{\theta X_1}]=e^{\theta^2/2}. \end{align*} Differentiating the moment generating function gives \begin{align*} \mathbb E[X_1e^{\theta X_1}]=\frac{d}{d\theta}e^{\theta^2/2}=\theta e^{\theta^2/2}. \end{align*} By the definition of entropy, \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta x_1})=\mathbb E[\theta X_1e^{\theta X_1}]-\mathbb E[e^{\theta X_1}]\log \mathbb E[e^{\theta X_1}]. \end{align*} Substituting the two identities above gives \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta x_1})=\theta\cdot\theta e^{\theta^2/2}-e^{\theta^2/2}\log(e^{\theta^2/2}). \end{align*} Since $\log(e^{\theta^2/2})=\theta^2/2$, \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta x_1})=\theta^2e^{\theta^2/2}-\frac{\theta^2}{2}e^{\theta^2/2}. \end{align*} Hence \begin{align*} \operatorname{Ent}_{\gamma_n}(e^{\theta x_1})=\frac{\theta^2}{2}e^{\theta^2/2}=\frac{\theta^2}{2}\mathbb E[e^{\theta X_1}]. \end{align*} Thus the variance estimate captures the exact second-moment scale for the coordinate function, while the entropy estimate captures its exact Gaussian exponential scale. [/example] The chapter therefore fits into the course as the functional-inequality engine behind entropy concentration. Gross's inequality supplies the canonical model, tensorization makes the method high-dimensional, and the Poincare implication records the lower-order information contained in the same estimate. # 4. Discrete Entropy and Product Spaces This chapter moves from the entropy identities of product spaces to the discrete functional inequalities that make those identities useful. The guiding setting is a product cube, where changing one coordinate gives a finite-difference derivative and conditional replacement gives the local averaging operation. The prerequisites are Chapter 1's entropy functional and tensorization theorem, conditional expectation, Bernoulli product measures, and the Herbst argument from Chapter 2. The main result is a tensorized logarithmic Sobolev inequality for product Bernoulli measures, followed by the [entropy method for self-bounding functions](/theorems/6757). ## Discrete Gradients and Conditional Replacement How should a derivative be defined on a space with no small increments? A variance bound based only on the range of $f$ loses the coordinate structure: it cannot tell whether the fluctuation comes from one coordinate or from many small independent contributions. On a product space the natural local operation is to resample or flip one coordinate while all other coordinates are held fixed. This produces coordinate oscillations which play the role of squared gradients in logarithmic Sobolev inequalities. Let $(\Omega_i, \mathcal F_i, \mu_i)$ be probability spaces, and let $(\Omega, \mathcal F, \mu)=\prod_{i=1}^n(\Omega_i, \mathcal F_i, \mu_i)$. For $x=(x_1,\dots,x_n)$ write $x_{-i}$ for all coordinates except $x_i$. [definition: Conditional Replacement Operator] The $i$th conditional replacement operator is the [linear map](/page/Linear%20Map) \begin{align*} Q_i:L^1(\Omega,\mathcal F,\mu)\to L^1(\Omega,\mathcal F,\mu) \end{align*} defined by \begin{align*} Q_i f(x) := \int_{\Omega_i} f(x_1,\dots,x_{i-1},y_i,x_{i+1},\dots,x_n)\,d\mu_i(y_i). \end{align*} [/definition] The operator $Q_i$ is conditional expectation with respect to all coordinates except $i$. The deviation from this local average is the basic signed coordinate variation, while its square-averaged version is the local energy needed in entropy estimates. This motivates recording both the oscillation and the conditional variance. [definition: Discrete Gradient] The $i$th coordinate oscillation is the map \begin{align*} \Delta_i:L^1(\Omega,\mathcal F,\mu)\to L^1(\Omega,\mathcal F,\mu) \end{align*} defined by \begin{align*} \Delta_i f(x):= f(x)-Q_i f(x). \end{align*} The $i$th conditional variance is the map \begin{align*} \operatorname{Var}_i:L^2(\Omega,\mathcal F,\mu)\to L^1(\Omega,\mathcal F,\mu) \end{align*} defined by \begin{align*} \operatorname{Var}_i(f)(x):=Q_i(f^2)(x)-(Q_i f(x))^2. \end{align*} [/definition] The two quantities serve different purposes. The oscillation is signed and behaves like a first derivative; the conditional variance is nonnegative and is the quadratic energy that appears naturally after entropy tensorization. A first test case is the usual bounded-difference hypothesis on the discrete cube. [example: Lipschitz Functions on the Discrete Cube] Let $\Omega=\{0,1\}^n$ with product Bernoulli measure, fix a coordinate $i$, and fix the coordinates $x_{-i}$. Write \begin{align*} a:=f(x_1,\dots,x_{i-1},0,x_{i+1},\dots,x_n) \end{align*} and \begin{align*} b:=f(x_1,\dots,x_{i-1},1,x_{i+1},\dots,x_n). \end{align*} Let $p_i$ be the Bernoulli mass of $1$ in the $i$th coordinate, and set $q_i:=1-p_i$. The two inputs defining $a$ and $b$ differ only in coordinate $i$, so the bounded-difference hypothesis gives \begin{align*} |a-b|\le c_i. \end{align*} For this fixed $x_{-i}$, the conditional replacement operator averages only over the $i$th coordinate, hence \begin{align*} Q_i f(x)=q_i a+p_i b \end{align*} and \begin{align*} Q_i(f^2)(x)=q_i a^2+p_i b^2. \end{align*} Therefore, by the definition of conditional variance, \begin{align*} \operatorname{Var}_i(f)(x)=q_i a^2+p_i b^2-(q_i a+p_i b)^2. \end{align*} Expanding the square gives \begin{align*} (q_i a+p_i b)^2=q_i^2a^2+2p_iq_iab+p_i^2b^2. \end{align*} Substituting this expansion, \begin{align*} \operatorname{Var}_i(f)(x)=q_i a^2+p_i b^2-q_i^2a^2-2p_iq_iab-p_i^2b^2. \end{align*} Since $1-q_i=p_i$ and $1-p_i=q_i$, this becomes \begin{align*} \operatorname{Var}_i(f)(x)=p_iq_i a^2-2p_iq_iab+p_iq_i b^2. \end{align*} Factoring, \begin{align*} \operatorname{Var}_i(f)(x)=p_iq_i(a-b)^2. \end{align*} Finally, \begin{align*} 0\le (p_i-q_i)^2. \end{align*} Using $p_i+q_i=1$, this is \begin{align*} 0\le (p_i+q_i)^2-4p_iq_i=1-4p_iq_i, \end{align*} so \begin{align*} p_iq_i\le \frac14. \end{align*} Combining $p_iq_i(a-b)^2\le \frac14(a-b)^2$ with $|a-b|\le c_i$ gives \begin{align*} \operatorname{Var}_i(f)(x)\le \frac{c_i^2}{4}. \end{align*} Thus the conditional variance in one coordinate is controlled by one quarter of the squared bounded-difference constant for that coordinate. [/example] This example explains why the discrete gradient formalism recovers bounded-difference concentration. The next step is to replace variance by entropy, since entropy tensorizes more directly under exponential tilting. ## The Two-Point Logarithmic Sobolev Inequality What is the smallest local inequality needed for a product cube? A Poincare or Efron-Stein estimate controls variance, but Herbst's argument asks for entropy of $e^{\lambda F}$, not variance of $F$. Tensorization reduces the missing estimate to a single Bernoulli coordinate, so the fundamental input is a logarithmic Sobolev inequality on $\{0,1\}$. It compares entropy of a square to the squared difference between the two endpoint values. [definition: Bernoulli Entropy Functional] Let $\nu_p$ be the Bernoulli measure on $\{0,1\}$ with $\nu_p(1)=p$ and $\nu_p(0)=1-p$. The Bernoulli entropy functional is the map \begin{align*} \operatorname{Ent}_{\nu_p}:L^\infty_+(\{0,1\},\nu_p)\to[0,\infty) \end{align*} defined by \begin{align*} \operatorname{Ent}_{\nu_p}(g):=\mathbb E_{\nu_p}[g\log g]-\mathbb E_{\nu_p}[g]\log\mathbb E_{\nu_p}[g]. \end{align*} The convention is $0\log 0:=0$. [/definition] Here $L^\infty_+(\{0,1\},\nu_p)$ denotes the nonnegative bounded measurable functions $g:\{0,1\}\to[0,\infty)$. On this finite space, the notation mainly records the domain and codomain explicitly; there is no integrability issue beyond nonnegativity. The entropy is zero exactly when $g$ is constant up to null sets. On a two-point space, all nonconstant behaviour is controlled by the single difference $f(1)-f(0)$, so the key local theorem asks how much entropy can be generated by that difference alone. [quotetheorem:6752] [citeproof:6752] The condition $p\in(0,1)$ is essential because both points must have positive mass. The constant is sharp for the normalization used here, and it degenerates as $p\downarrow0$ or $p\uparrow1$: \begin{align*} C_p=\frac{p(1-p)}{\Lambda(p,1-p)} \sim p\log\frac1p\quad(p\downarrow0), \qquad C_p\sim (1-p)\log\frac1{1-p}\quad(p\uparrow1). \end{align*} This degeneration is the correct rare-endpoint behaviour. For instance, if $f(0)=0$ and $f(1)=1$, then $\operatorname{Ent}_{\nu_p}(f^2)=p\log(1/p)$, which has the same order as $C_p$ as $p\downarrow0$. At the actual endpoint $p=0$ or $p=1$, the space has only one measured point, so values on the missing point should not contribute to entropy. The use of $f^2$ is also part of the structure: entropy is defined for nonnegative inputs and the square removes signs while matching the quadratic Dirichlet energy. Replacing $f^2$ by $f$ would not be admissible for $f(0)=-1$ and $f(1)=1$, whereas replacing the two-point difference by an arbitrary single value would miss the constant functions, for which entropy is zero exactly because the endpoint difference vanishes. The theorem is local: it does not say that every discrete measure has the same constant, nor does it control arbitrary moves on a non-product discrete space. Its role is to supply the single-coordinate estimate which tensorization can repeat. In the symmetric case the statement takes a particularly readable form. [example: Symmetric Bernoulli Coordinate] For the symmetric coordinate, $p=q=1/2$. In the *[Two-Point Logarithmic Sobolev Inequality](/theorems/6752)*, the constant is \begin{align*} C_p=\frac{p(1-p)}{\Lambda(p,1-p)}. \end{align*} Since $p=1-p=1/2$, the defining convention $\Lambda(a,a)=a$ gives \begin{align*} \Lambda(1/2,1/2)=1/2. \end{align*} Therefore \begin{align*} C_{1/2} =\frac{(1/2)(1/2)}{1/2} =\frac{1/4}{1/2} =\frac12. \end{align*} Hence, for every $f:\{0,1\}\to\mathbb R$, \begin{align*} \operatorname{Ent}_{\nu_{1/2}}(f^2)\le \frac12\,(f(1)-f(0))^2. \end{align*} This is the reference normalization on the unbiased cube: the one-coordinate law contributes the fixed constant $1/2$, so any dimension dependence in the product cube must come from summing coordinate differences, not from the one-bit inequality itself. The boundary behaviour is different. If $q=1-p$ and $p\ne q$, then \begin{align*} C_p =\frac{pq}{\Lambda(p,q)} =pq\,\frac{\log p-\log q}{p-q} =pq\,\frac{\log(q/p)}{q-p}. \end{align*} As $p\downarrow0$, we have $q\to1$ and $q-p\to1$, so \begin{align*} C_p =pq\,\frac{\log(q/p)}{q-p} \sim p\log\frac1p. \end{align*} For the rare-endpoint indicator $f(0)=0$ and $f(1)=1$, one has $f^2=f$ and \begin{align*} \mathbb E_{\nu_p}[f^2]=p, \qquad \mathbb E_{\nu_p}[f^2\log f^2]=p\cdot 1\cdot\log 1+q\cdot0\log0=0, \end{align*} using $0\log0=0$. Thus \begin{align*} \operatorname{Ent}_{\nu_p}(f^2) =0-p\log p =p\log\frac1p, \end{align*} which has the same scale as $C_p$ near $p=0$. At $p=1/2$, the two endpoints have equal mass and the entropy cost is balanced; tensorization then repeats this one-bit estimate across all coordinates of $\{0,1\}^n$. [/example] This local inequality is the discrete analogue of a Gaussian logarithmic Sobolev inequality. The replacement for $|\nabla f|^2$ is the sum of coordinate energies. ## Tensorized Logarithmic Sobolev Inequalities on Product Cubes How does a one-coordinate entropy inequality become an $n$-coordinate inequality? Without independence, conditioning on all but one coordinate changes the remaining coordinate law, so the one-bit estimate no longer repeats with the same constant and the sum of local terms can miss correlations. For a product measure, entropy tensorization gives the exact bridge: global entropy is bounded by the sum of conditional entropies. Each conditional entropy is then controlled by the two-point inequality applied with the other coordinates frozen. [definition: Coordinate Difference on the Cube] For $x\in\{0,1\}^n$, let $x_{\operatorname{flip},i}$ denote the point obtained by flipping the $i$th coordinate. The $i$th coordinate difference is the linear map \begin{align*} D_i:\{f:\{0,1\}^n\to\mathbb R\}\to\{g:\{0,1\}^n\to\mathbb R\} \end{align*} defined by \begin{align*} D_i f(x):=f(x)-f(x_{\operatorname{flip},i}). \end{align*} [/definition] The coordinate difference records exactly the change visible to the two-point inequality after the other $n-1$ coordinates have been fixed. To use this local control for concentration, we need a theorem that sums the coordinate estimates into one global entropy bound on the whole cube. This motivates the discrete tensorized logarithmic Sobolev inequality. [quotetheorem:6755] [citeproof:6755] The product structure is not a cosmetic hypothesis. For a concrete failure of the product proof, put mass $1/2$ on $(0,0,\dots,0)$ and mass $1/2$ on $(1,1,\dots,1)$, and let $f(x)=x_1$. On the support of this law the function fluctuates by one, so $\operatorname{Ent}(f^2)$ is positive after a harmless shift such as $f_\varepsilon(x)=\varepsilon+x_1$. But conditioning on all coordinates except $i$ determines the remaining coordinate, so the conditional law in the $i$th fibre is a point mass and the corresponding conditional entropy is zero. Thus the tensorization step has no local entropy terms to sum. This does not mean that the formal flip energy $\mathbb E[(D_i f)^2]$ is intrinsically zero under the correlated law; for the diagonal measure, flipping a coordinate usually leaves the support and may even produce a nonzero value of $D_i f$ if $f$ has been defined on the whole cube. The failure is that the conditional laws are degenerate and are not the Bernoulli fibres used in the theorem, so the displayed product right-hand side is no longer the quantity delivered by entropy tensorization. This example shows both the limitation and the mechanism: independence is what makes the conditional fibres carry the original Bernoulli law and allows local entropy losses to recombine into global entropy. The theorem also gives an entropy bound, not a tail bound by itself; the concentration step still requires applying it to $e^{\lambda F/2}$ and controlling the resulting coordinate differences. This is where the method reconnects with concentration through the Herbst argument from Chapter 2. [example: Bounded Differences from Tensorized Log-Sobolev] Let $F:\{0,1\}^n\to\mathbb R$ satisfy $|D_iF|\le c_i$ for every $i$, and put $c_*:=\max_i c_i$. We apply the *Discrete Tensorized Logarithmic Sobolev Inequality* to \begin{align*} f=e^{\lambda F/2}. \end{align*} For fixed $x$ and $i$, write $y=x_{\operatorname{flip},i}$. By the [mean value theorem](/theorems/186) applied to $t\mapsto e^{\lambda t/2}$, \begin{align*} \left|e^{\lambda F(x)/2}-e^{\lambda F(y)/2}\right|\le \frac{|\lambda|}{2}|F(x)-F(y)|\max\{e^{\lambda F(x)/2},e^{\lambda F(y)/2}\}. \end{align*} Since $|F(x)-F(y)|=|D_iF(x)|\le c_i$ and \begin{align*} e^{\lambda F(y)/2}=e^{\lambda F(x)/2}e^{\lambda(F(y)-F(x))/2}, \end{align*} we have \begin{align*} e^{\lambda F(y)/2}\le e^{\lambda F(x)/2}e^{|\lambda|c_i/2}. \end{align*} The same bound is also true when the maximum is already $e^{\lambda F(x)/2}$, because $e^{|\lambda|c_i/2}\ge1$. Hence \begin{align*} |D_i e^{\lambda F/2}(x)|\le \frac{|\lambda|c_i}{2}e^{\lambda F(x)/2}e^{|\lambda|c_i/2}. \end{align*} Squaring this estimate gives \begin{align*} (D_i e^{\lambda F/2}(x))^2\le \frac{\lambda^2c_i^2}{4}e^{\lambda F(x)}e^{|\lambda|c_i}. \end{align*} Since $c_i\le c_*$, \begin{align*} (D_i e^{\lambda F/2}(x))^2\le \frac{\lambda^2c_i^2}{4}e^{|\lambda|c_*}e^{\lambda F(x)}. \end{align*} Therefore, with $C_p$ denoting the one-coordinate log-Sobolev constant from the two-point inequality, \begin{align*} \operatorname{Ent}_{\mu_p}(e^{\lambda F})\le C_p\sum_{i=1}^n\mathbb E_{\mu_p}\left[(D_i e^{\lambda F/2})^2\right]. \end{align*} Using the preceding pointwise bound in each summand, \begin{align*} \operatorname{Ent}_{\mu_p}(e^{\lambda F})\le \frac{C_p}{4}\lambda^2e^{|\lambda|c_*}\left(\sum_{i=1}^n c_i^2\right)\mathbb E_{\mu_p}[e^{\lambda F}]. \end{align*} Thus one may take $K_p=C_p/4$ with the explicit exponential correction $e^{|\lambda|c_*}$. For the upper tail, set $G:=F-\mathbb E_{\mu_p}F$ and \begin{align*} \psi(\lambda):=\log\mathbb E_{\mu_p}[e^{\lambda G}]. \end{align*} The entropy identity gives \begin{align*} \frac{\operatorname{Ent}_{\mu_p}(e^{\lambda G})}{\mathbb E_{\mu_p}[e^{\lambda G}]}=\lambda\psi'(\lambda)-\psi(\lambda). \end{align*} Multiplying $e^{\lambda F}$ by the constant $e^{-\lambda\mathbb E_{\mu_p}F}$ leaves this normalized entropy estimate unchanged, so for $\lambda\ge0$, \begin{align*} \lambda\psi'(\lambda)-\psi(\lambda)\le A\lambda^2e^{\lambda c_*}, \end{align*} where \begin{align*} A:=\frac{C_p}{4}\sum_{i=1}^n c_i^2. \end{align*} Since \begin{align*} \left(\frac{\psi(\lambda)}{\lambda}\right)'=\frac{\lambda\psi'(\lambda)-\psi(\lambda)}{\lambda^2}, \end{align*} integration from $0$ to $\lambda$ gives \begin{align*} \frac{\psi(\lambda)}{\lambda}\le A\int_0^\lambda e^{s c_*}\,ds. \end{align*} If $0\le\lambda\le 1/c_*$, then $e^{s c_*}\le e$ throughout the integral, and therefore \begin{align*} \psi(\lambda)\le eA\lambda^2. \end{align*} Markov's inequality gives \begin{align*} \mathbb P(F-\mathbb E_{\mu_p}F\ge t)\le \exp\{-\lambda t+\psi(\lambda)\}. \end{align*} Combining this with the bound on $\psi$, \begin{align*} \mathbb P(F-\mathbb E_{\mu_p}F\ge t)\le \exp\{-\lambda t+eA\lambda^2\}. \end{align*} Choosing $\lambda=t/(2eA)$ is admissible whenever $t\le 2eA/c_*$, and then \begin{align*} -\lambda t+eA\lambda^2=-\frac{t^2}{4eA}. \end{align*} Thus, in this moderate-deviation range, \begin{align*} \mathbb P(F-\mathbb E_{\mu_p}F\ge t)\le \exp\left(-\frac{t^2}{eC_p\sum_{i=1}^n c_i^2}\right). \end{align*} The tensorized log-Sobolev inequality therefore recovers a sub-Gaussian upper tail at the bounded-difference scale $\sum_i c_i^2$, while the factor involving $c_*$ records the range of $\lambda$ on which the purely quadratic estimate is valid. [/example] The exponential substitution also shows why entropy is more flexible than variance. It can absorb nonlinear functions of independent variables provided their coordinate increments have a suitable one-sided structure. ## The Entropy Method for Self-Bounding Functions Bounded differences use uniform coordinate sensitivities, but many statistics have smaller increments when the statistic itself is small. For a counting statistic, a worst-case bound may charge every coordinate even when only a few coordinates can change the count. The entropy method captures the missing scale through self-bounding conditions, where the sum of coordinate influences is controlled by the value of the function. The formal definition isolates exactly the two estimates needed after entropy tensorization: a unit bound on each coordinate contribution and a global bound on the sum of those contributions. [definition: Self-Bounding Function] Let $X_i$ be independent random variables with values in measurable spaces $(E_i,\mathcal E_i)$, and let $F:\prod_{i=1}^n E_i\to[0,\infty)$ be measurable. The function $F$ is self-bounding for the law of $X=(X_1,\dots,X_n)$ if there exist measurable functions $F_i:\prod_{j\ne i}E_j\to\mathbb R$ such that, for every $i$, a.s. \begin{align*} 0\le F-F_i\le 1 \end{align*} and \begin{align*} \sum_{i=1}^n (F-F_i)\le F. \end{align*} [/definition] The auxiliary $F_i$ is often the statistic after coordinate $i$ has been removed or replaced. The definition itself only requires the measurable auxiliary functions and the two inequalities, because different applications choose $F_i$ in different ways. Since entropy tensorization produces a sum over coordinates, the self-bounding conditions are designed to turn that sum into a multiple of $F$ itself rather than a deterministic Lipschitz constant. The next theorem is the payoff: the entropy inequality remembers the random scale of the statistic. [quotetheorem:6757] [citeproof:6757] Each hypothesis has a distinct job, and each can fail in a concrete way. The theorem is not a replacement for every bounded-difference inequality: a function may satisfy $|F(x)-F(x')|\le1$ in each coordinate and still fail the global self-bounding condition $\sum_i(F-F_i)\le F$. It also gives upper-tail control in the displayed form; lower tails require a separate lower self-bounding or variance proxy hypothesis. Nonnegativity is needed because the final entropy bound is proportional to $F e^{\lambda F}$; if $F$ is replaced by $G=X_1-1$ for a Bernoulli variable $X_1$, then $G$ takes a negative value and $G e^{\lambda G}$ is negative on part of the space, so it cannot serve as a positive entropy energy. The unit increment condition prevents a single coordinate from creating a large exponential jump: for $F=M X_1$ with $M\gg1$, the sum condition can be made to look proportional to $F$ after scaling, but the exponential change from $0$ to $M$ is governed by $e^{\lambda M}$ rather than by $e^\lambda$. The sum condition rules out many small local charges attached to a statistic of small value. For instance, let $F=\mathbb{1}_{\{X_1+\cdots+X_n\ge1\}}$ for independent Bernoulli variables and set $F_i$ by deleting coordinate $i$; on the event that several coordinates equal $1$, many terms $F-F_i$ can be positive while $F=1$, so $\sum_i(F-F_i)\le F$ fails. Without that condition, tensorization would produce an energy of order the number of active coordinates, not the value of the statistic. The resulting concentration is Poisson-type: Gaussian near the mean and exponential farther out, matching many counting statistics better than a worst-case bounded-difference estimate. [example: Occupancy Statistics] Throw $m$ balls independently into $N$ boxes. For each box $r\in\{1,\dots,N\}$, let \begin{align*} M_r:=\#\{i:\text{ball }i\text{ lands in box }r\}. \end{align*} Then the number of occupied boxes is \begin{align*} F=\sum_{r=1}^N \mathbf 1_{\{M_r\ge1\}}. \end{align*} If ball $i$ lands in box $B_i$, and $F_i$ denotes the number of occupied boxes after removing ball $i$, then every box $r\ne B_i$ has the same occupancy status before and after the removal. The only possible change is in box $B_i$, so \begin{align*} F-F_i=\mathbf 1_{\{M_{B_i}\ge1\}}-\mathbf 1_{\{M_{B_i}-1\ge1\}}. \end{align*} Because $M_{B_i}\ge1$ always holds for the box containing ball $i$, the difference equals $1$ exactly when ball $i$ was the only ball in that box: \begin{align*} F-F_i=\mathbf 1_{\{M_{B_i}=1\}}. \end{align*} Hence \begin{align*} 0\le F-F_i\le1. \end{align*} Now sum the coordinate contributions. First group the balls according to the box they occupy: \begin{align*} \sum_{i=1}^m(F-F_i)=\sum_{i=1}^m \mathbf 1_{\{M_{B_i}=1\}}. \end{align*} This is the same as summing inside each box: \begin{align*} \sum_{i=1}^m \mathbf 1_{\{M_{B_i}=1\}}=\sum_{r=1}^N \sum_{i:B_i=r}\mathbf 1_{\{M_r=1\}}. \end{align*} For a fixed box $r$, there are exactly $M_r$ indices $i$ with $B_i=r$, so \begin{align*} \sum_{r=1}^N \sum_{i:B_i=r}\mathbf 1_{\{M_r=1\}}=\sum_{r=1}^N M_r\,\mathbf 1_{\{M_r=1\}}. \end{align*} Since $M_r\,\mathbf 1_{\{M_r=1\}}=\mathbf 1_{\{M_r=1\}}$, this becomes \begin{align*} \sum_{i=1}^m(F-F_i)=\sum_{r=1}^N \mathbf 1_{\{M_r=1\}}. \end{align*} Every singly occupied box is occupied, so \begin{align*} \sum_{r=1}^N \mathbf 1_{\{M_r=1\}}\le \sum_{r=1}^N \mathbf 1_{\{M_r\ge1\}}=F. \end{align*} Thus $F$ satisfies the self-bounding conditions. Applying *Entropy Method for Self-Bounding Functions* gives upper-tail concentration with variance scale governed by $\mathbb E[F]$, the expected number of occupied boxes. [/example] Self-bounding is also stable under many convex constructions where each coordinate contributes a controlled marginal increment. This is the form in which the method often appears in empirical process and randomized combinatorial applications. [example: Convex Functions of Independent Bernoulli Variables] Let $X\in\{0,1\}^n$ have independent Bernoulli coordinates, and let \begin{align*} F(X)=\sup_{a\in A}\sum_{j=1}^n a_jX_j, \end{align*} where $A\subset[0,1]^n$ is countable, or has first been replaced by a countable subclass giving the same measurable supremum. For each coordinate $i$, define \begin{align*} F_i(X_{-i}) := \sup_{a\in A}\sum_{j\ne i} a_jX_j. \end{align*} Fix an outcome $X$. Since $a_iX_i\ge0$ for every $a\in A$, \begin{align*} \sum_{j\ne i}a_jX_j\le \sum_{j=1}^n a_jX_j. \end{align*} Taking suprema over $a\in A$ gives \begin{align*} F_i(X_{-i})\le F(X), \end{align*} so $F-F_i\ge0$. Also, because $a_i\in[0,1]$ and $X_i\in\{0,1\}$, \begin{align*} a_iX_i\le1. \end{align*} Thus, for every $a\in A$, \begin{align*} \sum_{j=1}^n a_jX_j =\sum_{j\ne i}a_jX_j+a_iX_i \le F_i(X_{-i})+1. \end{align*} Taking the supremum over $a\in A$ gives \begin{align*} F(X)\le F_i(X_{-i})+1, \end{align*} and hence \begin{align*} 0\le F-F_i\le1. \end{align*} It remains to verify the summed self-bounding condition. If the supremum is attained at some $a^*\in A$, then \begin{align*} F(X)=\sum_{j=1}^n a_j^*X_j. \end{align*} For each $i$, \begin{align*} F_i(X_{-i}) =\sup_{a\in A}\sum_{j\ne i}a_jX_j \ge \sum_{j\ne i}a_j^*X_j. \end{align*} Therefore \begin{align*} F(X)-F_i(X_{-i}) \le \sum_{j=1}^n a_j^*X_j-\sum_{j\ne i}a_j^*X_j =a_i^*X_i. \end{align*} Summing over $i$, \begin{align*} \sum_{i=1}^n(F-F_i) \le \sum_{i=1}^n a_i^*X_i =F. \end{align*} If the supremum is not attained, then for every $\varepsilon>0$ choose $a^{(\varepsilon)}\in A$ such that \begin{align*} \sum_{j=1}^n a_j^{(\varepsilon)}X_j\ge F(X)-\varepsilon. \end{align*} For each $i$, \begin{align*} F_i(X_{-i}) \ge \sum_{j\ne i}a_j^{(\varepsilon)}X_j, \end{align*} so \begin{align*} F-F_i \le F-\sum_{j\ne i}a_j^{(\varepsilon)}X_j =F-\sum_{j=1}^n a_j^{(\varepsilon)}X_j+a_i^{(\varepsilon)}X_i \le \varepsilon+a_i^{(\varepsilon)}X_i. \end{align*} After summing in $i$, \begin{align*} \sum_{i=1}^n(F-F_i) \le n\varepsilon+\sum_{i=1}^n a_i^{(\varepsilon)}X_i \le n\varepsilon+F. \end{align*} Letting $\varepsilon\downarrow0$ gives \begin{align*} \sum_{i=1}^n(F-F_i)\le F. \end{align*} Thus this supremum statistic satisfies the self-bounding conditions: each coordinate changes the value by at most one, and the total removable contribution is controlled by the value of $F$ itself. [/example] Continuing the local-to-global theme of entropy tensorization from Chapter 1, the chapter's progression is therefore local-to-global. A two-point logarithmic Sobolev inequality controls entropy in one Bernoulli coordinate, tensorization sums these controls across a product space, and the entropy method converts coordinate replacement estimates into concentration for nonlinear statistics. # 5. Gaussian Isoperimetry This chapter studies the sharp form of Gaussian concentration through the geometry of sets. In Chapters 2 and 3, entropy and logarithmic Sobolev inequalities produced sub-Gaussian tails by controlling moment generating functions. Gaussian isoperimetry gives a more geometric route: it identifies which sets have the smallest Gaussian boundary for a given Gaussian measure, and it turns that extremal statement into concentration around medians and sets. ## Gaussian Boundary Measure and Enlargement of Sets How should the boundary size of a measurable set be measured under Gaussian measure? Euclidean surface area is not the right invariant quantity, since Gaussian mass is weighted toward the origin. The useful object is the infinitesimal rate at which Gaussian measure grows when we thicken a set by Euclidean distance. [definition: Standard Gaussian Measure] Let $\gamma_n:\mathcal B(\mathbb R^n)\to[0,1]$ denote the standard Gaussian probability measure on $\mathbb R^n$, defined by \begin{align*} \gamma_n(A) = \int_A (2\pi)^{-n/2} e^{-|x|^2/2}\,d\mathcal L^n(x),\qquad A\in\mathcal B(\mathbb R^n). \end{align*} [/definition] The measure $\gamma_n$ is rotation-invariant but not translation-invariant, so translating a set changes its Gaussian mass. For example, in one dimension the intervals $[-1,1]$ and $[9,11]$ have the same Euclidean length, but their Gaussian measures are very different because the density near $10$ is exponentially smaller than the density near $0$. This is the first sign that Euclidean perimeter alone cannot be the right boundary quantity in Gaussian space. The next definition introduces the Euclidean enlargement of a set, which is the operation whose Gaussian mass growth will later define boundary measure. [definition: Euclidean Enlargement] For each $r \ge 0$, the Euclidean $r$-enlargement operation is the map $\mathcal P(\mathbb R^n)\to\mathcal P(\mathbb R^n)$ defined by \begin{align*} A \longmapsto A_r := \{x \in \mathbb R^n : \operatorname{dist}(x,A) \le r\}. \end{align*} [/definition] When $A$ is Borel, the set $A_r$ is Borel for closed $A$ and is universally measurable in the general measurable setting used for Gaussian isoperimetry; the statements below are understood for sets for which the displayed Gaussian measures are defined. The enlargement $A_r$ records all points whose distance to $A$ is at most $r$, so $r \mapsto \gamma_n(A_r)$ is the distribution function of the distance from a Gaussian point to $A$. This motivates the next definition: Gaussian boundary measure is the lower derivative at $r=0$ of this enlargement mass. [definition: Gaussian Boundary Measure] The lower Gaussian boundary measure is the functional from measurable subsets of $\mathbb R^n$ to $[0,\infty]$ defined by \begin{align*} A \longmapsto \gamma_n^+(A) := \liminf_{r \downarrow 0} \frac{\gamma_n(A_r)-\gamma_n(A)}{r}. \end{align*} [/definition] This definition turns boundary estimates into differential information about enlargement. The comparison problem is now to know what lower boundary size is forced by the single number $\gamma_n(A)$. Half-spaces provide the model obstruction: their Gaussian enlargement is just a one-dimensional threshold shift, so their boundary contribution can be computed from the density of a standard normal variable at the corresponding quantile. [definition: Gaussian Isoperimetric Profile] Let $\Phi(t)=\gamma_1((-\infty,t])$ be the standard normal distribution function and let $\phi(t)=(2\pi)^{-1/2}e^{-t^2/2}$ be its density. The Gaussian isoperimetric profile is the function $I:[0,1]\to[0,\infty)$ defined by $I(0)=I(1)=0$ and, for $u\in(0,1)$, \begin{align*} I(u) := \phi(\Phi^{-1}(u)). \end{align*} [/definition] The profile $I$ is the boundary measure of a half-line with Gaussian mass $u$. The central extremal question is whether any higher-dimensional set can have smaller Gaussian boundary than this one-dimensional model with the same measure. [quotetheorem:6759] [citeproof:6759] The theorem is sharp in every dimension and at every volume. At the endpoint volumes $0$ and $1$, the extended profile gives $I(0)=I(1)=0$, so the inequality asserts only the non-negativity of lower boundary measure. The measurability hypothesis is essential because $\gamma_n(A)$ and $\gamma_n(A_r)$ must be defined. This is not a harmless technicality: take a non-Lebesgue-measurable set $V\subset[0,1]$ and view it as a subset of $\mathbb R$ under the Gaussian density. Since the density is bounded above and below by positive constants on $[0,1]$, measurability of $V$ for $\gamma_1$ would be equivalent to Lebesgue measurability of $V$; hence $\gamma_1(V)$ is not defined, and neither side of the proposed comparison has a well-defined value. The theorem does not identify all equality cases in this form, and it does not say that Euclidean surface area is minimized by half-spaces; the boundary is measured through Gaussian enlargement, so the ambient Gaussian weight is part of the statement. Unlike Euclidean isoperimetry, where balls extremise surface area, the Gaussian weight makes half-spaces the extremal sets. For instance, moving a Euclidean ball far from the origin leaves its Euclidean surface area unchanged but makes both its Gaussian mass and its Gaussian boundary exponentially small. The next computation records the equality case that fixes the profile and prepares the enlargement formulation used for concentration. [example: Boundary Of A Gaussian Half-Space] Fix $a\in\mathbb R$ and let $H_a=\{x\in\mathbb R^n:x_1\le a\}$. By the product form of the standard Gaussian density and Fubini's theorem, \begin{align*} \gamma_n(H_a)=\left(\int_{-\infty}^a (2\pi)^{-1/2}e^{-s^2/2}\,ds\right)\prod_{j=2}^n\left(\int_{-\infty}^{\infty}(2\pi)^{-1/2}e^{-x_j^2/2}\,dx_j\right). \end{align*} Each full one-dimensional Gaussian integral in the product equals $1$, while the first integral is $\Phi(a)$, so \begin{align*} \gamma_n(H_a)=\Phi(a). \end{align*} For $x\in\mathbb R^n$, if $x_1\le a$ then $x\in H_a$ and $\operatorname{dist}(x,H_a)=0$. If $x_1>a$, the point $(a,x_2,\ldots,x_n)$ lies in $H_a$ and has distance $x_1-a$ from $x$; every point $y\in H_a$ satisfies $y_1\le a$, hence $|x-y|\ge |x_1-y_1|\ge x_1-a$. Therefore $\operatorname{dist}(x,H_a)=x_1-a$ when $x_1>a$. It follows that $\operatorname{dist}(x,H_a)\le r$ exactly when $x_1\le a+r$, and hence \begin{align*} (H_a)_r=H_{a+r}. \end{align*} Using the definition of lower Gaussian boundary measure, \begin{align*} \gamma_n^+(H_a)=\liminf_{r\downarrow 0}\frac{\gamma_n((H_a)_r)-\gamma_n(H_a)}{r}. \end{align*} Substituting $(H_a)_r=H_{a+r}$ and $\gamma_n(H_b)=\Phi(b)$ gives \begin{align*} \gamma_n^+(H_a)=\liminf_{r\downarrow 0}\frac{\Phi(a+r)-\Phi(a)}{r}. \end{align*} Since $\Phi$ is differentiable and $\Phi'(a)=\phi(a)$, the one-sided difference quotient has limit $\phi(a)$, so \begin{align*} \gamma_n^+(H_a)=\phi(a). \end{align*} Also $\Phi^{-1}(\Phi(a))=a$, so the definition of the Gaussian isoperimetric profile gives \begin{align*} I(\gamma_n(H_a))=I(\Phi(a))=\phi(\Phi^{-1}(\Phi(a)))=\phi(a). \end{align*} Thus $\gamma_n^+(H_a)=I(\gamma_n(H_a))$, so Gaussian half-spaces attain equality in the [Gaussian isoperimetric inequality](/theorems/6759). [/example] ## Half-Spaces as Extremizers Why do half-spaces, rather than balls, control Gaussian enlargement? A half-space is determined by a single Gaussian coordinate, so its enlargement is governed by translation of a one-dimensional threshold. This makes it possible to compare every set with the half-space having the same Gaussian measure. [definition: Gaussian Half-Space] A Gaussian half-space in $\mathbb R^n$ is a set of the form \begin{align*} H_{v,a}:=\{x\in\mathbb R^n: \langle x,v\rangle \le a\}, \end{align*} where $v\in\mathbb R^n$ satisfies $|v|=1$ and $a\in\mathbb R$. [/definition] For such a half-space, the random variable $\langle G,v\rangle$ is standard normal when $G\sim\gamma_n$. Hence $\gamma_n(H_{v,a})=\Phi(a)$ and $(H_{v,a})_r=H_{v,a+r}$, so enlarging the set shifts the one-dimensional threshold. The Gaussian isoperimetric inequality identifies the correct infinitesimal comparison at $r=0$, but concentration estimates need positive-radius information: the event that a Gaussian point lies more than distance $r$ from a set is the complement of $A_r$, not an infinitesimal boundary event. The half-space calculation therefore suggests the finite comparison that must be proved next: if $A$ has the same initial mass as $H_{v,a}$, then $A_r$ should have at least the mass of $H_{v,a+r}$ for every $r\ge 0$. This finite-radius statement is the form of isoperimetry that will be applied directly to median sublevel sets. It is obtained by integrating the boundary lower bound along the enlargement flow $r\mapsto A_r$, with the half-space curve $\Phi(a+r)$ serving as the comparison solution. In the form used here, Borell's Gaussian enlargement inequality says that if $A\subseteq\mathbb R^n$ is measurable, $0<\gamma_n(A)<1$, and $a=\Phi^{-1}(\gamma_n(A))$, then for every $r\ge 0$, \begin{align*} \gamma_n(A_r)\ge \Phi(a+r). \end{align*} Equivalently, \begin{align*} \gamma_n(A_r)\ge \Phi(\Phi^{-1}(\gamma_n(A))+r). \end{align*} If $\gamma_n(A)=0$, the endpoint convention gives the weaker statement $\gamma_n(A_r)\ge 0$. If $\gamma_n(A)=1$, then $\gamma_n(A_r)=1$ for every $r\ge 0$. Borell's inequality is stronger than the boundary statement because it controls every enlargement radius, not only the infinitesimal growth at $0$. The interior-volume hypothesis is exactly where the real threshold $a=\Phi^{-1}(\gamma_n(A))$ exists; null sets and full-measure sets are handled by the endpoint statements rather than by choosing a finite $a$. The measurability assumption again matters because the comparison is a statement about Gaussian probabilities of $A$ and $A_r$. These are hypothesis necessities within the theorem. Separate limitations arise when the setting is changed: under the uniform probability measure on the Euclidean ball in $\mathbb R^n$, the Euclidean isoperimetric behaviour is governed by spherical caps and boundary effects from the ball, not by Gaussian half-spaces with profile $\Phi(a+r)$. If the Euclidean distance is replaced by a different metric such as $\ell^\infty$ distance, enlargement grows by boxes rather than round Euclidean neighbourhoods, so the one-dimensional Gaussian threshold shift no longer describes the extremal comparison. Borell's form is the one most directly used in concentration, since the event that a Gaussian point is far from a large set is the complement of an enlargement. [example: Distance To A Measurable Set] Let $A\subseteq\mathbb R^n$ satisfy $\gamma_n(A)\ge 1/2$, let $G\sim\gamma_n$, and fix $r\ge 0$. By the definition of Euclidean enlargement, \begin{align*} A_r=\{x\in\mathbb R^n:\operatorname{dist}(x,A)\le r\}. \end{align*} Therefore the event $\{\operatorname{dist}(G,A)\le r\}$ is exactly the event $\{G\in A_r\}$, and since $G$ has law $\gamma_n$, \begin{align*} \mathbb P(\operatorname{dist}(G,A)>r)=1-\gamma_n(A_r). \end{align*} If $\gamma_n(A)=1$, then $\gamma_n(A_r)=1$ because $A\subseteq A_r$, so $\mathbb P(\operatorname{dist}(G,A)>r)=0$. Otherwise $1/2\le \gamma_n(A)<1$, so we may set \begin{align*} a=\Phi^{-1}(\gamma_n(A)). \end{align*} Since $\Phi(0)=1/2$ and $\Phi$ is increasing, the inequality $\gamma_n(A)\ge 1/2$ gives $a\ge 0$. By *Borell Inequality*, \begin{align*} \gamma_n(A_r)\ge \Phi(a+r). \end{align*} Because $a+r\ge r$ and $\Phi$ is increasing, \begin{align*} \Phi(a+r)\ge \Phi(r). \end{align*} Combining these inequalities gives \begin{align*} \gamma_n(A_r)\ge \Phi(r). \end{align*} Hence \begin{align*} \mathbb P(\operatorname{dist}(G,A)>r)\le 1-\Phi(r). \end{align*} Finally, the standard Gaussian tail bound $1-\Phi(r)\le e^{-r^2/2}$ for $r\ge 0$ yields \begin{align*} \mathbb P(\operatorname{dist}(G,A)>r)\le e^{-r^2/2}. \end{align*} Thus any measurable set of Gaussian measure at least $1/2$ captures a standard Gaussian point within distance $r$ except with Gaussian tail probability. [/example] ## From Gaussian Isoperimetry to Concentration of Lipschitz Functions How does a statement about sets become a tail bound for functions? A Lipschitz function has sublevel sets whose enlargements are contained in higher sublevel sets. If a median sublevel set has measure at least $1/2$, Borell's inequality forces the function to remain close to its median with high probability. [definition: Lipschitz Function On Gaussian Space] A function $f:\mathbb R^n\to\mathbb R$ is $L$-Lipschitz if \begin{align*} |f(x)-f(y)|\le L|x-y| \end{align*} for all $x,y\in\mathbb R^n$. [/definition] The Lipschitz condition turns geometric distance into control of function values: if $A=\{x:f(x)\le m\}$, then points near $A$ lie in a higher sublevel set of $f$. The next definition introduces medians, which provide sublevel and superlevel sets with the measure threshold required by Borell's inequality. [definition: Median] Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, and let $X:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ be a real-valued random variable. A median of $X$ is a number $m\in\mathbb R$ such that \begin{align*} \mathbb P(X\le m)\ge \frac12,\qquad \mathbb P(X\ge m)\ge \frac12. \end{align*} [/definition] Medians are natural for isoperimetry because both median sublevel and superlevel sets have measure at least $1/2$. This motivates the following concentration theorem, which combines Borell enlargement with the containment of enlarged sublevel sets inside higher sublevel sets. [quotetheorem:6761] [citeproof:6761] This theorem is often the cleanest way to see that every Lipschitz observable of a standard Gaussian vector has fluctuations of order its Lipschitz constant. The Lipschitz hypothesis is the mechanism that converts distance from a sublevel set into a bound on the function value; without it, a function can have arbitrarily large jumps on sets of tiny Gaussian measure and no Gaussian tail bound of this form can hold. The median hypothesis is also structural rather than cosmetic: it supplies a sublevel set and a superlevel set of measure at least $1/2$, exactly the threshold at which Borell's inequality gives the clean comparison with $\Phi(t/L)$. The result controls deviations from a median, not pointwise oscillation of $f$ or higher moments without additional integration. The first illustration records the median form in the most reusable form, for a general $1$-Lipschitz function. [example: Concentration Around A Median For One-Lipschitz Functions] Let $G\sim\gamma_n$, let $f:\mathbb R^n\to\mathbb R$ be $1$-Lipschitz, and let $m$ be a median of $f(G)$. For $t=0$, the desired estimate gives $\mathbb P(|f(G)-m|\ge 0)\le 2(1-\Phi(0))=1$, so assume $t>0$. Fix $0<\varepsilon<t$. Since $f$ is $1$-Lipschitz, *[Gaussian Concentration Around A Median](/theorems/6761)* applied with $L=1$ gives \begin{align*} \mathbb P(f(G)\ge m+t)\le \mathbb P(f(G)>m+t-\varepsilon). \end{align*} The same theorem gives \begin{align*} \mathbb P(f(G)>m+t-\varepsilon)\le 1-\Phi(t-\varepsilon). \end{align*} Thus \begin{align*} \mathbb P(f(G)\ge m+t)\le 1-\Phi(t-\varepsilon). \end{align*} Letting $\varepsilon\downarrow 0$ and using continuity of $\Phi$ yields \begin{align*} \mathbb P(f(G)\ge m+t)\le 1-\Phi(t). \end{align*} The lower-tail part of the same theorem gives identically \begin{align*} \mathbb P(f(G)\le m-t)\le 1-\Phi(t). \end{align*} The event $|f(G)-m|\ge t$ is contained in the union of the two one-sided events: \begin{align*} \{|f(G)-m|\ge t\}\subseteq \{f(G)\ge m+t\}\cup\{f(G)\le m-t\}. \end{align*} By the union bound, \begin{align*} \mathbb P(|f(G)-m|\ge t)\le \mathbb P(f(G)\ge m+t)+\mathbb P(f(G)\le m-t). \end{align*} Substituting the two one-sided bounds gives \begin{align*} \mathbb P(|f(G)-m|\ge t)\le 2(1-\Phi(t)). \end{align*} Using the standard Gaussian tail estimate $1-\Phi(t)\le e^{-t^2/2}$ for $t\ge 0$, we obtain \begin{align*} \mathbb P(|f(G)-m|\ge t)\le 2e^{-t^2/2}. \end{align*} Thus a one-Lipschitz observable of a standard Gaussian vector has median deviations on a scale independent of the dimension $n$, even when it depends on all coordinates. [/example] The preceding example isolates the method: find a Lipschitz constant, choose a median, and apply Borell's inequality to sublevel sets. A concrete high-dimensional observable is the Euclidean norm, whose concentration describes the thin annulus phenomenon for Gaussian vectors. [example: Euclidean Norm Of A Gaussian Vector] Let $G=(G_1,\ldots,G_n)\sim\gamma_n$ and set $f(x)=|x|$. For any $x,y\in\mathbb R^n$, the triangle inequality gives \begin{align*} |x|=|x-y+y|\le |x-y|+|y|. \end{align*} Subtracting $|y|$ from both sides gives \begin{align*} |x|-|y|\le |x-y|. \end{align*} Interchanging $x$ and $y$ gives \begin{align*} |y|-|x|\le |x-y|. \end{align*} The two inequalities together imply \begin{align*} \bigl||x|-|y|\bigr|\le |x-y|, \end{align*} so $f$ is $1$-Lipschitz. Let $m_n$ be a median of $|G|$. Applying *Concentration Around A Median For One-Lipschitz Functions* to this $f$ gives, for every $t\ge 0$, \begin{align*} \mathbb P\bigl(|f(G)-m_n|\ge t\bigr)\le 2e^{-t^2/2}. \end{align*} Since $f(G)=|G|$, this becomes \begin{align*} \mathbb P\bigl(|\,|G|-m_n\,|\ge t\bigr)\le 2e^{-t^2/2}. \end{align*} The radius $\sqrt n$ is the natural scale because \begin{align*} |G|^2=\sum_{j=1}^n G_j^2. \end{align*} Taking expectations and using linearity of expectation, \begin{align*} \mathbb E|G|^2=\sum_{j=1}^n \mathbb E[G_j^2]. \end{align*} Each coordinate $G_j$ is standard normal, so $\mathbb E[G_j^2]=1$. Hence \begin{align*} \mathbb E|G|^2=\sum_{j=1}^n 1=n. \end{align*} Thus the Euclidean norm of a standard Gaussian vector lies in an annulus of dimension-free Gaussian width around its median, with the ambient radius scale set by $\sqrt n$. [/example] The norm example shows the strength of median concentration, but many analytic estimates are stated around the mean. This motivates the following mean-centered version and prepares the comparison with the logarithmic Sobolev method from Chapter 3, where $\mathbb E[f(G)]$ is the natural center. [quotetheorem:6763] [citeproof:6763] The distinction between median and mean is pedagogically useful. The Lipschitz assumption remains essential in the mean-centered statement: without it, a function of a Gaussian vector may have much heavier tails or may fail to have a finite expectation. A concrete failure is obtained by taking $G_1\sim\mathcal N(0,1)$ and $f(x)=e^{x_1^2}$ on $\mathbb R^n$. This function is not Lipschitz, and $\mathbb E[e^{G_1^2}]=\infty$, so a mean-centered Gaussian tail estimate is not even well-defined. By contrast, a Lipschitz function is continuous and hence Borel measurable, and the Gaussian tail bound around a median implies integrability by integrating the tail probabilities. The split between $L>0$ and $L=0$ is only to avoid dividing by the Lipschitz constant; when $L=0$, the function is constant and the centered variable is identically zero except for the harmless convention at $t=0$ in one-sided tail statements. The theorem also does not assert that the mean is an extremal level set in the isoperimetric sense; the mean enters through integration of tails or through the entropy method. The isoperimetric route passes through a median and then compares the median to the mean, so it loses constants and does not by itself give the sharp mean-centered constant $1/(2L^2)$. Isoperimetry naturally produces medians, while entropy naturally produces moment generating function bounds around the mean. ## Comparison With Log-Sobolev Concentration What does Gaussian isoperimetry add beyond the logarithmic Sobolev method? Both methods prove dimension-free Gaussian concentration for Lipschitz functions, but they encode different structural information. Log-Sobolev concentration is analytic and stable under tensorization; isoperimetry is geometric and sharp for sets. [quotetheorem:6765] The theorem was proved in the preceding chapter by the Gaussian logarithmic Sobolev inequality and Herbst's argument. We recall it here to compare the output with the isoperimetric theorem rather than to give another proof. [remark: Strengths Of The Two Methods] Gaussian isoperimetry gives the optimal enlargement inequality for all measurable sets and identifies half-spaces as extremizers. The logarithmic Sobolev method gives a direct estimate on the log-Laplace transform of $f(G)-\mathbb E[f(G)]$ and extends naturally to product measures satisfying suitable functional inequalities. The outputs are different even when both imply concentration: Borell's inequality estimates $\gamma_n(A_r)$ directly for an arbitrary measurable set $A$, while the log-Sobolev method estimates $\mathbb E[e^{\lambda(f(G)-\mathbb E[f(G)])}]$ for sufficiently regular Lipschitz functions $f$. Thus a set enlargement statement for a rough set is available from isoperimetry before any choice of function or exponential moment estimate is made. The two methods agree on the sub-Gaussian scale for Lipschitz functions, but only the isoperimetric method records the sharp set geometry. [/remark] The comparison also explains why both methods appear in a course on entropy and transport. Entropy controls exponential moments, transport controls distances between measures, and isoperimetry controls enlargements of sets; the sharp example behind the Gaussian theory is a half-space. [example: Half-Space Shows Sharpness] Let $f(x)=x_1$ and let $G=(G_1,\ldots,G_n)\sim\gamma_n$. By the product form of the standard Gaussian measure, the first coordinate $G_1$ has density $\phi(s)=(2\pi)^{-1/2}e^{-s^2/2}$, so $f(G)=G_1\sim\mathcal N(0,1)$. For any $x,y\in\mathbb R^n$, \begin{align*} |f(x)-f(y)|^2=|x_1-y_1|^2\le \sum_{j=1}^n |x_j-y_j|^2=|x-y|^2. \end{align*} Taking square roots gives $|f(x)-f(y)|\le |x-y|$, so $f$ is $1$-Lipschitz. The mean is \begin{align*} \mathbb E[f(G)]=\mathbb E[G_1]=\int_{-\infty}^{\infty}s\phi(s)\,ds=0, \end{align*} because $s\phi(s)$ is an odd integrable function. Also, \begin{align*} \mathbb P(f(G)\le 0)=\mathbb P(G_1\le 0)=\Phi(0)=\frac12. \end{align*} Since $G_1$ has a continuous density, $\mathbb P(G_1=0)=0$, and therefore \begin{align*} \mathbb P(f(G)\ge 0)=1-\mathbb P(G_1<0)=1-\frac12=\frac12. \end{align*} Thus $0$ is both the mean and a median of $f(G)$. For $t\ge 0$, \begin{align*} \mathbb P(f(G)\ge t)=\mathbb P(G_1\ge t)=\int_t^\infty \phi(s)\,ds=1-\int_{-\infty}^t\phi(s)\,ds=1-\Phi(t). \end{align*} Because $G_1$ has a continuous density, $\mathbb P(f(G)>t)=\mathbb P(f(G)\ge t)=1-\Phi(t)$. Hence the Gaussian isoperimetric upper-tail bound around the median is attained by the linear functional $x\mapsto x_1$, whose sublevel sets $\{x:x_1\le a\}$ are exactly Gaussian half-spaces. [/example] The half-space example is the equality case that keeps the constants honest. It also points toward the next part of the course, where the geometry of enlargements is replaced by product and transport distances suited to non-Gaussian spaces. [explanation: Role In The Course] Gaussian isoperimetry is the geometric benchmark for the rest of the course. Chapter 6 gives product-space analogues through Talagrand's convex distance, replacing distance to a set by a convex distance. Chapters 7 and 8 then connect concentration to metric control of probability measures through transportation-cost inequalities. The Gaussian case remains the model example because its extremizers, boundary profile, and Lipschitz concentration are all explicit. [/explanation] # 6. Talagrand's Convex Distance Inequality This chapter brings the entropy and isoperimetric methods of the course into finite product spaces where the geometry is not Euclidean. It uses Chapter 1's [tensorization of entropy](/theorems/6733), Chapter 2's Herbst-type concentration arguments, independence on product probability spaces, and elementary finite-dimensional convexity, especially convex hulls and separating hyperplanes in $\mathbb R^n$. The central question is how far a random point in a product space is from a set $A$, when distance is measured by the best weighted collection of coordinates that certifies membership in $A$. Talagrand's convex distance inequality gives a dimension-free exponential bound for this distance and is the product-space analogue of a sharp isoperimetric principle. We then use it to obtain concentration for convex Lipschitz functions and to control combinatorial random structures whose large values have small certificates. ## Certificate Geometry in Product Spaces A product space has many coordinate directions, and ordinary Hamming distance treats every coordinate equally. For concentration, this is often the wrong geometry: an event may be certified by checking only a small collection of coordinates, even if many other coordinates are irrelevant. The first problem is to record, for a point outside a set, which coordinate changes can move the point into the set. [definition: Coordinate Disagreement Vector] Let $X = X_1 \times \cdots \times X_n$ be a product set and let $A \subset X$. For $x \in X$ and $y \in A$, define the map \begin{align*} h: X\times A \to \{0,1\}^n \end{align*} by \begin{align*} h_i(x,y) = \mathbb{1}_{\{x_i \ne y_i\}}, \qquad 1 \le i \le n. \end{align*} The set of disagreement vectors from $x$ to $A$ is \begin{align*} U_A(x) = \{h(x,y) : y \in A\} \subset \{0,1\}^n. \end{align*} [/definition] The set $U_A(x)$ records all coordinate-change patterns that can carry $x$ into $A$. A raw minimum over $U_A(x)$ would give ordinary Hamming distance, but concentration will need a smoother quantity that can be tested by weighted coordinate functionals. This leads to the Euclidean distance from the origin to the convex hull of the possible disagreement patterns. [definition: Talagrand Convex Distance] Let $X = X_1 \times \cdots \times X_n$ and let $A \subset X$. The Talagrand convex distance to $A$ is the map \begin{align*} d_T(\cdot,A):X\to [0,\infty) \end{align*} defined by \begin{align*} d_T(x,A) = \inf_{u \in \operatorname{conv} U_A(x)} |u|, \end{align*} where $|u|$ is the Euclidean norm on $\mathbb R^n$. [/definition] The definition is geometric, but applications usually need a way to prove lower bounds on $d_T(x,A)$. Such lower bounds come from choosing coordinate weights and proving that every point of $A$ must disagree with $x$ on enough weighted coordinates. The following dual formula makes this certificate interpretation exact. [quotetheorem:6767] [citeproof:6767] The dual formula shows that convex distance is not an abstract relaxation of Hamming distance; it is the value of the best normalised coordinate test. The nonnegativity of the weights is essential because the geometry records coordinate disagreements, not signed coordinate displacement: allowing negative weights would make a lower bound meaningless, since a coordinate on which all points of $A$ must disagree could be assigned a negative coefficient and reduce the infimum. The restriction $|\alpha|\le 1$ is also necessary; otherwise the supremum could be multiplied by an arbitrary constant whenever $x\notin A$. What the formula does not provide is a metric triangle inequality or a way to measure how far a coordinate has moved inside its coordinate space; it only tests whether two coordinates agree. The simplest place to see the fractional nature of the convex hull is the discrete cube, where a monotone threshold event has many interchangeable certificates. [example: Distance to a Monotone Event in the Discrete Cube] Let $\boldsymbol X=\{0,1\}^n$ and let \begin{align*} A=\left\{z\in\{0,1\}^n:\sum_{i=1}^n z_i\ge r\right\}. \end{align*} Fix $x\notin A$ with exactly $k<r$ one coordinates, and set \begin{align*} S_0=\{i:x_i=0\},\qquad N=|S_0|=n-k,\qquad q=r-k. \end{align*} Thus $q$ is the number of additional one coordinates needed to reach $A$. For every subset $J\subset S_0$ with $|J|=q$, define $y^J$ by setting $y_i^J=1$ for $i\in J$ and $y_i^J=x_i$ for $i\notin J$. Then \begin{align*} \sum_{i=1}^n y_i^J=\sum_{i=1}^n x_i+\sum_{i\in J}(1-0)=k+q=r. \end{align*} Hence $y^J\in A$, and the disagreement vector $h(x,y^J)$ is the indicator vector of $J$. Average the vectors $h(x,y^J)$ over all subsets $J\subset S_0$ with $|J|=q$. The average is a vector $u\in\operatorname{conv}U_A(x)$. For a fixed $i\in S_0$, exactly $\binom{N-1}{q-1}$ of the $\binom{N}{q}$ subsets contain $i$, so \begin{align*} u_i=\frac{\binom{N-1}{q-1}}{\binom{N}{q}}=\frac{(N-1)!}{(q-1)!(N-q)!}\cdot\frac{q!(N-q)!}{N!}=\frac{q}{N}. \end{align*} For $i\notin S_0$, every averaged vector has coordinate $0$, so $u_i=0$. Therefore \begin{align*} d_T(x,A)\le |u|=\left(\sum_{i\in S_0}\left(\frac{q}{N}\right)^2\right)^{1/2}=\left(N\frac{q^2}{N^2}\right)^{1/2}=\frac{q}{\sqrt N}. \end{align*} For the reverse inequality, take the weight vector $\alpha$ with $\alpha_i=N^{-1/2}$ for $i\in S_0$ and $\alpha_i=0$ for $i\notin S_0$. Its Euclidean norm is \begin{align*} |\alpha|=\left(\sum_{i\in S_0}\frac{1}{N}\right)^{1/2}=1. \end{align*} Now fix any $y\in A$, and let \begin{align*} b=\sum_{i\in S_0}\mathbb 1_{\{y_i=1\}}. \end{align*} The coordinates where $x_i=1$ can contribute at most $k$ ones to $y$, while $y\in A$ has at least $r$ ones. Hence \begin{align*} k+b\ge \sum_{i=1}^n y_i\ge r. \end{align*} Thus $b\ge r-k=q$. Since $x_i=0$ on $S_0$, we get \begin{align*} \sum_{i=1}^n \alpha_i\mathbb 1_{\{x_i\ne y_i\}}=\frac{1}{\sqrt N}\sum_{i\in S_0}\mathbb 1_{\{y_i=1\}}=\frac{b}{\sqrt N}\ge \frac{q}{\sqrt N}. \end{align*} Taking the infimum over $y\in A$ and applying the *Dual Formula for Convex Distance* gives \begin{align*} d_T(x,A)\ge \frac{q}{\sqrt N}. \end{align*} Combining the two bounds, \begin{align*} d_T(x,A)=\frac{q}{\sqrt N}=\frac{r-k}{\sqrt{n-k}}. \end{align*} Thus convex distance spreads the deficit $r-k$ evenly over the $n-k$ zero coordinates that could repair it. [/example] This example shows how convex distance refines Hamming distance. Hamming distance is $r-k$, while convex distance reflects how many interchangeable coordinates can repair the deficit. In the degenerate case where exactly $r-k$ zero coordinates are available, this becomes $\sqrt{r-k}$; when many zero coordinates are available, convexification spreads the certificate mass across them. ## Talagrand's Convex Distance Inequality The next problem is to turn the geometry into probability. Ordinary Hamming distance can be too crude here: for a threshold event in the cube, Hamming distance gives the deficit $r-k$, while the certificate geometry above also accounts for how many coordinates can repair that deficit. Bounded-difference inequalities also miss this structure when a large value can be certified by exposing only a small subset of coordinates. If $X=(X_1,\dots,X_n)$ has independent coordinates and $A$ has positive probability, then a product-space isoperimetric principle should say that most points are close to $A$ in convex distance. Talagrand's inequality gives exactly this statement with no dependence on $n$ or on the coordinate laws. [quotetheorem:6769] The theorem is best read as an expansion statement: a set of non-negligible measure rapidly fills the space when enlarged in convex distance. Independence is the structural hypothesis; without it, the conclusion can fail. For example, let $Y$ be a fair bit and set $X=(Y,\dots,Y)\in\{0,1\}^n$, so the coordinates are perfectly correlated. If $A=\{(0,\dots,0)\}$, then $\mathbb P(A)=1/2$, but $d_T((1,\dots,1),A)=\sqrt n$, making the exponential moment grow like $e^{n/4}/2$. Positive mass is also indispensable because the right side contains $1/\mathbb P(A)$ and no finite enlargement estimate can start from a null event under the sampling law. The theorem does not say that $d_T$ behaves like Euclidean distance inside the coordinate spaces, and this limitation will matter when we later pass to functions on intervals. To state the result in the language of neighbourhoods, we package all points within convex distance $r$ of a set into one enlargement. [definition: Convex Enlargement] Let $X$ be a product set and let $r\ge 0$. The convex-enlargement operator at radius $r$ is the map \begin{align*} \mathcal P(X)\to \mathcal P(X), \qquad A\mapsto A_r^{\mathrm T}, \end{align*} where \begin{align*} A_r^{\mathrm T} = \{x\in X: d_T(x,A)\le r\}. \end{align*} [/definition] The enlargement notation turns the exponential-moment theorem into a form that can be inserted directly into applications. Once a favourable event $A$ has probability at least one half, we need a theorem that bounds the probability of falling outside its convex enlargement. The next result supplies exactly that expansion estimate and will be the set-level input for the concentration arguments below. [quotetheorem:6771] [citeproof:6771] The preceding theorem converts a median-probability event into a high-probability enlargement. The assumption $\mathbb P(A)>0$ is necessary because the bound is normalised by the initial mass of $A$: if $A$ is null, then no estimate involving $1/\mathbb P(A)$ has content, and a null set need not be reached by typical samples at any fixed small radius. Independence is inherited from Talagrand's inequality; for perfectly correlated coordinates, a set that is large in the formal product space can be invisible or poorly expanded under the actual law. The result does not identify which points enter the enlargement; it only bounds the mass of the complement. The following computation records the radius needed to make the exceptional probability at most a chosen tolerance. [example: Median Enlargement Bound] Let $A$ be an event with $\mathbb P(A)\ge 1/2$ in a product probability space, and fix $0<\delta\le 1$. By *Product-Space Isoperimetry via Convex Distance*, for every $r\ge 0$, \begin{align*} \mathbb P(X\notin A_r^{\mathrm T})\le 2\exp\left(-\frac{r^2}{4}\right). \end{align*} Choose \begin{align*} r=2\sqrt{\log(2/\delta)}. \end{align*} This is well-defined because $2/\delta\ge 2$, so $\log(2/\delta)>0$. For this choice of $r$, \begin{align*} r^2=\left(2\sqrt{\log(2/\delta)}\right)^2=4\log(2/\delta). \end{align*} Dividing by $4$ gives \begin{align*} \frac{r^2}{4}=\log(2/\delta). \end{align*} Substituting into the isoperimetric bound, \begin{align*} 2\exp\left(-\frac{r^2}{4}\right)=2\exp(-\log(2/\delta)). \end{align*} Since $\exp(\log(2/\delta))=2/\delta$, this becomes \begin{align*} 2\exp(-\log(2/\delta))=2\cdot\frac{1}{2/\delta}=\delta. \end{align*} Therefore \begin{align*} \mathbb P\left(X\notin A_{2\sqrt{\log(2/\delta)}}^{\mathrm T}\right)\le \delta. \end{align*} Thus a probability-$1/2$ event becomes a probability-$1-\delta$ event after enlarging it by convex-distance radius $2\sqrt{\log(2/\delta)}$. [/example] This median-to-tail mechanism is the bridge from set expansion to concentration of functions. The remaining step is to connect the event $\{f\le m\}$ or $\{f\ge m\}$ to level sets of a function $f$. ## Convex Lipschitz Concentration on Product Spaces The functional question is when a real-valued function on a product space has sub-Gaussian tails under arbitrary independent coordinates. Talagrand's answer is that convexity and coordinate Lipschitz control are enough. Convexity enters because the distance $d_T$ itself is defined by convexifying certificates. [definition: Convex Function on a Product Subset of Euclidean Space] Let $K_i\subset \mathbb R$ be intervals and let $K=K_1\times\cdots\times K_n$. A function $f:K\to\mathbb R$ is convex if \begin{align*} f(\lambda x+(1-\lambda)y)\le \lambda f(x)+(1-\lambda)f(y) \end{align*} for all $x,y\in K$ and all $\lambda\in[0,1]$. [/definition] Convexity supplies separating hyperplanes for sublevel sets, but it does not set the deviation scale. To compare a vertical change in $f$ with a horizontal convex-distance enlargement, we also need a Euclidean Lipschitz constant and a coordinate-diameter normalisation. This normalisation is not cosmetic: convex distance records whether coordinate $i$ changed, not whether it changed by $10^{-3}$ or by $10^3$. [definition: Euclidean Lipschitz Constant] Let $K\subset \mathbb R^n$. A function $f:K\to\mathbb R$ is $L$-Lipschitz if \begin{align*} |f(x)-f(y)|\le L|x-y| \end{align*} for all $x,y\in K$. [/definition] The concentration theorem is stated around a median, since the preceding isoperimetric statement starts from a set of probability at least one half. The key geometric claim is that a point whose value is $t$ above a median sublevel set must be at convex distance at least $t/L$ from that set. [quotetheorem:6773] [citeproof:6773] This result shows why convexity appears in the statement: it turns level-set separation into the weighted coordinate tests of the dual formula. Without convexity, a sublevel set may have holes that no single separating hyperplane detects; an indicator-like oscillation smoothed to be Lipschitz on the cube can have many isolated high points while its median remains low. The diameter assumption is equally important. If $K_1=[0,M]$ and $f(x)=x_1$, then $f$ is $1$-Lipschitz, but changing the first coordinate is still only one Talagrand disagreement, so the displayed bound with the same constant cannot hold uniformly as $M$ grows. The theorem also gives only the upper tail for convex functions, because convex sublevel sets are the sets being enlarged by convex distance. [remark: Why Convexity Is Needed] Lipschitz continuity alone does not force dimension-free concentration for arbitrary product measures. A function may depend on many coordinates in a nonconvex way that defeats certificate geometry, while a convex function has level sets that can be separated by linear functionals. Talagrand's inequality is tailored to those separating functionals. [/remark] The median form is often enough, but estimates in applications are frequently stated around $\mathbb E[f(X)]$. The passage from medians to means costs only an absolute constant at the scale $L$. [quotetheorem:6775] [citeproof:6775] The mean form is less sharp in constants but more convenient when the typical value has already been estimated by expectation methods. Its hypotheses are exactly those of the median theorem, so the same coordinate-diameter restriction remains in force; the example $f(x)=x_1$ on $[0,M]$ again shows why this restriction cannot be dropped without rescaling constants by the coordinate diameter. The convexity and one-sidedness assumptions are also unchanged: replacing the median by the mean does not create a lower-tail estimate for a convex function, since the proof still controls only expansion away from convex sublevel sets. A basic illustration is the Euclidean norm of independent bounded coordinates, where convexity is built into the norm. [example: Convex Norms of Independent Bounded Coordinates] Let $X=(X_1,\dots,X_n)$ have independent coordinates in $[0,1]$, and define $f:[0,1]^n\to\mathbb R$ by $f(x)=|x|$. For $x,y\in[0,1]^n$ and $\lambda\in[0,1]$, the triangle inequality gives \begin{align*} |\lambda x+(1-\lambda)y|\le |\lambda x|+|(1-\lambda)y|. \end{align*} Since $\lambda\ge 0$ and $1-\lambda\ge 0$, homogeneity of the Euclidean norm gives \begin{align*} |\lambda x|+|(1-\lambda)y|=\lambda |x|+(1-\lambda)|y|. \end{align*} Therefore \begin{align*} f(\lambda x+(1-\lambda)y)\le \lambda f(x)+(1-\lambda)f(y), \end{align*} so $f$ is convex. For the Lipschitz constant, the triangle inequality gives \begin{align*} |x|=|x-y+y|\le |x-y|+|y|. \end{align*} Hence $|x|-|y|\le |x-y|$. Interchanging $x$ and $y$ gives $|y|-|x|\le |x-y|$, and combining the two inequalities gives \begin{align*} |f(x)-f(y)|=\bigl||x|-|y|\bigr|\le |x-y|. \end{align*} Thus $f$ is $1$-Lipschitz. Each coordinate interval has diameter \begin{align*} \sup_{a,b\in[0,1]}|a-b|=1. \end{align*} If $m$ is a median of $|X|=f(X)$, then *Convex-Lipschitz Concentration on Product Spaces* with $L=1$ gives, for every $t\ge 0$, \begin{align*} \mathbb P(|X|\ge m+t)=\mathbb P(f(X)\ge m+t). \end{align*} The theorem bounds the right-hand side by \begin{align*} 2\exp\left(-\frac{t^2}{4\cdot 1^2}\right)=2\exp\left(-\frac{t^2}{4}\right). \end{align*} Thus the Euclidean norm has a dimension-free upper-tail scale, independent of $n$ and independent of whether the coordinate distributions are identical. If instead the coordinates lie in $[-1,1]$, set $Y_i=(X_i+1)/2$, so $Y_i\in[0,1]$ and $X_i=2Y_i-1$. The function of $Y$ is \begin{align*} g(y)=|2y-\mathbf 1|. \end{align*} For $y,z\in[0,1]^n$, the [reverse triangle inequality](/theorems/2300) gives \begin{align*} |g(y)-g(z)|=\bigl||2y-\mathbf 1|-|2z-\mathbf 1|\bigr|\le |(2y-\mathbf 1)-(2z-\mathbf 1)|. \end{align*} The difference inside the norm is $2(y-z)$, so \begin{align*} |(2y-\mathbf 1)-(2z-\mathbf 1)|=2|y-z|. \end{align*} Thus $g$ is $2$-Lipschitz on $[0,1]^n$. Applying the same theorem with $L=2$ gives the rescaled bound \begin{align*} \mathbb P(|X|\ge m+t)\le 2\exp\left(-\frac{t^2}{16}\right). \end{align*} [/example] The theorem is one-sided in the convex case. For a convex function, upper level sets are controlled by separating hyperplanes; lower tails require either concavity or additional structure. ## Combinatorial Random Structures and Small Certificates Talagrand's inequality is especially useful when a random variable is not globally Lipschitz in a strong sense but large values can be certified by a small number of coordinates. This is common in combinatorics: the existence of a long subsequence, a large matching, or many local substructures can often be witnessed by exposing only the coordinates that participate in the object. The problem is to translate the size of such a witness into a lower bound on convex distance from lower level sets. [definition: Certificate Size] Let $X=X_1\times\cdots\timesX_n$ be a product set and let \begin{align*} Z:X\to \mathbb Z_{\ge 0} \end{align*} be a nonnegative integer-valued function. We say that $Z$ has certificates of size at most $r$ for level $s$ if, whenever $Z(x)\ge s$, there is a set $I\subset\{1,\dots,n\}$ with $|I|\le r$ such that every $y\inX$ agreeing with $x$ on $I$ satisfies $Z(y)\ge s$. [/definition] Certificates turn level sets into sets with strong convex-distance separation, but the direction of the separation matters. A certificate for a high value proves that a point agreeing with the certificate remains high; therefore it naturally bounds the probability of falling far below a high-probability upper level, rather than directly bounding the upper tail from a low-level set. [quotetheorem:6777] [citeproof:6777] The principle is designed for random objects whose large values come with visible witnesses. The certificate hypothesis is needed because a lower tail is controlled by showing that points far below a typical high level are far from the high-level set; without certificates, a $1$-Lipschitz variable can have a high-level set whose membership cannot be checked from $O(m)$ coordinates. The Lipschitz hypothesis is needed as well: if changing one coordinate could destroy many certified objects at once, then the argument comparing $Z(x)$ with the certified point would lose the factor $t$. The positivity of $r$ and the integer median level keep the certificate size $rm$ meaningful at the level actually used in the proof. The result does not, by itself, give an upper tail; upper tails require a separate argument, often applied to a dual variable or to a different certificate structure. The longest increasing subsequence is a model case: the value may depend on all coordinates, but a single subsequence certifies a lower bound on the value. [example: Longest Increasing Subsequence Certificates] Let $X_1,\dots,X_n$ be independent continuous labels, and let the induced permutation be obtained by ranking the labels. For a deterministic label vector $x$, let $Z(x)$ be the length of the longest increasing subsequence, meaning the largest $s$ for which there are indices \begin{align*} 1\le i_1<\cdots<i_s\le n \end{align*} such that \begin{align*} x_{i_1}<x_{i_2}<\cdots<x_{i_s}. \end{align*} If $Z(x)\ge s$, choose one increasing subsequence $i_1<\cdots<i_s$ and set \begin{align*} I=\{i_1,\dots,i_s\}. \end{align*} Whenever $y$ agrees with $x$ on every coordinate in $I$, we have \begin{align*} y_{i_1}=x_{i_1}<x_{i_2}=y_{i_2}<\cdots<x_{i_s}=y_{i_s}. \end{align*} Thus $y$ also has an increasing subsequence of length $s$, so $Z(y)\ge s$. Therefore the event $\{Z\ge s\}$ has certificates of size at most $s$. Next, suppose $x$ and $x'$ differ in only one coordinate, say coordinate $j$. Let $L=Z(x)$, and choose an increasing subsequence of $x$ with index set \begin{align*} i_1<\cdots<i_L. \end{align*} If $j\notin\{i_1,\dots,i_L\}$, then all labels used by this subsequence are unchanged, so the same subsequence is increasing for $x'$ and \begin{align*} Z(x')\ge L=Z(x). \end{align*} If $j\in\{i_1,\dots,i_L\}$, remove $j$ from the subsequence. The remaining indices are still in increasing order, and their labels are unchanged, so they form an increasing subsequence of $x'$ of length $L-1$. Hence \begin{align*} Z(x')\ge L-1=Z(x)-1. \end{align*} In both cases, \begin{align*} Z(x)-Z(x')\le 1. \end{align*} Interchanging the roles of $x$ and $x'$ gives \begin{align*} Z(x')-Z(x)\le 1, \end{align*} and therefore \begin{align*} |Z(x)-Z(x')|\le 1. \end{align*} Thus $Z$ is $1$-Lipschitz under single-coordinate changes and has certificates of size at most $s=1\cdot s$ at level $s$. Applying the *Certificate Lower-Tail Principle* with $r=1$, if $m$ is an integer median level for $Z$, then for every $0\le t\le m$ with $m-t\ge 1$, \begin{align*} \mathbb P(Z\le m-t)\le 2\exp\left(-\frac{t^2}{4\cdot 1\cdot m}\right) =2\exp\left(-\frac{t^2}{4m}\right). \end{align*} The lower tail is therefore controlled at variance scale $m$, the typical length of the subsequence, rather than at the ambient scale $n$. [/example] The same idea applies to random graphs when the random coordinates are edges. The event that many copies of a fixed graph appear may be certified by listing the participating edges, provided the copies can be organised without excessive overlap. [example: Random Graph Substructure Counts with Small Certificates] Let $G\sim G(n,p)$, with one product coordinate for each possible edge, and let $Z(G)$ be the maximum number of edge-disjoint triangles in $G$. Suppose first that $Z(G)\ge s$. Choose edge-disjoint triangles $T_1,\dots,T_s$ in $G$, and let \begin{align*} I=\bigcup_{\ell=1}^s E(T_\ell). \end{align*} Each triangle has $3$ edges, and the triangles are edge-disjoint, so \begin{align*} |I|=\sum_{\ell=1}^s |E(T_\ell)|=\sum_{\ell=1}^s 3=3s. \end{align*} If another graph $H$ agrees with $G$ on every edge in $I$, then every edge of every $T_\ell$ is still present in $H$. Therefore $T_1,\dots,T_s$ are still $s$ edge-disjoint triangles in $H$, and hence $Z(H)\ge s$. Thus the level set $\{Z\ge s\}$ has certificates of size at most $3s$. Now let $G$ and $H$ differ in exactly one edge, say $e$. Set \begin{align*} L=Z(G). \end{align*} Choose an edge-disjoint family of triangles $T_1,\dots,T_L$ in $G$. Since the triangles are edge-disjoint, at most one of them can contain the edge $e$. Removing that triangle if it exists leaves at least $L-1$ triangles. Every remaining triangle uses only edges on which $G$ and $H$ agree, so all remaining triangles are present in $H$. Hence \begin{align*} Z(H)\ge L-1=Z(G)-1. \end{align*} This gives \begin{align*} Z(G)-Z(H)\le 1. \end{align*} Interchanging the roles of $G$ and $H$ gives \begin{align*} Z(H)-Z(G)\le 1. \end{align*} Therefore \begin{align*} |Z(G)-Z(H)|\le 1. \end{align*} So $Z$ is $1$-Lipschitz under single-edge changes and has certificates of size at most $3s=rs$ at level $s$ with $r=3$. If $m$ is an integer median level for $Z$, then the *Certificate Lower-Tail Principle* gives, for every $0\le t\le m$ with $m-t\ge 1$, \begin{align*} \mathbb P(Z\le m-t)\le 2\exp\left(-\frac{t^2}{4rm}\right). \end{align*} Substituting $r=3$ yields \begin{align*} \mathbb P(Z\le m-t)\le 2\exp\left(-\frac{t^2}{4\cdot 3\cdot m}\right). \end{align*} Since $4\cdot 3\cdot m=12m$, this is \begin{align*} \mathbb P(Z\le m-t)\le 2\exp\left(-\frac{t^2}{12m}\right). \end{align*} Thus the lower-tail scale is controlled by the $3m$ edges needed to witness $m$ edge-disjoint triangles, rather than by the total number of possible edges in the graph. [/example] These examples show that the denominator in the lower-tail bound is governed by witness size rather than by the number of ambient coordinates. This distinction is the main reason Talagrand's method can improve bounded-difference estimates for certificate-rich variables, but only in the direction supported by the available certificates. [remark: Certificates Versus Variance Proxies] The denominator in the certificate lower-tail bound is not the number of ambient coordinates. It is the size of the witness for the level being tested. This is why Talagrand's method can improve bounded-difference estimates for combinatorial variables whose natural Lipschitz constant is spread over a much smaller random structure. The same certificate data should not be read as an upper-tail estimate without an additional argument, because high-value certificates separate low points from high-level sets, not high points from low-level sets. [/remark] The chapter's main lesson is that, beyond the entropy and Gaussian isoperimetric routes of Chapters 1 through 5, concentration can come from geometry of evidence rather than from uniform sensitivity alone. Convex distance packages that geometry into a product-space isoperimetric theorem, and the resulting inequalities explain why certificate-rich random structures often have sharper tails than classical bounded-difference methods predict. # 7. Transportation-Cost Inequalities The preceding chapters developed concentration through Laplace transforms, entropy, and tensorization. This chapter reorganises those tools around a new question: when does a probability measure control the transport cost of every alternative law in terms of relative entropy? The prerequisites are Chapter 1's entropy variational formula, Chapter 2's sub-Gaussian Laplace bounds for Lipschitz functions, and basic metric-space measure theory. The goal is to introduce $W_1$ transportation-cost inequalities, prove the Bobkov-Gotze duality between $T_1$ and Laplace estimates, and use tensorization and pushforward stability to recover the product-space concentration bounds from a geometric viewpoint. ## Couplings, Wasserstein Distance, and Entropy How should we measure the cost of changing one probability law into another? Total variation records the mass that must be changed, but concentration for Lipschitz functions needs a metric-sensitive notion: moving a small amount of mass a long distance should cost more than moving it nearby. Let $(X,d)$ be a metric space equipped with its Borel $\sigma$-algebra. The basic object in optimal transport is a joint law whose two marginals are prescribed. [definition: Coupling] Let $\mu$ and $\nu$ be probability measures on a measurable space $(X,\mathcal F)$. A coupling of $\mu$ and $\nu$ is a probability measure $\pi$ on $(X\times X,\mathcal F\otimes\mathcal F)$ such that \begin{align*} \pi(A\times X)&=\mu(A),\qquad \pi(X\times A)=\nu(A) \end{align*} for every $A\in\mathcal F$. [/definition] A coupling should be read as a way to realise two random variables $Y$ and $Z$ on the same probability space with $Y\sim\mu$ and $Z\sim\nu$. To compare laws quantitatively, we now minimise the average metric distance between the paired variables over all such joint realisations. [definition: First Wasserstein Distance] Let $(X,d)$ be a metric space and let $\mathcal P_1(X)$ denote the probability measures on $X$ with finite first moment. The first Wasserstein distance is the functional $W_1:\mathcal P_1(X)\times\mathcal P_1(X)\to[0,\infty)$ defined by \begin{align*} W_1(\mu,\nu)=\inf_{\pi\in\Pi(\mu,\nu)}\int_{X\times X} d(x,y)\,d\pi(x,y), \end{align*} where $\Pi(\mu,\nu)$ is the set of couplings of $\mu$ and $\nu$. [/definition] The definition is primal: it minimises over all transport plans, so it is well suited to geometric intuition but not yet connected to scalar concentration estimates. The next theorem gives the missing bridge. It rewrites $W_1$ as the largest possible change in expectation over the class of $1$-Lipschitz functions, which are exactly the observables whose fluctuations earlier chapters controlled by Laplace methods. [quotetheorem:6779] [citeproof:6779] This duality explains why $W_1$ is the right transport distance for Lipschitz concentration: a uniform bound on $W_1(\nu,\mu)$ controls the change in expectation of every $1$-Lipschitz observable. The hypotheses are not cosmetic. The Polish assumption is a regularity condition ensuring that couplings, measurable potentials, and the Kantorovich dual problem behave well; on badly behaved measurable metric spaces the supremum over Lipschitz functions need not capture the primal transport problem without extra regularity. The finite first moment assumption is also necessary for the displayed formula to be meaningful: if $\int d(x_0,x)\,d\mu(x)=\infty$ for one of the measures, then some $1$-Lipschitz functions have undefined or infinite expectations, so the dual expression no longer gives a finite comparison of means. Even in the Polish finite-moment setting, the theorem only tests against Lipschitz observables; it says nothing about discontinuous indicators except through Lipschitz approximations, which is why later concentration statements are formulated for Lipschitz functions rather than arbitrary events. The cost term in the inequalities below is relative entropy, because entropy is the variational price paid when the reference measure is tilted. Transport distance measures how far a new law moves mass away from the reference law, while entropy measures how expensive it is to create that law by reweighting the reference measure. The next definition fixes this cost before we compare it with $W_1$. [definition: Relative Entropy] Let $\nu$ and $\mu$ be probability measures on a measurable space $(X,\mathcal F)$. The relative entropy functional is $H(\cdot\mid\cdot):\mathcal P(X)\times\mathcal P(X)\to[0,\infty]$. If $\nu\ll\mu$, the relative entropy of $\nu$ with respect to $\mu$ is \begin{align*} H(\nu\mid\mu)=\int_X \log\left(\frac{d\nu}{d\mu}\right)\,d\nu. \end{align*} If $\nu\not\ll\mu$, the relative entropy is $H(\nu\mid\mu)=+\infty$. [/definition] Entropy is asymmetric, while $W_1$ is symmetric. Transportation inequalities exploit this asymmetry by using entropy as the price of replacing the reference measure by a tilted or conditional law. [example: Two Point Space] Let $X=\{0,1\}$ with $d(0,1)=1$, let $\mu=\operatorname{Ber}(p)$ and $\nu=\operatorname{Ber}(q)$, so \begin{align*} \mu(\{1\})=p,\quad \mu(\{0\})=1-p,\quad \nu(\{1\})=q,\quad \nu(\{0\})=1-q. \end{align*} For any coupling $\pi$ of $\mu$ and $\nu$, the cost is \begin{align*} \int_{X\times X}d(x,y)\,d\pi(x,y)=\pi(\{0\}\times\{1\})+\pi(\{1\}\times\{0\}), \end{align*} because $d(0,0)=d(1,1)=0$ and $d(0,1)=d(1,0)=1$. The marginal constraints give \begin{align*} \pi(\{0\}\times\{1\})+\pi(\{1\}\times\{1\})=q. \end{align*} They also give \begin{align*} \pi(\{1\}\times\{0\})+\pi(\{1\}\times\{1\})=p. \end{align*} Subtracting the second identity from the first yields \begin{align*} \pi(\{0\}\times\{1\})-\pi(\{1\}\times\{0\})=q-p. \end{align*} Since both off-diagonal masses are nonnegative, \begin{align*} \pi(\{0\}\times\{1\})+\pi(\{1\}\times\{0\})\ge \left|\pi(\{0\}\times\{1\})-\pi(\{1\}\times\{0\})\right|=|q-p|. \end{align*} Thus every coupling has cost at least $|q-p|$. This lower bound is attained by matching the common mass at each point and moving only the excess mass. If $q\ge p$, define a coupling by \begin{align*} \pi(\{1\}\times\{1\})=p,\quad \pi(\{0\}\times\{1\})=q-p,\quad \pi(\{0\}\times\{0\})=1-q,\quad \pi(\{1\}\times\{0\})=0. \end{align*} Its first marginal is $\mu$, its second marginal is $\nu$, and its cost is $q-p$. If $p\ge q$, define instead \begin{align*} \pi(\{1\}\times\{1\})=q,\quad \pi(\{1\}\times\{0\})=p-q,\quad \pi(\{0\}\times\{0\})=1-p,\quad \pi(\{0\}\times\{1\})=0. \end{align*} This coupling has cost $p-q$. Hence \begin{align*} W_1(\mu,\nu)=|p-q|. \end{align*} If $0<p<1$ and $0<q<1$, the Radon-Nikodym derivative is \begin{align*} \frac{d\nu}{d\mu}(1)=\frac{q}{p},\quad \frac{d\nu}{d\mu}(0)=\frac{1-q}{1-p}. \end{align*} Therefore, by the definition of relative entropy, \begin{align*} H(\nu\mid\mu)=q\log\frac{q}{p}+(1-q)\log\frac{1-q}{1-p}. \end{align*} With the conventions $0\log(0/a)=0$ for $a>0$ and $a\log(a/0)=+\infty$ for $a>0$, the same expression describes the endpoint cases. Thus, on the two-point space, $W_1$ is exactly the amount of probability mass shifted between the two points, while entropy is the logarithmic cost of changing the Bernoulli weights. [/example] ## T One Inequalities and Sub-Gaussian Concentration Which transport-entropy inequality is strong enough to imply Gaussian tails for all Lipschitz functions? For $W_1$, the useful scale is quadratic: distance is bounded by the square root of entropy. [definition: T One Inequality] Let $(X,d)$ be a metric space and let $\mu$ be a probability measure on $X$. We say that $\mu$ satisfies the $T_1(C)$ transportation-cost inequality if, for every probability measure $\nu$ on $X$, with $W_1(\nu,\mu)$ interpreted as the extended first Wasserstein cost when a finite first moment is absent, \begin{align*} W_1(\nu,\mu)\le \sqrt{2C\,H(\nu\mid\mu)}. \end{align*} [/definition] The constant $C$ has the same role as a variance proxy. The obstruction is that $T_1$ is a statement about distances between probability laws, while concentration is a statement about the tails of a single real-valued observable. Exponential tilting connects the two: a large Laplace transform of a Lipschitz function would create a tilted law whose expectation moves too far unless the entropy term pays a quadratic cost. [quotetheorem:6781] [citeproof:6781] The theorem says that a geometric inequality over all alternative laws $\nu$ is equivalent to concentration of scalar observables, but the formulation has important boundaries. The proof uses exponential tilts, so for unbounded $f$ the argument must first be run for bounded truncations $f_m=(-m)\vee f\wedge m$ and then passed to the limit using the resulting uniform Laplace bounds; the assumption $\int_X f\,d\mu<\infty$ identifies the centering constant but does not by itself justify every tilted measure before this regularisation. The $1$-Lipschitz hypothesis is essential because $T_1$ only controls expectations through the Kantorovich-Rubinstein dual class: a highly discontinuous function can have large jumps on sets that are very close in the metric and need not satisfy any Gaussian tail bound. Nor does $T_1$ imply stronger transport inequalities such as $T_2$, logarithmic Sobolev inequalities, or dimension-free concentration for arbitrary non-Lipschitz observables; for instance, a boundedly supported measure satisfies some $T_1$ inequality, while it may have atoms and therefore cannot satisfy the usual log-Sobolev inequality on a smooth space. On bounded metric spaces such inequalities are automatic, although the constants may be coarse. [example: Bounded Metric Spaces] Suppose $\operatorname{diam}(X)\le D$ and $\mu$ is any probability measure on $X$. If $f:X\to\mathbb R$ is $1$-Lipschitz, then for all $x,y\in X$, \begin{align*} f(x)-f(y)\le |f(x)-f(y)|\le d(x,y)\le D. \end{align*} Taking the supremum over $x$ and the infimum over $y$ gives \begin{align*} \sup_X f-\inf_X f\le D. \end{align*} Thus the centered random variable $f-\mathbb E[f]$ comes from a random variable whose range has length at most $D$, so *[Hoeffding's lemma](/theorems/1956)* gives, for every $\lambda\in\mathbb R$, \begin{align*} \log\mathbb E\left[e^{\lambda(f-\mathbb E[f])}\right]\le \frac{\lambda^2D^2}{8}. \end{align*} To match this with the Bobkov-Gotze Laplace form \begin{align*} \log\mathbb E\left[e^{\lambda(f-\mathbb E[f])}\right]\le \frac{C\lambda^2}{2}, \end{align*} we set \begin{align*} \frac{C\lambda^2}{2}=\frac{\lambda^2D^2}{8}, \end{align*} and hence, for $\lambda\ne 0$, \begin{align*} C=\frac{D^2}{4}. \end{align*} The same value also covers $\lambda=0$, where both sides are $0$. By *Bobkov-Gotze Theorem*, $\mu$ satisfies $T_1(D^2/4)$. Thus bounded support alone forces a transportation inequality; no curvature, smoothness, or product structure is being used. [/example] The bounded-support example shows that $T_1$ can hold for reasons unrelated to curvature, but it gives a constant determined only by the diameter. Gaussian measures are the model unbounded case: the same duality recovers the sharp Lipschitz concentration scale from the usual Gaussian Laplace estimate. This comparison also indicates why $T_1$ is weaker than the stronger transport inequalities studied later; it detects the correct one-dimensional Lipschitz tails without encoding the full quadratic transport geometry. [example: Gaussian T One Inequality] Let $\gamma_n=\mathcal N(0,I_n)$ on $\mathbb R^n$ with the Euclidean metric. For every $1$-Lipschitz function $f:\mathbb R^n\to\mathbb R$, the *Gaussian concentration theorem* gives, for every $\lambda\in\mathbb R$, \begin{align*} \log\mathbb E_{\gamma_n}\left[e^{\lambda(f-\mathbb E_{\gamma_n}[f])}\right]\le \frac{\lambda^2}{2}. \end{align*} The Bobkov-Gotze Laplace criterion asks for the bound \begin{align*} \log\mathbb E_{\mu}\left[e^{\lambda(f-\mathbb E_{\mu}[f])}\right]\le \frac{C\lambda^2}{2} \end{align*} for every bounded $1$-Lipschitz $f$ and every $\lambda\in\mathbb R$. Comparing the right-hand sides in the Gaussian bound gives $C=1$, since $\lambda^2/2=1\cdot\lambda^2/2$ for every $\lambda$. Hence, by *Bobkov-Gotze Theorem*, $\gamma_n$ satisfies $T_1(1)$. For $\mathcal N(0,\sigma^2 I_n)$ with $\sigma>0$, write $Y=\sigma Z$ where $Z\sim\gamma_n$. If $f:\mathbb R^n\to\mathbb R$ is $1$-Lipschitz and $g(z)=f(\sigma z)$, then for all $z,z'\in\mathbb R^n$, \begin{align*} |g(z)-g(z')|=|f(\sigma z)-f(\sigma z')|\le |\sigma z-\sigma z'|=\sigma |z-z'|. \end{align*} Thus $g/\sigma$ is $1$-Lipschitz. Applying the standard Gaussian Laplace bound to $g/\sigma$ with parameter $\lambda\sigma$ gives \begin{align*} \log\mathbb E\left[e^{\lambda\sigma((g(Z)/\sigma)-\mathbb E[g(Z)/\sigma])}\right]\le \frac{(\lambda\sigma)^2}{2}. \end{align*} Since $\lambda\sigma((g(Z)/\sigma)-\mathbb E[g(Z)/\sigma])=\lambda(g(Z)-\mathbb E[g(Z)])$ and $g(Z)=f(Y)$, this becomes \begin{align*} \log\mathbb E\left[e^{\lambda(f(Y)-\mathbb E[f(Y)])}\right]\le \frac{\sigma^2\lambda^2}{2}. \end{align*} This is the Bobkov-Gotze Laplace bound with $C=\sigma^2$, so $\mathcal N(0,\sigma^2 I_n)$ satisfies $T_1(\sigma^2)$. The transport constant scales exactly like the variance parameter. [/example] ## Bobkov-Gotze Duality Between T One and Laplace Bounds The previous section used $T_1$ to derive sub-Gaussian tails. Is the converse true: do Laplace bounds for all Lipschitz functions force the transport-entropy inequality? Bobkov-Gotze duality answers this by combining Kantorovich-Rubinstein duality with the Gibbs variational formula. [quotetheorem:6783] [citeproof:6783] The theorem is the bridge between analytic concentration estimates and transport. The Polish hypothesis again supplies the regularity needed to combine Kantorovich-Rubinstein duality with the Gibbs variational formula over probability measures; without such structure, the dual description can fail to see all transport plans or can require additional measurability hypotheses on potentials. A concrete pathology arises if a metric space is equipped with a $\sigma$-algebra smaller than its Borel $\sigma$-algebra: a Lipschitz potential needed to separate two transport plans may fail to be measurable for that smaller $\sigma$-algebra, while the variational formula only ranges over measurable tilts. This is why the theorem is stated for Polish Borel spaces rather than arbitrary measurable metric spaces. The bounded-test-function condition is also part of the precise mechanism: the Gibbs formula is immediate for bounded measurable functions, while unbounded Lipschitz functions require exponential integrability and approximation. A concrete limitation is furnished by heavy-tailed laws on $\mathbb R$: a Cauchy distribution has finite probabilities but no Gaussian Laplace bound for the identity function, so it cannot satisfy $T_1(C)$ for any finite $C$. Thus Bobkov-Gotze is not just a convenient reformulation; it identifies exactly the sub-Gaussian Lipschitz regime and excludes measures whose tails are too heavy. It also explains why centering by the mean is natural: additive constants disappear from both Lipschitz seminorms and Wasserstein duality. [remark: Bounded Test Functions] The boundedness assumption on $f$ is a technical entry point for the variational formula. Under standard integrability assumptions, the bound extends to unbounded Lipschitz functions by truncation and monotone or dominated convergence. In applications to Gaussian and product measures, the required exponential integrability is obtained from the same Laplace estimate. [/remark] A useful consequence is that any method proving a dimension-free Laplace bound for Lipschitz functions also proves a $T_1$ inequality. This lets the Herbst argument from earlier chapters be reused in transport form. [example: From Log Sobolev to T One] Suppose a probability measure $\mu$ on a metric space supports a Herbst argument giving, for every $1$-Lipschitz function $f$ in a dense class and every $\lambda\in\mathbb R$, \begin{align*} \log\mathbb E_\mu\left[e^{\lambda(f-\mathbb E_\mu[f])}\right]\le \frac{C\lambda^2}{2}. \end{align*} We show that the same Laplace bound holds for every bounded $1$-Lipschitz function, which is the hypothesis in the *Bobkov-Gotze Theorem*. Let $h$ be bounded and $1$-Lipschitz, and choose functions $f_m$ from the dense class such that \begin{align*} \|f_m-h\|_\infty\to 0. \end{align*} Then \begin{align*} |\mathbb E_\mu[f_m]-\mathbb E_\mu[h]|=\left|\int (f_m-h)\,d\mu\right|\le \int |f_m-h|\,d\mu\le \|f_m-h\|_\infty, \end{align*} so $\mathbb E_\mu[f_m]\to\mathbb E_\mu[h]$. For each fixed $\lambda\in\mathbb R$, \begin{align*} \left|\lambda(f_m-\mathbb E_\mu[f_m])-\lambda(h-\mathbb E_\mu[h])\right|\le |\lambda|\,\|f_m-h\|_\infty+|\lambda|\,|\mathbb E_\mu[f_m]-\mathbb E_\mu[h]|. \end{align*} The right-hand side tends to $0$, hence the exponents converge uniformly. Since $h$ is bounded and $f_m\to h$ uniformly, the centered functions $f_m-\mathbb E_\mu[f_m]$ are uniformly bounded for all sufficiently large $m$, so [uniform convergence](/page/Uniform%20Convergence) of the exponents gives \begin{align*} \mathbb E_\mu\left[e^{\lambda(f_m-\mathbb E_\mu[f_m])}\right]\to \mathbb E_\mu\left[e^{\lambda(h-\mathbb E_\mu[h])}\right]. \end{align*} Passing to the limit in \begin{align*} \log\mathbb E_\mu\left[e^{\lambda(f_m-\mathbb E_\mu[f_m])}\right]\le \frac{C\lambda^2}{2} \end{align*} therefore yields \begin{align*} \log\mathbb E_\mu\left[e^{\lambda(h-\mathbb E_\mu[h])}\right]\le \frac{C\lambda^2}{2} \end{align*} for every bounded $1$-Lipschitz $h$ and every $\lambda\in\mathbb R$. By the *Bobkov-Gotze Theorem*, $\mu$ satisfies \begin{align*} W_1(\nu,\mu)\le \sqrt{2C\,H(\nu\mid\mu)} \end{align*} for every probability measure $\nu$. Thus the Herbst estimate supplied by the functional inequality has been converted into the transportation-cost inequality $T_1(C)$. [/example] ## Tensorization and Stability of Transportation Inequalities Why do transportation inequalities matter for concentration on high-dimensional product spaces? The answer is tensorization: if each coordinate satisfies $T_1$ with a uniform constant, then the product law satisfies a $T_1$ inequality for the product metric with no dimension loss in the variance proxy. [definition: Product L One Metric] Let $(X_i,d_i)$ be metric spaces for $1\le i\le n$. On $X=X_1\times\cdots\times X_n$, the product $L^1$ metric is \begin{align*} d_1^{(n)}:X\times X\to[0,\infty),\qquad d_1^{(n)}(x,y)=\sum_{i=1}^n d_i(x_i,y_i), \end{align*} where $x=(x_1,\dots,x_n)$ and $y=(y_1,\dots,y_n)$. [/definition] This metric is adapted to functions whose coordinatewise Lipschitz constants add. For Gaussian-scale concentration of separately Lipschitz functions, the squared coordinate constants appear, so the product theorem is best stated first for a weighted Euclidean combination of the coordinate distances. The $L^1$ version then follows from Cauchy-Schwarz. [quotetheorem:6786] [citeproof:6786] The tensorized statement is useful because it translates coordinatewise control into a bound for a single high-dimensional observable. In applications the observable is often not naturally Lipschitz for the weighted Euclidean metric, but it has separate coordinate sensitivities. The next example rewrites those sensitivities as the variance proxy that appears in bounded-difference inequalities. [example: Empirical Averages Through Transportation Distance] Let $X=(X_1,\dots,X_n)$ and let $\mu=\mu_1\otimes\cdots\otimes\mu_n$ be its law. By *Tensorization of T One*, $\mu$ satisfies $T_1(1)$ for the weighted product metric \begin{align*} \rho(x,y)=\left(\sum_{i=1}^n \frac{d_i(x_i,y_i)^2}{C_i}\right)^{1/2}. \end{align*} For any $x,y\in E_1\times\cdots\times E_n$, the assumed separate Lipschitz bound gives \begin{align*} |F(x)-F(y)|\le \sum_{i=1}^n a_i d_i(x_i,y_i). \end{align*} Rewrite the right-hand side as \begin{align*} \sum_{i=1}^n a_i d_i(x_i,y_i)=\sum_{i=1}^n \left(a_i\sqrt{C_i}\right)\left(\frac{d_i(x_i,y_i)}{\sqrt{C_i}}\right). \end{align*} Applying Cauchy-Schwarz to these two finite sequences yields \begin{align*} \sum_{i=1}^n \left(a_i\sqrt{C_i}\right)\left(\frac{d_i(x_i,y_i)}{\sqrt{C_i}}\right)\le \left(\sum_{i=1}^n C_i a_i^2\right)^{1/2}\left(\sum_{i=1}^n \frac{d_i(x_i,y_i)^2}{C_i}\right)^{1/2}. \end{align*} Therefore \begin{align*} |F(x)-F(y)|\le \left(\sum_{i=1}^n C_i a_i^2\right)^{1/2}\rho(x,y). \end{align*} Thus $F$ is $L$-Lipschitz for $\rho$, where \begin{align*} L=\left(\sum_{i=1}^n C_i a_i^2\right)^{1/2}. \end{align*} If $L>0$, then $F/L$ is $1$-Lipschitz, so *T One Implies Lipschitz Concentration* applied under the $T_1(1)$ inequality gives, for every $t\ge 0$, \begin{align*} \mathbb P(F-\mathbb E[F]\ge t)=\mathbb P\left(\frac{F-\mathbb E[F]}{L}\ge \frac{t}{L}\right). \end{align*} The concentration bound for $F/L$ gives \begin{align*} \mathbb P\left(\frac{F-\mathbb E[F]}{L}\ge \frac{t}{L}\right)\le \exp\left(-\frac{(t/L)^2}{2}\right). \end{align*} Since \begin{align*} \frac{(t/L)^2}{2}=\frac{t^2}{2L^2}=\frac{t^2}{2\sum_{i=1}^n C_i a_i^2}, \end{align*} we obtain \begin{align*} \mathbb P(F-\mathbb E[F]\ge t)\le \exp\left(-\frac{t^2}{2\sum_{i=1}^n C_i a_i^2}\right). \end{align*} If $L=0$, then $\sum_i C_i a_i^2=0$ and $C_i>0$ for every $i$, so $a_i=0$ for every $i$. The assumed Lipschitz bound then gives $|F(x)-F(y)|\le 0$ for all $x,y$, hence $F$ is constant and the upper-tail probability is $0$ for every $t>0$. For the empirical average \begin{align*} F(x)=\frac{1}{n}\sum_{i=1}^n g(x_i), \end{align*} where $g$ is $1$-Lipschitz, we have \begin{align*} |F(x)-F(y)|=\left|\frac{1}{n}\sum_{i=1}^n \bigl(g(x_i)-g(y_i)\bigr)\right|. \end{align*} By the triangle inequality, \begin{align*} \left|\frac{1}{n}\sum_{i=1}^n \bigl(g(x_i)-g(y_i)\bigr)\right|\le \frac{1}{n}\sum_{i=1}^n |g(x_i)-g(y_i)|. \end{align*} Since $g$ is $1$-Lipschitz, \begin{align*} \frac{1}{n}\sum_{i=1}^n |g(x_i)-g(y_i)|\le \frac{1}{n}\sum_{i=1}^n d_i(x_i,y_i). \end{align*} Thus the sensitivities are $a_i=1/n$. If $C_i\le C$ for every $i$, then \begin{align*} \sum_{i=1}^n C_i a_i^2=\sum_{i=1}^n C_i\frac{1}{n^2}. \end{align*} Factoring out $1/n^2$ gives \begin{align*} \sum_{i=1}^n C_i\frac{1}{n^2}=\frac{1}{n^2}\sum_{i=1}^n C_i. \end{align*} Using $C_i\le C$, \begin{align*} \frac{1}{n^2}\sum_{i=1}^n C_i\le \frac{1}{n^2}\cdot nC=\frac{C}{n}. \end{align*} Substituting this variance proxy into the preceding tail estimate gives \begin{align*} \mathbb P(F-\mathbb E[F]\ge t)\le \exp\left(-\frac{t^2}{2C/n}\right). \end{align*} Equivalently, \begin{align*} \mathbb P(F-\mathbb E[F]\ge t)\le \exp\left(-\frac{nt^2}{2C}\right). \end{align*} The transport formulation therefore records the same effective variance proxy as bounded differences: each coordinate contributes its squared sensitivity $a_i^2$ multiplied by its one-coordinate transport constant $C_i$. [/example] The empirical-average example treats a scalar observable as a Lipschitz image of the product random vector. To use transportation inequalities systematically after applying maps, we need a stability principle showing that Lipschitz pushforwards preserve $T_1$ with the expected rescaling of the constant. [quotetheorem:6788] [citeproof:6788] The factor $L^2$ is the same scaling seen in ordinary sub-Gaussian estimates: multiplying a Lipschitz seminorm by $L$ multiplies the variance proxy by $L^2$. The Borel assumption is needed so that $T_\#\mu(A)=\mu(T^{-1}(A))$ defines a probability measure on the Borel sets of $Y$, and the Polish hypotheses ensure that Bobkov-Gotze can be applied after passing to the image law. In the proof, bounded Borel Lipschitz functions on $Y$ compose with $T$ to give bounded Borel Lipschitz functions on $X$, so both the Laplace transform and the centering term are legitimate. The result does not claim that arbitrary measurable images preserve transportation inequalities; without a Lipschitz bound, nearby points in $X$ can be sent far apart in $Y$, so $W_1$ on the image may see fluctuations that the original metric did not control. For instance, if $X=[0,1]$ has the usual metric and $\mu$ is uniform, the map $T(x)=1/x$ for $x>0$ and $T(0)=0$ is Borel but not Lipschitz; the pushforward has a polynomial tail and the identity function on the image has no Gaussian Laplace bound, so no finite $T_1$ constant can be inferred. Pushforward stability explains scalar concentration for Lipschitz observables, but some applications need a transport bound for an arbitrary law on the whole product space. Marton's inequality supplies such a bound for Hamming metrics, and its proof exposes the role of the chain rule for entropy. [quotetheorem:6790] [citeproof:6790] Marton's inequality is weaker in constants than some bounded-difference bounds, but it has a geometric advantage: it controls the distance from an arbitrary tilted law to the product law. Product independence is essential because the proof decomposes entropy into coordinate contributions; for dependent coordinates, the conditional laws may change with the past and the same chain-rule estimate no longer yields the displayed square-root bound. A concrete failure occurs for the diagonal law on $\{0,1\}^n$, where $X_1=\cdots=X_n$ with probability $1/2$ for each diagonal point. Moving the conditional law from the all-zero point to the diagonal reference law has Hamming cost of order $n$, while the entropy cost is only $\log 2$, contradicting the displayed $\sqrt n$ scale for large $n$. The Hamming metric is also essential to this formulation because the one-coordinate input is the discrete two-level oscillation estimate; replacing it by an unrelated metric would require a different one-coordinate transportation inequality. This viewpoint is the entry point for concentration around sets, convex distance inequalities, empirical process bounds, information-theoretic stability estimates, statistical learning generalization bounds, and the stronger $T_2$ inequalities developed in Chapter 8. # 8. Quadratic Transport and Talagrand's $T_2$ Inequality This chapter moves from entropy bounds to transport bounds with quadratic cost. It uses the language of probability measures on metric spaces, couplings, relative entropy, and the Lipschitz concentration and moment-generating-function tools developed in Chapters 2 and 7. The central object is the Wasserstein distance $W_2$, which measures the least root-mean-square displacement needed to transform one probability law into another. The main result is Talagrand's $T_2$ inequality for Gaussian measure, which turns relative entropy into a quadratic transportation estimate and, through the $T_1$ concentration principle of Chapter 7, yields dimension-free Gaussian concentration. We also record the Otto--Villani principle, which explains why logarithmic Sobolev inequalities are strong enough to imply quadratic transport inequalities. ## Quadratic Transport Cost How should we measure the distance between two probability laws when the geometry of the underlying space matters? Total variation ignores the metric, while [weak convergence](/page/Weak%20Convergence) records the metric but does not assign a displacement cost. Quadratic transport inserts the cost $|x-y|^2$ and asks for the most economical joint realisation of the two laws. [definition: Coupling] Let $\nu$ and $\rho$ be probability measures on $\mathbb R^n$. A coupling of $\nu$ and $\rho$ is a probability measure $\pi$ on $\mathbb R^n \times \mathbb R^n$ such that \begin{align*} \pi(A \times \mathbb R^n) &= \nu(A), & \pi(\mathbb R^n \times B) &= \rho(B) \end{align*} for all Borel sets $A,B \subset \mathbb R^n$. [/definition] A coupling is a joint law $(X,Y)$ with prescribed marginals $X \sim \nu$ and $Y \sim \rho$. Once the law of the pair is chosen, the expected transportation cost is $\mathbb E[|X-Y|^2]$. To compare laws without choosing a preferred joint construction, we need to minimise this cost over all couplings. [definition: Quadratic Wasserstein Distance] Let $\mathcal P_2(\mathbb R^n)$ be the set of probability measures on $\mathbb R^n$ with finite second moment. The quadratic Wasserstein distance is the map \begin{align*} W_2: \mathcal P_2(\mathbb R^n)\times \mathcal P_2(\mathbb R^n) \to [0,\infty) \end{align*} defined by \begin{align*} W_2(\nu,\rho) := \inf_{\pi \in \Pi(\nu,\rho)} \left( \int_{\mathbb R^n \times \mathbb R^n} |x-y|^2\,d\pi(x,y) \right)^{1/2}, \end{align*} where $\Pi(\nu,\rho)$ denotes the set of all couplings of $\nu$ and $\rho$. [/definition] The square root is part of the definition because it makes $W_2$ a metric on the set of probability laws with finite second moment. The corresponding squared quantity $W_2^2$ is the minimal mean quadratic displacement, so translations give the first calibration of the definition. [example: Translation Of A Law] Let $X\sim\nu$ on $\mathbb R^n$ with finite second moment, let $a\in\mathbb R^n$, and let $\rho$ be the law of $X+a$. The pair $(X,X+a)$ is a coupling of $\nu$ and $\rho$, and its quadratic cost is \begin{align*} \mathbb E\left[|X-(X+a)|^2\right]=\mathbb E\left[|-a|^2\right]=|a|^2. \end{align*} Since $W_2^2(\nu,\rho)$ is the infimum of the quadratic costs over all couplings, this coupling gives \begin{align*} W_2^2(\nu,\rho)\le |a|^2. \end{align*} Taking square roots gives $W_2(\nu,\rho)\le |a|$. For the reverse inequality, take any coupling $(Y,Z)$ of $\nu$ and $\rho$. Finite second moments imply finite first moments, so the expectations below are defined. Since $Y\sim\nu$ and $Z\sim\rho$, while $\rho$ is the law of $X+a$, we have \begin{align*} \mathbb E[Z]-\mathbb E[Y]=\mathbb E[X+a]-\mathbb E[X]=a. \end{align*} Apply Jensen's inequality to the convex function $u\mapsto |u|^2$ and the random vector $Z-Y$: \begin{align*} \mathbb E[|Z-Y|^2]\ge \left|\mathbb E[Z-Y]\right|^2. \end{align*} Also, \begin{align*} \mathbb E[Z-Y]=\mathbb E[Z]-\mathbb E[Y]=a. \end{align*} Therefore every coupling satisfies \begin{align*} \mathbb E[|Z-Y|^2]\ge |a|^2. \end{align*} Taking the infimum over all couplings gives $W_2^2(\nu,\rho)\ge |a|^2$. Together with the translation coupling bound, \begin{align*} W_2^2(\nu,\rho)=|a|^2, \end{align*} so $W_2(\nu,\rho)=|a|$. Thus translating a law by $a$ costs exactly the Euclidean displacement length $|a|$ in quadratic Wasserstein distance. [/example] This example explains why $W_2$ is sensitive to displacement rather than pointwise disagreement. Transport alone does not know which law is the reference law for concentration. To compare a transported law with a reference law, we need an information cost for changing measure. [definition: Relative Entropy With Respect To A Reference Measure] Let $\mu$ be a probability measure on a measurable space $(E,\mathcal E)$. The relative entropy with respect to $\mu$ is the map \begin{align*} H(\cdot\mid\mu):\mathcal P(E)\to [0,\infty] \end{align*} defined by \begin{align*} H(\nu\mid \mu) := \int_E \log\left(\frac{d\nu}{d\mu}\right)\,d\nu \end{align*} when $\nu \ll \mu$, and by $H(\nu\mid \mu):=+\infty$ when $\nu$ is not absolutely continuous with respect to $\mu$. [/definition] The transport inequalities in this chapter compare a geometric distance from $\mu$ with an information-theoretic distance from $\mu$. Entropy penalises rare changes of measure, while $W_2$ records how far mass must move. ## Talagrand's Gaussian $T_2$ Inequality For Gaussian measure, entropy controls quadratic displacement with the optimal constant. The question is: if a probability law is obtained by tilting a Gaussian density, how far can it move from the original Gaussian in $W_2$ compared with the entropy spent by the tilt? [definition: Standard Gaussian Measure] The standard Gaussian measure on $(\mathbb R^n,\mathcal B(\mathbb R^n))$ is the probability measure $\gamma_n:\mathcal B(\mathbb R^n)\to[0,1]$ defined by \begin{align*} \gamma_n(A)=\int_A (2\pi)^{-n/2}e^{-|x|^2/2}\,d\mathcal L^n(x) \end{align*} for every Borel set $A\subset \mathbb R^n$. [/definition] Thus $\gamma_n$ has density $(2\pi)^{-n/2}e^{-|x|^2/2}$ with respect to $\mathcal L^n$. The Gaussian is the reference law for the sharp quadratic inequality because its log-density has Hessian $I_n$. This uniform convexity is the source of the constant $2$ below. [quotetheorem:6792] [citeproof:6792] The theorem says that no tilt can move a Gaussian law in mean-square distance by more than the square root of twice its entropy cost. The Gaussian hypothesis is essential for this constant: if $\mu=\mathcal N(0,\sigma^2 I_n)$ with large $\sigma$, translating by $a$ gives $W_2^2(\mathcal N(a,\sigma^2 I_n),\mu)=|a|^2$ but $H(\mathcal N(a,\sigma^2 I_n)\mid\mu)=|a|^2/(2\sigma^2)$, so the constant must scale like $\sigma^2$. The inequality does not identify the optimal coupling or imply a logarithmic Sobolev inequality by itself. Its role here is to provide the quadratic transport input that will later imply $T_1$ concentration. [example: Tilted Gaussian Law] Let $a\in\mathbb R^n$ and define $\nu$ by \begin{align*} d\nu(x)=e^{a\cdot x-|a|^2/2}\,d\gamma_n(x). \end{align*} Using the density of $\gamma_n$, the Lebesgue density of $\nu$ is \begin{align*} e^{a\cdot x-|a|^2/2}(2\pi)^{-n/2}e^{-|x|^2/2}=(2\pi)^{-n/2}\exp\left(-\frac{|x|^2-2a\cdot x+|a|^2}{2}\right). \end{align*} Since \begin{align*} |x-a|^2=|x|^2-2a\cdot x+|a|^2, \end{align*} this becomes \begin{align*} (2\pi)^{-n/2}\exp\left(-\frac{|x-a|^2}{2}\right). \end{align*} Thus $\nu=\mathcal N(a,I_n)$. The relative entropy is \begin{align*} H(\nu\mid\gamma_n)=\int_{\mathbb R^n}\log\left(\frac{d\nu}{d\gamma_n}(x)\right)\,d\nu(x). \end{align*} Because \begin{align*} \log\left(\frac{d\nu}{d\gamma_n}(x)\right)=a\cdot x-\frac{|a|^2}{2}, \end{align*} we get \begin{align*} H(\nu\mid\gamma_n)=\int_{\mathbb R^n}\left(a\cdot x-\frac{|a|^2}{2}\right)\,d\nu(x). \end{align*} Since $\nu=\mathcal N(a,I_n)$ has mean $a$, \begin{align*} H(\nu\mid\gamma_n)=a\cdot a-\frac{|a|^2}{2}. \end{align*} Therefore \begin{align*} H(\nu\mid\gamma_n)=\frac{|a|^2}{2}. \end{align*} If $X\sim\gamma_n$, then $X+a\sim\nu$, so $(X+a,X)$ is a coupling of $\nu$ and $\gamma_n$. Its quadratic cost is \begin{align*} \mathbb E[|(X+a)-X|^2]=\mathbb E[|a|^2]=|a|^2. \end{align*} Hence \begin{align*} W_2^2(\nu,\gamma_n)\le |a|^2. \end{align*} For the reverse inequality, let $(Y,Z)$ be any coupling of $\nu$ and $\gamma_n$. Then $\mathbb E[Y]=a$ and $\mathbb E[Z]=0$, so \begin{align*} \mathbb E[Y-Z]=a. \end{align*} Jensen's inequality applied to the convex function $u\mapsto |u|^2$ gives \begin{align*} \mathbb E[|Y-Z|^2]\ge |\mathbb E[Y-Z]|^2. \end{align*} Substituting $\mathbb E[Y-Z]=a$ yields \begin{align*} \mathbb E[|Y-Z|^2]\ge |a|^2. \end{align*} Taking the infimum over all couplings gives \begin{align*} W_2^2(\nu,\gamma_n)\ge |a|^2. \end{align*} Combining the two bounds, \begin{align*} W_2^2(\nu,\gamma_n)=|a|^2=2H(\nu\mid\gamma_n). \end{align*} Thus Talagrand's Gaussian $T_2$ inequality is sharp on linear tilts, which are exactly translations of the standard Gaussian. [/example] This equality case is a useful calibration. The constant $2$ cannot be improved, and the inequality is exactly sharp along translations. The same scaling also identifies the right constant for non-isotropic centred Gaussian laws. [remark: Centred Gaussian Reference] If $\gamma_{n,\Sigma}=\mathcal N(0,\Sigma)$ with positive definite covariance matrix $\Sigma$ satisfying $\Sigma \le \sigma^2 I_n$, then the corresponding inequality has constant $2\sigma^2$: \begin{align*} W_2^2(\nu,\gamma_{n,\Sigma}) \le 2\sigma^2 H(\nu\mid\gamma_{n,\Sigma}). \end{align*} This follows by applying the standard result after a linear change of variables and using the Lipschitz norm of $\Sigma^{1/2}$. [/remark] The covariance version shows how transport inequalities encode curvature or variance scale. A flatter Gaussian allows larger displacement for the same entropy. ## The Otto--Villani Principle Talagrand's inequality was first proved for Gaussian measure directly, but the broader mechanism is functional-analytic: logarithmic Sobolev inequalities imply quadratic transport inequalities. The guiding question is why an inequality controlling entropy by Fisher information should control the cost of moving mass. [definition: Logarithmic Sobolev Inequality] Let $\mu$ be a probability measure on $\mathbb R^n$. The entropy functional of $\mu$ is the map \begin{align*} \operatorname{Ent}_\mu:L^1_+(\mu)\to [0,\infty] \end{align*} defined by \begin{align*} \operatorname{Ent}_\mu(g) := \int g\log g\,d\mu - \left(\int g\,d\mu\right)\log\left(\int g\,d\mu\right) \end{align*} for nonnegative $g\in L^1(\mu)$ for which the right-hand side is defined. We say that $\mu$ satisfies a logarithmic Sobolev inequality with constant $C>0$ if every $f\in C_c^\infty(\mathbb R^n)$ satisfies \begin{align*} \operatorname{Ent}_\mu(f^2) \le 2C\int_{\mathbb R^n} |\nabla f|^2\,d\mu. \end{align*} [/definition] In density form, set $g=f^2$ with $\int g\,d\mu=1$. The logarithmic Sobolev inequality becomes an entropy-information estimate, \begin{align*} H(g\mu\mid\mu) \le \frac{C}{2}\int \frac{|\nabla g|^2}{g}\,d\mu, \end{align*} for smooth positive densities $g$. This estimate controls not just the size of a tilt, but the behaviour of Hamilton--Jacobi infimum convolutions associated with Lipschitz test functions. We need the Otto--Villani theorem to turn this infinitesimal entropy control into a global quadratic transport bound. [quotetheorem:6794] [citeproof:6794] This theorem is often used as a black-box implication: prove a logarithmic Sobolev inequality, then obtain a $T_2$ inequality without constructing optimal maps. The logarithmic Sobolev hypothesis is doing real work, but the obstruction should not be phrased in terms of singular translates: if a translate is singular with respect to the reference measure, then its entropy is infinite and the transport inequality gives no information about that comparison. A better limitation is non-reversibility. There are measures for which transport concentration holds while the differential entropy dissipation estimate required by an LSI fails; transport controls displacement from tilted laws, whereas LSI controls the decay of entropy along a whole smoothing dynamics. The theorem does not say that every measure satisfying $T_2(C)$ satisfies $\mathrm{LSI}(C)$, and the direction cannot be reversed in general. For the standard Gaussian, the Gaussian logarithmic Sobolev inequality has $C=1$, recovering Talagrand's bound and feeding the concentration results below. [remark: Direction Of Implications] The implication $\mathrm{LSI}(C)\Rightarrow T_2(C)$ is not reversible in general. Quadratic transport controls concentration of Lipschitz functions, while logarithmic Sobolev inequalities also control entropy dissipation and tensorise in a stronger differential form. [/remark] The non-reversibility is important in applications. Transport inequalities are often the concentration output of a stronger analytic estimate, not a replacement for the estimate itself. ## From $T_2$ To $T_1$ And Concentration Quadratic transport inequalities imply linear transport inequalities because first moments are bounded by second moments. The next question is how a bound on $W_2$ produces familiar concentration estimates for Lipschitz functions. [definition: Transportation-Cost Inequality] Let $(E,d)$ be a metric space and let $\mathcal P_p(E)$ be the set of probability measures $\nu$ on $E$ with $\int d(x,x_0)^p\,d\nu(x)<\infty$ for some $x_0\in E$. The $p$-Wasserstein distance is the map \begin{align*} W_p:\mathcal P_p(E)\times\mathcal P_p(E)\to[0,\infty) \end{align*} defined using cost $d(x,y)^p$. Let $\mu\in\mathcal P_p(E)$. We say that $\mu$ satisfies $T_p(C)$ for $p\in\{1,2\}$ and $C>0$ if every probability measure $\nu\in\mathcal P_p(E)$ satisfies \begin{align*} W_p^p(\nu,\mu) \le 2C H(\nu\mid\mu). \end{align*} [/definition] With this convention, Gaussian $T_2$ has $C=1$. The concentration argument later only tests against Lipschitz observables, and those observables are naturally paired with $W_1$ rather than $W_2$. The problem is therefore to discard the extra quadratic information without worsening the entropy constant; Jensen's inequality is exactly the mechanism that compares the two transport scales when the second moment is finite. [quotetheorem:6796] [citeproof:6796] The finite second-moment hypothesis behind $T_2$ is necessary for this comparison to be meaningful: a probability law with heavy tails may have finite first moment but infinite second moment, so $W_1$ can be finite while $W_2$ is not. The theorem does not recover any quadratic displacement information after passing to $T_1$; for instance, two measures can be close in $W_1$ while a small amount of mass is sent very far away, making $W_2$ large or infinite. What survives is precisely the part matched to Lipschitz observables through Kantorovich duality. We need the next theorem to convert this dual comparison into a tail bound for every Lipschitz observable. [quotetheorem:6797] [citeproof:6797] The Lipschitz hypothesis is essential: under Gaussian measure, the function $F(x)=|x|^2$ is not Lipschitz and has chi-squared rather than Gaussian upper tails. The theorem does not assert matching lower tails, sharp constants for every individual observable, or concentration for arbitrary measurable functions. For Gaussian measure, $C=1$, so this recovers the standard dimension-free concentration of Lipschitz functions and explains why the preceding transport inequalities are useful for applications. [example: Lipschitz Maps From Gaussian Space] Let $X\sim\gamma_n$ and let $F:\mathbb R^n\to\mathbb R$ be $L$-Lipschitz for the Euclidean metric, with $L>0$. Since $\gamma_n$ satisfies $T_2(1)$ by *Talagrand Gaussian T Two Inequality*, it satisfies $T_1(1)$ by *T Two Implies T One*. Define \begin{align*} G(x):=\frac{F(x)}{L}. \end{align*} For every $x,y\in\mathbb R^n$, \begin{align*} |G(x)-G(y)|=\frac{|F(x)-F(y)|}{L}. \end{align*} Because $F$ is $L$-Lipschitz, \begin{align*} \frac{|F(x)-F(y)|}{L}\le \frac{L|x-y|}{L}=|x-y|. \end{align*} Thus $G$ is $1$-Lipschitz. Applying *T One Concentration For Lipschitz Functions* to $G$ with $C=1$ gives, for every $s\ge 0$, \begin{align*} \mathbb P\left(G(X)-\mathbb E[G(X)]\ge s\right)\le \exp\left(-\frac{s^2}{2}\right). \end{align*} By linearity of expectation, \begin{align*} G(X)-\mathbb E[G(X)]=\frac{F(X)-\mathbb E[F(X)]}{L}. \end{align*} Taking $s=t/L$ and using $L>0$, \begin{align*} \left\{G(X)-\mathbb E[G(X)]\ge \frac{t}{L}\right\}=\left\{F(X)-\mathbb E[F(X)]\ge t\right\}. \end{align*} Therefore \begin{align*} \mathbb P\left(F(X)-\mathbb E[F(X)]\ge t\right)\le \exp\left(-\frac{(t/L)^2}{2}\right)=\exp\left(-\frac{t^2}{2L^2}\right). \end{align*} For a nonempty set $A\subset\mathbb R^n$, the distance function $F(x)=\operatorname{dist}(x,A)$ is $1$-Lipschitz. Indeed, for every $z\in A$, the triangle inequality gives \begin{align*} \operatorname{dist}(x,A)\le |x-z|\le |x-y|+|y-z|. \end{align*} Taking the infimum over $z\in A$ gives \begin{align*} \operatorname{dist}(x,A)\le |x-y|+\operatorname{dist}(y,A). \end{align*} Interchanging $x$ and $y$ gives \begin{align*} \operatorname{dist}(y,A)\le |x-y|+\operatorname{dist}(x,A). \end{align*} The two inequalities imply \begin{align*} |\operatorname{dist}(x,A)-\operatorname{dist}(y,A)|\le |x-y|. \end{align*} Thus Gaussian enlargement bounds follow by applying the same transport-to-concentration estimate to the distance from the set. [/example] The example shows why transportation-cost inequalities are concentration inequalities in geometric form. Instead of estimating each moment-generating function separately, one proves a single inequality for all tilted laws. The distinction between linear and quadratic transport remains useful when deciding how much geometry the conclusion retains. [example: Comparing Linear And Quadratic Transport Consequences] Let $0<\varepsilon<1$, let $R>0$, and set \begin{align*} \nu=(1-\varepsilon)\gamma_1+\varepsilon\delta_R . \end{align*} We compare the linear and quadratic costs along one explicit coupling. Let $X\sim\gamma_1$, and let $B$ be independent of $X$ with $\mathbb P(B=0)=1-\varepsilon$ and $\mathbb P(B=1)=\varepsilon$. Define $Y=(1-B)X+BR$. For every Borel set $A\subset\mathbb R$, \begin{align*} \mathbb P(Y\in A)=\mathbb P(B=0)\mathbb P(X\in A)+\mathbb P(B=1)\mathbf 1_{\{R\in A\}}. \end{align*} Substituting the two probabilities gives \begin{align*} \mathbb P(Y\in A)=(1-\varepsilon)\gamma_1(A)+\varepsilon\delta_R(A). \end{align*} Thus $Y\sim\nu$, so $(Y,X)$ is a coupling of $\nu$ and $\gamma_1$. For this coupling, \begin{align*} Y-X=(1-B)X+BR-X=B(R-X). \end{align*} Since $B\in\{0,1\}$, we have $|B(R-X)|=B|R-X|$, and therefore \begin{align*} \mathbb E[|Y-X|]=\mathbb E[B|R-X|]. \end{align*} Independence of $B$ and $X$ gives \begin{align*} \mathbb E[B|R-X|]=\mathbb E[B]\mathbb E[|R-X|]=\varepsilon\mathbb E[|R-X|]. \end{align*} By the triangle inequality, $|R-X|\le R+|X|$, hence \begin{align*} W_1(\nu,\gamma_1)\le \mathbb E[|Y-X|]\le \varepsilon R+\varepsilon\mathbb E[|X|]. \end{align*} The part of this upper bound that grows with the distant location $R$ is linear in $R$ and has coefficient $\varepsilon$. For the squared quadratic cost of the same coupling, $B^2=B$ because $B\in\{0,1\}$, so \begin{align*} \mathbb E[|Y-X|^2]=\mathbb E[B(R-X)^2]. \end{align*} Using independence again, \begin{align*} \mathbb E[B(R-X)^2]=\varepsilon\mathbb E[(R-X)^2]. \end{align*} Expanding the square gives \begin{align*} (R-X)^2=R^2-2RX+X^2. \end{align*} Since $X\sim\gamma_1$ has $\mathbb E[X]=0$ and $\mathbb E[X^2]=1$, \begin{align*} \mathbb E[(R-X)^2]=R^2-2R\mathbb E[X]+\mathbb E[X^2]=R^2+1. \end{align*} Hence this transport plan has squared quadratic cost \begin{align*} \mathbb E[|Y-X|^2]=\varepsilon(R^2+1). \end{align*} The same mass $\varepsilon$ placed near distance $R$ therefore appears linearly in this $W_1$ cost estimate and quadratically in this $W_2^2$ cost estimate. This is why the $W_1$ consequence is suited to Lipschitz concentration, while the stronger $W_2$ estimate keeps mean-square displacement information and is much more sensitive to small amounts of mass placed far away. [/example] Combining this chapter with the $T_1$ duality of Chapter 7, the hierarchy can be summarised as \begin{align*} \mathrm{LSI}(C) \implies T_2(C) \implies T_1(C) \implies \text{sub-Gaussian concentration for Lipschitz functions}. \end{align*} This chain is the organising principle of the chapter. Entropy dissipation produces quadratic transport, quadratic transport produces linear transport, and linear transport produces the familiar Gaussian tails. # 9. Connections with Optimal Transport and Geometry This chapter connects the entropy methods developed earlier in the course with the geometry of optimal transport. It uses the earlier material on relative entropy, the Gibbs variational principle, logarithmic Sobolev inequalities, and basic weak convergence of probability measures. The recurring question is how an analytic inequality, such as a logarithmic Sobolev or transport-entropy inequality, can be interpreted as a statement about moving probability mass. We move between three viewpoints: duality for transport costs, infimal convolution semigroups, and curvature-driven convexity along transport interpolations. ## Kantorovich Duality and $c$-Transforms The first problem is to turn the cost of transporting one measure to another into something that can be tested by functions. This is the same structural move as the Gibbs variational principle from Chapter 1: a primal optimization over measures has a dual formulation over test functions, and the dual side is often the one that interacts with concentration. [definition: Transport Cost] Let $(X,d)$ be a Polish metric space and let $c:X \times X \to [0,\infty]$ be a lower semicontinuous cost function. The transport cost associated with $c$ is the functional \begin{align*} \mathcal T_c:\mathcal P(X)\times \mathcal P(X)\to[0,\infty] \end{align*} defined by \begin{align*} \mathcal T_c(\mu,\nu) := \inf_{\pi \in \Pi(\mu,\nu)} \int_{X \times X} c(x,y)\,d\pi(x,y), \end{align*} for $\mu,\nu \in \mathcal P(X)$, where $\Pi(\mu,\nu)$ is the set of probability measures on $X\times X$ with first marginal $\mu$ and second marginal $\nu$. [/definition] The coupling $\pi$ records how mass starting at $x$ is reassigned to $y$. For concentration, the most important case is the quadratic cost \begin{align*} c(x,y)=\frac{d(x,y)^2}{2}, \end{align*} because its dual formulation naturally produces infimal convolutions and Lipschitz bounds; before proving duality, it helps to see how the cost measures an ordinary displacement. [example: Quadratic Cost on Euclidean Space] Let $X=\mathbb R^n$ with Euclidean distance, let $c(x,y)=|x-y|^2/2$, and let $\nu=(\tau_a)_{\#}\mu$ be the translate of $\mu$ by $\tau_a(x)=x+a$. The map $x\mapsto (x,x+a)$ induces a coupling $\pi_0\in\Pi(\mu,\nu)$, because its first coordinate has law $\mu$ and its second coordinate has law $(\tau_a)_{\#}\mu=\nu$. Its cost is \begin{align*} \int_{\mathbb R^n\times\mathbb R^n}\frac{|x-y|^2}{2}\,d\pi_0(x,y)=\int_{\mathbb R^n}\frac{|x-(x+a)|^2}{2}\,d\mu(x) \end{align*} and since $x-(x+a)=-a$, this becomes \begin{align*} \int_{\mathbb R^n}\frac{|-a|^2}{2}\,d\mu(x)=\frac{|a|^2}{2}\int_{\mathbb R^n}1\,d\mu(x)=\frac{|a|^2}{2}. \end{align*} Thus $\mathcal T_c(\mu,\nu)\le |a|^2/2$. Assume now that $\mu$ has finite second moment. Let $\pi\in\Pi(\mu,\nu)$ be any coupling, and write $(U,V)$ for the coordinate random variables under $\pi$. The first marginal condition gives \begin{align*} \mathbb E_{\pi}[U]=\int_{\mathbb R^n}x\,d\mu(x). \end{align*} The second marginal condition and the definition of the translated measure give \begin{align*} \mathbb E_{\pi}[V]=\int_{\mathbb R^n}y\,d\nu(y)=\int_{\mathbb R^n}(x+a)\,d\mu(x)=\int_{\mathbb R^n}x\,d\mu(x)+a. \end{align*} Therefore \begin{align*} \mathbb E_{\pi}[V-U]=\mathbb E_{\pi}[V]-\mathbb E_{\pi}[U]=a. \end{align*} Expanding the squared displacement around its mean gives \begin{align*} \mathbb E_{\pi}|V-U|^2=\mathbb E_{\pi}|(V-U)-a|^2+2a\cdot\mathbb E_{\pi}[(V-U)-a]+|a|^2. \end{align*} Since $\mathbb E_{\pi}[(V-U)-a]=\mathbb E_{\pi}[V-U]-a=0$, this reduces to \begin{align*} \mathbb E_{\pi}|V-U|^2=\mathbb E_{\pi}|(V-U)-a|^2+|a|^2\ge |a|^2. \end{align*} Hence every coupling satisfies \begin{align*} \int_{\mathbb R^n\times\mathbb R^n}\frac{|x-y|^2}{2}\,d\pi(x,y)\ge \frac{|a|^2}{2}. \end{align*} Taking the infimum over all $\pi\in\Pi(\mu,\nu)$ and combining this lower bound with the translation coupling yields \begin{align*} \mathcal T_c(\mu,\nu)=\frac{|a|^2}{2}. \end{align*} So translating the whole distribution by $a$ costs exactly the squared Euclidean displacement divided by $2$. [/example] This example shows why transport sees geometry: moving a Gaussian by $a$ costs the squared Euclidean displacement. The dual problem asks which functions detect this displacement most efficiently, and the device that builds the best admissible partner for a test function is the $c$-transform. [definition: Cost Transform] Let $c:X\times X\to (-\infty,\infty]$ be a cost function and let $\varphi:X\to \mathbb R\cup\{-\infty\}$. The $c$-transform of $\varphi$ is the function $\varphi^c:X\to \mathbb R\cup\{-\infty,+\infty\}$ defined by \begin{align*} \varphi^c(y) := \inf_{x\in X}\{c(x,y)-\varphi(x)\}. \end{align*} [/definition] The defining inequality is $\varphi(x)+\varphi^c(y)\le c(x,y)$. Any admissible pair gives a lower bound on the cost of every coupling, but a priori such bounds could be far from sharp because they only test the transport problem through functions. The duality question is whether, under enough topological regularity of the cost, these functional tests separate couplings well enough to recover the exact primal transport value. [quotetheorem:6799] Kantorovich duality is the bridge from geometry back to exponential moments. The lower semicontinuity of $c$ is a genuine regularity assumption: on $X=[0,1]$, take $\mu=\delta_0$, $\nu=\delta_1$, and define $c(0,1)=1$ while $c(x,y)=0$ elsewhere. The primal cost is then $1$, because the only coupling is $\delta_{(0,1)}$, but any bounded continuous admissible pair must satisfy $\varphi(0)+\psi(1)\le0$ by continuity from points where the cost is $0$, so the continuous dual gives at most $0$. The Polish hypothesis supplies the Radon tightness and separation by continuous functions used in the compact-approximation step; if the topology does not separate points, for instance the two-point space with the indiscrete topology and cost $\mathbb{1}_{\{x\ne y\}}$, continuous test functions cannot distinguish $\delta_0$ from $\delta_1$ although the primal cost is positive. The theorem also does not identify an optimal map; it only identifies the value of the optimal coupling problem through potentials. A transport inequality will control the left-hand side by entropy, while the right-hand side converts that control into inequalities for integrals of transformed functions. [remark: Duality as a Transport Version of Gibbs] The Gibbs variational formula writes entropy as the convex dual of the log-Laplace transform. Kantorovich duality writes transport cost as the convex dual of a constraint on pairs of functions. Transport-entropy inequalities are powerful because they allow these two dualities to be composed. [/remark] ## Infimal Convolution and the Hopf-Lax Viewpoint The next problem is to understand the specific dual functions generated by the quadratic transport cost. For the cost \begin{align*} c(x,y)=\frac{d(x,y)^2}{2t}, \end{align*} the $c$-transform is an infimum over penalized values of a function, which is the discrete-time shadow of the Hamilton-Jacobi semigroup. [definition: Infimal Convolution Operator] Let $(X,d)$ be a metric space and let $t>0$. The infimal convolution operator at time $t$ is the map \begin{align*} Q_t : \{f:X\to \mathbb R\cup\{\infty\}\}\to \{g:X\to \mathbb R\cup\{-\infty,\infty\}\} \end{align*} defined by \begin{align*} (Q_t f)(x) := \inf_{y\in X}\left\{f(y)+\frac{d(x,y)^2}{2t}\right\}. \end{align*} [/definition] The operator $Q_t$ lowers $f$ by allowing the point $x$ to borrow values from nearby points, paying a quadratic travel penalty. Concentration enters because inequalities involving $Q_t f$ express that a function cannot have too many high values unless the measure pays transport cost; the Lipschitz case gives the basic estimate used in the concentration proof. [example: Infimal Convolution of a Lipschitz Function] Let $f:X\to\mathbb R$ be $L$-Lipschitz on a metric space and fix $t>0$. For any $x,y\in X$, the Lipschitz condition gives \begin{align*} f(y)\ge f(x)-L\,d(x,y). \end{align*} Therefore \begin{align*} f(y)+\frac{d(x,y)^2}{2t}\ge f(x)-L\,d(x,y)+\frac{d(x,y)^2}{2t}. \end{align*} Write $r=d(x,y)\ge0$. The part depending on $r$ satisfies \begin{align*} -Lr+\frac{r^2}{2t}=\frac{1}{2t}\left(r^2-2tLr\right). \end{align*} Completing the square gives \begin{align*} \frac{1}{2t}\left(r^2-2tLr\right)=\frac{1}{2t}\left((r-tL)^2-t^2L^2\right). \end{align*} Thus \begin{align*} -Lr+\frac{r^2}{2t}=\frac{(r-tL)^2}{2t}-\frac{tL^2}{2}. \end{align*} Since $(r-tL)^2/(2t)\ge0$, we get \begin{align*} -Lr+\frac{r^2}{2t}\ge -\frac{tL^2}{2}. \end{align*} Substituting $r=d(x,y)$, for every $y\in X$, \begin{align*} f(y)+\frac{d(x,y)^2}{2t}\ge f(x)-\frac{tL^2}{2}. \end{align*} Taking the infimum over $y$ in the definition of $Q_t$ gives \begin{align*} Q_t f(x)\ge f(x)-\frac{tL^2}{2}. \end{align*} On the other hand, choosing $y=x$ in the infimum gives \begin{align*} Q_t f(x)\le f(x)+\frac{d(x,x)^2}{2t}=f(x). \end{align*} Hence \begin{align*} f(x)-\frac{tL^2}{2}\le Q_t f(x)\le f(x). \end{align*} So infimal convolution can lower an $L$-Lipschitz function by at most $tL^2/2$ at each point. [/example] This bound is the computational core behind the transport proof of sub-Gaussian concentration. It also shows why the quadratic penalty has the correct scaling: the cost of moving a distance $r$ competes with a linear Lipschitz gain $Lr$, and optimizing this tradeoff produces the quadratic exponent later seen in Gaussian tails. The next theorem explains why $Q_t$ is also the natural solution operator for a first-order evolution equation. This is the point where transport duality begins to look like a semigroup method rather than only a static optimization principle. [quotetheorem:6802] [citeproof:6802] The course uses this theorem mainly as motivation and as a guide for the algebra of $Q_t$. The bounded uniformly continuous hypothesis is not just a convenience for the initial condition: if $f=\mathbb{1}_{\mathbb Q}$ on $\mathbb R$, then $Q_t f(x)=0$ for every $t>0$ and every $x$, so $Q_t f$ cannot converge pointwise to $f$ as $t\downarrow0$. Unbounded data also require separate growth assumptions; for example $f(x)=-x^2$ makes the infimum equal to $-\infty$ when the quadratic penalty is too weak. The Euclidean full-space setting matters because the displayed equation uses the ordinary gradient and has no boundary condition; on a half-line the same minimization is constrained at the boundary, and the resulting viscosity problem is a state-constrained or boundary-value Hamilton-Jacobi problem rather than the stated Cauchy problem on $\mathbb R^n$. The theorem does not say that $u$ is classically differentiable, since minimizers can fail to be unique and shocks can form even from smooth-looking initial data. It suggests that concentration inequalities can be seen as integrated Hamilton-Jacobi estimates, and the following example records the standard computation that will reappear in transport-entropy arguments. [example: Infimal Convolution Proof of Concentration] Assume that $\mu$ satisfies \begin{align*} \mathcal T_{d^2/2}(\nu,\mu)\le C H(\nu\mid \mu) \end{align*} for all $\nu\ll\mu$, and let $f:X\to\mathbb R$ be $L$-Lipschitz. Fix $\lambda\ge0$. Since $\lambda f$ is $\lambda L$-Lipschitz, for every $x,y\in X$, \begin{align*} \lambda f(y)\ge \lambda f(x)-\lambda L\,d(x,y). \end{align*} Put $r=d(x,y)$. Then \begin{align*} \lambda f(y)+\frac{d(x,y)^2}{2}\ge \lambda f(x)-\lambda Lr+\frac{r^2}{2}. \end{align*} The scalar quadratic term is \begin{align*} -\lambda Lr+\frac{r^2}{2}=\frac{1}{2}\left(r^2-2\lambda Lr\right). \end{align*} Completing the square gives \begin{align*} \frac{1}{2}\left(r^2-2\lambda Lr\right)=\frac{1}{2}\left((r-\lambda L)^2-\lambda^2L^2\right). \end{align*} Thus \begin{align*} -\lambda Lr+\frac{r^2}{2}=\frac{(r-\lambda L)^2}{2}-\frac{\lambda^2L^2}{2}\ge -\frac{\lambda^2L^2}{2}. \end{align*} Substituting back, for every $y\in X$, \begin{align*} \lambda f(y)+\frac{d(x,y)^2}{2}\ge \lambda f(x)-\frac{\lambda^2L^2}{2}. \end{align*} Taking the infimum over $y$ in the definition of $Q_1$ gives \begin{align*} Q_1(\lambda f)(x)=\inf_{y\in X}\left\{\lambda f(y)+\frac{d(x,y)^2}{2}\right\}\ge \lambda f(x)-\frac{\lambda^2L^2}{2}. \end{align*} By the *Dual Form of the Transport-Entropy Inequality*, applied to the test function $\lambda f$, \begin{align*} \int_X \exp\left(\frac{1}{C}Q_1(\lambda f)\right)\,d\mu \le \exp\left(\frac{\lambda}{C}\int_X f\,d\mu\right). \end{align*} Since the exponential function is increasing, the pointwise lower bound on $Q_1(\lambda f)$ implies \begin{align*} \exp\left(\frac{\lambda}{C}f(x)-\frac{\lambda^2L^2}{2C}\right)\le \exp\left(\frac{1}{C}Q_1(\lambda f)(x)\right). \end{align*} Integrating this inequality with respect to $\mu$ gives \begin{align*} e^{-\lambda^2L^2/(2C)}\int_X \exp\left(\frac{\lambda}{C}f\right)\,d\mu \le \int_X \exp\left(\frac{1}{C}Q_1(\lambda f)\right)\,d\mu. \end{align*} Combining the last two displays yields \begin{align*} e^{-\lambda^2L^2/(2C)}\int_X \exp\left(\frac{\lambda}{C}f\right)\,d\mu \le \exp\left(\frac{\lambda}{C}\int_X f\,d\mu\right). \end{align*} Multiplying by $e^{\lambda^2L^2/(2C)}$ and then by $\exp\left(-\frac{\lambda}{C}\int_X f\,d\mu\right)$ gives \begin{align*} \int_X \exp\left(\frac{\lambda}{C}\left(f-\int_X f\,d\mu\right)\right)\,d\mu \le \exp\left(\frac{\lambda^2L^2}{2C}\right). \end{align*} With $s=\lambda/C$, equivalently $\lambda=Cs$, this becomes \begin{align*} \int_X e^{s(f-\int_X f\,d\mu)}\,d\mu \le \exp\left(\frac{Cs^2L^2}{2}\right) \end{align*} for every $s\ge0$. For $r\ge0$ and $s\ge0$, Markov's inequality applied to $e^{s(f-\int_X f\,d\mu)}$ gives \begin{align*} \mu\left(f-\int_X f\,d\mu\ge r\right)\le e^{-sr}\int_X e^{s(f-\int_X f\,d\mu)}\,d\mu. \end{align*} Using the exponential moment bound, \begin{align*} \mu\left(f-\int_X f\,d\mu\ge r\right)\le \exp\left(-sr+\frac{Cs^2L^2}{2}\right). \end{align*} If $L>0$, the exponent $-sr+Cs^2L^2/2$ is minimized over $s\ge0$ at $s=r/(CL^2)$, since \begin{align*} \frac{d}{ds}\left(-sr+\frac{Cs^2L^2}{2}\right)=-r+CL^2s. \end{align*} Substituting $s=r/(CL^2)$ gives \begin{align*} -sr+\frac{Cs^2L^2}{2}=-\frac{r^2}{CL^2}+\frac{r^2}{2CL^2}=-\frac{r^2}{2CL^2}. \end{align*} Therefore \begin{align*} \mu\left(f-\int_X f\,d\mu\ge r\right)\le \exp\left(-\frac{r^2}{2CL^2}\right). \end{align*} When $L=0$, the function $f$ is constant, so the same upper-tail bound is immediate. Thus the transport-entropy inequality gives a sub-Gaussian upper tail with variance proxy $CL^2$. [/example] ## Primal and Dual Viewpoints on Entropy Inequalities The third problem is to decide which side of a transport-entropy inequality is more useful in a given argument. The primal side compares probability measures through couplings, while the dual side compares exponential integrals through infimal convolution. Transport cost can say that two probability measures are far apart, but by itself it does not distinguish a mild density tilt from a singular perturbation on a set the reference measure never sees. Relative entropy supplies that missing bookkeeping: it assigns a finite price to changes of density and an infinite price to changes outside the reference measure. [definition: Relative Entropy] Let $(X,\mathcal B)$ be a measurable space. Relative entropy is the functional \begin{align*} H(\cdot\mid\cdot):\mathcal P(X)\times\mathcal P(X)\to[0,\infty] \end{align*} defined as follows. For $\nu,\mu\in\mathcal P(X)$ with $\nu\ll\mu$, \begin{align*} H(\nu\mid\mu):=\int_X \log\left(\frac{d\nu}{d\mu}\right)\,d\nu. \end{align*} If $\nu$ is not absolutely continuous with respect to $\mu$, set $H(\nu\mid\mu):=+\infty$. [/definition] Entropy measures the price of changing the underlying measure. To connect that price with geometry, we need an inequality saying that low-entropy changes of measure cannot create large quadratic displacement; this is the transport-entropy condition. [definition: Quadratic Transport-Entropy Inequality] Let $(X,d)$ be a metric space and let $C>0$. A probability measure $\mu\in\mathcal P(X)$ satisfies $T_2(C)$ if \begin{align*} \mathcal T_{d^2/2}(\nu,\mu)\le C H(\nu\mid\mu) \end{align*} for every $\nu\in\mathcal P(X)$. [/definition] This is a primal statement: it speaks directly about couplings between $\nu$ and $\mu$. To use it for concentration, however, we need an exponential-integral form that can be applied to Lipschitz functions, and Kantorovich duality gives exactly that conversion. [quotetheorem:6804] [citeproof:6804] The dual form packages a whole family of exponential moment estimates into one inequality. Each hypothesis controls a specific failure mode. Boundedness of $f$ keeps the exponential integral finite in the abstract Polish setting; even for the Gaussian measure, taking rapidly growing positive $f$ can make $\int e^{Q_1 f/C}\,d\mu$ infinite or outside the scope of the Gibbs formula. Continuity is tied to Kantorovich duality with continuous potentials; on a topology with too few continuous functions, such as the indiscrete two-point space, the dual tests cannot detect the cost between two distinct Dirac masses. The convention $H(\nu\mid\mu)=+\infty$ for $\nu\not\ll\mu$ is also forced: if a singular measure were assigned finite entropy, then the variational step over densities with respect to $\mu$ would not control that singular perturbation. The theorem does not by itself prove concentration until it is combined with estimates on $Q_1 f$, such as the Lipschitz infimal-convolution bound above. It is also the form in which the analogy with hypercontractivity and Hamilton-Jacobi semigroups is most visible, while the primal form remains useful when an optimal or near-optimal map can be described. [example: Gaussian Transport Maps] Let $\gamma_n=\mathcal N(0,I_n)$, and let $\nu$ be given by \begin{align*} d\nu=e^V\,d\gamma_n, \end{align*} with \begin{align*} \int_{\mathbb R^n}e^V\,d\gamma_n=1. \end{align*} Assume that $\nu$ has finite second moment. Since $\gamma_n$ is absolutely continuous with respect to Lebesgue measure, *Brenier's theorem* gives an optimal quadratic transport map $T=\nabla\Phi$ such that \begin{align*} T_{\#}\gamma_n=\nu. \end{align*} The map $x\mapsto (x,T(x))$ induces the coupling \begin{align*} \pi_T=(\operatorname{Id},T)_{\#}\gamma_n. \end{align*} Its first marginal is $\gamma_n$, because the first coordinate is $x$, and its second marginal is $T_{\#}\gamma_n=\nu$. Hence $\pi_T\in\Pi(\gamma_n,\nu)$. For the quadratic cost $c(x,y)=|x-y|^2/2$, optimality of $T$ means that the transport cost is the cost of this map-induced coupling: \begin{align*} \mathcal T_{|\cdot|^2/2}(\gamma_n,\nu)=\int_{\mathbb R^n\times\mathbb R^n}\frac{|x-y|^2}{2}\,d\pi_T(x,y). \end{align*} By the definition of pushforward under $x\mapsto (x,T(x))$, the last integral is \begin{align*} \int_{\mathbb R^n}\frac{|x-T(x)|^2}{2}\,d\gamma_n(x). \end{align*} Since $|x-T(x)|=|T(x)-x|$, we obtain \begin{align*} \mathcal T_{|\cdot|^2/2}(\gamma_n,\nu)=\frac{1}{2}\int_{\mathbb R^n}|T(x)-x|^2\,d\gamma_n(x). \end{align*} The entropy of the density tilt is computed from $d\nu/d\gamma_n=e^V$: \begin{align*} H(\nu\mid\gamma_n)=\int_{\mathbb R^n}\log(e^V)\,d\nu. \end{align*} Since $\log(e^V)=V$, this becomes \begin{align*} H(\nu\mid\gamma_n)=\int_{\mathbb R^n}V\,d\nu. \end{align*} Using $d\nu=e^V\,d\gamma_n$, equivalently, \begin{align*} H(\nu\mid\gamma_n)=\int_{\mathbb R^n}V e^V\,d\gamma_n, \end{align*} whenever the positive and negative parts are integrable; if the entropy is $+\infty$, the bound below is vacuous. By *Talagrand's Gaussian transport inequality*, \begin{align*} \mathcal T_{|\cdot|^2/2}(\gamma_n,\nu)\le H(\nu\mid\gamma_n). \end{align*} Combining this inequality with the transport-map formula gives \begin{align*} \frac{1}{2}\int_{\mathbb R^n}|T(x)-x|^2\,d\gamma_n(x)\le H(\nu\mid\gamma_n). \end{align*} Thus the squared average displacement of the Brenier map is controlled by the entropy cost of the density tilt $e^V$. [/example] ## Curvature Heuristics and Displacement Convexity The next question is why the Gaussian measure satisfies such strong inequalities, and why similar conclusions hold for measures with uniformly convex potentials. The geometric answer is that entropy behaves convexly along transport geodesics when the underlying measure has positive curvature. [definition: Wasserstein Geodesic] Let $(X,d)$ be a geodesic metric space and let $\mu_0,\mu_1\in\mathcal P_2(X)$. A Wasserstein geodesic from $\mu_0$ to $\mu_1$ is a map \begin{align*} [0,1]\to\mathcal P_2(X),\qquad t\mapsto \mu_t, \end{align*} such that $\mu_0$ and $\mu_1$ are its endpoints and \begin{align*} W_2(\mu_s,\mu_t)=|s-t|W_2(\mu_0,\mu_1) \end{align*} for all $s,t\in[0,1]$. [/definition] Along such curves, the interpolation is not linear interpolation of densities. It is interpolation of mass locations, and this raises the right notion of convexity for entropy and related functionals: convexity along Wasserstein geodesics rather than along linear mixtures. Different texts vary on whether displacement convexity is required along every Wasserstein geodesic or only along at least one selected geodesic between each pair of endpoints. These notes use the stronger convention, which is the one needed for the curvature heuristic below: once the endpoints are fixed, any admissible transport interpolation should preserve convexity of the functional. [definition: Displacement Convexity] Let $\mathcal F:\mathcal P_2(X)\to (-\infty,\infty]$ be a functional. The functional $\mathcal F$ is displacement convex if every pair $\mu_0,\mu_1$ in its effective domain is joined by at least one Wasserstein geodesic and, for every Wasserstein geodesic $(\mu_t)_{0\le t\le 1}$ joining them, \begin{align*} \mathcal F(\mu_t)\le (1-t)\mathcal F(\mu_0)+t\mathcal F(\mu_1) \end{align*} for all $t\in[0,1]$. [/definition] For concentration, the functional of interest is relative entropy. The next theorem gives the model mechanism: uniform convexity of a potential becomes strict convexity of entropy along transport geodesics, and this is the geometric source behind the log-Sobolev and transport inequalities compared in Chapters 3 and 8. [quotetheorem:6806] [citeproof:6806] This theorem is not used later as a technical black box, but it explains why the same curvature parameter controls several inequalities. The normalization hypothesis is essential: if $V=0$ on $\mathbb R^n$, then $Z=\infty$ and there is no probability measure $\mu$ of the displayed form. The finite second moment assumption is also part of the geometry; a Cauchy law on $\mathbb R$ is a probability measure, but it is not in $\mathcal P_2(\mathbb R)$, so $W_2$-geodesics and the squared distance term are not finite. Finite entropy endpoints are needed for a finite convexity statement, since $\nu=\delta_0$ has infinite entropy relative to any smooth positive density $\mu$. The uniform lower Hessian bound is essential for the strict correction term: if $V(x)=|x|^4$ near the origin, the Hessian has no positive uniform lower bound there, and small translations near the origin cannot produce a positive $\rho W_2^2$ correction. Smoothness is the assumption under which the proof differentiates the potential term pointwise; for nonsmooth potentials such as $V(x)=|x|+x^2$, the Hessian condition must be replaced by a distributional or convex-analytic version before the theorem has a precise meaning. The result does not assert ordinary convexity under linear interpolation of densities, nor does it construct the geodesic in a nonsmooth space. Positive curvature makes entropy decrease efficiently under smoothing and resist transport displacement; the heat-flow picture gives a dynamic version of the same idea. [example: Entropy Dissipation Along Heat Flow] Let $\mu=\gamma_n$, and let $(P_t)_{t\ge0}$ be the Ornstein-Uhlenbeck semigroup with generator \begin{align*} L=\Delta-x\cdot\nabla \end{align*} and invariant measure $\gamma_n$. For a positive smooth density $f_t=P_t f$ with respect to $\gamma_n$, the evolution equation is \begin{align*} \partial_t f_t=L f_t. \end{align*} Since $\int_{\mathbb R^n} f_t\,d\gamma_n=\int_{\mathbb R^n} f\,d\gamma_n=1$ by invariance of $\gamma_n$, the entropy is \begin{align*} \operatorname{Ent}_{\gamma_n}(f_t)=\int_{\mathbb R^n} f_t\log f_t\,d\gamma_n. \end{align*} Differentiating under the integral gives \begin{align*} \frac{d}{dt}\operatorname{Ent}_{\gamma_n}(f_t)=\int_{\mathbb R^n}\partial_t f_t\log f_t\,d\gamma_n+\int_{\mathbb R^n}f_t\frac{\partial_t f_t}{f_t}\,d\gamma_n. \end{align*} Using $\partial_t f_t=Lf_t$, this becomes \begin{align*} \frac{d}{dt}\operatorname{Ent}_{\gamma_n}(f_t)=\int_{\mathbb R^n}Lf_t\log f_t\,d\gamma_n+\int_{\mathbb R^n}Lf_t\,d\gamma_n. \end{align*} Invariance of $\gamma_n$ gives $\int_{\mathbb R^n}Lf_t\,d\gamma_n=0$. The Gaussian integration-by-parts identity for the Ornstein-Uhlenbeck generator is \begin{align*} \int_{\mathbb R^n} g\,Lh\,d\gamma_n=-\int_{\mathbb R^n}\nabla g\cdot\nabla h\,d\gamma_n. \end{align*} Applying it with $g=\log f_t$ and $h=f_t$ yields \begin{align*} \frac{d}{dt}\operatorname{Ent}_{\gamma_n}(f_t)=-\int_{\mathbb R^n}\nabla(\log f_t)\cdot\nabla f_t\,d\gamma_n. \end{align*} Since $f_t>0$, we have \begin{align*} \nabla(\log f_t)=\frac{\nabla f_t}{f_t}. \end{align*} Therefore \begin{align*} \frac{d}{dt}\operatorname{Ent}_{\gamma_n}(f_t)=-\int_{\mathbb R^n}\frac{|\nabla f_t|^2}{f_t}\,d\gamma_n. \end{align*} The Gaussian logarithmic Sobolev inequality in density form says \begin{align*} \operatorname{Ent}_{\gamma_n}(f_t)\le \frac12\int_{\mathbb R^n}\frac{|\nabla f_t|^2}{f_t}\,d\gamma_n. \end{align*} Equivalently, \begin{align*} \int_{\mathbb R^n}\frac{|\nabla f_t|^2}{f_t}\,d\gamma_n\ge 2\operatorname{Ent}_{\gamma_n}(f_t). \end{align*} Substituting this lower bound into the dissipation identity gives \begin{align*} \frac{d}{dt}\operatorname{Ent}_{\gamma_n}(f_t)\le -2\operatorname{Ent}_{\gamma_n}(f_t). \end{align*} Multiplying by $e^{2t}$, we compute \begin{align*} \frac{d}{dt}\left(e^{2t}\operatorname{Ent}_{\gamma_n}(f_t)\right)=e^{2t}\left(2\operatorname{Ent}_{\gamma_n}(f_t)+\frac{d}{dt}\operatorname{Ent}_{\gamma_n}(f_t)\right)\le0. \end{align*} Hence, for every $t\ge0$, \begin{align*} e^{2t}\operatorname{Ent}_{\gamma_n}(f_t)\le \operatorname{Ent}_{\gamma_n}(f). \end{align*} Equivalently, \begin{align*} \operatorname{Ent}_{\gamma_n}(P_t f)\le e^{-2t}\operatorname{Ent}_{\gamma_n}(f). \end{align*} Thus the Fisher-information dissipation identity, together with the Gaussian logarithmic Sobolev inequality, shows that entropy decays exponentially along the Ornstein-Uhlenbeck flow. [/example] ## Isoperimetry, Log-Sobolev Inequalities, and Transport The final problem is to place the main mechanisms of the course into one map. Isoperimetry controls enlargement of sets, logarithmic Sobolev inequalities control entropy through gradients, and transport inequalities control the cost of moving tilted measures. Tail bounds for Lipschitz functions are scalar shadows of a sharper set-enlargement question: among all sets with a fixed Gaussian measure, which ones have the smallest boundary and therefore grow most slowly under metric enlargement? The Gaussian isoperimetric profile records this boundary cost as a function of the set's measure, so it is the right object for comparing set-level concentration with entropy and transport inequalities. [definition: Gaussian Isoperimetric Profile] Let $\gamma_n$ be standard Gaussian measure on $\mathbb R^n$. The Gaussian isoperimetric profile is the function $I:[0,1]\to[0,\infty)$ defined by \begin{align*} I(s):=\phi(\Phi^{-1}(s)), \end{align*} where $\Phi$ is the standard normal distribution function and $\phi$ is the standard normal density. [/definition] The profile describes the boundary measure of half-spaces, which are extremal for Gaussian enlargement. To make the comparison with the rest of the course precise, we need a theorem turning a transport-entropy inequality directly into a tail estimate for Lipschitz functions. [quotetheorem:6808] [citeproof:6808] This theorem is the most compact expression of the course's theme: entropy controls changes of measure, transport turns changes of measure into geometric displacement, and duality turns displacement control into concentration. Each hypothesis is tied to that conversion. The absolute-continuity condition $\nu\ll\mu$ ensures that entropy is computed through density tilts of $\mu$, exactly the class accessed by the Gibbs variational formula. The metric assumption supplies the Lipschitz test functions in the Kantorovich-Rubinstein dual representation of $W_1$, and the $1$-Lipschitz condition is what turns displacement control into the scalar inequality $|\int f\,d\nu-\int f\,d\mu|\le W_1(\nu,\mu)$. The $T_1$ hypothesis is doing real work: a measure with heavy polynomial tails on $\mathbb R$ cannot satisfy the displayed Gaussian tail conclusion for the $1$-Lipschitz function $f(x)=x$. The theorem gives one-sided concentration for Lipschitz observables, but it does not identify sharp isoperimetric sets or prove a logarithmic Sobolev inequality. We can now summarize how the three major languages of the course describe the same phenomenon at different levels. [explanation: How the Main Inequalities Fit Together] Gaussian isoperimetry is the sharp set-level statement. It says that among all sets of a fixed Gaussian measure, half-spaces have the smallest enlargement, and this gives the sharp Gaussian concentration profile for Lipschitz functions. The logarithmic Sobolev inequality is the gradient-level statement. Through Herbst's argument it implies Gaussian concentration, and through semigroup methods it expresses exponential decay of entropy along the Ornstein-Uhlenbeck flow. Transport inequalities are the metric-measure statement. They compare the entropy cost of tilting a measure with the geometric cost of moving it, and their dual forms recover exponential integrability of Lipschitz functions. The implications are not all reversible without additional assumptions. In the Gaussian and uniformly log-concave settings, curvature supplies the missing structure and makes the three pictures mutually reinforcing. [/explanation] This synthesis also marks a shift in what concentration inequalities are measuring. The final remark records the conceptual gain from the chapter: concentration is no longer only a tail estimate, but also a geometric stability property of a probability measure. [remark: What Geometry Adds to Concentration] The first concentration course treated concentration mainly through martingales, bounded differences, and moment-generating functions. Entropy and transport add an invariant language: the same inequality can be read as a stability property of a measure under tilting, a bound on optimal displacement, or convexity of entropy along geodesics. This is why the methods extend from product spaces to Gaussian space, log-concave measures, and geometric probability models. [/remark] # 10. Case Studies and Synthesis ## Choosing a Concentration Method This final chapter synthesises the course's three main mechanisms for concentration of measure: entropy, isoperimetry, and transport. The goal is to decide which method matches a given probabilistic model and observable, using the definitions and inequalities developed in Chapters 1 through 9. We assume familiarity with product measures, Lipschitz functions on metric spaces, relative entropy, Wasserstein distance, logarithmic Sobolev inequalities, and Talagrand-type isoperimetric inequalities. The practical question is rarely whether a random variable concentrates at all. It is usually which structural fact about the underlying measure is available and which class of observables must be controlled. [explanation: Three Routes to Concentration] The entropy route starts from a functional inequality such as a logarithmic Sobolev inequality or a modified logarithmic Sobolev inequality. It is strongest when the observable has a gradient, coordinate differences, or another local sensitivity bound that feeds into the entropy of $e^{\theta f}$. The isoperimetric route starts from expansion of sets. It is most natural when the concentration statement concerns neighbourhoods of sets, medians, or geometric distance functions. Talagrand's convex distance inequality is the discrete product-space version that turns the shape of an event into a sharp deviation estimate. The transport route starts from a transportation-cost inequality such as $T_1(C)$ or $T_2(C)$. It is designed to convert relative entropy into Wasserstein distance, so it gives a direct bridge from information control to Lipschitz concentration. [/explanation] The three routes overlap, but they are not interchangeable. Entropy methods usually expose tensorization and differential inequalities; isoperimetric methods expose sharp boundary growth; transport methods expose stability under Lipschitz pushforwards and perturbations. To compare these methods on product spaces, we need a principle explaining why a concentration constant can remain fixed while the number of coordinates grows. [quotetheorem:6810] [citeproof:6810] This theorem is the organising principle for high-dimensional applications. The tensorization hypothesis is essential: a one-dimensional tail estimate by itself does not imply that arbitrary high-dimensional laws keep the same constant. For example, let $Y$ be a symmetric $\{-1,1\}$-valued random variable and set $X=(Y,\dots,Y)\in\{-1,1\}^n$. The one-coordinate law has bounded tails, but the $1$-Lipschitz function $F(x)=n^{-1/2}\sum_{i=1}^n x_i$ for the Euclidean metric satisfies $F(X)=\sqrt n\,Y$, so its deviations are of order $\sqrt n$ rather than order $1$. This diagonal law is the basic obstruction removed by product tensorization. The Euclidean product metric is also essential because it makes independent coordinate sensitivities add in squares. If the same dimension-free conclusion were asserted for the $\ell^1$ product metric, the sum $S(x)=\sum_{i=1}^n x_i$ on $\{0,1\}^n$ would be $1$-Lipschitz, and the theorem would predict order-one fluctuations for a binomial random variable. In reality, $S-\mathbb E[S]$ has standard deviation of order $\sqrt n$. The Lipschitz hypothesis is the final place where dimension can enter: on the same cube, $S$ has Euclidean Lipschitz constant $\sqrt n$, so applying a unit-Lipschitz theorem to $S$ would lose the correct scale. The theorem says that once $F$ has been measured in the correct product metric, no additional factor of $n$ is introduced by the concentration inequality. [example: Normalised Average on a Product Space] Let $X_1,\dots,X_n$ be independent random variables in $[0,1]$, and define \begin{align*} F(x)=\frac{1}{n}\sum_{i=1}^n x_i \end{align*} for $x=(x_1,\dots,x_n)\in[0,1]^n$. For the Euclidean product metric $d_2(x,y)=\left(\sum_{i=1}^n |x_i-y_i|^2\right)^{1/2}$, the Lipschitz constant of $F$ is at most $n^{-1/2}$ because \begin{align*} |F(x)-F(y)|=\frac{1}{n}\left|\sum_{i=1}^n (x_i-y_i)\right|. \end{align*} By the triangle inequality, \begin{align*} \frac{1}{n}\left|\sum_{i=1}^n (x_i-y_i)\right|\le \frac{1}{n}\sum_{i=1}^n |x_i-y_i|. \end{align*} By Cauchy-Schwarz, \begin{align*} \sum_{i=1}^n |x_i-y_i|\le \left(\sum_{i=1}^n 1^2\right)^{1/2}\left(\sum_{i=1}^n |x_i-y_i|^2\right)^{1/2}. \end{align*} Since $\left(\sum_{i=1}^n 1^2\right)^{1/2}=\sqrt n$, these inequalities give \begin{align*} |F(x)-F(y)|\le \frac{1}{\sqrt n}d_2(x,y). \end{align*} Thus $G=\sqrt n\,F$ is $1$-Lipschitz with respect to $d_2$. If the product law satisfies the hypotheses of the *Dimension-Free Concentration Principle* with constant $C$, then for every $s\ge 0$, \begin{align*} \mathbb P(G-\mathbb E[G]\ge s)\le \exp\left(-\frac{s^2}{2C}\right). \end{align*} Moreover, \begin{align*} G-\mathbb E[G]=\sqrt n\,F-\sqrt n\,\mathbb E[F]=\sqrt n\,(F-\mathbb E[F]). \end{align*} Taking $s=\sqrt n\,t$ gives \begin{align*} \mathbb P(F-\mathbb E[F]\ge t)=\mathbb P(G-\mathbb E[G]\ge \sqrt n\,t). \end{align*} Therefore \begin{align*} \mathbb P(F-\mathbb E[F]\ge t)\le \exp\left(-\frac{n t^2}{2C}\right). \end{align*} In the notation $\exp(-c n t^2)$, this is $c=1/(2C)$, so the exponent grows linearly with the sample size and the fluctuation scale is $n^{-1/2}$. [/example] ## Lipschitz Observables on Product Spaces The next question is how to use the dimension-free principle when the function is not a simple average. Many observables in statistics and learning theory are Lipschitz functions of an entire sample, but their Lipschitz constants depend on the metric used on the sample space. [definition: Product Lipschitz Constant] Let $(E,d)$ be a metric space. The Euclidean product Lipschitz functional is the map \begin{align*} \operatorname{Lip}_2:\{F:E^n\to\mathbb R\}\to [0,\infty] \end{align*} defined by \begin{align*} \operatorname{Lip}_2(F)=\sup_{x\neq y}\frac{|F(x)-F(y)|}{\left(\sum_{i=1}^n d(x_i,y_i)^2\right)^{1/2}}. \end{align*} [/definition] After this definition, concentration statements reduce to estimating $\operatorname{Lip}_2(F)$. The remaining issue is how a deterministic oscillation estimate becomes a probabilistic tail estimate under the product law. Tensorized $T_1$ supplies the missing bridge: once an observable is Lipschitz for the product metric, entropy duality converts that metric control into a Gaussian Laplace-transform bound around its mean. [quotetheorem:6813] [citeproof:6813] The $T_1$ hypothesis is not a cosmetic assumption: it is the mechanism that turns entropy control into exponential integrability of all Lipschitz observables. Without it, a $1$-Lipschitz observable can have tails that are much heavier than Gaussian. For instance, if $\nu$ is the exponential distribution on $[0,\infty)$ and $f(x)=x$, then \begin{align*} \nu(f-\mathbb E[f]\ge r)=e^{-(r+1)},\qquad r\ge 0, \end{align*} so no estimate of the form $\exp(-r^2/(2C))$ can hold for all $r$ with fixed $C$. The Lipschitz condition is equally necessary, since transport duality tests against Lipschitz functions and gives no direct tail estimate for observables with large jumps under small metric changes. The standard Gaussian law satisfies a $T_1$ inequality, but the non-Lipschitz observable $f(x)=x^2$ has \begin{align*} \mathbb P(f-\mathbb E[f]\ge r)=\mathbb P(|X|\ge \sqrt{r+1}), \end{align*} which decays on the scale $e^{-r/2}$ rather than $e^{-c r^2}$. The theorem controls fluctuations around $\mathbb E[f]$, but it does not estimate that mean or identify the typical location of $f$. This distinction is central for Wasserstein statistics. Kantorovich duality expresses $W_1$ as a supremum over $1$-Lipschitz test functions, so replacing one sample point changes the empirical measure by at most its transport cost. The theorem therefore supplies a fluctuation estimate for empirical Wasserstein distances once the sample-to-measure map has been shown to be Lipschitz; the separate task of estimating the mean belongs to quantization, covering, or empirical-process arguments. [example: Empirical Measure Concentration in W One] Let $X_1,\dots,X_n$ be i.i.d. samples from a probability measure $\nu$ on a bounded metric space $(E,d)$ with diameter at most $D$, and let \begin{align*} \hat\nu_n=\frac{1}{n}\sum_{i=1}^n\delta_{X_i}. \end{align*} For $x=(x_1,\dots,x_n)\in E^n$, define \begin{align*} F(x)=W_1\left(\frac{1}{n}\sum_{i=1}^n\delta_{x_i},\nu\right). \end{align*} We show that $F$ has Lipschitz constant at most $n^{-1/2}$ with respect to \begin{align*} d_2(x,y)=\left(\sum_{i=1}^n d(x_i,y_i)^2\right)^{1/2}. \end{align*} Fix $x,y\in E^n$, and set \begin{align*} \mu_x=\frac{1}{n}\sum_{i=1}^n\delta_{x_i}. \end{align*} Also set \begin{align*} \mu_y=\frac{1}{n}\sum_{i=1}^n\delta_{y_i}. \end{align*} The triangle inequality for $W_1$ gives \begin{align*} W_1(\mu_x,\nu)\le W_1(\mu_x,\mu_y)+W_1(\mu_y,\nu). \end{align*} Subtracting $W_1(\mu_y,\nu)$ from both sides gives \begin{align*} W_1(\mu_x,\nu)-W_1(\mu_y,\nu)\le W_1(\mu_x,\mu_y). \end{align*} Interchanging $x$ and $y$ gives \begin{align*} W_1(\mu_y,\nu)-W_1(\mu_x,\nu)\le W_1(\mu_y,\mu_x). \end{align*} Since $W_1(\mu_y,\mu_x)=W_1(\mu_x,\mu_y)$, the two inequalities imply \begin{align*} |F(x)-F(y)|\le W_1(\mu_x,\mu_y). \end{align*} Now define the coordinatewise coupling \begin{align*} \pi=\frac{1}{n}\sum_{i=1}^n\delta_{(x_i,y_i)}. \end{align*} Its first marginal is $\mu_x$, and its second marginal is $\mu_y$. Therefore, by the definition of $W_1$ as the infimum of transport cost over couplings, \begin{align*} W_1(\mu_x,\mu_y)\le \int_{E\times E}d(u,v)\,d\pi(u,v). \end{align*} Evaluating the integral against the finite measure $\pi$ gives \begin{align*} \int_{E\times E}d(u,v)\,d\pi(u,v)=\frac{1}{n}\sum_{i=1}^n d(x_i,y_i). \end{align*} By Cauchy-Schwarz applied to $(1,\dots,1)$ and $(d(x_1,y_1),\dots,d(x_n,y_n))$, \begin{align*} \sum_{i=1}^n d(x_i,y_i)\le \left(\sum_{i=1}^n 1^2\right)^{1/2}\left(\sum_{i=1}^n d(x_i,y_i)^2\right)^{1/2}. \end{align*} Since $\left(\sum_{i=1}^n 1^2\right)^{1/2}=\sqrt n$, this becomes \begin{align*} \sum_{i=1}^n d(x_i,y_i)\le \sqrt n\,d_2(x,y). \end{align*} Combining the preceding estimates, \begin{align*} |F(x)-F(y)|\le \frac{1}{\sqrt n}d_2(x,y). \end{align*} Thus $\sqrt n\,F$ is $1$-Lipschitz with respect to $d_2$. If the product law $\nu^{\otimes n}$ satisfies the tensorized $T_1(C)$ hypothesis from *Concentration from Transport-Entropy Inequalities*, then for every $s\ge 0$, \begin{align*} \mathbb P\left(\sqrt n\,F(X)-\mathbb E[\sqrt n\,F(X)]\ge s\right)\le \exp\left(-\frac{s^2}{2C}\right). \end{align*} Linearity of expectation gives \begin{align*} \sqrt n\,F(X)-\mathbb E[\sqrt n\,F(X)]=\sqrt n\left(F(X)-\mathbb E[F(X)]\right). \end{align*} Taking $s=\sqrt n\,t$ therefore gives, for every $t\ge 0$, \begin{align*} \mathbb P\left(F(X)-\mathbb E[F(X)]\ge t\right)\le \exp\left(-\frac{n t^2}{2C}\right). \end{align*} Because $F(X)=W_1(\hat\nu_n,\nu)$, the empirical Wasserstein distance fluctuates around $\mathbb E[W_1(\hat\nu_n,\nu)]$ on the scale $n^{-1/2}$; this concentration estimate controls the fluctuation term, not the size of the mean itself. [/example] This example separates two tasks. Concentration controls fluctuations around the mean, while estimating the mean $\mathbb E[W_1(\hat\nu_n,\nu)]$ is a quantization and empirical-process problem that depends on the geometry of $E$. ## Convex Concentration Versus Arbitrary Lipschitz Concentration Product measures on the discrete cube and on bounded intervals often give stronger estimates for convex observables than for arbitrary Lipschitz observables. The issue is that convexity lets the function interact with product geometry through supporting hyperplanes, while arbitrary Lipschitz functions can encode rough set boundaries. [definition: Convex Lipschitz Observable] Let $K\subset\mathbb R^n$ be convex. A function $F:K\to\mathbb R$ is a convex Lipschitz observable if $F$ is convex and there exists $L\ge 0$ such that \begin{align*} |F(x)-F(y)|\le L|x-y|,\qquad x,y\in K. \end{align*} [/definition] Convex Lipschitz observables include norms, suprema of affine functions, and many risk functionals. These are common in applications where a random vector is built from independent bounded coordinates. The next theorem is needed because Talagrand's convex distance converts this convexity assumption into a Gaussian enlargement estimate for sublevel sets. [quotetheorem:6815] [citeproof:6815] Independence is used through the product-space isoperimetric inequality; strong dependence can concentrate mass near thin diagonal sets where product enlargement estimates fail. A concrete failure is obtained by taking $X=(Y,\dots,Y)$ with $Y$ symmetric on $\{-1,1\}$ and $F(x)=n^{-1/2}\sum_{i=1}^n x_i$. The function $F$ is convex and $1$-Lipschitz, but $F(X)=\sqrt n\,Y$, so a dimension-free upper-tail bound cannot hold without the product structure. Bounded coordinate ranges set the length scale of the convex distance, which is why changing the interval diameter changes the effective Lipschitz constant. If this boundedness is removed, independent exponential random variables and the convex $1$-Lipschitz function $F(x)=x_1$ already give only exponential upper tails, not Gaussian tails. Convexity is the structural assumption that allows sublevel sets to be separated by supporting hyperplanes; without it, even bounded independent coordinates can produce a Lipschitz observable outside the reach of this theorem. On the discrete cube, the parity function $G(x)=\mathbb{1}_{\{x_1+\cdots+x_n\text{ is odd}\}}$ is Lipschitz for the Hamming metric but has alternating sublevel geometry, and the convex-distance supporting-hyperplane argument has no object to act on. The Euclidean Lipschitz hypothesis is also necessary: on $[0,1]^n$, the convex function $S(x)=\sum_{i=1}^n x_i$ has Lipschitz constant $\sqrt n$, and treating it as if $L=1$ would predict order-one fluctuations for a sum whose standard deviation is of order $\sqrt n$. The theorem is asymmetric in spirit because convexity is preserved under suprema and positive combinations, but not under sign change, so upper and lower tails may require separate arguments. [example: Convex Functions of Independent Bounded Variables] Let $X_1,\dots,X_n$ be independent random variables in $[-1,1]$, and define \begin{align*} F(x)=\max_{1\le j\le N}(a_j\cdot x+b_j), \end{align*} where $a_j\in\mathbb R^n$, $b_j\in\mathbb R$, and $\max_{1\le j\le N}|a_j|\le L$. Each map $x\mapsto a_j\cdot x+b_j$ is affine, hence convex. For $\lambda\in[0,1]$ and $x,y\in[-1,1]^n$, the identity \begin{align*} a_j\cdot(\lambda x+(1-\lambda)y)+b_j=\lambda(a_j\cdot x+b_j)+(1-\lambda)(a_j\cdot y+b_j) \end{align*} holds for every $j$. Since $a_j\cdot x+b_j\le F(x)$ and $a_j\cdot y+b_j\le F(y)$, taking the maximum over $j$ gives \begin{align*} F(\lambda x+(1-\lambda)y)\le \lambda F(x)+(1-\lambda)F(y). \end{align*} Thus $F$ is convex. We next compute its Euclidean Lipschitz constant. For each $j$, \begin{align*} a_j\cdot x+b_j=a_j\cdot y+b_j+a_j\cdot(x-y). \end{align*} Because $a_j\cdot y+b_j\le F(y)$, this implies \begin{align*} a_j\cdot x+b_j\le F(y)+a_j\cdot(x-y). \end{align*} Taking the maximum over $j$ yields \begin{align*} F(x)\le F(y)+\max_{1\le j\le N}a_j\cdot(x-y). \end{align*} By Cauchy-Schwarz, \begin{align*} a_j\cdot(x-y)\le |a_j|\,|x-y|\le L|x-y| \end{align*} for every $j$, so \begin{align*} F(x)-F(y)\le L|x-y|. \end{align*} Interchanging $x$ and $y$ gives \begin{align*} F(y)-F(x)\le L|x-y|. \end{align*} Combining the two one-sided bounds, \begin{align*} |F(x)-F(y)|\le L|x-y|. \end{align*} The coordinate intervals $[-1,1]$ have length $2$. Applying *Convex Concentration from Talagrand's Inequality* with diameter factor $D=2$, and absorbing this fixed factor into the universal constant, gives for every median $m_F$ of $F(X)$ and every $r\ge 0$, \begin{align*} \mathbb P(F(X)-m_F\ge r)\le \exp\left(-\frac{c r^2}{L^2}\right). \end{align*} Thus finite suprema of affine statistics concentrate at a scale controlled by the largest Euclidean coefficient norm, covering bounded empirical-process suprema and support functions of random polytopes. [/example] The example shows why convex concentration is useful even when the number $N$ of functions is large. The concentration scale depends on the largest Euclidean coefficient norm, not directly on $N$. [example: Gaussian Supremum of a Bounded Process] Let $(g_t)_{t\in T}$ be a centred Gaussian process indexed by a finite set $T$, and assume \begin{align*} \sup_{t\in T}\operatorname{Var}(g_t)\le \sigma^2. \end{align*} Since $T$ is finite, the covariance matrix $(\mathbb E[g_sg_t])_{s,t\in T}$ is positive semidefinite. Hence there are vectors $(a_t)_{t\in T}$ in some Euclidean space $\mathbb R^n$ such that \begin{align*} \mathbb E[g_sg_t]=a_s\cdot a_t. \end{align*} If $G$ is a standard Gaussian vector in $\mathbb R^n$, then $(a_t\cdot G)_{t\in T}$ is centred and has covariance matrix $(a_s\cdot a_t)_{s,t\in T}$, so it has the same law as $(g_t)_{t\in T}$. Also, \begin{align*} |a_t|^2=a_t\cdot a_t=\mathbb E[g_t^2]=\operatorname{Var}(g_t)\le \sigma^2. \end{align*} Define \begin{align*} \Phi(z)=\sup_{t\in T}a_t\cdot z,\qquad z\in\mathbb R^n. \end{align*} For $\lambda\in[0,1]$ and $x,y\in\mathbb R^n$, the identity \begin{align*} a_t\cdot(\lambda x+(1-\lambda)y)=\lambda a_t\cdot x+(1-\lambda)a_t\cdot y \end{align*} holds for every $t\in T$. Since $a_t\cdot x\le \Phi(x)$ and $a_t\cdot y\le \Phi(y)$, we get \begin{align*} a_t\cdot(\lambda x+(1-\lambda)y)\le \lambda\Phi(x)+(1-\lambda)\Phi(y). \end{align*} Taking the supremum over $t\in T$ gives \begin{align*} \Phi(\lambda x+(1-\lambda)y)\le \lambda\Phi(x)+(1-\lambda)\Phi(y), \end{align*} so $\Phi$ is convex. For $x,y\in\mathbb R^n$ and every $t\in T$, \begin{align*} a_t\cdot x=a_t\cdot y+a_t\cdot(x-y). \end{align*} Since $a_t\cdot y\le \Phi(y)$, this implies \begin{align*} a_t\cdot x\le \Phi(y)+a_t\cdot(x-y). \end{align*} By Cauchy-Schwarz and $|a_t|\le \sigma$, \begin{align*} a_t\cdot(x-y)\le |a_t|\,|x-y|\le \sigma |x-y|. \end{align*} Therefore \begin{align*} a_t\cdot x\le \Phi(y)+\sigma |x-y| \end{align*} for every $t\in T$. Taking the supremum over $t$ gives \begin{align*} \Phi(x)-\Phi(y)\le \sigma |x-y|. \end{align*} Interchanging $x$ and $y$ gives \begin{align*} \Phi(y)-\Phi(x)\le \sigma |x-y|. \end{align*} Combining the two one-sided estimates, \begin{align*} |\Phi(x)-\Phi(y)|\le \sigma |x-y|. \end{align*} Thus $\Phi$ is $\sigma$-Lipschitz. Applying *Gaussian Concentration Inequality* to the $\sigma$-Lipschitz function $\Phi$ gives, for every $r\ge 0$, \begin{align*} \mathbb P\left(\Phi(G)-\mathbb E[\Phi(G)]\ge r\right)\le \exp\left(-\frac{r^2}{2\sigma^2}\right). \end{align*} Since $\Phi(G)=\sup_{t\in T}a_t\cdot G$ has the same distribution as $\sup_{t\in T}g_t$, this becomes \begin{align*} \mathbb P\left(\sup_{t\in T}g_t-\mathbb E\left[\sup_{t\in T}g_t\right]\ge r\right)\le \exp\left(-\frac{r^2}{2\sigma^2}\right). \end{align*} The deviation scale is controlled by the largest variance bound $\sigma^2$; the cardinality and geometry of $T$ enter through the mean of the supremum. [/example] ## Sharp Constants and Dimension-Free Bounds A final synthesis must distinguish the order of a bound from its sharp constant. Entropy, isoperimetry, and transport may all yield Gaussian tails, but they can disagree by factors of $2$, by centering at the mean versus a median, or by the metric in which the Lipschitz constant is measured. [remark: Mean and Median Centering] Isoperimetric arguments often produce concentration around a median because set expansion is naturally phrased in terms of sets of measure at least $1/2$. Entropy and transport arguments more often produce Laplace-transform bounds around the mean. For sub-Gaussian random variables, the two centre choices differ by at most a constant multiple of the sub-Gaussian scale, so either version gives the same qualitative concentration rate. [/remark] The next remark records the [comparison principle](/theorems/4870) used throughout the course. It is not a new method; it is a checklist for deciding whether to invoke the entropy-Herbst results of Chapters 2 through 4, the isoperimetric results of Chapters 5 and 6, or the transport results of Chapters 7 through 9. [remark: Method Selection Principle] Suppose $X=(X_1,\dots,X_n)$ has independent coordinates and $F=F(X)$ is a real-valued observable. If a tensorized logarithmic Sobolev inequality is available and $F$ has a usable gradient or difference bound, the entropy-Herbst route gives mean-centred sub-Gaussian concentration. If the available information is expansion of product sets or convex distance, the isoperimetric route gives set and median concentration, with strong consequences for convex functions. If the measure satisfies a transport-entropy inequality and $F$ is Lipschitz for the relevant metric, the transport route gives concentration through Kantorovich duality. [/remark] This principle is a synthesis of the preceding chapters rather than a new theorem. Logarithmic Sobolev plus tensorization controls the entropy of exponential tilts, so it is strongest when the observable has a gradient or coordinate-difference bound that can be inserted into Herbst's argument. Isoperimetry is the right language when the proof starts from enlargement of sets, but it may give only median-centred statements and may require convexity for sharp product-space conclusions. Transport-entropy inequalities compare Wasserstein distance to relative entropy, so they require Lipschitz control in the chosen metric and do not estimate the location term $\mathbb E[F]$. The sharpest result is usually the one whose hypotheses match the observable most closely. A smooth function under a log-Sobolev measure should be treated through entropy; a convex supremum under a product measure often belongs to Talagrand's inequality; a Wasserstein statistic should first be tested against transport concentration. The same checklist appears outside probability in several guises: logarithmic Sobolev inequalities are functional-analytic estimates for semigroups, isoperimetric inequalities are geometric boundary estimates, and transport inequalities compare variational problems over measures. Concentration is the probabilistic expression of these analytic and geometric controls. [example: Comparing Routes for a Sample Mean] Let $X_1,\dots,X_n$ be independent and suppose each $X_i$ takes values in an interval of length at most $D$. For \begin{align*} F(x)=\frac{1}{n}\sum_{i=1}^n x_i, \end{align*} changing only the $i$th coordinate from $x_i$ to $y_i$ changes $F$ by \begin{align*} \left|\frac{1}{n}\sum_{k=1}^n x_k-\frac{1}{n}\left(\sum_{k\ne i}x_k+y_i\right)\right|=\frac{1}{n}|x_i-y_i|\le \frac{D}{n}. \end{align*} Thus the bounded-differences sensitivity vector has entries $D/n$, and \begin{align*} \sum_{i=1}^n \left(\frac{D}{n}\right)^2=n\frac{D^2}{n^2}=\frac{D^2}{n}. \end{align*} The bounded-differences route therefore gives a Gaussian tail with exponent proportional to \begin{align*} \frac{r^2}{D^2/n}=\frac{n r^2}{D^2}. \end{align*} The Euclidean product Lipschitz calculation gives the same scale. For $x,y\in\mathbb R^n$, \begin{align*} |F(x)-F(y)|=\frac{1}{n}\left|\sum_{i=1}^n(x_i-y_i)\right|. \end{align*} By the triangle inequality, \begin{align*} \frac{1}{n}\left|\sum_{i=1}^n(x_i-y_i)\right|\le \frac{1}{n}\sum_{i=1}^n |x_i-y_i|. \end{align*} By Cauchy-Schwarz, \begin{align*} \sum_{i=1}^n |x_i-y_i|\le \left(\sum_{i=1}^n 1^2\right)^{1/2}\left(\sum_{i=1}^n |x_i-y_i|^2\right)^{1/2}. \end{align*} Since $\left(\sum_{i=1}^n 1^2\right)^{1/2}=\sqrt n$, this gives \begin{align*} |F(x)-F(y)|\le \frac{1}{\sqrt n}|x-y|. \end{align*} Hence $\sqrt n\,F$ is $1$-Lipschitz for the Euclidean product metric. Any tensorized entropy or transport theorem that gives \begin{align*} \mathbb P(G-\mathbb E[G]\ge s)\le \exp\left(-\frac{s^2}{2C}\right) \end{align*} for every $1$-Lipschitz $G$ gives, with $G=\sqrt n\,F$, \begin{align*} G-\mathbb E[G]=\sqrt n\,F-\sqrt n\,\mathbb E[F]=\sqrt n\,(F-\mathbb E[F]). \end{align*} Taking $s=\sqrt n\,r$ therefore gives \begin{align*} \mathbb P(F-\mathbb E[F]\ge r)=\mathbb P(G-\mathbb E[G]\ge \sqrt n\,r). \end{align*} Thus \begin{align*} \mathbb P(F-\mathbb E[F]\ge r)\le \exp\left(-\frac{n r^2}{2C}\right). \end{align*} Finally, $F$ is affine, hence convex, and the same Lipschitz computation gives Euclidean Lipschitz constant $1/\sqrt n$. Convex concentration on coordinate intervals of length at most $D$ therefore gives, around a median $m_F$, \begin{align*} \mathbb P(F-m_F\ge r)\le \exp\left(-\frac{c n r^2}{D^2}\right). \end{align*} Thus bounded differences, entropy tensorization, convex concentration, and transport all produce the same $n r^2$ exponent scale for the sample mean; they differ in constants, centering, and in which sensitivity estimate their hypotheses require. [/example] ## Limitations and Failure Modes The last question is what concentration inequalities do not say. A dimension-free tail bound can still be weak if the Lipschitz constant is large, if the mean is hard to estimate, or if the observable lies outside the class controlled by the theorem. [remark: Lipschitz Constants Carry Dimension] Dimension-free concentration does not mean every high-dimensional observable has fluctuations of order $1$. It means that once the Lipschitz constant is computed in the correct product metric, no further dimension factor appears in the theorem. For sums, averages, maxima, and empirical measures, the main work is often the computation of this Lipschitz constant. [/remark] A second limitation concerns the distinction between fluctuations and location. Concentration around $\mathbb E[F]$ is useful only when $\mathbb E[F]$ is known or can be bounded by another argument. [example: Empirical Wasserstein Location Term] Let $\hat\nu_n=n^{-1}\sum_{i=1}^n\delta_{X_i}$ be the empirical measure on $[0,1]^d$. A concentration estimate of the form \begin{align*} \mathbb P\left(W_1(\hat\nu_n,\nu)-\mathbb E[W_1(\hat\nu_n,\nu)]\ge t\right)\le \exp(-c n t^2) \end{align*} controls only the fluctuation around the mean. If we set $t=a n^{-1/2}$, then \begin{align*} c n t^2=c n\left(a n^{-1/2}\right)^2=c n a^2 n^{-1}=c a^2, \end{align*} so deviations of size $a n^{-1/2}$ have probability at most $e^{-c a^2}$. The location term is separate. In high-dimensional empirical Wasserstein problems one often has, up to endpoint and logarithmic effects, \begin{align*} \mathbb E[W_1(\hat\nu_n,\nu)]\asymp n^{-1/d}. \end{align*} Thus the concentration inequality gives an error decomposition of the form \begin{align*} W_1(\hat\nu_n,\nu) = \mathbb E[W_1(\hat\nu_n,\nu)] + \left(W_1(\hat\nu_n,\nu)-\mathbb E[W_1(\hat\nu_n,\nu)]\right), \end{align*} where the first term has scale $n^{-1/d}$ and the second has scale $n^{-1/2}$. For $d>2$, \begin{align*} \frac{n^{-1/d}}{n^{-1/2}}=n^{1/2-1/d}, \end{align*} and since $1/2-1/d>0$, this ratio grows with $n$. In that regime the approximation scale $n^{-1/d}$ is larger than the fluctuation scale $n^{-1/2}$, so concentration around the mean does not by itself determine the statistical error; the mean estimate is the dominant part. [/example] The empirical Wasserstein example shows that a tail inequality is only one part of a statistical estimate. The next obstruction is structural rather than numerical: some theorems apply only to observables with convex sublevel or superlevel geometry. [remark: Convexity Is a Structural Hypothesis] Convex concentration cannot be applied to an arbitrary Lipschitz function by ignoring the word convex. Nonconvex functions may encode complicated unions of product events, and Talagrand's convex-distance mechanism no longer supplies the same supporting-hyperplane argument. When convexity fails, return to entropy, bounded differences, or transport methods that control all Lipschitz observables. [/remark] The course ends by returning to the map set out in Chapter 0. Entropy controls exponential tilts, isoperimetry controls enlargement of sets, and transport controls movement of measures. Concentration inequalities are most effective when the observable is translated into the language of the method before estimates begin. ## Beyond and Further Connections The three routes developed in these notes are starting points rather than separate endpoints. The entropy route continues into modified logarithmic Sobolev inequalities, concentration for dependent spin systems, and entropy dissipation for Markov semigroups. The isoperimetric route continues into sharp comparison geometry, convex geometry, and product-space concentration for functions whose large values have small certificates. The transport route continues into empirical process theory, Wasserstein statistics, curvature-dimension conditions, and stability of concentration under Lipschitz maps. Several Androma topics connect directly with this page. [Gibbs' Inequality](/theorems/1629) is the basic entropy positivity principle behind the variational formulas, while [Chain Rule for Entropy](/theorems/1635) and [Conditioning Reduces Entropy](/theorems/1652) are finite-model versions of the decomposition ideas used in tensorization. [Chernoff Bound for Sub-Gaussian Random Variables](/theorems/6052), [Sub-Gaussian Sum Bound for Independent Random Variables](/theorems/6056), and [Maximal Inequality for Finitely Many Sub-Gaussian Random Variables](/theorems/6058) give classical concentration outputs to compare with the entropy and transport machinery here. On the geometric side, [Edge Isoperimetric Inequality in the Cube](/theorems/2602) and [Isoperimetric Inequality for Convex Bodies](/theorems/4117) provide related finite and convex isoperimetric models. The staged theorem cards in this page then develop the course-specific forms: Gibbs variational duality, logarithmic Sobolev concentration, Gaussian isoperimetry, Kantorovich duality, Bobkov-Gotze, and Talagrand's $T_2$ inequality. A useful next project is to compare what these methods say about the same observable. For a sample mean, all major routes give the same $n r^2$ scale after the Lipschitz or bounded-difference constant is computed. For a supremum, a convex function, or an empirical measure, the best method may change: entropy may see gradients, isoperimetry may see certificates, and transport may see Lipschitz dependence on the underlying law. This method selection problem is often the real mathematical work before a concentration theorem is applied. ## References - Bakry, D., Gentil, I., and Ledoux, M. *Analysis and Geometry of Markov Diffusion Operators*. Springer, 2014. - Bobkov, S. G. and Gotze, F. "Exponential integrability and transportation cost related to logarithmic Sobolev inequalities." *Journal of Functional Analysis* 163 (1999), 1-28. - Boucheron, S., Lugosi, G., and Massart, P. *Concentration Inequalities: A Nonasymptotic Theory of Independence*. Oxford University Press, 2013. - Ledoux, M. *The Concentration of Measure Phenomenon*. American Mathematical Society, 2001. - Villani, C. *Optimal Transport: Old and New*. Springer, 2009.

Created by admin on 6/12/2026 | Last updated on 6/12/2026

What brings you to Androma?

Start with a route through the knowledge graph.

Concentration Inequalities II: Entropy and Transport

Sign in to Androma

Check your inbox

One last step

Concentration Inequalities II: Entropy and Transport

Prerequisites

Rate this page