This course develops optimal transport as a unifying framework for analysis, geometry, and applied modeling. It begins with the dynamic viewpoint, where transport is described by time-dependent flows and the Benamou-Brenier formula connects geometry with kinetic energy. From there, it builds a calculus on the Wasserstein space $W_2$, including Otto’s formal Riemannian picture, which turns probability measures into an infinite-dimensional geometric object on which gradients, geodesics, and curvature-like effects can be studied.
The middle chapters focus on how this geometry controls functionals and evolution equations. Displacement convexity leads to sharp functional inequalities and structural stability results, while the JKO scheme and minimizing movements provide a variational method for constructing gradient flows. These ideas are then applied to nonlinear diffusion, aggregation dynamics, and Monge-Ampère methods, showing how optimal transport informs both PDE theory and classical PDE techniques. The course then extends to Ricci curvature through transport and to concentration, isoperimetry, and stability, where geometric and probabilistic consequences of transport convexity become central.
The final chapters emphasize modern applications and computation. Computational and statistical optimal transport connects the theory to algorithms, regularization, and data analysis, while the synthesis chapter draws together the core message of the course: transport is not just a way to compare measures, but a modeling language for describing motion, interaction, curvature, and optimization across analysis and applied mathematics.
# Introduction
This course begins after the Monge-Kantorovich existence and duality theory is already available. The central question is how optimal transport becomes an analytic method: not only a way to compare probability measures, but a geometry in which PDE, convexity, curvature, concentration, and computation can be organised. The emphasis is on applications, but the applications are treated through the structures that make them work.
The guiding object is the Wasserstein space of probability measures with finite second moment. In the first course, $W_2$ was introduced through transport plans and optimal couplings. Here the same distance is studied dynamically: a curve of measures is interpreted as mass moving with a velocity field, and the squared distance is recovered as a least kinetic action.
## What Changes After the Foundations?
The foundational theory answers when optimal plans and maps exist and how duality identifies them. For applications, the more useful question is often different: if a probability measure evolves in time, what differential equation does the evolution satisfy, and what functional is it decreasing? This shifts the viewpoint from static optimisation to calculus on spaces of measures.
[definition: Wasserstein Space]
Let $d \coloneqq W_2$ denote the quadratic Wasserstein distance on
\begin{align*}
\mathcal P_2(\mathbb R^n) = \left\{\mu \text{ a Borel probability measure on } \mathbb R^n : \int_{\mathbb R^n} |x|^2\,d\mu(x) < \infty\right\}.
\end{align*}
The metric space $(\mathcal P_2(\mathbb R^n), W_2)$ is called quadratic Wasserstein space.
[/definition]
This definition fixes the ambient space for most of the course. Its importance is that the distance is strong enough to see second moments, but flexible enough to contain singular measures, empirical measures, smooth densities, and solutions of weak PDE.
[example: Transport Between Point Masses]
For $a,b\in\mathbb R^n$, consider $\delta_a,\delta_b\in\mathcal P_2(\mathbb R^n)$. If $\pi$ is a coupling of $\delta_a$ and $\delta_b$, then its first marginal gives $\pi(\{a\}\times\mathbb R^n)=1$, and its second marginal gives $\pi(\mathbb R^n\times\{b\})=1$. Hence the complement of $\{a\}\times\{b\}$ has $\pi$-measure $0$, so $\pi=\delta_{(a,b)}$.
Using the definition of the quadratic Wasserstein distance, the squared cost is therefore
\begin{align*}
W_2^2(\delta_a,\delta_b)=\int_{\mathbb R^n\times\mathbb R^n}|x-y|^2\,d\delta_{(a,b)}(x,y)=|a-b|^2.
\end{align*}
Since $W_2$ is nonnegative, this gives $W_2(\delta_a,\delta_b)=|a-b|$. Thus the map $a\mapsto\delta_a$ preserves distances from Euclidean space into Wasserstein space, while Wasserstein space also contains genuinely distributed measures, not only point masses.
[/example]
The course repeatedly exploits this mixture of finite-dimensional intuition and infinite-dimensional behaviour. Geodesics between point masses look like straight lines, while geodesics between densities encode interpolation of whole distributions.
## Dynamic Transport as a Calculus of Moving Mass
How should a curve $t\mapsto \mu_t$ be differentiated when its values are measures rather than points? The answer used throughout the course is to pair the curve with a velocity field and impose conservation of mass. This gives the continuity equation as the differential language of transport.
[definition: Continuity Equation]
Let $(\mu_t)_{t\in[0,1]}$ be a narrowly continuous curve in $\mathcal P_2(\mathbb R^n)$ and let $(v_t)_{t\in[0,1]}$ be a Borel vector field with $v_t\in L^2(\mu_t;\mathbb R^n)$ for a.e. $t$. The pair $(\mu_t,v_t)$ solves the continuity equation if
\begin{align*}
\partial_t\mu_t + \nabla\cdot(v_t\mu_t)=0
\end{align*}
in the sense of distributions on $(0,1)\times\mathbb R^n$.
[/definition]
The formula says that the only way mass changes in a region is by flux through its boundary. It is the bridge between transport plans and PDE: a coupling describes where mass starts and ends, while the continuity equation describes how it moves in between. The next issue is whether this dynamical description gives the same distance as the static Kantorovich problem, because the rest of the course relies on replacing couplings by curves without changing the metric.
[quotetheorem:9556]
[citeproof:9556]
This theorem is the first main technical hinge of the course. It turns the Wasserstein distance into an action functional and makes tools from weak PDE available in transport geometry. The finite second moment assumption is not decorative: without it the squared displacement and the kinetic action may both be infinite, so the formula would no longer give a finite metric statement. Narrow continuity is the minimal continuity condition compatible with testing measures against bounded continuous functions, while the requirement $v_t\in L^2(\mu_t;\mathbb R^n)$ is exactly what makes the kinetic energy meaningful. The formula should not be read as saying that an optimal velocity field is unique or smooth; minimisers may exist only in weak form, and the continuity equation is usually a distributional equation rather than a classical PDE.
[example: Constant-Speed Motion of a Density]
Let $\mu_0=\rho_0\mathcal L^n$, let $T_\#\mu_0=\mu_1$, and set $T_t(x)=(1-t)x+tT(x)$. By the definition of pushforward, for every bounded test function $\varphi$,
\begin{align*}
\int_{\mathbb R^n}\varphi(y)\,d\mu_t(y)=\int_{\mathbb R^n}\varphi(T_t(x))\,d\mu_0(x).
\end{align*}
Since $\partial_tT_t(x)=T(x)-x$, differentiating the identity for smooth compactly supported $\varphi$ gives
\begin{align*}
\frac{d}{dt}\int_{\mathbb R^n}\varphi(y)\,d\mu_t(y)=\int_{\mathbb R^n}\nabla\varphi(T_t(x))\cdot (T(x)-x)\,d\mu_0(x).
\end{align*}
Define the velocity along transported particles by $v_t(T_t(x))=T(x)-x$. Then the previous identity becomes
\begin{align*}
\frac{d}{dt}\int_{\mathbb R^n}\varphi(y)\,d\mu_t(y)=\int_{\mathbb R^n}\nabla\varphi(y)\cdot v_t(y)\,d\mu_t(y),
\end{align*}
which is exactly the weak form of $\partial_t\mu_t+\nabla\cdot(v_t\mu_t)=0$.
The kinetic action of this curve is computed by changing variables through the pushforward $\mu_t=(T_t)_\#\mu_0$:
\begin{align*}
\int_0^1\int_{\mathbb R^n}|v_t(y)|^2\,d\mu_t(y)\,dt=\int_0^1\int_{\mathbb R^n}|v_t(T_t(x))|^2\,d\mu_0(x)\,dt.
\end{align*}
Using $v_t(T_t(x))=T(x)-x$ gives
\begin{align*}
\int_0^1\int_{\mathbb R^n}|v_t(T_t(x))|^2\,d\mu_0(x)\,dt=\int_0^1\int_{\mathbb R^n}|T(x)-x|^2\,d\mu_0(x)\,dt.
\end{align*}
The integrand is independent of $t$, so
\begin{align*}
\int_0^1\int_{\mathbb R^n}|T(x)-x|^2\,d\mu_0(x)\,dt=\int_{\mathbb R^n}|T(x)-x|^2\,d\mu_0(x).
\end{align*}
Thus the displacement interpolation moves each particle with constant velocity from $x$ to $T(x)$, and its kinetic action equals the quadratic transport cost induced by the optimal map $T$.
[/example]
## Displacement Convexity and Gradient Flows
Once curves and velocities are available, the next problem is to decide which functionals are convex along transport geodesics. This is not ordinary convexity on a vector space: the midpoint between two measures is formed by optimal displacement, not by linear averaging. The resulting notion is displacement convexity.
[definition: Displacement Convexity]
A functional $\mathcal F:\mathcal P_2(\mathbb R^n)\to(-\infty,\infty]$ is displacement convex if for every $\mu_0,\mu_1\in\mathcal P_2(\mathbb R^n)$ there exists a constant-speed $W_2$-geodesic $(\mu_t)_{t\in[0,1]}$ joining them such that
\begin{align*}
\mathcal F(\mu_t) \le (1-t)\mathcal F(\mu_0)+t\mathcal F(\mu_1)
\end{align*}
for every $t\in[0,1]$.
[/definition]
Displacement convexity is the condition that lets variational arguments survive in Wasserstein space. It underlies uniqueness of minimisers, contraction estimates, entropy methods, and the interpretation of dissipative PDE as steepest descent.
[example: Potential Energy Along Transport Geodesics]
Let $V:\mathbb R^n\to\mathbb R$ be convex, and define
\begin{align*}
\mathcal V[\mu]=\int_{\mathbb R^n}V(y)\,d\mu(y)
\end{align*}
whenever this integral is finite. Suppose $\mu_1=T_\#\mu_0$, and let the displacement interpolation be
\begin{align*}
\mu_t=(T_t)_\#\mu_0,\qquad T_t(x)=(1-t)x+tT(x).
\end{align*}
By the definition of pushforward applied to the test function $V$,
\begin{align*}
\mathcal V[\mu_t]=\int_{\mathbb R^n}V(T_t(x))\,d\mu_0(x).
\end{align*}
Substituting $T_t(x)=(1-t)x+tT(x)$ gives
\begin{align*}
\mathcal V[\mu_t]=\int_{\mathbb R^n}V((1-t)x+tT(x))\,d\mu_0(x).
\end{align*}
Since $V$ is convex,
\begin{align*}
V((1-t)x+tT(x))\le (1-t)V(x)+tV(T(x))
\end{align*}
for every $x$ and every $t\in[0,1]$. Integrating this pointwise inequality with respect to $\mu_0$ gives
\begin{align*}
\mathcal V[\mu_t]\le \int_{\mathbb R^n}\bigl((1-t)V(x)+tV(T(x))\bigr)\,d\mu_0(x).
\end{align*}
By linearity of the integral,
\begin{align*}
\int_{\mathbb R^n}\bigl((1-t)V(x)+tV(T(x))\bigr)\,d\mu_0(x)=(1-t)\int_{\mathbb R^n}V(x)\,d\mu_0(x)+t\int_{\mathbb R^n}V(T(x))\,d\mu_0(x).
\end{align*}
The first integral is $\mathcal V[\mu_0]$, and the second is $\mathcal V[\mu_1]$ because $\mu_1=T_\#\mu_0$. Therefore
\begin{align*}
\mathcal V[\mu_t]\le (1-t)\mathcal V[\mu_0]+t\mathcal V[\mu_1].
\end{align*}
Thus convexity of $V$ in Euclidean space becomes displacement convexity of the potential energy along the transport geodesic generated by $T$.
[/example]
The example explains why convex potentials behave well under displacement, but the main analytic payoff is stronger. The course next needs a theorem identifying a familiar PDE as a Wasserstein steepest descent equation. The entropy functional is the basic test case because its descent equation is the heat equation, which is already familiar from PDE.
[quotetheorem:9557]
[citeproof:9557]
This calculation previews the style of the course: formal geometric identities guide the analysis, and later sections explain how to justify them under weak hypotheses. The identity $\rho\nabla\log\rho=\nabla\rho$ uses positivity and enough regularity to differentiate the density, so it is initially a formal computation for smooth positive solutions. For general absolutely continuous measures the entropy may be infinite, the density may vanish, and the heat flow has to be constructed by a weak or variational argument rather than by pointwise calculus. The rigorous statement therefore concerns curves in $\mathcal P_2(\mathbb R^n)$ produced by the minimizing movement scheme, with entropy dissipation replacing the classical chain rule.
## Curvature, Concentration, and Functional Inequalities
The next question is why transport inequalities know about curvature and probability tails. In Riemannian geometry, lower Ricci curvature bounds control volume growth, heat flow, and convexity of entropy. Optimal transport recasts these facts as convexity properties along Wasserstein geodesics.
[definition: Transport-Entropy Inequality]
Let $\nu\in\mathcal P_2(\mathbb R^n)$. The measure $\nu$ satisfies a quadratic transport-entropy inequality with constant $C>0$ if every probability measure $\mu\ll\nu$ satisfies
\begin{align*}
W_2^2(\mu,\nu) \le 2C\operatorname{Ent}_\nu(\mu),
\end{align*}
where $\operatorname{Ent}_\nu(\mu)=\int \log\left(\frac{d\mu}{d\nu}\right)\,d\mu$.
[/definition]
This inequality converts information-theoretic distance into geometric distance. It becomes useful because $W_2$ controls deviations of Lipschitz observables by Kantorovich duality and approximation. The course therefore needs a first sharp case where the inequality can be proved and then reused as a model for curvature-driven concentration; the standard Gaussian is that case.
[quotetheorem:6792]
[citeproof:6792]
The Gaussian case is the model for a larger principle: convexity of the reference potential produces transport inequalities, and transport inequalities imply concentration of measure. The assumption $\mu\ll\gamma$ is needed because the relative entropy is defined through the density $d\mu/d\gamma$; if this absolute continuity fails, the entropy is interpreted as $+\infty$ and the inequality becomes vacuous. The constant $2$ is sharp for the standard Gaussian with this normalisation, as can be detected by translating the Gaussian measure. The theorem is not a statement about arbitrary log-concave reference measures: extending it requires quantitative convexity of the potential, curvature lower bounds, or additional functional inequalities.
## Computational and Statistical Directions
The final part of the course asks how much of this analytic structure remains useful when measures are empirical, high-dimensional, or represented by parameterised maps. This leads to entropic regularisation, stability of empirical transport, and links with generative modelling.
[definition: Entropic Regularisation]
Let $\mu,\nu\in\mathcal P_2(\mathbb R^n)$ and let $\varepsilon>0$. The entropically regularised transport cost is
\begin{align*}
\operatorname{OT}_\varepsilon(\mu,\nu)
=\inf_{\pi\in\Pi(\mu,\nu)}\left\{\int_{\mathbb R^n\times\mathbb R^n}|x-y|^2\,d\pi(x,y)+\varepsilon\operatorname{Ent}_{\mu\otimes\nu}(\pi)\right\}.
\end{align*}
[/definition]
The entropy term makes the optimisation problem more stable and computationally tractable, while changing the geometry in a controlled way as $\varepsilon\downarrow0$.
[example: Sinkhorn Scaling in the Discrete Case]
Let $c_{ij}=|x_i-y_j|^2$. A coupling of $\mu=\sum_{i=1}^m a_i\delta_{x_i}$ and $\nu=\sum_{j=1}^k b_j\delta_{y_j}$ is the same as a matrix $P=(P_{ij})$ with $P_{ij}\ge0$,
\begin{align*}
\sum_{j=1}^k P_{ij}=a_i
\end{align*}
for each $i$, and
\begin{align*}
\sum_{i=1}^m P_{ij}=b_j
\end{align*}
for each $j$. Since all $a_i$ and $b_j$ are positive, the entropy relative to $\mu\otimes\nu$ is
\begin{align*}
\operatorname{Ent}_{\mu\otimes\nu}(P)=\sum_{i=1}^m\sum_{j=1}^k P_{ij}\log\left(\frac{P_{ij}}{a_i b_j}\right).
\end{align*}
Thus the discrete entropic problem is to minimise
\begin{align*}
\sum_{i=1}^m\sum_{j=1}^k c_{ij}P_{ij}+\varepsilon\sum_{i=1}^m\sum_{j=1}^k P_{ij}\log\left(\frac{P_{ij}}{a_i b_j}\right)
\end{align*}
under the row and column constraints. The first term is linear in $P$, and each function $p\mapsto p\log(p/(a_i b_j))$ is strictly convex on $(0,\infty)$, so the minimiser is unique.
At an interior minimiser, introduce Lagrange multipliers $\alpha_i$ and $\beta_j$ for the row and column constraints. Differentiating the Lagrangian with respect to $P_{ij}$ gives
\begin{align*}
c_{ij}+\varepsilon\left(\log\left(\frac{P_{ij}}{a_i b_j}\right)+1\right)+\alpha_i+\beta_j=0.
\end{align*}
Rearranging,
\begin{align*}
\log\left(\frac{P_{ij}}{a_i b_j}\right)=-\frac{c_{ij}}{\varepsilon}-1-\frac{\alpha_i}{\varepsilon}-\frac{\beta_j}{\varepsilon}.
\end{align*}
Exponentiating both sides gives
\begin{align*}
P_{ij}=a_i b_j e^{-1-\alpha_i/\varepsilon}e^{-c_{ij}/\varepsilon}e^{-\beta_j/\varepsilon}.
\end{align*}
If
\begin{align*}
K_{ij}=e^{-c_{ij}/\varepsilon}=e^{-|x_i-y_j|^2/\varepsilon},
\end{align*}
and the positive factors depending only on $i$ and only on $j$ are absorbed into vectors $u$ and $v$, then the minimiser has the scaling form
\begin{align*}
P_{ij}=u_iK_{ij}v_j.
\end{align*}
The row constraints require
\begin{align*}
a_i=\sum_{j=1}^k P_{ij}=\sum_{j=1}^k u_iK_{ij}v_j=u_i\sum_{j=1}^k K_{ij}v_j,
\end{align*}
so, once $v$ is fixed,
\begin{align*}
u_i=\frac{a_i}{\sum_{j=1}^k K_{ij}v_j}.
\end{align*}
Similarly, the column constraints require
\begin{align*}
b_j=\sum_{i=1}^m P_{ij}=\sum_{i=1}^m u_iK_{ij}v_j=v_j\sum_{i=1}^m u_iK_{ij},
\end{align*}
so, once $u$ is fixed,
\begin{align*}
v_j=\frac{b_j}{\sum_{i=1}^m u_iK_{ij}}.
\end{align*}
Sinkhorn scaling is precisely the iteration that alternates these two formulas: rescale the rows to match $a$, then rescale the columns to match $b$. The kernel $K$ contains the transport cost, while the vectors $u$ and $v$ enforce the marginal constraints.
[/example]
These computational themes do not replace the analytic theory; they depend on it. Stability, convergence, bias, and sample complexity all require understanding how transport distances and regularised variants behave under weak convergence, moment bounds, and functional inequalities.
## Course Map
The first chapter develops dynamic transport and the Benamou-Brenier formula. The second chapter introduces Otto's formal Riemannian calculus on $\mathcal P_2$ and uses it to compute gradients of internal, potential, and interaction energies. The next chapters study displacement convexity, gradient flows, and curvature-dimension ideas.
The middle part turns these tools into inequalities: logarithmic Sobolev inequalities, Talagrand inequalities, HWI estimates, and concentration bounds. The final part treats computation and statistics, including entropic regularisation, empirical measures, stability estimates, and transport-based generative models.
Throughout the course, the same pattern repeats. Start with a variational or geometric statement in Wasserstein space, translate it into a PDE or inequality, and then prove enough compactness, convexity, or stability to make the translation reliable.
# 1. Dynamic Transport and the Benamou-Brenier Formula
This chapter replaces the static problem of matching $\mu_0$ to $\mu_1$ by a time-dependent problem: move mass through a continuum of intermediate probability measures while paying kinetic energy. It assumes the static quadratic transport problem, optimal couplings, pushforwards, weak convergence of measures, basic distributional PDE notation, and the space $\mathcal P_2(\mathbb R^n)$ with the Wasserstein distance $W_2$. The main question is how a curve $t \mapsto \mu_t$ in Wasserstein space should be differentiated, and how that derivative is represented by an Eulerian velocity field. The reward is the Benamou--Brenier formula, which identifies $W_2^2$ with the least kinetic action among all solutions of a continuity equation joining the two endpoints.
## Curves in Wasserstein Space and Metric Derivatives
How should we measure the speed of a curve whose points are probability measures rather than points in Euclidean space? The right notion is intrinsic to the metric space $(\mathcal P_2(\mathbb R^n), W_2)$: compare nearby times by their Wasserstein distance and ask whether this comparison is controlled by an integrable scalar speed.
[definition: Absolutely Continuous Curve in a Metric Space]
Let $(X,d)$ be a metric space and let $I\subset \mathbb R$ be an interval. A curve $\gamma:I\to X$ is absolutely continuous if there exists $g\in L^1(I)$ such that for all $s,t\in I$ with $s\le t$,
\begin{align*}
d(\gamma(s),\gamma(t))\le \int_s^t g(r)\,dr.
\end{align*}
[/definition]
This definition only uses the distance, so it applies without coordinates or linear structure. To compare different admissible controls $g$, we need an intrinsic pointwise speed attached to the curve itself.
[definition: Metric Derivative]
Let $(X,d)$ be a metric space and let $\gamma:I\to X$ be an absolutely continuous curve. The metric derivative is the partially defined function $|\gamma'|:I\to[0,\infty]$ given at points $t\in I$ where the limit exists by
\begin{align*}
|\gamma'|(t):=\lim_{h\to 0}\frac{d(\gamma(t+h),\gamma(t))}{|h|}.
\end{align*}
[/definition]
The control function $g$ in the definition of absolute continuity is not unique: if one admissible $g$ works, any larger integrable function works as well. To turn metric absolute continuity into a usable calculus, one must identify the intrinsic smallest speed determined only by the curve and the distance. The obstruction is that the pointwise limit defining $|\gamma'|$ need not exist for an arbitrary continuous curve, so the length density has to be recovered from absolute continuity itself rather than assumed in advance.
[quotetheorem:9558]
[citeproof:9558]
The theorem says that metric speed is not an extra structure: it is forced by the distance. It does not say that every continuous curve has a speed, nor that finite endpoint distance controls the length travelled in between. For instance, a continuous curve in a compact metric space can have infinite variation by repeatedly traversing shorter and shorter loops with divergent total length, so no $L^1$ control function can bound all increments. The completeness assumption is harmless here and becomes more important in later compactness arguments; the decisive hypothesis for this theorem is metric absolute continuity, not mere continuity or pointwise differentiability.
The minimality clause is essential because many larger functions $g$ may control the same curve, while only $|\gamma'|$ gives the intrinsic length density. In the constant curve case, every nonnegative $g\in L^1(0,1)$ is an admissible control, but the metric derivative is $0$ a.e.; this separates "a speed bound" from "the speed." For transport, the next task is to express this intrinsic speed through a PDE describing conservation of mass rather than through pairwise comparisons of measures.
[example: Translation Curve]
Let $T_t(x)=x+ta$, so $\mu_t=(T_t)_\#\mu_0$. For $0\le s,t\le1$, the map $x\mapsto (T_s(x),T_t(x))$ defines a coupling of $\mu_s$ and $\mu_t$, hence
\begin{align*}
W_2^2(\mu_s,\mu_t)\le \int_{\mathbb R^n}|T_t(x)-T_s(x)|^2\,d\mu_0(x)=\int_{\mathbb R^n}|(t-s)a|^2\,d\mu_0(x)=|t-s|^2|a|^2.
\end{align*}
For the reverse inequality, let $b_t=\int_{\mathbb R^n}z\,d\mu_t(z)$ be the barycentre. Since $\mu_0\in\mathcal P_2(\mathbb R^n)$, this integral is finite, and the pushforward formula gives
\begin{align*}
b_t=\int_{\mathbb R^n}(x+ta)\,d\mu_0(x)=\int_{\mathbb R^n}x\,d\mu_0(x)+ta.
\end{align*}
Thus $b_t-b_s=(t-s)a$. If $\pi$ is any coupling of $\mu_s$ and $\mu_t$, then
\begin{align*}
(t-s)a=b_t-b_s=\int_{\mathbb R^n\times\mathbb R^n}(y-x)\,d\pi(x,y).
\end{align*}
By the triangle inequality and Cauchy--Schwarz,
\begin{align*}
|t-s||a|\le \int_{\mathbb R^n\times\mathbb R^n}|y-x|\,d\pi(x,y)\le \left(\int_{\mathbb R^n\times\mathbb R^n}|y-x|^2\,d\pi(x,y)\right)^{1/2}.
\end{align*}
Taking the infimum over all couplings $\pi$ gives $|t-s|^2|a|^2\le W_2^2(\mu_s,\mu_t)$. Combining the two inequalities,
\begin{align*}
W_2(\mu_s,\mu_t)=|t-s||a|.
\end{align*}
Therefore
\begin{align*}
|\mu'|(t)=\lim_{h\to0}\frac{W_2(\mu_{t+h},\mu_t)}{|h|}=\lim_{h\to0}\frac{|h||a|}{|h|}=|a|.
\end{align*}
This translation curve is the model case where every particle moves with the same constant velocity $a$, and the Wasserstein speed records exactly that particle speed.
[/example]
## Continuity Equations and Kinetic Action
If the curve $\mu_t$ is a moving density, what is its derivative? The answer cannot be only a pointwise time derivative of a density: a Dirac mass moving along a curve has no density with respect to $\mathcal L^n$, and even smooth densities can change because mass crosses the boundary of a test region. A formula such as $\partial_t\rho_t=0$ would miss transport entirely, while tracking particle labels is unavailable for general measures. Conservation of mass is therefore encoded weakly by a continuity equation with a velocity field $v_t$.
[definition: Distributional Continuity Equation]
Let $(\mu_t)_{t\in[0,1]}$ be a narrowly continuous curve in $\mathcal P(\mathbb R^n)$ and let $v_t:\mathbb R^n\to\mathbb R^n$ be a Borel vector field defined for a.e. $t$. The pair $(\mu_t,v_t)$ solves the continuity equation
\begin{align*}
\partial_t\mu_t+\nabla\cdot(v_t\mu_t)=0
\end{align*}
in the distributional sense if
\begin{align*}
\int_0^1\int_{\mathbb R^n}\left(\partial_t\phi(t,x)+v_t(x)\cdot \nabla\phi(t,x)\right)\,d\mu_t(x)\,dt=0
\end{align*}
for every $\phi\in C_c^\infty((0,1)\times\mathbb R^n)$.
[/definition]
The weak form records that the total mass of every moving test profile changes only by flux through its spatial gradient. To turn this conservation law into a variational transport problem, we need a numerical cost assigned to the velocity field.
[definition: Kinetic Action]
Let $\mathsf{CE}_{0,1}$ denote the class of pairs $(\mu,v)$ solving the continuity equation on $[0,1]$, where $\mu=(\mu_t)_{t\in[0,1]}$ and $v=(v_t)_{t\in[0,1]}$. The kinetic action is the functional
\begin{align*}
\mathcal A:\mathsf{CE}_{0,1}\to[0,\infty]
\end{align*}
defined by
\begin{align*}
\mathcal A(\mu,v):=\int_0^1\int_{\mathbb R^n}|v_t(x)|^2\,d\mu_t(x)\,dt,
\end{align*}
with value $+\infty$ if the integral is not finite.
[/definition]
The action is the Eulerian analogue of average squared particle speed. A velocity field with large cancellations or unnecessary rotational motion can solve the same continuity equation while spending extra action, so the variational problem must identify the least $L^2(\mu_t)$ representative. Conversely, a curve in $W_2$ has a metric speed before any velocity field has been chosen, and it is not automatic that this abstract derivative comes from a PDE. The next theorem resolves this obstruction by matching metric speed with the minimal Eulerian kinetic density.
[quotetheorem:9559]
[citeproof:9559]
This theorem is the bridge between metric geometry and PDE, but each hypothesis is doing work. The finite second moment assumption is needed because the curve $\mu_t=\delta_{1/(1-t)}$ for $t<1$ cannot be treated as a $W_2$ curve up to $t=1$ even though distributional transport on subintervals makes sense. Narrow continuity is also not automatic from the continuity equation if time slices are chosen inconsistently: changing $\mu_t$ at a single time leaves the integral identity unchanged, but destroys continuity of $t\mapsto\mu_t$ and therefore destroys the metric-curve interpretation. The precise assumption $\int_0^1\|v_t\|_{L^2(\mu_t)}\,dt<\infty$ rules out a velocity field with infinite time-integrated quadratic speed; an $L^1(\mu_t)$ bound alone can move a small amount of mass over very long distances and is suited to $W_1$, not to the quadratic distance.
The theorem also does not give a unique velocity field. If $\mu_t$ is Lebesgue measure restricted to a smooth region and a divergence-free vector field is tangent to the boundary, adding that field can preserve the continuity equation while changing the action. The distinguished representative is obtained only after projecting onto the closure of gradients in $L^2(\mu_t)$, which is why the next discussion keeps track not just of existence of velocities, but of action-minimising velocities.
At the same time, the Eulerian equation should still be read as a weak particle-transport law. Under suitable measurability and integrability assumptions on the velocity field, one can often represent a narrowly continuous solution of the continuity equation by a probability measure on path space whose time marginals are the measures $\mu_t$. This superposition viewpoint explains why continuity equations are the right weak language: even when no deterministic flow map exists, the solution may decompose into random particle paths.
The path-space interpretation is weaker than uniqueness of characteristics; different path measures may represent the same Eulerian solution when trajectories merge or when the velocity field has nonunique ODE solutions. Its forward role is specific: it converts admissible Eulerian competitors into endpoint couplings carried by paths, which is the lower-bound mechanism behind the Benamou--Brenier formula.
[example: Compressible and Incompressible-Looking Paths]
Let $\rho_0$ be an absolutely continuous probability density on a disc in $\mathbb R^2$, and let the moving particles be described by an affine flow
\begin{align*}
X_t(x)=c(t)+B_t(x-c(0)),\qquad B_t=R_{\omega t}\operatorname{diag}(e^{\alpha t},e^{-\beta t}),
\end{align*}
where $R_{\omega t}$ is rotation by angle $\omega t$. The pushforward density $\rho_t\mathcal L^2=(X_t)_\#(\rho_0\mathcal L^2)$ satisfies the change-of-variables identity
\begin{align*}
\rho_t(X_t(x))\det B_t=\rho_0(x).
\end{align*}
Since $\det R_{\omega t}=1$ and $\det\operatorname{diag}(e^{\alpha t},e^{-\beta t})=e^{\alpha t}e^{-\beta t}=e^{(\alpha-\beta)t}$, this becomes
\begin{align*}
\rho_t(X_t(x))=e^{-(\alpha-\beta)t}\rho_0(x).
\end{align*}
The Eulerian velocity is obtained by differentiating the particle path and then writing the result at the spatial point $z=X_t(x)$:
\begin{align*}
v_t(z)=\dot c(t)+\dot B_tB_t^{-1}(z-c(t)).
\end{align*}
Its divergence is the trace of the linear part,
\begin{align*}
\nabla\cdot v_t=\operatorname{tr}(\dot B_tB_t^{-1})=\frac{d}{dt}\log\det B_t=\frac{d}{dt}\bigl((\alpha-\beta)t\bigr)=\alpha-\beta.
\end{align*}
Thus the rotational factor contributes no divergence, because rotations preserve area, while the unequal squeeze factors contribute $\alpha-\beta$. Along each particle trajectory,
\begin{align*}
\frac{d}{dt}\rho_t(X_t(x))=-(\alpha-\beta)e^{-(\alpha-\beta)t}\rho_0(x)=-(\nabla\cdot v_t)\rho_t(X_t(x)).
\end{align*}
By the chain rule, the left side is $\partial_t\rho_t(X_t(x))+v_t(X_t(x))\cdot\nabla\rho_t(X_t(x))$, so the continuity equation reads
\begin{align*}
\partial_t\rho_t+\nabla\cdot(\rho_t v_t)=\partial_t\rho_t+v_t\cdot\nabla\rho_t+\rho_t\nabla\cdot v_t=0.
\end{align*}
A rotating-looking motion is therefore incompressible only when the area factor is constant, here when $\alpha=\beta$; otherwise the density changes by the Jacobian factor even if the visible motion resembles a rigid rotation.
[/example]
## The Benamou-Brenier Variational Formula
The static definition of $W_2$ minimises squared distance over couplings. The dynamic question asks for the least kinetic action among all conservative evolutions from $\mu_0$ to $\mu_1$, and the central result says that these two minimisation problems have the same value.
[definition: Admissible Dynamic Plan]
Let $\mu_0,\mu_1\in\mathcal P_2(\mathbb R^n)$. An admissible dynamic plan from $\mu_0$ to $\mu_1$ is a pair $(\mu,v)$, where $\mu:[0,1]\to\mathcal P_2(\mathbb R^n)$ is a narrowly continuous curve and $v=(v_t)_{t\in[0,1]}$ is a time-indexed Borel vector field $v_t:\mathbb R^n\to\mathbb R^n$ defined for a.e. $t$, such that $(\mu_t,v_t)$ solves the continuity equation, $\mu_0$ and $\mu_1$ are the endpoint values of the curve, and $\mathcal A(\mu,v)<\infty$.
[/definition]
The endpoints turn the continuity equation into a transport problem. Among all admissible ways of moving the mass, the quadratic action selects constant-speed geodesics.
[quotetheorem:9556]
[citeproof:9556]
This formula is the entry point to the differential calculus of $W_2$, and its hypotheses delimit the quadratic theory. The finite second moment hypothesis is exactly what makes both sides finite-valued: a measure on $\mathbb R$ with tail density proportional to $(1+|x|)^{-3}$ has total mass finite but infinite second moment, so its quadratic transport cost to $\delta_0$ is infinite. The continuity equation constraint is also essential: without it, choosing $v_t=0$ for arbitrary unrelated endpoints would give zero action. The endpoint conditions matter as well; dropping one endpoint would allow the constant curve at the other endpoint to have zero action.
The attainment statement is part of the theorem, not a formal consequence of writing an infimum. It uses compactness of probability measures with controlled second moments and lower semicontinuity of the quadratic action. Minimizers need not be unique: if the optimal static coupling is not unique, or if the velocity is changed on a $\mu_t\,dt$-null set, the dynamic minimizer is not unique as an Eulerian representative. The formula turns an optimal transport distance into an action minimisation problem, so geodesics can be studied through Euler-Lagrange ideas.
[example: One-Dimensional Monotone Interpolation]
Let $S_t=(1-t)\operatorname{id}+tT$, so $\mu_t=(S_t)_\#\mu_0$. By the one-dimensional monotone rearrangement theorem, the monotone map $T=F_{\mu_1}^{-1}\circ F_{\mu_0}$ is optimal and
\begin{align*}
W_2^2(\mu_0,\mu_1)=\int_{\mathbb R}|T(x)-x|^2\,d\mu_0(x).
\end{align*}
For the particle starting at $x$, the path is $\gamma_x(t)=S_t(x)=(1-t)x+tT(x)$. Differentiating this scalar affine function of $t$ gives
\begin{align*}
\dot\gamma_x(t)=-x+T(x)=T(x)-x.
\end{align*}
The associated path measure is $\eta=(x\mapsto \gamma_x)_\#\mu_0$, and its time-$t$ marginal is $\mu_t$ because $e_t(\gamma_x)=S_t(x)$.
For $0\le t<1$, the map $S_t$ is strictly increasing on the support order: if $x_1<x_2$, then monotonicity of $T$ gives
\begin{align*}
S_t(x_2)-S_t(x_1)=(1-t)(x_2-x_1)+t(T(x_2)-T(x_1))\ge (1-t)(x_2-x_1)>0.
\end{align*}
Thus the Eulerian velocity at $z=S_t(x)$ is well defined by
\begin{align*}
v_t(z)=T(x)-x.
\end{align*}
Using the pushforward identity for $\mu_t=(S_t)_\#\mu_0$, the kinetic action is
\begin{align*}
\int_0^1\int_{\mathbb R}|v_t(z)|^2\,d\mu_t(z)\,dt=\int_0^1\int_{\mathbb R}|T(x)-x|^2\,d\mu_0(x)\,dt.
\end{align*}
Since the inner integral is independent of $t$,
\begin{align*}
\int_0^1\int_{\mathbb R}|T(x)-x|^2\,d\mu_0(x)\,dt=\left(\int_0^1 1\,dt\right)\int_{\mathbb R}|T(x)-x|^2\,d\mu_0(x)=W_2^2(\mu_0,\mu_1).
\end{align*}
So in one dimension the Benamou--Brenier action is realised by moving each quantile at constant speed from $x$ to its monotone rearranged position $T(x)$.
[/example]
## Geodesics and Velocity Potentials
The Benamou-Brenier minimiser is not an arbitrary velocity field. Since the cost is quadratic and the constraint is a continuity equation, the optimal velocity is a gradient field, and along smooth portions of a geodesic it is governed by a Hamilton-Jacobi equation.
[definition: Wasserstein Geodesic]
Let $\mu_0,\mu_1\in\mathcal P_2(\mathbb R^n)$. A curve $\mu:[0,1]\to\mathcal P_2(\mathbb R^n)$, written $\mu(t)=\mu_t$, is a constant-speed Wasserstein geodesic from $\mu_0$ to $\mu_1$ if $\mu_{t=0}=\mu_0$, $\mu_{t=1}=\mu_1$, and
\begin{align*}
W_2(\mu_s,\mu_t)=|t-s|W_2(\mu_0,\mu_1)
\end{align*}
for all $s,t\in[0,1]$.
[/definition]
Geodesics are exactly the curves that spend the Benamou-Brenier action evenly in time. A velocity field can contain a divergence-free component that circulates mass without changing the density, and such a component increases the quadratic action without helping the endpoints. This is the obstruction behind the gradient condition: optimal transport should keep only the part of the velocity that changes the measure in the required direction. To recognise minimising velocity fields in computations, we need a differential condition that rules out rotational components.
[quotetheorem:9560]
[citeproof:9560]
The theorem is a smooth sufficient-and-necessary calculation with a global dual condition, not a general regularity theorem for all geodesics. Smoothness is used to justify integration by parts and pointwise Euler-Lagrange equations; the geodesic between two separated Dirac masses is a moving Dirac mass, so no smooth density formulation can apply. Positivity of $\rho_t$ is also needed: if a smooth density vanishes on an open vacuum region, variations of $\rho$ there cannot determine the Hamilton-Jacobi equation from the action. Constant kinetic energy rules out reparametrising the same spatial path with a non-uniform clock, for instance replacing a geodesic $\bar\mu_t$ by $\bar\mu_{\alpha(t)}$ with nonlinear increasing $\alpha$ changes the speed profile even when the image of the curve is the same.
The converse would be false without the dual admissibility hypothesis. A smooth gradient field may solve the continuity equation and the local Hamilton-Jacobi equation for a short time while generating crossing characteristics or a non-optimal endpoint map; local potential flow alone does not certify global optimal transport. Chapter 2 makes this tangent-space picture the basis of Otto calculus and gradient flows, where the gradient condition is kept but the global optimality statement is used with care.
[example: Gaussian-to-Gaussian Geodesics]
Let $\mu_0=\mathcal N(m_0,\Sigma_0)$ and $\mu_1=\mathcal N(m_1,\Sigma_1)$, where $\Sigma_0$ and $\Sigma_1$ are positive definite. The Gaussian optimal transport map for the quadratic cost is
\begin{align*}
T(x)=m_1+A(x-m_0),\qquad A=\Sigma_0^{-1/2}(\Sigma_0^{1/2}\Sigma_1\Sigma_0^{1/2})^{1/2}\Sigma_0^{-1/2}.
\end{align*}
This matrix $A$ is symmetric positive definite. It also sends the covariance $\Sigma_0$ to $\Sigma_1$, because
\begin{align*}
A\Sigma_0A=\Sigma_0^{-1/2}(\Sigma_0^{1/2}\Sigma_1\Sigma_0^{1/2})^{1/2}(\Sigma_0^{1/2}\Sigma_1\Sigma_0^{1/2})^{1/2}\Sigma_0^{-1/2}.
\end{align*}
Multiplying the two square-root factors gives
\begin{align*}
A\Sigma_0A=\Sigma_0^{-1/2}(\Sigma_0^{1/2}\Sigma_1\Sigma_0^{1/2})\Sigma_0^{-1/2}=\Sigma_1.
\end{align*}
Define the interpolating maps
\begin{align*}
S_t(x)=(1-t)x+tT(x).
\end{align*}
Substituting the formula for $T$ gives
\begin{align*}
S_t(x)=(1-t)x+t m_1+tA(x-m_0).
\end{align*}
Writing $m_t=(1-t)m_0+tm_1$ and $B_t=(1-t)I+tA$, this becomes
\begin{align*}
S_t(x)=m_t+B_t(x-m_0).
\end{align*}
Since every eigenvalue of $A$ is positive, every eigenvalue of $B_t$ is $(1-t)+t\lambda>0$, so $B_t$ is invertible for $0\le t\le1$.
If $X\sim\mathcal N(m_0,\Sigma_0)$ and $Z_t=S_t(X)$, then
\begin{align*}
\mathbb E[Z_t]=m_t+B_t(\mathbb E[X]-m_0)=m_t.
\end{align*}
The covariance is
\begin{align*}
\operatorname{Cov}(Z_t)=\mathbb E[(Z_t-m_t)(Z_t-m_t)^T]=\mathbb E[B_t(X-m_0)(X-m_0)^TB_t^T]=B_t\Sigma_0B_t^T.
\end{align*}
Because $B_t$ is symmetric, $B_t^T=B_t$, hence
\begin{align*}
\Sigma_t=B_t\Sigma_0B_t=((1-t)I+tA)\Sigma_0((1-t)I+tA).
\end{align*}
Thus $\mu_t=(S_t)_\#\mu_0=\mathcal N(m_t,\Sigma_t)$.
The particle velocity is obtained by differentiating $S_t(x)=m_t+B_t(x-m_0)$ in $t$:
\begin{align*}
\frac{d}{dt}S_t(x)=m_1-m_0+(A-I)(x-m_0).
\end{align*}
At the Eulerian point $z=S_t(x)$, the identity $z-m_t=B_t(x-m_0)$ gives
\begin{align*}
x-m_0=B_t^{-1}(z-m_t).
\end{align*}
Therefore the velocity field is
\begin{align*}
v_t(z)=m_1-m_0+(A-I)B_t^{-1}(z-m_t).
\end{align*}
Since $B_t=(1-t)I+tA$ is a polynomial in $A$, the matrices $A-I$ and $B_t^{-1}$ commute. They are symmetric, so
\begin{align*}
M_t:=(A-I)B_t^{-1}
\end{align*}
is symmetric. Hence
\begin{align*}
v_t(z)=m_1-m_0+M_t(z-m_t)=\nabla\left((m_1-m_0)\cdot z+\frac12(z-m_t)^TM_t(z-m_t)\right).
\end{align*}
The Gaussian geodesic is therefore generated by affine velocity fields, and those affine fields are gradients of quadratic potentials, exactly the structure predicted by the velocity-potential characterization.
[/example]
The Gaussian calculation shows that the abstract potential condition can be concrete: quadratic potentials generate affine maps, and affine maps preserve the Gaussian family. The chapter can now close by recording the three equivalent viewpoints that will be used repeatedly.
[remark: What This Chapter Establishes]
The static coupling formulation, the dynamic continuity-equation formulation, and the pathwise superposition formulation all compute the same quadratic transport geometry. The next chapter uses this equivalence to treat $W_2$ as a formal Riemannian space whose tangent vectors are gradient fields and whose energy functionals generate PDEs.
[/remark]
# 2. Calculus on $W_2$ and Otto's Formal Riemannian Picture
The dynamic formulation of $W_2$ from the previous chapter turns curves of measures into solutions of the continuity equation. This chapter asks what differential calculus looks like on the resulting space: what should a tangent vector be, how should we measure its length, and how do functionals such as entropy or interaction energy generate evolution equations? Otto's insight is that, at a smooth positive density, the tangent directions can be represented by gradient vector fields and the resulting formal Riemannian calculus recovers many important PDEs as gradient flows.
## Tangent Vectors and the Wasserstein Inner Product
A curve $(\rho_t)_{t\in(-\varepsilon,\varepsilon)}$ in $\mathcal P_2(\mathbb R^n)$ cannot be differentiated by subtracting measures pointwise. The right question is: which velocity fields $v_t$ can represent the infinitesimal motion of mass, and which representative has the minimal kinetic cost? We first restrict attention to smooth positive densities so that this question can be answered by calculus rather than by metric relaxation.
[definition: Smooth Positive Density Manifold]
Let $\mathcal P_2^\infty(\mathbb R^n)$ denote the class of probability measures $\rho\,d\mathcal L^n$ such that $\rho\in C^\infty(\mathbb R^n)$, $\rho>0$, $\int_{\mathbb R^n}\rho\,d\mathcal L^n=1$, and $\int_{\mathbb R^n}|x|^2\rho(x)\,d\mathcal L^n(x)<\infty$.
[/definition]
This is a formal model space rather than a complete manifold. As Chapter 1 showed, the Benamou--Brenier formula identifies metric motion with continuity-equation motion; here that identification is restricted to smooth positive densities so it can be differentiated and integrated by parts without first handling all metric-measure technicalities. The next issue is to describe an infinitesimal change of density in a way that still records the velocity field transporting the mass.
[definition: Tangent Vector at a Density]
For $\rho\,d\mathcal L^n\in\mathcal P_2^\infty(\mathbb R^n)$, a tangent vector is a distribution $s$ of total mass zero for which there exists a potential $\phi$ with
\begin{align*}
s=-\nabla\cdot(\rho\nabla\phi).
\end{align*}
[/definition]
The potential is not unique up to additive constants, but its gradient is the meaningful velocity field. The total mass condition appears because differentiating $\int\rho_t\,d\mathcal L^n=1$ gives $\int s\,d\mathcal L^n=0$. This motivates recording tangent directions by the gradient velocity itself, since the metric cost depends on velocity and not only on the distributional derivative.
[definition: Otto Tangent Space]
At $\rho\,d\mathcal L^n\in\mathcal P_2^\infty(\mathbb R^n)$, the Otto tangent space is formally
\begin{align*}
T_\rho\mathcal P_2 := \overline{\{\nabla\phi:\phi\in C_c^\infty(\mathbb R^n)\}}^{L^2(\rho)}.
\end{align*}
[/definition]
The tangent vector may be recorded either as the velocity $v=\nabla\phi$ or as the density variation $s=-\nabla\cdot(\rho v)$. The velocity notation remembers the metric cost; the density-variation notation remembers how the measure itself changes. To turn this representation into a Riemannian picture, we need the inner product whose norm reproduces the kinetic action.
[definition: Wasserstein Inner Product]
For tangent velocities $v=\nabla\phi$ and $w=\nabla\psi$ at $\rho$, define
\begin{align*}
(v,w)_{T_\rho\mathcal P_2}:=\int_{\mathbb R^n}\nabla\phi(x)\cdot\nabla\psi(x)\rho(x)\,d\mathcal L^n(x).
\end{align*}
[/definition]
This inner product is exactly the kinetic energy density from the dynamic formulation. It turns the minimising velocity field in the continuity equation into the analogue of the derivative of a curve. The next theorem justifies the choice by connecting it back to the metric derivative from $W_2$.
[quotetheorem:9561]
[citeproof:9561]
The theorem explains why gradients, rather than arbitrary vector fields, are built into the tangent space. It also gives a practical recipe: solve the weighted Poisson equation $-\nabla\cdot(\rho\nabla\phi)=s$ to convert a density derivative $s$ into a tangent velocity. The smooth-minimiser caveat is essential: outside this formal positive-density class the weighted Helmholtz projection may exist only as an $L^2(\rho)$ object, and the theorem is not claiming a classical potential or pointwise velocity field.
[example: Translating a Density]
Let $\rho_0\in\mathcal P_2^\infty(\mathbb R^n)$, fix $a=(a_1,\dots,a_n)\in\mathbb R^n$, and define $\rho_t(x)=\rho_0(x-ta)$. We verify that this curve is represented by the constant velocity field $a$ and compute its Wasserstein speed.
For each coordinate $x_i$, the chain rule gives
\begin{align*}
\partial_{x_i}\rho_t(x)=\partial_{x_i}\rho_0(x-ta)=(\partial_i\rho_0)(x-ta).
\end{align*}
Differentiating with respect to $t$ and using $\partial_t(x-ta)=-a$ gives
\begin{align*}
\partial_t\rho_t(x)=\nabla\rho_0(x-ta)\cdot(-a).
\end{align*}
Since $\nabla\rho_t(x)=\nabla\rho_0(x-ta)$ by the coordinate computation above,
\begin{align*}
\partial_t\rho_t(x)=-a\cdot\nabla\rho_t(x).
\end{align*}
Because $a_i$ is constant in $x$ for every $i$,
\begin{align*}
\nabla\cdot(\rho_t a)=\sum_{i=1}^n\partial_{x_i}(\rho_t a_i)=\sum_{i=1}^n a_i\partial_{x_i}\rho_t.
\end{align*}
The last sum is the dot product $a\cdot\nabla\rho_t$, so
\begin{align*}
\nabla\cdot(\rho_t a)=a\cdot\nabla\rho_t.
\end{align*}
Therefore
\begin{align*}
\partial_t\rho_t+\nabla\cdot(\rho_t a)=-a\cdot\nabla\rho_t+a\cdot\nabla\rho_t=0.
\end{align*}
The velocity $a$ is an Otto tangent velocity because it is a gradient field. Indeed,
\begin{align*}
a\cdot x=\sum_{i=1}^n a_i x_i.
\end{align*}
For each coordinate,
\begin{align*}
\partial_{x_j}(a\cdot x)=\partial_{x_j}\left(\sum_{i=1}^n a_i x_i\right)=a_j.
\end{align*}
Hence
\begin{align*}
\nabla(a\cdot x)=a.
\end{align*}
By *Benamou Brenier Metric Tensor*, the squared speed represented by this gradient velocity is
\begin{align*}
\int_{\mathbb R^n}|a|^2\rho_t(x)\,d\mathcal L^n(x).
\end{align*}
Since $|a|^2$ is independent of $x$,
\begin{align*}
\int_{\mathbb R^n}|a|^2\rho_t(x)\,d\mathcal L^n(x)=|a|^2\int_{\mathbb R^n}\rho_0(x-ta)\,d\mathcal L^n(x).
\end{align*}
Use the translation change of variables $y=x-ta$, so $x=y+ta$ and $d\mathcal L^n(x)=d\mathcal L^n(y)$. Then
\begin{align*}
\int_{\mathbb R^n}\rho_0(x-ta)\,d\mathcal L^n(x)=\int_{\mathbb R^n}\rho_0(y)\,d\mathcal L^n(y).
\end{align*}
Because $\rho_0$ is a probability density,
\begin{align*}
\int_{\mathbb R^n}\rho_0(y)\,d\mathcal L^n(y)=1.
\end{align*}
Thus the squared Wasserstein speed is $|a|^2\cdot 1=|a|^2$, and the Wasserstein speed of the translating density is $|a|$. This matches the Euclidean fact that translating every particle with constant velocity $a$ has speed $|a|$.
[/example]
This example is the flat-space model inside the formal calculus: translating all mass by $a$ has the same squared speed as the Euclidean vector $a$. More general motions differ because the density weights the cost of velocity at each spatial point.
## First Variations and Wasserstein Gradients
Once tangent vectors have an inner product, the next question is how a functional $\mathcal F[\rho]$ differentiates along a continuity equation. The answer should identify the vector field $\operatorname{grad}_{W_2}\mathcal F(\rho)$ whose inner product with every tangent velocity gives the first derivative of $\mathcal F$. We begin with the ordinary variational derivative with respect to density.
[definition: First Variation]
Let $\mathcal F:\mathcal P_2^\infty(\mathbb R^n)\to\mathbb R$ be a functional on smooth positive densities. A function $\frac{\delta\mathcal F}{\delta\rho}$ is a first variation of $\mathcal F$ at $\rho$ if, for every smooth perturbation $\sigma$ with $\int\sigma\,d\mathcal L^n=0$,
\begin{align*}
\frac{d}{d\varepsilon}\Big|_{\varepsilon=0}\mathcal F[\rho+\varepsilon\sigma]
=\int_{\mathbb R^n}\frac{\delta\mathcal F}{\delta\rho}(x)\sigma(x)\,d\mathcal L^n(x).
\end{align*}
[/definition]
The first variation is determined only up to an additive constant, because admissible perturbations have zero total mass. Its gradient is therefore the invariant object in the Wasserstein metric. We need a theorem that converts this invariant first variation into the Riemannian gradient used in evolution equations.
[quotetheorem:9562]
[citeproof:9562]
The formula converts variational derivatives into PDEs. The sign convention is important: steepest descent moves with velocity $v_t=-\nabla(\delta\mathcal F/\delta\rho)$, while the continuity equation writes this as a positive divergence on the right-hand side. To use the formula in applications, we need first variations for the standard energy families.
[quotetheorem:9563]
[citeproof:9563]
These three families cover most PDEs encountered in the applications part of the course. Internal energy creates diffusion, potential energy creates drift, and interaction energy creates nonlocal aggregation or repulsion.
[example: Entropy Gradient Is Heat Flow]
Take a smooth positive density $\rho$ and the Boltzmann entropy
\begin{align*}
\mathcal H[\rho]=\int_{\mathbb R^n}\rho(x)\log\rho(x)\,d\mathcal L^n(x).
\end{align*}
For a smooth perturbation $\sigma$ with $\int_{\mathbb R^n}\sigma\,d\mathcal L^n=0$, choose $|\varepsilon|$ small enough that $\rho+\varepsilon\sigma>0$. Then
\begin{align*}
\mathcal H[\rho+\varepsilon\sigma]=\int_{\mathbb R^n}(\rho(x)+\varepsilon\sigma(x))\log(\rho(x)+\varepsilon\sigma(x))\,d\mathcal L^n(x).
\end{align*}
For the scalar function $f(r)=r\log r$ on $(0,\infty)$,
\begin{align*}
f'(r)=1\cdot\log r+r\cdot\frac1r=\log r+1.
\end{align*}
Applying the chain rule pointwise gives
\begin{align*}
\frac{d}{d\varepsilon}f(\rho(x)+\varepsilon\sigma(x))=(\log(\rho(x)+\varepsilon\sigma(x))+1)\sigma(x).
\end{align*}
Evaluating at $\varepsilon=0$ and differentiating under the integral sign,
\begin{align*}
\frac{d}{d\varepsilon}\Big|_{\varepsilon=0}\mathcal H[\rho+\varepsilon\sigma]=\int_{\mathbb R^n}(\log\rho(x)+1)\sigma(x)\,d\mathcal L^n(x).
\end{align*}
By the definition of first variation, one representative is therefore
\begin{align*}
\frac{\delta\mathcal H}{\delta\rho}=\log\rho+1.
\end{align*}
Using *Otto Gradient Formula*, the Wasserstein gradient flow satisfies
\begin{align*}
\partial_t\rho_t=\nabla\cdot\left(\rho_t\nabla(\log\rho_t+1)\right).
\end{align*}
Because the gradient of a constant is zero,
\begin{align*}
\nabla(\log\rho_t+1)=\nabla\log\rho_t.
\end{align*}
Since $\rho_t>0$, the ordinary chain rule for $\log$ gives
\begin{align*}
\nabla\log\rho_t=\frac{1}{\rho_t}\nabla\rho_t.
\end{align*}
Hence
\begin{align*}
\rho_t\nabla(\log\rho_t+1)=\rho_t\frac{1}{\rho_t}\nabla\rho_t.
\end{align*}
The factors $\rho_t$ and $1/\rho_t$ cancel pointwise, so
\begin{align*}
\rho_t\nabla(\log\rho_t+1)=\nabla\rho_t.
\end{align*}
Substituting this expression into the gradient flow equation gives
\begin{align*}
\partial_t\rho_t=\nabla\cdot(\nabla\rho_t).
\end{align*}
By the definition of the Laplacian,
\begin{align*}
\nabla\cdot(\nabla\rho_t)=\sum_{i=1}^n\partial_{x_i}^2\rho_t=\Delta\rho_t.
\end{align*}
Therefore
\begin{align*}
\partial_t\rho_t=\Delta\rho_t.
\end{align*}
The heat equation is the steepest descent equation for Boltzmann entropy in the $W_2$ geometry.
[/example]
The computation shows why diffusion is linear in the heat equation even though the metric is nonlinear in $\rho$. The density factor in the Wasserstein inner product cancels the reciprocal density coming from $\nabla\log\rho$.
[example: Potential Energy and Force Fields]
For $\mathcal V[\rho]=\int_{\mathbb R^n}V(x)\rho(x)\,d\mathcal L^n(x)$ and a smooth perturbation $\sigma$ with $\int_{\mathbb R^n}\sigma\,d\mathcal L^n=0$, we compute the first variation and then the induced Wasserstein descent equation. By linearity of the integral,
\begin{align*}
\mathcal V[\rho+\varepsilon\sigma]=\int_{\mathbb R^n}V(x)(\rho(x)+\varepsilon\sigma(x))\,d\mathcal L^n(x).
\end{align*}
Expanding the product inside the integral gives
\begin{align*}
V(x)(\rho(x)+\varepsilon\sigma(x))=V(x)\rho(x)+\varepsilon V(x)\sigma(x).
\end{align*}
Therefore
\begin{align*}
\mathcal V[\rho+\varepsilon\sigma]=\int_{\mathbb R^n}V(x)\rho(x)\,d\mathcal L^n(x)+\varepsilon\int_{\mathbb R^n}V(x)\sigma(x)\,d\mathcal L^n(x).
\end{align*}
Differentiating the right-hand side with respect to $\varepsilon$ gives
\begin{align*}
\frac{d}{d\varepsilon}\mathcal V[\rho+\varepsilon\sigma]=\int_{\mathbb R^n}V(x)\sigma(x)\,d\mathcal L^n(x).
\end{align*}
Evaluating at $\varepsilon=0$ leaves the same expression:
\begin{align*}
\frac{d}{d\varepsilon}\Big|_{\varepsilon=0}\mathcal V[\rho+\varepsilon\sigma]=\int_{\mathbb R^n}V(x)\sigma(x)\,d\mathcal L^n(x).
\end{align*}
By the definition of first variation, this means
\begin{align*}
\frac{\delta\mathcal V}{\delta\rho}=V,
\end{align*}
in agreement with *First Variations of Standard Energies*.
Using *Otto Gradient Formula*, the Wasserstein gradient is
\begin{align*}
\operatorname{grad}_{W_2}\mathcal V(\rho)=\nabla V.
\end{align*}
Steepest descent therefore uses the velocity
\begin{align*}
v_t=-\nabla V.
\end{align*}
Substituting this velocity into the continuity equation gives
\begin{align*}
\partial_t\rho_t+\nabla\cdot(\rho_t(-\nabla V))=0.
\end{align*}
Since multiplication by $-1$ factors out of divergence,
\begin{align*}
\nabla\cdot(\rho_t(-\nabla V))=-\nabla\cdot(\rho_t\nabla V).
\end{align*}
Thus the continuity equation becomes
\begin{align*}
\partial_t\rho_t-\nabla\cdot(\rho_t\nabla V)=0.
\end{align*}
Adding $\nabla\cdot(\rho_t\nabla V)$ to both sides gives
\begin{align*}
\partial_t\rho_t=\nabla\cdot(\rho_t\nabla V).
\end{align*}
So the density evolves by transport along the force field $-\nabla V$.
For the quadratic potential
\begin{align*}
V(x)=\frac{|x|^2}{2}=\frac12\sum_{j=1}^n x_j^2,
\end{align*}
the $i$-th coordinate derivative is
\begin{align*}
\partial_{x_i}V(x)=\partial_{x_i}\left(\frac12\sum_{j=1}^n x_j^2\right).
\end{align*}
Using $\partial_{x_i}x_j^2=2x_j\delta_{ij}$, this becomes
\begin{align*}
\partial_{x_i}V(x)=\frac12\sum_{j=1}^n 2x_j\delta_{ij}.
\end{align*}
Only the $j=i$ term remains in the sum, so
\begin{align*}
\partial_{x_i}V(x)=x_i.
\end{align*}
Hence
\begin{align*}
\nabla V(x)=(x_1,\dots,x_n)=x.
\end{align*}
The descent velocity is therefore
\begin{align*}
v_t(x)=-x.
\end{align*}
Substituting $\nabla V(x)=x$ into the density equation gives
\begin{align*}
\partial_t\rho_t=\nabla\cdot(\rho_t x).
\end{align*}
Along a particle path $X_t$, the velocity equation is
\begin{align*}
\dot X_t=v_t(X_t)=-X_t.
\end{align*}
The function $X_t=e^{-t}X_0$ satisfies this ordinary differential equation because
\begin{align*}
\frac{d}{dt}(e^{-t}X_0)=-e^{-t}X_0=-X_t.
\end{align*}
It also satisfies $X_t=X_0$ at $t=0$ since $e^0=1$. Thus particles move exponentially toward the origin, and the PDE $\partial_t\rho_t=\nabla\cdot(\rho_t x)$ is the drift part of the Ornstein--Uhlenbeck equation without the diffusion term $\Delta\rho_t$.
[/example]
Potential energy is local in space: each particle reacts only to the value of $V$ near its current location. The next example shows the nonlocal alternative, where the motion depends on the whole density through convolution.
[example: Interaction Energy with Newtonian Type Kernels]
Let $W$ be a smooth even regularisation of the Newtonian kernel on $\mathbb R^n$, so $W(z)=W(-z)$, and define
\begin{align*}
\mathcal W[\rho]=\frac12\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(x)\rho(y)\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
For a smooth zero-mass perturbation $\sigma$, the perturbed density gives
\begin{align*}
\mathcal W[\rho+\varepsilon\sigma]=\frac12\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)(\rho(x)+\varepsilon\sigma(x))(\rho(y)+\varepsilon\sigma(y))\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
Expanding the product pointwise,
\begin{align*}
(\rho(x)+\varepsilon\sigma(x))(\rho(y)+\varepsilon\sigma(y))=\rho(x)\rho(y)+\varepsilon\sigma(x)\rho(y)+\varepsilon\rho(x)\sigma(y)+\varepsilon^2\sigma(x)\sigma(y).
\end{align*}
Substituting this expansion into $\mathcal W[\rho+\varepsilon\sigma]$ gives
\begin{align*}
\mathcal W[\rho+\varepsilon\sigma]=\frac12\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(x)\rho(y)\,d\mathcal L^n(x)d\mathcal L^n(y)+\frac{\varepsilon}{2}\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\sigma(x)\rho(y)\,d\mathcal L^n(x)d\mathcal L^n(y)+\frac{\varepsilon}{2}\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(x)\sigma(y)\,d\mathcal L^n(x)d\mathcal L^n(y)+\frac{\varepsilon^2}{2}\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\sigma(x)\sigma(y)\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
Differentiating this polynomial in $\varepsilon$ and evaluating at $\varepsilon=0$ removes the constant term and the $\varepsilon^2$ term:
\begin{align*}
\frac{d}{d\varepsilon}\Big|_{\varepsilon=0}\mathcal W[\rho+\varepsilon\sigma]=\frac12\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\sigma(x)\rho(y)\,d\mathcal L^n(x)d\mathcal L^n(y)+\frac12\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(x)\sigma(y)\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
In the second integral, interchange the dummy variables $x$ and $y$:
\begin{align*}
\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(x)\sigma(y)\,d\mathcal L^n(x)d\mathcal L^n(y)=\int_{\mathbb R^n}\int_{\mathbb R^n}W(y-x)\rho(y)\sigma(x)\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
Since $W$ is even, $W(y-x)=W(-(x-y))=W(x-y)$, so
\begin{align*}
\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(x)\sigma(y)\,d\mathcal L^n(x)d\mathcal L^n(y)=\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(y)\sigma(x)\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
Thus the two first-order integrals are equal, and the factors $\frac12$ add to $1$:
\begin{align*}
\frac{d}{d\varepsilon}\Big|_{\varepsilon=0}\mathcal W[\rho+\varepsilon\sigma]=\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\sigma(x)\rho(y)\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
For each fixed $x$, the convolution is
\begin{align*}
(W*\rho)(x)=\int_{\mathbb R^n}W(x-y)\rho(y)\,d\mathcal L^n(y).
\end{align*}
Substituting this inner integral gives
\begin{align*}
\frac{d}{d\varepsilon}\Big|_{\varepsilon=0}\mathcal W[\rho+\varepsilon\sigma]=\int_{\mathbb R^n}(W*\rho)(x)\sigma(x)\,d\mathcal L^n(x).
\end{align*}
By the definition of first variation, one representative is therefore
\begin{align*}
\frac{\delta\mathcal W}{\delta\rho}=W*\rho.
\end{align*}
Using *Otto Gradient Formula*, the Wasserstein gradient flow satisfies
\begin{align*}
\partial_t\rho_t=\nabla\cdot\left(\rho_t\nabla(W*\rho_t)\right).
\end{align*}
Equivalently, writing the same evolution in continuity-equation form gives
\begin{align*}
\partial_t\rho_t+\nabla\cdot(\rho_t v_t)=0.
\end{align*}
Comparing the two displayed equations, the descent velocity is
\begin{align*}
v_t(x)=-\nabla(W*\rho_t)(x).
\end{align*}
Because $W$ is smooth and the convolution is differentiable under the integral sign,
\begin{align*}
\nabla(W*\rho_t)(x)=\int_{\mathbb R^n}\nabla_x W(x-y)\rho_t(y)\,d\mathcal L^n(y).
\end{align*}
Thus
\begin{align*}
v_t(x)=-\int_{\mathbb R^n}\nabla W(x-y)\rho_t(y)\,d\mathcal L^n(y).
\end{align*}
The velocity at $x$ is therefore determined by the entire distribution of mass through the integral over $y$, not only by the local value of $\rho_t$ near $x$. For attractive $W$, the vector field $-\nabla(W*\rho_t)$ moves mass toward decreasing values of the interaction potential generated by $\rho_t$.
[/example]
The Newtonian case is often singular, so the smooth computation should be read as the formal model. Later chapters use weak formulations and regularisation to decide when the same expression defines a well-posed evolution.
## Hessians, Geodesic Convexity, and Formal Integration by Parts
Gradient flows depend not only on first derivatives but also on convexity along geodesics. The problem is to express second variation in the Wasserstein metric and to recognise when a functional bends upward along displacement interpolations. We first define the Hessian by differentiating twice along a geodesic.
[definition: Wasserstein Hessian Along a Geodesic]
Let $(\rho_t)_{t\in(-\varepsilon,\varepsilon)}$ be a smooth $W_2$ geodesic with velocity potential $\phi_t$. The formal Wasserstein Hessian of $\mathcal F$ at $\rho_0$ in the direction $\nabla\phi_0$ is
\begin{align*}
\operatorname{Hess}_{W_2}\mathcal F(\rho_0)[\nabla\phi_0,\nabla\phi_0]
:=\frac{d^2}{dt^2}\Big|_{t=0}\mathcal F[\rho_t].
\end{align*}
[/definition]
This definition relies on having a geodesic with the chosen initial velocity. In the formal smooth setting, geodesics are governed by a Hamilton--Jacobi equation for the velocity potential. The next theorem gives exactly this equation, so it supplies the computational input needed to turn the Hessian definition into the integration-by-parts formulas used throughout the rest of the chapter.
[quotetheorem:9564]
[citeproof:9564]
The second equation is the inviscid Hamilton--Jacobi equation. Differentiating it gives $\partial_t v_t+(v_t\cdot\nabla)v_t=0$, so particles move with constant velocity before shocks or caustics form. This motivates replacing linear convexity by convexity along these geodesic paths.
[definition: Displacement Convexity]
A functional $\mathcal F$ is displacement convex on a class of measures if, for every constant-speed $W_2$ geodesic $(\rho_t)_{t\in[0,1]}$ in that class, the function $t\mapsto\mathcal F[\rho_t]$ is convex on $[0,1]$.
[/definition]
Displacement convexity is ordinary convexity after replacing straight lines of densities by optimal-transport geodesics. It is the geometric condition behind contraction, uniqueness, and energy dissipation for many Wasserstein gradient flows. The next criterion turns this geometric condition into a second-variation inequality.
[quotetheorem:9565]
[citeproof:9565]
This criterion is only as rigorous as the class of geodesics and differentiability allowed, but it gives the calculation that later becomes McCann's displacement convexity theory. The main computational device is formal integration by parts against the continuity and Hamilton--Jacobi equations. The next theorem is the basic model calculation: for the same potential energies introduced in the first-variation section, the Hessian can be read directly from $D^2V$.
[quotetheorem:9566]
[citeproof:9566]
For potential energies, displacement convexity reduces to ordinary convexity of the spatial potential. This is the easy Hessian calculation because the integrand is linear in $\rho$: the geodesic acceleration terms cancel and only the spatial matrix $D^2V$ remains. Internal and interaction energies are subtler, since their second variations also see the density dependence, the dimension, and the way mass is rearranged along the geodesic.
[example: Quadratic Confinement]
Let $V(x)=|x|^2/2=\frac12\sum_{k=1}^n x_k^2$, and let $\mathcal V[\rho]=\int_{\mathbb R^n}V\rho\,d\mathcal L^n$. We compute its Wasserstein Hessian along a smooth $W_2$ geodesic with initial velocity $\nabla\phi$.
For each coordinate $i$,
\begin{align*}
\partial_{x_i}V(x)=\partial_{x_i}\left(\frac12\sum_{k=1}^n x_k^2\right).
\end{align*}
By linearity of $\partial_{x_i}$,
\begin{align*}
\partial_{x_i}V(x)=\frac12\sum_{k=1}^n\partial_{x_i}(x_k^2).
\end{align*}
For each $k$, $\partial_{x_i}(x_k^2)=2x_k\delta_{ik}$, so
\begin{align*}
\partial_{x_i}V(x)=\frac12\sum_{k=1}^n 2x_k\delta_{ik}.
\end{align*}
Only the term $k=i$ survives in the sum, hence
\begin{align*}
\partial_{x_i}V(x)=x_i.
\end{align*}
Differentiating once more with respect to $x_j$ gives
\begin{align*}
\partial_{x_j}\partial_{x_i}V(x)=\partial_{x_j}x_i.
\end{align*}
Since $\partial_{x_j}x_i=\delta_{ij}$,
\begin{align*}
\partial_{x_j}\partial_{x_i}V(x)=\delta_{ij}.
\end{align*}
Thus the Hessian matrix has entries $(D^2V(x))_{ij}=\delta_{ij}$, so
\begin{align*}
D^2V(x)=I_n.
\end{align*}
By *Potential Energy Hessian*,
\begin{align*}
\operatorname{Hess}_{W_2}\mathcal V(\rho)[\nabla\phi,\nabla\phi]=\int_{\mathbb R^n}\nabla\phi(x)\cdot(D^2V(x)\nabla\phi(x))\rho(x)\,d\mathcal L^n(x).
\end{align*}
Substituting $D^2V(x)=I_n$ gives
\begin{align*}
\operatorname{Hess}_{W_2}\mathcal V(\rho)[\nabla\phi,\nabla\phi]=\int_{\mathbb R^n}\nabla\phi(x)\cdot(I_n\nabla\phi(x))\rho(x)\,d\mathcal L^n(x).
\end{align*}
The identity matrix satisfies $I_n\nabla\phi(x)=\nabla\phi(x)$, so the integrand becomes
\begin{align*}
\nabla\phi(x)\cdot(I_n\nabla\phi(x))=\nabla\phi(x)\cdot\nabla\phi(x).
\end{align*}
By the definition of the Euclidean norm,
\begin{align*}
\nabla\phi(x)\cdot\nabla\phi(x)=|\nabla\phi(x)|^2.
\end{align*}
Therefore
\begin{align*}
\operatorname{Hess}_{W_2}\mathcal V(\rho)[\nabla\phi,\nabla\phi]=\int_{\mathbb R^n}|\nabla\phi(x)|^2\rho(x)\,d\mathcal L^n(x).
\end{align*}
By *Benamou Brenier Metric Tensor*, the right-hand side is the squared Wasserstein speed of the geodesic at this time. Hence the second derivative of the quadratic confinement energy equals the squared tangent norm, so in particular it is at least $1$ times that squared norm. This is the formal statement that quadratic confinement is $1$-convex in the Wasserstein geometry.
[/example]
This calculation is the model for strongly convex confinement in Fokker--Planck equations. The same method, with more elaborate algebra, treats entropy and interaction energies.
[remark: Formal Status of Otto Calculus]
Otto calculus is a formal differential calculus on a dense smooth part of $\mathcal P_2$. The rigorous metric theory replaces tangent spaces by minimal velocity fields, differentiability by absolutely continuous curves, and Hessian inequalities by convexity along geodesics or evolution variational inequalities.
[/remark]
The formal picture remains valuable because it predicts the correct PDE, energy identity, and convexity condition before the analytic framework is built. Chapter 3 applies this pattern to displacement convexity and functional inequalities: compute in Otto notation, then translate the result into a weak or metric statement that survives beyond smooth positive densities.
# 3. Displacement Convexity and Functional Inequalities
This chapter explains how convexity reappears in optimal transport after replacing straight-line interpolation of densities by $W_2$-geodesic interpolation. The main theme is that many nonlinear integral functionals become convex along displacement paths, and this convexity gives analytic consequences that are not visible from linear interpolation alone. The prerequisites are the Brenier theorem, change of variables for optimal maps, relative entropy, Fisher information, and the basic dual formulation of $W_2^2$. We begin with McCann's internal-energy criterion, then apply it to entropy and uniqueness of minimizers, and end with the transport proof of HWI, logarithmic Sobolev, and Talagrand inequalities under curvature lower bounds.
## Internal Energies Along Displacement Interpolations
The first question is: which nonlinear functionals of a density behave convexly when mass moves along optimal maps? Linear convexity is too restrictive for transport problems, because the natural interpolation between two absolutely continuous probability measures is not $(1-t)\rho_0+t\rho_1$ but the pushforward of $\rho_0$ along the map $(1-t)\operatorname{id}+tT$. McCann's condition identifies a large class of integrands for which this displacement interpolation improves, rather than destroys, convexity.
[definition: Internal Energy]
Let $U:[0,\infty)\to\mathbb R$ be a convex function with $U(0)=0$. The internal energy associated to $U$ is the functional
\begin{align*}
\mathcal U: \mathcal D(\mathcal U)\subset \mathcal P_2(\mathbb R^n)\to (-\infty,\infty]
\end{align*}
defined on absolutely continuous probability measures $\mu=\rho\,d\mathcal L^n$ with finite second moment by
\begin{align*}
\mathcal U[\mu] = \int_{\mathbb R^n} U(\rho(x))\,d\mathcal L^n(x).
\end{align*}
[/definition]
When used on all of $\mathcal P_2(\mathbb R^n)$, $\mathcal U$ denotes the lower semicontinuous extension of this functional, with value $+\infty$ assigned where no finite relaxed value exists.
The internal energy records the cost of local compression, but displacement interpolation changes density through Jacobian determinants rather than through pointwise convex combinations. Convexity of $U$ by itself does not control this Jacobian effect: for instance, in dimensions $n\ge 2$, the convex integrand $U(s)=-s^m$ with $0<m<1-1/n$ lies outside the displacement-convex range, and compressing mass anisotropically along an optimal map can make the corresponding internal energy bend the wrong way along the geodesic. To decide when the previous functional is compatible with transport geodesics, we need a condition that tests exactly how $U$ reacts to the volume factor in an $n$-dimensional change of variables.
[definition: McCann Class]
Let $n\ge 1$. A convex function $U:[0,\infty)\to\mathbb R$ with $U(0)=0$ belongs to McCann's displacement convexity class $\mathcal{DC}_n$ if the McCann transform
\begin{align*}
G_U:(0,\infty)\to\mathbb R, \qquad G_U(r)=r^n U(r^{-n})
\end{align*}
is convex and nonincreasing on $(0,\infty)$.
[/definition]
The condition packages the change-of-variables calculation that appears along optimal maps. What remains to check is that this one-dimensional convexity condition is strong enough to control every Jacobian determinant produced by an optimal transport map, not just scalar dilations. Without such a result, the internal energy could look convex as a function of density values while still bending downward along Wasserstein geodesics because the mass is being rearranged through anisotropic volume changes.
[quotetheorem:9567]
[citeproof:9567]
The hypotheses are doing real work. Absolute continuity is what allows the density of $\mu_t$ to be computed from the Jacobian of the transport map; if $\mu_0=\delta_0$ and $\mu_1=\delta_1$, the geodesic consists of Dirac masses, so the integral formula for $\mathcal U[\mu_t]$ has no density to evaluate and the relaxed energy is typically infinite. The McCann condition is also not cosmetic: in dimensions $n\ge 2$, the convex integrand $U(s)=-s^m$ with $0<m<1-1/n$ gives a concrete failure of displacement convexity, because its McCann transform $-r^{n(1-m)}$ is concave. The theorem proves convexity along the specific $W_2$-geodesics between absolutely continuous endpoints; it does not assert strict convexity, existence of minimizers, or convexity along arbitrary interpolations of measures.
The theorem is useful because it covers the internal energies that generate diffusion equations. Two model cases are the Boltzmann entropy and porous-medium energy.
[example: Boltzmann Entropy In The McCann Class]
Let $U(s)=s\log s$ for $s>0$ and define $U(0)=0$, which is compatible with
\begin{align*}
\lim_{s\downarrow 0}s\log s=0.
\end{align*}
For $s>0$,
\begin{align*}
U''(s)=\frac{1}{s}>0,
\end{align*}
so $U$ is convex on $(0,\infty)$, and its continuous extension at $0$ is convex on $[0,\infty)$. For an absolutely continuous measure $\mu=\rho\,d\mathcal L^n$, the associated internal energy is therefore
\begin{align*}
\mathcal U[\mu]=\int_{\mathbb R^n}\rho(x)\log\rho(x)\,d\mathcal L^n(x)=\operatorname{Ent}_{\mathcal L^n}(\mu).
\end{align*}
We compute the McCann transform. For $r>0$,
\begin{align*}
G_U(r)=r^nU(r^{-n}).
\end{align*}
Since $r^{-n}>0$,
\begin{align*}
U(r^{-n})=r^{-n}\log(r^{-n}).
\end{align*}
Using $\log(r^{-n})=-n\log r$, this gives
\begin{align*}
G_U(r)=r^n r^{-n}(-n\log r)=-n\log r.
\end{align*}
Its first derivative is
\begin{align*}
G_U'(r)=-\frac{n}{r}\le 0,
\end{align*}
and its second derivative is
\begin{align*}
G_U''(r)=\frac{n}{r^2}\ge 0.
\end{align*}
Thus $G_U$ is nonincreasing and convex on $(0,\infty)$, so $U\in\mathcal{DC}_n$. By *McCann Displacement Convexity Theorem*, $\operatorname{Ent}_{\mathcal L^n}$ is convex along $W_2$-geodesics between absolutely continuous probability measures on $\mathbb R^n$.
[/example]
The entropy example corresponds to linear heat flow in the Wasserstein gradient-flow picture. Replacing $s\log s$ by a power produces nonlinear diffusion, and the displacement convexity threshold records the dimension-dependent range in which transport geometry still gives convexity.
[example: Porous-Medium Energy]
Let $m>1$ and set $U(s)=s^m/(m-1)$ for $s\ge 0$. For $s>0$,
\begin{align*}
U''(s)=m s^{m-2}>0.
\end{align*}
Thus $U$ is convex, and for $\mu=\rho\,d\mathcal L^n$ its internal energy is
\begin{align*}
\mathcal U_m[\rho\,d\mathcal L^n]=\int_{\mathbb R^n}\frac{\rho(x)^m}{m-1}\,d\mathcal L^n(x)=\frac{1}{m-1}\int_{\mathbb R^n}\rho(x)^m\,d\mathcal L^n(x).
\end{align*}
We compute the McCann transform. For $r>0$,
\begin{align*}
G_U(r)=r^nU(r^{-n})=r^n\frac{(r^{-n})^m}{m-1}.
\end{align*}
Since $(r^{-n})^m=r^{-nm}$, this becomes
\begin{align*}
G_U(r)=\frac{1}{m-1}r^{n-nm}=\frac{1}{m-1}r^{-n(m-1)}.
\end{align*}
Let $\alpha=n(m-1)>0$. Then
\begin{align*}
G_U(r)=\frac{1}{m-1}r^{-\alpha}.
\end{align*}
Its first derivative is
\begin{align*}
G_U'(r)=-\frac{\alpha}{m-1}r^{-\alpha-1}\le 0.
\end{align*}
Its second derivative is
\begin{align*}
G_U''(r)=\frac{\alpha(\alpha+1)}{m-1}r^{-\alpha-2}\ge 0.
\end{align*}
Therefore $G_U$ is nonincreasing and convex on $(0,\infty)$, so $U\in\mathcal{DC}_n$. By *McCann Displacement Convexity Theorem*, the porous-medium energy $\mathcal U_m$ is displacement convex along $W_2$-geodesics between absolutely continuous probability measures on $\mathbb R^n$.
[/example]
These examples show that displacement convexity is not a rare accident. It is a systematic compatibility between mass rearrangement, Jacobians, and nonlinear density penalization.
## Entropy Convexity And Uniqueness Of Minimizers
The next problem is to turn convexity along geodesics into uniqueness statements. In ordinary convex analysis, a strictly convex functional has at most one minimizer. In Wasserstein space, the same principle applies once the admissible set is geodesically convex and the functional is convex along those geodesics.
[definition: Displacement Convex Functional]
Let $\mathcal A\subset \mathcal P_2(\mathbb R^n)$ be a set such that every pair of measures in $\mathcal A$ can be joined by a $W_2$-geodesic contained in $\mathcal A$. A functional $\mathcal F:\mathcal A\to(-\infty,\infty]$ is displacement convex on $\mathcal A$ if for every constant-speed $W_2$-geodesic $(\mu_t)_{0\le t\le 1}$ in $\mathcal A$,
\begin{align*}
\mathcal F[\mu_t]\le (1-t)\mathcal F[\mu_0]+t\mathcal F[\mu_1]
\end{align*}
for all $t\in[0,1]$.
[/definition]
Convexity alone gives a convex set of minimizers, so uniqueness requires either strict convexity or another term that creates strictness. In applications, entropy is often combined with a potential energy whose convexity comes from a convex confinement potential.
[definition: Free Energy With Reference Potential]
Let $V:\mathbb R^n\to\mathbb R$ be Borel measurable. The free energy with confinement potential $V$ is the functional
\begin{align*}
\mathcal F:\mathcal D(\mathcal F)\subset \mathcal P_2(\mathbb R^n)\to(-\infty,\infty]
\end{align*}
defined for absolutely continuous measures $\mu=\rho\,d\mathcal L^n$ by
\begin{align*}
\mathcal F[\mu]=\int_{\mathbb R^n}\rho\log\rho\,d\mathcal L^n+\int_{\mathbb R^n}V(x)\,d\mu(x),
\end{align*}
whenever the positive and negative parts are not both infinite, and extended by $+\infty$ outside its finite-energy domain.
[/definition]
The entropy term penalizes concentration relative to Lebesgue measure, while the potential term prevents mass from escaping to infinity. To use this free energy in variational arguments, we need a geodesic convexity estimate that retains the curvature of $V$ as a quantitative correction term.
[quotetheorem:9568]
[citeproof:9568]
The estimate separates two roles that are often conflated. The entropy supplies displacement convexity, while the Hessian lower bound on $V$ supplies the quantitative correction term; the $C^2$ assumption is a convenient way to express this through the second derivative of $V$ along transport rays. Without a lower Hessian bound the potential term can destroy geodesic convexity: for example $V(x)=-|x|^4$ on $\mathbb R^n$ has second derivative unbounded below along rays, and translating a compactly supported density outward makes $\int V\,d\mu_t$ bend downward faster than any fixed quadratic correction can control. Integrability is a separate boundary condition; if the negative part of $V$ is not integrable against an endpoint, then
\begin{align*}
\int_{\mathbb R^n} V\,d\mu
\end{align*}
is not a finite variational term and the displayed inequality cannot be read as a comparison of real numbers. If $\lambda=0$, the conclusion is ordinary displacement convexity and equality may persist along nontrivial geodesics, while $\lambda<0$ allows a controlled loss of convexity rather than a uniqueness mechanism. The theorem also says nothing by itself about existence of minimizers: compactness, coercivity of $V$, and lower semicontinuity are separate variational inputs.
A convexity estimate alone does not automatically identify a single equilibrium unless equality along a geodesic is rigid. When $\lambda>0$, two distinct minimizers would determine a nonconstant Wasserstein geodesic along which the energy is bounded above by the same minimal value minus a strictly positive quadratic correction. That creates an immediate contradiction, provided the minimizers actually exist, so the variational question becomes whether positive curvature forbids more than one minimizer.
[quotetheorem:9569]
[citeproof:9569]
The condition $\lambda>0$ is the strictness that rules out a whole geodesic of minimizers. For merely convex potentials, minimizers may still be unique for other reasons, but convexity alone does not force uniqueness from this argument; for weak confinement, a minimizer may fail to exist because mass can drift to infinity. This is why attainment is stated as a separate hypothesis rather than hidden inside the convexity theorem. The most important reference example is the Gaussian measure: its quadratic confinement both guarantees a well-behaved equilibrium and gives the curvature constant that becomes sharp in the functional inequalities below.
[example: Gaussian Free Energy]
Let
\begin{align*}
d\gamma(x)=(2\pi)^{-n/2}e^{-|x|^2/2}\,d\mathcal L^n(x)
\end{align*}
be the standard Gaussian probability measure on $\mathbb R^n$, and let $\mu=\rho\,d\mathcal L^n$ be a probability measure with $\mu\ll\gamma$. Writing $\mu=f\gamma$, the density $f$ is determined by
\begin{align*}
\rho(x)\,d\mathcal L^n(x)=f(x)(2\pi)^{-n/2}e^{-|x|^2/2}\,d\mathcal L^n(x).
\end{align*}
Hence
\begin{align*}
f(x)=\rho(x)(2\pi)^{n/2}e^{|x|^2/2}.
\end{align*}
Taking logarithms gives
\begin{align*}
\log f(x)=\log\rho(x)+\frac n2\log(2\pi)+\frac{|x|^2}{2}
\end{align*}
at points where $\rho(x)>0$, with the usual convention $0\log 0=0$ in the entropy integral. Therefore
\begin{align*}
\operatorname{Ent}_\gamma(\mu)=\int_{\mathbb R^n} f\log f\,d\gamma=\int_{\mathbb R^n}\rho(x)\log f(x)\,d\mathcal L^n(x).
\end{align*}
Substituting the displayed expression for $\log f$ yields
\begin{align*}
\operatorname{Ent}_\gamma(\mu)=\int_{\mathbb R^n}\rho\log\rho\,d\mathcal L^n+\frac12\int_{\mathbb R^n}|x|^2\,d\mu(x)+\frac n2\log(2\pi)\int_{\mathbb R^n}\rho\,d\mathcal L^n.
\end{align*}
Since $\mu$ is a probability measure, $\int_{\mathbb R^n}\rho\,d\mathcal L^n=1$, so
\begin{align*}
\operatorname{Ent}_\gamma(\mu)=\int_{\mathbb R^n}\rho\log\rho\,d\mathcal L^n+\frac12\int_{\mathbb R^n}|x|^2\,d\mu(x)+\frac n2\log(2\pi).
\end{align*}
Thus Gaussian relative entropy is the free energy with confinement potential $V(x)=|x|^2/2$, plus the constant $n\log(2\pi)/2$. For this potential,
\begin{align*}
\nabla V(x)=x.
\end{align*}
Its Jacobian matrix is
\begin{align*}
J(\nabla V)_x=I.
\end{align*}
Therefore $J(\nabla V)_x\ge 1\cdot I$ for every $x\in\mathbb R^n$. By *Entropy Plus Convex Potential Is Displacement Convex*, the nonconstant part of $\operatorname{Ent}_\gamma$ is $1$-displacement convex, and adding the constant $n\log(2\pi)/2$ does not change the convexity inequality along a geodesic. Hence the Gaussian relative entropy is $1$-displacement convex.
[/example]
Together with the entropy gradient-flow calculation from Chapter 2, this example is the bridge from convexity of energy to quantitative probability inequalities. The constant $1$ in the Gaussian curvature bound becomes the sharp constant in the Gaussian logarithmic Sobolev and Talagrand inequalities.
## HWI, Logarithmic Sobolev, And Talagrand Inequalities
The final question is how transport distances, entropy, and Fisher information constrain each other. Otto and Villani's HWI inequality links the three quantities named by $H$ for entropy, $W$ for Wasserstein distance, and $I$ for Fisher information. Once this bridge is in place, logarithmic Sobolev and Talagrand inequalities follow by optimization and comparison.
[definition: Relative Entropy And Fisher Information]
Let $V:\mathbb R^n\to\mathbb R$ be Borel measurable, and let $\nu=e^{-V}\,d\mathcal L^n$ be a probability measure on $\mathbb R^n$. The relative entropy is the functional
\begin{align*}
\operatorname{Ent}_\nu:\mathcal P(\mathbb R^n)\to[0,\infty]
\end{align*}
defined for $\mu=f\nu$ by
\begin{align*}
\operatorname{Ent}_\nu(\mu) = \int_{\mathbb R^n} f\log f\,d\nu,
\end{align*}
and defined to be $+\infty$ when $\mu$ is not absolutely continuous with respect to $\nu$. The relative Fisher information is the functional
\begin{align*}
I_\nu:\mathcal D(I_\nu)\subset \mathcal P(\mathbb R^n)\to[0,\infty]
\end{align*}
defined on measures $\mu=f\nu$ such that $\sqrt f\in W^{1,2}_{\mathrm{loc}}(\mathbb R^n)$ and
\begin{align*}
\int_{\mathbb R^n} |\nabla \sqrt f|^2\,d\nu<\infty
\end{align*}
by
\begin{align*}
I_\nu(\mu) = 4\int_{\mathbb R^n}|\nabla \sqrt f|^2\,d\nu,
\end{align*}
where the gradient is the weak gradient. The value is $+\infty$ outside this finite-information domain.
[/definition]
On the set where $f>0$ and $\nabla f$ exists weakly, the same Fisher information may be written in the density-gradient form
\begin{align*}
I_\nu(\mu)=\int_{\mathbb R^n}\frac{|\nabla f|^2}{f}\,d\nu.
\end{align*}
The Fisher information is the squared Wasserstein slope of entropy in smooth situations. The following theorem is needed because it converts displacement convexity into an inequality involving this slope, by comparing $\mu$ with the equilibrium measure $\nu$ along the geodesic joining them and estimating the initial entropy derivative.
[quotetheorem:9570]
[citeproof:9570]
The assumptions in HWI ensure that all three terms are meaningful at the same time. Regularity and finite Fisher information justify the first-variation computation, finite entropy prevents the left-hand side from being a formal infinity, and finite second moment is needed for $W_2(\mu,\nu)$. The curvature lower bound is the source of the sign and size of the quadratic term: when $\lambda>0$ it penalizes distance from equilibrium, when $\lambda=0$ the inequality still relates entropy to slope and distance, and when $\lambda<0$ it no longer yields a direct logarithmic Sobolev estimate by optimization. Thus HWI is a bridge inequality, not by itself an existence theorem or a replacement for regularity of the entropy slope.
In the HWI inequality, the distance to equilibrium is an auxiliary variable rather than the quantity one ultimately wants to keep. Positive curvature makes the right-hand side a concave quadratic expression in $W_2(\mu,\nu)$, so the distance term can be eliminated by maximizing over its possible value. The obstruction is precisely the sign of the curvature term: only when $\lambda>0$ does this optimization produce a finite entropy bound depending solely on Fisher information.
[quotetheorem:9571]
[citeproof:9571]
The positivity of $\lambda$ is essential in this derivation because the quadratic expression has a finite maximum only when the curvature term bends downward. The resulting inequality controls entropy by the infinitesimal cost of changing the density, but it does not directly bound $W_2$ or guarantee concentration without an additional argument. The missing global transport control is supplied by the next theorem, which turns logarithmic Sobolev information into Talagrand's $T_2$ inequality. Sharpness is already visible in the Gaussian case: the constant cannot be improved because exponential tilts attain equality. The smooth proof is only a starting point: extending the inequality to the natural Sobolev domain requires approximating $f$ by positive bounded densities, passing the entropy by lower semicontinuity, and passing the Fisher information by weak lower semicontinuity of the Dirichlet form. When the curvature lower bound is zero or negative, HWI may remain informative, but this optimization no longer produces a finite logarithmic Sobolev constant.
The logarithmic Sobolev inequality controls entropy by infinitesimal information. The next step asks for a global form of the same control, replacing the local Fisher information by the squared transport distance to equilibrium.
[quotetheorem:9572]
[citeproof:9572]
The conclusion is global in transport distance, but it is weaker than the logarithmic Sobolev inequality from which it was derived. It controls $W_2(\mu,\nu)$ by entropy, while LSI controls entropy by Fisher information and carries differential information about the density. The Euclidean Hamilton-Jacobi framework is needed because the proof differentiates the Hopf-Lax semigroup and inserts its spatial gradient into LSI; on a space where the Hopf-Lax formula lacks a usable gradient calculus, the displayed differential inequality is not available from the stated hypotheses alone. The bounded-Lipschitz test class also matters: without truncation and approximation, the exponential moment term may be infinite and the entropy variational formula cannot be applied. Absolute continuity and finite entropy are needed for the right-hand side to be a real quantitative bound; if $\mu$ is singular with respect to $\nu$, for instance $\mu=\delta_0$ while $\nu$ is the standard Gaussian measure, then $\operatorname{Ent}_\nu(\mu)=+\infty$ and the inequality is only an extended-real statement with no finite transport estimate. The finite second moment assumption is also indispensable for the left-hand side: a probability density on $\mathbb R^n$ with tail comparable to $|x|^{-(n+2)}$ has infinite second moment, so its $W_2$ distance to any measure with finite second moment is infinite and the asserted finite-distance conclusion is unavailable.
For the Gaussian measure, the preceding inequalities recover the sharp constants familiar from probability. For $V(x)=|x|^2/2$, the Jacobian of the gradient satisfies $J(\nabla V)_x=I$, so $\lambda=1$.
[example: Sharp Gaussian Constants]
Let $\gamma$ be the standard Gaussian measure on $\mathbb R^n$. Since the Gaussian potential is $V(x)=|x|^2/2$, we have $\nabla V(x)=x$ and $J(\nabla V)_x=I$, so the curvature constant is $\lambda=1$. Applying *Logarithmic Sobolev Inequality From HWI* with $\lambda=1$ gives, for every smooth probability density $f$ with respect to $\gamma$,
\begin{align*}
\operatorname{Ent}_\gamma(f\gamma)\le \frac{1}{2}I_\gamma(f\gamma).
\end{align*}
Applying *Talagrand T Two From Logarithmic Sobolev* with the same constant gives
\begin{align*}
W_2(f\gamma,\gamma)^2\le 2\operatorname{Ent}_\gamma(f\gamma).
\end{align*}
Now fix $a\in\mathbb R^n$ and set
\begin{align*}
f_a(x)=\exp\left(a\cdot x-\frac{|a|^2}{2}\right).
\end{align*}
Multiplying by the Gaussian density gives
\begin{align*}
f_a(x)(2\pi)^{-n/2}e^{-|x|^2/2}=(2\pi)^{-n/2}\exp\left(-\frac{|x|^2}{2}+a\cdot x-\frac{|a|^2}{2}\right).
\end{align*}
Since
\begin{align*}
-|x|^2+2a\cdot x-|a|^2=-|x-a|^2,
\end{align*}
we obtain
\begin{align*}
f_a(x)(2\pi)^{-n/2}e^{-|x|^2/2}=(2\pi)^{-n/2}e^{-|x-a|^2/2}.
\end{align*}
Thus $f_a\gamma$ is the translate of $\gamma$ by $a$, and in particular its mean is $a$.
For the entropy, $\log f_a(x)=a\cdot x-|a|^2/2$, so
\begin{align*}
\operatorname{Ent}_\gamma(f_a\gamma)=\int_{\mathbb R^n}\left(a\cdot x-\frac{|a|^2}{2}\right)\,d(f_a\gamma)(x).
\end{align*}
Using $\int x\,d(f_a\gamma)(x)=a$ and $\int 1\,d(f_a\gamma)=1$, this becomes
\begin{align*}
\operatorname{Ent}_\gamma(f_a\gamma)=a\cdot a-\frac{|a|^2}{2}=\frac{|a|^2}{2}.
\end{align*}
For the Fisher information, $\nabla f_a(x)=a f_a(x)$, hence
\begin{align*}
\frac{|\nabla f_a(x)|^2}{f_a(x)}=|a|^2f_a(x).
\end{align*}
Therefore
\begin{align*}
I_\gamma(f_a\gamma)=\int_{\mathbb R^n}|a|^2f_a\,d\gamma=|a|^2.
\end{align*}
The logarithmic Sobolev inequality is an equality for $f_a$, because both sides are
\begin{align*}
\frac{|a|^2}{2}.
\end{align*}
Finally, the map $T(x)=x+a$ pushes $\gamma$ forward to $f_a\gamma$, so the corresponding transport plan gives
\begin{align*}
W_2(\gamma,f_a\gamma)^2\le \int_{\mathbb R^n}|T(x)-x|^2\,d\gamma(x)=|a|^2.
\end{align*}
Conversely, for any coupling $(X,Y)$ of $\gamma$ and $f_a\gamma$, Jensen's inequality gives
\begin{align*}
\mathbb E|Y-X|^2\ge |\mathbb E(Y-X)|^2=|a-0|^2=|a|^2.
\end{align*}
Taking the infimum over couplings yields
\begin{align*}
W_2(\gamma,f_a\gamma)^2=|a|^2.
\end{align*}
Since $2\operatorname{Ent}_\gamma(f_a\gamma)=|a|^2$, the Talagrand inequality is also an equality for these translated Gaussian measures.
[/example]
The chapter's main lesson is that displacement convexity is not only a structural property of Wasserstein space. It is an engine for uniqueness, stability, and sharp inequalities: McCann convexity controls internal energies, entropy convexity controls minimizers, and HWI converts curvature into quantitative concentration estimates. The same pattern reaches beyond probability inequalities: Chapter 4 turns it into the JKO construction of variational PDE, Chapter 5 applies it to nonlinear diffusion and aggregation, and Chapter 7 recasts it as curvature-dimension theory.
# 4. The JKO Scheme and Minimizing Movements
The previous chapter developed displacement convexity as the structural reason many functionals behave like convex energies along Wasserstein geodesics. This chapter turns that formal geometry into an existence method for evolution equations. The guiding question is how to construct a gradient flow in a space of probability measures when there is no linear Hilbert-space structure available. The answer is to replace the differential equation by a sequence of variational problems, each balancing distance travelled against energy decreased.
## Implicit Euler Steps in Metric Spaces
Suppose a smooth Hilbert-space curve $u(t)$ solves the gradient flow equation $\frac{d}{dt}u(t)=-\nabla E(u(t))$. The backward Euler discretisation does not move by evaluating the vector field at the old point; instead, it chooses the next point by minimising a penalised energy. This viewpoint survives in metric spaces because it only uses the distance and the functional.
[definition: Moreau-Yosida Step]
Let $(X,d)$ be a metric space, let $E:X\to(-\infty,\infty]$ be a functional, let $\tau>0$, and let $x\in X$. A Moreau-Yosida step from $x$ with time step $\tau$ is any minimiser $x_\tau\in X$ of
\begin{align*}
y\mapsto \frac{1}{2\tau}d^2(y,x)+E(y).
\end{align*}
[/definition]
The squared-distance term penalises motion away from the previous state, while $E$ rewards moving downhill. In a smooth Hilbert space the Euler-Lagrange equation for this minimisation problem is $(x_\tau-x)/\tau=-\nabla E(x_\tau)$, which is exactly backward Euler.
[example: Quadratic Energy in Euclidean Space]
Let $X=\mathbb R^n$ with Euclidean distance, let $E(x)=\frac{1}{2}|x|^2$, and fix $x_0\in\mathbb R^n$. The Moreau-Yosida objective is
\begin{align*}
\Phi_\tau(y)=\frac{1}{2\tau}|y-x_0|^2+\frac{1}{2}|y|^2.
\end{align*}
Expanding the squared norm gives
\begin{align*}
\Phi_\tau(y)=\frac{1}{2\tau}\bigl(|y|^2-2y\cdot x_0+|x_0|^2\bigr)+\frac{1}{2}|y|^2.
\end{align*}
Hence
\begin{align*}
\nabla\Phi_\tau(y)=\frac{1}{\tau}(y-x_0)+y.
\end{align*}
The Hessian is $(\tau^{-1}+1)I$, so $\Phi_\tau$ is strictly convex and its critical point is the unique minimiser. Setting the gradient equal to zero gives
\begin{align*}
\frac{1}{\tau}(y-x_0)+y=0.
\end{align*}
Multiplying by $\tau$ gives
\begin{align*}
y-x_0+\tau y=0.
\end{align*}
Therefore $(1+\tau)y=x_0$, and the one-step minimiser is
\begin{align*}
y=\frac{1}{1+\tau}x_0.
\end{align*}
Iterating the same formula gives $x_1=(1+\tau)^{-1}x_0$, $x_2=(1+\tau)^{-1}x_1=(1+\tau)^{-2}x_0$, and by induction
\begin{align*}
x_k=(1+\tau)^{-k}x_0.
\end{align*}
If $k\tau\to t$ as $\tau\downarrow0$, then
\begin{align*}
(1+\tau)^{-k}=\exp\bigl(-k\log(1+\tau)\bigr).
\end{align*}
Since
\begin{align*}
k\log(1+\tau)=(k\tau)\frac{\log(1+\tau)}{\tau}
\end{align*}
and $\frac{\log(1+\tau)}{\tau}\to1$, we get $k\log(1+\tau)\to t$. Thus $x_k\to e^{-t}x_0$, which is exactly the solution of the continuous gradient flow $\dot x=-x$ starting from $x_0$.
[/example]
This example explains why the metric minimisation problem is the right replacement for a differential equation. To pass from one step to an entire curve, we need a definition that records all successive variational choices and the interpolation they generate.
[definition: Discrete Minimizing Movement]
Let $(X,d)$ be a metric space, let $E:X\to(-\infty,\infty]$, let $x_0\in X$, and let $\tau>0$. A discrete minimizing movement with time step $\tau$ is a sequence $(x_k^\tau)_{k\ge0}$ such that $x_0^\tau=x_0$ and, for each $k\ge0$, $x_{k+1}^\tau$ minimises
\begin{align*}
y\mapsto \frac{1}{2\tau}d^2(y,x_k^\tau)+E(y).
\end{align*}
[/definition]
The associated piecewise-constant interpolation is $\bar{x}_\tau(t)=x_k^\tau$ for $t\in((k-1)\tau,k\tau]$. A more refined interpolation, often called the De Giorgi interpolation, minimises the same functional with the elapsed time $t-k\tau$ in place of $\tau$ inside each interval.
The two interpolations just introduced turn each discrete sequence into a genuine curve $\bar{x}_\tau$ indexed by continuous time, but they leave the central analytic question untouched: as the time step $\tau$ shrinks, does the family of these curves actually settle down to a single limiting trajectory? Nothing so far guarantees this. Each $\bar{x}_\tau$ is built from independent minimisations, and a priori the curves could oscillate, develop finer and finer jumps, or drift to infinity as $\tau\downarrow0$. To extract a limit we need two ingredients working together: an estimate that prevents the discrete curves from moving too far too fast (equicontinuity in time), and a mechanism that keeps the states confined to a compact region of $X$ (so that pointwise limits exist). The first comes from the energy budget of the scheme, the second from a hypothesis on the sublevel sets of $E$. The next theorem records precisely which structural assumptions make this extraction possible and what kind of limit it produces.
[quotetheorem:9573]
[citeproof:9573]
This theorem is qualitative: it constructs a curve without yet identifying which differential equation it solves. The compact-sublevel hypothesis is the metric replacement for finite-dimensional compactness; without it, a sequence of minimisers can drift away while keeping bounded energy, so no convergent subsequence need exist. Lower semicontinuity is equally structural, since otherwise a limiting point of an approximating sequence may fail to minimise the time-step functional. In Wasserstein space, the same principle becomes a numerical and theoretical engine for diffusion equations.
## The Jordan-Kinderlehrer-Otto Scheme for Entropy and Diffusion
The Fokker-Planck equation combines diffusion with drift, and its formal Wasserstein gradient-flow structure was identified in Chapter 2 through Otto's first-variation formula. The question now is whether this formal gradient flow can be constructed by the minimizing-movement recipe. The JKO scheme answers this by taking $X=\mathcal P_2(\mathbb R^n)$ with the $W_2$ distance.
[definition: Free Energy for Fokker-Planck]
Let $V:\mathbb R^n\to\mathbb R$ be a Borel potential. The free energy is the functional $\mathcal F:\mathcal P_2(\mathbb R^n)\to(-\infty,\infty]$ defined as follows. If $\rho\in\mathcal P_2(\mathbb R^n)$ is absolutely continuous with respect to $\mathcal L^n$, with density still denoted by $\rho$, then
\begin{align*}
\mathcal F[\rho]=\int_{\mathbb R^n}\rho\log\rho\,d\mathcal L^n+\int_{\mathbb R^n}V\rho\,d\mathcal L^n.
\end{align*}
For non-absolutely continuous measures, set $\mathcal F[\rho]=+\infty$.
[/definition]
The first term is the Boltzmann entropy and drives spreading; the second term is potential energy and drives motion down $V$. Their Wasserstein gradient flow is expected to be the Fokker-Planck equation
\begin{align*}
\partial_t\rho=\Delta\rho+\nabla\cdot(\rho\nabla V).
\end{align*}
To make this expectation into a construction, we need the abstract minimizing-movement step with $d=W_2$ and $E=\mathcal F$.
[definition: JKO Scheme]
Let $\rho_0\in\mathcal P_2(\mathbb R^n)$ with $\mathcal F[\rho_0]<\infty$, and let $\tau>0$. The JKO sequence $(\rho_k^\tau)_{k\ge0}$ for $\mathcal F$ is defined by $\rho_0^\tau=\rho_0$ and
\begin{align*}
\rho_{k+1}^\tau\in\operatorname*{argmin}_{\rho\in\mathcal P_2(\mathbb R^n)}\left\{\frac{1}{2\tau}W_2^2(\rho,\rho_k^\tau)+\mathcal F[\rho]\right\}.
\end{align*}
[/definition]
The definition is formally identical to the metric Euler step, but the distance term now encodes mass transport. The minimiser is a probability density whose optimal transport map back to the previous step contains the drift velocity. To identify that velocity, we need the one-step Euler-Lagrange equation.
[quotetheorem:9574]
[citeproof:9574]
This identity is the discrete weak form of the Fokker-Planck equation, but its derivation leans on the regularity hypotheses in an essential way, and it is worth being explicit about what they buy. Smoothness and strict positivity of the minimiser are what allow $\nabla\log\rho_{k+1}^\tau$ to be written down pointwise and integrated against; where the density vanishes, $\log\rho$ degenerates and the entropy first variation must instead be understood weakly, as the divergence of a flux $\nabla\rho_{k+1}^\tau$ rather than as a pointwise gradient. The $C^1$ control on $V$ is what makes the potential first variation $\int\nabla V\cdot\xi\,\rho\,d\mathcal L^n$ finite and the integration by parts legitimate; without a growth bound on $\nabla V$ the perturbation could change the energy by an infinite amount and the Euler-Lagrange equation would be meaningless. What the theorem does not assert is that such a smooth positive minimiser exists, nor that the displayed equation holds across a vanishing-density region; it is a statement about one step under regularity that the convergence theorem will later have to dispense with. The remaining problem is to justify the limiting passage from discrete transport displacements to the continuum velocity field $-\nabla\log\rho-\nabla V$. We now need the JKO convergence theorem, which combines existence of minimisers, compactness of interpolations, and identification of the weak PDE.
[quotetheorem:9575]
[citeproof:9575]
The theorem gives a rigorous construction of diffusion from optimal transport geometry. The coercivity and growth assumptions are not cosmetic: without tightness or second-moment control, mass can escape to infinity along a minimizing sequence, and narrow convergence alone would not identify a limit in $\mathcal P_2(\mathbb R^n)$. Lower semicontinuity is needed so that the variational structure survives the limiting process rather than disappearing at the minimiser. The first test case removes the drift term, leaving only entropy and the spreading of mass.
[example: Heat Equation as Entropy Gradient Flow]
Take $V=0$, so the free energy is the entropy functional
\begin{align*}
\mathcal F[\rho]=\int_{\mathbb R^n}\rho\log\rho\,d\mathcal L^n.
\end{align*}
The JKO limiting equation from *JKO Theorem for the Fokker-Planck Equation* becomes
\begin{align*}
\partial_t\rho=\Delta\rho+\nabla\cdot(\rho\nabla V)=\Delta\rho+\nabla\cdot(0)=\Delta\rho.
\end{align*}
Thus the entropy-only JKO scheme constructs the heat equation as a Wasserstein gradient flow.
For the Gaussian initial datum $\rho_0=\mathcal N(0,\sigma_0^2I_n)$, set $a(t)=\sigma_0^2+2t$ and
\begin{align*}
\rho(t,x)=(2\pi a(t))^{-n/2}\exp\left(-\frac{|x|^2}{2a(t)}\right).
\end{align*}
Since $a'(t)=2$, differentiating the logarithm gives
\begin{align*}
\partial_t\log\rho(t,x)=-\frac{n}{2}\frac{a'(t)}{a(t)}+\frac{|x|^2}{2a(t)^2}a'(t)=-\frac{n}{a(t)}+\frac{|x|^2}{a(t)^2}.
\end{align*}
Therefore
\begin{align*}
\partial_t\rho(t,x)=\rho(t,x)\left(-\frac{n}{a(t)}+\frac{|x|^2}{a(t)^2}\right).
\end{align*}
For the spatial derivatives,
\begin{align*}
\nabla\rho(t,x)=-\frac{x}{a(t)}\rho(t,x).
\end{align*}
Taking the divergence component by component gives
\begin{align*}
\Delta\rho(t,x)=\sum_{i=1}^n\partial_{x_i}\left(-\frac{x_i}{a(t)}\rho(t,x)\right)=\left(-\frac{n}{a(t)}+\frac{|x|^2}{a(t)^2}\right)\rho(t,x).
\end{align*}
Hence $\partial_t\rho=\Delta\rho$, and the covariance is exactly $a(t)I_n=(\sigma_0^2+2t)I_n$.
This matches the variational picture: for $\rho=\mathcal N(0,aI_n)$,
\begin{align*}
\int_{\mathbb R^n}\rho\log\rho\,d\mathcal L^n=-\frac{n}{2}\log(2\pi a)-\frac{n}{2},
\end{align*}
whose derivative in $a$ is $-\frac{n}{2a}<0$, so increasing the variance lowers the entropy functional. The $W_2$ term penalises changing the variance too quickly, and in the limit $\tau\downarrow0$ the balance produces the linear variance law $a(t)=\sigma_0^2+2t$.
[/example]
As previewed by the entropy gradient-flow example in Chapters 0 and 2, the heat equation shows the role of entropy alone. Adding a quadratic potential creates the model drift-diffusion semigroup where spreading competes with confinement.
[example: Ornstein-Uhlenbeck Flow]
Let $V(x)=\frac{1}{2}|x|^2$, so $\nabla V(x)=x$. The JKO limiting equation for the free energy
\begin{align*}
\mathcal F[\rho]=\int_{\mathbb R^n}\rho\log\rho\,d\mathcal L^n+\frac{1}{2}\int_{\mathbb R^n}|x|^2\rho\,d\mathcal L^n
\end{align*}
is therefore
\begin{align*}
\partial_t\rho=\Delta\rho+\nabla\cdot(\rho\nabla V)=\Delta\rho+\nabla\cdot(x\rho).
\end{align*}
This is the Ornstein-Uhlenbeck Fokker-Planck equation.
Let
\begin{align*}
\gamma(x)=(2\pi)^{-n/2}\exp\left(-\frac{|x|^2}{2}\right).
\end{align*}
Then $\nabla\gamma=-x\gamma$, so
\begin{align*}
\Delta\gamma=\nabla\cdot(-x\gamma)=-n\gamma-x\cdot\nabla\gamma=(-n+|x|^2)\gamma.
\end{align*}
Also
\begin{align*}
\nabla\cdot(x\gamma)=n\gamma+x\cdot\nabla\gamma=(n-|x|^2)\gamma.
\end{align*}
Adding the two identities gives
\begin{align*}
\Delta\gamma+\nabla\cdot(x\gamma)=0,
\end{align*}
so $\gamma$ is an invariant density.
The same Gaussian also rewrites the free energy as relative entropy up to an additive constant. Since
\begin{align*}
\log\gamma(x)=-\frac{|x|^2}{2}-\frac{n}{2}\log(2\pi),
\end{align*}
we have
\begin{align*}
\int_{\mathbb R^n}\rho\log\frac{\rho}{\gamma}\,d\mathcal L^n=\int_{\mathbb R^n}\rho\log\rho\,d\mathcal L^n+\frac{1}{2}\int_{\mathbb R^n}|x|^2\rho\,d\mathcal L^n+\frac{n}{2}\log(2\pi).
\end{align*}
Thus
\begin{align*}
\mathcal F[\rho]=\int_{\mathbb R^n}\rho\log\frac{\rho}{\gamma}\,d\mathcal L^n-\frac{n}{2}\log(2\pi).
\end{align*}
For centred Gaussian data $\rho(t)=\mathcal N(0,a(t)I_n)$, write
\begin{align*}
\rho(t,x)=(2\pi a(t))^{-n/2}\exp\left(-\frac{|x|^2}{2a(t)}\right).
\end{align*}
As in the heat-flow computation,
\begin{align*}
\partial_t\rho=\frac{a'(t)}{2}\left(-\frac{n}{a(t)}+\frac{|x|^2}{a(t)^2}\right)\rho.
\end{align*}
The diffusion term is
\begin{align*}
\Delta\rho=\left(-\frac{n}{a(t)}+\frac{|x|^2}{a(t)^2}\right)\rho,
\end{align*}
and the drift term is
\begin{align*}
\nabla\cdot(x\rho)=n\rho+x\cdot\nabla\rho=\left(n-\frac{|x|^2}{a(t)}\right)\rho.
\end{align*}
Therefore the equation $\partial_t\rho=\Delta\rho+\nabla\cdot(x\rho)$ is equivalent to
\begin{align*}
\frac{a'(t)}{2}\left(-\frac{n}{a(t)}+\frac{|x|^2}{a(t)^2}\right)=\left(-\frac{n}{a(t)}+\frac{|x|^2}{a(t)^2}\right)+\left(n-\frac{|x|^2}{a(t)}\right).
\end{align*}
The right-hand side factors as
\begin{align*}
\left(-\frac{n}{a(t)}+\frac{|x|^2}{a(t)^2}\right)(1-a(t)).
\end{align*}
Matching coefficients gives $a'(t)=2(1-a(t))$, so
\begin{align*}
a(t)=1+(a(0)-1)e^{-2t}.
\end{align*}
The variance is pushed toward the invariant value $a=1$: entropy spreads the density, while the quadratic potential confines it toward the standard Gaussian.
[/example]
These two examples also indicate why $W_2$ is the natural metric: the discrete velocity is a transport velocity, not a pointwise relaxation of density values.
## Compactness, Energy Dissipation, and Passage to the PDE Limit
The JKO theorem rests on three estimates that recur throughout the theory of Wasserstein gradient flows. We need compactness to extract a curve, an energy inequality to control motion, and a weak formulation to identify the PDE. Each estimate is already visible at the discrete level.
[quotetheorem:9576]
[citeproof:9576]
The inequality says that every unit of transport has to be paid for by energy decrease. It gives a discrete analogue of square-integrability of the metric derivative. The next problem is to turn this estimate into compactness of time-dependent measures.
[quotetheorem:9577]
[citeproof:9577]
Compactness gives a candidate limit, but the limiting PDE requires testing the discrete optimality condition. The moment assumption is essential because narrow compactness by itself does not control the quadratic transport cost; measures may converge narrowly while their second moments escape. The increment estimate is what turns time-slice compactness into compactness of curves, since otherwise different times could be selected independently with no coherent motion between them. We also need a continuum energy statement that survives the limiting process and records the amount of dissipation lost between times $s$ and $t$.
[quotetheorem:9578]
[citeproof:9578]
This is the metric substitute for the identity $\frac{d}{dt}E(u(t))=-\|\nabla E(u(t))\|^2$ in Hilbert space. In nonsmooth settings it is often an inequality, but it is strong enough to imply uniqueness under suitable convexity hypotheses. A final finite-dimensional calculation shows these abstract estimates at work in a computable JKO step.
[example: One-Step JKO Calculation for a Gaussian Ansatz]
Consider the heat-flow case $V=0$ in one dimension and restrict the minimisation to centred Gaussian densities $\rho_s=\mathcal N(0,s)$ with variance $s>0$. Write $r=\sqrt{s}$ and $r_0=\sqrt{s_0}$. The increasing affine map $T(x)=\frac{r_0}{r}x$ pushes $\rho_s$ forward to $\rho_{s_0}$, so in one dimension it is the optimal monotone transport map and
\begin{align*}
W_2^2(\rho_s,\rho_{s_0})=\int_{\mathbb R}\left|x-\frac{r_0}{r}x\right|^2\rho_s(x)\,dx.
\end{align*}
Since $\int_{\mathbb R}x^2\rho_s(x)\,dx=s=r^2$, this becomes
\begin{align*}
W_2^2(\rho_s,\rho_{s_0})=\left(1-\frac{r_0}{r}\right)^2r^2=(r-r_0)^2=(\sqrt{s}-\sqrt{s_0})^2.
\end{align*}
For the entropy term, the density is
\begin{align*}
\rho_s(x)=(2\pi s)^{-1/2}\exp\left(-\frac{x^2}{2s}\right),
\end{align*}
hence
\begin{align*}
\log\rho_s(x)=-\frac{1}{2}\log(2\pi s)-\frac{x^2}{2s}.
\end{align*}
Using $\int_{\mathbb R}\rho_s\,dx=1$ and $\int_{\mathbb R}x^2\rho_s(x)\,dx=s$, we get
\begin{align*}
\int_{\mathbb R}\rho_s\log\rho_s\,dx=-\frac{1}{2}\log(2\pi s)-\frac{1}{2}=-\frac{1}{2}\log(2\pi e s).
\end{align*}
Thus the Gaussian-restricted one-step JKO objective is
\begin{align*}
J_\tau(s)=\frac{1}{2\tau}(\sqrt{s}-\sqrt{s_0})^2-\frac{1}{2}\log(2\pi e s).
\end{align*}
In the variable $r=\sqrt{s}$, the same objective is
\begin{align*}
J_\tau(r)=\frac{1}{2\tau}(r-r_0)^2-\frac{1}{2}\log(2\pi e r^2)=\frac{1}{2\tau}(r-r_0)^2-\log r-\frac{1}{2}\log(2\pi e).
\end{align*}
Differentiating for $r>0$ gives
\begin{align*}
J_\tau'(r)=\frac{1}{\tau}(r-r_0)-\frac{1}{r}.
\end{align*}
At the minimiser this derivative vanishes, so
\begin{align*}
\frac{1}{\tau}(r-r_0)-\frac{1}{r}=0.
\end{align*}
Multiplying by $\tau r$ gives
\begin{align*}
r(r-r_0)=\tau.
\end{align*}
Equivalently,
\begin{align*}
r^2-r_0r-\tau=0.
\end{align*}
The positive root is
\begin{align*}
r=\frac{r_0+\sqrt{r_0^2+4\tau}}{2}.
\end{align*}
Using $\sqrt{r_0^2+4\tau}=r_0\sqrt{1+\frac{4\tau}{r_0^2}}=r_0+\frac{2\tau}{r_0}+O(\tau^2)$ as $\tau\downarrow0$, we obtain
\begin{align*}
r=r_0+\frac{\tau}{r_0}+O(\tau^2).
\end{align*}
Therefore
\begin{align*}
r^2-s_0=(r-r_0)(r+r_0)=\left(\frac{\tau}{r_0}+O(\tau^2)\right)\left(2r_0+O(\tau)\right)=2\tau+O(\tau^2).
\end{align*}
Thus one Gaussian JKO step increases the variance by $2\tau$ to first order, matching the heat equation variance law $s(t)=s_0+2t$.
[/example]
The Gaussian calculation is not a proof of the full theorem, since it restricts the admissible densities to a finite-dimensional family. Its value is diagnostic: it shows how the transport penalty and entropy variation combine to produce the correct diffusion rate. The general proof replaces this finite-dimensional calculation by the compactness, lower semicontinuity, and weak Euler-Lagrange estimates developed above.
# 5. Nonlinear Diffusion and Aggregation Equations
Chapters 2 through 4 developed Wasserstein gradient flows as a calculus and a minimizing-movement construction for energies on probability measures. This chapter applies that calculus to nonlinear diffusion and aggregation equations, where the unknown density both spreads by an internal energy and moves under self-interaction or external confinement. The guiding question is how much qualitative PDE information is already encoded in the geometry of the energy: existence of weak evolutions, contraction estimates, equilibrium structure, and long-time convergence. The chapter assumes the Wasserstein distance $W_2$ on $\mathcal P_2(\mathbb R^n)$ from Chapter 1, first variations of integral functionals from Chapter 2, weak formulations of parabolic equations, and the JKO minimizing-movement construction from Chapter 4.
## Porous Medium and Fast Diffusion as Wasserstein Gradient Flows
The first problem is to identify which nonlinear parabolic equations arise when the entropy in the heat equation is replaced by a different internal energy. In Euclidean space the Wasserstein gradient of an integral functional is computed by a first variation, so nonlinear diffusion appears once that first variation is a nonlinear function of the density.
Let $\rho$ denote a probability density on $\mathbb R^n$. For $m>0$ with $m\ne 1$, the model internal energy is the power entropy.
[definition: Power Entropy]
Let $m>0$ with $m\ne 1$, and define
\begin{align*}
\mathcal A_m := \{\rho\in\mathcal P_2(\mathbb R^n): \rho\ll\mathcal L^n,\ d\rho=\rho(x)\,d\mathcal L^n(x),\ \rho\in L^m(\mathbb R^n)\}.
\end{align*}
The power entropy is the functional $\mathcal U_m:\mathcal A_m\to\mathbb R$ given by
\begin{align*}
\mathcal U_m[\rho]=\frac{1}{m-1}\int_{\mathbb R^n}\rho(x)^m\,d\mathcal L^n(x).
\end{align*}
[/definition]
The sign convention is chosen so that the first variation is a useful pressure variable. At the level of the density variable the integrand $r^m/(m-1)$ has second derivative $m r^{m-2}>0$ for every $m>0$, so the distinction between $m>1$ and $0<m<1$ is not ordinary pointwise convexity. For $m>1$ the functional penalises concentration through a positive superlinear energy; for $0<m<1$ the energy is negative and the first variation is singular near vacuum, so displacement convexity and well-posedness require more care.
[example: First Variation of the Power Entropy]
Let $\rho$ be a smooth positive probability density and let $\rho_\varepsilon=\rho+\varepsilon\sigma$, where $\int_{\mathbb R^n}\sigma\,d\mathcal L^n=0$ and $\rho_\varepsilon>0$ for small $|\varepsilon|$. For such $\varepsilon$,
\begin{align*}
\mathcal U_m[\rho_\varepsilon]=\frac{1}{m-1}\int_{\mathbb R^n}(\rho(x)+\varepsilon\sigma(x))^m\,d\mathcal L^n(x).
\end{align*}
Differentiating the integrand with respect to $\varepsilon$ gives
\begin{align*}
\frac{d}{d\varepsilon}(\rho(x)+\varepsilon\sigma(x))^m=m(\rho(x)+\varepsilon\sigma(x))^{m-1}\sigma(x).
\end{align*}
Therefore at $\varepsilon=0$,
\begin{align*}
\frac{d}{d\varepsilon}\Big|_{\varepsilon=0}\mathcal U_m[\rho_\varepsilon]=\int_{\mathbb R^n}\frac{m}{m-1}\rho(x)^{m-1}\sigma(x)\,d\mathcal L^n(x).
\end{align*}
Since this is the pairing of the perturbation $\sigma$ with the first variation, we obtain
\begin{align*}
\frac{\delta \mathcal U_m}{\delta \rho}(x)=\frac{m}{m-1}\rho(x)^{m-1}.
\end{align*}
The formal Wasserstein velocity is therefore
\begin{align*}
v=-\nabla\left(\frac{m}{m-1}\rho^{m-1}\right).
\end{align*}
Using the chain rule,
\begin{align*}
\nabla\left(\frac{m}{m-1}\rho^{m-1}\right)=m\rho^{m-2}\nabla\rho.
\end{align*}
Substitution into $\partial_t\rho+\nabla\cdot(\rho v)=0$ gives
\begin{align*}
\partial_t\rho=\nabla\cdot(m\rho^{m-1}\nabla\rho).
\end{align*}
Finally, because $\nabla(\rho^m)=m\rho^{m-1}\nabla\rho$, the equation becomes
\begin{align*}
\partial_t\rho=\Delta(\rho^m).
\end{align*}
Thus the power entropy produces the porous medium or fast diffusion equation, depending on the exponent $m$.
[/example]
The formal computation has identified the velocity field that the Wasserstein gradient-flow equation should produce, but this is not yet a statement about an evolving density. One still has to verify that the continuity equation with this velocity is exactly the porous medium equation, and that the integrations by parts used in the calculation are legitimate for the smooth class under discussion.
This verification is the first place where the formal Wasserstein recipe is tested against a concrete nonlinear PDE. The point is not merely to repeat the algebra from the example, but to isolate hypotheses under which the energy variation, continuity equation, and classical differential equation all describe the same smooth evolution.
The next issue is therefore a well-posedness check for the formalism itself: under smooth positivity and decay assumptions, does the Wasserstein gradient-flow identity really select precisely the porous medium evolution, rather than only suggesting it at the level of symbols? The theorem records this equivalence so that later nonlinear diffusion examples can use the gradient-flow computation as a justified model case.
[quotetheorem:9579]
[citeproof:9579]
The porous medium result shows how the exponent $m>1$ creates degenerate diffusion at vacuum. The hypotheses in the theorem are part of the formal calculation, not cosmetic assumptions: positivity allows the first variation $\rho^{m-1}$ to be differentiated without discussing free boundaries, decay at infinity removes boundary flux terms, and finite second moment places the curve in $\mathcal P_2(\mathbb R^n)$ where $W_2$ is finite. If compactly supported weak solutions are allowed, the density may vanish on large regions and the equation must be interpreted through weak fluxes rather than by differentiating $\rho^{m-1}$ pointwise. If the decay assumption is dropped, integrations by parts may acquire boundary terms at infinity and the Wasserstein energy identity can fail even when the PDE is formally the same.
The porous medium equation covers the exponent range in which diffusion degenerates at vacuum. The complementary range has the same algebraic form, but the coefficient $m\rho^{m-1}$ becomes singular as the density approaches zero, so the analytic behaviour changes even before one discusses interactions. It is useful to name this equation separately because its weak solutions may require different integrability assumptions, extinction criteria, and mass-conservation arguments.
[definition: Fast Diffusion Equation]
Let $0<m<1$. A distributional solution of the fast diffusion equation is a nonnegative density $\rho\in L^1_{\mathrm{loc}}((0,\infty)\times\mathbb R^n)$ with $\rho^m\in L^1_{\mathrm{loc}}((0,\infty)\times\mathbb R^n)$ such that
\begin{align*}
\partial_t\rho=\Delta(\rho^m).
\end{align*}
in $\mathcal D'((0,\infty)\times\mathbb R^n)$.
[/definition]
For $0<m<1$ the diffusivity $m\rho^{m-1}$ becomes singular as $\rho\to0$. This singularity is the analytic reason that fast diffusion may spread with infinite speed and, below critical exponents, may lose mass or extinguish in finite time.
[example: Barenblatt Profiles]
For $m>1$, look for a radially symmetric self-similar solution of
\begin{align*}
\partial_t\rho=\Delta(\rho^m)
\end{align*}
in the form
\begin{align*}
\rho_t(x)=t^{-\alpha}F(xt^{-\beta}).
\end{align*}
Mass conservation forces $\alpha=n\beta$, because with $z=xt^{-\beta}$ and $dx=t^{n\beta}\,dz$,
\begin{align*}
\int_{\mathbb R^n}\rho_t(x)\,dx=t^{-\alpha+n\beta}\int_{\mathbb R^n}F(z)\,dz.
\end{align*}
The scaling of the equation also requires $\alpha+1=m\alpha+2\beta$, since $\partial_t\rho$ has time homogeneity $t^{-\alpha-1}$ while $\Delta(\rho^m)$ has time homogeneity $t^{-m\alpha-2\beta}$. Combining $\alpha=n\beta$ with $\alpha+1=m\alpha+2\beta$ gives
\begin{align*}
1=(m-1)n\beta+2\beta.
\end{align*}
Hence
\begin{align*}
\beta=\frac{1}{n(m-1)+2}
\end{align*}
and
\begin{align*}
\alpha=\frac{n}{n(m-1)+2}.
\end{align*}
Now set
\begin{align*}
\rho_t(x)=t^{-\alpha}\left(C-k|x|^2t^{-2\beta}\right)_+^{1/(m-1)}.
\end{align*}
Inside the positivity set, write $A=C-k|x|^2t^{-2\beta}$ and $q=1/(m-1)$. Then $\rho=t^{-\alpha}A^q$, and
\begin{align*}
\partial_t\rho=t^{-\alpha-1}\left(-\alpha A^q+2\beta kq|x|^2t^{-2\beta}A^{q-1}\right).
\end{align*}
Also $\rho^m=t^{-m\alpha}A^{q+1}$. Since $\nabla A=-2kt^{-2\beta}x$,
\begin{align*}
\nabla(A^{q+1})=-2k(q+1)t^{-2\beta}xA^q.
\end{align*}
Taking the divergence gives
\begin{align*}
\Delta(A^{q+1})=-2kn(q+1)t^{-2\beta}A^q+4k^2q(q+1)|x|^2t^{-4\beta}A^{q-1}.
\end{align*}
Therefore
\begin{align*}
\Delta(\rho^m)=t^{-m\alpha-2\beta}\left(-2kn(q+1)A^q+4k^2q(q+1)|x|^2t^{-2\beta}A^{q-1}\right).
\end{align*}
The time powers match because $\alpha+1=m\alpha+2\beta$. Matching the coefficient of $A^q$ gives $\alpha=2kn(q+1)$, and matching the coefficient of $|x|^2t^{-2\beta}A^{q-1}$ gives
\begin{align*}
2\beta kq=4k^2q(q+1).
\end{align*}
Since $q+1=m/(m-1)$, this fixes
\begin{align*}
k=\frac{\beta}{2(q+1)}=\frac{m-1}{2m\{n(m-1)+2\}}.
\end{align*}
Thus the Barenblatt profiles are compactly supported in the ball $|x|<\sqrt{C/k}\,t^\beta$, conserve mass by the relation $\alpha=n\beta$, and solve the porous medium equation inside their positivity set. Their support expands like $t^\beta$, which is the finite-speed propagation feature absent from the linear heat equation.
[/example]
The Barenblatt family is more than a collection of special solutions. It gives the asymptotic shape for many porous medium evolutions after the natural rescaling, so the Wasserstein energy picture connects local PDE smoothing with global convergence to a self-similar attractor.
## Aggregation-Diffusion Energies and Interaction Potentials
Diffusion alone spreads mass, but many applications include attraction, repulsion, or confinement. The next question is how to combine local spreading with nonlocal transport while retaining a gradient-flow structure.
The standard energy has three pieces: an internal energy, a potential energy, and an interaction energy. Each term contributes additively to the first variation, so the PDE is obtained by taking the Wasserstein gradient of the total first variation.
[definition: Aggregation-Diffusion Energy]
Let $U:[0,\infty)\to\mathbb R$, let $V:\mathbb R^n\to\mathbb R$, and let $W:\mathbb R^n\to\mathbb R$ be an even interaction potential. Define $D(\mathcal E)$ to be the set of probability densities $\rho\in\mathcal P_2(\mathbb R^n)$, written $d\rho=\rho(x)\,d\mathcal L^n(x)$, for which the three integrals below are well-defined and finite. The aggregation-diffusion energy is the functional $\mathcal E:D(\mathcal E)\to\mathbb R$ given by
\begin{align*}
\mathcal E[\rho]=\int_{\mathbb R^n}U(\rho(x))\,d\mathcal L^n(x)+\int_{\mathbb R^n}V(x)\rho(x)\,d\mathcal L^n(x)+\frac12\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(x)\rho(y)\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
[/definition]
The factor $1/2$ prevents double-counting pairs. Evenness of $W$ makes the interaction force antisymmetric between two particles and is the continuum analogue of action-reaction symmetry.
[example: First Variation of an Interaction Energy]
Assume $W$ is even and smooth, and let
\begin{align*}
\mathcal W[\rho]=\frac12\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(x)\rho(y)\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
For a smooth mass-preserving perturbation $\rho_\varepsilon=\rho+\varepsilon\sigma$, expand the two density factors:
\begin{align*}
\rho_\varepsilon(x)\rho_\varepsilon(y)=\rho(x)\rho(y)+\varepsilon\sigma(x)\rho(y)+\varepsilon\rho(x)\sigma(y)+\varepsilon^2\sigma(x)\sigma(y).
\end{align*}
Substituting this expansion into $\mathcal W[\rho_\varepsilon]$ gives
\begin{align*}
\mathcal W[\rho_\varepsilon]=\mathcal W[\rho]+\frac{\varepsilon}{2}\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\sigma(x)\rho(y)\,d\mathcal L^n(x)d\mathcal L^n(y)+\frac{\varepsilon}{2}\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(x)\sigma(y)\,d\mathcal L^n(x)d\mathcal L^n(y)+\frac{\varepsilon^2}{2}\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\sigma(x)\sigma(y)\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
Differentiating at $\varepsilon=0$ removes the quadratic term and leaves
\begin{align*}
\frac{d}{d\varepsilon}\Big|_{\varepsilon=0}\mathcal W[\rho_\varepsilon]=\frac12\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\sigma(x)\rho(y)\,d\mathcal L^n(x)d\mathcal L^n(y)+\frac12\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(x)\sigma(y)\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
In the second double integral, interchange the names of the variables $x$ and $y$:
\begin{align*}
\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(x)\sigma(y)\,d\mathcal L^n(x)d\mathcal L^n(y)=\int_{\mathbb R^n}\int_{\mathbb R^n}W(y-x)\rho(y)\sigma(x)\,d\mathcal L^n(x)d\mathcal L^n(y).
\end{align*}
Since $W$ is even, $W(y-x)=W(x-y)$, so the two linear terms are equal. Hence
\begin{align*}
\frac{d}{d\varepsilon}\Big|_{\varepsilon=0}\mathcal W[\rho_\varepsilon]=\int_{\mathbb R^n}\int_{\mathbb R^n}W(x-y)\rho(y)\sigma(x)\,d\mathcal L^n(y)d\mathcal L^n(x).
\end{align*}
By the definition of convolution,
\begin{align*}
(W*\rho)(x)=\int_{\mathbb R^n}W(x-y)\rho(y)\,d\mathcal L^n(y).
\end{align*}
Therefore
\begin{align*}
\frac{d}{d\varepsilon}\Big|_{\varepsilon=0}\mathcal W[\rho_\varepsilon]=\int_{\mathbb R^n}(W*\rho)(x)\sigma(x)\,d\mathcal L^n(x).
\end{align*}
This is the pairing of the perturbation $\sigma$ with the first variation, so
\begin{align*}
\frac{\delta\mathcal W}{\delta\rho}=W*\rho.
\end{align*}
The interaction part of the Wasserstein velocity is therefore $-\nabla(W*\rho)$, meaning the nonlocal force at $x$ is determined by the convolution of $W$ with the whole density.
[/example]
The interaction calculation supplies the missing nonlocal part of the first variation, but the full energy combines three different forces with different analytic origins. The internal energy contributes a density-dependent diffusion, the potential contributes a local drift, and the interaction contributes a convolution force.
The next obstruction is bookkeeping at the level of the evolution equation: each first variation must be inserted into the same Wasserstein velocity without losing the continuity-equation structure. A general aggregation-diffusion formula records exactly how the local diffusion, external drift, and nonlocal interaction assemble into one PDE.
[quotetheorem:9580]
[citeproof:9580]
This theorem packages many familiar PDEs in a single template. Its hypotheses also mark the limits of the formal derivation: smoothness of $U$ is used to identify the internal first variation as $U'(\rho)$, regularity of $V$ and $W*\rho$ is used to define the velocity field, and decay or a no-flux boundary condition is needed to avoid losing mass through the boundary. Evenness of $W$ is what turns the pair interaction into the symmetric first variation $W*\rho$; without it, the factor $1/2$ no longer represents a simple double-counting correction. Singular kernels such as the Newtonian or logarithmic interaction in the Keller-Segel model below require separate approximation and compactness arguments, because $W*\rho$ may not be classically differentiable at points of concentration. The balance between diffusion and attraction decides whether solutions disperse, converge, or concentrate.
[example: Keller-Segel Energy Below Critical Mass]
Fix a nonnegative density $\rho$ on $\mathbb R^2$ with total mass $M$, and write the Keller-Segel free energy with this sign convention as
\begin{align*}
\mathcal F[\rho]=\int_{\mathbb R^2}\rho(x)\log\rho(x)\,dx+\frac{1}{4\pi}\int_{\mathbb R^2}\int_{\mathbb R^2}\log|x-y|\rho(x)\rho(y)\,dxdy.
\end{align*}
The factor $1/(4\pi)$ comes from the definition of the interaction energy with $W(x)=(2\pi)^{-1}\log|x|$ and the prefactor $1/2$ in the aggregation-diffusion energy.
To see where the threshold $8\pi$ comes from, concentrate the density by setting $\rho_\varepsilon(x)=\varepsilon^{-2}\rho(x/\varepsilon)$ for $0<\varepsilon<1$. Its mass is unchanged, because with $x=\varepsilon z$,
\begin{align*}
\int_{\mathbb R^2}\rho_\varepsilon(x)\,dx=\int_{\mathbb R^2}\varepsilon^{-2}\rho(x/\varepsilon)\,dx=\int_{\mathbb R^2}\rho(z)\,dz=M.
\end{align*}
For the entropy term, the same change of variables gives
\begin{align*}
\int_{\mathbb R^2}\rho_\varepsilon(x)\log\rho_\varepsilon(x)\,dx=\int_{\mathbb R^2}\rho(z)\log(\varepsilon^{-2}\rho(z))\,dz.
\end{align*}
Since $\log(\varepsilon^{-2}\rho(z))=\log\rho(z)-2\log\varepsilon$, this becomes
\begin{align*}
\int_{\mathbb R^2}\rho_\varepsilon\log\rho_\varepsilon=\int_{\mathbb R^2}\rho\log\rho-2M\log\varepsilon.
\end{align*}
For the interaction term, put $x=\varepsilon a$ and $y=\varepsilon b$. Then $dxdy=\varepsilon^4\,dadb$ and $\rho_\varepsilon(x)\rho_\varepsilon(y)=\varepsilon^{-4}\rho(a)\rho(b)$, so
\begin{align*}
\frac{1}{4\pi}\int_{\mathbb R^2}\int_{\mathbb R^2}\log|x-y|\rho_\varepsilon(x)\rho_\varepsilon(y)\,dxdy=\frac{1}{4\pi}\int_{\mathbb R^2}\int_{\mathbb R^2}\log(\varepsilon|a-b|)\rho(a)\rho(b)\,dadb.
\end{align*}
Using $\log(\varepsilon|a-b|)=\log\varepsilon+\log|a-b|$ and $\int\rho=M$, we get
\begin{align*}
\frac{1}{4\pi}\int_{\mathbb R^2}\int_{\mathbb R^2}\log|x-y|\rho_\varepsilon(x)\rho_\varepsilon(y)\,dxdy=\frac{M^2}{4\pi}\log\varepsilon+\frac{1}{4\pi}\int_{\mathbb R^2}\int_{\mathbb R^2}\log|a-b|\rho(a)\rho(b)\,dadb.
\end{align*}
Therefore
\begin{align*}
\mathcal F[\rho_\varepsilon]=\mathcal F[\rho]+\left(-2M+\frac{M^2}{4\pi}\right)\log\varepsilon.
\end{align*}
Since $-2M+M^2/(4\pi)=M(M-8\pi)/(4\pi)$, the free energy is scale-invariant at $M=8\pi$ and tends to $-\infty$ along concentrating families when $M>8\pi$. For $M<8\pi$, the logarithmic Hardy-Littlewood-Sobolev inequality gives a lower bound for $\mathcal F$ under the fixed mass constraint, so the energy no longer drives collapse by this concentration mechanism and the gradient-flow construction can produce global-in-time evolutions.
[/example]
The Keller-Segel example is the prototype for a competition between entropy and attraction. It also shows why lower bounds and displacement convexity are different issues: an energy can be bounded below in a mass range without being globally convex along every Wasserstein geodesic.
## Long-Time Behavior, Equilibria, and Contractivity
Once a gradient flow has been identified, the central qualitative question is whether two solutions move closer together and whether a single solution approaches a minimiser. In Wasserstein space these questions are governed by geodesic convexity, expressed through the Evolution Variational Inequality.
The EVI formulation is useful because it contains the gradient-flow equation, energy dissipation, and contraction estimates in one metric inequality. It is also stable under approximation schemes, which makes it a preferred definition for nonsmooth energies.
[definition: Evolution Variational Inequality]
Let $(X,d)$ be a metric space, let $\mathcal E:X\to(-\infty,\infty]$, and let $\lambda\in\mathbb R$. A locally absolutely continuous curve $(u_t)_{t>0}$ satisfies the $\operatorname{EVI}_\lambda$ condition for $\mathcal E$ if for every $v\in D(\mathcal E)$ and for a.e. $t>0$,
\begin{align*}
\frac12\frac{d}{dt}d(u_t,v)^2+\frac{\lambda}{2}d(u_t,v)^2+\mathcal E[u_t]\le\mathcal E[v].
\end{align*}
[/definition]
The comparison point $v$ is fixed while the flow point $u_t$ moves, so the inequality measures whether the flow decreases its squared distance to every competitor at the rate predicted by the energy gap. In a smooth Hilbert-space gradient flow this kind of estimate follows by differentiating the squared distance and using convexity of the energy. The Wasserstein version has to replace linear segments by geodesics and ordinary convexity by geodesic $\lambda$-convexity, which is the obstruction the metric formulation must overcome.
[quotetheorem:9581]
[citeproof:9581]
The EVI is not just another way to write the PDE. The $\lambda$-convexity hypothesis is the geometric input that prevents two gradient-flow trajectories from separating faster than the curvature bound allows. If an interaction energy has a strongly attractive nonconvex potential, two nearby masses can be pulled toward different concentration regions, and no global Wasserstein contraction estimate should be expected. The theorem also does not by itself prove existence of the flow; in nonsmooth settings that existence statement is a substantial part of the Ambrosio-Gigli-Savare minimizing-movement theory, where approximation, compactness, and lower semicontinuity replace the smooth geodesic calculation. Once an EVI flow exists, however, it directly implies uniqueness and continuous dependence on initial data, which are often harder to prove from weak formulations of nonlinear diffusion equations.
[quotetheorem:9582]
[citeproof:9582]
The contraction estimate controls the distance between two evolving densities. Its strength depends on the sign of $\lambda$: when $\lambda>0$ it gives exponential contraction, when $\lambda=0$ it gives nonexpansion, and when $\lambda<0$ it allows exponential growth at rate at most $e^{-\lambda t}$. The estimate is a consequence of the EVI structure, so it can fail for flows generated by nonconvex attractive interactions even when weak solutions exist. It also compares solutions of the same energy and does not say that a single solution has a limit unless an equilibrium or compactness mechanism is available. To discuss the limit of one density as $t\to\infty$, the course next names the stationary objects selected by the energy landscape.
[definition: Wasserstein Equilibrium]
Let $\mathcal E:\mathcal P_2(\mathbb R^n)\to(-\infty,\infty]$. A probability measure $\rho_\infty\in D(\mathcal E)$ is a Wasserstein equilibrium for $\mathcal E$ if it is a stationary point of the Wasserstein gradient flow of $\mathcal E$.
[/definition]
For smooth positive densities, stationarity should become an Euler-Lagrange equation with a constant Lagrange multiplier because admissible perturbations must preserve total mass. The subtle point is that the first variation can only be tested against zero-mean perturbations, so vanishing of the constrained variation does not say the first variation is zero.
The formal equilibrium definition therefore needs a usable test for candidates: if a density is proposed as a long-time limit or minimizer, one must know what equation its first variation satisfies under the mass constraint. The Euler-Lagrange statement supplies that criterion in the smooth positive case, where arbitrary zero-mean perturbations are available.
[quotetheorem:9583]
[citeproof:9583]
This condition becomes explicit in the important case of entropy plus confinement. The assumptions in the theorem are restrictive because they avoid two complications that occur in applications. Smooth positivity lets the first variation be tested by arbitrary zero-mean perturbations inside the support; if the minimiser has a free boundary, the Euler-Lagrange equation usually becomes an inequality outside the support together with a constant-value condition on the positivity set. Minimality is also stronger than stationarity: a stationary density may be unstable, while a minimiser is the variational object selected by the long-time energy dissipation. If the support has several connected components, the constant can first be obtained componentwise, and additional perturbations moving mass between components are needed to force a single global Lagrange multiplier. The equilibrium is the Gibbs density, and convexity of the confining potential gives convergence of the whole flow.
[example: Convergence to Gibbs Equilibrium Under Confinement]
Let $U(r)=r\log r$, let $W=0$, and let $V:\mathbb R^n\to\mathbb R$ be smooth with $D^2V\ge\lambda I$ for some $\lambda>0$. Assume
\begin{align*}
Z=\int_{\mathbb R^n}e^{-V(x)}\,d\mathcal L^n(x)<\infty.
\end{align*}
For this energy, $U'(r)=\log r+1$, so the equilibrium condition from *Euler-Lagrange Condition for Smooth Equilibria* reads
\begin{align*}
\log\rho_\infty(x)+1+V(x)=c
\end{align*}
on the support of $\rho_\infty$. Subtracting $1+V(x)$ gives
\begin{align*}
\log\rho_\infty(x)=c-1-V(x).
\end{align*}
Exponentiating gives
\begin{align*}
\rho_\infty(x)=e^{c-1}e^{-V(x)}.
\end{align*}
The mass constraint determines the constant:
\begin{align*}
1=\int_{\mathbb R^n}\rho_\infty(x)\,d\mathcal L^n(x)=e^{c-1}\int_{\mathbb R^n}e^{-V(x)}\,d\mathcal L^n(x)=e^{c-1}Z.
\end{align*}
Hence $e^{c-1}=Z^{-1}$, and therefore
\begin{align*}
\rho_\infty(x)=Z^{-1}e^{-V(x)}.
\end{align*}
The corresponding gradient-flow equation from *Aggregation-Diffusion Gradient Flow Equation* is
\begin{align*}
\partial_t\rho_t=\nabla\cdot\left(\rho_t\nabla(\log\rho_t+1+V)\right).
\end{align*}
Since $\nabla(\log\rho_t+1)=\rho_t^{-1}\nabla\rho_t$ for positive $\rho_t$, the flux satisfies
\begin{align*}
\rho_t\nabla(\log\rho_t+1+V)=\nabla\rho_t+\rho_t\nabla V.
\end{align*}
Thus
\begin{align*}
\partial_t\rho_t=\Delta\rho_t+\nabla\cdot(\rho_t\nabla V).
\end{align*}
Because $D^2V\ge\lambda I$ and the entropy term is displacement convex, the free energy is $\lambda$-convex along Wasserstein geodesics. Applying *EVI for Lambda-Convex Wasserstein Energies* with comparison point $\rho_\infty$ gives
\begin{align*}
\frac12\frac{d}{dt}W_2(\rho_t,\rho_\infty)^2+\frac{\lambda}{2}W_2(\rho_t,\rho_\infty)^2+\mathcal E[\rho_t]\le\mathcal E[\rho_\infty].
\end{align*}
Since $\rho_\infty$ minimises $\mathcal E$, $\mathcal E[\rho_\infty]\le\mathcal E[\rho_t]$, so
\begin{align*}
\frac12\frac{d}{dt}W_2(\rho_t,\rho_\infty)^2+\frac{\lambda}{2}W_2(\rho_t,\rho_\infty)^2\le0.
\end{align*}
Gronwall's inequality yields
\begin{align*}
W_2(\rho_t,\rho_\infty)\le e^{-\lambda t/2}W_2(\rho_0,\rho_\infty).
\end{align*}
In the quadratic case $V(x)=|x|^2/2$, we have $\nabla V(x)=x$ and $Z=(2\pi)^{n/2}$, so the equation becomes
\begin{align*}
\partial_t\rho_t=\Delta\rho_t+\nabla\cdot(x\rho_t)
\end{align*}
and the equilibrium is the standard Gaussian density
\begin{align*}
\rho_\infty(x)=(2\pi)^{-n/2}e^{-|x|^2/2}.
\end{align*}
Thus the confined entropy flow is the Ornstein-Uhlenbeck flow, and the convex confinement forces convergence in $W_2$ to the Gibbs density.
[/example]
The chapter's main lesson is that nonlinear diffusion, aggregation, and equilibration are not separate stories. After the energy is chosen, the Wasserstein gradient-flow formalism predicts the PDE, the stationary equation, and the stability estimate; the analytic work is to justify those predictions under the regularity and compactness available for the model at hand.
# 6. Monge-Ampere Methods in Applied Transport
This chapter turns the Monge--Kantorovich existence theory recalled in the first course, together with the Brenier-map viewpoint used in Chapters 1 and 3, into a nonlinear PDE and a computational method. The main goals are to derive the Monge--Ampere equation from Brenier's theorem, understand its Alexandrov weak form, and explain why semi-discrete algorithms are governed by power diagrams. These tools appear directly in mesh generation, geometric optics, density estimation, and fluid rearrangement, where the unknown transport map must satisfy both a mass-balance law and geometric constraints. The prerequisites are the quadratic-cost Brenier theorem, the pushforward formulation of transport, basic convex analysis for subgradients, and the change-of-variables formula for smooth maps. In the quadratic-cost setting, Brenier's theorem says that the optimal map is the gradient of a convex potential, so the change-of-variables formula becomes a Monge--Ampere equation; when the target is a finite weighted sum of atoms, the same convex-potential viewpoint becomes a finite-dimensional system for cell masses.
## From Convex Potentials to Monge--Ampere Equations
What equation does an optimal transport map satisfy when it is regular enough to be differentiated? The starting point is the quadratic-cost case on subsets of Euclidean space, where optimal maps are gradients of convex functions. The transport constraint then says that the Jacobian determinant of the map must convert the source density into the target density.
[definition: Brenier Potential]
Let $\Omega_0,\Omega_1\subset \mathbb R^n$ be Borel sets, and let $\mu_0,\mu_1\in \mathcal P_2(\mathbb R^n)$. A Brenier potential from $\mu_0$ to $\mu_1$ is a convex function $\phi:\mathbb R^n\to (-\infty,\infty]$ such that the map $T:\Omega_0\to\mathbb R^n$ defined by $T(x)=\nabla \phi(x)$ for $\mu_0$-a.e. $x$ satisfies $T_\#\mu_0=\mu_1$.
[/definition]
The convexity condition is the analytic trace of cyclical monotonicity for the quadratic cost. This definition gives the map, but applications need an equation that can be estimated; the next result derives that equation from mass preservation under the extra regularity assumptions that make the Jacobian meaningful.
[quotetheorem:7480]
[citeproof:7480]
The theorem is the classical PDE form of optimal transport. It is fully nonlinear because the highest derivatives enter through a determinant, and it is elliptic only on the cone of convex potentials.
Each hypothesis is doing visible work. The $C^2$ assumption is what turns the derivative of $\nabla\phi$ into the Hessian $D^2\phi$; for a convex function with corners, the Hessian determinant may not exist pointwise even though transport still makes sense. The diffeomorphism assumption excludes folding and many-to-one behaviour, where the change-of-variables formula would require multiplicities rather than a single determinant. Positivity of $\rho_1$ on the image prevents division by zero, while convexity keeps $D^2\phi$ positive semidefinite and places the equation in the elliptic branch. Thus the theorem does not describe general Brenier maps; it describes the smooth, single-valued regime that the weak theory must extend.
[example: Uniform Source to Log-Concave Density]
Let $\Omega_0=B(0,1)\subset\mathbb R^n$, let $\rho_0=1/\mathcal L^n(B(0,1))$, and let $\rho_1(y)=Z^{-1}e^{-V(y)}$ on a convex target domain, where $V$ is convex and $Z=\int_{\Omega_1}e^{-V(y)}\,d\mathcal L^n(y)$. Assuming the hypotheses of *Smooth Monge--Ampere Transport Equation*, the Brenier potential $\phi$ satisfies
\begin{align*}
\det D^2\phi(x)=\frac{\rho_0(x)}{\rho_1(\nabla\phi(x))}.
\end{align*}
Substituting the two densities gives
\begin{align*}
\frac{\rho_0(x)}{\rho_1(\nabla\phi(x))}=\frac{1/\mathcal L^n(B(0,1))}{Z^{-1}e^{-V(\nabla\phi(x))}}.
\end{align*}
Since division by $Z^{-1}e^{-V(\nabla\phi(x))}$ is multiplication by $Ze^{V(\nabla\phi(x))}$, this becomes
\begin{align*}
\frac{\rho_0(x)}{\rho_1(\nabla\phi(x))}=\frac{Z}{\mathcal L^n(B(0,1))}e^{V(\nabla\phi(x))}.
\end{align*}
Therefore
\begin{align*}
\det D^2\phi(x)=\frac{Z}{\mathcal L^n(B(0,1))}e^{V(\nabla\phi(x))}.
\end{align*}
Taking logarithms, which is legitimate because both sides are positive in the smooth positive-density regime, gives
\begin{align*}
\log\det D^2\phi(x)=\log Z-\log \mathcal L^n(B(0,1))+V(\nabla\phi(x)).
\end{align*}
For a coordinate direction $x_k$, differentiating the right-hand side by the chain rule gives
\begin{align*}
\partial_k\bigl(V(\nabla\phi(x))\bigr)=\sum_{j=1}^n \partial_j V(\nabla\phi(x))\,\partial_{kj}\phi(x).
\end{align*}
Thus log-concavity of the target enters the Monge--Ampere equation through the composed term $V\circ\nabla\phi$; bounds on derivatives of $V$ become bounds on the differentiated right-hand side. By contrast, a target density with several separated peaks corresponds to a nonconvex effective potential $V$ and can force sharp transitions in the transport map rather than a uniformly regular rearrangement.
[/example]
The example depends on writing a pointwise Hessian determinant, but many applied transports involve corners, discontinuous densities, or maps that are not differentiable everywhere. We therefore need a measure-valued Monge--Ampere operator that still records how a convex potential sends sets of source points into sets of slopes.
[definition: Monge--Ampere Measure]
Let $\Omega\subset\mathbb R^n$ be open and let $\phi:\Omega\to\mathbb R$ be convex. The Monge--Ampere measure of $\phi$ is the map $M_\phi:\mathcal B(\Omega)\to[0,\infty]$ defined by
\begin{align*}
M_\phi(E)=\mathcal L^n(\partial\phi(E))
\end{align*}
for every Borel set $E\subset\Omega$ for which $\partial\phi(E)$ is $\mathcal L^n$-measurable.
[/definition]
The standard measurability theorem for convex subdifferentials ensures that this construction applies to Borel sets and gives a Borel measure. Thus $M_\phi$ records the Lebesgue size of the slope image rather than the Lebesgue size of the original set.
For $C^2$ convex potentials this measure has density $\det D^2\phi$ with respect to Lebesgue measure. In transport, however, the determinant is not arbitrary: it must account for how much source density is sent into each part of the target.
The remaining question is how the pushforward constraint determines the Monge--Ampere right-hand side when the target density is nonuniform. The transport identity below gives the density-ratio formula in the regular regime and indicates what the Alexandrov measure is trying to encode beyond pointwise differentiability.
[quotetheorem:7481]
[citeproof:7481]
The theorem above is a restricted transport identity, not the full Alexandrov Monge--Ampere equation on all Borel sets. The restriction matters because a convex potential may fail to be differentiable on an $\mathcal L^n$-null set whose subgradient image has positive Lebesgue measure, and because image-level saturation does not rule out extra source points mapping into the same target set. In the usual Alexandrov formulation, this difficulty is absorbed into the Monge--Ampere measure $M_\phi$ and interpreted with the appropriate right-hand side measure. Singular targets, such as sums of atoms, cannot be inserted into the density-ratio formula because $\rho_1$ is then not a positive function; they require a cell formulation instead. If $\rho_1$ vanishes on part of the image, division by $\rho_1(\nabla\phi)$ is not meaningful and mass may concentrate near the zero-density region. This is the limitation numerical methods must respect: they approximate transported mass of cells, not a pointwise Hessian determinant everywhere.
## Semi-Discrete Transport and Power Diagrams
How does the Monge--Ampere picture change when the target measure consists of finitely many atoms? The PDE collapses to a finite-dimensional nonlinear system: each atom receives a cell in the source domain, and the unknowns are weights that make the source mass of each cell equal to the prescribed target mass.
[definition: Semi-Discrete Transport Problem]
Let $\mu_0=\rho_0\,d\mathcal L^n$ be a probability measure on a compact set $\Omega\subset\mathbb R^n$, and let
\begin{align*}
\mu_1=\sum_{i=1}^N m_i\delta_{y_i}
\end{align*}
where $y_i\in\mathbb R^n$ are distinct, $m_i>0$, and $\sum_{i=1}^N m_i=1$. The semi-discrete quadratic transport problem is the optimal transport problem from $\mu_0$ to $\mu_1$ for the cost $c(x,y)=\frac12|x-y|^2$.
[/definition]
The continuous source still gives geometric cells, while the discrete target turns the potential into the upper envelope of finitely many affine functions. To compute those cells and adjust their masses, we need a weighted version of Voronoi geometry.
[definition: Power Cell]
Let $y_1,\dots,y_N\in\mathbb R^n$ be distinct sites and let $w=(w_1,\dots,w_N)\in\mathbb R^N$. The power cell associated to $y_i$ and $w$ is
\begin{align*}
P_i(w)=\{x\in\Omega: |x-y_i|^2-w_i\le |x-y_j|^2-w_j \text{ for all }1\le j\le N\}.
\end{align*}
[/definition]
Power cells generalise Voronoi cells by allowing additive weights, but the weights cannot be chosen geometrically in isolation. Each cell must receive exactly the prescribed target mass, and changing one weight changes neighbouring cells at the same time. The semi-discrete problem is therefore to prove that the correct mass vector can be achieved by weights, and that the resulting cells really encode the Brenier map to the target atoms.
[quotetheorem:9584]
[citeproof:9584]
The theorem turns a transport problem into computational geometry plus convex optimisation. It is also an Alexandrov-type discretisation, because each cell is the set of points whose subgradient contains the same target atom.
The assumptions exclude several degeneracies. Absolute continuity of $\mu_0$ makes cell boundaries negligible, so assigning boundary points to one neighbouring site or another does not change the transported mass; an atomic source could put positive mass on a boundary and destroy this clean cell rule. Distinct sites are needed because two identical target atoms would define the same geometric destination and weights would not separate them uniquely. Positive masses prevent a target atom from being intentionally assigned an empty cell, while the connected adjacency condition rules out independent components whose weights can slide relative to one another. The theorem gives existence and structural uniqueness of weights under these conditions, but it does not promise good numerical conditioning near disappearing cells or nearly disconnected diagrams.
[example: Semi-Discrete Matching to Three Sites]
Let $\Omega=[0,1]^2$ with uniform source density, so $\mu_0(E)=\mathcal L^2(E)$ for every measurable $E\subset\Omega$, and let
\begin{align*}
\mu_1=m_1\delta_{y_1}+m_2\delta_{y_2}+m_3\delta_{y_3}
\end{align*}
with non-collinear sites, $m_i>0$, and $m_1+m_2+m_3=1$. For weights $w=(w_1,w_2,w_3)$, the cell assigned to $y_i$ is
\begin{align*}
P_i(w)=\{x\in[0,1]^2: |x-y_i|^2-w_i\le |x-y_j|^2-w_j \text{ for }j=1,2,3\}.
\end{align*}
For a fixed pair $i,j$, the separating inequality is affine in $x$ because
\begin{align*}
|x-y_i|^2-w_i\le |x-y_j|^2-w_j \Longleftrightarrow 2x\cdot(y_j-y_i)\le |y_j|^2-|y_i|^2+w_i-w_j.
\end{align*}
Thus each $P_i(w)$ is the intersection of the square with two half-planes, hence is a convex polygon, possibly empty or clipped by the boundary of $\Omega$.
The required weights are exactly those satisfying the three area equations
\begin{align*}
\mathcal L^2(P_i(w))=m_i,\qquad i=1,2,3.
\end{align*}
Only two equations are independent, since the cells cover $[0,1]^2$ up to their edges and
\begin{align*}
\mathcal L^2(P_1(w))+\mathcal L^2(P_2(w))+\mathcal L^2(P_3(w))=1=m_1+m_2+m_3.
\end{align*}
Also, adding the same constant $c$ to all weights changes each expression $|x-y_i|^2-w_i$ to $|x-y_i|^2-w_i-c$, so every comparison is unchanged; one may fix the gauge, for example $w_3=0$ or $w_1+w_2+w_3=0$.
To see what Newton's method differentiates, suppose cells $P_i$ and $P_j$ share an interior edge segment of length $\ell_{ij}$. The line between them has normal direction $y_j-y_i$, and changing $w_i-w_j$ by $\delta w_i-\delta w_j$ moves this line by signed distance
\begin{align*}
\frac{\delta w_i-\delta w_j}{2|y_j-y_i|}.
\end{align*}
Therefore, to first order, the area gained by $P_i$ across that edge is
\begin{align*}
\ell_{ij}\frac{\delta w_i-\delta w_j}{2|y_j-y_i|}.
\end{align*}
Writing
\begin{align*}
a_{ij}=\frac{\ell_{ij}}{2|y_j-y_i|}
\end{align*}
for adjacent cells, the linearized mass equations have the form
\begin{align*}
\delta \mathcal L^2(P_i)=\sum_{j\sim i} a_{ij}(\delta w_i-\delta w_j).
\end{align*}
Newton's method solves this two-dimensional gauge-fixed linear system for the weight correction that reduces the residuals $\mathcal L^2(P_i(w))-m_i$. Geometrically, the update moves only shared edges, and an edge contributes to the linear system exactly when it has positive length.
[/example]
For algorithms and applications, it is not enough to know that a weight vector exists. We need stability: if the desired masses or sites are perturbed, the weights and the cells should not jump uncontrollably.
[quotetheorem:9585]
[citeproof:9585]
This stability theorem is one reason semi-discrete optimal transport is usable in numerical pipelines. In Newton methods, the Jacobian formula gives the sparse linear system that updates the normalised weights from the current mass residual $\mathcal M(w)-m$. In continuation methods, it justifies tracking a branch of weights as the target masses or sites are changed in small steps, with the gauge-fixed inverse controlling how large each update can be before recomputing the diagram. The graph Laplacian structure is essential when the number of target sites is large because only adjacent cells interact in the linearised system. The theorem is local: it does not give global stability of all diagrams, uniform conditioning over a family of problems, convergence of Newton's method from arbitrary initial weights, or stability through changes in cell topology.
Each condition protects the inverse problem from a concrete failure. If $\rho_0$ is allowed to vanish on a cell interface, moving that interface may change no mass to first order, so the Jacobian of the mass map can lose rank. If a cell has zero mass, its weight can often vary over an interval before the cell reappears, preventing local invertibility. Positive $\mathcal H^{n-1}$-measure of shared faces is the differentiability and nondegeneracy condition behind the graph Laplacian formula; touching at a point or along a lower-dimensional ridge does not create a first-order mass exchange. Gauge fixing is necessary because adding the same constant to every $w_i$ leaves all power cells unchanged, so the unnormalised Jacobian always has the constant vector in its kernel. Finally, if the positive-face adjacency graph is disconnected, two clusters of cells can shift their weights relative to each other without changing any shared face between the clusters, giving another kernel direction.
[example: Mesh Generation by Optimal Transport]
Let $\Omega\subset\mathbb R^2$ be a polygonal physical domain with source density $\rho_0$, and choose target sites $y_1,\dots,y_N$ with masses $m_i>0$ satisfying $\sum_{i=1}^N m_i=1$. The semi-discrete construction uses weights $w=(w_1,\dots,w_N)$ and cells
\begin{align*}
P_i(w)=\{x\in\Omega: |x-y_i|^2-w_i\le |x-y_j|^2-w_j \text{ for all }j\}.
\end{align*}
For fixed $i$ and $j$, the boundary between the two competing sites is obtained by expanding both squared distances:
\begin{align*}
|x-y_i|^2-w_i\le |x-y_j|^2-w_j.
\end{align*}
Since $|x-y_i|^2=|x|^2-2x\cdot y_i+|y_i|^2$ and $|x-y_j|^2=|x|^2-2x\cdot y_j+|y_j|^2$, cancellation of the common $|x|^2$ term gives
\begin{align*}
-2x\cdot y_i+|y_i|^2-w_i\le -2x\cdot y_j+|y_j|^2-w_j.
\end{align*}
Equivalently,
\begin{align*}
2x\cdot(y_j-y_i)\le |y_j|^2-|y_i|^2+w_i-w_j.
\end{align*}
Thus every pairwise constraint is a half-plane, so each cell is a polygon obtained by clipping $\Omega$ by finitely many half-planes.
The desired mesh-density prescription is encoded by the mass equations
\begin{align*}
\int_{P_i(w)}\rho_0(x)\,d\mathcal L^2(x)=m_i,\qquad i=1,\dots,N.
\end{align*}
By *Semi-Discrete Brenier Cells*, weights satisfying these equations make the transport map $T(x)=y_i$ on $P_i(w)$ optimal, up to cell boundaries of $\mu_0$-measure zero. If $\rho_0$ is uniform with total mass one, then $\rho_0=1/\mathcal L^2(\Omega)$ and the equation becomes
\begin{align*}
\frac{\mathcal L^2(P_i(w))}{\mathcal L^2(\Omega)}=m_i.
\end{align*}
Multiplying by $\mathcal L^2(\Omega)$ gives
\begin{align*}
\mathcal L^2(P_i(w))=m_i\,\mathcal L^2(\Omega).
\end{align*}
Therefore larger prescribed mass produces a larger control volume, and a mesh built from these control volumes places more area, hence more mesh elements under a fixed element-size rule, in regions assigned larger density.
The weights are insensitive to adding a common constant: replacing every $w_i$ by $w_i+c$ changes $|x-y_i|^2-w_i$ to $|x-y_i|^2-w_i-c$, so all pairwise comparisons are unchanged. After fixing this gauge, *Stability of Semi-Discrete Optimal Weights* says that near a nondegenerate diagram with connected positive-face adjacency, small changes in the mass vector produce small changes in the normalized weights. In mesh terms, away from disappearing cells and degenerate interfaces, small changes in the desired density move the power-cell edges continuously rather than reorganizing the whole mesh at once.
[/example]
## Regularity as an Analytic Input for Applications
When can the transport map be used as a regular change of variables rather than only as a measurable map? Applications to PDE, statistics, geometry, and numerics often require bounds on derivatives of the Brenier potential. Monge--Ampere regularity supplies those bounds, but only under geometric and analytic hypotheses that cannot be ignored.
[definition: Second Boundary Value Problem for Monge--Ampere]
Let $\Omega_0,\Omega_1\subset\mathbb R^n$ be domains, let $f\in C^0(\Omega_0;(0,\infty))$ be a density ratio, and let $\phi\in C^2(\Omega_0;\mathbb R)$ be the unknown potential. The second boundary value problem for the Monge--Ampere equation consists of the conditions
\begin{align*}
\det D^2\phi(x)=f(x),\qquad x\in\Omega_0.
\end{align*}
\begin{align*}
\nabla\phi(\Omega_0)=\Omega_1.
\end{align*}
\begin{align*}
\phi\text{ is convex.}
\end{align*}
[/definition]
The boundary condition is global and nonlinear: it prescribes the image of the gradient rather than the value of $\phi$ on the boundary. This formulation is the regularity problem faced by smooth transport applications, so the next theorem records the standard analytic input used later in the course.
[quotetheorem:9586]
The full proof belongs to the global second-boundary-value regularity theory of Caffarelli and Urbas for the Monge--Ampere equation, using Alexandrov estimates, strict convexity, localization, and boundary obliqueness. In this applications course, the theorem is used as an analytic input: it licenses differentiating the transport map and transferring estimates across it. The hypotheses are part of the content, not decorative assumptions. Uniform convexity of both domains prevents flat or reentrant boundary geometry from producing singular boundary contact. Smooth positive densities keep the right-hand side of the Monge--Ampere equation smooth and uniformly elliptic along the convex branch. The total-mass compatibility condition is necessary for any transport map between the two weighted domains. The result gives regularity for the continuous positive-density problem only; it does not cover nonconvex supports, densities that vanish or blow up, nonsmooth boundaries, or targets with singular parts such as atoms. Typical failures include discontinuous maps for nonconvex target support, degenerate Monge--Ampere equations where the target density vanishes, and boundary singularities at corners or flat pieces of the domain.
[example: Why Convex Supports Matter]
Let $D\subset\mathbb R^2$ be a disk with uniform source measure $\mu_0=\mathcal L^2|_D/\mathcal L^2(D)$, and let the target be the uniform probability measure on
\begin{align*}
Y=Y_1\cup Y_2
\end{align*}
where $Y_1$ and $Y_2$ are two disjoint open disks with positive distance between them. Since both target components have positive area, a transport map $T$ satisfying $T_\#\mu_0=\mu_1$ must send positive source mass to each component:
\begin{align*}
\mu_0(T^{-1}(Y_1))=\mu_1(Y_1)>0.
\end{align*}
\begin{align*}
\mu_0(T^{-1}(Y_2))=\mu_1(Y_2)>0.
\end{align*}
If $T$ were a smooth diffeomorphism from $D$ onto $Y$, then $T$ would be continuous, and the continuous image $T(D)$ of the connected set $D$ would have to be connected. But $Y_1\cup Y_2$ is disconnected, because
\begin{align*}
Y_1\cap Y_2=\varnothing
\end{align*}
and both $Y_1$ and $Y_2$ are nonempty open subsets of the relative topology of $Y$. Thus no smooth diffeomorphism can map the connected disk onto this disconnected target.
Equivalently, any mass-splitting transport must separate the source into two positive-mass regions $T^{-1}(Y_1)$ and $T^{-1}(Y_2)$. Since the target components are separated by a positive distance, approaching a dividing interface from the two sides forces the image values to approach different target components, so the map cannot remain a smooth single-valued diffeomorphism across that interface. This is the obstruction: convexity-type assumptions on the target prevent this topological splitting and are therefore structural, not merely technical.
[/example]
Regularity also interacts with the earlier semi-discrete picture. A discrete target is intentionally singular, so the potential is piecewise affine in the dual representation and the transport map is piecewise constant; the relevant regularity is then stability of cells and weights, not differentiability of $\nabla\phi$.
[remark: Two Notions of Regularity]
In continuous transport, regularity usually means estimates for $\phi$, $\nabla\phi$, or $D^2\phi$. In semi-discrete transport, regularity usually means nondegenerate cell geometry and stable dependence of weights on masses and sites. Both notions are Monge--Ampere regularity statements, but they live in different categories: PDE estimates in the continuous case and finite-dimensional sensitivity estimates in the semi-discrete case.
[/remark]
The chapter's main lesson is that applied optimal transport often passes through the same bridge: Brenier's convex potential turns mass balance into Monge--Ampere structure, and Chapter 9 will use the semi-discrete version of this bridge in computational transport. In smooth settings this gives a nonlinear elliptic PDE for a map; in semi-discrete settings it gives a weighted power diagram and a finite-dimensional mass equation. The success of the application depends on choosing the version of regularity that matches the target measure.
# 7. Ricci Curvature Through Transport
This chapter explains how lower Ricci curvature bounds can be read from the geometry of probability measures. It assumes the earlier material on Wasserstein geodesics, displacement interpolation, relative entropy, and basic Riemannian comparison through Jacobi fields. Chapters 3 through 5 used displacement convexity to study functional inequalities and gradient flows in $W_2$; here the same convexity becomes a definition of curvature in spaces where no smooth tensor is available. The guiding question is: which convexity properties of entropy along optimal transport geodesics encode the inequality $\operatorname{Ric} \ge K$?
## Entropy Convexity on Riemannian Manifolds
The first problem is to translate a pointwise curvature tensor condition into a statement about whole probability measures. On a smooth Riemannian manifold $M$, the entropy sees how volume changes when mass is transported along geodesics. Ricci curvature enters through the second variation of the Riemannian volume distortion.
[definition: Relative Entropy on a Riemannian Manifold]
Let $(M,g)$ be a complete Riemannian manifold with Riemannian volume measure $\operatorname{vol}$. For $\mu \in \mathcal P_2(M)$, the relative entropy of $\mu$ with respect to $\operatorname{vol}$ is
\begin{align*}
\operatorname{Ent}_{\operatorname{vol}}:\mathcal P_2(M)\longrightarrow [-\infty,+\infty].
\end{align*}
It sends $\mu$ to the following extended real number. When $\mu=\rho\operatorname{vol}$ and $(\rho\log\rho)_+\in L^1(M,\operatorname{vol})$, its value is
\begin{align*}
\operatorname{Ent}_{\operatorname{vol}}(\mu)=\int_M \rho\log\rho\,d\operatorname{vol}.
\end{align*}
Otherwise,
\begin{align*}
\operatorname{Ent}_{\operatorname{vol}}(\mu)=+\infty.
\end{align*}
[/definition]
This is the same entropy whose Wasserstein gradient flow was identified with the heat equation in Chapters 2 and 4, so it is already tied to diffusion. Since the synthetic definitions below no longer have a distinguished Riemannian volume, we also need the metric-measure version of the same functional before asking for convexity.
[definition: Relative Entropy on a Metric Measure Space]
Let $(X,d,m)$ be a metric measure space with $m$ a locally finite Borel measure. The relative entropy of $\mu\in\mathcal P_2(X)$ with respect to $m$ is the functional
\begin{align*}
\operatorname{Ent}_m:\mathcal P_2(X)\longrightarrow [-\infty,+\infty].
\end{align*}
It sends $\mu$ to the following extended real number. When $\mu=\rho m$ and $(\rho\log\rho)_+\in L^1(X,m)$, its value is
\begin{align*}
\operatorname{Ent}_m(\mu)=\int_X \rho\log\rho\,dm.
\end{align*}
Otherwise,
\begin{align*}
\operatorname{Ent}_m(\mu)=+\infty.
\end{align*}
[/definition]
The metric-measure entropy has the same formula as the Riemannian one, but its reference measure may now encode weights, singularities, or limiting geometry. In the convexity statements below, "finite entropy" means that the entropy value is a real number, excluding both $+\infty$ and $-\infty$. To extract curvature from entropy, we must ask how it behaves along transport geodesics rather than along its own gradient flow; this motivates the definition of displacement convexity.
[definition: Displacement Convexity of Entropy]
Let $(X,d,m)$ be a metric measure space and let $K \in \mathbb R$. The entropy $\operatorname{Ent}_m$ is $K$-displacement convex if for every pair $\mu_0,\mu_1 \in \mathcal P_2(X)$ with finite real entropy there exists a constant-speed $W_2$-geodesic $(\mu_t)_{t\in[0,1]}$ from $\mu_0$ to $\mu_1$ such that
\begin{align*}
\operatorname{Ent}_m(\mu_t) \le (1-t)\operatorname{Ent}_m(\mu_0)+t\operatorname{Ent}_m(\mu_1) - \frac{K}{2}t(1-t)W_2^2(\mu_0,\mu_1)
\end{align*}
for all $t\in[0,1]$.
[/definition]
The parameter $K$ measures the strength of convexity in Wasserstein space. The natural test of the definition is whether the classical lower Ricci bound supplies exactly this strength of convexity, and this motivates the first main theorem.
[quotetheorem:9587]
[citeproof:9587]
The completeness assumption prevents geodesics from leaving the space before time $1$; on an incomplete open submanifold, minimizing geodesics between transported particles may hit the missing boundary and the global interpolation is no longer available. Absolute continuity of one endpoint is the smooth setting in which the optimal map description avoids branching; without it, mass can split over several geodesics and a chosen interpolation may fail to reflect a pointwise Jacobian comparison. The theorem gives existence of a convexity-realizing geodesic, not convexity along every possible optimal plan in a branching metric space. With these restrictions in view, the next examples check that the sign and size of $K$ match familiar model geometries.
[example: Sphere and Hyperbolic Space]
On the unit sphere $S^n$ with its round metric, the constant sectional curvature is $1$, so tracing the sectional curvatures in the $n-1$ directions orthogonal to a unit vector gives
\begin{align*}
\operatorname{Ric}(v,v)=(n-1)g(v,v)
\end{align*}
for every tangent vector $v$. Thus $\operatorname{Ric}=(n-1)g$, and applying *Entropy Convexity from Ricci Lower Bounds* with $K=n-1$ gives
\begin{align*}
\operatorname{Ent}_{\operatorname{vol}}(\mu_t)\le (1-t)\operatorname{Ent}_{\operatorname{vol}}(\mu_0)+t\operatorname{Ent}_{\operatorname{vol}}(\mu_1)-\frac{n-1}{2}t(1-t)W_2^2(\mu_0,\mu_1).
\end{align*}
So entropy on the round unit sphere is $(n-1)$-displacement convex.
On hyperbolic space $\mathbb H^n$ with constant sectional curvature $-1$, the same trace gives
\begin{align*}
\operatorname{Ric}(v,v)=-(n-1)g(v,v)
\end{align*}
for every tangent vector $v$. Hence $\operatorname{Ric}=-(n-1)g$, and the same theorem with $K=-(n-1)$ gives
\begin{align*}
\operatorname{Ent}_{\operatorname{vol}}(\mu_t)\le (1-t)\operatorname{Ent}_{\operatorname{vol}}(\mu_0)+t\operatorname{Ent}_{\operatorname{vol}}(\mu_1)+\frac{n-1}{2}t(1-t)W_2^2(\mu_0,\mu_1).
\end{align*}
The sign change records the model-space comparison: positive curvature strengthens entropy convexity, while negative curvature weakens it by allowing larger volume expansion along transport.
The same sign appears in the finite-dimensional distortion coefficient. When $K>0$, its spherical branch is
\begin{align*}
\sigma_{K,N}^{(t)}(\theta)=\frac{\sin(t\theta\sqrt{K/N})}{\sin(\theta\sqrt{K/N})}.
\end{align*}
The denominator first vanishes when
\begin{align*}
\theta\sqrt{K/N}=\pi,
\end{align*}
equivalently
\begin{align*}
\theta=\pi\sqrt{N/K}.
\end{align*}
This is why the definition requires $\theta<\pi\sqrt{N/K}$ in the positive-curvature case and assigns $+\infty$ at and beyond the first zero.
[/example]
The example shows that entropy convexity has the right qualitative behaviour on spaces of constant curvature. The stronger question is whether entropy convexity contains enough information to reconstruct the tensor inequality, and this motivates the von Renesse-Sturm characterization.
[quotetheorem:9588]
[citeproof:9588]
The connectedness assumption rules out a minor but real pathology: if the space has several components, transport between components may be impossible at finite distance unless the metric is specified across components. Completeness again ensures that the local tests used in the reverse implication sit inside genuine minimizing geodesics. The theorem does not recover sectional curvature, injectivity radius, or topology; many manifolds with different local geometry can share the same Ricci lower bound. What it does recover is exactly the tensor inequality that survives when the language is reduced to distance, measure, and transport, which motivates the synthetic definitions below.
## Curvature-Dimension Conditions
The next problem is to retain both a curvature lower bound and an upper dimension bound in a setting that may have no coordinates. The notation $CD(K,N)$ abbreviates curvature at least $K$ and dimension at most $N$. The case $N=\infty$ is governed by ordinary entropy convexity, while finite $N$ requires a distortion correction reflecting volume comparison.
[definition: Curvature-Dimension Condition CD K Infinity]
Let $(X,d,m)$ be a complete separable geodesic metric measure space with $m$ a locally finite Borel measure. The space satisfies $CD(K,\infty)$ if $\operatorname{Ent}_m$ is $K$-displacement convex on $\mathcal P_2(X)$.
[/definition]
This definition extends the smooth theorem by promoting the conclusion to an axiom. To see that the axiom has the expected normalization, we first test it in the flat case where the curvature parameter should be zero.
[example: Euclidean Space]
In $\mathbb R^n$ with the Euclidean metric and Lebesgue measure, the Riemannian curvature tensor vanishes, so $\operatorname{Ric}=0$. By *Entropy Convexity from Ricci Lower Bounds* with $K=0$, every finite-entropy pair with absolutely continuous first endpoint admits a Wasserstein geodesic satisfying
\begin{align*}
\operatorname{Ent}_{\mathcal L^n}(\mu_t)\le (1-t)\operatorname{Ent}_{\mathcal L^n}(\mu_0)+t\operatorname{Ent}_{\mathcal L^n}(\mu_1).
\end{align*}
For a concrete affine case, take nondegenerate Gaussian measures with covariances $\Sigma_0$ and $\Sigma_1$, and suppose the optimal map has the form $T(x)=m_1+A(x-m_0)$ with $A$ symmetric positive definite. Then the interpolation map is $F_t(x)=m_t+B_t(x-m_0)$, where
\begin{align*}
B_t=(1-t)I+tA.
\end{align*}
The covariance at time $t$ is
\begin{align*}
\Sigma_t=B_t\Sigma_0B_t.
\end{align*}
Since $B_t$ is symmetric, $\det(B_t\Sigma_0B_t)=\det(B_t)^2\det(\Sigma_0)$, so
\begin{align*}
\det\Sigma_t=\det(B_t)^2\det\Sigma_0.
\end{align*}
For a Gaussian density with covariance $\Sigma$, the entropy relative to Lebesgue measure is
\begin{align*}
\operatorname{Ent}_{\mathcal L^n}(\mu)=-\frac12\log\bigl((2\pi e)^n\det\Sigma\bigr).
\end{align*}
Therefore
\begin{align*}
\operatorname{Ent}_{\mathcal L^n}(\mu_t)=\operatorname{Ent}_{\mathcal L^n}(\mu_0)-\log\det B_t.
\end{align*}
At $t=1$, this gives
\begin{align*}
\operatorname{Ent}_{\mathcal L^n}(\mu_1)=\operatorname{Ent}_{\mathcal L^n}(\mu_0)-\log\det A.
\end{align*}
Thus the convexity inequality is exactly
\begin{align*}
-\log\det B_t\le -t\log\det A,
\end{align*}
equivalently
\begin{align*}
\log\det((1-t)I+tA)\ge (1-t)\log\det I+t\log\det A.
\end{align*}
This is the concavity of $\log\det$ on positive definite matrices. In flat space, entropy convexity is therefore the matrix determinant concavity appearing in the Jacobian of affine transport.
[/example]
The flat example confirms the infinite-dimensional version, but it forgets that $\mathbb R^n$ has dimension $n$. To retain dimension, the transport inequality must use the same comparison functions that appear in Jacobi field estimates; this motivates the distortion coefficients.
[definition: Distortion Coefficients]
For $K\in\mathbb R$, $N\in(0,\infty)$, and $t\in[0,1]$, the reduced distortion coefficient is the map
\begin{align*}
\sigma_{K,N}^{(t)}:[0,\infty)\longrightarrow[0,\infty].
\end{align*}
For $\theta\in[0,\infty)$, its value is specified by the following four cases. If $K>0$ and $0<\theta<\pi\sqrt{N/K}$, then
\begin{align*}
\sigma_{K,N}^{(t)}(\theta)=\frac{\sin(t\theta\sqrt{K/N})}{\sin(\theta\sqrt{K/N})}.
\end{align*}
If $K=0$ or $\theta=0$, then
\begin{align*}
\sigma_{K,N}^{(t)}(\theta)=t.
\end{align*}
If $K<0$ and $\theta>0$, then
\begin{align*}
\sigma_{K,N}^{(t)}(\theta)=\frac{\sinh(t\theta\sqrt{-K/N})}{\sinh(\theta\sqrt{-K/N})}.
\end{align*}
If $K>0$ and $\theta\ge\pi\sqrt{N/K}$, then
\begin{align*}
\sigma_{K,N}^{(t)}(\theta)=+\infty.
\end{align*}
For $N>1$, the Lott-Sturm-Villani distortion coefficient is the map
\begin{align*}
\tau_{K,N}^{(t)}:[0,\infty)\longrightarrow[0,\infty], \qquad \tau_{K,N}^{(t)}(\theta)=t^{1/N}\left(\sigma_{K,N-1}^{(t)}(\theta)\right)^{1-1/N}.
\end{align*}
[/definition]
The coefficient $\sigma$ records one transverse Jacobi direction, while $\tau$ combines the radial interpolation factor $t^{1/N}$ with the $(N-1)$ transverse directions. Using $\sigma$ alone would give the reduced curvature-dimension condition, commonly called the reduced $CD$ condition, rather than the standard Lott-Sturm-Villani $CD(K,N)$ condition, so the finite-dimensional definition below uses $\tau$.
These coefficients are one-dimensional shadows of the Jacobi field comparison appearing in the smooth proof. Once the coefficients are available, they motivate the finite-dimensional condition comparing the endpoint densities with the density at intermediate times along a dynamical optimal plan. The obstruction addressed here is that $CD(K,\infty)$ sees only logarithmic entropy: Euclidean spaces of all finite dimensions satisfy $CD(0,\infty)$, so no Bishop-Gromov type dimension information can be recovered from that condition alone.
For a geodesic metric space $X$, write $\operatorname{Geo}(X)$ for the set of constant-speed geodesics $\gamma:[0,1]\to X$, equipped with the topology of uniform convergence. The evaluation map at time $t$ is denoted by $e_t:\operatorname{Geo}(X)\to X$, $e_t(\gamma)=\gamma_t$.
The finite-dimensional definition therefore has to remember the whole path of transported mass, not only the endpoint measures. The dynamical plan supplies this path, and the inequality below compares the intermediate density with the two endpoint densities after correcting by the model-space distortion factors.
[definition: Curvature-Dimension Condition CD K N]
Let $(X,d,m)$ be a complete separable geodesic metric measure space, let $m$ be locally finite with full support, and let $K\in\mathbb R$, $N\in(1,\infty)$. The space satisfies the weak Lott-Sturm-Villani condition $CD(K,N)$ if for every pair $\mu_0=\rho_0m$ and $\mu_1=\rho_1m$ in $\mathcal P_2(X)$ with bounded support there exists an optimal dynamical plan $\pi\in\mathcal P(\operatorname{Geo}(X))$ from $\mu_0$ to $\mu_1$ such that, writing $\mu_t=(e_t)_\#\pi=\rho_t m+\mu_t^\perp$ with $\mu_t^\perp\perp m$,
\begin{align*}
\int_X \rho_t^{1-1/N}\,dm \ge \int_{\operatorname{Geo}(X)} \left[ \tau_{K,N}^{(1-t)}(d(\gamma_0,\gamma_1))\rho_0(\gamma_0)^{-1/N} + \tau_{K,N}^{(t)}(d(\gamma_0,\gamma_1))\rho_1(\gamma_1)^{-1/N} \right] \,d\pi(\gamma)
\end{align*}
for every $t\in[0,1]$, with $r^{-1/N}=+\infty$ when $r=0$ and with the extended-real convention that the inequality is required to be meaningful.
[/definition]
This is a weak formulation: it asks for at least one good optimal plan for each endpoint pair, while the strong version requires the inequality for every optimal dynamical plan between such endpoints. That distinction matters on branching spaces such as metric graphs, where different optimal plans can merge and split mass in ways that destroy pointwise density comparison even though another plan behaves well. The singular part $\mu_t^\perp$ is recorded because intermediate measures need not remain absolutely continuous in a general metric space; the left-hand side uses only the density of the absolutely continuous part. These conventions set up the smooth consistency theorem, where non-branching and absolute continuity remove the ambiguity.
The finite-dimensional condition is usually written using entropy powers rather than the logarithmic entropy, so its normalization is not self-evident from the definition alone. A synthetic curvature-dimension inequality would be poorly calibrated if, on a smooth $n$-manifold with Riemannian volume, it produced a curvature bound different from the classical Ricci lower bound or allowed an effective dimension below the manifold dimension. The consistency problem is therefore to compare the distortion-coefficient inequality with the usual Jacobi-field description of volume distortion.
[quotetheorem:9589]
[citeproof:9589]
The dimension hypothesis is sharp: a smooth $n$-manifold cannot satisfy the noncollapsed model condition with an effective dimension below $n$, as small balls have volume growth comparable to $r^n$ rather than $r^N$ for $N<n$. Completeness is needed for the same reason as before; removing a point from $\mathbb R^n$ leaves the local Ricci tensor unchanged away from the puncture but destroys global geodesic completeness and changes which optimal dynamical plans exist. The use of Riemannian volume is also essential in this exact form: replacing volume by $e^{-V}\operatorname{vol}$ changes the tensor to a Bakry-Emery tensor, so the unweighted Ricci condition no longer governs the density distortion. The theorem says nothing about arbitrary singular spaces satisfying $CD(K,N)$; branching examples show that synthetic curvature bounds can behave differently unless extra hypotheses such as essential non-branching or infinitesimal Hilbertianity are added. Its role here is calibration: the synthetic formula has the right normalization before we pass to weighted measures and nonsmooth limits.
[example: Dimension Bound on Model Spaces]
For the round unit sphere $S^n$, every two-plane has sectional curvature $1$. If $v$ is a unit tangent vector and $e_2,\ldots,e_n$ is an orthonormal basis of $v^\perp$, then the Ricci trace is
\begin{align*}
\operatorname{Ric}(v,v)=\sum_{i=2}^n \sec(v,e_i)=\sum_{i=2}^n 1=n-1.
\end{align*}
For a general tangent vector $w=|w|v$, bilinearity gives
\begin{align*}
\operatorname{Ric}(w,w)=|w|^2\operatorname{Ric}(v,v)=(n-1)|w|^2=(n-1)g(w,w).
\end{align*}
Thus $\operatorname{Ric}=(n-1)g$, so *Smooth Manifolds Satisfy Curvature-Dimension* gives
\begin{align*}
(S^n,d_g,\operatorname{vol})\text{ satisfies }CD(n-1,n).
\end{align*}
For Euclidean space $\mathbb R^n$, all sectional curvatures are $0$. With the same orthonormal trace,
\begin{align*}
\operatorname{Ric}(v,v)=\sum_{i=2}^n 0=0
\end{align*}
for every unit vector $v$, hence $\operatorname{Ric}=0$. Applying *Smooth Manifolds Satisfy Curvature-Dimension* with $K=0$ gives
\begin{align*}
(\mathbb R^n,|\cdot|,\mathcal L^n)\text{ satisfies }CD(0,n).
\end{align*}
For hyperbolic space $\mathbb H^n$, every two-plane has sectional curvature $-1$. Therefore, for a unit tangent vector $v$,
\begin{align*}
\operatorname{Ric}(v,v)=\sum_{i=2}^n \sec(v,e_i)=\sum_{i=2}^n (-1)=-(n-1).
\end{align*}
For $w=|w|v$, this becomes
\begin{align*}
\operatorname{Ric}(w,w)=-(n-1)|w|^2=-(n-1)g(w,w),
\end{align*}
so $\operatorname{Ric}=-(n-1)g$. Hence *Smooth Manifolds Satisfy Curvature-Dimension* gives
\begin{align*}
(\mathbb H^n,d_g,\operatorname{vol})\text{ satisfies }CD(-(n-1),n).
\end{align*}
The corresponding $CD(K,\infty)$ statements follow from the same Ricci lower bounds through entropy $K$-displacement convexity. The finite-dimensional statements are sharper here because the parameter $N=n$ is carried by the distortion coefficients, so the inequality records both the Ricci lower bound and the model space dimension.
[/example]
The finite-dimensional condition is strongest when the reference measure is the Riemannian volume. Many applications require weighted measures, and then the Ricci tensor must be modified.
## Weighted Manifolds and the Bakry-Emery Tensor
The next question is what curvature means when the geometry is a manifold but the reference measure is not volume. In analysis and probability, the natural invariant measure is often $e^{-V}\operatorname{vol}$ for a potential $V$. Transport detects the convexity of $V$ together with the curvature of the underlying metric.
[definition: Weighted Manifold]
A weighted Riemannian manifold is a triple $(M,g,m)$ where $(M,g)$ is a complete Riemannian manifold and
\begin{align*}
m = e^{-V}\operatorname{vol}
\end{align*}
for a smooth function $V:M\to\mathbb R$.
[/definition]
The weight changes entropy by adding the potential energy term
\begin{align*}
\int_M V\,d\mu.
\end{align*}
To express the curvature detected by this modified entropy, the ordinary Ricci tensor must be combined with the Hessian of the potential.
[definition: Bakry-Emery Ricci Tensor]
Let $(M,g,m)$ be a weighted Riemannian manifold with $m=e^{-V}\operatorname{vol}$. The infinite-dimensional Bakry-Emery Ricci tensor is
\begin{align*}
\operatorname{Ric}_V = \operatorname{Ric}+\nabla^2 V.
\end{align*}
[/definition]
This tensor is the curvature seen by the diffusion generator $\Delta-\nabla V\cdot\nabla$, but the transport definition of curvature is phrased through convexity of entropy along Wasserstein geodesics. After changing the reference measure from volume to $e^{-V}\operatorname{vol}$, entropy gains a potential-energy term, and its second variation contains both the Ricci contribution from volume distortion and the Hessian contribution from $V$. The question is whether these two terms combine exactly into the Bakry-Emery lower bound.
[quotetheorem:9590]
[citeproof:9590]
Completeness and absence of boundary are global hypotheses, not presentation choices. On an open interval with the Euclidean metric, a transport geodesic between endpoint measures may try to pass through a missing endpoint, while on a bounded domain with nonconvex boundary the displacement interpolation can leave the domain even though the interior tensor calculation is flat. Local finiteness and full support ensure that entropy is a genuine metric-measure functional rather than a quantity supported on an invisible subset; if the reference measure vanishes on an open set, transport into that region produces singular parts with infinite entropy. The smoothness of $V$ is used to differentiate the potential energy twice along geodesics; for rough weights, the same statement requires a weak Hessian or a separate approximation theorem. The lower bound on $\operatorname{Ric}_V$ is also essential: on $\mathbb R$ with $V(x)=-x^2$, the reference measure expands rather than concentrates and the entropy cannot satisfy a positive convexity bound. The theorem does not impose finite dimension in the $CD(K,N)$ sense, so the next example should be read as an $N=\infty$ curvature statement.
[example: Gaussian Measure]
On $\mathbb R^n$ with its Euclidean metric, the Riemannian volume measure is $\mathcal L^n$. Write the Gaussian reference measure as $m=e^{-\widetilde V}\mathcal L^n$, where
\begin{align*}
\widetilde V(x)=\frac{|x|^2}{2}+\frac n2\log(2\pi).
\end{align*}
The additive constant does not affect first or second derivatives. Since
\begin{align*}
\widetilde V(x)=\frac12\sum_{i=1}^n x_i^2+\frac n2\log(2\pi),
\end{align*}
we have
\begin{align*}
\frac{\partial \widetilde V}{\partial x_i}=x_i.
\end{align*}
Differentiating once more gives
\begin{align*}
\frac{\partial^2 \widetilde V}{\partial x_i\partial x_j}=\delta_{ij}.
\end{align*}
Thus $\nabla^2\widetilde V=I$ as a symmetric bilinear form. Euclidean space is flat, so $\operatorname{Ric}=0$, and the Bakry-Emery tensor is
\begin{align*}
\operatorname{Ric}_{\widetilde V}=\operatorname{Ric}+\nabla^2\widetilde V=0+I=I.
\end{align*}
Equivalently, for every tangent vector $w\in T_x\mathbb R^n$,
\begin{align*}
\operatorname{Ric}_{\widetilde V}(w,w)=|w|^2=g(w,w).
\end{align*}
Hence $\operatorname{Ric}_{\widetilde V}\ge 1\cdot g$, so by *Weighted Entropy Convexity*, $\operatorname{Ent}_m$ is $1$-displacement convex. This is the transport curvature input behind Gaussian concentration and the logarithmic Sobolev inequality.
[/example]
Finite-dimensional weighted curvature requires an additional correction for the gradient of the weight. The Gaussian example has $N=\infty$; to discuss finite effective dimension, the tensor must include a term measuring how the weight can mimic hidden dimensions.
[definition: N-Bakry-Emery Ricci Tensor]
Let $(M,g,m)$ be an $n$-dimensional weighted Riemannian manifold with $m=e^{-V}\operatorname{vol}$ and let $N\in(n,\infty)$. The $N$-Bakry-Emery Ricci tensor is
\begin{align*}
\operatorname{Ric}_{V,N} = \operatorname{Ric}+\nabla^2V-\frac{1}{N-n}\nabla V\otimes\nabla V.
\end{align*}
[/definition]
The final negative term is the cost of treating the weight as hidden dimensions. The transport question is whether this tensor lower bound gives the finite-dimensional $CD(K,N)$ inequality with the same parameters.
[quotetheorem:9591]
[citeproof:9591]
The restriction $N>n$ is not cosmetic: the coefficient $(N-n)^{-1}$ becomes singular at $N=n$, and in that borderline case a nonconstant weight is incompatible with finite effective dimension in this formula. A concrete failure occurs already on $\mathbb R$ with $m=e^{-ax}\,d\mathcal L^1$ and $a\ne0$. Here $V(x)=ax$ has $\nabla^2V=0$, so omitting the gradient term would suggest a flat finite-dimensional bound, but the correct tensor is $\operatorname{Ric}_{V,N}=-a^2/(N-1)$ and therefore cannot give $CD(0,N)$. Completeness is also part of the metric-measure conclusion: if a geodesic is cut out by removing a point from a manifold, the local tensor inequality may still hold on each component while the global dynamical plans required by $CD(K,N)$ are no longer available across the missing point. Full support prevents a different failure, where an open region of the metric space carries no reference mass and the density comparison in the $CD$ inequality no longer describes the geometry there. The theorem also does not assert the converse for arbitrary metric measure spaces, nor does it identify weak and strong $CD$ formulations on branching spaces. Its content is the smooth sufficient criterion: a pointwise weighted tensor bound forces the finite-dimensional transport inequality with the same $K$ and $N$.
## Lott-Sturm-Villani and Bakry-Emery Viewpoints
The transport viewpoint defines curvature through convexity of entropy along geodesics in probability space. The Bakry-Emery viewpoint defines curvature through differential inequalities for a Markov generator. The central comparison problem is to understand when these languages describe the same analytic content.
[definition: Carre du Champ and Iterated Carre du Champ]
Let $(M,g,m)$ be a smooth weighted Riemannian manifold and let $\mathcal A=C_c^\infty(M)$. A diffusion operator is a linear map
\begin{align*}
L:\mathcal A\longrightarrow C^\infty(M)
\end{align*}
such that $L$ is second order with no zeroth-order term. The carre du champ is the bilinear map
\begin{align*}
\Gamma:\mathcal A\times\mathcal A\longrightarrow C^\infty(M), \qquad \Gamma(f,g)=\frac{1}{2}\left(L(fg)-fLg-gLf\right).
\end{align*}
The iterated carre du champ is the bilinear map
\begin{align*}
\Gamma_2:\mathcal A\times\mathcal A\longrightarrow C^\infty(M), \qquad \Gamma_2(f,g)=\frac{1}{2}\left(L\Gamma(f,g)-\Gamma(f,Lg)-\Gamma(g,Lf)\right).
\end{align*}
For $f\in\mathcal A$, write $\Gamma(f)=\Gamma(f,f)$ and $\Gamma_2(f)=\Gamma_2(f,f)$.
[/definition]
This formalism packages first and second derivatives of the diffusion semigroup. To turn it into a curvature lower bound, the second-order form must dominate the first-order energy with the same constants $K$ and $N$ as in the transport condition; this motivates the Bakry-Emery condition.
[definition: Bakry-Emery Condition BE K N]
Let $(M,g,m)$ be a smooth weighted Riemannian manifold, let $\mathcal A=C_c^\infty(M)$, and let
\begin{align*}
L:\mathcal A\longrightarrow C^\infty(M)
\end{align*}
be a diffusion operator with associated maps $\Gamma,\Gamma_2:\mathcal A\times\mathcal A\to C^\infty(M)$. For $K\in\mathbb R$ and $N\in(1,\infty]$, the condition $BE(K,N)$ holds if
\begin{align*}
\Gamma_2(f) \ge K\Gamma(f)+\frac{1}{N}(Lf)^2
\end{align*}
pointwise on $M$ for every $f\in\mathcal A$, with the final term omitted when $N=\infty$.
[/definition]
The inequality is local and infinitesimal, while $CD(K,N)$ is global and variational. On a weighted manifold with $L=\Delta-\nabla V\cdot\nabla$, the Bochner formula identifies the local curvature term inside $\Gamma_2$, and this motivates the bridge theorem.
[quotetheorem:9592]
[citeproof:9592]
Compactly supported smooth test functions keep integrations by parts away from boundary terms; on a domain with a concave boundary, heat flow can create boundary contributions of the wrong sign even when the interior Ricci tensor is nonnegative. Smoothness of $V$ is also a genuine hypothesis: if the weight has a corner, for instance $V(x)=|x|$ on the line, the Hessian contains a singular contribution and the displayed pointwise identity is no longer an identity of smooth functions. Finally, the formula itself does not state a curvature lower bound until the Hessian square is estimated and $\operatorname{Ric}_V$ is bounded from below; on $\mathbb R$ with $V(x)=-x^2$, the term $\nabla^2V=-2$ gives negative weighted curvature. The purpose of the formula is to isolate exactly where the curvature term enters $\Gamma_2$, so the next theorem can compare the local $BE(K,N)$ inequality with the global transport definition.
[quotetheorem:9593]
[citeproof:9593]
The hypotheses are deliberately smooth, complete, and non-branching: Riemannian geodesics do not split after agreeing for a time, and the measure has a smooth positive density. Completeness cannot be dropped without changing the global transport statement; for instance, an incomplete open submanifold can have the same local Bochner identity as its completion while some Wasserstein geodesics leave the space. The smooth positive density is needed because the formula for $L$ and the tensor $\operatorname{Ric}_{V,N}$ use first and second derivatives of $V$; a weight with a corner has no pointwise smooth Hessian to insert into Bochner's identity. Outside this class the statement must be reformulated; for example, Finsler manifolds may satisfy transport $CD(K,N)$ while their energy is not Hilbertian, so the usual quadratic $\Gamma$ calculus does not match the Riemannian one. The theorem also does not identify the standard $CD$ condition, the reduced $CD$ condition, strong $CD$, and $RCD$ in general metric spaces. In nonsmooth theory, the equivalence between transport curvature and Bakry-Emery type calculus is usually stated under infinitesimal Hilbertianity as an $RCD(K,N)$ result, not as a bare $CD(K,N)$ theorem. It only says that on smooth weighted Riemannian manifolds the transport, tensor, and diffusion formulations have the same content, which is why examples can be checked by whichever side is easiest.
[example: Ornstein-Uhlenbeck Generator]
For the standard Gaussian measure $m$ on $\mathbb R^n$, write
\begin{align*}
m=e^{-V}\mathcal L^n,\qquad V(x)=\frac{|x|^2}{2}+\frac n2\log(2\pi).
\end{align*}
The constant term has zero gradient, and
\begin{align*}
\nabla V(x)=x.
\end{align*}
Therefore the weighted diffusion generator is
\begin{align*}
Lf=\Delta f-\nabla V\cdot\nabla f=\Delta f-x\cdot\nabla f.
\end{align*}
For this generator, the carre du champ is the Euclidean energy. Indeed, using $\Delta(fg)=f\Delta g+g\Delta f+2\nabla f\cdot\nabla g$ and $x\cdot\nabla(fg)=f\,x\cdot\nabla g+g\,x\cdot\nabla f$, we get
\begin{align*}
L(fg)=fLg+gLf+2\nabla f\cdot\nabla g.
\end{align*}
Hence
\begin{align*}
\Gamma(f,g)=\frac12\left(L(fg)-fLg-gLf\right)=\nabla f\cdot\nabla g,
\end{align*}
so
\begin{align*}
\Gamma(f)=|\nabla f|^2.
\end{align*}
Since Euclidean space has $\operatorname{Ric}=0$ and
\begin{align*}
\frac{\partial^2 V}{\partial x_i\partial x_j}=\delta_{ij},
\end{align*}
we have
\begin{align*}
\operatorname{Ric}_V=\operatorname{Ric}+\nabla^2V=0+I=I.
\end{align*}
By *Bochner Formula for Weighted Manifolds*,
\begin{align*}
\Gamma_2(f)=|\nabla^2 f|^2+\operatorname{Ric}_V(\nabla f,\nabla f)=|\nabla^2 f|^2+|\nabla f|^2.
\end{align*}
Since $|\nabla^2 f|^2\ge0$, this gives
\begin{align*}
\Gamma_2(f)\ge|\nabla f|^2=\Gamma(f).
\end{align*}
Thus the Ornstein-Uhlenbeck generator satisfies $BE(1,\infty)$, matching the $CD(1,\infty)$ curvature statement for Gaussian entropy convexity.
[/example]
## Tensorization and Stability
The final problem in this chapter is structural: if curvature-dimension bounds are to serve as a synthetic replacement for Ricci bounds, they must survive products and limits. These two operations are unavoidable in analysis, probability, and geometry.
[quotetheorem:9594]
[citeproof:9594]
The common lower bound must be the same $K$ on both factors; if one factor has curvature only $K_1<K_2$, the product cannot inherit the stronger bound $K_2$ because geodesics moving entirely in the first factor already violate it. The product structure is also part of the hypothesis. If the same underlying set $X_1\times X_2$ is equipped with a warped metric, or if the measure is replaced by a non-product density such as $e^{-V(x_1,x_2)}m_1\otimes m_2$ with mixed second derivatives, the factorwise inequalities no longer control transports that couple the two coordinates. The theorem does not say that dimensions multiply, nor does it improve curvature through averaging across factors. Its use is practical: check the factors separately, add the dimension parameters, and carry the weaker curvature lower bound to the product.
[example: Product of Gaussian Spaces]
Identify $\mathbb R^{n_1}\times\mathbb R^{n_2}$ with $\mathbb R^{n_1+n_2}$ and use the product Euclidean distance, so
\begin{align*} |(x_1,x_2)|^2=|x_1|^2+|x_2|^2. \end{align*}
The product Gaussian measure has density
\begin{align*} d(\gamma_{n_1}\otimes\gamma_{n_2})(x_1,x_2)=(2\pi)^{-n_1/2}e^{-|x_1|^2/2}(2\pi)^{-n_2/2}e^{-|x_2|^2/2}\,dx_1\,dx_2. \end{align*}
Multiplying the constants and exponents gives
\begin{align*} d(\gamma_{n_1}\otimes\gamma_{n_2})(x_1,x_2)=(2\pi)^{-(n_1+n_2)/2}e^{-(|x_1|^2+|x_2|^2)/2}\,d(x_1,x_2). \end{align*}
Since $|(x_1,x_2)|^2=|x_1|^2+|x_2|^2$, this is exactly the standard Gaussian measure $\gamma_{n_1+n_2}$ on $\mathbb R^{n_1+n_2}$.
Each factor satisfies $CD(1,\infty)$ by the Gaussian weighted-curvature computation and *Weighted Entropy Convexity*. Applying *Tensorization of Curvature-Dimension Bounds* with $K=1$, $N_1=\infty$, and $N_2=\infty$ gives
\begin{align*} CD(1,N_1+N_2)=CD(1,\infty+\infty)=CD(1,\infty). \end{align*}
Thus the product Gaussian space satisfies $CD(1,\infty)$.
The same curvature value is visible from the potential. Writing
\begin{align*} \widetilde V(x_1,x_2)=\frac{|x_1|^2+|x_2|^2}{2}+\frac{n_1+n_2}{2}\log(2\pi), \end{align*}
we have $\gamma_{n_1}\otimes\gamma_{n_2}=e^{-\widetilde V}\mathcal L^{n_1+n_2}$. The additive constant has zero Hessian, and differentiating the quadratic part gives
\begin{align*} \nabla^2\widetilde V=I_{n_1+n_2}. \end{align*}
Euclidean Ricci curvature is zero, so
\begin{align*} \operatorname{Ric}_{\widetilde V}=\operatorname{Ric}+\nabla^2\widetilde V=0+I_{n_1+n_2}=I_{n_1+n_2}. \end{align*}
Equivalently, for every tangent vector $w$,
\begin{align*} \operatorname{Ric}_{\widetilde V}(w,w)=|w|^2=g(w,w). \end{align*}
Hence the product space has weighted curvature lower bound $1$, matching the $CD(1,\infty)$ conclusion from tensorization.
[/example]
Products preserve curvature inside a fixed category, but many geometric arguments also pass to limits of spaces. The obstruction is that a limit of manifolds or weighted spaces may lose smooth coordinates, develop singular points, or collapse directions, so a curvature condition defined only by tensors would no longer be directly meaningful.
Measured Gromov-Hausdorff convergence is the limit notion that retains both distances and measures. Before stating a closure theorem, we need a name for the property that a class survives this limiting process rather than depending on the smooth structure of the approximating spaces.
[definition: Measured Gromov-Hausdorff Stability]
A class of metric measure spaces is stable under measured Gromov-Hausdorff limits if, whenever $(X_j,d_j,m_j)$ belong to the class and converge in measured Gromov-Hausdorff sense to $(X,d,m)$, the limit space $(X,d,m)$ also belongs to the class.
[/definition]
This stability property is a major advantage of the transport formulation. The final structural theorem says that curvature-dimension bounds are closed under such convergence because the defining inequalities use only distances, measures, optimal plans, and lower semicontinuity; this motivates the stability theorem.
[quotetheorem:9595]
[proofunderconstruction:9595]
Uniform control of $K$ and $N$ is essential: a sequence with curvature bounds tending to $-\infty$ or dimensions tending to infinity need not leave a finite curvature-dimension bound in the limit. The theorem also does not assert smoothness of the limit; measured Gromov-Hausdorff limits of manifolds may have singular points or collapsed directions. Stability is therefore the bridge from the smooth checking criteria above to spaces where no tensor calculation is available.
[example: Collapsing Flat Tori]
Let $T^2_\varepsilon=S^1\times(\varepsilon S^1)$ carry the product metric
\begin{align*}
d_\varepsilon((x,y),(x',y'))^2=d_{S^1}(x,x')^2+d_{\varepsilon S^1}(y,y')^2.
\end{align*}
Both circle factors are flat one-dimensional manifolds, so their product is a flat two-dimensional manifold. Thus for every tangent vector $w\in T T^2_\varepsilon$,
\begin{align*}
\operatorname{Ric}_{T^2_\varepsilon}(w,w)=0=0\cdot g_\varepsilon(w,w).
\end{align*}
By *Smooth Manifolds Satisfy Curvature-Dimension* with $K=0$ and $n=2$, each normalized flat torus $(T^2_\varepsilon,d_\varepsilon,\operatorname{vol}_\varepsilon)$ satisfies
\begin{align*}
CD(0,2).
\end{align*}
Let $p_\varepsilon:T^2_\varepsilon\to S^1$ be the projection $p_\varepsilon(x,y)=x$. For any two points $(x,y),(x',y')$,
\begin{align*}
d_{S^1}(p_\varepsilon(x,y),p_\varepsilon(x',y'))=d_{S^1}(x,x')\le d_\varepsilon((x,y),(x',y')).
\end{align*}
Also, since the scaled circle $\varepsilon S^1$ has diameter $\pi\varepsilon$,
\begin{align*}
d_\varepsilon((x,y),(x',y'))\le d_{S^1}(x,x')+\pi\varepsilon.
\end{align*}
Therefore
\begin{align*}
\left|d_\varepsilon((x,y),(x',y'))-d_{S^1}(p_\varepsilon(x,y),p_\varepsilon(x',y'))\right|\le \pi\varepsilon.
\end{align*}
The fibers of $p_\varepsilon$ have diameter $\pi\varepsilon$, so the projection collapses only the second circle factor, and the metric error tends to $0$ as $\varepsilon\to0$.
For the measures, normalized product volume factors as
\begin{align*}
\operatorname{vol}_\varepsilon=\operatorname{length}_{S^1}\otimes\operatorname{length}_{\varepsilon S^1}.
\end{align*}
Hence for every continuous function $f$ on $S^1$,
\begin{align*}
\int_{T^2_\varepsilon} f(p_\varepsilon(x,y))\,d\operatorname{vol}_\varepsilon(x,y)=\int_{S^1} f(x)\,d\operatorname{length}_{S^1}(x).
\end{align*}
Thus $(p_\varepsilon)_\#\operatorname{vol}_\varepsilon=\operatorname{length}_{S^1}$, and the pointed measured Gromov-Hausdorff limit is the circle with normalized length measure. By *Stability of Curvature-Dimension Conditions*, the limit satisfies
\begin{align*}
CD(0,2).
\end{align*}
Moreover the circle is itself a flat one-dimensional model space, so it satisfies $CD(0,N)$ for every $N>1$. This example shows explicitly that the lower curvature bound survives the collapse, while the effective dimension can drop from $2$ to $1$ in the limit.
[/example]
The chapter's message is that Ricci curvature can be reconstructed from transport. On smooth manifolds, entropy convexity is equivalent to the classical tensor lower bound. On weighted manifolds, the same story produces Bakry-Emery curvature. In synthetic settings, $CD(K,N)$ keeps the parts of Ricci curvature that are stable under products, limits, heat flow, and displacement interpolation. This is why the theory interfaces with comparison geometry through Bishop-Gromov type volume bounds, with Chapter 8's concentration inequalities, and with PDE through heat-flow regularization on nonsmooth spaces.
# 8. Concentration, Isoperimetry, and Stability
This chapter turns the geometry of Wasserstein distance into quantitative deviation estimates. The guiding question is how an inequality comparing entropy to transport cost forces Lipschitz functions to concentrate, and why such estimates survive product constructions and bounded perturbations. Chapters 3 and 7 developed displacement convexity, curvature, and functional inequalities; here those tools become probability inequalities with dimension-free consequences.
## Transport-Entropy Inequalities
A probability measure can be studied by asking how expensive it is to transport another probability measure back to it. The transport-entropy philosophy is that measures with strong concentration make every absolutely continuous perturbation pay entropy proportional to the transportation distance it creates.
[definition: Relative Entropy]
Let $(X,d)$ be a Polish metric space, and let $\nu,\rho$ be Borel probability measures on $X$. If $\nu \ll \rho$, the relative entropy of $\nu$ with respect to $\rho$ is
\begin{align*}
H(\nu \mid \rho) := \int_X \log\left(\frac{d\nu}{d\rho}\right)\,d\nu.
\end{align*}
If $\nu$ is not absolutely continuous with respect to $\rho$, set $H(\nu\mid\rho):=+\infty$.
[/definition]
Entropy measures how much the density $d\nu/d\rho$ tilts the reference law. Without a transport-entropy inequality, entropy can be too weak to control displacement: for a heavy-tailed law on $\mathbb R$, the $1$-Lipschitz function $x\mapsto x$ may have no sub-Gaussian tail, so bounded entropy tilts can still move mass far into the tail. To turn entropy into a concentration statement, we need an inequality that limits the Wasserstein displacement produced by any entropy budget.
[definition: Transport-Entropy Inequality]
Let $(X,d)$ be a Polish metric space, let $p \ge 1$, and let $C > 0$. A Borel probability measure $\rho$ satisfies $T_p(C)$ if
\begin{align*}
W_p(\nu,\rho)^2 \le 2C\,H(\nu \mid \rho)
\end{align*}
for every Borel probability measure $\nu$ on $X$.
[/definition]
The square on $W_p$ is part of the convention used in concentration theory. For $p=1$ this gives a dual handle through Kantorovich-Rubinstein; for $p=2$ it is stronger and interacts well with quadratic transport, Gaussian measures, and convexity.
[remark: Monotonicity Between T Two and T One]
If $\rho$ satisfies $T_2(C)$, then it satisfies $T_1(C)$. Indeed $W_1(\nu,\rho) \le W_2(\nu,\rho)$ for every coupling by Jensen's inequality.
[/remark]
This implication explains why $T_2$ is often proved first when a quadratic structure is available. To build intuition before the Gaussian case, we ask what concentration can be forced from the most elementary geometric hypothesis: a bounded support diameter.
[quotetheorem:9596]
[citeproof:9596]
Marton's inequality is rough when the support is large or unbounded, but it already reveals the pattern: entropy controls how far a tilted law can move in Wasserstein distance. The diameter enters because the argument has no information about the geometry beyond the largest possible displacement, so the bound deteriorates as soon as the space has long tails or no finite diameter.
This limitation explains why bounded support is only a first model for concentration. To treat Gaussian measures, product measures, and other unbounded laws, one needs a condition that controls Lipschitz observables through their exponential moments rather than through a hard cutoff on the metric space. The next criterion gives exactly that replacement.
[quotetheorem:6783]
[citeproof:6783]
This criterion is the point where transport and concentration become interchangeable. The mean-zero normalization removes the linear term in the moment generating function, while the $1$-Lipschitz restriction is exactly the dual class appearing in Kantorovich-Rubinstein duality for $W_1$. The criterion does not say that arbitrary bounded observables have Gaussian fluctuations; boundedness alone gives only a finite range, not the geometric sensitivity needed to identify the transport cost. It also prepares the product examples, because exponential moment bounds tensorize naturally for independent coordinates.
[example: Bounded Product Space]
Let $X=X_1\times\cdots\times X_n$ and let $\rho=\rho_1\otimes\cdots\otimes\rho_n$. For each coordinate, *Marton Transportation Inequality* gives
\begin{align*}
W_{1,d_i}(\nu_i,\rho_i)^2 \le \frac{D_i^2}{2}H(\nu_i\mid\rho_i).
\end{align*}
In the convention $T_1(C_i)$ means $W_1^2\le 2C_iH$, this is $T_1(D_i^2/4)$ on $X_i$.
Now equip $X$ with
\begin{align*}
d(x,y)=\sum_{i=1}^n d_i(x_i,y_i).
\end{align*}
Under tensorization for the $\ell^1$ product metric, the coordinate constants add, so $\rho$ satisfies
\begin{align*}
W_{1,d}(\nu,\rho)^2 \le 2\left(\sum_{i=1}^n \frac{D_i^2}{4}\right)H(\nu\mid\rho)
= \frac{1}{2}\left(\sum_{i=1}^n D_i^2\right)H(\nu\mid\rho).
\end{align*}
Equivalently, $\rho$ satisfies $T_1(C)$ with
\begin{align*}
C=\frac{1}{4}\sum_{i=1}^n D_i^2.
\end{align*}
For the normalized metric
\begin{align*}
\bar d(x,y)=\frac{1}{n}\sum_{i=1}^n d_i(x_i,y_i),
\end{align*}
the Wasserstein distance scales by the same factor:
\begin{align*}
W_{1,\bar d}(\nu,\rho)=\frac{1}{n}W_{1,d}(\nu,\rho).
\end{align*}
Therefore
\begin{align*}
W_{1,\bar d}(\nu,\rho)^2
=\frac{1}{n^2}W_{1,d}(\nu,\rho)^2
\le \frac{1}{2n^2}\left(\sum_{i=1}^n D_i^2\right)H(\nu\mid\rho).
\end{align*}
Thus $\rho$ satisfies $T_1(\bar C)$ for $\bar d$ with
\begin{align*}
\bar C=\frac{1}{4n^2}\sum_{i=1}^n D_i^2.
\end{align*}
By the Bobkov--Gotze concentration consequence of $T_1$, every $1$-Lipschitz $f$ for $\bar d$ satisfies
\begin{align*}
\rho\left(f-\int_X f\,d\rho\ge r\right)
\le \exp\left(-\frac{r^2}{2\bar C}\right)
=\exp\left(-\frac{2n^2r^2}{\sum_{i=1}^n D_i^2}\right).
\end{align*}
If $D_i\le D$ for all $i$, then $\sum_iD_i^2\le nD^2$, so the exponent is at most $-2nr^2/D^2$; hence the typical deviation scale is $D/\sqrt n$.
[/example]
## Gaussian Concentration from Talagrand and Herbst Arguments
The central problem for unbounded spaces is to identify a transport inequality strong enough to recover the familiar Gaussian tail scale. Talagrand's inequality supplies this for the standard Gaussian law, and the Herbst argument turns logarithmic moment bounds into tail estimates.
[quotetheorem:6792]
[citeproof:6792]
Talagrand's inequality is dimension-free: the constant does not depend on $n$. The constant $1$ reflects the standard isotropic Gaussian normalization; for a Gaussian with covariance matrix $\Sigma$, the corresponding constant is governed by the largest eigenvalue of $\Sigma$, and for general log-concave laws no comparable bound follows without a strong convexity or curvature hypothesis. In particular, merely being log-concave does not prevent weak confinement in some direction, which can destroy the Gaussian-scale transport control. The next step is to extract a tail estimate for an arbitrary Lipschitz observable from this transport control.
[quotetheorem:9597]
[citeproof:9597]
The previous theorem is the transport form of the Herbst argument. The finiteness assumption on $\int_X f\,d\rho$ is needed because the deviation is centred at the mean; without an integrable observable, the displayed expression is not even defined. The result controls geometric observables whose oscillation is limited by the metric, and it gives no direct information about non-Lipschitz functionals such as rapidly growing polynomials unless they are first truncated or estimated by another method. The Gaussian case is the cleanest illustration because the transport constant is one and the Lipschitz constant alone determines the variance proxy.
[example: Gaussian Lipschitz Deviation]
Let $Z\sim\mathcal N(0,I_n)$, and let $f:\mathbb R^n\to\mathbb R$ be $L$-Lipschitz for the Euclidean distance. By *Gaussian T Two Inequality*, the standard Gaussian law satisfies $T_2(1)$, so *Concentration from T Two* with $C=1$ gives, for every $r\ge 0$,
\begin{align*}
\mathbb P\left(f(Z)-\mathbb E[f(Z)]\ge r\right)\le \exp\left(-\frac{r^2}{2L^2}\right).
\end{align*}
For the empirical linear functional
\begin{align*}
f(x)=\frac{1}{n}\sum_{i=1}^n x_i,
\end{align*}
we compute its Lipschitz constant. For $x,y\in\mathbb R^n$, the Cauchy-Schwarz inequality gives
\begin{align*}
|f(x)-f(y)|=\frac{1}{n}\left|\sum_{i=1}^n(x_i-y_i)\right|\le \frac{1}{n}\left(\sum_{i=1}^n 1^2\right)^{1/2}\left(\sum_{i=1}^n(x_i-y_i)^2\right)^{1/2}.
\end{align*}
Since $\sum_{i=1}^n1^2=n$, this becomes
\begin{align*}
|f(x)-f(y)|\le \frac{1}{\sqrt n}\|x-y\|_2.
\end{align*}
Thus $f$ is $n^{-1/2}$-Lipschitz, and this constant is sharp because taking $x-y=t(1,\dots,1)$ gives $|f(x)-f(y)|=|t|$ and $\|x-y\|_2=\sqrt n\,|t|$.
Also,
\begin{align*}
\mathbb E[f(Z)]=\frac{1}{n}\sum_{i=1}^n\mathbb E[Z_i]=0.
\end{align*}
Substituting $L=n^{-1/2}$ into the concentration bound gives
\begin{align*}
\mathbb P(f(Z)\ge r)\le \exp\left(-\frac{r^2}{2n^{-1}}\right).
\end{align*}
Equivalently,
\begin{align*}
\mathbb P(f(Z)\ge r)\le \exp\left(-\frac{nr^2}{2}\right).
\end{align*}
This has the exact Gaussian scale: indeed $f(Z)$ is a centered Gaussian with variance $n^{-2}\sum_{i=1}^n1=n^{-1}$, so deviations of constant size have exponent proportional to $n$.
[/example]
Gaussian coordinates are special, but the same scaling appears for empirical averages under any law satisfying a transport-entropy inequality. The next example records the reusable computation of the Lipschitz constant.
[example: Deviation of Empirical Means]
Let $X_1,\dots,X_n$ be independent random variables with common law $\rho$ satisfying $T_1(C)$ on $\mathbb R$, and let $\rho^{\otimes n}$ denote their joint law. For a $1$-Lipschitz function $g:\mathbb R\to\mathbb R$, define
\begin{align*}
F(x_1,\dots,x_n)=\frac{1}{n}\sum_{i=1}^n g(x_i).
\end{align*}
We first compute the Lipschitz constant of $F$ for the Euclidean product metric on $\mathbb R^n$.
For $x,y\in\mathbb R^n$, the triangle inequality and the $1$-Lipschitz property of $g$ give
\begin{align*}
|F(x)-F(y)|=\left|\frac{1}{n}\sum_{i=1}^n \bigl(g(x_i)-g(y_i)\bigr)\right|\le \frac{1}{n}\sum_{i=1}^n |g(x_i)-g(y_i)|.
\end{align*}
Since $|g(x_i)-g(y_i)|\le |x_i-y_i|$ for each $i$, this implies
\begin{align*}
|F(x)-F(y)|\le \frac{1}{n}\sum_{i=1}^n |x_i-y_i|.
\end{align*}
By Cauchy-Schwarz,
\begin{align*}
\sum_{i=1}^n |x_i-y_i|\le \left(\sum_{i=1}^n 1^2\right)^{1/2}\left(\sum_{i=1}^n |x_i-y_i|^2\right)^{1/2}.
\end{align*}
Because $\sum_{i=1}^n1^2=n$, we obtain
\begin{align*}
|F(x)-F(y)|\le \frac{1}{\sqrt n}\|x-y\|_2.
\end{align*}
Thus $F$ is $n^{-1/2}$-Lipschitz for the Euclidean product metric.
By *Tensorization of Transport Entropy*, the product law $\rho^{\otimes n}$ satisfies $T_1(C)$ for the $\ell^1$ product metric $d_1(x,y)=\sum_{i=1}^n |x_i-y_i|$. Since $\|x-y\|_2\le d_1(x,y)$, every coupling has Euclidean transport cost no larger than its $\ell^1$ transport cost, so
\begin{align*}
W_{1,\|\cdot\|_2}(\nu,\rho^{\otimes n})\le W_{1,d_1}(\nu,\rho^{\otimes n}).
\end{align*}
Therefore $\rho^{\otimes n}$ also satisfies $T_1(C)$ for the Euclidean product metric.
Applying *Bobkov Gotze Criterion* to the centered function
\begin{align*}
F-\int_{\mathbb R^n}F\,d\rho^{\otimes n}
\end{align*}
with Lipschitz constant $L=n^{-1/2}$ gives, for every $r\ge 0$,
\begin{align*}
\mathbb P\left(F(X_1,\dots,X_n)-\mathbb E[F(X_1,\dots,X_n)]\ge r\right)\le \exp\left(-\frac{r^2}{2C n^{-1}}\right).
\end{align*}
Since $2C n^{-1}=2C/n$, this is
\begin{align*}
\mathbb P\left(F-\mathbb E[F]\ge r\right)\le \exp\left(-\frac{nr^2}{2C}\right).
\end{align*}
The factor $n$ in the exponent comes from averaging: the product transport constant stays dimension-free, while the empirical mean has Lipschitz constant $n^{-1/2}$.
[/example]
## Stability Under Tensorization and Perturbation
Concentration inequalities are useful only if they persist under the constructions used in applications. The two basic constructions are independent products, which model many-coordinate systems, and bounded changes of density, which model controlled perturbations of an ideal reference law.
[quotetheorem:9598]
[citeproof:9598]
Tensorization is responsible for the correct scaling in empirical averages. Without it, applying a one-dimensional inequality after embedding into a high-dimensional product would lose the independence structure.
[remark: Metric Normalization in Products]
For empirical means, the Euclidean product metric and the averaging map combine to produce a Lipschitz constant of order $n^{-1/2}$. Equivalently, using a normalized product metric moves the factor of $n$ from the Lipschitz constant into the transport constant. Both conventions produce the same deviation exponent when the constants are tracked consistently.
[/remark]
The product principle handles independent coordinates, while many models arise by tilting a known reference measure. The obstruction is that a change of density can alter entropy and normalization even when the underlying metric space is unchanged, so a transport inequality for the reference law does not automatically pass to the tilted law.
A bounded perturbation is the regime where this loss can be controlled by the oscillation of the density ratio. The transfer result below quantifies how much the transport constant worsens when the new measure is uniformly comparable to the old one.
[quotetheorem:9599]
[citeproof:9599]
The perturbation factor is rarely sharp, but it is robust and dimension-free when the oscillation of the perturbation is dimension-free. This makes the principle useful for strongly log-concave references modified by bounded potentials.
[example: Bounded Perturbation of a Strongly Log Concave Law]
Let $U:\mathbb R^n\to\mathbb R$ be twice differentiable, assume $\nabla^2U(x)\ge \kappa I_n$ for every $x$, and set $\rho(dx)=Z_0^{-1}e^{-U(x)}\,dx$. By the *Bakry-Emery criterion*, this curvature lower bound gives a logarithmic Sobolev inequality for $\rho$ with constant $1/\kappa$, and by the *Otto-Villani theorem* this implies
\begin{align*}
W_2(\nu,\rho)^2\le 2\kappa^{-1}H(\nu\mid\rho)
\end{align*}
for every probability measure $\nu$. Thus $\rho$ satisfies $T_2(1/\kappa)$.
Now let $V$ be bounded and define
\begin{align*}
\mu(dx)=Z^{-1}e^{-U(x)-V(x)}\,dx.
\end{align*}
Since $d\rho=Z_0^{-1}e^{-U}\,dx$, the density of $\mu$ with respect to $\rho$ is
\begin{align*}
\frac{d\mu}{d\rho}(x)=\frac{Z_0}{Z}e^{-V(x)}.
\end{align*}
Equivalently, if $\widetilde Z=\int_{\mathbb R^n}e^{-V}\,d\rho$, then $Z=Z_0\widetilde Z$ and
\begin{align*}
d\mu=\widetilde Z^{-1}e^{-V}\,d\rho.
\end{align*}
The *Bounded Perturbation Principle* applied with $C=1/\kappa$ gives that $\mu$ satisfies $T_2(C_\mu)$ with
\begin{align*}
C_\mu=\frac{1}{\kappa}e^{\operatorname{osc}(V)}=\frac{e^{\operatorname{osc}(V)}}{\kappa}.
\end{align*}
Let $f:\mathbb R^n\to\mathbb R$ be $L$-Lipschitz. Applying *Concentration from T Two* to $\mu$ with constant $C_\mu=e^{\operatorname{osc}(V)}/\kappa$ gives, for every $r\ge 0$,
\begin{align*}
\mu\left(f-\int f\,d\mu\ge r\right)\le \exp\left(-\frac{r^2}{2C_\mu L^2}\right).
\end{align*}
Substituting the value of $C_\mu$ gives
\begin{align*}
2C_\mu L^2=2\frac{e^{\operatorname{osc}(V)}}{\kappa}L^2=\frac{2e^{\operatorname{osc}(V)}L^2}{\kappa}.
\end{align*}
Hence
\begin{align*}
\frac{r^2}{2C_\mu L^2}=\frac{\kappa r^2}{2e^{\operatorname{osc}(V)}L^2}.
\end{align*}
Therefore
\begin{align*}
\mu\left(f-\int f\,d\mu\ge r\right)\le \exp\left(-\frac{\kappa r^2}{2e^{\operatorname{osc}(V)}L^2}\right).
\end{align*}
Applying the same estimate to $-f$, which is also $L$-Lipschitz, gives the matching lower-tail bound
\begin{align*}
\mu\left(f-\int f\,d\mu\le -r\right)\le \exp\left(-\frac{\kappa r^2}{2e^{\operatorname{osc}(V)}L^2}\right).
\end{align*}
Thus the bounded perturbation preserves Gaussian Lipschitz concentration, with the variance proxy multiplied by $e^{\operatorname{osc}(V)}$.
[/example]
This final example combines all three themes of the chapter: a reference transport inequality, a stability principle, and the conversion from transport to Lipschitz concentration. The result is a practical recipe for proving high-dimensional deviation bounds without estimating each observable separately.
[explanation: What the Chapter Achieves]
The chapter establishes three reusable mechanisms. First, $T_1$ is equivalent to sub-Gaussian exponential moments for Lipschitz observables, so transportation and concentration can be translated into one another. Second, Gaussian $T_2$ and its descendants give dimension-free concentration through the Herbst-Chernoff route. Third, tensorization and bounded perturbation explain why these estimates remain available in high-dimensional models built from independent pieces or controlled changes of density.
[/explanation]
# 9. Computational and Statistical Optimal Transport
This chapter turns the analytic theory of optimal transport into methods that can be computed from data. Chapters 1 through 8 treated Wasserstein geometry as a continuum object, with geodesics, gradient flows, curvature inequalities, and concentration estimates stated for probability measures on metric or Euclidean spaces. Here the measures are often empirical, the transport plan is represented by a matrix, and regularization is introduced both to make algorithms stable and to change the statistical behaviour of the objective.
The central tension is that exact Wasserstein distances are geometrically meaningful but computationally and statistically expensive in high dimension. Entropic regularization leads to Sinkhorn scaling, empirical measures reveal the curse of dimensionality, and generative modelling uses Kantorovich duality as a trainable loss. The chapter ends by relating the same entropy term to Schrödinger bridge interpolation, which connects the static regularized transport problem to the dynamic viewpoint of Chapter 1 through stochastic paths.
## Entropic Regularization and Sinkhorn Divergence
The first computational problem is the discrete transport problem between two point clouds. If the measures have many atoms, the linear program defining optimal transport has a large number of variables and constraints. Entropic regularization changes the linear program into a strictly convex problem whose optimizer has a multiplicative scaling form.
Let $a=(a_1,\dots,a_m)$ and $b=(b_1,\dots,b_n)$ be probability vectors with positive entries, and let $C=(C_{ij})\in \mathbb R^{m\times n}$ be a cost matrix. Before introducing the regularized objective, we need the finite-dimensional analogue of the set of Kantorovich plans: the matrices whose row and column sums prescribe the two measures.
[definition: Discrete Coupling Polytope]
For probability vectors $a\in [0,\infty)^m$ and $b\in [0,\infty)^n$ with $\sum_i a_i=\sum_j b_j=1$, the discrete coupling polytope is
\begin{align*}
\Pi(a,b)=\left\{P\in [0,\infty)^{m\times n} : \sum_{j=1}^n P_{ij}=a_i,\ \sum_{i=1}^m P_{ij}=b_j\right\}.
\end{align*}
[/definition]
This is the finite-dimensional version of the set of Kantorovich plans. The computational difficulty is that minimizing $\sum_{i,j}C_{ij}P_{ij}$ over $\Pi(a,b)$ is a linear program whose optimizer often lies on a low-dimensional face, so small changes in the data can change the active constraints. To obtain a smoother optimization problem and an efficient scaling algorithm, we add an entropy term to the linear cost.
[definition: Entropic Optimal Transport Cost]
Let $\Delta_k=\{a\in[0,\infty)^k : \sum_{i=1}^k a_i=1\}$ for $k\in\mathbb N$. For $\varepsilon>0$, the entropic optimal transport cost is the map $\operatorname{OT}_\varepsilon:\Delta_m\times \Delta_n\times \mathbb R^{m\times n}\to \mathbb R$ defined by
\begin{align*}
\operatorname{OT}_\varepsilon(a,b;C)=\min_{P\in \Pi(a,b)}\left\{\sum_{i=1}^m\sum_{j=1}^n C_{ij}P_{ij}+\varepsilon\sum_{i=1}^m\sum_{j=1}^n P_{ij}(\log P_{ij}-1)\right\},
\end{align*}
with the convention $0(\log 0-1)=0$.
[/definition]
The entropy term makes the minimizer positive on every entry whose row and column marginal are both positive. Its Euler-Lagrange equations suggest that the optimizer should be obtained by multiplying a fixed positive kernel on the left and right by diagonal matrices. The next theorem supplies the existence, uniqueness, and algorithmic foundation for that scaling procedure in the strictly positive marginal case.
[quotetheorem:9600]
[citeproof:9600]
The strict positivity of $K$ is essential for this form of the theorem: if $K$ contains a zero pattern that disconnects some rows from some columns, prescribed positive marginals may be impossible to realize by diagonal scaling. The result also does not give a uniform iteration count; as entries of $K$ become extremely small, the alternating normalizations can become numerically stiff. For entropic optimal transport, the matrix to be scaled is $K_{ij}=e^{-C_{ij}/\varepsilon}$, so the theorem becomes both an existence result and an algorithm: alternate row and column normalizations until the desired marginals are reached. The next example shows the resulting computation in the point-cloud setting where the discrete formulation first arises.
[example: Two Empirical Point Clouds]
Let $x_1,\dots,x_m\in \mathbb R^d$ and $y_1,\dots,y_n\in \mathbb R^d$, and put empirical weights $a_i=1/m$ and $b_j=1/n$. For the quadratic cost $C_{ij}=|x_i-y_j|^2$ and parameter $\varepsilon>0$, define
\begin{align*}
K_{ij}=e^{-C_{ij}/\varepsilon}=e^{-|x_i-y_j|^2/\varepsilon}.
\end{align*}
Since $K_{ij}>0$ for every $i,j$, the *Sinkhorn Scaling Theorem* applies to this positive kernel: we look for positive vectors $u\in\mathbb R^m_+$ and $v\in\mathbb R^n_+$ such that
\begin{align*}
P_{ij}=u_iK_{ij}v_j
\end{align*}
has row sums $1/m$ and column sums $1/n$.
The row constraint for a fixed row $i$ is
\begin{align*}
\sum_{j=1}^n P_{ij}=\sum_{j=1}^n u_iK_{ij}v_j=u_i\sum_{j=1}^n K_{ij}v_j=\frac1m.
\end{align*}
Because $\sum_{j=1}^n K_{ij}v_j>0$, this gives the row update
\begin{align*}
u_i=\frac{1/m}{\sum_{j=1}^n K_{ij}v_j}.
\end{align*}
Similarly, the column constraint for a fixed column $j$ is
\begin{align*}
\sum_{i=1}^m P_{ij}=\sum_{i=1}^m u_iK_{ij}v_j=v_j\sum_{i=1}^m u_iK_{ij}=\frac1n,
\end{align*}
so
\begin{align*}
v_j=\frac{1/n}{\sum_{i=1}^m u_iK_{ij}}.
\end{align*}
Sinkhorn iteration alternates exactly these two normalizations.
The matrix $P$ is dense because $u_i>0$, $v_j>0$, and $K_{ij}>0$, hence $P_{ij}>0$ for every pair $(i,j)$. Its preference for low-cost pairs is encoded entrywise: for two pairs $(i,j)$ and $(i',j')$,
\begin{align*}
\frac{K_{ij}}{K_{i'j'}}=\exp\left(-\frac{|x_i-y_j|^2-|x_{i'}-y_{j'}|^2}{\varepsilon}\right).
\end{align*}
If $|x_i-y_j|^2<|x_{i'}-y_{j'}|^2$, then the exponent is positive after reversing the difference, and this ratio tends to $+\infty$ as $\varepsilon\downarrow 0$. Thus the scaled coupling remains a full matrix for every positive $\varepsilon$, but the kernel increasingly favors nearby point-cloud pairs; the marginal constraints are enforced by the two scaling vectors rather than by solving a general linear program.
[/example]
The price of regularization is that $\operatorname{OT}_\varepsilon$ is not the original transport cost. To use it as a controlled approximation, we need to know that the regularized values return to the Kantorovich value as the entropy weight is removed.
[quotetheorem:9601]
[citeproof:9601]
Finite dimensionality is doing real work here: compactness of $\Pi(a,b)$ and boundedness of the entropy term prevent mass from escaping. In continuum problems on non-compact spaces, convergence requires moment and tightness assumptions; without them, a minimizing sequence can drift to infinity while preserving weak endpoint information. The theorem also concerns optimal values, not a stability estimate for the optimal plans or the number of Sinkhorn iterations needed at small $\varepsilon$. The convergence result explains approximation, but it does not remove the self-interaction bias introduced by the entropy term. For statistical comparisons, the regularized cost of a measure against itself should not be counted as evidence of discrepancy. This motivates a debiased version that subtracts the two self-costs.
[definition: Sinkhorn Divergence]
Let $X=\{x_1,\dots,x_m\}$ and $Y=\{y_1,\dots,y_n\}$ be finite subsets of a common space, and let $c:(X\cup Y)\times (X\cup Y)\to\mathbb R$ be a ground cost. Define $C_{X,Y}\in\mathbb R^{m\times n}$, $C_{X,X}\in\mathbb R^{m\times m}$, and $C_{Y,Y}\in\mathbb R^{n\times n}$ by
\begin{align*}
(C_{X,Y})_{ij}&=c(x_i,y_j), &
(C_{X,X})_{ii'}&=c(x_i,x_{i'}), &
(C_{Y,Y})_{jj'}&=c(y_j,y_{j'}).
\end{align*}
For $\varepsilon>0$, the Sinkhorn divergence is the map $S_\varepsilon:\Delta_m\times\Delta_n\to \mathbb R$ defined by
\begin{align*}
S_\varepsilon(a,b)=\operatorname{OT}_\varepsilon(a,b;C_{X,Y})-\frac{1}{2}\operatorname{OT}_\varepsilon(a,a;C_{X,X})-\frac{1}{2}\operatorname{OT}_\varepsilon(b,b;C_{Y,Y}).
\end{align*}
[/definition]
The subtraction is not cosmetic. When the two arguments are weights on the same finite support with the same ground cost, so that $X=Y$ and $C_{X,Y}=C_{X,X}=C_{Y,Y}$ under the natural identification, it forces $S_\varepsilon(a,a)=0$. For different supports, the two self-costs still remove the entropy-driven self-interaction terms, but the expression should not be read as an identity test between unrelated grids. This debiasing makes the divergence useful for barycenters, where the objective should compare shapes without rewarding the entropy of each image against itself.
[example: Barycenters of Images]
Suppose a grayscale image on a fixed pixel grid with $M$ pixels is represented by a vector $a\in\Delta_M$, where $a_k$ is the normalized intensity at pixel $k$. For images $a_1,\dots,a_N\in\Delta_M$ and weights $\lambda_r>0$ with $\sum_{r=1}^N\lambda_r=1$, a Sinkhorn barycenter minimizes
\begin{align*}
\sum_{r=1}^N \lambda_r S_\varepsilon(a,a_r)
\end{align*}
over $a\in\Delta_M$. Expanding the Sinkhorn divergence gives
\begin{align*}
\sum_{r=1}^N \lambda_r S_\varepsilon(a,a_r)=\sum_{r=1}^N\lambda_r\operatorname{OT}_\varepsilon(a,a_r;C)-\frac12\left(\sum_{r=1}^N\lambda_r\right)\operatorname{OT}_\varepsilon(a,a;C)-\frac12\sum_{r=1}^N\lambda_r\operatorname{OT}_\varepsilon(a_r,a_r;C).
\end{align*}
Since $\sum_{r=1}^N\lambda_r=1$, this is
\begin{align*}
\sum_{r=1}^N \lambda_r S_\varepsilon(a,a_r)=\sum_{r=1}^N\lambda_r\operatorname{OT}_\varepsilon(a,a_r;C)-\frac12\operatorname{OT}_\varepsilon(a,a;C)-\frac12\sum_{r=1}^N\lambda_r\operatorname{OT}_\varepsilon(a_r,a_r;C).
\end{align*}
The final term is independent of $a$, so it does not affect which $a$ minimizes the objective.
This differs from Euclidean averaging. If Euclidean averaging is defined by minimizing $\sum_{r=1}^N\lambda_r\|a-a_r\|_2^2$, then, with $\bar a=\sum_{r=1}^N\lambda_r a_r$, one has
\begin{align*}
\sum_{r=1}^N\lambda_r\|a-a_r\|_2^2=\|a\|_2^2-2\langle a,\bar a\rangle+\sum_{r=1}^N\lambda_r\|a_r\|_2^2.
\end{align*}
Also
\begin{align*}
\|a-\bar a\|_2^2=\|a\|_2^2-2\langle a,\bar a\rangle+\|\bar a\|_2^2.
\end{align*}
Therefore the Euclidean objective equals $\|a-\bar a\|_2^2+\sum_r\lambda_r\|a_r\|_2^2-\|\bar a\|_2^2$, so its minimizer is the pixelwise average $\bar a$. For two translated copies of the same shape, $\bar a$ places intensity at both translated locations. The Sinkhorn barycenter instead compares images through transport costs on the pixel grid, so mass can move across nearby pixels; the resulting average is a transported intermediate shape rather than a superposition of the separate locations.
[/example]
## Sample Complexity and Statistical Bias of Empirical Wasserstein Distances
The second problem is statistical rather than algorithmic: even if exact Wasserstein distances could be computed, empirical measures may converge slowly in Wasserstein distance. This matters whenever the input distributions are observed only through samples. The convergence rate depends strongly on the ambient dimension and on whether regularization or smoothing is used.
Let $X_1,\dots,X_N$ be i.i.d. samples with law $\mu$ on $\mathbb R^d$. The empirical measure is the random probability measure
\begin{align*}
\hat{\mu}_N=\frac{1}{N}\sum_{i=1}^N \delta_{X_i}.
\end{align*}
The quantity $\mathbb E[W_p(\hat{\mu}_N,\mu)]$ measures the intrinsic sampling error of Wasserstein estimation. The next theorem gives the rate that makes high-dimensional empirical transport statistically expensive.
[quotetheorem:9602]
[citeproof:9602]
The upper and lower density assumptions exclude two opposite failures. If $\mu$ is the uniform measure on the line segment $[0,1]\times\{0\}^{d-1}\subset[0,1]^d$, then the sampling rate is governed by the one-dimensional geometry rather than by the ambient dimension $d$. If instead $\mu$ has a density on $[0,1]$ proportional to $x^{-\alpha}$ near $0$ for some $0<\alpha<1$, the heavy concentration near the endpoint changes local cell fluctuations and may force constants or rates outside the bounded-density theorem. The theorem also does not describe the low-dimensional regimes $d<2p$ or the critical logarithmic case $d=2p$, where the rates have different forms. For instance, when $\mu$ is uniform on $[0,1]^2$ and $p=1$, the critical case has the logarithmic correction $\mathbb E[W_1(\hat{\mu}_N,\mu)]$ of order $(\log N/N)^{1/2}$ rather than a pure $N^{-1/2}$ rate. This theorem is the statistical reason that high-dimensional Wasserstein distances are difficult to estimate directly. It also explains why regularized, sliced, projected, and learned transport objectives appear in applications: they trade exact geometric fidelity for improved estimation and optimization behaviour. The same rate appears as an upward bias when two independent empirical samples come from the same law.
[example: Bias Between Two Empirical Samples]
Assume, as in the preceding bounded-density setting, that $\mu$ is non-atomic on $[0,1]^d$, and let
\begin{align*}
\hat{\mu}_N=\frac1N\sum_{i=1}^N\delta_{X_i}
\end{align*}
and
\begin{align*}
\hat{\nu}_N=\frac1N\sum_{i=1}^N\delta_{Y_i}.
\end{align*}
Since $\mu$ is non-atomic, for each pair $(i,j)$ one has $\mathbb P(X_i=Y_j)=0$. Taking the union over the $N^2$ pairs gives
\begin{align*}
\mathbb P\left(\exists\, i,j\text{ with }X_i=Y_j\right)\le \sum_{i=1}^N\sum_{j=1}^N\mathbb P(X_i=Y_j)=0.
\end{align*}
Thus the two empirical supports are disjoint almost surely. If $\hat{\mu}_N=\hat{\nu}_N$, then every atom of $\hat{\mu}_N$ would also be an atom of $\hat{\nu}_N$, forcing some equality $X_i=Y_j$. Therefore $\hat{\mu}_N\ne \hat{\nu}_N$ almost surely, and the identity property of $W_p$ gives
\begin{align*}
W_p(\hat{\mu}_N,\hat{\nu}_N)>0
\end{align*}
almost surely, while the population distance is
\begin{align*}
W_p(\mu,\mu)=0.
\end{align*}
The size of this positive plug-in distance is controlled by the same empirical scale. By the triangle inequality,
\begin{align*}
W_p(\hat{\mu}_N,\hat{\nu}_N)\le W_p(\hat{\mu}_N,\mu)+W_p(\mu,\hat{\nu}_N).
\end{align*}
Taking expectations and using that $\hat{\mu}_N$ and $\hat{\nu}_N$ have the same law,
\begin{align*}
\mathbb E[W_p(\hat{\mu}_N,\hat{\nu}_N)]\le 2\mathbb E[W_p(\hat{\mu}_N,\mu)].
\end{align*}
In the regime $d>2p$ with density bounded above and below, *Empirical Wasserstein Curse of Dimensionality* gives constants $A,B>0$ such that
\begin{align*}
A N^{-1/d}\le \mathbb E[W_p(\hat{\mu}_N,\mu)]\le B N^{-1/d}.
\end{align*}
The corresponding two-sample empirical lower estimate gives a constant $A'>0$ with
\begin{align*}
A'N^{-1/d}\le \mathbb E[W_p(\hat{\mu}_N,\hat{\nu}_N)]\le 2BN^{-1/d}.
\end{align*}
Thus, under the null hypothesis that both samples come from the same law, the population distance is zero but the empirical plug-in distance has expectation of order $N^{-1/d}$. This is the upward statistical bias that debiased quantities such as the Sinkhorn divergence are designed to reduce.
[/example]
Regularization changes the statistical picture because the optimizer is smoother as a function of the input measures. The resulting quantity is biased as an approximation to $W_p$, but it can have lower variance and better sample complexity at fixed $\varepsilon$. This creates the practical tuning problem summarized in the following remark.
[remark: Bias-Variance Tradeoff]
The regularization parameter $\varepsilon$ plays two roles. Smaller $\varepsilon$ makes the objective closer to unregularized optimal transport but increases numerical stiffness and statistical sensitivity. Larger $\varepsilon$ produces smoother dual potentials and more stable gradients, but it moves the objective toward a kernel discrepancy rather than a transport distance.
[/remark]
The course perspective is that computational and statistical errors should be treated together. A useful Wasserstein objective is not only one whose population value is meaningful, but one whose empirical value can be computed and optimized at the sample sizes available.
## Wasserstein Objectives in Generative Modeling
The third problem is how to compare a model distribution with a data distribution when the model is represented implicitly. In generative modelling, a neural network $G_\theta$ pushes forward a simple latent law $\zeta$ to a model law $\mu_\theta=(G_\theta)_\#\zeta$. The loss should give meaningful gradients even when $\mu_\theta$ and the data law are supported on low-dimensional sets that do not overlap.
The dual formulation of $W_1$ is the starting point for Wasserstein generative adversarial networks. It replaces optimization over couplings by optimization over Lipschitz test functions, which can then be approximated by a critic network.
[quotetheorem:6779]
[citeproof:6779]
The finite first-moment assumption is necessary because the integrals of Lipschitz functions can be infinite without it. For a concrete case on $\mathbb R$, take $\mu$ to be the Cauchy law and $\nu=\delta_0$; then
\begin{align*}
W_1(\mu,\nu)=\int_{\mathbb R}|x|\,d\mu(x)=\infty.
\end{align*}
The theorem is also an infinite-dimensional duality statement, not a guarantee that a finite neural-network class contains an optimal critic or that empirical training finds one. In a WGAN, the function $f$ is represented by a neural network critic $f_\omega$. The training objective approximates the difference between expectations under real and generated samples, so we need a precise population objective before discussing how the Lipschitz constraint is enforced.
[definition: WGAN Critic Loss]
Let $\mu_{\mathrm{data}}$ be a data law on a metric space $X$, let $\zeta$ be a latent law on a space $Z$, let $\Theta$ be a generator parameter space, and let $G_\theta:Z\to X$ be a generator for each $\theta\in\Theta$. Let $\{f_\omega:X\to \mathbb R\}_{\omega\in \Omega_\omega}$ be a parametrized class of $1$-Lipschitz critics for which the displayed expectations are well-defined. The WGAN critic loss is the map $\mathcal W_{\Omega_\omega}:\Theta\to \mathbb R\cup\{+\infty\}$ defined by
\begin{align*}
\mathcal W_{\Omega_\omega}(\theta)=\sup_{\omega\in\Omega_\omega}\left\{\mathbb E[f_\omega(X)]-\mathbb E[f_\omega(G_\theta(Z))]\right\},
\end{align*}
where $X\sim \mu_{\mathrm{data}}$ and $Z\sim \zeta$.
[/definition]
The definition hides the main engineering issue: a neural network is not automatically $1$-Lipschitz. Weight clipping, spectral normalization, and gradient penalties are different ways to enforce or encourage the constraint from Kantorovich-Rubinstein duality. The following objective records the gradient-penalty version used as a heuristic Lipschitz enforcement method.
[definition: Gradient-Penalty WGAN Objective]
Let $\mu_{\mathrm{data}}$ be a data law on $\mathbb R^d$, let $\zeta$ be a latent law on a space $Z$, let $\Theta$ be a generator parameter space, and let $G_\theta:Z\to\mathbb R^d$ be a generator for each $\theta\in\Theta$. Let $\{f_\omega:\mathbb R^d\to\mathbb R\}_{\omega\in\Omega_\omega}$ be a differentiable critic class, let $\lambda>0$, and let $\hat{X}$ be sampled along line segments between real samples $X\sim\mu_{\mathrm{data}}$ and generated samples $G_\theta(Z)$ with $Z\sim\zeta$. The gradient-penalty WGAN critic objective is the map $\mathcal L_{\mathrm{GP}}:\Theta\times\Omega_\omega\to \mathbb R\cup\{+\infty\}$ defined by
\begin{align*}
\mathcal L_{\mathrm{GP}}(\theta,\omega)=\mathbb E[f_\omega(X)]-\mathbb E[f_\omega(G_\theta(Z))]-\lambda\,\mathbb E\left[\left(|\nabla f_\omega(\hat{X})|-1\right)^2\right].
\end{align*}
[/definition]
The corresponding training step maximizes $\mathcal L_{\mathrm{GP}}(\theta,\omega)$ over $\omega$ for fixed $\theta$. The penalty is a heuristic Lipschitz enforcement mechanism. Since a differentiable function on a convex region is $1$-Lipschitz if $|\nabla f(x)|\le 1$ throughout that region, penalizing deviations of $|\nabla f|$ from $1$ along sampled interpolation paths encourages the critic to behave like an admissible Kantorovich potential where training places mass. A one-dimensional mixture makes the critic's role visible.
[example: WGAN Critic on a One-Dimensional Mixture]
Let
\begin{align*}
F_{\mathrm{data}}(x)=\frac12\Phi\left(\frac{x+2}{\sigma}\right)+\frac12\Phi\left(\frac{x-2}{\sigma}\right)
\end{align*}
and
\begin{align*}
F_m(x)=\Phi\left(\frac{x-m}{\sigma}\right),
\end{align*}
where $\Phi$ is the standard normal distribution function. For a differentiable $1$-Lipschitz critic $f$, the population critic objective is
\begin{align*}
\mathbb E[f(X)]-\mathbb E[f(G_\theta(Z))]=\int_{\mathbb R} f(x)\,d(\mu_{\mathrm{data}}-\mu_\theta)(x).
\end{align*}
Writing $A_m(x)=F_{\mathrm{data}}(x)-F_m(x)$, integration by parts for the signed measure $\mu_{\mathrm{data}}-\mu_\theta$ gives
\begin{align*}
\int_{\mathbb R} f(x)\,d(\mu_{\mathrm{data}}-\mu_\theta)(x)=-\int_{\mathbb R} f'(x)A_m(x)\,dx,
\end{align*}
with no boundary term because both laws have total mass $1$ and Gaussian tails. Since $|f'(x)|\le 1$, the integrand is maximized pointwise by choosing
\begin{align*}
f'(x)=-\operatorname{sign}(A_m(x))=\operatorname{sign}\bigl(F_m(x)-F_{\mathrm{data}}(x)\bigr)
\end{align*}
where $A_m(x)\ne 0$; where $A_m(x)=0$, any value in $[-1,1]$ gives the same contribution.
For example, if $m=0$, symmetry gives
\begin{align*}
F_m(0)=\Phi(0)=\frac12
\end{align*}
and
\begin{align*}
F_{\mathrm{data}}(0)=\frac12\Phi\left(\frac{2}{\sigma}\right)+\frac12\Phi\left(-\frac{2}{\sigma}\right)=\frac12,
\end{align*}
using $\Phi(-t)=1-\Phi(t)$. Thus the sign can change near the center, and the critic need not form two symmetric peaks at the data modes. Instead, its slope records where the generated cumulative mass exceeds or falls short of the data cumulative mass. The generator therefore receives a transport direction from the critic even when the generated density is concentrated between the two modes rather than overlapping both modes pointwise.
[/example]
This example captures the advantage of Wasserstein losses for singular or nearly singular model distributions. The loss can still vary continuously with the generator parameters when classical divergences saturate because the supports fail to overlap.
## Entropic Dynamic Transport and Schrödinger Bridges
The final problem is to interpret entropic regularization dynamically. In the static discrete problem, entropy makes the coupling diffuse. In path space, entropy selects the most likely stochastic interpolation between two endpoint laws relative to a reference diffusion.
The static entropic problem regularizes couplings. To describe a whole interpolation, we instead regularize probability measures on paths and impose endpoint constraints on their time marginals.
[definition: Schrödinger Bridge Problem]
Let $R$ be a reference probability measure on path space $C([0,1];\mathbb R^d)$. The Schrödinger bridge objective is the map $\mathcal S_R:\mathcal P(\mathbb R^d)\times\mathcal P(\mathbb R^d)\to [0,\infty]$ defined by
\begin{align*}
\mathcal S_R(\mu_0,\mu_1)=\inf\left\{\operatorname{KL}(P\mid R): P\in \mathcal P(C([0,1];\mathbb R^d)),\ P_0=\mu_0,\ P_1=\mu_1\right\},
\end{align*}
where $P_t$ denotes the time-$t$ marginal of $P$.
[/definition]
This is the dynamic analogue of entropy-regularized transport. If $R$ is Brownian motion with small variance, the endpoint coupling induced by the bridge approximates an optimal transport plan, while the intermediate marginals form a noisy version of the Wasserstein geodesic. The following theorem makes that approximation principle precise in the small-noise limit.
[quotetheorem:9603]
[citeproof:9603]
The compact support and moment hypotheses prevent endpoint mass from escaping in the small-noise limit; without tightness, weak limits of endpoint couplings may fail to exist. A typical failure is a sequence of endpoint laws that places mass $1/N$ at distance $N^2$ from the origin: the second moments are not uniformly controlled, and a small amount of far-away mass can dominate quadratic transport costs. Absolute continuity of $\mu_0$ is a standard hypothesis under which the quadratic transport problem has a well-controlled dynamical representation, but the theorem only asserts optimality of subsequential endpoint limits and does not require uniqueness of a limiting coupling. For a non-unique endpoint example, take $\mu_0=\frac12\delta_{(-1,0)}+\frac12\delta_{(1,0)}$ and $\mu_1=\frac12\delta_{(0,-1)}+\frac12\delta_{(0,1)}$ on $\mathbb R^2$; all four endpoint distances are equal, so every coupling between these two two-point laws has the same quadratic cost. The theorem also concerns endpoint couplings, not necessarily uniform convergence of the whole interpolation curve in a strong topology. It connects this chapter back to the Benamou--Brenier dynamic formulation from Chapters 0 and 1: entropic methods are not merely numerical smoothing of a static linear program, but also arise from conditioning stochastic processes on unlikely endpoint marginals. The resulting interpolation is often easier to compute and sample than an exact displacement geodesic.
[example: Schrödinger Bridge Interpolation]
Let $\mu_0=\rho_0(x)\,dx$ and $\mu_1=\rho_1(y)\,dy$ be smooth compactly supported probability densities on $\mathbb R^d$, and let $R^\varepsilon_{\mu_0}$ be the law of $X_t=X_0+\sqrt{\varepsilon}B_t$ with $X_0\sim\mu_0$. Under the reference law, the conditional density of $X_1$ given $X_0=x$ is
\begin{align*}
p_\varepsilon(x,y)=(2\pi\varepsilon)^{-d/2}\exp\left(-\frac{|y-x|^2}{2\varepsilon}\right).
\end{align*}
Thus the endpoint law of the reference process is
\begin{align*}
dR^\varepsilon_{0,1}(x,y)=\rho_0(x)p_\varepsilon(x,y)\,dx\,dy.
\end{align*}
The factor depending on the displacement is exactly
\begin{align*}
\exp\left(-\frac{|y-x|^2}{2\varepsilon}\right),
\end{align*}
so moving mass from $x$ to $y$ is exponentially suppressed according to the quadratic cost $|x-y|^2/2$.
For a deterministic absolutely continuous path from $x$ to $y$, the least kinetic action is obtained by the straight path $\gamma_t=(1-t)x+ty$. Indeed,
\begin{align*}
\dot\gamma_t=y-x.
\end{align*}
Therefore
\begin{align*}
\frac12\int_0^1|\dot\gamma_t|^2\,dt=\frac12\int_0^1|y-x|^2\,dt.
\end{align*}
Since $\int_0^1 1\,dt=1$, this equals
\begin{align*}
\frac{|y-x|^2}{2}.
\end{align*}
This is the same quadratic endpoint cost that appears in the Brownian transition density.
Let $P^\varepsilon$ be the bridge law and set $\mu_t^\varepsilon=(X_t)_\#P^\varepsilon$. The endpoint constraints give
\begin{align*}
\mu_0^\varepsilon=\mu_0
\end{align*}
and
\begin{align*}
\mu_1^\varepsilon=\mu_1.
\end{align*}
For $0<t<1$, the Brownian reference still contributes fluctuations of size $\sqrt{\varepsilon}$ around transported paths, so the intermediate density is not concentrated only on deterministic transport rays. As $\varepsilon\downarrow 0$, the endpoint coupling $(X_0,X_1)_\#P^\varepsilon$ converges along subsequences to a quadratic-cost optimal coupling by *Small-Noise Limit of Schrödinger Bridges*. When that optimal coupling is induced by a transport map $T$, the limiting displacement interpolation has the form
\begin{align*}
\mu_t=((1-t)\operatorname{id}+tT)_\#\mu_0.
\end{align*}
Thus positive $\varepsilon$ gives a stochastic, smoothed interpolation, while the small-noise limit recovers the deterministic quadratic-transport interpolation.
[/example]
Computational optimal transport therefore has a unified structure. Sinkhorn scaling, statistical debiasing, WGAN critics, and Schrödinger bridges all use duality, regularization, and sampling to make Wasserstein geometry operational. The analytic lesson is that approximation is part of the model: changing the objective changes the geometry, the gradients, and the statistical regime in which the method can be trusted.
# 10. Synthesis: Transport as a Modeling Language
The guiding theme is that the same object can be read in three registers. A static Kantorovich problem identifies an optimal coupling, a dynamic Benamou-Brenier problem identifies an evolution, and an entropic problem identifies a softened coupling with algorithmic structure. Applications usually require moving between these registers rather than choosing one forever.
## Choosing Between Static, Dynamic, and Entropic Formulations
The first modeling question is not "what is the transport distance?" but "which variables should the model expose?" If only the endpoints matter, the static formulation is economical. If paths, velocities, or conservation laws matter, the dynamic formulation carries the relevant information. If large-scale computation or statistical smoothing matters, entropic regularisation changes the model in a controlled way.
[definition: Static Quadratic Transport Problem]
Let $\rho_0,\rho_1 \in \mathcal P_2(\mathbb R^n)$. The static quadratic transport problem is
\begin{align*}
W_2^2(\rho_0,\rho_1) := \inf_{\pi \in \Pi(\rho_0,\rho_1)} \int_{\mathbb R^n \times \mathbb R^n} |x-y|^2\,d\pi(x,y),
\end{align*}
where $\Pi(\rho_0,\rho_1)$ is the set of probability measures on $\mathbb R^n \times \mathbb R^n$ with first marginal $\rho_0$ and second marginal $\rho_1$.
[/definition]
This formulation keeps the coupling but forgets the time-parametrised route between the marginals. That loss of path information is acceptable for matching and comparison, but it becomes a defect when the application asks how mass moves, which velocity field is used, or which conservation law is being enforced. The next formulation adds exactly that missing dynamical layer.
[definition: Dynamic Quadratic Transport Problem]
Let $\rho_0,\rho_1 \in \mathcal P_2(\mathbb R^n)$. The dynamic quadratic transport problem is
\begin{align*}
\inf_{(\rho_t,v_t)} \int_0^1 \int_{\mathbb R^n} |v_t(x)|^2\,d\rho_t(x)\,dt,
\end{align*}
where $(\rho_t)_{t\in[0,1]}$ is a narrowly continuous curve in $\mathcal P_2(\mathbb R^n)$, $v_t:\mathbb R^n\to\mathbb R^n$ is a Borel velocity field, $\rho_{t=0}=\rho_0$, $\rho_{t=1}=\rho_1$, and
\begin{align*}
\partial_t\rho_t + \nabla\cdot(\rho_t v_t)=0
\end{align*}
holds in the sense of distributions on $(0,1)\times\mathbb R^n$.
[/definition]
The continuity equation is the modeling gain: it identifies transport with mass conservation under a velocity field. This fits crowd motion, fluid-like interpolation, and variational derivations of evolution equations. The price is that the model now carries a PDE constraint, so for high-dimensional or data-driven problems it is natural to ask whether a softened static problem can keep the useful geometry while becoming easier to compute.
[definition: Entropic Transport Problem]
Let $c: X\times Y\to \mathbb R$ be a measurable cost, let $\mu\in\mathcal P(X)$ and $\nu\in\mathcal P(Y)$, and let $\varepsilon>0$. The entropic transport problem is
\begin{align*}
\inf_{\pi\in\Pi(\mu,\nu)}\left\{\int_{X\times Y} c(x,y)\,d\pi(x,y)+\varepsilon\operatorname{KL}(\pi\mid \mu\otimes\nu)\right\}.
\end{align*}
[/definition]
The entropic term rewards spread relative to the product reference measure. It also creates the scaling structure behind the Sinkhorn algorithm, so the regularised problem is both a modeling choice and a numerical method. With the three formulations on the table, the first synthesis result states how they compare.
[quotetheorem:9604]
[citeproof:9604]
The theorem explains why the three languages can be compared, but it does not make them interchangeable in practice. The static problem hides trajectories, the dynamic problem introduces a PDE constraint, and the entropic problem solves a nearby smoothed problem.
[example: Toy Comparison of Transport Losses]
For $\mu=\frac12\delta_0+\frac12\delta_2$ and $\nu_a=\frac12\delta_a+\frac12\delta_{a+2}$, any coupling is determined by one number $s\in[0,\frac12]$: mass $s$ is sent from $0$ to $a$, mass $\frac12-s$ from $0$ to $a+2$, mass $\frac12-s$ from $2$ to $a$, and mass $s$ from $2$ to $a+2$. For the $W_1$ cost, the total cost of this coupling is
\begin{align*}
s|a|+\left(\frac12-s\right)|a+2|+\left(\frac12-s\right)|2-a|+s|2-(a+2)|.
\end{align*}
Since $|2-(a+2)|=|a|$ and $|2-a|=|a-2|$, this becomes
\begin{align*}
2s|a|+\left(\frac12-s\right)|a+2|+\left(\frac12-s\right)|a-2|.
\end{align*}
The triangle inequality gives $|a+2|+|a-2|\ge |(a+2)+(a-2)|=2|a|$, so replacing off-diagonal mass by diagonal mass cannot increase the cost. Thus the minimum is attained at $s=\frac12$, and
\begin{align*}
W_1(\mu,\nu_a)=\frac12|0-a|+\frac12|2-(a+2)|=|a|.
\end{align*}
For the quadratic cost, the same coupling has cost
\begin{align*}
s a^2+\left(\frac12-s\right)(a+2)^2+\left(\frac12-s\right)(a-2)^2+s a^2.
\end{align*}
The off-diagonal sum expands as
\begin{align*}
(a+2)^2+(a-2)^2=(a^2+4a+4)+(a^2-4a+4)=2a^2+8.
\end{align*}
Therefore the quadratic cost is
\begin{align*}
2sa^2+\left(\frac12-s\right)(2a^2+8)=a^2+4-8s.
\end{align*}
This affine function of $s$ is minimized at $s=\frac12$, so
\begin{align*}
W_2^2(\mu,\nu_a)=a^2.
\end{align*}
Taking the square root gives
\begin{align*}
W_2(\mu,\nu_a)=|a|.
\end{align*}
Now change only one target atom and set $\eta_R=\frac12\delta_0+\frac12\delta_R$ with $R\ge2$. The monotone matching in one dimension keeps $0$ paired with $0$ and sends $2$ to $R$, so
\begin{align*}
W_1(\mu,\eta_R)=\frac12|0-0|+\frac12|2-R|=\frac{R-2}{2}.
\end{align*}
For squared cost, the same matching gives
\begin{align*}
W_2^2(\mu,\eta_R)=\frac12(0-0)^2+\frac12(2-R)^2=\frac{(R-2)^2}{2}.
\end{align*}
Hence
\begin{align*}
W_2(\mu,\eta_R)=\frac{R-2}{\sqrt2}.
\end{align*}
Thus both distances grow linearly after taking the metric value, but the quadratic transport cost $W_2^2$ records the far displacement as a squared penalty.
For the entropic version in the translated case, use the squared-distance costs $C_{00}=a^2$, $C_{0,2}=(a+2)^2$, $C_{2,0}=(a-2)^2$, and $C_{2,2}=a^2$. With $K_{ij}=e^{-C_{ij}/\varepsilon}$, the diagonal kernel entries are both $e^{-a^2/\varepsilon}$, while the off-diagonal entries are $e^{-(a+2)^2/\varepsilon}$ and $e^{-(a-2)^2/\varepsilon}$. For every $\varepsilon>0$ these four entries are positive, so the Sinkhorn-scaled plan may assign positive mass to off-diagonal pairings even when the exact quadratic optimum is diagonal. Relative to a diagonal entry, the two off-diagonal weights are measured by
\begin{align*}
\frac{e^{-(a+2)^2/\varepsilon}}{e^{-a^2/\varepsilon}}=e^{-((a+2)^2-a^2)/\varepsilon}=e^{-(4a+4)/\varepsilon}.
\end{align*}
The other ratio is
\begin{align*}
\frac{e^{-(a-2)^2/\varepsilon}}{e^{-a^2/\varepsilon}}=e^{-((a-2)^2-a^2)/\varepsilon}=e^{(4a-4)/\varepsilon}.
\end{align*}
These formulas show the tradeoff explicitly: positive temperature smooths the dependence on atom locations through exponential weights, but for fixed $\varepsilon$ it biases the plan away from the sharp diagonal assignment whenever off-diagonal weights remain non-negligible.
[/example]
## From Variational Principles to PDE and Algorithms
The second modeling question is how a variational principle becomes something that evolves or can be computed. The course has repeatedly used the same pipeline, from Otto calculus in Chapter 2 through JKO in Chapter 4 and Sinkhorn scaling in Chapter 9: choose an energy, define a movement or regularized problem by transport cost, derive the Euler-Lagrange equation, and then read the limiting object as a PDE or algorithm.
[definition: JKO Step]
Let $\mathcal E:\mathcal P_2(\mathbb R^n)\to(-\infty,\infty]$ be an energy functional and let $\tau>0$. Given $\rho_k\in\mathcal P_2(\mathbb R^n)$, a JKO step is any minimiser
\begin{align*}
\rho_{k+1}\in \operatorname*{argmin}_{\rho\in\mathcal P_2(\mathbb R^n)}\left\{\frac{1}{2\tau}W_2^2(\rho,\rho_k)+\mathcal E[\rho]\right\}.
\end{align*}
[/definition]
The transport term prevents the next state from moving too far in one time step, while the energy term chooses the direction of descent. This turns gradient flow into a sequence of static minimisation problems.
[quotetheorem:9605]
[citeproof:9605]
This result is the model case for the slogan that diffusion is transport-gradient descent of entropy. The growth assumptions on $V$ keep mass from escaping to infinity and ensure that the potential energy controls the tails of $\rho$, while convexity is the condition that makes the energy behave like a convex functional along Wasserstein geodesics. Without such structure, the formal calculation still produces an equation, but it does not by itself prove existence, uniqueness, or stability of solutions. For example, a double-well potential can create two competing basins of attraction, so minimising movements may depend sensitively on the initial distribution and need not select a single globally stable equilibrium.
[example: Diffusion Model as Stochastic Interpolation]
Take the Gaussian reference to be $\gamma=\mathcal N(0,I_n)$, and define the forward noising process by
\begin{align*}
X_t=e^{-t/2}X_0+\sqrt{1-e^{-t}}\,Z,
\end{align*}
where $X_0\sim\rho_0$ and $Z\sim\gamma$ are independent. At $t=0$ this gives $X_t=X_0$, because $e^0=1$ and $\sqrt{1-e^0}=0$. As $t\to\infty$, $e^{-t/2}\to0$ and $\sqrt{1-e^{-t}}\to1$, so the law of $X_t$ converges to the law of $Z$, namely $\gamma$.
Equivalently, this interpolation is generated by the Ornstein-Uhlenbeck forward equation
\begin{align*}
\partial_t\rho_t=\frac12\Delta\rho_t+\frac12\nabla\cdot(x\rho_t).
\end{align*}
For positive smooth densities,
\begin{align*}
\Delta\rho_t=\nabla\cdot(\nabla\rho_t)=\nabla\cdot(\rho_t\nabla\log\rho_t),
\end{align*}
because $\rho_t\nabla\log\rho_t=\rho_t(\nabla\rho_t/\rho_t)=\nabla\rho_t$. Therefore the same equation can be written as
\begin{align*}
\partial_t\rho_t=\nabla\cdot\left(\rho_t\left(\frac12\nabla\log\rho_t+\frac12x\right)\right).
\end{align*}
Moving the right-hand side to the left gives the continuity-equation form
\begin{align*}
\partial_t\rho_t+\nabla\cdot(\rho_t v_t)=0,
\end{align*}
with
\begin{align*}
v_t(x)=-\frac12\left(x+\nabla\log\rho_t(x)\right).
\end{align*}
This is the transport content of the diffusion model: the learned reverse dynamics is trying to approximate the score $\nabla\log\rho_t$ so that it can reconstruct the velocity field carrying noisy laws back toward the data law. Thus the model is not only comparing $\rho_0$ with $\gamma$; it is learning a time-parametrised interpolation between them, with diffusion supplying the regularity along the path.
[/example]
A variational derivation also becomes an algorithm when the minimisation problem has exploitable structure. Entropic regularisation is the most important example in the computational part of the course.
[quotetheorem:9606]
[citeproof:9606]
The computational lesson is that adding entropy changes the geometry but exposes a matrix-scaling problem. The positivity assumptions are part of that lesson: positive marginals and the strictly positive kernel $K_{ij}=e^{-C_{ij}/\varepsilon}$ prevent zero rows or columns from blocking the alternating rescalings. If a marginal entry vanishes, the corresponding row or column must be removed or treated separately; if the kernel has structural zeros, a feasible scaling may fail to exist or may fail to be unique on the full matrix. This is why Sinkhorn losses are common in learning pipelines where exact transport would be too expensive or too unstable, but also why implementations track support and numerical underflow carefully.
[example: Crowd Motion with Congestion Energy]
Let $\rho_t$ denote a crowd density in a bounded domain $\Omega\subset\mathbb R^n$, and consider the energy
\begin{align*}
\mathcal E[\rho]=\int_\Omega U(\rho(x))\,d\mathcal L^n(x)+\int_\Omega V(x)\rho(x)\,d\mathcal L^n(x),
\end{align*}
where $V$ is a desired potential and $U$ penalizes high density. The JKO step from $\rho_k$ chooses $\rho_{k+1}$ by minimizing
\begin{align*}
\frac{1}{2\tau}W_2^2(\rho,\rho_k)+\int_\Omega U(\rho(x))\,d\mathcal L^n(x)+\int_\Omega V(x)\rho(x)\,d\mathcal L^n(x).
\end{align*}
For a smooth positive density, the first variation of the congestion term is found by perturbing $\rho$ to $\rho+\theta\sigma$ with $\int_\Omega\sigma\,d\mathcal L^n=0$:
\begin{align*}
\left.\frac{d}{d\theta}\right|_{\theta=0}\int_\Omega U(\rho+\theta\sigma)\,d\mathcal L^n=\int_\Omega U'(\rho)\sigma\,d\mathcal L^n.
\end{align*}
The potential term varies as
\begin{align*}
\left.\frac{d}{d\theta}\right|_{\theta=0}\int_\Omega V(\rho+\theta\sigma)\,d\mathcal L^n=\int_\Omega V\sigma\,d\mathcal L^n.
\end{align*}
Thus
\begin{align*}
\frac{\delta\mathcal E}{\delta\rho}=U'(\rho)+V.
\end{align*}
The formal Wasserstein gradient-flow equation is therefore
\begin{align*}
\partial_t\rho_t=\nabla\cdot\left(\rho_t\nabla\left(U'(\rho_t)+V\right)\right).
\end{align*}
Writing this as a continuity equation gives
\begin{align*}
\partial_t\rho_t+\nabla\cdot(\rho_t v_t)=0,
\end{align*}
with velocity
\begin{align*}
v_t=-\nabla\left(U'(\rho_t)+V\right).
\end{align*}
Equivalently,
\begin{align*}
\rho_t v_t=-\rho_t U''(\rho_t)\nabla\rho_t-\rho_t\nabla V.
\end{align*}
The term $\rho_t\nabla V$ drives the crowd toward lower potential, while the factor $\rho_t U''(\rho_t)\nabla\rho_t$ is the congestion pressure opposing movement into already dense regions.
[/example]
## Stability and Passage to the Limit
The third modeling question is whether the conclusions survive approximation. Applications use discretised costs, empirical measures, regularised energies, and finite-dimensional parameterisations. A transport model is useful only when these approximations converge to the intended continuum object.
[definition: Gamma-Convergence of Energies]
Let $X$ be a metric space and let $\mathcal E_k:X\to(-\infty,\infty]$ be a sequence of functionals. The sequence $\mathcal E_k$ Gamma-converges to $\mathcal E:X\to(-\infty,\infty]$ if for every $x\in X$ the following two conditions hold:
\begin{align*}
x_k\to x \implies \mathcal E[x]\le \liminf_{k\to\infty}\mathcal E_k[x_k].
\end{align*}
There exists $y_k\to x$ such that
\begin{align*}
\mathcal E[x]\ge \limsup_{k\to\infty}\mathcal E_k[y_k].
\end{align*}
[/definition]
The first condition prevents artificial loss of energy in the limit, and the second says that every limiting state can be approximated without paying extra energy. Together they are designed to preserve minimisation problems. Since gradient flows are built by iterated minimisation, the natural next question is whether this preservation extends from single minimisers to whole time-dependent evolutions.
[quotetheorem:9607]
[proofunderconstruction:9607]
This theorem is the abstract justification for many approximations in the course: particle-to-continuum limits, vanishing regularisation, and discretised schemes. It also warns that the energy, the metric, and the compactness mechanism must be controlled together.
[example: Vanishing Entropic Bias]
Let $\widehat\mu=\frac1m\sum_{i=1}^m\delta_{x_i}$ and $\widehat\nu=\frac1n\sum_{j=1}^n\delta_{y_j}$, and write $C_{ij}=|x_i-y_j|^2$. The entropic transport value is
\begin{align*}
T_\varepsilon(\widehat\mu,\widehat\nu)=\min_{\pi\in\Pi(\widehat\mu,\widehat\nu)}\left\{\sum_{i=1}^m\sum_{j=1}^n C_{ij}\pi_{ij}+\varepsilon\sum_{i=1}^m\sum_{j=1}^n\pi_{ij}\log\frac{\pi_{ij}}{(1/m)(1/n)}\right\}.
\end{align*}
Since every feasible $\pi$ has nonnegative relative entropy with respect to $\widehat\mu\otimes\widehat\nu$, we have
\begin{align*}
T_\varepsilon(\widehat\mu,\widehat\nu)\ge \min_{\pi\in\Pi(\widehat\mu,\widehat\nu)}\sum_{i=1}^m\sum_{j=1}^n C_{ij}\pi_{ij}=W_2^2(\widehat\mu,\widehat\nu).
\end{align*}
If $\pi^0$ is an optimal unregularised coupling, then $\pi^0$ is also feasible for the entropic problem, so
\begin{align*}
T_\varepsilon(\widehat\mu,\widehat\nu)\le W_2^2(\widehat\mu,\widehat\nu)+\varepsilon\operatorname{KL}(\pi^0\mid\widehat\mu\otimes\widehat\nu).
\end{align*}
For fixed empirical supports this upper error is proportional to $\varepsilon$, hence $T_\varepsilon(\widehat\mu,\widehat\nu)\to W_2^2(\widehat\mu,\widehat\nu)$ as $\varepsilon\downarrow0$.
The debiased Sinkhorn divergence subtracts the two self-costs:
\begin{align*}
S_\varepsilon(\widehat\mu,\widehat\nu)=T_\varepsilon(\widehat\mu,\widehat\nu)-\frac12T_\varepsilon(\widehat\mu,\widehat\mu)-\frac12T_\varepsilon(\widehat\nu,\widehat\nu).
\end{align*}
Applying the same bound to each of the three terms shows that each entropic value converges to its unregularised value. Since $W_2^2(\widehat\mu,\widehat\mu)=0$ and $W_2^2(\widehat\nu,\widehat\nu)=0$, the limit is
\begin{align*}
\lim_{\varepsilon\downarrow0}S_\varepsilon(\widehat\mu,\widehat\nu)=W_2^2(\widehat\mu,\widehat\nu).
\end{align*}
Thus fixed $\varepsilon$ introduces a controlled smoothing bias, while sending $\varepsilon$ to zero recovers the sharp empirical transport loss; the price is that the limiting problem again depends on a nonsmoothed optimal coupling, so gradients can become unstable when the empirical support nearly admits several competing matchings.
[/example]
## Limits of the Theory
The final modeling question is where the transport viewpoint stops being a reliable guide. The theory is strongest when compactness, convexity, and regularity interact favourably. Singular measures, nonconvex energies, and high dimension each break part of that structure.
[remark: Singular Measures]
Transport distances handle singular measures as metric objects, but PDE interpretations may require densities, fluxes, or first variations that do not exist as functions. A Dirac mass can move along a Wasserstein geodesic, yet entropy is infinite on it, and diffusion energies may instantly regularise it. The modeler must distinguish metric admissibility from analytic admissibility.
[/remark]
This distinction matters whenever a particle method is used to approximate a density-driven PDE. The particle configuration may converge narrowly, while the energy or velocity field may fail to converge without additional estimates. A different obstruction appears when the energy itself has several competing wells, because then the transport metric supplies a descent direction but does not restore convexity.
[quotetheorem:9608]
[citeproof:9608]
The theorem should be read as a warning label for applications such as swarming, aggregation, and pattern formation. Transport still gives a language for writing the equations, but it may no longer give a unique or stable prediction.
[example: Nonconvex Aggregation]
Take the radial interaction potential $W(z)=\frac14(|z|^2-1)^2$. Since
\begin{align*}
\nabla W(z)=\frac14\cdot 2(|z|^2-1)\cdot 2z=(|z|^2-1)z,
\end{align*}
the force $-\nabla W(z)=(1-|z|^2)z$ is repulsive when $|z|<1$ and attractive when $|z|>1$.
For two equal point clusters at positions $-r/2$ and $r/2$, write
\begin{align*}
\rho_r=\frac12\delta_{-r/2}+\frac12\delta_{r/2}.
\end{align*}
The particle velocity induced by the interaction energy is
\begin{align*}
\dot x_i=-\sum_j m_j\nabla W(x_i-x_j).
\end{align*}
For the right particle $x_+=r/2$, the self-interaction term is zero because $\nabla W(0)=0$, so
\begin{align*}
\dot x_+=-\frac12\nabla W(r)=-\frac12(r^2-1)r.
\end{align*}
For the left particle $x_-=-r/2$, similarly,
\begin{align*}
\dot x_-=-\frac12\nabla W(-r)=\frac12(r^2-1)r.
\end{align*}
Therefore the separation $r=x_+-x_-$ evolves by
\begin{align*}
\dot r=\dot x_+-\dot x_-=-(r^2-1)r.
\end{align*}
If $r>1$, then $\dot r<0$, so the clusters move toward each other; if $0<r<1$, then $\dot r>0$, so the clusters move apart. The preferred distance is $r=1$, not complete collapse.
This explicit calculation shows the nonconvex behavior: transport dissipation moves the configuration downhill, but the downhill set can contain separated equilibria. With many particles, the same preferred-distance mechanism can stabilize several geometric arrangements, such as rings, lattice-like patterns, or multiple clusters, so the Wasserstein gradient-flow structure alone does not select one unique macroscopic pattern.
[/example]
High dimension creates a different obstruction. Even when the theory is correct, the sample and computational complexity of estimating transport can dominate the modeling error.
[remark: High-Dimensional Transport Losses]
For empirical measures in $\mathbb R^n$, exact transport costs can suffer from slow statistical convergence as $n$ grows. Entropic regularisation improves numerical conditioning and can reduce variance, but it changes the loss landscape and introduces a smoothing scale. Sliced, projected, low-rank, and Schrödinger-type variants should be understood as modeling choices, not merely accelerations.
[/remark]
## Final Modeling Checklist
The course ends with a practical checklist. A transport model should specify the state space, the metric or cost, the energy or data-fitting term, the regularisation, and the intended limiting process. Most mistakes come from mixing these layers without checking that the hypotheses needed at one layer survive at the next.
[explanation: Choosing a Transport Model]
Use the static formulation when the endpoint matching problem is primary and intermediate dynamics are irrelevant. Use the Benamou--Brenier formulation when velocities, conservation laws, or PDE limits are part of the model. Use JKO when the process is dissipative and the energy landscape is the main object of interest. Use entropic transport when computation, noise, or statistical smoothing is central, while tracking the bias introduced by $\varepsilon$.
The decision should also record what convergence statement is expected. If a discrete scheme is meant to approximate a PDE, identify the compactness estimate and the limiting Euler-Lagrange equation. If a regularised loss is meant to approximate exact transport, identify the regime in which the regularisation parameter vanishes relative to sample size and discretisation. If the energy is nonconvex, treat uniqueness and stability as additional hypotheses rather than background facts.
[/explanation]
The checklist is abstract, so it is useful to run it on a single applied problem. The next example compares the modeling choices that arise when the same observed evolution is viewed through static, dynamic, variational, and entropic lenses.
[example: Comparing Modeling Choices in One Problem]
Suppose the observed data are empirical laws
\begin{align*}
\mu_0=\frac1N\sum_{i=1}^N\delta_{x_i}
\end{align*}
and
\begin{align*}
\mu_1=\frac1N\sum_{j=1}^N\delta_{y_j}.
\end{align*}
The static quadratic transport model chooses a coupling matrix $\pi=(\pi_{ij})$ with row sums $\sum_j\pi_{ij}=1/N$ and column sums $\sum_i\pi_{ij}=1/N$, and it measures endpoint mismatch by minimizing
\begin{align*}
\sum_{i=1}^N\sum_{j=1}^N |x_i-y_j|^2\pi_{ij}.
\end{align*}
This number compares the two endpoint clouds, but the variables $\pi_{ij}$ only say how much mass moves from $x_i$ to $y_j$; they do not specify where that mass is at an intermediate time $t\in(0,1)$.
The dynamic model exposes the missing variables by replacing a single coupling with a curve $(\rho_t)_{t\in[0,1]}$ and a velocity field $v_t$. If one particle starting at $x_i$ is paired with $y_j$ and travels at constant speed, its interpolated position is
\begin{align*}
z_{ij}(t)=(1-t)x_i+ty_j.
\end{align*}
Its velocity is
\begin{align*}
\dot z_{ij}(t)=y_j-x_i.
\end{align*}
Hence its kinetic contribution over the time interval $[0,1]$ is
\begin{align*}
\int_0^1 |\dot z_{ij}(t)|^2\,dt=\int_0^1 |y_j-x_i|^2\,dt=|y_j-x_i|^2.
\end{align*}
Thus the same squared displacement that appears in the static cost becomes kinetic action when the model records the path and velocity.
A JKO model adds an energy $\mathcal E$ and selects the next state by minimizing
\begin{align*}
\frac{1}{2\tau}W_2^2(\rho,\rho_k)+\mathcal E[\rho].
\end{align*}
The first term penalizes movement away from $\rho_k$, while the second term ranks candidate states by the modeling energy; changing $\mathcal E$ changes the direction selected by the same transport geometry. For example, if $\mathcal E$ is an entropy-plus-potential energy, its formal Wasserstein gradient flow has a diffusion term from entropy and a drift term from the potential.
The Sinkhorn version keeps the endpoint variables but replaces the sharp linear program by the entropic objective
\begin{align*}
\sum_{i=1}^N\sum_{j=1}^N |x_i-y_j|^2\pi_{ij}+\varepsilon\sum_{i=1}^N\sum_{j=1}^N \pi_{ij}\log\frac{\pi_{ij}}{(1/N)(1/N)}.
\end{align*}
For $\varepsilon>0$, the corresponding kernel weights are
\begin{align*}
K_{ij}=e^{-|x_i-y_j|^2/\varepsilon}.
\end{align*}
If two candidate pairings have squared costs $c_1$ and $c_2$, then their unscaled kernel-weight ratio is
\begin{align*}
\frac{e^{-c_1/\varepsilon}}{e^{-c_2/\varepsilon}}=e^{-(c_1-c_2)/\varepsilon}.
\end{align*}
So smaller $\varepsilon$ amplifies cost differences and approaches sharper matching, while larger $\varepsilon$ makes competing pairings closer in weight and gives smoother numerical gradients. The same observed evolution can therefore be modeled as endpoint matching, mass-conserving motion, energy-driven descent, or regularized computation, depending on which variables the application needs to expose.
[/example]
The main lesson is that optimal transport is not a single formula but a family of compatible modeling languages. Its strength comes from the ability to translate between geometry, PDE, probability, and computation while keeping track of which assumptions make each translation valid.
## Beyond This Note
Several nearby Androma topics extend the ideas developed here. The metric viewpoint connects naturally to [Metric Space](/page/Metric%20Space), [Weak Convergence](/page/Weak%20Convergence), and [Wasserstein Geodesics Induced by Dynamical Optimal Plans](/theorems/7486), because optimal transport turns convergence of measures into a geometric question. The dynamic formulation leads toward [Partial Differential Equations I: Classical Foundations and First-Order Equations](/page/Partial%20Differential%20Equations%20I%3A%20Classical%20Foundations%20and%20First-Order%20Equations) and [Partial Differential Equations II: Elliptic Theory and Variational Methods](/page/Partial%20Differential%20Equations%20II%3A%20Elliptic%20Theory%20and%20Variational%20Methods), where transport paths are described by velocity fields rather than only by couplings. The probabilistic side points to [Compactness of the Coupling Set](/theorems/7462), while the computational side connects with [Convex Optimisation I: Theory](/page/Convex%20Optimisation%20I%3A%20Theory), [Fundamental Theorem of Linear Programming](/theorems/6702), and [Convex Function](/page/Convex%20Function).
These connections also mark the limits of the formal Riemannian picture. Otto calculus is most useful as a guide to structure, but rigorous arguments still pass through precise metric, variational, or PDE formulations. Reading further in any of the linked directions should therefore keep the same discipline used throughout this note: identify the space, the topology, the admissible curves or plans, and the hypotheses under which the formal transport language becomes a theorem.
## References
- [Metric Space](/page/Metric%20Space)
- [Weak Convergence](/page/Weak%20Convergence)
- [Wasserstein Geodesics Induced by Dynamical Optimal Plans](/theorems/7486)
- [$p$-Wasserstein Distance Is a Metric](/theorems/7483)
- [Partial Differential Equations I: Classical Foundations and First-Order Equations](/page/Partial%20Differential%20Equations%20I%3A%20Classical%20Foundations%20and%20First-Order%20Equations)
- [Partial Differential Equations II: Elliptic Theory and Variational Methods](/page/Partial%20Differential%20Equations%20II%3A%20Elliptic%20Theory%20and%20Variational%20Methods)
- [Convex Optimisation I: Theory](/page/Convex%20Optimisation%20I%3A%20Theory)
- [Convex Function](/page/Convex%20Function)
Contents
- Introduction
- What Changes After the Foundations?
- Dynamic Transport as a Calculus of Moving Mass
- Displacement Convexity and Gradient Flows
- Curvature, Concentration, and Functional Inequalities
- Computational and Statistical Directions
- Course Map
- 1. Dynamic Transport and the Benamou-Brenier Formula
- Curves in Wasserstein Space and Metric Derivatives
- Continuity Equations and Kinetic Action
- The Benamou-Brenier Variational Formula
- Geodesics and Velocity Potentials
- 2. Calculus on $W_2$ and Otto's Formal Riemannian Picture
- Tangent Vectors and the Wasserstein Inner Product
- First Variations and Wasserstein Gradients
- Hessians, Geodesic Convexity, and Formal Integration by Parts
- 3. Displacement Convexity and Functional Inequalities
- Internal Energies Along Displacement Interpolations
- Entropy Convexity And Uniqueness Of Minimizers
- HWI, Logarithmic Sobolev, And Talagrand Inequalities
- 4. The JKO Scheme and Minimizing Movements
- Implicit Euler Steps in Metric Spaces
- The Jordan-Kinderlehrer-Otto Scheme for Entropy and Diffusion
- Compactness, Energy Dissipation, and Passage to the PDE Limit
- 5. Nonlinear Diffusion and Aggregation Equations
- Porous Medium and Fast Diffusion as Wasserstein Gradient Flows
- Aggregation-Diffusion Energies and Interaction Potentials
- Long-Time Behavior, Equilibria, and Contractivity
- 6. Monge-Ampere Methods in Applied Transport
- From Convex Potentials to Monge--Ampere Equations
- Semi-Discrete Transport and Power Diagrams
- Regularity as an Analytic Input for Applications
- 7. Ricci Curvature Through Transport
- Entropy Convexity on Riemannian Manifolds
- Curvature-Dimension Conditions
- Weighted Manifolds and the Bakry-Emery Tensor
- Lott-Sturm-Villani and Bakry-Emery Viewpoints
- Tensorization and Stability
- 8. Concentration, Isoperimetry, and Stability
- Transport-Entropy Inequalities
- Gaussian Concentration from Talagrand and Herbst Arguments
- Stability Under Tensorization and Perturbation
- 9. Computational and Statistical Optimal Transport
- Entropic Regularization and Sinkhorn Divergence
- Sample Complexity and Statistical Bias of Empirical Wasserstein Distances
- Wasserstein Objectives in Generative Modeling
- Entropic Dynamic Transport and Schrödinger Bridges
- 10. Synthesis: Transport as a Modeling Language
- Choosing Between Static, Dynamic, and Entropic Formulations
- From Variational Principles to PDE and Algorithms
- Stability and Passage to the Limit
- Limits of the Theory
- Final Modeling Checklist
- Beyond This Note
- References
Optimal Transport II: Applications
Content
Problems
History
Created by admin on 6/22/2026 | Last updated on 6/22/2026
Prerequisites (0/5 completed)
Log in to track your prerequisite progress.
Prerequisites Graph
Interactive dependency map showing prerequisite concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Rate this page
★
★
★
★
★
Poor
Excellent