Optimal Transport I: Foundations develops the central questions and methods of optimal transport, the study of how to move mass from one distribution to another as efficiently as possible. The course begins with the classical Monge formulation, where transport is described by maps, and then moves to the Kantorovich relaxation, which replaces rigid maps with transport plans and makes the theory flexible enough to handle existence, duality, and computation. Along the way, the course connects analysis, geometry, partial differential equations, and linear programming through a single organizing problem.
The main themes are the structure of optimal couplings, the role of dual potentials, and the geometric meaning of transport cost. Early chapters build the basic variational framework, then discrete optimal transport and linear programming provide a computable model that clarifies the general theory. From there, the course develops [Kantorovich duality](/theorems/6799), cyclical monotonicity, and Brenier’s theorem for quadratic cost, leading to the Monge-Ampere equation as the PDE behind optimal maps. The later chapters use these tools to define Wasserstein distances, study the geometry of Wasserstein spaces, and analyze stability, approximation, and limiting behavior of transport problems in general settings.
# Introduction
Optimal transport asks how one distribution of mass can be moved into another while paying the least possible cost. The subject begins with a geometric problem about moving piles of material, but the modern theory quickly becomes a language for probability measures, [weak convergence](/page/Weak%20Convergence), convexity, and partial differential equations. This foundations course develops the passage from transport maps to transport plans, then uses compactness and duality to obtain existence and structure theorems. Chapters 8 and 9 treat Wasserstein distances and their geodesics, while Chapters 6 and 7 develop the quadratic-cost theory culminating in [Brenier's theorem](/theorems/7477) and its Monge-Ampere equation.
The guiding viewpoint is that a probability measure records where mass sits, while a transport object records how that mass is matched to its new location. At first, the desired object is a map: each point of the source chooses a single destination. The main lesson of the opening lectures is that maps are too rigid for a general existence theory, so the course replaces them by measures on the product space.
We use two standard pieces of notation from the start. If $X$ is a measurable or [topological space](/page/Topological%20Space), then $\mathcal P(X)$ denotes the set of probability measures on $X$. On $\mathbb R^n$, the symbol $\mathcal L^n$ denotes [Lebesgue measure](/page/Lebesgue%20Measure).
## The Central Question of Transport
How can two probability measures be compared when the underlying space has geometry? Total variation detects how much mass differs at the same location, but it does not measure the effort required to move mass from one location to another. Optimal transport inserts a cost $c(x,y)$ between a source point $x$ and a target point $y$, and asks for a mass-preserving assignment that minimises the total cost.
The first formulation uses deterministic assignments. Before minimising anything, we need a precise way to say that a measurable map sends one measure to another.
[definition: Pushforward Measure]
Let $(X, \mathcal A)$ and $(Y, \mathcal B)$ be measurable spaces, let $\mu$ be a measure on $(X, \mathcal A)$, and let $T: X \to Y$ be measurable. The pushforward measure $T_\#\mu$ on $(Y, \mathcal B)$ is defined by
\begin{align*}
T_\#\mu(B) = \mu(T^{-1}(B))
\end{align*}
for every $B \in \mathcal B$.
[/definition]
The pushforward condition is the mass-balance constraint: all mass that lands in a target set $B$ must have come from the preimage $T^{-1}(B)$. This lets us state Monge's problem as a constrained minimisation over maps.
[definition: Transport Map]
Let $(X, \mathcal A, \mu)$ and $(Y, \mathcal B, \nu)$ be probability spaces. A transport map from $\mu$ to $\nu$ is a measurable map $T: X \to Y$ such that $T_\#\mu = \nu$.
[/definition]
Once a transport map exists, a cost function $c: X \times Y \to (-\infty,\infty]$ assigns a price to moving a unit of mass from $x$ to $T(x)$. The resulting optimisation problem is the historical starting point.
[definition: Monge Problem]
Let $X$ and $Y$ be measurable spaces, let $\mu \in \mathcal P(X)$ and $\nu \in \mathcal P(Y)$, and let $c: X \times Y \to (-\infty,\infty]$ be measurable. The Monge problem is
\begin{align*}
\inf\left\{\int_X c(x,T(x))\,d\mu(x) : T_\#\mu = \nu\right\}.
\end{align*}
[/definition]
This problem is nonlinear in two ways: the admissibility constraint is imposed on maps, and the set of maps is not stable under averaging. Those features are the reason Monge's formulation is geometrically natural but analytically difficult.
[example: Translating Lebesgue Measure]
Let $\mu$ be given by $\mu(B)=\mathcal L^n(B\cap A)/\mathcal L^n(A)$ for Borel sets $B\subset \mathbb R^n$, and fix $a\in\mathbb R^n$. The map $T_a(x)=x+a$ is continuous, hence Borel measurable, and if $\nu=(T_a)_\#\mu$, then for every Borel set $B\subset\mathbb R^n$,
\begin{align*}
(T_a)_\#\mu(B)=\mu(T_a^{-1}(B))=\nu(B).
\end{align*}
Thus $T_a{}_\#\mu=\nu$, so $T_a$ is a transport map from $\mu$ to $\nu$.
For the squared Euclidean cost $c(x,y)=|x-y|^2$, the cost of this map is
\begin{align*}
\int_{\mathbb R^n} c(x,T_a(x))\,d\mu(x)=\int_{\mathbb R^n}|x-(x+a)|^2\,d\mu(x).
\end{align*}
Since $x-(x+a)=-a$ and $|-a|=|a|$, this becomes
\begin{align*}
\int_{\mathbb R^n}|x-(x+a)|^2\,d\mu(x)=\int_{\mathbb R^n}|a|^2\,d\mu(x).
\end{align*}
Because $|a|^2$ is constant and $\mu$ is a probability measure,
\begin{align*}
\int_{\mathbb R^n}|a|^2\,d\mu(x)=|a|^2\mu(\mathbb R^n)=|a|^2.
\end{align*}
So translating every point by the same vector $a$ has total quadratic transport cost exactly $|a|^2$.
[/example]
The example shows the simplest geometric meaning of the cost: transport is not just comparison of two measures, but comparison through a chosen geometry. It also hides a rigidity that becomes visible when a single source point must send mass to several destinations. To formulate transport in a way that can represent splitting, the next object must record pairings of source and target locations as a measure rather than as a function.
## Why Maps Are Too Rigid
What fails if every source point is forced to choose only one destination? The obstruction appears as soon as atoms are present. A point mass cannot be split by a deterministic map, but many natural target measures require exactly such splitting.
[example: An Atom Cannot Split]
Let $\mu=\delta_0$ on $\mathbb R$ and let $\nu=\frac12\delta_{-1}+\frac12\delta_1$. Fix a measurable map $T:\mathbb R\to\mathbb R$. For every Borel set $B\subset\mathbb R$, the definition of pushforward gives
\begin{align*}
T_\#\mu(B)=\mu(T^{-1}(B))=\delta_0(T^{-1}(B)).
\end{align*}
The condition $0\in T^{-1}(B)$ is equivalent to $T(0)\in B$, so $T_\#\mu(B)=1$ when $T(0)\in B$ and $T_\#\mu(B)=0$ when $T(0)\notin B$. Hence $T_\#\mu=\delta_{T(0)}$.
If $T$ were a transport map from $\mu$ to $\nu$, then $T_\#\mu=\nu$. Evaluating both measures on the singleton $\{T(0)\}$ gives
\begin{align*}
T_\#\mu(\{T(0)\})=\delta_{T(0)}(\{T(0)\})=1.
\end{align*}
On the other hand, $\nu(\{T(0)\})$ is $1/2$ if $T(0)=-1$, is $1/2$ if $T(0)=1$, and is $0$ otherwise. Thus $\nu(\{T(0)\})\ne 1$, contradicting $T_\#\mu=\nu$. Therefore no transport map sends $\delta_0$ to $\frac12\delta_{-1}+\frac12\delta_1$, even though the desired movement is exactly to send half of the mass at $0$ to $-1$ and half to $1$.
[/example]
The atomic example identifies the exact missing operation: deterministic maps preserve the indivisibility of atoms, while physical or probabilistic transport may need to divide an atom among several targets. The next definition introduces an admissible object whose marginals still enforce mass balance but whose mass on $X\times Y$ can describe splitting.
[definition: Transport Plan]
Let $(X,\mathcal A)$ and $(Y,\mathcal B)$ be measurable spaces, and let $\mu\in\mathcal P(X)$ and $\nu\in\mathcal P(Y)$. A transport plan between $\mu$ and $\nu$ is a probability measure $\pi$ on the product measurable space $(X\times Y,\mathcal A\otimes\mathcal B)$ such that
\begin{align*}
\pi(A\times Y) &= \mu(A), & \pi(X\times B) &= \nu(B)
\end{align*}
for all measurable $A\subset X$ and $B\subset Y$.
[/definition]
The set of transport plans is usually denoted $\Pi(\mu,\nu)$. A plan records how much mass travels from each source region to each target region, so it permits splitting while retaining both marginal constraints. Since a plan already assigns mass to source-target pairs, the natural relaxed optimisation problem is to integrate the cost directly over this plan.
[definition: Kantorovich Problem]
Let $X$ and $Y$ be measurable spaces, let $\mu\in\mathcal P(X)$ and $\nu\in\mathcal P(Y)$, and let $c:X\times Y\to(-\infty,\infty]$ be measurable. The Kantorovich problem is
\begin{align*}
\inf\left\{\int_{X\times Y} c(x,y)\,d\pi(x,y) : \pi\in\Pi(\mu,\nu)\right\}.
\end{align*}
[/definition]
The Kantorovich problem relaxes the Monge problem by replacing graph-like deterministic assignments with all transport plans. This raises an essential consistency question: does every admissible Monge map give an admissible Kantorovich plan with the same cost? The next theorem is needed to answer that question and to justify treating Kantorovich's problem as an enlargement of Monge's original formulation.
[quotetheorem:7449]
[citeproof:7449]
This theorem explains why Kantorovich's problem is a relaxation rather than a different subject. The admissible class is enlarged from graph measures to all couplings, and that enlargement makes compactness and convexity available.
The hypotheses also mark the theorem's limitations. Measurability of $T$ is needed because otherwise $(\operatorname{id}_X,T)_\#\mu$ may not be a well-defined probability measure on the product measurable space. The nonnegative-cost assumption avoids integrability ambiguities in the identity: for signed or extended-valued costs, the two sides may contain undefined expressions of the form $\infty-\infty$ unless additional integrability assumptions are imposed. Finally, the theorem does not say that every optimal plan is induced by a map; it only embeds deterministic transports into the larger class of plans. The atom-splitting example shows exactly why this inclusion can be strict.
## Compactness, Duality, and Geometry
How does the relaxation lead to an existence theory? The key point is that probability measures on suitable spaces have compactness properties under weak convergence, while the cost functional behaves well when the cost is lower semicontinuous and bounded below. These two facts will produce optimal plans in broad generality.
[quotetheorem:7450]
[citeproof:7450]
This principle is the first major payoff of the Kantorovich formulation. It shifts the problem from constructing maps to proving compactness, lower semicontinuity, and eventually structure of optimisers.
Each assumption has a specific role in the direct method. Polishness gives the tightness and weak compactness machinery needed to extract a convergent subsequence of plans; on badly behaved measurable spaces, probability measures need not have enough compactness for this argument. Lower semicontinuity of $c$ is what prevents mass from lowering the cost by disappearing into a limit: if $c$ has a downward jump, a weak limit can have strictly smaller or poorly controlled cost behaviour than the approximating sequence. The bounded-below condition ensures the integral functional is well defined after adding a constant if necessary, while the finite-value assumption rules out the vacuous case where all admissible plans have infinite cost. These are existence hypotheses, not uniqueness or map-structure hypotheses; an optimal plan may exist without being induced by a transport map.
The second major payoff is duality. Instead of minimising over all couplings, the theory seeks lower bounds obtained from functions on the two marginal spaces. To make such lower bounds compatible with the transport cost for every possible pair $(x,y)$, the functions must satisfy a pointwise constraint.
[definition: Kantorovich Dual Constraint]
Let $X$ and $Y$ be sets and let $c:X\times Y\to(-\infty,\infty]$. A pair of functions $\varphi:X\to[-\infty,\infty)$ and $\psi:Y\to[-\infty,\infty)$ is $c$-admissible if
\begin{align*}
\varphi(x)+\psi(y)\le c(x,y)
\end{align*}
for all $x\in X$ and $y\in Y$.
[/definition]
Duality turns the inequality into a certificate of optimality: whenever a plan is concentrated where equality holds, the primal and dual costs agree. The full theorem will require measurability, integrability, and topological hypotheses, so the course builds it after the existence theory.
## The Route Through the Course
Which ideas should be mastered before using optimal transport as a tool? The foundations course follows a sequence in which each topic removes an obstruction exposed by the previous one. Maps express the original geometric problem; plans solve the existence problem; duality reveals optimality conditions; Wasserstein distances turn probability measures into metric spaces; quadratic costs connect the theory to convex gradients and PDE.
The early lectures study pushforwards, Monge's problem, and examples where maps do or do not exist. The purpose is to make the mass-balance constraint concrete and to isolate the role of deterministic assignments.
The middle lectures develop Kantorovich's relaxation. They introduce couplings, prove compactness of $\Pi(\mu,\nu)$ under weak convergence, establish existence of optimal plans, and prove the basic duality theorem. Finite-space examples accompany the general theory because they expose the linear-programming structure without topological overhead.
The later lectures specialise to costs with geometric content. For $c(x,y)=d(x,y)^p$, the value defines Wasserstein distances on suitable classes of probability measures. For $c(x,y)=|x-y|^2$ on $\mathbb R^n$, convex analysis enters: optimal maps, when they exist, are gradients of convex functions.
[quotetheorem:7451]
This theorem is stated here as the destination of the course, not as a result to be proved at the beginning. Its proof combines the Kantorovich dual problem, cyclic monotonicity, convex potentials, and the absolute continuity hypothesis on the source measure.
[example: Quadratic Cost and Convex Gradients]
Let $\mu$ and $\nu$ have continuous positive densities $f$ and $g$ on intervals $I,J\subset\mathbb R$, and define their distribution functions by $F(x)=\mu((-\infty,x])$ and $G(y)=\nu((-\infty,y])$. Since the densities are positive on their intervals, $F$ and $G$ are strictly increasing there, so the increasing rearrangement is
\begin{align*}
T(x)=G^{-1}(F(x)).
\end{align*}
For any $y\in J$, monotonicity of $G^{-1}$ gives
\begin{align*}
T(x)\le y \quad\Longleftrightarrow\quad G^{-1}(F(x))\le y \quad\Longleftrightarrow\quad F(x)\le G(y).
\end{align*}
Therefore
\begin{align*}
T_\#\mu((-\infty,y])=\mu(\{x:T(x)\le y\}).
\end{align*}
Using the previous equivalence,
\begin{align*}
\mu(\{x:T(x)\le y\})=\mu(\{x:F(x)\le G(y)\}).
\end{align*}
Since $F$ is increasing,
\begin{align*}
\mu(\{x:F(x)\le G(y)\})=\mu((-\infty,F^{-1}(G(y))]).
\end{align*}
By the definition of $F$,
\begin{align*}
\mu((-\infty,F^{-1}(G(y))])=F(F^{-1}(G(y)))=G(y)=\nu((-\infty,y]).
\end{align*}
Thus $T_\#\mu=\nu$ on intervals of the form $(-\infty,y]$, and hence on the Borel sets generated by them.
For the quadratic cost, the important structural point is that $T$ is monotone and therefore comes from a convex potential. Fix $x_0\in I$ and define
\begin{align*}
\phi(x)=\int_{x_0}^x T(s)\,ds.
\end{align*}
By the [fundamental theorem of calculus](/theorems/632),
\begin{align*}
\phi'(x)=T(x).
\end{align*}
If $u<v<w$ in $I$, then monotonicity of $T$ gives $T(s)\le T(v)$ for $s\in[u,v]$ and $T(v)\le T(s)$ for $s\in[v,w]$. Hence
\begin{align*}
\frac{\phi(v)-\phi(u)}{v-u}=\frac{1}{v-u}\int_u^v T(s)\,ds\le T(v).
\end{align*}
Also,
\begin{align*}
T(v)\le \frac{1}{w-v}\int_v^w T(s)\,ds=\frac{\phi(w)-\phi(v)}{w-v}.
\end{align*}
The secant slopes of $\phi$ are increasing, so $\phi$ is convex. Thus, in one dimension, the quadratic-cost transport map is the derivative of a convex function; Brenier's theorem generalizes this monotone-derivative picture to convex gradients in higher dimensions.
[/example]
## Prerequisites and Conventions
What background will be used without rebuilding it from the ground up? The course assumes measure theory, real analysis, probability measures and weak convergence, basic functional analysis, convex analysis, and introductory familiarity with PDE. The notes recall definitions when they are part of the transport story, but they treat standard facts such as weak compactness criteria, lower semicontinuity, and convex subdifferentials as tools to be invoked and connected.
Throughout the course, $\mathcal P(X)$ denotes probability measures on a measurable or topological space $X$. When $X$ is a [metric space](/page/Metric%20Space), weak convergence of probability measures means convergence against bounded continuous test functions. On Euclidean spaces, $\mathcal L^n$ denotes Lebesgue measure, and costs such as $|x-y|$ and $|x-y|^2$ use the Euclidean norm.
The notation $T_\#\mu$ will always mean pushforward, $\Pi(\mu,\nu)$ will denote the set of transport plans, and $c$ will denote a cost function. These conventions are fixed early because the course repeatedly moves between map-based, plan-based, dual, and geometric formulations of the same transport problem.
The conventions fixed in the introduction now become tools rather than notation. With pushforwards, couplings, and costs in place, we can ask the basic Monge question: can one move the source measure to the target by a single transport map?
# 1. The Monge Problem and Transport Maps
Optimal transport begins with a concrete question: if a distribution of mass is arranged according to a probability measure $\mu$, and a target distribution is prescribed by another probability measure $\nu$, can we move every point $x$ to a single destination $T(x)$ so that the moved mass has law $\nu$? Monge's formulation asks for such a map with minimal transportation cost. This first chapter sets up the language of pushforwards and transport maps, states the Monge problem, and explains why the map-based formulation is powerful but too rigid for general existence theory. The only prerequisites are basic measure theory: measurable maps, probability measures, integration of [measurable functions](/page/Measurable%20Functions), and the idea that two measures are equal when they agree on all measurable sets.
The central tension is already visible before optimization begins. A map can move each source point to only one target point, so it cannot split the mass sitting at a single atom. This creates elementary obstructions that Chapter 2 resolves by Kantorovich's relaxation to transport plans.
## Pushforwards and Transport Maps
The first problem is how to express the sentence "the image of $\mu$ under $T$ is $\nu$" without relying on densities or coordinates. The correct language is the pushforward measure, which records how much source mass lands in each target set.
[definition: Pushforward Measure]
Let $(X, \mathcal A)$ and $(Y, \mathcal B)$ be measurable spaces, let $\mu$ be a measure on $(X, \mathcal A)$, and let $T: X \to Y$ be a measurable map. The pushforward of $\mu$ by $T$ is the measure $T_{\#}\mu$ on $(Y, \mathcal B)$ defined by
\begin{align*}
(T_{\#}\mu)(B) = \mu(T^{-1}(B)), \qquad B \in \mathcal B.
\end{align*}
[/definition]
Thus $T_{\#}\mu$ is determined by testing target sets $B$ and measuring the mass of all source points sent into $B$. This definition is set-theoretic rather than analytic, so it applies equally well to discrete spaces, Euclidean spaces, and abstract probability spaces.
[example: Pushforward of a Two-Point Measure]
Let $X=\{a,b\}$, $Y=\{0,1\}$, and
\begin{align*}
\mu = \frac{1}{3}\delta_a + \frac{2}{3}\delta_b.
\end{align*}
We compute the pushforward by evaluating the mass assigned to each subset of $Y$. First suppose $T(a)=0$ and $T(b)=1$. For the singleton $\{0\}$,
\begin{align*}
T^{-1}(\{0\})=\{a\}.
\end{align*}
Hence
\begin{align*}
(T_{\#}\mu)(\{0\})=\mu(\{a\})=\frac{1}{3}.
\end{align*}
Similarly,
\begin{align*}
T^{-1}(\{1\})=\{b\}.
\end{align*}
Thus
\begin{align*}
(T_{\#}\mu)(\{1\})=\mu(\{b\})=\frac{2}{3}.
\end{align*}
Since $Y$ has only the two points $0$ and $1$, these two singleton masses determine the measure:
\begin{align*}
T_{\#}\mu = \frac{1}{3}\delta_0+\frac{2}{3}\delta_1.
\end{align*}
If instead $T(a)=T(b)=0$, then
\begin{align*}
T^{-1}(\{0\})=\{a,b\}.
\end{align*}
Therefore
\begin{align*}
(T_{\#}\mu)(\{0\})=\mu(\{a,b\})=\mu(\{a\})+\mu(\{b\})=\frac{1}{3}+\frac{2}{3}=1.
\end{align*}
Also,
\begin{align*}
T^{-1}(\{1\})=\varnothing.
\end{align*}
So
\begin{align*}
(T_{\#}\mu)(\{1\})=\mu(\varnothing)=0.
\end{align*}
Hence $T_{\#}\mu=\delta_0$. The two cases show that when distinct source points have the same image, their masses are added at that common target point.
[/example]
The example computes the image law of a single map. For transport, the problem is reversed: the target law $\nu$ is prescribed in advance, so we need a name for maps whose pushforward is exactly that prescribed target.
[definition: Transport Map]
Let $(X, \mathcal A, \mu)$ and $(Y, \mathcal B, \nu)$ be probability spaces. A transport map from $\mu$ to $\nu$ is a measurable map $T:X\to Y$ such that
\begin{align*}
T_{\#}\mu = \nu.
\end{align*}
[/definition]
The definition packages infinitely many setwise constraints into one equation, but that compact notation can hide the actual verification problem. To decide whether a candidate map really transports $\mu$ to $\nu$, one must check that every measurable target set receives exactly the mass prescribed by $\nu$, computed by pulling the set back to the source.
[quotetheorem:7452]
[citeproof:7452]
The mass-balance identity is the practical content of the pushforward constraint: every target test set must receive exactly the prescribed amount of mass. Measurability of $T$ is essential here, because otherwise $T^{-1}(B)$ need not be a measurable source set and the expression $\mu(T^{-1}(B))$ may not be defined. The theorem does not assert that such a map exists; it only gives the criterion a candidate map must satisfy. A concrete failure already occurs when
\begin{align*}
X=\{a\}, \qquad
\mu=\delta_a, \qquad
Y=\{0,1\}, \qquad
\nu=\frac{1}{2}\delta_0+\frac{1}{2}\delta_1.
\end{align*}
Any map sends $a$ to either $0$ or $1$, so the pushforward is either $\delta_0$ or $\delta_1$, never $\nu$. This limitation is the first form of the atomic splitting obstruction. Before examining that obstruction systematically, we need a second way to read the pushforward: by testing against functions rather than sets.
[quotetheorem:7453]
[citeproof:7453]
This formula is the bridge between measure preservation and optimization. The measurability hypotheses ensure that both $f$ and $f\circ T$ are legitimate integrands, and the integrability clause is needed because signed or complex integrals are not meaningful when positive and negative parts are both infinite. If $T$ were not measurable, then $T^{-1}(B)$ could fail to be measurable for an indicator test $f=\mathbb{1}_B$, so $f\circ T$ would not be a valid integrand; if a signed function has infinite positive and negative parts, then neither side defines a signed integral. The theorem does not provide a change of coordinates with a Jacobian determinant; it only identifies integration against the image measure with integration of the pulled-back function. For example, if $\mu$ is normalized Lebesgue measure on $[-1,1]$ and $T(x)=x^2$, then $T_{\#}\mu$ has density $y\mapsto (2\sqrt y)^{-1}$ on $(0,1)$, so the identity reads
\begin{align*}
\int_0^1 f(y)(2\sqrt y)^{-1}\,d\mathcal L^1(y)
=\frac{1}{2}\int_{-1}^1 f(x^2)\,d\mathcal L^1(x).
\end{align*}
This is a statement about the image measure, not a single-branch coordinate substitution. Costs in Monge's problem are integrated over the source, but the admissibility constraint is stated as a pushforward condition, so this identity is the device that lets us move between the two viewpoints.
[example: Testing a Pushforward by Moments]
Let $X=Y=\mathbb R$, let $\mu$ be a probability measure with finite first moment, and let $T(x)=2x+1$. Since
\begin{align*}
|T(x)|=|2x+1|\le 2|x|+1,
\end{align*}
we have
\begin{align*}
\int_{\mathbb R}|T(x)|\,d\mu(x)\le 2\int_{\mathbb R}|x|\,d\mu(x)+\int_{\mathbb R}1\,d\mu(x)<\infty.
\end{align*}
Thus the [test function](/page/Test%20Function) $f(y)=y$ is integrable against $T_{\#}\mu$, and *Change of Variables for Pushforwards* gives
\begin{align*}
\int_{\mathbb R} y\,d(T_{\#}\mu)(y)=\int_{\mathbb R} f(T(x))\,d\mu(x).
\end{align*}
Because $f(T(x))=T(x)=2x+1$, this becomes
\begin{align*}
\int_{\mathbb R} y\,d(T_{\#}\mu)(y)=\int_{\mathbb R}(2x+1)\,d\mu(x).
\end{align*}
By linearity of the integral,
\begin{align*}
\int_{\mathbb R}(2x+1)\,d\mu(x)=2\int_{\mathbb R}x\,d\mu(x)+\int_{\mathbb R}1\,d\mu(x).
\end{align*}
Since $\mu$ is a probability measure,
\begin{align*}
\int_{\mathbb R}1\,d\mu(x)=\mu(\mathbb R)=1.
\end{align*}
Therefore
\begin{align*}
\int_{\mathbb R} y\,d(T_{\#}\mu)(y)=2\int_{\mathbb R}x\,d\mu(x)+1.
\end{align*}
So the mean of the image law $T_{\#}\mu$ is exactly twice the mean of $\mu$ plus $1$, computed without first finding a density for $T_{\#}\mu$.
[/example]
## Monge's Optimization Problem
Once admissible maps are defined, the next question is which admissible map should be chosen. Monge's answer is to assign a cost $c(x,y)$ to moving mass from $x$ to $y$ and minimize the total cost over all maps that push $\mu$ to $\nu$.
[definition: Cost Function]
Let $(X,\mathcal A)$ and $(Y,\mathcal B)$ be measurable spaces. A cost function is a measurable function
\begin{align*}
c:(X\times Y,\mathcal A\otimes\mathcal B)\to [0,\infty].
\end{align*}
[/definition]
The value $c(x,T(x))$ is the cost paid by the particle initially at $x$. Once a cost is fixed, the next step is to combine the cost integral with the pushforward constraint into the variational problem introduced by Monge.
[definition: Monge Problem]
Let $(X,\mathcal A,\mu)$ and $(Y,\mathcal B,\nu)$ be probability spaces, and let $c:(X\times Y,\mathcal A\otimes\mathcal B)\to [0,\infty]$ be measurable. The Monge problem is
\begin{align*}
\inf\left\{\int_X c(x,T(x))\,d\mu(x) : T:X\to Y \text{ measurable and } T_{\#}\mu=\nu\right\}.
\end{align*}
[/definition]
An admissible map attaining the infimum is called an optimal transport map for the Monge problem.
The feasible set may be empty, and even when it is nonempty the infimum may fail to be attained. These two difficulties are separate: admissibility is a measure-theoretic constraint, while attainment is an analytic compactness issue.
[remark: Empty Feasible Set]
If no measurable map $T:X\to Y$ satisfies $T_{\#}\mu=\nu$, then the Monge problem has no admissible maps. In this case the map-based optimization problem does not describe a transport mechanism from $\mu$ to $\nu$, regardless of the cost.
[/remark]
Two costs dominate the first course in optimal transport. Distance cost emphasizes shortest paths and is historically closest to moving piles of mass, while the quadratic cost requires a separate definition because it brings convexity into the theory.
[definition: Distance Cost]
Let $(X,d)$ be a metric space equipped with its Borel $\sigma$-algebra, and let $\mu,\nu$ be Borel probability measures on $X$. The distance-cost Monge problem uses the Borel measurable cost function $c:X\times X\to [0,\infty)$ defined by
\begin{align*}
c(x,y)=d(x,y).
\end{align*}
[/definition]
This cost measures travel length and makes sense on any metric space. On Euclidean space there is a second canonical cost, obtained by squaring distance, whose stronger penalty for long moves leads to a richer structural theory.
[definition: Squared Euclidean Cost]
Let $X=Y=\mathbb R^n$ and let $\mu,\nu$ be Borel probability measures on $\mathbb R^n$. The squared Euclidean cost is the function $c:\mathbb R^n\times\mathbb R^n\to [0,\infty)$ defined by
\begin{align*}
c(x,y)=|x-y|^2.
\end{align*}
[/definition]
The square changes the problem qualitatively: it penalizes long moves more heavily and interacts with convex functions through gradients. Chapter 6 uses this cost to connect optimal maps with convex analysis through Brenier's theorem.
[example: Translating a Probability Measure]
Let $X=Y=\mathbb R^n$, let $a\in\mathbb R^n$, and define $T(x)=x+a$. The statement $T_{\#}\mu=\nu$ means that for every Borel set $B\subseteq\mathbb R^n$,
\begin{align*}
\nu(B)=\mu(T^{-1}(B)).
\end{align*}
Since $T(x)\in B$ exactly when $x+a\in B$, equivalently $x\in B-a:=\{z-a:z\in B\}$, we have
\begin{align*}
T^{-1}(B)=B-a.
\end{align*}
Thus $T_{\#}\mu=\nu$ exactly when
\begin{align*}
\nu(B)=\mu(B-a)
\end{align*}
for every Borel set $B$, which is precisely the statement that $\nu$ is the translate of $\mu$ by $a$.
For the squared Euclidean cost, the displacement from $x$ to $T(x)$ is
\begin{align*}
x-T(x)=x-(x+a)=-a.
\end{align*}
Therefore
\begin{align*}
|x-T(x)|^2=|-a|^2=|a|^2.
\end{align*}
Integrating this constant function against the probability measure $\mu$ gives
\begin{align*}
\int_{\mathbb R^n}|x-T(x)|^2\,d\mu(x)=\int_{\mathbb R^n}|a|^2\,d\mu(x)=|a|^2\mu(\mathbb R^n)=|a|^2.
\end{align*}
So whenever $\nu$ is the translate of $\mu$ by $a$, the translation map is an admissible Monge map, and its squared Euclidean cost depends only on the translation vector $a$, not on the shape of $\mu$.
[/example]
The change-of-variables formula lets us express constraints on a candidate optimizer by testing against functions. In Euclidean settings, this is often the most convenient way to prove that a map has the right target distribution.
[quotetheorem:7454]
[citeproof:7454]
This characterisation is especially useful when the target measure is described through integrals, moments, or densities. The bounded-measurable testing class is large enough because it contains indicators of all measurable target sets; testing only against a smaller class may lose information unless that class is known to determine measures. For instance, matching only the first moment on $\mathbb R$ cannot distinguish
\begin{align*}
\delta_0
\qquad\text{from}\qquad
\frac{1}{2}\delta_{-1}+\frac{1}{2}\delta_1,
\end{align*}
even though the measures are different. The boundedness and probability hypotheses keep both integrals finite for every test in the class. If one tried to use all measurable functions against an infinite measure, the statement would not even be a well-posed finite identity: for example, Lebesgue measure on $\mathbb R$ gives $\int_{\mathbb R} 1\,d\mathcal L^1=\infty$, and signed tests such as $f(y)=y$ may have undefined integral because the positive and negative parts are both infinite. This is a hypothesis-by-hypothesis warning: bounded measurable tests recover sets because they include $\mathbb{1}_B$, while moment tests can agree for distinct probability measures, and the probability setting prevents total-mass pathologies. The theorem therefore gives an exact testing criterion, not merely a moment condition. It also prepares the transition from Monge maps to Kantorovich plans, where marginals are tested in the same way.
## Atoms and Splitting Obstructions
The key obstruction to Monge's formulation is that a map sends each source point to a single target point. If positive mass is concentrated at one point, a transport map must send that entire atom to one target location, so it cannot divide the atom among several destinations.
[definition: Atom of a Measure]
Let $(X,\mathcal A,\mu)$ be a [measure space](/page/Measure%20Space). A measurable set $A\in\mathcal A$ is an atom of $\mu$ if $\mu(A)>0$ and for every measurable set $B\subseteq A$, either $\mu(B)=0$ or $\mu(B)=\mu(A)$.
[/definition]
This definition captures indivisible mass at the level of measurable sets, which is the right general notion but too broad for the finite obstructions that follow. In those examples the indivisible mass is visible at one named source point, and the relevant question is simply whether the singleton containing that point has positive measure.
[definition: Point Atom]
Let $(X,\mathcal A,\mu)$ be a measure space and let $x\in X$ with $\{x\}\in\mathcal A$. The point $x$ is a point atom of $\mu$ if
\begin{align*}
\mu(\{x\})>0.
\end{align*}
[/definition]
Point atoms show the map obstruction in its simplest form. The next example demonstrates nonexistence before any cost is considered.
[example: Moving a Two-Point Measure to a Three-Point Measure]
Let
\begin{align*}
\mu = \frac{1}{2}\delta_0+\frac{1}{2}\delta_1
\end{align*}
on $\{0,1\}$, and let
\begin{align*}
\nu = \frac{1}{3}\delta_0+\frac{1}{3}\delta_1+\frac{1}{3}\delta_2
\end{align*}
on $\{0,1,2\}$. We show that there is no map $T:\{0,1\}\to\{0,1,2\}$ with $T_{\#}\mu=\nu$.
Fix any such map $T$. Since the source has only the two points $0$ and $1$, the image set is
\begin{align*}
T(\{0,1\})=\{T(0),T(1)\}
\end{align*}
and therefore contains either one or two target points. If $y\in\{0,1,2\}$ is not equal to $T(0)$ or $T(1)$, then
\begin{align*}
T^{-1}(\{y\})=\varnothing
\end{align*}
so
\begin{align*}
(T_{\#}\mu)(\{y\})=\mu(\varnothing)=0.
\end{align*}
Thus $T_{\#}\mu$ assigns positive mass to at most two of the three singleton target sets.
More explicitly, for each $y\in\{0,1,2\}$,
\begin{align*}
T^{-1}(\{y\})=\{x\in\{0,1\}:T(x)=y\}.
\end{align*}
This preimage can be $\varnothing$, $\{0\}$, $\{1\}$, or $\{0,1\}$. Hence
\begin{align*}
(T_{\#}\mu)(\{y\})\in\left\{0,\frac{1}{2},1\right\}.
\end{align*}
But $\nu$ satisfies
\begin{align*}
\nu(\{0\})=\nu(\{1\})=\nu(\{2\})=\frac{1}{3}.
\end{align*}
Since $\frac{1}{3}$ is not one of $0,\frac{1}{2},1$, no singleton mass of $T_{\#}\mu$ can match the corresponding singleton mass of $\nu$. Therefore $T_{\#}\mu\ne\nu$ for every map $T$, so the Monge feasible set is empty.
[/example]
This example is not a peculiarity of finite spaces. It reflects a necessary condition: the image of an atom under a map remains unsplit.
[quotetheorem:7455]
[citeproof:7455]
The statement is a structural restriction, not an optimality condition, and it does not require a target law $\nu$ or a pushforward identity $T_{\#}\mu=\nu$. The finite-partition hypothesis is only used to express splitting among finitely many target regions; for a point atom $A=\{x\}$ the obstruction is stronger, since the whole mass at $x$ is concentrated at the single image point $T(x)$. The atomic hypothesis is essential: with $\mu=\mathcal L^1|_{[0,1]}$ and $A=[0,1]$, the map $T(x)=0$ for $0\le x\le \frac12$ and $T(x)=1$ for $\frac12<x\le 1$ sends positive mass from the same source set into both target regions $\{0\}$ and $\{1\}$. This theorem therefore does not say that all source mass must travel together, only that mass contained in one atom cannot be divided by a map. It shows why Monge's problem can fail at the admissibility stage, while the relaxed Kantorovich problem will allow the mass at an atom to be divided across several destinations.
[remark: Why Plans Will Fix the Obstruction]
A transport plan records a probability measure on $X\times Y$ rather than a map $X\to Y$. For a point atom $x$, a plan may place the mass over several points $(x,y)$ with different values of $y$. This is the basic mechanism that removes the splitting obstruction in the next chapter.
[/remark]
## Monotone Rearrangement on $\mathbb R$
Despite the obstructions above, the Monge problem has a canonical solution in one dimension under broad hypotheses. The order structure of $\mathbb R$ gives a distinguished increasing map from one distribution to another.
[definition: Distribution Function]
Let $\mu$ be a Borel probability measure on $\mathbb R$. Its distribution function is
\begin{align*}
F_\mu:\mathbb R\to[0,1], \qquad F_\mu(x)=\mu((-\infty,x]), \qquad x\in\mathbb R.
\end{align*}
[/definition]
The distribution function describes how much mass lies to the left of each point. To convert this cumulative description back into locations of mass, we need a generalized inverse that works even when the distribution function has flat parts or jumps.
[definition: Quantile Function]
Let $\mu$ be a Borel probability measure on $\mathbb R$ with distribution function $F_\mu$. Its quantile function is the map $F_\mu^{-1}:(0,1)\to \mathbb R\cup\{-\infty,\infty\}$ defined by
\begin{align*}
F_\mu^{-1}(s)=\inf\{x\in\mathbb R: F_\mu(x)\ge s\}, \qquad s\in(0,1).
\end{align*}
[/definition]
For Borel probability measures on $\mathbb R$, the quantile function is real-valued for all $s\in(0,1)$, because the distribution function tends to $0$ at $-\infty$ and to $1$ at $\infty$. When the endpoint values are needed, we use the extended convention
\begin{align*}
F_\mu^{-1}(0)=\inf \operatorname{supp}\mu,
\qquad
F_\mu^{-1}(1)=\sup \operatorname{supp}\mu,
\end{align*}
allowing the values $-\infty$ and $\infty$ if the support is unbounded. The quantile function converts uniform mass on $(0,1)$ into the desired probability measure. This motivates the next definition: to transport an arbitrary source measure to a target measure, we first convert source locations into cumulative ranks and then convert those ranks into target locations.
[definition: Increasing Rearrangement]
Let $\mu$ and $\nu$ be Borel probability measures on $\mathbb R$, and assume that $\mu$ has no atoms. The increasing rearrangement from $\mu$ to $\nu$ is any Borel map $T:\mathbb R\to\mathbb R$ satisfying
\begin{align*}
T(x)=F_\nu^{-1}(F_\mu(x))
\end{align*}
at every point where $0<F_\mu(x)<1$.
[/definition]
The values of $T$ on the $\mu$-null endpoint set where $F_\mu(x)\in\{0,1\}$ may be chosen arbitrarily, subject to Borel measurability, because those values do not affect the pushforward law. This formula is built to preserve order, but it must still be checked against the pushforward condition. The next theorem is needed to turn the rearrangement formula into an admissible Monge map. The no-atom assumption on the source measure is what makes the rank variable $F_\mu(X)$ behave like a uniform [random variable](/page/Random%20Variable), and the quantile function then rebuilds the target law.
[quotetheorem:7456]
[proofunderconstruction:7456]
This theorem solves the existence problem in one dimension for non-atomic sources and also selects a canonical map. The no-atom hypothesis is not cosmetic: if
\begin{align*}
\mu=\delta_0
\qquad\text{and}\qquad
\nu=\frac{1}{2}\delta_{-1}+\frac{1}{2}\delta_1,
\end{align*}
no map can send the single source atom to both target points, so the displayed formula cannot produce a transport map. The real-line hypothesis is also doing mathematical work: the construction uses the total order on $\mathbb R$ to turn positions into cumulative ranks, and the Borel structure is the measurable structure for which intervals $(-\infty,x]$ determine probability laws. On a general measurable space there is no intrinsic order, and on a non-Borel measurable structure the distribution functions may fail to encode the full target measure. The theorem also says only admissibility and monotonicity, not optimality or uniqueness for an arbitrary cost. For convex costs depending on $x-y$, the increasing rearrangement is the optimal map; Chapters 4 and 5 derive this from duality and cyclical monotonicity.
[example: Increasing Rearrangement Between Probability Densities on $\mathbb R$]
Let $\mu$ have density $f_\mu(x)=2x$ on $(0,1)$ and let $\nu$ be uniform on $(0,1)$. For $0<x<1$, the distribution function of $\mu$ is
\begin{align*}
F_\mu(x)=\mu((-\infty,x])=\int_0^x 2r\,dr=x^2.
\end{align*}
For the uniform measure $\nu$ on $(0,1)$, its distribution function satisfies $F_\nu(y)=y$ for $0<y<1$, so its quantile function is
\begin{align*}
F_\nu^{-1}(s)=s, \qquad 0<s<1.
\end{align*}
Therefore the increasing rearrangement on $(0,1)$ is
\begin{align*}
T(x)=F_\nu^{-1}(F_\mu(x))=F_\nu^{-1}(x^2)=x^2.
\end{align*}
We verify the target law by computing the distribution function of $T(X)$ when $X\sim\mu$. If $0<t<1$, then $X\in(0,1)$ implies $X^2\le t$ exactly when $X\le \sqrt t$, because the square-root function is the inverse of $x\mapsto x^2$ on $[0,\infty)$. Hence
\begin{align*}
\mathbb P(T(X)\le t)=\mathbb P(X^2\le t)=\mathbb P(X\le \sqrt t).
\end{align*}
Using the density of $\mu$,
\begin{align*}
\mathbb P(X\le \sqrt t)=\int_0^{\sqrt t}2r\,dr=(\sqrt t)^2=t.
\end{align*}
Also $\mathbb P(T(X)\le t)=0$ for $t\le 0$ and $\mathbb P(T(X)\le t)=1$ for $t\ge 1$, since $0<X^2<1$ almost surely. Thus $T(X)$ has distribution function $t$ on $(0,1)$, so $T(X)$ is uniform on $(0,1)$. This shows concretely that the rearrangement $T(x)=x^2$ transports the density $2x$ to the uniform law.
[/example]
The chapter ends with a precise lesson. Pushforwards make the Monge constraint exact, and the change-of-variables formula makes it usable in calculations. But maps cannot split mass, so the original Monge formulation is not stable enough to guarantee admissible competitors in general. The same pushforward language appears beyond optimal transport whenever deterministic data are propagated into observed laws: in statistics, a fitted transformation sends a model distribution to the distribution of predicted observations; in PDE, a Lagrangian flow pushes an initial density forward to the density at a later time. These connections explain why optimal transport can later speak both to statistical matching problems and to continuity equations. Kantorovich's relaxation begins by replacing maps with probability measures on the product space, preserving mass balance while allowing splitting.
The Monge problem makes the transport task concrete, but its rigidity quickly becomes apparent. When a deterministic map cannot split mass, we are forced to enlarge the notion of solution and allow transport plans on the product space.
# 2. Kantorovich Relaxation
The Monge formulation asks for a deterministic rule sending each point of the source space to a point of the target space. Chapter 1 showed why this is too rigid: atoms may need to split, and a measurable map may fail to realise the desired mass balance. Kantorovich's relaxation replaces a transport map by a joint probability measure, called a coupling or transport plan, so that mass may be distributed across many destinations while the two marginals remain fixed.
The main questions in this chapter are existence and stability. Once transport is formulated as a minimization problem over measures on a product space, compactness and lower semicontinuity become the basic tools. The reward is a clean existence theorem on Polish spaces for lower semicontinuous costs bounded from below.
## Couplings and Marginals
If deterministic maps are too restrictive, the next object should record how much mass travels from each source region to each target region. Such an object must live on the product space: a set $A \times B$ records mass starting in $A$ and ending in $B$. The marginal constraints then impose that all outgoing mass reproduces the source measure and all incoming mass reproduces the target measure.
[definition: Coupling]
Let $(X, \mathcal A)$ and $(Y, \mathcal B)$ be measurable spaces, and let $\mu$ be a probability measure on $(X, \mathcal A)$ and $\nu$ a probability measure on $(Y, \mathcal B)$. A coupling of $\mu$ and $\nu$ is a probability measure $\gamma$ on $(X \times Y, \mathcal A \otimes \mathcal B)$ such that, for all $A \in \mathcal A$ and all $B \in \mathcal B$,
\begin{align*}
\gamma(A \times Y) = \mu(A)
\end{align*}
and
\begin{align*}
\gamma(X \times B) = \nu(B).
\end{align*}
[/definition]
The two identities say that the first and second coordinate projections recover the prescribed measures. This is the probabilistic viewpoint: a coupling is a joint law for random variables $(X_0,Y_0)$ whose individual laws are $\mu$ and $\nu$. Since we will optimize over all such joint laws, we need a stable notation for the full admissible class rather than naming couplings one at a time.
[definition: Set Of Transport Plans]
Let $\mu \in \mathcal P(X)$ and $\nu \in \mathcal P(Y)$ be probability measures on measurable spaces $X$ and $Y$. The set of transport plans from $\mu$ to $\nu$ is
\begin{align*}
\Pi(\mu,\nu) := \{\gamma \in \mathcal P(X \times Y) : (\pi_X)_{\#}\gamma = \mu,\ (\pi_Y)_{\#}\gamma = \nu\},
\end{align*}
where $\pi_X:X\times Y\to X$ and $\pi_Y:X\times Y\to Y$ are the coordinate projections defined by $\pi_X(x,y)=x$ and $\pi_Y(x,y)=y$.
[/definition]
This is the same notation for transport plans introduced in Chapter 0, and it will be used throughout the course. The set is never empty, since the product measure $\mu \otimes \nu$ belongs to it. The product coupling represents independent sampling of source and target points, so it is usually not optimal for geometric costs, but it proves admissibility.
[example: Couplings On A Two Point Space]
Let $X=Y=\{0,1\}$, let $\mu(\{0\})=p$, and let $\nu(\{0\})=q$, so $\mu(\{1\})=1-p$ and $\nu(\{1\})=1-q$. A probability measure $\gamma$ on $X\times Y$ is determined by the four numbers
\begin{align*}
\gamma_{ij}:=\gamma(\{i\}\times\{j\}),\qquad i,j\in\{0,1\}.
\end{align*}
The first marginal condition gives the two row-sum equations
\begin{align*}
\gamma_{00}+\gamma_{01}=p
\end{align*}
and
\begin{align*}
\gamma_{10}+\gamma_{11}=1-p.
\end{align*}
The second marginal condition gives the two column-sum equations
\begin{align*}
\gamma_{00}+\gamma_{10}=q
\end{align*}
and
\begin{align*}
\gamma_{01}+\gamma_{11}=1-q.
\end{align*}
Write $a=\gamma_{00}$. From $\gamma_{00}+\gamma_{01}=p$ we get
\begin{align*}
\gamma_{01}=p-a.
\end{align*}
From $\gamma_{00}+\gamma_{10}=q$ we get
\begin{align*}
\gamma_{10}=q-a.
\end{align*}
Using the total mass identity
\begin{align*}
\gamma_{00}+\gamma_{01}+\gamma_{10}+\gamma_{11}=1,
\end{align*}
we solve for the last entry:
\begin{align*}
\gamma_{11}=1-\gamma_{00}-\gamma_{01}-\gamma_{10}.
\end{align*}
Substituting the three known entries gives
\begin{align*}
\gamma_{11}=1-a-(p-a)-(q-a).
\end{align*}
Expanding the signs yields
\begin{align*}
\gamma_{11}=1-a-p+a-q+a.
\end{align*}
Collecting terms gives
\begin{align*}
\gamma_{11}=1-p-q+a.
\end{align*}
It remains to impose nonnegativity of the four masses. The conditions are
\begin{align*}
a\ge 0,
\end{align*}
\begin{align*}
p-a\ge 0,
\end{align*}
\begin{align*}
q-a\ge 0,
\end{align*}
and
\begin{align*}
1-p-q+a\ge 0.
\end{align*}
These are equivalent to
\begin{align*}
a\ge 0,
\end{align*}
\begin{align*}
a\le p,
\end{align*}
\begin{align*}
a\le q,
\end{align*}
and
\begin{align*}
a\ge p+q-1.
\end{align*}
Therefore the admissible values of $a$ are exactly
\begin{align*}
\max\{0,p+q-1\}\le a\le \min\{p,q\}.
\end{align*}
Thus every coupling is encoded by one parameter in this interval, and even this smallest finite example already produces a convex family of relaxed transports rather than a single deterministic rule.
[/example]
This example also introduces the geometry of $\Pi(\mu,\nu)$ in the finite case. The marginal constraints are linear equations and positivity is a set of inequalities, so the set of plans is a convex polytope. We need the same convexity in general spaces because Kantorovich minimization will be a linear problem over the admissible class rather than a nonlinear search over maps.
[quotetheorem:7457]
[citeproof:7457]
Convexity turns an originally nonlinear search over maps into a linear minimization problem over a convex set of measures. The fixed marginal condition is essential here: on $X=Y=\{0,1\}$, the measures $\delta_{(0,0)}$ and $\delta_{(1,1)}$ are each graph plans, but they have different first marginals. Their average has first marginal $\frac12\delta_0+\frac12\delta_1$, so it is not a plan with either original source marginal. Convexity also has a limitation: it does not imply compactness, closedness, or existence of a minimizer. Those require topological hypotheses and lower semicontinuity of the cost, so the next task is first to put a cost on this convex set and then to identify when minimization over it is well behaved.
## The Kantorovich Problem
The Monge problem minimizes the average cost of a map, but it cannot split mass. The relaxed problem should charge a plan according to the cost of each source-target pair and then minimize this total cost over all admissible joint distributions.
[definition: Kantorovich Cost]
Let $X$ and $Y$ be measurable spaces, and let $c:X\times Y\to (-\infty,\infty]$ be measurable. The Kantorovich cost functional associated to $c$ is the map
\begin{align*}
\mathsf K_c:\mathcal P(X)\times \mathcal P(Y)\to [-\infty,\infty]
\end{align*}
defined, for $\mu \in \mathcal P(X)$ and $\nu \in \mathcal P(Y)$, by
\begin{align*}
\mathsf K_c(\mu,\nu) := \inf_{\gamma \in \Pi(\mu,\nu)} \int_{X\times Y} c(x,y)\,d\gamma(x,y).
\end{align*}
A plan $\gamma \in \Pi(\mu,\nu)$ is an optimal Kantorovich plan if it attains this infimum.
[/definition]
The integral may be infinite, and the value may be infinite for very singular costs. In the existence theorem below we assume the cost is bounded from below, which keeps the minimization from degenerating to $-\infty$ and lets lower semicontinuity control minimizing sequences.
[example: Optimal Plans On A Finite Metric Space]
Let $\bar\gamma$ be the plan on $\{1,2,3\}\times\{1,2,3\}$ defined by
\begin{align*}
\bar\gamma_{12}=\frac12,\qquad \bar\gamma_{23}=\frac12,
\end{align*}
with all other entries equal to $0$. Its row sums are
\begin{align*}
\bar\gamma_{11}+\bar\gamma_{12}+\bar\gamma_{13}=0+\frac12+0=\frac12,
\end{align*}
\begin{align*}
\bar\gamma_{21}+\bar\gamma_{22}+\bar\gamma_{23}=0+0+\frac12=\frac12,
\end{align*}
and
\begin{align*}
\bar\gamma_{31}+\bar\gamma_{32}+\bar\gamma_{33}=0+0+0=0,
\end{align*}
so the first marginal is $\mu$. Its column sums are
\begin{align*}
\bar\gamma_{11}+\bar\gamma_{21}+\bar\gamma_{31}=0+0+0=0,
\end{align*}
\begin{align*}
\bar\gamma_{12}+\bar\gamma_{22}+\bar\gamma_{32}=\frac12+0+0=\frac12,
\end{align*}
and
\begin{align*}
\bar\gamma_{13}+\bar\gamma_{23}+\bar\gamma_{33}=0+\frac12+0=\frac12,
\end{align*}
so the second marginal is $\nu$. Hence $\bar\gamma\in\Pi(\mu,\nu)$.
Its cost is
\begin{align*}
\sum_{i=1}^3\sum_{j=1}^3 |i-j|\bar\gamma_{ij}=|1-2|\frac12+|2-3|\frac12=1\cdot\frac12+1\cdot\frac12=1.
\end{align*}
We now show that no admissible plan can have smaller cost. Let $\gamma\in\Pi(\mu,\nu)$. Since $\nu(\{1\})=0$, the first column sum satisfies
\begin{align*}
\gamma_{11}+\gamma_{21}+\gamma_{31}=0.
\end{align*}
All entries are nonnegative, so $\gamma_{11}=0$. Since the first row sum is $\mu(\{1\})=1/2$, we get
\begin{align*}
\gamma_{12}+\gamma_{13}=\frac12-\gamma_{11}=\frac12.
\end{align*}
Similarly, since $\mu(\{3\})=0$, the third row sum gives $\gamma_{33}=0$. Since the third column sum is $\nu(\{3\})=1/2$, we get
\begin{align*}
\gamma_{13}+\gamma_{23}=\frac12-\gamma_{33}=\frac12.
\end{align*}
For the relevant entries, the costs are
\begin{align*}
|1-2|=1,\qquad |1-3|=2,\qquad |2-3|=1.
\end{align*}
Therefore the total cost of $\gamma$ is at least
\begin{align*}
\gamma_{12}+2\gamma_{13}+\gamma_{23}.
\end{align*}
Regrouping this expression gives
\begin{align*}
\gamma_{12}+2\gamma_{13}+\gamma_{23}=(\gamma_{12}+\gamma_{13})+(\gamma_{13}+\gamma_{23}).
\end{align*}
Substituting the two marginal identities above yields
\begin{align*}
(\gamma_{12}+\gamma_{13})+(\gamma_{13}+\gamma_{23})=\frac12+\frac12=1.
\end{align*}
Thus every plan in $\Pi(\mu,\nu)$ has cost at least $1$, while $\bar\gamma$ has cost exactly $1$. Hence $\bar\gamma$ is optimal: it moves the left half of the mass one step to the right and the middle half one step to the right, exactly filling the target deficit at site $3$.
[/example]
Finite examples show how Kantorovich transport becomes a linear programme: minimize a linear functional over a polytope. To compare this relaxation with the original Monge problem, we need to translate each admissible map into an admissible plan. The translation is obtained by placing all mass on the graph of the map.
[definition: Graph Plan]
Let $T:X\to Y$ be a measurable map and let $\mu \in \mathcal P(X)$. The graph plan associated to $T$ is
\begin{align*}
\gamma_T := (\operatorname{id}_X,T)_{\#}\mu \in \mathcal P(X\times Y).
\end{align*}
[/definition]
A graph plan is admissible precisely when the map transports $\mu$ to $\nu$. This embeds the Monge problem into the Kantorovich problem, so relaxation can only decrease the infimum.
[quotetheorem:7458]
[citeproof:7458]
The hypotheses in this embedding are doing real work. Measurability of $T$ is needed so that $(\operatorname{id}_X,T)_{\#}\mu$ is a well-defined probability measure on $X\times Y$, and the condition $T_{\#}\mu=\nu$ is exactly the second marginal constraint. Without this pushforward condition the graph measure is still concentrated on the graph of $T$, but it is a plan from $\mu$ to $T_{\#}\mu$, not necessarily from $\mu$ to the prescribed target $\nu$.
The assumption that the two extended integrals are well-defined is separate from admissibility. It rules out the indeterminate situation in which the positive and negative parts of $c(x,T(x))$ both have infinite integral, so that the expression $\int c\,d\gamma_T$ has no variational meaning. When $c$ is nonnegative, or more generally bounded from below, this issue disappears because the negative part is controlled.
The converse is false in general: most plans are not concentrated on the graph of a map. This failure is exactly what makes the relaxation useful when a source atom must be split.
[example: A Relaxed Plan When No Monge Map Exists]
Let $X=Y=\mathbb R$, let $\mu=\delta_0$, and define
\begin{align*}
\nu=\frac12\delta_{-1}+\frac12\delta_1.
\end{align*}
We first show that no measurable map $T:\mathbb R\to\mathbb R$ can satisfy $T_{\#}\mu=\nu$. For any Borel set $B\subset\mathbb R$, the pushforward definition gives
\begin{align*}
T_{\#}\delta_0(B)=\delta_0(T^{-1}(B)).
\end{align*}
Since $\delta_0(A)=1$ exactly when $0\in A$ and $\delta_0(A)=0$ exactly when $0\notin A$, this becomes
\begin{align*}
T_{\#}\delta_0(B)=1 \quad \text{if } T(0)\in B,
\end{align*}
and
\begin{align*}
T_{\#}\delta_0(B)=0 \quad \text{if } T(0)\notin B.
\end{align*}
Thus $T_{\#}\delta_0=\delta_{T(0)}$, a single Dirac mass. But $\nu(\{-1\})=1/2$ and $\nu(\{1\})=1/2$, so $\nu$ is not equal to any single Dirac mass. Hence no Monge map transports $\mu$ to $\nu$.
Now define
\begin{align*}
\gamma=\frac12\delta_{(0,-1)}+\frac12\delta_{(0,1)}.
\end{align*}
For any Borel set $A\subset\mathbb R$,
\begin{align*}
\gamma(A\times\mathbb R)=\frac12\delta_{(0,-1)}(A\times\mathbb R)+\frac12\delta_{(0,1)}(A\times\mathbb R).
\end{align*}
The point $(0,-1)$ belongs to $A\times\mathbb R$ exactly when $0\in A$, and the same is true of $(0,1)$, so
\begin{align*}
\gamma(A\times\mathbb R)=\frac12\delta_0(A)+\frac12\delta_0(A).
\end{align*}
Adding the coefficients gives
\begin{align*}
\gamma(A\times\mathbb R)=\delta_0(A)=\mu(A).
\end{align*}
For any Borel set $B\subset\mathbb R$,
\begin{align*}
\gamma(\mathbb R\times B)=\frac12\delta_{(0,-1)}(\mathbb R\times B)+\frac12\delta_{(0,1)}(\mathbb R\times B).
\end{align*}
Here $(0,-1)\in\mathbb R\times B$ exactly when $-1\in B$, and $(0,1)\in\mathbb R\times B$ exactly when $1\in B$, hence
\begin{align*}
\gamma(\mathbb R\times B)=\frac12\delta_{-1}(B)+\frac12\delta_1(B)=\nu(B).
\end{align*}
Therefore $\gamma\in\Pi(\mu,\nu)$.
For the cost $c(x,y)=|x-y|^2$, the integral against this two-point plan is
\begin{align*}
\int_{\mathbb R\times\mathbb R}|x-y|^2\,d\gamma(x,y)=\frac12|0-(-1)|^2+\frac12|0-1|^2.
\end{align*}
Since $0-(-1)=1$ and $0-1=-1$, this is
\begin{align*}
\frac12|1|^2+\frac12|-1|^2=\frac12\cdot 1+\frac12\cdot 1.
\end{align*}
Adding the two terms gives
\begin{align*}
\int_{\mathbb R\times\mathbb R}|x-y|^2\,d\gamma(x,y)=1.
\end{align*}
Thus the relaxed plan splits the atom at $0$ equally between $-1$ and $1$, producing an admissible Kantorovich plan even though no deterministic transport map exists.
[/example]
Returning to the atom-splitting obstruction from Chapter 1, this example is the canonical reason for Kantorovich's enlargement. The relaxed admissible class is not an approximation to the Monge class; it contains genuine transport mechanisms that deterministic maps cannot express.
## Compactness Of Transport Plans
The direct method of the [calculus of variations](/page/Calculus%20of%20Variations) has two steps: extract a convergent subsequence from a minimizing sequence, then pass to the limit in the functional. For transport plans, compactness is supplied by tightness and [Prokhorov's theorem](/theorems/1172).
[definition: Tight Family Of Probability Measures]
Let $Z$ be a topological space and let $\mathcal C \subset \mathcal P(Z)$. The family $\mathcal C$ is tight if for every $\varepsilon>0$ there exists a compact set $K\subset Z$ such that
\begin{align*}
\alpha(Z\setminus K)<\varepsilon \quad \text{for all } \alpha\in\mathcal C.
\end{align*}
[/definition]
Tightness says that mass cannot escape to infinity uniformly over the family. For a single probability measure on a Polish space this is automatic, but for a family it is a genuine compactness condition. We need a theorem converting this mass-control condition into [weak sequential compactness](/theorems/214), because minimizing sequences of plans will be controlled through their marginals rather than through pointwise formulas.
[quotetheorem:7459]
[citeproof:7459]
Prokhorov's theorem is quoted here in the form needed for optimal transport. The Polish hypothesis is not cosmetic: separability and completeness provide regular probability measures and countable test families, which are what make tightness equivalent to weak relative compactness. On more general topological spaces, tightness may fail to identify weak limit points in the same useful way. For instance, on a nonmetrizable space the condition that every measure places most of its mass on some compact set need not provide a countable diagonal extraction argument for bounded continuous test functions, so tightness alone may not produce [sequential compactness](/page/Sequential%20Compactness) of the family. The theorem also gives only relative compactness; it does not say that a particular admissible family is closed, so the next step is to verify both tightness and closedness for $\Pi(\mu,\nu)$.
[quotetheorem:7460]
[citeproof:7460]
Tightness gives relative compactness, but the limit of couplings must still be a coupling. The fixed marginals are what made the previous tightness argument possible: without them, a sequence such as $\delta_{(n,0)}$ on $\mathbb R^2$ has no uniform compact set capturing most of its mass. Compactness of $K_X\times K_Y$ also uses the topological setting in which finite products of compact sets are compact. Even after tightness is known, a minimizing sequence could converge to a measure with the wrong source or target marginal unless the marginal constraints are closed. The next result rules out that loss of admissibility by testing weak convergence against functions depending on only one coordinate.
[quotetheorem:7461]
[citeproof:7461]
We now have the two pieces needed for compactness of the admissible class. Tightness prevents escape of mass from $X\times Y$, and weak closedness preserves the prescribed marginals in the limit. Both ingredients are necessary: tightness alone would allow weak limits with the wrong marginals if the constraints were not closed, while closedness alone would not stop mass from escaping to infinity. Combining them gives the compactness statement used in the existence theorem.
[quotetheorem:7462]
[citeproof:7462]
Compactness alone does not minimize anything. This compactness statement depends on the fixed marginals and on the Polish-space form of Prokhorov's theorem; it should not be read as saying that arbitrary families of probability measures on $X\times Y$ are compact. It also says nothing about the objective function. We still need the cost functional to behave well under weak limits of plans.
## Lower Semicontinuity And Existence
A minimizing sequence may converge weakly to a plan that rearranges mass in a limiting way. To pass the cost to the limit, the correct hypothesis is lower semicontinuity of $c$, not continuity. This allows costs such as hard constraints, where $c(x,y)=\infty$ outside an admissible region.
[definition: Lower Semicontinuous Cost]
Let $X$ and $Y$ be topological spaces. A function $c:X\times Y\to (-\infty,\infty]$ is lower semicontinuous if, for every $a\in\mathbb R$, the strict superlevel set
\begin{align*}
\{(x,y)\in X\times Y : c(x,y)> a\}
\end{align*}
is open.
[/definition]
Equivalently, $c$ is lower semicontinuous exactly when every net, and every sequence in first-countable spaces, satisfies the lower bound $c(z)\le \liminf c(z_n)$ whenever $z_n\to z$. For extended-real costs this convention allows the value $+\infty$ and is the one used in the Portmanteau lower bound below.
Lower semicontinuity means that downward jumps are forbidden. Under weak convergence of measures, this is exactly the condition that prevents the limiting plan from having lower cost than the limit inferior of the approximating costs. The following Portmanteau form is the analytic bridge from topology of functions to lower semicontinuity of the integral functional.
[quotetheorem:7463]
[citeproof:7463]
This theorem is the lower semicontinuity half of the direct method. Lower semicontinuity is needed because weak convergence cannot control pointwise downward jumps of the cost; a merely measurable cost may assign a smaller value to the limiting plan than to all approximating plans. A concrete failure occurs on $Z=\mathbb R$ with $\alpha_n=\delta_{1/n}$, $\alpha=\delta_0$, and $f=\mathbb{1}_{\mathbb R\setminus\{0\}}$. Then $\alpha_n\rightharpoonup\alpha$, but
\begin{align*}
\int_{\mathbb R} f\,d\alpha = 0
\quad\text{while}\quad
\liminf_{n\to\infty}\int_{\mathbb R} f\,d\alpha_n = 1,
\end{align*}
so the Portmanteau lower bound fails for this discontinuous cost. The bounded-below hypothesis has a different role: it prevents the negative part from escaping and makes the extended integral compatible with the liminf argument. For example, on $Z=\mathbb R$ take
\begin{align*}
\alpha_n=\left(1-\frac1n\right)\delta_0+\frac1n\delta_{n^2},
\end{align*}
so that $\alpha_n\rightharpoonup\delta_0$, and let $f(x)=-x$. Then $f$ is continuous, hence lower semicontinuous, but
\begin{align*}
\int_{\mathbb R} f\,d\delta_0 = 0
\quad\text{while}\quad
\liminf_{n\to\infty}\int_{\mathbb R} f\,d\alpha_n = -\infty.
\end{align*}
This shows the negative-part failure that boundedness from below rules out in the direct method. Applying the theorem on $Z=X\times Y$ with $f=c$ gives lower semicontinuity of the Kantorovich objective on the weakly compact set $\Pi(\mu,\nu)$.
[quotetheorem:7464]
[citeproof:7464]
The finite-cost assumption excludes the case in which every admissible plan has infinite cost, where the statement that an optimizer exists would carry no useful variational information. Each remaining hypothesis has a distinct role: compactness of $\Pi(\mu,\nu)$ supplies a weakly convergent subsequence, lower semicontinuity prevents cost from dropping incorrectly in the limit, and boundedness from below rules out minimizing sequences whose negative parts run away. If the cost is not lower semicontinuous, the infimum can be approached by plans concentrating near a cheap discontinuity without being attained; if compactness or tightness fails, minimizing mass may escape to infinity. The theorem also proves only existence of at least one optimal plan. It does not imply uniqueness, and it does not imply that the optimizer is induced by a Monge map.
This is the same direct-method pattern that appears in variational problems for Sobolev spaces and weak solutions of PDEs: compactness produces a candidate, and lower semicontinuity passes the inequality to the limit. In optimal transport the compactness is measure-theoretic rather than Sobolev compactness, but the logical structure is identical.
[example: Existence For Squared Distance Cost]
Let $X=Y=\mathbb R^n$, let $\mu,\nu\in\mathcal P(\mathbb R^n)$ have finite second moments, and take $c(x,y)=|x-y|^2$. The map $(x,y)\mapsto x-y$ is continuous, the Euclidean norm is continuous, and $r\mapsto r^2$ is continuous on $[0,\infty)$, so $c$ is continuous. Also $|x-y|^2\ge 0$, so $c$ is bounded from below.
We first verify that there is an admissible plan with finite cost. The product measure $\mu\otimes\nu$ has marginals $\mu$ and $\nu$: for every Borel set $A\subset\mathbb R^n$,
\begin{align*}
(\mu\otimes\nu)(A\times\mathbb R^n)=\mu(A)\nu(\mathbb R^n)=\mu(A)\cdot 1=\mu(A),
\end{align*}
and for every Borel set $B\subset\mathbb R^n$,
\begin{align*}
(\mu\otimes\nu)(\mathbb R^n\times B)=\mu(\mathbb R^n)\nu(B)=1\cdot\nu(B)=\nu(B).
\end{align*}
Thus $\mu\otimes\nu\in\Pi(\mu,\nu)$.
For all $x,y\in\mathbb R^n$, the triangle inequality gives
\begin{align*}
|x-y|\le |x|+|y|.
\end{align*}
Squaring both sides gives
\begin{align*}
|x-y|^2\le (|x|+|y|)^2.
\end{align*}
Expanding the square gives
\begin{align*}
(|x|+|y|)^2=|x|^2+2|x||y|+|y|^2.
\end{align*}
Since $(|x|-|y|)^2\ge 0$, we have
\begin{align*}
|x|^2-2|x||y|+|y|^2\ge 0.
\end{align*}
Rearranging gives
\begin{align*}
2|x||y|\le |x|^2+|y|^2.
\end{align*}
Substituting this bound into the expanded square yields
\begin{align*}
(|x|+|y|)^2\le |x|^2+(|x|^2+|y|^2)+|y|^2=2|x|^2+2|y|^2.
\end{align*}
Therefore
\begin{align*}
|x-y|^2\le 2|x|^2+2|y|^2.
\end{align*}
Integrating with respect to $\mu\otimes\nu$ gives
\begin{align*}
\int_{\mathbb R^n\times\mathbb R^n}|x-y|^2\,d(\mu\otimes\nu)(x,y)\le \int_{\mathbb R^n\times\mathbb R^n}(2|x|^2+2|y|^2)\,d(\mu\otimes\nu)(x,y).
\end{align*}
By the defining property of product measure, the right-hand side separates as
\begin{align*}
\int_{\mathbb R^n\times\mathbb R^n}(2|x|^2+2|y|^2)\,d(\mu\otimes\nu)(x,y)=2\int_{\mathbb R^n}|x|^2\,d\mu(x)+2\int_{\mathbb R^n}|y|^2\,d\nu(y).
\end{align*}
Both terms are finite by the finite-second-moment assumption, so $\mu\otimes\nu$ has finite quadratic cost. The preceding *[Existence Of Optimal Kantorovich Plans](/theorems/7464)* theorem therefore gives an optimal plan for the quadratic Kantorovich problem. Later chapters study when this optimal plan is induced by a map and why convexity enters the answer.
[/example]
Kantorovich relaxation therefore changes the landscape of the problem. The admissible class becomes convex and compact under weak convergence, the objective is lower semicontinuous under natural hypotheses, and existence follows by the direct method. The remaining foundational tasks are taken up next: Chapter 3 studies the finite linear-programming model, Chapter 4 proves Kantorovich duality, and Chapter 5 turns optimality into a support condition through cyclical monotonicity.
Passing from maps to plans restores existence, but it also shifts the problem into a more flexible and more tractable framework. In the finite setting, that same framework becomes linear programming, where transport is encoded by matrices and solved with polyhedral methods.
# 3. Discrete Optimal Transport and Linear Programming
The previous chapter replaced Monge maps by couplings and proved that optimal plans exist under compactness and lower semicontinuity hypotheses. This chapter studies the finite case, where the same objects become matrices and the Kantorovich problem becomes a linear program. The finite model is not only computationally useful: it reveals the polyhedral structure behind duality, [complementary slackness](/theorems/2559), and special integral solutions such as assignments.
The guiding question is how much of optimal transport can already be seen in a finite table of masses and costs. We shall see that transport plans form a convex polytope, that optimal plans occur at extreme points, and that the dual variables are potentials constrained by the cost matrix.
## Transport Polytopes and the Primal Program
Suppose the source and target spaces have finitely many points. A transport plan is then a non-negative matrix whose row sums and column sums match the prescribed masses. Before defining the feasible matrices, we need a compact notation for the two lists of masses whose equality of total mass makes transport possible.
[definition: Finite Probability Vectors]
Let $m,n \in \mathbb N$. A finite source probability vector is $a=(a_1,\dots,a_m) \in [0,1]^m$ with $\sum_{i=1}^m a_i=1$. A finite target probability vector is $b=(b_1,\dots,b_n) \in [0,1]^n$ with $\sum_{j=1}^n b_j=1$.
[/definition]
The entries $a_i$ and $b_j$ are the masses at the source point $x_i$ and target point $y_j$. Once these masses are fixed, the next problem is to describe all possible ways of splitting each source mass among the target points while recovering the target masses exactly.
[definition: Transport Polytope]
For finite probability vectors $a \in [0,1]^m$ and $b \in [0,1]^n$, the transport polytope is
\begin{align*}
\Pi(a,b)=\left\{P=(p_{ij})\in [0,\infty)^{m\times n}: \sum_{j=1}^n p_{ij}=a_i \text{ for all } i,\; \sum_{i=1}^m p_{ij}=b_j \text{ for all } j\right\}.
\end{align*}
[/definition]
Thus $p_{ij}$ records the amount of mass sent from $x_i$ to $y_j$. The row constraints impose the source marginal, while the column constraints impose the target marginal. A first example shows that finite transport already allows splitting, which was the key relaxation in the preceding chapter.
[example: A Simple Transport Polytope]
Let $a=(1/2,1/2)$ and $b=(1/3,1/3,1/3)$. Consider the non-negative matrix $P=(p_{ij})$ whose entries are
\begin{align*}
p_{11}=1/3,\quad p_{12}=1/6,\quad p_{13}=0,\quad p_{21}=0,\quad p_{22}=1/6,\quad p_{23}=1/3.
\end{align*}
We verify that $P\in\Pi(a,b)$ by checking each row and column sum. The first row has total mass
\begin{align*}
p_{11}+p_{12}+p_{13}=1/3+1/6+0=2/6+1/6=3/6=1/2.
\end{align*}
The second row has total mass
\begin{align*}
p_{21}+p_{22}+p_{23}=0+1/6+1/3=1/6+2/6=3/6=1/2.
\end{align*}
Thus the row sums are exactly $a=(1/2,1/2)$. The first column has total mass
\begin{align*}
p_{11}+p_{21}=1/3+0=1/3.
\end{align*}
The second column has total mass
\begin{align*}
p_{12}+p_{22}=1/6+1/6=2/6=1/3.
\end{align*}
The third column has total mass
\begin{align*}
p_{13}+p_{23}=0+1/3=1/3.
\end{align*}
Thus the column sums are exactly $b=(1/3,1/3,1/3)$, so $P$ is a feasible transport plan. The first source atom splits its mass between targets $1$ and $2$, while the second source atom splits its mass between targets $2$ and $3$, showing explicitly how the finite coupling model allows mass splitting.
[/example]
The feasible set only records which movements conserve mass. To turn the feasibility problem into an optimisation problem, we must assign a price to every possible edge from a source point to a target point.
[definition: Cost Matrix]
A cost matrix is a matrix $C=(c_{ij})\in \mathbb R^{m\times n}$. Its entry $c_{ij}$ is the cost of transporting one unit of mass from source point $x_i$ to target point $y_j$.
[/definition]
With this notation, the Kantorovich objective becomes a linear functional on matrices. The finite primal problem asks for the cheapest feasible matrix, so it is the first place where optimal transport and linear programming meet directly.
[definition: Finite Kantorovich Primal Problem]
Given $a\in [0,1]^m$, $b\in [0,1]^n$, and $C\in \mathbb R^{m\times n}$, the finite Kantorovich primal problem is
\begin{align*}
\min_{P\in \Pi(a,b)} \sum_{i=1}^m\sum_{j=1}^n c_{ij}p_{ij}.
\end{align*}
[/definition]
The feasible set is cut out by finitely many affine equalities and inequalities. The first structural question is whether a minimiser can fail to exist because the infimum is approached only at the boundary or at infinity.
[quotetheorem:7465]
[citeproof:7465]
This proof is the finite analogue of Chapter 2's compactness argument for $\Pi(\mu,\nu)$: compactness of the admissible set and continuity of the objective give existence. The hypotheses matter in two separate ways: non-empty marginals with equal total mass give at least one feasible plan, and boundedness prevents minimising sequences from escaping. In a general linear program without compact feasible set, for instance minimising $x$ over $x>0$, an infimum need not be attained; the transport polytope avoids this failure because it is closed and bounded. The theorem is an existence statement rather than an algorithm, so the next step is to understand where inside the polytope an optimum can be found.
## Extreme Points and Basic Feasible Solutions
Linear objectives over polytopes are controlled by the geometry of faces and vertices. The question now is what the vertices of $\Pi(a,b)$ look like and why this matters for computation.
[definition: Extreme Point]
Let $K\subset \mathbb R^N$ be convex. A point $z\in K$ is an extreme point of $K$ if whenever $z=t z_1+(1-t)z_2$ with $z_1,z_2\in K$ and $0<t<1$, then $z_1=z_2=z$.
[/definition]
Extreme points are the finite-dimensional shadows of indivisible choices in a convex feasible region. Since the primal objective is linear, we now need the standard fact that searching the whole polytope can be reduced to searching its vertices.
[quotetheorem:7466]
[citeproof:7466]
For transport, this theorem says that an optimal plan may be chosen among the vertices of $\Pi(a,b)$. Compactness is again essential: a linear functional on an unbounded polyhedron may have no minimum, and on an open feasible set it may approach its infimum without attaining it. The conclusion also does not say that every optimum is a vertex; if the cost is constant on a face, every point of that face is optimal. What it gives is a reduction principle, and the next issue is how to recognise the relevant vertices from the pattern of positive entries in the matrix.
[definition: Support Graph of a Transport Plan]
Let $P\in \Pi(a,b)$. The support graph of $P$ is the bipartite graph with left vertices $1,\dots,m$, right vertices $1,\dots,n$, and an edge $(i,j)$ whenever $p_{ij}>0$.
[/definition]
The support graph records which transport routes are used. We need a criterion for when this pattern can be perturbed while keeping the same row and column sums. Cycles give exactly the perturbation mechanism, so the next theorem turns graph structure into a test for non-extremality.
[quotetheorem:7467]
[citeproof:7467]
The hypotheses are doing real work here. If the support contains a cycle, the alternating perturbation gives two different feasible plans with the same marginals, so a cyclic positive pattern cannot be a vertex. If the support is acyclic but has fewer than $m+n-1$ positive entries, it may be degenerate or disconnected rather than a nondegenerate basis; for example, zero marginals can create unused rows or columns that destroy the spanning-tree picture. This graph criterion is valuable because it turns a linear-algebraic question about bases into a combinatorial question about cycles, which is exactly what the small computation below makes visible.
[example: Computing a Two-by-Three Transport Problem]
Let $a=(1/2,1/2)$, $b=(1/3,1/3,1/3)$, and let the cost entries be
\begin{align*}c_{11}=0,\quad c_{12}=2,\quad c_{13}=3,\quad c_{21}=1,\quad c_{22}=0,\quad c_{23}=2.\end{align*}
Set $p_{11}=s$ and $p_{12}=t$. The first row constraint gives
\begin{align*}p_{13}=1/2-p_{11}-p_{12}=1/2-s-t.\end{align*}
The first two column constraints give
\begin{align*}p_{21}=1/3-p_{11}=1/3-s.\end{align*}
\begin{align*}p_{22}=1/3-p_{12}=1/3-t.\end{align*}
Finally, the second row constraint gives
\begin{align*}p_{23}=1/2-p_{21}-p_{22}=1/2-(1/3-s)-(1/3-t)=s+t-1/6.\end{align*}
Non-negativity of the six entries is therefore equivalent to
\begin{align*}s\ge 0,\quad t\ge 0,\quad s+t\le 1/2,\quad s\le 1/3,\quad t\le 1/3,\quad s+t\ge 1/6.\end{align*}
Thus the feasible set in the $(s,t)$-plane is the polygon cut out by these six inequalities. Substituting the determined entries into the cost gives
\begin{align*}\sum_{i,j}c_{ij}p_{ij}=0\cdot s+2t+3(1/2-s-t)+(1/3-s)+0\cdot(1/3-t)+2(s+t-1/6).\end{align*}
Collecting constant, $s$, and $t$ terms,
\begin{align*}\sum_{i,j}c_{ij}p_{ij}=3/2-3s-3t+2t+1/3-s+2s+2t-1/3=3/2-2s+t.\end{align*}
The vertices of the feasible polygon are obtained by intersecting pairs of active boundary lines:
\begin{align*}(0,1/6),\quad (0,1/3),\quad (1/6,1/3),\quad (1/3,1/6),\quad (1/3,0),\quad (1/6,0).\end{align*}
Evaluating $3/2-2s+t$ at these points gives
\begin{align*}5/3,\quad 11/6,\quad 3/2,\quad 1,\quad 5/6,\quad 7/6.\end{align*}
The smallest value is $5/6$, attained at $(s,t)=(1/3,0)$. Substituting $s=1/3$ and $t=0$ into the formulas for the entries gives
\begin{align*}p^*_{11}=1/3,\quad p^*_{12}=0,\quad p^*_{13}=1/2-1/3-0=1/6.\end{align*}
\begin{align*}p^*_{21}=1/3-1/3=0,\quad p^*_{22}=1/3-0=1/3,\quad p^*_{23}=1/3+0-1/6=1/6.\end{align*}
So the optimal plan has total cost $5/6$. This calculation illustrates how the finite problem reduces to solving a small linear program and also shows why checking feasibility alone does not identify the cheapest splitting pattern.
[/example]
The example has fractional entries because the marginals were probabilities. If the supplies and demands are integer amounts, a natural question is whether linear programming can return fractional flows even though the input data are integral.
[quotetheorem:5800]
[citeproof:5800/optimal-transport-i-foundations]
This theorem explains why discrete transport with integer supplies and demands often produces combinatorial objects. The integer hypothesis cannot simply be dropped: probability marginals such as $(1/2,1/2)$ force fractional entries in many feasible plans. The conclusion is also about basic feasible solutions, not about every point of the feasible polytope, since convex combinations of integral vertices can be fractional. The assignment problem is the most important special case, but the permutation matrices appear after scaling the transport variables to have row and column sums equal to $1$.
[example: Assignment Problem as Transport]
Let $m=n$ and set $a_i=b_j=1/n$ for all $i,j$. For a plan $P=(p_{ij})\in\Pi(a,b)$, define $Q=(q_{ij})$ by $q_{ij}=n p_{ij}$. Since $p_{ij}\ge 0$, we have $q_{ij}\ge 0$, and for each row $i$,
\begin{align*}
\sum_{j=1}^n q_{ij}=\sum_{j=1}^n n p_{ij}=n\sum_{j=1}^n p_{ij}=n\cdot \frac1n=1.
\end{align*}
For each column $j$,
\begin{align*}
\sum_{i=1}^n q_{ij}=\sum_{i=1}^n n p_{ij}=n\sum_{i=1}^n p_{ij}=n\cdot \frac1n=1.
\end{align*}
Conversely, if $Q=(q_{ij})$ has non-negative entries and all row and column sums equal to $1$, then $p_{ij}=q_{ij}/n$ satisfies
\begin{align*}
\sum_{j=1}^n p_{ij}=\frac1n\sum_{j=1}^n q_{ij}=\frac1n
\end{align*}
and
\begin{align*}
\sum_{i=1}^n p_{ij}=\frac1n\sum_{i=1}^n q_{ij}=\frac1n.
\end{align*}
Thus $P\in\Pi(a,b)$ exactly when $Q=nP$ is a non-negative matrix with all row and column sums equal to $1$.
Under this change of variables, the objective becomes
\begin{align*}
\sum_{i=1}^n\sum_{j=1}^n c_{ij}p_{ij}=\sum_{i=1}^n\sum_{j=1}^n c_{ij}\frac{q_{ij}}{n}=\frac1n\sum_{i=1}^n\sum_{j=1}^n c_{ij}q_{ij}.
\end{align*}
Multiplication by the positive constant $1/n$ does not change which matrices minimize the objective, so the transport problem is equivalent to minimizing $\sum_{i,j}c_{ij}q_{ij}$ over non-negative matrices with row and column sums equal to $1$.
The right-hand side has integer marginals, all equal to $1$. By *Integrality for Integer Marginals*, an optimal basic feasible solution can be chosen with integer entries. Since each $q_{ij}$ is a non-negative integer and each row sums to $1$, every row contains exactly one entry equal to $1$ and all other entries equal to $0$. The same column-sum condition forces each column to contain exactly one entry equal to $1$. Therefore $Q$ is a permutation matrix, and the assignment problem is exactly the problem of choosing the cheapest bijection between the $n$ sources and the $n$ targets.
[/example]
## Finite Kantorovich Duality
The primal problem chooses flows $p_{ij}$. The dual problem assigns prices to the marginal constraints. The central question is which price systems certify that no transport plan can have cost below a given value.
Suppose $u=(u_1,\dots,u_m)\in \mathbb R^m$ and $v=(v_1,\dots,v_n)\in \mathbb R^n$. If $u_i+v_j\le c_{ij}$ for every pair $(i,j)$, then for any $P\in \Pi(a,b)$,
\begin{align*}
\sum_{i=1}^m a_i u_i+\sum_{j=1}^n b_j v_j
&=\sum_{i=1}^m\sum_{j=1}^n (u_i+v_j)p_{ij}
\end{align*}
and hence
\begin{align*}
\sum_{i=1}^m a_i u_i+\sum_{j=1}^n b_j v_j
&\le \sum_{i=1}^m\sum_{j=1}^n c_{ij}p_{ij}.
\end{align*}
This inequality is [weak duality](/theorems/2549). It motivates the dual problem: maximise this certified lower bound over all feasible potentials.
[definition: Finite Kantorovich Dual Problem]
Given $a\in [0,1]^m$, $b\in [0,1]^n$, and $C\in \mathbb R^{m\times n}$, the finite Kantorovich dual problem is
\begin{align*}
\max_{u\in \mathbb R^m,\,v\in \mathbb R^n}\left\{\sum_{i=1}^m a_i u_i+\sum_{j=1}^n b_j v_j : u_i+v_j\le c_{ij}\text{ for all } i,j\right\}.
\end{align*}
[/definition]
The variables $u_i$ and $v_j$ are finite Kantorovich potentials. They measure how much value can be assigned to the source and target marginals without exceeding the transport cost on any edge. The main duality question is whether the best lower bound produced in this way reaches the primal minimum.
Finite-dimensional linear programming duality gives equality between the finite Kantorovich primal minimum and the finite dual maximum:
\begin{align*}
\min_{P\in\Pi(a,b)}\sum_{i,j}c_{ij}P_{ij}
=
\max\left\{\sum_i a_i u_i+\sum_j b_j v_j:
u\in\mathbb R^m,\ v\in\mathbb R^n,\ u_i+v_j\le c_{ij}\right\}.
\end{align*}
This finite duality statement is the finite prototype for the general Kantorovich duality theorem proved in Chapter 4. Feasibility of the primal problem is essential: without a transport plan the primal value is not a finite minimum to compare with the dual. Finite-dimensionality also hides difficulties that reappear later, because in infinite spaces dual attainment can fail without compactness, tightness, or regularity assumptions. The inequality $u_i+v_j\le c_{ij}$ is the discrete version of the $c$-transform constraint, and examples of dual certificates show how optimality can be verified without searching all plans.
[example: Dual Certificate for the Two-by-Three Problem]
For the cost entries $c_{11}=0$, $c_{12}=2$, $c_{13}=3$, $c_{21}=1$, $c_{22}=0$, and $c_{23}=2$, take $u=(0,0)$ and $v=(0,0,2)$. We check the six dual constraints $u_i+v_j\le c_{ij}$ one by one:
\begin{align*}u_1+v_1=0+0=0=c_{11}.\end{align*}
\begin{align*}u_1+v_2=0+0=0\le 2=c_{12}.\end{align*}
\begin{align*}u_1+v_3=0+2=2\le 3=c_{13}.\end{align*}
\begin{align*}u_2+v_1=0+0=0\le 1=c_{21}.\end{align*}
\begin{align*}u_2+v_2=0+0=0=c_{22}.\end{align*}
\begin{align*}u_2+v_3=0+2=2=c_{23}.\end{align*}
Thus $(u,v)$ is dual feasible. Its dual value is
\begin{align*}(1/2)u_1+(1/2)u_2+(1/3)v_1+(1/3)v_2+(1/3)v_3=(1/2)0+(1/2)0+(1/3)0+(1/3)0+(1/3)2.\end{align*}
Hence
\begin{align*}(1/2)0+(1/2)0+(1/3)0+(1/3)0+(1/3)2=2/3.\end{align*}
Now let $P\in\Pi(a,b)$ be any feasible transport plan. Since $u_i+v_j\le c_{ij}$ and $p_{ij}\ge 0$ for every $i,j$, we have
\begin{align*}\sum_{i=1}^2\sum_{j=1}^3 (u_i+v_j)p_{ij}\le \sum_{i=1}^2\sum_{j=1}^3 c_{ij}p_{ij}.\end{align*}
Using the row and column constraints of $P$, the left-hand side equals the dual value:
\begin{align*}\sum_{i=1}^2\sum_{j=1}^3 (u_i+v_j)p_{ij}=\sum_{i=1}^2 u_i\sum_{j=1}^3 p_{ij}+\sum_{j=1}^3 v_j\sum_{i=1}^2 p_{ij}.\end{align*}
Since the row sums are $a=(1/2,1/2)$ and the column sums are $b=(1/3,1/3,1/3)$, this becomes
\begin{align*}\sum_{i=1}^2 u_i a_i+\sum_{j=1}^3 v_j b_j=2/3.\end{align*}
Therefore every feasible plan has cost at least $2/3$. This certificate is valid but not sharp, since the primal computation above found an optimal cost of $5/6$; sharper potentials must make the dual value reach $5/6$, and this happens by forcing equality on the edges used by an optimal plan.
[/example]
The dual certificate in the example is valid but not sharp. We need a criterion that detects sharpness by comparing the primal cost with the dual lower bound term by term. In the finite transport problem, if $P\in\Pi(a,b)$ is primal feasible and $(u,v)$ is dual feasible, then the weak-duality gap is
\begin{align*}
\sum_{i=1}^m\sum_{j=1}^n c_{ij}p_{ij}
-
\left(\sum_{i=1}^m a_i u_i+\sum_{j=1}^n b_j v_j\right)
=
\sum_{i=1}^m\sum_{j=1}^n \left(c_{ij}-u_i-v_j\right)p_{ij}.
\end{align*}
All summands on the right are non-negative. Therefore $P$ and $(u,v)$ are simultaneously optimal exactly when
\begin{align*}
\left(c_{ij}-u_i-v_j\right)p_{ij}=0
\quad\text{for every } i,j.
\end{align*}
This is the finite transport form of complementary slackness: any edge carrying mass must have zero slack in the dual inequality.
Thus mass can only travel along edges where the dual constraint is tight. The feasibility assumptions on both sides are necessary: without primal feasibility there is no plan to test, and without dual feasibility the slack terms need not be non-negative. The criterion is also not a uniqueness statement; several different plans may be supported on tight edges when the tight graph contains more than one feasible coupling. In later chapters this becomes the principle that optimal plans are concentrated on a $c$-cyclically monotone set.
[remark: Gauge Freedom of Potentials]
If $(u,v)$ is dual feasible, then $(u+\lambda\mathbf{1},v-\lambda\mathbf{1})$ is dual feasible for every $\lambda\in \mathbb R$ and has the same dual value, because $\sum_i a_i=\sum_j b_j=1$. A normalization such as $u_1=0$ removes this non-uniqueness without changing the transport problem.
[/remark]
The finite theory therefore gives three complementary views of optimal transport: a matrix minimisation problem, a polyhedral problem about vertices and bases, and a dual problem about potentials. These views will persist in the general theory, with compactness, convex analysis, and measurable selection replacing the finite-dimensional linear algebra.
The discrete theory provides the finite-dimensional model for everything that follows. Its primal-dual structure and vertex geometry reappear in the general setting, where compactness and convexity replace matrix algebra.
# 4. Kantorovich Duality in General Spaces
This chapter proves the Kantorovich duality theorem in the setting where optimal transport is most useful: probability measures on Polish spaces and costs that need not be finite everywhere. Chapter 2 established existence of optimal plans by compactness and lower semicontinuity, and Chapter 3 showed in finite spaces that the same minimisation problem is a linear program with a dual. We now identify the variational dual problem, explain why its admissible functions encode all lower bounds on transport cost, and extract the contact relation between optimal plans and optimal potentials.
The main new difficulty is that the finite-dimensional linear programming proof no longer applies directly. We replace it by an approximation argument: first prove duality for bounded continuous costs, then pass to lower semicontinuous costs by monotone approximation and tightness. The same formalism also introduces $c$-transforms and $c$-concavity, which are the right language for the geometry of optimal transport.
## Admissible Potentials and the Dual Problem
How can a pair of functions give a lower bound for every transport plan at once? The guiding observation is that if two functions satisfy a pointwise inequality against the cost, then integrating that inequality against any coupling gives a lower bound for the Kantorovich value.
[definition: Admissible Pair]
Let $X$ and $Y$ be Polish spaces, let $\mu\in \mathcal P(X)$ and $\nu\in \mathcal P(Y)$, and let $c:X\times Y\to (-\infty,\infty]$ be Borel measurable. A pair $(\varphi,\psi)$ of measurable functions $\varphi:X\to \mathbb R\cup\{-\infty\}$ and $\psi:Y\to \mathbb R\cup\{-\infty\}$ is admissible for $c$ if
\begin{align*}
\varphi(x)+\psi(y)\le c(x,y)
\end{align*}
for all $(x,y)\in X\times Y$.
[/definition]
Admissibility is imposed pointwise because the pair is meant to work before a particular coupling has been chosen. The next result turns that pointwise comparison into the basic inequality behind all duality: every admissible pair gives a certificate that no plan can cost less than its marginal integral.
Weak duality says that every integrable admissible pair gives a lower bound on every transport plan:
\begin{align*}
\int_X\varphi\,d\mu+\int_Y\psi\,d\nu
\le
\int_{X\times Y}c\,d\pi
\end{align*}
for all $\pi\in\Pi(\mu,\nu)$ whenever the displayed marginal integrals are well-defined.
The integrability assumption is part of the certificate, not a cosmetic restriction. Without it the expression $\int_X\varphi\,d\mu+\int_Y\psi\,d\nu$ may contain the undefined operation $\infty-\infty$, so the alleged lower bound has no numerical meaning. For instance, with $X=Y=\mathbb R$, $\mu=\nu$ a probability measure with infinite first moment, $c(x,y)=|x-y|$, and $\varphi(x)=x$, $\psi(y)=-y$, admissibility follows from $x-y\le |x-y|$, but both marginal integrals are not finite. Weak duality therefore does not assert that every pointwise admissible pair produces a usable lower bound; it asserts this only after the marginal integrals are well-defined.
Weak duality changes the transport problem into a search for the best possible certificate. Since a single admissible pair gives only one lower bound, we gather all such bounds into a supremum; the central theorem of the chapter will say that this supremum has no gap from the primal infimum.
[definition: Kantorovich Dual Value]
For a cost $c:X\times Y\to (-\infty,\infty]$, define the primal cost functional
\begin{align*}
C_c:\mathcal P(X)\times\mathcal P(Y)\to[-\infty,\infty],
\qquad
C_c(\mu,\nu):=\inf_{\gamma\in\Pi(\mu,\nu)}\int_{X\times Y}c\,d\gamma.
\end{align*}
The admissible-function class is
\begin{align*}
\mathcal A_c(\mu,\nu):=
\left\{
(\varphi,\psi)\in L^1(\mu)\times L^1(\nu):
\varphi(x)+\psi(y)\le c(x,y)\ \text{for all }(x,y)\in X\times Y
\right\}.
\end{align*}
The Kantorovich dual value is the extended-real map
\begin{align*}
D_c:\mathcal P(X)\times\mathcal P(Y)\to[-\infty,\infty]
\end{align*}
defined by
\begin{align*}
D_c(\mu,\nu)
:=
\sup\left\{
\int_X \varphi\,d\mu + \int_Y \psi\,d\nu
:
(\varphi,\psi)\in\mathcal A_c(\mu,\nu)
\right\}.
\end{align*}
[/definition]
The weak inequality says $D_c(\mu,\nu)\le C_c(\mu,\nu)$. Kantorovich duality is the assertion that, under the standard hypotheses, this lower bound is exact.
[example: Lipschitz Potentials for Distance Cost on $\mathbb R$]
Let $X=Y=\mathbb R$, let $\mu,\nu\in\mathcal P(\mathbb R)$ have finite first moments, and take $c(x,y)=|x-y|$. If $f:\mathbb R\to\mathbb R$ is $1$-Lipschitz, then for every $x,y\in\mathbb R$,
\begin{align*}
f(x)-f(y)\le |f(x)-f(y)|\le |x-y|.
\end{align*}
Thus the pair $(f,-f)$ satisfies
\begin{align*}
f(x)+(-f(y))\le |x-y|=c(x,y),
\end{align*}
so it is admissible.
The finite first moment assumption makes the two marginal integrals finite, because the Lipschitz condition gives
\begin{align*}
|f(x)|\le |f(0)|+|f(x)-f(0)|\le |f(0)|+|x|.
\end{align*}
Therefore $\int |f|\,d\mu<\infty$ and $\int |f|\,d\nu<\infty$. For any coupling $\gamma\in\Pi(\mu,\nu)$, integrating the admissibility inequality against $\gamma$ gives
\begin{align*}
\int_{\mathbb R\times\mathbb R}\bigl(f(x)-f(y)\bigr)\,d\gamma(x,y)\le \int_{\mathbb R\times\mathbb R}|x-y|\,d\gamma(x,y).
\end{align*}
Since the marginals of $\gamma$ are $\mu$ and $\nu$, the left-hand side equals
\begin{align*}
\int_{\mathbb R} f\,d\mu-\int_{\mathbb R} f\,d\nu.
\end{align*}
Hence
\begin{align*}
\int_{\mathbb R} f\,d\mu-\int_{\mathbb R} f\,d\nu\le \int_{\mathbb R\times\mathbb R}|x-y|\,d\gamma(x,y).
\end{align*}
Taking the infimum over all $\gamma\in\Pi(\mu,\nu)$ gives
\begin{align*}
\int_{\mathbb R} f\,d\mu-\int_{\mathbb R} f\,d\nu\le \inf_{\gamma\in\Pi(\mu,\nu)}\int_{\mathbb R\times\mathbb R}|x-y|\,d\gamma(x,y).
\end{align*}
Thus every $1$-Lipschitz function supplies a valid dual lower bound, and the full duality theorem identifies the supremum of these bounds with the $W_1$ transportation cost.
[/example]
## The $c$-Transform and $c$-Concavity
The admissibility condition couples two unknown functions, so it is natural to ask whether one of them determines the best possible choice of the other. The $c$-transform answers this question by pushing a function down until the admissibility inequality is saturated as much as possible.
[definition: C Transform]
Let $c:X\times Y\to (-\infty,\infty]$. For a function $\varphi:X\to \mathbb R\cup\{-\infty\}$, its $c$-transform is the function $\varphi^c:Y\to \mathbb R\cup\{-\infty,\infty\}$ defined by
\begin{align*}
\varphi^c(y):=\inf_{x\in X}\bigl(c(x,y)-\varphi(x)\bigr).
\end{align*}
For a function $\psi:Y\to \mathbb R\cup\{-\infty\}$, its $c$-transform is the function $\psi^c:X\to \mathbb R\cup\{-\infty,\infty\}$ defined by
\begin{align*}
\psi^c(x):=\inf_{y\in Y}\bigl(c(x,y)-\psi(y)\bigr).
\end{align*}
[/definition]
The transform is designed so that $(\varphi,\varphi^c)$ is admissible whenever the expression is well-defined in the extended real sense. The next result explains why this construction is not merely a source of examples: it improves any admissible pair by replacing either potential with the largest compatible partner.
[quotetheorem:7468]
[citeproof:7468]
The improvement theorem is a pointwise order statement, and its limitations matter in the dual problem. Replacing $\psi$ by $\varphi^c$ may destroy measurability unless the cost has enough regularity, and it may destroy integrability even when measurability is available. A pair with finite dual value can therefore be improved pointwise into a pair whose objective is not an admissible extended-real integral. The theorem also does not prove that a maximizer exists, nor that the dual value increases inside the chosen admissible class; those conclusions require later hypotheses such as lower semicontinuity, integrable growth bounds, and normalization.
The improvement theorem tells us that a dual optimizer, when it exists and remains inside the integrable class, should be replaceable by potentials obtained from the cost itself. This motivates isolating the class of functions that already have this form. The resulting definition is the cost-dependent analogue of concavity or Legendre-Fenchel closedness.
[definition: C Concave Function]
A function $\varphi:X\to \mathbb R\cup\{-\infty\}$ is $c$-concave if there exists a function $\psi:Y\to \mathbb R\cup\{-\infty\}$ such that
\begin{align*}
\varphi(x)=\inf_{y\in Y}\bigl(c(x,y)-\psi(y)\bigr)
\end{align*}
for all $x\in X$.
[/definition]
Thus a $c$-concave function is an infimum of translated cost functions. For the quadratic cost, this notion is a disguised form of ordinary convexity.
[example: Quadratic Cost Potentials from Convex Functions]
Let $X=Y=\mathbb R^n$ and $c(x,y)=\frac12|x-y|^2$. For $x,y\in\mathbb R^n$, expanding the square gives
\begin{align*}
\frac12|x-y|^2=\frac12(x-y)\cdot(x-y)=\frac12|x|^2-x\cdot y+\frac12|y|^2.
\end{align*}
Let $u:\mathbb R^n\to\mathbb R\cup\{\infty\}$ be convex and lower semicontinuous, and set
\begin{align*}
\varphi(x)=\frac12|x|^2-u(x).
\end{align*}
By the *[Fenchel-Moreau theorem](/theorems/6677)*,
\begin{align*}
u(x)=\sup_{y\in\mathbb R^n}\{x\cdot y-u^*(y)\}.
\end{align*}
Define
\begin{align*}
\psi(y)=\frac12|y|^2-u^*(y).
\end{align*}
Then for each fixed $x$,
\begin{align*}
\inf_{y\in\mathbb R^n}\left(\frac12|x-y|^2-\psi(y)\right)=\inf_{y\in\mathbb R^n}\left(\frac12|x|^2-x\cdot y+\frac12|y|^2-\frac12|y|^2+u^*(y)\right).
\end{align*}
Cancelling the two $\frac12|y|^2$ terms inside the infimum,
\begin{align*}
\inf_{y\in\mathbb R^n}\left(\frac12|x-y|^2-\psi(y)\right)=\frac12|x|^2+\inf_{y\in\mathbb R^n}\left(u^*(y)-x\cdot y\right).
\end{align*}
Since $\inf_y(u^*(y)-x\cdot y)=-\sup_y(x\cdot y-u^*(y))$, the Fenchel-Moreau representation gives
\begin{align*}
\inf_{y\in\mathbb R^n}\left(\frac12|x-y|^2-\psi(y)\right)=\frac12|x|^2-u(x)=\varphi(x).
\end{align*}
Thus $\varphi=\psi^c$, so $\varphi$ is $c$-concave. For the quadratic cost, $c$-concavity is therefore exactly convex duality after subtracting the quadratic term $\frac12|x|^2$.
[/example]
The example shows that $c$-concavity is not an artificial definition; it is the cost-adjusted convexity that later controls optimal maps. We also need an algebraic test for when repeated transformation has stabilized, and this is supplied by the involution property.
[quotetheorem:7469]
[citeproof:7469]
The equality $\varphi^{cc}=\varphi$ is therefore a closedness condition, not a universal identity. The operation $\varphi\mapsto\varphi^{cc}$ is analogous to taking a closed convex envelope in Fenchel duality, except that with the present sign convention it produces a $c$-concave envelope lying above the original potential. If $X=Y=\mathbb R^n$ and $c(x,y)=-x\cdot y$, then the $c$-transform is the negative Fenchel transform, and $\varphi^{cc}$ corresponds to the closed convex envelope in the sign-adjusted variables. A nonconvex or non-lower-semicontinuous input can therefore satisfy $\varphi<\varphi^{cc}$ at some points. The theorem does not say that the transformed potential is integrable, maximizes the dual problem, or has any uniqueness property; it only identifies the fixed points of this closure operation.
## Duality from Bounded Continuous Costs
The central question is now whether the supremum over admissible potentials equals the infimum over couplings. In compact or bounded continuous settings this is an analytic form of linear programming duality, and the general case is reached by approximation.
For compact metric spaces and bounded continuous costs, Kantorovich duality identifies the primal transport value with the supremum over continuous admissible potentials:
\begin{align*}
\inf_{\pi\in\Pi(\mu,\nu)}\int c\,d\pi
=
\sup_{\varphi(x)+\psi(y)\le c(x,y)}
\left(\int\varphi\,d\mu+\int\psi\,d\nu\right),
\end{align*}
where the supremum is taken over bounded continuous potentials.
Boundedness and continuity are the hypotheses that make this separation argument live in a well-behaved dual pair. Continuity ensures that $\gamma\mapsto\int c\,d\gamma$ is continuous under weak convergence of measures, while boundedness prevents integrability losses when couplings vary. If $c$ is merely measurable, weakly convergent couplings can change the value of $\int c\,d\gamma$ abruptly: for $c=\mathbb{1}_{(0,\infty)\times\{0\}}$ on $\mathbb R\times\mathbb R$, the measures $\delta_{(1/n,0)}$ converge weakly to $\delta_{(0,0)}$, but the costs drop from $1$ to $0$. If boundedness is removed, continuity alone does not make integration continuous in the [weak topology](/page/Weak%20Topology) on finite measures: the probability measures $\eta_n=(1-1/n)\delta_0+(1/n)\delta_n$ on $\mathbb R$ converge weakly to $\delta_0$, but for the continuous unbounded function $f(x)=|x|$ the integrals satisfy $\int f\,d\eta_n=1$ while $\int f\,d\delta_0=0$. This is why the bounded continuous theorem is used as a controlled intermediate statement rather than as the final transport theorem. It is the engine of the proof, but many costs in transport are only lower semicontinuous and may be unbounded, so the approximation argument must preserve both the value of the primal problem and enough admissibility in the dual problem.
For lower semicontinuous costs on Polish spaces, the same equality is obtained by approximating the cost from below by bounded continuous costs and passing to the limit:
\begin{align*}
\inf_{\pi\in\Pi(\mu,\nu)}\int c\,d\pi
=
\sup_{\varphi(x)+\psi(y)\le c(x,y)}
\left(\int\varphi\,d\mu+\int\psi\,d\nu\right),
\end{align*}
with the usual convention that only well-defined dual integrals contribute to the supremum.
This theorem also shows why lower semicontinuity is the natural hypothesis: it is exactly the condition allowing approximation from below while retaining compactness of minimizing sequences from the previous chapter. If lower semicontinuity is removed, the primal value need not behave well under weak limits. For example, on $X=Y=[0,1]$, a cost that is $0$ on an open [dense subset](/page/Dense%20Subset) of $X\times Y$ and $1$ on its complement can be Borel measurable but not lower semicontinuous; weak limits of plans supported mostly in the zero-cost region may charge points where the limiting cost jumps. The theorem also does not promise a dual maximizer. It gives equality between an infimum and a supremum; attainment of the supremum needs the separate compactness hypotheses discussed next.
[example: Truncating the Quadratic Cost]
Let $X=Y=\mathbb R^n$ and let $c(x,y)=|x-y|^2$. For each $k\in\mathbb N$, set
\begin{align*}
c_k(x,y)=\min\{|x-y|^2,k\}.
\end{align*}
Since $(x,y)\mapsto |x-y|^2$ is continuous and $r\mapsto \min\{r,k\}$ is continuous and bounded by $k$, each $c_k$ is bounded and continuous on $\mathbb R^n\times\mathbb R^n$.
The truncations increase pointwise to the quadratic cost. Indeed, for every $r\ge 0$,
\begin{align*}
\min\{r,k\}\le \min\{r,k+1\}\le r.
\end{align*}
If $k\ge r$, then $\min\{r,k\}=r$, so $\min\{r,k\}\uparrow r$. Applying this with $r=|x-y|^2$ gives
\begin{align*}
c_k(x,y)\uparrow c(x,y).
\end{align*}
Let
\begin{align*}
P_k=\inf_{\gamma\in\Pi(\mu,\nu)}\int_{\mathbb R^n\times\mathbb R^n} c_k\,d\gamma
\end{align*}
and
\begin{align*}
P=\inf_{\gamma\in\Pi(\mu,\nu)}\int_{\mathbb R^n\times\mathbb R^n} |x-y|^2\,d\gamma.
\end{align*}
Because $c_k\le c_{k+1}$, every coupling $\gamma$ satisfies
\begin{align*}
\int c_k\,d\gamma\le \int c_{k+1}\,d\gamma.
\end{align*}
Taking infima over $\gamma\in\Pi(\mu,\nu)$ gives $P_k\le P_{k+1}$. Also $c_k\le c$, so $P_k\le P$ for every $k$.
The monotone approximation argument for lower semicontinuous costs gives the reverse limit inequality: if $L=\sup_k P_k$, then tightness of $\Pi(\mu,\nu)$ gives a weak limit of almost-minimizers, and bounded continuity of each fixed $c_m$ gives
\begin{align*}
\int c_m\,d\gamma\le L.
\end{align*}
Letting $m\to\infty$ and using monotone convergence gives
\begin{align*}
\int |x-y|^2\,d\gamma\le L.
\end{align*}
Hence $P\le L$, and therefore
\begin{align*}
P_k\uparrow P.
\end{align*}
If $\mu$ and $\nu$ have finite second moments, the limiting value is finite. For the product coupling $\mu\otimes\nu$,
\begin{align*}
|x-y|^2\le (|x|+|y|)^2=|x|^2+2|x||y|+|y|^2.
\end{align*}
Using $2|x||y|\le |x|^2+|y|^2$, this gives
\begin{align*}
|x-y|^2\le 2|x|^2+2|y|^2.
\end{align*}
Therefore
\begin{align*}
P\le \int |x-y|^2\,d(\mu\otimes\nu)\le 2\int |x|^2\,d\mu+2\int |y|^2\,d\nu<\infty.
\end{align*}
Thus the truncated bounded-continuous dual problems approximate the quadratic transport problem from below, and finite second moments are exactly the integrability scale that makes quadratic-growth potentials have finite marginal integrals.
[/example]
## Dual Maximizers
Duality gives equality of values, but it does not automatically give a maximizing pair. The existence of dual maximizers requires a normalization and compactness argument for potentials, together with hypotheses preventing mass from escaping to infinite negative values.
[quotetheorem:7470]
[citeproof:7470]
This theorem is often used through more concrete versions, for example when $X=Y=\mathbb R^n$ and $c$ has at most polynomial growth controlled by finite moments of the marginals. The assumptions separate value duality from attainment. Polish lower semicontinuous duality may give $D_c(\mu,\nu)=C_c(\mu,\nu)$ even when no maximizing pair exists; attainment requires compactness of potentials, not just equality of values. Finiteness rules out costs that force the primal value to be infinite, where maximizing the dual is no longer a finite compactness problem. The upper bound $c\le a+b$ gives the [compactness theorem](/theorems/2748) an integrable scale after the potentials have been saturated and normalized; it should not be read as saying that every admissible pair has both signs bounded by $a$ and $b$. For $c(x,y)=|x-y|^2$ on $\mathbb R^n$, marginals without finite second moments can make the natural quadratic potentials nonintegrable, so a maximizing sequence can approach the value while losing any $L^1$ limit. A typical nonattainment mechanism is translation in the additive degree of freedom combined with growth at infinity: the values remain stable, but after every fixed normalization one marginal potential develops tails that are not uniformly integrable. Normalization is necessary because $(\varphi+\lambda,\psi-\lambda)$ has the same admissibility and value for every $\lambda\in\mathbb R$, so an unnormalized maximizing sequence can drift without converging even when the dual value is finite.
[remark: Additive Normalization]
If $(\varphi,\psi)$ is admissible, then $(\varphi+\lambda,\psi-\lambda)$ is admissible for every constant $\lambda\in\mathbb R$ and has the same dual value. Any compactness proof for dual maximizers must fix this degree of freedom. Common choices are $\varphi(x_0)=0$ when pointwise values are well-defined, or $\int_X\varphi\,d\mu=0$ when the potentials are handled in $L^1$.
[/remark]
## Complementary Slackness and Optimal Supports
Once the primal and dual optima exist, the next question is how they recognize each other. The answer is complementary slackness in transport form. If $\pi\in\Pi(\mu,\nu)$ is primal optimal and $(\varphi,\psi)$ is a dual maximizing admissible pair, then
\begin{align*}
\varphi(x)+\psi(y)=c(x,y)
\end{align*}
for $\pi$-almost every $(x,y)$. Conversely, if $\pi$ is primal feasible, $(\varphi,\psi)$ is dual feasible, the primal and dual integrals are finite, and the equality above holds $\pi$-almost everywhere, then $\pi$ and $(\varphi,\psi)$ are optimal. Indeed, integrating the pointwise inequality $\varphi(x)+\psi(y)\le c(x,y)$ against $\pi$ shows that the weak-duality gap is exactly the integral of the non-negative slack $c-\varphi-\psi$.
Both optimality assumptions in the theorem are doing work. A finite example already shows the point. Let $X=Y=\{0,1\}$, let both marginals be uniform, and set $c(x,y)=0$ if $x=y$ and $c(x,y)=1$ if $x\ne y$. The pair $\varphi=\psi=0$ is dual maximizing, and the diagonal coupling is primal optimal; its support lies in the contact set. The off-diagonal coupling with mass $1/2$ on $(0,1)$ and $1/2$ on $(1,0)$ has the same marginals but is not optimal, and it charges only points where the slack equals $1$. Thus a nonoptimal plan need not satisfy contact even when a dual maximizer exists. Conversely, for the one-point problem $X=Y=\{0\}$ with $c(0,0)=1$, the admissible pair $\varphi(0)=\psi(0)=0$ is not dual maximizing, and the unique primal optimal plan has positive slack at its whole support. These examples isolate the role of the two hypotheses before the infinite-dimensional complication: if the dual supremum is approached but not attained, a maximizing sequence gives only approximate contact sets, and their limiting geometry can lose the exact equality relation. Complementary slackness is therefore stronger than value duality: it converts equality of numbers into a pointwise support statement only after both sides have optimizers.
Complementary slackness translates value equality into geometry: optimal mass can move only along points where the dual certificate touches the cost. We therefore need a name for this locus, since later structure theorems study it directly before showing when it becomes the graph of a map. The following definition records the exact equality set associated to a chosen admissible pair, so that the support condition can be stated without repeating the equality relation each time.
[definition: Contact Set]
Let $X$ and $Y$ be sets, let $c:X\times Y\to[-\infty,\infty]$, and let $(\varphi,\psi)$ be an admissible pair with $\varphi:X\to[-\infty,\infty]$ and $\psi:Y\to[-\infty,\infty]$. The contact set of $(\varphi,\psi)$ is the subset $\Gamma_{\varphi,\psi}\subset X\times Y$ defined by
\begin{align*}
\Gamma_{\varphi,\psi}:=\bigl\{(x,y)\in X\times Y: \varphi(x)+\psi(y)=c(x,y)\bigr\}.
\end{align*}
[/definition]
Complementary slackness states that an optimal plan is concentrated on $\Gamma_{\varphi,\psi}$. In many later arguments, this contact set replaces the graph of a transport map until additional structure is proved.
[example: Contact Set for the Distance Cost]
For $c(x,y)=|x-y|$ on $\mathbb R$, let $(f,-f)$ be the admissible pair determined by a $1$-Lipschitz function $f$. The contact set consists exactly of the points where the admissibility inequality is sharp:
\begin{align*}
\Gamma_f=\{(x,y)\in\mathbb R^2:f(x)+(-f(y))=|x-y|\}.
\end{align*}
Equivalently,
\begin{align*}
\Gamma_f=\{(x,y)\in\mathbb R^2:f(x)-f(y)=|x-y|\}.
\end{align*}
Suppose $x>y$ and $(x,y)\in\Gamma_f$. Since $x>y$, the absolute value is
\begin{align*}
|x-y|=x-y.
\end{align*}
Thus the contact condition becomes
\begin{align*}
f(x)-f(y)=x-y.
\end{align*}
For any $z\in[y,x]$, the $1$-Lipschitz condition gives
\begin{align*}
f(x)-f(z)\le |x-z|=x-z.
\end{align*}
It also gives
\begin{align*}
f(z)-f(y)\le |z-y|=z-y.
\end{align*}
Adding these two inequalities yields
\begin{align*}
f(x)-f(y)\le (x-z)+(z-y)=x-y.
\end{align*}
But the endpoint equality already gives $f(x)-f(y)=x-y$, so both displayed Lipschitz inequalities must be equalities. Hence
\begin{align*}
f(x)-f(z)=x-z
\end{align*}
and
\begin{align*}
f(z)-f(y)=z-y.
\end{align*}
Therefore $f$ uses its full Lipschitz bound on every subinterval between $y$ and $x$.
If $\gamma$ is an optimal plan and $(f,-f)$ is a dual maximizing pair, then the slack
\begin{align*}
|x-y|-f(x)+f(y)
\end{align*}
is nonnegative and has $\gamma$-integral equal to $0$. Hence $\gamma$ can charge only points of $\Gamma_f$: optimal transport may move mass only along pairs where the chosen Kantorovich potential spends its entire Lipschitz budget.
[/example]
For quadratic cost, the contact set becomes the bridge to convex analysis and ultimately Brenier's theorem. The equality condition can be rewritten as a subgradient relation.
[example: Quadratic Contact and Subgradients]
Let $c(x,y)=\frac12|x-y|^2$ on $\mathbb R^n$, and write the $c$-concave potential in the quadratic form
\begin{align*}
\varphi(x)=\frac12|x|^2-u(x)
\end{align*}
with $u$ convex. The matching transform has the form
\begin{align*}
\psi(y)=\frac12|y|^2-u^*(y),
\end{align*}
so the contact condition $\varphi(x)+\psi(y)=c(x,y)$ becomes
\begin{align*}
\frac12|x|^2-u(x)+\frac12|y|^2-u^*(y)=\frac12|x-y|^2.
\end{align*}
Expanding the quadratic term gives
\begin{align*}
\frac12|x-y|^2=\frac12|x|^2-x\cdot y+\frac12|y|^2.
\end{align*}
Substituting this into the contact condition gives
\begin{align*}
\frac12|x|^2-u(x)+\frac12|y|^2-u^*(y)=\frac12|x|^2-x\cdot y+\frac12|y|^2.
\end{align*}
Subtracting $\frac12|x|^2+\frac12|y|^2$ from both sides gives
\begin{align*}
-u(x)-u^*(y)=-x\cdot y.
\end{align*}
Multiplying by $-1$ gives the equivalent equality
\begin{align*}
u(x)+u^*(y)=x\cdot y.
\end{align*}
This equality is exactly the subgradient condition. Indeed, $y\in\partial u(x)$ means
\begin{align*}
u(z)\ge u(x)+y\cdot(z-x)\quad\text{for every }z\in\mathbb R^n.
\end{align*}
Rearranging gives
\begin{align*}
z\cdot y-u(z)\le x\cdot y-u(x)\quad\text{for every }z\in\mathbb R^n.
\end{align*}
Taking the supremum over $z$ gives
\begin{align*}
u^*(y)\le x\cdot y-u(x).
\end{align*}
The reverse inequality follows from the definition of $u^*$ by evaluating the supremum at $z=x$:
\begin{align*}
u^*(y)=\sup_{z\in\mathbb R^n}\{z\cdot y-u(z)\}\ge x\cdot y-u(x).
\end{align*}
Hence $u^*(y)=x\cdot y-u(x)$, which is the same as $u(x)+u^*(y)=x\cdot y$. Therefore contact for the quadratic cost is equivalent to $y\in\partial u(x)$.
Thus, when an optimal plan is concentrated on the contact set of the maximizing potentials, its support is contained in the graph of the convex subdifferential $\partial u$, which is the structural input for the map-valued theory in the next chapter.
[/example]
The chapter has moved from value equality to support geometry. Kantorovich duality identifies the exact lower bounds for transport cost, while complementary slackness tells us where an optimal plan is allowed to place mass. The next chapter specializes these ideas to quadratic cost, where $c$-concavity becomes convex analysis and the contact set can collapse to the graph of a transport map.
Duality turns the transport problem into a comparison between costs and potentials. The next step is to read that analytic information geometrically, using cyclical monotonicity to recognize when a plan can actually be optimal.
# 5. Cyclical Monotonicity and Optimality Criteria
This chapter turns the duality theory of the preceding chapter into a geometric test for optimality. The prerequisites are the Kantorovich formulation and compactness from Chapter 2, finite duality intuition from Chapter 3, and weak duality together with the $c$-transform from Chapter 4. Instead of solving the Kantorovich problem directly, we ask which subsets of $X \times Y$ can carry an optimal plan. The answer is cyclical monotonicity: no finite cycle of transported mass can be rearranged to lower the total cost. This criterion is local on finite configurations but strong enough to reconstruct dual potentials.
## Finite Cycles and the Cost-Swapping Test
The basic question is how to recognise, from the support of a plan, whether the plan contains a finite pattern that wastes cost. If a plan sends $x_i$ to $y_i$ for $i=1,\dots,n$, then a competing plan can keep the same $x$-marginal and $y$-marginal while permuting the targets among these finitely many points. Optimality should forbid every such cost-improving permutation.
[definition: C-Cyclically Monotone Set]
Let $X$ and $Y$ be sets, and let $c:X\times Y\to (-\infty,\infty]$ be a cost function. A subset $\Gamma\subset X\times Y$ is $c$-cyclically monotone if, for every $n\in\mathbb N$, every $(x_1,y_1),\dots,(x_n,y_n)\in\Gamma$, and every permutation $\sigma$ of $\{1,\dots,n\}$,
\begin{align*}
\sum_{i=1}^n c(x_i,y_i) \le \sum_{i=1}^n c(x_i,y_{\sigma(i)}).
\end{align*}
[/definition]
The definition says that every finite piece of the graph already minimises the assignment cost among all ways of matching the same sources to the same targets. It is a support condition rather than a statement about the weights of the plan, which is why it can survive passage to restrictions and approximations.
[remark: Cycles Instead of All Permutations]
It is enough to test cyclic permutations of the form $y_i\mapsto y_{i+1}$, with $y_{n+1}=y_1$. Every permutation decomposes into disjoint cycles, and the inequality for each cycle adds to the inequality for the whole permutation.
[/remark]
This reduction is useful because failures of cyclical monotonicity usually appear as a visible cycle. The smallest nontrivial case already gives the familiar two-point crossing test.
[example: Two Point Cost Swap]
Let $X=Y=\mathbb R$ and $c(x,y)=|x-y|^2$. Suppose $x_1<x_2$ but $y_1>y_2$. The two-point cyclic monotonicity inequality would require
\begin{align*}
|x_1-y_1|^2+|x_2-y_2|^2\le |x_1-y_2|^2+|x_2-y_1|^2.
\end{align*}
Compute the difference between the left-hand side and the right-hand side:
\begin{align*}
\bigl((x_1-y_1)^2+(x_2-y_2)^2\bigr)-\bigl((x_1-y_2)^2+(x_2-y_1)^2\bigr)=2x_1y_2+2x_2y_1-2x_1y_1-2x_2y_2.
\end{align*}
Factoring the right-hand side gives
\begin{align*}
2x_1(y_2-y_1)+2x_2(y_1-y_2)=2(x_2-x_1)(y_1-y_2).
\end{align*}
Thus the required inequality is equivalent to
\begin{align*}
2(x_2-x_1)(y_1-y_2)\le 0.
\end{align*}
But $x_2-x_1>0$ and $y_1-y_2>0$, so $2(x_2-x_1)(y_1-y_2)>0$. The two-point swap lowers the cost, and therefore a squared-cost optimal support on the line cannot contain two crossing pairs.
[/example]
The one-dimensional example returns to the increasing rearrangement from Chapter 1: order preservation is exactly the no-crossing form of the finite swap test. To compare this with the convex-analysis language used for the quadratic cost in higher dimensions, we isolate the corresponding condition on pairs $(x,p)$ in Euclidean space.
[definition: Cyclically Monotone Set]
A subset $\Gamma\subset \mathbb R^d\times\mathbb R^d$ is cyclically monotone if, for every $n\in\mathbb N$ and every $(x_1,p_1),\dots,(x_n,p_n)\in\Gamma$,
\begin{align*}
\sum_{i=1}^n p_i\cdot (x_{i+1}-x_i) \le 0,
\end{align*}
where $x_{n+1}=x_1$.
[/definition]
This is the classical convex-analytic notion. The next task is to justify that this older definition is not a separate condition but exactly the same finite reassignment test once the cost is
\begin{align*}
c(x,y)=\frac12|x-y|^2.
\end{align*}
[quotetheorem:7471]
[citeproof:7471]
This theorem explains why convex analysis enters the quadratic theory: after expansion, the quadratic terms depending only on the marginals disappear, and only the bilinear pairing $x\cdot y$ remains. The quadratic structure is essential here; for a general cost there is no reason for the reassignment inequality to become an ordinary subgradient inequality. For instance, if $c(x,y)=|x-y|$, take $x_1=(0,0)$, $x_2=(-4,-4)$, $y_1=(-4,1)$, and $y_2=(-3,0)$ in $\mathbb R^2$. The two-point set is classically cyclically monotone because
\begin{align*}
(y_2-y_1)\cdot(x_2-x_1)=(1,-1)\cdot(-4,-4)=0.
\end{align*}
However, the cost swap for the distance cost gives
\begin{align*}
|x_1-y_1|+|x_2-y_2|=2\sqrt{17}>8=|x_1-y_2|+|x_2-y_1|.
\end{align*}
Thus classical cyclical monotonicity does not imply $c$-cyclical monotonicity for this non-quadratic cost. The theorem also does not say that every cyclically monotone set is the graph of a function, since vertical fibres may occur when a convex function has a multi-valued subdifferential. Its role is to convert optimal transport support geometry into the language needed for Rockafellar's reconstruction theorem.
[example: Monotone Supports in One Dimension]
Let $\Gamma=\{(x,T(x)):x\in\mathbb R\}$ with $T$ nondecreasing, and write $p_i=T(x_i)$. If $x_1<x_2$, then $p_1\le p_2$, so
\begin{align*}
(p_2-p_1)(x_2-x_1)\ge 0.
\end{align*}
Equivalently,
\begin{align*}
p_1x_1+p_2x_2-p_1x_2-p_2x_1=(p_2-p_1)(x_2-x_1)\ge 0.
\end{align*}
Thus, for two source points on the graph, matching the smaller $x$ to the smaller value of $T$ has no larger bilinear cost than crossing the two matches.
For a finite cycle $(x_1,p_1),\dots,(x_n,p_n)$ in $\Gamma$, order the same points so that $x_{(1)}\le\cdots\le x_{(n)}$. Since $T$ is nondecreasing, $p_{(1)}\le\cdots\le p_{(n)}$. The two-point inequality above shows that any inversion in a matching can be removed without decreasing $\sum_i p_i x_{\tau(i)}$; repeating this adjacent swap process gives
\begin{align*}
\sum_{i=1}^n p_i x_{i+1}\le \sum_{i=1}^n p_i x_i,
\end{align*}
where $x_{n+1}=x_1$. Subtracting the right-hand side gives
\begin{align*}
\sum_{i=1}^n p_i(x_{i+1}-x_i)\le 0.
\end{align*}
This is classical cyclical monotonicity in one dimension. For the quadratic cost
\begin{align*}
c(x,y)=\frac12|x-y|^2,
\end{align*}
the quadratic expansion converts this classical condition into $c$-cyclical monotonicity, so the graph of a nondecreasing transport map has no cost-lowering finite swap.
[/example]
## Optimal Plans Live on $c$-Cyclically Monotone Sets
The next question is whether optimality forces the support condition just described. Since a transport plan may be diffuse, we cannot inspect individual atoms of mass only. The strategy is to use finite restrictions of the plan: if a bad finite cycle occurs on a set of positive mass, then a small amount of mass can be rerouted and the total cost decreases.
[definition: Concentrated on a Set]
Let $\pi$ be a measure on $X\times Y$. We say that $\pi$ is concentrated on $\Gamma\subset X\times Y$ if $\pi((X\times Y)\setminus\Gamma)=0$.
[/definition]
The relevant set $\Gamma$ is often chosen as a full-measure subset of the support of $\pi$. The statement below gives the necessary condition for optimality under the usual integrability assumptions that make the Kantorovich value finite.
[quotetheorem:7472]
[citeproof:7472]
The theorem is the first major optimality criterion that does not mention potentials explicitly. The Borel and Polish hypotheses are used to make the violating finite configurations measurable and to reduce the selection argument to countably many neighbourhoods; on a nonmeasurable cost-swapping set, the phrase "positive mass of violating cycles" may have no measure-theoretic meaning. The finite-cost assumption also has a concrete role: if two assignments both have infinite total cost, the formal inequality $\infty>\infty$ cannot create a cheaper competitor. For example, with $c(x,y)=\infty$ off a prescribed closed relation, every plan that leaves the relation has infinite cost, and finite-cycle comparisons outside the relation do not give a numerical contradiction. The theorem is only a necessary condition at this point: a plan may be concentrated on a cyclically monotone set, but without a dual potential tight on that set we have not yet proved optimality. The next section supplies the missing reconstruction step.
[example: Nonoptimal Crossing Transports in $\mathbb R^2$]
Let $X=Y=\mathbb R^2$ and use the cost
\begin{align*}
c(x,y)=\frac12|x-y|^2.
\end{align*}
Take
\begin{align*}
x_1=(-1,0),\quad x_2=(1,0),\quad y_1=(1,0),\quad y_2=(-1,0),
\end{align*}
and give each source atom mass $1/2$. For the crossing plan sending $x_1$ to $y_1$ and $x_2$ to $y_2$, the first displacement is
\begin{align*}
x_1-y_1=(-1,0)-(1,0)=(-2,0),
\end{align*}
so
\begin{align*}
|x_1-y_1|^2=(-2)^2+0^2=4.
\end{align*}
Hence
\begin{align*}
c(x_1,y_1)=\frac12|x_1-y_1|^2=\frac12\cdot 4=2.
\end{align*}
Similarly,
\begin{align*}
x_2-y_2=(1,0)-(-1,0)=(2,0),
\end{align*}
so
\begin{align*}
c(x_2,y_2)=\frac12\bigl(2^2+0^2\bigr)=2.
\end{align*}
The probability-weighted cost of the crossing plan is therefore
\begin{align*}
\frac12 c(x_1,y_1)+\frac12 c(x_2,y_2)=\frac12\cdot 2+\frac12\cdot 2=2.
\end{align*}
Now compare the two-point assignment costs. The crossing assignment has unweighted cost
\begin{align*}
c(x_1,y_1)+c(x_2,y_2)=2+2=4.
\end{align*}
The swapped assignment sends each point to itself, because $y_2=x_1$ and $y_1=x_2$. Thus
\begin{align*}
c(x_1,y_2)=\frac12|x_1-x_1|^2=0
\end{align*}
and
\begin{align*}
c(x_2,y_1)=\frac12|x_2-x_2|^2=0.
\end{align*}
Therefore
\begin{align*}
c(x_1,y_1)+c(x_2,y_2)=4>0=c(x_1,y_2)+c(x_2,y_1).
\end{align*}
This is exactly a violation of the two-point $c$-cyclical monotonicity inequality for the support $\{(x_1,y_1),(x_2,y_2)\}$. The crossing plan admits a cheaper swap, so it cannot be optimal.
[/example]
The example is intentionally discrete, but the same obstruction appears in continuous plans. Whenever two pieces of mass cross in a way that admits a cheaper local exchange, the support fails the finite-cycle test.
[remark: Necessity Does Not Use Strict Convexity]
The necessity of $c$-cyclical monotonicity is not special to the quadratic cost. It comes from comparing a finite assignment with a permuted assignment, so the argument applies to any Borel cost for which the optimal plan has finite cost.
[/remark]
The quadratic cost becomes special when we ask for a more constructive description of all cyclically monotone supports. That direction is supplied by Rockafellar's theorem.
## Potentials Built from Cyclical Monotonicity
A support condition becomes a full optimality theorem only if it produces dual potentials. The problem is to start from inequalities on finite cycles and manufacture a function whose $c$-transform touches the cost exactly on the given set. In the quadratic case this is the content of Rockafellar's theorem; for a general cost it is the construction of a $c$-convex potential.
[definition: Subdifferential of a Convex Function]
Let $\varphi:\mathbb R^d\to (-\infty,\infty]$ be convex. The subdifferential of $\varphi$ is the set-valued map $\partial\varphi:\mathbb R^d\to\mathcal P(\mathbb R^d)$ defined at $x\in\mathbb R^d$ by
\begin{align*}
\partial\varphi(x):=\{p\in\mathbb R^d: \varphi(z)\ge \varphi(x)+p\cdot(z-x)\text{ for all }z\in\mathbb R^d\}.
\end{align*}
[/definition]
The subdifferential definition gives a source of cyclically monotone sets, because the supporting inequalities can be added around a cycle. For optimal transport we need the reverse direction: given only the finite-cycle inequalities on a proposed support, we need a potential whose subdifferential contains that support. Rockafellar's theorem supplies exactly this reconstruction step.
[quotetheorem:7473]
[citeproof:7473]
The construction resembles an antiderivative: cyclic monotonicity is the path-independence condition needed to integrate the field of slopes. The nonempty hypothesis is only there to choose a base point; if $\Gamma=\varnothing$, any proper convex lower semicontinuous function would satisfy the empty inclusion. Cyclical monotonicity itself is indispensable: the two-point set $\{(0,1),(1,0)\}\subset\mathbb R\times\mathbb R$ violates monotonicity, and no convex function can have both $1\in\partial\varphi(0)$ and $0\in\partial\varphi(1)$ because subgradients of a convex function are monotone. The theorem does not assert uniqueness of $\varphi$, since adding constants or changing the function away from the generated convex envelope may leave the same subdifferential containment. It is also special to the Euclidean pairing, so the next step is to replace affine functions by the cost profiles $x\mapsto -c(x,y)$ and obtain a $c$-convex version suitable for optimal transport.
[definition: C-Convex Function]
Let $X$ and $Y$ be sets and let $c:X\times Y\to (-\infty,\infty]$. A function $\phi:X\to (-\infty,\infty]$ is $c$-convex if there exists a function $a:Y\to [-\infty,\infty]$ such that
\begin{align*}
\phi(x)=\sup_{y\in Y}\{a(y)-c(x,y)\}.
\end{align*}
[/definition]
A $c$-convex function is designed to participate in the Kantorovich dual inequality. To identify the support where a plan attains equality in that inequality, we need the contact relation between $\phi$ and its $c$-transform.
[definition: C-Subdifferential]
Let $X$ and $Y$ be sets, let $c:X\times Y\to(-\infty,\infty]$ be a cost function, and let $\phi:X\to (-\infty,\infty]$ be a function. Its $c$-transform is the function $\phi^c:Y\to [-\infty,\infty]$ defined by
\begin{align*}
\phi^c(y):=\inf_{x\in X}\{c(x,y)-\phi(x)\}.
\end{align*}
The $c$-subdifferential of $\phi$ is the subset $\partial^c\phi\subset X\times Y$ defined by
\begin{align*}
\partial^c\phi:=\{(x,y)\in X\times Y: \phi(x)+\phi^c(y)=c(x,y)\}.
\end{align*}
[/definition]
Thus $\partial^c\phi$ is the contact set for the admissible dual pair $(\phi,\phi^c)$. The remaining problem is the converse direction: a support condition such as cyclic monotonicity is geometric, while optimality is an equality between primal transport cost and dual potentials. One needs a criterion that turns the geometry of the support into an integrable dual certificate.
[quotetheorem:5772]
[citeproof:5772]
This result closes the loop between support geometry and Kantorovich duality. The lower semicontinuity of $c$ is part of the regularity framework in which the reconstructed potential is measurable, while the explicit integrability assumption is what allows the final equality of integrals to be meaningful. The point is not that every tight pair is usable. For example, take $X=Y=\mathbb R$, $c(x,y)=|x-y|$, and let $\mu=\nu$ be the standard Cauchy law. The identity plan has zero cost and is concentrated on the diagonal. The pair $\phi(x)=x$, $\phi^c(y)=-y$ is tight on that diagonal, but its two marginal integrals are undefined, so this particular pair does not certify optimality by equality of primal and dual values. The same plan is certified by the integrable tight pair $\phi=0$ and $\phi^c=0$, showing why the theorem asks for the existence of a well-defined tight pair rather than accepting an arbitrary reconstructed representative. In applications, the burden is therefore to verify both the finite-cycle inequalities and an integrable dual certificate.
[example: Dual Potentials for Monotone Rearrangement]
Let $\mu$ and $\nu$ be nonatomic probability measures on $\mathbb R$, and let
$T=F_\nu^{-1}\circ F_\mu$ be the increasing rearrangement. Since both $F_\mu$ and $F_\nu^{-1}$ are nondecreasing, $T$ is nondecreasing. Assume $T$ is locally integrable and define
\begin{align*}
\varphi(x)=\int_0^x T(s)\,ds.
\end{align*}
For $z>x$, monotonicity gives $T(s)\ge T(x)$ for $s\in[x,z]$, so
\begin{align*}
\varphi(z)-\varphi(x)=\int_x^z T(s)\,ds\ge \int_x^z T(x)\,ds=T(x)(z-x).
\end{align*}
For $z<x$, monotonicity gives $T(s)\le T(x)$ for $s\in[z,x]$, hence
\begin{align*}
\varphi(z)-\varphi(x)=-\int_z^x T(s)\,ds\ge -\int_z^x T(x)\,ds=T(x)(z-x).
\end{align*}
Thus, at every point where this pointwise representative of $T$ is used,
\begin{align*}
\varphi(z)\ge \varphi(x)+T(x)(z-x)
\end{align*}
for all $z\in\mathbb R$, which means $T(x)\in\partial\varphi(x)$. In particular this holds for a.e. $x$ after changing $T$ on a null set if necessary. The same inequalities also show that $\varphi$ is convex: its graph has a supporting affine line of slope $T(x)$ at almost every $x$.
For the quadratic transport cost
\begin{align*}
c(x,y)=\frac12|x-y|^2,
\end{align*}
the Kantorovich potential is obtained by subtracting this convex primitive from the quadratic term:
\begin{align*}
\Phi(x)=\frac12x^2-\varphi(x).
\end{align*}
Let $\varphi^*$ be the Legendre transform of $\varphi$, and set
\begin{align*}
\Psi(y)=\frac12y^2-\varphi^*(y).
\end{align*}
The Fenchel inequality $\varphi(x)+\varphi^*(y)\ge xy$ gives
\begin{align*}
\Phi(x)+\Psi(y)=\frac12x^2+\frac12y^2-\varphi(x)-\varphi^*(y)\le \frac12x^2+\frac12y^2-xy=\frac12|x-y|^2.
\end{align*}
When $y=T(x)\in\partial\varphi(x)$, Fenchel equality gives $\varphi(x)+\varphi^*(T(x))=xT(x)$, and therefore
\begin{align*}
\Phi(x)+\Psi(T(x))=\frac12x^2+\frac12T(x)^2-xT(x)=\frac12|x-T(x)|^2.
\end{align*}
So the increasing rearrangement is supported on the equality set of the dual pair $(\Phi,\Psi)$, and the monotone graph is certified by an explicit quadratic-cost potential.
[/example]
## The Optimality Criterion in Practice
The last question is how to use the criterion without rebuilding the whole dual problem each time. In applications, one usually proposes a plan from geometry or symmetry, proves its support is $c$-cyclically monotone, and then invokes the converse criterion.
[quotetheorem:7474]
[citeproof:7474]
The theorem packages duality into a support test. The finite-value assumption rules out the degenerate situation in which all admissible plans have infinite cost and optimality carries no useful information; for instance, if $c\equiv\infty$, every plan has the same infinite cost, while cyclical monotonicity carries no selection power. The potential reconstruction and integrability assumptions are also substantive: the support test becomes a proof of optimality only after it yields a Borel dual pair whose equality can be integrated. The Cauchy example above illustrates a milder but common pitfall: a natural tight pair may have undefined marginal integrals even when another integrable certificate exists. Thus the characterization is stated with an explicit reconstruction-and-integrability hypothesis rather than as a bare support condition in complete generality. The theorem also does not identify the optimal plan uniquely; different plans may live on the same cyclically monotone set, or on different such sets, unless extra hypotheses such as strict convexity and absolute continuity are imposed. These extra hypotheses are precisely what the quadratic theory will exploit in the passage to Brenier maps.
[example: Checking a Proposed Plan by Support]
In $\mathbb R^2$, set
\begin{align*}
\mu=\frac12\delta_{(-1,0)}+\frac12\delta_{(1,0)},\qquad \nu=\frac12\delta_{(-2,0)}+\frac12\delta_{(2,0)},
\end{align*}
and use the cost
\begin{align*}
c(x,y)=\frac12|x-y|^2.
\end{align*}
Consider the plan
\begin{align*}
\pi=\frac12\delta_{((-1,0),(-2,0))}+\frac12\delta_{((1,0),(2,0))}.
\end{align*}
Its first marginal is $\mu$ and its second marginal is $\nu$, so $\pi\in\Pi(\mu,\nu)$.
Write
\begin{align*}
x_1=(-1,0),\qquad x_2=(1,0),\qquad y_1=(-2,0),\qquad y_2=(2,0).
\end{align*}
The assigned costs are
\begin{align*}
c(x_1,y_1)=\frac12|(-1,0)-(-2,0)|^2=\frac12|(1,0)|^2=\frac12
\end{align*}
and
\begin{align*}
c(x_2,y_2)=\frac12|(1,0)-(2,0)|^2=\frac12|(-1,0)|^2=\frac12.
\end{align*}
The crossed costs are
\begin{align*}
c(x_1,y_2)=\frac12|(-1,0)-(2,0)|^2=\frac12|(-3,0)|^2=\frac92
\end{align*}
and
\begin{align*}
c(x_2,y_1)=\frac12|(1,0)-(-2,0)|^2=\frac12|(3,0)|^2=\frac92.
\end{align*}
Therefore
\begin{align*}
c(x_1,y_1)+c(x_2,y_2)=\frac12+\frac12=1
\end{align*}
while
\begin{align*}
c(x_1,y_2)+c(x_2,y_1)=\frac92+\frac92=9.
\end{align*}
Thus the nontrivial two-point swap satisfies
\begin{align*}
c(x_1,y_1)+c(x_2,y_2)\le c(x_1,y_2)+c(x_2,y_1).
\end{align*}
The identity permutation gives equality, and every finite assignment using only these two support points is built from identity matches and copies of this same two-point comparison. Hence the support of $\pi$ is $c$-cyclically monotone. By the optimality characterization from the preceding theorem, this proposed plan is optimal.
[/example]
The finite example illustrates the general workflow: the optimality proof is reduced to checking finitely many or structurally controlled inequalities. For [absolutely continuous measures](/page/Absolutely%20Continuous%20Measures) and the quadratic cost, this support condition will lead in the next chapter to maps of the form $T(x)=\nabla\varphi(x)$, which is the gateway to Brenier's theorem.
Cyclical monotonicity gives a support-based optimality criterion that is still purely geometric. For the quadratic cost, that geometry becomes especially rigid, and optimal transport begins to produce maps with gradient structure.
# 6. Quadratic Cost and Brenier Theory
Quadratic cost is the point where optimal transport stops being only a compactness-and-duality theory and begins to have differential structure. The cost $c(x,y)=|x-y|^2$ interacts with convexity, and this interaction converts optimality of a plan into geometric monotonicity of its support. The main question in this chapter is when an optimal plan is induced by a map, and why that map must be the gradient of a convex function.
Chapters 4 and 5 developed Kantorovich duality and $c$-cyclical monotonicity for general costs. Here we specialize those tools to Euclidean spaces, rewrite the dual problem in convex variables, and prove the Brenier theorem under absolute continuity of the source measure. We then record two model computations and state the [polar factorization theorem](/theorems/7479), which explains Brenier maps as the nonlinear analogue of the [polar decomposition](/theorems/3074) of matrices.
## Rewriting Quadratic Transport Using Convex Potentials
The first problem is to understand why the quadratic cost is better than a generic continuous cost. The expression $|x-y|^2$ contains the [inner product](/page/Inner%20Product) $x\cdot y$, and after discarding terms depending only on the marginals, minimizing quadratic transport becomes the same as maximizing correlation.
Let $\mu,\nu \in \mathcal P_2(\mathbb R^n)$, meaning that both probability measures have finite second moment. For any coupling $\pi \in \Pi(\mu,\nu)$, expansion of the square gives
\begin{align*}
\int_{\mathbb R^n \times \mathbb R^n} |x-y|^2\,d\pi(x,y)=\int_{\mathbb R^n} |x|^2\,d\mu(x)+\int_{\mathbb R^n} |y|^2\,d\nu(y)-2\int_{\mathbb R^n \times \mathbb R^n} x\cdot y\,d\pi(x,y).
\end{align*}
Thus the variable part of the optimization is the last integral. This is the source of convex analysis in the quadratic theory: supporting hyperplanes to convex functions are exactly inequalities involving $x\cdot y$.
[definition: Convex Conjugate]
Let $\varphi:\mathbb R^n\to(-\infty,\infty]$ be a proper convex function. Its convex conjugate is the function $\varphi^*:\mathbb R^n\to(-\infty,\infty]$ defined by
\begin{align*}
\varphi^*(y)=\sup_{x\in\mathbb R^n}\{x\cdot y-\varphi(x)\}.
\end{align*}
[/definition]
The definition packages all affine lower bounds of $\varphi$ into a single function on the dual variable. The next result is needed because it identifies exactly when the transport inequality $x\cdot y\le \varphi(x)+\varphi^*(y)$ is saturated, and this saturation will become the support condition for an optimal plan.
To state the equality condition, we need the convex-analytic replacement for an ordinary derivative. At a point where a convex function has a supporting hyperplane, the slope of that hyperplane is called a subgradient; at a corner there may be several such slopes.
[definition: Subdifferential of a Convex Function]
Let $\varphi:\mathbb R^n\to(-\infty,\infty]$ be a proper convex function. The subdifferential is the set-valued map $\partial\varphi:\mathbb R^n\to\mathcal P(\mathbb R^n)$ defined as follows. For $x\in\mathbb R^n$ with $\varphi(x)<\infty$,
\begin{align*}
\partial\varphi(x)=\{y\in\mathbb R^n:\varphi(z)\ge \varphi(x)+y\cdot(z-x)\text{ for all }z\in\mathbb R^n\}.
\end{align*}
If $\varphi(x)=\infty$, set $\partial\varphi(x)=\varnothing$.
[/definition]
This definition turns the geometric idea of a supporting hyperplane into a set-valued map $x\mapsto\partial\varphi(x)$. When $\varphi$ is differentiable at $x$, the subdifferential is the singleton $\{\nabla\varphi(x)\}$; when $\varphi$ has a flat face or corner, the subdifferential records all supporting slopes at that point. The next theorem is the algebraic test for membership in this graph: it converts a geometric supporting-plane condition into equality in a universal inequality, which is exactly the form needed for complementary slackness in the transport problem.
[quotetheorem:6676]
[citeproof:6676]
Fenchel inequality has the same shape as the Kantorovich dual constraint, but its hypotheses matter. Properness prevents degenerate extended-real behaviour: if $\varphi\equiv\infty$, then $\varphi^*\equiv-\infty$ in the extended-real convention, and the displayed inequality no longer gives a useful finite contact condition. Convexity is also needed for the equality statement. For instance, on $\mathbb R$ let $\varphi(x)=-|x|$. Then $\varphi^*(0)=\sup_x |x|=\infty$, so the inequality contains no supporting-slope information at any point; the nonconvex function has no global supporting hyperplane at $0$ although ordinary one-sided slopes exist. The result does not say that every pair $(x,y)$ lies in a subdifferential graph; it identifies the exceptional equality set inside the universal inequality. In the quadratic problem, this equality set is the place where complementary slackness will force optimal mass to live, so the next step is to normalize Kantorovich potentials until their constraint is exactly Fenchel inequality.
[definition: Quadratic Kantorovich Potential]
Let $\mu,\nu\in\mathcal P_2(\mathbb R^n)$. A convex function $\varphi:\mathbb R^n\to(-\infty,\infty]$ is a quadratic Kantorovich potential from $\mu$ to $\nu$ if the functions $f,g:\mathbb R^n\to(-\infty,\infty]$ defined by
\begin{align*}
f(x)= |x|^2-2\varphi(x), \qquad g(y)= |y|^2-2\varphi^*(y)
\end{align*}
form an optimal dual pair for the cost $c(x,y)=|x-y|^2$.
[/definition]
With this convention, the dual constraint $f(x)+g(y)\le |x-y|^2$ is equivalent to Fenchel inequality. The following theorem is needed because it translates all the abstract Kantorovich information from earlier chapters into the language of convex subdifferentials, which is the form in which Brenier's theorem can be proved.
[quotetheorem:7475]
[citeproof:7475]
The finite second moment assumption is not cosmetic: it makes the two marginal terms $\int |x|^2\,d\mu$ and $\int |y|^2\,d\nu$ finite, so the expansion separates the optimization into a finite constant part and a correlation part. The theorem is still a statement about Kantorovich plans, not a statement that an optimal plan must already be induced by a map. If $\mu=\delta_0$ and $\nu$ is not a Dirac mass, the only coupling is $\delta_0\otimes\nu$, and it cannot be represented as $(\operatorname{id},T)_\#\mu$ for any map $T$ with target law $\nu$. Thus the theorem reduces the geometric content of quadratic transport to understanding subgradients of convex functions, while the next section explains why absolute continuity of $\mu$ turns the relevant subgradient into a single-valued map almost everywhere.
[example: One-Dimensional Increasing Rearrangement as a Convex Gradient]
Let $F_\nu^{-1}(t)=\inf\{y\in\mathbb R:F_\nu(y)\ge t\}$ for $0<t<1$, and define
\begin{align*}
T(x)=F_\nu^{-1}(F_\mu(x))
\end{align*}
at points where $0<F_\mu(x)<1$. Since $F_\mu$ is nondecreasing, if $x_1\le x_2$ then $F_\mu(x_1)\le F_\mu(x_2)$. The quantile map $F_\nu^{-1}$ is also nondecreasing: if $s\le t$, then $\{y:F_\nu(y)\ge t\}\subset\{y:F_\nu(y)\ge s\}$, so taking infima gives $F_\nu^{-1}(s)\le F_\nu^{-1}(t)$. Therefore
\begin{align*}
x_1\le x_2 \implies T(x_1)=F_\nu^{-1}(F_\mu(x_1))\le F_\nu^{-1}(F_\mu(x_2))=T(x_2).
\end{align*}
Because $T$ is nondecreasing, define a convex potential by
\begin{align*}
\varphi(x)=\int_0^x T(s)\,ds
\end{align*}
on intervals where $T$ is finite, with the usual sign convention $\int_0^x=-\int_x^0$ when $x<0$. For $a<b<c$, monotonicity gives $T(s)\le T(t)$ whenever $s\in[a,b]$ and $t\in[b,c]$, hence the average slope of $\varphi$ on $[a,b]$ is at most the average slope on $[b,c]$:
\begin{align*}
\frac{\varphi(b)-\varphi(a)}{b-a}=\frac{1}{b-a}\int_a^b T(s)\,ds\le \frac{1}{c-b}\int_b^c T(t)\,dt=\frac{\varphi(c)-\varphi(b)}{c-b}.
\end{align*}
This monotonicity of secant slopes is the one-dimensional convexity criterion, so $\varphi$ is convex. At every continuity point $x$ of $T$, the difference quotient satisfies
\begin{align*}
\frac{\varphi(x+h)-\varphi(x)}{h}=\frac{1}{h}\int_x^{x+h}T(s)\,ds\to T(x)
\end{align*}
as $h\to0$, because the averages of a continuous-at-$x$ monotone function over shrinking intervals converge to its value at $x$. Thus $T=\varphi'$ at all continuity points of $T$. This is the one-dimensional model for Brenier theory: increasing rearrangement is exactly a convex gradient, and in higher dimensions the gradient of a convex potential replaces scalar monotonicity.
[/example]
## Gradients of Convex Functions as Optimal Maps
The next question is why gradients of convex functions are the correct maps, rather than merely a convenient way to describe optimal plans. The answer is that subdifferential graphs of convex functions are cyclically monotone, and cyclical monotonicity is exactly the optimality condition for the quadratic cost.
[definition: Cyclically Monotone Set]
A set $\Gamma\subset\mathbb R^n\times\mathbb R^n$ is cyclically monotone if for every $N\in\mathbb N$ and every finite family $(x_i,y_i)_{i=1}^N\subset\Gamma$,
\begin{align*}
\sum_{i=1}^N x_i\cdot y_i \ge \sum_{i=1}^N x_i\cdot y_{i+1}, \qquad y_{N+1}=y_1.
\end{align*}
[/definition]
For quadratic cost, this condition says that no finite cycle of destinations can be reassigned to lower the cost. What is not automatic is that a collection of finitely consistent inequalities comes from one global object. The obstruction is compatibility: the same assigned pairs must fit together as the contact data of a single convex potential, not merely pass isolated finite tests.
[quotetheorem:7476]
[citeproof:7476]
Rockafellar's theorem explains why convex potentials are forced by optimality, but the exact cyclic condition is essential. Pairwise monotonicity is enough in one dimension, yet in higher dimensions it does not control all finite cycles. A concrete failure is given by the [linear map](/page/Linear%20Map) $A:\mathbb R^2\to\mathbb R^2$,
\begin{align*}
A(x_1,x_2)=(x_1-2x_2,2x_1+x_2).
\end{align*}
Its graph is pairwise monotone because $(x-z)\cdot(Ax-Az)=|x-z|^2$ for all $x,z\in\mathbb R^2$. However, for
\begin{align*}
x_1=(1,0),\qquad x_2=(-1,-1),\qquad x_3=(-1,1),\qquad y_i=Ax_i,
\end{align*}
we have $\sum_i x_i\cdot y_i=5$ while $\sum_i x_i\cdot y_{i+1}=6$, with $y_4=y_1$. Thus the graph is not cyclically monotone and cannot be contained in the subdifferential of a convex function. The theorem deliberately states the constructed potential as convex rather than automatically lower semicontinuous; taking a lower semicontinuous envelope is a separate operation and may enlarge contact sets unless the equality relation is tracked. The conclusion is also a containment statement, not a parametrization of $\Gamma$: the subdifferential graph may contain more points than the original set, and the subdifferential may be multivalued on faces of the potential. To turn this relation into a transport map, we need to know that convex functions have a single subgradient at almost every point seen by an absolutely continuous source measure.
[quotetheorem:3088]
[citeproof:3088]
The interior-domain hypothesis is where the local Lipschitz argument lives. Near the boundary of the effective domain, a convex function may have vertical supporting behaviour or may take the value $\infty$ immediately outside the domain, so [Rademacher's theorem](/theorems/3069) cannot be applied without first restricting to compact subsets of the interior. For example, let $\varphi:\mathbb R\to(-\infty,\infty]$ be the indicator of the closed half-line $[0,\infty)$, equal to $0$ on $[0,\infty)$ and $\infty$ on $(-\infty,0)$. The interior $(0,\infty)$ is harmless, but at the boundary point $0$ the function is not finite on any neighbourhood and $\partial\varphi(0)=(-\infty,0]$ is multivalued. Almost-everywhere differentiability also does not make $\nabla\varphi$ continuous, invertible, or globally defined on every boundary point; it only says that the multivalued part of $\partial\varphi$ is invisible to Lebesgue measure inside the domain. Now the reason for assuming absolute continuity becomes visible: if $\mu\ll\mathcal L^n$, then the exceptional set where the convex potential is not differentiable has zero $\mu$-mass, so an optimal plan concentrated on $\partial\varphi$ must in fact sit on the graph of $\nabla\varphi$.
[quotetheorem:7477]
[citeproof:7477]
Brenier's theorem is the central structural result of the course, and each hypothesis has a specific job. Finite second moments are needed so the quadratic objective and the $W_2$ geometry are finite; for measures outside $\mathcal P_2(\mathbb R^n)$, the transport cost may be infinite and the variational statement loses its content. The quadratic cost is also special: the conclusion that the optimizer is a gradient of a convex function comes from the identity $|x-y|^2=|x|^2+|y|^2-2x\cdot y$, and a generic cost gives $c$-convex potentials rather than ordinary convex gradients. Absolute continuity of $\mu$ is the map-producing assumption; if $\mu=\delta_0$ and $\nu$ has more than one point in its support, no deterministic map can send $\mu$ to $\nu$, and if $\mu$ gives mass to a nondifferentiability set of a convex potential, an optimal plan may split that mass inside a subdifferential. Thus the theorem says that, for quadratic cost and volume-type source measures, the relaxed Kantorovich problem has not introduced extra minimizers: the unique optimal plan comes from a deterministic map.
[example: Optimal Transport Between Gaussian Measures]
Let $\mu=\mathcal N(m_0,\Sigma_0)$ and $\nu=\mathcal N(m_1,\Sigma_1)$ on $\mathbb R^n$, where $\Sigma_0$ and $\Sigma_1$ are symmetric positive definite. Put
\begin{align*}
B=\Sigma_0^{1/2}\Sigma_1\Sigma_0^{1/2}
\end{align*}
and define
\begin{align*}
A=\Sigma_0^{-1/2}B^{1/2}\Sigma_0^{-1/2}.
\end{align*}
The matrix $B$ is symmetric positive definite, so $B^{1/2}$ is symmetric positive definite. Hence $A$ is symmetric, since
\begin{align*}
A^\top=(\Sigma_0^{-1/2}B^{1/2}\Sigma_0^{-1/2})^\top=\Sigma_0^{-1/2}B^{1/2}\Sigma_0^{-1/2}=A.
\end{align*}
It is also positive definite: for $v\ne0$,
\begin{align*}
v\cdot Av=(\Sigma_0^{-1/2}v)\cdot B^{1/2}(\Sigma_0^{-1/2}v)>0.
\end{align*}
Define the affine map
\begin{align*}
T(x)=m_1+A(x-m_0).
\end{align*}
If
\begin{align*}
\varphi(x)=m_1\cdot x+\frac12(x-m_0)\cdot A(x-m_0),
\end{align*}
then the symmetry of $A$ gives
\begin{align*}
\nabla\varphi(x)=m_1+A(x-m_0)=T(x).
\end{align*}
Since the Hessian of $\varphi$ is $A$, and $A$ is positive definite, $\varphi$ is convex.
It remains to check that $T$ sends $\mu$ to $\nu$. If $X\sim\mathcal N(m_0,\Sigma_0)$, then $X-m_0$ has mean $0$ and covariance $\Sigma_0$. Therefore $T(X)$ is Gaussian with mean
\begin{align*}
\mathbb E[T(X)]=m_1+A\mathbb E[X-m_0]=m_1
\end{align*}
and covariance
\begin{align*}
\operatorname{Cov}(T(X))=A\Sigma_0A^\top=A\Sigma_0A.
\end{align*}
Using the definition of $A$ and $A^\top=A$,
\begin{align*}
A\Sigma_0A=\Sigma_0^{-1/2}B^{1/2}\Sigma_0^{-1/2}\Sigma_0\Sigma_0^{-1/2}B^{1/2}\Sigma_0^{-1/2}.
\end{align*}
Since $\Sigma_0^{-1/2}\Sigma_0\Sigma_0^{-1/2}=I$, this becomes
\begin{align*}
A\Sigma_0A=\Sigma_0^{-1/2}B^{1/2}B^{1/2}\Sigma_0^{-1/2}.
\end{align*}
Because $B^{1/2}B^{1/2}=B$,
\begin{align*}
A\Sigma_0A=\Sigma_0^{-1/2}B\Sigma_0^{-1/2}.
\end{align*}
Substituting $B=\Sigma_0^{1/2}\Sigma_1\Sigma_0^{1/2}$ gives
\begin{align*}
A\Sigma_0A=\Sigma_0^{-1/2}\Sigma_0^{1/2}\Sigma_1\Sigma_0^{1/2}\Sigma_0^{-1/2}=\Sigma_1.
\end{align*}
Thus $T(X)\sim\mathcal N(m_1,\Sigma_1)$, so $T_\#\mu=\nu$. Since $\mu$ has a smooth positive density and $T=\nabla\varphi$ for a convex potential, *Brenier theorem* identifies this affine map as the unique quadratic optimal transport map from $\mu$ to $\nu$.
[/example]
This Gaussian example is finite-dimensional linear algebra inside Brenier theory. The next example is nonlinear but still explicit because symmetry reduces the map to one scalar monotone rearrangement.
[example: Radial Transport Between Balls]
Let $\mu$ and $\nu$ have densities $f(|x|)$ on $B(0,R)$ and $g(|y|)$ on $B(0,S)$, with $f,g>0$ on their supports. For $0\le r\le R$, define
\begin{align*}
F(r)=\int_0^r f(s)s^{n-1}\,ds
\end{align*}
and for $0\le q\le S$, define
\begin{align*}
G(q)=\int_0^q g(t)t^{n-1}\,dt.
\end{align*}
Because $f,g>0$, both $F$ and $G$ are strictly increasing on their supports. The equality of total masses gives $|\mathbb S^{n-1}|F(R)=|\mathbb S^{n-1}|G(S)=1$, so $F(R)=G(S)$. Define
\begin{align*}
\rho(r)=G^{-1}(F(r)).
\end{align*}
Then $\rho$ is nondecreasing because $F$ and $G^{-1}$ are nondecreasing, and it satisfies the mass-matching identity
\begin{align*}
\int_0^r f(s)s^{n-1}\,ds=\int_0^{\rho(r)} g(t)t^{n-1}\,dt.
\end{align*}
Set
\begin{align*}
T(x)=\frac{\rho(|x|)}{|x|}x \quad (x\ne 0), \qquad T(0)=0.
\end{align*}
For every $0\le r\le R$, the image of the ball $B(0,r)$ under $T$ is $B(0,\rho(r))$, since $|T(x)|=\rho(|x|)$ and $\rho$ is nondecreasing. Therefore
\begin{align*}
\mu(B(0,r))=|\mathbb S^{n-1}|\int_0^r f(s)s^{n-1}\,ds
\end{align*}
and
\begin{align*}
\nu(B(0,\rho(r)))=|\mathbb S^{n-1}|\int_0^{\rho(r)} g(t)t^{n-1}\,dt.
\end{align*}
The two quantities are equal by the defining equation for $\rho$. Since both measures are radial, equality on centered balls is exactly the one-dimensional equality of radial distribution functions, so $T_\#\mu=\nu$.
Now define
\begin{align*}
h(r)=\int_0^r \rho(s)\,ds \quad (r\ge 0)
\end{align*}
and set $\varphi(x)=h(|x|)$. The function $h$ is convex because its derivative $\rho$ is nondecreasing, and $h$ is nondecreasing because $\rho(r)\ge 0$. Hence $\varphi=h\circ |\cdot|$ is convex. For $x\ne 0$, the chain rule gives
\begin{align*}
\nabla\varphi(x)=h'(|x|)\nabla |x|=\rho(|x|)\frac{x}{|x|}=T(x).
\end{align*}
Thus the radial rearrangement is a gradient of a convex potential. Since $\mu$ is absolutely continuous and $T_\#\mu=\nu$, the *Brenier theorem* identifies this map as the quadratic optimal transport map; symmetry has reduced the transport equation to the scalar cumulative relation $F(r)=G(\rho(r))$.
[/example]
## Absolute Continuity and Uniqueness of the Optimal Plan
The remaining problem is to identify exactly what the absolute continuity assumption contributes. Without it, optimal plans still live on subdifferential graphs, but subdifferentials may contain whole faces; then a source atom can split among several destinations while remaining optimal.
[remark: Why Atoms Break Monge Solutions]
Let $\mu=\delta_0$ on $\mathbb R^n$ and let $\nu$ be any probability measure in $\mathcal P_2(\mathbb R^n)$. There is only one coupling, namely $\delta_0\otimes\nu$, so the Kantorovich plan is unique, but it is induced by a map only when $\nu$ is a Dirac mass. This is the simplest obstruction to solving the Monge problem: a single source point cannot be sent by a deterministic map to several target points.
[/remark]
The atom example concerns existence of a Monge map, while Brenier's absolute continuity assumption also controls uniqueness of optimal Kantorovich plans. The next theorem records the operational form used later: once the source gives no mass to Lebesgue-null sets, every optimal plan must coincide with the graph plan selected by the convex gradient.
[quotetheorem:7478]
[citeproof:7478]
The absolute continuity hypothesis is again the decisive one: it prevents the first marginal from charging the set where the selected convex potential has several subgradients. Without it, uniqueness of a plan can fail even for quadratic cost. For example, take
\begin{align*}
\mu=\frac12\delta_{(-1,0)}+\frac12\delta_{(1,0)},\qquad
\nu=\frac12\delta_{(0,-1)}+\frac12\delta_{(0,1)}
\end{align*}
on $\mathbb R^2$. Every pairing between a source point and a target point has squared distance $2$, so every coupling of $\mu$ and $\nu$ has the same total cost and is optimal; the deterministic matching and the crossed matching are two different optimal plans. The theorem also does not say that every coupling between $\mu$ and $\nu$ is a graph, only that the optimal coupling is forced onto the unique gradient graph. This uniqueness theorem is often the working form of Brenier's theorem: once the source has a density, every optimality argument can be checked against a single map. To connect this map-level structure with transformations that are not themselves gradients, we separate a general map into its mass-preserving rearrangement part and its monotone Brenier part.
[definition: Measure-Preserving Map]
Let $(X,\mathcal A,\mu)$ be a probability space and let $\Omega\subset\mathbb R^n$ be a Borel set with probability measure $\lambda$. A measurable map $s:X\to\Omega$ is measure-preserving from $\mu$ to $\lambda$ if $s_\#\mu=\lambda$.
[/definition]
The definition isolates the part of a map that rearranges mass without changing the reference distribution. The following theorem is needed because it says that, relative to Lebesgue measure on a domain, every sufficiently integrable map can be decomposed into such a rearrangement followed by a Brenier map.
[quotetheorem:7479]
[citeproof:7479]
Boundedness of $\Omega$ and the assumption $u\in L^2(\Omega;\mathbb R^n)$ ensure that both the reference law $\lambda$ and the image law $\nu$ lie in $\mathcal P_2(\mathbb R^n)$, so Brenier's theorem applies to the monotone factor. If the $L^2$ hypothesis is removed, this may fail even on a bounded interval: with $\Omega=(0,1)$ and $u(x)=x^{-1}$, the image law has infinite second moment, so the quadratic Brenier theorem is outside its natural finite-cost setting. Nondegeneracy is the extra rearrangement hypothesis: it says that the image law has a density, which supplies the reverse Brenier map used to define $s=R\circ u$. It is also needed for any uniqueness claim about the rearrangement factor. If $u$ is the constant map $u(x)=0$, then $\nu=\delta_0$ and the Brenier factor is the constant map $T\equiv0$; every measure-preserving map $s:\Omega\to\Omega$ satisfies $u=T\circ s$, so $s$ is not unique. The theorem is therefore the nonlinear analogue of writing a matrix as a positive symmetric part times an orthogonal part, with $\nabla\varphi$ as the monotone part and $s$ as the rearrangement preserving the reference measure.
[remark: Brenier Theory as Geometry of $W_2$]
Quadratic cost gives the Wasserstein distance $W_2$ on $\mathcal P_2(\mathbb R^n)$. Brenier's theorem supplies the geodesic direction from an absolutely continuous source measure to a target measure: the displacement $T(x)-x$ is the velocity field of the constant-speed interpolation $((1-t)\operatorname{id}+tT)_\#\mu$. Chapters 8 and 9 use this structure to introduce Wasserstein distances and treat $\mathcal P_2(\mathbb R^n)$ as a geodesic, curved infinite-dimensional space.
[/remark]
Brenier theory turns the quadratic-cost problem into a statement about convex potentials and gradients. Once that structure is in hand, the transport map can be recast as the solution of a nonlinear PDE, namely the Monge-Ampere equation.
# 7. The Monge-Ampere Equation Behind Transport
The course has now reached the point where the geometric existence theorem for quadratic optimal transport begins to look like a nonlinear PDE. The prerequisites for this chapter are Brenier's theorem, the change-of-variables formula, basic facts about convex functions, and the interpretation of pushforwards in terms of testing against functions. The previous chapter identified the quadratic-cost optimal map as the gradient of a convex potential, at least under the hypotheses of Brenier's theorem. This chapter asks what equation that potential should satisfy when the source and target measures have densities.
The answer is a nonlinear Jacobian equation, the Monge-Ampere equation, together with a boundary condition recording where the gradient map sends the support of the source measure. The goal is not yet to prove regularity. Instead, the chapter explains how the classical determinant equation arises, why it fails as a literal pointwise equation for general Brenier potentials, and how Alexandrov's measure gives the weak replacement used in transport.
The equation first appears by pretending that all maps and densities are smooth. That formal calculation is useful, but it is not stable enough for the general theory because convex potentials may fail to have classical second derivatives everywhere. The main purpose of the chapter is to reinterpret the same Jacobian identity through Alexandrov's measure associated to a convex function.
## The Smooth Jacobian Equation for Brenier Maps
The guiding question is: if $T = \nabla \phi$ pushes a density $\rho_0$ to a density $\rho_1$, what local equation forces the right amount of mass to pass through each infinitesimal volume element? The answer comes from applying the change-of-variables formula to the Brenier map. Since $T$ is a gradient map, the Jacobian matrix is the Hessian $D^2\phi$, and convexity makes this matrix positive semidefinite wherever it exists.
[quotetheorem:7480]
[citeproof:7480]
This theorem is the bridge from optimal transport to a fully nonlinear PDE. Each hypothesis has a specific job. The $C^2$ assumption makes the Hessian determinant a classical function; without it, a potential such as $\phi(x)=|x|$ has no classical second derivative at the origin. The diffeomorphism hypothesis prevents multiplicity in the change-of-variables formula; for instance $x\mapsto x^2$ on $(-1,1)$ folds two source points over most positive target points. Positivity of $\rho_1$ is needed because the formula divides by $\rho_1(\nabla\phi(x))$, and convexity ensures $D^2\phi$ is positive semidefinite so the determinant has the orientation-free volume meaning expected of a Brenier map.
The theorem does not say that a solution of the determinant equation is automatically an optimal transport map. It also does not handle nonsmooth convex potentials, nor does it encode which target set the gradient must fill. These limitations lead first to the Alexandrov measure, which replaces $\det D^2\phi\,d\mathcal L^n$ by a geometric measure of subgradient images, and later to the second boundary value condition.
The unknown is not the map $T$ directly, but a convex scalar potential $\phi$, and the equation constrains the determinant of its Hessian rather than its trace.
[remark: Why the Equation is Fully Nonlinear]
The Laplacian $\Delta \phi$ is linear in the second derivatives of $\phi$, while $\det D^2\phi$ is a polynomial of degree $n$ in the entries of the Hessian matrix. For $n=1$ the equation reduces to an ordinary first-order density balance for $\phi'$, but in higher dimensions it couples all principal curvatures of the graph of $\phi$.
[/remark]
This same determinant structure appears outside optimal transport. In geometric optics, a reflector or refractor must redistribute incoming light intensity onto a prescribed target intensity, and the surface potential satisfies a Monge-Ampere type equation with a geometric boundary condition. In prescribed curvature problems, determinant equations similarly encode how a convex graph bends volume or area. Optimal transport is therefore one route into a broader class of fully nonlinear elliptic equations where local curvature and global image constraints cannot be separated.
The one-dimensional case shows that the Monge-Ampere equation is the multidimensional descendant of monotone rearrangement. In one dimension, convexity of $\phi$ says that $T=\phi'$ is increasing, so the equation becomes the derivative of the cumulative distribution identity.
[example: Uniform Measure on an Interval]
Let $\mu=\operatorname{Unif}(0,1)$ and $\nu=\operatorname{Unif}(0,2)$ on $\mathbb R$. Their densities are
\begin{align*}
\rho_0(x)=1 \quad \text{for } x\in(0,1)
\end{align*}
and
\begin{align*}
\rho_1(y)=\frac{1}{2} \quad \text{for } y\in(0,2),
\end{align*}
because the intervals have lengths $1$ and $2$ respectively.
The increasing map carrying $(0,1)$ onto $(0,2)$ is $T(x)=2x$. It is the derivative of the convex potential $\phi(x)=x^2$, since
\begin{align*}
\phi'(x)=2x
\end{align*}
and
\begin{align*}
\phi''(x)=2.
\end{align*}
For $x\in(0,1)$, the image point is $\phi'(x)=2x\in(0,2)$, so the target density at the image point is
\begin{align*}
\rho_1(\phi'(x))=\rho_1(2x)=\frac{1}{2}.
\end{align*}
Therefore the one-dimensional Monge-Ampere density ratio is
\begin{align*}
\frac{\rho_0(x)}{\rho_1(\phi'(x))}=\frac{1}{1/2}=2.
\end{align*}
Thus the equation
\begin{align*}
\phi''(x)=\frac{\rho_0(x)}{\rho_1(\phi'(x))}
\end{align*}
becomes $2=2$ for every $x\in(0,1)$. In one dimension, the Hessian determinant is just the second derivative, so this is exactly the smooth Monge-Ampere equation for the monotone rearrangement from $(0,1)$ to $(0,2)$.
[/example]
The determinant equation is local, but transport is global: the PDE only says how volumes are distorted at points where the map is smooth. To obtain a well-posed transport problem, the equation must be paired with a condition saying that the gradient image lands in the target support.
## Convex Potentials and the Alexandrov Measure
The smooth derivation raises a problem: Brenier's theorem gives a convex potential, not a $C^2$ potential. [Convex functions are locally Lipschitz](/theorems/3086) on the interior of their domain and are twice differentiable almost everywhere, but their Hessians may not define classical functions everywhere. The right weak object is not obtained by differentiating twice distributionally; it is obtained by measuring the size of the subgradient image.
For a convex function, the gradient may be multivalued at nonsmooth points. The subdifferential records every supporting slope and replaces the pointwise gradient in the weak Jacobian equation.
[definition: Subdifferential of a Convex Function]
Let $U \subset \mathbb R^n$ be convex and let $\phi:U \to \mathbb R$ be convex. The subdifferential is the set-valued map
\begin{align*}
\partial\phi:U\to \mathcal P(\mathbb R^n)
\end{align*}
given, for $x \in U$, by
\begin{align*}
\partial\phi(x) := \{p \in \mathbb R^n : \phi(z) \ge \phi(x) + p\cdot(z-x) \text{ for every } z \in U\}.
\end{align*}
It also induces the set-valued map on subsets
\begin{align*}
\partial\phi:\mathcal P(U)\to \mathcal P(\mathbb R^n)
\end{align*}
defined, for $E \subset U$, by
\begin{align*}
\partial\phi(E) := \bigcup_{x\in E}\partial\phi(x).
\end{align*}
[/definition]
At points where $\phi$ is differentiable, the subdifferential consists of the single vector $\nabla\phi(x)$. At corners it has positive-dimensional size, so the next step is to turn the set of all slopes over $E$ into a measure on the original domain.
[definition: Alexandrov Measure]
Let $U \subset \mathbb R^n$ be open and convex, and let $\phi:U\to\mathbb R$ be convex. The Alexandrov measure associated to $\phi$ is the Borel measure
\begin{align*}
M_\phi: \mathcal B(U) \to [0,\infty]
\end{align*}
defined by
\begin{align*}
M_\phi(E) := \mathcal L^n(\partial\phi(E))
\end{align*}
for every Borel set $E\subset U$.
[/definition]
The standard measurability theorem for subgradient images of convex functions is what makes the displayed set function a Borel measure; this measure is also called the Monge-Ampere measure of $\phi$. This definition turns the determinant of the Hessian into a measure. The necessary consistency check is the smooth case: if $\phi\in C^2(U)$ is convex and $\nabla\phi$ is injective on a Borel set $E\subset U$, then the [area formula](/theorems/3075) gives
\begin{align*}
M_\phi(E)=\mathcal L^n(\nabla\phi(E))=\int_E \det D^2\phi(x)\,d\mathcal L^n(x).
\end{align*}
This smooth consistency check explains why the Alexandrov measure deserves to be called the weak Monge-Ampere measure. The injectivity assumption is a hypothesis for this stated version, because it lets the area formula count image volume without a multiplicity term. It is not meant to say that convex gradients must be injective for the smooth Alexandrov identity to have a meaningful degenerate form. For gradients of convex functions, failure of injectivity is not a folding phenomenon like a nonconvex map; it comes from flat or degenerate directions, and those directions also force the Hessian determinant to vanish. For example, the convex potential $\phi(x_1,x_2)=x_1^2/2$ has $\nabla\phi(x_1,x_2)=(x_1,0)$, so vertical line segments collapse to the same target point and the Hessian determinant vanishes. More general versions of the smooth Alexandrov identity can handle such degeneracies by the area formula with multiplicity or by approximation, but the injective formulation isolates the mechanism needed in the course: subgradient image volume replaces Hessian determinant volume. Convexity is also doing work, because it identifies the subdifferential with a monotone gradient and removes the orientation sign from the Jacobian determinant. Smoothness is a separate hypothesis: for $\phi(x)=|x|$ the subdifferential at $0$ is the interval $[-1,1]$, so a point contributes positive Alexandrov mass even though there is no classical Hessian determinant at that point.
The smooth identity does not yet solve the nonsmooth case; it only checks that the new definition agrees with the classical determinant when the gradient behaves well on the chosen set. It also does not include the density ratio $\rho_0/\rho_1$, does not impose a target support, and does not by itself produce a transport map. Its role is calibration: it identifies $M_\phi$ as the correct weak replacement for $\det D^2\phi\,d\mathcal L^n$. Once the determinant has been recast as the size of a slope image, corners and other nonsmooth features can contribute mass without requiring pointwise second derivatives, and the actual weak transport equation can be stated by combining this geometry with the pushforward identity.
[example: A Corner Creates Monge-Ampere Mass]
Let $\phi(x)=|x|$ on $\mathbb R$. For $x>0$ we have $\phi(x)=x$, so $\phi'(x)=1$ and hence $\partial\phi(x)=\{1\}$. For $x<0$ we have $\phi(x)=-x$, so $\phi'(x)=-1$ and hence $\partial\phi(x)=\{-1\}$.
At the corner $x=0$, a number $p\in\mathbb R$ belongs to $\partial\phi(0)$ exactly when
\begin{align*}
|z| \ge pz \text{ for every } z\in\mathbb R.
\end{align*}
Taking $z>0$ gives $z\ge pz$, so $p\le 1$. Taking $z<0$ gives $-z\ge pz$, and division by the negative number $z$ gives $p\ge -1$. Conversely, if $-1\le p\le 1$, then for $z>0$ we have $pz\le z=|z|$, while for $z<0$ we have $pz\le -z=|z|$, and for $z=0$ both sides are $0$. Thus
\begin{align*}
\partial\phi(0)=[-1,1].
\end{align*}
By the definition of the Alexandrov measure,
\begin{align*}
M_\phi(\{0\})=\mathcal L^1(\partial\phi(\{0\}))=\mathcal L^1([-1,1])=1-(-1)=2.
\end{align*}
Away from the origin, $\phi$ is affine on each side, so $\phi''(x)=0$ for $x\ne 0$; the Monge-Ampere mass is concentrated at the nonsmooth point because the single point $\{0\}$ has a subgradient image of positive length.
[/example]
For optimal transport, the weak equation must include the target density rather than only the Lebesgue size of the subgradient image. This motivates a formulation in which the convex potential solves the transport equation by pushing the source measure to the target measure.
[definition: Transport Alexandrov Solution of the Monge-Ampere Equation]
Let $U\subset\mathbb R^n$ be an open convex set, let $V\subset\mathbb R^n$ be a Borel set, let $\mu=\rho_0\,d\mathcal L^n$ be a measure on $U$, and let $\nu=\rho_1\,d\mathcal L^n$ be a measure on $V$. A convex function $\phi:U\to\mathbb R$ is a transport Alexandrov solution of the Monge-Ampere equation from $\mu$ to $\nu$ if $\phi$ is differentiable $\mu$-a.e. and the a.e. defined map $\nabla\phi:U\to V$ satisfies
\begin{align*}
(\nabla\phi)_\#\mu = \nu.
\end{align*}
[/definition]
Equivalently, the pushforward condition says that for every Borel set $A\subset V$,
\begin{align*}
\mu((\nabla\phi)^{-1}(A))=\nu(A).
\end{align*}
This definition deliberately uses the almost-everywhere single-valued gradient rather than a naive pushforward of the set-valued subdifferential. That distinction matters because a nonsmooth point may have a subdifferential meeting several target sets, so counting all intersections would not define an ordinary transported measure. In Brenier applications the source is absolutely continuous, convex functions on open convex domains are differentiable $\mu$-a.e., and the optimal map is exactly this a.e. gradient.
The terminology here is deliberately qualified by the word "transport". In the classical Alexandrov theory, the central object is the Monge-Ampere measure $M_\phi(E)=\mathcal L^n(\partial\phi(E))$ and a solution is described by an identity for that measure. The definition above is not that general classical definition; it is the absolutely continuous transport formulation where the subgradient is represented by the a.e. Brenier map. The remaining question is whether the convex potential supplied by Brenier's theorem satisfies this transport version of the weak equation, and whether the smooth determinant equation reappears once extra differentiability and invertibility hypotheses are imposed. The next theorem answers both parts, turning the existence theorem for optimal maps into the weak Monge-Ampere statement that later regularity theory can study.
[quotetheorem:7481]
[citeproof:7481]
This result is not a new existence theorem; it is a reinterpretation of Brenier's theorem as a weak PDE statement. Absolute continuity of $\mu$ is the hypothesis that makes the gradient formulation legitimate, because the nondifferentiability set of a convex function has zero $\mathcal L^n$-measure and hence zero $\mu$-measure. If the source had an atom at a corner of $\phi$, the subdifferential could be genuinely multivalued on a set of positive source mass, and the a.e. gradient map would no longer describe all transport choices.
The theorem also does not claim regularity of $\phi$. Brenier gives the weak transport identity, while smoothness and injectivity are extra assumptions needed before the determinant equation can be read pointwise. Without $C^2$ regularity, the corner example $\phi(x)=|x|$ has Alexandrov mass at $0$ but no pointwise Hessian determinant there. Without the diffeomorphism hypothesis, even a smooth gradient can collapse directions, as in the convex potential $\phi(x_1,x_2)=x_1^2/2$ whose gradient sends all points with the same $x_1$ to the same target point. This is why the PDE perspective naturally splits into existence of weak Alexandrov solutions and separate regularity theorems for when those solutions become classical.
## The Second Boundary Value Problem
The Monge-Ampere equation alone does not determine the correct transport map. Even for a fixed density ratio, different convex potentials may solve the same local determinant equation while sending mass to different regions. The transport problem therefore imposes a second boundary value condition: the gradient image of the source support must lie in the target support and carry the source mass onto the target mass.
[definition: Second Boundary Value Condition]
Let $U,V\subset\mathbb R^n$ be domains and let $\phi\in C^1(U)$ be convex, so that $\nabla\phi:U\to\mathbb R^n$ is defined as a classical map. The second boundary value condition for the transport Monge-Ampere equation is
\begin{align*}
\nabla\phi(U)=V
\end{align*}
in the classical smooth setting.
[/definition]
The word ``second'' distinguishes this condition from prescribing $\phi$ on the boundary. Here the boundary data are imposed on the image of the gradient. Containment $\nabla\phi(U)\subset V$ prevents mass from leaving the target region, while equality in the smooth setting expresses that the target is not merely touched but filled.
In weak transport settings, where $\nabla\phi$ may only be defined $\mu$-a.e., the condition is interpreted as $\nabla\phi(x)\in V$ for $\mu$-a.e. $x$ together with the mass condition $(\nabla\phi)_\#\mu=\nu$. This pushforward identity fills the target up to sets of $\nu$-measure zero. For set-valued Alexandrov formulations, the analogous support condition is imposed on $\partial\phi$ modulo negligible sets.
The preceding definition isolates the global information that the determinant equation cannot see. To use the PDE in practice, the workflow is therefore: first identify the Brenier potential $\phi$, then compute $\det D^2\phi$ wherever the potential is smooth, and finally check that the gradient image fills the target in the sense required by the transport problem. The next theorem makes this workflow precise in the classical regime by proving that the determinant equation plus the image condition is equivalent to the pushforward identity.
[quotetheorem:7482]
[citeproof:7482]
The theorem packages the PDE and boundary condition into a form that exactly matches mass transport on the prescribed target measure space. The displayed image condition is the measure-theoretic version forced by the pushforward identity; because $\rho_1$ is positive, it is equivalent here to $\mathcal L^n(V\setminus T(U))=0$. In the classical second boundary value problem it is usually strengthened to the geometric condition $\nabla\phi(U)=V$. The equality of total masses is necessary before any pushforward relation can exist: a probability density cannot be transported onto a target density with different total mass. Positivity prevents division by zero in the density ratio, convexity keeps the map in the Brenier class, and the diffeomorphism assumption is what allows the ordinary change-of-variables formula without multiplicity. Boundedness is not the conceptual heart of the theorem, but it avoids escape of mass and boundary pathologies in this smooth formulation.
The theorem does not assert that every weak solution is smooth, nor that the second boundary value problem is automatically solvable for arbitrary rough domains or densities. It says that, in the classical regime, the local determinant equation plus the global gradient-image condition is exactly the pushforward identity. This is the point where optimal transport meets elliptic regularity: later results ask which assumptions on $U$, $V$, $\rho_0$, and $\rho_1$ upgrade Alexandrov solutions to smooth solutions.
[example: Transport from a Disk to an Ellipse by a Linear Brenier Map]
Let $U=B(0,1)\subset\mathbb R^2$, let $A=\operatorname{diag}(a,b)$ with $a,b>0$, and set $V=A(U)$. For $x=(x_1,x_2)$, the potential
\begin{align*}
\phi(x)=\frac{1}{2}x\cdot Ax=\frac{1}{2}(a x_1^2+b x_2^2)
\end{align*}
has partial derivatives
\begin{align*}
\partial_{x_1}\phi(x)=a x_1,\qquad \partial_{x_2}\phi(x)=b x_2.
\end{align*}
Hence
\begin{align*}
\nabla\phi(x)=(a x_1,b x_2)=Ax.
\end{align*}
Its second derivatives are $\partial_{x_1x_1}\phi=a$, $\partial_{x_2x_2}\phi=b$, and $\partial_{x_1x_2}\phi=\partial_{x_2x_1}\phi=0$, so $D^2\phi=A$ and
\begin{align*}
\det D^2\phi=ab-0\cdot 0=ab.
\end{align*}
Since $h\cdot D^2\phi\,h=a h_1^2+b h_2^2\ge 0$ for every $h=(h_1,h_2)$, the potential is convex.
The unit disk has area $\pi$, so the uniform probability density on $U$ is
\begin{align*}
\rho_0=\frac{1}{\pi}.
\end{align*}
The linear map $A$ stretches area by the factor $\det A=ab$, so $V=A(U)$ has area $ab\pi$, and the uniform probability density on $V$ is
\begin{align*}
\rho_1=\frac{1}{ab\pi}.
\end{align*}
For every $x\in U$, the image point $\nabla\phi(x)=Ax$ lies in $V$, and therefore
\begin{align*}
\frac{\rho_0(x)}{\rho_1(\nabla\phi(x))}=\frac{1/\pi}{1/(ab\pi)}=ab.
\end{align*}
Thus
\begin{align*}
\det D^2\phi(x)=\frac{\rho_0(x)}{\rho_1(\nabla\phi(x))}
\end{align*}
for every $x\in U$.
Finally,
\begin{align*}
\nabla\phi(U)=\{Ax:x\in B(0,1)\}=A(U)=V.
\end{align*}
So the determinant equation holds with the correct density ratio, and the gradient image fills the target ellipse; this linear Brenier map solves the smooth second boundary value problem.
[/example]
The support condition is often written in a slightly weaker form when measures are not supported on open domains. The map must send the source support into the target support, and the pushforward identity then guarantees that the correct amount of mass fills the target.
[remark: Support Form of the Boundary Condition]
For compactly supported probability measures $\mu$ and $\nu$, the boundary condition is often expressed as
\begin{align*}
\nabla\phi(\operatorname{supp}\mu) \subset \operatorname{supp}\nu
\end{align*}
up to sets of $\mu$-measure zero. In the Alexandrov formulation this becomes $\partial\phi(\operatorname{supp}\mu)\subset \operatorname{supp}\nu$, again understood modulo the sets that do not affect the transported measure.
[/remark]
This chapter completes the passage from Chapter 6's Brenier structure theorem to the PDE behind it. The map is optimal because it is the gradient of a convex potential; the potential solves Monge-Ampere because its gradient changes volume according to the density ratio; and the second boundary value condition remembers the target set. Later regularity theory asks when the weak Alexandrov solution produced by transport is smooth enough for the formal PDE to hold pointwise.
The Monge-Ampere equation explains how an optimal map reshapes density, but the transport cost itself can also be used to compare measures directly. That perspective leads naturally to Wasserstein distances, where optimal transport becomes a metric theory.
# 8. Wasserstein Distances
This chapter turns the value of an optimal transport problem into a distance between probability measures. Chapters 2 through 5 used couplings to minimise a cost, and compactness, duality, or cyclical monotonicity to prove existence and optimality of plans. Here the cost is a power of the underlying metric, so the resulting quantity measures both displacement and distributional convergence.
The main point is that Wasserstein distance is not merely a formula: on probability measures with finite moment it is a genuine metric. Its topology records weak convergence together with convergence of moments, which is why Wasserstein spaces are useful as geometric refinements of the usual weak topology.
## The Finite Moment Space
A transport cost of the form $d(x,y)^p$ may be infinite unless the measures have enough integrability. The first question is therefore not how to define a distance, but which probability measures should be allowed as points of the space.
[definition: Finite p-Moment Space]
Let $(X,d)$ be a metric space and let $1 \le p < \infty$. The finite $p$-moment space $\mathcal P_p(X)$ is the set of Borel probability measures $\mu \in \mathcal P(X)$ such that, for some $x_0 \in X$,
\begin{align*}
\int_X d(x,x_0)^p \,d\mu(x) < \infty.
\end{align*}
[/definition]
The choice of base point is part of the test but not part of the resulting notion. If the integral is finite for one $x_0$, then it is finite for every $x_1 \in X$, by the triangle inequality and the estimate $d(x,x_1)^p \le 2^{p-1}(d(x,x_0)^p+d(x_0,x_1)^p)$.
[example: Dirac Measures Have Finite Moment]
Fix $a\in X$ and choose any base point $x_0\in X$. By the defining property of the Dirac measure, integration against $\delta_a$ evaluates the integrand at $a$, so for the nonnegative function $x\mapsto d(x,x_0)^p$ we have
\begin{align*}
\int_X d(x,x_0)^p\,d\delta_a(x)=d(a,x_0)^p.
\end{align*}
Since $d(a,x_0)$ is a finite real number in a metric space and $1\le p<\infty$, the number $d(a,x_0)^p$ is finite. Hence $\delta_a\in\mathcal P_p(X)$.
Thus every point $a\in X$ determines a probability measure $\delta_a\in\mathcal P_p(X)$, giving the canonical embedding $a\mapsto\delta_a$ of the original metric space into the finite moment space.
[/example]
Having identified the correct class of measures, we can use the Kantorovich problem with cost $d^p$. The $p$th root is included so that the resulting quantity has the same homogeneity as the underlying distance.
[definition: Wasserstein Distance]
Let $(X,d)$ be a metric space and let $1 \le p < \infty$. The $p$-Wasserstein distance is the map
\begin{align*}
W_p: \mathcal P_p(X) \times \mathcal P_p(X) \to [0,\infty)
\end{align*}
defined, for $\mu,\nu \in \mathcal P_p(X)$, by
\begin{align*}
W_p(\mu,\nu)
= \left(\inf_{\pi \in \Pi(\mu,\nu)} \int_{X\times X} d(x,y)^p \,d\pi(x,y)\right)^{1/p}.
\end{align*}
[/definition]
The finite-moment condition guarantees that this expression is finite. For instance, if $\pi=\mu\otimes\nu$, then $d(x,y)^p$ is bounded above by a constant times $d(x,x_0)^p+d(y,x_0)^p$, and both terms are integrable.
[example: Wasserstein Distance Between Dirac Masses]
Let $a,b\in X$. We first identify the admissible couplings. If $\pi\in\Pi(\delta_a,\delta_b)$, then its first marginal is $\delta_a$ and its second marginal is $\delta_b$, so
\begin{align*}
\pi(\{a\}\times X)=\delta_a(\{a\})=1.
\end{align*}
Also,
\begin{align*}
\pi(X\times\{b\})=\delta_b(\{b\})=1.
\end{align*}
Since both events have $\pi$-measure $1$, their intersection has $\pi$-measure $1$:
\begin{align*}
\pi(\{a\}\times\{b\})=1.
\end{align*}
Thus $\pi=\delta_{(a,b)}$, so $\Pi(\delta_a,\delta_b)=\{\delta_{(a,b)}\}$.
Therefore the infimum in the definition of $W_p$ is taken over a single coupling:
\begin{align*}
W_p(\delta_a,\delta_b)^p=\int_{X\times X} d(x,y)^p\,d\delta_{(a,b)}(x,y).
\end{align*}
Integration against a Dirac mass evaluates the integrand at its support point, hence
\begin{align*}
\int_{X\times X} d(x,y)^p\,d\delta_{(a,b)}(x,y)=d(a,b)^p.
\end{align*}
Taking the $p$th root gives
\begin{align*}
W_p(\delta_a,\delta_b)=\bigl(d(a,b)^p\bigr)^{1/p}=d(a,b),
\end{align*}
because $d(a,b)\ge0$. Thus the map $a\mapsto\delta_a$ preserves distances, so it embeds $X$ isometrically into $\mathcal P_p(X)$.
[/example]
This calculation is the basic consistency check: Wasserstein distance extends the original geometry of $X$ rather than replacing it with an unrelated metric on measures.
## Metric Properties of Wasserstein Distance
The next problem is to prove that $W_p$ satisfies the metric axioms. Symmetry and non-negativity follow from the definition, but the triangle inequality requires a way to combine two transport plans sharing a common marginal. We use the transport [gluing lemma](/theorems/1871) in the following form: if $X,Y,Z$ are Polish spaces, $\pi_{12}\in\mathcal P(X\times Y)$ and $\pi_{23}\in\mathcal P(Y\times Z)$ have the same $Y$-marginal, then there exists $\gamma\in\mathcal P(X\times Y\times Z)$ whose $(X,Y)$-marginal is $\pi_{12}$ and whose $(Y,Z)$-marginal is $\pi_{23}$.
This lemma says that two compatible pairwise plans can be realised on a single probability space, but it does not say that the original two plans are independent or that the resulting measure $\gamma$ is unique. The common $Y$-marginal is essential: if the $Y$-marginal of $\pi_{12}$ differs from the $Y$-marginal of $\pi_{23}$, then no measure on $X\times Y\times Z$ can have both prescribed pairwise marginals. The Polish hypothesis is the regularity condition that gives the disintegrations used in the proof; without such [conditional probability](/page/Conditional%20Probability) kernels, the displayed construction of $\gamma$ need not be available in this form. The next result needs exactly this mechanism: to prove that $W_p$ is a metric, the only non-formal axiom is the triangle inequality, and that axiom comes from gluing two transport legs into one joint plan.
[quotetheorem:7483]
[citeproof:7483]
The diagonal argument for separation uses the idea that zero transport cost forces all mass to travel zero distance. The gluing argument is more structural: it is the transport analogue of concatenating paths in the base metric space. The assumption $p\ge1$ is needed because the proof uses Minkowski's inequality after taking the $p$th root; for $0<p<1$, the same formula with exponent $1/p$ does not generally satisfy the triangle inequality. The Polish hypothesis is used to access gluing and compactness arguments for probability measures; the theorem is not asserting the same proof works on arbitrary measurable spaces without this regularity. The result also does not claim that optimal plans are unique, only that the infimum defines a distance between measures.
[example: Triangle Inequality for Three Dirac Masses]
Let $a,b,c\in X$. By the Wasserstein triangle inequality from the metric theorem just proved, applied to the three measures $\delta_a,\delta_b,\delta_c\in\mathcal P_p(X)$, we have
\begin{align*}
W_p(\delta_a,\delta_c)\le W_p(\delta_a,\delta_b)+W_p(\delta_b,\delta_c).
\end{align*}
For Dirac masses, the preceding computation gives $W_p(\delta_x,\delta_y)=d(x,y)$ for any $x,y\in X$. Substituting $(x,y)=(a,c)$, $(a,b)$, and $(b,c)$ into this identity gives
\begin{align*}
W_p(\delta_a,\delta_c)=d(a,c).
\end{align*}
Also,
\begin{align*}
W_p(\delta_a,\delta_b)=d(a,b).
\end{align*}
And
\begin{align*}
W_p(\delta_b,\delta_c)=d(b,c).
\end{align*}
Replacing each Wasserstein term in the triangle inequality by the corresponding distance term yields
\begin{align*}
d(a,c)\le d(a,b)+d(b,c).
\end{align*}
Thus, on the embedded copy $a\mapsto\delta_a$, the Wasserstein triangle inequality is exactly the original triangle inequality in $X$.
[/example]
For $p=1$, the formula already resembles the minimal average distance moved. For larger $p$, long moves are penalized more strongly, which makes moment control an essential part of the topology.
## Wasserstein Convergence and Weak Convergence
Once $W_p$ is a metric, the next question is what convergence in this metric means. Since $W_p$ is defined by moving mass over the metric space, it should imply weak convergence, but it also remembers the behaviour of mass far from a base point.
[definition: Weak Convergence of Probability Measures]
Let $X$ be a metric space. A sequence $(\mu_n)_{n\ge1}$ in $\mathcal P(X)$ converges weakly to $\mu\in\mathcal P(X)$ if
\begin{align*}
\int_X f\,d\mu_n \to \int_X f\,d\mu
\end{align*}
for every bounded [continuous function](/page/Continuous%20Function) $f:X\to\mathbb R$.
[/definition]
Weak convergence tests bounded continuous functions, so it cannot by itself detect a small amount of mass escaping to infinity. Since $W_p$ charges long travel distances with a $p$th power, convergence in transport also has to rule out hidden $p$-moment loss in the tails. The missing condition is therefore not another bounded test, but control of the $p$th moment about one base point.
This raises the main structural question for the Wasserstein topology: whether weak convergence plus one scalar moment condition is merely a necessary consequence of $W_p$ convergence, or whether it fully determines the metric convergence. A usable criterion should avoid constructing optimal couplings every time and instead reduce $W_p$ convergence to conditions that can be checked directly on distributions and tails.
The precise characterization gives exactly this equivalence. On a Polish metric space, weak convergence together with convergence of one $p$th moment recovers the full transport topology, giving a practical test for proving $W_p$ convergence.
This criterion is needed because the definition of $W_p$ is infimum-based and coupling-dependent, while many limiting arguments naturally provide only weak convergence and tail estimates. The theorem below identifies exactly when those distributional and moment data are enough to reconstruct convergence in the transport metric.
[quotetheorem:7484]
[citeproof:7484]
This theorem should be read as a topology statement: $W_p$ convergence is weak convergence plus no loss of $p$th moment at infinity. The moment condition cannot be omitted: a small amount of mass can move farther and farther away, as in the example below, and bounded continuous test functions do not detect its full transport cost. The Polish hypothesis is what permits the compactness, tightness, and representation tools used in the proof; on a general measurable space there is no comparable metric structure on which to formulate this characterization in the same way. The theorem does not say that weak convergence alone controls transportation, nor does it say that convergence of moments alone controls the limiting distribution. The base point $x_0$ does not affect the condition, because finite moments about different base points are equivalent through the triangle inequality.
[example: Weak Convergence Without Wasserstein Convergence]
Work on $X=\mathbb R$ with its usual distance, fix $1\le p<\infty$, and define
\begin{align*}
\mu_n=\left(1-\frac{1}{n}\right)\delta_0+\frac{1}{n}\delta_n.
\end{align*}
We show that $\mu_n$ converges weakly to $\delta_0$, but not in $W_p$. Let $f:\mathbb R\to\mathbb R$ be bounded and continuous. Since integration against a Dirac mass evaluates the integrand at the support point,
\begin{align*}
\int_{\mathbb R} f\,d\mu_n=\left(1-\frac{1}{n}\right)f(0)+\frac{1}{n}f(n).
\end{align*}
Subtracting the limiting value $f(0)=\int_{\mathbb R}f\,d\delta_0$ gives
\begin{align*}
\int_{\mathbb R} f\,d\mu_n-f(0)=\frac{1}{n}f(n)-\frac{1}{n}f(0)=\frac{f(n)-f(0)}{n}.
\end{align*}
If $|f(x)|\le M$ for all $x$, then
\begin{align*}
\left|\int_{\mathbb R} f\,d\mu_n-f(0)\right|\le \frac{|f(n)|+|f(0)|}{n}\le \frac{2M}{n}.
\end{align*}
Since $2M/n\to0$, we have $\int f\,d\mu_n\to\int f\,d\delta_0$ for every bounded continuous $f$, so $\mu_n$ converges weakly to $\delta_0$.
Now compute the $p$th moment about $0$. Again using the two Dirac masses in the definition of $\mu_n$,
\begin{align*}
\int_{\mathbb R}|x|^p\,d\mu_n(x)=\left(1-\frac{1}{n}\right)|0|^p+\frac{1}{n}|n|^p.
\end{align*}
Since $|0|^p=0$ and $|n|^p=n^p$ for $n\ge1$,
\begin{align*}
\int_{\mathbb R}|x|^p\,d\mu_n(x)=0+\frac{n^p}{n}=n^{p-1}.
\end{align*}
For the proposed limit $\delta_0$,
\begin{align*}
\int_{\mathbb R}|x|^p\,d\delta_0(x)=|0|^p=0.
\end{align*}
If $p>1$, then $n^{p-1}\to\infty$, so the moments do not converge to $0$. If $p=1$, then $n^{p-1}=1$ for every $n$, so the moments still do not converge to $0$. Therefore the moment condition in *[Wasserstein Convergence Characterization](/theorems/7484)* fails, and $\mu_n$ does not converge to $\delta_0$ in $W_p$.
The escaping atom has mass $1/n$, which is invisible to bounded weak tests in the limit, but its distance from the origin is $n$, so its contribution to the $p$th moment is $n^{p-1}$.
[/example]
The example isolates the role of moment convergence. Weak convergence sees that the escaping mass has small probability, while $W_p$ also measures the distance travelled by that mass.
[remark: Monotonicity in p]
If $1\le p\le q<\infty$ and $\mu,\nu\in\mathcal P_q(X)$, then
\begin{align*}
W_p(\mu,\nu) \le W_q(\mu,\nu).
\end{align*}
This follows by applying Lyapunov's inequality to the random variable $d(x,y)$ under any coupling and then taking infima.
[/remark]
This monotonicity matches the intuition that larger exponents impose stronger control on long transport distances. It also explains why convergence in $W_q$ implies convergence in $W_p$ when the $q$th moments are finite.
## One-Dimensional Quantile Formula
The final topic in this chapter is a computable model case. On the real line, the order structure identifies a canonical optimal plan: match equal quantiles of the two measures.
[definition: Quantile Function]
Let $\mu\in\mathcal P(\mathbb R)$. The distribution function of $\mu$ is the map
\begin{align*}
F_\mu: \mathbb R \to [0,1], \qquad F_\mu(t)=\mu(({-}\infty,t]).
\end{align*}
The quantile function of $\mu$ is the map $F_\mu^{-1}:(0,1)\to\mathbb R$ defined by
\begin{align*}
F_\mu^{-1}(s)=\inf\{t\in\mathbb R: F_\mu(t)\ge s\}.
\end{align*}
[/definition]
Quantiles convert a probability measure into an increasing function on $(0,1)$. This suggests a specific coupling: draw one uniform random variable and use it to sample both measures at the same quantile level. The next theorem proves that, on the line, this monotone coupling minimizes every cost $|x-y|^p$ with $p\ge1$.
[quotetheorem:7485]
[citeproof:7485]
The one-dimensional order structure is essential here. On $\mathbb R$, any crossing in a transport plan can be uncrossed without increasing the convex cost, so monotone matching is optimal. In higher dimensions there is no total order on points, and an optimal plan cannot usually be recovered by sorting a single coordinate or matching scalar quantiles. The hypothesis $\mu,\nu\in\mathcal P_p(\mathbb R)$ is also essential: it guarantees that the displayed integral is finite and that the formula gives a value of $W_p$ rather than $\infty$. The theorem does not claim uniqueness of optimal transport in every one-dimensional situation, but it does identify a canonical optimal coupling.
For $p=2$, this gives the especially useful identity
\begin{align*}
W_2(\mu,\nu)^2
= \int_0^1 |F_\mu^{-1}(s)-F_\nu^{-1}(s)|^2\,ds.
\end{align*}
Thus $W_2$ on the line is the $L^2(0,1)$ distance between quantile functions.
[example: Wasserstein Two Between Uniform Intervals]
Let $\mu=\operatorname{Unif}(0,1)$ and $\nu=\operatorname{Unif}(a,a+b)$ with $b>0$. For $0<s<1$, the quantile functions are $F_\mu^{-1}(s)=s$ and $F_\nu^{-1}(s)=a+bs$, so *One-Dimensional Wasserstein Formula* with $p=2$ gives
\begin{align*}
W_2(\mu,\nu)^2=\int_0^1 |F_\mu^{-1}(s)-F_\nu^{-1}(s)|^2\,ds.
\end{align*}
Substituting the two quantile functions gives
\begin{align*}
W_2(\mu,\nu)^2=\int_0^1 |s-(a+bs)|^2\,ds.
\end{align*}
Since $s-(a+bs)=s-a-bs=(1-b)s-a$, this becomes
\begin{align*}
W_2(\mu,\nu)^2=\int_0^1 |(1-b)s-a|^2\,ds.
\end{align*}
The integrand is real-valued, so $|r|^2=r^2$ for $r=(1-b)s-a$. Expanding the square,
\begin{align*}
\bigl((1-b)s-a\bigr)^2=(1-b)^2s^2-2a(1-b)s+a^2.
\end{align*}
Therefore
\begin{align*}
W_2(\mu,\nu)^2=\int_0^1 \bigl((1-b)^2s^2-2a(1-b)s+a^2\bigr)\,ds.
\end{align*}
By linearity of the integral,
\begin{align*}
W_2(\mu,\nu)^2=(1-b)^2\int_0^1 s^2\,ds-2a(1-b)\int_0^1 s\,ds+a^2\int_0^1 1\,ds.
\end{align*}
The three elementary integrals are
\begin{align*}
\int_0^1 s^2\,ds=\frac{1}{3}.
\end{align*}
Also,
\begin{align*}
\int_0^1 s\,ds=\frac{1}{2}.
\end{align*}
And
\begin{align*}
\int_0^1 1\,ds=1.
\end{align*}
Substituting these values,
\begin{align*}
W_2(\mu,\nu)^2=(1-b)^2\frac{1}{3}-2a(1-b)\frac{1}{2}+a^2.
\end{align*}
Since $2\cdot \frac{1}{2}=1$, this is
\begin{align*}
W_2(\mu,\nu)^2=\frac{(1-b)^2}{3}-a(1-b)+a^2.
\end{align*}
The term involving $a$ records the translation of the interval, while the factor $1-b$ records the change of scale from length $1$ to length $b$.
[/example]
The quantile formula is the one-dimensional prototype for later optimal maps. It shows that in special geometries the abstract infimum over couplings can collapse to an explicit transport rule.
Wasserstein distances turn the space of probability measures into a geometric object with its own curves and tangent directions. Having built that metric structure, we can now study geodesics, curvature, and the global geometry of Wasserstein spaces.
# 9. Geometry of Wasserstein Spaces
Chapter 8 constructed the Wasserstein distances by minimising transport cost over couplings, while Chapter 6 identified quadratic Euclidean optimal plans through convex potentials. This chapter changes perspective: the collection of probability measures is treated as a metric space in its own right. The central questions are whether two measures can be joined by a shortest path, how such paths are built from optimal transport plans, and when the path is uniquely determined by a transport map.
## Constant-Speed Geodesics from Optimal Plans
The Wasserstein distance is defined by an optimization problem, but a metric space becomes geometrically useful only when its distances can be realized by curves. The first problem is therefore to turn an optimal coupling between two endpoint measures into a path of intermediate measures whose metric speed is constant.
[definition: Constant-Speed Geodesic]
Let $(X,d)$ be a metric space. A curve $\nu:[0,1]\to X$ is a constant-speed geodesic from $\nu_0$ to $\nu_1$ if $\nu(0)=\nu_0$, $\nu(1)=\nu_1$, and
\begin{align*}
d(\nu(s),\nu(t))=|t-s|d(\nu_0,\nu_1)
\end{align*}
for all $s,t\in[0,1]$.
[/definition]
This definition gives the target identity for a shortest path, but it does not explain how to build such a path between measures. An endpoint coupling records only where each unit of mass starts and ends; it does not record which intermediate route that mass takes. To turn endpoint transport into a curve $t\mapsto \mu_t$, we need an object whose points are whole base-space geodesics and whose time evaluations produce the intermediate marginals. This is the purpose of a dynamical transport plan.
[definition: Dynamical Transport Plan]
Let $(X,d)$ be a metric space and let $\operatorname{Geo}(X)$ denote the set of constant-speed geodesics $\gamma:[0,1]\to X$, equipped with the Borel structure inherited from the uniform topology on $C([0,1];X)$. For each $t\in[0,1]$, let $e_t:\operatorname{Geo}(X)\to X$ be the Borel evaluation map $e_t(\gamma)=\gamma(t)$. A dynamical transport plan is a probability measure $\Pi$ on $\operatorname{Geo}(X)$. Its time-$t$ marginal is $(e_t)_\#\Pi$.
[/definition]
The endpoint map $(e_0,e_1):\operatorname{Geo}(X)\to X\times X$ turns a dynamical plan into the ordinary transport plan $(e_0,e_1)_\#\Pi$. This makes the next question precise: if the endpoint plan is optimal and the chosen trajectories are shortest paths in the base space, does the family of time marginals realize the Wasserstein distance at constant speed?
[quotetheorem:7486]
[citeproof:7486]
The completeness and separability hypotheses provide the measure-theoretic setting in which optimal plans and measurable geodesic selections behave well; the geodesic hypothesis is the geometric ingredient that supplies the particle paths. Without geodesics in the base space, there may be no shortest curve along which to move a particle from $x$ to $y$, so the construction can fail before any Wasserstein estimate is attempted. For example, in $X=\{0,1\}$ with the inherited Euclidean distance and $p>1$, there is no constant-speed Wasserstein geodesic from $\delta_0$ to $\delta_1$. A candidate curve must have the form
\begin{align*}
\mu_t=(1-m(t))\delta_0+m(t)\delta_1.
\end{align*}
The constant-speed identities would require
\begin{align*}
W_p(\delta_0,\mu_t)=t,
\qquad
W_p(\mu_t,\delta_1)=1-t,
\end{align*}
which force
\begin{align*}
m(t)=t^p=1-(1-t)^p.
\end{align*}
This is impossible for intermediate $t$. The theorem also does not say that the resulting geodesic is unique: different optimal endpoint plans or different choices of base geodesics may produce different dynamical plans. What it gives is the basic existence mechanism that later Euclidean and nonbranching theories refine into explicit formulas and uniqueness statements.
[example: Isometric Embedding by Dirac Masses]
Let $(X,d)$ be a geodesic metric space, fix $x,y\in X$, and take $\mu_0=\delta_x$ and $\mu_1=\delta_y$. Any coupling $\pi\in\Pi(\delta_x,\delta_y)$ satisfies $\pi(\{x\}\times X)=1$ and $\pi(X\times\{y\})=1$, hence $\pi(\{(x,y)\})=1$; therefore the only coupling is $\delta_{(x,y)}$. Thus, for $1\le p<\infty$,
\begin{align*}
W_p(\delta_x,\delta_y)^p=\int_{X\times X} d(u,v)^p\,d\delta_{(x,y)}(u,v)=d(x,y)^p.
\end{align*}
Taking $p$th roots gives $W_p(\delta_x,\delta_y)=d(x,y)$.
Now let $\gamma:[0,1]\to X$ be a constant-speed geodesic from $x$ to $y$, and define $\mu_t=\delta_{\gamma(t)}$. Repeating the same Dirac-mass calculation at times $s,t\in[0,1]$ gives
\begin{align*}
W_p(\mu_s,\mu_t)^p=\int_{X\times X} d(u,v)^p\,d\delta_{(\gamma(s),\gamma(t))}(u,v)=d(\gamma(s),\gamma(t))^p.
\end{align*}
Since $\gamma$ is constant speed,
\begin{align*}
d(\gamma(s),\gamma(t))=|t-s|d(\gamma(0),\gamma(1))=|t-s|d(x,y).
\end{align*}
Therefore
\begin{align*}
W_p(\mu_s,\mu_t)=|t-s|d(x,y)=|t-s|W_p(\delta_x,\delta_y).
\end{align*}
So $t\mapsto\delta_{\gamma(t)}$ is a constant-speed Wasserstein geodesic, and the map $x\mapsto\delta_x$ preserves all distances from $X$ into $(\mathcal P_p(X),W_p)$.
[/example]
This example shows that Wasserstein geometry contains the original geometry of $X$ as the subspace of Dirac masses. For non-atomic measures, the same idea persists, but a transport plan may split the mass and therefore may encode many particle paths at once.
[remark: Role of Measurable Geodesic Selection]
When $X$ is Polish and geodesic, lifting an optimal endpoint plan to a dynamical plan uses a measurable choice of a geodesic between $x$ and $y$ for $\pi$-almost every pair $(x,y)$. The course treats this selection step as part of the standard structure theory of Polish geodesic spaces. The transport estimate above is the part of the proof that carries the metric content.
[/remark]
The next step is to specialize this abstract construction to Euclidean space. There the base geodesics are line segments, so the entire dynamical description collapses to an explicit interpolation formula. This is also where the geometric picture begins to connect with analysis: a moving probability distribution can be read as a mass density transported by a velocity field, the viewpoint used later for continuity equations and Wasserstein gradient flows.
## Displacement Interpolation in Euclidean Space
In $\mathbb R^n$, a particle moving from $x$ to $y$ along a shortest path sits at $(1-t)x+ty$ at time $t$. The problem is to translate this pointwise linear motion into a formula for probability measures and then verify that it gives a Wasserstein geodesic.
[definition: Displacement Interpolation]
Let $\mu_0,\mu_1\in\mathcal P_p(\mathbb R^n)$ and let $\pi\in\Pi(\mu_0,\mu_1)$. Let $\operatorname{pr}_1,\operatorname{pr}_2:\mathbb R^n\times\mathbb R^n\to\mathbb R^n$ be the coordinate projections, defined by $\operatorname{pr}_1(x,y)=x$ and $\operatorname{pr}_2(x,y)=y$. For $t\in[0,1]$, let $F_t:\mathbb R^n\times\mathbb R^n\to\mathbb R^n$ be
\begin{align*}
F_t(x,y)=(1-t)\operatorname{pr}_1(x,y)+t\operatorname{pr}_2(x,y)=(1-t)x+ty.
\end{align*}
The displacement interpolation induced by $\pi$ is the curve
\begin{align*}
\mu_t=(F_t)_\#\pi,
\qquad 0\le t\le 1.
\end{align*}
[/definition]
The word displacement is important: the interpolation moves mass through space rather than mixing the two endpoint measures linearly. However, an arbitrary coupling can make particles take unnecessary trips, so pushing it through the affine maps may describe motion without describing a shortest path. The key question is when this interpolated motion has exactly the Wasserstein length dictated by the endpoint distance.
[quotetheorem:7487]
[citeproof:7487]
The hypothesis that $\pi$ is optimal is essential: pushing a non-optimal coupling through the same affine maps still produces a curve of measures, but its length need not equal $W_p(\mu_0,\mu_1)$. A concrete failure is the crossed self-coupling of $\frac12\delta_0+\frac12\delta_1$ with itself: it sends half the mass from $0$ to $1$ and half from $1$ to $0$, so the interpolation moves even though the endpoint Wasserstein distance is zero. The Euclidean hypothesis is also doing real work, because the formula uses the fact that line segments are constant-speed geodesics for the Euclidean distance. The theorem does not assert that every Wasserstein geodesic arises from a unique plan, nor that the chosen optimal plan is unique. Its practical value is that once one optimal endpoint plan has been found, the whole displacement geodesic is obtained by pushing that single plan forward through the affine maps $F_t$.
[example: Non-Optimal Coupling Gives a Longer Motion]
Let $\mu_0=\mu_1=\frac12\delta_0+\frac12\delta_1$ on $\mathbb R$. The diagonal plan
\begin{align*}
\pi_{\mathrm{diag}}=\frac12\delta_{(0,0)}+\frac12\delta_{(1,1)}
\end{align*}
has both marginals equal to $\mu_0$, and its $p$-cost is
\begin{align*}
\int_{\mathbb R\times\mathbb R}|x-y|^p\,d\pi_{\mathrm{diag}}(x,y)=\frac12|0-0|^p+\frac12|1-1|^p=0.
\end{align*}
Since transport costs are nonnegative, this proves $W_p(\mu_0,\mu_1)^p=0$, hence $W_p(\mu_0,\mu_1)=0$.
Now consider the crossed coupling
\begin{align*}
\pi=\frac12\delta_{(0,1)}+\frac12\delta_{(1,0)}.
\end{align*}
Its first marginal is $\frac12\delta_0+\frac12\delta_1$, and its second marginal is $\frac12\delta_1+\frac12\delta_0$, so $\pi\in\Pi(\mu_0,\mu_1)$. However, its $p$-cost is
\begin{align*}
\int_{\mathbb R\times\mathbb R}|x-y|^p\,d\pi(x,y)=\frac12|0-1|^p+\frac12|1-0|^p=1,
\end{align*}
so it is not optimal because the diagonal plan has cost $0$.
The displacement map is $F_t(x,y)=(1-t)x+ty$. On the two atoms of $\pi$,
\begin{align*}
F_t(0,1)=t
\end{align*}
and
\begin{align*}
F_t(1,0)=1-t.
\end{align*}
Therefore
\begin{align*}
\mu_t=(F_t)_\#\pi=\frac12\delta_t+\frac12\delta_{1-t}.
\end{align*}
At $t=\frac12$, both atoms land at $\frac12$, so
\begin{align*}
\mu_{1/2}=\frac12\delta_{1/2}+\frac12\delta_{1/2}=\delta_{1/2}.
\end{align*}
A constant-speed geodesic from $\mu_0$ to itself would have distance $|t-s|W_p(\mu_0,\mu_0)=0$ between every pair of times, hence would be constant. The crossed interpolation is not constant, since $\mu_{1/2}=\delta_{1/2}\ne\frac12\delta_0+\frac12\delta_1$, so it cannot be a Wasserstein geodesic from $\mu_0$ to itself.
[/example]
The preceding example is the concrete failure hidden inside the optimality hypothesis of McCann's formula: even when the two endpoint measures are identical, a crossed coupling can manufacture artificial motion. The next example returns to an optimal deterministic coupling, where every particle is assigned the same displacement and the Wasserstein path becomes especially transparent.
[example: Translation of a Density]
Let $\mu_0\in\mathcal P_p(\mathbb R^n)$ have density $\rho_0\in L^1(\mathbb R^n)$, fix $a\in\mathbb R^n$, and set $\mu_1=(\tau_a)_\#\mu_0$ for $\tau_a(x)=x+a$. With $T(x)=x+a$, the map-induced displacement interpolation is
\begin{align*}
F_t(x)=(1-t)x+tT(x)=(1-t)x+t(x+a)=x+ta.
\end{align*}
Thus
\begin{align*}
\mu_t=(F_t)_\#\mu_0=(x\mapsto x+ta)_\#\mu_0.
\end{align*}
To identify the density of $\mu_t$, let $\varphi:\mathbb R^n\to\mathbb R$ be bounded and Borel. By the definition of pushforward measure,
\begin{align*}
\int_{\mathbb R^n}\varphi(z)\,d\mu_t(z)=\int_{\mathbb R^n}\varphi(F_t(x))\,d\mu_0(x).
\end{align*}
Since $d\mu_0(x)=\rho_0(x)\,dx$ and $F_t(x)=x+ta$,
\begin{align*}
\int_{\mathbb R^n}\varphi(F_t(x))\,d\mu_0(x)=\int_{\mathbb R^n}\varphi(x+ta)\rho_0(x)\,dx.
\end{align*}
Make the change of variables $z=x+ta$, so $x=z-ta$ and $dx=dz$. Then
\begin{align*}
\int_{\mathbb R^n}\varphi(x+ta)\rho_0(x)\,dx=\int_{\mathbb R^n}\varphi(z)\rho_0(z-ta)\,dz.
\end{align*}
Therefore $\mu_t$ has density
\begin{align*}
\rho_t(z)=\rho_0(z-ta).
\end{align*}
At time $t$, every point of the original density profile has been shifted by the same vector $ta$, so the distribution translates rigidly and its shape is unchanged.
[/example]
This translation example is the simplest case in which the Wasserstein path differs from convex mixing of measures. Convex mixing would produce two overlapping copies during intermediate times, while displacement interpolation moves the original copy continuously.
[example: Interpolating Gaussian Measures]
Let $\mu_0=\mathcal N(m_0,\Sigma_0)$ and $\mu_1=\mathcal N(m_1,\Sigma_1)$ on $\mathbb R^n$, with $\Sigma_0$ and $\Sigma_1$ positive definite. For quadratic cost, *Brenier's theorem for Gaussian measures* gives the optimal affine map
\begin{align*}
T(x)=m_1+A(x-m_0),
\end{align*}
where $A$ is positive definite and satisfies
\begin{align*}
A\Sigma_0A=\Sigma_1.
\end{align*}
The map-induced displacement interpolation is
\begin{align*}
F_t(x)=(1-t)x+tT(x).
\end{align*}
Substituting the formula for $T$ gives
\begin{align*}
F_t(x)=(1-t)x+t m_1+tA(x-m_0).
\end{align*}
Let
\begin{align*}
B_t=(1-t)I+tA
\end{align*}
and
\begin{align*}
m_t=(1-t)m_0+tm_1.
\end{align*}
Then
\begin{align*}
m_t+B_t(x-m_0)=(1-t)m_0+tm_1+((1-t)I+tA)(x-m_0).
\end{align*}
Expanding the right-hand side gives
\begin{align*}
m_t+B_t(x-m_0)=(1-t)x+tm_1+tA(x-m_0)=F_t(x).
\end{align*}
Thus, if $X\sim\mathcal N(m_0,\Sigma_0)$, then
\begin{align*}
F_t(X)=m_t+B_t(X-m_0).
\end{align*}
By the affine transformation rule for Gaussian random vectors, $F_t(X)$ is Gaussian. Its mean is
\begin{align*}
\mathbb E[F_t(X)]=m_t+B_t(\mathbb E[X]-m_0)=m_t,
\end{align*}
and its covariance is
\begin{align*}
\operatorname{Cov}(F_t(X))=\mathbb E[B_t(X-m_0)(X-m_0)^\top B_t^\top]=B_t\Sigma_0B_t^\top.
\end{align*}
Since $A$ is symmetric positive definite, $B_t$ is symmetric, so $B_t^\top=B_t$. Therefore
\begin{align*}
\Sigma_t=((1-t)I+tA)\Sigma_0((1-t)I+tA).
\end{align*}
Hence
\begin{align*}
\mu_t=(F_t)_\#\mu_0=\mathcal N(m_t,\Sigma_t).
\end{align*}
At $t=0$ this gives $\Sigma_t=\Sigma_0$, and at $t=1$ it gives $\Sigma_t=A\Sigma_0A=\Sigma_1$. Thus Wasserstein interpolation preserves Gaussianity, while the covariance follows the transport matrix $A$ rather than the arithmetic average $(1-t)\Sigma_0+t\Sigma_1$.
[/example]
The Gaussian computation foreshadows a major theme: the Wasserstein space behaves like a nonlinear Riemannian space over the base domain. Means move linearly, but covariance matrices follow the geometry dictated by optimal maps.
## Quadratic Cost and Geodesics Induced by Brenier Maps
For $W_2$ on Euclidean space, the preceding formula becomes sharper when the optimal plan comes from a map. The problem is to understand when the geodesic is no longer plan-valued data but is determined by a single deterministic motion of particles.
[quotetheorem:7488]
[citeproof:7488]
The assumption that the optimal plan is induced by a map rules out splitting at the starting point: each initial position $x$ has a single assigned destination $T(x)$. The quadratic cost matters because Brenier's theorem supplies such maps under absolute continuity of the source, and the map has the special form $T=\nabla\varphi$ for a convex potential. For other costs, the relevant optimality condition may not force a convex-gradient structure, and optimal couplings can fail to be represented by a single deterministic map without additional twist or regularity hypotheses. Singular sources create a separate obstruction: if $\mu_0=\delta_0$ and $\mu_1=\frac12\delta_{-1}+\frac12\delta_1$ on $\mathbb R$, no map from the one-point source mass can push $\mu_0$ to $\mu_1$, although a transport plan exists. The theorem does not claim that every pair of measures has a map-induced geodesic, only that when the endpoint optimum is deterministic the displacement interpolation is deterministic as well. This is the formula most often used in applications, where one studies each particle moving on the straight line from $x$ to $T(x)$.
[remark: Brenier Maps]
When $\mu_0$ is absolutely continuous with respect to $\mathcal L^n$, Brenier's theorem gives a unique optimal map $T:\mathbb R^n\to\mathbb R^n$ of the form $T=\nabla\varphi$ for a convex function $\varphi:\mathbb R^n\to\mathbb R$, for the quadratic cost. In that common case the preceding theorem applies and the geodesic is determined by the convex potential. If the source has atoms or lies on a lower-dimensional set, optimal plans may fail to be induced by a map.
[/remark]
The role of uniqueness is not cosmetic. If there are several optimal endpoint pairings, the same two endpoint measures can be connected by different straight-line motions, because different particles may be matched to different destinations. When optimality fixes the endpoint plan, that ambiguity disappears and the displacement curve has no remaining choice of pairing.
[quotetheorem:7489]
[citeproof:7489]
The uniqueness hypothesis is needed at the level of endpoint plans. For instance, on a square in $\mathbb R^2$ with
\begin{align*}
\mu_0=\frac12\delta_{(-1,0)}+\frac12\delta_{(1,0)},
\qquad
\mu_1=\frac12\delta_{(0,-1)}+\frac12\delta_{(0,1)},
\end{align*}
both matchings have the same quadratic cost. One optimal plan moves $(-1,0)$ to $(0,-1)$ and $(1,0)$ to $(0,1)$; the other crosses these assignments. At time $t=1/2$, their displacement interpolations put mass at different midpoint pairs, so distinct optimal plans produce different intermediate measures. The conclusion is also deliberately restricted to geodesics built by straight-line displacement of optimal pairings; it does not by itself exclude more exotic metric-geodesic phenomena in spaces with branching or weak regularity. In the map-induced case the formula gives the strongest form of determinism, because the initial label $x$ determines the whole path. Later nonbranching assumptions strengthen this kind of uniqueness by preventing two particle trajectories from coinciding for a positive time and then separating.
[example: Splitting from an Atom]
Let $\mu_0=\delta_0$ on $\mathbb R$ and let $\mu_1=\frac12\delta_{-1}+\frac12\delta_1$. If $\pi\in\Pi(\mu_0,\mu_1)$, then its first marginal is concentrated at $0$, so
\begin{align*}
\pi(\{0\}\times\mathbb R)=\mu_0(\{0\})=1.
\end{align*}
Its second marginal is $\mu_1$, hence
\begin{align*}
\pi(\mathbb R\times\{-1\})=\frac12
\end{align*}
and
\begin{align*}
\pi(\mathbb R\times\{1\})=\frac12.
\end{align*}
Intersecting these sets with $\{0\}\times\mathbb R$ gives
\begin{align*}
\pi(\{(0,-1)\})=\frac12
\end{align*}
and
\begin{align*}
\pi(\{(0,1)\})=\frac12.
\end{align*}
Thus the endpoint plan is uniquely determined:
\begin{align*}
\pi=\frac12\delta_{(0,-1)}+\frac12\delta_{(0,1)}.
\end{align*}
This plan is not induced by a transport map from $\mu_0$. Indeed, if $\pi=(\operatorname{id},T)_\#\delta_0$, then all mass is sent to the single point $(0,T(0))$, so $\pi=\delta_{(0,T(0))}$. Its second marginal would be $\delta_{T(0)}$, which cannot equal $\frac12\delta_{-1}+\frac12\delta_1$ because the latter assigns mass $\frac12$ to each of two distinct points.
For the displacement interpolation, use
\begin{align*}
F_t(x,y)=(1-t)x+ty.
\end{align*}
On the two atoms of $\pi$,
\begin{align*}
F_t(0,-1)=-t
\end{align*}
and
\begin{align*}
F_t(0,1)=t.
\end{align*}
Therefore
\begin{align*}
\mu_t=(F_t)_\#\pi=\frac12\delta_{-t}+\frac12\delta_t.
\end{align*}
The mass has split at the level of the transport plan: one half travels from $0$ to $-1$, and the other half travels from $0$ to $1$. This is splitting of a single source atom, not nonuniqueness of the endpoint plan and not branching of Euclidean geodesics.
[/example]
The example separates splitting from nonuniqueness. A plan can split mass even when the base space has unique geodesics between points; genuine base-space branching is a different issue, concerning the behaviour of geodesic paths after they have overlapped.
[example: Branching in a Simple Metric Tree]
Consider a metric tree shaped like a Y. Let $o$ be the root, let $q$ be the branching point, and let $a,b$ be two leaves with
\begin{align*}
d(o,a)=d(o,b)=L
\end{align*}
and
\begin{align*}
d(o,q)=r
\end{align*}
where $0<r<L$. In a metric tree there is a unique embedded arc between any two points, so the geodesic from $o$ to $a$ and the geodesic from $o$ to $b$ are individually unique.
Let $\gamma_a$ and $\gamma_b$ be the constant-speed geodesics from $o$ to $a$ and from $o$ to $b$. Since both paths travel along the same trunk from $o$ to $q$, the time at which they reach $q$ is determined by
\begin{align*}
d(o,\gamma_a(t))=tL
\end{align*}
and
\begin{align*}
d(o,q)=r.
\end{align*}
Thus $\gamma_a(t)=\gamma_b(t)$ for $0\le t\le r/L$, because both points are the unique point on the trunk at distance $tL$ from $o$. For $t>r/L$, the point $\gamma_a(t)$ lies on the arm from $q$ to $a$, while $\gamma_b(t)$ lies on the arm from $q$ to $b$, so $\gamma_a(t)\ne\gamma_b(t)$.
Now take
\begin{align*}
\mu_0=\delta_o
\end{align*}
and
\begin{align*}
\mu_1=\frac12\delta_a+\frac12\delta_b.
\end{align*}
The endpoint transport plan is forced to be
\begin{align*}
\pi=\frac12\delta_{(o,a)}+\frac12\delta_{(o,b)},
\end{align*}
because the first marginal is concentrated at $o$ and the second marginal assigns mass $\frac12$ to each of $a$ and $b$. The corresponding dynamical plan is
\begin{align*}
\Pi=\frac12\delta_{\gamma_a}+\frac12\delta_{\gamma_b}.
\end{align*}
Its time-$t$ marginal is
\begin{align*}
\mu_t=(e_t)_\#\Pi=\frac12\delta_{\gamma_a(t)}+\frac12\delta_{\gamma_b(t)}.
\end{align*}
For $0\le t\le r/L$, this becomes
\begin{align*}
\mu_t=\frac12\delta_{\gamma_a(t)}+\frac12\delta_{\gamma_a(t)}=\delta_{\gamma_a(t)}.
\end{align*}
For $t>r/L$, the two atoms are distinct, so
\begin{align*}
\mu_t=\frac12\delta_{\gamma_a(t)}+\frac12\delta_{\gamma_b(t)}
\end{align*}
is split between the two arms. Thus the branching is not caused by two different geodesics between the same pair of points; it is caused by two uniquely determined particle paths sharing an initial segment and then separating.
[/example]
Nonbranching assumptions in later theory are designed to rule out this kind of ambiguity. They allow information about intermediate-time coincidence of particles to propagate backward and forward along the whole geodesic.
## What the Geometry Encodes
The constructions in this chapter show that $W_p$ is not just a numerical distance between probability measures. It carries the geodesic structure of the base space and adds a layer of mass allocation through optimal plans.
[explanation: Eulerian and Lagrangian Views]
The plan-based formula is Lagrangian: it labels particles by their initial and final positions, then follows each particle along a shortest base-space path. The curve $t\mapsto\mu_t$ is Eulerian: it records only the distribution of particles at time $t$, after labels have been forgotten.
This distinction explains why different plans may lead to different curves with the same endpoints. It also explains why map-induced geodesics are easier to analyze: the label $x$ at time $0$ determines the entire trajectory $t\mapsto(1-t)x+tT(x)$.
[/explanation]
This two-viewpoint language is the bridge to the analytic uses of Wasserstein geometry. The final point of the chapter is that later convexity and evolution statements will be measured along displacement paths, not along ordinary affine combinations of measures.
[remark: Link to Later Analysis]
Displacement interpolation is the foundation for displacement convexity, gradient flows in Wasserstein space, and synthetic curvature bounds. In those topics, convexity is tested not along linear mixtures of measures but along the geodesics constructed here. Analytically, a map-induced geodesic also gives a weak solution of a continuity equation $\partial_t\mu_t+\nabla\cdot(v_t\mu_t)=0$, so the metric picture is tied to the PDE language of mass conservation. The quadratic-cost formula $\mu_t=((1-t)\operatorname{id}+tT)_\#\mu_0$ is therefore the basic path along which entropy and energy functionals are studied.
[/remark]
The metric and geometric picture is now complete enough to ask how stable it is under approximation and limiting procedures. The final chapter shifts from exact formulas to convergence, showing which parts of the theory persist when measures, costs, or transport plans vary.
# 10. Stability, Approximation, and Structural Limits
This closing chapter asks how much of the foundational theory survives limiting procedures. Chapters 2 through 9 established existence, duality, Wasserstein distances, geodesics, and the main geometric structure of optimal plans. Here the emphasis shifts from exact continuous problems to approximation: if the measures, costs, or finite discretisations vary, what happens to the optimal values and the optimal plans? The answer is strong enough to justify computational and statistical approximations, but it also marks the boundary of the first course: regularity, curvature, and most applications require additional hypotheses not supplied by the foundational framework alone.
## Stability Under Weak Convergence
The first question is whether optimal transport is stable when the input measures are only known as limits. This matters because weak convergence is the natural topology for probability measures on Polish spaces, while optimal transport also remembers the cost and may require moment control. The guiding principle is that lower semicontinuity gives compactness of minimising sequences, and compactness lets optimality pass to the limit.
[definition: Narrow Convergence Of Probability Measures]
Let $X$ be a Polish space. A sequence $(\mu_k)_{k \ge 1}$ in $\mathcal P(X)$ converges narrowly to $\mu \in \mathcal P(X)$ if
\begin{align*}
\int_X \varphi\,d\mu_k \to \int_X \varphi\,d\mu
\end{align*}
for every bounded continuous function $\varphi: X \to \mathbb R$.
[/definition]
Narrow convergence is the weak convergence of probability measures used throughout the compactness part of Kantorovich theory. It controls bounded continuous observables, but it does not by itself control large values of an unbounded cost. The next definition isolates the extra hypothesis needed for costs such as $|x-y|^p$.
[definition: Uniform Moment Bound]
Let $(X,d)$ be a metric space, let $p \ge 1$, and fix $x_0 \in X$. A sequence $(\mu_k)_{k \ge 1}$ in $\mathcal P(X)$ has a uniform $p$-moment bound if
\begin{align*}
\sup_{k \ge 1} \int_X d(x,x_0)^p\,d\mu_k(x) < \infty.
\end{align*}
[/definition]
The choice of $x_0$ does not change the property when $p \ge 1$, because the triangle inequality compares $d(x,x_1)^p$ and $d(x,x_0)^p$ up to constants depending on $x_0,x_1,p$. Moment bounds are the price paid for working with Wasserstein costs rather than only bounded costs. A basic failure mode is the sequence $\mu_k=\delta_k$ on $\mathbb R$: against bounded continuous test functions it has no tight subsequential probability limit, and even when a sequence converges narrowly, small amounts of mass moving far away can destroy convergence of unbounded transport costs. Thus weak convergence and moment control play different roles: weak convergence locates the mass, while the moment condition prevents expensive tails from being invisible to bounded tests.
The basic analytic input is lower semicontinuity of the integral cost under weak convergence of plans. This is the same mechanism used in the existence proof for Kantorovich minimisers, now applied to a varying sequence of admissible plans.
[quotetheorem:7490]
[citeproof:7490]
This theorem prevents mass from lowering the transport cost by escaping through a limiting process. The lower semicontinuity hypothesis is essential: if the cost has a downward jump at a limiting point, a sequence of plans can converge to a plan whose limiting cost is larger than the liminf inequality allows, so optimality need not pass to the limit. The bounded-below assumption rules out a different pathology, namely sending small amounts of mass into regions where the cost tends to $-\infty$ and thereby destroying compactness of minimising values. What the theorem gives is only a one-sided inequality; an upper bound requires separate competitors or recovery sequences, which is why the next result adds marginal convergence and an explicit approximation hypothesis.
[quotetheorem:7491]
[citeproof:7491]
The recovery sequence hypothesis is a compact way to state the upper-bound half of stability. The lower semicontinuity and tightness assumptions do different work: lower semicontinuity prevents cost from dropping at the limit, while tightness prevents optimal plans from losing mass at infinity before a subsequential limit is even available. Without a recovery sequence, the theorem may identify a lower bound for every subsequential limit but still fail to prove that the limiting value is no larger than the approximating values. The statement also does not assert uniqueness or full-sequence convergence: if the limiting problem has several optimal plans, different subsequences of $(\pi_k)$ may converge to different optimal limits. For bounded continuous costs the recovery condition is often obtained by approximating a fixed competitor through its marginals; for unbounded Wasserstein costs it is supplied by moment convergence and truncation.
[example: Weak Limit Of Gaussian Marginals]
Let $X=Y=\mathbb R^d$ and $c(x,y)=|x-y|^2$. Suppose $\mu_k=\mathcal N(0,A_k)$ and $\nu_k=\mathcal N(0,B_k)$, where $A_k \to A$ and $B_k \to B$ as positive semidefinite covariance matrices. For each $t \in \mathbb R^d$, the characteristic function of $\mu_k$ is
\begin{align*}
\widehat \mu_k(t)=\exp\left(-\frac12 t^\top A_k t\right).
\end{align*}
Since $t^\top A_k t \to t^\top A t$, we get
\begin{align*}
\widehat \mu_k(t) \to \exp\left(-\frac12 t^\top A t\right)=\widehat\mu(t),
\end{align*}
so $\mu_k \to \mu=\mathcal N(0,A)$ narrowly by the *Lévy [continuity theorem](/theorems/1145)*. The same argument gives $\nu_k \to \nu=\mathcal N(0,B)$ narrowly.
If $X_k \sim \mathcal N(0,A_k)$, then
\begin{align*}
\int_{\mathbb R^d}|x|^2\,d\mu_k(x)=\mathbb E|X_k|^2=\operatorname{tr}(A_k).
\end{align*}
Because trace is linear in the matrix entries and $A_k \to A$, we have
\begin{align*}
\operatorname{tr}(A_k)\to \operatorname{tr}(A)=\int_{\mathbb R^d}|x|^2\,d\mu(x).
\end{align*}
The same computation gives second-moment convergence for $\nu_k$. Hence, by *Stability In Wasserstein Distance Under Moment Bounds* with $p=2$,
\begin{align*}
W_2(\mu_k,\nu_k)^2 \to W_2(\mu,\nu)^2.
\end{align*}
In the commuting covariance case, choose an orthogonal matrix $Q$ and numbers $a_i,b_i \ge 0$ such that
\begin{align*}
A=Q\operatorname{diag}(a_1,\dots,a_d)Q^\top,\qquad B=Q\operatorname{diag}(b_1,\dots,b_d)Q^\top.
\end{align*}
Then
\begin{align*}
A^{1/2}BA^{1/2}=Q\operatorname{diag}(a_1b_1,\dots,a_db_d)Q^\top.
\end{align*}
Taking the positive semidefinite square root gives
\begin{align*}
\left(A^{1/2}BA^{1/2}\right)^{1/2}=Q\operatorname{diag}(\sqrt{a_1b_1},\dots,\sqrt{a_db_d})Q^\top.
\end{align*}
Therefore the Gaussian [quadratic formula](/theorems/1301) gives
\begin{align*}
W_2(\mu,\nu)^2=\operatorname{tr}(A)+\operatorname{tr}(B)-2\operatorname{tr}\left(\left(A^{1/2}BA^{1/2}\right)^{1/2}\right).
\end{align*}
Substituting the diagonal traces yields
\begin{align*}
W_2(\mu,\nu)^2=\sum_{i=1}^d a_i+\sum_{i=1}^d b_i-2\sum_{i=1}^d \sqrt{a_ib_i}.
\end{align*}
Equivalently,
\begin{align*}
W_2(\mu,\nu)^2=\sum_{i=1}^d(\sqrt{a_i}-\sqrt{b_i})^2.
\end{align*}
Thus the stability statement recovers the same limiting value as the closed-form Gaussian computation in the simultaneously diagonalizable case.
[/example]
This example is deliberately regular: the measures have all moments and no mass escapes to infinity. For approximation arguments, regularity of a single closed-form family is not enough; one needs to know that replacing measures by nearby discrete or smoothed measures does not change the limiting optimal cost. The possible failure is again tail loss, so the convergence must be in the Wasserstein topology rather than merely narrow convergence.
[quotetheorem:7492]
[citeproof:7492]
This theorem is often the most convenient stability statement in computations: once the approximating measures converge in $W_p$, all pairwise Wasserstein distances against another convergent sequence behave continuously. Moment convergence is not a technical decoration. For example, on $\mathbb R$ the measures
\begin{align*}
\mu_k = \left(1-\frac{1}{k}\right)\delta_0 + \frac{1}{k}\delta_k
\end{align*}
converge narrowly to $\delta_0$, but their first moments do not converge to the first moment of $\delta_0$; the small escaping mass is invisible to bounded continuous tests and visible to the cost $|x-y|$. The theorem concerns convergence of distances and optimal values, not stability of Monge maps or uniqueness of optimal plans, so later map-level statements still require separate structural assumptions.
## Empirical Approximation And Finite Transport Problems
The second question is how an infinite-dimensional transport problem becomes a finite linear programme without changing its limiting answer. The practical idea is to replace a measure by finitely many weighted samples and solve a matrix-valued coupling problem. The mathematical issue is to verify that these finite problems approximate the original one in the topology relevant to the cost.
[definition: Empirical Measure]
Let $X$ be a measurable space and let $x_1,\dots,x_N \in X$. The empirical measure associated with these points is
\begin{align*}
\mu_N = \frac{1}{N}\sum_{i=1}^N \delta_{x_i}.
\end{align*}
More generally, for weights $a_i \ge 0$ with $\sum_{i=1}^N a_i=1$, the weighted empirical measure is
\begin{align*}
\mu_N = \sum_{i=1}^N a_i\delta_{x_i}.
\end{align*}
[/definition]
Empirical measures convert integration against $\mu$ into finite sums over samples. When both marginals are empirical, Kantorovich transport is a finite-dimensional linear optimisation problem, so the abstract coupling becomes a matrix with prescribed row and column sums.
[definition: Finite Kantorovich Problem]
Let $X$ and $Y$ be sets, let $c:X \times Y \to (-\infty,\infty]$ be a cost function, and let
\begin{align*}
\mu = \sum_{i=1}^m a_i\delta_{x_i},
\qquad
\nu = \sum_{j=1}^n b_j\delta_{y_j},
\end{align*}
where $x_i \in X$, $y_j \in Y$, $a_i,b_j \ge 0$, and $\sum_i a_i=\sum_j b_j=1$. Set $c_{ij}=c(x_i,y_j)$. The finite Kantorovich problem is
\begin{align*}
\min_{(\pi_{ij})}
\sum_{i=1}^m\sum_{j=1}^n c_{ij}\pi_{ij}
\end{align*}
subject to
\begin{align*}
\pi_{ij} \ge 0,
\qquad
\sum_{j=1}^n \pi_{ij}=a_i,
\qquad
\sum_{i=1}^m \pi_{ij}=b_j.
\end{align*}
[/definition]
The constraints say that each row sends out the mass at $x_i$ and each column receives the mass required at $y_j$. The remaining problem is not how to solve a single finite programme, but why a sequence of such programmes recovers the continuous value and the continuous optimal plans.
[quotetheorem:7493]
[citeproof:7493]
The theorem explains why empirical optimal transport is not merely a numerical trick. Under the right moment control, the finite linear programmes converge to the intended continuous problem. Without moment control, a grid or sample approximation may converge narrowly while still placing a small amount of mass very far away, which can change $W_p^p$ by a non-vanishing amount. Without convergence of the plan costs to the limiting optimal value, lower semicontinuity gives only that a subsequential limit is no more expensive than the liminf of the discrete costs; it does not identify the limit as optimal. This is the point at which numerical linear programming practice connects back to the compactness theory: the finite matrix problem is reliable only when the discretisation is compatible with the topology of the cost.
[example: Empirical Approximation Of A Smooth Density]
Let $\Omega \subset \mathbb R^d$ be bounded, and let $\mu=f\mathcal L^d$ and $\nu=g\mathcal L^d$ be probability measures with smooth positive densities on $\Omega$. For each $N$, partition $\Omega$ into grid cells $(C_{N,i})_i$ and $(D_{N,j})_j$ with mesh size tending to $0$, choose sample points $x_{N,i}\in C_{N,i}$ and $y_{N,j}\in D_{N,j}$, and set
\begin{align*}
a_{N,i}=\int_{C_{N,i}} f(x)\,dx,\qquad b_{N,j}=\int_{D_{N,j}} g(y)\,dy.
\end{align*}
Since $f,g\ge 0$ and $\mu,\nu$ are probability measures,
\begin{align*}
\sum_i a_{N,i}=\int_\Omega f(x)\,dx=1,\qquad \sum_j b_{N,j}=\int_\Omega g(y)\,dy=1.
\end{align*}
Thus
\begin{align*}
\mu_N=\sum_i a_{N,i}\delta_{x_{N,i}},\qquad \nu_N=\sum_j b_{N,j}\delta_{y_{N,j}}
\end{align*}
are probability measures.
We show first that $\mu_N\to\mu$ narrowly. Let $\varphi:\mathbb R^d\to\mathbb R$ be bounded and continuous, and define $q_N(x)=x_{N,i}$ for $x\in C_{N,i}$. Then
\begin{align*}
\int_{\mathbb R^d}\varphi\,d\mu_N=\sum_i a_{N,i}\varphi(x_{N,i}).
\end{align*}
Substituting the definition of $a_{N,i}$ gives
\begin{align*}
\sum_i a_{N,i}\varphi(x_{N,i})=\sum_i\int_{C_{N,i}}\varphi(x_{N,i})f(x)\,dx.
\end{align*}
Since $q_N(x)=x_{N,i}$ on $C_{N,i}$, this is
\begin{align*}
\sum_i\int_{C_{N,i}}\varphi(x_{N,i})f(x)\,dx=\int_\Omega \varphi(q_N(x))f(x)\,dx.
\end{align*}
The mesh size tends to $0$, so $q_N(x)\to x$ for all non-boundary points of the partition, and the partition boundaries have Lebesgue measure $0$. Hence $\varphi(q_N(x))\to\varphi(x)$ for $\mathcal L^d$-almost every $x$. Also
\begin{align*}
|\varphi(q_N(x))f(x)|\le \|\varphi\|_\infty f(x),
\end{align*}
and $\int_\Omega f(x)\,dx=1$, so dominated convergence gives
\begin{align*}
\int_{\mathbb R^d}\varphi\,d\mu_N\to \int_\Omega \varphi(x)f(x)\,dx=\int_{\mathbb R^d}\varphi\,d\mu.
\end{align*}
This is exactly narrow convergence of $\mu_N$ to $\mu$. The same argument with $g$, $D_{N,j}$, and $y_{N,j}$ gives $\nu_N\to\nu$ narrowly.
Now fix $x_0\in\mathbb R^d$ and $p\ge 1$. For the same map $q_N$,
\begin{align*}
\int_{\mathbb R^d}|x-x_0|^p\,d\mu_N(x)=\sum_i a_{N,i}|x_{N,i}-x_0|^p.
\end{align*}
Substituting $a_{N,i}$ again gives
\begin{align*}
\sum_i a_{N,i}|x_{N,i}-x_0|^p=\int_\Omega |q_N(x)-x_0|^p f(x)\,dx.
\end{align*}
Because $\Omega$ is bounded, there is $R<\infty$ such that $|z-x_0|\le R$ for all grid points and all $z\in\Omega$. Therefore
\begin{align*}
|q_N(x)-x_0|^p f(x)\le R^p f(x),
\end{align*}
and $R^p f$ is integrable. Since $q_N(x)\to x$ almost everywhere, dominated convergence gives
\begin{align*}
\int_{\mathbb R^d}|x-x_0|^p\,d\mu_N(x)\to \int_\Omega |x-x_0|^p f(x)\,dx=\int_{\mathbb R^d}|x-x_0|^p\,d\mu(x).
\end{align*}
The identical calculation for $\nu_N$ gives convergence of the $p$-moments of $\nu_N$ to those of $\nu$.
By *Convergence Of Discrete Approximations Under Moment Bounds*, the narrow convergence and moment convergence imply
\begin{align*}
W_p(\mu_N,\nu_N)\to W_p(\mu,\nu).
\end{align*}
Since the map $r\mapsto r^p$ is continuous on $[0,\infty)$, this yields
\begin{align*}
W_p(\mu_N,\nu_N)^p\to W_p(\mu,\nu)^p.
\end{align*}
For the cost $c(x,y)=|x-y|^p$, the finite Kantorovich value between $\mu_N$ and $\nu_N$ is exactly $W_p(\mu_N,\nu_N)^p$, so the finite transport costs converge to the continuous transport cost. If $\pi_N$ is any sequence of optimal discrete plans and a subsequence $\pi_{N_k}$ converges narrowly, the same theorem identifies its limit as an optimal plan between $\mu$ and $\nu$. Thus the grid problems recover both the limiting value and, along convergent subsequences, limiting optimal plans.
[/example]
Finite approximations also reveal a distinction between transport plans and transport maps. Plans are compact under weak convergence, while maps may fail to have a meaningful limit when the source measure develops atoms or when mass splitting becomes necessary.
[example: Instability Of Monge Maps When Absolute Continuity Is Lost]
Let $\mu_\varepsilon$ be the uniform probability measure on $[-\varepsilon,\varepsilon]\subset\mathbb R$, and let
\begin{align*}
\nu=\frac12\delta_{-1}+\frac12\delta_1.
\end{align*}
Define $T_\varepsilon(x)=-1$ for $-\varepsilon\le x<0$ and $T_\varepsilon(x)=1$ for $0\le x\le \varepsilon$. The point $0$ has $\mu_\varepsilon$-mass $0$, so its assignment does not affect the pushforward. Since $\mu_\varepsilon$ has density $1/(2\varepsilon)$ on $[-\varepsilon,\varepsilon]$,
\begin{align*}
\mu_\varepsilon([-\varepsilon,0))=\frac{1}{2\varepsilon}\int_{-\varepsilon}^0 dx=\frac{1}{2\varepsilon}\varepsilon=\frac12.
\end{align*}
Similarly,
\begin{align*}
\mu_\varepsilon([0,\varepsilon])=\frac{1}{2\varepsilon}\int_0^\varepsilon dx=\frac{1}{2\varepsilon}\varepsilon=\frac12.
\end{align*}
Thus $T_\varepsilon$ sends mass $1/2$ to $-1$ and mass $1/2$ to $1$, so
\begin{align*}
(T_\varepsilon)_\#\mu_\varepsilon=\frac12\delta_{-1}+\frac12\delta_1=\nu.
\end{align*}
In one dimension with quadratic cost, the monotone transport map is optimal by the monotone rearrangement principle, and $T_\varepsilon$ is monotone nondecreasing.
Now let $\varphi:\mathbb R\to\mathbb R$ be bounded and continuous. Then
\begin{align*}
\int_{\mathbb R}\varphi\,d\mu_\varepsilon=\frac{1}{2\varepsilon}\int_{-\varepsilon}^{\varepsilon}\varphi(x)\,dx.
\end{align*}
Subtracting $\varphi(0)$ gives
\begin{align*}
\int_{\mathbb R}\varphi\,d\mu_\varepsilon-\varphi(0)=\frac{1}{2\varepsilon}\int_{-\varepsilon}^{\varepsilon}(\varphi(x)-\varphi(0))\,dx.
\end{align*}
Hence
\begin{align*}
\left|\int_{\mathbb R}\varphi\,d\mu_\varepsilon-\varphi(0)\right|\le \frac{1}{2\varepsilon}\int_{-\varepsilon}^{\varepsilon}|\varphi(x)-\varphi(0)|\,dx.
\end{align*}
Since $|\varphi(x)-\varphi(0)|\le \sup_{|z|\le\varepsilon}|\varphi(z)-\varphi(0)|$ for $x\in[-\varepsilon,\varepsilon]$, we get
\begin{align*}
\left|\int_{\mathbb R}\varphi\,d\mu_\varepsilon-\varphi(0)\right|\le \sup_{|z|\le\varepsilon}|\varphi(z)-\varphi(0)|.
\end{align*}
The right side tends to $0$ by continuity of $\varphi$ at $0$, so $\mu_\varepsilon$ converges narrowly to $\delta_0$.
The optimal plan induced by $T_\varepsilon$ is
\begin{align*}
\pi_\varepsilon=(\operatorname{id},T_\varepsilon)_\#\mu_\varepsilon.
\end{align*}
For every bounded continuous $\psi:\mathbb R^2\to\mathbb R$,
\begin{align*}
\int_{\mathbb R^2}\psi\,d\pi_\varepsilon=\frac{1}{2\varepsilon}\int_{-\varepsilon}^0\psi(x,-1)\,dx+\frac{1}{2\varepsilon}\int_0^\varepsilon\psi(x,1)\,dx.
\end{align*}
For the first term,
\begin{align*}
\left|\frac{1}{2\varepsilon}\int_{-\varepsilon}^0\psi(x,-1)\,dx-\frac12\psi(0,-1)\right|\le \frac12\sup_{-\varepsilon\le x\le 0}|\psi(x,-1)-\psi(0,-1)|.
\end{align*}
The right side tends to $0$ by continuity of $\psi$ at $(0,-1)$. Likewise,
\begin{align*}
\left|\frac{1}{2\varepsilon}\int_0^\varepsilon\psi(x,1)\,dx-\frac12\psi(0,1)\right|\le \frac12\sup_{0\le x\le \varepsilon}|\psi(x,1)-\psi(0,1)|.
\end{align*}
This tends to $0$ by continuity of $\psi$ at $(0,1)$. Therefore
\begin{align*}
\int_{\mathbb R^2}\psi\,d\pi_\varepsilon\to \frac12\psi(0,-1)+\frac12\psi(0,1).
\end{align*}
Equivalently,
\begin{align*}
\pi_\varepsilon\to \frac12\delta_{(0,-1)}+\frac12\delta_{(0,1)}
\end{align*}
narrowly.
There is no transport map from $\delta_0$ to $\nu$. Indeed, if $S:\mathbb R\to\mathbb R$ is measurable, then for every Borel set $E\subset\mathbb R$,
\begin{align*}
S_\#\delta_0(E)=\delta_0(S^{-1}(E)).
\end{align*}
This equals $1$ when $S(0)\in E$ and equals $0$ when $S(0)\notin E$, so
\begin{align*}
S_\#\delta_0=\delta_{S(0)}.
\end{align*}
A single Dirac mass cannot equal $\nu$, because $\nu(\{-1\})=1/2$ and $\nu(\{1\})=1/2$ at two distinct points. Thus the Kantorovich optimal plans have a stable limit, but the limiting transport cannot be represented by a Monge map.
[/example]
The example is the basic warning behind many later regularity results. Absolute continuity of the source is not a cosmetic assumption; it can be the difference between a genuine map and unavoidable splitting.
## What The Foundational Theory Does Not Decide
The final question is what remains unresolved after existence, duality, stability, and approximation have been established. The foundational theory gives optimal plans and, in favourable cases, optimal maps. It does not by itself decide whether those maps are continuous, differentiable, unique beyond known structural hypotheses, or compatible with curvature and PDE estimates.
[remark: Regularity Requires More Than Existence]
For quadratic cost on $\mathbb R^d$, Brenier's theorem gives an optimal map $T=\nabla \varphi$ under absolute continuity of the source. The theorem alone does not imply that $T$ is continuous or smooth. Regularity theory studies additional assumptions on the domains and densities under which the associated Monge-Ampere equation has classical or partial regularity properties.
[/remark]
This separates the convex-analytic existence theorem from the PDE problem hidden behind it. When $T=\nabla \varphi$ transports $f\mathcal L^d$ to $g\mathcal L^d$, a formal change of variables leads to a determinant equation.
[example: Formal Monge-Ampere Equation For A Brenier Map]
Let $\Omega,\Lambda \subset \mathbb R^d$ be bounded domains, and let $f,g$ be positive smooth probability densities on them. Suppose that $T=\nabla\varphi:\Omega\to\Lambda$ is a smooth Brenier map transporting $f\mathcal L^d$ to $g\mathcal L^d$, and suppose also that the change-of-variables formula applies to $T$ with Jacobian determinant $\det DT(x)>0$.
For every bounded measurable $\eta:\Lambda\to\mathbb R$, the pushforward condition $T_\#(f\mathcal L^d)=g\mathcal L^d$ means
\begin{align*}
\int_\Lambda \eta(y)g(y)\,dy=\int_\Omega \eta(T(x))f(x)\,dx.
\end{align*}
Applying the change-of-variables formula to $y=T(x)$ rewrites the left side as
\begin{align*}
\int_\Lambda \eta(y)g(y)\,dy=\int_\Omega \eta(T(x))g(T(x))\det DT(x)\,dx.
\end{align*}
Comparing the two identities gives
\begin{align*}
\int_\Omega \eta(T(x))f(x)\,dx=\int_\Omega \eta(T(x))g(T(x))\det DT(x)\,dx.
\end{align*}
Since this holds for all bounded measurable test functions $\eta$ on $\Lambda$ and $T$ is onto up to the transported mass, the densities agree pointwise under the smooth hypotheses:
\begin{align*}
f(x)=g(T(x))\det DT(x).
\end{align*}
Because $T=\nabla\varphi$, its derivative matrix is the Hessian:
\begin{align*}
DT(x)=D(\nabla\varphi)(x)=D^2\varphi(x).
\end{align*}
Substituting this into the density identity yields
\begin{align*}
f(x)=g(\nabla\varphi(x))\det D^2\varphi(x).
\end{align*}
Equivalently,
\begin{align*}
\det D^2\varphi(x)=\frac{f(x)}{g(\nabla\varphi(x))}.
\end{align*}
Thus, when the smoothness, positivity of the Jacobian, and change-of-variables hypotheses are already available, the Brenier potential satisfies a Monge-Ampere equation together with the transport boundary condition $\nabla\varphi(\Omega)=\Lambda$.
[/example]
Curvature is another feature absent from the basic minimisation problem. Wasserstein spaces behave like metric spaces with geodesics, but synthetic Ricci curvature, displacement convexity, and gradient-flow theory require a second layer of structure.
[remark: Curvature And Gradient Flows Are Later Topics]
The metric space $(\mathcal P_p(X),W_p)$ can be studied through geodesics and convexity of functionals along geodesics. In later courses, this leads to displacement convexity, entropy methods, synthetic lower Ricci curvature bounds, and gradient flows in Wasserstein space. None of these conclusions follows from compactness and duality alone; they require additional geometric or analytic hypotheses on the underlying space and the functionals being transported.
[/remark]
Applications also sit beyond the foundational endpoint of this course. The same stability theorem supports numerical approximation, statistics, PDE, geometry, and machine learning, but each application adds its own assumptions and estimates.
[explanation: The Role Of Stability At The End Of The Course]
Stability is the bridge between theory and use. It says that optimal transport problems are not isolated exact objects: they can be approximated by empirical measures, discretised into finite programmes, and recovered as limits of computable problems.
At the same time, stability is weaker than regularity. It can identify a limiting optimal plan without proving that the plan is induced by a smooth map, without estimating the map, and without describing the geometry of the support beyond the structural results already proved. The foundational theory therefore ends with a reliable compactness and approximation framework, not with a complete theory of smoothness or applications.
[/explanation]
The course closes with a useful hierarchy. Kantorovich theory supplies existence and compactness; duality supplies certificates and potentials; cyclic monotonicity and Brenier theory supply structure for special costs; stability supplies approximation. Regularity, curvature, and application-specific estimates form the next stage rather than consequences of the foundational package alone.
## Beyond And Connections
The natural continuation is the regularity theory of the Monge-Ampere equation. Once a Brenier map is known to be $\nabla\varphi$, one can ask which assumptions on domains and densities make $\varphi$ strictly convex, continuous up to the boundary, or smooth in the interior. These questions connect optimal transport to fully nonlinear elliptic PDE, [convex geometry](/page/Convex%20Geometry), and the second boundary value problem.
A second continuation is the geometry of Wasserstein space. Displacement interpolation turns $\mathcal P_p(X)$ into a metric-geometric object, and for $p=2$ this geometry supports displacement convexity, entropy methods, gradient flows, and synthetic Ricci curvature lower bounds. In that direction, the geodesics constructed here become the curves along which energies and curvature conditions are tested.
Finally, the finite and empirical approximation results point toward computation and statistics. Entropic regularization, matching algorithms, sample complexity, and convergence of discrete optimal plans all refine the stability ideas from the last chapter. The foundational results in this course supply the common language: couplings, dual certificates, cyclic monotonicity, Brenier potentials, and Wasserstein convergence.
## References
- Villani, C., *Topics in Optimal Transportation*, American Mathematical Society, 2003.
- Villani, C., *Optimal Transport: Old and New*, Springer, 2009.
- Santambrogio, F., *Optimal Transport for Applied Mathematicians*, Birkhauser, 2015.
- Ambrosio, L., Gigli, N., and Savare, G., *Gradient Flows in Metric Spaces and in the Space of Probability Measures*, Birkhauser, 2005.
- Rachev, S. T., and Ruschendorf, L., *Mass Transportation Problems*, Volumes I and II, Springer, 1998.
Contents
- Introduction
- The Central Question of Transport
- Why Maps Are Too Rigid
- Compactness, Duality, and Geometry
- The Route Through the Course
- Prerequisites and Conventions
- 1. The Monge Problem and Transport Maps
- Pushforwards and Transport Maps
- Monge's Optimization Problem
- Atoms and Splitting Obstructions
- Monotone Rearrangement on $\mathbb R$
- 2. Kantorovich Relaxation
- Couplings and Marginals
- The Kantorovich Problem
- Compactness Of Transport Plans
- Lower Semicontinuity And Existence
- 3. Discrete Optimal Transport and Linear Programming
- Transport Polytopes and the Primal Program
- Extreme Points and Basic Feasible Solutions
- Finite Kantorovich Duality
- 4. Kantorovich Duality in General Spaces
- Admissible Potentials and the Dual Problem
- The $c$-Transform and $c$-Concavity
- Duality from Bounded Continuous Costs
- Dual Maximizers
- Complementary Slackness and Optimal Supports
- 5. Cyclical Monotonicity and Optimality Criteria
- Finite Cycles and the Cost-Swapping Test
- Optimal Plans Live on $c$-Cyclically Monotone Sets
- Potentials Built from Cyclical Monotonicity
- The Optimality Criterion in Practice
- 6. Quadratic Cost and Brenier Theory
- Rewriting Quadratic Transport Using Convex Potentials
- Gradients of Convex Functions as Optimal Maps
- Absolute Continuity and Uniqueness of the Optimal Plan
- 7. The Monge-Ampere Equation Behind Transport
- The Smooth Jacobian Equation for Brenier Maps
- Convex Potentials and the Alexandrov Measure
- The Second Boundary Value Problem
- 8. Wasserstein Distances
- The Finite Moment Space
- Metric Properties of Wasserstein Distance
- Wasserstein Convergence and Weak Convergence
- One-Dimensional Quantile Formula
- 9. Geometry of Wasserstein Spaces
- Constant-Speed Geodesics from Optimal Plans
- Displacement Interpolation in Euclidean Space
- Quadratic Cost and Geodesics Induced by Brenier Maps
- What the Geometry Encodes
- 10. Stability, Approximation, and Structural Limits
- Stability Under Weak Convergence
- Empirical Approximation And Finite Transport Problems
- What The Foundational Theory Does Not Decide
- Beyond And Connections
- References
Optimal Transport I: Foundations
Content
Problems
History
Created by admin on 6/18/2026 | Last updated on 6/18/2026
Prerequisites (0/4 completed)
Log in to track your prerequisite progress.
Prerequisites Graph
Interactive dependency map showing prerequisite concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Rate this page
★
★
★
★
★
Poor
Excellent