[proofplan]
We verify the four metric axioms directly from the coupling definition. Non-negativity and finiteness follow from the non-negativity of the cost and from the product coupling estimate using the finite $p$-moment assumption. Symmetry is obtained by pushing a coupling forward under the coordinate-swap map. For identity of indiscernibles, zero transport cost forces the two measures to agree on every [closed set](/page/Closed%20Set). The triangle inequality follows by gluing two transport plans through their common middle marginal and applying the pointwise triangle inequality together with Minkowski's inequality in $L^p$.
[/proofplan]
[step:Show that $W_p$ is finite and non-negative on $\mathcal{P}_p(X)$]
Fix $\mu,\nu \in \mathcal{P}_p(X)$. Since $d(x,y)^p \ge 0$ for all $(x,y) \in X \times X$, every transport cost is non-negative, and therefore $W_p(\mu,\nu) \ge 0$.
Choose a point $x_0 \in X$. Since $\mu,\nu \in \mathcal{P}_p(X)$, both moments
\begin{align*}
\int_X d(x,x_0)^p\,d\mu(x)
\end{align*}
and
\begin{align*}
\int_X d(y,x_0)^p\,d\nu(y)
\end{align*}
are finite. The product measure $\mu \otimes \nu$ is a coupling of $\mu$ and $\nu$. For every $(x,y) \in X \times X$, the triangle inequality for $d$ and the convexity inequality $(a+b)^p \le 2^{p-1}(a^p+b^p)$ for $a,b \ge 0$ give
\begin{align*}
d(x,y)^p \le 2^{p-1}\bigl(d(x,x_0)^p+d(y,x_0)^p\bigr).
\end{align*}
Integrating this inequality with respect to $\mu \otimes \nu$ gives
\begin{align*}
\int_{X \times X} d(x,y)^p\,d(\mu \otimes \nu)(x,y) \le 2^{p-1}\int_X d(x,x_0)^p\,d\mu(x) + 2^{p-1}\int_X d(y,x_0)^p\,d\nu(y) < \infty.
\end{align*}
Hence the infimum defining $W_p(\mu,\nu)$ is finite.
[/step]
[step:Prove symmetry by swapping the two coordinates]
Let $S: X \times X \to X \times X$ be the Borel map defined by
\begin{align*}
S(x,y) := (y,x).
\end{align*}
If $\pi \in \Pi(\mu,\nu)$, define $\pi^S := S_{\#}\pi$, the pushforward of $\pi$ by $S$. Then $\pi^S \in \Pi(\nu,\mu)$, because the first marginal of $\pi^S$ is the second marginal of $\pi$, and the second marginal of $\pi^S$ is the first marginal of $\pi$. Moreover, since $d(y,x)=d(x,y)$,
\begin{align*}
\int_{X \times X} d(x,y)^p\,d\pi^S(x,y) = \int_{X \times X} d(y,x)^p\,d\pi(x,y) = \int_{X \times X} d(x,y)^p\,d\pi(x,y).
\end{align*}
Taking the infimum over $\pi \in \Pi(\mu,\nu)$ gives $W_p(\nu,\mu) \le W_p(\mu,\nu)$. Applying the same argument with $\mu$ and $\nu$ interchanged gives $W_p(\mu,\nu) \le W_p(\nu,\mu)$. Therefore $W_p(\mu,\nu)=W_p(\nu,\mu)$.
[/step]
[step:Prove that zero distance forces equality of measures]
Assume $W_p(\mu,\nu)=0$. For each $n \in \mathbb{N}$, choose $\pi_n \in \Pi(\mu,\nu)$ such that
\begin{align*}
\int_{X \times X} d(x,y)^p\,d\pi_n(x,y) \le \frac{1}{n}.
\end{align*}
Let $F \subset X$ be closed, and for $\varepsilon > 0$ define its open $\varepsilon$-neighbourhood $F_\varepsilon \subset X$ by
\begin{align*}
F_\varepsilon := \{y \in X : \operatorname{dist}(y,F) < \varepsilon\}.
\end{align*}
If $x \in F$ and $y \notin F_\varepsilon$, then $d(x,y) \ge \varepsilon$. Hence
\begin{align*}
\pi_n(F \times (X \setminus F_\varepsilon)) \le \frac{1}{\varepsilon^p}\int_{X \times X} d(x,y)^p\,d\pi_n(x,y) \le \frac{1}{\varepsilon^p n}.
\end{align*}
Using that $\pi_n$ has first marginal $\mu$ and second marginal $\nu$, we obtain
\begin{align*}
\mu(F) = \pi_n(F \times X) \le \pi_n(X \times F_\varepsilon) + \pi_n(F \times (X \setminus F_\varepsilon)) \le \nu(F_\varepsilon) + \frac{1}{\varepsilon^p n}.
\end{align*}
Letting $n \to \infty$ gives $\mu(F) \le \nu(F_\varepsilon)$ for every $\varepsilon > 0$. Since $F$ is closed, the sets $F_\varepsilon$ decrease to $F$ as $\varepsilon \downarrow 0$, and continuity from above for the probability measure $\nu$ gives $\mu(F) \le \nu(F)$.
By symmetry of $W_p$, the same argument with $\mu$ and $\nu$ interchanged gives $\nu(F) \le \mu(F)$ for every closed $F \subset X$. Thus $\mu$ and $\nu$ agree on all closed sets. Therefore they agree on all open sets by taking complements, and hence on the Borel $\sigma$-algebra because the open sets generate it. Thus $\mu=\nu$.
[guided]
The point of this step is to convert “zero average transport cost” into a statement that almost all transported mass lies arbitrarily close to the diagonal. We assume $W_p(\mu,\nu)=0$. By the definition of the infimum, for every $n \in \mathbb{N}$ there exists a coupling $\pi_n \in \Pi(\mu,\nu)$ whose cost satisfies
\begin{align*}
\int_{X \times X} d(x,y)^p\,d\pi_n(x,y) \le \frac{1}{n}.
\end{align*}
Now fix a closed set $F \subset X$. To prove $\mu=\nu$, it is enough to show that $\mu(F)=\nu(F)$ for every closed $F$, because closed sets determine Borel probability measures on a [metric space](/page/Metric%20Space). For $\varepsilon > 0$, define the open $\varepsilon$-neighbourhood of $F$ by
\begin{align*}
F_\varepsilon := \{y \in X : \operatorname{dist}(y,F) < \varepsilon\}.
\end{align*}
If $x \in F$ but $y \notin F_\varepsilon$, then $y$ is at least distance $\varepsilon$ from every point of $F$, so in particular $d(x,y) \ge \varepsilon$. Therefore
\begin{align*}
\varepsilon^p \mathbb{1}_{F \times (X \setminus F_\varepsilon)}(x,y) \le d(x,y)^p
\end{align*}
for all $(x,y) \in X \times X$. Integrating this pointwise inequality with respect to $\pi_n$ gives
\begin{align*}
\pi_n(F \times (X \setminus F_\varepsilon)) \le \frac{1}{\varepsilon^p}\int_{X \times X} d(x,y)^p\,d\pi_n(x,y) \le \frac{1}{\varepsilon^p n}.
\end{align*}
We now decompose the mass that starts in $F$. Since $\pi_n$ has first marginal $\mu$,
\begin{align*}
\mu(F) = \pi_n(F \times X).
\end{align*}
The set $F \times X$ is contained in the union of $X \times F_\varepsilon$ and $F \times (X \setminus F_\varepsilon)$, so subadditivity of $\pi_n$ gives
\begin{align*}
\mu(F) \le \pi_n(X \times F_\varepsilon) + \pi_n(F \times (X \setminus F_\varepsilon)).
\end{align*}
Because the second marginal of $\pi_n$ is $\nu$, we have $\pi_n(X \times F_\varepsilon)=\nu(F_\varepsilon)$. Combining this with the previous estimate yields
\begin{align*}
\mu(F) \le \nu(F_\varepsilon) + \frac{1}{\varepsilon^p n}.
\end{align*}
Letting $n \to \infty$ removes the transport error and gives
\begin{align*}
\mu(F) \le \nu(F_\varepsilon).
\end{align*}
Finally let $\varepsilon \downarrow 0$. Since $F$ is closed, the sets $F_\varepsilon$ decrease to $F$, and since $\nu$ is a probability measure, continuity from above gives
\begin{align*}
\mu(F) \le \nu(F).
\end{align*}
The symmetry already proved gives $W_p(\nu,\mu)=0$, so the same argument with the roles of $\mu$ and $\nu$ interchanged gives $\nu(F) \le \mu(F)$ for every closed $F \subset X$. Hence $\mu(F)=\nu(F)$ for every closed $F$. Taking complements gives equality on open sets, and since open sets generate the Borel $\sigma$-algebra of the metric space $X$, the two Borel probability measures are equal.
[/guided]
[/step]
[step:Glue two couplings through their common marginal]
Let $\mu,\nu,\rho \in \mathcal{P}_p(X)$, and let $\pi_{12} \in \Pi(\mu,\nu)$ and $\pi_{23} \in \Pi(\nu,\rho)$. We need a probability measure on $X^3$ whose $(x,y)$-marginal is $\pi_{12}$ and whose $(y,z)$-marginal is $\pi_{23}$.
[claim:Gluing lemma for couplings on a Polish space]
There exists a Borel probability measure $\gamma$ on $X^3$ such that
\begin{align*}
(\operatorname{pr}_{12})_{\#}\gamma = \pi_{12}
\end{align*}
and
\begin{align*}
(\operatorname{pr}_{23})_{\#}\gamma = \pi_{23},
\end{align*}
where $\operatorname{pr}_{12}:X^3 \to X^2$ and $\operatorname{pr}_{23}:X^3 \to X^2$ are the coordinate projection maps defined by $\operatorname{pr}_{12}(x,y,z)=(x,y)$ and $\operatorname{pr}_{23}(x,y,z)=(y,z)$.
[/claim]
[proof]
Since $X$ is Polish, regular conditional probabilities exist for Borel probability measures on products of $X$ with itself (citing a result not yet in the wiki: existence of regular conditional probabilities on Polish spaces). Disintegrate $\pi_{12}$ with respect to its second marginal $\nu$: there is a Markov kernel $K_1:X \times \mathcal{B}(X) \to [0,1]$ such that, for all Borel sets $A,B \subset X$,
\begin{align*}
\pi_{12}(A \times B) = \int_B K_1(y,A)\,d\nu(y).
\end{align*}
Similarly, disintegrate $\pi_{23}$ with respect to its first marginal $\nu$: there is a Markov kernel $K_3:X \times \mathcal{B}(X) \to [0,1]$ such that, for all Borel sets $B,C \subset X$,
\begin{align*}
\pi_{23}(B \times C) = \int_B K_3(y,C)\,d\nu(y).
\end{align*}
Define a set function $\gamma$ on measurable rectangles $A \times B \times C \subset X^3$ by
\begin{align*}
\gamma(A \times B \times C) := \int_B K_1(y,A)K_3(y,C)\,d\nu(y).
\end{align*}
By the standard [extension theorem](/theorems/59) for probability kernels, this prescription extends uniquely to a Borel probability measure $\gamma$ on $X^3$. Taking $C=X$ gives $K_3(y,X)=1$, hence
\begin{align*}
\gamma(A \times B \times X) = \int_B K_1(y,A)\,d\nu(y) = \pi_{12}(A \times B).
\end{align*}
Thus $(\operatorname{pr}_{12})_{\#}\gamma=\pi_{12}$. Taking $A=X$ gives $K_1(y,X)=1$, hence
\begin{align*}
\gamma(X \times B \times C) = \int_B K_3(y,C)\,d\nu(y) = \pi_{23}(B \times C).
\end{align*}
Thus $(\operatorname{pr}_{23})_{\#}\gamma=\pi_{23}$.
[/proof]
[/step]
[step:Apply Minkowski to prove the triangle inequality]
Fix $\eta>0$. Choose couplings $\pi_{12} \in \Pi(\mu,\nu)$ and $\pi_{23} \in \Pi(\nu,\rho)$ such that
\begin{align*}
\left(\int_{X \times X} d(x,y)^p\,d\pi_{12}(x,y)\right)^{1/p} \le W_p(\mu,\nu)+\eta
\end{align*}
and
\begin{align*}
\left(\int_{X \times X} d(y,z)^p\,d\pi_{23}(y,z)\right)^{1/p} \le W_p(\nu,\rho)+\eta.
\end{align*}
Let $\gamma$ be the glued probability measure on $X^3$ from the previous step. Define [measurable functions](/page/Measurable%20Functions) $a,b:X^3 \to [0,\infty)$ by
\begin{align*}
a(x,y,z):=d(x,y), \qquad b(x,y,z):=d(y,z).
\end{align*}
The pointwise triangle inequality gives $d(x,z) \le a(x,y,z)+b(x,y,z)$. Since the $(x,z)$-marginal of $\gamma$ is a coupling of $\mu$ and $\rho$, we have
\begin{align*}
W_p(\mu,\rho) \le \left(\int_{X^3} d(x,z)^p\,d\gamma(x,y,z)\right)^{1/p}.
\end{align*}
By monotonicity of the $L^p$ norm and Minkowski's inequality for $L^p$ spaces (citing a result not yet in the wiki: [Minkowski inequality](/theorems/517)),
\begin{align*}
\left(\int_{X^3} d(x,z)^p\,d\gamma(x,y,z)\right)^{1/p} \le \left(\int_{X^3} a(x,y,z)^p\,d\gamma(x,y,z)\right)^{1/p} + \left(\int_{X^3} b(x,y,z)^p\,d\gamma(x,y,z)\right)^{1/p}.
\end{align*}
Using the marginal identities for $\gamma$,
\begin{align*}
\int_{X^3} a(x,y,z)^p\,d\gamma(x,y,z) = \int_{X \times X} d(x,y)^p\,d\pi_{12}(x,y)
\end{align*}
and
\begin{align*}
\int_{X^3} b(x,y,z)^p\,d\gamma(x,y,z) = \int_{X \times X} d(y,z)^p\,d\pi_{23}(y,z).
\end{align*}
Therefore
\begin{align*}
W_p(\mu,\rho) \le W_p(\mu,\nu)+W_p(\nu,\rho)+2\eta.
\end{align*}
Letting $\eta \downarrow 0$ gives
\begin{align*}
W_p(\mu,\rho) \le W_p(\mu,\nu)+W_p(\nu,\rho).
\end{align*}
This proves the triangle inequality.
[/step]
[step:Conclude that all metric axioms hold]
We have shown that $W_p(\mu,\nu)$ is finite and non-negative for all $\mu,\nu \in \mathcal{P}_p(X)$, that $W_p(\mu,\nu)=W_p(\nu,\mu)$, that $W_p(\mu,\nu)=0$ implies $\mu=\nu$, and that the triangle inequality holds. Conversely, if $\mu=\nu$, then the diagonal coupling $(x \mapsto (x,x))_{\#}\mu$ has zero cost, so $W_p(\mu,\mu)=0$. Hence $W_p$ satisfies all metric axioms on $\mathcal{P}_p(X)$.
[/step]