[proofplan]
We construct the increasing coupling by pushing forward [Lebesgue measure](/page/Lebesgue%20Measure) on $(0,1)$ through the two quantile functions. The quantile transform gives the prescribed marginals, so this coupling gives the upper bound for $W_p(\mu,\nu)^p$. For the reverse inequality, we prove that any coupling with a crossing can be uncrossed without increasing the cost $|x-y|^p$, first for finitely supported couplings and then by approximation for general couplings. The uncrossed limiting coupling is precisely the quantile coupling, giving optimality.
[/proofplan]
[step:Define the quantile coupling and verify its marginals]
Let $\lambda$ denote the restriction of one-dimensional Lebesgue measure $\mathcal{L}^1$ to the Borel space $((0,1),\mathcal{B}((0,1)))$. Define the quantile maps
\begin{align*}
Q_\mu:(0,1)&\to\mathbb{R}, \qquad s\mapsto F_\mu^{-1}(s),
\end{align*}
and
\begin{align*}
Q_\nu:(0,1)&\to\mathbb{R}, \qquad s\mapsto F_\nu^{-1}(s).
\end{align*}
Both maps are nondecreasing and Borel measurable. Define the probability measure $\gamma$ on $\mathbb{R}^2$ by
\begin{align*}
\gamma=(Q_\mu,Q_\nu)_\#\lambda.
\end{align*}
We show that $\gamma$ has first marginal $\mu$ and second marginal $\nu$.
For $x\in\mathbb{R}$, the defining property of the generalized inverse gives
\begin{align*}
Q_\mu(s)\leq x \iff s\leq F_\mu(x)
\end{align*}
for $\lambda$-almost every $s\in(0,1)$. Therefore
\begin{align*}
\gamma((-\infty,x]\times\mathbb{R})=\lambda(\{s\in(0,1):Q_\mu(s)\leq x\})=F_\mu(x).
\end{align*}
Thus the first marginal of $\gamma$ is $\mu$. The same argument with $\nu$ gives
\begin{align*}
\gamma(\mathbb{R}\times(-\infty,y])=F_\nu(y)
\end{align*}
for every $y\in\mathbb{R}$, so the second marginal is $\nu$. Hence $\gamma\in\Pi(\mu,\nu)$, where $\Pi(\mu,\nu)$ denotes the set of Borel probability measures on $\mathbb{R}^2$ with marginals $\mu$ and $\nu$.
[guided]
We want to build a coupling of $\mu$ and $\nu$ from one common parameter $s\in(0,1)$. The correct maps are the generalized inverse functions, so we define
\begin{align*}
Q_\mu:(0,1)&\to\mathbb{R}, \qquad s\mapsto F_\mu^{-1}(s),
\end{align*}
and
\begin{align*}
Q_\nu:(0,1)&\to\mathbb{R}, \qquad s\mapsto F_\nu^{-1}(s).
\end{align*}
Because each distribution function is nondecreasing and right-continuous, each generalized inverse is nondecreasing and Borel measurable. Let $\lambda=\mathcal{L}^1|_{(0,1)}$ be Lebesgue probability measure on $(0,1)$, and define
\begin{align*}
\gamma=(Q_\mu,Q_\nu)_\#\lambda.
\end{align*}
This means that for every Borel set $A\subset\mathbb{R}^2$,
\begin{align*}
\gamma(A)=\lambda(\{s\in(0,1):(Q_\mu(s),Q_\nu(s))\in A\}).
\end{align*}
We now verify the marginals directly from distribution functions. Fix $x\in\mathbb{R}$. By the definition
\begin{align*}
Q_\mu(s)=\inf\{a\in\mathbb{R}:F_\mu(a)\geq s\},
\end{align*}
the inequality $Q_\mu(s)\leq x$ is equivalent, up to the harmless endpoint ambiguity of a monotone function, to $s\leq F_\mu(x)$. Hence
\begin{align*}
\lambda(\{s\in(0,1):Q_\mu(s)\leq x\})=F_\mu(x).
\end{align*}
Therefore
\begin{align*}
\gamma((-\infty,x]\times\mathbb{R})=F_\mu(x),
\end{align*}
which says exactly that the first marginal of $\gamma$ has distribution function $F_\mu$, hence is $\mu$. Repeating the same argument for $Q_\nu$ gives
\begin{align*}
\gamma(\mathbb{R}\times(-\infty,y])=F_\nu(y)
\end{align*}
for every $y\in\mathbb{R}$, so the second marginal is $\nu$. Thus $\gamma$ is an admissible transport plan from $\mu$ to $\nu$.
[/guided]
[/step]
[step:Use the quantile coupling to obtain the upper bound]
Since $\gamma\in\Pi(\mu,\nu)$ and $W_p(\mu,\nu)^p$ is the infimum of $\int_{\mathbb{R}^2}|x-y|^p\,d\pi(x,y)$ over all $\pi\in\Pi(\mu,\nu)$, the definition of the Wasserstein distance gives
\begin{align*}
W_p(\mu,\nu)^p\leq \int_{\mathbb{R}^2}|x-y|^p\,d\gamma(x,y).
\end{align*}
By the definition of pushforward measure under $(Q_\mu,Q_\nu)$, this integral equals
\begin{align*}
\int_{\mathbb{R}^2}|x-y|^p\,d\gamma(x,y)=\int_0^1 |Q_\mu(s)-Q_\nu(s)|^p\,d\mathcal{L}^1(s).
\end{align*}
Therefore
\begin{align*}
W_p(\mu,\nu)^p\leq \int_0^1 |F_\mu^{-1}(s)-F_\nu^{-1}(s)|^p\,d\mathcal{L}^1(s).
\end{align*}
The integral is finite because $\mu,\nu\in\mathcal{P}_p(\mathbb{R})$ imply $Q_\mu,Q_\nu\in L^p((0,1),\lambda)$, and the inequality $|a-b|^p\leq 2^{p-1}(|a|^p+|b|^p)$ applies for all $a,b\in\mathbb{R}$.
[/step]
[step:Prove the crossing inequality for the cost $|x-y|^p$]
Let $x_1,x_2,y_1,y_2\in\mathbb{R}$ satisfy $x_1<x_2$ and $y_1<y_2$. We claim that
\begin{align*}
|x_1-y_1|^p+|x_2-y_2|^p\leq |x_1-y_2|^p+|x_2-y_1|^p.
\end{align*}
Define the convex function
\begin{align*}
\phi:\mathbb{R}\to\mathbb{R}, \qquad \phi(t)=|t|^p.
\end{align*}
For fixed $x_1<x_2$, define
\begin{align*}
h:\mathbb{R}\to\mathbb{R}, \qquad h(y)=\phi(x_2-y)-\phi(x_1-y).
\end{align*}
Convexity of $\phi$ implies that the difference quotient of $\phi$ over intervals of the same length is nondecreasing in the base point. Since
\begin{align*}
h(y)=\phi(x_2-y)-\phi(x_1-y),
\end{align*}
the function $h$ is nonincreasing in $y$. Hence $h(y_2)\leq h(y_1)$, which is precisely
\begin{align*}
|x_2-y_2|^p-|x_1-y_2|^p\leq |x_2-y_1|^p-|x_1-y_1|^p.
\end{align*}
Rearranging gives the desired crossing inequality.
[guided]
The crossing inequality is the algebraic reason monotone transport is optimal on the line. Suppose $x_1<x_2$ but the larger target $y_2$ is paired with the smaller source $x_1$, while the smaller target $y_1$ is paired with the larger source $x_2$. We want to show that uncrossing these two assignments cannot increase the total cost.
Define
\begin{align*}
\phi:\mathbb{R}\to\mathbb{R}, \qquad \phi(t)=|t|^p.
\end{align*}
Since $p\geq1$, the function $\phi$ is convex. Fix $x_1<x_2$, and define
\begin{align*}
h:\mathbb{R}\to\mathbb{R}, \qquad h(y)=\phi(x_2-y)-\phi(x_1-y).
\end{align*}
The quantity $h(y)$ measures how much more expensive it is to send $x_2$ to $y$ instead of sending $x_1$ to $y$.
Convexity says that slopes of secant lines are monotone. Applied to the interval endpoints $x_1-y$ and $x_2-y$, this means that as $y$ increases, both endpoints shift left, and the increment
\begin{align*}
\phi(x_2-y)-\phi(x_1-y)
\end{align*}
cannot increase. Therefore $h$ is nonincreasing. Since $y_1<y_2$, we get
\begin{align*}
h(y_2)\leq h(y_1).
\end{align*}
Writing this out gives
\begin{align*}
|x_2-y_2|^p-|x_1-y_2|^p\leq |x_2-y_1|^p-|x_1-y_1|^p.
\end{align*}
After moving the negative terms to the other side, this becomes
\begin{align*}
|x_1-y_1|^p+|x_2-y_2|^p\leq |x_1-y_2|^p+|x_2-y_1|^p.
\end{align*}
Thus matching the smaller source with the smaller target and the larger source with the larger target is no more expensive than the crossed matching.
[/guided]
[/step]
[step:Minimize the cost among finitely supported couplings by uncrossing]
Assume first that $\mu$ and $\nu$ are finitely supported probability measures on $\mathbb{R}$. Write
\begin{align*}
\mu=\sum_{i=1}^{m} r_i\delta_{x_i}.
\end{align*}
\begin{align*}
\nu=\sum_{j=1}^{n} c_j\delta_{y_j}.
\end{align*}
Here $x_1<\cdots<x_m$, $y_1<\cdots<y_n$, $r_i>0$, $c_j>0$, $\sum_{i=1}^{m}r_i=1$, $\sum_{j=1}^{n}c_j=1$, $\delta_z$ denotes the Dirac probability measure at $z\in\mathbb{R}$, and $\delta_{(x_i,y_j)}$ denotes the Dirac probability measure at $(x_i,y_j)\in\mathbb{R}^2$. Let
\begin{align*}
\pi=\sum_{i=1}^{m}\sum_{j=1}^{n} a_{ij}\delta_{(x_i,y_j)}
\end{align*}
be a coupling of $\mu$ and $\nu$, where $a_{ij}\geq0$, $\sum_{j=1}^{n}a_{ij}=r_i$, and $\sum_{i=1}^{m}a_{ij}=c_j$.
Construct the monotone filling matrix $b=(b_{ij})$ by the finite north-west rule: starting at $(i,j)=(1,1)$, put $b_{ij}=\min\{r_i^{\mathrm{rem}},c_j^{\mathrm{rem}}\}$, subtract this amount from the remaining row mass $r_i^{\mathrm{rem}}$ and column mass $c_j^{\mathrm{rem}}$, then increase $i$ when the row is exhausted and increase $j$ when the column is exhausted. Since at least one row or column is exhausted at each move, the construction terminates after at most $m+n-1$ moves. It produces a coupling, and its support is monotone: if $b_{ij}>0$ and $b_{i'j'}>0$ with $i<i'$, then $j\leq j'$.
Let $a=(a_{ij})$ be any cost-minimizing coupling matrix among all matrices with these row and column sums. Such a minimizer exists because the transportation polytope is a closed and bounded subset of $\mathbb{R}^{mn}$ and the cost is a linear function of $a$. Among all cost minimizers, choose one for which the lexicographic vector of cumulative southwest masses
\begin{align*}
M_{ij}(a)=\sum_{u=1}^{i}\sum_{v=1}^{j}a_{uv}
\end{align*}
is maximal in lexicographic order over the finite list of pairs $(i,j)$.
We claim that this chosen minimizer has no crossing. Suppose instead that $i_1<i_2$ and $j_1<j_2$ satisfy $a_{i_1j_2}>0$ and $a_{i_2j_1}>0$. Define
\begin{align*}
\varepsilon=\min\{a_{i_1j_2},a_{i_2j_1}\}.
\end{align*}
Move mass $\varepsilon$ from $(i_1,j_2)$ and $(i_2,j_1)$ to $(i_1,j_1)$ and $(i_2,j_2)$. This preserves every row sum and column sum. By the crossing inequality, it does not increase the cost, so the new matrix is again a cost minimizer. For every lower-left rectangle, its mass changes by $0$ except for rectangles whose row cutoff lies in $\{i_1,\dots,i_2-1\}$ and whose column cutoff lies in $\{j_1,\dots,j_2-1\}$, where the mass increases by $\varepsilon$. Hence the lexicographic vector $(M_{ij})$ strictly increases, contradicting the choice of $a$. Thus a cost minimizer is monotone.
It remains to identify the monotone coupling. For $R_i=\sum_{u=1}^{i}r_u$ and $C_j=\sum_{v=1}^{j}c_v$, monotonicity forces
\begin{align*}
\sum_{u=1}^{i}\sum_{v=1}^{j}a_{uv}=\min\{R_i,C_j\}.
\end{align*}
Indeed, if $R_i\leq C_j$, then all mass in rows $u\leq i$ must lie in columns $v\leq j$, since any mass with $u\leq i$ and $v>j$ would require, by the column sums up to $j$, positive mass with $u'>i$ and $v'\leq j$, creating a crossing. The case $C_j\leq R_i$ is symmetric. These rectangle values determine every entry by finite inclusion-exclusion, so the monotone coupling is unique. The north-west filling matrix has exactly these rectangle values; hence it is the unique monotone coupling. It is also exactly $(Q_\mu,Q_\nu)_\#\lambda$, because $Q_\mu=x_i$ on the interval $(R_{i-1},R_i]$ and $Q_\nu=y_j$ on the interval $(C_{j-1},C_j]$, with $R_0=C_0=0$. Therefore, in the finitely supported case,
\begin{align*}
\int_{\mathbb{R}^2}|x-y|^p\,d\pi(x,y)\geq \int_0^1 |Q_\mu(s)-Q_\nu(s)|^p\,d\mathcal{L}^1(s)
\end{align*}
for every $\pi\in\Pi(\mu,\nu)$.
[/step]
[step:Pass from finitely supported measures to arbitrary measures in $\mathcal{P}_p(\mathbb{R})$]
Let $\pi\in\Pi(\mu,\nu)$ be arbitrary. For each $k\in\mathbb{N}$, define the truncation-and-quantization map
\begin{align*}
T_k:\mathbb{R}&\to 2^{-k}\mathbb{Z}\cap[-k,k], \qquad x\mapsto \max\{-k,\min\{k,2^{-k}\lfloor 2^k x\rfloor\}\}.
\end{align*}
Define finitely supported probability measures
\begin{align*}
\mu_k=(T_k)_\#\mu, \qquad \nu_k=(T_k)_\#\nu, \qquad \pi_k=(T_k,T_k)_\#\pi.
\end{align*}
Then $\pi_k\in\Pi(\mu_k,\nu_k)$. Since $T_k(x)\to x$ for every $x\in\mathbb{R}$ and $|T_k(x)|\leq |x|+1$, dominated convergence with respect to $\pi$ gives
\begin{align*}
\lim_{k\to\infty}\int_{\mathbb{R}^2}|T_k(x)-T_k(y)|^p\,d\pi(x,y)=\int_{\mathbb{R}^2}|x-y|^p\,d\pi(x,y),
\end{align*}
because $|T_k(x)-T_k(y)|^p\leq 2^{p-1}(|x|+1)^p+2^{p-1}(|y|+1)^p$ and the right-hand side is $\pi$-integrable by the marginal moment hypotheses.
Applying the finitely supported case to $\mu_k$ and $\nu_k$ gives
\begin{align*}
\int_{\mathbb{R}^2}|T_k(x)-T_k(y)|^p\,d\pi(x,y)\geq \int_0^1 |Q_{\mu_k}(s)-Q_{\nu_k}(s)|^p\,d\mathcal{L}^1(s).
\end{align*}
For the same monotone truncation-and-quantization map $T_k$, the generalized inverse of $(T_k)_\#\mu$ agrees with $T_k\circ Q_\mu$ for $\mathcal{L}^1$-almost every $s\in(0,1)$. To verify this, let $z\in 2^{-k}\mathbb{Z}\cap[-k,k]$ and use monotonicity of $T_k$ to write
\begin{align*}
\{s\in(0,1):T_k(Q_\mu(s))\leq z\}=\{s\in(0,1):Q_\mu(s)\leq \sup T_k^{-1}((-∞,z])\}
\end{align*}
up to endpoints. By the sublevel identity for $Q_\mu$, the $\lambda$-measure of this set equals $\mu(T_k^{-1}((-∞,z]))$, which is exactly the distribution function of $(T_k)_\#\mu$ at $z$. Since both $Q_{\mu_k}$ and $T_k\circ Q_\mu$ take values in the same finite ordered set, equality of all sublevel sets up to $\lambda$-null endpoints gives $Q_{\mu_k}=T_k\circ Q_\mu$ $\lambda$-almost everywhere. The same argument gives $Q_{\nu_k}=T_k\circ Q_\nu$ $\lambda$-almost everywhere.
Here $L^p((0,1),\lambda)$ denotes the space of $p$-integrable real-valued functions on the probability space $((0,1),\mathcal{B}((0,1)),\lambda)$, modulo equality $\lambda$-almost everywhere. Since $T_k(t) o t$ for every $t\in\mathbb{R}$ and $|T_k(t)|\leq |t|+1$, dominated convergence on $((0,1),\mathcal{B}((0,1)),\lambda)$ gives
\begin{align*}
Q_{\mu_k}\to Q_\mu \text{ in } L^p((0,1),\lambda).
\end{align*}
\begin{align*}
Q_{\nu_k}\to Q_\nu \text{ in } L^p((0,1),\lambda).
\end{align*}
The map $(f,g)\mapsto \|f-g\|_{L^p((0,1),\lambda)}^p$ is continuous under $L^p$ convergence, so
\begin{align*}
\lim_{k\to\infty}\int_0^1 |Q_{\mu_k}(s)-Q_{\nu_k}(s)|^p\,d\mathcal{L}^1(s)=\int_0^1 |Q_\mu(s)-Q_\nu(s)|^p\,d\mathcal{L}^1(s).
\end{align*}
Passing to the limit yields
\begin{align*}
\int_{\mathbb{R}^2}|x-y|^p\,d\pi(x,y)\geq \int_0^1 |Q_\mu(s)-Q_\nu(s)|^p\,d\mathcal{L}^1(s).
\end{align*}
Since $\pi\in\Pi(\mu,\nu)$ was arbitrary, taking the infimum over all couplings gives
\begin{align*}
W_p(\mu,\nu)^p\geq \int_0^1 |Q_\mu(s)-Q_\nu(s)|^p\,d\mathcal{L}^1(s).
\end{align*}
[/step]
[step:Combine the two inequalities]
The upper bound from the quantile coupling gives
\begin{align*}
W_p(\mu,\nu)^p\leq \int_0^1 |Q_\mu(s)-Q_\nu(s)|^p\,d\mathcal{L}^1(s),
\end{align*}
while the optimality argument for arbitrary couplings gives
\begin{align*}
W_p(\mu,\nu)^p\geq \int_0^1 |Q_\mu(s)-Q_\nu(s)|^p\,d\mathcal{L}^1(s).
\end{align*}
Therefore
\begin{align*}
W_p(\mu,\nu)^p=\int_0^1 |Q_\mu(s)-Q_\nu(s)|^p\,d\mathcal{L}^1(s).
\end{align*}
Substituting $Q_\mu=F_\mu^{-1}$ and $Q_\nu=F_\nu^{-1}$ gives
\begin{align*}
W_p(\mu,\nu)^p=\int_0^1 |F_\mu^{-1}(s)-F_\nu^{-1}(s)|^p\,d\mathcal{L}^1(s).
\end{align*}
This is the desired formula.
[/step]