[proofplan]
We prove the upper bound by a dyadic hierarchical transport estimate: mass discrepancies are corrected scale by scale, and the expected discrepancy in each dyadic cube is controlled by binomial variance. The condition $d>2p$ makes the large-scale part of the dyadic sum dominated by the terminal scale $2^{-k}\simeq N^{-1/d}$, while the small-scale tail is summable. For the lower bound, we use a covering argument around the empirical atoms: with positive expected $\mu$-mass, a point sampled from $\mu$ lies at distance comparable to $N^{-1/d}$ from all sample points, forcing any coupling from $\hat\mu_N$ to $\mu$ to move that mass at least this distance. The constants are chosen uniformly in $N$.
[/proofplan]
[step:Control transport by dyadic mass discrepancies]
For every $k\in\mathbb N_0$, let $\mathcal D_k$ denote the family of half-open dyadic cubes of side length $2^{-k}$ contained in $Q$, with the usual convention that boundary pieces are assigned to exactly one cube. Thus $\mathcal D_0=\{Q\}$ and $\#\mathcal D_k=2^{kd}$.
We use the following standard dyadic transport estimate.
[claim:Dyadic hierarchical transport estimate]
There exists a constant $K_{d,p}\in(0,\infty)$, depending only on $d$ and $p$, such that, for all Borel probability measures $\nu_0$ and $\nu_1$ on $Q$,
\begin{align*}
W_p(\nu_0,\nu_1)^p\le K_{d,p}\sum_{k=0}^{\infty}2^{-kp}\sum_{R\in\mathcal D_k}|\nu_0(R)-\nu_1(R)|.
\end{align*}
[/claim]
[proof]
This is the standard dyadic hierarchical transport estimate for probability measures on a cube. For completeness, we recall the construction. At level $k$, each cube $R\in\mathcal D_k$ has diameter $\sqrt d\,2^{-k}$. Starting at level $0$ and refining downward, one transports the excess mass between the children of each dyadic parent cube before passing the remaining balanced masses to the next scale. The total mass that must be rearranged at scale $k$ is bounded by
\begin{align*}
\sum_{R\in\mathcal D_k}|\nu_0(R)-\nu_1(R)|.
\end{align*}
Each unit of mass moved at this scale travels distance at most $\sqrt d\,2^{-k+1}$, so its $p$-cost is bounded by $d^{p/2}2^p2^{-kp}$. Summing over all scales and then letting the terminal mesh size tend to $0$ gives the stated estimate with, for example, a finite constant $K_{d,p}$ obtained by absorbing the geometric diameter factor and the finite overlap of the hierarchical construction. The limiting passage is justified by narrow compactness on the compact metric space $Q$ and lower semicontinuity of the transport cost.
[/proof]
[guided]
The estimate says that we can upper-bound Wasserstein cost by measuring how much the two measures disagree on dyadic cells at every scale. The reason this is useful for empirical measures is that $\hat\mu_N(R)-\mu(R)$ is a centered binomial fluctuation for each fixed dyadic cube $R$.
Let $\nu_0$ and $\nu_1$ be Borel probability measures on $Q$. For each level $k$, the partition $\mathcal D_k$ decomposes $Q$ into cubes of diameter $\sqrt d\,2^{-k}$. The hierarchical construction first compares the masses of $\nu_0$ and $\nu_1$ in the children of $Q$, then in the grandchildren, and so on. At scale $k$, the amount of mass that must be corrected is bounded by the total discrepancy
\begin{align*}
\sum_{R\in\mathcal D_k}|\nu_0(R)-\nu_1(R)|.
\end{align*}
Since all correction at this level takes place inside dyadic parents of diameter at most $\sqrt d\,2^{-k+1}$, transporting one unit of mass at this level costs at most
\begin{align*}
(\sqrt d\,2^{-k+1})^p=d^{p/2}2^p2^{-kp}.
\end{align*}
Adding the cost over all levels yields
\begin{align*}
W_p(\nu_0,\nu_1)^p\le K_{d,p}\sum_{k=0}^{\infty}2^{-kp}\sum_{R\in\mathcal D_k}|\nu_0(R)-\nu_1(R)|.
\end{align*}
Here $K_{d,p}$ is a constant depending only on the dimension and on $p$; it accounts for the diameter factor and for the bounded overlap in the tree construction. This is the standard dyadic hierarchical transport estimate for empirical measures on cubes.
[/guided]
[/step]
[step:Estimate the expected dyadic discrepancy of the empirical measure]
Fix $N\in\mathbb N$ and $R\in\mathcal D_k$. Define the Bernoulli random variables
\begin{align*}
Y_i^R:\Omega\to\{0,1\},\qquad \omega\mapsto \mathbb 1_R(X_i(\omega)).
\end{align*}
Then
\begin{align*}
\hat\mu_N(R)=\frac{1}{N}\sum_{i=1}^N Y_i^R.
\end{align*}
The variables $Y_1^R,\dots,Y_N^R$ are independent and identically distributed with mean $\mu(R)$. Therefore, using Cauchy--Schwarz in the probability space $(\Omega,\mathcal F,\mathbb P)$ and the variance formula for a Bernoulli random variable,
\begin{align*}
\mathbb E[|\hat\mu_N(R)-\mu(R)|]\le \left(\mathbb E[|\hat\mu_N(R)-\mu(R)|^2]\right)^{1/2}.
\end{align*}
Also
\begin{align*}
\mathbb E[|\hat\mu_N(R)-\mu(R)|^2]=\frac{\mu(R)(1-\mu(R))}{N}\le \frac{\mu(R)}{N}.
\end{align*}
Hence
\begin{align*}
\mathbb E[|\hat\mu_N(R)-\mu(R)|]\le \left(\frac{\mu(R)}{N}\right)^{1/2}.
\end{align*}
Summing over $R\in\mathcal D_k$ and applying Cauchy--Schwarz to the finite counting measure on $\mathcal D_k$ gives
\begin{align*}
\sum_{R\in\mathcal D_k}\mu(R)^{1/2}\le \left(\#\mathcal D_k\right)^{1/2}\left(\sum_{R\in\mathcal D_k}\mu(R)\right)^{1/2}=2^{kd/2}.
\end{align*}
Thus
\begin{align*}
\mathbb E\left[\sum_{R\in\mathcal D_k}|\hat\mu_N(R)-\mu(R)|\right]\le N^{-1/2}2^{kd/2}.
\end{align*}
The elementary bound $|\hat\mu_N(R)-\mu(R)|\le \hat\mu_N(R)+\mu(R)$ also gives
\begin{align*}
\sum_{R\in\mathcal D_k}|\hat\mu_N(R)-\mu(R)|\le 2.
\end{align*}
Combining the two estimates,
\begin{align*}
\mathbb E\left[\sum_{R\in\mathcal D_k}|\hat\mu_N(R)-\mu(R)|\right]\le \min\{2,N^{-1/2}2^{kd/2}\}.
\end{align*}
[/step]
[step:Sum the dyadic upper bound at the critical scale]
Choose $K_N\in\mathbb N_0$ such that
\begin{align*}
2^{K_Nd}\le N<2^{(K_N+1)d}
\end{align*}
when $N\ge 2$, and set $K_1:=0$. Taking expectation in the dyadic transport estimate and using the previous step gives
\begin{align*}
\mathbb E[W_p(\hat\mu_N,\mu)^p]\le K_{d,p}\sum_{k=0}^{\infty}2^{-kp}\min\{2,N^{-1/2}2^{kd/2}\}.
\end{align*}
For $0\le k\le K_N$, use the variance bound. Since $d/2-p>0$,
\begin{align*}
\sum_{k=0}^{K_N}2^{-kp}N^{-1/2}2^{kd/2}=N^{-1/2}\sum_{k=0}^{K_N}2^{k(d/2-p)}.
\end{align*}
The finite geometric sum is bounded by a constant depending only on $d$ and $p$ times $2^{K_N(d/2-p)}$. Since $2^{K_N}\le N^{1/d}$, this part is bounded by a constant times
\begin{align*}
N^{-1/2}N^{(d/2-p)/d}=N^{-p/d}.
\end{align*}
For $k>K_N$, use the deterministic bound:
\begin{align*}
\sum_{k=K_N+1}^{\infty}2^{-kp}\,2\le \frac{2\,2^{-(K_N+1)p}}{1-2^{-p}}.
\end{align*}
Because $N<2^{(K_N+1)d}$, we have $2^{-(K_N+1)p}<N^{-p/d}$. Hence there exists a constant $M_{d,p}\in(0,\infty)$ such that
\begin{align*}
\mathbb E[W_p(\hat\mu_N,\mu)^p]\le M_{d,p}N^{-p/d}
\end{align*}
for every $N\in\mathbb N$. Jensen's inequality for the concave map $t\mapsto t^{1/p}$ on $[0,\infty)$ yields
\begin{align*}
\mathbb E[W_p(\hat\mu_N,\mu)]\le \left(\mathbb E[W_p(\hat\mu_N,\mu)^p]\right)^{1/p}\le M_{d,p}^{1/p}N^{-1/d}.
\end{align*}
Thus the upper bound holds with $B:=M_{d,p}^{1/p}$.
[guided]
We now insert the dyadic fluctuation estimate into the transport estimate. The two bounds
\begin{align*}
\mathbb E\left[\sum_{R\in\mathcal D_k}|\hat\mu_N(R)-\mu(R)|\right]\le N^{-1/2}2^{kd/2}
\end{align*}
and
\begin{align*}
\mathbb E\left[\sum_{R\in\mathcal D_k}|\hat\mu_N(R)-\mu(R)|\right]\le 2
\end{align*}
have different roles. The first is good at coarse scales, where there are not too many cubes. The second is good at very fine scales, where the variance bound would grow too large.
Let $K_N$ be the integer satisfying
\begin{align*}
2^{K_Nd}\le N<2^{(K_N+1)d}.
\end{align*}
This means that the dyadic cell side length $2^{-K_N}$ is comparable to $N^{-1/d}$, the expected nearest-neighbor spacing of $N$ points in $d$ dimensions. By the dyadic transport estimate,
\begin{align*}
\mathbb E[W_p(\hat\mu_N,\mu)^p]\le K_{d,p}\sum_{k=0}^{\infty}2^{-kp}\min\{2,N^{-1/2}2^{kd/2}\}.
\end{align*}
For the coarse scales $k\le K_N$, we use the fluctuation estimate:
\begin{align*}
\sum_{k=0}^{K_N}2^{-kp}N^{-1/2}2^{kd/2}=N^{-1/2}\sum_{k=0}^{K_N}2^{k(d/2-p)}.
\end{align*}
Here the hypothesis $d>2p$ is consumed: it gives $d/2-p>0$, so the geometric sum is dominated by its largest term. Therefore this part is bounded by a constant depending only on $d$ and $p$ times
\begin{align*}
N^{-1/2}2^{K_N(d/2-p)}\le N^{-1/2}N^{(d/2-p)/d}=N^{-p/d}.
\end{align*}
For the fine scales $k>K_N$, the total discrepancy cannot exceed $2$, because both $\hat\mu_N$ and $\mu$ are probability measures. Hence
\begin{align*}
\sum_{k=K_N+1}^{\infty}2^{-kp}\,2\le \frac{2\,2^{-(K_N+1)p}}{1-2^{-p}}.
\end{align*}
The defining inequality $N<2^{(K_N+1)d}$ implies $2^{-(K_N+1)p}<N^{-p/d}$, so this tail is also bounded by a constant times $N^{-p/d}$. Combining the coarse and fine scale estimates gives
\begin{align*}
\mathbb E[W_p(\hat\mu_N,\mu)^p]\le M_{d,p}N^{-p/d}.
\end{align*}
Finally, Jensen's inequality applies because $t\mapsto t^{1/p}$ is concave on $[0,\infty)$. Thus
\begin{align*}
\mathbb E[W_p(\hat\mu_N,\mu)]\le \left(\mathbb E[W_p(\hat\mu_N,\mu)^p]\right)^{1/p}\le M_{d,p}^{1/p}N^{-1/d}.
\end{align*}
[/guided]
[/step]
[step:Force transport over empty neighborhoods of the empirical support]
Let $\alpha_d:=\mathcal L^d(B(0,1))$ denote the Lebesgue measure of the Euclidean unit ball in $\mathbb R^d$. Define
\begin{align*}
r_N:=\left(\frac{1}{2C\alpha_d N}\right)^{1/d}.
\end{align*}
For $\omega\in\Omega$, define the random closed set
\begin{align*}
S_N(\omega):=\{X_1(\omega),\dots,X_N(\omega)\}\subset Q
\end{align*}
and the random Borel set
\begin{align*}
E_N(\omega):=\{y\in Q:\operatorname{dist}(y,S_N(\omega))\ge r_N\}.
\end{align*}
Since $\rho\le C$ $(\mathcal L^d\!\restriction_Q)$-a.e., for every $\omega\in\Omega$,
\begin{align*}
\mu(Q\setminus E_N(\omega))\le \sum_{i=1}^N\mu(Q\cap B(X_i(\omega),r_N)).
\end{align*}
For each $i\in\{1,\dots,N\}$,
\begin{align*}
\mu(Q\cap B(X_i(\omega),r_N))\le C\mathcal L^d(B(X_i(\omega),r_N))=C\alpha_d r_N^d.
\end{align*}
Therefore
\begin{align*}
\mu(Q\setminus E_N(\omega))\le NC\alpha_d r_N^d=\frac{1}{2}.
\end{align*}
Thus
\begin{align*}
\mu(E_N(\omega))\ge \frac{1}{2}
\end{align*}
for every $\omega\in\Omega$.
Fix $\omega\in\Omega$ and let $\pi\in\Pi(\hat\mu_N(\omega),\mu)$. Since the first marginal of $\pi$ is supported on $S_N(\omega)$, the set $S_N(\omega)\times E_N(\omega)$ has $\pi$-mass $\mu(E_N(\omega))$. For every $(x,y)\in S_N(\omega)\times E_N(\omega)$, one has $|x-y|\ge r_N$. Hence
\begin{align*}
\int_{Q\times Q}|x-y|^p\,d\pi(x,y)\ge r_N^p\mu(E_N(\omega)).
\end{align*}
Taking the infimum over $\pi\in\Pi(\hat\mu_N(\omega),\mu)$ gives
\begin{align*}
W_p(\hat\mu_N(\omega),\mu)\ge r_N\mu(E_N(\omega))^{1/p}.
\end{align*}
Since $\mu(E_N(\omega))\ge 1/2$,
\begin{align*}
W_p(\hat\mu_N(\omega),\mu)\ge 2^{-1/p}r_N.
\end{align*}
Taking expectation with respect to $\mathbb P$ yields
\begin{align*}
\mathbb E[W_p(\hat\mu_N,\mu)]\ge 2^{-1/p}\left(\frac{1}{2C\alpha_d}\right)^{1/d}N^{-1/d}.
\end{align*}
Thus the lower bound holds with
\begin{align*}
A:=2^{-1/p}\left(\frac{1}{2C\alpha_d}\right)^{1/d}.
\end{align*}
[/step]
[step:Combine the two estimates]
The upper estimate gives
\begin{align*}
\mathbb E[W_p(\hat\mu_N,\mu)]\le B N^{-1/d}
\end{align*}
for every $N\in\mathbb N$, where $B=M_{d,p}^{1/p}$. The lower estimate gives
\begin{align*}
\mathbb E[W_p(\hat\mu_N,\mu)]\ge A N^{-1/d}
\end{align*}
for every $N\in\mathbb N$, where
\begin{align*}
A=2^{-1/p}\left(\frac{1}{2C\alpha_d}\right)^{1/d}.
\end{align*}
Both constants are positive and depend only on the stated parameters. This proves the asserted empirical Wasserstein rate.
[/step]