[proofplan]
The forward direction $\mu = \nu \Rightarrow d_\phi(\mu, \nu) = 0$ is immediate from the definition of $d_\phi$ as the Hilbert-norm distance between kernel mean embeddings. The converse is the substantive content. Assuming $d_\phi(\mu,\nu) = 0$, we have $M^\phi_\mu = M^\phi_\nu$ in $\mathcal{H}_\phi$, hence $\mathbb{E}_\mu[g] = \mathbb{E}_\nu[g]$ for every $g \in \mathcal{H}_\phi$ via the reproducing property. Universality of $k_\phi$ lets us approximate any $f \in C(\mathcal{K})$ uniformly by RKHS elements, so a triangle-inequality estimate transfers integral identities from $\mathcal{H}_\phi$ to all of $C(\mathcal{K})$. Since bounded continuous functions on a compact metric space determine probability measures, this forces $\mu = \nu$.
[/proofplan]
[step:Establish the forward implication from the definition]
By the [Definition of MMD](/theorems/???), $d_\phi(\mu, \nu) = \|M^\phi_\mu - M^\phi_\nu\|_{\mathcal{H}_\phi}$. If $\mu = \nu$, then $M^\phi_\mu = M^\phi_\nu$ as elements of $\mathcal{H}_\phi$ (the kernel mean embedding is well-defined as a function of the measure), so $d_\phi(\mu, \nu) = 0$.
[/step]
[step:Translate $d_\phi(\mu,\nu) = 0$ into equality of integrals against $\mathcal{H}_\phi$ functions]
Suppose $d_\phi(\mu, \nu) = 0$. Then $M^\phi_\mu = M^\phi_\nu$ in $\mathcal{H}_\phi$. By the [Reproducing Property of the Kernel Mean Embedding](/theorems/???), for every $g \in \mathcal{H}_\phi$,
\begin{align*}
\mathbb{E}_{x \sim \mu}[g(x)] = \langle g, M^\phi_\mu \rangle_{\mathcal{H}_\phi} = \langle g, M^\phi_\nu \rangle_{\mathcal{H}_\phi} = \mathbb{E}_{y \sim \nu}[g(y)].
\end{align*}
Hence
\begin{align*}
\mathbb{E}_{x \sim \mu}[g(x)] - \mathbb{E}_{y \sim \nu}[g(y)] = 0 \qquad \text{for all } g \in \mathcal{H}_\phi.
\end{align*}
[guided]
The hypothesis $d_\phi(\mu, \nu) = 0$ is by [Definition of MMD](/theorems/???) the statement
\begin{align*}
\|M^\phi_\mu - M^\phi_\nu\|_{\mathcal{H}_\phi} = 0,
\end{align*}
and norms are positive-definite, so this forces equality in the Hilbert space $\mathcal{H}_\phi$:
\begin{align*}
M^\phi_\mu = M^\phi_\nu \quad \text{in } \mathcal{H}_\phi.
\end{align*}
**The reproducing property of the kernel mean embedding.** The kernel mean embedding $M^\phi_\rho \in \mathcal{H}_\phi$ of a probability measure $\rho \in \mathcal{P}(\mathcal{K})$ is the unique element of $\mathcal{H}_\phi$ characterised by the [Reproducing Property of the Kernel Mean Embedding](/theorems/???):
\begin{align*}
\langle g, M^\phi_\rho \rangle_{\mathcal{H}_\phi} = \mathbb{E}_{x \sim \rho}[g(x)] \qquad \text{for every } g \in \mathcal{H}_\phi.
\end{align*}
This identity is the kernel mean embedding's *reason for existing*. It follows from $M^\phi_\rho = \int k_\phi(x, \cdot)\, d\rho(x)$ (a Bochner integral in the Hilbert space $\mathcal{H}_\phi$, well-defined because $\mathcal{K}$ is compact and $k_\phi$ is continuous, hence $k_\phi(x, \cdot)$ is uniformly bounded in $\mathcal{H}_\phi$-norm) combined with the reproducing kernel property $g(x) = \langle g, k_\phi(x, \cdot)\rangle_{\mathcal{H}_\phi}$:
\begin{align*}
\langle g, M^\phi_\rho\rangle_{\mathcal{H}_\phi} = \Bigl\langle g, \int k_\phi(x, \cdot)\, d\rho(x)\Bigr\rangle_{\mathcal{H}_\phi} = \int \langle g, k_\phi(x, \cdot)\rangle_{\mathcal{H}_\phi}\, d\rho(x) = \int g(x)\, d\rho(x) = \mathbb{E}_\rho[g],
\end{align*}
where the second equality uses linearity and continuity of the inner product against the Bochner integral.
**Applying the reproducing property to both measures.** Apply the identity to $\rho = \mu$ and to $\rho = \nu$ for an arbitrary $g \in \mathcal{H}_\phi$:
\begin{align*}
\mathbb{E}_{x \sim \mu}[g(x)] &= \langle g, M^\phi_\mu\rangle_{\mathcal{H}_\phi}, \\
\mathbb{E}_{y \sim \nu}[g(y)] &= \langle g, M^\phi_\nu\rangle_{\mathcal{H}_\phi}.
\end{align*}
Since $M^\phi_\mu = M^\phi_\nu$ in $\mathcal{H}_\phi$, the two right-hand sides agree:
\begin{align*}
\langle g, M^\phi_\mu\rangle_{\mathcal{H}_\phi} = \langle g, M^\phi_\nu\rangle_{\mathcal{H}_\phi}.
\end{align*}
Therefore
\begin{align*}
\mathbb{E}_{x \sim \mu}[g(x)] = \mathbb{E}_{y \sim \nu}[g(y)] \qquad \text{for every } g \in \mathcal{H}_\phi,
\end{align*}
or equivalently
\begin{align*}
\mathbb{E}_{x \sim \mu}[g(x)] - \mathbb{E}_{y \sim \nu}[g(y)] = 0 \qquad \text{for every } g \in \mathcal{H}_\phi.
\end{align*}
**Strategic significance.** This identity is the foothold for the rest of the proof. We know $\mu$ and $\nu$ agree on integrals against every element of $\mathcal{H}_\phi$, and we want to upgrade this to agreement on integrals against every $f \in C(\mathcal{K})$. The gap is that $\mathcal{H}_\phi$ is a *strict* subset of $C(\mathcal{K})$ in general — RKHS elements have additional smoothness or summability constraints — and the bridge from "agree on $\mathcal{H}_\phi$" to "agree on $C(\mathcal{K})$" is *uniform density*. Without universality of $k_\phi$, the conclusion fails: there are kernels for which $\mathcal{H}_\phi$ is "too small" to determine the measure (e.g. polynomial kernels of fixed degree). Universality is precisely the property that fills the gap.
[/guided]
[/step]
[step:Approximate any $f \in C(\mathcal{K})$ uniformly by an element of $\mathcal{H}_\phi$ using universality]
Fix $f \in C(\mathcal{K})$ and $\varepsilon > 0$. By the hypothesis that $k_\phi$ is universal, the RKHS $\mathcal{H}_\phi$ is dense in $C(\mathcal{K})$ with respect to the uniform norm $\|\cdot\|_\infty$. Hence there exists $g \in \mathcal{H}_\phi$ with
\begin{align*}
\|f - g\|_\infty = \sup_{x \in \mathcal{K}} |f(x) - g(x)| < \varepsilon.
\end{align*}
[/step]
[step:Bound $|\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]|$ by a triangle-inequality estimate]
For any probability measure $\rho$ on $\mathcal{K}$, the inequality $|f(x) - g(x)| \le \|f - g\|_\infty < \varepsilon$ holds pointwise, so by monotonicity of the integral and finiteness of $\rho$,
\begin{align*}
|\mathbb{E}_{z \sim \rho}[f(z)] - \mathbb{E}_{z \sim \rho}[g(z)]| \le \mathbb{E}_{z \sim \rho}|f(z) - g(z)| \le \|f - g\|_\infty < \varepsilon.
\end{align*}
Applying this with $\rho = \mu$ and $\rho = \nu$ and combining with the identity $\mathbb{E}_\mu[g] = \mathbb{E}_\nu[g]$ from Step 2,
\begin{align*}
|\mathbb{E}_{x \sim \mu}[f(x)] - \mathbb{E}_{y \sim \nu}[f(y)]| &\le |\mathbb{E}_\mu[f] - \mathbb{E}_\mu[g]| + |\mathbb{E}_\mu[g] - \mathbb{E}_\nu[g]| + |\mathbb{E}_\nu[g] - \mathbb{E}_\nu[f]| \\
&< \varepsilon + 0 + \varepsilon = 2\varepsilon.
\end{align*}
[guided]
We want to compare $\mathbb{E}_\mu[f]$ and $\mathbb{E}_\nu[f]$, but the equality from Step 2 is only available for elements of $\mathcal{H}_\phi$, not for $f$ itself. The standard "bracket" or "triangle" technique inserts a proxy $g \in \mathcal{H}_\phi$ that is uniformly close to $f$ and uses three triangle terms:
\begin{align*}
|\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| \le \underbrace{|\mathbb{E}_\mu[f] - \mathbb{E}_\mu[g]|}_{\text{(I)}} + \underbrace{|\mathbb{E}_\mu[g] - \mathbb{E}_\nu[g]|}_{\text{(II)}} + \underbrace{|\mathbb{E}_\nu[g] - \mathbb{E}_\nu[f]|}_{\text{(III)}}.
\end{align*}
Each term is controlled by a different fact:
- Term (I) and (III): For any probability measure $\rho$,
\begin{align*}
|\mathbb{E}_\rho[f] - \mathbb{E}_\rho[g]| = \left| \int (f - g)\, d\rho \right| \le \int |f - g|\, d\rho \le \|f - g\|_\infty \cdot \rho(\mathcal{K}) = \|f - g\|_\infty < \varepsilon,
\end{align*}
where the equality $\rho(\mathcal{K}) = 1$ uses that $\rho$ is a probability measure. Note that $f$ and $g$ are bounded on the compact set $\mathcal{K}$ (as continuous functions), so the integrals are finite and Fubini-style manipulations are valid. We use this for $\rho = \mu$ in (I) and $\rho = \nu$ in (III).
- Term (II): This is exactly the equality from Step 2: $\mathbb{E}_\mu[g] = \mathbb{E}_\nu[g]$ for $g \in \mathcal{H}_\phi$, hence (II) $= 0$.
Adding the three bounds,
\begin{align*}
|\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| < \varepsilon + 0 + \varepsilon = 2\varepsilon.
\end{align*}
This is the only place where universality is essential: it produces the proxy $g$ whose existence makes term (II) vanish.
[/guided]
[/step]
[step:Take $\varepsilon \to 0$ and conclude $\mu = \nu$]
The bound $|\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| < 2\varepsilon$ holds for every $\varepsilon > 0$, so $\mathbb{E}_\mu[f] = \mathbb{E}_\nu[f]$. Since $f \in C(\mathcal{K})$ was arbitrary and $\mathcal{K}$ is a compact metric space, every $f \in C(\mathcal{K})$ is bounded and so lies in $C_b(\mathcal{K})$. By the [Bounded Continuous Functions Determine Borel Probability Measures](/theorems/???) (Dudley, *Real Analysis and Probability*, Lemma 9.3.2): if $\mu, \nu$ are Borel probability measures on a metric space and $\int f\, d\mu = \int f\, d\nu$ for every $f \in C_b(\mathcal{K})$, then $\mu = \nu$. The hypotheses are met: $\mu, \nu \in \mathcal{P}(\mathcal{K})$ are Borel by assumption and $\mathcal{K}$ is a compact metric space. Hence $\mu = \nu$, completing the converse.
Combined with Step 1, this proves the equivalence $d_\phi(\mu, \nu) = 0 \iff \mu = \nu$.
[guided]
The bound $|\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| < 2\varepsilon$ from Step 4 holds for *every* $\varepsilon > 0$, with the proxy $g \in \mathcal{H}_\phi$ depending on the choice of $\varepsilon$ but the bound itself uniform in that choice. Letting $\varepsilon \to 0$,
\begin{align*}
|\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| \le \lim_{\varepsilon \to 0^+} 2\varepsilon = 0,
\end{align*}
so $\mathbb{E}_\mu[f] = \mathbb{E}_\nu[f]$. Since the function $f \in C(\mathcal{K})$ was arbitrary, this identity holds for *every* continuous function on $\mathcal{K}$.
**Translating to $C_b(\mathcal{K})$.** Continuous functions on a compact metric space are automatically bounded — for $f \in C(\mathcal{K})$, the [Extreme Value Theorem](/theorems/???) gives $\sup_{x \in \mathcal{K}} |f(x)| < \infty$. Hence $C(\mathcal{K}) = C_b(\mathcal{K})$ in our setting, and we conclude
\begin{align*}
\int_\mathcal{K} f\, d\mu = \int_\mathcal{K} f\, d\nu \qquad \text{for every } f \in C_b(\mathcal{K}).
\end{align*}
**Applying the measure-determination theorem.** We invoke the [Bounded Continuous Functions Determine Borel Probability Measures](/theorems/???) (Dudley, *Real Analysis and Probability*, Lemma 9.3.2):
\begin{quote}
Let $X$ be a metric space and $\mu, \nu$ be Borel probability measures on $X$. If $\int f\, d\mu = \int f\, d\nu$ for every $f \in C_b(X)$, then $\mu = \nu$.
\end{quote}
We verify each hypothesis:
\begin{itemize}
\item *$X$ is a metric space.* The compact set $\mathcal{K}$ is a compact metric space (subspace of $\mathcal{C}_p$, which carries a metric topology by hypothesis).
\item *$\mu, \nu$ are Borel probability measures on $X$.* By assumption $\mu, \nu \in \mathcal{P}(\mathcal{K})$, the space of Borel probability measures on $\mathcal{K}$.
\item *$\mathbb{E}_\mu[f] = \mathbb{E}_\nu[f]$ for every $f \in C_b(\mathcal{K})$.* Just established.
\end{itemize}
The conclusion is $\mu = \nu$.
**Why does the chain of implications need universality?** The argument boils down to: $d_\phi(\mu, \nu) = 0 \Rightarrow \mathbb{E}_\mu = \mathbb{E}_\nu$ on $\mathcal{H}_\phi \Rightarrow \mathbb{E}_\mu = \mathbb{E}_\nu$ on $C_b(\mathcal{K}) \Rightarrow \mu = \nu$. The first arrow is just the reproducing property; the third is Dudley's lemma. The middle arrow — extending integral identities from a dense subspace $\mathcal{H}_\phi$ of $C(\mathcal{K})$ to the whole space — is the *only* place where universality enters. Without universality, $\mathcal{H}_\phi$ might be dense in some smaller space (e.g. polynomials of bounded degree, smooth functions vanishing on a fixed set), and Dudley's lemma would not apply because we would only have integral identities on a non-dense subset of $C_b$.
**Closing the equivalence.** The forward implication $\mu = \nu \Rightarrow d_\phi(\mu, \nu) = 0$ from Step 1 is trivial: equal measures have equal kernel mean embeddings, so the difference has zero norm. The converse, just established, is the substantive content. Together,
\begin{align*}
d_\phi(\mu, \nu) = 0 \iff \mu = \nu \qquad \text{(equivalently, $d_\phi$ is a metric on $\mathcal{P}(\mathcal{K})$).}
\end{align*}
**The role of compactness of $\mathcal{K}$.** Compactness is used twice: once in Step 3 (universality is defined on compact subsets), and once here to identify $C(\mathcal{K})$ with $C_b(\mathcal{K})$. On a non-compact $\mathcal{K}$, continuous functions need not be bounded, and the universality-based argument would need a separate truncation step. The standard MMD framework therefore restricts attention to compact $\mathcal{K}$ for cleanliness, although extensions to non-compact settings exist (using e.g. Polish-space versions of Dudley's lemma and characteristic kernels of integrable type).
[/guided]
[/step]