MMD is a Metric under Characteristicness — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The forward direction $\mu = \nu \Rightarrow d_\phi(\mu, \nu) = 0$ is immediate from the definition of $d_\phi$ as the Hilbert-norm distance between kernel mean embeddings. The converse is the substantive content. Assuming $d_\phi(\mu,\nu) = 0$, we have $M^\phi_\mu = M^\phi_\nu$ in $\mathcal{H}_\phi$, hence $\mathbb{E}_\mu[g] = \mathbb{E}_\nu[g]$ for every $g \in \mathcal{H}_\phi$ via the reproducing property. Universality of $k_\phi$ lets us approximate any $f \in C(\mathcal{K})$ uniformly by RKHS elements, so a triangle-inequality estimate transfers integral identities from $\mathcal{H}_\phi$ to all of $C(\mathcal{K})$. Since bounded continuous functions on a compact metric space determine probability measures, this forces $\mu = \nu$. [/proofplan] [step:Establish the forward implication from the definition] By the [Definition of MMD](/theorems/???), $d_\phi(\mu, \nu) = \|M^\phi_\mu - M^\phi_\nu\|_{\mathcal{H}_\phi}$. If $\mu = \nu$, then $M^\phi_\mu = M^\phi_\nu$ as elements of $\mathcal{H}_\phi$ (the kernel mean embedding is well-defined as a function of the measure), so $d_\phi(\mu, \nu) = 0$. [/step] [step:Translate $d_\phi(\mu,\nu) = 0$ into equality of integrals against $\mathcal{H}_\phi$ functions] Suppose $d_\phi(\mu, \nu) = 0$. Then $M^\phi_\mu = M^\phi_\nu$ in $\mathcal{H}_\phi$. By the [Reproducing Property of the Kernel Mean Embedding](/theorems/???), for every $g \in \mathcal{H}_\phi$, \begin{align*} \mathbb{E}_{x \sim \mu}[g(x)] = \langle g, M^\phi_\mu \rangle_{\mathcal{H}_\phi} = \langle g, M^\phi_\nu \rangle_{\mathcal{H}_\phi} = \mathbb{E}_{y \sim \nu}[g(y)]. \end{align*} Hence \begin{align*} \mathbb{E}_{x \sim \mu}[g(x)] - \mathbb{E}_{y \sim \nu}[g(y)] = 0 \qquad \text{for all } g \in \mathcal{H}_\phi. \end{align*} [guided] The hypothesis $d_\phi(\mu, \nu) = 0$ is by [Definition of MMD](/theorems/???) the statement \begin{align*} \|M^\phi_\mu - M^\phi_\nu\|_{\mathcal{H}_\phi} = 0, \end{align*} and norms are positive-definite, so this forces equality in the Hilbert space $\mathcal{H}_\phi$: \begin{align*} M^\phi_\mu = M^\phi_\nu \quad \text{in } \mathcal{H}_\phi. \end{align*} **The reproducing property of the kernel mean embedding.** The kernel mean embedding $M^\phi_\rho \in \mathcal{H}_\phi$ of a probability measure $\rho \in \mathcal{P}(\mathcal{K})$ is the unique element of $\mathcal{H}_\phi$ characterised by the [Reproducing Property of the Kernel Mean Embedding](/theorems/???): \begin{align*} \langle g, M^\phi_\rho \rangle_{\mathcal{H}_\phi} = \mathbb{E}_{x \sim \rho}[g(x)] \qquad \text{for every } g \in \mathcal{H}_\phi. \end{align*} This identity is the kernel mean embedding's *reason for existing*. It follows from $M^\phi_\rho = \int k_\phi(x, \cdot)\, d\rho(x)$ (a Bochner integral in the Hilbert space $\mathcal{H}_\phi$, well-defined because $\mathcal{K}$ is compact and $k_\phi$ is continuous, hence $k_\phi(x, \cdot)$ is uniformly bounded in $\mathcal{H}_\phi$-norm) combined with the reproducing kernel property $g(x) = \langle g, k_\phi(x, \cdot)\rangle_{\mathcal{H}_\phi}$: \begin{align*} \langle g, M^\phi_\rho\rangle_{\mathcal{H}_\phi} = \Bigl\langle g, \int k_\phi(x, \cdot)\, d\rho(x)\Bigr\rangle_{\mathcal{H}_\phi} = \int \langle g, k_\phi(x, \cdot)\rangle_{\mathcal{H}_\phi}\, d\rho(x) = \int g(x)\, d\rho(x) = \mathbb{E}_\rho[g], \end{align*} where the second equality uses linearity and continuity of the inner product against the Bochner integral. **Applying the reproducing property to both measures.** Apply the identity to $\rho = \mu$ and to $\rho = \nu$ for an arbitrary $g \in \mathcal{H}_\phi$: \begin{align*} \mathbb{E}_{x \sim \mu}[g(x)] &= \langle g, M^\phi_\mu\rangle_{\mathcal{H}_\phi}, \\ \mathbb{E}_{y \sim \nu}[g(y)] &= \langle g, M^\phi_\nu\rangle_{\mathcal{H}_\phi}. \end{align*} Since $M^\phi_\mu = M^\phi_\nu$ in $\mathcal{H}_\phi$, the two right-hand sides agree: \begin{align*} \langle g, M^\phi_\mu\rangle_{\mathcal{H}_\phi} = \langle g, M^\phi_\nu\rangle_{\mathcal{H}_\phi}. \end{align*} Therefore \begin{align*} \mathbb{E}_{x \sim \mu}[g(x)] = \mathbb{E}_{y \sim \nu}[g(y)] \qquad \text{for every } g \in \mathcal{H}_\phi, \end{align*} or equivalently \begin{align*} \mathbb{E}_{x \sim \mu}[g(x)] - \mathbb{E}_{y \sim \nu}[g(y)] = 0 \qquad \text{for every } g \in \mathcal{H}_\phi. \end{align*} **Strategic significance.** This identity is the foothold for the rest of the proof. We know $\mu$ and $\nu$ agree on integrals against every element of $\mathcal{H}_\phi$, and we want to upgrade this to agreement on integrals against every $f \in C(\mathcal{K})$. The gap is that $\mathcal{H}_\phi$ is a *strict* subset of $C(\mathcal{K})$ in general — RKHS elements have additional smoothness or summability constraints — and the bridge from "agree on $\mathcal{H}_\phi$" to "agree on $C(\mathcal{K})$" is *uniform density*. Without universality of $k_\phi$, the conclusion fails: there are kernels for which $\mathcal{H}_\phi$ is "too small" to determine the measure (e.g. polynomial kernels of fixed degree). Universality is precisely the property that fills the gap. [/guided] [/step] [step:Approximate any $f \in C(\mathcal{K})$ uniformly by an element of $\mathcal{H}_\phi$ using universality] Fix $f \in C(\mathcal{K})$ and $\varepsilon > 0$. By the hypothesis that $k_\phi$ is universal, the RKHS $\mathcal{H}_\phi$ is dense in $C(\mathcal{K})$ with respect to the uniform norm $\|\cdot\|_\infty$. Hence there exists $g \in \mathcal{H}_\phi$ with \begin{align*} \|f - g\|_\infty = \sup_{x \in \mathcal{K}} |f(x) - g(x)| < \varepsilon. \end{align*} [/step] [step:Bound $|\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]|$ by a triangle-inequality estimate] For any probability measure $\rho$ on $\mathcal{K}$, the inequality $|f(x) - g(x)| \le \|f - g\|_\infty < \varepsilon$ holds pointwise, so by monotonicity of the integral and finiteness of $\rho$, \begin{align*} |\mathbb{E}_{z \sim \rho}[f(z)] - \mathbb{E}_{z \sim \rho}[g(z)]| \le \mathbb{E}_{z \sim \rho}|f(z) - g(z)| \le \|f - g\|_\infty < \varepsilon. \end{align*} Applying this with $\rho = \mu$ and $\rho = \nu$ and combining with the identity $\mathbb{E}_\mu[g] = \mathbb{E}_\nu[g]$ from Step 2, \begin{align*} |\mathbb{E}_{x \sim \mu}[f(x)] - \mathbb{E}_{y \sim \nu}[f(y)]| &\le |\mathbb{E}_\mu[f] - \mathbb{E}_\mu[g]| + |\mathbb{E}_\mu[g] - \mathbb{E}_\nu[g]| + |\mathbb{E}_\nu[g] - \mathbb{E}_\nu[f]| \\ &< \varepsilon + 0 + \varepsilon = 2\varepsilon. \end{align*} [guided] We want to compare $\mathbb{E}_\mu[f]$ and $\mathbb{E}_\nu[f]$, but the equality from Step 2 is only available for elements of $\mathcal{H}_\phi$, not for $f$ itself. The standard "bracket" or "triangle" technique inserts a proxy $g \in \mathcal{H}_\phi$ that is uniformly close to $f$ and uses three triangle terms: \begin{align*} |\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| \le \underbrace{|\mathbb{E}_\mu[f] - \mathbb{E}_\mu[g]|}_{\text{(I)}} + \underbrace{|\mathbb{E}_\mu[g] - \mathbb{E}_\nu[g]|}_{\text{(II)}} + \underbrace{|\mathbb{E}_\nu[g] - \mathbb{E}_\nu[f]|}_{\text{(III)}}. \end{align*} Each term is controlled by a different fact: - Term (I) and (III): For any probability measure $\rho$, \begin{align*} |\mathbb{E}_\rho[f] - \mathbb{E}_\rho[g]| = \left| \int (f - g)\, d\rho \right| \le \int |f - g|\, d\rho \le \|f - g\|_\infty \cdot \rho(\mathcal{K}) = \|f - g\|_\infty < \varepsilon, \end{align*} where the equality $\rho(\mathcal{K}) = 1$ uses that $\rho$ is a probability measure. Note that $f$ and $g$ are bounded on the compact set $\mathcal{K}$ (as continuous functions), so the integrals are finite and Fubini-style manipulations are valid. We use this for $\rho = \mu$ in (I) and $\rho = \nu$ in (III). - Term (II): This is exactly the equality from Step 2: $\mathbb{E}_\mu[g] = \mathbb{E}_\nu[g]$ for $g \in \mathcal{H}_\phi$, hence (II) $= 0$. Adding the three bounds, \begin{align*} |\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| < \varepsilon + 0 + \varepsilon = 2\varepsilon. \end{align*} This is the only place where universality is essential: it produces the proxy $g$ whose existence makes term (II) vanish. [/guided] [/step] [step:Take $\varepsilon \to 0$ and conclude $\mu = \nu$] The bound $|\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| < 2\varepsilon$ holds for every $\varepsilon > 0$, so $\mathbb{E}_\mu[f] = \mathbb{E}_\nu[f]$. Since $f \in C(\mathcal{K})$ was arbitrary and $\mathcal{K}$ is a compact metric space, every $f \in C(\mathcal{K})$ is bounded and so lies in $C_b(\mathcal{K})$. By the [Bounded Continuous Functions Determine Borel Probability Measures](/theorems/???) (Dudley, *Real Analysis and Probability*, Lemma 9.3.2): if $\mu, \nu$ are Borel probability measures on a metric space and $\int f\, d\mu = \int f\, d\nu$ for every $f \in C_b(\mathcal{K})$, then $\mu = \nu$. The hypotheses are met: $\mu, \nu \in \mathcal{P}(\mathcal{K})$ are Borel by assumption and $\mathcal{K}$ is a compact metric space. Hence $\mu = \nu$, completing the converse. Combined with Step 1, this proves the equivalence $d_\phi(\mu, \nu) = 0 \iff \mu = \nu$. [guided] The bound $|\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| < 2\varepsilon$ from Step 4 holds for *every* $\varepsilon > 0$, with the proxy $g \in \mathcal{H}_\phi$ depending on the choice of $\varepsilon$ but the bound itself uniform in that choice. Letting $\varepsilon \to 0$, \begin{align*} |\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| \le \lim_{\varepsilon \to 0^+} 2\varepsilon = 0, \end{align*} so $\mathbb{E}_\mu[f] = \mathbb{E}_\nu[f]$. Since the function $f \in C(\mathcal{K})$ was arbitrary, this identity holds for *every* continuous function on $\mathcal{K}$. **Translating to $C_b(\mathcal{K})$.** Continuous functions on a compact metric space are automatically bounded — for $f \in C(\mathcal{K})$, the [Extreme Value Theorem](/theorems/???) gives $\sup_{x \in \mathcal{K}} |f(x)| < \infty$. Hence $C(\mathcal{K}) = C_b(\mathcal{K})$ in our setting, and we conclude \begin{align*} \int_\mathcal{K} f\, d\mu = \int_\mathcal{K} f\, d\nu \qquad \text{for every } f \in C_b(\mathcal{K}). \end{align*} **Applying the measure-determination theorem.** We invoke the [Bounded Continuous Functions Determine Borel Probability Measures](/theorems/???) (Dudley, *Real Analysis and Probability*, Lemma 9.3.2): \begin{quote} Let $X$ be a metric space and $\mu, \nu$ be Borel probability measures on $X$. If $\int f\, d\mu = \int f\, d\nu$ for every $f \in C_b(X)$, then $\mu = \nu$. \end{quote} We verify each hypothesis: \begin{itemize} \item *$X$ is a metric space.* The compact set $\mathcal{K}$ is a compact metric space (subspace of $\mathcal{C}_p$, which carries a metric topology by hypothesis). \item *$\mu, \nu$ are Borel probability measures on $X$.* By assumption $\mu, \nu \in \mathcal{P}(\mathcal{K})$, the space of Borel probability measures on $\mathcal{K}$. \item *$\mathbb{E}_\mu[f] = \mathbb{E}_\nu[f]$ for every $f \in C_b(\mathcal{K})$.* Just established. \end{itemize} The conclusion is $\mu = \nu$. **Why does the chain of implications need universality?** The argument boils down to: $d_\phi(\mu, \nu) = 0 \Rightarrow \mathbb{E}_\mu = \mathbb{E}_\nu$ on $\mathcal{H}_\phi \Rightarrow \mathbb{E}_\mu = \mathbb{E}_\nu$ on $C_b(\mathcal{K}) \Rightarrow \mu = \nu$. The first arrow is just the reproducing property; the third is Dudley's lemma. The middle arrow — extending integral identities from a dense subspace $\mathcal{H}_\phi$ of $C(\mathcal{K})$ to the whole space — is the *only* place where universality enters. Without universality, $\mathcal{H}_\phi$ might be dense in some smaller space (e.g. polynomials of bounded degree, smooth functions vanishing on a fixed set), and Dudley's lemma would not apply because we would only have integral identities on a non-dense subset of $C_b$. **Closing the equivalence.** The forward implication $\mu = \nu \Rightarrow d_\phi(\mu, \nu) = 0$ from Step 1 is trivial: equal measures have equal kernel mean embeddings, so the difference has zero norm. The converse, just established, is the substantive content. Together, \begin{align*} d_\phi(\mu, \nu) = 0 \iff \mu = \nu \qquad \text{(equivalently, $d_\phi$ is a metric on $\mathcal{P}(\mathcal{K})$).} \end{align*} **The role of compactness of $\mathcal{K}$.** Compactness is used twice: once in Step 3 (universality is defined on compact subsets), and once here to identify $C(\mathcal{K})$ with $C_b(\mathcal{K})$. On a non-compact $\mathcal{K}$, continuous functions need not be bounded, and the universality-based argument would need a separate truncation step. The standard MMD framework therefore restricts attention to compact $\mathcal{K}$ for cleanliness, although extensions to non-compact settings exist (using e.g. Polish-space versions of Dudley's lemma and characteristic kernels of integrable type). [/guided] [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

MMD is a Metric under Characteristicness (Theorem # 2522)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

MMD is a Metric under Characteristicness (Theorem # 2522)

Discussion

Proof

Explore Further