MMD Metrizes Weak Convergence — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The proof has two halves matching the two assertions. **Continuity of $M^\phi$:** expand $\|M^\phi_{\mu_n} - M^\phi_\mu\|_{\mathcal{H}_\phi}^2$ as a combination of three integrals against product measures; weak convergence of $\mu_n$ to $\mu$ lifts to weak convergence of the relevant product measures, and continuity plus boundedness of $k_\phi$ on the compact set $\mathcal{K} \times \mathcal{K}$ allows us to pass each integral through the limit, giving $d_\phi(\mu_n, \mu) \to 0$. **Converse direction:** assume $d_\phi(\mu_n, \mu) \to 0$. For $f \in C_b(\mathcal{K})$, universality of $k_\phi$ yields a uniform RKHS approximant $g \in \mathcal{H}_\phi$ within $\varepsilon$ of $f$; the difference of expectations against $g$ is bounded by $\|g\|_{\mathcal{H}_\phi} \cdot d_\phi(\mu_n, \mu)$ via Cauchy-Schwarz on the reproducing-kernel inner product. Letting $n \to \infty$ and then $\varepsilon \to 0$ gives $\mu_n(f) \to \mu(f)$, which is weak convergence. The two halves combine to show that the identity map between the weak topology and the metric topology induced by $d_\phi$ is a continuous bijection — hence a homeomorphism, since $\mathcal{P}(\mathcal{K})$ is compact in the weak topology by Prokhorov's theorem. [/proofplan] [step:Expand $\|M^\phi_{\mu_n} - M^\phi_\mu\|_{\mathcal{H}_\phi}^2$ as integrals against product measures] By the [Squared MMD Formula](/theorems/???), for any $\rho_1, \rho_2 \in \mathcal{P}(\mathcal{K})$, \begin{align*} \|M^\phi_{\rho_1} - M^\phi_{\rho_2}\|_{\mathcal{H}_\phi}^2 &= \int_{\mathcal{K} \times \mathcal{K}} k_\phi\, d(\rho_1 \otimes \rho_1) - 2\int_{\mathcal{K} \times \mathcal{K}} k_\phi\, d(\rho_1 \otimes \rho_2) + \int_{\mathcal{K} \times \mathcal{K}} k_\phi\, d(\rho_2 \otimes \rho_2). \end{align*} All three integrals are finite because $\mathcal{K}$ is compact, $k_\phi$ is continuous (by the hypothesis on the kernel), hence bounded on the compact product space $\mathcal{K} \times \mathcal{K}$ — say $|k_\phi(x, y)| \le M$ for all $(x, y) \in \mathcal{K} \times \mathcal{K}$ — and each $\rho_i \otimes \rho_j$ is a probability measure. [guided] The squared MMD has, by the [Definition of MMD](/theorems/???), an inner-product origin: $d_\phi(\rho_1, \rho_2)^2 = \|M^\phi_{\rho_1} - M^\phi_{\rho_2}\|_{\mathcal{H}_\phi}^2$, where $M^\phi_\rho \in \mathcal{H}_\phi$ is the kernel mean embedding of $\rho$. To turn this into something we can compare to a weak-convergence statement, we must rewrite each Hilbert-norm term as an *integral against a product measure* — which is the form in which weak convergence interacts naturally. **The polarisation identity.** For any two vectors $a, b$ in a Hilbert space, \begin{align*} \|a - b\|^2 = \langle a - b, a - b\rangle = \langle a, a\rangle - 2\langle a, b\rangle + \langle b, b\rangle = \|a\|^2 - 2\langle a, b\rangle + \|b\|^2. \end{align*} Apply this with $a = M^\phi_{\rho_1}$, $b = M^\phi_{\rho_2}$ in $\mathcal{H}_\phi$: \begin{align*} \|M^\phi_{\rho_1} - M^\phi_{\rho_2}\|_{\mathcal{H}_\phi}^2 = \|M^\phi_{\rho_1}\|_{\mathcal{H}_\phi}^2 - 2\langle M^\phi_{\rho_1}, M^\phi_{\rho_2}\rangle_{\mathcal{H}_\phi} + \|M^\phi_{\rho_2}\|_{\mathcal{H}_\phi}^2. \end{align*} **Inner products as integrals against product measures.** By the [Inner Product Formula for Kernel Mean Embeddings](/theorems/???), \begin{align*} \langle M^\phi_\rho, M^\phi_\tau\rangle_{\mathcal{H}_\phi} = \int_{\mathcal{K} \times \mathcal{K}} k_\phi(x, y)\, d(\rho \otimes \tau)(x, y) \qquad \text{for any } \rho, \tau \in \mathcal{P}(\mathcal{K}). \end{align*} This identity comes from the reproducing property: writing $M^\phi_\rho = \int k_\phi(x, \cdot)\, d\rho(x)$ and $M^\phi_\tau = \int k_\phi(y, \cdot)\, d\tau(y)$ as Bochner integrals in $\mathcal{H}_\phi$, the inner product is \begin{align*} \langle M^\phi_\rho, M^\phi_\tau\rangle_{\mathcal{H}_\phi} = \int\int \langle k_\phi(x, \cdot), k_\phi(y, \cdot)\rangle_{\mathcal{H}_\phi}\, d\rho(x)\, d\tau(y) = \int\int k_\phi(x, y)\, d\rho(x)\, d\tau(y), \end{align*} using $\langle k_\phi(x, \cdot), k_\phi(y, \cdot)\rangle_{\mathcal{H}_\phi} = k_\phi(x, y)$ (the reproducing property of the kernel). Setting $\rho = \tau$ gives $\|M^\phi_\rho\|_{\mathcal{H}_\phi}^2 = \int k_\phi\, d(\rho \otimes \rho)$. **The full expansion.** Substituting these identities into the polarisation expansion, \begin{align*} \|M^\phi_{\rho_1} - M^\phi_{\rho_2}\|_{\mathcal{H}_\phi}^2 = \int_{\mathcal{K} \times \mathcal{K}} k_\phi\, d(\rho_1 \otimes \rho_1) - 2\int_{\mathcal{K} \times \mathcal{K}} k_\phi\, d(\rho_1 \otimes \rho_2) + \int_{\mathcal{K} \times \mathcal{K}} k_\phi\, d(\rho_2 \otimes \rho_2). \end{align*} **Finiteness of each integral.** The kernel $k_\phi : \mathcal{K} \times \mathcal{K} \to \mathbb{R}$ is continuous by hypothesis. Continuous functions on the compact space $\mathcal{K} \times \mathcal{K}$ attain their supremum (extreme value theorem), so there exists $M < \infty$ with $\sup_{(x, y) \in \mathcal{K} \times \mathcal{K}} |k_\phi(x, y)| \le M$. For any product probability measure $\rho \otimes \tau$ on $\mathcal{K} \times \mathcal{K}$, \begin{align*} \Bigl|\int k_\phi\, d(\rho \otimes \tau)\Bigr| \le \int |k_\phi|\, d(\rho \otimes \tau) \le M \cdot (\rho \otimes \tau)(\mathcal{K} \times \mathcal{K}) = M < \infty. \end{align*} All three integrals are therefore finite, and the manipulation above is valid (no infinity-minus-infinity worries). **Strategic significance of this expansion.** We have re-expressed $d_\phi(\mu_n, \mu)^2$ entirely in terms of three integrals against product measures of the form $\mu_n \otimes \mu_n$, $\mu_n \otimes \mu$, and $\mu \otimes \mu$. The rest of the forward direction will lift weak convergence $\mu_n \rightharpoonup \mu$ to weak convergence of these product measures (Step 2), then plug $k_\phi \in C_b(\mathcal{K} \times \mathcal{K})$ into the definition of weak convergence to get convergence of each integral (Step 3). The alternating signs $+1, -2, +1$ produce a perfect cancellation in the limit, sending $d_\phi(\mu_n, \mu)^2 \to 0$. [/guided] [/step] [step:Lift weak convergence $\mu_n \rightharpoonup \mu$ to weak convergence of the product measures] Suppose $\mu_n \rightharpoonup \mu$ in $\mathcal{P}(\mathcal{K})$. We claim: \begin{align*} \mu_n \otimes \mu_n &\rightharpoonup \mu \otimes \mu, \\ \mu_n \otimes \mu &\rightharpoonup \mu \otimes \mu, \end{align*} both in $\mathcal{P}(\mathcal{K} \times \mathcal{K})$. The second is the standard [Continuity of Product with a Fixed Measure](/theorems/???): for any $h \in C_b(\mathcal{K} \times \mathcal{K})$, the partial integration $h_2(x) := \int_\mathcal{K} h(x, y)\, d\mu(y)$ defines an element of $C_b(\mathcal{K})$ (continuity follows from continuity of $h$ and dominated convergence; boundedness from $\|h_2\|_\infty \le \|h\|_\infty$), so $\int h\, d(\mu_n \otimes \mu) = \int h_2\, d\mu_n \to \int h_2\, d\mu = \int h\, d(\mu \otimes \mu)$. For the first, we use that $\mathcal{K}$ is a compact metric space, so $\mathcal{P}(\mathcal{K})$ is a compact metric space in the weak topology (e.g. metrized by the Lévy-Prokhorov metric), and the [Continuity of the Product Map](/theorems/???) $\mathcal{P}(\mathcal{K}) \times \mathcal{P}(\mathcal{K}) \to \mathcal{P}(\mathcal{K} \times \mathcal{K})$, $(\rho, \tau) \mapsto \rho \otimes \tau$, applied to the convergent pair $(\mu_n, \mu_n) \to (\mu, \mu)$, gives $\mu_n \otimes \mu_n \rightharpoonup \mu \otimes \mu$. [/step] [step:Pass each integral through the weak limit using $k_\phi \in C_b(\mathcal{K} \times \mathcal{K})$] Since $k_\phi : \mathcal{K} \times \mathcal{K} \to \mathbb{R}$ is continuous (by hypothesis) and $\mathcal{K} \times \mathcal{K}$ is compact, $k_\phi$ is bounded, so $k_\phi \in C_b(\mathcal{K} \times \mathcal{K})$. By the [Definition of Weak Convergence](/theorems/???) (integration against $C_b$ test functions converges): \begin{align*} \int k_\phi\, d(\mu_n \otimes \mu_n) &\to \int k_\phi\, d(\mu \otimes \mu), \\ \int k_\phi\, d(\mu_n \otimes \mu) &\to \int k_\phi\, d(\mu \otimes \mu), \\ \int k_\phi\, d(\mu \otimes \mu) &= \int k_\phi\, d(\mu \otimes \mu) \quad \text{(constant in } n\text{)}. \end{align*} Substituting into the formula of Step 1 with $\rho_1 = \mu_n$, $\rho_2 = \mu$, \begin{align*} \|M^\phi_{\mu_n} - M^\phi_\mu\|_{\mathcal{H}_\phi}^2 \to \int k_\phi\, d(\mu \otimes \mu) - 2\int k_\phi\, d(\mu \otimes \mu) + \int k_\phi\, d(\mu \otimes \mu) = 0. \end{align*} Hence $d_\phi(\mu_n, \mu) = \|M^\phi_{\mu_n} - M^\phi_\mu\|_{\mathcal{H}_\phi} \to 0$. This proves continuity of $M^\phi$ from the weak topology to the Hilbert-norm topology, and the forward direction of the metrization claim. [guided] The strategy is to write $d_\phi(\mu_n, \mu)^2$ entirely in terms of integrals against product measures and then exploit weak convergence at the level of products. The expansion in Step 1 is purely algebraic — it follows from $\|a - b\|^2 = \|a\|^2 - 2\langle a, b\rangle + \|b\|^2$ together with $\langle M^\phi_\rho, M^\phi_\tau\rangle_{\mathcal{H}_\phi} = \int k_\phi\, d(\rho \otimes \tau)$ (the [Inner Product Formula for Kernel Mean Embeddings](/theorems/???)). The non-trivial work is in lifting $\mu_n \rightharpoonup \mu$ to product convergence. The two product convergences we need are: - $\mu_n \otimes \mu \rightharpoonup \mu \otimes \mu$: This is "fixing one factor and varying the other". For $h \in C_b(\mathcal{K} \times \mathcal{K})$, define $h_2 \in C_b(\mathcal{K})$ by partial integration against $\mu$. Then $\int h\, d(\mu_n \otimes \mu) = \int h_2\, d\mu_n \to \int h_2\, d\mu = \int h\, d(\mu \otimes \mu)$. The continuity of $h_2$ uses dominated convergence: $h$ is continuous and bounded on the compact product, so for $x_n \to x$, $h(x_n, y) \to h(x, y)$ pointwise with the constant bound $\|h\|_\infty$, and DCT gives convergence of the integral against $\mu$. - $\mu_n \otimes \mu_n \rightharpoonup \mu \otimes \mu$: Both factors vary. The cleanest argument uses the **continuity of the product map** $\mathcal{P}(\mathcal{K}) \times \mathcal{P}(\mathcal{K}) \to \mathcal{P}(\mathcal{K} \times \mathcal{K})$. On compact metric $\mathcal{K}$, weak convergence is metrizable, so we can verify continuity by sequential continuity. Apply this to the convergent pair $(\mu_n, \mu_n) \to (\mu, \mu)$. Once we have product convergence and $k_\phi \in C_b$, plugging into the definition of weak convergence gives convergence of each of the three integrals, and the alternating signs $+1, -2, +1$ produce the perfect cancellation $1 - 2 + 1 = 0$. [/guided] [/step] [step:Establish the converse: $d_\phi(\mu_n, \mu) \to 0$ implies $\mu_n \rightharpoonup \mu$] Assume $d_\phi(\mu_n, \mu) \to 0$. Fix $f \in C_b(\mathcal{K})$ and $\varepsilon > 0$. Since $\mathcal{K}$ is compact, $f$ is bounded and continuous. By universality of $k_\phi$, $\mathcal{H}_\phi$ is dense in $C(\mathcal{K})$ in the uniform norm; since $\mathcal{K}$ is compact, $C(\mathcal{K}) = C_b(\mathcal{K})$, so there exists $g \in \mathcal{H}_\phi$ with $\|f - g\|_\infty < \varepsilon$. By the triangle-inequality decomposition of [MMD is a Metric under Characteristicness](/theorems/2522) (the same bracket argument), for any probability measure $\rho$ on $\mathcal{K}$, \begin{align*} |\mathbb{E}_\rho[f] - \mathbb{E}_\rho[g]| \le \mathbb{E}_\rho|f - g| \le \|f - g\|_\infty < \varepsilon. \end{align*} For the middle term, the [Reproducing Property of the Kernel Mean Embedding](/theorems/???) gives \begin{align*} \mathbb{E}_{\mu_n}[g] - \mathbb{E}_\mu[g] = \langle g, M^\phi_{\mu_n} - M^\phi_\mu \rangle_{\mathcal{H}_\phi}. \end{align*} By the Cauchy-Schwarz inequality in $\mathcal{H}_\phi$, \begin{align*} |\mathbb{E}_{\mu_n}[g] - \mathbb{E}_\mu[g]| \le \|g\|_{\mathcal{H}_\phi} \cdot \|M^\phi_{\mu_n} - M^\phi_\mu\|_{\mathcal{H}_\phi} = \|g\|_{\mathcal{H}_\phi} \cdot d_\phi(\mu_n, \mu). \end{align*} Combining all three estimates via the triangle inequality, \begin{align*} |\mu_n(f) - \mu(f)| \le 2\varepsilon + \|g\|_{\mathcal{H}_\phi} \cdot d_\phi(\mu_n, \mu). \end{align*} The element $g$ depends on $\varepsilon$ but is fixed once $\varepsilon$ is. Since $d_\phi(\mu_n, \mu) \to 0$, taking $\limsup_{n \to \infty}$, \begin{align*} \limsup_{n \to \infty} |\mu_n(f) - \mu(f)| \le 2\varepsilon. \end{align*} Letting $\varepsilon \to 0$ gives $\mu_n(f) \to \mu(f)$. Since $f \in C_b(\mathcal{K})$ was arbitrary, $\mu_n \rightharpoonup \mu$ by the [Definition of Weak Convergence](/theorems/???). [guided] The converse direction is the heart of the metrization claim. We are given $d_\phi(\mu_n, \mu) \to 0$ — convergence in a norm on the *kernel mean embedding* — and must conclude $\mu_n \rightharpoonup \mu$ in the *weak topology* on $\mathcal{P}(\mathcal{K})$, i.e. $\int f\, d\mu_n \to \int f\, d\mu$ for every $f \in C_b(\mathcal{K})$. The key idea is to use universality of $k_\phi$ to bridge between $C_b(\mathcal{K})$ and $\mathcal{H}_\phi$, where the convergence in $d_\phi$ acts directly through the reproducing property. **(a) Setup.** Fix $f \in C_b(\mathcal{K})$ and $\varepsilon > 0$. Since $\mathcal{K}$ is compact, $C(\mathcal{K}) = C_b(\mathcal{K})$ (every continuous function on a compact space is bounded by the extreme value theorem). By universality of $k_\phi$, the RKHS $\mathcal{H}_\phi$ is uniformly dense in $C(\mathcal{K})$, so there exists $g_\varepsilon \in \mathcal{H}_\phi$ — depending on the choice of $\varepsilon$ — with \begin{align*} \|f - g_\varepsilon\|_\infty := \sup_{x \in \mathcal{K}} |f(x) - g_\varepsilon(x)| < \varepsilon. \end{align*} The element $g_\varepsilon$ is fixed for the remainder of the argument; in particular, $\|g_\varepsilon\|_{\mathcal{H}_\phi}$ is a finite constant depending on $\varepsilon$ and $f$ but not on $n$. **(b) Triangle decomposition.** Insert $g_\varepsilon$ as a proxy and apply the triangle inequality: \begin{align*} |\mu_n(f) - \mu(f)| \le \underbrace{|\mu_n(f) - \mu_n(g_\varepsilon)|}_{\text{(I)}} + \underbrace{|\mu_n(g_\varepsilon) - \mu(g_\varepsilon)|}_{\text{(II)}} + \underbrace{|\mu(g_\varepsilon) - \mu(f)|}_{\text{(III)}}. \end{align*} **(c) Bounding terms (I) and (III) by uniform closeness.** For any probability measure $\rho \in \mathcal{P}(\mathcal{K})$, \begin{align*} |\rho(f) - \rho(g_\varepsilon)| = \Bigl|\int_\mathcal{K} (f - g_\varepsilon)\, d\rho\Bigr| \le \int_\mathcal{K} |f - g_\varepsilon|\, d\rho \le \|f - g_\varepsilon\|_\infty \cdot \rho(\mathcal{K}) = \|f - g_\varepsilon\|_\infty < \varepsilon, \end{align*} using the pointwise bound $|f - g_\varepsilon|(x) \le \|f - g_\varepsilon\|_\infty$ on $\mathcal{K}$, monotonicity of the integral, and $\rho(\mathcal{K}) = 1$ since $\rho$ is a probability measure. Apply this with $\rho = \mu_n$ for term (I) and $\rho = \mu$ for term (III); both are bounded by $\varepsilon$. **(d) Bounding term (II) via the reproducing property.** This is the bridge from "weak topology" to "MMD topology". Since $g_\varepsilon \in \mathcal{H}_\phi$, the [Reproducing Property of the Kernel Mean Embedding](/theorems/???) gives, for any $\rho \in \mathcal{P}(\mathcal{K})$, \begin{align*} \rho(g_\varepsilon) = \mathbb{E}_{x \sim \rho}[g_\varepsilon(x)] = \langle g_\varepsilon, M^\phi_\rho\rangle_{\mathcal{H}_\phi}. \end{align*} Subtracting the same identity with $\rho = \mu$ from $\rho = \mu_n$, \begin{align*} \mu_n(g_\varepsilon) - \mu(g_\varepsilon) = \langle g_\varepsilon, M^\phi_{\mu_n} - M^\phi_\mu\rangle_{\mathcal{H}_\phi}. \end{align*} Apply Cauchy--Schwarz in the Hilbert space $\mathcal{H}_\phi$: \begin{align*} |\mu_n(g_\varepsilon) - \mu(g_\varepsilon)| \le \|g_\varepsilon\|_{\mathcal{H}_\phi} \cdot \|M^\phi_{\mu_n} - M^\phi_\mu\|_{\mathcal{H}_\phi} = \|g_\varepsilon\|_{\mathcal{H}_\phi} \cdot d_\phi(\mu_n, \mu). \end{align*} The right-hand side has $\|g_\varepsilon\|_{\mathcal{H}_\phi}$ fixed (independent of $n$) and $d_\phi(\mu_n, \mu) \to 0$ by hypothesis, so term (II) tends to zero as $n \to \infty$. **(e) Combining the bounds.** From (b)--(d), \begin{align*} |\mu_n(f) - \mu(f)| \le \varepsilon + \|g_\varepsilon\|_{\mathcal{H}_\phi} \cdot d_\phi(\mu_n, \mu) + \varepsilon = 2\varepsilon + \|g_\varepsilon\|_{\mathcal{H}_\phi} \cdot d_\phi(\mu_n, \mu). \end{align*} Take the limsup over $n$, treating $\varepsilon$ and $g_\varepsilon$ as fixed: \begin{align*} \limsup_{n \to \infty} |\mu_n(f) - \mu(f)| \le 2\varepsilon + \|g_\varepsilon\|_{\mathcal{H}_\phi} \cdot \limsup_{n \to \infty} d_\phi(\mu_n, \mu) = 2\varepsilon + \|g_\varepsilon\|_{\mathcal{H}_\phi} \cdot 0 = 2\varepsilon. \end{align*} Now take $\varepsilon \to 0^+$. Since the bound $\limsup |\mu_n(f) - \mu(f)| \le 2\varepsilon$ holds for *every* $\varepsilon > 0$, \begin{align*} \limsup_{n \to \infty} |\mu_n(f) - \mu(f)| = 0, \end{align*} which means $\mu_n(f) \to \mu(f)$. **(f) Conclusion.** The function $f \in C_b(\mathcal{K})$ was arbitrary, so $\mu_n(f) \to \mu(f)$ for every $f \in C_b(\mathcal{K})$. By the [Definition of Weak Convergence](/theorems/???) on a metric space (which uses $C_b$ test functions), $\mu_n \rightharpoonup \mu$. **Why the order of limits matters.** Notice we send $n \to \infty$ first (with $\varepsilon$ fixed) and then $\varepsilon \to 0$, not the reverse. This is essential because $g_\varepsilon$ depends on $\varepsilon$: as $\varepsilon \to 0$, the approximant changes and $\|g_\varepsilon\|_{\mathcal{H}_\phi}$ may *blow up* (better RKHS approximations of an arbitrary continuous function typically have larger RKHS norm). If we sent $\varepsilon \to 0$ first while $n$ stayed bounded, the term $\|g_\varepsilon\|_{\mathcal{H}_\phi} \cdot d_\phi(\mu_n, \mu)$ might diverge. Sending $n \to \infty$ first kills $d_\phi(\mu_n, \mu)$ for *that specific* $g_\varepsilon$, neutralising the blow-up. This is the standard "blow-up cancels by limit interchange" trick in approximation theory. **Why universality is essential.** Without universality, we could not produce $g_\varepsilon \in \mathcal{H}_\phi$ uniformly close to a generic $f \in C_b(\mathcal{K})$. The argument would then only conclude $\mu_n(f) \to \mu(f)$ for $f$ in the uniform closure of $\mathcal{H}_\phi$, which might be a strict subset of $C_b(\mathcal{K})$. Such partial convergence is *not* equivalent to weak convergence — measures could agree on $\mathcal{H}_\phi$ but disagree elsewhere — so $d_\phi$ would fail to metrize the weak topology. [/guided] [/step] [step:Conclude that $d_\phi$ metrizes the weak topology] The forward direction (Step 3) and the converse (Step 4) together give: $\mu_n \rightharpoonup \mu \iff d_\phi(\mu_n, \mu) \to 0$. By the [MMD is a Metric under Characteristicness](/theorems/2522), $d_\phi$ is a genuine metric on $\mathcal{P}(\mathcal{K})$ (separation: $d_\phi(\mu, \nu) = 0 \iff \mu = \nu$; symmetry and triangle inequality follow from $d_\phi$ being a Hilbert-norm distance). Hence the convergence equivalence shows that the metric topology induced by $d_\phi$ coincides with the weak topology on $\mathcal{P}(\mathcal{K})$. This completes the proof. [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

MMD Metrizes Weak Convergence (Theorem # 2523)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

MMD Metrizes Weak Convergence (Theorem # 2523)

Discussion

Proof

Explore Further