Unbiased MMD Estimator — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The proof has two parts. For unbiasedness, we expand $\mathbb{E}[\hat{d}_\phi(\mu,\nu)^2]$ term by term, using the i.i.d. structure to identify each off-diagonal average with a population kernel expectation; matching this to the population formula for the squared MMD established in the [Population Squared MMD Formula](/theorems/???) yields unbiasedness. For consistency, each of the three averages is a generalised U-statistic with bounded second moment (guaranteed by the Cauchy-Schwarz bound $|k_\phi(x,y)|^2 \le k_\phi(x,x) k_\phi(y,y)$ together with the integrability hypotheses), and the [Weak Law of Large Numbers for U-Statistics](/theorems/???) provides convergence in probability. The continuous mapping theorem then transfers convergence to the linear combination. [/proofplan] [step:Identify the three components of $\hat{d}_\phi(\mu,\nu)^2$ as separate averages] Decompose the estimator into three U-statistics. Define the maps \begin{align*} A_m &: (\mathcal{K}^m, \mu^{\otimes m}) \to \mathbb{R}, & (x^1,\dots,x^m) &\mapsto \frac{1}{m(m-1)} \sum_{i \neq j} k_\phi(x^i, x^j), \\ B_{m,n} &: (\mathcal{K}^{m+n}, \mu^{\otimes m} \otimes \nu^{\otimes n}) \to \mathbb{R}, & (x,y) &\mapsto \frac{1}{mn} \sum_{i=1}^m \sum_{j=1}^n k_\phi(x^i, y^j), \\ C_n &: (\mathcal{K}^n, \nu^{\otimes n}) \to \mathbb{R}, & (y^1,\dots,y^n) &\mapsto \frac{1}{n(n-1)} \sum_{i \neq j} k_\phi(y^i, y^j), \end{align*} so that $\hat{d}_\phi(\mu,\nu)^2 = A_m - 2 B_{m,n} + C_n$. Each sum is a finite sum of measurable kernel evaluations, hence integrable provided the kernel evaluations are integrable; we verify this below. [/step] [step:Verify integrability of each pairwise kernel evaluation] By the [Cauchy-Schwarz inequality for reproducing kernels](/theorems/???), for any $u, v \in \mathcal{K}$, \begin{align*} |k_\phi(u, v)| \le \sqrt{k_\phi(u,u)\, k_\phi(v,v)}. \end{align*} For $i \neq j$ the variables $x^i, x^j$ are independent draws from $\mu$, so by Tonelli's theorem and the AM-GM bound $\sqrt{ab} \le \tfrac12(a+b)$, \begin{align*} \mathbb{E}\bigl[|k_\phi(x^i, x^j)|\bigr] \le \mathbb{E}_{x,x' \sim \mu}\bigl[\sqrt{k_\phi(x,x) k_\phi(x',x')}\bigr] \le \mathbb{E}_{x \sim \mu}[k_\phi(x,x)] < \infty. \end{align*} The same argument applied to $\mu \otimes \nu$ and to $\nu \otimes \nu$ shows that $\mathbb{E}[|k_\phi(x^i, y^j)|]$ and $\mathbb{E}[|k_\phi(y^i, y^j)|]$ are finite. Hence Fubini's theorem applies to each sum in the decomposition. [guided] Before computing any expectations we must verify that the kernel evaluations being averaged are integrable. Without this, neither linearity-of-expectation manipulations nor Fubini are licensed. The two ingredients are (a) the universal Cauchy--Schwarz bound for any positive-definite kernel, and (b) the integrability hypotheses on the diagonal $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$ and $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$ that are part of the theorem's standing assumptions. **(a) The reproducing-kernel Cauchy--Schwarz bound.** Since $k_\phi$ is a positive-definite kernel arising from the inner product on the Hilbert space $T_\phi((V))$, the [Cauchy--Schwarz inequality for reproducing kernels](/theorems/???) gives, for all $u, v \in \mathcal{K}$, \begin{align*} |k_\phi(u, v)| = |\langle S(u), S(v)\rangle_\phi| \le \|S(u)\|_\phi \, \|S(v)\|_\phi = \sqrt{k_\phi(u, u)} \, \sqrt{k_\phi(v, v)}. \end{align*} The two equalities at the ends use $\|S(u)\|_\phi^2 = \langle S(u), S(u)\rangle_\phi = k_\phi(u, u)$ (and likewise for $v$). The bound is pointwise on $\mathcal{K} \times \mathcal{K}$ and requires no integrability. **(b) Off-diagonal integrability under $\mu \otimes \mu$.** For $i \neq j$ the joint law of $(x^i, x^j)$ is $\mu \otimes \mu$ by the i.i.d.\ hypothesis on the $\mu$-sample. By Tonelli's theorem (applicable to the non-negative measurable function $\sqrt{k_\phi(\cdot, \cdot) k_\phi(\cdot, \cdot)}$), \begin{align*} \mathbb{E}\bigl[|k_\phi(x^i, x^j)|\bigr] = \int_{\mathcal{K} \times \mathcal{K}} |k_\phi(u, v)|\, d(\mu \otimes \mu)(u, v) \le \int_{\mathcal{K} \times \mathcal{K}} \sqrt{k_\phi(u, u) k_\phi(v, v)}\, d(\mu \otimes \mu)(u, v). \end{align*} Apply the AM--GM inequality $\sqrt{ab} \le \tfrac{1}{2}(a + b)$ with $a = k_\phi(u, u) \ge 0$ and $b = k_\phi(v, v) \ge 0$, and split the resulting integral by Tonelli: \begin{align*} \int_{\mathcal{K} \times \mathcal{K}} \sqrt{k_\phi(u, u) k_\phi(v, v)}\, d(\mu \otimes \mu)(u, v) &\le \frac{1}{2}\int_{\mathcal{K} \times \mathcal{K}} \bigl(k_\phi(u, u) + k_\phi(v, v)\bigr)\, d(\mu \otimes \mu)(u, v) \\ &= \frac{1}{2}\Bigl(\int_\mathcal{K} k_\phi(u, u)\, d\mu(u) + \int_\mathcal{K} k_\phi(v, v)\, d\mu(v)\Bigr) = \mathbb{E}_{x \sim \mu}[k_\phi(x, x)] < \infty, \end{align*} where the last inequality is the integrability hypothesis $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$. Hence $\mathbb{E}[|k_\phi(x^i, x^j)|] < \infty$ for any $i \neq j$. **(c) Cross-term and $\nu$-term integrability.** The same Cauchy--Schwarz + AM--GM argument applied with the joint law $\mu \otimes \nu$ on $(x^i, y^j)$ gives \begin{align*} \mathbb{E}\bigl[|k_\phi(x^i, y^j)|\bigr] \le \frac{1}{2}\bigl(\mathbb{E}_{x \sim \mu}[k_\phi(x, x)] + \mathbb{E}_{y \sim \nu}[k_\phi(y, y)]\bigr) < \infty, \end{align*} and applied with $\nu \otimes \nu$ on $(y^i, y^j)$ for $i \neq j$ gives \begin{align*} \mathbb{E}\bigl[|k_\phi(y^i, y^j)|\bigr] \le \mathbb{E}_{y \sim \nu}[k_\phi(y, y)] < \infty. \end{align*} **Why integrability matters here.** Three downstream operations rely on it: \begin{enumerate} \item *Fubini* — to compute $\mathbb{E}[k_\phi(x^i, x^j)] = \int\int k_\phi\, d\mu\, d\mu$ as iterated integrals (Steps 3 and 4) we need integrability on the product space; we have $L^1$-integrability by the bound just established. \item *Linearity of expectation* — interchanging $\mathbb{E}$ with the finite double sum $\sum_{i \neq j}$ is automatic for finite sums of integrable random variables. \item *Continuous mapping / WLLN* — convergence-in-probability arguments via Chebyshev or U-statistic theorems require finite first (and ideally second) moments; we have controlled the first moment, and the same Cauchy--Schwarz bound iterated gives the second moment, $\mathbb{E}[|k_\phi(u, v)|^2] \le \mathbb{E}[k_\phi(u, u) k_\phi(v, v)]$, which is finite by Tonelli and the same hypothesis (note here we do not even need AM--GM since we are already a single product over the product measure). \end{enumerate} The integrability hypotheses $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$ and $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$ are therefore not technical decoration — they are the entry condition that makes every expectation in the rest of the proof well-defined. [/guided] [/step] [step:Compute $\mathbb{E}[A_m]$ via the i.i.d. structure on the off-diagonal] For any pair $(i,j)$ with $i \neq j$, the variables $x^i$ and $x^j$ are independent and each distributed as $\mu$, so $(x^i, x^j) \sim \mu \otimes \mu$. By Fubini, \begin{align*} \mathbb{E}[k_\phi(x^i, x^j)] = \int_{\mathcal{K}}\int_{\mathcal{K}} k_\phi(u, v)\, d\mu(u)\, d\mu(v) = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]. \end{align*} The sum $\sum_{i \neq j}$ has exactly $m(m-1)$ terms, each with the same expectation, so \begin{align*} \mathbb{E}[A_m] = \frac{1}{m(m-1)} \cdot m(m-1) \cdot \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')] = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]. \end{align*} [guided] The crucial design choice in the unbiased estimator is the *exclusion of the diagonal* $i = j$ from the double sum. We compute $\mathbb{E}[A_m]$ in three steps: (a) identify the joint law of an off-diagonal pair, (b) reduce $\mathbb{E}[A_m]$ to a single expectation by linearity and i.i.d.-symmetry, (c) explain why the diagonal would have introduced bias. **(a) Joint law of an off-diagonal pair.** Fix any ordered pair $(i, j)$ with $i \neq j$, $1 \le i, j \le m$. Since $(x^1, \ldots, x^m)$ is i.i.d. with law $\mu$, the marginal joint law of $(x^i, x^j)$ is the product measure $\mu \otimes \mu$ on $\mathcal{K} \times \mathcal{K}$. (For $i = j$, the joint law is the diagonal copy of $\mu$ on $\mathcal{K} \times \mathcal{K}$, which is *not* $\mu \otimes \mu$ in general.) By Fubini's theorem — applicable because $\mathbb{E}[|k_\phi(x^i, x^j)|] < \infty$ from the integrability check in Step 2 — \begin{align*} \mathbb{E}[k_\phi(x^i, x^j)] = \int_{\mathcal{K} \times \mathcal{K}} k_\phi(u, v)\, d(\mu \otimes \mu)(u, v) = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]. \end{align*} **(b) Symmetry across pairs.** This expectation does not depend on the choice of $(i, j)$ as long as $i \neq j$: every off-diagonal ordered pair has the same joint law $\mu \otimes \mu$ by the i.i.d. structure. The double sum $\sum_{i \neq j}$ ranges over ordered pairs with $i \neq j$, of which there are exactly $m(m-1)$ (we choose $i$ in $m$ ways, then $j$ in $m - 1$ remaining ways). By linearity of expectation, \begin{align*} \mathbb{E}[A_m] = \frac{1}{m(m-1)} \mathbb{E}\Bigl[\sum_{i \neq j} k_\phi(x^i, x^j)\Bigr] = \frac{1}{m(m-1)} \sum_{i \neq j} \mathbb{E}[k_\phi(x^i, x^j)] = \frac{m(m-1)}{m(m-1)} \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')] = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]. \end{align*} The interchange of expectation and finite sum requires no further justification; for an infinite sum, one would invoke Fubini, but the sum here is over the finite index set $\{(i, j) : i \neq j, 1 \le i, j \le m\}$. **(c) Why exclude the diagonal?** If we instead summed over *all* ordered pairs $(i, j)$ including $i = j$, the diagonal would contribute the term \begin{align*} \frac{1}{m^2}\sum_{i = 1}^m k_\phi(x^i, x^i), \end{align*} whose expectation is $\frac{1}{m}\mathbb{E}_{x \sim \mu}[k_\phi(x, x)]$ — the integral of $k_\phi$ on the *diagonal* of $\mathcal{K} \times \mathcal{K}$ rather than on the full product. In general $\mathbb{E}_{x \sim \mu}[k_\phi(x, x)] \neq \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]$, so including the diagonal injects a bias proportional to the difference between the two. The denominator $m(m-1)$ instead of $m^2$ further fine-tunes the normalisation so that the off-diagonal sum has the right expectation. This bias-removal is the entire reason the estimator is called *unbiased*. The same trick is used in Step 4 for $C_n$ (the $\nu$-sample), but is *not* needed for $B_{m,n}$ since the cross sum naturally has no diagonal — the indices $i$ and $j$ index *different* samples and there is no case of "the same draw" to exclude. [/guided] [/step] [step:Compute $\mathbb{E}[B_{m,n}]$ and $\mathbb{E}[C_n]$ analogously] For the cross term, every pair $(x^i, y^j)$ has joint law $\mu \otimes \nu$ by the independence of the two samples and the i.i.d. structure within each. The sum has $mn$ terms with common expectation, so \begin{align*} \mathbb{E}[B_{m,n}] = \frac{1}{mn} \cdot mn \cdot \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)] = \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)]. \end{align*} The same off-diagonal argument as in the previous step, applied to the $\nu$-sample, gives \begin{align*} \mathbb{E}[C_n] = \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')]. \end{align*} [/step] [step:Combine the three expectations to obtain the population squared MMD] By linearity of expectation and the previous two steps, \begin{align*} \mathbb{E}[\hat{d}_\phi(\mu,\nu)^2] = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')] - 2\, \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)] + \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')]. \end{align*} By the [Population Squared MMD Formula](/theorems/???), the right-hand side equals $d_\phi(\mu,\nu)^2$. This proves unbiasedness. [/step] [step:Establish convergence in probability of each component average] Each of $A_m$, $B_{m,n}$, $C_n$ is an average of pairwise kernel evaluations on i.i.d. data. We show each converges in probability to its mean. [claim:$A_m \xrightarrow{\mathbb{P}} \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]$ as $m \to \infty$] [proof] $A_m$ is a degree-$2$ U-statistic with kernel $h(u,v) = k_\phi(u, v)$ on the i.i.d. sample $(x^1, \dots, x^m) \sim \mu^{\otimes m}$. The kernel $h$ has finite first moment $\mathbb{E}|h(x, x')| < \infty$ by the integrability check above. By the [Weak Law of Large Numbers for U-Statistics](/theorems/???) (Hoeffding, Theorem 5.4), $A_m \to \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]$ in probability as $m \to \infty$. [/proof] [/claim] The identical argument with the kernel $k_\phi$ on the $\nu$-sample gives $C_n \xrightarrow{\mathbb{P}} \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')]$ as $n \to \infty$. For the cross term, $B_{m,n}$ is a sum of $mn$ pairwise terms on independent i.i.d. samples; this is a two-sample U-statistic. The kernel $k_\phi(x, y)$ has $\mathbb{E}|k_\phi(x, y)| < \infty$ from Step 2, and the [Weak Law of Large Numbers for Two-Sample U-Statistics](/theorems/???) gives \begin{align*} B_{m,n} \xrightarrow{\mathbb{P}} \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)] \qquad \text{as } m, n \to \infty. \end{align*} [guided] Each of the three averages $A_m$, $B_{m,n}$, $C_n$ is a *generalised U-statistic* — an average of a kernel over all pairs of distinct (or product-independent) draws — and we apply the appropriate Weak Law of Large Numbers for each. The key checks in every application are: (i) the underlying samples have the right independence structure, (ii) the kernel has finite first moment. **(a) $A_m$ as a one-sample U-statistic.** $A_m = \frac{1}{m(m-1)}\sum_{i \neq j} k_\phi(x^i, x^j)$ is a degree-$2$ U-statistic on the i.i.d. sample $(x^1, \ldots, x^m) \sim \mu^{\otimes m}$ with kernel $h(u, v) := k_\phi(u, v)$. For the [Weak Law of Large Numbers for U-Statistics](/theorems/???) (Hoeffding, Theorem 5.4) we need: \begin{itemize} \item *Independence*: $(x^1, \ldots, x^m)$ are i.i.d.\ with law $\mu$ — this is a hypothesis of the theorem. \item *Finite first moment*: $\mathbb{E}|h(x, x')| = \mathbb{E}|k_\phi(x, x')| < \infty$ — verified in Step 2 via the Cauchy--Schwarz bound $|k_\phi(u, v)| \le \sqrt{k_\phi(u, u) k_\phi(v, v)}$ and the integrability hypothesis $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$. \end{itemize} The conclusion is $A_m \xrightarrow{\mathbb{P}} \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]$ as $m \to \infty$. **(b) $C_n$ by symmetry.** The same theorem applied to the i.i.d. sample $(y^1, \ldots, y^n) \sim \nu^{\otimes n}$ with kernel $k_\phi$ — using the integrability hypothesis $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$ to verify finite first moment — gives $C_n \xrightarrow{\mathbb{P}} \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')]$ as $n \to \infty$. **(c) $B_{m, n}$ as a two-sample U-statistic.** $B_{m, n} = \frac{1}{mn}\sum_{i, j} k_\phi(x^i, y^j)$ is a degree-$(1, 1)$ two-sample U-statistic with kernel $k_\phi(u, v)$ on the independent product sample $\mu^{\otimes m} \otimes \nu^{\otimes n}$. The [Weak Law of Large Numbers for Two-Sample U-Statistics](/theorems/???) requires: \begin{itemize} \item *Independence*: the two samples are mutually independent — given as a hypothesis. \item *I.i.d. within each sample*: also given. \item *Finite first moment*: $\mathbb{E}_{x \sim \mu, y \sim \nu}|k_\phi(x, y)| < \infty$ — verified in Step 2 via $|k_\phi(x, y)| \le \sqrt{k_\phi(x, x) k_\phi(y, y)}$ and AM-GM, using both integrability hypotheses. \end{itemize} The conclusion is $B_{m, n} \xrightarrow{\mathbb{P}} \mathbb{E}_{x \sim \mu, y \sim \nu}[k_\phi(x, y)]$ as $m, n \to \infty$. **Note on the mode of convergence in the two-sample case.** The two-sample WLLN gives convergence in probability under the joint law of the two independent samples, with $m$ and $n$ both tending to infinity — *not* one held fixed while the other varies. This is the standard formulation, and matches the regime in which the unbiased MMD estimator is used: both sample sizes grow. **Why is the second moment automatically controlled?** While the WLLN only requires finite first moment, the standard proofs use a second-moment Chebyshev argument; the Cauchy--Schwarz bound on the kernel gives $|k_\phi(u, v)|^2 \le k_\phi(u, u) k_\phi(v, v)$, so the second moment of the kernel is bounded by $\mathbb{E}_\mu[k_\phi(x, x)]^2 < \infty$ (and the analogous expressions for the other two). The hypothesis "the integrability hypotheses" in the theorem is shorthand for $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$ and $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$, which together give finite second moments for all three kernels via this Cauchy--Schwarz bound. [/guided] [/step] [step:Conclude convergence of $\hat{d}_\phi(\mu,\nu)^2$ via the continuous mapping theorem] The map $T : \mathbb{R}^3 \to \mathbb{R}$, $(a, b, c) \mapsto a - 2b + c$, is continuous. By the previous step, \begin{align*} (A_m, B_{m,n}, C_n) \xrightarrow{\mathbb{P}} \bigl(\mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')],\, \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)],\, \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')]\bigr) \end{align*} as $m, n \to \infty$. Applying the [Continuous Mapping Theorem](/theorems/???) to $T$, \begin{align*} \hat{d}_\phi(\mu,\nu)^2 = T(A_m, B_{m,n}, C_n) \xrightarrow{\mathbb{P}} \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')] - 2\, \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)] + \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')]. \end{align*} By the [Population Squared MMD Formula](/theorems/???), the right-hand side is $d_\phi(\mu,\nu)^2$. This completes the proof of consistency, and combined with Step 5, the theorem. [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Unbiased MMD Estimator (Theorem # 2521)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Unbiased MMD Estimator (Theorem # 2521)

Discussion

Proof

Explore Further