[proofplan]
The proof has two parts. For unbiasedness, we expand $\mathbb{E}[\hat{d}_\phi(\mu,\nu)^2]$ term by term, using the i.i.d. structure to identify each off-diagonal average with a population kernel expectation; matching this to the population formula for the squared MMD established in the [Population Squared MMD Formula](/theorems/???) yields unbiasedness. For consistency, each of the three averages is a generalised U-statistic with bounded second moment (guaranteed by the Cauchy-Schwarz bound $|k_\phi(x,y)|^2 \le k_\phi(x,x) k_\phi(y,y)$ together with the integrability hypotheses), and the [Weak Law of Large Numbers for U-Statistics](/theorems/???) provides convergence in probability. The continuous mapping theorem then transfers convergence to the linear combination.
[/proofplan]
[step:Identify the three components of $\hat{d}_\phi(\mu,\nu)^2$ as separate averages]
Decompose the estimator into three U-statistics. Define the maps
\begin{align*}
A_m &: (\mathcal{K}^m, \mu^{\otimes m}) \to \mathbb{R}, & (x^1,\dots,x^m) &\mapsto \frac{1}{m(m-1)} \sum_{i \neq j} k_\phi(x^i, x^j), \\
B_{m,n} &: (\mathcal{K}^{m+n}, \mu^{\otimes m} \otimes \nu^{\otimes n}) \to \mathbb{R}, & (x,y) &\mapsto \frac{1}{mn} \sum_{i=1}^m \sum_{j=1}^n k_\phi(x^i, y^j), \\
C_n &: (\mathcal{K}^n, \nu^{\otimes n}) \to \mathbb{R}, & (y^1,\dots,y^n) &\mapsto \frac{1}{n(n-1)} \sum_{i \neq j} k_\phi(y^i, y^j),
\end{align*}
so that $\hat{d}_\phi(\mu,\nu)^2 = A_m - 2 B_{m,n} + C_n$. Each sum is a finite sum of measurable kernel evaluations, hence integrable provided the kernel evaluations are integrable; we verify this below.
[/step]
[step:Verify integrability of each pairwise kernel evaluation]
By the [Cauchy-Schwarz inequality for reproducing kernels](/theorems/???), for any $u, v \in \mathcal{K}$,
\begin{align*}
|k_\phi(u, v)| \le \sqrt{k_\phi(u,u)\, k_\phi(v,v)}.
\end{align*}
For $i \neq j$ the variables $x^i, x^j$ are independent draws from $\mu$, so by Tonelli's theorem and the AM-GM bound $\sqrt{ab} \le \tfrac12(a+b)$,
\begin{align*}
\mathbb{E}\bigl[|k_\phi(x^i, x^j)|\bigr] \le \mathbb{E}_{x,x' \sim \mu}\bigl[\sqrt{k_\phi(x,x) k_\phi(x',x')}\bigr] \le \mathbb{E}_{x \sim \mu}[k_\phi(x,x)] < \infty.
\end{align*}
The same argument applied to $\mu \otimes \nu$ and to $\nu \otimes \nu$ shows that $\mathbb{E}[|k_\phi(x^i, y^j)|]$ and $\mathbb{E}[|k_\phi(y^i, y^j)|]$ are finite. Hence Fubini's theorem applies to each sum in the decomposition.
[guided]
Before computing any expectations we must verify that the kernel evaluations being averaged are integrable. Without this, neither linearity-of-expectation manipulations nor Fubini are licensed. The two ingredients are (a) the universal Cauchy--Schwarz bound for any positive-definite kernel, and (b) the integrability hypotheses on the diagonal $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$ and $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$ that are part of the theorem's standing assumptions.
**(a) The reproducing-kernel Cauchy--Schwarz bound.** Since $k_\phi$ is a positive-definite kernel arising from the inner product on the Hilbert space $T_\phi((V))$, the [Cauchy--Schwarz inequality for reproducing kernels](/theorems/???) gives, for all $u, v \in \mathcal{K}$,
\begin{align*}
|k_\phi(u, v)| = |\langle S(u), S(v)\rangle_\phi| \le \|S(u)\|_\phi \, \|S(v)\|_\phi = \sqrt{k_\phi(u, u)} \, \sqrt{k_\phi(v, v)}.
\end{align*}
The two equalities at the ends use $\|S(u)\|_\phi^2 = \langle S(u), S(u)\rangle_\phi = k_\phi(u, u)$ (and likewise for $v$). The bound is pointwise on $\mathcal{K} \times \mathcal{K}$ and requires no integrability.
**(b) Off-diagonal integrability under $\mu \otimes \mu$.** For $i \neq j$ the joint law of $(x^i, x^j)$ is $\mu \otimes \mu$ by the i.i.d.\ hypothesis on the $\mu$-sample. By Tonelli's theorem (applicable to the non-negative measurable function $\sqrt{k_\phi(\cdot, \cdot) k_\phi(\cdot, \cdot)}$),
\begin{align*}
\mathbb{E}\bigl[|k_\phi(x^i, x^j)|\bigr] = \int_{\mathcal{K} \times \mathcal{K}} |k_\phi(u, v)|\, d(\mu \otimes \mu)(u, v) \le \int_{\mathcal{K} \times \mathcal{K}} \sqrt{k_\phi(u, u) k_\phi(v, v)}\, d(\mu \otimes \mu)(u, v).
\end{align*}
Apply the AM--GM inequality $\sqrt{ab} \le \tfrac{1}{2}(a + b)$ with $a = k_\phi(u, u) \ge 0$ and $b = k_\phi(v, v) \ge 0$, and split the resulting integral by Tonelli:
\begin{align*}
\int_{\mathcal{K} \times \mathcal{K}} \sqrt{k_\phi(u, u) k_\phi(v, v)}\, d(\mu \otimes \mu)(u, v) &\le \frac{1}{2}\int_{\mathcal{K} \times \mathcal{K}} \bigl(k_\phi(u, u) + k_\phi(v, v)\bigr)\, d(\mu \otimes \mu)(u, v) \\
&= \frac{1}{2}\Bigl(\int_\mathcal{K} k_\phi(u, u)\, d\mu(u) + \int_\mathcal{K} k_\phi(v, v)\, d\mu(v)\Bigr) = \mathbb{E}_{x \sim \mu}[k_\phi(x, x)] < \infty,
\end{align*}
where the last inequality is the integrability hypothesis $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$. Hence $\mathbb{E}[|k_\phi(x^i, x^j)|] < \infty$ for any $i \neq j$.
**(c) Cross-term and $\nu$-term integrability.** The same Cauchy--Schwarz + AM--GM argument applied with the joint law $\mu \otimes \nu$ on $(x^i, y^j)$ gives
\begin{align*}
\mathbb{E}\bigl[|k_\phi(x^i, y^j)|\bigr] \le \frac{1}{2}\bigl(\mathbb{E}_{x \sim \mu}[k_\phi(x, x)] + \mathbb{E}_{y \sim \nu}[k_\phi(y, y)]\bigr) < \infty,
\end{align*}
and applied with $\nu \otimes \nu$ on $(y^i, y^j)$ for $i \neq j$ gives
\begin{align*}
\mathbb{E}\bigl[|k_\phi(y^i, y^j)|\bigr] \le \mathbb{E}_{y \sim \nu}[k_\phi(y, y)] < \infty.
\end{align*}
**Why integrability matters here.** Three downstream operations rely on it:
\begin{enumerate}
\item *Fubini* — to compute $\mathbb{E}[k_\phi(x^i, x^j)] = \int\int k_\phi\, d\mu\, d\mu$ as iterated integrals (Steps 3 and 4) we need integrability on the product space; we have $L^1$-integrability by the bound just established.
\item *Linearity of expectation* — interchanging $\mathbb{E}$ with the finite double sum $\sum_{i \neq j}$ is automatic for finite sums of integrable random variables.
\item *Continuous mapping / WLLN* — convergence-in-probability arguments via Chebyshev or U-statistic theorems require finite first (and ideally second) moments; we have controlled the first moment, and the same Cauchy--Schwarz bound iterated gives the second moment, $\mathbb{E}[|k_\phi(u, v)|^2] \le \mathbb{E}[k_\phi(u, u) k_\phi(v, v)]$, which is finite by Tonelli and the same hypothesis (note here we do not even need AM--GM since we are already a single product over the product measure).
\end{enumerate}
The integrability hypotheses $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$ and $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$ are therefore not technical decoration — they are the entry condition that makes every expectation in the rest of the proof well-defined.
[/guided]
[/step]
[step:Compute $\mathbb{E}[A_m]$ via the i.i.d. structure on the off-diagonal]
For any pair $(i,j)$ with $i \neq j$, the variables $x^i$ and $x^j$ are independent and each distributed as $\mu$, so $(x^i, x^j) \sim \mu \otimes \mu$. By Fubini,
\begin{align*}
\mathbb{E}[k_\phi(x^i, x^j)] = \int_{\mathcal{K}}\int_{\mathcal{K}} k_\phi(u, v)\, d\mu(u)\, d\mu(v) = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')].
\end{align*}
The sum $\sum_{i \neq j}$ has exactly $m(m-1)$ terms, each with the same expectation, so
\begin{align*}
\mathbb{E}[A_m] = \frac{1}{m(m-1)} \cdot m(m-1) \cdot \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')] = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')].
\end{align*}
[guided]
The crucial design choice in the unbiased estimator is the *exclusion of the diagonal* $i = j$ from the double sum. We compute $\mathbb{E}[A_m]$ in three steps: (a) identify the joint law of an off-diagonal pair, (b) reduce $\mathbb{E}[A_m]$ to a single expectation by linearity and i.i.d.-symmetry, (c) explain why the diagonal would have introduced bias.
**(a) Joint law of an off-diagonal pair.** Fix any ordered pair $(i, j)$ with $i \neq j$, $1 \le i, j \le m$. Since $(x^1, \ldots, x^m)$ is i.i.d. with law $\mu$, the marginal joint law of $(x^i, x^j)$ is the product measure $\mu \otimes \mu$ on $\mathcal{K} \times \mathcal{K}$. (For $i = j$, the joint law is the diagonal copy of $\mu$ on $\mathcal{K} \times \mathcal{K}$, which is *not* $\mu \otimes \mu$ in general.) By Fubini's theorem — applicable because $\mathbb{E}[|k_\phi(x^i, x^j)|] < \infty$ from the integrability check in Step 2 —
\begin{align*}
\mathbb{E}[k_\phi(x^i, x^j)] = \int_{\mathcal{K} \times \mathcal{K}} k_\phi(u, v)\, d(\mu \otimes \mu)(u, v) = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')].
\end{align*}
**(b) Symmetry across pairs.** This expectation does not depend on the choice of $(i, j)$ as long as $i \neq j$: every off-diagonal ordered pair has the same joint law $\mu \otimes \mu$ by the i.i.d. structure. The double sum $\sum_{i \neq j}$ ranges over ordered pairs with $i \neq j$, of which there are exactly $m(m-1)$ (we choose $i$ in $m$ ways, then $j$ in $m - 1$ remaining ways). By linearity of expectation,
\begin{align*}
\mathbb{E}[A_m] = \frac{1}{m(m-1)} \mathbb{E}\Bigl[\sum_{i \neq j} k_\phi(x^i, x^j)\Bigr] = \frac{1}{m(m-1)} \sum_{i \neq j} \mathbb{E}[k_\phi(x^i, x^j)] = \frac{m(m-1)}{m(m-1)} \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')] = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')].
\end{align*}
The interchange of expectation and finite sum requires no further justification; for an infinite sum, one would invoke Fubini, but the sum here is over the finite index set $\{(i, j) : i \neq j, 1 \le i, j \le m\}$.
**(c) Why exclude the diagonal?** If we instead summed over *all* ordered pairs $(i, j)$ including $i = j$, the diagonal would contribute the term
\begin{align*}
\frac{1}{m^2}\sum_{i = 1}^m k_\phi(x^i, x^i),
\end{align*}
whose expectation is $\frac{1}{m}\mathbb{E}_{x \sim \mu}[k_\phi(x, x)]$ — the integral of $k_\phi$ on the *diagonal* of $\mathcal{K} \times \mathcal{K}$ rather than on the full product. In general $\mathbb{E}_{x \sim \mu}[k_\phi(x, x)] \neq \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]$, so including the diagonal injects a bias proportional to the difference between the two. The denominator $m(m-1)$ instead of $m^2$ further fine-tunes the normalisation so that the off-diagonal sum has the right expectation. This bias-removal is the entire reason the estimator is called *unbiased*.
The same trick is used in Step 4 for $C_n$ (the $\nu$-sample), but is *not* needed for $B_{m,n}$ since the cross sum naturally has no diagonal — the indices $i$ and $j$ index *different* samples and there is no case of "the same draw" to exclude.
[/guided]
[/step]
[step:Compute $\mathbb{E}[B_{m,n}]$ and $\mathbb{E}[C_n]$ analogously]
For the cross term, every pair $(x^i, y^j)$ has joint law $\mu \otimes \nu$ by the independence of the two samples and the i.i.d. structure within each. The sum has $mn$ terms with common expectation, so
\begin{align*}
\mathbb{E}[B_{m,n}] = \frac{1}{mn} \cdot mn \cdot \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)] = \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)].
\end{align*}
The same off-diagonal argument as in the previous step, applied to the $\nu$-sample, gives
\begin{align*}
\mathbb{E}[C_n] = \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')].
\end{align*}
[/step]
[step:Combine the three expectations to obtain the population squared MMD]
By linearity of expectation and the previous two steps,
\begin{align*}
\mathbb{E}[\hat{d}_\phi(\mu,\nu)^2] = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')] - 2\, \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)] + \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')].
\end{align*}
By the [Population Squared MMD Formula](/theorems/???), the right-hand side equals $d_\phi(\mu,\nu)^2$. This proves unbiasedness.
[/step]
[step:Establish convergence in probability of each component average]
Each of $A_m$, $B_{m,n}$, $C_n$ is an average of pairwise kernel evaluations on i.i.d. data. We show each converges in probability to its mean.
[claim:$A_m \xrightarrow{\mathbb{P}} \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]$ as $m \to \infty$]
[proof]
$A_m$ is a degree-$2$ U-statistic with kernel $h(u,v) = k_\phi(u, v)$ on the i.i.d. sample $(x^1, \dots, x^m) \sim \mu^{\otimes m}$. The kernel $h$ has finite first moment $\mathbb{E}|h(x, x')| < \infty$ by the integrability check above. By the [Weak Law of Large Numbers for U-Statistics](/theorems/???) (Hoeffding, Theorem 5.4), $A_m \to \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]$ in probability as $m \to \infty$.
[/proof]
[/claim]
The identical argument with the kernel $k_\phi$ on the $\nu$-sample gives $C_n \xrightarrow{\mathbb{P}} \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')]$ as $n \to \infty$.
For the cross term, $B_{m,n}$ is a sum of $mn$ pairwise terms on independent i.i.d. samples; this is a two-sample U-statistic. The kernel $k_\phi(x, y)$ has $\mathbb{E}|k_\phi(x, y)| < \infty$ from Step 2, and the [Weak Law of Large Numbers for Two-Sample U-Statistics](/theorems/???) gives
\begin{align*}
B_{m,n} \xrightarrow{\mathbb{P}} \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)] \qquad \text{as } m, n \to \infty.
\end{align*}
[guided]
Each of the three averages $A_m$, $B_{m,n}$, $C_n$ is a *generalised U-statistic* — an average of a kernel over all pairs of distinct (or product-independent) draws — and we apply the appropriate Weak Law of Large Numbers for each. The key checks in every application are: (i) the underlying samples have the right independence structure, (ii) the kernel has finite first moment.
**(a) $A_m$ as a one-sample U-statistic.** $A_m = \frac{1}{m(m-1)}\sum_{i \neq j} k_\phi(x^i, x^j)$ is a degree-$2$ U-statistic on the i.i.d. sample $(x^1, \ldots, x^m) \sim \mu^{\otimes m}$ with kernel $h(u, v) := k_\phi(u, v)$. For the [Weak Law of Large Numbers for U-Statistics](/theorems/???) (Hoeffding, Theorem 5.4) we need:
\begin{itemize}
\item *Independence*: $(x^1, \ldots, x^m)$ are i.i.d.\ with law $\mu$ — this is a hypothesis of the theorem.
\item *Finite first moment*: $\mathbb{E}|h(x, x')| = \mathbb{E}|k_\phi(x, x')| < \infty$ — verified in Step 2 via the Cauchy--Schwarz bound $|k_\phi(u, v)| \le \sqrt{k_\phi(u, u) k_\phi(v, v)}$ and the integrability hypothesis $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$.
\end{itemize}
The conclusion is $A_m \xrightarrow{\mathbb{P}} \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]$ as $m \to \infty$.
**(b) $C_n$ by symmetry.** The same theorem applied to the i.i.d. sample $(y^1, \ldots, y^n) \sim \nu^{\otimes n}$ with kernel $k_\phi$ — using the integrability hypothesis $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$ to verify finite first moment — gives $C_n \xrightarrow{\mathbb{P}} \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')]$ as $n \to \infty$.
**(c) $B_{m, n}$ as a two-sample U-statistic.** $B_{m, n} = \frac{1}{mn}\sum_{i, j} k_\phi(x^i, y^j)$ is a degree-$(1, 1)$ two-sample U-statistic with kernel $k_\phi(u, v)$ on the independent product sample $\mu^{\otimes m} \otimes \nu^{\otimes n}$. The [Weak Law of Large Numbers for Two-Sample U-Statistics](/theorems/???) requires:
\begin{itemize}
\item *Independence*: the two samples are mutually independent — given as a hypothesis.
\item *I.i.d. within each sample*: also given.
\item *Finite first moment*: $\mathbb{E}_{x \sim \mu, y \sim \nu}|k_\phi(x, y)| < \infty$ — verified in Step 2 via $|k_\phi(x, y)| \le \sqrt{k_\phi(x, x) k_\phi(y, y)}$ and AM-GM, using both integrability hypotheses.
\end{itemize}
The conclusion is $B_{m, n} \xrightarrow{\mathbb{P}} \mathbb{E}_{x \sim \mu, y \sim \nu}[k_\phi(x, y)]$ as $m, n \to \infty$.
**Note on the mode of convergence in the two-sample case.** The two-sample WLLN gives convergence in probability under the joint law of the two independent samples, with $m$ and $n$ both tending to infinity — *not* one held fixed while the other varies. This is the standard formulation, and matches the regime in which the unbiased MMD estimator is used: both sample sizes grow.
**Why is the second moment automatically controlled?** While the WLLN only requires finite first moment, the standard proofs use a second-moment Chebyshev argument; the Cauchy--Schwarz bound on the kernel gives $|k_\phi(u, v)|^2 \le k_\phi(u, u) k_\phi(v, v)$, so the second moment of the kernel is bounded by $\mathbb{E}_\mu[k_\phi(x, x)]^2 < \infty$ (and the analogous expressions for the other two). The hypothesis "the integrability hypotheses" in the theorem is shorthand for $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$ and $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$, which together give finite second moments for all three kernels via this Cauchy--Schwarz bound.
[/guided]
[/step]
[step:Conclude convergence of $\hat{d}_\phi(\mu,\nu)^2$ via the continuous mapping theorem]
The map $T : \mathbb{R}^3 \to \mathbb{R}$, $(a, b, c) \mapsto a - 2b + c$, is continuous. By the previous step,
\begin{align*}
(A_m, B_{m,n}, C_n) \xrightarrow{\mathbb{P}} \bigl(\mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')],\, \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)],\, \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')]\bigr)
\end{align*}
as $m, n \to \infty$. Applying the [Continuous Mapping Theorem](/theorems/???) to $T$,
\begin{align*}
\hat{d}_\phi(\mu,\nu)^2 = T(A_m, B_{m,n}, C_n) \xrightarrow{\mathbb{P}} \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')] - 2\, \mathbb{E}_{x \sim \mu,\, y \sim \nu}[k_\phi(x, y)] + \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')].
\end{align*}
By the [Population Squared MMD Formula](/theorems/???), the right-hand side is $d_\phi(\mu,\nu)^2$. This completes the proof of consistency, and combined with Step 5, the theorem.
[/step]