Androma — The Home of Mathematics on the Internet

custom_env Unknown

[guided]Before computing any expectations we must verify that the kernel evaluations being averaged are integrable. Without this, neither linearity-of-expectation manipulations nor Fubini are licensed. The two ingredients are (a) the universal Cauchy--Schwarz bound for any positive-definite kernel, and (b) the integrability hypotheses on the diagonal $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$ and $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$ that are part of the theorem's standing assumptions. **(a) The reproducing-kernel Cauchy--Schwarz bound.** Since $k_\phi$ is a positive-definite kernel arising from the inner product on the Hilbert space $T_\phi((V))$, the [Cauchy--Schwarz inequality for reproducing kernels](/theorems/???) gives, for all $u, v \in \mathcal{K}$, \begin{align*} |k_\phi(u, v)| = |\langle S(u), S(v)\rangle_\phi| \le \|S(u)\|_\phi \, \|S(v)\|_\phi = \sqrt{k_\phi(u, u)} \, \sqrt{k_\phi(v, v)}. \end{align*} The two equalities at the ends use $\|S(u)\|_\phi^2 = \langle S(u), S(u)\rangle_\phi = k_\phi(u, u)$ (and likewise for $v$). The bound is pointwise on $\mathcal{K} \times \mathcal{K}$ and requires no integrability. **(b) Off-diagonal integrability under $\mu \otimes \mu$.** For $i \neq j$ the joint law of $(x^i, x^j)$ is $\mu \otimes \mu$ by the i.i.d.\ hypothesis on the $\mu$-sample. By Tonelli's theorem (applicable to the non-negative measurable function $\sqrt{k_\phi(\cdot, \cdot) k_\phi(\cdot, \cdot)}$), \begin{align*} \mathbb{E}\bigl[|k_\phi(x^i, x^j)|\bigr] = \int_{\mathcal{K} \times \mathcal{K}} |k_\phi(u, v)|\, d(\mu \otimes \mu)(u, v) \le \int_{\mathcal{K} \times \mathcal{K}} \sqrt{k_\phi(u, u) k_\phi(v, v)}\, d(\mu \otimes \mu)(u, v). \end{align*} Apply the AM--GM inequality $\sqrt{ab} \le \tfrac{1}{2}(a + b)$ with $a = k_\phi(u, u) \ge 0$ and $b = k_\phi(v, v) \ge 0$, and split the resulting integral by Tonelli: \begin{align*} \int_{\mathcal{K} \times \mathcal{K}} \sqrt{k_\phi(u, u) k_\phi(v, v)}\, d(\mu \otimes \mu)(u, v) &\le \frac{1}{2}\int_{\mathcal{K} \times \mathcal{K}} \bigl(k_\phi(u, u) + k_\phi(v, v)\bigr)\, d(\mu \otimes \mu)(u, v) \\ &= \frac{1}{2}\Bigl(\int_\mathcal{K} k_\phi(u, u)\, d\mu(u) + \int_\mathcal{K} k_\phi(v, v)\, d\mu(v)\Bigr) = \mathbb{E}_{x \sim \mu}[k_\phi(x, x)] < \infty, \end{align*} where the last inequality is the integrability hypothesis $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$. Hence $\mathbb{E}[|k_\phi(x^i, x^j)|] < \infty$ for any $i \neq j$. **(c) Cross-term and $\nu$-term integrability.** The same Cauchy--Schwarz + AM--GM argument applied with the joint law $\mu \otimes \nu$ on $(x^i, y^j)$ gives \begin{align*} \mathbb{E}\bigl[|k_\phi(x^i, y^j)|\bigr] \le \frac{1}{2}\bigl(\mathbb{E}_{x \sim \mu}[k_\phi(x, x)] + \mathbb{E}_{y \sim \nu}[k_\phi(y, y)]\bigr) < \infty, \end{align*} and applied with $\nu \otimes \nu$ on $(y^i, y^j)$ for $i \neq j$ gives \begin{align*} \mathbb{E}\bigl[|k_\phi(y^i, y^j)|\bigr] \le \mathbb{E}_{y \sim \nu}[k_\phi(y, y)] < \infty. \end{align*} **Why integrability matters here.** Three downstream operations rely on it: \begin{enumerate} \item *Fubini* — to compute $\mathbb{E}[k_\phi(x^i, x^j)] = \int\int k_\phi\, d\mu\, d\mu$ as iterated integrals (Steps 3 and 4) we need integrability on the product space; we have $L^1$-integrability by the bound just established. \item *Linearity of expectation* — interchanging $\mathbb{E}$ with the finite double sum $\sum_{i \neq j}$ is automatic for finite sums of integrable random variables. \item *Continuous mapping / WLLN* — convergence-in-probability arguments via Chebyshev or U-statistic theorems require finite first (and ideally second) moments; we have controlled the first moment, and the same Cauchy--Schwarz bound iterated gives the second moment, $\mathbb{E}[|k_\phi(u, v)|^2] \le \mathbb{E}[k_\phi(u, u) k_\phi(v, v)]$, which is finite by Tonelli and the same hypothesis (note here we do not even need AM--GM since we are already a single product over the product measure). \end{enumerate} The integrability hypotheses $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$ and $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$ are therefore not technical decoration — they are the entry condition that makes every expectation in the rest of the proof well-defined.[/guided]

custom_env Unknown

[guided]The crucial design choice in the unbiased estimator is the *exclusion of the diagonal* $i = j$ from the double sum. We compute $\mathbb{E}[A_m]$ in three steps: (a) identify the joint law of an off-diagonal pair, (b) reduce $\mathbb{E}[A_m]$ to a single expectation by linearity and i.i.d.-symmetry, (c) explain why the diagonal would have introduced bias. **(a) Joint law of an off-diagonal pair.** Fix any ordered pair $(i, j)$ with $i \neq j$, $1 \le i, j \le m$. Since $(x^1, \ldots, x^m)$ is i.i.d. with law $\mu$, the marginal joint law of $(x^i, x^j)$ is the product measure $\mu \otimes \mu$ on $\mathcal{K} \times \mathcal{K}$. (For $i = j$, the joint law is the diagonal copy of $\mu$ on $\mathcal{K} \times \mathcal{K}$, which is *not* $\mu \otimes \mu$ in general.) By Fubini's theorem — applicable because $\mathbb{E}[|k_\phi(x^i, x^j)|] < \infty$ from the integrability check in Step 2 — \begin{align*} \mathbb{E}[k_\phi(x^i, x^j)] = \int_{\mathcal{K} \times \mathcal{K}} k_\phi(u, v)\, d(\mu \otimes \mu)(u, v) = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]. \end{align*} **(b) Symmetry across pairs.** This expectation does not depend on the choice of $(i, j)$ as long as $i \neq j$: every off-diagonal ordered pair has the same joint law $\mu \otimes \mu$ by the i.i.d. structure. The double sum $\sum_{i \neq j}$ ranges over ordered pairs with $i \neq j$, of which there are exactly $m(m-1)$ (we choose $i$ in $m$ ways, then $j$ in $m - 1$ remaining ways). By linearity of expectation, \begin{align*} \mathbb{E}[A_m] = \frac{1}{m(m-1)} \mathbb{E}\Bigl[\sum_{i \neq j} k_\phi(x^i, x^j)\Bigr] = \frac{1}{m(m-1)} \sum_{i \neq j} \mathbb{E}[k_\phi(x^i, x^j)] = \frac{m(m-1)}{m(m-1)} \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')] = \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]. \end{align*} The interchange of expectation and finite sum requires no further justification; for an infinite sum, one would invoke Fubini, but the sum here is over the finite index set $\{(i, j) : i \neq j, 1 \le i, j \le m\}$. **(c) Why exclude the diagonal?** If we instead summed over *all* ordered pairs $(i, j)$ including $i = j$, the diagonal would contribute the term \begin{align*} \frac{1}{m^2}\sum_{i = 1}^m k_\phi(x^i, x^i), \end{align*} whose expectation is $\frac{1}{m}\mathbb{E}_{x \sim \mu}[k_\phi(x, x)]$ — the integral of $k_\phi$ on the *diagonal* of $\mathcal{K} \times \mathcal{K}$ rather than on the full product. In general $\mathbb{E}_{x \sim \mu}[k_\phi(x, x)] \neq \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]$, so including the diagonal injects a bias proportional to the difference between the two. The denominator $m(m-1)$ instead of $m^2$ further fine-tunes the normalisation so that the off-diagonal sum has the right expectation. This bias-removal is the entire reason the estimator is called *unbiased*. The same trick is used in Step 4 for $C_n$ (the $\nu$-sample), but is *not* needed for $B_{m,n}$ since the cross sum naturally has no diagonal — the indices $i$ and $j$ index *different* samples and there is no case of "the same draw" to exclude.[/guided]

custom_env Unknown

[guided]Each of the three averages $A_m$, $B_{m,n}$, $C_n$ is a *generalised U-statistic* — an average of a kernel over all pairs of distinct (or product-independent) draws — and we apply the appropriate Weak Law of Large Numbers for each. The key checks in every application are: (i) the underlying samples have the right independence structure, (ii) the kernel has finite first moment. **(a) $A_m$ as a one-sample U-statistic.** $A_m = \frac{1}{m(m-1)}\sum_{i \neq j} k_\phi(x^i, x^j)$ is a degree-$2$ U-statistic on the i.i.d. sample $(x^1, \ldots, x^m) \sim \mu^{\otimes m}$ with kernel $h(u, v) := k_\phi(u, v)$. For the [Weak Law of Large Numbers for U-Statistics](/theorems/???) (Hoeffding, Theorem 5.4) we need: \begin{itemize} \item *Independence*: $(x^1, \ldots, x^m)$ are i.i.d.\ with law $\mu$ — this is a hypothesis of the theorem. \item *Finite first moment*: $\mathbb{E}|h(x, x')| = \mathbb{E}|k_\phi(x, x')| < \infty$ — verified in Step 2 via the Cauchy--Schwarz bound $|k_\phi(u, v)| \le \sqrt{k_\phi(u, u) k_\phi(v, v)}$ and the integrability hypothesis $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$. \end{itemize} The conclusion is $A_m \xrightarrow{\mathbb{P}} \mathbb{E}_{x, x' \sim \mu}[k_\phi(x, x')]$ as $m \to \infty$. **(b) $C_n$ by symmetry.** The same theorem applied to the i.i.d. sample $(y^1, \ldots, y^n) \sim \nu^{\otimes n}$ with kernel $k_\phi$ — using the integrability hypothesis $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$ to verify finite first moment — gives $C_n \xrightarrow{\mathbb{P}} \mathbb{E}_{y, y' \sim \nu}[k_\phi(y, y')]$ as $n \to \infty$. **(c) $B_{m, n}$ as a two-sample U-statistic.** $B_{m, n} = \frac{1}{mn}\sum_{i, j} k_\phi(x^i, y^j)$ is a degree-$(1, 1)$ two-sample U-statistic with kernel $k_\phi(u, v)$ on the independent product sample $\mu^{\otimes m} \otimes \nu^{\otimes n}$. The [Weak Law of Large Numbers for Two-Sample U-Statistics](/theorems/???) requires: \begin{itemize} \item *Independence*: the two samples are mutually independent — given as a hypothesis. \item *I.i.d. within each sample*: also given. \item *Finite first moment*: $\mathbb{E}_{x \sim \mu, y \sim \nu}|k_\phi(x, y)| < \infty$ — verified in Step 2 via $|k_\phi(x, y)| \le \sqrt{k_\phi(x, x) k_\phi(y, y)}$ and AM-GM, using both integrability hypotheses. \end{itemize} The conclusion is $B_{m, n} \xrightarrow{\mathbb{P}} \mathbb{E}_{x \sim \mu, y \sim \nu}[k_\phi(x, y)]$ as $m, n \to \infty$. **Note on the mode of convergence in the two-sample case.** The two-sample WLLN gives convergence in probability under the joint law of the two independent samples, with $m$ and $n$ both tending to infinity — *not* one held fixed while the other varies. This is the standard formulation, and matches the regime in which the unbiased MMD estimator is used: both sample sizes grow. **Why is the second moment automatically controlled?** While the WLLN only requires finite first moment, the standard proofs use a second-moment Chebyshev argument; the Cauchy--Schwarz bound on the kernel gives $|k_\phi(u, v)|^2 \le k_\phi(u, u) k_\phi(v, v)$, so the second moment of the kernel is bounded by $\mathbb{E}_\mu[k_\phi(x, x)]^2 < \infty$ (and the analogous expressions for the other two). The hypothesis "the integrability hypotheses" in the theorem is shorthand for $\mathbb{E}_\mu[k_\phi(x, x)] < \infty$ and $\mathbb{E}_\nu[k_\phi(y, y)] < \infty$, which together give finite second moments for all three kernels via this Cauchy--Schwarz bound.[/guided]

custom_env Unknown

What brings you to Androma?

Start with a route through the knowledge graph.

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Sign in to Androma

Check your inbox

One last step

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Raw Attribution Data