[proofplan]
The strategy is a bounded-differences concentration argument applied to $\hat{d}_\phi(\mu,\nu)^2$ viewed as a function of the $2m$ independent samples. Under the null $H_0 : \mu = \nu$ the estimator has mean zero (by [Unbiased MMD Estimator](/theorems/2521) combined with [MMD is a Metric under Characteristicness](/theorems/2522)). We bound the maximum change of the estimator when a single sample is replaced — this gives explicit bounded-differences constants $c_k = 8M/m$ for each of the $2m$ coordinates. McDiarmid's inequality then converts the per-coordinate bounds into an exponential tail bound, and inverting that bound at level $\alpha$ produces the rejection threshold $(8M/\sqrt{m}) \sqrt{\log(2/\alpha)}$.
[/proofplan]
[step:Reformulate $\hat{d}_\phi(\mu,\nu)^2$ as a function of $2m$ independent samples]
Define
\begin{align*}
F : \mathcal{K}^{2m} &\to \mathbb{R}, \\
(x_1, \dots, x_m, y_1, \dots, y_m) &\mapsto \frac{1}{m(m-1)} \sum_{i \neq j} k_\phi(x_i, x_j) - \frac{2}{m^2} \sum_{i,j=1}^m k_\phi(x_i, y_j) + \frac{1}{m(m-1)} \sum_{i \neq j} k_\phi(y_i, y_j).
\end{align*}
Then $\hat{d}_\phi(\mu, \nu)^2 = F(X_1, \dots, X_m, Y_1, \dots, Y_m)$, and the random vector $Z = (X_1, \dots, X_m, Y_1, \dots, Y_m)$ has independent coordinates: the $X$-block is i.i.d. $\mu$, the $Y$-block is i.i.d. $\nu$, and the two blocks are independent of each other by hypothesis.
[/step]
[step:Verify $\mathbb{E}_{H_0}[\hat{d}_\phi(\mu, \nu)^2] = 0$ under the null]
Under $H_0$, $\mu = \nu$. By [Unbiased MMD Estimator](/theorems/2521), $\mathbb{E}[\hat{d}_\phi(\mu, \nu)^2] = d_\phi(\mu, \nu)^2$. By [MMD is a Metric under Characteristicness](/theorems/2522), $d_\phi(\mu, \nu) = 0$ when $\mu = \nu$. Hence
\begin{align*}
\mathbb{E}_{H_0}[\hat{d}_\phi(\mu, \nu)^2] = 0.
\end{align*}
[/step]
[step:Compute the bounded-differences constant for replacing a single $X$-coordinate]
Fix $k \in \{1, \dots, m\}$ and let $Z' = (X_1, \dots, X_{k-1}, X_k', X_{k+1}, \dots, X_m, Y_1, \dots, Y_m)$ differ from $Z$ only in the $k$-th $X$-coordinate. We bound
\begin{align*}
|F(Z) - F(Z')| \le c_k.
\end{align*}
Replacing $X_k$ affects only those terms in $F$ that involve $X_k$ — namely the terms $k_\phi(X_k, X_j)$ and $k_\phi(X_j, X_k)$ in the first sum, and $k_\phi(X_k, Y_j)$ in the cross sum.
**First sum (within-$X$ off-diagonal):** The pairs containing index $k$ are $\{(k, j) : j \neq k\} \cup \{(j, k) : j \neq k\}$, totalling $2(m-1)$ ordered pairs. Each $k_\phi$ value lies in $[-M, M]$, so the change in any single term is at most $2M$. Hence
\begin{align*}
\left| \frac{1}{m(m-1)} \sum_{i \neq j} k_\phi(X_i, X_j) - \frac{1}{m(m-1)} \sum_{i \neq j} k_\phi(X_i', X_j') \right| \le \frac{2(m-1) \cdot 2M}{m(m-1)} = \frac{4M}{m},
\end{align*}
where we use the convention $X_i' = X_i$ for $i \neq k$.
**Cross sum:** The terms containing $X_k$ are $\{(k, j) : 1 \le j \le m\}$, totalling $m$ pairs, each shifted by at most $2M$. The cross term carries a coefficient of $-2/m^2$, so
\begin{align*}
\left| -\frac{2}{m^2} \sum_{i, j} k_\phi(X_i, Y_j) + \frac{2}{m^2} \sum_{i, j} k_\phi(X_i', Y_j) \right| \le \frac{2 \cdot m \cdot 2M}{m^2} = \frac{4M}{m}.
\end{align*}
**Within-$Y$ sum:** Unchanged.
By the triangle inequality,
\begin{align*}
|F(Z) - F(Z')| \le \frac{4M}{m} + \frac{4M}{m} = \frac{8M}{m}.
\end{align*}
Thus $c_k = 8M/m$ for $k \in \{1, \dots, m\}$ (the $X$-coordinates).
[guided]
McDiarmid's inequality requires a coordinate-wise bound: for each $k$, we need a constant $c_k$ such that flipping $X_k \to X_k'$ (holding all other coordinates fixed) changes $F$ by at most $c_k$. This number controls the variance proxy in the resulting Gaussian tail; getting it wrong by even a constant factor changes the threshold. So we do the bookkeeping carefully.
Fix $k \in \{1, \dots, m\}$ and define $Z' := (X_1, \dots, X_{k-1}, X_k', X_{k+1}, \dots, X_m, Y_1, \dots, Y_m)$ — same as $Z$ except in position $k$. Recall
\begin{align*}
F(Z) = \underbrace{\frac{1}{m(m-1)} \sum_{i \neq j} k_\phi(X_i, X_j)}_{\text{within-}X} - \underbrace{\frac{2}{m^2} \sum_{i,j=1}^m k_\phi(X_i, Y_j)}_{\text{cross}} + \underbrace{\frac{1}{m(m-1)} \sum_{i \neq j} k_\phi(Y_i, Y_j)}_{\text{within-}Y}.
\end{align*}
The within-$Y$ term does not depend on $X_k$, so it contributes $0$ to $|F(Z) - F(Z')|$. We bound the other two.
**Within-$X$ term.** It is $\frac{1}{m(m-1)}$ times a sum over the $m(m-1)$ ordered off-diagonal pairs. The pairs that involve index $k$ are $(k, j)$ for $j \neq k$ ($m-1$ pairs) and $(j, k)$ for $j \neq k$ (another $m-1$ pairs), giving $2(m-1)$ pairs. Each $k_\phi$ value lies in $[-M, M]$ (by hypothesis $|k_\phi| \le M$), so each term shifts by at most $|k_\phi(\cdot,\cdot) - k_\phi(\cdot,\cdot)| \le 2M$. By the triangle inequality on these $2(m-1)$ terms,
\begin{align*}
\Delta_{\text{within-}X} \le \frac{1}{m(m-1)} \cdot 2(m-1) \cdot 2M = \frac{4M}{m}.
\end{align*}
**Cross term.** It is $-\frac{2}{m^2}$ times a sum over the $m^2$ ordered pairs $(i, j)$. The pairs involving $X_k$ are $(k, j)$ for $1 \le j \le m$ — exactly $m$ pairs. Each shifts by at most $2M$, so
\begin{align*}
\Delta_{\text{cross}} \le \frac{2}{m^2} \cdot m \cdot 2M = \frac{4M}{m}.
\end{align*}
Adding via the triangle inequality,
\begin{align*}
|F(Z) - F(Z')| \le 0 + \frac{4M}{m} + \frac{4M}{m} = \frac{8M}{m}.
\end{align*}
Hence $c_k = 8M/m$ for the $X$-coordinates.
A useful sanity check: the within-$X$ and cross contributions come out equal precisely because the $1/(m(m-1))$ within-coefficient compensates for the smaller pair count $2(m-1)$, while the $2/m^2$ cross-coefficient compensates for the larger pair count $m$. This balance is by design — the unbiased MMD estimator's coefficients are chosen to make these cancellations clean.
[/guided]
[/step]
[step:Compute the bounded-differences constant for replacing a single $Y$-coordinate]
By symmetry of $F$ in the $X$ and $Y$ blocks (with the cross term being symmetric in $X$ and $Y$, since changing $Y_\ell$ shifts $m$ pairs $\{(i, \ell)\}$ in the cross sum and $2(m-1)$ pairs $\{(\ell, j), (j, \ell)\}$ in the within-$Y$ sum, by the same argument), the bounded-differences constant for the $\ell$-th $Y$-coordinate is also
\begin{align*}
c_{m + \ell} = \frac{8M}{m} \qquad \text{for } \ell \in \{1, \dots, m\}.
\end{align*}
[guided]
We do the bookkeeping carefully because off-by-constant errors in the bounded-differences constants propagate to the threshold. The crucial observation is the structure of $F$ as a sum of three terms, and the role of each term's coefficient.
For replacing $Y_\ell$ by $Y_\ell'$: the only terms in $F$ that change are those in the cross sum (mixed $X$, $Y$) and the within-$Y$ off-diagonal sum.
**Cross sum.** It is $-\frac{2}{m^2}$ times a sum of $m^2$ pairs $(i, j)$. Replacing $Y_\ell$ affects the $m$ pairs $\{(i, \ell) : 1 \le i \le m\}$, each shifted by at most $|k_\phi(X_i, Y_\ell) - k_\phi(X_i, Y_\ell')| \le 2M$. By the triangle inequality,
\begin{align*}
\Delta_{\text{cross}} \le \frac{2}{m^2} \cdot m \cdot 2M = \frac{4M}{m}.
\end{align*}
**Within-$Y$ sum.** It is $\frac{1}{m(m-1)}$ times a sum of $m(m-1)$ ordered pairs $(i, j)$ with $i \neq j$. The pairs containing index $\ell$ are $(\ell, j)$ for $j \neq \ell$ ($m-1$ pairs) and $(j, \ell)$ for $j \neq \ell$ (another $m-1$ pairs), giving $2(m-1)$ total. Each shifted by at most $2M$:
\begin{align*}
\Delta_{\text{within-}Y} \le \frac{1}{m(m-1)} \cdot 2(m-1) \cdot 2M = \frac{4M}{m}.
\end{align*}
The first term (within-$X$) does not depend on $Y_\ell$, so it contributes $0$.
Adding,
\begin{align*}
|F(Z) - F(Z')| \le 0 + \frac{4M}{m} + \frac{4M}{m} = \frac{8M}{m}.
\end{align*}
This matches the constant for the $X$-coordinates by the symmetric structure of $F$. So all $2m$ bounded-differences constants are equal: $c_k = 8M/m$.
[/guided]
[/step]
[step:Apply McDiarmid's inequality to obtain a Gaussian tail bound]
The vector $Z$ has $2m$ independent coordinates, and we have established the bounded-differences condition $|F(Z) - F(Z')| \le c_k = 8M/m$ for every coordinate replacement. We invoke [McDiarmid's Inequality](/theorems/???) (the bounded-differences concentration inequality) in its two-sided form: if $F$ satisfies the bounded-differences condition with constants $(c_k)_{k=1}^{N}$ on $N$ independent coordinates, then for every $t > 0$,
\begin{align*}
\mathbb{P}\!\left( \left| F(Z) - \mathbb{E}[F(Z)] \right| \ge t \right) \le 2 \exp\!\left( -\frac{2 t^2}{\sum_{k=1}^{N} c_k^2} \right).
\end{align*}
We apply this with $N = 2m$ and $c_k = 8M/m$ for all $k$, which gives
\begin{align*}
\sum_{k=1}^{2m} c_k^2 = 2m \cdot \left( \frac{8M}{m} \right)^2 = \frac{128 M^2}{m}.
\end{align*}
Substituting,
\begin{align*}
\mathbb{P}\!\left( \left| F(Z) - \mathbb{E}[F(Z)] \right| \ge t \right) \le 2 \exp\!\left( -\frac{2 t^2}{128 M^2 / m} \right) = 2 \exp\!\left( -\frac{m t^2}{64 M^2} \right).
\end{align*}
Under $H_0$, Step 2 gives $\mathbb{E}[F(Z)] = 0$, so dropping the lower tail (which can only weaken the bound) yields
\begin{align*}
\mathbb{P}_{H_0}\!\left( \hat{d}_\phi(\mu, \nu)^2 \ge t \right) \le \mathbb{P}_{H_0}\!\left( |\hat{d}_\phi(\mu, \nu)^2| \ge t \right) \le 2 \exp\!\left( -\frac{m t^2}{64 M^2} \right).
\end{align*}
[guided]
The point of Steps 1-4 was to show that $F$ has the bounded-differences property: each of its $2m$ coordinates can shift the value by at most $8M/m$. We now cash this in for a Gaussian tail.
We invoke [McDiarmid's Inequality](/theorems/???) (also called the bounded-differences inequality). Its hypotheses, which we verify:
- **Independent coordinates.** $Z = (X_1, \dots, X_m, Y_1, \dots, Y_m)$ has $2m$ coordinates. The $X$-block is i.i.d. $\mu$, the $Y$-block is i.i.d. $\nu$, and the two blocks are independent of each other (Step 1). So all $2m$ coordinates of $Z$ are mutually independent.
- **Bounded differences.** For each $k \in \{1, \dots, 2m\}$, replacing the $k$-th coordinate while holding the rest fixed changes $F$ by at most $c_k$. Steps 3 and 4 established $c_k = 8M/m$ uniformly.
Under these hypotheses, McDiarmid's inequality (two-sided form) gives, for every $t > 0$,
\begin{align*}
\mathbb{P}\!\left( \left| F(Z) - \mathbb{E}[F(Z)] \right| \ge t \right) \le 2 \exp\!\left( -\frac{2 t^2}{\sum_{k=1}^{2m} c_k^2} \right).
\end{align*}
Plugging in $c_k = 8M/m$ for all $k$ and computing the sum:
\begin{align*}
\sum_{k=1}^{2m} c_k^2 = 2m \cdot \left( \frac{8M}{m} \right)^2 = 2m \cdot \frac{64 M^2}{m^2} = \frac{128 M^2}{m}.
\end{align*}
The exponent becomes
\begin{align*}
-\frac{2 t^2}{128 M^2 / m} = -\frac{m t^2}{64 M^2},
\end{align*}
giving
\begin{align*}
\mathbb{P}\!\left( \left| F(Z) - \mathbb{E}[F(Z)] \right| \ge t \right) \le 2 \exp\!\left( -\frac{m t^2}{64 M^2} \right).
\end{align*}
Now we specialise to the null. Step 2 established that under $H_0$ (i.e. $\mu = \nu$), $\mathbb{E}_{H_0}[F(Z)] = \mathbb{E}_{H_0}[\hat{d}_\phi(\mu,\nu)^2] = 0$. Substituting,
\begin{align*}
\mathbb{P}_{H_0}\!\left( \left| \hat{d}_\phi(\mu, \nu)^2 \right| \ge t \right) \le 2 \exp\!\left( -\frac{m t^2}{64 M^2} \right).
\end{align*}
For deriving an upper-tail threshold we use $\{\hat{d}_\phi^2 \ge t\} \subseteq \{|\hat{d}_\phi^2| \ge t\}$, which only weakens the inequality:
\begin{align*}
\mathbb{P}_{H_0}\!\left( \hat{d}_\phi(\mu, \nu)^2 \ge t \right) \le 2 \exp\!\left( -\frac{m t^2}{64 M^2} \right).
\end{align*}
The factor of $1/m$ in the exponent is the source of the $1/\sqrt{m}$ rate in the threshold below — concentration improves with sample size at the parametric rate.
[/guided]
[/step]
[step:Invert the tail bound at level $\alpha$ to derive the threshold]
Set the right-hand side equal to $\alpha$ and solve for $t$:
\begin{align*}
2 \exp\!\left( -\frac{m t^2}{64 M^2} \right) = \alpha
\quad\Longleftrightarrow\quad
\exp\!\left( -\frac{m t^2}{64 M^2} \right) = \frac{\alpha}{2}
\quad\Longleftrightarrow\quad
\frac{m t^2}{64 M^2} = \log\!\left( \frac{2}{\alpha} \right).
\end{align*}
Solving for $t$ and taking the positive root,
\begin{align*}
t^* = \frac{8 M}{\sqrt{m}} \sqrt{\log(2/\alpha)}.
\end{align*}
This is the level-$\alpha$ threshold appearing in the theorem statement.
[guided]
We pause to verify the algebra by direct substitution, since the threshold formula is the load-bearing computation of the theorem. With $t^* = (8M/\sqrt{m}) \sqrt{\log(2/\alpha)}$,
\begin{align*}
(t^*)^2 = \frac{64 M^2}{m} \log(2/\alpha),
\end{align*}
so
\begin{align*}
\frac{m (t^*)^2}{64 M^2} = \log(2/\alpha).
\end{align*}
Therefore
\begin{align*}
2 \exp\!\left( -\frac{m (t^*)^2}{64 M^2} \right) = 2 \exp(-\log(2/\alpha)) = 2 \cdot \frac{\alpha}{2} = \alpha.
\end{align*}
The substitution recovers $\alpha$ exactly, confirming the threshold formula.
[/guided]
[/step]
[step:Verify that the test attains level at most $\alpha$]
The test rejects $H_0$ when $\hat{d}_\phi(\mu, \nu)^2 \ge t^* = (8M/\sqrt{m}) \sqrt{\log(2/\alpha)}$. Combining the tail bound from Step 5 with the choice of $t^*$ from Step 6,
\begin{align*}
\mathbb{P}_{H_0}\!\left( \text{reject } H_0 \right) = \mathbb{P}_{H_0}\!\left( \hat{d}_\phi(\mu, \nu)^2 \ge t^* \right) \le 2 \exp\!\left( -\frac{m (t^*)^2}{64 M^2} \right) = \alpha.
\end{align*}
This is the level of the test, so the test has level at most $\alpha$, completing the proof.
[/step]