[proofplan]
We reduce estimation over $\Theta$ to testing among the finite packing points $\theta_1,\dots,\theta_M$. Given an estimator $\hat{\theta}$, we project its output to the nearest packing point and thereby obtain an index estimator $\hat V$. The $2s$-separation implies that an estimation error smaller than $s$ forces correct testing, so the average estimation risk dominates $s$ times the testing error probability. [Fano's inequality](/theorems/1654) controls this testing error from below, and the assumed average Kullback-Leibler bound controls the mutual information through the mixture distribution.
[/proofplan]
[step:Place the uniform prior on the packing and define the induced testing problem]
Let $V:(\Omega,\mathcal F)\to\{1,\dots,M\}$ denote a uniformly distributed random index on the underlying probability space, so
\begin{align*}
\mathbb P(V=j)=\frac{1}{M}
\end{align*}
for every $j\in\{1,\dots,M\}$. Conditional on the event $\{V=j\}$, let the data random element $X:(\Omega,\mathcal F)\to(\mathcal X,\mathcal A)$ have distribution $P_{\theta_j}$. Thus the joint law of $(V,X)$ is determined by
\begin{align*}
\mathbb P(V=j,\ X\in A)=\frac{1}{M}P_{\theta_j}(A)
\end{align*}
for every $j\in\{1,\dots,M\}$ and every $A\in\mathcal A$.
Fix a measurable estimator $\hat{\theta}:\mathcal X\to\Theta$. Define the finite-set nearest-neighbour index estimator $\hat V:\mathcal X\to\{1,\dots,M\}$ by choosing the smallest minimising index:
\begin{align*}
\hat V(x):=\min\left\{j\in\{1,\dots,M\}: d(\hat{\theta}(x),\theta_j)=\min_{1\leq k\leq M}d(\hat{\theta}(x),\theta_k)\right\}.
\end{align*}
Since the minimum is taken over a finite set and $d(\hat{\theta}(\cdot),\theta_j)$ is measurable for each $j$, the map $\hat V$ is measurable.
[/step]
[step:Convert small estimation error into correct identification]
For every outcome $\omega\in\Omega$, if
\begin{align*}
d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s,
\end{align*}
then $\hat V(X(\omega))=V(\omega)$. Indeed, if $k\neq V(\omega)$, the packing assumption and the triangle inequality give
\begin{align*}
d(\hat{\theta}(X(\omega)),\theta_k)\geq d(\theta_{V(\omega)},\theta_k)-d(\hat{\theta}(X(\omega)),\theta_{V(\omega)}).
\end{align*}
The packing assumption gives $d(\theta_{V(\omega)},\theta_k)\geq 2s$, while the assumed error bound gives $d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s$. Hence
\begin{align*}
d(\hat{\theta}(X(\omega)),\theta_k)>s.
\end{align*}
Since also $d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s$, we have
\begin{align*}
d(\hat{\theta}(X(\omega)),\theta_k)>d(\hat{\theta}(X(\omega)),\theta_{V(\omega)}).
\end{align*}
Hence $\theta_{V(\omega)}$ is the unique nearest packing point to $\hat{\theta}(X(\omega))$, so the tie-breaking rule gives $\hat V(X(\omega))=V(\omega)$. Taking the contrapositive yields
\begin{align*}
\{\hat V(X)\neq V\}\subseteq \{d(\hat{\theta}(X),\theta_V)\geq s\}.
\end{align*}
Therefore, pointwise on $\Omega$,
\begin{align*}
d(\hat{\theta}(X),\theta_V)\geq s\,\mathbb 1_{\{\hat V(X)\neq V\}}.
\end{align*}
[guided]
The purpose of the nearest-neighbour construction is to turn an estimator of the parameter into a test of which packing point generated the data. The separation assumption is exactly what makes this reduction quantitative.
Fix an outcome $\omega\in\Omega$ and suppose that the estimator is within distance $s$ of the true packing point:
\begin{align*}
d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s.
\end{align*}
We compare the distance from $\hat{\theta}(X(\omega))$ to the true packing point with its distance to any other packing point. Let $k\in\{1,\dots,M\}$ with $k\neq V(\omega)$. Since the packing points are $2s$-separated,
\begin{align*}
d(\theta_{V(\omega)},\theta_k)\geq 2s.
\end{align*}
The triangle inequality in the form $d(a,c)\geq d(b,c)-d(a,b)$, applied with $a=\hat{\theta}(X(\omega))$, $b=\theta_{V(\omega)}$, and $c=\theta_k$, gives
\begin{align*}
d(\hat{\theta}(X(\omega)),\theta_k)\geq d(\theta_{V(\omega)},\theta_k)-d(\hat{\theta}(X(\omega)),\theta_{V(\omega)}).
\end{align*}
Combining $d(\theta_{V(\omega)},\theta_k)\geq 2s$ with $d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s$ yields
\begin{align*}
d(\hat{\theta}(X(\omega)),\theta_k)>s.
\end{align*}
At the same time, the assumed small-error event says
\begin{align*}
d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s.
\end{align*}
Thus every incorrect packing point $\theta_k$ is farther from $\hat{\theta}(X(\omega))$ than the true packing point $\theta_{V(\omega)}$. Consequently the nearest-neighbour rule must select the true index:
\begin{align*}
\hat V(X(\omega))=V(\omega).
\end{align*}
Equivalently, if the induced test makes an error, then the estimator cannot have been within distance $s$ of the true parameter. Hence
\begin{align*}
\{\hat V(X)\neq V\}\subseteq \{d(\hat{\theta}(X),\theta_V)\geq s\}.
\end{align*}
Multiplying the indicator of the testing error by $s$ gives the pointwise lower bound
\begin{align*}
d(\hat{\theta}(X),\theta_V)\geq s\,\mathbb 1_{\{\hat V(X)\neq V\}}.
\end{align*}
This is the central [testing-to-estimation reduction](/theorems/5895): a lower bound on testing error becomes a lower bound on metric estimation risk.
[/guided]
[/step]
[step:Average the pointwise bound under the finite prior]
Taking expectation with respect to the joint law of $(V,X)$ in the pointwise inequality gives
\begin{align*}
\mathbb E[d(\hat{\theta}(X),\theta_V)]
\geq s\,\mathbb P(\hat V(X)\neq V).
\end{align*}
By the definition of the joint law,
\begin{align*}
\mathbb E[d(\hat{\theta}(X),\theta_V)]
=
\frac{1}{M}\sum_{j=1}^{M}\mathbb E_{\theta_j}[d(\hat{\theta},\theta_j)].
\end{align*}
Since the supremum over $\Theta$ dominates the average over the finite subset $\{\theta_1,\dots,\theta_M\}$,
\begin{align*}
\sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat{\theta},\theta)]
\geq
\frac{1}{M}\sum_{j=1}^{M}\mathbb E_{\theta_j}[d(\hat{\theta},\theta_j)]
\geq
s\,\mathbb P(\hat V(X)\neq V).
\end{align*}
[/step]
[step:Bound the mutual information by the average divergence to $Q$]
Let $\bar P$ be the mixture probability measure on $(\mathcal X,\mathcal A)$ defined by
\begin{align*}
\bar P(A):=\frac{1}{M}\sum_{j=1}^{M}P_{\theta_j}(A)
\end{align*}
for every $A\in\mathcal A$. The mutual information between $V$ and $X$ is
\begin{align*}
I(V;X)=\frac{1}{M}\sum_{j=1}^{M}D(P_{\theta_j}\|\bar P).
\end{align*}
We claim that
\begin{align*}
I(V;X)\leq \frac{1}{M}\sum_{j=1}^{M}D(P_{\theta_j}\|Q).
\end{align*}
If the right-hand side is $+\infty$, the inequality is immediate. Otherwise each $P_{\theta_j}$ is absolutely continuous with respect to $Q$. Let $\mu$ be the probability measure
\begin{align*}
\mu:=Q+\sum_{j=1}^{M}P_{\theta_j}
\end{align*}
renormalised by the positive constant $M+1$, and let $p_j:\mathcal X\to[0,\infty)$, $\bar p:\mathcal X\to[0,\infty)$, and $q:\mathcal X\to[0,\infty)$ be Radon-Nikodym densities of $P_{\theta_j}$, $\bar P$, and $Q$ with respect to $\mu$. Since $\bar p=M^{-1}\sum_{j=1}^{M}p_j$, direct expansion of the relative entropies gives
\begin{align*}
\frac{1}{M}\sum_{j=1}^{M}D(P_{\theta_j}\|Q)=\frac{1}{M}\sum_{j=1}^{M}\int_{\mathcal X}p_j(x)\log\frac{p_j(x)}{q(x)}\,d\mu(x).
\end{align*}
Using $p_j/q=(p_j/\bar p)(\bar p/q)$ on the set where the densities are positive, with the standard extended-value convention for relative entropy, this becomes
\begin{align*}
\frac{1}{M}\sum_{j=1}^{M}D(P_{\theta_j}\|Q)=\frac{1}{M}\sum_{j=1}^{M}\int_{\mathcal X}p_j(x)\log\frac{p_j(x)}{\bar p(x)}\,d\mu(x)+\int_{\mathcal X}\bar p(x)\log\frac{\bar p(x)}{q(x)}\,d\mu(x).
\end{align*}
Therefore
\begin{align*}
\frac{1}{M}\sum_{j=1}^{M}D(P_{\theta_j}\|Q)=I(V;X)+D(\bar P\|Q).
\end{align*}
Since $D(\bar P\|Q)\geq 0$, the claimed inequality follows. Therefore the hypothesis implies
\begin{align*}
I(V;X)\leq \alpha\log M.
\end{align*}
[/step]
[step:Apply Fano's inequality to the induced test]
We apply Fano's inequality to the uniformly distributed index $V\in\{1,\dots,M\}$ and the estimator $\hat V(X)$ of $V$ (citing a result not yet in the wiki: Fano's inequality). Its hypotheses are satisfied because $M\geq 2$, $V$ is uniform on a finite set of cardinality $M$, and $\hat V(X)$ takes values in the same finite set. It gives
\begin{align*}
\mathbb P(\hat V(X)\neq V)
\geq
1-\frac{I(V;X)+\log 2}{\log M}.
\end{align*}
Using the bound $I(V;X)\leq \alpha\log M$, we obtain
\begin{align*}
\mathbb P(\hat V(X)\neq V)
\geq
1-\alpha-\frac{\log 2}{\log M}.
\end{align*}
[/step]
[step:Combine the testing and estimation bounds]
Substituting the Fano lower bound into the average-risk inequality first gives
\begin{align*}
\sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat{\theta},\theta)]\geq s\,\mathbb P(\hat V(X)\neq V).
\end{align*}
Using $\mathbb P(\hat V(X)\neq V)\geq 1-\alpha-\frac{\log 2}{\log M}$, we obtain
\begin{align*}
\sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat{\theta},\theta)]\geq s\left(1-\alpha-\frac{\log 2}{\log M}\right).
\end{align*}
The estimator $\hat{\theta}:\mathcal X\to\Theta$ was arbitrary, so this lower bound holds for every measurable estimator. This proves the claimed minimax testing-to-estimation reduction.
[/step]