[proofplan]
We convert the estimator into a binary test between the two product laws $Q_0=P_0^{\otimes n}$ and $Q_1=P_1^{\otimes n}$ by selecting the closer of the two target values. The separation assumption forces the estimation error to be at least $s$ whenever this induced test chooses the wrong alternative. We then prove the elementary two-point testing inequality that the sum of the two error probabilities is at least $1-\operatorname{TV}(Q_0,Q_1)$, and finally pass from the average of the two endpoint risks to the supremum risk over $\mathcal P$.
[/proofplan]
[step:Introduce endpoint notation and reduce the supremum risk to the two endpoint risks]
Define the two target points $\theta_0,\theta_1\in\Psi$ by
\begin{align*}
\theta_i:=\psi(P_i), \qquad i\in\{0,1\}.
\end{align*}
Define the endpoint risks $R_i\in[0,\infty]$ of the estimator $\hat\psi_n$ by
\begin{align*}
R_i:=\int_{\mathcal X^n} d(\hat\psi_n(x),\theta_i)\,dQ_i(x),
\qquad i\in\{0,1\}.
\end{align*}
Since $P_0,P_1\in\mathcal P$, the supremum risk dominates the larger endpoint risk:
\begin{align*}
\sup_{P\in\mathcal P}\mathbb E_{P^{\otimes n}}\!\left[d(\hat\psi_n,\psi(P))\right]
\ge \max\{R_0,R_1\}.
\end{align*}
The maximum dominates the arithmetic mean, so
\begin{align*}
\sup_{P\in\mathcal P}\mathbb E_{P^{\otimes n}}\!\left[d(\hat\psi_n,\psi(P))\right]
\ge \frac{R_0+R_1}{2}.
\end{align*}
[/step]
[step:Build the closest-target test induced by the estimator]
Since $\hat\psi_n:\mathcal X^n\to\Psi$ is an estimator, it is $\mathcal A^{\otimes n}$-to-Borel measurable. The maps $y\mapsto d(y,\theta_0)$ and $y\mapsto d(y,\theta_1)$ from $\Psi$ to $\mathbb R$ are continuous because $d$ is a metric, so the set
\begin{align*}
A_0:=\{x\in\mathcal X^n: d(\hat\psi_n(x),\theta_0)\le d(\hat\psi_n(x),\theta_1)\}
\end{align*}
belongs to $\mathcal A^{\otimes n}$. Define
\begin{align*}
A_1:=\mathcal X^n\setminus A_0.
\end{align*}
Then $A_1\in\mathcal A^{\otimes n}$ as well. Define the induced test $\varphi:\mathcal X^n\to\{0,1\}$ by setting $\varphi(x)=0$ for $x\in A_0$ and $\varphi(x)=1$ for $x\in A_1$. The test $\varphi$ selects the target value closer to $\hat\psi_n(x)$, with ties assigned to $0$.
On $A_1$, we have $d(\hat\psi_n(x),\theta_1)<d(\hat\psi_n(x),\theta_0)$. By the separation assumption,
\begin{align*}
2s\le d(\theta_0,\theta_1).
\end{align*}
By the triangle inequality and the defining inequality of $A_1$,
\begin{align*}
d(\theta_0,\theta_1)\le d(\theta_0,\hat\psi_n(x))+d(\hat\psi_n(x),\theta_1)<2d(\theta_0,\hat\psi_n(x)).
\end{align*}
Thus $d(\hat\psi_n(x),\theta_0)>s$ for every $x\in A_1$. Therefore
\begin{align*}
R_0
=\int_{\mathcal X^n} d(\hat\psi_n(x),\theta_0)\,dQ_0(x)
\ge s\,Q_0(A_1).
\end{align*}
On $A_0$, the definition of $A_0$ gives $d(\hat\psi_n(x),\theta_0)\le d(\hat\psi_n(x),\theta_1)$. Again, the separation assumption gives
\begin{align*}
2s\le d(\theta_0,\theta_1).
\end{align*}
The triangle inequality and the defining inequality of $A_0$ give
\begin{align*}
d(\theta_0,\theta_1)\le d(\theta_0,\hat\psi_n(x))+d(\hat\psi_n(x),\theta_1)\le 2d(\hat\psi_n(x),\theta_1).
\end{align*}
Thus $d(\hat\psi_n(x),\theta_1)\ge s$ for every $x\in A_0$. Hence
\begin{align*}
R_1
=\int_{\mathcal X^n} d(\hat\psi_n(x),\theta_1)\,dQ_1(x)
\ge s\,Q_1(A_0).
\end{align*}
Combining the two endpoint bounds gives
\begin{align*}
R_0+R_1\ge s\bigl(Q_0(A_1)+Q_1(A_0)\bigr).
\end{align*}
[guided]
The estimator $\hat\psi_n$ takes values in the target space $\Psi$, not in the label set $\{0,1\}$. Recall that the two target points are defined by $\theta_i:=\psi(P_i)$ for $i\in\{0,1\}$. To compare estimation with testing, we turn the estimator into a test by asking which target point it is closer to. Since $\hat\psi_n:\mathcal X^n\to\Psi$ is $\mathcal A^{\otimes n}$-to-Borel measurable and the maps $y\mapsto d(y,\theta_0)$ and $y\mapsto d(y,\theta_1)$ are continuous on the [metric space](/page/Metric%20Space) $\Psi$, the set
\begin{align*}
A_0:=\{x\in\mathcal X^n: d(\hat\psi_n(x),\theta_0)\le d(\hat\psi_n(x),\theta_1)\}
\end{align*}
belongs to $\mathcal A^{\otimes n}$. Define
\begin{align*}
A_1:=\mathcal X^n\setminus A_0.
\end{align*}
Then $A_1\in\mathcal A^{\otimes n}$. Define $\varphi:\mathcal X^n\to\{0,1\}$ by $\varphi(x)=0$ on $A_0$ and $\varphi(x)=1$ on $A_1$. Thus $\varphi$ chooses model $0$ when $\hat\psi_n(x)$ is at least as close to $\theta_0$ as to $\theta_1$, and chooses model $1$ otherwise.
Why does a wrong testing decision force a large estimation error? Suppose first that the true model is $0$, but the induced test chooses $1$, so $x\in A_1$. Then
\begin{align*}
d(\hat\psi_n(x),\theta_1)<d(\hat\psi_n(x),\theta_0).
\end{align*}
The separation assumption gives
\begin{align*}
2s\le d(\theta_0,\theta_1).
\end{align*}
Using the triangle inequality between $\theta_0$, $\hat\psi_n(x)$, and $\theta_1$, and then using the displayed inequality above, we get
\begin{align*}
d(\theta_0,\theta_1)\le d(\theta_0,\hat\psi_n(x))+d(\hat\psi_n(x),\theta_1)<2d(\theta_0,\hat\psi_n(x)).
\end{align*}
Therefore $d(\hat\psi_n(x),\theta_0)>s$ on $A_1$. Integrating this pointwise lower bound with respect to the true law $Q_0$ gives
\begin{align*}
R_0
=\int_{\mathcal X^n} d(\hat\psi_n(x),\theta_0)\,dQ_0(x)
\ge s\,Q_0(A_1).
\end{align*}
Now suppose the true model is $1$, but the induced test chooses $0$, so $x\in A_0$. The definition of $A_0$ gives
\begin{align*}
d(\hat\psi_n(x),\theta_0)\le d(\hat\psi_n(x),\theta_1).
\end{align*}
The separation assumption gives
\begin{align*}
2s\le d(\theta_0,\theta_1).
\end{align*}
The same triangle inequality calculation gives
\begin{align*}
d(\theta_0,\theta_1)\le d(\theta_0,\hat\psi_n(x))+d(\hat\psi_n(x),\theta_1)\le 2d(\hat\psi_n(x),\theta_1).
\end{align*}
Therefore $d(\hat\psi_n(x),\theta_1)\ge s$ on $A_0$, and integration with respect to $Q_1$ yields
\begin{align*}
R_1
=\int_{\mathcal X^n} d(\hat\psi_n(x),\theta_1)\,dQ_1(x)
\ge s\,Q_1(A_0).
\end{align*}
Adding the two inequalities gives the reduction from estimation risk to testing error:
\begin{align*}
R_0+R_1\ge s\bigl(Q_0(A_1)+Q_1(A_0)\bigr).
\end{align*}
[/guided]
[/step]
[step:Lower-bound the two testing errors by total variation]
For any measurable set $A\in\mathcal A^{\otimes n}$, the definition of total variation gives
\begin{align*}
Q_0(A)-Q_1(A)\le \operatorname{TV}(Q_0,Q_1).
\end{align*}
Apply this with $A=A_0$. Since $A_1=\mathcal X^n\setminus A_0$ and $Q_0(\mathcal X^n)=1$, finite additivity gives
\begin{align*}
Q_0(A_1)+Q_1(A_0)=1-Q_0(A_0)+Q_1(A_0).
\end{align*}
Rearranging the right-hand side gives
\begin{align*}
1-Q_0(A_0)+Q_1(A_0)=1-\bigl(Q_0(A_0)-Q_1(A_0)\bigr).
\end{align*}
Using $Q_0(A_0)-Q_1(A_0)\le \operatorname{TV}(Q_0,Q_1)$, we obtain
\begin{align*}
Q_0(A_1)+Q_1(A_0)\ge 1-\operatorname{TV}(Q_0,Q_1).
\end{align*}
Consequently,
\begin{align*}
R_0+R_1\ge s\left(1-\operatorname{TV}(Q_0,Q_1)\right).
\end{align*}
[/step]
[step:Combine the endpoint average with the testing lower bound]
From the endpoint reduction and the testing bound, first
\begin{align*}
\sup_{P\in\mathcal P}\mathbb E_{P^{\otimes n}}\!\left[d(\hat\psi_n,\psi(P))\right]\ge \frac{R_0+R_1}{2}.
\end{align*}
Combining this with $R_0+R_1\ge s\left(1-\operatorname{TV}(Q_0,Q_1)\right)$ gives
\begin{align*}
\sup_{P\in\mathcal P}\mathbb E_{P^{\otimes n}}\!\left[d(\hat\psi_n,\psi(P))\right]\ge \frac{s}{2}\left(1-\operatorname{TV}(Q_0,Q_1)\right).
\end{align*}
Substituting $Q_i=P_i^{\otimes n}$ for $i\in\{0,1\}$ gives
\begin{align*}
\sup_{P\in\mathcal P}\mathbb E_{P^{\otimes n}}\!\left[d(\hat\psi_n,\psi(P))\right]
\ge
\frac{s}{2}\left(1-\operatorname{TV}(P_0^{\otimes n},P_1^{\otimes n})\right).
\end{align*}
This is the desired lower bound for the arbitrary estimator $\hat\psi_n$.
[/step]