[proofplan]
We reduce the minimax risk over $\Theta$ to the Bayes risk over the finite hypercube $\psi(\{0,1\}^m)$ under the uniform prior. Given an arbitrary estimator of $\theta$, we decode it to a vertex of the cube by nearest-neighbour projection, and the separation condition converts metric loss into Hamming loss. The Hamming risk splits into coordinatewise bit-estimation risks. Each coordinatewise problem is a binary testing problem, whose average error is bounded below by one minus total variation.
[/proofplan]
[step:Reduce the minimax risk to the uniform risk on the embedded hypercube]
Fix a measurable estimator $\hat{\theta}:\mathcal X \to \Theta$. Since $\psi(\{0,1\}^m) \subset \Theta$, restricting the supremum to this finite subset gives
\begin{align*}
\sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr]
\geq
\max_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr]
\geq
2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr],
\end{align*}
where $\mathbb{E}_v$ denotes expectation with respect to $P_v=P_{\psi(v)}$.
[guided]
The minimax risk is the largest risk over all parameters in $\Theta$. Since the finite set $\psi(\{0,1\}^m)$ is contained in $\Theta$, the supremum over $\Theta$ is at least the maximum over this finite embedded cube. For the fixed measurable estimator $\hat\theta:\mathcal X\to\Theta$, this gives
\begin{align*}
\sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr]
\geq
\max_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr],
\end{align*}
where $\mathbb E_v$ means expectation with respect to the distribution $P_v=P_{\psi(v)}$ from the statistical experiment.
The next move replaces a maximum by an average under the uniform prior on the cube. This can only decrease the quantity, because the average of finitely many [real numbers](/page/Real%20Numbers) is bounded above by their maximum. Therefore
\begin{align*}
\max_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr]
\geq
2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr].
\end{align*}
This reduction is useful because the uniform average over the cube can later be decomposed coordinate by coordinate.
[/guided]
[/step]
[step:Decode the estimator to a nearest vertex of the cube]
Define a decoder $\hat{v}:\mathcal X \to \{0,1\}^m$ as follows. For each $w \in \{0,1\}^m$, define the measurable function $r_w:\mathcal X \to [0,\infty)$ by
\begin{align*}
r_w(x)=\rho(\hat{\theta}(x),\psi(w)).
\end{align*}
Here $\mathcal B(\Theta)$ denotes the Borel $\sigma$-algebra generated by the metric $\rho$ on $\Theta$. The function $r_w$ is measurable because $\hat{\theta}:(\mathcal X,\mathcal A)\to(\Theta,\mathcal B(\Theta))$ is measurable and $\theta\mapsto \rho(\theta,\psi(w))$ is continuous on the [metric space](/page/Metric%20Space) $(\Theta,\rho)$. For each $x \in \mathcal X$, choose $\hat{v}(x)$ to be the lexicographically first minimizer of $w \mapsto r_w(x)$ over the finite set $\{0,1\}^m$.
For each $w \in \{0,1\}^m$, the event $\{x\in\mathcal X:\hat v(x)=w\}$ is a finite intersection of sets of the form $\{r_w\le r_u\}$ and $\{r_w<r_u\}$, according to whether $u$ is lexicographically after or before $w$. Hence $\hat v:(\mathcal X,\mathcal A)\to\{0,1\}^m$ is measurable when $\{0,1\}^m$ is equipped with the discrete $\sigma$-algebra. Then, for every $x \in \mathcal X$ and every $v \in \{0,1\}^m$,
\begin{align*}
\rho(\hat{\theta}(x),\psi(\hat{v}(x))) \leq \rho(\hat{\theta}(x),\psi(v)).
\end{align*}
By the triangle inequality for $\rho$,
\begin{align*}
\rho(\psi(\hat{v}(x)),\psi(v))\leq \rho(\psi(\hat{v}(x)),\hat{\theta}(x))+\rho(\hat{\theta}(x),\psi(v)).
\end{align*}
Using symmetry of the metric and the nearest-neighbour inequality on the first term gives
\begin{align*}
\rho(\psi(\hat{v}(x)),\psi(v))\leq 2\rho(\hat{\theta}(x),\psi(v)).
\end{align*}
The assumed hypercube separation gives
\begin{align*}
2s\,d_H(\hat{v}(x),v)
\leq \rho(\psi(\hat{v}(x)),\psi(v)).
\end{align*}
Combining the two inequalities yields
\begin{align*}
\rho(\hat{\theta}(x),\psi(v)) \geq s\,d_H(\hat{v}(x),v).
\end{align*}
[guided]
The purpose of the decoder is to turn an arbitrary estimator taking values in $\Theta$ into an estimator of the hidden vertex $v \in \{0,1\}^m$. For each $w \in \{0,1\}^m$, define $r_w:\mathcal X\to[0,\infty)$ by
\begin{align*}
r_w(x)=\rho(\hat{\theta}(x),\psi(w)).
\end{align*}
Let $\mathcal B(\Theta)$ denote the Borel $\sigma$-algebra generated by the metric $\rho$ on $\Theta$. This function is measurable because $\hat\theta:(\mathcal X,\mathcal A)\to(\Theta,\mathcal B(\Theta))$ is measurable, while $\theta\mapsto\rho(\theta,\psi(w))$ is continuous. Since the cube is finite, the minimum of $w\mapsto r_w(x)$ over $w\in\{0,1\}^m$ exists for each observation $x\in\mathcal X$. We choose the lexicographically first minimizer to make the choice single-valued.
We also need the decoder to be measurable, because later we will take probabilities of events involving $\hat v_j$. For a fixed $w\in\{0,1\}^m$, the event $\{x\in\mathcal X:\hat v(x)=w\}$ is described by finitely many comparisons between the [measurable functions](/page/Measurable%20Functions) $r_w$ and $r_u$: if $u$ is lexicographically after $w$, require $r_w\le r_u$, and if $u$ is lexicographically before $w$, require $r_w<r_u$. Sets of the form $\{r_w\le r_u\}$ and $\{r_w<r_u\}$ are measurable, so each fibre of $\hat v$ is measurable. Therefore $\hat v:(\mathcal X,\mathcal A)\to\{0,1\}^m$ is measurable for the discrete $\sigma$-algebra on the finite cube.
By definition of this nearest-neighbour choice, for every fixed true vertex $v \in \{0,1\}^m$,
\begin{align*}
\rho(\hat{\theta}(x),\psi(\hat{v}(x))) \leq \rho(\hat{\theta}(x),\psi(v)).
\end{align*}
Now we compare the two cube vertices $\psi(\hat{v}(x))$ and $\psi(v)$. The triangle inequality in the metric space $(\Theta,\rho)$ gives
\begin{align*}
\rho(\psi(\hat{v}(x)),\psi(v))\leq \rho(\psi(\hat{v}(x)),\hat{\theta}(x))+\rho(\hat{\theta}(x),\psi(v)).
\end{align*}
By symmetry of $\rho$, the first term equals $\rho(\hat{\theta}(x),\psi(\hat{v}(x)))$. The nearest-neighbour inequality then gives
\begin{align*}
\rho(\psi(\hat{v}(x)),\psi(v))\leq 2\rho(\hat{\theta}(x),\psi(v)).
\end{align*}
The hypercube separation hypothesis says that metric distance between embedded vertices dominates Hamming distance:
\begin{align*}
\rho(\psi(\hat{v}(x)),\psi(v)) \geq 2s\,d_H(\hat{v}(x),v).
\end{align*}
Combining the last two displays and dividing by $2$ gives
\begin{align*}
\rho(\hat{\theta}(x),\psi(v)) \geq s\,d_H(\hat{v}(x),v).
\end{align*}
Thus every unit error in the decoded Hamming distance costs at least $s$ in the original metric loss.
[/guided]
[/step]
[step:Decompose the Bayes Hamming risk into coordinatewise bit errors]
For every $v \in \{0,1\}^m$, taking expectation under $P_v$ in the previous pointwise inequality gives
\begin{align*}
\mathbb{E}_v\bigl[\rho(\hat{\theta},\psi(v))\bigr]
\geq
s\,\mathbb{E}_v\bigl[d_H(\hat{v},v)\bigr].
\end{align*}
Since
\begin{align*}
d_H(\hat{v},v)=\sum_{j=1}^m \mathbb{1}_{\{\hat{v}_j \neq v_j\}},
\end{align*}
linearity of expectation gives
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_v\bigl[d_H(\hat{v},v)\bigr]
=
\sum_{j=1}^m
2^{-m}\sum_{v \in \{0,1\}^m}
P_v(\hat{v}_j \neq v_j).
\end{align*}
Therefore
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr]
\geq
s\sum_{j=1}^m
2^{-m}\sum_{v \in \{0,1\}^m}
P_v(\hat{v}_j \neq v_j).
\end{align*}
[guided]
The decoder step gave a pointwise comparison between metric loss and Hamming loss: for every observation $x\in\mathcal X$ and every vertex $v\in\{0,1\}^m$,
\begin{align*}
\rho(\hat{\theta}(x),\psi(v)) \geq s\,d_H(\hat v(x),v).
\end{align*}
Both sides are non-negative measurable functions of $x$, because $\hat\theta$ and $\hat v$ are measurable. Taking expectation with respect to $P_v$ preserves the inequality and gives
\begin{align*}
\mathbb{E}_v\bigl[\rho(\hat{\theta},\psi(v))\bigr]
\geq
s\,\mathbb{E}_v\bigl[d_H(\hat{v},v)\bigr].
\end{align*}
Now we expand the Hamming distance into coordinate errors. For vertices in $\{0,1\}^m$, the Hamming distance is exactly the number of coordinates on which the two vertices differ, so
\begin{align*}
d_H(\hat{v},v)=\sum_{j=1}^m \mathbb{1}_{\{\hat{v}_j \neq v_j\}}.
\end{align*}
Linearity of expectation applies to this finite sum and yields
\begin{align*}
\mathbb{E}_v\bigl[d_H(\hat{v},v)\bigr]
=
\sum_{j=1}^m P_v(\hat{v}_j \neq v_j).
\end{align*}
Averaging this identity over the uniform prior on $\{0,1\}^m$ and interchanging the two finite sums gives
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_v\bigl[d_H(\hat{v},v)\bigr]
=
\sum_{j=1}^m
2^{-m}\sum_{v \in \{0,1\}^m}
P_v(\hat{v}_j \neq v_j).
\end{align*}
Combining this with the expected metric-loss lower bound gives
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr]
\geq
s\sum_{j=1}^m
2^{-m}\sum_{v \in \{0,1\}^m}
P_v(\hat{v}_j \neq v_j).
\end{align*}
Thus the Bayes metric risk is reduced to the sum of $m$ coordinatewise bit-estimation risks.
[/guided]
[/step]
[step:Bound each coordinatewise bit risk by total variation]
Fix $j \in \{1,\dots,m\}$. Let $\mathcal V_j^0 := \{v \in \{0,1\}^m : v_j = 0\}$. Define the measurable set
\begin{align*}
A_{j} := \{x \in \mathcal X : \hat{v}_j(x)=1\}.
\end{align*}
The set $A_j$ belongs to $\mathcal A$ because $\hat v$ is measurable and the coordinate projection $w\mapsto w_j$ from the finite cube to $\{0,1\}$ is measurable.
Then
\begin{align*}
P_v(\hat{v}_j \neq v_j)+P_{v^{(j)}}(\hat{v}_j \neq v^{(j)}_j)
=
P_v(A_j)+P_{v^{(j)}}(A_j^c).
\end{align*}
Using the definition of [total variation distance](/page/Total%20Variation) as
\begin{align*}
\operatorname{TV}(P,Q)=\sup_{A \in \mathcal A}|P(A)-Q(A)|,
\end{align*}
we first compute
\begin{align*}
P_v(A_j)+P_{v^{(j)}}(A_j^c)=1-\bigl(P_{v^{(j)}}(A_j)-P_v(A_j)\bigr).
\end{align*}
Since $P_{v^{(j)}}(A_j)-P_v(A_j) \leq \operatorname{TV}(P_v,P_{v^{(j)}})$, it follows that
\begin{align*}
P_v(A_j)+P_{v^{(j)}}(A_j^c) \geq 1-\operatorname{TV}(P_v,P_{v^{(j)}}).
\end{align*}
Averaging over all pairs with $v_j=0$ gives
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j)
=
2^{-m}\sum_{v \in \mathcal V_j^0}
\left[
P_v(\hat{v}_j \neq v_j)+P_{v^{(j)}}(\hat{v}_j \neq v^{(j)}_j)
\right].
\end{align*}
Applying the preceding total-variation bound to each pair gives
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j)
\geq
2^{-m}\sum_{v \in \mathcal V_j^0}
\left[1-\operatorname{TV}(P_v,P_{v^{(j)}})\right].
\end{align*}
Since the map $v \mapsto v^{(j)}$ pairs each vertex with its $j$th-bit flip and total variation is symmetric,
\begin{align*}
2^{-m}\sum_{v \in \mathcal V_j^0}\operatorname{TV}(P_v,P_{v^{(j)}})
=
\frac{1}{2}\alpha_j.
\end{align*}
Also $|\mathcal V_j^0|=2^{m-1}$. Hence
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j)
\geq
\frac{1-\alpha_j}{2}.
\end{align*}
[guided]
Fix a coordinate $j\in\{1,\dots,m\}$. We compare vertices in pairs that differ only in the $j$th coordinate. Let
\begin{align*}
\mathcal V_j^0 := \{v \in \{0,1\}^m : v_j = 0\}
\end{align*}
be the lower half of the cube in coordinate $j$, and define the measurable decision event
\begin{align*}
A_j := \{x \in \mathcal X : \hat v_j(x)=1\}.
\end{align*}
The set $A_j$ lies in $\mathcal A$ because $\hat v$ is measurable and the coordinate projection $w\mapsto w_j$ from the finite cube to $\{0,1\}$ is measurable.
For $v\in\mathcal V_j^0$, the estimator makes a $j$th-coordinate error under $P_v$ exactly on $A_j$, while it makes a $j$th-coordinate error under $P_{v^{(j)}}$ exactly on $A_j^c$. Therefore
\begin{align*}
P_v(\hat{v}_j \neq v_j)+P_{v^{(j)}}(\hat{v}_j \neq v^{(j)}_j)
=
P_v(A_j)+P_{v^{(j)}}(A_j^c).
\end{align*}
Using $P_{v^{(j)}}(A_j^c)=1-P_{v^{(j)}}(A_j)$, we compute
\begin{align*}
P_v(A_j)+P_{v^{(j)}}(A_j^c)=1-\bigl(P_{v^{(j)}}(A_j)-P_v(A_j)\bigr).
\end{align*}
By the definition of [total variation distance](/page/Total%20Variation),
\begin{align*}
P_{v^{(j)}}(A_j)-P_v(A_j) \leq \operatorname{TV}(P_v,P_{v^{(j)}}),
\end{align*}
so
\begin{align*}
P_v(A_j)+P_{v^{(j)}}(A_j^c) \geq 1-\operatorname{TV}(P_v,P_{v^{(j)}}).
\end{align*}
This is the binary testing lower bound in the only form needed here, derived directly from the definition of total variation.
Now average over all pairs with $v_j=0$. Each vertex of the cube appears exactly once in the paired sum, either as $v$ or as $v^{(j)}$, so
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j)
=
2^{-m}\sum_{v \in \mathcal V_j^0}
\left[
P_v(\hat{v}_j \neq v_j)+P_{v^{(j)}}(\hat{v}_j \neq v^{(j)}_j)
\right].
\end{align*}
Applying the pairwise bound gives
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j)
\geq
2^{-m}\sum_{v \in \mathcal V_j^0}
\left[1-\operatorname{TV}(P_v,P_{v^{(j)}})\right].
\end{align*}
Since $|\mathcal V_j^0|=2^{m-1}$, the constant part contributes $2^{-m}2^{m-1}=1/2$. Also, the map $v\mapsto v^{(j)}$ pairs the cube into unordered pairs and total variation is symmetric, so
\begin{align*}
2^{-m}\sum_{v \in \mathcal V_j^0}\operatorname{TV}(P_v,P_{v^{(j)}})
=
\frac{1}{2}\alpha_j.
\end{align*}
Consequently
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j)
\geq
\frac{1-\alpha_j}{2}.
\end{align*}
[/guided]
[/step]
[step:Combine the coordinate bounds and take the infimum over estimators]
Combining the preceding steps, every measurable estimator $\hat{\theta}:\mathcal X \to \Theta$ satisfies
\begin{align*}
\sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr]
\geq
2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr].
\end{align*}
The decoded Hamming-risk lower bound gives
\begin{align*}
\sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr]
\geq
s\sum_{j=1}^m
2^{-m}\sum_{v \in \{0,1\}^m}
P_v(\hat{v}_j \neq v_j).
\end{align*}
The coordinatewise total-variation bound gives
\begin{align*}
\sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr]
\geq
\frac{s}{2}\sum_{j=1}^m(1-\alpha_j).
\end{align*}
Taking the infimum over all admissible measurable estimators $\hat{\theta}$ gives
\begin{align*}
\inf_{\hat{\theta}}\sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr]
\geq
\frac{s}{2}\sum_{j=1}^m(1-\alpha_j),
\end{align*}
which is the asserted lower bound.
[guided]
We now assemble the inequalities proved in the preceding steps for an arbitrary measurable estimator $\hat\theta:\mathcal X\to\Theta$. The reduction to the uniform cube risk gave
\begin{align*}
\sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr]
\geq
2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr].
\end{align*}
The decoding argument and the decomposition of Hamming distance into coordinate errors then gave
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr]
\geq
s\sum_{j=1}^m
2^{-m}\sum_{v \in \{0,1\}^m}
P_v(\hat{v}_j \neq v_j).
\end{align*}
Finally, for each coordinate $j$, the total-variation bound gives
\begin{align*}
2^{-m}\sum_{v \in \{0,1\}^m}
P_v(\hat{v}_j \neq v_j)
\geq
\frac{1-\alpha_j}{2}.
\end{align*}
Substituting this coordinatewise lower bound into the preceding display yields
\begin{align*}
\sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr]
\geq
\frac{s}{2}\sum_{j=1}^m(1-\alpha_j).
\end{align*}
Because the estimator $\hat\theta$ was arbitrary, taking the infimum over all admissible measurable estimators preserves the lower bound:
\begin{align*}
\inf_{\hat{\theta}}\sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr]
\geq
\frac{s}{2}\sum_{j=1}^m(1-\alpha_j).
\end{align*}
This is exactly the claimed Assouad lower bound.
[/guided]
[/step]