Assouad's Lemma — Statement & Proof

Assouad's Lemma (Theorem # 5906)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We reduce the minimax risk over $\Theta$ to the Bayes risk over the finite hypercube $\psi(\{0,1\}^m)$ under the uniform prior. Given an arbitrary estimator of $\theta$, we decode it to a vertex of the cube by nearest-neighbour projection, and the separation condition converts metric loss into Hamming loss. The Hamming risk splits into coordinatewise bit-estimation risks. Each coordinatewise problem is a binary testing problem, whose average error is bounded below by one minus total variation. [/proofplan] [step:Reduce the minimax risk to the uniform risk on the embedded hypercube] Fix a measurable estimator $\hat{\theta}:\mathcal X \to \Theta$. Since $\psi(\{0,1\}^m) \subset \Theta$, restricting the supremum to this finite subset gives \begin{align*} \sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr] \geq \max_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr] \geq 2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr], \end{align*} where $\mathbb{E}_v$ denotes expectation with respect to $P_v=P_{\psi(v)}$. [guided] The minimax risk is the largest risk over all parameters in $\Theta$. Since the finite set $\psi(\{0,1\}^m)$ is contained in $\Theta$, the supremum over $\Theta$ is at least the maximum over this finite embedded cube. For the fixed measurable estimator $\hat\theta:\mathcal X\to\Theta$, this gives \begin{align*} \sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr] \geq \max_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr], \end{align*} where $\mathbb E_v$ means expectation with respect to the distribution $P_v=P_{\psi(v)}$ from the statistical experiment. The next move replaces a maximum by an average under the uniform prior on the cube. This can only decrease the quantity, because the average of finitely many [real numbers](/page/Real%20Numbers) is bounded above by their maximum. Therefore \begin{align*} \max_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr] \geq 2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr]. \end{align*} This reduction is useful because the uniform average over the cube can later be decomposed coordinate by coordinate. [/guided] [/step] [step:Decode the estimator to a nearest vertex of the cube] Define a decoder $\hat{v}:\mathcal X \to \{0,1\}^m$ as follows. For each $w \in \{0,1\}^m$, define the measurable function $r_w:\mathcal X \to [0,\infty)$ by \begin{align*} r_w(x)=\rho(\hat{\theta}(x),\psi(w)). \end{align*} Here $\mathcal B(\Theta)$ denotes the Borel $\sigma$-algebra generated by the metric $\rho$ on $\Theta$. The function $r_w$ is measurable because $\hat{\theta}:(\mathcal X,\mathcal A)\to(\Theta,\mathcal B(\Theta))$ is measurable and $\theta\mapsto \rho(\theta,\psi(w))$ is continuous on the [metric space](/page/Metric%20Space) $(\Theta,\rho)$. For each $x \in \mathcal X$, choose $\hat{v}(x)$ to be the lexicographically first minimizer of $w \mapsto r_w(x)$ over the finite set $\{0,1\}^m$. For each $w \in \{0,1\}^m$, the event $\{x\in\mathcal X:\hat v(x)=w\}$ is a finite intersection of sets of the form $\{r_w\le r_u\}$ and $\{r_w<r_u\}$, according to whether $u$ is lexicographically after or before $w$. Hence $\hat v:(\mathcal X,\mathcal A)\to\{0,1\}^m$ is measurable when $\{0,1\}^m$ is equipped with the discrete $\sigma$-algebra. Then, for every $x \in \mathcal X$ and every $v \in \{0,1\}^m$, \begin{align*} \rho(\hat{\theta}(x),\psi(\hat{v}(x))) \leq \rho(\hat{\theta}(x),\psi(v)). \end{align*} By the triangle inequality for $\rho$, \begin{align*} \rho(\psi(\hat{v}(x)),\psi(v))\leq \rho(\psi(\hat{v}(x)),\hat{\theta}(x))+\rho(\hat{\theta}(x),\psi(v)). \end{align*} Using symmetry of the metric and the nearest-neighbour inequality on the first term gives \begin{align*} \rho(\psi(\hat{v}(x)),\psi(v))\leq 2\rho(\hat{\theta}(x),\psi(v)). \end{align*} The assumed hypercube separation gives \begin{align*} 2s\,d_H(\hat{v}(x),v) \leq \rho(\psi(\hat{v}(x)),\psi(v)). \end{align*} Combining the two inequalities yields \begin{align*} \rho(\hat{\theta}(x),\psi(v)) \geq s\,d_H(\hat{v}(x),v). \end{align*} [guided] The purpose of the decoder is to turn an arbitrary estimator taking values in $\Theta$ into an estimator of the hidden vertex $v \in \{0,1\}^m$. For each $w \in \{0,1\}^m$, define $r_w:\mathcal X\to[0,\infty)$ by \begin{align*} r_w(x)=\rho(\hat{\theta}(x),\psi(w)). \end{align*} Let $\mathcal B(\Theta)$ denote the Borel $\sigma$-algebra generated by the metric $\rho$ on $\Theta$. This function is measurable because $\hat\theta:(\mathcal X,\mathcal A)\to(\Theta,\mathcal B(\Theta))$ is measurable, while $\theta\mapsto\rho(\theta,\psi(w))$ is continuous. Since the cube is finite, the minimum of $w\mapsto r_w(x)$ over $w\in\{0,1\}^m$ exists for each observation $x\in\mathcal X$. We choose the lexicographically first minimizer to make the choice single-valued. We also need the decoder to be measurable, because later we will take probabilities of events involving $\hat v_j$. For a fixed $w\in\{0,1\}^m$, the event $\{x\in\mathcal X:\hat v(x)=w\}$ is described by finitely many comparisons between the [measurable functions](/page/Measurable%20Functions) $r_w$ and $r_u$: if $u$ is lexicographically after $w$, require $r_w\le r_u$, and if $u$ is lexicographically before $w$, require $r_w<r_u$. Sets of the form $\{r_w\le r_u\}$ and $\{r_w<r_u\}$ are measurable, so each fibre of $\hat v$ is measurable. Therefore $\hat v:(\mathcal X,\mathcal A)\to\{0,1\}^m$ is measurable for the discrete $\sigma$-algebra on the finite cube. By definition of this nearest-neighbour choice, for every fixed true vertex $v \in \{0,1\}^m$, \begin{align*} \rho(\hat{\theta}(x),\psi(\hat{v}(x))) \leq \rho(\hat{\theta}(x),\psi(v)). \end{align*} Now we compare the two cube vertices $\psi(\hat{v}(x))$ and $\psi(v)$. The triangle inequality in the metric space $(\Theta,\rho)$ gives \begin{align*} \rho(\psi(\hat{v}(x)),\psi(v))\leq \rho(\psi(\hat{v}(x)),\hat{\theta}(x))+\rho(\hat{\theta}(x),\psi(v)). \end{align*} By symmetry of $\rho$, the first term equals $\rho(\hat{\theta}(x),\psi(\hat{v}(x)))$. The nearest-neighbour inequality then gives \begin{align*} \rho(\psi(\hat{v}(x)),\psi(v))\leq 2\rho(\hat{\theta}(x),\psi(v)). \end{align*} The hypercube separation hypothesis says that metric distance between embedded vertices dominates Hamming distance: \begin{align*} \rho(\psi(\hat{v}(x)),\psi(v)) \geq 2s\,d_H(\hat{v}(x),v). \end{align*} Combining the last two displays and dividing by $2$ gives \begin{align*} \rho(\hat{\theta}(x),\psi(v)) \geq s\,d_H(\hat{v}(x),v). \end{align*} Thus every unit error in the decoded Hamming distance costs at least $s$ in the original metric loss. [/guided] [/step] [step:Decompose the Bayes Hamming risk into coordinatewise bit errors] For every $v \in \{0,1\}^m$, taking expectation under $P_v$ in the previous pointwise inequality gives \begin{align*} \mathbb{E}_v\bigl[\rho(\hat{\theta},\psi(v))\bigr] \geq s\,\mathbb{E}_v\bigl[d_H(\hat{v},v)\bigr]. \end{align*} Since \begin{align*} d_H(\hat{v},v)=\sum_{j=1}^m \mathbb{1}_{\{\hat{v}_j \neq v_j\}}, \end{align*} linearity of expectation gives \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_v\bigl[d_H(\hat{v},v)\bigr] = \sum_{j=1}^m 2^{-m}\sum_{v \in \{0,1\}^m} P_v(\hat{v}_j \neq v_j). \end{align*} Therefore \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr] \geq s\sum_{j=1}^m 2^{-m}\sum_{v \in \{0,1\}^m} P_v(\hat{v}_j \neq v_j). \end{align*} [guided] The decoder step gave a pointwise comparison between metric loss and Hamming loss: for every observation $x\in\mathcal X$ and every vertex $v\in\{0,1\}^m$, \begin{align*} \rho(\hat{\theta}(x),\psi(v)) \geq s\,d_H(\hat v(x),v). \end{align*} Both sides are non-negative measurable functions of $x$, because $\hat\theta$ and $\hat v$ are measurable. Taking expectation with respect to $P_v$ preserves the inequality and gives \begin{align*} \mathbb{E}_v\bigl[\rho(\hat{\theta},\psi(v))\bigr] \geq s\,\mathbb{E}_v\bigl[d_H(\hat{v},v)\bigr]. \end{align*} Now we expand the Hamming distance into coordinate errors. For vertices in $\{0,1\}^m$, the Hamming distance is exactly the number of coordinates on which the two vertices differ, so \begin{align*} d_H(\hat{v},v)=\sum_{j=1}^m \mathbb{1}_{\{\hat{v}_j \neq v_j\}}. \end{align*} Linearity of expectation applies to this finite sum and yields \begin{align*} \mathbb{E}_v\bigl[d_H(\hat{v},v)\bigr] = \sum_{j=1}^m P_v(\hat{v}_j \neq v_j). \end{align*} Averaging this identity over the uniform prior on $\{0,1\}^m$ and interchanging the two finite sums gives \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_v\bigl[d_H(\hat{v},v)\bigr] = \sum_{j=1}^m 2^{-m}\sum_{v \in \{0,1\}^m} P_v(\hat{v}_j \neq v_j). \end{align*} Combining this with the expected metric-loss lower bound gives \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr] \geq s\sum_{j=1}^m 2^{-m}\sum_{v \in \{0,1\}^m} P_v(\hat{v}_j \neq v_j). \end{align*} Thus the Bayes metric risk is reduced to the sum of $m$ coordinatewise bit-estimation risks. [/guided] [/step] [step:Bound each coordinatewise bit risk by total variation] Fix $j \in \{1,\dots,m\}$. Let $\mathcal V_j^0 := \{v \in \{0,1\}^m : v_j = 0\}$. Define the measurable set \begin{align*} A_{j} := \{x \in \mathcal X : \hat{v}_j(x)=1\}. \end{align*} The set $A_j$ belongs to $\mathcal A$ because $\hat v$ is measurable and the coordinate projection $w\mapsto w_j$ from the finite cube to $\{0,1\}$ is measurable. Then \begin{align*} P_v(\hat{v}_j \neq v_j)+P_{v^{(j)}}(\hat{v}_j \neq v^{(j)}_j) = P_v(A_j)+P_{v^{(j)}}(A_j^c). \end{align*} Using the definition of [total variation distance](/page/Total%20Variation) as \begin{align*} \operatorname{TV}(P,Q)=\sup_{A \in \mathcal A}|P(A)-Q(A)|, \end{align*} we first compute \begin{align*} P_v(A_j)+P_{v^{(j)}}(A_j^c)=1-\bigl(P_{v^{(j)}}(A_j)-P_v(A_j)\bigr). \end{align*} Since $P_{v^{(j)}}(A_j)-P_v(A_j) \leq \operatorname{TV}(P_v,P_{v^{(j)}})$, it follows that \begin{align*} P_v(A_j)+P_{v^{(j)}}(A_j^c) \geq 1-\operatorname{TV}(P_v,P_{v^{(j)}}). \end{align*} Averaging over all pairs with $v_j=0$ gives \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j) = 2^{-m}\sum_{v \in \mathcal V_j^0} \left[ P_v(\hat{v}_j \neq v_j)+P_{v^{(j)}}(\hat{v}_j \neq v^{(j)}_j) \right]. \end{align*} Applying the preceding total-variation bound to each pair gives \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j) \geq 2^{-m}\sum_{v \in \mathcal V_j^0} \left[1-\operatorname{TV}(P_v,P_{v^{(j)}})\right]. \end{align*} Since the map $v \mapsto v^{(j)}$ pairs each vertex with its $j$th-bit flip and total variation is symmetric, \begin{align*} 2^{-m}\sum_{v \in \mathcal V_j^0}\operatorname{TV}(P_v,P_{v^{(j)}}) = \frac{1}{2}\alpha_j. \end{align*} Also $|\mathcal V_j^0|=2^{m-1}$. Hence \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j) \geq \frac{1-\alpha_j}{2}. \end{align*} [guided] Fix a coordinate $j\in\{1,\dots,m\}$. We compare vertices in pairs that differ only in the $j$th coordinate. Let \begin{align*} \mathcal V_j^0 := \{v \in \{0,1\}^m : v_j = 0\} \end{align*} be the lower half of the cube in coordinate $j$, and define the measurable decision event \begin{align*} A_j := \{x \in \mathcal X : \hat v_j(x)=1\}. \end{align*} The set $A_j$ lies in $\mathcal A$ because $\hat v$ is measurable and the coordinate projection $w\mapsto w_j$ from the finite cube to $\{0,1\}$ is measurable. For $v\in\mathcal V_j^0$, the estimator makes a $j$th-coordinate error under $P_v$ exactly on $A_j$, while it makes a $j$th-coordinate error under $P_{v^{(j)}}$ exactly on $A_j^c$. Therefore \begin{align*} P_v(\hat{v}_j \neq v_j)+P_{v^{(j)}}(\hat{v}_j \neq v^{(j)}_j) = P_v(A_j)+P_{v^{(j)}}(A_j^c). \end{align*} Using $P_{v^{(j)}}(A_j^c)=1-P_{v^{(j)}}(A_j)$, we compute \begin{align*} P_v(A_j)+P_{v^{(j)}}(A_j^c)=1-\bigl(P_{v^{(j)}}(A_j)-P_v(A_j)\bigr). \end{align*} By the definition of [total variation distance](/page/Total%20Variation), \begin{align*} P_{v^{(j)}}(A_j)-P_v(A_j) \leq \operatorname{TV}(P_v,P_{v^{(j)}}), \end{align*} so \begin{align*} P_v(A_j)+P_{v^{(j)}}(A_j^c) \geq 1-\operatorname{TV}(P_v,P_{v^{(j)}}). \end{align*} This is the binary testing lower bound in the only form needed here, derived directly from the definition of total variation. Now average over all pairs with $v_j=0$. Each vertex of the cube appears exactly once in the paired sum, either as $v$ or as $v^{(j)}$, so \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j) = 2^{-m}\sum_{v \in \mathcal V_j^0} \left[ P_v(\hat{v}_j \neq v_j)+P_{v^{(j)}}(\hat{v}_j \neq v^{(j)}_j) \right]. \end{align*} Applying the pairwise bound gives \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j) \geq 2^{-m}\sum_{v \in \mathcal V_j^0} \left[1-\operatorname{TV}(P_v,P_{v^{(j)}})\right]. \end{align*} Since $|\mathcal V_j^0|=2^{m-1}$, the constant part contributes $2^{-m}2^{m-1}=1/2$. Also, the map $v\mapsto v^{(j)}$ pairs the cube into unordered pairs and total variation is symmetric, so \begin{align*} 2^{-m}\sum_{v \in \mathcal V_j^0}\operatorname{TV}(P_v,P_{v^{(j)}}) = \frac{1}{2}\alpha_j. \end{align*} Consequently \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m}P_v(\hat{v}_j \neq v_j) \geq \frac{1-\alpha_j}{2}. \end{align*} [/guided] [/step] [step:Combine the coordinate bounds and take the infimum over estimators] Combining the preceding steps, every measurable estimator $\hat{\theta}:\mathcal X \to \Theta$ satisfies \begin{align*} \sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr] \geq 2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr]. \end{align*} The decoded Hamming-risk lower bound gives \begin{align*} \sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr] \geq s\sum_{j=1}^m 2^{-m}\sum_{v \in \{0,1\}^m} P_v(\hat{v}_j \neq v_j). \end{align*} The coordinatewise total-variation bound gives \begin{align*} \sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr] \geq \frac{s}{2}\sum_{j=1}^m(1-\alpha_j). \end{align*} Taking the infimum over all admissible measurable estimators $\hat{\theta}$ gives \begin{align*} \inf_{\hat{\theta}}\sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr] \geq \frac{s}{2}\sum_{j=1}^m(1-\alpha_j), \end{align*} which is the asserted lower bound. [guided] We now assemble the inequalities proved in the preceding steps for an arbitrary measurable estimator $\hat\theta:\mathcal X\to\Theta$. The reduction to the uniform cube risk gave \begin{align*} \sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr] \geq 2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr]. \end{align*} The decoding argument and the decomposition of Hamming distance into coordinate errors then gave \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m}\mathbb{E}_{v}\bigl[\rho(\hat{\theta},\psi(v))\bigr] \geq s\sum_{j=1}^m 2^{-m}\sum_{v \in \{0,1\}^m} P_v(\hat{v}_j \neq v_j). \end{align*} Finally, for each coordinate $j$, the total-variation bound gives \begin{align*} 2^{-m}\sum_{v \in \{0,1\}^m} P_v(\hat{v}_j \neq v_j) \geq \frac{1-\alpha_j}{2}. \end{align*} Substituting this coordinatewise lower bound into the preceding display yields \begin{align*} \sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr] \geq \frac{s}{2}\sum_{j=1}^m(1-\alpha_j). \end{align*} Because the estimator $\hat\theta$ was arbitrary, taking the infimum over all admissible measurable estimators preserves the lower bound: \begin{align*} \inf_{\hat{\theta}}\sup_{\theta \in \Theta}\mathbb{E}_{\theta}\bigl[\rho(\hat{\theta},\theta)\bigr] \geq \frac{s}{2}\sum_{j=1}^m(1-\alpha_j). \end{align*} This is exactly the claimed Assouad lower bound. [/guided] [/step]

Prerequisites (0/9 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Triangle Inequality For Inner Product Spaces

Definitions & Concepts

Explore Further

Distribution Definition Measurable Functions Definition Metric Space Definition Expectation Definition Event Definition Function Definition Matrix Definition Set Definition Triangle Inequality For Inner Product Spaces Theorem #433 Expectation of a Product of Independent Variables Probability Theory Countable Additivity of Conditional Probability Probability Theory Thresholded Lasso Support and Sign Recovery Under Supremum Norm Control Probability & Statistics Ridgeless Regression Conditional Risk Decomposition Probability & Statistics Doob-Dynkin Representation Probability & Statistics Gaussian Score Concentration for Column-Normalized Fixed Designs Probability & Statistics Stirling's Formula Probability Theory Feasibility of the True Parameter for the Dantzig Selector Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.

Assouad's Lemma (Theorem # 5906)

Discussion

Proof

Prerequisites (0/9 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Assouad's Lemma (Theorem # 5906)

Discussion

Proof

Prerequisites (0/9 completed)

Prerequisites Graph

Explore Further