Assouad's Lemma for Coordinatewise Support Recovery

Assouad's Lemma for Coordinatewise Support Recovery (Theorem # 5910)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] Put the uniform prior on the hypercube and write the Bayes Hamming risk as the sum of the coordinatewise error probabilities. For each coordinate $j$, condition on all other coordinates; the remaining decision is a binary test between two neighboring vertices of the cube. A direct total-variation testing bound gives average error at least $(1-\alpha)/2$ for each such binary problem. Summing over the $m$ coordinates and comparing Bayes risk with worst-case risk gives the Hamming lower bound, and the probability lower bound follows from the deterministic inequality $|\hat S\triangle S(v)|\le m\mathbb{1}_{\{\hat S\neq S(v)\}}$. [/proofplan] [step:Put the uniform prior on the hypercube and decompose the Bayes Hamming risk] Fix a measurable estimator $\hat S:\mathcal X\to 2^{\{1,\dots,m\}}$. Define a probability measure $\mathbb Q$ on $\{0,1\}^m\times\mathcal X$ by \begin{align*} \mathbb Q(\{v\}\times A):=2^{-m}P_v(A) \end{align*} for every $v\in\{0,1\}^m$ and every $A\in\mathcal A$. Let $V:\{0,1\}^m\times\mathcal X\to\{0,1\}^m$ and $X:\{0,1\}^m\times\mathcal X\to\mathcal X$ be the coordinate maps. Thus $V$ is uniform on $\{0,1\}^m$, and conditional on $V=v$, the observation $X$ has distribution $P_v$. Equip the finite set $2^{\{1,\dots,m\}}$ with its power-set $\sigma$-algebra. For each $j\in\{1,\dots,m\}$, define the coordinate decision rule $\delta_j:\mathcal X\to\{0,1\}$ by \begin{align*} \delta_j(x):=\mathbb{1}_{\{j\in\hat S(x)\}}. \end{align*} The map $A\mapsto \mathbb{1}_{\{j\in A\}}$ from $2^{\{1,\dots,m\}}$ to $\{0,1\}$ is measurable because both spaces are finite and carry their power-set $\sigma$-algebras, so $\delta_j$ is measurable as its composition with the measurable estimator $\hat S$. Since \begin{align*} |\hat S(X)\triangle S(V)| =\sum_{j=1}^m \mathbb{1}_{\{\delta_j(X)\neq V_j\}}, \end{align*} linearity of expectation with respect to $\mathbb Q$ gives \begin{align*} \mathbb E_{\mathbb Q}[|\hat S(X)\triangle S(V)|] =\sum_{j=1}^m \mathbb Q(\delta_j(X)\neq V_j). \end{align*} [guided] The uniform prior turns the minimax problem into an average-risk problem that is easier to lower bound. We define $\mathbb Q$ by first drawing $V$ uniformly from the cube and then drawing $X$ from the experiment $P_V$: \begin{align*} \mathbb Q(\{v\}\times A):=2^{-m}P_v(A). \end{align*} The estimator $\hat S$ produces a subset of $\{1,\dots,m\}$. We equip the finite set $2^{\{1,\dots,m\}}$ with its power-set $\sigma$-algebra, so the coordinate membership map $A\mapsto \mathbb{1}_{\{j\in A\}}$ is measurable. To compare $\hat S$ coordinate by coordinate with the true support $S(V)$, define $\delta_j:\mathcal X\to\{0,1\}$ by \begin{align*} \delta_j(x):=\mathbb{1}_{\{j\in\hat S(x)\}}. \end{align*} Because $\delta_j$ is the composition of $\hat S$ with a measurable map on the finite codomain, $\delta_j$ is measurable. Thus $\delta_j(X)$ is the estimator's decision about whether coordinate $j$ belongs to the support. Since $j\in S(V)$ exactly when $V_j=1$, the event of making a mistake at coordinate $j$ is $\{\delta_j(X)\neq V_j\}$. Therefore the symmetric-difference size is exactly the Hamming loss: \begin{align*} |\hat S(X)\triangle S(V)| =\sum_{j=1}^m \mathbb{1}_{\{\delta_j(X)\neq V_j\}}. \end{align*} Taking expectation under $\mathbb Q$ and using linearity of expectation gives \begin{align*} \mathbb E_{\mathbb Q}[|\hat S(X)\triangle S(V)|] =\sum_{j=1}^m \mathbb Q(\delta_j(X)\neq V_j). \end{align*} This decomposition is the point of using the hypercube: the total support error splits into $m$ binary testing errors. [/guided] [/step] [step:Lower bound each coordinate error by a binary testing inequality] Fix $j\in\{1,\dots,m\}$. Let $\{0,1\}^{m-1}_j$ denote the set of assignments to all coordinates except $j$. For each $w\in\{0,1\}^{m-1}_j$, define $v_0(w),v_1(w)\in\{0,1\}^m$ by requiring that both vectors agree with $w$ outside coordinate $j$, while $(v_0(w))_j=0$ and $(v_1(w))_j=1$. Then $v_1(w)=v_0(w)^{(j)}$. For any measurable set $A\in\mathcal A$ and any probability measures $P,Q$ on $(\mathcal X,\mathcal A)$, \begin{align*} P(A)+Q(A^c)\ge 1-\operatorname{TV}(P,Q). \end{align*} Indeed, $Q(A^c)=1-Q(A)$, so \begin{align*} P(A)+Q(A^c)=1-(Q(A)-P(A))\ge 1-|P(A)-Q(A)|\ge 1-\operatorname{TV}(P,Q). \end{align*} Apply this with $P=P_{v_0(w)}$, $Q=P_{v_1(w)}$, and $A=\{x\in\mathcal X:\delta_j(x)=1\}$. Since $\operatorname{TV}(P_{v_0(w)},P_{v_1(w)})\le\alpha$, we obtain \begin{align*} \frac{1}{2}P_{v_0(w)}(\delta_j=1)+\frac{1}{2}P_{v_1(w)}(\delta_j=0) \ge \frac{1-\alpha}{2}. \end{align*} Averaging this inequality over all $w\in\{0,1\}^{m-1}_j$ gives \begin{align*} \mathbb Q(\delta_j(X)\neq V_j)\ge \frac{1-\alpha}{2}. \end{align*} [guided] Fix a coordinate $j\in\{1,\dots,m\}$. The strategy is to freeze every coordinate except $j$ and view the remaining question as a binary test. Let $\{0,1\}^{m-1}_j$ denote the set of assignments to the coordinates in $\{1,\dots,m\}\setminus\{j\}$. For each $w\in\{0,1\}^{m-1}_j$, define $v_0(w),v_1(w)\in\{0,1\}^m$ as follows: both vectors agree with $w$ away from coordinate $j$, while $(v_0(w))_j=0$ and $(v_1(w))_j=1$. Thus $v_1(w)=v_0(w)^{(j)}$, so the theorem hypothesis gives \begin{align*} \operatorname{TV}(P_{v_0(w)},P_{v_1(w)})\le \alpha. \end{align*} We now prove the binary testing inequality used for this frozen value of $w$. Let $P$ and $Q$ be probability measures on $(\mathcal X,\mathcal A)$, and let $A\in\mathcal A$ be measurable. Since $Q(A^c)=1-Q(A)$, subtracting and bounding by total variation gives \begin{align*} P(A)+Q(A^c)=1-(Q(A)-P(A))\ge 1-|P(A)-Q(A)|\ge 1-\operatorname{TV}(P,Q). \end{align*} This inequality says that no test can make both type-I and type-II errors too small when the two laws are close in total variation. Apply the inequality with $P=P_{v_0(w)}$, $Q=P_{v_1(w)}$, and $A=\{x\in\mathcal X:\delta_j(x)=1\}$. The set $A$ is measurable because $\delta_j$ is measurable. Under $P_{v_0(w)}$, the true value of coordinate $j$ is $0$, so the event $\{\delta_j=1\}$ is an error. Under $P_{v_1(w)}$, the true value of coordinate $j$ is $1$, so the event $\{\delta_j=0\}=A^c$ is an error. Therefore \begin{align*} \frac{1}{2}P_{v_0(w)}(\delta_j=1)+\frac{1}{2}P_{v_1(w)}(\delta_j=0) \ge \frac{1}{2}\bigl(1-\operatorname{TV}(P_{v_0(w)},P_{v_1(w)})\bigr) \ge \frac{1-\alpha}{2}. \end{align*} Averaging this bound over the uniformly distributed frozen vector $w\in\{0,1\}^{m-1}_j$ is exactly the conditional contribution to $\mathbb Q(\delta_j(X)\neq V_j)$. Hence \begin{align*} \mathbb Q(\delta_j(X)\neq V_j)\ge \frac{1-\alpha}{2}. \end{align*} [/guided] [/step] [step:Sum the coordinate lower bounds and pass from Bayes risk to worst-case risk] Using the coordinate decomposition and the lower bound from the previous step, \begin{align*} \mathbb E_{\mathbb Q}[|\hat S(X)\triangle S(V)|] =\sum_{j=1}^m \mathbb Q(\delta_j(X)\neq V_j) \ge \frac{m}{2}(1-\alpha). \end{align*} By the definition of $\mathbb Q$, \begin{align*} \mathbb E_{\mathbb Q}[|\hat S(X)\triangle S(V)|] =2^{-m}\sum_{v\in\{0,1\}^m}\mathbb E_v[|\hat S\triangle S(v)|] \le \sup_{v\in\{0,1\}^m}\mathbb E_v[|\hat S\triangle S(v)|]. \end{align*} Therefore every measurable estimator $\hat S$ satisfies \begin{align*} \sup_{v\in\{0,1\}^m}\mathbb E_v[|\hat S\triangle S(v)|] \ge \frac{m}{2}(1-\alpha). \end{align*} Taking the infimum over all measurable estimators $\hat S$ proves \begin{align*} \inf_{\hat S}\sup_{v\in\{0,1\}^m}\mathbb E_v[|\hat S\triangle S(v)|] \ge \frac{m}{2}(1-\alpha). \end{align*} [/step] [step:Convert the Hamming lower bound into a probability lower bound] For every $v\in\{0,1\}^m$ and every observation $x\in\mathcal X$, \begin{align*} |\hat S(x)\triangle S(v)| \le m\,\mathbb{1}_{\{\hat S(x)\neq S(v)\}}, \end{align*} because the symmetric difference is a subset of $\{1,\dots,m\}$ and is nonempty exactly when $\hat S(x)\neq S(v)$. Taking expectation under $P_v$ gives \begin{align*} \mathbb E_v[|\hat S\triangle S(v)|] \le m\,\mathbb P_v(\hat S\neq S(v)). \end{align*} Since $m\in\mathbb N$, we have $m\ge 1$. Hence \begin{align*} \sup_{v\in\{0,1\}^m}\mathbb P_v(\hat S\neq S(v)) \ge \frac{1}{m}\sup_{v\in\{0,1\}^m}\mathbb E_v[|\hat S\triangle S(v)|] \ge \frac{1-\alpha}{2}. \end{align*} Taking the infimum over all measurable estimators $\hat S$ proves \begin{align*} \inf_{\hat S}\sup_{v\in\{0,1\}^m}\mathbb P_v(\hat S\neq S(v)) \ge \frac{1-\alpha}{2}. \end{align*} [guided] The Bayes Hamming lower bound from the coordinatewise testing argument says that every measurable estimator $\hat S$ satisfies \begin{align*} \mathbb E_{\mathbb Q}[|\hat S(X)\triangle S(V)|]\ge \frac{m}{2}(1-\alpha). \end{align*} By the definition of $\mathbb Q$, this Bayes risk is the uniform average of the risks over the cube: \begin{align*} \mathbb E_{\mathbb Q}[|\hat S(X)\triangle S(V)|] =2^{-m}\sum_{v\in\{0,1\}^m}\mathbb E_v[|\hat S\triangle S(v)|]. \end{align*} An average is bounded above by the maximum of the same finite family, so \begin{align*} \sup_{v\in\{0,1\}^m}\mathbb E_v[|\hat S\triangle S(v)|] \ge \frac{m}{2}(1-\alpha). \end{align*} Taking the infimum over all measurable estimators proves the minimax Hamming-risk bound. It remains to convert Hamming loss into an exact-support error probability. For every $v\in\{0,1\}^m$ and every observation $x\in\mathcal X$, the symmetric difference $\hat S(x)\triangle S(v)$ is a subset of $\{1,\dots,m\}$. It has size $0$ exactly when $\hat S(x)=S(v)$, and otherwise it has size at most $m$. Hence \begin{align*} |\hat S(x)\triangle S(v)| \le m\,\mathbb{1}_{\{\hat S(x)\neq S(v)\}}. \end{align*} Taking expectation under $P_v$ gives \begin{align*} \mathbb E_v[|\hat S\triangle S(v)|] \le m\,\mathbb P_v(\hat S\neq S(v)). \end{align*} Since $m\in\mathbb N$, the number of coordinates satisfies $m\ge 1$, so division by $m$ is legitimate. Therefore \begin{align*} \sup_{v\in\{0,1\}^m}\mathbb P_v(\hat S\neq S(v)) \ge \frac{1}{m}\sup_{v\in\{0,1\}^m}\mathbb E_v[|\hat S\triangle S(v)|] \ge \frac{1-\alpha}{2}. \end{align*} Taking the infimum over all measurable estimators $\hat S$ proves the probability lower bound. [/guided] This completes the proof. [/step]

Prerequisites (0/6 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

test

Definitions & Concepts

Explore Further

What brings you to Androma?

Start with a route through the knowledge graph.