Androma — The Home of Mathematics on the Internet

custom_env admin

[step:Convert every estimator into a nearest packing decoder]Let $\hat\theta:\mathcal X\to\Theta$ be any estimator. Define a decoder $\hat V:\mathcal X\to I$ by choosing an index attaining the minimum distance from $\hat\theta(x)$ to the packing: \begin{align*} \hat V(x)\in\operatorname*{arg\,min}_{j\in I} d(\hat\theta(x),\theta_j). \end{align*} Since $I$ is finite, such an index exists; ties are resolved by the smallest index. For each $k\in I$, the preimage $\{x\in\mathcal X:\hat V(x)=k\}$ is the finite intersection of the measurable comparison sets $\{x:d(\hat\theta(x),\theta_k)\le d(\hat\theta(x),\theta_j)\}$ for $j\in I$ and $\{x:d(\hat\theta(x),\theta_k)<d(\hat\theta(x),\theta_j)\}$ for $j<k$, so $\hat V$ is measurable. For each $i\in I$, on the event $\{\hat V(X)\ne i\}$ under $P_{\theta_i}$, the triangle inequality and the definition of $\hat V$ give \begin{align*} \varepsilon \le d(\theta_i,\theta_{\hat V(X)}) \le d(\theta_i,\hat\theta(X))+d(\hat\theta(X),\theta_{\hat V(X)}) \le 2d(\theta_i,\hat\theta(X)). \end{align*} Therefore \begin{align*} d(\hat\theta(X),\theta_i)\ge \frac{\varepsilon}{2}\mathbb 1_{\{\hat V(X)\ne i\}}. \end{align*} Integrating with respect to $P_{\theta_i}$ yields \begin{align*} \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)] \ge \frac{\varepsilon}{2}P_{\theta_i}(\hat V(X)\ne i). \end{align*} Averaging over $i\in I$ and using $\sup_{\theta\in\Theta}$ to dominate the average over the finite subset $\Theta_\varepsilon$, we obtain \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat\theta,\theta)] \ge \frac{1}{M}\sum_{i=1}^M \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)]. \end{align*} The pointwise testing reduction gives \begin{align*} \frac{1}{M}\sum_{i=1}^M \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)] \ge \frac{\varepsilon}{2}\frac{1}{M}\sum_{i=1}^M P_{\theta_i}(\hat V(X)\ne i). \end{align*} By the definition of the joint law $\mathbb P$, \begin{align*} \frac{1}{M}\sum_{i=1}^M P_{\theta_i}(\hat V(X)\ne i) = \mathbb P(\hat V(X)\ne V). \end{align*} Therefore \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat\theta,\theta)] \ge \frac{\varepsilon}{2}\mathbb P(\hat V(X)\ne V). \end{align*}[/step]

custom_env admin

[guided]The purpose of the finite packing is to turn estimation into testing. Under the joint law $\mathbb P$, the random index $V$ is uniform on $I=\{1,\dots,M\}$, and conditionally on $V=i$ the observation $X$ has law $P_{\theta_i}$. Thus the average testing error $\mathbb P(\hat V(X)\ne V)$ is exactly the average, over packing points, of the probability of failing to identify the true index. If an estimator $\hat\theta$ estimates $\theta_i$ within distance less than $\varepsilon/2$, then it must identify $\theta_i$ as the nearest packing point, because every other packing point is at least $\varepsilon$ away from $\theta_i$. Formally, for each observation $x\in\mathcal X$, define $\hat V(x)$ to be a nearest packing index: \begin{align*} \hat V(x)\in\operatorname*{arg\,min}_{j\in I} d(\hat\theta(x),\theta_j), \end{align*} with ties broken by the smallest index. Because $I$ is finite, this minimum is attained. The same finite comparison description used in the exact proof shows that $\hat V$ is measurable, so $\hat V(X)$ is an $I$-valued [random variable](/page/Random%20Variable). Now fix $i\in I$ and work under the law $P_{\theta_i}$. On the event $\{\hat V(X)\ne i\}$, the two packing points $\theta_i$ and $\theta_{\hat V(X)}$ are distinct, hence packing separation gives \begin{align*} d(\theta_i,\theta_{\hat V(X)})\ge\varepsilon. \end{align*} The triangle inequality gives \begin{align*} d(\theta_i,\theta_{\hat V(X)}) \le d(\theta_i,\hat\theta(X))+d(\hat\theta(X),\theta_{\hat V(X)}). \end{align*} By the definition of $\hat V(X)$ as a nearest packing index, \begin{align*} d(\hat\theta(X),\theta_{\hat V(X)}) \le d(\hat\theta(X),\theta_i). \end{align*} Combining these three inequalities gives \begin{align*} \varepsilon \le 2d(\hat\theta(X),\theta_i) \end{align*} on $\{\hat V(X)\ne i\}$. Equivalently, \begin{align*} d(\hat\theta(X),\theta_i) \ge \frac{\varepsilon}{2}\mathbb 1_{\{\hat V(X)\ne i\}}. \end{align*} Integrating this pointwise inequality with respect to $P_{\theta_i}$ gives \begin{align*} \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)] \ge \frac{\varepsilon}{2}P_{\theta_i}(\hat V(X)\ne i). \end{align*} Averaging over the uniformly chosen index $V$ first gives \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat\theta,\theta)] \ge \frac{1}{M}\sum_{i=1}^M \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)]. \end{align*} Using the inequality just proved for each $i\in I$ gives \begin{align*} \frac{1}{M}\sum_{i=1}^M \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)] \ge \frac{\varepsilon}{2}\frac{1}{M}\sum_{i=1}^M P_{\theta_i}(\hat V(X)\ne i). \end{align*} Finally, the joint law $\mathbb P$ was defined so that \begin{align*} \frac{1}{M}\sum_{i=1}^M P_{\theta_i}(\hat V(X)\ne i) = \mathbb P(\hat V(X)\ne V). \end{align*} Combining these three displayed inequalities yields \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat\theta,\theta)] \ge \frac{\varepsilon}{2}\mathbb P(\hat V(X)\ne V). \end{align*} Thus every estimator induces a finite testing rule, and its estimation risk controls the testing error from below.[/guided]

custom_env admin

[step:Prove the finite Fano bound and compare the mixture information to the reference divergence]Let $P_e:=\mathbb P(\hat V(X)\ne V)$ denote the testing error of the decoder. For a finite-valued random variable $Z$, let $H(Z)$ denote its Shannon entropy. For a finite-valued random variable $Z$ and a sub-$\sigma$-algebra $\mathcal G$, let $H(Z\mid\mathcal G)$ denote the conditional Shannon entropy, defined as the expectation of the entropy of the regular conditional distribution of $Z$ given $\mathcal G$. We write $H(V\mid X)$ for $H(V\mid\sigma(X))$. Let $I(V;X)$ denote the mutual information of the finite mixture experiment, equivalently the relative entropy between the joint law of $(V,X)$ and the product of its marginals. Since $V$ is uniform on the finite set $I$, its entropy is $H(V)=\log M$. The conditional entropy satisfies the following elementary entropy estimate: \begin{align*} H(V\mid X)\le \log 2+P_e\log M. \end{align*} Indeed, with $E:=\mathbb 1_{\{\hat V(X)\ne V\}}$, the value of $E$ is determined by $(V,\hat V(X))$, so conditioning and the chain rule for finite entropy give \begin{align*} H(V\mid X)\le H(E,V\mid X,\hat V(X)). \end{align*} The chain rule gives \begin{align*} H(E,V\mid X,\hat V(X))=H(E\mid X,\hat V(X))+H(V\mid E,X,\hat V(X)). \end{align*} The first term is at most $\log 2$ because $E$ is binary. On $\{E=0\}$, $V=\hat V(X)$ is determined; on $\{E=1\}$, $V$ takes values in at most $M-1$ indices. Hence \begin{align*} H(V\mid E,X,\hat V(X))\le P_e\log(M-1)\le P_e\log M. \end{align*} Combining these inequalities proves the displayed entropy estimate. Since $I(V;X)=H(V)-H(V\mid X)$, we obtain \begin{align*} P_e\ge 1-\frac{I(V;X)+\log 2}{\log M}. \end{align*} The marginal law $\bar P$ of $X$ under $\mathbb P$ is \begin{align*} \bar P(A):=\frac{1}{M}\sum_{i=1}^M P_{\theta_i}(A) \end{align*} for every $A\in\mathcal A$. By the definition of mutual information for the finite mixture experiment, \begin{align*} I(V;X)=\frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|\bar P). \end{align*} We now compare this quantity with the reference divergences. If $D(P_{\theta_i}\|Q_\varepsilon)=\infty$ for some $i$, the admissibility inequality cannot hold because its left side is finite. Thus each $P_{\theta_i}$ is absolutely continuous with respect to $Q_\varepsilon$. Let $p_i:\mathcal X\to[0,\infty)$ be a Radon-Nikodym density of $P_{\theta_i}$ with respect to $Q_\varepsilon$, and let $\bar p:\mathcal X\to[0,\infty)$ be the density of $\bar P$ with respect to $Q_\varepsilon$ defined by \begin{align*} \bar p(x):=\frac{1}{M}\sum_{i=1}^M p_i(x). \end{align*} For each $i$, the set where $p_i>0$ and $\bar p=0$ has $Q_\varepsilon$-measure zero, since $\bar p\ge p_i/M$. Therefore $P_{\theta_i}$ is absolutely continuous with respect to $\bar P$, and the divergence is well-defined by the density ratio $p_i/\bar p$. The convexity of $t\mapsto t\log t$ on $[0,\infty)$ gives \begin{align*} D(\bar P\|Q_\varepsilon) =\int_{\mathcal X}\bar p(x)\log \bar p(x)\,dQ_\varepsilon(x) \le \frac{1}{M}\sum_{i=1}^M\int_{\mathcal X}p_i(x)\log p_i(x)\,dQ_\varepsilon(x) = \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon)<\infty. \end{align*} Using the identity $\log(p_i/\bar p)=\log p_i-\log\bar p$ on $\{p_i>0\}$ and averaging over $i$, we get \begin{align*} \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|\bar P)=\frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon)-D(\bar P\|Q_\varepsilon). \end{align*} Moreover $D(\bar P\|Q_\varepsilon)\ge 0$, since $t\log t-t+1\ge0$ for every $t\ge0$ and $\int_{\mathcal X}\bar p(x)\,dQ_\varepsilon(x)=1$. Hence \begin{align*} I(V;X)\le \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon). \end{align*} Using the admissibility hypothesis, we obtain \begin{align*} I(V;X)\le \alpha\log M. \end{align*} Substituting this into the finite Fano bound gives \begin{align*} \mathbb P(\hat V(X)\ne V)\ge 1-\alpha-\frac{\log 2}{\log M}. \end{align*}[/step]

custom_env admin

[guided]We need a lower bound on the probability that the decoder identifies the wrong packing index. Let \begin{align*} P_e:=\mathbb P(\hat V(X)\ne V) \end{align*} denote this testing error. For a finite-valued random variable $Z$, $H(Z)$ denotes its Shannon entropy. For a finite-valued random variable $Z$ and a sub-$\sigma$-algebra $\mathcal G$, $H(Z\mid\mathcal G)$ denotes the expected entropy of the regular conditional distribution of $Z$ given $\mathcal G$; in particular, $H(V\mid X)$ means $H(V\mid\sigma(X))$. The mutual information $I(V;X)$ is the relative entropy between the joint law of $(V,X)$ and the product of the marginal laws of $V$ and $X$. In this finite mixture setting this is also the average divergence from each component law to the mixture law, as verified below. Because $V$ is uniform on $I=\{1,\dots,M\}$, its entropy is $H(V)=\log M$. The information identity \begin{align*} I(V;X)=H(V)-H(V\mid X) \end{align*} shows that a lower bound on $H(V\mid X)$ becomes an upper bound on how well $X$ can reveal $V$. We prove the finite Fano estimate directly. Define the error indicator $E:=\mathbb 1_{\{\hat V(X)\ne V\}}$. Since $E$ is determined by the pair $(V,\hat V(X))$, conditioning and the chain rule for finite entropy give \begin{align*} H(V\mid X)\le H(E,V\mid X,\hat V(X)). \end{align*} Applying the chain rule again gives \begin{align*} H(E,V\mid X,\hat V(X))=H(E\mid X,\hat V(X))+H(V\mid E,X,\hat V(X)). \end{align*} The variable $E$ is binary, hence $H(E\mid X,\hat V(X))\le\log 2$. For the second term, split according to the value of $E$. If $E=0$, then $V=\hat V(X)$, so no uncertainty about $V$ remains. If $E=1$, then $V$ can be any index except $\hat V(X)$, so there are at most $M-1$ possibilities. Therefore \begin{align*} H(V\mid E,X,\hat V(X))\le P_e\log(M-1)\le P_e\log M. \end{align*} Combining the entropy bounds gives \begin{align*} H(V\mid X)\le \log 2+P_e\log M. \end{align*} Substituting $H(V\mid X)=\log M-I(V;X)$ and rearranging yields \begin{align*} P_e\ge 1-\frac{I(V;X)+\log 2}{\log M}. \end{align*} It remains to bound the mutual information by the admissibility assumption. The marginal law of $X$ is the mixture measure $\bar P$ defined by \begin{align*} \bar P(A):=\frac{1}{M}\sum_{i=1}^M P_{\theta_i}(A) \end{align*} for every $A\in\mathcal A$. For this finite mixture experiment, mutual information is the average divergence from the component laws to the mixture: \begin{align*} I(V;X)=\frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|\bar P). \end{align*} The admissibility condition bounds divergences to $Q_\varepsilon$, not to $\bar P$, so we compare the two. Since \begin{align*} \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon)<\infty, \end{align*} each $D(P_{\theta_i}\|Q_\varepsilon)$ is finite. Hence each $P_{\theta_i}$ is absolutely continuous with respect to $Q_\varepsilon$. Let $p_i:\mathcal X\to[0,\infty)$ be a Radon-Nikodym density of $P_{\theta_i}$ with respect to $Q_\varepsilon$. Then the mixture $\bar P$ has density $\bar p:\mathcal X\to[0,\infty)$ given by \begin{align*} \bar p(x):=\frac{1}{M}\sum_{i=1}^M p_i(x). \end{align*} Because $\bar p\ge p_i/M$, the ratio $p_i/\bar p$ is well-defined $P_{\theta_i}$-almost everywhere. Before subtracting the mixture divergence, we check that it is finite. The convexity of $t\mapsto t\log t$ on $[0,\infty)$ gives \begin{align*} D(\bar P\|Q_\varepsilon) =\int_{\mathcal X}\bar p(x)\log \bar p(x)\,dQ_\varepsilon(x) \le \frac{1}{M}\sum_{i=1}^M\int_{\mathcal X}p_i(x)\log p_i(x)\,dQ_\varepsilon(x) = \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon)<\infty. \end{align*} Therefore \begin{align*} \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|\bar P)=\frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon)-D(\bar P\|Q_\varepsilon). \end{align*} This is the logarithmic decomposition $\log(p_i/\bar p)=\log p_i-\log\bar p$ averaged over $i$. Also $D(\bar P\|Q_\varepsilon)\ge0$, because $t\log t-t+1\ge0$ for every $t\ge0$ and $\int_{\mathcal X}\bar p(x)\,dQ_\varepsilon(x)=1$. Hence \begin{align*} I(V;X)\le \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon). \end{align*} The admissibility hypothesis gives \begin{align*} I(V;X)\le \alpha\log M. \end{align*} Substituting this into Fano's bound gives the testing lower bound \begin{align*} \mathbb P(\hat V(X)\ne V)\ge 1-\alpha-\frac{\log 2}{\log M}. \end{align*}[/guided]

custom_env admin

What brings you to Androma?

Start with a route through the knowledge graph.

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Sign in to Androma

Check your inbox

One last step

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Raw Attribution Data