Yang-Barron Entropy Lower Bound — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] Fix an admissible scale $\varepsilon$ and reduce estimation over $\Theta$ to testing among the finite packing $\Theta_\varepsilon$. Any estimator whose expected metric error is small can be converted into a decoder of the packing point with small testing error, because distinct packing points are separated by at least $\varepsilon$. The average Kullback-Leibler condition bounds the mutual information in this finite testing problem, and the elementary Fano argument gives the stated lower bound. Since the argument applies independently at every admissible scale, taking the supremum over scales gives the final assertion. [/proofplan] [step:Fix an admissible packing and build the associated testing problem] Fix an admissible $\varepsilon>0$. Let \begin{align*} \Theta_\varepsilon=\{\theta_1,\dots,\theta_M\} \end{align*} be an admissible $\varepsilon$-packing, where $M:=|\Theta_\varepsilon|\ge 2$, and let $Q_\varepsilon$ be the corresponding reference probability measure. Thus \begin{align*} d(\theta_i,\theta_j)\ge \varepsilon \end{align*} for all distinct $i,j\in\{1,\dots,M\}$, and \begin{align*} \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon) \le \alpha\log M. \end{align*} Define the finite index set $I:=\{1,\dots,M\}$. Let $V:I\to I$ denote a uniformly distributed random index, and conditionally on $V=i$, let $X:(I\times\mathcal X,\mathcal P(I)\otimes\mathcal A)\to(\mathcal X,\mathcal A)$ have distribution $P_{\theta_i}$. Equivalently, the joint law of $(V,X)$ is the probability measure $\mathbb P$ determined by \begin{align*} \mathbb P(V=i, X\in A)=\frac{1}{M}P_{\theta_i}(A) \end{align*} for every $i\in I$ and every $A\in\mathcal A$. The marginal law of $X$ under $\mathbb P$ is denoted by $\bar P$. [/step] [step:Convert every estimator into a nearest packing decoder] Let $\hat\theta:\mathcal X\to\Theta$ be any estimator. Define a decoder $\hat V:\mathcal X\to I$ by choosing an index attaining the minimum distance from $\hat\theta(x)$ to the packing: \begin{align*} \hat V(x)\in\operatorname*{arg\,min}_{j\in I} d(\hat\theta(x),\theta_j). \end{align*} Since $I$ is finite, such an index exists; ties are resolved by the smallest index. For each $k\in I$, the preimage $\{x\in\mathcal X:\hat V(x)=k\}$ is the finite intersection of the measurable comparison sets $\{x:d(\hat\theta(x),\theta_k)\le d(\hat\theta(x),\theta_j)\}$ for $j\in I$ and $\{x:d(\hat\theta(x),\theta_k)<d(\hat\theta(x),\theta_j)\}$ for $j<k$, so $\hat V$ is measurable. For each $i\in I$, on the event $\{\hat V(X)\ne i\}$ under $P_{\theta_i}$, the triangle inequality and the definition of $\hat V$ give \begin{align*} \varepsilon \le d(\theta_i,\theta_{\hat V(X)}) \le d(\theta_i,\hat\theta(X))+d(\hat\theta(X),\theta_{\hat V(X)}) \le 2d(\theta_i,\hat\theta(X)). \end{align*} Therefore \begin{align*} d(\hat\theta(X),\theta_i)\ge \frac{\varepsilon}{2}\mathbb 1_{\{\hat V(X)\ne i\}}. \end{align*} Integrating with respect to $P_{\theta_i}$ yields \begin{align*} \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)] \ge \frac{\varepsilon}{2}P_{\theta_i}(\hat V(X)\ne i). \end{align*} Averaging over $i\in I$ and using $\sup_{\theta\in\Theta}$ to dominate the average over the finite subset $\Theta_\varepsilon$, we obtain \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat\theta,\theta)] \ge \frac{1}{M}\sum_{i=1}^M \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)]. \end{align*} The pointwise testing reduction gives \begin{align*} \frac{1}{M}\sum_{i=1}^M \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)] \ge \frac{\varepsilon}{2}\frac{1}{M}\sum_{i=1}^M P_{\theta_i}(\hat V(X)\ne i). \end{align*} By the definition of the joint law $\mathbb P$, \begin{align*} \frac{1}{M}\sum_{i=1}^M P_{\theta_i}(\hat V(X)\ne i) = \mathbb P(\hat V(X)\ne V). \end{align*} Therefore \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat\theta,\theta)] \ge \frac{\varepsilon}{2}\mathbb P(\hat V(X)\ne V). \end{align*} [guided] The purpose of the finite packing is to turn estimation into testing. Under the joint law $\mathbb P$, the random index $V$ is uniform on $I=\{1,\dots,M\}$, and conditionally on $V=i$ the observation $X$ has law $P_{\theta_i}$. Thus the average testing error $\mathbb P(\hat V(X)\ne V)$ is exactly the average, over packing points, of the probability of failing to identify the true index. If an estimator $\hat\theta$ estimates $\theta_i$ within distance less than $\varepsilon/2$, then it must identify $\theta_i$ as the nearest packing point, because every other packing point is at least $\varepsilon$ away from $\theta_i$. Formally, for each observation $x\in\mathcal X$, define $\hat V(x)$ to be a nearest packing index: \begin{align*} \hat V(x)\in\operatorname*{arg\,min}_{j\in I} d(\hat\theta(x),\theta_j), \end{align*} with ties broken by the smallest index. Because $I$ is finite, this minimum is attained. The same finite comparison description used in the exact proof shows that $\hat V$ is measurable, so $\hat V(X)$ is an $I$-valued [random variable](/page/Random%20Variable). Now fix $i\in I$ and work under the law $P_{\theta_i}$. On the event $\{\hat V(X)\ne i\}$, the two packing points $\theta_i$ and $\theta_{\hat V(X)}$ are distinct, hence packing separation gives \begin{align*} d(\theta_i,\theta_{\hat V(X)})\ge\varepsilon. \end{align*} The triangle inequality gives \begin{align*} d(\theta_i,\theta_{\hat V(X)}) \le d(\theta_i,\hat\theta(X))+d(\hat\theta(X),\theta_{\hat V(X)}). \end{align*} By the definition of $\hat V(X)$ as a nearest packing index, \begin{align*} d(\hat\theta(X),\theta_{\hat V(X)}) \le d(\hat\theta(X),\theta_i). \end{align*} Combining these three inequalities gives \begin{align*} \varepsilon \le 2d(\hat\theta(X),\theta_i) \end{align*} on $\{\hat V(X)\ne i\}$. Equivalently, \begin{align*} d(\hat\theta(X),\theta_i) \ge \frac{\varepsilon}{2}\mathbb 1_{\{\hat V(X)\ne i\}}. \end{align*} Integrating this pointwise inequality with respect to $P_{\theta_i}$ gives \begin{align*} \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)] \ge \frac{\varepsilon}{2}P_{\theta_i}(\hat V(X)\ne i). \end{align*} Averaging over the uniformly chosen index $V$ first gives \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat\theta,\theta)] \ge \frac{1}{M}\sum_{i=1}^M \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)]. \end{align*} Using the inequality just proved for each $i\in I$ gives \begin{align*} \frac{1}{M}\sum_{i=1}^M \mathbb E_{\theta_i}[d(\hat\theta,\theta_i)] \ge \frac{\varepsilon}{2}\frac{1}{M}\sum_{i=1}^M P_{\theta_i}(\hat V(X)\ne i). \end{align*} Finally, the joint law $\mathbb P$ was defined so that \begin{align*} \frac{1}{M}\sum_{i=1}^M P_{\theta_i}(\hat V(X)\ne i) = \mathbb P(\hat V(X)\ne V). \end{align*} Combining these three displayed inequalities yields \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat\theta,\theta)] \ge \frac{\varepsilon}{2}\mathbb P(\hat V(X)\ne V). \end{align*} Thus every estimator induces a finite testing rule, and its estimation risk controls the testing error from below. [/guided] [/step] [step:Prove the finite Fano bound and compare the mixture information to the reference divergence] Let $P_e:=\mathbb P(\hat V(X)\ne V)$ denote the testing error of the decoder. For a finite-valued random variable $Z$, let $H(Z)$ denote its Shannon entropy. For a finite-valued random variable $Z$ and a sub-$\sigma$-algebra $\mathcal G$, let $H(Z\mid\mathcal G)$ denote the conditional Shannon entropy, defined as the expectation of the entropy of the regular conditional distribution of $Z$ given $\mathcal G$. We write $H(V\mid X)$ for $H(V\mid\sigma(X))$. Let $I(V;X)$ denote the mutual information of the finite mixture experiment, equivalently the relative entropy between the joint law of $(V,X)$ and the product of its marginals. Since $V$ is uniform on the finite set $I$, its entropy is $H(V)=\log M$. The conditional entropy satisfies the following elementary entropy estimate: \begin{align*} H(V\mid X)\le \log 2+P_e\log M. \end{align*} Indeed, with $E:=\mathbb 1_{\{\hat V(X)\ne V\}}$, the value of $E$ is determined by $(V,\hat V(X))$, so conditioning and the chain rule for finite entropy give \begin{align*} H(V\mid X)\le H(E,V\mid X,\hat V(X)). \end{align*} The chain rule gives \begin{align*} H(E,V\mid X,\hat V(X))=H(E\mid X,\hat V(X))+H(V\mid E,X,\hat V(X)). \end{align*} The first term is at most $\log 2$ because $E$ is binary. On $\{E=0\}$, $V=\hat V(X)$ is determined; on $\{E=1\}$, $V$ takes values in at most $M-1$ indices. Hence \begin{align*} H(V\mid E,X,\hat V(X))\le P_e\log(M-1)\le P_e\log M. \end{align*} Combining these inequalities proves the displayed entropy estimate. Since $I(V;X)=H(V)-H(V\mid X)$, we obtain \begin{align*} P_e\ge 1-\frac{I(V;X)+\log 2}{\log M}. \end{align*} The marginal law $\bar P$ of $X$ under $\mathbb P$ is \begin{align*} \bar P(A):=\frac{1}{M}\sum_{i=1}^M P_{\theta_i}(A) \end{align*} for every $A\in\mathcal A$. By the definition of mutual information for the finite mixture experiment, \begin{align*} I(V;X)=\frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|\bar P). \end{align*} We now compare this quantity with the reference divergences. If $D(P_{\theta_i}\|Q_\varepsilon)=\infty$ for some $i$, the admissibility inequality cannot hold because its left side is finite. Thus each $P_{\theta_i}$ is absolutely continuous with respect to $Q_\varepsilon$. Let $p_i:\mathcal X\to[0,\infty)$ be a Radon-Nikodym density of $P_{\theta_i}$ with respect to $Q_\varepsilon$, and let $\bar p:\mathcal X\to[0,\infty)$ be the density of $\bar P$ with respect to $Q_\varepsilon$ defined by \begin{align*} \bar p(x):=\frac{1}{M}\sum_{i=1}^M p_i(x). \end{align*} For each $i$, the set where $p_i>0$ and $\bar p=0$ has $Q_\varepsilon$-measure zero, since $\bar p\ge p_i/M$. Therefore $P_{\theta_i}$ is absolutely continuous with respect to $\bar P$, and the divergence is well-defined by the density ratio $p_i/\bar p$. The convexity of $t\mapsto t\log t$ on $[0,\infty)$ gives \begin{align*} D(\bar P\|Q_\varepsilon) =\int_{\mathcal X}\bar p(x)\log \bar p(x)\,dQ_\varepsilon(x) \le \frac{1}{M}\sum_{i=1}^M\int_{\mathcal X}p_i(x)\log p_i(x)\,dQ_\varepsilon(x) = \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon)<\infty. \end{align*} Using the identity $\log(p_i/\bar p)=\log p_i-\log\bar p$ on $\{p_i>0\}$ and averaging over $i$, we get \begin{align*} \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|\bar P)=\frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon)-D(\bar P\|Q_\varepsilon). \end{align*} Moreover $D(\bar P\|Q_\varepsilon)\ge 0$, since $t\log t-t+1\ge0$ for every $t\ge0$ and $\int_{\mathcal X}\bar p(x)\,dQ_\varepsilon(x)=1$. Hence \begin{align*} I(V;X)\le \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon). \end{align*} Using the admissibility hypothesis, we obtain \begin{align*} I(V;X)\le \alpha\log M. \end{align*} Substituting this into the finite Fano bound gives \begin{align*} \mathbb P(\hat V(X)\ne V)\ge 1-\alpha-\frac{\log 2}{\log M}. \end{align*} [guided] We need a lower bound on the probability that the decoder identifies the wrong packing index. Let \begin{align*} P_e:=\mathbb P(\hat V(X)\ne V) \end{align*} denote this testing error. For a finite-valued random variable $Z$, $H(Z)$ denotes its Shannon entropy. For a finite-valued random variable $Z$ and a sub-$\sigma$-algebra $\mathcal G$, $H(Z\mid\mathcal G)$ denotes the expected entropy of the regular conditional distribution of $Z$ given $\mathcal G$; in particular, $H(V\mid X)$ means $H(V\mid\sigma(X))$. The mutual information $I(V;X)$ is the relative entropy between the joint law of $(V,X)$ and the product of the marginal laws of $V$ and $X$. In this finite mixture setting this is also the average divergence from each component law to the mixture law, as verified below. Because $V$ is uniform on $I=\{1,\dots,M\}$, its entropy is $H(V)=\log M$. The information identity \begin{align*} I(V;X)=H(V)-H(V\mid X) \end{align*} shows that a lower bound on $H(V\mid X)$ becomes an upper bound on how well $X$ can reveal $V$. We prove the finite Fano estimate directly. Define the error indicator $E:=\mathbb 1_{\{\hat V(X)\ne V\}}$. Since $E$ is determined by the pair $(V,\hat V(X))$, conditioning and the chain rule for finite entropy give \begin{align*} H(V\mid X)\le H(E,V\mid X,\hat V(X)). \end{align*} Applying the chain rule again gives \begin{align*} H(E,V\mid X,\hat V(X))=H(E\mid X,\hat V(X))+H(V\mid E,X,\hat V(X)). \end{align*} The variable $E$ is binary, hence $H(E\mid X,\hat V(X))\le\log 2$. For the second term, split according to the value of $E$. If $E=0$, then $V=\hat V(X)$, so no uncertainty about $V$ remains. If $E=1$, then $V$ can be any index except $\hat V(X)$, so there are at most $M-1$ possibilities. Therefore \begin{align*} H(V\mid E,X,\hat V(X))\le P_e\log(M-1)\le P_e\log M. \end{align*} Combining the entropy bounds gives \begin{align*} H(V\mid X)\le \log 2+P_e\log M. \end{align*} Substituting $H(V\mid X)=\log M-I(V;X)$ and rearranging yields \begin{align*} P_e\ge 1-\frac{I(V;X)+\log 2}{\log M}. \end{align*} It remains to bound the mutual information by the admissibility assumption. The marginal law of $X$ is the mixture measure $\bar P$ defined by \begin{align*} \bar P(A):=\frac{1}{M}\sum_{i=1}^M P_{\theta_i}(A) \end{align*} for every $A\in\mathcal A$. For this finite mixture experiment, mutual information is the average divergence from the component laws to the mixture: \begin{align*} I(V;X)=\frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|\bar P). \end{align*} The admissibility condition bounds divergences to $Q_\varepsilon$, not to $\bar P$, so we compare the two. Since \begin{align*} \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon)<\infty, \end{align*} each $D(P_{\theta_i}\|Q_\varepsilon)$ is finite. Hence each $P_{\theta_i}$ is absolutely continuous with respect to $Q_\varepsilon$. Let $p_i:\mathcal X\to[0,\infty)$ be a Radon-Nikodym density of $P_{\theta_i}$ with respect to $Q_\varepsilon$. Then the mixture $\bar P$ has density $\bar p:\mathcal X\to[0,\infty)$ given by \begin{align*} \bar p(x):=\frac{1}{M}\sum_{i=1}^M p_i(x). \end{align*} Because $\bar p\ge p_i/M$, the ratio $p_i/\bar p$ is well-defined $P_{\theta_i}$-almost everywhere. Before subtracting the mixture divergence, we check that it is finite. The convexity of $t\mapsto t\log t$ on $[0,\infty)$ gives \begin{align*} D(\bar P\|Q_\varepsilon) =\int_{\mathcal X}\bar p(x)\log \bar p(x)\,dQ_\varepsilon(x) \le \frac{1}{M}\sum_{i=1}^M\int_{\mathcal X}p_i(x)\log p_i(x)\,dQ_\varepsilon(x) = \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon)<\infty. \end{align*} Therefore \begin{align*} \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|\bar P)=\frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon)-D(\bar P\|Q_\varepsilon). \end{align*} This is the logarithmic decomposition $\log(p_i/\bar p)=\log p_i-\log\bar p$ averaged over $i$. Also $D(\bar P\|Q_\varepsilon)\ge0$, because $t\log t-t+1\ge0$ for every $t\ge0$ and $\int_{\mathcal X}\bar p(x)\,dQ_\varepsilon(x)=1$. Hence \begin{align*} I(V;X)\le \frac{1}{M}\sum_{i=1}^M D(P_{\theta_i}\|Q_\varepsilon). \end{align*} The admissibility hypothesis gives \begin{align*} I(V;X)\le \alpha\log M. \end{align*} Substituting this into Fano's bound gives the testing lower bound \begin{align*} \mathbb P(\hat V(X)\ne V)\ge 1-\alpha-\frac{\log 2}{\log M}. \end{align*} [/guided] [/step] [step:Take the infimum over estimators and then optimize over admissible scales] Combining the [testing-to-estimation reduction](/theorems/5895) with the testing lower bound gives, for every estimator $\hat\theta$, \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat\theta,\theta)] \ge \frac{\varepsilon}{2} \left( 1-\alpha-\frac{\log 2}{\log M} \right). \end{align*} Since $M=|\Theta_\varepsilon|$, taking the infimum over all estimators gives \begin{align*} \inf_{\hat\theta}\sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat\theta,\theta)] \ge \frac{\varepsilon}{2} \left( 1-\alpha-\frac{\log 2}{\log |\Theta_\varepsilon|} \right). \end{align*} This proves the asserted bound for the fixed admissible scale $\varepsilon$. The preceding derivation used only the admissibility conditions for the fixed scale $\varepsilon$ and therefore yields one valid lower bound for each admissible scale. Taking the supremum of these valid lower bounds over all admissible $\varepsilon$ for which \begin{align*} 1-\alpha-\frac{\log 2}{\log |\Theta_\varepsilon|}>0 \end{align*} gives the strongest lower bound produced by the theorem. [/step]

Prerequisites (0/7 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Triangle Inequality For Inner Product Spaces

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.

Yang-Barron Entropy Lower Bound (Theorem # 5905)

Discussion

Proof

Prerequisites (0/7 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Yang-Barron Entropy Lower Bound (Theorem # 5905)

Discussion

Proof

Prerequisites (0/7 completed)

Prerequisites Graph

Explore Further