Fano Testing-to-Estimation Reduction — Statement & Proof

Fano Testing-to-Estimation Reduction (Theorem # 5900)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We reduce estimation over $\Theta$ to testing among the finite packing points $\theta_1,\dots,\theta_M$. Given an estimator $\hat{\theta}$, we project its output to the nearest packing point and thereby obtain an index estimator $\hat V$. The $2s$-separation implies that an estimation error smaller than $s$ forces correct testing, so the average estimation risk dominates $s$ times the testing error probability. [Fano's inequality](/theorems/1654) controls this testing error from below, and the assumed average Kullback-Leibler bound controls the mutual information through the mixture distribution. [/proofplan] [step:Place the uniform prior on the packing and define the induced testing problem] Let $V:(\Omega,\mathcal F)\to\{1,\dots,M\}$ denote a uniformly distributed random index on the underlying probability space, so \begin{align*} \mathbb P(V=j)=\frac{1}{M} \end{align*} for every $j\in\{1,\dots,M\}$. Conditional on the event $\{V=j\}$, let the data random element $X:(\Omega,\mathcal F)\to(\mathcal X,\mathcal A)$ have distribution $P_{\theta_j}$. Thus the joint law of $(V,X)$ is determined by \begin{align*} \mathbb P(V=j,\ X\in A)=\frac{1}{M}P_{\theta_j}(A) \end{align*} for every $j\in\{1,\dots,M\}$ and every $A\in\mathcal A$. Fix a measurable estimator $\hat{\theta}:\mathcal X\to\Theta$. Define the finite-set nearest-neighbour index estimator $\hat V:\mathcal X\to\{1,\dots,M\}$ by choosing the smallest minimising index: \begin{align*} \hat V(x):=\min\left\{j\in\{1,\dots,M\}: d(\hat{\theta}(x),\theta_j)=\min_{1\leq k\leq M}d(\hat{\theta}(x),\theta_k)\right\}. \end{align*} Since the minimum is taken over a finite set and $d(\hat{\theta}(\cdot),\theta_j)$ is measurable for each $j$, the map $\hat V$ is measurable. [/step] [step:Convert small estimation error into correct identification] For every outcome $\omega\in\Omega$, if \begin{align*} d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s, \end{align*} then $\hat V(X(\omega))=V(\omega)$. Indeed, if $k\neq V(\omega)$, the packing assumption and the triangle inequality give \begin{align*} d(\hat{\theta}(X(\omega)),\theta_k)\geq d(\theta_{V(\omega)},\theta_k)-d(\hat{\theta}(X(\omega)),\theta_{V(\omega)}). \end{align*} The packing assumption gives $d(\theta_{V(\omega)},\theta_k)\geq 2s$, while the assumed error bound gives $d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s$. Hence \begin{align*} d(\hat{\theta}(X(\omega)),\theta_k)>s. \end{align*} Since also $d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s$, we have \begin{align*} d(\hat{\theta}(X(\omega)),\theta_k)>d(\hat{\theta}(X(\omega)),\theta_{V(\omega)}). \end{align*} Hence $\theta_{V(\omega)}$ is the unique nearest packing point to $\hat{\theta}(X(\omega))$, so the tie-breaking rule gives $\hat V(X(\omega))=V(\omega)$. Taking the contrapositive yields \begin{align*} \{\hat V(X)\neq V\}\subseteq \{d(\hat{\theta}(X),\theta_V)\geq s\}. \end{align*} Therefore, pointwise on $\Omega$, \begin{align*} d(\hat{\theta}(X),\theta_V)\geq s\,\mathbb 1_{\{\hat V(X)\neq V\}}. \end{align*} [guided] The purpose of the nearest-neighbour construction is to turn an estimator of the parameter into a test of which packing point generated the data. The separation assumption is exactly what makes this reduction quantitative. Fix an outcome $\omega\in\Omega$ and suppose that the estimator is within distance $s$ of the true packing point: \begin{align*} d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s. \end{align*} We compare the distance from $\hat{\theta}(X(\omega))$ to the true packing point with its distance to any other packing point. Let $k\in\{1,\dots,M\}$ with $k\neq V(\omega)$. Since the packing points are $2s$-separated, \begin{align*} d(\theta_{V(\omega)},\theta_k)\geq 2s. \end{align*} The triangle inequality in the form $d(a,c)\geq d(b,c)-d(a,b)$, applied with $a=\hat{\theta}(X(\omega))$, $b=\theta_{V(\omega)}$, and $c=\theta_k$, gives \begin{align*} d(\hat{\theta}(X(\omega)),\theta_k)\geq d(\theta_{V(\omega)},\theta_k)-d(\hat{\theta}(X(\omega)),\theta_{V(\omega)}). \end{align*} Combining $d(\theta_{V(\omega)},\theta_k)\geq 2s$ with $d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s$ yields \begin{align*} d(\hat{\theta}(X(\omega)),\theta_k)>s. \end{align*} At the same time, the assumed small-error event says \begin{align*} d(\hat{\theta}(X(\omega)),\theta_{V(\omega)})<s. \end{align*} Thus every incorrect packing point $\theta_k$ is farther from $\hat{\theta}(X(\omega))$ than the true packing point $\theta_{V(\omega)}$. Consequently the nearest-neighbour rule must select the true index: \begin{align*} \hat V(X(\omega))=V(\omega). \end{align*} Equivalently, if the induced test makes an error, then the estimator cannot have been within distance $s$ of the true parameter. Hence \begin{align*} \{\hat V(X)\neq V\}\subseteq \{d(\hat{\theta}(X),\theta_V)\geq s\}. \end{align*} Multiplying the indicator of the testing error by $s$ gives the pointwise lower bound \begin{align*} d(\hat{\theta}(X),\theta_V)\geq s\,\mathbb 1_{\{\hat V(X)\neq V\}}. \end{align*} This is the central [testing-to-estimation reduction](/theorems/5895): a lower bound on testing error becomes a lower bound on metric estimation risk. [/guided] [/step] [step:Average the pointwise bound under the finite prior] Taking expectation with respect to the joint law of $(V,X)$ in the pointwise inequality gives \begin{align*} \mathbb E[d(\hat{\theta}(X),\theta_V)] \geq s\,\mathbb P(\hat V(X)\neq V). \end{align*} By the definition of the joint law, \begin{align*} \mathbb E[d(\hat{\theta}(X),\theta_V)] = \frac{1}{M}\sum_{j=1}^{M}\mathbb E_{\theta_j}[d(\hat{\theta},\theta_j)]. \end{align*} Since the supremum over $\Theta$ dominates the average over the finite subset $\{\theta_1,\dots,\theta_M\}$, \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat{\theta},\theta)] \geq \frac{1}{M}\sum_{j=1}^{M}\mathbb E_{\theta_j}[d(\hat{\theta},\theta_j)] \geq s\,\mathbb P(\hat V(X)\neq V). \end{align*} [/step] [step:Bound the mutual information by the average divergence to $Q$] Let $\bar P$ be the mixture probability measure on $(\mathcal X,\mathcal A)$ defined by \begin{align*} \bar P(A):=\frac{1}{M}\sum_{j=1}^{M}P_{\theta_j}(A) \end{align*} for every $A\in\mathcal A$. The mutual information between $V$ and $X$ is \begin{align*} I(V;X)=\frac{1}{M}\sum_{j=1}^{M}D(P_{\theta_j}\|\bar P). \end{align*} We claim that \begin{align*} I(V;X)\leq \frac{1}{M}\sum_{j=1}^{M}D(P_{\theta_j}\|Q). \end{align*} If the right-hand side is $+\infty$, the inequality is immediate. Otherwise each $P_{\theta_j}$ is absolutely continuous with respect to $Q$. Let $\mu$ be the probability measure \begin{align*} \mu:=Q+\sum_{j=1}^{M}P_{\theta_j} \end{align*} renormalised by the positive constant $M+1$, and let $p_j:\mathcal X\to[0,\infty)$, $\bar p:\mathcal X\to[0,\infty)$, and $q:\mathcal X\to[0,\infty)$ be Radon-Nikodym densities of $P_{\theta_j}$, $\bar P$, and $Q$ with respect to $\mu$. Since $\bar p=M^{-1}\sum_{j=1}^{M}p_j$, direct expansion of the relative entropies gives \begin{align*} \frac{1}{M}\sum_{j=1}^{M}D(P_{\theta_j}\|Q)=\frac{1}{M}\sum_{j=1}^{M}\int_{\mathcal X}p_j(x)\log\frac{p_j(x)}{q(x)}\,d\mu(x). \end{align*} Using $p_j/q=(p_j/\bar p)(\bar p/q)$ on the set where the densities are positive, with the standard extended-value convention for relative entropy, this becomes \begin{align*} \frac{1}{M}\sum_{j=1}^{M}D(P_{\theta_j}\|Q)=\frac{1}{M}\sum_{j=1}^{M}\int_{\mathcal X}p_j(x)\log\frac{p_j(x)}{\bar p(x)}\,d\mu(x)+\int_{\mathcal X}\bar p(x)\log\frac{\bar p(x)}{q(x)}\,d\mu(x). \end{align*} Therefore \begin{align*} \frac{1}{M}\sum_{j=1}^{M}D(P_{\theta_j}\|Q)=I(V;X)+D(\bar P\|Q). \end{align*} Since $D(\bar P\|Q)\geq 0$, the claimed inequality follows. Therefore the hypothesis implies \begin{align*} I(V;X)\leq \alpha\log M. \end{align*} [/step] [step:Apply Fano's inequality to the induced test] We apply Fano's inequality to the uniformly distributed index $V\in\{1,\dots,M\}$ and the estimator $\hat V(X)$ of $V$ (citing a result not yet in the wiki: Fano's inequality). Its hypotheses are satisfied because $M\geq 2$, $V$ is uniform on a finite set of cardinality $M$, and $\hat V(X)$ takes values in the same finite set. It gives \begin{align*} \mathbb P(\hat V(X)\neq V) \geq 1-\frac{I(V;X)+\log 2}{\log M}. \end{align*} Using the bound $I(V;X)\leq \alpha\log M$, we obtain \begin{align*} \mathbb P(\hat V(X)\neq V) \geq 1-\alpha-\frac{\log 2}{\log M}. \end{align*} [/step] [step:Combine the testing and estimation bounds] Substituting the Fano lower bound into the average-risk inequality first gives \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat{\theta},\theta)]\geq s\,\mathbb P(\hat V(X)\neq V). \end{align*} Using $\mathbb P(\hat V(X)\neq V)\geq 1-\alpha-\frac{\log 2}{\log M}$, we obtain \begin{align*} \sup_{\theta\in\Theta}\mathbb E_\theta[d(\hat{\theta},\theta)]\geq s\left(1-\alpha-\frac{\log 2}{\log M}\right). \end{align*} The estimator $\hat{\theta}:\mathcal X\to\Theta$ was arbitrary, so this lower bound holds for every measurable estimator. This proves the claimed minimax testing-to-estimation reduction. [/step]

Prerequisites (0/9 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Explore Further

Distribution Definition Expectation Definition Event Definition Matrix Definition Set Definition test Theorem #89 Testing-to-Estimation Reduction Theorem #5895 Fano's Inequality Theorem #1654 Triangle Inequality For Inner Product Spaces Theorem #433 Spiked Covariance Eigenvalue And Eigenvector Threshold Probability & Statistics Doob's Maximal Inequality Martingale Theory Independence Under Complementation Probability Theory Comparison of Lasso and Dantzig Selector Rates Probability & Statistics Davis-Kahan Sin Theta Theorem Probability & Statistics Moments of Branching Processes Probability Theory Moments from the MGF Probability Theory Pivotal Score Bound for the Square-Root Lasso Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.

Fano Testing-to-Estimation Reduction (Theorem # 5900)

Discussion

Proof

Prerequisites (0/9 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Fano Testing-to-Estimation Reduction (Theorem # 5900)

Discussion

Proof

Prerequisites (0/9 completed)

Prerequisites Graph

Explore Further