Average KL Divergence Bound for Mutual Information

Average KL Divergence Bound for Mutual Information (Theorem # 5899)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We introduce the mixture law $\overline{P}$ of the observation $X$ and express mutual information as the average divergence from each conditional law $P_j$ to $\overline{P}$. We then compare this expression with an arbitrary reference measure $Q$ by expanding Radon-Nikodym derivatives, obtaining an exact decomposition with a non-negative remainder $D(\overline{P}\|Q)$. Finally, when all pairwise divergences are finite, we use the convexity of relative entropy in its second argument to bound each $D(P_j\|\overline{P})$ by the average of the divergences $D(P_j\|P_k)$. [/proofplan] [step:Identify the marginal law of $X$ and express mutual information through conditional laws] Define the mixture probability measure $\overline{P}$ on $(E,\mathcal{E})$ by \begin{align*} \overline{P}(A) := \frac{1}{M}\sum_{j=1}^{M} P_j(A), \qquad A \in \mathcal{E}. \end{align*} Since $V$ is uniform and the conditional law of $X$ given $V=j$ is $P_j$, the marginal law of $X$ is $\overline{P}$. For every $j \in \{1,\dots,M\}$, $P_j \ll \overline{P}$: if $\overline{P}(A)=0$, then $\sum_{k=1}^{M}P_k(A)=0$, hence $P_j(A)=0$. Let $r_j: E \to [0,\infty)$ be a Radon-Nikodym derivative, defined $\overline{P}$-almost everywhere by \begin{align*} r_j(x) := \frac{dP_j}{d\overline{P}}(x). \end{align*} The joint law of $(V,X)$ is \begin{align*} \mathbb{P}_{V,X}(\{j\}\times A)=\frac{1}{M}P_j(A), \qquad j \in \{1,\dots,M\},\ A \in \mathcal{E}, \end{align*} while the product of the marginals is \begin{align*} (\mathbb{P}_V\otimes \mathbb{P}_X)(\{j\}\times A)=\frac{1}{M}\overline{P}(A). \end{align*} Therefore the Radon-Nikodym derivative of $\mathbb{P}_{V,X}$ with respect to $\mathbb{P}_V\otimes \mathbb{P}_X$ is $r_j(x)$ on $\{j\}\times E$. By the definition of mutual information as relative entropy of the joint law with respect to the product law, \begin{align*} I(V;X)= \sum_{j=1}^{M}\frac{1}{M}\int_E \log r_j(x)\, dP_j(x). \end{align*} Using the definition of relative entropy for each pair $(P_j,\overline{P})$, this becomes \begin{align*} I(V;X)= \frac{1}{M}\sum_{j=1}^{M} D(P_j\|\overline{P}). \end{align*} [/step] [step:Decompose the average divergence through an arbitrary reference law $Q$] If $D(P_j\|Q)=+\infty$ for some $j$, then \begin{align*} I(V;X) \leq \frac{1}{M}\sum_{j=1}^{M}D(P_j\|Q) \end{align*} holds because the right-hand side is $+\infty$. Assume therefore that $D(P_j\|Q)<+\infty$ for all $j$. Then $P_j\ll Q$ for all $j$, and hence $\overline{P}\ll Q$. Let $p_j: E \to [0,\infty)$ be a Radon-Nikodym derivative, defined $Q$-almost everywhere by \begin{align*} p_j(x) := \frac{dP_j}{dQ}(x), \end{align*} and let $\overline{p}: E \to [0,\infty)$ be a Radon-Nikodym derivative, defined $Q$-almost everywhere by \begin{align*} \overline{p}(x) := \frac{d\overline{P}}{dQ}(x). \end{align*} Since $\overline{P}=M^{-1}\sum_{k=1}^{M}P_k$, we may choose \begin{align*} \overline{p}(x)=\frac{1}{M}\sum_{k=1}^{M}p_k(x) \end{align*} for $Q$-almost every $x \in E$. Also, \begin{align*} \frac{dP_j}{d\overline{P}}(x)=\frac{p_j(x)}{\overline{p}(x)} \end{align*} for $P_j$-almost every $x \in E$. Moreover $p_j(x) \leq M\overline{p}(x)$ for $Q$-almost every $x\in E$, so $\log(p_j/\overline{p}) \leq \log M$ on the set where $p_j>0$. Thus the following use of the logarithm does not form an undefined $+\infty-\infty$ expression. Therefore \begin{align*} D(P_j\|\overline{P})= \int_E \log\left(\frac{p_j(x)}{\overline{p}(x)}\right)\, dP_j(x). \end{align*} Since $D(P_j\|Q)<+\infty$, the positive part of $\log p_j$ is $P_j$-integrable; the preceding upper bound controls the positive part of $\log(p_j/\overline{p})$. Hence the subtraction below is well-defined in the extended real sense: \begin{align*} D(P_j\|\overline{P})= \int_E \log p_j(x)\, dP_j(x)-\int_E \log \overline{p}(x)\, dP_j(x). \end{align*} Averaging over $j$ gives \begin{align*} I(V;X)= \frac{1}{M}\sum_{j=1}^{M}\int_E \log p_j(x)\, dP_j(x)-\frac{1}{M}\sum_{j=1}^{M}\int_E \log \overline{p}(x)\, dP_j(x). \end{align*} By the definition of $D(P_j\|Q)$ and the identity $\overline{P}=M^{-1}\sum_{j=1}^{M}P_j$, this is \begin{align*} I(V;X)= \frac{1}{M}\sum_{j=1}^{M}D(P_j\|Q)-\int_E \log \overline{p}(x)\, d\overline{P}(x). \end{align*} By the definition of $D(\overline{P}\|Q)$, we conclude \begin{align*} I(V;X)= \frac{1}{M}\sum_{j=1}^{M}D(P_j\|Q)-D(\overline{P}\|Q). \end{align*} [guided] The goal of this step is to compare every $P_j$ not with the true marginal $\overline{P}$, but with an arbitrary reference law $Q$. If some $D(P_j\|Q)$ is infinite, then the claimed upper bound is immediate because the right-hand side is $+\infty$. Thus the meaningful case is the finite case, where $D(P_j\|Q)<+\infty$ for every $j$. Finite relative entropy implies absolute continuity, so $P_j\ll Q$ for every $j$. Since $\overline{P}$ is the average of the measures $P_j$, this also gives $\overline{P}\ll Q$. Define the Radon-Nikodym derivative map $p_j: E \to [0,\infty)$ for $j\in\{1,\dots,M\}$ by \begin{align*} p_j(x) := \frac{dP_j}{dQ}(x) \end{align*} for $Q$-almost every $x\in E$, and define the Radon-Nikodym derivative map $\overline{p}: E \to [0,\infty)$ by \begin{align*} \overline{p}(x) := \frac{d\overline{P}}{dQ}(x) \end{align*} for $Q$-almost every $x\in E$. Because $\overline{P}=M^{-1}\sum_{k=1}^{M}P_k$, the derivative of the mixture is the mixture of the derivatives: \begin{align*} \overline{p}(x)=\frac{1}{M}\sum_{k=1}^{M}p_k(x) \end{align*} for $Q$-almost every $x\in E$. Now compare $P_j$ to $\overline{P}$. The [chain rule for Radon-Nikodym derivatives](/theorems/1208) gives \begin{align*} \frac{dP_j}{d\overline{P}}(x)=\frac{p_j(x)}{\overline{p}(x)} \end{align*} for $P_j$-almost every $x\in E$. Substituting this into the relative entropy gives \begin{align*} D(P_j\|\overline{P})= \int_E \log\left(\frac{p_j(x)}{\overline{p}(x)}\right)\, dP_j(x). \end{align*} Why is it legitimate to split the logarithm? Since $\overline{p}=M^{-1}\sum_{k=1}^{M}p_k$, we have $p_j\leq M\overline{p}$ $Q$-almost everywhere. Hence $\log(p_j/\overline{p})\leq \log M$ on the set where $p_j>0$. Also $D(P_j\|Q)<+\infty$, so the positive part of $\log p_j$ is integrable with respect to $P_j$. These two facts ensure that the following subtraction is not an undefined $+\infty-\infty$ expression: \begin{align*} D(P_j\|\overline{P})= \int_E \log p_j(x)\, dP_j(x)-\int_E \log \overline{p}(x)\, dP_j(x). \end{align*} The first integral is exactly $D(P_j\|Q)$. The second integral becomes a mixture integral after averaging over $j$: \begin{align*} \frac{1}{M}\sum_{j=1}^{M}\int_E \log \overline{p}(x)\, dP_j(x)= \int_E \log \overline{p}(x)\, d\overline{P}(x). \end{align*} By the definition of relative entropy with density $\overline{p}=d\overline{P}/dQ$, this last integral is \begin{align*} \int_E \log \overline{p}(x)\, d\overline{P}(x)= D(\overline{P}\|Q). \end{align*} Therefore \begin{align*} I(V;X) = \frac{1}{M}\sum_{j=1}^{M}D(P_j\|Q)-D(\overline{P}\|Q). \end{align*} This identity is the central point: the average divergence to the reference law $Q$ exceeds the mutual information by exactly the divergence from the marginal law $\overline{P}$ to $Q$. [/guided] [/step] [step:Use non-negativity of relative entropy to obtain the reference-law bound] We verify $D(\overline{P}\|Q)\geq 0$. Let $\overline{p}=d\overline{P}/dQ$. Since $\overline{P}$ is a probability measure, \begin{align*} \int_E \overline{p}(x)\, dQ(x)=1. \end{align*} Using the inequality $\log t \leq t-1$ for $t>0$ with $t=1/\overline{p}(x)$ on the set where $\overline{p}(x)>0$, we obtain \begin{align*} -\log \overline{p}(x) \leq \frac{1}{\overline{p}(x)}-1. \end{align*} Multiplying by $\overline{p}(x)$ and integrating with respect to $Q$ gives \begin{align*} -D(\overline{P}\|Q)= \int_E -\log \overline{p}(x)\, d\overline{P}(x). \end{align*} Using $d\overline{P}=\overline{p}\,dQ$, this equals \begin{align*} -D(\overline{P}\|Q)= \int_E -\overline{p}(x)\log \overline{p}(x)\, dQ(x). \end{align*} The pointwise inequality above gives \begin{align*} \int_E -\overline{p}(x)\log \overline{p}(x)\, dQ(x) \leq \int_E (1-\overline{p}(x))\, dQ(x). \end{align*} Since $Q$ and $\overline{P}$ are probability measures, \begin{align*} \int_E (1-\overline{p}(x))\, dQ(x)=Q(E)-\overline{P}(E)=0. \end{align*} Hence $D(\overline{P}\|Q)\geq 0$. Combining this with the decomposition from the previous step yields \begin{align*} I(V;X)\leq \frac{1}{M}\sum_{j=1}^{M}D(P_j\|Q). \end{align*} [/step] [step:Apply convexity in the second argument to obtain the pairwise bound] Assume $D(P_j\|P_k)<+\infty$ for every $j,k\in\{1,\dots,M\}$. Fix $j\in\{1,\dots,M\}$. Define the finite measure $\mu_j$ on $(E,\mathcal{E})$ by \begin{align*} \mu_j := P_j+\sum_{k=1}^{M}P_k. \end{align*} Let $p: E \to [0,\infty)$ be a Radon-Nikodym derivative, defined $\mu_j$-almost everywhere by \begin{align*} p(x):=\frac{dP_j}{d\mu_j}(x). \end{align*} For $k\in\{1,\dots,M\}$, let $q_k: E \to [0,\infty)$ be a Radon-Nikodym derivative, defined $\mu_j$-almost everywhere by \begin{align*} q_k(x):=\frac{dP_k}{d\mu_j}(x). \end{align*} Define $\overline{q}: E \to [0,\infty)$ by \begin{align*} \overline{q}(x):=\frac{1}{M}\sum_{k=1}^{M}q_k(x). \end{align*} Then $\overline{q}$ is a Radon-Nikodym derivative of $\overline{P}$ with respect to $\mu_j$. Since $D(P_j\|P_k)<+\infty$, we have $P_j\ll P_k$ for each $k$, so $q_k(x)>0$ for $P_j$-almost every $x\in E$. On the set where $p(x)>0$, [Jensen's inequality](/theorems/1977) for the convex function $a\mapsto -\log a$ on $(0,\infty)$ gives \begin{align*} -\log\left(\frac{\overline{q}(x)}{p(x)}\right)= -\log\left(\frac{1}{M}\sum_{k=1}^{M}\frac{q_k(x)}{p(x)}\right). \end{align*} Applying [Jensen's inequality](/theorems/9) to the numbers $q_k(x)/p(x)>0$ gives \begin{align*} -\log\left(\frac{1}{M}\sum_{k=1}^{M}\frac{q_k(x)}{p(x)}\right) \leq \frac{1}{M}\sum_{k=1}^{M}-\log\left(\frac{q_k(x)}{p(x)}\right). \end{align*} Multiplying by $p(x)$ and integrating with respect to $\mu_j$ gives \begin{align*} D(P_j\|\overline{P})= \int_E p(x)\log\left(\frac{p(x)}{\overline{q}(x)}\right)\, d\mu_j(x). \end{align*} The [Jensen inequality](/theorems/515) gives \begin{align*} D(P_j\|\overline{P}) \leq \frac{1}{M}\sum_{k=1}^{M}\int_E p(x)\log\left(\frac{p(x)}{q_k(x)}\right)\, d\mu_j(x). \end{align*} By the density representation of relative entropy with respect to the dominating measure $\mu_j$, \begin{align*} \frac{1}{M}\sum_{k=1}^{M}\int_E p(x)\log\left(\frac{p(x)}{q_k(x)}\right)\, d\mu_j(x)= \frac{1}{M}\sum_{k=1}^{M}D(P_j\|P_k). \end{align*} Averaging this inequality over $j$ and using \begin{align*} I(V;X)=\frac{1}{M}\sum_{j=1}^{M}D(P_j\|\overline{P}) \end{align*} yields \begin{align*} I(V;X) \leq \frac{1}{M}\sum_{j=1}^{M}\frac{1}{M}\sum_{k=1}^{M}D(P_j\|P_k). \end{align*} Rewriting the double average gives \begin{align*} I(V;X)\leq \frac{1}{M^2}\sum_{j=1}^{M}\sum_{k=1}^{M}D(P_j\|P_k). \end{align*} This is the claimed pairwise KL bound. [/step]

Prerequisites (0/7 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Distribution Definition Continuity Definition Derivative Definition Function Definition Integral Definition Matrix Definition Set Definition Bai Yin Theorem Probability & Statistics Countable Subadditivity Probability Theory Existence of Nonmeasurable Subsets of the Real Line Probability & Statistics Recurrence-Transience Dichotomy for Brownian Motion Brownian Motion Assouad's Lemma for Coordinatewise Support Recovery Probability & Statistics Fast Lasso Prediction Bound under the Compatibility Condition Probability & Statistics Conditional Convergence Theorems Conditional Expectation Harmonicity of the Brownian Dirichlet Solution Brownian Motion Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.