Chain Rule for Relative Entropy — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We prove the formula by disintegrating both measures with respect to their $Y$-marginals and factoring the Radon--Nikodym derivative into a marginal density and a conditional density. Taking the logarithm of this factorization turns the entropy integral into two terms. The first term integrates only over $F$ and gives $D(\nu_Y\|\mu_Y)$; the second term is the $\nu_Y$-average of the conditional entropies. The infinite cases follow from the same absolute-continuity factorization: failure at the marginal level or on a positive-measure family of conditional fibers forces failure of absolute continuity for the joint law. [/proofplan] [step:Reduce failure of marginal absolute continuity to failure of joint absolute continuity] Assume first that $\nu_Y\not\ll\mu_Y$. Then there exists a set $B\in\mathcal F$ such that $\mu_Y(B)=0$ and $\nu_Y(B)>0$. Define the measurable rectangle \begin{align*} A := E\times B \in \mathcal E\otimes\mathcal F. \end{align*} By the definition of the second marginal, \begin{align*} \mu(A)=\mu_Y(B)=0 \end{align*} and \begin{align*} \nu(A)=\nu_Y(B)>0. \end{align*} Hence $\nu\not\ll\mu$, so $D(\nu\|\mu)=+\infty$ by the definition of relative entropy. Also $D(\nu_Y\|\mu_Y)=+\infty$ by the same definition applied to the marginal measures. Thus the asserted infinite conclusion holds whenever $\nu_Y\not\ll\mu_Y$. [/step] [step:Factor the joint density into marginal and conditional densities] Assume now that $\nu_Y\ll\mu_Y$. By the [Radon--Nikodym theorem](/page/Radon-Nikodym%20Theorem) for the probability measures $\nu_Y$ and $\mu_Y$ on $(F,\mathcal F)$, define the marginal Radon--Nikodym density $g:F\to[0,\infty)$ by \begin{align*} g(y)=\frac{d\nu_Y}{d\mu_Y}(y). \end{align*} Assume first that $\nu_{X\mid Y=y}\ll\mu_{X\mid Y=y}$ for $\nu_Y$-a.e. $y\in F$. By the displayed definition of $g$, every $\nu_Y$-null subset of $F$ is also irrelevant after integration against $g\,d\mu_Y$. We use the kernel Radon--Nikodym theorem for standard Borel kernels: if $K_y$ and $L_y$ are probability kernels from $(F,\mathcal F)$ to $(E,\mathcal E)$ and $K_y\ll L_y$ for $\lambda$-a.e. $y$, then after changing the kernels on a $\lambda$-null set there is an $\mathcal E\otimes\mathcal F$-measurable map $h:E\times F\to[0,\infty)$ such that, for $\lambda$-a.e. $y$, \begin{align*} h(\cdot,y)=\frac{dK_y}{dL_y}. \end{align*} Applying this theorem with $\lambda=\nu_Y$, $K_y=\nu_{X\mid Y=y}$, and $L_y=\mu_{X\mid Y=y}$ gives a jointly measurable conditional density $h:E\times F\to[0,\infty)$ such that \begin{align*} h(x,y)=\frac{d\nu_{X\mid Y=y}}{d\mu_{X\mid Y=y}}(x) \end{align*} for $\nu_Y$-a.e. $y$ and $\mu_{X\mid Y=y}$-a.e. $x$. Define $r:E\times F\to[0,\infty]$ by \begin{align*} r(x,y)=g(y)h(x,y). \end{align*} We claim that $r$ is a Radon--Nikodym derivative of $\nu$ with respect to $\mu$. Let $A\in\mathcal E\otimes\mathcal F$. By the disintegration theorem for standard Borel spaces for $\mu$ and Tonelli's theorem for the non-negative measurable function $(x,y)\mapsto \mathbb{1}_A(x,y)r(x,y)$, \begin{align*} \int_{E\times F}\mathbb{1}_A(x,y)r(x,y)\,d\mu(x,y)=\int_F g(y)\left(\int_E \mathbb{1}_A(x,y)h(x,y)\,d\mu_{X\mid Y=y}(x)\right)d\mu_Y(y). \end{align*} For $\nu_Y$-a.e. $y$, the definition of $h(\cdot,y)$ gives \begin{align*} \int_E \mathbb{1}_A(x,y)h(x,y)\,d\mu_{X\mid Y=y}(x) = \nu_{X\mid Y=y}(A_y), \end{align*} where \begin{align*} A_y := \{x\in E:(x,y)\in A\}. \end{align*} By the displayed definition of $g$, the previous display gives \begin{align*} \int_{E\times F}\mathbb{1}_A(x,y)r(x,y)\,d\mu(x,y) = \int_F \nu_{X\mid Y=y}(A_y)\,d\nu_Y(y). \end{align*} By disintegration of $\nu$, \begin{align*} \int_F \nu_{X\mid Y=y}(A_y)\,d\nu_Y(y)=\nu(A). \end{align*} Therefore \begin{align*} \frac{d\nu}{d\mu}(x,y)=g(y)h(x,y) \end{align*} for $\nu$-a.e. $(x,y)\in E\times F$. [guided] The goal is to express the joint likelihood ratio as a product of two simpler likelihood ratios: one for the $Y$-marginal and one for the conditional law of $X$ given $Y=y$. Since $\nu_Y\ll\mu_Y$, the [Radon--Nikodym theorem](/page/Radon-Nikodym%20Theorem) for probability measures on $(F,\mathcal F)$ gives the measurable marginal density $g:F\to[0,\infty)$ defined by \begin{align*} g(y)=\frac{d\nu_Y}{d\mu_Y}(y). \end{align*} Since we are in the case where $\nu_{X\mid Y=y}\ll\mu_{X\mid Y=y}$ for $\nu_Y$-a.e. $y$, the pointwise Radon--Nikodym derivatives must be chosen in a way that is measurable jointly in $(x,y)$. This is exactly where the standard Borel hypothesis is used: for probability kernels on standard Borel spaces, the kernel Radon--Nikodym theorem applies. Applied to the two probability kernels $y\mapsto\nu_{X\mid Y=y}$ and $y\mapsto\mu_{X\mid Y=y}$, it gives a jointly measurable map $h:E\times F\to[0,\infty)$ such that \begin{align*} h(x,y)=\frac{d\nu_{X\mid Y=y}}{d\mu_{X\mid Y=y}}(x) \end{align*} for $\nu_Y$-a.e. $y$ and $\mu_{X\mid Y=y}$-a.e. $x$. The exceptional set of $y$ is harmless in the later $\mu_Y$-integral because integration against $g\,d\mu_Y$ is the same as integration against $d\nu_Y$: \begin{align*} d\nu_Y(y)=g(y)\,d\mu_Y(y). \end{align*} The candidate joint density is therefore the measurable map $r:E\times F\to[0,\infty]$ defined by \begin{align*} r(x,y)=g(y)h(x,y). \end{align*} To verify that $r=d\nu/d\mu$, we test it on an arbitrary measurable set $A\in\mathcal E\otimes\mathcal F$. For each $y\in F$, define the fiber \begin{align*} A_y := \{x\in E:(x,y)\in A\}. \end{align*} The disintegration theorem for standard Borel spaces says that integration with respect to $\mu$ can be performed by first integrating over the conditional law $\mu_{X\mid Y=y}$ and then integrating over $\mu_Y$. Applying it together with Tonelli's theorem to the non-negative measurable function $(x,y)\mapsto \mathbb{1}_A(x,y)r(x,y)$ gives \begin{align*} \int_{E\times F}\mathbb{1}_A(x,y)r(x,y)\,d\mu(x,y) = \int_F g(y) \left( \int_E \mathbb{1}_A(x,y)h(x,y)\,d\mu_{X\mid Y=y}(x) \right) d\mu_Y(y). \end{align*} For $\nu_Y$-a.e. $y$, the function $h(\cdot,y)$ is the Radon--Nikodym derivative of $\nu_{X\mid Y=y}$ with respect to $\mu_{X\mid Y=y}$. Hence \begin{align*} \int_E \mathbb{1}_A(x,y)h(x,y)\,d\mu_{X\mid Y=y}(x) = \nu_{X\mid Y=y}(A_y). \end{align*} Now the factor $g(y)$ converts integration against $\mu_Y$ into integration against $\nu_Y$, by the displayed definition of $g$. Thus \begin{align*} \int_{E\times F}\mathbb{1}_A(x,y)r(x,y)\,d\mu(x,y) = \int_F \nu_{X\mid Y=y}(A_y)\,d\nu_Y(y). \end{align*} Finally, disintegration of $\nu$ identifies the last integral with $\nu(A)$: \begin{align*} \int_F \nu_{X\mid Y=y}(A_y)\,d\nu_Y(y)=\nu(A). \end{align*} Since this equality holds for every $A\in\mathcal E\otimes\mathcal F$, the function $r$ is a Radon--Nikodym derivative of $\nu$ with respect to $\mu$. Therefore \begin{align*} \frac{d\nu}{d\mu}(x,y)=g(y)h(x,y) \end{align*} for $\nu$-a.e. $(x,y)\in E\times F$. [/guided] [/step] [step:Integrate the logarithmic factorization] Keep the definitions from the previous step: $g$ is the marginal density, $h$ is the jointly measurable conditional density, and $r:E\times F\to[0,\infty]$ is defined by \begin{align*} r(x,y)=g(y)h(x,y). \end{align*} The previous step proved that $r$ is a Radon--Nikodym derivative of $\nu$ with respect to $\mu$. Since $r$ is a probability density with respect to the probability measure $\mu$, it is finite and positive for $\nu$-a.e. $(x,y)$. Also $g$ is finite and positive for $\nu_Y$-a.e. $y$, and $h(\cdot,y)$ is finite and positive for $\nu_{X\mid Y=y}$-a.e. $x$ for $\nu_Y$-a.e. $y$. Therefore \begin{align*} \log r(x,y)=\log g(y)+\log h(x,y) \end{align*} for $\nu$-a.e. $(x,y)$. We first justify that this sum of extended real-valued functions is legitimate. Define the negative-logarithm map $\ell^-:[0,\infty)\to[0,\infty]$ by $\ell^-(0)=+\infty$ and, for $t>0$, \begin{align*} \ell^-(t)=\max\{-\log t,0\}. \end{align*} We use the convention $t\ell^-(t)=0$ at $t=0$. The elementary bound \begin{align*} t\ell^-(t)\le e^{-1} \end{align*} holds for all $t\ge 0$, because $-t\log t\le e^{-1}$ on $0<t\le 1$ and the left-hand side is $0$ for $t\ge 1$. Hence \begin{align*} \int_F (\log g(y))^-\,d\nu_Y(y) = \int_F g(y)(\log g(y))^-\,d\mu_Y(y) \le e^{-1}. \end{align*} For $\nu_Y$-a.e. $y$, the same bound applied to the density $h(\cdot,y)$ gives \begin{align*} \int_E (\log h(x,y))^-\,d\nu_{X\mid Y=y}(x) \le e^{-1}. \end{align*} Integrating this inequality with respect to $\nu_Y$ and using disintegration of $\nu$ gives \begin{align*} \int_{E\times F}(\log h(x,y))^-\,d\nu(x,y) \le e^{-1}. \end{align*} Thus the negative parts of both logarithmic terms are integrable, so no undefined expression of the form $+\infty-\infty$ occurs when the logarithm of $r=gh$ is split. By the definition of relative entropy and the preceding integrability check, \begin{align*} D(\nu\|\mu)=\int_{E\times F}\log r(x,y)\,d\nu(x,y)=\int_{E\times F}\log g(y)\,d\nu(x,y)+\int_{E\times F}\log h(x,y)\,d\nu(x,y). \end{align*} For the first term, since $(x,y)\mapsto\log g(y)$ depends only on $y$, the definition of the marginal $\nu_Y$ gives \begin{align*} \int_{E\times F} \log g(y)\,d\nu(x,y) = \int_F \log g(y)\,d\nu_Y(y) = D(\nu_Y\|\mu_Y). \end{align*} For the second term, the disintegration formula for $\nu$ gives \begin{align*} \int_{E\times F} \log h(x,y)\,d\nu(x,y) = \int_F \left( \int_E \log h(x,y)\,d\nu_{X\mid Y=y}(x) \right) d\nu_Y(y). \end{align*} For $\nu_Y$-a.e. $y$, $h(\cdot,y)=d\nu_{X\mid Y=y}/d\mu_{X\mid Y=y}$, so \begin{align*} \int_E \log h(x,y)\,d\nu_{X\mid Y=y}(x) = D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y}). \end{align*} Combining the two identities yields \begin{align*} D(\nu\|\mu) = D(\nu_Y\|\mu_Y) + \int_F D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y}) \,d\nu_Y(y). \end{align*} [/step] [step:Show that conditional absolute continuity failure forces infinite joint entropy] Assume $\nu_Y\ll\mu_Y$ and suppose that the measurable set \begin{align*} S:=\{y\in F:\nu_{X\mid Y=y}\not\ll\mu_{X\mid Y=y}\} \end{align*} has positive $\nu_Y$-measure. The set $S$ is measurable by the standard absolute-[continuity theorem](/theorems/1145) for probability kernels on standard Borel target spaces: if $K_y$ and $L_y$ are probability kernels from $(F,\mathcal F)$ to a standard Borel space $(E,\mathcal E)$, then \begin{align*} \{y\in F:K_y\ll L_y\} \end{align*} is $\mathcal F$-measurable, and on that set the kernel Radon--Nikodym theorem gives a jointly measurable density after changing the kernels on a null set for any chosen base measure. We apply this theorem to \begin{align*} K_y=\nu_{X\mid Y=y} \end{align*} and \begin{align*} L_y=\mu_{X\mid Y=y}. \end{align*} On $F\setminus S$, the kernel Radon--Nikodym theorem for standard Borel kernels provides a jointly measurable conditional density $h:E\times(F\setminus S)\to[0,\infty)$. The conditional entropy on $F\setminus S$ is the extended real-valued measurable function obtained from \begin{align*} D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y})=\int_E \log h(x,y)\,d\nu_{X\mid Y=y}(x), \end{align*} where the negative part is integrable by the bound $t(\log t)^-\le e^{-1}$. On $S$ we define the conditional entropy to be $+\infty$. For every $y\in S$, the definition of failure of absolute continuity gives \begin{align*} D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y})=+\infty. \end{align*} Since $\nu_Y(S)>0$, the extended non-negative integral satisfies \begin{align*} \int_F D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y}) \,d\nu_Y(y) = +\infty. \end{align*} It remains to prove $\nu\not\ll\mu$ without using the previous factorization in the wrong direction. We argue by contrapositive. Suppose $\nu\ll\mu$, and let $q:E\times F\to[0,\infty)$ be a Radon--Nikodym derivative of $\nu$ with respect to $\mu$. Let $g:F\to[0,\infty)$ denote the Radon--Nikodym derivative $d\nu_Y/d\mu_Y$. Since $\nu_Y\ll\mu_Y$, the set $\{y\in F:g(y)=0\}$ has $\nu_Y$-measure $0$. Because $(E,\mathcal E)$ is standard Borel, choose a countable determining class $\mathcal C\subseteq\mathcal E$ that contains $E$, is closed under finite intersections, and generates $\mathcal E$. For each $C\in\mathcal C$, define the measurable function $Q_C:F\to[0,\infty)$ by \begin{align*} Q_C(y)=\int_E \mathbb{1}_C(x)q(x,y)\,d\mu_{X\mid Y=y}(x). \end{align*} For every $B\in\mathcal F$, the disintegration theorem for standard Borel spaces for $\mu$ and Tonelli's theorem applied to the non-negative function $(x,y)\mapsto \mathbb{1}_C(x)\mathbb{1}_B(y)q(x,y)$ give \begin{align*} \nu(C\times B)=\int_B Q_C(y)\,d\mu_Y(y). \end{align*} Disintegration of $\nu$ gives \begin{align*} \nu(C\times B)=\int_B \nu_{X\mid Y=y}(C)\,d\nu_Y(y) = \int_B g(y)\nu_{X\mid Y=y}(C)\,d\mu_Y(y). \end{align*} Since these two identities hold for every $B\in\mathcal F$, uniqueness in the Radon--Nikodym theorem gives \begin{align*} Q_C(y)=g(y)\nu_{X\mid Y=y}(C) \end{align*} for $\mu_Y$-a.e. $y$. Taking the intersection of the exceptional null sets over the countable class $\mathcal C$, there is a set $N\in\mathcal F$ with $\mu_Y(N)=0$ such that the last identity holds for every $C\in\mathcal C$ and every $y\in F\setminus N$. For $y\in F\setminus N$ with $g(y)>0$, define the measurable function $k_y:E\to[0,\infty)$ by \begin{align*} k_y(x)=\frac{q(x,y)}{g(y)}. \end{align*} Then for every $C\in\mathcal C$, \begin{align*} \nu_{X\mid Y=y}(C)=\int_C k_y(x)\,d\mu_{X\mid Y=y}(x). \end{align*} The two probability measures $\nu_{X\mid Y=y}$ and $A\mapsto\int_A k_y(x)\,d\mu_{X\mid Y=y}(x)$ agree on the countable determining class $\mathcal C$, hence agree on all of $\mathcal E$. Therefore \begin{align*} \nu_{X\mid Y=y}\ll\mu_{X\mid Y=y} \end{align*} for every $y\in F\setminus N$ with $g(y)>0$. Since $\nu_Y(N)=0$ and $\nu_Y(\{g=0\})=0$, this absolute continuity holds for $\nu_Y$-a.e. $y$, contradicting $\nu_Y(S)>0$. Therefore $\nu\not\ll\mu$, and by the definition of relative entropy, \begin{align*} D(\nu\|\mu)=+\infty. \end{align*} This proves the remaining infinite case and completes the proof. [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

test

Definitions & Concepts

Continuity

What brings you to Androma?

Start with a route through the knowledge graph.

Chain Rule for Relative Entropy (Theorem # 6731)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Chain Rule for Relative Entropy (Theorem # 6731)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further