[proofplan]
We prove the formula by disintegrating both measures with respect to their $Y$-marginals and factoring the Radon--Nikodym derivative into a marginal density and a conditional density. Taking the logarithm of this factorization turns the entropy integral into two terms. The first term integrates only over $F$ and gives $D(\nu_Y\|\mu_Y)$; the second term is the $\nu_Y$-average of the conditional entropies. The infinite cases follow from the same absolute-continuity factorization: failure at the marginal level or on a positive-measure family of conditional fibers forces failure of absolute continuity for the joint law.
[/proofplan]
[step:Reduce failure of marginal absolute continuity to failure of joint absolute continuity]
Assume first that $\nu_Y\not\ll\mu_Y$. Then there exists a set $B\in\mathcal F$ such that $\mu_Y(B)=0$ and $\nu_Y(B)>0$. Define the measurable rectangle
\begin{align*}
A := E\times B \in \mathcal E\otimes\mathcal F.
\end{align*}
By the definition of the second marginal,
\begin{align*}
\mu(A)=\mu_Y(B)=0
\end{align*}
and
\begin{align*}
\nu(A)=\nu_Y(B)>0.
\end{align*}
Hence $\nu\not\ll\mu$, so $D(\nu\|\mu)=+\infty$ by the definition of relative entropy. Also $D(\nu_Y\|\mu_Y)=+\infty$ by the same definition applied to the marginal measures. Thus the asserted infinite conclusion holds whenever $\nu_Y\not\ll\mu_Y$.
[/step]
[step:Factor the joint density into marginal and conditional densities]
Assume now that $\nu_Y\ll\mu_Y$. By the [Radon--Nikodym theorem](/page/Radon-Nikodym%20Theorem) for the probability measures $\nu_Y$ and $\mu_Y$ on $(F,\mathcal F)$, define the marginal Radon--Nikodym density $g:F\to[0,\infty)$ by
\begin{align*}
g(y)=\frac{d\nu_Y}{d\mu_Y}(y).
\end{align*}
Assume first that $\nu_{X\mid Y=y}\ll\mu_{X\mid Y=y}$ for $\nu_Y$-a.e. $y\in F$. By the displayed definition of $g$, every $\nu_Y$-null subset of $F$ is also irrelevant after integration against $g\,d\mu_Y$. We use the kernel Radon--Nikodym theorem for standard Borel kernels: if $K_y$ and $L_y$ are probability kernels from $(F,\mathcal F)$ to $(E,\mathcal E)$ and $K_y\ll L_y$ for $\lambda$-a.e. $y$, then after changing the kernels on a $\lambda$-null set there is an $\mathcal E\otimes\mathcal F$-measurable map $h:E\times F\to[0,\infty)$ such that, for $\lambda$-a.e. $y$,
\begin{align*}
h(\cdot,y)=\frac{dK_y}{dL_y}.
\end{align*}
Applying this theorem with $\lambda=\nu_Y$, $K_y=\nu_{X\mid Y=y}$, and $L_y=\mu_{X\mid Y=y}$ gives a jointly measurable conditional density $h:E\times F\to[0,\infty)$ such that
\begin{align*}
h(x,y)=\frac{d\nu_{X\mid Y=y}}{d\mu_{X\mid Y=y}}(x)
\end{align*}
for $\nu_Y$-a.e. $y$ and $\mu_{X\mid Y=y}$-a.e. $x$.
Define $r:E\times F\to[0,\infty]$ by
\begin{align*}
r(x,y)=g(y)h(x,y).
\end{align*}
We claim that $r$ is a Radon--Nikodym derivative of $\nu$ with respect to $\mu$. Let $A\in\mathcal E\otimes\mathcal F$. By the disintegration theorem for standard Borel spaces for $\mu$ and Tonelli's theorem for the non-negative measurable function $(x,y)\mapsto \mathbb{1}_A(x,y)r(x,y)$,
\begin{align*}
\int_{E\times F}\mathbb{1}_A(x,y)r(x,y)\,d\mu(x,y)=\int_F g(y)\left(\int_E \mathbb{1}_A(x,y)h(x,y)\,d\mu_{X\mid Y=y}(x)\right)d\mu_Y(y).
\end{align*}
For $\nu_Y$-a.e. $y$, the definition of $h(\cdot,y)$ gives
\begin{align*}
\int_E \mathbb{1}_A(x,y)h(x,y)\,d\mu_{X\mid Y=y}(x)
=
\nu_{X\mid Y=y}(A_y),
\end{align*}
where
\begin{align*}
A_y := \{x\in E:(x,y)\in A\}.
\end{align*}
By the displayed definition of $g$, the previous display gives
\begin{align*}
\int_{E\times F}\mathbb{1}_A(x,y)r(x,y)\,d\mu(x,y)
=
\int_F \nu_{X\mid Y=y}(A_y)\,d\nu_Y(y).
\end{align*}
By disintegration of $\nu$,
\begin{align*}
\int_F \nu_{X\mid Y=y}(A_y)\,d\nu_Y(y)=\nu(A).
\end{align*}
Therefore
\begin{align*}
\frac{d\nu}{d\mu}(x,y)=g(y)h(x,y)
\end{align*}
for $\nu$-a.e. $(x,y)\in E\times F$.
[guided]
The goal is to express the joint likelihood ratio as a product of two simpler likelihood ratios: one for the $Y$-marginal and one for the conditional law of $X$ given $Y=y$.
Since $\nu_Y\ll\mu_Y$, the [Radon--Nikodym theorem](/page/Radon-Nikodym%20Theorem) for probability measures on $(F,\mathcal F)$ gives the measurable marginal density $g:F\to[0,\infty)$ defined by
\begin{align*}
g(y)=\frac{d\nu_Y}{d\mu_Y}(y).
\end{align*}
Since we are in the case where $\nu_{X\mid Y=y}\ll\mu_{X\mid Y=y}$ for $\nu_Y$-a.e. $y$, the pointwise Radon--Nikodym derivatives must be chosen in a way that is measurable jointly in $(x,y)$. This is exactly where the standard Borel hypothesis is used: for probability kernels on standard Borel spaces, the kernel Radon--Nikodym theorem applies. Applied to the two probability kernels $y\mapsto\nu_{X\mid Y=y}$ and $y\mapsto\mu_{X\mid Y=y}$, it gives a jointly measurable map $h:E\times F\to[0,\infty)$ such that
\begin{align*}
h(x,y)=\frac{d\nu_{X\mid Y=y}}{d\mu_{X\mid Y=y}}(x)
\end{align*}
for $\nu_Y$-a.e. $y$ and $\mu_{X\mid Y=y}$-a.e. $x$. The exceptional set of $y$ is harmless in the later $\mu_Y$-integral because integration against $g\,d\mu_Y$ is the same as integration against $d\nu_Y$:
\begin{align*}
d\nu_Y(y)=g(y)\,d\mu_Y(y).
\end{align*}
The candidate joint density is therefore the measurable map $r:E\times F\to[0,\infty]$ defined by
\begin{align*}
r(x,y)=g(y)h(x,y).
\end{align*}
To verify that $r=d\nu/d\mu$, we test it on an arbitrary measurable set $A\in\mathcal E\otimes\mathcal F$. For each $y\in F$, define the fiber
\begin{align*}
A_y := \{x\in E:(x,y)\in A\}.
\end{align*}
The disintegration theorem for standard Borel spaces says that integration with respect to $\mu$ can be performed by first integrating over the conditional law $\mu_{X\mid Y=y}$ and then integrating over $\mu_Y$. Applying it together with Tonelli's theorem to the non-negative measurable function $(x,y)\mapsto \mathbb{1}_A(x,y)r(x,y)$ gives
\begin{align*}
\int_{E\times F}\mathbb{1}_A(x,y)r(x,y)\,d\mu(x,y)
=
\int_F
g(y)
\left(
\int_E \mathbb{1}_A(x,y)h(x,y)\,d\mu_{X\mid Y=y}(x)
\right)
d\mu_Y(y).
\end{align*}
For $\nu_Y$-a.e. $y$, the function $h(\cdot,y)$ is the Radon--Nikodym derivative of $\nu_{X\mid Y=y}$ with respect to $\mu_{X\mid Y=y}$. Hence
\begin{align*}
\int_E \mathbb{1}_A(x,y)h(x,y)\,d\mu_{X\mid Y=y}(x)
=
\nu_{X\mid Y=y}(A_y).
\end{align*}
Now the factor $g(y)$ converts integration against $\mu_Y$ into integration against $\nu_Y$, by the displayed definition of $g$. Thus
\begin{align*}
\int_{E\times F}\mathbb{1}_A(x,y)r(x,y)\,d\mu(x,y)
=
\int_F \nu_{X\mid Y=y}(A_y)\,d\nu_Y(y).
\end{align*}
Finally, disintegration of $\nu$ identifies the last integral with $\nu(A)$:
\begin{align*}
\int_F \nu_{X\mid Y=y}(A_y)\,d\nu_Y(y)=\nu(A).
\end{align*}
Since this equality holds for every $A\in\mathcal E\otimes\mathcal F$, the function $r$ is a Radon--Nikodym derivative of $\nu$ with respect to $\mu$. Therefore
\begin{align*}
\frac{d\nu}{d\mu}(x,y)=g(y)h(x,y)
\end{align*}
for $\nu$-a.e. $(x,y)\in E\times F$.
[/guided]
[/step]
[step:Integrate the logarithmic factorization]
Keep the definitions from the previous step: $g$ is the marginal density, $h$ is the jointly measurable conditional density, and $r:E\times F\to[0,\infty]$ is defined by
\begin{align*}
r(x,y)=g(y)h(x,y).
\end{align*}
The previous step proved that $r$ is a Radon--Nikodym derivative of $\nu$ with respect to $\mu$. Since $r$ is a probability density with respect to the probability measure $\mu$, it is finite and positive for $\nu$-a.e. $(x,y)$. Also $g$ is finite and positive for $\nu_Y$-a.e. $y$, and $h(\cdot,y)$ is finite and positive for $\nu_{X\mid Y=y}$-a.e. $x$ for $\nu_Y$-a.e. $y$. Therefore
\begin{align*}
\log r(x,y)=\log g(y)+\log h(x,y)
\end{align*}
for $\nu$-a.e. $(x,y)$.
We first justify that this sum of extended real-valued functions is legitimate. Define the negative-logarithm map $\ell^-:[0,\infty)\to[0,\infty]$ by $\ell^-(0)=+\infty$ and, for $t>0$,
\begin{align*}
\ell^-(t)=\max\{-\log t,0\}.
\end{align*}
We use the convention $t\ell^-(t)=0$ at $t=0$. The elementary bound
\begin{align*}
t\ell^-(t)\le e^{-1}
\end{align*}
holds for all $t\ge 0$, because $-t\log t\le e^{-1}$ on $0<t\le 1$ and the left-hand side is $0$ for $t\ge 1$. Hence
\begin{align*}
\int_F (\log g(y))^-\,d\nu_Y(y)
=
\int_F g(y)(\log g(y))^-\,d\mu_Y(y)
\le e^{-1}.
\end{align*}
For $\nu_Y$-a.e. $y$, the same bound applied to the density $h(\cdot,y)$ gives
\begin{align*}
\int_E (\log h(x,y))^-\,d\nu_{X\mid Y=y}(x)
\le e^{-1}.
\end{align*}
Integrating this inequality with respect to $\nu_Y$ and using disintegration of $\nu$ gives
\begin{align*}
\int_{E\times F}(\log h(x,y))^-\,d\nu(x,y)
\le e^{-1}.
\end{align*}
Thus the negative parts of both logarithmic terms are integrable, so no undefined expression of the form $+\infty-\infty$ occurs when the logarithm of $r=gh$ is split.
By the definition of relative entropy and the preceding integrability check,
\begin{align*}
D(\nu\|\mu)=\int_{E\times F}\log r(x,y)\,d\nu(x,y)=\int_{E\times F}\log g(y)\,d\nu(x,y)+\int_{E\times F}\log h(x,y)\,d\nu(x,y).
\end{align*}
For the first term, since $(x,y)\mapsto\log g(y)$ depends only on $y$, the definition of the marginal $\nu_Y$ gives
\begin{align*}
\int_{E\times F}
\log g(y)\,d\nu(x,y)
=
\int_F \log g(y)\,d\nu_Y(y)
=
D(\nu_Y\|\mu_Y).
\end{align*}
For the second term, the disintegration formula for $\nu$ gives
\begin{align*}
\int_{E\times F}
\log h(x,y)\,d\nu(x,y)
=
\int_F
\left(
\int_E \log h(x,y)\,d\nu_{X\mid Y=y}(x)
\right)
d\nu_Y(y).
\end{align*}
For $\nu_Y$-a.e. $y$, $h(\cdot,y)=d\nu_{X\mid Y=y}/d\mu_{X\mid Y=y}$, so
\begin{align*}
\int_E \log h(x,y)\,d\nu_{X\mid Y=y}(x)
=
D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y}).
\end{align*}
Combining the two identities yields
\begin{align*}
D(\nu\|\mu)
=
D(\nu_Y\|\mu_Y)
+
\int_F
D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y})
\,d\nu_Y(y).
\end{align*}
[/step]
[step:Show that conditional absolute continuity failure forces infinite joint entropy]
Assume $\nu_Y\ll\mu_Y$ and suppose that the measurable set
\begin{align*}
S:=\{y\in F:\nu_{X\mid Y=y}\not\ll\mu_{X\mid Y=y}\}
\end{align*}
has positive $\nu_Y$-measure. The set $S$ is measurable by the standard absolute-[continuity theorem](/theorems/1145) for probability kernels on standard Borel target spaces: if $K_y$ and $L_y$ are probability kernels from $(F,\mathcal F)$ to a standard Borel space $(E,\mathcal E)$, then
\begin{align*}
\{y\in F:K_y\ll L_y\}
\end{align*}
is $\mathcal F$-measurable, and on that set the kernel Radon--Nikodym theorem gives a jointly measurable density after changing the kernels on a null set for any chosen base measure. We apply this theorem to
\begin{align*}
K_y=\nu_{X\mid Y=y}
\end{align*}
and
\begin{align*}
L_y=\mu_{X\mid Y=y}.
\end{align*}
On $F\setminus S$, the kernel Radon--Nikodym theorem for standard Borel kernels provides a jointly measurable conditional density $h:E\times(F\setminus S)\to[0,\infty)$. The conditional entropy on $F\setminus S$ is the extended real-valued measurable function obtained from
\begin{align*}
D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y})=\int_E \log h(x,y)\,d\nu_{X\mid Y=y}(x),
\end{align*}
where the negative part is integrable by the bound $t(\log t)^-\le e^{-1}$. On $S$ we define the conditional entropy to be $+\infty$. For every $y\in S$, the definition of failure of absolute continuity gives
\begin{align*}
D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y})=+\infty.
\end{align*}
Since $\nu_Y(S)>0$, the extended non-negative integral satisfies
\begin{align*}
\int_F
D(\nu_{X\mid Y=y}\|\mu_{X\mid Y=y})
\,d\nu_Y(y)
=
+\infty.
\end{align*}
It remains to prove $\nu\not\ll\mu$ without using the previous factorization in the wrong direction. We argue by contrapositive. Suppose $\nu\ll\mu$, and let $q:E\times F\to[0,\infty)$ be a Radon--Nikodym derivative of $\nu$ with respect to $\mu$. Let $g:F\to[0,\infty)$ denote the Radon--Nikodym derivative $d\nu_Y/d\mu_Y$. Since $\nu_Y\ll\mu_Y$, the set $\{y\in F:g(y)=0\}$ has $\nu_Y$-measure $0$.
Because $(E,\mathcal E)$ is standard Borel, choose a countable determining class $\mathcal C\subseteq\mathcal E$ that contains $E$, is closed under finite intersections, and generates $\mathcal E$. For each $C\in\mathcal C$, define the measurable function $Q_C:F\to[0,\infty)$ by
\begin{align*}
Q_C(y)=\int_E \mathbb{1}_C(x)q(x,y)\,d\mu_{X\mid Y=y}(x).
\end{align*}
For every $B\in\mathcal F$, the disintegration theorem for standard Borel spaces for $\mu$ and Tonelli's theorem applied to the non-negative function $(x,y)\mapsto \mathbb{1}_C(x)\mathbb{1}_B(y)q(x,y)$ give
\begin{align*}
\nu(C\times B)=\int_B Q_C(y)\,d\mu_Y(y).
\end{align*}
Disintegration of $\nu$ gives
\begin{align*}
\nu(C\times B)=\int_B \nu_{X\mid Y=y}(C)\,d\nu_Y(y)
=
\int_B g(y)\nu_{X\mid Y=y}(C)\,d\mu_Y(y).
\end{align*}
Since these two identities hold for every $B\in\mathcal F$, uniqueness in the Radon--Nikodym theorem gives
\begin{align*}
Q_C(y)=g(y)\nu_{X\mid Y=y}(C)
\end{align*}
for $\mu_Y$-a.e. $y$. Taking the intersection of the exceptional null sets over the countable class $\mathcal C$, there is a set $N\in\mathcal F$ with $\mu_Y(N)=0$ such that the last identity holds for every $C\in\mathcal C$ and every $y\in F\setminus N$.
For $y\in F\setminus N$ with $g(y)>0$, define the measurable function $k_y:E\to[0,\infty)$ by
\begin{align*}
k_y(x)=\frac{q(x,y)}{g(y)}.
\end{align*}
Then for every $C\in\mathcal C$,
\begin{align*}
\nu_{X\mid Y=y}(C)=\int_C k_y(x)\,d\mu_{X\mid Y=y}(x).
\end{align*}
The two probability measures $\nu_{X\mid Y=y}$ and $A\mapsto\int_A k_y(x)\,d\mu_{X\mid Y=y}(x)$ agree on the countable determining class $\mathcal C$, hence agree on all of $\mathcal E$. Therefore
\begin{align*}
\nu_{X\mid Y=y}\ll\mu_{X\mid Y=y}
\end{align*}
for every $y\in F\setminus N$ with $g(y)>0$. Since $\nu_Y(N)=0$ and $\nu_Y(\{g=0\})=0$, this absolute continuity holds for $\nu_Y$-a.e. $y$, contradicting $\nu_Y(S)>0$. Therefore $\nu\not\ll\mu$, and by the definition of relative entropy,
\begin{align*}
D(\nu\|\mu)=+\infty.
\end{align*}
This proves the remaining infinite case and completes the proof.
[/step]