[proofplan]
We first reduce to the invertible case by passing to the natural extension, where the lifted partition and lifted factor have exactly the same conditional information process. In the invertible case, the conditional information chain rule writes the block information as a non-stationary ergodic average of one-step conditional information functions. Martingale convergence identifies the limiting one-step information function, Maker's ergodic theorem turns the non-stationary average into its space integral, and the relative Kolmogorov-Sinai formula identifies that integral with $h_\mu(T,\mathcal P\mid\mathcal G)$.
[/proofplan]
[step:Pass to the natural extension when $T$ is not invertible]
Let $(\widehat X,\widehat{\mathcal B},\widehat\mu,\widehat T)$ denote the natural extension of $(X,\mathcal B,\mu,T)$, and let
\begin{align*}
\pi: \widehat X \to X
\end{align*}
be the factor map satisfying $\pi\circ\widehat T=T\circ\pi$ and $\pi_*\widehat\mu=\mu$. Define the lifted partition $\widehat{\mathcal P}:=\pi^{-1}\mathcal P$ and the lifted factor $\widehat{\mathcal G}:=\pi^{-1}\mathcal G$. Since $\mathcal P$ is finite, $I_\mu(\mathcal P_{[0,n-1]}\mid\mathcal G)\in L^1(X,\mathcal B,\mu)$ for every $n\in\mathbb N$.
For a probability measure $\lambda$, a finite measurable partition $\mathcal R$, and a sub-$\sigma$-algebra $\mathcal H$, write
\begin{align*}
H_\lambda(\mathcal R\mid\mathcal H):=\int I_\lambda(\mathcal R\mid\mathcal H)\,d\lambda
\end{align*}
for the corresponding conditional entropy.
We use the natural-extension reduction in the following precise form. For every finite partition $\mathcal R$ of $X$ and every sub-$\sigma$-algebra $\mathcal H\subseteq\mathcal B$,
\begin{align*}
I_{\widehat\mu}(\pi^{-1}\mathcal R\mid\pi^{-1}\mathcal H)=I_\mu(\mathcal R\mid\mathcal H)\circ\pi
\end{align*}
$\widehat\mu$-almost everywhere, and consequently
\begin{align*}
H_{\widehat\mu}(\pi^{-1}\mathcal R\mid\pi^{-1}\mathcal H)=H_\mu(\mathcal R\mid\mathcal H).
\end{align*}
The first identity follows from the defining property of [conditional expectation](/page/Conditional%20Expectation) under a measure-preserving factor map:
\begin{align*}
\mathbb E_{\widehat\mu}[\mathbb 1_{\pi^{-1}R}\mid\pi^{-1}\mathcal H]=\mathbb E_\mu[\mathbb 1_R\mid\mathcal H]\circ\pi
\end{align*}
for each atom $R$ of $\mathcal R$, and the entropy identity follows by integrating the conditional information identity. Applying this reduction to $\mathcal R=\mathcal P_{[0,n-1]}$ and $\mathcal H=\mathcal G$, we have
\begin{align*}
I_{\widehat\mu}(\widehat{\mathcal P}_{[0,n-1]}\mid\widehat{\mathcal G})=I_\mu(\mathcal P_{[0,n-1]}\mid\mathcal G)\circ\pi
\end{align*}
for every $n\in\mathbb N$. The same reduction gives
\begin{align*}
h_{\widehat\mu}(\widehat T,\widehat{\mathcal P}\mid\widehat{\mathcal G})=h_\mu(T,\mathcal P\mid\mathcal G).
\end{align*}
Almost-everywhere convergence and $L^1$ convergence are preserved under pullback by the measure-preserving map $\pi$. It is therefore enough to prove the theorem under the additional assumption that $T$ is invertible modulo $\mu$-null sets.
[/step]
[step:Decompose the block information into one-step conditional information terms]
Assume from now on that $T$ is invertible modulo $\mu$-null sets. For $j\in\mathbb N\cup\{0\}$ define the finite future partition
\begin{align*}
\mathcal P_{[1,j]}:=\bigvee_{i=1}^j T^i\mathcal P,
\end{align*}
with the convention that $\mathcal P_{[1,0]}$ is the one-atom partition. Define
\begin{align*}
f_j:X\to[0,\infty]
\end{align*}
by
\begin{align*}
f_j(x):=I_\mu(\mathcal P\mid\mathcal G\vee\mathcal P_{[1,j]})(x).
\end{align*}
The function $f_j$ is measurable and belongs to $L^1(X,\mathcal B,\mu)$ because $\mathcal P$ is finite and conditional entropy of a finite partition is finite. The conditional information chain rule applied to the ordered join $\mathcal P_{[0,n-1]}=\bigvee_{k=0}^{n-1}T^{-k}\mathcal P$ gives
\begin{align*}
I_\mu(\mathcal P_{[0,n-1]}\mid\mathcal G)=\sum_{k=0}^{n-1} I_\mu(T^{-k}\mathcal P\mid\mathcal G\vee\mathcal P_{[0,k-1]}).
\end{align*}
Since $T$ is invertible, $\mu$ is $T$-invariant, and $T^{-1}\mathcal G=\mathcal G$ modulo $\mu$-null sets, covariance of conditional information under $T^k$ gives
\begin{align*}
I_\mu(T^{-k}\mathcal P\mid\mathcal G\vee\mathcal P_{[0,k-1]})=f_k\circ T^k
\end{align*}
for each $0\le k\le n-1$. Hence
\begin{align*}
I_n^{\mathcal P\mid\mathcal G}=\sum_{k=0}^{n-1} f_k\circ T^k.
\end{align*}
[guided]
The goal of this step is to turn information about an $n$-block into a sum of one-symbol information terms. For $j\in\mathbb N\cup\{0\}$ define
\begin{align*}
\mathcal P_{[1,j]}:=\bigvee_{i=1}^j T^i\mathcal P,
\end{align*}
where $\mathcal P_{[1,0]}$ is the one-atom partition. This is the information from the next $j$ shifted copies of the partition. Define the one-step conditional information function
\begin{align*}
f_j:X\to[0,\infty]
\end{align*}
by
\begin{align*}
f_j(x):=I_\mu(\mathcal P\mid\mathcal G\vee\mathcal P_{[1,j]})(x).
\end{align*}
Because $\mathcal P$ is finite, the conditional entropy $H_\mu(\mathcal P\mid\mathcal G\vee\mathcal P_{[1,j]})$ is finite, so $f_j\in L^1(X,\mathcal B,\mu)$.
We apply the conditional information chain rule. Its hypotheses are satisfied because $\mathcal P_{[0,n-1]}$ is a finite join of finite measurable partitions and $\mathcal G$ is a sub-$\sigma$-algebra of $\mathcal B$. The rule gives
\begin{align*}
I_\mu(\mathcal P_{[0,n-1]}\mid\mathcal G)=\sum_{k=0}^{n-1} I_\mu(T^{-k}\mathcal P\mid\mathcal G\vee\mathcal P_{[0,k-1]}).
\end{align*}
The $k$th summand is the information needed to specify the $k$th name once the factor information and the previous $k$ names are known.
Now use invertibility and invariance to move the $k$th summand to time $0$. Since $T$ is invertible modulo $\mu$-null sets and preserves $\mu$, conditional information is covariant under $T^k$. Since $T^{-1}\mathcal G=\mathcal G$ modulo $\mu$-null sets, the factor $\sigma$-algebra is unchanged by this shift. The previous block $\mathcal P_{[0,k-1]}$ becomes the future partition $\mathcal P_{[1,k]}$ after applying $T^k$. Therefore
\begin{align*}
I_\mu(T^{-k}\mathcal P\mid\mathcal G\vee\mathcal P_{[0,k-1]})=I_\mu(\mathcal P\mid\mathcal G\vee\mathcal P_{[1,k]})\circ T^k=f_k\circ T^k.
\end{align*}
Substituting this into the chain-rule identity yields
\begin{align*}
I_n^{\mathcal P\mid\mathcal G}=\sum_{k=0}^{n-1} f_k\circ T^k.
\end{align*}
This is the structural point of the proof: the relative block information is a non-stationary ergodic average.
[/guided]
[/step]
[step:Identify the limiting one-step information by martingale convergence]
Define the increasing $\sigma$-algebras
\begin{align*}
\mathcal A_j:=\mathcal G\vee\mathcal P_{[1,j]}
\end{align*}
for $j\in\mathbb N\cup\{0\}$, and define
\begin{align*}
\mathcal A_\infty:=\mathcal G\vee\bigvee_{i=1}^{\infty}T^i\mathcal P.
\end{align*}
Because $\mathcal P$ is finite, the [conditional probability](/page/Conditional%20Probability) of each atom of $\mathcal P$ with respect to $\mathcal A_j$ is a bounded martingale in $j$ after fixing the atom. The martingale convergence theorem for conditional probabilities therefore gives convergence of these conditional probabilities almost everywhere and in $L^1$ to the corresponding conditional probabilities with respect to $\mathcal A_\infty$. Applying the standard finite-partition information convergence consequence of martingale convergence gives
\begin{align*}
f_\infty:X\to[0,\infty]
\end{align*}
defined by
\begin{align*}
f_\infty:=I_\mu(\mathcal P\mid\mathcal A_\infty)
\end{align*}
and
\begin{align*}
f_j\to f_\infty
\end{align*}
$\mu$-a.e. and in $L^1(X,\mathcal B,\mu)$.
[/step]
[step:Apply Maker's theorem to the non-stationary ergodic average]
The functions $f_j$ and $f_\infty$ belong to $L^1(X,\mathcal B,\mu)$, and the previous step gives $f_j\to f_\infty$ in $L^1$ and almost everywhere. Define the invariant $\sigma$-algebra
\begin{align*}
\mathcal I_T:=\{A\in\mathcal B:T^{-1}A=A\text{ modulo }\mu\text{-null sets}\}.
\end{align*}
Let
\begin{align*}
\mathbb E_\mu[\cdot\mid\mathcal I_T]:L^1(X,\mathcal B,\mu)\to L^1(X,\mathcal I_T,\mu)
\end{align*}
denote conditional expectation with respect to $\mathcal I_T$. We use Maker's theorem in the tail-envelope form: if $g_j\to g_\infty$ almost everywhere and in $L^1$, and if the map $R_N^g:X\to[0,\infty]$ defined by
\begin{align*}
R_N^g(x):=\sup_{j\geq N}|g_j(x)-g_\infty(x)|
\end{align*}
belongs to $L^1(X,\mathcal B,\mu)$ with
\begin{align*}
\int_X R_N^g(x)\,d\mu(x)\to0,
\end{align*}
then
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1}(g_k-g_\infty)(T^k x)\to0
\end{align*}
for $\mu$-a.e. $x$ and in $L^1(X,\mathcal B,\mu)$. We now verify the tail-envelope hypothesis using Breiman's maximal convergence theorem for finite conditional information. In the form needed here, the theorem says: if $\mathcal Q$ is a finite measurable partition of a probability space $(X,\mathcal B,\mu)$, if $(\mathcal A_j)_{j\geq0}$ is an increasing sequence of sub-$\sigma$-algebras, if
\begin{align*}
\mathcal A_\infty:=\bigvee_{j=0}^{\infty}\mathcal A_j,
\end{align*}
and if the maps $F_j:X\to[0,\infty]$ and $F_\infty:X\to[0,\infty]$ are defined by
\begin{align*}
F_j:=I_\mu(\mathcal Q\mid\mathcal A_j)
\end{align*}
and
\begin{align*}
F_\infty:=I_\mu(\mathcal Q\mid\mathcal A_\infty),
\end{align*}
then for each $N\in\mathbb N$ the maximal tail map $S_N:X\to[0,\infty]$ defined by
\begin{align*}
S_N(x):=\sup_{j\geq N}|F_j(x)-F_\infty(x)|
\end{align*}
belongs to $L^1(X,\mathcal B,\mu)$ and satisfies
\begin{align*}
\int_X S_N(x)\,d\mu(x)\to0.
\end{align*}
The hypotheses match our situation with $\mathcal Q=\mathcal P$, $F_j=f_j$, and $F_\infty=f_\infty$, because the previous step defined $\mathcal A_j=\mathcal G\vee\mathcal P_{[1,j]}$, $\mathcal A_\infty=\mathcal G\vee\bigvee_{i=1}^{\infty}T^i\mathcal P$, and $f_j=I_\mu(\mathcal P\mid\mathcal A_j)$. Therefore the map $R_N:X\to[0,\infty]$ defined by
\begin{align*}
R_N(x):=\sup_{j\geq N}|f_j(x)-f_\infty(x)|.
\end{align*}
belongs to $L^1(X,\mathcal B,\mu)$ and satisfies
\begin{align*}
\int_X R_N(x)\,d\mu(x)\to0.
\end{align*}
Therefore Maker's theorem applies to the sequence $(f_j)_{j\ge0}$ and gives
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1} f_k(T^k x)\to \mathbb E_\mu[f_\infty\mid\mathcal I_T](x)
\end{align*}
for $\mu$-a.e. $x$ and in $L^1(X,\mathcal B,\mu)$. Since the system is ergodic, every member of $\mathcal I_T$ has $\mu$-measure $0$ or $1$, so
\begin{align*}
\mathbb E_\mu[f_\infty\mid\mathcal I_T]=\int_X f_\infty(y)\,d\mu(y)
\end{align*}
$\mu$-a.e. Combining this with the decomposition from the previous step gives
\begin{align*}
\frac{1}{n}I_n^{\mathcal P\mid\mathcal G}(x)\to \int_X f_\infty(y)\,d\mu(y)
\end{align*}
for $\mu$-a.e. $x$ and in $L^1(X,\mathcal B,\mu)$.
[/step]
[step:Identify the limiting integral with relative entropy]
For a finite measurable partition $\mathcal Q$ and a sub-$\sigma$-algebra $\mathcal A\subset\mathcal B$, let $H_\mu(\mathcal Q\mid\mathcal A)$ denote the conditional entropy
\begin{align*}
H_\mu(\mathcal Q\mid\mathcal A):=\int_X I_\mu(\mathcal Q\mid\mathcal A)(y)\,d\mu(y).
\end{align*}
By this definition of conditional entropy and the definition of $f_\infty$,
\begin{align*}
\int_X f_\infty(y)\,d\mu(y)=H_\mu\left(\mathcal P\mid\mathcal G\vee\bigvee_{i=1}^{\infty}T^i\mathcal P\right).
\end{align*}
We use the relative Kolmogorov-Sinai formula in the following form: if $T$ is invertible and measure-preserving, $T^{-1}\mathcal G=\mathcal G$ modulo $\mu$-null sets, and $\mathcal P$ is finite, then
\begin{align*}
h_\mu(T,\mathcal P\mid\mathcal G)=H_\mu\left(\mathcal P\mid\mathcal G\vee\bigvee_{i=1}^{\infty}T^i\mathcal P\right).
\end{align*}
The hypotheses required by this formula are satisfied: $T$ is invertible and measure-preserving, $T^{-1}\mathcal G=\mathcal G$ modulo $\mu$-null sets, and $\mathcal P$ is finite. Hence
\begin{align*}
H_\mu\left(\mathcal P\mid\mathcal G\vee\bigvee_{i=1}^{\infty}T^i\mathcal P\right)=h_\mu(T,\mathcal P\mid\mathcal G).
\end{align*}
Therefore
\begin{align*}
\frac{1}{n}I_n^{\mathcal P\mid\mathcal G}(x)\to h_\mu(T,\mathcal P\mid\mathcal G)
\end{align*}
for $\mu$-a.e. $x$ and in $L^1(X,\mathcal B,\mu)$. This proves the theorem.
[/step]