[proofplan]
We first prove the theorem when $T$ is invertible. In that case the information of a forward block can be written exactly as a sum of conditional information increments: the symbol at time $k$ is conditioned on the symbols at times $0,\dots,k-1$, which becomes the present symbol conditioned on a finite past after applying $T^k$. The martingale convergence theorem identifies these finite-past increments with the conditional information given the entire past, and an ergodic averaging lemma plus Birkhoff's theorem turns the resulting time average into the entropy rate. For a non-invertible system we pass to the natural extension, where forward names and their measures are preserved under the factor map, and then descend the almost sure statement back to $X$.
[/proofplan]
[step:Introduce conditional information for a finite partition]
Let $\log$ denote the natural logarithm. For $n\in\mathbb N$, define the joined partition $\mathcal P_0^{n-1}$ by $\mathcal P_0^{n-1}:=\bigvee_{k=0}^{n-1}T^{-k}\mathcal P$, and let $\mathcal P_0^{n-1}(x)$ denote the atom of this joined partition containing $x$. Define the block information function $I_n^{\mathcal P}:X\to[0,\infty]$ by $I_n^{\mathcal P}(x):=-\log\mu(\mathcal P_0^{n-1}(x))$, with the convention $-\log 0=\infty$.
If $\mathcal Q$ is a finite measurable partition of $X$, define the information function $I_\mu(\mathcal Q):X\to[0,\infty]$ by
\begin{align*}
I_\mu(\mathcal Q)(x):=-\log\mu(\mathcal Q(x)),
\end{align*}
where $\mathcal Q(x)$ is the atom of $\mathcal Q$ containing $x$.
If $\mathcal G\subseteq\mathcal B$ is a sub-$\sigma$-algebra, define the conditional information function $I_\mu(\mathcal Q\mid\mathcal G):X\to[0,\infty]$ by the finite-atom formula
\begin{align*}
I_\mu(\mathcal Q\mid\mathcal G)(x):=-\sum_{Q\in\mathcal Q}\mathbb 1_Q(x)\log\mathbb E[\mathbb 1_Q\mid\mathcal G](x),
\end{align*}
with the convention $-\log 0=\infty$. This avoids treating $\mathcal Q(x)$ as a fixed set inside the [conditional expectation](/page/Conditional%20Expectation): for each atom $Q\in\mathcal Q$, $\mathbb E[\mathbb 1_Q\mid\mathcal G]:X\to[0,1]$ is a chosen $\mathcal G$-measurable version of the [conditional probability](/page/Conditional%20Probability) of $Q$, and the finite sum selects the atom containing $x$.
The corresponding conditional entropy is
\begin{align*}
H_\mu(\mathcal Q\mid\mathcal G):=\int_X I_\mu(\mathcal Q\mid\mathcal G)(x)\,d\mu(x).
\end{align*}
[/step]
[step:Prove the theorem when $T$ is invertible]
Assume in this step that $T:X\to X$ is invertible modulo null sets and that $T^{-1}:X\to X$ is measurable and measure-preserving. For each integer $j\in\mathbb Z$, define the shifted partition
\begin{align*}
\mathcal P_j:=T^{-j}\mathcal P.
\end{align*}
For $m\in\mathbb N$, define the finite-past $\sigma$-algebra
\begin{align*}
\mathcal F_m:=\sigma\left(\bigvee_{j=1}^{m}\mathcal P_{-j}\right)=\sigma\left(\bigvee_{j=1}^{m}T^j\mathcal P\right),
\end{align*}
and define the full-past $\sigma$-algebra
\begin{align*}
\mathcal F_\infty:=\sigma\left(\bigcup_{m=1}^{\infty}\mathcal F_m\right).
\end{align*}
For $m\geq 0$, with $\mathcal F_0:=\{\varnothing,X\}$, define $f_m:X\to[0,\infty]$ by
\begin{align*}
f_m(x):=I_\mu(\mathcal P\mid\mathcal F_m)(x).
\end{align*}
Also define $f:X\to[0,\infty]$ by
\begin{align*}
f(x):=I_\mu(\mathcal P\mid\mathcal F_\infty)(x).
\end{align*}
For each $n\in\mathbb N$, the chain rule for conditional probabilities applied to the finite partition $\mathcal P_0^{n-1}$ gives, outside the null union of zero-measure atoms,
\begin{align*}
I_n^{\mathcal P}(x)=\sum_{k=0}^{n-1} f_k(T^k x).
\end{align*}
Indeed, the conditional probability of the symbol at time $k$ given the previous symbols at times $0,\dots,k-1$ is, after applying $T^k$, exactly the conditional probability of the present $\mathcal P$-atom given the finite past $\mathcal F_k$.
We use the conditional information martingale convergence theorem for finite partitions: if $\mathcal R$ is a finite measurable partition with $H_\mu(\mathcal R)<\infty$ and $(\mathcal G_m)_{m\geq0}$ is an increasing sequence of sub-$\sigma$-algebras with $\mathcal G_\infty=\sigma(\bigcup_{m\geq0}\mathcal G_m)$, then $I_\mu(\mathcal R\mid\mathcal G_m)\to I_\mu(\mathcal R\mid\mathcal G_\infty)$ in $L^1$ and almost everywhere. Its hypotheses hold here because $\mathcal P$ is finite, so $H_\mu(\mathcal P)<\infty$, and because $\mathcal F_\infty$ is defined as $\sigma(\bigcup_{m\geq0}\mathcal F_m)$. Therefore $(f_m)_{m\geq0}$ converges to $f$ in $L^1(X,\mathcal B,\mu)$ and $\mu$-almost everywhere.
We now use the Breiman conditional-information averaging lemma in its finite-partition form. This auxiliary result is the Breiman-Maker estimate for the specific sequence of conditional information functions: if $(Y,\mathcal A,\nu,S)$ is measure-preserving, $\mathcal R$ is a finite measurable partition with $H_\nu(\mathcal R)<\infty$, $(\mathcal G_m)_{m\geq0}$ is an increasing sequence of sub-$\sigma$-algebras with $\mathcal G_\infty=\sigma(\bigcup_{m\geq0}\mathcal G_m)$, and $g_m:=I_\nu(\mathcal R\mid\mathcal G_m)$ and $g:=I_\nu(\mathcal R\mid\mathcal G_\infty)$, then
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}\sum_{k=0}^{n-1}(g_k-g)(S^k y)=0
\end{align*}
for $\nu$-almost every $y\in Y$. This is the standard Breiman-Maker step specialized to conditional information; its proof supplies the required truncation and maximal-tail estimates for finite partitions, so no separate domination hypothesis such as $\sup_m|g_m-g|\in L^1$ is being assumed here.
The hypotheses are satisfied with $Y=X$, $\mathcal A=\mathcal B$, $\nu=\mu$, $S=T$, $\mathcal R=\mathcal P$, and $\mathcal G_m=\mathcal F_m$: the transformation $T$ is measure-preserving, the partition $\mathcal P$ is finite and hence has finite entropy, the sequence $(\mathcal F_m)_{m\geq0}$ is increasing, and $\mathcal F_\infty=\sigma(\bigcup_{m\geq0}\mathcal F_m)$ by definition. Hence
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}\sum_{k=0}^{n-1}(f_k-f)(T^k x)=0
\end{align*}
for $\mu$-almost every $x\in X$. Therefore
\begin{align*}
\lim_{n\to\infty}\left(\frac{1}{n}I_n^{\mathcal P}(x)-\frac{1}{n}\sum_{k=0}^{n-1}f(T^k x)\right)=0
\end{align*}
for $\mu$-almost every $x\in X$.
Since $f\in L^1(X,\mathcal B,\mu)$, $T$ is measure-preserving, and $T$ is ergodic, Birkhoff's ergodic theorem for integrable observables applies and gives
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}\sum_{k=0}^{n-1}f(T^k x)=\int_X f\,d\mu
\end{align*}
for $\mu$-almost every $x\in X$. Finally, the fixed-partition entropy formula for invertible systems says that, for a finite partition $\mathcal P$ of an invertible measure-preserving system, the fixed-partition entropy equals the conditional entropy of the present atom given the full past. Applying that formula identifies
\begin{align*}
\int_X f\,d\mu=H_\mu(\mathcal P\mid\mathcal F_\infty)=h_\mu(T,\mathcal P).
\end{align*}
Combining the preceding limits proves
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}I_n^{\mathcal P}(x)=h_\mu(T,\mathcal P)
\end{align*}
for $\mu$-almost every $x\in X$ in the invertible case.
[guided]
Assume first that the transformation is invertible. The reason for making this assumption is that it lets us speak about the past of a point. The forward name of $x$ records the atoms visited by $x,T x,\dots,T^{n-1}x$; after shifting by $T^k$, the previous symbols become symbols in the past of $T^k x$.
For each integer $j\in\mathbb Z$, define
\begin{align*}
\mathcal P_j:=T^{-j}\mathcal P.
\end{align*}
Thus $\mathcal P_j(x)$ records the $\mathcal P$-symbol of $T^j x$. For $m\in\mathbb N$, define the finite-past $\sigma$-algebra
\begin{align*}
\mathcal F_m:=\sigma\left(\bigvee_{j=1}^{m}\mathcal P_{-j}\right).
\end{align*}
This is the information contained in the symbols at times $-m,\dots,-1$. Equivalently, since $\mathcal P_{-j}=T^j\mathcal P$ for $j\geq1$, we may write $\mathcal F_m=\sigma(\bigvee_{j=1}^{m}T^j\mathcal P)$. Define also
\begin{align*}
\mathcal F_\infty:=\sigma\left(\bigcup_{m=1}^{\infty}\mathcal F_m\right),
\end{align*}
the $\sigma$-algebra generated by the entire past.
For $m\geq0$, with $\mathcal F_0=\{\varnothing,X\}$, define $f_m:X\to[0,\infty]$ by
\begin{align*}
f_m(x):=I_\mu(\mathcal P\mid\mathcal F_m)(x).
\end{align*}
Thus $f_m(x)$ is the information in the present $\mathcal P$-symbol of $x$ once the previous $m$ symbols are known. Define the limiting conditional information $f:X\to[0,\infty]$ by
\begin{align*}
f(x):=I_\mu(\mathcal P\mid\mathcal F_\infty)(x).
\end{align*}
Now fix $n\in\mathbb N$. The ordinary chain rule for probabilities factors the measure of an $n$-block atom into successive conditional probabilities. For a point $x$ outside the null union of atoms of $\mathcal P_0^{n-1}$ with measure $0$, this gives
\begin{align*}
\mu(\mathcal P_0^{n-1}(x))=\prod_{k=0}^{n-1}\mu(\mathcal P_k(x)\mid\mathcal P_0^{k-1})(x),
\end{align*}
where the factor for $k=0$ is interpreted as the unconditional probability of the atom $\mathcal P(x)$. Taking $-\log$ turns this product into a sum:
\begin{align*}
I_n^{\mathcal P}(x)=\sum_{k=0}^{n-1} f_k(T^k x).
\end{align*}
The equality holds because conditioning the $k$-th symbol of $x$ on the symbols at times $0,\dots,k-1$ is the same, after applying $T^k$, as conditioning the present symbol of $T^k x$ on its symbols at times $-k,\dots,-1$.
The next question is what happens to $f_k$ as $k$ grows. The conditional information martingale convergence theorem for finite partitions applies to a finite partition with finite entropy and to an increasing family of conditioning $\sigma$-algebras. Those hypotheses are satisfied here: $\mathcal P$ has finitely many atoms, hence $H_\mu(\mathcal P)<\infty$, and the $\sigma$-algebras $\mathcal F_k$ increase to $\mathcal F_\infty=\sigma(\bigcup_{k\geq0}\mathcal F_k)$. Therefore
\begin{align*}
f_k\to f
\end{align*}
in $L^1(X,\mathcal B,\mu)$ and $\mu$-almost everywhere.
This convergence alone is not quite enough, because the summand in the block decomposition is $f_k(T^k x)$ rather than $f_k(x)$. We therefore invoke Breiman's conditional-information averaging lemma in the finite-partition form. The statement needed here is: if $(Y,\mathcal A,\nu,S)$ is measure-preserving, $\mathcal R$ is a finite measurable partition with $H_\nu(\mathcal R)<\infty$, $(\mathcal G_m)_{m\geq0}$ is an increasing sequence of sub-$\sigma$-algebras, $\mathcal G_\infty=\sigma(\bigcup_{m\geq0}\mathcal G_m)$, and $g_m=I_\nu(\mathcal R\mid\mathcal G_m)$ and $g=I_\nu(\mathcal R\mid\mathcal G_\infty)$, then
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}\sum_{k=0}^{n-1}(g_k-g)(S^k y)=0
\end{align*}
for $\nu$-almost every $y\in Y$. The important point is that this is not the bare assertion that arbitrary $L^1$ convergence may be sampled along the orbit with a moving index. The lemma is specialized to conditional information of a finite partition. Its proof truncates the functions $g_m-g$ at a height $M$, applies Maker's averaging argument to the bounded truncated part, and then uses the finite entropy of $\mathcal R$ together with conditional-information tail estimates to make the contribution of the set where conditional probabilities are smaller than $e^{-M}$ uniformly small after averaging. Thus no separate domination hypothesis such as $\sup_m|g_m-g|\in L^1$ is being assumed.
We check the hypotheses in our setting. The system $(X,\mathcal B,\mu,T)$ is measure-preserving by assumption. The partition $\mathcal P$ is finite. The $\sigma$-algebras $(\mathcal F_m)_{m\geq0}$ are increasing because adding one more past symbol refines the generated information, and $\mathcal F_\infty=\sigma(\bigcup_{m\geq0}\mathcal F_m)$ by definition. With $Y=X$, $\mathcal A=\mathcal B$, $\nu=\mu$, $S=T$, $\mathcal R=\mathcal P$, and $\mathcal G_m=\mathcal F_m$, the functions in the lemma are exactly $g_m=f_m$ and $g=f$. Applying the lemma gives
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}\sum_{k=0}^{n-1}(f_k-f)(T^k x)=0
\end{align*}
for $\mu$-almost every $x\in X$. Hence
\begin{align*}
\lim_{n\to\infty}\left(\frac{1}{n}I_n^{\mathcal P}(x)-\frac{1}{n}\sum_{k=0}^{n-1}f(T^k x)\right)=0.
\end{align*}
It remains to identify the limit of the ordinary time average of $f$. Since $f\in L^1(X,\mathcal B,\mu)$ and $T$ is measure-preserving and ergodic, Birkhoff's ergodic theorem gives
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}\sum_{k=0}^{n-1}f(T^k x)=\int_X f\,d\mu
\end{align*}
for $\mu$-almost every $x\in X$. The fixed-partition entropy formula for invertible systems identifies this integral with the entropy rate of the partition:
\begin{align*}
\int_X f\,d\mu=H_\mu(\mathcal P\mid\mathcal F_\infty)=h_\mu(T,\mathcal P).
\end{align*}
Combining the chain-rule decomposition, Maker's lemma, Birkhoff's theorem, and the entropy formula proves
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}I_n^{\mathcal P}(x)=h_\mu(T,\mathcal P)
\end{align*}
for $\mu$-almost every $x\in X$ when $T$ is invertible.
[/guided]
[/step]
[step:Pass from a non-invertible system to its natural extension]
Now let $T$ be an arbitrary measure-preserving transformation. Let
\begin{align*}
\widehat X:=\{(x_j)_{j\in\mathbb Z}\in X^{\mathbb Z}:T x_j=x_{j+1}\text{ for every }j\in\mathbb Z\}
\end{align*}
be the natural extension space. Let $\widehat{\mathcal B}$ be the trace of the product $\sigma$-algebra on $\widehat X$, let $\widehat T:\widehat X\to\widehat X$ be the shift map given by
\begin{align*}
\widehat T((x_j)_{j\in\mathbb Z}):=(x_{j+1})_{j\in\mathbb Z},
\end{align*}
and let $\pi_0:\widehat X\to X$ be the coordinate projection given by
\begin{align*}
\pi_0((x_j)_{j\in\mathbb Z}):=x_0.
\end{align*}
By the natural [extension theorem](/theorems/59) for standard probability measure-preserving systems, applied to the probability-preserving endomorphism $T:(X,\mathcal B,\mu)\to(X,\mathcal B,\mu)$, there is a probability measure $\widehat\mu$ on $(\widehat X,\widehat{\mathcal B})$ such that $(\widehat X,\widehat{\mathcal B},\widehat\mu,\widehat T)$ is an invertible measure-preserving system, $\pi_0\circ\widehat T=T\circ\pi_0$, and $\widehat\mu\circ\pi_0^{-1}=\mu$. In particular, the inverse is the measurable shift $\widehat T^{-1}((x_j)_{j\in\mathbb Z})=(x_{j-1})_{j\in\mathbb Z}$, which is well-defined on $\widehat X$ because $T x_{j-1}=x_j$ for every $j\in\mathbb Z$, and the theorem includes preservation of $\widehat\mu$ by both $\widehat T$ and $\widehat T^{-1}$. The theorem also states that ergodicity is preserved by the natural extension: since $(X,\mathcal B,\mu,T)$ is ergodic, $(\widehat X,\widehat{\mathcal B},\widehat\mu,\widehat T)$ is ergodic.
Define the lifted partition
\begin{align*}
\widehat{\mathcal P}:=\pi_0^{-1}\mathcal P=\{\pi_0^{-1}(P):P\in\mathcal P\}.
\end{align*}
For every $n\in\mathbb N$ and every $\widehat x\in\widehat X$ with $x:=\pi_0(\widehat x)$, the factor relation $\pi_0\circ\widehat T^k=T^k\circ\pi_0$ gives
\begin{align*}
\widehat{\mathcal P}_0^{n-1}(\widehat x)=\pi_0^{-1}(\mathcal P_0^{n-1}(x))\cap\widehat X.
\end{align*}
Since $\widehat\mu\circ\pi_0^{-1}=\mu$, it follows that
\begin{align*}
\widehat\mu(\widehat{\mathcal P}_0^{n-1}(\widehat x))=\mu(\mathcal P_0^{n-1}(x)).
\end{align*}
Therefore
\begin{align*}
I_n^{\widehat{\mathcal P}}(\widehat x)=I_n^{\mathcal P}(\pi_0(\widehat x)).
\end{align*}
Applying the invertible case to $(\widehat X,\widehat{\mathcal B},\widehat\mu,\widehat T)$ and $\widehat{\mathcal P}$ gives
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}I_n^{\widehat{\mathcal P}}(\widehat x)=h_{\widehat\mu}(\widehat T,\widehat{\mathcal P})
\end{align*}
for $\widehat\mu$-almost every $\widehat x\in\widehat X$. We verify the fixed-partition entropy equality directly from the lifted block measures. For each $n\in\mathbb N$, the atoms of $\bigvee_{k=0}^{n-1}\widehat T^{-k}\widehat{\mathcal P}$ are precisely the nonempty sets of the form $\pi_0^{-1}(A)\cap\widehat X$, where $A$ is an atom of $\bigvee_{k=0}^{n-1}T^{-k}\mathcal P$. Since $\widehat\mu\circ\pi_0^{-1}=\mu$, corresponding atoms have the same measure. Therefore
\begin{align*}
H_{\widehat\mu}\left(\bigvee_{k=0}^{n-1}\widehat T^{-k}\widehat{\mathcal P}\right)=H_\mu\left(\bigvee_{k=0}^{n-1}T^{-k}\mathcal P\right).
\end{align*}
Dividing by $n$ and taking the defining limit of fixed-partition entropy gives
\begin{align*}
h_{\widehat\mu}(\widehat T,\widehat{\mathcal P})=h_\mu(T,\mathcal P).
\end{align*}
Thus
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}I_n^{\mathcal P}(\pi_0(\widehat x))=h_\mu(T,\mathcal P)
\end{align*}
for $\widehat\mu$-almost every $\widehat x\in\widehat X$.
[guided]
When $T$ is not invertible, a point $x\in X$ may have many possible histories, so the previous step cannot be applied directly. The natural extension replaces $x$ by a full two-sided orbit. Define
\begin{align*}
\widehat X:=\{(x_j)_{j\in\mathbb Z}\in X^{\mathbb Z}:T x_j=x_{j+1}\text{ for every }j\in\mathbb Z\}.
\end{align*}
Let $\widehat{\mathcal B}$ be the trace of the product $\sigma$-algebra on $\widehat X$, define the shift map $\widehat T:\widehat X\to\widehat X$ by
\begin{align*}
\widehat T((x_j)_{j\in\mathbb Z}):=(x_{j+1})_{j\in\mathbb Z},
\end{align*}
and define the coordinate projection $\pi_0:\widehat X\to X$ by
\begin{align*}
\pi_0((x_j)_{j\in\mathbb Z}):=x_0.
\end{align*}
The natural extension theorem applies because $T$ is a measure-preserving map on the standard probability space $(X,\mathcal B,\mu)$. It gives a probability measure $\widehat\mu$ on $(\widehat X,\widehat{\mathcal B})$ such that $\widehat T$ is invertible and measure-preserving. With our convention $\widehat T((x_j)_{j\in\mathbb Z})=(x_{j+1})_{j\in\mathbb Z}$, the inverse is the measurable shift $\widehat T^{-1}((x_j)_{j\in\mathbb Z})=(x_{j-1})_{j\in\mathbb Z}$; this is well-defined on $\widehat X$ because each sequence in $\widehat X$ satisfies $T x_{j-1}=x_j$. The same theorem gives preservation of $\widehat\mu$ by this invertible shift, the factor identity
\begin{align*}
\pi_0\circ\widehat T=T\circ\pi_0
\end{align*}
holds, and the projection has the original law:
\begin{align*}
\widehat\mu\circ\pi_0^{-1}=\mu.
\end{align*}
The same natural-extension theorem preserves ergodicity, so the lifted system is ergodic because the original system is ergodic.
Define the lifted partition
\begin{align*}
\widehat{\mathcal P}:=\pi_0^{-1}\mathcal P=\{\pi_0^{-1}(P):P\in\mathcal P\}.
\end{align*}
Why does this partition preserve the information blocks? Fix $n\in\mathbb N$ and $\widehat x\in\widehat X$, and write $x:=\pi_0(\widehat x)$. For every $0\leq k\leq n-1$, the factor identity gives
\begin{align*}
\pi_0(\widehat T^k\widehat x)=T^k(\pi_0(\widehat x))=T^k x.
\end{align*}
Thus the length-$n$ lifted name of $\widehat x$ with respect to $\widehat{\mathcal P}$ is exactly the pullback of the length-$n$ name of $x$ with respect to $\mathcal P$:
\begin{align*}
\widehat{\mathcal P}_0^{n-1}(\widehat x)=\pi_0^{-1}(\mathcal P_0^{n-1}(x))\cap\widehat X.
\end{align*}
Since $\widehat\mu\circ\pi_0^{-1}=\mu$, the two atoms have the same measure:
\begin{align*}
\widehat\mu(\widehat{\mathcal P}_0^{n-1}(\widehat x))=\mu(\mathcal P_0^{n-1}(x)).
\end{align*}
Therefore their information functions agree pointwise:
\begin{align*}
I_n^{\widehat{\mathcal P}}(\widehat x)=I_n^{\mathcal P}(\pi_0(\widehat x)).
\end{align*}
The lifted system is now invertible and ergodic, so the invertible case applies to $(\widehat X,\widehat{\mathcal B},\widehat\mu,\widehat T)$ and $\widehat{\mathcal P}$. Hence
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}I_n^{\widehat{\mathcal P}}(\widehat x)=h_{\widehat\mu}(\widehat T,\widehat{\mathcal P})
\end{align*}
for $\widehat\mu$-almost every $\widehat x\in\widehat X$.
It remains to check that this entropy is the original entropy. For each $n\in\mathbb N$, the atoms of $\bigvee_{k=0}^{n-1}\widehat T^{-k}\widehat{\mathcal P}$ are exactly pullbacks under $\pi_0$ of atoms of $\bigvee_{k=0}^{n-1}T^{-k}\mathcal P$, up to empty sets. Corresponding atoms have equal measure because $\widehat\mu\circ\pi_0^{-1}=\mu$. Therefore the finite-block entropies agree:
\begin{align*}
H_{\widehat\mu}\left(\bigvee_{k=0}^{n-1}\widehat T^{-k}\widehat{\mathcal P}\right)=H_\mu\left(\bigvee_{k=0}^{n-1}T^{-k}\mathcal P\right).
\end{align*}
Dividing by $n$ and passing to the defining entropy limits gives
\begin{align*}
h_{\widehat\mu}(\widehat T,\widehat{\mathcal P})=h_\mu(T,\mathcal P).
\end{align*}
Combining the information equality with the lifted almost sure convergence yields
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}I_n^{\mathcal P}(\pi_0(\widehat x))=h_\mu(T,\mathcal P)
\end{align*}
for $\widehat\mu$-almost every $\widehat x\in\widehat X$.
[/guided]
[/step]
[step:Descend the almost sure conclusion to the original system]
Define the exceptional set
\begin{align*}
E:=\left\{x\in X:\lim_{n\to\infty}\frac{1}{n}I_n^{\mathcal P}(x)\neq h_\mu(T,\mathcal P)\right\}.
\end{align*}
The preceding step proves that
\begin{align*}
\widehat\mu(\pi_0^{-1}(E))=0.
\end{align*}
Because $\widehat\mu\circ\pi_0^{-1}=\mu$, we have
\begin{align*}
\mu(E)=\widehat\mu(\pi_0^{-1}(E))=0.
\end{align*}
Therefore
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}I_n^{\mathcal P}(x)=h_\mu(T,\mathcal P)
\end{align*}
for $\mu$-almost every $x\in X$. Since $I_n^{\mathcal P}(x)=-\log\mu(\mathcal P_0^{n-1}(x))$ by definition, this is exactly the equivalent logarithmic formulation. This completes the proof.
[/step]