[proofplan]
We first prove the assertion for invertible probability-preserving systems. In that case the information of an $n$-name factors by the chain rule into a sum of present-symbol information conditioned on longer and longer finite pasts. Martingale convergence identifies the limiting present information conditioned on the full past, and the conditional ergodic theorem, together with the standard Maker averaging lemma, gives the almost-sure limit of the normalized information. For a non-invertible system we pass to the natural extension, where forward-name atoms and their measures are preserved by the factor map, and then descend the limiting function to the original system.
[/proofplan]
custom_env
admin
[step:Define conditional information for finite partitions]
Let $(Y,\mathcal C,\nu,S)$ be an invertible probability-preserving system, and let $\mathcal Q$ be a finite measurable partition of $Y$. Define the invariant $\sigma$-algebra
\begin{align*}
\mathcal I_S:=\{A\in\mathcal C:S^{-1}A=A\text{ modulo }\nu\}.
\end{align*}
For integers $a\leq b$, define the finite name partition
\begin{align*}
\mathcal Q_{[a,b]}:=\bigvee_{j=a}^{b}S^{-j}\mathcal Q.
\end{align*}
For $y\in Y$, let $\mathcal Q_{[a,b]}(y)$ denote the atom of $\mathcal Q_{[a,b]}$ containing $y$.
If $\mathcal G\subseteq\mathcal C$ is a sub-$\sigma$-algebra, define the conditional information function
\begin{align*}
I_\nu(\mathcal Q\mid\mathcal G):Y\to[0,\infty]
\end{align*}
by
\begin{align*}
I_\nu(\mathcal Q\mid\mathcal G)(y):=-\log \nu(\mathcal Q(y)\mid\mathcal G)(y),
\end{align*}
where $\nu(\mathcal Q(y)\mid\mathcal G)$ means the [conditional probability](/page/Conditional%20Probability) of the atom of $\mathcal Q$ containing $y$. More explicitly, if $\mathcal Q=\{Q_1,\dots,Q_r\}$, then
\begin{align*}
I_\nu(\mathcal Q\mid\mathcal G)(y):=-\sum_{i=1}^r \mathbb 1_{Q_i}(y)\log \mathbb E_\nu[\mathbb 1_{Q_i}\mid\mathcal G](y),
\end{align*}
with the convention that the value on the null set where $\mathbb E_\nu[\mathbb 1_{Q_i}\mid\mathcal G]=0$ and $y\in Q_i$ is irrelevant.
For each $m\in\mathbb N$, define the finite-past $\sigma$-algebra
\begin{align*}
\mathcal G_m:=\sigma(\mathcal Q_{[-m,-1]})
\end{align*}
and define the full-past $\sigma$-algebra
\begin{align*}
\mathcal G_\infty:=\bigvee_{m=1}^{\infty}\mathcal G_m.
\end{align*}
Finally define functions $J_m:Y\to[0,\infty]$ and $J_\infty:Y\to[0,\infty]$ by
\begin{align*}
J_m:=I_\nu(\mathcal Q\mid\mathcal G_m)
\end{align*}
and
\begin{align*}
J_\infty:=I_\nu(\mathcal Q\mid\mathcal G_\infty).
\end{align*}
Since $\mathcal Q$ is finite, each $J_m$ is integrable and $J_\infty$ is integrable.
[/step]
custom_env
admin
[step:Factor the information of an $n$-name into finite-past increments]For $n\in\mathbb N$, define the $n$-name information function
\begin{align*}
I_{n,\mathcal Q}:Y\to[0,\infty]
\end{align*}
by
\begin{align*}
I_{n,\mathcal Q}(y):=-\log \nu(\mathcal Q_{[0,n-1]}(y)).
\end{align*}
We claim that, for $\nu$-almost every $y\in Y$,
\begin{align*}
I_{n,\mathcal Q}(y)=\sum_{k=0}^{n-1}J_k(S^k y),
\end{align*}
where $J_0:=I_\nu(\mathcal Q\mid\{\varnothing,Y\})$.
Indeed, fix atoms $Q_{i_0},\dots,Q_{i_{n-1}}\in\mathcal Q$ and set
\begin{align*}
A_k:=\bigcap_{\ell=0}^{k}S^{-\ell}Q_{i_\ell}
\end{align*}
for $0\leq k\leq n-1$. On the atom $A_{n-1}$, the chain rule for conditional probabilities gives
\begin{align*}
\nu(A_{n-1})=\prod_{k=0}^{n-1}\nu(S^{-k}Q_{i_k}\mid A_{k-1}),
\end{align*}
where $A_{-1}:=Y$. Since $S$ is invertible and $\nu$ is $S$-invariant, the conditional probability of $S^{-k}Q_{i_k}$ given the previous symbols $A_{k-1}$ equals the conditional probability of the present atom $Q_{i_k}$ at $S^k y$ given the $k$-symbol past $\mathcal Q_{[-k,-1]}$. Taking $-\log$ of the product gives the displayed identity.[/step]
custom_env
admin
[guided]The purpose of this step is to convert one large probability, namely the probability of an entire length-$n$ name, into a sum of one-step conditional probabilities. This is the entropy analogue of writing the probability of a word as the probability of its first symbol times the probability of the second symbol given the first, and so on.
For $n\in\mathbb N$, the partition
\begin{align*}
\mathcal Q_{[0,n-1]}:=\bigvee_{j=0}^{n-1}S^{-j}\mathcal Q
\end{align*}
records the symbols seen at times $0,1,\dots,n-1$. Its atom at $y$ is the set of all points whose first $n$ symbols agree with those of $y$. The information of this atom is the function
\begin{align*}
I_{n,\mathcal Q}:Y\to[0,\infty]
\end{align*}
defined by
\begin{align*}
I_{n,\mathcal Q}(y):=-\log \nu(\mathcal Q_{[0,n-1]}(y)).
\end{align*}
Fix an atom of $\mathcal Q_{[0,n-1]}$. Thus choose atoms $Q_{i_0},\dots,Q_{i_{n-1}}\in\mathcal Q$ and define
\begin{align*}
A_k:=\bigcap_{\ell=0}^{k}S^{-\ell}Q_{i_\ell}
\end{align*}
for $0\leq k\leq n-1$, with $A_{-1}:=Y$. The set $A_k$ is the cylinder determined by the first $k+1$ symbols. The elementary chain rule for conditional probabilities gives
\begin{align*}
\nu(A_{n-1})=\prod_{k=0}^{n-1}\nu(S^{-k}Q_{i_k}\mid A_{k-1}).
\end{align*}
Now we rewrite the $k$th factor from the viewpoint of the point $S^k y$. Because $S$ is invertible and $\nu$ is $S$-invariant, applying $S^k$ transports the previous-symbol cylinder $A_{k-1}$ to the atom of the past partition $\mathcal Q_{[-k,-1]}$ containing $S^k y$, while $S^{-k}Q_{i_k}$ is transported to the present atom $Q_{i_k}$. Therefore
\begin{align*}
\nu(S^{-k}Q_{i_k}\mid A_{k-1})=\nu(\mathcal Q(S^k y)\mid\mathcal Q_{[-k,-1]})(S^k y).
\end{align*}
By the definition of $J_k$,
\begin{align*}
J_k(S^k y)=-\log \nu(\mathcal Q(S^k y)\mid\mathcal Q_{[-k,-1]})(S^k y).
\end{align*}
Taking $-\log$ of the chain-rule product converts the product into a sum, so for $\nu$-almost every $y\in Y$,
\begin{align*}
I_{n,\mathcal Q}(y)=\sum_{k=0}^{n-1}J_k(S^k y).
\end{align*}
To make the null-set issue precise, discard every atom $A_{n-1}$ with $\nu(A_{n-1})=0$; the union of these atoms has $\nu$-measure $0$ because $\mathcal Q_{[0,n-1]}$ is finite. On every remaining atom $A_{n-1}$, each preceding cylinder $A_{k-1}$ has positive measure and the elementary conditional probabilities $\nu(S^{-k}Q_{i_k}\mid A_{k-1})$ are well-defined. The conditional-expectation representatives defining $J_k$ may disagree with these finite-atom conditional probabilities only on a $\nu$-null subset of each finite-past atom. Since there are only finitely many such atoms for the fixed $n$, their union is null. Therefore the displayed identity holds outside a $\nu$-null set depending on $n$, and intersecting over $n\in\mathbb N$ gives a single full-measure set on which the identity holds for every $n$.[/guided]
custom_env
admin
[step:Pass from finite pasts to the full past]The $\sigma$-algebras $(\mathcal G_m)_{m\geq1}$ increase to $\mathcal G_\infty$. We use the finite-partition information convergence theorem: for a finite partition $\mathcal Q$ and an increasing sequence of sub-$\sigma$-algebras $\mathcal G_m$ with join $\mathcal G_\infty$, the conditional information functions satisfy
\begin{align*}
I_\nu(\mathcal Q\mid\mathcal G_m)\to I_\nu(\mathcal Q\mid\mathcal G_\infty)
\end{align*}
$\nu$-almost everywhere and in $L^1(Y,\mathcal C,\nu)$. Its hypotheses hold here because $\mathcal Q$ is finite, each $\mathcal G_m$ is a sub-$\sigma$-algebra of $\mathcal C$, and $\mathcal G_\infty=\bigvee_{m=1}^{\infty}\mathcal G_m$ by definition. Hence
\begin{align*}
J_m\to J_\infty
\end{align*}
$\nu$-almost everywhere and in $L^1(Y,\mathcal C,\nu)$.
For completeness, this finite-partition convergence is the direct finite-alphabet consequence of martingale convergence. For each atom $Q_i$ of $\mathcal Q$, the conditional probabilities $\mathbb E_\nu[\mathbb 1_{Q_i}\mid\mathcal G_m]$ converge almost everywhere and in $L^1$ to $\mathbb E_\nu[\mathbb 1_{Q_i}\mid\mathcal G_\infty]$. Since there are finitely many atoms, the corresponding conditional information sums converge almost everywhere; the standard truncation of $-\log t$ on $[\varepsilon,1]$ and the finite entropy bound for $\mathcal Q$ give the $L^1$ convergence.
We also invoke Breiman's maximal lemma for finite-alphabet conditional information in the exact form needed here. If $\mathcal Q$ is finite, $(\mathcal G_m)_{m\geq1}$ is increasing, $\mathcal G_\infty=\bigvee_{m=1}^{\infty}\mathcal G_m$, and
\begin{align*}
F_m:=I_\nu(\mathcal Q\mid\mathcal G_m),\qquad F_\infty:=I_\nu(\mathcal Q\mid\mathcal G_\infty),
\end{align*}
then the maximal tails
\begin{align*}
R_{r,\mathcal Q,\mathcal G}(y):=\sup_{m\geq r}|F_m(y)-F_\infty(y)|
\end{align*}
belong to $L^1(Y,\mathcal C,\nu)$ and satisfy
\begin{align*}
\int_Y R_{r,\mathcal Q,\mathcal G}(y)\,d\nu(y)\to0
\end{align*}
as $r\to\infty$. This maximal lemma is the standard strengthening of martingale information convergence used in proofs of the [Shannon-McMillan-Breiman theorem](/theorems/6766); it is stronger than mere $L^1$ convergence and is the input that makes the non-stationary information average legitimate. Applying it to the present finite partition $\mathcal Q$ and increasing filtration $(\mathcal G_m)_{m\geq1}$ gives
\begin{align*}
R_r(y):=\sup_{m\geq r}|J_m(y)-J_\infty(y)|
\end{align*}
with $R_r\in L^1(Y,\mathcal C,\nu)$ and
\begin{align*}
\int_Y R_r(y)\,d\nu(y)\to0.
\end{align*}
The [conditional Birkhoff ergodic theorem](/theorems/518) applied to the integrable function $J_\infty:Y\to[0,\infty]$ gives
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1}J_\infty(S^k y)\to \mathbb E_\nu[J_\infty\mid\mathcal I_S](y)
\end{align*}
for $\nu$-almost every $y\in Y$, where $\mathcal I_S$ is the $S$-invariant $\sigma$-algebra.
It remains to replace $J_\infty(S^k y)$ by $J_k(S^k y)$. We record the exact diagonal averaging input needed here.
[claim:Breiman-Maker information averaging]
Let $(Y,\mathcal C,\nu,S)$ be a probability-preserving system, let $\mathcal Q$ be a finite measurable partition, and let $(\mathcal G_m)_{m\geq0}$ be an increasing sequence of sub-$\sigma$-algebras of $\mathcal C$. Define the maps
\begin{align*}
f_m:Y&\to[0,\infty]
\end{align*}
by
\begin{align*}
f_m:=I_\nu(\mathcal Q\mid\mathcal G_m)
\end{align*}
for $m\in\mathbb N\cup\{0\}$, and let $f_\infty:Y\to[0,\infty]$ be an integrable function such that $f_m\to f_\infty$ $\nu$-almost everywhere and in $L^1(Y,\mathcal C,\nu)$. Suppose moreover that the Breiman tail condition holds:
\begin{align*}
R_r:Y&\to[0,\infty]
\end{align*}
where
\begin{align*}
R_r(y):=\sup_{m\geq r}|f_m(y)-f_\infty(y)|
\end{align*}
satisfies $R_r\in L^1(Y,\mathcal C,\nu)$ and $\int_Y R_r(y)\,d\nu(y)\to0$ as $r\to\infty$. Then
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1}\bigl(f_k-f_\infty\bigr)(S^k y)\to0
\end{align*}
for $\nu$-almost every $y\in Y$.
[/claim]
[proof]
Define $g_m:Y\to\mathbb R$ by $g_m:=f_m-f_\infty$. Fix $r\in\mathbb N$. For every $n>r$ and every $y\in Y$,
\begin{align*}
\left|\frac{1}{n}\sum_{k=0}^{n-1}g_k(S^k y)\right|\leq \frac{1}{n}\sum_{k=0}^{r-1}|g_k(S^k y)|+\frac{1}{n}\sum_{k=r}^{n-1}R_r(S^k y).
\end{align*}
The first term tends to $0$ for every $y$ for which the finite numbers $|g_0(y)|,\dots,|g_{r-1}(S^{r-1}y)|$ are defined, because the numerator is fixed while $n\to\infty$. The function $R_r$ is integrable by the Breiman tail condition, so the [conditional Birkhoff ergodic theorem](/theorems/518) applied to $R_r$ gives
\begin{align*}
\limsup_{n\to\infty}\frac{1}{n}\sum_{k=r}^{n-1}R_r(S^k y)\leq \mathbb E_\nu[R_r\mid\mathcal I_S](y)
\end{align*}
for $\nu$-almost every $y$. Hence
\begin{align*}
\limsup_{n\to\infty}\left|\frac{1}{n}\sum_{k=0}^{n-1}g_k(S^k y)\right|\leq \mathbb E_\nu[R_r\mid\mathcal I_S](y)
\end{align*}
for every fixed $r$ and for $\nu$-almost every $y$.
It remains to let $r\to\infty$. Since $R_r\downarrow0$ $\nu$-almost everywhere and $\int_Y R_r(y)\,d\nu(y)\to0$, preservation of integrals under [conditional expectation](/page/Conditional%20Expectation) gives
\begin{align*}
\int_Y \mathbb E_\nu[R_r\mid\mathcal I_S](y)\,d\nu(y)=\int_Y R_r(y)\,d\nu(y)\to0.
\end{align*}
The functions $\mathbb E_\nu[R_r\mid\mathcal I_S]$ decrease to $0$ in $L^1(Y,\mathcal C,\nu)$ and hence, after passing through the monotone limit, converge to $0$ $\nu$-almost everywhere. Therefore the limsup above is $0$ for $\nu$-almost every $y$, which proves the claimed diagonal convergence.
[/proof]
Its hypotheses apply after adjoining $\mathcal G_0:=\{\varnothing,Y\}$ to the increasing filtration. The partition $\mathcal Q$ is finite, $J_m=I_\nu(\mathcal Q\mid\mathcal G_m)$ for $m\geq0$, $J_m\to J_\infty$ in $L^1(Y,\mathcal C,\nu)$ and almost everywhere by the finite-partition information convergence theorem, and the preceding Breiman tail estimate gives $R_r\in L^1(Y,\mathcal C,\nu)$ with $\int_Y R_r(y)\,d\nu(y)\to0$ for $R_r(y):=\sup_{m\geq r}|J_m(y)-J_\infty(y)|$. Therefore
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1}\bigl(J_k-J_\infty\bigr)(S^k y)\to0
\end{align*}
for $\nu$-almost every $y\in Y$.
Combining this proved approximation with the chain-rule identity and the conditional Birkhoff limit yields
\begin{align*}
\frac{1}{n}I_{n,\mathcal Q}(y)\to \mathbb E_\nu[J_\infty\mid\mathcal I_S](y)
\end{align*}
for $\nu$-almost every $y\in Y$.[/step]
custom_env
admin
[guided]We first pass from finite pasts to the full past. The $\sigma$-algebras $(\mathcal G_m)_{m\geq1}$ increase to
\begin{align*}
\mathcal G_\infty=\bigvee_{m=1}^{\infty}\mathcal G_m.
\end{align*}
The finite-partition information convergence theorem applies because $\mathcal Q$ is finite and each $\mathcal G_m$ is a sub-$\sigma$-algebra of $\mathcal C$. It gives
\begin{align*}
I_\nu(\mathcal Q\mid\mathcal G_m)\to I_\nu(\mathcal Q\mid\mathcal G_\infty)
\end{align*}
$\nu$-almost everywhere and in $L^1(Y,\mathcal C,\nu)$. Since
\begin{align*}
J_m=I_\nu(\mathcal Q\mid\mathcal G_m)
\end{align*}
and
\begin{align*}
J_\infty=I_\nu(\mathcal Q\mid\mathcal G_\infty),
\end{align*}
we have
\begin{align*}
J_m\to J_\infty
\end{align*}
$\nu$-almost everywhere and in $L^1(Y,\mathcal C,\nu)$.
The missing hypothesis for the diagonal average is supplied by Breiman's maximal convergence theorem for conditional information of a finite partition. In the exact form used here, if $\mathcal Q$ is finite, if $(\mathcal G_m)_{m\geq1}$ is increasing, if
\begin{align*}
\mathcal G_\infty=\bigvee_{m=1}^{\infty}\mathcal G_m,
\end{align*}
and if
\begin{align*}
F_m:=I_\nu(\mathcal Q\mid\mathcal G_m),\qquad F_\infty:=I_\nu(\mathcal Q\mid\mathcal G_\infty),
\end{align*}
then the maximal tail functions
\begin{align*}
R_{r,\mathcal Q,\mathcal G}(y):=\sup_{m\geq r}|F_m(y)-F_\infty(y)|
\end{align*}
are integrable and satisfy
\begin{align*}
\int_Y R_{r,\mathcal Q,\mathcal G}(y)\,d\nu(y)\to0.
\end{align*}
With $F_m=J_m$ and $F_\infty=J_\infty$, this gives
\begin{align*}
R_r(y):=\sup_{m\geq r}|J_m(y)-J_\infty(y)|
\end{align*}
with $R_r\in L^1(Y,\mathcal C,\nu)$ and
\begin{align*}
\int_Y R_r(y)\,d\nu(y)\to0.
\end{align*}
The [conditional Birkhoff ergodic theorem](/theorems/518) applies to the integrable function
\begin{align*}
J_\infty:Y\to[0,\infty].
\end{align*}
Therefore
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1}J_\infty(S^k y)\to \mathbb E_\nu[J_\infty\mid\mathcal I_S](y)
\end{align*}
for $\nu$-almost every $y\in Y$, where $\mathcal I_S$ is the $S$-invariant $\sigma$-algebra.
It remains to justify the diagonal replacement of $J_k(S^k y)$ by $J_\infty(S^k y)$. The exact input is the Breiman-Maker information averaging lemma. Let $(Y,\mathcal C,\nu,S)$ be a probability-preserving system, let $\mathcal Q$ be a finite measurable partition, and let $(\mathcal G_m)_{m\geq0}$ be an increasing sequence of sub-$\sigma$-algebras of $\mathcal C$. Define
\begin{align*}
f_m:Y\to[0,\infty]
\end{align*}
by
\begin{align*}
f_m:=I_\nu(\mathcal Q\mid\mathcal G_m),
\end{align*}
and let $f_\infty:Y\to[0,\infty]$ be an integrable function with $f_m\to f_\infty$ $\nu$-almost everywhere and in $L^1(Y,\mathcal C,\nu)$. Suppose the tail functions
\begin{align*}
R_r:Y\to[0,\infty]
\end{align*}
defined by
\begin{align*}
R_r(y):=\sup_{m\geq r}|f_m(y)-f_\infty(y)|
\end{align*}
satisfy $R_r\in L^1(Y,\mathcal C,\nu)$ and
\begin{align*}
\int_Y R_r(y)\,d\nu(y)\to0
\end{align*}
as $r\to\infty$. Then the lemma says
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1}\bigl(f_k-f_\infty\bigr)(S^k y)\to0
\end{align*}
for $\nu$-almost every $y\in Y$.
Here is the proof of that diagonal lemma in the present notation. Put
\begin{align*}
g_m:Y\to\mathbb R
\end{align*}
with
\begin{align*}
g_m:=f_m-f_\infty.
\end{align*}
Fix $r\in\mathbb N$. For every $n>r$ and every $y\in Y$,
\begin{align*}
\left|\frac{1}{n}\sum_{k=0}^{n-1}g_k(S^k y)\right|\leq \frac{1}{n}\sum_{k=0}^{r-1}|g_k(S^k y)|+\frac{1}{n}\sum_{k=r}^{n-1}R_r(S^k y).
\end{align*}
The first term tends to $0$ for every $y$ for which the finitely many values $|g_0(y)|,\dots,|g_{r-1}(S^{r-1}y)|$ are defined. Since $R_r$ is integrable, the [conditional Birkhoff ergodic theorem](/theorems/518) gives
\begin{align*}
\limsup_{n\to\infty}\frac{1}{n}\sum_{k=r}^{n-1}R_r(S^k y)\leq \mathbb E_\nu[R_r\mid\mathcal I_S](y)
\end{align*}
for $\nu$-almost every $y\in Y$.
Thus, for fixed $r$,
\begin{align*}
\limsup_{n\to\infty}\left|\frac{1}{n}\sum_{k=0}^{n-1}g_k(S^k y)\right|\leq \mathbb E_\nu[R_r\mid\mathcal I_S](y)
\end{align*}
for $\nu$-almost every $y$. Now let $r\to\infty$. Since $R_r\downarrow0$ $\nu$-almost everywhere and
\begin{align*}
\int_Y R_r(y)\,d\nu(y)\to0,
\end{align*}
conditional expectation preserves integrals:
\begin{align*}
\int_Y \mathbb E_\nu[R_r\mid\mathcal I_S](y)\,d\nu(y)=\int_Y R_r(y)\,d\nu(y)\to0.
\end{align*}
The functions $\mathbb E_\nu[R_r\mid\mathcal I_S]$ decrease to $0$ in $L^1(Y,\mathcal C,\nu)$ and hence converge to $0$ $\nu$-almost everywhere. Therefore
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1}\bigl(f_k-f_\infty\bigr)(S^k y)\to0
\end{align*}
for $\nu$-almost every $y\in Y$.
We now apply this with $f_m=J_m$ and $f_\infty=J_\infty$. The hypotheses verified above give
\begin{align*}
J_m=I_\nu(\mathcal Q\mid\mathcal G_m)\to J_\infty=I_\nu(\mathcal Q\mid\mathcal G_\infty)
\end{align*}
$\nu$-almost everywhere and in $L^1(Y,\mathcal C,\nu)$. The Breiman tail estimate for conditional information of a finite partition gives, for
\begin{align*}
R_r(y):=\sup_{m\geq r}|J_m(y)-J_\infty(y)|,
\end{align*}
that $R_r\in L^1(Y,\mathcal C,\nu)$ and
\begin{align*}
\int_Y R_r(y)\,d\nu(y)\to0.
\end{align*}
Hence
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1}\bigl(J_k-J_\infty\bigr)(S^k y)\to0
\end{align*}
for $\nu$-almost every $y\in Y$.
Finally, combine this diagonal convergence with the Birkhoff limit:
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1}J_\infty(S^k y)\to \mathbb E_\nu[J_\infty\mid\mathcal I_S](y).
\end{align*}
Adding the two limits gives
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1}J_k(S^k y)\to \mathbb E_\nu[J_\infty\mid\mathcal I_S](y).
\end{align*}
Using the chain-rule identity
\begin{align*}
I_{n,\mathcal Q}(y)=\sum_{k=0}^{n-1}J_k(S^k y)
\end{align*}
then yields
\begin{align*}
\frac{1}{n}I_{n,\mathcal Q}(y)\to \mathbb E_\nu[J_\infty\mid\mathcal I_S](y)
\end{align*}
for $\nu$-almost every $y\in Y$.[/guided]
custom_env
admin
[step:Identify the integral of the limiting function in the invertible case]
Define
\begin{align*}
\bar h_\nu(S,\mathcal Q\mid\mathcal I_S):Y\to[0,\infty]
\end{align*}
by
\begin{align*}
\bar h_\nu(S,\mathcal Q\mid\mathcal I_S)(y):=\mathbb E_\nu[J_\infty\mid\mathcal I_S](y).
\end{align*}
This function is $\mathcal I_S$-measurable by the defining property of conditional expectation. From the previous step,
\begin{align*}
\lim_{n\to\infty}-\frac{1}{n}\log \nu(\mathcal Q_{[0,n-1]}(y))=\bar h_\nu(S,\mathcal Q\mid\mathcal I_S)(y)
\end{align*}
for $\nu$-almost every $y\in Y$.
For a finite measurable partition $\mathcal R$, define
\begin{align*}
H_\nu(\mathcal R):=-\sum_{R\in\mathcal R}\nu(R)\log\nu(R),
\end{align*}
where terms with $\nu(R)=0$ are interpreted as $0$. Define the entropy rate of the finite partition $\mathcal Q$ for the system $(Y,\mathcal C,\nu,S)$ by
\begin{align*}
h_\nu(S,\mathcal Q):=\lim_{n\to\infty}\frac{1}{n}H_\nu(\mathcal Q_{[0,n-1]}),
\end{align*}
where the limit exists because the sequence $n\mapsto H_\nu(\mathcal Q_{[0,n-1]})$ is subadditive.
By preservation of integrals under conditional expectation,
\begin{align*}
\int_Y \bar h_\nu(S,\mathcal Q\mid\mathcal I_S)(y)\,d\nu(y)=\int_Y J_\infty(y)\,d\nu(y).
\end{align*}
For each $m\in\mathbb N$, the [chain rule for entropy](/theorems/1635) gives
\begin{align*}
H_\nu(\mathcal Q_{[0,m]})=\sum_{k=0}^{m} H_\nu(\mathcal Q\mid\mathcal G_k),
\end{align*}
where $H_\nu(\mathcal Q\mid\mathcal G_k)=\int_Y J_k(y)\,d\nu(y)$ and $\mathcal G_0:=\{\varnothing,Y\}$. Since $J_k\to J_\infty$ in $L^1(Y,\mathcal C,\nu)$, Cesaro convergence gives
\begin{align*}
\lim_{m\to\infty}\frac{1}{m+1}H_\nu(\mathcal Q_{[0,m]})=\int_Y J_\infty(y)\,d\nu(y).
\end{align*}
Therefore
\begin{align*}
\int_Y J_\infty(y)\,d\nu(y)=h_\nu(S,\mathcal Q).
\end{align*}
Thus
\begin{align*}
\int_Y \bar h_\nu(S,\mathcal Q\mid\mathcal I_S)(y)\,d\nu(y)=h_\nu(S,\mathcal Q).
\end{align*}
[/step]
custom_env
admin
[step:Lift the non-invertible system to its natural extension]
Return to the original standard probability-preserving system $(X,\mathcal B,\mu,T)$, where $T$ need not be invertible. The standardness assumption means that $(X,\mathcal B)$ is a standard Borel space after completion on a $\mu$-null set, so countable products and inverse-limit measurable structures have the usual projective-limit probability measures. Let $\widehat X$ be the inverse-limit space of one-sided histories
\begin{align*}
\widehat X:=\{(x_0,x_{-1},x_{-2},\dots)\in X^{\mathbb N_0}:T x_{-j}=x_{-j+1}\text{ for every }j\geq1\},
\end{align*}
let $\widehat{\mathcal B}$ be the trace of the product $\sigma$-algebra. For indices $0\leq j_1<\cdots<j_r$ and sets $B_1,\dots,B_r\in\mathcal B$, define the cylinder distribution by
\begin{align*}
\widehat\mu\{\widehat x:x_{-j_1}\in B_1,\dots,x_{-j_r}\in B_r\}:=\mu\left(B_r\cap T^{-(j_r-j_{r-1})}B_{r-1}\cap\cdots\cap T^{-(j_r-j_1)}B_1\right).
\end{align*}
The consistency of these finite-dimensional distributions follows from $\mu(T^{-1}A)=\mu(A)$ for every $A\in\mathcal B$: deleting the oldest coordinate or inserting $X$ in any coordinate leaves the displayed value unchanged by invariance of $\mu$. Since $(X,\mathcal B)$ is standard Borel, the countable projective-limit construction for standard Borel probability spaces gives a probability measure $\widehat\mu$ on $(\widehat X,\widehat{\mathcal B})$ with these cylinder values. This is the natural extension of $(X,\mathcal B,\mu,T)$. Let $\widehat T:\widehat X\to\widehat X$ be the shift
\begin{align*}
\widehat T(x_0,x_{-1},x_{-2},\dots):=(T x_0,x_0,x_{-1},\dots).
\end{align*}
Its inverse is the measurable map
\begin{align*}
\widehat T^{-1}(x_0,x_{-1},x_{-2},\dots):=(x_{-1},x_{-2},x_{-3},\dots),
\end{align*}
and the displayed cylinder formula shows that $\widehat\mu$ is $\widehat T$-invariant.
Let
\begin{align*}
\pi_0:\widehat X\to X
\end{align*}
be the time-zero factor map $\pi_0(x_0,x_{-1},x_{-2},\dots)=x_0$. Thus $\pi_0$ is measurable, $\widehat\mu\circ\pi_0^{-1}=\mu$, and
\begin{align*}
\pi_0\circ\widehat T=T\circ\pi_0.
\end{align*}
Let
\begin{align*}
\widehat{\mathcal P}:=\pi_0^{-1}\mathcal P
\end{align*}
be the lifted finite partition of $\widehat X$.
For every $n\in\mathbb N$ and every $\widehat x\in\widehat X$ outside a fixed $\widehat\mu$-null set,
\begin{align*}
\widehat{\mathcal P}_{[0,n-1]}(\widehat x)=\pi_0^{-1}\bigl(\mathcal P_{[0,n-1]}(\pi_0\widehat x)\bigr).
\end{align*}
Because $\widehat\mu\circ\pi_0^{-1}=\mu$, this implies
\begin{align*}
\widehat\mu(\widehat{\mathcal P}_{[0,n-1]}(\widehat x))=\mu(\mathcal P_{[0,n-1]}(\pi_0\widehat x)).
\end{align*}
Let $\mathcal I_{\widehat T}:=\{A\in\widehat{\mathcal B}:\widehat T^{-1}A=A\text{ modulo }\widehat\mu\}$ denote the $\widehat T$-invariant $\sigma$-algebra. Applying the invertible case to $(\widehat X,\widehat{\mathcal B},\widehat\mu,\widehat T)$ and $\widehat{\mathcal P}$ gives an $\mathcal I_{\widehat T}$-measurable function
\begin{align*}
\widehat h:\widehat X\to[0,\infty]
\end{align*}
such that
\begin{align*}
-\frac{1}{n}\log\mu(\mathcal P_{[0,n-1]}(\pi_0\widehat x))\to \widehat h(\widehat x)
\end{align*}
for $\widehat\mu$-almost every $\widehat x\in\widehat X$.
[/step]
custom_env
admin
[step:Descend the limiting function and recover the entropy rate]
For each $n\in\mathbb N$, define
\begin{align*}
a_n:X\to[0,\infty]
\end{align*}
by
\begin{align*}
a_n(x):=-\frac{1}{n}\log\mu(\mathcal P_{[0,n-1]}(x)).
\end{align*}
The lifted convergence says that $a_n\circ\pi_0$ converges $\widehat\mu$-almost everywhere to $\widehat h$. Since every $a_n\circ\pi_0$ is $\pi_0^{-1}\mathcal B$-measurable, the almost-sure limit $\widehat h$ is also $\pi_0^{-1}\mathcal B$-measurable after modifying it on a $\widehat\mu$-null set.
[claim:Factor-measurability descent]
If $\pi_0:(\widehat X,\widehat{\mathcal B},\widehat\mu)\to(X,\mathcal B,\mu)$ is a measure-preserving factor map between completed standard probability spaces and $g:\widehat X\to[0,\infty]$ is $\pi_0^{-1}\mathcal B$-measurable, then there exists a $\mathcal B$-measurable function $f:X\to[0,\infty]$ with $g=f\circ\pi_0$ $\widehat\mu$-almost everywhere.
[/claim]
[proof]
Because $g$ is $\pi_0^{-1}\mathcal B$-measurable on the completed space, there exists a $\pi_0^{-1}\mathcal B$-measurable function $g_0:\widehat X\to[0,\infty]$ such that $g=g_0$ $\widehat\mu$-almost everywhere. For each $q\in\mathbb Q\cap[0,\infty)$, choose $B_q\in\mathcal B$ such that
\begin{align*}
\{\widehat x:g_0(\widehat x)>q\}=\pi_0^{-1}B_q.
\end{align*}
We now replace these representatives by a monotone family. For each $q\in\mathbb Q\cap[0,\infty)$, define
\begin{align*}
C_q:=\bigcup_{r\in\mathbb Q\cap[0,\infty),\ r>q}B_r.
\end{align*}
Then $C_r\subseteq C_q$ whenever $q<r$, each $C_q$ belongs to $\mathcal B$, and
\begin{align*}
\pi_0^{-1}C_q=\{\widehat x:g_0(\widehat x)>q\}
\end{align*}
because the strict superlevel sets of $g_0$ satisfy $\{g_0>q\}=\bigcup_{r\in\mathbb Q,\ r>q}\{g_0>r\}$.
Define the measurable function
\begin{align*}
f:X\to[0,\infty]
\end{align*}
by
\begin{align*}
f(x):=\sup\{q\in\mathbb Q\cap[0,\infty):x\in C_q\},
\end{align*}
with the convention that the supremum of the empty set is $0$. The monotonicity of the family $(C_q)_{q\in\mathbb Q\cap[0,\infty)}$ implies that, for every rational $q\geq0$,
\begin{align*}
\{x:f(x)>q\}=\bigcup_{r\in\mathbb Q\cap[0,\infty),\ r>q}C_r=C_q.
\end{align*}
Pulling back by $\pi_0$ gives
\begin{align*}
\{\widehat x:f(\pi_0\widehat x)>q\}=\pi_0^{-1}C_q=\{\widehat x:g_0(\widehat x)>q\}
\end{align*}
for every rational $q\geq0$. Equality of all rational superlevel sets implies $f\circ\pi_0=g_0$ everywhere, and hence $g=f\circ\pi_0$ $\widehat\mu$-almost everywhere.
[/proof]
Applying the claim to $g=\widehat h$, there exists a measurable function
\begin{align*}
\bar h_\mu(T,\mathcal P\mid\mathcal I_T):X\to[0,\infty]
\end{align*}
such that
\begin{align*}
\widehat h=\bar h_\mu(T,\mathcal P\mid\mathcal I_T)\circ\pi_0
\end{align*}
for $\widehat\mu$-almost every point. We first push the almost-sure convergence down to $X$. Let
\begin{align*}
E:=\left\{x\in X:a_n(x)\text{ does not converge to }\bar h_\mu(T,\mathcal P\mid\mathcal I_T)(x)\right\}.
\end{align*}
For every $\widehat x$ outside the union of the null set where $a_n\circ\pi_0$ fails to converge to $\widehat h$ and the null set where $\widehat h\neq\bar h_\mu(T,\mathcal P\mid\mathcal I_T)\circ\pi_0$, the point $\pi_0\widehat x$ does not belong to $E$. Hence $\pi_0^{-1}E$ is contained in a $\widehat\mu$-null set. Since $\widehat\mu\circ\pi_0^{-1}=\mu$,
\begin{align*}
\mu(E)=\widehat\mu(\pi_0^{-1}E)=0.
\end{align*}
Therefore
\begin{align*}
a_n(x)\to\bar h_\mu(T,\mathcal P\mid\mathcal I_T)(x)
\end{align*}
for $\mu$-almost every $x\in X$.
We now verify invariant measurability after descent. Since $\widehat h=\bar h_\mu(T,\mathcal P\mid\mathcal I_T)\circ\pi_0$ and $\widehat h\circ\widehat T=\widehat h$ $\widehat\mu$-almost everywhere, the factor relation $\pi_0\circ\widehat T=T\circ\pi_0$ gives
\begin{align*}
\bar h_\mu(T,\mathcal P\mid\mathcal I_T)(T\pi_0\widehat x)=\bar h_\mu(T,\mathcal P\mid\mathcal I_T)(\pi_0\widehat x)
\end{align*}
for $\widehat\mu$-almost every $\widehat x\in\widehat X$. Because $\widehat\mu\circ\pi_0^{-1}=\mu$, this is exactly
\begin{align*}
\bar h_\mu(T,\mathcal P\mid\mathcal I_T)\circ T=\bar h_\mu(T,\mathcal P\mid\mathcal I_T)
\end{align*}
$\mu$-almost everywhere. Thus the descended function is $\mathcal I_T$-measurable modulo $\mu$. Hence
\begin{align*}
\lim_{n\to\infty}-\frac{1}{n}\log\mu(\mathcal P_{[0,n-1]}(x))=\bar h_\mu(T,\mathcal P\mid\mathcal I_T)(x)
\end{align*}
for $\mu$-almost every $x\in X$. Since $\mathcal P$ is finite, the entropy rate $h_\mu(T,\mathcal P)$ is at most $\log |\mathcal P|$, and the integral identity proved below implies that $\bar h_\mu(T,\mathcal P\mid\mathcal I_T)$ is finite $\mu$-almost everywhere; after changing it on a null set we regard it as $[0,\infty)$-valued.
Finally, using $\widehat\mu\circ\pi_0^{-1}=\mu$ and $\widehat h=\bar h_\mu(T,\mathcal P\mid\mathcal I_T)\circ\pi_0$ $\widehat\mu$-almost everywhere, the change-of-variables identity for pushforward measures gives
\begin{align*}
\int_X \bar h_\mu(T,\mathcal P\mid\mathcal I_T)(x)\,d\mu(x)=\int_{\widehat X}\widehat h(\widehat x)\,d\widehat\mu(\widehat x).
\end{align*}
Define
\begin{align*}
h_{\widehat\mu}(\widehat T,\widehat{\mathcal P}):=\lim_{n\to\infty}\frac{1}{n}H_{\widehat\mu}(\widehat{\mathcal P}_{[0,n-1]})
\end{align*}
for the entropy rate of the lifted finite partition. By the invertible case,
\begin{align*}
\int_{\widehat X}\widehat h(\widehat x)\,d\widehat\mu(\widehat x)=h_{\widehat\mu}(\widehat T,\widehat{\mathcal P}).
\end{align*}
Since $\widehat{\mathcal P}_{[0,n-1]}$ is the pullback of $\mathcal P_{[0,n-1]}$ under $\pi_0$ and $\widehat\mu\circ\pi_0^{-1}=\mu$, the entropies of the corresponding finite partitions agree for every $n$:
\begin{align*}
H_{\widehat\mu}(\widehat{\mathcal P}_{[0,n-1]})=H_\mu(\mathcal P_{[0,n-1]}).
\end{align*}
Taking entropy-rate limits gives
\begin{align*}
h_{\widehat\mu}(\widehat T,\widehat{\mathcal P})=h_\mu(T,\mathcal P).
\end{align*}
Combining the last three displays gives
\begin{align*}
\int_X \bar h_\mu(T,\mathcal P\mid\mathcal I_T)(x)\,d\mu(x)=h_\mu(T,\mathcal P).
\end{align*}
This proves both the almost-sure convergence statement and the asserted integral identity.
[/step]
custom_env
admin