This course develops the modern theory of entropy in ergodic theory and shows how information-theoretic ideas organize the study of dynamical systems. It begins with entropy for partitions and the basic language of information, then moves to Kolmogorov-Sinai entropy as an isomorphism invariant that measures dynamical complexity. From there, the course turns to generators, entropy computation, and the [Shannon-McMillan-Breiman theorem](/theorems/6766), which explains how entropy governs the asymptotic frequency of typical orbit segments.
The later chapters broaden the scope from measure-theoretic dynamics to symbolic, topological, and statistical viewpoints. Bernoulli shifts and isomorphism problems illustrate the classification power of entropy, while Markov shifts and symbolic dynamics provide concrete models for more general systems. Topological entropy, the variational principle, and thermodynamic formalism connect entropy with pressure, equilibrium states, and statistical mechanics. The final chapters explore how entropy interacts with mixing and decay of correlations, with number-theoretic dynamical systems, and with phase transitions in statistical mechanics, showing how a single invariant links rigorous dynamics, probability, and mathematical physics.
# Introduction
This opening chapter fixes the scope and language of the course. Ergodic Theory I treated qualitative long-term behaviour: invariant sets, recurrence, ergodicity, weak mixing, and strong mixing. The second course asks for quantitative invariants, especially entropy, which measure how much information is produced by a dynamical system per unit time.
The guiding contrast is between systems that look chaotic because they separate nearby orbits and systems that are chaotic in the measure-theoretic sense because observations reveal genuinely new information. Entropy connects measurable dynamics, symbolic dynamics, topological dynamics, statistical mechanics, and number-theoretic examples. The aim of the course is to learn how entropy is defined, computed, compared under factors and codings, and used as a classification tool.
## The Central Questions of the Course
A measure-preserving system may be studied by repeatedly observing which atom of a finite partition contains the orbit point. The first question is how much information this observation process produces over a long time interval. A second question is whether that number is intrinsic to the system or depends on the chosen observation scheme.
[explanation: Entropy As Information Growth]
Let $(X, \mathcal B, \mu, T)$ be a probability-preserving system and let $\mathcal P$ be a finite measurable partition of $X$. The observation of $x, T x, \dots, T^{n-1}x$ through $\mathcal P$ records the atom of the joined partition
\begin{align*}
\mathcal P_0^{n-1} = \mathcal P \vee T^{-1}\mathcal P \vee \cdots \vee T^{-(n-1)}\mathcal P
\end{align*}
that contains $x$. The entropy of this joined partition measures the information needed to describe the length-$n$ name of a typical point. The entropy rate asks for the asymptotic average information per observation.
[/explanation]
This viewpoint turns dynamics into a source-coding problem: a partition gives a finite alphabet, orbit segments give words, and the measure gives the frequencies of those words. To make this rate useful, we need to know that the finite-time entropies have a well-defined long-time average.
[quotetheorem:6722]
[citeproof:6722]
The theorem explains why entropy is an asymptotic invariant rather than a finite-time statistic. The finiteness of the partition is essential here: on the countable atomic probability space $X=\{2,3,4,\dots\}$ with
\begin{align*}
\mu(\{m\})=\frac{c}{m(\log m)^2},
\end{align*}
where $c>0$ normalises the total mass, the partition into singletons has infinite Shannon entropy. For the identity map on this space, the first normalised entropy value is already infinite, so no finite numerical rate is obtained from that countable observation. Measure preservation is also doing real work, because it lets the entropy of $T^{-m}\mathcal P_0^{n-1}$ agree with the entropy of $\mathcal P_0^{n-1}$ in the subadditivity argument. Without preservation this comparison can fail even for finite partitions: on $X=\{0,1\}$ with $\mu(\{0\})=\mu(\{1\})=1/2$, the map $T(0)=T(1)=0$ is not measure preserving, and for the partition into singletons the pullback $T^{-1}\mathcal P$ is the single-atom partition, so its entropy is $0$ rather than $H_\mu(\mathcal P)=\log 2$. The result does not say that the finite-time values are stable or monotone in $n$; it says only that their average information per step has a limiting rate. Chapter 2 removes the dependence on a particular finite observation by taking the supremum over all finite measurable partitions in the definition of Kolmogorov-Sinai entropy.
## Background Assumed from Ergodic Theory I
Before entropy can be compared across examples, we need to know what counts as the same long-term statistical experiment and what counts as a coarser observation of it. The course assumes that measure-preserving transformations have already been introduced as the measurable analogue of time evolution, so this section recalls only the base objects that entropy will attach numerical invariants to and compare under morphisms.
[definition: Probability-Preserving System]
A probability-preserving system is a quadruple $(X, \mathcal B, \mu, T)$ where $(X, \mathcal B, \mu)$ is a probability space and $T:X\to X$ is a measurable map such that
\begin{align*}
\mu(T^{-1}A)=\mu(A)
\end{align*}
for every $A\in \mathcal B$.
[/definition]
The map $T$ may be invertible or non-invertible, and both cases occur throughout the course. Invertible systems include shifts and rotations; non-invertible systems include expanding maps such as the doubling map on the circle.
[example: Irrational Rotation As Zero-Entropy Model]
Let $\mathcal P=\{I_1,\dots,I_r\}$ be a finite partition of $X=\mathbb R/\mathbb Z$ into intervals, and let $E$ be the set of interval endpoints, so $|E|\le r$. For each $j\ge 0$, the partition $T^{-j}\mathcal P$ has endpoint set
\begin{align*}
T^{-j}E=\{e-j\alpha \pmod 1:e\in E\}.
\end{align*}
Hence the joined partition
\begin{align*}
\mathcal P_0^{n-1}
=\mathcal P\vee T^{-1}\mathcal P\vee\cdots\vee T^{-(n-1)}\mathcal P
\end{align*}
is obtained by cutting the circle at points in
\begin{align*}
E\cup (E-\alpha)\cup\cdots\cup(E-(n-1)\alpha).
\end{align*}
This set has at most $nr$ points, so it cuts the circle into at most $nr$ intervals. Therefore $\mathcal P_0^{n-1}$ has at most $nr$ atoms, and the entropy of a probability distribution on at most $nr$ atoms is at most $\log(nr)$. Thus
\begin{align*}
0\le \frac{1}{n}H_\mu(\mathcal P_0^{n-1})
\le \frac{\log(nr)}{n}
=\frac{\log n}{n}+\frac{\log r}{n}.
\end{align*}
Both terms on the right tend to $0$, so $h_\mu(T,\mathcal P)=0$ for every finite interval partition $\mathcal P$. Irrational rotation is therefore a model of rigid long-term behaviour: it is ergodic, but interval observations produce only subexponentially many orbit names and hence no positive entropy rate.
[/example]
Ergodicity says that invariant measurable sets have measure $0$ or $1$, while entropy measures the statistical complexity inside that irreducible behaviour. A comparison of two systems must distinguish genuine dynamical complexity from complexity that disappears after a loss of information. The formal way to express such a loss is a map that preserves measure and sends each orbit in the larger system to the corresponding orbit in the smaller observed system.
[definition: Factor Map]
Let $(X, \mathcal B, \mu, T)$ and $(Y, \mathcal C, \nu, S)$ be probability-preserving systems. A factor map is a measurable map $\pi:X\to Y$ such that $\nu=\mu\circ \pi^{-1}$ and
\begin{align*}
\pi\circ T = S\circ \pi
\end{align*}
holds $\mu$-a.e.
[/definition]
Factors formalise the idea of observing less information about a system. This raises the next structural question: can a coarser observation have larger entropy than the system from which it came?
[definition: Measure-Theoretic Entropy]
Let $(X, \mathcal B, \mu, T)$ be a probability-preserving system. The measure-theoretic entropy of $T$ is
\begin{align*}
h_\mu(T)=\sup_{\mathcal P} h_\mu(T,\mathcal P),
\end{align*}
where the supremum is taken over all finite measurable partitions $\mathcal P$ of $X$.
[/definition]
This definition makes entropy intrinsic to the system rather than to a chosen finite observation. It also gives the [comparison principle](/theorems/4870) below its precise meaning: a factor can only see partition information that was already visible upstairs.
[quotetheorem:6724]
[citeproof:6724]
This monotonicity result is one reason entropy is an invariant rather than merely a computation. The factor-map hypotheses are essential: the map must push $\mu$ forward to $\nu$ and must intertwine the dynamics, otherwise pullback partitions no longer describe the same orbit-name process. For instance, let $X$ be a one-point zero-entropy system and let $Y=\{0,1\}^{\mathbb Z}$ carry the fair Bernoulli shift; a measurable map from $X$ to $Y$ can choose a single sequence, but it cannot push the point mass to the Bernoulli measure, so it gives no factor relation and no entropy comparison with $\log 2$. Monotonicity also does not say that every lower-entropy system is a factor; entropy is only an obstruction, not a complete factor criterion. If two systems are isomorphic, each is a factor of the other, and their entropies agree.
## Symbolic Models and Generators
Many advanced examples become tractable only after replacing the original phase space by sequences of symbols. The problem is to know when a partition records enough information to reconstruct the measurable dynamics, at least up to null sets.
[definition: Finite Generator]
Let $(X, \mathcal B, \mu, T)$ be an invertible probability-preserving system. A finite measurable partition $\mathcal P$ is a finite generator if
\begin{align*}
\sigma\left(\bigvee_{n\in\mathbb Z} T^{-n}\mathcal P\right)=\mathcal B
\end{align*}
up to $\mu$-null sets.
[/definition]
A generator turns an abstract system into a symbolic process without losing measurable information. The obstruction in computing $h_\mu(T)$ is that the definition takes a supremum over all finite partitions, many of which have no evident relation to a chosen symbolic coding. When a finite partition generates the whole measurable structure, no hidden measurable information remains outside its orbit names, so the entropy of that one partition should determine the system entropy.
[quotetheorem:6726]
[citeproof:6726]
The theorem is the computational bridge from general measure spaces to shifts. The generator hypothesis is much stronger than saying that $\mathcal P$ is informative at one time: the full two-sided orbit of the partition must recover the whole $\sigma$-algebra modulo null sets. If $\mathcal P$ is not generating, then $h_\mu(T,\mathcal P)$ measures only the entropy visible through that observation and may be strictly smaller than $h_\mu(T)$. In the fair Bernoulli shift on $\{0,1\}^{\mathbb Z}$, the partition $\{X\}$ has entropy rate $0$, while the time-zero coordinate generator gives entropy $\log 2$. The finiteness assumption matters because the proof uses continuity estimates for finite partitions and because a countable generator can have infinite one-step entropy. For example, on the full shift over $\{2,3,4,\dots\}^{\mathbb Z}$ with the heavy-tailed product marginal displayed above, the time-zero coordinate partition is a countable generator but has infinite partition entropy, so it cannot be inserted into the finite-generator theorem as a single finite approximation device. Countable generating partitions compute entropy in Chapter 3 only when finite-entropy or approximation hypotheses justify passage from finite subpartitions to the countable partition. Much of the course uses the finite theorem to evaluate entropy for Bernoulli shifts, Markov shifts, expanding maps, and systems with Markov partitions.
[example: Bernoulli Shift Entropy]
Let $\mathcal P=\{[x_0=i]:1\le i\le k\}$ be the time-zero coordinate partition of $X=A^{\mathbb Z}$. Since the translates $T^{-m}\mathcal P$ record the coordinate $m$, the two-sided join $\bigvee_{m\in\mathbb Z}T^{-m}\mathcal P$ generates all finite-coordinate cylinder sets, hence the product $\sigma$-algebra. Thus $\mathcal P$ is a finite generator, so by the *[Kolmogorov-Sinai Generator Theorem](/theorems/6726)* it is enough to compute $h_\mu(T,\mathcal P)$.
For $n\ge 1$, the atoms of $\mathcal P_0^{n-1}$ are the cylinders
\begin{align*}
[a_0,\dots,a_{n-1}]=\{x\in A^{\mathbb Z}:x_0=a_0,\dots,x_{n-1}=a_{n-1}\},
\end{align*}
with $(a_0,\dots,a_{n-1})\in A^n$. By the product definition of $\mu=p^{\mathbb Z}$,
\begin{align*}
\mu([a_0,\dots,a_{n-1}])=\prod_{j=0}^{n-1}p_{a_j}.
\end{align*}
Therefore the entropy of the joined partition is
\begin{align*}
H_\mu(\mathcal P_0^{n-1})=-\sum_{(a_0,\dots,a_{n-1})\in A^n}\left(\prod_{j=0}^{n-1}p_{a_j}\right)\log\left(\prod_{j=0}^{n-1}p_{a_j}\right).
\end{align*}
Using $\log(\prod_{j=0}^{n-1}p_{a_j})=\sum_{j=0}^{n-1}\log p_{a_j}$, this becomes
\begin{align*}
H_\mu(\mathcal P_0^{n-1})=-\sum_{j=0}^{n-1}\sum_{(a_0,\dots,a_{n-1})\in A^n}\left(\prod_{\ell=0}^{n-1}p_{a_\ell}\right)\log p_{a_j}.
\end{align*}
For a fixed $j$, the inner sum separates into the $j$th coordinate and the remaining coordinates:
\begin{align*}
\sum_{(a_0,\dots,a_{n-1})\in A^n}\left(\prod_{\ell=0}^{n-1}p_{a_\ell}\right)\log p_{a_j}=\left(\sum_{i=1}^k p_i\log p_i\right)\prod_{\ell\ne j}\left(\sum_{i=1}^k p_i\right).
\end{align*}
Since $\sum_{i=1}^k p_i=1$, this gives
\begin{align*}
\sum_{(a_0,\dots,a_{n-1})\in A^n}\left(\prod_{\ell=0}^{n-1}p_{a_\ell}\right)\log p_{a_j}=\sum_{i=1}^k p_i\log p_i.
\end{align*}
Substituting this value for each $j=0,\dots,n-1$ yields
\begin{align*}
H_\mu(\mathcal P_0^{n-1})=n\left(-\sum_{i=1}^k p_i\log p_i\right).
\end{align*}
Hence
\begin{align*}
h_\mu(T,\mathcal P)=\lim_{n\to\infty}\frac{1}{n}H_\mu(\mathcal P_0^{n-1})=-\sum_{i=1}^k p_i\log p_i.
\end{align*}
Because $\mathcal P$ is a finite generator,
\begin{align*}
h_\mu(T)=-\sum_{i=1}^k p_i\log p_i.
\end{align*}
This example supplies the basic model of a source producing independent symbols with fixed probabilities: the entropy rate is exactly the one-symbol Shannon entropy.
[/example]
## Measure-Theoretic and Topological Entropy
The course also compares measure-theoretic entropy with topological entropy. The guiding problem is that topological dynamics counts distinguishable orbit segments, while measure theory weighs them according to an invariant probability measure.
[definition: Topological Entropy]
Let $X$ be a compact [metric space](/page/Metric%20Space), let $d$ be its metric, and let $T:X\to X$ be continuous. For $n\in\mathbb N$ define
\begin{align*}
d_n(x,y)=\max_{0\le j<n} d(T^j x,T^j y).
\end{align*}
An $(n,\varepsilon)$-separated set is a subset $E\subset X$ such that $d_n(x,y)>\varepsilon$ whenever $x,y\in E$ are distinct. Let $s_n(\varepsilon)$ be the supremum of $|E|$ over all $(n,\varepsilon)$-separated sets $E\subset X$. The topological entropy is
\begin{align*}
h_{\mathrm{top}}(T)=\lim_{\varepsilon\downarrow 0}\limsup_{n\to\infty}\frac{1}{n}\log s_n(\varepsilon).
\end{align*}
[/definition]
The definition counts orbit names using the metric rather than a measurable partition. This creates a comparison problem: separated sets are topological objects, while measure-theoretic entropy is computed from the statistical distribution of names under an invariant probability measure. The bridge must show that the largest statistical entropy seen by invariant measures exactly matches the exponential growth rate of distinguishable orbit segments.
[quotetheorem:6728]
[citeproof:6728]
This theorem explains why topological entropy is not separate from the measure-theoretic story. Compactness is used to extract limit measures from long orbit segments, and continuity is used so that Bowen balls and separated sets behave consistently under iteration. The hypotheses rule out familiar pathologies: on the full shift over a countable alphabet, the number of length-$n$ words is infinite, so topological entropy is infinite and there need not be a finite maximal-entropy probability measure. Invariance is also essential: a probability measure that is not $T$-invariant does not describe the long-term statistics of the system, so its measure-theoretic entropy is not part of the variational comparison. The theorem gives a supremum over invariant measures, but in general course settings one must still check separately whether the supremum is attained by a measure of maximal entropy. It also does not identify topological and measure-theoretic entropy for a fixed measure; it says that topological entropy is the largest measure-theoretic entropy obtainable from invariant measures.
## Advanced Directions
After the foundations, the course moves toward classification and thermodynamic formalism. The common problem is to understand when entropy alone classifies a system and when it must be supplemented by finer invariants.
[explanation: From Entropy to Classification]
Bernoulli shifts are the main test case. Their entropy is determined by a probability vector, but the isomorphism problem asks whether entropy is the only invariant. Ornstein theory gives a deep affirmative answer for Bernoulli shifts over finite alphabets, while later developments show that nearby classes of systems require additional structure.
[/explanation]
The topological side replaces invariant measures by potentials and pressure. This leads to equilibrium states, Gibbs measures, and the dictionary with statistical mechanics.
[definition: Topological Pressure]
Let $X$ be a compact metric space with metric $d$, and let $T:X\to X$ be continuous. For each $\varphi\in C(X,\mathbb R)$, define the Birkhoff sum
\begin{align*}
S_n\varphi(x)=\sum_{j=0}^{n-1}\varphi(T^j x).
\end{align*}
For $n\in\mathbb N$ and $\varepsilon>0$, let
\begin{align*}
Z_n(T,\varphi,\varepsilon)=\sup_E \sum_{x\in E}\exp(S_n\varphi(x)),
\end{align*}
where the supremum is taken over all $(n,\varepsilon)$-separated sets $E\subset X$, using the Bowen metric $d_n(x,y)=\max_{0\le j<n}d(T^j x,T^j y)$. The topological pressure functional for the fixed system $(X,T)$ is the map
\begin{align*}
P(T,-):C(X,\mathbb R)\to (-\infty,\infty]
\end{align*}
defined by
\begin{align*}
P(T,\varphi)=\lim_{\varepsilon\downarrow 0}\limsup_{n\to\infty}\frac{1}{n}\log Z_n(T,\varphi,\varepsilon).
\end{align*}
[/definition]
Pressure reduces to topological entropy when $\varphi=0$, and it introduces energy terms into the counting of orbits. The issue is that the separated-set formula weighs individual orbit segments, while invariant measures describe long-run statistical behaviour. To use pressure in dynamics and statistical mechanics, one needs to know that this weighted orbit growth is governed by an optimisation over invariant measures balancing entropy against the average value of the potential.
[quotetheorem:6730]
[citeproof:6730]
The hypotheses in the pressure variational principle again matter. Compactness gives weak* compactness of invariant probability measures, continuity of $T$ makes orbit segments topologically coherent, and continuity of $\varphi$ ensures that Birkhoff sums vary little on sufficiently small Bowen balls. On the full shift over $\mathbb N$, the zero potential already gives infinite pressure because there are infinitely many one-symbol orbit names; more refined non-compact shifts can have finite pressure while mass escapes to symbols tending to infinity, so no equilibrium state attains the supremum. If the potential is unbounded or the space is non-compact, the weighted orbit sums and the measure-theoretic expression may require extra tightness or integrability assumptions, and the supremum need not be finite or attained. When the supremum is attained, the maximizing measure is an equilibrium state; the theorem itself identifies the optimisation problem but does not guarantee uniqueness.
The course ends by applying these ideas to examples from statistical mechanics, smooth dynamics, compact group rotations, and number-theoretic transformations. The point of the introduction is to mark entropy as the main invariant, symbolic coding as the main computational tool, and variational principles as the bridge between measurable and topological dynamics.
The introduction has identified entropy, symbolic coding, and variational principles as the central tools of the course, but those ideas need a precise measurable starting point. The next chapter begins by formalizing what it means to observe a system through a finite partition and to quantify the uncertainty of that observation before any iteration is taken.
# 1. Entropy of Partitions and Information
Entropy begins with a basic problem: a measurable observation of a dynamical system does not reveal the point itself, but only which element of a partition contains it. This chapter develops the language for measuring the average uncertainty of such observations before any dynamics is applied. We first define entropy for finite and countable measurable partitions, then refine it through conditional entropy and information functions, and finally record the algebraic rules that make entropy useful for iterating partitions under a transformation. The prerequisites are probability spaces, measurable maps, countable sums of non-negative functions, [Jensen's inequality](/theorems/9), and the basic language of measure-preserving transformations.
## Measuring Uncertainty of a Measurable Partition
Suppose $(X, \mathcal B, \mu)$ is a probability space and an experiment reports only membership in one of countably many measurable sets. Before measuring the uncertainty of such a report, we need a precise object representing the possible reports and a convention for ignoring outcomes that occur on null sets. This is the role of a measurable partition.
[definition: Measurable Partition]
A countable measurable partition of $(X, \mathcal B, \mu)$ is a countable family $\alpha = \{A_i : i \in I\}$ of pairwise disjoint measurable sets such that
\begin{align*}
\mu\left(X \setminus \bigcup_{i \in I} A_i\right) = 0.
\end{align*}
The atoms of $\alpha$ are the sets $A_i$ with $\mu(A_i) > 0$.
[/definition]
Since null atoms do not affect almost-everywhere statements, partitions are identified when their atoms agree up to null sets. Once the possible observations are encoded by atoms, the next task is to assign higher cost to rare atoms and lower cost to common atoms. The entropy formula uses the convention $0\log 0 = 0$, matching the limiting value of $-t\log t$ as $t \downarrow 0$.
[definition: Shannon Entropy of a Partition]
Let $(X, \mathcal B, \mu)$ be a probability space. The Shannon entropy functional is the map
\begin{align*}
H_\mu : \{\text{countable measurable partitions of }(X, \mathcal B, \mu)\} \to [0,\infty]
\end{align*}
defined by
\begin{align*}
H_\mu(\alpha) = -\sum_{i \in I} \mu(A_i)\log \mu(A_i)
\end{align*}
for every countable measurable partition $\alpha = \{A_i : i \in I\}$.
[/definition]
The logarithm base is a choice of units: base $2$ gives bits, while base $e$ gives nats. In these notes we use natural logarithms unless a symbolic dynamics example is written in base $2$ for interpretation. The first test case is a binary observation in a product probability space, where the entropy reduces to the familiar Bernoulli entropy function.
[example: Bernoulli Partition]
Let $X=\{0,1\}^{\mathbb Z}$ with Bernoulli product measure $\mu=(p\delta_1+(1-p)\delta_0)^{\mathbb Z}$, where $p\in[0,1]$, and let $\alpha=\{A_0,A_1\}$ with $A_a=\{x\in X:x_0=a\}$. Since the zeroth coordinate has distribution $p\delta_1+(1-p)\delta_0$, the atom masses are
\begin{align*}
\mu(A_1)=p
\end{align*}
and
\begin{align*}
\mu(A_0)=1-p.
\end{align*}
Substituting these masses into the definition of partition entropy gives
\begin{align*}
H_\mu(\alpha)=-\sum_{a\in\{0,1\}}\mu(A_a)\log\mu(A_a).
\end{align*}
The two terms in the sum are the $a=0$ term and the $a=1$ term, so
\begin{align*}
H_\mu(\alpha)=-\mu(A_0)\log\mu(A_0)-\mu(A_1)\log\mu(A_1).
\end{align*}
Using $\mu(A_0)=1-p$ and $\mu(A_1)=p$, this becomes
\begin{align*}
H_\mu(\alpha)=-(1-p)\log(1-p)-p\log p.
\end{align*}
Equivalently,
\begin{align*}
H_\mu(\alpha)=-p\log p-(1-p)\log(1-p).
\end{align*}
With the convention $0\log 0=0$, this gives $H_\mu(\alpha)=0$ when $p=0$ and also when $p=1$.
For $0<p<1$, write $h(p)=-p\log p-(1-p)\log(1-p)$. Differentiating the first term gives
\begin{align*}
\frac{d}{dp}\bigl(-p\log p\bigr)=-(\log p+1).
\end{align*}
Differentiating the second term gives
\begin{align*}
\frac{d}{dp}\bigl(-(1-p)\log(1-p)\bigr)=\log(1-p)+1.
\end{align*}
Therefore
\begin{align*}
h'(p)=\log(1-p)-\log p.
\end{align*}
Thus $h'(p)=0$ exactly when $\log(1-p)=\log p$, equivalently $1-p=p$, so $p=1/2$. A second derivative computation gives
\begin{align*}
h''(p)=-\frac{1}{1-p}-\frac{1}{p}.
\end{align*}
For $0<p<1$, both denominators are positive, so $h''(p)<0$. Hence $h$ is strictly concave on $(0,1)$ and its unique maximum occurs at $p=1/2$. Thus the binary coordinate observation has entropy $-p\log p-(1-p)\log(1-p)$, vanishes for deterministic endpoint distributions, and is largest for the balanced binary observation.
[/example]
The previous example is finite, but countable partitions occur naturally when inducing maps, coding return times, or recording countably many symbols. In that setting entropy may be infinite even though the total measure is one. The next example marks the boundary between countability of the partition and finiteness of its entropy.
[example: Countable Partition with Infinite Entropy]
Let
\begin{align*}
Z=\sum_{n=1}^{\infty}\frac{1}{n(\log(n+1))^2}.
\end{align*}
This series is finite: grouping $2^k\le n<2^{k+1}$ gives a block bounded above by a constant multiple of $1/k^2$, so the blocks have a convergent total sum. Choose $c=Z^{-1}$, and write
\begin{align*}
p_n=\mu(\{n\})=\frac{c}{n(\log(n+1))^2}.
\end{align*}
Then $\sum_{n=1}^{\infty}p_n=1$, so $\mu$ is a probability measure.
For the singleton partition $\alpha=\{\{n\}:n\in\mathbb N\}$, the entropy is
\begin{align*}
H_\mu(\alpha)=-\sum_{n=1}^{\infty}p_n\log p_n.
\end{align*}
Substituting the formula for $p_n$ gives
\begin{align*}
H_\mu(\alpha)=\sum_{n=1}^{\infty}\frac{c}{n(\log(n+1))^2}\log\left(\frac{n(\log(n+1))^2}{c}\right).
\end{align*}
Using $\log(ab/c)=\log a+\log b-\log c$ with $a=n$ and $b=(\log(n+1))^2$, this becomes
\begin{align*}
H_\mu(\alpha)=\sum_{n=1}^{\infty}\frac{c}{n(\log(n+1))^2}\left(\log n+2\log\log(n+1)-\log c\right).
\end{align*}
Since $\log n\to\infty$ and $2\log\log(n+1)-\log c$ grows more slowly than $\log n$, there is $N$ such that for all $n\ge N$,
\begin{align*}
\log n+2\log\log(n+1)-\log c\ge \frac{1}{2}\log n.
\end{align*}
Therefore the entropy tail satisfies
\begin{align*}
H_\mu(\alpha)\ge \frac{c}{2}\sum_{n=N}^{\infty}\frac{\log n}{n(\log(n+1))^2}.
\end{align*}
For $n\ge 3$, we have $\log(n+1)\le 2\log n$, hence
\begin{align*}
\frac{\log n}{n(\log(n+1))^2}\ge \frac{1}{4n\log n}.
\end{align*}
After increasing $N$ if necessary,
\begin{align*}
H_\mu(\alpha)\ge \frac{c}{8}\sum_{n=N}^{\infty}\frac{1}{n\log n}.
\end{align*}
The last series diverges. Indeed, for every sufficiently large $k$ with $2^k\ge N$,
\begin{align*}
\sum_{n=2^k}^{2^{k+1}-1}\frac{1}{n\log n}\ge \frac{1}{2(k+1)\log 2}.
\end{align*}
Summing these lower bounds over $k$ gives a constant multiple of the harmonic series, so the partial sums are unbounded. Hence $H_\mu(\alpha)=\infty$: the partition has countably many atoms with total mass one, but its average information is infinite.
[/example]
The examples show that entropy detects both randomness and the size of the alphabet. For finite partitions we need universal bounds, because later entropy rates are formed by normalising the entropy of larger and larger finite joins. These bounds are proved from the concavity of $-t\log t$ and from [Jensen's inequality](/theorems/1977).
[quotetheorem:6732]
[citeproof:6732]
The upper bound explains why finite partitions are technically convenient in Kolmogorov-Sinai entropy. Each hypothesis has a concrete role. Null listed sets must be ignored when counting $k$: if a two-set list has masses $1$ and $0$, its entropy is $0$, not $\log 2$, so the equality statement is about non-null atoms rather than formal labels. Finiteness is essential for a uniform estimate, since the preceding countable singleton partition has infinitely many atoms and infinite entropy, so there is no analogue of $H_\mu(\alpha)\le \log k$ without a finite alphabet. Probability normalization is also part of the statement: if the total mass were $2$ and the partition had one atom of mass $2$, the same expression would give $-2\log 2<0$, contradicting the lower bound. The equality cases describe only the distribution of atom names, not the geometry of the atoms or any dynamical behaviour of the underlying system; two very different partitions can have the same probability vector and hence the same entropy.
## Conditional Entropy and Information Functions
A second observation may make the first one partly predictable. The main problem is to measure the remaining uncertainty in a partition once another partition is already known. To formalise simultaneous observation, we first need the partition whose atoms remember both names.
[definition: Join of Partitions]
The join operation maps pairs of countable measurable partitions of $(X,\mathcal B,\mu)$ to countable measurable partitions of $(X,\mathcal B,\mu)$. For countable measurable partitions $\alpha = \{A_i : i \in I\}$ and $\beta = \{B_j : j \in J\}$, their join is
\begin{align*}
\alpha \vee \beta = \{A_i \cap B_j : i \in I,\ j \in J\},
\end{align*}
after discarding null atoms.
[/definition]
The join is the common observation obtained by recording both names at once. It lets us compare the entropy of knowing both observations with the entropy of knowing only the conditioning observation. The difference is the natural candidate for remaining uncertainty.
[definition: Conditional Entropy of Partitions]
Let $(X, \mathcal B, \mu)$ be a probability space. The conditional entropy functional is the partially defined map
\begin{align*}
H_\mu(\cdot \mid \cdot) : \{(\alpha,\beta) : \alpha,\beta \text{ are countable measurable partitions and } H_\mu(\beta)<\infty\} \to [0,\infty]
\end{align*}
given by
\begin{align*}
H_\mu(\alpha \mid \beta) = H_\mu(\alpha \vee \beta) - H_\mu(\beta),
\end{align*}
whenever this difference is defined in $[0,\infty]$.
[/definition]
This definition is compact, but computations usually require a formula inside the atoms of $\beta$. For finite partitions, conditioning by $\beta$ means restricting the probability measure to each atom of $\beta$ and averaging the resulting entropies. If $\mu(B)>0$, write
\begin{align*}
\mu_B(A)=\frac{\mu(A\cap B)}{\mu(B)}.
\end{align*}
[quotetheorem:6734]
[citeproof:6734]
The formula says that conditioning splits the space into the atoms of $\beta$, computes the entropy of $\alpha$ inside each atom, and averages. Finiteness has two separate uses: it makes all sums finite and prevents the defining difference from becoming an ambiguous $\infty-\infty$ expression. For a concrete failure mode, take a countable partition $\beta$ with $H_\mu(\beta)=\infty$ and set $\alpha=\beta$; the intended value of $H_\mu(\beta\mid\beta)$ is $0$, but the definition as $H_\mu(\beta\vee\beta)-H_\mu(\beta)$ reads $\infty-\infty$. If the ambient measure is not normalised, the weights in the displayed average need not form a probability distribution: for a space of total mass $2$ with one conditioning atom $B=X$, the expression $\mu(B)H_{\mu_B}(\alpha)$ carries a factor $2$ and no longer represents an average uncertainty per outcome. Null atoms show a different obstruction, since $\mu_B(A)=\mu(A\cap B)/\mu(B)$ is undefined when $\mu(B)=0$; including such an atom in the displayed sum would force division by zero, so only positive-measure atoms may be used. The theorem does not say that every atom of $\beta$ has the same conditional distribution of $\alpha$: it averages the different conditional entropies with weights $\mu(B)$.
Once a joint observation is understood through conditional pieces, the next question is how to account for several observations made in sequence. We need an identity that decomposes the entropy of a combined name into the entropy already paid for and the new uncertainty left after previous names are known. That bookkeeping principle is the chain rule.
[quotetheorem:1635]
[citeproof:1635]
The chain rule is not an extra inequality; it is an accounting identity saying that the total cost of a joint name can be paid in successive conditional costs. The finite hypothesis is again what makes the telescoping calculation legal without choosing conventions for infinite subtraction; countable versions are useful only after finite-entropy or integrability assumptions have been checked. The identity does not imply that the increments are equal, nor that later observations add fresh information: an observation already determined by the previous join contributes conditional entropy $0$. Its forward consequence is that entropy growth can be studied one new observation at a time, which is the viewpoint behind information functions and entropy rates.
[definition: Information Function]
Let $\alpha$ be a countable measurable partition of $(X, \mathcal B, \mu)$. The information function of $\alpha$ is the $\mu$-a.e. defined measurable function
\begin{align*}
I_\mu(\alpha) : X \to [0,\infty]
\end{align*}
given by
\begin{align*}
I_\mu(\alpha)(x) = -\log \mu(A_\alpha(x)),
\end{align*}
where $A_\alpha(x)$ is the atom of $\alpha$ containing $x$. On the null set not covered by positive-measure atoms, $I_\mu(\alpha)$ may be assigned any value in $[0,\infty]$.
[/definition]
The definition has replaced a partition-level number by a [random variable](/page/Random%20Variable) on the underlying probability space: each point carries the information content of its own atom. To justify using this pointwise object in entropy estimates, we need to check that averaging these local costs recovers the original entropy and introduces no new quantity. The next theorem is the bridge from the atomwise definition to the global sum.
[quotetheorem:6735]
[citeproof:6735]
The identity is an equality in the extended sense, so it may say that both sides are infinite. This theorem does not provide a pointwise limit theorem by itself. It supplies the bookkeeping device that later allows entropy rates to be compared with averages of pointwise information. To express the chain rule pointwise, the next object measures the surprise of the $\alpha$-atom after the $\beta$-atom is known.
[definition: Conditional Information Function]
Let $\alpha$ and $\beta$ be countable measurable partitions of $(X, \mathcal B, \mu)$. The conditional information function of $\alpha$ given $\beta$ is the $\mu$-a.e. defined measurable function
\begin{align*}
I_\mu(\alpha \mid \beta) : X \to [0,\infty]
\end{align*}
given by
\begin{align*}
I_\mu(\alpha \mid \beta)(x)
= -\log \frac{\mu(A_\alpha(x)\cap B_\beta(x))}{\mu(B_\beta(x))},
\end{align*}
for points $x$ whose $\beta$-atom has positive measure. On the null set not covered by positive-measure atoms of $\alpha$ and $\beta$, the function may be assigned any value in $[0,\infty]$.
[/definition]
The conditional information function records a local cost after the $\beta$-name of the point is already known. For this pointwise object to represent the conditional entropy introduced earlier, its average must agree with the finite double sum over intersections of atoms. The following theorem verifies exactly that agreement, so the chain rule can be read both globally as an entropy identity and locally as an identity between information functions.
[quotetheorem:6737]
[citeproof:6737]
The finiteness hypothesis ensures that the atomwise conditional probabilities and sums have no convergence ambiguity. Countable conditional information needs extra integrability hypotheses before it can be used as an $L^1$ object. The theorem is an averaging statement, not an assertion that conditional information is constant across atoms. A different comparison asks how much information two observations share rather than how much remains after conditioning. This shared part is mutual information.
[definition: Mutual Information of Partitions]
Let $(X,\mathcal B,\mu)$ be a probability space. Mutual information of finite partitions is the map
\begin{align*}
I_\mu(\cdot;\cdot) : \{(\alpha,\beta) : \alpha,\beta \text{ are finite measurable partitions of }(X,\mathcal B,\mu)\} \to [0,\infty)
\end{align*}
defined by
\begin{align*}
I_\mu(\alpha;\beta) = H_\mu(\alpha) + H_\mu(\beta) - H_\mu(\alpha\vee\beta).
\end{align*}
[/definition]
This number is symmetric in $\alpha$ and $\beta$. By the chain rule it also equals $H_\mu(\alpha)-H_\mu(\alpha\mid\beta)$, so it measures the reduction in uncertainty about $\alpha$ obtained by observing $\beta$. Finite-state Markov chains give the standard model where conditioning reduces uncertainty without eliminating it.
[example: Two-Step Markov Chain Partition]
Let $C_a=\{x:X_0(x)=a\}$ and $D_b=\{x:X_1(x)=b\}$, so $\alpha_0=\{C_a:a\in S\}$ and $\alpha_1=\{D_b:b\in S\}$ after discarding null atoms. We compute the conditional entropy of the next-state partition $\alpha_1$ given the present-state partition $\alpha_0$.
By the *Conditional Entropy Formula*,
\begin{align*}
H_\mu(\alpha_1\mid\alpha_0)
= -\sum_{a\in S}\sum_{b\in S}\mu(C_a\cap D_b)\log\frac{\mu(C_a\cap D_b)}{\mu(C_a)}.
\end{align*}
Here the sum uses only atoms $C_a$ with $\mu(C_a)>0$, and a term with $\mu(C_a\cap D_b)=0$ is interpreted as $0$. Stationarity gives
\begin{align*}
\mu(C_a)=\mu(X_0=a)=\pi_a.
\end{align*}
For each $a,b\in S$, the definition of the transition matrix gives
\begin{align*}
\mu(X_1=b\mid X_0=a)=P_{ab}.
\end{align*}
Therefore
\begin{align*}
\mu(C_a\cap D_b)=\mu(X_0=a,\ X_1=b)=\pi_a P_{ab}.
\end{align*}
For every $a$ with $\pi_a>0$, we then have
\begin{align*}
\frac{\mu(C_a\cap D_b)}{\mu(C_a)}=\frac{\pi_a P_{ab}}{\pi_a}=P_{ab}.
\end{align*}
Substituting these identities into the conditional entropy sum yields
\begin{align*}
H_\mu(\alpha_1\mid\alpha_0)
= -\sum_{a\in S}\sum_{b\in S}\pi_a P_{ab}\log P_{ab}.
\end{align*}
Equivalently,
\begin{align*}
H_\mu(\alpha_1\mid\alpha_0)
= -\sum_{a,b\in S}\pi_a P_{ab}\log P_{ab}.
\end{align*}
Thus the conditional entropy is the stationary average, over current states $a$, of the entropy of the transition distribution in the $a$-th row of $P$.
[/example]
## Refinement, Independence, and Monotonicity
The next structural question asks how entropy changes when an observation is made more detailed or when the conditioning information is enlarged. To compare observations, we need a relation saying that one partition records at least everything another partition records.
[definition: Refinement of Partitions]
The refinement relation $\succeq$ on countable measurable partitions of $(X,\mathcal B,\mu)$ modulo null changes of atoms is defined as follows: $\alpha \succeq \beta$ if every atom of $\alpha$ is contained, up to a null set, in an atom of $\beta$.
[/definition]
A refinement records at least as much information as the coarser partition. The resulting comparison problem is whether this extra detail is always visible numerically in the entropy. For finite partitions, the entropy formula is monotone under refinement.
[quotetheorem:6739]
[citeproof:6739]
The finite theorem has a limited but important meaning. If finiteness is dropped, a refinement can move immediately outside the finite-entropy regime: the heavy-tailed singleton partition of $\mathbb N$ refines the one-atom partition $\{X\}$, changing entropy from $0$ to $\infty$. The theorem also does not imply strict increase; refining by splitting only null sets, or using the same partition again, leaves entropy unchanged. Its role is therefore not to guarantee a positive gain, but to ensure that adding finite detail never lowers the average information. Refinement behaves differently from conditioning: making the observed partition finer increases entropy, while making the conditioning partition finer should decrease the remaining uncertainty.
[quotetheorem:6741]
[citeproof:6741]
The finiteness assumptions are doing real work here: the proof uses entropy as a concave function on finite probability vectors and avoids uncontrolled infinite sums. The theorem does not say that conditioning determines the value of $\alpha$; it says only that adding information to the conditioning partition cannot raise the average remaining uncertainty. Equality can still occur without $\gamma$ and $\alpha$ being the same partition, for instance when the extra information in $\gamma$ is independent of $\alpha$ inside each atom of $\beta$. This equality question leads naturally to independence, the case where observing $\beta$ fails to reduce the entropy of $\alpha$ at all.
[definition: Independent Partitions]
Let $\alpha$ and $\beta$ be countable measurable partitions of $(X, \mathcal B, \mu)$. They are independent if
\begin{align*}
\mu(A\cap B)=\mu(A)\mu(B)
\end{align*}
for all atoms $A\in\alpha$ and $B\in\beta$.
[/definition]
For finite independent partitions, the join behaves like a product probability distribution. The question is whether two observations that carry no information about each other really contribute separate amounts of uncertainty when they are observed together. Independence is precisely the condition that removes overlap between the two name distributions, so the entropy of the joint observation should split into two separate entropy costs.
[quotetheorem:1634]
[citeproof:1634]
This additivity result is the local mechanism behind the entropy of Bernoulli shifts, and the independence hypothesis cannot be replaced by a weaker visual separation of atoms. If $\alpha=\beta$ is the fair two-atom partition of $[0,1]$, then $H_\mu(\alpha\vee\beta)=H_\mu(\alpha)=\log 2$, while $H_\mu(\alpha)+H_\mu(\beta)=2\log 2$; complete dependence removes one whole copy of the information. The finite hypothesis keeps the product expansion and chain-rule argument inside finite sums; for countable independent partitions with infinite entropy, the equality may reduce to $\infty=\infty+\infty$ in the extended sense and no finite entropy rate can be extracted from it. The theorem also does not say that every join is additive: in general, overlap between observations is measured by mutual information. Independent coordinates contribute fresh information at each time, so the entropy of a block of coordinate names grows linearly with the block length.
[example: Independent Coordinates in a Bernoulli Shift]
In the Bernoulli shift $X=\{0,1\}^{\mathbb Z}$ with product measure $\mu=(p\delta_1+(1-p)\delta_0)^{\mathbb Z}$, let $\alpha_r$ be the partition according to the coordinate $x_r$. Write $p_1=p$ and $p_0=1-p$. For a word $w=(w_0,\dots,w_{n-1})\in\{0,1\}^n$, the corresponding atom of $\bigvee_{r=0}^{n-1}\alpha_r$ is
\begin{align*}
A_w=\{x\in X:x_0=w_0,\dots,x_{n-1}=w_{n-1}\}.
\end{align*}
By the product measure definition,
\begin{align*}
\mu(A_w)=\prod_{r=0}^{n-1}p_{w_r}.
\end{align*}
Assume first that $0<p<1$, so all these atoms have positive measure. Using the definition of partition entropy,
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}\alpha_r\right)=-\sum_{w\in\{0,1\}^n}\left(\prod_{s=0}^{n-1}p_{w_s}\right)\log\left(\prod_{s=0}^{n-1}p_{w_s}\right).
\end{align*}
Since $\log\prod_{s=0}^{n-1}p_{w_s}=\sum_{s=0}^{n-1}\log p_{w_s}$, this becomes
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}\alpha_r\right)=-\sum_{w\in\{0,1\}^n}\left(\prod_{s=0}^{n-1}p_{w_s}\right)\sum_{r=0}^{n-1}\log p_{w_r}.
\end{align*}
Because the sums are finite, we may interchange the sum over $w$ with the sum over $r$:
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}\alpha_r\right)=-\sum_{r=0}^{n-1}\sum_{w\in\{0,1\}^n}\left(\prod_{s=0}^{n-1}p_{w_s}\right)\log p_{w_r}.
\end{align*}
Fix $r$. Splitting the inner sum according to the value $a=w_r$ gives
\begin{align*}
\sum_{w\in\{0,1\}^n}\left(\prod_{s=0}^{n-1}p_{w_s}\right)\log p_{w_r}=\sum_{a\in\{0,1\}}p_a\log p_a\sum_{(w_s)_{s\ne r}\in\{0,1\}^{n-1}}\prod_{s\ne r}p_{w_s}.
\end{align*}
The remaining sum factors coordinate by coordinate:
\begin{align*}
\sum_{(w_s)_{s\ne r}\in\{0,1\}^{n-1}}\prod_{s\ne r}p_{w_s}=\prod_{s\ne r}(p_0+p_1)=1.
\end{align*}
Therefore
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}\alpha_r\right)=-\sum_{r=0}^{n-1}\sum_{a\in\{0,1\}}p_a\log p_a.
\end{align*}
Since the summand does not depend on $r$,
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}\alpha_r\right)=n\bigl(-p_0\log p_0-p_1\log p_1\bigr).
\end{align*}
The one-coordinate partition $\alpha_0$ has atom masses $p_0$ and $p_1$, so
\begin{align*}
H_\mu(\alpha_0)=-p_0\log p_0-p_1\log p_1.
\end{align*}
Hence
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}\alpha_r\right)=nH_\mu(\alpha_0).
\end{align*}
If $p=0$ or $p=1$, every coordinate is deterministic, so $\alpha_0$ and every finite coordinate join have one atom of measure $1$ after null atoms are discarded; both sides of the same identity are then $0$. Thus each additional independent coordinate contributes one more copy of the one-coordinate entropy, giving linear partition-entropy growth.
[/example]
## Subadditivity for Iterated Joins
The final question of the chapter is dynamical: if a transformation repeatedly pulls back the same observation, how fast can the joint information grow? To name orbit segments using a fixed initial partition, we first describe how a measurable map transports a partition backward.
[definition: Pullback Partition]
Let $T:X\to X$ be a non-singular measurable map on $(X,\mathcal B,\mu)$, meaning that $\mu(N)=0$ implies $\mu(T^{-1}N)=0$. The pullback operation
\begin{align*}
T^{-1} : \{\text{countable measurable partitions of }(X,\mathcal B,\mu)\} \to \{\text{countable measurable partitions of }(X,\mathcal B,\mu)\}
\end{align*}
is defined by
\begin{align*}
T^{-1}\alpha = \{T^{-1}A : A\in\alpha\},
\end{align*}
for every countable measurable partition $\alpha$, after discarding null atoms.
[/definition]
Without measure preservation, pulling back a partition can change the probabilities of its atoms and hence change entropy before any genuine orbit growth is being measured. For example, a non-singular map can send most of the space into one atom of a two-set partition, turning a balanced observation into a nearly deterministic one. When $T$ preserves $\mu$, this obstruction disappears: the pullback has the same atom measures as the original partition, up to null atoms. This creates a necessary check before studying growth: a single transported observation should have unchanged entropy. The following theorem supplies that invariance.
[quotetheorem:6743]
[citeproof:6743]
Measure preservation is the exact hypothesis that prevents the pullback from changing the probability vector. For a concrete failure without it, let $X=[0,1]$ with [Lebesgue measure](/page/Lebesgue%20Measure), let $\alpha=\{[0,1/2),[1/2,1]\}$, and let $T(x)=x/4$; then $T^{-1}[0,1/2)=X$ and the pullback observation has entropy $0$ instead of $\log 2$. Finiteness is used here to keep the entropy comparison inside finite sums; countable partitions with infinite entropy may still have equal extended entropy after pullback, but that equality is too weak for the finite block estimates used later. The theorem does not say that $T$ preserves the shapes of atoms or is invertible, only that the pulled-back atom measures agree. Orbit-name partitions are built from repeated joins, so we also need the basic estimate that observing two finite partitions together costs no more than observing them separately.
[quotetheorem:1634]
[citeproof:1634]
Subadditivity is weaker than additivity: two observations may overlap heavily, and the inequality counts the shared information only once. Finiteness again rules out undefined expressions and ensures that the chain rule and conditional monotonicity apply directly. A concrete countable obstruction is the following: take a probability vector $(p_i)_{i\ge 1}$ with infinite Shannon entropy, let $\alpha=\{A_i:i\ge 1\}$ have $\mu(A_i)=p_i$, and set $\beta=\alpha$; then $H_\mu(\alpha\vee\beta)=H_\mu(\alpha)=\infty$, while the chain-rule expression $H_\mu(\beta)+H_\mu(\alpha\mid\beta)$ involves $\infty+0$ and cannot be used to control finite differences or limiting averages. Countable partitions therefore require finite-entropy hypotheses before the same conclusion can be used safely. Equality is a special independence phenomenon, not the generic case. Applying subadditivity to orbit-name partitions gives the sequence whose limiting average will define partition entropy. The remaining problem is to prove that this sequence is itself subadditive in the block length. A block of length $m+n$ splits into an initial block of length $m$ and a shifted block of length $n$.
[quotetheorem:6745]
[citeproof:6745]
The hypotheses in the iterated version are doing two separate jobs. Measure preservation is needed when the shifted $n$-block is replaced by an unshifted $n$-block: without it, pullback can change entropy, as in the earlier map $T(x)=x/4$ on $[0,1]$ with the two-atom half-interval partition, where the pulled-back partition has entropy $0$ instead of $\log 2$. Finiteness of $\alpha$ keeps every block entropy finite, so the sequence $(a_n)$ is a genuine non-negative real sequence to which Fekete's lemma applies; for an infinite-entropy partition the same displayed inequality may collapse into comparisons involving $+\infty$ and no finite entropy rate is obtained. The theorem also has a limitation: it gives only subadditive upper control, not linear growth, independence, or an explicit value of the limiting average. In highly dependent systems the inequality can be strict for many block lengths, while in periodic or deterministic examples the growth may be much smaller than $nH_\mu(\alpha)$. Its forward use is nevertheless decisive: Fekete's lemma gives the existence of $\lim_{n\to\infty} a_n/n$ in Chapter 2, where this limit is named the entropy rate of the partition. The first example returns to Bernoulli systems, where independence makes the subadditive bound an equality at every block length.
[example: Bernoulli Linear Partition Growth]
Let $X=\{0,1\}^{\mathbb Z}$ with Bernoulli product measure $\mu=(p\delta_1+(1-p)\delta_0)^{\mathbb Z}$, and let $\alpha$ be the partition according to the coordinate $x_0$. Write $p_1=p$ and $p_0=1-p$. Since $T^{-r}\alpha$ records the coordinate $x_r$, an atom of $\bigvee_{r=0}^{n-1}T^{-r}\alpha$ is determined by a word $w=(w_0,\dots,w_{n-1})\in\{0,1\}^n$:
\begin{align*}
A_w=\{x\in X:x_0=w_0,\dots,x_{n-1}=w_{n-1}\}.
\end{align*}
By the product definition of the Bernoulli measure,
\begin{align*}
\mu(A_w)=\prod_{r=0}^{n-1}p_{w_r}.
\end{align*}
Assume first that $0<p<1$, so every word atom has positive measure. By the definition of partition entropy,
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}T^{-r}\alpha\right)=-\sum_{w\in\{0,1\}^n}\mu(A_w)\log\mu(A_w).
\end{align*}
Substituting the value of $\mu(A_w)$ gives
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}T^{-r}\alpha\right)=-\sum_{w\in\{0,1\}^n}\left(\prod_{s=0}^{n-1}p_{w_s}\right)\log\left(\prod_{s=0}^{n-1}p_{w_s}\right).
\end{align*}
Since $\log\prod_{s=0}^{n-1}p_{w_s}=\sum_{r=0}^{n-1}\log p_{w_r}$, this becomes
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}T^{-r}\alpha\right)=-\sum_{w\in\{0,1\}^n}\left(\prod_{s=0}^{n-1}p_{w_s}\right)\sum_{r=0}^{n-1}\log p_{w_r}.
\end{align*}
Because all sums are finite, we may interchange the sum over $w$ and the sum over $r$:
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}T^{-r}\alpha\right)=-\sum_{r=0}^{n-1}\sum_{w\in\{0,1\}^n}\left(\prod_{s=0}^{n-1}p_{w_s}\right)\log p_{w_r}.
\end{align*}
Fix $r$. Splitting the inner sum according to the value $a=w_r$ gives
\begin{align*}
\sum_{w\in\{0,1\}^n}\left(\prod_{s=0}^{n-1}p_{w_s}\right)\log p_{w_r}=\sum_{a\in\{0,1\}}p_a\log p_a\sum_{(w_s)_{s\ne r}\in\{0,1\}^{n-1}}\prod_{s\ne r}p_{w_s}.
\end{align*}
The remaining sum factors over the coordinates other than $r$:
\begin{align*}
\sum_{(w_s)_{s\ne r}\in\{0,1\}^{n-1}}\prod_{s\ne r}p_{w_s}=\prod_{s\ne r}(p_0+p_1).
\end{align*}
Since $p_0+p_1=1$, we get
\begin{align*}
\sum_{(w_s)_{s\ne r}\in\{0,1\}^{n-1}}\prod_{s\ne r}p_{w_s}=1.
\end{align*}
Therefore
\begin{align*}
\sum_{w\in\{0,1\}^n}\left(\prod_{s=0}^{n-1}p_{w_s}\right)\log p_{w_r}=\sum_{a\in\{0,1\}}p_a\log p_a.
\end{align*}
Substituting this back into the entropy sum gives
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}T^{-r}\alpha\right)=-\sum_{r=0}^{n-1}\sum_{a\in\{0,1\}}p_a\log p_a.
\end{align*}
The summand is independent of $r$, so
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}T^{-r}\alpha\right)=n\left(-p_0\log p_0-p_1\log p_1\right).
\end{align*}
The one-coordinate partition $\alpha$ has atom masses $p_0$ and $p_1$, hence
\begin{align*}
H_\mu(\alpha)=-p_0\log p_0-p_1\log p_1.
\end{align*}
Thus
\begin{align*}
H_\mu\left(\bigvee_{r=0}^{n-1}T^{-r}\alpha\right)=nH_\mu(\alpha).
\end{align*}
If $p=0$ or $p=1$, the coordinate partition has one atom of measure $1$ after null atoms are discarded, and every finite block partition also has one atom of measure $1$; both sides of the same identity are $0$. Thus every additional Bernoulli coordinate contributes exactly one more copy of the one-coordinate entropy.
[/example]
Bernoulli shifts represent the maximal-growth model for independent finite names. At the opposite end, rotations on compact groups often create many repeated geometric constraints, so the number of distinguishable names can grow only polynomially. The two-interval coding of an irrational circle rotation gives the basic zero-growth example.
[example: Irrational Rotation with Zero Partition Growth Rate]
Let $T:x\mapsto x+\theta \pmod 1$ be an irrational rotation of the circle with Lebesgue measure $\mu$, and let $\alpha$ be the two-interval partition determined by a cut point $c$. For $0\le r\le n-1$, the pullback partition $T^{-r}\alpha$ has its cut at the point $T^{-r}c=c-r\theta \pmod 1$. Hence the join
\begin{align*}
\bigvee_{r=0}^{n-1}T^{-r}\alpha
\end{align*}
is refined by the interval partition obtained by cutting the circle at the set
\begin{align*}
\{c-r\theta \pmod 1:0\le r\le n-1\}.
\end{align*}
This set has at most $n$ points, so its complement has at most $n+1$ interval components in the half-open interval model of the circle. Therefore the join has at most $n+1$ non-null atoms.
Write
\begin{align*}
\beta_n=\bigvee_{r=0}^{n-1}T^{-r}\alpha.
\end{align*}
Since $\beta_n$ is a finite partition with at most $n+1$ non-null atoms, the finite entropy bound gives
\begin{align*}
H_\mu(\beta_n)\le \log(n+1).
\end{align*}
Thus
\begin{align*}
0\le \frac{1}{n}H_\mu\left(\bigvee_{r=0}^{n-1}T^{-r}\alpha\right)
\le \frac{\log(n+1)}{n}.
\end{align*}
To see that the right-hand side tends to $0$, set $t=n+1$. Then
\begin{align*}
\frac{\log(n+1)}{n}=\frac{\log t}{t-1}.
\end{align*}
Since $\log t/t\to 0$ as $t\to\infty$ and
\begin{align*}
\frac{t}{t-1}\to 1,
\end{align*}
we get
\begin{align*}
\frac{\log(n+1)}{n}
=\frac{\log t}{t}\cdot \frac{t}{t-1}\to 0.
\end{align*}
By squeezing,
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}H_\mu\left(\bigvee_{r=0}^{n-1}T^{-r}\alpha\right)=0.
\end{align*}
Thus this two-interval observation of an irrational rotation has zero entropy rate: the number of orbit names grows at most linearly, so the average information per step vanishes.
[/example]
The contrast between Bernoulli shifts and irrational rotations is the guiding intuition for the next chapter. Entropy of a partition measures the asymptotic information per unit time in the names of orbit segments; Kolmogorov-Sinai entropy then takes the supremum over finite observations.
Once entropy has been defined for a single partition, the natural next question is how it behaves along orbits. This chapter turns that static quantity into a dynamical invariant by tracking names of orbit segments, taking refinements and joins into account, and then passing to the supremum over all finite observations.
# 2. Kolmogorov-Sinai Entropy
Entropy becomes a dynamical invariant when the information in a partition is measured along an orbit. Chapter 1 developed entropy for a single finite partition, including the identities for joins, refinements, conditional entropy, and subadditive estimates that will be used throughout this chapter. The goal here is to turn those static identities into Kolmogorov-Sinai entropy: an invariant measuring the maximal asymptotic information produced per unit time by a measure-preserving system. By the end of the chapter, the main course-level tools are the entropy-rate construction, the supremum over finite partitions, and the structural rules for factors, products, powers, and inverses.
The prerequisites are the basic measure-theoretic language of probability spaces, measurable partitions modulo null sets, measure-preserving transformations, and the Shannon entropy formalism from Chapter 1, especially joins, refinement monotonicity, pullback invariance, and subadditivity for iterated joins. The chapter is organised around a practical question that will recur later in the course: how can one compute or compare entropy without testing every finite partition from first principles?
Throughout, $(X, \mathcal B, \mu)$ is a probability space and $T:X\to X$ is a measure-preserving transformation. Partitions are measurable partitions, and unless stated otherwise they are finite.
## Entropy Rate Along an Orbit
A finite observation made once gives a partition $\mathcal P$ of phase space. The dynamical question is how much new information is gained by repeating the same observation at times $0,1,\dots,n-1$ along the orbit of $T$.
[definition: Dynamical Join]
Let $\mathcal P$ be a finite measurable partition of $X$. For $n\in\mathbb N$, define
\begin{align*}
\mathcal P_0^{n-1} := \bigvee_{j=0}^{n-1} T^{-j}\mathcal P.
\end{align*}
[/definition]
The atoms of $\mathcal P_0^{n-1}$ record the length-$n$ itinerary of a point through the atoms of $\mathcal P$. To extract a rate from these block entropies, we need a structural bound showing that the information in a long block is no larger than the information in consecutive shorter blocks.
[quotetheorem:1634]
[citeproof:1634]
Subadditivity is the mechanism that turns block entropy into an asymptotic rate. The finiteness of $\mathcal P$ is doing real work: it keeps every block entropy finite, so Fekete's lemma applies to a sequence in $[0,\infty)$ rather than to undefined expressions of the form $\infty-\infty$. Measure preservation is equally essential, because the proof replaces $H_\mu(T^{-m}\mathcal Q)$ by $H_\mu(\mathcal Q)$; if a map collapses most of a probability space into one atom, the entropy of pullbacks can decrease in a way that no longer gives the same time-homogeneous bound. The theorem does not say that the increments $a_{n+1}-a_n$ converge, nor that every finite observation has positive rate; it only supplies enough control to define the average rate below. This is the bridge from the static Shannon entropy of Chapter 1 to a dynamical invariant built from orbit names.
[definition: Entropy Rate of a Partition]
Let $T:X\to X$ be measure-preserving. The partition entropy-rate functional for $T$ is the map
\begin{align*}
h_\mu(T,-):\{\text{finite measurable partitions of }X\}\longrightarrow [0,\infty)
\end{align*}
defined on a finite measurable partition $\mathcal P$ by
\begin{align*}
h_\mu(T,\mathcal P):=\lim_{n\to\infty}\frac{1}{n}H_\mu\left(\mathcal P_0^{n-1}\right).
\end{align*}
[/definition]
Fekete's lemma applied to the previous theorem gives existence of the limit and the formula
\begin{align*}
h_\mu(T,\mathcal P)=\inf_{n\ge 1}\frac{1}{n}H_\mu\left(\mathcal P_0^{n-1}\right).
\end{align*}
This number is the average information per unit time extracted by the observation $\mathcal P$.
[example: Bernoulli Shift Partition Entropy Rate]
Let $A$ be a finite alphabet with probability vector $(p_a)_{a\in A}$, let $X=A^{\mathbb Z}$ with product measure $\mu=p^{\mathbb Z}$, and let $\sigma:X\to X$ be the shift. Write $\mathcal P=\{P_a:a\in A\}$ for the coordinate partition, where
\begin{align*}
P_a=\{x\in A^{\mathbb Z}:x_0=a\}.
\end{align*}
For a word $w=(a_0,\dots,a_{n-1})\in A^n$, the corresponding atom of $\mathcal P_0^{n-1}=\bigvee_{j=0}^{n-1}\sigma^{-j}\mathcal P$ is
\begin{align*}
C_w=P_{a_0}\cap \sigma^{-1}P_{a_1}\cap \cdots \cap \sigma^{-(n-1)}P_{a_{n-1}}.
\end{align*}
Equivalently,
\begin{align*}
C_w=\{x\in A^{\mathbb Z}:x_0=a_0,\ x_1=a_1,\ \dots,\ x_{n-1}=a_{n-1}\}.
\end{align*}
By the definition of the product measure,
\begin{align*}
\mu(C_w)=p_{a_0}p_{a_1}\cdots p_{a_{n-1}}=\prod_{j=0}^{n-1}p_{a_j}.
\end{align*}
Using the convention $0\log 0=0$, the Shannon entropy of the block partition is
\begin{align*}
H_\mu\left(\mathcal P_0^{n-1}\right)=-\sum_{w\in A^n}\mu(C_w)\log \mu(C_w).
\end{align*}
Substituting the product formula for $\mu(C_w)$ gives
\begin{align*}
H_\mu\left(\mathcal P_0^{n-1}\right)=-\sum_{(a_0,\dots,a_{n-1})\in A^n}\left(\prod_{j=0}^{n-1}p_{a_j}\right)\log\left(\prod_{j=0}^{n-1}p_{a_j}\right).
\end{align*}
Since $\log\left(\prod_{j=0}^{n-1}p_{a_j}\right)=\sum_{j=0}^{n-1}\log p_{a_j}$ on the positive-probability terms, this becomes
\begin{align*}
H_\mu\left(\mathcal P_0^{n-1}\right)=-\sum_{j=0}^{n-1}\sum_{(a_0,\dots,a_{n-1})\in A^n}\left(\prod_{i=0}^{n-1}p_{a_i}\right)\log p_{a_j}.
\end{align*}
For a fixed $j$, separate the $a_j$ coordinate from the other coordinates:
\begin{align*}
\sum_{(a_0,\dots,a_{n-1})\in A^n}\left(\prod_{i=0}^{n-1}p_{a_i}\right)\log p_{a_j}=\sum_{a_j\in A}p_{a_j}\log p_{a_j}\prod_{i\ne j}\left(\sum_{a_i\in A}p_{a_i}\right).
\end{align*}
Because $\sum_{a_i\in A}p_{a_i}=1$ for each $i\ne j$, the fixed-$j$ sum is
\begin{align*}
\sum_{(a_0,\dots,a_{n-1})\in A^n}\left(\prod_{i=0}^{n-1}p_{a_i}\right)\log p_{a_j}=\sum_{a\in A}p_a\log p_a.
\end{align*}
Therefore
\begin{align*}
H_\mu\left(\mathcal P_0^{n-1}\right)=-\sum_{j=0}^{n-1}\sum_{a\in A}p_a\log p_a.
\end{align*}
Hence
\begin{align*}
H_\mu\left(\mathcal P_0^{n-1}\right)=n\left(-\sum_{a\in A}p_a\log p_a\right).
\end{align*}
Dividing by $n$ and taking the limit in the definition of entropy rate gives
\begin{align*}
h_\mu(\sigma,\mathcal P)=\lim_{n\to\infty}\frac{1}{n}H_\mu\left(\mathcal P_0^{n-1}\right)=-\sum_{a\in A}p_a\log p_a.
\end{align*}
Thus the coordinate observation extracts exactly one independent symbol's Shannon entropy at each time step.
[/example]
The Bernoulli example shows the intended calibration: independent fresh symbols contribute their Shannon entropy each time step. At the opposite extreme, deterministic rotations produce partitions whose refined orbit names grow too slowly to create positive entropy.
[example: Kronecker Rotation Has Zero Rate For Interval Partitions]
Let $X=\mathbb T=\mathbb R/\mathbb Z$ with Lebesgue measure, let $T(x)=x+\alpha$, and let $\mathcal P$ be a finite partition of $\mathbb T$ into intervals. Let $B$ be the finite set of endpoints of the intervals in $\mathcal P$, and put $C=\max(1,|B|)$. For each $j\ge 0$, the pullback partition $T^{-j}\mathcal P$ has endpoints
\begin{align*}
T^{-j}B=\{b-j\alpha \pmod 1:b\in B\}.
\end{align*}
Thus $\mathcal P_0^{n-1}$ is obtained by cutting the circle at the points in
\begin{align*}
\bigcup_{j=0}^{n-1}T^{-j}B.
\end{align*}
This union has at most
\begin{align*}
\sum_{j=0}^{n-1}|T^{-j}B|=n|B|\le Cn
\end{align*}
points when $B$ is nonempty, and in the trivial case $B=\varnothing$ the partition has one atom, also bounded by $Cn$ for $n\ge 1$. Therefore $\mathcal P_0^{n-1}$ has at most $Cn$ interval atoms.
Write the atom measures of $\mathcal P_0^{n-1}$ as $q_1,\dots,q_M$, with $M\le Cn$ and $\sum_{i=1}^M q_i=1$. Let $K=\{i:q_i>0\}$. Using the convention $0\log 0=0$, the entropy is
\begin{align*}
H_\mu\left(\mathcal P_0^{n-1}\right)=\sum_{i\in K}q_i\log\left(\frac{1}{q_i}\right).
\end{align*}
Since $\log$ is concave, Jensen's inequality gives
\begin{align*}
\sum_{i\in K}q_i\log\left(\frac{1}{q_i}\right)\le \log\left(\sum_{i\in K}q_i\frac{1}{q_i}\right).
\end{align*}
The expression inside the logarithm is
\begin{align*}
\sum_{i\in K}q_i\frac{1}{q_i}=\sum_{i\in K}1=|K|.
\end{align*}
Hence
\begin{align*}
H_\mu\left(\mathcal P_0^{n-1}\right)\le \log |K|\le \log M\le \log(Cn).
\end{align*}
Dividing by $n$ gives
\begin{align*}
0\le \frac{1}{n}H_\mu\left(\mathcal P_0^{n-1}\right)\le \frac{\log(Cn)}{n}.
\end{align*}
Since
\begin{align*}
\frac{\log(Cn)}{n}=\frac{\log C}{n}+\frac{\log n}{n}
\end{align*}
and both terms tend to $0$, the [squeeze theorem](/theorems/627) gives
\begin{align*}
h_\mu(T,\mathcal P)=\lim_{n\to\infty}\frac{1}{n}H_\mu\left(\mathcal P_0^{n-1}\right)=0.
\end{align*}
Thus a finite interval observation of a circle rotation creates only linearly many orbit-name atoms, so its entropy rate is zero.
[/example]
## Supremum Over Finite Partitions
A single partition represents a chosen measuring device. To define an invariant of the system rather than of a particular observation, we ask for the largest entropy rate obtainable from any finite observation.
[definition: Kolmogorov-Sinai Entropy]
For the probability space $(X,\mathcal B,\mu)$, the Kolmogorov-Sinai entropy functional is the map
\begin{align*}
h_\mu:\{\text{measure-preserving transformations }T:X\to X\}\longrightarrow [0,\infty]
\end{align*}
defined by
\begin{align*}
h_\mu(T):=\sup_{\mathcal P} h_\mu(T,\mathcal P),
\end{align*}
where the supremum is taken over all finite measurable partitions $\mathcal P$ of $X$.
[/definition]
The definition packages all finite observations into a single number, but it leaves a practical comparison problem: replacing an observation by a finer one should not reduce the measured information rate. This motivates the refinement monotonicity theorem, which is the basic tool for approximating $h_\mu(T)$ by increasingly detailed partitions.
[quotetheorem:6747]
[citeproof:6747]
This theorem is often the first tool in computations: find a sequence of finite partitions that captures more of the measurable structure, then compute or bound their rates. Its hypotheses prevent two common mistakes. The refinement relation must hold modulo null sets at the level of measurable partitions; if two partitions merely have the same number of atoms, neither entropy rate need dominate the other. Pullback by any measurable map preserves the refinement relation itself, while measure preservation is used in the surrounding entropy-rate framework to keep pulled-back entropies invariant along the orbit. Finiteness keeps the entropy comparison within the Shannon entropy identities from Chapter 1. The theorem also has a sharp limitation: a finer partition can have the same rate as a coarser one, as happens when the extra atoms record information already determined by previous orbit coordinates.
Combining the definition of $h_\mu(T,\mathcal P)$ with the supremum over finite observations gives the working formula
\begin{align*}
h_\mu(T)
=\sup_{\mathcal P}\lim_{n\to\infty}\frac{1}{n}
H_\mu\left(\bigvee_{j=0}^{n-1}T^{-j}\mathcal P\right),
\end{align*}
where the supremum is taken over finite measurable partitions $\mathcal P$ of $X$.
The formula is short, but it is the central object of the chapter. It says that entropy is the best possible asymptotic growth rate of finite measurable names. The word finite is not cosmetic: for countable partitions the Shannon entropy may be infinite, and additional integrability assumptions are needed before the same limiting expression has useful meaning. Measure preservation is the reason the block entropy sequence has a stationary form; without it, the observation made at time $j$ may have a different marginal distribution, so the displayed expression no longer defines the usual invariant of a single measure-preserving system. The formula is also not a computation method by itself, because the supremum may be hard to locate and need not be attained by an arbitrary convenient partition. The rest of the chapter therefore studies structural rules that let us compute or compare entropy without resolving the supremum from first principles every time.
[example: Doubling Map On The Circle]
Let $T:\mathbb T\to\mathbb T$ be $T(x)=2x \pmod 1$ with Lebesgue measure $\mu$, and write
\begin{align*}
\mathcal P=\{P_0,P_1\},\quad P_0=[0,1/2),\quad P_1=[1/2,1).
\end{align*}
For $j\ge 0$, the condition $x\in T^{-j}P_0$ means $2^j x \pmod 1\in [0,1/2)$, so $T^{-j}\mathcal P$ cuts the circle at the points $m/2^{j+1}$ for $m=0,1,\dots,2^{j+1}-1$. Hence the common refinement
\begin{align*}
\mathcal P_0^{n-1}=\bigvee_{j=0}^{n-1}T^{-j}\mathcal P
\end{align*}
is the partition, modulo endpoints, into the dyadic intervals
\begin{align*}
I_m=\left[\frac{m}{2^n},\frac{m+1}{2^n}\right),\quad m=0,1,\dots,2^n-1.
\end{align*}
Each interval has measure
\begin{align*}
\mu(I_m)=\frac{m+1}{2^n}-\frac{m}{2^n}=2^{-n}.
\end{align*}
Using the Shannon entropy formula for a finite partition,
\begin{align*}
H_\mu\left(\mathcal P_0^{n-1}\right)=-\sum_{m=0}^{2^n-1}\mu(I_m)\log\mu(I_m).
\end{align*}
Substituting $\mu(I_m)=2^{-n}$ gives
\begin{align*}
H_\mu\left(\mathcal P_0^{n-1}\right)=-\sum_{m=0}^{2^n-1}2^{-n}\log(2^{-n}).
\end{align*}
Since $\log(2^{-n})=-n\log 2$, this is
\begin{align*}
H_\mu\left(\mathcal P_0^{n-1}\right)=\sum_{m=0}^{2^n-1}2^{-n}n\log 2.
\end{align*}
There are $2^n$ summands, so
\begin{align*}
H_\mu\left(\mathcal P_0^{n-1}\right)=2^n\cdot 2^{-n}\cdot n\log 2=n\log 2.
\end{align*}
Dividing by $n$ and taking the entropy-rate limit gives
\begin{align*}
h_\mu(T,\mathcal P)=\lim_{n\to\infty}\frac{1}{n}H_\mu\left(\mathcal P_0^{n-1}\right)=\log 2.
\end{align*}
The only ambiguity in the binary itinerary occurs at dyadic endpoints, and the set of all dyadic endpoints is countable, hence has Lebesgue measure $0$. Outside this null set, the joins $\mathcal P_0^{n-1}$ separate points by their first $n$ binary digits, and the dyadic intervals generate the Borel sigma-algebra of $\mathbb T$. Thus $\mathcal P$ is a generating partition modulo null sets. By the *Kolmogorov-Sinai generator theorem*,
\begin{align*}
h_\mu(T)=h_\mu(T,\mathcal P)=\log 2.
\end{align*}
Thus the doubling map produces one binary digit of information per iterate.
[/example]
For the doubling map, entropy counts the exponential growth of distinguishable inverse branches. Compact group rotations present the opposite tension: their orbits may be equidistributed and ergodic, but the motion is rigid translation rather than branching. The relevant question is whether any finite measurable observation of such a rigid system can produce exponentially many statistically significant orbit names.
[quotetheorem:6750]
[citeproof:6750]
This result separates entropy from ergodicity and mixing phenomena. An irrational circle rotation can be ergodic, but its orbit names do not branch exponentially. The compact group rotation hypothesis is crucial because translations preserve a rigid geometric and spectral structure: boundaries of regular test partitions do not proliferate exponentially under iteration. Outside the rotation or discrete-spectrum setting the same conclusion fails; the doubling map on the circle preserves Lebesgue measure but has entropy $\log 2$, because its inverse branches create exponentially many distinguishable names. The theorem also does not imply that every zero-entropy system is a rotation or has discrete spectrum, since many zero-entropy systems have weak mixing or more complicated measurable structure. Its role here is to provide the main low-complexity model against which Bernoulli and expanding examples can be contrasted.
## Behaviour Under Factors, Products, Powers, And Inverses
Once Kolmogorov-Sinai entropy is defined, the next question is whether it behaves like a dynamical invariant should. Factors should not create information, products should add independent information, powers should rescale time, and invertible systems should have the same entropy forward and backward.
[definition: Factor Map]
Let $(X,\mathcal B_X,\mu,T)$ and $(Y,\mathcal B_Y,\nu,S)$ be measure-preserving systems. A factor map from $X$ to $Y$ is a measurable map $\pi:X\to Y$ such that $\nu=\mu\circ\pi^{-1}$ and
\begin{align*}
\pi\circ T = S\circ \pi
\end{align*}
$\mu$-a.e.
[/definition]
A factor is a coarser system obtained by forgetting some measurable information while respecting the dynamics. The entropy question is whether this loss of information can ever increase the best possible asymptotic observation rate.
[quotetheorem:6753]
[citeproof:6753]
This theorem makes entropy an obstruction to factor maps. A system of smaller entropy may be a factor of a larger one, but not the reverse unless other structure lowers the effective information. The pushforward condition $\nu=\mu\circ\pi^{-1}$ is needed so that pulling back a partition preserves atom measures and hence Shannon entropy; a measurable map that intertwines the point maps but changes the measure can distort all entropy comparisons. The a.e. intertwining relation is also essential, because it identifies the $S$-names of $\pi(x)$ with the pulled-back $T$-names of $x$ along the whole orbit. A concrete failure occurs if $\pi$ forgets the dynamics but does not respect invariant measures: then the pullback block entropies need not equal the factor block entropies, so the proof has no comparison to take to rates. The theorem is only a necessary obstruction, not a classification of factors; equal entropy does not by itself produce a factor map.
[example: Bernoulli Factor Entropy Obstruction]
Let $\sigma_p$ and $\sigma_q$ be Bernoulli shifts over finite alphabets $A$ and $B$, with probability vectors $p=(p_a)_{a\in A}$ and $q=(q_b)_{b\in B}$. From the coordinate-partition computation for a Bernoulli shift, their Kolmogorov-Sinai entropies are
\begin{align*}
h(\sigma_p)=-\sum_{a\in A}p_a\log p_a,\qquad
h(\sigma_q)=-\sum_{b\in B}q_b\log q_b,
\end{align*}
using the convention $0\log 0=0$.
Suppose, toward a contradiction, that $\sigma_q$ is a factor of $\sigma_p$. By *Entropy Monotonicity Under Factors*, factor maps cannot increase Kolmogorov-Sinai entropy, so
\begin{align*}
h(\sigma_q)\le h(\sigma_p).
\end{align*}
Substituting the Bernoulli entropy formulas gives
\begin{align*}
-\sum_{b\in B}q_b\log q_b
\le
-\sum_{a\in A}p_a\log p_a.
\end{align*}
Therefore, if
\begin{align*}
-\sum_{b\in B}q_b\log q_b
>
-\sum_{a\in A}p_a\log p_a,
\end{align*}
then the required factor inequality is violated. Hence $\sigma_q$ cannot be a factor of $\sigma_p$ whenever $h(\sigma_q)>h(\sigma_p)$.
[/example]
Factors compare systems by forgetting information. Products ask for the complementary operation: two systems are run side by side with product measure, so a name for the product contains one name from each coordinate. The possible obstruction is correlation or hidden redundancy between the two coordinates; under the product measure that redundancy is absent, and the entropy rate should be the sum of the two separate rates.
[quotetheorem:6754]
[citeproof:6754]
Thus independent dynamical sources add their information rates. The product measure assumption is where independence enters: for rectangle partitions, atom measures factor, and Shannon entropy of the product partition splits as a sum. With a non-product joining of the same two systems, correlations can reduce the information carried by the pair, so the product formula is not a statement about arbitrary invariant measures on $X\times Y$. The countable-generation hypothesis is what supports the upper-bound approximation step: arbitrary finite measurable partitions of the product can be approximated in measure by finite Boolean combinations of measurable rectangles, and entropy continuity then lets this approximation pass to rates. The formula does not identify which product partition attains the supremum, and in non-generating coordinates it may give only a lower bound until the approximation step is used. The next operation changes the time scale rather than the phase space: replacing $T$ by $T^k$ observes the same orbits only at every $k$-th step.
[quotetheorem:6756]
[citeproof:6756]
The factor $k$ reflects a change of clock: one step of $T^k$ is $k$ steps of $T$. The restriction $k\in\mathbb N$ excludes $k=0$, where $T^0=\operatorname{id}_X$ has entropy $0$ and the formula would give no meaningful rescaling rule unless $h_\mu(T)$ were already $0$. Measure preservation remains part of the statement because the proof compares block entropies after applying iterates of $T$; without invariant measure, the entropy of a pulled-back partition need not match the entropy of the original partition. The formula also concerns integer time changes only, not inducing on a subset or passing to a suspension flow, where Abramov-type formulas require return times and extra hypotheses. This raises the final structural question in the chapter: if $T$ is invertible, does reading the same orbit in reverse time change the asymptotic information rate? The block joins have the same atoms after a global shift, which motivates the inverse-invariance theorem.
[quotetheorem:6758]
[citeproof:6758]
Invertibility is the essential new hypothesis in this theorem. It gives an actual transformation $T^{-1}:X\to X$ and lets the proof shift a future-looking join back to a past-looking join without changing atom measures. For a noninvertible map there is no inverse transformation on the same probability space to compare with; choosing a natural extension is a different construction, not an inverse inside the original system. Measure preservation again prevents the global shift from changing entropy, so the statement is not a property of arbitrary bijections of a measured space. The theorem says that entropy has no preferred orientation on two-sided measure-preserving orbits, while leaving genuinely one-sided systems to be handled through factors or natural extensions.
Together these structural laws make Kolmogorov-Sinai entropy a robust conjugacy invariant. It detects exponential information production, respects passage to coarser systems, and normalises correctly under changes of time scale.
With Kolmogorov-Sinai entropy in hand, the focus shifts from definition to computation. The next chapter uses the structural identities from the previous chapters to identify generators and to show how a well-chosen symbolic model can replace a supremum by a single entropy-rate calculation.
# 3. Generators and Entropy Computation
This chapter turns the definition of Kolmogorov-Sinai entropy into a usable computational tool. It assumes the measure-theoretic entropy of finite partitions from Chapter 2, together with the conditional entropy and join identities from Chapter 1 and the basic language of measure-preserving systems and factors. In Chapter 2, entropy was defined as a supremum over finite measurable partitions, which is conceptually natural but hard to evaluate directly. The main question here is when a single partition contains enough orbit information to recover the whole system, so that its entropy rate already equals the system entropy. Symbolic codings provide the bridge: they convert a measure-preserving transformation into a shift system whose coordinates record the names of partition atoms along an orbit. This is also where ergodic theory meets information theory most directly: a generator plays the role of an efficient observable, and its entropy rate measures the average information needed to describe long orbit names.
## Generating Partitions and Symbolic Codings
How can a measurable dynamical system be read from the sequence of partition elements visited by a typical orbit? A partition records finite information at one time; the iterates of the partition record information at all integer times. If these iterated observations separate all measurable events up to null sets, then the partition acts as a coordinate system for the dynamics.
[definition: Iterated Join of a Partition]
Let $(X,\mathcal A,\mu)$ be a probability space, let $T:X\to X$ be measure-preserving, and let $\mathcal P$ be a finite or countable measurable partition of $X$. For integers $m\le n$, the iterated join of $\mathcal P$ from time $m$ to time $n$ is
\begin{align*}
\mathcal P_m^n := \bigvee_{k=m}^{n} T^{-k}\mathcal P.
\end{align*}
[/definition]
The atoms of $\mathcal P_m^n$ are cylinder-like sets: membership in such an atom specifies the $\mathcal P$-name of a point during the time interval $m,\dots,n$. This notation measures the information obtained from a bounded observation window. The next question is when increasing these windows recovers every measurable event, which leads to the notion of a generator.
[definition: Generating Partition]
Let $(X,\mathcal A,\mu,T)$ be an invertible measure-preserving system, and let $\mathcal P$ be a finite or countable measurable partition. The partition $\mathcal P$ is a generator if
\begin{align*}
\sigma\left(\bigcup_{n=1}^{\infty} \mathcal P_{-n}^{n}\right)=\mathcal A \quad \operatorname{mod}\mu.
\end{align*}
For a non-invertible system, $\mathcal P$ is a one-sided generator if
\begin{align*}
\sigma\left(\bigcup_{n=0}^{\infty} \mathcal P_0^n\right)=\mathcal A \quad \operatorname{mod}\mu.
\end{align*}
[/definition]
The phrase $\operatorname{mod}\mu$ means that every set in $\mathcal A$ differs from a set in the generated $\sigma$-algebra by a null set. Thus a generator need not distinguish exceptional points, but it must distinguish all events relevant to measure theory. To turn this recovery property into an actual model, we record the whole itinerary of a point as a sequence of symbols.
[definition: Symbolic Coding Map]
Let $\mathcal P=\{P_a:a\in A\}$ be a finite or countable measurable partition indexed by an alphabet $A$. For an invertible measure-preserving system $(X,\mathcal A,\mu,T)$, the symbolic coding map associated to $\mathcal P$ is the measurable map
\begin{align*}
\pi_{\mathcal P}:X&\to A^{\mathbb Z},
\end{align*}
where $(\pi_{\mathcal P}(x))_n=a$ when $T^n x\in P_a$. For a non-invertible system, the coding map is defined with target $A^{\mathbb N_0}$ by the same formula for $n\ge 0$.
[/definition]
The coding map intertwines the original dynamics with the shift on sequences. If $S$ denotes the left shift on $A^{\mathbb Z}$ or $A^{\mathbb N_0}$, then $\pi_{\mathcal P}\circ T=S\circ \pi_{\mathcal P}$ wherever the partition name is defined. This motivates the following theorem: under the generator hypothesis, the symbolic process is not merely a factor but a faithful measurable model.
[quotetheorem:6760]
[citeproof:6760]
The hypotheses are doing real work. Countability ensures that the sequence space has its usual product measurable structure and that cylinder events form a manageable coding language. Generation is the condition that prevents loss of information: if $\mathcal P=\{X\}$, then every point has the same name, so the coding collapses any system with genuine measurable structure to a one-point factor. Invertibility explains the two-sided names; for a non-invertible map, negative coordinates are not available from the forward dynamics, so the correct statement uses one-sided names.
The theorem also has a precise limitation. It identifies measurable information modulo null sets; it does not say that every individual point is recovered from its symbolic itinerary. Boundary points of partitions, such as dyadic rationals for binary expansions, may have ambiguous names without affecting the measure-theoretic model.
This theorem is the conceptual reason generators compute entropy. Once the system is represented as a shift process, entropy becomes the asymptotic information per symbol in its names.
[example: Binary Expansion for the Doubling Map]
Let $X=[0,1)$ with Lebesgue measure, let $T(x)=2x \pmod 1$, and write $P_0=[0,1/2)$ and $P_1=[1/2,1)$. If $x$ is not dyadic, then $x$ has a unique binary expansion
\begin{align*}
x=\sum_{r=1}^{\infty} b_r2^{-r},\qquad b_r\in\{0,1\}.
\end{align*}
For each $k\ge 0$,
\begin{align*}
T^k x=2^k x \pmod 1.
\end{align*}
Multiplying the binary expansion by $2^k$ gives
\begin{align*}
2^k x=\sum_{r=1}^{k} b_r2^{k-r}+\sum_{r=k+1}^{\infty} b_r2^{k-r}.
\end{align*}
The first sum is an integer, so reducing modulo $1$ removes it:
\begin{align*}
T^k x=\sum_{r=k+1}^{\infty} b_r2^{k-r}.
\end{align*}
With $s=r-k$, this becomes
\begin{align*}
T^k x=\sum_{s=1}^{\infty} b_{k+s}2^{-s}.
\end{align*}
Therefore $T^k x\in P_0$ exactly when $b_{k+1}=0$, and $T^k x\in P_1$ exactly when $b_{k+1}=1$. Thus the one-sided itinerary of $x$ records its binary digits in order, except on the countable null set
\begin{align*}
\left\{\frac{m}{2^r}:r\ge 0,\ 0\le m<2^r\right\},
\end{align*}
where a point lies on an endpoint of some dyadic interval and has two binary expansions.
For a word $a_0,\dots,a_{n-1}\in\{0,1\}$, the corresponding atom of $\mathcal P_0^{n-1}$ is
\begin{align*}
\bigcap_{k=0}^{n-1}T^{-k}P_{a_k}.
\end{align*}
The condition $T^k x\in P_{a_k}$ says that the $(k+1)$-st binary digit of $x$ is $a_k$, so the first $n$ binary digits are fixed. Hence the atom is the half-open dyadic interval
\begin{align*}
\left[\sum_{k=0}^{n-1}a_k2^{-(k+1)},\ \sum_{k=0}^{n-1}a_k2^{-(k+1)}+2^{-n}\right),
\end{align*}
up to endpoint conventions. Consequently $\mathcal P_0^{n-1}$ consists of the $2^n$ half-open dyadic intervals of length $2^{-n}$. These intervals generate the Borel $\sigma$-algebra on $[0,1)$, and changing endpoint conventions affects only the dyadic rationals, which have Lebesgue measure $0$. Therefore $\mathcal P$ is a one-sided generator. Moreover, each length-$n$ cylinder has measure $2^{-n}=(1/2)^n$, so the coded symbols are independent and each symbol has probability $1/2$; the coding is the one-sided Bernoulli shift with weights $(1/2,1/2)$.
[/example]
The example shows the simplest pattern: expansion in the dynamics produces finer and finer names. More complicated systems often need partitions whose atoms are shaped by stable and unstable directions rather than ordinary intervals.
## The Kolmogorov-Sinai Generator Theorem
The definition of $h_\mu(T)$ as a supremum over finite partitions raises a practical problem. If $\mathcal P$ is a generator, should the entropy of $T$ be found from $\mathcal P$ alone, or could another partition carry more entropy? The Kolmogorov-Sinai generator theorem answers that a finite-entropy generator already sees all entropy.
[definition: Finite-Entropy Partition]
Let $(X,\mathcal A,\mu)$ be a probability space. On the class of countable measurable partitions of $X$, the partition entropy functional is
\begin{align*}
H_\mu:\{\text{countable measurable partitions of }X\}\to [0,\infty],
\end{align*}
defined by
\begin{align*}
H_\mu(\mathcal P):=-\sum_{P\in\mathcal P}\mu(P)\log \mu(P),
\end{align*}
with the convention $0\log 0=0$. The partition $\mathcal P$ has finite entropy if $H_\mu(\mathcal P)<\infty$.
[/definition]
Finite entropy is the integrability condition needed to pass from finite partitions to countable symbolic names. Without it, a countable generator can encode the system with infinitely much one-step information, so the expression $H_\mu(\mathcal P)$ may already be infinite before any dynamical averaging begins. A different failure occurs when the partition is not generating: for the doubling map, the one-atom partition has entropy rate $0$ although the system has entropy $\log 2$.
These two failures isolate the exact computational problem left open by the definition of $h_\mu(T)$. The entropy of the system is a supremum over all finite measurable partitions, while a generator is a single observable whose iterates recover the measurable structure. The next theorem is needed to justify replacing the global supremum by the entropy rate of this one observable: it says that no other finite partition can reveal more asymptotic information once a finite-entropy generator is already available.
[quotetheorem:6726]
[citeproof:6726]
This theorem is robust because a generator does not merely name points; it approximates every finite observation of the system with bounded time windows. Boundary terms disappear because entropy is an asymptotic rate. The theorem is therefore a computation theorem, not an existence theorem: it does not construct a generator, and it does not assert that every countable generator has finite entropy.
The finite-entropy hypothesis cannot be ignored. For instance, take the doubling map and refine the binary partition by splitting one half into countably many measurable pieces with masses proportional to $1/(n(\log n)^2)$ after normalisation. The resulting countable partition is still a one-sided generator, because it refines the binary generator, but its one-step entropy is infinite. Thus a generator can carry the right orbit information while being useless as a finite entropy-rate computation. The generator hypothesis is equally essential: a non-generating partition only computes the entropy of the factor it sees. For instance, the one-atom partition of the doubling map has entropy rate $0$, while the binary generator has entropy rate $\log 2$.
[example: Entropy of the Doubling Map from a Generator]
Let $T(x)=2x\pmod 1$ on $[0,1)$ and let $\mathcal P=\{P_0,P_1\}$, where $P_0=[0,1/2)$ and $P_1=[1/2,1)$. For a word $a_0,\dots,a_{n-1}\in\{0,1\}$, the atom of $\mathcal P_0^{n-1}$ with itinerary $a_0,\dots,a_{n-1}$ is
\begin{align*}
\bigcap_{k=0}^{n-1}T^{-k}P_{a_k}.
\end{align*}
By the binary expansion computation, this atom fixes the first $n$ binary digits of $x$, so it is the half-open interval
\begin{align*}
\left[\sum_{k=0}^{n-1}a_k2^{-(k+1)},\ \sum_{k=0}^{n-1}a_k2^{-(k+1)}+2^{-n}\right)
\end{align*}
up to dyadic endpoints. Hence $\mathcal P_0^{n-1}$ has $2^n$ atoms, each of Lebesgue measure $2^{-n}$.
Using the definition of partition entropy, the entropy of this join is
\begin{align*}
H_\mu(\mathcal P_0^{n-1})=-\sum_{A\in \mathcal P_0^{n-1}}\mu(A)\log\mu(A).
\end{align*}
Since every atom has measure $2^{-n}$ and there are $2^n$ atoms, this becomes
\begin{align*}
H_\mu(\mathcal P_0^{n-1})=-\sum_{j=1}^{2^n}2^{-n}\log(2^{-n}).
\end{align*}
The summand is constant in $j$, so
\begin{align*}
H_\mu(\mathcal P_0^{n-1})=-2^n\cdot 2^{-n}\log(2^{-n}).
\end{align*}
Because $2^n\cdot 2^{-n}=1$ and $\log(2^{-n})=-n\log 2$, we get
\begin{align*}
H_\mu(\mathcal P_0^{n-1})=n\log 2.
\end{align*}
Therefore
\begin{align*}
h_\mu(T,\mathcal P)=\lim_{n\to\infty}\frac{1}{n}H_\mu(\mathcal P_0^{n-1}).
\end{align*}
Substituting the computed value gives
\begin{align*}
h_\mu(T,\mathcal P)=\lim_{n\to\infty}\frac{1}{n}n\log 2.
\end{align*}
Thus
\begin{align*}
h_\mu(T,\mathcal P)=\log 2.
\end{align*}
Since $\mathcal P$ is a one-sided generator for the doubling map, the *Kolmogorov-Sinai Generator Theorem* gives
\begin{align*}
h_\mu(T)=h_\mu(T,\mathcal P)=\log 2.
\end{align*}
Thus the doubling map has exactly one fair binary digit of new information per iterate.
[/example]
Generator computations also apply beyond independent symbols. When the symbolic model has transition constraints, entropy is computed from the growth or stationary uncertainty of admissible words.
[example: Markov Shift from a Subshift of Finite Type]
Let $A$ be a finite alphabet and let $M$ be a zero-one transition matrix. The subshift of finite type
\begin{align*}
\Sigma_M:=\{x\in A^{\mathbb Z}: M_{x_nx_{n+1}}=1\text{ for all }n\in\mathbb Z\}
\end{align*}
is invariant under the left shift $S$, because $x\in\Sigma_M$ implies
\begin{align*}
M_{(Sx)_n(Sx)_{n+1}}=M_{x_{n+1}x_{n+2}}=1
\end{align*}
for every $n\in\mathbb Z$. Let $\nu$ be a stationary Markov measure with transition matrix $P$ compatible with $M$ and stationary distribution $\rho$, so $P_{ab}=0$ whenever $M_{ab}=0$ and
\begin{align*}
\sum_{a\in A}\rho_aP_{ab}=\rho_b
\end{align*}
for every $b\in A$.
Write $[a]=\{x\in\Sigma_M:x_0=a\}$ and $\mathcal P=\{[a]:a\in A\}$. The join $\bigvee_{k=-n}^{n}S^{-k}\mathcal P$ records exactly the coordinates $x_{-n},\dots,x_n$, so these finite coordinate cylinders generate the product $\sigma$-algebra on $\Sigma_M$. Thus $\mathcal P$ is a finite generator.
For a word $a_0,\dots,a_{n-1}$, the atom of $\mathcal P_0^{n-1}$ with that itinerary is
\begin{align*}
[a_0\dots a_{n-1}]=\{x\in\Sigma_M:x_0=a_0,\dots,x_{n-1}=a_{n-1}\}.
\end{align*}
By the definition of the stationary Markov measure,
\begin{align*}
\nu([a_0\dots a_{n-1}])=\rho_{a_0}P_{a_0a_1}P_{a_1a_2}\cdots P_{a_{n-2}a_{n-1}}.
\end{align*}
Let $W_n$ be the set of words with positive $\nu$-measure. Using the definition of partition entropy and expanding the logarithm of the product,
\begin{align*}
H_\nu(\mathcal P_0^{n-1})=-\sum_{(a_0,\dots,a_{n-1})\in W_n}\rho_{a_0}P_{a_0a_1}\cdots P_{a_{n-2}a_{n-1}}\log\!\left(\rho_{a_0}\prod_{k=0}^{n-2}P_{a_ka_{k+1}}\right).
\end{align*}
Since every factor in a word from $W_n$ is positive,
\begin{align*}
\log\!\left(\rho_{a_0}\prod_{k=0}^{n-2}P_{a_ka_{k+1}}\right)=\log\rho_{a_0}+\sum_{k=0}^{n-2}\log P_{a_ka_{k+1}}.
\end{align*}
Therefore the initial-symbol contribution is
\begin{align*}
-\sum_{(a_0,\dots,a_{n-1})\in W_n}\rho_{a_0}P_{a_0a_1}\cdots P_{a_{n-2}a_{n-1}}\log\rho_{a_0}=-\sum_{a\in A}\rho_a\log\rho_a,
\end{align*}
because, for fixed $a_0=a$, the total [conditional probability](/page/Conditional%20Probability) of all continuations $a_1,\dots,a_{n-1}$ is $1$.
For each fixed $k$ and each pair $a,b\in A$, stationarity gives
\begin{align*}
\nu(x_k=a,x_{k+1}=b)=\rho_aP_{ab}.
\end{align*}
Equivalently, summing the word probabilities over all words with $a_k=a$ and $a_{k+1}=b$ gives $\rho_aP_{ab}$. Hence the $k$-th transition contribution is
\begin{align*}
-\sum_{a,b\in A}\rho_aP_{ab}\log P_{ab},
\end{align*}
with the convention $0\log 0=0$. There are $n-1$ transition positions, so
\begin{align*}
H_\nu(\mathcal P_0^{n-1})=-\sum_{a\in A}\rho_a\log\rho_a-(n-1)\sum_{a,b\in A}\rho_aP_{ab}\log P_{ab}.
\end{align*}
Dividing by $n$ gives
\begin{align*}
\frac{1}{n}H_\nu(\mathcal P_0^{n-1})=-\frac{1}{n}\sum_{a\in A}\rho_a\log\rho_a-\frac{n-1}{n}\sum_{a,b\in A}\rho_aP_{ab}\log P_{ab}.
\end{align*}
Because $A$ is finite, $\sum_{a\in A}\rho_a\log\rho_a$ is finite, so the first term tends to $0$, while $(n-1)/n$ tends to $1$. Thus
\begin{align*}
h_\nu(S,\mathcal P)=-\sum_{a,b\in A}\rho_aP_{ab}\log P_{ab}.
\end{align*}
Since $\mathcal P$ is a finite generator, the *Kolmogorov-Sinai Generator Theorem* gives
\begin{align*}
h_\nu(S)=-\sum_{a,b\in A}\rho_aP_{ab}\log P_{ab}.
\end{align*}
The formula says that the new information per step is the stationary average of the uncertainty in the next symbol after the present symbol is known.
[/example]
This example separates topological constraints from measure-theoretic randomness. The matrix $M$ lists allowed words, while the stochastic matrix $P$ assigns their probabilities.
## Rokhlin Towers and the Intuition Behind Finite Generators
Why should finite generators exist for systems with finite entropy? A finite partition seems to record only a bounded amount of information at each time, while an arbitrary probability space may have a very complicated measurable structure. Rokhlin towers give the geometric intuition: over long intervals of time, most of the space can be organised into columns along which a small amount of information is repeated and refined.
[definition: Rokhlin Tower]
Let $(X,\mathcal A,\mu,T)$ be an invertible measure-preserving system. A Rokhlin tower of height $N$ with base $B\in\mathcal A$ is the collection
\begin{align*}
B,TB,T^2B,\dots,T^{N-1}B
\end{align*}
of pairwise disjoint measurable sets. Its remainder is
\begin{align*}
X\setminus \bigcup_{j=0}^{N-1}T^jB.
\end{align*}
[/definition]
A tower turns orbit segments into vertical columns. If the remainder has small measure, most points spend a long block of time inside a controlled array, so finite labels placed on the tower levels can encode long orbit names efficiently. The next result says that such towers are always available in aperiodic measure-preserving systems.
[quotetheorem:6762]
[citeproof:6762]
The hypotheses are necessary. Aperiodicity rules out the obstruction of finite cycles: if $T=\operatorname{id}_X$ and $N\ge 2$, then $B$ and $TB$ are the same set, so a non-null tower of height $N$ cannot have disjoint levels. Invertibility is part of this standard tower formulation because the construction organises two-sided orbit structure; non-invertible maps require modified tower statements with different bookkeeping.
The lemma is not itself an entropy theorem and it does not produce a generator by itself. A tower gives long controlled orbit segments, but a finite generator also needs a coding scheme whose symbols distinguish enough measurable sets while keeping the error from the tower remainder and boundary levels under quantitative entropy control. Those estimates are additional work: without them, a tower may organise orbits without proving that finitely many labels recover the whole measure algebra. This limitation points toward a complementary structural question. Instead of coding the whole system, can entropy force the existence of simpler symbolic factors inside it?
[quotetheorem:6764]
These notes use this statement as a quoted structural input rather than as one of the proved results. Its role here is to identify which Bernoulli factors entropy permits; realizing those factors requires structure beyond the tower and finite-generator estimates developed in this chapter.
This theorem is a structural result rather than a generator computation, and it is not a consequence of the preceding generator machinery alone. The entropy bound is necessary because factors cannot have larger entropy than the original system. Ergodicity prevents the factor from having different Bernoulli laws on different invariant components; for example, a direct sum of two invariant systems may have the same total entropy bound but no single ergodic Bernoulli factor describing both components uniformly. The point here is that entropy controls which independent symbolic processes can be extracted as factors, while the construction of those factors uses additional structure beyond finding a generating partition.
[remark: Relation Between Generators and Factors]
A generator represents the original system by a symbolic process without losing measurable information modulo null sets. A factor moves in the opposite direction: it deliberately forgets information while preserving the dynamics. The generator theorem computes the entropy of the whole system from a rich enough symbolic name, while Sinai's theorem guarantees Bernoulli symbolic factors of prescribed smaller entropy.
[/remark]
The finite generator philosophy is that entropy measures the average number of symbols needed per iterate. Rokhlin towers explain how long orbit names can be packed into finite alphabets, while the generator theorem explains why a successful packing computes the Kolmogorov-Sinai entropy.
[example: Coding a Hyperbolic Toral Automorphism]
Let $A\in SL(2,\mathbb Z)$ be hyperbolic, and let $T:\mathbb T^2\to\mathbb T^2$ be given by $T(x)=Ax\pmod{\mathbb Z^2}$. Choose a finite Markov partition $\mathcal R=\{R_1,\dots,R_q\}$ whose rectangle sides lie in the stable and unstable directions of $A$. Define the zero-one transition matrix by
\begin{align*}
M_{ij}=\mathbf 1_{\{\operatorname{int}(R_i)\cap T^{-1}\operatorname{int}(R_j)\ne\varnothing\}}.
\end{align*}
For every point whose orbit never hits a rectangle boundary, define its itinerary by $\pi(x)_n=i$ exactly when $T^n x\in R_i$. If $\pi(x)_n=i$ and $\pi(x)_{n+1}=j$, then
\begin{align*}
T^n x\in R_i\cap T^{-1}R_j.
\end{align*}
Since $x$ avoids all rectangle boundaries, this membership occurs in the corresponding interiors, so $M_{ij}=1$. Thus $\pi(x)$ belongs to the subshift of finite type
\begin{align*}
\Sigma_M=\{y\in\{1,\dots,q\}^{\mathbb Z}:M_{y_ny_{n+1}}=1\text{ for all }n\in\mathbb Z\}.
\end{align*}
The coding intertwines $T$ with the left shift $S$. Indeed, for each $n\in\mathbb Z$ and each symbol $i$,
\begin{align*}
(\pi(Tx))_n=i \Longleftrightarrow T^n(Tx)\in R_i.
\end{align*}
Since $T^n(Tx)=T^{n+1}x$, this is equivalent to
\begin{align*}
T^{n+1}x\in R_i.
\end{align*}
By the definition of $\pi(x)$, that is equivalent to
\begin{align*}
\pi(x)_{n+1}=i.
\end{align*}
By the definition of the left shift, this is equivalent to
\begin{align*}
(S\pi(x))_n=i.
\end{align*}
Therefore $\pi\circ T=S\circ\pi$ away from the boundary exceptional set.
The exceptional set is
\begin{align*}
B=\bigcup_{n\in\mathbb Z}T^{-n}\left(\bigcup_{i=1}^q\partial R_i\right).
\end{align*}
Each $\partial R_i$ is a finite union of stable and unstable line segments, hence has Haar measure $0$. Haar measure is $T$-invariant, so each set $T^{-n}(\partial R_i)$ also has Haar measure $0$. Since $B$ is a countable union of null sets,
\begin{align*}
m_{\mathbb T^2}(B)=0.
\end{align*}
On $\mathbb T^2\setminus B$, the atoms of $\bigvee_{k=-N}^{N}T^{-k}\mathcal R$ are bounded by pieces of stable and unstable sides. Forward iterates contract the stable direction and backward iterates contract the unstable direction, so these atoms shrink to points modulo the boundary set. Hence $\mathcal R$ is a finite generator modulo Haar-null sets. By the *Kolmogorov-Sinai Generator Theorem*,
\begin{align*}
h_m(T)=h_m(T,\mathcal R)=h_{\pi_*m}(S).
\end{align*}
For the standard cat map, take the integer matrix with first row $(2,1)$ and second row $(1,1)$. Its characteristic polynomial is computed from
\begin{align*}
\det(A-\lambda I)=(2-\lambda)(1-\lambda)-1.
\end{align*}
Expanding the product gives
\begin{align*}
(2-\lambda)(1-\lambda)=2-3\lambda+\lambda^2.
\end{align*}
Therefore
\begin{align*}
\det(A-\lambda I)=\lambda^2-3\lambda+1.
\end{align*}
Solving $\lambda^2-3\lambda+1=0$ gives
\begin{align*}
\lambda=\frac{3\pm\sqrt{9-4}}{2}.
\end{align*}
Thus the expanding eigenvalue is
\begin{align*}
\lambda_+=\frac{3+\sqrt5}{2}.
\end{align*}
For Haar measure on a hyperbolic toral automorphism, the entropy is the logarithm of the product of the expanding eigenvalues, so in this two-dimensional case
\begin{align*}
h_m(T)=\log\lambda_+=\log\left(\frac{3+\sqrt5}{2}\right).
\end{align*}
The Markov coding converts the geometric stretching of the torus into the entropy rate of a finite-state symbolic shift.
[/example]
The examples in this chapter show three levels of coding. The doubling map has independent binary digits, Markov shifts have finite memory, and hyperbolic toral automorphisms need geometrically adapted rectangles. In all cases the same principle is at work: once a finite-entropy generator is found, the supremum in the definition of Kolmogorov-Sinai entropy is replaced by a single entropy-rate calculation.
After generators reduce entropy computation to a tractable symbolic problem, the next step is to understand what that rate means pointwise. The Shannon-McMillan-Breiman theorem upgrades average information growth to an almost sure statement along typical orbits, linking partition entropy to exponential counting.
# 4. Shannon-McMillan-Breiman Theory
This chapter turns the entropy rate of a partition into a pointwise statement along almost every orbit. In Chapters 1 and 2, entropy was defined through partition entropies and orbit-name averages such as
\begin{align*}
\frac{1}{n}H(\mathcal P_0^{n-1});
\end{align*}
the Shannon-McMillan-Breiman theorem says that these averages are seen by individual long names. The prerequisites are finite measurable partitions, [conditional expectation](/page/Conditional%20Expectation), martingale convergence, and Birkhoff's ergodic theorem. The central question is whether most orbit segments have probabilities close to $e^{-n h}$, so that entropy becomes the exponential scale of typical orbit complexity.
## Information Along Orbits and Asymptotic Equipartition
The entropy of a finite partition $\mathcal P$ measures the expected information in the first symbol, while $H(\mathcal P_0^{n-1})$ measures the expected information in the first $n$ symbols of the orbit. This average alone does not say whether the particular name seen by a given point has probability near the exponential scale suggested by entropy; a small collection of rare names and a large collection of common names can have the same average information. The pointwise object is therefore the information contained in the atom of the joined partition that contains the point being followed.
[definition: Name of an Orbit Segment]
Let $(X,\mathcal B,\mu,T)$ be a measure-preserving system and let $\mathcal P=\{P_1,\dots,P_k\}$ be a finite measurable partition. For $n\geq 1$, define
\begin{align*}
\mathcal P_0^{n-1}:=\bigvee_{j=0}^{n-1}T^{-j}\mathcal P.
\end{align*}
After replacing the atoms by disjoint measurable representatives whose union is $X$, the length-$n$ name map is the measurable map
\begin{align*}
N_n^{\mathcal P}:X\to\{1,\dots,k\}^n,\qquad
N_n^{\mathcal P}(x)=(a_0,\dots,a_{n-1}),
\end{align*}
where $a_j$ is the unique index satisfying $T^j x\in P_{a_j}$ for $0\leq j<n$.
[/definition]
The atom of $\mathcal P_0^{n-1}$ containing $x$ is the cylinder of all points with the same first $n$ symbols as $x$. To compare long names quantitatively, we need a pointwise random variable whose average recovers partition entropy and whose values measure the rarity of the observed name.
[definition: Information of an Orbit Segment]
Let $(X,\mathcal B,\mu,T)$ be a measure-preserving system and let $\mathcal P$ be a finite measurable partition. For $n\geq 1$ and $x\in X$, write $\mathcal P_0^{n-1}(x)$ for the atom of $\mathcal P_0^{n-1}$ containing $x$. The length-$n$ information function is the measurable map
\begin{align*}
I_n^{\mathcal P}:X\to[0,\infty],\qquad
x\mapsto -\log \mu(\mathcal P_0^{n-1}(x)),
\end{align*}
with $-\log 0:=\infty$.
[/definition]
Since the union of the zero-measure atoms of the finite partition $\mathcal P_0^{n-1}$ has measure $0$, $I_n^{\mathcal P}$ is finite $\mu$-a.e. This is why the extended value $\infty$ does not affect entropy integrals or almost sure statements.
Averaging this function recovers the entropy of the joined partition:
\begin{align*}
\int_X I_n^{\mathcal P}\,d\mu=H(\mathcal P_0^{n-1}).
\end{align*}
The averaged identity alone does not say what a typical single orbit sees: a mean can be controlled even when rare names have very different probabilities. The key question is whether, along almost every orbit, the information content of the observed length-$n$ name stabilises at the same rate as the partition entropy average.
[quotetheorem:6766]
[citeproof:6766]
The theorem says that the entropy rate is not merely an average over names. Ergodicity is what turns the limit into the same constant on almost every orbit: in a mixture of two Bernoulli shifts with different entropy rates, typical points in the two components have different limiting information rates. The finiteness of $\mathcal P$ keeps the one-step information integrable and allows the martingale and chain-rule argument to be applied without extra entropy assumptions. The convergence is almost sure and in mean in standard formulations, but it is not a uniform statement over all points or all atoms; rare names with much smaller probability may persist for every $n$. The Bernoulli shift gives the model case where the exponential scale can be computed directly.
[example: Typical Cylinder Sizes in Bernoulli Shifts]
Let $X=\{1,\dots,k\}^{\mathbb Z}$ carry the Bernoulli measure determined by $p_1,\dots,p_k>0$, and let $\mathcal P=\{[a]:1\leq a\leq k\}$ be the time-zero partition. For $x=(x_j)_{j\in\mathbb Z}$, write
\begin{align*}
N_a(n,x):=\#\{0\leq j<n:x_j=a\}.
\end{align*}
The atom $\mathcal P_0^{n-1}(x)$ fixes the word $(x_0,\dots,x_{n-1})$, so independence of the Bernoulli coordinates gives
\begin{align*}
\mu(\mathcal P_0^{n-1}(x))=\prod_{j=0}^{n-1}p_{x_j}.
\end{align*}
Grouping equal symbols in the product gives
\begin{align*}
\prod_{j=0}^{n-1}p_{x_j}=\prod_{a=1}^k p_a^{N_a(n,x)}.
\end{align*}
Since each $p_a>0$, taking logarithms yields
\begin{align*}
-\frac{1}{n}\log \mu(\mathcal P_0^{n-1}(x))=-\frac{1}{n}\log\left(\prod_{a=1}^k p_a^{N_a(n,x)}\right).
\end{align*}
Using $\log(uv)=\log u+\log v$ and $\log(p_a^{N_a(n,x)})=N_a(n,x)\log p_a$, this becomes
\begin{align*}
-\frac{1}{n}\log \mu(\mathcal P_0^{n-1}(x))=-\sum_{a=1}^k \frac{N_a(n,x)}{n}\log p_a.
\end{align*}
For each fixed symbol $a$,
\begin{align*}
\frac{N_a(n,x)}{n}=\frac{1}{n}\sum_{j=0}^{n-1}\mathbb 1_{[a]}(T^j x).
\end{align*}
By the *[Birkhoff Ergodic Theorem](/theorems/518)* applied to $\mathbb 1_{[a]}$, for $\mu$-a.e. $x$ this average converges to
\begin{align*}
\int_X \mathbb 1_{[a]}\,d\mu=\mu([a])=p_a.
\end{align*}
The alphabet is finite, so the limit passes through the finite sum and gives
\begin{align*}
-\frac{1}{n}\log \mu(\mathcal P_0^{n-1}(x))\to -\sum_{a=1}^k p_a\log p_a.
\end{align*}
Thus a typical length-$n$ cylinder has logarithmic size asymptotic to $-n\sum_{a=1}^k p_a\log p_a$, while words with different empirical symbol counts can still have different exact probabilities.
[/example]
The Bernoulli case hides the conditional nature of the theorem because the symbols are independent: the probability of a long word factors into one-symbol probabilities. For dependent processes, the probability of the next symbol changes after the past has been observed, and the SMB limit averages these successive conditional uncertainties along the orbit. Markov chains are the first setting where this distinction is visible while the computation remains explicit.
[example: Markov Chain Entropy Rate]
Let $(X_n)_{n\in\mathbb Z}$ be a stationary irreducible finite-state Markov chain with finite state space, stationary distribution $\pi$, and transition matrix $P=(p_{ij})$. In the shift system on path space, let $\mathcal P$ be the partition according to the value of $X_0$. For a path $x=(x_j)_{j\in\mathbb Z}$, the atom $\mathcal P_0^{n-1}(x)$ is the cylinder fixing $(x_0,\dots,x_{n-1})$, so stationarity and the Markov property give
\begin{align*}
\mu(\mathcal P_0^{n-1}(x))=\mu(X_0=x_0,\dots,X_{n-1}=x_{n-1}).
\end{align*}
The chain rule for conditional probabilities gives
\begin{align*}
\mu(X_0=x_0,\dots,X_{n-1}=x_{n-1})=\mu(X_0=x_0)\prod_{r=0}^{n-2}\mu(X_{r+1}=x_{r+1}\mid X_0=x_0,\dots,X_r=x_r).
\end{align*}
By the Markov property,
\begin{align*}
\mu(X_{r+1}=x_{r+1}\mid X_0=x_0,\dots,X_r=x_r)=\mu(X_{r+1}=x_{r+1}\mid X_r=x_r)=p_{x_r x_{r+1}}.
\end{align*}
Since the chain is stationary, $\mu(X_0=x_0)=\pi_{x_0}$, and therefore
\begin{align*}
\mu(\mathcal P_0^{n-1}(x))=\pi_{x_0}\prod_{r=0}^{n-2}p_{x_r x_{r+1}}.
\end{align*}
On the full-measure set of paths for which all observed transitions have positive transition probability, logarithms give
\begin{align*}
-\frac{1}{n}\log\mu(\mathcal P_0^{n-1}(x))=-\frac{1}{n}\log\left(\pi_{x_0}\prod_{r=0}^{n-2}p_{x_r x_{r+1}}\right).
\end{align*}
Using $\log(uv)=\log u+\log v$ and $\log\prod_r a_r=\sum_r\log a_r$, this becomes
\begin{align*}
-\frac{1}{n}\log\mu(\mathcal P_0^{n-1}(x))=-\frac{1}{n}\log\pi_{x_0}+\frac{1}{n}\sum_{r=0}^{n-2}(-\log p_{x_r x_{r+1}}).
\end{align*}
Irreducibility on a finite state space implies $\pi_i>0$ for every state $i$. If $\pi_{\min}:=\min_i\pi_i$, then
\begin{align*}
0\leq -\frac{1}{n}\log\pi_{x_0}\leq -\frac{1}{n}\log\pi_{\min}.
\end{align*}
The right-hand side tends to $0$, so the initial distribution term does not contribute to the limiting information per symbol.
Define $f(x):=-\log p_{x_0x_1}$ when $p_{x_0x_1}>0$, and define $f$ arbitrarily on the null set of impossible transitions. By the *Birkhoff Ergodic Theorem*,
\begin{align*}
\frac{1}{n-1}\sum_{r=0}^{n-2}f(T^r x)\to \int f\,d\mu
\end{align*}
for $\mu$-a.e. $x$. Since
\begin{align*}
\frac{1}{n}\sum_{r=0}^{n-2}f(T^r x)=\frac{n-1}{n}\cdot\frac{1}{n-1}\sum_{r=0}^{n-2}f(T^r x),
\end{align*}
the same limit holds with denominator $n$. The integral is computed from the one-step stationary law:
\begin{align*}
\int f\,d\mu=\sum_i\sum_j \mu(X_0=i,X_1=j)(-\log p_{ij}).
\end{align*}
Stationarity and the transition rule give $\mu(X_0=i,X_1=j)=\pi_i p_{ij}$, so
\begin{align*}
\int f\,d\mu=\sum_i\sum_j \pi_i p_{ij}(-\log p_{ij}).
\end{align*}
With the convention $0\log 0=0$, this is
\begin{align*}
\int f\,d\mu=-\sum_i\pi_i\sum_j p_{ij}\log p_{ij}.
\end{align*}
Combining the vanishing initial term with the ergodic average gives
\begin{align*}
-\frac{1}{n}\log\mu(\mathcal P_0^{n-1}(x))\to -\sum_i\pi_i\sum_j p_{ij}\log p_{ij}
\end{align*}
for $\mu$-a.e. $x$. Thus the entropy rate is the stationary average of the uncertainty in the next state after the current state is known.
[/example]
## Conditional Forms and Entropy Relative to Invariant Sigma-Algebras
The ergodic hypothesis makes the limiting information rate constant. The next question is what remains when the system is not ergodic, or when we want to retain information about an invariant factor. The answer is a conditional entropy function, measurable with respect to the invariant $\sigma$-algebra.
[definition: Invariant Sigma-Algebra]
Let $(X,\mathcal B,\mu,T)$ be a measure-preserving system. The invariant $\sigma$-algebra is
\begin{align*}
\mathcal I_T:=\{A\in\mathcal B: \mu(T^{-1}A\triangle A)=0\}.
\end{align*}
[/definition]
The invariant $\sigma$-algebra records which ergodic component a point belongs to. To state a non-ergodic SMB theorem, we therefore need entropy conditioned on this information rather than a single number that averages all components together.
[definition: Conditional Partition Entropy Relative to the Invariant Sigma-Algebra]
Let $(X,\mathcal B,\mu,T)$ be a measure-preserving system, let $\mathcal P$ be a finite measurable partition, and let $\mathcal I_T$ be the invariant $\sigma$-algebra. Define the scalar
\begin{align*}
h_\mu(T,\mathcal P\mid \mathcal I_T):=\lim_{n\to\infty}\frac{1}{n}H(\mathcal P_0^{n-1}\mid \mathcal I_T),
\end{align*}
whenever the scalar limit of conditional entropies exists in $[0,\infty)$.
[/definition]
The notation distinguishes two objects: $h_\mu(T,\mathcal P\mid \mathcal I_T)$ is a number obtained by integrating conditional entropy, while $\bar h_\mu(T,\mathcal P\mid \mathcal I_T):X\to[0,\infty)$ denotes the invariant-measurable function seen by individual points. Without this separation, a non-ergodic system gives the wrong target: if half the space carries a fair Bernoulli shift and half carries a biased Bernoulli shift, the global entropy rate is an average of two numbers, while a point never moves between the two components. The next problem is to identify the pointwise function as the limit of long-name information, so that each orbit sees the entropy rate of its own invariant component.
[quotetheorem:6768]
[citeproof:6768]
This formulation explains non-ergodic examples without treating them as exceptions. The theorem does not force a single exponential scale; it assigns the scale dictated by the component, as seen in mixtures of Bernoulli processes. The finiteness of $\mathcal P$ again matters because conditional information must be integrable enough for the martingale argument. The invariant $\sigma$-algebra cannot be replaced by an arbitrary sub-$\sigma$-algebra in this statement, since the limiting time average is controlled by information that is unchanged along the dynamics. The theorem also does not identify a typical set with one universal size in a non-ergodic system; different invariant components may require exponentially different numbers of names.
[example: Non-Ergodic Mixture of Bernoulli Processes]
Let $\mathcal P=\{[0],[1]\}$ and write
\begin{align*}
h(r):=-r\log r-(1-r)\log(1-r)
\end{align*}
for $0<r<1$. Since $h(p)\neq h(q)$, we have $p\neq q$. For $x\in X$, set
\begin{align*}
N_1(n,x):=\#\{0\leq j<n:x_j=1\}
\end{align*}
and
\begin{align*}
N_0(n,x):=n-N_1(n,x).
\end{align*}
The atom $\mathcal P_0^{n-1}(x)$ is the cylinder fixing $x_0,\dots,x_{n-1}$. Under $\mu_p$ this cylinder has measure
\begin{align*}
\mu_p(\mathcal P_0^{n-1}(x))=p^{N_1(n,x)}(1-p)^{N_0(n,x)},
\end{align*}
and under $\mu_q$ it has measure
\begin{align*}
\mu_q(\mathcal P_0^{n-1}(x))=q^{N_1(n,x)}(1-q)^{N_0(n,x)}.
\end{align*}
Therefore the mixture measure is
\begin{align*}
\mu(\mathcal P_0^{n-1}(x))=\frac12 p^{N_1(n,x)}(1-p)^{N_0(n,x)}+\frac12 q^{N_1(n,x)}(1-q)^{N_0(n,x)}.
\end{align*}
Now take $x$ in the full-measure $\mu_p$-typical set for which
\begin{align*}
\frac{N_1(n,x)}{n}\to p
\end{align*}
and
\begin{align*}
\frac{N_0(n,x)}{n}\to 1-p.
\end{align*}
For the $p$-component,
\begin{align*}
-\frac1n\log\mu_p(\mathcal P_0^{n-1}(x))=-\frac1n\log\left(p^{N_1(n,x)}(1-p)^{N_0(n,x)}\right).
\end{align*}
Using $\log(uv)=\log u+\log v$ and $\log(a^m)=m\log a$, this is
\begin{align*}
-\frac1n\log\mu_p(\mathcal P_0^{n-1}(x))=-\frac{N_1(n,x)}{n}\log p-\frac{N_0(n,x)}{n}\log(1-p).
\end{align*}
Passing to the limit along the two frequency limits gives
\begin{align*}
-\frac1n\log\mu_p(\mathcal P_0^{n-1}(x))\to -p\log p-(1-p)\log(1-p)=h(p).
\end{align*}
For the $q$-component evaluated on the same $p$-typical sequence,
\begin{align*}
-\frac1n\log\mu_q(\mathcal P_0^{n-1}(x))=-\frac{N_1(n,x)}{n}\log q-\frac{N_0(n,x)}{n}\log(1-q),
\end{align*}
so
\begin{align*}
-\frac1n\log\mu_q(\mathcal P_0^{n-1}(x))\to -p\log q-(1-p)\log(1-q).
\end{align*}
Subtracting the $p$-entropy gives
\begin{align*}
\left[-p\log q-(1-p)\log(1-q)\right]-h(p)=p\log\frac{p}{q}+(1-p)\log\frac{1-p}{1-q}.
\end{align*}
By strict Gibbs inequality, this last quantity is positive because $p\neq q$. Thus along a $\mu_p$-typical sequence the $\mu_q$ cylinder mass is exponentially smaller than the $\mu_p$ cylinder mass.
To make this comparison explicit, write
\begin{align*}
A_n:=\mu_p(\mathcal P_0^{n-1}(x))
\end{align*}
and
\begin{align*}
B_n:=\mu_q(\mathcal P_0^{n-1}(x)).
\end{align*}
The preceding limits imply
\begin{align*}
-\frac1n\log A_n\to h(p)
\end{align*}
and, for some $\eta>0$,
\begin{align*}
-\frac1n\log B_n\to h(p)+\eta.
\end{align*}
Hence $B_n/A_n\to 0$ exponentially. Since
\begin{align*}
\mu(\mathcal P_0^{n-1}(x))=\frac12 A_n\left(1+\frac{B_n}{A_n}\right),
\end{align*}
we get
\begin{align*}
-\frac1n\log\mu(\mathcal P_0^{n-1}(x))=-\frac1n\log A_n-\frac1n\log\frac12-\frac1n\log\left(1+\frac{B_n}{A_n}\right).
\end{align*}
The first term tends to $h(p)$, the second term tends to $0$, and the third term tends to $0$ because $B_n/A_n\to 0$. Therefore
\begin{align*}
-\frac1n\log\mu(\mathcal P_0^{n-1}(x))\to h(p)
\end{align*}
for $\mu_p$-a.e. $x$.
The same calculation with $p$ and $q$ interchanged gives
\begin{align*}
-\frac1n\log\mu(\mathcal P_0^{n-1}(x))\to h(q)
\end{align*}
for $\mu_q$-a.e. $x$. Thus the invariant $\sigma$-algebra separates the two Bernoulli components: a point sees the entropy rate of its own component, while the scalar entropy rate of the mixture is $\frac12h(p)+\frac12h(q)$. Since $h(p)\neq h(q)$, the pointwise information rate is not constant.
[/example]
The mixture example conditions on the whole invariant component. In extensions and factors, the natural problem is finer: after a factor has already been observed, how much new information remains in the partition names upstairs? For instance, a skew-product whose base is a Bernoulli shift and whose fibre is another random process has total names that mix base and fibre randomness; conditioning on the base leaves only the fibre contribution. This leads to the relative form of SMB.
[definition: Relative Conditional Information]
Let $(X,\mathcal B,\mu,T)$ be a measure-preserving system, let $\mathcal G\subset\mathcal B$ be a $T$-invariant sub-$\sigma$-algebra modulo null sets, and let $\mathcal P$ be a finite measurable partition. For a finite partition $\mathcal Q$, choose versions of the conditional probability maps
\begin{align*}
\mu(Q\mid\mathcal G):X\to[0,1],
\qquad
\mu(Q\mid\mathcal G)=\mathbb E[\mathbb{1}_Q\mid\mathcal G],
\end{align*}
for all atoms $Q\in\mathcal Q$, modifying on one null set common to the finitely many atoms. The conditional information of $\mathcal Q$ given $\mathcal G$ is the measurable map
\begin{align*}
I_\mu(\mathcal Q\mid\mathcal G):X\to[0,\infty],
\qquad
I_\mu(\mathcal Q\mid\mathcal G)(x):=-\log \mu(\mathcal Q(x)\mid\mathcal G)(x),
\end{align*}
outside the null set where the chosen conditional probability of the containing atom is $0$; set the value to $0$ on that null set.
[/definition]
This definition fixes the version issue that is hidden by the shorter notation $\mu(\mathcal Q(x)\mid\mathcal G)(x)$. Since $\mathcal Q$ is finite, all atom representatives and conditional probabilities can be chosen simultaneously, and changing them on a null set does not change the information function as an element of $L^1$. The remaining question is dynamical rather than measure-theoretic: when $\mathcal Q=\mathcal P_0^{n-1}$ grows along the orbit, does this relative information have a deterministic asymptotic rate after the factor information in $\mathcal G$ has already been supplied? The next theorem answers this by showing that relative entropy is the almost sure exponential scale of the conditional probabilities of long names.
[quotetheorem:6770]
[citeproof:6770]
The relative theorem is the version used in extensions and factors: it separates the randomness already present downstairs from the additional randomness upstairs. Its hypotheses distinguish two different kinds of conditioning. Conditioning on $\mathcal I_T$ handles non-ergodic decomposition, while conditioning on a factor $\mathcal G$ measures entropy left after observing another dynamical system. The theorem does not say that conditional atom probabilities are uniformly close to $e^{-n h_\mu(T,\mathcal P\mid \mathcal G)}$ for every fibre or every base point; exceptional names and finite-time fluctuations remain. In a product of two Bernoulli shifts, conditioning on the first coordinate process leaves the entropy of the second coordinate process, which is the basic example to keep in mind.
This relative pointwise law is the technical input for later extension arguments. When comparing a system to a factor, the theorem turns the relative entropy number into an almost sure growth rate of fibre names, which is the form needed in relative generator constructions and in proofs that entropy is monotone under factors. It is also the pointwise language behind the Pinsker $\sigma$-algebra: a zero relative entropy extension is one in which, after conditioning on the factor, the remaining names grow subexponentially almost everywhere. Thus relative SMB is not only a product-Bernoulli calculation; it is the tool that lets later chapters localise entropy to the part of a system not already visible in a chosen factor.
Each hypothesis has a specific role. If ergodicity is dropped while $\mathcal G$ is fixed as the first-coordinate factor in a mixture of two product Bernoulli systems with different second-coordinate entropies, the relative information rate is not a single constant $h_\mu(T,\mathcal P\mid\mathcal G)$; it depends on the invariant component. If $\mathcal P$ is allowed to be countable with infinite entropy, the one-step conditional information may fail to be integrable, so the martingale and Birkhoff averages need extra assumptions before the displayed convergence has a finite target. If $\mathcal G$ is not invariant under $T$, conditioning is not information from a factor system along the whole orbit: for the shift, taking $\mathcal G=\sigma(X_0)$ means the conditioning at time $0$ does not contain the corresponding information at later times, and the chain-rule increments are not stationary relative to $\mathcal G$.
## Typical Names, Orbit Complexity, and Measure-Theoretic Interpretation
SMB turns entropy into a counting principle. The guiding question is how many orbit names are needed to describe most of the measure. The answer is that, for an ergodic process, almost all mass is carried by about $e^{n h}$ names, each of probability about $e^{-n h}$.
[definition: Typical Set of Names]
Let $(X,\mathcal B,\mu,T)$ be an ergodic measure-preserving system, let $\mathcal P$ be a finite measurable partition, and let $h=h_\mu(T,\mathcal P)$. For $\varepsilon>0$ and $n\geq 1$, define the $n$-th $\varepsilon$-typical set by
\begin{align*}
A_n(\varepsilon):=\left\{x\in X: e^{-n(h+\varepsilon)}\leq \mu(\mathcal P_0^{n-1}(x))\leq e^{-n(h-\varepsilon)}\right\}.
\end{align*}
[/definition]
The SMB theorem states that $\mu(A_n(\varepsilon))\to 1$ for every $\varepsilon>0$. The next question is how this high-measure typical set translates into a count of admissible names, which is the bridge from information to orbit complexity.
[quotetheorem:6772]
[citeproof:6772]
This is the measure-theoretic form of the asymptotic equipartition property. It says that entropy is the exponential growth rate of the number of statistically relevant names, rather than the total number of possible names. Ergodicity is needed for a single exponent $h$: in the mixture $\frac12\mu_p+\frac12\mu_q$ of two Bernoulli shifts with different entropies, a set covering most of both components must accommodate two different exponential scales, so no single typical-name count describes the whole system sharply. Finiteness of $\mathcal P$ is also a real hypothesis; for a countable partition of a Bernoulli shift with symbol distribution of infinite Shannon entropy, the expected one-symbol information is infinite and the exponential estimate $e^{n(h+\varepsilon)}$ has no finite entropy rate to use. The high-measure condition cannot be replaced by an exact count of all atoms: even a finite Bernoulli process has rare words with probabilities far below the typical scale, and including every positive-measure word can require the full combinatorial alphabet growth rather than the entropy growth. The $\varepsilon$-losses are therefore not cosmetic; finite-time cylinder probabilities fluctuate, and the theorem controls names only after ignoring a set of small measure.
[example: Zero Entropy Rotation Names]
Let $E=\{e_1,\dots,e_r\}$ be the finite set of endpoints of the interval partition $\mathcal P$. For $0\leq j<n$, the partition $T^{-j}\mathcal P$ has endpoints
\begin{align*}
T^{-j}E=\{e_\ell-j\alpha \pmod 1:1\leq \ell\leq r\}.
\end{align*}
Hence the join
\begin{align*}
\mathcal P_0^{n-1}=\bigvee_{j=0}^{n-1}T^{-j}\mathcal P
\end{align*}
is obtained by cutting the circle at the finite set
\begin{align*}
E_n:=\bigcup_{j=0}^{n-1}T^{-j}E.
\end{align*}
The assumption that the endpoint orbits are distinct gives
\begin{align*}
|E_n|=\sum_{j=0}^{n-1}|T^{-j}E|=\sum_{j=0}^{n-1}r=rn.
\end{align*}
A circle cut at $rn$ points has $rn$ interval components, so
\begin{align*}
|\mathcal P_0^{n-1}|\leq rn.
\end{align*}
Therefore the number of possible length-$n$ names satisfies
\begin{align*}
\#\{N_n^{\mathcal P}(x):x\in X\}\leq rn.
\end{align*}
Taking logarithms and dividing by $n$ gives
\begin{align*}
0\leq \frac1n\log \#\{N_n^{\mathcal P}(x):x\in X\}
\leq \frac1n\log(rn)
=\frac{\log r}{n}+\frac{\log n}{n}.
\end{align*}
Since
\begin{align*}
\frac{\log r}{n}\to 0
\qquad\text{and}\qquad
\frac{\log n}{n}\to 0,
\end{align*}
the exponential growth rate of orbit names is $0$. Also
\begin{align*}
H(\mathcal P_0^{n-1})\leq \log|\mathcal P_0^{n-1}|\leq \log(rn),
\end{align*}
so
\begin{align*}
0\leq h_\mu(T,\mathcal P)
=\lim_{n\to\infty}\frac1nH(\mathcal P_0^{n-1})
\leq \lim_{n\to\infty}\frac1n\log(rn)=0.
\end{align*}
Thus $h_\mu(T,\mathcal P)=0$, and the typical-name growth rate is subexponential rather than exponential as in Bernoulli shifts.
[/example]
Algorithmic complexity gives another interpretation of typical names: a long symbolic orbit from an ergodic process usually cannot be compressed below its entropy rate. Typical-set counting by itself gives only a high-measure list of plausible names; it does not rule out the possibility that most of those names have much shorter individual descriptions. The missing ingredient is the prefix-free counting bound supplied by Kraft's inequality, which limits how many words can have descriptions below a given length. The statement below fixes the finite-alphabet coding conventions so that the entropy comparison has a definite meaning.
[quotetheorem:6774]
[remark: Brudno Theorem in Context]
The argument combines the SMB theorem with coding estimates for typical sets and the converse fact that too many typical strings cannot all have descriptions much shorter than their entropy scale. The upper bound codes the high-measure family of typical words using approximately $n h_\mu(T)$ nats plus lower-order overhead. The lower bound uses Kraft's inequality: there are too few short prefix-free descriptions to cover almost all words in the typical family. The ergodicity assumption again supplies a single almost sure compression rate; without it, the rate depends on the ergodic component. The finite alphabet, coding convention, and prefix-free machine are part of the statement because the comparison between counting names and program lengths uses finite-word encodings and Kraft-type estimates. The theorem is an asymptotic statement about typical orbits and does not give a practical compression algorithm for a given finite sample.
[/remark]
The preceding theorem connects orbit names with individual description lengths, but the reader still needs a practical interpretation of what this connection says about entropy itself. The next remark extracts the conceptual message: entropy is not only a counting invariant of typical sets, but also the unavoidable compression scale for typical observations.
[remark: Entropy as Compression Rate]
SMB provides the probabilistic half of compression: most observed names lie in a set of size about $e^{n h}$. A coding scheme can therefore describe typical names using about $n h$ nats, or $n h/\log 2$ bits. Brudno's theorem says that for individual typical orbits this compression scale is also forced by algorithmic complexity.
[/remark]
The chapter's main conclusion is that entropy is simultaneously an average information rate, an almost sure orbit statistic, and an exponential counting rate for typical symbolic names. Later chapters use this pointwise viewpoint when comparing Bernoulli systems, proving isomorphism results, and relating measure-theoretic entropy to topological growth.
The pointwise picture of entropy now prepares the ground for the model class where everything is most transparent: Bernoulli shifts. The next chapter uses independent coordinates and orbit shifts to study isomorphism problems, showing how entropy becomes the decisive invariant in the probabilistic setting.
# 5. Bernoulli Shifts and Isomorphism Problems
Bernoulli shifts are the model systems in which entropy has its cleanest probabilistic meaning: the coordinates are independent, identically distributed observations, and the dynamics moves the observer one step along the sequence. The prerequisites are the definitions of measure-preserving systems and partitions from Chapters 0 and 1, generators from Chapter 3, and Kolmogorov-Sinai entropy from Chapter 2. The previous chapters gave entropy as a measure-theoretic invariant and showed how generators convert dynamics into symbolic processes. This chapter asks how much of a Bernoulli shift is remembered by its entropy, and it introduces the isomorphism problem that led from Kolmogorov's obstruction to Ornstein's classification theorem.
## Complete Independence in Bernoulli Schemes
What is the most independent measure-preserving system with a prescribed one-step distribution? The answer is a product probability space with the shift map, where every finite block of coordinates factors into the product of its marginals.
[definition: Bernoulli Scheme]
Let $A$ be a finite or countable alphabet, let $p=(p_a)_{a \in A}$ be a probability vector with $p_a \ge 0$ and $\sum_{a \in A} p_a=1$, and let
\begin{align*}
X &= A^{\mathbb Z}, & \mathcal F &= \bigotimes_{n \in \mathbb Z} 2^A, & \mu &= p^{\mathbb Z}.
\end{align*}
The shift map is the function $T:A^{\mathbb Z}\to A^{\mathbb Z}$ defined by $x \mapsto Tx$, where
\begin{align*}
(Tx)_n = x_{n+1}
\end{align*}
for every $x=(x_n)_{n \in \mathbb Z} \in A^{\mathbb Z}$ and $n \in \mathbb Z$. The two-sided Bernoulli shift with base distribution $p$ is the measure-preserving system $(X,\mathcal F,\mu,T)$.
[/definition]
The definition builds the ambient dynamical system from independent coordinates, but entropy is computed through partitions rather than through coordinates written informally. We therefore need the partition that records exactly the information seen at one time, because its shifted joins will describe finite observed blocks.
[definition: Coordinate Partition]
For a Bernoulli scheme over $A$, the coordinate partition is
\begin{align*}
\mathcal P = \{[a] : a \in A\}, \qquad [a]=\{x \in A^{\mathbb Z}:x_0=a\}.
\end{align*}
[/definition]
The point of the definition is that the atoms of $\bigvee_{j=0}^{n-1}T^{-j}\mathcal P$ are cylinder sets specifying the block $(x_0,\dots,x_{n-1})$. Their probabilities multiply, so the entropy computation reduces to the Shannon entropy of the one-coordinate distribution.
[example: Two Symbol Bernoulli Shift]
Let $A=\{0,1\}$, let $p_0=q$, and let $p_1=1-q$ with $0<q<1$. The coordinate partition is $\mathcal P=\{[0],[1]\}$, where $\mu([0])=q$ and $\mu([1])=1-q$, so
\begin{align*}
H(\mathcal P)=-\mu([0])\log \mu([0])-\mu([1])\log \mu([1]).
\end{align*}
Substituting the two atom measures gives
\begin{align*}
H(\mathcal P)=-q\log q-(1-q)\log(1-q).
\end{align*}
For $n\ge 1$, an atom of $\bigvee_{j=0}^{n-1}T^{-j}\mathcal P$ is determined by a word $a_0\dots a_{n-1}\in\{0,1\}^n$. Its measure is the product of the one-coordinate probabilities:
\begin{align*}
\mu([a_0\dots a_{n-1}])=\prod_{j=0}^{n-1}p_{a_j}.
\end{align*}
Therefore the entropy of the $n$-block partition is
\begin{align*}
H\left(\bigvee_{j=0}^{n-1}T^{-j}\mathcal P\right)=-\sum_{(a_0,\dots,a_{n-1})\in\{0,1\}^n}\left(\prod_{j=0}^{n-1}p_{a_j}\right)\log\left(\prod_{j=0}^{n-1}p_{a_j}\right).
\end{align*}
Using $\log(\prod_{j=0}^{n-1}p_{a_j})=\sum_{j=0}^{n-1}\log p_{a_j}$, this becomes
\begin{align*}
H\left(\bigvee_{j=0}^{n-1}T^{-j}\mathcal P\right)=-\sum_{j=0}^{n-1}\sum_{(a_0,\dots,a_{n-1})\in\{0,1\}^n}\left(\prod_{k=0}^{n-1}p_{a_k}\right)\log p_{a_j}.
\end{align*}
For a fixed $j$, the inner sum factors as
\begin{align*}
\sum_{(a_0,\dots,a_{n-1})\in\{0,1\}^n}\left(\prod_{k=0}^{n-1}p_{a_k}\right)\log p_{a_j}=\sum_{a_j\in\{0,1\}}p_{a_j}\log p_{a_j}\prod_{k\ne j}\sum_{a_k\in\{0,1\}}p_{a_k}.
\end{align*}
Since $\sum_{a_k\in\{0,1\}}p_{a_k}=q+(1-q)=1$, the fixed-$j$ contribution is
\begin{align*}
\sum_{a_j\in\{0,1\}}p_{a_j}\log p_{a_j}=q\log q+(1-q)\log(1-q).
\end{align*}
Hence
\begin{align*}
H\left(\bigvee_{j=0}^{n-1}T^{-j}\mathcal P\right)=n\left(-q\log q-(1-q)\log(1-q)\right).
\end{align*}
Dividing by $n$ gives the entropy rate seen by the coordinate generator:
\begin{align*}
h(q)=\lim_{n\to\infty}\frac{1}{n}H\left(\bigvee_{j=0}^{n-1}T^{-j}\mathcal P\right)=-q\log q-(1-q)\log(1-q).
\end{align*}
Thus the two-symbol Bernoulli shift has entropy equal to the binary entropy of its one-coordinate distribution.
[/example]
This example is the template for the general computation. For a Bernoulli process, the formal Kolmogorov-Sinai definition still asks for entropy rates of joined time translates, while the probabilistic model presents independent coordinates with a fixed one-step distribution. The point to verify is that the coordinate generator sees all measurable information and that independence makes the $n$-block entropy grow exactly linearly.
[quotetheorem:6776]
[citeproof:6776]
The theorem shows that complete independence makes entropy additive across time. The independence hypothesis is doing real work: for a stationary Markov chain, the block probabilities do not factor into one-coordinate marginals, and the entropy rate is usually conditional entropy rather than $H(p)$. The generator hypothesis is also essential, because a non-generating partition may see only a factor and therefore undercount the entropy of the full system. The finite-entropy assumption rules out the case where the coordinate partition has infinite Shannon entropy, in which the finite numerical classification statements below no longer apply in this form. This computation raises the inverse question: if two systems have the same entropy, when does that force them to be the same system up to measurable change of coordinates?
## Entropy as an Obstruction to Isomorphism
How can one prove that two measure-preserving systems are not the same when their orbits may look symbolically complicated? The first answer supplied by entropy is negative: isomorphic systems must have the same Kolmogorov-Sinai entropy, so unequal entropy blocks an isomorphism.
[definition: Measure-Theoretic Isomorphism]
Let $(X,\mathcal F,\mu,T)$ and $(Y,\mathcal G,\nu,S)$ be measure-preserving systems. A measure-theoretic isomorphism is a bimeasurable map $\Phi:X_0\to Y_0$ between invariant full-measure sets $X_0\subset X$ and $Y_0\subset Y$ such that $\Phi_*\mu=\nu$ and
\begin{align*}
\Phi\circ T = S\circ \Phi
\end{align*}
for all $x\in X_0$.
[/definition]
This definition records the [equivalence relation](/page/Equivalence%20Relation) for the classification problem. Once this notion of sameness is fixed, the next task is to identify quantities that survive the transport of partitions through an isomorphism, and entropy is the central such quantity.
[quotetheorem:6778]
[citeproof:6778]
The obstruction is powerful because it turns an isomorphism question into a number. Its limitation is just as important: equal entropy is not sufficient for isomorphism among general measure-preserving systems. For instance, a Bernoulli shift and a non-Bernoulli $K$-automorphism can have the same entropy but fail to be isomorphic because finer mixing and independence properties differ. Even zero entropy does not collapse everything to one model, since an irrational rotation and the identity transformation both have zero entropy but are not isomorphic as dynamical systems. The Bernoulli case is therefore exceptional: only there will equality of entropy become a complete classification theorem.
[example: Unequal Entropy Non-Isomorphism]
Compare the fair two-symbol Bernoulli shift with $p=(1/2,1/2)$ and the biased two-symbol Bernoulli shift with $q=(1/3,2/3)$. By *Entropy of a Bernoulli Shift*, their entropies are computed from the one-coordinate distributions. For the fair shift,
\begin{align*}
H(p)=-\frac12\log\frac12-\frac12\log\frac12.
\end{align*}
Since both terms contain the same logarithm,
\begin{align*}
H(p)=-\left(\frac12+\frac12\right)\log\frac12.
\end{align*}
Because $\frac12+\frac12=1$ and $\log(1/2)=-\log 2$, this gives
\begin{align*}
H(p)=\log 2.
\end{align*}
For the biased shift,
\begin{align*}
H(q)=-\frac13\log\frac13-\frac23\log\frac23.
\end{align*}
Using $\log(1/3)=-\log 3$ and $\log(2/3)=\log 2-\log 3$, we get
\begin{align*}
H(q)=\frac13\log 3-\frac23(\log 2-\log 3).
\end{align*}
Expanding the last term gives
\begin{align*}
H(q)=\frac13\log 3-\frac23\log 2+\frac23\log 3.
\end{align*}
Combining the two $\log 3$ terms gives
\begin{align*}
H(q)=\log 3-\frac23\log 2.
\end{align*}
These values are unequal: if $\log 2=\log 3-\frac23\log 2$, then
\begin{align*}
\frac53\log 2=\log 3.
\end{align*}
Exponentiating gives $2^{5/3}=3$, and cubing both sides gives $2^5=3^3$, i.e.
\begin{align*}
32=27,
\end{align*}
which is impossible. Therefore $H(p)\ne H(q)$, so the *Kolmogorov Entropy Obstruction* rules out a measure-theoretic isomorphism between the two Bernoulli shifts.
[/example]
This non-isomorphism example is the easy direction of classification. The deeper theorem says that, within Bernoulli shifts, entropy is not only an obstruction but a complete invariant.
[quotetheorem:6780]
This theorem is quoted without proof as a landmark classification theorem. The notes prove only the entropy obstruction direction directly; the converse is Ornstein's deep coding theorem and is used here to explain the shape of the classification landscape.
The course states Ornstein's theorem as a landmark classification result, with only the obstruction direction proved from Kolmogorov-Sinai entropy. The Bernoulli hypothesis is essential: outside the Bernoulli class, equal entropy does not force isomorphism, as zero-entropy rotations, periodic systems, and many positive-entropy non-Bernoulli systems show. The finite or countable finite-entropy assumption is also part of the classification statement, because the theorem compares systems by a finite Shannon entropy number and the standard block-coding construction needs finite information per coordinate. Equality of entropy is the exact numerical condition: unequal entropy is ruled out by Kolmogorov's obstruction, while equality becomes sufficient only after the very weak Bernoulli machinery supplies the missing independence structure. The converse uses this machinery developed by Ornstein and is far beyond the direct generator computations of the earlier chapters.
[example: Two and Three Symbol Bernoulli Shifts with Equal Entropy]
By *Entropy of a Bernoulli Shift*, the fair two-symbol shift has entropy
\begin{align*}
-\frac12\log\frac12-\frac12\log\frac12=-\left(\frac12+\frac12\right)\log\frac12.
\end{align*}
Since $\frac12+\frac12=1$, this becomes
\begin{align*}
-\left(\frac12+\frac12\right)\log\frac12=-\log\frac12.
\end{align*}
Using $\log(1/2)=-\log 2$, we get
\begin{align*}
-\log\frac12=\log 2.
\end{align*}
We now exhibit a three-symbol base distribution with the same entropy. For $t\in[0,1]$, set
\begin{align*}
q(t)=\left(1-\frac{2t}{3},\frac{t}{3},\frac{t}{3}\right).
\end{align*}
Its entropy is
\begin{align*}
H(q(t))=-\left(1-\frac{2t}{3}\right)\log\left(1-\frac{2t}{3}\right)-2\left(\frac{t}{3}\right)\log\left(\frac{t}{3}\right),
\end{align*}
with the endpoint convention $0\log 0=0$, justified by $\lim_{x\to0^+}x\log x=0$. At $t=0$,
\begin{align*}
H(q(0))=-1\log 1-0-0=0.
\end{align*}
At $t=1$,
\begin{align*}
H(q(1))=-3\left(\frac13\log\frac13\right).
\end{align*}
Since $3\cdot\frac13=1$, this is
\begin{align*}
H(q(1))=-\log\frac13.
\end{align*}
Using $\log(1/3)=-\log 3$, we obtain
\begin{align*}
H(q(1))=\log 3.
\end{align*}
Because $2<3$, monotonicity of the logarithm gives $\log 2<\log 3$. The function $H(q(t))$ is continuous on $[0,1]$, because $x\mapsto -x\log x$ extends continuously to $x=0$. Therefore the [intermediate value theorem](/theorems/180) gives some $t_*\in(0,1)$ such that
\begin{align*}
H(q(t_*))=\log 2.
\end{align*}
Thus the three-symbol Bernoulli shift with base distribution $q(t_*)$ and the fair two-symbol Bernoulli shift have equal entropy. By *[Ornstein Isomorphism Theorem](/theorems/6780)*, these Bernoulli shifts are measure-theoretically isomorphic even though their alphabets have different sizes; the invariant is the average information per coordinate, not the number of symbols.
[/example]
The example explains why alphabet size is not a measure-theoretic invariant. A measurable recoding may turn binary independent coordinates into ternary independent coordinates, provided the average information per time step is unchanged.
## Very Weak Bernoulli Partitions and Finitary Intuition
What extra structure lets equal entropy become a positive isomorphism theorem? Ornstein's answer is that good finite blocks from the remote past and remote future must become nearly independent in a strong matching sense, not merely in the scalar sense detected by entropy.
[definition: Hamming Distance on Words]
Let $A$ be a finite alphabet. The normalized Hamming distance is the function $d_n:A^n\times A^n\to[0,1]$ defined by
\begin{align*}
d_n(u,v)=\frac{1}{n}|\{0\le j\le n-1:u_j\ne v_j\}|
\end{align*}
for $u=(u_0,\dots,u_{n-1})$ and $v=(v_0,\dots,v_{n-1})\in A^n$.
[/definition]
Hamming distance measures whether two long names agree on most time positions. To compare distributions of names, we need a way to match random words drawn from two laws and pay the average Hamming error under the best possible matching.
[definition: Bar Distance]
Let $\operatorname{Prob}(A^n)$ be the set of probability measures on $A^n$. The bar distance is the function $\bar d_n:\operatorname{Prob}(A^n)\times\operatorname{Prob}(A^n)\to[0,1]$ defined by
\begin{align*}
\bar d_n(\lambda,\rho)=\inf_{\pi}\int_{A^n\times A^n} d_n(u,v)\,d\pi(u,v),
\end{align*}
where $\lambda,\rho\in\operatorname{Prob}(A^n)$ and the infimum is over all couplings $\pi$ of $\lambda$ and $\rho$.
[/definition]
The bar distance gives a metric for comparing finite name distributions up to a small density of coordinate errors. This motivates the very weak Bernoulli condition, which asks whether the future name distribution conditioned on a distant past is close to the unconditional future name distribution in precisely this metric.
[definition: Very Weak Bernoulli Partition]
Let $(X,\mathcal F,\mu,T)$ be an invertible measure-preserving system and let $\mathcal P$ be a finite partition. The partition $\mathcal P$ is very weak Bernoulli if for every $\varepsilon>0$ there exists $N\in\mathbb N$ such that for all $n\ge 1$, all past lengths $m\ge N$, and all atoms $C$ of the immediate past block partition $\bigvee_{j=-m}^{-1}T^{-j}\mathcal P$ outside a set of total measure at most $\varepsilon$, the conditional distribution of the future $n$-block name
\begin{align*}
(\mathcal P(x),\mathcal P(Tx),\dots,\mathcal P(T^{n-1}x))
\end{align*}
given $C$ has bar distance less than $\varepsilon$ from its unconditional distribution.
[/definition]
The definition formalises a robust separation of past and future, but the condition is only meaningful if it recognises the basic independent model it is meant to abstract. In a Bernoulli shift, distant past coordinates should not distort the distribution of a future block at all; the bar distance then measures this separation at the level of finite names rather than individual events.
The point to check is that the new condition is not too strong. If it failed already for independent coordinates, it would be the wrong abstraction for Bernoulli behaviour. The following theorem verifies that genuine Bernoulli generators satisfy very weak Bernoulli, so the later classification criterion starts from the intended model case.
[quotetheorem:6782]
[citeproof:6782]
This result is immediate for an independent process, but it marks the property that survives in more disguised Bernoulli systems. The finite-alphabet assumption matters because the bar-distance formulation compares finite words over a finite name space; countable versions require extra integrability and approximation control. The partition must also be the right observable: a coarse partition of a Bernoulli shift may be very weak Bernoulli while failing to generate the whole system, so it cannot by itself identify the original dynamics. The conclusion uses exact independence of separated coordinate blocks, not merely small correlation of a few functions.
The remaining classification gap is the converse direction needed for Bernoulli recognition. If a finite generator has very weak Bernoulli block statistics, then the observable already sees the whole system and its long blocks can be matched to independent Bernoulli names with small average error. The criterion below is the mechanism that turns this approximate block independence into an actual Bernoulli model.
[quotetheorem:6784]
This theorem is stated to explain the mechanism behind the isomorphism theorem rather than proved in full. Each hypothesis controls a specific obstruction. Without ergodicity, different invariant components can carry different statistics, so no single Bernoulli shift model is forced. Without invertibility, the two-sided past-and-future formulation of very weak Bernoulli no longer matches the system as stated, and one must pass to a natural extension or use a one-sided variant. Without a finite generating partition, the block names may either miss part of the system or require unbounded alphabet control. Without the very weak Bernoulli hypothesis, finite entropy and finite generation alone leave room for non-Bernoulli $K$-automorphisms whose block distributions have the right entropy but cannot be matched with Bernoulli names at small Hamming error density.
[example: Markov Chains that are Bernoulli]
Let the finite state space be $S$, and write cylinder atoms as
\begin{align*}
[i_0\dots i_{n-1}]=\{x\in S^{\mathbb Z}:x_0=i_0,\dots,x_{n-1}=i_{n-1}\}.
\end{align*}
For a stationary Markov chain with transition matrix $P=(P_{ij})$ and stationary distribution $\pi$, the Markov property gives
\begin{align*}
\mu([i_0\dots i_{n-1}])=\pi_{i_0}P_{i_0i_1}P_{i_1i_2}\cdots P_{i_{n-2}i_{n-1}}.
\end{align*}
The coordinate process is independent only in the special case where the next-step law does not depend on the present state. Indeed, irreducibility on the finite state space gives $\pi_i>0$ for every $i\in S$, so
\begin{align*}
\mu(x_1=j\mid x_0=i)=P_{ij}.
\end{align*}
If $x_0$ and $x_1$ were independent, then
\begin{align*}
P_{ij}=\mu(x_1=j)=\pi_j
\end{align*}
for every $i,j\in S$. Thus, unless every row of $P$ is the same distribution, the natural coordinates retain one step of dependence.
For $n\ge 2$, let $H_n$ be the entropy of the partition into length-$n$ coordinate cylinders. Using the cylinder formula above, we have
\begin{align*}
H_n=-\sum_{i_0,\dots,i_{n-1}\in S}\pi_{i_0}\prod_{r=0}^{n-2}P_{i_ri_{r+1}}\log\left(\pi_{i_0}\prod_{r=0}^{n-2}P_{i_ri_{r+1}}\right).
\end{align*}
Since $\log(ab)=\log a+\log b$ for positive $a,b$, with the convention $0\log 0=0$, this splits into the initial-state contribution and the transition contributions:
\begin{align*}
H_n=-\sum_{i_0,\dots,i_{n-1}\in S}\pi_{i_0}\prod_{r=0}^{n-2}P_{i_ri_{r+1}}\log\pi_{i_0}-\sum_{r=0}^{n-2}\sum_{i_0,\dots,i_{n-1}\in S}\pi_{i_0}\prod_{\ell=0}^{n-2}P_{i_\ell i_{\ell+1}}\log P_{i_ri_{r+1}}.
\end{align*}
For the first term, summing over $i_{n-1}$, then $i_{n-2}$, and so on gives $1$ at each step because each row of $P$ sums to $1$. Hence
\begin{align*}
-\sum_{i_0,\dots,i_{n-1}\in S}\pi_{i_0}\prod_{r=0}^{n-2}P_{i_ri_{r+1}}\log\pi_{i_0}=-\sum_{i_0\in S}\pi_{i_0}\log\pi_{i_0}.
\end{align*}
For a fixed $r$, stationarity gives
\begin{align*}
\mu(x_r=i,x_{r+1}=j)=\pi_iP_{ij}.
\end{align*}
Therefore the $r$th transition contribution is
\begin{align*}
\sum_{i_0,\dots,i_{n-1}\in S}\pi_{i_0}\prod_{\ell=0}^{n-2}P_{i_\ell i_{\ell+1}}\log P_{i_ri_{r+1}}=\sum_{i,j\in S}\pi_iP_{ij}\log P_{ij}.
\end{align*}
There are $n-1$ transition positions, so
\begin{align*}
H_n=-\sum_{i\in S}\pi_i\log\pi_i-(n-1)\sum_{i\in S}\pi_i\sum_{j\in S}P_{ij}\log P_{ij}.
\end{align*}
Dividing by $n$ gives
\begin{align*}
\frac{H_n}{n}=-\frac{1}{n}\sum_{i\in S}\pi_i\log\pi_i-\frac{n-1}{n}\sum_{i\in S}\pi_i\sum_{j\in S}P_{ij}\log P_{ij}.
\end{align*}
Since $S$ is finite, the initial entropy $\sum_i -\pi_i\log\pi_i$ is finite, and hence its coefficient $1/n$ tends to $0$. Also $(n-1)/n\to 1$, so the entropy rate is
\begin{align*}
\lim_{n\to\infty}\frac{H_n}{n}=-\sum_{i\in S}\pi_i\sum_{j\in S}P_{ij}\log P_{ij}.
\end{align*}
By *[Ornstein Very Weak Bernoulli Criterion](/theorems/6784)*, mixing finite-state Markov shifts are Bernoulli, so this Markov shift is measure-theoretically isomorphic to a Bernoulli shift with exactly this entropy.
[/example]
This example separates the appearance of dependence from the measure-theoretic classification. A Markov process remembers one step in its natural coordinates, but after a measurable recoding it may become a sequence of independent symbols.
## Factors of Bernoulli Shifts
If Bernoulli systems are the entropy building blocks, which lower-entropy systems can be obtained from them? Factor maps answer this by allowing an observer to read a coarser process from the Bernoulli shift.
[definition: Factor Map]
Let $(X,\mathcal F,\mu,T)$ and $(Y,\mathcal G,\nu,S)$ be measure-preserving systems. A factor map from $X$ to $Y$ is a measurable map $\pi:X\to Y$ such that $\pi_*\mu=\nu$ and
\begin{align*}
\pi\circ T=S\circ \pi
\end{align*}
holds modulo null sets.
[/definition]
A factor map formalises the idea of observing a system through a coarser measurable process. Since factors cannot increase entropy, the natural question is whether every ergodic process whose entropy fits below a Bernoulli source can actually be read from that source.
[quotetheorem:6785]
This is another statement-only structural theorem in the notes. Its proof belongs to Ornstein theory and is not supplied by the elementary entropy monotonicity argument; the point here is to record the exact theorem that entropy monotonicity suggests but does not prove.
This theorem marks the exact reach of Bernoulli randomness under factor maps. Its construction belongs to the same circle of ideas as the very weak Bernoulli criterion: one builds increasingly accurate names for the target system inside long independent source blocks while preserving the required statistics. The finite-entropy assumption on the Bernoulli source is needed because the theorem measures the source capacity by a finite entropy budget; infinite-entropy sources require a separate formulation, and zero-entropy Bernoulli sources cannot produce positive-entropy factors. Ergodicity of the target prevents a hidden decomposition into invariant pieces with different behaviours; without it, the correct statement must account for the ergodic decomposition rather than a single target process. The inequality $h_\nu(S)\le h_\mu(T)$ is necessary because entropy is monotone under factors: for example, a fair two-symbol Bernoulli shift of entropy $\log 2$ cannot factor onto a fair three-symbol Bernoulli shift of entropy $\log 3$, while a zero-entropy rotation can be a candidate target only for a source with entropy at least $0$ and still requires the factor construction rather than entropy alone.
[remark: Classification Picture]
For Bernoulli shifts, equal entropy gives isomorphism, and smaller entropy gives factors. For general measure-preserving systems, entropy remains an invariant and a monotone quantity under factors, but it does not classify systems by itself. The Bernoulli category is special because independence gives enough room for flexible measurable recoding.
[/remark]
The chapter therefore ends with a two-sided message. Kolmogorov-Sinai entropy supplies a universal obstruction to isomorphism and factor maps, while Ornstein theory shows that in the Bernoulli world this numerical obstruction is exact.
Bernoulli shifts provide the cleanest isomorphism theory, but many systems are better studied through symbolic recoding. The next chapter moves to Markov shifts and symbolic dynamics, where entropy can be read from finite combinatorics and where the same invariant governs shifts, factors, and coded hyperbolic maps.
# 6. Markov Shifts and Symbolic Dynamics
Symbolic dynamics gives a way to replace a complicated system by a space of infinite words together with the shift map. In Chapters 2 and 3, entropy was defined through partition entropy rates and generators; here the same quantity can be computed from finite combinatorial data. Markov shifts are the central model because their allowed transitions are encoded by a matrix, and the growth rate of admissible words is governed by Perron--Frobenius theory.
The chapter moves between three viewpoints. Topologically, a shift space is a closed shift-invariant subset of a full shift. Measure-theoretically, a Markov shift carries natural invariant measures, among which the Parry measure maximises entropy. Combinatorially, the entropy is the exponential growth rate of allowed paths in a directed graph.
## Topological Markov Chains and Transition Matrices
The first problem is to describe infinite symbolic orbits using only local transition rules. Before entropy can be measured by word growth, we need a space in which finite words are visible as local observations and the shift map turns time evolution into a single transformation. A full shift supplies this baseline object: it allows every word, so it has no memory. A Markov shift then imposes a nearest-neighbour constraint: whether a symbol may follow another is determined by a finite matrix.
[definition: Full Shift]
Let $A$ be a finite set with $|A|=k$. The two-sided full shift over $A$ is the compact metric space
\begin{align*}
A^{\mathbb Z} = \{x=(x_n)_{n\in\mathbb Z}: x_n\in A\},
\end{align*}
with the [product topology](/page/Product%20Topology), together with the shift map $\sigma:A^{\mathbb Z}\to A^{\mathbb Z}$ defined by $(\sigma x)_n=x_{n+1}$.
[/definition]
The full shift is the benchmark case: every finite block in $A^m$ occurs somewhere in the space. This makes the entropy computation depend only on counting words, so it gives the reference value against which constrained shifts are compared.
[example: Full Shift On k Symbols]
Let $A=\{1,\dots,k\}$. A word of length $m$ is a tuple $(a_0,\dots,a_{m-1})$ with each $a_i\in A$. There are $k$ choices for each coordinate, so multiplication of the $m$ choices gives
\begin{align*}
|\mathcal L_m(A^{\mathbb Z})|=\underbrace{k\cdot k\cdots k}_{m\text{ factors}}=k^m.
\end{align*}
Hence the exponential word-growth rate is
\begin{align*}
\lim_{m\to\infty}|\mathcal L_m(A^{\mathbb Z})|^{1/m}=\lim_{m\to\infty}(k^m)^{1/m}=k.
\end{align*}
Equivalently,
\begin{align*}
\lim_{m\to\infty}\frac{1}{m}\log|\mathcal L_m(A^{\mathbb Z})|=\lim_{m\to\infty}\frac{1}{m}\log(k^m)=\log k.
\end{align*}
Thus the full shift has topological entropy $\log k$. For the uniform Bernoulli measure, each symbol has probability $1/k$, so the one-symbol entropy is
\begin{align*}
-\sum_{i=1}^k \frac{1}{k}\log\frac{1}{k}=-k\cdot\frac{1}{k}\log\frac{1}{k}=\log k.
\end{align*}
The measure entropy of the uniform Bernoulli measure therefore matches the topological entropy of the full shift.
[/example]
The full shift example shows that word growth is the right combinatorial object, but it does not yet encode any geometry of allowed transitions. To model systems with forbidden transitions, we need a finite object that records which symbols may follow which other symbols; a zero-one transition matrix does exactly this.
[definition: Topological Markov Chain]
Let $A=\{1,\dots,k\}$ and let $M=(M_{ij})_{1\le i,j\le k}$ be a $k\times k$ matrix with entries in $\{0,1\}$. The two-sided topological Markov chain associated to $M$ is
\begin{align*}
\Sigma_M = \{x\in A^{\mathbb Z}: M_{x_n x_{n+1}}=1 \text{ for all } n\in\mathbb Z\},
\end{align*}
with the shift map $\sigma:\Sigma_M\to\Sigma_M$.
[/definition]
The same construction also has a one-sided version indexed by $\mathbb N$. In entropy computations the two-sided and one-sided versions have the same finite blocks, so their topological entropies agree. The next problem is to name the finite blocks that really occur, because entropy will be computed from their exponential growth rather than from the ambient full shift.
[definition: Admissible Word]
Let $M$ be a zero-one transition matrix. A word $w=(w_0,\dots,w_{m-1})\in A^m$ is $M$-admissible if
\begin{align*}
M_{w_i w_{i+1}}=1 \quad \text{for } 0\le i\le m-2.
\end{align*}
The set of admissible words of length $m$ is denoted $\mathcal L_m(\Sigma_M)$.
[/definition]
Admissible words turn the transition matrix into finite directed paths, so they are the objects whose number will later be counted. Counting words alone does not yet specify which subsets of the shift space are being observed, nor which finite partitions generate the measurable dynamics. To connect that combinatorics back to the shift space itself, we need the sets of sequences in which a prescribed finite word appears at a prescribed location. These sets are the local observations of the symbolic orbit, and they provide both the topology and the finite partitions used in entropy computations.
[definition: Cylinder Set]
Let $X\subset A^{\mathbb Z}$ be a subshift, let $w=(w_0,\dots,w_{m-1})$ be a word over $A$, and let $n\in\mathbb Z$. The cylinder set determined by $w$ at position $n$ is
\begin{align*}
[w]_n=\{x\in X:x_n=w_0,\dots,x_{n+m-1}=w_{m-1}\}.
\end{align*}
[/definition]
Cylinder sets form a basis for the product topology on a subshift. For Markov shifts, the nonempty cylinders are exactly those determined by admissible words, and matrix powers count such paths, which is why spectral theory enters the entropy calculation.
[example: Golden Mean Shift]
Let $A=\{0,1\}$ and forbid the block $11$. With the symbol order $(0,1)$, the allowed transitions are $0\to 0$, $0\to 1$, and $1\to 0$, while $1\to 1$ is forbidden; equivalently, $M_{00}=M_{01}=M_{10}=1$ and $M_{11}=0$. Thus an admissible word is exactly a binary word with no two consecutive symbols equal to $1$.
Let $a_m=|\mathcal L_m(\Sigma_M)|$. Split admissible words of length $m$ according to their last symbol. If the last symbol is $0$, the first $m-1$ symbols may be any admissible word of length $m-1$, giving $a_{m-1}$ possibilities. If the last symbol is $1$, then for $m\ge 2$ the previous symbol must be $0$, so the word is obtained by taking an admissible word of length $m-2$ and appending $01$, giving $a_{m-2}$ possibilities. These two cases are disjoint and exhaustive, hence
\begin{align*}
a_m=a_{m-1}+a_{m-2}\qquad (m\ge 2).
\end{align*}
The initial values are
\begin{align*}
a_0=1,\qquad a_1=2.
\end{align*}
Therefore $a_m=F_{m+2}$, where $F_0=0$, $F_1=1$, and $F_{n+1}=F_n+F_{n-1}$.
For the recurrence $a_m=a_{m-1}+a_{m-2}$, a trial solution $a_m=t^m$ with $t\ne 0$ gives
\begin{align*}
t^m=t^{m-1}+t^{m-2}.
\end{align*}
Dividing by $t^{m-2}$ gives
\begin{align*}
t^2=t+1.
\end{align*}
Thus
\begin{align*}
t^2-t-1=0.
\end{align*}
The [quadratic formula](/theorems/1301) gives the two roots
\begin{align*}
t=\frac{1+\sqrt5}{2}\quad\text{and}\quad t=\frac{1-\sqrt5}{2}.
\end{align*}
The positive root is $\varphi=(1+\sqrt5)/2$, and the other root has absolute value smaller than $\varphi$. Hence the Fibonacci recurrence grows exponentially at rate $\varphi$, so
\begin{align*}
\lim_{m\to\infty}a_m^{1/m}=\varphi.
\end{align*}
Thus forbidding only the block $11$ lowers the word-growth rate from $2$ for the full binary shift to the golden ratio $\varphi$.
[/example]
The golden mean shift already shows that entropy depends on the long-term path structure of the transition graph, not just on the number of symbols. If the graph has several components, different parts may have different growth rates; to state the spectral theorem that separates the essential case from the reducible case, we need the communication conditions below.
[definition: Irreducible and Aperiodic Matrix]
Let $M$ be a non-negative square matrix. The matrix $M$ is irreducible if for every pair of indices $i,j$ there exists $n\ge 1$ such that $(M^n)_{ij}>0$. It is aperiodic if it is irreducible and for each index $i$ the set $\{n\ge 1:(M^n)_{ii}>0\}$ has greatest common divisor $1$.
[/definition]
Irreducibility says that every state communicates with every other state after some positive time, while aperiodicity rules out a cyclic decomposition of returns. The theorem needed next is the spectral statement that turns these graph-theoretic hypotheses into precise asymptotics for the powers $M^n$, which are the path-counting matrices.
[quotetheorem:6787]
[citeproof:6787]
Perron--Frobenius converts long path counts into powers of a single eigenvalue, but its hypotheses are doing real work. For $M=\operatorname{diag}(2,3)$, the system splits into two non-communicating components and there is no single positive eigenvector seeing both components; the spectral radius only records the faster component. For the two-state matrix with allowed transitions $1\to 2$ and $2\to 1$ and no self-loops, the matrix is irreducible but periodic, and the powers $M^n$ oscillate rather than converging after the normalisation by $\lambda^n$. Thus irreducibility gives one communicating symbolic system, while aperiodicity is the extra condition needed for convergence and mixing-type conclusions, not for every entropy calculation.
The next question is whether this spectral growth is exactly the topological entropy of the shift, rather than only an estimate for a convenient counting problem. The answer is affirmative because cylinder covers and admissible words measure the same exponential complexity. The limitation is that the formula computes only the exponential growth rate of admissible words; it does not by itself describe mixing, uniqueness of equilibrium states for other potentials, or the distribution of words inside reducible components.
[quotetheorem:6789]
[citeproof:6789]
This formula reduces a dynamical invariant to linear algebra, but the reducible case should be read with care. For a block diagonal transition matrix, the corresponding shift is a disjoint union of component shifts, and the topological entropy is the maximum of the component entropies rather than an average over components. A block upper triangular matrix may allow paths to move from one component into another, but since a finite path can pass through only finitely many components, these transitions contribute at most polynomial factors to word counts and do not change the exponential rate. The formula therefore gives the leading growth exponent, not a classification of components, transitivity, or the number of maximal-entropy pieces.
To practise the computation, it is useful to take a transition matrix that is not the golden mean matrix and read the entropy directly from its characteristic polynomial.
[example: Entropy From An Adjacency Matrix]
Consider the transition matrix whose nonzero entries are $M_{11}=M_{12}=M_{21}=M_{23}=M_{32}=M_{33}=1$. With the convention $\det(\lambda I-M)$, the matrix $\lambda I-M$ has first row $(\lambda-1,-1,0)$, second row $(-1,\lambda,-1)$, and third row $(0,-1,\lambda-1)$.
Using the $3\times 3$ determinant formula with $a=\lambda-1$, $b=-1$, $c=0$, $d=-1$, $e=\lambda$, $f=-1$, $g=0$, $h=-1$, and $i=\lambda-1$, we get
\begin{align*}
\det(\lambda I-M)=(\lambda-1)\bigl(\lambda(\lambda-1)-(-1)(-1)\bigr)-(-1)\bigl((-1)(\lambda-1)-(-1)\cdot 0\bigr)+0\cdot\bigl((-1)(-1)-\lambda\cdot 0\bigr).
\end{align*}
The three terms simplify separately as
\begin{align*}
(\lambda-1)\bigl(\lambda(\lambda-1)-(-1)(-1)\bigr)=(\lambda-1)(\lambda^2-\lambda-1).
\end{align*}
\begin{align*}
-(-1)\bigl((-1)(\lambda-1)-(-1)\cdot 0\bigr)=-(\lambda-1).
\end{align*}
\begin{align*}
0\cdot\bigl((-1)(-1)-\lambda\cdot 0\bigr)=0.
\end{align*}
Therefore
\begin{align*}
\det(\lambda I-M)=(\lambda-1)(\lambda^2-\lambda-1)-(\lambda-1).
\end{align*}
Factoring out $\lambda-1$ gives
\begin{align*}
\det(\lambda I-M)=(\lambda-1)(\lambda^2-\lambda-2).
\end{align*}
Since $\lambda^2-\lambda-2=(\lambda-2)(\lambda+1)$, this is
\begin{align*}
\det(\lambda I-M)=(\lambda-1)(\lambda-2)(\lambda+1).
\end{align*}
Thus the eigenvalues are $1$, $2$, and $-1$. Their absolute values are $1$, $2$, and $1$, so the spectral radius is
\begin{align*}
\rho(M)=2.
\end{align*}
By *Entropy Of A Topological Markov Chain*, the associated Markov shift has
\begin{align*}
h_{\mathrm{top}}(\sigma|_{\Sigma_M})=\log\rho(M)=\log 2.
\end{align*}
Thus this matrix presents a shift whose open-cover entropy and admissible-word growth entropy give the same value, namely $\log 2$.
[/example]
## Parry Measure and Maximal Entropy Measures
Having computed topological entropy, the next question is whether there is an invariant probability measure whose Kolmogorov--Sinai entropy attains this value. For full shifts the answer is the uniform Bernoulli measure. For irreducible Markov shifts the analogous measure is built from Perron--Frobenius eigenvectors.
[definition: Markov Measure On A Shift]
Let $P=(P_{ij})$ be a stochastic matrix on $A=\{1,\dots,k\}$ and let $\pi$ be a stationary probability vector, so $\pi P=\pi$. The associated stationary Markov measure is the Borel probability measure
\begin{align*}
\mu_{\pi,P}:\mathcal B(A^{\mathbb Z})\to [0,1]
\end{align*}
determined on cylinder sets by
\begin{align*}
\mu_{\pi,P}([a_0\dots a_{m-1}]_0)=\pi_{a_0}P_{a_0a_1}\cdots P_{a_{m-2}a_{m-1}}.
\end{align*}
[/definition]
When $P_{ij}=0$ whenever $M_{ij}=0$, the measure is supported on $\Sigma_M$. Its entropy is the average uncertainty of the next symbol given the present state.
[quotetheorem:6791]
[citeproof:6791]
The Markov entropy formula depends on two hypotheses that cannot simply be dropped. Stationarity is what makes the entropy rate independent of absolute time: for instance, on $A=\{0,1\}$ take the transition matrix with $P_{00}=P_{01}=P_{10}=P_{11}=1/2$,
\begin{align*}
P_{ij}=1/2 \quad \text{for all } i,j\in\{0,1\},
\end{align*}
but start at time $0$ from $x_0=0$ with probability $1$. The one-step conditional entropy at time $0$ is $\log 2$, while the marginal law at time $0$ is not the stationary vector $(1/2,1/2)$, so the stationary weighted formula is not describing that initial-time process as a shift-invariant measure. The Markov property is also essential. For a concrete invariant non-Markov example, take i.i.d. fair bits $(Y_k)_{k\in\mathbb Z}$, form doubled blocks $\dots,Y_{-1},Y_{-1},Y_0,Y_0,Y_1,Y_1,\dots$, and then choose the origin uniformly from the two possible phases. The resulting binary process $(X_n)_{n\in\mathbb Z}$ is stationary and has entropy rate $(1/2)\log 2$, because one fresh fair bit is introduced every two symbols. However, given $X_0$, the next symbol agrees with probability $3/4$ and disagrees with probability $1/4$, so
\begin{align*}
H(X_1\mid X_0)=-\frac34\log\frac34-\frac14\log\frac14>(1/2)\log 2.
\end{align*}
This example shows why an arbitrary invariant process must be analysed through conditional entropies against longer pasts, not by assuming a one-step Markov transition formula from the displayed theorem. Thus the theorem evaluates stationary one-step Markov measures, not arbitrary invariant measures on the shift.
The formula tells us how to evaluate a proposed stationary Markov measure, but it does not say which transition probabilities should maximise entropy under the allowed edges. The definition needed next solves this optimisation problem at the level of transitions: Perron--Frobenius weights each allowed edge according to the future growth available after taking it.
[definition: Parry Transition Matrix]
Let $M$ be an irreducible zero-one $k\times k$ matrix with Perron eigenvalue $\lambda$ and positive right eigenvector $r\in\mathbb R^k_+$. The Parry transition matrix is the $k\times k$ stochastic matrix $P=(P_{ij})$ with entries
\begin{align*}
P_{ij}=\frac{M_{ij}r_j}{\lambda r_i}.
\end{align*}
[/definition]
The rows of $P$ sum to $1$ because $Mr=\lambda r$, so $P$ is a Markov transition kernel on the finite state space $A$: from state $i$, the next state is chosen among the allowed successors $j$ with probabilities $P_{ij}$. A transition matrix alone is not yet a two-sided invariant measure; for that we also need a stationary distribution, and the left Perron eigenvector supplies the missing weights.
[definition: Parry Measure]
Let $M$ be irreducible, and choose positive Perron eigenvectors $r,l\in\mathbb R^k_+$ satisfying $Mr=\lambda r$, $l^\top M=\lambda l^\top$, and $l^\top r=1$. The Parry measure on $\Sigma_M$ is the Borel probability measure $\mu_P:\mathcal B(\Sigma_M)\to[0,1]$ given by the stationary Markov measure with transition matrix
\begin{align*}
P_{ij}=\frac{M_{ij}r_j}{\lambda r_i}
\end{align*}
and stationary vector $\pi_i=l_i r_i$.
[/definition]
This measure gives an especially simple formula for cylinders. If $w=(w_0,\dots,w_{m-1})$ is admissible, then most intermediate eigenvector factors cancel, so endpoint effects are separated from the main exponential factor $\lambda^{-(m-1)}$.
[example: Cylinder Weights For The Golden Mean Shift]
For the golden mean transition matrix, with coordinates ordered as $(0,1)$, the nonzero entries are $M_{00}=M_{01}=M_{10}=1$ and $M_{11}=0$. The characteristic polynomial is
\begin{align*}
\det(\lambda I-M)=(\lambda-1)\lambda-(-1)(-1)=\lambda^2-\lambda-1.
\end{align*}
Thus the positive eigenvalue is
\begin{align*}
\varphi=\frac{1+\sqrt5}{2},
\end{align*}
and the equation $\varphi^2-\varphi-1=0$ gives
\begin{align*}
\varphi^2=\varphi+1.
\end{align*}
Take $r=(\varphi,1)$. Its two coordinates under multiplication by $M$ are
\begin{align*}
(Mr)_0=M_{00}r_0+M_{01}r_1=1\cdot\varphi+1\cdot 1=\varphi+1=\varphi^2,
\end{align*}
and
\begin{align*}
(Mr)_1=M_{10}r_0+M_{11}r_1=1\cdot\varphi+0\cdot 1=\varphi.
\end{align*}
On the other hand,
\begin{align*}
(\varphi r)_0=\varphi\cdot\varphi=\varphi^2,
\end{align*}
and
\begin{align*}
(\varphi r)_1=\varphi\cdot 1=\varphi.
\end{align*}
Hence $Mr=\varphi r$, so $r$ is a positive right Perron eigenvector.
By the definition of the Parry transition matrix,
\begin{align*}
P_{ij}=\frac{M_{ij}r_j}{\varphi r_i}.
\end{align*}
Therefore
\begin{align*}
P_{00}=\frac{M_{00}r_0}{\varphi r_0}=\frac{1\cdot\varphi}{\varphi\cdot\varphi}=\frac{1}{\varphi}.
\end{align*}
Similarly,
\begin{align*}
P_{01}=\frac{M_{01}r_1}{\varphi r_0}=\frac{1\cdot 1}{\varphi\cdot\varphi}=\frac{1}{\varphi^2}.
\end{align*}
For transitions out of state $1$,
\begin{align*}
P_{10}=\frac{M_{10}r_0}{\varphi r_1}=\frac{1\cdot\varphi}{\varphi\cdot 1}=1,
\end{align*}
and
\begin{align*}
P_{11}=\frac{M_{11}r_1}{\varphi r_1}=\frac{0\cdot 1}{\varphi\cdot 1}=0.
\end{align*}
The first row sums to $1$ because
\begin{align*}
P_{00}+P_{01}=\frac{1}{\varphi}+\frac{1}{\varphi^2}=\frac{\varphi+1}{\varphi^2}=\frac{\varphi^2}{\varphi^2}=1,
\end{align*}
and the second row sums to $1$ because
\begin{align*}
P_{10}+P_{11}=1+0=1.
\end{align*}
Thus after seeing a $1$, the next symbol must be $0$. After seeing a $0$, the allowed successors $0$ and $1$ have probabilities $1/\varphi$ and $1/\varphi^2$, respectively, reflecting the Perron weights of the future states.
[/example]
The golden mean calculation illustrates the principle behind the general construction: transitions leading to states with more future continuations receive more mass. The key theorem is that this balancing makes the measure entropy equal to the full topological entropy, so the Parry measure is the symbolic model of a maximal entropy measure.
[quotetheorem:6793]
[citeproof:6793]
The theorem separates uniqueness of the maximal entropy measure from stronger mixing conclusions. Irreducibility is enough for uniqueness because all states communicate and the Perron eigenvectors prescribe compatible cylinder weights throughout the whole graph. Aperiodicity is needed for the shift itself to be mixing: in the two-state system with allowed transitions $1\to 2$ and $2\to 1$ and no self-loops, the system alternates between the two symbols and cannot mix at odd and even times, although its Parry measure is still the unique measure of maximal entropy. If irreducibility is dropped, uniqueness may fail; for example, a block diagonal matrix with two irreducible components having the same spectral radius gives two disjoint maximal-entropy components and hence distinct maximal entropy measures.
[remark: Periodic Irreducible Shifts]
If $M$ is irreducible but periodic, the Parry measure is still the unique measure of maximal entropy. What fails is mixing for $\sigma$ itself: the space decomposes into cyclic classes, and a suitable power of $\sigma$ is mixing on each class when the corresponding component is aperiodic.
[/remark]
## Entropy of Subshifts of Finite Type and Sofic Shifts
The matrix shifts above are the basic symbolic systems with finite memory. The next problem is to understand which shift spaces can be described by finitely many forbidden words, and how entropy behaves after finite-state factors.
[definition: Subshift]
Let $A$ be a finite alphabet, and let $\sigma:A^{\mathbb Z}\to A^{\mathbb Z}$ be the shift map $(\sigma x)_n=x_{n+1}$. A subshift is a closed subset $X\subset A^{\mathbb Z}$ such that $\sigma(X)=X$.
[/definition]
Closedness means that membership is determined by finite forbidden patterns, while shift-invariance means that forbidden patterns do not depend on absolute position. The next definition is needed to isolate the shifts whose constraints can be given by a finite list, since those are exactly the systems that can be recoded into matrix form.
[definition: Subshift Of Finite Type]
A subshift $X\subset A^{\mathbb Z}$ is a subshift of finite type if there exists a finite set $\mathcal F$ of finite words over $A$ such that
\begin{align*}
X=\{x\in A^{\mathbb Z}: \text{no translate of a word in } \mathcal F \text{ occurs in } x\}.
\end{align*}
[/definition]
Every one-step Markov shift is a subshift of finite type, with forbidden words of length $2$. The converse is less immediate because a finite type constraint may look several symbols into the past; the next theorem explains why this is only a matter of recoding.
[quotetheorem:6795]
[citeproof:6795]
Higher block presentation justifies computing entropy for every subshift of finite type through an adjacency matrix, after a possible recoding. The finite type hypothesis is essential: the construction uses a uniform bound $N$ on how far the forbidden constraints look, so length-$N$ blocks contain all the memory needed to continue the sequence. An arbitrary subshift may have constraints of unbounded length, and no finite block alphabet can record enough past information to make the system one-step Markov. This is why systems such as the even shift require a different finite-state description rather than an ordinary finite forbidden list.
Some natural symbolic systems still need unbounded memory in the visible symbols, so finite forbidden lists do not describe them efficiently. The obstruction is not that the system has infinite combinatorial complexity, but that the relevant memory may be hidden in an auxiliary state. Sofic shifts solve this by allowing a finite-state Markov system upstairs and then reading a labelled output downstairs, exactly as finite automata recognise languages whose visible words need internal memory.
[definition: Sofic Shift]
A subshift $Y\subset B^{\mathbb Z}$ is sofic if there exist a subshift of finite type $X\subset A^{\mathbb Z}$ and a continuous shift-commuting surjection $\pi:X\to Y$.
[/definition]
A sofic shift can also be described by a finite labelled directed graph: paths in the graph produce symbol sequences by reading edge labels. This allows finite automata to encode constraints that need unbounded memory in the observed symbols.
[example: Even Shift]
The even shift consists of binary sequences in which the number of symbols equal to $0$ between consecutive symbols equal to $1$ is even.
It is not a subshift of finite type. Suppose, for contradiction, that membership were determined by forbidding finitely many words, and let $N$ be at least the length of every forbidden word. Choose an odd integer $L>N$. The periodic sequence
\begin{align*}
x=\cdots 1\,0^L\,1\,0^L\,1\cdots
\end{align*}
is not in the even shift, because each gap between consecutive symbols equal to $1$ contains the odd number $L$ of zeros. However, every block of $x$ of length at most $N$ contains at most one symbol equal to $1$, since two consecutive symbols equal to $1$ in $x$ are separated by $L>N$ zeros. Such a block is therefore either all zeros, or has the form
\begin{align*}
0^a1\,0^b
\end{align*}
with $a+b+1\le N$. The same block occurs in the periodic sequence
\begin{align*}
y=\cdots 1\,0^{L+1}\,1\,0^{L+1}\,1\cdots,
\end{align*}
because $L+1$ is even and larger than $N$. Thus every word of length at most $N$ seen in the forbidden sequence $x$ is also seen in the valid even-shift sequence $y$, so no finite forbidden list of maximum length $N$ can separate them.
The shift is nevertheless sofic. Use two states, $E$ and $O$, recording whether the current run of zeros since the last symbol equal to $1$ has even or odd length. Put labelled edges
\begin{align*}
E\xrightarrow{1}E,\qquad E\xrightarrow{0}O,\qquad O\xrightarrow{0}E.
\end{align*}
Starting in state $E$, reading a $0$ changes the parity state, while reading a $1$ is allowed only from $E$ and resets the parity to $E$. Hence a path can emit a $1$ exactly when the number of preceding zeros since the previous symbol equal to $1$ is even. Conversely, every binary sequence with even zero-runs between consecutive symbols equal to $1$ determines a path by following the parity of the current zero-run. Therefore the even shift is the labelled image of a finite directed graph, so it is sofic.
[/example]
The even shift shows why labelled graph presentations are more flexible than forbidden lists of bounded length. The next definition is needed because labelled graphs can overcount words: different paths may have the same label sequence, and entropy comparisons require a presentation where this ambiguity is controlled.
[definition: Right Resolving Presentation]
Let $G=(V,E)$ be a finite directed graph, let $B$ be a finite alphabet, and let $\ell:E\to B$ be a label map. The labelled graph presentation determined by $(G,\ell)$ is right resolving if, for every vertex $v\in V$, no two distinct outgoing edges from $v$ have the same label.
[/definition]
In a right-resolving presentation, a starting vertex and an emitted word determine at most one path. This is the property needed to compare word growth in the sofic shift with path growth in the underlying finite graph.
[quotetheorem:6798]
[citeproof:6798]
The theorem turns the finite automaton for a sofic shift into a practical entropy computation, provided the presentation does not hide exponential overcounting or irrelevant components. Right-resolving matters because, from a fixed starting vertex, a label word determines at most one path; without this condition, many different paths can carry the same label word, so path growth in the graph can be larger than word growth in the shift. Essentiality removes vertices and components that appear in the graph but never contribute to bi-infinite sequences in $Y$; such components can change $\rho(M)$ without changing the presented shift. The theorem therefore computes entropy from a suitable presentation, but it does not classify all presentations of the same sofic shift or say that every labelled graph with the same labels has the same spectral radius.
For the even shift, the same matrix as the golden mean shift appears, but it is now the adjacency matrix of a labelled presentation rather than a forbidden-transition matrix on visible symbols.
[example: Entropy Of The Even Shift]
Use states $E$ and $O$ for even and odd parity of the current run of zeros. The labelled edges are
\begin{align*}
E\xrightarrow{1}E,\qquad E\xrightarrow{0}O,\qquad O\xrightarrow{0}E.
\end{align*}
With the state order $(E,O)$, there is one edge from $E$ to $E$, one edge from $E$ to $O$, one edge from $O$ to $E$, and no edge from $O$ to $O$. Hence the underlying adjacency matrix is
\begin{align*}
M=\begin{pmatrix}1&1\cr 1&0\end{pmatrix}.
\end{align*}
We compute the spectral radius of $M$. First,
\begin{align*}
\lambda I=\begin{pmatrix}\lambda&0\cr 0&\lambda\end{pmatrix}.
\end{align*}
Therefore
\begin{align*}
\lambda I-M=\begin{pmatrix}\lambda&0\cr 0&\lambda\end{pmatrix}-\begin{pmatrix}1&1\cr 1&0\end{pmatrix}=\begin{pmatrix}\lambda-1&-1\cr -1&\lambda\end{pmatrix}.
\end{align*}
For a $2\times 2$ matrix, the determinant is the product of the diagonal entries minus the product of the off-diagonal entries, so
\begin{align*}
\det(\lambda I-M)=(\lambda-1)\lambda-(-1)(-1).
\end{align*}
Expanding the two products gives
\begin{align*}
(\lambda-1)\lambda=\lambda^2-\lambda.
\end{align*}
Also,
\begin{align*}
(-1)(-1)=1.
\end{align*}
Hence
\begin{align*}
\det(\lambda I-M)=\lambda^2-\lambda-1.
\end{align*}
The eigenvalues therefore solve
\begin{align*}
\lambda^2-\lambda-1=0.
\end{align*}
By the quadratic formula,
\begin{align*}
\lambda=\frac{1+\sqrt5}{2}\quad\text{or}\quad \lambda=\frac{1-\sqrt5}{2}.
\end{align*}
The first eigenvalue is $\varphi=(1+\sqrt5)/2$. The absolute value of the second eigenvalue is
\begin{align*}
\left|\frac{1-\sqrt5}{2}\right|=\frac{\sqrt5-1}{2}.
\end{align*}
Since
\begin{align*}
\frac{\sqrt5-1}{2}<\frac{\sqrt5+1}{2}=\varphi,
\end{align*}
the spectral radius is
\begin{align*}
\rho(M)=\varphi.
\end{align*}
By *Entropy Of A Sofic Shift*, the even shift has
\begin{align*}
h_{\mathrm{top}}(\sigma)=\log\rho(M)=\log\varphi.
\end{align*}
Thus the same spectral number as in the golden mean shift appears, but here it comes from a finite labelled presentation of a sofic shift rather than from forbidden transitions on the visible symbols themselves.
[/example]
## The Variational Principle In The Symbolic Model
The final question in the chapter is how the measure-theoretic and topological notions of entropy meet. For compact systems the variational principle is a general theorem; for shifts of finite type it can be proved by finite-block counting and Perron--Frobenius estimates. In this setting $\mathcal M_T(X)$ denotes the set of $T$-invariant Borel probability measures on $X$.
[quotetheorem:6800]
[citeproof:6800]
This symbolic proof contains the main pattern of the general variational principle. Topological entropy bounds how many orbit names are available, while measure entropy measures how many orbit names are typical for a given invariant measure. In shifts of finite type, compactness of the alphabet and the finite-memory condition are what make the upper bound so direct: each length-$m$ cylinder partition is finite, and its number of atoms is controlled by a single finite adjacency matrix. For countable-state Markov shifts or non-compact symbolic spaces, this finite partition argument can fail without extra tightness or recurrence hypotheses; entropy may escape to infinity or to states not seen by a fixed finite subgraph. In shifts of finite type, the two sides meet because the Perron eigenvectors produce a measure distributing mass across cylinders at exactly the exponential rate allowed by the transition graph.
The irreducibility hypothesis is the condition that makes the stated attainment result a single-component statement. If $M$ is reducible, the same upper bound still holds for every invariant measure, but the supremum is attained on any irreducible component whose spectral radius equals the maximum component spectral radius. Thus a reducible shift may have several measures of maximal entropy: for instance, a block diagonal matrix with two full two-shifts as diagonal blocks has topological entropy $\log 2$ and has at least one Bernoulli maximal measure on each component. This is the limitation of the irreducible model theorem: it proves the variational principle and identifies the Parry measure in the transitive case, while reducible shifts require decomposing the graph into maximal entropy components.
The next layer of the subject keeps the same symbolic framework but replaces the zero potential by a general potential on a shift space. Then maximal entropy measures become equilibrium states, the Perron eigenvalue is replaced by pressure, and the Perron--Frobenius construction becomes the transfer-operator method used in thermodynamic formalism. These symbolic models also connect back to smooth dynamics: Markov partitions code hyperbolic maps by shifts of finite type, and the symbolic entropy computed here becomes the entropy of the original system after passing through the coding map.
Symbolic dynamics gives a combinatorial language for orbit complexity, but entropy also has a topological form that ignores any chosen measure. The next chapter develops topological entropy as the continuous analogue of the measure-theoretic theory, counting distinguishable orbit segments in compact spaces.
# 7. Topological Entropy
Topological entropy is the topological counterpart of Kolmogorov-Sinai entropy developed in the preceding measure-theoretic part of the course: it measures how many distinguishable orbit segments a continuous map can produce without choosing an invariant measure. The prerequisites are compact topological spaces, continuous maps, finite open covers, and the basic language of metric spaces; symbolic dynamics enters later as the main class of computable examples. The guiding idea is that a dynamical system has positive topological entropy when the number of orbit patterns visible at time scale $n$ grows exponentially in $n$. This chapter develops three equivalent languages for that growth: open covers, separated sets, and spanning sets. It then explains how entropy behaves under conjugacy and how symbolic codings turn entropy into a word-counting problem.
## Measuring Orbit Complexity Without a Measure
How can we count the complexity of a continuous dynamical system when no probability measure has been chosen? The measure-theoretic chapters counted names of atoms of iterated partitions with weights attached by a measure. In the topological setting the analogous object is an open cover: instead of asking which atom contains a point, we ask which member of a finite cover contains the point at each time.
Let $X$ be a compact [topological space](/page/Topological%20Space) and let $T:X \to X$ be continuous. Compactness makes finite subcovers available and keeps all the quantities below finite. To combine observations made at different times, we first need the operation that refines two covers at once.
[definition: Join of Open Covers]
The join operation sends a pair $(\mathcal U,\mathcal V)$ of open covers of $X$ to the open cover
\begin{align*}
\mathcal U \vee \mathcal V := \{U \cap V : U \in \mathcal U,\ V \in \mathcal V\}.
\end{align*}
[/definition]
The join records simultaneous information from two observations: a point is placed in both a member of $\mathcal U$ and a member of $\mathcal V$. For dynamics, the next problem is to record not two static observations, but the observations made at times $0,1,\dots,n-1$ along the same orbit.
[definition: Iterated Dynamical Cover]
Let $\mathcal U$ be a finite open cover of $X$. For $n \ge 1$, define
\begin{align*}
\mathcal U_0^{n-1} := \mathcal U \vee T^{-1}\mathcal U \vee \cdots \vee T^{-(n-1)}\mathcal U.
\end{align*}
[/definition]
An element of $\mathcal U_0^{n-1}$ consists of points whose first $n$ iterates can be placed into prescribed members of $\mathcal U$. The next counting problem is that this iterated cover may contain redundant sets, so we count the smallest number of its members needed to cover the whole space.
[definition: Covering Number of an Open Cover]
Let $X$ be a [compact space](/page/Compact%20Space). The covering-number functional is the map
\begin{align*}
N:\{\text{open covers of }X\}\to \mathbb N
\end{align*}
defined as follows: for an open cover $\mathcal U$ of $X$, $N(\mathcal U)$ is the least cardinality of a finite subcover of $\mathcal U$.
[/definition]
The number $N(\mathcal U_0^{n-1})$ measures the number of orbit names of length $n$ needed at the resolution of $\mathcal U$. Raw covering numbers depend on the length of observation and are not directly comparable between different $n$. To turn these counts into an invariant for a fixed observational scale, we must separate transient finite-time effects from persistent exponential growth. This requires a rate attached to the cover itself.
[definition: Topological Entropy of an Open Cover]
Let $T:X \to X$ be continuous on a compact space and let $\mathcal U$ be a finite open cover of $X$. The topological entropy of $T$ relative to $\mathcal U$ is
\begin{align*}
h_{\mathrm{top}}(T,\mathcal U) := \lim_{n \to \infty} \frac{1}{n}\log N(\mathcal U_0^{n-1}).
\end{align*}
[/definition]
The logarithm turns multiplication of possibilities into addition, matching the subadditive structure of joins. A single cover sees the system at one finite resolution, so the next step is to allow every finite open cover and take the largest possible rate.
[definition: Topological Entropy via Open Covers]
Let $T:X \to X$ be continuous on a compact space. The topological entropy of $T$ is
\begin{align*}
h_{\mathrm{top}}(T) := \sup_{\mathcal U} h_{\mathrm{top}}(T,\mathcal U),
\end{align*}
where the supremum is over all finite open covers of $X$.
[/definition]
This definition makes sense on compact spaces without a metric. There is still a well-definedness issue inside the relative entropy $h_{\mathrm{top}}(T,\mathcal U)$: the formula uses a limit in $n$, and a priori the logarithmic covering counts could oscillate. Before the definition can be used, one must rule out the possibility that different observation lengths give incompatible asymptotic rates. The obstruction is controlled by the subadditive structure of iterated-cover joins.
[quotetheorem:6801]
[citeproof:6801]
The theorem is the topological analogue of the subadditivity argument used for entropy rates of partitions. Compactness is not a cosmetic assumption here: without it an open cover need not have a finite subcover, so $N(\mathcal U_0^{n-1})$ may be infinite before any limiting question can be asked. The theorem also does not say that the raw sequence $N(\mathcal U_0^{n-1})$ has regular growth; it says only that the logarithmic growth rate stabilises after dividing by $n$. This is why entropy is a rate rather than a raw count, and it prepares the passage to metric definitions where the same exponential rate will be recovered by packing and covering orbit segments.
[example: Irrational Rotation Has Zero Cover Growth]
Let $T:S^1\to S^1$ be $T(x)=x+\alpha \pmod 1$ with $\alpha\notin\mathbb Q$. We first compute the growth for a finite cover $\mathcal V=\{V_1,\dots,V_m\}$ by open arcs. Let $B$ be the finite set of all endpoints of the arcs in $\mathcal V$, so $|B|\le 2m$. For $0\le k\le n-1$, the cover $T^{-k}\mathcal V$ has endpoint set $T^{-k}B$, because $x\in T^{-k}V_i$ exactly when $T^k x\in V_i$. Hence every possible change in membership for the joined cover
\begin{align*}
\mathcal V_0^{n-1}
=
\mathcal V\vee T^{-1}\mathcal V\vee\cdots\vee T^{-(n-1)}\mathcal V
\end{align*}
can occur only at a point of
\begin{align*}
B_n
=
\bigcup_{k=0}^{n-1}T^{-k}B.
\end{align*}
Thus
\begin{align*}
|B_n|
\le
\sum_{k=0}^{n-1}|T^{-k}B|
=
\sum_{k=0}^{n-1}|B|
=
n|B|
\le
2mn.
\end{align*}
The points of $B_n$ cut the circle into at most $|B_n|$ complementary open arcs, and on each such arc the choice of which members of $T^{-k}\mathcal V$ contain the point is constant for every $0\le k\le n-1$. Therefore each complementary arc is contained in some member of $\mathcal V_0^{n-1}$. Adding, if necessary, one joined-cover set containing each boundary point gives
\begin{align*}
N(\mathcal V_0^{n-1})
\le
2|B_n|
\le
4mn.
\end{align*}
Consequently
\begin{align*}
0
\le
\frac{1}{n}\log N(\mathcal V_0^{n-1})
\le
\frac{1}{n}\log(4mn)
=
\frac{\log(4m)+\log n}{n},
\end{align*}
and the right-hand side tends to $0$ as $n\to\infty$. Hence $h_{\mathrm{top}}(T,\mathcal V)=0$ for every finite open arc cover $\mathcal V$.
Now let $\mathcal U$ be any finite open cover of $S^1$. Choose a finite open arc cover $\mathcal V$ which refines $\mathcal U$, meaning that each $V\in\mathcal V$ is contained in some $U\in\mathcal U$. Then each member of $\mathcal V_0^{n-1}$ is contained in a member of $\mathcal U_0^{n-1}$, so
\begin{align*}
N(\mathcal U_0^{n-1})
\le
N(\mathcal V_0^{n-1}).
\end{align*}
Therefore
\begin{align*}
0
\le
h_{\mathrm{top}}(T,\mathcal U)
\le
h_{\mathrm{top}}(T,\mathcal V)
=
0,
\end{align*}
so $h_{\mathrm{top}}(T,\mathcal U)=0$ for every finite open cover $\mathcal U$. Taking the supremum over $\mathcal U$ gives $h_{\mathrm{top}}(T)=0$. The dense orbit of an irrational rotation therefore creates recurrence without exponential growth of distinguishable orbit names.
[/example]
The rotation example shows that infinite orbits and dense orbits do not by themselves imply positive entropy. Entropy detects exponential orbit complexity, not recurrence or density alone.
## Separated and Spanning Orbit Sets
Open covers are intrinsic, but calculations often need a metric. The problem is to turn the informal phrase "two orbit segments are distinguishable up to time $n$" into a precise counting rule. On a compact metric space, the Bowen-Dinaburg metric packages all observations up to time $n$ into a single distance.
[definition: Bowen Dinaburg Metric]
Let $(X,d)$ be a compact metric space and let $T:X \to X$ be continuous. For $n \ge 1$, the Bowen-Dinaburg metric is the map
\begin{align*}
d_n:X\times X\to [0,\infty)
\end{align*}
given by
\begin{align*}
d_n(x,y):= \max_{0 \le k \le n-1} d(T^k x,T^k y).
\end{align*}
[/definition]
This metric regards two points as close when their first $n$ iterates remain close at every observed time. With this orbit metric in hand, the first natural count asks how many orbit segments can be mutually distinguished at a fixed resolution.
[definition: Separated Set]
Let $(X,d)$ be compact and let $T:X \to X$ be continuous. A set $E \subset X$ is $(n,\varepsilon)$-separated if for every distinct $x,y \in E$,
\begin{align*}
d_n(x,y) > \varepsilon.
\end{align*}
Let $s_n(\varepsilon)$ be the largest cardinality of an $(n,\varepsilon)$-separated subset of $X$.
[/definition]
Separated sets measure how many orbit segments can be told apart at resolution $\varepsilon$. The dual problem is approximation: rather than packing distinguishable orbits, we ask how many orbit templates are enough to approximate all orbits.
[definition: Spanning Set]
Let $(X,d)$ be compact and let $T:X \to X$ be continuous. A set $F \subset X$ is $(n,\varepsilon)$-spanning if for every $x \in X$ there exists $y \in F$ such that
\begin{align*}
d_n(x,y) \le \varepsilon.
\end{align*}
Let $r_n(\varepsilon)$ be the least cardinality of an $(n,\varepsilon)$-spanning subset of $X$.
[/definition]
The separated and spanning numbers are not equal in general, but they bracket each other after a small change of scale. This creates a consistency problem for the metric definition of entropy: packing distinguishable orbit segments and covering all orbit segments might have produced different exponential rates. The comparison theorem below resolves that ambiguity and also ties both metric counts back to the open-cover invariant.
[quotetheorem:6803]
[citeproof:6803]
The theorem explains why small changes in constants, such as $\varepsilon$ versus $\varepsilon/2$, do not affect entropy. Compactness is again essential: it gives finite spanning sets and Lebesgue numbers for finite covers, while on a non-compact space these counting functions can be infinite or fail to reflect the open-cover definition. For instance, the identity map on the discrete space $\mathbb N$ has infinite $(1,\varepsilon)$-separated sets for every $0<\varepsilon<1$, although this is not exponential orbit creation; it is a failure of compactness. The theorem is also tied to the chosen compact topology; replacing a compatible metric on $S^1$ by the discrete metric makes every distinct pair separated at scale $1/2$ immediately, which no longer represents the circle topology. What survives under compatible compact metrics is only the exponential rate after the fine-scale limit, and this is the bridge that lets explicit metric examples compute the intrinsic open-cover invariant.
[example: Doubling Map on the Circle]
Identify $S^1$ with $\mathbb R/\mathbb Z$ and use the circle metric $d(x,y)=\min_{p\in\mathbb Z}|x-y-p|$. We show, using *[Equivalence of Separated and Spanning Definitions](/theorems/6803)*, that the separated and spanning growth rates are both $\log 2$.
For the lower bound, set
\begin{align*}
E_n=\left\{\frac{j}{2^n}:0\le j\le 2^n-1\right\}\subset S^1.
\end{align*}
If $x_j=j/2^n$ and $x_\ell=\ell/2^n$ are distinct, choose the representative $q\in\{1,\dots,2^n-1\}$ of $j-\ell \pmod {2^n}$. Write
\begin{align*}
q=2^a b,
\end{align*}
where $b$ is odd and $0\le a\le n-1$. With $k=n-a-1$, we have $0\le k\le n-1$ and
\begin{align*}
T^k x_j-T^k x_\ell
\equiv
\frac{2^k(j-\ell)}{2^n}
\equiv
\frac{2^{n-a-1}2^a b}{2^n}
=
\frac{b}{2}
\equiv
\frac12
\pmod 1.
\end{align*}
Therefore
\begin{align*}
d(T^k x_j,T^k x_\ell)=\frac12>\frac14,
\end{align*}
so $E_n$ is $(n,1/4)$-separated. Hence
\begin{align*}
s_n(1/4)\ge |E_n|=2^n,
\end{align*}
and for every $0<\varepsilon\le 1/4$,
\begin{align*}
\limsup_{n\to\infty}\frac1n\log s_n(\varepsilon)
\ge
\lim_{n\to\infty}\frac1n\log(2^n)
=
\log 2.
\end{align*}
For the upper bound, fix $\varepsilon>0$ and choose an integer $L\ge 1/(4\varepsilon)$. Let
\begin{align*}
F_{n,\varepsilon}
=
\left\{\frac{j}{L2^n}:0\le j\le L2^n-1\right\}.
\end{align*}
For any $x\in S^1$, choose $y\in F_{n,\varepsilon}$ with
\begin{align*}
d(x,y)\le \frac{1}{2L2^n}.
\end{align*}
Since $T^k(x)=2^k x\pmod 1$, the map $T^k$ is $2^k$-Lipschitz for the circle metric, so for $0\le k\le n-1$,
\begin{align*}
d(T^k x,T^k y)
\le
2^k d(x,y)
\le
2^{n-1}\frac{1}{2L2^n}
=
\frac{1}{4L}
\le
\varepsilon.
\end{align*}
Thus $F_{n,\varepsilon}$ is $(n,\varepsilon)$-spanning, and
\begin{align*}
r_n(\varepsilon)\le |F_{n,\varepsilon}|=L2^n.
\end{align*}
Consequently
\begin{align*}
\limsup_{n\to\infty}\frac1n\log r_n(\varepsilon)
\le
\lim_{n\to\infty}\frac1n\log(L2^n)
=
\lim_{n\to\infty}\left(\frac{\log L}{n}+\log 2\right)
=
\log 2.
\end{align*}
The separated lower bound and spanning upper bound therefore give
\begin{align*}
h_{\mathrm{top}}(T)=\log 2.
\end{align*}
The doubling map has entropy $\log 2$ because each iterate doubles the number of distinguishable orbit cylinders, producing exactly exponential growth rate $2^n$.
[/example]
The doubling map exhibits the basic meaning of entropy as exponential orbit branching. The same mechanism appears for higher expanding maps, where each point has several inverse branches and the number of distinguishable histories grows exponentially.
[example: Expanding Maps of the Circle]
Let $T_m:S^1\to S^1$ be $T_m(x)=mx\pmod 1$ for an integer $m\ge2$, and use the circle metric $d(x,y)=\min_{p\in\mathbb Z}|x-y-p|$. We compute the separated and spanning growth rates and then apply *Equivalence of Separated and Spanning Definitions*.
For the lower bound, set
\begin{align*}
E_n=\left\{\frac{j}{m^n}:0\le j\le m^n-1\right\}.
\end{align*}
Take two distinct points $x_j=j/m^n$ and $x_\ell=\ell/m^n$ in $E_n$. Let $q\in\{1,\dots,m^n-1\}$ be the representative of $j-\ell\pmod {m^n}$. Write
\begin{align*}
q=m^a b,
\end{align*}
where $0\le a\le n-1$ and $m\nmid b$. Put $k=n-a-1$, so $0\le k\le n-1$. Since $T_m^k(x)=m^k x\pmod 1$,
\begin{align*}
T_m^k x_j-T_m^k x_\ell\equiv \frac{m^k(j-\ell)}{m^n}\pmod 1.
\end{align*}
Using $j-\ell\equiv q\pmod {m^n}$ and $q=m^a b$, this becomes
\begin{align*}
T_m^k x_j-T_m^k x_\ell\equiv \frac{m^{n-a-1}m^a b}{m^n}\pmod 1.
\end{align*}
Cancelling $m^{n-1}$ gives
\begin{align*}
T_m^k x_j-T_m^k x_\ell\equiv \frac{b}{m}\pmod 1.
\end{align*}
Because $m\nmid b$, the residue of $b$ modulo $m$ lies in $\{1,\dots,m-1\}$. Hence the circle distance from $b/m$ to $0$ is at least $1/m$, so
\begin{align*}
d(T_m^k x_j,T_m^k x_\ell)\ge \frac1m>\frac{1}{2m}.
\end{align*}
Thus $E_n$ is $(n,1/(2m))$-separated, and
\begin{align*}
s_n\left(\frac{1}{2m}\right)\ge |E_n|=m^n.
\end{align*}
Therefore
\begin{align*}
\limsup_{n\to\infty}\frac1n\log s_n\left(\frac{1}{2m}\right)\ge \lim_{n\to\infty}\frac1n\log(m^n)=\log m.
\end{align*}
Since smaller separation scales only allow larger separated sets, the separated-set entropy is at least $\log m$.
For the upper bound, fix $\varepsilon>0$ and choose an integer $L\ge 1/(2\varepsilon)$. Define
\begin{align*}
F_{n,\varepsilon}=\left\{\frac{j}{Lm^n}:0\le j\le Lm^n-1\right\}.
\end{align*}
For every $x\in S^1$, choose $y\in F_{n,\varepsilon}$ with
\begin{align*}
d(x,y)\le \frac{1}{2Lm^n}.
\end{align*}
The map $T_m^k(x)=m^k x\pmod 1$ is $m^k$-Lipschitz for the circle metric, so for $0\le k\le n-1$,
\begin{align*}
d(T_m^k x,T_m^k y)\le m^k d(x,y).
\end{align*}
Using the choice of $y$ and the bound $k\le n-1$, we get
\begin{align*}
d(T_m^k x,T_m^k y)\le m^{n-1}\frac{1}{2Lm^n}.
\end{align*}
Cancelling $m^{n-1}$ gives
\begin{align*}
d(T_m^k x,T_m^k y)\le \frac{1}{2Lm}\le \varepsilon.
\end{align*}
Hence $F_{n,\varepsilon}$ is $(n,\varepsilon)$-spanning, and
\begin{align*}
r_n(\varepsilon)\le |F_{n,\varepsilon}|=Lm^n.
\end{align*}
It follows that
\begin{align*}
\limsup_{n\to\infty}\frac1n\log r_n(\varepsilon)\le \lim_{n\to\infty}\frac1n\log(Lm^n).
\end{align*}
Since
\begin{align*}
\frac1n\log(Lm^n)=\frac{\log L}{n}+\log m,
\end{align*}
we obtain
\begin{align*}
\limsup_{n\to\infty}\frac1n\log r_n(\varepsilon)\le \log m.
\end{align*}
The separated lower bound and spanning upper bound give
\begin{align*}
h_{\mathrm{top}}(T_m)=\log m.
\end{align*}
Thus the map $x\mapsto mx\pmod 1$ has entropy $\log m$ because its length-$n$ orbit segments are resolved at exponential scale $m^n$.
[/example]
These examples also reveal why topological entropy is independent of the particular compatible metric on a compact space. Replacing the metric changes the small scale at which two orbit segments are distinguishable, but not the eventual exponential rate.
## Entropy as a Conjugacy Invariant
A useful dynamical invariant must not depend on the coordinates used to describe the system. The natural equivalence relation for topological dynamics is topological conjugacy: two systems are the same if a homeomorphism sends orbits of one to orbits of the other.
[definition: Topological Conjugacy]
Let $T:X\to X$ and $S:Y\to Y$ be continuous maps on compact spaces. A topological conjugacy from $(X,T)$ to $(Y,S)$ is a homeomorphism $\varphi:X\to Y$ such that
\begin{align*}
\varphi \circ T = S \circ \varphi.
\end{align*}
[/definition]
A conjugacy transports an orbit segment of $T$ into an orbit segment of $S$ with the same time ordering. The possible obstruction is that separated sets and open covers are expressed in the particular topology and coordinates of one space. To use entropy as an invariant, these finite-resolution counts must survive the change of coordinates induced by the conjugacy.
[quotetheorem:6805]
[citeproof:6805]
This result is the topological counterpart of entropy invariance under measure-theoretic isomorphism. The homeomorphism assumption is essential: a continuous factor map may collapse many orbit names, so semiconjugacy can decrease entropy and need not preserve it in both directions. The theorem also does not assert that equal entropy implies conjugacy; entropy is a coarse invariant, and many non-conjugate systems share the same numerical value. Its useful role is therefore obstructive: unequal entropies prove that two systems cannot be conjugate, while equal entropies merely leave the question open.
[example: Rotation and Doubling Are Not Conjugate]
Let $R_\alpha:S^1\to S^1$ be the irrational rotation $R_\alpha(x)=x+\alpha\pmod 1$, with $\alpha\notin\mathbb Q$, and let $D:S^1\to S^1$ be the doubling map $D(x)=2x\pmod 1$. From the rotation computation,
\begin{align*}
h_{\mathrm{top}}(R_\alpha)=0,
\end{align*}
and from the doubling-map computation,
\begin{align*}
h_{\mathrm{top}}(D)=\log 2.
\end{align*}
Since $2>1$ and the logarithm is strictly increasing,
\begin{align*}
\log 2>\log 1=0.
\end{align*}
Thus
\begin{align*}
h_{\mathrm{top}}(R_\alpha)=0\ne \log 2=h_{\mathrm{top}}(D).
\end{align*}
If $R_\alpha$ and $D$ were topologically conjugate, *Topological Entropy Is Conjugacy Invariant* would give
\begin{align*}
h_{\mathrm{top}}(R_\alpha)=h_{\mathrm{top}}(D),
\end{align*}
contradicting the inequality above. Therefore the irrational rotation and the doubling map are not topologically conjugate. The obstruction is numerical: conjugacy would have to preserve entropy, but these two systems have different entropy values.
[/example]
Conjugacy invariance is especially powerful when a complicated map can be coded by a symbolic system. The next section turns this into a concrete word-growth calculation.
## Expansive Maps and Symbolic Codings
When a system is expansive, a fixed positive scale is already enough to distinguish distinct orbits. This removes the need to pass through arbitrarily small scales in the separated-set definition, and it is the reason symbolic codings are so effective for hyperbolic and expanding systems.
[definition: Expansive Map]
Let $(X,d)$ be a compact metric space and let $T:X\to X$ be continuous. The map $T$ is expansive if there exists $\delta>0$ such that for every distinct $x,y\in X$ there is $n\ge0$ with
\begin{align*}
d(T^n x,T^n y)>\delta.
\end{align*}
[/definition]
The number $\delta$ is an expansivity scale. For homeomorphisms, many courses define expansivity using $n\in\mathbb Z$; for non-invertible maps the forward version above is the relevant one. In the general entropy definition, one must let the resolution tend to zero because a fixed scale may miss fine orbit complexity. Expansivity removes that obstruction by guaranteeing that distinct orbits eventually separate at a uniform positive scale.
[quotetheorem:6807]
[citeproof:6807]
This theorem explains why Markov partitions and symbolic codings can compute entropy. Without expansivity, a fixed $\varepsilon$ can miss complexity that appears only at smaller and smaller scales, so the limiting operation $\varepsilon\downarrow0$ in the general definition cannot usually be removed. A concrete model is a disjoint union of full shifts on two symbols, where the component indexed by $j$ is rescaled to have diameter $2^{-j}$ and the components accumulate at a fixed point. The system has entropy $\log 2$, because each small component carries a full shift, but any fixed $\varepsilon>0$ ignores all sufficiently small components and therefore can see zero exponential growth at that scale. Compactness is also part of the mechanism: it turns pointwise eventual separation into a uniform finite-time comparison, while non-compact examples need extra hypotheses to control escape to infinity and infinite separated sets. The theorem does not say that every expansive system has a convenient symbolic model; it says that once a faithful coding is available at an expansivity scale, counting allowed symbolic words captures the full topological entropy, so the next object to define is the symbolic system itself.
[definition: Subshift]
Let $A$ be a finite alphabet and let $A^{\mathbb Z_{\ge 0}}$ have the product topology, with coordinates indexed by $0,1,2,\dots$. The shift map $\sigma:A^{\mathbb Z_{\ge 0}}\to A^{\mathbb Z_{\ge 0}}$ is defined by
\begin{align*}
(\sigma x)_k = x_{k+1}, \qquad k\in\mathbb Z_{\ge 0}.
\end{align*}
A one-sided subshift is a closed subset $\Sigma\subset A^{\mathbb Z_{\ge 0}}$ such that $\sigma(\Sigma)\subset\Sigma$.
[/definition]
A point of a subshift is an infinite admissible itinerary. To count finite orbit complexity, we need the finite pieces of these itineraries.
[definition: Language of a Subshift]
Let $\Sigma\subset A^{\mathbb Z_{\ge 0}}$ be a subshift. Its language of length $n$ is
\begin{align*}
\mathcal L_n(\Sigma):=\{a_0\cdots a_{n-1}: \text{there exists }x\in\Sigma\text{ with }x_i=a_i\text{ for }0\le i<n\}.
\end{align*}
[/definition]
For symbolic systems, orbit separation is the same as disagreement in a finite word when the standard product metric is used. The remaining issue is to justify replacing metric orbit counts by the purely combinatorial count of admissible words. The theorem below identifies those two growth rates, making entropy computable from the language.
[quotetheorem:6809]
[citeproof:6809]
The theorem reduces symbolic entropy to combinatorics, but the finiteness and closedness hypotheses do real work. A finite alphabet makes each word count $|\mathcal L_n(\Sigma)|$ finite, while closedness and shift-invariance ensure that the language describes an actual compact dynamical system rather than an arbitrary list of finite words. The result does not classify subshifts with the same entropy, and it does not say that every prescribed word-growth sequence comes from a subshift. The full shift is the simplest case, and subshifts of finite type add a matrix-counting layer.
[example: Full Shift Entropy]
Let $\Sigma=A^{\mathbb Z_{\ge 0}}$ be the full one-sided shift over a finite alphabet with $|A|=q$. Since no finite word is forbidden in the full shift, a length-$n$ word is formed by choosing one symbol from $A$ in each of the $n$ positions. Therefore
\begin{align*}
|\mathcal L_n(\Sigma)|=|A|^n.
\end{align*}
Using $|A|=q$, this becomes
\begin{align*}
|\mathcal L_n(\Sigma)|=q^n.
\end{align*}
By *[Entropy of a Subshift via Word Growth](/theorems/6809)*,
\begin{align*}
h_{\mathrm{top}}(\sigma)=\lim_{n\to\infty}\frac{1}{n}\log |\mathcal L_n(\Sigma)|.
\end{align*}
Substituting the word count gives
\begin{align*}
h_{\mathrm{top}}(\sigma)=\lim_{n\to\infty}\frac{1}{n}\log(q^n).
\end{align*}
Since $\log(q^n)=n\log q$, we get
\begin{align*}
h_{\mathrm{top}}(\sigma)=\lim_{n\to\infty}\frac{n\log q}{n}.
\end{align*}
Hence
\begin{align*}
h_{\mathrm{top}}(\sigma)=\log q.
\end{align*}
Thus the full shift has entropy $\log q$, reflecting that each new time coordinate contributes exactly one independent choice among $q$ symbols.
[/example]
The full shift has no transition restrictions. A subshift of finite type has a finite directed graph of allowed transitions, and entropy is governed by the growth rate of paths in that graph.
[example: Golden Mean Shift]
Let $\Sigma\subset\{0,1\}^{\mathbb Z_{\ge 0}}$ be the subshift in which the block $11$ is forbidden, and write
\begin{align*}
a_n=|\mathcal L_n(\Sigma)|.
\end{align*}
For $n=1$, the admissible words are $0$ and $1$, so
\begin{align*}
a_1=2.
\end{align*}
For $n=2$, the admissible words are $00,01,10$, so
\begin{align*}
a_2=3.
\end{align*}
For $n\ge 3$, every admissible word of length $n$ ends either in $0$ or in $1$. If it ends in $0$, deleting the final symbol gives an arbitrary admissible word of length $n-1$, and appending $0$ to any admissible word of length $n-1$ cannot create the forbidden block $11$. Thus there are $a_{n-1}$ admissible words ending in $0$. If an admissible word ends in $1$, then the preceding symbol must be $0$, so the word ends in $10$. Deleting this final block $10$ gives an arbitrary admissible word of length $n-2$, and appending $10$ to any admissible word of length $n-2$ cannot create the block $11$ at the join. Thus there are $a_{n-2}$ admissible words ending in $1$. The two cases are disjoint and exhaust all admissible words, hence
\begin{align*}
a_n=a_{n-1}+a_{n-2}
\end{align*}
for $n\ge 3$.
Let $F_0=0$, $F_1=1$, and $F_{r+1}=F_r+F_{r-1}$ for $r\ge 1$. Since
\begin{align*}
a_1=2=F_3
\end{align*}
and
\begin{align*}
a_2=3=F_4,
\end{align*}
the recurrence gives, by induction,
\begin{align*}
a_n=F_{n+2}.
\end{align*}
The characteristic equation for the Fibonacci recurrence is
\begin{align*}
\lambda^2=\lambda+1.
\end{align*}
Its two roots are $\phi=(1+\sqrt5)/2$ and $\psi=(1-\sqrt5)/2$. Define
\begin{align*}
G_r=\frac{\phi^r-\psi^r}{\sqrt5}.
\end{align*}
Because $\phi^2=\phi+1$ and $\psi^2=\psi+1$, the sequence $G_r$ satisfies $G_{r+1}=G_r+G_{r-1}$. Also
\begin{align*}
G_0=\frac{1-1}{\sqrt5}=0
\end{align*}
and
\begin{align*}
G_1=\frac{\phi-\psi}{\sqrt5}=1.
\end{align*}
Therefore $F_r=G_r$ for all $r$, and so
\begin{align*}
a_n=F_{n+2}=\frac{\phi^{n+2}-\psi^{n+2}}{\sqrt5}.
\end{align*}
Since $|\psi|<1<\phi$, we can factor the dominant term:
\begin{align*}
a_n=\frac{\phi^{n+2}}{\sqrt5}\left(1-\left(\frac{\psi}{\phi}\right)^{n+2}\right).
\end{align*}
Taking logarithms gives
\begin{align*}
\frac{1}{n}\log a_n=\frac{n+2}{n}\log\phi-\frac{1}{2n}\log 5+\frac{1}{n}\log\left(1-\left(\frac{\psi}{\phi}\right)^{n+2}\right).
\end{align*}
Because $|\psi/\phi|<1$, the last term tends to $0$, while $(n+2)/n\to 1$ and $(\log 5)/(2n)\to 0$. Hence
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}\log a_n=\log\phi.
\end{align*}
By *Entropy of a Subshift via Word Growth*,
\begin{align*}
h_{\mathrm{top}}(\sigma|_\Sigma)=\lim_{n\to\infty}\frac{1}{n}\log|\mathcal L_n(\Sigma)|=\log\phi.
\end{align*}
Thus forbidding $11$ changes the full two-symbol count from $2^n$ to Fibonacci growth, whose exponential rate is $\phi^n$.
[/example]
Symbolic codings also appear in smooth dynamics, where the coding may be finite-to-one rather than a conjugacy. For hyperbolic systems, Markov partitions create subshifts of finite type whose entropy can often be computed from transition matrices.
[example: Hyperbolic Toral Automorphism]
Let $A\in SL(2,\mathbb Z)$ have eigenvalues $\lambda$ and $\lambda^{-1}$ with $\lambda>1$, and let $T_A:\mathbb T^2\to\mathbb T^2$ be the induced toral automorphism. A Markov partition for $T_A$ gives a finite symbolic coding by a subshift of finite type whose transition matrix $M$ has Perron-Frobenius eigenvalue $\lambda$; this is the standard Markov-partition computation for hyperbolic toral automorphisms.
For that subshift of finite type, admissible words of length $n$ are counted by paths of length $n-1$ in the transition graph. If $\mathbf 1$ denotes the column vector all of whose entries are $1$, then
\begin{align*}
|\mathcal L_n|
=
\mathbf 1^\top M^{n-1}\mathbf 1.
\end{align*}
By Perron-Frobenius asymptotics for the nonnegative transition matrix, there are constants $C_1,C_2>0$ such that
\begin{align*}
C_1\lambda^{n-1}
\le
\mathbf 1^\top M^{n-1}\mathbf 1
\le
C_2\lambda^{n-1}.
\end{align*}
Taking logarithms and dividing by $n$ gives
\begin{align*}
\frac{1}{n}\log C_1+\frac{n-1}{n}\log\lambda
\le
\frac{1}{n}\log|\mathcal L_n|
\le
\frac{1}{n}\log C_2+\frac{n-1}{n}\log\lambda.
\end{align*}
Since
\begin{align*}
\lim_{n\to\infty}\frac{\log C_1}{n}
=
\lim_{n\to\infty}\frac{\log C_2}{n}
=
0
\end{align*}
and
\begin{align*}
\lim_{n\to\infty}\frac{n-1}{n}\log\lambda
=
\log\lambda,
\end{align*}
the squeeze theorem gives
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}\log|\mathcal L_n|
=
\log\lambda.
\end{align*}
Using *Entropy of a Subshift via Word Growth* for the symbolic model and the finite-to-one Markov coding of $T_A$, we obtain
\begin{align*}
h_{\mathrm{top}}(T_A)=\log\lambda.
\end{align*}
In higher dimensions, the same computation replaces the single expanding eigenvalue by the product of all expanding moduli. Thus, for a hyperbolic toral automorphism with eigenvalues $\lambda_i$, the exponential volume growth in the unstable directions is
\begin{align*}
\prod_{|\lambda_i|>1}|\lambda_i|,
\end{align*}
so the entropy is
\begin{align*}
\log\left(\prod_{|\lambda_i|>1}|\lambda_i|\right)
=
\sum_{|\lambda_i|>1}\log|\lambda_i|,
\end{align*}
with eigenvalues counted with algebraic multiplicity.
[/example]
The toral example shows the geometric meaning of entropy: unstable directions create exponentially many distinguishable orbit segments. In non-uniform settings, symbolic models may exist only on invariant subsets or for selected parameter values, so entropy claims must specify the exact system.
[example: Logistic Map at the Full Tent Parameter]
Consider $f:[0,1]\to[0,1]$ given by $f(x)=4x(1-x)$. Define
\begin{align*}
\pi:\mathbb R/\mathbb Z\to[0,1],\qquad \pi(\theta)=\sin^2(\pi\theta),
\end{align*}
and let $D(\theta)=2\theta\pmod 1$. Then
\begin{align*}
\pi(D\theta)=\sin^2(2\pi\theta).
\end{align*}
Using $\sin(2u)=2\sin u\cos u$ with $u=\pi\theta$, we get
\begin{align*}
\sin^2(2\pi\theta)=(2\sin(\pi\theta)\cos(\pi\theta))^2.
\end{align*}
Expanding the square gives
\begin{align*}
(2\sin(\pi\theta)\cos(\pi\theta))^2=4\sin^2(\pi\theta)\cos^2(\pi\theta).
\end{align*}
Since $\cos^2(\pi\theta)=1-\sin^2(\pi\theta)$, this becomes
\begin{align*}
4\sin^2(\pi\theta)\cos^2(\pi\theta)=4\sin^2(\pi\theta)(1-\sin^2(\pi\theta)).
\end{align*}
By the definition of $\pi$, the last expression is
\begin{align*}
4\pi(\theta)(1-\pi(\theta))=f(\pi(\theta)).
\end{align*}
Therefore
\begin{align*}
\pi\circ D=f\circ\pi,
\end{align*}
so the logistic map at parameter $4$ is a factor of angle doubling.
Now code $D$ by binary expansions. For $a=(a_0,a_1,a_2,\dots)\in\{0,1\}^{\mathbb Z_{\ge0}}$, set
\begin{align*}
\beta(a)=\sum_{j=0}^{\infty}\frac{a_j}{2^{j+1}}\pmod 1.
\end{align*}
Then
\begin{align*}
D(\beta(a))\equiv 2\sum_{j=0}^{\infty}\frac{a_j}{2^{j+1}}\pmod 1.
\end{align*}
Multiplying the series by $2$ gives
\begin{align*}
2\sum_{j=0}^{\infty}\frac{a_j}{2^{j+1}}=a_0+\sum_{j=1}^{\infty}\frac{a_j}{2^j}.
\end{align*}
Since $a_0$ is an integer, it vanishes modulo $1$, so
\begin{align*}
D(\beta(a))\equiv \sum_{j=1}^{\infty}\frac{a_j}{2^j}\pmod 1.
\end{align*}
Reindexing with $r=j-1$ gives
\begin{align*}
\sum_{j=1}^{\infty}\frac{a_j}{2^j}=\sum_{r=0}^{\infty}\frac{a_{r+1}}{2^{r+1}}.
\end{align*}
The right-hand side is $\beta(\sigma a)$, hence
\begin{align*}
D\circ\beta=\beta\circ\sigma.
\end{align*}
The only ambiguity in binary expansion occurs at dyadic angles, namely points in the [countable set](/page/Countable%20Set)
\begin{align*}
\left\{\frac{k}{2^n}:0\le k\le 2^n,\ n\ge0\right\}.
\end{align*}
Away from this set, the binary coding is one-to-one, and the standard symbolic coding of angle doubling preserves the full two-symbol orbit names.
For the full one-sided shift on two symbols, every word of length $n$ is allowed, so
\begin{align*}
|\mathcal L_n|=2^n.
\end{align*}
By *Entropy of a Subshift via Word Growth*,
\begin{align*}
h_{\mathrm{top}}(\sigma)=\lim_{n\to\infty}\frac1n\log|\mathcal L_n|.
\end{align*}
Substituting $|\mathcal L_n|=2^n$ gives
\begin{align*}
h_{\mathrm{top}}(\sigma)=\lim_{n\to\infty}\frac1n\log(2^n).
\end{align*}
Since $\log(2^n)=n\log2$, this is
\begin{align*}
h_{\mathrm{top}}(\sigma)=\lim_{n\to\infty}\frac{n\log2}{n}.
\end{align*}
Therefore
\begin{align*}
h_{\mathrm{top}}(\sigma)=\log2.
\end{align*}
The angle-doubling and logistic codings above lose only the standard countable binary-expansion ambiguity and otherwise carry the same length-$n$ symbolic names, so the exponential word-growth rate is unchanged:
\begin{align*}
h_{\mathrm{top}}(f)=\log2.
\end{align*}
Thus the value $\log2$ is a special feature of the full parameter $a=4$; for $f_a(x)=a x(1-x)$ at other parameters, the allowed itineraries can change, so the entropy cannot be inferred from the computation at $a=4$.
[/example]
This careful statement is important because the logistic family contains attracting periodic windows, chaotic invariant sets, and parameter values with different kneading data. Topological entropy is a property of the chosen map, not of a visual impression of its graph.
## What Topological Entropy Records
The definitions in this chapter all measure the same phenomenon: exponential growth in the number of orbit segments distinguishable at finite resolution. Open covers do this without a metric, separated and spanning sets do it with Bowen-Dinaburg metrics, and symbolic systems do it by counting admissible words. The invariant is preserved by topological conjugacy, vanishes for rotations, is positive for expanding maps, and is computable from word growth for subshifts.
Chapter 8 compares this topological invariant with measure-theoretic entropy. The variational principle will say that topological entropy is the supremum of Kolmogorov-Sinai entropy over invariant probability measures, tying the orbit-counting viewpoint back to the information-theoretic viewpoint developed earlier. This connection also places topological entropy beside broader invariants used elsewhere in dynamics: growth of words in symbolic dynamics, growth of volume in smooth dynamics, and algebraic growth rates for toral automorphisms all become different ways of detecting the same exponential orbit complexity.
Topological entropy and Kolmogorov-Sinai entropy now appear as two sides of the same notion of orbit complexity. The next chapter uses this parallel to prove the variational principle, showing that the topological growth rate is recovered by optimizing measure entropy over invariant measures.
# 8. The Variational Principle
Chapters 2 through 6 developed measure-theoretic entropy, and Chapter 7 developed topological entropy, as two ways of measuring orbit complexity. This chapter uses the measure-theoretic prerequisites from earlier in the course: compact metrizable spaces, weak* convergence of Borel probability measures, invariant measures, Kolmogorov-Sinai entropy, and the topological definitions of entropy via separated sets or open covers. With that background in place, the chapter explains why measure entropy and topological entropy are not separate invariants: for a continuous map on a compact space, the topological entropy is the supremum of the measure-theoretic entropies over invariant probability measures. The variational principle is the bridge from topological orbit growth to probabilistic descriptions of typical orbits, and it leads directly to the problem of finding measures of maximal entropy.
The central questions are existence and uniqueness. Does some invariant probability measure realise all the topological entropy? If it exists, is it unique? The answer depends on compactness properties of the invariant-measure simplex and on continuity properties of the entropy map.
## Invariant Probability Measures on Compact Dynamical Systems
The first problem is to identify the correct space over which the variational principle takes its supremum. A compact topological system may have many invariant measures, and these measures form a convex compact set on which entropy can be studied as a function.
[definition: Compact Topological Dynamical System]
A compact topological dynamical system is a pair $(X,T)$ where $X$ is a compact [metrizable space](/page/Metrizable%20Space) and $T:X\to X$ is continuous.
[/definition]
Compactness of $X$ gives a weak* compact space of probability measures, while continuity of $T$ makes the pushforward operation continuous. However, not every probability measure describes stationary orbit statistics: after applying $T$, the distribution may change.
This creates an obstruction for using an arbitrary probability measure to model long-term orbit behaviour. If a distribution changes under one application of $T$, then averages computed from it do not represent a time-stationary statistical state of the system. The variational principle therefore restricts the entropy comparison to measures whose mass assignment is unchanged by the dynamics, meaning that a set and its full preimage have the same measure.
[definition: Invariant Probability Measure]
Let $(X,T)$ be a compact topological dynamical system. A Borel probability measure $\mu$ on $X$ is $T$-invariant if
\begin{align*}
\mu(T^{-1}A)=\mu(A)
\end{align*}
for every Borel set $A\subset X$.
[/definition]
The collection of invariant measures will be denoted by $\mathcal M_T(X)$. Before taking a supremum over this collection, we need to know that it is non-empty for every compact topological dynamical system.
[quotetheorem:3423]
[citeproof:3423]
This theorem says that compact systems always possess at least one probabilistic model of their orbit structure, but it is only an existence theorem. It does not say that the invariant measure is unique, ergodic, or entropy-maximising. Compactness is essential in the proof: for example, the translation $T(x)=x+1$ on the non-compact space $\mathbb Z$ has no invariant probability measure, because invariance would force all singleton masses to be equal and hence either total mass $0$ or infinite total mass. The averaging argument also uses continuity of $T$ so that $f\circ T$ remains continuous and weak* limits can detect invariance.
It is useful to see the theorem in a zero-entropy model where the invariant-measure simplex is as small as possible.
[example: Irrational Rotation Invariant Measures]
Let $X=\mathbb R/\mathbb Z$ and $T(x)=x+\alpha \pmod 1$ with $\alpha\notin\mathbb Q$. If $m$ denotes Lebesgue measure on $X$, then for every interval $I\subset X$,
\begin{align*}
m(T^{-1}I)=m(I-\alpha)=m(I),
\end{align*}
because translation by $-\alpha$ preserves arc length. Since finite unions of intervals generate the Borel $\sigma$-algebra and both $m\circ T^{-1}$ and $m$ are Borel probability measures, this gives $m(T^{-1}A)=m(A)$ for every Borel set $A\subset X$.
By *Unique Ergodicity of Irrational Rotations*, $m$ is the only $T$-invariant Borel probability measure, so
\begin{align*}
\mathcal M_T(X)=\{m\}.
\end{align*}
Also, by *Kolmogorov-Sinai Entropy of Irrational Rotations*, $h_m(T)=0$. Therefore the variational supremum is taken over one measure and equals
\begin{align*}
\sup_{\mu\in\mathcal M_T(X)} h_\mu(T)
=\sup_{\mu\in\{m\}}h_\mu(T)
=h_m(T)
=0.
\end{align*}
Thus an irrational rotation is a model where the invariant-measure simplex is a singleton and the only available invariant measure has zero entropy.
[/example]
The rotation example has a singleton [measure space](/page/Measure%20Space), but most systems have many invariant measures. For variational arguments we need the general structural fact that $\mathcal M_T(X)$ is compact and convex, so maximising sequences have weak* limit points.
[quotetheorem:3451]
[citeproof:3451]
The compactness in this theorem is the first half of the existence story for maximal entropy measures. Its hypotheses are doing real work: compactness of $X$ gives weak* compactness of probability measures, and continuity of $T$ makes the invariance relation closed under weak* limits. If $X$ is not compact, a sequence of invariant or nearly invariant probability measures can lose mass at infinity. If $T$ is not continuous, invariance can also fail to be closed under weak* limits. For instance, let $X=\{0\}\cup\{1/n:n\in\mathbb N\}\subset[0,1]$, define $T(1/n)=1/n$ for every $n$, and set $T(0)=1$. Then each $\delta_{1/n}$ is $T$-invariant, but $\delta_{1/n}\to\delta_0$ weak*. The limit measure is not invariant because $T_*\delta_0=\delta_1\ne\delta_0$; the discontinuity at $0$ is exactly what prevents the weak* limit argument from passing through.
Compact convexity alone still cannot produce a maximal entropy measure. A compact set supports maximisers only for upper semicontinuous functions, and entropy need not have that property in arbitrary compact systems. The second half of the existence story therefore asks whether entropy behaves well enough under weak* limits, and that question only becomes meaningful after comparing measure entropy with topological entropy.
## Measure Entropy Versus Topological Entropy
The next problem is to compare two definitions that appear to live in different worlds. Topological entropy counts distinguishable orbit segments using open covers or separated sets, while Kolmogorov-Sinai entropy measures the asymptotic information produced by finite measurable partitions. The variational principle states that the topological count is exactly the largest possible measure-theoretic information rate.
[definition: Entropy Map]
Let $(X,T)$ be a compact topological dynamical system. The entropy map is the function $h_T:\mathcal M_T(X)\to [0,\infty]$ defined by $h_T(\mu)=h_\mu(T)$, where $h_\mu(T)$ is the Kolmogorov-Sinai entropy of $T$ with respect to $\mu$.
[/definition]
The notation separates the system $T$ from the measure $\mu$. At this point there are two competing ways to measure complexity: one counts all distinguishable topological orbit segments, while the other measures the information rate seen by a stationary probability law. The central comparison problem is whether the topological count is exactly recovered by optimising over invariant measures.
[quotetheorem:6728]
[citeproof:6728]
The principle says that topological entropy is not merely a covering invariant; it is the best entropy rate seen by an invariant probability measure. Compactness and invariance are essential to the statement: without compactness the averaging measures used in the proof may not have a convergent subsequence, and without invariance the quantity $h_\mu(T)$ is not the Kolmogorov-Sinai entropy of a stationary process. A concrete non-compact failure is the translation $T(n)=n+1$ on $\mathbb Z$: empirical measures along an orbit drift to infinity, and no invariant probability measure exists because invariance would assign the same mass to every singleton. The theorem is also a supremum statement, not an attainment statement. For instance, in a compact symbolic system formed by adjoining a limiting fixed point to a sequence of disjoint mixing subshifts whose entropies increase to a value $H$ but never equal $H$, invariant measures on the subshifts can have entropies tending to $H$, while any weak* limit may sit on the fixed point and have entropy $0$. This is why the next section studies measures of maximal entropy and the regularity of the entropy map.
The full shift provides the model computation where orbit counting and measure entropy match exactly.
[example: Full Shift]
Let $X=\{1,\dots,m\}^{\mathbb Z}$ and let $\sigma:X\to X$ be the shift. A length-$n$ word is a string $(a_0,\dots,a_{n-1})\in\{1,\dots,m\}^n$. There are $m$ choices for each coordinate, so the number of length-$n$ words is
\begin{align*}
m\cdot m\cdots m=m^n.
\end{align*}
Therefore the exponential word-growth rate is
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}\log(m^n)=\lim_{n\to\infty}\frac{n\log m}{n}=\log m.
\end{align*}
Thus $h_{\mathrm{top}}(\sigma)=\log m$.
Now let $\mu$ be the Bernoulli measure with symbol weights $\mu(x_0=i)=1/m$ for $1\le i\le m$, and let $\mathcal P=\{P_1,\dots,P_m\}$, where $P_i=\{x\in X:x_0=i\}$. The partition $\mathcal P$ is generating for the full shift because the iterates $\sigma^{-k}\mathcal P$ record the coordinate $x_k$. The atoms of $\bigvee_{k=0}^{n-1}\sigma^{-k}\mathcal P$ are the cylinders determined by $(x_0,\dots,x_{n-1})=(a_0,\dots,a_{n-1})$. Each such cylinder has Bernoulli measure
\begin{align*}
\frac{1}{m}\cdot \frac{1}{m}\cdots \frac{1}{m}=m^{-n},
\end{align*}
and there are $m^n$ such cylinders. Hence
\begin{align*}
H_\mu\left(\bigvee_{k=0}^{n-1}\sigma^{-k}\mathcal P\right)=-m^n m^{-n}\log(m^{-n})=-\log(m^{-n})=n\log m.
\end{align*}
Since $\mathcal P$ is generating,
\begin{align*}
h_\mu(\sigma)=\lim_{n\to\infty}\frac{1}{n}H_\mu\left(\bigvee_{k=0}^{n-1}\sigma^{-k}\mathcal P\right)=\lim_{n\to\infty}\frac{n\log m}{n}=\log m.
\end{align*}
Thus the uniform Bernoulli measure has entropy equal to $h_{\mathrm{top}}(\sigma)$, so it attains the variational supremum.
[/example]
The full shift is the clean model case: all words are allowed and the uniform Bernoulli measure spreads mass evenly across them. More constrained symbolic systems require a transition matrix and lead to Perron-Frobenius theory.
[example: Subshift of Finite Type]
Let $A$ be an irreducible $m\times m$ zero-one matrix, and let $X_A$ be the two-sided subshift whose points $x=(x_k)_{k\in\mathbb Z}$ satisfy $A_{x_kx_{k+1}}=1$ for every $k$. By the *Perron-Frobenius theorem*, $A$ has a positive eigenvalue $\lambda_A>0$ and positive left and right eigenvectors $\ell=(\ell_i)$ and $r=(r_i)$ with
\begin{align*}
Ar=\lambda_A r.
\end{align*}
\begin{align*}
\ell^\top A=\lambda_A \ell^\top.
\end{align*}
Normalize them by
\begin{align*}
\sum_{i=1}^m \ell_i r_i=1.
\end{align*}
The number of admissible length-$n$ words is
\begin{align*}
N_n=\sum_{i=1}^m\sum_{j=1}^m (A^{n-1})_{ij},
\end{align*}
because $(A^{n-1})_{ij}$ counts admissible paths from initial symbol $i$ to terminal symbol $j$ with $n-1$ transitions. Perron-Frobenius growth gives
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}\log N_n=\log \lambda_A,
\end{align*}
so the topological entropy of the shift on $X_A$ is
\begin{align*}
h_{\mathrm{top}}(\sigma|_{X_A})=\log \lambda_A.
\end{align*}
Define
\begin{align*}
\pi_i=\ell_i r_i.
\end{align*}
\begin{align*}
p_{ij}=\frac{A_{ij}r_j}{\lambda_A r_i}.
\end{align*}
For each $i$,
\begin{align*}
\sum_{j=1}^m p_{ij}=\sum_{j=1}^m \frac{A_{ij}r_j}{\lambda_A r_i}=\frac{(Ar)_i}{\lambda_A r_i}=\frac{\lambda_A r_i}{\lambda_A r_i}=1,
\end{align*}
so $P=(p_{ij})$ is a transition matrix. Its stationary vector is $\pi$, since for each $j$,
\begin{align*}
\sum_{i=1}^m \pi_i p_{ij}=\sum_{i=1}^m \ell_i r_i\frac{A_{ij}r_j}{\lambda_A r_i}=\frac{r_j}{\lambda_A}\sum_{i=1}^m \ell_i A_{ij}.
\end{align*}
Using $\ell^\top A=\lambda_A\ell^\top$, this becomes
\begin{align*}
\sum_{i=1}^m \pi_i p_{ij}=\frac{r_j}{\lambda_A}(\ell^\top A)_j=\frac{r_j}{\lambda_A}\lambda_A\ell_j=\ell_jr_j=\pi_j.
\end{align*}
The corresponding stationary Markov measure assigns an admissible cylinder $[a_0,\dots,a_{n-1}]$ the mass
\begin{align*}
\pi_{a_0}p_{a_0a_1}\cdots p_{a_{n-2}a_{n-1}}=\ell_{a_0}r_{a_0}\prod_{k=0}^{n-2}\frac{r_{a_{k+1}}}{\lambda_A r_{a_k}},
\end{align*}
where admissibility makes every factor $A_{a_ka_{k+1}}$ equal to $1$. The product telescopes:
\begin{align*}
\ell_{a_0}r_{a_0}\prod_{k=0}^{n-2}\frac{r_{a_{k+1}}}{\lambda_A r_{a_k}}=\ell_{a_0}r_{a_0}\frac{r_{a_1}}{\lambda_A r_{a_0}}\frac{r_{a_2}}{\lambda_A r_{a_1}}\cdots\frac{r_{a_{n-1}}}{\lambda_A r_{a_{n-2}}}=\frac{\ell_{a_0}r_{a_{n-1}}}{\lambda_A^{n-1}}.
\end{align*}
By the [entropy formula for a stationary Markov shift](/theorems/6791),
\begin{align*}
h_\mu(\sigma)=-\sum_{i=1}^m \pi_i\sum_{j=1}^m p_{ij}\log p_{ij}.
\end{align*}
On admissible transitions,
\begin{align*}
\log p_{ij}=\log r_j-\log\lambda_A-\log r_i.
\end{align*}
Therefore
\begin{align*}
h_\mu(\sigma)=-\sum_i\pi_i\sum_jp_{ij}(\log r_j-\log\lambda_A-\log r_i).
\end{align*}
Distributing the three terms gives
\begin{align*}
h_\mu(\sigma)=\log\lambda_A\sum_i\pi_i\sum_jp_{ij}+\sum_i\pi_i\log r_i\sum_jp_{ij}-\sum_j\log r_j\sum_i\pi_i p_{ij}.
\end{align*}
Since $\sum_jp_{ij}=1$, $\sum_i\pi_i=1$, and $\sum_i\pi_i p_{ij}=\pi_j$, this reduces to
\begin{align*}
h_\mu(\sigma)=\log\lambda_A+\sum_i\pi_i\log r_i-\sum_j\pi_j\log r_j=\log\lambda_A.
\end{align*}
Thus this Perron-Frobenius Markov measure has entropy equal to the topological entropy, so it is a measure of maximal entropy; the cylinder formula shows explicitly that admissible words are weighted by the Perron-Frobenius data at their endpoints.
[/example]
These examples indicate the main use of the variational principle in symbolic dynamics: topological word growth can often be converted into an invariant measure with matching entropy. The next question is whether such a measure must exist in general.
## Measures of Maximal Entropy
Once the supremum formula is known, the natural problem is attainment. A measure of maximal entropy is an invariant probability measure whose entropy equals the full topological entropy of the system.
[definition: Measure of Maximal Entropy]
Let $(X,T)$ be a compact topological dynamical system with $h_{\mathrm{top}}(T)<\infty$. A measure $\mu\in\mathcal M_T(X)$ is a measure of maximal entropy if
\begin{align*}
h_\mu(T)=h_{\mathrm{top}}(T).
\end{align*}
[/definition]
This definition turns the variational principle into an optimisation problem on the compact convex set $\mathcal M_T(X)$. Compactness alone does not guarantee a maximiser, so we introduce upper semicontinuity as the one-sided continuity condition suited to maximisation.
[definition: Upper Semicontinuity]
Let $K$ be a compact topological space and let $F:K\to[-\infty,\infty)$ be a function. The function $F$ is upper semicontinuous if, whenever $x_j\to x$ in $K$,
\begin{align*}
F(x)\ge \limsup_{j\to\infty}F(x_j).
\end{align*}
[/definition]
Upper semicontinuity allows entropy to drop in limits but prevents entropy from appearing suddenly at the limit of low-entropy measures. This motivates the entropy maximiser theorem from upper semicontinuity: a near-maximising sequence can then be replaced by an actual maximal entropy measure.
[quotetheorem:6811]
[citeproof:6811]
The theorem isolates the compactness argument, and it also shows exactly where compactness alone falls short. There are compact dynamical systems, including classical symbolic examples built as countable unions of higher and higher entropy subshifts accumulating on a lower-entropy limit component, for which the entropy map is not upper semicontinuous and no invariant measure attains the topological entropy. In such systems a sequence of measures can concentrate on components with entropies increasing to the topological value while every weak* limit lives on a component with smaller entropy. Thus the variational principle gives only a supremum unless some extra hypothesis prevents entropy from disappearing or appearing discontinuously under weak* limits.
To state the standard criterion used in the course, we first make the expansivity hypothesis explicit. Upper semicontinuity of entropy asks for a finite observation scheme that keeps controlling orbit information after passing to weak* limits. Expansivity supplies such a scheme: if two full orbits remain uniformly close, then they were the same orbit, so sufficiently fine finite partitions can record the long-term behaviour without losing hidden local motion. This is the topological analogue of a generating partition in measure entropy, and it is why the next definition is formulated for homeomorphisms on compact metric spaces.
[definition: Expansive Homeomorphism]
Let $(X,d)$ be a compact metric space and let $T:X\to X$ be a homeomorphism. The map $T$ is expansive if there exists $c>0$ such that, whenever $x,y\in X$ satisfy
\begin{align*}
d(T^n x,T^n y)<c
\end{align*}
for every $n\in\mathbb Z$, then $x=y$.
[/definition]
For non-invertible maps one uses the corresponding positive expansivity condition with $n\ge 0$. In this chapter, the expansive criterion is stated for homeomorphisms, which is the setting of two-sided shifts and hyperbolic toral automorphisms. The point of the hypothesis is not merely technical: it converts small-scale topological separation into finite measurable information, making it possible to bound the entropy of weak* limits by entropies seen at finite resolution.
This is precisely the missing ingredient in the compactness argument above. A sequence of invariant measures may approach the topological entropy, but compactness alone only gives a weak* limit; without upper semicontinuity, entropy can be lost at that limit and no measure of maximal entropy is produced. The next theorem gives the extra control needed in expansive systems, turning near-maximising sequences into actual maximisers when the entropy supremum is finite.
[quotetheorem:6812]
The course quotes this Bowen-Misiurewicz criterion as a standard tool. Its usual applications choose finite partitions whose boundaries are invisible to the measures under consideration and whose iterates generate because of expansivity, including two-sided shifts of finite type and many hyperbolic systems.
The limitation is real. In non-expansive symbolic systems built as countable unions of subshifts with entropies increasing to a limiting value, together with a compactifying fixed point component, invariant measures supported on the high-entropy components can converge weak* to a zero-entropy measure on the limit component. Entropy then drops at the limit: $h_{\mu}(T)<\limsup_j h_{\mu_j}(T)$ for such a convergent sequence $\mu_j\to\mu$, so upper semicontinuity fails. Expansivity rules out this particular loss of orbit information by forcing entropy to be visible at a uniform scale.
[example: Toral Automorphism]
Let $A\in GL(n,\mathbb Z)$ and define
\begin{align*}
T_A(x+\mathbb Z^n)=Ax+\mathbb Z^n
\end{align*}
on $\mathbb T^n=\mathbb R^n/\mathbb Z^n$. Since $A\mathbb Z^n=\mathbb Z^n$ and $A^{-1}\in GL(n,\mathbb Z)$, this is a well-defined torus homeomorphism. Assume that no eigenvalue of $A$ has modulus $1$, and list the eigenvalues with algebraic multiplicity as $\lambda_1,\dots,\lambda_n$. By *Hyperbolic Toral Automorphisms Are Anosov*, $T_A$ is Anosov, and by *Anosov Diffeomorphisms Are Expansive*, $T_A$ is expansive.
Let $m$ be Haar probability measure on $\mathbb T^n$. The map $T_A$ is a continuous group automorphism, so $(T_A)_*m$ is again translation-invariant. Indeed, for every Borel set $B\subset\mathbb T^n$ and every $y\in\mathbb T^n$,
\begin{align*}
(T_A)_*m(y+B)=m(T_A^{-1}(y+B)).
\end{align*}
Because $T_A^{-1}(y+B)=T_A^{-1}y+T_A^{-1}B$, this becomes
\begin{align*}
(T_A)_*m(y+B)=m(T_A^{-1}y+T_A^{-1}B).
\end{align*}
Translation-invariance of Haar measure gives
\begin{align*}
m(T_A^{-1}y+T_A^{-1}B)=m(T_A^{-1}B).
\end{align*}
Finally, by the definition of pushforward measure,
\begin{align*}
m(T_A^{-1}B)=(T_A)_*m(B).
\end{align*}
Thus $(T_A)_*m$ is translation-invariant, and by uniqueness of Haar probability measure on a compact group, $(T_A)_*m=m$. Hence $m$ is $T_A$-invariant.
By *Entropy Formula for Hyperbolic Toral Automorphisms*,
\begin{align*}
h_m(T_A)=\sum_{|\lambda_i|>1}\log|\lambda_i|.
\end{align*}
The topological entropy of the same hyperbolic toral automorphism is given by the same unstable eigenvalue formula:
\begin{align*}
h_{\mathrm{top}}(T_A)=\sum_{|\lambda_i|>1}\log|\lambda_i|.
\end{align*}
Therefore
\begin{align*}
h_m(T_A)=h_{\mathrm{top}}(T_A).
\end{align*}
Thus Haar measure attains the variational supremum and is a measure of maximal entropy for $T_A$.
[/example]
Expansivity provides existence through upper semicontinuity, but it does not by itself settle uniqueness. For uniqueness, one needs a mechanism forcing high-entropy orbit pieces to concatenate in a controlled way.
## Specification and Uniqueness Questions
The final problem in this chapter is to distinguish systems with many maximal measures from systems with a unique statistical equilibrium at maximal entropy. Specification is a strong orbit-gluing property: finite pieces of orbits can be shadowed by one orbit with uniformly bounded transition times.
[definition: Specification Property]
Let $(X,d)$ be a compact metric space and let $T:X\to X$ be a homeomorphism. The map $T$ has the specification property if for every $\varepsilon>0$ there exists $M\in\mathbb N$ such that, for any finite collection of orbit segments
\begin{align*}
(x_i,[a_i,b_i])\quad 1\le i\le r,
\end{align*}
where $x_i\in X$ and $[a_i,b_i]\cap\mathbb Z=\{n\in\mathbb Z:a_i\le n\le b_i\}$ with $a_i,b_i\in\mathbb Z$ and $a_i\le b_i$ for each $i$, ordered so that $a_{i+1}-b_i\ge M$ for $1\le i<r$, there exists $y\in X$ such that
\begin{align*}
d(T^n y,T^n x_i)<\varepsilon
\end{align*}
for all $1\le i\le r$ and all $a_i\le n\le b_i$.
[/definition]
On a compact metrizable space this property is independent of the chosen compatible metric, because compatible metrics determine the same uniform structure. The metric is included in the definition to make the shadowing inequality meaningful.
Specification says that the system can concatenate prescribed finite behaviours. In entropy theory, this prevents the space from decomposing into unrelated high-entropy components and sets up the uniqueness theorem for maximal entropy measures.
[quotetheorem:6814]
This theorem is the standard route from symbolic orbit combinatorics to uniqueness of the maximal measure, and each hypothesis has a distinct role. Expansivity turns orbit separation into finite symbolic information; without it, different nearby orbit histories may not be detected by finite partitions. Specification rules out the simplest source of non-uniqueness: a disjoint union of two mixing shifts of finite type with the same topological entropy has two different maximal entropy measures, one on each component, and fails specification because orbit pieces from different components cannot be glued. The finite-entropy assumption prevents the maximisation problem from degenerating into an infinite value that no probability measure can meaningfully distinguish; for instance, the full shift on a countably infinite alphabet has infinite topological entropy, and probability measures on larger and larger finite subalphabets have entropies tending to infinity rather than converging to a finite maximising measure.
This is also the first appearance of the thermodynamic formalism viewpoint. The present chapter treats the zero-potential pressure problem, where maximal entropy measures are equilibrium states for the constant potential $0$. Later, replacing $h_\mu(T)$ by $h_\mu(T)+\int \varphi\,d\mu$ imports the same compactness, upper semicontinuity, and orbit-gluing questions into the statistical-mechanics language of pressure and Gibbs states.
The theorem does not cover all systems with unique maximal entropy measure. Beta-shifts and many non-uniformly hyperbolic systems may have maximal entropy measures by more delicate arguments even when full specification fails. It nevertheless explains why mixing shifts of finite type have a distinguished Parry measure and gives a direct application of the abstract criterion.
[example: Mixing Shift of Finite Type]
Let $X_A$ be a mixing subshift of finite type with transition matrix $A$. Mixing means that $A$ is primitive, so there is $M\in\mathbb N$ such that
\begin{align*}
(A^M)_{ij}>0
\end{align*}
for every pair of symbols $i,j$. If one admissible word ends in $i$ and the next begins in $j$, the inequality $(A^M)_{ij}>0$ gives an admissible path of $M$ transitions from $i$ to $j$, so the two words can be joined by a connecting block of uniformly bounded length. Thus the shift has specification.
In the usual symbolic metric, choose the expansivity constant so that $d(x,y)<1$ forces $x_0=y_0$. If
\begin{align*}
d(\sigma^n x,\sigma^n y)<1
\end{align*}
for every $n\in\mathbb Z$, then $(\sigma^n x)_0=(\sigma^n y)_0$ for every $n$, which means $x_n=y_n$ for every $n\in\mathbb Z$. Hence $x=y$, so the shift is expansive.
By *[Existence and Uniqueness for Expansive Specification Systems](/theorems/6814)*, $X_A$ has a unique measure of maximal entropy. To identify it, let $\lambda_A$ be the Perron-Frobenius eigenvalue of $A$, and let $\ell,r$ be positive left and right Perron-Frobenius eigenvectors normalized by
\begin{align*}
\sum_i \ell_i r_i=1.
\end{align*}
Define
\begin{align*}
\pi_i=\ell_i r_i,
\qquad
p_{ij}=\frac{A_{ij}r_j}{\lambda_A r_i}.
\end{align*}
Then
\begin{align*}
\sum_j p_{ij}
=\sum_j\frac{A_{ij}r_j}{\lambda_A r_i}
=\frac{(Ar)_i}{\lambda_A r_i}
=\frac{\lambda_A r_i}{\lambda_A r_i}
=1,
\end{align*}
and
\begin{align*}
\sum_i\pi_i p_{ij}
=\sum_i \ell_i r_i\frac{A_{ij}r_j}{\lambda_A r_i}
=\frac{r_j}{\lambda_A}\sum_i\ell_i A_{ij}
=\frac{r_j}{\lambda_A}(\ell^\top A)_j
=\frac{r_j}{\lambda_A}\lambda_A\ell_j
=\ell_jr_j
=\pi_j.
\end{align*}
Thus $\pi$ is stationary for $P=(p_{ij})$, and the corresponding stationary Markov measure is the Parry measure. The previous Perron-Frobenius entropy computation gives
\begin{align*}
h_\mu(\sigma)=\log\lambda_A=h_{\mathrm{top}}(\sigma|_{X_A}),
\end{align*}
so this Parry Markov measure is a measure of maximal entropy. By uniqueness, it is the only measure of maximal entropy on the mixing shift of finite type.
[/example]
Not every symbolic system has specification, and this is where uniqueness may fail or existence may require different compactness arguments. Beta-shifts show that maximal entropy measures can exist beyond the specification framework.
[example: Beta Shift]
Let $\beta>1$, and let $X_\beta$ be the beta-shift determined by the greedy expansion of $1$. Write $\mathcal L_n(X_\beta)$ for the set of admissible words of length $n$. By *Topological Entropy of Beta Shifts*,
\begin{align*}
h_{\mathrm{top}}(\sigma|_{X_\beta})
=\lim_{n\to\infty}\frac{1}{n}\log |\mathcal L_n(X_\beta)|
=\log\beta.
\end{align*}
The Parry measure $\mu_\beta$ is the invariant measure obtained from the absolutely continuous invariant measure of the beta-transformation $x\mapsto \beta x \pmod 1$ under the greedy coding map. By *Parry Measure for Beta Shifts*,
\begin{align*}
h_{\mu_\beta}(\sigma)=\log\beta.
\end{align*}
Combining the two equalities gives
\begin{align*}
h_{\mu_\beta}(\sigma)
=\log\beta
=h_{\mathrm{top}}(\sigma|_{X_\beta}),
\end{align*}
so $\mu_\beta$ is a measure of maximal entropy.
This conclusion does not come from specification in general. For example, when the greedy expansion of $1$ contains arbitrarily long blocks of zeros, orbit words cannot always be joined with a uniformly bounded transition word, so the full specification property fails. Thus beta-shifts show that specification is a sufficient mechanism for maximal entropy measures, but not a necessary one.
[/example]
The chapter closes the loop between topology and measure. Topological entropy is computed by orbit growth, but the variational principle says that the same number is realised as a supremum of measure entropies. The remaining questions in the course refine this bridge: adding potentials leads to pressure and equilibrium states, while stronger symbolic and hyperbolic hypotheses turn existence statements into uniqueness and statistical limit theorems.
The variational principle identifies the bridge between topology and measure, but thermodynamic formalism asks for a finer invariant than entropy alone. The next chapter adds a potential, interprets it as an energy term, and studies pressure and equilibrium states as the natural refinements of the entropy picture.
# 9. Thermodynamic Formalism
Thermodynamic formalism studies invariant measures by borrowing the language of statistical mechanics. Chapters 2 and 7 treated entropy as the amount of orbit complexity carried by a measure or by the whole topological system; here entropy is combined with an energy term coming from a potential. The central question is: among all invariant measures, which ones optimise the balance between disorder and energy?
The chapter assumes the entropy theory from the previous chapters: invariant Borel probability measures, measure-theoretic entropy, topological entropy via separated sets, Birkhoff's ergodic theorem, and the variational principle for entropy. The chapter is written for expansive symbolic systems and uniformly expanding maps, where orbit segments can be counted with enough precision to obtain strong existence and uniqueness results. The main objects are topological pressure, equilibrium states, transfer operators, Gibbs measures, and the Bowen property.
## Potentials, Pressure, and Equilibrium States
The first problem is to replace the unweighted orbit counts used for topological entropy by weighted orbit counts. A [continuous function](/page/Continuous%20Function) assigns an energy to each point of the phase space, and the weight of an orbit segment is the exponential of the accumulated energy along that segment.
[definition: Birkhoff Sum of a Potential]
Let $T:X\to X$ be a continuous map on a compact metric space $(X,d)$, and let $\phi\in C(X)$. For $n\ge 1$, the $n$th Birkhoff sum of $\phi$ is the function $S_n\phi:X\to\mathbb R$ defined by
\begin{align*}
S_n\phi(x)=\sum_{k=0}^{n-1}\phi(T^k x).
\end{align*}
[/definition]
The quantity $S_n\phi(x)$ is the energy accumulated by the orbit segment $x,Tx,\dots,T^{n-1}x$. When $\phi=0$, weighting by $e^{S_n\phi(x)}$ gives weight $1$ to every orbit segment, so the formalism reduces to topological entropy. To turn these weighted orbit segments into a topological invariant, we need a growth rate that counts only orbit segments that are distinguishable for $n$ iterates. Separated sets provide this scale-sensitive counting device and avoid choosing coordinates or symbolic names.
[definition: Topological Pressure]
Let $T:X\to X$ be a continuous map on a compact metric space $(X,d)$. The topological pressure is the functional $P(T,\cdot):C(X)\to\mathbb R$ defined as follows. For $\phi\in C(X)$ and $\varepsilon>0$, let $Z_n(\phi,\varepsilon)$ be the supremum of
\begin{align*}
\sum_{x\in E} e^{S_n\phi(x)}
\end{align*}
over all $(n,\varepsilon)$-separated sets $E\subset X$. The topological pressure of $\phi$ is
\begin{align*}
P(T,\phi)=\lim_{\varepsilon\downarrow 0}\limsup_{n\to\infty}\frac{1}{n}\log Z_n(\phi,\varepsilon).
\end{align*}
[/definition]
Pressure is therefore the exponential growth rate of the weighted collection of distinguishable orbit segments. The special case $\phi=0$ gives $P(T,0)=h_{\text{top}}(T)$, while adding a constant shifts pressure by that constant.
[example: Full Shift Locally Constant Potential]
Let $X=\{1,\dots,m\}^{\mathbb N}$ with the left shift $\sigma$, and let $\phi(x)=a_{x_0}$. If $w=w_0\dots w_{n-1}$ is a word of length $n$ and $x\in[w]$, then $(\sigma^k x)_0=w_k$ for $0\le k<n$. Hence
\begin{align*}
S_n\phi(x)=\sum_{k=0}^{n-1}\phi(\sigma^k x)=\sum_{k=0}^{n-1}a_{(\sigma^k x)_0}=\sum_{k=0}^{n-1}a_{w_k}=a_{w_0}+\cdots+a_{w_{n-1}}.
\end{align*}
Thus the weight is constant on each length-$n$ cylinder, and for $x\in[w]$ we have
\begin{align*}
e^{S_n\phi(x)}=e^{a_{w_0}+\cdots+a_{w_{n-1}}}=\prod_{k=0}^{n-1}e^{a_{w_k}}.
\end{align*}
Taking one representative from each length-$n$ word gives the weighted sum
\begin{align*}
\sum_{w\in\{1,\dots,m\}^n}e^{a_{w_0}+\cdots+a_{w_{n-1}}}=\sum_{w_0=1}^m\cdots\sum_{w_{n-1}=1}^m\prod_{k=0}^{n-1}e^{a_{w_k}}.
\end{align*}
Because the factor indexed by $w_k$ depends only on the $k$th coordinate of the word, the iterated sum separates into $n$ identical one-symbol sums:
\begin{align*}
\sum_{w_0=1}^m\cdots\sum_{w_{n-1}=1}^m\prod_{k=0}^{n-1}e^{a_{w_k}}=\prod_{k=0}^{n-1}\left(\sum_{i=1}^m e^{a_i}\right)=\left(\sum_{i=1}^m e^{a_i}\right)^n.
\end{align*}
Therefore
\begin{align*}
P(\sigma,\phi)=\lim_{n\to\infty}\frac{1}{n}\log\left(\sum_{i=1}^m e^{a_i}\right)^n=\log\sum_{i=1}^m e^{a_i}.
\end{align*}
Pressure is therefore the logarithmic normalising constant for the $m$ symbolic choices weighted by $e^{a_i}$.
[/example]
The example suggests that pressure should also be visible from invariant probability measures: a measure may have high entropy, high average potential, or both. The variational principle identifies the orbit-counting number with this optimisation problem over invariant measures. It is the thermodynamic analogue of the [variational principle for topological entropy](/theorems/6728).
[quotetheorem:6730]
[citeproof:6730]
The variational principle turns pressure into a convex optimisation problem, but its hypotheses are doing real work. Compactness keeps orbit-counting at finite scale and ensures that invariant probability measures form a weak* compact set; continuity of $\phi$ makes $\mu\mapsto\int_X\phi\,d\mu$ continuous; invariance is essential because entropy is measuring long-term orbit statistics rather than arbitrary distributions on $X$. In the expansive symbolic and expanding settings of this chapter, the entropies are finite and the formula is used in this finite-pressure regime.
The theorem does not by itself say that the supremum is attained or that the maximising measure is unique. Existence follows when the entropy map $\mu\mapsto h_\mu(T)$ has enough upper semicontinuity, and uniqueness requires additional dynamical mixing and distortion control. The measures attaining the supremum are the thermodynamic states selected by the potential.
[definition: Equilibrium State]
Let $T:X\to X$ be continuous on a compact metric space, and let $\phi\in C(X)$. A measure $\mu\in\mathcal M_T(X)$ is an equilibrium state for $\phi$ if
\begin{align*}
h_\mu(T)+\int_X \phi\,d\mu=P(T,\phi).
\end{align*}
[/definition]
An equilibrium state maximises free energy: entropy rewards orbit complexity, while the integral of $\phi$ rewards orbit segments preferred by the potential. Existence is automatic in many expansive systems with upper semicontinuous entropy, but uniqueness needs stronger dynamical and regularity assumptions.
[example: Bernoulli Equilibrium for a One-Symbol Potential]
Let
\begin{align*}
Z=\sum_{j=1}^m e^{a_j}
\end{align*}
and define $p_i=e^{a_i}/Z$ for $1\le i\le m$. Since each $e^{a_i}>0$, each $p_i>0$, and
\begin{align*}
\sum_{i=1}^m p_i=\sum_{i=1}^m\frac{e^{a_i}}{Z}=\frac{1}{Z}\sum_{i=1}^m e^{a_i}=\frac{Z}{Z}=1.
\end{align*}
Thus $(p_1,\dots,p_m)$ defines a Bernoulli measure $\mu_p$ on the full shift.
For a Bernoulli measure on the full shift, the entropy is
\begin{align*}
h_{\mu_p}(\sigma)=-\sum_{i=1}^m p_i\log p_i.
\end{align*}
Since $\phi(x)=a_{x_0}$ and $\mu_p(\{x:x_0=i\})=p_i$, the integral is
\begin{align*}
\int \phi\,d\mu_p=\sum_{i=1}^m p_i a_i.
\end{align*}
From $p_i=e^{a_i}/Z$ we have
\begin{align*}
\log p_i=\log(e^{a_i}/Z)=\log(e^{a_i})-\log Z=a_i-\log Z.
\end{align*}
Substituting this identity into the free energy gives
\begin{align*}
h_{\mu_p}(\sigma)+\int \phi\,d\mu_p=-\sum_{i=1}^m p_i(a_i-\log Z)+\sum_{i=1}^m p_i a_i.
\end{align*}
Expanding the first sum,
\begin{align*}
-\sum_{i=1}^m p_i(a_i-\log Z)+\sum_{i=1}^m p_i a_i=-\sum_{i=1}^m p_i a_i+\sum_{i=1}^m p_i\log Z+\sum_{i=1}^m p_i a_i.
\end{align*}
The two $a_i$-terms cancel, and $\log Z$ is independent of $i$, so
\begin{align*}
-\sum_{i=1}^m p_i a_i+\sum_{i=1}^m p_i\log Z+\sum_{i=1}^m p_i a_i=\log Z\sum_{i=1}^m p_i=\log Z.
\end{align*}
Using $\sum_i p_i=1$ and the definition of $Z$,
\begin{align*}
h_{\mu_p}(\sigma)+\int \phi\,d\mu_p=\log Z=\log\sum_{j=1}^m e^{a_j}.
\end{align*}
The previous pressure computation gave $P(\sigma,\phi)=\log\sum_{j=1}^m e^{a_j}$, hence
\begin{align*}
h_{\mu_p}(\sigma)+\int \phi\,d\mu_p=P(\sigma,\phi).
\end{align*}
Therefore $\mu_p$ is an equilibrium state, and its probabilities are exactly the exponential weights $e^{a_i}$ normalised by their total weight.
[/example]
## Ruelle Transfer Operators and Gibbs Measures
The next problem is to construct the equilibrium state rather than only recognise it variationally. For shifts of finite type and expanding maps, the relevant construction is spectral: a positive operator transports densities backward along inverse branches and weights them by the potential.
[definition: Ruelle Transfer Operator]
Let $T:X\to X$ be a finite-to-one local homeomorphism of a compact metric space, and let $\phi:X\to\mathbb R$ be continuous. The Ruelle transfer operator associated to $\phi$ is the linear operator $\mathcal L_\phi:C(X)\to C(X)$ defined, for $f\in C(X)$ and $x\in X$, by
\begin{align*}
\mathcal L_\phi f(x)=\sum_{Ty=x} e^{\phi(y)}f(y).
\end{align*}
[/definition]
This operator is the weighted analogue of pulling a function back over all preimages. For a one-sided subshift of finite type with Hölder $\phi$, the same formula restricts to a bounded operator $\mathcal L_\phi:C^\alpha(\Sigma_A)\to C^\alpha(\Sigma_A)$, which is the [Banach space](/page/Banach%20Space) used in the spectral theorem below. Its iterates contain the same Birkhoff sums as pressure:
\begin{align*}
\mathcal L_\phi^n f(x)=\sum_{T^n y=x} e^{S_n\phi(y)}f(y).
\end{align*}
[example: Transfer Matrix for a Markov Shift]
Let $A$ be a primitive $m\times m$ zero-one matrix, and let $\Sigma_A$ be the one-sided Markov shift with admissible transitions $A_{ij}=1$. Suppose $\phi(x)=\phi(x_0,x_1)$, and write $\phi(i,j)$ for its value on the cylinder determined by $x_0=i$ and $x_1=j$. If $f(x)=v_{x_0}$ depends only on the first coordinate and $x_0=j$, then the preimages of $x$ are the sequences $y=(i,j,x_1,x_2,\dots)$ with $A_{ij}=1$. Hence
\begin{align*}
\mathcal L_\phi f(x)=\sum_{\sigma y=x}e^{\phi(y_0,y_1)}f(y)=\sum_{i=1}^m A_{ij}e^{\phi(i,j)}v_i.
\end{align*}
Thus the weighted adjacency matrix is
\begin{align*}
M_{ij}=A_{ij}e^{\phi(i,j)},
\end{align*}
with the convention that the $j$th output coordinate is $\sum_i M_{ij}v_i$.
For an admissible word $w_0\dots w_n$, the Birkhoff weight over the first $n$ transitions is
\begin{align*}
e^{\sum_{k=0}^{n-1}\phi(w_k,w_{k+1})}=\prod_{k=0}^{n-1}e^{\phi(w_k,w_{k+1})}=\prod_{k=0}^{n-1}M_{w_k w_{k+1}},
\end{align*}
because admissibility gives $A_{w_k w_{k+1}}=1$ for every $k$. Summing over all admissible words of length $n+1$ gives
\begin{align*}
\sum_{w_0,\dots,w_n}\prod_{k=0}^{n-1}M_{w_k w_{k+1}}=\sum_{i,j=1}^m (M^n)_{ij}.
\end{align*}
Since $M$ is primitive, the Perron-Frobenius theorem gives a leading eigenvalue $\lambda>0$, and the exponential growth rate of $\sum_{i,j}(M^n)_{ij}$ is $\lambda$. Therefore
\begin{align*}
P(\sigma,\phi)=\lim_{n\to\infty}\frac{1}{n}\log\sum_{i,j=1}^m(M^n)_{ij}=\log\lambda.
\end{align*}
Let $\ell$ and $r$ be positive left and right Perron-Frobenius eigenvectors, normalised by
\begin{align*}
\sum_{i=1}^m \ell_i r_i=1.
\end{align*}
Thus
\begin{align*}
\sum_{j=1}^m M_{ij}r_j=\lambda r_i.
\end{align*}
Also
\begin{align*}
\sum_{i=1}^m \ell_iM_{ij}=\lambda \ell_j.
\end{align*}
Define
\begin{align*}
p_{ij}=\frac{M_{ij}r_j}{\lambda r_i}.
\end{align*}
Define also
\begin{align*}
\pi_i=\ell_i r_i.
\end{align*}
For each $i$,
\begin{align*}
\sum_{j=1}^m p_{ij}=\frac{1}{\lambda r_i}\sum_{j=1}^m M_{ij}r_j=\frac{\lambda r_i}{\lambda r_i}=1.
\end{align*}
Thus $p_{ij}$ are transition probabilities. The distribution $\pi$ is stationary because
\begin{align*}
\sum_{i=1}^m \pi_i p_{ij}=\sum_{i=1}^m \ell_i r_i\frac{M_{ij}r_j}{\lambda r_i}=\frac{r_j}{\lambda}\sum_{i=1}^m\ell_iM_{ij}=\frac{r_j}{\lambda}\lambda\ell_j=\ell_jr_j=\pi_j.
\end{align*}
For this stationary Markov chain $\mu$, the entropy and potential average are
\begin{align*}
h_\mu(\sigma)=-\sum_{i,j}\pi_i p_{ij}\log p_{ij}.
\end{align*}
Also
\begin{align*}
\int\phi\,d\mu=\sum_{i,j}\pi_i p_{ij}\phi(i,j).
\end{align*}
On allowed transitions, $M_{ij}=e^{\phi(i,j)}$, so
\begin{align*}
\log p_{ij}=\log M_{ij}+\log r_j-\log\lambda-\log r_i=\phi(i,j)+\log r_j-\log\lambda-\log r_i.
\end{align*}
Therefore
\begin{align*}
-\log p_{ij}+\phi(i,j)=\log\lambda+\log r_i-\log r_j.
\end{align*}
Substituting this identity into the free energy gives
\begin{align*}
h_\mu(\sigma)+\int\phi\,d\mu=\sum_{i,j}\pi_i p_{ij}(\log\lambda+\log r_i-\log r_j).
\end{align*}
Since $\sum_jp_{ij}=1$ and $\sum_i\pi_ip_{ij}=\pi_j$, this becomes
\begin{align*}
h_\mu(\sigma)+\int\phi\,d\mu=\log\lambda+\sum_i\pi_i\log r_i-\sum_j\pi_j\log r_j=\log\lambda.
\end{align*}
Thus the pressure is $\log\lambda$, and the equilibrium Markov chain has stationary distribution $\pi_i=\ell_i r_i$ and transition probabilities $p_{ij}=M_{ij}r_j/(\lambda r_i)$.
[/example]
The finite matrix example indicates what should remain true for Hölder potentials: there should be a leading eigenvalue, a positive eigenfunction, and a dual eigenmeasure. The following theorem is the analytic engine behind the construction. In the course, it is proved for mixing subshifts of finite type with Hölder potentials, and the same proof pattern works for $C^{1+\alpha}$ expanding maps after replacing cylinders by inverse branches.
[quotetheorem:6817]
The eigenfunction-eigenmeasure pair gives a concrete invariant measure, and the assumptions explain why the construction is rigid. Topological mixing prevents the system from decomposing into several independent components with separate leading eigenvectors, while Hölder regularity gives bounded distortion of $S_n\phi$ on cylinders. For merely continuous potentials the transfer operator may fail to have a spectral gap on a useful Banach space, and equilibrium states need not have uniform cylinder estimates.
We still need a local test that recognises the spectral measure without referring to the whole spectrum. Such a test should say how much mass is assigned to a finite orbit name. The Gibbs property supplies exactly this cylinder estimate, with pressure as the normalising constant.
[definition: Gibbs Measure]
Let $\Sigma_A$ be a subshift of finite type, let $\phi:\Sigma_A\to\mathbb R$ be continuous, and let $P=P(\sigma,\phi)$. A Borel probability measure $\mu$ on $\Sigma_A$ is a Gibbs measure for $\phi$ if there exists $C\ge 1$ such that for every admissible word $w=w_0\dots w_{n-1}$ and every $x\in[w]$,
\begin{align*}
C^{-1}\le \frac{\mu([w])}{\exp(S_n\phi(x)-nP)}\le C.
\end{align*}
[/definition]
The Gibbs property converts symbolic cylinders into thermodynamic balls: the measure of an orbit neighbourhood is determined, up to a uniform multiplicative constant, by its energy and by the pressure normalisation. Since the Ruelle construction produced a candidate equilibrium state, the next step is to verify that this candidate has the promised local cylinder estimates. This is where the eigenmeasure relation and bounded distortion enter directly.
[quotetheorem:6818]
[citeproof:6818]
The theorem shows that the spectral equilibrium measure can be read from local orbit data. The hypotheses again matter: the conformal eigenmeasure gives the correct normalisation, positivity of $h$ prevents mass from disappearing on cylinders, and Hölder distortion makes the choice of $x\in[w]$ irrelevant up to a uniform constant. If the potential has unbounded variation along cylinders, the ratio in the Gibbs inequality can depend strongly on the chosen representative point and no uniform constant $C$ need exist.
In smooth expanding systems the same local estimate appears on inverse image intervals rather than symbolic cylinders, which makes the circle map a useful model example before moving to uniqueness.
[example: Expanding Circle Map]
Identify $S^1$ with $\mathbb R/\mathbb Z$ and write $T(x)=dx \pmod 1$, where $d\ge 2$. For each $x\in S^1$, the $d$ preimages are
\begin{align*}
y_j=\frac{x+j}{d}\pmod 1,\qquad 0\le j\le d-1.
\end{align*}
Indeed,
\begin{align*}
T(y_j)=d\left(\frac{x+j}{d}\right)=x+j\equiv x\pmod 1.
\end{align*}
Thus the transfer operator is
\begin{align*}
\mathcal L_\phi f(x)=\sum_{j=0}^{d-1} e^{\phi((x+j)/d)}f((x+j)/d).
\end{align*}
Its iterates contain the Birkhoff sums. For $n=2$,
\begin{align*}
\mathcal L_\phi^2 f(x)=\sum_{Ty=x}e^{\phi(y)}\mathcal L_\phi f(y).
\end{align*}
Substituting the definition of $\mathcal L_\phi f(y)$ gives
\begin{align*}
\mathcal L_\phi^2 f(x)=\sum_{Ty=x}e^{\phi(y)}\sum_{Tz=y}e^{\phi(z)}f(z).
\end{align*}
The conditions $Tz=y$ and $Ty=x$ are equivalent to $T^2z=x$ with $y=Tz$, so
\begin{align*}
\mathcal L_\phi^2 f(x)=\sum_{T^2z=x}e^{\phi(Tz)}e^{\phi(z)}f(z).
\end{align*}
Combining the exponents and using $S_2\phi(z)=\phi(z)+\phi(Tz)$ gives
\begin{align*}
\mathcal L_\phi^2 f(x)=\sum_{T^2z=x}e^{S_2\phi(z)}f(z).
\end{align*}
Repeating the same substitution at each step yields
\begin{align*}
\mathcal L_\phi^n f(x)=\sum_{T^n y=x}e^{S_n\phi(y)}f(y).
\end{align*}
Let $I$ be a level-$n$ inverse image interval, so $T^n$ maps $I$ homeomorphically onto the circle and $I$ has length $d^{-n}$. If $u,v\in I$, then for $0\le k<n$ the points $T^k u$ and $T^k v$ lie in the same level-$(n-k)$ inverse image interval. Hence
\begin{align*}
d(T^k u,T^k v)\le d^{k-n}.
\end{align*}
Assume $\phi$ is Hölder with exponent $\alpha$ and Hölder constant $C_\phi$, so
\begin{align*}
|\phi(a)-\phi(b)|\le C_\phi d(a,b)^\alpha
\end{align*}
for all $a,b\in S^1$. Then
\begin{align*}
|S_n\phi(u)-S_n\phi(v)|\le \sum_{k=0}^{n-1}|\phi(T^k u)-\phi(T^k v)|.
\end{align*}
Using the Hölder estimate at each time $k$ gives
\begin{align*}
|S_n\phi(u)-S_n\phi(v)|\le C_\phi\sum_{k=0}^{n-1}d(T^k u,T^k v)^\alpha.
\end{align*}
Using $d(T^k u,T^k v)\le d^{k-n}$ gives
\begin{align*}
|S_n\phi(u)-S_n\phi(v)|\le C_\phi\sum_{k=0}^{n-1}d^{-\alpha(n-k)}.
\end{align*}
With $r=n-k$, the last sum becomes
\begin{align*}
C_\phi\sum_{r=1}^{n}d^{-\alpha r}.
\end{align*}
Since this is bounded by the infinite geometric series,
\begin{align*}
|S_n\phi(u)-S_n\phi(v)|\le C_\phi\sum_{r=1}^{\infty}d^{-\alpha r}=\frac{C_\phi}{d^\alpha-1}.
\end{align*}
Thus $S_n\phi$ has uniformly bounded distortion on all level-$n$ inverse image intervals.
By the *[Ruelle-Perron-Frobenius Theorem](/theorems/6817)*, the operator has a leading eigenvalue $\lambda=e^{P(T,\phi)}$, a positive eigenfunction $h$, and an eigenmeasure $\nu$, and $\mu=h\nu$ is the associated invariant equilibrium measure. The bounded distortion estimate above is the smooth analogue of the cylinder distortion used in the *Ruelle Measure is Gibbs* theorem. Therefore there is a constant $C\ge 1$ such that for every level-$n$ inverse image interval $I$ and every $x\in I$,
\begin{align*}
C^{-1}\le \frac{\mu(I)}{\exp(S_n\phi(x)-nP(T,\phi))}\le C.
\end{align*}
So, for the expanding circle map, the mass of an interval of length $d^{-n}$ selected by an inverse branch is determined up to one uniform multiplicative constant by the accumulated potential $S_n\phi$ and the pressure normalisation $e^{-nP(T,\phi)}$.
[/example]
## Bowen Property, Specification, and Uniqueness of Equilibrium States
The remaining problem is uniqueness. Existence and the Gibbs property may still allow several competing thermodynamic phases; uniqueness follows when orbit segments can be pasted together and when the potential has controlled distortion along Bowen balls.
[definition: Bowen Property]
Let $T:X\to X$ be continuous on a compact metric space, and let $\phi:X\to\mathbb R$ be continuous. The potential $\phi$ has the Bowen property at scale $\varepsilon>0$ if there exists $V<\infty$ such that whenever
\begin{align*}
d(T^k x,T^k y)<\varepsilon\quad\text{for }0\le k<n,
\end{align*}
one has
\begin{align*}
|S_n\phi(x)-S_n\phi(y)|\le V.
\end{align*}
[/definition]
The Bowen property is the distortion estimate needed to treat all points in the same orbit segment as having essentially the same weight. Hölder potentials on expanding maps and subshifts of finite type satisfy it because matching orbit names force exponentially close coordinates. However, distortion control only compares points that already shadow the same segment; it does not say that unrelated segments can be assembled into longer orbits. The next definition captures the need for pressure estimates to multiply across independently chosen orbit pieces.
[definition: Specification Property]
Let $T:X\to X$ be continuous on a compact metric space. The system has specification at scale $\varepsilon>0$ if there exists $\tau\in\mathbb N$ such that every finite collection of orbit segments can be $\varepsilon$-shadowed in the prescribed order by a single orbit, with transition gaps of length at most $\tau$ between consecutive segments.
[/definition]
Specification says that finite pieces of orbits can be concatenated with uniformly bounded waiting time. This is the dynamical mixing hypothesis that prevents different high-pressure regions from remaining isolated. Combined with the Bowen property, it makes partition sums behave almost multiplicatively: the weight of a pasted orbit segment is comparable to the product of the weights of its pieces, up to a bounded error from the transition gaps.
The next example shows these two requirements in the finite-range symbolic setting. Local dependence of the potential gives distortion control, while primitivity of the transition graph gives the orbit-pasting mechanism.
[example: Nearest-Neighbour Ising-Type Potential]
Let $\Sigma_A\subset S^{\mathbb N}$ be a mixing finite-state Markov shift with adjacency matrix $A$, and consider
\begin{align*}
\phi(x)=J(x_0,x_1)+H(x_0),
\end{align*}
where $J:S\times S\to\mathbb R$ and $H:S\to\mathbb R$. If $x,y\in\Sigma_A$ agree in their first two coordinates, then $x_0=y_0$ and $x_1=y_1$, so
\begin{align*}
\phi(x)-\phi(y)=J(x_0,x_1)+H(x_0)-J(y_0,y_1)-H(y_0)=0.
\end{align*}
Thus $\phi$ is constant on every cylinder determined by two consecutive symbols. More generally, if $x_k=y_k$ for $0\le k\le n$, then for each $0\le k<n$,
\begin{align*}
\phi(\sigma^k x)=J(x_k,x_{k+1})+H(x_k)=J(y_k,y_{k+1})+H(y_k)=\phi(\sigma^k y).
\end{align*}
Therefore
\begin{align*}
S_n\phi(x)-S_n\phi(y)=\sum_{k=0}^{n-1}\bigl(\phi(\sigma^k x)-\phi(\sigma^k y)\bigr)=\sum_{k=0}^{n-1}0=0.
\end{align*}
Hence the Bowen distortion constant may be taken to be $V=0$ at any symbolic scale small enough to force agreement of the symbols $0,\dots,n$.
Now take a function depending only on the first coordinate, $f(x)=v_{x_0}$. If $x_0=j$, then the preimages of $x$ are the sequences $y=(i,j,x_1,x_2,\dots)$ with $A_{ij}=1$. Hence
\begin{align*}
\mathcal L_\phi f(x)=\sum_{\sigma y=x}e^{\phi(y)}f(y)=\sum_{i\in S:A_{ij}=1}e^{J(i,j)+H(i)}v_i=\sum_{i\in S}A_{ij}e^{J(i,j)+H(i)}v_i.
\end{align*}
Thus the weighted adjacency matrix is
\begin{align*}
M_{ij}=A_{ij}e^{J(i,j)+H(i)}.
\end{align*}
If the shift is full, then $A_{ij}=1$ for all $i,j$, so every $M_{ij}$ is positive. If the transition graph is mixing, then $A$ is primitive, and $M$ has the same zero pattern as $A$ with positive weights on the allowed edges; hence $M$ is primitive.
For an admissible word $w_0\dots w_n$ and any $x\in[w_0\dots w_n]$,
\begin{align*}
S_n\phi(x)=\sum_{k=0}^{n-1}\phi(\sigma^k x)=\sum_{k=0}^{n-1}\bigl(J(w_k,w_{k+1})+H(w_k)\bigr).
\end{align*}
Since $A_{w_k w_{k+1}}=1$ along an admissible word,
\begin{align*}
e^{S_n\phi(x)}=\prod_{k=0}^{n-1}e^{J(w_k,w_{k+1})+H(w_k)}=\prod_{k=0}^{n-1}A_{w_k w_{k+1}}e^{J(w_k,w_{k+1})+H(w_k)}=\prod_{k=0}^{n-1}M_{w_k w_{k+1}}.
\end{align*}
Summing over all admissible words of length $n+1$ gives
\begin{align*}
\sum_{w_0,\dots,w_n}e^{S_n\phi(x_w)}=\sum_{w_0,\dots,w_n}\prod_{k=0}^{n-1}M_{w_k w_{k+1}}=\sum_{i,j\in S}(M^n)_{ij},
\end{align*}
where $x_w$ is any point in the cylinder $[w_0\dots w_n]$. By the *Perron-Frobenius theorem*, the primitive matrix $M$ has a leading eigenvalue $\lambda>0$ with positive left and right eigenvectors $\ell$ and $r$, and the exponential growth rate of $\sum_{i,j\in S}(M^n)_{ij}$ is $\lambda$. Thus
\begin{align*}
P(\sigma,\phi)=\log\lambda.
\end{align*}
Normalize $\ell$ and $r$ by
\begin{align*}
\sum_{i\in S}\ell_i r_i=1,
\end{align*}
and define
\begin{align*}
p_{ij}=\frac{M_{ij}r_j}{\lambda r_i}\quad\text{and}\quad \pi_i=\ell_i r_i.
\end{align*}
The right eigenvector relation gives
\begin{align*}
\sum_{j\in S}p_{ij}=\frac{1}{\lambda r_i}\sum_{j\in S}M_{ij}r_j=\frac{\lambda r_i}{\lambda r_i}=1,
\end{align*}
so $p_{ij}$ are transition probabilities on the allowed edges. The left eigenvector relation gives stationarity:
\begin{align*}
\sum_{i\in S}\pi_i p_{ij}=\sum_{i\in S}\ell_i r_i\frac{M_{ij}r_j}{\lambda r_i}=\frac{r_j}{\lambda}\sum_{i\in S}\ell_iM_{ij}=\frac{r_j}{\lambda}\lambda\ell_j=\pi_j.
\end{align*}
Let $\mu$ be the stationary Markov measure with initial distribution $\pi$ and transition matrix $p$. Its entropy and potential average are
\begin{align*}
h_\mu(\sigma)=-\sum_{i,j\in S}\pi_i p_{ij}\log p_{ij}.
\end{align*}
Also,
\begin{align*}
\int\phi\,d\mu=\sum_{i,j\in S}\pi_i p_{ij}\bigl(J(i,j)+H(i)\bigr).
\end{align*}
On allowed transitions, $\log M_{ij}=J(i,j)+H(i)$, so
\begin{align*}
\log p_{ij}=\log M_{ij}+\log r_j-\log\lambda-\log r_i=J(i,j)+H(i)+\log r_j-\log\lambda-\log r_i.
\end{align*}
Therefore
\begin{align*}
-\log p_{ij}+J(i,j)+H(i)=\log\lambda+\log r_i-\log r_j.
\end{align*}
Substituting this identity into the free energy gives
\begin{align*}
h_\mu(\sigma)+\int\phi\,d\mu=\sum_{i,j\in S}\pi_i p_{ij}\bigl(\log\lambda+\log r_i-\log r_j\bigr).
\end{align*}
Expanding the three terms,
\begin{align*}
h_\mu(\sigma)+\int\phi\,d\mu=\log\lambda\sum_{i,j\in S}\pi_i p_{ij}+\sum_{i,j\in S}\pi_i p_{ij}\log r_i-\sum_{i,j\in S}\pi_i p_{ij}\log r_j.
\end{align*}
Since $\sum_j p_{ij}=1$ and $\sum_i\pi_i p_{ij}=\pi_j$, this becomes
\begin{align*}
h_\mu(\sigma)+\int\phi\,d\mu=\log\lambda+\sum_{i\in S}\pi_i\log r_i-\sum_{j\in S}\pi_j\log r_j=\log\lambda.
\end{align*}
Because $P(\sigma,\phi)=\log\lambda$, the stationary Markov measure $\mu$ is an equilibrium state. Thus the nearest-neighbour Ising-type potential selects the Markov chain determined by the Perron-Frobenius eigenvectors of the weighted adjacency matrix, with transition probabilities $p_{ij}=M_{ij}r_j/(\lambda r_i)$.
[/example]
The Ising-type example combines the two mechanisms: finite-range energy gives bounded distortion, while mixing of the transition graph gives orbit concatenation. We can now state the standard uniqueness theorem. It is the thermodynamic counterpart of the uniqueness of the measure of maximal entropy for systems with specification.
[quotetheorem:6819]
The theorem also explains how uniqueness can fail, and each hypothesis rules out a different obstruction. Expansiveness ensures that orbit names generate the dynamics; without it, entropy estimates from separated sets need not control measures on the underlying space. Specification rules out isolated high-pressure components. The Bowen property rules out potentials whose energy oscillates too much inside one orbit name; for example, on a shift a continuous potential with large variations on long cylinders can favour different representatives of the same symbolic block so strongly that no uniform Gibbs comparison survives.
The cleanest counterexample removes specification while keeping the potential harmless.
[example: Why Specification Matters]
Let $X=X_1\sqcup X_2$, where $X_1$ and $X_2$ are disjoint closed invariant mixing subshifts, and let $\phi=0$. Write
\begin{align*}
h_1=h_{\text{top}}(T|_{X_1}),\qquad h_2=h_{\text{top}}(T|_{X_2}),
\end{align*}
and assume $h_1=h_2=h$.
For $\phi=0$, every Birkhoff sum is zero:
\begin{align*}
S_n\phi(x)=\sum_{k=0}^{n-1}\phi(T^kx)=\sum_{k=0}^{n-1}0=0.
\end{align*}
Hence every orbit segment has weight
\begin{align*}
e^{S_n\phi(x)}=e^0=1.
\end{align*}
Therefore the pressure partition sum is the ordinary separated-set count, so
\begin{align*}
P(T,0)=h_{\text{top}}(T).
\end{align*}
Since $X$ is the disjoint union of the two invariant components, separated sets in $X_1$ and $X_2$ may be combined, and every separated set in $X$ splits into its intersections with $X_1$ and $X_2$. Thus
\begin{align*}
h_{\text{top}}(T)=\max\{h_{\text{top}}(T|_{X_1}),h_{\text{top}}(T|_{X_2})\}=\max\{h,h\}=h.
\end{align*}
So
\begin{align*}
P(T,0)=h.
\end{align*}
Let $\mu_1$ be a measure of maximal entropy for $T|_{X_1}$ and let $\mu_2$ be a measure of maximal entropy for $T|_{X_2}$. Viewing them as measures on $X$ by extending them by zero outside their components, they are $T$-invariant and satisfy
\begin{align*}
h_{\mu_1}(T)=h,\qquad h_{\mu_2}(T)=h.
\end{align*}
Also,
\begin{align*}
\int_X \phi\,d\mu_1=\int_X 0\,d\mu_1=0,\qquad \int_X \phi\,d\mu_2=\int_X 0\,d\mu_2=0.
\end{align*}
Therefore
\begin{align*}
h_{\mu_1}(T)+\int_X\phi\,d\mu_1=h+0=h=P(T,0),
\end{align*}
and similarly
\begin{align*}
h_{\mu_2}(T)+\int_X\phi\,d\mu_2=h+0=h=P(T,0).
\end{align*}
Thus both $\mu_1$ and $\mu_2$ are equilibrium states for $\phi=0$. They are distinct because
\begin{align*}
\mu_1(X_1)=1,\qquad \mu_2(X_1)=0.
\end{align*}
Uniqueness fails even though the potential is constant; the obstruction is that an orbit starting in $X_1$ remains in $X_1$ and an orbit starting in $X_2$ remains in $X_2$, so no single orbit can concatenate orbit segments from the two components.
[/example]
The formalism therefore has three layers. Pressure is the variational quantity, the transfer operator constructs the equilibrium measure in hyperbolic models, and the Bowen plus specification hypotheses explain why this measure is unique. Later applications use this package to study dimension, large deviations, and statistical properties of chaotic systems.
Thermodynamic formalism explains how entropy competes with energy, but its consequences are most visible in the statistical behavior of dynamics. The next chapter studies mixing, weak mixing, and decay of correlations, showing how stronger probabilistic properties refine the entropy-based classification introduced earlier.
# 10. Entropy, Mixing, and Decay of Correlations
This chapter compares the main randomness properties that appear after entropy has been introduced. Bernoulli systems represent the strongest probabilistic model in the course, while the K-property, mixing, weak mixing, and correlation decay record progressively different ways in which past and future become unrelated. The chapter also introduces the analytic mechanism behind quantitative mixing: spectral gaps for transfer operators acting on spaces of regular observables.
The prerequisites are the measure-theoretic entropy of a finite or countable generating partition, basic ergodicity, conditional expectation, and the $L^p$ language for observables on a probability space. We also use the elementary spectral terminology for bounded operators on Banach spaces when transfer operators enter the discussion.
The guiding theme is that entropy alone does not measure every form of randomness. Positive entropy often accompanies strong statistical behaviour, but there are zero-entropy systems with weak mixing, and there are positive-entropy systems whose finer structure requires more than the value of entropy. We therefore place entropy beside structural, spectral, and probabilistic notions of chaos.
## Bernoulli Systems, K-Systems, and Mixing Hierarchies
The first question is how much independence a measure-preserving system can contain. Bernoulli shifts have independent coordinates by construction, but a general system may only look independent after taking a suitable generating partition, or may only display independence asymptotically in time.
[definition: Bernoulli System]
Let $(X,\mathcal B,\mu)$ be a probability space and let $T:X\to X$ be an invertible probability-preserving transformation. It is a Bernoulli system if it is measure-theoretically isomorphic to a two-sided Bernoulli shift $(A^{\mathbb Z},\mathcal A^{\mathbb Z},p^{\mathbb Z},\sigma)$ over a countable alphabet $A$ with probability vector $p$.
[/definition]
The definition says that, after ignoring null sets and changing coordinates by an isomorphism, the whole orbit process is an i.i.d. sequence. This is much stronger than having the same entropy as a Bernoulli shift: Ornstein theory shows when Bernoulli systems are classified by entropy, but being Bernoulli is itself a structural property.
[example: Baker Map As A Bernoulli Model]
Discard the null set of points whose $x$- or $y$-coordinate has two binary expansions. For $a=(a_n)_{n\in\mathbb Z}\in\{0,1\}^{\mathbb Z}$, define
\begin{align*}
\Phi(a)=\left(\sum_{j=1}^{\infty}a_j2^{-j},\sum_{j=0}^{\infty}a_{-j}2^{-(j+1)}\right).
\end{align*}
Thus the first coordinate has binary expansion $0.a_1a_2a_3\cdots$, while the second has binary expansion $0.a_0a_{-1}a_{-2}\cdots$.
Let $\sigma$ be the left shift, so $(\sigma a)_n=a_{n+1}$. If $a_1=0$, then $0\le x<1/2$ and
\begin{align*}
2x=2\sum_{j=1}^{\infty}a_j2^{-j}=\sum_{j=1}^{\infty}a_j2^{-(j-1)}=\sum_{j=2}^{\infty}a_j2^{-(j-1)}=\sum_{i=1}^{\infty}a_{i+1}2^{-i}.
\end{align*}
Also,
\begin{align*}
\frac y2=\frac12\sum_{j=0}^{\infty}a_{-j}2^{-(j+1)}=\sum_{j=0}^{\infty}a_{-j}2^{-(j+2)}.
\end{align*}
Since $a_1=0$, the last expression is the binary expansion $0.a_1a_0a_{-1}\cdots$. Hence $B(\Phi(a))=\Phi(\sigma a)$ in this case.
If $a_1=1$, then $1/2\le x<1$ and
\begin{align*}
2x-1=2\sum_{j=1}^{\infty}a_j2^{-j}-1=a_1+\sum_{j=2}^{\infty}a_j2^{-(j-1)}-1=\sum_{j=2}^{\infty}a_j2^{-(j-1)}=\sum_{i=1}^{\infty}a_{i+1}2^{-i}.
\end{align*}
Also,
\begin{align*}
\frac{y+1}{2}=\frac12+\frac12\sum_{j=0}^{\infty}a_{-j}2^{-(j+1)}=2^{-1}+\sum_{j=0}^{\infty}a_{-j}2^{-(j+2)}.
\end{align*}
Since $a_1=1$, this is again the binary expansion $0.a_1a_0a_{-1}\cdots$. Therefore $B(\Phi(a))=\Phi(\sigma a)$ for every non-dyadic coded point.
The two vertical rectangles record the symbol $a_1$, and their iterates recover all coordinates $a_n$ of the bilateral itinerary. The fair product measure on $\{0,1\}^{\mathbb Z}$ pushes forward to Lebesgue measure because prescribing $k$ binary digits fixes a dyadic rectangle of area $2^{-k}$. Thus the baker map is measure-theoretically isomorphic to the two-sided Bernoulli shift with weights $(1/2,1/2)$. Its entropy is
\begin{align*}
-\frac12\log\frac12-\frac12\log\frac12=\log 2,
\end{align*}
and the coded orbit process is genuinely i.i.d.
[/example]
The baker map shows how an invertible deterministic transformation can still be a complete i.i.d. process after the right coding. To compare this with weaker systems, we need a property that keeps the entropy-theoretic loss of memory but does not demand a literal product-coordinate model.
[definition: K-System]
Let $(X,\mathcal B,\mu)$ be a probability space and let $T:X\to X$ be an invertible probability-preserving transformation. It has the K-property if there exists a sub-$\sigma$-algebra $\mathcal F\subset\mathcal B$ such that $T^{-1}\mathcal F\subset\mathcal F$, $\bigvee_{n\ge 0}T^n\mathcal F=\mathcal B$ modulo null sets, and $\bigcap_{n\ge 0}T^{-n}\mathcal F$ is the null $\sigma$-algebra modulo null sets.
[/definition]
Here $T^{-1}\mathcal F$ denotes the pullback $\{T^{-1}A:A\in\mathcal F\}$, and $T^n\mathcal F$ denotes the image convention obtained by applying $T^n$ to the sets in $\mathcal F$ modulo null sets. This convention is harmless for invertible probability-preserving maps, but stating it prevents the future and past filtrations from being interchanged in the proof.
The K-property is an entropy-theoretic strengthening of ergodicity. It rules out deterministic information that survives indefinitely into the future-facing filtration. The theorem below proves that Bernoulli independence is strong enough to give this tail condition.
[quotetheorem:6820]
[citeproof:6820]
This implication uses the full product structure of the Bernoulli shift: independence of the distant coordinate tail is what allows Kolmogorov's zero-one law to remove all residual future information. The product-coordinate hypothesis cannot be replaced by positive entropy alone. For example, the product of a positive-entropy Bernoulli shift with an irrational rotation still has positive entropy, but the rotation factor contributes non-constant eigenfunctions and a nontrivial deterministic factor, so the product is not a K-system. The conclusion is therefore a statement about independent coordinates, not just about exponential orbit-name growth.
The implication also has a limitation in the opposite direction. It proves that Bernoulli systems are K-systems, but it does not characterize Bernoulli systems among K-systems. Ornstein's non-Bernoulli K-automorphisms, constructed through failures of the very weak Bernoulli condition, show that complete positivity of entropy need not produce an i.i.d. coordinate model. With the tail obstruction removed, the next question is whether ordinary sets become independent at large time separations, which leads from filtration-based memory loss to mixing.
The filtration definition of the K-property is powerful but indirect: it speaks about hidden information contained in increasing and decreasing $\sigma$-algebras. To compare concrete events observed at two different times, we need a formulation that asks whether $A$ and $T^{-n}B$ become asymptotically independent as the time gap $n$ grows.
This set-based formulation is the bridge from structural memory loss to observable statistical behaviour. It keeps the same probability space and transformation, but replaces tail $\sigma$-algebras by the concrete error term $\mu(A\cap T^{-n}B)-\mu(A)\mu(B)$ that measures how much knowing the present event $A$ changes the probability of the future event $B$.
There are two natural strengths of this asymptotic independence. The strongest set-level version asks the error term to converge to $0$ along the full sequence of times; this is the direct analogue of ordinary independence appearing farther and farther apart in the orbit. A weaker but still powerful version asks the same error to be small in Cesaro average, which permits sparse exceptional times while ruling out persistent periodic or eigenfunction obstructions. The next definition records both notions because they occupy different places in the hierarchy between Bernoulli behaviour and ergodicity.
[definition: Mixing And Weak Mixing]
Let $(X,\mathcal B,\mu)$ be a probability space and let $T:X\to X$ be a probability-preserving transformation.
The system is mixing if for all $A,B\in\mathcal B$,
\begin{align*}
\mu(A\cap T^{-n}B)\to \mu(A)\mu(B)
\end{align*}
as $n\to\infty$.
The system is weak mixing if
\begin{align*}
\frac{1}{N}\sum_{n=0}^{N-1}|\mu(A\cap T^{-n}B)-\mu(A)\mu(B)|\to 0
\end{align*}
for all $A,B\in\mathcal B$ as $N\to\infty$.
[/definition]
Weak mixing is an averaged version of mixing. It is strong enough to exclude non-constant eigenfunctions in $L^2(X,\mu)$, but it allows exceptional times at which correlations fail to be small. Since Bernoulli systems model independence of whole coordinate blocks, the next definition asks for independence of any finite family of well-separated observations.
[definition: Mixing Of All Orders]
Let $(X,\mathcal B,\mu)$ be a probability space and let $T:X\to X$ be a probability-preserving transformation. The system is mixing of all orders if for every $r\ge 2$ and every $A_1,\dots,A_r\in\mathcal B$,
\begin{align*}
\mu(A_1\cap T^{-n_2}A_2\cap\cdots\cap T^{-n_r}A_r)\to \prod_{j=1}^r\mu(A_j)
\end{align*}
whenever $\min_{i\ne j}|n_i-n_j|\to\infty$, with $n_1=0$.
[/definition]
This definition records asymptotic independence of several observations taken at mutually separated times. It is designed to match the finite-block independence already present in an i.i.d. shift. In a Bernoulli shift, events depending on sufficiently separated finite coordinate windows are exactly independent, and arbitrary measurable events are approximated by such cylinder events. Consequently every Bernoulli shift is mixing of all orders.
The Bernoulli hypothesis in this result is used through exact independence of separated coordinate blocks. If only ordinary mixing is assumed, pairwise independence at large time gaps gives no control over higher joint intersections. Kalikow's mixing transformation gives a standard example of a mixing measure-preserving transformation that is not mixing of all orders, so mixing cannot replace Bernoulli independence in the statement. If weak mixing is assumed instead, even pairwise correlations need only be small on average; irrational rotations show that ergodicity by itself does not remove eigenfunction obstructions, and the Chacon transformation shows that weak mixing can occur with zero entropy and without ordinary mixing.
Combining Chapter 5's Bernoulli classification viewpoint with the present mixing definitions, the implication chain encountered in the course is therefore
\begin{align*}
\text{Bernoulli} \implies \text{K-property} \implies \text{mixing} \implies \text{weak mixing} \implies \text{ergodic}.
\end{align*}
The preceding hierarchy records the strongest qualitative mixing conclusion available from Bernoulli independence, but mixing of all orders is still not a characterization of Bernoulli systems in general. There are K-automorphisms that are not Bernoulli, and there are mixing systems that are not K, so the converses fail even before one reaches weak mixing. [Irrational rotations are ergodic](/theorems/3429) but not weak mixing, while the Chacon transformation gives a weak mixing zero-entropy example that is not mixing. The main lesson is that entropy contributes to the hierarchy, but it does not collapse it: structural independence, spectral behaviour, and orbit-name complexity remain distinct invariants.
[example: Zero-Entropy Weak Mixing]
The classical Chacon transformation is obtained by the rank-one cutting-and-stacking rule: start with one column of height $h_0=1$, cut each column into three equal subcolumns, place one spacer above the middle subcolumn, and stack left, middle-with-spacer, then right. Thus the heights satisfy
\begin{align*}
h_{r+1}=h_r+(h_r+1)+h_r=3h_r+1.
\end{align*}
Solving this recurrence gives
\begin{align*}
h_r+\frac12=3\left(h_{r-1}+\frac12\right)=\cdots=3^r\left(h_0+\frac12\right)=\frac{3^{r+1}}{2},
\end{align*}
so
\begin{align*}
h_r=\frac{3^{r+1}-1}{2}.
\end{align*}
The transformation is rank one because every stage is built from a single tower, and the finite partitions into tower levels refine to generate the measure algebra. A rank-one transformation has Kolmogorov-Sinai entropy $0$: at stage $r$, an orbit name over the tower is determined by the position in one tower together with the spacer pattern introduced by the cutting rule, and the number of stage-$r$ tower words grows at most polynomially in $h_r$, not exponentially in $h_r$. Hence
\begin{align*}
\lim_{r\to\infty}\frac{1}{h_r}\log(\text{number of stage-}r\text{ names})=0,
\end{align*}
which gives zero entropy for the generating rank-one partitions.
For Chacon's spacer pattern, the same construction also eliminates non-constant eigenfunctions: the repeated three-cut stacking with a single middle spacer forces any $L^2$ eigenfunction to agree with its translate along tower levels and with the spacer-shifted copy, so the eigenvalue equations are compatible only for the constant eigenvalue $1$. Therefore the transformation is weak mixing, while its Kolmogorov-Sinai entropy is $0$. This shows that weak mixing is not an entropy-positive phenomenon: spectral randomness can occur without exponential growth of orbit names.
[/example]
## Transfer Operators and Spectral Gaps
The next problem is quantitative: if a system mixes, how fast do correlations decay? For many expanding systems the answer comes from replacing the transformation by an operator acting on observables, then proving that this operator contracts everything except the invariant density.
[definition: Transfer Operator]
Let $(X,\mathcal B,m)$ be a measure space and let $T:X\to X$ be a non-singular measurable map. The transfer operator associated to $T$ is the [bounded linear operator](/page/Bounded%20Linear%20Operator)
\begin{align*}
\mathcal L:L^1(m)\to L^1(m)
\end{align*}
defined by the duality identity
\begin{align*}
\int_X (\mathcal L f)g\,dm=\int_X f(g\circ T)\,dm
\end{align*}
for all $f\in L^1(m)$ and all $g\in L^\infty(m)$.
[/definition]
The transfer operator is the adjoint of composition by $T$ at the level of integrals. When $T$ preserves a probability measure with density $h$ relative to $m$, the fixed equation $\mathcal Lh=h$ expresses invariance.
[example: Doubling Map Transfer Operator]
For the doubling map $T(x)=2x\pmod 1$ on $[0,1)$ with Lebesgue measure, the two inverse branches are $x\mapsto x/2$ and $x\mapsto (x+1)/2$. For bounded $g$ and integrable $f$, split the integral over the two halves of $[0,1)$:
\begin{align*}
\int_0^1 f(x)g(Tx)\,dx=\int_0^{1/2} f(x)g(2x)\,dx+\int_{1/2}^1 f(x)g(2x-1)\,dx.
\end{align*}
In the first integral set $u=2x$, so $dx=du/2$ and $u$ runs from $0$ to $1$:
\begin{align*}
\int_0^{1/2} f(x)g(2x)\,dx=\int_0^1 f(u/2)g(u)\frac{du}{2}.
\end{align*}
In the second integral set $u=2x-1$, so $x=(u+1)/2$, $dx=du/2$, and $u$ again runs from $0$ to $1$:
\begin{align*}
\int_{1/2}^1 f(x)g(2x-1)\,dx=\int_0^1 f((u+1)/2)g(u)\frac{du}{2}.
\end{align*}
Adding the two transformed integrals gives
\begin{align*}
\int_0^1 f(x)g(Tx)\,dx=\int_0^1 \left(\frac12 f(u/2)+\frac12 f((u+1)/2)\right)g(u)\,du.
\end{align*}
Therefore the transfer operator is
\begin{align*}
(\mathcal L f)(x)=\frac12 f(x/2)+\frac12 f((x+1)/2).
\end{align*}
For the constant function $1$ this gives
\begin{align*}
(\mathcal L1)(x)=\frac12\cdot 1+\frac12\cdot 1=1.
\end{align*}
For the Fourier mode $e_k(x)=e^{2\pi ikx}$, the formula gives
\begin{align*}
(\mathcal L e_k)(x)=\frac12 e^{2\pi ik(x/2)}+\frac12 e^{2\pi ik((x+1)/2)}.
\end{align*}
Since $e^{2\pi ik(x/2)}=e^{\pi ikx}$ and $e^{2\pi ik((x+1)/2)}=e^{\pi ikx}e^{\pi ik}$, we get
\begin{align*}
(\mathcal L e_k)(x)=\frac12 e^{\pi ikx}\left(1+e^{\pi ik}\right).
\end{align*}
Because $e^{\pi ik}=(-1)^k$, this becomes
\begin{align*}
(\mathcal L e_k)(x)=\frac12 e^{\pi ikx}\left(1+(-1)^k\right).
\end{align*}
If $k=2\ell$ is even, then $(-1)^k=1$ and
\begin{align*}
(\mathcal L e_{2\ell})(x)=e^{2\pi i\ell x}=e_\ell(x).
\end{align*}
If $k$ is odd, then $(-1)^k=-1$ and
\begin{align*}
(\mathcal L e_k)(x)=0.
\end{align*}
Thus $\mathcal L$ preserves constants, halves even Fourier frequencies, and removes odd Fourier frequencies. On regularity spaces such as bounded variation or Hölder spaces, this systematic loss of oscillation is the mechanism behind exponential correlation decay.
[/example]
The doubling map illustrates why the operator should be studied on a Banach space that remembers enough regularity. The central analytic condition is a spectral gap: constants form the peripheral part of the spectrum, and the remaining spectrum is contained in a smaller disk.
[definition: Spectral Gap For A Transfer Operator]
Let $\mathcal B_0$ be a Banach space of observables with norm $\|\cdot\|_{\mathcal B_0}$, and let $\mathcal L:\mathcal B_0\to\mathcal B_0$ be a bounded linear operator with spectral radius $r(\mathcal L)>0$. The operator $\mathcal L$ has a spectral gap if there is a decomposition
\begin{align*}
\mathcal L=\Pi+N
\end{align*}
where $\Pi$ is the finite-rank spectral projection onto the eigenspaces with eigenvalues of modulus $r(\mathcal L)$, $\Pi N=N\Pi=0$, and the spectral radius $r(N)$ is strictly smaller than $r(\mathcal L)$.
[/definition]
For mixing probability-preserving systems, the projection is usually the rank-one projection onto the invariant density or constant functions. The part $N$ describes fluctuations, and its smaller spectral radius is exactly the mechanism that should force correlations to shrink at an exponential rate. The following theorem makes this operator-to-correlation step precise.
[quotetheorem:6821]
[citeproof:6821]
This theorem turns an asymptotic mixing statement into a quantitative estimate, but each hypothesis is doing real work. The Banach space must encode regularity because transfer operators rarely contract on a large space such as $L^1(\mu)$ strongly enough to give rates, so the conclusion should not be read as a statement about arbitrary rough $L^1$ or $L^2$ observables. The rank-one projection expresses that constants are the only asymptotically invariant observables; if the peripheral spectrum had additional eigenvalues, correlations could oscillate rather than decay along the full sequence of times. For instance, if an operator has a peripheral eigenvalue $-1$, an eigen-observable contributes a factor $(-1)^n$ to correlations instead of a term tending to zero. The $L^1$ embedding is what converts a norm estimate for $\mathcal L^n f$ into an estimate against the bounded test observable $g$. Without a spectral gap, mixing may still hold, as in many intermittent maps such as the Pomeau-Manneville family, but the operator argument no longer supplies a uniform exponential rate; at best, a different Banach space or a renewal method may give slower decay for a narrower class of observables.
[example: Gibbs-Markov Maps]
Let $T:Y\to Y$ be a Gibbs-Markov map with Markov partition $\alpha$. Choose $0<\theta<1$ and define the separation metric $d_\theta(x,y)=\theta^{s(x,y)}$, where $s(x,y)$ is the first time at which the $\alpha$-names of $x$ and $y$ differ. On the Banach space $\mathcal B_\theta$ of observables that are bounded and Lipschitz on partition elements for $d_\theta$, the Gibbs-Markov hypotheses give the standard spectral-gap decomposition
\begin{align*}
\mathcal L=\Pi+N.
\end{align*}
Here
\begin{align*}
\Pi f=\left(\int_Y f\,d\mu\right)1,
\end{align*}
and
\begin{align*}
\Pi N=N\Pi=0.
\end{align*}
The spectral gap means that there are constants $C_0>0$ and $0<\rho<1$ such that
\begin{align*}
\|N^n f\|_{\mathcal B_\theta}\le C_0\rho^n\|f\|_{\mathcal B_\theta}
\end{align*}
for every $f\in\mathcal B_\theta$ and every $n\ge 0$.
Let $f\in\mathcal B_\theta$ be a cylinder observable and let $g\in L^\infty(\mu)$. By the defining duality of the transfer operator,
\begin{align*}
\int_Y f(g\circ T^n)\,d\mu=\int_Y (\mathcal L^n f)g\,d\mu.
\end{align*}
Since $\Pi^2=\Pi$ and $\Pi N=N\Pi=0$, expanding powers of $\mathcal L=\Pi+N$ gives
\begin{align*}
\mathcal L^n f=\Pi f+N^n f.
\end{align*}
Therefore
\begin{align*}
\operatorname{Cor}_n(f,g)=\int_Y \left(\Pi f+N^n f\right)g\,d\mu-\int_Y f\,d\mu\int_Y g\,d\mu.
\end{align*}
Substituting the formula for $\Pi f$ gives
\begin{align*}
\operatorname{Cor}_n(f,g)=\int_Y \left(\left(\int_Y f\,d\mu\right)1+N^n f\right)g\,d\mu-\int_Y f\,d\mu\int_Y g\,d\mu.
\end{align*}
Splitting the integral,
\begin{align*}
\operatorname{Cor}_n(f,g)=\left(\int_Y f\,d\mu\right)\left(\int_Y g\,d\mu\right)+\int_Y (N^n f)g\,d\mu-\int_Y f\,d\mu\int_Y g\,d\mu.
\end{align*}
The two product terms cancel, so
\begin{align*}
\operatorname{Cor}_n(f,g)=\int_Y (N^n f)g\,d\mu.
\end{align*}
If the inclusion $\mathcal B_\theta\hookrightarrow L^1(\mu)$ is continuous, choose $C_1>0$ with
\begin{align*}
\|u\|_{L^1}\le C_1\|u\|_{\mathcal B_\theta}
\end{align*}
for all $u\in\mathcal B_\theta$. Then
\begin{align*}
|\operatorname{Cor}_n(f,g)|\le \|g\|_\infty\|N^n f\|_{L^1}.
\end{align*}
Using the continuous inclusion,
\begin{align*}
|\operatorname{Cor}_n(f,g)|\le C_1\|g\|_\infty\|N^n f\|_{\mathcal B_\theta}.
\end{align*}
Using the spectral-gap estimate,
\begin{align*}
|\operatorname{Cor}_n(f,g)|\le C_1C_0\rho^n\|f\|_{\mathcal B_\theta}\|g\|_\infty.
\end{align*}
Thus cylinder observables have exponential correlation decay. The countable Markov partition supplies the symbolic states, while the spectral gap says that all non-constant statistical information is contracted exponentially fast, so the system behaves like a rapidly mixing countable-state Markov chain.
[/example]
The same method also covers smooth uniformly expanding maps and subshifts of finite type with Hölder potentials. For hyperbolic invertible systems, anisotropic Banach spaces replace ordinary Hölder spaces because stable and unstable directions must be treated differently.
[example: Cat Map And Anisotropic Regularity]
Let $A$ be the integer matrix with rows $(2,1)$ and $(1,1)$, so $\det A=2\cdot 1-1\cdot 1=1$, and let $T_Ax=Ax\pmod{\mathbb Z^2}$ on $\mathbb T^2=\mathbb R^2/\mathbb Z^2$. Because $A$ maps $\mathbb Z^2$ bijectively onto itself and has determinant $1$, the induced map preserves Lebesgue measure $\mathcal L^2$.
For smooth $f,g:\mathbb T^2\to\mathbb C$, write the Fourier expansions
\begin{align*}
f(x)=\sum_{m\in\mathbb Z^2}\hat f(m)e^{2\pi i m\cdot x}
\end{align*}
and
\begin{align*}
g(x)=\sum_{k\in\mathbb Z^2}\hat g(k)e^{2\pi i k\cdot x}.
\end{align*}
Since $k\cdot A^n x=(A^{\top n}k)\cdot x$, we have
\begin{align*}
g(T_A^n x)=\sum_{k\in\mathbb Z^2}\hat g(k)e^{2\pi i(A^{\top n}k)\cdot x}.
\end{align*}
Multiplying the [Fourier series](/page/Fourier%20Series) and integrating term by term, which is justified by smoothness and rapid decay of Fourier coefficients, gives
\begin{align*}
\int_{\mathbb T^2} f(x)g(T_A^n x)\,d\mathcal L^2(x)=\sum_{m,k\in\mathbb Z^2}\hat f(m)\hat g(k)\int_{\mathbb T^2}e^{2\pi i(m+A^{\top n}k)\cdot x}\,d\mathcal L^2(x).
\end{align*}
The last integral is $1$ when $m+A^{\top n}k=0$ and is $0$ otherwise, by orthogonality of torus characters. Therefore
\begin{align*}
\int_{\mathbb T^2} f(x)g(T_A^n x)\,d\mathcal L^2(x)=\sum_{k\in\mathbb Z^2}\hat f(-A^{\top n}k)\hat g(k).
\end{align*}
The $k=0$ term equals
\begin{align*}
\hat f(0)\hat g(0)=\int_{\mathbb T^2}f\,d\mathcal L^2\int_{\mathbb T^2}g\,d\mathcal L^2.
\end{align*}
Hence
\begin{align*}
\operatorname{Cor}_n(f,g)=\sum_{k\in\mathbb Z^2\setminus\{0\}}\hat f(-A^{\top n}k)\hat g(k).
\end{align*}
The characteristic polynomial of $A$ is
\begin{align*}
t^2-3t+1.
\end{align*}
Thus the eigenvalues are
\begin{align*}
\lambda=\frac{3+\sqrt5}{2}
\end{align*}
and
\begin{align*}
\lambda^{-1}=\frac{3-\sqrt5}{2}.
\end{align*}
The unstable and stable eigendirections have quadratic irrational slopes, so the standard Diophantine estimate for quadratic irrationals gives a constant $c>0$ such that
\begin{align*}
|A^{\top n}k|\ge c\lambda^n|k|^{-1}
\end{align*}
for every $n\ge 0$ and every $k\in\mathbb Z^2\setminus\{0\}$. Smoothness gives, for every $M>0$, constants $C_{f,M}$ and $C_{g,M}$ with
\begin{align*}
|\hat f(q)|\le C_{f,M}(1+|q|)^{-M}
\end{align*}
and
\begin{align*}
|\hat g(k)|\le C_{g,M}(1+|k|)^{-M}.
\end{align*}
Choose $M>4$. Using the lower bound on $|A^{\top n}k|$ for the $f$ coefficient and the same decay estimate with exponent $2M$ for $g$, each nonzero term satisfies
\begin{align*}
|\hat f(-A^{\top n}k)\hat g(k)|\le C_{f,M}C_{g,2M}c^{-M}\lambda^{-Mn}|k|^M(1+|k|)^{-2M}.
\end{align*}
Since $\sum_{k\in\mathbb Z^2\setminus\{0\}} |k|^M(1+|k|)^{-2M}$ converges when $M>2$, there is a constant $C>0$ such that
\begin{align*}
|\operatorname{Cor}_n(f,g)|\le C\lambda^{-Mn}.
\end{align*}
Thus smooth observables for the cat map have exponential correlation decay. The example is the invertible hyperbolic analogue of the expanding-map transfer-operator picture: regularity must be measured anisotropically because forward iteration expands unstable directions while contracting stable directions.
[/example]
## Correlations and Limit Theorems
After proving decay of correlations, the final question is probabilistic: do orbit sums behave like sums of independent random variables? The answer is often yes for sufficiently regular observables over systems with a spectral gap, and the [central limit theorem](/theorems/521) is the first major statement of this kind.
[definition: Correlation Function]
Let $(X,\mathcal B,\mu)$ be a probability space and let $T:X\to X$ be a probability-preserving transformation. For each $n\ge 0$, the time-$n$ correlation functional is the map
\begin{align*}
\operatorname{Cor}_n:L^2(\mu)\times L^2(\mu)\to\mathbb R
\end{align*}
defined by
\begin{align*}
\operatorname{Cor}_n(f,g)=\int_X f(g\circ T^n)\,d\mu-\int_X f\,d\mu\int_X g\,d\mu.
\end{align*}
[/definition]
Correlation functions are observables-level versions of mixing. They refine the set-based definition because taking $f=\mathbb{1}_A$ and $g=\mathbb{1}_B$ recovers the mixing expression for sets. To measure fluctuations rather than only pairwise memory, we need a notation for sums along a whole orbit segment.
[definition: Birkhoff Sum]
Let $(X,\mathcal B,\mu)$ be a probability space, let $T:X\to X$ be a probability-preserving transformation, and let $n\ge 1$. Let $\mathcal M(X,\mathcal B;\mathbb R)$ denote the [vector space](/page/Vector%20Space) of real-valued [measurable functions](/page/Measurable%20Functions) $f:X\to\mathbb R$. The length-$n$ Birkhoff sum operator is the map
\begin{align*}
S_n:\mathcal M(X,\mathcal B;\mathbb R)\to \mathcal M(X,\mathcal B;\mathbb R)
\end{align*}
defined by
\begin{align*}
(S_n f)(x)=\sum_{j=0}^{n-1} f(T^j x).
\end{align*}
[/definition]
For each $f\in\mathcal M(X,\mathcal B;\mathbb R)$, the function $S_nf$ is the observable obtained by adding the values of $f$ along the first $n$ points of the orbit. Birkhoff's ergodic theorem controls $S_nf/n$ for integrable $f$. The [central limit theorem](/theorems/1848) asks for the next-order fluctuations of $S_nf$ around its mean, and this requires identifying the observables whose sums telescope instead of growing diffusively.
[definition: Coboundary]
Let $(X,\mathcal B,\mu)$ be a probability space and let $T:X\to X$ be a probability-preserving transformation. A measurable function $f:X\to\mathbb R$ is a coboundary if there exists a measurable function $u:X\to\mathbb R$ such that
\begin{align*}
f=u-u\circ T.
\end{align*}
[/definition]
Coboundaries are the degenerate observables for fluctuation theory. Their Birkhoff sums telescope, so they cannot have diffusive variance growth unless the transfer function is sufficiently irregular. The theorem below states the spectral-gap central limit result once this degeneracy has been separated out.
[quotetheorem:6822]
[example: Nonzero Variance For The Doubling Map]
Discard the dyadic rationals, a null set, and write $x=0.b_1b_2b_3\cdots$ in binary. For the doubling map $T(x)=2x\pmod 1$, the binary expansion shifts left, so
\begin{align*}
T^j x=0.b_{j+1}b_{j+2}b_{j+3}\cdots.
\end{align*}
Therefore
\begin{align*}
f(T^j x)=\mathbb{1}_{[0,1/2)}(T^j x)-\frac12=\mathbb{1}_{\{b_{j+1}=0\}}-\frac12.
\end{align*}
Thus $f(T^j x)=1/2$ when $b_{j+1}=0$, and $f(T^j x)=-1/2$ when $b_{j+1}=1$.
Under Lebesgue measure, the binary digits $b_1,b_2,\dots$ are independent and satisfy
\begin{align*}
\mathcal L(b_j=0)=\mathcal L(b_j=1)=\frac12.
\end{align*}
Hence each random variable $f\circ T^j$ has mean
\begin{align*}
\int_0^1 f(T^j x)\,dx=\frac12\cdot \frac12+\left(-\frac12\right)\cdot \frac12=\frac14-\frac14=0,
\end{align*}
and variance
\begin{align*}
\int_0^1 f(T^j x)^2\,dx=\left(\frac12\right)^2\frac12+\left(-\frac12\right)^2\frac12=\frac14\cdot\frac12+\frac14\cdot\frac12=\frac14.
\end{align*}
Because $f(T^j x)$ depends only on the digit $b_{j+1}$, the variables $f(x),f(Tx),\dots,f(T^{n-1}x)$ are independent. Therefore
\begin{align*}
S_nf(x)=\sum_{j=0}^{n-1}f(T^j x)
\end{align*}
is a sum of $n$ independent, mean-zero random variables, each with variance $1/4$, so
\begin{align*}
\operatorname{Var}(S_nf)=\sum_{j=0}^{n-1}\operatorname{Var}(f\circ T^j)=\sum_{j=0}^{n-1}\frac14=\frac n4.
\end{align*}
By the *Classical Central Limit Theorem*,
\begin{align*}
\frac{S_nf}{\sqrt n}\xrightarrow{d}\mathcal N\left(0,\frac14\right).
\end{align*}
So this observable has nonzero limiting variance: it is not a coboundary degeneration, and the dynamical central limit theorem gives the same Gaussian law as the independent binary-digit model.
[/example]
Limit theorems require more than qualitative mixing. Weak mixing alone does not give summable correlations, and mixing without a rate need not imply a central limit theorem for natural observables.
[remark: Entropy Does Not Determine Correlation Decay]
Two systems may have the same Kolmogorov-Sinai entropy and very different statistical rates. Bernoulli shifts have exact independence at separated coordinate blocks, uniformly expanding maps often have exponential correlation decay for regular observables, and some zero-entropy weak mixing systems have no useful quantitative rate. Entropy measures exponential orbit-name complexity, while decay of correlations measures how fast information is forgotten in time.
[/remark]
The chapter therefore leaves us with three complementary languages. Entropy classifies the size of orbit information, mixing hierarchies describe qualitative independence, and transfer-operator spectra produce quantitative statistics. Later thermodynamic formalism combines all three by assigning entropy and pressure to invariant measures and by extracting equilibrium statistics from spectral properties of weighted transfer operators.
The hierarchy of randomness properties now extends the entropy framework from qualitative independence to quantitative statistical decay. The next chapter applies these ideas to number-theoretic systems, where algorithms, digit expansions, and symbolic codings become dynamical transformations with measurable invariant structure.
# 11. Number-Theoretic Dynamical Systems
Number-theoretic dynamical systems arise when an arithmetic algorithm is viewed as iteration of a measurable transformation. [Continued fractions](/page/Continued%20Fractions), base expansions, multiplication maps on compact groups, and symbolic codings of geodesic flow all turn questions about digits and equidistribution into questions about invariant measures, entropy, and generators. This chapter applies the entropy tools developed earlier to systems whose definitions come from elementary number theory but whose long-term behaviour is measured by ergodic theory.
The guiding theme is that arithmetic randomness often appears through deterministic maps with strong expansion. We begin with the Gauss map for continued fractions, then state the entropy formula for smooth expanding interval maps, compute the entropy of the Gauss map, and finally place beta-transformations, toral endomorphisms, and geodesic-flow codings in the same framework.
The chapter assumes measure-theoretic entropy from Chapter 2, generators from Chapter 3, pointwise orbit averages from Chapter 4, and basic facts about absolutely continuous invariant measures for piecewise smooth maps. The number theory needed is limited to continued fractions, base expansions, and elementary matrix actions on tori.
## Continued Fractions and the Gauss Map
The first problem is to encode a real number by a sequence of integers in a way that interacts naturally with iteration. Decimal expansion uses multiplication by $10$ modulo $1$; continued fractions use a nonlinear operation that repeatedly removes the integer part of a reciprocal.
[definition: Continued Fraction Digits]
For each $n\in\mathbb N$, the $n$th continued-fraction digit is a map
\begin{align*}
a_n:(0,1)\setminus\mathbb Q\to\mathbb N.
\end{align*}
For $x \in (0,1)\setminus\mathbb Q$, define $a_1(x), a_2(x), \dots$ recursively as follows. Put
\begin{align*}
a_1(x) = \left\lfloor \frac{1}{x} \right\rfloor
\end{align*}
and
\begin{align*}
x_1 = \frac{1}{x} - a_1(x).
\end{align*}
For $n\ge 1$, put
\begin{align*}
a_{n+1}(x) = a_1(x_n)
\end{align*}
and
\begin{align*}
x_{n+1} = \frac{1}{x_n} - a_1(x_n).
\end{align*}
The continued fraction expansion of $x$ is
\begin{align*}
x = [0; a_1(x), a_2(x), \dots].
\end{align*}
[/definition]
The recursion identifies the first digit and produces a new number whose digits are the remaining tail. To study digit statistics with ergodic theory, we need the transformation that performs exactly this tail shift.
[definition: Gauss Map]
The Gauss map is the measurable map $G:(0,1) \to [0,1)$ defined by
\begin{align*}
G(x) = \frac{1}{x} - \left\lfloor \frac{1}{x} \right\rfloor.
\end{align*}
For each $n \in \mathbb N$, its $n$th branch has domain
\begin{align*}
I_n = \left(\frac{1}{n+1}, \frac{1}{n}\right]
\end{align*}
and on $I_n$ the formula is $G(x)=1/x-n$.
[/definition]
The values $G(1/n)=0$ occur only at countably many rational endpoints. When we regard $G$ as a measure-preserving transformation on $(0,1)$, we either remove this null endpoint orbit or redefine the map there arbitrarily; all invariant-measure and entropy statements below are unchanged modulo $\mu_G$-null sets.
[illustration:gauss-map-branches]
Thus $a_1(x)=n$ exactly on $I_n$, and $a_k(x)=a_1(G^{k-1}x)$ for every $k \ge 1$. The partition by the intervals $I_n$ is countable, and it is a generator modulo the usual ambiguity coming from rational endpoints.
[example: First Continued Fraction Digits]
Let $x=\sqrt{2}-1$. Since
\begin{align*}
(\sqrt{2}-1)(\sqrt{2}+1)=2-1=1,
\end{align*}
we have
\begin{align*}
\frac{1}{x}=\sqrt{2}+1.
\end{align*}
Because $1<\sqrt{2}<2$, adding $1$ gives $2<\sqrt{2}+1<3$, and therefore
\begin{align*}
a_1(x)=\left\lfloor \frac{1}{x}\right\rfloor=\lfloor \sqrt{2}+1\rfloor=2.
\end{align*}
Using the definition of the Gauss map,
\begin{align*}
G(x)=\frac{1}{x}-\left\lfloor \frac{1}{x}\right\rfloor=(\sqrt{2}+1)-2=\sqrt{2}-1=x.
\end{align*}
Thus $G^k(x)=x$ for every $k\ge 0$. Since the continued-fraction digits satisfy $a_{k+1}(x)=a_1(G^k x)$ by the recursive definition, we get
\begin{align*}
a_{k+1}(x)=a_1(G^k x)=a_1(x)=2
\end{align*}
for every $k\ge 0$. Hence
\begin{align*}
\sqrt{2}-1=[0;2,2,2,\dots].
\end{align*}
The fixed point of $G$ produces a purely periodic continued fraction; more generally, periodic orbits of $G$ give purely periodic continued fractions, and preperiodic orbits give eventually periodic ones.
[/example]
This example shows that the Gauss map captures the continued-fraction algorithm, but it does not yet tell us which probability measure makes the digit process stationary. Lebesgue measure is not preserved by $G$, so the next task is to find the invariant density adapted to the inverse branches.
[definition: Gauss Measure]
The Gauss measure $\mu_G$ on $(0,1)$ is the probability measure defined by
\begin{align*}
d\mu_G(x)=\frac{1}{\log 2}\frac{1}{1+x}\,d\mathcal L^1(x).
\end{align*}
[/definition]
The normalising constant is chosen because
\begin{align*}
\int_0^1 \frac{1}{1+x}\,d\mathcal L^1(x)=\log 2.
\end{align*}
The Gauss map has countably many inverse branches, so ordinary Lebesgue measure is distorted in a non-uniform way when points are pulled back through continued-fraction digits. The stationarity problem is to find a density whose distortion contributions from all branches exactly balance. The Gauss density is designed for that cancellation.
[quotetheorem:6823]
[citeproof:6823]
This theorem is stronger than merely saying that $G$ sends null sets to null sets: it identifies the probability law under which the continued-fraction digit process is stationary. Each part of the statement is doing work. The bounded measurable [test function](/page/Test%20Function) formulation avoids integrability failures; for example $f(x)=1/x$ is not $\mu_G$-integrable, so neither side of the displayed identity is a finite expectation. The branch decomposition is also essential: treating $G$ as if it were a single injective change of variables misses the infinitely many inverse branches $u_n(y)=1/(n+y)$, and using only the branch $u_1$ gives the density contribution $1/((1+y)(2+y))$ rather than the full telescoping sum. Finally, the exact density matters. Lebesgue measure is not invariant because
\begin{align*}
\mathcal L^1(G^{-1}(0,1/2))=\sum_{n=1}^{\infty}\left(\frac{1}{n}-\frac{1}{n+1/2}\right)=2-2\log 2,
\end{align*}
whereas $\mathcal L^1((0,1/2))=1/2$. Invariance alone still does not imply convergence of digit frequencies, since a measure-preserving system may decompose into several invariant components. For digit statistics we therefore need time averages of digit functions to converge to their space averages, which leads to ergodicity.
[quotetheorem:6824]
[citeproof:6824]
Ergodicity is the point at which stationarity becomes an almost-sure statistical law. The structural hypotheses behind the proof cannot be dropped. If a measure-preserving map is the identity on $[0,1]$, every measurable set is invariant, so stationarity gives no mixing between components. If $T$ is the doubling map on $[0,1/2)$ and separately the doubling map, rescaled, on $[1/2,1]$, then the two halves are invariant and typical digit frequencies depend on the half containing the initial point. The generating-cylinder hypothesis also matters: for the product system $G\times R$ on $(0,1)\times\{0,1\}$, where $R$ is the identity on the second coordinate, the continued-fraction partition in the first coordinate is full-branch and expanding but does not see the invariant sets $(0,1)\times\{0\}$ and $(0,1)\times\{1\}$. The bounded-distortion full-branch argument is what rules out exactly this kind of hidden invariant information for the Gauss map. The theorem also does not say that the digits are independent; continued-fraction digits have correlations even though their time averages are governed by a single invariant measure. As a consequence, the long-run frequency of any fixed digit can be computed by integrating its indicator over the Gauss measure.
[example: Gauss-Kuzmin Digit Frequencies]
For $m\in\mathbb N$, the first continued-fraction digit equals $m$ exactly on
\begin{align*}
I_m=\left(\frac{1}{m+1},\frac{1}{m}\right].
\end{align*}
Indeed, $a_1(x)=\lfloor 1/x\rfloor=m$ means
\begin{align*}
m\le \frac{1}{x}<m+1.
\end{align*}
Since $x>0$, taking reciprocals reverses the inequalities and gives
\begin{align*}
\frac{1}{m+1}<x\le \frac{1}{m}.
\end{align*}
The digit relation $a_k(x)=a_1(G^{k-1}x)$ gives
\begin{align*}
\#\{1\le k\le N:a_k(x)=m\}=\sum_{k=1}^N \mathbf 1_{I_m}(G^{k-1}x).
\end{align*}
By *[Ergodicity of the Gauss Map](/theorems/6824)* and Birkhoff's theorem applied to the bounded function $\mathbf 1_{I_m}$, for $\mu_G$-a.e. $x$,
\begin{align*}
\lim_{N\to\infty}\frac{1}{N}\#\{1\le k\le N:a_k(x)=m\}=\int_0^1 \mathbf 1_{I_m}(y)\,d\mu_G(y).
\end{align*}
The integral on the right is $\mu_G(I_m)$, so it remains to compute that measure from the Gauss density:
\begin{align*}
\mu_G(I_m)=\frac{1}{\log 2}\int_{1/(m+1)}^{1/m}\frac{1}{1+y}\,d\mathcal L^1(y).
\end{align*}
An antiderivative of $1/(1+y)$ is $\log(1+y)$, hence
\begin{align*}
\mu_G(I_m)=\frac{1}{\log 2}\left(\log\left(1+\frac{1}{m}\right)-\log\left(1+\frac{1}{m+1}\right)\right).
\end{align*}
Now
\begin{align*}
1+\frac{1}{m}=\frac{m+1}{m}.
\end{align*}
Also
\begin{align*}
1+\frac{1}{m+1}=\frac{m+2}{m+1}.
\end{align*}
Therefore
\begin{align*}
\mu_G(I_m)=\frac{1}{\log 2}\left(\log\left(\frac{m+1}{m}\right)-\log\left(\frac{m+2}{m+1}\right)\right).
\end{align*}
Using $\log a-\log b=\log(a/b)$ for positive $a,b$,
\begin{align*}
\mu_G(I_m)=\frac{1}{\log 2}\log\left(\frac{(m+1)/m}{(m+2)/(m+1)}\right).
\end{align*}
Equivalently,
\begin{align*}
\mu_G(I_m)=\frac{1}{\log 2}\log\left(\frac{(m+1)^2}{m(m+2)}\right).
\end{align*}
Thus the digit process is stationary under $\mu_G$, but its one-digit distribution depends on $m$ and is not a uniform distribution on $\mathbb N$.
[/example]
## Entropy of Arithmetic Transformations
The next question is how much information the arithmetic algorithm produces per iterate. For piecewise expanding maps, entropy can often be read from the average logarithmic expansion rate. This is the one-dimensional form of a broader principle connecting metric entropy and Jacobian growth.
[quotetheorem:6825]
[citeproof:6825]
The formula reduces metric entropy to an integral only under the stated symbolic and distortion hypotheses, and the failures are concrete. It does not compute topological entropy, nor the entropy of an arbitrary symbolic factor obtained from a coarser partition; it computes Kolmogorov-Sinai entropy for the specified invariant probability measure and the full generating dynamics. If the branch partition is not generating, take $T(x,y)=(2x \bmod 1, y+\alpha \bmod 1)$ on $\mathbb T^2$ with $\alpha$ irrational and partition only by whether $x\in[0,1/2)$ or $x\in[1/2,1)$; the partition records the expanding binary coordinate but misses the rotation coordinate, so its cylinders do not generate the full Borel structure. If the partition has infinite entropy, a countable Bernoulli shift with symbol probabilities proportional to $1/(n(\log n)^2)$ for $n\ge 2$ has a generating countable partition whose Shannon entropy diverges, so the finite-entropy step in the proof breaks. If the logarithmic derivative is not integrable, a map with branches accumulating at $0$ and slopes so large on intervals of measure comparable to $1/(n(\log n)^2)$ that $\log |T'|$ has divergent expectation has no finite Lyapunov integral to equal the entropy. Distortion and absolute continuity are also needed: singular invariant measures for expanding maps can have entropy smaller than the ambient logarithmic expansion because the measure lives on a thin invariant set rather than being spread according to interval length. These restrictions matter for countable-branch maps such as the Gauss map, where the derivative has a singularity at $0$ and the partition has infinitely many atoms. The next theorem checks that the logarithmic singularity is integrable against the Gauss density and then evaluates the resulting series.
[quotetheorem:6826]
[citeproof:6826]
The value $\pi^2/(6\log 2)$ is tied to the Gauss measure, not to Lebesgue measure or to an arbitrary digit distribution. Lebesgue measure gives the wrong starting law because it is not invariant under $G$, as the interval $(0,1/2)$ calculation above shows. The computation also depends on using the natural generating branch partition; the coarser partition that records only whether $a_1(x)=1$ or $a_1(x)\ge 2$ cannot distinguish the full continued-fraction itinerary and therefore records only a two-symbol factor. Integrability is a further limitation: replacing $\mu_G$ by a probability density with too much mass near $0$, for instance a density comparable to $1/(x(\log(1/x))^2)$ near $0$, makes $\int \log(1/x)\,d\mu$ diverge, so Rokhlin's finite entropy formula would not produce a finite number. Thus the calculation is a model for arithmetic entropy only after three checks have been made: identify the invariant probability measure, prove that the natural partition is generating, and verify integrability of logarithmic expansion. The same pattern is visible in base expansions, where the map is still expanding but the invariant density and allowed digit strings depend on the base.
[example: Beta-Transformations]
Fix $\beta>1$ and define
\begin{align*}
T_\beta x=\beta x-\lfloor \beta x\rfloor.
\end{align*}
The first digit is
\begin{align*}
d_1(x)=\lfloor \beta x\rfloor,
\end{align*}
and the recursive formula
\begin{align*}
d_k(x)=\lfloor \beta T_\beta^{k-1}x\rfloor
\end{align*}
records the branch visited by the $(k-1)$st iterate. Thus the branch atoms have the form
\begin{align*}
\left[\frac{j}{\beta},\frac{j+1}{\beta}\right)
\end{align*}
for those integers $j$ for which this interval lies in $[0,1)$, with a final truncated atom when $\beta$ is not an integer.
When $\beta=b\in\mathbb N$ and $b\ge 2$, the branch partition is
\begin{align*}
\alpha=\left\{\left[\frac{j}{b},\frac{j+1}{b}\right):0\le j\le b-1\right\}.
\end{align*}
On the $j$th atom,
\begin{align*}
T_bx=bx-j.
\end{align*}
For every bounded measurable function $f$,
\begin{align*}
\int_0^1 f(T_bx)\,dx=\sum_{j=0}^{b-1}\int_{j/b}^{(j+1)/b} f(bx-j)\,dx.
\end{align*}
On the $j$th summand put $y=bx-j$. Then $dy=b\,dx$, so $dx=dy/b$, and as $x$ runs from $j/b$ to $(j+1)/b$, $y$ runs from $0$ to $1$. Hence
\begin{align*}
\int_{j/b}^{(j+1)/b} f(bx-j)\,dx=\frac{1}{b}\int_0^1 f(y)\,dy.
\end{align*}
Summing the $b$ identical contributions gives
\begin{align*}
\int_0^1 f(T_bx)\,dx=\int_0^1 f(y)\,dy.
\end{align*}
So Lebesgue measure is invariant for the integer-base map.
The $n$-cylinder determined by digits $j_1,\dots,j_n$ is the interval
\begin{align*}
\left[\sum_{k=1}^n \frac{j_k}{b^k},\sum_{k=1}^n \frac{j_k}{b^k}+\frac{1}{b^n}\right).
\end{align*}
Its length is $b^{-n}$, and $b^{-n}\to 0$ as $n\to\infty$, so these cylinders generate the Borel sets modulo endpoints. Since $T_b'(x)=b$ on every branch, the entropy formula for full-branch expanding interval maps gives
\begin{align*}
h_{\mathcal L^1}(T_b)=\int_0^1 \log |T_b'(x)|\,dx.
\end{align*}
Substituting $T_b'(x)=b$ gives
\begin{align*}
h_{\mathcal L^1}(T_b)=\int_0^1 \log b\,dx.
\end{align*}
Since $\log b$ is constant and $\mathcal L^1([0,1])=1$,
\begin{align*}
h_{\mathcal L^1}(T_b)=\log b.
\end{align*}
For non-integer $\beta$, write $m=\lfloor\beta\rfloor$ and $r=\beta-m$, so $0<r<1$. Lebesgue measure is not invariant. Let
\begin{align*}
A=[0,r).
\end{align*}
Then
\begin{align*}
\mathcal L^1(A)=r.
\end{align*}
For each $j=0,1,\dots,m$, the condition $T_\beta x\in A$ on the branch where $\lfloor\beta x\rfloor=j$ is
\begin{align*}
0\le \beta x-j<r.
\end{align*}
Because $\beta>0$, this is equivalent to
\begin{align*}
\frac{j}{\beta}\le x<\frac{j+r}{\beta}.
\end{align*}
Each such interval has length $r/\beta$, and there are $m+1$ of them. Therefore
\begin{align*}
\mathcal L^1(T_\beta^{-1}A)=(m+1)\frac{r}{\beta}.
\end{align*}
Since $m<\beta<m+1$, dividing by $\beta>0$ gives
\begin{align*}
\frac{m+1}{\beta}>1.
\end{align*}
Multiplying by $r>0$ gives
\begin{align*}
(m+1)\frac{r}{\beta}>r.
\end{align*}
Thus
\begin{align*}
\mathcal L^1(T_\beta^{-1}A)>\mathcal L^1(A),
\end{align*}
so Lebesgue measure is not preserved.
The natural invariant probability measure for $T_\beta$ is the absolutely continuous Renyi-Parry measure $\mu_\beta$, rather than Lebesgue measure. With respect to this measure, the digit partition is generating, and $T_\beta'(x)=\beta$ on every smooth branch. The same entropy formula therefore gives
\begin{align*}
h_{\mu_\beta}(T_\beta)=\int_0^1 \log \beta\,d\mu_\beta.
\end{align*}
Since $\mu_\beta$ is a probability measure and $\log\beta$ is constant,
\begin{align*}
h_{\mu_\beta}(T_\beta)=\log\beta.
\end{align*}
Thus integer bases have uniform full branches and Lebesgue measure, while non-integer bases keep the same expansion rate but require the invariant density adapted to the truncated final branch.
[/example]
Beta-transformations show how expansion produces digit entropy on an interval. A higher-dimensional analogue replaces the interval by a compact abelian group and replaces multiplication by a scalar with an integer matrix.
[example: Multiplication Maps on Tori]
Let $A\in M_d(\mathbb Z)$ have $\det A\ne 0$, and define
\begin{align*}
T_A:\mathbb T^d\to\mathbb T^d,\qquad T_A(x)=Ax\pmod{\mathbb Z^d}.
\end{align*}
This is well-defined because if $x-y\in\mathbb Z^d$, then
\begin{align*}
Ax-Ay=A(x-y)\in\mathbb Z^d
\end{align*}
since every entry of $A$ is an integer. The map is onto: for any class $y+\mathbb Z^d$, choose $x=A^{-1}y\in\mathbb R^d$, which is defined because $\det A\ne 0$. Then
\begin{align*}
Ax=y,
\end{align*}
so
\begin{align*}
T_A(x+\mathbb Z^d)=y+\mathbb Z^d.
\end{align*}
Thus $T_A$ is a continuous surjective homomorphism of the compact group $\mathbb T^d$. The pushforward of normalized Haar measure under a continuous surjective [group homomorphism](/page/Group%20Homomorphism) is again normalized Haar measure, so Lebesgue-Haar measure $\mathcal L^d$ is $T_A$-invariant.
Assume now that every eigenvalue of $A$ has absolute value greater than $1$. Then $A^{-1}$ has all eigenvalues of absolute value less than $1$, so in an equivalent norm there is a constant $c<1$ such that
\begin{align*}
\|A^{-1}v\|\le c\|v\|
\end{align*}
for all $v\in\mathbb R^d$. Applying this inequality to $v=Aw$ gives
\begin{align*}
\|w\|=\|A^{-1}Aw\|\le c\|Aw\|.
\end{align*}
Therefore
\begin{align*}
\|Aw\|\ge c^{-1}\|w\|,
\end{align*}
with $c^{-1}>1$, so $T_A$ is expanding. Its derivative is constant:
\begin{align*}
D T_A(x)=A.
\end{align*}
Hence the Jacobian determinant is
\begin{align*}
|\det D T_A(x)|=|\det A|.
\end{align*}
By the entropy formula for expanding toral endomorphisms,
\begin{align*}
h_{\mathcal L^d}(T_A)=\int_{\mathbb T^d}\log |\det D T_A(x)|\,d\mathcal L^d(x).
\end{align*}
Substituting the constant Jacobian gives
\begin{align*}
h_{\mathcal L^d}(T_A)=\int_{\mathbb T^d}\log |\det A|\,d\mathcal L^d(x).
\end{align*}
Since $\mathcal L^d(\mathbb T^d)=1$, this becomes
\begin{align*}
h_{\mathcal L^d}(T_A)=\log |\det A|.
\end{align*}
If the eigenvalues of $A$ are $\lambda_1,\dots,\lambda_d$, counted with algebraic multiplicity, then
\begin{align*}
\det A=\prod_{i=1}^d \lambda_i.
\end{align*}
Taking absolute values gives
\begin{align*}
|\det A|=\prod_{i=1}^d |\lambda_i|.
\end{align*}
In the expanding case each $|\lambda_i|>1$, so
\begin{align*}
\log |\det A|=\log\left(\prod_{i=1}^d|\lambda_i|\right).
\end{align*}
Using $\log(ab)=\log a+\log b$ repeatedly for positive factors,
\begin{align*}
\log |\det A|=\sum_{i=1}^d\log|\lambda_i|.
\end{align*}
For a toral automorphism, $A\in GL_d(\mathbb Z)$ and $\det A=\pm 1$, so expansion in some directions is balanced by contraction in others. The entropy formula for toral automorphisms counts only the expanding eigenvalues:
\begin{align*}
h_{\mathcal L^d}(T_A)=\sum_{|\lambda|>1}\log|\lambda|,
\end{align*}
with eigenvalues repeated according to algebraic multiplicity. Thus toral multiplication maps turn linear expansion, measured by determinants or unstable eigenvalues, into metric entropy.
[/example]
## Equidistribution, Invariant Measures, and Symbolic Arithmetic Codings
The final problem is to understand why the same orbit can be studied in three languages: equidistribution in a geometric space, invariance of a measure, and symbolic dynamics. Arithmetic maps often become tractable only after moving between these languages.
[definition: Equidistribution of an Orbit]
Let $(X,\mathcal B,\mu)$ be a probability space with $X$ a compact metric space and $\mu$ a Borel probability measure. For a measurable map $T:X\to X$, the orbit $(T^n x)_{n\ge 0}$ is equidistributed with respect to $\mu$ if, for every continuous function $f:X\to\mathbb R$,
\begin{align*}
\lim_{N\to\infty}\frac{1}{N}\sum_{n=0}^{N-1} f(T^n x)=\int_X f\,d\mu.
\end{align*}
[/definition]
This definition turns orbit statistics into convergence of empirical averages, but it leaves open why individual arithmetic orbits should satisfy the condition. We therefore need the theorem that converts ergodicity of an invariant measure into equidistribution for almost every starting point.
[quotetheorem:6827]
[citeproof:6827]
This theorem gives an almost-everywhere statement, not a universal one: exceptional orbits may be periodic, preperiodic, or otherwise non-generic. For instance, a periodic point for an expanding map equidistributes on its finite orbit rather than on the ambient invariant measure unless that measure is supported on the orbit. Ergodicity is essential. If $T$ is the identity on $[0,1]$ with Lebesgue measure, Birkhoff averages equal $f(x)$ along each orbit, not $\int_0^1 f\,d\mathcal L^1$ for a general continuous $f$. The compact metric and Borel hypotheses are used to reduce all continuous test functions to a countable dense family; on a nonseparable compact space, $C(X)$ need not have a countable [dense subset](/page/Dense%20Subset) in the uniform norm, so applying Birkhoff to countably many functions cannot certify the defining convergence for every continuous test function. If $X$ is not compact, continuous functions may be unbounded; for example $f(x)=x$ on $\mathbb R$ is not an admissible bounded observable for the usual $L^1$ form of Birkhoff without extra integrability assumptions. Equidistribution describes the geometric orbit, but entropy computations usually need a symbolic record of which branch, cell, or cross-section the orbit visits. This motivates a general coding construction from a measurable partition.
[definition: Symbolic Coding by a Partition]
Let $(X,\mathcal B,\mu,T)$ be a measure-preserving system and let $\alpha=\{A_i:i\in I\}$ be a finite or countable measurable partition. The symbolic coding map associated to $\alpha$ is
\begin{align*}
\pi_\alpha:X\to I^{\mathbb N}, \qquad (\pi_\alpha(x))_n=i \quad \text{if } T^n x\in A_i.
\end{align*}
[/definition]
The coding intertwines $T$ with the left shift on $I^{\mathbb N}$ wherever the itinerary is defined. When $\alpha$ is generating, the coded process retains the measurable dynamics up to null sets; for continued fractions this recovers the digit process.
[example: Continued Fractions as a Countable Shift Coding]
For the Gauss map, take the countable partition $\alpha=\{I_n:n\in\mathbb N\}$, where
\begin{align*}
I_n=\left(\frac{1}{n+1},\frac{1}{n}\right].
\end{align*}
If $x\in(0,1)\setminus\mathbb Q$ and $G^k x\in I_n$, then by the definition of $I_n$,
\begin{align*}
\frac{1}{n+1}<G^k x\le \frac{1}{n}.
\end{align*}
Since $G^k x>0$, taking reciprocals reverses the inequalities and gives
\begin{align*}
n\le \frac{1}{G^k x}<n+1.
\end{align*}
Therefore
\begin{align*}
\left\lfloor \frac{1}{G^k x}\right\rfloor=n.
\end{align*}
By the recursive definition of continued-fraction digits,
\begin{align*}
a_{k+1}(x)=a_1(G^k x)=\left\lfloor \frac{1}{G^k x}\right\rfloor=n.
\end{align*}
Thus the symbolic coding map satisfies
\begin{align*}
(\pi_\alpha(x))_k=n
\end{align*}
exactly when
\begin{align*}
a_{k+1}(x)=n,
\end{align*}
so
\begin{align*}
\pi_\alpha(x)=(a_1(x),a_2(x),a_3(x),\dots),
\end{align*}
up to the harmless choice of whether sequence coordinates are indexed from $0$ or from $1$.
Let $\nu=(\pi_\alpha)_*\mu_G$ be the image measure on $\mathbb N^{\mathbb N}$, and let $\sigma$ be the left shift. For every cylinder set $C\subseteq \mathbb N^{\mathbb N}$,
\begin{align*}
\nu(\sigma^{-1}C)
=\mu_G(\pi_\alpha^{-1}(\sigma^{-1}C)).
\end{align*}
The coding intertwines $G$ with $\sigma$ away from the countable endpoint orbit:
\begin{align*}
\pi_\alpha(Gx)=\sigma(\pi_\alpha(x)).
\end{align*}
Hence
\begin{align*}
\pi_\alpha^{-1}(\sigma^{-1}C)=G^{-1}(\pi_\alpha^{-1}C)
\end{align*}
modulo a $\mu_G$-null set. Since $\mu_G$ is $G$-invariant,
\begin{align*}
\nu(\sigma^{-1}C)
=\mu_G(G^{-1}(\pi_\alpha^{-1}C))
=\mu_G(\pi_\alpha^{-1}C)
=\nu(C).
\end{align*}
Cylinder sets generate the product sigma-algebra on $\mathbb N^{\mathbb N}$, so $\nu$ is shift-invariant. Because the partition $\alpha$ is generating modulo endpoints, this coding retains the Gauss dynamics up to null sets, and the entropy of $G$ with respect to $\mu_G$ is the entropy rate of the continued-fraction digit process.
[/example]
The continued-fraction coding is also a gateway to a geometric system that is not itself an interval map. The next example records the advanced bridge: the same digit sequences arise from return maps for geodesic flow on the modular surface.
[example: Geodesic Flow Coding as an Advanced Bridge]
On the modular surface, one standard cross-section for the geodesic flow, the *Artin cross-section construction*, assigns to a returning geodesic an endpoint coordinate $x\in(0,1)$. The return branch is determined by the unique integer $n\in\mathbb N$ satisfying
\begin{align*}
\frac{1}{n+1}<x\le \frac{1}{n}.
\end{align*}
Since $x>0$, taking reciprocals reverses the inequalities:
\begin{align*}
n\le \frac{1}{x}<n+1.
\end{align*}
Therefore
\begin{align*}
\left\lfloor \frac{1}{x}\right\rfloor=n.
\end{align*}
The first-return coordinate after renormalising the geodesic is
\begin{align*}
x'=\frac{1}{x}-n.
\end{align*}
Using the preceding identity for $n$, this becomes
\begin{align*}
x'=\frac{1}{x}-\left\lfloor \frac{1}{x}\right\rfloor=G(x).
\end{align*}
Thus one return of the geodesic-flow cross-section changes the endpoint coordinate by the Gauss map. Iterating the return gives
\begin{align*}
x_0=x,\qquad x_{k+1}=G(x_k),
\end{align*}
and the integer recorded at the $k$th return is
\begin{align*}
n_k=\left\lfloor \frac{1}{x_k}\right\rfloor.
\end{align*}
Since $x_k=G^k x$, this is
\begin{align*}
n_k=\left\lfloor \frac{1}{G^k x}\right\rfloor=a_{k+1}(x).
\end{align*}
So the symbolic return sequence of the geodesic is
\begin{align*}
(n_0,n_1,n_2,\dots)=(a_1(x),a_2(x),a_3(x),\dots).
\end{align*}
The point is not that the interval map and the geodesic flow are the same object: one is a one-dimensional non-invertible return map, while the other is a smooth flow on the unit tangent bundle of a hyperbolic surface. The calculation shows exactly where they meet: the first-return coding of the flow records the same continued-fraction digits as the Gauss map.
[/example]
The chapter therefore closes the entropy part of the course with a recurring template. An arithmetic rule defines a transformation; an invariant measure makes the system probabilistic; ergodicity gives almost-sure statistical laws; and a generating symbolic coding turns entropy into an information rate for digits or branches.
Number-theoretic dynamics shows how entropy turns arithmetic processes into measurable orbit statistics, but the same template also applies to physical systems with many interacting components. The final chapter returns to thermodynamic ideas in their original setting, where entropy, pressure, and phase transitions describe how microscopic rules produce macroscopic behavior.
# 12. Statistical Mechanics and Phase Transitions
Statistical mechanics brings the thermodynamic formalism of Chapter 9 into its original setting: large systems with many local degrees of freedom. The guiding question is how microscopic rules, encoded by an interaction or potential, determine macroscopic quantities such as entropy, pressure, free energy, and phase structure. The chapter assumes the earlier material on measure-theoretic entropy, topological pressure for subshifts, weak* compactness of probability measures, and Perron-Frobenius theory for non-negative matrices. In dynamical language, the central objects are invariant measures which optimise entropy plus energy, and in lattice language they are Gibbs states satisfying local conditional probability rules. This chapter compares these two viewpoints and explains how non-uniqueness of equilibrium states is the dynamical signature of a phase transition.
## Entropy, Pressure, and Free Energy
How should we assign a single thermodynamic number to a potential when the system has exponentially many orbit segments? Entropy counts orbit complexity, while the potential weights orbit segments by energetic preference. Pressure combines these two contributions and is the quantity whose derivatives and maximisers organise the rest of the chapter.
Let $(X,T)$ be a compact metric dynamical system, and let $\nu$ range over invariant Borel probability measures for $T$. For a continuous potential $\beta: X \to \mathbb R$, the quantity
\begin{align*}
h_\nu(T)+\int_X \beta\,d\nu
\end{align*}
is the measure-theoretic free energy contribution of $\nu$. The first task is to name the best value of this tradeoff, because all later equilibrium questions ask which measures attain it.
[definition: Topological Pressure]
Let $(X,T)$ be a compact metric dynamical system. The topological pressure functional is the map
\begin{align*}
P_T:C(X)\to (-\infty,\infty]
\end{align*}
defined by
\begin{align*}
P_T(\beta) := \sup_{\nu \in \mathcal M_T(X)} \left(h_\nu(T)+\int_X \beta\,d\nu\right),
\end{align*}
where $\mathcal M_T(X)$ is the set of invariant Borel probability measures for $T$ on $X$.
[/definition]
The definition gives a variational number, but statistical mechanics usually starts with finite systems and normalising constants. To connect these viewpoints, we need a finite-volume object whose exponential growth rate can be compared with the supremum above. On a subshift, the natural finite systems are admissible words.
[definition: Partition Function for a Subshift]
Let $X \subset A^{\mathbb N}$ be a one-sided subshift over a finite alphabet and let $\sigma:X\to X$ be the shift. For $n\in\mathbb N$, the length-$n$ partition function is the map
\begin{align*}
Z_n:C(X)\to (0,\infty)
\end{align*}
defined by
\begin{align*}
Z_n(\beta) := \sum_{w\in \mathcal L_n(X)} \exp\left(\sup_{x\in [w]} \sum_{k=0}^{n-1}\beta(\sigma^k x)\right),
\end{align*}
where $\mathcal L_n(X)$ is the set of admissible words of length $n$, and $[w]$ is the corresponding cylinder.
[/definition]
The partition function records weighted orbit complexity, while topological pressure was defined by optimising over measures. The next issue is whether these two constructions give the same number. This comparison is the central variational principle for the symbolic models used throughout the chapter.
[quotetheorem:6828]
The theorem turns pressure into a bridge between orbit counting and invariant measures, but each hypothesis controls a different possible failure. If $X$ is the disjoint union of two mixing finite-type components, the pressure is the maximum of the two component pressures; at parameter values where the maxima tie, there are competing equilibrium states rather than a single statistical regime. If the finite-type hypothesis is dropped, a non-sofic subshift can have languages whose entropy is not governed by a finite transition matrix, so the cylinder estimates used in the proof no longer reduce to finitely many local constraints. If Holder regularity is weakened to arbitrary continuity, Birkhoff sums can have unbounded distortion on long cylinders, and the transfer-operator spectral consequences used later may fail even when the variational formula is recovered by a more general definition of pressure.
Thus the theorem says two things needed later: the finite-volume growth rate equals the variational free energy, and in the Holder finite-type setting this equality is compatible with transfer-operator methods. It does not say that pressure always has a unique maximiser, nor that all continuous-potential systems have spectral gaps. Once the pressure has a variational description, the next problem is to identify the measures that realise the optimum. These measures are the thermodynamic states selected by the potential.
[definition: Equilibrium State]
Let $(X,T)$ be a compact metric dynamical system and let $\beta\in C(X)$. An equilibrium state for $\beta$ is a measure $\mu\in\mathcal M_T(X)$ such that
\begin{align*}
P_T(\beta)=h_\mu(T)+\int_X\beta\,d\mu.
\end{align*}
[/definition]
Equilibrium states are maximisers, but thermodynamics often phrases the same information in terms of free energy rather than pressure. To compare the course notation with the physics convention, we introduce the finite-volume free energy density attached to a partition function. This also prepares the discussion of singularities in the thermodynamic limit.
[definition: Free Energy Density]
Let $I\subset \mathbb R_+$ be an interval of inverse temperatures. Let $(Z_n)_{n\in\mathbb N}$ be a family of functions
\begin{align*}
Z_n:I\to (0,\infty).
\end{align*}
The finite-volume free energy density in volume $n$ is the map
\begin{align*}
f_n:I\to\mathbb R
\end{align*}
defined by
\begin{align*}
f_n(\theta):=-\frac{1}{\theta n}\log Z_n(\theta).
\end{align*}
[/definition]
Thus pressure and free energy contain the same asymptotic information. Singular behaviour of the limiting pressure, or non-uniqueness of measures attaining it, is the mathematical form of a phase transition. In simple finite-state models the variational expression
\begin{align*}
h_\nu(\sigma)+\int_X\beta\,d\nu
\end{align*}
can be maximised by an explicit finite-dimensional calculation.
[example: Locally Constant Potential on a Full Shift]
Let $X=A^{\mathbb N}$, and suppose $\beta(x)=b_a$ whenever $x_0=a$. For a word $w=w_0\cdots w_{n-1}\in A^n$, every $x\in[w]$ has $(\sigma^k x)_0=w_k$, so
\begin{align*}
\sum_{k=0}^{n-1}\beta(\sigma^k x)=\sum_{k=0}^{n-1}b_{w_k}.
\end{align*}
Thus the supremum over $[w]$ is this same value. Since $X$ is the full shift, every word in $A^n$ is admissible, and
\begin{align*}
Z_n(\beta)=\sum_{w_0,\ldots,w_{n-1}\in A}\exp\left(\sum_{k=0}^{n-1}b_{w_k}\right).
\end{align*}
Expanding the exponential product gives
\begin{align*}
Z_n(\beta)=\sum_{w_0,\ldots,w_{n-1}\in A}e^{b_{w_0}}\cdots e^{b_{w_{n-1}}}.
\end{align*}
The sum factors over the $n$ independent choices of symbols, so
\begin{align*}
Z_n(\beta)=\left(\sum_{a\in A}e^{b_a}\right)^n.
\end{align*}
Therefore
\begin{align*}
\frac{1}{n}\log Z_n(\beta)=\log\sum_{a\in A}e^{b_a}.
\end{align*}
By the pressure variational principle for this full shift and locally constant potential,
\begin{align*}
P_\sigma(\beta)=\log\sum_{a\in A}e^{b_a}.
\end{align*}
Set
\begin{align*}
q_a=\frac{e^{b_a}}{\sum_{c\in A}e^{b_c}}.
\end{align*}
For the Bernoulli measure $\mu_q$ with symbol weights $(q_a)_{a\in A}$, the one-symbol entropy formula gives
\begin{align*}
h_{\mu_q}(\sigma)=-\sum_{a\in A}q_a\log q_a.
\end{align*}
Since $\beta=b_a$ on the cylinder $[a]$ and $\mu_q([a])=q_a$,
\begin{align*}
\int_X\beta\,d\mu_q=\sum_{a\in A}q_a b_a.
\end{align*}
Hence
\begin{align*}
h_{\mu_q}(\sigma)+\int_X\beta\,d\mu_q=-\sum_{a\in A}q_a\log q_a+\sum_{a\in A}q_a b_a.
\end{align*}
Combining the sums,
\begin{align*}
h_{\mu_q}(\sigma)+\int_X\beta\,d\mu_q=\sum_{a\in A}q_a(b_a-\log q_a).
\end{align*}
Substituting the definition of $q_a$,
\begin{align*}
b_a-\log q_a=b_a-\log\frac{e^{b_a}}{\sum_{c\in A}e^{b_c}}.
\end{align*}
Since $\log e^{b_a}=b_a$,
\begin{align*}
b_a-\log q_a=\log\sum_{c\in A}e^{b_c}.
\end{align*}
Therefore
\begin{align*}
h_{\mu_q}(\sigma)+\int_X\beta\,d\mu_q=\sum_{a\in A}q_a\log\sum_{c\in A}e^{b_c}.
\end{align*}
Because $\sum_{a\in A}q_a=1$,
\begin{align*}
h_{\mu_q}(\sigma)+\int_X\beta\,d\mu_q=\log\sum_{c\in A}e^{b_c}.
\end{align*}
Thus $\mu_q$ attains the pressure and is an equilibrium state. The model is the zero-interaction Gibbs model: each coordinate is chosen independently, with symbol $a$ weighted proportionally to $e^{b_a}$.
[/example]
This example is the zero-interaction model: each coordinate chooses a symbol independently, biased by the local potential. The next section studies the finite-range case, where conditional probabilities depend on neighbours and equilibrium states become Gibbs states.
## Equilibrium States as Gibbs States
What local rule should replace independence when nearby coordinates interact? Statistical mechanics answers by prescribing conditional probabilities in finite windows given the configuration outside the window. The Dobrushin-Lanford-Ruelle formalism, usually abbreviated DLR, is the measure-theoretic language for this prescription.
Let $S$ be a finite spin set and write $\Omega=S^{\mathbb Z^d}$ with its product Borel $\sigma$-algebra. A configuration is denoted $\omega=(\omega_i)_{i\in\mathbb Z^d}$, and for a finite set $\Lambda\subset\mathbb Z^d$ we write $\omega_\Lambda$ for its restriction to $\Lambda$. Before conditional probabilities can be written, we need a local energy rule specifying which finite groups of sites interact.
[definition: Interaction]
An interaction on $S^{\mathbb Z^d}$ is a family $\Phi=(\Phi_A)_{A\Subset\mathbb Z^d}$ such that each $\Phi_A:S^{\mathbb Z^d}\to\mathbb R$ depends only on the coordinates in the finite set $A$.
[/definition]
For finite-range interactions only finitely many terms involving a fixed site are non-zero. This restriction is not cosmetic: if infinitely many interaction terms touch the finite window and no absolute-summability condition is imposed, the local energy may be an undefined infinite series. In this chapter the finite-volume formalism is used for finite-range interactions, which keeps the Hamiltonian and the normalising constant finite. Local interaction terms by themselves do not yet give the energy of a finite experiment, because the experiment sits inside an exterior configuration. The next definition packages the interaction terms touching a finite window together with a chosen boundary condition.
[definition: Finite-Volume Hamiltonian]
Let $\Phi$ be a finite-range interaction and let $\Lambda\Subset\mathbb Z^d$. For each boundary condition $\eta\in S^{\mathbb Z^d}$, the finite-volume Hamiltonian in $\Lambda$ is the map
\begin{align*}
H_\Lambda^\eta:S^\Lambda\to\mathbb R
\end{align*}
defined by
\begin{align*}
H_\Lambda^\eta(\omega_\Lambda)
:=\sum_{A\cap\Lambda\ne\varnothing}\Phi_A(\omega_\Lambda\eta_{\Lambda^c}),
\end{align*}
where $\omega_\Lambda\eta_{\Lambda^c}$ is the configuration agreeing with $\omega_\Lambda$ on $\Lambda$ and with $\eta$ on $\Lambda^c$.
[/definition]
The next definition is needed because an energy table is not yet a probability law. We must normalise the Boltzmann weights over all possible fillings of the same finite window, while keeping the exterior configuration fixed. The finite-range hypothesis ensures that every term in the denominator is a finite positive number and that the denominator is a finite sum. This produces the local probability kernel that will appear in the DLR equations.
[definition: Finite-Volume Gibbs Kernel]
Let $\Phi$ be a finite-range interaction, let $\Lambda\Subset\mathbb Z^d$, and let $\eta\in S^{\mathbb Z^d}$. The finite-volume Gibbs kernel is the map
\begin{align*}
\gamma_\Lambda^\eta:S^\Lambda\to[0,1]
\end{align*}
defined by
\begin{align*}
\gamma_\Lambda^\eta(\omega_\Lambda)
:=\frac{\exp(-H_\Lambda^\eta(\omega_\Lambda))}{\sum_{\tau_\Lambda\in S^\Lambda}\exp(-H_\Lambda^\eta(\tau_\Lambda))}.
\end{align*}
[/definition]
The kernel gives the desired conditional law in a finite window, but an infinite-volume state must be compatible with every such finite-window law at once. A sequence of finite-volume Gibbs measures with plus boundary conditions and the same sequence with minus boundary conditions can have different subsequential limits; each finite box has been normalised correctly, yet the limiting conditional laws may remember the exterior choice. This is why finite-volume Gibbs measures alone are not the definition of an infinite-volume phase. The next definition imposes compatibility by averaging the local kernel against the exterior configuration distributed according to the measure itself.
[definition: DLR Gibbs State]
Let $\Phi$ be a finite-range interaction on $S^{\mathbb Z^d}$, and let $\gamma_\Lambda^\eta$ be the finite-volume Gibbs kernel defined from $\Phi$. A Borel probability measure $\mu$ on $S^{\mathbb Z^d}$ is a DLR Gibbs state for $\Phi$ if, for every finite $\Lambda\Subset\mathbb Z^d$ and every bounded measurable $F:S^{\mathbb Z^d}\to\mathbb R$,
\begin{align*}
\int F(\omega)\,d\mu(\omega)
=
\int \sum_{\omega_\Lambda\in S^\Lambda}
F(\omega_\Lambda\eta_{\Lambda^c})\,\gamma_\Lambda^\eta(\omega_\Lambda)\,d\mu(\eta).
\end{align*}
[/definition]
This is the Dobrushin-Lanford-Ruelle formalism in the form needed here. After defining the compatibility condition, the first structural question is existence: a local specification would be unusable if no infinite-volume measure satisfied it. Compactness supplies existence for finite-range interactions over finite spin spaces.
[quotetheorem:6829]
The theorem therefore gives existence only in the compact finite-spin, finite-range setting. It does not give uniqueness, does not identify which boundary conditions lead to which limit points, and does not cover continuous-spin models without additional tightness and summability hypotheses. Existence does not answer whether the infinite-volume state is determined uniquely. In one dimension with finite-range interactions, the local specification can be represented by a finite transfer matrix, so uniqueness becomes a Perron-Frobenius question. This is the main uniqueness result proved in the symbolic part of the course.
[quotetheorem:6830]
This theorem explains why phase transitions do not appear in the usual one-dimensional finite-range Ising model. Irreducibility rules out a decomposition into disconnected symbolic components: for instance, the matrix $\operatorname{diag}(e^a,e^b)$ describes two fixed-symbol components that never communicate, and when $a=b$ the two point masses are distinct invariant Gibbs and equilibrium states. Aperiodicity rules out cyclic behaviour in which several peripheral eigenvalues compete with the leading one; the two-state matrix with allowed transitions only $0\to1$ and $1\to0$ forces period two and has no mixing stationary block structure in the same sense. The theorem is stated only for translation-invariant DLR states because that is the version proved here through stationary transfer-matrix distributions; stronger one-dimensional uniqueness theorems for positive finite-range interactions require extra arguments controlling arbitrary boundary limits. The transfer matrix has a unique dominant eigenvector at positive temperature, so boundary conditions are forgotten in the translation-invariant infinite-volume limit.
[example: One-Dimensional Finite-Range Ising Model]
Let $S=\{-1,1\}$, with coupling $J\in\mathbb R$ and external field $h\in\mathbb R$. For a finite word $\omega_1,\ldots,\omega_n\in S$, the nearest-neighbour energy is
\begin{align*}
H_n(\omega)=-J\sum_{i=1}^{n-1}\omega_i\omega_{i+1}-h\sum_{i=1}^{n}\omega_i.
\end{align*}
Hence its Boltzmann weight is
\begin{align*}
\exp(-H_n(\omega))=\exp\left(J\sum_{i=1}^{n-1}\omega_i\omega_{i+1}+h\sum_{i=1}^{n}\omega_i\right).
\end{align*}
Define the transfer matrix by
\begin{align*}
M_{ab}:=\exp(Jab+hb),\qquad a,b\in\{-1,1\}.
\end{align*}
For a word $\omega_1,\ldots,\omega_n$, the product of transition weights is
\begin{align*}
\prod_{i=1}^{n-1}M_{\omega_i,\omega_{i+1}}=\prod_{i=1}^{n-1}\exp(J\omega_i\omega_{i+1}+h\omega_{i+1}).
\end{align*}
Using $\prod_i e^{u_i}=e^{\sum_i u_i}$ gives
\begin{align*}
\prod_{i=1}^{n-1}M_{\omega_i,\omega_{i+1}}=\exp\left(J\sum_{i=1}^{n-1}\omega_i\omega_{i+1}+h\sum_{i=2}^{n}\omega_i\right).
\end{align*}
Multiplying by the missing first-site field gives exactly the Boltzmann weight:
\begin{align*}
e^{h\omega_1}\prod_{i=1}^{n-1}M_{\omega_i,\omega_{i+1}}=\exp\left(J\sum_{i=1}^{n-1}\omega_i\omega_{i+1}+h\sum_{i=1}^{n}\omega_i\right)=\exp(-H_n(\omega)).
\end{align*}
Thus, with this boundary convention,
\begin{align*}
Z_n=\sum_{\omega_1,\ldots,\omega_n\in\{-1,1\}}e^{h\omega_1}\prod_{i=1}^{n-1}M_{\omega_i,\omega_{i+1}}.
\end{align*}
In the ordered basis $(-1,1)$, the four entries are
\begin{align*}
M_{-1,-1}=e^{J-h},\qquad M_{-1,1}=e^{-J+h},\qquad M_{1,-1}=e^{-J-h},\qquad M_{1,1}=e^{J+h}.
\end{align*}
Every entry is positive because $e^t>0$ for every real $t$. Therefore the transfer matrix is irreducible and aperiodic, so *[One-Dimensional Finite-Range Gibbs States Are Unique](/theorems/6830)* gives a unique translation-invariant Markov Gibbs measure, and the pressure is the logarithm of the Perron-Frobenius eigenvalue of $M$.
The trace is
\begin{align*}
\operatorname{tr}(M)=e^{J-h}+e^{J+h}=e^J(e^{-h}+e^h)=2e^J\cosh h.
\end{align*}
The determinant is
\begin{align*}
\det(M)=e^{J-h}e^{J+h}-e^{-J+h}e^{-J-h}=e^{2J}-e^{-2J}.
\end{align*}
For a $2\times2$ matrix, the eigenvalues solve
\begin{align*}
\lambda^2-\operatorname{tr}(M)\lambda+\det(M)=0.
\end{align*}
Hence the larger eigenvalue is
\begin{align*}
\lambda_{\max}=\frac{\operatorname{tr}(M)+\sqrt{\operatorname{tr}(M)^2-4\det(M)}}{2}.
\end{align*}
Substituting the trace and determinant gives
\begin{align*}
\lambda_{\max}=e^J\cosh h+\sqrt{e^{2J}\cosh^2 h-e^{2J}+e^{-2J}}.
\end{align*}
Since $\cosh^2 h-1=\sinh^2 h$, this becomes
\begin{align*}
\lambda_{\max}=e^J\cosh h+\sqrt{e^{2J}\sinh^2 h+e^{-2J}}.
\end{align*}
Therefore the infinite-volume pressure for this transfer-matrix normalisation is
\begin{align*}
P=\log\left(e^J\cosh h+\sqrt{e^{2J}\sinh^2 h+e^{-2J}}\right).
\end{align*}
The one-dimensional finite-range Ising model is governed by one positive Perron-Frobenius eigenvector, so boundary effects do not create multiple translation-invariant Gibbs states in this setting.
[/example]
The DLR viewpoint and the variational viewpoint therefore agree in the one-dimensional finite-range models treated here. The next section asks what changes when uniqueness fails.
## Non-Uniqueness, Phase Transitions, and Long-Range Order
When can a local rule fail to determine a single infinite-volume state? The answer is that boundary conditions may leave a persistent influence even as the finite box grows. This persistence is phase coexistence, and in dynamical language it appears as multiple equilibrium states for the same pressure problem.
The uniqueness theorem above gives a baseline: in strongly mixing one-dimensional finite-range systems, the boundary disappears from the limit. To name the opposite phenomenon, we define phase transition directly as non-uniqueness of infinite-volume Gibbs states. This definition focuses on the object controlled by the DLR equations.
[definition: Phase Transition]
Let $S$ be finite, and let $\Phi$ be an interaction on $S^{\mathbb Z^d}$ for which the DLR specification is defined, such as a finite-range interaction. A phase transition for $\Phi$ occurs when there is more than one DLR Gibbs state for $\Phi$.
[/definition]
Non-uniqueness of states does not by itself describe what can be observed at large separation. Two phases may be distinguished by a boundary choice, but the observable question is whether that choice leaves correlations that survive far away. We therefore need a condition detecting persistent correlations of local observables at arbitrarily large distances.
[definition: Long-Range Order]
Let $\mu$ be a translation-invariant probability measure on $S^{\mathbb Z^d}$ and let $f:S^{\mathbb Z^d}\to\mathbb R$ be a local observable with $\int f\,d\mu=0$. For $i\in\mathbb Z^d$, write $\sigma_i:S^{\mathbb Z^d}\to S^{\mathbb Z^d}$ for translation by $i$. The measure $\mu$ has long-range order for $f$ if there is a sequence $i_k\in\mathbb Z^d$ with $|i_k|\to\infty$ such that
\begin{align*}
\limsup_{k\to\infty}\left|\int f(\omega)f(\sigma_{i_k}\omega)\,d\mu(\omega)\right|>0.
\end{align*}
[/definition]
Long-range order is one way to detect phase structure inside a single measure. The variational formalism gives another diagnostic: if the same potential has several pressure maximisers, the system has several competing thermodynamic descriptions. The next theorem records this implication when the equilibrium and DLR descriptions coincide.
[quotetheorem:6831]
[citeproof:6831]
The hypothesis that the two classes of measures coincide is important, and there are concrete ways it can fail. A general continuous potential on a shift has a variational problem, but it need not define a finite-range interaction or a summable specification, so the phrase "DLR Gibbs state for the same local energy" may have no object attached to it. Conversely, if one keeps only shift-invariant equilibrium states in the correspondence, a model could have additional non-translation-invariant DLR states selected by patterned boundary conditions; those states would witness DLR non-uniqueness without appearing in the restricted variational class. A reducible symbolic system gives another caution: several equilibrium states may come from disconnected components rather than from boundary sensitivity in a physical lattice.
Thus the theorem is a translation principle, not a universal definition of phase transition. It will be used later only in settings where the variational potential and the DLR interaction have already been identified, and where the relevant state classes match. Under that correspondence, variational non-uniqueness supplies DLR non-uniqueness; outside it, pressure degeneracy and Gibbs phase coexistence must be checked separately.
[example: Subshift with Multiple Equilibrium States]
Let
\begin{align*}
Y=\{0,1\}^{\mathbb N},\qquad z=222\cdots,\qquad X=Y\cup\{z\}.
\end{align*}
For the zero potential $\beta=0$, the length-$n$ language of $X$ consists of all $2^n$ words over $\{0,1\}$ and the single word $22\cdots 2$, so
\begin{align*}
Z_n(0)=\sum_{w\in\mathcal L_n(X)}e^0=2^n+1.
\end{align*}
Hence
\begin{align*}
\frac{1}{n}\log Z_n(0)
=\frac{1}{n}\log\left(2^n+1\right)
=\frac{1}{n}\log\left(2^n\left(1+2^{-n}\right)\right)
=\log 2+\frac{1}{n}\log\left(1+2^{-n}\right),
\end{align*}
and the last term tends to $0$. Thus the two-symbol component has pressure $\log 2$, while the fixed-point component has pressure $0$.
Now define a new locally constant potential $\varphi:X\to\mathbb R$ by
\begin{align*}
\varphi(x)=0\quad\text{for }x\in Y,\qquad \varphi(z)=\log 2.
\end{align*}
Let $\nu\in\mathcal M_\sigma(X)$, and write $p=\nu(Y)$. Since $Y$ and $\{z\}$ are closed invariant components, the entropy contribution from the fixed point is $0$, and the entropy on $Y$ is at most $\log 2$. Therefore
\begin{align*}
h_\nu(\sigma)\le p\log 2.
\end{align*}
Also,
\begin{align*}
\int_X\varphi\,d\nu
=\int_Y 0\,d\nu+\int_{\{z\}}\log 2\,d\nu
=(1-p)\log 2.
\end{align*}
So every invariant measure satisfies
\begin{align*}
h_\nu(\sigma)+\int_X\varphi\,d\nu
\le p\log 2+(1-p)\log 2
=\log 2.
\end{align*}
The Bernoulli measure $\mu$ on $Y$ with weights $(1/2,1/2)$ has
\begin{align*}
h_\mu(\sigma)=-\frac12\log\frac12-\frac12\log\frac12=\log 2
\end{align*}
and
\begin{align*}
\int_X\varphi\,d\mu=0,
\end{align*}
so
\begin{align*}
h_\mu(\sigma)+\int_X\varphi\,d\mu=\log 2.
\end{align*}
The point mass $\delta_z$ has
\begin{align*}
h_{\delta_z}(\sigma)=0
\end{align*}
and
\begin{align*}
\int_X\varphi\,d\delta_z=\log 2,
\end{align*}
so
\begin{align*}
h_{\delta_z}(\sigma)+\int_X\varphi\,d\delta_z=\log 2.
\end{align*}
Thus both $\mu$ and $\delta_z$ attain the same pressure value $\log 2$, but they are distinct because $\mu(Y)=1$ while $\delta_z(Y)=0$. This gives multiple equilibrium states for one potential, coming from two invariant components whose free-energy contributions have been made equal.
[/example]
This symbolic example has phase coexistence for a structural reason: the space decomposes into components that do not communicate. Physical lattice models can also have coexistence in an irreducible space, where the competing phases arise from boundary-condition sensitivity rather than disconnected dynamics.
[example: Hard-Core Model on a Symbolic Space]
Let $G=(V,E)$ be a graph, and let
\begin{align*}
X=\{\omega\in\{0,1\}^V:\omega_u\omega_v=0\text{ whenever }\{u,v\}\in E\}.
\end{align*}
Thus $X$ is the symbolic space of independent-set configurations: if $\omega_v=1$, then every neighbour of $v$ must have spin $0$.
Fix a finite set $\Lambda\Subset V$ and a boundary condition $\eta\in X$. A filling $\omega_\Lambda\in\{0,1\}^\Lambda$ is compatible with $\eta$ when the combined configuration $\omega_\Lambda\eta_{\Lambda^c}$ belongs to $X$. Its number of occupied sites in $\Lambda$ is
\begin{align*}
N_\Lambda(\omega_\Lambda)=\sum_{v\in\Lambda}\omega_v.
\end{align*}
For activity $\lambda>0$, the finite-volume hard-core partition function with boundary $\eta$ is
\begin{align*}
Z_\Lambda^\eta(\lambda)=\sum_{\tau_\Lambda:\tau_\Lambda\eta_{\Lambda^c}\in X}\lambda^{N_\Lambda(\tau_\Lambda)}.
\end{align*}
The corresponding conditional probability is
\begin{align*}
\gamma_\Lambda^\eta(\omega_\Lambda)=\frac{\lambda^{N_\Lambda(\omega_\Lambda)}}{Z_\Lambda^\eta(\lambda)}
\end{align*}
when $\omega_\Lambda\eta_{\Lambda^c}\in X$, and it is $0$ otherwise.
The role of $\lambda$ is visible by comparing two compatible fillings which differ only at one vertex $v\in\Lambda$. Suppose $\omega_v=0$, suppose changing this spin to $1$ still gives an allowed configuration, and let $\omega'_\Lambda$ be the modified filling. Since the two fillings agree away from $v$,
\begin{align*}
N_\Lambda(\omega'_\Lambda)=1+\sum_{u\in\Lambda,\,u\ne v}\omega_u.
\end{align*}
Also,
\begin{align*}
N_\Lambda(\omega_\Lambda)=\sum_{u\in\Lambda,\,u\ne v}\omega_u.
\end{align*}
Therefore
\begin{align*}
N_\Lambda(\omega'_\Lambda)=N_\Lambda(\omega_\Lambda)+1.
\end{align*}
Their relative weights are
\begin{align*}
\frac{\lambda^{N_\Lambda(\omega'_\Lambda)}}{\lambda^{N_\Lambda(\omega_\Lambda)}}=\frac{\lambda^{N_\Lambda(\omega_\Lambda)+1}}{\lambda^{N_\Lambda(\omega_\Lambda)}}=\lambda.
\end{align*}
So increasing $\lambda$ multiplies the weight by $\lambda$ for each additional occupied site, provided the hard-core constraint is still satisfied.
This makes boundary effects concrete. On a bipartite graph with vertex classes $V_{\mathrm{even}}$ and $V_{\mathrm{odd}}$, the configuration occupying every vertex in $V_{\mathrm{even}}$ and no vertex in $V_{\mathrm{odd}}$ is allowed, because every edge joins opposite classes. The analogous all-odd configuration is also allowed. At large activity, both patterns have high occupation number and hence large weight, but the hard-core constraint prevents them from being locally combined. On regular trees and on high-dimensional lattices this competition can persist in the infinite-volume limit, so different boundary patterns can select different Gibbs states. Thus a purely symbolic exclusion rule, together with the local weight $\lambda^{N(\omega)}$, can produce non-uniqueness.
[/example]
The hard-core model shows that non-uniqueness can arise from competition between local exclusion and global density. To understand the complementary uniqueness regime, we need a criterion that rules out thermodynamic singularities. A basic obstruction is a maximum of two competing analytic pressure branches: if $P(t)=\max\{a(t),b(t)\}$ and the two branches cross with different slopes, then $P$ has a corner even though each branch is analytic. In these notes, a first-order pressure transition for a parameter family $\beta_t$ means a point at which $t\mapsto P(\sigma,\beta_t)$ is not differentiable. Spectral stability of the transfer operator rules out this competing-branch picture in the Holder symbolic setting.
[quotetheorem:6832]
[citeproof:6832]
The theorem gives the dynamical reason for the absence of phase transitions in uniformly mixing one-dimensional symbolic models: the transfer operator has a stable leading eigenvalue. The spectral gap hypothesis prevents other spectral values from colliding with the leading eigenvalue, while uniqueness prevents the pressure from being the upper envelope of several equilibrium branches. The theorem does not rule out higher-order transitions not visible as first-order non-analyticities, and it does not apply when the shift is reducible or the potential family leaves the Holder class. Phase transitions begin when this spectral picture breaks down, when there are competing maximal components, or when infinite-dimensional boundary effects survive the thermodynamic limit.
[remark: Boundary Conditions and Symmetry Breaking]
In the ferromagnetic Ising model in dimension $d\ge 2$, plus and minus boundary conditions can lead to different infinite-volume limits at low temperature. The two limiting states are exchanged by the spin-flip symmetry, but neither state is itself symmetric. This is the standard example motivating the phrase symmetry breaking.
[/remark]
The course uses this example as a guide rather than proving the full low-temperature theorem. Its proof requires contour or Peierls arguments beyond the entropy techniques developed here, but it clarifies why DLR non-uniqueness is a stronger phenomenon than large finite-volume fluctuations.
## Thermodynamic Formalism as a Dictionary Between Dynamics and Lattices
The main lesson is that thermodynamic formalism translates statistical mechanics into an optimisation problem over invariant measures. Pressure is the asymptotic logarithm of weighted orbit counts and also the supremum of entropy plus potential average. Equilibrium states are the maximisers of this variational problem, while Gibbs states are measures satisfying finite-volume conditional probability rules. In one-dimensional finite-range symbolic models, transfer matrices give uniqueness and analytic pressure; in higher-dimensional or reducible symbolic settings, multiple Gibbs or equilibrium states encode phase coexistence and long-range order. This also links the chapter back to convex analysis and large-deviation heuristics: pressure behaves like a generating function, and competing supporting measures are the probabilistic shadow of multiple macroscopic phases.
## Beyond and Connected Topics
The course sits between several Androma threads. The information-theoretic foundation is the finite-partition calculus around [Subadditivity of Entropy](/theorems/1634), conditional entropy, and joins of partitions; those results are the algebraic reason orbit-name entropy has stable limiting rates. The invariant-measure side connects to compactness and convexity through the [Krylov-Bogolyubov Theorem](/theorems/3423) and [Convexity of the Invariant Measure Space](/theorems/3451), which explain why variational principles optimise over a compact convex space rather than over individual orbits.
Symbolic dynamics provides the main bridge from abstract measurable systems to concrete models. Cylinder sets are the basic local observables in shifts, while [Bernoulli Shifts Are Strongly Mixing](/theorems/3439), strong mixing, and weak mixing locate Bernoulli systems inside the hierarchy of independence properties. The later thermodynamic chapters should be read as the topological and statistical-mechanical continuation of the same theme: pressure replaces entropy alone, equilibrium states replace invariant measures of maximal entropy, and transfer-operator methods give quantitative control when the symbolic system has enough regularity.
## References
- Androma, [Subadditivity of Entropy](/theorems/1634).
- Androma, [Krylov-Bogolyubov Theorem](/theorems/3423).
- Androma, [Convexity of the Invariant Measure Space](/theorems/3451).
- Androma, [Bernoulli Shifts Are Strongly Mixing](/theorems/3439).
Contents
- Introduction
- The Central Questions of the Course
- Background Assumed from Ergodic Theory I
- Symbolic Models and Generators
- Measure-Theoretic and Topological Entropy
- Advanced Directions
- 1. Entropy of Partitions and Information
- Measuring Uncertainty of a Measurable Partition
- Conditional Entropy and Information Functions
- Refinement, Independence, and Monotonicity
- Subadditivity for Iterated Joins
- 2. Kolmogorov-Sinai Entropy
- Entropy Rate Along an Orbit
- Supremum Over Finite Partitions
- Behaviour Under Factors, Products, Powers, And Inverses
- 3. Generators and Entropy Computation
- Generating Partitions and Symbolic Codings
- The Kolmogorov-Sinai Generator Theorem
- Rokhlin Towers and the Intuition Behind Finite Generators
- 4. Shannon-McMillan-Breiman Theory
- Information Along Orbits and Asymptotic Equipartition
- Conditional Forms and Entropy Relative to Invariant Sigma-Algebras
- Typical Names, Orbit Complexity, and Measure-Theoretic Interpretation
- 5. Bernoulli Shifts and Isomorphism Problems
- Complete Independence in Bernoulli Schemes
- Entropy as an Obstruction to Isomorphism
- Very Weak Bernoulli Partitions and Finitary Intuition
- Factors of Bernoulli Shifts
- 6. Markov Shifts and Symbolic Dynamics
- Topological Markov Chains and Transition Matrices
- Parry Measure and Maximal Entropy Measures
- Entropy of Subshifts of Finite Type and Sofic Shifts
- The Variational Principle In The Symbolic Model
- 7. Topological Entropy
- Measuring Orbit Complexity Without a Measure
- Separated and Spanning Orbit Sets
- Entropy as a Conjugacy Invariant
- Expansive Maps and Symbolic Codings
- What Topological Entropy Records
- 8. The Variational Principle
- Invariant Probability Measures on Compact Dynamical Systems
- Measure Entropy Versus Topological Entropy
- Measures of Maximal Entropy
- Specification and Uniqueness Questions
- 9. Thermodynamic Formalism
- Potentials, Pressure, and Equilibrium States
- Ruelle Transfer Operators and Gibbs Measures
- Bowen Property, Specification, and Uniqueness of Equilibrium States
- 10. Entropy, Mixing, and Decay of Correlations
- Bernoulli Systems, K-Systems, and Mixing Hierarchies
- Transfer Operators and Spectral Gaps
- Correlations and Limit Theorems
- 11. Number-Theoretic Dynamical Systems
- Continued Fractions and the Gauss Map
- Entropy of Arithmetic Transformations
- Equidistribution, Invariant Measures, and Symbolic Arithmetic Codings
- 12. Statistical Mechanics and Phase Transitions
- Entropy, Pressure, and Free Energy
- Equilibrium States as Gibbs States
- Non-Uniqueness, Phase Transitions, and Long-Range Order
- Thermodynamic Formalism as a Dictionary Between Dynamics and Lattices
- Beyond and Connected Topics
- References
Ergodic Theory II: Entropy and Advanced Topics
Content
Problems
History
Created by admin on 6/12/2026 | Last updated on 6/12/2026
Prerequisites (0/4 completed)
Log in to track your prerequisite progress.
Prerequisites Graph
Interactive dependency map showing prerequisite concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Rate this page
★
★
★
★
★
Poor
Excellent