What brings you to Androma?

Ergodic theory studies measure-preserving dynamical systems and the long-term statistical behaviour they exhibit. When a transformation acts repeatedly on a probability space while preserving the measure, the central questions become: do time averages of observables converge to a limit, and is that limit determined by the measure rather than by individual orbits? How does the system shuffle the space over time—does it mix, or does it display some residual rigidity? Ergodic theory emerged from Boltzmann's hypothesis in statistical mechanics, which posited that time averages along trajectories should equal ensemble averages computed against the measure, and these questions have since revealed profound connections to harmonic analysis, spectral theory, and information theory. The course is built around four central themes. First, we establish measure-preserving systems and their natural examples: circle rotations, the doubling map, torus automorphisms, and Bernoulli shifts. Second, we prove the foundational convergence theorems—the [von Neumann mean ergodic theorem](/theorems/3448) and the Birkhoff pointwise ergodic theorem—which show that for ergodic systems, time averages converge to spatial averages. Along the way, we introduce recurrence (every typical point returns to its neighbourhood infinitely often), ergodicity itself (the absence of non-trivial invariant subsets), and ergodic decomposition (the decomposition into ergodic building blocks). Third, we study the [mixing hierarchy](/theorems/3436), which measures how quickly systems forget their initial conditions: we progress through weak mixing, strong mixing, and higher-order mixing properties. Finally, we explore spectral methods via the Koopman operator, a unitary operator on the [Hilbert space](/page/Hilbert%20Space) of square-integrable functions that encodes the dynamics. The spectrum of this operator becomes our primary tool for classification. These ideas culminate in a beautiful structure theorem: ergodic systems with pure discrete spectrum are precisely rotations on compact abelian groups. However, spectral theory alone does not classify systems beyond the discrete case. Bernoulli shifts, the canonical example of maximal disorder, share the same Lebesgue spectrum yet are not all isomorphic. This limitation points to a deeper invariant—entropy, the rate at which a system generates information—which is the subject of Ergodic Theory II. The present course supplies the language, intuition, and technical foundation that make this further theory accessible. By the end of these ten chapters, you will understand how the long-term behaviour of a deterministic system can be decoded through the measure it preserves, how time and space averages relate, and how both spectral and mixing properties distinguish qualitatively different types of dynamics. You will have tools to prove that a system possesses recurrence, ergodicity, or mixing properties, and you will grasp the fundamental limits of prediction in systems governed by measure preservation. # 1. Measure-Preserving Systems Ergodic theory is the study of measure-preserving dynamical systems — it asks how, and in what sense, the long-run statistical behaviour of a system is determined by its invariant measure rather than by the details of individual orbits. This opening chapter builds the foundational vocabulary: we define the central object of the subject (a measure-preserving transformation), examine the most important examples, discuss the natural notions of equivalence between systems, and introduce two constructions — induced maps and natural extensions — that allow us to pass between different but closely related systems. ## Probability Spaces and Measurable Transformations Before defining the central object, we recall the measure-theoretic backdrop. A **probability space** $(\Omega, \mathcal{F}, \mu)$ consists of a set $\Omega$, a $\sigma$-algebra $\mathcal{F}$ of subsets of $\Omega$, and a probability measure $\mu: \mathcal{F} \to [0, 1]$ with $\mu(\Omega) = 1$. Throughout the course, we work almost exclusively with **Lebesgue spaces** — standard probability spaces that are, up to isomorphism, either a [countable set](/page/Countable%20Set) with a discrete measure or the unit interval $[0,1]$ equipped with Lebesgue measure (or a combination of the two). Every compact [metric space](/page/Metric%20Space) with a Borel probability measure is a Lebesgue space, which covers all the examples we care about. A map $T: (\Omega_1, \mathcal{F}_1) \to (\Omega_2, \mathcal{F}_2)$ between measurable spaces is **measurable** if $T^{-1}(B) \in \mathcal{F}_1$ for every $B \in \mathcal{F}_2$. Measurability is the minimal regularity needed to make sense of how $T$ interacts with the measure. The key idea underlying everything in this course is that a transformation can **push forward** a measure: if $T: \Omega \to \Omega$ is measurable and $\mu$ is a probability measure on $(\Omega, \mathcal{F})$, the **pushforward measure** $T_* \mu$ is defined by \begin{align*} (T_*\mu)(B) = \mu(T^{-1}(B)), \quad B \in \mathcal{F}. \end{align*} The pushforward $T_* \mu$ is the measure that tracks where $\mu$ goes under $T$: the $T_* \mu$-mass of a set $B$ is the $\mu$-mass of all points whose image lands in $B$. Measure preservation is precisely the condition that the system does not redistribute mass. [definition: Measure-Preserving Transformation] Let $(\Omega, \mathcal{F}, \mu)$ be a probability space. A measurable map $T: \Omega \to \Omega$ is a **measure-preserving transformation** (m.p.t.) if \begin{align*} \mu(T^{-1}(B)) = \mu(B) \quad \text{for all } B \in \mathcal{F}. \end{align*} [/definition] Equivalently, $T_*\mu = \mu$, i.e., $\mu$ is **$T$-invariant**. The triple $(\Omega, \mathcal{F}, \mu, T)$ is called a **measure-preserving system** (m.p.s.) or, when $\mu$ is a probability measure, a **probability-preserving system**. Note that $T$ need not be bijective; the measure-preservation condition involves only preimages, which are always well-defined for any measurable map. [remark: Checking Measure Preservation on a Generator] In practice, one seldom verifies $\mu(T^{-1}(B)) = \mu(B)$ for every $B \in \mathcal{F}$. It suffices to check it on a **$\pi$-system** (a collection of sets closed under finite intersections) that generates $\mathcal{F}$, because the class of sets on which two measures agree is a $\lambda$-system, and the $\pi$-$\lambda$ theorem upgrades agreement on a $\pi$-system to agreement on the generated $\sigma$-algebra. For example, on $([0,1], \mathcal{B}([0,1]), \mathcal{L}^1)$, it suffices to check $\mu(T^{-1}(a,b)) = b - a$ for all intervals $(a,b) \subset [0,1]$. [/remark] ## The Core Examples Which measure-preserving systems should one keep in mind as concrete touchstones? Three families pervade the entire course, each illustrating a different facet of the theory — rigid order, chaotic expansion, and hyperbolic mixing. It is worth working through each in detail, because every abstract theorem will be illustrated using one of them. ### Circle Rotations Let $\mathbb{T} = \mathbb{R}/\mathbb{Z} = [0,1)$ be the **circle**, equipped with the Borel $\sigma$-algebra $\mathcal{B}(\mathbb{T})$ and Lebesgue measure $\mathcal{L}^1$. [definition: Circle Rotation] For $\alpha \in \mathbb{R}$, the **rotation by $\alpha$** is the map $T_\alpha: \mathbb{T} \to \mathbb{T}$ defined by \begin{align*} T_\alpha(x) = x + \alpha \pmod{1}. \end{align*} [/definition] The simplicity of the definition belies the richness of behaviour that follows, depending on whether $\alpha$ is rational or irrational. The next example confirms that every rotation is measure-preserving. [example: Circle Rotations Preserve Lebesgue Measure] We show that $T_\alpha$ preserves $\mathcal{L}^1$. For an interval $(a, b) \subset \mathbb{T}$, \begin{align*} T_\alpha^{-1}(a, b) = \{x \in \mathbb{T} : x + \alpha \pmod{1} \in (a, b)\} = (a - \alpha \pmod{1},\, b - \alpha \pmod{1}). \end{align*} This is an interval of the same length $b - a$ as $(a, b)$, so $\mathcal{L}^1(T_\alpha^{-1}(a,b)) = b - a = \mathcal{L}^1(a,b)$. Since intervals generate $\mathcal{B}(\mathbb{T})$, the $\pi$-$\lambda$ theorem extends this to all Borel sets. Thus $(\mathbb{T}, \mathcal{B}(\mathbb{T}), \mathcal{L}^1, T_\alpha)$ is a measure-preserving system for every $\alpha$. The behaviour of $T_\alpha$ depends sharply on $\alpha$: if $\alpha \in \mathbb{Q}$, every orbit is finite and periodic. If $\alpha \notin \mathbb{Q}$, every orbit $\{x, x + \alpha, x + 2\alpha, \ldots\}$ (mod 1) is dense in $\mathbb{T}$. This distinction — between rational and irrational rotations — will reappear when we discuss ergodicity in Chapter 3. [illustration:circle-rotation-dense-vs-periodic] [/example] ### The Doubling Map Our next example is the simplest expanding map of the interval, and it will serve as the prototype for non-invertible, strongly mixing behaviour throughout the course. [definition: Doubling Map] The **doubling map** is the map $T: [0,1) \to [0,1)$ defined by \begin{align*} T(x) = 2x \pmod{1}. \end{align*} [/definition] That the doubling map preserves Lebesgue measure is not obvious from the definition, since $T$ is 2-to-1 everywhere. The preimage calculation below makes this precise. [example: The Doubling Map Preserves Lebesgue Measure] For an interval $(a, b) \subset [0,1)$, the preimage under $T$ is \begin{align*} T^{-1}(a, b) = \left(\frac{a}{2}, \frac{b}{2}\right) \cup \left(\frac{a+1}{2}, \frac{b+1}{2}\right). \end{align*} These two intervals are disjoint and each has length $(b-a)/2$, so \begin{align*} \mathcal{L}^1(T^{-1}(a,b)) = \frac{b-a}{2} + \frac{b-a}{2} = b - a = \mathcal{L}^1(a,b). \end{align*} By the $\pi$-$\lambda$ theorem, $T$ preserves $\mathcal{L}^1$. Unlike circle rotations, the doubling map is **not** invertible: both $x/2$ and $(x+1)/2$ map to $x$, so $T$ is 2-to-1 everywhere. The non-invertibility here is essential — it reflects the fact that $T$ is an **expanding map** and plays a key role in the mixing properties we will establish later. Indeed, the doubling map is the prototypical example of a **strongly mixing** system. [illustration:doubling-map-two-to-one] [/example] ### Torus Automorphisms Moving to two dimensions gives a richer family of examples that combines algebraic structure with genuinely hyperbolic dynamics. Let $\mathbb{T}^2 = \mathbb{R}^2 / \mathbb{Z}^2$ be the **2-torus** equipped with Lebesgue measure. [definition: Torus Automorphism] A **torus automorphism** is a map $T_A: \mathbb{T}^2 \to \mathbb{T}^2$ induced by a matrix $A \in \operatorname{GL}(2, \mathbb{Z})$ with $|\det A| = 1$, given by \begin{align*} T_A(x) = Ax \pmod{\mathbb{Z}^2}. \end{align*} [/definition] The condition $A \in \operatorname{GL}(2, \mathbb{Z})$ (integer entries, determinant $\pm 1$) ensures that $T_A$ is a well-defined bijection on $\mathbb{T}^2$ with measurable inverse $T_{A^{-1}}$ (which also has integer entries since $A^{-1} = (\det A)^{-1} \operatorname{adj}(A)$ and $\det A = \pm 1$). [example: The Arnold Cat Map] The most famous torus automorphism is the **Arnold cat map**, given by \begin{align*} A = \begin{pmatrix} 2 & 1 \\ 1 & 1 \end{pmatrix}. \end{align*} We verify $A \in \operatorname{GL}(2, \mathbb{Z})$: all entries are integers, and $\det A = 2 \cdot 1 - 1 \cdot 1 = 1$. The map $T_A$ preserves Lebesgue measure on $\mathbb{T}^2$: for any Borel set $B \subset \mathbb{T}^2$, the [linear map](/page/Linear%20Map) $x \mapsto Ax$ on $\mathbb{R}^2$ has Jacobian determinant $|\det A| = 1$, so it preserves Lebesgue measure on $\mathbb{R}^2$, and this descends to the quotient $\mathbb{T}^2$. The eigenvalues of $A$ are $\lambda_\pm = (3 \pm \sqrt{5})/2$, with $\lambda_+ > 1 > \lambda_- > 0$. Since $|\lambda_+| \neq 1$, the cat map is **hyperbolic**, meaning it has a contracting direction and an expanding direction. Hyperbolic torus automorphisms are strongly mixing — in fact, they are Bernoulli systems (the strongest form of mixing), a fact we will revisit in Chapter 9. [illustration:arnold-cat-map-shearing] [/example] The general principle illustrated here: a [linear map](/page/Linear%20Map) on $\mathbb{R}^n / \mathbb{Z}^n$ preserves Lebesgue measure if and only if $|\det A| = 1$, by the change-of-variables formula. ## Invertible and Non-Invertible Systems Does it matter whether a measure-preserving transformation is bijective? The examples above split naturally into two classes — circle rotations and torus automorphisms are bijections, while the doubling map is not — and this distinction turns out to have deep consequences for the structure theory of m.p.s. [definition: Invertible Measure-Preserving System] A measure-preserving system $(\Omega, \mathcal{F}, \mu, T)$ is **invertible** if $T$ is a bijection and $T^{-1}$ is also measurable (hence also measure-preserving). [/definition] For invertible systems, one can iterate $T$ in both directions, forming a two-sided orbit $\{\ldots, T^{-2}x, T^{-1}x, x, Tx, T^2 x, \ldots\}$. For non-invertible systems, only forward orbits are available. The doubling map is the canonical non-invertible example, and it highlights that interesting dynamics — including strong mixing — can occur in the non-invertible setting. [remark: Almost Everywhere Invertibility] In measure theory, the strict bijectivity in the definition above is often too strong. It is standard to say $T$ is invertible if there exists a set $N \in \mathcal{F}$ with $\mu(N) = 0$ such that $T$ restricted to $\Omega \setminus N$ is a bijection onto a set of full measure, and $T^{-1}$ is measurable. This is the correct notion for Lebesgue spaces, where one always works modulo null sets. [/remark] ## Isomorphism of Measure-Preserving Systems When are two measure-preserving systems "the same"? The natural notion of equivalence in this category is isomorphism, which asks for a bijection that is simultaneously measure-preserving and intertwines the dynamics. [definition: Isomorphism of Measure-Preserving Systems] Two measure-preserving systems $(\Omega_1, \mathcal{F}_1, \mu_1, T_1)$ and $(\Omega_2, \mathcal{F}_2, \mu_2, T_2)$ are **isomorphic** if there exists a measurable bijection $\phi: \Omega_1 \to \Omega_2$ (defined on sets of full measure) such that: 1. $\phi_* \mu_1 = \mu_2$ (i.e., $\mu_2(\phi(A)) = \mu_1(A)$ for all $A \in \mathcal{F}_1$), and 2. $\phi \circ T_1 = T_2 \circ \phi$ a.e. ($\mu_1$-almost everywhere). The map $\phi$ is called an **isomorphism** or a **metric isomorphism**. [/definition] Condition 2 says that $\phi$ intertwines $T_1$ and $T_2$: applying $T_1$ first and then $\phi$ gives the same result as applying $\phi$ first and then $T_2$. Isomorphic systems share all ergodic-theoretic properties — ergodicity, mixing, entropy — since these are defined purely in terms of the measure and the dynamics.  The most important example of isomorphism in this chapter identifies the doubling map with a well-studied combinatorial system. [example: Isomorphism of Doubling Map and One-Sided Bernoulli Shift] The doubling map $T(x) = 2x \pmod{1}$ on $([0,1), \mathcal{B}, \mathcal{L}^1)$ is isomorphic to the **one-sided Bernoulli shift** $\sigma$ on $(\{0,1\}^{\mathbb{N}}, \mathcal{F}_{\text{product}}, \mu_{1/2}^{\otimes \mathbb{N}})$, where $\sigma(x_1, x_2, x_3, \ldots) = (x_2, x_3, \ldots)$ and $\mu_{1/2}$ is the uniform measure on $\{0,1\}$. The isomorphism is the binary expansion map: write $x = \sum_{n=1}^\infty x_n 2^{-n}$ with each $x_n \in \{0,1\}$, and set $\phi(x) = (x_1, x_2, \ldots)$. Then $\phi \circ T(x) = \phi(2x \pmod 1) = (x_2, x_3, \ldots) = \sigma \circ \phi(x)$, so condition 2 holds. The map $\phi$ is injective off the dyadic rationals (a null set) and maps $\mathcal{L}^1$ to $\mu_{1/2}^{\otimes \mathbb{N}}$ by the standard fact that binary digits of a uniform $[0,1)$ random variable are i.i.d. $\operatorname{Ber}(1/2)$. This isomorphism will become important in Chapter 9, where we study Bernoulli shifts as the paradigmatic strongly-mixing systems. [/example] ## Existence of Invariant Measures A natural question is: given a measurable map $T: \Omega \to \Omega$, does there exist any $T$-invariant probability measure? In general, no — but for continuous maps on compact metric spaces, the answer is always yes. [quotetheorem:3423] The proof uses functional analysis and does not require $T$ to have any special structure beyond continuity. The argument proceeds by taking any probability measure $\nu$ (e.g., a Dirac mass $\delta_x$) and considering the Cesaro averages of its pushforwards: \begin{align*} \mu_N = \frac{1}{N} \sum_{n=0}^{N-1} T^n_* \nu. \end{align*} By the Banach–Alaoglu theorem, the space of Borel probability measures on a compact [metric space](/page/Metric%20Space) is compact in the [weak* topology](/page/Weak*%20Topology). Any limit point $\mu$ of the sequence $(\mu_N)$ is $T$-invariant, which follows because $T_*\mu_N - \mu_N = (T^N_*\nu - \nu)/N \to 0$ weak* as $N \to \infty$. This theorem establishes existence but says nothing about uniqueness: there may be many invariant measures. Understanding the structure of the set of all invariant measures — in particular its extreme points — is the content of the ergodic decomposition, treated in Chapter 6. [remark: The Theorem Does Not Apply to Non-Compact Spaces] Compactness is essential. The translation $T(x) = x + 1$ on $\mathbb{R}$ is continuous but has no finite $T$-invariant measure (only Lebesgue measure is translation-invariant on $\mathbb{R}$, and it is infinite). Similarly, even on $[0,1)$, a measurable $T$ with no regularity may fail to have an invariant probability measure. [/remark] ## Induced Maps and Natural Extensions Two standard constructions allow one to modify a given m.p.s. to obtain a new one, while preserving many of its dynamical properties. ### Induced Maps Given a set $A \in \mathcal{F}$ with $\mu(A) > 0$, the **first return map** to $A$ records what happens to orbits the first time they return to $A$. [definition: Induced Map] Let $(\Omega, \mathcal{F}, \mu, T)$ be a measure-preserving system and $A \in \mathcal{F}$ with $\mu(A) > 0$. The **first return time** to $A$ is \begin{align*} n_A(x) = \min\{n \ge 1 : T^n(x) \in A\}, \quad x \in A. \end{align*} By Poincaré's Recurrence Theorem (Chapter 2), $n_A(x) < \infty$ for $\mu$-a.e. $x \in A$. The **induced map** (or **first return map**) is $T_A: A \to A$ defined by \begin{align*} T_A(x) = T^{n_A(x)}(x). \end{align*} [/definition] The induced map $T_A$ is itself measure-preserving for the **normalized restriction** $\mu_A = \mu(\cdot \mid A) = \mu(\cdot \cap A) / \mu(A)$. [example: Inducing the Doubling Map on an Interval] Consider the doubling map $T(x) = 2x \pmod 1$ on $([0,1), \mathcal{B}, \mathcal{L}^1)$, and let $A = [0, 1/3)$. For $x \in [0, 1/6)$, we have $T(x) = 2x \in [0, 1/3)$, so $n_A(x) = 1$ and $T_A(x) = 2x$. For $x \in [1/6, 1/3)$, we have $T(x) = 2x \in [1/3, 2/3)$, so one step is insufficient; $T^2(x) = 4x \pmod 1$. For $x \in [1/6, 1/4)$, $T^2(x) = 4x \in [2/3, 1)$, requiring a third step; for $x \in [1/4, 1/3)$, $T^2(x) = 4x - 1 \in [0, 1/3)$, so $n_A(x) = 2$ and $T_A(x) = 4x - 1$ on this sub-interval. The induced map $T_A$ turns out to be an expanding Markov map on $[0, 1/3)$, which can be analysed explicitly. [/example] A key quantitative result about induced maps concerns how long orbits take to return. We state it here and defer the proof to Chapter 2. [quotetheorem:3424] The proof uses Poincaré's Recurrence Theorem and is given in Chapter 2. [Kac's Lemma](/theorems/3424) has a striking interpretation: rare sets (small $\mu(A)$) are visited infrequently, and the mean waiting time is exactly the reciprocal of the measure. ### Natural Extensions The natural extension addresses the opposite problem: given a non-invertible m.p.s., can one embed it in an invertible one? The answer is yes, and the construction is canonical. [definition: Natural Extension] Let $(\Omega, \mathcal{F}, \mu, T)$ be a measure-preserving system. The **natural extension** of $(\Omega, \mathcal{F}, \mu, T)$ is a measure-preserving system $(\tilde{\Omega}, \tilde{\mathcal{F}}, \tilde{\mu}, \tilde{T})$ together with a measurable surjection $\pi: \tilde{\Omega} \to \Omega$ satisfying: 1. $\pi \circ \tilde{T} = T \circ \pi$ a.e., 2. $\pi_* \tilde{\mu} = \mu$, 3. $\tilde{T}$ is invertible, 4. $\tilde{\mathcal{F}} = \bigvee_{n=0}^\infty \tilde{T}^{-n}(\pi^{-1}(\mathcal{F}))$ (the $\sigma$-algebra generated by all past information). The natural extension is unique up to isomorphism. [/definition] The explicit construction of $\tilde{\Omega}$ is as an inverse limit. Define \begin{align*} \tilde{\Omega} = \{(x_0, x_1, x_2, \ldots) \in \Omega^{\mathbb{N}_0} : T(x_{n+1}) = x_n \text{ for all } n \ge 0\}, \end{align*} i.e., the set of **consistent histories** under $T$. The map $\tilde{T}$ is the left shift: $\tilde{T}(x_0, x_1, x_2, \ldots) = (T(x_0), x_0, x_1, \ldots)$, which is invertible with inverse $\tilde{T}^{-1}(x_0, x_1, x_2, \ldots) = (x_1, x_2, x_3, \ldots)$. The projection $\pi(x_0, x_1, x_2, \ldots) = x_0$ is the factor map. The measure $\tilde{\mu}$ is the unique $\tilde{T}$-invariant measure projecting to $\mu$ under $\pi$. [example: Natural Extension of the Doubling Map] For the doubling map $T(x) = 2x \pmod 1$, the natural extension lives on the space of sequences $(x_0, x_1, x_2, \ldots)$ with $2x_{n+1} \equiv x_n \pmod 1$ for all $n$. Each such sequence is a **complete binary expansion history**: $x_n$ is a point whose $n$-th forward iterate is $x_0$. The natural extension is isomorphic (as an invertible m.p.s.) to the **two-sided Bernoulli shift** on $\{0,1\}^{\mathbb{Z}}$, which encodes both the future and the past binary digits. This explains why the doubling map, though non-invertible, "really is" a Bernoulli system. [/example] The natural extension captures all the information about the past of an orbit. For ergodic-theoretic purposes — entropy, spectral theory, mixing — one can often work with the natural extension and then project results back to the original system. This technique will appear repeatedly in later chapters. --- # 2. Recurrence Chapter 1 established the abstract framework of measure-preserving transformations and their basic structural properties. This chapter turns to one of the first deep theorems of ergodic theory: the [Poincaré Recurrence Theorem](/theorems/3425), which asserts that in a finite-measure-preserving system, almost every orbit returns to any set of positive measure — and does so infinitely often. We prove the theorem, sharpen it via [Kac's Lemma](/theorems/3442) (which computes the mean return time precisely), examine the induced map on a return set, characterise recurrence through the notion of wandering sets, and compare measure-theoretic recurrence with its topological analogue. Throughout, we emphasise how each result depends critically on the finiteness of the invariant measure, and exhibit what goes wrong when this hypothesis is dropped. ## The Poincaré Recurrence Theorem The question of whether orbits return motivates the entire subject. In classical mechanics, Poincaré observed in 1890 that a Hamiltonian system with finitely many degrees of freedom, evolving under a flow that preserves the phase-space volume (Liouville measure), must return arbitrarily close to its initial state after sufficient time — at least for almost every initial condition. The measure-theoretic formulation strips away the mechanical specifics and reveals the true engine of the phenomenon: the invariant measure is finite. To see why finiteness is essential before stating the theorem, consider the integer translation $T: \mathbb{Z} \to \mathbb{Z}$ defined by $T(n) = n + 1$, equipped with counting measure $\mu_{\#}$ (so $\mu_{\#}(\{n\}) = 1$ for each $n \in \mathbb{Z}$). This is a measure-preserving transformation: $\mu_{\#}(T^{-1}(A)) = \mu_{\#}(A - 1) = \mu_{\#}(A)$ for every $A \subset \mathbb{Z}$. Yet for any $n \in \mathbb{Z}$, the iterate $T^k(n) = n + k$ is never equal to $n$ for $k \geq 1$. No orbit ever returns. The reason: counting measure on $\mathbb{Z}$ is infinite, and the finite-measure hypothesis is exactly what the proof requires. [quotetheorem:3425] [citeproof:3425] The proof is a pigeonhole argument in measure: infinitely many disjoint translates $T^{-n}(B)$ of equal positive measure cannot all fit inside a space of finite total measure. Notice what the proof does not provide: no estimate of the return time, no information about which $n$ achieves the return, and no bound at all on how long one might wait. The theorem is a pure existence result. [remark: Metric Recurrence] On a [metric space](/page/Metric%20Space) $(X, d)$ with Borel $\sigma$-algebra, applying the theorem to the open balls $B(x, 1/k)$ for each $k \geq 1$ and taking a countable intersection yields: for $\mu$-a.e. $x \in X$, the orbit $(T^n x)_{n \geq 1}$ returns within distance $1/k$ of $x$ for every $k \geq 1$, and does so infinitely often. The orbit of a typical point accumulates at the starting point. [/remark] The remark highlights recurrence as a topological phenomenon — orbits return near their starting point — but says nothing about when. The following example shows how the character of returns depends sharply on the arithmetic of $\alpha$. [example: Recurrence of Irrational Rotations] Let $T_\alpha: \mathbb{T} \to \mathbb{T}$ be the rotation $T_\alpha(x) = x + \alpha \pmod{1}$ on the circle, equipped with Lebesgue measure $\lambda$. For any arc $A = [a, b) \subset \mathbb{T}$ with $\lambda(A) > 0$, the Poincaré theorem guarantees that $\lambda$-a.e. $x \in A$ returns to $A$ infinitely often. When $\alpha \notin \mathbb{Q}$, much more is true: every point of $\mathbb{T}$ returns to $A$ infinitely often. The orbit $\{n\alpha \pmod{1} : n \geq 0\}$ is dense in $\mathbb{T}$ (a consequence of the irrationality of $\alpha$, proved via Weyl equidistribution in Chapter 7), and density forces every orbit to enter every non-empty open arc infinitely often. The irrational rotation is both measurably and topologically recurrent in the strongest possible sense. By contrast, when $\alpha = p/q$ in lowest terms, every orbit is periodic of period $q$. A point $x$ returns to $A$ if and only if one of $x, T_\alpha(x), \ldots, T_\alpha^{q-1}(x)$ lies in $A$. If none does, $x$ never returns to $A$ — perfectly consistent with Poincaré, which only guarantees returns for a.e. $x \in A$, not for every $x \in \mathbb{T}$. [/example] ## Kac's Lemma and Mean Return Times The Poincaré theorem guarantees that returns happen but is entirely silent on their timing. A natural question follows immediately: what is the mean waiting time before the first return? Intuition suggests an inverse relationship with $\mu(A)$: if $A$ occupies a small fraction of the space, the orbit spends little time there and must wait a long time between visits. [Kac's Lemma](/theorems/3424) confirms this intuition and gives the exact formula. For a set $A \in \mathcal{B}$ with $\mu(A) > 0$, the first return time function is defined on $A$ by \begin{align*} n_A(x) := \min\{n \ge 1 : T^n(x) \in A\}. \end{align*} By Poincaré recurrence, $n_A(x) < \infty$ for $\mu$-a.e. $x \in A$, so $n_A$ is a well-defined $\mu$-a.e. finite measurable function on $A$. [quotetheorem:3442] [citeproof:3442] [illustration:rokhlin-tower-kac-lemma] The formula $\int_A n_A \, d\mu = 1$ is striking for its universality. It holds for every measurable $A$ of positive measure, regardless of the geometry of $A$ or the specific dynamics of $T$. The only inputs are ergodicity (which ensures the columns tile $X$) and the normalisation $\mu(X) = 1$. Normalising by $\mu(A)$ gives the mean return time as $1/\mu(A)$: a set occupying $1\%$ of the space ($\mu(A) = 0.01$) has mean return time $100$; a set occupying half the space has mean return time $2$. There is a natural probabilistic reading. Think of $n_A(x)$ as the first passage time back to $A$ for a trajectory with initial position drawn from $A$ according to $\mu_A$. The formula $\int_A n_A \, d\mu_A = 1/\mu(A)$ is an ergodic-theoretic analogue of the renewal theory identity: in a renewal process with i.i.d. inter-arrival times $\tau$ satisfying $\mathbb{E}[\tau] = 1/\lambda$, the long-run arrival rate is $\lambda$. [Kac's Lemma](/theorems/3442) is the deterministic, measure-preserving version of this identity. The same formula appears in the theory of [Markov chains](/page/Markov%20Chain): for a positive recurrent state $s$ with stationary probability $\pi(s)$, the mean return time equals $1/\pi(s)$. [remark: Ergodicity in Kac's Lemma] The ergodicity hypothesis is used to guarantee that the columns $\{B_{k,j}\}$ tile $X$ up to a null set: every $\mu$-a.e. orbit visits $A$. Without ergodicity, the orbit of a point in an invariant subset disjoint from $A$ would never visit $A$, the columns would not fill $X$, and the integral $\int_A n_A \, d\mu$ could be strictly less than $1$. The correct non-ergodic formula is $\int_A n_A \, d\mu = \mu(\{x : T^n x \in A \text{ for some } n \geq 0\})$, the measure of the orbit-saturation of $A$. [/remark] [Kac's Lemma](/theorems/3424) is most vivid when the dynamics are explicit enough that one can compute not just the mean return time but the full distribution of return times. The irrational rotation provides a setting where the structure of returns can be read off directly from Diophantine approximation. [example: Mean Return Time for Irrational Rotations] Consider the irrational rotation $T_\alpha$ on $(\mathbb{T}, \lambda)$, which is ergodic (proved in Chapter 3). For any arc $A = [0, r) \subset \mathbb{T}$ with $r \in (0,1)$, [Kac's Lemma](/theorems/3442) gives mean return time $1/r$. As $r \to 1$ (the arc fills the whole circle), the mean return time approaches $1$: nearly every step lands back in $A$. As $r \to 0$ (tiny arc), the mean return time diverges, confirming that visits to a microscopic arc are rare. For a finer calculation: take $\alpha = (\sqrt{5}-1)/2$ (the reciprocal of the golden ratio) and $r = 1/F_n$ where $F_n$ is the $n$-th Fibonacci number. Since $F_n$ is the denominator of a convergent to $\alpha$, the three-distance theorem asserts that the $F_n$ points $\{k\alpha \pmod{1} : 0 \leq k \leq F_n - 1\}$ partition $\mathbb{T}$ into $F_n$ gaps all of equal length $1/F_n$. A consequence is that the first return time $n_A$ is constant on all of $A$: every $x \in A$ returns to $A$ for the first time at step $F_n$, so $n_A \equiv F_n$ on $A$. One verifies directly: \begin{align*} \int_A n_A \, d\lambda = F_n \cdot \lambda(A) = F_n \cdot \frac{1}{F_n} = 1, \end{align*} confirming [Kac's Lemma](/theorems/3424). Normalising by $\lambda(A) = 1/F_n$ gives mean return time $F_n = 1/r$, as expected. When $r$ lies strictly between two consecutive reciprocal Fibonacci numbers, $1/F_{n+1} < r < 1/F_n$, the three-distance theorem produces gaps of two distinct lengths, and $n_A$ takes exactly two values $F_n$ and $F_{n+1}$ on $A$; the weighted average $\int_A n_A \, d\lambda$ still equals $1$ by [Kac's Lemma](/theorems/3442), with the two arc-measure weights determined by the two linear constraints $p + q = r$ and $F_n p + F_{n+1} q = 1$. [/example] ## The Induced Map The Kac proof partitions $A$ by first return time and stacks the resulting pieces into a Rokhlin tower. Once the tower structure is visible, a natural further question arises: does the first-return map $T_A: A \to A$, which sends $x \in A$ to $T^{n_A(x)}(x)$, itself constitute a measure-preserving system on $(A, \mathcal{B} \cap A, \mu_A)$? The answer is yes. [definition: Induced Map] Let $(X, \mathcal{B}, \mu, T)$ be a measure-preserving system with $\mu(A) > 0$ for some $A \in \mathcal{B}$. The **induced map** (or **first return map**) on $A$ is the transformation $T_A: A \to A$ defined by \begin{align*} T_A(x) := T^{n_A(x)}(x), \end{align*} where $n_A(x) = \min\{n \geq 1 : T^n(x) \in A\}$ is the first return time, defined $\mu$-a.e. on $A$. [/definition] The definition makes $T_A$ a well-defined map on $A$ (by Poincaré recurrence), but it is not immediately clear that it preserves a natural measure. The following theorem shows that $T_A$ inherits measure-preservation from $T$. [quotetheorem:3426] [citeproof:3426] The induced map construction is remarkably stable. If $T$ is ergodic on $(X, \mu)$, then $T_A$ is ergodic on $(A, \mu_A)$: the induced system inherits the ergodic properties of the ambient one. Moreover, the operation is consistent under iteration: if $B \subset A$ has $\mu(B) > 0$, then $(T_A)_B = T_B$ — inducing $T$ on $B$ gives the same map as first inducing on $A$ and then inducing the induced map on $B$. This bootstrapping makes the induced map an essential tool in many proofs in ergodic theory, most notably in the proof of the [Birkhoff ergodic theorem](/theorems/518) via the [maximal ergodic lemma](/theorems/3432) (Chapter 5). ## Wandering Sets and the Halmos Recurrence Theorem The Poincaré theorem establishes recurrence as a consequence of finite measure. What is the structural property of the dynamics responsible? A clean reformulation uses the notion of a wandering set: a set that drifts away from itself under the dynamics, never returning. [definition: Wandering Set] Let $(X, \mathcal{B}, \mu, T)$ be a measure-preserving system. A measurable set $W \in \mathcal{B}$ is **wandering** if the sets $\{T^{-n}(W)\}_{n \geq 0}$ are pairwise disjoint up to $\mu$-null sets: $\mu(T^{-j}(W) \cap T^{-k}(W)) = 0$ for all $j \neq k \geq 0$. [/definition] Equivalently (using measure-preservation to shift indices), $W$ is wandering if and only if $\mu(T^{-n}(W) \cap W) = 0$ for all $n \geq 1$: forward iterates of $W$ never return to $W$ in positive measure. [quotetheorem:3427] [citeproof:3427] The proof is the same pigeonhole argument as in Poincaré recurrence, but the Halmos theorem exposes the underlying mechanism transparently: positive-measure wandering sets cannot exist in a finite-measure system because they would generate infinitely many disjoint copies of themselves inside a finite-[measure space](/page/Measure%20Space). Notice that the Halmos theorem requires no ergodicity, no irreducibility, no metric or topological structure — only the finiteness of $\mu$. This is the minimal condition, and it is also sufficient. [explanation: Equivalence of Poincaré and Halmos] The two theorems are logically equivalent characterisations of the same property. To derive Poincaré from Halmos: if $B = \{x \in A : T^n x \notin A \text{ for all } n \geq 1\}$ is the set of non-returning points, then $B$ is wandering — for $x \in B$, one has $T^n x \notin A \supset B$ for all $n \geq 1$, so $T^{-n}(B) \cap B = \varnothing$ for all $n \geq 1$. The Halmos theorem gives $\mu(B) = 0$. To derive Halmos from Poincaré: if $W$ is a wandering set with $\mu(W) > 0$, applying Poincaré to $A = W$ would give that a.e. $x \in W$ has some $n \geq 1$ with $T^n x \in W$; but $W$ wandering means no such $n$ exists. The contradiction forces $\mu(W) = 0$. The Halmos theorem is therefore not a strengthening of Poincaré recurrence but an equivalent restatement: in finite-measure-preserving systems, no set of positive measure wanders. [/explanation] To see the Halmos theorem at work concretely, it helps to examine a specific system and identify which sets are wandering (all of measure zero) and which are not. [example: Wandering Sets in the Doubling Map] Consider the doubling map $T(x) = 2x \pmod{1}$ on $([0,1), \lambda)$, which preserves Lebesgue measure. The Halmos theorem asserts that every wandering set has $\lambda$-measure zero. To see that positive-measure sets are non-wandering concretely: take $A = [0, 1/4)$ with $\lambda(A) = 1/4$. Then $T^{-2}(A) = \{x : 4x \pmod{1} \in [0,1/4)\} = [0, 1/16) \cup [1/4, 5/16) \cup [1/2, 9/16) \cup [3/4, 13/16)$. The intersection $T^{-2}(A) \cap A = [0, 1/16)$ has $\lambda$-measure $1/16 > 0$, confirming that $A$ is not wandering. Any singleton $\{x_0\}$ is trivially a wandering set if $x_0$ is aperiodic (the sets $T^{-n}(\{x_0\})$ are pairwise disjoint singletons), but each has $\lambda$-measure zero, consistent with the Halmos theorem. No set of positive Lebesgue measure can be wandering in this system. [/example] ## Topological Recurrence How much of the Poincaré phenomenon depends on the measure, and how much is a purely topological fact about compact dynamics? The Poincaré theorem requires an invariant measure of finite total mass — but one might ask whether something like recurrence holds for continuous maps on compact spaces without any reference to a measure at all. [definition: Topologically Recurrent Point] Let $T: X \to X$ be a continuous map on a topological space $X$. A point $x \in X$ is **topologically recurrent** if $x \in \overline{\{T^n(x) : n \geq 1\}}$. [/definition] Unpacking the closure condition: $x$ is topologically recurrent if and only if for every open neighbourhood $U$ of $x$ there exists $n \geq 1$ with $T^n(x) \in U$. In other words, the orbit of $x$ returns arbitrarily close to $x$ — not merely infinitely often in some abstract sense, but in the specific sense prescribed by the topology of $X$. [quotetheorem:3428] The proof uses [Zorn's lemma](/theorems/1226) to find a minimal closed $T$-invariant set; any point in such a minimal set is recurrent. The theorem requires no measure and no invariance hypothesis on a measure — compactness plays the role that finiteness of the measure plays in Poincaré's theorem. [example: Comparing Measure-Theoretic and Topological Recurrence] **Irrational rotation:** For $T_\alpha$ with $\alpha \notin \mathbb{Q}$, every point of $\mathbb{T}$ is topologically recurrent (its orbit is dense, so it returns near itself). Every point of every positive-measure arc is also measurably recurrent by Poincaré. The two notions agree completely. **Rational rotation:** For $T_{p/q}$ with $\alpha = p/q$ rational, every point is periodic of period $q$, hence topologically recurrent. Again both notions agree. **The tent map at a boundary point:** Consider $T: [0,1] \to [0,1]$ defined by $T(x) = 1 - |2x - 1|$. The point $x = 1$ satisfies $T(1) = 0$ and $T^n(1) = 0$ for all $n \geq 1$. The orbit of $1$ converges to $0$ and does not accumulate near $1$: $x = 1$ is not topologically recurrent. Yet $\{1\}$ is a null set, so Poincaré recurrence does not forbid this behaviour. **The integer shift:** The map $T: \mathbb{Z} \to \mathbb{Z}$, $T(n) = n+1$, has no topologically recurrent points: the orbit of $n$ is $\{n+k : k \geq 1\}$, which does not accumulate at $n$. Since the counting measure on $\mathbb{Z}$ is infinite, Poincaré recurrence does not apply, and indeed no orbit returns to any finite set. The failure of topological recurrence and the failure of measure-theoretic recurrence arise from the same cause: the dynamics are genuinely expansive — there is no mechanism (neither compactness nor finite measure) to confine orbits. [/example]  The divergence between the two notions is sharpest in infinite-measure systems. On a [compact space](/page/Compact%20Space), the Birkhoff theorem guarantees topological recurrence for free; in the presence of a finite invariant measure, Poincaré recurrence provides measurable recurrence for almost every point. On non-compact spaces with infinite invariant measures, neither mechanism is in force, and recurrence can fail completely. ## Non-Recurrence on Infinite-Measure Spaces The finite-measure hypothesis in the Poincaré and Halmos theorems is not merely sufficient but genuinely necessary. On infinite-measure spaces, wandering sets of positive measure are possible, and the failure of recurrence is not the exception but the rule. ### The Integer Shift The integer shift $T(n) = n + 1$ on $(\mathbb{Z}, \mu_{\#})$ was introduced at the outset. Every singleton $\{n\}$ is a wandering set of positive measure ($\mu_{\#}(\{n\}) = 1$): the sets $T^{-k}(\{n\}) = \{n - k\}$ for $k \geq 0$ are pairwise distinct singletons, each with measure $1$, and their union is all of $\mathbb{Z}$, which has infinite measure. No contradiction arises, and no point ever returns to $\{n\}$ under forward iteration. More generally, every non-empty set $A \subset \mathbb{Z}$ with $\mu_{\#}(A) > 0$ (i.e., $A \neq \varnothing$) contains some $n_0$. For $n > \max(A) - n_0$ (if $A$ is bounded above), the iterate $T^n(n_0) = n_0 + n$ lies outside $A$. Bounded sets are eventually left behind, and unbounded sets with bounded below-level sets are equally non-recurrent. The dynamics on $\mathbb{Z}$ are purely dissipative. ### The Lebesgue Shift Let $T: \mathbb{R} \to \mathbb{R}$ be the unit translation $T(x) = x + 1$, which preserves Lebesgue measure $\mathcal{L}^1$ since $\mathcal{L}^1(A - 1) = \mathcal{L}^1(A)$ for every Borel $A$. The arc $A = [0,1)$ satisfies $\mathcal{L}^1(A) = 1 > 0$. For any $x \in A$ and $n \geq 1$, we have $T^n(x) = x + n \geq n \geq 1$, so $T^n(x) \notin [0,1)$. Every point of $A$ fails to return to $A$. The set $A$ is wandering, and $\mathcal{L}^1(\mathbb{R}) = \infty$ allows this without contradiction. [explanation: Why Finiteness Is Indispensable] The Poincaré argument requires: the sets $T^{-n}(B)$ are pairwise disjoint subsets of $X$, each with measure $\mu(B)$, so $\sum_{n=0}^\infty \mu(B) \leq \mu(X)$, forcing $\mu(B) = 0$ since $\mu(X) < \infty$. On an infinite-[measure space](/page/Measure%20Space), the final step fails: $\sum_{n=0}^\infty \mu(B) \leq \mu(X) = \infty$ imposes no constraint on $\mu(B)$. The sets $T^{-n}(B)$ can each have measure $\mu(B) > 0$, sum to infinite total measure, and be consistently accommodated in $X$. For the integer shift with $B = \{0\}$: the sets $T^{-n}(\{0\}) = \{-n\}$ are pairwise disjoint singletons of measure $1$, summing to total measure $\infty$, equal to $\mu_{\#}(\mathbb{Z})$. The argument collapses at exactly the step using finiteness, and recurrence fails at exactly the systems where finiteness fails. The failure is not coincidental. [/explanation] [remark: Conservative vs Dissipative Systems] In infinite ergodic theory — the study of measure-preserving systems with $\sigma$-finite infinite invariant measure — the fundamental dichotomy replaces the Poincaré theorem. A measure-preserving transformation $T$ on an infinite-[measure space](/page/Measure%20Space) $(X, \mu)$ is **conservative** if every wandering set has measure zero (i.e., the Halmos conclusion holds despite infinite measure), and **dissipative** if it admits a wandering set of positive measure. The integer shift is dissipative. By contrast, many systems arising from number theory — such as the Gauss map $T(x) = \{1/x\}$ on $(0,1)$ with the infinite measure $d\mu = dx/x$ — are conservative and exhibit rich ergodic behaviour described by the Hopf ratio ergodic theorem. [/remark] ## Recurrence, Mean Frequency, and the Road to Birkhoff Poincaré recurrence guarantees that returns happen; [Kac's Lemma](/theorems/3424) computes their average waiting time. Together they give a qualitative and quantitative picture of how orbits revisit sets. But there is a much stronger result lurking: in ergodic systems, orbits do not merely return to sets of positive measure, they visit every such set with a frequency exactly proportional to its measure. This is the Birkhoff Pointwise Ergodic Theorem, proved in Chapter 5. To see how the three results fit together, take an ergodic system $(X, \mu, T)$ and a set $A$ with $\mu(A) > 0$. The Birkhoff theorem — applied to $f = \mathbf{1}_A$ — asserts that \begin{align*} \frac{1}{N} \#\{0 \le n \le N-1 : T^n(x) \in A\} \longrightarrow \mu(A) \quad \text{for } \mu\text{-a.e. } x. \end{align*} The orbit of a typical point visits $A$ with asymptotic frequency exactly $\mu(A)$. Poincaré guarantees that the frequency is positive (visits are infinitely often); Birkhoff specifies it precisely. [Kac's Lemma](/theorems/3442) is consistent with this picture at the level of mean return times. If the orbit visits $A$ with long-run frequency $\mu(A)$, then the number of steps between consecutive visits averages $1/\mu(A)$, which is exactly Kac's formula. The three results thus form a hierarchy: - **Poincaré**: a.e. orbit returns to $A$ infinitely often (existence of returns). - **Kac**: the mean time between returns is $1/\mu(A)$ (first moment of return time). - **Birkhoff**: the fraction of time spent in $A$ converges to $\mu(A)$ (asymptotic frequency). Each refines the preceding, and all three are best understood together. Chapters 3 and 4 develop the ergodicity hypothesis that underlies [Kac's Lemma](/theorems/3424) in full generality; Chapter 5 proves the Birkhoff theorem from which the full quantitative picture emerges. --- # 3. Ergodicity With measure-preserving transformations and recurrence established in the preceding chapters, we now turn to the most fundamental qualitative question in ergodic theory: when is a dynamical system *indecomposable*? A system is ergodic if it cannot be split into two invariant pieces of positive measure that evolve independently. This indecomposability has several equivalent formulations — through invariant sets, invariant functions, and the spectrum of the associated Koopman operator — and the proof of their equivalence is the central theorem of this chapter. We close with the paradigmatic examples: irrational rotations (ergodic, by a Fourier-series argument related to Weyl's equidistribution theorem), rational rotations (not ergodic), and the doubling map (ergodic with respect to Lebesgue measure). ## Invariant Sets and Invariant Functions The starting point is deciding what *invariance* means in the measure-theoretic setting. There is a strict set-theoretic notion — the preimage $T^{-1}A$ equals $A$ exactly — and a weaker measure-theoretic one, which asks only that $T^{-1}A$ and $A$ agree up to a null set. The distinction matters because in a measure-preserving system we can detect sets only up to sets of measure zero; insisting on exact set-theoretic equality would make the theory brittle. The measure-theoretic notion is the one that yields a clean, robust theory. We work throughout with a measure-preserving transformation $T: (X, \mathcal{B}, \mu) \to (X, \mathcal{B}, \mu)$ on a probability space, as defined in Chapter 1. [definition: Almost Invariant Set] Let $T: (X, \mathcal{B}, \mu) \to (X, \mathcal{B}, \mu)$ be a measure-preserving transformation. A set $A \in \mathcal{B}$ is **$T$-invariant mod $\mu$** (or *almost invariant*) if \begin{align*} \mu(T^{-1}A \triangle A) = 0. \end{align*} [/definition] The collection of all almost-invariant sets is closed under countable operations and forms a $\sigma$-algebra. To see this: if $\mu(T^{-1}A \triangle A) = 0$, then $T^{-1}(A^c) \triangle A^c = T^{-1}A \triangle A$ has measure zero, so complements of almost-invariant sets are almost invariant. For intersections, $T^{-1}(A \cap B) \triangle (A \cap B) \subset (T^{-1}A \triangle A) \cup (T^{-1}B \triangle B)$, so the intersection of two almost-invariant sets is almost invariant. Countable unions follow by the same containment. [definition: Invariant $\sigma$-Algebra] The **invariant $\sigma$-algebra** $\mathcal{I} = \mathcal{I}_T$ is the sub-$\sigma$-algebra of $\mathcal{B}$ consisting of all sets that are $T$-invariant mod $\mu$. [/definition] The invariant $\sigma$-algebra will play a central role in the rest of the course. In Chapter 5, the [Birkhoff Ergodic Theorem](/theorems/518) will identify the limit of time averages as the conditional expectation $\mathbb{E}[f \mid \mathcal{I}]$; ergodicity is then precisely the condition that collapses this conditional expectation to the global space average $\int_X f\, d\mu$. Alongside invariant sets, it is natural to ask which functions are fixed by the dynamics. [definition: $T$-Invariant Function] A measurable function $f: X \to \mathbb{R}$ is **$T$-invariant** if $f \circ T = f$ $\mu$-almost everywhere. [/definition] There is a tight relationship between the two notions: $A$ is almost invariant if and only if its indicator function $\mathbf{1}_A$ is $T$-invariant. More generally, $f$ is $T$-invariant if and only if $f$ is measurable with respect to the invariant $\sigma$-algebra $\mathcal{I}$. [remark: One-Sided Invariance] For a non-invertible $T$, the condition $f \circ T = f$ a.e. is one-sided: the value of $f$ at $x$ is tied to the value at $Tx$, but $T$ may be many-to-one so there is no natural "backwards" condition. For invertible $T$, invariance under $T$ is equivalent to invariance under $T^{-1}$, and the condition is symmetric. [/remark] ## Ergodicity and Three Equivalent Formulations What does it mean for a system to be *decomposable*? Suppose $A \in \mathcal{B}$ is almost invariant with $0 < \mu(A) < 1$. Then $\mu$-almost every orbit starting in $A$ stays in $A$ (since $T^{-1}A$ and $A$ agree up to a null set), and similarly almost every orbit starting in $A^c$ stays in $A^c$. The measure $\mu$ can then be written as a nontrivial convex combination \begin{align*} \mu = \mu(A)\,\mu_A + \mu(A^c)\,\mu_{A^c}, \end{align*} where $\mu_A(\cdot) = \mu(\cdot \cap A)/\mu(A)$ and $\mu_{A^c}(\cdot) = \mu(\cdot \cap A^c)/\mu(A^c)$ are each $T$-invariant probability measures. The system has split into two independently evolving subsystems. Ergodicity is the condition that forbids this. (The connection between ergodicity and extreme points of the convex set of invariant measures is developed fully in Chapter 6.) [definition: Ergodic Transformation] A measure-preserving transformation $T: (X, \mathcal{B}, \mu) \to (X, \mathcal{B}, \mu)$ is **ergodic** if every almost-invariant set has measure $0$ or $1$: \begin{align*} A \in \mathcal{B},\quad \mu(T^{-1}A \triangle A) = 0 \implies \mu(A) \in \{0, 1\}. \end{align*} [/definition] Equivalently, the invariant $\sigma$-algebra $\mathcal{I}$ is trivial: it contains only sets of measure $0$ or $1$. The following theorem shows that this set-theoretic condition is equivalent to two further conditions — one about invariant functions, one about the spectrum of the Koopman operator — each often easier to verify in concrete examples. [quotetheorem:3444] [citeproof:3444] The level-set argument in (i) $\implies$ (ii) is the key analytic step. It works for any $L^1$ invariant function, not just $L^2$, so condition (ii) can equivalently be stated with $L^1$ in place of $L^2$. [explanation: Why Simplicity of the Eigenvalue Matters] Condition (iii) requires that $1$ be a *simple* eigenvalue of $U_T$. Since $T$ preserves $\mu$, constant functions are always fixed by $U_T$: $U_T \mathbf{1} = \mathbf{1}$. So $1$ is always an eigenvalue. The question is whether the eigenspace at $1$ is larger than the constants. If there were a non-constant $T$-invariant $f \in L^2$, the level-set argument would produce some $c$ with $\mu(\{f > c\}) \in (0,1)$, giving a non-trivial almost-invariant set. Ergodicity forbids this, cutting the eigenspace at $1$ down to the minimum: the one-dimensional subspace spanned by $\mathbf{1}$. Eigenspaces at eigenvalues $\lambda \neq 1$ are not constrained by ergodicity. An irrational rotation, for example, has many eigenfunctions (the characters $e^{2\pi i nx}$) at eigenvalues $e^{2\pi i n\alpha} \neq 1$. Ergodicity speaks only about the eigenspace at $1$. [/explanation] ## The 2-Set Criterion The definition of ergodicity requires examining all almost-invariant sets, which is impractical for concrete systems. A more dynamical reformulation — the 2-set criterion — replaces the static condition with a requirement about how orbits communicate between regions of positive measure. The underlying intuition is that an ergodic system cannot have isolated regions: any set of positive measure must eventually send orbits into any other set of positive measure. [quotetheorem:3445] [citeproof:3445] The 2-set criterion gives ergodicity a dynamical character: every positive-measure region eventually communicates with every other positive-measure region. This is stronger than Poincaré recurrence, which only asserts that a region communicates with itself. [example: The Identity and Rotation by $1/2$] For the identity map $T = \operatorname{id}$ on $([0,1], \lambda)$: take $A = [0, 1/3)$ and $B = (2/3, 1]$. Then $T^{-n}A = A$ for all $n$, so $\lambda(T^{-n}A \cap B) = 0$ for all $n \geq 1$. The two disjoint open sets $A$ and $B$ never communicate, and $T$ is not ergodic. For the rotation $T_{1/2}(x) = x + 1/2 \pmod 1$: take $A = [0, 1/2)$ and $B = [0, 1/2) = A$. Then $T_{1/2}^{-1}A = [1/2, 1) = A^c$, so $T_{1/2}^{-2}A = A$, and $\lambda(T_{1/2}^{-n}A \cap A^c) = 0$ for all even $n$ while $\lambda(T_{1/2}^{-n}A \cap A) = 0$ for all odd $n$. The set $A$ alternates between $A$ and $A^c$ under $T_{1/2}^{-1}$, so $A$ is a non-trivial almost-$T_{1/2}^2$-invariant set. The map $T_{1/2}$ is not ergodic. [/example] ## Ergodicity of Irrational Rotations Having seen that rational rotations are not ergodic, we now prove that irrational rotations are. The spectral criterion provides the clean argument. The circle rotation $T_\alpha: \mathbb{R}/\mathbb{Z} \to \mathbb{R}/\mathbb{Z}$, $T_\alpha(x) = x + \alpha \pmod 1$, preserves Lebesgue measure $\lambda$ on $[0,1)$. [quotetheorem:3429] [citeproof:3429] The proof exploits the fact that $U_{T_\alpha}$ is diagonalised by the Fourier basis, and irrationality of $\alpha$ is exactly the condition ensuring no non-zero character has eigenvalue $1$. This ergodicity result is related to — but weaker than — Weyl's Equidistribution Theorem, which gives a quantitative picture of how orbits distribute over $[0,1)$. [quotetheorem:3443] This theorem is quoted without proof here. The key step is Weyl's criterion: equidistribution is equivalent to showing $\frac{1}{N}\sum_{n=0}^{N-1} e^{2\pi i kn\alpha} \to 0$ for each $k \neq 0$. Since $e^{2\pi ik\alpha} \neq 1$ when $\alpha \notin \mathbb{Q}$ and $k \neq 0$, the partial sum of the geometric series is bounded by $\frac{2}{|e^{2\pi ik\alpha}-1|}$, which divided by $N$ goes to zero. A full proof appears in Chapter 7 in the broader context of polynomial sequences. Equidistribution says that *every* orbit of $T_\alpha$ is equidistributed in $[0,1)$, whereas ergodicity (via the Birkhoff theorem of Chapter 5) says only that *almost every* orbit is equidistributed. The reason every orbit behaves well is that irrational rotations are *uniquely ergodic* — they support a unique invariant probability measure. Unique ergodicity is developed in Chapter 7. [example: Rational Rotations are Not Ergodic] When $\alpha = p/q \in \mathbb{Q}$ with $\gcd(p,q) = 1$ and $q \geq 2$, every orbit of $T_\alpha$ is periodic with period exactly $q$: for any $x$, the orbit $\{x, x+\alpha, \ldots, x+(q-1)\alpha\} \pmod 1$ returns to $x$ after $q$ steps. Consider the set \begin{align*} A = \bigcup_{j=0}^{q-1}\left[\frac{j}{q},\, \frac{j}{q} + \frac{1}{2q}\right), \end{align*} consisting of the left half of each of the $q$ equal sub-intervals that $T_\alpha$ permutes cyclically. Since $T_\alpha$ maps $[j/q, (j+1)/q)$ to $[(j+p)/q \pmod 1, (j+p+1)/q \pmod 1)$, the set $A$ is mapped to itself by $T_\alpha$: $T_\alpha^{-1}A = A$ exactly. So $A$ is a strictly invariant set with $\lambda(A) = 1/2$, witnessing the failure of ergodicity. Alternatively: the character $e_q(x) = e^{2\pi i qx}$ is non-constant and satisfies $U_{T_\alpha} e_q = e^{2\pi i q\alpha} e_q = e^{2\pi ip} e_q = e_q$, so it is a non-trivial element of the eigenspace at $1$, confirming non-ergodicity via the spectral criterion. [/example] ## Ergodicity of the Doubling Map The doubling map $T: [0,1) \to [0,1)$, $T(x) = 2x \pmod 1$, is measure-preserving with respect to Lebesgue measure $\lambda$ (Chapter 1). Unlike rotations, the doubling map is non-invertible and expanding: it stretches distances by a factor of $2$.  The key structural difference from rotations appears immediately in how the Koopman operator $U_T$ acts on the Fourier basis. For the rotation $T_\alpha$, each character $e_n$ was an eigenfunction. For the doubling map: \begin{align*} U_T e_n(x) = e_n(Tx) = e^{2\pi i n \cdot 2x} = e_{2n}(x). \end{align*} The Koopman operator maps each character to a different character — it shifts to higher frequency rather than multiplying by a scalar. This is the signature of an expanding map in Fourier space. [quotetheorem:3446] [citeproof:3446] The argument is a compactness argument in Fourier space: the expanding dynamics drive every nonzero Fourier mode to arbitrarily high frequency, where $L^2$ decay kills it. No arithmetic condition on any parameter is needed — ergodicity follows purely from the expanding nature of $T$. It is instructive to compare this mechanism with the one we used for rotations. [remark: Contrasting Mechanisms of Ergodicity] For the irrational rotation $T_\alpha$, ergodicity is an arithmetic condition: the Koopman operator diagonalises in the Fourier basis, and irrationality ensures no non-zero character resonates at eigenvalue $1$. For the doubling map, ergodicity is an analytic condition: the Koopman operator is not diagonalised in the Fourier basis at all, and the $L^2$ summability of Fourier coefficients forces every invariant function to be constant. In general, expanding maps tend to be ergodic because their Koopman operators drive every non-constant mode to arbitrarily high frequencies, where $L^2$ decay kills it. [/remark] What does the ergodicity of the doubling map actually tell us about specific numbers? The answer connects ergodic theory to the classical problem of digit frequencies. [example: Binary Digit Frequencies and Normal Numbers] The ergodicity of the doubling map has a concrete number-theoretic consequence. Write $x = \sum_{k=1}^\infty d_k(x) 2^{-k}$ where $d_k(x) \in \{0,1\}$ is the $k$-th binary digit. The key identity is $d_k(x) = d_1(T^{k-1}(x))$: applying $T$ shifts the binary expansion one digit to the left. Applying the [Birkhoff Ergodic Theorem](/theorems/518) (Chapter 5) to $f = \mathbf{1}_{[1/2, 1)}$, which equals $d_1$ $\lambda$-a.e., gives: for Lebesgue-almost every $x \in [0,1)$, \begin{align*} \frac{1}{N}\sum_{k=0}^{N-1} d_{k+1}(x) = \frac{1}{N}\sum_{k=0}^{N-1} f(T^k x) \xrightarrow{a.s.} \int_0^1 f\, d\lambda = \frac{1}{2}. \end{align*} For almost every $x$, exactly half of its binary digits are $1$. Such a number is called *normal in base $2$*. Ergodicity of the doubling map thus implies that Lebesgue-almost every real number is normal in base $2$. The same argument applied to the map $x \mapsto bx \pmod 1$, which is ergodic for every integer $b \geq 2$ by an identical Fourier argument, establishes normality in base $b$. [/example] ## The Spectral Perspective and the Road to Chapter 4 Looking across all the examples — the identity, rational rotations, irrational rotations, and the doubling map — each failure of ergodicity manifests as the Koopman operator having a higher-dimensional eigenspace at eigenvalue $1$. For the identity, every $L^2$ function is invariant. For the rational rotation $T_{p/q}$, the character $e_q$ gives an extra invariant direction. Ergodicity is exactly the condition that reduces the eigenspace at $1$ to the minimum: the one-dimensional span of the constant function $\mathbf{1}$. This spectral picture connects directly to Chapter 4. For any measure-preserving $T$, the space $L^2(X, \mu)$ decomposes orthogonally as \begin{align*} L^2 = \ker(U_T - I) \oplus \overline{\operatorname{Range}(U_T - I)}. \end{align*} The [von Neumann Mean Ergodic Theorem](/theorems/3448) (Chapter 4) asserts that the Cesaro averages \begin{align*} \frac{1}{N}\sum_{n=0}^{N-1} U_T^n f \end{align*} converge in $L^2$ to the [orthogonal projection](/theorems/437) of $f$ onto $\ker(U_T - I)$ — the space of $T$-invariant functions, which equals $\mathbb{E}[f \mid \mathcal{I}]$. When $T$ is ergodic, $\ker(U_T - I)$ is the one-dimensional space of constants, and the projection is simply the space average $\int_X f\, d\mu$. The Mean Ergodic Theorem then becomes the statement that time averages converge to space averages in $L^2$. Chapter 5 refines this further to almost everywhere convergence via the Birkhoff Pointwise Ergodic Theorem. --- # 4. The von Neumann Mean Ergodic Theorem The previous chapter established ergodicity as a property of measure-preserving transformations — the condition that the only invariant sets are trivial. This chapter takes a different approach to the same phenomenon, asking not about the structure of invariant sets but about the long-run behaviour of time averages of $L^2$ functions. The [von Neumann Mean Ergodic Theorem](/theorems/3448), proved in 1932, is the first and most fundamental convergence theorem in ergodic theory: the Cesàro averages $\frac{1}{N}\sum_{n=0}^{N-1} f(T^n x)$ converge in $L^2$ norm to the [orthogonal projection](/theorems/437) of $f$ onto the space of invariant functions. The key insight is that the theorem is really a result in [Hilbert space](/page/Hilbert%20Space) theory — the dynamics enter only through a single unitary operator — and this functional-analytic perspective will reappear in every subsequent chapter. ## The Koopman Operator Why not study the transformation $T$ directly? The obstruction is that $T$ acts on points, not functions, and the tools we want to apply — spectral theory, orthogonal projections, norm estimates — require a linear setting. Iterating $T$ on points produces orbits, but to form averages and take limits in a meaningful function space, we need to pass to function composition. Moreover, two different systems may have the same statistical properties yet look completely different at the level of points; what matters is the action on observables. The question is: what is the right linear object associated to $T$? The naive answer — consider the pushforward action on measures — works for studying invariant measures but discards too much information about individual functions. The right answer is to let $T$ act on $L^2$ functions by precomposition. This simple idea, due to Koopman (1931), turns a dynamical problem into a spectral problem on an infinite-dimensional [Hilbert space](/page/Hilbert%20Space). [definition: Koopman Operator] Let $(X, \mathcal{B}, \mu)$ be a probability space and $T: X \to X$ a measure-preserving transformation. The **Koopman operator** associated to $T$ is the map $U_T: L^2(X, \mu) \to L^2(X, \mu)$ defined by \begin{align*} U_T f := f \circ T. \end{align*} [/definition] The passage from $T$ to $U_T$ is called the **Koopman representation** or **Koopman linearisation** of the dynamics. This is a profound step: a potentially complicated nonlinear map $T$ on a [measure space](/page/Measure%20Space) is replaced by a linear operator on a [Hilbert space](/page/Hilbert%20Space). The price we pay is that the [Hilbert space](/page/Hilbert%20Space) is infinite-dimensional, but the tools of functional analysis — spectral theory, orthogonal projections, adjoints — become available. [quotetheorem:3430] [citeproof:3430] The unitarity of $U_T$ is what makes [Hilbert space](/page/Hilbert%20Space) geometry directly applicable to ergodic theory. But what happens when $T$ is not invertible? [remark: Non-invertible Case] When $T$ is not invertible — for example, the doubling map $T(x) = 2x \pmod{1}$ — the Koopman operator $U_T$ is an isometry but not unitary. Its adjoint $U_T^*$ (the **transfer operator** or Perron–Frobenius operator) acts on $L^2$ but is not the inverse of $U_T$. For the Mean Ergodic Theorem, the invertibility of $U_T$ is not required; the isometry property alone suffices. [/remark] The key dictionary between dynamics and functional analysis is: - Orbits of $T$ correspond to iterates $U_T^n = U_{T^n}$. - Invariant functions ($f \circ T = f$ a.e.) correspond to fixed points of $U_T$ in $L^2$. - The invariant $\sigma$-algebra $\mathcal{I}$ corresponds to the fixed-point subspace $\ker(U_T - I)$. - Ergodicity of $T$ is equivalent to $\ker(U_T - I)$ consisting only of constants (see Chapter 3). [example: Koopman Operator for Circle Rotation] Let $T_\alpha: [0,1) \to [0,1)$ be the irrational rotation $T_\alpha(x) = x + \alpha \pmod{1}$ with $\alpha \notin \mathbb{Q}$, equipped with Lebesgue measure $\lambda$. The [Hilbert space](/page/Hilbert%20Space) $L^2([0,1), \lambda)$ has an orthonormal basis of characters $e_n(x) = e^{2\pi i n x}$ for $n \in \mathbb{Z}$. The Koopman operator acts on these basis elements as \begin{align*} U_{T_\alpha} e_n(x) = e_n(x + \alpha) = e^{2\pi i n \alpha} e_n(x). \end{align*} So each $e_n$ is an eigenvector of $U_{T_\alpha}$ with eigenvalue $e^{2\pi i n \alpha}$. For $\alpha \notin \mathbb{Q}$, the eigenvalue $e^{2\pi i n \alpha} = 1$ holds only when $n = 0$, corresponding to the constant function $e_0 \equiv 1$. This confirms that the fixed-point subspace of $U_{T_\alpha}$ is exactly the space of constants, consistent with the ergodicity of $T_\alpha$. The Cesàro average of $U_{T_\alpha}^n e_k$ for $k \neq 0$ is: \begin{align*} \frac{1}{N}\sum_{n=0}^{N-1} U_{T_\alpha}^n e_k = \frac{1}{N}\sum_{n=0}^{N-1} e^{2\pi i nk\alpha} e_k = \frac{1}{N} \cdot \frac{1 - e^{2\pi i Nk\alpha}}{1 - e^{2\pi i k\alpha}} \cdot e_k \to 0 \end{align*} as $N \to \infty$, since $|e^{2\pi i k\alpha} - 1| > 0$ for $k \neq 0$ and $\alpha \notin \mathbb{Q}$. For $k = 0$, the average is identically $e_0 = 1$. This calculation, done directly on the Fourier basis, is the mechanism behind the general Mean Ergodic Theorem. [/example] ## The Mean Ergodic Theorem: Statement and Proof The central tool in the proof is [orthogonal decomposition](/theorems/436) in [Hilbert space](/page/Hilbert%20Space). The fixed-point subspace of $U_T$ plays the role of the "limit direction," and the key lemma identifies precisely the orthogonal complement. [definition: Fixed-Point Subspace and Coboundaries] Let $U: H \to H$ be an isometry on a [Hilbert space](/page/Hilbert%20Space) $H$. Define: - The **fixed-point subspace**: $\mathcal{H}_1 := \ker(U - I) = \{f \in H : Uf = f\}$. - The **coboundary subspace**: $\mathcal{H}_0 := \overline{\operatorname{Range}(U - I)} = \overline{\{Ug - g : g \in H\}}$. [/definition] With these two subspaces in hand, we can state the key structural lemma: they are orthogonal complements and together span all of $H$. The result is perhaps surprising — the closure of coboundaries could a priori be smaller than $\mathcal{H}_1^\perp$ — but the isometry property of $U$ forces an exact fit. [quotetheorem:3447] [citeproof:3447] The decomposition is the engine of the Mean Ergodic Theorem. Before turning to the main proof, it is worth pausing on what each summand contributes dynamically. [remark: Interpretation] The decomposition $H = \mathcal{H}_1 \oplus \mathcal{H}_0$ has a direct dynamical meaning. Functions $f \in \mathcal{H}_1$ are invariant under $T$: their time averages are trivially constant. Functions of the form $Ug - g$ (coboundaries) have zero mean and exhibit strong cancellation in Cesàro averages: $\frac{1}{N}\sum_{n=0}^{N-1}U^n(Ug - g) = \frac{1}{N}(U^N g - g) \to 0$ in norm because $\|U^N g - g\| \le 2\|g\|$ is bounded. The density of coboundaries in $\mathcal{H}_0$ propagates this convergence to the entire complement. [/remark] We are now ready to prove the main theorem. [quotetheorem:3448] [citeproof:3448] The proof is strikingly clean: once the [orthogonal decomposition](/theorems/436) is established, the argument is elementary. The only use of the isometry property is to bound $\|A_N\|_{\mathcal{L}(H)} \le 1$; everything else is linear algebra and the definition of closure. Notice what the theorem does and does not say. It asserts $L^2$ norm convergence: the Cesàro averages converge to $Pf$ as vectors in $L^2$. It says nothing about pointwise convergence at a specific $x \in X$. Pointwise convergence — the stronger statement that $\frac{1}{N}\sum_{n=0}^{N-1} f(T^n x) \to Pf(x)$ for $\mu$-almost every $x$ — is the content of the Birkhoff Pointwise Ergodic Theorem in Chapter 5, which requires a completely different argument. The theorem also does not require ergodicity. For a non-ergodic system, $\mathcal{H}_1$ is a non-trivial subspace (not just constants), and the limit $Pf$ is a genuinely non-constant function: different orbits converge to different values, depending on which invariant component of $X$ they belong to. Ergodicity is exactly the condition that forces $\mathcal{H}_1 = \{$constants$\}$, collapsing the limit to the single number $\int_X f\, d\mu$.  To see the theorem in action, we return to the irrational rotation and trace the convergence of each Fourier mode explicitly. [example: Mean Ergodic Theorem for Irrational Rotations] Return to $T_\alpha$ on $([0,1), \lambda)$ with $\alpha \notin \mathbb{Q}$. Take any $f \in L^2([0,1), \lambda)$ and expand in the Fourier basis: $f = \sum_{k \in \mathbb{Z}} \hat{f}(k) e_k$ where $\hat{f}(k) = \int_0^1 f(x) e^{-2\pi i k x}\, d\lambda(x)$. Since $T_\alpha$ is ergodic, the invariant functions are precisely the constants, so $Pf = \hat{f}(0) = \int_0^1 f\, d\lambda$. The Mean Ergodic Theorem states that the time averages converge in $L^2$ to the space average: \begin{align*} \frac{1}{N}\sum_{n=0}^{N-1} f(x + n\alpha) \xrightarrow{L^2} \int_0^1 f\, d\lambda \quad \text{as } N \to \infty. \end{align*} To see this on the Fourier side: for $k \neq 0$, \begin{align*} \frac{1}{N}\sum_{n=0}^{N-1} U_{T_\alpha}^n e_k = \frac{1}{N} \cdot \frac{1 - e^{2\pi i Nk\alpha}}{1 - e^{2\pi i k\alpha}} e_k, \end{align*} and this has $L^2$ norm $\frac{1}{N}\left|\frac{1 - e^{2\pi i Nk\alpha}}{1 - e^{2\pi i k\alpha}}\right| \le \frac{2}{N|1 - e^{2\pi i k\alpha}|} \to 0$. For the general $f$, the [Fourier series](/page/Fourier%20Series) is truncated: for any $\varepsilon > 0$, take $K$ large enough that $\sum_{|k| > K}|\hat{f}(k)|^2 < \varepsilon^2$, then apply the $N \to \infty$ argument to the finite sum of Fourier modes. [/example] ## The Limit as Conditional Expectation The Mean Ergodic Theorem identifies the $L^2$ limit as the [orthogonal projection](/theorems/437) $Pf$ onto the invariant subspace $\mathcal{H}_1$. For an ergodic system this is immediate — $\mathcal{H}_1$ consists only of constants, so $Pf = \int_X f\, d\mu$ — but for a general measure-preserving system $\mathcal{H}_1$ can be rich. The question is whether there is a natural probabilistic object that equals $Pf$ and makes the dependence of the limit on $f$ transparent. The answer is yes: $Pf$ is a conditional expectation. [quotetheorem:3449] [citeproof:3449] This identification ties ergodic theory to the classical theory of conditional expectations. The invariant $\sigma$-algebra $\mathcal{I}$ encodes all the constraints the dynamics impose on long-run behaviour: an event $A$ is in $\mathcal{I}$ precisely when knowing whether $T^n x \in A$ for all future $n$ tells you exactly whether $x \in A$. Conditioning on $\mathcal{I}$ extracts the part of $f$ that is determined by this invariant information. [explanation: What Does This Mean Dynamically?] The identification $Pf = \mathbb{E}[f \mid \mathcal{I}]$ is the answer to the question: "What does the long-run time average of a function know about the initial condition $x$?" If $T$ is ergodic, then $\mathcal{I}$ contains only sets of measure $0$ or $1$, so the conditional expectation $\mathbb{E}[f \mid \mathcal{I}]$ is constant $\mu$-a.e., equal to $\int_X f\, d\mu$. The time average forgets the starting point entirely and converges to the space average. This is the quantitative content of ergodicity: the orbit distribution becomes equidistributed. If $T$ is not ergodic, then $\mathcal{I}$ carries non-trivial information. The invariant $\sigma$-algebra partitions $X$ into invariant regions, and on each region the time average converges to the average of $f$ over that region. The function $\mathbb{E}[f \mid \mathcal{I}](x)$ is the "local space average" of $f$ on the invariant component containing $x$. Different orbits may have different time limits — but each orbit has a well-defined limit in $L^2$, determined by which invariant component it belongs to. For example, if $X = [0,1]$ and $T$ acts on $[0, \frac{1}{2})$ and $[\frac{1}{2}, 1)$ separately as ergodic maps on each half, then for any $f \in L^2$, \begin{align*} Pf(x) = \begin{cases} 2\int_0^{1/2} f\, d\lambda & \text{if } x \in [0, \frac{1}{2}) \\ 2\int_{1/2}^{1} f\, d\lambda & \text{if } x \in [\frac{1}{2}, 1). \end{cases} \end{align*} The time average remembers which half the orbit is confined to, but nothing finer. [/explanation] ## Extensions to $L^p$ The $L^2$ proof relied on orthogonal projections and inner products — tools specific to [Hilbert space](/page/Hilbert%20Space). What happens in $L^p$ for $p \neq 2$? There is no [orthogonal decomposition](/theorems/436) available, and the Koopman operator need not have an adjoint in any useful sense. One might guess that the theorem fails in $L^p$, since the spectral machinery breaks down. In fact, the theorem holds for all $1 \le p < \infty$, though the proof strategy shifts from spectral decomposition to density-and-boundedness. [quotetheorem:3463] The theorem is stated here without proof; the full argument uses the $L^2$ case as a starting point together with two ingredients: (i) the density of $L^2 \cap L^p$ in $L^p$, and (ii) the operator bound $\|U_T f\|_{L^p} = \|f\|_{L^p}$ (which follows from the measure-preserving property), ensuring that the Cesàro averages are uniformly bounded operators on $L^p$ with norm at most $1$. The proof technique — establishing the result on a dense subclass and extending by a density-and-boundedness argument — mirrors the closure argument in Step 3 of the $L^2$ proof above. The $L^p$ result connects to broader themes in analysis. Since $U_T$ is an isometry on both $L^1$ and $L^\infty$, the Riesz–Thorin interpolation theorem gives $\|U_T\|_{\mathcal{L}(L^p)} \le 1$ for all $1 \le p \le \infty$ without any further calculation. Ergodic theory thus provides a natural testing ground for interpolation arguments. The Cesàro operator $f \mapsto \frac{1}{N}\sum_{n=0}^{N-1} f \circ T^n$ is a contraction on every $L^p$ simultaneously, and its $L^p$-limit is the same conditional expectation $\mathbb{E}[f \mid \mathcal{I}]$ independent of $p$ (when $f \in L^\infty$, so that all $L^p$ comparisons are valid). [remark: The Case $p = \infty$] The theorem fails for $p = \infty$ in general: the time averages of an $L^\infty$ function need not converge in $L^\infty$ norm. The obstruction is that $L^\infty$ convergence would require [uniform convergence](/page/Uniform%20Convergence) almost everywhere, which is far stronger than what the theorem delivers. The correct statement for $L^\infty$ functions is $L^p$ convergence for all finite $p$, combined with the pointwise convergence guaranteed (for $L^1$ functions) by the Birkhoff Pointwise Ergodic Theorem in Chapter 5. [/remark] The $L^1$ case deserves special attention. In $L^1$, functions can be unbounded and uniform estimates are unavailable, yet the Cesàro averages still converge in norm. The doubling map provides a concrete illustration of what $L^1$ convergence means in practice. [example: $L^1$ Mean Ergodic Theorem for the Doubling Map] Let $T(x) = 2x \pmod{1}$ on $([0,1], \mathcal{B}, \lambda)$. Since $T$ is ergodic with respect to Lebesgue measure $\lambda$ (a fact established in Chapter 3 using the characterisation via the Fourier spectrum), the invariant $\sigma$-algebra $\mathcal{I}$ consists only of sets of measure $0$ or $1$, and $\mathbb{E}[f \mid \mathcal{I}] = \int_0^1 f\, d\lambda$ for any $f \in L^1$. Take $f = \mathbb{1}_{[0, 1/2)}$. Then $f(T^n x)$ is the $n$-th binary digit of $x$ (the digit being $1$ iff the $n$-th iterate lands in $[0, \frac{1}{2})$). The $L^1$ Mean Ergodic Theorem gives \begin{align*} \frac{1}{N}\sum_{n=0}^{N-1} \mathbb{1}_{[0,1/2)}(T^n x) \xrightarrow{L^1} \frac{1}{2} \quad \text{as } N \to \infty. \end{align*} This says that the proportion of the first $N$ binary digits of $x$ that equal $1$ converges to $\frac{1}{2}$ in $L^1(\lambda)$. Note that the convergence is in $L^1$, not pointwise: there exist individual $x$ (namely, the rationals with eventually-periodic binary expansions) for which the limit fails. The [Birkhoff Ergodic Theorem](/theorems/518) in Chapter 5 will sharpen this to $\lambda$-a.e. convergence. [/example] The failure at $p = \infty$ is not a mere technicality. The doubling map has orbits that behave erratically at individual points — any rational $x$ has an eventually-periodic orbit, while for Lebesgue-almost-every $x$ the orbit is equidistributed. Controlling all orbits simultaneously, not just on average, is the pointwise problem that requires Birkhoff's theorem. The gap between $L^p$ and pointwise convergence is genuine and deep, and Chapter 5 is devoted to bridging it. [motivation] ### Why Hilbert Space Methods? The Mean Ergodic Theorem could have been stated and proved in a more elementary way — by tracking the explicit computation of Cesàro sums and using compactness or [weak convergence](/page/Weak%20Convergence) arguments directly on $L^2$. Von Neumann's original proof (1932) used the spectral decomposition of unitary operators. A simpler-seeming approach: since $\{A_N f\}$ is a bounded sequence in $L^2$, by the Banach–Alaoglu theorem it has a weakly convergent subsequence $A_{N_k} f \rightharpoonup h$. Any weak limit $h$ must satisfy $Uh = h$ (since $A_N(Uf - f) \to 0$ in norm), so $h \in \mathcal{H}_1$. But this argument gives only convergence along a subsequence — different subsequences might converge to different elements of $\mathcal{H}_1$ — and it does not pin down the limit as the [orthogonal projection](/theorems/437) $Pf$. The [orthogonal decomposition](/theorems/436) approach circumvents both difficulties at once, delivering norm convergence of the full sequence directly.  The proof via [orthogonal decomposition](/theorems/436) presented here has several advantages. First, it is genuinely short: once the decomposition $L^2 = \mathcal{H}_1 \oplus \mathcal{H}_0$ is established, the convergence argument is three lines. Second, it reveals the structural reason for convergence: the fixed-point component converges trivially, and the coboundary component cancels by a telescoping argument. Third, the abstract [Hilbert space](/page/Hilbert%20Space) formulation makes clear that the theorem is not specific to the ergodic-theory setting — it holds for any isometry on any [Hilbert space](/page/Hilbert%20Space). This generality is exploited in Chapter 8, where the same decomposition (now in terms of the spectral measure of $U_T$) characterises weak mixing as the absence of non-trivial eigenvalues. ### The Koopman Representation as a Theme The passage $T \mapsto U_T$ replaces a dynamical system with a unitary operator, and every dynamical question has a linear-operator translation. Ergodicity becomes the simplicity of the eigenvalue $1$. Mixing becomes the decay of matrix coefficients $(U_T^n f, g)_{L^2}$. Spectral isomorphism (Chapter 10) becomes unitary equivalence of Koopman operators. The Koopman representation is not just a tool — it is the organising principle of the entire course from this chapter onwards. This viewpoint also situates ergodic theory within the broader landscape of analysis. On the functional-analytic side, $U_T$ is a unitary operator on a separable [Hilbert space](/page/Hilbert%20Space), and its spectral theory is governed by the spectral theorem: there is a projection-valued measure $E$ on the circle $S^1$ such that $U_T = \int_{S^1} z\, dE(z)$. The Mean Ergodic Theorem is then a consequence of the behaviour of the Cesàro kernel near $z = 1$: the weight at $z = 1$ determines the projection $P$, while the contribution from $z \neq 1$ vanishes under Cesàro summation by the Riemann–Lebesgue phenomenon. On the harmonic analysis side, the spectral measure of $U_{T_\alpha}$ for an irrational rotation is a sum of point masses at $\{e^{2\pi i n\alpha} : n \in \mathbb{Z}\}$, and the equidistribution of these eigenvalues on the circle is precisely the [Weyl equidistribution theorem](/theorems/3434). The Koopman operator thus serves as a bridge connecting measure-theoretic dynamics, operator spectral theory, and classical harmonic analysis. [/motivation] --- # 5. The Birkhoff Pointwise Ergodic Theorem Chapter 4 established the [von Neumann Mean Ergodic Theorem](/theorems/3448): for $f \in L^2(X, \mu)$, the Cesàro averages $\frac{1}{N}\sum_{n=0}^{N-1} f(T^n x)$ converge in the $L^2$ norm. This is a statement about convergence of functions in a [Hilbert space](/page/Hilbert%20Space), but it says nothing about what happens at any individual point $x$. Chapter 5 upgrades this to pointwise almost everywhere convergence, which is far deeper: the limit exists for $\mu$-almost every $x$, and holds for all $f \in L^1$, a much larger class than $L^2$. This is the Birkhoff Pointwise Ergodic Theorem, one of the central results of the entire subject. ## The Ergodic Averages What happens to the long-run average of an observable along a typical orbit? More precisely: given a measurable function $f: X \to \mathbb{R}$ representing some physical quantity, does the time average $\frac{1}{N}\sum_{n=0}^{N-1} f(T^n x)$ converge as $N \to \infty$, and if so, to what limit? This is the central question of the chapter. Throughout this chapter, $(X, \mathcal{B}, \mu, T)$ denotes a measure-preserving system: $(X, \mathcal{B}, \mu)$ is a probability space and $T: X \to X$ is a measurable, measure-preserving transformation. For $f \in L^1(X, \mu)$ and $N \ge 1$, define the $N$-th ergodic average operator $A_N: L^1(X, \mu) \to L^1(X, \mu)$ by \begin{align*} A_N f(x) := \frac{1}{N} \sum_{n=0}^{N-1} f(T^n x). \end{align*} The sequence $(A_N f(x))_{N \ge 1}$ is the time average of the observable $f$ along the orbit of $x$. The fundamental question of ergodic theory is whether this sequence converges as $N \to \infty$, and if so, to what limit. The von Neumann theorem answered this in $L^2$: $A_N f \to Pf$ in $\|\cdot\|_{L^2}$, where $P$ is the [orthogonal projection](/theorems/437) onto the subspace of $T$-invariant functions. The Birkhoff theorem shows that the convergence is in fact pointwise almost everywhere, and the limit is the same object — the conditional expectation of $f$ onto the $\sigma$-algebra of $T$-invariant sets. [definition: Invariant Sigma-Algebra] Let $(X, \mathcal{B}, \mu, T)$ be a measure-preserving system. The **invariant $\sigma$-algebra** is \begin{align*} \mathcal{I} := \{ A \in \mathcal{B} : T^{-1}A = A \text{ mod } \mu \}, \end{align*} the collection of all measurable sets that are $T$-invariant up to sets of measure zero. [/definition] The invariant $\sigma$-algebra $\mathcal{I}$ captures all the information about which parts of $X$ are permanently separated by the dynamics. When $T$ is ergodic, $\mathcal{I}$ consists only of sets of measure $0$ or $1$, which means the conditional expectation $\mathbb{E}[f \mid \mathcal{I}]$ collapses to the constant $\int_X f \, d\mu$. ## Statement of the Birkhoff Theorem The convergence established by the von Neumann theorem — that ergodic averages converge in $L^2$ norm — does not, by itself, say anything about the behaviour of $A_N f(x)$ at any particular point $x$. One might hope to deduce pointwise convergence from $L^2$ convergence, but this fails in general: $L^2$ convergence allows the exceptional set to move around unpredictably with $N$, so no fixed point $x$ is guaranteed to lie outside it for all large $N$. The question of almost everywhere pointwise convergence is therefore a genuinely harder problem, requiring a fundamentally different approach. The answer is the Birkhoff Pointwise Ergodic Theorem. [quotetheorem:518] The theorem makes two assertions simultaneously: the limit exists almost everywhere (a non-trivial convergence statement), and the limit is identified as the conditional expectation onto the invariant $\sigma$-algebra. The identification in (iii) is not additional content once (i) and (ii) are established — it follows from the uniqueness characterisation of conditional expectation. [remark: Sharpness of the $L^1$ assumption] The assumption $f \in L^1$ is sharp. If $f \notin L^1$, the ergodic averages need not converge almost everywhere; indeed, for any $f \notin L^1$ on a conservative ergodic system, it follows that $\limsup_N A_N |f|(x) = \infty$ for a.e. $x$. The theorem therefore cannot be extended beyond $L^1$ in any general sense. [/remark] When the system is ergodic, the conditional expectation $\mathbb{E}[f \mid \mathcal{I}]$ reduces to the constant $\int_X f \, d\mu$, recovering the classical formulation of Birkhoff's theorem. The ergodic case is the most frequently used in applications: when the system has no non-trivial invariant sets, the time average of any integrable observable equals its space average for almost every initial condition. This is the content of the following specialisation. [quotetheorem:3431] This is the mathematical formulation of the ergodic hypothesis from statistical mechanics: the time average equals the space average for almost every initial condition. The "almost every" caveat is essential — one cannot generally remove the exceptional set. Ergodicity is indispensable here. Without it, the conclusion fails: if $T$ is the identity map on $[0,1]$, then $A_N f = f$ for every $N$, so the time average equals $f(x)$ pointwise — a function that need not be the constant $\int_X f \, d\mu$. Measure-preservation is equally essential: if $T$ does not preserve $\mu$, then the orbit of $x$ may concentrate on a subset that does not represent the whole space, and the time average can converge to a value entirely unrelated to $\int_X f \, d\mu$. For a concrete failure, let $X = [0,1]$, $\mu = \lambda$ (Lebesgue), and $T(x) = x/2$ (a contraction). Then $T^n(x) \to 0$ for every $x$, so for any continuous $f$, $A_N f(x) \to f(0)$, which equals $\int_0^1 f \, d\lambda$ only if $f$ is the constant function $f(0)$. The map $T$ is not measure-preserving (it pushes all mass toward $0$), and the ergodic conclusion collapses completely. ## The Maximal Ergodic Lemma Before developing the proof machinery, it is worth understanding why the naive approach fails. The most direct attempt is to show that the sequence $(A_N f(x))_{N \ge 1}$ is Cauchy for a.e. $x$, or to use some form of pointwise domination. Both strategies run into the same obstacle: the ergodic averages are not monotone, not ordered, and not uniformly bounded in any pointwise sense (even for $f \in L^1$). What is needed is not control of individual averages but control of the worst-case behaviour — how large the averages can ever get. This is precisely what the [Maximal Ergodic Lemma](/theorems/3432) provides: a sharp integral inequality on the set where the running maximum exceeds zero. The proof of the Birkhoff theorem does not proceed by directly showing the limit exists. Instead, the key tool is a maximal inequality that controls how large the ergodic averages can be on any set of positive measure. This is the [Maximal Ergodic Lemma](/theorems/3432), sometimes called Hopf's maximal ergodic theorem after Eberhard Hopf who found the elegant argument below. For $f \in L^1(X, \mu)$ and $N \ge 1$, define the maximal ergodic average operator $M_N: L^1(X, \mu) \to L^1(X, \mu)$ by \begin{align*} M_N f(x) := \max_{1 \le k \le N} A_k f(x) = \max_{1 \le k \le N} \frac{1}{k} \sum_{n=0}^{k-1} f(T^n x), \end{align*} and the one-sided maximal function $M^+: L^1(X, \mu) \to [0, +\infty]$ (a pointwise supremum, not necessarily finite a priori) by \begin{align*} M^+ f(x) := \sup_{N \ge 1} A_N f(x). \end{align*} [quotetheorem:3432] [citeproof:3432] The maximal lemma is deceptively simple in statement but remarkably powerful. It says: on the set where the running maximum of the ergodic averages is positive, the average of $f$ over that set is non-negative. This holds for every $N$, making it a uniform statement. Two important caveats sharpen our understanding. First, the lemma says nothing about individual points — it is a statement about integrals over sets, not about pointwise bounds. Second, the hypothesis $f \in L^1$ is indispensable: if $f \notin L^1$, the set $E_N$ may have full measure and the integral $\int_{E_N} f \, d\mu$ may be undefined or infinite, so no inequality of this form can hold. The $L^1$ integrability of $f$ is what makes the integral on the left coherent and ultimately finite. An important corollary is the $L^1$ maximal inequality: [quotetheorem:3450] [citeproof:3450] This weak-type $(1,1)$ bound is sharp: one cannot replace it with a strong-type $(1,1)$ bound $\|M^+ f\|_{L^1} \le C\|f\|_{L^1}$, since $M^+$ fails to be bounded on $L^1$ in general (take a function concentrating near an orbit). The correct comparison is with the Hardy–Littlewood maximal inequality in harmonic analysis, where the same weak-type $(1,1)$ bound holds for the Hardy–Littlewood maximal function $Hf(x) = \sup_{r>0} \frac{1}{2r}\int_{x-r}^{x+r} |f|$; the ergodic setting is the dynamical analogue of that classical result. The role of the weak-type bound in the Birkhoff proof is also directly analogous: it converts approximation in $L^1$ norm into almost everywhere convergence, via the standard density argument that occupies the next section. ## Proof of the Birkhoff Theorem The [Maximal Ergodic Lemma](/theorems/3432) reduces the proof of pointwise convergence to a density argument. The strategy is: 1. Prove the theorem for a dense class of functions in $L^1$ where convergence is easy to verify directly. 2. Use the maximal inequality to show the exceptional set (where convergence fails) has measure zero for all $f \in L^1$ by approximation. [quotetheorem:518] [citeproof:518] The three-step architecture of the Birkhoff proof — convergence on a dense class, approximation, and maximal inequality to control the error — is a template that recurs throughout ergodic theory and harmonic analysis. The maximal inequality is the decisive ingredient: without it, the approximation argument collapses because $L^1$ approximation alone cannot force pointwise convergence. [remark: The Role of the Maximal Lemma] The weak-type $(1,1)$ inequality for $M^+$ is the analogue, in ergodic theory, of the Hardy–Littlewood maximal inequality in harmonic analysis. Just as the Hardy–Littlewood maximal inequality implies the [Lebesgue Differentiation Theorem](/theorems/74), the [Maximal Ergodic Lemma](/theorems/3432) implies the Birkhoff theorem via the same approximation-and-truncation strategy. [/remark] ## The Non-Ergodic Case: Convergence to Conditional Expectation The full Birkhoff theorem identifies the limit as $\mathbb{E}[f \mid \mathcal{I}]$ in general. This formulation is essential when the system is not ergodic — when $X$ decomposes into a continuum of $T$-invariant components, each of which may be ergodic in its own right. [explanation: What $\mathbb{E}[f \mid \mathcal{I}]$ means geometrically] The invariant $\sigma$-algebra $\mathcal{I}$ partitions $X$ (up to null sets) into the ergodic components of $T$ — the maximal $T$-invariant sets on each of which $T$ acts ergodically. The conditional expectation $\mathbb{E}[f \mid \mathcal{I}](x)$ gives the average of $f$ over the ergodic component containing $x$. When the system is ergodic, there is only one component (all of $X$), and the conditional expectation reduces to the constant $\int_X f \, d\mu$. For a non-ergodic system, different points $x$ may lie in different components and see different time averages. As a concrete illustration, consider the system on $[0,1]$ with $T(x) = x$ (the identity map). Measure-preservation holds because $T^{-1}(A) = A$ for every measurable set $A$, so $\mu(T^{-1}(A)) = \mu(A)$. Every function is invariant, so $\mathcal{I} = \mathcal{B}$, and $\mathbb{E}[f \mid \mathcal{I}] = f$. Indeed, $A_N f(x) = f(x)$ for every $N$, and the Birkhoff theorem correctly predicts the limit is $f$ itself. Each point is its own ergodic component. [/explanation] The non-ergodic formulation of the Birkhoff theorem interacts cleanly with the ergodic decomposition (to be covered in Chapter 6). If $\mu = \int \mu_\omega \, d\nu(\omega)$ is the ergodic decomposition of $\mu$, then for $\mu$-a.e. $x$, the ergodic averages converge to $\int_X f \, d\mu_{\omega(x)}$, where $\omega(x)$ is the ergodic component of $x$. The picture worth keeping in mind is that the ergodic components act as invisible walls: orbits cannot cross between them, so the time average of any observable sees only the piece of $X$ that the orbit actually visits. The Birkhoff theorem makes this geometric intuition precise, assigning each point $x$ its own limiting average determined entirely by which component $x$ belongs to.  ## Normal Numbers and the Gauss Map What can the Birkhoff theorem tell us about the digits of a typical real number? The answer turns out to be striking: almost every real number, in a precise measure-theoretic sense, has digits that are equidistributed across all bases simultaneously. The Birkhoff theorem is the engine behind this and the analogous result for continued fraction digits. ### Normal Numbers via the Doubling Map A real number $x \in [0,1]$ is called **normal in base $b$** if every finite string of digits appears in the base-$b$ expansion of $x$ with the correct asymptotic frequency. More precisely, for a digit $d \in \{0, 1, \ldots, b-1\}$, the frequency of $d$ among the first $N$ base-$b$ digits of $x$ should tend to $1/b$. [example: Normal numbers from the Birkhoff theorem] Consider the doubling map $T_b(x) = bx \pmod{1}$ on $([0,1], \mathcal{B}([0,1]), \lambda)$, where $\lambda$ is Lebesgue measure. This map is ergodic with respect to $\lambda$ (as established in Chapter 3). The $n$-th base-$b$ digit of $x$ is $d_n(x) = \lfloor T_b^n(x) \cdot b \rfloor$; equivalently, the digit equals $d$ if and only if $T_b^n(x) \in [d/b, (d+1)/b)$. Take $f = \mathbb{1}_{[d/b, (d+1)/b)}$. Then $f(T_b^n x) = 1$ if and only if the $n$-th digit of $x$ in base $b$ is $d$. The [Birkhoff–Khinchin theorem](/theorems/3431) gives: \begin{align*} \frac{1}{N} \sum_{n=0}^{N-1} \mathbb{1}_{[d/b, (d+1)/b)}(T_b^n x) \xrightarrow{a.s.} \int_0^1 \mathbb{1}_{[d/b, (d+1)/b)} \, d\lambda = \frac{1}{b}. \end{align*} This holds for $\lambda$-almost every $x \in [0,1]$. Since this works for every digit $d \in \{0, \ldots, b-1\}$ and every base $b \ge 2$ (each application excludes only a null set, and a countable union of null sets is null), $\lambda$-almost every $x \in [0,1]$ is normal in every base simultaneously. The exceptional set — the set of numbers that fail to be normal in at least one base — has Lebesgue measure zero. To justify why the union over all bases remains a null set: for each fixed base $b$ and each fixed digit $d$, the set of $x$ for which the frequency of $d$ in base $b$ fails to be $1/b$ is a null set. There are countably many choices of $(b, d)$, so the union of these countably many null sets is still a null set. [/example] This result is remarkable: it asserts that "most" [real numbers](/page/Real%20Numbers) are normal in every base, yet it is difficult to exhibit a single explicit example of a number provably normal in all bases. Champernowne's number $0.1234567891011\ldots$ is normal in base 10 but its normality in other bases is not fully established. The Birkhoff theorem guarantees existence in abundance without constructing a single example. ### Continued Fraction Digit Frequencies via the Gauss Map Every $x \in (0,1) \setminus \mathbb{Q}$ has a unique continued fraction expansion $x = [a_1, a_2, a_3, \ldots]$ where $a_n \in \mathbb{N}$. The **Gauss map** $G: (0,1) \to [0,1)$ is defined by \begin{align*} G(x) := \left\{ \frac{1}{x} \right\} = \frac{1}{x} - \left\lfloor \frac{1}{x} \right\rfloor, \end{align*} where $\{y\}$ denotes the fractional part of $y$. The key property is that $G^n(x)$ encodes the $n$-th continued fraction digit: $a_n(x) = \lfloor 1/G^{n-1}(x) \rfloor$. [example: Gauss–Kuzmin statistics via the Birkhoff theorem] The Gauss map preserves the **Gauss measure** $\gamma$ with density \begin{align*} \frac{d\gamma}{d\lambda}(x) = \frac{1}{\log 2} \cdot \frac{1}{1 + x}, \quad x \in (0,1). \end{align*} The measure $\gamma$ is $G$-invariant. To see this, write $G^{-1}([0,t)) = \bigcup_{k=1}^\infty (\frac{1}{k+t}, \frac{1}{k}]$ for $t \in [0,1)$, compute $\gamma$ of each interval using the density $\frac{1}{(1+x)\log 2}$, and sum the resulting series — the total equals $\gamma([0,t))$, confirming $\gamma(G^{-1}(A)) = \gamma(A)$ for all Borel sets $A \subset (0,1)$. Moreover, $G$ is ergodic with respect to $\gamma$. The event that the $n$-th continued fraction digit equals $k$ is $\{G^{n-1}(x) \in [1/(k+1), 1/k)\}$. Take $f = \mathbb{1}_{[1/(k+1), 1/k)}$. The [Birkhoff–Khinchin theorem](/theorems/3431) applied to $(0,1), \gamma, G)$ gives: \begin{align*} \frac{1}{N} \#\{ 1 \le n \le N : a_n(x) = k \} \xrightarrow{a.s.} \gamma\!\left(\left[\frac{1}{k+1}, \frac{1}{k}\right)\right) = \frac{1}{\log 2} \log\!\left(\frac{(k+1)^2}{k(k+2)}\right). \end{align*} Let us compute this for $k = 1$: \begin{align*} \gamma([1/2, 1)) = \frac{1}{\log 2}\int_{1/2}^1 \frac{dx}{1+x} = \frac{1}{\log 2}\Big[\log(1+x)\Big]_{1/2}^1 = \frac{1}{\log 2}\left(\log 2 - \log\frac{3}{2}\right) = \frac{\log(4/3)}{\log 2} \approx 0.415. \end{align*} So for $\gamma$-almost every $x$ (equivalently, $\lambda$-almost every $x$, since $\gamma$ and $\lambda$ are mutually absolutely continuous), about $41.5\%$ of the continued fraction digits of $x$ equal $1$. More precisely, for $\lambda$-almost every $x \in (0,1)$, the limiting frequency of digit $k$ in the continued fraction expansion of $x$ is $\frac{1}{\log 2}\log\frac{(k+1)^2}{k(k+2)}$. This is the **Gauss–Kuzmin distribution**, and its derivation from the Birkhoff theorem requires only verifying the invariance and ergodicity of the Gauss measure — both of which can be established by direct calculation. [/example] The equivalence between $\gamma$-a.e. and $\lambda$-a.e. in the example above follows from the mutual absolute continuity of the Gauss measure and Lebesgue measure on $(0,1)$: since the Gauss density $1/((1+x)\log 2)$ is bounded above and below by positive constants on $[0,1]$, the two measures assign zero to exactly the same sets.  Both examples above have been applications of the theorem to measure-preserving systems where ergodicity and invariance are already in hand. It is equally instructive to see where the theorem breaks down. If $f \notin L^1(X, \mu)$, the ergodic averages need not converge: on a conservative ergodic system, $\limsup_N A_N |f|(x) = +\infty$ for a.e. $x$ whenever $f \notin L^1$ — the averages blow up almost everywhere. Similarly, if $T$ fails to preserve $\mu$, the time average can diverge from the space average regardless of regularity: for the contraction $T(x) = x/2$ on $([0,1], \lambda)$ and $f = \mathbf{1}_{[1/2, 1]}$, every orbit eventually leaves the support of $f$, so $A_N f(x) \to 0$ for every $x$, while $\int_0^1 f \, d\lambda = 1/2$. These failures confirm that $f \in L^1$ and measure-preservation of $T$ are not merely technical conveniences — they are necessary conditions for the theorem to hold. ## Relation to the von Neumann Theorem The von Neumann and Birkhoff theorems both concern the convergence of ergodic averages, but do they imply each other? Does pointwise a.e. convergence subsume $L^2$ norm convergence, or vice versa? The answer is that the two modes of convergence are logically independent — neither implies the other in general — which raises the question of why both theorems are needed and what each contributes. [explanation: Mean versus pointwise convergence] The [von Neumann Mean Ergodic Theorem](/theorems/3448) gives $\|A_N f - f^*\|_{L^2} \to 0$, while the Birkhoff theorem gives $A_N f(x) \to f^*(x)$ for a.e. $x$. These are logically independent modes of convergence: $L^2$ convergence does not imply a.e. convergence in general (as can be seen from the typewriter sequence of functions), and a.e. convergence does not imply $L^2$ convergence without a domination hypothesis. In the ergodic setting, one can recover $L^1$ convergence from the Birkhoff theorem using the [Dominated Convergence Theorem](/theorems/4), provided one has a uniform bound on the ergodic averages. Specifically, if $f \ge 0$, then $A_N f \le M^+ f$ pointwise, and if $\|f\|_{L^1} < \infty$ then $M^+ f \in L^1$ (this follows from the weak-type inequality by a layer-cake argument). This gives $\|A_N f - f^*\|_{L^1} \to 0$ for all $f \in L^1$, a bonus corollary of the Birkhoff theorem. The Birkhoff theorem is strictly harder to prove than the von Neumann theorem. The von Neumann proof is a clean [Hilbert space](/page/Hilbert%20Space) argument (projection onto a closed subspace), while the Birkhoff proof requires the combinatorial [Maximal Ergodic Lemma](/theorems/3432). The gain is a far stronger conclusion. [/explanation] --- # 6. Ergodic Decomposition The Birkhoff Pointwise Ergodic Theorem established that time averages converge almost everywhere to a limit that equals the space average precisely when the system is ergodic. For non-ergodic systems, the limit is a non-constant invariant function — the conditional expectation onto the invariant $\sigma$-algebra. This raises a natural structural question: what does a general invariant measure look like, and can every non-ergodic system be understood as a combination of ergodic ones? The [Ergodic Decomposition Theorem](/theorems/3453) answers this definitively: every invariant probability measure decomposes as a mixture — a measurable convex combination — of ergodic measures, and this decomposition is essentially unique. Understanding this decomposition requires two parallel threads: the functional-analytic perspective, which identifies ergodic measures as extreme points of the convex set of all invariant measures, and the measurable-theoretic perspective, which realises the decomposition through disintegration over the invariant $\sigma$-algebra. ## The Space of Invariant Measures What structure does the collection of all invariant measures carry, and why should we expect it to be rich enough to encode the full complexity of the dynamics? A single dynamical system can admit many invariant measures — indeed infinitely many in general — and the question of how they relate to one another is not merely organisational. The answer turns out to be geometric: the invariant measures form a convex body, and the ergodic measures are exactly its extreme points. This geometric perspective is the entry point into the decomposition theory. Let $(X, \mathcal{B}, T)$ be a measurable dynamical system, where $X$ is a compact [metrizable space](/page/Metrizable%20Space), $\mathcal{B}$ is its Borel $\sigma$-algebra, and $T: X \to X$ is a continuous map. The set of all $T$-invariant Borel probability measures plays a central role in what follows. [definition: Space of Invariant Measures] Let $T: X \to X$ be a continuous map on a compact [metrizable space](/page/Metrizable%20Space) $X$. The **space of $T$-invariant probability measures** is \begin{align*} \mathcal{M}_T := \{\mu \in \mathcal{M}(X) : \mu(T^{-1}B) = \mu(B) \text{ for all } B \in \mathcal{B}\}, \end{align*} where $\mathcal{M}(X)$ denotes the set of all Borel probability measures on $X$, equipped with the [weak* topology](/page/Weak*%20Topology) (convergence against continuous functions). [/definition] The [Krylov–Bogolyubov theorem](/theorems/3423) from Chapter 1 guarantees $\mathcal{M}_T \neq \varnothing$. The [weak* topology](/page/Weak*%20Topology) makes $\mathcal{M}(X)$ a compact [metrizable space](/page/Metrizable%20Space) by the Banach–Alaoglu theorem, since $C(X)^* \supseteq \mathcal{M}(X)$ and $X$ is compact metrizable. [quotetheorem:3451] [citeproof:3451] The convex structure of $\mathcal{M}_T$ immediately raises the question of its extreme points. Recall that a point $\mu$ in a convex set $K$ is an **extreme point** if the only way to write $\mu = t\mu_1 + (1-t)\mu_2$ with $\mu_1, \mu_2 \in K$ and $t \in (0,1)$ is with $\mu_1 = \mu_2 = \mu$. [quotetheorem:3452] [citeproof:3452] This characterisation — ergodicity as extremality — is both aesthetically satisfying and practically powerful. It says that ergodic measures are the indivisible, irreducible invariant measures: they cannot be expressed as proper convex combinations of distinct invariant measures. Non-ergodic measures, by contrast, always admit such a decomposition, and the extreme point structure is precisely what makes this decomposition possible and unique. It is worth being precise about what the theorem does NOT say. It identifies ergodic measures with extreme points of $\mathcal{M}_T$, but it says nothing about how many extreme points exist, nor does it guarantee that every non-ergodic measure decomposes into finitely many ergodic pieces. A system can have a continuum of ergodic measures (as the shear example later in this chapter illustrates), and the decomposition of a non-ergodic measure may be genuinely integral rather than a finite sum. The theorem also does not say that $\mathcal{M}_T$ is a simplex — in general, $\mathcal{M}_T$ may be much more complicated than a finite-dimensional convex body. The Choquet Representation Theorem, which we develop in the next section, is what makes the decomposition into ergodic components precise and canonical. [example: Non-Ergodic Torus Automorphism as a Convex Combination] Consider the two-dimensional torus $\mathbb{T}^2 = \mathbb{R}^2/\mathbb{Z}^2$ with the map $T(x,y) = (x + \alpha, y + \beta) \pmod{1}$, where $\alpha$ is irrational and $\beta$ is rational, say $\beta = p/q$ in lowest terms. This is not ergodic with respect to two-dimensional Lebesgue measure $\lambda_2$: the map $T^q$ fixes the second coordinate mod $1$, so the sets $A_c = \{(x,y) : y \in [c, c + 1/q] \pmod{1}\}$ for $c \in \{0, 1/q, \ldots, (q-1)/q\}$ are cyclically permuted by $T$ and serve as witnesses to non-ergodicity. The ergodic measures for $T$ are the measures $\mu_y = \lambda_1 \otimes \delta_{\{y, T_2 y, \ldots, T_2^{q-1} y\}}$ where $\lambda_1$ is Lebesgue measure on the first factor and the second factor cycles through a $q$-periodic orbit. To verify that each $\mu_y$ is indeed ergodic: the action of $T$ on the fibre over the periodic orbit $\{y, y + \beta, \ldots, y + (q-1)\beta\}$ is that of $T^q$ on the single fibre $\{y\}$, which acts as $x \mapsto x + q\alpha \pmod{1}$. Since $\alpha$ is irrational, $q\alpha$ is also irrational, so this restricted rotation is ergodic with respect to $\lambda_1$ (by the standard criterion for irrational rotations). Any $T$-invariant $L^2(\mu_y)$ function must therefore be constant on the $\lambda_1$-factor, hence constant $\mu_y$-a.e., confirming ergodicity. Lebesgue measure $\lambda_2$ decomposes as: \begin{align*} \lambda_2 = \int_0^{1/q} \mu_y \, d\lambda_1(y), \end{align*} expressing $\lambda_2$ as a continuous mixture (integral) of ergodic measures — exactly what the [Ergodic Decomposition Theorem](/theorems/3453) guarantees. [/example] ## The Ergodic Decomposition Theorem The Krein–Milman theorem from functional analysis states that any compact convex set in a locally convex space is the closed convex hull of its extreme points. Applied to $\mathcal{M}_T$, this gives a preliminary decomposition result: every invariant measure lies in the closed convex hull of the ergodic measures. But for ergodic theory the goal is more precise — we want an integral representation that is measurable, essentially unique, and tied to the dynamics of individual points. [motivation] ### Why a Choquet-Type Decomposition? The Krein–Milman theorem is an existence result in functional analysis: it guarantees that extreme points exist and their convex hull is dense. But in ergodic theory we want something much more explicit. Given a specific invariant measure $\mu$, we want a probability measure $\tau$ on the space of ergodic measures such that \begin{align*} \mu = \int_{\mathcal{E}_T} \nu \, d\tau(\nu), \end{align*} where $\mathcal{E}_T$ denotes the set of ergodic measures. This is a Choquet-type integral representation. The measure $\tau$ assigns weights to ergodic components, and the formula says $\mu$ is recovered by averaging those components with those weights. ### Why Disintegration? The decomposition above is a statement about measures on $X$ expressed as integrals of simpler measures. But it has a point-wise incarnation: for $\mu$-almost every $x \in X$, there is a well-defined ergodic measure $\mu_x$ — the ergodic component of $x$ — such that the orbit of $x$ is equidistributed with respect to $\mu_x$. This is the disintegration of $\mu$ over the invariant $\sigma$-algebra $\mathcal{I}$, the sub-$\sigma$-algebra of $\mathcal{B}$ consisting of all $T$-invariant measurable sets. ### The Invariant $\sigma$-Algebra as the Index Space The invariant $\sigma$-algebra $\mathcal{I}$ measures exactly the "long-run" behaviour of orbits. When the system is ergodic, $\mathcal{I}$ is degenerate (contains only sets of measure $0$ or $1$), reflecting that all orbits have the same statistical behaviour. When the system is non-ergodic, $\mathcal{I}$ is rich, and its atoms — the equivalence classes of points that cannot be distinguished by invariant sets — are exactly the ergodic components. [/motivation] The [Ergodic Decomposition Theorem](/theorems/3453) realises both threads simultaneously. It constructs the disintegration $x \mapsto \mu_x$ over the invariant $\sigma$-algebra $\mathcal{I}$, shows that $\mu_x$ is ergodic for $\mu$-almost every $x$, and derives the Choquet integral formula $\mu = \int \mu_x \, d\mu(x)$ as a consequence. The standard Borel hypothesis on $X$ is what makes the construction of the measurable family $x \mapsto \mu_x$ possible, via the theory of regular conditional probabilities. All compact metrizable spaces, shift spaces, and manifolds satisfy this hypothesis. [quotetheorem:3453] [citeproof:3453] The disintegration formula in (iii) should be read as: integrating $f$ against $\mu$ is the same as first integrating against the ergodic component $\mu_x$, then averaging those component integrals over $x$. The measure $\mu_x$ captures the long-run statistical behaviour of the orbit of $x$. Property (iv) is the bridge between the abstract disintegration and computation: it identifies $\mu_x$ as the conditional measure given the invariant $\sigma$-algebra, so in practice $\mu_x$ is computed via the Radon–Nikodym machinery of conditional expectations. Property (v) ensures the decomposition is canonical — there is no ambiguity in the assignment of ergodic components, up to a set of $\mu$-measure zero. [remark: Connection to Birkhoff's Theorem] The Birkhoff Pointwise Ergodic Theorem from Chapter 5 stated that for $f \in L^1(\mu)$, \begin{align*} \frac{1}{N}\sum_{n=0}^{N-1} f(T^n x) \to \mathbb{E}_\mu[f \mid \mathcal{I}](x) \quad \mu\text{-a.e.} \end{align*} The [Ergodic Decomposition Theorem](/theorems/3453) identifies $\mathbb{E}_\mu[f \mid \mathcal{I}](x) = \int f \, d\mu_x$. So for $\mu$-almost every $x$, the time average of $f$ along the orbit of $x$ equals the space average of $f$ under the ergodic measure $\mu_x$. The two theorems together provide a complete picture: Birkhoff tells you that the time average converges; the Ergodic Decomposition tells you what it converges to and what the limit means structurally. [/remark] ## Choquet Representation and the Integral Formula The [Ergodic Decomposition Theorem](/theorems/3453) gives a pointwise, orbit-level picture of the decomposition: for $\mu$-almost every $x$, the measure $\mu_x$ captures the statistical behaviour of the orbit of $x$. But this leaves an unsatisfying gap: the family $x \mapsto \mu_x$ is indexed by points of $X$, not by the ergodic measures themselves, and the same ergodic measure $\mu_x$ may be assigned to uncountably many points. What is missing is a global view — a single probability measure on the space $\mathcal{M}_T$ that records, in one object, how much weight $\mu$ places on each ergodic component. The Choquet representation fills this gap by lifting the decomposition from the level of points to the level of measures on $\mathcal{M}_T$. [quotetheorem:3454] [citeproof:3454] This representation has a clean intuitive meaning: $\tau_\mu$ is the "probability distribution over ergodic components" that encodes how $\mu$ is assembled from ergodic building blocks. When $\mu$ itself is ergodic, $\tau_\mu = \delta_\mu$ — a Dirac mass at $\mu$, confirming that an ergodic measure has only one component: itself. The passage from the point-wise family $x \mapsto \mu_x$ to the measure $\tau_\mu$ on $\mathcal{M}_T$ is simply a change of perspective: instead of labelling ergodic components by points $x \in X$, we label them by the ergodic measures themselves and record how much weight $\mu$ assigns to each. This viewpoint makes the representation intrinsic — it depends only on $\mu$ and the dynamics, not on any choice of parametrisation of the ergodic components. The hypothesis that $X$ is compact and metrizable is necessary for the theorem as stated. Without compactness, $\mathcal{M}_T$ need not be compact in the [weak* topology](/page/Weak*%20Topology), and the Choquet theory (which relies on Krein–Milman applied to a compact convex set) breaks down. Without metrizability, the [weak* topology](/page/Weak*%20Topology) on $\mathcal{M}(X)$ may fail to be metrizable, so $\mathcal{M}_T$ need not be a metrizable compact convex set and extreme points need not be a $G_\delta$, undermining the measurability of the support of $\tau_\mu$. A concrete failure: on a non-metrizable [compact space](/page/Compact%20Space), one can have a compact convex set with no extreme points (the Krein–Milman theorem requires local convexity, which holds, but the Choquet theorem additionally requires the set to be a simplex, a condition that depends on the topology). In practice this is not a restriction: every Polish space, compact [metrizable space](/page/Metrizable%20Space), and shift space satisfies the hypotheses. [example: Direct Sum of Ergodic Systems] Let $(X_1, \mu_1, T_1)$ and $(X_2, \mu_2, T_2)$ be two distinct ergodic measure-preserving systems on disjoint spaces. Form the disjoint union $X = X_1 \sqcup X_2$ and the map $T(x) = T_i(x)$ for $x \in X_i$. For any $p \in [0,1]$, the measure $\mu = p\mu_1 + (1-p)\mu_2$ is $T$-invariant but not ergodic (unless $p \in \{0,1\}$), since $X_1$ is a $T$-invariant set with $\mu(X_1) = p \in (0,1)$. The [Ergodic Decomposition Theorem](/theorems/3453) gives: for $\mu$-almost every $x$, \begin{align*} \mu_x = \begin{cases} \mu_1 & \text{if } x \in X_1, \\ \mu_2 & \text{if } x \in X_2. \end{cases} \end{align*} The representing measure is $\tau_\mu = p\delta_{\mu_1} + (1-p)\delta_{\mu_2}$, a convex combination of two Dirac masses on $\mathcal{E}_T$. The disintegration formula becomes: \begin{align*} \int_X f \, d\mu = p \int_{X_1} f \, d\mu_1 + (1-p) \int_{X_2} f \, d\mu_2, \end{align*} which is exactly what one computes directly from $\mu = p\mu_1 + (1-p)\mu_2$. [/example] ## Disintegration Over the Invariant $\sigma$-Algebra The invariant $\sigma$-algebra $\mathcal{I}$ plays the role of the "index space" for the ergodic decomposition. Understanding its structure clarifies how ergodic components are parametrised. [definition: Invariant $\sigma$-Algebra] Let $(X, \mathcal{B}, \mu, T)$ be a measure-preserving system. The **invariant $\sigma$-algebra** is \begin{align*} \mathcal{I} := \{B \in \mathcal{B} : \mu(T^{-1}B \triangle B) = 0\}, \end{align*} the collection of all measurable sets that are $T$-invariant up to measure zero. [/definition] Two immediate consequences of this definition clarify its role. A measurable function $f: X \to \mathbb{R}$ is $\mathcal{I}$-measurable if and only if $f \circ T = f$ $\mu$-almost everywhere: the $\mathcal{I}$-[measurable functions](/page/Measurable%20Functions) are precisely the (a.e.) $T$-invariant ones. The system $(X, \mu, T)$ is ergodic if and only if every set in $\mathcal{I}$ has $\mu$-measure $0$ or $1$: ergodicity means $\mathcal{I}$ is degenerate from $\mu$'s point of view, carrying no information beyond "null" or "full". [explanation: Disintegration Mechanics] The formal statement of disintegration works as follows. The invariant $\sigma$-algebra $\mathcal{I}$ partitions $X$ (in the measure-theoretic sense) into "invariant atoms." Each atom $[x]_\mathcal{I}$ is a maximal set that cannot be distinguished by invariant measurements — equivalently, $x$ and $y$ are in the same atom if $\mathbf{1}_B(x) = \mathbf{1}_B(y)$ for every $B \in \mathcal{I}$. For a standard Borel space, the theory of regular conditional probabilities guarantees: there is a measurable map $x \mapsto \mu_x$ such that $\mu_x$ is concentrated on the atom $[x]_\mathcal{I}$ for $\mu$-a.e. $x$, and the disintegration formula $\mu(B) = \int \mu_x(B) \, d\mu(x)$ holds for all $B \in \mathcal{B}$. The ergodic component $\mu_x$ is the restriction of the dynamics to the "world seen by $x$": it carries the statistical information of orbits that are indistinguishable from $x$ by invariant observables. Since $\mu_x$ is concentrated on a single ergodic component and is invariant, the ergodic theorem applied within that component implies $\mu_x$ is ergodic. The conditional expectation formula $\mathbb{E}_\mu[f \mid \mathcal{I}](x) = \int f \, d\mu_x$ is the precise link between the $L^2$ projection onto $\mathcal{I}$-[measurable functions](/page/Measurable%20Functions) (studied in the Mean Ergodic Theorem, Chapter 4) and the pointwise ergodic decomposition. The Mean Ergodic Theorem says the time average converges in $L^2$ to this conditional expectation; the Birkhoff Theorem says it converges pointwise; the Ergodic Decomposition identifies the limit as the ergodic-component average. [/explanation] The partition of $X$ into invariant atoms, with each atom carrying its own ergodic measure, is the geometric heart of the decomposition — orbits are confined within atoms, and the dynamics on each atom is irreducible. Before turning to what can go wrong when the standard Borel hypothesis is dropped, it is worth pausing to make this picture concrete: the atom $[x]_\mathcal{I}$ is a measurable set on which the orbit of $x$ is dense (in the topological setting), and the measure $\mu_x$ is the unique invariant measure for the restricted dynamics on $[x]_\mathcal{I}$, accounting for the entire statistical behaviour of orbits starting in that atom.  The disintegration construction above relies on a topological hypothesis that deserves explicit attention. In measure theory without topological structure, conditional probabilities can be pathological — assigning values outside $[0,1]$ or failing to be countably additive on fibres. The standard Borel condition eliminates these pathologies by ensuring enough "room" for a measurable selection of honest probability measures. [remark: Role of Standard Borel Hypothesis] The assumption that $X$ is a standard Borel space (equivalently, a Polish space with its Borel $\sigma$-algebra, or a Borel subset of $\mathbb{R}$) is essential for the disintegration to produce a measurable family $x \mapsto \mu_x$ of honest probability measures. Without this hypothesis, one can have pathological conditional distributions that fail to be measures on each fibre. A concrete example of such a pathology: let $X$ be the unit interval $[0,1]$ equipped with a $\sigma$-algebra $\mathcal{A}$ strictly coarser than the Borel $\sigma$-algebra (such as the countable-cocountable $\sigma$-algebra, where measurable sets are those that are countable or have countable complement). Conditional expectations with respect to a sub-$\sigma$-algebra $\mathcal{G} \subset \mathcal{A}$ may then fail to be representable by measures on the fibres: the "conditional probability" $x \mapsto \mathbb{P}(B \mid \mathcal{G})(x)$ need not extend to a $\sigma$-additive set function on each fibre, because the fibre itself need not be measurable in the ambient space. The standard Borel condition provides enough measurable structure to guarantee that a consistent, countably additive selection exists. In practice, all spaces arising in ergodic theory — compact metrizable spaces, shift spaces, manifolds — are standard Borel. [/remark] ## Uniquely Ergodic Systems What happens to the ergodic decomposition when the space of invariant measures collapses to a single point? If $\mathcal{M}_T$ contains exactly one measure, there is nothing to decompose — but the consequences for the dynamics are surprisingly strong. Such systems are called uniquely ergodic, and their simplicity at the level of invariant measures translates into remarkable uniformity at the level of individual orbits. This is a degenerate but illuminating special case of the decomposition, and unique ergodicity will be the subject of Chapter 7. The [Ergodic Decomposition Theorem](/theorems/3453) immediately implies: [quotetheorem:3433] [citeproof:3433] For uniquely ergodic systems, every $\mathcal{I}$-set has measure $0$ or $1$ (since $\mu$ is ergodic), and the ergodic component map is the constant map $x \mapsto \mu$. There is only one "world" — every orbit has the same statistical behaviour. In particular, unique ergodicity implies that the Birkhoff averages $\frac{1}{N}\sum_{n=0}^{N-1} f(T^n x)$ converge to $\int f \, d\mu$ for $\mu$-almost every $x$, and for continuous $f$ the convergence is in fact uniform over all $x$ — a much stronger statement that will be the subject of Chapter 7. It is instructive to see what happens when unique ergodicity just barely fails, i.e., when $|\mathcal{M}_T| = 2$. Suppose $\mu_1$ and $\mu_2$ are the two invariant measures, both necessarily ergodic (since they are extreme points of the now two-point simplex $\mathcal{M}_T$). Every other invariant measure has the form $p\mu_1 + (1-p)\mu_2$ for $p \in [0,1]$, and its Choquet representation is the two-atom measure $p\delta_{\mu_1} + (1-p)\delta_{\mu_2}$. The system decomposes into exactly two ergodic pieces, each supporting its own distinct statistical behaviour. Unlike the uniquely ergodic case, a point $x$ belonging to the support of $\mu_1$ will have Birkhoff averages converging to $\int f \, d\mu_1$, which may differ from $\int f \, d\mu_2$, the limit for $\mu_2$-generic points. There is no [uniform convergence](/page/Uniform%20Convergence) statement across all $x$: the two ergodic components simply coexist without interaction. [example: Irrational Rotation on the Circle] The rotation $T_\alpha: x \mapsto x + \alpha \pmod{1}$ on $\mathbb{T}^1$ for irrational $\alpha$ is the canonical uniquely ergodic system. Lebesgue measure $\lambda$ is the unique $T_\alpha$-invariant probability measure. The Ergodic Decomposition is degenerate: $\mathcal{M}_{T_\alpha} = \{\lambda\}$, and the decomposition of $\lambda$ is $\lambda$ itself. Every point $x \in \mathbb{T}^1$ has $\mu_x = \lambda$ as its ergodic component. This should be contrasted with the rational rotation $T_{p/q}$ for $\gcd(p,q) = 1$: here every orbit is periodic of period $q$, and for each $x \in \mathbb{T}^1$ the ergodic measure $\mu_x = \frac{1}{q}\sum_{k=0}^{q-1} \delta_{T^k x}$ is the uniform measure on the finite orbit of $x$. Since there are uncountably many distinct orbits, there are uncountably many ergodic measures, and Lebesgue measure decomposes as \begin{align*} \lambda = \int_0^{1/q} \frac{1}{q}\sum_{k=0}^{q-1} \delta_{x + k\alpha} \, d\lambda(x), \end{align*} an integral of ergodic measures parametrised by the coset representatives in $\mathbb{T}^1 / \langle T_{p/q} \rangle$. [/example] The irrational rotation exhibits the simplest possible decomposition — a single ergodic measure with no structure to uncover. To see the full power of the decomposition, we need a system where the ergodic components are genuinely different from one another. Torus automorphisms provide a family where the character of the decomposition depends on the arithmetic of the matrix entries, ranging from degenerate (a single ergodic component) to a one-parameter family of components indexed by invariant fibres. [example: Torus Automorphisms — Ergodic vs Non-Ergodic] Consider the torus $\mathbb{T}^2 = \mathbb{R}^2/\mathbb{Z}^2$ and the automorphism induced by a matrix $A \in SL(2, \mathbb{Z})$. Lebesgue measure $\lambda_2$ is always $T_A$-invariant. **Hyperbolic case** (e.g., the Arnold cat map $A = \begin{pmatrix} 2 & 1 \\ 1 & 1 \end{pmatrix}$): The eigenvalues of $A$ are irrational. The system $(\mathbb{T}^2, \lambda_2, T_A)$ is ergodic, so $\lambda_2$ is an extreme point of $\mathcal{M}_{T_A}$. In particular, the ergodic decomposition of $\lambda_2$ is degenerate: $\mu_x = \lambda_2$ for $\lambda_2$-almost every $x$. (The space $\mathcal{M}_{T_A}$ itself contains many other invariant measures, such as those supported on periodic orbits, but $\lambda_2$ is indecomposable among them.) **Non-hyperbolic case** (e.g., $A = \begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}$, a shear): The map $T_A(x,y) = (x+y, y) \pmod{1}$ preserves each horizontal fibre $\{y = c\}$ and acts on it as the rotation $x \mapsto x + c \pmod{1}$. For irrational $c$, the rotation $x \mapsto x + c$ on the circle is ergodic with respect to Lebesgue measure $\lambda_1$: this is the standard result on irrational rotations (established in Chapter 3), because the only $T$-invariant $L^2$ functions are constants when $c \notin \mathbb{Q}$. Therefore $\lambda_1 \otimes \delta_c$ is the unique ergodic measure for the restricted dynamics $T_A|_{\{y=c\}}$. For rational $c = p/q$, the rotation on fibre $y = c$ has period $q$, and the fibre decomposition is into finitely many periodic orbits rather than a single ergodic measure. Lebesgue measure $\lambda_2$ is not ergodic (the set $\{(x,y) : y \in [0, 1/2)\}$ is invariant), and its ergodic decomposition is: \begin{align*} \lambda_2 = \int_0^1 (\lambda_1 \otimes \delta_c) \, dc, \end{align*} where $\lambda_1 \otimes \delta_c$ is Lebesgue measure on the horizontal fibre at height $c$. Since the set of irrational $c \in [0,1]$ has full Lebesgue measure, $\lambda_1 \otimes \delta_c$ is ergodic for $\lambda_2$-almost every $c$, confirming that the decomposition produces ergodic measures almost everywhere. [/example] The examples above — irrational rotations, rational rotations, hyperbolic torus automorphisms, and shear maps — illustrate the full range of decomposition behaviour, from degenerate (uniquely ergodic) through finite (direct sums) to continuous (fibre-by-fibre). In each case, the [Ergodic Decomposition Theorem](/theorems/3453) produces a canonical assignment of ergodic components that respects the geometric structure of the dynamics. The broader significance of this decomposition extends beyond these examples to a general organisational principle for ergodic theory. [remark: Why the Ergodic Decomposition Matters] The [Ergodic Decomposition Theorem](/theorems/3453) is not merely a structural curiosity — it is an organisational principle for the entire theory. It says: to understand any invariant measure, it suffices to understand the ergodic measures. Every question about time averages, mixing, spectral properties, and entropy for a general invariant measure reduces, via the decomposition, to the same question for ergodic measures. This reduction principle will appear repeatedly in Chapter 7 (unique ergodicity and equidistribution) and Chapter 8 ([mixing hierarchy](/theorems/3436)), where properties of non-ergodic systems are understood by examining their ergodic components. [/remark]  --- # 7. Unique Ergodicity and Equidistribution The [Ergodic Decomposition Theorem](/theorems/3453) of Chapter 6 gave a complete structural picture of invariant measures: every invariant measure decomposes uniquely as a mixture of ergodic ones, parametrised by the invariant $\sigma$-algebra. At the extreme end of this spectrum sits a class of systems where the decomposition is trivial — not because the system lacks structure, but because it has as much as possible. When a system admits exactly one invariant probability measure, Cesàro averages of continuous observables must converge not merely for almost every starting point but uniformly over all starting points, with no exceptional set whatsoever. This chapter develops unique ergodicity from its definition through its key characterisation, proves Weyl's equidistribution theorem for polynomial sequences by induction on degree using the Weyl differencing method, and examines how unique ergodicity relates to — but does not coincide with — minimality. ## Unique Ergodicity ### Definition and First Examples Recall from Chapter 1 that the [Krylov–Bogolyubov theorem](/theorems/3423) guarantees that any continuous map on a compact [metric space](/page/Metric%20Space) admits at least one invariant Borel probability measure. The collection $\mathcal{M}_T$ of all $T$-invariant Borel probability measures on $X$ is a non-empty, convex, weak* compact set; Chapter 6 established that its extreme points are precisely the ergodic measures. Unique ergodicity demands that this entire convex set collapse to a single point. [definition: Uniquely Ergodic System] A continuous map $T: X \to X$ on a compact [metric space](/page/Metric%20Space) $X$ is **uniquely ergodic** if $\mathcal{M}_T$ consists of exactly one Borel probability measure. [/definition] When $\mathcal{M}_T = \{\mu\}$, the measure $\mu$ is automatically ergodic: if it were not ergodic, the [Ergodic Decomposition Theorem](/theorems/3453) would express $\mu$ as a non-trivial mixture of two distinct ergodic measures, producing at least two elements of $\mathcal{M}_T$ and contradicting uniqueness. Unique ergodicity is therefore strictly stronger than ergodicity: an ergodic system can coexist with other invariant measures supported elsewhere, while a uniquely ergodic system rules out all competitors at once. [example: Irrational Rotations Are Uniquely Ergodic] Let $\alpha \in \mathbb{R} \setminus \mathbb{Q}$ and $T_\alpha: \mathbb{T} \to \mathbb{T}$, $T_\alpha(x) = x + \alpha \pmod 1$, on the circle $\mathbb{T} = \mathbb{R}/\mathbb{Z}$. Lebesgue measure $\lambda$ is $T_\alpha$-invariant. We show $\lambda$ is the only invariant Borel probability measure. Let $\nu$ be any $T_\alpha$-invariant Borel probability measure. For each $k \in \mathbb{Z}$, the $k$-th Fourier coefficient satisfies \begin{align*} \hat{\nu}(k) = \int_{\mathbb{T}} e^{2\pi i k x} \, d\nu(x). \end{align*} Substituting $x \mapsto x + \alpha$ and using $T_\alpha$-invariance of $\nu$ gives $\hat{\nu}(k) = e^{2\pi i k \alpha} \hat{\nu}(k)$ for all $k$. Since $\alpha \notin \mathbb{Q}$, the factor $e^{2\pi i k \alpha} \neq 1$ for $k \neq 0$, so $(e^{2\pi i k \alpha} - 1)\hat{\nu}(k) = 0$ forces $\hat{\nu}(k) = 0$ for all $k \neq 0$. Since $\hat{\nu}(0) = 1$ and $\hat{\lambda}(k) = 0$ for $k \neq 0$, the Fourier coefficients of $\nu$ and $\lambda$ agree everywhere, giving $\nu = \lambda$ by the uniqueness theorem for measures on $\mathbb{T}$. The argument fails for rational $\alpha = p/q$: when $k$ is a multiple of $q$, the factor $e^{2\pi i k \alpha} = 1$, leaving $\hat{\nu}(kq)$ unconstrained. Indeed, for rational $\alpha$, any probability measure supported on the finite orbit $\{x_0, x_0 + \tfrac{p}{q}, \ldots, x_0 + \tfrac{(q-1)p}{q}\}$ is $T_\alpha$-invariant, and different base points $x_0$ yield different invariant measures. [/example] ### The Uniform Cesàro Characterisation The [Birkhoff Ergodic Theorem](/theorems/518) guarantees, for an ergodic system $(X, \mathcal{B}, \mu, T)$ and $f \in L^1(\mu)$, that \begin{align*} \frac{1}{N} \sum_{n=0}^{N-1} f(T^n x) \to \int_X f \, d\mu \end{align*} for $\mu$-almost every $x$. The exceptional null set can depend on $f$, and the convergence need not hold at every point. Unique ergodicity removes both qualifications for continuous functions: convergence holds at every $x$ and is uniform in $x$. [quotetheorem:3455] [citeproof:3455] The proof of (i) $\implies$ (ii) is conceptually illuminating: failure of [uniform convergence](/page/Uniform%20Convergence) yields a second invariant measure by the weak* compactness argument, so unique ergodicity is exactly the obstruction to this construction. Compactness of $X$ plays a silent but essential role throughout: it ensures weak* compactness of $\mathcal{M}_T$ and allows the empirical measures $\nu_k$ to have accumulation points. When $X$ is not compact — for instance, $T$ is a translation on $\mathbb{R}$ — the framework breaks down. The space $\mathcal{M}_T$ may be empty (there may be no finite invariant measure), and even when one exists, the Cesàro averages need not produce tight sequences of probability measures. Unique ergodicity in the non-compact setting requires additional hypotheses (such as the existence of a finite invariant measure and tightness of the empirical measures), and the [uniform convergence](/page/Uniform%20Convergence) in (ii) must be replaced by convergence on compact subsets. For these reasons, the theory of unique ergodicity is developed almost exclusively in the compact setting, where Krylov–Bogolyubov guarantees non-emptiness of $\mathcal{M}_T$ and all the compactness arguments go through without modification. [remark: Continuity Is Essential] The [uniform convergence](/page/Uniform%20Convergence) in the theorem requires continuity of $f$. For the irrational rotation $T_\alpha$ and the discontinuous function $f = \mathbf{1}_{[0,1/2)}$, Birkhoff gives convergence for $\lambda$-a.e. $x$, but the convergence fails at the countably many points $x$ for which $\{x + n\alpha : n \ge 0\}$ hits the boundary $\{0, 1/2\}$ with positive Cesàro density — a degenerate behaviour caused by the jump discontinuity of $f$ at points on the orbit of $0$ under $T_\alpha$. [/remark] ## Weyl's Equidistribution Theorem for Polynomial Sequences ### Weyl's Criterion and the Problem Unique ergodicity of irrational rotations gives, via the characterisation theorem, uniform equidistribution of linear sequences $\{n\alpha\}$ for all starting points. A far-reaching generalisation, proved by Weyl in 1916, extends this to polynomial sequences: if $p(n)$ is any polynomial with at least one irrational non-constant coefficient, then the sequence $(p(n) \bmod 1)_{n \ge 0}$ distributes itself uniformly over $[0,1)$. [definition: Equidistribution Modulo 1] A sequence $(a_n)_{n \ge 0}$ of [real numbers](/page/Real%20Numbers) is **equidistributed modulo 1** if for every subinterval $[c, d) \subset [0, 1)$, \begin{align*} \lim_{N \to \infty} \frac{1}{N} \#\{0 \le n \le N-1 : \{a_n\} \in [c, d)\} = d - c, \end{align*} where $\{t\} = t - \lfloor t \rfloor$ denotes the fractional part. [/definition] By the Weyl criterion, a sequence is equidistributed modulo 1 if and only if for every non-zero integer $k$, \begin{align*} \frac{1}{N} \sum_{n=0}^{N-1} e^{2\pi i k a_n} \to 0 \quad \text{as } N \to \infty. \end{align*} The equivalence follows from weak* [density of trigonometric polynomials](/theorems/1219) in $C(\mathbb{T})$: the empirical measures $\frac{1}{N}\sum_{n=0}^{N-1} \delta_{\{a_n\}}$ converge weak* to Lebesgue measure on $\mathbb{T}$ if and only if all their non-trivial Fourier coefficients tend to 0. The rationality assumption in the theorem cannot be weakened. To see concretely what goes wrong, consider the simplest rational case: let $p(n) = n/3$. Then $\{p(n)\} = \{0, 1/3, 2/3, 0, 1/3, 2/3, \ldots\}$, cycling through exactly three values with period 3. For any $N$ divisible by 3, the empirical measure $\frac{1}{N}\sum_{n=0}^{N-1} \delta_{\{n/3\}}$ is $\frac{1}{3}(\delta_0 + \delta_{1/3} + \delta_{2/3})$, which is supported on three atoms and bears no resemblance to Lebesgue measure. The Weyl criterion fails at $k = 3$: $\frac{1}{N}\sum_{n=0}^{N-1} e^{2\pi i \cdot 3 \cdot n/3} = \frac{1}{N}\sum_{n=0}^{N-1} 1 = 1 \not\to 0$. More generally, for any polynomial $p(n) = (a/b)n^d + \cdots$ with rational leading coefficient $a/b$ in lowest terms, the sequence $\{p(n)\}$ is periodic with period dividing $b \cdot d!$, hence equidistribution fails for the same reason. [quotetheorem:3434] [citeproof:3434] The theorem requires at least one irrational non-constant coefficient, but it says nothing about how rapidly equidistribution is achieved. How quickly do the Cesàro averages settle down? The answer depends on how well $\alpha$ (or the relevant irrational coefficient) can be approximated by rationals. For Diophantine irrationals — those satisfying $|\alpha - p/q| \ge c/q^\kappa$ for some $c > 0$, $\kappa \ge 2$ and all rationals $p/q$ — the discrepancy $D_N = \sup_{[c,d)} \left|\frac{1}{N}\#\{n < N : \{p(n)\} \in [c,d)\} - (d-c)\right|$ satisfies $D_N = O(N^{-\beta})$ for some $\beta > 0$ depending on $\kappa$. For Liouville numbers — irrationals approximable by rationals to any order — the convergence can be arbitrarily slow, and no uniform rate can be extracted from the irrationality condition alone. The theorem thus gives a qualitative conclusion (the limit is 0) but makes no promise about quantitative speed: that is a separate question belonging to the metric theory of Diophantine approximation. ### The Quadratic Case What does the inductive machinery of van der Corput differencing actually produce when applied to the simplest non-linear case? Working out $d = 2$ explicitly reveals the mechanism in its purest form. [example: Equidistribution of $\{n^2 \alpha\}$] Let $\alpha \in \mathbb{R} \setminus \mathbb{Q}$. By Weyl's theorem, the sequence $(\{n^2 \alpha\})_{n \ge 0}$ is equidistributed modulo 1. The mechanism of the Weyl differencing step is transparent here: for non-zero $k \in \mathbb{Z}$, applying the van der Corput inequality to $S_N = \frac{1}{N}\sum_{n=0}^{N-1} e^{2\pi i k n^2 \alpha}$ gives \begin{align*} |S_N|^2 \le \frac{1}{N} + \frac{2}{N} \sum_{h=1}^{N-1} \left(1 - \frac{h}{N}\right) \left|\frac{1}{N}\sum_{n=0}^{N-1} e^{2\pi i k (2\alpha h) n}\right|. \end{align*} Each inner sum is a linear exponential average with exponent $2k\alpha h$. Since $\alpha$ is irrational and $k, h \neq 0$, the number $2k\alpha h$ is irrational, and by the base case of Weyl's theorem the inner sum tends to 0 as $N \to \infty$. Dominated convergence then gives $|S_N|^2 \to 0$. Concretely, take $\alpha = \sqrt{2}$. The fractional parts $\{n^2\sqrt{2}\}$ for $n = 0, 1, 2, \ldots$ scatter across $[0,1)$ without any visible periodicity: the point is that $n^2\sqrt{2}$ grows quadratically and the quadratic irrational $\sqrt{2}$ ensures no finite periodicity. For large $N$, the proportion of these $N$ values lying in any subinterval $[c,d) \subset [0,1)$ is approximately $d - c$, with an error that tends to 0. [/example] The differencing step reduces the quadratic problem to the linear one at the cost of averaging over $h$. The key observation is that the resulting linear phases $2k\alpha h$ are irrational for every $h \ge 1$ precisely because $\alpha$ is irrational — if $\alpha$ were rational, the linearity would only help finitely many values of $h$, and the argument would collapse. This reveals why the irrationality of the coefficient is not merely sufficient but genuinely necessary for the differencing to work: the entire inductive step relies on irrationality propagating from the leading coefficient to all the differenced polynomials. From the Diophantine perspective, the discrepancy of $\{n^2\alpha\}$ over $N$ terms is $O(N^{-1/2 + \varepsilon})$ for any $\varepsilon > 0$ when $\alpha$ has bounded partial quotients in its continued fraction expansion (e.g., $\alpha = \sqrt{2}$), but can be much slower for $\alpha$ that are very well approximated by rationals. The sequence $\{n^2\alpha\}$ thus equidistributes at a rate that encodes the Diophantine quality of $\alpha$ in a precise way. ### The Skew-Product Perspective The inductive proof via van der Corput is effective but obscures the geometric reason polynomial sequences equidistribute. What dynamical system underlies the phenomenon, and why does reducing degree by one correspond to a natural geometric operation? The quadratic sequence $\{n^2 \alpha\}$ is closely related to the orbit of a specific point under the two-dimensional skew product \begin{align*} T(x, y) = (x + \alpha, \, y + x) \pmod 1 \end{align*} on $\mathbb{T}^2$. Starting from $(0, 0)$, one computes $T^n(0, 0) = (n\alpha \bmod 1, \, \tfrac{n(n-1)}{2}\alpha \bmod 1)$. The second coordinate traces the sequence $\{\tfrac{n(n-1)}{2}\alpha\}$, which equals $\{n^2 \alpha\}$ up to a linear correction $\{-\tfrac{n}{2}\alpha\}$ absorbed by Weyl's theorem for the linear case. The map $T$ preserves Lebesgue measure on $\mathbb{T}^2$ and is an example of a **nilrotation** — a rotation on a nilmanifold. When $\alpha \notin \mathbb{Q}$, it is both minimal and uniquely ergodic, and the equidistribution of $\{n^2\alpha\}$ follows from the unique ergodicity of $T$ together with the projection of Lebesgue measure onto the second coordinate.  This nilrotation viewpoint generalises: Weyl's theorem for degree-$d$ polynomials corresponds to unique ergodicity of a rotation on a degree-$d$ nilmanifold, a perspective central to the Host–Kra structure theorem and the proof of Szemerédi's theorem via ergodic methods (topics for Ergodic Theory II). ## Minimality Versus Unique Ergodicity ### Minimality as a Topological Counterpart Unique ergodicity is a measure-theoretic indecomposability condition. Its natural topological counterpart — demanding that orbits be dense rather than measures be unique — is minimality. [definition: Minimal System] A continuous map $T: X \to X$ on a compact [metric space](/page/Metric%20Space) $X$ is **minimal** if every orbit $\{T^n x : n \ge 0\}$ is dense in $X$. [/definition] Equivalently, $T$ is minimal if and only if $X$ has no proper closed non-empty $T$-invariant subset: any [closed set](/page/Closed%20Set) $F \subset X$ with $T(F) \subset F$ must satisfy $F = \varnothing$ or $F = X$. [example: Irrational Rotations Are Minimal] For $\alpha \notin \mathbb{Q}$, the rotation $T_\alpha$ on $\mathbb{T}$ is minimal. Fix $x \in \mathbb{T}$ and any open arc $U$ of length $\varepsilon > 0$. Divide $\mathbb{T}$ into $\lceil 1/\varepsilon \rceil + 1$ arcs of length less than $\varepsilon$. By the pigeonhole principle, among the $\lceil 1/\varepsilon \rceil + 2$ points $x, x+\alpha, \ldots, x + (\lceil 1/\varepsilon \rceil + 1)\alpha \pmod 1$, at least two land in the same arc, say $x + m\alpha$ and $x + n\alpha$ with $m > n$. Then $(m-n)\alpha \pmod 1$ has absolute value less than $\varepsilon$. The orbit $\{k(m-n)\alpha \bmod 1 : k \ge 0\}$ is a sequence of steps of size less than $\varepsilon$ around the circle, so it hits every arc of length $\varepsilon$. In particular, $\{T_\alpha^{k(m-n)}(x)\}$ is dense, and since every orbit of $T_\alpha^{m-n}$ is contained in an orbit of $T_\alpha$, the orbit of $x$ under $T_\alpha$ is also dense. [/example] Thus irrational rotations are simultaneously minimal and uniquely ergodic. This coexistence, while natural for rotations, is not automatic. ### Independence of the Two Properties Despite the parallel definitions, minimality and unique ergodicity are genuinely independent. Neither implies the other. [quotetheorem:3435] [citeproof:3435] The two examples illustrate a structural asymmetry. In the Furstenberg example, the space $\mathbb{T}^2$ is homogeneous enough that every orbit is dense, yet the metric structure is rich enough to support two genuinely different statistical behaviours — orbits visit different regions of the torus with different long-run frequencies depending on the starting fiber. In the second example, the geometry is almost degenerate: the fixed point $0$ is an attractor for the dynamics, and every orbit eventually collapses toward it, producing a unique invariant measure concentrated at the attractor but at the cost of destroying density of orbits. These contrasting mechanisms suggest that the relationship between minimality and unique ergodicity is mediated by how the invariant measure interacts with the topology of the space — specifically, whether the measure sees all of $X$ or only part of it. [explanation: When Minimality and Unique Ergodicity Coincide] The two properties interact most cleanly when the unique invariant measure has full support. Recall that the support of a Borel measure $\mu$ on $X$ is the smallest [closed set](/page/Closed%20Set) of full measure, or equivalently $\operatorname{supp} \mu = \{x \in X : \mu(U) > 0 \text{ for every open neighbourhood } U \text{ of } x\}$. If $T$ is uniquely ergodic with measure $\mu$ and $\mu$ has full support (meaning $\mu(U) > 0$ for every non-empty open $U \subset X$), then $T$ is minimal. The argument: if $F \subsetneq X$ were a proper closed $T$-invariant set, then the measure $\mu|_F / \mu(F)$ would be $T|_F$-invariant on $F$, and its pushforward to $X$ via the inclusion would be $T$-invariant, equalling $\mu$ by unique ergodicity. This forces $\mu(F) = 1$, so $\mu(X \setminus F) = 0$. But if $X \setminus F$ is non-empty and open, full support gives $\mu(X \setminus F) > 0$, a contradiction. Conversely, if $T$ is minimal and admits a $T$-invariant measure $\mu$ with full support, one can check that any other invariant measure $\nu$ must also have full support (by an orbit-density argument), and uniqueness then follows from the characterisation theorem if one further knows the Cesàro averages converge. In practice, for minimal systems with a unique invariant measure, the characterisation theorem guarantees that the Cesàro averages of any $f \in C(X)$ converge to $\int f \, d\mu$ uniformly in $x$, and the measure $\mu$ automatically has full support by minimality. Such systems are called **strictly ergodic**. [/explanation] ### Strict Ergodicity and Sturmian Systems When a system is both minimal and uniquely ergodic, the uniform Cesàro convergence from the characterisation theorem takes its strongest form: time averages converge to space averages uniformly over all starting points, and the unique measure sees all of $X$. [example: Sturmian Subshifts] Fix $\alpha \in (0,1) \setminus \mathbb{Q}$ and define the Sturmian sequence $\omega \in \{0,1\}^{\mathbb{Z}}$ by \begin{align*} \omega_n = \begin{cases} 1 & \text{if } \{n\alpha\} \in [0, \alpha), \\ 0 & \text{otherwise.} \end{cases} \end{align*} The orbit closure $X_\alpha = \overline{\{\sigma^k \omega : k \in \mathbb{Z}\}}$ under the left shift $\sigma$ forms a compact shift-invariant subset of $\{0,1\}^{\mathbb{Z}}$ called the Sturmian subshift. The system $(X_\alpha, \sigma)$ is: - **Minimal**: every orbit under $\sigma$ is dense in $X_\alpha$, because $X_\alpha$ is precisely the set of all sequences whose every finite word appears in $\omega$, and the irrational rotation $T_\alpha$ is minimal, so every finite pattern recurs in every coding. - **Uniquely ergodic**: the unique invariant measure $\mu_\alpha$ is the pushforward of Lebesgue measure on $\mathbb{T}$ under the coding map $x \mapsto (\mathbf{1}_{[0,\alpha)}(x + n\alpha))_{n \in \mathbb{Z}}$. By the characterisation theorem, for any $f \in C(X_\alpha)$, \begin{align*} \frac{1}{N} \sum_{n=0}^{N-1} f(\sigma^n \omega') \to \int_{X_\alpha} f \, d\mu_\alpha \end{align*} uniformly over all $\omega' \in X_\alpha$. In particular, the frequency of the symbol $1$ in any sequence $\omega' \in X_\alpha$ equals $\alpha$ for every sequence, not just $\mu_\alpha$-almost every sequence: \begin{align*} \lim_{N \to \infty} \frac{1}{N} \sum_{n=0}^{N-1} \omega'_n = \alpha \quad \text{for all } \omega' \in X_\alpha. \end{align*} This uniformity — that the digit frequency is the same from every starting sequence in the subshift — is precisely the signature of strict ergodicity. [/example] The comparison between the Birkhoff theorem and the unique ergodicity characterisation is instructive: Birkhoff gives $\frac{1}{N}\sum_{n=0}^{N-1} f(T^n x) \to \mathbb{E}_\mu[f \mid \mathcal{I}](x)$ for $\mu$-a.e. $x$ and $f \in L^1(\mu)$; the characterisation gives $\frac{1}{N}\sum_{n=0}^{N-1} f(T^n x) \to \int f \, d\mu$ uniformly over all $x \in X$ for $f \in C(X)$. The gain is that the exceptional null set disappears entirely, and convergence is uniform; the cost is that we need the much stronger hypothesis of unique ergodicity rather than ergodicity, and we restrict to continuous (rather than $L^1$) functions. [remark: Connection to Ergodic Decomposition] From the perspective of Chapter 6, unique ergodicity means the ergodic decomposition measure on the space of ergodic measures is a Dirac mass concentrated at $\mu$. The disintegration over the invariant $\sigma$-algebra $\mathcal{I}$ is trivial: $\mathcal{I}$ is $\mu$-trivial (every invariant set has measure 0 or 1), and the conditional measures in the disintegration are all equal to $\mu$. This collapses the Birkhoff limit $\mathbb{E}_\mu[f \mid \mathcal{I}]$ to the constant $\int f \, d\mu$, removing the dependence on $x$ in the limit — which is the content of the characterisation theorem restated in decomposition language. [/remark] The chapter has established three connected results: (1) unique ergodicity is equivalent to uniform Cesàro convergence for all continuous observables; (2) polynomial sequences with at least one irrational non-constant coefficient are equidistributed, by inductively reducing the degree via Weyl differencing; (3) unique ergodicity and minimality are independent properties, though they frequently coexist in natural systems. Chapter 8 turns to a finer hierarchy of statistical properties — strong mixing, weak mixing — which concern not the long-run distribution of orbits but the asymptotic independence of events separated by long time intervals, a fundamentally different kind of dynamical regularity. --- # 8. The Mixing Hierarchy With ergodicity, unique ergodicity, and Weyl's equidistribution theorem now established, the course turns to a finer question: how rapidly does a measure-preserving system lose memory of its initial state? Ergodicity guarantees that time averages equal space averages, and unique ergodicity strengthens this to [uniform convergence](/page/Uniform%20Convergence) over all starting points — but neither property says anything about the speed or manner in which the orbit of a set $T^{-n}A$ becomes statistically independent of another set $B$. The [mixing hierarchy](/theorems/3436) formalises this notion of asymptotic independence. At the top sits strong mixing, which demands that $\mu(T^{-n}A \cap B) \to \mu(A)\mu(B)$ for every pair of measurable sets. Weak mixing relaxes this to a Cesàro (time-averaged) version of the same condition. Below both sits ergodicity. This chapter proves the two implications — strong implies weak, weak implies ergodic — establishes by explicit examples that both are strict, characterises weak mixing spectrally via the [Koopman–von Neumann lemma](/theorems/3457), and closes with multiple mixing and Rokhlin's problem. ## Strong Mixing How rapidly does a measure-preserving system lose memory of its initial state? Ergodicity guarantees time averages converge to space averages, but says nothing about the speed at which two initially correlated events become statistically decoupled. The strongest possible answer — that correlations vanish completely, along every subsequence, for every pair of measurable sets — is captured by the notion of strong mixing. [motivation] ### Why Strong Mixing? Consider a physical fluid being stirred. After sufficient time, a drop of ink placed anywhere in the fluid should be distributed uniformly throughout — any small region of the fluid should contain approximately the same proportion of ink regardless of where the drop started. Translated into the language of measure-preserving systems, if $B$ is a "region" and $A$ is the initial ink drop, then $T^{-n}A$ is the location of the evolved ink at time $n$. The proportion of $B$ occupied by the evolved ink is $\mu(T^{-n}A \cap B) / \mu(B)$. Asymptotic independence means this proportion approaches $\mu(A)$ — the overall density of the ink — regardless of the sets $A$ and $B$ chosen. This is exactly the definition of strong mixing. [/motivation] The physical picture suggests a precise mathematical condition. At time $n$, the evolved set $T^{-n}A$ — the preimage of $A$ under the $n$-th iterate of $T$ — represents where the system must currently be if it is to land in $A$ at time $n$. The measure $\mu(T^{-n}A \cap B)$ is the proportion of the initial state space that simultaneously starts in $T^{-n}A$ and lies in $B$. If the system loses all memory of its past, this quantity should factor: $\mu(T^{-n}A \cap B) \approx \mu(A)\mu(B)$, the product of the individual probabilities. Strong mixing is the requirement that this factorisation becomes exact in the limit, uniformly over all pairs of measurable sets. [definition: Strong Mixing] A measure-preserving transformation $T: (X, \mathcal{B}, \mu) \to (X, \mathcal{B}, \mu)$ is **strongly mixing** (or simply **mixing**) if for every $A, B \in \mathcal{B}$, \begin{align*} \mu(T^{-n}A \cap B) \to \mu(A)\mu(B) \quad \text{as } n \to \infty. \end{align*} [/definition] The quantity $\mu(T^{-n}A \cap B) - \mu(A)\mu(B)$ measures the correlation between the event $B$ and the event that the system, starting from $A$, is observed after $n$ steps. Strong mixing requires these correlations to decay to zero for every pair of sets, with no averaging over time. The condition reformulates naturally in $L^2$. Passing to indicator functions and extending by linearity and density, $T$ is strongly mixing if and only if for all $f, g \in L^2(X, \mu)$, \begin{align*} (U_T^n f, g)_{L^2} \to \int f \, d\mu \int \bar{g} \, d\mu \quad \text{as } n \to \infty, \end{align*} where $U_T: L^2(X,\mu) \to L^2(X,\mu)$, $U_T f = f \circ T$, is the Koopman operator. Writing $f_0 = f - \int f \, d\mu$ and $g_0 = g - \int g \, d\mu$ for the centred versions, this reduces to $(U_T^n f_0, g_0)_{L^2} \to 0$ for all $f_0, g_0 \in L^2_0$, the subspace of zero-mean functions in $L^2(X,\mu)$. ### Spectral Characterisation of Strong Mixing The Koopman operator $U_T$ is a unitary operator on $L^2(X, \mu)$. The spectral theorem assigns to each $f \in L^2$ a Borel measure $\sigma_f$ on the unit circle $\mathbb{T} = \{z \in \mathbb{C} : |z| = 1\}$ — the **spectral measure** of $f$ — characterised by \begin{align*} (U_T^n f, f)_{L^2} = \int_{\mathbb{T}} z^n \, d\sigma_f(z) \quad \text{for all } n \in \mathbb{Z}. \end{align*} The right-hand side is the $n$-th Fourier coefficient $\hat{\sigma}_f(n)$ of the measure $\sigma_f$. Whether $\hat{\sigma}_f(n) \to 0$ is exactly the question of whether $\sigma_f$ charges individual points. [quotetheorem:3456] [citeproof:3456] The spectral characterisation translates a dynamical question — do correlations decay? — into a measure-theoretic question about the Fourier coefficients of a measure on the circle. It is the starting point for constructing systems that distinguish between the various levels of the [mixing hierarchy](/theorems/3436), since one can engineer spectral measures with prescribed Fourier decay properties. [example: Bernoulli Shifts Are Strongly Mixing] Let $(\Omega^{\mathbb{Z}}, \mu, \sigma)$ be a two-sided Bernoulli shift with alphabet $\Omega$ and product measure $\mu = p^{\otimes \mathbb{Z}}$ for some probability vector $p$. Take cylinder sets $A = [a_{-k} \cdots a_0]$ determined by coordinates $\{-k, \ldots, 0\}$ and $B = [b_{-l} \cdots b_0]$ determined by coordinates $\{-l, \ldots, 0\}$. For $n > k + l$, the set $\sigma^{-n}A$ is determined by coordinates $\{n-k, \ldots, n\}$, which is disjoint from $\{-l, \ldots, 0\}$. The product structure of $\mu$ gives independence over disjoint coordinate windows, so \begin{align*} \mu(\sigma^{-n}A \cap B) = \mu(A)\mu(B) \end{align*} exactly once the windows separate, which is a stronger statement than the limit condition. Since cylinder sets generate $\mathcal{B}$ and every $A, B \in \mathcal{B}$ can be approximated in measure by cylinder sets, the identity passes to all measurable sets in the limit. The argument extends to any Bernoulli shift. [/example] The exact independence (not merely approximate) for Bernoulli shifts once coordinate windows separate is characteristic of the product structure. No other class of systems achieves such a clean verification of strong mixing. ## Weak Mixing Strong mixing demands that correlations $\mu(T^{-n}A \cap B) - \mu(A)\mu(B)$ converge to zero along every subsequence. Weak mixing relaxes this: the correlations need only converge in Cesàro (time) average. [definition: Weak Mixing] A measure-preserving transformation $T: (X, \mathcal{B}, \mu) \to (X, \mathcal{B}, \mu)$ is **weakly mixing** if for every $A, B \in \mathcal{B}$, \begin{align*} \frac{1}{N}\sum_{n=0}^{N-1} \left|\mu(T^{-n}A \cap B) - \mu(A)\mu(B)\right| \to 0 \quad \text{as } N \to \infty. \end{align*} [/definition] The absolute values are essential. Without them, the Cesàro convergence $\frac{1}{N}\sum_{n=0}^{N-1}[\mu(T^{-n}A \cap B) - \mu(A)\mu(B)] \to 0$ holds for every ergodic transformation — this is exactly what the Mean Ergodic Theorem gives for indicator functions. The absolute values encode the additional content of weak mixing over mere ergodicity: the correlations must become small on average, not just cancel on average. An equivalent characterisation is that there exists a set $J \subset \mathbb{N}$ of **density zero** — meaning $|J \cap \{0, \ldots, N-1\}|/N \to 0$ as $N \to \infty$ — such that \begin{align*} \mu(T^{-n}A \cap B) \to \mu(A)\mu(B) \quad \text{as } n \to \infty,\; n \notin J. \end{align*} In other words, the correlations converge along a subsequence of full density, though they may oscillate on an exceptional density-zero set. ### The Koopman–von Neumann Lemma The deepest characterisation of weak mixing is spectral. It connects the time-averaging condition on correlations to the absence of non-constant eigenfunctions of the Koopman operator — a purely spectral statement. [quotetheorem:3457] [citeproof:3457] The lemma identifies three distinct facets of the same property: a time-averaging condition on correlations (i), a spectral purity condition (ii), a uniform $L^2$ correlation decay condition (iii), and an ergodicity condition for the product system (iv). Condition (iv) is particularly useful for proving that certain systems are not weakly mixing by exhibiting a non-trivial invariant function for the product system. Condition (ii) is sometimes stated as: $U_T$ has **purely continuous spectrum** on $L^2_0$. This is the spectral signature that distinguishes weakly mixing systems from ergodic systems that are not weakly mixing. Irrational rotations are the canonical example of the latter. [example: Irrational Rotations Are Ergodic but Not Weakly Mixing] Let $T_\alpha(x) = x + \alpha \pmod 1$ on $([0,1), \mathcal{B}, \lambda)$ with $\alpha \in \mathbb{R} \setminus \mathbb{Q}$. **Ergodicity.** The exponentials $e_n(x) = e^{2\pi i n x}$ for $n \in \mathbb{Z}$ form an orthonormal basis of $L^2([0,1))$, and $U_{T_\alpha} e_n = e^{2\pi i n \alpha} e_n$. An invariant function $f = \sum_n \hat{f}(n) e_n$ satisfies $e^{2\pi i n \alpha}\hat{f}(n) = \hat{f}(n)$ for all $n$. Since $\alpha \notin \mathbb{Q}$, we have $e^{2\pi i n \alpha} \neq 1$ for $n \neq 0$, forcing $\hat{f}(n) = 0$ for all $n \neq 0$. Thus $f$ is constant, and $T_\alpha$ is ergodic. **Failure of weak mixing.** Each $e_n$ for $n \neq 0$ is a non-constant eigenfunction of $U_{T_\alpha}$ with eigenvalue $e^{2\pi i n\alpha} \in \mathbb{T}$. By condition (ii) of the [Koopman–von Neumann lemma](/theorems/3457), $T_\alpha$ is not weakly mixing. Indeed, the spectral measure of $e_n$ is the Dirac mass $\sigma_{e_n} = \delta_{e^{2\pi i n\alpha}}$, whose Fourier coefficients are $e^{2\pi i n\alpha \cdot k}$ — constant in modulus and never decaying. Thus irrational rotations, which Chapter 7 established are ergodic and uniquely ergodic, fail the next level of the hierarchy: their eigenvalues $\{e^{2\pi i n\alpha} : n \in \mathbb{Z}\}$ form a countable dense subgroup of $\mathbb{T}$, giving a rich discrete spectrum. [/example] ## The Implications and Their Strictness The mixing conditions organise into a proper hierarchy. Each level implies the one below, and no implication reverses. [illustration:mixing-hierarchy-venn] [quotetheorem:3436] The theorem tells us that the three conditions are distinct levels of a hierarchy, but it says nothing about how rapidly the convergence at each level occurs. Strong mixing asserts $\mu(T^{-n}A \cap B) \to \mu(A)\mu(B)$ without specifying any rate. In fact, no uniform rate is possible: for any sequence $a_n \to 0$, there exist strongly mixing systems whose correlations decay no faster than $a_n$ for some pair of sets $A, B$. Quantitative mixing — polynomial decay, exponential decay — requires additional structure beyond the definition, such as hyperbolicity or explicit spectral gap estimates, and is not part of the present hierarchy. [citeproof:3436] **Strictness of weak mixing $\implies$ ergodic.** [Irrational rotations are ergodic](/theorems/3429) but not weakly mixing, as the example above shows. **Strictness of strong mixing $\implies$ weak mixing.** This requires more work. Gaussian systems furnish the cleanest examples. [example: Weakly but Not Strongly Mixing — Gaussian Systems] A **Gaussian system** is built from a unitary operator $V$ on a real [Hilbert space](/page/Hilbert%20Space) $H$ and a unit vector $e \in H$. One constructs a Gaussian stationary process $(Y_n)_{n \in \mathbb{Z}}$ with covariance $\mathbb{E}[Y_m Y_n] = (V^{n-m} e, e)_H$, which induces a shift-invariant Gaussian measure $\mu_G$ on $\mathbb{R}^{\mathbb{Z}}$. The spectral measure $\sigma$ of $V$ relative to $e$ governs the mixing properties: the two-point correlation decays like $\hat{\sigma}(n) = (V^n e, e)_H$. The system is weakly mixing iff $\sigma$ is a continuous measure (no atoms); it is strongly mixing iff $\hat{\sigma}(n) \to 0$. Now choose $\sigma$ to be the middle-thirds Cantor measure on $\mathbb{T}$ — the self-similar measure supported on the [Cantor set](/page/Cantor%20Set) embedded in $[0,1)$, viewed as a subset of $\mathbb{T}$. This measure is continuous (no atoms), so the Gaussian system is weakly mixing. The Fourier coefficients of this specific measure do not tend to zero: an explicit computation gives $\hat{\sigma}(3^k) = \prod_{j=1}^{k} \cos(2\pi \cdot 3^{j-1}/3) = \prod_{j=1}^{k} \cos(2\pi/3)^1$, which one can check satisfies $|\hat{\sigma}(3^k)| = (2/3)^k \cdot (\text{const}) \not\to 0$ along the subsequence $n = 3^k$. (More precisely, the self-similarity of the Cantor measure under scaling by $3$ forces a multiplicative recurrence in $\hat\sigma$ that prevents decay; see Zygmund, *Trigonometric Series*, Vol. I, §IV.2.) Consequently the correlations $\hat\sigma(n) = (V^n e, e)_H$ do not converge to zero, and the Gaussian system is not strongly mixing. This is a genuine example: the gap between weak and strong mixing is not an artefact of a special construction but appears naturally whenever the spectral measure is singular continuous. [/example]  The spectral picture summarises the distinctions cleanly. For an ergodic $T$ with Koopman operator $U_T$ on $L^2_0(X,\mu)$: - **Ergodic but not weakly mixing**: $U_T$ has eigenvalues on $\mathbb{T}$. The eigenfunctions span the discrete spectrum component. Example: irrational rotations, eigenvalues $\{e^{2\pi i n\alpha} : n \in \mathbb{Z}\}$. - **Weakly mixing**: $U_T$ has purely continuous spectrum on $L^2_0$ — no eigenvalues. Every spectral measure is a continuous Borel measure on $\mathbb{T}$. - **Strongly mixing**: $U_T$ has purely continuous spectrum and moreover all spectral measures satisfy $\hat{\sigma}_f(n) \to 0$. By the Riemann–Lebesgue lemma for measures, this holds when the spectral measures are absolutely continuous with respect to Lebesgue measure on $\mathbb{T}$. A singular continuous spectral measure may or may not satisfy $\hat{\sigma}_f(n) \to 0$: a **Rajchman measure** is a finite Borel measure whose Fourier coefficients do tend to zero, and there exist singular continuous Rajchman measures (first constructed by Menshov). A Gaussian system with Rajchman spectral measure is therefore strongly mixing despite having singular continuous spectrum. Conversely, the middle-thirds Cantor measure is non-Rajchman, so the Gaussian system built from it is weakly but not strongly mixing. The key distinction is whether the spectral measure is Rajchman, not whether it is singular continuous. ## Multiple Mixing and Rokhlin's Problem Both ergodicity and mixing extend to higher-order independence conditions. Ergodicity is a two-set condition: eventually $\mu(T^{-n}A \cap B) \approx \mu(A)\mu(B)$. Strong mixing is also a two-set condition. One can demand independence of arbitrarily many iterates simultaneously. [definition: Mixing of Order $k$] A measure-preserving transformation $T: (X, \mathcal{B}, \mu) \to (X, \mathcal{B}, \mu)$ is **mixing of order $k$** (for an integer $k \ge 2$) if for all $A_0, A_1, \ldots, A_{k-1} \in \mathcal{B}$ and all sequences $0 = n_0 < n_1 < n_2 < \cdots < n_{k-1}$ with $n_j - n_{j-1} \to \infty$ for each $j$, \begin{align*} \mu\!\left(A_0 \cap T^{-n_1}A_1 \cap T^{-n_2}A_2 \cap \cdots \cap T^{-n_{k-1}}A_{k-1}\right) \to \mu(A_0)\mu(A_1)\cdots\mu(A_{k-1}). \end{align*} The system is **mixing of all orders** if it is mixing of order $k$ for every $k \ge 2$. [/definition] Mixing of order $2$ is exactly strong mixing. Mixing of order $3$ requires three arbitrarily spread-apart iterates to become mutually independent. The condition is strictly stronger: the product structure required for three-way independence cannot in general be deduced from pair-wise independence. [quotetheorem:3437] This problem, posed by Rokhlin in the 1940s, remains open in full generality. No counterexample — a strongly mixing system that fails to be mixing of order $3$ — has been found. Nor has any proof of the implication been established for general systems. The difficulty is that knowing two iterates become asymptotically independent gives no algebraic control over three or more iterates simultaneously. Partial results show the answer is affirmative in special cases. For **finite-rank systems** — those whose Rokhlin towers can be built with a uniformly bounded number of columns — strong mixing does imply mixing of all orders, a result due to Kalikow. Host proved that if the spectral measure of $T$ is **singular** (purely singular with respect to Lebesgue measure), then strong mixing implies mixing of all orders: thus the Rokhlin problem is solved for all systems in the singular spectral class. The remaining open territory is systems with absolutely continuous or mixed spectral type, where the techniques of Host and Kalikow do not directly apply. For Bernoulli shifts, however, the product structure makes all orders of mixing transparent. [quotetheorem:3438] [citeproof:3438] The proof reveals why Bernoulli shifts are the natural testing ground for higher-order mixing: the product structure collapses any $k$-set correlation to an independence calculation over disjoint windows. The argument is purely combinatorial — no spectral theory is needed — and generalises immediately to any number of sets. This structure is specific to Bernoulli shifts and their factors; for systems without an explicit independence structure, even establishing mixing of order $3$ from mixing of order $2$ remains elusive. [remark: Higher-Order Mixing and Future Connections] The hierarchy of mixing orders is tied to deep questions in ergodic theory. Furstenberg's proof of Szemerédi's theorem on arithmetic progressions (developed in Ergodic Theory II) uses multiple recurrence results — quantitative strengthenings of mixing-of-all-orders statements — for Bernoulli and more general systems. The connection between the [mixing hierarchy](/theorems/3436) and entropy, and between spectral type and mixing order, remains an active research area. Chapter 10 returns to the spectral perspective and examines the [Halmos–von Neumann theorem](/theorems/3461) on systems with pure point spectrum, which characterises the ergodic but not weakly mixing end of the hierarchy. [/remark] --- # 9. Bernoulli Shifts and the Natural Examples The [mixing hierarchy](/theorems/3436) of Chapter 8 established the theoretical landscape: strong mixing implies weak mixing implies ergodicity, and each implication is strict. That chapter was largely abstract — conditions defined in terms of correlations and spectral measures on the Koopman operator. This chapter populates the landscape with concrete systems. Bernoulli shifts and Markov shifts are the symbolic models, built directly from probability theory: sequences of outcomes drawn from a finite alphabet, with independence or Markov dependence between consecutive letters. The geodesic flow on a compact hyperbolic surface and Hamiltonian systems with Liouville measure bring ergodic theory into contact with differential geometry and classical mechanics. Together, these examples show that every level of the [mixing hierarchy](/theorems/3436) is realised by systems arising naturally across mathematics — and that measure-preservation, far from being an assumption to check, is often enforced by the underlying structure. ## Bernoulli Shifts: Definition and Construction Do systems exist that achieve perfect independence at each level of the [mixing hierarchy](/theorems/3436)? The symbolic models of ergodic theory answer this question emphatically: Bernoulli shifts are built from the product structure of an infinite sequence of independent trials, and that independence alone drives them to the very top of the [mixing hierarchy](/theorems/3436). Before fixing the definition, it is worth seeing what goes wrong without the product structure. Consider the sequence space $\{0,1\}^{\mathbb{Z}}$ equipped with the measure $\nu$ that assigns equal weight $1/2$ to each of the two constant sequences $(0,0,0,\dots)$ and $(1,1,1,\dots)$ and zero weight to every other sequence. This is a perfectly well-defined shift-invariant probability measure on $\{0,1\}^{\mathbb{Z}}$, but it is far from mixing: the system consists of two fixed points and is not even ergodic, since the invariant set $\{(0,0,\dots)\}$ has measure $1/2$. The pathology is that $\nu$ is not a product measure — the coordinates are perfectly correlated rather than independent. The Bernoulli construction corrects this by insisting that the measure is a genuine product. The simplest measure-preserving systems, from the standpoint of independence, are the Bernoulli shifts. They are the ergodic-theoretic models for an infinite sequence of independent trials, each governed by a fixed probability distribution on a finite alphabet. Let $A = \{0, 1, \dots, k-1\}$ be a finite alphabet and let $p = (p_0, p_1, \dots, p_{k-1})$ be a probability vector, meaning $p_i > 0$ for all $i$ and $\sum_{i=0}^{k-1} p_i = 1$. The vector $p$ encodes the weight assigned to each symbol. There are two natural versions of the construction, depending on whether time runs over $\mathbb{N}$ or $\mathbb{Z}$. [definition: One-Sided Bernoulli Shift] The **one-sided $(p_0, \dots, p_{k-1})$-Bernoulli shift** is the measure-preserving system $(X, \mathcal{B}, \mu, \sigma)$ where: - $X = A^{\mathbb{N}} = \{(x_0, x_1, x_2, \dots) : x_n \in A\}$ is the space of one-sided sequences over $A$, - $\mathcal{B}$ is the product $\sigma$-algebra generated by cylinder sets $\{x : x_{i_1} = a_1, \dots, x_{i_m} = a_m\}$, - $\mu = p^{\otimes \mathbb{N}}$ is the product measure, uniquely determined by $\mu(\{x : x_i = a_i,\, 0 \le i \le n\}) = p_{a_0} p_{a_1} \cdots p_{a_n}$ on cylinder sets, - $\sigma: X \to X$ is the left shift map $\sigma(x_0, x_1, x_2, \dots) = (x_1, x_2, x_3, \dots)$. [/definition] The one-sided shift discards the zeroth coordinate at each step. It is a surjective but non-invertible endomorphism of $X$: the coordinate $x_0$ is lost upon applying $\sigma$, and cannot be recovered. For most purposes in ergodic theory we prefer invertible transformations, leading to the two-sided version. [definition: Two-Sided Bernoulli Shift] The **two-sided $(p_0, \dots, p_{k-1})$-Bernoulli shift** is the measure-preserving system $(X, \mathcal{B}, \mu, \sigma)$ where: - $X = A^{\mathbb{Z}} = \{(x_n)_{n \in \mathbb{Z}} : x_n \in A\}$ is the space of bi-infinite sequences over $A$, - $\mathcal{B}$ is the product $\sigma$-algebra on $A^{\mathbb{Z}}$, - $\mu = p^{\otimes \mathbb{Z}}$ is the product measure on $A^{\mathbb{Z}}$, - $\sigma: X \to X$ is the left shift $\sigma((x_n)_{n \in \mathbb{Z}}) = (x_{n+1})_{n \in \mathbb{Z}}$. [/definition] In both cases, the shift $\sigma$ is measure-preserving: for any cylinder set $C$ defined by prescribing finitely many coordinates, $\sigma^{-1}(C)$ is again a cylinder set of the same $\mu$-measure, because the product measure assigns the same weight to any finite window of prescribed values regardless of where in $\mathbb{Z}$ (or $\mathbb{N}$) that window sits. The two-sided shift is invertible — its inverse $\sigma^{-1}$ is the right shift $\sigma^{-1}((x_n)) = (x_{n-1})$ — whereas the one-sided shift is not. [remark: Notation for Bernoulli Shifts] It is standard to denote the Bernoulli shift with probability vector $p = (p_0, \dots, p_{k-1})$ by $B(p_0, \dots, p_{k-1})$. When the alphabet is $\{0, 1\}$ with $p_0 = p_1 = 1/2$, the system $B(1/2, 1/2)$ is called the **fair coin-toss shift**. [/remark] The product measure structure is what separates Bernoulli shifts from arbitrary shift-invariant measures. Given the alphabet $A$ and weight vector $p$, the measure $\mu = p^{\otimes \mathbb{Z}}$ is characterised by a single property: the coordinate projections $\pi_n$ are independent and identically distributed. This independence is not just a convenient feature — it is the mechanism behind the strong mixing property proved below, and it distinguishes Bernoulli shifts from the broader class of Markov shifts, where consecutive coordinates are correlated. The product measure $\mu = p^{\otimes \mathbb{Z}}$ is the unique probability measure on $A^{\mathbb{Z}}$ under which the coordinate projections $\pi_n: (x_m)_{m \in \mathbb{Z}} \mapsto x_n$ are independent and identically distributed with law $p$. This is precisely the measure of an infinite sequence of i.i.d. random variables with common distribution $p$ — the probabilistic interpretation of the definition. These two examples exhibit the same qualitative behaviour — independence at every scale, uniform forgetting — but differ in the quantitative weights assigned to each symbol. It is instructive to work through both explicitly, since the argument for measure-preservation and the application of the Birkhoff theorem are essentially identical, and comparing the two cases clarifies which features depend on the specific vector $p$ and which are purely structural. [example: Fair Coin-Toss Shift] Take $A = \{0, 1\}$ and $p = (1/2, 1/2)$, giving the two-sided Bernoulli shift $B(1/2, 1/2)$. The space $X = \{0,1\}^{\mathbb{Z}}$ consists of all bi-infinite binary sequences, and $\mu$ is the uniform product measure. A cylinder set $[a_{-m}, \dots, a_n] = \{x : x_{-m} = a_{-m}, \dots, x_n = a_n\}$ has measure $(1/2)^{m+n+1}$, reflecting independence of the bits. The shift moves every coordinate one step to the left: $\sigma(\dots, x_{-1}, x_0, x_1, \dots) = (\dots, x_0, x_1, x_2, \dots)$. This system is the canonical model for an i.i.d. sequence of fair coin flips: the orbit of a point $x$ under $\sigma$ reads off successive symbols of the sequence $x$, and the [Birkhoff Ergodic Theorem](/theorems/518) (applied to $\mathbf{1}_{\{x_0 = 0\}}$) guarantees that for $\mu$-almost every $x$, the frequency of $0$s in the sequence $(x_0, x_1, \dots, x_{n-1})$ converges to $1/2$. [/example] The fair coin-toss gives equal weight to each symbol, but the Bernoulli construction works just as well with an asymmetric distribution. Non-uniform weights produce the same structural properties — measure-preservation, independence of coordinates — while changing the statistics of symbol frequencies along orbits. [example: The $(1/2, 1/3, 1/6)$-Bernoulli Shift] Take $A = \{0, 1, 2\}$ and $p = (1/2, 1/3, 1/6)$. The product measure on $A^{\mathbb{Z}}$ assigns measure $p_{a_0} p_{a_1} \cdots p_{a_{n-1}}$ to each $n$-cylinder. For instance, the cylinder $\{x : x_0 = 0, x_1 = 2\}$ has measure $(1/2)(1/6) = 1/12$. The symbols are drawn independently at each step with a non-uniform distribution, but the shift is measure-preserving because $\mu$ is a product measure — the distribution of $(x_n, x_{n+1})$ does not depend on $n$. For $\mu$-almost every $x$, the [Birkhoff Ergodic Theorem](/theorems/518) gives asymptotic frequencies: symbol $0$ appears with frequency $1/2$, symbol $1$ with frequency $1/3$, and symbol $2$ with frequency $1/6$. [/example] ## Ergodicity and Strong Mixing of Bernoulli Shifts Having defined Bernoulli shifts, the fundamental question is where they sit in the [mixing hierarchy](/theorems/3436) from Chapter 8. The answer is unambiguous: Bernoulli shifts occupy the top. [quotetheorem:3439] [citeproof:3439] The proof reveals that Bernoulli shifts achieve exact independence — not merely asymptotic independence — as soon as the coordinate windows of the two test sets separate. This is a direct consequence of the product structure of $\mu$. No other condition is needed: the system "forgets its past" completely once time has advanced past the width of the test windows. [explanation: Why Bernoulli Shifts Are the Canonical Example] The strong mixing property of Bernoulli shifts is not incidental — it is the ergodic-theoretic expression of the independence built into the product measure. In the [mixing hierarchy](/theorems/3436), strong mixing captures the idea that the system loses all memory of its past as time goes to infinity. For a Bernoulli shift, this forgetting is perfect after a finite time: once the time separation $N$ exceeds the combined widths of two cylinder windows, the two events become exactly independent. This places Bernoulli shifts at the top of all mixing conditions from Chapter 8. The spectral theory of Chapter 10 will refine the picture further: Bernoulli shifts have countable Lebesgue spectrum — the strongest possible spectral type, meaning the spectral measure of every zero-mean $L^2$ function is absolutely continuous with infinite multiplicity. A deep theorem of Ornstein (1970) shows that two Bernoulli shifts $B(p)$ and $B(q)$ are measurably isomorphic as measure-preserving systems if and only if they have the same metric entropy $h = -\sum_i p_i \log p_i$. Entropy therefore provides a complete isomorphism invariant for Bernoulli shifts — a theme central to Ergodic Theory II. [/explanation] ## Markov Shifts Bernoulli shifts assume complete independence — but what happens when we allow short-range memory? If the probability of the next symbol depends on the current symbol but not on the more distant past, we are led to Markov shifts: the ergodic-theoretic models of Markov chains. These systems interpolate between the maximal independence of Bernoulli shifts and the maximal dependence of deterministic systems, and they reveal that the [mixing hierarchy](/theorems/3436) is sensitive to subtle arithmetic properties of the transition dynamics. Bernoulli shifts model sequences of independent trials. Markov shifts generalise this by allowing dependence between consecutive symbols: the probability of the next symbol depends on the current symbol, governed by a transition matrix. They are the ergodic-theoretic incarnation of a Markov chain. Let $A = \{1, 2, \dots, k\}$ be a finite state space. A **transition matrix** is a $k \times k$ matrix $P = (P_{ij})$ with $P_{ij} \ge 0$ for all $i, j$ and $\sum_{j=1}^k P_{ij} = 1$ for all $i$; each row is a probability distribution on $A$. [definition: Stationary Distribution for a Transition Matrix] A probability vector $\pi = (\pi_1, \dots, \pi_k)$ is a **stationary distribution** for $P$ if $\pi P = \pi$, that is, \begin{align*} \pi_j = \sum_{i=1}^k \pi_i P_{ij} \quad \text{for all } j \in A. \end{align*} [/definition] By the Perron–Frobenius theorem, every stochastic matrix with strictly positive entries has a unique stationary distribution. For general non-negative stochastic matrices, stationary distributions exist (by compactness of the simplex and continuity of $\pi \mapsto \pi P$) but need not be unique. [definition: Markov Shift] Given a transition matrix $P$ on $A$ and a stationary distribution $\pi$ for $P$, the **two-sided Markov shift** $(X, \mathcal{B}, \mu, \sigma)$ is defined by: - $X = A^{\mathbb{Z}}$, with the [product topology](/page/Product%20Topology) and Borel $\sigma$-algebra $\mathcal{B}$, - $\mu$ is the unique shift-invariant probability measure on $A^{\mathbb{Z}}$ satisfying \begin{align*} \mu(\{x : x_0 = a_0,\, x_1 = a_1,\, \dots,\, x_n = a_n\}) = \pi_{a_0} P_{a_0 a_1} P_{a_1 a_2} \cdots P_{a_{n-1} a_n} \end{align*} for all $n \ge 0$ and all $(a_0, \dots, a_n) \in A^{n+1}$, - $\sigma: X \to X$ is the left shift. [/definition] The measure $\mu$ is shift-invariant because $\pi$ is stationary for $P$: shifting the cylinder window by one step to the left introduces a sum over the new leftmost coordinate, and the stationarity condition $\pi P = \pi$ ensures the measure of the shifted cylinder equals that of the original. [remark: Bernoulli Shifts as Special Markov Shifts] A Bernoulli shift $B(p_0, \dots, p_{k-1})$ is the special Markov shift with transition matrix $P_{ij} = p_j$ for all $i$ — the next symbol has distribution $p$ regardless of the current state. The stationary distribution is $\pi = p$. Every Bernoulli shift is a Markov shift, but the converse fails: Markov shifts with non-trivial dependence between consecutive coordinates are not Bernoulli. [/remark] ### Ergodicity of Markov Shifts When is a Markov shift ergodic? Unlike Bernoulli shifts, where the product structure guarantees ergodicity immediately, a Markov shift can fail to be ergodic if the chain splits into isolated groups of states that never communicate. The relevant condition is a connectivity property of the transition graph: every state must be reachable from every other state. [definition: Irreducible Transition Matrix] The transition matrix $P$ is **irreducible** if for every pair of states $i, j \in A$ there exists an integer $n \ge 1$ such that $(P^n)_{ij} > 0$; that is, there is a positive probability of reaching state $j$ from state $i$ in exactly $n$ steps. [/definition] Irreducibility is the natural connectivity condition for the directed graph with vertex set $A$ and edge $(i,j)$ whenever $P_{ij} > 0$: the chain can eventually reach every state from every other state. This condition is necessary and sufficient for ergodicity of the Markov shift. [quotetheorem:3458] [citeproof:3458] The ergodicity criterion reveals a clean dichotomy: reducible chains decompose into invariant subsystems, while irreducible chains explore the full state space. To see concretely what reducibility looks like, consider a two-state chain with $P = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$ — the identity matrix. Every state is absorbing: once in state $0$, the chain stays at $0$ forever, and likewise for state $1$. If the stationary distribution is $\pi = (1/2, 1/2)$, the shift decomposes into two invariant subsystems of equal mass, and the indicator $\mathbf{1}_{\{x_0 = 0\}}$ is a non-constant invariant function. The chain is reducible, and the shift is not ergodic. Note also the limit of the ergodicity criterion: irreducibility alone does not give mixing. It guarantees that the chain can travel between any two states, but says nothing about how long this takes or whether the occupation frequencies oscillate. A chain may be irreducible yet have return times with a common period greater than $1$, causing correlations to oscillate rather than decay. The theorem also says nothing about convergence rates — two irreducible chains may both be ergodic while one mixes exponentially fast and the other mixes arbitrarily slowly. These distinctions, between ergodicity and mixing and between mixing and rates of mixing, are the subject of the next discussion. The next natural question is whether stronger mixing properties hold. The transition from ergodicity to strong mixing requires an additional structural hypothesis, and understanding why illuminates the difference between the two conditions. [explanation: Aperiodicity and Strong Mixing of Markov Shifts] For a Markov shift, ergodicity requires only irreducibility. Strong mixing requires an additional condition: the transition matrix must be **aperiodic**, meaning the greatest common divisor of all return times to every state is $1$. Equivalently, $P$ is aperiodic if $(P^n)_{ii} > 0$ for all sufficiently large $n$ and all $i \in A$. Under irreducibility and aperiodicity, the classical convergence theorem for Markov chains gives $\|(P^n)_{i \cdot} - \pi\|_1 \to 0$ as $n \to \infty$ for every initial state $i$. Translating to cylinder sets, this convergence implies $\mu(\sigma^{-n}A \cap B) \to \mu(A)\mu(B)$ for all cylinder sets $A, B$, which is exactly the strong mixing condition. An irreducible but periodic chain fails to be strongly mixing. The simplest example is a two-state chain that alternates deterministically between states $0$ and $1$: this is irreducible with period $2$ and hence ergodic, but the correlation $\mu(\sigma^{-n}A \cap B)$ oscillates as $n$ varies between even and odd, never settling to $\mu(A)\mu(B)$. [/explanation] The abstract criterion — irreducibility plus aperiodicity gives strong mixing — becomes concrete once one identifies the spectral gap of the transition matrix as the rate of convergence. In the following two-state example, the gap can be read off directly from the eigenvalues. [example: A Two-State Markov Shift] Let $A = \{0, 1\}$ and consider the transition matrix \begin{align*} P = \begin{pmatrix} 1 - \alpha & \alpha \\ \beta & 1 - \beta \end{pmatrix} \end{align*} with $\alpha, \beta \in (0, 1)$. This matrix is irreducible: from state $0$, state $1$ is reached in one step with probability $\alpha > 0$, and from state $1$, state $0$ is reached with probability $\beta > 0$. The unique stationary distribution is $\pi = \bigl(\tfrac{\beta}{\alpha+\beta},\, \tfrac{\alpha}{\alpha+\beta}\bigr)$, as can be verified directly from $\pi P = \pi$. Since both diagonal entries $1 - \alpha$ and $1 - \beta$ are strictly positive, the chain has period $1$ (it can return to either state in one step), so $P$ is aperiodic. The Markov shift is therefore strongly mixing. The second eigenvalue of $P$ is $1 - \alpha - \beta$, and the rate of convergence to the stationary distribution is governed by $|1 - \alpha - \beta|^n \to 0$, giving exponential mixing. [/example] ## Geodesic Flow on a Compact Hyperbolic Surface Bernoulli and Markov shifts are built combinatorially. The geodesic flow is a measure-preserving system imposed by geometry: the underlying space is a Riemannian manifold, the dynamics are determined by the metric, and both the invariant measure and the ergodicity are consequences of the geometry of negative curvature. Let $\Sigma$ be a compact orientable surface of constant Gaussian curvature $-1$, realised as a quotient $\Sigma = \Gamma \backslash \mathbb{H}^2$, where $\mathbb{H}^2$ is the upper half-plane equipped with the Poincaré metric $ds^2 = (dx^2 + dy^2)/y^2$, and $\Gamma \subset \mathrm{PSL}(2, \mathbb{R})$ is a cocompact torsion-free lattice. Compactness ensures every geodesic is defined for all time. The **unit tangent bundle** $T^1\Sigma$ is the set of unit-speed tangent vectors $(p, v)$ with $p \in \Sigma$ and $v \in T_p\Sigma$, $|v|_g = 1$; it is a compact smooth $3$-manifold. [definition: Geodesic Flow] The **geodesic flow** on $T^1\Sigma$ is the one-parameter family of smooth maps $(\phi_t)_{t \in \mathbb{R}}$, where \begin{align*} \phi_t: T^1\Sigma \to T^1\Sigma, \qquad \phi_t(p, v) = \bigl(\gamma_{(p,v)}(t),\, \dot\gamma_{(p,v)}(t)\bigr), \end{align*} and $\gamma_{(p,v)}: \mathbb{R} \to \Sigma$ is the unique unit-speed geodesic with $\gamma_{(p,v)}(0) = p$ and $\dot\gamma_{(p,v)}(0) = v$. [/definition] The map $\phi_t$ moves each unit tangent vector forward along its geodesic by time $t$. Since $\Sigma$ is compact, geodesics do not escape to infinity, and $\phi_t$ is defined for all $t \in \mathbb{R}$. Before turning to measure-preservation and ergodicity, it is instructive to note what the geodesic flow does not do on surfaces with different geometry. On a flat torus $\mathbb{T}^2 = \mathbb{R}^2/\mathbb{Z}^2$ with its Euclidean metric, geodesics are straight lines with constant slope; the flow is ergodic precisely when the slope is irrational (a result analogous to Weyl's theorem on irrational rotations), but it is never strongly mixing — the Koopman operator has purely discrete spectrum, and correlations do not decay. On a positively curved surface such as the round sphere $S^2$, geodesics are great circles that close up in finite time, causing every geodesic to revisit its initial direction periodically; the flow is not even ergodic, since the dynamics decomposes into periodic orbits. For non-compact hyperbolic surfaces of finite area, the Liouville measure has infinite total mass on $T^1\Sigma$ — it is still preserved by the flow, but the system cannot be normalised to a probability space in the standard way, and ergodicity requires a different framework. Negative curvature combined with compactness is the precise combination that makes the geodesic flow both measure-preserving and strongly mixing. The geodesic flow preserves a natural measure on $T^1\Sigma$. The Riemannian metric on $\Sigma$ induces the Sasaki metric on $T^1\Sigma$, a Riemannian structure on the $3$-manifold, and hence a normalised Riemannian volume measure $\nu$ on $T^1\Sigma$. [quotetheorem:3440] Measure-preservation follows from [Liouville's theorem](/page/Liouville's%20Theorem) for Hamiltonian systems, developed in the next section: the geodesic flow is a Hamiltonian flow on the cotangent bundle $T^*\Sigma$, with Hamiltonian $H(q, p) = \frac{1}{2}g^{ij}(q)p_ip_j$ (the kinetic energy), and the unit tangent bundle $T^1\Sigma$ is the level set $\{H = 1/2\}$. [quotetheorem:3441] The ergodicity of the geodesic flow was proved by Hedlund in 1939, and the proof technique via stable and unstable foliations — the **Hopf argument** — was introduced by Hopf in the same year. The proof exploits the exponential divergence of geodesics in negative curvature. The strategy of the Hopf argument is as follows. For a bounded measurable function $f$ on $T^1\Sigma$, the Birkhoff averages $\frac{1}{T}\int_0^{\,T} f \circ \phi_t \, dt$ converge $\nu$-almost everywhere to the conditional expectation $\mathbb{E}[f \mid \mathcal{I}]$ onto the invariant $\sigma$-algebra $\mathcal{I}$. To show ergodicity, one must show this limit is $\nu$-a.e. constant. For each $(p, v) \in T^1\Sigma$, the **stable manifold** $W^s(p,v) = \{(p', v') : d(\phi_t(p,v), \phi_t(p',v')) \to 0 \text{ as } t \to +\infty\}$ and the **unstable manifold** $W^u(p,v) = \{(p', v') : d(\phi_t(p,v), \phi_t(p',v')) \to 0 \text{ as } t \to -\infty\}$ are smooth submanifolds. On a negatively curved surface, these manifolds diverge exponentially: $d(\phi_t(p,v), \phi_t(p',v')) \le C e^{-t} d((p,v),(p',v'))$ on $W^s$. The key consequence is that any two points in $T^1\Sigma$ can be connected by a path consisting of finitely many segments alternately lying in stable and unstable manifolds. Using this connectivity, one shows that the Birkhoff average of $f$ is constant along stable manifolds and along unstable manifolds, and hence constant $\nu$-a.e. The full argument for strong mixing uses the fact that the stable and unstable foliations are absolutely continuous and transitive. Both hypotheses — compactness and negative curvature — are essential. On a flat torus $\mathbb{T}^2$, geodesics are straight lines: the geodesic flow is ergodic (by unique ergodicity of irrational-slope linear flows) but not mixing, since the Koopman operator has discrete spectrum. On a positively curved surface such as the round sphere $S^2$, every geodesic is a great circle of the same length, so the geodesic flow is periodic and far from ergodic. For non-compact hyperbolic surfaces, the Liouville measure is infinite, and ergodicity must be replaced by conservative ergodicity — a different framework. The compact negative-curvature setting is the unique regime where the full conclusion of both ergodicity and strong mixing holds for the geodesic flow.  ## Hamiltonian Systems and Liouville's Theorem Why does the geodesic flow preserve Liouville measure? What structural property forces measure-preservation, without any explicit computation? The answer lies in symplectic geometry: the geodesic flow is a Hamiltonian system, and every Hamiltonian system preserves the volume form canonically attached to its symplectic structure. Measure-preservation is not verified case-by-case — it is a theorem about Hamiltonian flows in general, with [Liouville's theorem](/theorems/38) providing the structural guarantee. The geodesic flow is a special case of a Hamiltonian system. [Liouville's theorem](/theorems/346) states that all Hamiltonian flows preserve a canonical volume measure — the Liouville measure — automatically, without any additional hypotheses. This makes Hamiltonian systems a natural source of measure-preserving dynamical systems. Let $(M, \omega)$ be a symplectic manifold of dimension $2n$, where $\omega$ is a closed, non-degenerate $2$-form (the **symplectic form**). The standard example is $M = T^*Q$, the cotangent bundle of a configuration manifold $Q$, with local canonical coordinates $(q_1, \dots, q_n, p_1, \dots, p_n)$ and symplectic form \begin{align*} \omega = \sum_{i=1}^n dp_i \wedge dq_i. \end{align*} [definition: Hamiltonian Vector Field and Hamiltonian Flow] Given a smooth function $H: M \to \mathbb{R}$ (the **Hamiltonian**), the **Hamiltonian vector field** $X_H \in \mathfrak{X}(M)$ is the unique vector field satisfying \begin{align*} \omega(X_H, Y) = dH(Y) \quad \text{for all vector fields } Y \in \mathfrak{X}(M). \end{align*} The **Hamiltonian flow** $(\phi_t^H)_{t \in \mathbb{R}}$ is the flow of $X_H$: each $\phi_t^H: M \to M$ is the time-$t$ map of the ODE $\dot{x} = X_H(x)$. [/definition] In local canonical coordinates $(q, p)$, the non-degeneracy and skew-symmetry of $\omega$ translate the defining relation $\iota_{X_H}\omega = dH$ into the classical **Hamilton equations of motion**: \begin{align*} \dot{q}_i = \frac{\partial H}{\partial p_i}, \qquad \dot{p}_i = -\frac{\partial H}{\partial q_i}, \quad i = 1, \dots, n. \end{align*} The Hamiltonian vector field encodes all of classical mechanics: Newton's laws, the equations of a pendulum, and the geodesic equations on a Riemannian manifold all arise as Hamiltonian flows for appropriate choices of $H$ and symplectic manifold $M$. This breadth makes the following definition and theorem particularly powerful. [definition: Liouville Measure] The **Liouville measure** on the symplectic manifold $(M, \omega)$ of dimension $2n$ is the measure $\Lambda$ induced by the **Liouville volume form** \begin{align*} \Omega = \frac{\omega^n}{n!} = \frac{1}{n!}\underbrace{\omega \wedge \omega \wedge \cdots \wedge \omega}_{n \text{ factors}}. \end{align*} In local canonical coordinates $(q, p)$, this reduces to $\Omega = dq_1 \wedge \cdots \wedge dq_n \wedge dp_1 \wedge \cdots \wedge dp_n$, so $d\Lambda = dq\,dp$ is ordinary Lebesgue measure on $\mathbb{R}^{2n}$. [/definition] The Liouville volume form $\Omega$ is non-vanishing everywhere on $M$ (since $\omega$ is non-degenerate) and gives a canonical notion of volume that depends only on the symplectic structure, not on any choice of Riemannian metric. The fundamental result is that this volume is preserved by every Hamiltonian flow. [quotetheorem:346] [citeproof:346] [Liouville's theorem](/page/Liouville's%20Theorem) is one of the oldest measure-preservation results in mathematical physics. It says that Hamiltonian dynamics cannot have attractors in the measure-theoretic sense: the volume in phase space occupied by any set of initial conditions is conserved. This is in sharp contrast to dissipative systems, where phase-space volume contracts. A concrete example makes the failure of volume preservation visible. Consider the **damped pendulum**, governed by $\ddot{\theta} + \gamma \dot{\theta} + \sin\theta = 0$ for damping constant $\gamma > 0$. Writing $q = \theta$ and $p = \dot{\theta}$, the equations of motion are $\dot{q} = p$ and $\dot{p} = -\sin q - \gamma p$. The divergence of the vector field $(\dot{q}, \dot{p}) = (p, -\sin q - \gamma p)$ with respect to the phase-space coordinates $(q, p)$ is \begin{align*} \frac{\partial}{\partial q}(p) + \frac{\partial}{\partial p}(-\sin q - \gamma p) = 0 + (-\gamma) = -\gamma < 0. \end{align*} By [Liouville's theorem](/theorems/38) for ODEs (the [divergence theorem](/theorems/2754) for flows), the rate of change of the volume of any region $A$ under the flow satisfies $\frac{d}{dt}\Lambda(\phi_t(A)) = \int_{\phi_t(A)} \operatorname{div}(X)\, d\Lambda = -\gamma \Lambda(\phi_t(A))$, so $\Lambda(\phi_t(A)) = e^{-\gamma t}\Lambda(A) \to 0$. Every set of initial conditions collapses to zero volume, and the dynamics contracts onto the attractor $\{(0, 0)\}$ (the rest position). The system is not measure-preserving in any reasonable sense. This is the generic behaviour for non-Hamiltonian systems: dissipation breaks the volume-preserving structure that [Liouville's theorem](/theorems/346) guarantees. When $H$ is time-independent, $H$ is conserved along orbits: $\frac{d}{dt}H(\phi_t^H(x)) = dH(X_H) = \omega(X_H, X_H) = 0$ by skew-symmetry of $\omega$. Each energy level set $M_E = \{x \in M : H(x) = E\}$ is therefore invariant under the flow. On a compact energy level set $M_E$, the restriction of the Hamiltonian flow carries a natural invariant measure — the Liouville measure conditioned to $M_E$ via the co-[area formula](/theorems/3075). For the geodesic flow, the relevant level set is $T^1\Sigma = \{H = 1/2\}$, recovering the Liouville measure $\nu$ from the previous section. [example: The Arnold Cat Map on $\mathbb{T}^2$] The **Arnold cat map** is the discrete-time Hamiltonian system $T: \mathbb{T}^2 \to \mathbb{T}^2$ defined by \begin{align*} T(x, y) = (2x + y,\, x + y) \pmod{1}, \end{align*} corresponding to the [linear map](/page/Linear%20Map) on $\mathbb{R}^2$ given by the matrix $A = \begin{pmatrix} 2 & 1 \\ 1 & 1 \end{pmatrix}$. Since $\det A = 1$, the map $A$ preserves the standard area form on $\mathbb{R}^2$ and hence $T$ preserves normalised Lebesgue measure $\lambda$ on $\mathbb{T}^2 = \mathbb{R}^2/\mathbb{Z}^2$. The eigenvalues of $A$ are $\lambda_\pm = (3 \pm \sqrt{5})/2$, satisfying $\lambda_+ \approx 2.618 > 1$ and $\lambda_- = \lambda_+^{-1} \approx 0.382 < 1$. The expanding eigenvector direction and contracting eigenvector direction wind densely around the torus (since their slopes are irrational), giving $\mathbb{T}^2$ a hyperbolic structure: at every point there is a uniformly expanding direction and a uniformly contracting direction. This is the two-dimensional analogue of the hyperbolicity that drives ergodicity of the geodesic flow. The cat map is strongly mixing. To verify this, consider the Fourier expansion: any $f \in L^2(\mathbb{T}^2)$ has [Fourier series](/page/Fourier%20Series) $f(x,y) = \sum_{(m,n) \in \mathbb{Z}^2} \hat{f}(m,n) e^{2\pi i(mx+ny)}$. The pullback satisfies $f \circ T^k(x,y) = \sum_{(m,n)} \hat{f}(m,n) e^{2\pi i(m,n) \cdot A^k(x,y)^\top}$, so the Fourier coefficient at frequency $(m', n')$ of $f \circ T^k$ equals $\hat{f}(m,n)$ where $(m,n) = (m',n') A^{-k}$. For a fixed non-zero frequency $(m', n') \neq (0,0)$, the dual orbit $(m', n') A^{-k}$ grows without bound as $k \to \infty$ (since $\|A^{-k}\| \sim \lambda_+^k \to \infty$ on non-contracting directions), and the corresponding Fourier mode of $f \circ T^k$ involves a coefficient $\hat{f}$ evaluated at a frequency tending to infinity. By the Riemann–Lebesgue lemma, $\hat{f}(m,n) \to 0$ as $|(m,n)| \to \infty$ for $f \in L^2$. Hence for any $f, g \in L^2_0(\mathbb{T}^2, \lambda)$ (zero-mean), the correlation \begin{align*} \int_{\mathbb{T}^2} f \circ T^k \cdot \bar{g}\, d\lambda = \sum_{(m,n) \neq (0,0)} \hat{f}(m,n) \overline{\hat{g}((m,n) A^{-k\top})} \to 0 \end{align*} as $k \to \infty$, establishing strong mixing. The cat map is in fact a Bernoulli system — it is measurably isomorphic to a Bernoulli shift — which places it at the top of the [mixing hierarchy](/theorems/3436). [/example]  ## The Natural Examples in the Mixing Hierarchy Is every level of the [mixing hierarchy](/theorems/3436) actually inhabited by natural systems? Chapter 8 proved that the implications are strict — ergodicity does not imply weak mixing, and weak mixing does not imply strong mixing — but strictly-finer implications can in principle leave levels empty if natural examples are hard to construct. The systems of this chapter demonstrate that no level is vacuous: each tier of the hierarchy is occupied by a system arising directly from probability theory, symbolic dynamics, or Riemannian geometry. The examples of this chapter give concrete realisations of each level of the hierarchy established in Chapter 8. The following table summarises their properties: | System | Ergodic | Weakly Mixing | Strongly Mixing | |---|---|---|---| | Irrational rotation $T_\alpha$ (Chapter 7) | Yes | No | No | | Periodic Markov shift (irreducible, period $> 1$) | Yes | No | No | | Two-state Markov shift with $\alpha, \beta \in (0,1)$ | Yes | Yes | Yes | | Bernoulli shift $B(p_0, \dots, p_{k-1})$ | Yes | Yes | Yes | | Geodesic flow on $T^1\Sigma$ | Yes | Yes | Yes | | Arnold cat map on $\mathbb{T}^2$ | Yes | Yes | Yes | The irrational rotation occupies the ergodic-only tier because the Koopman operator has a full discrete spectrum — the characters $\{e^{2\pi i n\alpha}\}_{n \in \mathbb{Z}}$ are non-constant eigenfunctions, violating the Koopman–von Neumann criterion for weak mixing. The periodic Markov shift (e.g., the two-state chain alternating deterministically between $0$ and $1$) is ergodic but not weakly mixing: the Koopman operator has non-trivial eigenvalues at roots of unity (specifically $e^{2\pi i k/p}$ for a chain of period $p$), violating the Koopman–von Neumann criterion. The aperiodic two-state Markov shift with $\alpha, \beta \in (0,1)$ is weakly and strongly mixing. The remaining systems are all strongly mixing; they differ in entropy, spectral type, and smooth structure, distinctions that Chapter 10 and Ergodic Theory II will make precise. [remark: Symbolic Coding of the Geodesic Flow] One of the deep results connecting the symbolic and geometric examples is that the geodesic flow on a compact hyperbolic surface admits a **symbolic coding** via a Markov shift: the surface can be partitioned into finitely many pieces such that the first-return map to the partition defines a topological Markov chain. The resulting Markov shift is irreducible and aperiodic, and the geodesic flow is measurably isomorphic to a suspension over this shift. This coding, developed by Adler–Weiss and Bowen in the 1970s, shows that the symbolic and geometric models of this chapter are not separate worlds but are intimately connected: the smooth geometry of the geodesic flow is faithfully captured, up to measurable isomorphism, by a countable symbol sequence with Markov transitions. [/remark] The classification of these examples — and the question of when two measure-preserving systems are isomorphic — leads naturally to spectral invariants. The Koopman operator $U_T$ acting on $L^2(X, \mu)$ encodes the dynamics in a linear operator, and the spectral theory of $U_T$ (the subject of Chapter 10) provides invariants that distinguish systems beyond the [mixing hierarchy](/theorems/3436). For Bernoulli shifts and the cat map, the Koopman operator has countable Lebesgue spectrum; for the geodesic flow, spectral analysis involves the representation theory of $\mathrm{PSL}(2, \mathbb{R})$. These spectral distinctions, combined with entropy, form the foundations of the isomorphism theory of ergodic systems. --- # 10. Spectral Theory of Ergodic Transformations The preceding chapters developed ergodic theory through two complementary lenses: the analytic (ergodic theorems and their convergence) and the geometric-probabilistic (the [mixing hierarchy](/theorems/3436) and the gallery of examples in Chapter 9). Chapter 9 showed that Bernoulli shifts, Markov shifts, the geodesic flow, Hamiltonian systems, and the Arnold cat map each occupy a definite place in the [mixing hierarchy](/theorems/3436). But the [mixing hierarchy](/theorems/3436) is too coarse to distinguish all systems: there exist uncountably many non-isomorphic ergodic systems with the same mixing type, and two systems can be strongly mixing yet structurally very different. For instance, the Bernoulli shifts $B(1/2, 1/2)$ (a fair coin) and $B(1/3, 1/3, 1/3)$ (a fair three-sided die) are both strongly mixing, yet they are not isomorphic as measure-preserving systems — a fact that no mixing-type argument can detect. The fundamental question of isomorphism — when are two measure-preserving systems genuinely the same, up to relabelling the phase space? — requires new invariants. Spectral theory provides one of the most powerful and conceptually clean answers. Every measure-preserving transformation $T$ induces a unitary operator $U_T$ on $L^2(X, \mu)$, and the spectral data of this operator — its eigenvalues, spectral measures, and spectral type — are invariants of the system under measure-theoretic isomorphism. This chapter develops the spectral theory of ergodic transformations, identifies the key spectral invariants, proves the [Halmos–von Neumann theorem](/theorems/3461) classifying systems with pure discrete spectrum, and closes with a survey of spectral rigidity and the limits of the spectral approach, pointing naturally toward the entropy theory of Ergodic Theory II. ## Spectral Measures and the Cyclic Decomposition Given two ergodic systems that both mix strongly, how can we tell them apart? The [mixing hierarchy](/theorems/3436) certifies that both possess a certain dynamical character, but it provides no finer gauge. What we need is an invariant that sees *inside* a given mixing class and separates systems that mere mixing-type analysis cannot distinguish. The answer begins with the observation that every measure-preserving transformation $T$ acts unitarily on $L^2(X, \mu)$, and this unitary operator carries a rich spectral signature. The Koopman operator $U_T: L^2(X, \mu) \to L^2(X, \mu)$, defined by $U_T f = f \circ T$, is a unitary operator whenever $T$ is a measure-preserving transformation: $\|U_T f\|_{L^2} = \|f\|_{L^2}$ and $U_T$ is invertible with $U_T^{-1} = U_{T^{-1}}$ (in the invertible case). The spectral theory of $U_T$ begins with a fundamental observation: for any $f \in L^2(X, \mu)$, the correlation sequence $n \mapsto (U_T^n f, f)_{L^2}$ is positive-definite. To see why, recall that a sequence $(a_n)_{n \in \mathbb{Z}}$ of complex numbers is **positive-definite** if $\sum_{j,k} a_{j-k} c_j \overline{c_k} \ge 0$ for all finite sequences $(c_j)$. Setting $a_n = (U_T^n f, f)_{L^2}$ gives $\sum_{j,k} (U_T^{j-k} f, f)_{L^2} c_j \overline{c_k} = \|\sum_j c_j U_T^j f\|_{L^2}^2 \ge 0$, confirming positive-definiteness. Bochner's theorem then guarantees the following. [definition: Spectral Measure] Let $(X, \mathcal{B}, \mu, T)$ be a measure-preserving system and let $U_T$ be its Koopman operator on $H = L^2(X, \mu)$. For $f \in H$, the **spectral measure of $f$** is the unique finite positive Borel measure $\sigma_f$ on the circle $\mathbb{T} = \{z \in \mathbb{C} : |z| = 1\}$ satisfying \begin{align*} \int_{\mathbb{T}} z^n \, d\sigma_f(z) = (U_T^n f, f)_{L^2} \quad \text{for all } n \in \mathbb{Z}. \end{align*} The total mass of $\sigma_f$ is $\sigma_f(\mathbb{T}) = (f, f)_{L^2} = \|f\|_{L^2}^2$. [/definition] The spectral measure $\sigma_f$ translates the orbit of $f$ under $U_T$ into a frequency-domain object on the circle. Its moments are the correlations of $f$ with its iterates, so $\sigma_f$ encodes the full time-correlation structure of $f$ along the orbit of $T$. The connection to the spectral theorem is direct. The **closed cyclic subspace** generated by $f$ is $H_f := \overline{\operatorname{span}\{U_T^n f : n \in \mathbb{Z}\}}$. By the spectral theorem for unitary operators, $H_f$ is unitarily equivalent to $L^2(\mathbb{T}, \sigma_f)$, with $U_T$ acting as multiplication by the coordinate function $z \mapsto z$. The entire [Hilbert space](/page/Hilbert%20Space) $L^2(X, \mu)$ decomposes as an orthogonal sum of such cyclic subspaces. [quotetheorem:3459] [citeproof:3459] The pair consisting of the measure class of $\sigma_1$ (the **maximal spectral type**) and the multiplicity function (recording how many $\sigma_k$ are mutually absolutely continuous at each point) is a complete unitary invariant for $U_T$: two unitary operators are unitarily equivalent if and only if their maximal spectral types and multiplicity functions agree. This pair is therefore an isomorphism invariant for the underlying dynamical system. A technical point: separability of $L^2(X, \mu)$ is essential to the theorem. Without it, the process of selecting vectors of maximal spectral type may not terminate in a countable sequence, and the Hahn–Hellinger theorem requires the countable-chain condition on the lattice of invariant subspaces. For standard probability spaces $(X, \mathcal{B}, \mu)$ — those isomorphic to $[0,1]$ with Lebesgue measure, or countable spaces with atomic measures — separability holds automatically, and the decomposition is always valid. [remark: Spectral Isomorphism vs. Measure-Theoretic Isomorphism] If two systems $(X, \mu, T)$ and $(Y, \nu, S)$ are measure-theoretically isomorphic (there exists a measure-preserving bijection $\phi: X \to Y$ with $\phi \circ T = S \circ \phi$), then their Koopman operators are unitarily equivalent — the map $\Phi: L^2(Y, \nu) \to L^2(X, \mu)$, $\Phi g = g \circ \phi$, is a unitary intertwiner. So spectral data are genuine invariants. The converse fails: spectral isomorphism does not in general imply measure-theoretic isomorphism. Establishing this failure requires the entropy invariant introduced by Kolmogorov. [/remark] ## Spectral Invariants: Discrete, Continuous, and Lebesgue Spectrum The cyclic decomposition classifies every ergodic system down to the finest spectral detail — but this precision is a liability as much as a strength. The full data of the maximal spectral type and the multiplicity function is too fine-grained to compute in practice and too delicate to compare across systems. What coarser spectral features distinguish qualitatively different dynamics? Three broad spectral classes — discrete, continuous, and Lebesgue — capture the most important divisions.  The cyclic decomposition is a complete but unwieldy invariant. For classification purposes, one extracts coarser qualitative data from the spectral type. Three spectral classes are fundamental: discrete spectrum, continuous spectrum, and Lebesgue spectrum. Each class captures a distinct dynamical character. **Eigenvalues and eigenfunctions.** A value $\lambda \in \mathbb{T}$ is an **eigenvalue** of $U_T$ if there exists $f \in L^2(X, \mu)$, $f \ne 0$, with $U_T f = \lambda f$, equivalently $f(Tx) = \lambda f(x)$ $\mu$-almost everywhere. The function $f$ is a corresponding **eigenfunction**. Since $U_T$ is unitary, all eigenvalues lie on $\mathbb{T}$, and eigenfunctions for distinct eigenvalues are orthogonal. For an ergodic system, eigenvalues are simple: any two eigenfunctions for the same eigenvalue are proportional $\mu$-a.e. This follows because the ratio $f/g$ of two eigenfunctions for $\lambda$ is $U_T$-invariant ($(f/g) \circ T = \lambda f(Tx)/(\lambda g(Tx)) = f/g$) and hence $\mu$-a.e. constant by ergodicity. [definition: Discrete Spectrum] A measure-preserving system $(X, \mathcal{B}, \mu, T)$ has **discrete spectrum** (also called **pure point spectrum**) if the eigenfunctions of $U_T$ span a dense subspace of $L^2(X, \mu)$: the space $L^2(X, \mu)$ has an orthonormal basis consisting of eigenfunctions of $U_T$. [/definition] For a system with discrete spectrum, the [Hilbert space](/page/Hilbert%20Space) is entirely accounted for by periodic-in-$n$ structures: every $f \in L^2$ is approximated by linear combinations of functions that cycle through finitely many values under $T$. [definition: Continuous Spectrum and Lebesgue Spectrum] A measure-preserving system has **continuous spectrum** if the only eigenvalue of $U_T$ is $\lambda = 1$, and the corresponding eigenspace is exactly the constants (i.e., $T$ is ergodic). The system has **Lebesgue spectrum** if the restriction of $U_T$ to the orthocomplement $L^2_0(X, \mu) = \{f \in L^2 : \int f \, d\mu = 0\}$ of the constants has maximal spectral type equivalent to Lebesgue measure on $\mathbb{T}$ (i.e., all spectral measures $\sigma_f$ for $f \in L^2_0$ are absolutely continuous with respect to Haar measure on $\mathbb{T}$). If additionally the multiplicity is countably infinite everywhere on $\mathbb{T}$, the system has **countable Lebesgue spectrum**. [/definition] These three classes are arranged in a hierarchy of "randomness": discrete spectrum corresponds to maximally structured systems, Lebesgue spectrum to maximally random ones. The connection to the [mixing hierarchy](/theorems/3436) from Chapter 8 is captured precisely by the following theorem. [quotetheorem:3460] [citeproof:3460] The theorem clarifies the hierarchy: Lebesgue spectrum $\implies$ strongly mixing $\implies$ weakly mixing $\implies$ ergodic, and each implication is strict. The two fundamental examples of Chapter 9 sit at opposite ends of this spectrum. [example: Spectral Measures of the Irrational Circle Rotation] Let $T_\alpha: \mathbb{T} \to \mathbb{T}$ be the irrational rotation $T_\alpha(x) = x + \alpha \pmod{1}$ for $\alpha \notin \mathbb{Q}$, acting on $(\mathbb{T}, \mathcal{B}(\mathbb{T}), \lambda)$ where $\lambda$ is Lebesgue measure. The trigonometric characters $e_n(x) = e^{2\pi i n x}$, $n \in \mathbb{Z}$, form an orthonormal basis of $L^2(\mathbb{T}, \lambda)$. Each $e_n$ is an eigenfunction: \begin{align*} U_{T_\alpha} e_n(x) = e_n(x + \alpha) = e^{2\pi i n (x + \alpha)} = e^{2\pi i n \alpha} e_n(x), \end{align*} so $e_n$ has eigenvalue $\lambda_n = e^{2\pi i n \alpha}$. Since $\alpha \notin \mathbb{Q}$, all eigenvalues $\{\lambda_n : n \in \mathbb{Z}\}$ are distinct (if $e^{2\pi i n\alpha} = e^{2\pi i m\alpha}$ then $(n-m)\alpha \in \mathbb{Z}$, forcing $n = m$) and form a countable dense subgroup of $\mathbb{T}$. The spectral measure of $e_n$ is the point mass $\sigma_{e_n} = \delta_{\lambda_n}$. Since the eigenfunctions $\{e_n : n \in \mathbb{Z}\}$ form an orthonormal basis, $T_\alpha$ has **pure discrete spectrum**. The eigenvalue $1$ occurs only at $n = 0$ (the constant function), confirming ergodicity. The presence of non-trivial eigenvalues $\{e^{2\pi i n \alpha} : n \ne 0\}$ confirms that $T_\alpha$ is not weakly mixing, and hence not strongly mixing. [/example] The irrational rotation is the archetypal structured system: all its dynamical complexity is encoded in countably many frequencies, each perfectly periodic. To appreciate why the spectral approach is powerful, it helps to contrast this with a system at the opposite extreme — one where no such periodic structure survives. [example: Spectral Measures of the Bernoulli Shift] Let $\sigma$ be the two-sided Bernoulli shift $B(p_0, \dots, p_{k-1})$ on $(A^{\mathbb{Z}}, \mu^{\otimes \mathbb{Z}})$ with a strictly positive probability vector. For any $f \in L^2_0(A^{\mathbb{Z}}, \mu)$ (zero mean), the correlation function $(U_\sigma^n f, f)_{L^2}$ decays to zero. To make this explicit for cylinder functions: if $f$ depends only on coordinates in a finite window $W \subset \mathbb{Z}$, then for $|n|$ large enough that the window $W + n = \{w + n : w \in W\}$ is disjoint from $W$, the functions $f$ and $f \circ \sigma^n$ depend on disjoint coordinates. Since $\mu = p^{\otimes \mathbb{Z}}$ is a product measure, functions of disjoint coordinate sets are independent: \begin{align*} (U_\sigma^n f, f)_{L^2} = \int f(x) \cdot f(\sigma^{-n} x) \, d\mu(x) = \int f \, d\mu \cdot \int f \, d\mu = 0 \end{align*} (the last equality uses $\int f \, d\mu = 0$). This vanishing is exact, not merely in the limit, for all $|n|$ exceeding the diameter of $W$. For the spectral measure $\sigma_f$, exact vanishing of Fourier coefficients for large $|n|$ forces $\sigma_f$ to be absolutely continuous with respect to Lebesgue measure on $\mathbb{T}$ — in fact, $\sigma_f$ has a bounded density in $L^2(\mathbb{T}, \lambda)$. An explicit orthonormal basis constructed from cylinder functions (analogous to the Walsh functions) shows that the multiplicity is countably infinite. Therefore the Bernoulli shift has **countable Lebesgue spectrum**, which is the strongest possible spectral type. [/example] ## The Halmos–von Neumann Theorem For most spectral types, the spectral data do not determine the system up to isomorphism. But pure discrete spectrum is a remarkable exception: for ergodic systems, the eigenvalue group alone is a complete isomorphism invariant. This is the content of the [Halmos–von Neumann theorem](/theorems/3461), proved in their 1942 paper. The starting point is the observation that the eigenvalues of an ergodic transformation form a group. If $f$ and $g$ are eigenfunctions with $U_T f = \lambda f$ and $U_T g = \mu g$, then $U_T(fg) = (\lambda\mu)(fg)$, so $\lambda\mu$ is an eigenvalue (provided $fg \ne 0$ a.e., which holds when both eigenfunctions are nowhere zero, which can be arranged for $U_T$-eigenfunctions of a unitary operator). Since $U_T$ is unitary, $\overline{\lambda} = \lambda^{-1}$ is also an eigenvalue. Thus the set of eigenvalues $\Lambda(T) \subset \mathbb{T}$ is a subgroup of $\mathbb{T}$. For an ergodic transformation, each eigenvalue $\lambda$ has a one-dimensional eigenspace (up to scalar multiples). This allows us to canonically associate a single eigenfunction $e_\lambda$ to each $\lambda \in \Lambda(T)$, normalised so that $|e_\lambda| = 1$ $\mu$-a.e. (This normalisation is possible because $|e_\lambda|$ is $U_T$-invariant and hence constant a.e. by ergodicity.) Moreover, $e_{\lambda\mu} = e_\lambda e_\mu$ $\mu$-a.e. and $e_1 = 1$ $\mu$-a.e., so the map $\lambda \mapsto e_\lambda$ is a group homomorphism from $\Lambda(T)$ into the group of measurable unimodular functions. [quotetheorem:3461] [citeproof:3461] The [Halmos–von Neumann theorem](/theorems/3461) is a complete structure theorem. It says that the ergodic systems with pure discrete spectrum are, up to isomorphism, exactly the ergodic rotations on compact abelian groups, and the classification reduces to identifying the eigenvalue group. This is a striking rigidity result: the abstract measure-theoretic system is completely determined, up to isomorphism, by a countable subgroup of $\mathbb{T}$. [example: Isomorphism Classes of Irrational Rotations] The irrational rotation $T_\alpha$ on $(\mathbb{T}, \lambda)$ has eigenvalue group $\Lambda(T_\alpha) = \{e^{2\pi i n \alpha} : n \in \mathbb{Z}\}$. This group is isomorphic to $\mathbb{Z}$ as an abstract group (generated by $e^{2\pi i \alpha}$). The Pontryagin dual of $\mathbb{Z}$ (with discrete topology) is $\mathbb{T}$ itself, recovering the model system as a rotation on $\mathbb{T}$. Two irrational rotations $T_\alpha$ and $T_\beta$ are isomorphic if and only if $\Lambda(T_\alpha) = \Lambda(T_\beta)$ as subgroups of $\mathbb{T}$. The group $\{e^{2\pi i n \alpha} : n \in \mathbb{Z}\}$ equals $\{e^{2\pi i n \beta} : n \in \mathbb{Z}\}$ if and only if $\beta \in \mathbb{Z}\alpha + \mathbb{Z}$, i.e., $\beta = m\alpha + k$ for some $m, k \in \mathbb{Z}$ with $m \ne 0$. In particular, $T_\alpha$ and $T_{-\alpha}$ are isomorphic (generated by the same subgroup), but $T_\alpha$ and $T_{\alpha'}$ for algebraically independent $\alpha, \alpha'$ are not. [/example] The classification by eigenvalue group is not limited to rotations on the circle. The [Halmos–von Neumann theorem](/theorems/3461) applies to any compact abelian group, and higher-dimensional tori provide natural examples where the eigenvalue group has higher rank. [example: Ergodic Rotations on the Two-Dimensional Torus] Let $G = \mathbb{T}^2 = \mathbb{R}^2/\mathbb{Z}^2$ and let $g = (\alpha, \beta)$ with $1, \alpha, \beta$ linearly independent over $\mathbb{Q}$. The rotation $R_g(x, y) = (x + \alpha, y + \beta) \pmod{1}$ is ergodic (since no non-zero character $e^{2\pi i(mx + ny)}$ is invariant: $e^{2\pi i(m\alpha + n\beta)} = 1$ forces $m = n = 0$ by [linear independence](/page/Linear%20Independence)). The eigenvalue group is $\Lambda(R_g) = \{e^{2\pi i(m\alpha + n\beta)} : m, n \in \mathbb{Z}\}$, which is a rank-$2$ subgroup of $\mathbb{T}$. By the [Halmos–von Neumann theorem](/theorems/3461), any ergodic system with this eigenvalue group is isomorphic to $R_g$ on $\mathbb{T}^2$. [/example] The two-torus example illustrates how the rank of the eigenvalue group as an abstract abelian group determines the topological structure of the model compact group — rank $1$ gives $\mathbb{T}$, rank $2$ gives $\mathbb{T}^2$, and so on. This spectral-algebraic correspondence between ergodic rotations and subgroups of $\mathbb{T}$ has a striking consequence for entropy. [remark: Discrete Spectrum and Zero Entropy] Systems with pure discrete spectrum have zero metric entropy $h(T) = 0$. The reason is that a group rotation is a completely determined system: given the current state and the group element $g$, all future and past states are determined. There is no exponential growth of information — all the information about the orbit is captured by the eigenvalue structure. This connection between spectral type and entropy will be central in Ergodic Theory II. [/remark] ## Spectral Rigidity and the Limits of the Spectral Approach The [Halmos–von Neumann theorem](/theorems/3461) gives a complete spectral classification for pure discrete spectrum, but beyond this case spectral invariants become insufficient to resolve the isomorphism problem. This section surveys what spectral theory can and cannot do, and explains why the entropy theory of Ergodic Theory II is necessary. ### The reach of spectral invariants The spectral type (maximal spectral type and multiplicity function) is a complete unitary invariant for the Koopman operator, and hence an isomorphism invariant for the dynamical system. It determines: - Ergodicity: whether $\lambda = 1$ is a simple eigenvalue. - Weak mixing: whether $U_T$ has eigenvalues other than $1$. - Strong mixing: whether the spectral measures on $L^2_0$ are continuous. - Lebesgue spectrum: whether the spectral type is absolutely continuous with countably infinite multiplicity. - The complete isomorphism class within the pure discrete spectrum case. The irrational rotation and any Bernoulli shift are immediately distinguished by spectral type: the former has pure discrete spectrum (a purely atomic maximal spectral type), the latter has countable Lebesgue spectrum (a purely absolutely continuous maximal spectral type). No further analysis is needed to show these systems are non-isomorphic. ### The failure of spectral invariants within the Lebesgue class Within the class of systems with countable Lebesgue spectrum, all systems are spectrally isomorphic to one another — they share the same maximal spectral type (Lebesgue measure) and the same multiplicity (countably infinite). Spectral theory is thus powerless to distinguish them. [quotetheorem:3462] The proof follows from the independence structure shown in the Bernoulli shift example above: for every $f \in L^2_0$, the spectral measure $\sigma_f$ is absolutely continuous with respect to Lebesgue measure on $\mathbb{T}$, and an explicit orthonormal basis (built from functions depending on disjoint finite windows) witnesses countably infinite multiplicity. See Walters, Chapter 5, or Petersen, Chapter 4. This result has a striking and unsettling consequence: the Bernoulli shifts $B(1/2, 1/2)$ (fair coin flips) and $B(1/3, 1/3, 1/3)$ (fair three-sided die) are spectrally indistinguishable. Yet they are genuinely non-isomorphic systems — a fact that requires a fundamentally different invariant to establish. [example: Spectral Indistinguishability vs. Non-Isomorphism] The Bernoulli shifts $B(1/2, 1/2)$ and $B(1/3, 1/3, 1/3)$ both have countable Lebesgue spectrum. Their Koopman operators are unitarily equivalent. However, the Shannon entropy of the first system is $h = -2 \cdot \frac{1}{2}\log\frac{1}{2} = \log 2$, while for the second $h = -3 \cdot \frac{1}{3}\log\frac{1}{3} = \log 3$. Since $\log 2 \ne \log 3$, Kolmogorov's entropy invariant (1958) distinguishes the two systems, establishing that they are not isomorphic. No spectral argument can do this. [/example] ### Spectral rigidity phenomena While spectral theory cannot solve the isomorphism problem in general, it does impose strong constraints — spectral rigidity — on what transformations can look like. An ergodic system is **spectrally rigid** (with respect to a sequence $(n_k)$) if $U_T^{n_k} \to \mathrm{Id}$ in the strong operator topology on $\mathcal{L}(L^2)$. Spectral rigidity along a sequence $(n_k)$ means the system "almost returns" to the identity along that sequence, constraining the spectral measures to be highly concentrated. [example: Irrational Rotations Cannot Be Mixing — A Spectral Proof] For the irrational rotation $T_\alpha$, the eigenfunction $e_1(x) = e^{2\pi i x}$ satisfies \begin{align*} (U_{T_\alpha}^n e_1, e_1)_{L^2} = \int_{\mathbb{T}} e^{2\pi i (x+n\alpha)} \overline{e^{2\pi i x}} \, d\lambda(x) = e^{2\pi i n\alpha}. \end{align*} Since $|e^{2\pi i n\alpha}| = 1$ for all $n$, the correlation does not tend to zero. The spectral measure $\sigma_{e_1} = \delta_{e^{2\pi i\alpha}}$ is a point mass, consistent with the point spectrum. Strong mixing would require this correlation to tend to zero, but it does not — confirming that $T_\alpha$ is not strongly mixing by a direct spectral computation. [/example] ### The Kolmogorov–Sinai entropy invariant The insufficiency of spectral invariants for the isomorphism problem was the central open problem of ergodic theory through the 1950s. The resolution came with Kolmogorov's introduction of metric entropy in 1958, refined by Sinai. The **metric entropy** $h(T)$ of a measure-preserving transformation $T$ is a non-negative real number (possibly $+\infty$) measuring the asymptotic exponential growth rate of information generated by the orbit of $T$. It is defined via partitions of $(X, \mu)$ and their refinements under iteration of $T$. Entropy is an isomorphism invariant, genuinely independent of the spectral invariants, and it solves the isomorphism problem for Bernoulli shifts: - **Kolmogorov (1958):** For the Bernoulli shift $\sigma_p$ with probability vector $p = (p_a)_{a \in A}$, the metric entropy is $h(\sigma_p) = -\sum_{a \in A} p_a \log p_a$ (the Shannon entropy of $p$). - **Sinai (1962):** If $(X, \mu, T)$ is ergodic with $h(T) = h(\sigma_p)$, then $(X, \mu, T)$ has a Bernoulli factor (a factor isomorphic to $\sigma_p$). - **Ornstein (1970):** Two Bernoulli shifts are measure-theoretically isomorphic if and only if they have the same entropy. Entropy is a **complete** isomorphism invariant for Bernoulli shifts. These results, which require a completely different toolkit from spectral theory, are the subject of Ergodic Theory II. [explanation: Spectral Theory and Entropy — Two Layers of Ergodic Theory] The course has revealed that ergodic theory has two structurally distinct layers. The first is the **spectral layer**: the qualitative dynamical behaviour — ergodicity, mixing, and its strength — is encoded in the unitary representation of $T$ on $L^2(X, \mu)$. The [Halmos–von Neumann theorem](/theorems/3461) shows that this layer is complete for pure discrete spectrum systems, giving a classification via countable subgroups of $\mathbb{T}$ and compact group rotations. The spectral layer is the natural domain of functional analysis. The second is the **informational layer**: it measures complexity, not frequency structure. Entropy lives here. It captures how much genuinely new information the system generates at each step — information that cannot be predicted from any finite past. Systems with pure discrete spectrum generate no new information (their future is determined by the eigenvalue structure), consistent with $h(T) = 0$. Systems with Lebesgue spectrum generate fresh information at every step; their entropy can be any non-negative value. The spectral type places structural constraints on entropy — pure discrete spectrum forces $h = 0$ — but within the class of Lebesgue spectrum systems, the spectral invariant is trivial while entropy takes all values in $[0, +\infty]$. What makes ergodic theory deep is that these two layers interact without being redundant. Two systems can share the same spectral type yet have wildly different entropy (all Bernoulli shifts share countable Lebesgue spectrum but have distinct entropies $\log|A|$ for distinct uniform alphabets). Two systems can also share the same entropy yet have different spectral types (though this is harder to arrange explicitly). Entropy and spectral theory are complementary tools. [/explanation] ## What the Course Has Achieved, and the Road Ahead This course has developed ergodic theory from the ground up across ten chapters. The journey began with the definition of a measure-preserving system and the [Poincaré Recurrence Theorem](/theorems/3425) (Chapters 1–2): even without strong hypotheses, orbits return. Chapter 3 introduced ergodicity and the fundamental dichotomy between decomposable and indecomposable systems. Chapters 4–5 proved the two central convergence theorems: the [von Neumann Mean Ergodic Theorem](/theorems/3448) (convergence of time averages in $L^2$) and the Birkhoff Pointwise Ergodic Theorem (almost-everywhere convergence in $L^1$). Chapter 6 established the ergodic decomposition, expressing any invariant measure as an integral of ergodic measures; Chapter 7 treated unique ergodicity and Weyl's equidistribution theorem for polynomial sequences. Chapter 8 developed the [mixing hierarchy](/theorems/3436) — ergodic, weakly mixing, strongly mixing — and established the spectral characterisations that made the hierarchy tractable. Chapter 9 populated this hierarchy with concrete examples: Bernoulli shifts, Markov shifts, the geodesic flow on a compact hyperbolic surface, Hamiltonian systems via [Liouville's theorem](/page/Liouville's%20Theorem), and the Arnold cat map. Chapter 10 brought the analytic and spectral aspects together: the Halmos–von Neumann classification of pure discrete spectrum systems, the [Lebesgue spectrum of Bernoulli shifts](/theorems/3462), and the fundamental limitation of spectral theory. The thread connecting all of this is the Koopman operator. It is the bridge between the nonlinear dynamics of $T$ and the linear analysis of $L^2(X, \mu)$; spectral theory is the systematic study of what this bridge reveals and what it conceals. **Ergodic Theory II: Entropy and Advanced Topics** begins exactly where this course ends. Its central themes are: - **Entropy theory.** The Kolmogorov–Sinai entropy $h(T)$, its computation via generating partitions and the Kolmogorov–Sinai theorem, and its role as an isomorphism invariant. The entropy of the key examples: $h(\sigma_p) = -\sum p_a \log p_a$ for Bernoulli shifts, $h = 0$ for irrational rotations, and positive entropy for the cat map and geodesic flow. - **Ornstein isomorphism theory.** Ornstein's theorem that entropy is a complete invariant for Bernoulli shifts: $\sigma_p \cong \sigma_q$ iff $h(\sigma_p) = h(\sigma_q)$. The proof uses the notion of $\bar{d}$-distance between processes and the very weak Bernoulli property. - **Joinings.** The theory of joinings — measures on product spaces that project to given marginals — as a unifying framework for studying factors, isomorphism, and disjointness of ergodic systems. - **Multiple recurrence and Furstenberg's theorem.** Furstenberg's ergodic-theoretic proof of Szemerédi's theorem on arithmetic progressions in sets of positive density, via the Multiple Recurrence Theorem and its extensions. - **Symbolic dynamics and specification.** The relationship between ergodic theory and topological dynamics via specification properties, intrinsic ergodicity, and the thermodynamic formalism. The foundations laid in this course — the ergodic theorems, the [mixing hierarchy](/theorems/3436), the spectral theory, and the gallery of examples — are the essential prerequisites for all of these directions. Ergodic theory, from the vantage point of Chapter 10, reveals itself as a discipline at the intersection of analysis, probability, and combinatorics, with deep connections to number theory, geometry, and mathematical physics. The second course explores these connections in their full depth.

Created by admin on 5/13/2026 | Last updated on 5/13/2026

What brings you to Androma?

Start with a route through the knowledge graph.

Ergodic Theory I: Foundations

Sign in to Androma

Check your inbox

One last step

Ergodic Theory I: Foundations

Prerequisites

Rate this page