The central question of this course is deceptively simple: when can we interchange a limit and an integral? That is, when does
\begin{align*}
\lim_{n \to \infty} \int f_n \, d\mu = \int \lim_{n \to \infty} f_n \, d\mu
\end{align*}
hold? In the Riemann theory taught in Analysis II, the answer is unsatisfying: only under strong hypotheses (typically [uniform convergence](/page/Uniform%20Convergence)) that exclude many natural situations in analysis and probability. Measure theory was created, in large part, to give a much better answer.
# Motivation
## Why Riemann Integration Breaks Down
The [Riemann integral](/page/Riemann%20Integral) partitions the *domain* of a function into subintervals, forms upper and lower sums, and declares the function integrable when these sums converge to the same limit. This works beautifully for continuous [functions](/page/Function), and even for functions with "few" discontinuities — the precise condition is that the set of discontinuities has Lebesgue measure zero, though the Riemann theory cannot even state this condition in its own language. The trouble begins when we try to take [limits](/page/Limit).
Consider the following example. Enumerate the rationals in $[0,1]$ as $q_1, q_2, q_3, \dots$ and define $f_n = \mathbf{1}_{\{q_1, \dots, q_n\}}$. Each $f_n$ is zero except at finitely many points, so each is Riemann integrable with $\int_0^1 f_n(x)\,dx = 0$. The pointwise limit $f = \lim_n f_n = \mathbf{1}_{\mathbb{Q} \cap [0,1]}$ exists everywhere — yet $f$ is *not* Riemann integrable. Every subinterval of $[0,1]$ contains both rationals (where $f = 1$) and irrationals (where $f = 0$), so the upper and lower Riemann sums are permanently stuck at $1$ and $0$ respectively. The Riemann integral cannot pass the limit through the integral sign, even though the sequence is monotone increasing and uniformly bounded.
This is not an isolated pathology. The three situations where the interchange of limits and integrals matters most are precisely where the Riemann theory is weakest:
**Monotone limits.** If $0 \leq f_1 \leq f_2 \leq \cdots$ pointwise, can we conclude $\int \lim f_n = \lim \int f_n$? Not in the Riemann theory — the limit function may fail to be Riemann integrable, as the example above shows.
**Limits with a dominating function.** If $f_n \to f$ pointwise and $|f_n| \leq g$ for some integrable $g$, can we interchange the limit and integral? Riemann theory requires the much stronger hypothesis of uniform convergence, which rules out many natural examples — for instance, approximating a discontinuous function by continuous ones.
**One-sided bounds on integrals of limits.** If $f_n \geq 0$ pointwise, what can we say about $\int \liminf f_n$ relative to $\liminf \int f_n$? The Riemann theory has no result of this kind at all.
## How Measure Theory Resolves These Failures
The [Lebesgue integral](/page/Lebesgue%20Integral), built on the theory of measures, resolves every one of these difficulties. The key insight is to partition the *range* of a function rather than its domain: instead of asking "how wide is each subinterval?", we ask "how large is the set where $f$ takes values in a given range?". This requires a systematic notion of "size" for subsets — a *measure* — and in particular the ability to assign size zero to complicated [sets](/page/Set) like $\mathbb{Q} \cap [0,1]$. The payoff is a suite of convergence theorems that the Riemann theory cannot match.
The following table summarises the core exchange. Each row identifies a limitation of the Riemann integral and the measure-theoretic result that overcomes it.
| **Problem in Riemann theory** | **Measure-theoretic resolution** |
|---|---|
| Monotone limits of Riemann-integrable functions may have non-Riemann-integrable limits, so $\lim \int f_n$ and $\int \lim f_n$ cannot be compared. | The [Monotone Convergence Theorem](/theorems/509) guarantees $\int \lim f_n = \lim \int f_n$ for any increasing sequence of non-negative measurable functions — no [integrability](/page/Integral) assumption on the limit is needed; the theorem *produces* it. |
| Interchanging limits and integrals requires uniform convergence, which is far too restrictive for most applications. | The [Dominated Convergence Theorem](/theorems/4) requires only pointwise convergence and a single integrable dominating function $g$ with $\lvert f_n \rvert \leq g$. Uniform convergence is replaced by a pointwise bound. |
| No Riemann analogue exists for bounding $\int \liminf f_n$ when the sequence is not monotone. | [Fatou's Lemma](/theorems/510) gives the one-sided inequality $\int \liminf f_n \leq \liminf \int f_n$ for any sequence of non-negative measurable functions — a result with no Riemann counterpart. |
| The class of Riemann-integrable functions is not closed under pointwise limits, so the theory is not self-contained. | The class of Lebesgue-integrable functions is closed under pointwise limits (with an integrable dominator), making the Lebesgue theory internally consistent and stable under the operations of analysis. |
| Riemann integration is defined only on intervals in $\mathbb{R}^n$ and cannot handle abstract domains. | Lebesgue integration is defined on arbitrary measure spaces $(X, \mathcal{A}, \mu)$, including probability spaces, manifolds, and function spaces. |
The three convergence theorems — MCT, Fatou, DCT — are the engine of the entire course. Every major result we prove will rely on at least one of them, often in combination.
## Why Probability Needs Measure Theory
The connection to probability is immediate: a probability space is simply a measure space $(X, \mathcal{A}, \mu)$ with the additional constraint that $\mu(X) = 1$. A random variable is a measurable function, and its expectation is a Lebesgue integral. Once this identification is made, the convergence theorems become theorems about expectations of random variables — and the limitations of the Riemann integral become limitations on what can be proved about random phenomena.
The fundamental questions of probability theory concern the long-run behaviour of [sequences](/page/Sequence) of random variables: does the sample average $\bar{X}_n = (X_1 + \cdots + X_n)/n$ converge to the population mean? In what sense? How are the fluctuations distributed? Answering these questions requires precisely the tools that the Riemann integral lacks. The following table shows how the measure-theoretic machinery developed in the first half of the course underpins the probabilistic results of the second half.
| **Probabilistic question** | **Measure-theoretic tool** |
|---|---|
| Does $\bar{X}_n \to \mu$ almost surely for i.i.d. random variables with $\mathbb{E}[\lvert X_1 \rvert] < \infty$? | The [Strong Law of Large Numbers](/theorems/520), proved via the [Birkhoff Ergodic Theorem](/theorems/518), which itself is established using the Maximal Ergodic Theorem and the Dominated Convergence Theorem. |
| What is the limiting distribution of the standardised partial sums $(S_n - n\mu)/(\sigma\sqrt{n})$? | The [Central Limit Theorem](/theorems/521), proved using characteristic functions (Fourier transforms of probability measures) and Lévy's convergence theorem. |
| When can "infinitely many" events occur? When must they? | The [Borel–Cantelli Lemmas](/theorems/507): the first uses countable subadditivity of measures; the second uses independence to convert a divergent series into an almost-sure event. |
| When can expectations and limits be interchanged for random variables? | The [Dominated Convergence Theorem](/theorems/4) applied in the probability setting, often combined with uniform integrability. |
| How do time averages relate to space averages for stationary processes? | The [Birkhoff Ergodic Theorem](/theorems/518), which asserts $n^{-1}\sum_{k=0}^{n-1} f \circ \Theta^k \to \bar{f}$ almost everywhere for measure-preserving transformations $\Theta$. |
In each case, the measure-theoretic framework is not optional decoration — it is the language in which the statement is formulated and the toolkit from which the proof is assembled.
## Course Overview
The course develops in two phases. The first phase builds the abstract machinery; the second applies it to probability.
**Phase I: Measure and Integration.** We begin with [Measures](/page/Measures), constructing $\sigma$-algebras and measures, proving the extension theorems that build Lebesgue measure from its values on intervals, and establishing the Borel–Cantelli lemmas and Kolmogorov's zero–one law. The [Measurable Functions](/page/Measurable%20Functions) section studies which functions are compatible with this structure — the measurable functions — and develops the modes of convergence (pointwise, almost everywhere, in measure) that arise naturally. The [Integration](/page/Integration) section builds the Lebesgue integral via simple function approximation and proves the three convergence theorems. We then turn to [Inequalities and $L^p$ Spaces](/page/Inequalities%20and%20%24L%5Ep%24%20Spaces), where the Markov, Jensen, Hölder, and Minkowski inequalities give quantitative control on integrals, and completeness of $L^p$ provides the function spaces in which analysis takes place. This section also develops $L^2$ as a [Hilbert space](/page/Hilbert%20Space), orthogonal projection, and its connection to conditional expectation.
**Phase II: Probability.** The [Characteristic Functions and the Fourier Transform](/page/Characteristic%20Functions%20and%20the%20Fourier%20Transform) section introduces the Fourier-analytic tools — characteristic functions, the inversion formula, and Lévy's convergence theorem — that convert questions about distributions into questions about complex-valued functions. [Ergodic Theory](/page/Ergodic%20Theory) establishes the Birkhoff Ergodic Theorem, connecting time averages to space averages and providing the key ingredient for the Strong Law. The course culminates in [The Strong Law and the Central Limit Theorem](/page/The%20Strong%20Law%20and%20the%20Central%20Limit%20Theorem), the two results that justify the entire enterprise: the sample average converges to the population mean (almost surely), and its fluctuations are asymptotically Gaussian.
We assume familiarity with Part IB Analysis II (in particular, the Riemann integral, pointwise and uniform convergence, and basic [metric space](/page/Metric%20Space) topology). No prior exposure to measure theory or abstract probability is required.\n\n---\n\nThe [Introduction](/pages/introduction) showed that the Riemann integral breaks down when functions have complicated level sets — and the root cause is that Riemann's framework has no way to assign a "size" to sets like $\mathbb{Q} \cap [0,1]$ beyond the crude tool of covering by intervals. To build a better integral, we need a better notion of size. This section develops that notion: we define *measures* on *$\sigma$-algebras*, prove the extension theorems that let us construct measures from their values on simple sets, and establish the probabilistic tools (independence, Borel–Cantelli) that rely on this foundation.
The central tension is between generality and consistency. We would like to measure every subset of $\mathbb{R}$, but Vitali's construction (see below) shows this is impossible if we insist on translation invariance and countable additivity. The solution is to measure only the subsets in a $\sigma$-algebra — a collection rich enough for all practical purposes but restricted enough to avoid contradictions.
# 1. Measure Spaces
## 1.1 $\sigma$-Algebras and Measures
[definition:$\sigma$-Algebra]
Let $E$ be a set. A *$\sigma$-algebra* $\mathcal{E}$ on $E$ is a collection of subsets of $E$ such that $\emptyset \in \mathcal{E}$; if $A \in \mathcal{E}$ then $A^c = E \setminus A \in \mathcal{E}$; and for any countable sequence $(A_n)$ in $\mathcal{E}$, $\bigcup_n A_n \in \mathcal{E}$. The pair $(E, \mathcal{E})$ is called a *measurable space*.
[/definition]
Closure under complements and countable unions automatically gives closure under countable intersections (by De Morgan), so a $\sigma$-algebra is closed under every set operation we need in analysis. The requirement of *countable* (not just finite) closure is what distinguishes $\sigma$-algebras from algebras, and it is essential: the set $\mathbb{Q} \cap [0,1] = \bigcup_{n=1}^\infty \{q_n\}$ is a countable union of singletons, so any $\sigma$-algebra containing singletons also contains $\mathbb{Q} \cap [0,1]$. An algebra closed only under finite unions would not guarantee this.
[definition:Measure]
A *measure* on a measurable space $(E, \mathcal{E})$ is a function $\mu: \mathcal{E} \to [0, \infty]$ such that $\mu(\emptyset) = 0$ and $\mu$ is *countably additive*: for any disjoint sequence $(A_n)$ in $\mathcal{E}$,
\begin{align*}
\mu\!\left(\bigcup_n A_n\right) = \sum_{n=1}^\infty \mu(A_n).
\end{align*}
The triple $(E, \mathcal{E}, \mu)$ is a *measure space*. A measure is *finite* if $\mu(E) < \infty$, and *$\sigma$-finite* if $E$ can be written as $\bigcup_{n=1}^\infty E_n$ with $\mu(E_n) < \infty$ for each $n$.
[/definition]
Countable additivity has immediate and powerful consequences. If $A_1 \subseteq A_2 \subseteq \cdots$ then $\mu(\bigcup_n A_n) = \lim_n \mu(A_n)$ (*[continuity](/page/Continuity) from below*), because the union telescopes as a disjoint union of the differences $A_n \setminus A_{n-1}$. Similarly, if $B_1 \supseteq B_2 \supseteq \cdots$ with $\mu(B_1) < \infty$, then $\mu(\bigcap_n B_n) = \lim_n \mu(B_n)$ (*continuity from above*). These properties are used constantly — for instance, continuity from below is the measure-theoretic engine behind the [Monotone Convergence Theorem](/theorems/509).
## 1.2 Generating $\sigma$-Algebras
In practice, $\sigma$-algebras are rarely described by listing their elements. Instead, we specify a small *generating* collection and let the $\sigma$-algebra machinery fill in the rest:
[definition:Generated $\sigma$-Algebra]
For any collection $\mathcal{A}$ of subsets of $E$, the *$\sigma$-algebra generated by $\mathcal{A}$*, written $\sigma(\mathcal{A})$, is the intersection of all $\sigma$-algebras on $E$ containing $\mathcal{A}$ — equivalently, the smallest $\sigma$-algebra containing $\mathcal{A}$.
[/definition]
[example:The Borel $\sigma$-Algebra]
The most important example is the *Borel $\sigma$-algebra* $\mathcal{B}(\mathbb{R}) = \sigma(\mathcal{O})$, where $\mathcal{O}$ is the collection of all open subsets of $\mathbb{R}$. Since every open set in $\mathbb{R}$ is a countable union of open intervals (by density of $\mathbb{Q}$), we equally have $\mathcal{B}(\mathbb{R}) = \sigma(\{(a,b) : a < b\})$. We can restrict the generators further: $\mathcal{B}(\mathbb{R}) = \sigma(\{(-\infty, q] : q \in \mathbb{Q}\})$, because $(a, b) = \bigcup_{n=1}^\infty (a, b - 1/n] = \bigcup_{n=1}^\infty [(-\infty, b-1/n] \setminus (-\infty, a]]$. The fact that a single countable family of half-lines generates the entire Borel $\sigma$-algebra is what makes it possible to verify measurability by checking only preimages of half-lines — the basis of the [Generator Criterion for Measurability](/theorems/525).
[/example]
## 1.3 $\pi$-Systems, $d$-Systems, and Uniqueness
A recurring problem in measure theory is: *given two measures that agree on some collection of sets, do they agree on the entire generated $\sigma$-algebra?* The answer is yes, provided the collection has enough structure. The key is to decompose the $\sigma$-algebra axioms into two independent pieces.
[definition:$\pi$-System]
A collection $\mathcal{A}$ of subsets of $E$ is a *$\pi$-system* if it is closed under finite intersections: $A, B \in \mathcal{A}$ implies $A \cap B \in \mathcal{A}$.
[/definition]
[definition:$d$-System]
A collection $\mathcal{D}$ of subsets of $E$ is a *$d$-system* (or Dynkin system) if $E \in \mathcal{D}$; whenever $A, B \in \mathcal{D}$ with $A \subseteq B$, then $B \setminus A \in \mathcal{D}$; and for any increasing sequence $(A_n)$ in $\mathcal{D}$, $\bigcup_n A_n \in \mathcal{D}$.
[/definition]
A collection is a $\sigma$-algebra if and only if it is both a $\pi$-system and a $d$-system. The idea is that $\pi$-systems handle intersections (the "algebraic" part) while $d$-systems handle complements and limits (the "analytic" part). Separating these roles is what makes the following lemma so useful:
[quotetheorem:505]
Why is this powerful? Suppose two measures $\mu_1$ and $\mu_2$ agree on a $\pi$-system $\mathcal{A}$. The collection $\mathcal{D} = \{B \in \sigma(\mathcal{A}) : \mu_1(B) = \mu_2(B)\}$ is a $d$-system containing $\mathcal{A}$ (checking the three axioms uses countable additivity of both measures). By the [Dynkin $\pi$-system lemma](/theorems/505), $\mathcal{D} \supseteq \sigma(\mathcal{A})$, so $\mu_1 = \mu_2$ on $\sigma(\mathcal{A})$. This is the proof strategy for the [Uniqueness of Extension theorem](/theorems/506).
[citeproof:505]
## 1.4 Measure Extension
Knowing that generators determine $\sigma$-algebras is only useful if we can *extend* a pre-measure defined on generators to a genuine measure on the whole $\sigma$-algebra. This is the content of the extension theorems.
[quotetheorem:522]
The [Carathéodory Extension Theorem](/theorems/522) says: start with a countably additive set function on a ring $\mathcal{A}$ (a collection closed under finite unions and differences), define an *outer measure* $\mu^*(B) = \inf\{\sum_n \mu(A_n) : B \subseteq \bigcup_n A_n,\; A_n \in \mathcal{A}\}$ by covering $B$ with sets from $\mathcal{A}$, then restrict $\mu^*$ to the $\mu^*$-measurable sets (those that "split" every test set additively). The remarkable fact is that these $\mu^*$-measurable sets form a $\sigma$-algebra containing $\mathcal{A}$, and $\mu^*$ restricted to this $\sigma$-algebra is a genuine measure extending the original $\mu$.
[citeproof:522]
Extensions need not be unique in general. But under a finiteness condition, the $\pi$-system argument above gives uniqueness:
[quotetheorem:506]
The hypothesis $\mu_1(E) = \mu_2(E) < \infty$ (or more generally, $\sigma$-finiteness) is essential. On $\mathbb{Z}$ with the power-set $\sigma$-algebra, counting measure and twice counting measure both agree on singletons $\{n\}$ (a $\pi$-system generating the full $\sigma$-algebra), yet they differ on every set of cardinality $\geq 2$. The finiteness condition in the [uniqueness theorem](/theorems/506) rules out such examples.
[citeproof:506]
## 1.5 Lebesgue Measure
The extension theorems construct the most important measure in analysis:
[quotetheorem:523]
The strategy is: define $\mu((a,b]) = b - a$ on the $\pi$-system of half-open intervals (which generates $\mathcal{B}(\mathbb{R})$), verify countable additivity on this $\pi$-system, apply [Carathéodory](/theorems/522) to extend, and invoke [uniqueness](/theorems/506) (using $\sigma$-finiteness: $\mathbb{R} = \bigcup_{n \in \mathbb{Z}} (n, n+1]$). The resulting Lebesgue measure is translation-invariant: $\mu(A + t) = \mu(A)$ for all $t \in \mathbb{R}$ and Borel $A$.
[citeproof:523]
[example:The Vitali Set — Why $\sigma$-Algebras Are Necessary]
Can Lebesgue measure be extended to *all* subsets of $[0,1)$? No. Define an [equivalence relation](/page/Equivalence%20Relation) $x \sim y$ iff $x - y \in \mathbb{Q}$, and use the Axiom of Choice to select one representative from each equivalence class, forming a set $V \subseteq [0,1)$. For each $q \in \mathbb{Q} \cap [0,1)$, define $V_q = \{v + q \mod 1 : v \in V\}$. These translates are pairwise disjoint and $\bigcup_q V_q = [0,1)$. If $V$ were measurable with $\mu(V) = \alpha$, then by translation invariance each $V_q$ would also have measure $\alpha$, and by countable additivity $1 = \mu([0,1)) = \sum_q \alpha$. But this sum is $0$ if $\alpha = 0$ and $\infty$ if $\alpha > 0$ — a contradiction either way. So $V$ is not Lebesgue measurable, and the restriction to a $\sigma$-algebra is unavoidable.
[/example]
# 2. Probability and Independence
## 2.1 Probability Measures and Independence
In probability, the measure space $(\Omega, \mathcal{F}, \mathbb{P})$ is called a *probability space* when $\mathbb{P}(\Omega) = 1$. The elements of $\mathcal{F}$ are *events*, and $\mathbb{P}(A)$ is the *probability* of event $A$.
[definition:Probability Space]
A *probability measure* is a measure $\mathbb{P}$ on a measurable space $(\Omega, \mathcal{F})$ with $\mathbb{P}(\Omega) = 1$. The triple $(\Omega, \mathcal{F}, \mathbb{P})$ is a *probability space*.
[/definition]
The central concept in probability is *independence*: events $A$ and $B$ are independent if $\mathbb{P}(A \cap B) = \mathbb{P}(A)\mathbb{P}(B)$. More generally, $\sigma$-algebras $\mathcal{G}_1, \dots, \mathcal{G}_n$ are independent if $\mathbb{P}(G_1 \cap \cdots \cap G_n) = \mathbb{P}(G_1) \cdots \mathbb{P}(G_n)$ for all choices $G_i \in \mathcal{G}_i$. Checking this for *every* set in each $\sigma$-algebra is impractical, but the $\pi$-system lemma comes to the rescue:
[quotetheorem:524]
So to verify that two $\sigma$-algebras are independent, it suffices to check the product formula on generating $\pi$-systems — for instance, to show that two real random variables are independent, one only needs $\mathbb{P}(X \leq s,\, Y \leq t) = \mathbb{P}(X \leq s)\,\mathbb{P}(Y \leq t)$ for all $s, t \in \mathbb{R}$, since the half-lines $(-\infty, s]$ form a $\pi$-system generating $\mathcal{B}(\mathbb{R})$.
[citeproof:524]
## 2.2 Borel–Cantelli Lemmas
For a sequence of events $(A_n)$, how do we determine whether infinitely many of them occur? The set-theoretic formulation is:
[definition:$\limsup$ and $\liminf$ of Sets]
For a sequence of sets $(A_n)$, define
\begin{align*}
\limsup_n A_n &= \bigcap_{n=1}^\infty \bigcup_{m \geq n} A_m = \{\omega : \omega \in A_n \text{ for infinitely many } n\}, \\
\liminf_n A_n &= \bigcup_{n=1}^\infty \bigcap_{m \geq n} A_m = \{\omega : \omega \in A_n \text{ for all sufficiently large } n\}.
\end{align*}
[/definition]
The notation "$A_n$ infinitely often" (abbreviated "$A_n$ i.o.") means $\omega \in \limsup_n A_n$. The Borel–Cantelli lemmas give a sharp dichotomy for the probability of this event:
[quotetheorem:507]
The proof of [BC-I](/theorems/507) is short and instructive: $\mathbb{P}(\limsup A_n) = \mathbb{P}(\bigcap_n \bigcup_{m \geq n} A_m) = \lim_n \mathbb{P}(\bigcup_{m \geq n} A_m) \leq \lim_n \sum_{m \geq n} \mathbb{P}(A_m) = 0$, where the limit equals zero because the tail of a convergent series vanishes. No independence is needed.
[citeproof:507]
[quotetheorem:508]
The [second Borel–Cantelli lemma](/theorems/508) is a partial converse: if the events are independent and $\sum \mathbb{P}(A_n) = \infty$, then $\mathbb{P}(\limsup A_n) = 1$. The independence is used to show $\mathbb{P}(\bigcap_{m=n}^N A_m^c) = \prod_{m=n}^N (1 - \mathbb{P}(A_m)) \leq \exp(-\sum_{m=n}^N \mathbb{P}(A_m)) \to 0$ as $N \to \infty$. Without independence, the conclusion can fail: take $A_n = A$ for all $n$ with $\mathbb{P}(A) = 1/2$; then $\sum \mathbb{P}(A_n) = \infty$ but $\mathbb{P}(A_n \text{ i.o.}) = 1/2 \neq 1$.
[citeproof:508]
[problem]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and let $(A_n)_{n=1}^\infty$ be a sequence of independent events with $\mathbb{P}(A_n) = 1/n$. Show that $\mathbb{P}(\limsup_n A_n) = 1$, and deduce that almost every $\omega \in \Omega$ belongs to infinitely many of the $A_n$.
[/problem]
[solution]
Since $\sum_{n=1}^\infty \mathbb{P}(A_n) = \sum_{n=1}^\infty 1/n = \infty$ (the harmonic series diverges) and the events are independent, the [second Borel–Cantelli lemma](/theorems/508) gives $\mathbb{P}(\limsup_n A_n) = 1$.
Contrast this with $\mathbb{P}(A_n) = 1/n^2$: now $\sum \mathbb{P}(A_n) = \pi^2/6 < \infty$, so the [first Borel–Cantelli lemma](/theorems/507) gives $\mathbb{P}(\limsup_n A_n) = 0$ — no independence needed. The [boundary](/page/Boundary) between "i.o." and "finitely often" is controlled entirely by the convergence or divergence of $\sum \mathbb{P}(A_n)$, with independence needed only for the divergent case.
[/solution]\n\n---\n\nWith $\sigma$-algebras and measures in place, we can ask: which functions are compatible with this structure? The answer determines what we can integrate, and in probability, what counts as a "random variable."
The analogy with topology is instructive. A continuous function $f: X \to Y$ between [topological](/page/Topology) spaces is one for which the preimage of every [open set](/page/Open%20Set) is open. Replacing "open" with "measurable" gives the definition of a measurable function. This is not a superficial parallel — it reflects a deep structural principle: both continuity and measurability are defined by requiring preimages to preserve the relevant structure. But measurability is far more permissive than continuity. Pointwise limits of measurable functions are measurable (as we shall see), while pointwise limits of continuous functions need not be continuous. This robustness under limits is precisely what makes measurable functions the right class for integration theory.
# 3. Measurable Functions and Random Variables
## 3.1 Measurable Functions
[definition:Measurable Function]
Let $(E, \mathcal{E})$ and $(G, \mathcal{G})$ be measurable spaces. A map $f: E \to G$ is *$\mathcal{E}/\mathcal{G}$-measurable* if $f^{-1}(A) \in \mathcal{E}$ for every $A \in \mathcal{G}$.
[/definition]
When $(G, \mathcal{G}) = (\mathbb{R}, \mathcal{B})$, we simply say $f$ is *measurable*. When $E$ is a topological space and $\mathcal{E} = \mathcal{B}(E)$, a measurable function is called *Borel measurable*.
Checking the preimage condition for every set in $\mathcal{G}$ is usually impractical. The following result reduces the work to a generating collection:
[quotetheorem:525]
The [Generator Criterion](/theorems/525) is the measurability workhorse. Since $\mathcal{B}(\mathbb{R}) = \sigma(\{(-\infty, y] : y \in \mathbb{R}\})$, it tells us that $f: E \to \mathbb{R}$ is measurable if and only if $\{f \leq y\} \in \mathcal{E}$ for all $y$ — a single family of preimages, rather than all Borel sets.
Why does this work? The key is that $\{A \in \mathcal{G} : f^{-1}(A) \in \mathcal{E}\}$ is itself a $\sigma$-algebra (preimages commute with complements, countable unions, and countable intersections). If this $\sigma$-algebra contains a generating collection $\mathcal{Q}$, it contains $\sigma(\mathcal{Q}) = \mathcal{G}$, so $f$ is measurable.
Two immediate consequences:
*Continuous functions are Borel measurable.* If $f: E \to F$ is continuous between topological spaces, then $f^{-1}(U)$ is open for every open $U \subseteq F$. Since the open sets generate $\mathcal{B}(F)$, the Generator Criterion gives measurability. Every polynomial, exponential, trigonometric function, etc., is Borel measurable.
*Composition preserves measurability.* If $f: E \to F$ and $g: F \to G$ are measurable, then $(g \circ f)^{-1}(A) = f^{-1}(g^{-1}(A)) \in \mathcal{E}$ for all $A \in \mathcal{G}$. So $g \circ f$ is measurable.
## 3.2 Closure Under Limits
The most important feature of measurable functions — the one that separates them from continuous or Riemann-integrable functions — is their closure under pointwise limits and lattice operations.
If $f$ and $g$ are measurable functions on $(E, \mathcal{E})$ taking values in $[0, \infty]$ or $\mathbb{R}$, then $f + g$, $fg$, $\max\{f, g\}$, and $\min\{f, g\}$ are all measurable. For sequences, $\sup_n f_n$, $\inf_n f_n$, $\limsup_n f_n$, and $\liminf_n f_n$ are all measurable.
The proofs are short exercises in using the Generator Criterion. For instance, $\{\sup_n f_n > y\} = \bigcup_n \{f_n > y\}$ is a countable union of measurable sets, hence measurable. For $\{f + g < y\} = \bigcup_{r \in \mathbb{Q}} (\{f < r\} \cap \{g < y - r\})$, using the density of $\mathbb{Q}$ to reduce to a countable union. And $\limsup_n f_n = \inf_n \sup_{m \geq n} f_m$ reduces to the previous cases.
The consequence is striking: if $(f_n)$ is a sequence of measurable functions converging pointwise to $f$, then $f$ is measurable. Compare this with continuous functions, where the pointwise limit of continuous functions can be any Baire class 1 function (discontinuous, possibly everywhere), and Riemann-integrable functions, where the pointwise limit of Riemann-integrable functions may not be Riemann-integrable (as the Introduction's example of $\mathbf{1}_{\mathbb{Q} \cap [0,1]}$ shows).
[definition:Simple Function]
A *simple function* on $(E, \mathcal{E})$ is a measurable function $\phi: E \to \mathbb{R}$ taking finitely many values, written $\phi = \sum_{j=1}^n c_j \mathbf{1}_{A_j}$ with $c_j \in \mathbb{R}$ and $A_j \in \mathcal{E}$.
[/definition]
Simple functions are the building blocks of the Lebesgue integral. Their importance rests on the following approximation: every non-negative measurable function $f$ is the pointwise limit of an increasing sequence of simple functions $\phi_n \nearrow f$. The standard construction (dyadic approximation from the [Introduction](/pages/introduction)) combined with the [Monotone Convergence Theorem](/theorems/509) then reduces integration of arbitrary non-negative functions to limits of finite sums.
## 3.3 The Monotone Class Theorem
In practice, one often needs to prove that a property $P$ holds for all bounded measurable functions. The following result provides a systematic strategy, playing the same role for functions that the [Dynkin $\pi$-system lemma](/theorems/505) plays for sets:
*Monotone Class Theorem.* Let $\mathcal{A} \subseteq \mathcal{E}$ be a $\pi$-system with $\sigma(\mathcal{A}) = \mathcal{E}$, and let $\mathcal{V}$ be a vector space of bounded functions $f: E \to \mathbb{R}$ such that (i) $\mathbf{1}_E \in \mathcal{V}$, (ii) $\mathbf{1}_A \in \mathcal{V}$ for all $A \in \mathcal{A}$, and (iii) if $(f_n)$ is a bounded, non-negative sequence in $\mathcal{V}$ with $f_n \nearrow f$ pointwise, then $f \in \mathcal{V}$. Then $\mathcal{V}$ contains all bounded $\mathcal{E}$-measurable functions.
The proof strategy mirrors the Dynkin lemma: show that $\{A \in \mathcal{E} : \mathbf{1}_A \in \mathcal{V}\}$ is a $d$-system (using the vector space and monotone closure properties of $\mathcal{V}$), apply the Dynkin lemma to conclude it contains $\sigma(\mathcal{A}) = \mathcal{E}$, so all indicator functions of measurable sets are in $\mathcal{V}$. Then by linearity, all simple functions are in $\mathcal{V}$. Finally, the monotone closure (iii) extends to all bounded non-negative measurable functions (via simple-function approximation), and general bounded measurable functions follow by writing $f = f^+ - f^-$.
[example:Applying the Monotone Class Theorem]
Suppose we want to prove the change of variables formula $\int_G g \, d(\mu \circ f^{-1}) = \int_E (g \circ f) \, d\mu$ for all bounded measurable $g$. Let $\mathcal{V} = \{g : \text{the formula holds}\}$. For $g = \mathbf{1}_B$ with $B$ in a generating $\pi$-system, the formula says $\mu(f^{-1}(B)) = \mu(f^{-1}(B))$, which is trivially true. $\mathcal{V}$ is a vector space (by linearity of both integrals) and is closed under bounded monotone limits (by MCT). The Monotone Class Theorem gives the result for all bounded measurable $g$, and extension to non-negative measurable $g$ follows by simple-function approximation.
[/example]
## 3.4 Product Measurable Spaces
When studying functions of several variables — or joint distributions of multiple random variables — we need a $\sigma$-algebra on the product space:
[definition:Product $\sigma$-Algebra]
For measurable spaces $(E, \mathcal{E})$ and $(G, \mathcal{G})$, the *product $\sigma$-algebra* is $\mathcal{E} \otimes \mathcal{G} = \sigma(\{A \times B : A \in \mathcal{E},\, B \in \mathcal{G}\})$.
[/definition]
The measurable rectangles $\{A \times B\}$ form a $\pi$-system, so $\mathcal{E} \otimes \mathcal{G}$ is the smallest $\sigma$-algebra containing all rectangles. The product $\sigma$-algebra is characterised by a universal property: $f: (H, \mathcal{H}) \to (E \times G, \mathcal{E} \otimes \mathcal{G})$ is measurable if and only if both components $\pi_1 \circ f$ and $\pi_2 \circ f$ are measurable. This is what makes product $\sigma$-algebras the natural setting for [Fubini's theorem](/theorems/513).
## 3.5 Random Variables and Distributions
In probability, a measurable function from a probability space $(\Omega, \mathcal{F}, \mathbb{P})$ to a measurable space $(E, \mathcal{E})$ is called a *random variable*:
[definition:Random Variable and Distribution]
An *$E$-valued random variable* is a measurable function $X: (\Omega, \mathcal{F}) \to (E, \mathcal{E})$. Its *distribution* (or *law*) is the image measure $\mu_X = \mathbb{P} \circ X^{-1}$, defined by $\mu_X(A) = \mathbb{P}(X \in A)$ for $A \in \mathcal{E}$.
[/definition]
The distribution $\mu_X$ captures everything about $X$ that can be determined from $\mathbb{P}$ alone — it is the probability measure on $(E, \mathcal{E})$ that records how likely $X$ is to land in each measurable set. Two random variables on different probability spaces can have the same distribution (e.g., a fair coin flip modelled on $\{H, T\}$ and the indicator of $[0, 1/2]$ on $([0,1], \text{Lebesgue})$).
For real-valued random variables, the distribution is encoded by the *distribution function* $F_X(x) = \mathbb{P}(X \leq x)$, which is non-decreasing, right-continuous, and satisfies $F_X(-\infty) = 0$, $F_X(+\infty) = 1$. That $F_X$ determines $\mu_X$ uniquely follows from the [uniqueness of measure extension](/theorems/506): the half-lines $(-\infty, x]$ form a $\pi$-system generating $\mathcal{B}(\mathbb{R})$, and $\mu_X$ is a finite measure ($\mu_X(\mathbb{R}) = 1$). Conversely, any function $F$ that is non-decreasing, right-continuous, with limits $0$ and $1$ at $\pm\infty$, is the distribution function of a random variable on $([0,1], \text{Lebesgue})$ — take $X(\omega) = \inf\{x : F(x) \geq \omega\}$.
[definition:Independence of Random Variables]
Random variables $X_1, \dots, X_n$ are *independent* if the $\sigma$-algebras $\sigma(X_1), \dots, \sigma(X_n)$ are independent. For real-valued $X_i$, this is equivalent to the product formula
\begin{align*}
\mathbb{P}(X_1 \leq x_1, \dots, X_n \leq x_n) = \prod_{j=1}^n \mathbb{P}(X_j \leq x_j) \quad \text{for all } x_1, \dots, x_n \in \mathbb{R},
\end{align*}
by the [$\pi$-system criterion for independence](/theorems/524), since $\{(-\infty, x] : x \in \mathbb{R}\}$ generates $\mathcal{B}(\mathbb{R})$.
[/definition]
## 3.6 Modes of Convergence
A sequence of measurable functions can converge in several distinct senses, each useful in different contexts. The relationships between them form a "convergence dictionary" that appears throughout analysis and probability.
[definition:Almost Everywhere Convergence]
$(f_n)$ converges to $f$ *almost everywhere* (a.e.) if $\mu(\{x : f_n(x) \not\to f(x)\}) = 0$. In probability: *almost sure* (a.s.) convergence.
[/definition]
[definition:Convergence in Measure]
$(f_n)$ converges to $f$ *in measure* if for every $\varepsilon > 0$, $\mu(\{|f_n - f| \geq \varepsilon\}) \to 0$ as $n \to \infty$. In probability: *convergence in probability*.
[/definition]
[definition:Convergence in Distribution]
Random variables $X_n \to X$ *in distribution* if $F_{X_n}(x) \to F_X(x)$ at every continuity point $x$ of $F_X$.
[/definition]
The key relationships:
*A.e. convergence implies convergence in measure on finite measure spaces.* If $f_n \to f$ a.e. and $\mu(E) < \infty$, then for any $\varepsilon > 0$, $\mu(\{|f_n - f| \geq \varepsilon\}) \to 0$. The proof uses continuity of measure from above: the sets $B_n = \bigcup_{m \geq n}\{|f_m - f| \geq \varepsilon\}$ decrease to $\{|f_m - f| \geq \varepsilon \text{ i.o.}\}$, which has measure zero by hypothesis.
*The finiteness of $\mu(E)$ is necessary.* On $(\mathbb{R}, \mathcal{B}, \text{Lebesgue})$, the functions $f_n = \mathbf{1}_{[n, n+1]}$ converge to $0$ pointwise everywhere, but $\mu(\{f_n \geq 1\}) = 1$ for all $n$, so $f_n \not\to 0$ in measure. The "mass" escapes to infinity.
*Convergence in measure implies a.e. convergence along a subsequence.* If $f_n \to f$ in measure, there exists a subsequence $f_{n_k} \to f$ a.e. The proof extracts $n_k$ with $\mu(\{|f_{n_k} - f| \geq 2^{-k}\}) < 2^{-k}$, then applies the [first Borel–Cantelli lemma](/theorems/507) to conclude that a.e. only finitely many of the events $\{|f_{n_k} - f| \geq 2^{-k}\}$ occur.
## 3.7 Tail Events and the Kolmogorov 0-1 Law
[definition:Tail $\sigma$-Algebra]
For a sequence of random variables $(X_n)$, the *tail $\sigma$-algebra* is $\mathcal{T} = \bigcap_{n=1}^\infty \sigma(X_{n+1}, X_{n+2}, \dots)$.
[/definition]
A tail event depends only on the asymptotic behaviour of the sequence — it is unchanged if we modify finitely many terms. The event $\{\sum_n X_n \text{ converges}\}$ is a tail event (convergence of a [series](/page/Series) depends only on its tail). The event $\{X_1 > 0\}$ is not (it depends on $X_1$).
[quotetheorem:512]
The [Kolmogorov 0-1 law](/theorems/512) says that for independent $(X_n)$, every tail event has probability $0$ or $1$. The proof shows that $\mathcal{T}$ is independent of itself: $\mathcal{T} \subseteq \sigma(X_{n+1}, X_{n+2}, \dots)$ is independent of $\sigma(X_1, \dots, X_n)$ for every $n$ (by independence of the $X_i$), and taking $n \to \infty$ shows $\mathcal{T}$ is independent of $\sigma(X_1, X_2, \dots) \supseteq \mathcal{T}$. So for any $A \in \mathcal{T}$, $\mathbb{P}(A) = \mathbb{P}(A \cap A) = \mathbb{P}(A)^2$, giving $\mathbb{P}(A) \in \{0, 1\}$.
This has a striking consequence: for independent $(X_n)$, the series $\sum_n X_n$ either converges almost surely or diverges almost surely — there is no intermediate behaviour.
[citeproof:512]
[problem]
Let $(X_n)_{n \geq 1}$ be independent random variables with $X_n$ uniformly distributed on $\{-1, +1\}$. Define $S_n = X_1 + \cdots + X_n$ (a simple random walk). Show that $\{S_n \to +\infty\}$, $\{S_n \to -\infty\}$, $\{\limsup_n S_n = +\infty\}$, and $\{\liminf_n S_n = -\infty\}$ are all tail events. Deduce from the [Kolmogorov 0-1 law](/theorems/512) that each has probability $0$ or $1$, and use symmetry to determine which.
[/problem]
[solution]
Each event depends only on the tail of $(X_n)$: for any fixed $N$, $\{S_n \to +\infty\}$ depends on $X_{N+1}, X_{N+2}, \dots$ (since $S_n - S_N = X_{N+1} + \cdots + X_n$ and $S_N$ is a finite constant). Similarly for the other events. So all four are in the tail $\sigma$-algebra $\mathcal{T}$.
By the [Kolmogorov 0-1 law](/theorems/512), each has probability $0$ or $1$. The distribution of $(X_n)$ is symmetric: replacing each $X_n$ by $-X_n$ maps $\{S_n \to +\infty\}$ to $\{S_n \to -\infty\}$ and vice versa, so $\mathbb{P}(S_n \to +\infty) = \mathbb{P}(S_n \to -\infty)$. If either were $1$, the other would also be $1$, but they are disjoint, so both are $0$.
For $\{\limsup_n S_n = +\infty\}$ and $\{\liminf_n S_n = -\infty\}$: symmetry gives them equal probability, and they are *not* disjoint (both can occur simultaneously). Since $\{\limsup S_n = +\infty\}^c = \{S_n \text{ is eventually bounded above}\}$ and $\{S_n \to +\infty\}$ has probability $0$, one can show (using the recurrence of simple random walk, or the [second Borel–Cantelli lemma](/theorems/508) applied to $\{S_n > K\}$) that $\mathbb{P}(\limsup S_n = +\infty) = 1$, and by symmetry $\mathbb{P}(\liminf S_n = -\infty) = 1$. The walk oscillates between $+\infty$ and $-\infty$ almost surely.
[/solution]\n\n---\n\nThe [Measures](/page/Measures) section developed the machinery for assigning sizes to sets; the [Measurable Functions](/page/Measurable%20Functions) section identified which functions are compatible with this machinery. The present section combines both to define the *Lebesgue integral* — the operation that assigns a numerical value to a measurable function over a measure space, generalising area under a curve, expected value, and total mass.
The construction proceeds in three stages. First, we define the integral for *simple functions* — measurable functions taking finitely many values — where integration reduces to a finite sum. Second, we extend to all non-negative measurable functions by approximating from below and taking a supremum. Third, we handle general measurable functions by splitting into positive and negative parts. At each stage, the [Monotone Convergence Theorem](/theorems/509) and its consequences ([Fatou's Lemma](/theorems/510), [Dominated Convergence](/theorems/4)) govern the interaction between limits and integrals.
# 4. Integration
## 4.1 Integration of Simple Functions
The integral of a simple function is the natural weighted sum: each value is multiplied by the measure of the set where that value is attained.
[definition:Integral Of A Simple Function]
Let $(E, \mathcal{E}, \mu)$ be a measure space. For a non-negative simple function $f = \sum_{k=1}^n a_k \mathbb{1}_{A_k}$ with $a_k \geq 0$ and $A_k \in \mathcal{E}$ pairwise disjoint, the *integral* is
\begin{align*}
\int_E f \, d\mu = \sum_{k=1}^n a_k \, \mu(A_k),
\end{align*}
with the convention $0 \cdot \infty = 0$.
[/definition]
The convention $0 \cdot \infty = 0$ ensures that a function that is zero outside a set of infinite measure still integrates to zero — as it should, since the function contributes no "mass."
This definition requires well-definedness: if $f$ admits two representations $\sum_{i} a_i \mathbb{1}_{A_i} = \sum_{j} b_j \mathbb{1}_{B_j}$, the resulting sums must agree. To see this, form the common refinement $C_{ij} = A_i \cap B_j$. On each $C_{ij}$, both representations assign the same value (say $c_{ij}$), and by finite additivity of $\mu$, $\sum_i a_i \mu(A_i) = \sum_{i,j} c_{ij} \mu(C_{ij}) = \sum_j b_j \mu(B_j)$.
Linearity and monotonicity for non-negative simple functions are immediate: if $f = \sum a_k \mathbb{1}_{A_k}$ and $g = \sum b_l \mathbb{1}_{B_l}$, write both in terms of the common refinement $\{A_k \cap B_l\}$ and compute directly.
## 4.2 Extension to Non-Negative Measurable Functions
Simple functions can only take finitely many values, but we need to integrate functions like $x \mapsto x^2$ or $x \mapsto e^{-x}$. The idea is to approximate any non-negative measurable function from below by simple functions and define the integral as the supremum of the simple-function integrals.
[definition:Integral Of A Non Negative Measurable Function]
For a non-negative measurable function $f: E \to [0, \infty]$, the *integral* is
\begin{align*}
\int_E f \, d\mu = \sup\!\left\{ \int_E g \, d\mu : g \text{ simple},\; 0 \leq g \leq f \right\}.
\end{align*}
[/definition]
This definition is well-posed (the set is non-empty since $g = 0$ is always admissible) and extends the simple-function integral (if $f$ is itself simple, the supremum is achieved at $g = f$). But it raises a crucial question: is this supremum actually a *limit*? Can we compute $\int f \, d\mu$ as the limit of an explicit sequence of simple-function integrals, rather than as an abstract supremum? The [Monotone Convergence Theorem](/theorems/509) answers yes.
### 4.2.1 The Monotone Convergence Theorem
Every non-negative measurable function $f$ is the pointwise limit of an increasing sequence of simple functions — the standard dyadic approximation $\phi_n(x) = \sum_{k=0}^{n2^n - 1} \frac{k}{2^n}\mathbb{1}_{\{k/2^n \leq f(x) < (k+1)/2^n\}} + n\,\mathbb{1}_{\{f(x) \geq n\}}$ satisfies $0 \leq \phi_n \leq \phi_{n+1} \leq f$ and $\phi_n \to f$ pointwise. But does $\int \phi_n \, d\mu \to \int f \, d\mu$? The MCT guarantees this, and much more: *any* non-decreasing sequence of non-negative measurable functions has the property that the integral of the limit equals the limit of the integrals.
[quotetheorem:509]
The power of this result is twofold. First, it converts the abstract supremum definition of $\int f \, d\mu$ into a concrete limit: choose any increasing simple-function approximation $\phi_n \uparrow f$ and compute $\int f \, d\mu = \lim_n \int \phi_n \, d\mu$. Second, it allows us to establish properties of the integral (linearity, for instance) by proving them for simple functions and passing to the limit. The hypotheses — non-negativity and monotonicity — cannot both be dropped: the sequence $f_n = n\,\mathbb{1}_{(0,1/n]}$ converges pointwise to $0$ but has $\int f_n = 1$ for all $n$, violating the conclusion. The issue is that the "mass" of $f_n$ concentrates on shrinking sets, a phenomenon that monotonicity prevents.
[citeproof:509]
[example:Summing A Series Via MCT]
Let $(E, \mathcal{E}, \mu)$ be a measure space and let $(g_k)_{k=1}^\infty$ be a sequence of non-negative measurable functions. Define $f_n = \sum_{k=1}^n g_k$. Then $f_n \uparrow f = \sum_{k=1}^\infty g_k$ pointwise, so the [Monotone Convergence Theorem](/theorems/509) gives
\begin{align*}
\int_E \sum_{k=1}^\infty g_k \, d\mu = \lim_{n \to \infty} \int_E \sum_{k=1}^n g_k \, d\mu = \sum_{k=1}^\infty \int_E g_k \, d\mu.
\end{align*}
This "integration against summation" identity is used constantly — for instance, it proves that a [countable set](/page/Countable%20Set) has Lebesgue measure zero: $\mu(\{q_1, q_2, \dots\}) = \int \sum_k \mathbb{1}_{\{q_k\}} \, d\mu = \sum_k \mu(\{q_k\}) = 0$.
[/example]
### 4.2.2 Properties of the Integral
From the MCT, the fundamental properties extend from simple functions to all non-negative measurable functions:
*Linearity.* For non-negative measurable $f, g$ and $\alpha, \beta \geq 0$,
\begin{align*}
\int_E (\alpha f + \beta g) \, d\mu = \alpha \int_E f \, d\mu + \beta \int_E g \, d\mu.
\end{align*}
To prove this, choose simple $\phi_n \uparrow f$ and $\psi_n \uparrow g$. Then $\alpha \phi_n + \beta \psi_n \uparrow \alpha f + \beta g$, and both sides equal $\alpha \lim \int \phi_n + \beta \lim \int \psi_n$ by the MCT.
*Monotonicity.* If $f \leq g$ pointwise, then $\int f \, d\mu \leq \int g \, d\mu$, since every simple function below $f$ is also below $g$.
*Vanishing criterion.* $\int_E f \, d\mu = 0$ if and only if $f = 0$ $\mu$-a.e. The forward direction uses the sets $A_n = \{f \geq 1/n\}$: monotonicity gives $\mu(A_n)/n \leq \int f \, d\mu = 0$, so $\mu(A_n) = 0$ for all $n$, and $\{f > 0\} = \bigcup_n A_n$ has measure zero by countable subadditivity.
## 4.3 Integrable Functions
To integrate functions that take both positive and negative values, we decompose them:
[definition:Integrable Function]
For a measurable function $f: E \to \mathbb{R}$, define the *positive part* $f^+ = \max\{f, 0\}$ and *negative part* $f^- = \max\{-f, 0\}$, so that $f = f^+ - f^-$ and $|f| = f^+ + f^-$. The function $f$ is *integrable* (written $f \in \mathcal{L}^1(E, \mathcal{E}, \mu)$) if $\int_E |f| \, d\mu < \infty$, and its *integral* is
\begin{align*}
\int_E f \, d\mu = \int_E f^+ \, d\mu - \int_E f^- \, d\mu.
\end{align*}
[/definition]
The condition $\int |f| \, d\mu < \infty$ ensures both $\int f^+ \, d\mu$ and $\int f^- \, d\mu$ are finite, so the difference is well-defined (we never encounter $\infty - \infty$). Linearity and monotonicity extend to $\mathcal{L}^1$: for integrable $f, g$ and $\alpha, \beta \in \mathbb{R}$, $\alpha f + \beta g$ is integrable with $\int (\alpha f + \beta g) = \alpha \int f + \beta \int g$. The proof reduces to the non-negative case by writing each function in terms of its positive and negative parts.
### 4.3.1 The Triangle Inequality for Integrals
A useful consequence of the definition is the *integral triangle inequality*: for integrable $f$,
\begin{align*}
\left| \int_E f \, d\mu \right| \leq \int_E |f| \, d\mu.
\end{align*}
This follows from $-|f| \leq f \leq |f|$ and monotonicity: $-\int |f| \leq \int f \leq \int |f|$.
## 4.4 Interchanging Limits and Integrals
The MCT handles non-decreasing sequences. For sequences that are not monotone, we need two further results.
### 4.4.1 Fatou's Lemma
When a sequence of non-negative measurable functions converges pointwise but is not monotone, we cannot expect equality between the integral of the limit and the limit of the integrals. However, we can always bound one side. The failure of equality is caused by "mass escaping" — concentrating on sets of vanishing measure or drifting off to infinity — and Fatou's lemma captures exactly how much control we retain.
[quotetheorem:510]
[Fatou's Lemma](/theorems/510) gives only an inequality, and the inequality can be strict. The construction $g_n = \inf_{m \geq n} f_m$ recovers a monotone minorant to which the [MCT](/theorems/509) applies, but the minorant satisfies $g_n \leq f_n$, not $g_n = f_n$, and the gap between $\int g_n$ and $\int f_n$ is precisely where mass can be lost. The hypothesis of non-negativity is essential: without it, the integrals can diverge to $-\infty$ and the inequality becomes vacuous.
[citeproof:510]
[example:Strict Inequality In Fatou]
On $([0,1], \mathcal{B}, \mu)$ with Lebesgue measure, define $f_n = n\,\mathbb{1}_{(0,1/n]}$. Then $f_n(x) \to 0$ for every $x \in [0,1]$ (for $x > 0$, eventually $1/n < x$; for $x = 0$, $f_n(0) = 0$). The integrals are $\int f_n \, d\mu = n \cdot (1/n) = 1$ for all $n$. So
\begin{align*}
0 = \int_{[0,1]} \liminf_n f_n \, d\mu < \liminf_n \int_{[0,1]} f_n \, d\mu = 1.
\end{align*}
The mass of $f_n$ (a rectangle of area $1$) concentrates on $(0, 1/n]$, which shrinks to the empty set. The minorant $g_n = \inf_{m \geq n} f_m = 0$ for all $n$ (since $f_m(x) = 0$ for $m > 1/x$), so $\int g_n = 0$, confirming that the MCT-based proof yields only an inequality.
[/example]
### 4.4.2 The Dominated Convergence Theorem
Fatou's lemma shows that without additional control, limits and integrals need not commute. What condition on the sequence $(f_n)$ prevents mass from escaping? The answer is *domination*: if $|f_n| \leq g$ for some fixed integrable function $g$, then the total mass is uniformly bounded and cannot concentrate or drift. This is the content of the most widely used convergence theorem in analysis.
[quotetheorem:4]
The [Dominated Convergence Theorem](/theorems/4) subsumes the bounded convergence theorem (take $g$ to be a constant on a finite measure space) and is the primary tool for justifying [differentiation](/page/Derivative) under the integral sign, computing limits of parametric integrals, and proving continuity of [convolutions](/page/Convolution). The proof applies [Fatou's Lemma](/theorems/510) to the non-negative sequences $g + f_n$ and $g - f_n$, squeezing $\liminf \int f_n$ and $\limsup \int f_n$ to the same value. The domination hypothesis $|f_n| \leq g$ is used precisely to ensure these auxiliary sequences are non-negative.
[citeproof:4]
## 4.5 Change of Variables
Integration interacts naturally with measurable maps. If $f: (E, \mathcal{E}) \to (G, \mathcal{G})$ is measurable and $\mu$ is a measure on $\mathcal{E}$, the pushforward $\nu = \mu \circ f^{-1}$ is a measure on $\mathcal{G}$. How do integrals against $\nu$ relate to integrals against $\mu$?
For $g = \mathbb{1}_B$ with $B \in \mathcal{G}$, $\int_G \mathbb{1}_B \, d\nu = \nu(B) = \mu(f^{-1}(B)) = \int_E \mathbb{1}_{f^{-1}(B)} \, d\mu = \int_E (\mathbb{1}_B \circ f) \, d\mu$. By linearity this extends to non-negative simple functions, and by the [Monotone Convergence Theorem](/theorems/509) to all non-negative measurable $g$:
\begin{align*}
\int_G g \, d\nu = \int_E (g \circ f) \, d\mu.
\end{align*}
This *change of variables formula* generalises the substitution rule from calculus. In probability, it gives the law of the unconscious statistician: if $X: \Omega \to \mathbb{R}$ is a random variable with distribution $\mu_X = \mathbb{P} \circ X^{-1}$, then
\begin{align*}
\mathbb{E}[h(X)] = \int_\Omega h(X(\omega))\,\mathbb{P}(d\omega) = \int_{\mathbb{R}} h(x)\,\mu_X(dx)
\end{align*}
for any non-negative measurable $h: \mathbb{R} \to [0, \infty]$.
### 4.5.1 Densities
A measure $\nu$ on $(E, \mathcal{E})$ is said to be *absolutely continuous* with respect to $\mu$ (written $\nu \ll \mu$) if $\mu(A) = 0$ implies $\nu(A) = 0$. In this case, one expects $\nu$ to be representable as "integration against a function":
[definition:Density]
A non-negative measurable function $f: E \to [0, \infty)$ is a *density* of $\nu$ with respect to $\mu$ if
\begin{align*}
\nu(A) = \int_A f \, d\mu
\end{align*}
for all $A \in \mathcal{E}$. We write $d\nu = f \, d\mu$ or $f = d\nu/d\mu$.
[/definition]
When a density exists, integration against $\nu$ reduces to integration against $\mu$: $\int g \, d\nu = \int gf \, d\mu$ for all non-negative measurable $g$ (verified for indicators, extended by linearity and MCT). In probability, if a random variable $X$ has density $f_X$ with respect to Lebesgue measure, then $\mathbb{E}[h(X)] = \int_{\mathbb{R}} h(x) f_X(x) \, d\mathcal{L}^1(x)$.
## 4.6 Product Measures and Fubini's Theorem
Computing a double integral by iterating single integrals is one of the most basic techniques in analysis and probability. The theoretical foundation requires constructing a *product measure* and proving that integration against it decomposes into iterated integrals.
Given $\sigma$-finite measure spaces $(E_1, \mathcal{E}_1, \mu_1)$ and $(E_2, \mathcal{E}_2, \mu_2)$, define $\mu_0(A_1 \times A_2) = \mu_1(A_1)\,\mu_2(A_2)$ on measurable rectangles. The rectangles form a $\pi$-system generating $\mathcal{E}_1 \otimes \mathcal{E}_2$. By the [Carathéodory Extension Theorem](/theorems/522), $\mu_0$ extends to a measure $\mu_1 \otimes \mu_2$ on $\mathcal{E}_1 \otimes \mathcal{E}_2$, and by the [Uniqueness of Measure Extension](/theorems/506) ($\sigma$-finiteness provides the exhausting sequence), this extension is unique.
[quotetheorem:513]
[Fubini's theorem](/theorems/513) says that integration over a product space can always be computed as an iterated integral, in either order. The $\sigma$-finiteness hypothesis is essential: on $(\mathbb{R}, \mathcal{B}, \text{counting}) \times (\mathbb{R}, \mathcal{B}, \text{counting})$, the function $\mathbb{1}_{\{(x,x) : x \in \mathbb{R}\}}$ gives different values when integrated in different orders (the diagonal has "counting measure" equal to $\infty$ in one direction and $0$ in the other). For integrable (not just non-negative) functions, Fubini guarantees that the inner integral exists for almost every value of the outer variable — a subtle measurability conclusion that requires the full strength of the proof's monotone-class argument.
[citeproof:513]
[example:Independence And Product Measures]
In probability, random variables $X_1, \dots, X_n$ on $(\Omega, \mathcal{F}, \mathbb{P})$ are independent if and only if their joint law $\mu_{(X_1, \dots, X_n)} = \mathbb{P} \circ (X_1, \dots, X_n)^{-1}$ equals the product of their marginal laws $\mu_{X_1} \otimes \cdots \otimes \mu_{X_n}$. If the $X_i$ have densities $f_{X_i}$ with respect to Lebesgue measure, this is equivalent to the joint density factoring: $f_{(X_1, \dots, X_n)}(x_1, \dots, x_n) = \prod_{i=1}^n f_{X_i}(x_i)$. [Fubini's theorem](/theorems/513) then gives $\mathbb{E}[g(X_1, \dots, X_n)] = \int \cdots \int g(x_1, \dots, x_n) \prod_i f_{X_i}(x_i) \, d\mathcal{L}^1(x_1) \cdots d\mathcal{L}^1(x_n)$ for non-negative measurable $g$.
[/example]
[problem]
Let $f_n = n\,\mathbb{1}_{(0,1/n]}$ on $([0,1], \mathcal{B}, \mu)$ with Lebesgue measure. Compute $\lim_{n \to \infty} f_n(x)$ for each $x$, $\int_{[0,1]} f_n \, d\mu$, and explain why the [Dominated Convergence Theorem](/theorems/4) does not apply. Then apply [Fatou's Lemma](/theorems/510) to verify the inequality.
[/problem]
[solution]
**Step 1 (Pointwise limit).** For $x = 0$: $f_n(0) = n \cdot \mathbb{1}_{(0,1/n]}(0) = 0$ for all $n$, so $\lim_n f_n(0) = 0$. For $x \in (0,1]$: once $n > 1/x$, we have $1/n < x$, so $\mathbb{1}_{(0,1/n]}(x) = 0$ and $f_n(x) = 0$. Therefore $f_n(x) \to 0$ for every $x \in [0,1]$, giving $f = \lim_n f_n = 0$ everywhere.
**Step 2 (Integrals).** For each $n$,
\begin{align*}
\int_{[0,1]} f_n \, d\mu = n \cdot \mu((0, 1/n]) = n \cdot \frac{1}{n} = 1.
\end{align*}
So $\lim_n \int f_n \, d\mu = 1 \neq 0 = \int f \, d\mu$: the limit and integral do not commute.
**Step 3 (Failure of DCT).** For the [DCT](/theorems/4) to apply, we would need an integrable function $g: [0,1] \to [0,\infty)$ with $f_n(x) \leq g(x)$ for all $n$ and all $x$. For any $x \in (0,1]$, taking $n = \lfloor 1/x \rfloor$ gives $f_n(x) = n \geq 1/(2x)$ (for $x$ small enough), so $g(x) \geq \sup_n f_n(x)$. The supremum satisfies $\sup_n f_n(x) \geq 1/(2x)$ for $x \in (0,1]$, and $\int_0^1 1/(2x) \, d\mathcal{L}^1(x) = \infty$, so no integrable dominator exists.
**Step 4 (Fatou's Lemma).** Since $f_n \geq 0$ for all $n$, [Fatou's Lemma](/theorems/510) applies and gives
\begin{align*}
\int_{[0,1]} \liminf_n f_n \, d\mu \leq \liminf_n \int_{[0,1]} f_n \, d\mu,
\end{align*}
which reads $0 \leq 1$. The inequality is strict, confirming that Fatou gives only a one-sided bound in the absence of domination.
[/solution]\n\n---\n\nThe [Integration](/page/Integration) section built the Lebesgue integral and proved the convergence theorems that govern passing limits through integrals. But the integral of $|f|$ alone — the $L^1$ norm — is a coarse measure of the "size" of a function. It cannot distinguish between a function that is moderately large over a wide set and one that is extremely large on a tiny set: $f = \mathbb{1}_{[0,1]}$ and $g = n\,\mathbb{1}_{(0,1/n]}$ both have $\int |f| = \int |g| = 1$, yet $g$ has a spike of height $n$. To capture finer information about the magnitude and distribution of a function's values, we introduce the $L^p$ norms for $p \geq 1$, which penalise large values more heavily as $p$ increases.
The central goal of this section is to show that these $L^p$ norms are genuine *norms* — in particular, that they satisfy the triangle inequality — and that the resulting spaces are *complete*. Neither of these facts is obvious from the definition. The triangle inequality for $\|\cdot\|_p$ (Minkowski's inequality) requires Hölder's inequality as an ingredient, which in turn relies on convexity (Jensen's inequality). We develop these tools in order, beginning with the simplest tail bound (Markov's inequality) and building up to the full $L^p$ theory.
# 5. $L^p$ Spaces
## 5.1 $L^p$ Spaces and Conjugate Exponents
[definition:$L^p$ Space]
Let $(E, \mathcal{E}, \mu)$ be a measure space. For $1 \leq p < \infty$, define $L^p = L^p(E, \mathcal{E}, \mu)$ as the set of measurable functions $f: E \to \mathbb{R}$ with
\begin{align*}
\|f\|_p = \left( \int_E |f|^p \, d\mu \right)^{1/p} < \infty.
\end{align*}
For $p = \infty$, define $L^\infty$ as the set of measurable functions with
\begin{align*}
\|f\|_\infty = \inf\!\left\{ \lambda \geq 0 : |f(x)| \leq \lambda \text{ for } \mu\text{-a.e. } x \right\} < \infty.
\end{align*}
[/definition]
The quantity $\|f\|_p$ satisfies homogeneity ($\|\alpha f\|_p = |\alpha|\,\|f\|_p$) by the scaling properties of the integral. However, $\|f\|_p = 0$ does not imply $f = 0$ pointwise — only $f = 0$ $\mu$-a.e. To obtain a genuine normed space, we pass to the *quotient* $\mathcal{L}^p = L^p/{\sim}$ where $f \sim g$ iff $f = g$ $\mu$-a.e. We write $L^p$ for $\mathcal{L}^p$ when the distinction is clear.
The triangle inequality $\|f + g\|_p \leq \|f\|_p + \|g\|_p$ is far from obvious — it is the content of [Minkowski's inequality](/theorems/517), which we prove below. Its proof requires [Hölder's inequality](/theorems/516), which involves *conjugate exponents*:
[definition:Conjugate Exponents]
Exponents $p, q \in [1, \infty]$ are *conjugate* if $\frac{1}{p} + \frac{1}{q} = 1$ (with $1/\infty = 0$). Thus $p = 1 \leftrightarrow q = \infty$ and $p = 2$ is self-conjugate.
[/definition]
## 5.2 Markov's Inequality
The simplest and most broadly useful tool for converting integral information into pointwise control is Markov's inequality. The question it answers is: if $\int f \, d\mu$ is small, how large can $f$ be on a set of substantial measure? The answer is: not very, because any set where $f \geq \lambda$ contributes at least $\lambda$ times its measure to the integral.
[quotetheorem:514]
[Markov's inequality](/theorems/514) is the prototype for all "tail bounds" in probability and analysis. Its proof is a single line — $f \geq \lambda \cdot \mathbb{1}_{\{f \geq \lambda\}}$, integrate both sides — yet the result is enormously useful because it applies to *any* non-negative measurable function with no regularity assumptions. By applying it to $|f|^p$ instead of $f$, one immediately obtains the family of *Chebyshev-type bounds*: $\mu(\{|f| \geq \lambda\}) \leq \lambda^{-p}\|f\|_p^p$, which improve as $p$ increases (trading integrability for tighter tail control). The inequality is sharp: equality holds when $f = \lambda \cdot \mathbb{1}_A$ for some measurable $A$.
[citeproof:514]
[example:Chebyshev's Inequality From Markov]
Let $X$ be a random variable on $(\Omega, \mathcal{F}, \mathbb{P})$ with $\mathbb{E}[X] = m$ and $\operatorname{Var}(X) = \sigma^2$. Applying [Markov's inequality](/theorems/514) to the non-negative function $(X - m)^2$ with threshold $\lambda^2$ gives
\begin{align*}
\mathbb{P}(|X - m| \geq \lambda) = \mathbb{P}((X - m)^2 \geq \lambda^2) \leq \frac{\mathbb{E}[(X-m)^2]}{\lambda^2} = \frac{\sigma^2}{\lambda^2}.
\end{align*}
This is Chebyshev's inequality: it quantifies the concentration of $X$ around its mean. For instance, $\mathbb{P}(|X - m| \geq 3\sigma) \leq 1/9$ regardless of the distribution of $X$.
[/example]
## 5.3 Jensen's Inequality
To prove Hölder's inequality, we will need the concept of *convexity* and its interaction with expectations. The question is: if $\phi$ is convex and $X$ is a random variable, how does $\mathbb{E}[\phi(X)]$ compare to $\phi(\mathbb{E}[X])$? Intuitively, the "bowing up" of a convex function should make the average of the outputs at least as large as the output at the average input.
[definition:Convex Function]
A function $\phi: I \to \mathbb{R}$ on an interval $I \subseteq \mathbb{R}$ is *convex* if
\begin{align*}
\phi(tx + (1-t)y) \leq t\,\phi(x) + (1-t)\,\phi(y)
\end{align*}
for all $x, y \in I$ and $t \in [0,1]$.
[/definition]
The geometric content is that the graph of $\phi$ lies below (or on) any chord. A key analytic consequence is the *supporting line property*: at every interior point $x_0 \in I$, there exists $a \in \mathbb{R}$ such that $\phi(y) \geq \phi(x_0) + a(y - x_0)$ for all $y \in I$. This says that $\phi$ lies above some affine function tangent at $x_0$ — even if $\phi$ is not differentiable there (in which case $a$ lies between the left and right derivatives).
[quotetheorem:515]
[Jensen's inequality](/theorems/515) formalises the intuition that convex functions amplify variability. The proof is a direct application of the supporting line property: choose $x_0 = \mathbb{E}[X]$ and $a$ such that $\phi(y) \geq \phi(\mathbb{E}[X]) + a(y - \mathbb{E}[X])$ for all $y \in I$. Substituting $y = X(\omega)$ and taking expectations gives $\mathbb{E}[\phi(X)] \geq \phi(\mathbb{E}[X]) + a \cdot 0 = \phi(\mathbb{E}[X])$. The hypothesis that $\mathbb{P}$ is a *probability* measure ($\mathbb{P}(\Omega) = 1$) is essential — for a measure with total mass $c \neq 1$, the inequality fails unless $\phi$ is normalised accordingly.
[citeproof:515]
[example:Young's Inequality From Jensen]
For conjugate exponents $p, q \in (1, \infty)$ and non-negative reals $a, b \geq 0$,
\begin{align*}
ab \leq \frac{a^p}{p} + \frac{b^q}{q}.
\end{align*}
This is *Young's inequality*. To prove it: if $a = 0$ or $b = 0$ the result is trivial. Otherwise, the function $\phi(t) = e^t$ is convex, and $\frac{1}{p} + \frac{1}{q} = 1$ gives
\begin{align*}
ab = e^{\log a + \log b} = e^{\frac{1}{p}(p \log a) + \frac{1}{q}(q \log b)} \leq \frac{1}{p}e^{p \log a} + \frac{1}{q}e^{q \log b} = \frac{a^p}{p} + \frac{b^q}{q},
\end{align*}
where the inequality is [Jensen's inequality](/theorems/515) applied to $\phi = \exp$ with the "probability measure" assigning mass $1/p$ to $p\log a$ and $1/q$ to $q\log b$. Young's inequality is the key ingredient in the proof of [Hölder's inequality](/theorems/516).
[/example]
## 5.4 Hölder's Inequality
With Young's inequality in hand, we can answer the fundamental duality question for $L^p$ spaces: when is the product $fg$ integrable, and how does $\int |fg|$ relate to $\|f\|_p$ and $\|g\|_q$? Without some constraint, $fg$ need not be integrable even when $f \in L^p$ and $g \in L^q$ separately. Hölder's inequality provides the optimal bound.
[quotetheorem:516]
[Hölder's inequality](/theorems/516) is the cornerstone of $L^p$ duality. When $p = q = 2$, it reduces to the Cauchy–Schwarz inequality $|\langle f, g \rangle| \leq \|f\|_2 \|g\|_2$. The proof normalises: set $F = |f|/\|f\|_p$ and $G = |g|/\|g\|_q$ (assuming both norms are positive and finite), so that $\|F\|_p = \|G\|_q = 1$. Young's inequality gives $F(x)G(x) \leq F(x)^p/p + G(x)^q/q$ pointwise, and integrating both sides yields $\int FG \, d\mu \leq 1/p + 1/q = 1$, which is the normalised form of Hölder. The inequality is sharp: equality holds if and only if $|f|^p / \|f\|_p^p = |g|^q / \|g\|_q^q$ $\mu$-a.e. (i.e., the two functions are "aligned" in $L^p$-$L^q$ duality).
[citeproof:516]
## 5.5 Minkowski's Inequality
With [Hölder's inequality](/theorems/516) established, we can finally prove the triangle inequality for $\|\cdot\|_p$. This is the last ingredient needed to confirm that $\mathcal{L}^p$ is a [normed vector space](/page/Normed%20Vector%20Space).
[quotetheorem:517]
The cases $p = 1$ and $p = \infty$ are straightforward: for $p = 1$, the triangle inequality $|f + g| \leq |f| + |g|$ integrates directly; for $p = \infty$, $\|f + g\|_\infty \leq \|f\|_\infty + \|g\|_\infty$ follows from the triangle inequality for real numbers applied $\mu$-a.e. The case $1 < p < \infty$ is the non-trivial one. The proof writes
\begin{align*}
|f + g|^p \leq |f + g|^{p-1}|f| + |f + g|^{p-1}|g|,
\end{align*}
integrates both sides, and applies [Hölder's inequality](/theorems/516) to each term — pairing $|f|$ (or $|g|$) in $L^p$ with $|f+g|^{p-1}$ in $L^q$ where $q = p/(p-1)$. The exponent $q$ is conjugate to $p$ precisely because $(p-1)q = p$.
[citeproof:517]
## 5.6 Completeness: $\mathcal{L}^p$ Is a [Banach Space](/page/Banach%20Space)
The normed spaces $\mathcal{L}^p$ are not merely normed — they are *complete*, meaning every [Cauchy sequence](/page/Cauchy%20Sequence) converges. Completeness is what makes $\mathcal{L}^p$ spaces useful in practice: it guarantees that limits of approximating sequences (from numerical methods, PDE theory, or functional analysis) remain in the space.
[theorem:Riesz Fischer]
For $1 \leq p \leq \infty$, the normed space $(\mathcal{L}^p(E, \mathcal{E}, \mu), \|\cdot\|_p)$ is complete. That is, every Cauchy sequence in $\mathcal{L}^p$ converges to a limit in $\mathcal{L}^p$.
[/theorem]
The proof for $1 \leq p < \infty$ proceeds as follows. Let $(f_n)$ be Cauchy in $\mathcal{L}^p$. Extract a subsequence $(f_{n_k})$ with $\|f_{n_{k+1}} - f_{n_k}\|_p < 2^{-k}$ (possible by the Cauchy property). Define
\begin{align*}
G = |f_{n_1}| + \sum_{k=1}^\infty |f_{n_{k+1}} - f_{n_k}|.
\end{align*}
By the [Monotone Convergence Theorem](/theorems/509) and the triangle inequality ([Minkowski](/theorems/517)),
\begin{align*}
\|G\|_p \leq \|f_{n_1}\|_p + \sum_{k=1}^\infty \|f_{n_{k+1}} - f_{n_k}\|_p \leq \|f_{n_1}\|_p + 1 < \infty,
\end{align*}
so $G \in L^p$ and in particular $G < \infty$ $\mu$-a.e. This means the telescoping series $f_{n_1} + \sum_k (f_{n_{k+1}} - f_{n_k})$ converges absolutely $\mu$-a.e. to some measurable function $f$ with $|f| \leq G$. By construction, $f_{n_k} \to f$ $\mu$-a.e., and $|f_{n_k} - f|^p \leq (2G)^p \in L^1$, so the [Dominated Convergence Theorem](/theorems/4) gives $\|f_{n_k} - f\|_p \to 0$. Finally, $\|f_n - f\|_p \leq \|f_n - f_{n_k}\|_p + \|f_{n_k} - f\|_p \to 0$ as $n, k \to \infty$ (using the Cauchy property for the first term).
For $p = \infty$: if $(f_n)$ is Cauchy in $L^\infty$, define $N_{n,m} = \{|f_n - f_m| > \|f_n - f_m\|_\infty\}$ and $N = \bigcup_{n,m} N_{n,m}$. Then $\mu(N) = 0$, and on $N^c$ the sequence $(f_n)$ is uniformly Cauchy, hence converges uniformly to some $f$, giving $\|f_n - f\|_\infty \to 0$.
## 5.7 Inclusions Between $L^p$ Spaces
The relationship between different $L^p$ spaces depends on the underlying measure. On $(\mathbb{R}, \mathcal{B}(\mathbb{R}), \mathcal{L}^1)$, neither $L^p \subseteq L^q$ nor $L^q \subseteq L^p$ for $p \neq q$. To see this concretely: the function $f(x) = x^{-1/2}\,\mathbb{1}_{(0,1]}(x)$ satisfies
\begin{align*}
\int_0^1 |f|^p \, d\mathcal{L}^1 = \int_0^1 x^{-p/2} \, d\mathcal{L}^1(x),
\end{align*}
which converges if and only if $p < 2$, so $f \in L^p \setminus L^2$ for $1 \leq p < 2$. Conversely, $g(x) = x^{-1/2}\,\mathbb{1}_{[1,\infty)}(x)$ has $\int_1^\infty |g|^p \, d\mathcal{L}^1 < \infty$ iff $p > 2$, so $g \in L^p \setminus L^2$ for $p > 2$. Neither space contains the other.
On *finite* measure spaces the situation is simpler. If $\mu(E) < \infty$ and $1 \leq p < q \leq \infty$, then $L^q \subseteq L^p$ with
\begin{align*}
\|f\|_p \leq \mu(E)^{1/p - 1/q}\,\|f\|_q.
\end{align*}
The proof applies [Hölder's inequality](/theorems/516) to the pair $|f|^p \in L^{q/p}$ and $\mathbb{1}_E \in L^{q/(q-p)}$, giving $\int |f|^p \, d\mu \leq (\int |f|^q \, d\mu)^{p/q} \cdot \mu(E)^{1 - p/q}$, and taking $p$-th roots yields the bound. In probability, this means $L^2(\mathbb{P}) \subseteq L^1(\mathbb{P})$: every square-integrable random variable is integrable.
## 5.8 The Hilbert Space $\mathcal{L}^2$
When $p = 2$, the $L^p$ norm arises from an *inner product*:
\begin{align*}
\langle f, g \rangle = \int_E fg \, d\mu.
\end{align*}
That $\langle f, g \rangle$ is finite for $f, g \in L^2$ follows from [Hölder's inequality](/theorems/516) with $p = q = 2$ (Cauchy–Schwarz): $|\langle f, g \rangle| \leq \|f\|_2\|g\|_2$. The inner product is bilinear, symmetric, and positive definite (on $\mathcal{L}^2$, since $\langle f, f \rangle = 0$ implies $f = 0$ $\mu$-a.e.). Since $\mathcal{L}^2$ is complete (by the Riesz–Fischer theorem above), it is a *Hilbert space*.
### 5.8.1 Orthogonal Projections
The inner product structure gives $\mathcal{L}^2$ a geometric flavour absent from other $\mathcal{L}^p$ spaces. For any closed subspace $V \subseteq \mathcal{L}^2$, every $f \in \mathcal{L}^2$ decomposes uniquely as $f = v + u$ with $v \in V$ and $u \in V^\perp = \{w \in \mathcal{L}^2 : \langle w, h \rangle = 0 \text{ for all } h \in V\}$. The element $v$ is the *orthogonal projection* of $f$ onto $V$: it is the unique element of $V$ minimising $\|f - g\|_2$ over $g \in V$.
[example:Conditional Expectation As Projection]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and $\mathcal{G} \subseteq \mathcal{F}$ a sub-$\sigma$-algebra. The closed subspace $V = \{Y \in \mathcal{L}^2(\mathbb{P}) : Y \text{ is } \mathcal{G}\text{-measurable}\}$ represents the "information available from $\mathcal{G}$." For $X \in \mathcal{L}^2(\mathbb{P})$, the conditional expectation $\mathbb{E}[X \mid \mathcal{G}]$ is the orthogonal projection of $X$ onto $V$. This means:
\begin{align*}
\|X - \mathbb{E}[X \mid \mathcal{G}]\|_2 \leq \|X - Y\|_2 \quad \text{for all } \mathcal{G}\text{-measurable } Y \in \mathcal{L}^2.
\end{align*}
Conditional expectation is the best $L^2$-approximation of $X$ given the information in $\mathcal{G}$. The projection property also encodes the "tower law": $\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}] = \mathbb{E}[X \mid \mathcal{H}]$ when $\mathcal{H} \subseteq \mathcal{G}$, because projecting first onto a larger subspace and then onto a smaller one is the same as projecting directly onto the smaller one.
[/example]
## 5.9 Uniform Integrability and $L^1$ Convergence
The [Dominated Convergence Theorem](/theorems/4) requires a fixed integrable dominator $g$. In many applications — particularly in probability, where one studies sequences of random variables with no uniform bound — no such $g$ exists. What is the *minimal* condition on a sequence $(X_n)$ that, combined with convergence in probability, guarantees $L^1$ convergence?
The answer is *uniform integrability*. The motivating failure is the sequence $X_n = n\,\mathbb{1}_{(0,1/n]}$ on $([0,1], \mathcal{B}, \mathcal{L}^1)$: $X_n \to 0$ a.s., but $\mathbb{E}[|X_n|] = 1$ for all $n$, so $X_n \not\to 0$ in $L^1$. The problem is that the integral $\mathbb{E}[|X_n|\,\mathbb{1}_{\{|X_n| > K\}}]$ does not vanish as $K \to \infty$ uniformly in $n$: for $K < n$, the entire mass of $X_n$ sits above level $K$.
[definition:Uniform Integrability]
A family $\mathcal{X}$ of $L^1$ random variables on $(\Omega, \mathcal{F}, \mathbb{P})$ is *uniformly integrable* (UI) if
\begin{align*}
\lim_{K \to \infty} \sup_{X \in \mathcal{X}} \mathbb{E}\!\left[|X|\,\mathbb{1}_{\{|X| > K\}}\right] = 0.
\end{align*}
[/definition]
An equivalent formulation: $\sup_{X \in \mathcal{X}} \mathbb{E}[|X|] < \infty$ (the family is $L^1$-bounded), and for every $\varepsilon > 0$ there exists $\delta > 0$ such that $\mathbb{P}(A) < \delta$ implies $\mathbb{E}[|X|\,\mathbb{1}_A] < \varepsilon$ uniformly over $X \in \mathcal{X}$. The second condition says that the integrals cannot concentrate on sets of arbitrarily small probability — precisely the pathology exhibited by $n\,\mathbb{1}_{(0,1/n]}$.
### 5.9.1 The $L^1$ Convergence Characterisation
[theorem: $L^1$ Convergence via Uniform Integrability]
Let $X, X_1, X_2, \dots$ be random variables on $(\Omega, \mathcal{F}, \mathbb{P})$. Then $X_n \to X$ in $L^1$ if and only if $X_n \to X$ in probability and $\{X_n\}$ is uniformly integrable.
[/theorem]
*Forward direction.* If $\|X_n - X\|_1 \to 0$, then [Markov's inequality](/theorems/514) gives $\mathbb{P}(|X_n - X| \geq \varepsilon) \leq \varepsilon^{-1}\|X_n - X\|_1 \to 0$, so $X_n \to X$ in probability. For UI: $\mathbb{E}[|X_n|\,\mathbb{1}_{\{|X_n| > K\}}] \leq \mathbb{E}[|X_n - X|] + \mathbb{E}[|X|\,\mathbb{1}_{\{|X_n| > K\}}]$. The first term tends to zero uniformly in $K$; the second is controlled by $\mathbb{E}[|X|\,\mathbb{1}_{\{|X| > K/2\}}] + \mathbb{E}[|X|\,\mathbb{1}_{\{|X_n - X| > K/2\}}]$, both of which tend to zero.
*Reverse direction.* Fix $\varepsilon > 0$. Choose $K$ large enough that $\mathbb{E}[|X_n|\,\mathbb{1}_{\{|X_n| > K\}}] < \varepsilon$ and $\mathbb{E}[|X|\,\mathbb{1}_{\{|X| > K\}}] < \varepsilon$ for all $n$. Write
\begin{align*}
X_n - X = (X_n - X)\,\mathbb{1}_{\{|X_n| \leq K,\, |X| \leq K\}} + (X_n - X)\,\mathbb{1}_{\{|X_n| > K \text{ or } |X| > K\}}.
\end{align*}
The first term is bounded by $2K$, so the [Dominated Convergence Theorem](/theorems/4) (applied to a finite measure space) shows its $L^1$ norm tends to zero. The second term has $L^1$ norm at most $\mathbb{E}[|X_n|\,\mathbb{1}_{\{|X_n| > K\}}] + \mathbb{E}[|X|\,\mathbb{1}_{\{|X| > K\}}] < 2\varepsilon$ by UI. Since $\varepsilon$ was arbitrary, $\|X_n - X\|_1 \to 0$.
[example:$L^p$ Boundedness Implies Uniform Integrability]
If $\sup_n \|X_n\|_p < \infty$ for some $p > 1$, then $\{X_n\}$ is uniformly integrable. By [Hölder's inequality](/theorems/516) with exponents $p$ and $q = p/(p-1)$:
\begin{align*}
\mathbb{E}[|X_n|\,\mathbb{1}_A] \leq \|X_n\|_p \cdot \|\mathbb{1}_A\|_q = \|X_n\|_p \cdot \mathbb{P}(A)^{1/q} \leq C\,\mathbb{P}(A)^{1/q},
\end{align*}
where $C = \sup_n \|X_n\|_p$. Taking $A = \{|X_n| > K\}$: [Markov's inequality](/theorems/514) gives $\mathbb{P}(|X_n| > K) \leq C^p / K^p$, so $\mathbb{E}[|X_n|\,\mathbb{1}_{\{|X_n| > K\}}] \leq C \cdot (C^p/K^p)^{1/q} = C^{1+p/q}/K^{p/q} \to 0$ as $K \to \infty$, uniformly in $n$.
[/example]
[problem]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and $X \in L^2(\mathbb{P})$ with $\mathbb{E}[X] = 0$ and $\operatorname{Var}(X) = \sigma^2 > 0$. Show that $\mathbb{P}(|X| \geq t) \leq \sigma^2/t^2$ for all $t > 0$, and that $\mathbb{E}[e^X] \geq 1$.
[/problem]
[solution]
**Step 1 (Chebyshev bound via Markov).** The random variable $X^2$ is non-negative, so [Markov's inequality](/theorems/514) with threshold $t^2$ gives
\begin{align*}
\mathbb{P}(|X| \geq t) = \mathbb{P}(X^2 \geq t^2) \leq \frac{\mathbb{E}[X^2]}{t^2}.
\end{align*}
Since $\mathbb{E}[X] = 0$, we have $\mathbb{E}[X^2] = \operatorname{Var}(X) + (\mathbb{E}[X])^2 = \sigma^2 + 0 = \sigma^2$. Substituting:
\begin{align*}
\mathbb{P}(|X| \geq t) \leq \frac{\sigma^2}{t^2}.
\end{align*}
**Step 2 (Jensen bound for the exponential).** The function $\phi: \mathbb{R} \to \mathbb{R}$ defined by $\phi(x) = e^x$ is convex (since $\phi''(x) = e^x > 0$ everywhere). The random variable $X$ is integrable (since $X \in L^2(\mathbb{P}) \subseteq L^1(\mathbb{P})$ on a probability space, by the $L^p$ inclusion for finite measures). By [Jensen's inequality](/theorems/515),
\begin{align*}
\mathbb{E}[e^X] = \mathbb{E}[\phi(X)] \geq \phi(\mathbb{E}[X]) = e^{\mathbb{E}[X]} = e^0 = 1.
\end{align*}
Equality holds if and only if $X = \mathbb{E}[X] = 0$ $\mathbb{P}$-a.s. Since $\operatorname{Var}(X) = \sigma^2 > 0$, $X$ is not a.s. constant, so the inequality is strict: $\mathbb{E}[e^X] > 1$.
[/solution]\n\n---\n\nThe [Integration](/page/Integration) section developed the Lebesgue integral and the convergence theorems; the [Inequalities and $L^p$ Spaces](/page/Inequalities%20and%20%24L%5Ep%24%20Spaces) section built normed function spaces from $L^p$ norms. We now introduce a tool that links analysis and probability at a deeper level: the *Fourier transform*, which decomposes a function or measure into its oscillatory components at each frequency.
The [Fourier transform](/page/Fourier%20Transform) is valuable for two distinct reasons. In analysis, it converts convolution into multiplication and differentiation into polynomial multiplication, turning integral equations into algebraic ones. In probability, the Fourier transform of a distribution — its *characteristic function* — always exists (unlike moment generating functions, which may diverge), uniquely determines the distribution, and converts the question of [distributional](/page/Distribution) convergence into pointwise convergence of functions. This last property, formalised by [Lévy's Continuity Theorem](/theorems/519), is the engine behind the Central Limit Theorem.
# 6. Fourier Analysis
## 6.1 The Fourier Transform on $L^1$
The starting point is to define the Fourier transform for integrable functions. The idea is to decompose $f$ into its "frequency content" by integrating $f$ against complex exponentials $e^{i\langle u, x \rangle}$ at each frequency $u$.
[definition:Fourier Transform Of An Integrable Function]
Let $f \in L^1(\mathbb{R}^d, \mathcal{B}(\mathbb{R}^d), \mathcal{L}^d)$. The *Fourier transform* of $f$ is the function $\hat{f}: \mathbb{R}^d \to \mathbb{C}$ defined by
\begin{align*}
\hat{f}(u) = \int_{\mathbb{R}^d} f(x)\,e^{i\langle u, x \rangle} \, d\mathcal{L}^d(x),
\end{align*}
where $\langle u, x \rangle = u_1 x_1 + \cdots + u_d x_d$ is the Euclidean inner product.
[/definition]
The integral converges absolutely for every $u$ because $|f(x) e^{i\langle u, x\rangle}| = |f(x)|$ and $f \in L^1$. This observation already gives the first basic property: $|\hat{f}(u)| \leq \int |f| \, d\mathcal{L}^d = \|f\|_1$ for all $u$, so $\hat{f}$ is bounded.
### 6.1.1 Fourier Transform of Finite Measures
The definition extends naturally to measures. In probability, every random variable has a finite law, even when it lacks a density — so the measure-theoretic version is the natural setting for characteristic functions.
[definition:Fourier Transform Of A Finite Measure]
Let $\mu$ be a finite Borel measure on $\mathbb{R}^d$. The *Fourier transform* (or *Fourier–Stieltjes transform*) of $\mu$ is
\begin{align*}
\hat{\mu}(u) = \int_{\mathbb{R}^d} e^{i\langle u, x \rangle} \, \mu(dx).
\end{align*}
If $\mu$ has density $f$ with respect to $\mathcal{L}^d$, then $\hat{\mu} = \hat{f}$.
[/definition]
[definition:Characteristic Function]
Let $X$ be a random variable on $(\Omega, \mathcal{F}, \mathbb{P})$ taking values in $\mathbb{R}^d$, with law $\mu_X = \mathbb{P} \circ X^{-1}$. The *characteristic function* of $X$ is
\begin{align*}
\phi_X(u) = \mathbb{E}[e^{i\langle u, X \rangle}] = \hat{\mu}_X(u).
\end{align*}
[/definition]
Every random variable has a characteristic function, since $|e^{i\langle u, X \rangle}| = 1$ is trivially integrable on a probability space. This is a key advantage over the moment generating function $\mathbb{E}[e^{tX}]$, which may be infinite for heavy-tailed distributions.
### 6.1.2 The Riemann–Lebesgue Lemma
The Fourier transform maps $L^1$ functions to bounded, uniformly continuous functions that vanish at infinity. Continuity follows from the [Dominated Convergence Theorem](/theorems/4): if $u_n \to u$, the integrands $f(x)e^{i\langle u_n, x\rangle}$ converge pointwise to $f(x)e^{i\langle u, x\rangle}$ and are dominated by $|f(x)|$, so $\hat{f}(u_n) \to \hat{f}(u)$. The deeper fact is that $\hat{f}$ vanishes at infinity.
[quotetheorem:526]
The [Riemann–Lebesgue Lemma](/theorems/526) shows that $\hat{f} \in C_0(\mathbb{R}^d)$ — the Fourier transform of any $L^1$ function is continuous and decays to zero. The proof proceeds in three stages: first, for indicator functions of rectangles, the Fourier transform is computed explicitly and seen to decay (each coordinate factor is $O(1/|u_j|)$); second, linearity extends the result to simple functions; third, a density argument — approximating $f \in L^1$ by simple functions $g$ with $\|f - g\|_1 < \varepsilon$, then using the bound $|\hat{f}(u) - \hat{g}(u)| \leq \|f - g\|_1$ — yields the general case.
The Riemann–Lebesgue lemma does *not* hold for general finite measures: a point mass $\delta_0$ has $\hat{\delta}_0(u) = 1$ for all $u$. The distinction is that $L^1$ densities spread their mass continuously, while point masses do not.
[citeproof:526]
## 6.2 Convolution
Convolution is the analytical counterpart of adding independent random variables: if $X$ and $Y$ are independent with densities $f$ and $g$, then $X + Y$ has density $f * g$. More broadly, convolution provides a smoothing mechanism — convolving a rough function with a smooth kernel produces a smoother output.
[definition:Convolution]
For $f \in L^1(\mathbb{R}^d, \mathcal{L}^d)$ and a finite Borel measure $\nu$ on $\mathbb{R}^d$, the *convolution* is
\begin{align*}
(f * \nu)(x) = \int_{\mathbb{R}^d} f(x - y) \, \nu(dy),
\end{align*}
defined for $\mathcal{L}^d$-almost every $x$. If $\nu$ has density $g$, we write $(f * g)(x) = \int_{\mathbb{R}^d} f(x - y)\,g(y) \, d\mathcal{L}^d(y)$.
[/definition]
That $f * \nu \in L^p$ when $f \in L^p$ and $\nu$ is a probability measure follows from [Jensen's inequality](/theorems/515) and [Fubini's theorem](/theorems/513): the convexity of $t \mapsto |t|^p$ gives $|(f * \nu)(x)|^p \leq \int |f(x-y)|^p \, \nu(dy)$, and integrating in $x$ via [Fubini](/theorems/513) yields $\|f * \nu\|_p^p \leq \|f\|_p^p$. So convolution with a probability measure is a *contraction* on $L^p$.
### 6.2.1 The Convolution Theorem
The deepest algebraic property of the Fourier transform is that it converts the integral operation of convolution into pointwise multiplication. This is what makes the Fourier transform useful for solving integral and differential equations.
[quotetheorem:527]
The [Convolution Theorem](/theorems/527) is the key structural result of Fourier analysis. The proof is a direct application of [Fubini's theorem](/theorems/513): write out $\widehat{f * \nu}(u)$ as a double integral, exchange the order of integration (justified by absolute convergence), and use the substitution $z = x - y$ (translation invariance of $\mathcal{L}^d$) to factor the inner integral as $e^{i\langle u, y\rangle}\hat{f}(u)$. In probability, this gives $\phi_{X+Y} = \phi_X \cdot \phi_Y$ for independent $X, Y$ — the characteristic function of a sum of independent random variables is the product of their characteristic functions.
[citeproof:527]
## 6.3 Gaussian Densities and Fourier Inversion
To recover a function from its Fourier transform, we need an *inversion formula*. The strategy is to prove inversion first for Gaussians (where explicit computation is possible), then use Gaussian convolutions to approximate general $L^1$ functions.
[definition:Gaussian Density]
For $t > 0$, the *Gaussian density* on $\mathbb{R}^d$ with variance parameter $t$ is
\begin{align*}
g_t(x) = (2\pi t)^{-d/2}\,\exp\!\left(-\frac{|x|^2}{2t}\right).
\end{align*}
[/definition]
The Gaussian $g_t$ is a probability density ($\int g_t \, d\mathcal{L}^d = 1$), and as $t \to 0$ it concentrates near the origin — $g_t$ approximates the Dirac mass $\delta_0$.
[example:Fourier Transform Of A Gaussian]
The Fourier transform of $g_t$ is computed by completing the square in the exponent. Write
\begin{align*}
\hat{g}_t(u) &= (2\pi t)^{-d/2} \int_{\mathbb{R}^d} \exp\!\left(-\frac{|x|^2}{2t} + i\langle u, x\rangle\right) d\mathcal{L}^d(x).
\end{align*}
The exponent equals $-\frac{1}{2t}|x - itu|^2 - \frac{t|u|^2}{2}$, so the integral factors as $e^{-t|u|^2/2}$ times $\int \exp(-|z|^2/(2t))(2\pi t)^{-d/2}\,d\mathcal{L}^d(z) = 1$ (after the shift $z = x - itu$, justified by contour integration or Cauchy's theorem applied to each coordinate). Therefore
\begin{align*}
\hat{g}_t(u) = e^{-t|u|^2/2}.
\end{align*}
Gaussians are *self-dual* under the Fourier transform: the transform of a Gaussian is again a Gaussian, with variance parameter inverted ($t \leftrightarrow 1/t$, up to normalisation).
[/example]
Since $\hat{g}_t(u) = e^{-t|u|^2/2} \in L^1(\mathbb{R}^d, \mathcal{L}^d)$, both $g_t$ and $\hat{g}_t$ are integrable, and the inversion formula can be verified directly for $g_t$.
### 6.3.1 The Inversion Formula
The key idea is that for any $f \in L^1$, the Gaussian convolution $f * g_t$ has Fourier transform $\hat{f}(u)\,e^{-t|u|^2/2}$ (by the [Convolution Theorem](/theorems/527)), which is integrable (being the product of a bounded function and a Gaussian). So inversion holds for $f * g_t$. As $t \to 0$, $f * g_t \to f$ in $L^1$ (since $g_t$ approximates $\delta_0$), and the [Dominated Convergence Theorem](/theorems/4) identifies the limit.
[quotetheorem:528]
The [Fourier Inversion Formula](/theorems/528) shows that $f$ and $\hat{f}$ determine each other (when both are integrable). It is the Fourier-analytic manifestation of the [Uniqueness of Measure Extension](/theorems/506): the values of $\hat{f}$ at all frequencies $u$ encode the same information as the values of $\int_A f \, d\mathcal{L}^d$ over all Borel sets $A$. A crucial consequence is *injectivity*: if $f, g \in L^1$ satisfy $\hat{f} = \hat{g}$, then $f = g$ $\mathcal{L}^d$-a.e. The proof introduces a Gaussian convergence factor $e^{-t|u|^2/2}$, rewrites the regularised integral as the spatial convolution $f * g_t$ via [Fubini](/theorems/513), and passes to the limit $t \to 0$ using the [Dominated Convergence Theorem](/theorems/4) and the approximation-to-the-identity property of Gaussians.
[citeproof:528]
## 6.4 The Fourier Transform on $L^2$
Many important functions — the indicator $\mathbb{1}_{[a,b]}$, for example — lie in $L^2$ but not $L^1$. The Fourier transform is defined as an integral against $e^{i\langle u, x \rangle}$, so it requires *some* integrability. How can we extend the transform to $L^2$ functions whose defining integral may not converge?
The answer is the *Plancherel identity*, which shows that the Fourier transform preserves the $L^2$ norm (up to a constant). Since $L^1 \cap L^2$ is dense in $L^2$, the transform extends by continuity.
[quotetheorem:529]
The [Plancherel Identity](/theorems/529) is the bridge between $L^1$ and $L^2$ Fourier analysis. The proof uses Gaussian regularisation: for $f \in L^1 \cap L^2$, the smoothed function $f_t = f * g_t$ has $\hat{f}_t(u) = \hat{f}(u)\,e^{-t|u|^2/2}$ (by the [Convolution Theorem](/theorems/527)), and a Parseval-type computation — using the [Fourier Inversion Formula](/theorems/528) applied to $f_t * \overline{f_t(-\cdot)}$ — gives $\|f_t\|_2^2 = (2\pi)^{-d}\|\hat{f}_t\|_2^2$. The left side converges to $\|f\|_2^2$ (since $f * g_t \to f$ in $L^2$), and the right side converges to $(2\pi)^{-d}\|\hat{f}\|_2^2$ by the [Monotone Convergence Theorem](/theorems/509). The extension to all of $L^2$ follows from density of $L^1 \cap L^2$ and completeness (Riesz–Fischer).
[citeproof:529]
The Plancherel identity implies that the map $f \mapsto (2\pi)^{-d/2}\hat{f}$ is an *isometry* from $L^1 \cap L^2$ (with the $L^2$ norm) into $L^2$. Since $L^1 \cap L^2$ is dense in $L^2$ and $L^2$ is complete (by the [Riesz–Fischer theorem](/page/Inequalities%20and%20%24L%5Ep%24%20Spaces)), this isometry extends uniquely to a *unitary operator* $\mathcal{F}: L^2(\mathbb{R}^d) \to L^2(\mathbb{R}^d)$ satisfying $\|\mathcal{F}(f)\|_2 = \|f\|_2$ for all $f \in L^2$. Unitarity means that the Fourier transform is a bijection on $L^2$ that preserves inner products: $\langle \mathcal{F}(f), \mathcal{F}(g) \rangle = \langle f, g \rangle$.
# 7. Characteristic Functions and Gaussian Random Variables
## 7.1 Characteristic Functions and Convergence in Distribution
We now turn to the probabilistic applications. The Fourier transform of a probability distribution — the characteristic function — is the primary tool for studying weak convergence.
### 7.1.1 Characteristic Functions Determine Distributions
The inversion formula has a direct probabilistic consequence: two random variables with the same characteristic function must have the same distribution. This is the characteristic function analogue of the fact that a measure is determined by its values on a $\pi$-system ([Uniqueness of Measure Extension](/theorems/506)).
[quotetheorem:530]
The proof of [Uniqueness of Characteristic Functions](/theorems/530) uses Gaussian smoothing. If $\phi_X = \phi_Y$, then for any bounded continuous $h$, the smoothed expectation $\mathbb{E}[h(X + \sqrt{t}\,Z)]$ (with $Z \sim N(0, I_d)$ independent) can be expressed via the [Fourier Inversion Formula](/theorems/528) and [Convolution Theorem](/theorems/527) as an integral involving $\phi_X$. Since $\phi_X = \phi_Y$, the smoothed expectations agree. Taking $t \to 0$ and applying the [Dominated Convergence Theorem](/theorems/4) gives $\mathbb{E}[h(X)] = \mathbb{E}[h(Y)]$ for all bounded continuous $h$, which is the definition of $\mu_X = \mu_Y$.
A further consequence: if $\phi_X \in L^1(\mathbb{R}^d, \mathcal{L}^d)$, then the [Fourier Inversion Formula](/theorems/528) gives an explicit density for $\mu_X$:
\begin{align*}
f_X(x) = \frac{1}{(2\pi)^d}\int_{\mathbb{R}^d} \phi_X(u)\,e^{-i\langle u, x \rangle}\,d\mathcal{L}^d(u),
\end{align*}
which is bounded and continuous (as the Fourier transform of an $L^1$ function, by the [Riemann–Lebesgue Lemma](/theorems/526)).
[citeproof:530]
### 7.1.2 Lévy's Continuity Theorem
With uniqueness established, the natural question is: does pointwise convergence of characteristic functions imply convergence of distributions? This is the content of Lévy's theorem, and it is the main reason characteristic functions are so useful in probability — it reduces the difficult question of [weak convergence](/page/Weak%20Convergence) of measures to the simpler question of pointwise convergence of functions.
[quotetheorem:519]
The power of [Lévy's Continuity Theorem](/theorems/519) is that it asks only for *pointwise* convergence of $\phi_{X_n}$ to a function that is already known to be a characteristic function. The proof has three stages: first, a Fourier-analytic bound shows that characteristic function convergence implies tightness of the laws (so subsequential limits exist); second, the [Dominated Convergence Theorem](/theorems/4) identifies the characteristic function of any subsequential limit as $\phi_X$; third, [Uniqueness of Characteristic Functions](/theorems/530) forces all subsequential limits to agree. The continuity of $\phi_X$ at the origin is built into the hypothesis (since $\phi_X$ is a characteristic function, $\phi_X(0) = 1$), and this is precisely what drives the tightness estimate.
[citeproof:519]
[example:Convergence Of Poisson To Gaussian]
Let $X_n \sim \operatorname{Poisson}(n)$ and define $Y_n = (X_n - n)/\sqrt{n}$ (centred and scaled). The characteristic function of $X_n$ is $\phi_{X_n}(u) = \exp(n(e^{iu} - 1))$, so
\begin{align*}
\phi_{Y_n}(u) = \mathbb{E}[e^{iu(X_n - n)/\sqrt{n}}] = e^{-iun/\sqrt{n}} \cdot \exp\!\left(n(e^{iu/\sqrt{n}} - 1)\right).
\end{align*}
Taylor-expanding $e^{iu/\sqrt{n}} = 1 + iu/\sqrt{n} - u^2/(2n) + O(n^{-3/2})$ gives $n(e^{iu/\sqrt{n}} - 1) = iu\sqrt{n} - u^2/2 + O(n^{-1/2})$, so
\begin{align*}
\phi_{Y_n}(u) = \exp\!\left(-u^2/2 + O(n^{-1/2})\right) \to e^{-u^2/2} = \phi_Z(u),
\end{align*}
where $Z \sim N(0,1)$. By [Lévy's Continuity Theorem](/theorems/519), $Y_n \to Z$ in distribution. This is a special case of the Central Limit Theorem.
[/example]
## 7.2 Gaussian Random Variables
Gaussians are the prototypical "well-behaved" distributions: they are completely determined by their first two moments, closed under affine transformations, and uncorrelated Gaussian components are independent. These properties all follow from the characteristic function.
[definition:Gaussian Random Vector]
A random vector $X$ on $\mathbb{R}^d$ is *Gaussian* (or *normally distributed*) if $\langle u, X \rangle$ is a univariate Gaussian random variable for every $u \in \mathbb{R}^d$.
[/definition]
This definition includes the degenerate case $\langle u, X \rangle = \text{const}$ (a Gaussian with variance $0$). The virtue of this definition is that it makes no reference to densities — the distribution may be singular (e.g., supported on a lower-dimensional affine subspace).
### 7.2.1 Characterisation via Characteristic Functions
If $X$ is Gaussian with mean $m = \mathbb{E}[X] \in \mathbb{R}^d$ and covariance matrix $V = \operatorname{Cov}(X) \in \mathbb{R}^{d \times d}$ (where $V_{jk} = \operatorname{Cov}(X_j, X_k)$), then the characteristic function of $\langle u, X \rangle$ is $\exp(i\langle u, m \rangle \cdot 1 - \frac{1}{2}\langle u, Vu \rangle \cdot 1)$ (by the characteristic function formula for univariate Gaussians with mean $\langle u, m \rangle$ and variance $\langle u, Vu \rangle$). Evaluating at $1$:
\begin{align*}
\phi_X(u) = \mathbb{E}[e^{i\langle u, X\rangle}] = \exp\!\left(i\langle u, m \rangle - \frac{1}{2}\langle u, Vu \rangle\right).
\end{align*}
By [Uniqueness of Characteristic Functions](/theorems/530), the law of $X$ is determined entirely by $m$ and $V$. We write $X \sim N(m, V)$.
### 7.2.2 Key Properties
*Closure under affine maps.* If $X \sim N(m, V)$ and $A \in \mathbb{R}^{k \times d}$, $b \in \mathbb{R}^k$, then $AX + b$ is Gaussian: for any $u \in \mathbb{R}^k$, $\langle u, AX + b \rangle = \langle A^\top u, X \rangle + \langle u, b \rangle$ is Gaussian (as an affine function of the Gaussian scalar $\langle A^\top u, X \rangle$). Moreover, $AX + b \sim N(Am + b, AVA^\top)$.
*Uncorrelated implies independent.* Suppose $X = (X_1, X_2)$ is Gaussian with $\operatorname{Cov}(X_1, X_2) = 0$, meaning the covariance matrix $V$ is block-diagonal: $V = \begin{pmatrix} V_1 & 0 \\ 0 & V_2 \end{pmatrix}$. Then for $u = (u_1, u_2)$,
\begin{align*}
\phi_X(u) = \exp\!\left(i\langle u, m \rangle - \tfrac{1}{2}\langle u, Vu \rangle\right) = \exp\!\left(i\langle u_1, m_1\rangle - \tfrac{1}{2}\langle u_1, V_1 u_1\rangle\right) \cdot \exp\!\left(i\langle u_2, m_2\rangle - \tfrac{1}{2}\langle u_2, V_2 u_2\rangle\right) = \phi_{X_1}(u_1)\,\phi_{X_2}(u_2).
\end{align*}
Since $\phi_X(u_1, u_2) = \phi_{X_1}(u_1)\phi_{X_2}(u_2)$, the joint law factors as the product of marginals, so $X_1$ and $X_2$ are independent. This is special to Gaussians: for general random vectors, uncorrelatedness does not imply independence.
*Density (non-degenerate case).* If $V$ is invertible, the [Fourier Inversion Formula](/theorems/528) gives the density of $X \sim N(m, V)$:
\begin{align*}
f_X(x) = (2\pi)^{-d/2}(\det V)^{-1/2}\,\exp\!\left(-\frac{1}{2}\langle x - m, V^{-1}(x - m)\rangle\right).
\end{align*}
[example:Standard Normal Characteristic Function And Moments]
For $X \sim N(0, 1)$, the characteristic function is $\phi_X(u) = e^{-u^2/2}$. Expanding in a [power series](/page/Power%20Series): $\phi_X(u) = \sum_{k=0}^\infty (-u^2/2)^k / k! = \sum_{k=0}^\infty (-1)^k u^{2k}/(2^k k!)$. On the other hand, $\phi_X(u) = \sum_{n=0}^\infty \mathbb{E}[X^n](iu)^n/n!$ when moments exist. Matching coefficients of $u^{2k}$:
\begin{align*}
\frac{\mathbb{E}[X^{2k}]\,(i)^{2k}}{(2k)!} = \frac{(-1)^k}{2^k\,k!}, \quad \text{so} \quad \mathbb{E}[X^{2k}] = \frac{(2k)!}{2^k\,k!} = (2k-1)!!,
\end{align*}
and all odd moments vanish by symmetry. This recovers the classical double-factorial formula for Gaussian moments directly from the characteristic function.
[/example]
[problem]
Let $X_1, X_2, \dots$ be independent random variables with $X_k \sim N(0, 1/k^2)$. Define $S_n = \sum_{k=1}^n X_k$. Use characteristic functions to show that $S_n$ converges in distribution as $n \to \infty$, identify the limit, and determine whether the limit has a density.
[/problem]
[solution]
**Step 1 (Characteristic function of $S_n$).** Since $X_1, \dots, X_n$ are independent, the characteristic function of $S_n$ factors as a product (by the [Convolution Theorem](/theorems/527)). Each $X_k \sim N(0, 1/k^2)$ has $\phi_{X_k}(u) = \exp(-u^2/(2k^2))$. Therefore
\begin{align*}
\phi_{S_n}(u) = \prod_{k=1}^n \phi_{X_k}(u) = \prod_{k=1}^n \exp\!\left(-\frac{u^2}{2k^2}\right) = \exp\!\left(-\frac{u^2}{2}\sum_{k=1}^n \frac{1}{k^2}\right).
\end{align*}
**Step 2 (Pointwise limit).** The series $\sum_{k=1}^\infty 1/k^2 = \pi^2/6$ converges. As $n \to \infty$,
\begin{align*}
\phi_{S_n}(u) \to \exp\!\left(-\frac{u^2}{2} \cdot \frac{\pi^2}{6}\right) = \exp\!\left(-\frac{\pi^2 u^2}{12}\right)
\end{align*}
for every $u \in \mathbb{R}$. The limit is the characteristic function of $N(0, \pi^2/6)$.
**Step 3 (Apply Lévy's theorem).** By [Lévy's Continuity Theorem](/theorems/519), since $\phi_{S_n}(u) \to \phi_S(u) = \exp(-\pi^2 u^2/12)$ pointwise and the limit is a characteristic function (of $S \sim N(0, \pi^2/6)$), we conclude $S_n \to S$ in distribution, where $S \sim N(0, \pi^2/6)$.
**Step 4 (Density).** The limit characteristic function $\phi_S(u) = e^{-\pi^2 u^2/12}$ is a Gaussian, which is integrable: $\int_{\mathbb{R}} e^{-\pi^2 u^2/12}\,d\mathcal{L}^1(u) < \infty$. By the [Fourier Inversion Formula](/theorems/528), $S$ has a bounded continuous density:
\begin{align*}
f_S(x) = \frac{1}{2\pi}\int_{\mathbb{R}} e^{-\pi^2 u^2/12}\,e^{-iux}\,d\mathcal{L}^1(u) = \frac{1}{\pi}\sqrt{\frac{3}{\pi}}\,\exp\!\left(-\frac{3x^2}{\pi^2}\right),
\end{align*}
which is the $N(0, \pi^2/6)$ density, confirming the answer from the characteristic function calculation.
[/solution]\n\n---\n\nThe convergence theorems developed in [Integration](/page/Integration) and the $L^p$ framework of [Inequalities and $L^p$ Spaces](/page/Inequalities%20and%20%24L%5Ep%24%20Spaces) govern the behaviour of sequences of functions. But what happens when the *same* function is observed repeatedly along the orbit of a dynamical system? If a transformation $\Theta$ acts on a space $E$ and we measure an observable $f$ at times $0, 1, 2, \dots$ — recording the values $f(x), f(\Theta(x)), f(\Theta^2(x)), \dots$ — does the time average $\frac{1}{n}\sum_{k=0}^{n-1} f(\Theta^k(x))$ converge, and if so, to what?
This is the central question of ergodic theory. The answer, given by *Birkhoff's ergodic theorem*, is that the time average converges almost surely to the conditional expectation of $f$ given the invariant $\sigma$-algebra — and when the system is *ergodic* (has no non-trivial invariant sets), the limit is the spatial average $\frac{1}{\mu(E)}\int f \, d\mu$. This connects deterministic dynamics to probabilistic averaging and provides, as a direct corollary, the Strong Law of Large Numbers for i.i.d. sequences.
# 8. Ergodic Theory
## 8.1 Measure-Preserving Transformations
The first requirement is that the dynamics do not distort the measure. Without this, the statistical properties of $f$ along an orbit could bear no relation to the global integral $\int f \, d\mu$.
[definition:Measure Preserving Transformation]
Let $(E, \mathcal{E}, \mu)$ be a measure space. A measurable map $\Theta: E \to E$ is *measure preserving* if
\begin{align*}
\mu(\Theta^{-1}(A)) = \mu(A)
\end{align*}
for all $A \in \mathcal{E}$.
[/definition]
The definition uses *preimages* $\Theta^{-1}(A) = \{x \in E : \Theta(x) \in A\}$, not images. This is natural because preimages commute with all set operations (unions, intersections, complements), making $\Theta^{-1}$ a $\sigma$-algebra homomorphism. The transformation need not be injective or surjective.
An immediate consequence of measure preservation is that integration is invariant: for any $f \in L^1(\mu)$, the change of variables formula gives
\begin{align*}
\int_E f \circ \Theta \, d\mu = \int_E f \, d(\mu \circ \Theta^{-1}) = \int_E f \, d\mu.
\end{align*}
This means the spatial average of $f$ is unchanged when we advance the system by one time step — a necessary condition for the time average to converge to the spatial average.
## 8.2 Invariant Sets, Invariant Functions, and Ergodicity
A measure-preserving transformation partitions the space into regions that are "mixed" by the dynamics and regions that are preserved. To study the long-run behaviour of time averages, we need to understand which parts of the space are left unchanged.
[definition:Invariant Set]
A set $A \in \mathcal{E}$ is *$\Theta$-invariant* if $\Theta^{-1}(A) = A$. The collection $\mathcal{E}_\Theta = \{A \in \mathcal{E} : \Theta^{-1}(A) = A\}$ is a sub-$\sigma$-algebra of $\mathcal{E}$, called the *invariant $\sigma$-algebra*.
[/definition]
[definition:Invariant Function]
A measurable function $f: E \to \mathbb{R}$ is *$\Theta$-invariant* if $f = f \circ \Theta$ $\mu$-a.e.
[/definition]
Invariant functions are precisely the $\mathcal{E}_\Theta$-measurable functions. The forward direction is immediate: if $f$ is $\mathcal{E}_\Theta$-measurable, then the level sets $\{f \leq c\}$ are invariant, so $\{f \circ \Theta \leq c\} = \Theta^{-1}(\{f \leq c\}) = \{f \leq c\}$, giving $f \circ \Theta = f$. The converse uses the [Generator Criterion for Measurability](/theorems/525): if $f \circ \Theta = f$ a.e., then $\{f \leq c\} = \{f \circ \Theta \leq c\} = \Theta^{-1}(\{f \leq c\})$ up to a null set, and one can modify $f$ on a null set to make it exactly $\mathcal{E}_\Theta$-measurable.
If the invariant $\sigma$-algebra is trivial — every invariant set has measure $0$ or full measure — then the system is "indecomposable" and every invariant function is a.e. constant. This is the key dynamical property:
[definition:Ergodic Transformation]
A measure-preserving transformation $\Theta$ on $(E, \mathcal{E}, \mu)$ is *ergodic* if every $A \in \mathcal{E}_\Theta$ satisfies $\mu(A) = 0$ or $\mu(A^c) = 0$.
[/definition]
Ergodicity says the dynamics cannot be decomposed into two non-trivial invariant pieces. Every invariant function must be a.e. constant: if $f = f \circ \Theta$ a.e. and $f$ is not a.e. constant, then for some $c$, the invariant set $\{f \leq c\}$ would have $0 < \mu(\{f \leq c\}) < \mu(E)$, contradicting ergodicity.
[example:Irrational Rotation On The Circle]
Let $E = [0,1)$ with Lebesgue measure $\mathcal{L}^1$ and define $\Theta_\alpha(x) = x + \alpha \bmod 1$ for $\alpha \in [0,1)$. This is measure preserving for all $\alpha$ (translation invariance of $\mathcal{L}^1$, reduced modulo $1$).
*Rational case.* If $\alpha = p/q$ in lowest terms, then $\Theta_\alpha^q = \text{id}$, so every orbit is periodic with period dividing $q$. The sets $\bigcup_{k=0}^{q-1} [k/q, (k+1)/q) \cap \Theta_\alpha^{-j}([0, 1/q))$ for each $j$ give non-trivial invariant sets. The system is not ergodic.
*Irrational case.* If $\alpha \notin \mathbb{Q}$, the orbit $\{x, x + \alpha, x + 2\alpha, \dots\}$ (mod $1$) is dense in $[0,1)$ for every $x$ (by Weyl's equidistribution theorem). If $A$ is invariant, then $\mathbb{1}_A = \mathbb{1}_A \circ \Theta_\alpha$ a.e., so $\hat{\mathbb{1}}_A(n) = e^{2\pi i n \alpha}\hat{\mathbb{1}}_A(n)$ for every $n \in \mathbb{Z}$ (taking Fourier coefficients). Since $\alpha$ is irrational, $e^{2\pi i n \alpha} \neq 1$ for $n \neq 0$, forcing $\hat{\mathbb{1}}_A(n) = 0$ for all $n \neq 0$. By uniqueness of Fourier coefficients, $\mathbb{1}_A$ is a.e. constant, so $\mu(A) = 0$ or $\mu(A) = 1$. Thus $\Theta_\alpha$ is ergodic.
[/example]
## 8.3 The Canonical Space and the Shift Map
The most important application of ergodic theory to probability uses the *shift map* on the space of sequences. This is the mechanism that converts the Strong Law of Large Numbers into a special case of Birkhoff's theorem.
Let $m$ be a probability measure on $(\mathbb{R}, \mathcal{B}(\mathbb{R}))$. Define the *canonical space*:
\begin{align*}
E = \mathbb{R}^{\mathbb{N}}, \quad \mathcal{E} = \mathcal{B}(\mathbb{R})^{\otimes \mathbb{N}}, \quad \mu = m^{\otimes \mathbb{N}},
\end{align*}
where $\mu$ is the infinite product measure (constructed via the [Carathéodory Extension Theorem](/theorems/522) and Kolmogorov's extension theorem). The coordinate projections $\pi_k(\omega) = \omega_k$ are i.i.d. random variables with common law $m$.
[definition:Shift Map]
The *shift map* $\theta: \mathbb{R}^{\mathbb{N}} \to \mathbb{R}^{\mathbb{N}}$ is defined by $\theta(\omega_1, \omega_2, \omega_3, \dots) = (\omega_2, \omega_3, \omega_4, \dots)$.
[/definition]
The shift is measurable (since $\pi_k \circ \theta = \pi_{k+1}$ is measurable for each $k$, and the coordinate projections generate $\mathcal{E}$). It is measure preserving because for any cylinder set $A = \{\omega : (\omega_{i_1}, \dots, \omega_{i_r}) \in B\}$, the preimage $\theta^{-1}(A) = \{\omega : (\omega_{i_1+1}, \dots, \omega_{i_r+1}) \in B\}$ has the same $\mu$-measure (since the $\pi_k$ are i.i.d.), and cylinder sets form a $\pi$-system generating $\mathcal{E}$, so the [Uniqueness of Measure Extension](/theorems/506) gives $\mu \circ \theta^{-1} = \mu$.
The shift is ergodic: any $\theta$-invariant set $A$ satisfies $A = \theta^{-1}(A) = \theta^{-2}(A) = \cdots$, so $A$ is a *tail event* (it does not depend on any finite collection of coordinates). By the [Kolmogorov Zero–One Law](/theorems/512), $\mu(A) = 0$ or $\mu(A) = 1$.
The key connection to probability: for $f = \pi_1$ (the first coordinate), $f \circ \theta^k = \pi_{k+1}$, so
\begin{align*}
\frac{S_n(f)}{n} = \frac{1}{n}\sum_{k=0}^{n-1} f \circ \theta^k = \frac{1}{n}\sum_{k=1}^n \pi_k = \frac{X_1 + \cdots + X_n}{n},
\end{align*}
the sample average of the first $n$ i.i.d. random variables. Birkhoff's theorem applied to $(\mathbb{R}^{\mathbb{N}}, \theta, \mu)$ will therefore give the Strong Law directly.
## 8.4 Birkhoff's Ergodic Theorem
The central result of ergodic theory asserts that time averages converge almost surely. The proof is technically demanding — it relies on a *maximal ergodic inequality* that bounds the measure of the set where partial ergodic sums are large.
[quotetheorem:518]
[Birkhoff's theorem](/theorems/518) is the dynamical systems analogue of the Strong Law of Large Numbers, but far more general: it applies to *any* measure-preserving system, not just i.i.d. sequences. The limit $\bar{f}$ is the conditional expectation $\mathbb{E}[f \mid \mathcal{E}_\Theta]$ (though we do not prove this identification here — it follows from the integrability bound $\int |\bar{f}| \leq \int |f|$ and the fact that $\int_A \bar{f} \, d\mu = \int_A f \, d\mu$ for all invariant $A$). In the ergodic case, the invariant $\sigma$-algebra is trivial, so the conditional expectation is the constant $\frac{1}{\mu(E)}\int f \, d\mu$.
The proof has two main components. The *maximal ergodic lemma* shows that on the set where $\max_{1 \leq k \leq n} S_k(f) > 0$, the integral of $f$ is non-negative — a surprising "positivity on positivity" result whose proof uses a telescoping identity and measure preservation. From this, one derives a *maximal inequality* bounding $\mu(\{\sup_n S_n(f)/n > \alpha\})$, which controls the fluctuations of the Cesàro averages. The a.e. convergence then follows by showing that the set $\{\liminf S_n(f)/n < \alpha < \beta < \limsup S_n(f)/n\}$ has measure zero for all rational $\alpha < \beta$.
[citeproof:518]
### 8.4.1 The Strong Law of Large Numbers as a Corollary
Applying Birkhoff's theorem to the ergodic shift $\theta$ on $(\mathbb{R}^{\mathbb{N}}, m^{\otimes \mathbb{N}})$ with $f = \pi_1$: since $\theta$ is ergodic and $\mu(E) = 1$ (probability space), the theorem gives
\begin{align*}
\frac{X_1 + \cdots + X_n}{n} = \frac{S_n(\pi_1)}{n} \to \int_{\mathbb{R}^{\mathbb{N}}} \pi_1 \, d\mu = \mathbb{E}[X_1] \quad \text{a.s.},
\end{align*}
provided $\mathbb{E}[|X_1|] < \infty$ (so that $\pi_1 \in L^1(\mu)$). This is the Strong Law of Large Numbers for i.i.d. random variables.
## 8.5 Von Neumann's Mean Ergodic Theorem
Birkhoff's theorem gives pointwise (almost sure) convergence of ergodic averages. In many applications — particularly in functional analysis and spectral theory — one needs convergence in the $L^p$ norm instead. Von Neumann's theorem provides this, and its proof is considerably simpler than Birkhoff's, using Hilbert space geometry rather than maximal inequalities.
[theorem:Von Neumann Mean Ergodic Theorem]
Let $(E, \mathcal{E}, \mu)$ be a finite measure space, $\Theta: E \to E$ measure preserving, and $f \in L^2(\mu)$. Then
\begin{align*}
\frac{S_n(f)}{n} \to \bar{f} \quad \text{in } L^2(\mu),
\end{align*}
where $\bar{f}$ is the orthogonal projection of $f$ onto the closed subspace $\{g \in L^2(\mu) : g \circ \Theta = g \text{ a.e.}\}$ of $\Theta$-invariant $L^2$ functions.
[/theorem]
The proof uses the decomposition $L^2 = \overline{\operatorname{ran}(I - U)} \oplus \ker(I - U)$, where $U: L^2 \to L^2$ is the isometry $Uf = f \circ \Theta$. For $f \in \ker(I - U)$, $f$ is invariant and $S_n(f)/n = f$. For $f = g - g \circ \Theta \in \operatorname{ran}(I - U)$, the ergodic average telescopes: $S_n(f)/n = (g - g \circ \Theta^n)/n \to 0$ in $L^2$ since $\|g \circ \Theta^n\|_2 = \|g\|_2$ (measure preservation). The general case follows by density and the triangle inequality.
The $L^p$ version for $1 \leq p < \infty$ follows from the $L^2$ case plus Birkhoff's pointwise convergence: since $S_n(f)/n \to \bar{f}$ a.e. and $|S_n(f)/n|^p \leq (\frac{1}{n}\sum_{k=0}^{n-1} |f| \circ \Theta^k)^p \leq \frac{1}{n}\sum_{k=0}^{n-1} (|f| \circ \Theta^k)^p$ (by [Jensen's inequality](/theorems/515) with the convex function $t \mapsto |t|^p$), the [Dominated Convergence Theorem](/theorems/4) (or rather an ergodic-theoretic analogue) gives $L^p$ convergence.
[example:Ergodic Average Of A Step Function]
Let $\Theta = \Theta_\alpha$ be the irrational rotation on $([0,1), \mathcal{B}, \mathcal{L}^1)$ with $\alpha = (\sqrt{5} - 1)/2$ (the golden ratio minus 1). Take $f = \mathbb{1}_{[0, 1/3)}$. By Birkhoff's theorem (with ergodicity of irrational rotations),
\begin{align*}
\frac{1}{n}\sum_{k=0}^{n-1} \mathbb{1}_{[0,1/3)}(x + k\alpha \bmod 1) \to \int_0^1 \mathbb{1}_{[0,1/3)} \, d\mathcal{L}^1 = \frac{1}{3} \quad \text{for a.e. } x.
\end{align*}
The left side counts the fraction of the first $n$ orbit points $x, x + \alpha, \dots, x + (n-1)\alpha$ (mod $1$) that fall in $[0, 1/3)$. Birkhoff's theorem says this fraction converges to $1/3$ — the "length" of the interval — for almost every starting point $x$. This is a quantitative form of Weyl's equidistribution theorem.
[/example]
[problem]
Let $(X_n)_{n \geq 1}$ be i.i.d. random variables with $\mathbb{P}(X_1 = 1) = p$ and $\mathbb{P}(X_1 = 0) = 1 - p$ for some $p \in (0,1)$. Define $S_n = X_1 + \cdots + X_n$. Use the ergodic-theoretic framework to prove that $S_n/n \to p$ almost surely, and show that $\mathbb{E}[(S_n/n - p)^2] \to 0$.
[/problem]
[solution]
**Step 1 (Ergodic setup).** Realise $(X_n)$ on the canonical space $(E, \mathcal{E}, \mu)$ with $E = \{0,1\}^{\mathbb{N}}$, $\mathcal{E} = 2^{\{0,1\}^{\mathbb{N}}}$ (the product $\sigma$-algebra), and $\mu = \operatorname{Ber}(p)^{\otimes \mathbb{N}}$. The coordinate projections $\pi_k(\omega) = \omega_k$ are i.i.d. $\operatorname{Ber}(p)$ random variables. The shift map $\theta$ is measure preserving and ergodic (by the [Kolmogorov Zero–One Law](/theorems/512), since tail events have probability $0$ or $1$).
**Step 2 (Apply Birkhoff).** The function $f = \pi_1$ satisfies $f \in L^1(\mu)$ (since $|f| \leq 1$ and $\mu$ is a probability measure). By the [Birkhoff Ergodic Theorem](/theorems/518), since $\theta$ is ergodic on a probability space,
\begin{align*}
\frac{S_n}{n} = \frac{1}{n}\sum_{k=0}^{n-1} \pi_1 \circ \theta^k = \frac{1}{n}\sum_{k=1}^n \pi_k \to \int_E \pi_1 \, d\mu = \mathbb{E}[X_1] = p \quad \text{a.s.}
\end{align*}
**Step 3 ($L^2$ convergence).** Since $f = \pi_1 \in L^2(\mu)$ (bounded random variables are in all $L^p$), von Neumann's mean ergodic theorem gives $S_n/n \to p$ in $L^2(\mu)$, i.e.,
\begin{align*}
\mathbb{E}\!\left[\left(\frac{S_n}{n} - p\right)^2\right] = \left\|\frac{S_n}{n} - p\right\|_2^2 \to 0.
\end{align*}
Alternatively, compute directly: $\mathbb{E}[(S_n/n - p)^2] = \operatorname{Var}(S_n/n) = \operatorname{Var}(S_n)/n^2 = np(1-p)/n^2 = p(1-p)/n \to 0$.
[/solution]\n\n---\n\nThe preceding sections built a toolkit: [Integration](/page/Integration) gave us the Lebesgue integral and convergence theorems, [Inequalities and $L^p$ Spaces](/page/Inequalities%20and%20%24L%5Ep%24%20Spaces) provided moment bounds and function-space structure, [Characteristic Functions and the Fourier Transform](/page/Characteristic%20Functions%20and%20the%20Fourier%20Transform) connected distributions to their frequency representations, and [Ergodic Theory](/page/Ergodic%20Theory) linked time averages to spatial averages for measure-preserving systems. We now bring these threads together to prove the two most important theorems in probability: the *Strong Law of Large Numbers* and the *Central Limit Theorem*.
Both theorems concern the partial sums $S_n = X_1 + \cdots + X_n$ of an i.i.d. sequence. The Strong Law says that $S_n/n$ converges *almost surely* to the mean — the sample average stabilises to the true average with probability one. The Central Limit Theorem describes the *fluctuations* of $S_n$ around its mean: after centering and scaling by $\sqrt{n}$, these fluctuations are asymptotically Gaussian, regardless of the original distribution. The proofs use completely different machinery — ergodic theory for the Strong Law, characteristic functions for the CLT — illustrating how the abstract tools from earlier sections each find their natural application.
# 9. Limit Theorems
## 9.1 The Strong Law of Large Numbers
### 9.1.1 Motivation: From the Weak Law to the Strong Law
The *weak* law of large numbers, which asserts $S_n/n \to \mu$ in probability, follows quickly from Chebyshev's inequality under a finite second moment: $\mathbb{P}(|S_n/n - \mu| \geq \varepsilon) \leq \operatorname{Var}(S_n/n)/\varepsilon^2 = \sigma^2/(n\varepsilon^2) \to 0$. But convergence in probability does not imply almost sure convergence — it guarantees only that violations become rare, not that the sample path eventually settles. Upgrading to almost sure convergence requires controlling the probability of *infinitely many* violations, which is a fundamentally harder problem.
We present two approaches. The first uses a finite fourth moment to obtain summable tail bounds via [Markov's inequality](/theorems/514) and the [First Borel–Cantelli Lemma](/theorems/507), then extracts a.s. convergence. The second — cleaner and more general — uses the [Birkhoff Ergodic Theorem](/theorems/518) to obtain the result under the optimal hypothesis of mere integrability.
### 9.1.2 The Fourth Moment Approach
The idea is to bound $\mathbb{E}[(S_n/n - \mu)^4]$ and show it is summable in $n$. If this fourth moment decays like $1/n^2$, then [Markov's inequality](/theorems/514) gives $\mathbb{P}(|S_n/n - \mu| \geq \varepsilon) = O(1/n^2)$, which is summable. The [First Borel–Cantelli Lemma](/theorems/507) then forces $|S_n/n - \mu| \geq \varepsilon$ to occur only finitely often, yielding a.s. convergence.
[quotetheorem:531]
The [Fourth Moment Strong Law](/theorems/531) demonstrates how $L^p$ moment bounds — specifically in $L^4$ — combine with the Borel–Cantelli machinery to yield almost sure convergence. The proof centres on the combinatorial observation that when the $Y_k = X_k - \mu$ are independent and centred, expanding $(\sum Y_k)^4$ produces $n^4$ terms, but independence and $\mathbb{E}[Y_k] = 0$ kill all terms where any index appears exactly once. Only the "all-equal" terms ($j = k = l = m$, contributing $O(n)$) and the "two-pair" terms ($\{j,k,l,m\}$ a pair of distinct pairs, contributing $O(n^2)$) survive, giving $\mathbb{E}[(\sum Y_k)^4] = O(n^2)$ and hence $\mathbb{E}[(S_n/n - \mu)^4] = O(1/n^2)$. The summability $\sum 1/n^2 < \infty$ then drives the Borel–Cantelli argument.
[citeproof:531]
### 9.1.3 The Ergodic Approach: Optimal Integrability
The fourth moment proof requires $\mathbb{E}[X_1^4] < \infty$, which is far from necessary. The optimal result requires only $\mathbb{E}[|X_1|] < \infty$. The elegant proof uses the [Birkhoff Ergodic Theorem](/theorems/518) applied to the shift on the canonical sequence space.
[quotetheorem:520]
The [Strong Law of Large Numbers](/theorems/520) is the most important single application of ergodic theory to probability. The proof embeds the i.i.d. sequence into the canonical space $(\mathbb{R}^{\mathbb{N}}, \mathcal{B}(\mathbb{R})^{\otimes \mathbb{N}}, m^{\otimes \mathbb{N}})$, where $m$ is the common law. The shift $\theta(\omega_1, \omega_2, \dots) = (\omega_2, \omega_3, \dots)$ is measure preserving (since the product structure is invariant under discarding the first coordinate, verified via the [Uniqueness of Measure Extension](/theorems/506) on cylinder sets) and ergodic (since shift-invariant events are tail events, and the [Kolmogorov Zero–One Law](/theorems/512) makes them trivial). Taking $f = \pi_1$ (the first coordinate), the ergodic sum $S_n(f) = \pi_1 + \cdots + \pi_n$ is exactly the partial sum of the i.i.d. sequence. Birkhoff's theorem gives $S_n(f)/n \to \bar{f}$ a.s. with $\bar{f}$ invariant, and ergodicity forces $\bar{f} = \mathbb{E}[X_1] = \nu$ a.s.
The integrability condition $\mathbb{E}[|X_1|] < \infty$ is *necessary*: if $\mathbb{E}[|X_1|] = \infty$, then $S_n/n$ does not converge a.s. to any constant (one can show that $\limsup |S_n|/n = \infty$ a.s. using the [Second Borel–Cantelli Lemma](/theorems/508) and the divergence of $\sum \mathbb{P}(|X_1| > n)$).
[citeproof:520]
[example:Cauchy Distribution Has No Strong Law]
Let $X_1, X_2, \dots$ be i.i.d. standard Cauchy random variables, with density $f(x) = 1/(\pi(1 + x^2))$. Then $\mathbb{E}[|X_1|] = \int_{\mathbb{R}} |x|/(\pi(1+x^2)) \, d\mathcal{L}^1(x) = \infty$, so the SLLN does not apply. In fact, the distribution of $S_n/n$ is itself standard Cauchy for every $n$ — this follows from characteristic functions: $\phi_{X_1}(u) = e^{-|u|}$, so by the [Convolution Theorem](/theorems/527), $\phi_{S_n}(u) = e^{-n|u|}$, and $\phi_{S_n/n}(u) = \phi_{S_n}(u/n) = e^{-|u|}$. The sample average has the same heavy-tailed distribution regardless of $n$, and does not converge in any meaningful sense.
[/example]
## 9.2 The Central Limit Theorem
### 9.2.1 Motivation: What Happens at the $\sqrt{n}$ Scale?
The Strong Law says $S_n/n \to \mu$ a.s. But *how fast* does $S_n$ approach $n\mu$? The answer is: the typical deviation is of order $\sqrt{n}$. Precisely, $S_n - n\mu$ has standard deviation $\sigma\sqrt{n}$, so the normalised quantity $(S_n - n\mu)/(\sigma\sqrt{n})$ has mean $0$ and variance $1$ for all $n$. The Central Limit Theorem says its *distribution* converges to the standard normal — a universal limit that depends on the original distribution only through its variance.
The proof strategy is purely Fourier-analytic: compute the characteristic function of the normalised sum (which factors as a product by independence and the [Convolution Theorem](/theorems/527)), expand it to second order using the moment assumptions, and show the $n$-th power converges pointwise to $e^{-u^2/2}$. The [Lévy Continuity Theorem](/theorems/519) then converts this pointwise convergence into distributional convergence.
[quotetheorem:521]
The [Central Limit Theorem](/theorems/521) is the most broadly applied result in probability and statistics. Its power lies in *universality*: the limiting distribution is always Gaussian, regardless of whether the $X_n$ are discrete, continuous, bounded, or unbounded — only the first two moments matter. The proof uses four ingredients: independence (to factor the characteristic function of $S_n$ as a product, via the [Convolution Theorem](/theorems/527)), the Taylor expansion $\phi(t) = 1 - t^2/2 + o(t^2)$ (from $\mathbb{E}[X_1] = 0$ and $\mathbb{E}[X_1^2] = 1$, with the remainder controlled by the [Dominated Convergence Theorem](/theorems/4)), the classical limit $(1 + a/n + o(1/n))^n \to e^a$ (to identify the $n$-th power), and the [Lévy Continuity Theorem](/theorems/519) (to convert characteristic function convergence to distributional convergence).
[citeproof:521]
### 9.2.2 General Mean and Variance
The standardised version of the CLT assumes $\mathbb{E}[X_1] = 0$ and $\operatorname{Var}(X_1) = 1$. The general case follows by a simple linear change of variables.
[quotetheorem:532]
The reduction is immediate: define $Y_k = (X_k - \mu)/\sigma$, which are i.i.d. with $\mathbb{E}[Y_1] = 0$ and $\mathbb{E}[Y_1^2] = 1$. Then $(S_n - n\mu)/(\sigma\sqrt{n}) = (Y_1 + \cdots + Y_n)/\sqrt{n}$, and the [standardised CLT](/theorems/521) applied to $(Y_k)$ gives convergence in distribution to $N(0,1)$. This is the version most commonly used in applications.
[citeproof:532]
[example:De Moivre Laplace Theorem]
Let $X_1, X_2, \dots$ be i.i.d. $\operatorname{Bernoulli}(p)$ with $p \in (0,1)$. Then $S_n = \sum_{k=1}^n X_k \sim \operatorname{Binomial}(n, p)$, $\mathbb{E}[S_n] = np$, and $\operatorname{Var}(S_n) = np(1-p)$. The [General CLT](/theorems/532) gives
\begin{align*}
\frac{S_n - np}{\sqrt{np(1-p)}} \xrightarrow{d} N(0,1).
\end{align*}
This is the *De Moivre–Laplace theorem* (1733/1812), historically the first version of the CLT. For example, with $p = 1/2$ and $n = 100$ (fair coin, 100 tosses): $\mathbb{P}(45 \leq S_{100} \leq 55) = \mathbb{P}(|S_{100} - 50| \leq 5) \approx \mathbb{P}(|Z| \leq 1) \approx 0.683$ where $Z \sim N(0,1)$, since $5/\sqrt{25} = 1$. The exact binomial probability is $\approx 0.729$, illustrating the quality of the normal approximation already at moderate $n$.
[/example]
## 9.3 Relationship Between the Two Theorems
The SLLN and CLT describe the same sequence at different scales. Write $S_n = n\mu + \sigma\sqrt{n} \cdot Z_n + o(\sqrt{n})$ where $Z_n \xrightarrow{d} N(0,1)$. The first term $n\mu$ captures the deterministic drift (SLLN: $S_n/n \to \mu$). The second term $\sigma\sqrt{n}\cdot Z_n$ captures the random fluctuations (CLT: $(S_n - n\mu)/(\sigma\sqrt{n}) \xrightarrow{d} N(0,1)$). The SLLN is a *first-order* result (law of large numbers), while the CLT is a *second-order* refinement (distributional limit of the fluctuations).
The hypotheses differ: the SLLN needs only $\mathbb{E}[|X_1|] < \infty$ (first moment), while the CLT needs $\mathbb{E}[X_1^2] < \infty$ (second moment). This is natural — the SLLN makes no claim about the scale of fluctuations, while the CLT characterises fluctuations at the $\sqrt{n}$ scale, which requires the variance to be finite.
Neither result implies the other. The SLLN gives a.s. convergence of $S_n/n$, which is a statement about *individual sample paths*. The CLT gives distributional convergence of $(S_n - n\mu)/(\sigma\sqrt{n})$, which is a statement about the *distribution* of the normalised sum — it says nothing about pointwise behaviour of sample paths.
[problem]
Let $X_1, X_2, \dots$ be i.i.d. with $\mathbb{P}(X_1 = 1) = \mathbb{P}(X_1 = -1) = 1/2$ (symmetric random walk steps). Define $S_n = X_1 + \cdots + X_n$.
(a) Use the [Strong Law of Large Numbers](/theorems/520) to show that $S_n/n \to 0$ a.s.
(b) Use the [Central Limit Theorem](/theorems/521) to show that $\mathbb{P}(S_n > 0) \to 1/2$.
(c) Show that $\mathbb{P}(S_n > \sqrt{n}\log n) \to 0$ as $n \to \infty$.
[/problem]
[solution]
**Step 1 (Part (a): Strong Law).** The random variables $X_k$ are i.i.d. with $\mathbb{E}[|X_1|] = 1 < \infty$ and $\mathbb{E}[X_1] = 0$. By the [Strong Law of Large Numbers](/theorems/520),
\begin{align*}
\frac{S_n}{n} \to \mathbb{E}[X_1] = 0 \quad \text{almost surely.}
\end{align*}
**Step 2 (Part (b): CLT).** We have $\mathbb{E}[X_1] = 0$ and $\operatorname{Var}(X_1) = \mathbb{E}[X_1^2] = 1$. By the [Central Limit Theorem](/theorems/521), $S_n/\sqrt{n} \xrightarrow{d} Z \sim N(0,1)$. Since $\mathbb{P}(Z = 0) = 0$ (the normal distribution has no atoms),
\begin{align*}
\mathbb{P}(S_n > 0) = \mathbb{P}(S_n/\sqrt{n} > 0) \to \mathbb{P}(Z > 0) = \frac{1}{2}.
\end{align*}
**Step 3 (Part (c): Chernoff bound).** Since $\log n \to \infty$, the CLT (which describes the distribution at fixed thresholds) does not directly apply. We use an exponential moment bound instead. By [Markov's inequality](/theorems/514) applied to $e^{\lambda S_n}$ for $\lambda > 0$,
\begin{align*}
\mathbb{P}(S_n > \sqrt{n}\log n) = \mathbb{P}(e^{\lambda S_n} > e^{\lambda\sqrt{n}\log n}) \leq e^{-\lambda\sqrt{n}\log n}\,\mathbb{E}[e^{\lambda S_n}].
\end{align*}
Since $\mathbb{E}[e^{\lambda X_1}] = \frac{1}{2}(e^\lambda + e^{-\lambda}) = \cosh(\lambda)$ and the $X_k$ are independent, $\mathbb{E}[e^{\lambda S_n}] = \cosh(\lambda)^n$. Using $\cosh(\lambda) \leq e^{\lambda^2/2}$ (which follows from $\cosh(\lambda) = \sum_{k=0}^\infty \lambda^{2k}/(2k)! \leq \sum_{k=0}^\infty (\lambda^2/2)^k/k! = e^{\lambda^2/2}$):
\begin{align*}
\mathbb{P}(S_n > \sqrt{n}\log n) \leq e^{-\lambda\sqrt{n}\log n + n\lambda^2/2}.
\end{align*}
Minimise the exponent over $\lambda > 0$: setting $d/d\lambda(-\lambda\sqrt{n}\log n + n\lambda^2/2) = -\sqrt{n}\log n + n\lambda = 0$ gives $\lambda^* = (\log n)/\sqrt{n}$. Substituting:
\begin{align*}
-\lambda^*\sqrt{n}\log n + \frac{n(\lambda^*)^2}{2} = -(\log n)^2 + \frac{(\log n)^2}{2} = -\frac{(\log n)^2}{2}.
\end{align*}
Therefore $\mathbb{P}(S_n > \sqrt{n}\log n) \leq e^{-(\log n)^2/2} \to 0$ as $n \to \infty$.
[/solution]
Contents
- Motivation
- Why Riemann Integration Breaks Down
- How Measure Theory Resolves These Failures
- Why Probability Needs Measure Theory
- Course Overview
- 1. Measure Spaces
- 1.1 $\sigma$-Algebras and Measures
- 1.2 Generating $\sigma$-Algebras
- 1.3 $\pi$-Systems, $d$-Systems, and Uniqueness
- 1.4 Measure Extension
- 1.5 Lebesgue Measure
- 2. Probability and Independence
- 2.1 Probability Measures and Independence
- 2.2 Borel–Cantelli Lemmas
- 3. Measurable Functions and Random Variables
- 3.1 Measurable Functions
- 3.2 Closure Under Limits
- 3.3 The Monotone Class Theorem
- 3.4 Product Measurable Spaces
- 3.5 Random Variables and Distributions
- 3.6 Modes of Convergence
- 3.7 Tail Events and the Kolmogorov 0-1 Law
- 4. Integration
- 4.1 Integration of Simple Functions
- 4.2 Extension to Non-Negative Measurable Functions
- 4.2.1 The Monotone Convergence Theorem
- 4.2.2 Properties of the Integral
- 4.3 Integrable Functions
- 4.3.1 The Triangle Inequality for Integrals
- 4.4 Interchanging Limits and Integrals
- 4.4.1 Fatou's Lemma
- 4.4.2 The Dominated Convergence Theorem
- 4.5 Change of Variables
- 4.5.1 Densities
- 4.6 Product Measures and Fubini's Theorem
- 5. $L^p$ Spaces
- 5.1 $L^p$ Spaces and Conjugate Exponents
- 5.2 Markov's Inequality
- 5.3 Jensen's Inequality
- 5.4 Hölder's Inequality
- 5.5 Minkowski's Inequality
- 5.6 Completeness: $\mathcal{L}^p$ Is a Banach Space
- 5.7 Inclusions Between $L^p$ Spaces
- 5.8 The Hilbert Space $\mathcal{L}^2$
- 5.8.1 Orthogonal Projections
- 5.9 Uniform Integrability and $L^1$ Convergence
- 5.9.1 The $L^1$ Convergence Characterisation
- 6. Fourier Analysis
- 6.1 The Fourier Transform on $L^1$
- 6.1.1 Fourier Transform of Finite Measures
- 6.1.2 The Riemann–Lebesgue Lemma
- 6.2 Convolution
- 6.2.1 The Convolution Theorem
- 6.3 Gaussian Densities and Fourier Inversion
- 6.3.1 The Inversion Formula
- 6.4 The Fourier Transform on $L^2$
- 7. Characteristic Functions and Gaussian Random Variables
- 7.1 Characteristic Functions and Convergence in Distribution
- 7.1.1 Characteristic Functions Determine Distributions
- 7.1.2 Lévy's Continuity Theorem
- 7.2 Gaussian Random Variables
- 7.2.1 Characterisation via Characteristic Functions
- 7.2.2 Key Properties
- 8. Ergodic Theory
- 8.1 Measure-Preserving Transformations
- 8.2 Invariant Sets, Invariant Functions, and Ergodicity
- 8.3 The Canonical Space and the Shift Map
- 8.4 Birkhoff's Ergodic Theorem
- 8.4.1 The Strong Law of Large Numbers as a Corollary
- 8.5 Von Neumann's Mean Ergodic Theorem
- 9. Limit Theorems
- 9.1 The Strong Law of Large Numbers
- 9.1.1 Motivation: From the Weak Law to the Strong Law
- 9.1.2 The Fourth Moment Approach
- 9.1.3 The Ergodic Approach: Optimal Integrability
- 9.2 The Central Limit Theorem
- 9.2.1 Motivation: What Happens at the $\sqrt{n}$ Scale?
- 9.2.2 General Mean and Variance
- 9.3 Relationship Between the Two Theorems
Cambridge IB Probability and Measure
Content
Problems
History
Created by admin on 3/2/2026 | Last updated on 6/1/2026
Prerequisites
No prerequisites required for this page.
Rate this page
★
★
★
★
★
Poor
Excellent