This course develops the core machinery of modern probability theory, beginning with the measure-theoretic formulation of conditional expectation and culminating in the construction and analysis of Brownian motion and Poisson random measures. The central thread is the interplay between martingale theory and the fine structure of stochastic processes: conditional expectation provides the algebraic framework, martingale convergence and optional stopping supply the analytical backbone, and these tools combine to unlock the deeper properties of Brownian paths --- their regularity, their symmetries, their remarkable connections to harmonic analysis and partial differential equations.
The course is structured as follows. Chapter 1 constructs conditional expectation via Hilbert space projections and the Radon--Nikodym theorem, establishing the properties that underpin all subsequent martingale arguments. Chapter 2 develops discrete-time martingale theory in full: stopping times, the optional stopping theorem, the almost sure convergence theorem via Doob's upcrossing inequality, $L^p$ convergence, and uniform integrability. This chapter also contains major applications --- the strong law of large numbers, Kolmogorov's $0$-$1$ law, and the Radon--Nikodym theorem --- each proved using backwards martingales or martingale convergence. Chapter 3 passes to continuous time, addressing the measurability subtleties that arise for uncountable index sets, proving the martingale regularisation theorem that guarantees cadlag versions, and establishing Kolmogorov's continuity criterion. Chapter 4 treats weak convergence of probability measures: the portmanteau theorem, tightness, Prokhorov's theorem, and Levy's continuity theorem via characteristic functions. Chapter 5 introduces large deviation theory through Cramer's theorem, computing the exponential rate at which empirical means deviate from their expectations. Chapter 6, the longest and deepest chapter, constructs Brownian motion via Wiener's theorem, develops its invariance properties and the strong Markov property, proves the reflection principle, establishes the recurrence-transience dichotomy across dimensions, solves the Dirichlet problem probabilistically, and proves Donsker's invariance principle via the Skorokhod embedding. Chapter 7 constructs Poisson random measures and develops their integration theory.
Throughout, the reader is assumed to have a solid grounding in measure theory ($\sigma$-algebras, Lebesgue integration, $L^p$ spaces, the monotone and dominated convergence theorems), elementary probability (random variables, expectation, variance, generating functions), and the central limit theorem. Familiarity with Hilbert space theory at the level of orthogonal projections onto closed subspaces is used in Chapter 1.
# 1. Conditional Expectation
The elementary definition of conditional expectation --- $\mathbb{E}[X \mid B] = \mathbb{E}[X \mathbb{1}_B] / \mathbb{P}(B)$ for an event $B$ with $\mathbb{P}(B) > 0$ --- conditions on a single event. But in probability theory, one routinely needs to condition on an entire $\sigma$-algebra $\mathcal{G}$, representing all the information available at a given time or from a given collection of observations. The question driving this chapter is: given an integrable random variable $X$ on $(\Omega, \mathcal{F}, \mathbb{P})$ and a sub-$\sigma$-algebra $\mathcal{G} \subset \mathcal{F}$, can one find a $\mathcal{G}$-measurable random variable $Y$ that serves as the "best prediction of $X$ given $\mathcal{G}$"? The answer requires making precise what "best prediction" means, and the construction passes through the Hilbert space structure of $L^2$.
## 1.1 The Discrete Case
To build intuition, consider first the special case where $\mathcal{G}$ is generated by a countable partition. Let $X$ be an integrable random variable and let $(B_i)_{i \in I}$ be a countable family of disjoint events with $\bigcup_i B_i = \Omega$. Set $\mathcal{G} = \sigma(B_i : i \in I)$. The elements of $\mathcal{G}$ are precisely the unions $\bigcup_{i \in J} B_i$ for subsets $J \subset I$. One defines
\begin{align*}
Y = \sum_{i \in I} \mathbb{E}[X \mid B_i] \, \mathbb{1}_{B_i},
\end{align*}
where the convention $\mathbb{E}[X \mid B_i] = 0$ when $\mathbb{P}(B_i) = 0$ ensures $Y$ is well-defined everywhere. This random variable $Y$ is constant on each atom $B_i$ and takes the value $\mathbb{E}[X \mathbb{1}_{B_i}] / \mathbb{P}(B_i)$ there --- exactly the elementary conditional expectation of $X$ given the event $B_i$.
The following result identifies the two properties that will serve as the defining characteristics of conditional expectation in general.
[quotetheorem:1146]
The proof verifies these properties directly from the construction.
[citeproof:1146]
The two properties --- $\mathcal{G}$-measurability and the integral-matching condition $\mathbb{E}[Y\mathbb{1}_A] = \mathbb{E}[X\mathbb{1}_A]$ for all $A \in \mathcal{G}$ --- completely characterise the conditional expectation. The integral-matching condition says that $Y$ "agrees with $X$ on average" over every event in $\mathcal{G}$: it is the function that retains exactly the information about $X$ that is visible through $\mathcal{G}$, discarding everything else. This motivates the general definition.
## 1.2 Existence and Uniqueness
For a general sub-$\sigma$-algebra $\mathcal{G} \subset \mathcal{F}$, the partition-based construction is unavailable: $\mathcal{G}$ need not be generated by a countable partition. The existence proof proceeds in three stages --- first for $X \in L^2$ using Hilbert space methods, then for $X \geq 0$ using monotone approximation, and finally for general $X \in L^1$ by decomposition into positive and negative parts.
[quotetheorem:1147]
The strategy for uniqueness is a short argument using the integral-matching condition. The strategy for existence when $X \in L^2$ is to use the orthogonal projection theorem in Hilbert space. The extension to $X \geq 0$ uses monotone truncation, and the extension to general $X$ uses the decomposition $X = X^+ - X^-$.
[citeproof:1147]
The uniqueness argument --- exploiting the fact that $\{Y > Y'\} \in \mathcal{G}$ and using the integral-matching condition to force the difference to vanish --- is a technique that recurs throughout martingale theory. The same pattern appears in the proof of the Radon--Nikodym theorem (Chapter 2) and in verifying uniqueness of solutions to certain stochastic equations.
One writes $Y = \mathbb{E}[X \mid \mathcal{G}]$ for any version of the conditional expectation; the notation suppresses the fact that $Y$ is determined only up to a set of $\mathbb{P}$-measure zero. When $\mathcal{G} = \sigma(Z)$ for some random variable $Z$, one also writes $\mathbb{E}[X \mid Z]$.
[remark: Alternative Characterisation]
Condition (ii) can equivalently be replaced by: $\mathbb{E}[XW] = \mathbb{E}[YW]$ for all bounded $\mathcal{G}$-measurable random variables $W$. This follows because bounded $\mathcal{G}$-measurable functions can be approximated by linear combinations of indicators $\mathbb{1}_A$ with $A \in \mathcal{G}$, and one passes to the limit using the dominated convergence theorem.
[/remark]
The $L^2$ construction reveals the geometric content of conditional expectation: $\mathbb{E}[X \mid \mathcal{G}]$ is the orthogonal projection of $X$ onto $L^2(\mathcal{G})$, and the "error" $X - \mathbb{E}[X \mid \mathcal{G}]$ is orthogonal to all $\mathcal{G}$-measurable functions. This is the precise sense in which conditional expectation is the "best $L^2$-prediction."
## 1.3 Properties of Conditional Expectation
The following collects the basic algebraic and order-theoretic properties. Each property follows from a short verification of the two defining conditions.
[quotetheorem:1148]
Each property is proved by exhibiting a candidate for the conditional expectation and verifying the two defining conditions: measurability and integral-matching.
[citeproof:1148]
The averaging property (i) is used so frequently that it barely receives comment, but it encodes a fundamental conservation law: conditional expectation preserves the total expectation. The known-functions property (ii) and independence property (iii) are the two extremes of the theory --- when $\mathcal{G}$ contains all the information about $X$ versus none of it.
## 1.4 Conditional Convergence Theorems
The convergence theorems of measure theory have conditional counterparts, proved by combining the unconditional versions with the defining properties. These conditional versions are essential for the passage from discrete to continuous time in Chapter 3 and for the UI convergence arguments in Chapter 2.
[quotetheorem:1149]
The proof strategy for each part is uniform: one constructs a candidate for the conditional expectation of the limit, then verifies the two defining conditions (measurability and integral-matching) using the corresponding unconditional convergence theorem. The key difficulty in each case is justifying the interchange of conditional expectation and limiting operation.
[citeproof:1149]
The conditional Jensen inequality deserves particular attention. The proof uses the representation of convex functions as suprema of affine minorants, which reduces the problem to the linearity and monotonicity of conditional expectation. This representation is itself a consequence of the Hahn--Banach theorem (the supporting hyperplane theorem in $\mathbb{R}$). The conditional Jensen inequality underlies the entire $L^p$ theory of martingales: it is the mechanism by which $|X|^p$ or $e^{\lambda X}$ produces submartingales from martingales.
The $L^p$ contraction inequality $\|\mathbb{E}[X \mid \mathcal{G}]\|_p \leq \|X\|_p$ is a key quantitative tool. It says that conditioning can only reduce the $L^p$ norm --- "smoothing out" by averaging never increases the size of a function.
## 1.5 The Tower Property and Factoring Out Known Information
Two further structural properties are used constantly in martingale theory. The tower property says that if one first conditions on a fine $\sigma$-algebra and then on a coarser one, the result is the same as conditioning directly on the coarser one. This property is the algebraic engine behind the definition of martingales: the martingale condition $\mathbb{E}[X_n \mid \mathcal{F}_m] = X_m$ for all $n \geq m$ is equivalent, via the tower property, to the one-step condition $\mathbb{E}[X_{n+1} \mid \mathcal{F}_n] = X_n$.
[quotetheorem:1150]
The proof is a direct verification of the two defining conditions. The key observation is that the inclusion $\mathcal{H} \subset \mathcal{G}$ ensures that the integral-matching condition for $\mathbb{E}[X \mid \mathcal{G}]$ on sets in $\mathcal{H}$ reduces to the integral-matching condition for $\mathbb{E}[X \mid \mathcal{H}]$.
[citeproof:1150]
The tower property technique --- reducing a multi-step conditional expectation to a one-step condition --- reappears throughout this course. It is essential for verifying the martingale property (Chapter 2), for the backwards martingale argument in the strong law (Chapter 2), and for establishing the Markov property of Brownian motion (Chapter 6).
The "taking out what is known" property says that $\mathcal{G}$-measurable factors can be extracted from conditional expectations.
[quotetheorem:1151]
The proof verifies the defining conditions using the alternative characterisation: one replaces $\mathbb{1}_A$ by the bounded $\mathcal{G}$-measurable function $Y \mathbb{1}_A$.
[citeproof:1151]
The next result governs conditioning when some information is independent. It is the basis for the Markov property.
[quotetheorem:1152]
The proof uses the $\pi$-system technique: one verifies the integral-matching condition on sets of the form $A \cap B$ with $A \in \mathcal{G}$, $B \in \mathcal{H}$, which form a $\pi$-system generating $\sigma(\mathcal{G}, \mathcal{H})$, and appeals to the uniqueness of extension theorem for finite measures.
[citeproof:1152]
The hypothesis that $\sigma(X, \mathcal{G})$ --- not just $\sigma(X)$ and $\mathcal{G}$ separately --- is independent of $\mathcal{H}$ is essential. If one only assumes $\sigma(X)$ independent of $\mathcal{H}$ and $\mathcal{G}$ independent of $\mathcal{H}$, the conclusion fails, because $X$ and $\mathcal{G}$ may carry dependent information.
## 1.6 Product Measures and Fubini's Theorem
The theory of conditional expectation interacts with product measures through Fubini's theorem, which is used repeatedly in later chapters (e.g., in the proof that Brownian motion has the correct finite-dimensional distributions).
[definition: Product Measure Space]
A measure space $(E, \mathcal{E}, \mu)$ is **$\sigma$-finite** if there exists a sequence $(S_n)_{n \geq 0}$ in $\mathcal{E}$ with $\bigcup_n S_n = E$ and $\mu(S_n) < \infty$ for all $n$. Given two $\sigma$-finite measure spaces $(E_1, \mathcal{E}_1, \mu_1)$ and $(E_2, \mathcal{E}_2, \mu_2)$, the **product $\sigma$-algebra** is $\mathcal{E}_1 \otimes \mathcal{E}_2 = \sigma(\{A_1 \times A_2 : A_1 \in \mathcal{E}_1, A_2 \in \mathcal{E}_2\})$.
[/definition]
The product measure interacts with conditional expectation through Fubini's theorem, which allows the interchange of iterated integrals.
[quotetheorem:513]
## 1.7 Examples of Conditional Expectation
### 1.7.1 The Gaussian Case
The Gaussian case provides a fully explicit computation that illustrates how the structure of the joint distribution determines the conditional expectation. Let $(X, Y)$ be a Gaussian random vector in $\mathbb{R}^2$ and set $\mathcal{G} = \sigma(Y)$. One seeks $\mathbb{E}[X \mid \mathcal{G}]$.
Since $\mathbb{E}[X \mid \mathcal{G}]$ must be $\sigma(Y)$-measurable, a theorem from measure theory guarantees that $\mathbb{E}[X \mid \mathcal{G}] = f(Y)$ for some Borel function $f$. One tries the ansatz $\mathbb{E}[X \mid \mathcal{G}] = aY + b$ for constants $a, b \in \mathbb{R}$.
[example: Gaussian Conditional Expectation]
Let $(X, Y)$ be a Gaussian vector. The averaging property $\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}]] = \mathbb{E}[X]$ requires $a\mathbb{E}[Y] + b = \mathbb{E}[X]$. The integral-matching condition $\mathbb{E}[XY] = \mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \cdot Y]$ yields:
\begin{align*}
\mathbb{E}[(X - aY - b)Y] = 0 \implies \operatorname{Cov}(X, Y) = a \operatorname{Var}(Y).
\end{align*}
Setting $a = \operatorname{Cov}(X, Y) / \operatorname{Var}(Y)$ (assuming $\operatorname{Var}(Y) > 0$), one obtains $\operatorname{Cov}(X - aY - b, Y) = 0$. Since $(X - aY - b, Y)$ is a Gaussian vector, zero covariance implies independence. Therefore $X - aY - b$ is independent of $\sigma(Y)$, and for any bounded $\sigma(Y)$-measurable $Z$:
\begin{align*}
\mathbb{E}[(X - aY - b) Z] = \mathbb{E}[X - aY - b] \cdot \mathbb{E}[Z] = 0.
\end{align*}
This confirms that $\mathbb{E}[X \mid Y] = aY + b$ where:
\begin{align*}
a = \frac{\operatorname{Cov}(X, Y)}{\operatorname{Var}(Y)}, \qquad b = \mathbb{E}[X] - a \mathbb{E}[Y].
\end{align*}
[/example]
This computation reveals a remarkable feature of the Gaussian case: the conditional expectation is a linear function of the conditioning variable, and the "residual" $X - \mathbb{E}[X \mid Y]$ is independent of $Y$ (not merely uncorrelated). For non-Gaussian distributions, uncorrelated does not imply independent, and the conditional expectation need not be linear.
### 1.7.2 Conditional Density Functions
When $(X, Y)$ has a joint density, the conditional expectation can be expressed using conditional densities.
[example: Conditional Density]
Let $X$ and $Y$ be random variables with joint density $f_{X,Y} : \mathbb{R}^2 \to [0, \infty)$ and let $h : \mathbb{R} \to \mathbb{R}$ be Borel with $h(X)$ integrable. The marginal density of $Y$ is $f_Y(y) = \int_{\mathbb{R}} f_{X,Y}(x,y) \, dx$. For any bounded measurable $g : \mathbb{R} \to \mathbb{R}$:
\begin{align*}
\mathbb{E}[h(X) g(Y)] &= \int_{\mathbb{R}} \int_{\mathbb{R}} h(x) g(y) f_{X,Y}(x,y) \, dx \, dy \\
&= \int_{\mathbb{R}} \left(\int_{\mathbb{R}} h(x) \frac{f_{X,Y}(x,y)}{f_Y(y)} \, dx\right) g(y) f_Y(y) \, dy,
\end{align*}
where one uses the convention $0/0 = 0$. Defining $\varphi(y) = \int_{\mathbb{R}} h(x) \, f_{X|Y}(x \mid y) \, dx$ where $f_{X|Y}(x \mid y) = f_{X,Y}(x,y) / f_Y(y)$ for $f_Y(y) > 0$, one obtains:
\begin{align*}
\mathbb{E}[h(X) \mid Y] = \varphi(Y) \quad \text{a.s.}
\end{align*}
The function $f_{X|Y}(x \mid y)$ is the **conditional density** of $X$ given $Y = y$, and $\nu(y, dx) = f_{X|Y}(x \mid y) \, dx$ is the **conditional distribution** of $X$ given $Y = y$.
[/example]
### 1.7.3 When Conditional Expectation Fails to Exist
The integrability hypothesis $X \in L^1$ in the existence theorem is not a mere technicality. When $X$ is not integrable, $\mathbb{E}[X \mid \mathcal{G}]$ may fail to exist in any useful sense.
[example: Non-Integrable Conditional Expectation]
Let $\Omega = [0,1]$ with Lebesgue measure $\mathbb{P}$ and let $X(\omega) = 1/\omega$ for $\omega \in (0,1]$. Then $\mathbb{E}[|X|] = \int_0^1 1/\omega \, d\omega = +\infty$, so $X \notin L^1$. Let $\mathcal{G} = \sigma(\{[0, 1/2], (1/2, 1]\})$ be the $\sigma$-algebra generated by the partition into two halves. If one naively attempts to define $\mathbb{E}[X \mid \mathcal{G}]$, the averaging over $[0, 1/2]$ gives:
\begin{align*}
\frac{1}{\mathbb{P}([0, 1/2])} \int_0^{1/2} \frac{1}{\omega} \, d\omega = 2 \cdot (+\infty) = +\infty.
\end{align*}
No finite $\mathcal{G}$-measurable function $Y$ can satisfy $\mathbb{E}[Y \mathbb{1}_{[0,1/2]}] = \mathbb{E}[X \mathbb{1}_{[0,1/2]}] = +\infty$ and be integrable. The construction breaks down at the $L^2$ stage (since $X \notin L^2$) and cannot be rescued by the monotone approximation step because $\mathbb{E}[X^+] = +\infty$ and the limit $Y = \lim_n Y_n$ satisfies $\mathbb{E}[Y] = +\infty$.
[/example]
This example is not merely pathological. In applications to mathematical finance, the integrability requirement for conditional expectations imposes genuine constraints on which random variables can serve as "prices" conditional on available information.
The chapter has established the algebraic framework of conditional expectation: existence, uniqueness, the tower property, independence, and the convergence theorems. These tools are the foundation for martingale theory, to which we now turn.
# 2. Discrete-Time Martingales
The previous chapter provided the algebraic framework of conditional expectation. The present chapter uses it to develop the theory of martingales --- stochastic processes that model "fair games" --- which is the central tool in modern probability. The key results are the optional stopping theorem, which controls expectations at random times, and the almost sure convergence theorem, which asserts that bounded martingales have limits. The chapter concludes with applications to the strong law of large numbers, Kolmogorov's $0$-$1$ law, and the Radon--Nikodym theorem, each proved elegantly via martingale methods.
## 2.1 Filtrations and the Martingale Condition
The definition of a martingale requires the notion of a filtration --- an increasing family of $\sigma$-algebras representing the evolution of information over time. The challenge is to formalise the idea that at each time $n$, an observer knows the outcomes of all past events but cannot see into the future. The filtration provides this formalisation: $\mathcal{F}_n$ encodes all events that the observer can evaluate at time $n$.
[definition: Filtration]
A **filtration** on $(\Omega, \mathcal{F}, \mathbb{P})$ is an increasing family $(\mathcal{F}_n)_{n \geq 0}$ of sub-$\sigma$-algebras: $\mathcal{F}_n \subset \mathcal{F}_{n+1}$ for all $n$. A process $X = (X_n)_{n \geq 0}$ is **adapted** to $(\mathcal{F}_n)$ if $X_n$ is $\mathcal{F}_n$-measurable for all $n$. The **natural filtration** of $X$ is $\mathcal{F}_n^X = \sigma(X_k : k \leq n)$.
[/definition]
One thinks of $\mathcal{F}_n$ as the information available at time $n$. Every process is adapted to its natural filtration.
With the filtration in place, one can define the central concept. A martingale is a process whose conditional expectation, given the past, equals the current value. This captures the notion of a "fair game": knowing the entire history gives no advantage in predicting the next value.
[definition: Martingale]
An adapted, integrable process $X = (X_n)_{n \geq 0}$ is:
- a **martingale** if $\mathbb{E}[X_n \mid \mathcal{F}_m] = X_m$ a.s. for all $n \geq m$;
- a **supermartingale** if $\mathbb{E}[X_n \mid \mathcal{F}_m] \leq X_m$ a.s. for all $n \geq m$;
- a **submartingale** if $\mathbb{E}[X_n \mid \mathcal{F}_m] \geq X_m$ a.s. for all $n \geq m$.
[/definition]
By the tower property, it suffices to verify the one-step condition: $\mathbb{E}[X_{n+1} \mid \mathcal{F}_n] = X_n$ (resp. $\leq$, $\geq$). Every process that is a martingale (resp. super-, sub-) with respect to a filtration $(\mathcal{F}_n)$ is also a martingale with respect to its natural filtration.
The simplest examples arise from sums and products of independent random variables.
[example: Random Walk Martingale]
Let $(\xi_i)_{i \geq 1}$ be i.i.d. with $\mathbb{E}[\xi_1] = 0$, and set $X_n = \sum_{i=1}^n \xi_i$ with $X_0 = 0$. Then $X$ is a martingale with respect to $\mathcal{F}_n = \sigma(\xi_1, \ldots, \xi_n)$. Indeed, since $\xi_{n+1}$ is independent of $\mathcal{F}_n$:
\begin{align*}
\mathbb{E}[X_{n+1} \mid \mathcal{F}_n] = \mathbb{E}[\xi_{n+1} \mid \mathcal{F}_n] + X_n = \mathbb{E}[\xi_{n+1}] + X_n = X_n.
\end{align*}
[/example]
The next example illustrates that multiplicative structure also produces martingales.
[example: Product Martingale]
Let $(\xi_i)_{i \geq 1}$ be i.i.d. with $\mathbb{E}[\xi_1] = 1$ and $\xi_i \geq 0$. Set $M_n = \prod_{i=1}^n \xi_i$ with $M_0 = 1$. The "taking out what is known" property (extended by approximation to this unbounded case via DCT) and independence give:
\begin{align*}
\mathbb{E}[M_{n+1} \mid \mathcal{F}_n] = M_n \, \mathbb{E}[\xi_{n+1} \mid \mathcal{F}_n] = M_n \, \mathbb{E}[\xi_{n+1}] = M_n.
\end{align*}
So $M$ is a non-negative martingale. Such product martingales play a central role in Kakutani's theorem and the theory of likelihood ratios.
[/example]
## 2.2 Stopping Times
A central idea in martingale theory is to evaluate a process at a random time determined by the process itself. The challenge is that not every random time is compatible with the filtration --- one must not "look into the future."
[definition: Stopping Time]
A **stopping time** $T$ is a random variable $T : \Omega \to \mathbb{Z}_+ \cup \{\infty\}$ such that $\{T \leq n\} \in \mathcal{F}_n$ for all $n$. Equivalently, $\{T = n\} \in \mathcal{F}_n$ for all $n$.
[/definition]
The distinction between stopping times and general random times is illustrated by the following examples.
[example: Stopping Time Examples]
(a) Constant times $T = n$ are stopping times.
(b) For an adapted process $X$ and a Borel set $A$, the first entrance time $T_A = \inf\{n \geq 0 : X_n \in A\}$ (with $\inf \varnothing = \infty$) is a stopping time, since $\{T_A \leq n\} = \bigcup_{k \leq n} \{X_k \in A\} \in \mathcal{F}_n$.
(c) Last exit times are generally **not** stopping times. For instance, $L_A = \sup\{n \leq 10 : X_n \in A\}$ requires knowledge of the future trajectory after time $n$ to determine whether $n$ is the last visit.
[/example]
The $\sigma$-algebra $\mathcal{F}_T = \{A \in \mathcal{F} : A \cap \{T \leq t\} \in \mathcal{F}_t \text{ for all } t\}$ captures "the information available at time $T$." For the stopped process $X^T_n = X_{\min\{T, n\}}$, one has: if $S \leq T$, then $\mathcal{F}_S \subset \mathcal{F}_T$; $X_T \mathbb{1}_{T < \infty}$ is $\mathcal{F}_T$-measurable; and $X^T$ is adapted and integrable whenever $X$ is.
## 2.3 Optional Stopping
Under what conditions does the identity $\mathbb{E}[X_T] = \mathbb{E}[X_0]$ hold for a martingale $X$ and a stopping time $T$? Without additional hypotheses, it can fail: if $X_n = \sum_{k=1}^n \xi_k$ is a simple symmetric random walk and $T = \inf\{n : X_n = 1\}$, then $T < \infty$ a.s. but $\mathbb{E}[X_T] = 1 \neq 0 = \mathbb{E}[X_0]$. This example shows that some boundedness or integrability condition is necessary.
[quotetheorem:1153]
The proof strategy for part (i) is to verify the one-step martingale condition by decomposing $X_{\min\{T,t\}}$ according to whether $T \leq t-1$ or $T \geq t$, then using the measurability of the stopping time event. Parts (ii)--(v) build on part (i) by passing to limits under progressively weaker domination hypotheses.
[citeproof:1153]
The telescoping technique in part (ii) --- decomposing $X_T - X_S$ as a sum of one-step increments weighted by the indicator $\mathbb{1}_{S \leq k < T}$ --- is a standard method in discrete martingale theory. It reappears in the proof of Doob's upcrossing inequality below and in the proofs of various decomposition theorems (Doob's decomposition, the compensator).
The simple random walk example mentioned above illustrates why part (v) requires $\mathbb{E}[T] < \infty$: the walk $X_n$ stopped at $T = \inf\{n : X_n = 1\}$ has bounded increments and $T < \infty$ a.s., but $\mathbb{E}[T] = \infty$, and the conclusion fails.
For non-negative supermartingales, an unconditional stopping result holds without boundedness assumptions.
[quotetheorem:1154]
The key insight is that non-negativity provides a built-in lower bound that allows the use of Fatou's lemma instead of the dominated convergence theorem.
[citeproof:1154]
The use of Fatou's lemma (rather than DCT) is what makes the non-negative supermartingale result unconditional: Fatou requires only non-negativity, not a dominating function. This technique --- replacing DCT by Fatou when non-negativity is available --- is a standard move in probability and recurs in the proof of convergence of non-negative supermartingales.
## 2.4 Gambler's Ruin
The optional stopping theorem provides an elegant computation of ruin probabilities. Let $(\xi_i)_{i \geq 1}$ be i.i.d. with $\mathbb{P}(\xi_1 = +1) = \mathbb{P}(\xi_1 = -1) = 1/2$, and set $X_n = \sum_{i=1}^n \xi_i$ (simple symmetric random walk with $X_0 = 0$). For $a, b > 0$, define $T = \min\{T_{-a}, T_b\}$ where $T_c = \inf\{n : X_n = c\}$.
[example: Gambler's Ruin]
The random walk $X$ is a martingale with bounded increments ($|X_{n+1} - X_n| = 1$). To apply part (v) of the optional stopping theorem, one needs $\mathbb{E}[T] < \infty$.
The time $T$ is bounded above by the first time $a + b$ consecutive $+1$'s appear. The probability that $\xi_1, \ldots, \xi_{a+b}$ are all $+1$ is $2^{-(a+b)}$. Successive disjoint blocks of length $a+b$ are independent, so the number of blocks needed is geometric with success probability $2^{-(a+b)}$. Therefore $\mathbb{E}[T] \leq (a+b) \cdot 2^{a+b} < \infty$.
The optional stopping theorem gives $\mathbb{E}[X_T] = \mathbb{E}[X_0] = 0$. Since $X_T \in \{-a, b\}$:
\begin{align*}
0 = -a \, \mathbb{P}(T_{-a} < T_b) + b \, \mathbb{P}(T_b < T_{-a}),
\end{align*}
and $\mathbb{P}(T_{-a} < T_b) + \mathbb{P}(T_b < T_{-a}) = 1$ since $T < \infty$ a.s. Solving:
\begin{align*}
\mathbb{P}(T_{-a} < T_b) = \frac{b}{a+b}.
\end{align*}
[/example]
## 2.5 Martingale Convergence
Does a martingale $(X_n)$ converge as $n \to \infty$? Unlike the usual situation in analysis, one typically has no candidate for the limit. The remarkable fact is that an $L^1$-bounded supermartingale converges almost surely, and the proof uses a counting argument based on "upcrossings."
For a real sequence $x = (x_n)$ and $a < b$, let $N_n([a,b], x)$ denote the number of upcrossings of the interval $[a,b]$ by $x$ up to time $n$: the number of times the sequence drops below $a$ and subsequently rises above $b$. Let $N([a,b], x) = \lim_n N_n([a,b], x)$ be the total number of upcrossings.
[quotetheorem:1155]
This is a purely deterministic result --- it characterises convergence of sequences in terms of upcrossing counts. The proof is a straightforward application of the definition of $\liminf$ and $\limsup$.
[citeproof:1155]
The upcrossing criterion converts the convergence problem into a counting problem, and the power of the probabilistic approach is that the upcrossing count can be bounded using the supermartingale inequality.
[quotetheorem:1156]
The proof is the technical heart of martingale convergence theory. The strategy is to bound the expected number of upcrossings by decomposing the trajectory into pieces corresponding to upcrossings and the "leftover" part, then applying the optional stopping theorem to each piece. The key difficulty is handling the incomplete upcrossing at the end, which contributes a non-positive term.
[citeproof:1156]
The upcrossing inequality is remarkable for what it bounds the upcrossing count by: the negative part $(X_n - a)^-$, which is related to how far below $a$ the process falls. For a non-negative supermartingale with $a = 0$, the bound simplifies to $(b-0) \mathbb{E}[N_n] \leq \mathbb{E}[X_n^-] = 0$, giving zero upcrossings --- consistent with the process remaining non-negative. The technique of decomposing trajectories into upcrossing segments and applying optional stopping to each segment reappears in the continuous-time setting (Chapter 3), where one applies the same argument to the discrete skeleton $\{X_{q} : q \in \mathbb{Q}\}$.
[quotetheorem:1157]
The proof combines the upcrossing inequality with the convergence criterion: the $L^1$ bound controls the expected number of upcrossings, which forces convergence on a set of full measure.
[citeproof:1157]
A non-negative supermartingale $X$ satisfies $\mathbb{E}[|X_n|] = \mathbb{E}[X_n] \leq \mathbb{E}[X_0] < \infty$, so it is automatically $L^1$-bounded. This gives the important corollary: **every non-negative supermartingale converges a.s. to a finite limit.**
## 2.6 Doob's Inequalities
The maximal inequality controls the probability that a submartingale exceeds a threshold.
[quotetheorem:1158]
The strategy is to introduce the stopping time $T = \inf\{k : X_k \geq \lambda\}$ and use the submartingale property at the bounded stopping time $\min\{T, n\}$.
[citeproof:1158]
The maximal inequality is a probabilistic analogue of the Hardy--Littlewood maximal inequality in harmonic analysis. The stopping-time technique used here --- introducing a random time that detects when the process first exceeds a threshold --- reappears in the proof of Doob's $L^p$ inequality and in the maximal inequalities for continuous-time processes (Chapter 3).
[quotetheorem:1159]
The proof combines Doob's maximal inequality with the layer cake representation and H\"older's inequality. The key step is converting the $L^p$ norm of the maximum into an integral involving the tail probability $\mathbb{P}(X_n^* \geq x)$, then bounding this using the maximal inequality.
[citeproof:1159]
The constant $p/(p-1)$ is sharp. Notice that the inequality breaks down at $p = 1$: a martingale bounded in $L^1$ converges a.s. but need not converge in $L^1$. This gap between $L^1$ boundedness (which gives a.s. convergence) and $L^1$ convergence motivates the introduction of uniform integrability later in this chapter.
## 2.7 $L^p$ Convergence for $p > 1$
For $p > 1$, the $L^p$ boundedness of a martingale characterises exactly when convergence in $L^p$ occurs.
[quotetheorem:1160]
The three implications form a cycle: (i) $\Rightarrow$ (ii) uses Doob's $L^p$ inequality to provide a dominating function for DCT, (ii) $\Rightarrow$ (iii) identifies the limit as the closing random variable, and (iii) $\Rightarrow$ (i) uses the $L^p$ contraction of conditional expectation.
[citeproof:1160]
A martingale of the form $X_n = \mathbb{E}[Z \mid \mathcal{F}_n]$ is called **closed** in $L^p$. For such a martingale, $X_n \to \mathbb{E}[Z \mid \mathcal{F}_\infty]$ a.s. and in $L^p$, where $\mathcal{F}_\infty = \sigma(\mathcal{F}_n : n \geq 0)$.
## 2.8 Uniform Integrability
The $L^p$ convergence theorem requires $p > 1$. For $p = 1$, the correct substitute for $L^1$-boundedness is uniform integrability.
[definition: Uniform Integrability]
A family $(X_i)_{i \in I}$ of random variables is **uniformly integrable** (UI) if:
\begin{align*}
\sup_{i \in I} \mathbb{E}[|X_i| \mathbb{1}_{|X_i| > \alpha}] \to 0 \quad \text{as } \alpha \to \infty.
\end{align*}
[/definition]
Every UI family is $L^1$-bounded, but the converse fails. However, $L^p$-boundedness for $p > 1$ implies UI, by H\"older's inequality: $\mathbb{E}[|X_i| \mathbb{1}_A] \leq \|X_i\|_p \, \mathbb{P}(A)^{1/q}$ where $q = p/(p-1)$.
A fundamental source of UI families is conditional expectation:
[quotetheorem:1161]
The proof uses the absolute continuity of the integral: $X \in L^1$ implies that $\mathbb{E}[|X| \mathbb{1}_A]$ can be made uniformly small by choosing $\mathbb{P}(A)$ small. This is then combined with Markov's inequality to control $\mathbb{P}(|Y| \geq \lambda)$ for $Y = \mathbb{E}[X \mid \mathcal{G}]$.
[citeproof:1161]
The absolute continuity of the integral --- used here to obtain uniform smallness of $\mathbb{E}[|X| \mathbb{1}_A]$ for small $\mathbb{P}(A)$ --- is one of the fundamental facts of integration theory. This technique reappears in the proof of the Radon--Nikodym theorem below.
The connection between UI and $L^1$ convergence is the following characterisation.
[quotetheorem:1162]
Combining the UI characterisation with the a.s. convergence theorem yields the complete $L^1$ convergence theory for martingales.
[quotetheorem:1163]
The proof follows the same cyclic pattern as the $L^p$ convergence theorem, with UI playing the role of $L^p$-boundedness.
[citeproof:1163]
For a UI martingale, the optional stopping theorem holds without any boundedness assumptions on the stopping times:
[quotetheorem:1164]
The proof uses the representation $X_n = \mathbb{E}[X_\infty \mid \mathcal{F}_n]$ to reduce to verifying the integral-matching condition.
[citeproof:1164]
## 2.9 Backwards Martingales
A backwards martingale reverses the time direction: the filtration $(\mathcal{G}_n)_{n \leq 0}$ is decreasing ($\mathcal{G}_{n-1} \subset \mathcal{G}_n$), and $\mathbb{E}[X_{n+1} \mid \mathcal{G}_n] = X_n$ for $n \leq -1$. By the tower property, $X_n = \mathbb{E}[X_0 \mid \mathcal{G}_n]$ for all $n \leq 0$, so a backwards martingale is automatically UI (by the theorem on conditional expectations of an integrable random variable).
[quotetheorem:1165]
The proof applies the upcrossing inequality to the time-reversed process, then uses the automatic uniform integrability of backwards martingales to upgrade a.s. convergence to $L^p$ convergence. This is the technique that makes backwards martingales so powerful: they come with built-in UI, which must be verified separately for forward martingales.
[citeproof:1165]
The automatic UI of backwards martingales is the key advantage. For forward martingales, UI must be verified on a case-by-case basis, and this is often the hardest step. Backwards martingales sidestep this entirely because they are representations of a fixed $L^1$ function ($X_0$) projected onto successively coarser $\sigma$-algebras.
## 2.10 Applications of Martingale Theory
### 2.10.1 Kolmogorov's $0$-$1$ Law
[quotetheorem:512]
The proof strategy is to construct a martingale that converges to $\mathbb{1}_A$ and simultaneously equals the constant $\mathbb{P}(A)$ at every time step, forcing $\mathbb{1}_A = \mathbb{P}(A)$ a.s.
[citeproof:512]
The elegance of the martingale proof lies in the tension it creates: the martingale $M_n$ is constant (equal to $\mathbb{P}(A)$) at every finite time, yet its limit must be $\mathbb{1}_A$, which takes only the values $0$ and $1$. The only resolution is $\mathbb{P}(A) \in \{0, 1\}$. This technique --- constructing a martingale that is constant at each step and then identifying its limit --- reappears in the proof of Blumenthal's $0$-$1$ law for Brownian motion (Chapter 6).
### 2.10.2 The Strong Law of Large Numbers
[quotetheorem:520]
The proof constructs a backwards martingale from the partial sums. The key insight is a symmetry argument: since the $X_i$ are i.i.d., $\mathbb{E}[X_k \mid S_n] = S_n / n$ for all $1 \leq k \leq n$. This is the step where the i.i.d. hypothesis is used most powerfully.
[citeproof:520]
### 2.10.3 The Radon--Nikodym Theorem
Martingale convergence also yields a probabilistic proof of the Radon--Nikodym theorem. This proof illustrates a recurring theme: deep results in measure theory can be derived from martingale arguments.
[quotetheorem:1167]
The hypothesis that $\mathcal{F}$ is countably generated is used to construct an increasing filtration whose union generates $\mathcal{F}$. The general Radon--Nikodym theorem (for $\sigma$-finite measures on arbitrary measurable spaces) requires a different proof --- typically via the Riesz representation theorem on $L^2$ --- and the countable generation hypothesis is not needed there. The martingale proof presented here gives a stronger statement in the countably generated case: it constructs the density as a martingale limit, which reveals its structure.
[citeproof:1167]
### 2.10.4 Kakutani's Product Martingale Theorem
The following result characterises when a product of independent non-negative random variables converges to a non-degenerate limit. It is the probabilistic foundation of the dichotomy between equivalence and singularity of product measures.
[quotetheorem:1166]
The proof introduces an auxiliary martingale $N_n = \sqrt{M_n} / \prod_{i=1}^n a_i$ that is $L^2$-bounded when $\prod a_n > 0$. This allows Doob's $L^2$ inequality to provide a uniform bound, which yields UI for the original martingale $M_n$. The device of passing from $M_n$ to $\sqrt{M_n}$ (renormalised) converts an $L^1$ problem into an $L^2$ problem, where stronger tools are available.
[citeproof:1166]
The dichotomy in Kakutani's theorem is sharp: either the product martingale converges to a non-degenerate limit (in $L^1$), or it converges to zero. The condition $\prod a_n > 0$ is equivalent to $\sum (1 - a_n) < \infty$, so the dichotomy is between rapidly converging Hellinger distances (equivalence of product measures) and diverging Hellinger distances (singularity).
The chapter has established the full discrete-time martingale theory. The passage to continuous time, which requires entirely new techniques to handle the uncountable index set, is the subject of the next chapter.
# 3. Continuous-Time Processes
The martingale theory of Chapter 2 was developed for discrete time: the index set is $\mathbb{N}$, and measurability of processes is automatic. In continuous time, the index set is $\mathbb{R}_+$ and fundamentally new difficulties arise. First, a process $(X_t)_{t \geq 0}$ is a family of random variables indexed by an uncountable set, and the joint mapping $(\omega, t) \mapsto X_t(\omega)$ need not be measurable with respect to $\mathcal{F} \otimes \mathfrak{B}(\mathbb{R}_+)$ without regularity assumptions on the paths $t \mapsto X_t(\omega)$. Second, stopping times in continuous time require care, since $\{T \leq t\}$ may involve uncountable unions. The resolution of both difficulties relies on path regularity: cadlag (right-continuous with left limits) processes.
## 3.1 Measurability and Path Regularity
If the paths $t \mapsto X_t(\omega)$ are continuous, then the mapping $(\omega, t) \mapsto X_t(\omega)$ is $\mathcal{F} \otimes \mathfrak{B}(\mathbb{R}_+)$-measurable, since:
\begin{align*}
X_t(\omega) = \lim_{n \to \infty} \sum_{k=0}^{2^n - 1} \mathbb{1}_{(k 2^{-n}, (k+1) 2^{-n})}(t) \, X_{k 2^{-n}}(\omega),
\end{align*}
and each approximant is measurable. More generally, a **cadlag** process (right-continuous with left limits, from the French "continue \`a droite, limite \`a gauche") has the same property. A cadlag process is determined by its values on a countable dense subset such as $\mathbb{Q}_+$.
[definition: Cadlag Process]
A process $(X_t)_{t \geq 0}$ is **cadlag** if almost surely, $t \mapsto X_t(\omega)$ is right-continuous and admits left limits at every $t > 0$.
[/definition]
Two processes $X$ and $X'$ on $(\Omega, \mathcal{F}, \mathbb{P})$ are called **versions** of each other if $\mathbb{P}(X_t = X_t') = 1$ for all $t$. Processes with the same finite-dimensional distributions need not have the same path properties.
[example: Versions With Different Paths]
Let $X_t = 0$ for all $t \in [0,1]$, and let $U \sim \operatorname{Unif}[0,1]$ be independent. Define $X_t' = \mathbb{1}_{U = t}$. Both processes have identical finite-dimensional distributions (all equal to the Dirac measure at $0$), so $X'$ is a version of $X$. Yet $X'$ is discontinuous (at $t = U$) with probability one, while $X$ is continuous. This shows that finite-dimensional distributions do not determine path regularity.
[/example]
The following example demonstrates a subtlety that is specific to continuous time: a process can be almost surely continuous yet fail to be cadlag, and vice versa.
[example: Continuous But Not Cadlag in Any Version]
Let $B$ be a standard Brownian motion and define $Y_t = B_t$ for $t \in [0,1) \cup (1, 2]$ and $Y_1 = B_1 + 1$. The process $Y$ has a jump at $t = 1$, so it is not continuous, but it is cadlag (right-continuous with a left limit at $t = 1$). Now consider the process $Z_t = \sum_{n=1}^\infty n^{-2} \mathbb{1}_{t = q_n}$, where $(q_n)$ is an enumeration of $\mathbb{Q} \cap [0,1]$. For any fixed $t$, $Z_t = 0$ a.s. (since $t$ is irrational with probability $1$ if chosen "randomly," but in fact $Z_t$ is deterministic: $Z_t = n^{-2}$ if $t = q_n$ and $Z_t = 0$ otherwise). The process $Z$ is a version of the zero process (since $\{t : Z_t \neq 0\}$ is countable, so $\mathbb{P}(Z_t = 0) = 1$ for all $t$). However, $Z$ is discontinuous at every rational: for any $q_n$, the path $t \mapsto Z_t$ takes the value $n^{-2}$ at $q_n$ but $0$ in every neighbourhood. In particular, $Z$ is not cadlag. The zero process is the unique cadlag version.
[/example]
## 3.2 Stopping Times in Continuous Time
A stopping time in continuous time is $T : \Omega \to [0, \infty]$ with $\{T \leq t\} \in \mathcal{F}_t$ for all $t$. The first hitting time $T_A = \inf\{t \geq 0 : X_t \in A\}$ of a set $A$ is not always a stopping time, since $\{T_A \leq t\} = \bigcup_{0 \leq s \leq t} \{X_s \in A\}$ is an uncountable union that need not belong to $\mathcal{F}_t$.
[quotetheorem:1168]
The proof exploits continuity to reduce the uncountable union to a countable one over rational times. The key insight is that a continuous function that hits a closed set must approach it through a sequence of rational times.
[citeproof:1168]
The closedness of $A$ is essential in this proof: it ensures that the limit of a sequence in $A$ remains in $A$. For open sets $A$, the hitting time $T_A$ is a stopping time with respect to the right-continuous filtration $\mathcal{F}_{t+} = \bigcap_{s > t} \mathcal{F}_s$, since $\{T_A < t\} = \bigcup_{q \in \mathbb{Q}, q < t} \{X_q \in A\} \in \mathcal{F}_t$ and $\{T_A \leq t\} = \bigcap_n \{T_A < t + 1/n\} \in \mathcal{F}_{t+}$.
## 3.3 The Martingale Regularisation Theorem
The fundamental result of this section asserts that a continuous-time martingale always has a cadlag version (under mild conditions on the filtration). This is the continuous-time analogue of the existence of conditional expectations: just as conditional expectation is defined only up to a null set at each time, a continuous-time martingale may have badly-behaved paths, but a cadlag representative always exists.
[definition: Usual Conditions]
A filtration $(\mathcal{F}_t)$ satisfies the **usual conditions** if it is right-continuous ($\mathcal{F}_{t+} = \mathcal{F}_t$ for all $t$) and complete (every $\mathbb{P}$-null set belongs to $\mathcal{F}_0$). The augmented filtration is $\widetilde{\mathcal{F}}_t = \sigma(\mathcal{F}_{t+}, \mathcal{N})$ where $\mathcal{N}$ is the collection of $\mathbb{P}$-null sets.
[/definition]
The usual conditions are not just a technical convenience but resolve genuine pathologies. Without right-continuity, stopping times may fail to generate well-behaved $\sigma$-algebras; without completeness, exceptional null sets accumulate and interfere with monotone convergence arguments.
[quotetheorem:1169]
The proof constructs the cadlag version by taking right limits along rational sequences. The key technical input is the discrete-time machinery --- upcrossing bounds and backwards martingale convergence --- applied to the process restricted to rational times. The main difficulty is showing that the right limits exist simultaneously for all $t$, which requires a full-measure event on which all upcrossing counts are finite.
[citeproof:1169]
The proof reveals the mechanism by which continuous-time results are derived from discrete-time ones: one applies the discrete theory to the rational skeleton $\{X_q : q \in \mathbb{Q}_+\}$, then extends by continuity. This "restrict to rationals and extend" technique is the standard method throughout continuous-time probability.
From this point forward, continuous-time martingales are always taken in their cadlag version (when the filtration satisfies the usual conditions). The extension of discrete-time results to continuous time proceeds by restricting to rational times and using the cadlag property. To illustrate the technique concretely, consider the extension of Doob's maximal inequality: for a cadlag non-negative submartingale $X$, the supremum $X_t^* = \sup_{0 \leq s \leq t} X_s = \sup_{s \in \mathbb{Q} \cap [0,t]} X_s$ (the second equality uses right-continuity), and the discrete maximal inequality applied to the rational skeleton gives $\lambda \, \mathbb{P}(X_t^* \geq \lambda) \leq \mathbb{E}[X_t]$. Similarly, the upcrossing inequality, $L^p$ convergence, UI convergence, and optional stopping extend to continuous time by the same method.
## 3.4 Kolmogorov's Continuity Criterion
When does a process have a continuous (not just cadlag) version? The following criterion gives a sufficient condition in terms of moment bounds on increments.
[quotetheorem:1170]
The proof uses a multi-scale argument: one controls the increments at each dyadic scale using Markov's inequality and Borel--Cantelli, then sums the geometric series across scales. The key difficulty is that summing over $2^n$ intervals at scale $2^{-n}$ introduces a factor of $2^n$, which must be absorbed by the decay $2^{-n\varepsilon}$ from the moment bound.
[citeproof:1170]
Kolmogorov's criterion is applied in the next chapter to establish the existence of Brownian motion with continuous paths. The multi-scale Borel--Cantelli argument used here is characteristic of regularity proofs in stochastic analysis: one establishes good behaviour at each dyadic scale, then interpolates.
The present chapter has addressed the main difficulties of continuous-time probability: measurability of processes, the existence of cadlag versions, and criteria for path continuity. With these tools in hand, we turn to the convergence of probability measures, which provides the framework for studying distributional limits of stochastic processes.
# 4. Weak Convergence
The previous chapters developed the theory of individual stochastic processes. This chapter takes a step back to study convergence of probability measures themselves. The motivating question is: given a sequence of probability measures $(\mu_n)$ on a metric space $(M, d)$, in what sense can $\mu_n$ "converge" to a limit $\mu$? Pointwise convergence $\mu_n(A) \to \mu(A)$ for all Borel sets $A$ is too strong (it fails for simple examples like Dirac masses converging to a point). The correct notion is **weak convergence**, which tests against continuous bounded functions.
## 4.1 The Portmanteau Theorem
[definition: Weak Convergence]
A sequence $(\mu_n)$ of probability measures on a metric space $(M, d)$ converges **weakly** to $\mu$, written $\mu_n \Rightarrow \mu$, if $\int_M f \, d\mu_n \to \int_M f \, d\mu$ for all bounded continuous $f : M \to \mathbb{R}$.
[/definition]
The portmanteau theorem provides several equivalent characterisations. Each characterisation is useful in different contexts: the open-set formulation (ii) is natural for lower semicontinuity arguments, the closed-set formulation (iii) for upper semicontinuity, and the continuity-set formulation (iv) for distribution functions.
[quotetheorem:1171]
The proof strategy for the main implication (i) $\Rightarrow$ (ii) is to approximate the indicator of an open set from below by continuous functions. The reverse direction (iv) $\Rightarrow$ (i) uses Fubini's theorem to express $\int f \, d\mu_n$ as an integral over level sets, reducing the problem to convergence of $\mu_n$ on continuity sets.
[citeproof:1171]
The Fubini argument in the step (iv) $\Rightarrow$ (i) is elegant and economical: it reduces the convergence of integrals (a priori requiring uniform control over all bounded continuous functions) to convergence of measures on level sets, which are "almost all" continuity sets. This technique reappears in the proof of Levy's continuity theorem.
A sequence of random variables $X_n$ (on possibly different probability spaces) converges **in distribution** to $X$ if the laws $\mathcal{L}(X_n) \Rightarrow \mathcal{L}(X)$.
The following example illustrates why weak convergence is the correct notion and why pointwise convergence of measures fails.
[example: Failure of Pointwise Measure Convergence]
Let $\mu_n = \delta_{1/n}$ be the Dirac mass at $1/n$ on $\mathbb{R}$, and let $\mu = \delta_0$. For the singleton $A = \{0\}$: $\mu_n(A) = 0$ for all $n$, but $\mu(A) = 1$. So $\mu_n(A) \not\to \mu(A)$, even though $\mu_n$ "concentrates near $0$" for large $n$. On the other hand, for any bounded continuous $f : \mathbb{R} \to \mathbb{R}$:
\begin{align*}
\int f \, d\mu_n = f(1/n) \to f(0) = \int f \, d\mu,
\end{align*}
by continuity of $f$. So $\mu_n \Rightarrow \mu$. The portmanteau theorem identifies the source of the failure: $A = \{0\}$ has boundary $\partial A = \{0\}$ with $\mu(\partial A) = 1 > 0$, so convergence $\mu_n(A) \to \mu(A)$ is not guaranteed by weak convergence. Condition (iv) gives convergence only for sets whose boundary has $\mu$-measure zero.
[/example]
## 4.2 Tightness and Prokhorov's Theorem
The Bolzano--Weierstrass theorem guarantees that every bounded sequence in $\mathbb{R}$ has a convergent subsequence. Prokhorov's theorem is the analogue for probability measures: tightness replaces boundedness.
[definition: Tightness]
A sequence $(\mu_n)$ of probability measures on a metric space $M$ is **tight** if for every $\varepsilon > 0$ there exists a compact $K \subset M$ with $\sup_n \mu_n(M \setminus K) \leq \varepsilon$.
[/definition]
Tightness is the right condition because it prevents mass from escaping to infinity. The following theorem shows that tightness is sufficient for sequential compactness of probability measures.
[quotetheorem:1172]
The proof for $M = \mathbb{R}$ proceeds by a diagonal argument on distribution functions, using tightness to ensure that the limit is a genuine probability measure (i.e., has total mass $1$) rather than losing mass to infinity. This is the critical role of tightness: it prevents the measures from "escaping" to infinity along the subsequence.
[citeproof:1172]
The diagonal argument used here --- extracting successive subsequences along a countable dense set and then taking the diagonal --- is a standard compactness technique in analysis. It reappears in the proof of the Arzel\`a--Ascoli theorem and in many constructions in functional analysis. The technique is effective precisely because the rationals are countable and dense.
Without tightness, Prokhorov's theorem fails: the sequence $\mu_n = \delta_n$ has $F_{m_k}(x) \to 0$ for all $x$, so the "limit" $F$ is identically $0$ --- not a distribution function.
## 4.3 Characteristic Functions and Levy's Continuity Theorem
The characteristic function provides a powerful analytic tool for establishing weak convergence.
[definition: Characteristic Function]
The **characteristic function** of a random variable $X$ taking values in $\mathbb{R}^d$ is the function $\varphi_X : \mathbb{R}^d \to \mathbb{C}$ defined by:
\begin{align*}
\varphi_X(u) = \mathbb{E}[e^{i \langle u, X \rangle}] = \int_{\mathbb{R}^d} e^{i \langle u, x \rangle} \, \mu(dx), \quad u \in \mathbb{R}^d,
\end{align*}
where $\mu = \mathcal{L}(X)$. The characteristic function is continuous with $\varphi_X(0) = 1$, and it determines the law of $X$ uniquely.
[/definition]
The fundamental theorem of this section asserts that convergence of characteristic functions is equivalent to weak convergence, provided the limit is itself a characteristic function.
[quotetheorem:519]
The proof of part (i) is immediate from the definition of weak convergence. Part (ii) is more substantial: the strategy is to deduce tightness from the behaviour of the characteristic functions near the origin, apply Prokhorov's theorem to extract a convergent subsequence, and then show that the limit is unique by the injectivity of the Fourier transform.
[citeproof:519]
The continuity of $\psi$ at $0$ is essential in part (ii). Without it, tightness can fail: consider $X_n \sim \mathcal{N}(0, n)$, whose characteristic functions $\varphi_{X_n}(u) = e^{-nu^2/2} \to \mathbb{1}_{\{0\}}(u)$ converge pointwise to a function that equals $1$ at $u = 0$ but $0$ elsewhere. This limit is discontinuous at $0$, and indeed $(X_n)$ is not tight --- the mass escapes to infinity. The continuity hypothesis prevents precisely this phenomenon.
The tools of weak convergence --- the portmanteau theorem, tightness, Prokhorov's theorem, and Levy's continuity theorem --- provide the framework for studying distributional limits. The next chapter applies these tools to study the exponential rate of decay of probabilities of rare events.
# 5. Large Deviations
The strong law of large numbers asserts that $\bar{S}_n = S_n/n \to \mathbb{E}[X_1]$ a.s. The central limit theorem refines this by describing the fluctuations of $S_n$ around its mean at scale $\sqrt{n}$. Both results concern the "typical" behaviour of sums. The theory of large deviations concerns a different question: **at what exponential rate does $\mathbb{P}(S_n \geq na)$ decay for $a > \mathbb{E}[X_1]$?** This is relevant in statistics, information theory, and statistical mechanics, wherever one needs to quantify the probability of rare events.
## 5.1 The Rate Function and Cramer's Theorem
Let $(X_i)_{i \geq 1}$ be i.i.d. with $\mathbb{E}[X_1] = \bar{x}$ and $S_n = X_1 + \cdots + X_n$. The moment generating function and cumulant generating function are:
\begin{align*}
M(\lambda) = \mathbb{E}[e^{\lambda X_1}], \qquad \Psi(\lambda) = \log M(\lambda).
\end{align*}
By Markov's inequality, for $\lambda \geq 0$:
\begin{align*}
\mathbb{P}(S_n \geq na) = \mathbb{P}(e^{\lambda S_n} \geq e^{\lambda na}) \leq M(\lambda)^n e^{-\lambda na} = \exp(-n(\lambda a - \Psi(\lambda))).
\end{align*}
Optimising over $\lambda$ gives the upper bound $\mathbb{P}(S_n \geq na) \leq \exp(-n \Psi^*(a))$, where the **Legendre transform** (or **rate function**) is:
\begin{align*}
\Psi^*(a) = \sup_{\lambda \geq 0} (\lambda a - \Psi(\lambda)) \geq -\Psi(0) = 0.
\end{align*}
The Gaussian case provides a benchmark: the rate function is quadratic, and the Chernoff bound is exact.
[example: Gaussian Rate Function]
For $X \sim \mathcal{N}(0,1)$: $M(\lambda) = e^{\lambda^2/2}$, $\Psi(\lambda) = \lambda^2/2$, $\Psi'(\lambda) = \lambda$, so $\lambda = a$ achieves the supremum, giving $\Psi^*(a) = a^2/2$. This matches the exact computation $\mathbb{P}(|\bar{S}_n| \geq \delta) \approx e^{-n\delta^2/2}$.
[/example]
The exponential distribution shows that the rate function can have a restricted domain.
[example: Exponential Rate Function]
For $X \sim \operatorname{Exp}(1)$: $M(\lambda) = 1/(1-\lambda)$ for $\lambda < 1$ (and $\infty$ for $\lambda \geq 1$), $\Psi(\lambda) = -\log(1-\lambda)$, $\Psi'(\lambda) = 1/(1-\lambda) = a$ gives $\lambda = 1 - 1/a$ for $a \geq 1$, so $\Psi^*(a) = a - 1 - \log a$ for $a \geq 1$. The rate $\Psi^*(a)$ grows like $a \log a$ for large $a$, slower than the Gaussian rate $a^2/2$, reflecting the heavier tail of the exponential distribution.
[/example]
Cramer's theorem asserts that the Chernoff upper bound gives the exact exponential rate.
[quotetheorem:1173]
The proof of the upper bound is the Markov/Chernoff argument above. The lower bound is considerably more delicate: the strategy is to "tilt" the distribution so that the rare event becomes typical, then use the CLT under the tilted measure to compute the probability. The key difficulty is handling the case where the moment generating function is infinite for some $\lambda$, which requires a truncation argument.
[citeproof:1173]
The exponential tilting technique in Case 2 is one of the most important ideas in large deviation theory. By changing the measure from $\mathbb{P}$ to $\mathbb{P}_\theta$, the rare event $\{S_n \geq 0\}$ (which has exponentially small probability under $\mathbb{P}$) becomes a typical event under $\mathbb{P}_\theta$ (since $\mathbb{E}_\theta[X_1] = 0$). The CLT then gives a polynomial lower bound under $\mathbb{P}_\theta$, which translates back to the correct exponential rate under $\mathbb{P}$ via the Radon--Nikodym derivative. This technique is the prototype for importance sampling in Monte Carlo methods.
The Poisson distribution provides a third benchmark for computing rate functions, and its combinatorial structure leads to a rate function involving the entropy.
[example: Poisson Rate Function]
For $X \sim \operatorname{Po}(1)$: $M(\lambda) = e^{e^\lambda - 1}$, $\Psi(\lambda) = e^\lambda - 1$, $\Psi'(\lambda) = e^\lambda = a$, so $\Psi^*(a) = a\log a - a + 1$ for $a \geq 1$. The rate function $\Psi^*(a) = a \log a - a + 1$ is the Kullback--Leibler divergence of $\operatorname{Po}(a)$ from $\operatorname{Po}(1)$. This is not a coincidence: the large deviation principle for Poisson random variables is intimately connected to information-theoretic quantities.
[/example]
### 5.1.1 Tightness Versus Uniform Integrability
The concepts of tightness (for probability measures) and uniform integrability (for random variables) are often confused because both serve as "compactness substitutes." The following example clarifies the distinction by exhibiting a sequence that is tight but not uniformly integrable.
[example: Tight But Not Uniformly Integrable]
Let $X_n$ take the value $n$ with probability $1/n$ and $0$ with probability $1 - 1/n$. Then $\mathbb{E}[X_n] = 1$ for all $n$, so the sequence is $L^1$-bounded. The laws $\mathcal{L}(X_n)$ are tight: for any $\varepsilon > 0$, the compact set $K = [0, 1/\varepsilon + 1]$ satisfies $\mathbb{P}(X_n \notin K) = \mathbb{P}(X_n = n) = 1/n \leq \varepsilon$ for $n \geq 1/\varepsilon$, and for the finitely many remaining $n$ one enlarges $K$ accordingly.
However, $(X_n)$ is not uniformly integrable: for $\alpha = n/2$,
\begin{align*}
\mathbb{E}[|X_n| \mathbb{1}_{|X_n| > \alpha}] = n \cdot \frac{1}{n} = 1 \quad \text{for all } n,
\end{align*}
so $\sup_n \mathbb{E}[|X_n| \mathbb{1}_{|X_n| > \alpha}] \geq 1$ for all $\alpha$. This sequence converges in distribution to $\delta_0$ (since $\mathbb{P}(X_n = 0) \to 1$) but does not converge in $L^1$: indeed $\mathbb{E}[X_n] = 1 \not\to 0 = \mathbb{E}[0]$. Tightness ensures convergence of the distribution, but UI is needed to ensure convergence of expectations.
[/example]
Cramer's theorem provides the foundational result for large deviation theory. The extension to multidimensional distributions (the G\"artner--Ellis theorem) and to empirical measures (Sanov's theorem) are beyond the scope of this course. We now turn to Brownian motion, where all the threads of the course converge.
# 6. Brownian Motion
The previous chapters developed the general theory of stochastic processes: conditional expectation, martingale convergence, regularisation, weak convergence, and large deviations. Brownian motion is the process where all these threads converge. It is simultaneously the central object of study and the most powerful source of examples and counterexamples in probability. Brownian motion is a continuous martingale, a Gaussian process, the scaling limit of random walks, and the probabilistic solution of Laplace's equation. This chapter constructs Brownian motion, develops its symmetry properties, and explores its deep connections to analysis.
## 6.1 The Brownian Motion Process
The definition of Brownian motion specifies three properties: a starting point, Gaussian increments, and independence of increments. The remarkable content of Wiener's theorem is that a process with continuous paths and these distributional properties exists --- this is not automatic, as the example of versions with different paths in Chapter 3 showed.
[definition: Brownian Motion]
A **Brownian motion** in $\mathbb{R}^d$ started from $x \in \mathbb{R}^d$ is a continuous process $B = (B_t)_{t \geq 0}$ satisfying:
(i) $B_0 = x$ a.s.
(ii) $B_t - B_s \sim \mathcal{N}(0, (t-s) I_d)$ for all $0 \leq s < t$.
(iii) $B$ has independent increments: for $0 \leq t_1 < \cdots < t_k$, the increments $B_{t_2} - B_{t_1}, \ldots, B_{t_k} - B_{t_{k-1}}$ are independent, and independent of $B_0$.
A **standard Brownian motion** has $x = 0$.
[/definition]
Conditions (ii) and (iii) determine the finite-dimensional distributions of $B$ uniquely (they specify a consistent family of Gaussian distributions). The Kolmogorov extension theorem (a foundational result in probability theory stating that a consistent family of finite-dimensional distributions on a product space extends to a unique probability measure on the product $\sigma$-algebra) guarantees the existence of a process with these finite-dimensional distributions, but this process lives on the product space $(\mathbb{R}^d)^{[0,\infty)}$ and need not have continuous paths. The content of Wiener's theorem is that a continuous process with these distributions exists.
## 6.2 Wiener's Theorem
[quotetheorem:1174]
The proof constructs the process on the dyadic rationals using independent Gaussians, verifies the Brownian motion properties, applies Kolmogorov's continuity criterion to obtain H\"older-continuous paths, and extends by continuity. This is the point where Kolmogorov's continuity criterion (from Chapter 3) plays its essential role.
[citeproof:1174]
The construction shows that Brownian paths are a.s. $\alpha$-H\"older continuous for every $\alpha < 1/2$. A deeper result (not proved here) shows that Brownian motion is a.s. **nowhere differentiable** --- the Paley--Wiener--Zygmund theorem (1933). See Morters and Peres, *Brownian Motion*, Chapter 1, for a proof.
## 6.3 Invariance Properties
Brownian motion possesses remarkable symmetries, each reflecting a different invariance of the Gaussian distribution.
[quotetheorem:1175]
Each invariance follows directly from the Gaussian structure: invariance of $\mathcal{N}(0, I_d)$ under orthogonal transformations gives (i), the scaling property of variance gives (ii), and the independence of increments gives (iii). These symmetries are what make Brownian motion the "canonical" continuous process: it is the unique continuous process with stationary, independent, isotropic Gaussian increments.
The time inversion property is more surprising: it reverses the direction of time and still produces a Brownian motion.
[quotetheorem:1176]
The proof strategy is to check that $X$ has the same finite-dimensional distributions as $B$ by computing the covariance structure, then to verify continuity at $t = 0$ (which is the only non-trivial point, since $X_t = tB_{1/t}$ involves a product of two quantities tending to $0$ and $\infty$ respectively).
[citeproof:1176]
Time inversion gives an elegant proof that $B_t/t \to 0$ a.s. as $t \to \infty$: one has $B_t/t = X_{1/t} \to X_0 = 0$ by continuity. This bypasses the strong law of large numbers and uses only the distributional properties of Brownian motion.
## 6.4 Blumenthal's $0$-$1$ Law and Its Consequences
The simple Markov property says that $B_{t+s} - B_s$ is independent of $\mathcal{F}_s^B$. A more delicate result is that this independence extends to the right-continuous filtration $\mathcal{F}_s^+ = \bigcap_{t > s} \mathcal{F}_t^B$.
[quotetheorem:1177]
The proof approximates the right-continuous filtration by the discrete filtrations $\mathcal{F}_{s_n}^B$ with $s_n \downarrow s$, then passes to the limit using continuity of $B$ and the dominated convergence theorem.
[citeproof:1177]
The passage from the simple Markov property to independence from $\mathcal{F}_s^+$ is a strengthening that is specific to Brownian motion (or more generally, Feller processes). It fails for general Markov processes that are not right-continuous. This strengthening is exactly what is needed for Blumenthal's law.
[quotetheorem:1178]
The proof exploits the same "self-independence" argument that proved Kolmogorov's $0$-$1$ law: an event that is independent of itself must have probability $0$ or $1$.
[citeproof:1178]
Blumenthal's $0$-$1$ law says that Brownian motion has no "germ information": any event determined by the infinitesimal behaviour of $B$ near time $0$ has probability either $0$ or $1$. This is a much stronger statement than Kolmogorov's $0$-$1$ law, which concerns the tail behaviour at infinity.
The most striking consequence of Blumenthal's law is that Brownian motion immediately returns to its starting value.
[quotetheorem:1179]
The proof first shows that the events $\{\tau = 0\}$ and $\{\sigma = 0\}$ belong to $\mathcal{F}_0^+$, then uses Blumenthal's law to conclude they have probability $0$ or $1$, and finally uses the symmetry of Brownian motion to rule out probability $0$.
[citeproof:1179]
This result is genuinely counterintuitive: a continuous function that starts at zero, immediately becomes positive, and immediately becomes negative must oscillate infinitely rapidly near $t = 0$. This infinite oscillation is a manifestation of the nowhere-differentiability of Brownian paths.
This result also gives: $S_\varepsilon = \sup_{0 \leq s \leq \varepsilon} B_s > 0$ and $I_\varepsilon = \inf_{0 \leq s \leq \varepsilon} B_s < 0$ a.s. for all $\varepsilon > 0$, and $\sup_{t \geq 0} B_t = -\inf_{t \geq 0} B_t = +\infty$ a.s. (by the scaling invariance argument).
## 6.5 Strong Markov Property
The simple Markov property asserts independence of the future from the past at a **deterministic** time $s$. The strong Markov property extends this to **stopping times**.
[quotetheorem:1180]
The proof proceeds in two steps: first verify the result for discrete stopping times (which follows from the simple Markov property applied on each event $\{T = k2^{-n}\}$), then pass to general stopping times by approximating $T$ from above by $T_n = 2^{-n} \lceil 2^n T \rceil$ and using the continuity of $B$. The main difficulty is verifying that the independence is preserved in the limit.
[citeproof:1180]
The "discrete approximation then limit" technique used here is the standard method for extending results from deterministic times to stopping times. It was already used implicitly in Chapter 2 (in the optional stopping theorem), and it reappears in the proofs of the reflection principle and the Dirichlet problem below.
## 6.6 Reflection Principle
The strong Markov property enables the reflection principle, which computes the joint distribution of Brownian motion and its running maximum.
[quotetheorem:1181]
The proof reduces to observing that the reflected process is obtained by concatenating $(B_t)_{t \leq T}$ with $-B^{(T)}$ (the negation of the post-$T$ Brownian motion), and that the negation of a Brownian motion is again a Brownian motion with the same law.
[citeproof:1181]
The reflection principle is the key tool for computing distributions of extrema of Brownian motion. Without it, the joint distribution of $B_t$ and $\sup_{s \leq t} B_s$ would require solving a boundary value problem for the heat equation. The reflection argument bypasses this entirely.
The following result is the main application of the reflection principle: it computes the joint distribution of Brownian motion and its running maximum.
[quotetheorem:1182]
The proof uses the reflection principle at the hitting time $T_b$, together with the inclusion $\{\widetilde{B}_t \geq 2b - a\} \subset \{T_b \leq t\}$ (which holds because $a \leq b$) to simplify the joint probability.
[citeproof:1182]
The identity $S_t \overset{d}{=} |B_t|$ is a striking consequence of the reflection principle. It says that the maximum of a Brownian path up to time $t$ has exactly the same distribution as the absolute value of the endpoint --- a fact that is far from obvious from the definition.
## 6.7 Martingales for Brownian Motion
Brownian motion itself is a martingale, and many other martingales can be built from it. The general principle is that applying a suitable "correction" to $f(B_t)$ --- subtracting the integral of the generator $\frac{1}{2}\Delta f$ --- produces a martingale.
[quotetheorem:1183]
The proofs are direct computations using the independence of increments and the known moments of Gaussian random variables.
[citeproof:1183]
The martingale $B_t^2 - t$ reveals the "quadratic variation" of Brownian motion: the process $B_t^2$ grows at rate $t$, and subtracting this deterministic growth produces a martingale. This is the prototype for the general theory of quadratic variation and stochastic calculus (not treated in this course).
The exponential martingale provides a powerful tool for computing hitting probabilities.
[quotetheorem:1184]
The proof is a direct computation using the moment generating function of the Gaussian distribution.
[citeproof:1184]
The exponential martingale is the continuous-time analogue of the product martingale in Chapter 2. It is used for computing Laplace transforms of hitting times and for the Girsanov change of measure (not treated here).
More generally, if $f : \mathbb{R}_+ \times \mathbb{R}^d \to \mathbb{R}$ is $C^1$ in $t$ and $C^2$ in $x$ with bounded derivatives, then:
\begin{align*}
M_t = f(t, B_t) - f(0, B_0) - \int_0^t \left(\frac{\partial}{\partial s} + \frac{1}{2} \Delta\right) f(s, B_s) \, ds
\end{align*}
is a martingale. The operator $\partial_t + \frac{1}{2}\Delta$ is the **generator** of Brownian motion.
## 6.8 Recurrence and Transience
The behaviour of Brownian motion depends drastically on the dimension. In one dimension, the walk visits every point infinitely often; in two dimensions, it returns to every neighbourhood but misses individual points; in three or more dimensions, it escapes to infinity.
[quotetheorem:1185]
The proof uses harmonic functions in annular domains together with optional stopping. The key insight is that $\log|y|$ is harmonic in $\mathbb{R}^2 \setminus \{0\}$ and $1/|y|^{d-2}$ is harmonic in $\mathbb{R}^d \setminus \{0\}$ for $d \geq 3$, and applying the optional stopping theorem to the martingale $f(B_t)$ in the annulus $\{\varepsilon \leq |x| \leq R\}$ gives hitting probabilities as explicit functions of $\varepsilon$, $R$, and $|x|$.
[citeproof:1185]
The dimension-dependence is controlled entirely by the rate at which the harmonic function $f$ grows: $\log|y| \to \infty$ as $|y| \to \infty$ (slowly, logarithmically) in $d = 2$, while $1/|y| \to 0$ in $d \geq 3$. In $d = 1$, the identity function is harmonic, and the optional stopping argument directly gives $\mathbb{P}_0(T_a < T_{-b}) = b/(a+b)$ (the gambler's ruin formula from Chapter 2), from which point-recurrence follows.
## 6.9 The Dirichlet Problem
Brownian motion provides a probabilistic solution to the classical Dirichlet problem of potential theory. The connection between Brownian motion and harmonic functions is one of the deepest in probability theory: the expected value of a boundary function evaluated at the exit point of Brownian motion solves Laplace's equation.
[quotetheorem:1186]
The proof uses the strong Markov property to show that $u$ satisfies the mean value property, which characterises harmonic functions.
[citeproof:1186]
The mean value property argument --- using the strong Markov property at the exit time of a small ball, then invoking rotational invariance to identify the exit distribution --- is the fundamental link between Brownian motion and harmonic analysis.
[quotetheorem:1187]
The proof uses the maximum principle: a harmonic function on a bounded domain attains its extrema on the boundary.
[citeproof:1187]
Harmonicity and uniqueness are established. The remaining question is whether $u$ is continuous up to the boundary with the correct boundary values. This requires a geometric condition on $D$ that prevents the domain from having inward-pointing cusps where Brownian motion could linger.
[quotetheorem:1188]
The proof uses the cone condition to show that Brownian motion exits $D$ near its starting point with high probability, then combines this with the continuity of $\varphi$ to control $|u(x) - \varphi(z)|$.
[citeproof:1188]
Combining the three results: if $D$ satisfies the Poincar\'e cone condition at every boundary point and $\varphi : \partial D \to \mathbb{R}$ is continuous, then $u(x) = \mathbb{E}_x[\varphi(B_\tau)]$ is the unique continuous function on $\overline{D}$ satisfying $\Delta u = 0$ on $D$ and $u = \varphi$ on $\partial D$.
## 6.10 Donsker's Invariance Principle
Brownian motion is the universal scaling limit of random walks with finite variance. This is a functional central limit theorem: convergence holds not just for the marginal distributions but for the entire path.
[quotetheorem:1189]
The proof uses the Skorokhod embedding theorem, which embeds the random walk into a Brownian motion via a sequence of stopping times.
[quotetheorem:1190]
The Skorokhod embedding and Donsker's invariance principle are stated here without proof. The original embedding construction (Skorokhod, 1965) proceeds by the "gambler's ruin device": given the target distribution $\mu$ with mean zero, one decomposes $\mu$ as a mixture $\mu = p \delta_{-a} + q \delta_b + (1-p-q) \nu$ for suitable $a, b > 0$ and a residual distribution $\nu$, then defines the stopping time as the first exit from $[-a, b]$ (which has the correct marginals by the gambler's ruin computation), and iterates. The full details can be found in Billingsley, *Convergence of Probability Measures*, Chapter 4. For Donsker's theorem, the key step is to show that the rescaled stopping times $T_{\lfloor Nt \rfloor} / N \to \sigma^2 t$ a.s. by the strong law of large numbers (since $\mathbb{E}[T_1] = \sigma^2$), and then to use the continuity of Brownian paths to conclude that $S^{[N]}$ and $B$ are close in the supremum norm.
## 6.11 Zeros of Brownian Motion
The zero set of Brownian motion has a remarkable fractal structure that reflects the infinite oscillation near each zero.
[quotetheorem:1191]
The proof uses the strong Markov property at rational approximations to zeros, together with the immediate-return-to-zero property established earlier.
[citeproof:1191]
The zero set is thus a perfect set (closed, no isolated points) of Lebesgue measure zero --- a fractal with Hausdorff dimension $1/2$. The proof that the Hausdorff dimension is $1/2$ uses the occupation times formula and is beyond the scope of this course; see Morters and Peres, *Brownian Motion*, Chapter 9.
The present chapter has exploited all the machinery developed in Chapters 1--5: conditional expectation and the tower property (for the Markov and strong Markov properties), martingale convergence (for the regularisation theorem and the construction), Kolmogorov's continuity criterion (for the existence of continuous paths), weak convergence and Levy's continuity theorem (for verifying finite-dimensional distributions), and the optional stopping theorem (for the recurrence-transience analysis and the Dirichlet problem). The final chapter introduces a different class of random objects: random point measures.
# 7. Poisson Random Measures
The final chapter introduces Poisson random measures, which generalise the Poisson process to arbitrary $\sigma$-finite measure spaces. While Brownian motion models continuous random fluctuations, Poisson random measures model random collections of points --- arrivals, jumps, events occurring in space or time. The two together form the building blocks from which all L\'evy processes are constructed (via the L\'evy--It\^o decomposition, not treated in this course).
The chapter begins with the algebraic properties of the Poisson distribution that make the construction possible, then constructs the random measure itself, and develops its integration theory.
## 7.1 Addition and Splitting for the Poisson Distribution
The construction of Poisson random measures rests on two structural properties of the Poisson distribution.
[quotetheorem:1192]
The addition property follows from computing the probability generating function of the sum: $\mathbb{E}[z^{\sum_k N_k}] = \prod_k \mathbb{E}[z^{N_k}] = \prod_k e^{\lambda_k(z-1)} = e^{(\sum_k \lambda_k)(z-1)}$, which is the PGF of $\operatorname{Po}(\sum_k \lambda_k)$.
The splitting property is the converse: a Poisson random variable split into independent categories produces independent Poisson counts.
[quotetheorem:1193]
The proof is a direct computation of the joint probability generating function, using the law of total expectation to condition on $N$.
## 7.2 The Poisson Random Measure
[definition: Poisson Random Measure]
Let $(E, \mathcal{E}, \mu)$ be a $\sigma$-finite measure space. A **Poisson random measure** (PRM) with intensity $\mu$ is a random measure $M : \Omega \times \mathcal{E} \to \mathbb{Z}_+ \cup \{\infty\}$ satisfying, for all sequences $(A_k)$ of disjoint sets in $\mathcal{E}$:
(i) $M(\bigcup_k A_k) = \sum_k M(A_k)$ (countable additivity for each $\omega$).
(ii) $M(A_k)$, $k \in \mathbb{N}$, are independent random variables.
(iii) $M(A_k) \sim \operatorname{Po}(\mu(A_k))$ for all $k$.
[/definition]
The definition combines measure-theoretic structure (countable additivity) with probabilistic structure (independence and Poisson marginals). The existence of such a measure is not immediate: one must verify that these three conditions are mutually consistent.
[quotetheorem:1194]
The proof constructs the PRM in two stages: first for finite $\mu$ using the splitting property, then for $\sigma$-finite $\mu$ by partitioning $E$ into sets of finite measure and taking independent copies. The uniqueness argument uses the $\pi$-system technique, which by now is a familiar tool.
[citeproof:1194]
The construction reveals the intuitive picture of a PRM: one draws a Poisson number of "points" in $E$, each placed independently according to the normalised intensity $\mu / \mu(E)$. The measure $M$ is the counting measure of these random points: $M(A)$ counts how many points fall in $A$.
An important consequence: conditional on $M(A) = k$ (for $\mu(A) < \infty$), the restriction $M|_A$ has the law of $\sum_{i=1}^k \delta_{X_i}$ where $X_1, \ldots, X_k$ are i.i.d. with law $\mu(\cdot \cap A) / \mu(A)$. Moreover, restrictions to disjoint sets are independent.
## 7.3 Integration with Respect to a PRM
For a measurable function $f : E \to \mathbb{R}$, the integral $M(f) = \int_E f(y) \, M(dy)$ is defined by the standard approximation: first for indicators, then for simple functions, then for non-negative measurable functions via MCT.
[quotetheorem:1195]
The proof proceeds by reducing to the case $f = \mathbb{1}_A$ (where $M(A) \sim \operatorname{Po}(\mu(A))$ and the formulae are just the Poisson moment generating function), extending to simple functions by independence, and passing to general $f$ by approximation. The key point in verifying (i) is that the variance formula requires $f \in L^2(\mu)$, not just $f \in L^1(\mu)$: the variance of $M(f)$ involves $\int f^2 \, d\mu$, which is the second moment of $f$ with respect to $\mu$.
[citeproof:1195]
The Laplace functional $\mathbb{E}[e^{-uM(f)}] = \exp\bigl(-\int_E (1 - e^{-uf}) \, d\mu\bigr)$ uniquely determines the law of the PRM and is the starting point for many computations in the theory of point processes and L\'evy processes. The characteristic function formula in (iii) is the Poisson analogue of the Gaussian characteristic function $\mathbb{E}[e^{iuX}] = e^{-u^2\sigma^2/2}$: the Poisson version replaces $-u^2\sigma^2/2$ with $\int (e^{iuf} - 1) \, d\mu$.
### 7.3.1 Compensated Integrals
A natural question is whether the centred (compensated) integral $\widetilde{M}(f) = M(f) - \mathbb{E}[M(f)] = \int_E f \, dM - \int_E f \, d\mu$ defines a martingale when the integration domain grows with time. The answer is yes, under appropriate integrability conditions.
[example: Compensated Poisson Integral]
Let $E = \mathbb{R}_+ \times S$ for some measurable space $(S, \mathcal{S})$, and let $\mu = \mathcal{L}^1 \otimes \nu$ where $\nu$ is a $\sigma$-finite measure on $S$ (the "L\'evy measure"). For $f : S \to \mathbb{R}$ with $f \in L^1(\nu) \cap L^2(\nu)$, define:
\begin{align*}
\widetilde{M}_t(f) = \int_{[0,t] \times S} f(s) \, M(ds, dy) - t \int_S f(s) \, \nu(ds).
\end{align*}
Since $M([0,t] \times A)$ and $M((t, t'] \times A)$ are independent for disjoint time intervals (by the independence property of the PRM), the process $\widetilde{M}_t(f)$ has independent increments. Moreover, $\mathbb{E}[\widetilde{M}_t(f)] = 0$ and $\operatorname{Var}(\widetilde{M}_t(f)) = t \int_S f^2 \, d\nu$.
The integrability condition $f \in L^2(\nu)$ is essential for the variance to be finite. If $f \in L^1(\nu) \setminus L^2(\nu)$, then $\widetilde{M}_t(f)$ is well-defined as an $L^1$ random variable but has infinite variance, and the martingale property still holds in $L^1$. If $f \notin L^1(\nu)$, the compensation $\mathbb{E}[M(f)]$ is infinite and the centred integral is not well-defined.
[/example]
This compensated integral is the jump analogue of the stochastic integral with respect to Brownian motion (It\^o integral), and the two together form the building blocks of the general stochastic integral with respect to semimartingales.
## References
- P. Billingsley, *Convergence of Probability Measures*, 2nd edition, Wiley, 1999.
- R. Durrett, *Probability: Theory and Examples*, 5th edition, Cambridge University Press, 2019.
- O. Kallenberg, *Foundations of Modern Probability*, 2nd edition, Springer, 2002.
- P. Morters and Y. Peres, *Brownian Motion*, Cambridge University Press, 2010.
- D. Williams, *Probability with Martingales*, Cambridge University Press, 1991.
Contents
- 1. Conditional Expectation
- 1.1 The Discrete Case
- 1.2 Existence and Uniqueness
- 1.3 Properties of Conditional Expectation
- 1.4 Conditional Convergence Theorems
- 1.5 The Tower Property and Factoring Out Known Information
- 1.6 Product Measures and Fubini's Theorem
- 1.7 Examples of Conditional Expectation
- 1.7.1 The Gaussian Case
- 1.7.2 Conditional Density Functions
- 1.7.3 When Conditional Expectation Fails to Exist
- 2. Discrete-Time Martingales
- 2.1 Filtrations and the Martingale Condition
- 2.2 Stopping Times
- 2.3 Optional Stopping
- 2.4 Gambler's Ruin
- 2.5 Martingale Convergence
- 2.6 Doob's Inequalities
- 2.7 $L^p$ Convergence for $p > 1$
- 2.8 Uniform Integrability
- 2.9 Backwards Martingales
- 2.10 Applications of Martingale Theory
- 2.10.1 Kolmogorov's $0$-$1$ Law
- 2.10.2 The Strong Law of Large Numbers
- 2.10.3 The Radon--Nikodym Theorem
- 2.10.4 Kakutani's Product Martingale Theorem
- 3. Continuous-Time Processes
- 3.1 Measurability and Path Regularity
- 3.2 Stopping Times in Continuous Time
- 3.3 The Martingale Regularisation Theorem
- 3.4 Kolmogorov's Continuity Criterion
- 4. Weak Convergence
- 4.1 The Portmanteau Theorem
- 4.2 Tightness and Prokhorov's Theorem
- 4.3 Characteristic Functions and Levy's Continuity Theorem
- 5. Large Deviations
- 5.1 The Rate Function and Cramer's Theorem
- 5.1.1 Tightness Versus Uniform Integrability
- 6. Brownian Motion
- 6.1 The Brownian Motion Process
- 6.2 Wiener's Theorem
- 6.3 Invariance Properties
- 6.4 Blumenthal's $0$-$1$ Law and Its Consequences
- 6.5 Strong Markov Property
- 6.6 Reflection Principle
- 6.7 Martingales for Brownian Motion
- 6.8 Recurrence and Transience
- 6.9 The Dirichlet Problem
- 6.10 Donsker's Invariance Principle
- 6.11 Zeros of Brownian Motion
- 7. Poisson Random Measures
- 7.1 Addition and Splitting for the Poisson Distribution
- 7.2 The Poisson Random Measure
- 7.3 Integration with Respect to a PRM
- 7.3.1 Compensated Integrals
- References
Cambridge III Advanced Probability
Content
Problems
History
Created by admin on 4/18/2026 | Last updated on 4/18/2026
Prerequisites
No prerequisites required for this page.
Rate this page
★
★
★
★
★
Poor
Excellent