Markov chains are fundamental mathematical models for systems that evolve randomly over time, where the future depends only on the present state, not on the history. This memoryless property — called the Markov property — makes them remarkably tractable while still capturing the essential randomness in many real-world phenomena: weather patterns, stock prices, population dynamics, and queuing systems all admit Markov chain descriptions. The course develops both the theory underlying these models and the techniques for analysing their long-run behaviour.
The course unfolds in natural stages. After motivating the Markov property and its consequences, we introduce the formal structure of Markov chains as sequences of states connected by transition probabilities, then classify chains and individual states by their connectivity and recurrence properties. These structural insights lead directly to questions about convergence: do chains settle into stable distributions, do they return to certain states infinitely often, and how long does transient behaviour persist? The final chapter on time reversal reveals a beautiful duality — that many chains look statistically identical when run backward — with applications to sampling algorithms and detailed balance in physical systems.
# Introduction
[motivation]
Probability theory, as studied in a first course, is largely a theory of independence. We toss a coin repeatedly, roll dice in sequence, draw from an urn with replacement — in each case, the outcome of one trial tells us nothing about the next. This independence assumption is mathematically convenient and allows us to prove powerful results: the law of large numbers, the central limit theorem, and so on. But it is also, frankly, an idealization.
In the real world, consecutive observations are almost never independent. The weather tomorrow is not independent of the weather today. The queue length at a service counter at 3pm is not independent of the queue length at 2pm. The price of a stock next week is not independent of its price this week. If we wish to model such phenomena mathematically, we need a theory of *dependent* random variables.
The difficulty is that "dependence" is an enormously broad concept. Given two random variables $X$ and $Y$, knowing that they are dependent tells us very little about their joint structure. Their relationship could be simple or extraordinarily complex, monotone or oscillatory, local or non-local. Without further assumptions, very little can be said in general, and essentially no theory can be built.
This parallels the situation in analysis. Given a function $f: \mathbb{R} \to \mathbb{R}$, knowing merely that $f$ exists tells us nothing useful. But if we assume $f$ is continuous, or differentiable, or convex, a rich theory opens up. The structural assumption is what makes the mathematics tractable.
The course takes the following approach: restrict attention to a particularly tractable — yet still general enough to be interesting — form of dependence. We study sequences of random variables $(X_0, X_1, X_2, \ldots)$ in which the conditional distribution of the next state $X_{n+1}$, given the entire history $X_0, X_1, \ldots, X_n$, depends only on the current state $X_n$. In symbols:
\begin{align*}
\mathbb{P}(X_{n+1} = j \mid X_0 = i_0, \ldots, X_{n-1} = i_{n-1}, X_n = i) = \mathbb{P}(X_{n+1} = j \mid X_n = i).
\end{align*}
This is the *Markov property*, and a sequence satisfying it is called a *Markov chain*. The intuition is vivid: the future of the process is determined entirely by its present state. The past — however the chain arrived at its current position — is irrelevant once the current state is known. This is sometimes called the "memorylessness" of the process, though the more precise term is the Markov property.
The Markov property is a structural constraint, and structural constraints enable proofs. As we will see throughout this course, just this single assumption — that the future depends only on the present — is enough to derive a rich and complete theory:
- We can compute the probability of reaching any state from any other state.
- We can determine which states a chain will eventually return to, and with what probability.
- We can identify when a chain settles into a long-run equilibrium distribution, independent of where it started.
- We can reverse time and ask whether the chain looks statistically the same run backwards.
None of these would be tractable for general dependent sequences. The Markov property is precisely the right level of constraint: strong enough to admit a complete theory, weak enough to model a vast range of real phenomena.
Before the formal definitions begin, it is worth noting that Markov chains are not exotic objects. A simple random walk on $\mathbb{Z}$ — where at each step the walker moves right or left with equal probability, independently of everything except the current position — is a Markov chain. The branching process — where each individual independently produces a random number of offspring — is a Markov chain (with the current population size as the state). Coin-tossing and die-rolling are degenerate Markov chains where the transition probabilities do not depend on the current state at all (since consecutive outcomes are independent).
What this course provides is the general framework that encompasses all these examples, and the tools to analyse them systematically.
This course studies discrete-time Markov chains on countable state spaces. The journey proceeds in four main phases. First, we define the basic objects and establish how the transition matrix $P = (p_{ij})$ encodes the entire dynamics of a homogeneous chain. Second, we classify states and chains according to their long-run behaviour: which states communicate, which are recurrent (the chain will surely return), and which are transient (there is positive probability of never returning). Simple random walks in dimensions one, two, and three provide concrete illustrations of this dichotomy. Third, we study invariant distributions and convergence to equilibrium — the deep theorem that an irreducible, positive recurrent, aperiodic chain forgets its initial state and converges to a unique stationary distribution. Finally, we examine time-reversal and detailed balance, which characterise chains that look the same run forwards and backwards.
Throughout, the emphasis is on understanding the probabilistic structure rather than treating Markov chains as a branch of linear algebra — though the language of matrices will appear naturally and usefully at several points.
[/motivation]
## The Transition Matrix
The Markov property makes the structure of a chain remarkably concrete. Since the conditional distribution of $X_{n+1}$ given $X_n = i$ does not depend on the past, it is determined entirely by the probabilities
\begin{align*}
p_{ij} = \mathbb{P}(X_{n+1} = j \mid X_n = i), \qquad i, j \in S.
\end{align*}
For a homogeneous chain (one whose transition probabilities do not change with time), these numbers are constant across all steps, and we arrange them into the **transition matrix** $P = (p_{ij})_{i,j \in S}$. The entry $p_{ij}$ is the probability of moving from state $i$ to state $j$ in a single step.
For $P$ to be a valid transition matrix, its rows must sum to one: $\sum_{j \in S} p_{ij} = 1$ for every $i \in S$ (since from state $i$, the chain must go somewhere). A matrix with non-negative entries and row sums equal to one is called a **stochastic matrix**.
The initial distribution $\lambda = (\lambda_i)_{i \in S}$, where $\lambda_i = \mathbb{P}(X_0 = i)$, completes the specification. The pair $(\lambda, P)$ determines the joint distribution of the entire sequence $(X_0, X_1, X_2, \ldots)$: for any finite sequence of states $i_0, i_1, \ldots, i_n \in S$,
\begin{align*}
\mathbb{P}(X_0 = i_0, X_1 = i_1, \ldots, X_n = i_n) = \lambda_{i_0}\, p_{i_0 i_1}\, p_{i_1 i_2} \cdots p_{i_{n-1} i_n}.
\end{align*}
This is the fundamental formula of the subject: the probability of any trajectory is a product of one-step transition probabilities, weighted by the initial distribution.
[example: Weather Model]
Consider a simple weather model with state space $S = \{\text{Sunny}, \text{Rainy}\}$, which we abbreviate as $S = \{0, 1\}$ with $0 = \text{Sunny}$ and $1 = \text{Rainy}$. Suppose that if today is sunny, tomorrow is sunny with probability $0.8$ and rainy with probability $0.2$; and if today is rainy, tomorrow is sunny with probability $0.4$ and rainy with probability $0.6$. The transition matrix is
\begin{align*}
P = \begin{pmatrix} 0.8 & 0.2 \\ 0.4 & 0.6 \end{pmatrix},
\end{align*}
where row $i$ gives the transition probabilities from state $i$. Each row sums to one, confirming $P$ is stochastic.
Suppose it is sunny on day 0, so the initial distribution is $\lambda = (1, 0)$. What is the probability that day 2 is rainy? There are two trajectories reaching state 1 at time 2:
- $0 \to 0 \to 1$: probability $\lambda_0 \cdot p_{00} \cdot p_{01} = 1 \cdot 0.8 \cdot 0.2 = 0.16$.
- $0 \to 1 \to 1$: probability $\lambda_0 \cdot p_{01} \cdot p_{11} = 1 \cdot 0.2 \cdot 0.6 = 0.12$.
So $\mathbb{P}(X_2 = 1) = 0.16 + 0.12 = 0.28$. Equivalently, the two-step transition probability from state 0 to state 1 is $(P^2)_{01} = 0.28$, since the $n$-step transition probabilities are encoded in the matrix power $P^n$ — a fact known as the **Chapman-Kolmogorov equation**, which we establish in the next chapter.
[/example]
[illustration:two-state-weather-transition-diagram]
<!-- illustration-needed: transition diagram for the two-state weather model — two nodes labelled Sunny and Rainy with directed edges labelled with transition probabilities 0.8 (self-loop at Sunny), 0.2 (Sunny to Rainy), 0.4 (Rainy to Sunny), 0.6 (self-loop at Rainy) -->
## Three-State Random Walk
A slightly richer example illustrates the main structural questions of the course. Consider a particle performing a random walk on the state space $S = \{0, 1, 2\}$, where at each step the particle moves to an adjacent state with equal probability, with reflecting boundaries at 0 and 2:
\begin{align*}
P = \begin{pmatrix} 0 & 1 & 0 \\ 1/2 & 0 & 1/2 \\ 0 & 1 & 0 \end{pmatrix}.
\end{align*}
Every row sums to one. From state 0 the particle must go to state 1; from state 1 it goes to state 0 or 2 with equal probability; from state 2 it must return to state 1. Starting from state 0, we can compute $\mathbb{P}(X_1 = 1) = 1$, $\mathbb{P}(X_2 = 0) = 1/2$, $\mathbb{P}(X_2 = 2) = 1/2$.
This chain is **irreducible**: every state can be reached from every other state. It is also **recurrent**: starting from any state, the chain returns to that state with probability one. It has a unique invariant distribution $\pi = (1/4, 1/2, 1/4)$ — that is, $\pi P = \pi$ and $\sum_i \pi_i = 1$. Verifying this:
\begin{align*}
(\pi P)_0 &= \pi_0 \cdot 0 + \pi_1 \cdot \tfrac{1}{2} + \pi_2 \cdot 0 = \tfrac{1}{4} = \pi_0, \\
(\pi P)_1 &= \pi_0 \cdot 1 + \pi_1 \cdot 0 + \pi_2 \cdot 1 = \tfrac{1}{4} + \tfrac{1}{4} = \tfrac{1}{2} = \pi_1, \\
(\pi P)_2 &= \pi_0 \cdot 0 + \pi_1 \cdot \tfrac{1}{2} + \pi_2 \cdot 0 = \tfrac{1}{4} = \pi_2.
\end{align*}
The central questions of the course — when does an invariant distribution exist, and does the chain converge to it? — are illustrated concretely here before the general theory is developed.
## Road Map of the Course
The course develops the theory of discrete-time, homogeneous Markov chains on countable state spaces in four stages.
**Foundations.** Chapter 1 makes the Markov property precise and establishes how $(\lambda, P)$ determines all finite-dimensional distributions. The Chapman-Kolmogorov equation $P^{m+n} = P^m P^n$ is proved: $n$-step transition probabilities are matrix powers, connecting the probabilistic structure to linear algebra.
**Classification.** Chapter 2 partitions the state space $S$ into communicating classes and introduces recurrence versus transience. A state $i$ is recurrent if $\mathbb{P}_i(X_n = i \text{ for some } n \geq 1) = 1$, and transient otherwise. The canonical illustration is the simple symmetric random walk on $\mathbb{Z}^d$: recurrent for $d \leq 2$, transient for $d \geq 3$. The proof uses generating functions and the divergence of the harmonic series.
**Equilibrium.** Chapter 3 establishes the existence and uniqueness of invariant distributions for irreducible positive recurrent chains, and proves convergence to equilibrium for aperiodic chains via the coupling method.
**Reversibility.** Chapter 4 characterises reversible chains via the detailed balance equations $\pi_i p_{ij} = \pi_j p_{ji}$ and explores the connection to symmetric matrices and spectral methods.
Throughout, the emphasis is on probabilistic reasoning — the tools are hitting times, return probabilities, and couplings, not just eigenvalues and matrix algebra.
Now that we have built intuition about probabilistic reasoning and the core tools of hitting times and couplings, we turn to the foundational definitions and properties of Markov chains themselves.
# 1. Markov Chains
This chapter lays the foundations of the theory of Markov chains. The central idea is simple but powerful: a stochastic process has the Markov property if the future depends on the present alone, not on the entire past history. Starting from this single assumption, we derive the algebraic structure that makes Markov chains tractable — the transition matrix and the Chapman–Kolmogorov equations — and state the extended and strong Markov properties that underpin most of the deeper theory to come.
## The Markov Property and Its Consequences
### Defining Markov Chains
In IA Probability we studied sequences of independent random variables. Independence is a very strong assumption: once you know all past outcomes, you know nothing more about the future. But most real-world processes are not independent. A queue, a gene frequency, a random walk on a graph — in each case the next state depends on the current one. The question is: how can we model dependence in a way that is rich enough to be interesting but structured enough to yield a rigorous theory?
The Markov assumption is the simplest possible answer: the future depends on the present, but given the present, the future is independent of the past. But before making this formal, it is worth seeing what the Markov property actually rules out. Consider a process $X_n$ where $X_{n+1} = X_n + X_{n-1} + \varepsilon_n$, with $\varepsilon_n$ i.i.d. noise: the next state depends on both the current and the previous value, not on the current value alone. Given $X_n$, knowledge of $X_{n-1}$ still provides additional information about $X_{n+1}$, so this process is not Markov. The Markov property is precisely what allows us to summarise the entire history of the chain by its current state.
[definition: Markov Chain]
Let $X = (X_0, X_1, \ldots)$ be a sequence of random variables taking values in a countable set $S$, called the **state space**.
We say $X$ has the **Markov property** if for all $n \geq 0$ and all $i_0, \ldots, i_{n+1} \in S$,
\begin{align*}
\mathbb{P}(X_{n+1} = i_{n+1} \mid X_0 = i_0, \ldots, X_n = i_n) = \mathbb{P}(X_{n+1} = i_{n+1} \mid X_n = i_n).
\end{align*}
A sequence satisfying the Markov property is called a **Markov chain**.
The chain is **homogeneous** (or **time-homogeneous**) if the conditional probabilities $\mathbb{P}(X_{n+1} = j \mid X_n = i)$ do not depend on $n$.
[/definition]
Throughout this course, all Markov chains are assumed to be homogeneous unless explicitly stated otherwise. Since $S$ is countable, we typically label states by integers $i \in \mathbb{N}$ or by a finite set $\{1, 2, \ldots, N\}$.
[example: Random Walk and Branching Process]
Two familiar examples are already Markov chains:
**Random walk.** Let $\xi_1, \xi_2, \ldots$ be i.i.d. integer-valued random variables, and set $X_n = \xi_1 + \cdots + \xi_n$ with $X_0 = 0$. Then $X_{n+1} = X_n + \xi_{n+1}$, so the future increment $\xi_{n+1}$ is independent of $X_0, \ldots, X_{n-1}$ given $X_n$. Hence $(X_n)$ is a Markov chain on $\mathbb{Z}$.
**Branching process.** Let $X_n$ denote the size of the $n$-th generation of a population, where each individual independently produces offspring according to a fixed distribution. Given $X_n = k$, the next generation size $X_{n+1}$ is the sum of $k$ independent copies of the offspring distribution, which depends only on $k$, not on the history $X_0, \ldots, X_{n-1}$.
**A non-Markov process.** For contrast, let $Y_n = X_n + X_{n-1}$ (the running sum of the last two values of a random walk). Then $Y_{n+1} = X_{n+1} + X_n$, but $X_n = Y_n - X_{n-1}$, so knowing $Y_n$ alone does not determine $X_n$: we also need $X_{n-1}$ (or equivalently $Y_{n-1}$). The conditional distribution of $Y_{n+1}$ given $Y_n$ still depends on $Y_{n-1}$, so $(Y_n)$ is not a Markov chain on $\mathbb{Z}$.
[/example]
### Specifying a Markov Chain
Given that a Markov chain is determined by its current state alone, what data do we actually need to write one down? The state space $S$ is fixed, but we still need to specify where the chain starts and how it moves. The minimal algebraic data turns out to be the **initial distribution** and the **transition matrix**.
To fully specify a homogeneous Markov chain we need two objects:
1. The **initial distribution** $\lambda = (\lambda_i : i \in S)$, where $\lambda_i = \mathbb{P}(X_0 = i)$.
2. The **transition probabilities** $p_{i,j} = \mathbb{P}(X_{n+1} = j \mid X_n = i)$, assembled into the **transition matrix** $P = (p_{i,j})_{i,j \in S}$.
The notation $(\lambda, P)$ for a Markov chain is standard.
Not every pair $(\lambda, P)$ is valid. The following proposition characterises exactly which pairs arise:
[quotetheorem:2204]
[citeproof:2204]
[remark: Row Sums vs Column Sums]
The stochastic condition requires only that each **row** sums to one. Column sums need not equal one: the matrix $P$ governs transitions out of a given state, not into it. A matrix whose columns also sum to one is called **doubly stochastic**; such matrices arise when the chain has a uniform stationary distribution, but most chains of interest are not doubly stochastic.
[/remark]
The constraints in this theorem are necessary but also sufficient: given any probability vector $\lambda$ and any stochastic matrix $P$, the Kolmogorov extension theorem guarantees the existence of a Markov chain on $S^{\mathbb{N}}$ with these parameters. We take this existence result for granted.
### The Joint Distribution of a Markov Chain
Once we have $\lambda$ and $P$ in hand, how much do they tell us? Is the pair $(\lambda, P)$ really sufficient to recover every probability involving the chain — not just one-step probabilities, but the probability of any finite path? The answer is yes, and the formula is elegant: every joint probability is a simple product of the relevant entries of $\lambda$ and $P$.
Knowing $(\lambda, P)$ is sufficient to compute every finite-dimensional distribution of $(X_n)$. The next theorem makes this precise and gives a clean product formula for path probabilities.
[quotetheorem:2205]
[citeproof:2205]
The product formula is the workhorse identity of the theory: any probability involving a finite path can be read off directly from $\lambda$ and $P$.
### The Extended Markov Property
The basic Markov property conditions only on the event $\{X_n = i\}$. In practice, we need a stronger statement: given the current state, the entire future is independent of the entire past, not just of the one previous step.
[quotetheorem:2206]
The proof stitches together finitely many applications of the one-step Markov property via the path probability formula; the calculation is a straightforward induction and is omitted here. The extended property is the form of the Markov property most often used in practice, since it lets us treat the chain "restarted from time $n$" as an independent copy with the same transition matrix.
## Transition Matrices and $n$-Step Probabilities
### $n$-Step Transition Probabilities
The one-step transition probability $p_{i,j}$ tells us how likely a single step from $i$ to $j$ is. For many questions — what is the probability of being in state $j$ after $n$ steps? — we need the $n$-step analogue.
[definition: $n$-Step Transition Probability]
The **$n$-step transition probability** from state $i$ to state $j$ is
\begin{align*}
p_{i,j}(n) = \mathbb{P}(X_n = j \mid X_0 = i), \quad n \geq 0,\; i,j \in S.
\end{align*}
[/definition]
The boundary cases anchor the definition. For $n = 0$, the chain starts at $i$ and has not moved, so $p_{i,j}(0) = \mathbb{1}_{\{i = j\}}$. For $n = 1$, we recover the one-step transition probability: $p_{i,j}(1) = p_{i,j}$, the $(i,j)$ entry of $P$.
### The Chapman–Kolmogorov Equations
To compute $p_{i,j}(m+n)$, we condition on the intermediate state after $m$ steps. If the chain is at some state $k$ at time $m$, the probability of then reaching $j$ in $n$ further steps is $p_{k,j}(n)$; by the Markov property, this is independent of how the chain reached $k$.
[quotetheorem:2207]
[citeproof:2207]
### Matrix Formulation
The Chapman-Kolmogorov equations say that the $n$-step transition probabilities interact exactly as matrix products. Why does this matter? Because it transforms a probabilistic computation — tracking how probability mass flows over $n$ steps — into a piece of linear algebra that we already know how to do efficiently.
Write $P(n) = (p_{i,j}(n))_{i,j \in S}$ for the matrix of $n$-step transition probabilities. The Chapman–Kolmogorov equations then say exactly that $P(m+n) = P(m) P(n)$ — the product is ordinary matrix multiplication.
In particular, since $P(1) = P$,
\begin{align*}
P(n) = P(1)\, P(n-1) = \cdots = P^n.
\end{align*}
The $n$-step transition probabilities are simply the entries of the $n$-th power of the transition matrix.
This connection to matrix algebra is striking. Computing $p_{i,j}(n)$ reduces to computing a matrix power, which can be done efficiently via diagonalisation or Jordan normal form. Moreover, it places Markov chains squarely within linear algebra, even though in this course we emphasise the probabilistic perspective. The family $(P^n)_{n \geq 0}$ satisfies $P^{m+n} = P^m P^n$ — the **semigroup property** — which reflects the translation-invariance of time-homogeneous chains: the chain's behaviour over $m+n$ steps is the same whether we split the interval at the start or anywhere in the middle.
### Distribution Propagation
So far we have tracked probabilities conditional on a fixed starting state $X_0 = i$. What if the initial state is itself random, with distribution $\lambda$? How does the distribution of $X_n$ evolve over time? The answer is clean and connects directly to the matrix power $P^n$.
If $X_0$ has distribution $\lambda$, then for each $j \in S$,
\begin{align*}
\mathbb{P}(X_1 = j) &= \sum_{i \in S} \mathbb{P}(X_1 = j \mid X_0 = i)\, \mathbb{P}(X_0 = i) = \sum_{i \in S} \lambda_i\, p_{i,j}.
\end{align*}
Writing $\lambda$ as a row vector, this says the distribution of $X_1$ is $\lambda P$. By the same reasoning, the distribution of $X_n$ is $\lambda P^n$.
[example: Two-State Chain]
Let $S = \{1, 2\}$ with transition matrix
\begin{align*}
P = \begin{pmatrix} 1 - \alpha & \alpha \\ \beta & 1 - \beta \end{pmatrix},
\end{align*}
where $0 < \alpha, \beta < 1$. We find the $n$-step transition probabilities by diagonalising $P$.
**Eigenvalues.** Solving $\det(P - \kappa I) = 0$:
\begin{align*}
(1 - \alpha - \kappa)(1 - \beta - \kappa) - \alpha\beta = 0.
\end{align*}
Expanding and simplifying gives $\kappa^2 - (2 - \alpha - \beta)\kappa + (1 - \alpha - \beta) = 0$, with roots
\begin{align*}
\kappa_1 = 1, \qquad \kappa_2 = 1 - \alpha - \beta.
\end{align*}
**Form of $p_{1,2}(n)$.** Since $P^n$ is diagonalisable (the two eigenvalues are distinct when $\alpha + \beta \neq 1$), the entry $p_{1,2}(n)$ must be of the form
\begin{align*}
p_{1,2}(n) = A \cdot 1^n + B \cdot (1 - \alpha - \beta)^n = A + B(1-\alpha-\beta)^n.
\end{align*}
We determine $A$ and $B$ from the boundary values $p_{1,2}(0) = 0$ and $p_{1,2}(1) = \alpha$:
\begin{align*}
A + B &= 0, \\
A + B(1 - \alpha - \beta) &= \alpha.
\end{align*}
Subtracting the first from the second: $B(1 - \alpha - \beta - 1) = \alpha$, so $B = -\alpha/(\alpha + \beta)$ and $A = \alpha/(\alpha + \beta)$. Therefore
\begin{align*}
p_{1,2}(n) = \frac{\alpha}{\alpha + \beta}\bigl(1 - (1-\alpha-\beta)^n\bigr).
\end{align*}
Since $p_{1,1}(n) + p_{1,2}(n) = 1$, we also have $p_{1,1}(n) = \frac{\beta}{\alpha+\beta} + \frac{\alpha}{\alpha+\beta}(1-\alpha-\beta)^n$.
By symmetry (interchanging $\alpha$ and $\beta$ and the roles of states 1 and 2), the full matrix is
\begin{align*}
P^n = \frac{1}{\alpha + \beta}
\begin{pmatrix}
\beta + \alpha(1-\alpha-\beta)^n & \alpha - \alpha(1-\alpha-\beta)^n \\
\beta - \beta(1-\alpha-\beta)^n & \alpha + \beta(1-\alpha-\beta)^n
\end{pmatrix}.
\end{align*}
To verify $p_{2,1}(n)$: setting $\alpha \leftrightarrow \beta$ in the formula for $p_{1,2}(n)$ gives $p_{2,1}(n) = \frac{\beta}{\alpha+\beta}(1-(1-\alpha-\beta)^n)$, which matches the $(2,1)$ entry above.
**Long-run behaviour.** Since $|\kappa_2| = |1-\alpha-\beta| < 1$ (as $0 < \alpha,\beta < 1$ and $\alpha + \beta < 2$), the term $(1-\alpha-\beta)^n \to 0$ as $n \to \infty$. Thus
\begin{align*}
P^n \to \frac{1}{\alpha + \beta}
\begin{pmatrix} \beta & \alpha \\ \beta & \alpha \end{pmatrix}.
\end{align*}
The two rows of the limiting matrix are identical: the long-run probability of being in state 2 is $\alpha/(\alpha+\beta)$, regardless of whether the chain started in state 1 or state 2. This "forgetting of initial conditions" is a special case of the convergence to equilibrium theorem proved in Chapter 3.
[/example]
The two-state chain illustrates the general strategy for computing $P^n$: find the eigenvalues, write entries as linear combinations of eigenvalue powers, and determine coefficients from boundary conditions. The same method extends to any finite state space.
## The Strong Markov Property
The basic Markov property conditions on the current state at a **fixed** time $n$. Many problems in the theory require conditioning at a **random** time — for example, the first time the chain hits a particular state. The strong Markov property extends the Markov property to such random stopping times.
[definition: Stopping Time]
A random variable $T : \Omega \to \{0, 1, 2, \ldots\} \cup \{\infty\}$ is a **stopping time** for $(X_n)$ if for each $n \geq 0$, the event $\{T = n\}$ is determined by $(X_0, X_1, \ldots, X_n)$ — that is, whether the chain stops by time $n$ can be decided from the chain's history up to and including time $n$.
[/definition]
The key examples are hitting times: for a state $j \in S$, define
\begin{align*}
T_j = \min\{n \geq 0 : X_n = j\}
\end{align*}
(with $T_j = \infty$ if the chain never visits $j$). The event $\{T_j = n\} = \{X_0 \neq j, \ldots, X_{n-1} \neq j, X_n = j\}$ depends only on $X_0, \ldots, X_n$, so $T_j$ is a stopping time.
[quotetheorem:2208]
The strong Markov property states that the chain "restarts from scratch" at any stopping time, with no memory of how it arrived there. The proof uses the fact that on $\{T = m\}$ the event $\{T = m\}$ is determined by $X_0, \ldots, X_m$, so we may apply the extended Markov property at the fixed time $m$ and then sum over all possible values of $m$. The details are omitted here; we will invoke the result freely throughout the course.
It is important that $T$ must genuinely be a stopping time — a random variable whose value can be determined from the chain's history up to time $T$. The property fails for random times that depend on the future. For instance, let $T^* = \sup\{n \leq 10 : X_n = j\}$ be the **last** visit to state $j$ before time 10. Whether $T^*$ has occurred cannot be decided from $X_0, \ldots, X_{T^*}$ alone (since we do not yet know that no later visit to $j$ will occur), so $T^*$ is not a stopping time. Conditioning on $X_{T^*} = j$ and trying to apply the strong Markov property would give the wrong answer: the chain after $T^*$ is biased away from $j$ (since $T^*$ was chosen as the last visit), whereas an ordinary restart from $j$ has no such bias.
[remark: Strong vs Extended Markov Property]
The extended Markov property conditions on a fixed time $n$: given $X_n = i$, past and future are independent. The strong Markov property conditions on a random stopping time $T$: given $T < \infty$ and $X_T = i$, the future $(X_{T+n})_{n \geq 0}$ is independent of $(X_0, \ldots, X_T)$. The strong version is strictly more powerful and is essential for analysing hitting times and absorption probabilities.
[/remark]
<!-- illustration-needed: transition diagram for the two-state chain — two nodes labelled 1 and 2 with directed arrows between them labelled alpha (from 1 to 2) and beta (from 2 to 1), and self-loops labelled 1-alpha (at node 1) and 1-beta (at node 2) -->
The rest of the course builds systematically on these foundations. In the next chapter we classify the states and communicating classes of a chain — asking which states can be reached from which, and which classes the chain can never escape.
Understanding how individual states behave is essential, but Markov chains often exhibit global structure. We now classify the states and communicating classes of a chain — asking which states can be reached from which, and which classes the chain can never escape.
# 2. Classification of Chains and States
This chapter classifies the states and communicating classes of a Markov chain. The central questions are structural: which states can the chain reach from which? Will it return to a given state, and if so, how often? The answers partition the state space into communicating classes and further classify each state as recurrent or transient, positive recurrent or null recurrent, and periodic or aperiodic. These classifications determine the long-run behaviour of the chain.
## Communicating Classes
A Markov chain on a state space $S$ may exhibit very different structural behaviour in different parts of $S$. Some states may be unreachable from others; from certain states, the chain may escape never to return. To study long-run behaviour, we want to identify and isolate the "self-contained" portions of the chain — the regions from which there is no escape. This section introduces the vocabulary for this structural analysis: communicating classes, closed classes, and irreducibility.
## Accessibility and Communication
The first step is to decide when one state can "see" another. Since the chain can take any number of steps, the right notion is reachability in a finite number of steps.
[definition: Accessibility and Communication]
Let $X$ be a Markov chain on state space $S$ with $n$-step transition probabilities $p_{i,j}(n) = \mathbb{P}(X_n = j \mid X_0 = i)$.
We say state $j$ is **accessible** from state $i$, written $i \to j$, if there exists $n \geq 0$ such that
\begin{align*}
p_{i,j}(n) > 0.
\end{align*}
In words: starting from $i$, there is a positive probability of reaching $j$ in some finite number of steps.
We say states $i$ and $j$ **communicate**, written $i \leftrightarrow j$, if $i \to j$ and $j \to i$.
[/definition]
Two points deserve immediate attention. First, the relation $i \to i$ holds for every state, since $p_{i,i}(0) = \mathbb{P}(X_0 = i \mid X_0 = i) = 1 > 0$ — in zero steps, we are trivially already at $i$. Second, communication is a symmetric relation by definition: $i \leftrightarrow j$ is exactly the same statement as $j \leftrightarrow i$.
What is less immediate is that communication is also transitive, which upgrades it to a full equivalence relation.
[quotetheorem:2209]
[citeproof:2209]
The key step in the transitivity argument is that the Chapman–Kolmogorov equation allows us to lower-bound the $(m+n)$-step probability by singling out the path that passes through $j$ at time $m$. The sum over all intermediate states can only be larger than any single term.
## Communicating Classes
Since $\leftrightarrow$ is an equivalence relation, it partitions $S$ into equivalence classes.
[definition: Communicating Class]
The equivalence classes of $\leftrightarrow$ are called **communicating classes**. Concretely, the communicating class of a state $i$ is
\begin{align*}
[i] = \{j \in S : i \leftrightarrow j\}.
\end{align*}
[/definition]
It is important not to confuse "communicating class" with "isolated component". Two states in different communicating classes are not necessarily disconnected — the chain can move from one class to another, but not back.
[explanation: How Classes Interact]
Within a single communicating class $A$, the chain can travel between any two states. But the chain can also jump from $A$ to a different class $B$: what is ruled out is returning from $B$ to $A$. Thus the communicating classes form a directed acyclic structure at the class level: if there is a path from class $A$ to class $B$ (meaning $i \to j$ for some $i \in A$, $j \in B$ with $A \neq B$), then there is no path from $B$ back to $A$.
In a chain with finitely many communicating classes, if we keep moving outward along this directed structure, we must eventually reach a class from which we cannot escape. The chain then stays in that class forever.
[/explanation]
<!-- illustration-needed: directed graph showing three communicating classes with one-way arrows between them, illustrating the acyclic inter-class structure — class {1,2,3} can reach {4} and {5,6}, and {4} can reach {5,6}, but {5,6} has no outgoing arrows to other classes -->
This directed acyclic structure among classes makes it natural to ask which classes sit at the "bottom" — the ones with no exits. These are the closed classes, which we define next.
## Closed Classes and Absorption
The classes that "trap" the chain are called closed.
[definition: Closed Set]
A subset $C \subseteq S$ is **closed** if
\begin{align*}
p_{i,j} = 0 \quad \text{for all } i \in C,\, j \notin C.
\end{align*}
In words: once the chain enters $C$, it can never leave. A closed communicating class is called an **absorbing class**, and a closed class consisting of a single state $\{i\}$ is called an **absorbing state**.
[/definition]
The condition $p_{i,j} = 0$ for $j \notin C$ is a condition on the one-step matrix. The following proposition shows that this is equivalent to a multi-step condition.
[quotetheorem:2210]
[citeproof:2210]
To see these definitions in action, we work through an explicit six-state chain, identifying the communicating classes and checking which are closed.
[example: Identifying Communicating Classes and Closed Sets]
Consider the state space $S = \{1, 2, 3, 4, 5, 6\}$ with transition matrix
\begin{align*}
P =
\begin{pmatrix}
\tfrac{1}{2} & \tfrac{1}{2} & 0 & 0 & 0 & 0\\
0 & 0 & 1 & 0 & 0 & 0\\
\tfrac{1}{3} & 0 & 0 & \tfrac{1}{3} & \tfrac{1}{3} & 0\\
0 & 0 & 0 & \tfrac{1}{2} & \tfrac{1}{2} & 0\\
0 & 0 & 0 & 0 & 0 & 1\\
0 & 0 & 0 & 0 & 1 & 0\\
\end{pmatrix}.
\end{align*}
We trace the accessible states from each state:
- From state $1$: $1 \to 1$ (self-loop with prob $\tfrac{1}{2}$) and $1 \to 2$ (prob $\tfrac{1}{2}$). From $2$: $2 \to 3$ (prob $1$). From $3$: $3 \to 1$ (prob $\tfrac{1}{3}$), so $3 \to 1$. Thus $1 \to 2 \to 3 \to 1$, giving $1 \leftrightarrow 2 \leftrightarrow 3$. The class $\{1,2,3\}$ communicates internally.
- From state $4$: $4 \to 4$ (self-loop) and $4 \to 5$. From $5$: $5 \to 6$, $6 \to 5$, so $5 \leftrightarrow 6$. But $5 \not\to 4$ and $6 \not\to 4$, so $4$ does not communicate with $5$ or $6$. The class $\{4\}$ is its own communicating class.
- States $5$ and $6$: $5 \to 6$ (prob $1$) and $6 \to 5$ (prob $1$), so $5 \leftrightarrow 6$. The class $\{5,6\}$ communicates internally.
The three communicating classes are $\{1,2,3\}$, $\{4\}$, and $\{5,6\}$.
Now check which are closed. From class $\{1,2,3\}$: state $3$ has $p_{3,4} = \tfrac{1}{3} > 0$ and $p_{3,5} = \tfrac{1}{3} > 0$, so the chain can leave to $\{4\}$ or $\{5,6\}$. Not closed. From $\{4\}$: $p_{4,5} = \tfrac{1}{2} > 0$, so the chain can leave to $\{5,6\}$. Not closed. From $\{5,6\}$: $p_{5,6} = 1$ and $p_{6,5} = 1$; no probability escapes to $1,2,3,4$. The class $\{5,6\}$ is **closed**.
Starting from any state, the chain will eventually reach $\{5,6\}$ and oscillate between $5$ and $6$ indefinitely.
[/example]
[illustration:communicating-classes-directed-structure]
## Irreducibility
The most tractable chains are those with a single communicating class, where every state can reach every other state.
[definition: Irreducible Chain]
A Markov chain is **irreducible** if $S$ itself is a single communicating class, i.e., $i \leftrightarrow j$ for all $i, j \in S$.
[/definition]
Equivalently, a chain is irreducible if and only if the entire state space is closed and every state is accessible from every other. In an irreducible chain, there are no "escape routes" to elsewhere, because there is no "elsewhere": the chain is always within the unique communicating class.
[remark: Why Irreducibility Is the Right Assumption]
In the example above, if we start in $\{1,2,3\}$, the chain will eventually leave and never return. The long-run behaviour is determined entirely by what happens in the closed class $\{5,6\}$. This means that, for most questions about long-run averages and stationary distributions, it suffices to study irreducible chains. Non-irreducible chains can often be reduced to studying their closed communicating classes one at a time.
[/remark]
For the remainder of these notes, unless stated otherwise, **all chains are assumed to be irreducible**. This assumption ensures that the chain cannot get "trapped" or "excluded", and allows us to develop a clean theory of long-run behaviour.
## Recurrence and Transience
Having classified states into communicating classes and understood how irreducibility organises the state space, the next question is: what is the long-run fate of the chain? Starting from a given state, will the process inevitably return, or might it drift away forever? This is the subject of **recurrence and transience** — the central qualitative dichotomy in the theory of Markov chains.
## Motivation: Will the Chain Return?
The question arises naturally from the study of random walks in IA Probability. A simple random walk starts at $0$ and at each step moves left or right with equal probability. One can ask: will it ever return to $0$? And if so, will it return infinitely often?
For an irreducible chain on a finite state space, the chain cannot escape to infinity, so the question of return is automatically settled — the process must revisit every state. The interesting situation is an **infinite state space**, where the chain might drift further and further from its starting point and never come back. Recurrence and transience give a precise, probabilistic answer to this question.
We introduce convenient notation for probabilities and expectations conditioned on the starting state.
[definition: Conditional Notation]
For a Markov chain $(X_n)_{n \geq 0}$ with state space $S$, we write
\begin{align*}
\mathbb{P}_i(A) &= \mathbb{P}(A \mid X_0 = i), \\
\mathbb{E}_i[Z] &= \mathbb{E}[Z \mid X_0 = i],
\end{align*}
for any event $A$ and random variable $Z$.
[/definition]
## First Passage Times and Recurrence
Before defining recurrence, we need the concept of the **first passage time** — the first moment at which the chain visits a given state, after starting from some initial state.
[definition: First Passage Time and Probability]
Let $(X_n)_{n \geq 0}$ be a Markov chain on state space $S$. For $j \in S$, the **first passage time** to $j$ is
\begin{align*}
T_j = \min\{n \geq 1 : X_n = j\}.
\end{align*}
We require $n \geq 1$ in the definition; this ensures that $T_i$ is not trivially zero when the chain starts at $i$.
The **first passage probability** from $i$ to $j$ in exactly $n$ steps is
\begin{align*}
f_{ij}(n) = \mathbb{P}_i(T_j = n).
\end{align*}
[/definition]
Note that $T_j$ implicitly depends on the starting state $i$ through the law $\mathbb{P}_i$. The total probability of ever reaching $j$ from $i$ is $\sum_{n=1}^\infty f_{ij}(n) = \mathbb{P}_i(T_j < \infty)$.
We can now make the central definition.
[definition: Recurrent and Transient States]
A state $i \in S$ is **recurrent** (also called **persistent**) if
\begin{align*}
\mathbb{P}_i(T_i < \infty) = 1,
\end{align*}
that is, starting from $i$, the chain returns to $i$ with probability $1$.
A state $i \in S$ is **transient** if it is not recurrent, i.e., $\mathbb{P}_i(T_i < \infty) < 1$.
[/definition]
The names are suggestive: a recurrent state is one the chain keeps coming back to, while a transient state is one the chain may pass through and leave behind. The following remark clarifies the precise meaning of transience, which is often misread.
[remark: What Transience Really Means]
Transience does not mean the chain never returns. It means there is a strictly positive probability of never returning. A transient state may still be visited again — it is simply not guaranteed. By contrast, a recurrent state will be revisited with probability $1$.
[/remark]
In fact, for a recurrent state, not only is return guaranteed once — it is guaranteed infinitely often. Starting from a recurrent state $i$, the chain returns to $i$ with probability $1$, and by the strong Markov property applied at each return time, it returns again and again: $\mathbb{P}_i(X_n = i \text{ for infinitely many } n) = 1$.
<!-- illustration-needed: schematic of a recurrent chain (trajectories looping back to origin repeatedly) versus a transient chain (trajectory drifting away with positive probability of never returning) -->
## The Summability Criterion
The defining property of recurrence involves the probability of return, which can be hard to compute directly. The following fundamental theorem gives an equivalent characterisation in terms of the $n$-step transition probabilities $p_{ii}(n) = \mathbb{P}_i(X_n = i)$, which are the entries of the matrix $P^n$ and are often more tractable.
[quotetheorem:2211]
The proof uses generating functions. For fixed $i, j \in S$, define the generating functions
\begin{align*}
P_{ij}(s) &= \sum_{n=0}^\infty p_{ij}(n)\, s^n, \\
F_{ij}(s) &= \sum_{n=0}^\infty f_{ij}(n)\, s^n,
\end{align*}
for $s \in (-1, 1]$, with the conventions $p_{ij}(0) = \delta_{ij}$ and $f_{ij}(0) = 0$.
The key algebraic identity relating these two generating functions is:
[quotetheorem:2212]
[citeproof:2212]
To pass from $s < 1$ to $s = 1$ and recover sums of the original sequences, the proof invokes Abel's lemma from analysis.
[quotetheorem:2213]
This is a standard result in analysis (appearing, for instance, on the first example sheet of Analysis II), so we use it without proof here.
[citeproof:2213]
This criterion is powerful: $p_{ii}(n)$ is the $(i,i)$-entry of $P^n$, which can often be computed or estimated explicitly, as we will see for random walks.
## Recurrence as a Class Property
A priori one might need to check recurrence for every state individually. The next theorem shows this is unnecessary: within a communicating class, all states share the same classification.
[quotetheorem:2214]
[citeproof:2214]
The asymmetry between the two parts of the theorem is worth noting: recurrence forces the class to be closed, but transience imposes no such constraint.
[remark: Transient Classes Need Not Be Closed]
Part (2) says recurrent communicating classes are automatically closed. Transient communicating classes, however, can be non-closed: the chain may leave the class with positive probability, and every state in the class is transient.
[/remark]
An immediate and important consequence concerns finite state spaces.
[quotetheorem:2215]
[citeproof:2215]
The finite case is thus completely settled. The richness of the theory — and the possibility of genuine transience — appears only for **infinite state spaces**.
## Simple Random Walks in $\mathbb{Z}^d$: Pólya's Theorem
The most illuminating illustration of recurrence versus transience is the simple random walk on the integer lattice $\mathbb{Z}^d$. This is the Markov chain with state space $\mathbb{Z}^d$ in which at each step the walk moves to one of its $2d$ lattice neighbours, each chosen with equal probability $\tfrac{1}{2d}$:
\begin{align*}
\mathbb{P}(X_{n+1} = j \mid X_n = i) = \begin{cases} \dfrac{1}{2d} & \text{if } |j - i| = 1 \\ 0 & \text{otherwise,} \end{cases}
\end{align*}
where $|\,\cdot\,|$ denotes the Euclidean norm. The chain is irreducible (one can reach any lattice point from any other), so by the class property all states share the same classification.
<!-- illustration-needed: the integer lattice $\mathbb{Z}^d$ for $d=1$ (a line of dots) and $d=2$ (a square grid) with arrows showing the uniform steps to nearest neighbours -->
[quotetheorem:2216]
The proof proceeds by computing $p_{00}(n)$ (the probability of return to the origin in $n$ steps) and testing the summability criterion.
[citeproof:2216]
### Why the Two-Dimensional Probability Factors as a Square
The computation above reveals a striking identity: $p_{00}^{(2)}(2n) = \left[p_{00}^{(1)}(2n)\right]^2$, where the superscripts denote the dimension. This suggests independence, but horizontal and vertical increments of the two-dimensional walk are not independent — at each step, exactly one coordinate changes.
The resolution uses a change of coordinates. Write the position as $X_n = (A_n, B_n) \in \mathbb{Z}^2$ and define
\begin{align*}
U_n &= A_n - B_n, \\
V_n &= A_n + B_n.
\end{align*}
At each step, either $A_n$ changes by $\pm 1$ (and $B_n$ stays fixed) or $B_n$ changes by $\pm 1$ (and $A_n$ stays fixed). In both cases, $U_n$ and $V_n$ **each** change by $\pm 1$, and crucially, the steps in $U$ and $V$ are independent (each is a fair $\pm 1$ step, uncorrelated with the other). Therefore $U$ and $V$ are independent simple random walks on $\mathbb{Z}$, and
\begin{align*}
p_{00}^{(2)}(2n) &= \mathbb{P}(A_n = 0,\, B_n = 0) = \mathbb{P}(U_n = 0,\, V_n = 0) \\
&= \mathbb{P}(U_n = 0)\,\mathbb{P}(V_n = 0) = \left[\binom{2n}{n}\left(\frac{1}{2}\right)^{2n}\right]^2.
\end{align*}
This factorisation is the genuine explanation: the diagonal coordinate system decouples the two-dimensional walk into two independent one-dimensional walks.
[example: Consequences of Pólya's Theorem]
Consider a particle performing a simple random walk starting at the origin in each dimension.
In **one dimension**, the walk is recurrent: the particle will return to $0$ — and in fact to every integer — with probability $1$, and it will visit each state infinitely often.
In **two dimensions**, the walk is also recurrent: a wandering particle on the integer plane will return to its starting point with probability $1$.
In **three dimensions**, the walk is transient. The probability that a three-dimensional simple random walk ever returns to the origin is strictly less than $1$. A classical computation gives the return probability as approximately $0.3405$, so with probability roughly $65.95\%$ the particle drifts away and never returns.
The intuition is that in higher dimensions there is "more room" for the walk to escape: the volume of the ball of radius $r$ in $\mathbb{Z}^d$ grows like $r^d$, while the number of paths of length $n$ that return to the origin grows more slowly, making it increasingly unlikely that any one of them lands back at the starting point.
[/example]
Pólya's theorem illustrates how the qualitative behaviour of the chain — recurrent or transient — depends not on any local feature of the walk, but on the global geometry of the state space.
## Hitting Probabilities and Mean Hitting Times
Recurrence and transience tell us whether we return to a state with certainty, but they leave open a quantitative question: what is the actual probability of ever reaching a particular set of states, and how long should we expect to wait before we get there? These questions are answered by the theory of hitting probabilities and mean hitting times, which forms the computational core of the chapter. The key tool is first-step analysis — conditioning on the first step of the chain to derive a system of equations that the hitting probabilities or expected times must satisfy.
## Setting Up the Hitting Problem
We work throughout with a Markov chain $X = (X_n)_{n \geq 0}$ on a countable state space $S$ with transition matrix $P = (p_{i,j})$.
[definition: Hitting Time]
Let $A \subseteq S$. The **hitting time** of $A$ is the random variable
\begin{align*}
H^A = \min\{n \geq 0 : X_n \in A\},
\end{align*}
with the convention that $H^A = \infty$ if the chain never enters $A$. If the chain starts in $A$, then $H^A = 0$.
The **hitting probability** of $A$ starting from state $i$ is
\begin{align*}
h_i^A = \mathbb{P}_i(H^A < \infty) = \mathbb{P}_i(\text{the chain ever reaches } A).
\end{align*}
[/definition]
The subscript $i$ in $\mathbb{P}_i$ denotes that the chain starts at $X_0 = i$. Note that $h_i^A = 1$ for all $i \in A$, since the chain starts in $A$ and $H^A = 0$.
## First-Step Analysis for Hitting Probabilities
The hitting probabilities $h_i^A$ are not computed directly from the definition — that would require summing over all possible paths. Instead, we condition on the first step of the chain. This technique is called **first-step analysis**, and it reduces the global question "does the chain ever reach $A$?" to a local linear system.
[quotetheorem:2217]
[citeproof:2217]
The minimality condition is what pins down $h_i^A$ uniquely: the system of equations can have many non-negative solutions (we will see this concretely in the gambler's ruin example), but the hitting probability is always the smallest one. This is because any non-negative solution that satisfies the recurrence dominates the true probability by the iterative argument above.
[remark: Why Minimality Matters]
If the chain starting from $i$ reaches $A$ with probability strictly less than $1$, then $h_i^A < 1$, and there will be other non-negative solutions larger than $h_i^A$. Without the minimality condition, we could not tell which solution corresponds to the actual hitting probability.
[/remark]
## Mean Hitting Times
Given that we want to reach $A$, a natural follow-up question is: how many steps does it take on average? The expected hitting time from state $i$ is $k_i^A = \mathbb{E}_i[H^A]$. We should be careful: if $h_i^A < 1$ then $H^A = \infty$ with positive probability, so $k_i^A = \infty$ regardless. Even when $h_i^A = 1$ (we reach $A$ with certainty), it is possible that $k_i^A = \infty$; the hitting probability being $1$ is necessary but not sufficient for a finite mean.
[quotetheorem:2218]
[citeproof:2218]
The "$1 +$" in the recurrence reflects the fact that one step elapses when moving from state $i$ to a neighbour $j$, before the mean hitting time $k_j^A$ from $j$ is incurred.
## Gambler's Ruin
The classical illustration of first-step analysis is the gambler's ruin problem. It captures the situation of a gambler who accumulates or loses money with each bet and wants to know the probability of eventual ruin.
[example: Gambler's Ruin]
Consider a random walk on $\mathbb{N} = \{0, 1, 2, \ldots\}$. From any state $i \geq 1$, the chain moves to $i+1$ with probability $p$ and to $i-1$ with probability $q = 1-p$. State $0$ is absorbing: $p_{0,0} = 1$. We ask for the probability of ever reaching $0$ starting from state $i$, i.e., $h_i = h_i^{\{0\}}$.
By the theorem, $h_i$ is the minimal non-negative solution to
\begin{align*}
h_0 = 1, \qquad h_i = q\, h_{i-1} + p\, h_{i+1} \quad \text{for } i \geq 1.
\end{align*}
Rearranging gives the second-order linear difference equation
\begin{align*}
p\, h_{i+1} - h_i + q\, h_{i-1} = 0, \qquad i \geq 1.
\end{align*}
**Case $p \neq q$:** The general solution is $h_i = A + B(q/p)^i$. Since $h_i$ is a probability, $0 \leq h_i \leq 1$ for all $i$.
- If $p < q$ (drift towards $0$): $(q/p)^i \to \infty$ as $i \to \infty$, so the term $B(q/p)^i$ would grow without bound unless $B = 0$. Thus $h_i = A$ is constant, and the boundary condition $h_0 = 1$ forces $h_i = 1$ for all $i$. The gambler is ruined with certainty regardless of starting position.
- If $p > q$ (drift away from $0$): The general solution satisfying $h_0 = 1$ is
\begin{align*}
h_i = \left(\frac{q}{p}\right)^i + A\!\left(1 - \left(\frac{q}{p}\right)^i\right)
\end{align*}
for any $A \geq 0$ (non-negativity requires $A \geq 0$). The minimal solution, obtained by setting $A = 0$, is
\begin{align*}
h_i = \left(\frac{q}{p}\right)^i.
\end{align*}
**Case $p = q = 1/2$:** The difference equation has general solution $h_i = A + Bi$. The only bounded non-negative solution satisfying $h_0 = 1$ is $h_i = 1$. The symmetric random walk is recurrent: ruin is certain from every starting state.
To summarise: for $p > q$ the probability of ruin from state $i$ is $(q/p)^i < 1$, which decays exponentially in the starting wealth. For $p \leq q$, ruin is certain.
[/example]
<!-- illustration-needed: The gambler's ruin — a number line on N with arrows of weight p going right and q going left from each state i≥1, and state 0 labelled as an absorbing barrier. Overlay two curves of h_i^{(0)}: the constant h_i=1 for p≤q and the decaying curve (q/p)^i for p>q. -->
## Birth-Death Chains
The gambler's ruin is a random walk with constant step probabilities. The birth-death chain generalises this: the probability of moving up or down can depend on the current state, making it a natural model for population dynamics.
[definition: Birth-Death Chain]
Let $(p_i)_{i \geq 1}$ be a sequence with $p_i \in (0,1)$ for all $i \geq 1$, and set $q_i = 1 - p_i$. The **birth-death chain** on $\mathbb{N}$ is the Markov chain with transition probabilities
\begin{align*}
p_{i,\,i+1} = p_i, \qquad p_{i,\,i-1} = q_i, \qquad i \geq 1,
\end{align*}
and $p_{0,0} = 1$ (state $0$ is absorbing). The random walk is the special case $p_i = p$ constant.
[/definition]
The biological interpretation is transparent: at each step, the population either increases by one (a birth) with probability $p_i$, or decreases by one (a death) with probability $q_i$. When the population reaches $0$, extinction is permanent.
[example: Hitting Probability for the Birth-Death Chain]
We compute $h_i = h_i^{\{0\}}$, the probability of eventual extinction starting from population $i$. By first-step analysis, $h_i$ satisfies
\begin{align*}
h_0 = 1, \qquad p_i\, h_{i+1} - h_i + q_i\, h_{i-1} = 0, \qquad i \geq 1.
\end{align*}
Unlike the gambler's ruin, the coefficients depend on $i$, so this is not a constant-coefficient difference equation. We use a substitution to reduce it.
Rewrite the recurrence as
\begin{align*}
p_i(h_{i+1} - h_i) - q_i(h_i - h_{i-1}) = 0.
\end{align*}
Define $u_i = h_{i-1} - h_i$ (note the sign: since $h_i$ is non-increasing in $i$ when starting near $0$, the $u_i$ are non-negative). Then the recurrence becomes
\begin{align*}
u_{i+1} = \frac{q_i}{p_i}\, u_i.
\end{align*}
Iterating from $u_1$:
\begin{align*}
u_{i+1} = \frac{q_i}{p_i} \cdot \frac{q_{i-1}}{p_{i-1}} \cdots \frac{q_1}{p_1}\, u_1 = \gamma_i\, u_1,
\end{align*}
where we define
\begin{align*}
\gamma_i = \frac{q_1 q_2 \cdots q_i}{p_1 p_2 \cdots p_i}, \qquad \gamma_0 = 1.
\end{align*}
To recover $h_i$, sum the telescoping identity $u_k = h_{k-1} - h_k$ from $k = 1$ to $k = i$:
\begin{align*}
h_0 - h_i = \sum_{k=1}^{i} u_k = u_1 \sum_{k=0}^{i-1} \gamma_k.
\end{align*}
Since $h_0 = 1$:
\begin{align*}
h_i = 1 - u_1 \sum_{k=0}^{i-1} \gamma_k.
\end{align*}
The free parameter $u_1$ is determined by the minimality condition. Set $S = \sum_{k=0}^{\infty} \gamma_k$.
- **If $S = \infty$:** We must have $u_1 = 0$, otherwise $h_i \to -\infty$ as $i \to \infty$, contradicting $h_i \geq 0$. Hence $h_i = 1$ for all $i$: extinction is certain.
- **If $S < \infty$:** The minimal solution is obtained by maximising $u_1$ subject to $h_i \geq 0$ for all $i$. Taking $i \to \infty$ and requiring $h_i \geq 0$ gives the binding constraint $u_1 \leq 1/S$, so the minimum is attained at $u_1 = 1/S$. Then
\begin{align*}
h_i = 1 - \frac{1}{S}\sum_{k=0}^{i-1} \gamma_k = \frac{\sum_{k=i}^{\infty} \gamma_k}{\sum_{k=0}^{\infty} \gamma_k}.
\end{align*}
Thus the extinction probability from state $i$ is $h_i = (\sum_{k=i}^{\infty} \gamma_k) / S$ when $S < \infty$, and $h_i = 1$ when $S = \infty$.
[/example]
The formula is clean, but the criterion $S < \infty$ versus $S = \infty$ deserves a direct interpretation in terms of the birth and death rates.
[explanation: Interpreting the Extinction Criterion]
The quantity $\gamma_i = (q_1 \cdots q_i)/(p_1 \cdots p_i)$ measures the cumulative "drag towards $0$" up to level $i$: it is large when births are systematically less likely than deaths. The condition $S = \sum_{i=0}^{\infty} \gamma_i = \infty$ is the condition that this cumulative drag is infinite — intuitively, the chain always has enough pull back towards $0$ to guarantee return. When $S < \infty$, the drift away from $0$ eventually dominates, and there is a positive probability of escaping to infinity (i.e., the population grows without bound forever).
For the simple random walk with $p_i = p$ constant, $\gamma_i = (q/p)^i$, so $S = \sum_{i=0}^{\infty}(q/p)^i$. This series converges if and only if $q/p < 1$, i.e., $p > q$. This recovers the gambler's ruin result: extinction is certain if and only if $p \leq q$.
[/explanation]
This connection between the extinction criterion and recurrence is not coincidental: the two classifications are two sides of the same probabilistic coin.
[remark: Survival and Recurrence]
The dichotomy $S < \infty$ versus $S = \infty$ is closely related to the recurrence/transience classification from the previous section. When $h_i^{\{0\}} < 1$, the chain has a positive probability of never returning to $0$, which is precisely the definition of $0$ being transient. So the birth-death chain has $0$ transient if and only if $S < \infty$.
[/remark]
## The Strong Markov Property and Applications
The Markov property as we have encountered it so far applies at a fixed deterministic time: given $X_n = i$, the future of the chain after time $n$ is independent of the past. A natural and important generalisation replaces the fixed time $n$ by a random time $T$ that is determined by the evolution of the chain itself. The resulting statement is the **strong Markov property**, and it underpins many of the most useful arguments in the theory of Markov chains — including the key renewal arguments used to characterise recurrence and transience.
## Stopping Times
To allow a random time $T$ in the Markov property, we cannot permit $T$ to depend on the future. If it did, we would be conditioning on information we do not yet have, and the independence statement would break down. The correct condition is that $T$ depends only on what has been observed up to and including time $T$.
[definition: Stopping Time]
Let $X = (X_n)_{n \geq 0}$ be a Markov chain on a countable state space $S$. A random variable $T : \Omega \to \mathbb{N} \cup \{\infty\}$ is a **stopping time** for $X$ if, for every $n \geq 0$, the event $\{T = n\}$ is determined entirely by $(X_0, X_1, \ldots, X_n)$.
[/definition]
The key intuition is that a stopping time is one you could use as an instruction to stop: at each moment $n$, looking only at the history $(X_0, \ldots, X_n)$, you can decide whether $T = n$ or not. No knowledge of the future is required.
[example: Stopping Times in a Casino]
Suppose $X_n$ represents a gambler's fortune at time $n$. The rule "stop when your fortune first reaches $\$10$" defines a stopping time, because at time $n$ you know whether $X_n = 10$ and whether this is the first such occurrence. By contrast, the rule "stop one step before you go bankrupt" is not a stopping time: knowing at time $n$ whether $X_{n+1} = 0$ requires observing the future.
More formally, the hitting time
\begin{align*}
H^A &= \inf\{n \geq 0 : X_n \in A\}
\end{align*}
is a stopping time for any set $A \subset S$. The event $\{H^A = n\}$ equals $\{X_0 \notin A, X_1 \notin A, \ldots, X_{n-1} \notin A, X_n \in A\}$, which depends only on $X_0, \ldots, X_n$. Similarly, $H^A + 1$ is a stopping time (it equals $n$ iff $H^A = n-1$, determined by $X_0, \ldots, X_{n-1}$). However, $H^A - 1$ is not a stopping time: the event $\{H^A - 1 = n\}$ equals $\{H^A = n+1\}$, which requires knowing $X_{n+1}$.
[/example]
## The Strong Markov Property
With the notion of stopping time in hand, the strong Markov property asserts that the chain, restarted at the random time $T$, behaves exactly as if it had just started fresh from state $X_T$.
[quotetheorem:2208]
The proof proceeds by decomposing on the value of $T$: on the event $\{T = n\}$, the strong Markov property reduces to the ordinary (extended) Markov property at the fixed time $n$. The proof is not included in this course.
The content of the theorem is best appreciated by unpacking what "independent of $(X_0, \ldots, X_T)$" means: no matter what the history of the chain was leading up to the stopping time, the future trajectory starting from $X_T$ has the same distribution as a fresh chain started from $X_T$. This is called the **path decomposition at $T$**: the path of $X$ splits at the stopping time $T$ into a past $(X_0, \ldots, X_T)$ and a future $(X_{T+k})_{k \geq 0}$ that are mutually independent given $X_T$.
[remark: Contrast with the Weak Markov Property]
The ordinary Markov property applies at a fixed deterministic time $n$. In probability, "strong" versions of theorems replace fixed deterministic quantities with random ones, and typically imply the "weak" versions. Here, if $T \equiv n$ (a constant), the strong Markov property reduces to the extended Markov property. The strong version is strictly more powerful because it applies at random times like $H^A$.
[/remark]
## Application: Recurrence via Renewal Arguments
The strong Markov property is the engine behind renewal arguments in the classification of states. The idea is: if the chain returns to state $i$, it finds itself in exactly the same situation as it started — by the strong Markov property applied at the first return time $T_i^+ = \inf\{n \geq 1 : X_n = i\}$. This allows us to decompose the event of visiting $i$ infinitely often into a sequence of independent attempts.
Recall from the previous section that a state $i$ is **recurrent** if $\mathbb{P}_i(T_i^+ < \infty) = 1$ and **transient** if $\mathbb{P}_i(T_i^+ < \infty) < 1$. The strong Markov property gives a clean path to the following fundamental equivalence.
[quotetheorem:2219]
[citeproof:2219]
The geometric structure of this argument is worth unpacking carefully: it is the strong Markov property that makes each return attempt truly independent of the previous ones.
[explanation: Why the Renewal Argument Works]
The key step — that $\mathbb{P}_i(V_i \geq k) = f^k$ — is a direct consequence of the strong Markov property. Let $T_i^{(1)} = T_i^+$ be the first return time to $i$, and inductively define $T_i^{(k)}$ as the first return to $i$ after time $T_i^{(k-1)}$. Each $T_i^{(k)}$ is a stopping time, and by the strong Markov property applied at $T_i^{(k-1)}$, the probability that the $k$-th return occurs is exactly $f$, independently of what happened before. The successive attempts to return to $i$ are therefore Bernoulli trials with success probability $f$, and $V_i - 1$ (the number of returns after time $0$) has a geometric distribution.
This argument would fail if we used a non-stopping time — if $T$ depended on the future, the fresh-start property would not hold, and the returns would not be independent.
[/explanation]
## Application: The Generating Function of Hitting Times
The strong Markov property also gives precise quantitative information about the distribution of hitting times. We illustrate this with the gambler's ruin chain, where the method produces a complete description of the time to absorption.
[example: Hitting Time Distribution in Gambler's Ruin]
Consider the random walk on $\mathbb{N} \cup \{0\}$ with transition probabilities $p_{i, i+1} = p$ and $p_{i, i-1} = q = 1 - p$ for $i \geq 1$, and $p_{0,0} = 1$ (state $0$ is absorbing). We wish to find the distribution of the hitting time of $0$ starting from state $1$:
\begin{align*}
H &= \inf\{n \geq 0 : X_n = 0\}.
\end{align*}
Define the probability generating function
\begin{align*}
G_1(s) &= \mathbb{E}_1[s^H] = \sum_{n=0}^\infty s^n \mathbb{P}_1(H = n), \quad |s| < 1.
\end{align*}
(The restriction $|s| < 1$ ensures convergence even if $\mathbb{P}_1(H = \infty) > 0$.)
Conditioning on the first step:
\begin{align*}
G_1(s) &= \mathbb{E}_1[s^H] = p\,\mathbb{E}_1[s^H \mid X_1 = 2] + q\,\mathbb{E}_1[s^H \mid X_1 = 0].
\end{align*}
The second term is $q \cdot s$ since on $\{X_1 = 0\}$ we have $H = 1$. For the first term, starting from $2$, the chain must pass through $1$ before reaching $0$. Let $H'$ be the time to travel from $2$ to $1$, and $H''$ the time to travel from $1$ to $0$. By the strong Markov property applied at the stopping time "first hit of $1$ from $2$", the variables $H'$ and $H''$ are independent, each with the same distribution as $H$ under $\mathbb{P}_1$. Including the first step (cost $1$ in the exponent), we obtain:
\begin{align*}
G_1(s) &= ps\,G_1(s)^2 + qs. \tag{$*$}
\end{align*}
This is a quadratic in $G_1(s)$ with solutions
\begin{align*}
G_1(s) &= \frac{1 \pm \sqrt{1 - 4pqs^2}}{2ps}.
\end{align*}
Since $G_1(s)$ is a power series (hence continuous), the choice of sign must be constant in $s$. Examining the behaviour as $s \to 0$: the denominator $2ps \to 0$, so the numerator must also tend to $0$ for $G_1(s)$ to remain finite. Thus $1 \pm 1 \to 0$, forcing the minus sign. We conclude:
\begin{align*}
G_1(s) &= \frac{1 - \sqrt{1 - 4pqs^2}}{2ps}.
\end{align*}
The probability of ever hitting $0$ is
\begin{align*}
\mathbb{P}_1(H < \infty) &= \lim_{s \to 1} G_1(s) = \frac{1 - \sqrt{1 - 4pq}}{2p}.
\end{align*}
Using $q = 1 - p$, one computes $1 - 4pq = (1 - 2p)^2 = (p - q)^2$, so $\sqrt{1 - 4pq} = |p - q|$. Therefore:
\begin{align*}
\mathbb{P}_1(H < \infty) &= \frac{1 - |p - q|}{2p} = \begin{cases} 1 & \text{if } p \leq q, \\ q/p & \text{if } p > q. \end{cases}
\end{align*}
For the mean hitting time $\mu = \mathbb{E}_1[H]$: if $p > q$ then $\mathbb{P}_1(H = \infty) > 0$ so $\mu = \infty$. If $p \leq q$, differentiate equation $(*)$ with respect to $s$:
\begin{align*}
G_1'(s) &= p G_1(s)^2 + 2ps G_1(s) G_1'(s) + q,
\end{align*}
which rearranges to
\begin{align*}
G_1'(s) &= \frac{p G_1(s)^2 + q}{1 - 2ps\,G_1(s)}.
\end{align*}
By Abel's lemma, $\mu = \lim_{s \to 1} G_1'(s)$. Evaluating at $s = 1$ (using $G_1(1) = 1$ when $p \leq q$) gives:
\begin{align*}
\mu &= \begin{cases} \infty & \text{if } p = 1/2, \\ \dfrac{1}{q - p} & \text{if } p < 1/2. \end{cases}
\end{align*}
The case $p = 1/2$ (symmetric random walk) is recurrent ($\mathbb{P}_1(H < \infty) = 1$) yet the mean return time is infinite — this is the hallmark of **null recurrence**.
[/example]
The generating function calculation above is a template for many similar arguments. The structure is always the same: condition on the first step, use the strong Markov property to factorise the event in terms of independent copies, set up a functional equation for the generating function, and solve it. The factorisation $H = 1 + H' + H''$ with $H', H''$ independent and identically distributed is the **renewal structure** that the strong Markov property guarantees.
[remark: The Renewal Argument and Recurrence/Transience]
The gambler's ruin calculation shows directly that state $0$ is absorbing (recurrent). For state $1$: if $p \leq q$, then $\mathbb{P}_1(H < \infty) = 1$, and since the walk is irreducible on $\{1, 2, 3, \ldots\}$ (before absorption), all these states are recurrent. If $p > q$, then $\mathbb{P}_1(H < \infty) = q/p < 1$, and all states in $\{1, 2, \ldots\}$ are transient. This illustrates the general principle, established in the previous section, that recurrence and transience are class properties.
[/remark]
<!-- illustration-needed: path decomposition at a stopping time T — show a sample path of a Markov chain, mark the stopping time T on the time axis, and illustrate the "past" path up to T and the "future" path after T as two independent segments that rejoin at state X_T -->
## Further Classification of States
The classification of states as recurrent or transient, developed in the previous section, is an important first step. But it does not tell the full story. Among recurrent states there are meaningful distinctions: a chain might be guaranteed to return to a state, yet the expected waiting time for that return could be infinite. Separately, even for well-behaved chains, a subtle arithmetic obstruction — periodicity — can prevent the long-run convergence results we seek. This section introduces both refinements, establishes that they are class properties, and identifies the special class of ergodic states that will play a central role in the convergence theory of later chapters.
## Visits to a Recurrent State
Before defining mean recurrence times, it is worth recording a structural fact about recurrent states: if we are certain to return to a state, we are in fact certain to return to it infinitely many times. This transforms the qualitative statement "recurrence" into a quantitative counting statement.
Fix a state $i$ and suppose the chain starts at $i$, that is $X_0 = i$. Define
\begin{align*}
V_i = |\{n \geq 1 : X_n = i\}|,
\end{align*}
the total number of visits to $i$ after time zero. Recall that $f_{i,i} = \mathbb{P}_i(T_i < \infty)$ denotes the probability of ever returning to $i$ starting from $i$.
[quotetheorem:2220]
[citeproof:2220]
This result makes the intuition precise: recurrent states are visited infinitely often with probability one, while transient states are visited only finitely many times before the chain drifts away for good.
## Mean Recurrence Time and Positive Recurrence
Even among recurrent states there is a further distinction. The chain is guaranteed to return, but how long does it typically take? The mean recurrence time answers this question.
[definition: Mean Recurrence Time]
Let $T_i = \inf\{n \geq 1 : X_n = i\}$ be the first return time to state $i$. The **mean recurrence time** of $i$ is
\begin{align*}
\mu_i = \mathbb{E}_i(T_i) = \begin{cases} \infty & \text{if } i \text{ is transient,} \\ \displaystyle\sum_{n=1}^\infty n\, f_{i,i}(n) & \text{if } i \text{ is recurrent,} \end{cases}
\end{align*}
where $f_{i,i}(n) = \mathbb{P}_i(T_i = n)$ is the probability that the first return to $i$ occurs at time $n$.
[/definition]
For a transient state, the chain may never return at all, so the convention $\mu_i = \infty$ is natural. For a recurrent state, the chain returns with probability one, but the expected return time can still be infinite — the returns happen eventually, but the wait can be arbitrarily long on average.
The mean recurrence time gives a finer classification of recurrent states: those with finite mean return time behave much more regularly than those for which the mean is infinite.
[definition: Positive and Null Recurrence]
A recurrent state $i$ is called **positive recurrent** (or non-null) if $\mu_i < \infty$, and **null recurrent** if $\mu_i = \infty$.
[/definition]
The distinction matters enormously for long-run behaviour. Positive recurrent states are "strongly attracting" in the sense that the chain returns quickly on average; null recurrent states admit returns but with no finite timescale. We will see in the convergence theory of Chapter 3 that the existence of an invariant distribution — and hence the convergence of the chain to equilibrium — is tied precisely to positive recurrence.
The canonical example of null recurrence is the simple symmetric random walk on $\mathbb{Z}$, which we already know is recurrent but which turns out to have infinite mean return time.
[example: Simple Random Walk on $\mathbb{Z}$]
Consider the simple symmetric random walk on $\mathbb{Z}$: from any state $i$, the chain moves to $i+1$ or $i-1$ each with probability $\frac{1}{2}$. It is a classical result (proved in the section on recurrence and transience) that this chain is recurrent in dimension one, meaning $f_{0,0} = 1$. However, the mean recurrence time satisfies $\mu_0 = \infty$.
To see why, recall that $f_{0,0}(n) = 0$ for odd $n$ and, for even $n = 2k$,
\begin{align*}
f_{0,0}(2k) = \frac{1}{2k-1}\binom{2k}{k}\frac{1}{4^k}.
\end{align*}
Using Stirling's approximation, $\binom{2k}{k}/4^k \sim (\pi k)^{-1/2}$, so $f_{0,0}(2k) \sim c \cdot k^{-3/2}$ for a constant $c > 0$. Then
\begin{align*}
\mu_0 = \sum_{k=1}^\infty 2k \cdot f_{0,0}(2k) \sim 2c \sum_{k=1}^\infty k^{-1/2} = \infty.
\end{align*}
The simple symmetric random walk on $\mathbb{Z}$ is therefore null recurrent: it returns to the origin with probability one, but the expected return time is infinite.
[/example]
## Periodicity and Aperiodicity
Alongside the recurrence classification, there is a separate obstruction to well-behaved long-run dynamics: periodicity. The random walk example already hints at this. Starting from the origin, the chain can only return to the origin at even times — steps $2, 4, 6, \ldots$ — because each step changes the parity of the position. This means that even if the chain is positive recurrent, the probability $p_{0,0}(n)$ oscillates rather than converging. To obtain genuine convergence results, we need to exclude this arithmetic obstruction.
[definition: Period of a State]
The **period** of a state $i$ is
\begin{align*}
d_i = \gcd\{n \geq 1 : p_{i,i}(n) > 0\}.
\end{align*}
A state $i$ is called **aperiodic** if $d_i = 1$.
[/definition]
Intuitively, $d_i$ is the arithmetic "rhythm" at which returns to $i$ are possible. If $d_i = 2$, returns can only happen at even times; if $d_i = 3$, only at multiples of $3$; and so on. An aperiodic state has no such arithmetic constraint: returns can happen at times with no common factor greater than $1$.
[remark: Aperiodicity in Practice]
In most applications, periodicity is at most a minor nuisance. For example, the symmetric simple random walk on $\mathbb{Z}$ has period $2$ at every state. Adding a small self-loop probability $\varepsilon > 0$ — that is, setting $p_{i,i} = \varepsilon$ — makes every state aperiodic (since $d_i = \gcd\{1, 2, 3, \ldots\} = 1$), while changing the dynamics by only a tiny amount. In theoretical results, it is common to assume aperiodicity without loss of practical generality.
[/remark]
<!-- illustration-needed: periodicity of simple random walk — show a number line with the state 0 highlighted, and arrows indicating that only even-time transitions $p_{0,0}(2), p_{0,0}(4), \ldots$ are positive while odd-time ones are zero; contrast with an aperiodic chain where both $p_{0,0}(1)$ and $p_{0,0}(2)$ are positive -->
[example: Period of the Simple Random Walk]
For the simple symmetric random walk on $\mathbb{Z}$, starting from $0$, a return to $0$ requires an equal number of steps left and right, so any return takes an even number of steps. Thus $p_{0,0}(n) = 0$ for all odd $n$, and $p_{0,0}(2k) = \binom{2k}{k}(1/2)^{2k} > 0$ for all $k \geq 1$. Therefore
\begin{align*}
d_0 = \gcd\{2, 4, 6, \ldots\} = 2.
\end{align*}
The random walk on $\mathbb{Z}$ has period $2$ at every state.
Now consider a modified chain on $\{0, 1\}$ with transition matrix
\begin{align*}
P = \begin{pmatrix} 1/2 & 1/2 \\ 1/2 & 1/2 \end{pmatrix}.
\end{align*}
From state $0$, we can return to $0$ in one step (with probability $1/2$) or in two steps (via $0 \to 1 \to 0$, with probability $1/4$). So $p_{0,0}(1) = 1/2 > 0$ and $p_{0,0}(2) > 0$, giving $d_0 = \gcd\{1, 2, \ldots\} = 1$. State $0$ is aperiodic.
[/example]
## Ergodic States
Combining positive recurrence and aperiodicity yields the most tractable class of states.
[definition: Ergodic State]
A state $i$ is **ergodic** if it is aperiodic and positive recurrent.
[/definition]
Ergodic states are the "well-behaved" states of a Markov chain. In an irreducible chain consisting entirely of ergodic states, the convergence to equilibrium results of Chapter 3 apply in their strongest form: the $n$-step transition probabilities $p_{i,j}(n)$ converge to a limit $\pi_j$ that is independent of the starting state $i$.
## Class Properties
We established in the previous section that recurrence is a class property: if $i \leftrightarrow j$ and $i$ is recurrent, then $j$ is recurrent. The same turns out to be true for the period and for positive recurrence. This means that when we speak of an irreducible chain, we can speak unambiguously of the period of the chain or whether the chain is positive recurrent or null recurrent.
[quotetheorem:2221]
[citeproof:2221]
With all four properties confirmed as class properties, we can now speak unambiguously about the period and recurrence type of an entire irreducible chain.
[remark: Irreducible Chains]
For an irreducible chain, Theorem 1 implies that there is a single period $d$ shared by all states. We call $d$ the period of the chain. An irreducible chain is aperiodic if $d = 1$, and positive recurrent or null recurrent depending on whether every (equivalently, any one) state is positive or null recurrent.
[/remark]
## Irreducibility and Certain Visitation
We close this section with a result that will be used repeatedly in later arguments. It strengthens the definition of recurrence by removing the assumption on the starting state.
[quotetheorem:2222]
[citeproof:2222]
This result says that for an irreducible recurrent chain, every state is visited from every starting point, with probability one. It is the key structural fact underlying the ergodic theorem and the existence of invariant distributions, to which we turn in Chapter 3.
The classification machinery — communicating classes, recurrence, transience, periodicity, and mean return times — provides the foundation for understanding equilibrium. We now address the central question of the course: does a Markov chain settle down to a long-run equilibrium, and if so, what does it look like?
# 3. Long-run Behaviour
This chapter addresses the central question of the course: does a Markov chain settle down to a long-run equilibrium, and if so, what does that equilibrium look like? The answer involves invariant distributions — probability vectors preserved by the transition matrix — and a remarkable convergence theorem that shows irreducible, positive recurrent, aperiodic chains forget their initial state.
## Invariant Distributions
Having classified chains by their recurrence and transience properties, we now turn to a deeper question: what does a Markov chain look like in the long run? Recall the two-state chain computed at the start of the course: as $n \to \infty$, the distribution of $X_n$ converged to a fixed distribution, and this limit was independent of where the chain started. The central goal of this section is to identify precisely which chains have such limiting behaviour, and to characterise the limit when it exists.
Before defining the limit, we need to decide what "convergence" means for random variables. We cannot simply apply the usual definition of real-sequence convergence, since $X_n$ takes values in a state space $S$, not in $\mathbb{R}$. The right notion, for the purposes of Markov chain theory, is to ask whether the probability mass function converges: for each $k \in S$, does $\mathbb{P}(X_n = k)$ tend to a limit, and does the collection of limits form a probability distribution?
The strategy is to guess the limit rather than prove convergence directly. The limit will be called an *invariant distribution* — a distribution that the chain preserves exactly.
### The Definition of Invariant Distributions
If the chain has initial distribution $\lambda$, then the distribution at step $n$ is $\lambda P^n$. The key identity is
\begin{align*}
\lambda P^{n+1} = (\lambda P^n) P.
\end{align*}
If $\lambda P^n \to \pi$ as $n \to \infty$, then also $\lambda P^{n+1} \to \pi$, so passing to the limit gives $\pi P = \pi$. This is the defining equation of an invariant distribution.
[definition: Invariant Distribution]
Let $(X_n)_{n \geq 0}$ be a Markov chain on countable state space $S$ with transition matrix $P = (p_{ij})_{i,j \in S}$. A distribution $\pi = (\pi_k : k \in S)$ is an **invariant distribution** (also called an *equilibrium distribution*, *stationary distribution*, or *steady-state distribution*) if:
1. $\pi_k \geq 0$ for all $k \in S$, and $\sum_{k \in S} \pi_k = 1$;
2. $\pi P = \pi$, that is, $\pi_j = \sum_{i \in S} \pi_i p_{ij}$ for every $j \in S$.
[/definition]
Condition (1) ensures $\pi$ is a genuine probability distribution. Condition (2) says that if $X_0$ has distribution $\pi$, then $X_1$ also has distribution $\pi$: the distribution is unchanged after one step, and hence unchanged after any number of steps.
[example: Invariant Distribution of the Two-State Chain]
Consider the two-state chain with state space $S = \{1, 2\}$ and transition matrix
\begin{align*}
P = \begin{pmatrix} 1 - \alpha & \alpha \\ \beta & 1 - \beta \end{pmatrix},
\end{align*}
where $0 < \alpha, \beta < 1$. We seek $\pi = (\pi_1, \pi_2)$ satisfying $\pi P = \pi$ and $\pi_1 + \pi_2 = 1$.
The equation $\pi P = \pi$ gives:
\begin{align*}
\pi_1(1 - \alpha) + \pi_2 \beta &= \pi_1, \\
\pi_1 \alpha + \pi_2(1 - \beta) &= \pi_2.
\end{align*}
Both equations reduce to $\pi_1 \alpha = \pi_2 \beta$. Combined with $\pi_1 + \pi_2 = 1$, we get
\begin{align*}
\pi_1 = \frac{\beta}{\alpha + \beta}, \qquad \pi_2 = \frac{\alpha}{\alpha + \beta}.
\end{align*}
This is consistent with the long-run limit found by explicit diagonalization: as $n \to \infty$, both rows of $P^n$ converge to $(\pi_1, \pi_2)$.
[/example]
### Existence and Uniqueness: The Main Theorem
Not every Markov chain has an invariant distribution. Understanding exactly when one exists — and when it is unique — is the central result of this section.
[quotetheorem:2223]
[citeproof:2223]
The theorem has an important structural consequence for the classification of states, which we record as a remark.
[remark: Positive Recurrence as a Class Property]
Part (2) of the theorem completes the deferred proof from the chapter on classification: for an irreducible chain, if *one* state is positive recurrent, then *all* states are positive recurrent. Indeed, if some state is positive recurrent, an invariant distribution exists by part (1), and part (2) then forces every state to be positive recurrent.
[/remark]
### The Return-Time Measure
The proof of existence in the theorem rests on a fundamental construction: the expected number of visits to each state during a single excursion away from a fixed base state. This construction is of independent interest, as it makes the formula $\pi_i = 1/\mu_i$ transparent.
Fix a state $k \in S$ and start the chain at $X_0 = k$. Let $T_k = \min\{n \geq 1 : X_n = k\}$ be the first return time to $k$. Define the **return-time measure** by
\begin{align*}
W_i = \sum_{m=1}^{T_k} \mathbb{1}_{X_m = i}, \qquad \rho_i = \mathbb{E}_k[W_i].
\end{align*}
Here $W_i$ counts the number of times the chain visits state $i$ during the excursion from $k$ back to $k$ (inclusive of the return visit to $k$ itself). The vector $\rho = (\rho_i : i \in S)$ is the expected visit count.
[quotetheorem:2224]
[citeproof:2224]
The proposition shows that $\rho$ satisfies all the conditions of an invariant distribution *except* that it need not sum to $1$: its sum is $\mu_k$ instead. When $k$ is positive recurrent, $\mu_k < \infty$, and we can normalise.
[explanation: From Return-Time Measure to Invariant Distribution]
If $k$ is positive recurrent ($\mu_k < \infty$), define
\begin{align*}
\pi_i = \frac{\rho_i}{\mu_k}.
\end{align*}
Then $\pi_i \geq 0$ by part (4), $\sum_i \pi_i = \mu_k / \mu_k = 1$ by part (2), and $\pi P = \pi$ because $\rho P = \rho$ by part (3). So $\pi$ is an invariant distribution.
The formula $\pi_i = \rho_i / \mu_k$ has a clear intuitive meaning. The quantity $\rho_i$ is the expected number of visits to $i$ per excursion from $k$, and $\mu_k$ is the expected duration of that excursion. Their ratio is the long-run proportion of time spent at $i$. When $i = k$, this gives $\pi_k = 1/\mu_k$, confirming the formula $\pi_i = 1/\mu_i$ for all $i$ (since by part (3) the formula must hold at each state by the same argument applied with $k$ replaced by $i$).
[/explanation]
### Proving the Full Theorem
With the proposition established, the proof of the main theorem is within reach. The existence direction follows immediately from the normalisation argument above. The uniqueness and characterisation direction — showing that any invariant distribution must satisfy $\pi_i = 1/\mu_i$ — requires the computation $\pi_i \mu_i = 1$, which we carry out in detail.
[proof]
**Part (1) — existence:** If $k$ is positive recurrent, the normalised return-time measure $\pi_i = \rho_i / \mu_k$ is an invariant distribution by the Proposition.
**Part (2) — characterisation and uniqueness:** Let $\pi$ be any invariant distribution.
*Step 1: all entries are positive.* Since $\sum_i \pi_i = 1$, there exists $k$ with $\pi_k > 0$. Using $\pi = \pi P^n$ for all $n$:
\begin{align*}
\pi_i = \sum_j \pi_j p_{ji}(n) \geq \pi_k p_{ki}(n).
\end{align*}
By irreducibility, choose $n$ with $p_{ki}(n) > 0$. Then $\pi_i \geq \pi_k p_{ki}(n) > 0$.
*Step 2: all states are recurrent.* Suppose for contradiction that all states are transient, so $p_{ji}(n) \to 0$ as $n \to \infty$ for all $i, j$. From $\pi_i = \sum_j \pi_j p_{ji}(n)$, for any finite set $F \subset S$:
\begin{align*}
\pi_i = \sum_{j \in F} \pi_j p_{ji}(n) + \sum_{j \notin F} \pi_j p_{ji}(n) \leq \sum_{j \in F} p_{ji}(n) + \sum_{j \notin F} \pi_j.
\end{align*}
Taking $n \to \infty$ and using $p_{ji}(n) \to 0$ for each fixed $j \in F$:
\begin{align*}
\pi_i \leq \sum_{j \notin F} \pi_j.
\end{align*}
Since $\sum_j \pi_j = 1$, we have $\sum_{j \notin F} \pi_j \to 0$ as $F \nearrow S$. So $\pi_i \leq 0$, contradicting Step 1.
*Step 3: positive recurrence and the formula.* We show $\pi_i \mu_i = 1$. Start the chain with initial distribution $\pi$; since $\pi$ is invariant, $X_m$ has distribution $\pi$ for every $m \geq 0$. Write
\begin{align*}
\pi_i \mu_i = \pi_i \mathbb{E}_i[T_i] = \pi_i \sum_{n=1}^{\infty} \mathbb{P}_i(T_i \geq n) = \sum_{n=1}^{\infty} \mathbb{P}(T_i \geq n,\, X_0 = i).
\end{align*}
For $n = 1$: $\mathbb{P}(T_i \geq 1, X_0 = i) = \mathbb{P}(X_0 = i) = \pi_i$, since $T_i \geq 1$ always holds.
For $n \geq 2$: define $a_r = \mathbb{P}(X_m \neq i \text{ for all } 0 \leq m \leq r)$. Then:
\begin{align*}
\mathbb{P}(T_i \geq n, X_0 = i) &= \mathbb{P}(X_0 = i,\, X_m \neq i \text{ for } 1 \leq m \leq n-1).
\end{align*}
Writing $\{X_m \neq i,\, 1 \leq m \leq n-1\} = \{X_m \neq i,\, 0 \leq m \leq n-1\}^c \cup \{X_0 \neq i\}$ and exploiting time-stationarity (since the distribution is $\pi$ at every time), the sequence $(X_0, \ldots, X_{n-2})$ has the same distribution as $(X_1, \ldots, X_{n-1})$. Therefore:
\begin{align*}
\mathbb{P}(T_i \geq n, X_0 = i) = a_{n-2} - a_{n-1} \qquad (n \geq 2).
\end{align*}
Summing over $n \geq 1$ using a finite truncation and then taking the limit:
\begin{align*}
\pi_i \mu_i &= \lim_{N \to \infty} \left[\pi_i + (a_0 - a_1) + (a_1 - a_2) + \cdots + (a_{N-2} - a_{N-1})\right] \\
&= \lim_{N \to \infty} \left[\pi_i + a_0 - a_{N-1}\right] \\
&= \pi_i + a_0 - \lim_{N \to \infty} a_N.
\end{align*}
Now $\pi_i + a_0 = \mathbb{P}(X_0 = i) + \mathbb{P}(X_0 \neq i) = 1$. Since the chain is recurrent (by Step 2) and started from state $i$ (in the averaged sense), $\mathbb{P}(X_m \neq i \text{ for all } m \geq 0) = 0$, so $\lim_{N} a_N = 0$. Hence $\pi_i \mu_i = 1$, which gives $\mu_i = 1/\pi_i < \infty$ since $\pi_i > 0$. So every state is positive recurrent and $\pi_i = 1/\mu_i$.
Uniqueness follows: any invariant distribution must equal $1/\mu_i$ at every state, and the values $\mu_i$ are determined by the chain alone.
[/proof]
### Consequences and Interpretation
The theorem draws together the classification of recurrence types with the long-run behaviour of the chain.
[remark: Trichotomy for Irreducible Chains]
For an irreducible chain, exactly one of the following holds:
- All states are **transient**: no invariant distribution exists.
- All states are **null recurrent**: no invariant distribution exists (mean return times are infinite, so $\sum_i 1/\mu_i$ would be $0$ or $\infty$, neither summing to $1$).
- All states are **positive recurrent**: a unique invariant distribution exists, given by $\pi_i = 1/\mu_i$.
[/remark]
The random walks encountered in Chapter 2 illustrate all three cases concretely.
[example: Simple Random Walk on the Integers]
Consider the simple symmetric random walk on $\mathbb{Z}$, which moves to $i+1$ or $i-1$ with probability $1/2$ each. This chain is irreducible and recurrent (shown in Chapter 2 by dimension-dependent arguments). Since $\mathbb{Z}$ is countably infinite, a candidate invariant distribution would require $\pi_i = c$ for some constant $c > 0$ for all $i \in \mathbb{Z}$ (by symmetry and the invariance equation), but then $\sum_{i \in \mathbb{Z}} \pi_i = \infty \neq 1$. No normalisation is possible. This shows the random walk on $\mathbb{Z}$ is null recurrent: it returns to every state with probability $1$, but the mean return time is infinite, and no invariant distribution exists.
By contrast, the simple random walk on $\mathbb{Z}/N\mathbb{Z}$ (the cycle of length $N$) is positive recurrent: the uniform distribution $\pi_i = 1/N$ satisfies $\pi P = \pi$ (each state transitions to two neighbours each with probability $1/2$, and summing contributions gives $\frac{1}{2} \cdot \frac{1}{N} + \frac{1}{2} \cdot \frac{1}{N} = \frac{1}{N} = \pi_i$), so $\mu_i = N$ for all $i$.
[/example]
In the finite case, the trichotomy collapses to a single outcome, which follows from linear algebra.
[remark: Finite Irreducible Chains are Always Positive Recurrent]
If $S$ is finite and the chain is irreducible, an invariant distribution always exists and is unique. One can see this algebraically: the equation $\pi P = \pi$ with $\sum \pi_i = 1$ is a linear system with a stochastic matrix, and the Perron–Frobenius theorem guarantees a unique positive solution. The theorem above then forces all states to be positive recurrent.
[/remark]
The formula $\pi_i = 1/\mu_i$ gives a probabilistic interpretation of the invariant distribution: $\pi_i$ is the long-run fraction of time the chain spends at state $i$. A state visited frequently (small mean return time) receives a large weight in the invariant distribution, while a state rarely visited receives a small weight. This connection between time averages and the invariant distribution will be made precise in the next section, where we address convergence: under what conditions does $\mathbb{P}(X_n = i) \to \pi_i$ as $n \to \infty$?
## Convergence to Equilibrium
The question of whether a Markov chain actually converges — and if so, to what — is one of the central results of the theory. The previous section established that a unique invariant distribution $\pi$ exists precisely when the chain is positive recurrent. This section answers when $p_{i,k}(n) \to \pi_k$ actually holds, provides the coupling proof of the main convergence theorem, and establishes that $\pi_i$ has a concrete meaning as the long-run proportion of time the chain spends in state $i$.
## The Convergence Theorem
We have seen that starting from any initial distribution, a Markov chain evolving for a long time ought to "forget" where it began. The two-state example computed explicitly in Chapter 1 already showed that $P^n$ converges to a matrix with identical rows. The convergence theorem says this happens in full generality under three conditions: irreducibility, positive recurrence, and aperiodicity.
The aperiodicity condition is genuinely necessary. To see why it can fail without it, consider a chain on $S = \{0, 1\}$ that alternates deterministically: $p_{0,1} = p_{1,0} = 1$. This chain is irreducible and positive recurrent, but $p_{0,0}(n) = 1$ if $n$ is even and $p_{0,0}(n) = 0$ if $n$ is odd. The sequence oscillates forever and does not converge. Aperiodicity rules out exactly this kind of periodic trapping.
[quotetheorem:2225]
[citeproof:2225]
## The Coupling Technique
The coupling argument used above is worth pausing to appreciate, because it is a broadly applicable probabilistic method and not merely a trick for this one theorem.
[explanation: The Logic of Coupling]
The core problem is to compare two probability distributions — $p_{i,k}(n)$ and $p_{j,k}(n)$ — which are defined over different probability spaces (or, equivalently, as marginals of a system with no specified joint structure). It is not immediately clear how to bound their difference.
The coupling strategy bypasses this by constructing a single probability space carrying **both** chains simultaneously. Once we have the joint object $Z_n = (X_n, Y_n)$, we can ask events like "do $X$ and $Y$ agree at time $n$?", which would be meaningless if the chains lived in separate probability spaces.
The coupling time $T$ captures the moment the two chains become indistinguishable. After $T$, since both chains are at the same state and evolve under the same Markov kernel, their futures are identically distributed. The bound $|p_{i,k}(n) - p_{j,k}(n)| \leq \mathbb{P}_{ij}(T > n)$ says: the distributions can only differ by the probability that coupling has not yet occurred. As $n \to \infty$, this probability vanishes by recurrence.
The independence of $X$ and $Y$ is used only to guarantee that the product chain $Z$ is irreducible and positive recurrent. One could equally well construct a **non-independent coupling** — choosing $X$ and $Y$ to be correlated in a way that makes coupling happen faster. This is the idea behind quantitative mixing-time bounds, but the qualitative convergence result only needs the coupling to be finite a.s.
[/explanation]
The two-state chain from Chapter 1 provides a check: we can verify convergence directly from the explicit formula for $p_{i,k}(n)$ and confirm the theorem's conclusion.
[example: Two-State Chain Revisited]
Consider $S = \{1, 2\}$ with transition matrix
\begin{align*}
P = \begin{pmatrix} 1 - \alpha & \alpha \\ \beta & 1 - \beta \end{pmatrix}, \quad 0 < \alpha, \beta < 1.
\end{align*}
The unique invariant distribution is $\pi = \bigl(\frac{\beta}{\alpha+\beta},\, \frac{\alpha}{\alpha+\beta}\bigr)$. From the explicit calculation in Chapter 1,
\begin{align*}
p_{1,2}(n) = \frac{\alpha}{\alpha+\beta}\bigl(1 - (1-\alpha-\beta)^n\bigr).
\end{align*}
Since $0 < \alpha, \beta < 1$, we have $|1 - \alpha - \beta| < 1$, so $(1-\alpha-\beta)^n \to 0$ and
\begin{align*}
p_{1,2}(n) \to \frac{\alpha}{\alpha+\beta} = \pi_2.
\end{align*}
The chain is irreducible and aperiodic (both states communicate and $p_{1,1} = 1-\alpha > 0$, so period $1$), confirming the theorem applies. The convergence rate is geometric with ratio $|1-\alpha-\beta|$: smaller $\alpha + \beta$ means slower mixing.
[/example]
The aperiodicity and positive recurrence conditions in the theorem are both essential. The null recurrent case shows what can go wrong when positive recurrence fails.
[remark: The Null Recurrent Case]
When the chain is null recurrent — irreducible and recurrent, but with $\mu_i = \mathbb{E}_i[T_i] = \infty$ for all $i$ — Stage 1 of the coupling proof still goes through: since $Z$ is recurrent, $T$ is still finite a.s., so $|p_{i,k}(n) - p_{j,k}(n)| \to 0$. However, there is no invariant probability distribution $\pi$, so Stage 2 breaks down entirely. The correct conclusion in the null recurrent case is that $p_{i,k}(n) \to 0$ for all $i, k$ — the chain spreads out over the infinite state space and spends vanishingly small probability in any fixed state.
[/remark]
## Long-run Proportion of Time in a State
The invariant distribution $\pi$ was originally motivated by asking: in the long run, what fraction of time does the chain spend in state $i$? The convergence theorem tells us that the probability of being in state $i$ at a large time $n$ tends to $\pi_i$. But a probabilistic statement about a single time is different from a statement about the empirical average over a long run. We now address the latter directly.
For a chain starting at $X_0 = i$, define the **visit count** up to time $n$:
\begin{align*}
V_i(n) = \#\{m \leq n : X_m = i\}.
\end{align*}
The claim is that $V_i(n)/n \to 1/\mu_i = \pi_i$ as $n \to \infty$, where $\mu_i = \mathbb{E}_i[T_i]$ is the mean return time to $i$.
[explanation: Long-run Proportion via the Law of Large Numbers]
Assume $X_0 = i$. Let $T_m$ denote the time of the $m$-th return to $i$, with $T_0 = 0$. The inter-return times $U_m = T_m - T_{m-1}$ for $m \geq 1$ are independent and identically distributed by the strong Markov property (at each return to $i$, the chain restarts afresh), and $\mathbb{E}[U_1] = \mu_i$.
The law of large numbers gives:
\begin{align*}
\frac{T_m}{m} = \frac{1}{m}\sum_{r=1}^m U_r \to \mu_i \quad \text{as } m \to \infty.
\end{align*}
Now observe that $V_i(n) \leq k$ if and only if the $k$-th return to $i$ occurs after time $n$, i.e., $T_k > n$. More generally, for any $x > 0$,
\begin{align*}
V_i(n) \leq x \iff T_{\lceil x \rceil} > n.
\end{align*}
Substituting $x = An/\mu_i$ for a constant $A > 0$, we ask whether $V_i(n)/n \leq A/\mu_i$, which is equivalent to $T_{\lceil An/\mu_i \rceil} > n$. By the law of large numbers, $T_m / m \to \mu_i$, so
\begin{align*}
\frac{T_{\lceil An/\mu_i \rceil}}{n} \approx \frac{T_{\lceil An/\mu_i \rceil}}{\lceil An/\mu_i \rceil} \cdot \frac{\lceil An/\mu_i \rceil}{n} \to \mu_i \cdot \frac{A}{\mu_i} = A.
\end{align*}
Therefore $T_{\lceil An/\mu_i \rceil}/n \to A$ almost surely. If $A < 1$, the event $\{T_{\lceil An/\mu_i \rceil} > n\}$ has probability tending to $1$; if $A > 1$, the probability tends to $0$. This shows that
\begin{align*}
\frac{V_i(n)}{n} \to \frac{1}{\mu_i} = \pi_i \quad \text{a.s.}
\end{align*}
[/explanation]
The argument above established the result for $k = i$; the general statement for any target state $k$ follows by the same reasoning applied starting from $k$.
[quotetheorem:2226]
The ergodic theorem is the Markov chain analogue of the strong law of large numbers. It says that time averages converge to distributional averages — the chain is ergodic in the sense that the long-run empirical distribution equals the stationary distribution, regardless of where the chain started.
[example: Proportion of Even Steps in a Random Walk]
Consider the symmetric simple random walk on $\mathbb{Z}$ reflected at $0$: the chain on $S = \{0, 1, 2, \ldots\}$ with $p_{0,1} = 1$ and $p_{i,i+1} = p_{i,i-1} = 1/2$ for $i \geq 1$. This chain is irreducible and null recurrent (as the unreflected walk on $\mathbb{Z}$ is recurrent, and the reflected walk has mean return time $\mu_0 = \infty$ to state $0$).
By the ergodic theorem, or rather by its failure: since $\mu_0 = \infty$, we have $\pi_0 = 1/\mu_0 = 0$. The long-run proportion of time spent at $0$ is $V_0(n)/n \to 0$, even though the chain returns to $0$ infinitely often. The chain returns, but returns become increasingly rare.
This contrasts with the positive recurrent case, where $\mu_0 < \infty$ forces $\pi_0 > 0$, and the chain returns at a positive rate.
[/example]
A natural question is whether the ergodic theorem requires aperiodicity. It does not — time averages converge regardless of period, even though pointwise probabilities may not.
[remark: Convergence Without Aperiodicity]
For chains that are irreducible and positive recurrent but periodic with period $d > 1$, one can still show that the Cesàro averages $(1/n)\sum_{m=0}^{n-1} p_{i,k}(m) \to \pi_k$. The ergodic theorem for time averages continues to hold — aperiodicity is not required for $V_k(n)/n \to \pi_k$. What aperiodicity buys is the pointwise convergence $p_{i,k}(n) \to \pi_k$; without it, one only has convergence in the averaged sense.
[/remark]
The ergodic theorem and its Cesaro variant provide the definitive tools for extracting long-run averages from Markov chain behaviour. We now explore time reversal, which reveals a deep structural symmetry: under the stationary distribution, the chain and its time-reversed version share the same probabilistic laws.
# 4. Time Reversal
Physicists struggle to explain why we can move freely through space but only forward through time. In the world of Markov chains, however, time reversal is something we understand precisely. The previous chapter established that every irreducible, positive recurrent, aperiodic chain converges to a unique invariant distribution $\pi$. This chapter asks a complementary question: for which chains does the chain in equilibrium look statistically identical run forwards and backwards? We develop the theory of reversed chains, identify the key condition — the detailed balance equations — under which a chain is indistinguishable from its own reversal, and apply this theory to random walks on graphs, where it yields invariant distributions almost effortlessly.
## The Reversed Chain
Given a finite-length Markov chain $X = (X_0, X_1, \ldots, X_N)$, the most natural way to reverse time is to define $Y_k = X_{N-k}$, so that $Y = (X_N, X_{N-1}, \ldots, X_0)$. The sequence $Y$ is just $X$ read backwards. The question is: when is $Y$ itself a Markov chain?
The answer depends on the initial distribution. If $X_0$ is started from an arbitrary distribution, the reversed sequence need not be Markov. But if $X_0$ has the invariant distribution, the reversed sequence is again Markov — and we can identify its transition matrix explicitly.
[quotetheorem:2227]
The formula for $\hat{p}_{i,j}$ has a natural interpretation. The reversed chain's probability of going from $i$ to $j$ is the forward chain's probability of going from $j$ to $i$, rescaled by the ratio $\pi_j / \pi_i$ to ensure the rows of $\hat{P}$ sum to one. In this sense, $\hat{P}$ is a kind of "weighted transpose" of $P$.
[citeproof:2227]
Two points deserve emphasis. First, this theorem requires that $X_0$ be started from the invariant distribution $\pi$: if $X_0$ has an arbitrary initial distribution, the reversed sequence $Y_k = X_{N-k}$ need not satisfy the Markov property. Second, the theorem does not say the reversed chain has the same transition matrix as the original — that is the stronger condition of reversibility, treated in the next section. Here, we only assert that a transition matrix $\hat{P}$ for the reversal exists and is given by the stated formula.
The key algebraic identity hidden in this proof is worth recording on its own:
\begin{align*}
\pi_i \hat{p}_{i,j} = \pi_j p_{j,i} \quad \text{for all } i, j \in S.
\end{align*}
This single equation is the engine that drives everything in this chapter.
## Detailed Balance and Reversibility
Knowing that the reversal of a chain is again Markov does not yet say the chain looks the same forwards and backwards. In physics, a process is called *reversible* if you cannot tell from watching it whether time is running forwards or backwards. For Markov chains, this means the reversed chain should have the same transition probabilities as the original.
[definition: Reversible Chain]
Let $X$ be an irreducible Markov chain with invariant distribution $\pi$ and transition matrix $P$. The chain is *reversible* if its reversal $Y$ (when started in $\pi$) has the same transition matrix as $X$, i.e., $\hat{P} = P$. Equivalently, using the formula for $\hat{p}_{i,j}$, the chain is reversible if and only if
\begin{align*}
\pi_i p_{i,j} = \pi_j p_{j,i} \quad \text{for all } i, j \in S.
\end{align*}
This condition is called the *detailed balance equation*. More generally, if $\lambda$ is any distribution (not necessarily invariant) satisfying $\lambda_i p_{i,j} = \lambda_j p_{j,i}$ for all $i,j$, we say that $(P, \lambda)$ is in *detailed balance*.
[/definition]
The detailed balance equations have a striking structural feature: they impose a local pairwise constraint rather than a global one.
[remark: Most Chains Are Not Reversible]
Reversibility is a special property, not the generic case — just as most functions are not continuous. But when it holds, it is extremely useful: it replaces the global balance equations $\pi P = \pi$ (which involve a sum over all states) with the local equations $\pi_i p_{i,j} = \pi_j p_{j,i}$ (which involve only pairs of states). The detailed balance equations are much easier to solve.
[/remark]
The following proposition gives the efficient route to checking reversibility.
[quotetheorem:2228]
This is the computational workhorse of the chapter. To find the invariant distribution via global balance, one must solve a system of equations $\pi_i = \sum_j \pi_j p_{j,i}$, each involving a sum over all states. The detailed balance equations $\lambda_i p_{i,j} = \lambda_j p_{j,i}$ involve only two states at a time — they are genuinely simpler.
The converse fails: a chain can have a unique invariant distribution without satisfying detailed balance. Consider the three-state chain on $S = \{1, 2, 3\}$ with transition probabilities $p_{1,2} = p_{2,3} = p_{3,1} = 1$ and all other $p_{i,j} = 0$. This is a deterministic cycle $1 \to 2 \to 3 \to 1$, and by symmetry the invariant distribution is $\pi = (1/3, 1/3, 1/3)$. But detailed balance fails: $\pi_1 p_{1,2} = 1/3$ while $\pi_2 p_{2,1} = 0$. The chain is irreversible — it has a preferred direction of rotation that a reversed observer would immediately detect.
[citeproof:2228]
## Reversibility in Birth–Death Chains and Random Walks
### The Birth–Death Chain with Immigration
The detailed balance criterion becomes especially powerful for chains that have a natural "physical" directionality, where one expects reversibility to hold. Birth–death chains are the canonical example.
[example: Birth-Death Chain with Immigration]
Consider a population model where at each state $i > 0$, the population increases to $i+1$ with probability $p_i$ and decreases to $i-1$ with probability $q_i = 1 - p_i$. To make the chain irreducible, we modify the boundary: at state $0$, a new immigrant arrives (moving to state $1$) with probability $p_0$, and the population stays at $0$ with probability $q_0 = 1 - p_0$.
This process has a physical flavour — movements up and down are paired — so we attempt to find a solution to the detailed balance equations. If such a solution $\lambda$ exists and is normalisable, we immediately have the invariant distribution.
The detailed balance equation $\lambda_i p_{i,j} = \lambda_j p_{j,i}$ is automatically satisfied when $|i - j| \geq 2$ (both sides are zero). The only nontrivial case is $j = i + 1$:
\begin{align*}
\lambda_i p_i = \lambda_{i+1} q_{i+1}.
\end{align*}
This is a first-order recurrence relation. Solving iteratively:
\begin{align*}
\lambda_i = \frac{p_{i-1}}{q_i} \lambda_{i-1} = \frac{p_{i-1}}{q_i} \cdot \frac{p_{i-2}}{q_{i-1}} \cdots \frac{p_0}{q_1} \lambda_0 =: \rho_i \lambda_0,
\end{align*}
where we define
\begin{align*}
\rho_i = \frac{p_{i-1}}{q_i} \cdot \frac{p_{i-2}}{q_{i-1}} \cdots \frac{p_0}{q_1} = \prod_{k=1}^{i} \frac{p_{k-1}}{q_k}.
\end{align*}
For $\lambda$ to be a probability distribution, we need $\sum_i \lambda_i = 1$, which forces $\lambda_0 \sum_i \rho_i = 1$. If $\sum_i \rho_i < \infty$, we set $\lambda_0 = 1 / \sum_i \rho_i$ and conclude:
\begin{align*}
\lambda_i = \frac{\rho_i}{\sum_{k \geq 0} \rho_k},
\end{align*}
and this is the unique invariant distribution, with the chain reversible. If $\sum_i \rho_i = \infty$, no normalised invariant distribution exists and the chain is null recurrent or transient.
[/example]
The computation above required no linear algebra — just a two-term recurrence. This illustrates the genuine advantage of seeking detailed balance solutions over solving global balance directly.
### Random Walk on a Finite Graph
The most elegant application of reversibility in this chapter is the random walk on a graph, where the invariant distribution turns out to have a beautifully simple form.
[definition: Graph and Random Walk on a Graph]
A *graph* is a pair $G = (V, E)$ where $V$ is a set of vertices and $E$ is a set of unordered pairs $\{u, v\}$ of distinct vertices, called edges. The graph is *connected* if for every $u, v \in V$ there is a path of edges from $u$ to $v$.
The *degree* of a vertex $i \in V$ is $d_i = |\{j \in V : \{i,j\} \in E\}|$, the number of neighbours of $i$.
The *simple random walk* on a connected graph $G = (V, E)$ with $|V| < \infty$ is the Markov chain $X = (X_n)$ on state space $V$ with transition probabilities
\begin{align*}
p_{i,j} = \begin{cases} \dfrac{1}{d_i} & \text{if } \{i,j\} \in E, \\ 0 & \text{otherwise.} \end{cases}
\end{align*}
At each step, the walk moves to a uniformly chosen neighbour of the current vertex.
[/definition]
[illustration:random-walk-graph-degree-distribution]
<!-- illustration-needed: a small example graph (say, 5–6 vertices) with edges drawn, and the degree of each vertex labelled — helps students see why d_i / (2|E|) is the invariant distribution -->
The simple random walk on a connected finite graph is irreducible (by connectivity) and positive recurrent (by finiteness). We can find its invariant distribution by attempting the detailed balance approach.
[quotetheorem:2229]
[citeproof:2229]
The result says: in equilibrium, the fraction of time spent at vertex $i$ is proportional to its degree. Highly connected vertices are visited more often, and the precise weight is the degree itself.
The hypotheses matter. If the graph is disconnected, the walk is no longer irreducible and there is no single invariant distribution — each connected component has its own, and the long-run behaviour depends on the starting vertex. If the edges are directed (giving a directed graph), the detailed balance argument breaks down because $p_{i,j}$ and $p_{j,i}$ can differ; the walk may no longer be reversible, and finding the invariant distribution requires solving the full global balance equations. The degree-proportional formula $\pi_i = d_i / (2|E|)$ is special to undirected connected graphs.
This result is the starting point for two important applications beyond this course. In electrical network theory, the invariant distribution connects to the theory of effective resistance, yielding quantitative bounds on mixing times. In MCMC (Markov chain Monte Carlo), one deliberately constructs a reversible Markov chain — using the Metropolis–Hastings algorithm — whose invariant distribution is a prescribed target measure; the random walk on a graph is the simplest instance of this construction.
[example: Knight's Random Walk on a Chessboard]
Consider a knight moving on a standard $8 \times 8$ chessboard. At each step, the knight chooses uniformly at random from its legal moves (L-shaped moves: two squares in one direction and one in the perpendicular direction). This defines a simple random walk on the graph $G$ whose vertices are the 64 squares and whose edges connect squares that are a legal knight's move apart.
The number of legal moves varies by position. A corner square (say, a1) has only 2 legal knight moves, while a central square has up to 8. The invariant distribution assigns weight proportional to degree:
\begin{align*}
\pi_i = \frac{d_i}{2|E|}.
\end{align*}
To find $\pi$ at the corner, we need $\sum_i d_i = 2|E|$. By symmetry and direct enumeration, $\sum_i d_i = 336$, so $|E| = 168$. Thus
\begin{align*}
\pi_{\text{corner}} = \frac{2}{336} = \frac{1}{168}.
\end{align*}
In contrast, a central square with $d_i = 8$ has $\pi_{\text{central}} = 8/336 = 1/42$, making it four times more likely to be visited in equilibrium than a corner.
The chain is irreducible: from any square, a knight can reach any other square in finitely many moves (this can be verified by checking that the graph is connected). Since the graph is finite and irreducible, the chain is positive recurrent and convergence to $\pi$ is guaranteed.
[/example]
The knight example shows the degree-proportional formula at work in a combinatorially rich setting. The same result has deep connections to other areas of mathematics.
[remark: Connection to Previous Chapters]
The detailed balance approach complements the global balance theory developed in Chapter 3. Global balance says: if the chain is positive recurrent and irreducible, an invariant distribution exists and is unique. Detailed balance says: if you can find any distribution $\lambda$ satisfying $\lambda_i p_{i,j} = \lambda_j p_{j,i}$, it must be that distribution, and the chain is reversible. The two approaches are not in competition — detailed balance is a shortcut to identifying the invariant distribution for the class of reversible chains.
[/remark]
## References
G. R. Grimmett, *Markov Chains* (Cambridge IB lecture notes, Michaelmas 2015).
Contents
- Introduction
- The Transition Matrix
- Three-State Random Walk
- Road Map of the Course
- 1. Markov Chains
- The Markov Property and Its Consequences
- Defining Markov Chains
- Specifying a Markov Chain
- The Joint Distribution of a Markov Chain
- The Extended Markov Property
- Transition Matrices and $n$-Step Probabilities
- $n$-Step Transition Probabilities
- The Chapman–Kolmogorov Equations
- Matrix Formulation
- Distribution Propagation
- The Strong Markov Property
- 2. Classification of Chains and States
- Communicating Classes
- Accessibility and Communication
- Communicating Classes
- Closed Classes and Absorption
- Irreducibility
- Recurrence and Transience
- Motivation: Will the Chain Return?
- First Passage Times and Recurrence
- The Summability Criterion
- Recurrence as a Class Property
- Simple Random Walks in $\mathbb{Z}^d$: Pólya's Theorem
- Why the Two-Dimensional Probability Factors as a Square
- Hitting Probabilities and Mean Hitting Times
- Setting Up the Hitting Problem
- First-Step Analysis for Hitting Probabilities
- Mean Hitting Times
- Gambler's Ruin
- Birth-Death Chains
- The Strong Markov Property and Applications
- Stopping Times
- The Strong Markov Property
- Application: Recurrence via Renewal Arguments
- Application: The Generating Function of Hitting Times
- Further Classification of States
- Visits to a Recurrent State
- Mean Recurrence Time and Positive Recurrence
- Periodicity and Aperiodicity
- Ergodic States
- Class Properties
- Irreducibility and Certain Visitation
- 3. Long-run Behaviour
- Invariant Distributions
- The Definition of Invariant Distributions
- Existence and Uniqueness: The Main Theorem
- The Return-Time Measure
- Proving the Full Theorem
- Consequences and Interpretation
- Convergence to Equilibrium
- The Convergence Theorem
- The Coupling Technique
- Long-run Proportion of Time in a State
- 4. Time Reversal
- The Reversed Chain
- Detailed Balance and Reversibility
- Reversibility in Birth–Death Chains and Random Walks
- The Birth–Death Chain with Immigration
- Random Walk on a Finite Graph
- References
Cambridge IB Markov Chains
Content
Problems
History
Created by admin on 4/25/2026 | Last updated on 4/25/2026
Prerequisites
No prerequisites required for this page.
Rate this page
★
★
★
★
★
Poor
Excellent