Monte Carlo methods are a family of computational techniques for approximating quantities that are difficult or impossible to compute exactly, especially when they arise from high-dimensional probability models. This course develops the core ideas behind random simulation as a practical tool for estimation, sampling, and uncertainty quantification, with an emphasis on how probabilistic reasoning becomes an algorithmic workflow. Along the way, it connects basic numerical estimation to modern methods for exploring complex distributions and simulating from models used throughout applied statistics, machine learning, and scientific computing.
The chapters move from fundamentals to advanced sampling algorithms. The early material covers Monte Carlo estimation, error, and variance reduction through importance sampling, rejection sampling, transformation methods, and direct simulation. The middle chapters build the main Markov chain Monte Carlo framework, introducing Markov chains for sampling, Metropolis-Hastings, Gibbs sampling, and data-augmentation methods, then turning to convergence diagnostics and practical workflow. The later chapters extend these ideas to more sophisticated and scalable methods, including Hamiltonian and Langevin Monte Carlo, sequential Monte Carlo and particle filters, and finally perfect simulation and advanced couplings.
The overall structure is cumulative: each topic is motivated by the limitations of the previous one, and each new method addresses a harder class of target distributions, larger state spaces, or more demanding computational settings. By the end of the course, the central goal is not just to know a collection of algorithms, but to understand how to choose, justify, diagnose, and combine Monte Carlo methods in real applied problems.
# Introduction
Monte Carlo methods replace an intractable deterministic computation by a random experiment whose output can be analysed probabilistically. The central object is usually an expectation, a probability, a normalizing constant, or a distribution that is too complex to evaluate by direct quadrature or enumeration. This course studies the mathematical principles that make such randomized computation reliable, and the algorithmic choices that make it efficient or cause it to fail.
The course begins with independent sampling and error estimates, then moves to importance sampling, Markov chain Monte Carlo, sequential Monte Carlo, and modern variants used in statistics, machine learning, and computational science. The recurring questions are: what is the target quantity, what random object estimates it, what assumptions make the estimator valid, and how large is its error?
## What Monte Carlo Computes
The first problem is to identify which deterministic quantity has been turned into a probabilistic one. In many applications the desired number is an integral over a high-dimensional space, but the same notation also covers posterior means, risk measures, expectations under stochastic dynamics, and marginal likelihoods.
[definition: Monte Carlo Target]
A Monte Carlo target is a deterministic quantity expressible as
\begin{align*}
I = \mathbb E_\pi[f(X)] = \int_E f(x)\,\pi(dx),
\end{align*}
where $(E, \mathcal E)$ is a measurable space, $\pi$ is a probability measure on $E$, $X \sim \pi$, and $f:E\to \mathbb R$ is integrable with respect to $\pi$.
[/definition]
This definition deliberately separates the mathematical target $I$ from the procedure used to approximate it. The distribution $\pi$ may be directly simulable, known only up to a normalizing constant, or accessible only through a Markov transition kernel.
[example: Posterior Mean As An Expectation]
In Bayesian inference, let $\theta\in\Theta$ have posterior probability measure $\pi(d\theta\mid y)$ after observing data $y$, and let $g:\Theta\to\mathbb R$ be posterior-integrable. The deterministic quantity being computed is the posterior expectation of $g(\theta)$:
\begin{align*}
I=\mathbb E[g(\theta)\mid y]
\end{align*}
By the definition of [conditional expectation](/page/Conditional%20Expectation) under the posterior law, this expectation is the integral of $g$ against $\pi(\cdot\mid y)$:
\begin{align*}
I=\int_\Theta g(\theta)\,\pi(d\theta\mid y)
\end{align*}
Thus the posterior mean is a Monte Carlo target with measurable space $\Theta$, target probability measure $\pi(\cdot\mid y)$, and integrand $g$.
If $\theta_1,\dots,\theta_N$ are samples from the posterior, the corresponding Monte Carlo estimator is
\begin{align*}
\widehat I_N=\frac{1}{N}\sum_{i=1}^N g(\theta_i)
\end{align*}
When the samples are exactly posterior-distributed and the average is well defined, each term $g(\theta_i)$ has expectation $I$, so linearity of expectation gives
\begin{align*}
\mathbb E[\widehat I_N]=\frac{1}{N}\sum_{i=1}^N \mathbb E[g(\theta_i)]
\end{align*}
and therefore
\begin{align*}
\mathbb E[\widehat I_N]=\frac{1}{N}\sum_{i=1}^N I=I
\end{align*}
The empirical posterior average is therefore the numerical version of the posterior integral; approximate posterior sampling makes the same expression the natural reported quantity, with additional sampling or approximation error to be analysed separately.
[/example]
The example shows why Monte Carlo is not only a technique for evaluating elementary integrals. It is a way to compute expectations under probability measures that arise from modelling assumptions, conditioning, and dynamical systems.
## Estimators, Random Error, and Cost
Once the target has been specified, the next question is what [random variable](/page/Random%20Variable) will be reported by the algorithm. The output of a Monte Carlo method is itself random, so correctness cannot mean exact equality at a fixed computational budget. Instead, the course studies bias, variance, asymptotic distribution, and cost-normalised accuracy.
[definition: Monte Carlo Estimator]
Let $I\in\mathbb R$ be a target quantity. A Monte Carlo estimator of $I$ is a random variable $\widehat I_N$ depending on a computational budget parameter $N\in\mathbb N$, intended to satisfy $\widehat I_N \to I$ in a specified mode of convergence as $N\to\infty$.
[/definition]
This definition says what kind of object an algorithm returns, but it does not yet say how to compare two such objects. To compare algorithms, we need a vocabulary that distinguishes systematic displacement from random spread.
[definition: Bias Variance And Root Mean Square Error]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and let $I\in\mathbb R$. The bias functional at target $I$ is the map
\begin{align*}
\operatorname{Bias}_I : L^1(\Omega,\mathcal F,\mathbb P;\mathbb R) \to \mathbb R,
\qquad
\operatorname{Bias}_I(\widehat I_N) = \mathbb E[\widehat I_N] - I.
\end{align*}
The variance functional is the map
\begin{align*}
\operatorname{Var} : L^2(\Omega,\mathcal F,\mathbb P;\mathbb R) \to \mathbb R_{\ge 0},
\qquad
\operatorname{Var}(\widehat I_N) = \mathbb E[(\widehat I_N - \mathbb E[\widehat I_N])^2].
\end{align*}
The root mean square error functional at target $I$ is the map
\begin{align*}
\operatorname{RMSE}_I : L^2(\Omega,\mathcal F,\mathbb P;\mathbb R) \to \mathbb R_{\ge 0},
\qquad
\operatorname{RMSE}_I(\widehat I_N) = \left(\mathbb E[(\widehat I_N - I)^2]\right)^{1/2}.
\end{align*}
[/definition]
These quantities separate systematic error from random fluctuation, but the definitions alone do not say how the two contributions combine. For error budgeting we need to know whether bias and variance interact through an additional cross term or whether they split cleanly in mean square.
[quotetheorem:4425]
[citeproof:4425]
The decomposition is used throughout the course to decide which error source deserves attention. Square-integrability is the hypothesis that makes this accounting finite: without $\mathbb E[\widehat I_N^2]<\infty$, the variance or the mean square error may be infinite or undefined, so the displayed identity no longer gives a usable numerical error criterion. For instance, an unbiased estimator with a Pareto tail of exponent in $(1,2]$ has a finite mean but an infinite second moment, so its target bias is defined while its variance and RMSE are not finite. The theorem is not a convergence result and does not say that either the bias or the variance decreases with $N$; it only says that whenever the second moment exists, the mean-square error splits exactly into these two pieces. For example, increasing $N$ may reduce variance but leave discretisation bias unchanged, while improving a proposal distribution may reduce variance without changing the target. This is why later chapters repeatedly ask whether an algorithm is unbiased, whether its second moment is finite, and whether the variance estimate is itself reliable.
[example: Estimating A Probability By Simulation]
Let $A\in\mathcal F$ be an event in a probability space $(\Omega,\mathcal F,\mathbb P)$, and let $\omega_1,\dots,\omega_N$ be independent simulations with common law $\mathbb P$. Set $X_i=\mathbb{1}_A(\omega_i)$ and $p=\mathbb P(A)$. We estimate $p$ by the sample proportion
\begin{align*}
\widehat p_N=\frac{1}{N}\sum_{i=1}^N X_i.
\end{align*}
Since $X_i=1$ on $A$ and $X_i=0$ on $A^c$,
\begin{align*}
\mathbb E[X_i]=1\cdot \mathbb P(A)+0\cdot \mathbb P(A^c)=p.
\end{align*}
Linearity of expectation gives
\begin{align*}
\mathbb E[\widehat p_N]=\mathbb E\left[\frac{1}{N}\sum_{i=1}^N X_i\right]=\frac{1}{N}\sum_{i=1}^N \mathbb E[X_i]=\frac{1}{N}\sum_{i=1}^N p=p.
\end{align*}
Thus $\widehat p_N$ is unbiased.
To compute its variance, first note that $X_i^2=X_i$ because $X_i$ only takes the values $0$ and $1$. Hence
\begin{align*}
\operatorname{Var}(X_i)=\mathbb E[X_i^2]-\mathbb E[X_i]^2=\mathbb E[X_i]-p^2=p-p^2=p(1-p).
\end{align*}
The variables $X_1,\dots,X_N$ are independent, so the variance of their sum is the sum of their variances. Therefore
\begin{align*}
\operatorname{Var}(\widehat p_N)=\operatorname{Var}\left(\frac{1}{N}\sum_{i=1}^N X_i\right)=\frac{1}{N^2}\sum_{i=1}^N \operatorname{Var}(X_i)=\frac{1}{N^2}\sum_{i=1}^N p(1-p)=\frac{p(1-p)}{N}.
\end{align*}
Since the estimator is unbiased, its root mean square error is $\sqrt{p(1-p)/N}$. Relative to the target $p$, this is
\begin{align*}
\frac{\sqrt{p(1-p)/N}}{p}=\sqrt{\frac{1-p}{Np}},
\end{align*}
so for small $p$ the absolute error may be small while the relative error remains large unless $N$ is much larger than $1/p$.
[/example]
This example foreshadows two themes. Monte Carlo accuracy depends on the quantity being estimated, not only on the sample size, and relative error can be the relevant measure when the target is very small.
## Why Random Sampling Helps In High Dimension
The next question is why randomized methods are competitive when deterministic quadrature is available in low dimensions. The reason is not that randomness removes difficulty, but that many Monte Carlo error bounds depend on variance rather than directly on dimension.
[quotetheorem:7198]
[citeproof:7198]
The theorem explains the famous $N^{-1/2}$ Monte Carlo rate, but it is a theorem with two essential structural hypotheses. If independence is dropped completely, the rate can disappear: for instance, taking $X_i=X_1$ for every $i$ makes $\widehat I_N=f(X_1)$, so averaging does not reduce the variance at all. If the finite second moment assumption is dropped, the RMSE may not exist; a Pareto-type random variable with finite mean and infinite variance gives an unbiased average whose mean-square error is infinite. Thus the result is not a universal law of simulation, but a baseline for independent sampling with finite variance.
[remark: Dimension-Free Rate Does Not Mean Dimension-Free Difficulty]
The expression $N^{-1/2}$ does not guarantee practical success in high dimension. The rate is dimension-free only in its algebraic dependence on $N$; the variance may grow with dimension, sampling from $\pi$ may become costly, and most proposed samples may land in regions that contribute little to the target. Much of the course studies algorithms designed to control these hidden dimension-dependent costs while preserving enough independence, mixing, or moment control to recover usable error estimates.
[/remark]
This warning motivates the move from crude simulation to structured sampling. The rest of the course asks how to change the sampling distribution, introduce dependence, or propagate particles without changing the target being estimated.
## The Main Algorithmic Families
After the basic estimator is understood, the course branches according to what access we have to the target distribution. Different access models lead to different algorithms and different mathematical guarantees.
[definition: Direct Sampling Regime]
A Monte Carlo problem with target $I=\int_E f(x)\,\pi(dx)$ on a measurable space $(E,\mathcal E)$ is in the direct sampling regime when independent samples $X_1,X_2,\dots \sim \pi$ can be generated at acceptable computational cost.
[/definition]
In this regime, the central tools are laws of large numbers, central limit theorems, confidence intervals, and variance reduction. The obstruction is that many targets cannot be sampled directly, even though a nearby or more convenient distribution can be. To keep the same expectation while changing the simulation law, the problem must include an absolute-continuity relation between the target and the proposal.
[definition: Importance Sampling Regime]
A Monte Carlo problem with target $I=\int_E f(x)\,\pi(dx)$ on a measurable space $(E,\mathcal E)$ is in the importance sampling regime when samples are generated from a proposal probability measure $q$ on $(E,\mathcal E)$, while the target expectation is with respect to a probability measure $\pi$ satisfying $\pi \ll q$.
[/definition]
The absolute continuity condition is the mathematical requirement that the proposal does not miss regions to which the target assigns mass. If even a useful proposal is hard to design, the next definition captures the route of building a dependent simulation whose long-run behaviour has the desired target law.
[definition: Markov Chain Monte Carlo Regime]
A Monte Carlo problem with target $I=\int_E f(x)\,\pi(dx)$ on a measurable space $(E,\mathcal E)$ is in the Markov chain Monte Carlo regime when samples are produced by a Markov chain $(X_n)_{n\ge 0}$ on $E$ whose invariant distribution is the target distribution $\pi$.
[/definition]
Markov chain Monte Carlo replaces independence by long-run stationarity. Some problems have no single fixed target: observations arrive over time, annealing changes the distribution gradually, or rare-event algorithms move through intermediate levels.
This creates a different simulation problem from stationary-chain sampling. The target itself is indexed by time or level, so the formal regime must specify an evolving approximation that carries weights forward while estimating expectations under each current distribution.
[definition: Sequential Monte Carlo Regime]
A Monte Carlo problem on a measurable space $(E,\mathcal E)$ is in the sequential Monte Carlo regime when a sequence of target distributions $(\pi_t)_{t\ge 0}$ is approximated by an evolving weighted particle system, with estimates formed for expectations of the form $\int_E f_t(x)\,\pi_t(dx)$.
[/definition]
Sequential methods are designed for filtering, rare events, annealing, and time-evolving models. Their analysis combines importance sampling, resampling, propagation, and stability of particle approximations.
[example: Four Views Of The Same Integral]
Let the common target be
\begin{align*}
I=\int_E f(x)\,\pi(dx).
\end{align*}
In the direct sampling regime, if $X_1,\dots,X_N$ are sampled from $\pi$, the reported average is
\begin{align*}
\widehat I_N^{\mathrm{direct}}=\frac{1}{N}\sum_{i=1}^N f(X_i).
\end{align*}
Each summand is evaluated at a point drawn from the target law itself, so the average is an empirical version of the defining integral.
In the importance sampling regime, suppose $q$ is a proposal probability measure and $\pi\ll q$. Write
\begin{align*}
w(x)=\frac{d\pi}{dq}(x).
\end{align*}
Then the same target can be rewritten as
\begin{align*}
I=\int_E f(x)\,\pi(dx)=\int_E f(x)w(x)\,q(dx).
\end{align*}
Thus, if $Y_1,\dots,Y_N$ are sampled from $q$, the corresponding estimator is
\begin{align*}
\widehat I_N^{\mathrm{IS}}=\frac{1}{N}\sum_{i=1}^N f(Y_i)w(Y_i).
\end{align*}
The integrand changes from $f$ to $fw$ because the samples are drawn from $q$ rather than from $\pi$.
In the Markov chain Monte Carlo regime, one constructs a Markov chain $X_0,X_1,\dots$ whose invariant distribution is $\pi$. The estimator has the same averaging form,
\begin{align*}
\widehat I_N^{\mathrm{MCMC}}=\frac{1}{N}\sum_{i=1}^N f(X_i),
\end{align*}
but the samples are now dependent; the role of invariance is to make $\pi$ the long-run target distribution of the chain.
In the sequential Monte Carlo regime, one introduces intermediate probability measures
\begin{align*}
\pi_0,\pi_1,\dots,\pi_T=\pi
\end{align*}
and represents the final measure by weighted particles $(X_T^j,W_T^j)_{j=1}^M$. The final expectation is then approximated by
\begin{align*}
\widehat I_M^{\mathrm{SMC}}=\sum_{j=1}^M W_T^j f(X_T^j),
\end{align*}
with weights normalized so that
\begin{align*}
\sum_{j=1}^M W_T^j=1.
\end{align*}
All four methods estimate the same deterministic integral; what changes is the random mechanism used to place mass near the parts of $E$ that matter for $f$ under $\pi$.
[/example]
This comparison gives the organisational map for the course. The same target notation persists, while the sampling mechanism and error analysis change.
## Validity, Diagnostics, and Failure Modes
The final introductory problem is how to know whether a Monte Carlo answer should be trusted. A numerical value without an error assessment is not yet a computational result.
[definition: Monte Carlo Diagnostic]
A Monte Carlo diagnostic is a computable summary of the simulation output designed to reveal accuracy, instability, or failure of the estimator or sampling mechanism.
[/definition]
Diagnostics are not substitutes for theorems, but they are essential in applied work. Examples include estimated standard errors, effective sample size, trace plots, autocorrelation estimates, acceptance rates, weight histograms, and resampling counts.
[remark: Diagnostics Are Model Dependent]
A diagnostic can fail to detect a serious problem if its assumptions do not match the algorithm. For instance, a small estimated standard error from a single poorly mixing Markov chain may measure only local fluctuation, while an importance sampler with a few extreme weights may underestimate rare but dominant contributions. The course treats diagnostics as mathematical objects with hypotheses, not as generic certificates.
[/remark]
This point sets the tone for the remaining chapters. Monte Carlo methods are powerful because they turn computation into probability, but that same randomness demands rigorous error accounting and careful algorithm design.
## Course Roadmap
The organising problem for the rest of the course is how to keep Monte Carlo error measurable when the ideal independent-sampling picture breaks down. Each new family of methods changes the way samples are generated, so each part of the course asks which target is preserved, which error estimate survives, and which new failure mode appears. The course first develops independent Monte Carlo estimation, including laws of large numbers, central limit theorems, confidence intervals, effective sample size, and variance reduction. These tools provide the baseline against which later algorithms are compared.
The second part studies importance sampling and normalizing constants. The main ideas are change of measure, likelihood ratios, proposal design, self-normalisation, rare-event estimation, and the collapse of estimators when weights become too variable.
The third part develops Markov chain Monte Carlo. It covers invariant distributions, detailed balance, Metropolis-Hastings, Gibbs sampling, convergence, autocorrelation, effective sample size for dependent samples, and diagnostics for poor mixing.
The fourth part studies sequential and particle methods. It introduces particle filters, resampling, Feynman-Kac viewpoints, genealogical degeneracy, and applications to state-space models and annealed sampling.
The final part connects these methods to modern applied computation. Topics include pseudo-marginal methods, stochastic gradient Monte Carlo, approximate Bayesian computation, high-dimensional scaling, and the practical limits of simulation-based inference.
The methods surveyed in the introduction all share a common computational theme: they replace exact analytic calculation with randomized approximation, and then ask how the resulting uncertainty should be measured. Chapter 1 begins from that basic Monte Carlo question and develops the standard estimator, its error, and the limits of what averaging alone can accomplish.
# 1. Monte Carlo Estimation and Error
Monte Carlo methods begin with the problem of computing a number that is naturally an expectation, while direct integration or exhaustive summation is infeasible. The basic response is to simulate random inputs, average the outputs, and quantify the sampling error of that average. This chapter develops the first error calculus for that procedure: bias, variance, root mean square error, laws of large numbers, central limit theorems, confidence intervals, and the main elementary variance reduction devices.
The point of the chapter is not only that averages converge. We also need to know how quickly they fluctuate, when a normal error bar is justified, and how to redesign a simulation so that the same computational budget gives a smaller error. The prerequisites are basic probability spaces, random variables, expectation, variance, independence, convergence in distribution, and elementary real analysis at the level of [measurable functions](/page/Measurable%20Functions) and $L^p$ integrability. Chapters 2, 4, and 9 on importance sampling, Markov chain Monte Carlo, and particle methods all reuse the language introduced here.
## Monte Carlo Estimators for Expectations
The first question is how to turn a deterministic numerical target into a random variable whose expectation is that target. Once this is done, repeated simulation gives an empirical average.
[definition: Crude Monte Carlo Estimator]
Let $(E,\mathcal E)$ be a measurable space, let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a random variable, and let $f:E\to\mathbb R$ be measurable with $f(X)\in L^1$. For i.i.d. random variables $X_1,\dots,X_n:(\Omega,\mathcal F)\to(E,\mathcal E)$ with the same distribution as $X$, the crude Monte Carlo estimator of $\mu=\mathbb E[f(X)]$ is the map
\begin{align*}
\hat\mu_n:E^n&\to\mathbb R, &
\hat\mu_n(x_1,\dots,x_n)&=\frac{1}{n}\sum_{i=1}^{n}f(x_i).
\end{align*}
[/definition]
Applied to the simulated sample, the estimator is the random output $\hat\mu_n(X_1,\dots,X_n)$. This definition is the core template: sample from the law that defines the expectation and average the resulting values. A geometric problem gives the simplest concrete instance, and it also shows why indicator functions are useful simulation outputs.
[example: Estimating Pi By Random Points]
Let $(U_i,V_i)_{i=1}^{n}$ be i.i.d. random vectors uniformly distributed on $[0,1]^2$, and define the indicator of the quarter unit disk by
\begin{align*}
Y_i=\mathbb{1}_{\{U_i^2+V_i^2\le 1\}}.
\end{align*}
Since $(U_i,V_i)$ is uniform on $[0,1]^2$, the expectation of this indicator is the area of the set it indicates divided by the area of the square:
\begin{align*}
\mathbb E[Y_i]=\mathbb P(U_i^2+V_i^2\le1)=\frac{\mathcal L^2(\{(u,v)\in[0,1]^2:u^2+v^2\le1\})}{\mathcal L^2([0,1]^2)}.
\end{align*}
The set $\{(u,v)\in[0,1]^2:u^2+v^2\le1\}$ is one quarter of the unit disk, so its area is $\pi/4$, while $\mathcal L^2([0,1]^2)=1$. Hence
\begin{align*}
\mathbb E[Y_i]=\frac{\pi/4}{1}=\frac{\pi}{4}.
\end{align*}
Thus
\begin{align*}
\hat\pi_n=4\cdot\frac{1}{n}\sum_{i=1}^{n}Y_i
\end{align*}
is centred at $\pi$, because
\begin{align*}
\mathbb E[\hat\pi_n]=4\cdot\frac{1}{n}\sum_{i=1}^{n}\mathbb E[Y_i]=4\cdot\frac{1}{n}\cdot n\cdot\frac{\pi}{4}=\pi.
\end{align*}
Each $Y_i$ is Bernoulli with success probability $p=\pi/4$, so $Y_i^2=Y_i$ and
\begin{align*}
\operatorname{Var}(Y_i)=\mathbb E[Y_i^2]-\mathbb E[Y_i]^2=p-p^2=p(1-p)=\frac{\pi}{4}\left(1-\frac{\pi}{4}\right).
\end{align*}
Independence gives zero covariance between distinct $Y_i$ and $Y_j$, so
\begin{align*}
\operatorname{Var}(\hat\pi_n)=16\operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^{n}Y_i\right)=16\cdot\frac{1}{n^2}\sum_{i=1}^{n}\operatorname{Var}(Y_i)=\frac{16}{n}\frac{\pi}{4}\left(1-\frac{\pi}{4}\right).
\end{align*}
Therefore the standard deviation is
\begin{align*}
\sqrt{\operatorname{Var}(\hat\pi_n)}=4\sqrt{\frac{1}{n}\frac{\pi}{4}\left(1-\frac{\pi}{4}\right)},
\end{align*}
so the random error scale is proportional to $n^{-1/2}$.
[/example]
The example estimates an area by sampling uniformly from a square, and it motivates the following definition for general finite-volume integrals. A common failure at this point is to average $g(X_i)$ and forget the volume factor; that estimates the average value of $g$ over $D$, not the integral of $g$ over $D$. We need a formal way to convert integration over a domain into expectation under a probability distribution on that domain.
[definition: Integral Monte Carlo Estimator]
Let $D\subset\mathbb R^d$ be measurable with $0<\mathcal L^d(D)<\infty$, let $g:D\to\mathbb R$ be measurable with $g\in L^1(D)$, and let $X_1,\dots,X_n:(\Omega,\mathcal F)\to(D,\mathcal B(D))$ be i.i.d. with distribution $\operatorname{Unif}(D)$. The Monte Carlo estimator of
\begin{align*}
I=\int_D g(x)\,d\mathcal L^d(x)
\end{align*}
is the map
\begin{align*}
\hat I_n:D^n&\to\mathbb R, &
\hat I_n(x_1,\dots,x_n)&=\mathcal L^d(D)\frac{1}{n}\sum_{i=1}^{n}g(x_i).
\end{align*}
[/definition]
This estimator has the same structure as the estimator of $\pi$, but the output is now a function value scaled by the volume of the domain. The form does not become more complicated in high dimension; instead, the main challenge becomes measuring random error.
## Bias, Variance, and Root Mean Square Error
A Monte Carlo output is random, so accuracy is described statistically. The first distinction is between systematic displacement from the target and random fluctuation around the estimator's own mean.
[definition: Bias]
Let $(S,\mathcal S,\mathbb P_S)$ be the sample probability space for a simulation output, let $\hat\theta:S\to\mathbb R$ be a measurable estimator of a scalar target $\theta\in\mathbb R$, and suppose $\hat\theta\in L^1(S,\mathcal S,\mathbb P_S)$. For fixed $\theta$, the bias functional is the map
\begin{align*}
\operatorname{bias}_\theta:L^1(S,\mathcal S,\mathbb P_S)&\to\mathbb R, &
\operatorname{bias}_\theta(\hat\theta)&=\mathbb E[\hat\theta]-\theta.
\end{align*}
[/definition]
Bias records whether repeated runs are centred on the target; when the target is fixed, we write $\operatorname{bias}(\hat\theta)$ for $\operatorname{bias}_\theta(\hat\theta)$. Once that systematic component is named, we need a second quantity for the scale of run-to-run variation.
[definition: Variance and Standard Error]
Let $(S,\mathcal S,\mathbb P_S)$ be the sample probability space for a simulation output, and let $\hat\theta:S\to\mathbb R$ be a measurable square-integrable estimator of a scalar target $\theta\in\mathbb R$. The variance is $\operatorname{Var}(\hat\theta)$, and the standard error is the map
\begin{align*}
\operatorname{se}:L^2(S,\mathcal S,\mathbb P_S)&\to[0,\infty), &
\operatorname{se}(\hat\theta)&=\sqrt{\operatorname{Var}(\hat\theta)}.
\end{align*}
[/definition]
The standard error becomes useful only when the variance of the reported average can be related to the variance of one simulation draw. Independence is the key question: if averaging really removes cross-covariances, then the random error should shrink at the familiar square-root rate.
[quotetheorem:7199]
[citeproof:7199]
The theorem explains the random error of an unbiased average, and its hypotheses are doing real work. If the variables are positively correlated, covariance terms remain and the variance may be much larger than $\sigma^2/n$; if $\sigma^2=\infty$, the displayed formula does not define a finite standard error. The result is therefore a rate statement for independent square-integrable sampling, not a universal promise for every simulation output. To compare estimators that may also have systematic approximation error, we need an accuracy measure that includes both effects.
[definition: Root Mean Square Error]
Let $(S,\mathcal S,\mathbb P_S)$ be the sample probability space for a simulation output, let $\hat\theta:S\to\mathbb R$ be a measurable square-integrable estimator, and let $\theta\in\mathbb R$ be the scalar target. For fixed $\theta$, the mean square error is the map
\begin{align*}
\operatorname{MSE}_\theta:L^2(S,\mathcal S,\mathbb P_S)&\to[0,\infty), &
\operatorname{MSE}_\theta(\hat\theta)&=\mathbb E[(\hat\theta-\theta)^2].
\end{align*}
The root mean square error is the map
\begin{align*}
\operatorname{RMSE}_\theta:L^2(S,\mathcal S,\mathbb P_S)&\to[0,\infty), &
\operatorname{RMSE}_\theta(\hat\theta)&=\sqrt{\operatorname{MSE}_\theta(\hat\theta)}.
\end{align*}
[/definition]
RMSE is measured in the same units as the target; when the target is fixed, we write $\operatorname{MSE}(\hat\theta)$ and $\operatorname{RMSE}(\hat\theta)$. The next identity explains how RMSE combines the bias and variance quantities already introduced.
[quotetheorem:4425]
[citeproof:4425]
The decomposition separates two errors that behave differently under repetition. Averaging more independent samples reduces variance but does not remove bias, so a biased time-discretized simulation can converge confidently to the wrong number. The square-integrability assumption is needed because MSE and variance are second-moment quantities; without it, this decomposition is not an available diagnostic. Option pricing provides a standard example where the target is an expectation under a risk-neutral model.
[example: Option Pricing By Crude Monte Carlo]
In a risk-neutral model, let the discounted payoff be $Y=e^{-rT}h(S_T)$ with $Y\in L^2$, and let $Y_i=e^{-rT}h(S_{T,i})$ be independent copies of $Y$ obtained from exact independent path simulation. The option price is $V=\mathbb E[Y]$, and the crude Monte Carlo price estimator is
\begin{align*}
\hat V_n=\frac{1}{n}\sum_{i=1}^{n}Y_i=\frac{1}{n}\sum_{i=1}^{n}e^{-rT}h(S_{T,i}).
\end{align*}
Linearity of expectation gives
\begin{align*}
\mathbb E[\hat V_n]=\mathbb E\left[\frac{1}{n}\sum_{i=1}^{n}Y_i\right]=\frac{1}{n}\sum_{i=1}^{n}\mathbb E[Y_i].
\end{align*}
Since each $Y_i$ has the same distribution as $Y$, $\mathbb E[Y_i]=\mathbb E[Y]=V$, so
\begin{align*}
\mathbb E[\hat V_n]=\frac{1}{n}\sum_{i=1}^{n}V=\frac{nV}{n}=V.
\end{align*}
Thus the estimator is unbiased under exact simulation.
Because the paths are independent, the variables $Y_1,\dots,Y_n$ are independent, so all covariance terms with distinct indices vanish. Therefore
\begin{align*}
\operatorname{Var}(\hat V_n)=\operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^{n}Y_i\right)=\frac{1}{n^2}\sum_{i=1}^{n}\operatorname{Var}(Y_i).
\end{align*}
Each $Y_i$ has variance $\operatorname{Var}(Y)$, hence
\begin{align*}
\operatorname{Var}(\hat V_n)=\frac{1}{n^2}\cdot n\operatorname{Var}(Y)=\frac{\operatorname{Var}(Y)}{n}.
\end{align*}
Since the bias is $0$, the [bias-variance decomposition](/theorems/1424) gives
\begin{align*}
\operatorname{RMSE}(\hat V_n)=\sqrt{\operatorname{Var}(\hat V_n)}=\sqrt{\frac{\operatorname{Var}(Y)}{n}}.
\end{align*}
If a variance reduction method replaces $Y$ by an unbiased payoff with variance $\operatorname{Var}(Y)/4$, then the new RMSE is
\begin{align*}
\sqrt{\frac{\operatorname{Var}(Y)/4}{n}}=\frac{1}{2}\sqrt{\frac{\operatorname{Var}(Y)}{n}}.
\end{align*}
Using four times as many independent paths instead gives
\begin{align*}
\sqrt{\frac{\operatorname{Var}(Y)}{4n}}=\frac{1}{2}\sqrt{\frac{\operatorname{Var}(Y)}{n}}.
\end{align*}
Thus cutting payoff variance by a factor of four has the same RMSE effect as multiplying the number of independent paths by four.
[/example]
The option example uses independent paths, but later algorithms often produce correlated outputs. This motivates effective sample size, which expresses the variance of a possibly dependent estimator in independent-sample units.
[definition: Effective Sample Size]
Let $Y_1,\dots,Y_n:(\Omega,\mathcal F)\to\mathbb R$ be simulated square-integrable outputs with common marginal variance $\sigma^2>0$, and let $\hat\mu_n=H(Y_1,\dots,Y_n)$ be an unbiased estimator of $\mu$, where $H:\mathbb R^n\to\mathbb R$ is measurable and $\operatorname{Var}(\hat\mu_n)>0$. The effective sample size is
\begin{align*}
\operatorname{ESS}=\frac{\sigma^2}{\operatorname{Var}(\hat\mu_n)}.
\end{align*}
[/definition]
For an independent average, $\operatorname{ESS}=n$. Positive dependence usually lowers this number, while useful negative dependence can raise it relative to a naive design.
## Laws of Large Numbers and Error Certification
The next question is why the empirical average should be trusted as the number of simulations grows. The first answer is consistency, expressed as almost sure convergence.
[quotetheorem:520]
[citeproof:520/monte-carlo-methods]
The strong law says the estimator converges, but each hypothesis has a role. Independence and identical distribution give a stable repeated-sampling experiment, while integrability gives a finite target $\mu$ and enough tail control for averages to settle. For example, if $Y_i$ are i.i.d. standard Cauchy random variables, then $\mathbb E[|Y_1|]=\infty$ and the sample average has the same Cauchy distribution as each summand, so it does not settle down to a deterministic target. The conclusion is almost sure convergence, not a finite-run error certificate: it does not give an error bar for a fixed simulation budget, a convergence rate, or a diagnostic for whether a particular run has already reached the asymptotic regime. This gap is exactly why Monte Carlo error analysis next studies the normalized fluctuation $\sqrt n(\hat\mu_n-\mu)$ rather than only the unscaled average.
In the i.i.d. square-integrable setting with mean $\mu$ and variance $\sigma^2\in(0,\infty)$, the classical [central limit theorem](/theorems/1848) gives
\begin{align*}
\sqrt n(\hat\mu_n-\mu)\Rightarrow \mathcal N(0,\sigma^2).
\end{align*}
This explains the $n^{-1/2}$ scale behind ordinary Monte Carlo error bars, but finite variance is essential. With i.i.d. Cauchy outputs, the sample average remains Cauchy rather than becoming Gaussian after $\sqrt n$ scaling, so the usual normal interval has no basis. Independence is also essential for this statement; if all outputs are copies of the same square-integrable random variable $Y$, then the average is $Y$ for every $n$ and the normalized error does not have the independent-sampling limit. Some simulation estimators are sums of independent but non-identically distributed terms, for example under stratification or adaptive allocation, so we need a version that rules out domination by a few large summands.
[quotetheorem:7200]
[citeproof:7200]
The theorem explains when a normal approximation is credible for triangular arrays, and its hypotheses separate three issues. Independence permits the characteristic-function product argument, centering removes deterministic drift, and the Lindeberg condition excludes a failure mode in which a small number of rare but very large terms controls the sum. A concrete failure is the array with $k_n=1$ and $Y_{n,1}=\sqrt n$ with probability $1/(2n)$, $Y_{n,1}=-\sqrt n$ with probability $1/(2n)$, and $Y_{n,1}=0$ otherwise. Then $\operatorname{Var}(Y_{n,1})=1$, but $Y_{n,1}\to0$ in probability rather than to $\mathcal N(0,1)$, and the Lindeberg term equals $1$ for every fixed $\varepsilon>0$ and all large $n$.
The convergence of total variance to a positive finite number fixes the limiting scale; without it, the limiting law would either collapse to a point or require a different normalization. The theorem also does not say that every tail probability is accurately approximated at the finite sample sizes used in a computation, nor does it justify normal intervals for dependent arrays, biased estimators, or arrays whose variance is itself unknown and poorly estimated. It is a weak-convergence statement for the sum, not a uniform guarantee over all rare events. To use the normal approximation in computation, we also need to estimate the unknown variance from the simulated outputs.
[definition: Monte Carlo Sample Variance]
For $n\ge2$, the Monte Carlo sample variance is the measurable statistic $s_n^2:\mathbb R^n\to\mathbb R$ given by
\begin{align*}
s_n^2(y_1,\dots,y_n)=\frac{1}{n-1}\sum_{i=1}^{n}(y_i-\bar y_n)^2,
\qquad
\bar y_n=\frac{1}{n}\sum_{i=1}^{n}y_i.
\end{align*}
[/definition]
Applied to simulated scalar outputs $Y_1,\dots,Y_n$, the sample variance gives $s_n^2(Y_1,\dots,Y_n)$. It supplies the missing scale in the central limit theorem. Without such a scale estimate, a simulation report like $0.731$ can look precise even when the run-to-run standard deviation is larger than the last displayed digits. This motivates the standard asymptotic confidence interval used as a randomized certificate of sampling error.
[quotetheorem:7201]
[citeproof:7201]
The interval is justified by an asymptotic theorem, so its nominal coverage is not an exact finite-sample guarantee. The finite-variance assumption is needed both for the central limit theorem and for $s_n^2$ to estimate a finite scale; heavy-tailed output can make the interval unstable or misleading. Independence is also part of the certificate: correlated Markov chain output requires an autocorrelation correction rather than the factor $s_n/\sqrt n$. This interval certifies sampling error, not modelling error, discretization error, pseudo-random generator defects, or bias from approximate simulation. Those other errors must be analysed separately.
## Ratios and Posterior Expectations
Many applied Monte Carlo targets are ratios of expectations rather than expectations themselves. Ratios appear in Bayesian computation, conditional expectation estimation, and later in self-normalized importance sampling. The obstruction is that nonlinear operations do not preserve unbiasedness: even if $\bar A_n$ and $\bar B_n$ are unbiased for their expectations, $\mathbb E[\bar A_n/\bar B_n]$ usually differs from $\mathbb E[A_1]/\mathbb E[B_1]$.
[definition: Monte Carlo Ratio Estimator]
Let $(A_i,B_i)_{i=1}^{n}:(\Omega,\mathcal F)\to\mathbb R^2$ be i.i.d. random vectors with $\mathbb E[|A_1|]<\infty$, $\mathbb E[|B_1|]<\infty$, and $\mathbb E[B_1]=\nu\ne0$. The ratio estimator of $\theta=\mathbb E[A_1]/\mathbb E[B_1]$ is the measurable map, defined on samples with nonzero second empirical mean,
\begin{align*}
\hat\theta_n:\left\{((a_1,b_1),\dots,(a_n,b_n))\in(\mathbb R^2)^n:\frac{1}{n}\sum_{i=1}^{n}b_i\ne0\right\}&\to\mathbb R, &
\hat\theta_n((a_1,b_1),\dots,(a_n,b_n))&=\frac{\sum_{i=1}^{n}a_i}{\sum_{i=1}^{n}b_i},
\end{align*}
where
\begin{align*}
\bar A_n=\frac{1}{n}\sum_{i=1}^{n}A_i,
\qquad
\bar B_n=\frac{1}{n}\sum_{i=1}^{n}B_i.
\end{align*}
[/definition]
A ratio estimator is nonlinear, so unbiased numerator and denominator estimates do not generally make the ratio unbiased. Its denominator may also fluctuate near zero, which can turn small sampling errors into large changes in the ratio. To use such estimators responsibly, we need conditions under which the ratio map is locally stable and its leading fluctuation is governed by a linear approximation.
[quotetheorem:1860]
[citeproof:1860]
The theorem is a local linear approximation, so the condition $\nu\ne0$ is essential. If $A_i\equiv1$ and $B_i$ are i.i.d. $\mathcal N(0,1)$, then the formal target would require division by $\mathbb E[B_1]=0$, and the empirical ratio $1/\bar B_n$ has severe instability near $\bar B_n=0$ rather than a Gaussian limit around a finite number. The finite second-moment assumption supplies the covariance matrix used in the quadratic form; without it, the delta-method variance is not defined: if $B_i\equiv1$ and $A_i$ are i.i.d. Cauchy, the ratio estimator is the Cauchy sample mean, not a normal fluctuation governed by a covariance matrix. Bayesian posterior means are a central example when exact posterior draws are available. They also show what later changes when independent draws are replaced by weighted samples or Markov chains.
[example: Bayesian Posterior Mean From Simulated Draws]
Let $Z_i=h(\theta_i)$ for $i=1,\dots,n$. Since $\theta_i$ has posterior distribution $\pi(\cdot\mid y)$, each $Z_i$ has mean
\begin{align*}
\mathbb E[Z_i]=\mathbb E[h(\theta_i)\mid y]=m.
\end{align*}
The posterior mean estimator is therefore
\begin{align*}
\hat m_n=\frac{1}{n}\sum_{i=1}^{n}Z_i=\frac{1}{n}\sum_{i=1}^{n}h(\theta_i).
\end{align*}
Linearity of expectation gives
\begin{align*}
\mathbb E[\hat m_n]=\mathbb E\left[\frac{1}{n}\sum_{i=1}^{n}Z_i\right]=\frac{1}{n}\sum_{i=1}^{n}\mathbb E[Z_i].
\end{align*}
Substituting $\mathbb E[Z_i]=m$ for every $i$ gives
\begin{align*}
\mathbb E[\hat m_n]=\frac{1}{n}\sum_{i=1}^{n}m=\frac{nm}{n}=m.
\end{align*}
Thus independent exact posterior simulation makes $\hat m_n$ an unbiased estimator of the posterior expectation.
Because $h$ is square-integrable under the posterior, $\sigma_h^2=\operatorname{Var}(h(\theta)\mid y)$ is finite. The variables $Z_1,\dots,Z_n$ are independent, so distinct covariance terms vanish, and
\begin{align*}
\operatorname{Var}(\hat m_n)=\operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^{n}Z_i\right)=\frac{1}{n^2}\sum_{i=1}^{n}\operatorname{Var}(Z_i).
\end{align*}
Each $Z_i$ has variance $\sigma_h^2$, hence
\begin{align*}
\operatorname{Var}(\hat m_n)=\frac{1}{n^2}\cdot n\sigma_h^2=\frac{\sigma_h^2}{n}.
\end{align*}
Since $\sigma_h^2$ is usually unknown, estimate it from the simulated values by
\begin{align*}
s_n^2=\frac{1}{n-1}\sum_{i=1}^{n}(h(\theta_i)-\hat m_n)^2.
\end{align*}
The estimated Monte Carlo standard error is then
\begin{align*}
\frac{s_n}{\sqrt n}.
\end{align*}
This standard error describes independent posterior draws; if the same values are produced by a Markov chain, covariance terms between different iterations no longer vanish, so the denominator must be adjusted through an effective sample size rather than treated as $n$.
[/example]
## Variance Reduction Principles
The previous sections show that reducing RMSE by a factor of ten often requires one hundred times as many independent samples. This is a computational obstruction, not merely a statistical inconvenience: if each sample requires solving a differential equation or simulating a long asset path, brute-force replication can be the dominant cost of the computation. Variance reduction asks whether we can change the estimator while preserving the target.
[definition: Variance Reduction]
A variance reduction method for a scalar target $\theta\in\mathbb R$ is a construction of a measurable estimator map $T_n:S_n\to\mathbb R$ on a simulation sample space $S_n$, producing $\tilde\theta_n=T_n(Z_n)$ from simulated data $Z_n$, such that $\mathbb E[\tilde\theta_n]=\theta$ or such that its bias is controlled, and
\begin{align*}
\operatorname{MSE}(\tilde\theta_n)<\operatorname{MSE}(\hat\theta_n)
\end{align*}
for a specified baseline estimator $\hat\theta_n$ at comparable computational cost.
[/definition]
This definition frames the goal as estimator design. The first design changes dependence by pairing outputs so that high values tend to be balanced by low values, and this motivates the following antithetic estimator.
[definition: Antithetic Variables Estimator]
Let $U_1,\dots,U_n:(\Omega,\mathcal F)\to(0,1)$ be i.i.d. random variables with distribution $\operatorname{Unif}(0,1)$, and let $g:(0,1)\to\mathbb R$ be measurable with $g\in L^2(0,1)$. The antithetic variables estimator of $\mu=\int_0^1g(u)\,d\mathcal L^1(u)$ is the map
\begin{align*}
\hat\mu_{n,\mathrm{anti}}:(0,1)^n&\to\mathbb R, &
\hat\mu_{n,\mathrm{anti}}(u_1,\dots,u_n)&=\frac{1}{2n}\sum_{i=1}^{n}\bigl(g(u_i)+g(1-u_i)\bigr).
\end{align*}
[/definition]
The estimator remains unbiased because $U_i$ and $1-U_i$ have the same marginal distribution. Unbiasedness alone does not tell us whether the pairing improves precision, since the answer depends on the covariance between paired outputs. The next calculation shows exactly when the induced dependence helps.
[quotetheorem:7202]
[citeproof:7202]
The identity shows that antithetic sampling helps only when the covariance term is negative, and the assumptions specify the exact situation in which the calculation applies. The condition $U\sim\operatorname{Unif}(0,1)$ ensures that $U$ and $1-U$ have the same marginal law, so the paired average still targets $\int_0^1 g(u)\,d\mathcal L^1(u)$. The square-integrability assumption is needed because both variances and covariance appear in the formula. If $g$ is increasing on $(0,1)$, then $g(U)$ and $g(1-U)$ tend to move in opposite directions, but monotonicity is a separate sufficient condition for negative covariance, not part of the identity itself.
The identity does not say that the antithetic estimator is always better than using two independent function evaluations. For the non-monotone example $g(u)=u(1-u)$, we have $g(1-U)=g(U)$, so $\operatorname{Cov}(Y,Y')=\operatorname{Var}(Y)>0$ and the antithetic pair has variance $\operatorname{Var}(Y)$ instead of the $\operatorname{Var}(Y)/2$ obtained by averaging two independent copies. It also does not address computational cost: if evaluating $g(1-U)$ is substantially more expensive than generating another independent draw, variance per unit time must be compared separately. Thus the method is useful when the transformation creates balancing behaviour at comparable cost. This motivates a second design, where a correlated quantity with known expectation is used to subtract noise more directly.
[definition: Control Variate Estimator]
Let $Y,C:(\Omega,\mathcal F)\to\mathbb R$ be square-integrable random variables with target $\mu=\mathbb E[Y]$ and known $\mathbb E[C]=c$. For $\beta\in\mathbb R$, the control variate random variable is the measurable map $Y_\beta:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ defined by
\begin{align*}
Y_\beta=Y-\beta(C-c).
\end{align*}
The corresponding Monte Carlo estimator is the sample-average map
\begin{align*}
T_{n,\beta}:\mathbb R^n&\to\mathbb R, &
T_{n,\beta}(y_{\beta,1},\dots,y_{\beta,n})&=\frac{1}{n}\sum_{i=1}^{n}y_{\beta,i}
\end{align*}
applied to independent copies of $Y_\beta$.
[/definition]
The estimator is unbiased for every $\beta$, so the design problem is to choose the coefficient with smallest variance. This is a one-dimensional quadratic optimization.
[quotetheorem:7203]
[citeproof:7203]
The assumption $\operatorname{Var}(C)>0$ prevents division by a degenerate control, which would carry no random information to subtract. The formula also shows a limitation: if $\operatorname{Cov}(Y,C)=0$, the optimal coefficient is $0$ and the control variate gives no variance reduction. A poorly estimated or poorly chosen coefficient can even increase variance in finite samples. This makes control variates most natural when the simulation contains a cheap auxiliary quantity with known mean and strong correlation with the payoff. When the dominant error instead comes from random imbalance across different regions of the input space, the next design fixes the number of samples assigned to each region; stratification uses known structure of the sample space to prevent overrepresentation of some regions and underrepresentation of others.
[definition: Stratified Monte Carlo Estimator]
Let $D\subset\mathbb R^d$ be measurable with $0<\mathcal L^d(D)<\infty$, let $g:D\to\mathbb R$ be measurable, and let $D$ be partitioned into measurable sets $D_1,\dots,D_K$ with $p_k=\mathcal L^d(D_k)/\mathcal L^d(D)>0$. Let $n_k\ge1$ for every $k$, and let $X_{k,1},\dots,X_{k,n_k}:(\Omega,\mathcal F)\to(D_k,\mathcal B(D_k))$ be independent samples from $\operatorname{Unif}(D_k)$, with $\sum_{k=1}^{K}n_k=n$. The stratified estimator of $I=\int_Dg(x)\,d\mathcal L^d(x)$ is the map
\begin{align*}
\hat I_{\mathrm{strat}}:\prod_{k=1}^{K}D_k^{n_k}&\to\mathbb R, &
\hat I_{\mathrm{strat}}\bigl((x_{k,j})_{1\le k\le K,\,1\le j\le n_k}\bigr)&=\mathcal L^d(D)\sum_{k=1}^{K}p_k\left(\frac{1}{n_k}\sum_{j=1}^{n_k}g(x_{k,j})\right).
\end{align*}
[/definition]
The estimator is unbiased because it averages unbiased estimates of the conditional mean in each stratum.
The remaining obstruction is budget allocation. With a fixed total number of samples, an unbiased stratified estimator can still have poor precision if high-mass or high-variance strata receive too few samples. A variance identity is needed to separate the contribution of each stratum and show exactly where the allocation numbers $n_k$ enter.
[quotetheorem:7204]
[citeproof:7204]
The formula shows why allocation matters: strata with large $p_k\sigma_k$ deserve more samples than strata with small variability or small mass. The condition $n_k\ge1$ is needed because each conditional mean is estimated by an average inside that stratum; a stratum with $p_k>0$ and no samples would make the displayed estimator and variance formula undefined. Square-integrability supplies the finite conditional variances, and independence across strata is what removes cross-covariance terms from the variance of the weighted sum.
The hypotheses are not cosmetic. Take $D=[0,1]$, $D_1=[0,1/2]$, $D_2=(1/2,1]$, $p_1=p_2=1/2$, and $g(x)=x$. If the two stratum samples are forced to use the same underlying uniform variable by setting $X_{1,1}=U/2$ and $X_{2,1}=1/2+U/2$, the two terms are positively correlated; the actual variance of the weighted sum contains an additional covariance term that the theorem's formula omits. Conversely, if $g$ has infinite second moment on one stratum, $\sigma_k^2$ is not finite and the formula cannot provide a finite error estimate. The theorem gives a variance identity for a fixed independent stratified design, not an automatic claim that every partition improves on crude Monte Carlo or that the displayed allocation is computationally optimal. Stratification reduces error by conditioning the sampling design on region labels; the next method applies the same conditioning idea directly to the estimator itself. When part of the simulation can be averaged analytically after conditioning on a smaller information set, Rao-Blackwellization replaces the original output by that conditional mean and removes the residual randomness.
[definition: Rao-Blackwellized Estimator]
Let $Y:(\Omega,\mathcal F)\to\mathbb R$ be a square-integrable unbiased estimator of $\theta$, and let $\mathcal G\subset\mathcal F$ be a sub-$\sigma$-algebra. The Rao-Blackwellized estimator is the $\mathcal G$-measurable map $Y_{\mathrm{RB}}:(\Omega,\mathcal G)\to(\mathbb R,\mathcal B(\mathbb R))$ given by
\begin{align*}
Y_{\mathrm{RB}}:\omega\mapsto \mathbb E[Y\mid\mathcal G](\omega).
\end{align*}
[/definition]
The estimator has the same mean as $Y$ by the tower property. The remaining question is whether replacing a random output by its conditional mean merely preserves expectation or also improves reliability. Since conditioning removes the variation left inside each fibre of $\mathcal G$, the relevant comparison is between the two variances.
[quotetheorem:7205]
[citeproof:7205]
The theorem depends on square-integrability because the variance decomposition uses second moments. It also explains a practical limitation: Rao-Blackwellization is valuable only when the conditional expectation can be computed or accurately approximated; otherwise the conditional mean may be more expensive than the original simulation. Equality can occur, for instance when $Y$ is already $\mathcal G$-measurable, so conditioning has removed no randomness. These four techniques introduce the main design philosophy of Monte Carlo methods: preserve the target expectation while changing the distribution, dependence, or conditioning structure of the simulated variables. The next chapter studies the most important such change, importance sampling, where the proposal distribution is deliberately different from the target distribution.
Once crude Monte Carlo is understood, the next issue is how to reuse the same target expectation while changing the distribution that generates the samples. Chapter 2 takes up that change-of-measure viewpoint and shows why importance sampling is the natural refinement when direct sampling wastes effort in low-contribution regions.
# 2. Importance Sampling
Importance sampling begins with a practical failure of crude Monte Carlo: the samples may come from the correct distribution, but they may rarely visit the part of the space that contributes most to the expectation. The remedy is to sample from a different distribution and correct the change by a likelihood ratio. This chapter develops that idea from the measure-theoretic identity through self-normalized estimators, variance diagnostics, and rare-event design.
The central theme is that importance sampling is not merely a variance reduction trick. It is a change-of-measure method, and its success depends on absolute continuity, tail behaviour, and the match between the proposal distribution and the integrand. The same structure reappears in Bayesian computation, statistical physics through partition functions, sequential Monte Carlo through particle weights, and mathematical finance through changes of numeraire and rare-loss estimation. Building on Chapter 1's bias-variance calculus and ratio estimators, the chapter follows the progression from unbiased estimators to self-normalized ratio estimators, then to the pathologies that appear when weights become unstable.
## Change of Measure and Likelihood Ratios
Suppose the quantity of interest is an expectation under a probability measure $\mathbb P$, but direct simulation from $\mathbb P$ is expensive or statistically inefficient. The first question is when samples from another probability measure $\mathbb Q$ can still be used without changing the target quantity.
[definition: Importance Sampling Setting]
Let $(E, \mathcal E)$ be a measurable space. Let $\mathbb P$ and $\mathbb Q$ be probability measures on $(E, \mathcal E)$, and let $h:E\to\mathbb R$ be measurable. The target expectation is
\begin{align*}
I := \mathbb E_{\mathbb P}[h(X)] = \int_E h(x)\,d\mathbb P(x).
\end{align*}
The measure $\mathbb Q$ is called the proposal measure.
[/definition]
This setting separates the distribution defining the answer from the distribution used for simulation. To make that separation legitimate, every target-relevant event must remain visible under the proposal, and the correction factor must be a measurable random variable under the proposal. The next definition names this correction factor and makes the support requirement explicit through absolute continuity.
[definition: Likelihood Ratio]
Let $\mathbb P$ and $\mathbb Q$ be probability measures on $(E,\mathcal E)$ with $\mathbb P \ll \mathbb Q$. The likelihood ratio of $\mathbb P$ with respect to $\mathbb Q$ is a measurable map
\begin{align*}
L:E&\to[0,\infty], & x&\mapsto \frac{d\mathbb P}{d\mathbb Q}(x),
\end{align*}
where $L$ is a version of the Radon--Nikodym derivative satisfying
\begin{align*}
\mathbb P(A)=\int_A L(x)\,d\mathbb Q(x)
\end{align*}
for every $A\in\mathcal E$.
[/definition]
The likelihood ratio transfers integration under $\mathbb P$ into integration under $\mathbb Q$. This is the point at which a change of simulation law becomes an exact identity rather than an approximation. The next theorem is the fundamental identity behind every [importance sampling estimator](/theorems/2001) in this chapter.
[quotetheorem:7206]
[citeproof:7206]
The hypothesis $\mathbb P\ll\mathbb Q$ is the support condition that prevents invisible target mass. If there is an event $A$ with $\mathbb P(A)>0$ but $\mathbb Q(A)=0$, no simulation under $\mathbb Q$ can observe the contribution of $A$, and no finite likelihood ratio can repair the missing mass. The integrability assumption is also essential: the identity may still hold in the extended sense for nonnegative $h$, but Monte Carlo averages are meaningful as estimators of a finite number only when the weighted quantity has a finite expectation.
The identity turns a deterministic equality of integrals into a simulation method. Once $\mathbb Q$ is chosen, the random variable $h(Y)L(Y)$ has the correct expectation, so the natural Monte Carlo move is to average independent copies of it. The next definition records this estimator and fixes the notation used for its bias and variance analysis.
[definition: Standard Importance Sampling Estimator]
Let $Y_1,\dots,Y_N$ be i.i.d. with distribution $\mathbb Q$, let $L=d\mathbb P/d\mathbb Q$, and assume $hL\in L^1(\mathbb Q)$. The standard importance sampling estimator of $I=\mathbb E_{\mathbb P}[h(X)]$ is the measurable map
\begin{align*}
\widehat I_{N,\mathrm{IS}}:E^N&\to\mathbb R, &
(y_1,\dots,y_N)&\mapsto \frac{1}{N}\sum_{i=1}^N h(y_i)L(y_i).
\end{align*}
[/definition]
This estimator is unbiased when the weighted integrand is integrable, but precision depends on its second moment under the proposal. Proposal design is therefore a question about the distribution of $h(Y)L(Y)$, not merely about making sampling convenient. The next result gives the variance formula that makes this design criterion quantitative.
[quotetheorem:7207]
[citeproof:7207]
The variance formula shows why a proposal may be mathematically valid yet numerically poor. Absolute continuity gives unbiasedness, but it says nothing about the size of $\mathbb E_{\mathbb Q}[h(Y)^2L(Y)^2]$. A proposal with lighter tails than the target can produce rare but enormous weights; in that case the estimator may remain unbiased while having infinite variance, so ordinary standard-error estimates and normal approximations fail. This limitation turns proposal design into a second-moment problem rather than just a support-matching problem.
A shifted Gaussian proposal gives a concrete first example of moving samples toward the important region while keeping the likelihood ratio explicit.
[example: Gaussian Tail Probability]
Let $X\sim\mathcal N(0,1)$ and let $I=\mathbb P(X>a)=\mathbb E[\mathbb{1}_{\{X>a\}}]$ for a large threshold $a>0$. Take the proposal $Y\sim\mathcal N(\mu,1)$ with $\mu>a$. The target and proposal densities are
\begin{align*}
p(y)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{y^2}{2}\right)
\end{align*}
and
\begin{align*}
q(y)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{(y-\mu)^2}{2}\right).
\end{align*}
Since $q(y)>0$ for every $y\in\mathbb R$, the likelihood ratio is
\begin{align*}
L(y)=\frac{p(y)}{q(y)}=\frac{\exp(-y^2/2)}{\exp(-(y-\mu)^2/2)}.
\end{align*}
Expanding $(y-\mu)^2=y^2-2\mu y+\mu^2$ gives
\begin{align*}
L(y)=\exp\left(-\frac{y^2}{2}+\frac{y^2-2\mu y+\mu^2}{2}\right).
\end{align*}
Combining the terms in the exponent gives
\begin{align*}
L(y)=\exp\left(-\mu y+\frac{\mu^2}{2}\right).
\end{align*}
For i.i.d. samples $Y_1,\dots,Y_N\sim\mathcal N(\mu,1)$, define
\begin{align*}
\widehat I_N=\frac{1}{N}\sum_{i=1}^N\mathbb{1}_{\{Y_i>a\}}L(Y_i).
\end{align*}
Its expectation under the proposal is
\begin{align*}
\mathbb E_{\mathbb Q}[\mathbb{1}_{\{Y>a\}}L(Y)]=\int_a^\infty L(y)q(y)\,dy.
\end{align*}
Substituting $L(y)=p(y)/q(y)$ gives
\begin{align*}
\int_a^\infty L(y)q(y)\,dy=\int_a^\infty p(y)\,dy.
\end{align*}
Therefore
\begin{align*}
\mathbb E_{\mathbb Q}[\mathbb{1}_{\{Y>a\}}L(Y)]=\mathbb P(X>a)=I.
\end{align*}
Thus $\widehat I_N$ estimates the same Gaussian tail probability, but under the proposal the event has probability $\mathbb P_{\mathbb Q}(Y>a)=\mathbb P(Z>a-\mu)$ for $Z\sim\mathcal N(0,1)$, which is greater than $1/2$ when $\mu>a$. The proposal makes tail samples common, and the likelihood ratio exactly converts their average back to the original $\mathcal N(0,1)$ target law.
[/example]
The Gaussian example also gives a first proposal-design principle: place mass where the product of target density and integrand is large, while keeping the likelihood ratio stable. That principle has an ideal mathematical form, even though it usually cannot be implemented because it uses the unknown answer. The next theorem identifies this ideal proposal and explains what practical proposal families try to approximate.
[quotetheorem:7208]
[citeproof:7208]
The nonnegativity assumption is what makes $q^*$ proportional to $hp$ without cancellations. For a concrete failure, take $E=\{0,1\}$ with counting measure, $p(0)=p(1)=1/2$, $h(0)=3$, and $h(1)=-1$. Then $I=\int h p\,d\mu=1$, but $h(1)p(1)/I=-1/2$, so $hp/I$ is signed rather than nonnegative and cannot be a proposal density. If $h$ changes sign, the analogous variance-minimizing proposal depends on $|h|p$, and the summand is no longer a constant because the sign remains. The condition $0<I<\infty$ is also doing real work: when $I=0$, the displayed formula cannot normalize a probability density, and when $I=\infty$, the proposed normalizing constant is not finite. The theorem therefore does not give an implementable algorithm or a guarantee for arbitrary signed integrands; it is a benchmark showing that good proposals concentrate where the integrand contributes to the target integral.
The zero-variance proposal identifies the ideal shape: sample from a distribution proportional to the absolute contribution to the integral. In practice, proposals approximate that shape using parametric shifts, heavier tails, mixtures, adaptive fits, or problem-specific transformations. This perspective becomes even more important when the target density is known only up to a multiplicative constant.
## Self-Normalized Importance Sampling and Normalizing Constants
Many applied targets are known only up to a normalizing constant. Bayesian posterior distributions are the main example: the unnormalized density is often available, while the marginal likelihood is exactly the integral being avoided. The next question is how to estimate expectations when the likelihood ratio itself is known only up to scale.
[definition: Unnormalized Target Density]
Let $(E,\mathcal E,\mu)$ be a [measure space](/page/Measure%20Space). Let $\gamma:(E,\mathcal E)\to([0,\infty),\mathcal B([0,\infty)))$ be measurable with
\begin{align*}
Z:=\int_E \gamma(x)\,d\mu(x)\in(0,\infty).
\end{align*}
The normalized target density is the measurable map
\begin{align*}
\pi:E&\to[0,\infty), & x&\mapsto \frac{\gamma(x)}{Z}.
\end{align*}
[/definition]
The normalized target probability measure has density $\pi$ with respect to $\mu$. The unknown constant $Z$ prevents direct evaluation of the normalized density $\pi$, but the unnormalized function $\gamma$ can still be compared to a proposal density. This comparison is the computable part of the likelihood ratio. The next definition introduces the weight used both for estimating $Z$ and for forming normalized weighted averages.
[definition: Unnormalized Importance Weight]
Let $q:E\to[0,\infty)$ be a proposal density with $\gamma(x)>0\implies q(x)>0$ for $\mu$-a.e. $x$. The unnormalized importance weight is the measurable map
\begin{align*}
w:E&\to[0,\infty], & x&\mapsto \frac{\gamma(x)}{q(x)}
\end{align*}
on the set $\{x:q(x)>0\}$. On the $\mu$-null-equivalent part of $\{x:q(x)=0\}$ where $\gamma(x)=0$, set $w(x)=0$.
[/definition]
The weight $w$ was introduced because integration against $q$ cancels its denominator and recovers the integral of $\gamma$. That cancellation is exactly what is needed to estimate partition functions, marginal likelihoods, and other normalizing constants from proposal samples. The next theorem verifies this estimator before the same numerator-denominator idea is reused for self-normalized expectations.
[quotetheorem:7209]
[citeproof:7209]
The theorem requires integrability of $w(Y)$ because the [Strong Law of Large Numbers](/theorems/1852) is being applied to the proposed samples. The support condition is what prevents systematic underestimation of $Z$: for example, if $E=[0,1]$, $\mu=\mathcal L^1$, $\gamma(x)=1$, and the proposal density is $q(x)=2\mathbb{1}_{[0,1/2]}(x)$, then every sample misses the interval $(1/2,1]$. The resulting average estimates only $\int_0^{1/2}\gamma(x)\,d x=1/2$ rather than $Z=1$. The result also says only that $\widehat Z_N$ is unbiased and strongly consistent; it does not guarantee a small relative error, which still depends on the second moment of $w(Y)$.
For expectations under $\pi$, the unknown $Z$ appears in both numerator and denominator. The preceding theorem estimates the denominator, while the same construction applied to $wh$ estimates the numerator. The next estimator combines these two estimates into a ratio, which removes the need to know the normalizing constant in advance.
[definition: Self-Normalized Importance Sampling Estimator]
Let $Y_1,\dots,Y_N$ be i.i.d. from proposal density $q$, let $w:E\to[0,\infty]$ be the unnormalized importance weight, and assume $wh\in L^1(\mathbb Q)$. Define the positive-denominator domain by
\begin{align*}
D_N:=\left\{(y_1,\dots,y_N)\in E^N:\sum_{i=1}^N w(y_i)>0\right\}.
\end{align*}
For measurable $h:E\to\mathbb R$, the self-normalized importance sampling estimator is the measurable map
\begin{align*}
\widehat I_{N,\mathrm{SN}}:D_N&\to\mathbb R, &
(y_1,\dots,y_N)&\mapsto \frac{\sum_{i=1}^N w(y_i)h(y_i)}{\sum_{i=1}^N w(y_i)}.
\end{align*}
[/definition]
The estimator is a ratio, so finite-sample unbiasedness should not be expected from the unbiasedness of the two averages separately. Its main advantage is invariance under multiplying all weights by the same positive constant, which is why it works with unnormalized targets. The next theorem records the tradeoff: finite-sample bias in exchange for almost sure consistency.
[quotetheorem:7210]
[citeproof:7210]
For a concrete finite-sample bias example, let $N=1$, $E=\{0,1\}$, $q(0)=q(1)=1/2$, $w(0)=1$, $w(1)=2$, $h(0)=0$, and $h(1)=1$. Then the self-normalized estimator equals $h(Y_1)$, so its expectation is $1/2$, while the target ratio is
\begin{align*}
\frac{\mathbb E_{\mathbb Q}[w(Y)h(Y)]}{\mathbb E_{\mathbb Q}[w(Y)]}=\frac{2}{3}.
\end{align*}
A separate obstruction is possible zero denominator: if $w(Y)=0$ with probability $1/2$, then for $N=1$ the estimator is undefined with probability $1/2$.
Consistency says that the ratio eventually targets the correct value, but it does not quantify the leading sampling error. Since the estimator is a smooth function of two sample averages, the natural next tool is the [multivariate central limit theorem](/theorems/1854) followed by the [delta method](/theorems/1861). The resulting variance depends on weighted deviations from the target mean.
[quotetheorem:7211]
[citeproof:7211]
The theorem separates two requirements that are often conflated. The denominator only needs a law of large numbers, while the numerator fluctuation needs a finite second moment for the weighted centered quantity $w(Y)(h(Y)-I)$. If this second moment is infinite, the estimator may still be consistent, but the displayed Gaussian error approximation is no longer justified. If finite samples can have zero denominator, the asymptotic statement applies after the denominator becomes positive, while implementation still needs a convention or a proposal with $\mathbb Q(w>0)=1$.
Self-normalization is therefore consistent but not free. The estimator pays for the unknown normalizing constant by introducing ratio bias and by making accuracy depend on the variability of normalized weights. A Bayesian marginal likelihood calculation shows both the normalizing-constant estimator and the self-normalized posterior estimator in the same problem.
[example: Bayesian Marginal Likelihood]
Let $\theta\in\Theta$ be a parameter, fix observed data $y$, and write the unnormalized posterior density as
\begin{align*}
\gamma(\theta)=p(y\mid\theta)p(\theta).
\end{align*}
The marginal likelihood is the normalizing constant
\begin{align*}
Z=p(y)=\int_\Theta \gamma(\theta)\,d\theta=\int_\Theta p(y\mid\theta)p(\theta)\,d\theta.
\end{align*}
Choose a proposal density $q$ such that $q(\theta)>0$ whenever $\gamma(\theta)>0$, and define
\begin{align*}
w(\theta)=\frac{\gamma(\theta)}{q(\theta)}=\frac{p(y\mid\theta)p(\theta)}{q(\theta)}.
\end{align*}
If $\theta\sim q$, then the proposal expectation of the weight is
\begin{align*}
\mathbb E_q[w(\theta)]=\int_\Theta \frac{\gamma(\theta)}{q(\theta)}q(\theta)\,d\theta=\int_\Theta \gamma(\theta)\,d\theta=Z=p(y).
\end{align*}
Thus, for i.i.d. proposal samples $\theta_1,\dots,\theta_N\sim q$,
\begin{align*}
\widehat Z_N=\frac{1}{N}\sum_{i=1}^N w(\theta_i)
\end{align*}
is the corresponding importance-sampling estimator of the marginal likelihood.
The posterior density is
\begin{align*}
\pi(\theta\mid y)=\frac{\gamma(\theta)}{Z}=\frac{p(y\mid\theta)p(\theta)}{p(y)}.
\end{align*}
For a posterior expectation of $h(\theta)$, the target quantity is
\begin{align*}
\mathbb E_\pi[h(\theta)]=\int_\Theta h(\theta)\frac{\gamma(\theta)}{Z}\,d\theta.
\end{align*}
Moving the constant $1/Z$ outside the integral gives
\begin{align*}
\mathbb E_\pi[h(\theta)]=\frac{1}{Z}\int_\Theta h(\theta)\gamma(\theta)\,d\theta.
\end{align*}
Using $\gamma(\theta)=w(\theta)q(\theta)$ gives
\begin{align*}
\mathbb E_\pi[h(\theta)]=\frac{\int_\Theta h(\theta)w(\theta)q(\theta)\,d\theta}{\int_\Theta w(\theta)q(\theta)\,d\theta}.
\end{align*}
The self-normalized estimator replaces the numerator and denominator integrals by proposal sample averages:
\begin{align*}
\widehat I_{N,\mathrm{SN}}=\frac{N^{-1}\sum_{i=1}^N w(\theta_i)h(\theta_i)}{N^{-1}\sum_{i=1}^N w(\theta_i)}=\frac{\sum_{i=1}^N w(\theta_i)h(\theta_i)}{\sum_{i=1}^N w(\theta_i)}.
\end{align*}
The unknown marginal likelihood cancels from this ratio, while the separate average $\widehat Z_N$ estimates that same constant directly.
[/example]
This example also highlights a frequent tension. A proposal close to the posterior may produce stable self-normalized expectations, but estimating the marginal likelihood requires accurate coverage of the full prior-likelihood mass, including regions with non-negligible contribution to $Z$.
## Degeneracy, Effective Sample Size, and Rare Events
The practical question after computing weights is not only whether the estimator is consistent. It is whether the sample size used has produced many informative samples or only a small number of dominant weighted particles.
[definition: Normalized Importance Weights]
For $N\in\mathbb N$, define
\begin{align*}
W_N^+:=\left\{w=(w_1,\dots,w_N)\in[0,\infty)^N:\sum_{i=1}^Nw_i>0\right\}
\end{align*}
and
\begin{align*}
\Delta_{N-1}:=\left\{u\in[0,1]^N:\sum_{i=1}^N u_i=1\right\}.
\end{align*}
The normalized-weight map is
\begin{align*}
\mathcal N_N:W_N^+&\to\Delta_{N-1}, &
(w_1,\dots,w_N)&\mapsto(\bar w_1,\dots,\bar w_N),
\end{align*}
where
\begin{align*}
\bar w_i:=\frac{w_i}{\sum_{j=1}^Nw_j},\qquad i=1,\dots,N.
\end{align*}
[/definition]
Normalized weights turn a weighted average into an ordinary average over a discrete random measure. Degeneracy means that this random measure places most of its mass on a small fraction of the simulated points. To summarize that concentration by a single number, the next definition uses the inverse squared mass of the normalized weights.
[definition: Importance Sampling Effective Sample Size Diagnostic]
Given normalized weights $\bar w=(\bar w_1,\dots,\bar w_N)$ in the probability simplex
\begin{align*}
\Delta_{N-1}:=\left\{u\in[0,1]^N:\sum_{i=1}^N u_i=1\right\},
\end{align*}
the importance sampling effective sample size diagnostic is the map
\begin{align*}
\operatorname{ESS}:\Delta_{N-1}&\to[1,N], &
\bar w&\mapsto \frac{1}{\sum_{i=1}^N \bar w_i^2}.
\end{align*}
[/definition]
The diagnostic ranges between $1$ and $N$. Values near $N$ indicate nearly equal weights, while values near $1$ indicate that a single sample dominates the weighted empirical distribution. Since ESS is computed from realized weights, it should be read as a warning signal rather than as a replacement for variance analysis.
[remark: Effective Sample Size Is a Diagnostic]
The ESS formula is not a theorem giving the exact number of independent target samples. It is a scale diagnostic derived from the concentration of the normalized weights. It is most useful when compared across proposal choices or monitored over repeated runs.
[/remark]
The source of degeneracy is second-moment growth. For standard importance sampling, finite variance is the threshold for the usual normal approximation to Monte Carlo error. The next theorem records this central limit theorem so that failures of the second moment can be interpreted as failures of ordinary error certification.
[quotetheorem:521]
[citeproof:521]
The finite-variance hypothesis is exactly the condition behind the usual Monte Carlo error bar. It is stronger than unbiasedness and stronger than consistency of the sample mean under mere integrability. A concrete failure occurs when the target is a Pareto law with density $p(x)=\alpha x^{-\alpha-1}$ on $x\ge1$, the proposal is a lighter-tailed Pareto law $q(x)=\beta x^{-\beta-1}$ with $\beta>\alpha$, and $h(x)=1$. The likelihood ratio $L(x)=(\alpha/\beta)x^{\beta-\alpha}$ has $\mathbb E_{\mathbb Q}[L(Y)]=1$, so the estimator is unbiased for $1$, but $\mathbb E_{\mathbb Q}[L(Y)^2]=\infty$ whenever $\beta\ge2\alpha$. The estimator can then look stable for many samples and jump after a rare high-weight draw; in that regime the displayed normal approximation is not a reliable certification of uncertainty.
Rare-event problems put this issue at the centre: the event of interest has tiny probability, but conditional on the event the contribution is decisive. Ordinary absolute error can be misleading here, because an error of size $10^{-6}$ is negligible for a probability near one and catastrophic for a probability near $10^{-8}$.
The basic object must therefore be a sequence of estimation problems, not one fixed event. The definition below isolates the asymptotic regime in which the event probabilities tend to zero and relative error becomes the relevant measure of difficulty.
[definition: Rare-Event Probability]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. For each $n\ge1$, let $(E_n,\mathcal E_n)$ be a measurable space, let $X_n:(\Omega,\mathcal F)\to(E_n,\mathcal E_n)$ be a random variable, and let $B_n\in\mathcal E_n$ satisfy
\begin{align*}
A_n:=X_n^{-1}(B_n)\in\mathcal F, \qquad \mathbb P(A_n)\to0.
\end{align*}
Estimating $p_n=\mathbb P(A_n)$ is called rare-event probability estimation.
[/definition]
A crude Monte Carlo estimator for $p_n$ has variance
\begin{align*}
\frac{p_n(1-p_n)}{N}.
\end{align*}
Its relative variance is approximately
\begin{align*}
\frac{1}{Np_n}.
\end{align*}
When $p_n$ is tiny, relative accuracy requires an impractically large number of samples unless the proposal makes the event common. Exponential tilting is the canonical construction for light-tailed sums because it changes the mean while preserving a tractable likelihood ratio.
[definition: Exponential Tilting]
Let $X:(\Omega,\mathcal F,\mathbb P)\to(\mathbb R,\mathcal B(\mathbb R))$ be a real-valued random variable with moment generating function $M_X(\lambda)=\mathbb E[e^{\lambda X}]$ finite on an open interval containing $0$. For such $\lambda$, the exponentially tilted probability measure $\mathbb P_\lambda$ on $(\Omega,\mathcal F)$ is defined by the density map
\begin{align*}
R_\lambda:\Omega&\to(0,\infty), & \omega&\mapsto \exp\left(\lambda X(\omega)-\log M_X(\lambda)\right),
\end{align*}
so that $d\mathbb P_\lambda/d\mathbb P=R_\lambda$.
[/definition]
Exponential tilting shifts probability mass toward large values when $\lambda>0$. For sums of independent variables, it preserves independence while changing the marginal distribution in a controlled way. The next theorem computes the likelihood ratio for the tilted product measure and gives the standard choice of the tilting parameter for an upper-tail event.
[quotetheorem:7212]
[citeproof:7212]
The moment-generating-function hypothesis is the light-tail assumption that makes exponential tilting available. For heavy-tailed distributions, such as Pareto laws, $M_X(\lambda)$ may be infinite for every positive $\lambda$, and this construction cannot be used to move the upper tail by a finite likelihood ratio of this form. The independence of the summands is also essential to the product-density formula; for dependent summands, the likelihood ratio must account for the joint law rather than multiply identical one-dimensional tilts. The mean-matching equation is a design rule rather than a full optimality theorem: it makes the rare threshold typical under the proposal, while variance still depends on the remaining spread of the likelihood ratio around that threshold. Thus the theorem does not say that every positive $\lambda$ is efficient, nor that tilting solves heavy-tailed rare-event problems; it sets up the light-tailed large-deviation construction used in the Gaussian example.
The theorem explains why tilting is effective for many light-tailed sums. It changes the sampling law so that the event is no longer rare, then pays a deterministic exponential likelihood-ratio penalty for samples that reach the event. This is the same mechanism that appears in large-deviation theory, where the parameter $\lambda$ selects the tilted law under which the rare empirical mean becomes typical.
The Gaussian case shows the computation without any hidden normalizing constant.
[example: Rare Event for a Sum]
Let $X_1,\dots,X_n\overset{\text{i.i.d.}}{\sim}\mathcal N(0,1)$, write $S_n=X_1+\cdots+X_n$, and fix $a>0$. We estimate
\begin{align*}
p_n=\mathbb P(S_n\ge na).
\end{align*}
For one standard normal summand, the moment generating function is
\begin{align*}
M_X(\lambda)=\mathbb E[e^{\lambda X}]=\exp\left(\frac{\lambda^2}{2}\right).
\end{align*}
The exponentially tilted density with parameter $\lambda$ is therefore
\begin{align*}
f_\lambda(x)=\exp\left(\lambda x-\frac{\lambda^2}{2}\right)\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{x^2}{2}\right).
\end{align*}
Combining the exponents gives
\begin{align*}
f_\lambda(x)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{x^2-2\lambda x+\lambda^2}{2}\right).
\end{align*}
Since $x^2-2\lambda x+\lambda^2=(x-\lambda)^2$, this is
\begin{align*}
f_\lambda(x)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{(x-\lambda)^2}{2}\right),
\end{align*}
so each tilted summand has distribution $\mathcal N(\lambda,1)$.
Choose $\lambda=a$. Under the proposal, $Y_1,\dots,Y_n$ are i.i.d. $\mathcal N(a,1)$, and
\begin{align*}
T_n:=Y_1+\cdots+Y_n\sim\mathcal N(na,n).
\end{align*}
Thus the event threshold is exactly the proposal mean:
\begin{align*}
\mathbb E_{\mathbb P_a}[T_n]=na.
\end{align*}
The product tilted density relative to the original product density is
\begin{align*}
\frac{d\mathbb P_a}{d\mathbb P}(Y_1,\dots,Y_n)=\prod_{i=1}^n \exp\left(aY_i-\frac{a^2}{2}\right).
\end{align*}
Multiplying the factors gives
\begin{align*}
\prod_{i=1}^n \exp\left(aY_i-\frac{a^2}{2}\right)=\exp\left(a\sum_{i=1}^nY_i-\frac{na^2}{2}\right).
\end{align*}
Since $\sum_{i=1}^nY_i=T_n$, the likelihood ratio from the proposal back to the original law is the reciprocal
\begin{align*}
L_n(T_n)=\frac{d\mathbb P}{d\mathbb P_a}(Y_1,\dots,Y_n)=\exp\left(-aT_n+\frac{na^2}{2}\right).
\end{align*}
The importance sampling estimator is therefore
\begin{align*}
\widehat p_{n,N}=\frac{1}{N}\sum_{j=1}^N\mathbb 1_{\{T_n^{(j)}\ge na\}}\exp\left(-aT_n^{(j)}+\frac{na^2}{2}\right),
\end{align*}
where $T_n^{(j)}$ are independent sums generated under the tilted law. Its expectation is
\begin{align*}
\mathbb E_{\mathbb P_a}\left[\mathbb 1_{\{T_n\ge na\}}L_n(T_n)\right]=\mathbb P(S_n\ge na)=p_n.
\end{align*}
Under the proposal,
\begin{align*}
\mathbb P_a(T_n\ge na)=\mathbb P\left(\frac{T_n-na}{\sqrt n}\ge0\right)=\mathbb P(Z\ge0)=\frac12
\end{align*}
for $Z\sim\mathcal N(0,1)$. Under the original law, $S_n\sim\mathcal N(0,n)$, so
\begin{align*}
p_n=\mathbb P\left(Z\ge a\sqrt n\right).
\end{align*}
The tilted proposal turns the rare upper-tail event into a probability-$1/2$ event, while the exponential likelihood ratio converts the average back to the original $\mathcal N(0,1)$ product law.
[/example]
The final lesson of the chapter is that proposal design must balance two requirements. The proposal must make important regions visible, and the likelihood ratio must not introduce larger variance than the original problem. Diagnostics such as ESS, repeated-run stability, and estimated second moments are therefore part of the method rather than optional afterthoughts.
Importance sampling frames simulation as a balance between visibility of the important regions and stability of the correction weights. Chapter 3 turns that perspective into concrete sampling schemes, showing how rejection, transformations, and other direct methods can produce random variables with prescribed distributions.
# 3. Rejection, Transformation, and Direct Sampling
This chapter turns the change-of-measure and unnormalized-density perspective of Chapter 2 into concrete mechanisms for producing random variables with prescribed distributions. The prerequisites are the language of probability spaces, densities with respect to a reference measure, cumulative distribution functions, change of variables, [conditional probability](/page/Conditional%20Probability), and the basic independence assumptions used in Monte Carlo simulation. The central question is algorithmic: given access to uniform random numbers, densities known up to a normalizing constant, or only pointwise evaluations of a target, how can we generate samples with the right law? We begin with direct transformation methods, move to rejection sampling and envelope design, and then use log-concavity and slice variables to connect exact sampling ideas with the Markov chain methods that follow later in the course.
## Transforming Uniform Random Variables
The first sampling problem asks how far a source of independent $\operatorname{Unif}(0,1)$ variables can take us. In one dimension, the cumulative distribution function turns the target distribution into a monotone coordinate system, so inversion of that coordinate system gives an exact sampling method.
[definition: Quantile Function]
Let $F: \mathbb R \to [0,1]$ be a distribution function. The quantile function associated with $F$ is
\begin{align*}
F^{-1}:(0,1)\to\mathbb R, \qquad
F^{-1}(u) := \inf\{x \in \mathbb R : F(x) \ge u\}.
\end{align*}
[/definition]
This definition uses the generalized inverse rather than an ordinary inverse because many distributions have jumps or flat parts. The next theorem is the basic justification for [inverse transform sampling](/theorems/1139): it says that applying this generalized inverse to a uniform variable exactly reconstructs the target law.
[quotetheorem:1139]
[citeproof:1139]
The result explains why [inverse transform sampling](/theorems/2000) is exact rather than approximate: all numerical error comes from computing the inverse or representing the uniform variable, not from the probabilistic transformation itself. The continuity hypothesis in the converse is essential. If $X$ is a point mass at $0$, then its distribution function jumps from $0$ to $1$, so $F(X)=1$ almost surely rather than uniform on $(0,1)$. Thus the theorem says that a generalized inverse always pushes a uniform variable forward to the target law, but it does not say that applying a discontinuous CDF to its own random variable produces a uniform variable. The simplest useful cases are distributions whose quantile function can be written explicitly.
[example: Exponential Sampling By Inversion]
Let $U\sim\operatorname{Unif}(0,1)$ and fix $\lambda>0$. For the exponential distribution with rate $\lambda$, the distribution function is
\begin{align*}
F(x)=1-e^{-\lambda x}\qquad x\ge 0.
\end{align*}
To invert this distribution function, fix $u\in(0,1)$ and solve $u=F(x)$:
\begin{align*}
u=1-e^{-\lambda x}.
\end{align*}
Subtracting $1$ from both sides gives
\begin{align*}
u-1=-e^{-\lambda x}.
\end{align*}
Multiplying by $-1$ gives
\begin{align*}
1-u=e^{-\lambda x}.
\end{align*}
Taking logarithms is valid because $1-u\in(0,1)$, and gives
\begin{align*}
\log(1-u)=-\lambda x.
\end{align*}
Dividing by $-\lambda$ gives
\begin{align*}
x=-\frac{1}{\lambda}\log(1-u).
\end{align*}
Thus the inverse-transform sample is
\begin{align*}
X=-\frac{1}{\lambda}\log(1-U).
\end{align*}
Since $\mathbb P(1-U\le t)=\mathbb P(U\ge 1-t)=t$ for every $t\in(0,1)$, the random variable $1-U$ is also $\operatorname{Unif}(0,1)$. Therefore implementations usually use the equivalent sampler
\begin{align*}
X=-\frac{1}{\lambda}\log U.
\end{align*}
This is the ideal inverse-transform case: an explicit quantile formula turns one uniform random variable into an exact exponential sample.
[/example]
Many important distributions do not have a closed-form quantile function. The Gamma family illustrates the boundary between direct inversion and more structured direct sampling, because integer-shape Gamma variables can be assembled from exponential variables.
[example: Sampling Gamma Variables]
For an integer shape parameter $k\in\mathbb N$ and rate $\lambda>0$, let $E_1,\dots,E_k$ be independent $\operatorname{Exp}(\lambda)$ variables generated by inversion, and define
\begin{align*}
X=\sum_{i=1}^k E_i.
\end{align*}
We show that $X\sim\operatorname{Gamma}(k,\lambda)$ by comparing moment generating functions. For one exponential summand and $t<\lambda$,
\begin{align*}
M_{E_i}(t)=\mathbb E[e^{tE_i}]=\int_0^\infty e^{tx}\lambda e^{-\lambda x}\,dx.
\end{align*}
Combining the exponential factors gives
\begin{align*}
M_{E_i}(t)=\lambda\int_0^\infty e^{-(\lambda-t)x}\,dx.
\end{align*}
Since $\lambda-t>0$,
\begin{align*}
\int_0^\infty e^{-(\lambda-t)x}\,dx=\frac{1}{\lambda-t}.
\end{align*}
Therefore
\begin{align*}
M_{E_i}(t)=\frac{\lambda}{\lambda-t}.
\end{align*}
Independence gives
\begin{align*}
M_X(t)=\mathbb E[e^{t(E_1+\cdots+E_k)}]=\mathbb E[e^{tE_1}\cdots e^{tE_k}].
\end{align*}
Factoring the expectation of a product of independent random variables,
\begin{align*}
M_X(t)=\prod_{i=1}^k \mathbb E[e^{tE_i}].
\end{align*}
Substituting the exponential moment generating function into each factor gives
\begin{align*}
M_X(t)=\prod_{i=1}^k \frac{\lambda}{\lambda-t}=\left(\frac{\lambda}{\lambda-t}\right)^k.
\end{align*}
For a $\operatorname{Gamma}(k,\lambda)$ density,
\begin{align*}
f(x)=\frac{\lambda^k}{(k-1)!}x^{k-1}e^{-\lambda x}, \qquad x>0.
\end{align*}
Its moment generating function is
\begin{align*}
\int_0^\infty e^{tx}\frac{\lambda^k}{(k-1)!}x^{k-1}e^{-\lambda x}\,dx=\frac{\lambda^k}{(k-1)!}\int_0^\infty x^{k-1}e^{-(\lambda-t)x}\,dx.
\end{align*}
With the substitution $y=(\lambda-t)x$, so $x=y/(\lambda-t)$ and $dx=dy/(\lambda-t)$, this becomes
\begin{align*}
\frac{\lambda^k}{(k-1)!}\int_0^\infty \left(\frac{y}{\lambda-t}\right)^{k-1}e^{-y}\frac{dy}{\lambda-t}.
\end{align*}
Collecting powers of $\lambda-t$ gives
\begin{align*}
\frac{\lambda^k}{(\lambda-t)^k}\frac{1}{(k-1)!}\int_0^\infty y^{k-1}e^{-y}\,dy.
\end{align*}
Using $\int_0^\infty y^{k-1}e^{-y}\,dy=(k-1)!$, the Gamma moment generating function is
\begin{align*}
\left(\frac{\lambda}{\lambda-t}\right)^k.
\end{align*}
The moment generating functions agree for every $t<\lambda$, an open interval containing $0$, so $X$ has the $\operatorname{Gamma}(k,\lambda)$ distribution by the uniqueness theorem for moment generating functions. Thus integer-shape Gamma sampling reduces to summing independently generated exponential variables; for non-integer shapes, production samplers usually use rejection, squeeze, or related specialised methods instead of direct inversion.
[/example]
The same transformation principle extends beyond one dimension, but it requires a triangular or otherwise invertible map whose Jacobian can be controlled. This is the multivariate version of changing variables in integration, and it turns sampler design into the construction of maps from a tractable reference distribution.
[quotetheorem:7213]
[citeproof:7213]
This formula gives both a sampler and a density calculation, but its hypotheses are doing real work. Bijectivity ensures that each output point has a single preimage; otherwise the density must sum contributions over all preimages, as happens for maps such as $z\mapsto z^2$. Differentiability and differentiability of the inverse supply the Jacobian factor that measures local volume distortion, while the open-set assumption keeps the usual change-of-variables theorem away from boundary pathologies. The formula therefore does not cover many-to-one transformations without modification, nor transformations with singular Jacobian on a set that carries positive probability. A canonical example is the conversion from independent uniforms to independent Gaussian variables through polar coordinates.
[example: Box Muller Transformation]
Let $U_1,U_2$ be independent $\operatorname{Unif}(0,1)$ variables, and define
\begin{align*}
R=\sqrt{-2\log U_1}.
\end{align*}
\begin{align*}
\Theta=2\pi U_2.
\end{align*}
\begin{align*}
Z_1=R\cos\Theta,\qquad Z_2=R\sin\Theta.
\end{align*}
We compute the joint density of $(Z_1,Z_2)$. First, for $s\ge 0$,
\begin{align*}
\mathbb P(R^2\le s)=\mathbb P(-2\log U_1\le s).
\end{align*}
Since $-2<0$, dividing by $-2$ reverses the inequality:
\begin{align*}
\mathbb P(-2\log U_1\le s)=\mathbb P(\log U_1\ge -s/2).
\end{align*}
Exponentiating both sides preserves the inequality because the exponential function is increasing:
\begin{align*}
\mathbb P(\log U_1\ge -s/2)=\mathbb P(U_1\ge e^{-s/2}).
\end{align*}
Because $U_1$ is uniform on $(0,1)$ and $e^{-s/2}\in(0,1]$,
\begin{align*}
\mathbb P(U_1\ge e^{-s/2})=1-e^{-s/2}.
\end{align*}
Thus $R^2$ has density $(1/2)e^{-s/2}$ on $(0,\infty)$. Equivalently, for $r>0$,
\begin{align*}
\mathbb P(R\le r)=\mathbb P(R^2\le r^2)=1-e^{-r^2/2}.
\end{align*}
Differentiating this distribution function gives the density of $R$:
\begin{align*}
f_R(r)=r e^{-r^2/2},\qquad r>0.
\end{align*}
Also, since $\Theta=2\pi U_2$, for $0<\theta<2\pi$,
\begin{align*}
\mathbb P(\Theta\le \theta)=\mathbb P(U_2\le \theta/(2\pi))=\frac{\theta}{2\pi}.
\end{align*}
Hence $\Theta$ has density $f_\Theta(\theta)=1/(2\pi)$ on $(0,2\pi)$. The variables $R$ and $\Theta$ are independent because $R$ is a function of $U_1$, $\Theta$ is a function of $U_2$, and $U_1,U_2$ are independent. Therefore
\begin{align*}
f_{R,\Theta}(r,\theta)=\frac{r}{2\pi}e^{-r^2/2},\qquad r>0,\ 0<\theta<2\pi.
\end{align*}
Now transform from polar coordinates to Cartesian coordinates by
\begin{align*}
z_1=r\cos\theta,\qquad z_2=r\sin\theta.
\end{align*}
Away from the origin and the chosen polar-coordinate branch cut, the inverse map is
\begin{align*}
r=\sqrt{z_1^2+z_2^2}.
\end{align*}
The inverse Jacobian matrix has entries
\begin{align*}
\frac{\partial r}{\partial z_1}=\frac{z_1}{r},\qquad \frac{\partial r}{\partial z_2}=\frac{z_2}{r},\qquad
\frac{\partial \theta}{\partial z_1}=-\frac{z_2}{r^2},\qquad \frac{\partial \theta}{\partial z_2}=\frac{z_1}{r^2}.
\end{align*}
Its determinant is
\begin{align*}
\frac{z_1}{r}\frac{z_1}{r^2}-\frac{z_2}{r}\left(-\frac{z_2}{r^2}\right)=\frac{z_1^2+z_2^2}{r^3}.
\end{align*}
Since $r^2=z_1^2+z_2^2$, this determinant equals
\begin{align*}
\frac{1}{r}.
\end{align*}
The boundary sets where polar coordinates are not one-to-one have [Lebesgue measure](/page/Lebesgue%20Measure) zero, so they do not affect the density. Applying the density transformation formula gives, for $(z_1,z_2)\ne(0,0)$,
\begin{align*}
f_{Z_1,Z_2}(z_1,z_2)=\frac{r}{2\pi}e^{-r^2/2}\cdot\frac{1}{r}.
\end{align*}
Cancelling the factor $r$ gives
\begin{align*}
f_{Z_1,Z_2}(z_1,z_2)=\frac{1}{2\pi}e^{-r^2/2}.
\end{align*}
Substituting $r^2=z_1^2+z_2^2$ yields
\begin{align*}
f_{Z_1,Z_2}(z_1,z_2)=\frac{1}{2\pi}\exp\left(-\frac{z_1^2+z_2^2}{2}\right).
\end{align*}
This is the standard bivariate Gaussian density with independent standard [normal coordinates](/theorems/2713), so the Box-Muller map turns two independent uniform variables into two independent Gaussian samples.
[/example]
## Rejection Sampling And Envelope Design
The next problem appears when inversion is unavailable but we can evaluate the target density up to a constant. Rejection sampling replaces direct transformation by a comparison argument: sample from an easier proposal, then keep points with a probability that corrects the proposal into the target.
[definition: Rejection Envelope]
Let $(E,\mathcal E,\mu)$ be a measure space. Let $p:E\to[0,\infty)$ be a target density and let $q:E\to[0,\infty)$ be a proposal density, both with respect to $\mu$. A constant $M\ge 1$ is an envelope constant for $(p,q)$ if
\begin{align*}
p(x) \le M q(x)
\end{align*}
for $\mu$-a.e. $x\in E$.
[/definition]
The envelope condition is a pointwise domination condition, but a sampler also needs a precise accept-reject rule that uses this domination without changing the retained law. This motivates defining the algorithm in terms of an independent proposal and an independent uniform threshold.
[definition: Rejection Sampling Algorithm]
Let $(E,\mathcal E,\mu)$ be a measure space. Given a target density $p:E\to[0,\infty)$, a proposal density $q:E\to[0,\infty)$, and an envelope constant $M$, the rejection sampling algorithm repeats the following steps until acceptance:
1. Draw $Y\sim q$ and $U\sim\operatorname{Unif}(0,1)$ independently.
2. Accept $X=Y$ if
\begin{align*}
U \le \frac{p(Y)}{M q(Y)},
\end{align*}
with the ratio interpreted only on the set where $q(Y)>0$.
3. Otherwise reject $Y$ and repeat.
[/definition]
The algorithm is simple, but its validity is not just a pointwise picture under two graphs. To use it inside a Monte Carlo estimator, we need to know the law of the accepted proposal and the distribution of the waiting time.
[quotetheorem:7214]
[citeproof:7214]
Thus the envelope constant is not a cosmetic tuning parameter: it is the reciprocal of the acceptance rate. The domination hypothesis is necessary, not merely convenient. If $q(x)=0$ on a set where $p(x)>0$, proposals can never produce target mass there; if $p/q$ is unbounded, no finite accept-reject threshold can keep all acceptance probabilities below $1$. The theorem also does not guarantee that a useful $M$ is known or computationally affordable. In high dimensions, even modest mismatch between $p$ and $q$ can make $M$ enormous, so concrete examples should always compute both the bound and the resulting efficiency.
[example: Rejection Sampling For A Beta Density]
Consider the target density $p(x)=6x(1-x)$ on $(0,1)$ and the uniform proposal density $q(x)=1$ on $(0,1)$. To find an envelope constant, first rewrite
\begin{align*}
x(1-x)=x-x^2.
\end{align*}
Completing the square gives
\begin{align*}
x-x^2=\frac{1}{4}-\left(x-\frac{1}{2}\right)^2.
\end{align*}
Since $\left(x-\frac{1}{2}\right)^2\ge 0$, we have
\begin{align*}
x(1-x)\le \frac{1}{4}.
\end{align*}
Multiplying by $6$ gives
\begin{align*}
p(x)=6x(1-x)\le \frac{6}{4}=\frac{3}{2}.
\end{align*}
Equality occurs at $x=1/2$, so $p(x)\le (3/2)q(x)$ for every $x\in(0,1)$, and we may take $M=3/2$.
For a proposal $Y\sim\operatorname{Unif}(0,1)$ and an independent $U\sim\operatorname{Unif}(0,1)$, the rejection ratio is
\begin{align*}
\frac{p(Y)}{M q(Y)}=\frac{6Y(1-Y)}{(3/2)\cdot 1}.
\end{align*}
Dividing by $3/2$ is the same as multiplying by $2/3$, so
\begin{align*}
\frac{6Y(1-Y)}{3/2}=6Y(1-Y)\cdot \frac{2}{3}.
\end{align*}
Multiplying the constants gives
\begin{align*}
6\cdot\frac{2}{3}=4.
\end{align*}
Thus the acceptance rule is
\begin{align*}
U\le 4Y(1-Y).
\end{align*}
The acceptance probability is
\begin{align*}
\frac{1}{M}=\frac{1}{3/2}=\frac{2}{3}.
\end{align*}
This example shows that a bounded density on a compact interval can be sampled from a uniform proposal once its maximum, and hence a valid envelope constant, is known.
[/example]
The previous example works because a single global maximum is cheap to compute. For sharper or unbounded targets, the proposal must be chosen so that its tails dominate the target and its shape keeps $M$ moderate; otherwise the envelope condition can fail even when the proposal has full support.
[remark: Tail Domination]
A proposal density $q$ can be used for rejection sampling only if $p/q$ is bounded on the support of $p$. In practice this means the proposal should have tails at least as heavy as the target. A Gaussian proposal for a Student-$t$ target fails this test because the ratio $p(x)/q(x)$ diverges as $|x|\to\infty$.
[/remark]
The tail condition is stated for a normalised target, but many applied targets are available only through an unnormalised density, especially in Bayesian computation where the posterior normalising constant is unknown. We therefore need an envelope notion that dominates the unnormalised density directly and separates exact sampling from estimation of the normalising constant.
[definition: Unnormalised Rejection Envelope]
Let $(E,\mathcal E,\mu)$ be a measure space. Let $\gamma:E\to[0,\infty)$ be an unnormalised target with normalising constant
\begin{align*}
Z=\int_E \gamma(x)\,d\mu(x).
\end{align*}
A proposal density $q:E\to[0,\infty)$ with respect to $\mu$ and a constant $C$ form an unnormalised envelope if
\begin{align*}
\gamma(x) \le C q(x)
\end{align*}
for $\mu$-a.e. $x\in E$.
[/definition]
The accepted law is then $p=\gamma/Z$, and the acceptance probability is $Z/C$. This is useful when $C$ is known but $Z$ is not, since the empirical acceptance rate estimates $Z/C$.
## Adaptive Envelopes And Log-Concavity
The remaining question is how to build good envelopes without solving a difficult global optimisation problem by hand. Log-concavity gives a powerful answer in one dimension because tangent lines to the log-density sit above the graph, producing exponential pieces that are easy to sample.
[definition: Log-Concave Density]
Let $I\subset\mathbb R$ be an interval. A density $p:I\to[0,\infty)$ is log-concave if there is a concave function $\ell:I\to[-\infty,\infty)$ such that
\begin{align*}
p(x)=\exp(\ell(x))
\end{align*}
for all $x\in I$.
[/definition]
Log-concavity is preserved by many common models, including Gaussian, exponential, Gamma densities with shape at least $1$, and many posterior densities with concave log-likelihood and log-prior. Its main computational benefit is geometric: concave functions lie below their tangent lines, which turns local derivative information into global upper bounds.
[quotetheorem:7215]
[citeproof:7215]
The theorem explains why tangent envelopes are valid, and it also shows why the assumptions are restrictive. Concavity is what makes tangent lines upper bounds; for a convex log-density such as $\ell(x)=x^2$ on a bounded interval, tangents lie below the graph and therefore cannot form a rejection envelope. Differentiability is used only to write tangent lines, and in nonsmooth log-concave cases the same idea can sometimes be recovered with supporting lines or subgradients. Non-log-concave targets may still be sampled by rejection methods, but this particular tangent construction no longer supplies a valid global envelope. Validity also requires more than pointwise domination: the exponential of the upper hull must have finite integral, because the next algorithm normalises that envelope into the proposal distribution used at each stage. The next result packages the envelope update, proposal sampling, and acceptance step into a single exact sampler for differentiable log-concave targets.
[quotetheorem:7216]
[citeproof:7216]
Adaptive rejection sampling is not an MCMC method: accepted draws are independent conditional on the evolving random envelope construction. The theorem depends on having a genuine upper hull at every stage. If log-concavity fails, a tangent line can pass below the log-density, so the ratio $\gamma(Y)/\exp(u(Y))$ may exceed $1$ and the rejection step is no longer meaningful. This is why mixtures of separated Gaussians, which are typically not log-concave, require different envelope constructions or Markov chain methods. The method belongs in this chapter because it is still exact rejection sampling, but it also prepares the geometric language used by later adaptive and Markov chain algorithms. A basic Gamma calculation shows how the log-concavity condition is checked in practice.
[example: Gamma Density With Shape At Least One]
Let $\gamma(x)=x^{\alpha-1}e^{-\lambda x}$ on $(0,\infty)$, where $\alpha\ge 1$ and $\lambda>0$. Since $x>0$, taking logarithms gives
\begin{align*}
\ell(x)=\log\gamma(x)=\log\left(x^{\alpha-1}e^{-\lambda x}\right)
\end{align*}
Using $\log(ab)=\log a+\log b$, $\log(x^{\alpha-1})=(\alpha-1)\log x$, and $\log(e^{-\lambda x})=-\lambda x$, this becomes
\begin{align*}
\ell(x)=(\alpha-1)\log x-\lambda x.
\end{align*}
Differentiate term by term on $(0,\infty)$:
\begin{align*}
\ell'(x)=\frac{\alpha-1}{x}-\lambda.
\end{align*}
Differentiating once more gives
\begin{align*}
\ell''(x)=-\frac{\alpha-1}{x^2}.
\end{align*}
Because $\alpha\ge 1$, we have $\alpha-1\ge 0$, and because $x>0$, we have $x^2>0$. Therefore
\begin{align*}
-\frac{\alpha-1}{x^2}\le 0.
\end{align*}
Thus $\ell$ is concave on $(0,\infty)$, so $\gamma=\exp(\ell)$ is log-concave. For any support point $s>0$, the tangent line $\ell(s)+\ell'(s)(x-s)$ is an upper bound for $\ell(x)$ by concavity, and exponentiating preserves the inequality. Hence adaptive rejection sampling can use these tangent lines as exponential envelope pieces.
If $0<\alpha<1$, then $\alpha-1<0$, so
\begin{align*}
\ell''(x)=-\frac{\alpha-1}{x^2}=\frac{1-\alpha}{x^2}>0.
\end{align*}
In that range the log-density is convex on $(0,\infty)$, so tangent lines lie below rather than above the log-density, and the tangent-envelope construction no longer gives a valid rejection envelope.
[/example]
## Slice Sampling As A Bridge To Markov Chains
Rejection sampling accepts or rejects proposals by introducing an auxiliary uniform variable under the graph of the target density. Slice sampling keeps that auxiliary-variable viewpoint but changes the computational question: instead of proposing a point and testing whether it lies under the graph, sample from horizontal slices of the graph.
[definition: Slice Variable]
Let $\gamma:E\to[0,\infty)$ be an unnormalised target density. A slice variable for a current point $x\in E$ is a random variable
\begin{align*}
U \mid X=x \sim \operatorname{Unif}(0,\gamma(x)).
\end{align*}
The associated slice at height $u$ is
\begin{align*}
S_u = \{x\in E : \gamma(x)\ge u\}.
\end{align*}
[/definition]
This definition turns sampling from $\gamma/Z$ into a question about a two-dimensional region under the graph of $\gamma$. The key issue is whether alternating between vertical and horizontal conditional moves preserves the desired marginal law. The following invariance theorem is needed because practical slice samplers will later replace independent output by a transition kernel whose correctness is measured through invariance.
[quotetheorem:7217]
[citeproof:7217]
This theorem is exact, but its assumptions are part of the algorithm rather than decorative measure theory. The finite integral $Z$ is needed so that the target marginal is a probability density, and measurable slices with finite positive reference measure are needed before a uniform slice conditional is defined. Exact full-conditional updates give Gibbs invariance; if the horizontal update is replaced by an approximate move that is not reversible or otherwise invariant for the uniform law on $S_u$, the target marginal need not be preserved. Practical univariate slice samplers use interval expansion and shrinkage to build a reversible transition on the slice rather than an arbitrary approximation, and multimodal targets show where the difficulty enters.
[example: Univariate Slice Sampling For A Bimodal Target]
Let
\begin{align*}
\gamma(x)=\frac{1}{2}e^{-(x+3)^2/2}+\frac{1}{2}e^{-(x-3)^2/2}, \qquad x\in\mathbb R.
\end{align*}
Given a current state $x$, draw $U\sim\operatorname{Unif}(0,\gamma(x))$ and define the horizontal slice
\begin{align*}
S_U=\{z\in\mathbb R:\gamma(z)\ge U\}.
\end{align*}
The two modes are centered near $-3$ and $3$. At these two points,
\begin{align*}
\gamma(-3)=\frac{1}{2}e^0+\frac{1}{2}e^{-(-6)^2/2}
=\frac{1}{2}+\frac{1}{2}e^{-18},
\end{align*}
and
\begin{align*}
\gamma(3)=\frac{1}{2}e^{-6^2/2}+\frac{1}{2}e^0
=\frac{1}{2}e^{-18}+\frac{1}{2}.
\end{align*}
At the midpoint between the modes,
\begin{align*}
\gamma(0)=\frac{1}{2}e^{-3^2/2}+\frac{1}{2}e^{-(-3)^2/2}
=\frac{1}{2}e^{-9/2}+\frac{1}{2}e^{-9/2}
=e^{-9/2}.
\end{align*}
Since $e^{-18}<1$, we have
\begin{align*}
\gamma(-3)=\gamma(3)=\frac{1}{2}+\frac{1}{2}e^{-18}>e^{-9/2}=\gamma(0).
\end{align*}
Thus a slice level $U$ with $0<U\le e^{-9/2}$ includes the midpoint $0$ and also includes neighborhoods of both modes. By continuity of $\gamma$, this gives a horizontal slice that can connect the two modal regions through the valley. In contrast, if
\begin{align*}
e^{-9/2}<U<\frac{1}{2}+\frac{1}{2}e^{-18},
\end{align*}
then $-3$ and $3$ belong to $S_U$, but $0$ does not. Therefore the slice has mass near both modes while excluding the point between them, so the relevant horizontal set is split into separated pieces. A stepping-out and shrinkage implementation that discovers only the local piece of $S_U$ can then move within one modal component without crossing the gap to the other, which is the mixing difficulty illustrated by this bimodal target.
[/example]
Slice sampling therefore sits at the boundary between direct sampling and MCMC. Its auxiliary-variable construction gives an exact invariant distribution, but its practical implementation often requires a transition kernel rather than independent draws.
[remark: Direct Sampling Versus Markov Chain Sampling]
Inverse transform sampling, transformation methods, and rejection sampling produce independent samples when run with independent random numbers. Slice sampling produces independent samples only when the slice conditional can be sampled exactly and independently of the past beyond the current slice height. In most nontrivial targets, slice sampling is used as a Markov chain update, so its correctness is expressed through invariance rather than through independent output.
[/remark]
The chapter's methods form a hierarchy. Inversion is best when a quantile is available; deterministic transformations are best when a reference distribution can be mapped to the target; rejection sampling is exact when a tight envelope is available; adaptive rejection sampling automates envelope construction under log-concavity; slice sampling keeps the same under-the-graph geometry but leads naturally to Markov chains. These ideas also reappear outside Monte Carlo integration: statistical simulation uses them as random variate generators, Bayesian computation uses the unnormalised-density versions, and stochastic modelling uses transformation methods to turn primitive noise sources into more structured random inputs.
Direct samplers solve the easiest version of the problem, but many targets cannot be sampled independently in this way. Chapter 4 therefore shifts to dependent simulation, where Markov chains are used to build a mechanism whose long-run behavior targets the desired distribution.
# 4. Markov Chains for Sampling
After Chapters 1--3 treated independent estimation, importance weighting, and exact direct samplers, the next question is how a Monte Carlo method can sample from a distribution when direct independent sampling is unavailable. The central idea of this chapter is to replace independent draws by a dependent sequence whose long-run behaviour is governed by the target distribution. This changes the analysis: we must understand when a Markov chain forgets its starting point, how fast it does so, and how dependence changes the variance of estimators. The chapter moves from structural conditions on Markov kernels, through convergence tools, to the limit theory behind Markov chain Monte Carlo output.
## Markov Kernels and Stationarity
The first question is how to describe a sampling algorithm that evolves by local random moves. Instead of specifying a full joint distribution for the path, we specify the conditional law of the next state given the present state.
[definition: Markov Kernel]
Let $(E, \mathcal E)$ be a measurable space. A Markov kernel on $E$ is a map $P: E \times \mathcal E \to [0,1]$ such that:
1. for each $x \in E$, $A \mapsto P(x,A)$ is a probability measure on $(E,\mathcal E)$;
2. for each $A \in \mathcal E$, $x \mapsto P(x,A)$ is $\mathcal E$-measurable.
[/definition]
Given a current state $x$, the measure $P(x,\cdot)$ is the distribution used to generate the next state. The $n$-step transition kernel $P^n$ is obtained by iterating the kernel, and it records the law of $X_n$ conditional on $X_0=x$.
[definition: Markov Chain With Kernel]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $(E,\mathcal E)$ be a measurable space, and let $P$ be a Markov kernel on $E$. A stochastic process $(X_n)_{n \ge 0}$ with $X_n:(\Omega,\mathcal F)\to(E,\mathcal E)$ is a Markov chain with transition kernel $P$ if, for all $A \in \mathcal E$ and all $n \ge 0$,
\begin{align*}
\mathbb P(X_{n+1} \in A \mid X_0,\dots,X_n) = P(X_n,A)
\end{align*}
whenever the conditional probability is defined.
[/definition]
For sampling, the kernel is designed so that a chosen distribution remains unchanged by one transition. This is the mathematical replacement for drawing independent samples from the target.
[definition: Invariant Distribution]
Let $P$ be a Markov kernel on $(E,\mathcal E)$. A probability measure $\pi$ on $(E,\mathcal E)$ is invariant for $P$ if
\begin{align*}
\pi(A) = \int_E P(x,A)\,d\pi(x)
\end{align*}
for every $A \in \mathcal E$.
[/definition]
If $X_0 \sim \pi$ and $\pi$ is invariant, then $X_n \sim \pi$ for every $n$. Invariance is the target-preservation property, but it can be hard to check directly because it requires integrating the whole transition kernel against $\pi$. We therefore look for a stronger balance condition that compares movement from one region to another with movement in the reverse direction.
[definition: Reversibility]
Let $P$ be a Markov kernel on $(E,\mathcal E)$ and let $\pi$ be a probability measure on $(E,\mathcal E)$. The kernel $P$ is reversible with respect to $\pi$ if the measure on $E \times E$ defined by
\begin{align*}
M(A \times B) = \int_A P(x,B)\,d\pi(x)
\end{align*}
is symmetric in $A$ and $B$ for all $A,B \in \mathcal E$.
[/definition]
Reversibility says that, at stationarity, the statistical flow from $A$ to $B$ matches the flow from $B$ to $A$. The next criterion explains why this flow identity is enough to preserve the target law.
[quotetheorem:7218]
[citeproof:7218]
This criterion is useful because detailed balance is local in the current and proposed states, while invariance is a global condition. The reversibility hypothesis is stronger than invariance: a deterministic cycle on three states with the uniform distribution is invariant but has a nonzero flow in one direction around the cycle and hence is not reversible. The theorem also does not say that an invariant distribution is unique or that the chain converges to it from arbitrary starts; reducible chains may preserve many stationary laws at once. Its role is therefore only the first design check for MCMC: once detailed balance identifies the target as stationary, irreducibility and recurrence must still rule out unreachable parts of the state space.
[example: Random Walk On A Finite Graph]
Let $G=(V,E)$ be a finite connected undirected graph, and define the random-walk transition probabilities by
\begin{align*}
P(x,y)=\frac{1}{\deg(x)}
\end{align*}
when $x$ is adjacent to $y$, and $P(x,y)=0$ otherwise. Since each edge contributes one degree to each of its two endpoints, the handshaking identity gives
\begin{align*}
\sum_{x\in V}\deg(x)=2|E|.
\end{align*}
Thus
\begin{align*}
\sum_{x\in V}\pi(x)=\sum_{x\in V}\frac{\deg(x)}{2|E|}=\frac{1}{2|E|}\sum_{x\in V}\deg(x)=1,
\end{align*}
so $\pi(x)=\deg(x)/(2|E|)$ is a probability distribution on $V$.
We verify stationarity by checking detailed balance. If $x$ and $y$ are adjacent, then
\begin{align*}
\pi(x)P(x,y)=\frac{\deg(x)}{2|E|}\cdot\frac{1}{\deg(x)}=\frac{1}{2|E|}.
\end{align*}
Because the graph is undirected, $y$ is adjacent to $x$, and therefore
\begin{align*}
\pi(y)P(y,x)=\frac{\deg(y)}{2|E|}\cdot\frac{1}{\deg(y)}=\frac{1}{2|E|}.
\end{align*}
If $x$ and $y$ are not adjacent, then $P(x,y)=0$ and $P(y,x)=0$, so
\begin{align*}
\pi(x)P(x,y)=0=\pi(y)P(y,x).
\end{align*}
Hence $\pi(x)P(x,y)=\pi(y)P(y,x)$ for every pair $x,y\in V$. Summing this identity over $x$ gives
\begin{align*}
\sum_{x\in V}\pi(x)P(x,y)=\sum_{x\in V}\pi(y)P(y,x)=\pi(y)\sum_{x\in V}P(y,x)=\pi(y),
\end{align*}
because $\sum_{x\in V}P(y,x)=1$. Thus $\pi$ is invariant. The invariant mass is proportional to degree, so the uniform distribution is invariant only when all vertices have the same degree, that is, when the graph is regular.
[/example]
The graph example verifies the target law for a connected graph, but it also exposes a separate issue: if the graph had two components, a chain started in one component would never enter the other. A usable sampler must be able to reach every region that matters under the target measure, so we need a condition that rules out inaccessible positive-mass sets.
[definition: Irreducibility]
Let $P$ be a Markov kernel on $(E,\mathcal E)$ and let $\varphi$ be a nonzero measure on $(E,\mathcal E)$. The kernel $P$ is $\varphi$-irreducible if, for every $A \in \mathcal E$ with $\varphi(A)>0$ and every $x \in E$, there exists $n \ge 1$ such that $P^n(x,A)>0$.
[/definition]
Irreducibility rules out closed regions that the sampler cannot leave or cannot enter. For finite chains, it says that every state communicates with every other state.
A second obstruction is periodic motion. A chain can visit all relevant states and still fail to settle at the target at individual times because it moves in cycles.
[definition: Aperiodicity]
For a countable-state irreducible Markov chain with transition matrix $P$, the period of a state $x$ is
\begin{align*}
d(x)=\gcd\{n \ge 1 : P^n(x,x)>0\}.
\end{align*}
The chain is aperiodic if $d(x)=1$ for every state $x$.
[/definition]
In irreducible countable chains all states have the same period, so it is enough to check one state. In practical MCMC algorithms, adding a positive probability of staying put often breaks periodicity.
For general state spaces, recurrence needs more care than repeated visits to a single point. Harris recurrence is the condition that sets of positive reference measure are eventually visited with probability one.
[definition: Harris Recurrence]
Let $P$ be a $\varphi$-irreducible Markov kernel on $(E,\mathcal E)$. The chain is Harris recurrent if, for every $A \in \mathcal E$ with $\varphi(A)>0$,
\begin{align*}
\mathbb P_x(\tau_A < \infty)=1
\end{align*}
for every $x \in E$, where $\tau_A=\inf\{n\ge 1:X_n\in A\}$.
[/definition]
Harris recurrence is the recurrence notion that supports strong laws for general-state-space MCMC. It says that the chain keeps returning to all statistically relevant regions rather than drifting away permanently.
## Convergence in Total Variation
Once a kernel has the right stationary distribution, the next question is how to measure convergence to it. For Monte Carlo, the relevant distance should control errors in expectations uniformly over bounded test functions.
[definition: Total Variation Distance]
Let $\mu$ and $\nu$ be probability measures on $(E,\mathcal E)$. Their total variation distance is
\begin{align*}
\|\mu-\nu\|_{\mathrm{TV}}=\sup_{A\in\mathcal E}|\mu(A)-\nu(A)|.
\end{align*}
[/definition]
Total variation convergence of $P^n(x,\cdot)$ to $\pi$ means that the distribution of the chain after $n$ steps is close to the target in a strong operational sense. It bounds the worst-case discrepancy over events, and hence controls bounded integrands.
Coupling gives a concrete way to prove total variation bounds. Instead of comparing two probability measures by sets, we build two random variables with the given laws and estimate the probability that they differ.
[definition: Coupling]
Let $\mu$ and $\nu$ be probability measures on $(E,\mathcal E)$. A coupling of $\mu$ and $\nu$ is a probability measure $\Gamma$ on $E\times E$ whose first marginal is $\mu$ and whose second marginal is $\nu$.
[/definition]
The value of a coupling is that equality of the two coupled variables forces agreement for every event. Good couplings turn convergence into a hitting-time problem.
[quotetheorem:7219]
[citeproof:7219]
The coupling bound reduces convergence to making two copies meet, but the quality of the bound depends entirely on the coupling chosen. If $Y$ and $Z$ are sampled independently, the probability $\mathbb P(Y\ne Z)$ may be close to one even when $\mu$ and $\nu$ are nearly identical, so the theorem gives a method rather than an automatic estimate. The inequality also controls total variation only through equality of the coupled variables; it does not by itself construct a coupling that meets quickly. This raises a new question: where can two copies be forced to share a common part of their transition law? We need a named class of sets that provide uniform overlap, because those sets will become the regeneration sites in convergence proofs.
[definition: Small Set]
Let $P$ be a Markov kernel on $(E,\mathcal E)$. A measurable set $C\in\mathcal E$ is small if there exist $m\ge 1$, $\varepsilon>0$, and a probability measure $\nu$ on $(E,\mathcal E)$ such that
\begin{align*}
P^m(x,A)\ge \varepsilon \nu(A)
\end{align*}
for every $x\in C$ and every $A\in\mathcal E$.
[/definition]
The small-set inequality means that, after $m$ steps from any point in $C$, part of the transition law is the same fixed measure $\nu$. This common part lets two chains coalesce with probability at least $\varepsilon$ whenever they simultaneously enter $C$. The remaining question is how to ensure that the chain keeps returning to such a set instead of drifting through regions with poor mixing.
[definition: Lyapunov Drift Condition]
Let $P$ be a Markov kernel on $(E,\mathcal E)$. A function $V:E\to[1,\infty)$ satisfies a geometric Lyapunov drift condition if there exist constants $\lambda\in(0,1)$, $b<\infty$, and a measurable set $C\in\mathcal E$ such that
\begin{align*}
PV(x):=\int_E V(y)P(x,dy) \le \lambda V(x)+b\mathbb{1}_C(x)
\end{align*}
for every $x\in E$.
[/definition]
The function $V$ measures distance from the well-behaved part of the space. The drift inequality says that outside $C$, the expected value of $V(X_{n+1})$ contracts by a fixed factor. We now need a theorem that converts these two ingredients, uniform local overlap and global return control, into an actual total variation convergence rate.
Standard Foster-Lyapunov criteria turn this local overlap plus global drift structure into geometric ergodicity: under the usual irreducibility, aperiodicity, Harris recurrence, small-set, and invariant-measure hypotheses, one obtains constants $\rho\in(0,1)$ and a finite function $M$ such that
\begin{align*}
\|P^n(x,\cdot)-\pi\|_{\mathrm{TV}} \le M(x)\rho^n.
\end{align*}
This criterion is one of the main practical bridges between algorithm design and convergence theory. Each hypothesis has a separate job: irreducibility prevents closed unreachable regions, aperiodicity prevents deterministic cycling, Harris recurrence prevents escape from the statistically relevant part of the space, the small set gives a uniform regeneration mechanism, and the drift inequality controls the return time to that mechanism. If irreducibility fails, two communicating classes can each satisfy a local drift inequality while the chain still cannot converge to a target putting mass on both classes; if the drift condition fails for a heavy-tailed random-walk sampler, returns to the central region may be too slow for geometric convergence. The criterion also does not identify the constants $M(x)$ and $\rho$ sharply, so it is mainly a qualitative certificate of geometric convergence rather than a practical runtime estimate.
[example: Autoregressive Gaussian Chain]
Let $E=\mathbb R$ and let
\begin{align*}
X_{n+1}=aX_n+\sigma Z_{n+1},
\end{align*}
where $|a|<1$, $\sigma>0$, and $Z_1,Z_2,\dots$ are i.i.d. standard normal random variables. Put
\begin{align*}
\tau^2=\frac{\sigma^2}{1-a^2}.
\end{align*}
If $X_n\sim\mathcal N(0,\tau^2)$ and $Z_{n+1}$ is independent of $X_n$, then $aX_n+\sigma Z_{n+1}$ is normal with mean
\begin{align*}
a\cdot 0+\sigma\cdot 0=0
\end{align*}
and variance
\begin{align*}
a^2\tau^2+\sigma^2=a^2\frac{\sigma^2}{1-a^2}+\sigma^2=\frac{a^2\sigma^2+\sigma^2(1-a^2)}{1-a^2}=\frac{\sigma^2}{1-a^2}=\tau^2.
\end{align*}
Thus $\mathcal N(0,\sigma^2/(1-a^2))$ is invariant.
Now take $V(x)=1+x^2$. Conditional on $X_n=x$, we have $X_{n+1}=ax+\sigma Z_{n+1}$, so
\begin{align*}
\mathbb E[V(X_{n+1})\mid X_n=x]=1+\mathbb E[(ax+\sigma Z_{n+1})^2].
\end{align*}
Expanding the square gives
\begin{align*}
\mathbb E[(ax+\sigma Z_{n+1})^2]=a^2x^2+2a\sigma x\,\mathbb E[Z_{n+1}]+\sigma^2\mathbb E[Z_{n+1}^2].
\end{align*}
Since $\mathbb E[Z_{n+1}]=0$ and $\mathbb E[Z_{n+1}^2]=1$,
\begin{align*}
\mathbb E[V(X_{n+1})\mid X_n=x]=1+a^2x^2+\sigma^2.
\end{align*}
Choose any $\lambda$ with $a^2<\lambda<1$, and choose $R>0$ such that
\begin{align*}
(\lambda-a^2)R^2\ge 1+\sigma^2-\lambda.
\end{align*}
Then for $|x|>R$,
\begin{align*}
1+a^2x^2+\sigma^2\le \lambda+\lambda x^2=\lambda V(x).
\end{align*}
On the compact set $C=[-R,R]$, the bound
\begin{align*}
1+a^2x^2+\sigma^2\le 1+a^2R^2+\sigma^2
\end{align*}
holds, so with $b=1+a^2R^2+\sigma^2$ we get
\begin{align*}
PV(x)\le \lambda V(x)+b\mathbb 1_C(x)
\end{align*}
for every $x\in\mathbb R$.
Compact intervals are also small. For $K<\infty$, the one-step transition density is
\begin{align*}
p(x,y)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(y-ax)^2}{2\sigma^2}\right).
\end{align*}
Define
\begin{align*}
q_K(y)=\inf_{|x|\le K}p(x,y).
\end{align*}
The function $q_K$ is positive for each $y$ and integrable, and
\begin{align*}
\varepsilon_K=\int_{\mathbb R}q_K(y)\,dy>0.
\end{align*}
If $\nu_K(A)=\varepsilon_K^{-1}\int_A q_K(y)\,dy$, then for every $x\in[-K,K]$ and measurable $A$,
\begin{align*}
P(x,A)=\int_A p(x,y)\,dy\ge \int_A q_K(y)\,dy=\varepsilon_K\nu_K(A).
\end{align*}
Thus compact intervals are small, and together with the displayed drift inequality this Gaussian autoregressive chain has the standard drift-and-minorization structure behind geometric ergodicity.
[/example]
Drift and small sets also help diagnose failures. If a proposed sampler has separate communicating classes, no drift argument can repair the fact that the target distribution is not reached from all starting points.
[example: Reducible Sampler Diagnosis]
Suppose the target distribution $\pi$ on $\mathbb R$ satisfies $\pi((-\infty,0))>0$ and $\pi((0,\infty))>0$. If the sampler updates by
\begin{align*}
X_{n+1}=R_{n+1}X_n
\end{align*}
where $R_{n+1}>0$ almost surely, then the sign of the state cannot change. Indeed, if $X_n>0$, then $R_{n+1}X_n>0$, so $X_{n+1}>0$; by induction, a chain started from $x>0$ satisfies
\begin{align*}
\mathbb P_x(X_n\in(-\infty,0))=0
\end{align*}
for every $n\ge 0$.
Therefore, for $A=(-\infty,0)$,
\begin{align*}
\|P^n(x,\cdot)-\pi\|_{\mathrm{TV}}\ge |P^n(x,A)-\pi(A)|=|0-\pi(A)|=\pi(A)>0
\end{align*}
for every $n$ whenever $x>0$. Similarly, if $x<0$, then $X_n<0$ for every $n$, and taking $A=(0,\infty)$ gives
\begin{align*}
\|P^n(x,\cdot)-\pi\|_{\mathrm{TV}}\ge |P^n(x,A)-\pi(A)|=|0-\pi(A)|=\pi(A)>0.
\end{align*}
Thus neither starting side can converge in total variation to the full target distribution. The trace plot may look stable inside one half-line, but the empirical distribution is confined to a closed class and misses a positive-mass part of $\pi$.
[/example]
## Markov Chain Monte Carlo Estimators
The final question is how dependence changes Monte Carlo estimation. Once a chain has invariant distribution $\pi$, the natural estimator of $\pi f=\int_E f\,d\pi$ is still an average, but its variance reflects serial correlation.
[definition: Markov Chain Monte Carlo Estimator]
Let $(X_n)_{n\ge0}$ be a Markov chain with invariant distribution $\pi$, and let $f:E\to\mathbb R$ be $\pi$-integrable. The Markov chain Monte Carlo estimator based on $N$ samples is
\begin{align*}
\hat{\pi}_N(f)=\frac{1}{N}\sum_{n=0}^{N-1}f(X_n).
\end{align*}
[/definition]
This estimator has the same form as the independent Monte Carlo average, but the justification is different because the summands are dependent. The key question is whether one long dependent trajectory can replace independent samples. The theorem below gives the consistency statement needed for MCMC estimation.
[quotetheorem:2226]
[citeproof:2226]
The ergodic theorem justifies consistency, but each hypothesis matters. Without positive recurrence, as for the simple symmetric random walk on $\mathbb Z$, there is no invariant probability distribution to average against; without Harris recurrence, a chain may preserve $\pi$ while a start in the wrong closed class never sees the whole target. The theorem also says nothing about the rate of convergence of the average or the size of its random fluctuations. This raises the next question: what variance should appear in a confidence interval when the same path contributes correlated observations? We therefore define the long-run variance that combines the ordinary variance with all serial covariance terms.
[definition: Asymptotic Variance]
Assume $(X_n)_{n\ge0}$ is stationary with invariant distribution $\pi$ and let $f\in L^2(\pi)$. When the series is well-defined, the asymptotic variance of $f$ is
\begin{align*}
\sigma_f^2=\operatorname{Var}_\pi(f(X_0))+2\sum_{k=1}^{\infty}\operatorname{Cov}_\pi(f(X_0),f(X_k)).
\end{align*}
[/definition]
The extra covariance terms are the price of dependence. Positive autocorrelation increases Monte Carlo error, while negative autocorrelation can reduce it. Once the long-run variance is identified, the next goal is a Gaussian approximation for the centred and scaled estimator.
[quotetheorem:7220]
[citeproof:7220]
This theorem is the basis for Monte Carlo standard errors from correlated output, but it is more delicate than the ergodic theorem. Geometric ergodicity alone does not control heavy-tailed observables, and the $L^{2+\delta}(\pi)$ condition is one concrete way to prevent rare extreme values from destroying the Gaussian limit. Conversely, a chain may satisfy a law of large numbers while failing this CLT, so consistency of the MCMC average is not enough for confidence intervals. Estimating $\sigma_f^2$ usually requires batch means, spectral variance estimators, or autocorrelation-window methods rather than the independent-sample variance formula, connecting the theorem to martingale approximation and the functional-analytic Poisson equation behind the proof.
[example: Effective Sample Size For A Correlated Chain]
Suppose a stationary scalar output has autocorrelation function approximately $\rho_k=r^k$ with $0<r<1$. The integrated autocorrelation factor is the ordinary variance contribution plus twice the positive-lag autocorrelation contribution:
\begin{align*}
1+2\sum_{k=1}^{\infty}\rho_k=1+2\sum_{k=1}^{\infty}r^k.
\end{align*}
Since $0<r<1$, the geometric series satisfies
\begin{align*}
\sum_{k=1}^{\infty}r^k=\frac{r}{1-r}.
\end{align*}
Substituting this into the integrated autocorrelation factor gives
\begin{align*}
1+2\sum_{k=1}^{\infty}r^k=1+\frac{2r}{1-r}.
\end{align*}
Putting the two terms over the common denominator $1-r$ gives
\begin{align*}
1+\frac{2r}{1-r}=\frac{1-r}{1-r}+\frac{2r}{1-r}.
\end{align*}
Adding the numerators yields
\begin{align*}
\frac{1-r}{1-r}+\frac{2r}{1-r}=\frac{1-r+2r}{1-r}.
\end{align*}
Combining like terms in the numerator gives
\begin{align*}
\frac{1-r+2r}{1-r}=\frac{1+r}{1-r}.
\end{align*}
Thus the variance inflation factor is approximately $(1+r)/(1-r)$. Matching the variance of $N$ correlated draws to the variance of $N_{\mathrm{eff}}$ independent draws gives
\begin{align*}
N_{\mathrm{eff}}\approx \frac{N}{(1+r)/(1-r)}.
\end{align*}
Dividing by a fraction means multiplying by its reciprocal, so
\begin{align*}
\frac{N}{(1+r)/(1-r)}=N\frac{1-r}{1+r}.
\end{align*}
Therefore the effective sample size is approximately $N(1-r)/(1+r)$. When $r$ is close to $1$, the denominator $1+r$ is near $2$ while the numerator $1-r$ is small, so many iterations can contain much less independent information than their raw count suggests.
[/example]
The practical lesson is that an MCMC run must be assessed at two levels. Structural conditions such as irreducibility and recurrence ask whether the target can be reached at all; quantitative conditions such as drift, small sets, and autocorrelation ask how much computation is required for reliable estimation.
Markov chains provide the general framework, but the practical challenge is to make the target invariant while still moving through the space efficiently. Chapter 5 develops Metropolis-Hastings as the canonical solution, combining proposal mechanisms with acceptance correction to recover the right stationary distribution.
# 5. Metropolis-Hastings Algorithms
Metropolis-Hastings algorithms turn a distribution known only up to a normalising constant into a Markov chain whose long-run averages approximate expectations under that distribution. Chapters 1 and 2 developed independent Monte Carlo and importance sampling, and Chapter 4 introduced Markov kernels and invariance; here those ideas are combined into a concrete recipe for constructing target-preserving chains. This chapter changes the viewpoint: instead of trying to sample the target distribution directly, we build a local stochastic dynamics that preserves it.
The central questions are practical as well as theoretical. How can we construct a transition kernel with a prescribed invariant distribution? How should a proposal be scaled so that the chain moves often enough but still travels far enough? How do blocking, componentwise moves, noisy likelihood estimates, and early rejection schemes alter the basic algorithm?
## From Proposals to Metropolis-Hastings Kernels
A common applied difficulty is that the target density can be evaluated only up to proportionality. In Bayesian computation, for instance, the posterior density has the form
\begin{align*}
\pi(\theta \mid y) \propto L(y \mid \theta)p(\theta),
\end{align*}
where the normalising constant is the marginal likelihood. This motivates isolating exactly what information about the target is needed by the algorithm.
[definition: Target Density Known Up To Proportionality]
Let $(E, \mathcal E)$ be a measurable space with reference measure $\lambda$. A target distribution $\pi$ is known up to proportionality if there is a measurable function $\gamma:E\to[0,\infty)$ such that
\begin{align*}
0 < Z := \int_E \gamma(x)\,d\lambda(x) < \infty,
\end{align*}
and
\begin{align*}
\pi(A) = \frac{1}{Z}\int_A \gamma(x)\,d\lambda(x)
\end{align*}
for every $A\in\mathcal E$.
[/definition]
The function $\gamma$ is enough for ratio-based algorithms because the unknown factor $Z$ cancels. The next problem is how to suggest candidate states in a way that can depend on the current state, which motivates the proposal kernel.
[definition: Proposal Kernel]
Let $(E,\mathcal E)$ be a measurable space. A proposal kernel is a Markov kernel $Q$ on $E$, so that $Q(x,A)$ is a probability measure in $A$ for each $x\in E$ and is measurable in $x$ for each $A\in\mathcal E$.
[/definition]
When $E\subseteq\mathbb R^d$ and $Q(x,dy)$ has a density $q(x,y)$ with respect to Lebesgue measure, the proposal density describes how likely the algorithm is to suggest $y$ from $x$. The next task is to decide whether the proposed move should be used, and this motivates an acceptance probability that corrects for both target density and proposal asymmetry.
[definition: Metropolis-Hastings Acceptance Probability]
Let $\pi$ have unnormalised density $\gamma$ with respect to $\lambda$, and let $Q(x,dy)=q(x,y)\,d\lambda(y)$ be a proposal kernel. The Metropolis-Hastings acceptance probability is the map $\alpha:E\times E\to[0,1]$ given by
\begin{align*}
\alpha(x,y) = \min\left\{1, \frac{\gamma(y)q(y,x)}{\gamma(x)q(x,y)}\right\}
\end{align*}
whenever $\gamma(x)q(x,y)>0$, with any value in $[0,1]$ assigned on null cases that do not affect the transition kernel.
[/definition]
The ratio contains two corrections. The factor $\gamma(y)/\gamma(x)$ prefers moves toward high target density, while $q(y,x)/q(x,y)$ corrects for asymmetry in the proposal. This motivates the following definition, which turns the accept-reject rule into a Markov kernel by including the probability of staying put after rejection.
[definition: Metropolis-Hastings Kernel]
Under the assumptions above, the Metropolis-Hastings kernel $P$ is
\begin{align*}
P(x,A) = \int_A \alpha(x,y)q(x,y)\,d\lambda(y) + r(x)\mathbb{1}_A(x),
\end{align*}
where
\begin{align*}
r(x) = 1 - \int_E \alpha(x,y)q(x,y)\,d\lambda(y).
\end{align*}
[/definition]
The second term records rejected proposals: if the proposed move is rejected, the chain stays at its current state. This self-loop is not a numerical nuisance; it is the piece that completes the transition kernel. The immediate question is whether the completed kernel has the intended invariant distribution.
[quotetheorem:7221]
[citeproof:7221]
Detailed balance is stronger than invariance and is the standard design principle for this chapter. It guarantees that the algorithm has the right stationary distribution, but convergence from a starting point still depends on irreducibility, aperiodicity, and mixing. The following example shows how the ratio is used in a posterior computation where the normalising constant is unavailable.
[example: Bayesian Logistic Regression Posterior]
For binary observations $y_i\in\{0,1\}$ with covariates $x_i\in\mathbb R^d$, write $p_i(\beta)=\mathbb P(Y_i=1\mid\beta)$. Then
\begin{align*}
p_i(\beta)=\frac{\exp(x_i\cdot\beta)}{1+\exp(x_i\cdot\beta)}
\end{align*}
and hence
\begin{align*}
1-p_i(\beta)=1-\frac{\exp(x_i\cdot\beta)}{1+\exp(x_i\cdot\beta)}=\frac{1}{1+\exp(x_i\cdot\beta)}.
\end{align*}
Since $y_i$ is binary, the likelihood contribution of observation $i$ is
\begin{align*}
p_i(\beta)^{y_i}\{1-p_i(\beta)\}^{1-y_i}=\left(\frac{\exp(x_i\cdot\beta)}{1+\exp(x_i\cdot\beta)}\right)^{y_i}\left(\frac{1}{1+\exp(x_i\cdot\beta)}\right)^{1-y_i}.
\end{align*}
With the Gaussian prior $\beta\sim\mathcal N(0,\tau^2I_d)$, its density is a constant times $\exp(-|\beta|^2/(2\tau^2))$. Therefore the posterior density is proportional to the likelihood times the prior factor:
\begin{align*}
\gamma(\beta)=\prod_{i=1}^n \left(\frac{\exp(x_i\cdot\beta)}{1+\exp(x_i\cdot\beta)}\right)^{y_i}\left(\frac{1}{1+\exp(x_i\cdot\beta)}\right)^{1-y_i}\exp\left(-\frac{|\beta|^2}{2\tau^2}\right).
\end{align*}
For a random-walk Metropolis proposal $\beta'=\beta+\sigma Z$ with $Z\sim\mathcal N(0,I_d)$, the proposal density has the form
\begin{align*}
q(\beta,\beta')=(2\pi\sigma^2)^{-d/2}\exp\left(-\frac{|\beta'-\beta|^2}{2\sigma^2}\right).
\end{align*}
Because $|\beta'-\beta|^2=|\beta-\beta'|^2$, we have $q(\beta,\beta')=q(\beta',\beta)$. The Metropolis-Hastings acceptance probability is therefore
\begin{align*}
\alpha(\beta,\beta')=\min\left\{1,\frac{\gamma(\beta')q(\beta',\beta)}{\gamma(\beta)q(\beta,\beta')}\right\}=\min\left\{1,\frac{\gamma(\beta')}{\gamma(\beta)}\right\}.
\end{align*}
If the true posterior density is $\pi(\beta\mid y)=\gamma(\beta)/Z$, then the same ratio is
\begin{align*}
\frac{\pi(\beta'\mid y)}{\pi(\beta\mid y)}=\frac{\gamma(\beta')/Z}{\gamma(\beta)/Z}=\frac{\gamma(\beta')}{\gamma(\beta)}.
\end{align*}
Thus posterior mean estimation can use the Markov-chain average of the sampled $\beta$ values without ever evaluating the marginal likelihood $Z$.
[/example]
The example used a symmetric local proposal, so the proposal-density correction disappeared. This motivates naming the resulting special case, since it is the baseline algorithm against which many later variants are compared.
[definition: Random-Walk Metropolis Algorithm]
Let $E=\mathbb R^d$, let $\gamma$ be an unnormalised target density, and let $\varepsilon_1,\varepsilon_2,\dots$ be i.i.d. from a density $s$ satisfying $s(z)=s(-z)$. Given $X_k=x$, propose $Y=x+\varepsilon_{k+1}$ and accept it with probability
\begin{align*}
\alpha(x,Y)=\min\left\{1,\frac{\gamma(Y)}{\gamma(x)}\right\}.
\end{align*}
If the proposal is accepted set $X_{k+1}=Y$; otherwise set $X_{k+1}=x$.
[/definition]
The symmetry assumption removes the proposal-density correction. The resulting chain is easy to code, but its efficiency depends strongly on the scale and covariance structure of $s$. A correlated Gaussian target displays the geometric issue.
[example: Random-Walk Metropolis for a Correlated Gaussian]
Let $\Sigma$ be the $2\times 2$ covariance matrix with diagonal entries $1$ and off-diagonal entries $0.99$. The unit vectors $e_+=(1,1)/\sqrt 2$ and $e_-=(1,-1)/\sqrt 2$ are eigenvectors because
\begin{align*}
\Sigma e_+=1.99e_+.
\end{align*}
Also,
\begin{align*}
\Sigma e_-=0.01e_-.
\end{align*}
Thus the target has standard deviation $\sqrt{1.99}$ along the long direction $e_+$ and standard deviation $\sqrt{0.01}=0.1$ along the narrow direction $e_-$.
For the isotropic proposal $Y=X+\sigma Z$ with $Z\sim\mathcal N(0,I_2)$, write $\Delta=Y-X$. In the rotated coordinates $\Delta_+=\Delta\cdot e_+$ and $\Delta_-=\Delta\cdot e_-$, orthogonal invariance of the standard Gaussian gives
\begin{align*}
(\Delta_+,\Delta_-)\sim\mathcal N(0,\sigma^2I_2).
\end{align*}
So the same proposal scale $\sigma$ is used in both target directions. If $\sigma$ is chosen comparable to the long-axis scale $\sqrt{1.99}$, then the proposal standard deviation in the narrow coordinate is still $\sigma$, while the target standard deviation there is only $0.1$. Such proposals often move several narrow-axis standard deviations at once, placing $Y$ in a much lower-density part of the Gaussian target and causing rejection.
Now take $Y=X+\sigma LZ$, where $LL^\top=\Sigma$. Since $Z\sim\mathcal N(0,I_2)$, we have $\operatorname{Cov}(Z)=I_2$, and therefore
\begin{align*}
\operatorname{Cov}(Y-X)=\operatorname{Cov}(\sigma LZ)=\sigma^2L\operatorname{Cov}(Z)L^\top=\sigma^2LL^\top=\sigma^2\Sigma.
\end{align*}
In the eigenbasis $\{e_+,e_-\}$, the proposal variance is therefore $\sigma^2\cdot1.99$ along $e_+$ and $\sigma^2\cdot0.01$ along $e_-$. The covariance-aligned proposal is long where the target is long and short where the target is narrow, so accepted moves can travel along the high-probability ridge instead of repeatedly jumping across it.
[/example]
Local proposals are not the only option. If a global approximation to the target is available, the next natural idea is to propose independently from that approximation and correct the bias through Metropolis-Hastings.
[definition: Independence Metropolis-Hastings Sampler]
Let $Q(x,dy)=g(y)\,d\lambda(y)$ for a proposal density $g$ independent of $x$. The independence Metropolis-Hastings acceptance probability is
\begin{align*}
\alpha(x,y)=\min\left\{1,\frac{\gamma(y)g(x)}{\gamma(x)g(y)}\right\}.
\end{align*}
[/definition]
The independence sampler can mix very fast if $g$ is a good global approximation. Its failure mode is severe when the proposal puts too little mass in regions where the target has substantial mass, as the next example illustrates.
[example: Independence Sampler Failure Under Light-Tailed Proposals]
Take the unnormalised target density $\gamma(x)=(1+x^2)^{-1}$ and the independence proposal density
\begin{align*}
g(x)=(2\pi)^{-1/2}\exp(-x^2/2).
\end{align*}
For an independence proposal from current state $x$ to proposed state $y$, the acceptance probability is
\begin{align*}
\alpha(x,y)=\min\left\{1,\frac{\gamma(y)g(x)}{\gamma(x)g(y)}\right\}.
\end{align*}
The ratio can be written using the weight $\gamma/g$. Since
\begin{align*}
\frac{\gamma(x)}{g(x)}=\frac{(1+x^2)^{-1}}{(2\pi)^{-1/2}\exp(-x^2/2)}=\sqrt{2\pi}\frac{\exp(x^2/2)}{1+x^2},
\end{align*}
we have $\gamma(x)/g(x)\to\infty$ as $|x|\to\infty$, because the exponential factor $\exp(x^2/2)$ grows faster than the polynomial factor $1+x^2$.
Now suppose the chain is at a large positive or negative value $x$, and the independence proposal suggests a return toward the centre, say $|y|\le 1$. Then
\begin{align*}
\frac{\gamma(y)g(x)}{\gamma(x)g(y)}=\frac{1+x^2}{1+y^2}\exp\left(\frac{y^2-x^2}{2}\right).
\end{align*}
Because $1+y^2\ge 1$ and $y^2\le 1$, this gives the upper bound
\begin{align*}
\frac{\gamma(y)g(x)}{\gamma(x)g(y)}\le (1+x^2)\exp\left(\frac{1-x^2}{2}\right).
\end{align*}
The right-hand side tends to $0$ as $|x|\to\infty$, so proposals back toward the centre are accepted with very small probability once the chain has reached the target's heavy tail. The failure is caused by the light-tailed normal proposal: it does not put enough mass in the tail relative to the Cauchy-like target, so the independence sampler can become stuck for long stretches at large $|x|$.
[/example]
## Scaling, Acceptance, and Autocorrelation
After detailed balance settles correctness, the next question is efficiency. A Metropolis-Hastings chain can have the right invariant distribution and still produce poor estimates if it moves too slowly, rejects too often, or wanders by tiny accepted steps.
[definition: Acceptance Rate]
For a Metropolis-Hastings chain with proposals $Y_k$ and accept indicators $A_k=\mathbb{1}_{\{X_{k+1}=Y_k\}}$, the empirical acceptance rate over $N$ iterations is
\begin{align*}
\hat a_N = \frac{1}{N}\sum_{k=0}^{N-1} A_k.
\end{align*}
[/definition]
Acceptance rate is a diagnostic, not an objective by itself. A chain can accept nearly every proposal by taking tiny steps, but then successive states are almost identical. This motivates measuring dependence between successive values of the actual quantity being estimated.
[definition: Autocorrelation Function]
Let $(X_k)_{k\ge0}$ be a stationary Markov chain and let $h\in L^2(\pi)$. The lag-$\ell$ autocorrelation of $h(X_k)$ is
\begin{align*}
\rho_h(\ell)=\frac{\operatorname{Cov}(h(X_0),h(X_\ell))}{\operatorname{Var}(h(X_0))},
\end{align*}
provided $\operatorname{Var}(h(X_0))>0$.
[/definition]
Autocorrelation measures how much new information each iteration adds for a specific estimand. Since Monte Carlo error accumulates dependence across all lags, the next diagnostic sums the autocorrelation sequence into a single variance inflation factor.
[definition: Integrated Autocorrelation Time]
Let $(X_k)_{k\ge0}$ be stationary and let $h\in L^2(\pi)$ have positive variance. If the series converges, the integrated autocorrelation time is
\begin{align*}
\tau_{\mathrm{int}}(h)=1+2\sum_{\ell=1}^{\infty}\rho_h(\ell).
\end{align*}
[/definition]
For the Monte Carlo average $N^{-1}\sum_{k=0}^{N-1}h(X_k)$, the asymptotic variance is inflated by $\tau_{\mathrm{int}}(h)$ relative to independent sampling when a Markov chain central limit theorem applies. This motivates effective sample size estimates of the form $N/\tau_{\mathrm{int}}(h)$. To compare kernels before running long simulations, it is useful to know when one reversible chain is uniformly better than another.
[quotetheorem:7222]
[citeproof:7222]
Peskun ordering formalises the intuition that avoidable rejections are costly. The finite state-space hypothesis keeps the comparison inside finite-dimensional reversible operator theory; on general state spaces, analogous results require extra measure-theoretic hypotheses and compare transition densities or kernels off the diagonal rather than simple matrix entries. It does not mean that the largest proposal scale is best, because increasing proposal scale can also lower off-diagonal transition probabilities through rejection. This tension motivates an asymptotic calculation for the random-walk scale in high-dimensional product targets.
[quotetheorem:7223]
[citeproof:7223]
The $0.234$ heuristic is most relevant to high-dimensional random-walk proposals that update many weakly dependent coordinates at once. In low dimension, for strongly constrained targets, or for algorithms using gradients, different acceptance ranges can be appropriate. The next example translates the theorem into the tuning tradeoff seen in practice.
[example: Scaling a Gaussian Random Walk]
For a $d$-dimensional target with approximately unit marginal variances, compare Gaussian random-walk proposals
\begin{align*}
Y=X+\sigma Z,\qquad Z\sim\mathcal N(0,I_d).
\end{align*}
The proposed displacement is $\Delta=Y-X=\sigma Z$, so
\begin{align*}
\mathbb E\lVert \Delta\rVert^2=\mathbb E\lVert \sigma Z\rVert^2=\sigma^2\mathbb E\sum_{j=1}^d Z_j^2=\sigma^2\sum_{j=1}^d\mathbb E Z_j^2=d\sigma^2.
\end{align*}
If $\sigma=o(d^{-1/2})$, then $d\sigma^2=o(1)$, so accepted moves have vanishing squared jumping distance. This explains why very small scales can have high acceptance while still exploring slowly.
To see the opposite effect, take the product standard normal target as the unit-variance benchmark, with unnormalised density
\begin{align*}
\gamma(x)=\exp\left(-\frac{\lVert x\rVert^2}{2}\right).
\end{align*}
For a proposed move $y=x+\sigma z$, the log target ratio is
\begin{align*}
\log\frac{\gamma(y)}{\gamma(x)}=-\frac{\lVert x+\sigma z\rVert^2}{2}+\frac{\lVert x\rVert^2}{2}.
\end{align*}
Expanding the square gives
\begin{align*}
\lVert x+\sigma z\rVert^2=\lVert x\rVert^2+2\sigma x\cdot z+\sigma^2\lVert z\rVert^2.
\end{align*}
Substituting this expansion yields
\begin{align*}
\log\frac{\gamma(y)}{\gamma(x)}=-\sigma x\cdot z-\frac{\sigma^2}{2}\lVert z\rVert^2.
\end{align*}
Under stationarity and for typical $z$, $\lVert z\rVert^2$ is of order $d$, so the deterministic penalty term has size about $\sigma^2 d/2$. If $\sigma$ is much larger than $d^{-1/2}$, this penalty is large and proposals often land in much lower-density regions, so the Metropolis acceptance probability is small.
The scaling
\begin{align*}
\sigma=\frac{\ell}{\sqrt d}
\end{align*}
keeps the mean squared proposal length equal to
\begin{align*}
d\sigma^2=d\frac{\ell^2}{d}=\ell^2.
\end{align*}
Thus the proposal neither shrinks to a point nor makes order-$\sqrt d$ jumps across the target, which is the balance captured by the product-target optimal-scaling regime.
[/example]
## Blocking, Componentwise Updates, and Variants
The basic Metropolis-Hastings template leaves open how much of the state to update at once and how expensive each proposed move should be. In hierarchical models, latent-variable models, and inverse problems, a single global proposal may be ineffective because different coordinates have different scales or conditional dependencies.
[definition: Blocked Metropolis-Hastings Update]
Let the state decompose as $x=(x_B,x_{-B})$ for a block of coordinates $B$. A blocked Metropolis-Hastings update proposes $y_B$ from a kernel $Q_B((x_B,x_{-B}),dy_B)$, sets $y=(y_B,x_{-B})$, and applies the Metropolis-Hastings acceptance rule using the full target density on $x$ and $y$.
[/definition]
Blocking lets the algorithm match posterior dependence among selected coordinates. Large blocks can exploit correlation, while small blocks are easier to tune and may have higher acceptance.
The practical obstruction is that a full-block proposal may be expensive, badly scaled, or impossible to tune uniformly across all coordinates. This leads to a deliberately local variant in which each move changes only one coordinate, or a small fixed group, while the rest of the state is held fixed.
[definition: Componentwise Metropolis-Hastings]
Componentwise Metropolis-Hastings is a blocked Metropolis-Hastings scheme in which the blocks are single coordinates or small fixed groups, updated sequentially or according to a random scan rule.
[/definition]
Componentwise updates are useful when conditional scales vary across coordinates. They may, however, move slowly through strong posterior correlations unless blocks are chosen to reflect those correlations. Logistic regression gives a concrete setting where coordinate-specific tuning is natural.
[example: Componentwise Updates in Logistic Regression]
In Bayesian logistic regression, let $\beta=(\beta_1,\dots,\beta_d)$ and propose an update of coordinate $j$ by
\begin{align*}
\beta_j'=\beta_j+\sigma_j Z,\qquad Z\sim\mathcal N(0,1),
\end{align*}
while keeping all other coordinates fixed. Thus the proposed vector is $\beta'=(\beta_1,\dots,\beta_{j-1},\beta_j',\beta_{j+1},\dots,\beta_d)$. Using the unnormalised posterior density from the logistic regression example,
\begin{align*}
\gamma(\beta)=\prod_{i=1}^n \left(\frac{\exp(x_i\cdot\beta)}{1+\exp(x_i\cdot\beta)}\right)^{y_i}\left(\frac{1}{1+\exp(x_i\cdot\beta)}\right)^{1-y_i}\exp\left(-\frac{|\beta|^2}{2\tau^2}\right).
\end{align*}
Since only coordinate $j$ changes,
\begin{align*}
x_i\cdot\beta'=x_i\cdot\beta+x_{ij}(\beta_j'-\beta_j).
\end{align*}
The one-coordinate Gaussian proposal density is symmetric because
\begin{align*}
q_j(\beta,\beta')=(2\pi\sigma_j^2)^{-1/2}\exp\left(-\frac{(\beta_j'-\beta_j)^2}{2\sigma_j^2}\right)=(2\pi\sigma_j^2)^{-1/2}\exp\left(-\frac{(\beta_j-\beta_j')^2}{2\sigma_j^2}\right)=q_j(\beta',\beta).
\end{align*}
Therefore the acceptance probability is
\begin{align*}
\alpha_j(\beta,\beta')=\min\left\{1,\frac{\gamma(\beta')}{\gamma(\beta)}\right\}.
\end{align*}
Writing out the ratio shows that every observation still contributes:
\begin{align*}
\frac{\gamma(\beta')}{\gamma(\beta)}=\prod_{i=1}^n \left(\frac{\exp(x_i\cdot\beta')}{\exp(x_i\cdot\beta)}\right)^{y_i}\left(\frac{1+\exp(x_i\cdot\beta)}{1+\exp(x_i\cdot\beta')}\right)\exp\left(-\frac{|\beta'|^2-|\beta|^2}{2\tau^2}\right).
\end{align*}
The prior term also changes only through coordinate $j$, since
\begin{align*}
|\beta'|^2-|\beta|^2=(\beta_j')^2-\beta_j^2.
\end{align*}
Thus a componentwise proposal changes one coefficient, but its accept-reject decision uses the full likelihood and the full prior ratio. Choosing separate scales $\sigma_j$ lets wide posterior directions take larger coordinate steps and narrow posterior directions take smaller ones.
[/example]
Some targets cannot be evaluated exactly, but an unbiased nonnegative estimate of the unnormalised density is available. This motivates moving the random estimator into the state space rather than treating it as a harmless numerical approximation.
[definition: Pseudo-Marginal Metropolis-Hastings]
Let $\widehat\gamma(x,U)$ be a nonnegative random estimator satisfying
\begin{align*}
\mathbb E[\widehat\gamma(x,U)]=\gamma(x)
\end{align*}
for each $x$, where $U$ has a known simulation law. Pseudo-marginal Metropolis-Hastings runs Metropolis-Hastings on the extended state $(x,U)$ with target density proportional to $\widehat\gamma(x,U)$ times the simulation density of $U$.
[/definition]
The definition raises a subtle correctness question: replacing $\gamma$ by a noisy estimate could have biased the invariant distribution. The issue is not numerical accuracy alone; if the random estimator is collapsed away incorrectly, the chain may target the expectation of the wrong object. Correctness requires viewing the auxiliary randomness as part of the state and checking the marginal distribution of $x$ under the extended target.
[quotetheorem:7224]
[citeproof:7224]
Pseudo-marginal methods are common in state-space models and doubly intractable likelihood problems. The hypotheses are not cosmetic: a negative estimator cannot define a probability density on the extended space, while a biased estimator with $\mathbb E[\widehat\gamma(x,U)]\ne\gamma(x)$ would make the marginal invariant density proportional to the wrong expectation. The estimator variance is part of the algorithm design, since cheaper noisy estimates can reduce computation per step while increasing the number of steps needed. This motivates the following definition, where a cheap deterministic approximation screens proposals before the full target is evaluated.
[definition: Delayed Acceptance Metropolis-Hastings]
Let $\tilde\gamma$ be a cheap approximation to an unnormalised target density $\gamma$. Given a proposal $y\sim q(x,\cdot)$, delayed acceptance first accepts the proposal provisionally with probability
\begin{align*}
\alpha_1(x,y)=\min\left\{1,\frac{\tilde\gamma(y)q(y,x)}{\tilde\gamma(x)q(x,y)}\right\}.
\end{align*}
Conditional on passing the first stage, it accepts the proposal with second-stage probability
\begin{align*}
\alpha_2(x,y)=\min\left\{1,\frac{\gamma(y)\tilde\gamma(x)}{\gamma(x)\tilde\gamma(y)}\right\}.
\end{align*}
[/definition]
The product of the two stages is arranged so that the exact target remains invariant. A poor approximation can reduce efficiency, but a good approximation can save many expensive likelihood evaluations. The final theorem records the detailed balance check behind this correction.
[quotetheorem:7225]
[citeproof:7225]
These variants preserve the same core idea: propose a move using a convenient mechanism, then correct the proposal so that the target distribution is invariant. In applied Monte Carlo work, the mathematical guarantee is only the starting point; performance depends on proposal geometry, computational cost per iteration, and the autocorrelation of the quantities being estimated.
Metropolis-Hastings handles invariance, but it does not remove the computational burden of exploring complicated joint distributions. Chapter 6 shows how Gibbs sampling and data augmentation can break that burden into conditional pieces, turning hard joint simulation into a sequence of simpler draws.
# 6. Gibbs Sampling and Data-Augmentation Methods
Gibbs sampling turns a difficult joint simulation problem into a sequence of conditional simulation problems. The central question is when repeatedly sampling from full conditional distributions produces draws from the desired joint distribution, and how the choice of conditioning blocks changes statistical and computational efficiency. The chapter uses the language of Markov kernels, invariant distributions, conditional expectation, and basic measure-theoretic probability from the previous chapters. It builds from the basic systematic-scan and random-scan Gibbs samplers to data augmentation, collapsed samplers, and practical blocking rules.
## Full Conditionals and Gibbs Transitions
Suppose the target distribution $\pi$ is a probability distribution on a product space $E = E_1 \times \cdots \times E_d$. Direct simulation from $\pi$ may be unavailable even when each coordinate can be simulated from its conditional law given the others. The problem is to convert these conditional simulators into a Markov chain whose invariant distribution is the joint target.
[definition: Full Conditional Distribution]
Let $(E_i,\mathcal E_i)$ be measurable spaces and let $E=E_1\times\cdots\times E_d$ with product $\sigma$-algebra $\mathcal E$. Let $X = (X_1, \dots, X_d)$ be an $E$-valued random variable with joint distribution $\pi$. For $i \in \{1, \dots, d\}$, write $E_{-i}=\prod_{j\ne i}E_j$, $\mathcal E_{-i}=\bigotimes_{j\ne i}\mathcal E_j$, and $x_{-i} = (x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_d)$. A full conditional distribution for coordinate $i$ is a Markov kernel
\begin{align*}
\pi_i: E_{-i}\times \mathcal E_i \to [0,1]
\end{align*}
satisfying
\begin{align*}
\pi_i(A \mid x_{-i}) = \mathbb P(X_i \in A \mid X_{-i} = x_{-i}), \qquad A \in \mathcal E_i.
\end{align*}
[/definition]
A full conditional isolates the part of the joint distribution that remains random when all other coordinates are fixed. In Bayesian computation this is useful because the conditional often belongs to a familiar family even when the posterior joint law does not.
[example: Normal-Normal Full Conditionals]
Consider observations $Y_j \mid \theta \overset{\text{ind}}\sim \mathcal N(\theta,\sigma^2)$ for $j=1,\dots,n$, with $\sigma^2$ known, prior $\theta\mid \tau^2\sim \mathcal N(0,\tau^2)$, and $\tau^2\sim \operatorname{IG}(a_0,b_0)$ in the shape-scale convention. Write $\bar y=n^{-1}\sum_{j=1}^n y_j$. For fixed $\tau^2$, the part of the joint density depending on $\theta$ is
\begin{align*}
p(\theta\mid \tau^2,y)\propto \exp\left\{-\frac{1}{2\sigma^2}\sum_{j=1}^n(y_j-\theta)^2-\frac{\theta^2}{2\tau^2}\right\}.
\end{align*}
Expanding the sum,
\begin{align*}
\sum_{j=1}^n(y_j-\theta)^2=\sum_{j=1}^n y_j^2-2\theta\sum_{j=1}^n y_j+n\theta^2=\sum_{j=1}^n y_j^2-2n\bar y\,\theta+n\theta^2.
\end{align*}
Terms not involving $\theta$ are absorbed into the proportionality constant, so
\begin{align*}
p(\theta\mid \tau^2,y)\propto \exp\left\{-\frac{1}{2}\left(\left(\frac n{\sigma^2}+\frac1{\tau^2}\right)\theta^2-2\frac{n\bar y}{\sigma^2}\theta\right)\right\}.
\end{align*}
Set $A=n/\sigma^2+1/\tau^2$ and $B=n\bar y/\sigma^2$. Since
\begin{align*}
A\theta^2-2B\theta=A\left(\theta-\frac BA\right)^2-\frac{B^2}{A},
\end{align*}
the factor $\exp\{B^2/(2A)\}$ is constant in $\theta$, and therefore
\begin{align*}
\theta\mid \tau^2,y\sim \mathcal N\left(\frac{n\bar y/\sigma^2}{n/\sigma^2+1/\tau^2},\frac{1}{n/\sigma^2+1/\tau^2}\right).
\end{align*}
For fixed $\theta$, the likelihood does not involve $\tau^2$, so the conditional density of $\tau^2$ is obtained from the normal prior density for $\theta\mid\tau^2$ and the inverse-gamma prior density for $\tau^2$. Thus
\begin{align*}
p(\tau^2\mid \theta,y)\propto (\tau^2)^{-1/2}\exp\left(-\frac{\theta^2}{2\tau^2}\right)(\tau^2)^{-a_0-1}\exp\left(-\frac{b_0}{\tau^2}\right).
\end{align*}
Multiplying the powers and combining the exponent terms gives
\begin{align*}
p(\tau^2\mid \theta,y)\propto (\tau^2)^{-(a_0+1/2)-1}\exp\left(-\frac{b_0+\theta^2/2}{\tau^2}\right).
\end{align*}
This is the shape-scale inverse-gamma density with shape $a_0+1/2$ and scale $b_0+\theta^2/2$, so
\begin{align*}
\tau^2\mid \theta,y\sim \operatorname{IG}\left(a_0+\frac12,b_0+\frac{\theta^2}{2}\right).
\end{align*}
The posterior dependence between $\theta$ and $\tau^2$ remains in the joint law, but each conditional distribution is standard, so alternating these two draws defines the Gibbs update.
[/example]
The normal-normal example leaves an implementation question: should a full iteration mean a complete deterministic sweep through all coordinates, or should each Markov step update a randomly selected coordinate? The first convention leads to the systematic-scan Gibbs sampler.
[definition: Systematic-Scan Gibbs Sampler]
Let $\pi$ be a probability distribution on $(E,\mathcal E)=\prod_{i=1}^d(E_i,\mathcal E_i)$ with full conditional kernels $(\pi_i:E_{-i}\times\mathcal E_i\to[0,1])_{i=1}^d$. The systematic-scan Gibbs sampler is the Markov kernel
\begin{align*}
K_{\mathrm{sys}}: E\times\mathcal E\to[0,1]
\end{align*}
obtained by updating coordinates in the deterministic order $1,2,\dots,d$, sampling each new coordinate from its full conditional given the most recently available values of all other coordinates.
[/definition]
Systematic scan is common in statistical software because a complete sweep has a natural interpretation as one iteration. What transition should be used when the algorithm needs a single coordinate update at each Markov step, possibly with unequal update frequencies? This question leads to the random-scan Gibbs sampler.
[definition: Random-Scan Gibbs Sampler]
Let $\pi$ be a probability distribution on $(E,\mathcal E)=\prod_{i=1}^d(E_i,\mathcal E_i)$ with full conditional kernels $(\pi_i:E_{-i}\times\mathcal E_i\to[0,1])_{i=1}^d$. Let $p_1, \dots, p_d > 0$ with $\sum_{i=1}^d p_i = 1$. The random-scan Gibbs sampler is the Markov kernel
\begin{align*}
K:E\times\mathcal E\to[0,1]
\end{align*}
that first draws an index $I_k$ with $\mathbb P(I_k=i)=p_i$, and then replaces only coordinate $i$ by a draw from $\pi_i(\cdot \mid X_{k,-i})$.
[/definition]
Random scan separates the Markov-chain transition into a mixture of coordinate kernels. The key question for both scan rules is whether these coordinate-wise replacements preserve the original joint distribution rather than only the separate conditional formulas.
[quotetheorem:7226]
[citeproof:7226]
The theorem says only that $\pi$ is left unchanged if the chain already has law $\pi$; it does not say that a chain started elsewhere will reach that law. Irreducibility is needed because the conditional kernels may preserve a hidden invariant set: for example, a target putting mass on two disconnected components can have full conditional updates that never cross from one component to the other. Aperiodicity is a separate condition excluding deterministic cycling, which can occur in specially structured discrete Gibbs chains even when the displayed conditionals are correct. Thus the invariance theorem supplies the candidate stationary distribution, while convergence still requires support and recurrence conditions for the actual transition kernel.
[remark: Support and Irreducibility]
A Gibbs sampler can have the correct invariant distribution on each communicating class while failing to explore the full support of $\pi$. For instance, if a target is supported on two separated curves and every full conditional is degenerate along the current curve, coordinate-wise Gibbs updates cannot move between curves. Diagnostics based only on conditional formulas may miss this obstruction.
[/remark]
## Data Augmentation and Auxiliary Variables
Some targets become easier after adding variables that are not themselves of scientific interest. The question is how to introduce auxiliary variables so that sampling on an enlarged space helps the marginal distribution of the variables we care about.
[definition: Data Augmentation]
Let $(\Theta,\mathcal T)$ be a parameter space, let $(\mathcal Z,\mathcal A)$ be an auxiliary-variable space, and let $\pi$ be a probability distribution on $(\Theta,\mathcal T)$. A data-augmentation scheme is a probability distribution $\tilde\pi$ on $(\Theta\times\mathcal Z,\mathcal T\otimes\mathcal A)$ such that
\begin{align*}
\tilde\pi(A\times \mathcal Z)=\pi(A), \qquad A\in\mathcal T.
\end{align*}
[/definition]
After augmentation, one usually alternates between $\theta \mid z$ and $z \mid \theta$. The auxiliary variable is valuable when these two conditional laws are easier to sample than the original marginal law.
[example: Bayesian Probit Regression via Latent Gaussian Variables]
Let $Y_j \in \{0,1\}$ and $x_j\in\mathbb R^p$. Introduce latent variables
\begin{align*}
Z_j\mid \beta \sim \mathcal N(x_j^\top\beta,1), \qquad Y_j=\mathbb{1}_{\{Z_j>0\}}.
\end{align*}
Then, writing $\varepsilon_j=Z_j-x_j^\top\beta\sim\mathcal N(0,1)$,
\begin{align*}
\mathbb P(Y_j=1\mid \beta)=\mathbb P(Z_j>0\mid\beta)=\mathbb P(\varepsilon_j>-x_j^\top\beta)=\mathbb P(\varepsilon_j\le x_j^\top\beta)=\Phi(x_j^\top\beta),
\end{align*}
so the latent-variable model has the required probit marginal law.
Given $\beta$ and $Y_j$, the density of $Z_j$ is the normal density restricted to the event defining the observed binary response:
\begin{align*}
p(z_j\mid \beta,Y_j=1)\propto \exp\left\{-\frac12(z_j-x_j^\top\beta)^2\right\}\mathbb{1}_{(0,\infty)}(z_j).
\end{align*}
Similarly,
\begin{align*}
p(z_j\mid \beta,Y_j=0)\propto \exp\left\{-\frac12(z_j-x_j^\top\beta)^2\right\}\mathbb{1}_{(-\infty,0]}(z_j).
\end{align*}
Thus $Z_j\mid \beta,Y_j=1$ is $\mathcal N(x_j^\top\beta,1)$ truncated to $(0,\infty)$, and $Z_j\mid \beta,Y_j=0$ is $\mathcal N(x_j^\top\beta,1)$ truncated to $(-\infty,0]$.
Now let $X$ be the $n\times p$ matrix with rows $x_j^\top$, and write $z=(z_1,\dots,z_n)^\top$. Conditional on $z$, the likelihood is
\begin{align*}
p(z\mid\beta)\propto \exp\left\{-\frac12(z-X\beta)^\top(z-X\beta)\right\}.
\end{align*}
The prior density is
\begin{align*}
p(\beta)\propto \exp\left\{-\frac12(\beta-m_0)^\top V_0^{-1}(\beta-m_0)\right\}.
\end{align*}
Multiplying likelihood and prior gives
\begin{align*}
p(\beta\mid z,y)\propto \exp\left\{-\frac12\left[(z-X\beta)^\top(z-X\beta)+(\beta-m_0)^\top V_0^{-1}(\beta-m_0)\right]\right\}.
\end{align*}
Expanding the quadratic terms,
\begin{align*}
(z-X\beta)^\top(z-X\beta)=z^\top z-2\beta^\top X^\top z+\beta^\top X^\top X\beta.
\end{align*}
Also,
\begin{align*}
(\beta-m_0)^\top V_0^{-1}(\beta-m_0)=\beta^\top V_0^{-1}\beta-2\beta^\top V_0^{-1}m_0+m_0^\top V_0^{-1}m_0.
\end{align*}
The terms $z^\top z$ and $m_0^\top V_0^{-1}m_0$ do not involve $\beta$, so
\begin{align*}
p(\beta\mid z,y)\propto \exp\left\{-\frac12\left[\beta^\top(V_0^{-1}+X^\top X)\beta-2\beta^\top(V_0^{-1}m_0+X^\top z)\right]\right\}.
\end{align*}
Set
\begin{align*}
\Lambda=V_0^{-1}+X^\top X, \qquad \eta=V_0^{-1}m_0+X^\top z, \qquad m=\Lambda^{-1}\eta.
\end{align*}
Since $\eta=\Lambda m$,
\begin{align*}
\beta^\top\Lambda\beta-2\beta^\top\eta=(\beta-m)^\top\Lambda(\beta-m)-m^\top\Lambda m.
\end{align*}
The last term is constant in $\beta$, and therefore
\begin{align*}
\beta\mid z,y\sim \mathcal N\left(\Lambda^{-1}\eta,\Lambda^{-1}\right)=\mathcal N\left((V_0^{-1}+X^\top X)^{-1}(V_0^{-1}m_0+X^\top z),(V_0^{-1}+X^\top X)^{-1}\right).
\end{align*}
The augmentation converts the binary-response posterior into two standard Gibbs steps: sample truncated univariate Gaussian latent variables $Z_j$ given $\beta$, then sample a multivariate Gaussian $\beta$ given $z$.
[/example]
The probit construction shows that auxiliary variables can restore conjugacy. The next issue is whether every auxiliary variable should actually be sampled, since some variables increase posterior dependence without adding information about the final estimand; this motivates collapsed Gibbs sampling.
[definition: Collapsed Gibbs Sampler]
Let $(E,\mathcal E)$ and $(F,\mathcal F)$ be measurable spaces, let $\tilde\pi$ be a probability distribution on $(E\times F,\mathcal E\otimes\mathcal F)$, and let $\pi_E$ be the marginal distribution of $\tilde\pi$ on $(E,\mathcal E)$. A collapsed Gibbs sampler for $\pi_E$ is a Markov kernel
\begin{align*}
K_{\mathrm{coll}}:E\times\mathcal E\to[0,1]
\end{align*}
constructed from conditional kernels of $\tilde\pi$ after analytically integrating out at least one $F$-valued variable in at least one update, with invariant distribution $\pi_E$.
[/definition]
Collapsing often reduces autocorrelation because it removes random variables that mediate strong dependence. Care is needed: after integrating out a variable, the remaining updates must be ordered so that they define a valid Markov kernel for the marginal target.
[example: Collapsed Gibbs Sampler for a Finite Mixture Model]
Let $C_i\mid w$ have $\mathbb P(C_i=\ell\mid w)=w_\ell$, and suppose that, conditional on $C_i=\ell$ and $\psi$, the density of $Y_i$ is $f(y_i\mid\psi_\ell)$. Fix $j$ and write
\begin{align*}
n_{-j,\ell}=\sum_{i\ne j}\mathbb 1_{\{c_i=\ell\}}.
\end{align*}
Given $c_{-j}$, the part of the joint density involving $w$ is
\begin{align*}
p(w\mid c_{-j})\propto \prod_{\ell=1}^K w_\ell^{\alpha_\ell-1}\prod_{i\ne j}w_{c_i}.
\end{align*}
Since each factor $w_\ell$ appears once for every $i\ne j$ with $c_i=\ell$, this becomes
\begin{align*}
p(w\mid c_{-j})\propto \prod_{\ell=1}^K w_\ell^{\alpha_\ell+n_{-j,\ell}-1}.
\end{align*}
Thus $w\mid c_{-j}\sim \operatorname{Dirichlet}(\alpha_1+n_{-j,1},\dots,\alpha_K+n_{-j,K})$.
Set $a_\ell=\alpha_\ell+n_{-j,\ell}$ and $A=\sum_{\ell=1}^K a_\ell$. The collapsed predictive probability of assigning observation $j$ to component $k$ is
\begin{align*}
\mathbb P(C_j=k\mid c_{-j})=\int w_k\,\operatorname{Dirichlet}(dw;a_1,\dots,a_K).
\end{align*}
Using the Dirichlet normalizing constant, this integral is
\begin{align*}
\frac{\Gamma(A)}{\prod_{\ell=1}^K\Gamma(a_\ell)}\int_{\sum_\ell w_\ell=1} w_k\prod_{\ell=1}^K w_\ell^{a_\ell-1}\,dw.
\end{align*}
The integrand has exponent $a_k$ on $w_k$ and exponent $a_\ell-1$ on $w_\ell$ for $\ell\ne k$, so the integral equals the reciprocal normalizing constant for $\operatorname{Dirichlet}(a_1,\dots,a_k+1,\dots,a_K)$:
\begin{align*}
\mathbb P(C_j=k\mid c_{-j})=\frac{\Gamma(A)}{\prod_{\ell=1}^K\Gamma(a_\ell)}\frac{\Gamma(a_k+1)\prod_{\ell\ne k}\Gamma(a_\ell)}{\Gamma(A+1)}.
\end{align*}
Since $\Gamma(a_k+1)=a_k\Gamma(a_k)$ and $\Gamma(A+1)=A\Gamma(A)$, this reduces to
\begin{align*}
\mathbb P(C_j=k\mid c_{-j})=\frac{a_k}{A}=\frac{\alpha_k+n_{-j,k}}{\sum_{\ell=1}^K(\alpha_\ell+n_{-j,\ell})}.
\end{align*}
Now include the likelihood contribution of $y_j$. Conditional on assigning $C_j=k$, the factor involving $y_j$ is $f(y_j\mid\psi_k)$, so
\begin{align*}
\mathbb P(C_j=k\mid c_{-j},\psi,y)\propto f(y_j\mid\psi_k)\frac{\alpha_k+n_{-j,k}}{\sum_{\ell=1}^K(\alpha_\ell+n_{-j,\ell})}.
\end{align*}
The denominator is independent of $k$, hence
\begin{align*}
\mathbb P(C_j=k\mid c_{-j},\psi,y)\propto (\alpha_k+n_{-j,k})f(y_j\mid\psi_k).
\end{align*}
Equivalently, the normalized collapsed update is
\begin{align*}
\mathbb P(C_j=k\mid c_{-j},\psi,y)=\frac{(\alpha_k+n_{-j,k})f(y_j\mid\psi_k)}{\sum_{\ell=1}^K(\alpha_\ell+n_{-j,\ell})f(y_j\mid\psi_\ell)}.
\end{align*}
The update replaces an explicit draw of $w$ by its posterior predictive effect on $C_j$, so the allocation step uses the same target distribution while avoiding extra variation from repeatedly sampling weights that are nearly determined by the current counts.
[/example]
A different way to attack dependence is to change the augmented representation without changing the marginal target. This leads to parameter-expanded samplers, which insert redundant parameters to create larger moves after marginalization or transformation.
[definition: Parameter Expansion]
Let $(\Theta,\mathcal T)$ be a parameter space, let $(A,\mathcal A)$ be a working-parameter space, let $(\mathcal Z,\mathcal B)$ be an auxiliary-variable space, and let $\pi$ be a probability distribution on $(\Theta,\mathcal T)$. A parameter expansion for $\pi$ is a probability distribution $\tilde\pi$ on $(\Theta\times A\times\mathcal Z,\mathcal T\otimes\mathcal A\otimes\mathcal B)$ such that
\begin{align*}
\tilde\pi(B\times A\times\mathcal Z)=\pi(B), \qquad B\in\mathcal T.
\end{align*}
[/definition]
Parameter expansion is useful when the working parameter permits rescaling, centering, or rotating latent variables before returning to the original parameter. The transition must be designed so that the marginal law of the original parameter remains unchanged.
## Compatibility, Blocking, and Factorization
In applications, conditionals are often derived piece by piece from a model specification or from local dependence assumptions. The problem is to know whether a proposed collection of conditionals corresponds to a joint distribution, and how to group variables into blocks once a joint target is available.
[definition: Compatible Conditional Distributions]
Let $(E,\mathcal E)=\prod_{i=1}^d(E_i,\mathcal E_i)$ and write $(E_{-i},\mathcal E_{-i})=\prod_{j\ne i}(E_j,\mathcal E_j)$. A collection of Markov kernels
\begin{align*}
q_i:E_{-i}\times\mathcal E_i\to[0,1], \qquad i=1,\dots,d,
\end{align*}
is compatible if there exists a joint distribution $\pi$ on $(E,\mathcal E)$ such that $q_i(\cdot \mid x_{-i})$ is a full conditional distribution of $\pi$ for each $i$.
[/definition]
To build compatible conditionals from a graphical model, we need a result that turns local dependence restrictions into a joint density formula. The Hammersley-Clifford factorization statement supplies that bridge in the finite positive case.
[quotetheorem:7227]
[citeproof:7227]
The theorem justifies local Gibbs updates in finite Markov random fields: if the joint law factorizes into local clique potentials, the full conditional for a site only uses factors touching that site. The finite and positivity hypotheses are substantive. If positivity fails, deterministic constraints can create zero-probability configurations where local conditional independences hold but no strictly positive clique-potential representation exists on the whole configuration space. Outside the finite positive case, factorization theorems require additional measure-theoretic hypotheses and a choice of reference measures, so the displayed statement should not be read as a general theorem for arbitrary continuous random fields. Once the compatible joint target has been identified, the next design question is whether variables should be updated individually or in blocks.
[definition: Blocked Gibbs Sampler]
Let $\pi$ be a probability distribution on $(E,\mathcal E)=\prod_{i=1}^d(E_i,\mathcal E_i)$, and let $B_1,\dots,B_m$ be nonempty subsets of $\{1,\dots,d\}$ whose union is $\{1,\dots,d\}$. For each block $B_r$, write $E_{B_r}=\prod_{i\in B_r}E_i$ and let
\begin{align*}
\pi_{B_r}:E_{-B_r}\times\mathcal E_{B_r}\to[0,1]
\end{align*}
be the conditional kernel of $x_{B_r}$ given $x_{-B_r}$. A blocked Gibbs sampler is the Markov kernel on $E$ that updates $x_{B_r}$ by sampling from $\pi_{B_r}(\cdot\mid x_{-B_r})$ for $r=1,\dots,m$, either in a fixed order or according to a random scan.
[/definition]
Blocking changes the geometry of the transition. Larger blocks can reduce random-walk behaviour and posterior dependence, but they require conditional distributions that can be sampled at acceptable cost.
[example: Blocking Correlated Regression Coefficients]
Let $\beta\mid y\sim \mathcal N(m,V)$ and write $Q=V^{-1}$. For a block $B=\{r,s\}$, set $\delta_B=\beta_B-m_B$ and $\delta_{-B}=\beta_{-B}-m_{-B}$. Partitioning the precision matrix gives
\begin{align*}
(\beta-m)^\top Q(\beta-m)=\delta_B^\top Q_{BB}\delta_B+2\delta_B^\top Q_{B,-B}\delta_{-B}+\delta_{-B}^\top Q_{-B,-B}\delta_{-B}.
\end{align*}
When $\beta_{-B}$ is fixed, the last term is constant in $\beta_B$. Define
\begin{align*}
\mu_B=m_B-Q_{BB}^{-1}Q_{B,-B}(\beta_{-B}-m_{-B}).
\end{align*}
Then $\mu_B-m_B=-Q_{BB}^{-1}Q_{B,-B}\delta_{-B}$, so $Q_{BB}(\mu_B-m_B)=-Q_{B,-B}\delta_{-B}$. Expanding the completed square gives
\begin{align*}
(\beta_B-\mu_B)^\top Q_{BB}(\beta_B-\mu_B)=\delta_B^\top Q_{BB}\delta_B-2\delta_B^\top Q_{BB}(\mu_B-m_B)+(\mu_B-m_B)^\top Q_{BB}(\mu_B-m_B).
\end{align*}
Substituting $Q_{BB}(\mu_B-m_B)=-Q_{B,-B}\delta_{-B}$ gives
\begin{align*}
(\beta_B-\mu_B)^\top Q_{BB}(\beta_B-\mu_B)=\delta_B^\top Q_{BB}\delta_B+2\delta_B^\top Q_{B,-B}\delta_{-B}+(\mu_B-m_B)^\top Q_{BB}(\mu_B-m_B).
\end{align*}
The final term is constant in $\beta_B$, and therefore
\begin{align*}
\beta_B\mid \beta_{-B},y\sim \mathcal N\left(m_B-Q_{BB}^{-1}Q_{B,-B}(\beta_{-B}-m_{-B}),Q_{BB}^{-1}\right).
\end{align*}
For comparison, a one-coordinate update of coordinate $r$ uses
\begin{align*}
\beta_r\mid \beta_{-r},y\sim \mathcal N\left(m_r-Q_{rr}^{-1}Q_{r,-r}(\beta_{-r}-m_{-r}),Q_{rr}^{-1}\right).
\end{align*}
To see why near collinearity makes this inefficient, look only at the two-dimensional marginal law of $(\beta_r,\beta_s)$. Suppose its variances are $s_r^2$ and $s_s^2$, with correlation $\rho$. Its covariance entries are $V_{rr}=s_r^2$, $V_{ss}=s_s^2$, and $V_{rs}=V_{sr}=\rho s_rs_s$, so its determinant is
\begin{align*}
\det(V_{BB})=s_r^2s_s^2-\rho^2s_r^2s_s^2=s_r^2s_s^2(1-\rho^2).
\end{align*}
The $(r,r)$ entry of $V_{BB}^{-1}$ is therefore
\begin{align*}
(V_{BB}^{-1})_{rr}=\frac{s_s^2}{s_r^2s_s^2(1-\rho^2)}=\frac{1}{s_r^2(1-\rho^2)}.
\end{align*}
For a bivariate Gaussian, the conditional precision of $\beta_r$ given $\beta_s$ is this precision entry, hence
\begin{align*}
\operatorname{Var}(\beta_r\mid \beta_s,y)=s_r^2(1-\rho^2).
\end{align*}
When $|\rho|$ is close to $1$, this conditional variance is small, so alternating single-coordinate updates changes one coefficient at a time in short moves across the narrow direction. Sampling $(\beta_r,\beta_s)$ together instead uses the block covariance $Q_{BB}^{-1}$ and can move along the tilted posterior dependence between the two coefficients. This is why useful Gibbs blocks are chosen from posterior dependence, not merely from the order of parameters in the model statement.
[/example]
The blocked regression example concerns the transition itself. A separate source of efficiency comes after the chain has been simulated: Gibbs updates often compute conditional probabilities or conditional means, so the natural question is whether those conditional expectations should replace noisier sampled quantities.
At the level of a single update, this is the conditional-expectation variance reduction principle used earlier in the course: replacing an output by its conditional expectation preserves the target mean and cannot increase its marginal variance. In a Gibbs chain, the conditioning information is often already available before the updated component is sampled. The limitation is that this marginal comparison does not by itself compare asymptotic variances, because autocovariances may change. Rao-Blackwellization is therefore most useful when the conditional expectation is already computed during a Gibbs update and does not introduce a substantially more persistent time series.
A common application is estimating a posterior probability or classification probability. Instead of averaging an indicator sampled inside the chain, average the conditional probability of that indicator given the current block; this keeps the same target expectation and removes conditional simulation noise.
[example: Rao-Blackwellized Mixture Allocation Estimate]
In the finite mixture sampler, suppose the goal is $\mathbb P(C_j=k\mid y)$. At iteration $t$, set
\begin{align*}
q_t=\mathbb P(C_j=k\mid c_{-j,t},\psi_t,y).
\end{align*}
For the collapsed allocation update from the mixture example, this probability is
\begin{align*}
q_t=\frac{(\alpha_k+n_{-j,k,t})f(y_j\mid\psi_{k,t})}{\sum_{\ell=1}^K(\alpha_\ell+n_{-j,\ell,t})f(y_j\mid\psi_{\ell,t})}.
\end{align*}
The raw estimator averages $I_t=\mathbb 1_{\{C_{j,t}=k\}}$, while the Rao-Blackwellized estimator averages $q_t$.
To check that the target expectation is the same, let $\mathcal G_t=\sigma(c_{-j,t},\psi_t,y)$. Since $q_t=\mathbb E[I_t\mid\mathcal G_t]$,
\begin{align*}
\mathbb E[q_t\mid y]=\mathbb E[\mathbb E[I_t\mid\mathcal G_t]\mid y]=\mathbb E[I_t\mid y]=\mathbb P(C_j=k\mid y).
\end{align*}
The variance reduction at one iteration is also explicit. Conditional on $\mathcal G_t$, the variable $I_t$ is Bernoulli with success probability $q_t$, so
\begin{align*}
\operatorname{Var}(I_t\mid\mathcal G_t)=q_t(1-q_t).
\end{align*}
The variance decomposition gives
\begin{align*}
\operatorname{Var}(I_t\mid y)=\operatorname{Var}(q_t\mid y)+\mathbb E[q_t(1-q_t)\mid y].
\end{align*}
Because $q_t(1-q_t)\ge 0$, averaging $q_t$ removes the conditional Bernoulli simulation noise present in $I_t$ while preserving the desired posterior mean $\mathbb P(C_j=k\mid y)$.
[/example]
Practical Gibbs construction therefore has three recurring choices. First, choose a representation in which the necessary conditionals are valid and sampleable. Second, choose blocks that balance dependence reduction against conditional simulation cost. Third, use conditional expectations when available to extract more information from each sweep of the chain.
Gibbs sampling makes the conditional structure of the model central, yet validity alone does not tell us whether a finite run is useful. Chapter 7 focuses on the workflow around the sampler itself, asking how to diagnose convergence, assess uncertainty, and decide when the numerical output is trustworthy.
# 7. Convergence Diagnostics and Monte Carlo Workflow
This chapter turns from designing Monte Carlo algorithms to deciding whether their numerical output is trustworthy. Chapters 4--6 gave Markov chain mechanisms whose validity is asymptotic; here the practical question is what can be learned from a finite run. The main tools are convergence diagnostics, Monte Carlo standard errors, effective sample size estimates, and model-based checks that expose when a sampler is exploring the wrong distribution or only part of the right one.
## Single-Chain Error Assessment
A Markov chain Monte Carlo run produces dependent draws, so the first diagnostic problem is to separate the target expectation from the transient behaviour caused by initialization. If $X_1,\dots,X_N$ is a chain with invariant distribution $\pi$ and $h$ is a scalar function, the estimator
\begin{align*}
\bar h_N = \frac{1}{N}\sum_{n=1}^N h(X_n)
\end{align*}
looks like an ordinary sample mean, but its error depends on serial correlation and on whether the chain has reached its stationary regime. This initialization question motivates the first operational convention: choose which early draws are excluded before reporting the final estimator.
[definition: Burn-In]
For a Markov chain output $X_1,\dots,X_N$, a burn-in length $b$ is a nonnegative integer with $b < N$ such that the Monte Carlo estimator is formed from $X_{b+1},\dots,X_N$.
[/definition]
Burn-in is a pragmatic attempt to reduce initialization bias. It does not create independent samples, and a poor burn-in choice can hide rather than solve nonconvergence. In adaptive algorithms the early stage has another role: the algorithm may be changing its proposal scale, mass matrix, or integration parameters, so we need the broader notion of warmup.
[definition: Warmup]
For an adaptive Markov chain algorithm, warmup is an initial stage during which draws may be used to tune algorithmic parameters and are excluded from the final Monte Carlo estimator.
[/definition]
After warmup, the remaining draws still carry dependence. This is why simply reporting the number of post-warmup iterations can overstate the amount of information in the computation. The first dependence diagnostic is the autocorrelation function, which measures how strongly the chain remembers its earlier states at each lag.
[definition: Autocorrelation Function]
Let $(Y_n)_{n \ge 1}$ be a stationary scalar time series with $\mathbb E[Y_1^2] < \infty$ and variance $\gamma_0 > 0$. Its lag-$k$ autocovariance and autocorrelation are
\begin{align*}
\gamma_k &= \operatorname{Cov}(Y_1,Y_{1+k}), & \rho_k &= \frac{\gamma_k}{\gamma_0}.
\end{align*}
[/definition]
Autocorrelation at a single lag is only a partial description of Monte Carlo error. The variance of a sample mean accumulates dependence across all lags, so we need a single scalar quantity that summarizes the whole autocorrelation sequence. This motivates integrated autocorrelation time.
[definition: Integrated Autocorrelation Time]
For a stationary scalar time series with autocorrelations $(\rho_k)_{k \ge 1}$ such that the series converges, the integrated autocorrelation time is
\begin{align*}
\tau = 1 + 2\sum_{k=1}^{\infty}\rho_k.
\end{align*}
[/definition]
The role of $\tau$ is to convert dependent sampling into an equivalent independent-sampling scale. For a Markov chain average, the long-run variance is $\gamma_0\tau$, so the usual variance divided by $N$ must be replaced by a time-series variance divided by $N$. We need the Monte Carlo standard error to report this simulation uncertainty on the same scale as the estimated quantity.
[definition: Monte Carlo Standard Error]
For an estimator $\hat\theta_N$ of a scalar quantity $\theta$, the Monte Carlo standard error is an estimate of the standard deviation of $\hat\theta_N$ under repeated Monte Carlo runs.
[/definition]
The Monte Carlo standard error quantifies simulation noise, not posterior uncertainty or model uncertainty. In Bayesian computation, a posterior standard deviation describes the spread of the target distribution, while the Monte Carlo standard error describes numerical uncertainty in the reported posterior summary. We need effective sample size as a complementary way to communicate the same loss of information using an independent-sample scale.
[definition: Effective Sample Size]
For a stationary scalar Markov chain estimate of $\mathbb E_\pi[h(X)]$, the effective sample size is
\begin{align*}
N_{\mathrm{eff}} = \frac{N}{\tau},
\end{align*}
where $\tau$ is the integrated autocorrelation time for $h(X_n)$.
[/definition]
Effective sample size depends on the function $h$. A chain can estimate a posterior mean accurately while giving much weaker information about a tail probability or quantile. The Gaussian autoregression gives a concrete scale for how severe this loss can be.
[example: Autocorrelated Gaussian Chain]
Let $(Y_n)$ be the stationary Gaussian autoregression
\begin{align*}
Y_{n+1}=\phi Y_n+\varepsilon_{n+1},
\end{align*}
where $|\phi|<1$, the innovations have mean $0$, are independent of the past, and their variance is chosen so that $\operatorname{Var}(Y_n)=1$. Stationarity gives $\operatorname{Var}(Y_{n+1})=\operatorname{Var}(Y_n)=1$, so
\begin{align*}
1=\operatorname{Var}(Y_{n+1})=\operatorname{Var}(\phi Y_n+\varepsilon_{n+1})=\phi^2\operatorname{Var}(Y_n)+\operatorname{Var}(\varepsilon_{n+1})=\phi^2+\operatorname{Var}(\varepsilon_{n+1}).
\end{align*}
Thus $\operatorname{Var}(\varepsilon_{n+1})=1-\phi^2$.
For the lag-$1$ autocovariance, independence of $\varepsilon_{n+1}$ from $Y_n$ gives
\begin{align*}
\gamma_1=\operatorname{Cov}(Y_n,Y_{n+1})=\operatorname{Cov}(Y_n,\phi Y_n+\varepsilon_{n+1})=\phi\operatorname{Var}(Y_n)+0=\phi.
\end{align*}
For $k\ge 1$, iterating the recursion gives
\begin{align*}
Y_{n+k}=\phi^kY_n+\sum_{j=1}^k \phi^{k-j}\varepsilon_{n+j}.
\end{align*}
Each $\varepsilon_{n+j}$ is independent of $Y_n$, so
\begin{align*}
\gamma_k=\operatorname{Cov}(Y_n,Y_{n+k})=\operatorname{Cov}\left(Y_n,\phi^kY_n+\sum_{j=1}^k \phi^{k-j}\varepsilon_{n+j}\right)=\phi^k\operatorname{Var}(Y_n)=\phi^k.
\end{align*}
Since $\gamma_0=\operatorname{Var}(Y_n)=1$, the autocorrelation is $\rho_k=\gamma_k/\gamma_0=\phi^k$.
The integrated autocorrelation time is therefore
\begin{align*}
\tau=1+2\sum_{k=1}^{\infty}\rho_k=1+2\sum_{k=1}^{\infty}\phi^k.
\end{align*}
Because $|\phi|<1$, the geometric series satisfies $\sum_{k=1}^{\infty}\phi^k=\phi/(1-\phi)$, hence
\begin{align*}
\tau=1+2\frac{\phi}{1-\phi}=\frac{1-\phi}{1-\phi}+\frac{2\phi}{1-\phi}=\frac{1+\phi}{1-\phi}.
\end{align*}
For $N=10000$ and $\phi=0.9$,
\begin{align*}
\tau=\frac{1+0.9}{1-0.9}=\frac{1.9}{0.1}=19.
\end{align*}
The effective sample size is
\begin{align*}
N_{\mathrm{eff}}=\frac{N}{\tau}=\frac{10000}{19}=526.315\ldots.
\end{align*}
Thus ten thousand stored states have the same mean-estimation scale as about $526$ independent draws, so strong positive autocorrelation can remove most of the apparent information in a long run.
[/example]
This example explains why reducing autocorrelation by discarding intermediate draws is not automatically a variance improvement. Discarding draws lowers autocorrelation between retained draws, but it also lowers $N$; unless storage or downstream computation is the bottleneck, the unthinned average normally has no larger Monte Carlo variance than the thinned average. This motivates the definition of thinning as a deliberate retention rule rather than a convergence cure.
[definition: Thinning]
For a Markov chain output $X_1,\dots,X_N$, thinning by lag $m \in \mathbb N$ means retaining only $X_m,X_{2m},X_{3m},\dots$ and discarding the other draws.
[/definition]
Thinning clarifies that storage decisions and error estimation are different tasks. Once the retained or unthinned scalar sequence has been chosen, the next task is to estimate the long-run variance without knowing the infinite autocorrelation series. Batch means gives one standard method by turning a dependent run into many long block averages.
Under stationarity, a suitable moment bound, a strong-mixing summability condition, and a Markov-chain central limit theorem, the batch-means estimator is consistent when both the number of batches $a_N$ and the batch length $m_N$ tend to infinity while $a_N/m_N\to0$. The hypotheses each exclude a concrete failure mode. If $m_N$ stays bounded, each batch mean only sees a fixed local fragment of the chain; for an AR(1) process with $\phi$ close to $1$, such batches estimate the innovation-scale variance rather than the long-run variance inflated by autocorrelation. If $a_N$ stays bounded, the estimator is a sample variance based on only finitely many batch means and does not settle to a deterministic limit. If the batch count grows too quickly relative to the batch length, neighbouring batches remain strongly dependent; a slowly mixing two-state chain can then make the batch sample variance understate the uncertainty caused by rare switches. The moment assumption rules out heavy-tailed cases where a few extreme batches dominate the squared deviations, and the mixing summability rules out long-range dependence for which the usual $\sqrt N$ Monte Carlo standard error is not the right scale.
The theorem does not choose an optimal batch length for a finite run, and it does not prove that a visibly stable trace has reached stationarity. It is a consistency statement for a fixed scalar sequence under a specified asymptotic regime. In practice, batch means is attractive because it can be implemented with little tuning, but its weakness is that the batch length must be long enough to capture dependence while still leaving many batches. A more direct approach estimates the same long-run variance by summing estimated autocovariances with a stabilizing window, leading to spectral variance estimation. This connects Monte Carlo error assessment to the spectral-density viewpoint from time-series analysis: the long-run variance is the zero-frequency spectral mass of the scalar output.
[quotetheorem:7228]
[citeproof:7228]
The lag window, bandwidth, summability, moment, and mixing assumptions are not cosmetic. Without a window, the raw sum of empirical autocovariances includes many high-lag terms that are mostly estimation noise; for independent draws those true autocovariances vanish, but the empirical high-lag terms still fluctuate enough to destabilize the estimate. If $b_N$ is fixed, persistent positive autocorrelation beyond that fixed lag is never included, so an AR(1) chain with large $\phi$ is systematically underestimated. If $b_N$ is of order $N$, the estimator gives too much weight to lags estimated from very few pairs. Absolute summability of $(\gamma_k)$ rules out long-memory processes where the long-run variance series diverges or has a different scaling, and the moment and mixing assumptions rule out heavy-tailed or slowly mixing cases in which empirical autocovariances fail to concentrate.
The theorem also does not identify which scalar summary is hardest to estimate, nor does it diagnose multimodality. Spectral variance estimation is the time-domain form of estimating the spectral density at frequency zero. Both batch means and spectral estimators should be interpreted as estimates for a chosen scalar summary $h(X)$, not as global guarantees for the whole target distribution. This viewpoint will reappear when effective sample size is reported separately for bulk and tail summaries.
## Multiple Chains and Scale Diagnostics
A single chain can look stable while exploring only one region of the state space. The multiple-chain problem is to compare overdispersed runs and ask whether between-chain variation has become comparable to within-chain variation. The first tool is visual: before reducing chains to statistics, we inspect their paths through time.
[definition: Trace Plot]
For scalar summaries $h(X_{m,n})$ from chains $m=1,\dots,M$ and iterations $n=1,\dots,N$, a trace plot is the graphical display of the coordinate sets
\begin{align*}
\{(n,h(X_{m,n})) : 1 \le n \le N\}, \qquad m=1,\dots,M,
\end{align*}
with iteration number on the horizontal axis and sampled value on the vertical axis.
[/definition]
Trace plots are diagnostic pictures rather than tests. They are useful for detecting trends, sudden jumps, sticky behaviour, and chains trapped in different regions, but visual agreement alone cannot certify convergence. To turn the same comparison into a numerical diagnostic, we need a ratio comparing between-chain scale to within-chain scale.
[definition: Potential Scale Reduction Factor]
For integers $M \ge 2$ and $N \ge 2$, the potential scale reduction factor is a statistic
\begin{align*}
\hat R:\mathcal A_{M,N}\to [0,\infty),
\end{align*}
where $\mathcal A_{M,N}\subset \mathbb R^{M \times N}$ is the set of scalar chain arrays for which the within-chain variance estimate $W$ is positive.
[/definition]
For the classical version, $\hat R$ is the square root of a pooled variance estimate divided by the average within-chain variance. The diagnostic is therefore large when the variation between chain means has not yet become comparable with the variation inside each chain.
The potential scale reduction factor formalizes the idea that chains initialized from dispersed points should eventually become indistinguishable at the scale of the target distribution. If the chain means still disagree, the between-chain variance remains large relative to the within-chain variance. The classical Gelman-Rubin formula specifies this comparison for scalar draws of common length.
[definition: Classical Gelman-Rubin Potential Scale Reduction Factor]
Let $Y_{m,n}$ be the scalar draw from chain $m \in \{1,\dots,M\}$ at iteration $n \in \{1,\dots,N\}$, where $M \ge 2$ and $N \ge 2$. Define
\begin{align*}
\bar Y_m &= \frac{1}{N}\sum_{n=1}^N Y_{m,n}, &
\bar Y &= \frac{1}{M}\sum_{m=1}^M \bar Y_m.
\end{align*}
Define also
\begin{align*}
s_m^2 &= \frac{1}{N-1}\sum_{n=1}^N\left(Y_{m,n}-\bar Y_m\right)^2, &
W &= \frac{1}{M}\sum_{m=1}^M s_m^2.
\end{align*}
Finally set
\begin{align*}
B = \frac{N}{M-1}\sum_{m=1}^M\left(\bar Y_m-\bar Y\right)^2.
\end{align*}
The classical potential scale reduction factor is
\begin{align*}
\hat R = \sqrt{\frac{\widehat{\operatorname{var}}^+(Y)}{W}}, \qquad
\widehat{\operatorname{var}}^+(Y)=\frac{N-1}{N}W+\frac{1}{N}B.
\end{align*}
[/definition]
The construction compares two variance estimates. The within-chain estimate $W$ is small when each chain has locally mixed, while $B$ is large when chain means disagree. The pooled estimate $\widehat{\operatorname{var}}^+(Y)$ combines within-chain variation with the observed dispersion of chain averages, so it exceeds $W$ when chains have not reached a common region. Taking the square root places the diagnostic on the scale of standard deviations. Under stationary runs with a common marginal distribution, finite second moment, and increasing common length $N$, both variance estimates target the same marginal variance for a well-mixed scalar summary, so the ratio is expected to approach $1$.
The common-length scalar-chain assumptions are part of what make this variance decomposition meaningful. If chains have different lengths, the simple formulas for $B$ and $W$ no longer weight chain means on the same scale. If the within-chain variance is zero, the ratio is undefined; a stuck deterministic chain can therefore produce no valid reassurance. Stationarity matters because a drifting chain can have a large within-chain variance that hides a still-moving mean, and common marginal distribution matters because chains confined to different symmetric modes can have similar within-chain variances but incompatible averages for label-dependent summaries. Finite second moments matter as well: with a Cauchy marginal, the empirical variances used by the diagnostic do not stabilize in the way the construction assumes.
This construction does not say that $\hat R$ near $1$ proves convergence of the Markov chain, and it only addresses the scalar summary to which it is applied. A pair of chains can have similar means and variances while occupying different parts of a banana-shaped or multimodal target. The classical statistic is also sensitive to non-normal marginal distributions and can miss slow drift that occurs within a single chain. To make drift visible to the same between-chain comparison, each chain is split into an early and late segment before computing the diagnostic.
[definition: Split Chain]
Given a chain $Y_1,\dots,Y_N$ with even $N$, its split chains are $Y_1,\dots,Y_{N/2}$ and $Y_{N/2+1},\dots,Y_N$.
[/definition]
Splitting converts late-vs-early drift into between-chain disagreement. A chain whose first half and second half occupy different regions will inflate $\hat R$ after splitting even if its full-chain mean resembles other full-chain means. We need rank-normalization as a second improvement because heavy tails and skewed marginals can distort variance comparisons on the raw scale.
[definition: Rank-Normalization]
For $S \ge 1$, rank-normalization is the map
\begin{align*}
T:\{y \in \mathbb R^S: y_i \ne y_j \text{ for } i \ne j\}\to \mathbb R^S
\end{align*}
that sends pooled ordered scalar draws $y=(y_1,\dots,y_S)$ to $z=(z_1,\dots,z_S)$ by
\begin{align*}
z_i = \Phi^{-1}\left(\frac{r_i-3/8}{S+1/4}\right),
\end{align*}
where $r_i$ is the rank of $y_i$ among all $S$ pooled draws and $\Phi$ is the standard normal CDF.
[/definition]
Rank-normalization contributes a common transformed scale on which skewness and heavy tails have less influence on the variance comparison. Splitting contributes a way to turn within-chain drift into disagreement between early and late segments. To use both improvements in a workflow, we need a single statistic that specifies the order of operations: split the chains, pool the draws for ranks, transform to normal scores, and then apply the same between-chain versus within-chain comparison as before.
[definition: Rank-Normalized Split R-Hat]
For $M \ge 2$ post-warmup scalar chains of common even length $N$, the rank-normalized split $\hat R$ is the statistic obtained by replacing each chain by its two split chains of length $N/2$, applying rank-normalization to the pooled $MN$ scalar draws, and computing the classical Gelman-Rubin potential scale reduction factor on the resulting $2M$ transformed split chains.
[/definition]
This definition separates the computation from the interpretation. Large finite-sample values mean that the transformed split chains disagree in location or scale, but the numerical threshold is a workflow convention rather than a theorem. The next question is the baseline question: under stationary split chains with a shared marginal distribution and enough mixing for their empirical means and variances to stabilize, the diagnostic should return to $1$ rather than to some method-dependent constant.
[quotetheorem:7229]
[citeproof:7229]
Each hypothesis protects a specific part of the argument. Continuity prevents ties; if the target has an atom, many draws can share the same value and the rank transformation depends on arbitrary tie-breaking unless a separate randomized rule is specified. A common stationary marginal is needed because the diagnostic compares split-chain rank distributions; two chains started in different transient regimes can have inflated or deflated ratios for reasons unrelated to the target. Splitting is needed because a single long chain that drifts from one region to another can have a full-chain mean close to the target while its first and second halves disagree. Finite transformed second moments are automatic for the normal-score ranks at fixed $S$, but they express the variance-comparison requirement in the asymptotic idealization.
The theorem does not give a universal threshold and does not prove that every posterior functional is accurately estimated when $\hat R$ is near $1$. It is a check for agreement among split chains after a rank-based transformation. Rank-normalized split $\hat R$ asks whether chains agree, but it does not say how accurately a particular posterior summary has been estimated. For that we return to effective sample size, now separating central summaries from tail summaries because quantiles and tail probabilities are often the quantities most affected by poor exploration.
[definition: Bulk And Tail Effective Sample Size]
Let $y=(y_1,\dots,y_S)\in \mathbb R^S$ be pooled scalar draws with no ties, and let $T:\mathbb R^S\to \mathbb R^S$ denote the rank-normalization map on its tie-free domain. The bulk effective sample size is the effective sample size of the scalar sequence $T(y)_1,\dots,T(y)_S$. For $q \in (0,1/2)$ and empirical quantiles $\hat Q_q,\hat Q_{1-q}:\mathbb R^S\to \mathbb R$, the lower-tail and upper-tail effective sample sizes are the effective sample sizes of the indicator sequences
\begin{align*}
\mathbb{1}_{(-\infty,\hat Q_q(y)]}(y_i), \qquad
\mathbb{1}_{[\hat Q_{1-q}(y),\infty)}(y_i), \qquad 1 \le i \le S.
\end{align*}
The tail effective sample size is the minimum of these lower-tail and upper-tail effective sample sizes.
[/definition]
A reliable workflow reports $\hat R$, bulk effective sample size, tail effective sample size, and Monte Carlo standard errors for the summaries used in the scientific conclusion. No single scalar diagnostic can rule out every failure mode.
## Posterior Predictive Checks and Calibration
Even a perfectly converged sampler can faithfully sample a poor statistical model. The next question is whether the fitted model can reproduce the features of the observed data that matter for the analysis. In Bayesian computation this question is answered by simulating replicated data from the fitted model.
[definition: Posterior Predictive Distribution]
Let $y$ denote observed data, let $\theta$ be a parameter with posterior distribution $\pi(\theta \mid y)$, and let $p(\tilde y \mid \theta)$ be the sampling model for replicated data. The posterior predictive distribution is
\begin{align*}
p(\tilde y \mid y)=\int p(\tilde y \mid \theta)\,\pi(\theta \mid y)\,d\theta.
\end{align*}
[/definition]
Posterior predictive simulation asks whether replicated data $\tilde y$ generated from the fitted model resemble the observed data $y$ under selected summaries. The summaries should be chosen to probe features relevant to the modelling task, such as tails, dependence, sparsity, or group-level variation. A posterior predictive check is the formal comparison built from those summaries.
[definition: Posterior Predictive Check]
Let $(\mathcal Y,\mathcal A)$ be the data space, let $y \in \mathcal Y$ be the observed dataset, let $\tilde Y:\Omega\to \mathcal Y$ be replicated data with posterior predictive distribution $p(\tilde y \mid y)$, and let $T:\mathcal Y\to \mathbb R^k$ be a measurable discrepancy statistic. A posterior predictive check is the comparison of $T(y)$ with the distribution of $T(\tilde Y)$.
[/definition]
Posterior predictive checks diagnose model fit, not Markov chain convergence. A chain can have excellent $\hat R$ and effective sample size while the posterior predictive distribution fails to reproduce a scientifically important feature of the data. Tail behaviour is a common place where this distinction becomes visible.
[example: Posterior Predictive Tail Check]
Suppose the fitted model is
\begin{align*}
y_i \mid \mu,\sigma^2 \sim \mathcal N(\mu,\sigma^2), \qquad i=1,\dots,n,
\end{align*}
with posterior draws $(\mu^{(s)},\sigma^{2(s)})$ for $s=1,\dots,S$. For each posterior draw, generate a replicated dataset by sampling
\begin{align*}
\tilde y_i^{(s)} \mid \mu^{(s)},\sigma^{2(s)} \sim \mathcal N(\mu^{(s)},\sigma^{2(s)}), \qquad i=1,\dots,n.
\end{align*}
Use the discrepancy statistic $T(z)=\max_{1\le i\le n} z_i$, so the observed discrepancy is
\begin{align*}
T(y)=\max_{1\le i\le n} y_i
\end{align*}
and the replicated discrepancies are
\begin{align*}
T(\tilde y^{(s)})=\max_{1\le i\le n}\tilde y_i^{(s)}, \qquad s=1,\dots,S.
\end{align*}
A numerical tail check can record the posterior predictive tail proportion
\begin{align*}
\hat p=\frac{1}{S}\sum_{s=1}^S \mathbb 1_{\{T(\tilde y^{(s)})\ge T(y)\}}.
\end{align*}
If $T(y)$ is larger than nearly every $T(\tilde y^{(s)})$, then only a small fraction of the indicators in this average equal $1$, so $\hat p$ is close to $0$. This means the fitted Gaussian model rarely generates replicated maxima as large as the observed maximum, indicating that its posterior predictive right tail is too light for this feature of the data.
[/example]
A posterior predictive check conditions on the observed dataset, so it is mainly a model criticism tool. When validating a new computational pipeline, we also want a pre-data calibration test in which the true parameter value is known because it was simulated. We need simulation-based calibration for this broader test of the whole inferential procedure.
[definition: Simulation-Based Calibration]
Let $(\Theta,\mathcal T)$ be a parameter space, let $(\mathcal Y,\mathcal A)$ be a data space, let $\pi$ be a prior distribution on $\Theta$, and let $P_\theta$ be the sampling distribution on $\mathcal Y$ for each $\theta \in \Theta$. For a scalar measurable summary $g:\Theta\to \mathbb R$ and posterior draw count $L$, define the rank map
\begin{align*}
R_L:\mathbb R^{L+1}\to \{1,\dots,L+1\}
\end{align*}
by taking $R_L(u_0,u_1,\dots,u_L)$ to be the rank of $u_0$ among $u_0,u_1,\dots,u_L$, using a specified randomized tie-breaking rule if ties occur. Simulation-based calibration repeatedly samples $\theta_s\sim \pi$, samples $y_s\sim P_{\theta_s}$, runs the posterior algorithm on $y_s$ to obtain draws $\theta_{s,1},\dots,\theta_{s,L}$, and records
\begin{align*}
R_L(g(\theta_s),g(\theta_{s,1}),\dots,g(\theta_{s,L})).
\end{align*}
[/definition]
If the model and computation are calibrated, the true simulated parameter should look like a random posterior draw. Systematic rank deviations reveal bias, underdispersion, overdispersion, or coding errors in the inference algorithm. We need the uniform rank theorem because it gives the mathematical baseline for interpreting SBC rank histograms.
[quotetheorem:6295]
[citeproof:6295]
The assumptions describe the exact calibration experiment. If posterior draws are generated by a biased approximation, such as a variational distribution that is too concentrated, the simulated truth is no longer exchangeable with the reported draws and the rank histogram typically becomes U-shaped. If the posterior draws are Markov chain states with strong dependence and the chain has not mixed, the conditional i.i.d. assumption fails; ranks can then reflect sampler stickiness rather than the posterior law. Continuity prevents ties, which otherwise require a randomized or mid-rank convention: a discrete Bernoulli parameter summary can pile many values at the same rank. Prior-predictive simulation is also essential. If $\theta$ is fixed by hand or data are drawn from a different data-generating process, the original $\theta$ need not have the posterior distribution conditional on $y$, so the exchangeability proof does not start.
The theorem does not say that a finite rank histogram must look flat, and it does not certify performance on datasets far from the simulated prior-predictive regime. SBC is powerful because it tests both the statistical model implementation and the computation. Its conclusion is limited by the simulations used: passing SBC for a simple prior-predictive regime does not guarantee good behaviour for every dataset encountered in applications.
## Multimodality and Workflow Failures
The hardest convergence failures occur when the target has separated regions of high probability. A sampler may generate smooth traces within one region, estimate local expectations accurately, and still miss another region that changes the global answer. We therefore need language for targets whose geometry creates separated regions relative to the sampler.
[definition: Practical Multimodality]
A probability distribution on a state space has practical multimodality for a specified sampler and computational budget when its density or mass function has multiple regions of high probability and the sampler's transition behaviour rarely moves between those regions on that time scale.
[/definition]
Practical multimodality is partly a property of the target and partly a property of the algorithm. Two regions are practically separated if the Markov transition almost never moves between them on the computational time scale, even when a different algorithm might move between them regularly. The funnel distribution illustrates a related geometric failure in which a narrow neck rather than distinct symmetric modes creates poor exploration.
[example: Funnel Distribution]
Consider the hierarchical target with scalar $v$ and vector $x=(x_1,\dots,x_d)\in \mathbb R^d$ given by
\begin{align*}
v \sim \mathcal N(0,9), \qquad x_i \mid v \sim \mathcal N(0,e^v),\quad i=1,\dots,d.
\end{align*}
For fixed $v$, the conditional mean and variance of each coordinate are
\begin{align*}
\mathbb E[x_i\mid v]=0, \qquad \operatorname{Var}(x_i\mid v)=e^v.
\end{align*}
Hence
\begin{align*}
\mathbb E[\|x\|^2\mid v]=\mathbb E\left[\sum_{i=1}^d x_i^2\mid v\right]=\sum_{i=1}^d \mathbb E[x_i^2\mid v]=\sum_{i=1}^d e^v=d e^v.
\end{align*}
Thus the conditional radius of $x$ is on the scale $\sqrt{d e^v}=\sqrt d\,e^{v/2}$: if $v=-10$, this scale is $\sqrt d\,e^{-5}$, while if $v=4$, this scale is $\sqrt d\,e^2$.
The joint density also shows the same funnel geometry. Using the normal density formula,
\begin{align*}
p(v,x)\propto \exp\left(-\frac{v^2}{18}\right)\prod_{i=1}^d \left(e^{-v/2}\exp\left(-\frac{x_i^2}{2e^v}\right)\right).
\end{align*}
Multiplying the factors gives
\begin{align*}
p(v,x)\propto \exp\left(-\frac{v^2}{18}\right)\exp\left(-\frac{dv}{2}\right)\exp\left(-\frac{1}{2e^v}\sum_{i=1}^d x_i^2\right).
\end{align*}
When $v$ is very negative, $e^v$ is small, so the term $\sum_i x_i^2/(2e^v)$ strongly penalizes any $x$ away from $0$; this is the narrow neck. When $v$ is positive, $e^v$ is large, so the same penalty is much weaker and the conditional distribution spreads out. A sampler that moves locally must pass between these very different scales, so trace plots and tail effective sample sizes for $v$ are essential diagnostics.
[/example]
The funnel is not always multimodal, but it is a canonical example of geometry that defeats local exploration. Reparameterization often matters more than increasing the number of iterations. Mixture models provide a different failure mode: the posterior may have genuine symmetric modes because the labels of exchangeable components carry no intrinsic meaning.
[example: Label Switching In Mixture Models]
Consider a two-component mixture model with weights $w=(w_1,w_2)$, component parameters $\theta=(\theta_1,\theta_2)$, and sampling density
\begin{align*}
p(y_i\mid w,\theta)=w_1 f(y_i\mid \theta_1)+w_2 f(y_i\mid \theta_2).
\end{align*}
Let the label-swapped parameters be $w'=(w_2,w_1)$ and $\theta'=(\theta_2,\theta_1)$. For each observation,
\begin{align*}
p(y_i\mid w',\theta')=w_2 f(y_i\mid \theta_2)+w_1 f(y_i\mid \theta_1)=w_1 f(y_i\mid \theta_1)+w_2 f(y_i\mid \theta_2)=p(y_i\mid w,\theta).
\end{align*}
Therefore the full likelihood is unchanged:
\begin{align*}
\prod_{i=1}^n p(y_i\mid w',\theta')=\prod_{i=1}^n p(y_i\mid w,\theta).
\end{align*}
If the prior is exchangeable, then
\begin{align*}
\pi(w',\theta')=\pi(w,\theta),
\end{align*}
so the unnormalized posterior satisfies
\begin{align*}
\pi(w',\theta'\mid y)\propto \pi(w',\theta')\prod_{i=1}^n p(y_i\mid w',\theta')=\pi(w,\theta)\prod_{i=1}^n p(y_i\mid w,\theta)\propto \pi(w,\theta\mid y).
\end{align*}
Thus every posterior region has a label-swapped region with equal posterior density. A chain that remains in the region where, say, $\theta_1<\theta_2$ can estimate the local mean of $\theta_1$ with a small Monte Carlo standard error, but the symmetric posterior also assigns equal mass to the swapped region where the roles of $\theta_1$ and $\theta_2$ are reversed. The label-specific summaries $\mathbb E[\theta_1\mid y]$ and $\mathbb E[\theta_2\mid y]$ therefore depend on arbitrary labels rather than intrinsic mixture components, while label-invariant summaries such as $\{\theta_1,\theta_2\}$ or the mixture density remain meaningful.
[/example]
Label switching shows that some diagnostic failures are also inferential failures. The right response may be to use label-invariant summaries, impose identifiability constraints with care, or redesign the model. Even after the modelling target is appropriate, workflow reporting must attach Monte Carlo standard errors to the actual summaries used, including quantiles.
[example: Monte Carlo Standard Error For A Posterior Quantile]
Let $Y_1,\dots,Y_N$ be the post-warmup draws of a scalar parameter, and suppose $N=Am$ so the draws are split into $A$ consecutive batches of length $m$. Write $\hat q_{0.95}$ for the empirical $0.95$ quantile of all $N$ draws, and write $\hat q_{0.95,j}$ for the empirical $0.95$ quantile computed from batch $j$.
The average of the batch quantiles is
\begin{align*}
\bar q_{\mathrm{batch}}=\frac{1}{A}\sum_{j=1}^A \hat q_{0.95,j}.
\end{align*}
Their sample variance is
\begin{align*}
s_q^2=\frac{1}{A-1}\sum_{j=1}^A(\hat q_{0.95,j}-\bar q_{\mathrm{batch}})^2.
\end{align*}
Because each batch contains $m$ dependent draws, $s_q^2$ estimates the Monte Carlo variance of a length-$m$ quantile estimator. The full estimator uses $N=Am$ draws, so the corresponding variance estimate is scaled down by the number of batches:
\begin{align*}
\widehat{\operatorname{Var}}_{\mathrm{MC}}(\hat q_{0.95})=\frac{s_q^2}{A}.
\end{align*}
Taking the square root gives the batch-quantile Monte Carlo standard error
\begin{align*}
\widehat{\operatorname{MCSE}}(\hat q_{0.95})=\sqrt{\frac{s_q^2}{A}}.
\end{align*}
For example, if $A=20$ and the sample standard deviation of the $20$ batch quantiles is $s_q=0.14$, then
\begin{align*}
\widehat{\operatorname{MCSE}}(\hat q_{0.95})=\sqrt{\frac{0.14^2}{20}}=\frac{0.14}{\sqrt{20}}\approx 0.0313.
\end{align*}
This number estimates numerical uncertainty in the reported quantile caused by the finite Monte Carlo run; it is not the posterior uncertainty of the scalar parameter itself.
[/example]
Quantiles and tail probabilities usually require larger effective sample sizes than posterior means. This is why modern reports distinguish bulk and tail effective sample size and attach Monte Carlo standard errors to the particular estimates being reported. The complete workflow combines graphical, numerical, and modelling diagnostics rather than relying on any single threshold.
[explanation: A Minimal Monte Carlo Workflow]
A defensible workflow begins by running multiple chains from dispersed initializations, with warmup long enough for adaptation and transient behaviour to settle. The first inspection is graphical: trace plots for important scalar summaries, energy or acceptance diagnostics when the algorithm provides them, and pair plots for parameters suspected to be strongly dependent.
The second inspection is numerical. Report rank-normalized split $\hat R$, bulk effective sample size, tail effective sample size, and Monte Carlo standard errors for posterior means, probabilities, and quantiles used in the analysis. Thresholds such as $\hat R < 1.01$ are useful conventions, but they are not mathematical guarantees.
The third inspection is modelling-oriented. Use posterior predictive checks to compare replicated and observed data, and use simulation-based calibration when validating a new model or implementation. When diagnostics disagree, trust the most specific failure signal: a tail effective sample size warning for a reported tail probability matters even if the posterior mean has high effective sample size.
[/explanation]
The central lesson is that convergence is not a single event. It is a collection of claims about particular summaries, particular chains, particular model checks, and a particular computational budget. Good Monte Carlo practice makes those claims explicit and attaches an error estimate to every numerical conclusion that matters.
Convergence diagnostics tell us how to judge a run after the fact, but many modern targets also demand more efficient proposals before the fact. Chapter 8 brings in gradient information and approximate dynamics, moving from generic Markov chain constructions to Hamiltonian and Langevin methods that exploit the geometry of the target.
# 8. Hamiltonian and Langevin Monte Carlo
This chapter studies Markov chain Monte Carlo methods that use differential structure of the target density. Chapters 4 and 5 built samplers from proposals, accept-reject corrections, and Markov chain invariance; here the proposal itself is guided by gradients and by approximate physical dynamics. The central question is how to move far through a high-dimensional distribution while keeping the Metropolis correction large enough for useful computation.
The target distribution is written as $\pi(dx) = \pi(x)\,d\mathcal L^d(x)$ on $\mathbb R^d$, with density known up to a normalizing constant. We write
\begin{align*}
U(x) = -\log \pi(x) + C.
\end{align*}
The additive constant is irrelevant. Langevin methods use $\nabla \log \pi(x) = -\nabla U(x)$ to bias local motion toward regions of high probability, while Hamiltonian Monte Carlo augments the state with momentum so that the chain follows long, approximately energy-preserving trajectories.
## Langevin Diffusions and Discretization Bias
A random-walk Metropolis proposal treats every direction near the current state as equally plausible. In a target with strong curvature, this wastes proposals in directions where the density changes rapidly and moves too slowly in flatter directions. The first remedy is to use a diffusion whose infinitesimal drift points along the score $\nabla \log \pi$.
[definition: Overdamped Langevin Diffusion]
Let $\pi$ be a positive $C^1$ probability density on $\mathbb R^d$. The overdamped Langevin diffusion with invariant density $\pi$ is the stochastic process $(X_t)_{t \ge 0}$ satisfying
\begin{align*}
dX_t = \frac{1}{2}\nabla \log \pi(X_t)\,dt + dW_t,
\end{align*}
where $(W_t)_{t \ge 0}$ is standard [Brownian motion](/page/Brownian%20Motion) in $\mathbb R^d$.
[/definition]
The factor $1/2$ matches the Brownian variance convention in the generator. The drift pulls the process uphill in log density, while the Brownian term prevents collapse to a mode and supplies exploration. The next question is why this particular drift-noise balance leaves the desired density unchanged rather than creating a different stationary law.
[quotetheorem:7230]
[citeproof:7230]
This theorem gives an exact continuous-time sampler, but simulation requires time discretization. The decay hypothesis is not cosmetic: if boundary terms survive [integration by parts](/theorems/210), the outward probability flux can destroy stationarity. The simplest discretization is Euler-Maruyama, which produces a proposal by taking one score-guided step and adding Gaussian noise.
[definition: Unadjusted Langevin Algorithm]
For step size $h>0$, the unadjusted Langevin algorithm generates a Markov chain $(X_k)_{k \ge 0}$ by
\begin{align*}
X_{k+1} = X_k + \frac{h}{2}\nabla \log \pi(X_k) + \sqrt{h}\,\xi_k,
\end{align*}
where $\xi_1,\xi_2,\dots$ are i.i.d. random vectors with distribution $\mathcal N(0,I_d)$.
[/definition]
This chain is cheap and often effective, but it targets the invariant distribution of the discretized chain rather than exactly $\pi$. The difference between that invariant distribution and $\pi$ is called discretization bias.
[example: Discretization Bias for a Gaussian Target]
Let $\pi=\mathcal N(0,1)$ on $\mathbb R$. Since
\begin{align*}
\log \pi(x)=-\frac12 x^2-\frac12\log(2\pi),
\end{align*}
we have $\nabla \log \pi(x)=-x$. The unadjusted Langevin update therefore becomes
\begin{align*}
X_{k+1}=X_k+\frac h2(-X_k)+\sqrt h\,\xi_k.
\end{align*}
Equivalently,
\begin{align*}
X_{k+1}=\left(1-\frac h2\right)X_k+\sqrt h\,\xi_k.
\end{align*}
Write $a=1-h/2$. If this AR(1) chain has a stationary distribution with mean $0$ and variance $\sigma^2$, then $\xi_k$ is independent of $X_k$ and $\operatorname{Var}(\xi_k)=1$, so
\begin{align*}
\sigma^2=\operatorname{Var}(aX_k+\sqrt h\,\xi_k)=a^2\sigma^2+h.
\end{align*}
Thus
\begin{align*}
(1-a^2)\sigma^2=h.
\end{align*}
Substituting $a=1-h/2$ gives
\begin{align*}
1-a^2=1-\left(1-\frac h2\right)^2.
\end{align*}
Expanding the square,
\begin{align*}
1-\left(1-\frac h2\right)^2=1-\left(1-h+\frac{h^2}{4}\right)=h-\frac{h^2}{4}=h\left(1-\frac h4\right).
\end{align*}
Therefore
\begin{align*}
\sigma^2=\frac{h}{h(1-h/4)}=\frac{1}{1-h/4}.
\end{align*}
The condition $0<h<4$ is exactly $|a|<1$, which is the stability condition for this AR(1) recursion. Since $1/(1-h/4)\ne 1$ for every $h>0$, the Euler discretization changes the Gaussian target variance, with the bias disappearing only in the limit $h\to 0$.
[/example]
The example shows that gradient drift alone does not remove numerical bias. The next problem is to keep the useful Langevin proposal while adding the same exactness mechanism that made Metropolis-Hastings reliable in earlier chapters.
[definition: Metropolis-Adjusted Langevin Algorithm]
Let $h>0$ and define the proposal density
\begin{align*}
q_h(x,y) = (2\pi h)^{-d/2}\exp\left(-\frac{1}{2h}\left|y-x-\frac{h}{2}\nabla\log\pi(x)\right|^2\right).
\end{align*}
The Metropolis-adjusted Langevin algorithm proposes $Y \sim q_h(X_k,\cdot)$ and sets $X_{k+1}=Y$ with probability
\begin{align*}
\alpha(x,y)=\min\left\{1,\frac{\pi(y)q_h(y,x)}{\pi(x)q_h(x,y)}\right\},
\end{align*}
otherwise setting $X_{k+1}=X_k$.
[/definition]
The proposal is no longer symmetric, because the drift at $x$ and the drift at $y$ are different.
This asymmetry is the main correctness obstruction. A drifted proposal may point toward high-density regions, but the discretized move by itself generally has the wrong invariant distribution. The theorem below is needed to check that the Metropolis correction using both forward and reverse proposal densities restores invariance of the target law.
[quotetheorem:7231]
[citeproof:7231]
MALA is a local method, so its efficiency still depends on tuning the step size. The theorem proves invariance, not rapid mixing or geometric ergodicity; for heavy-tailed targets, poorly scaled Langevin proposals can even fail to be stable. Too small a step behaves like a slow diffusion; too large a step produces inaccurate drift moves and many rejections. Hamiltonian Monte Carlo changes the geometry of proposals by adding momentum and following deterministic trajectories before applying the Metropolis correction.
## Hamiltonian Dynamics and Leapfrog Integration
The main difficulty in high dimension is that local random walks diffuse slowly across long posterior valleys. Hamiltonian dynamics replaces diffusive wandering by motion along level sets of an energy function. The method introduces an auxiliary momentum variable so that the sampler can travel across the typical set before randomization is refreshed.
[definition: Hamiltonian for HMC]
Let $\pi(x) \propto e^{-U(x)}$ be a target density on $\mathbb R^d$, and let $M \in \mathbb R^{d \times d}$ be symmetric positive definite. The Hamiltonian on position-momentum space $\mathbb R^d \times \mathbb R^d$ is
\begin{align*}
H(x,p)=U(x)+K(p), \qquad K(p)=\frac12 p^\top M^{-1}p.
\end{align*}
[/definition]
Under this joint density, $x$ has the desired target distribution and $p \sim \mathcal N(0,M)$ independently. The mass matrix $M$ sets the scale and orientation of momentum. The next step is to specify the deterministic motion that uses this energy to convert gradients into long-distance proposals.
[definition: Hamiltonian Dynamics]
Hamiltonian dynamics for $H(x,p)=U(x)+\frac12 p^\top M^{-1}p$ is the ordinary differential equation defined by the two relations
\begin{align*}
\dot{x} = M^{-1}p.
\end{align*}
\begin{align*}
\dot{p} = -\nabla U(x).
\end{align*}
[/definition]
These equations preserve the Hamiltonian in exact arithmetic, so the joint density proportional to $e^{-H(x,p)}$ remains unchanged along the flow. For MCMC, energy preservation is not enough; the proposal map must also transform volume in a controlled way so that a Metropolis ratio can be written without an extra Jacobian factor.
[quotetheorem:7232]
[citeproof:7232]
Exact Hamiltonian flow is not available for most posteriors. The $C^2$ hypothesis supports the mixed-partial cancellation and differentiability of the flow Jacobian; with rougher Hamiltonians, the volume-preservation argument needs separate justification. HMC therefore uses a numerical integrator that is reversible and volume-preserving by construction; the remaining energy error is corrected by Metropolis acceptance.
[definition: Leapfrog Integrator]
For step size $\varepsilon>0$, mass matrix $M$, and potential $U$, one leapfrog step maps $(x,p)$ to $(x',p')$ by first setting
\begin{align*}
p_{1/2} = p - \frac{\varepsilon}{2}\nabla U(x).
\end{align*}
It then sets
\begin{align*}
x' = x + \varepsilon M^{-1}p_{1/2}.
\end{align*}
Finally it sets
\begin{align*}
p' = p_{1/2} - \frac{\varepsilon}{2}\nabla U(x').
\end{align*}
[/definition]
The half-step, full-step, half-step structure is not cosmetic. It makes the map reversible after momentum flip and preserves volume because it is a composition of shear maps with unit Jacobian determinant. These two structural properties are exactly what the next detailed-balance argument requires.
[quotetheorem:7233]
[citeproof:7233]
This result is the algorithmic core of HMC. The integrator may accumulate energy error, but reversibility and volume preservation allow the Metropolis test to remove bias from the numerical approximation.
[example: Correlated Gaussian Posterior]
Take an orthonormal eigenbasis of $\Sigma$ and write $u=Q^\top x$, where
\begin{align*}
Q^\top \Sigma Q=\operatorname{diag}(100,1).
\end{align*}
In these coordinates the Gaussian density is proportional to
\begin{align*}
\exp\left(-\frac12 u^\top \operatorname{diag}(1/100,1)u\right)
=
\exp\left(-\frac{u_1^2}{200}-\frac{u_2^2}{2}\right).
\end{align*}
Thus the target standard deviations are $10$ in the $u_1$ direction and $1$ in the $u_2$ direction, so the density contours are elongated ellipses.
For a random-walk proposal $u'=u+\eta$ with $\eta\sim \mathcal N(0,s^2I_2)$, the same scale $s$ is used in both coordinates. If $s$ is comparable to $10$, then the narrow coordinate changes by an amount comparable to $10$ even though its target standard deviation is $1$, so many proposals land far outside the narrow part of the ellipse. To keep the narrow coordinate stable one takes $s$ comparable to $1$. Then motion in the long direction is diffusive: after $n$ accepted steps with independent increments of variance $s^2$ in the $u_1$ coordinate, the accumulated variance is $ns^2$, so reaching the target variance scale $100$ requires $ns^2\approx 100$.
For HMC, the Gaussian potential is
\begin{align*}
U(x)=\frac12 x^\top \Sigma^{-1}x+C.
\end{align*}
Choose $M=\Sigma^{-1}$, so $M^{-1}=\Sigma$. Hamiltonian dynamics gives
\begin{align*}
\dot x=\Sigma p.
\end{align*}
\begin{align*}
\dot p=-\Sigma^{-1}x.
\end{align*}
In eigen-coordinates $u=Q^\top x$ and $r=Q^\top p$, with $D=\operatorname{diag}(100,1)$, these become
\begin{align*}
\dot u=Dr.
\end{align*}
\begin{align*}
\dot r=-D^{-1}u.
\end{align*}
Differentiating $\dot u=Dr$ gives
\begin{align*}
\ddot u=D\dot r.
\end{align*}
Substituting $\dot r=-D^{-1}u$ gives
\begin{align*}
\ddot u=D(-D^{-1}u)=-u.
\end{align*}
Therefore each coordinate satisfies $\ddot u_j=-u_j$, so both eigendirections oscillate on the same time scale. The mass matrix removes the linear scale mismatch while still allowing large motion in the long direction, which is why HMC can follow coherent elliptical trajectories instead of accumulating many small random-walk displacements.
[/example]
The Gaussian example is favourable because the geometry is global and quadratic. Real posteriors have curvature that changes across the state space, so tuning and diagnostics become part of the algorithm rather than an afterthought.
## No-U-Turn Sampling and Adaptive Geometry
HMC has two tuning parameters that interact strongly: the step size $\varepsilon$ and the number of leapfrog steps $L$. If $L$ is too small, the method behaves like a local proposal; if $L$ is too large, the trajectory may turn back toward where it started, wasting gradient evaluations. The No-U-Turn Sampler chooses a trajectory length automatically by detecting when simulated motion begins to reverse.
[definition: No-U-Turn Criterion]
Let $(x^-,p^-)$ and $(x^+,p^+)$ be the two endpoints of a leapfrog trajectory in position-momentum space. The trajectory satisfies the no-U-turn condition while
\begin{align*}
(x^+ - x^-) \cdot M^{-1}p^- \ge 0
\end{align*}
and
\begin{align*}
(x^+ - x^-) \cdot M^{-1}p^+ \ge 0.
\end{align*}
A U-turn is declared when at least one of these inequalities fails.
[/definition]
The criterion measures whether the current momenta still point away from the opposite end of the trajectory.
NUTS builds a balanced binary tree of leapfrog states and stops expanding when this geometric condition fails or when a numerical divergence is detected. Once trajectory length is handled automatically, the next geometric question is how to scale momentum so that motion is well conditioned.
[definition: Mass Matrix Adaptation]
Mass matrix adaptation estimates a symmetric positive definite matrix $M$ during warm-up and uses it as the covariance of the momentum distribution $p \sim \mathcal N(0,M)$ in HMC.
[/definition]
Mass adaptation is a preconditioning step. In a posterior with strong linear correlation, a well-chosen $M$ makes the kinetic energy match the posterior scale so that Hamiltonian trajectories are less distorted. The remaining tuning question is how accurately the trajectory should be resolved by the numerical integrator.
[definition: Step-Size Adaptation]
Step-size adaptation chooses the leapfrog step size $\varepsilon$ during warm-up, often by dual averaging, to target a desired average Metropolis acceptance probability.
[/definition]
A smaller step size reduces energy error but increases the number of gradient evaluations required for a fixed integration time. A larger step size moves faster per leapfrog step but can create low acceptance probabilities or divergent transitions.
[example: Bayesian Logistic Regression with HMC]
Let observations $(y_i,z_i)_{i=1}^n$ satisfy $y_i \in \{0,1\}$ and $z_i \in \mathbb R^d$, and suppose
\begin{align*}
\mathbb P(y_i=1\mid \beta)=\frac{e^{z_i\cdot\beta}}{1+e^{z_i\cdot\beta}}, \qquad \mathbb P(y_i=0\mid \beta)=\frac{1}{1+e^{z_i\cdot\beta}}.
\end{align*}
For a single observation, the likelihood factor is therefore
\begin{align*}
\left(\frac{e^{z_i\cdot\beta}}{1+e^{z_i\cdot\beta}}\right)^{y_i}\left(\frac{1}{1+e^{z_i\cdot\beta}}\right)^{1-y_i}.
\end{align*}
Combining the powers gives
\begin{align*}
\left(e^{z_i\cdot\beta}\right)^{y_i}\left(1+e^{z_i\cdot\beta}\right)^{-y_i}\left(1+e^{z_i\cdot\beta}\right)^{-(1-y_i)}=e^{y_i z_i\cdot\beta}\left(1+e^{z_i\cdot\beta}\right)^{-1}.
\end{align*}
Taking minus the logarithm of this factor gives
\begin{align*}
-\log\left(e^{y_i z_i\cdot\beta}\left(1+e^{z_i\cdot\beta}\right)^{-1}\right)=\log(1+e^{z_i\cdot\beta})-y_i z_i\cdot\beta.
\end{align*}
The Gaussian prior $\beta\sim \mathcal N(0,\tau^2 I_d)$ has density proportional to $\exp(-|\beta|^2/(2\tau^2))$, so its contribution to the potential is $|\beta|^2/(2\tau^2)$ up to an additive constant. Hence the posterior potential is
\begin{align*}
U(\beta)=\sum_{i=1}^n \left\{\log(1+e^{z_i\cdot \beta})-y_i z_i\cdot\beta\right\}+\frac{1}{2\tau^2}|\beta|^2 + C.
\end{align*}
Its gradient is explicit. For each $i$, the chain rule gives
\begin{align*}
\nabla_\beta \log(1+e^{z_i\cdot\beta})=\frac{1}{1+e^{z_i\cdot\beta}}e^{z_i\cdot\beta}z_i=\frac{e^{z_i\cdot\beta}}{1+e^{z_i\cdot\beta}}z_i.
\end{align*}
Also,
\begin{align*}
\nabla_\beta(y_i z_i\cdot\beta)=y_i z_i.
\end{align*}
And
\begin{align*}
\nabla_\beta\frac{1}{2\tau^2}|\beta|^2=\frac{1}{\tau^2}\beta.
\end{align*}
Therefore
\begin{align*}
\nabla U(\beta)=\sum_{i=1}^n \left(\frac{e^{z_i\cdot\beta}}{1+e^{z_i\cdot\beta}}-y_i\right)z_i+\frac{1}{\tau^2}\beta.
\end{align*}
If $s_i(\beta)=e^{z_i\cdot\beta}/(1+e^{z_i\cdot\beta})$, then differentiating the $i$th gradient contribution gives the Hessian term
\begin{align*}
\nabla_\beta\{s_i(\beta)z_i\}=s_i(\beta)(1-s_i(\beta))z_i z_i^\top.
\end{align*}
Thus
\begin{align*}
\nabla^2 U(\beta)=\sum_{i=1}^n s_i(\beta)(1-s_i(\beta))z_i z_i^\top+\frac{1}{\tau^2}I_d.
\end{align*}
For any $v\in\mathbb R^d$,
\begin{align*}
v^\top \nabla^2U(\beta)v=\sum_{i=1}^n s_i(\beta)(1-s_i(\beta))(z_i\cdot v)^2+\frac{1}{\tau^2}|v|^2\ge 0.
\end{align*}
So the potential is convex, and HMC can use the available gradient $\nabla U(\beta)$ even though the posterior normalizing constant is hidden in the irrelevant additive constant $C$.
[/example]
This logistic regression example has relatively benign geometry. A more difficult case is a hierarchical posterior in which the scale of one variable changes the conditional scale of many others.
[example: Neal's Funnel]
Neal's funnel has a scalar variable $v$ and conditionally Gaussian variables $x_1,\dots,x_m$ with
\begin{align*}
v \sim \mathcal N(0,9), \qquad x_i \mid v \sim \mathcal N(0,e^v).
\end{align*}
The conditional standard deviation of each $x_i$ is $e^{v/2}$, because the conditional variance is $e^v$. Thus $v=-10$ gives conditional standard deviation $e^{-5}$, while $v=10$ gives conditional standard deviation $e^5$; the same model therefore contains both a very narrow neck and a very wide mouth.
The joint density, up to a normalizing constant, is obtained by multiplying the marginal density of $v$ and the conditional densities of the $x_i$:
\begin{align*}
\pi(v,x) \propto \exp\left(-\frac{v^2}{18}\right)\prod_{i=1}^m \left(e^v\right)^{-1/2}\exp\left(-\frac{x_i^2}{2e^v}\right).
\end{align*}
Since $\prod_{i=1}^m (e^v)^{-1/2}=e^{-mv/2}$ and $\prod_{i=1}^m \exp(-x_i^2/(2e^v))=\exp(-\sum_{i=1}^m x_i^2/(2e^v))$, this becomes
\begin{align*}
\pi(v,x) \propto \exp\left(-\frac{v^2}{18}-\frac{mv}{2}-\frac{1}{2e^v}\sum_{i=1}^m x_i^2\right).
\end{align*}
Therefore the potential $U=-\log \pi+C$ is
\begin{align*}
U(v,x)=\frac{v^2}{18}+\frac{mv}{2}+\frac{e^{-v}}{2}\sum_{i=1}^m x_i^2+C.
\end{align*}
The gradients show where the difficult geometry enters. For each $i$,
\begin{align*}
\frac{\partial U}{\partial x_i}(v,x)=e^{-v}x_i.
\end{align*}
Also,
\begin{align*}
\frac{\partial U}{\partial v}(v,x)=\frac{v}{9}+\frac{m}{2}-\frac{e^{-v}}{2}\sum_{i=1}^m x_i^2.
\end{align*}
When $v$ is very negative, $e^{-v}$ is very large, so even moderate changes in $x_i$ produce large changes in the gradient. A leapfrog step size that is adequate in the wide region can therefore be too large in the neck, causing large energy error and possible divergent transitions. This example shows that gradient information helps guide motion, but it does not remove severe geometry caused by changing conditional scales.
[/example]
High-dimensional performance is governed less by isolated modes and more by the typical set: the region where most probability mass lies. In many dimensions, the mode may be far from a representative draw, while Hamiltonian motion is useful because it can follow the local geometry of this typical set.
[remark: Typical Set Geometry]
For a product-like target in $d$ dimensions, most probability mass lies in a shell rather than at the mode. Random-walk proposals need many small accepted moves to traverse this shell, while HMC uses momentum to maintain direction across many gradient evaluations. This is why tuning is judged by effective sample size per gradient evaluation rather than by acceptance probability alone.
[/remark]
The practical workflow is therefore: choose a differentiable target density, run warm-up to adapt the mass matrix and step size, monitor divergences and effective sample size, and then use post-warm-up draws for Monte Carlo estimation. The Metropolis correction protects the invariant distribution, but it does not guarantee that the chain has explored all relevant geometry in finite computation.
Gradient-based MCMC extends the reach of Markov chain methods, yet it still assumes a fixed target distribution. Chapter 9 moves to settings where the target evolves over time or must be built sequentially, using weighted particles, resampling, and mutation to track a changing sequence of distributions.
# 9. Sequential Monte Carlo and Particle Filters
Sequential Monte Carlo methods address the situation where the target distribution changes over time or is too difficult to sample from in one step. The central problem is to represent a sequence of probability distributions by a moving cloud of weighted samples, while controlling the collapse of weights diagnosed for static importance sampling in Chapter 2. This chapter assumes Chapter 2's importance sampling weights, Chapter 4's Markov kernels, Chapter 6's conditional distributions, and Chapter 5's Metropolis-Hastings correction. It connects those ideas into algorithms for filtering, smoothing, likelihood estimation, and static Bayesian inference.
The guiding model is a latent Markov process observed through noisy data. At time $t$, the object of interest is often the filtering distribution of the hidden state given observations so far; as new observations arrive, the algorithm must update the approximation without restarting from scratch. Sequential Monte Carlo solves this by alternating propagation, weighting, and resampling.
## Sequential Importance Sampling and Particle Approximations
The first question is how to update an importance sampler when the target arrives as a sequence rather than as a single distribution. Recomputing an independent importance sampler at each time wastes the Markov structure, while reusing old samples without correction ignores the new observation. Sequential importance sampling keeps trajectories alive and updates their weights recursively, so the model must first specify the latent dynamics, observation mechanism, and posterior path distribution.
[definition: State Space Model]
Let $(E,\mathcal E,\lambda)$ be a state space with reference measure $\lambda$, and let $(F,\mathcal F,\nu)$ be an observation space with reference measure $\nu$. A state space model consists of an initial density $\mu_0:E\to[0,\infty)$ with respect to $\lambda$, transition densities $f_t:E\times E\to[0,\infty)$ with $f_t(x_t\mid x_{t-1})$ denoting the $\lambda$-density of $X_t$ given $X_{t-1}=x_{t-1}$, and observation densities $g_t:F\times E\to[0,\infty)$ with $g_t(y_t\mid x_t)$ denoting the $\nu$-density of $Y_t$ given $X_t=x_t$. For fixed observations $y_{0:t}=(y_0,\dots,y_t)\in F^{t+1}$, the unnormalized joint latent path density $\gamma_t:E^{t+1}\to[0,\infty)$ is
\begin{align*}
\gamma_t(x_{0:t}) = \mu_0(x_0)g_0(y_0\mid x_0)\prod_{s=1}^{t} f_s(x_s\mid x_{s-1})g_s(y_s\mid x_s).
\end{align*}
[/definition]
The normalizing constant
\begin{align*}
Z_t=\int_{E^{t+1}} \gamma_t(x_{0:t})\,d\lambda^{\otimes(t+1)}(x_{0:t})
\end{align*}
is the marginal likelihood of the observations $y_{0:t}$. When $0<Z_t<\infty$, the normalized path posterior is the probability measure
\begin{align*}
\pi_t(dx_{0:t})
=\frac{\gamma_t(x_{0:t})}{Z_t}\,\lambda^{\otimes(t+1)}(dx_{0:t}).
\end{align*}
Filtering often needs only the final coordinate under $\pi_t$, so we introduce a marginal target that records the online inferential objective directly.
[definition: Filtering Distribution]
For a state space model and fixed observations $y_{0:t}$, the filtering distribution at time $t$ is the probability measure $\eta_t:\mathcal E\to[0,1]$ on $(E,\mathcal E)$ defined by
\begin{align*}
\eta_t(A)=\mathbb P(X_t\in A\mid Y_0=y_0,\dots,Y_t=y_t),\qquad A\in\mathcal E.
\end{align*}
[/definition]
The filtering distribution is the online posterior for the current hidden state. To approximate it recursively, we need a sampling scheme that extends each old trajectory by one new state and records the correction for the proposal used at that extension step.
[definition: Sequential Importance Sampling]
Let $N\in\mathbb N$ be the number of particles. Let $q_0:E\to[0,\infty)$ be an initial proposal density and, for $t\ge 1$, let $Q_t:E^t\times E\to[0,\infty)$ be a one-step proposal density, written $Q_t(x_t\mid x_{0:t-1})$. Sequential importance sampling draws particles $X_{0:t,i}$, for $1\le i\le N$, from the joint proposal density $q_{0:t}:E^{t+1}\to[0,\infty)$ defined by
\begin{align*}
q_{0:t}(x_{0:t})=q_0(x_0)\prod_{s=1}^{t}Q_s(x_s\mid x_{0:s-1}),
\end{align*}
and assigns unnormalized path weights
\begin{align*}
W_{t,i}=\frac{\gamma_t(X_{0:t,i})}{q_{0:t}(X_{0:t,i})}.
\end{align*}
The normalized weights are
\begin{align*}
w_{t,i}=\frac{W_{t,i}}{\sum_{j=1}^{N}W_{t,j}}.
\end{align*}
[/definition]
Because the proposal is built sequentially, the weight has a recursion rather than a full recomputation from scratch:
\begin{align*}
W_{t,i} = W_{t-1,i}\,\frac{f_t(X_{t,i}\mid X_{t-1,i})g_t(y_t\mid X_{t,i})}{Q_t(X_{t,i}\mid X_{0:t-1,i})}.
\end{align*}
This recursion is the computational reason the method can operate online. The hidden Markov model case shows how the correction simplifies when the proposal is the latent transition itself.
[example: Hidden Markov Model Filtering]
Consider a finite-state hidden Markov model with $E=\{1,\dots,K\}$, transition matrix $P$, and observation probabilities $g_t(y_t\mid x_t)$. For a particle trajectory $X_{0:t,i}$, the sequential importance sampling weight recursion is
\begin{align*}
W_{t,i}=W_{t-1,i}\frac{P_{X_{t-1,i},X_{t,i}}g_t(y_t\mid X_{t,i})}{q_t(X_{t,i}\mid X_{0:t-1,i})}.
\end{align*}
If the proposal is the transition kernel, then
\begin{align*}
q_t(X_{t,i}\mid X_{0:t-1,i})=P_{X_{t-1,i},X_{t,i}}.
\end{align*}
Substituting this proposal into the recursion gives
\begin{align*}
W_{t,i}=W_{t-1,i}\frac{P_{X_{t-1,i},X_{t,i}}g_t(y_t\mid X_{t,i})}{P_{X_{t-1,i},X_{t,i}}}.
\end{align*}
Whenever $P_{X_{t-1,i},X_{t,i}}>0$ along the sampled transition, the transition factors cancel, so
\begin{align*}
W_{t,i}=W_{t-1,i}g_t(y_t\mid X_{t,i}).
\end{align*}
Thus the incremental correction from time $t-1$ to time $t$ is exactly the likelihood of the new observation at the propagated state. After normalizing
\begin{align*}
w_{t,i}=\frac{W_{t,i}}{\sum_{j=1}^N W_{t,j}},
\end{align*}
the particle approximation of the filtering expectation $\eta_t(h)=\mathbb E[h(X_t)\mid y_{0:t}]$ is
\begin{align*}
\eta_{t,N}(h)=\sum_{i=1}^N w_{t,i}h(X_{t,i}).
\end{align*}
The algorithm therefore updates the filtering expectation by first sampling each new particle through $P$, then increasing or decreasing its contribution according to how likely the new observation $y_t$ is under that sampled state.
[/example]
The example shows the main advantage and the main danger. Propagation is cheap, but after many updates most normalized weights may be nearly zero. We therefore need a scalar diagnostic that warns when the weighted sample behaves like far fewer than $N$ equally weighted samples.
[definition: Effective Sample Size]
Let $\Delta_N=\{w\in[0,1]^N:\sum_{i=1}^{N}w_i=1\}$ be the probability simplex. The effective sample size estimate is the map $\operatorname{ESS}:\Delta_N\to[1,N]$ defined by
\begin{align*}
\operatorname{ESS}(w)=\frac{1}{\sum_{i=1}^{N}w_i^2}.
\end{align*}
For normalized particle weights $w_t=(w_{t,1},\dots,w_{t,N})$, write $\operatorname{ESS}_t=\operatorname{ESS}(w_t)$.
[/definition]
The effective sample size is a diagnostic, not a theorem. It equals $N$ when weights are uniform and equals $1$ when all weight sits on one particle. Once the diagnostic signals collapse, the algorithm needs a mechanism that converts a weighted cloud back into an equally weighted population.
[definition: Resampling Step]
Given weighted particles $(X_{0:t,i},w_{t,i})_{i=1}^{N}$, a resampling step draws ancestor indices $A_{t,1},\dots,A_{t,N}$ from a scheme with marginal probabilities
\begin{align*}
\mathbb P(A_{t,j}=i\mid X_{0:t,1:N},w_{t,1:N})=w_{t,i}.
\end{align*}
The resampled particles are $\widetilde X_{0:t,j}=X_{0:t,A_{t,j}}$ and receive equal weights $1/N$.
[/definition]
Resampling removes particles with small weights and replicates particles with large weights. It reduces weight degeneracy, but it introduces genealogical degeneracy: far enough back in time, many present particles may share the same ancestor. Different resampling schemes preserve the same marginals while changing the variance of the replicated population.
[example: Multinomial Stratified and Systematic Resampling]
Set $C_0=0$ and $I_i=(C_{i-1},C_i]$ for $1\le i\le N$, so the intervals partition $(0,1]$ and
\begin{align*}
|I_i|=C_i-C_{i-1}=\sum_{k=1}^{i}w_{t,k}-\sum_{k=1}^{i-1}w_{t,k}=w_{t,i}.
\end{align*}
Multinomial resampling draws independent $U_1,\dots,U_N\sim\operatorname{Unif}(0,1)$ and sets $A_j=i$ exactly when $U_j\in I_i$. Therefore
\begin{align*}
\mathbb P(A_j=i)=\mathbb P(U_j\in I_i)=|I_i|=w_{t,i}.
\end{align*}
For stratified resampling, write $S_j=((j-1)/N,j/N]$ and draw $U_j\sim\operatorname{Unif}(S_j)$. The probability that stratum $j$ selects ancestor $i$ is
\begin{align*}
\mathbb P(A_j=i)=\frac{|S_j\cap I_i|}{|S_j|}=N|S_j\cap I_i|.
\end{align*}
If $N_i=\sum_{j=1}^{N}\mathbf 1_{\{A_j=i\}}$ is the number of children assigned to ancestor $i$, then
\begin{align*}
\mathbb E[N_i]=\sum_{j=1}^{N}\mathbb P(A_j=i)=\sum_{j=1}^{N}N|S_j\cap I_i|=N\left|\bigcup_{j=1}^{N}(S_j\cap I_i)\right|=N|I_i|=Nw_{t,i}.
\end{align*}
Thus stratified resampling gives ancestor $i$ the correct expected number of offspring, and a final random permutation of the offspring labels gives each displayed label marginal probability $w_{t,i}$.
For systematic resampling, draw one $U\sim\operatorname{Unif}(0,1/N)$ and use points $V_j=U+(j-1)/N$. The number of offspring assigned to ancestor $i$ is
\begin{align*}
N_i(U)=\sum_{j=1}^{N}\mathbf 1_{\{U+(j-1)/N\in I_i\}}.
\end{align*}
Since $U$ has density $N$ on $(0,1/N)$,
\begin{align*}
\mathbb E[N_i]=N\int_0^{1/N}\sum_{j=1}^{N}\mathbf 1_{\{u+(j-1)/N\in I_i\}}\,du.
\end{align*}
With the change of variables $v=u+(j-1)/N$ in each summand, the $N$ integrals over the adjacent intervals $((j-1)/N,j/N)$ combine to give
\begin{align*}
\mathbb E[N_i]=N\int_0^1 \mathbf 1_{\{v\in I_i\}}\,dv=N|I_i|=Nw_{t,i}.
\end{align*}
So multinomial resampling has the correct marginal probability label by label, while stratified and systematic resampling have the correct expected offspring counts and become labelwise marginally correct after random permutation. Their variance is smaller in many cases because the uniforms are forced to occupy different parts of $[0,1]$ instead of clustering independently.
[/example]
The resampling comparison shows algorithmic choices that preserve the same marginal target, but it does not yet provide a common mathematical target for filtering, annealing, and later static-target samplers. To prove consistency and identify what the particle system approximates independently of implementation details, we need a measure recursion built from mutation kernels and potential functions. This motivates the Feynman-Kac flow.
[definition: Feynman Kac Flow]
Let $(E_t,\mathcal E_t)$ be measurable spaces, let $\eta_0$ be a probability measure on $(E_0,\mathcal E_0)$, let $M_t:E_{t-1}\times\mathcal E_t\to[0,1]$ be Markov kernels, and let $G_t:E_t\to[0,\infty)$ be measurable potential functions. Let $B_b^+(E_t)$ denote the nonnegative bounded measurable functions $h:E_t\to[0,\infty)$. The unnormalized Feynman-Kac functional is the map $\gamma_t:B_b^+(E_t)\to[0,\infty)$ defined by
\begin{align*}
\gamma_t(h)=\int h(x_t)\prod_{s=0}^{t}G_s(x_s)\,\eta_0(dx_0)\prod_{s=1}^{t}M_s(x_{s-1},dx_s),
\end{align*}
for every $h\in B_b^+(E_t)$, whenever $0<\gamma_t(1)<\infty$. The normalized probability measure $\eta_t:\mathcal E_t\to[0,1]$ on $(E_t,\mathcal E_t)$ is defined by $\eta_t(A)=\gamma_t(\mathbf{1}_A)/\gamma_t(1)$.
[/definition]
This definition is broad enough to include filtering, annealing, and static-target SMC. For filtering, the important consistency check is that the abstract mutation-potential recursion is not just notation: with the state transition as $M_t$ and the observation likelihood as $G_t(x_t)=g_t(y_t\mid x_t)$, the normalized Feynman-Kac measure should coincide with the Bayesian filtering posterior.
[quotetheorem:7234]
[citeproof:7234]
The theorem explains why particle filters are not ad hoc weighted simulations. They are empirical approximations to a deterministic measure recursion induced by the hidden Markov model. The hypotheses also explain the theorem's limit: the bootstrap filter uses the state transition as the mutation kernel, so the potential is exactly the observation likelihood. If instead particles are propagated from a proposal density $q_t(x_t\mid x_{t-1},y_t)$, the same filtering distribution can still be targeted, but the potential must become $g_t(y_t\mid x_t)f_t(x_t\mid x_{t-1})/q_t(x_t\mid x_{t-1},y_t)$ along the sampled transition. For example, using a Gaussian proposal with variance much smaller than the model transition variance and then weighting only by $g_t(y_t\mid x_t)$ targets the wrong Feynman-Kac flow: it corresponds to the artificial Gaussian proposal transition rather than the original state transition, because the missing factor $f_t/q_t$ is the correction that restores the model dynamics. If $\int G_t\,d(\eta_{t-1}M_t)=0$, every candidate state has zero observation likelihood under the predictive law and the normalization denominator is zero; if this integral is infinite, the normalized ratio is not a probability measure. The next question is whether this empirical approximation converges to the deterministic recursion as the number of particles grows.
[quotetheorem:7235]
[citeproof:7235]
Consistency is an asymptotic statement for fixed time. The boundedness and positivity assumptions are not cosmetic: if $G_t$ is unbounded, rare particles can dominate the normalized estimate and destroy the conditional law-of-large-numbers control; if $\int G_t\,d(\eta_{t-1}M_t)=0$, the update map $\Phi_t$ is not defined. The theorem also does not say that a fixed number of particles remains adequate as $t$ grows, which is why diagnostics and resampling choices remain central in applications.
## Bootstrap and Auxiliary Particle Filters
The next problem is algorithm design. Sequential importance sampling gives the recursion, but the proposal distribution and resampling schedule determine whether particles are placed where the likelihood is large.
[definition: Bootstrap Particle Filter]
The bootstrap particle filter for a state space model uses the transition proposal
\begin{align*}
q_t(x_t\mid x_{0:t-1})=f_t(x_t\mid x_{t-1}).
\end{align*}
At each time $t$, particles are propagated through the state transition, weighted by $g_t(y_t\mid X_{t,i})$, and optionally resampled according to the normalized weights.
[/definition]
The bootstrap filter is robust and simple because it requires simulation from the latent dynamics only. It can perform poorly when the observation is informative relative to the transition, since many proposed particles land in regions with negligible likelihood. The next example isolates this failure mode in a familiar Gaussian observation setting.
[example: Nonlinear Filtering with Informative Observations]
Suppose $\varepsilon_t\sim\mathcal N(0,\tau^2)$ and $\delta_t\sim\mathcal N(0,\sigma_y^2)$, with $\sigma_y^2\ll \tau^2$, so
\begin{align*}
X_t\mid X_{t-1}=x_{t-1}\sim\mathcal N(x_{t-1},\tau^2)
\end{align*}
and
\begin{align*}
Y_t\mid X_t=x_t\sim\mathcal N(x_t,\sigma_y^2).
\end{align*}
The bootstrap proposal samples
\begin{align*}
X_{t,i}\sim \mathcal N(X_{t-1,i},\tau^2),
\end{align*}
and because the proposal equals the transition density, the incremental weight is the observation likelihood
\begin{align*}
\omega_{t,i}=g_t(y_t\mid X_{t,i})=\frac{1}{\sqrt{2\pi\sigma_y^2}}\exp\left(-\frac{(y_t-X_{t,i})^2}{2\sigma_y^2}\right).
\end{align*}
The small variance $\sigma_y^2$ makes this likelihood sharply concentrated near $y_t$. If two propagated particles satisfy $|y_t-X_{t,a}|=0$ and $|y_t-X_{t,b}|=d$, then their weight ratio is
\begin{align*}
\frac{\omega_{t,b}}{\omega_{t,a}}=\frac{(2\pi\sigma_y^2)^{-1/2}\exp(-d^2/(2\sigma_y^2))}{(2\pi\sigma_y^2)^{-1/2}\exp(0)}=\exp\left(-\frac{d^2}{2\sigma_y^2}\right).
\end{align*}
Thus a particle only a few observation standard deviations from $y_t$ receives exponentially less weight than a particle near $y_t$. If a surprising observation places $y_t$ in a tail of the predictive distribution $\mathcal N(X_{t-1,i},\tau^2)$ for most ancestors, then most sampled $X_{t,i}$ miss the narrow high-likelihood region, their normalized weights become nearly zero, and
\begin{align*}
\operatorname{ESS}_t=\frac{1}{\sum_{i=1}^N w_{t,i}^2}
\end{align*}
can be close to $1$ when one propagated particle carries almost all the mass. The failure is therefore not that the bootstrap correction is wrong; it is that the transition proposal may place too few particles where the informative observation gives non-negligible likelihood.
[/example]
The example shows that the current observation may contain useful information before propagation is completed. A natural response is to look one observation ahead before deciding which ancestors to propagate, which motivates a filter with first-stage ancestor weights.
[definition: Auxiliary Particle Filter]
An auxiliary particle filter is specified by a predictive summary map $m_t:E_{t-1}\times F\to(0,\infty)$ and proposal kernels $R_t:E_{t-1}\times F\times\mathcal E\to[0,1]$ such that, for each $(x_{t-1},y_t)\in E_{t-1}\times F$, $R_t(x_{t-1},y_t,\cdot)$ has density $r_t(\cdot\mid x_{t-1},y_t)$ with respect to $\lambda$. For particles $(X_{t-1,i},w_{t-1,i})_{i=1}^{N}$ and observation $y_t\in F$, it introduces first-stage weights proportional to
\begin{align*}
w_{t-1,i}m_t(X_{t-1,i},y_t).
\end{align*}
Ancestors are sampled using these first-stage weights, new particles are propagated according to $R_t(X_{t-1,A_t},y_t,\cdot)$, and the second-stage unnormalized weight assigned to a propagated particle $X_t$ with ancestor $A_t$ is
\begin{align*}
\widetilde W_t
=\frac{f_t(X_t\mid X_{t-1,A_t})g_t(y_t\mid X_t)}
{m_t(X_{t-1,A_t},y_t)r_t(X_t\mid X_{t-1,A_t},y_t)}.
\end{align*}
[/definition]
The auxiliary filter trades extra approximation work for better particle placement. The correction weights preserve the target distribution, while the first-stage weights reduce wasted propagation from ancestors unlikely to match the new data. Stochastic volatility models give a standard setting where this predictive weighting is useful.
[example: Auxiliary Filtering in a Stochastic Volatility Model]
Consider the stochastic volatility model
\begin{align*}
X_t=\phi X_{t-1}+\sigma\varepsilon_t,\qquad \varepsilon_t\sim\mathcal N(0,1),
\end{align*}
with observation law
\begin{align*}
Y_t\mid X_t=x_t\sim\mathcal N(0,e^{x_t}).
\end{align*}
Thus the transition density is
\begin{align*}
f_t(x_t\mid x_{t-1})=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x_t-\phi x_{t-1})^2}{2\sigma^2}\right),
\end{align*}
and the observation likelihood at a realized observation $y_t$ is
\begin{align*}
g_t(y_t\mid x_t)=\frac{1}{\sqrt{2\pi e^{x_t}}}\exp\left(-\frac{y_t^2}{2e^{x_t}}\right).
\end{align*}
For an ancestor $X_{t-1,i}$, the conditional mean of the next state is $\phi X_{t-1,i}$. The auxiliary filter can therefore use the predictive representative $\phi X_{t-1,i}$ and assign first-stage weight
\begin{align*}
m_{t,i}=g_t(y_t\mid \phi X_{t-1,i})=\frac{1}{\sqrt{2\pi e^{\phi X_{t-1,i}}}}\exp\left(-\frac{y_t^2}{2e^{\phi X_{t-1,i}}}\right).
\end{align*}
When $|y_t|$ is large, this quantity favors ancestors with larger $\phi X_{t-1,i}$, because larger $\phi X_{t-1,i}$ makes the predictive variance $e^{\phi X_{t-1,i}}$ larger and makes the exponential penalty $y_t^2/(2e^{\phi X_{t-1,i}})$ smaller.
Suppose ancestor $A_t=i$ is selected with probability proportional to $w_{t-1,i}m_{t,i}$, and the propagated state is sampled from a proposal density $r_t(x_t\mid X_{t-1,i},y_t)$. The auxiliary-particle second-stage correction is
\begin{align*}
\widetilde W_t=\frac{f_t(X_t\mid X_{t-1,i})g_t(y_t\mid X_t)}{m_{t,i}r_t(X_t\mid X_{t-1,i},y_t)}.
\end{align*}
Substituting the displayed expression for $m_{t,i}$ gives
\begin{align*}
\widetilde W_t=\frac{f_t(X_t\mid X_{t-1,i})g_t(y_t\mid X_t)}{g_t(y_t\mid \phi X_{t-1,i})r_t(X_t\mid X_{t-1,i},y_t)}.
\end{align*}
This denominator removes the approximate look-ahead likelihood already used to choose the ancestor, while the numerator restores the actual transition density and the actual observation likelihood at the sampled state $X_t$. Thus the first-stage weight steers computation toward ancestors whose predicted volatility explains the large observation, and the second-stage weight keeps the particle system targeted at the original stochastic volatility model rather than at the representative point $\phi X_{t-1,i}$.
[/example]
Filtering estimates the current state, but some inferential tasks need past states after later observations are available. This leads to particle smoothing, where the target is no longer just the final coordinate of the latent path.
[definition: Particle Smoothing]
Particle smoothing refers to Monte Carlo approximation of the smoothing marginal laws $\mathcal L(X_s\mid y_{0:t}):\mathcal E\to[0,1]$ for $s<t$, or of the full path posterior $\mathcal L(X_{0:t}\mid y_{0:t}):\mathcal E^{\otimes(t+1)}\to[0,1]$ on $E^{t+1}$.
[/definition]
Common particle smoothers use stored genealogies, backward simulation kernels, or forward-filtering backward-sampling recursions. Stored genealogies are cheap, but path degeneracy makes early-time smoothing unreliable when $t$ is large. Backward simulation uses the filtering particles together with transition probabilities to sample more diverse trajectories from an approximation of the smoothing distribution. In online settings, a fixed lag is often a practical compromise between latency and genealogical collapse.
[example: Fixed Lag Smoothing]
In an online tracking problem, suppose the required quantity at time $t$ is
\begin{align*}
\mathbb E[h(X_{t-L})\mid y_{0:t}]
\end{align*}
for a fixed lag $L\ge 1$. After filtering to time $t$, let $(X_{0:t,i},w_{t,i})_{i=1}^N$ denote the weighted particle paths, where $X_{t-L:t,i}$ is retained through the ancestry records. The empirical path approximation to the posterior smoothing law assigns mass $w_{t,i}$ to the stored path $X_{0:t,i}$, so applying the [test function](/page/Test%20Function) $x_{0:t}\mapsto h(x_{t-L})$ gives the fixed-lag estimate
\begin{align*}
\widehat h_{t-L\mid t,N}=\sum_{i=1}^{N}w_{t,i}h(X_{t-L,i}).
\end{align*}
Equivalently, if the current particle $X_{t,i}$ has ancestor index $A_{t,t-L,i}$ at time $t-L$, then the retained lagged state is
\begin{align*}
X_{t-L,i}=X_{t-L,A_{t,t-L,i}}.
\end{align*}
Substituting this ancestry identity into the estimator gives
\begin{align*}
\widehat h_{t-L\mid t,N}=\sum_{i=1}^{N}w_{t,i}h(X_{t-L,A_{t,t-L,i}}).
\end{align*}
Thus the filter does not need to store the whole trajectory for all past times: for a fixed lag it only needs enough ancestry to recover the state $L$ steps behind each current particle. The estimate is useful when $L$ is short enough that the current particles have not all coalesced to the same old ancestor, but long enough that observations between times $t-L+1$ and $t$ have informed the state at time $t-L$.
[/example]
## Likelihood Estimation and Particle MCMC
The next question is how particle filters interact with parameter inference. For a parameter $\theta$, Bayesian inference often needs the likelihood $p_\theta(y_{0:M})$, but this likelihood is unavailable after integrating over latent states. A particle filter supplies a random likelihood estimate, so we first define the estimator produced by the filter.
[definition: Particle Likelihood Estimator]
For a particle filter, let $\omega_{t,i}\in[0,\infty)$ denote the incremental unnormalized weight assigned to particle $i$ at time $t$ after propagation from the particle population produced at time $t-1$. The one-step normalizing-constant estimate is the map $\widehat z_{t,N}:(\omega_{t,1},\dots,\omega_{t,N})\mapsto N^{-1}\sum_{i=1}^{N}\omega_{t,i}$. The particle likelihood estimator is
\begin{align*}
\widehat Z_{M,N}=\prod_{t=0}^{M}\widehat z_{t,N}
=\prod_{t=0}^{M}\left(\frac{1}{N}\sum_{i=1}^{N}\omega_{t,i}\right).
\end{align*}
[/definition]
The product form mirrors the likelihood decomposition into predictive densities. Its importance lies in unbiasedness, not only in consistency, because pseudo-marginal MCMC requires an estimator whose expectation is the exact likelihood rather than merely a close numerical approximation.
[quotetheorem:7236]
[citeproof:7236]
This result is stronger than a large-sample approximation statement. Its hypotheses are essential. If incremental weights are allowed to be negative, the estimate cannot be used as a probability density in a Metropolis-Hastings ratio. If a resampling rule always gives one extra copy to particle $1$ whenever possible and removes a copy from the last particle, the conditional offspring mean is no longer $Nw_{t,i}$, so the empirical prediction step is biased before the next propagation. If the factors are normalized by $\sum_i(\omega_{t,i})^2$ rather than $\sum_i\omega_{t,i}$, the product estimates a distorted sequence of predictive quantities rather than the likelihood. Unbiasedness also does not control variance: over long horizons the product estimator can remain unbiased while its variance becomes large enough that typical runs severely under-estimate the likelihood and rare runs dominate the expectation. This is the likelihood-estimation analogue of path degeneracy, and it explains why practical PMMH implementations tune $N$, resampling, and proposal design to control the variance of $\log \widehat Z_{M,N}$. With nonnegative correctly normalized weights and unbiased resampling, the random likelihood estimate can be inserted into a Metropolis-Hastings algorithm without changing the target posterior, provided the auxiliary randomness is included in the Markov chain state. We now formalize that pseudo-marginal construction.
[definition: Particle Marginal Metropolis Hastings]
Let $(\Theta,\mathcal T,\rho)$ be a parameter space with reference measure $\rho$, let $p:\Theta\to[0,\infty)$ be a prior density with respect to $\rho$, and let $Z_M:\Theta\to[0,\infty)$ be the likelihood $Z_M(\theta)=p_\theta(y_{0:M})$. Let $r:\Theta\times\mathcal T\to[0,1]$ be a proposal Markov kernel admitting a density $r(\theta,\theta')$ with respect to $\rho$. Let $(\mathsf U,\mathcal U)$ be the auxiliary random space used by the particle filter, and let $m:\Theta\times\mathcal U\to[0,1]$ be a Markov kernel, with $m_\theta(\cdot)=m(\theta,\cdot)$ denoting the simulation law of the particle-filter random variables at parameter $\theta$. Let $\widehat Z_{M,N}:\Theta\times\mathsf U\to[0,\infty)$ be the resulting likelihood estimator. Particle marginal Metropolis-Hastings is the pseudo-marginal algorithm for the parameter posterior $\pi(\theta\mid y_{0:M})\propto p(\theta)Z_M(\theta)$. Given current parameter $\theta$ and current auxiliary variable $u$, propose $\theta'\sim r(\theta,\cdot)$, draw $u'\sim m_{\theta'}$, compute $\widehat Z_{M,N}(\theta',u')$, and accept $(\theta',u')$ with probability
\begin{align*}
\alpha=1\wedge \frac{p(\theta')\widehat Z_{M,N}(\theta',u')r(\theta',\theta)}{p(\theta)\widehat Z_{M,N}(\theta,u)r(\theta,\theta')}.
\end{align*}
[/definition]
The algorithm targets the exact posterior marginal in $\theta$ even though it uses a random likelihood estimate. The number of particles affects mixing and computational cost, not the invariant marginal distribution. The next theorem states this exactness property as an ordinary Metropolis-Hastings invariance result on an enlarged space.
[quotetheorem:7237]
[citeproof:7237]
This exactness statement does not decide whether PMMH mixes well. Each assumption has a specific role. Nonnegativity is needed because the extended target has density proportional to $\widehat Z_{M,N}(\theta,u)$; a signed likelihood estimate would not define a probability measure. Unbiasedness is needed because the marginal density becomes $p(\theta)\mathbb E[\widehat Z_{M,N}(\theta,U)]$; a biased estimator would target the posterior for a distorted likelihood. The support condition prevents proposals from moving into regions where the reverse move has zero probability. For instance, if $\Theta=\{0,1\}$, both posterior masses are positive, and the proposal has $r(0,1)>0$ but $r(1,0)=0$, then a move from $0$ to $1$ has no reverse proposal term to balance it; the usual Metropolis-Hastings ratio cannot make the two directed probability flows equal. A stochastic volatility likelihood calculation illustrates what the algorithm computes at each parameter proposal and why the variance of $\widehat Z_{M,N}(\theta)$ matters for acceptance behaviour.
[example: Stochastic Volatility Likelihood Estimation]
For $\theta=(\phi,\sigma)$ with $\sigma>0$, the transition and observation densities are
\begin{align*}
f_\theta(x_t\mid x_{t-1})=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x_t-\phi x_{t-1})^2}{2\sigma^2}\right)
\end{align*}
and
\begin{align*}
g(y_t\mid x_t)=\frac{1}{\sqrt{2\pi e^{x_t}}}\exp\left(-\frac{y_t^2}{2e^{x_t}}\right).
\end{align*}
If $\mu_{0,\theta}$ is the initial density of $X_0$, then the likelihood of $y_{0:M}$ is the path integral
\begin{align*}
Z_M(\theta)=\int \mu_{0,\theta}(x_0)g(y_0\mid x_0)\prod_{t=1}^{M}f_\theta(x_t\mid x_{t-1})g(y_t\mid x_t)\,dx_{0:M}.
\end{align*}
This integral is high-dimensional because every latent state $x_0,\dots,x_M$ must be integrated out.
A bootstrap particle filter proposes from the model transition itself:
\begin{align*}
X_{t,i}\sim f_\theta(\cdot\mid X_{t-1,i}).
\end{align*}
Therefore the transition density in the target path density is the same density used by the proposal, so the incremental importance correction at time $t$ is only the observation likelihood
\begin{align*}
\omega_{t,i}=g(y_t\mid X_{t,i})=\frac{1}{\sqrt{2\pi e^{X_{t,i}}}}\exp\left(-\frac{y_t^2}{2e^{X_{t,i}}}\right).
\end{align*}
The one-step normalizing-constant estimate is
\begin{align*}
\widehat z_{t,N}(\theta)=\frac{1}{N}\sum_{i=1}^{N}\omega_{t,i},
\end{align*}
so the particle likelihood estimate used by the parameter algorithm is
\begin{align*}
\widehat Z_{M,N}(\theta)=\prod_{t=0}^{M}\widehat z_{t,N}(\theta)=\prod_{t=0}^{M}\left(\frac{1}{N}\sum_{i=1}^{N}g(y_t\mid X_{t,i})\right).
\end{align*}
In particle marginal Metropolis-Hastings, a proposed parameter $\theta'=(\phi',\sigma')$ gets its own freshly simulated particle-filter randomness and hence its own estimate $\widehat Z_{M,N}(\theta')$. If the current parameter is $\theta$, the prior density is $p$, and the proposal density is $r$, then the acceptance probability is
\begin{align*}
\alpha=1\wedge \frac{p(\theta')\widehat Z_{M,N}(\theta')r(\theta',\theta)}{p(\theta)\widehat Z_{M,N}(\theta)r(\theta,\theta')}.
\end{align*}
Thus each parameter proposal is judged by the prior ratio, the proposal correction, and the particle estimate of the latent-state likelihood; when $\widehat Z_{M,N}(\theta')$ has high variance, the acceptance decision can be dominated by Monte Carlo noise rather than by the underlying likelihood surface.
[/example]
## SMC Samplers for Static Targets
The final topic asks how sequential methods can help when the target itself is static. The idea is to create an artificial sequence of intermediate distributions leading from an easy initial distribution to the difficult target.
[definition: Tempered SMC Sampler]
Let $(E,\mathcal E,\lambda)$ be a state space with reference measure $\lambda$. Let $\gamma:E\to[0,\infty)$ and $\gamma_0:E\to[0,\infty)$ be measurable unnormalized densities with finite positive normalizing constants, and let $\pi,\pi_0:\mathcal E\to[0,1]$ be the probability measures whose $\lambda$-densities are proportional to $\gamma$ and $\gamma_0$, respectively. A tempered SMC sampler chooses temperatures
\begin{align*}
0=\beta_0<\beta_1<\cdots<\beta_T=1
\end{align*}
and intermediate probability measures $\pi_t:\mathcal E\to[0,1]$ whose $\lambda$-densities are proportional to
\begin{align*}
\gamma_t(x)=\gamma_0(x)^{1-\beta_t}\gamma(x)^{\beta_t}.
\end{align*}
Particles are reweighted from $\pi_{t-1}$ to $\pi_t$, resampled when needed, and moved by Markov kernels $K_t:E\times\mathcal E\to[0,1]$ satisfying $\pi_tK_t=\pi_t$.
[/definition]
Tempering turns a hard global sampling problem into a sequence of smaller changes. The Markov move step restores diversity after resampling without changing the current target. Bayesian posterior tempering is the standard example because the likelihood can be introduced gradually.
[example: Bayesian Posterior Tempering]
Let $\pi(\theta\mid y)\propto p(\theta)L(\theta)$, where $p(\theta)$ is the prior density and $L(\theta)$ is the likelihood. Choose the initial unnormalized density $\gamma_0(\theta)=p(\theta)$ and the final unnormalized density $\gamma(\theta)=p(\theta)L(\theta)$. For temperatures $0=\beta_0<\beta_1<\cdots<\beta_T=1$, the tempered SMC path uses
\begin{align*}
\gamma_t(\theta)=\gamma_0(\theta)^{1-\beta_t}\gamma(\theta)^{\beta_t}.
\end{align*}
Substituting $\gamma_0(\theta)=p(\theta)$ and $\gamma(\theta)=p(\theta)L(\theta)$ gives
\begin{align*}
\gamma_t(\theta)=p(\theta)^{1-\beta_t}\{p(\theta)L(\theta)\}^{\beta_t}.
\end{align*}
Using $\{ab\}^{\beta_t}=a^{\beta_t}b^{\beta_t}$ for nonnegative densities,
\begin{align*}
\gamma_t(\theta)=p(\theta)^{1-\beta_t}p(\theta)^{\beta_t}L(\theta)^{\beta_t}.
\end{align*}
Since $p(\theta)^{1-\beta_t}p(\theta)^{\beta_t}=p(\theta)$ wherever $p(\theta)>0$, the intermediate target has unnormalized density
\begin{align*}
\gamma_t(\theta)=p(\theta)L(\theta)^{\beta_t}.
\end{align*}
If particles are already weighted for $\gamma_{t-1}$ and are reweighted to $\gamma_t$ without changing position, the incremental importance factor at $\theta$ is the ratio
\begin{align*}
\frac{\gamma_t(\theta)}{\gamma_{t-1}(\theta)}=\frac{p(\theta)L(\theta)^{\beta_t}}{p(\theta)L(\theta)^{\beta_{t-1}}}.
\end{align*}
On the support where $p(\theta)>0$ and $L(\theta)>0$, the prior factors cancel and the powers of $L(\theta)$ subtract:
\begin{align*}
\frac{\gamma_t(\theta)}{\gamma_{t-1}(\theta)}=L(\theta)^{\beta_t-\beta_{t-1}}.
\end{align*}
Thus the sampler begins at the prior when $\beta_0=0$, ends at the posterior when $\beta_T=1$, and introduces the likelihood gradually; choosing $\beta_t-\beta_{t-1}$ smaller makes the new weights less uneven, which is why adaptive temperature selection can use the effective sample size to control weight collapse.
[/example]
The example shows the same components as filtering, but the sequence is created for computation rather than imposed by time. This distinction is important when deciding whether the particles represent latent states in a dynamic model or samples from a static posterior path.
[remark: Choosing Between Particle Filters and SMC Samplers]
Particle filters are designed for data arriving through a latent dynamic model, where the sequence of targets is imposed by time. SMC samplers are designed for static targets, where the sequence is chosen by the algorithm. Both use the same mutation-weighting-resampling architecture, but their modelling roles are different.
[/remark]
The chapter closes the course's transition from independent Monte Carlo and Markov chain Monte Carlo to population-based algorithms. Sequential Monte Carlo keeps the importance-sampling identity at its core, but adds resampling and mutation so that the approximation can track a changing or deliberately constructed sequence of targets.
Sequential Monte Carlo adds population structure, but it still leaves open deeper questions about exactness and chain comparison. Chapter 10 turns to couplings and perfect simulation to study convergence at a more fundamental level, asking when one can obtain exact draws or rigorous convergence guarantees from joint constructions.
# 10. Perfect Simulation and Advanced Couplings
After Chapters 4--8 developed approximate Markov chain simulation and Chapter 7 studied diagnostics for finite output, this chapter turns to couplings that compare two chains on the same probability space. The main question is how much information a joint construction can give about convergence, exact stationarity, and bias removal. Coupling turns qualitative mixing statements into random times, and in special finite-state settings it even produces samples exactly from the stationary distribution.
## Coupling Constructions and Meeting Times
The basic problem is to compare two Markov chains that use the same transition rule but start from different initial states. If the two chains can be forced to meet and then move together, the meeting time becomes a certificate that their marginal laws have become close.
[definition: Coupling of Probability Measures]
Let $(E, \mathcal E)$ be a measurable space and let $\mu, \nu$ be probability measures on $E$. A coupling of $\mu$ and $\nu$ is a probability measure $\gamma$ on $E \times E$ such that
\begin{align*}
\gamma(A \times E) &= \mu(A), & \gamma(E \times A) &= \nu(A)
\end{align*}
for every $A \in \mathcal E$.
[/definition]
A coupling is not a new marginal distribution; it is a choice of dependence between already specified marginal laws. To see why the dependence matters, it helps to compare two couplings with identical marginals but different probabilities of equality.
[example: Independent and Synchronous Couplings]
Let $\varphi(t)=(2\pi)^{-1/2}e^{-t^2/2}$ be the standard normal density. In the independent coupling, $X$ and $Y$ have joint density $(x,y)\mapsto \varphi(x)\varphi(y)$ on $\mathbb R^2$, so for every Borel set $A\subset\mathbb R$,
\begin{align*}
\mathbb P(X\in A)=\int_A \varphi(x)\,dx=\mu(A)
\end{align*}
and similarly $\mathbb P(Y\in A)=\nu(A)$. Thus the independent construction is a coupling of $\mu$ and $\nu$. Its meeting event is the diagonal $\{(x,y):x=y\}$, which has two-dimensional Lebesgue measure $0$; since the joint law has a density with respect to two-dimensional Lebesgue measure,
\begin{align*}
\mathbb P(X=Y)=\int_{\{(x,y):x=y\}}\varphi(x)\varphi(y)\,d(x,y)=0.
\end{align*}
In the synchronous coupling, choose one $Z\sim\mathcal N(0,1)$ and set $X=Z$ and $Y=Z$. Then for every Borel set $A\subset\mathbb R$,
\begin{align*}
\mathbb P(X\in A)=\mathbb P(Z\in A)=\mu(A)
\end{align*}
and
\begin{align*}
\mathbb P(Y\in A)=\mathbb P(Z\in A)=\nu(A).
\end{align*}
Because $X=Z=Y$ on every outcome,
\begin{align*}
\mathbb P(X=Y)=1.
\end{align*}
Both constructions therefore have the same prescribed marginals, but their dependence structures give opposite meeting probabilities: $0$ for the independent coupling and $1$ for the synchronous coupling.
[/example]
The example shows that coupling choices can determine whether equality is likely or impossible.
For Markov chains, arbitrary dependence between two copies is not enough. If the joint evolution changes either marginal transition law, then any meeting event would no longer describe the original chain. The definition below isolates the admissible joint kernels whose two coordinates each still evolve according to the intended transition kernel.
[definition: Coupled Markov Chains]
Let $P$ be a Markov transition kernel on $(E,\mathcal E)$. A coupled pair $(X_k,Y_k)_{k\ge 0}$ is a Markov chain on $E \times E$ with transition kernel $\bar P$ such that, for every $(x,y) \in E \times E$, the first marginal of $\bar P((x,y),\cdot)$ is $P(x,\cdot)$ and the second marginal is $P(y,\cdot)$.
[/definition]
A coupled Markov chain preserves the correct one-chain dynamics in each coordinate, but this alone does not say what happens after the coordinates meet. The next definition records the coalescing property needed to turn equality into a persistent event and to define the random time that measures successful coupling.
[definition: Faithful Coupling and Meeting Time]
A coupled pair $(X_k,Y_k)_{k\ge 0}$ for a kernel $P$ is faithful if $X_k=Y_k$ implies $X_{k+1}=Y_{k+1}$ a.s. for every $k\ge 0$. Its meeting time is
\begin{align*}
\tau = \inf\{k\ge 0 : X_k = Y_k\}.
\end{align*}
[/definition]
Faithfulness turns the event $\{\tau\le k\}$ into a guarantee that the two coordinates agree at time $k$.
The unresolved question is how a pathwise meeting statement controls the distributional error of a Markov chain. Total variation distance compares laws, while coupling produces a joint process on paths. The theorem below provides the bridge: failure to have met by time $k$ bounds the possible discrepancy between the two marginal laws.
[quotetheorem:7238]
[citeproof:7238]
The inequality is useful only because the coupling is faithful: without coalescence after meeting, equality at an earlier time would not imply equality at time $k$. It also gives an upper bound rather than an exact formula for an arbitrary coupling, so a poorly chosen joint construction may say little about the true total variation distance. The theorem does not prove that the chain has mixed from every starting point; it reduces that question to controlling a model-dependent meeting time. This motivates the search for couplings that maximise the probability of agreement at a fixed time. The next definition names the optimal case, where the upper bound in the coupling inequality cannot be improved.
[definition: Maximal Coupling]
Let $\mu$ and $\nu$ be probability measures on $(E,\mathcal E)$. A coupling $(X,Y)$ of $\mu$ and $\nu$ is maximal if
\begin{align*}
\mathbb P(X\ne Y)=\|\mu-\nu\|_{\mathrm{TV}}.
\end{align*}
[/definition]
Maximal couplings attain the best possible value in the coupling inequality.
The obstruction is existence: defining an optimal meeting probability does not by itself guarantee that some random pair realizes it, especially on a general measurable space. The theorem below supplies the required existence result under a standard measurability hypothesis by matching the common mass of the two distributions.
[quotetheorem:7239]
[citeproof:7239]
The countably generated assumption is a regularity condition ensuring that the common-mass construction can be realised as an actual random pair. Maximality is also a one-time property of two probability laws, not automatically a useful dynamic construction for a whole Markov chain. In continuous or high-dimensional MCMC, implementing a maximal coupling at every step may require evaluating and sampling from residual densities that are no easier than the original target. Thus maximal couplings identify the best possible benchmark, while practical couplings usually preserve common random numbers, proposals, or accept-reject uniforms. The next example illustrates how shared proposal and acceptance randomness can encourage meeting for random-walk Metropolis chains.
[example: Coupled Random-Walk Metropolis Chains]
Let $\pi$ be a target density on $\mathbb R^d$, and use the symmetric random-walk proposal $q(x,\cdot)=\mathcal N(x,\sigma^2 I_d)$. For current states $X$ and $Y$, draw one $Z\sim\mathcal N(0,\sigma^2 I_d)$ and one $U\sim\mathrm{Unif}(0,1)$, and set
\begin{align*}
X^\ast=X+Z,\qquad Y^\ast=Y+Z.
\end{align*}
Because the proposal density is symmetric, the Metropolis acceptance probabilities are
\begin{align*}
a_X=\min\left\{1,\frac{\pi(X^\ast)}{\pi(X)}\right\},\qquad a_Y=\min\left\{1,\frac{\pi(Y^\ast)}{\pi(Y)}\right\}.
\end{align*}
The coupled update is
\begin{align*}
X_+=X^\ast\mathbf 1_{\{U\le a_X\}}+X\mathbf 1_{\{U>a_X\}},\qquad Y_+=Y^\ast\mathbf 1_{\{U\le a_Y\}}+Y\mathbf 1_{\{U>a_Y\}}.
\end{align*}
If both proposals are accepted, then
\begin{align*}
X_+-Y_+=(X+Z)-(Y+Z)=X-Y.
\end{align*}
If both proposals are rejected, then
\begin{align*}
X_+-Y_+=X-Y.
\end{align*}
Thus the shared proposal preserves the separation whenever the two accept-reject decisions agree. If $X=Y$, then $X^\ast=Y^\ast$, hence $a_X=a_Y$, so the single uniform $U$ makes the same accept-reject decision for both chains and gives $X_+=Y_+$.
Pure synchronous proposals preserve equality once it occurs, but they do not create equality from $X\ne Y$ because $X^\ast-Y^\ast=X-Y$. A maximal-coupling refinement changes the proposal step: instead of forcing $Y^\ast=Y+Z$, it couples the proposal laws $\mathcal N(X,\sigma^2I_d)$ and $\mathcal N(Y,\sigma^2I_d)$ so that the common proposal event has probability
\begin{align*}
\int_{\mathbb R^d}\min\{q(X,w),q(Y,w)\}\,dw.
\end{align*}
For Gaussian random-walk proposals this integral is positive whenever $X$ and $Y$ are finite points, since both Gaussian densities are strictly positive on every nonempty open ball. On the event $X^\ast=Y^\ast=w$, the shared uniform accepts both proposals whenever
\begin{align*}
U\le \min\left\{\min\left(1,\frac{\pi(w)}{\pi(X)}\right),\min\left(1,\frac{\pi(w)}{\pi(Y)}\right)\right\}.
\end{align*}
On that event the next states satisfy $X_+=Y_+=w$, and the shared randomness keeps them together thereafter. This explains the role of the refinement: synchronous randomness preserves closeness and coalescence, while maximal proposal coupling supplies a positive-probability mechanism for exact meeting in continuous state spaces.
[/example]
The example shows how coupling can create exact meetings even in algorithms designed for approximate sampling.
A separate obstruction remains for Monte Carlo averages from a finite Markov chain run: before stationarity, the marginal expectation at time $k$ is biased for the stationary expectation. Meeting times make it possible to stop an infinite telescoping correction at a finite random time, producing an estimator that targets the stationary expectation exactly.
[quotetheorem:7240]
[citeproof:7240]
This theorem is conceptually different from a convergence diagnostic. It does not certify that one realised chain has mixed; instead, it constructs a random variable whose expectation is exactly the stationary quantity. The lagged marginal condition makes $h(Y_{k-1})$ replace $h(X_{k-1})$ in expectation, so the correction estimates successive changes in the marginal means of the $X$-chain. Lag-faithful coalescence then turns the infinite correction into a finite random sum. The convergence assumption connects the telescoping limit to the stationary expectation, and the integrability condition prevents the random correction from having undefined expectation. Without any one of these hypotheses, the estimator may still be computable but the equality $\mathbb E[H_m]=\pi(h)$ need not follow.
## Coupling from the Past and Exact Stationarity
The next problem is stronger than bounding convergence: can an algorithm return a sample whose law is exactly stationary without knowing the mixing time? Coupling from the past answers this in finite state spaces when the chain can be driven by shared random maps and all possible initial states coalesce by time $0$.
[definition: Random Mapping Representation]
Let $S$ be a finite state space, let $(\mathcal U,\mathcal A)$ be a measurable update-noise space, and let $P$ be a Markov transition matrix on $S$. A random mapping representation of $P$ is a measurable function $F:S\times \mathcal U\to S$ and an i.i.d. sequence $(U_t)_{t\in \mathbb Z}$ of $\mathcal U$-valued random variables such that, for each $x\in S$, the random state $F(x,U_t)$ has distribution $P(x,\cdot)$.
[/definition]
A random mapping representation lets us update every possible starting point with the same randomness.
The difficulty is that exact stationarity cannot be certified merely by running one chain for a long time; the output might still depend on its unknown starting state. CFTP looks backward and asks for a stronger event: after applying the same stored updates from time $-T$ to time $0$, every possible initial state has collapsed to one value.
[definition: Coalescence from the Past]
For a random mapping representation $F$ and an integer $T\ge 1$, define the map $\Phi_{-T,0}:S\to S$ by
\begin{align*}
\Phi_{-T,0}:x\mapsto F(F(\cdots F(F(x,U_{-T}),U_{-T+1})\cdots,U_{-2}),U_{-1}).
\end{align*}
Coalescence from time $-T$ occurs if $\Phi_{-T,0}(x)$ is the same element of $S$ for every $x\in S$.
[/definition]
If coalescence occurs, the common value no longer remembers the starting state at time $-T$.
The remaining issue is distributional correctness. Loss of memory from a finite past time rules out dependence on the chosen initial state, but it must still be connected to the stationary law of the transition matrix. The theorem below gives that connection: under the finite-state ergodicity assumptions and finite coalescence, the common value at time $0$ is exactly stationary.
[quotetheorem:7241]
[citeproof:7241]
Each hypothesis has a specific role. Finiteness makes it meaningful, at least in principle, to check all starting states, and irreducibility together with aperiodicity gives a unique stationary distribution for the transition matrix. The random mapping representation is what lets every possible chain use the same stored randomness; without common maps, coalescence from the past would not define a single output. Finite coalescence is the termination certificate, and if it never occurs the algorithm has no exact sample to return. The theorem is therefore often impractical if every state must be tracked. The next definition introduces the monotonicity structure that reduces this burden by allowing all states to be bounded between extremal trajectories.
[definition: Monotone Random Mapping]
Let $(S,\preceq)$ be a finite partially ordered set. A random mapping representation $F:S\times\mathcal U\to S$ is monotone if
\begin{align*}
x\preceq y \implies F(x,u)\preceq F(y,u)
\end{align*}
for every $u\in\mathcal U$.
[/definition]
When the state space has a bottom and top element, monotonicity sandwiches all trajectories between two extremal trajectories.
The computational obstruction is that checking every initial state may be infeasible even on a finite space. If the update maps preserve the partial order, the bottom and top trajectories bound all others after every shared update. The theorem below explains why equality of just these two extremal trajectories certifies coalescence for the whole state space.
[quotetheorem:7242]
[citeproof:7242]
The bottom and top elements are essential because they give trajectories that bound every other possible initial state. Monotonicity is the property that preserves this sandwich after each update; without it, equality of two selected trajectories would not rule out a third trajectory landing elsewhere. The theorem also remains only a correctness statement conditional on detecting extremal coalescence, so a long or heavy-tailed coalescence time can still make the method ineffective. This is the most common route to exact simulation in statistical mechanics models, where the partial order is coordinatewise order and the random updates are arranged so that larger configurations remain larger.
[example: Ising Model on a Finite Lattice]
Let $S=\{-1,+1\}^V$ for a finite graph with edge set $E$, ordered coordinatewise, and consider the ferromagnetic Ising law
\begin{align*}
\pi(\sigma)\propto \exp\left(\beta\sum_{\{i,j\}\in E}\sigma_i\sigma_j+\sum_{i\in V}h_i\sigma_i\right),\qquad \beta\ge 0.
\end{align*}
If site $v$ is updated while all other spins are fixed, write
\begin{align*}
H_v(\sigma)=h_v+\beta\sum_{u\sim v}\sigma_u.
\end{align*}
The heat-bath probability of assigning $+1$ at $v$ is
\begin{align*}
p_v(\sigma)=\frac{\exp(H_v(\sigma))}{\exp(H_v(\sigma))+\exp(-H_v(\sigma))}=\frac{1}{1+\exp(-2H_v(\sigma))}.
\end{align*}
Now couple all starting configurations by using the same chosen site $v$ and the same uniform variable $U\sim\mathrm{Unif}(0,1)$, setting the updated spin at $v$ equal to $+1$ when $U\le p_v(\sigma)$ and equal to $-1$ otherwise. If $\sigma\preceq \eta$, then $\sigma_u\le \eta_u$ for every neighbour $u\sim v$, so
\begin{align*}
\sum_{u\sim v}\sigma_u\le \sum_{u\sim v}\eta_u.
\end{align*}
Because $\beta\ge 0$,
\begin{align*}
H_v(\sigma)=h_v+\beta\sum_{u\sim v}\sigma_u\le h_v+\beta\sum_{u\sim v}\eta_u=H_v(\eta).
\end{align*}
The function $r\mapsto (1+\exp(-2r))^{-1}$ is increasing, hence
\begin{align*}
p_v(\sigma)\le p_v(\eta).
\end{align*}
Therefore the shared uniform update is order-preserving: if the lower chain gets $+1$, then $U\le p_v(\sigma)\le p_v(\eta)$ so the upper chain also gets $+1$; if the upper chain gets $-1$, then $U>p_v(\eta)\ge p_v(\sigma)$ so the lower chain also gets $-1$.
The all-minus configuration $\hat 0$ and all-plus configuration $\hat 1$ satisfy $\hat 0\preceq \sigma\preceq \hat 1$ for every $\sigma\in S$. Since each shared heat-bath update preserves this order, the composed update from time $-T$ to time $0$ satisfies
\begin{align*}
\Phi_{-T,0}(\hat 0)\preceq \Phi_{-T,0}(\sigma)\preceq \Phi_{-T,0}(\hat 1).
\end{align*}
Thus, if the all-minus and all-plus trajectories agree at time $0$, every intermediate trajectory must equal the same spin configuration. By the *Monotone CFTP Correctness Theorem*, that common configuration is an exact draw from the finite-volume Ising stationary distribution.
[/example]
The Ising example is high-dimensional, so the order structure hides some of the mechanics inside the spin configuration. A one-dimensional birth-death chain gives a smaller example where the same sandwiching idea can be seen directly: every initial state lies between the two endpoints, and an order-preserving update keeps it there until the endpoints meet. This makes the birth-death case a useful model for understanding what monotone CFTP is checking before returning an exact sample.
[example: Birth-Death Chain Perfect Simulation]
Let the birth-death chain on $S=\{0,1,\dots,N\}$ have transition probabilities
\begin{align*}
P(i,i+1)=p_i,\quad P(i,i-1)=q_i,\quad P(i,i)=r_i
\end{align*}
where $p_i+q_i+r_i=1$, with $q_0=0$ and $p_N=0$. Suppose these probabilities admit a monotone update map $F:S\times[0,1]\to S$ such that, for $U\sim\mathrm{Unif}(0,1)$, the random variable $F(i,U)$ has the $i$th transition law. For example, the one-step law is correct when the intervals assigned to the three moves have lengths $q_i,r_i,p_i$, since
\begin{align*}
\mathbb P(F(i,U)=i-1)=q_i,\quad \mathbb P(F(i,U)=i)=r_i,\quad \mathbb P(F(i,U)=i+1)=p_i.
\end{align*}
Use the same stored uniforms $U_{-T},\dots,U_{-1}$ for every possible starting state, and write $X_t^x$ for the trajectory started from $x$ at time $-T$. Since $0\le x\le N$, monotonicity gives after the first update
\begin{align*}
X_{-T+1}^0=F(0,U_{-T})\le F(x,U_{-T})=X_{-T+1}^x\le F(N,U_{-T})=X_{-T+1}^N.
\end{align*}
Applying the same argument at each later stored uniform preserves the order, so at time $0$,
\begin{align*}
\Phi_{-T,0}(0)\le \Phi_{-T,0}(x)\le \Phi_{-T,0}(N)
\end{align*}
for every $x\in S$. If the endpoint trajectories meet, say $\Phi_{-T,0}(0)=\Phi_{-T,0}(N)=c$, then
\begin{align*}
c\le \Phi_{-T,0}(x)\le c
\end{align*}
and hence $\Phi_{-T,0}(x)=c$ for every starting state $x$. Thus endpoint coalescence is full coalescence, and by the *Monotone CFTP Correctness Theorem* the returned common state $c$ has exactly the stationary distribution of the birth-death chain.
[/example]
## Limitations and Diagnostics
Perfect simulation is powerful because it removes burn-in error, but its assumptions are restrictive. The final question is how exact simulation should be compared with the approximate diagnostics used for ordinary MCMC.
[remark: Finite State and Coalescence Requirements]
Classical CFTP is most direct on finite state spaces with a reusable random mapping representation. The method also requires detectable coalescence, and for non-monotone chains this may mean tracking too many starting states. In high-dimensional models, the coalescence time can have a heavy tail even when ordinary MCMC estimates appear stable.
[/remark]
This limitation is not merely computational. Exact sampling gives correctness conditional on termination, but it does not by itself make termination fast.
[example: Slow Coalescence at Low Temperature]
Consider the ferromagnetic Ising model on a large finite lattice at inverse temperature $\beta$ with monotone heat-bath updates. Start the monotone CFTP construction from the extremal configurations
\begin{align*}
\hat 0(v)=-1\quad\text{and}\quad \hat 1(v)=+1
\end{align*}
for every lattice site $v$. Since the update map is monotone, the two trajectories satisfy
\begin{align*}
X_t^{\hat 0}\preceq X_t^{\hat 1}
\end{align*}
at every stored time $t$. Therefore CFTP can certify coalescence by checking whether
\begin{align*}
X_0^{\hat 0}=X_0^{\hat 1}.
\end{align*}
At low temperature, changing from a mostly minus configuration to a mostly plus configuration requires many sites to flip against the local field before a macroscopic phase change occurs. In heat-bath form, the probability of assigning $+1$ at site $v$ is
\begin{align*}
p_v(\sigma)=\frac{1}{1+\exp(-2H_v(\sigma))},
\end{align*}
where
\begin{align*}
H_v(\sigma)=h_v+\beta\sum_{u\sim v}\sigma_u.
\end{align*}
If the neighbours of $v$ are mostly $-1$ and $h_v$ is not large enough to offset them, then $H_v(\sigma)<0$, so
\begin{align*}
-2H_v(\sigma)>0.
\end{align*}
Hence
\begin{align*}
\exp(-2H_v(\sigma))>1
\end{align*}
and therefore
\begin{align*}
p_v(\sigma)=\frac{1}{1+\exp(-2H_v(\sigma))}<\frac{1}{2}.
\end{align*}
As $\beta$ increases, the magnitude of the neighbour contribution $\beta\sum_{u\sim v}\sigma_u$ increases, so spins aligned with a local phase become harder to overturn by single-site updates.
Thus the all-minus and all-plus extremal chains can remain separated for many stored updates even though each individual chain may show long intervals of local stability. Monotone CFTP is still correct once the extremal trajectories meet, by the *Monotone CFTP Correctness Theorem*, but the random search time needed to find such a meeting can be much larger than the time scale on which one realised chain appears visually stable. Exact stationarity and practical efficiency are therefore separate properties.
[/example]
Approximate convergence diagnostics answer a different question: whether the finite simulation output appears consistent with stationarity. They are useful safeguards, but they cannot prove exact sampling.
[definition: Approximate MCMC Diagnostic]
An approximate MCMC diagnostic is a statistic or graphical procedure computed from finite simulated output to assess evidence for convergence, mixing, or Monte Carlo error.
[/definition]
Diagnostics are indispensable in applied work because most target distributions do not admit perfect simulation. Their conclusions must be interpreted as evidence about the run, not as a theorem that the sample is stationary.
[remark: Perfect Simulation Versus Diagnostics]
Perfect simulation produces a draw with exactly stationary law when its assumptions hold and the algorithm terminates. Approximate diagnostics such as trace plots, autocorrelation estimates, effective sample size estimates, and between-chain comparisons study finite realised output from algorithms that usually have residual initialization bias. The two approaches are complementary: perfect simulation is a correctness mechanism for special structures, while diagnostics are practical monitoring tools for broad classes of MCMC algorithms.
[/remark]
The course perspective is that coupling provides a common language for both sides. Meeting times can bound total variation distance, exact coalescence from the past can remove burn-in entirely, and lagged coupled chains can remove expectation bias through telescoping. The price is that good couplings are model-dependent, and their computational cost is part of the algorithm rather than a secondary implementation detail.
## Beyond and Connected Topics
Monte Carlo methods sit between probability, statistics, optimization, and [numerical analysis](/page/Numerical%20Analysis). The law-of-large-numbers and central-limit arguments in the first chapters connect directly to [Cambridge IA Probability](/page/Cambridge%20IA%20Probability), [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure), and [Nonparametric Statistics](/page/Nonparametric%20Statistics). The Markov-chain chapters rely on [Cambridge IB Markov Chains](/page/Cambridge%20IB%20Markov%20Chains), [Ergodic Theory I: Foundations](/page/Ergodic%20Theory%20I%3A%20Foundations), and the detailed-balance viewpoint used throughout Bayesian and statistical computation. Importance sampling and sequential Monte Carlo connect to rare-event simulation and filtering, while Hamiltonian Monte Carlo is a bridge to geometric numerical integration, optimization, and gradient-based computation.
Several directions naturally extend these notes. Quasi-Monte Carlo replaces independent randomness with low-discrepancy structure. Multilevel Monte Carlo combines simulations at several discretization levels to reduce cost. Coupling-based unbiased MCMC turns meeting times into debiasing tools, and modern probabilistic programming systems build these algorithms into reusable inference engines. These extensions preserve the course's main lesson: correctness comes from an invariant expectation or distribution, while efficiency comes from matching the sampler to the geometry of the problem.
## References
- Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability).
- Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure).
- Androma, [Cambridge IB Markov Chains](/page/Cambridge%20IB%20Markov%20Chains).
- Androma, [Ergodic Theory I: Foundations](/page/Ergodic%20Theory%20I%3A%20Foundations).
- Androma, [Nonparametric Statistics](/page/Nonparametric%20Statistics).
- Androma, [Strong Law of Large Numbers](/theorems/520).
- Androma, [Central Limit Theorem for Standard Importance Sampling](/theorems/521).
- Christian P. Robert and George Casella, *Monte Carlo Statistical Methods*.
- Jun S. Liu, *Monte Carlo Strategies in Scientific Computing*.
- Art B. Owen, *Monte Carlo Theory, Methods and Examples*.
- Arnaud Doucet, Nando de Freitas, and Neil Gordon, editors, *Sequential Monte Carlo Methods in Practice*.
Contents
- Introduction
- What Monte Carlo Computes
- Estimators, Random Error, and Cost
- Why Random Sampling Helps In High Dimension
- The Main Algorithmic Families
- Validity, Diagnostics, and Failure Modes
- Course Roadmap
- 1. Monte Carlo Estimation and Error
- Monte Carlo Estimators for Expectations
- Bias, Variance, and Root Mean Square Error
- Laws of Large Numbers and Error Certification
- Ratios and Posterior Expectations
- Variance Reduction Principles
- 2. Importance Sampling
- Change of Measure and Likelihood Ratios
- Self-Normalized Importance Sampling and Normalizing Constants
- Degeneracy, Effective Sample Size, and Rare Events
- 3. Rejection, Transformation, and Direct Sampling
- Transforming Uniform Random Variables
- Rejection Sampling And Envelope Design
- Adaptive Envelopes And Log-Concavity
- Slice Sampling As A Bridge To Markov Chains
- 4. Markov Chains for Sampling
- Markov Kernels and Stationarity
- Convergence in Total Variation
- Markov Chain Monte Carlo Estimators
- 5. Metropolis-Hastings Algorithms
- From Proposals to Metropolis-Hastings Kernels
- Scaling, Acceptance, and Autocorrelation
- Blocking, Componentwise Updates, and Variants
- 6. Gibbs Sampling and Data-Augmentation Methods
- Full Conditionals and Gibbs Transitions
- Data Augmentation and Auxiliary Variables
- Compatibility, Blocking, and Factorization
- 7. Convergence Diagnostics and Monte Carlo Workflow
- Single-Chain Error Assessment
- Multiple Chains and Scale Diagnostics
- Posterior Predictive Checks and Calibration
- Multimodality and Workflow Failures
- 8. Hamiltonian and Langevin Monte Carlo
- Langevin Diffusions and Discretization Bias
- Hamiltonian Dynamics and Leapfrog Integration
- No-U-Turn Sampling and Adaptive Geometry
- 9. Sequential Monte Carlo and Particle Filters
- Sequential Importance Sampling and Particle Approximations
- Bootstrap and Auxiliary Particle Filters
- Likelihood Estimation and Particle MCMC
- SMC Samplers for Static Targets
- 10. Perfect Simulation and Advanced Couplings
- Coupling Constructions and Meeting Times
- Coupling from the Past and Exact Stationarity
- Limitations and Diagnostics
- Beyond and Connected Topics
- References
Monte Carlo Methods
Content
Problems
History
Created by admin on 6/16/2026 | Last updated on 6/16/2026
Prerequisites (0/7 completed)
Log in to track your prerequisite progress.
Prerequisites Graph
Interactive dependency map showing prerequisite concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Rate this page
★
★
★
★
★
Poor
Excellent