Convergence In Distribution

Also known as: Weak convergence, Weak convergence of probability measures, Distributional convergence, Convergence in law, Convergence of distributions

Edit 0 Issues 0 Pull Requests Roadmap Admin

Content

Problems

History

Issues Verification Attributions

Convergence in distribution is the language for limits of probability laws when pointwise convergence of random variables is too strong or unavailable. ## The problem convergence in distribution solves Many limiting arguments in probability give limiting laws without giving a limiting [random variable](/page/Random%20Variable) on the original [probability space](/page/Probability%20Space). Statistics may be computed on sample spaces that change with $n$, while normalized counts may remain lattice-valued for every $n$ and still approach continuous laws. Convergence in distribution records this law-level limiting behavior. It is weaker than convergence in probability and almost sure convergence, but it is the right notion for limit theorems such as the [central limit theorem](/theorems/521). The central object is the distribution, not the individual value of a random variable on a sample point. If $X_n$ and $X$ are real-valued random variables, the question is not whether $X_n(\omega)$ is close to $X(\omega)$ for most $\omega$, but whether the law of $X_n$ assigns nearly the same mass as the law of $X$ to the sets that matter for weak limits. For real-valued variables this can be stated with distribution functions. For random elements in metric spaces it is more natural to use expectations of bounded continuous test functions. Both views express the same idea: only the limiting law matters. The notation used here is \begin{align*} X_n \Rightarrow X \end{align*} or \begin{align*} \mathcal L(X_n) \Rightarrow \mathcal L(X). \end{align*} The symbol $\mathcal L(X)$ denotes the law of $X$. The arrow $\Rightarrow$ is reserved for convergence in distribution, not almost sure convergence or convergence in probability. [example: Lattice Laws Approaching a Smooth Law] Let $Y_1,Y_2,\ldots$ be independent Bernoulli$(p)$ random variables with $0<p<1$, set $S_n=Y_1+\cdots+Y_n$, and define \begin{align*} X_n=\frac{S_n-np}{\sqrt{np(1-p)}}. \end{align*} For each $i$, the Bernoulli law gives $\mathbb P(Y_i=1)=p$ and $\mathbb P(Y_i=0)=1-p$, so \begin{align*} \mathbb E Y_i=1\cdot p+0\cdot(1-p)=p. \end{align*} Since $Y_i$ only takes the values $0$ and $1$, both possible values satisfy $y^2=y$, so $Y_i^2=Y_i$ almost surely. Therefore \begin{align*} \mathbb E(Y_i^2)=\mathbb E(Y_i)=p. \end{align*} Using $\operatorname{Var}(Y_i)=\mathbb E(Y_i^2)-(\mathbb E Y_i)^2$, we get \begin{align*} \operatorname{Var}(Y_i)=p-p^2=p(1-p). \end{align*} Because $0<p<1$, both $p>0$ and $1-p>0$, hence $p(1-p)>0$. Since each $Y_i$ is either $0$ or $1$, the sum $S_n$ can take only the integer values $0,1,\ldots,n$. If $S_n=k$, then substitution into the definition of $X_n$ gives \begin{align*} X_n=\frac{k-np}{\sqrt{np(1-p)}}. \end{align*} Thus every possible value of $X_n$ belongs to \begin{align*} \left\{\frac{k-np}{\sqrt{np(1-p)}}:k=0,1,\ldots,n\right\}. \end{align*} Conversely, whenever $S_n=k$, the displayed value of $X_n$ occurs by the same substitution. Hence these are exactly the possible values of $X_n$, so every finite-sample law of $X_n$ is supported on a finite lattice. For $k\in\{0,1,\ldots,n\}$, the event $S_n=k$ is the event that exactly $k$ of the variables $Y_1,\ldots,Y_n$ are equal to $1$ and the remaining $n-k$ are equal to $0$. For a fixed set $I\subset\{1,\ldots,n\}$ with $|I|=k$, the corresponding event is \begin{align*} \{Y_i=1\text{ for }i\in I,\ Y_j=0\text{ for }j\notin I\}. \end{align*} By independence, its probability is \begin{align*} \prod_{i\in I}\mathbb P(Y_i=1)\prod_{j\notin I}\mathbb P(Y_j=0)=\prod_{i\in I}p\prod_{j\notin I}(1-p). \end{align*} The set $I$ has $k$ elements and its complement has $n-k$ elements, so \begin{align*} \prod_{i\in I}p\prod_{j\notin I}(1-p)=p^k(1-p)^{n-k}. \end{align*} There are $\binom nk$ choices of such a set $I$, and the corresponding events are disjoint because a single outcome has one fixed set of positions where $Y_i=1$. Therefore \begin{align*} \mathbb P(S_n=k)=\binom nk p^k(1-p)^{n-k}. \end{align*} The denominator $\sqrt{np(1-p)}$ is positive, because $n\ge1$ and $p(1-p)>0$. Hence the map $k\mapsto (k-np)/\sqrt{np(1-p)}$ is one-to-one on $\{0,1,\ldots,n\}$, and \begin{align*} \left\{X_n=\frac{k-np}{\sqrt{np(1-p)}}\right\}=\{S_n=k\}. \end{align*} Therefore \begin{align*} \mathbb P\left(X_n=\frac{k-np}{\sqrt{np(1-p)}}\right)=\binom nk p^k(1-p)^{n-k}. \end{align*} The variables $Y_i$ are independent and identically distributed with mean $p$ and variance $p(1-p)>0$, so the [Central Limit Theorem](/theorems/532), applied with $\mu=p$ and $\sigma=\sqrt{p(1-p)}$, gives \begin{align*} \frac{S_n-np}{\sqrt{p(1-p)}\sqrt n}\Rightarrow Z, \end{align*} where $Z$ has the standard normal distribution. Since $n>0$, $p>0$, and $1-p>0$, both $\sqrt{p(1-p)}\sqrt n$ and $\sqrt{np(1-p)}$ are positive. Their squares agree: \begin{align*} \left(\sqrt{p(1-p)}\sqrt n\right)^2=p(1-p)n=np(1-p). \end{align*} Thus \begin{align*} \sqrt{p(1-p)}\sqrt n=\sqrt{np(1-p)}. \end{align*} Substituting this equality into the normalized sum gives \begin{align*} \frac{S_n-np}{\sqrt{p(1-p)}\sqrt n}=\frac{S_n-np}{\sqrt{np(1-p)}}=X_n. \end{align*} Hence \begin{align*} X_n\Rightarrow Z. \end{align*} Let $A$ be a Borel set whose boundary has normal probability zero: \begin{align*} \mathbb P(Z\in\partial A)=0. \end{align*} The law of $Z$ assigns mass $\mathbb P(Z\in B)$ to each Borel set $B$, so the displayed condition says that the law of $Z$ assigns zero mass to $\partial A$. Thus $A$ is a continuity set for the law of $Z$. By the continuity-set form of the *[Portmanteau Theorem](/theorems/1171)*, \begin{align*} \mathbb P(X_n\in A)\to \mathbb P(Z\in A). \end{align*} The finite laws never stop being lattice laws, but convergence in distribution records that their masses on normal-continuity sets approach the corresponding masses of the smooth limiting normal law. [/example] ## Definition The formal definition captures convergence of the laws of real-valued random variables by checking their distribution functions only at continuity points of the limiting distribution function. Continuity points are used because atoms of the limiting law create jumps where pointwise convergence of distribution functions need not hold. [definition: Convergence in Distribution] Let $X_1,X_2,\ldots$ and $X$ be real-valued random variables. For any real-valued random variable $Y$, write $F_Y:\mathbb R\to[0,1]$ for its distribution function, \begin{align*} F_Y(x)=\mathbb P(Y\le x). \end{align*} The sequence $X_n$ converges in distribution to $X$, written $X_n\Rightarrow X$, if \begin{align*} F_{X_n}(x)\to F_X(x) \end{align*} for every continuity point $x$ of $F_X$. [/definition] ### Distributional Convergence Distribution functions give a real-valued law a concrete cumulative form. They are the objects whose pointwise convergence detects convergence in distribution on the real line. ### Distribution Functions The convergence criterion above uses $F_X$ as the object being compared, so we need a precise definition of that object before using the criterion further. The point is to encode the whole law of a real-valued random variable through the probabilities of half-line events $(-\infty,x]$, which can then be compared point by point. [definition: Distribution Function] For a real-valued random variable $X$, its distribution function is the map $F_X:\mathbb R\to[0,1]$ defined by \begin{align*} F_X:x\mapsto\mathbb P(X\le x). \end{align*} [/definition] ### Basic Shape Distribution functions are increasing, right-continuous, and have limits $0$ at $-\infty$ and $1$ at $+\infty$. ### Jumps and Continuity Points They may have jumps. A jump at $x$ has size $\mathbb P(X=x)$. Those jumps are exactly why the definition only asks for convergence at continuity points of the limiting distribution function. The restriction to continuity points is essential. If $X_n=1/n$ deterministically and $X=0$ deterministically, then $X_n\Rightarrow X$. For every $x\ne 0$, the distribution functions eventually agree with the limiting distribution function. At $x=0$, however, \begin{align*} F_{X_n}(0)=0,\qquad F_X(0)=1. \end{align*} The point $0$ is a discontinuity point of $F_X$, so this mismatch does not prevent convergence in distribution. This example also shows that convergence in distribution is about the limiting law. The point masses of $X_n$ sit at $1/n$, not at $0$, but the laws still accumulate at the point mass at $0$. [example: Deterministic Approximation of an Atom] Let $X_n=1/n$ almost surely and $X=0$ almost surely. We verify the law-level convergence by bounded continuous test functions: fix a bounded [continuous function](/page/Continuous%20Function) $f:\mathbb R\to\mathbb R$. Since $X_n=1/n$ almost surely, there is an event $A_n$ with $\mathbb P(A_n)=1$ such that $X_n(\omega)=1/n$ for every $\omega\in A_n$. Applying $f$ on this event gives \begin{align*} f(X_n(\omega))=f(1/n). \end{align*} Thus $f(X_n)=f(1/n)$ almost surely. Since $f$ is bounded, $f(X_n)$ is integrable, and changing an integrable random variable on a null event does not change its expectation. Therefore \begin{align*} \mathbb E f(X_n)=f(1/n). \end{align*} Similarly, because $X=0$ almost surely, there is an event $A$ with $\mathbb P(A)=1$ such that $X(\omega)=0$ for every $\omega\in A$. Hence \begin{align*} f(X(\omega))=f(0) \end{align*} for every $\omega\in A$, so \begin{align*} \mathbb E f(X)=f(0). \end{align*} The numerical sequence $1/n$ converges to $0$, and $f$ is continuous at $0$. Therefore \begin{align*} f(1/n)\to f(0). \end{align*} Using the two expectation identities, \begin{align*} \mathbb E f(X_n)=f(1/n)\to f(0)=\mathbb E f(X). \end{align*} Since this holds for every bounded continuous $f:\mathbb R\to\mathbb R$, the laws of $X_n$ converge weakly to the law of $X$, and therefore \begin{align*} X_n\Rightarrow X. \end{align*} The distribution functions still disagree at the atom of the limiting law. At $x=0$, let \begin{align*} A_n=\{X_n=1/n\}. \end{align*} Then $\mathbb P(A_n)=1$. On $A_n$, the event $\{X_n\le0\}$ is the event that $1/n\le0$. Since $n\ge1$, we have $1/n>0$, so $1/n\le0$ is false. Hence \begin{align*} \{X_n\le0\}\cap A_n=\varnothing. \end{align*} Because $\Omega=A_n\cup A_n^c$ and the intersection with $A_n$ is empty, \begin{align*} \{X_n\le0\}\subseteq A_n^c. \end{align*} Therefore \begin{align*} \mathbb P(X_n\le0)\le\mathbb P(A_n^c). \end{align*} Since $\mathbb P(A_n)=1$, \begin{align*} \mathbb P(A_n^c)=1-\mathbb P(A_n)=1-1=0. \end{align*} Thus \begin{align*} 0\le \mathbb P(X_n\le0)\le0, \end{align*} and so \begin{align*} F_{X_n}(0)=\mathbb P(X_n\le0)=0. \end{align*} For the limiting variable, let \begin{align*} A=\{X=0\}. \end{align*} Then $\mathbb P(A)=1$. On $A$, the inequality $X\le0$ becomes $0\le0$, which is true, so \begin{align*} A\subseteq\{X\le0\}. \end{align*} It follows that \begin{align*} 1=\mathbb P(A)\le\mathbb P(X\le0)\le1. \end{align*} Therefore \begin{align*} F_X(0)=\mathbb P(X\le0)=1. \end{align*} To see the jump at $0$ explicitly, fix any $x<0$. On the event $A=\{X=0\}$, the inequality $X\le x$ becomes $0\le x$, which is false because $x<0$. Hence \begin{align*} \{X\le x\}\cap A=\varnothing. \end{align*} As above, this implies \begin{align*} \{X\le x\}\subseteq A^c. \end{align*} Therefore \begin{align*} F_X(x)=\mathbb P(X\le x)\le\mathbb P(A^c). \end{align*} Since $\mathbb P(A)=1$, \begin{align*} \mathbb P(A^c)=1-\mathbb P(A)=0. \end{align*} Thus \begin{align*} 0\le F_X(x)\le0, \end{align*} so \begin{align*} F_X(x)=0 \end{align*} for every $x<0$. Hence the left limit of $F_X$ at $0$ is $0$, while $F_X(0)=1$. The jump size is \begin{align*} F_X(0)-\lim_{x\uparrow0}F_X(x)=1-0=1. \end{align*} So $0$ is a discontinuity point of the limiting distribution function. The mismatch $F_{X_n}(0)=0$ and $F_X(0)=1$ is exactly the atom-level disagreement that convergence in distribution excludes from the distribution-function test. [/example] ## Weak convergence of laws The distribution-function definition is tied to the order structure of the real line. For random vectors or random elements in a [metric space](/page/Metric%20Space), bounded continuous functions provide a coordinate-free definition. Let $S$ be a metric space with its Borel sigma-algebra. A [probability measure](/page/Probability%20Measure) $\mu$ on $S$ is the law of an $S$-valued random element if $\mu(A)=\mathbb P(X\in A)$ for Borel sets $A$. [definition: Weak Convergence of Probability Measures] Let $\mu_1,\mu_2,\ldots$ and $\mu$ be probability measures on a metric space $S$, so each is a map $\mathcal B(S)\to[0,1]$ on the Borel sigma-algebra of $S$. The measures $\mu_n$ converge weakly to $\mu$, written $\mu_n\Rightarrow\mu$, if \begin{align*} \int_S f\,d\mu_n\to \int_S f\,d\mu \end{align*} for every bounded continuous function $f:S\to\mathbb R$. [/definition] For random elements this means that $X_n\Rightarrow X$ when the laws $\mathcal L(X_n)$ converge weakly to $\mathcal L(X)$. The test functions act as smooth measuring devices, and the portmanteau theorem below turns this test-function viewpoint into statements about sets whose boundaries have no limiting mass. For $S=\mathbb R$, this weak-convergence definition agrees with the distribution-function definition above. The test-function formulation is often the most stable definition in applications, because it is compatible with continuous maps, product spaces, random vectors, empirical measures, and stochastic processes. ## The portmanteau principle The portmanteau theorem collects several equivalent ways to detect convergence in distribution. Each form emphasizes a different kind of test. Open sets test lower bounds, closed sets test upper bounds, and continuity sets test ordinary convergence of probabilities. [definition: Continuity Set] Let $\mu$ be a probability measure on a metric space $S$. A Borel set $A\subset S$ is a continuity set for $\mu$ if \begin{align*} \mu(\partial A)=0, \end{align*} where $\partial A$ is the boundary of $A$. [/definition] To apply convergence in distribution, one needs tests that turn the abstract bounded-continuous-function definition into usable probability bounds. The key obstruction is that arbitrary Borel sets can see boundary mass, so the theorem below identifies exactly which open, closed, and continuity-set tests are stable under weak limits. [quotetheorem:1171] The theorem explains why convergence of probabilities for all Borel sets is too strong. If the limiting law has an atom, sets with boundary through that atom may have unstable probabilities. Continuity sets avoid that problem. On the real line, intervals of the form $(-\infty,x]$ are continuity sets exactly when $F_X$ is continuous at $x$. Thus the portmanteau theorem recovers the distribution-function criterion as a special case. The open-set and closed-set forms are useful when a set is not described by a finite collection of coordinates. For instance, in function spaces one often proves lower bounds for open neighborhoods and upper bounds for compact or closed constraints. ## Relationship with stronger modes of convergence Convergence in distribution is weaker than convergence in probability. Convergence in probability is still weaker than almost sure convergence. The usual implication chain is \begin{align*} X_n\to X\ \text{almost surely} \quad\Longrightarrow\quad X_n\to X\ \text{in probability} \quad\Longrightarrow\quad X_n\Rightarrow X. \end{align*} [quotetheorem:10010] The converse implication does not hold in general. Let $X$ have a non-degenerate distribution and let $X_n$ be independent copies of $X$. Then $X_n\Rightarrow X$, because every $X_n$ has the same law as $X$. The sequence need not converge to $X$ in probability on the shared probability space. There is one important special case where convergence in distribution becomes convergence in probability. If the limiting random variable is constant, distributional convergence forces concentration near that constant. This limit-to-constant test is useful because it turns a distributional statement into a probability statement near the single possible limiting value. It is also the case where [weak convergence](/page/Weak%20Convergence) leaves no room for mass to oscillate between separated neighborhoods, because the limiting law assigns all probability to one point and forces every fixed neighborhood of that point to capture almost all mass. This makes the result the bridge between distributional limits and the probability bounds used in consistency arguments. The theorem below isolates the special degenerate case where a law-level limit has no room to spread, forcing the random variables themselves to concentrate near the limiting constant. [quotetheorem:10011] This criterion is common in estimation. Consistency of an estimator is often stated as convergence in probability to a parameter. If a limit theorem first gives convergence in distribution to a point mass at that parameter, the criterion upgrades it to convergence in probability. ## Continuous mapping Convergence in distribution behaves well under continuous transformations. This is the main way to transfer a limit theorem from one statistic to another. The theorem is exact about the role of continuity: discontinuities may break convergence unless the limiting law avoids them. [quotetheorem:6304] For real-valued variables, this allows limits of transformed statistics. If $X_n\Rightarrow X$, then $X_n^2\Rightarrow X^2$, $\exp(X_n)\Rightarrow \exp(X)$, and $\sin(X_n)\Rightarrow \sin(X)$. If the transformation is discontinuous, a modified version can still apply when the discontinuity set has limiting probability zero. For example, the indicator map $g:\mathbb R\to\{0,1\}$ defined by $g(x)=\mathbb{1}_{\{x\le a\}}$ is discontinuous at $a$. It preserves convergence in expectation when $\mathbb P(X=a)=0$. This is another expression of the same boundary principle from the portmanteau theorem. [example: Threshold Indicators] Suppose $X_n\Rightarrow X$, $\mathbb P(X\le a)=p$, and $\mathbb P(X=a)=0$. Define \begin{align*} A=(-\infty,a]. \end{align*} We compute the limit of $\mathbb P(X_n\le a)$ by rewriting the threshold event as the membership event $\{X_n\in A\}$. The complement of $A$ in $\mathbb R$ is \begin{align*} \mathbb R\setminus A=(a,\infty). \end{align*} The interval $(a,\infty)$ is open in $\mathbb R$, so $A$ is closed and hence Borel. Therefore \begin{align*} \overline A=A=(-\infty,a]. \end{align*} Next identify the interior. If $x<a$, set \begin{align*} \delta=\frac{a-x}{2}. \end{align*} Then $\delta>0$. For any $y$ with $|y-x|<\delta$, we have $y<x+\delta$, and \begin{align*} x+\delta=x+\frac{a-x}{2}=\frac{2x+a-x}{2}=\frac{x+a}{2}<a. \end{align*} The last inequality follows from $x<a$. Hence $(x-\delta,x+\delta)\subseteq A$, so every $x<a$ lies in $A^\circ$. If $x=a$, then for every $\delta>0$ the point $a+\delta/2$ satisfies \begin{align*} a-\delta<a+\delta/2<a+\delta \end{align*} and also \begin{align*} a+\delta/2>a. \end{align*} Thus $a+\delta/2\notin A$, so no open interval around $a$ is contained in $A$. If $x>a$, then $x\notin A$, so $x$ cannot lie in $A^\circ$. Therefore \begin{align*} A^\circ=(-\infty,a). \end{align*} Using $\partial A=\overline A\setminus A^\circ$, we get \begin{align*} \partial A=(-\infty,a]\setminus(-\infty,a)=\{a\}. \end{align*} Therefore \begin{align*} \mathbb P(X\in\partial A)=\mathbb P(X\in\{a\})=\mathbb P(X=a)=0. \end{align*} Thus $A$ is a continuity set for the law of $X$. Since $X_n\Rightarrow X$, the continuity-set form of the *Portmanteau Theorem* gives \begin{align*} \mathbb P(X_n\in A)\to\mathbb P(X\in A). \end{align*} By the definition of $A$, \begin{align*} \{X_n\in A\}=\{X_n\in(-\infty,a]\}=\{X_n\le a\}. \end{align*} Similarly, \begin{align*} \{X\in A\}=\{X\in(-\infty,a]\}=\{X\le a\}. \end{align*} Substituting these event identities into the preceding limit gives \begin{align*} \mathbb P(X_n\le a)\to\mathbb P(X\le a). \end{align*} Using $\mathbb P(X\le a)=p$, we obtain \begin{align*} \mathbb P(X_n\le a)\to p. \end{align*} The indicator map $x\mapsto\mathbb 1_{\{x\le a\}}$ fails to be continuous exactly at the threshold $a$, and the condition $\mathbb P(X=a)=0$ makes that discontinuity invisible to the distributional limit. [/example] ## Slutsky's theorem Many statistical limits involve one component with a non-degenerate distributional limit and another component that vanishes or approaches a constant. [Slutsky's theorem](/theorems/10012) combines those components. It is used when replacing unknown parameters by consistent estimates or when simplifying normalized statistics. [quotetheorem:10012] A typical use has $Y_n=o_p(1)$, which means $Y_n\to0$ in probability. Then $X_n+Y_n\Rightarrow X$. This says that a term which vanishes in probability can be ignored at the distributional scale. Another typical use has a consistent variance estimate in a denominator. If \begin{align*} \sqrt n(\hat\theta_n-\theta)\Rightarrow N(0,\sigma^2) \end{align*} and $\hat\sigma_n\to\sigma>0$ in probability, then \begin{align*} \frac{\sqrt n(\hat\theta_n-\theta)}{\hat\sigma_n}\Rightarrow N(0,1). \end{align*} [example: Studentized Mean] Let $\bar X_n=n^{-1}\sum_{i=1}^n X_i$, and let $S_n$ be the sample standard deviation. Assume $X_1,X_2,\ldots$ are independent identically distributed with mean $\mu$ and variance $\sigma^2>0$, and assume $S_n\to\sigma$ in probability. Define \begin{align*} W_n=\frac{\sqrt n(\bar X_n-\mu)}{S_n} \end{align*} on $\{S_n\ne0\}$, and set $W_n=0$ on $\{S_n=0\}$. First check that the zero-denominator convention is used only on an event whose probability tends to $0$. Since $\sigma^2>0$, we have $\sigma>0$, so $\sigma/2>0$. If $S_n=0$, then \begin{align*} |S_n-\sigma|=|0-\sigma|=\sigma. \end{align*} Because $\sigma>\sigma/2$, this gives \begin{align*} \{S_n=0\}\subseteq\{|S_n-\sigma|>\sigma/2\}. \end{align*} Therefore \begin{align*} \mathbb P(S_n=0)\le \mathbb P(|S_n-\sigma|>\sigma/2). \end{align*} Since $S_n\to\sigma$ in probability and $\sigma/2>0$, \begin{align*} \mathbb P(|S_n-\sigma|>\sigma/2)\to0. \end{align*} Hence \begin{align*} \mathbb P(S_n=0)\to0. \end{align*} Write \begin{align*} T_n=X_1+\cdots+X_n. \end{align*} Since $\bar X_n=T_n/n$, \begin{align*} \sqrt n(\bar X_n-\mu)=\sqrt n\left(\frac{T_n}{n}-\mu\right). \end{align*} Putting the terms inside the parentheses over the common denominator $n$ gives \begin{align*} \frac{T_n}{n}-\mu=\frac{T_n}{n}-\frac{n\mu}{n}=\frac{T_n-n\mu}{n}. \end{align*} Thus \begin{align*} \sqrt n(\bar X_n-\mu)=\sqrt n\left(\frac{T_n-n\mu}{n}\right). \end{align*} For $n\ge1$, $n=(\sqrt n)^2$, so \begin{align*} \frac{\sqrt n}{n}=\frac{\sqrt n}{(\sqrt n)^2}=\frac{1}{\sqrt n}. \end{align*} Therefore \begin{align*} \sqrt n\left(\frac{T_n-n\mu}{n}\right)=\frac{T_n-n\mu}{\sqrt n}. \end{align*} Dividing by $\sigma>0$ gives \begin{align*} \frac{\sqrt n(\bar X_n-\mu)}{\sigma}=\frac{T_n-n\mu}{\sigma\sqrt n}. \end{align*} By the [Central Limit Theorem](/theorems/532), \begin{align*} \frac{\sqrt n(\bar X_n-\mu)}{\sigma}\Rightarrow Z, \end{align*} where $Z\sim N(0,1)$. Next scale the random denominator by its limit. For every $\varepsilon>0$, \begin{align*} \left|\frac{S_n}{\sigma}-1\right|=\left|\frac{S_n-\sigma}{\sigma}\right|. \end{align*} Since $\sigma>0$, \begin{align*} \left|\frac{S_n-\sigma}{\sigma}\right|=\frac{|S_n-\sigma|}{\sigma}. \end{align*} Hence \begin{align*} \left\{\left|\frac{S_n}{\sigma}-1\right|>\varepsilon\right\}=\{|S_n-\sigma|>\sigma\varepsilon\}. \end{align*} Because $\sigma\varepsilon>0$ and $S_n\to\sigma$ in probability, \begin{align*} \mathbb P\left(\left|\frac{S_n}{\sigma}-1\right|>\varepsilon\right)=\mathbb P(|S_n-\sigma|>\sigma\varepsilon)\to0. \end{align*} Thus \begin{align*} \frac{S_n}{\sigma}\to1 \end{align*} in probability. Define \begin{align*} U_n=\frac{\sqrt n(\bar X_n-\mu)}{\sigma} \end{align*} and \begin{align*} V_n=\frac{S_n}{\sigma}. \end{align*} We have shown that $U_n\Rightarrow Z$ and $V_n\to1$ in probability. On the event $\{S_n\ne0\}$, equivalently $\{V_n\ne0\}$, multiply by $\sigma/\sigma=1$: \begin{align*} \frac{\sqrt n(\bar X_n-\mu)}{S_n}=\frac{\sqrt n(\bar X_n-\mu)}{S_n}\cdot\frac{\sigma}{\sigma}. \end{align*} Rearranging the nonzero factors gives \begin{align*} \frac{\sqrt n(\bar X_n-\mu)}{S_n}\cdot\frac{\sigma}{\sigma}=\frac{\sqrt n(\bar X_n-\mu)}{\sigma}\cdot\frac{\sigma}{S_n}. \end{align*} Since $S_n\ne0$ and $\sigma>0$, \begin{align*} \frac{\sigma}{S_n}=\frac{1}{S_n/\sigma}=\frac{1}{V_n}. \end{align*} Therefore, on $\{S_n\ne0\}$, \begin{align*} W_n=\frac{U_n}{V_n}. \end{align*} On $\{S_n=0\}$, the definition sets $W_n=0$, which is a fixed convention on the zero-denominator event. Since $U_n\Rightarrow Z$, $V_n\to1$ in probability, and $1\ne0$, the quotient form of *Slutsky's Theorem* gives \begin{align*} W_n\Rightarrow \frac{Z}{1}. \end{align*} For every outcome $\omega$, \begin{align*} \frac{Z(\omega)}{1}=Z(\omega). \end{align*} Thus $Z/1=Z$ as random variables, and so \begin{align*} W_n\Rightarrow Z. \end{align*} Since $Z\sim N(0,1)$, this is equivalently \begin{align*} W_n\Rightarrow N(0,1). \end{align*} Replacing the unknown $\sigma$ by the consistent estimate $S_n$ therefore leaves the limiting standard normal law unchanged. [/example] ## Moment generating functions Moment generating functions give another way to identify weak limits, but only when the relevant exponential moments exist near the origin. For a real-valued random variable $X$, the [moment generating function](/page/Moment%20Generating%20Function) is \begin{align*} M_X(t)=\mathbb E e^{tX}, \end{align*} for those real $t$ where the expectation is finite. Unlike characteristic functions, this transform need not exist for every $t$, so a [continuity theorem for moment generating functions](/theorems/9547) must assume convergence on an interval around $0$. [quotetheorem:1145] This theorem is useful because it replaces direct comparison of distribution functions by comparison of transforms in a neighborhood of the origin. It also turns sums of independent variables into products. If $S_n=X_1+\cdots+X_n$ for independent variables and the displayed moment generating functions are finite at $t$, then \begin{align*} M_{S_n}(t)=\prod_{k=1}^n M_{X_k}(t). \end{align*} After normalization, asymptotic expansions of these products often reveal the limiting law. ## The functional central limit theorem as the model case The ordinary [central limit theorem](/theorems/1848) describes the distribution of one normalized sum at one terminal time. A stronger question asks for the limiting law of the whole running-sum path. Given partial sums $S_k=X_1+\cdots+X_k$, form a continuous random path on $[0,1]$ by plotting the normalized values at times $k/n$ and joining adjacent points by straight line segments. The space $\mathcal C([0,1],\mathbb R)$ is the set of continuous real-valued functions on $[0,1]$, equipped here with the supremum norm \begin{align*} \|f\|_\infty=\sup_{0\le t\le 1}|f(t)|. \end{align*} A standard [Brownian motion](/page/Brownian%20Motion) is the continuous [stochastic process](/page/Stochastic%20Process) whose increments are independent, centered normal random variables with variance equal to the time increment. Donsker's theorem says that the entire interpolated random path converges in distribution to Brownian motion as a random element of this path space. [quotetheorem:1189] The theorem is not merely a statement about the endpoint $S_n$. It says that the fluctuations of the whole normalized walk, viewed as a continuous curve, have a universal Brownian limit. In particular, applying the continuous map that evaluates a path at time $1$ recovers the usual central limit behavior for the terminal normalized sum. Convergence in distribution is designed to describe this passage from discrete laws to a continuous law. ## Tightness and existence of subsequential limits Convergence in distribution has a compactness aspect. A family of probability measures is tight if most of its mass can be captured inside compact sets uniformly. On the real line this means that the tails can be made uniformly small. [definition: Tightness on the Real Line] A sequence of real-valued random variables $X_n$ is tight if for every $\varepsilon>0$ there is $M<\infty$ such that \begin{align*} \sup_n \mathbb P(|X_n|>M)<\varepsilon. \end{align*} [/definition] Tightness prevents probability mass from escaping to infinity. If $X_n=n$ deterministically, then no subsequence can converge in distribution to a real-valued random variable. The laws move farther out along the real line. If a sequence is tight, subsequential weak limits often exist. In $\mathbb R^d$, [Prokhorov's theorem](/theorems/1172) says that tightness is equivalent to relative compactness for weak convergence. Tightness is therefore a common first step in proving a distributional limit. One proves that subsequential limits exist, identifies every possible subsequential limit, and then concludes that the whole sequence converges. ## Random vectors For random vectors in $\mathbb R^d$, convergence in distribution is weak convergence of their laws on $\mathbb R^d$. It can be checked by bounded continuous functions, by continuity sets, or by characteristic functions. The distribution-function analogue uses rectangles \begin{align*} (-\infty,x_1]\times\cdots\times(-\infty,x_d]. \end{align*} Convergence of these multivariate distribution functions is required at continuity points of the limiting distribution. The Cramer-Wold device gives another useful criterion. [quotetheorem:9818] This reduces multivariate convergence to one-dimensional convergence of all linear projections. It is especially useful when the limiting law is multivariate normal. For example, to prove that a vector of normalized estimators is asymptotically normal, it is often enough to prove asymptotic normality of every fixed linear combination. The device is powerful because linear functionals determine probability laws on $\mathbb R^d$. ## Beyond and Connected Topics Convergence in distribution is the entry point to weak convergence on richer spaces. In [empirical process theory](/page/Empirical%20Process%20Theory), the random objects are functions and the limit may be a Gaussian process. In stochastic process convergence, one studies laws on path spaces such as $C[0,1]$ or $D[0,1]$. In asymptotic statistics, distributional convergence underlies likelihood approximations, delta-method calculations, bootstrap limits, and confidence procedures. The same boundary and continuity-set principles continue to control which functionals preserve the limit. For related foundations, see Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure), and Androma, [Cambridge II Principles of Statistics](/page/Cambridge%20II%20Principles%20of%20Statistics). ## Common mistakes The first common mistake is to demand convergence of $F_{X_n}(x)$ at discontinuities of $F_X$. Those points are intentionally excluded from the definition. Atoms of the limiting law create jumps, and jumps are allowed. The second common mistake is to treat convergence in distribution as a statement about random variables on a shared probability space. Only the laws matter. Two versions of a limiting random variable with the same law give the same distributional convergence statement. The third common mistake is to apply discontinuous transformations without checking the limiting probability of the discontinuity set. Continuous mapping is automatic for continuous functions. For discontinuous functions, the boundary condition must be checked. The fourth common mistake is to combine convergence in distribution statements as though all algebraic operations were automatically valid. If $X_n\Rightarrow X$ and $Y_n\Rightarrow Y$, it does not by itself follow that $X_n+Y_n\Rightarrow X+Y$. Joint convergence is needed, or a theorem such as Slutsky's theorem must supply the missing control. ## Summary Convergence in distribution is convergence of probability laws. On the real line it is detected by distribution functions at continuity points of the limiting distribution. On metric spaces it is detected by bounded continuous test functions. The portmanteau theorem explains the equivalent open-set, closed-set, and continuity-set forms. [Convergence in probability implies convergence in distribution](/theorems/10010), and convergence in distribution to a constant implies convergence in probability. Continuous mapping and Slutsky's theorem make the notion useful for transformed and approximated statistics. Characteristic functions provide a practical analytic route to many limits. The central limit theorem is the guiding example: normalized sums approach a normal law in distribution, even when the finite-sample laws remain far from normal in their exact form. ## References Androma. [Cambridge IA Probability](/page/Cambridge%20IA%20Probability). Androma. [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure). Androma. [Cambridge II Principles of Statistics](/page/Cambridge%20II%20Principles%20of%20Statistics). Androma. [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability).

Created by admin on 6/23/2026 | Last updated on 6/23/2026

What brings you to Androma?

Start with a route through the knowledge graph.

Convergence In Distribution

Sign in to Androma

Check your inbox

One last step

Convergence In Distribution

Prerequisites (0/5 completed)

Prerequisites Graph

Rate this page