A fair coin produces binomial counts, waiting times produce geometric variables, and many small errors added together produce something different: the bell curve. The Gaussian random variable is the mathematical form of that bell curve. Its importance is not that nature always starts Gaussian, but that sums, projections, noises, and conditional errors so often become Gaussian after the right scaling.
The first warning is that matching a mean and variance does not determine a distribution. Many random variables have mean $0$ and variance $1$, but their tails, densities, and sums behave very differently. Within the Gaussian family, the mean and variance determine the whole law.
[example: Same Mean and Variance, Different Shape]
Let $X$ be real-valued with $\mathbb P(X=1)=\mathbb P(X=-1)=1/2$, and let $Z\sim\mathcal N(0,1)$. For $X$, the expectation is
\begin{align*}
\mathbb E[X]=1\cdot \frac12+(-1)\cdot \frac12=0.
\end{align*}
Also,
\begin{align*}
\mathbb E[X^2]=1^2\cdot \frac12+(-1)^2\cdot \frac12=1.
\end{align*}
Therefore, using $\operatorname{Var}(X)=\mathbb E[X^2]-(\mathbb E[X])^2$,
\begin{align*}
\operatorname{Var}(X)=1-0^2=1.
\end{align*}
For $Z$, the theorem *Mean and Variance of a Gaussian* gives $\mathbb E[Z]=0$ and $\operatorname{Var}(Z)=1$.
The two random variables nevertheless have different laws. Since $X$ only takes the values $-1$ and $1$, the event $\{|X|>2\}$ is empty up to probability zero, so
\begin{align*}
\mathbb P(|X|>2)=0.
\end{align*}
For $Z$, the standard normal density is $\phi(x)=(2\pi)^{-1/2}e^{-x^2/2}$, so
\begin{align*}
\mathbb P(|Z|>2)=\int_{-\infty}^{-2}\phi(x)\,d\mathcal L^1(x)+\int_2^\infty \phi(x)\,d\mathcal L^1(x).
\end{align*}
Because $\phi(-x)=\phi(x)$, the substitution $u=-x$ in the first integral gives
\begin{align*}
\mathbb P(|Z|>2)=2\int_2^\infty (2\pi)^{-1/2}e^{-x^2/2}\,d\mathcal L^1(x).
\end{align*}
This quantity is strictly positive: on the interval $[2,3]$,
\begin{align*}
(2\pi)^{-1/2}e^{-x^2/2}\ge (2\pi)^{-1/2}e^{-9/2}>0,
\end{align*}
and hence
\begin{align*}
\int_2^\infty (2\pi)^{-1/2}e^{-x^2/2}\,d\mathcal L^1(x)\ge \int_2^3 (2\pi)^{-1/2}e^{-9/2}\,d\mathcal L^1(x)=(2\pi)^{-1/2}e^{-9/2}>0.
\end{align*}
Thus $X$ and $Z$ have the same mean and variance but different tail probabilities. The Gaussian assumption is therefore a structural statement about the entire law, not just a label attached to two moments.
[/example]
The chapter develops that structure from the definition upward. We begin with the density and the degenerate case, then study transformations, moments, sums, tails, and the limiting mechanism that explains why Gaussian variables appear so often.
## Definition
### The Normal Family
The parent notion is a [random variable](/page/Random%20Variable): a measurable map from a [probability space](/page/Probability%20Space) into a measurable state space. A Gaussian random variable is the special case where the state space is $\mathbb R$ and the law is one of the normal laws, including the constant limiting case with variance $0$.
[definition: Gaussian Random Variable]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. A real-valued random variable $X:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ is a Gaussian random variable, or normal random variable, if there exist $\mu\in\mathbb R$ and $\sigma^2\ge 0$ such that
\begin{align*}
X \sim \mathcal N(\mu,\sigma^2).
\end{align*}
[/definition]
The notation $\mathcal N(\mu,\sigma^2)$ records the mean parameter and variance parameter. The case $\sigma^2>0$ is the continuous bell curve; the case $\sigma^2=0$ means $\mathbb P(X=\mu)=1$, so the family also includes constant random variables and remains closed under limits and constant affine transformations.
A distribution is more fundamental than a formula for a density, but for non-degenerate Gaussian variables the density is the most visible object. The following formula gives the [probability measure](/page/Probability%20Measure) whose mass on each Borel set is computed by integrating the bell-shaped function over that set.
For $\mu\in\mathbb R$ and $\sigma^2>0$, the non-degenerate normal law $\mathcal N(\mu,\sigma^2)$ is the probability measure on $(\mathbb R,\mathcal B(\mathbb R))$ with density $f_{\mu,\sigma^2}:\mathbb R\to\mathbb R$ with respect to [Lebesgue measure](/page/Lebesgue%20Measure) $\mathcal L^1$. For $x\in\mathbb R$, this density is
\begin{align*}
f_{\mu,\sigma^2}(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).
\end{align*}
Thus if $X\sim\mathcal N(\mu,\sigma^2)$ with $\sigma^2>0$, probabilities are obtained by integrating this density over Borel sets. To avoid rewriting the same shifted and rescaled integral in every calculation, the theory first isolates a reference member of the family.
### The Standard Normal
Probability calculations should not require a new table for every possible value of $\mu$ and $\sigma^2$. The standard normal is the reference law from which every non-degenerate one-dimensional Gaussian can be obtained by shifting and rescaling.
A real-valued random variable $Z:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ is called standard normal if
\begin{align*}
Z\sim\mathcal N(0,1).
\end{align*}
The standard normal is useful only if its probability scale has names. We write $\phi$ for the density and $\Phi$ for the cumulative probability function used in tables, numerical software, and normal approximation.
The standard normal density is the function $\phi:\mathbb R\to\mathbb R$ given, for $x\in\mathbb R$, by
\begin{align*}
\phi(x)=(2\pi)^{-1/2}e^{-x^2/2}.
\end{align*}
The standard normal distribution function is the function $\Phi:\mathbb R\to[0,1]$ given, for $x\in\mathbb R$, by
\begin{align*}
\Phi(x)=\int_{-\infty}^x \phi(t)\,d\mathcal L^1(t).
\end{align*}
The symbol $\Phi$ packages a non-elementary integral into a reusable object. Most numerical probabilities for Gaussian variables are reduced to values of $\Phi$ after standardisation.
[example: Standardising a Gaussian Probability]
Let $X\sim\mathcal N(\mu,\sigma^2)$ with $\sigma^2>0$, and write $\sigma$ for the positive square root of $\sigma^2$. For [real numbers](/page/Real%20Numbers) $a<b$, define
\begin{align*}
Z=\frac{X-\mu}{\sigma}.
\end{align*}
By *Affine Transformations of Gaussian Random Variables*, $Z\sim\mathcal N(0,1)$.
Since $\sigma>0$, subtracting $\mu$ and then dividing by $\sigma$ preserves the order of inequalities. Thus
\begin{align*}
\{a\le X\le b\}=\left\{\frac{a-\mu}{\sigma}\le \frac{X-\mu}{\sigma}\le \frac{b-\mu}{\sigma}\right\}.
\end{align*}
Using the definition of $Z$, this becomes
\begin{align*}
\mathbb P(a\le X\le b)=\mathbb P\left(\frac{a-\mu}{\sigma}\le Z\le \frac{b-\mu}{\sigma}\right).
\end{align*}
Because $Z$ is standard normal, its distribution function is $\Phi(x)=\mathbb P(Z\le x)$. Therefore, for $c<d$,
\begin{align*}
\mathbb P(c\le Z\le d)=\Phi(d)-\Phi(c).
\end{align*}
Taking $c=(a-\mu)/\sigma$ and $d=(b-\mu)/\sigma$ gives
\begin{align*}
\mathbb P(a\le X\le b)=\Phi\left(\frac{b-\mu}{\sigma}\right)-\Phi\left(\frac{a-\mu}{\sigma}\right).
\end{align*}
For example, if $X\sim\mathcal N(10,4)$, then $\mu=10$ and $\sigma=2$. Hence
\begin{align*}
\frac{8-\mu}{\sigma}=\frac{8-10}{2}=-1.
\end{align*}
Also,
\begin{align*}
\frac{13-\mu}{\sigma}=\frac{13-10}{2}=\frac32.
\end{align*}
Substituting these two cutoffs into the standardised formula gives
\begin{align*}
\mathbb P(8\le X\le 13)=\Phi(3/2)-\Phi(-1).
\end{align*}
The calculation reduces a probability for $\mathcal N(10,4)$ to two values of the standard normal distribution function.
[/example]
## Density and Basic Shape
### Normalisation
The Gaussian density is not chosen because it looks smooth. It is chosen because the quadratic exponent is stable under completing the square, differentiation, [Fourier transform](/page/Fourier%20Transform), and convolution. That algebraic stability is what makes the distribution usable.
Before doing probability with the density, we need the fact that it really integrates to $1$. This is the point at which the constant $(2\pi)^{-1/2}$ is fixed, so the next theorem is the analytic foundation of the definition.
[quotetheorem:1140]
This theorem turns the bell-shaped formula into a probability density. The usual proof squares the integral and changes to polar coordinates, which explains why the constant involves $2\pi$.
### Symmetry and Scale
After normalisation, the next feature is symmetry. Symmetry explains why the centre parameter is the balance point of the distribution, and it lets many probability computations be reduced to the right tail.
[quotetheorem:10158]
Symmetry is useful, but it does not describe how quickly probabilities shrink far from the centre. The next example records the simplest visual scale rule.
[example: The Empirical Scale of a Standard Normal]
Let $Z\sim\mathcal N(0,1)$, and recall that $\Phi(x)=\mathbb P(Z\le x)$. For any $r>0$,
\begin{align*}
\mathbb P(|Z|\le r)=\mathbb P(-r\le Z\le r).
\end{align*}
Since $\{Z\le r\}$ is the disjoint union of $\{Z<-r\}$, $\{-r\le Z\le r\}$, and possibly the single point $\{Z=-r\}$, and the standard normal has a density, the point has probability $0$. Hence
\begin{align*}
\mathbb P(-r\le Z\le r)=\Phi(r)-\Phi(-r).
\end{align*}
The standard normal density satisfies $\phi(-x)=\phi(x)$, so symmetry gives $\Phi(-r)=1-\Phi(r)$. Therefore
\begin{align*}
\mathbb P(|Z|\le r)=\Phi(r)-(1-\Phi(r))=2\Phi(r)-1.
\end{align*}
Taking $r=1,2,3$ gives
\begin{align*}
\mathbb P(|Z|\le 1)=2\Phi(1)-1.
\end{align*}
\begin{align*}
\mathbb P(|Z|\le 2)=2\Phi(2)-1.
\end{align*}
\begin{align*}
\mathbb P(|Z|\le 3)=2\Phi(3)-1.
\end{align*}
Using numerical evaluation of the defining integral $\Phi(x)=\int_{-\infty}^x(2\pi)^{-1/2}e^{-t^2/2}\,d\mathcal L^1(t)$,
\begin{align*}
\Phi(1)\approx 0.8413,\quad \Phi(2)\approx 0.9772,\quad \Phi(3)\approx 0.9987.
\end{align*}
Substituting these values gives
\begin{align*}
2\Phi(1)-1\approx 2(0.8413)-1=0.6826.
\end{align*}
\begin{align*}
2\Phi(2)-1\approx 2(0.9772)-1=0.9544.
\end{align*}
\begin{align*}
2\Phi(3)-1\approx 2(0.9987)-1=0.9974.
\end{align*}
With more digits these are approximately $0.6827$, $0.9545$, and $0.9973$, so one, two, and three standard deviations from the mean correspond to sharply increasing central probability bands.
[/example]
## Transformations and Standardisation
A useful distribution family should behave well under the elementary operations used to change units. If temperatures are converted from Celsius to Fahrenheit, or errors are shifted by a calibration constant, the class should remain the same. Gaussian random variables have exactly this affine stability.
The transformation itself deserves a name because it is the main bridge between arbitrary Gaussian variables and the standard normal table. It asks: what remains after subtracting the centre and measuring distance in units of the standard deviation?
[definition: Standardisation]
Let $X:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ be a real-valued random variable with mean $\mu$ and variance $\sigma^2>0$. The standardisation of $X$ is the random variable $Z:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ given by
\begin{align*}
Z(\omega)=\frac{X(\omega)-\mu}{\sigma}.
\end{align*}
[/definition]
For a general random variable, standardisation fixes only the first two moments. For a Gaussian variable it does more: subtracting the mean and dividing by the standard deviation gives another Gaussian variable, in fact a standard normal one.
To use standardisation as a method, we need a stability result: Gaussian laws should survive shifts, rescalings, and the corresponding coordinate changes. Without that fact, the definition of $Z=(X-\mu)/\sigma$ would only be a change of notation, not a bridge to the standard normal distribution.
The theorem quoted here is stated in a slightly broader vector form because the same change-of-units idea is used coordinate-by-coordinate in higher-dimensional Gaussian models. In that notation, a Gaussian random vector $X$ takes values in $\mathbb R^p$, has mean vector $\mu\in\mathbb R^p$, and has covariance matrix $\Sigma$; the notation $X\sim\mathcal N_p(\mu,\Sigma)$ records this multivariate normal law. If $A$ is a matrix and $b$ is a vector, then $AX+b$ is the affine transformation of $X$, and its covariance becomes $A\Sigma A^\top$.
The point of quoting the affine transformation result now is to turn standardisation from an algebraic formula into a distributional statement. Once affine maps are known to preserve Gaussian laws and update the mean and covariance in the indicated way, the standard normal distribution becomes the canonical representative of every non-degenerate one-dimensional Gaussian law.
[quotetheorem:4001]
For the one-dimensional standardisation above, take $p=1$, $A=1/\sigma$, and $b=-\mu/\sigma$. The conclusion is exactly $(X-\mu)/\sigma\sim\mathcal N(0,1)$. The vector statement is not adding a new assumption to the one-dimensional discussion; it is recording the same stability principle in the form needed when several Gaussian quantities are transformed together. The covariance restriction also explains why the variance parameter must be nonnegative and why $\sigma$ appears as a scale rather than as an arbitrary second parameter.
[example: Unit Conversion Preserves Gaussianity]
Suppose $C\sim\mathcal N(20,9)$ is a temperature measurement in degrees Celsius, and define the Fahrenheit value by
\begin{align*}
F=\frac95 C+32.
\end{align*}
Applying *Affine Transformations of Gaussian Random Variables* with $a=9/5$, $b=32$, $\mu=20$, and $\sigma^2=9$ gives
\begin{align*}
F\sim\mathcal N\left(\frac95\cdot 20+32,\left(\frac95\right)^2\cdot 9\right).
\end{align*}
The mean parameter is
\begin{align*}
\frac95\cdot 20+32=9\cdot 4+32=36+32=68.
\end{align*}
The variance parameter is
\begin{align*}
\left(\frac95\right)^2\cdot 9=\frac{81}{25}\cdot 9=\frac{729}{25}.
\end{align*}
Therefore
\begin{align*}
F\sim\mathcal N\left(68,\frac{729}{25}\right).
\end{align*}
The unit conversion shifts and rescales the centre in the same way as the temperature value itself, while the variance is multiplied by the square of the scale factor $9/5$.
[/example]
A common mistake is to treat all nonlinear transformations as if they preserved normality. Squaring, exponentiating, or taking absolute values usually leaves the Gaussian family.
[example: Squaring a Standard Normal Is Not Gaussian]
Let $Z\sim\mathcal N(0,1)$ and set $Y=Z^2$. Since $z^2\ge 0$ for every $z\in\mathbb R$, we have $Y(\omega)\ge 0$ for every $\omega$, and therefore
\begin{align*}
\mathbb P(Y<0)=0.
\end{align*}
We show that this prevents $Y$ from being any Gaussian random variable. If $W\sim\mathcal N(\mu,\sigma^2)$ with $\sigma^2>0$, then $W$ has density
\begin{align*}
f_{\mu,\sigma^2}(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).
\end{align*}
For every $x\in[-1,0]$, the factor $1/(\sigma\sqrt{2\pi})$ is positive and the exponential is positive, so $f_{\mu,\sigma^2}(x)>0$. Hence
\begin{align*}
\mathbb P(W<0)\ge \int_{-1}^{0} f_{\mu,\sigma^2}(x)\,d\mathcal L^1(x)>0.
\end{align*}
Thus no non-degenerate Gaussian random variable can have probability $0$ on $(-\infty,0)$, while $Y$ does.
It remains to check that $Y$ is not a degenerate Gaussian, meaning not almost surely constant. Since the standard normal density $\phi(x)=(2\pi)^{-1/2}e^{-x^2/2}$ is positive on every interval, we have
\begin{align*}
\mathbb P(Y\le 1)=\mathbb P(-1\le Z\le 1)=\int_{-1}^{1}\phi(x)\,d\mathcal L^1(x)>0.
\end{align*}
Also,
\begin{align*}
\mathbb P(Y>1)\ge \mathbb P(2\le Z\le 3)=\int_{2}^{3}\phi(x)\,d\mathcal L^1(x)>0.
\end{align*}
So $Y$ takes values in both $(-\infty,1]$ and $(1,\infty)$ with positive probability, and therefore cannot be almost surely constant. Thus $Z^2$ is not Gaussian; it is instead $\chi_1^2$, the chi-squared distribution with one degree of freedom.
[/example]
## Moments, Characteristic Functions, and Recognition
The Gaussian law can be recognised in several equivalent ways. The density is the most concrete, but transforms and moments reveal why the distribution is stable under sums and limits.
The [moment generating function](/page/Moment%20Generating%20Function) is useful when exponential moments exist near the origin. It converts moment information into a function, and for the Gaussian this function has a quadratic exponent that is especially easy to manipulate.
[definition: Moment Generating Function]
Let $X:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ be a real-valued random variable. Define
\begin{align*}
D_X=\{t\in\mathbb R:\mathbb E[e^{tX}]<\infty\}.
\end{align*}
The moment [generating function](/page/Generating%20Function) of $X$ is the function $M_X:D_X\to\mathbb R$ given by
\begin{align*}
M_X(t)=\mathbb E[e^{tX}].
\end{align*}
[/definition]
Moment generating functions turn sums of independent random variables into products. For Gaussian variables, this product rule becomes the addition of quadratic polynomials, so the next formula is the computational engine behind Gaussian summation.
[quotetheorem:4006]
The transform formula is not only a way to compute integrals. In the density definition, the symbols $\mu$ and $\sigma^2$ are parameters; to interpret them probabilistically, one must check that they match the actual center and spread of the random variable. Differentiating the moment generating function at the origin provides exactly this verification.
This raises the next precise question: when a random variable is declared to have the Gaussian density with parameters $\mu$ and $\sigma^2$, do those parameters equal its expectation and variance? The following result records that identification, so later uses of the notation $N(\mu,\sigma^2)$ can treat $\mu$ as the mean and $\sigma^2$ as the variance rather than merely as symbols in the formula.
[quotetheorem:10159]
This theorem is the point where the parameters in the notation $\mathcal N(\mu,\sigma^2)$ stop being only parameters in a density formula. It says that, inside the Gaussian family, the first parameter is exactly the expectation and the second parameter is exactly the variance. That identification is essential because many later arguments describe a Gaussian variable by its centre and spread rather than by rewriting its density each time.
The conclusion is also special to Gaussian laws. The earlier example with $Z^2$ showed that knowing a mean and a variance does not usually determine the shape of a distribution, nor does it force the distribution to be Gaussian. What the theorem gives is narrower but stronger: once Gaussianity is known, the two parameters already encode the probabilistic quantities needed for centering, scaling, and comparing Gaussian variables.
This is why standardisation is more than a formal substitution. For $X\sim\mathcal N(\mu,\sigma^2)$, the expression $(X-\mu)/\sigma$ subtracts the actual mean and measures in actual standard-deviation units. The theorem therefore links the density-based definition to the later transform and limit arguments, where tracking means and variances is the natural language.
Moment generating functions are powerful, but requiring finite exponential moments can exclude important distributions. For recognition and [weak convergence](/page/Weak%20Convergence), we need a transform that is defined for every real-valued random variable. The complex exponential has absolute value $1$, so its expectation always exists as a bounded complex-valued expectation.
[definition: Characteristic Function]
Let $X:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ be a real-valued random variable. The characteristic function of $X$ is the function $\phi_X:\mathbb R\to\mathbb C$ given by
\begin{align*}
\phi_X(u)=\mathbb E[e^{iuX}].
\end{align*}
[/definition]
The characteristic function becomes especially useful when it has a recognisable closed form. The recognition question is whether the oscillatory transform of a random variable determines a familiar law. For a one-dimensional Gaussian variable, the closed form has a quadratic exponent.
This motivates a recognition theorem rather than just another computation. If a transform with modulus-one integrand determines the distribution, then identifying the Gaussian characteristic function becomes a practical way to prove that a random variable has a Gaussian law, even when moment generating functions are unavailable.
The quoted recognition formula is again stated in vector notation. For a random vector $X\in\mathbb R^p$, its characteristic function is $\phi_X(u)=\mathbb E[e^{i u^\top X}]$ for $u\in\mathbb R^p$. When $p=1$, this reduces to the definition above because $u^\top X$ is just $uX$.
The next step is to identify the exact transform signature of a Gaussian law. This is the forward use of characteristic functions on this page: instead of starting from a density or a moment generating function, we can recognise Gaussianity from the exponential [quadratic form](/page/Quadratic%20Form) of $\phi_X$ itself.
[quotetheorem:4003]
This transform formula characterises the law, but its hypotheses matter. The covariance matrix must be positive definite in the stated form so that it describes a genuinely non-degenerate Gaussian density; degenerate Gaussian laws require a separate singular-covariance formulation. In the special case $p=1$, the quadratic term becomes $\sigma^2u^2$, giving the familiar one-dimensional characteristic function. In practice, many proofs show that a random variable is Gaussian by showing that its characteristic function has this form.
[example: Computing Even Moments]
Let $Z\sim\mathcal N(0,1)$. By *Moment Generating Function of a Gaussian* with $\mu=0$ and $\sigma^2=1$,
\begin{align*}
M_Z(t)=\exp\left(\frac{t^2}{2}\right).
\end{align*}
Using the exponential [power series](/page/Power%20Series) $e^x=\sum_{m=0}^{\infty}x^m/m!$ with $x=t^2/2$, we get
\begin{align*}
e^{t^2/2}=\sum_{m=0}^{\infty}\frac{(t^2/2)^m}{m!}.
\end{align*}
For each $m\ge 0$,
\begin{align*}
\frac{(t^2/2)^m}{m!}=\frac{t^{2m}}{2^m m!}.
\end{align*}
Therefore
\begin{align*}
M_Z(t)=\sum_{m=0}^{\infty}\frac{t^{2m}}{2^m m!}.
\end{align*}
On the other hand, the moment generating function has the Taylor expansion
\begin{align*}
M_Z(t)=\sum_{n=0}^{\infty}\frac{\mathbb E[Z^n]}{n!}t^n
\end{align*}
near $0$. Comparing coefficients of $t^n$ in the two power series, every odd power has coefficient $0$, so for $m\ge 0$,
\begin{align*}
\frac{\mathbb E[Z^{2m+1}]}{(2m+1)!}=0.
\end{align*}
Hence
\begin{align*}
\mathbb E[Z^{2m+1}]=0.
\end{align*}
For the even power $t^{2m}$, the coefficient comparison gives
\begin{align*}
\frac{\mathbb E[Z^{2m}]}{(2m)!}=\frac{1}{2^m m!}.
\end{align*}
Multiplying both sides by $(2m)!$ gives
\begin{align*}
\mathbb E[Z^{2m}]=\frac{(2m)!}{2^m m!}.
\end{align*}
For instance, when $m=2$,
\begin{align*}
\mathbb E[Z^4]=\frac{4!}{2^2 2!}=\frac{24}{4\cdot 2}=3.
\end{align*}
Thus the standard normal has all odd moments equal to $0$, and its even moments are determined by the factorial formula above.
[/example]
## Sums and Independence
### Independent Addition
The most important algebraic feature of Gaussian variables is closure under independent addition. This is the mechanism behind normal approximations: when independent sources of error are added, their Gaussian components combine into another Gaussian component.
Independence matters. Without it, the variance of a sum involves covariance, and the sum of Gaussian-looking marginals need not behave as expected unless the joint structure is controlled.
[quotetheorem:10160]
The statement is as important for what it excludes as for what it includes. Independence is a structural hypothesis, not a cosmetic add-on: it is what lets moment generating functions multiply, or equivalently what removes covariance terms from the variance of the sum. If the variables are merely marginally Gaussian, the theorem cannot be applied without information about their joint law. Correlation can change the variance, and in poorly controlled joint examples a sum of Gaussian-looking one-dimensional quantities need not be governed by the simple parameter addition in the theorem.
Within those limits, the result is the basic bookkeeping rule for Gaussian models. It says that independent Gaussian errors may be aggregated without leaving the Gaussian family, with centers adding and variances adding. The examples below use exactly this forward direction: first to compute the distribution of an average of independent measurements, and later to justify why normalized sums are naturally compared with a Gaussian law.
[example: Averaging Independent Measurement Errors]
Let $X_1,\dots,X_n$ be independent random variables with $X_j\sim\mathcal N(\mu,\sigma^2)$ for each $j$, where $\sigma^2>0$. We compute the law of the sample average
\begin{align*}
\overline X_n=\frac{1}{n}\sum_{j=1}^n X_j.
\end{align*}
By *[Sum of Independent Gaussian Random Variables](/theorems/10160)*,
\begin{align*}
\sum_{j=1}^n X_j\sim \mathcal N\left(\sum_{j=1}^n\mu,\sum_{j=1}^n\sigma^2\right).
\end{align*}
Since $\sum_{j=1}^n\mu=n\mu$ and $\sum_{j=1}^n\sigma^2=n\sigma^2$, this becomes
\begin{align*}
\sum_{j=1}^n X_j\sim \mathcal N(n\mu,n\sigma^2).
\end{align*}
Now apply *Affine Transformations of Gaussian Random Variables* to $\sum_{j=1}^n X_j$ with scale factor $a=1/n$ and shift $b=0$. Then
\begin{align*}
\frac{1}{n}\sum_{j=1}^n X_j\sim \mathcal N\left(\frac{1}{n}\cdot n\mu,\left(\frac{1}{n}\right)^2 n\sigma^2\right).
\end{align*}
The mean parameter is
\begin{align*}
\frac{1}{n}\cdot n\mu=\mu.
\end{align*}
The variance parameter is
\begin{align*}
\left(\frac{1}{n}\right)^2 n\sigma^2=\frac{1}{n^2}\cdot n\sigma^2=\frac{\sigma^2}{n}.
\end{align*}
Therefore
\begin{align*}
\overline X_n\sim\mathcal N\left(\mu,\frac{\sigma^2}{n}\right).
\end{align*}
Averaging independent Gaussian measurements keeps the same centre $\mu$ while dividing the variance by the number of measurements.
[/example]
### Gaussian Vectors
Sums naturally lead from single variables to vectors. A vector should be called Gaussian only when all of its linear shadows are Gaussian; otherwise the coordinates can look normal while the joint law still has non-Gaussian structure.
[definition: Gaussian Random Vector]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. A random vector $X:(\Omega,\mathcal F)\to(\mathbb R^n,\mathcal B(\mathbb R^n))$ is a Gaussian random vector if for every $a\in\mathbb R^n$, the real-valued random variable $a\cdot X$ is Gaussian.
[/definition]
This definition is more robust than asking each coordinate to be Gaussian separately. Coordinate-wise normality does not by itself control joint behaviour.
[example: Gaussian Marginals Do Not Force a Gaussian Vector]
Let $Z\sim\mathcal N(0,1)$, let $\varepsilon$ be independent of $Z$ with $\mathbb P(\varepsilon=1)=\mathbb P(\varepsilon=-1)=1/2$, and define
\begin{align*}
X=(X_1,X_2)=(Z,\varepsilon Z).
\end{align*}
The first coordinate satisfies $X_1=Z$, so $X_1\sim\mathcal N(0,1)$.
We next check the second coordinate. For every Borel set $B\subset\mathbb R$, independence of $\varepsilon$ and $Z$ gives
\begin{align*}
\mathbb P(\varepsilon Z\in B)=\frac12\mathbb P(Z\in B)+\frac12\mathbb P(-Z\in B).
\end{align*}
Since $Z$ has density $\phi(x)=(2\pi)^{-1/2}e^{-x^2/2}$ and $\phi(-x)=\phi(x)$, the substitution $u=-x$ shows that $\mathbb P(-Z\in B)=\mathbb P(Z\in B)$. Hence
\begin{align*}
\mathbb P(\varepsilon Z\in B)=\frac12\mathbb P(Z\in B)+\frac12\mathbb P(Z\in B)=\mathbb P(Z\in B).
\end{align*}
Thus $X_2=\varepsilon Z\sim\mathcal N(0,1)$.
However, the linear combination with coefficient vector $(1,1)$ is
\begin{align*}
X_1+X_2=Z+\varepsilon Z=Z(1+\varepsilon).
\end{align*}
On the event $\{\varepsilon=-1\}$, this equals $Z(1-1)=0$, so
\begin{align*}
\mathbb P(X_1+X_2=0)\ge \mathbb P(\varepsilon=-1)=\frac12.
\end{align*}
Also, on the event $\{\varepsilon=1\}\cap\{Z>1/2\}$, we have $X_1+X_2=2Z>1$, and independence gives
\begin{align*}
\mathbb P(X_1+X_2>1)\ge \mathbb P(\varepsilon=1,\ Z>1/2)=\frac12\mathbb P(Z>1/2)>0.
\end{align*}
Therefore $X_1+X_2$ is not almost surely constant.
It is also not a non-degenerate Gaussian random variable, because every non-degenerate Gaussian law has a density with respect to $\mathcal L^1$, so every singleton has probability $0$, while $X_1+X_2$ has positive probability at $0$. Thus $X_1+X_2$ is not Gaussian. Since a Gaussian random vector must have every linear combination Gaussian, the vector $X=(Z,\varepsilon Z)$ is not a Gaussian random vector even though both coordinates are standard normal.
[/example]
For arbitrary random variables, zero covariance only rules out linear correlation and does not force independence. Gaussian vectors are exceptional because their joint law is rigid enough for covariance information to control independence. This is the practical payoff of using the Gaussian-vector definition rather than checking coordinates separately.
[quotetheorem:4005]
This theorem is false for general random variables. It is one of the reasons Gaussian models are tractable: second-order information can determine much more than it usually does.
## Tails and Concentration
Gaussian variables have unbounded support, so extreme deviations are possible. The useful fact is that their probabilities decay exponentially in the square of the distance from the mean. This square in the exponent is the analytic signature of Gaussian concentration.
Tail bounds replace tables when exact values of $\Phi$ are less important than estimates. They also explain why Gaussian errors are often regarded as light-tailed.
The concentration theorem below is phrased for sub-Gaussian random variables, so we pause to identify why it applies here. A centred random variable $X$ is sub-Gaussian with variance proxy $\sigma^2$ if
\begin{align*}
\mathbb E[e^{tX}]\le \exp\left(\frac{\sigma^2t^2}{2}\right)
\end{align*}
for every real $t$. If $X\sim\mathcal N(0,\sigma^2)$, the Gaussian moment generating function gives equality in this bound, so centred Gaussian variables are sub-Gaussian.
[quotetheorem:1953]
For Gaussian variables, the estimate is not an exact asymptotic, but it captures the correct exponential scale. The sub-Gaussian hypothesis is the part that rules out heavier tails. The next example contrasts this with a distribution having the same variance but larger rare deviations.
[example: Same Variance, Heavier Tails]
Let $Z\sim\mathcal N(0,1)$, and let $Y$ satisfy $\mathbb P(Y=0)=8/9$ and $\mathbb P(Y=3)=\mathbb P(Y=-3)=1/18$. We first compute the first two moments of $Y$:
\begin{align*}
\mathbb E[Y]=0\cdot \frac89+3\cdot\frac{1}{18}+(-3)\cdot\frac{1}{18}=0+\frac16-\frac16=0.
\end{align*}
Also,
\begin{align*}
\mathbb E[Y^2]=0^2\cdot\frac89+3^2\cdot\frac{1}{18}+(-3)^2\cdot\frac{1}{18}=0+\frac{9}{18}+\frac{9}{18}=1.
\end{align*}
Using $\operatorname{Var}(Y)=\mathbb E[Y^2]-(\mathbb E[Y])^2$, we get
\begin{align*}
\operatorname{Var}(Y)=1-0^2=1.
\end{align*}
The tail at distance $3$ is much larger than the standard normal tail. For $Y$,
\begin{align*}
\{|Y|\ge 3\}=\{Y=3\}\cup\{Y=-3\}.
\end{align*}
These two events are disjoint, so
\begin{align*}
\mathbb P(|Y|\ge 3)=\mathbb P(Y=3)+\mathbb P(Y=-3)=\frac{1}{18}+\frac{1}{18}=\frac19.
\end{align*}
For the standard normal variable,
\begin{align*}
\mathbb P(|Z|\ge 3)=\mathbb P(Z\le -3)+\mathbb P(Z\ge 3).
\end{align*}
By symmetry of the standard normal density, this is
\begin{align*}
\mathbb P(|Z|\ge 3)=2\mathbb P(Z\ge 3)=2(1-\Phi(3)).
\end{align*}
Numerical evaluation of $\Phi(3)=\int_{-\infty}^{3}(2\pi)^{-1/2}e^{-t^2/2}\,d\mathcal L^1(t)$ gives $\Phi(3)\approx 0.99865$, hence
\begin{align*}
\mathbb P(|Z|\ge 3)\approx 2(1-0.99865)=0.00270.
\end{align*}
Thus $Y$ and $Z$ both have variance $1$, but $Y$ assigns probability $1/9$ to values at least $3$ units from the mean, while the standard normal assigns only about $0.0027$ to the same event. Equal variance does not force Gaussian tail behaviour.
[/example]
After seeing how fast the tails decay, the next question is what kinds of integrability this decay buys. The theorem below turns the tail behaviour into a clean moment statement.
[quotetheorem:10161]
Moment finiteness allows Gaussian variables to be used safely in $L^p$ arguments for every finite $p$. It does not mean they are bounded; no non-degenerate Gaussian random variable is bounded a.s.
[example: A Gaussian Variable Is Not Bounded]
Let $X\sim\mathcal N(\mu,\sigma^2)$ with $\sigma^2>0$, and let $\sigma$ be the positive square root of $\sigma^2$. We show that no finite bound can contain $X$ almost surely. Fix $M>0$ and consider the interval $[M+1,M+2]$. If $x\in[M+1,M+2]$, then $|x|>M$, so
\begin{align*}
\mathbb P(|X|>M)\ge \mathbb P(M+1\le X\le M+2).
\end{align*}
Because $X$ has density
\begin{align*}
f_{\mu,\sigma^2}(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right),
\end{align*}
we have
\begin{align*}
\mathbb P(M+1\le X\le M+2)=\int_{M+1}^{M+2}\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\,d\mathcal L^1(x).
\end{align*}
Set
\begin{align*}
A=\max\{(M+1-\mu)^2,(M+2-\mu)^2\}.
\end{align*}
For every $x\in[M+1,M+2]$, the quantity $(x-\mu)^2$ is at most $A$, and therefore
\begin{align*}
\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\ge \exp\left(-\frac{A}{2\sigma^2}\right).
\end{align*}
Hence
\begin{align*}
\mathbb P(M+1\le X\le M+2)\ge \int_{M+1}^{M+2}\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{A}{2\sigma^2}\right)\,d\mathcal L^1(x).
\end{align*}
The integrand is constant, so
\begin{align*}
\int_{M+1}^{M+2}\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{A}{2\sigma^2}\right)\,d\mathcal L^1(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{A}{2\sigma^2}\right)>0.
\end{align*}
Thus $\mathbb P(|X|>M)>0$ for every $M>0$. By *All Polynomial Moments Exist*, $X$ has finite moments of every order, but it is not essentially bounded because essential boundedness would require some $M$ with $\mathbb P(|X|>M)=0$.
[/example]
## Normal Approximation and Limit Behaviour
Gaussian random variables matter far beyond situations where a model is assumed normal at the start. Their deepest role is as limiting laws for sums of many small independent contributions. This is why the Gaussian distribution appears in statistics, statistical mechanics, numerical error, and stochastic processes.
To state this precisely, we need the standard form of convergence used for laws of random variables. This mode of convergence tracks distributions rather than requiring the random variables to be close on the same sample space.
[definition: Convergence in Distribution]
For each $n\ge 1$, let $(\Omega_n,\mathcal F_n,\mathbb P_n)$ be a probability space and let $X_n:(\Omega_n,\mathcal F_n)\to(\mathbb R,\mathcal B(\mathbb R))$ be a real-valued random variable. Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and let $X:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ be a real-valued random variable. We say that $X_n$ converges in distribution to $X$, and write $X_n\xrightarrow{d}X$, if
\begin{align*}
\mathbb E_{\mathbb P_n}[f(X_n)]\to\mathbb E_{\mathbb P}[f(X)]
\end{align*}
for every bounded [continuous function](/page/Continuous%20Function) $f:\mathbb R\to\mathbb R$.
[/definition]
Normal approximation needs a theorem that turns many small independent inputs into one universal output law. The [central limit theorem](/theorems/521) gives that bridge: after subtracting the accumulated mean and dividing by the accumulated standard deviation, the limiting distribution is standard normal.
In the form used here, if $X_1,X_2,\dots$ are independent and identically distributed real-valued random variables with mean $\mu$ and variance $\sigma^2>0$, then
\begin{align*}
\frac{X_1+\cdots+X_n-n\mu}{\sigma\sqrt n}\xrightarrow{d}\mathcal N(0,1).
\end{align*}
This is the one-dimensional normal approximation principle needed for the binomial example below.
The result says that Gaussian variables arise from aggregation after centering and scaling. It also explains why the standard normal is the universal limiting object rather than just a convenient table.
[example: Binomial Normal Approximation]
Let $S_n\sim\operatorname{Bin}(n,p)$ with $p\in(0,1)$, and write $S_n=X_1+\cdots+X_n$, where $X_1,\dots,X_n$ are i.i.d. Bernoulli random variables with parameter $p$. For each $j$,
\begin{align*}
\mathbb E[X_j]=1\cdot p+0\cdot(1-p)=p.
\end{align*}
Also,
\begin{align*}
\mathbb E[X_j^2]=1^2\cdot p+0^2\cdot(1-p)=p.
\end{align*}
Therefore
\begin{align*}
\operatorname{Var}(X_j)=\mathbb E[X_j^2]-(\mathbb E[X_j])^2=p-p^2=p(1-p).
\end{align*}
By the *[Central Limit Theorem](/theorems/1848)* applied with $\mu=p$ and $\sigma^2=p(1-p)$,
\begin{align*}
\frac{S_n-np}{\sqrt{np(1-p)}}=\frac{\sum_{j=1}^n X_j-np}{\sqrt{np(1-p)}}\xrightarrow{d}Z,
\end{align*}
where $Z\sim\mathcal N(0,1)$. Thus, for integer cutoffs $a\le b$, the event $a\le S_n\le b$ is approximated by standardising both endpoints:
\begin{align*}
a\le S_n\le b \quad \Longleftrightarrow \quad \frac{a-np}{\sqrt{np(1-p)}}\le \frac{S_n-np}{\sqrt{np(1-p)}}\le \frac{b-np}{\sqrt{np(1-p)}}.
\end{align*}
Since the limiting variable has distribution function $\Phi$, this gives the uncorrected normal approximation
\begin{align*}
\mathbb P(a\le S_n\le b)\approx \Phi\left(\frac{b-np}{\sqrt{np(1-p)}}\right)-\Phi\left(\frac{a-np}{\sqrt{np(1-p)}}\right).
\end{align*}
The continuity correction accounts for the fact that $S_n$ takes integer values while the normal approximation is continuous. For integer-valued $S_n$,
\begin{align*}
\{a\le S_n\le b\}=\left\{a-\frac12<S_n<b+\frac12\right\}.
\end{align*}
Standardising these half-integer endpoints gives
\begin{align*}
\frac{a-\frac12-np}{\sqrt{np(1-p)}}<\frac{S_n-np}{\sqrt{np(1-p)}}<\frac{b+\frac12-np}{\sqrt{np(1-p)}}.
\end{align*}
So the continuity-corrected approximation is
\begin{align*}
\mathbb P(a\le S_n\le b)\approx \Phi\left(\frac{b+\frac12-np}{\sqrt{np(1-p)}}\right)-\Phi\left(\frac{a-\frac12-np}{\sqrt{np(1-p)}}\right).
\end{align*}
The correction enlarges the discrete count interval $[a,b]$ to the continuous interval $[a-1/2,b+1/2]$ before applying the Gaussian scale.
[/example]
The central limit theorem has hypotheses. Infinite variance is a common failure mode, and dependence can also change the limiting law.
[example: Why Finite Variance Matters]
Let $X_1,X_2,\dots$ be i.i.d. standard Cauchy random variables, so $X_1$ has density
\begin{align*}
f(x)=\frac{1}{\pi(1+x^2)}.
\end{align*}
The characteristic function of the standard Cauchy law is $\phi_{X_1}(u)=e^{-|u|}$. For the average
\begin{align*}
A_n=\frac{X_1+\cdots+X_n}{n},
\end{align*}
independence gives
\begin{align*}
\phi_{A_n}(u)=\mathbb E\left[\exp\left(iu\frac{X_1+\cdots+X_n}{n}\right)\right].
\end{align*}
Factoring the exponential,
\begin{align*}
\exp\left(iu\frac{X_1+\cdots+X_n}{n}\right)=\prod_{j=1}^n \exp\left(i\frac{u}{n}X_j\right).
\end{align*}
Using independence,
\begin{align*}
\phi_{A_n}(u)=\prod_{j=1}^n \mathbb E\left[\exp\left(i\frac{u}{n}X_j\right)\right].
\end{align*}
Since the $X_j$ have the same standard Cauchy law,
\begin{align*}
\phi_{A_n}(u)=\left(\phi_{X_1}\left(\frac{u}{n}\right)\right)^n.
\end{align*}
Substituting $\phi_{X_1}(v)=e^{-|v|}$ gives
\begin{align*}
\phi_{A_n}(u)=\left(e^{-|u|/n}\right)^n=e^{-|u|}.
\end{align*}
Thus $A_n$ has the same characteristic function as $X_1$, and hence the same standard Cauchy distribution by the uniqueness theorem for characteristic functions.
This common law does not concentrate as $n$ grows. For example,
\begin{align*}
\mathbb P(|A_n|>1)=\mathbb P(|X_1|>1).
\end{align*}
Using the Cauchy density and symmetry,
\begin{align*}
\mathbb P(|X_1|>1)=2\int_1^\infty \frac{1}{\pi(1+x^2)}\,d\mathcal L^1(x).
\end{align*}
Since $\int (1+x^2)^{-1}\,dx=\arctan x$,
\begin{align*}
2\int_1^\infty \frac{1}{\pi(1+x^2)}\,d\mathcal L^1(x)=\frac{2}{\pi}\left(\frac{\pi}{2}-\frac{\pi}{4}\right)=\frac12.
\end{align*}
So the averages keep probability $1/2$ outside the interval $[-1,1]$ for every $n$.
There is also no ordinary standard deviation to use for central-limit scaling. Indeed,
\begin{align*}
\mathbb E[X_1^2]=\int_{\mathbb R}\frac{x^2}{\pi(1+x^2)}\,d\mathcal L^1(x).
\end{align*}
For $x\ge 1$, we have $x^2/(1+x^2)\ge 1/2$, so for every $R>1$,
\begin{align*}
\int_1^R \frac{x^2}{\pi(1+x^2)}\,d\mathcal L^1(x)\ge \int_1^R \frac{1}{2\pi}\,d\mathcal L^1(x)=\frac{R-1}{2\pi}.
\end{align*}
Letting $R\to\infty$ shows $\mathbb E[X_1^2]=\infty$, so the variance is not finite. The finite-variance hypothesis in the central limit theorem is therefore a genuine structural assumption, not a technical decoration.
[/example]
## Beyond and Connected Topics
Gaussian random variables are the one-dimensional entry point to a much larger theory. In multivariate probability, the Gaussian random vector is governed by a mean vector and covariance matrix, and linear algebra becomes part of probability. This leads to conditional Gaussian laws, regression, Kalman filtering, and the geometry of covariance ellipsoids.
In stochastic processes, Gaussian finite-dimensional distributions lead to [Brownian motion](/page/Brownian%20Motion). A standard Brownian motion $(W_t)_{t\ge0}$ has increments $W_t-W_s\sim\mathcal N(0,t-s)$ for $0\le s<t$, and this is the starting point for martingales, Itô integration, and stochastic differential equations. The natural continuation is [Cambridge III Stochastic Calculus and Applications](/page/Cambridge%20III%20Stochastic%20Calculus%20and%20Applications).
In measure-theoretic probability, the Gaussian is a central example for characteristic functions, weak convergence, and limit theorems. These ideas are developed systematically in [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure) and then at a deeper level in [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability).
In statistics, Gaussian models underlie least squares, maximum likelihood estimation, confidence intervals, and hypothesis tests. The reason is not merely tradition: sums of squared Gaussian errors have chi-squared distributions, and independent Gaussian noise makes linear estimators especially tractable.
In analysis, the function $e^{-x^2/2}$ is also a distinguished object for Fourier analysis and the [heat equation](/page/Heat%20Equation). Gaussian kernels describe diffusion, and the same quadratic exponent that governs probability tails governs smoothing by the heat semigroup.
## References
Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability).
Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure).
Androma, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability).
Androma, [Cambridge III Stochastic Calculus and Applications](/page/Cambridge%20III%20Stochastic%20Calculus%20and%20Applications).
William Feller, *An Introduction to Probability Theory and Its Applications, Volume II* (1971).
Patrick Billingsley, *Probability and Measure* (1995).
Rick Durrett, *Probability: Theory and Examples* (2019).
Gaussian Random Variable
Also known as: ["Normal random variable","Normal variable","Gaussian distribution","Normal distribution","Gaussian law","Normal law"]