A gambler may know that a game has average payoff zero and still go bankrupt by playing it. The mean tells us where a random quantity balances, but it does not tell us how violently the outcomes move around that balance. A bet that pays $1$ or $-1$ with equal probability and a bet that pays $1000$ or $-1000$ with equal probability both have expectation $0$; they are not the same risk.
The mathematical problem behind variance is to turn this missing information into a number. We want a measure of spread that is insensitive to the sign of deviations, compatible with expectation, and algebraically tractable enough to survive sums, conditioning, and limits. Absolute deviation is often more visually natural, but the square has a special advantage: it interacts with inner products, projections, independence, and martingales.
[example: Two Fair Games with the Same Mean]
Let $X$ and $Y$ be real-valued random variables on $(\Omega,\mathcal F,\mathbb P)$ with
\begin{align*}
\mathbb P(X = 1) &= \mathbb P(X = -1) = \frac{1}{2}, \\
\mathbb P(Y = 1000) &= \mathbb P(Y = -1000) = \frac{1}{2}.
\end{align*}
We first compute the two expectations from their two-point distributions:
\begin{align*}
\mathbb E[X]
&= 1\cdot \mathbb P(X=1) + (-1)\cdot \mathbb P(X=-1) \\
&= 1\cdot \frac{1}{2} - 1\cdot \frac{1}{2} \\
&= 0,
\end{align*}
and
\begin{align*}
\mathbb E[Y]
&= 1000\cdot \mathbb P(Y=1000) + (-1000)\cdot \mathbb P(Y=-1000) \\
&= 1000\cdot \frac{1}{2} - 1000\cdot \frac{1}{2} \\
&= 0.
\end{align*}
Thus both games have mean payoff $0$.
Their squared deviations from their means have different sizes. Since $\mathbb E[X]=0$,
\begin{align*}
\mathbb E[(X-\mathbb E[X])^2]
&= \mathbb E[X^2] \\
&= 1^2\cdot \mathbb P(X=1) + (-1)^2\cdot \mathbb P(X=-1) \\
&= 1\cdot \frac{1}{2} + 1\cdot \frac{1}{2} \\
&= 1.
\end{align*}
Since $\mathbb E[Y]=0$,
\begin{align*}
\mathbb E[(Y-\mathbb E[Y])^2]
&= \mathbb E[Y^2] \\
&= 1000^2\cdot \mathbb P(Y=1000) + (-1000)^2\cdot \mathbb P(Y=-1000) \\
&= 1000000\cdot \frac{1}{2} + 1000000\cdot \frac{1}{2} \\
&= 1000000.
\end{align*}
Expectation alone treats both games as fair, while the squared deviation calculation records that the second game fluctuates on a much larger scale.
[illustration:variance-two-fair-games-spread]
[/example]
The example leaves a precise problem to solve. A spread measure should ignore where the [random variable](/page/Random%20Variable) is located and measure only how much it fluctuates around that location. If we measured $\mathbb E[X^2]$ directly, then shifting every outcome by a constant would change the value even though the amount of fluctuation had not changed. The first operation is therefore to subtract the mean, producing a new random variable whose average level has been removed and whose values record only deviations from the balance point.
For a probability space $(\Omega,\mathcal F,\mathbb P)$, the notation $L^1(\Omega,\mathcal F,\mathbb P)$ means the space of integrable real-valued random variables, and $L^2(\Omega,\mathcal F,\mathbb P)$ means the space of square-integrable real-valued random variables. Thus $X\in L^1$ means $\mathbb E[|X|]<\infty$, while $X\in L^2$ means $\mathbb E[X^2]<\infty$.
The definition below isolates the centering operation because variance should not measure the absolute location of a random variable. We first need a reusable map that takes an integrable random variable and replaces it by its deviation from its own expectation; only after that subtraction does it make sense to square and average the remaining fluctuation.
[definition: Centered Random Variable]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. The centering map is
\begin{align*}
C: L^1(\Omega, \mathcal F, \mathbb P) &\to L^1(\Omega, \mathcal F, \mathbb P) \\
X &\mapsto X - \mathbb E[X].
\end{align*}
For an integrable real-valued random variable $X: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$, the centered random variable associated to $X$ is $C(X)$.
[/definition]
Centering turns the question from “where is the variable located?” into “how far does it move away from its own location?” A useful spread measure must then remove cancellation between positive and negative deviations. Squaring is the choice that produces a non-negative quantity and keeps the calculation inside the algebra of expectation.
## Definition
Once deviations have been centered and squared, there is still a domain issue: the expected squared deviation may fail to exist. The useful spread functional should therefore be defined exactly where this quadratic average is finite. The $L^2$ hypothesis is the condition that makes the squared deviation integrable, so the formal definition records both the domain of the construction and the number it assigns to each random variable.
[definition: Variance]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. The variance functional is
\begin{align*}
\operatorname{Var}: L^2(\Omega, \mathcal F, \mathbb P) &\to [0,\infty) \\
X &\mapsto \mathbb E[(X - \mathbb E[X])^2].
\end{align*}
For a square-integrable real-valued random variable $X: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$, the variance of $X$ is $\operatorname{Var}(X)$.
[/definition]
The definition gives the central quantity of the page. The next computational problem is to evaluate $\mathbb E[(X - \mathbb E[X])^2]$ without expanding the centered square from scratch every time. A formula in terms of the first two moments makes variance usable from tables, densities, and probability mass functions.
[quotetheorem:5015]
This identity is the working formula for most computations. It separates the second moment from the square of the first moment, and it explains why variance is unchanged by translation but changes quadratically under scaling. There is still a reporting problem: if $X$ is measured in metres, then $\operatorname{Var}(X)$ is measured in square metres. The square root restores the original unit while leaving variance as the algebraically preferred object.
[definition: Standard Deviation]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. The standard deviation functional is
\begin{align*}
\sigma: L^2(\Omega, \mathcal F, \mathbb P) &\to [0,\infty) \\
X &\mapsto \sqrt{\operatorname{Var}(X)}.
\end{align*}
For a square-integrable real-valued random variable $X: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$, the standard deviation of $X$ is $\sigma_X = \sigma(X)$.
[/definition]
The standard deviation is often better for reporting scale, while variance is usually better for calculation. The simplest discrete example already shows the difference between randomness that is present and randomness that has collapsed into certainty.
[example: Bernoulli Variance]
Let $X \sim \operatorname{Ber}(p)$ with $p \in [0,1]$, so
\begin{align*}
\mathbb P(X=1) &= p, \\
\mathbb P(X=0) &= 1-p.
\end{align*}
Since $X$ only takes the values $0$ and $1$, we have $X^2=X$ pointwise: if $X(\omega)=0$, then $X(\omega)^2=0^2=0$, and if $X(\omega)=1$, then $X(\omega)^2=1^2=1$. Hence
\begin{align*}
\mathbb E[X]
&= 1\cdot \mathbb P(X=1) + 0\cdot \mathbb P(X=0) \\
&= 1\cdot p + 0\cdot (1-p) \\
&= p,
\end{align*}
and
\begin{align*}
\mathbb E[X^2]
&= 1^2\cdot \mathbb P(X=1) + 0^2\cdot \mathbb P(X=0) \\
&= 1\cdot p + 0\cdot (1-p) \\
&= p.
\end{align*}
Using the [computational formula for variance](/theorems/5015),
\begin{align*}
\operatorname{Var}(X)
&= \mathbb E[X^2] - (\mathbb E[X])^2 \\
&= p - p^2 \\
&= p(1-p).
\end{align*}
To locate the largest value on $[0,1]$, rewrite
\begin{align*}
p(1-p)
&= p-p^2 \\
&= -\left(p^2-p\right) \\
&= -\left(p^2-p+\frac{1}{4}\right)+\frac{1}{4} \\
&= \frac{1}{4}-\left(p-\frac{1}{2}\right)^2.
\end{align*}
Since $\left(p-\frac{1}{2}\right)^2 \ge 0$, the maximum value is $1/4$, attained at $p=1/2$. At $p=0$ and $p=1$,
\begin{align*}
p(1-p)=0,
\end{align*}
so the variance vanishes exactly when the Bernoulli outcome is constant.
[illustration:bernoulli-variance-curve]
[/example]
## Quadratic Structure
### Zero Variance and Constants
Variance behaves like squared distance from the constants inside the [Hilbert space](/page/Hilbert%20Space) $L^2(\Omega, \mathcal F, \mathbb P)$. This viewpoint explains why variance is non-negative, why constants are exactly the variables with zero variance, and why affine transformations have such simple formulas.
A spread measure should never be negative. For variance this is not a separate axiom; it comes from the fact that it is the expectation of a non-negative random variable. The equality case matters as much as the inequality, because it tells us when all randomness has disappeared.
[quotetheorem:5016]
This result identifies the only variables with no fluctuation: constants, up to null sets. It is important that equality is almost sure rather than pointwise, since random variables in probability are insensitive to events of probability zero.
[example: Zero Variance with a Null-Set Exception]
Let $\Omega = [0,1]$, let $\mathcal F = \mathcal B([0,1])$, and let $\mathbb P$ be [Lebesgue measure](/page/Lebesgue%20Measure) restricted to $[0,1]$. Define
\begin{align*}
X(\omega) =
\begin{cases}
7, & \omega \in [0,1] \setminus \{1/2\}, \\
100, & \omega = 1/2.
\end{cases}
\end{align*}
The singleton $\{1/2\}$ has Lebesgue measure $0$, so
\begin{align*}
\mathbb P(\{1/2\}) &= 0, \\
\mathbb P([0,1]\setminus \{1/2\}) &= 1-\mathbb P(\{1/2\}) \\
&= 1.
\end{align*}
Thus $X=7$ on an event of probability $1$, so $X=7$ $\mathbb P$-a.s. Computing the expectation from the two possible values gives
\begin{align*}
\mathbb E[X]
&= 7\cdot \mathbb P([0,1]\setminus \{1/2\}) + 100\cdot \mathbb P(\{1/2\}) \\
&= 7\cdot 1 + 100\cdot 0 \\
&= 7.
\end{align*}
Therefore the centered random variable is
\begin{align*}
X(\omega)-\mathbb E[X]
=
\begin{cases}
7-7, & \omega \in [0,1] \setminus \{1/2\}, \\
100-7, & \omega = 1/2,
\end{cases}
=
\begin{cases}
0, & \omega \in [0,1] \setminus \{1/2\}, \\
93, & \omega = 1/2.
\end{cases}
\end{align*}
Hence
\begin{align*}
\operatorname{Var}(X)
&= \mathbb E[(X-\mathbb E[X])^2] \\
&= 0^2\cdot \mathbb P([0,1]\setminus \{1/2\}) + 93^2\cdot \mathbb P(\{1/2\}) \\
&= 0\cdot 1 + 8649\cdot 0 \\
&= 0.
\end{align*}
The value $100$ occurs only on a probability-zero event, so it changes neither the expectation nor the variance.
[/example]
### Translation and Scaling
The next basic test is whether variance responds correctly to changes of origin and scale. Shifting a random variable should not change its spread; multiplying by a number should multiply deviations by that number and squared deviations by its square. This is the calculation that lets variance behave predictably under changes of units.
[quotetheorem:5017]
The translation parameter $b$ disappears, while the scale parameter $a$ is squared. This is why standard deviation, not variance, is the quantity with the same physical units as $X$.
[example: Rescaling a Measurement]
Suppose $T$ is a temperature in degrees Celsius with $\operatorname{Var}(T)=9$, and define
\begin{align*}
S = \frac{9}{5}T + 32.
\end{align*}
Using *[Affine Transformation of Variance](/theorems/5017)* with $a=9/5$ and $b=32$, we get
\begin{align*}
\operatorname{Var}(S)
&= \operatorname{Var}\left(\frac{9}{5}T+32\right) \\
&= \left(\frac{9}{5}\right)^2\operatorname{Var}(T) \\
&= \frac{9^2}{5^2}\cdot 9 \\
&= \frac{81}{25}\cdot 9 \\
&= \frac{729}{25}.
\end{align*}
The additive offset $32$ disappears from the variance calculation, while the change from Celsius units to Fahrenheit units multiplies variance by the square of the scale factor $9/5$.
[/example]
## Covariance and Sums
### Cross Terms
The variance of a sum is the point at which spread meets dependence. Two random variables can each fluctuate substantially, yet their sum may fluctuate little if their deviations cancel. Conversely, positively aligned deviations reinforce each other. Covariance is the term that records this alignment.
Before adding variances, we need a way to measure whether two centered variables move together. Multiplying their centered deviations gives a signed quantity: positive when they deviate in the same direction, negative when they deviate in opposite directions.
[definition: Covariance]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. The covariance functional is
\begin{align*}
\operatorname{Cov}: L^2(\Omega, \mathcal F, \mathbb P) \times L^2(\Omega, \mathcal F, \mathbb P) &\to \mathbb R \\
(X,Y) &\mapsto \mathbb E[(X - \mathbb E[X])(Y - \mathbb E[Y])].
\end{align*}
For square-integrable real-valued random variables $X,Y: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$, the covariance of $X$ and $Y$ is $\operatorname{Cov}(X,Y)$.
[/definition]
Covariance is variance when the two variables are the same. More importantly, it is the cross term that appears when the squared deviation of $X+Y$ is expanded. Before using that cross term in a sum formula, it is worth isolating a normalized version that records the same alignment without the units of $X$ and $Y$.
Covariance has the units of the product $XY$, so its raw size can be hard to compare across problems. When both variables have positive variance, dividing by their standard deviations gives a dimensionless measure of linear alignment.
[definition: Correlation Coefficient]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space, and define
\begin{align*}
\mathcal C = \{(X,Y) \in L^2(\Omega, \mathcal F, \mathbb P) \times L^2(\Omega, \mathcal F, \mathbb P) : \operatorname{Var}(X) > 0 \text{ and } \operatorname{Var}(Y) > 0\}.
\end{align*}
The correlation coefficient functional is
\begin{align*}
\rho: \mathcal C &\to [-1,1] \\
(X,Y) &\mapsto \frac{\operatorname{Cov}(X,Y)}{\sigma_X\sigma_Y}.
\end{align*}
For square-integrable real-valued random variables $X,Y: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$ with positive variances, the correlation coefficient of $X$ and $Y$ is $\rho(X,Y)$.
[/definition]
The codomain $[-1,1]$ is not a convention hidden in the notation; it is forced by Cauchy-Schwarz applied to the centered variables. Correlation normalizes covariance, while covariance itself remains the algebraic term in variance identities.
[quotetheorem:5018]
The inequality says that covariance cannot exceed the product of the two standard deviations in magnitude. For sums, the unnormalized cross term is the term that appears directly after expanding the square.
This prepares the independent case. Independence is often useful precisely because it makes products of expectations factor, so the covariance terms vanish and the variance of a sum becomes the sum of the variances.
[quotetheorem:1119]
The hypothesis is independence, not merely different names for different variables. If two variables are negatively correlated, the variance of the sum may be smaller than either individual variance.
[example: Cancellation Under Perfect Negative Dependence]
Let $X$ be any square-integrable real-valued random variable, and set $Y=-X$. Then $Y$ is square-integrable because $Y^2=(-X)^2=X^2$, so all variances and covariances below are finite. Since $X+Y=X-X=0$ pointwise,
\begin{align*}
\operatorname{Var}(X+Y)
&= \operatorname{Var}(0) \\
&= \mathbb E[(0-\mathbb E[0])^2] \\
&= \mathbb E[(0-0)^2] \\
&= \mathbb E[0] \\
&= 0.
\end{align*}
The two individual variances are equal because
\begin{align*}
\mathbb E[Y]
&= \mathbb E[-X] \\
&= -\mathbb E[X],
\end{align*}
and therefore
\begin{align*}
Y-\mathbb E[Y]
&= -X-(-\mathbb E[X]) \\
&= -(X-\mathbb E[X]).
\end{align*}
Squaring gives
\begin{align*}
(Y-\mathbb E[Y])^2
&= \left(-(X-\mathbb E[X])\right)^2 \\
&= (X-\mathbb E[X])^2,
\end{align*}
so
\begin{align*}
\operatorname{Var}(Y)
&= \mathbb E[(Y-\mathbb E[Y])^2] \\
&= \mathbb E[(X-\mathbb E[X])^2] \\
&= \operatorname{Var}(X).
\end{align*}
The covariance is negative because
\begin{align*}
\operatorname{Cov}(X,Y)
&= \mathbb E[(X-\mathbb E[X])(Y-\mathbb E[Y])] \\
&= \mathbb E[(X-\mathbb E[X])\left(-(X-\mathbb E[X])\right)] \\
&= \mathbb E[-(X-\mathbb E[X])^2] \\
&= -\mathbb E[(X-\mathbb E[X])^2] \\
&= -\operatorname{Var}(X).
\end{align*}
Thus the variance-of-a-sum identity becomes
\begin{align*}
\operatorname{Var}(X+Y)
&= \operatorname{Var}(X)+\operatorname{Var}(Y)+2\operatorname{Cov}(X,Y) \\
&= \operatorname{Var}(X)+\operatorname{Var}(X)+2(-\operatorname{Var}(X)) \\
&= 2\operatorname{Var}(X)-2\operatorname{Var}(X) \\
&= 0.
\end{align*}
The random variables $X$ and $Y=-X$ may each fluctuate, but their deviations cancel exactly in the sum, so variance is not additive without a condition that removes the covariance term.
[/example]
### Averaging Independent Observations
For repeated independent trials, the additive formula turns variance into a scaling law. Sums grow in variance like $n$, while averages shrink in variance like $1/n$. This is the numerical reason that repeated measurement suppresses random noise.
[example: Variance of a Sample Mean]
Let $X_1,\dots,X_n$ be i.i.d. real-valued random variables with $\mathbb E[X_1^2] < \infty$, mean $\mu$, and variance $\sigma^2$, and assume $n \ge 1$. Define
\begin{align*}
\bar X_n = \frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
Because the variables are identically distributed, $\mathbb E[X_i]=\mu$ and $\operatorname{Var}(X_i)=\sigma^2$ for every $i \in \{1,\dots,n\}$. By linearity of expectation,
\begin{align*}
\mathbb E[\bar X_n]
&= \mathbb E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] \\
&= \frac{1}{n}\mathbb E\left[\sum_{i=1}^n X_i\right] \\
&= \frac{1}{n}\sum_{i=1}^n \mathbb E[X_i] \\
&= \frac{1}{n}\sum_{i=1}^n \mu \\
&= \frac{1}{n}\cdot n\mu \\
&= \mu.
\end{align*}
For the variance, first apply *Affine Transformation of Variance* with $a=1/n$ and $b=0$:
\begin{align*}
\operatorname{Var}(\bar X_n)
&= \operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^n X_i\right) \\
&= \left(\frac{1}{n}\right)^2\operatorname{Var}\left(\sum_{i=1}^n X_i\right) \\
&= \frac{1}{n^2}\operatorname{Var}\left(\sum_{i=1}^n X_i\right).
\end{align*}
Since $X_1,\dots,X_n$ are independent, *Variance Adds for Independent Random Variables* gives
\begin{align*}
\operatorname{Var}\left(\sum_{i=1}^n X_i\right)
&= \sum_{i=1}^n \operatorname{Var}(X_i) \\
&= \sum_{i=1}^n \sigma^2 \\
&= n\sigma^2.
\end{align*}
Substituting this into the previous display,
\begin{align*}
\operatorname{Var}(\bar X_n)
&= \frac{1}{n^2}\cdot n\sigma^2 \\
&= \frac{\sigma^2}{n}.
\end{align*}
Thus averaging $n$ independent observations preserves the common mean $\mu$ and reduces the variance from $\sigma^2$ to $\sigma^2/n$.
[/example]
## Moment Bounds and Concentration
Variance becomes powerful when it turns into probability bounds. A small variance says that large deviations from the mean cannot happen too often. This does not require knowing the full distribution; the second moment alone already controls tails at a coarse scale.
The standard variance tail estimate is [Chebyshev's inequality](/theorems/1126). It controls deviations of $X$ from its expectation by applying a non-negative-variable bound to $Z=(X-\mathbb E[X])^2$, so the second moment becomes a direct upper bound for the probability of a large deviation.
[quotetheorem:1126]
Chebyshev's inequality is a bridge from algebra to probability. The next question is whether the variance decay of sample averages is strong enough to force convergence. Since Chebyshev turns small variance into small deviation probability, it gives a direct route from $\operatorname{Var}(\bar X_n)=\operatorname{Var}(X_1)/n$ to the [weak law of large numbers](/theorems/1851).
[quotetheorem:1127]
The theorem says that independent averaging stabilizes. The proof mechanism is the variance computation $\operatorname{Var}(\bar X_n) = \operatorname{Var}(X_1)/n$ combined with Chebyshev's inequality.
[example: A Concrete Chebyshev Bound]
Let $X_1,\dots,X_{100}$ be i.i.d. random variables with mean $10$ and variance $4$, and define
\begin{align*}
\bar X_{100}
&= \frac{1}{100}\sum_{i=1}^{100} X_i.
\end{align*}
First compute its mean by linearity of expectation:
\begin{align*}
\mathbb E[\bar X_{100}]
&= \mathbb E\left[\frac{1}{100}\sum_{i=1}^{100}X_i\right] \\
&= \frac{1}{100}\mathbb E\left[\sum_{i=1}^{100}X_i\right] \\
&= \frac{1}{100}\sum_{i=1}^{100}\mathbb E[X_i] \\
&= \frac{1}{100}\sum_{i=1}^{100}10 \\
&= \frac{1}{100}\cdot 1000 \\
&= 10.
\end{align*}
For the variance, apply *Affine Transformation of Variance* with $a=1/100$ and $b=0$, then use *Variance Adds for Independent Random Variables*:
\begin{align*}
\operatorname{Var}(\bar X_{100})
&= \operatorname{Var}\left(\frac{1}{100}\sum_{i=1}^{100}X_i\right) \\
&= \left(\frac{1}{100}\right)^2\operatorname{Var}\left(\sum_{i=1}^{100}X_i\right) \\
&= \frac{1}{10000}\sum_{i=1}^{100}\operatorname{Var}(X_i) \\
&= \frac{1}{10000}\sum_{i=1}^{100}4 \\
&= \frac{1}{10000}\cdot 400 \\
&= \frac{400}{10000} \\
&= \frac{1}{25}.
\end{align*}
Now apply *Chebyshev Inequality* to $\bar X_{100}$ with $a=1$:
\begin{align*}
\mathbb P(|\bar X_{100}-10|\ge 1)
&= \mathbb P(|\bar X_{100}-\mathbb E[\bar X_{100}]|\ge 1) \\
&\le \frac{\operatorname{Var}(\bar X_{100})}{1^2} \\
&= \frac{1/25}{1} \\
&= \frac{1}{25}.
\end{align*}
The bound uses only independence, the common mean, and the common variance; no assumption about the distributional shape of the $X_i$ is needed.
[/example]
## Conditional Variance
Variance can be decomposed by information. If a sub-$\sigma$-algebra $\mathcal G \subset \mathcal F$ represents what is known, then part of the uncertainty is the remaining fluctuation after seeing $\mathcal G$, and part is the fluctuation of the conditional mean itself. This is the probabilistic version of decomposing a vector into a projection and an orthogonal residual.
To make that decomposition precise, we measure squared deviation after conditioning on the available information. The result is not a number but a $\mathcal G$-measurable random variable: the amount of residual variance depends on what has been observed.
[definition: Conditional Variance]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space, and let $\mathcal G \subset \mathcal F$ be a sub-$\sigma$-algebra. The conditional variance map given $\mathcal G$ is
\begin{align*}
\operatorname{Var}(\cdot \mid \mathcal G): L^2(\Omega, \mathcal F, \mathbb P) &\to L^1(\Omega, \mathcal G, \mathbb P) \\
X &\mapsto \mathbb E[(X - \mathbb E[X \mid \mathcal G])^2 \mid \mathcal G].
\end{align*}
For a square-integrable real-valued random variable $X: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$, the conditional variance of $X$ given $\mathcal G$ is $\operatorname{Var}(X \mid \mathcal G)$.
[/definition]
Conditional variance isolates the uncertainty that remains after information has been used. The total variance formula should recover the original variance by adding two contributions: average residual uncertainty inside information classes and variability between the conditional means of those classes.
[quotetheorem:5019]
The formula separates noise from signal. The first term is the average uncertainty inside each information class; the second term is the variability between the conditional means of those classes.
[illustration:total-variance-decomposition]
[example: Mixture of Two Coins]
First choose a coin type $C \in \{0,1\}$ with $\mathbb P(C=0)=\mathbb P(C=1)=1/2$. Given $C=0$, let $X \sim \operatorname{Ber}(1/4)$; given $C=1$, let $X \sim \operatorname{Ber}(3/4)$. Let $\mathcal G=\sigma(C)$, so conditioning on $\mathcal G$ is the same as conditioning on the observed value of $C$.
On the event $\{C=0\}$,
\begin{align*}
\mathbb E[X\mid C=0]
&= 1\cdot \mathbb P(X=1\mid C=0)+0\cdot \mathbb P(X=0\mid C=0) \\
&= 1\cdot \frac{1}{4}+0\cdot \frac{3}{4} \\
&= \frac{1}{4}.
\end{align*}
On the event $\{C=1\}$,
\begin{align*}
\mathbb E[X\mid C=1]
&= 1\cdot \mathbb P(X=1\mid C=1)+0\cdot \mathbb P(X=0\mid C=1) \\
&= 1\cdot \frac{3}{4}+0\cdot \frac{1}{4} \\
&= \frac{3}{4}.
\end{align*}
Therefore
\begin{align*}
\mathbb E[X\mid \mathcal G]
=
\begin{cases}
1/4, & C=0, \\
3/4, & C=1.
\end{cases}
\end{align*}
Next compute the conditional variances. Since $X^2=X$ for a Bernoulli random variable,
\begin{align*}
\mathbb E[X^2\mid C=0]
&= \mathbb E[X\mid C=0] \\
&= \frac{1}{4},
\end{align*}
and hence
\begin{align*}
\operatorname{Var}(X\mid C=0)
&= \mathbb E[X^2\mid C=0]-\mathbb E[X\mid C=0]^2 \\
&= \frac{1}{4}-\left(\frac{1}{4}\right)^2 \\
&= \frac{1}{4}-\frac{1}{16} \\
&= \frac{4}{16}-\frac{1}{16} \\
&= \frac{3}{16}.
\end{align*}
Similarly,
\begin{align*}
\mathbb E[X^2\mid C=1]
&= \mathbb E[X\mid C=1] \\
&= \frac{3}{4},
\end{align*}
so
\begin{align*}
\operatorname{Var}(X\mid C=1)
&= \mathbb E[X^2\mid C=1]-\mathbb E[X\mid C=1]^2 \\
&= \frac{3}{4}-\left(\frac{3}{4}\right)^2 \\
&= \frac{3}{4}-\frac{9}{16} \\
&= \frac{12}{16}-\frac{9}{16} \\
&= \frac{3}{16}.
\end{align*}
Thus
\begin{align*}
\operatorname{Var}(X\mid \mathcal G)
=
\begin{cases}
3/16, & C=0, \\
3/16, & C=1.
\end{cases}
\end{align*}
Taking expectation gives
\begin{align*}
\mathbb E[\operatorname{Var}(X\mid \mathcal G)]
&= \frac{3}{16}\mathbb P(C=0)+\frac{3}{16}\mathbb P(C=1) \\
&= \frac{3}{16}\cdot \frac{1}{2}+\frac{3}{16}\cdot \frac{1}{2} \\
&= \frac{3}{32}+\frac{3}{32} \\
&= \frac{3}{16}.
\end{align*}
Now let $M=\mathbb E[X\mid \mathcal G]$. Then $M$ takes the values $1/4$ and $3/4$, each with probability $1/2$. Its expectation is
\begin{align*}
\mathbb E[M]
&= \frac{1}{4}\mathbb P(C=0)+\frac{3}{4}\mathbb P(C=1) \\
&= \frac{1}{4}\cdot \frac{1}{2}+\frac{3}{4}\cdot \frac{1}{2} \\
&= \frac{1}{8}+\frac{3}{8} \\
&= \frac{1}{2},
\end{align*}
and its second moment is
\begin{align*}
\mathbb E[M^2]
&= \left(\frac{1}{4}\right)^2\mathbb P(C=0)+\left(\frac{3}{4}\right)^2\mathbb P(C=1) \\
&= \frac{1}{16}\cdot \frac{1}{2}+\frac{9}{16}\cdot \frac{1}{2} \\
&= \frac{1}{32}+\frac{9}{32} \\
&= \frac{10}{32} \\
&= \frac{5}{16}.
\end{align*}
Therefore
\begin{align*}
\operatorname{Var}(\mathbb E[X\mid \mathcal G])
&= \operatorname{Var}(M) \\
&= \mathbb E[M^2]-(\mathbb E[M])^2 \\
&= \frac{5}{16}-\left(\frac{1}{2}\right)^2 \\
&= \frac{5}{16}-\frac{1}{4} \\
&= \frac{5}{16}-\frac{4}{16} \\
&= \frac{1}{16}.
\end{align*}
Adding the within-coin and between-coin contributions gives
\begin{align*}
\mathbb E[\operatorname{Var}(X\mid \mathcal G)]
+\operatorname{Var}(\mathbb E[X\mid \mathcal G])
&= \frac{3}{16}+\frac{1}{16} \\
&= \frac{4}{16} \\
&= \frac{1}{4}.
\end{align*}
This agrees with the unconditional distribution of $X$. Indeed,
\begin{align*}
\mathbb P(X=1)
&= \mathbb P(X=1\mid C=0)\mathbb P(C=0)+\mathbb P(X=1\mid C=1)\mathbb P(C=1) \\
&= \frac{1}{4}\cdot \frac{1}{2}+\frac{3}{4}\cdot \frac{1}{2} \\
&= \frac{1}{8}+\frac{3}{8} \\
&= \frac{1}{2},
\end{align*}
so $X\sim \operatorname{Ber}(1/2)$ and
\begin{align*}
\operatorname{Var}(X)
&= \frac{1}{2}\left(1-\frac{1}{2}\right) \\
&= \frac{1}{2}\cdot \frac{1}{2} \\
&= \frac{1}{4}.
\end{align*}
The total variance splits into residual randomness after the coin type is known, $\frac{3}{16}$, and randomness coming from the unknown coin type itself, $\frac{1}{16}$.
[/example]
## Finite and Infinite Variance
Variance is not available for every random variable. Heavy-tailed distributions can have finite means but infinite second moments, and some have no mean at all. This boundary matters because many variance-based arguments, including Chebyshev's inequality and the $L^2$ weak law above, cannot be applied without a second moment.
The cleanest way to state the required hypothesis is through square integrability. As introduced at the start of the page, this is the condition $X\in L^2$, and it is exactly what makes the geometric interpretation of variance valid.
[definition: Square-Integrable Random Variable]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. A real-valued random variable $X: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$ is square-integrable if
\begin{align*}
\mathbb E[X^2] < \infty.
\end{align*}
[/definition]
The definition gives a second-moment condition, but variance also uses the mean. The next question is why a square-integrable random variable is automatically integrable, so that $\mathbb E[X]$ exists before we subtract it. The [Cauchy-Schwarz inequality](/theorems/432) supplies exactly that bridge.
[quotetheorem:5020]
The reverse implication fails. A random variable may have a finite mean while its variance is infinite, so mean-based statements can remain meaningful after variance-based methods break down.
[example: Finite Mean and Infinite Variance]
Let $X$ be a non-negative real-valued random variable with density
\begin{align*}
f_X(x) = \frac{2}{x^3}\mathbb 1_{[1,\infty)}(x).
\end{align*}
First check that this is normalized:
\begin{align*}
\int_{\mathbb R} f_X(x)\,d\mathcal L^1(x)
&= \int_1^\infty \frac{2}{x^3}\,d\mathcal L^1(x) \\
&= \int_1^\infty 2x^{-3}\,d\mathcal L^1(x) \\
&= \left[-x^{-2}\right]_{1}^{\infty} \\
&= \lim_{b\to\infty}\left(-b^{-2}\right)-(-1) \\
&= 0+1 \\
&= 1.
\end{align*}
The first moment is finite:
\begin{align*}
\mathbb E[X]
&= \int_{\mathbb R} x f_X(x)\,d\mathcal L^1(x) \\
&= \int_1^\infty x\frac{2}{x^3}\,d\mathcal L^1(x) \\
&= \int_1^\infty \frac{2}{x^2}\,d\mathcal L^1(x) \\
&= \int_1^\infty 2x^{-2}\,d\mathcal L^1(x) \\
&= \left[-2x^{-1}\right]_{1}^{\infty} \\
&= \lim_{b\to\infty}\left(-\frac{2}{b}\right)-(-2) \\
&= 0+2 \\
&= 2.
\end{align*}
The second moment diverges:
\begin{align*}
\mathbb E[X^2]
&= \int_{\mathbb R} x^2 f_X(x)\,d\mathcal L^1(x) \\
&= \int_1^\infty x^2\frac{2}{x^3}\,d\mathcal L^1(x) \\
&= \int_1^\infty \frac{2}{x}\,d\mathcal L^1(x) \\
&= \int_1^\infty 2x^{-1}\,d\mathcal L^1(x) \\
&= \lim_{b\to\infty}\int_1^b \frac{2}{x}\,d\mathcal L^1(x) \\
&= \lim_{b\to\infty} \left[2\log x\right]_{1}^{b} \\
&= \lim_{b\to\infty} \left(2\log b-2\log 1\right) \\
&= \lim_{b\to\infty} 2\log b \\
&= \infty.
\end{align*}
Thus $X$ has a finite expectation, but it is not square-integrable and so has no finite variance. Chebyshev's inequality cannot be applied to this $X$ using $\operatorname{Var}(X)$.
[/example]
There is a further warning. Infinite variance is not a technical inconvenience; it changes limiting behaviour. Heavy-tailed sums may require different scaling and may converge to non-Gaussian stable laws rather than to a normal distribution.
[remark: Variance as an $L^2$ Assumption]
Many classical probability results have versions without finite variance, but their proofs and conclusions change. Variance belongs to the $L^2$ theory: it is strongest when random variables can be treated as vectors with finite squared length.
[/remark]
## Beyond and Connected Topics
Variance is the entry point to covariance matrices for random vectors. If $X: (\Omega, \mathcal F) \to (\mathbb R^n, \mathcal B(\mathbb R^n))$ is square-integrable, the covariance matrix records all pairwise covariances between components. This matrix controls linear projections, principal component analysis, multivariate Gaussian distributions, and the quadratic forms that appear in concentration estimates.
The next probabilistic step is concentration theory. Chebyshev's inequality uses only variance, while sharper inequalities such as Chernoff, Hoeffding, Bernstein, and Azuma inequalities use boundedness, exponential moments, or martingale structure. The common theme is the same: transform information about moments or increments into tail probabilities.
In stochastic processes, variance becomes a time-dependent diagnostic. For [Brownian motion](/page/Brownian%20Motion) $W_t$, the identity $\operatorname{Var}(W_t)=t$ is part of the scaling structure that leads to quadratic variation and the Itô integral. This connects variance to martingales and to [Cambridge III Stochastic Calculus and Applications](/page/Cambridge%20III%20Stochastic%20Calculus%20and%20Applications).
In measure-theoretic probability, conditional variance is best understood through [conditional expectation](/page/Conditional%20Expectation) as an $L^2$ projection. This links the page to Hilbert space methods, martingale convergence, and [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability). At the introductory level, variance supports the law of large numbers and elementary distribution computations developed in [Cambridge IA Probability](/page/Cambridge%20IA%20Probability) and [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure).
## References
Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability).
Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure).
Androma, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability).
Androma, [Cambridge III Stochastic Calculus and Applications](/page/Cambridge%20III%20Stochastic%20Calculus%20and%20Applications).
Billingsley, *Probability and Measure* (1995).
Durrett, *Probability: Theory and Examples* (2019).
Grimmett and Stirzaker, *Probability and Random Processes* (2020).