Two coins can both be fair and still tell completely different stories. In one model the second toss is a fresh experiment; in another model it is a copy of the first toss; in a third model it is forced to be the opposite result. The marginal probability of heads is $1/2$ in all three models, so marginal probabilities do not tell us whether observations carry new information.
Independence is the condition that rules out hidden information flow. It is the reason products of probabilities appear in repeated trials, the reason expectations factor, and the reason long averages stabilise. Without it, the same one-dimensional distributions can produce very different joint behaviour.
[example: Two Fair Coins with Different Dependence]
Let $(\Omega,\mathcal F,\mathbb P)$ be the uniform probability space on $\{HH,HT,TH,TT\}$. Let $X$ be the first coordinate and $Y$ the second coordinate. Since each of the four outcomes has probability $1/4$,
\begin{align*}
\{X=H\}&=\{HH,HT\},&
\{Y=H\}&=\{HH,TH\},&
\{X=H,Y=H\}&=\{HH\}.
\end{align*}
Therefore
\begin{align*}
\mathbb P(X=H)&=\mathbb P(\{HH,HT\})
=\mathbb P(\{HH\})+\mathbb P(\{HT\})
=\frac14+\frac14
=\frac12,\\
\mathbb P(Y=H)&=\mathbb P(\{HH,TH\})
=\mathbb P(\{HH\})+\mathbb P(\{TH\})
=\frac14+\frac14
=\frac12,\\
\mathbb P(X=H,Y=H)&=\mathbb P(\{HH\})
=\frac14.
\end{align*}
Thus
\begin{align*}
\mathbb P(X=H)\mathbb P(Y=H)
=\frac12\cdot\frac12
=\frac14
=\mathbb P(X=H,Y=H),
\end{align*}
so the heads event for the first coordinate factors from the heads event for the second coordinate in this product model.
Now let $(\Omega',\mathcal F',\mathbb P')$ be uniform on $\{H,T\}$, and define $X'(\omega)=\omega$ and $Y'(\omega)=\omega$. Then
\begin{align*}
\{X'=H\}&=\{H\},&
\{Y'=H\}&=\{H\},&
\{X'=H,Y'=H\}&=\{H\}\cap\{H\}=\{H\}.
\end{align*}
Since $\mathbb P'(\{H\})=1/2$,
\begin{align*}
\mathbb P'(X'=H)&=\frac12,&
\mathbb P'(Y'=H)&=\frac12,&
\mathbb P'(X'=H,Y'=H)&=\frac12.
\end{align*}
But
\begin{align*}
\mathbb P'(X'=H)\mathbb P'(Y'=H)
=\frac12\cdot\frac12
=\frac14
\ne
\frac12
=\mathbb P'(X'=H,Y'=H).
\end{align*}
Both $X'$ and $Y'$ are fair, but observing $X'=H$ forces $Y'=H$; the first model describes two fresh tosses, while the second describes the same toss recorded twice.
[/example]
This example gives the central warning for the whole chapter: independence is a property of a joint model, not of separate distributions. We begin with events, move to random variables and sigma-algebras, then reformulate independence through product measures and expectation factorisation.
## Definition
The most concrete question is whether two events influence each other. If $B$ has positive probability, the phrase "knowing $B$ does not change the chance of $A$" means $\mathbb P(A\mid B)=\mathbb P(A)$. Multiplying by $\mathbb P(B)$ gives a formula that remains meaningful when $\mathbb P(B)=0$.
[definition: Independent Events]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Events $A,B\in\mathcal F$ are independent if
\begin{align*}
\mathbb P(A\cap B)=\mathbb P(A)\mathbb P(B).
\end{align*}
[/definition]
Independence should not be confused with disjointness. Disjoint events cannot occur together, so learning one of them occurred usually gives decisive information about the other. The following computation shows the failure directly.
[example: Disjoint Events Are Usually Dependent]
Let $\Omega=\{1,2,3,4,5,6\}$ with the uniform probability measure, so each singleton has probability $1/6$. Let
\begin{align*}
A=\{1,2,3\}, \qquad B=\{4,5,6\}.
\end{align*}
The two sets have no common outcome, hence
\begin{align*}
A\cap B
&=\{1,2,3\}\cap\{4,5,6\}
=\varnothing,
\end{align*}
and therefore
\begin{align*}
\mathbb P(A\cap B)
=\mathbb P(\varnothing)
=0.
\end{align*}
On the other hand,
\begin{align*}
\mathbb P(A)
&=\mathbb P(\{1,2,3\})
=\mathbb P(\{1\})+\mathbb P(\{2\})+\mathbb P(\{3\}) \\
&=\frac16+\frac16+\frac16
=\frac36
=\frac12,
\end{align*}
and similarly
\begin{align*}
\mathbb P(B)
&=\mathbb P(\{4,5,6\})
=\mathbb P(\{4\})+\mathbb P(\{5\})+\mathbb P(\{6\}) \\
&=\frac16+\frac16+\frac16
=\frac36
=\frac12.
\end{align*}
Thus
\begin{align*}
\mathbb P(A)\mathbb P(B)
=\frac12\cdot\frac12
=\frac14
\ne 0
=\mathbb P(A\cap B).
\end{align*}
So $A$ and $B$ are not independent: observing that $B$ occurred rules out every outcome in $A$.
[/example]
## Families and Random Variables
### Event Families
Many probabilistic models involve more than two events, and the problem is not solved by checking pairs. A hidden relation can appear only when several events are considered together. The next definition imposes the finite intersection factorisation needed for a whole family to behave like separate sources of randomness.
[definition: Mutual Independence of Events]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, and let $(A_i)_{i\in I}$ be a family of events in $\mathcal F$. The family $(A_i)_{i\in I}$ is mutually independent if, for every finite subset $J\subset I$,
\begin{align*}
\mathbb P\left(\bigcap_{j\in J}A_j\right)=\prod_{j\in J}\mathbb P(A_j).
\end{align*}
[/definition]
The finite-subset condition is the real content of mutual independence. It prevents a hidden relation from appearing only when several events are considered together. The next example is the standard small model where pairwise independence survives but mutual independence fails.
[example: Pairwise Independence Without Mutual Independence]
Let $\Omega=\{00,01,10,11\}$ with the uniform probability measure, so each singleton has probability $1/4$. Define
\begin{align*}
A&=\{00,01\},&
B&=\{00,10\},&
C&=\{00,11\}.
\end{align*}
Then
\begin{align*}
\mathbb P(A)
&=\mathbb P(\{00,01\})
=\mathbb P(\{00\})+\mathbb P(\{01\})
=\frac14+\frac14
=\frac12,\\
\mathbb P(B)
&=\mathbb P(\{00,10\})
=\mathbb P(\{00\})+\mathbb P(\{10\})
=\frac14+\frac14
=\frac12,\\
\mathbb P(C)
&=\mathbb P(\{00,11\})
=\mathbb P(\{00\})+\mathbb P(\{11\})
=\frac14+\frac14
=\frac12.
\end{align*}
The pairwise intersections are
\begin{align*}
A\cap B
&=\{00,01\}\cap\{00,10\}
=\{00\},\\
A\cap C
&=\{00,01\}\cap\{00,11\}
=\{00\},\\
B\cap C
&=\{00,10\}\cap\{00,11\}
=\{00\}.
\end{align*}
Therefore
\begin{align*}
\mathbb P(A\cap B)
&=\mathbb P(\{00\})
=\frac14
=\frac12\cdot\frac12
=\mathbb P(A)\mathbb P(B),\\
\mathbb P(A\cap C)
&=\mathbb P(\{00\})
=\frac14
=\frac12\cdot\frac12
=\mathbb P(A)\mathbb P(C),\\
\mathbb P(B\cap C)
&=\mathbb P(\{00\})
=\frac14
=\frac12\cdot\frac12
=\mathbb P(B)\mathbb P(C).
\end{align*}
Thus every pair among $A,B,C$ is independent.
However, the triple intersection is
\begin{align*}
A\cap B\cap C
&=\{00,01\}\cap\{00,10\}\cap\{00,11\}
=\{00\},
\end{align*}
so
\begin{align*}
\mathbb P(A\cap B\cap C)
&=\mathbb P(\{00\})
=\frac14.
\end{align*}
On the other hand,
\begin{align*}
\mathbb P(A)\mathbb P(B)\mathbb P(C)
=\frac12\cdot\frac12\cdot\frac12
=\frac18.
\end{align*}
Since
\begin{align*}
\mathbb P(A\cap B\cap C)
=\frac14
\ne
\frac18
=\mathbb P(A)\mathbb P(B)\mathbb P(C),
\end{align*}
the three events are pairwise independent but not mutually independent.
[/example]
### Random Variables
Events are yes-or-no observations, but random variables carry many observations at once. To ask whether random variables are independent, we must ask whether all events determined by one variable are independent of all events determined by the other. This motivates the sigma-algebra generated by a random variable.
[definition: Sigma-Algebra Generated by a Random Variable]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $(E,\mathcal E)$ be a measurable space, and let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a random variable. The sigma-algebra generated by $X$ is
\begin{align*}
\sigma(X)=\{X^{-1}(A):A\in\mathcal E\}.
\end{align*}
[/definition]
The generated sigma-algebra contains exactly the events whose truth can be decided after observing $X$. The next problem is to turn that information content into an independence condition for variables rather than individual events. The definition below requires every event generated by one variable to factor from every compatible event generated by the others.
[definition: Independent Random Variables]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Let $X_i:(\Omega,\mathcal F)\to(E_i,\mathcal E_i)$ be random variables indexed by $i\in I$. The family $(X_i)_{i\in I}$ is independent if the family of sigma-algebras $(\sigma(X_i))_{i\in I}$ is mutually independent.
[/definition]
The definition is abstract because it quantifies over many events. In practice, we often test independence through joint probabilities of rectangles. The following criterion turns the information-theoretic definition into a usable calculation.
[quotetheorem:4882]
This theorem explains why independence is often checked using joint distribution functions, probability mass functions, or densities. It also shows why independence is stronger than matching marginal distributions.
## Conditional Probability and Factorisation
### Conditional Interpretation
The product formula is the safest definition, but conditional probability gives the most direct interpretation. Conditioning renormalises the probability space after some event has occurred. Independence is the special case where this renormalisation leaves the other event's probability unchanged.
[definition: Conditional Probability]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, and let $B\in\mathcal F$ satisfy $\mathbb P(B)>0$. The conditional probability given $B$ is the function
\begin{align*}
\mathbb P(\cdot\mid B):\mathcal F&\to[0,1] \\
A&\mapsto \frac{\mathbb P(A\cap B)}{\mathbb P(B)}.
\end{align*}
[/definition]
We now need to connect the conditional language with the multiplication language. The two formulations agree whenever the conditioning event has positive probability. This equivalence is what justifies reading independence as absence of probabilistic update.
[quotetheorem:4859]
A single conditional equality is not enough to prove independence of random variables. Random variables generate many events, and dependence can hide in an event not yet tested. The next example shows a deterministic dependence that a coarse check might miss.
[example: Deterministic Dependence]
Let $X$ be uniformly distributed on $\{-1,0,1\}$, so
\begin{align*}
\mathbb P(X=-1)=\mathbb P(X=0)=\mathbb P(X=1)=\frac13.
\end{align*}
Set $Y=X^2$. Then
\begin{align*}
\{Y=0\}
&=\{X^2=0\}
=\{X=0\},
\end{align*}
and therefore
\begin{align*}
\mathbb P(Y=0)
=\mathbb P(X=0)
=\frac13.
\end{align*}
Also,
\begin{align*}
\{X=0,Y=0\}
&=\{X=0\}\cap\{Y=0\} \\
&=\{X=0\}\cap\{X=0\}
=\{X=0\},
\end{align*}
so
\begin{align*}
\mathbb P(X=0,Y=0)
=\mathbb P(X=0)
=\frac13.
\end{align*}
If $X$ and $Y$ were independent, the events $\{X=0\}$ and $\{Y=0\}$ would factor:
\begin{align*}
\mathbb P(X=0,Y=0)
&=\mathbb P(X=0)\mathbb P(Y=0).
\end{align*}
But here
\begin{align*}
\mathbb P(X=0)\mathbb P(Y=0)
=\frac13\cdot\frac13
=\frac19
\ne
\frac13
=\mathbb P(X=0,Y=0).
\end{align*}
Equivalently, since $\mathbb P(X=0)>0$,
\begin{align*}
\mathbb P(Y=0\mid X=0)
&=\frac{\mathbb P(Y=0,X=0)}{\mathbb P(X=0)}
=\frac{1/3}{1/3}
=1
\ne
\frac13
=\mathbb P(Y=0).
\end{align*}
Thus $X$ and $Y$ are not independent: $Y$ has nontrivial marginal behaviour, but once $X$ is known, $Y=X^2$ is completely determined.
[/example]
### Complement Patterns
Repeated trials require probabilities of complete success-failure patterns, not only probabilities of all-success intersections. If independent events are replaced by their complements, the independence structure should remain available. This stability lets us compute binomial probabilities from independent Bernoulli trials.
[quotetheorem:4943]
The theorem gives the exact tool needed for counting successes and failures. Each prescribed pattern factors into a product of success probabilities and failure probabilities. Summing over patterns produces the binomial distribution.
[example: Independent Bernoulli Trials]
Let $X_1,\dots,X_n$ be independent random variables with $X_i\sim\operatorname{Ber}(p)$, where $p\in[0,1]$, and put $S_n=\sum_{i=1}^nX_i$. Thus each $X_i$ takes values in $\{0,1\}$ and
\begin{align*}
\mathbb P(X_i=1)=p,\qquad \mathbb P(X_i=0)=1-p.
\end{align*}
Fix $J\subset\{1,\dots,n\}$ with $|J|=k$, and define the event
\begin{align*}
E_J
=\{X_i=1\text{ for }i\in J,\ X_i=0\text{ for }i\notin J\}.
\end{align*}
Because the random variables $X_1,\dots,X_n$ are independent, the coordinate events in this prescribed success-failure pattern factor, so
\begin{align*}
\mathbb P(E_J)
&=\mathbb P\left(\bigcap_{i\in J}\{X_i=1\}\cap\bigcap_{i\notin J}\{X_i=0\}\right)\\
&=\prod_{i\in J}\mathbb P(X_i=1)\prod_{i\notin J}\mathbb P(X_i=0)\\
&=\prod_{i\in J}p\prod_{i\notin J}(1-p)\\
&=p^{|J|}(1-p)^{n-|J|}\\
&=p^k(1-p)^{n-k}.
\end{align*}
Now $S_n=k$ occurs exactly when the set of indices where $X_i=1$ has size $k$. Hence
\begin{align*}
\{S_n=k\}
=\bigcup_{\substack{J\subset\{1,\dots,n\}\\ |J|=k}}E_J.
\end{align*}
The events $E_J$ in this union are pairwise disjoint: if $J\ne K$, choose an index $r$ belonging to exactly one of $J$ and $K$; then $E_J$ requires one of $X_r=1$ or $X_r=0$, while $E_K$ requires the other. Therefore additivity over disjoint events gives
\begin{align*}
\mathbb P(S_n=k)
&=\sum_{\substack{J\subset\{1,\dots,n\}\\ |J|=k}}\mathbb P(E_J)\\
&=\sum_{\substack{J\subset\{1,\dots,n\}\\ |J|=k}}p^k(1-p)^{n-k}\\
&=\binom nk p^k(1-p)^{n-k},
\end{align*}
since there are $\binom nk$ subsets of $\{1,\dots,n\}$ of size $k$. Thus the sum of $n$ independent $\operatorname{Ber}(p)$ trials has the binomial distribution $\operatorname{Bin}(n,p)$.
[/example]
## Sigma-Algebras and Information
### Information Sources
The sigma-algebra viewpoint treats independence as a relation between information sources. This matters for stochastic processes, where the information available before a time is usually a sigma-algebra rather than a single random variable. We first name the information generated by a single event.
[definition: Sigma-Algebra Generated by an Event]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and let $A\in\mathcal F$. The sigma-algebra generated by $A$ is
\begin{align*}
\sigma(A)=\{\varnothing,A,A^c,\Omega\}.
\end{align*}
[/definition]
A single event gives the smallest nontrivial example of generated information. In applications, an information source may contain infinitely many events, so checking only one named event can miss dependencies among other events in the same source. To compare information sources themselves, every event measurable from one source must factor from every finite choice of events measurable from the others.
[definition: Independent Sigma-Algebras]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, and let $(\mathcal G_i)_{i\in I}$ be sub-sigma-algebras of $\mathcal F$. The family $(\mathcal G_i)_{i\in I}$ is mutually independent if, for every finite subset $J\subset I$ and every choice of events $A_j\in\mathcal G_j$ for $j\in J$,
\begin{align*}
\mathbb P\left(\bigcap_{j\in J}A_j\right)=\prod_{j\in J}\mathbb P(A_j).
\end{align*}
[/definition]
Once independence is phrased through sigma-algebras, a new permanence problem appears. If we process each independent source separately, we should not create information flow between sources. The theorem below proves this stability under [measurable functions](/page/Measurable%20Functions).
[quotetheorem:1116]
This result is used whenever independent data are transformed separately. Squaring, thresholding, rounding, and applying indicators all preserve independence across distinct inputs. The following example gives the typical calculation-free use.
[example: Squaring and Thresholding]
Let $X$ and $Y$ be independent real-valued random variables. Define
\begin{align*}
g:\mathbb R&\to\mathbb R,&
g(x)&=x^2,\\
h:\mathbb R&\to\mathbb R,&
h(y)&=\mathbb 1_{(0,\infty)}(y).
\end{align*}
The map $g$ is continuous, hence Borel measurable, and $h$ is Borel measurable because $(0,\infty)$ is a Borel set and
\begin{align*}
h^{-1}(\{1\})&=(0,\infty),&
h^{-1}(\{0\})&=(-\infty,0].
\end{align*}
Therefore
\begin{align*}
X^2=g(X), \qquad \mathbb 1_{\{Y>0\}}=h(Y).
\end{align*}
Since these are measurable functions applied separately to the independent variables $X$ and $Y$, *Functions of Independent Random Variables* implies that $X^2$ and $\mathbb 1_{\{Y>0\}}$ are independent. Thus independence is attached to the separate information sources generated by $X$ and $Y$, and it is preserved when each source is processed on its own.
[/example]
### Blocks of Variables
Many arguments group independent variables into blocks. A random walk may split into an initial segment and a future segment; a sample may be divided into batches. We need the fact that disjoint blocks of an independent family remain independent as information sources.
[quotetheorem:4941]
Block independence is the formal reason that separated portions of an independent sequence can be treated as new independent objects. It is a basic tool in limit theorems and stopping-time arguments.
## Joint Laws and Product Measures
### Product Laws
Independence can be stated without mentioning the original sample space. The law of a random variable records all probabilities visible from its target space. To compare joint behaviour with separate behaviour, we first define law and joint law.
[definition: Law of a Random Variable]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $(E,\mathcal E)$ be a measurable space, and let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a random variable. The law of $X$ is the probability measure $\mu_X$ on $(E,\mathcal E)$ defined by
\begin{align*}
\mu_X(A)=\mathbb P(X\in A), \qquad A\in\mathcal E.
\end{align*}
[/definition]
The law of one variable forgets how it sits with other variables. Dependence is a statement about simultaneous outcomes, so we need the law of the vector of variables. The target measurable structure is the product sigma-algebra $\mathcal E_1\otimes\cdots\otimes\mathcal E_n$, meaning the sigma-algebra on $E_1\times\cdots\times E_n$ generated by measurable rectangles $A_1\times\cdots\times A_n$ with $A_i\in\mathcal E_i$. This is the joint law.
[definition: Joint Law]
Let $X_i:(\Omega,\mathcal F)\to(E_i,\mathcal E_i)$ be random variables for $i\in\{1,\dots,n\}$. The joint law of $(X_1,\dots,X_n)$ is the law of the map
\begin{align*}
(X_1,\dots,X_n):\Omega&\to E_1\times\cdots\times E_n\\
\omega&\mapsto (X_1(\omega),\dots,X_n(\omega))
\end{align*}
with respect to $\mathcal E_1\otimes\cdots\otimes\mathcal E_n$.
[/definition]
The joint law is the actual simultaneous distribution, but independence needs a benchmark for comparison. Marginal laws alone give probabilities on separate spaces and do not by themselves assign probabilities to simultaneous rectangles such as $A_1\times\cdots\times A_n$. The product measure is the canonical way to assemble those marginals into a joint distribution with no interaction between coordinates.
[definition: Product Measure]
Let $(E_i,\mathcal E_i,\mu_i)$ be probability spaces for $i\in\{1,\dots,n\}$. The product measure $\mu_1\otimes\cdots\otimes\mu_n$ is the probability measure on $(E_1\times\cdots\times E_n,\mathcal E_1\otimes\cdots\otimes\mathcal E_n)$ satisfying
\begin{align*}
(\mu_1\otimes\cdots\otimes\mu_n)(A_1\times\cdots\times A_n)=\prod_{i=1}^n\mu_i(A_i)
\end{align*}
for all $A_i\in\mathcal E_i$.
[/definition]
Now the conceptual statement becomes precise: independent coordinates have product law. This formulation compares the actual joint distribution with the distribution obtained by assembling the marginals without interaction. It is the cleanest way to recognise independence from distributions.
[quotetheorem:4861]
For densities, product law becomes a familiar factorisation formula. The joint density must equal the product of the marginal densities, up to null sets. This is often the easiest way to verify independence in continuous examples.
[example: Density Factorisation]
Here $\mathcal L^1$ denotes one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) on $\mathbb R$, and $\mathcal L^2$ denotes two-dimensional Lebesgue measure on $\mathbb R^2$. Let $(X,Y)$ have joint density $f_{X,Y}$ with respect to $\mathcal L^2$, and let $X$ and $Y$ have marginal densities $f_X$ and $f_Y$ with respect to $\mathcal L^1$. Assume first that
\begin{align*}
f_{X,Y}(x,y)=f_X(x)f_Y(y)
\end{align*}
for $\mathcal L^2$-a.e. $(x,y)\in\mathbb R^2$. For Borel sets $A,B\subset\mathbb R$, the joint law satisfies
\begin{align*}
\mathbb P(X\in A,Y\in B)
&=\int_{A\times B} f_{X,Y}(x,y)\,d\mathcal L^2(x,y)\\
&=\int_{A\times B} f_X(x)f_Y(y)\,d\mathcal L^2(x,y)\\
&=\int_A\int_B f_X(x)f_Y(y)\,d\mathcal L^1(y)\,d\mathcal L^1(x)\\
&=\int_A f_X(x)\left(\int_B f_Y(y)\,d\mathcal L^1(y)\right)d\mathcal L^1(x)\\
&=\left(\int_A f_X(x)\,d\mathcal L^1(x)\right)\left(\int_B f_Y(y)\,d\mathcal L^1(y)\right)\\
&=\mathbb P(X\in A)\mathbb P(Y\in B).
\end{align*}
Thus all Borel rectangles factor, so $X$ and $Y$ are independent by the *Rectangle Criterion for Independence*.
Conversely, suppose $X$ and $Y$ are independent and have marginal densities $f_X$ and $f_Y$. For Borel sets $A,B\subset\mathbb R$, independence gives
\begin{align*}
\mathbb P((X,Y)\in A\times B)
&=\mathbb P(X\in A,Y\in B)\\
&=\mathbb P(X\in A)\mathbb P(Y\in B)\\
&=\left(\int_A f_X(x)\,d\mathcal L^1(x)\right)
\left(\int_B f_Y(y)\,d\mathcal L^1(y)\right)\\
&=\int_A\int_B f_X(x)f_Y(y)\,d\mathcal L^1(y)\,d\mathcal L^1(x)\\
&=\int_{A\times B} f_X(x)f_Y(y)\,d\mathcal L^2(x,y).
\end{align*}
Since Borel rectangles generate $\mathcal B(\mathbb R^2)$ and both sides define probability measures on $\mathbb R^2$, equality on rectangles determines equality on all Borel sets. Hence $f_X(x)f_Y(y)$ is a joint density for $(X,Y)$. Density factorisation is therefore exactly the density-level form of independence.
[/example]
### Construction by Products
Product measures do more than recognise independence; they construct it. If we prescribe marginal probability spaces and take their product, the coordinate maps are automatically independent. This gives the standard model for repeated sampling.
[quotetheorem:4942]
This theorem separates modelling into two steps: choose the one-step law, then use a product space when separate repetitions are intended. Infinite product versions support infinite independent sequences.
## Expectation and Limit Behaviour
### Moment Factorisation
Independence is powerful because it turns products of observations into products of expectations. This is the analytic form of product law. It is the main computational rule behind variance estimates, transforms, and limit theorems.
[quotetheorem:1120]
Factoring one or two moments is weaker than independence. A pair of variables may have zero covariance while one is still determined by the other. The following example is a useful warning against replacing independence by uncorrelatedness.
[example: Uncorrelated but Dependent]
Let $X$ be uniformly distributed on $\{-1,0,1\}$, so
\begin{align*}
\mathbb P(X=-1)=\mathbb P(X=0)=\mathbb P(X=1)=\frac13,
\end{align*}
and set $Y=X^2$. We first compute the relevant moments:
\begin{align*}
\mathbb E[X]
&=(-1)\mathbb P(X=-1)+0\mathbb P(X=0)+1\mathbb P(X=1)\\
&=(-1)\frac13+0\cdot\frac13+1\cdot\frac13\\
&=-\frac13+0+\frac13\\
&=0,
\end{align*}
and, since $Y=X^2$,
\begin{align*}
\mathbb E[Y]
&=\mathbb E[X^2]\\
&=(-1)^2\mathbb P(X=-1)+0^2\mathbb P(X=0)+1^2\mathbb P(X=1)\\
&=1\cdot\frac13+0\cdot\frac13+1\cdot\frac13\\
&=\frac23.
\end{align*}
Also $XY=X\cdot X^2=X^3$, so
\begin{align*}
\mathbb E[XY]
&=\mathbb E[X^3]\\
&=(-1)^3\mathbb P(X=-1)+0^3\mathbb P(X=0)+1^3\mathbb P(X=1)\\
&=(-1)\frac13+0\cdot\frac13+1\cdot\frac13\\
&=0.
\end{align*}
Therefore
\begin{align*}
\operatorname{Cov}(X,Y)
&=\mathbb E[XY]-\mathbb E[X]\mathbb E[Y]\\
&=0-0\cdot\frac23\\
&=0.
\end{align*}
The variables are nevertheless dependent. Indeed,
\begin{align*}
\{Y=0\}
&=\{X^2=0\}
=\{X=0\},
\end{align*}
so
\begin{align*}
\mathbb P(Y=0)
=\mathbb P(X=0)
=\frac13.
\end{align*}
Moreover,
\begin{align*}
\{X=0,Y=0\}
&=\{X=0\}\cap\{Y=0\}\\
&=\{X=0\}\cap\{X=0\}\\
&=\{X=0\},
\end{align*}
and hence
\begin{align*}
\mathbb P(X=0,Y=0)
=\frac13.
\end{align*}
But
\begin{align*}
\mathbb P(X=0)\mathbb P(Y=0)
=\frac13\cdot\frac13
=\frac19
\ne
\frac13
=\mathbb P(X=0,Y=0).
\end{align*}
Thus $X$ and $Y$ have zero covariance, but they are not independent; knowing $X$ determines $Y=X^2$ exactly.
[/example]
For sums, the main obstruction to additivity of variance is the collection of cross terms measuring how different summands move together. Independence removes those cross terms, so fluctuations from separate sources accumulate by addition rather than by hidden covariance. This gives the variance identity needed for concentration and averaging estimates.
[quotetheorem:1119]
This result is deliberately weaker than full mutual independence: because variance only sees second-order cross terms, pairwise independence is enough. That is useful in second-moment methods, where one often controls fluctuations without proving every finite subfamily independent. The limitation is just as important. As soon as the argument needs a product of three or more factors, or needs to identify a full distribution rather than a variance, pairwise independence no longer supplies the required factorisation. The next subsection moves from moments to transforms, where mutual independence again becomes the natural hypothesis.
### Transforms and Averages
To encode an entire distribution, not only a moment, probability uses characteristic functions. They always exist and convert independent sums into products. This is the transform version of expectation factorisation.
[definition: Characteristic Function]
Let $X$ be an $\mathbb R^n$-valued random vector on $(\Omega,\mathcal F,\mathbb P)$. The characteristic function of $X$ is the map
\begin{align*}
\phi_X:\mathbb R^n&\to\mathbb C \\
u&\mapsto \mathbb E[e^{iu\cdot X}].
\end{align*}
[/definition]
Once the transform is defined, we need its rule for sums. For independent variables, the exponential of a sum splits into a product of separate exponentials, and independence factors the expectation. This yields the multiplicative formula below.
[quotetheorem:4944]
The most visible consequence of independence is the stabilisation of averages. Identical marginals alone do not suffice, because repeated copies of the same variable never average out their shared randomness. To state the repeated-sampling hypothesis used in laws of large numbers, we need the definition of an i.i.d. sequence.
[definition: Independent and Identically Distributed Sequence]
Let $(X_n)_{n\in\mathbb N}$ be random variables on $(\Omega,\mathcal F,\mathbb P)$ with values in $(E,\mathcal E)$. The sequence is independent and identically distributed, abbreviated i.i.d., with distribution $\mu$ if $(X_n)_{n\in\mathbb N}$ is independent and $\mu_{X_n}=\mu$ for every $n\in\mathbb N$.
[/definition]
The i.i.d. hypothesis is the standard mathematical form of repeated sampling under the same conditions. The remaining question is whether the random errors in the empirical average actually disperse instead of staying coordinated across time. With independence and a finite second moment, the variance of the average shrinks, forcing convergence in probability to the common mean.
[quotetheorem:1127]
The independence assumption cannot be removed from this theorem. Identical distributions do not prevent perfect coordination across time. The next example shows the failure in the smallest possible way.
[example: Identical Marginals Without Independence]
Let $X$ satisfy
\begin{align*}
\mathbb P(X=0)=\mathbb P(X=1)=\frac12,
\end{align*}
and define $X_n=X$ for every $n\in\mathbb N$. For each $n$,
\begin{align*}
\mathbb P(X_n=0)&=\mathbb P(X=0)=\frac12,\\
\mathbb P(X_n=1)&=\mathbb P(X=1)=\frac12,
\end{align*}
so all the variables $X_n$ have the same marginal distribution.
They are not independent. For example,
\begin{align*}
\{X_1=1,X_2=1\}
&=\{X=1,X=1\}\\
&=\{X=1\},
\end{align*}
and hence
\begin{align*}
\mathbb P(X_1=1,X_2=1)
&=\mathbb P(X=1)
=\frac12.
\end{align*}
On the other hand,
\begin{align*}
\mathbb P(X_1=1)\mathbb P(X_2=1)
&=\mathbb P(X=1)\mathbb P(X=1)\\
&=\frac12\cdot\frac12\\
&=\frac14.
\end{align*}
Since $\frac12\ne\frac14$, the first two variables already fail the independence factorisation.
The sample averages never separate from the original variable. For every $n\ge 1$,
\begin{align*}
\frac1n\sum_{i=1}^nX_i
&=\frac1n\sum_{i=1}^nX\\
&=\frac1n(nX)\\
&=X.
\end{align*}
Also,
\begin{align*}
\mathbb E[X]
&=0\cdot\mathbb P(X=0)+1\cdot\mathbb P(X=1)\\
&=0\cdot\frac12+1\cdot\frac12\\
&=\frac12.
\end{align*}
Taking $\varepsilon=\frac14$, we get
\begin{align*}
\mathbb P\left(\left|\frac1n\sum_{i=1}^nX_i-\frac12\right|>\frac14\right)
&=\mathbb P\left(\left|X-\frac12\right|>\frac14\right)\\
&=\mathbb P(X=0)+\mathbb P(X=1)\\
&=\frac12+\frac12\\
&=1
\end{align*}
for every $n$. Thus identical marginal distributions alone do not make averages converge to the common mean; the repeated variables here carry exactly the same randomness every time.
[/example]
## Tail Events and Zero-One Laws
### Tail Information
Infinite independent sequences have events that ignore every finite initial segment. Convergence, boundedness, and occurrence infinitely often are examples of such eventual questions. The sigma-algebra collecting them is the tail sigma-algebra.
[definition: Tail Sigma-Algebra]
Let $(X_n)_{n\in\mathbb N}$ be random variables on $(\Omega,\mathcal F,\mathbb P)$. The tail sigma-algebra is
\begin{align*}
\mathcal T=\bigcap_{n=1}^\infty \sigma(X_n,X_{n+1},X_{n+2},\dots).
\end{align*}
[/definition]
Tail events are independent of every finite initial block in an independent sequence. The subtle point is that a tail event is also determined after discarding any finite initial block, so it cannot retain ordinary partial dependence on the sequence.
This raises the central question for tail information: can an event that is unchanged by removing any finite beginning still have an intermediate probability? For independent sequences the answer is no. The next result gives the needed rigidity principle, showing that tail events have only probability $0$ or probability $1$.
[quotetheorem:512]
A standard tail event is occurrence infinitely often. Independence turns divergent total probability into almost sure repeated occurrence. This is the content of the second Borel-Cantelli lemma.
[example: Infinitely Many Successes]
Let $(A_n)_{n\in\mathbb N}$ be independent events, and set $X_n=\mathbb 1_{A_n}$. The event that the events $A_n$ occur infinitely often is
\begin{align*}
\{A_n\text{ i.o.}\}
&=\{\omega:\omega\in A_m\text{ for infinitely many }m\}\\
&=\bigcap_{n=1}^\infty\bigcup_{m\ge n}A_m.
\end{align*}
For each fixed $n$, every event $A_m$ with $m\ge n$ is determined by $X_m$, since
\begin{align*}
A_m=\{X_m=1\}.
\end{align*}
Hence
\begin{align*}
\bigcup_{m\ge n}A_m
=\bigcup_{m\ge n}\{X_m=1\}
\in \sigma(X_n,X_{n+1},X_{n+2},\dots),
\end{align*}
and therefore
\begin{align*}
\{A_n\text{ i.o.}\}
=\bigcap_{r=1}^\infty\bigcup_{m\ge r}A_m
\in \sigma(X_n,X_{n+1},X_{n+2},\dots)
\end{align*}
for every $n$. Thus
\begin{align*}
\{A_n\text{ i.o.}\}
\in \bigcap_{n=1}^\infty \sigma(X_n,X_{n+1},X_{n+2},\dots),
\end{align*}
so it is a tail event for the sequence $(\mathbb 1_{A_n})_{n\in\mathbb N}$.
If
\begin{align*}
\sum_{n=1}^\infty \mathbb P(A_n)=\infty,
\end{align*}
then the independence of the events $A_n$ is exactly the hypothesis needed in *Second Borel-Cantelli Lemma*, and that lemma gives
\begin{align*}
\mathbb P\left(\bigcap_{n=1}^\infty\bigcup_{m\ge n}A_m\right)=1.
\end{align*}
Equivalently,
\begin{align*}
\mathbb P(A_n\text{ i.o.})=1.
\end{align*}
So under independence and divergent total probability, the successes do not merely occur often on average; they occur infinitely many times almost surely.
[/example]
### Common Misreadings
Independence is often inferred from conditions that are easier to see but weaker. Equal marginals, zero covariance, and pairwise independence each miss part of the joint structure. The following remarks collect the most common failures.
[example: Same Marginals and Perfect Dependence]
Let $U\sim\operatorname{Unif}(0,1)$ and set $V=1-U$. We first verify that $V$ has the same marginal distribution as $U$. Since $U$ is uniform on $(0,1)$, for $0\le a\le b\le 1$,
\begin{align*}
\mathbb P(a\le U\le b)=b-a.
\end{align*}
For $t<0$,
\begin{align*}
\mathbb P(V\le t)
=\mathbb P(1-U\le t)
=\mathbb P(U\ge 1-t)
=0,
\end{align*}
because $1-t>1$. For $0\le t\le 1$,
\begin{align*}
\mathbb P(V\le t)
&=\mathbb P(1-U\le t)\\
&=\mathbb P(U\ge 1-t)\\
&=\mathbb P(1-t\le U\le 1)\\
&=1-(1-t)\\
&=t.
\end{align*}
For $t>1$,
\begin{align*}
\mathbb P(V\le t)
=\mathbb P(1-U\le t)
=1,
\end{align*}
because $1-U\in(0,1)$. Thus $V$ has distribution function $0$ for $t<0$, $t$ for $0\le t\le 1$, and $1$ for $t>1$, so $V\sim\operatorname{Unif}(0,1)$.
The variables are nevertheless perfectly dependent. For every outcome,
\begin{align*}
U+V
&=U+(1-U)\\
&=1,
\end{align*}
so
\begin{align*}
\{U+V=1\}=\Omega
\end{align*}
and therefore
\begin{align*}
\mathbb P(U+V=1)=\mathbb P(\Omega)=1.
\end{align*}
To see the failure of independence by an event factorisation, take
\begin{align*}
A=\{U\le \tfrac12\},\qquad B=\{V\le \tfrac12\}.
\end{align*}
Then
\begin{align*}
B
&=\{1-U\le \tfrac12\}\\
&=\{U\ge \tfrac12\},
\end{align*}
and hence
\begin{align*}
A\cap B
&=\{U\le \tfrac12\}\cap\{U\ge \tfrac12\}\\
&=\{U=\tfrac12\}.
\end{align*}
Since a single point has uniform probability $0$,
\begin{align*}
\mathbb P(A\cap B)=0.
\end{align*}
But
\begin{align*}
\mathbb P(A)
&=\mathbb P(0\le U\le \tfrac12)
=\frac12-0
=\frac12,
\end{align*}
and
\begin{align*}
\mathbb P(B)
&=\mathbb P(V\le \tfrac12)
=\frac12.
\end{align*}
Therefore
\begin{align*}
\mathbb P(A)\mathbb P(B)
=\frac12\cdot\frac12
=\frac14
\ne 0
=\mathbb P(A\cap B).
\end{align*}
Thus equal marginal distributions do not imply independence: here the joint law is concentrated on the line $u+v=1$, while independent uniform variables would factor on rectangle events.
[/example]
Zero covariance tests only one second-order statistic. It does not test nonlinear events, generated sigma-algebras, or product laws. This is why uncorrelated variables need not be independent.
[remark: Independence Versus Zero Covariance]
For square-integrable real-valued random variables, independence implies zero covariance whenever the covariance is defined. The converse fails because covariance only tests the product of two centred linear observations.
[/remark]
Pairwise independence can be useful in second-moment arguments. It is not enough when a proof multiplies probabilities across three or more events. Mutual independence must be checked whenever higher intersections matter.
[remark: Pairwise Independence Is Not Enough]
Pairwise independence does not imply mutual independence. Arguments involving products over finite families require mutual independence or a replacement hypothesis that gives the needed factorisation.
[/remark]
## Beyond and Connected Topics
Independence leads directly to [Conditional Expectation](/page/Conditional%20Expectation). Conditional expectation measures what remains predictable after a sigma-algebra is observed, and independence is the case where observing one information source leaves expectations from another unchanged.
[Product Measure](/page/Product%20Measure) provides the construction principle behind independence. Finite and infinite products build spaces supporting independent coordinates with prescribed marginal laws.
[Martingale](/page/Martingale) theory weakens independence into conditional mean stability. Independent increments often produce martingales, but martingales also describe dependent processes whose future has controlled conditional drift.
The [Law of Large Numbers](/page/Law%20of%20Large%20Numbers) and [Central Limit Theorem](/page/Central%20Limit%20Theorem) show why independence is a structural assumption rather than a cosmetic one. It turns repeated small uncertainties into stable averages and universal fluctuation laws.
In statistics, i.i.d. samples make likelihoods factor. If $X_1,\dots,X_n$ have density $f_\theta$ and are independent, the joint density is
\begin{align*}
\prod_{i=1}^n f_\theta(X_i),
\end{align*}
which is the starting point for maximum likelihood estimation and Bayesian updating.
## References
Billingsley, *Probability and Measure* (1995).
Durrett, *Probability: Theory and Examples* (2019).
Kallenberg, *Foundations of Modern Probability* (2021).
Williams, *Probability with Martingales* (1991).