A probability model begins with a failure of counting. If a dart lands somewhere in the interval $[0,1]$, symmetry says no individual point should be preferred. Giving every point the same positive probability would force finitely many points to have arbitrarily large total probability, while giving every point probability $0$ cannot explain why the whole interval has probability $1$ by countable summation alone. Probability therefore needs more than a set of outcomes: it needs a chosen class of events and a countably additive way to measure them.
[example: A Uniform Point Cannot Be Built from Equal Point Masses]
Let $\Omega=[0,1]$, and suppose first that every singleton has one common positive probability:
\begin{align*}
\mathbb P(\{x\})=c>0\quad\text{for every }x\in[0,1].
\end{align*}
Fix $n\in\mathbb N$ and choose distinct points $x_1,\ldots,x_n\in[0,1]$. Because the points are distinct, if $i\ne j$ then $\{x_i\}\cap\{x_j\}=\varnothing$, so the singleton events are pairwise disjoint.
To apply countable additivity to this finite family, define
\begin{align*}
A_k=\{x_k\}\quad\text{for }1\le k\le n
\end{align*}
and
\begin{align*}
A_k=\varnothing\quad\text{for }k>n.
\end{align*}
The sequence $(A_k)_{k\in\mathbb N}$ is pairwise disjoint, and its union is
\begin{align*}
\bigcup_{k=1}^{\infty}A_k=\{x_1\}\cup\cdots\cup\{x_n\}=\{x_1,\ldots,x_n\}.
\end{align*}
Countable additivity gives
\begin{align*}
\mathbb P(\{x_1,\ldots,x_n\})=\sum_{k=1}^{\infty}\mathbb P(A_k).
\end{align*}
For $k>n$, $A_k=\varnothing$. Also, since $\varnothing$ is disjoint from itself, countable additivity applied to $\varnothing=\bigcup_{k=1}^{\infty}\varnothing$ gives
\begin{align*}
\mathbb P(\varnothing)=\sum_{k=1}^{\infty}\mathbb P(\varnothing).
\end{align*}
The left side is finite because $\mathbb P$ takes values in $[0,1]$, so this equality forces $\mathbb P(\varnothing)=0$. Hence the tail terms vanish, and
\begin{align*}
\sum_{k=1}^{\infty}\mathbb P(A_k)=\mathbb P(\{x_1\})+\cdots+\mathbb P(\{x_n\}).
\end{align*}
Each singleton has probability $c$, so
\begin{align*}
\mathbb P(\{x_1,\ldots,x_n\})=c+\cdots+c.
\end{align*}
There are $n$ summands, each equal to $c$, therefore
\begin{align*}
c+\cdots+c=nc.
\end{align*}
Thus
\begin{align*}
\mathbb P(\{x_1,\ldots,x_n\})=nc.
\end{align*}
Since $c>0$, choose $n\in\mathbb N$ with $n>1/c$. Multiplying the inequality $n>1/c$ by the positive number $c$ gives
\begin{align*}
nc>1.
\end{align*}
For this choice of $n$,
\begin{align*}
\mathbb P(\{x_1,\ldots,x_n\})=nc>1.
\end{align*}
This contradicts the requirement that $\mathbb P$ take values in $[0,1]$.
Thus an equal-mass model on $[0,1]$ cannot assign a positive common probability to all singletons. A uniform probability model on $[0,1]$ must instead have $\mathbb P(\{x\})=0$ for each point $x$, while still having $\mathbb P([0,1])=1$; probabilities on continuous spaces therefore cannot be reconstructed by summing equal point masses.
[/example]
The example raises the central question for this page: what structure must be present before probability statements have mathematical meaning? The next definitions separate the raw outcomes, the observable events, and the probability measure.
## Definition
The opening failure shows why probability cannot be just a function on individual outcomes. A model must say what the possible outcomes are, which subsets are legitimate events, and how probabilities are assigned to those events. The central object packages these three choices at once.
[definition: Probability Space]
A probability space is a triple $(\Omega,\mathcal F,\mathbb P)$ where $\Omega$ is a set, $\mathcal F$ is a $\sigma$-algebra on $\Omega$, and $\mathbb P:\mathcal F\to[0,1]$ is a probability measure on $(\Omega,\mathcal F)$.
[/definition]
The definition is compact, but each component has a distinct job. The first component answers the question of what could happen. It carries no probabilities and no information structure by itself, so the definition below is intentionally minimal.
[definition: Sample Space]
A sample space is a set $\Omega$ whose elements are called outcomes.
[/definition]
A sample space is too raw for probability because most probabilistic statements concern subsets of outcomes. If one can ask whether an event happens, one should also be able to ask whether it does not happen and whether at least one event in a countable list happens. The event collection must therefore be closed under these logical operations.
[definition: $\sigma$-Algebra]
Let $\Omega$ be a set. A $\sigma$-algebra on $\Omega$ is a collection $\mathcal F\subset\mathcal P(\Omega)$ such that:
1. $\Omega\in\mathcal F$.
2. If $A\in\mathcal F$, then $A^c\in\mathcal F$, where the complement is taken in $\Omega$.
3. If $A_1,A_2,\ldots\in\mathcal F$, then $\bigcup_{n=1}^{\infty}A_n\in\mathcal F$.
[/definition]
The sets in $\mathcal F$ are the events whose probabilities the model is allowed to discuss. To assign numerical probabilities to those events, we need the following normalized version of a measure.
[definition: Probability Measure]
Let $(\Omega,\mathcal F)$ be a measurable space. A probability measure on $(\Omega,\mathcal F)$ is a function $\mathbb P:\mathcal F\to[0,1]$ such that:
1. $\mathbb P(\Omega)=1$.
2. If $A_1,A_2,\ldots\in\mathcal F$ are pairwise disjoint, then
\begin{align*}
\mathbb P\left(\bigcup_{n=1}^{\infty}A_n\right)=\sum_{n=1}^{\infty}\mathbb P(A_n).
\end{align*}
[/definition]
This definition is the foundation for the rest of probability. The next example shows that even the smallest familiar experiment already uses all three parts of the triple.
[example: A Biased Coin]
Let $\Omega=\{H,T\}$, let $\mathcal F=\mathcal P(\Omega)$, and fix $p\in[0,1]$. Since $\mathcal F$ is the full power set, it contains $\Omega$, is closed under complements in $\Omega$, and is closed under countable unions of subsets of $\Omega$; hence $\mathcal F$ is a $\sigma$-algebra. Define
\begin{align*}
\mathbb P(\varnothing)=0,\quad \mathbb P(\{H\})=p,\quad \mathbb P(\{T\})=1-p,\quad \mathbb P(\Omega)=1.
\end{align*}
Equivalently, for every $A\subset\Omega$,
\begin{align*}
\mathbb P(A)=p\,\mathbf 1_{\{H\in A\}}+(1-p)\,\mathbf 1_{\{T\in A\}}.
\end{align*}
Indeed, for $A=\varnothing$ both indicators are $0$, so the formula gives $0$. For $A=\{H\}$ it gives $p\cdot1+(1-p)\cdot0=p$. For $A=\{T\}$ it gives $p\cdot0+(1-p)\cdot1=1-p$. For $A=\Omega$ it gives
\begin{align*}
p\cdot1+(1-p)\cdot1=p+1-p=1.
\end{align*}
Since $p\in[0,1]$, we have $0\le p\le1$. Subtracting $p\le1$ from $1$ gives $1-p\ge0$, and subtracting $0\le p$ from $1$ gives $1-p\le1$, so $1-p\in[0,1]$. Thus every assigned value lies in $[0,1]$, and
\begin{align*}
\mathbb P(\Omega)=1.
\end{align*}
It remains to verify countable additivity. Let $(A_n)_{n\in\mathbb N}$ be pairwise disjoint subsets of $\Omega$, and put
\begin{align*}
U=\bigcup_{n=1}^{\infty}A_n.
\end{align*}
For the outcome $H$, either $H\notin U$, in which case every term $\mathbf 1_{\{H\in A_n\}}$ is $0$, or $H\in U$, in which case $H\in A_m$ for exactly one index $m$ because the sets $A_n$ are pairwise disjoint. Therefore
\begin{align*}
\mathbf 1_{\{H\in U\}}=\sum_{n=1}^{\infty}\mathbf 1_{\{H\in A_n\}}.
\end{align*}
The same argument for $T$ gives
\begin{align*}
\mathbf 1_{\{T\in U\}}=\sum_{n=1}^{\infty}\mathbf 1_{\{T\in A_n\}}.
\end{align*}
Using the indicator formula for $\mathbb P$ on $U$,
\begin{align*}
\mathbb P(U)=p\,\mathbf 1_{\{H\in U\}}+(1-p)\,\mathbf 1_{\{T\in U\}}.
\end{align*}
Substituting the two indicator identities gives
\begin{align*}
\mathbb P(U)=p\sum_{n=1}^{\infty}\mathbf 1_{\{H\in A_n\}}+(1-p)\sum_{n=1}^{\infty}\mathbf 1_{\{T\in A_n\}}.
\end{align*}
Each of the two indicator sums has at most one nonzero term, so multiplying by the constants and adding term by term gives
\begin{align*}
\mathbb P(U)=\sum_{n=1}^{\infty}\left(p\,\mathbf 1_{\{H\in A_n\}}+(1-p)\,\mathbf 1_{\{T\in A_n\}}\right).
\end{align*}
For each $n$, the expression inside the parentheses is exactly $\mathbb P(A_n)$ by the indicator formula. Hence
\begin{align*}
\mathbb P\left(\bigcup_{n=1}^{\infty}A_n\right)=\sum_{n=1}^{\infty}\mathbb P(A_n).
\end{align*}
Thus $(\Omega,\mathcal F,\mathbb P)$ is a probability space, and the parameter $p$ is precisely the probability assigned to heads.
[/example]
The coin example hides a simplification: every subset is measurable. The next section explains why the choice of $\mathcal F$ becomes a real part of the model once the outcome space is infinite.
## Event Structures
The event collection controls what the model can observe. A very small $\sigma$-algebra represents limited information, while a very large one allows many distinctions among outcomes. The first extreme is useful because it shows that outcomes need not be individually observable.
[definition: Indiscrete $\sigma$-Algebra]
Let $\Omega$ be a set. The indiscrete $\sigma$-algebra on $\Omega$ is $\{\varnothing,\Omega\}$.
[/definition]
With the indiscrete $\sigma$-algebra, the model distinguishes only impossibility from certainty. To model experiments where every subset is observable, we need the opposite extreme below.
[definition: Discrete $\sigma$-Algebra]
Let $\Omega$ be a set. The discrete $\sigma$-algebra on $\Omega$ is the power set $\mathcal P(\Omega)$.
[/definition]
The discrete $\sigma$-algebra is especially effective on countable spaces because every event is a union of singleton outcomes. The modeling question is then whether assigning nonnegative weights to individual outcomes is enough to determine the probabilities of all events, and whether countable additivity forces the answer to be a sum over those weights.
[quotetheorem:9348]
This theorem explains why elementary finite probability can be done by summing weights of outcomes. The next example turns that statement into a standard waiting-time model.
[example: A Geometric Waiting Time]
Let $p\in(0,1]$, let $\Omega=\mathbb N$, and let $\mathcal F=\mathcal P(\mathbb N)$. Put $r=1-p$, so $0\le r<1$, and define $q(n)=r^{n-1}p$ for $n\in\mathbb N$, with $r^0=1$. Since $0\le r<1$, each power $r^{n-1}$ lies in $[0,1]$, and since $0<p\le1$, we have $q(n)=r^{n-1}p\in[0,1]$.
For each $m\in\mathbb N$, substitute $j=n-1$ in the finite sum:
\begin{align*}
\sum_{n=1}^{m}q(n)=\sum_{n=1}^{m}r^{n-1}p=p\sum_{j=0}^{m-1}r^j.
\end{align*}
Since $p=1-r$,
\begin{align*}
p\sum_{j=0}^{m-1}r^j=(1-r)(1+r+\cdots+r^{m-1}).
\end{align*}
Expanding the product as two sums gives
\begin{align*}
(1-r)(1+r+\cdots+r^{m-1})=(1+r+\cdots+r^{m-1})-(r+r^2+\cdots+r^m).
\end{align*}
The terms $r,r^2,\ldots,r^{m-1}$ appear once with sign $+$ and once with sign $-$, so they cancel, leaving
\begin{align*}
(1+r+\cdots+r^{m-1})-(r+r^2+\cdots+r^m)=1-r^m.
\end{align*}
Therefore
\begin{align*}
\sum_{n=1}^{m}q(n)=1-r^m.
\end{align*}
Because $0\le r<1$, $r^m\to0$, and hence
\begin{align*}
\sum_{n=1}^{\infty}q(n)=\lim_{m\to\infty}\sum_{n=1}^{m}q(n)=\lim_{m\to\infty}(1-r^m)=1.
\end{align*}
Thus $q:\mathbb N\to[0,1]$ has total mass $1$, so by *Probability Measures on Countable Spaces* it defines a probability measure $\mathbb P$ on $\mathcal P(\mathbb N)$ satisfying
\begin{align*}
\mathbb P(A)=\sum_{n\in A}q(n)
\end{align*}
for every $A\subset\mathbb N$.
For $k\in\mathbb N$, the event that the first success occurs no earlier than time $k$ is $\{k,k+1,k+2,\ldots\}$. By the formula for $\mathbb P$,
\begin{align*}
\mathbb P(\{k,k+1,\ldots\})=\sum_{n=k}^{\infty}q(n)=\sum_{n=k}^{\infty}r^{n-1}p.
\end{align*}
Writing $n=k+\ell$ with $\ell\in\{0,1,2,\ldots\}$ gives
\begin{align*}
\sum_{n=k}^{\infty}r^{n-1}p=\sum_{\ell=0}^{\infty}r^{k+\ell-1}p.
\end{align*}
Since $r^{k+\ell-1}=r^{k-1}r^\ell$, this becomes
\begin{align*}
\sum_{\ell=0}^{\infty}r^{k+\ell-1}p=\sum_{\ell=0}^{\infty}r^{k-1}r^\ell p.
\end{align*}
The factor $r^{k-1}p$ is independent of $\ell$, so
\begin{align*}
\sum_{\ell=0}^{\infty}r^{k-1}r^\ell p=r^{k-1}p\sum_{\ell=0}^{\infty}r^\ell.
\end{align*}
For each $M\ge0$, the same finite geometric calculation gives
\begin{align*}
p\sum_{\ell=0}^{M}r^\ell=(1-r)(1+r+\cdots+r^M)=1-r^{M+1}.
\end{align*}
Letting $M\to\infty$ and using $r^{M+1}\to0$ gives
\begin{align*}
p\sum_{\ell=0}^{\infty}r^\ell=1.
\end{align*}
Therefore
\begin{align*}
\mathbb P(\{k,k+1,\ldots\})=r^{k-1}.
\end{align*}
Finally, since $r=1-p$,
\begin{align*}
r^{k-1}=(1-p)^{k-1}.
\end{align*}
So the chance of waiting at least until time $k$ is $(1-p)^{k-1}$, exactly the probability of seeing $k-1$ consecutive failures first.
[/example]
Countable spaces do not reveal the main difficulty of event selection. For real-valued outcomes, we need a $\sigma$-algebra large enough to contain intervals and open sets, and the following construction provides the standard choice.
[definition: Borel $\sigma$-Algebra]
Let $(X,\tau)$ be a [topological space](/page/Topological%20Space). The Borel $\sigma$-algebra on $X$, denoted $\mathcal B(X)$, is the smallest $\sigma$-algebra on $X$ that contains every [open set](/page/Open%20Set) in $\tau$.
[/definition]
The Borel $\sigma$-algebra is the default event structure for random variables with values in $\mathbb R$ or $\mathbb R^n$. On an interval, the intended size of a Borel event is measured by restricting [Lebesgue measure](/page/Lebesgue%20Measure) $\mathcal L^1$ from $\mathbb R$ to the Borel subsets of that interval; normalizing this restricted length measure returns to the opening dart model.
[example: Uniform Probability on an Interval]
Let $a<b$, let $\Omega=[a,b]$, and let $\mathcal F=\mathcal B([a,b])$, where Borel subsets of $[a,b]$ are understood through the [subspace topology](/page/Subspace%20Topology). Define
\begin{align*}
\mathbb P(A)=\frac{\mathcal L^1(A)}{b-a}
\end{align*}
for each $A\in\mathcal B([a,b])$. We verify that this is a probability measure and then compute the probabilities of intervals and points.
Since $a<b$, we have $b-a>0$. For any $A\in\mathcal B([a,b])$, Lebesgue measure is nonnegative, so
\begin{align*}
\mathcal L^1(A)\ge0.
\end{align*}
Dividing by the positive number $b-a$ gives
\begin{align*}
\mathbb P(A)=\frac{\mathcal L^1(A)}{b-a}\ge0.
\end{align*}
Also $A\subset[a,b]$, so monotonicity of Lebesgue measure gives
\begin{align*}
\mathcal L^1(A)\le \mathcal L^1([a,b]).
\end{align*}
The length of the closed interval $[a,b]$ is $b-a$, hence
\begin{align*}
\mathcal L^1([a,b])=b-a.
\end{align*}
Substituting this into the previous inequality gives
\begin{align*}
\mathcal L^1(A)\le b-a.
\end{align*}
Dividing by $b-a>0$ gives
\begin{align*}
\mathbb P(A)=\frac{\mathcal L^1(A)}{b-a}\le\frac{b-a}{b-a}.
\end{align*}
Since $b-a\ne0$,
\begin{align*}
\frac{b-a}{b-a}=1.
\end{align*}
Thus $\mathbb P(A)\in[0,1]$. For the whole sample space,
\begin{align*}
\mathbb P([a,b])=\frac{\mathcal L^1([a,b])}{b-a}.
\end{align*}
Using $\mathcal L^1([a,b])=b-a$,
\begin{align*}
\mathbb P([a,b])=\frac{b-a}{b-a}=1.
\end{align*}
Now let $(A_n)_{n\in\mathbb N}$ be pairwise disjoint Borel subsets of $[a,b]$. Since $\mathcal B([a,b])$ is a $\sigma$-algebra,
\begin{align*}
\bigcup_{n=1}^{\infty}A_n\in\mathcal B([a,b]).
\end{align*}
By countable additivity of Lebesgue measure,
\begin{align*}
\mathcal L^1\left(\bigcup_{n=1}^{\infty}A_n\right)=\sum_{n=1}^{\infty}\mathcal L^1(A_n).
\end{align*}
Applying the definition of $\mathbb P$ to the union gives
\begin{align*}
\mathbb P\left(\bigcup_{n=1}^{\infty}A_n\right)=\frac{\mathcal L^1\left(\bigcup_{n=1}^{\infty}A_n\right)}{b-a}.
\end{align*}
Substituting the countable-additivity identity gives
\begin{align*}
\mathbb P\left(\bigcup_{n=1}^{\infty}A_n\right)=\frac{\sum_{n=1}^{\infty}\mathcal L^1(A_n)}{b-a}.
\end{align*}
Because $b-a$ is a positive constant, division by $b-a$ distributes over the nonnegative series:
\begin{align*}
\frac{\sum_{n=1}^{\infty}\mathcal L^1(A_n)}{b-a}=\sum_{n=1}^{\infty}\frac{\mathcal L^1(A_n)}{b-a}.
\end{align*}
For each $n$,
\begin{align*}
\frac{\mathcal L^1(A_n)}{b-a}=\mathbb P(A_n).
\end{align*}
Therefore
\begin{align*}
\mathbb P\left(\bigcup_{n=1}^{\infty}A_n\right)=\sum_{n=1}^{\infty}\mathbb P(A_n).
\end{align*}
So $\mathbb P$ is a probability measure on $\mathcal B([a,b])$.
Now let $a\le c\le d\le b$. Then $[c,d]\subset[a,b]$, so $[c,d]\in\mathcal B([a,b])$, and the length of $[c,d]$ is
\begin{align*}
\mathcal L^1([c,d])=d-c.
\end{align*}
Using the definition of $\mathbb P$,
\begin{align*}
\mathbb P([c,d])=\frac{\mathcal L^1([c,d])}{b-a}.
\end{align*}
Substituting $\mathcal L^1([c,d])=d-c$ gives
\begin{align*}
\mathbb P([c,d])=\frac{d-c}{b-a}.
\end{align*}
For a point $x\in[a,b]$, the singleton $\{x\}$ is the degenerate interval $[x,x]$. Its length is
\begin{align*}
\mathcal L^1(\{x\})=\mathcal L^1([x,x])=x-x.
\end{align*}
Since $x-x=0$,
\begin{align*}
\mathcal L^1(\{x\})=0.
\end{align*}
Therefore
\begin{align*}
\mathbb P(\{x\})=\frac{\mathcal L^1(\{x\})}{b-a}.
\end{align*}
Substituting $\mathcal L^1(\{x\})=0$ gives
\begin{align*}
\mathbb P(\{x\})=\frac{0}{b-a}=0.
\end{align*}
This normalized length model assigns probability in proportion to interval length, while every individual point has probability zero.
[/example]
The interval example shows why probability uses countable additivity rather than arbitrary additivity. The next section records the basic consequences of the axioms that make event calculations possible.
## Axioms and Limiting Events
The axioms of a probability measure are designed to recover the arithmetic of mutually exclusive alternatives. Before probabilities can be added without correction, the model must express that two events have no shared outcomes, so that one occurrence cannot be counted twice.
[definition: Disjoint Events]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Events $A,B\in\mathcal F$ are disjoint if $A\cap B=\varnothing$. A family $(A_i)_{i\in I}$ of events is pairwise disjoint if $A_i\cap A_j=\varnothing$ whenever $i,j\in I$ and $i\ne j$.
[/definition]
Disjointness is the case where probabilities add without an overlap correction. For arbitrary finite unions, however, overlaps create double-counting, and complements require tracking what is left outside an event. The basic event identities organize these corrections into usable formulas.
[quotetheorem:9349]
The overlap term in the union formula is often the first place where event algebra matters. The next example shows how the formula prevents double-counting.
[example: Inclusion-Exclusion for Two Events]
Let $A$ be the event that a randomly chosen student takes analysis, and let $B$ be the event that the student takes statistics. Suppose
\begin{align*}
\mathbb P(A)=0.55,\quad \mathbb P(B)=0.40,\quad \mathbb P(A\cap B)=0.20.
\end{align*}
We compute the probability that the student takes at least one of the two courses, namely $\mathbb P(A\cup B)$.
By *Elementary Probability Identities*,
\begin{align*}
\mathbb P(A\cup B)=\mathbb P(A)+\mathbb P(B)-\mathbb P(A\cap B).
\end{align*}
Substituting the three given values gives
\begin{align*}
\mathbb P(A\cup B)=0.55+0.40-0.20.
\end{align*}
Adding the first two decimals,
\begin{align*}
0.55+0.40=0.95.
\end{align*}
Subtracting the overlap term,
\begin{align*}
0.95-0.20=0.75.
\end{align*}
Therefore
\begin{align*}
\mathbb P(A\cup B)=0.75.
\end{align*}
The subtraction of $\mathbb P(A\cap B)=0.20$ removes the double count: a student taking both courses contributes once to $\mathbb P(A)$ and once to $\mathbb P(B)$, but should contribute only once to $\mathbb P(A\cup B)$.
[/example]
The union formula solves a fixed finite problem, but many probabilistic events are reached only through a sequence of approximations, such as ever seeing a head or eventually remaining inside a prescribed set. To make such limits usable, one first needs vocabulary for sequences whose event sets move in only one direction: accumulating more possible outcomes or imposing stricter requirements at each stage.
[definition: Increasing and Decreasing Events]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. A sequence $(A_n)_{n\in\mathbb N}$ of events is increasing if $A_n\subset A_{n+1}$ for every $n\in\mathbb N$. It is decreasing if $A_{n+1}\subset A_n$ for every $n\in\mathbb N$.
[/definition]
Increasing events model accumulating possibilities, while decreasing events model progressively stricter requirements. The remaining issue is whether the probability of the limiting event is determined by the probabilities seen at finite stages; countable additivity is exactly what makes that passage to the limit valid.
[quotetheorem:1107]
[Continuity of probability](/theorems/1107) lets finite approximations determine infinite events. The next example uses the standard infinite product measure for repeated fair coin tosses; its existence is a construction theorem in measure theory, so here it functions as background rather than a new primitive. With that model in place, the probability of eventually seeing a head is computed through an increasing sequence.
[example: Eventually Seeing a Head]
Let $\Omega=\{H,T\}^{\mathbb N}$, let $\mathcal F$ be the product $\sigma$-algebra, and let $\mathbb P$ be the infinite product probability measure for fair coin tosses. Thus, for any prescribed values in the first $n$ coordinates, the corresponding cylinder event has probability $(1/2)^n$. For $n\in\mathbb N$, let $A_n$ be the event that at least one head occurs among the first $n$ tosses. If $\omega\in A_n$, then one of $\omega_1,\ldots,\omega_n$ is equal to $H$, so one of $\omega_1,\ldots,\omega_n,\omega_{n+1}$ is also equal to $H$. Hence $\omega\in A_{n+1}$, so $A_n\subset A_{n+1}$ and $(A_n)_{n\in\mathbb N}$ is increasing.
Let $B_n=A_n^c$. Then $B_n$ is the event that no head occurs among the first $n$ tosses, equivalently
\begin{align*}
B_n=\{\omega\in\Omega:\omega_1=T,\omega_2=T,\ldots,\omega_n=T\}.
\end{align*}
This event fixes each of the first $n$ coordinates to be $T$. Since each fixed coordinate has probability $1/2$ under the fair product measure, the defining finite-cylinder probabilities give
\begin{align*}
\mathbb P(B_n)=\underbrace{\frac12\cdot\frac12\cdots\frac12}_{n\text{ factors}}.
\end{align*}
The product of $n$ factors equal to $1/2$ is
\begin{align*}
\underbrace{\frac12\cdot\frac12\cdots\frac12}_{n\text{ factors}}=\left(\frac12\right)^n.
\end{align*}
Since $\left(\frac12\right)^n=2^{-n}$, we have
\begin{align*}
\mathbb P(B_n)=2^{-n}.
\end{align*}
Because $B_n=A_n^c$, we also have $A_n=B_n^c$. By the complement identity from *Elementary Probability Identities*,
\begin{align*}
\mathbb P(A_n)=1-\mathbb P(B_n).
\end{align*}
Substituting $\mathbb P(B_n)=2^{-n}$ gives
\begin{align*}
\mathbb P(A_n)=1-2^{-n}.
\end{align*}
The event $\bigcup_{n=1}^{\infty}A_n$ is exactly the event that a head occurs at some finite time: membership in the union means membership in $A_n$ for at least one $n$, and membership in $A_n$ means a head occurred among the first $n$ tosses. Since $(A_n)$ is increasing, *Continuity of Probability* gives
\begin{align*}
\mathbb P\left(\bigcup_{n=1}^{\infty}A_n\right)=\lim_{n\to\infty}\mathbb P(A_n).
\end{align*}
Using $\mathbb P(A_n)=1-2^{-n}$,
\begin{align*}
\lim_{n\to\infty}\mathbb P(A_n)=\lim_{n\to\infty}(1-2^{-n}).
\end{align*}
Since $2^{-n}=1/2^n$ and $2^n\to\infty$, we have $2^{-n}\to0$. Therefore
\begin{align*}
\lim_{n\to\infty}(1-2^{-n})=1-0.
\end{align*}
Thus
\begin{align*}
\mathbb P\left(\bigcup_{n=1}^{\infty}A_n\right)=1.
\end{align*}
So the probability of eventually seeing at least one head is $1$, even though no finite time is guaranteed in advance.
[/example]
The next section describes general ways to construct probability spaces rather than merely use them.
## Constructing Probability Spaces
### Finite and Normalized Models
In finite experiments, the most common modeling assumption is symmetry. If no outcome is distinguished from any other, assigning equal mass to each singleton should make an event's probability depend only on how many outcomes it contains. Normalizing by the total number of outcomes turns that counting rule into a probability measure.
[definition: Uniform Probability on a Finite Set]
Let $\Omega$ be a nonempty finite set and let $\mathcal F=\mathcal P(\Omega)$. The uniform probability measure on $\Omega$ is the probability measure $\mathbb P:\mathcal P(\Omega)\to[0,1]$ defined by
\begin{align*}
\mathbb P(A)=\frac{|A|}{|\Omega|}\quad\text{for every }A\subset\Omega.
\end{align*}
[/definition]
Uniform finite probability turns many questions into counting problems. The next example shows how the probability space determines whether ordered or unordered outcomes are being counted.
[example: Rolling Two Dice]
Let $\Omega=\{1,2,3,4,5,6\}\times\{1,2,3,4,5,6\}$, let $\mathcal F=\mathcal P(\Omega)$, and use the uniform probability measure on the finite set $\Omega$. We compute the probability of the event
\begin{align*}
E=\{(i,j)\in\Omega:i+j=7\}.
\end{align*}
For each possible first die value $i\in\{1,2,3,4,5,6\}$, the equation $i+j=7$ is equivalent to
\begin{align*}
j=7-i.
\end{align*}
Substituting $i=1,2,3,4,5,6$ gives
\begin{align*}
7-1=6,\quad 7-2=5,\quad 7-3=4,\quad 7-4=3,\quad 7-5=2,\quad 7-6=1.
\end{align*}
Thus the outcomes in $E$ are exactly
\begin{align*}
E=\{(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)\}.
\end{align*}
The six displayed ordered pairs are distinct, so
\begin{align*}
|E|=6.
\end{align*}
The sample space has $6$ choices for the first coordinate. For each first coordinate, it has $6$ choices for the second coordinate, so by the multiplication rule for finite products,
\begin{align*}
|\Omega|=6\cdot6.
\end{align*}
Multiplying gives
\begin{align*}
6\cdot6=36.
\end{align*}
Hence
\begin{align*}
|\Omega|=36.
\end{align*}
By the definition of the uniform probability measure on a finite set,
\begin{align*}
\mathbb P(E)=\frac{|E|}{|\Omega|}.
\end{align*}
Substituting $|E|=6$ and $|\Omega|=36$ gives
\begin{align*}
\mathbb P(E)=\frac{6}{36}.
\end{align*}
Since $36=6\cdot6$,
\begin{align*}
\frac{6}{36}=\frac{6}{6\cdot6}.
\end{align*}
Cancelling the common positive factor $6$ gives
\begin{align*}
\frac{6}{6\cdot6}=\frac{1}{6}.
\end{align*}
Therefore
\begin{align*}
\mathbb P(E)=\frac16.
\end{align*}
Thus the probability of rolling a sum of $7$ is $1/6$; the ordered outcome space counts $(1,6)$ and $(6,1)$ as different outcomes.
[/example]
Counting worked for dice because the outcome set was finite and each elementary outcome had equal mass. For intervals, regions, or other continuous spaces, the relevant size may instead be length, area, or another finite measure, and raw size is not yet a probability because the whole region need not have size $1$. The required construction is to restrict attention to the region of interest and divide all measurable sizes by its total size.
[definition: Normalized Restriction of a Measure]
Let $(E,\mathcal E,\mu)$ be a [measure space](/page/Measure%20Space) and let $S\in\mathcal E$ satisfy $0<\mu(S)<\infty$. Let $\mathcal E_S=\{A\subset S:A\in\mathcal E\}$. The normalized restriction of $\mu$ to $S$ is the probability measure $\mathbb P:\mathcal E_S\to[0,1]$ on $(S,\mathcal E_S)$ defined by
\begin{align*}
\mathbb P(A)=\frac{\mu(A)}{\mu(S)}\quad\text{for every }A\in\mathcal E_S.
\end{align*}
[/definition]
Normalization turns length, area, volume, and other finite measures into probabilities. The next construction moves probability in a different way: through an observable map.
### Pushforwards and Laws
Often the sample point itself is not what we record. We observe a function of it, so events about the observed value must be translated back into events about the original sample point before their probabilities can be evaluated. This translation produces a probability measure on the observation space.
[definition: Pushforward Probability Measure]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $(E,\mathcal E)$ be a measurable space, and let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be measurable. The pushforward probability measure $\mathbb P\circ X^{-1}$ on $(E,\mathcal E)$ is defined by
\begin{align*}
(\mathbb P\circ X^{-1})(B)=\mathbb P(X^{-1}(B))\quad\text{for every }B\in\mathcal E.
\end{align*}
[/definition]
### Laws as Pushforwards
The pushforward records probabilities of events phrased only in terms of the observed value. In probability this measure has a special name because it lets one discuss the distribution of an observation without keeping the original sample space in view.
[definition: Law of a Random Variable]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $(E,\mathcal E)$ be a measurable space, and let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a measurable map. The law of $X$ is the probability measure $\mu_X:\mathcal E\to[0,1]$ on $(E,\mathcal E)$ defined by
\begin{align*}
\mu_X(B)=\mathbb P(X^{-1}(B))\quad\text{for every }B\in\mathcal E.
\end{align*}
[/definition]
Calling this object a law emphasizes that it is itself a probability measure. That claim is not automatic from terminology alone: the construction measures sets in the observation space by pulling them back to the original space, so one must know that preimages preserve the empty set, complements, and countable unions used in the probability axioms.
The obstruction is countable additivity on the target space. If disjoint measurable sets in $E$ are observed, their preimages must be disjoint events in $\Omega$ whose probabilities can be added, and the total mass of the target must still be one. Without this compatibility, a measurable observation would produce only a set function rather than a probability distribution.
[quotetheorem:9350]
Pushforwards explain why different experiments can produce the same distribution. The next section makes the observable maps themselves part of the language.
## Random Variables and Information
A probability space assigns probabilities to events, but data usually arrive as values: numbers, vectors, labels, or paths. To ask for the probability that an observed value lies in a measurable set, the preimage of that set must be an event in the original probability space. This compatibility condition is measurability.
[definition: Random Variable]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and let $(E,\mathcal E)$ be a measurable space. A random variable with values in $E$ is a measurable map $X:(\Omega,\mathcal F)\to(E,\mathcal E)$. When $E=\mathbb R$ and $\mathcal E=\mathcal B(\mathbb R)$, $X$ is called a real-valued random variable.
[/definition]
Measurability ensures that every measurable question about $X$ pulls back to an event in $\Omega$. Observing $X$ may reveal less than the full outcome, so the relevant event collection is only the family of questions whose answers are determined by the value of $X$.
[definition: Generated $\sigma$-Algebra of a Random Variable]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $(E,\mathcal E)$ be a measurable space, and let
\begin{align*}
X:(\Omega,\mathcal F)\to(E,\mathcal E)
\end{align*}
be a random variable. The $\sigma$-algebra generated by $X$ is $\sigma(X)=\{X^{-1}(B):B\in\mathcal E\}$.
[/definition]
The generated $\sigma$-algebra represents the information revealed by $X$. The next example shows how observing a function of an outcome can hide part of the original randomness.
[example: Information Lost by a Sum]
Roll two fair dice on $\Omega=\{1,\ldots,6\}^2$ with uniform probability, and define $S:\Omega\to\{2,\ldots,12\}$ by
\begin{align*}
S(i,j)=i+j.
\end{align*}
We compute which events are visible after observing only the sum $S$. By the definition of the generated $\sigma$-algebra,
\begin{align*}
\sigma(S)=\{S^{-1}(B):B\subset\{2,\ldots,12\}\}.
\end{align*}
The event that the sum is $7$ is
\begin{align*}
\{S=7\}=\{(i,j)\in\Omega:S(i,j)=7\}.
\end{align*}
Since $S^{-1}(\{7\})=\{(i,j)\in\Omega:S(i,j)\in\{7\}\}$, and the condition $S(i,j)\in\{7\}$ is equivalent to $S(i,j)=7$, we have
\begin{align*}
\{S=7\}=S^{-1}(\{7\}).
\end{align*}
Also $\{7\}\subset\{2,\ldots,12\}$, so $S^{-1}(\{7\})$ is one of the sets in the displayed description of $\sigma(S)$. Therefore
\begin{align*}
\{S=7\}\in\sigma(S).
\end{align*}
We now show that the singleton $\{(1,6)\}$ is not in $\sigma(S)$. Suppose, for contradiction, that
\begin{align*}
\{(1,6)\}\in\sigma(S).
\end{align*}
Then, by the definition of $\sigma(S)$, there is some subset $B\subset\{2,\ldots,12\}$ such that
\begin{align*}
\{(1,6)\}=S^{-1}(B).
\end{align*}
Because $(1,6)\in\{(1,6)\}$, the equality above gives
\begin{align*}
(1,6)\in S^{-1}(B).
\end{align*}
By the definition of preimage, this means
\begin{align*}
S(1,6)\in B.
\end{align*}
Using the formula for $S$,
\begin{align*}
S(1,6)=1+6.
\end{align*}
Since $1+6=7$,
\begin{align*}
S(1,6)=7.
\end{align*}
Thus
\begin{align*}
7\in B.
\end{align*}
Now compute the sum of the different outcome $(2,5)$:
\begin{align*}
S(2,5)=2+5.
\end{align*}
Since $2+5=7$,
\begin{align*}
S(2,5)=7.
\end{align*}
Because $7\in B$, this implies
\begin{align*}
S(2,5)\in B.
\end{align*}
Again by the definition of preimage,
\begin{align*}
(2,5)\in S^{-1}(B).
\end{align*}
Using $S^{-1}(B)=\{(1,6)\}$, we get
\begin{align*}
(2,5)\in\{(1,6)\}.
\end{align*}
Membership in the singleton $\{(1,6)\}$ means equality with its unique element, so this would force
\begin{align*}
(2,5)=(1,6).
\end{align*}
But equality of ordered pairs requires equality of both coordinates, and the first coordinates give
\begin{align*}
2\ne1.
\end{align*}
Hence
\begin{align*}
(2,5)\ne(1,6),
\end{align*}
a contradiction. Therefore
\begin{align*}
\{(1,6)\}\notin\sigma(S).
\end{align*}
Thus observing $S$ can decide events determined only by the sum, such as $\{S=7\}$, but it cannot distinguish individual outcomes with the same sum; $\sigma(S)$ contains the information in the sum, not the full dice roll.
[/example]
A real-valued random variable assigns a number to each outcome, but probability statements alone do not yet say how to summarize those numbers by a single typical value. The obstruction is that outcomes may have unequal probabilities, or may form a continuous space where ordinary finite averaging is unavailable. Measure-theoretic integration with respect to $\mathbb P$ is the device that weights the values correctly.
[definition: Expectation]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and let $X:\Omega\to\mathbb R$ be an integrable real-valued random variable. The expectation of $X$ is
\begin{align*}
\mathbb E[X]=\int_\Omega X\,d\mathbb P.
\end{align*}
[/definition]
Expectation is therefore not a separate primitive; it is measure-theoretic integration on the underlying probability space. The next example recovers the elementary average of a die roll.
[example: Expectation of a Die Roll]
Let $\Omega=\{1,2,3,4,5,6\}$ with the uniform probability measure on $\mathcal P(\Omega)$, and define $X:\Omega\to\mathbb R$ by $X(\omega)=\omega$. We compute $\mathbb E[X]$ from the finite weighted-sum formula for integration against a finite probability measure:
\begin{align*}
\mathbb E[X]=\sum_{\omega\in\Omega}X(\omega)\mathbb P(\{\omega\}).
\end{align*}
Since the elements of $\Omega$ are exactly $1,2,3,4,5,6$, this becomes
\begin{align*}
\mathbb E[X]=\sum_{k=1}^{6}X(k)\mathbb P(\{k\}).
\end{align*}
For each $k\in\{1,\ldots,6\}$, the definition of $X$ gives $X(k)=k$, so
\begin{align*}
\mathbb E[X]=\sum_{k=1}^{6}k\mathbb P(\{k\}).
\end{align*}
For the uniform probability measure on the six-point set $\Omega$, each singleton $\{k\}$ has cardinality $1$, while $\Omega$ has cardinality $6$. Hence
\begin{align*}
\mathbb P(\{k\})=\frac{|\{k\}|}{|\Omega|}=\frac{1}{6}.
\end{align*}
Substituting this value for $k=1,\ldots,6$ gives
\begin{align*}
\mathbb E[X]=1\cdot\frac16+2\cdot\frac16+3\cdot\frac16+4\cdot\frac16+5\cdot\frac16+6\cdot\frac16.
\end{align*}
Factoring out the common factor $1/6$ gives
\begin{align*}
\mathbb E[X]=\frac{1+2+3+4+5+6}{6}.
\end{align*}
Adding the numerator terms in order,
\begin{align*}
1+2=3.
\end{align*}
Then
\begin{align*}
3+3=6.
\end{align*}
Then
\begin{align*}
6+4=10.
\end{align*}
Then
\begin{align*}
10+5=15.
\end{align*}
Finally,
\begin{align*}
15+6=21.
\end{align*}
Therefore
\begin{align*}
1+2+3+4+5+6=21.
\end{align*}
Substituting this sum gives
\begin{align*}
\mathbb E[X]=\frac{21}{6}.
\end{align*}
Since $21=3\cdot7$ and $6=3\cdot2$,
\begin{align*}
\frac{21}{6}=\frac{3\cdot7}{3\cdot2}.
\end{align*}
Cancelling the common nonzero factor $3$ gives
\begin{align*}
\frac{3\cdot7}{3\cdot2}=\frac72.
\end{align*}
Thus
\begin{align*}
\mathbb E[X]=\frac72.
\end{align*}
The expected value of a fair die roll is therefore $7/2$, the probability-weighted average of the six possible outcomes.
[/example]
Expectations become more powerful when several observations coexist on one probability space. The next section explains how the probability space encodes independence among events and random variables.
## Independence and Products
Events can overlap without one causing the other, and disjointness is too strong to capture probabilistic unrelatedness. The issue is to express that learning whether one event occurred would leave the probability of the other unchanged, while avoiding conditional probabilities that may require division by zero. Multiplicative factorization gives an equivalent condition that still makes sense for all events.
[definition: Independent Events]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Events $A,B\in\mathcal F$ are independent if $\mathbb P(A\cap B)=\mathbb P(A)\mathbb P(B)$. A family $(A_i)_{i\in I}$ of events is independent if for every finite subset $J\subset I$, $\mathbb P(\bigcap_{j\in J}A_j)=\prod_{j\in J}\mathbb P(A_j)$.
[/definition]
Independence should not be confused with mutual exclusivity. The next example shows that disjoint non-null events are dependent because occurrence of one rules out the other.
[example: Disjoint Non-Null Events Are Dependent]
Let $\Omega=\{1,2,3,4\}$ with the uniform probability measure on $\mathcal P(\Omega)$. Take $A=\{1,2\}$ and $B=\{3,4\}$. We compute $\mathbb P(A\cap B)$ and compare it with $\mathbb P(A)\mathbb P(B)$.
First, no outcome belongs to both $A$ and $B$. Indeed, the only elements of $A$ are $1$ and $2$, while the only elements of $B$ are $3$ and $4$, and
\begin{align*}
1\ne3,\quad 1\ne4,\quad 2\ne3,\quad 2\ne4.
\end{align*}
Hence
\begin{align*}
A\cap B=\varnothing.
\end{align*}
Since the uniform probability measure on a finite set is defined by $\mathbb P(E)=|E|/|\Omega|$ for every $E\subset\Omega$,
\begin{align*}
\mathbb P(A\cap B)=\frac{|A\cap B|}{|\Omega|}.
\end{align*}
Using $A\cap B=\varnothing$, this becomes
\begin{align*}
\mathbb P(A\cap B)=\frac{|\varnothing|}{|\Omega|}.
\end{align*}
The empty set has no elements, so
\begin{align*}
|\varnothing|=0.
\end{align*}
Also $\Omega=\{1,2,3,4\}$ has four elements, so
\begin{align*}
|\Omega|=4.
\end{align*}
Substituting these two cardinalities gives
\begin{align*}
\mathbb P(A\cap B)=\frac{0}{4}.
\end{align*}
Since $0/4=0$,
\begin{align*}
\mathbb P(A\cap B)=0.
\end{align*}
Now compute the individual probabilities. The set $A=\{1,2\}$ has two elements, so
\begin{align*}
|A|=2.
\end{align*}
Therefore
\begin{align*}
\mathbb P(A)=\frac{|A|}{|\Omega|}.
\end{align*}
Substituting $|A|=2$ and $|\Omega|=4$ gives
\begin{align*}
\mathbb P(A)=\frac{2}{4}.
\end{align*}
Since $2/4=1/2$,
\begin{align*}
\mathbb P(A)=\frac12.
\end{align*}
Similarly, the set $B=\{3,4\}$ has two elements, so
\begin{align*}
|B|=2.
\end{align*}
Using the same uniform probability formula,
\begin{align*}
\mathbb P(B)=\frac{|B|}{|\Omega|}.
\end{align*}
Substituting $|B|=2$ and $|\Omega|=4$ gives
\begin{align*}
\mathbb P(B)=\frac{2}{4}.
\end{align*}
Since $2/4=1/2$,
\begin{align*}
\mathbb P(B)=\frac12.
\end{align*}
Multiplying the two probabilities,
\begin{align*}
\mathbb P(A)\mathbb P(B)=\frac12\cdot\frac12.
\end{align*}
Multiplying numerators and denominators gives
\begin{align*}
\frac12\cdot\frac12=\frac{1\cdot1}{2\cdot2}.
\end{align*}
Since $1\cdot1=1$ and $2\cdot2=4$,
\begin{align*}
\mathbb P(A)\mathbb P(B)=\frac14.
\end{align*}
By the definition of independent events, independence would require
\begin{align*}
\mathbb P(A\cap B)=\mathbb P(A)\mathbb P(B).
\end{align*}
But the computations above give
\begin{align*}
\mathbb P(A\cap B)=0
\end{align*}
and
\begin{align*}
\mathbb P(A)\mathbb P(B)=\frac14.
\end{align*}
Since $0\ne1/4$, the required equality fails. Thus $A$ and $B$ are not independent: they are disjoint but non-null, so occurrence of one rules out occurrence of the other.
[/example]
For random variables, independence cannot be checked only by comparing the values themselves, because the values may live in arbitrary measurable spaces and the observable questions are events such as $\{X\in B\}$.
The obstruction is that each random variable determines a whole collection of events, not just a list of possible values. To say that several observations are unrelated, every question determined by one observation must be independent of every compatible collection of questions determined by the others. Generated $\sigma$-algebras package exactly those observable questions.
[definition: Independent Random Variables]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. For each $i\in I$, let $(E_i,\mathcal E_i)$ be a measurable space and let $X_i:(\Omega,\mathcal F)\to(E_i,\mathcal E_i)$ be a random variable. The family $(X_i)_{i\in I}$ is independent if the $\sigma$-algebras $(\sigma(X_i))_{i\in I}$ are independent, meaning that for every finite subset $J\subset I$ and every choice of events $A_j\in\sigma(X_j)$, $\mathbb P(\bigcap_{j\in J}A_j)=\prod_{j\in J}\mathbb P(A_j)$.
[/definition]
Recognizing independence inside an existing probability space is not enough when the task is to build a joint model from separate experiments. The obstruction is that the two original spaces have different outcome sets, so events involving both experiments must live on a new space whose rectangles have the expected factorized probabilities and whose measurable sets are large enough for countable operations.
[definition: Product Probability Space]
Let $(\Omega_1,\mathcal F_1,\mathbb P_1)$ and $(\Omega_2,\mathcal F_2,\mathbb P_2)$ be probability spaces. Their product probability space is the probability space $(\Omega_1\times\Omega_2,\mathcal F_1\otimes\mathcal F_2,\mathbb P_1\otimes\mathbb P_2)$.
Here $\mathcal F_1\otimes\mathcal F_2$ denotes the product $\sigma$-algebra: the smallest $\sigma$-algebra on $\Omega_1\times\Omega_2$ containing every measurable rectangle $A_1\times A_2$ with $A_1\in\mathcal F_1$ and $A_2\in\mathcal F_2$. The measure $\mathbb P_1\otimes\mathbb P_2:\mathcal F_1\otimes\mathcal F_2\to[0,1]$ denotes the unique probability measure extending the rectangle rule
\begin{align*}
(\mathbb P_1\otimes\mathbb P_2)(A_1\times A_2)=\mathbb P_1(A_1)\mathbb P_2(A_2)
\end{align*}
for all $A_1\in\mathcal F_1$ and $A_2\in\mathcal F_2$. The existence and uniqueness of this extension are what make the construction a genuine probability space rather than just a rule on rectangles.
[/definition]
The rectangle rule is the mathematical expression of running the two mechanisms separately: the probability of requiring both coordinates to land in prescribed events factors into the two individual probabilities. In the product space, this already gives independence of the coordinate event families, because events of the form $A_1\times\Omega_2$ and $\Omega_1\times A_2$ have intersection $A_1\times A_2$ and therefore factorized probability.
The quoted criterion packages the same idea in a more portable form for random variables. Instead of checking independence directly inside one sample space, it says that independence is equivalent to the joint law splitting as a product of the marginal laws. This is the form used when observations are recorded as random variables rather than as bare coordinates of a product outcome space.
The practical question is how to recognize independence once the random quantities have been pushed forward from their original sample space. At that point the coordinate rectangles in a product construction are no longer visible, so the needed test must be stated in terms of distributions alone: the joint distribution must contain no more information than the two marginal distributions taken separately.
[quotetheorem:4861]
The criterion is useful because it separates a modeling claim from a calculation. To prove that two recorded quantities are independent, it is enough to identify their joint distribution and check that it is the product of the marginals; conversely, any dependence must appear as a failure of that factorization. For example, two coordinates in a product model satisfy the product law by construction, while two functions of the same underlying outcome may have the right marginal laws but still fail the joint-law test. This limitation matters in repeated-trial models: independence is not forced by identical individual behavior, but by the product structure of the joint law.
The final thematic section turns to another feature of probability spaces: events that are not impossible as sets but have probability zero.
## Null Events and Almost Sure Language
In continuous probability, events may be nonempty and still have probability zero. This creates a distinction that ordinary set language does not capture: an event can be possible as a subset of the sample space while still being too small to affect probabilities. The following definition names that probabilistic notion of negligible size.
[definition: Null Event]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. An event $N\in\mathcal F$ is a null event if $\mathbb P(N)=0$.
[/definition]
Null events are common in continuous models, where each individual value may have probability zero. The next example shows why countability matters when combining null events.
[example: Points Are Null in a Uniform Interval]
Let $\Omega=[0,1]$, let $\mathcal F=\mathcal B([0,1])$, and define $\mathbb P(A)=\mathcal L^1(A)$ for every Borel set $A\subset[0,1]$. Fix $x\in[0,1]$. The singleton $\{x\}$ is Borel in $[0,1]$, because $\{x\}=[0,1]\cap\{x\}$ and $\{x\}$ is closed in $\mathbb R$.
For every $m\in\mathbb N$, if $y\in\{x\}$, then $y=x$. Hence
\begin{align*}
x-\frac1m\le y\le x+\frac1m.
\end{align*}
Therefore
\begin{align*}
\{x\}\subset \left[x-\frac1m,x+\frac1m\right].
\end{align*}
Lebesgue measure is nonnegative, so
\begin{align*}
0\le \mathcal L^1(\{x\}).
\end{align*}
By monotonicity of Lebesgue measure and the displayed containment,
\begin{align*}
\mathcal L^1(\{x\})\le \mathcal L^1\left(\left[x-\frac1m,x+\frac1m\right]\right).
\end{align*}
The length of the interval on the right is its right endpoint minus its left endpoint:
\begin{align*}
\mathcal L^1\left(\left[x-\frac1m,x+\frac1m\right]\right)=\left(x+\frac1m\right)-\left(x-\frac1m\right).
\end{align*}
Expanding the subtraction gives
\begin{align*}
\left(x+\frac1m\right)-\left(x-\frac1m\right)=x+\frac1m-x+\frac1m.
\end{align*}
The terms $x$ and $-x$ cancel, so
\begin{align*}
x+\frac1m-x+\frac1m=\frac1m+\frac1m.
\end{align*}
Adding the two equal fractions gives
\begin{align*}
\frac1m+\frac1m=\frac2m.
\end{align*}
Thus, for every $m\in\mathbb N$,
\begin{align*}
0\le \mathcal L^1(\{x\})\le \frac2m.
\end{align*}
Letting $m\to\infty$, the right side satisfies $2/m\to0$, while $\mathcal L^1(\{x\})$ is independent of $m$. Hence the only possible value is
\begin{align*}
\mathcal L^1(\{x\})=0.
\end{align*}
Using the definition of $\mathbb P$,
\begin{align*}
\mathbb P(\{x\})=\mathcal L^1(\{x\}).
\end{align*}
Substituting $\mathcal L^1(\{x\})=0$ gives
\begin{align*}
\mathbb P(\{x\})=0.
\end{align*}
Now consider the union of all singleton sets in the interval. If $y\in[0,1]$, then $y\in\{y\}$, and since $y$ is one of the indices in $[0,1]$, this gives
\begin{align*}
y\in\bigcup_{x\in[0,1]}\{x\}.
\end{align*}
Therefore
\begin{align*}
[0,1]\subset \bigcup_{x\in[0,1]}\{x\}.
\end{align*}
Conversely, if $y\in\bigcup_{x\in[0,1]}\{x\}$, then there exists $x\in[0,1]$ such that $y\in\{x\}$. Membership in the singleton $\{x\}$ means $y=x$, and since $x\in[0,1]$, we get $y\in[0,1]$. Therefore
\begin{align*}
\bigcup_{x\in[0,1]}\{x\}\subset[0,1].
\end{align*}
The two containments imply
\begin{align*}
[0,1]=\bigcup_{x\in[0,1]}\{x\}.
\end{align*}
Finally, using the definition of $\mathbb P$ on the whole interval,
\begin{align*}
\mathbb P([0,1])=\mathcal L^1([0,1]).
\end{align*}
The length of $[0,1]$ is its right endpoint minus its left endpoint, so
\begin{align*}
\mathcal L^1([0,1])=1-0.
\end{align*}
Since $1-0=1$,
\begin{align*}
\mathbb P([0,1])=1.
\end{align*}
Thus every individual point of the uniform interval is a null event, but the uncountable union of all those points is the whole interval and has probability $1$; countability is essential when combining null events.
[/example]
If null events are negligible, then a property may be probabilistically decisive even when it does not hold for every outcome.
The obstruction is that set-theoretic certainty is often too strict in continuous models: excluding a null exceptional set should not change the probabilistic content of a statement. The precise issue is how to describe an event whose complement is null, so that the property holds with full probability rather than absolute certainty.
[definition: Almost Sure Event]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. An event $A\in\mathcal F$ occurs almost surely if $\mathbb P(A)=1$.
[/definition]
Almost sure language would be fragile if it survived only one property at a time. In applications one usually imposes countably many requirements simultaneously, so the key question is whether the union of the corresponding exceptional null sets is still negligible. The following result provides that countable stability.
[quotetheorem:1108]
The theorem explains why countably many almost sure properties can be imposed simultaneously. This is essential in convergence theorems, stochastic processes, and advanced probability.
## Beyond and Connected Topics
Probability spaces are the base layer for [random variables](/page/Random%20Variable), distributions, expectation, and conditioning. Once measurable maps are available, the same underlying space can carry numerical observations, random vectors, or random functions.
Measure-theoretic probability develops the integration tools that make the abstract definition useful: monotone convergence, dominated convergence, Radon-Nikodym derivatives, and [conditional expectation](/page/Conditional%20Expectation). These topics explain how probabilities, densities, and conditional laws interact.
Independence leads to product spaces, infinite sequences, laws of large numbers, central limit theorems, random walks, and martingales. The course page [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability) is a natural continuation after the measure-theoretic foundations are in place.
Statistics treats a model as a family of probability measures, often on a common measurable space and indexed by an unknown parameter. This connects probability spaces to likelihood, estimation, hypothesis testing, and Bayesian inference.
Stochastic processes add time and information flow. A process $(X_t)_{t\in T}$ is a family of random variables on one probability space, while a filtration $(\mathcal F_t)_{t\ge0}$ records what is known over time.
## References
Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure).
Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability).
Androma, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability).
Androma, [Cambridge IB Statistics](/page/Cambridge%20IB%20Statistics).
Androma, [Random Variable](/page/Random%20Variable).
Patrick Billingsley, *Probability and Measure* (1995).
David Williams, *Probability with Martingales* (1991).
Rick Durrett, *Probability: Theory and Examples* (2019).
Probability Space
Also known as: Sample space with probability measure, Probability triple, Kolmogorov probability space, Measure-theoretic probability space, Probability model