A probability model starts with outcomes, but the objects we calculate with are usually observations of those outcomes. A die roll becomes a score, a patient record becomes a survival time, and a sequence of coin tosses becomes a number of successes. The mathematical problem is to decide when such an observation is compatible with the events to which probability has been assigned.
The issue is not cosmetic. A function from outcomes to values may assign a number to every outcome, while the event described by that number is not measurable. Random variables are the class of functions for which value-based questions have probabilities.
[example: A Payoff Must Be an Event]
Let $\Omega=\{1,2,3,4,5,6\}$, let $\mathcal F=2^\Omega$, and assign equal point probabilities by $\mathbb P(\{i\})=1/6$ for each $i\in\Omega$. Define $X:\Omega\to\mathbb R$ by $X(i)=10$ for even $i$ and $X(i)=0$ for odd $i$. To compute the payoff event,
\begin{align*}
\{X>0\}
&=\{\omega\in\Omega:X(\omega)>0\}\\
&=\{\omega\in\Omega:X(\omega)=10\}\\
&=\{2,4,6\}.
\end{align*}
Since $\mathcal F=2^\Omega$, every subset of $\Omega$ belongs to $\mathcal F$, so $\{2,4,6\}\in\mathcal F$. Therefore the probability is defined, and finite additivity gives
\begin{align*}
\mathbb P(X>0)
&=\mathbb P(\{2,4,6\})\\
&=\mathbb P(\{2\})+\mathbb P(\{4\})+\mathbb P(\{6\})\\
&=\frac{1}{6}+\frac{1}{6}+\frac{1}{6}\\
&=\frac{3}{6}\\
&=\frac{1}{2}.
\end{align*}
Thus the numerical question "$X>0$" is legitimate because it pulls back to an event in the given sigma-algebra.
[/example]
The finite die example hides the measurability issue because every subset of $\Omega$ is an event. On larger outcome spaces that protection disappears, and a value assignment can fail before any calculation begins.
[example: A Non-Measurable Indicator]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, and let $A\subset\Omega$ with $A\notin\mathcal F$. Define
\begin{align*}
X:\Omega&\to\mathbb R \\
\omega&\mapsto
\begin{cases}
1, & \omega\in A,\\
0, & \omega\notin A.
\end{cases}
\end{align*}
The value-event $\{X=1\}$ pulls back to
\begin{align*}
\{\omega\in\Omega:X(\omega)=1\}
&=\{\omega\in A:X(\omega)=1\}\cup\{\omega\in\Omega\setminus A:X(\omega)=1\}\\
&=A\cup\varnothing\\
&=A.
\end{align*}
Since $A\notin\mathcal F$, this preimage is not an event in the probability space. Therefore $\mathbb P(X=1)$ is not defined: the function assigns a real number to every outcome, but it is not an observable random variable for the given sigma-algebra.
[/example]
## Measurable Framework
Before defining the observation itself, the model must say which subsets of outcomes are legitimate events. This is the bookkeeping that prevents the non-measurable indicator above from pretending to have probabilities.
[definition: Probability Space]
A probability space is a triple $(\Omega,\mathcal F,\mathbb P)$ where $\Omega$ is a set, $\mathcal F$ is a sigma-algebra on $\Omega$, and $\mathbb P:\mathcal F\to[0,1]$ is a measure satisfying $\mathbb P(\Omega)=1$.
[/definition]
The values of an observation also need a declared measurable structure. Without it, the phrase "$X\in A$" has no fixed meaning, because we have not said which value-sets $A$ may be tested.
[definition: Measurable Space]
A measurable space is a pair $(E,\mathcal E)$ where $E$ is a set and $\mathcal E$ is a sigma-algebra on $E$.
[/definition]
The source space tells us which outcome-events have probabilities, and the target space tells us which value-events are allowed. A random variable is exactly the compatibility condition between those two pieces of structure.
## Definition
The point of the definition is not to make $X$ numerically complicated; it is to make every target-side question into an event on $\Omega$. If the target says that $A$ is an observable set of values, then the source must be able to measure the event that $X$ lands in $A$.
[definition: Random Variable]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and let $(E,\mathcal E)$ be a measurable space. An $E$-valued random variable is a measurable map
\begin{align*}
X:(\Omega,\mathcal F)&\to(E,\mathcal E),
\end{align*}
meaning that $X^{-1}(A)\in\mathcal F$ for every $A\in\mathcal E$.
[/definition]
This definition is deliberately about preimages, not about formulas. Probability lives on subsets of $\Omega$, so every value-question about $X$ must pull back to an event in $\mathcal F$ before it can be assigned a probability.
## Numerical Observations
Numerical observations deserve their own name because order, integration, moments, and distribution functions all use the measurable structure of the real line. This is the setting of most first probability calculations.
[definition: Real-Valued Random Variable]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. A real-valued random variable is a measurable map
\begin{align*}
X:(\Omega,\mathcal F)&\to(\mathbb R,\mathcal B(\mathbb R)).
\end{align*}
[/definition]
A single real number often records only one feature of an experiment. When several features are observed together, the vector-valued observation keeps dependence information that would be lost by studying coordinates separately.
This separate name matters because the codomain now carries the Borel structure of $\mathbb R^n$, not just $n$ unrelated copies of the real line. The question is whether simultaneous events such as $\{X_1\le a_1,\dots,X_n\le a_n\}$ are measurable.
[definition: Random Vector]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. A random vector in $\mathbb R^n$ is a measurable map
\begin{align*}
X:(\Omega,\mathcal F)&\to(\mathbb R^n,\mathcal B(\mathbb R^n)).
\end{align*}
[/definition]
The distinction between a vector observation and a derived scalar is visible even in a finite experiment. Keeping the ordered pair preserves probabilities that disappear when only a summary statistic is retained.
[example: Two Coordinates of One Experiment]
Roll two fair dice, so the outcome space is
\begin{align*}
\Omega=\{1,2,3,4,5,6\}\times\{1,2,3,4,5,6\},
\end{align*}
with each ordered pair having probability $1/36$. Let $X_1(i,j)=i$, let $X_2(i,j)=j$, and let $X=(X_1,X_2)$. The random vector $X$ records the ordered pair, while the sum $S=X_1+X_2$ records only the derived value
\begin{align*}
S(i,j)=X_1(i,j)+X_2(i,j)=i+j.
\end{align*}
For the event $S=7$, we compute the preimage in the joint outcome space:
\begin{align*}
\{S=7\}
&=\{(i,j)\in\Omega:S(i,j)=7\}\\
&=\{(i,j)\in\Omega:i+j=7\}\\
&=\{(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)\}.
\end{align*}
These six ordered pairs are disjoint elementary outcomes, so finite additivity gives
\begin{align*}
\mathbb P(S=7)
&=\mathbb P(\{(1,6)\})+\mathbb P(\{(2,5)\})+\mathbb P(\{(3,4)\})\\
&\quad+\mathbb P(\{(4,3)\})+\mathbb P(\{(5,2)\})+\mathbb P(\{(6,1)\})\\
&=\frac{1}{36}+\frac{1}{36}+\frac{1}{36}+\frac{1}{36}+\frac{1}{36}+\frac{1}{36}\\
&=\frac{6}{36}\\
&=\frac{1}{6}.
\end{align*}
Thus the probability of the scalar event $S=7$ is obtained by counting points in the joint outcome space of $X$, showing that the ordered pair contains information that neither coordinate alone records.
[/example]
## Laws and Distribution Functions
### Laws Replace Sample Spaces
The original sample space may contain irrelevant detail. If all calculations concern $X$, then the probabilities can be pushed forward to the value space and the underlying outcomes can disappear from the notation.
[definition: Law of a Random Variable]
Let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a random variable on the probability space $(\Omega,\mathcal F,\mathbb P)$. The law, or distribution, of $X$ is the probability measure $\mu_X$ on $(E,\mathcal E)$ defined by
\begin{align*}
\mu_X(A)=\mathbb P(X\in A)=\mathbb P(X^{-1}(A)),\qquad A\in\mathcal E.
\end{align*}
[/definition]
The law remembers the probabilities of value-events, but it does not remember which original outcome produced which value. That loss of detail is often the point: two different experiments can have the same distribution for the statistic being studied.
Once sample spaces have been discarded in favour of laws, we need a way to say that two observations are probabilistically indistinguishable even if they live on unrelated experiments. Equality in distribution is that comparison.
[definition: Equality in Distribution]
Let $X:(\Omega_1,\mathcal F_1)\to(E,\mathcal E)$ and $Y:(\Omega_2,\mathcal F_2)\to(E,\mathcal E)$ be random variables on probability spaces $(\Omega_1,\mathcal F_1,\mathbb P_1)$ and $(\Omega_2,\mathcal F_2,\mathbb P_2)$. The random variables $X$ and $Y$ are equal in distribution, written $X\overset{d}{=}Y$, if $\mu_X=\mu_Y$ as probability measures on $(E,\mathcal E)$.
[/definition]
Equality in distribution lets us ignore the original outcome space when only the law matters. Conditioning asks a different question: if the observer is told only the value of $X$, which events in the original experiment can they decide? The answer is a smaller sigma-algebra inside $\mathcal F$, and naming it is what makes "information carried by $X$" a mathematical object rather than a metaphor.
[definition: Sigma-Algebra Generated by a Random Variable]
Let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a random variable. The sigma-algebra generated by $X$ is
\begin{align*}
\sigma(X)=\{X^{-1}(A):A\in\mathcal E\}.
\end{align*}
[/definition]
This collection is a sigma-algebra because preimages preserve complements and countable unions. Thus $\sigma(X)$ is closed under the same event operations as $\mathcal E$, but only after translating value-events back to $\Omega$. For a real-valued random variable, the same sigma-algebra is generated by the threshold events $\{X\le x\}$ with $x\in\mathbb R$, since half-lines generate $\mathcal B(\mathbb R)$.
### Distribution Functions
For real-valued random variables, intervals of the form $(-\infty,x]$ are enough to recover the entire law. This is why many calculations can be phrased through a single increasing function rather than through an abstract measure.
[definition: Distribution Function]
Let $X$ be a real-valued random variable. The distribution function of $X$ is the function $F_X:\mathbb R\to[0,1]$ defined by
\begin{align*}
F_X(x)=\mathbb P(X\le x).
\end{align*}
[/definition]
Not every increasing curve between $0$ and $1$ is a distribution function. Half-line probabilities must respect countable additivity: mass cannot disappear at finite levels, jumps must be approached from the right, and the total mass must be recovered in the limit as the threshold goes to infinity. The characterization below separates genuine distribution functions from merely increasing curves by recording exactly these endpoint and continuity constraints.
[quotetheorem:4986]
Once $F_X$ is known, it can recover probabilities of intervals by subtracting endpoint values. The left limit is needed because atoms appear as jumps.
The characterization tells us which functions can occur; the next practical question is how to read probabilities back from such a function. Interval formulas are the basic decoding rule for using a distribution function in computations.
[quotetheorem:4988]
The endpoint formula is especially useful because it makes atoms visible without naming the law separately. We use the standard notation $X\sim\operatorname{Ber}(p)$ for a Bernoulli random variable with $\mathbb P(X=1)=p$, $X\sim\operatorname{Bin}(n,p)$ for a binomial count from $n$ independent Bernoulli trials, $X\sim\operatorname{Unif}(a,b)$ for the continuous uniform law on $(a,b)$, and $X\sim\operatorname{Exp}(\lambda)$ for the exponential law with rate $\lambda>0$. Bernoulli variables give the smallest example: all mass appears as jumps in the graph of $F_X$.
[example: A Jump in a Distribution Function]
Let $X\sim\operatorname{Ber}(p)$, so $\mathbb P(X=1)=p$ and $\mathbb P(X=0)=1-p$. We compute $F_X(x)=\mathbb P(X\le x)$ by separating the three possible positions of $x$ relative to the two values $0$ and $1$:
\begin{align*}
x<0
&\implies \{X\le x\}=\varnothing,
&
F_X(x)&=\mathbb P(\varnothing)=0,\\
0\le x<1
&\implies \{X\le x\}=\{X=0\},
&
F_X(x)&=\mathbb P(X=0)=1-p,\\
x\ge1
&\implies \{X\le x\}=\{X=0\}\cup\{X=1\},
&
F_X(x)&=\mathbb P(X=0)+\mathbb P(X=1)\\
&&&=(1-p)+p\\
&&&=1.
\end{align*}
Thus
\begin{align*}
F_X(x)=
\begin{cases}
0, & x<0,\\
1-p, & 0\le x<1,\\
1, & x\ge1.
\end{cases}
\end{align*}
The jump at $0$ has size
\begin{align*}
F_X(0)-\lim_{x\uparrow0}F_X(x)=(1-p)-0=1-p,
\end{align*}
and the jump at $1$ has size
\begin{align*}
F_X(1)-\lim_{x\uparrow1}F_X(x)=1-(1-p)=p.
\end{align*}
So the two jumps of the distribution function are exactly the two point masses of the Bernoulli law.
[/example]
### Integrating Through the Law
A law is useful only if it lets us compute. The [pushforward formula](/theorems/4989) says that any expectation depending on $X$ can be evaluated on the value space using $\mu_X$.
[quotetheorem:4989]
This theorem is the bridge from random variables to ordinary measure theory. After it, the sample space is no longer sacred; it is often only a device for producing the distribution.
## Discrete, Continuous, and Mixed Laws
### Discrete Laws
The simplest laws concentrate all mass on a [countable set](/page/Countable%20Set). In that case probabilities can be computed by summing weights instead of integrating over arbitrary Borel sets.
[definition: Discrete Random Variable]
A real-valued random variable $X$ is discrete if there exists a countable set $S\subset\mathbb R$ such that $\mathbb P(X\in S)=1$.
[/definition]
Countable support reduces the whole law to point weights, but the definition of discreteness does not say how to store those weights. To compute with a discrete random variable, we need a table indexed by possible values: it should record the mass at each point and recover the probability of any event by summing the entries that lie in that event.
[definition: Probability Mass Function]
Let $X$ be a discrete real-valued random variable. The probability mass function of $X$ is the function $p_X:\mathbb R\to[0,1]$ defined by
\begin{align*}
p_X(x)=\mathbb P(X=x).
\end{align*}
[/definition]
The binomial law is the standard finite table: every possible value is an integer count, and each probability is obtained by collecting outcome strings with that count.
[example: A Binomial Count]
Let $X\sim\operatorname{Bin}(n,p)$ count successes in $n$ independent Bernoulli trials, so an outcome is a string $\omega=(\omega_1,\dots,\omega_n)\in\{0,1\}^n$ and
\begin{align*}
X(\omega)=\omega_1+\cdots+\omega_n.
\end{align*}
For $k\in\{0,1,\dots,n\}$, the event $\{X=k\}$ is the disjoint union over all subsets $A\subset\{1,\dots,n\}$ with $|A|=k$ of the elementary events
\begin{align*}
E_A=\{\omega:\omega_i=1\text{ for }i\in A,\ \omega_i=0\text{ for }i\notin A\}.
\end{align*}
For each such $A$, independence of the trials gives
\begin{align*}
\mathbb P(E_A)
&=\prod_{i\in A}\mathbb P(\omega_i=1)\prod_{i\notin A}\mathbb P(\omega_i=0)\\
&=\prod_{i\in A}p\prod_{i\notin A}(1-p)\\
&=p^k(1-p)^{n-k}.
\end{align*}
There are $\binom{n}{k}$ subsets $A\subset\{1,\dots,n\}$ of size $k$, and the events $E_A$ are disjoint, so finite additivity gives
\begin{align*}
\mathbb P(X=k)
&=\sum_{\substack{A\subset\{1,\dots,n\}\\ |A|=k}}\mathbb P(E_A)\\
&=\sum_{\substack{A\subset\{1,\dots,n\}\\ |A|=k}}p^k(1-p)^{n-k}\\
&=\binom{n}{k}p^k(1-p)^{n-k}.
\end{align*}
If $x\notin\{0,1,\dots,n\}$, then $\{X=x\}=\varnothing$, so $\mathbb P(X=x)=0$. Hence
\begin{align*}
p_X(x)=
\begin{cases}
\binom{n}{k}p^k(1-p)^{n-k}, & x=k\text{ for some }k\in\{0,1,\dots,n\},\\
0, & x\notin\{0,1,\dots,n\}.
\end{cases}
\end{align*}
The count is a random variable because for every Borel set $B\subset\mathbb R$,
\begin{align*}
X^{-1}(B)=\bigcup_{\substack{k\in\{0,1,\dots,n\}\\ k\in B}}\{X=k\},
\end{align*}
a finite union of elementary outcome events.
[/example]
### Absolutely Continuous Laws
Other laws do not put mass at individual points. For them, probabilities are spread across intervals and are measured by integrating a density against [Lebesgue measure](/page/Lebesgue%20Measure). In this section, $\mathcal L^1$ denotes Lebesgue measure on $\mathbb R$.
[definition: Absolutely Continuous Random Variable]
A real-valued random variable $X$ is absolutely continuous if its law satisfies $\mu_X\ll\mathcal L^1$.
[/definition]
Absolute continuity says that ordinary length-zero sets receive probability zero, but it does not yet give a computational object. To calculate interval probabilities in the continuous case, we need the analogue of a mass table: a function whose integral over a set gives the probability of landing there.
[definition: Probability Density Function]
Let $X$ be an absolutely continuous real-valued random variable. A probability density function for $X$ is a nonnegative Borel measurable function $f_X:\mathbb R\to[0,\infty]$ such that
\begin{align*}
\mathbb P(X\in A)=\int_A f_X(x)\,d\mathcal L^1(x)
\end{align*}
for every $A\in\mathcal B(\mathbb R)$.
[/definition]
The definition of absolute continuity is a structural condition on the law, not yet a usable formula. Without a theorem connecting that condition to densities, a continuous law could still look like a black box: we would know which null sets have zero probability, but not whether every Borel probability can be computed by integrating one fixed function. The next result supplies exactly that missing bridge, and it also explains why a density is unique only up to changes on sets of Lebesgue measure zero.
[quotetheorem:4991]
A uniform law is the simplest density calculation because the density is constant on its support. Interval probability then becomes length scaled by the total length of the support.
[example: A Uniform Density]
Let $X\sim\operatorname{Unif}(a,b)$ with $a<b$, so its density is constant on $(a,b)$ and zero off that interval:
\begin{align*}
f_X(x)=
\begin{cases}
\frac{1}{b-a}, & a<x<b,\\
0, & x\le a\text{ or }x\ge b.
\end{cases}
\end{align*}
First, this density has total mass one, since
\begin{align*}
\int_{\mathbb R} f_X(x)\,d\mathcal L^1(x)
&=\int_a^b \frac{1}{b-a}\,d\mathcal L^1(x)\\
&=\frac{1}{b-a}\mathcal L^1((a,b))\\
&=\frac{1}{b-a}(b-a)\\
&=1.
\end{align*}
Now let $a\le c<d\le b$. Since $(c,d)\subset(a,b)$ except possibly for endpoints, and endpoints have Lebesgue measure zero, the density formula gives
\begin{align*}
\mathbb P(c<X<d)
&=\int_{(c,d)} f_X(x)\,d\mathcal L^1(x)\\
&=\int_c^d \frac{1}{b-a}\,d\mathcal L^1(x)\\
&=\frac{1}{b-a}\mathcal L^1((c,d))\\
&=\frac{1}{b-a}(d-c)\\
&=\frac{d-c}{b-a}.
\end{align*}
Thus a uniform law assigns probability to a subinterval by dividing its length $d-c$ by the total length $b-a$.
[/example]
### Mixed Laws
Discrete and continuous laws do not exhaust all possibilities. A waiting time might have a positive probability of being exactly $0$ and otherwise have a density on $(0,\infty)$, so the law contains both an atom and a continuous part.
[example: An Atom Plus a Density]
Let $p\in(0,1)$, and define a real-valued random variable $X$ whose law satisfies
\begin{align*}
\mathbb P(X\in A)
=p\,\mathbb{1}_{\{0\in A\}}+(1-p)\int_{A\cap(0,\infty)}e^{-x}\,d\mathcal L^1(x)
\end{align*}
for every $A\in\mathcal B(\mathbb R)$. The formula has total mass one, since
\begin{align*}
\mathbb P(X\in\mathbb R)
&=p\,\mathbb{1}_{\{0\in\mathbb R\}}+(1-p)\int_{\mathbb R\cap(0,\infty)}e^{-x}\,d\mathcal L^1(x)\\
&=p+(1-p)\int_0^\infty e^{-x}\,dx\\
&=p+(1-p)\left[-e^{-x}\right]_{0}^{\infty}\\
&=p+(1-p)(0-(-1))\\
&=p+(1-p)\\
&=1.
\end{align*}
This law is not discrete. If $S\subset\mathbb R$ is countable, then $S\cap(0,\infty)$ has Lebesgue measure zero, so
\begin{align*}
\int_{S\cap(0,\infty)}e^{-x}\,d\mathcal L^1(x)=0.
\end{align*}
Therefore
\begin{align*}
\mathbb P(X\in S)
&=p\,\mathbb{1}_{\{0\in S\}}+(1-p)\int_{S\cap(0,\infty)}e^{-x}\,d\mathcal L^1(x)\\
&=p\,\mathbb{1}_{\{0\in S\}}\\
&\le p\\
&<1.
\end{align*}
No countable set can carry probability one, so $X$ is not discrete.
It is also not absolutely continuous. For the singleton $\{0\}$,
\begin{align*}
\mathbb P(X=0)
&=\mathbb P(X\in\{0\})\\
&=p\,\mathbb{1}_{\{0\in\{0\}\}}+(1-p)\int_{\{0\}\cap(0,\infty)}e^{-x}\,d\mathcal L^1(x)\\
&=p+(1-p)\int_{\varnothing}e^{-x}\,d\mathcal L^1(x)\\
&=p,
\end{align*}
while
\begin{align*}
\mathcal L^1(\{0\})=0.
\end{align*}
Thus a Lebesgue-null set has positive probability under the law of $X$, so the law has an atom at $0$ together with a continuous density on $(0,\infty)$.
[/example]
## Expectation and Moments
### Expectation as Integration
Probabilities answer yes-or-no questions; expectation answers weighted-value questions. The nonnegative case comes first because the integral is always defined, although it may be infinite.
[definition: Expectation of a Nonnegative Random Variable]
Let $L^0_+(\Omega,\mathcal F,\mathbb P)$ be the set of nonnegative extended real-valued random variables modulo almost sure equality. The expectation of nonnegative random variables is the functional
\begin{align*}
\mathbb E:L^0_+(\Omega,\mathcal F,\mathbb P)&\to[0,\infty] \\
X&\mapsto\int_\Omega X\,d\mathbb P.
\end{align*}
For $X\in L^0_+(\Omega,\mathcal F,\mathbb P)$, this value is written
\begin{align*}
\mathbb E[X]=\int_\Omega X\,d\mathbb P.
\end{align*}
[/definition]
Signed random variables require a finiteness condition; otherwise positive and negative infinite contributions could collide. Integrability is exactly the condition that makes expectation a finite number.
[definition: Integrable Random Variable]
A real-valued random variable $X$ is integrable if
\begin{align*}
\mathbb E[|X|]<\infty.
\end{align*}
[/definition]
The nonnegative definition allowed the value $\infty$ because monotone approximation never creates a subtraction problem. Signed variables create a different issue: positive and negative parts may both be infinite, making the expression $\infty-\infty$ meaningless. Integrability removes that ambiguity, so expectation becomes a finite real-valued functional suitable for algebraic identities and limiting arguments.
[definition: Expectation of an Integrable Random Variable]
Let $L^1(\Omega,\mathcal F,\mathbb P)$ be the space of integrable real-valued random variables modulo almost sure equality. The expectation is the map
\begin{align*}
\mathbb E:L^1(\Omega,\mathcal F,\mathbb P)&\to\mathbb R \\
X&\mapsto\int_\Omega X\,d\mathbb P.
\end{align*}
[/definition]
The law of a random variable is enough to compute expectations of functions of that variable. This is the calculation rule usually called LOTUS.
This theorem is needed because many random variables are specified by their distribution rather than by an explicit sample space. It tells us that after the law has been found, expected values of functions of $X$ can be computed entirely on the value space.
[quotetheorem:3536]
LOTUS computes expectation from the law, but in many arguments the law itself is not available. What we can often estimate directly is the chance that $X$ exceeds a level $t$. The next formula converts those tail bounds into an expectation, so it is especially useful for nonnegative variables whose density or mass function is hard to identify.
[quotetheorem:4993]
For an exponential waiting time, the tail probabilities have a simpler form than many interval probabilities. This makes it a natural first use of the [tail integral formula](/theorems/4993).
[example: Exponential Expectation from the Tail]
Let $X\sim\operatorname{Exp}(\lambda)$ with $\lambda>0$, so its survival function is
\begin{align*}
\mathbb P(X>t)=e^{-\lambda t},\qquad t\ge0.
\end{align*}
Since $X$ is nonnegative, the *Tail Integral Formula* gives
\begin{align*}
\mathbb E[X]
&=\int_0^\infty \mathbb P(X>t)\,d\mathcal L^1(t)\\
&=\int_0^\infty e^{-\lambda t}\,d\mathcal L^1(t)\\
&=\lim_{R\to\infty}\int_0^R e^{-\lambda t}\,dt\\
&=\lim_{R\to\infty}\left[-\frac{1}{\lambda}e^{-\lambda t}\right]_{0}^{R}\\
&=\lim_{R\to\infty}\left(-\frac{1}{\lambda}e^{-\lambda R}+\frac{1}{\lambda}e^{0}\right)\\
&=\lim_{R\to\infty}\left(\frac{1}{\lambda}-\frac{1}{\lambda}e^{-\lambda R}\right)\\
&=\frac{1}{\lambda}-\frac{1}{\lambda}\lim_{R\to\infty}e^{-\lambda R}\\
&=\frac{1}{\lambda}.
\end{align*}
Thus the expectation is recovered from the tail probabilities alone, without using the density formula directly.
[/example]
### Moments and Dependence Signals
Expectation gives the centre of mass of a law. The next quantities measure spread and joint linear movement, so they require second moments rather than just integrability.
[definition: Variance]
Let $L^2(\Omega,\mathcal F,\mathbb P)$ be the space of square-integrable real-valued random variables modulo almost sure equality. Variance is the functional
\begin{align*}
\operatorname{Var}:L^2(\Omega,\mathcal F,\mathbb P)&\to[0,\infty) \\
X&\mapsto\mathbb E[(X-\mathbb E[X])^2].
\end{align*}
For $X\in L^2(\Omega,\mathcal F,\mathbb P)$, this value is written
\begin{align*}
\operatorname{Var}(X)=\mathbb E[(X-\mathbb E[X])^2].
\end{align*}
[/definition]
Variance is one-variable spread. To compare two variables on the same probability space, we multiply their centred deviations and average.
Covariance needs its own definition because it is a two-input functional on a common $L^2$ space. The shared probability space is part of the structure: without it, there is no joint averaging operation.
[definition: Covariance]
Let $L^2(\Omega,\mathcal F,\mathbb P)$ be the space of square-integrable real-valued random variables modulo almost sure equality. Covariance is the map
\begin{align*}
\operatorname{Cov}:L^2(\Omega,\mathcal F,\mathbb P)\times L^2(\Omega,\mathcal F,\mathbb P)&\to\mathbb R \\
(X,Y)&\mapsto\mathbb E[(X-\mathbb E[X])(Y-\mathbb E[Y])].
\end{align*}
For $X,Y\in L^2(\Omega,\mathcal F,\mathbb P)$, this value is written
\begin{align*}
\operatorname{Cov}(X,Y)=\mathbb E[(X-\mathbb E[X])(Y-\mathbb E[Y])].
\end{align*}
[/definition]
## Joint Behavior, Independence, and Conditioning
### Joint and Marginal Laws
Studying $X_1$ and $X_2$ separately can miss the relation between them. The joint law keeps all simultaneous probabilities, which is where dependence lives.
[definition: Joint Law]
Let $X=(X_1,\dots,X_n)$ be a random vector in $\mathbb R^n$. The joint law of $X_1,\dots,X_n$ is the law $\mu_X$ of $X$ on $(\mathbb R^n,\mathcal B(\mathbb R^n))$.
[/definition]
The joint law contains all simultaneous probabilities, but a single coordinate may still be the object of interest. To compare individual behaviour with dependence, we need a name for the coordinate laws obtained from the joint distribution; these are the marginals, and they are what remain after the other coordinates are ignored.
[definition: Marginal Law]
Let $X=(X_1,\dots,X_n)$ be a random vector in $\mathbb R^n$. The marginal law of $X_i$ is the law of the coordinate random variable $X_i$.
[/definition]
The example below separates the information in marginals from the information in a joint law. It keeps the individual Bernoulli distributions fixed while changing the probability of simultaneous success.
[example: Same Marginals, Different Joint Laws]
Let $X\sim\operatorname{Ber}(1/2)$, so
\begin{align*}
\mathbb P(X=1)&=\frac{1}{2},&
\mathbb P(X=0)&=\frac{1}{2}.
\end{align*}
First set $Y=X$. Then for $a\in\{0,1\}$,
\begin{align*}
\mathbb P(Y=a)
&=\mathbb P(X=a),
\end{align*}
so $Y\sim\operatorname{Ber}(1/2)$ as well. The simultaneous success event is
\begin{align*}
\{X=1,Y=1\}
&=\{\omega:X(\omega)=1\text{ and }Y(\omega)=1\}\\
&=\{\omega:X(\omega)=1\text{ and }X(\omega)=1\}\\
&=\{\omega:X(\omega)=1\},
\end{align*}
because $Y=X$. Therefore
\begin{align*}
\mathbb P(X=1,Y=1)
&=\mathbb P(X=1)\\
&=\frac{1}{2}.
\end{align*}
Now let $Z\sim\operatorname{Ber}(1/2)$ be independent of $X$. Its marginal law is the same as that of $Y$, since
\begin{align*}
\mathbb P(Z=1)&=\frac{1}{2},&
\mathbb P(Z=0)&=\frac{1}{2}.
\end{align*}
Independence gives factorization of the two value-events $\{X=1\}$ and $\{Z=1\}$, so
\begin{align*}
\mathbb P(X=1,Z=1)
&=\mathbb P(\{X=1\}\cap\{Z=1\})\\
&=\mathbb P(X=1)\mathbb P(Z=1)\\
&=\frac{1}{2}\cdot\frac{1}{2}\\
&=\frac{1}{4}.
\end{align*}
Thus the pairs $(X,Y)$ and $(X,Z)$ have the same Bernoulli marginal laws, but their joint laws are different because they assign different probabilities to the same rectangle $\{1\}\times\{1\}$.
[/example]
### Independence
Independence says that knowing some variables gives no probabilistic information about the others. Formally, every finite collection of value-events must factor into the product of its individual probabilities.
[definition: Independence of Random Variables]
Let $X_i:(\Omega,\mathcal F)\to(E_i,\mathcal E_i)$ be random variables on $(\Omega,\mathcal F,\mathbb P)$ for $i\in I$. The family $(X_i)_{i\in I}$ is independent if for every finite set $J\subset I$ and every choice of sets $A_j\in\mathcal E_j$,
\begin{align*}
\mathbb P\left(\bigcap_{j\in J}\{X_j\in A_j\}\right)=\prod_{j\in J}\mathbb P(X_j\in A_j).
\end{align*}
[/definition]
The definition quantifies over many choices of measurable sets, which is faithful but cumbersome. Once joint laws are available, independence should be recognizable from the distribution itself: the joint law should contain no coupling information beyond the separate laws. The following theorem makes that compression precise.
[quotetheorem:4861]
The product-law criterion is the usable form of independence. Instead of checking every finite intersection on the original sample space, one can compare a joint distribution with the product of its marginals. This is especially important when the random variables are specified by densities, mass functions, or limiting laws rather than by explicit outcomes in $\Omega$.
### Conditioning as Coarser Information
Conditioning on a sigma-algebra means replacing a random variable by the best version visible with less information. The defining property is not a formula but agreement of integrals over every event that the coarser information can detect.
[definition: Conditional Expectation]
Let $X$ be an integrable real-valued random variable on $(\Omega,\mathcal F,\mathbb P)$, and let $\mathcal G\subset\mathcal F$ be a sub-sigma-algebra. A [conditional expectation](/page/Conditional%20Expectation) of $X$ given $\mathcal G$ is an integrable real-valued $\mathcal G$-measurable map
\begin{align*}
Y:(\Omega,\mathcal G)&\to(\mathbb R,\mathcal B(\mathbb R))
\end{align*}
such that
\begin{align*}
\int_GY\,d\mathbb P=\int_GX\,d\mathbb P
\end{align*}
for every $G\in\mathcal G$.
[/definition]
The defining integral identity describes what a conditional expectation should do, but it does not itself guarantee that such a coarser summary can be found. This is the main danger in treating conditioning as "averaging over the information in $\mathcal G$": the phrase is meaningful only if there is a $\mathcal G$-measurable random variable with the required integrals, and only if two valid summaries cannot disagree on an event of positive probability. The next theorem removes both obstructions, so $\mathbb E[X\mid\mathcal G]$ becomes an object of the theory rather than notation for a hoped-for formula.
[quotetheorem:1147]
### Conditioning in Stages
Conditional expectation is unique only up to almost sure equality. Thus $\mathbb E[X\mid\mathcal G]$ denotes any representative of that almost sure equivalence class, and identities involving conditional expectations are understood almost surely unless a version has been fixed. The next question is whether this equivalence-class object behaves coherently when information is filtered in stages.
If $\mathcal H\subset\mathcal G$, then $\mathcal H$ represents less information than $\mathcal G$. Replacing $X$ first by its $\mathcal G$-visible summary and then by its $\mathcal H$-visible summary should give the same $\mathcal H$-visible information as conditioning directly on $\mathcal H$. The tower property records this consistency principle.
[quotetheorem:1150]
This theorem is not a formula for computing a conditional expectation from scratch; it is a coherence rule for nested information. Once $\mathbb E[X\mid\mathcal G]$ has compressed $X$ to the information in $\mathcal G$, conditioning again on the smaller sigma-algebra $\mathcal H$ loses exactly the same information as conditioning directly on $\mathcal H$. It is the formal reason conditional expectation behaves like successive averaging.
## Convergence of Random Variables
Sequences of random variables can converge in several inequivalent senses. The strongest common mode asks for pointwise convergence outside a null set.
[definition: Almost Sure Convergence]
Let $(X_n)_{n\ge1}$ and $X$ be real-valued random variables on the same probability space. We say $X_n$ converges almost surely to $X$, written $X_n\xrightarrow{a.s.}X$, if
\begin{align*}
\mathbb P\left(\left\{\omega:\lim_{n\to\infty}X_n(\omega)=X(\omega)\right\}\right)=1.
\end{align*}
[/definition]
Almost sure convergence can be too strict when rare failures keep moving around. Convergence in probability ignores the location of the exceptional set and asks only that its probability become small.
[definition: Convergence in Probability]
Let $(X_n)_{n\ge1}$ and $X$ be real-valued random variables on the same probability space. We say $X_n$ converges in probability to $X$, written $X_n\xrightarrow{\mathbb P}X$, if for every $\varepsilon>0$,
\begin{align*}
\mathbb P(|X_n-X|>\varepsilon)\to0
\end{align*}
as $n\to\infty$.
[/definition]
Sometimes the variables are not even built on the same probability space. Then only the laws can be compared, which leads to convergence in distribution.
[definition: Convergence in Distribution]
Let $(X_n)_{n\ge1}$ and $X$ be real-valued random variables, not necessarily defined on the same probability space. We say $X_n$ converges in distribution to $X$, written $X_n\xrightarrow{d}X$, if
\begin{align*}
F_{X_n}(x)\to F_X(x)
\end{align*}
for every continuity point $x$ of $F_X$.
[/definition]
Convergence in distribution compares only laws, and convergence in probability controls only the chance of a large error. Many analytic estimates give something stronger: a bound on the average $p$th power of the error. That motivates treating random variables as elements of an $L^p$ space and defining convergence by the corresponding norm.
[definition: $L^p$ Convergence of Random Variables]
Let $p\ge1$, and let $(X_n)_{n\ge1}$ and $X$ be real-valued random variables on the same probability space with $\mathbb E[|X_n|^p]<\infty$ and $\mathbb E[|X|^p]<\infty$. We say $X_n$ converges to $X$ in $L^p$, written $X_n\xrightarrow{L^p}X$, if
\begin{align*}
\mathbb E[|X_n-X|^p]\to0
\end{align*}
as $n\to\infty$.
[/definition]
These modes form a hierarchy, but the arrows mostly point one way. The theorem below records the basic implications used throughout probability limit theory.
[quotetheorem:4994]
The next example shows why the reverse implication from convergence in probability to almost sure convergence is not available in general. The exceptional sets shrink in probability but keep moving through the sample space.
[example: Convergence in Probability Without Almost Sure Convergence]
Let $\Omega=(0,1)$ with Lebesgue probability measure. For each level $m\ge0$ and each $k=0,1,\dots,2^m-1$, set
\begin{align*}
I_{m,k}=\left(\frac{k}{2^m},\frac{k+1}{2^m}\right]\cap(0,1).
\end{align*}
Enumerate the intervals level by level, and let $X_n=\mathbb 1_{I_n}$, where $I_n=I_{m(n),k(n)}$ is the $n$th interval in that enumeration. Since there are only
\begin{align*}
1+2+\cdots+2^M=2^{M+1}-1
\end{align*}
intervals of levels $0,1,\dots,M$, the level $m(n)$ tends to infinity as $n\to\infty$.
We show that $X_n\xrightarrow{\mathbb P}0$. Let $\varepsilon>0$. Since $X_n$ only takes the values $0$ and $1$,
\begin{align*}
\{|X_n-0|>\varepsilon\}
&=
\begin{cases}
I_n, & 0<\varepsilon<1,\\
\varnothing, & \varepsilon\ge1.
\end{cases}
\end{align*}
If $0<\varepsilon<1$, then
\begin{align*}
\mathbb P(|X_n|>\varepsilon)
&=\mathbb P(I_n)\\
&=\mathcal L^1\left(\left(\frac{k(n)}{2^{m(n)}},\frac{k(n)+1}{2^{m(n)}}\right]\cap(0,1)\right)\\
&=\frac{k(n)+1}{2^{m(n)}}-\frac{k(n)}{2^{m(n)}}\\
&=\frac{1}{2^{m(n)}}.
\end{align*}
If $\varepsilon\ge1$, then
\begin{align*}
\mathbb P(|X_n|>\varepsilon)=\mathbb P(\varnothing)=0.
\end{align*}
Because $m(n)\to\infty$, we have $2^{-m(n)}\to0$, so for every $\varepsilon>0$,
\begin{align*}
\mathbb P(|X_n-0|>\varepsilon)\to0.
\end{align*}
Thus $X_n\xrightarrow{\mathbb P}0$.
Almost sure convergence to $0$ fails pointwise. Fix $\omega\in(0,1)$. For each level $m$, choose the unique integer
\begin{align*}
k_m=\lceil 2^m\omega\rceil-1.
\end{align*}
Then $0\le k_m\le2^m-1$ and
\begin{align*}
k_m<2^m\omega\le k_m+1,
\end{align*}
so
\begin{align*}
\frac{k_m}{2^m}<\omega\le\frac{k_m+1}{2^m}.
\end{align*}
Hence $\omega\in I_{m,k_m}$ for every $m$. Since the enumeration contains one such interval at every level, there are infinitely many indices $n$ with $X_n(\omega)=1$. Therefore $X_n(\omega)$ cannot converge to $0$ for any $\omega\in(0,1)$, and so
\begin{align*}
\mathbb P\left(\left\{\omega:\lim_{n\to\infty}X_n(\omega)=0\right\}\right)=0.
\end{align*}
This example shows that convergence in probability allows the exceptional intervals to move around, while almost sure convergence would require pointwise eventual control outside a null set.
[/example]
## Beyond and Connected Topics
Random variables connect elementary probability to measure theory. Their laws are pushforward measures, their expectations are Lebesgue integrals, and their convergence modes compare functions and measures in different ways.
Conditional expectation is the next major structure. The sigma-algebra $\sigma(X)$ represents the information carried by an observation, and expressions such as $\mathbb E[Y\mid X]$ mean conditioning on that information.
Limit theorems organize the long-run behavior of random variables. The [weak law of large numbers](/theorems/1127) uses convergence in probability, the strong law uses almost sure convergence, and the [central limit theorem](/theorems/1848) uses convergence in distribution.
For a course-level route, start with [Cambridge IA Probability](/page/Cambridge%20IA%20Probability), then move to [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure). The advanced continuations are [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability) and [Cambridge III Stochastic Calculus and Applications](/page/Cambridge%20III%20Stochastic%20Calculus%20and%20Applications).
## References
Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability).
Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure).
Androma, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability).
Androma, [Cambridge III Stochastic Calculus and Applications](/page/Cambridge%20III%20Stochastic%20Calculus%20and%20Applications).
Billingsley, *Probability and Measure* (1995).
Kallenberg, *Foundations of Modern Probability* (2002).
Durrett, *Probability: Theory and Examples* (2019).
Williams, *Probability with Martingales* (1991).