Probability is the mathematical framework for reasoning about uncertainty. At its heart lies a deceptively simple question: given an experiment whose outcome we cannot predict with certainty, how do we assign a consistent numerical measure to the likelihood of different events? The challenge is not merely philosophical — it is structural. We need a framework rich enough to handle finite experiments (rolling a die), countably infinite ones (waiting for the first head in repeated coin tosses), and continuous ones (throwing a needle onto a ruled floor), all within a single coherent theory.
These notes develop probability from the axioms of Kolmogorov, following the Cambridge Part IA course. We begin in Chapter 1 by setting up probability spaces and the combinatorial machinery needed to compute probabilities in finite sample spaces. Chapter 2 introduces conditional probability and independence — the two ideas that make probability theory a powerful modelling language rather than merely a branch of combinatorics. Chapter 3 formalises the notion of a random variable in the discrete setting, and Chapter 4 develops the fundamental tools for studying random variables: expectation, variance, and covariance. Chapter 5 introduces conditional expectation given a random variable, the Law of Total Expectation, joint discrete distributions, and the Gambler's Ruin problem — the principal applications of the conditioning technique. Chapter 6 establishes the key inequalities (Markov, Chebyshev, Jensen, Cauchy-Schwarz) and uses them to prove the Weak Law of Large Numbers, our first "limit theorem." Chapter 7 introduces probability generating functions and applies them to branching processes — a beautiful application where generating functions convert difficult probabilistic recursions into clean algebraic identities. Chapter 8 extends the theory to continuous random variables, where sums become integrals and probability mass functions become density functions. Finally, Chapter 9 introduces moment generating functions and culminates in the Central Limit Theorem, the single most important result in probability: under very mild conditions, the sum of many independent random variables is approximately normally distributed.
Throughout, we emphasise two complementary skills: the ability to set up a rigorous mathematical model of a random experiment, and the ability to compute with that model to extract concrete answers.
# 1. Probability Spaces and Counting
The first task in any probabilistic argument is to specify the mathematical setting: what are the possible outcomes, which collections of outcomes constitute "events" that we can assign probabilities to, and what numerical values do those probabilities take? This chapter develops the axiomatic framework of probability spaces and the combinatorial tools needed to compute probabilities in finite sample spaces.
## 1.1 The Axioms of Probability
To model a random experiment, we need three ingredients: the set of all possible outcomes, a specification of which collections of outcomes we consider as "events," and a rule for assigning probabilities to events. The axioms we adopt must be broad enough to encompass experiments with finitely many outcomes (a die roll), countably many outcomes (the number of coin tosses until the first head), and uncountably many outcomes (the position of a randomly thrown dart on a board).
The need for a restricted collection of events — rather than simply assigning probabilities to every subset of outcomes — is a subtlety that arises only in the uncountable case. For countable sample spaces, we can and do assign probabilities to every subset. The $\sigma$-algebra formalism provides a unified framework that handles all cases. At this level, we state the definition and move on; the reader will encounter $\sigma$-algebras in depth in Part IB Probability and Measure.
[definition: Sigma-Algebra]
Let $\Omega$ be a set. A collection $\mathcal{F}$ of subsets of $\Omega$ is a **$\sigma$-algebra** if:
(i) $\Omega \in \mathcal{F}$.
(ii) If $A \in \mathcal{F}$, then $A^c := \Omega \setminus A \in \mathcal{F}$.
(iii) If $(A_n)_{n \ge 1}$ is a countable collection of sets in $\mathcal{F}$, then $\bigcup_{n \ge 1} A_n \in \mathcal{F}$.
[/definition]
When $\Omega$ is finite or countable, we take $\mathcal{F} = \mathcal{P}(\Omega)$, the power set of all subsets. This is the typical situation in Part IA.
[definition: Probability Measure]
Let $\mathcal{F}$ be a $\sigma$-algebra on $\Omega$. A function $\mathbb{P} : \mathcal{F} \to [0,1]$ is a **probability measure** if:
(i) $\mathbb{P}(\Omega) = 1$.
(ii) For any countable collection $(A_n)_{n \ge 1}$ of pairwise disjoint events in $\mathcal{F}$,
\begin{align*}
\mathbb{P}\left(\bigcup_{n \ge 1} A_n\right) = \sum_{n \ge 1} \mathbb{P}(A_n).
\end{align*}
This property is called **countable additivity**.
[/definition]
The countable additivity axiom is strictly stronger than its finite counterpart. The following example illustrates that finite additivity alone is not sufficient for a coherent theory of probability.
[example: Finite Additivity Does Not Imply Countable Additivity]
Consider $\Omega = \mathbb{N} = \{1, 2, 3, \ldots\}$ and $\mathcal{F} = \mathcal{P}(\Omega)$. Define $\mathbb{P}^* : \mathcal{F} \to [0,1]$ by declaring $\mathbb{P}^*(A) = 0$ if $A$ is finite and $\mathbb{P}^*(A) = 1$ if $A^c$ is finite. (This is not yet fully specified for all sets, but we can extend it using the theory of ultrafilters — the details are beyond this course.) The resulting set function $\mathbb{P}^*$ is finitely additive: if $A$ and $B$ are disjoint, then at most one of them has a finite complement, so $\mathbb{P}^*(A \cup B) = \mathbb{P}^*(A) + \mathbb{P}^*(B)$.
However, $\mathbb{P}^*$ is not countably additive. The events $A_n = \{n\}$ are pairwise disjoint with $\bigcup_{n=1}^{\infty} A_n = \Omega$, but $\sum_{n=1}^{\infty} \mathbb{P}^*(A_n) = \sum_{n=1}^{\infty} 0 = 0 \ne 1 = \mathbb{P}^*(\Omega)$.
Without countable additivity, the continuity of probability (Theorem below) fails, and we lose the ability to work with events defined by infinite limiting processes — which is precisely the setting where probability becomes most useful.
[/example]
With the axioms of a probability measure in place, we can now package all three ingredients — sample space, event collection, and probability measure — into a single object.
[definition: Probability Space]
A **probability space** is a triple $(\Omega, \mathcal{F}, \mathbb{P})$ where $\Omega$ is a set (the **sample space**), $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$ (the collection of **events**), and $\mathbb{P}$ is a probability measure on $\mathcal{F}$.
[/definition]
The elements of $\Omega$ are called **outcomes**, and the elements of $\mathcal{F}$ are called **events**. An important point: $\mathbb{P}$ assigns probabilities to *events* (subsets of $\Omega$), not to individual outcomes. When $\Omega$ is uncountable (e.g. $\Omega = [0,1]$ with the uniform distribution), every individual outcome has probability zero, yet many events have positive probability.
The two axioms — normalisation ($\mathbb{P}(\Omega) = 1$) and countable additivity — are remarkably powerful. Together, they force the following basic properties.
[quotetheorem:1106]
Each property follows from the axioms by straightforward decompositions into disjoint unions. The strategy is always the same: rewrite the event in question as a disjoint union, then apply countable additivity.
[citeproof:1106]
The technique used throughout this proof — decomposing into disjoint unions and applying countable additivity — is the most fundamental technique in probability theory. It reappears in the proofs of the Continuity of Probability, Countable Subadditivity, and the Law of Total Probability. Part (ii) gives the technique of **complementary counting**: to compute $\mathbb{P}(A)$, it is often easier to compute $\mathbb{P}(A^c)$ and subtract from $1$.
The countable additivity axiom also has a "continuous" consequence for nested events: probability commutes with monotone limits.
[quotetheorem:1107]
The proof converts a monotone sequence into a disjoint one by taking "differences," then applies countable additivity. This is the same "disjointification" technique used in the basic properties above — it is the canonical way to apply countable additivity to non-disjoint events.
[citeproof:1107]
The telescoping decomposition $B_n = A_n \setminus A_{n-1}$ is a technique that recurs whenever we need to pass from a monotone sequence to a disjoint one — we will use it again in the proof of the Continuity Theorem for mgfs (Chapter 9). The continuity property itself is essential for working with events defined by infinite limiting processes — for example, the event "a random walk eventually returns to zero" is a countable union of events "the walk returns to zero by time $n$."
Another fundamental consequence of countable additivity is the **union bound**, which provides an upper estimate for the probability of a union of events that are not necessarily disjoint.
[quotetheorem:1108]
The inequality is often strict — the sum on the right counts each outcome in the intersection multiple times. Nevertheless, the union bound is one of the most frequently used tools in probability, precisely because it does not require any knowledge of the dependence structure between the events.
The proof uses the same disjointification technique as the Continuity of Probability, combined with monotonicity.
[citeproof:1108]
The union bound is the workhorse of many "counting" arguments in probability: when bounding the probability that *something bad* happens, we express it as the union of many bad events and bound each one individually. The price we pay — ignoring overlaps — is often worth the simplicity of the resulting bound.
## 1.2 Counting in Finite Sample Spaces
Many of the most important probability problems arise in a setting where $\Omega$ is finite and all outcomes are equally likely. In this case, computing a probability reduces to counting: if $\mathbb{P}(\{\omega\}) = 1/|\Omega|$ for every $\omega \in \Omega$, then for any event $A \subset \Omega$,
\begin{align*}
\mathbb{P}(A) = \frac{|A|}{|\Omega|}.
\end{align*}
The art of probability in finite spaces is therefore the art of combinatorics: counting the number of outcomes in both $A$ and $\Omega$, often using clever decompositions and bijections.
[example: Coincident Birthdays]
Suppose $n$ people are chosen and their birthdays are recorded. What is the probability that at least two of them share a birthday?
We model each person's birthday as a number in $\{1, 2, \ldots, 365\}$, chosen uniformly and independently. The sample space is $\Omega = \{1, \ldots, 365\}^n$ with $|\Omega| = 365^n$. Let $A = \{\text{at least two people share a birthday}\}$.
Rather than computing $|A|$ directly — which would require a complicated inclusion-exclusion argument — we use complementary counting. The complementary event $A^c = \{\text{all } n \text{ birthdays are distinct}\}$ is much easier: the first person's birthday can be any of $365$ values, the second any of $364$, and so on. Hence
\begin{align*}
|A^c| = 365 \times 364 \times \cdots \times (365 - n + 1),
\end{align*}
and therefore
\begin{align*}
\mathbb{P}(A) = 1 - \mathbb{P}(A^c) = 1 - \frac{365 \times 364 \times \cdots \times (365 - n + 1)}{365^n}.
\end{align*}
For $n = 23$, this probability exceeds $1/2$. The result is often considered surprising: among just $23$ people, it is more likely than not that two share a birthday. The reason is that the number of *pairs* grows quadratically — there are $\binom{23}{2} = 253$ pairs — and each pair has a small but non-negligible chance of a match.
[/example]
### 1.2.1 The Multinomial Coefficient
The fundamental counting problem underlying much of discrete probability is the following: given $n$ objects, how many ways can we partition them into $k$ groups of sizes $n_1, n_2, \ldots, n_k$ with $n_1 + n_2 + \cdots + n_k = n$?
We choose $n_1$ objects for the first group (in $\binom{n}{n_1}$ ways), then $n_2$ from the remaining $n - n_1$ objects, and so on. The total count is
\begin{align*}
\binom{n}{n_1} \binom{n - n_1}{n_2} \cdots \binom{n - n_1 - \cdots - n_{k-1}}{n_k} = \frac{n!}{n_1! \, n_2! \cdots n_k!}.
\end{align*}
[definition: Multinomial Coefficient]
The **multinomial coefficient** is defined by
\begin{align*}
\binom{n}{n_1, n_2, \ldots, n_k} := \frac{n!}{n_1! \, n_2! \cdots n_k!},
\end{align*}
where $n_1 + n_2 + \cdots + n_k = n$ and each $n_i \ge 0$.
[/definition]
When $k = 2$, this reduces to the ordinary binomial coefficient $\binom{n}{n_1}$.
### 1.2.2 Compositions and the Stars-and-Bars Method
A different counting problem arises when we wish to distribute $m$ identical objects into $k$ distinct groups.
[definition: Composition]
A **composition** of $m$ with $k$ parts is a sequence $(m_1, m_2, \ldots, m_k)$ of non-negative integers with $m_1 + m_2 + \cdots + m_k = m$.
[/definition]
To count compositions, we use the **stars-and-bars** bijection: represent the $m$ objects as $m$ stars, and separate the $k$ groups using $k - 1$ dividers. For example, the composition $(3, 0, 1, 2)$ of $6$ with $4$ parts corresponds to the arrangement $\star \star \star \mid \mid \star \mid \star \star$. Each arrangement of $m$ stars and $k - 1$ dividers corresponds to exactly one composition, so
\begin{align*}
\text{number of compositions of } m \text{ with } k \text{ parts} = \binom{m + k - 1}{m} = \binom{m + k - 1}{k - 1}.
\end{align*}
### 1.2.3 Random Walks on $\mathbb{Z}$
A natural setting where counting meets probability is the simple symmetric random walk. Consider a particle starting at the origin that, at each time step, moves one unit to the right (with probability $1/2$) or one unit to the left (with probability $1/2$), independently at each step.
We model this as follows. The sample space is
\begin{align*}
\Omega = \{(X_0, X_1, \ldots, X_n) : X_0 = 0,\; |X_k - X_{k-1}| = 1 \text{ for } k = 1, \ldots, n\},
\end{align*}
with $|\Omega| = 2^n$ (at each step, there are two choices). Under the uniform measure, each path has probability $2^{-n}$.
What is the probability that the walk returns to the origin at time $n$? If $n$ is odd, the walk is at an odd position at time $n$, so $\mathbb{P}(X_n = 0) = 0$. If $n$ is even, we need exactly $n/2$ steps to the right and $n/2$ steps to the left, giving
\begin{align*}
\mathbb{P}(X_n = 0) = \frac{\binom{n}{n/2}}{2^n}.
\end{align*}
To understand the behaviour of this probability for large $n$, we need Stirling's formula.
### 1.2.4 Stirling's Formula
To estimate factorial quantities asymptotically, we introduce a standard notion of asymptotic comparison.
[definition: Asymptotic Equality]
Two sequences $(a_n)$ and $(b_n)$ of positive real numbers satisfy $a_n \sim b_n$ if $a_n / b_n \to 1$ as $n \to \infty$.
[/definition]
The central asymptotic estimate for factorials is the following.
[quotetheorem:1109]
We will not prove the precise form with the constant $\sqrt{2\pi}$, but we can establish the weaker statement $\log(n!) \sim n \log n$, which suffices for many asymptotic estimates.
[quotetheorem:1110]
The proof compares $\log(n!)$, which is a sum, with the integral $\int_1^n \log x \, dx$, using the monotonicity of $\log$. This "comparison with an integral" technique is a standard method for estimating sums with monotone terms.
[citeproof:1110]
The integral comparison technique used here — bounding a sum above and below by integrals of a monotone function — is a workhorse of asymptotic analysis that reappears in Part IB Analysis II.
The full Stirling formula gives $\mathbb{P}(X_{2n} = 0) \sim 1/\sqrt{\pi n}$ for the symmetric random walk. This tends to zero, but slowly — the walk returns to the origin infinitely often.
## 1.3 The Inclusion-Exclusion Principle
The formula $\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A \cap B)$ extends to any finite number of events. The challenge is that as the number of events grows, the number of correction terms grows rapidly — for $n$ events, there are $2^n - 1$ terms. Nevertheless, the resulting identity is one of the most powerful tools in combinatorial probability.
[quotetheorem:751]
The formula alternates between overestimates and underestimates: including all singletons overcounts, subtracting all pairwise intersections undercounts, adding all triple intersections overcounts again, and so on. This alternating behaviour is captured by the Bonferroni inequalities.
The proof proceeds by induction on $n$, using the two-set inclusion-exclusion as the base case and the distributive law for $\cap$ over $\cup$ to reduce the inductive step to an application of the hypothesis.
[citeproof:751]
The induction-on-the-number-of-events technique used here is a recurring pattern in combinatorial probability. The key step — distributing $\cap A_n$ over the union — uses the distributive law of set theory, which converts a union of $n$ events into two unions of $n-1$ events. The Inclusion-Exclusion Principle converts a difficult probability computation (the probability of a union) into a potentially simpler collection of computations (the probabilities of intersections). The Bonferroni inequalities guarantee that truncating the sum gives useful bounds.
[quotetheorem:1111]
The proof proceeds by induction on $n$, using the same decomposition as for Inclusion-Exclusion and the observation that the sign of the inequality flips when we subtract a lower bound (the even case applied to the $B_i$ intersections). In practice, the $r = 1$ case (the union bound) is used most frequently: $\mathbb{P}\left(\bigcup A_i\right) \le \sum \mathbb{P}(A_i)$.
[example: Counting Surjections]
How many functions $f : \{1, \ldots, n\} \to \{1, \ldots, m\}$ are surjective?
The total number of functions is $m^n$. Let $A_i = \{f : i \notin \operatorname{Im}(f)\}$ be the set of functions that miss the value $i$. Then the set of surjections is $A_1^c \cap A_2^c \cap \cdots \cap A_m^c = (A_1 \cup \cdots \cup A_m)^c$. The key observation is that $|A_{i_1} \cap \cdots \cap A_{i_k}| = (m-k)^n$ — if $k$ specified values are excluded from the range, each of the $n$ inputs can map to any of the remaining $m - k$ values. Since the value of the intersection depends only on $k$ (not on which $k$ values are excluded), and there are $\binom{m}{k}$ choices for the excluded values, Inclusion-Exclusion gives:
\begin{align*}
\text{number of surjections} = \sum_{k=0}^{m} (-1)^k \binom{m}{k} (m-k)^n.
\end{align*}
[/example]
The tools of Chapter 1 — the axiomatic framework, complementary counting, inclusion-exclusion, and Stirling's formula — provide the foundation for computing probabilities. But so far, every probability computation has started from scratch. In the next chapter, we develop two ideas — conditional probability and independence — that allow us to decompose complex probability computations into simpler pieces.
# 2. Conditional Probability and Independence
In Chapter 1, we set up the formal machinery of probability spaces and developed combinatorial tools for computing probabilities in finite sample spaces. But many of the most important probabilistic arguments depend on the idea that *observing partial information changes the probabilities of other events.* If we know that it rained yesterday, does that change the probability of rain today? If a medical test comes back positive, how likely is it that the patient actually has the disease? To handle such questions, we need two related ideas: conditional probability (how probabilities update given new information) and independence (the special case where they do not).
## 2.1 Conditional Probability
Suppose we are told that an event $B$ has occurred, and we wish to update our assessment of the probability of another event $A$. The idea is simple: among all the outcomes consistent with $B$, what fraction also lies in $A$? This suggests the ratio $\mathbb{P}(A \cap B) / \mathbb{P}(B)$, provided $\mathbb{P}(B) > 0$.
[definition: Conditional Probability]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and let $B \in \mathcal{F}$ with $\mathbb{P}(B) > 0$. The **conditional probability** of $A$ given $B$ is
\begin{align*}
\mathbb{P}(A \mid B) := \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}.
\end{align*}
[/definition]
This definition is more than a formula — it defines a *new probability measure*. For fixed $B$ with $\mathbb{P}(B) > 0$, the function $A \mapsto \mathbb{P}(A \mid B)$ satisfies all the axioms of a probability measure on $(\Omega, \mathcal{F})$: it is non-negative, $\mathbb{P}(\Omega \mid B) = 1$, and it is countably additive. The countable additivity follows from the next result.
[quotetheorem:1112]
The proof is a direct computation: distribute the intersection with $B$ across the union, apply the countable additivity of $\mathbb{P}$, then divide by $\mathbb{P}(B)$.
[citeproof:1112]
The key step was distributing $\cap B$ inside the union — a technique that recurs whenever we need to verify that a conditional probability measure inherits properties from the unconditional one.
## 2.2 The Law of Total Probability and Bayes' Formula
Conditional probability becomes a powerful computational tool when combined with the idea of **partitioning** the sample space. If we can decompose $\Omega$ into disjoint pieces and compute the conditional probability of $A$ given each piece, we can reconstruct the unconditional probability of $A$.
[definition: Partition]
A **partition** of $\Omega$ is a countable collection of pairwise disjoint events $\{B_n\}_{n \ge 1}$ such that $\bigcup_{n \ge 1} B_n = \Omega$.
[/definition]
Given a partition, we can decompose any event $A$ as a disjoint union $A = \bigcup_n (A \cap B_n)$ and compute $\mathbb{P}(A)$ piece by piece.
[quotetheorem:1113]
The proof decomposes $A$ as a disjoint union $A = \bigcup_n (A \cap B_n)$ and applies countable additivity — the same disjointification technique we have been using throughout.
[citeproof:1113]
The Law of Total Probability is the first instance of a technique that reappears throughout these notes: **conditioning on a partition** (or more generally, on a random variable) to break a complex computation into simpler pieces. We will see this technique applied to compute expectations in Chapter 4, to analyse branching processes in Chapter 7, and to solve the Gambler's Ruin problem in Chapter 5.
The Law of Total Probability, combined with the definition of conditional probability, immediately yields Bayes' formula — a way to "invert" conditional probabilities.
[quotetheorem:1114]
The numerator is $\mathbb{P}(A \cap B_n) = \mathbb{P}(A \mid B_n) \, \mathbb{P}(B_n)$, and the denominator is $\mathbb{P}(A)$ computed via the Law of Total Probability. The formula is therefore a direct consequence of the definitions, but its conceptual importance is enormous: it tells us how to *update* our beliefs about which scenario $B_n$ produced the data $A$, given the prior probabilities $\mathbb{P}(B_n)$ and the likelihoods $\mathbb{P}(A \mid B_n)$.
The proof is immediate: write the definition of $\mathbb{P}(B_n \mid A)$ and substitute the Law of Total Probability for $\mathbb{P}(A)$.
[citeproof:1114]
Bayes' formula is the foundation of Bayesian statistics. The terms $\mathbb{P}(B_n)$ are the *prior* probabilities (before observing $A$), $\mathbb{P}(A \mid B_n)$ are the *likelihoods* (the probability of the data under each hypothesis), and $\mathbb{P}(B_n \mid A)$ are the *posterior* probabilities (updated after observing $A$). The following example illustrates how a test with high accuracy can still produce a low posterior probability when the prior probability is small.
[example: False Positive Paradox]
Suppose $0.1\%$ of the population has a certain disease. A diagnostic test has a $98\%$ detection rate (it is positive for $98\%$ of those who have the disease) and a $1\%$ false positive rate (it is positive for $1\%$ of those who do not have the disease). If a randomly chosen individual tests positive, what is the probability that they actually have the disease?
Let $D$ be the event that the individual has the disease, and $T$ be the event of a positive test. We are given $\mathbb{P}(D) = 0.001$, $\mathbb{P}(T \mid D) = 0.98$, and $\mathbb{P}(T \mid D^c) = 0.01$. By Bayes' formula:
\begin{align*}
\mathbb{P}(D \mid T) = \frac{\mathbb{P}(T \mid D) \, \mathbb{P}(D)}{\mathbb{P}(T \mid D) \, \mathbb{P}(D) + \mathbb{P}(T \mid D^c) \, \mathbb{P}(D^c)} = \frac{0.98 \times 0.001}{0.98 \times 0.001 + 0.01 \times 0.999} \approx 0.089.
\end{align*}
Despite the test's high accuracy, a positive result indicates only about a $9\%$ chance of disease. The reason is that the disease is rare: in a population of $1000$, roughly $1$ person has the disease and tests positive, while roughly $10$ healthy people also test positive. So among $11$ positive tests, only $1$ is a true positive. The low base rate ($\mathbb{P}(D) = 0.001$) overwhelms the test's discriminatory power.
[/example]
## 2.3 Independence
Conditional probability describes how information changes probabilities. The special case where no change occurs — where knowing that $B$ happened tells us nothing about $A$ — leads to the concept of independence.
If $A$ and $B$ are independent, we want $\mathbb{P}(A \mid B) = \mathbb{P}(A)$, which by the definition of conditional probability amounts to $\mathbb{P}(A \cap B) = \mathbb{P}(A) \, \mathbb{P}(B)$. We take the multiplicative formula as the definition, since it avoids the requirement $\mathbb{P}(B) > 0$.
[definition: Independence of Two Events]
Events $A, B \in \mathcal{F}$ are **independent** if $\mathbb{P}(A \cap B) = \mathbb{P}(A) \, \mathbb{P}(B)$.
[/definition]
For more than two events, independence is a stronger condition than pairwise independence: it requires the multiplicative property for *every* subcollection, not just pairs.
[definition: Independence of a Collection]
A countable collection of events $(A_n)_{n \ge 1}$ is **independent** (or **mutually independent**) if for every finite subcollection $\{A_{i_1}, A_{i_2}, \ldots, A_{i_k}\}$ with distinct indices,
\begin{align*}
\mathbb{P}(A_{i_1} \cap A_{i_2} \cap \cdots \cap A_{i_k}) = \prod_{j=1}^{k} \mathbb{P}(A_{i_j}).
\end{align*}
[/definition]
The distinction between pairwise independence and mutual independence is genuine: pairwise independence does *not* imply mutual independence. The following example makes this precise.
[example: Pairwise but Not Mutually Independent]
Toss a fair coin twice, so $\Omega = \{(0,0), (0,1), (1,0), (1,1)\}$ with $\mathbb{P}(\{\omega\}) = 1/4$ for each $\omega$. Define:
\begin{align*}
A = \{(0,0), (0,1)\}, \quad B = \{(0,0), (1,0)\}, \quad C = \{(1,0), (0,1)\}.
\end{align*}
Then $\mathbb{P}(A) = \mathbb{P}(B) = \mathbb{P}(C) = 1/2$. One checks:
\begin{align*}
\mathbb{P}(A \cap B) &= \mathbb{P}(\{(0,0)\}) = \tfrac{1}{4} = \mathbb{P}(A) \, \mathbb{P}(B), \\
\mathbb{P}(A \cap C) &= \mathbb{P}(\{(0,1)\}) = \tfrac{1}{4} = \mathbb{P}(A) \, \mathbb{P}(C), \\
\mathbb{P}(B \cap C) &= \mathbb{P}(\{(1,0)\}) = \tfrac{1}{4} = \mathbb{P}(B) \, \mathbb{P}(C).
\end{align*}
So $A, B, C$ are pairwise independent. However, $A \cap B \cap C = \varnothing$, so $\mathbb{P}(A \cap B \cap C) = 0 \ne 1/8 = \mathbb{P}(A) \, \mathbb{P}(B) \, \mathbb{P}(C)$. The three events are not mutually independent.
[/example]
A useful consequence of independence is that it is preserved under complementation.
[quotetheorem:1115]
The proof subtracts the intersection $A \cap B$ from $A$ to isolate $A \cap B^c$, then uses the independence hypothesis to factor the result.
[citeproof:1115]
By applying this result twice (first to $A, B^c$, then to $A^c, B^c$), one also shows that $A^c$ and $B^c$ are independent whenever $A$ and $B$ are. The technique of expressing $\mathbb{P}(A \cap B^c)$ as $\mathbb{P}(A) - \mathbb{P}(A \cap B)$ is a routine set-decomposition argument, but the conclusion is important: independence is a property of the "information content" of events, not of the specific events themselves, so replacing an event by its complement does not destroy it.
The ideas of conditional probability and independence provide the machinery for decomposing complex experiments into simpler parts. In the next chapter, we use these tools to study numerical quantities derived from random experiments.
# 3. Discrete Random Variables
In Chapters 1 and 2, we studied probabilities of events — subsets of the sample space $\Omega$. But in most applications, we are not interested in the individual outcomes themselves; we care about some *numerical quantity* derived from the outcome. When rolling two dice, we may care about the sum of the faces, not the individual faces. When counting customers arriving in a shop, we care about the total count, not the identity of each customer. This chapter formalises the idea of a "numerical quantity determined by the outcome" through the concept of a random variable.
## 3.1 Random Variables and Their Distributions
[definition: Random Variable]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space. A **random variable** is a function $X : \Omega \to \mathbb{R}$ such that for every $x \in \mathbb{R}$, the set $\{\omega \in \Omega : X(\omega) \le x\}$ belongs to $\mathcal{F}$.
[/definition]
The measurability condition — requiring $\{X \le x\} \in \mathcal{F}$ for all $x$ — ensures that we can compute $\mathbb{P}(X \le x)$. When $\Omega$ is countable and $\mathcal{F} = \mathcal{P}(\Omega)$, every function $X : \Omega \to \mathbb{R}$ is automatically a random variable.
We adopt the standard shorthand: $\{X \in A\}$ denotes the event $\{\omega \in \Omega : X(\omega) \in A\}$.
[definition: Discrete Random Variable]
A random variable $X$ is **discrete** if it takes values in a countable set $S \subset \mathbb{R}$.
[/definition]
The distribution of a discrete random variable is completely determined by the probabilities $\mathbb{P}(X = x)$ for $x \in S$.
[definition: Probability Mass Function]
The **probability mass function** (pmf) of a discrete random variable $X$ is the function
\begin{align*}
p_X : \mathbb{R} &\to [0, 1] \\
x &\mapsto \mathbb{P}(X = x).
\end{align*}
The pmf satisfies $p_X(x) \ge 0$ for all $x$, $p_X(x) = 0$ for $x \notin S$, and $\sum_{x \in S} p_X(x) = 1$.
[/definition]
A crucial observation: given any function $p : S \to [0,1]$ with $\sum_{x \in S} p(x) = 1$ on a countable set $S$, we can always construct a probability space $(\Omega, \mathcal{F}, \mathbb{P})$ and a random variable $X$ with pmf $p$. Simply take $\Omega = S$, $\mathcal{F} = \mathcal{P}(S)$, $\mathbb{P}(\{x\}) = p(x)$, and $X(\omega) = \omega$. In practice, we often *define* random variables by specifying their pmf, without explicitly constructing the underlying probability space.
### 3.1.1 The Indicator Random Variable
A particularly important class of random variables arises from events.
[definition: Indicator Random Variable]
Let $A \in \mathcal{F}$ be an event. The **indicator random variable** of $A$ is
\begin{align*}
\mathbb{1}_A : \Omega &\to \{0, 1\} \\
\omega &\mapsto \begin{cases} 1 & \text{if } \omega \in A, \\ 0 & \text{if } \omega \notin A. \end{cases}
\end{align*}
[/definition]
Indicator random variables are the bridge between events and random variables. They satisfy the following algebraic identities, which allow us to translate set operations into arithmetic:
(i) $\mathbb{1}_{A^c} = 1 - \mathbb{1}_A$.
(ii) $\mathbb{1}_{A \cap B} = \mathbb{1}_A \cdot \mathbb{1}_B$.
(iii) $\mathbb{1}_{A \cup B} = \mathbb{1}_A + \mathbb{1}_B - \mathbb{1}_A \cdot \mathbb{1}_B$.
These follow directly from the definition by checking the two cases $\omega \in A$ and $\omega \notin A$.
## 3.2 Discrete Distributions
A recurring task in probability is to model a specific type of random experiment — a coin toss, a sequence of trials, a count of rare events, a waiting time — and encode its structure in a probability mass function. The distributions we now introduce are not arbitrary choices from a catalogue; each arises as the natural answer to a specific probabilistic question. We organise them by the type of question they answer.
### 3.2.1 Counting Successes: Bernoulli and Binomial
The simplest random experiment is a single trial with two outcomes: success or failure. This is the building block from which more complex distributions are constructed.
[definition: Bernoulli Distribution]
A random variable $X$ has the **Bernoulli distribution** with parameter $p \in [0,1]$, written $X \sim \operatorname{Bern}(p)$, if $X$ takes values in $\{0, 1\}$ with $\mathbb{P}(X = 1) = p$ and $\mathbb{P}(X = 0) = 1 - p$.
[/definition]
This models a single trial (e.g. a biased coin toss) that succeeds with probability $p$ and fails with probability $q := 1 - p$. The indicator $\mathbb{1}_A$ of any event $A$ is $\operatorname{Bern}(\mathbb{P}(A))$.
When we repeat a Bernoulli trial $n$ times independently and count the total number of successes, we naturally arrive at the binomial distribution.
[definition: Binomial Distribution]
A random variable $X$ has the **binomial distribution** with parameters $n \in \mathbb{N}$ and $p \in [0,1]$, written $X \sim \operatorname{Bin}(n, p)$, if $X$ takes values in $\{0, 1, \ldots, n\}$ with
\begin{align*}
\mathbb{P}(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \qquad k = 0, 1, \ldots, n.
\end{align*}
[/definition]
One verifies that these probabilities sum to $1$ by the binomial theorem: $\sum_{k=0}^{n} \binom{n}{k} p^k (1-p)^{n-k} = (p + 1 - p)^n = 1$.
### 3.2.2 Modelling Rare Events: The Poisson Distribution
A different question arises when we have a very large number of trials but each individual success is very unlikely. What distribution describes the count of successes when the expected number $np = \lambda$ is fixed but $n \to \infty$? The answer is the Poisson distribution.
[definition: Poisson Distribution]
A random variable $X$ has the **Poisson distribution** with parameter $\lambda > 0$, written $X \sim \operatorname{Po}(\lambda)$, if $X$ takes values in $\{0, 1, 2, \ldots\}$ with
\begin{align*}
\mathbb{P}(X = k) = e^{-\lambda} \frac{\lambda^k}{k!}, \qquad k = 0, 1, 2, \ldots
\end{align*}
[/definition]
The probabilities sum to $1$ since $\sum_{k=0}^{\infty} \lambda^k / k! = e^{\lambda}$.
The Poisson approximation to the binomial is the following: if $X_n \sim \operatorname{Bin}(n, \lambda/n)$, then for each fixed $k$,
\begin{align*}
\mathbb{P}(X_n = k) = \binom{n}{k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n-k} \to e^{-\lambda} \frac{\lambda^k}{k!} \quad \text{as } n \to \infty.
\end{align*}
The essential computation is:
\begin{align*}
\binom{n}{k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n-k} = \frac{\lambda^k}{k!} \cdot \frac{n!}{n^k (n-k)!} \cdot \left(1 - \frac{\lambda}{n}\right)^{n-k}.
\end{align*}
The second factor is $n(n-1)\cdots(n-k+1)/n^k \to 1$ as $n \to \infty$ (for fixed $k$), and the third factor converges to $e^{-\lambda}$.
### 3.2.3 Waiting for Success: The Geometric Distribution
Yet another natural question: if we perform independent Bernoulli trials, how many trials must we wait for the first success? This leads to the geometric distribution.
[definition: Geometric Distribution]
A random variable $X$ has the **geometric distribution** with parameter $p \in (0, 1]$, written $X \sim \operatorname{Geom}(p)$, if $X$ takes values in $\{1, 2, 3, \ldots\}$ with
\begin{align*}
\mathbb{P}(X = k) = (1 - p)^{k-1} p, \qquad k = 1, 2, 3, \ldots
\end{align*}
[/definition]
The probabilities sum to $1$:
\begin{align*}
\sum_{k=1}^{\infty} (1-p)^{k-1} p = p \cdot \frac{1}{1 - (1-p)} = 1.
\end{align*}
A key property of the geometric distribution is the **memoryless property**: $\mathbb{P}(X > m + n \mid X > m) = \mathbb{P}(X > n)$. In words, given that the first $m$ trials were failures, the distribution of the remaining waiting time is the same as the original distribution. This is the discrete analogue of the memoryless property of the exponential distribution, which we encounter in Chapter 8.
## 3.3 Independent Random Variables
The notion of independence extends from events to random variables.
[definition: Independent Random Variables]
Discrete random variables $X_1, X_2, \ldots, X_n$ are **independent** if for all $x_1, x_2, \ldots, x_n \in \mathbb{R}$,
\begin{align*}
\mathbb{P}(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n) = \prod_{i=1}^{n} \mathbb{P}(X_i = x_i).
\end{align*}
[/definition]
Independence is preserved under transformations: if $X_1, \ldots, X_n$ are independent and $f_1, \ldots, f_n : \mathbb{R} \to \mathbb{R}$ are any functions, then $f_1(X_1), \ldots, f_n(X_n)$ are independent. This is because the event $\{f_i(X_i) = y_i\}$ is a union of events $\{X_i = x_i\}$, and the factorisation of the joint probability carries through.
[quotetheorem:1116]
The proof writes the joint probability of the $f_i(X_i)$ as a sum over preimages, uses independence to factor the joint probability of the $X_i$, and then factors the resulting sum of products into a product of sums.
[citeproof:1116]
The factorisation of a sum over a Cartesian product into a product of sums is a fundamental algebraic identity that appears throughout probability — we will use it again in the proof that the expectation of a product of independent random variables factors (Chapter 4) and in the proof that pgfs of sums are products (Chapter 7).
With random variables and their distributions in hand, we now turn to the fundamental numerical summaries — expectation, variance, and covariance — that allow us to extract quantitative information from distributions without computing them in full.
# 4. Expectation, Variance, and Covariance
The previous chapter introduced discrete random variables and their distributions — a complete description of all the probabilities associated with a random variable. But a complete description is often more information than we need, and in many situations it is not feasible to compute the full distribution. This chapter develops numerical summaries — expectation, variance, covariance — that capture the essential features of a random variable's distribution in a few numbers. These summaries are not merely descriptive: they are the workhorses of probabilistic computation, and the inequalities relating them (developed in Chapter 6) are the tools that power limit theorems.
## 4.1 Expectation
The expectation of a random variable is its "average value," weighted by probabilities. If we were to repeat the underlying random experiment many times and average the results, the expectation is the value we would expect this average to approach.
[definition: Expectation]
Let $X$ be a discrete random variable on $(\Omega, \mathcal{F}, \mathbb{P})$. The **expectation** (or **mean**, or **expected value**) of $X$ is
\begin{align*}
\mathbb{E}[X] = \sum_{\omega \in \Omega} X(\omega) \, \mathbb{P}(\{\omega\}),
\end{align*}
provided the sum converges absolutely. Equivalently, writing $S$ for the range of $X$,
\begin{align*}
\mathbb{E}[X] = \sum_{x \in S} x \, \mathbb{P}(X = x).
\end{align*}
[/definition]
The equivalence of the two expressions follows by grouping terms: $\sum_{\omega \in \Omega} X(\omega) \, \mathbb{P}(\{\omega\}) = \sum_{x \in S} x \sum_{\omega : X(\omega) = x} \mathbb{P}(\{\omega\}) = \sum_{x \in S} x \, \mathbb{P}(X = x)$. Absolute convergence is required to ensure the sum does not depend on the order of summation.
The requirement of absolute convergence is not a technicality — without it, the expectation may fail to exist. The following classical example demonstrates this.
[example: St. Petersburg Paradox]
A fair coin is tossed repeatedly until the first head appears. If the first head appears on toss $k$, the player receives $2^k$ roubles. What is the expected payout?
The number of tosses is $X \sim \operatorname{Geom}(1/2)$, and the payout is $Y = 2^X$. The expectation of $Y$ is
\begin{align*}
\mathbb{E}[Y] = \sum_{k=1}^{\infty} 2^k \cdot \mathbb{P}(X = k) = \sum_{k=1}^{\infty} 2^k \cdot \frac{1}{2^k} = \sum_{k=1}^{\infty} 1 = +\infty.
\end{align*}
The expected payout is infinite, yet no reasonable person would pay an enormous sum to play this game. The paradox illustrates that the expectation, when it exists, captures a long-run average — but when the sum diverges, the expectation simply does not exist as a finite number, and the concept of "fair price" requires a more sophisticated analysis (for example, using utility theory).
[/example]
The expectation of an indicator random variable connects expectation back to probability: $\mathbb{E}[\mathbb{1}_A] = 1 \cdot \mathbb{P}(A) + 0 \cdot \mathbb{P}(A^c) = \mathbb{P}(A)$.
[quotetheorem:1117]
Property (iv) is remarkable: it holds regardless of any dependence between $X$ and $Y$. No assumption of independence is required. This is one of the most useful facts in all of probability.
The proof of each property follows directly from the definition of expectation as a sum. The key idea is that summation is a linear operation.
[citeproof:1117]
The linearity of expectation — the combination of (iii) and (iv) — is the single most important computational tool in discrete probability. It does not require independence and works even when the summands are highly dependent. We exploit this repeatedly below.
An important computation technique: if $f : \mathbb{R} \to \mathbb{R}$ is any function and $X$ is a discrete random variable, then $\mathbb{E}[f(X)] = \sum_{x \in S} f(x) \, \mathbb{P}(X = x)$. There is no need to first compute the distribution of $f(X)$ — we can compute the expectation directly from the distribution of $X$.
### 4.1.1 Expectations of Standard Distributions
[example: Expectation of the Binomial]
Let $X \sim \operatorname{Bin}(n, p)$. Then
\begin{align*}
\mathbb{E}[X] &= \sum_{k=0}^{n} k \binom{n}{k} p^k (1-p)^{n-k}.
\end{align*}
Using the identity $k \binom{n}{k} = n \binom{n-1}{k-1}$ (valid for $k \ge 1$), the $k = 0$ term vanishes and:
\begin{align*}
\mathbb{E}[X] &= \sum_{k=1}^{n} n \binom{n-1}{k-1} p^k (1-p)^{n-k} = np \sum_{j=0}^{n-1} \binom{n-1}{j} p^j (1-p)^{n-1-j} = np,
\end{align*}
by the binomial theorem. There is also a much simpler proof using indicators: write $X = \sum_{i=1}^{n} \mathbb{1}_{A_i}$ where $A_i$ is the event that the $i$-th trial is a success. Then $\mathbb{E}[X] = \sum_{i=1}^{n} \mathbb{E}[\mathbb{1}_{A_i}] = np$.
[/example]
The two methods illustrate an important contrast: the direct computation requires a non-trivial combinatorial identity, while the indicator method uses only linearity of expectation and the fact that $\mathbb{E}[\mathbb{1}_A] = \mathbb{P}(A)$. The indicator technique avoids computing the full pmf of $X$.
[example: Expectation of the Poisson]
Let $X \sim \operatorname{Po}(\lambda)$. The computation uses the index-shifting trick: extract a factor of $k$ from the numerator to cancel the $k!$ in the denominator.
\begin{align*}
\mathbb{E}[X] = \sum_{k=0}^{\infty} k \, e^{-\lambda} \frac{\lambda^k}{k!} = \sum_{k=1}^{\infty} e^{-\lambda} \frac{\lambda^k}{(k-1)!} = \lambda e^{-\lambda} \sum_{j=0}^{\infty} \frac{\lambda^j}{j!} = \lambda e^{-\lambda} \cdot e^{\lambda} = \lambda.
\end{align*}
[/example]
The index-shifting technique — replacing the summation index to simplify the expression — is a standard method that appears in every Poisson computation.
[example: Expectation of the Geometric]
Let $X \sim \operatorname{Geom}(p)$ with $q = 1 - p$. The computation uses term-by-term differentiation of a geometric series.
\begin{align*}
\mathbb{E}[X] = \sum_{k=1}^{\infty} k \, q^{k-1} p = p \sum_{k=1}^{\infty} k \, q^{k-1} = p \cdot \frac{d}{dq}\left(\sum_{k=0}^{\infty} q^k\right) = p \cdot \frac{d}{dq}\left(\frac{1}{1-q}\right) = p \cdot \frac{1}{(1-q)^2} = \frac{1}{p}.
\end{align*}
[/example]
The "differentiate the generating function" technique used here for the geometric distribution is a precursor to the probability generating function methods developed in Chapter 7.
### 4.1.2 The Power of Indicator Random Variables
Linearity of expectation, combined with indicator random variables, provides a remarkably powerful technique for computing expectations that would be extremely difficult by direct calculation. The key idea: decompose a complicated random variable as a sum of indicators, then use $\mathbb{E}[\mathbb{1}_A] = \mathbb{P}(A)$.
[example: Couples at a Round Table]
Suppose $n \ge 2$ married couples are seated at random around a circular table, with men and women alternating. Let $N$ be the number of couples seated next to each other. Compute $\mathbb{E}[N]$.
Let $A_i = \{\text{couple } i \text{ are adjacent}\}$. Then $N = \sum_{i=1}^{n} \mathbb{1}_{A_i}$ and $\mathbb{E}[N] = \sum_{i=1}^{n} \mathbb{P}(A_i)$.
By symmetry, $\mathbb{P}(A_i)$ is the same for all $i$. Fix wife $i$: she sits in one of the $n$ "women's seats." Her husband occupies one of the $n$ "men's seats." Of the $n$ men's seats, exactly $2$ are adjacent to wife $i$'s seat (one on each side), so $\mathbb{P}(A_i) = 2/n$. Therefore
\begin{align*}
\mathbb{E}[N] = n \cdot \frac{2}{n} = 2.
\end{align*}
This result is independent of $n$ — whether there are $3$ couples or $300$, the expected number seated together is $2$.
[/example]
## 4.2 Variance
The expectation tells us the "centre" of a random variable's distribution, but nothing about how spread out the distribution is around that centre. A random variable that always equals its mean has no spread; one that takes values far from its mean with appreciable probability has a large spread. The variance quantifies this.
[definition: Variance]
The **variance** of a random variable $X$ with $\mathbb{E}[X^2] < \infty$ is
\begin{align*}
\operatorname{Var}(X) := \mathbb{E}[(X - \mathbb{E}[X])^2].
\end{align*}
The **standard deviation** is $\sigma(X) := \sqrt{\operatorname{Var}(X)}$.
[/definition]
The variance is always non-negative and measures the average squared distance from the mean. The following properties make it computable.
[quotetheorem:1118]
Property (iii) is the standard computational formula: it is usually easier to compute $\mathbb{E}[X^2]$ and $(\mathbb{E}[X])^2$ separately than to expand $\mathbb{E}[(X - \mathbb{E}[X])^2]$ directly. The proof uses the linearity of expectation to expand the square.
[citeproof:1118]
The "expand and take expectations" technique in part (iii) is simple but fundamental — we use it again when computing the variance of a sum and the conditional variance formula in Chapter 5.
Unlike expectation, variance is **not** additive in general. We need independence (or at least uncorrelatedness).
### 4.2.1 Variances of Standard Distributions
[example: Variance of the Binomial]
Let $X \sim \operatorname{Bin}(n, p)$. We compute $\mathbb{E}[X(X-1)]$ using the identity $k(k-1)\binom{n}{k} = n(n-1)\binom{n-2}{k-2}$:
\begin{align*}
\mathbb{E}[X(X-1)] &= \sum_{k=2}^{n} k(k-1) \binom{n}{k} p^k (1-p)^{n-k} = n(n-1)p^2 \sum_{j=0}^{n-2} \binom{n-2}{j} p^j (1-p)^{n-2-j} = n(n-1)p^2.
\end{align*}
Therefore $\mathbb{E}[X^2] = \mathbb{E}[X(X-1)] + \mathbb{E}[X] = n(n-1)p^2 + np$, and
\begin{align*}
\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = n(n-1)p^2 + np - n^2p^2 = np(1-p).
\end{align*}
[/example]
The technique of computing $\mathbb{E}[X(X-1)]$ (the **factorial moment**) rather than $\mathbb{E}[X^2]$ directly is a standard trick: the factorial moment is often easier to evaluate because it cancels the factorial in the denominator of the pmf.
[example: Variance of the Poisson]
Let $X \sim \operatorname{Po}(\lambda)$. The factorial moment computation mirrors the binomial case:
\begin{align*}
\mathbb{E}[X(X-1)] = \sum_{k=2}^{\infty} k(k-1) e^{-\lambda} \frac{\lambda^k}{k!} = \lambda^2 e^{-\lambda} \sum_{j=0}^{\infty} \frac{\lambda^j}{j!} = \lambda^2.
\end{align*}
So $\mathbb{E}[X^2] = \lambda^2 + \lambda$ and $\operatorname{Var}(X) = \lambda^2 + \lambda - \lambda^2 = \lambda$. The variance equals the mean — a characteristic property of the Poisson distribution.
[/example]
The equality $\operatorname{Var}(X) = \mathbb{E}[X] = \lambda$ for the Poisson is a useful diagnostic: if data show a variance much larger or smaller than the mean, the Poisson model may be inappropriate.
[example: Variance of the Geometric]
Let $X \sim \operatorname{Geom}(p)$ with $q = 1 - p$. We use the factorial moment approach. First, compute $\mathbb{E}[X(X-1)]$:
\begin{align*}
\mathbb{E}[X(X-1)] = \sum_{k=2}^{\infty} k(k-1) q^{k-1} p = pq \sum_{k=2}^{\infty} k(k-1) q^{k-2}.
\end{align*}
We recognise the sum as the second derivative of the geometric series. Since $\sum_{k=0}^{\infty} q^k = 1/(1-q)$, differentiating twice gives $\sum_{k=2}^{\infty} k(k-1) q^{k-2} = 2/(1-q)^3 = 2/p^3$. Therefore
\begin{align*}
\mathbb{E}[X(X-1)] = pq \cdot \frac{2}{p^3} = \frac{2q}{p^2}.
\end{align*}
Now $\mathbb{E}[X^2] = \mathbb{E}[X(X-1)] + \mathbb{E}[X] = 2q/p^2 + 1/p$, and
\begin{align*}
\operatorname{Var}(X) = \frac{2q}{p^2} + \frac{1}{p} - \frac{1}{p^2} = \frac{2q + p - 1}{p^2} = \frac{q}{p^2},
\end{align*}
using $2q + p - 1 = 2(1-p) + p - 1 = 1 - p = q$.
[/example]
Not every random variable has finite variance. The following example shows that even with a finite mean, the variance can be infinite.
[example: Infinite Variance]
Let $X$ take values in $\{1, 2, 3, \ldots\}$ with $\mathbb{P}(X = k) = c/k^3$ for a normalising constant $c > 0$ (which is finite since $\sum_{k=1}^{\infty} 1/k^3$ converges). The expectation $\mathbb{E}[X] = c \sum_{k=1}^{\infty} 1/k^2$ is finite. However,
\begin{align*}
\mathbb{E}[X^2] = c \sum_{k=1}^{\infty} \frac{k^2}{k^3} = c \sum_{k=1}^{\infty} \frac{1}{k} = +\infty.
\end{align*}
So $\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = +\infty$. This random variable has a well-defined mean but infinite variance. Chebyshev's inequality (Chapter 6) cannot be applied to such a variable, and the Weak Law of Large Numbers in its basic form does not apply.
[/example]
## 4.3 Covariance and Correlation
When $X$ and $Y$ are not independent, the variance of their sum involves a correction term measuring their dependence.
[definition: Covariance]
The **covariance** of random variables $X$ and $Y$ is
\begin{align*}
\operatorname{Cov}(X, Y) := \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X] \, \mathbb{E}[Y].
\end{align*}
[/definition]
Covariance has the following properties:
(i) $\operatorname{Cov}(X, Y) = \operatorname{Cov}(Y, X)$ (symmetry).
(ii) $\operatorname{Cov}(X, X) = \operatorname{Var}(X)$.
(iii) $\operatorname{Cov}(X + Z, Y) = \operatorname{Cov}(X, Y) + \operatorname{Cov}(Z, Y)$ (bilinearity).
(iv) $\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X, Y)$.
If $X$ and $Y$ are independent, then $\mathbb{E}[XY] = \mathbb{E}[X] \, \mathbb{E}[Y]$, so $\operatorname{Cov}(X, Y) = 0$. The converse is false: zero covariance does not imply independence.
[example: Zero Covariance Without Independence]
Suppose $(X, Y)$ takes each of the values $(2, 0)$, $(-1, 1)$, $(-1, -1)$ with probability $1/3$. Then $\mathbb{E}[X] = (2 - 1 - 1)/3 = 0$ and $\mathbb{E}[Y] = (0 + 1 - 1)/3 = 0$. Also $\mathbb{E}[XY] = (0 - 1 + 1)/3 = 0$, so $\operatorname{Cov}(X, Y) = 0$. But $X = 2$ if and only if $Y = 0$, so $X$ and $Y$ are not independent.
[/example]
This example shows that covariance measures only *linear* dependence. Two random variables can be perfectly dependent (one determines the other) yet have zero covariance, if the dependence is nonlinear.
[definition: Correlation]
The **correlation** of $X$ and $Y$ (assuming both have positive variance) is
\begin{align*}
\operatorname{Corr}(X, Y) := \frac{\operatorname{Cov}(X, Y)}{\sqrt{\operatorname{Var}(X) \, \operatorname{Var}(Y)}}.
\end{align*}
[/definition]
The Cauchy-Schwarz inequality (proved in Chapter 6) guarantees that $-1 \le \operatorname{Corr}(X, Y) \le 1$, with equality if and only if $Y$ is an affine function of $X$.
## 4.4 Variance of a Sum
The variance of a sum of random variables involves covariance terms. When the summands are independent (or even just pairwise uncorrelated), the cross-terms vanish.
[quotetheorem:1119]
The proof expands the square of the sum, separates the diagonal and off-diagonal terms, and uses independence to show that the off-diagonal terms vanish.
[citeproof:1119]
The "expand the square, separate diagonal from off-diagonal" technique is a computation pattern that recurs in the variance computation for branching processes (Chapter 7) and in the proof of the Weak Law of Large Numbers (Chapter 6). Note that the theorem requires only *pairwise* independence — a weaker condition than mutual independence.
For independent variables, the expectation of a product also factors.
[quotetheorem:1120]
The proof writes the expectation as a sum over the joint distribution, uses independence to factor the joint pmf, and then factors the resulting multi-sum into a product of single sums — the same algebraic identity used in the proof of Preservation of Independence.
[citeproof:1120]
The factorisation of a multi-sum into a product of single sums used in the second-to-last equality is the same Cartesian-product factorisation that appeared in the Preservation of Independence proof. This algebraic identity is the reason why generating functions (Chapters 7 and 9) are so effective: they encode independence as multiplication.
The expectation and variance tools developed here provide numerical summaries of distributions, but they leave many questions about multiple random variables unanswered. In the next chapter, we extend the conditioning idea from events to random variables, developing conditional expectation $\mathbb{E}[X \mid Y]$ and applying it to joint distributions and the Gambler's Ruin problem.
# 5. Conditional Expectation and Joint Distributions
The tools of Chapter 4 — expectation, variance, covariance — are powerful summaries of a single random variable's distribution, but they do not fully capture the relationship between two or more random variables. When we condition on partial information — knowing the value of one random variable — how does the expected value of another change? This chapter extends the conditioning ideas of Chapter 2 from events to random variables, develops joint, marginal, and conditional distributions for discrete random variables, and applies these tools to the Gambler's Ruin problem.
## 5.1 Conditional Expectation
### 5.1.1 Conditioning on an Event
The simplest form of conditional expectation conditions on a single event $B$ with $\mathbb{P}(B) > 0$. Since $A \mapsto \mathbb{P}(A \mid B)$ is a probability measure, we can define expectation with respect to it.
[definition: Conditional Expectation Given an Event]
Let $X$ be a discrete random variable and $B \in \mathcal{F}$ with $\mathbb{P}(B) > 0$. The **conditional expectation of $X$ given $B$** is
\begin{align*}
\mathbb{E}[X \mid B] := \sum_{x} x \, \mathbb{P}(X = x \mid B).
\end{align*}
[/definition]
All properties of ordinary expectation — linearity, monotonicity, etc. — hold for conditional expectations, since $\mathbb{P}(\cdot \mid B)$ is itself a probability measure.
### 5.1.2 Conditioning on a Random Variable
A more powerful idea is to condition on the value of another random variable. If $Y$ is a discrete random variable taking values $y_1, y_2, \ldots$, then for each $y$ with $\mathbb{P}(Y = y) > 0$, we can compute $\mathbb{E}[X \mid Y = y]$ — the expected value of $X$ given that $Y$ takes the value $y$.
[definition: Conditional Expectation Given a Random Variable]
Let $X$ and $Y$ be discrete random variables on $(\Omega, \mathcal{F}, \mathbb{P})$. The **conditional expectation of $X$ given $Y = y$** is
\begin{align*}
\mathbb{E}[X \mid Y = y] := \sum_{x} x \, \mathbb{P}(X = x \mid Y = y) = \sum_{x} x \, \frac{\mathbb{P}(X = x, Y = y)}{\mathbb{P}(Y = y)},
\end{align*}
defined for each $y$ with $\mathbb{P}(Y = y) > 0$.
The **conditional expectation of $X$ given $Y$**, written $\mathbb{E}[X \mid Y]$, is the random variable that takes the value $\mathbb{E}[X \mid Y = y]$ on the event $\{Y = y\}$. That is,
\begin{align*}
\mathbb{E}[X \mid Y] := h(Y), \quad \text{where } h(y) = \mathbb{E}[X \mid Y = y].
\end{align*}
[/definition]
A crucial point: $\mathbb{E}[X \mid Y = y]$ is a *number* (for each fixed $y$), while $\mathbb{E}[X \mid Y]$ is a *random variable* — it is a function of $Y$. When $Y$ is random, $\mathbb{E}[X \mid Y]$ inherits that randomness.
### 5.1.3 The Law of Total Expectation
The Law of Total Probability (Chapter 2) reconstructed $\mathbb{P}(A)$ from conditional probabilities $\mathbb{P}(A \mid B_n)$. The analogous result for expectations is the Law of Total Expectation: we recover $\mathbb{E}[X]$ by averaging the conditional expectations $\mathbb{E}[X \mid Y = y]$ over the distribution of $Y$.
[quotetheorem:1121]
This result is also called the **tower property** (or the **iterated expectation property**), since it says that taking the outer expectation of a conditional expectation recovers the unconditional expectation: $\mathbb{E}[\mathbb{E}[X \mid Y]] = \mathbb{E}[X]$.
The proof partitions $\Omega$ according to the values of $Y$ and applies the Law of Total Probability to each term.
[citeproof:1121]
The tower property is one of the most frequently used tools in probability. We have already used it implicitly in the proof of the PGF of a Random Sum (Chapter 7), and it is the foundation of first-step analysis for random walks, which we develop below in the Gambler's Ruin problem.
### 5.1.4 Properties of Conditional Expectation
Conditional expectation inherits the properties of ordinary expectation, with additional rules reflecting its dependence on the conditioning variable.
[quotetheorem:1122]
Property (ii) is intuitive: if $g(Y)$ is a function of the variable we are conditioning on, then $g(Y)$ is "known" in the conditional world and can be pulled outside the conditional expectation, just as a constant can be pulled outside an ordinary expectation.
The proof verifies each property by evaluating on the event $\{Y = y\}$ and applying the definitions.
[citeproof:1122]
The "take out what is known" property is essential for computing conditional expectations in practice: it allows us to simplify expressions by treating functions of the conditioning variable as constants.
## 5.2 Joint, Marginal, and Conditional Distributions
When studying two or more discrete random variables simultaneously, we need the joint distribution — the collection of probabilities $\mathbb{P}(X = x, Y = y)$ for all pairs $(x, y)$.
[definition: Joint Probability Mass Function]
The **joint probability mass function** of discrete random variables $X$ and $Y$ is the function
\begin{align*}
p_{X,Y} : \mathbb{R}^2 &\to [0, 1] \\
(x, y) &\mapsto \mathbb{P}(X = x, Y = y).
\end{align*}
It satisfies $\sum_{x} \sum_{y} p_{X,Y}(x, y) = 1$.
[/definition]
The joint pmf contains all the information about $X$ and $Y$ together. The distributions of $X$ and $Y$ individually — the **marginal distributions** — are obtained by summing out the other variable.
[definition: Marginal and Conditional PMFs]
The **marginal pmf** of $X$ is $p_X(x) = \sum_{y} p_{X,Y}(x, y)$.
The **conditional pmf** of $X$ given $Y = y$ (when $\mathbb{P}(Y = y) > 0$) is
\begin{align*}
p_{X \mid Y}(x \mid y) := \mathbb{P}(X = x \mid Y = y) = \frac{p_{X,Y}(x, y)}{p_Y(y)}.
\end{align*}
[/definition]
The relationship $p_{X,Y}(x,y) = p_{X \mid Y}(x \mid y) \cdot p_Y(y) = p_{Y \mid X}(y \mid x) \cdot p_X(x)$ is the random-variable version of the multiplication rule $\mathbb{P}(A \cap B) = \mathbb{P}(A \mid B) \, \mathbb{P}(B)$.
Independence of $X$ and $Y$ is equivalent to the joint pmf factoring: $p_{X,Y}(x,y) = p_X(x) \cdot p_Y(y)$ for all $(x,y)$.
## 5.3 Discrete Convolution
When $X$ and $Y$ are independent, the distribution of their sum $Z = X + Y$ is determined by the **convolution** of their individual pmfs.
[quotetheorem:1123]
The proof partitions the event $\{X + Y = n\}$ according to the value of $X$ and uses independence to factor each term.
[citeproof:1123]
The convolution formula is the discrete analogue of the continuous convolution integral that appears in Chapter 8. Its power lies in reducing the computation of $\mathbb{P}(X + Y = n)$ — which involves the joint distribution — to a computation involving only the marginal distributions.
[example: Sum of Independent Poissons by Convolution]
Let $X \sim \operatorname{Po}(\lambda)$ and $Y \sim \operatorname{Po}(\mu)$ be independent. We compute the pmf of $Z = X + Y$ using the convolution formula:
\begin{align*}
\mathbb{P}(Z = n) &= \sum_{k=0}^{n} \mathbb{P}(X = k) \, \mathbb{P}(Y = n - k) = \sum_{k=0}^{n} e^{-\lambda} \frac{\lambda^k}{k!} \cdot e^{-\mu} \frac{\mu^{n-k}}{(n-k)!} \\
&= e^{-(\lambda + \mu)} \sum_{k=0}^{n} \frac{\lambda^k \mu^{n-k}}{k!(n-k)!} = \frac{e^{-(\lambda+\mu)}}{n!} \sum_{k=0}^{n} \binom{n}{k} \lambda^k \mu^{n-k} = \frac{e^{-(\lambda+\mu)}(\lambda + \mu)^n}{n!}.
\end{align*}
This is the pmf of $\operatorname{Po}(\lambda + \mu)$, confirming that $X + Y \sim \operatorname{Po}(\lambda + \mu)$.
[/example]
The key step in this computation — recognising the binomial theorem $\sum_{k=0}^{n} \binom{n}{k} \lambda^k \mu^{n-k} = (\lambda + \mu)^n$ — is a technique that recurs in many convolution computations. We will obtain the same result more efficiently using probability generating functions in Chapter 7.
## 5.4 The Gambler's Ruin
One of the most important applications of conditional expectation and first-step analysis is the Gambler's Ruin problem. It models a gambler who plays a sequence of fair (or biased) games, winning or losing one unit at each step, and asks: what is the probability that the gambler is eventually ruined?
### 5.4.1 Setup
Consider a gambler who starts with $£a$, where $a \in \{0, 1, \ldots, N\}$, and plays against a casino with total combined wealth $N$. At each round, the gambler wins $£1$ with probability $p$ and loses $£1$ with probability $q = 1 - p$, independently of previous rounds. The game ends when the gambler's fortune reaches $0$ (ruin) or $N$ (victory).
Let $X_n$ denote the gambler's fortune after $n$ rounds. Then $(X_n)$ is a random walk on $\{0, 1, \ldots, N\}$ with absorbing barriers at $0$ and $N$. Define
\begin{align*}
r_a := \mathbb{P}(\text{ruin} \mid X_0 = a) = \mathbb{P}(X_n \text{ hits } 0 \text{ before } N \mid X_0 = a).
\end{align*}
### 5.4.2 First-Step Analysis
The technique of **first-step analysis** conditions on the outcome of the first step to set up a recurrence relation. After the first round, the gambler's fortune is either $a + 1$ (with probability $p$) or $a - 1$ (with probability $q$). By the Law of Total Probability:
[quotetheorem:1124]
The proof is a direct application of the Law of Total Probability, conditioning on the first step.
[citeproof:1124]
The first-step analysis technique — conditioning on the first transition to set up a recurrence — is one of the most powerful and widely applicable methods in probability. It works for any Markov chain and is the foundation of the theory developed in Part IB Markov Chains.
### 5.4.3 Solving the Recurrence
[quotetheorem:1125]
The proof solves the second-order linear recurrence $r_a = p \, r_{a+1} + q \, r_{a-1}$ by finding its characteristic roots. Rewriting as $p(r_{a+1} - r_a) = q(r_a - r_{a-1})$ reveals that the differences $d_a := r_a - r_{a-1}$ satisfy $d_{a+1} = (q/p) \, d_a$, a geometric progression.
[citeproof:1125]
The technique of solving a second-order linear recurrence via its characteristic equation is the same method used in the Analysis course for solving linear ODEs. The two cases $p = q$ and $p \ne q$ correspond to a repeated root and two distinct roots of the characteristic equation, respectively.
### 5.4.4 Expected Duration
First-step analysis also gives the expected number of rounds until the game ends. Let $D_a = \mathbb{E}[\text{duration} \mid X_0 = a]$. Conditioning on the first step:
\begin{align*}
D_a = 1 + p \, D_{a+1} + q \, D_{a-1}, \qquad 1 \le a \le N-1,
\end{align*}
with $D_0 = D_N = 0$. This is an inhomogeneous version of the ruin recurrence.
For the fair game ($p = q = 1/2$), the recurrence becomes $D_{a+1} - 2D_a + D_{a-1} = -2$. The general solution of the homogeneous equation is $D_a = \alpha + \beta a$; a particular solution of the inhomogeneous equation is $D_a = -a^2$ (check: $-(a+1)^2 + 2a^2 - (a-1)^2 = -2$). So $D_a = -a^2 + \alpha + \beta a$. Applying the boundary conditions $D_0 = 0$ and $D_N = 0$ gives $\alpha = 0$ and $\beta = N$, hence
\begin{align*}
D_a = a(N - a).
\end{align*}
Starting at $a = N/2$, the expected duration is $N^2/4$ — quadratic in $N$.
The tools of conditional expectation and first-step analysis developed in this chapter are among the most versatile techniques in probability. The next chapter develops a different set of tools — inequalities — that constrain distributions using only their moments.
# 6. Inequalities and the Weak Law of Large Numbers
The tools developed in Chapters 4 and 5 — expectation, variance, covariance, conditional expectation — provide numerical summaries of a random variable's distribution. But how tightly do these summaries constrain the distribution itself? If $\mathbb{E}[X] = 0$ and $\operatorname{Var}(X) = 1$, how large can $\mathbb{P}(|X| \ge 10)$ be? This chapter develops the fundamental inequalities that answer such questions: Jensen's inequality (which connects convexity to expectation), Cauchy-Schwarz (which bounds correlations), Markov's inequality (which bounds tail probabilities using the mean), and Chebyshev's inequality (which bounds tail probabilities using the variance). These inequalities culminate in the **Weak Law of Large Numbers**, which states that the sample mean of independent identically distributed random variables converges in probability to the population mean.
## 6.1 Jensen's Inequality
The relationship between a function of an expectation and the expectation of a function is controlled by convexity.
[definition: Convex Function]
A function $f : (a, b) \to \mathbb{R}$ is **convex** if for all $x_1, x_2 \in (a, b)$ and all $\lambda \in [0, 1]$,
\begin{align*}
f(\lambda x_1 + (1 - \lambda) x_2) \le \lambda f(x_1) + (1 - \lambda) f(x_2).
\end{align*}
A function $f$ is **concave** if $-f$ is convex. A function is **strictly convex** if equality holds only when $x_1 = x_2$.
[/definition]
If $f$ is twice differentiable and $f''(x) \ge 0$ on $(a, b)$, then $f$ is convex — a fact proved in the Analysis course.
[quotetheorem:515]
The inequality says that applying a convex function "after" averaging gives a smaller result than averaging "after" applying the function. The direction of the inequality reverses for concave functions. The result extends to arbitrary discrete random variables (not just finitely-valued ones) under the additional assumption that $\mathbb{E}[|X|]$ and $\mathbb{E}[|f(X)|]$ are finite — the proof uses an approximation argument that we omit at this level.
The proof uses induction on the number of values that $X$ takes. The inductive step decomposes the $n$-valued random variable into a convex combination of a point and an $(n-1)$-valued random variable.
[citeproof:515]
The inductive decomposition into "one point plus the rest" is a technique that appears in several other proofs by induction on the support size. Jensen's inequality has many useful special cases — we illustrate one.
[example: AM-GM Inequality from Jensen]
For positive reals $x_1, \ldots, x_n$, the AM-GM inequality states
\begin{align*}
\left(\prod_{i=1}^{n} x_i\right)^{1/n} \le \frac{1}{n} \sum_{i=1}^{n} x_i.
\end{align*}
This follows from Jensen's inequality applied to $f(t) = -\log t$ (which is convex on $(0, \infty)$) and the random variable $X$ taking values $x_1, \ldots, x_n$ each with probability $1/n$:
\begin{align*}
-\log\left(\frac{1}{n}\sum_{i=1}^{n} x_i\right) \le \frac{1}{n}\sum_{i=1}^{n} (-\log x_i) = -\log\left(\prod_{i=1}^{n} x_i\right)^{1/n}.
\end{align*}
Negating and exponentiating gives the AM-GM inequality.
[/example]
## 6.2 The Cauchy-Schwarz Inequality
[quotetheorem:432]
The version cited above is the general inner-product-space inequality $|(v, w)| \le \|v\| \, \|w\|$. The probabilistic form $(\mathbb{E}[XY])^2 \le \mathbb{E}[X^2] \, \mathbb{E}[Y^2]$ is the special case obtained by taking the inner product $(X, Y) = \mathbb{E}[XY]$ on the space $L^2(\Omega, \mathcal{F}, \mathbb{P})$ of square-integrable random variables.
The proof uses the fact that the expectation of a square is non-negative. The idea is to form a quadratic in an auxiliary parameter $\lambda$ and use the non-negativity of the discriminant.
[citeproof:432]
The "auxiliary parameter" technique — introducing a free parameter, forming a non-negative expression, and extracting information from the discriminant — is a fundamental method in analysis and will reappear in Part IB Linear Algebra. The Cauchy-Schwarz inequality immediately implies $|\operatorname{Corr}(X, Y)| \le 1$: apply it to $X - \mathbb{E}[X]$ and $Y - \mathbb{E}[Y]$.
## 6.3 Markov's and Chebyshev's Inequalities
The following two inequalities provide upper bounds on tail probabilities — the probability that a random variable is far from its mean.
[quotetheorem:514]
The proof is a clean application of indicator random variables: the indicator $\mathbb{1}_{\{|X| \ge a\}}$ is bounded above by $|X|/a$ pointwise.
[citeproof:514]
The pointwise comparison $\mathbb{1}_{\{|X| \ge a\}} \le |X|/a$ is the core idea. Markov's inequality is crude but universal: it applies to any random variable with finite first moment. By applying Markov's inequality to $(X - \mathbb{E}[X])^2$, we obtain a sharper bound involving the variance.
[quotetheorem:1126]
The proof reduces to Markov's inequality by applying it to the non-negative random variable $(X - \mathbb{E}[X])^2$.
[citeproof:1126]
The "apply Markov to a power of the deviation" technique is a general strategy: applying Markov to $|X - \mu|^r$ for different $r$ gives different tail bounds (this leads to the theory of **Chernoff bounds** in more advanced courses).
Chebyshev's inequality tells us that a random variable with small variance is unlikely to be far from its mean. Setting $a = k\sigma$ where $\sigma = \sqrt{\operatorname{Var}(X)}$, we get $\mathbb{P}(|X - \mathbb{E}[X]| \ge k\sigma) \le 1/k^2$. For example, the probability of deviating by more than $10$ standard deviations is at most $1\%$.
## 6.4 The Weak Law of Large Numbers
We now have the tools to prove one of the fundamental limit theorems of probability. If we observe independent copies of a random variable and average them, the average converges to the expectation.
[quotetheorem:1127]
The proof is a direct application of Chebyshev's inequality: the variance of the sample mean $S_n / n$ is $\sigma^2 / n$, which tends to zero.
[citeproof:1127]
The elegance of this proof is striking: one of the deepest results in probability follows from just two ideas — the variance of a sum of independent variables is the sum of variances, and Chebyshev's inequality bounds tail probabilities using the variance. The Weak Law requires finite variance; a stronger version (the Strong Law) requires only finite mean, but its proof uses techniques beyond this course.
The Weak Law tells us that the sample mean $S_n / n$ converges to $\mu$ **in probability**. There is a stronger result — the Strong Law of Large Numbers — which states that $S_n / n \to \mu$ **almost surely** (i.e. with probability $1$). The Strong Law requires more advanced techniques and is not proved in this course.
[remark: Convergence in Probability]
We say a sequence of random variables $(X_n)$ **converges in probability** to $X$, written $X_n \xrightarrow{\mathbb{P}} X$, if for every $\varepsilon > 0$, $\mathbb{P}(|X_n - X| > \varepsilon) \to 0$ as $n \to \infty$.
[/remark]
The inequalities of this chapter constrain the behaviour of random variables using moments. In the next chapter, we develop a more refined tool — the probability generating function — that encodes the *entire* distribution and converts probabilistic problems into algebraic ones.
# 7. Probability Generating Functions and Branching Processes
The previous chapters developed tools — expectation, variance, inequalities — for extracting information from the distribution of a random variable. But these tools capture only partial information: the mean and variance are just two numbers, while the full distribution contains infinitely many probabilities. This chapter introduces a device that encodes the *entire* distribution of a non-negative integer-valued random variable into a single function: the probability generating function (pgf). The pgf transforms difficult probabilistic computations — sums of independent random variables, random sums, recursive structures — into clean algebraic operations on power series.
Branching processes are the canonical application of pgfs: the entire theory of extinction probability and population moments becomes algebraic through the pgf framework.
## 7.1 Probability Generating Functions
[definition: Probability Generating Function]
Let $X$ be a random variable taking values in $\{0, 1, 2, \ldots\}$ with $\mathbb{P}(X = k) = p_k$. The **probability generating function** (pgf) of $X$ is
\begin{align*}
G_X(z) := \mathbb{E}[z^X] = \sum_{k=0}^{\infty} p_k z^k, \qquad |z| \le 1.
\end{align*}
[/definition]
The series converges absolutely for $|z| \le 1$, since $\left|\sum_{k=0}^{\infty} p_k z^k\right| \le \sum_{k=0}^{\infty} p_k = 1$. Hence $G_X(z)$ is well-defined with radius of convergence at least $1$.
Several important facts follow immediately:
- $G_X(1) = \sum_{k=0}^{\infty} p_k = 1$.
- $G_X(0) = p_0 = \mathbb{P}(X = 0)$.
- The probabilities can be recovered: $p_k = G_X^{(k)}(0) / k!$.
[quotetheorem:1128]
The proof uses the fact that a power series is determined by its coefficients, which can be extracted by evaluating derivatives at $0$.
[citeproof:1128]
This uniqueness is what makes pgfs useful: if we can identify the pgf of a random variable, we know its distribution. The "divide by $z^{n+1}$ and take $z \to 0$" technique is equivalent to computing the $(n+1)$-th Taylor coefficient, which is the content of the identity $p_k = G_X^{(k)}(0)/k!$.
## 7.2 Extracting Moments from the PGF
The derivatives of the pgf at $z = 1$ encode the moments of $X$.
[quotetheorem:1129]
Here $G_X'(1^-)$ denotes the left-hand limit $\lim_{z \to 1^-} G_X'(z)$, which may be finite or infinite.
The proof differentiates the power series term by term (justified by absolute convergence on $|z| < 1$) and takes the limit $z \to 1^-$ using the Monotone Convergence Theorem for series.
[citeproof:1129]
The monotone convergence argument — bounding $G_X'(z)$ from below by a finite partial sum and letting $z \to 1^-$ — is the standard method for justifying the interchange of limits with infinite sums of non-negative terms.
## 7.3 PGFs of Standard Distributions
We now compute the pgfs of the standard discrete distributions. A useful check: differentiating and evaluating at $z = 1$ must recover the known mean.
[example: PGF of the Binomial]
If $X \sim \operatorname{Bin}(n, p)$, then
\begin{align*}
G_X(z) = \sum_{k=0}^{n} \binom{n}{k} (pz)^k (1-p)^{n-k} = (pz + 1 - p)^n.
\end{align*}
Check: $G_X'(z) = np(pz + 1 - p)^{n-1}$, so $G_X'(1) = np = \mathbb{E}[X]$. Correct.
[/example]
The binomial pgf is a polynomial of degree $n$, reflecting the fact that $X$ takes only finitely many values.
[example: PGF of the Geometric]
If $X \sim \operatorname{Geom}(p)$ (taking values in $\{1, 2, \ldots\}$), then
\begin{align*}
G_X(z) = \sum_{k=1}^{\infty} p(1-p)^{k-1} z^k = pz \sum_{j=0}^{\infty} ((1-p)z)^j = \frac{pz}{1 - (1-p)z}.
\end{align*}
[/example]
Notice that the geometric pgf is a rational function — this algebraic structure makes it particularly amenable to manipulation.
[example: PGF of the Poisson]
If $X \sim \operatorname{Po}(\lambda)$, then
\begin{align*}
G_X(z) = \sum_{k=0}^{\infty} e^{-\lambda} \frac{(\lambda z)^k}{k!} = e^{-\lambda} e^{\lambda z} = e^{\lambda(z - 1)}.
\end{align*}
[/example]
The three pgfs illustrate a pattern: the Binomial pgf is a polynomial, the Geometric is a rational function, and the Poisson is an exponential. The algebraic form of the pgf often makes computations (sums, products, compositions) far simpler than working directly with the pmf.
## 7.4 Sums of Independent Random Variables
The most powerful property of pgfs is that they convert sums of independent random variables into products.
[quotetheorem:1130]
The proof uses the key property that $z^{X+Y} = z^X \cdot z^Y$ and the expectation of a product of independent random variables factors.
[citeproof:1130]
The conversion of sums to products is the fundamental reason why generating functions are so powerful: addition in the "probability world" becomes multiplication in the "generating function world." By uniqueness, if we can identify the product $G_X \cdot G_Y$ as a known pgf, we immediately know the distribution of $X + Y$.
[example: Sum of Independent Poissons]
If $X \sim \operatorname{Po}(\lambda)$ and $Y \sim \operatorname{Po}(\mu)$ are independent, then $G_{X+Y}(z) = e^{\lambda(z-1)} \cdot e^{\mu(z-1)} = e^{(\lambda + \mu)(z-1)}$, which is the pgf of $\operatorname{Po}(\lambda + \mu)$. By uniqueness, $X + Y \sim \operatorname{Po}(\lambda + \mu)$.
[/example]
More generally, if $X_1, \ldots, X_n$ are independent and identically distributed with common pgf $G$, then $G_{X_1 + \cdots + X_n}(z) = G(z)^n$.
### 7.4.1 Random Sums
A particularly important application arises when the number of summands is itself random.
[quotetheorem:1131]
The formula says: the pgf of a random sum is the *composition* of the pgf of the number of terms with the pgf of each term. The proof conditions on $N$ and uses the tower property (Law of Total Expectation).
[citeproof:1131]
The composition formula $G_{S_N} = G_N \circ G_X$ is remarkable: it converts a random sum — one of the most complex operations in probability — into function composition, one of the simplest operations in analysis. The conditioning technique used in the proof (condition on $N$, compute the conditional expectation, then average over $N$) is a standard pattern that we have already seen in the Law of Total Expectation.
Taking derivatives, one obtains the useful identities:
\begin{align*}
\mathbb{E}[S_N] &= \mathbb{E}[N] \cdot \mathbb{E}[X_1], \\
\operatorname{Var}(S_N) &= \mathbb{E}[N] \cdot \operatorname{Var}(X_1) + \operatorname{Var}(N) \cdot (\mathbb{E}[X_1])^2.
\end{align*}
## 7.5 Branching Processes
A **branching process** models a population that evolves over discrete generations: each individual in one generation independently produces a random number of offspring, and these offspring form the next generation. The question of primary interest is: does the population eventually die out, and if so, with what probability?
Branching processes are the canonical application of the PGF machinery just developed: the population in each generation is a random sum (the number of offspring from each individual), and the PGF of a random sum is a composition. This converts the recursive probabilistic structure into iterated function composition — a purely analytical problem.
Formally, let $X_0 = 1$ (a single founding individual). Each individual lives for one time step and produces $k$ offspring with probability $p_k$ (independently of all other individuals). Let $X_n$ denote the population size in generation $n$. If $Y_1^{(n)}, Y_2^{(n)}, \ldots$ denote the offspring counts of the individuals in generation $n$, then
\begin{align*}
X_{n+1} = Y_1^{(n)} + Y_2^{(n)} + \cdots + Y_{X_n}^{(n)},
\end{align*}
a random sum with $X_n$ terms.
Let $F(z) = \sum_{k=0}^{\infty} p_k z^k$ be the pgf of the offspring distribution (which is the same as the pgf of $X_1$). Define $F_n(z) = \mathbb{E}[z^{X_n}]$, the pgf of the population size in generation $n$.
[quotetheorem:1132]
The proof applies the PGF of a Random Sum theorem to the random sum $X_{n+1} = \sum_{i=1}^{X_n} Y_i$.
[citeproof:1132]
The iteration formula $F_{n+1} = F_n \circ F$ reduces the study of a stochastic population model to the study of iterated function composition — a topic in pure analysis. This is the power of the PGF approach: probabilistic questions become analytical questions.
### 7.5.1 Mean and Variance of the Population Size
[quotetheorem:1133]
The proof of (i) uses the tower property $\mathbb{E}[X_{n+1}] = \mathbb{E}[\mathbb{E}[X_{n+1} \mid X_n]] = \mu \, \mathbb{E}[X_n]$ and induction. The proof of (ii) uses the conditional variance decomposition and solves the resulting linear recurrence for $\operatorname{Var}(X_n)$.
[citeproof:1133]
The conditional variance decomposition $\operatorname{Var}(X_{n+1} \mid X_n) = X_n \sigma^2$ uses the fact that, given $X_n = k$, the variable $X_{n+1}$ is a sum of $k$ iid copies of $X_1$, whose variance is $k\sigma^2$ by the additivity of variance for independent variables. The resulting linear recurrence for $\operatorname{Var}(X_n)$ is solved by the same geometric-series technique used in the Gambler's Ruin problem.
### 7.5.2 Extinction Probability
The fundamental question about a branching process is: what is the probability $q$ that the population eventually goes extinct?
Let $A_n = \{X_n = 0\}$ be the event that extinction has occurred by generation $n$. Since the population cannot revive once extinct, $A_1 \subset A_2 \subset \cdots$, and the event of eventual extinction is $A = \bigcup_{n=1}^{\infty} A_n$. By the continuity of probability for increasing events:
\begin{align*}
q = \mathbb{P}(A) = \lim_{n \to \infty} \mathbb{P}(X_n = 0) = \lim_{n \to \infty} F_n(0).
\end{align*}
Since $F$ is continuous (it is a power series convergent on $[0,1]$) and $F_n(0) \to q$:
\begin{align*}
F(q) = F\left(\lim_{n \to \infty} F_n(0)\right) = \lim_{n \to \infty} F(F_n(0)) = \lim_{n \to \infty} F_{n+1}(0) = q.
\end{align*}
So $q$ is a fixed point of $F$: $F(q) = q$.
There is also a direct argument via the Law of Total Probability: conditioning on $X_1 = k$ and noting that for the whole population to go extinct, each of the $k$ independent sub-populations must go extinct (each with probability $q$):
\begin{align*}
q = \sum_{k=0}^{\infty} p_k q^k = F(q).
\end{align*}
[quotetheorem:1134]
The proof analyses the fixed points of $F$ on $[0,1]$ using convexity. The key observation is that $F$ is convex on $[0,1]$ (since all coefficients are non-negative), so its graph can intersect the line $y = s$ at most twice.
[citeproof:1134]
The convexity analysis in part (ii) is the key insight: a convex function on $[0,1]$ that passes through $(1,1)$ can have at most one additional fixed point in $[0,1)$, and whether it does depends on whether the slope at $1$ exceeds $1$. The critical parameter is $\mu = \mathbb{E}[X_1]$: when $\mu \le 1$, each individual produces at most one offspring on average, and the population is destined to die out. When $\mu > 1$, there is a positive probability of explosive growth.
The generating function methods of this chapter work beautifully for non-negative integer-valued random variables, but they cannot handle continuous random variables or random variables that take negative values. In the next chapter, we extend the theory to continuous random variables, replacing sums with integrals and probability mass functions with probability density functions.
# 8. Continuous Random Variables
In Chapters 3-7, all our random variables have taken values in a countable set, and probabilities have been computed as sums. But many quantities of interest — the position of a randomly thrown dart, the time until a radioactive atom decays, the height of a randomly chosen person — take values in a continuum. For such quantities, individual values have probability zero ($\mathbb{P}(X = x) = 0$ for any specific $x$), and we can only meaningfully assign probabilities to *intervals*. This chapter extends the theory to continuous random variables, replacing sums with integrals and probability mass functions with probability density functions.
## 8.1 Density Functions and the CDF
The transition from discrete to continuous random variables is driven by a fundamental observation: if $X$ can take any value in an interval, then for each $x$ and small $\delta x > 0$, the probability $\mathbb{P}(x \le X \le x + \delta x)$ should be approximately proportional to $\delta x$. The proportionality constant, which may depend on $x$, is the probability density function.
[definition: Continuous Random Variable]
A random variable $X$ is **continuous** if there exists a function $f_X : \mathbb{R} \to [0, \infty)$ such that for all $a \le b$,
\begin{align*}
\mathbb{P}(a \le X \le b) = \int_a^b f_X(x) \, dx.
\end{align*}
The function $f_X$ is called the **probability density function** (pdf) of $X$. It satisfies:
(i) $f_X(x) \ge 0$ for all $x \in \mathbb{R}$.
(ii) $\int_{-\infty}^{\infty} f_X(x) \, dx = 1$.
[/definition]
An important caution: $f_X(x)$ is *not* the probability that $X$ equals $x$. Indeed, $\mathbb{P}(X = a) = \int_a^a f_X(x) \, dx = 0$ for every $a$. The density $f_X(x)$ measures the *concentration* of probability near $x$, not the probability of the value $x$ itself. In particular, the density can exceed $1$ — the following example illustrates this common source of confusion.
[example: A Density Exceeding 1]
Let $X \sim U[0, 1/3]$. Then $f_X(x) = 3$ for $x \in [0, 1/3]$ and $f_X(x) = 0$ otherwise. The density $f_X(x) = 3 > 1$ on the interval $[0, 1/3]$. This does not violate any axiom: the density is *not* a probability. What matters is that $\int_{-\infty}^{\infty} f_X(x) \, dx = 3 \cdot (1/3) = 1$. The density measures the rate of change of the cdf, not a probability.
[/example]
To work with continuous random variables systematically, we need a function that accumulates the probability to the left of each point.
[definition: Cumulative Distribution Function]
The **cumulative distribution function** (cdf) of a random variable $X$ (discrete, continuous, or neither) is
\begin{align*}
F_X(x) := \mathbb{P}(X \le x).
\end{align*}
[/definition]
For continuous random variables, $F_X(x) = \int_{-\infty}^{x} f_X(t) \, dt$, so $F_X$ is continuous and differentiable, with $F_X'(x) = f_X(x)$ wherever $F_X$ is differentiable. The cdf is always increasing (in the non-strict sense) with $F_X(x) \to 0$ as $x \to -\infty$ and $F_X(x) \to 1$ as $x \to \infty$. Note that $\mathbb{P}(a < X \le b) = F_X(b) - F_X(a)$.
## 8.2 Continuous Distributions
As in the discrete case, we organise the standard continuous distributions by the type of modelling question they answer.
### 8.2.1 Uniform Randomness: The Uniform Distribution
The simplest continuous distribution assigns equal density to all points in an interval — it models "pure randomness" with no preferred location.
[definition: Uniform Distribution]
A random variable $X$ has the **uniform distribution** on $[a, b]$, written $X \sim U[a, b]$, if its pdf is
\begin{align*}
f_X(x) = \begin{cases} \dfrac{1}{b - a} & \text{if } a \le x \le b, \\[4pt] 0 & \text{otherwise.} \end{cases}
\end{align*}
Its cdf is $F_X(x) = (x - a)/(b - a)$ for $x \in [a, b]$.
[/definition]
The uniform distribution assigns equal density to all points in $[a, b]$. One computes $\mathbb{E}[X] = (a + b)/2$ and $\operatorname{Var}(X) = (b - a)^2 / 12$.
### 8.2.2 Waiting Times in Continuous Time: The Exponential Distribution
The exponential distribution is the continuous analogue of the geometric distribution: it models the waiting time until a "success" in continuous time. Just as the geometric distribution answers "how many discrete trials until the first success?", the exponential answers "how long must we wait in continuous time?"
[definition: Exponential Distribution]
A random variable $X$ has the **exponential distribution** with parameter $\lambda > 0$, written $X \sim \operatorname{Exp}(\lambda)$, if its pdf is
\begin{align*}
f_X(x) = \begin{cases} \lambda e^{-\lambda x} & \text{if } x \ge 0, \\ 0 & \text{if } x < 0. \end{cases}
\end{align*}
Its cdf is $F_X(x) = 1 - e^{-\lambda x}$ for $x \ge 0$.
[/definition]
The key property of the exponential distribution is the **memoryless property**, which we now state formally.
[quotetheorem:1135]
The proof is a direct computation using the survival function $\mathbb{P}(X \ge t) = e^{-\lambda t}$.
[citeproof:1135]
The computation reveals the algebraic reason for the memoryless property: the exponential function satisfies $e^{-\lambda(x+z)} = e^{-\lambda x} \cdot e^{-\lambda z}$, so the ratio cancels. No other survival function has this multiplicative structure, which is why the exponential is the unique continuous memoryless distribution.
[example: Expectation of the Exponential]
If $X \sim \operatorname{Exp}(\lambda)$, then using the tail formula $\mathbb{E}[X] = \int_0^{\infty} \mathbb{P}(X \ge x) \, dx$ (proved below):
\begin{align*}
\mathbb{E}[X] = \int_0^{\infty} e^{-\lambda x} \, dx = \left[-\frac{1}{\lambda} e^{-\lambda x}\right]_0^{\infty} = \frac{1}{\lambda}.
\end{align*}
[/example]
For the variance, we compute $\mathbb{E}[X^2]$ by integration by parts:
[example: Variance of the Exponential]
\begin{align*}
\mathbb{E}[X^2] = \int_0^{\infty} x^2 \lambda e^{-\lambda x} \, dx.
\end{align*}
Integrating by parts with $u = x^2$ and $dv = \lambda e^{-\lambda x} \, dx$ (so $du = 2x \, dx$ and $v = -e^{-\lambda x}$):
\begin{align*}
\mathbb{E}[X^2] = \left[-x^2 e^{-\lambda x}\right]_0^{\infty} + 2\int_0^{\infty} x e^{-\lambda x} \, dx = 0 + \frac{2}{\lambda} \cdot \frac{1}{\lambda} = \frac{2}{\lambda^2}.
\end{align*}
Here we used $\int_0^{\infty} x e^{-\lambda x} \, dx = 1/\lambda^2$ (which follows from $\mathbb{E}[X] = 1/\lambda$ with $f_X(x) = \lambda e^{-\lambda x}$, or by another integration by parts). Therefore
\begin{align*}
\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \frac{2}{\lambda^2} - \frac{1}{\lambda^2} = \frac{1}{\lambda^2}.
\end{align*}
[/example]
### 8.2.3 The Gamma Distribution
Summing independent exponentials gives the Gamma distribution, which models the total waiting time for $n$ events.
[definition: Gamma Distribution]
For $n \in \mathbb{N}$ and $\lambda > 0$, a random variable $X$ has the **Gamma distribution**, written $X \sim \Gamma(n, \lambda)$, if its pdf is
\begin{align*}
f_X(x) = \frac{\lambda^n x^{n-1}}{(n-1)!} e^{-\lambda x}, \qquad x \ge 0.
\end{align*}
[/definition]
When $n = 1$, this is $\operatorname{Exp}(\lambda)$. More generally, if $X_1, \ldots, X_n$ are independent $\operatorname{Exp}(\lambda)$ random variables, then $X_1 + \cdots + X_n \sim \Gamma(n, \lambda)$. The Gamma distribution has mean $n/\lambda$ and variance $n/\lambda^2$.
## 8.3 Expectation and Variance for Continuous Random Variables
The definitions of expectation and variance extend to continuous random variables by replacing sums with integrals.
[definition: Expectation of a Continuous Random Variable]
If $X$ is a continuous random variable with pdf $f_X$, its expectation is
\begin{align*}
\mathbb{E}[X] = \int_{-\infty}^{\infty} x \, f_X(x) \, dx,
\end{align*}
provided the integral converges absolutely.
[/definition]
All the properties of expectation (linearity, monotonicity, the minimisation property) carry over unchanged. For non-negative continuous random variables, there is a useful alternative formula.
[quotetheorem:1136]
The proof exchanges the order of integration in a double integral. The justification for this exchange is Tonelli's theorem (the version of Fubini's theorem for non-negative integrands), which is proved rigorously in Part IB Probability and Measure.
[citeproof:1136]
The exchange of integration order is the key step. The tail formula is particularly useful when the survival function $\mathbb{P}(X \ge x)$ has a simpler form than the density — as we saw for the exponential distribution above.
The variance of a continuous random variable is $\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$, computed as $\int_{-\infty}^{\infty} x^2 f_X(x) \, dx - \left(\int_{-\infty}^{\infty} x f_X(x) \, dx\right)^2$.
## 8.4 Joint Distributions and Independence
When studying multiple continuous random variables simultaneously, we need the notion of a joint density.
[definition: Joint Density]
Continuous random variables $X_1, \ldots, X_n$ are **jointly continuous** with **joint probability density function** $f$ if for any (measurable) set $A \subset \mathbb{R}^n$,
\begin{align*}
\mathbb{P}((X_1, \ldots, X_n) \in A) = \int \cdots \int_A f(x_1, \ldots, x_n) \, dx_1 \cdots dx_n,
\end{align*}
where $f \ge 0$ and $\int_{\mathbb{R}^n} f(x_1, \ldots, x_n) \, dx_1 \cdots dx_n = 1$.
[/definition]
The **marginal density** of $X_i$ is obtained by integrating out all other variables:
\begin{align*}
f_{X_i}(x) = \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} f(x_1, \ldots, x_n) \prod_{j \ne i} dx_j.
\end{align*}
[definition: Independent Continuous Random Variables]
Continuous random variables $X_1, \ldots, X_n$ with joint density $f$ are **independent** if
\begin{align*}
f(x_1, \ldots, x_n) = f_{X_1}(x_1) \cdot f_{X_2}(x_2) \cdots f_{X_n}(x_n)
\end{align*}
for all $(x_1, \ldots, x_n) \in \mathbb{R}^n$.
[/definition]
The following factorisation criterion provides a practical test for independence.
[quotetheorem:1137]
The proof integrates the factored density to obtain the marginal of each $X_i$: $f_{X_i}(x_i) = g_i(x_i) / \int_{-\infty}^{\infty} g_i(t) \, dt$, where $\int_{-\infty}^{\infty} g_i(t) \, dt$ is the normalising constant. The joint density then equals the product of marginals.
[citeproof:1137]
The factorisation criterion is the standard method for verifying independence in practice: if the joint density splits into a product of functions of individual variables, no further computation is needed.
## 8.5 Transformation of Random Variables
A common task is to determine the distribution of $Y = h(X)$ given the distribution of $X$.
[quotetheorem:1138]
The factor $|d h^{-1}/dy|$ is the Jacobian of the transformation, accounting for the stretching or compression of the density. The proof works through the cdf and differentiates.
[citeproof:1138]
The two cases differ only in the direction of the inequality when passing from $\{h(X) \le y\}$ to a condition on $X$. The absolute value in the formula unifies both cases.
[example: Exponential from Uniform]
Let $X \sim U[0, 1]$ and $Y = -\log X$. Then $h(x) = -\log x$ is strictly decreasing on $(0,1)$, with $h^{-1}(y) = e^{-y}$ and $d h^{-1}/dy = -e^{-y}$. So
\begin{align*}
f_Y(y) = f_X(e^{-y}) \cdot e^{-y} = 1 \cdot e^{-y} = e^{-y}, \qquad y \ge 0.
\end{align*}
This is the pdf of $\operatorname{Exp}(1)$. Hence $-\log U \sim \operatorname{Exp}(1)$ when $U \sim U[0,1]$.
[/example]
A more general result provides a method for generating any continuous distribution from a uniform random variable.
[quotetheorem:1139]
The proof is a one-line calculation that exploits the fact that $\mathbb{P}(U \le p) = p$ for $p \in [0,1]$.
[citeproof:1139]
Despite its brevity, this result is the foundation of Monte Carlo simulation: to generate samples from any continuous distribution with cdf $F$, generate $U \sim U[0,1]$ and return $F^{-1}(U)$.
## 8.6 Convolution
When $X$ and $Y$ are independent continuous random variables, the density of their sum $X + Y$ is the **convolution** of their individual densities:
\begin{align*}
f_{X+Y}(z) = \int_{-\infty}^{\infty} f_X(z - y) \, f_Y(y) \, dy = \int_{-\infty}^{\infty} f_X(x) \, f_Y(z - x) \, dx.
\end{align*}
This is the continuous analogue of the discrete convolution formula $\mathbb{P}(X + Y = n) = \sum_k \mathbb{P}(X = k) \, \mathbb{P}(Y = n - k)$ developed in Chapter 5.
## 8.7 The Normal Distribution
The normal (Gaussian) distribution is the most important continuous distribution in probability and statistics, owing to the Central Limit Theorem.
[definition: Normal Distribution]
A random variable $X$ has the **normal distribution** with parameters $\mu \in \mathbb{R}$ and $\sigma^2 > 0$, written $X \sim N(\mu, \sigma^2)$, if its pdf is
\begin{align*}
f_X(x) = \frac{1}{\sqrt{2\pi} \, \sigma} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right), \qquad x \in \mathbb{R}.
\end{align*}
The case $\mu = 0$, $\sigma^2 = 1$ is the **standard normal distribution**.
[/definition]
The normalisation $\int_{-\infty}^{\infty} f_X(x) \, dx = 1$ is not obvious and requires a calculation.
[quotetheorem:1140]
The direct integral appears intractable because the antiderivative of $e^{-x^2/2}$ has no closed form in terms of elementary functions. The trick — due to Poisson — is to *square* the integral, interpret the result as a double integral over the plane, and switch to polar coordinates where the integrand becomes elementary.
[citeproof:1140]
The "square and convert to polar" technique is specific to Gaussian-type integrals — it works because $x^2 + y^2 = r^2$ is rotationally invariant. This is one of the most beautiful calculations in mathematics.
[quotetheorem:1141]
The proof uses the substitution $z = (x - \mu)/\sigma$ to reduce to the standard normal, then applies symmetry (to evaluate the mean) and integration by parts (to evaluate the variance).
[citeproof:1141]
The integration by parts step uses the crucial observation that $z \cdot e^{-z^2/2}$ is the derivative of $-e^{-z^2/2}$ — this is why the substitution $u = z$, $dv = z e^{-z^2/2} \, dz$ works so cleanly. The parameters $\mu$ and $\sigma^2$ in the definition of $N(\mu, \sigma^2)$ are therefore genuinely the mean and variance.
## 8.8 Geometric Probability
Many problems in continuous probability can be solved by visualising the sample space geometrically.
[example: Buffon's Needle]
A needle of length $\ell$ is thrown randomly onto a floor marked with parallel lines a distance $L \ge \ell$ apart. What is the probability that the needle crosses a line?
Let $X \sim U[0, L]$ be the distance from the centre of the needle to the nearest line, and let $\Theta \sim U[0, \pi]$ be the angle the needle makes with the lines. The needle crosses a line if and only if $X \le (\ell/2) \sin \Theta$. The joint density is $f(x, \theta) = 1/(L\pi)$. Therefore:
\begin{align*}
\mathbb{P}(\text{cross}) = \int_0^{\pi} \int_0^{(\ell/2)\sin\theta} \frac{1}{L\pi} \, dx \, d\theta = \frac{1}{L\pi} \int_0^{\pi} \frac{\ell}{2} \sin\theta \, d\theta = \frac{\ell}{L\pi} \left[-\cos\theta\right]_0^{\pi} = \frac{2\ell}{\pi L}.
\end{align*}
The appearance of $\pi$ in the answer suggests a Monte Carlo method for estimating $\pi$: throw the needle $n$ times, observe $N$ crossings, and estimate $\pi \approx 2\ell n / (NL)$.
[/example]
Continuous random variables and their density functions provide the setting for the Central Limit Theorem. In the next chapter, we introduce moment generating functions — a more general encoding device than pgfs — and use them to prove that normalised sums converge to the normal distribution.
# 9. Moment Generating Functions and the Central Limit Theorem
The probability generating function, introduced in Chapter 7, is a powerful tool for non-negative integer-valued random variables. But it does not apply to continuous random variables, and it does not apply to discrete random variables that can take negative values. This chapter introduces a more general device — the **moment generating function** (mgf) — that works for both discrete and continuous random variables, and uses it to state and prove the Central Limit Theorem.
## 9.1 Moment Generating Functions
[definition: Moment Generating Function]
The **moment generating function** (mgf) of a random variable $X$ is
\begin{align*}
M_X(\theta) := \mathbb{E}[e^{\theta X}],
\end{align*}
defined for those $\theta \in \mathbb{R}$ for which the expectation is finite.
[/definition]
For continuous random variables, $M_X(\theta) = \int_{-\infty}^{\infty} e^{\theta x} f_X(x) \, dx$. For discrete random variables, $M_X(\theta) = \sum_x e^{\theta x} \mathbb{P}(X = x)$.
The relationship between the pgf and the mgf is: $G_X(z) = \mathbb{E}[z^X] = M_X(\log z)$ for non-negative integer-valued $X$.
[quotetheorem:1142]
We accept this result without proof. Its proof uses properties of the Laplace transform and belongs to a more advanced course.
## 9.2 Extracting Moments
The name "moment generating function" comes from the following property.
[quotetheorem:1143]
The proof expands $e^{\theta X}$ as a Taylor series, exchanges expectation and summation (justified by the assumed finiteness of $M_X$ in a neighbourhood of $0$), and reads off the coefficients.
[citeproof:1143]
The interchange of sum and expectation is the technical heart of the argument. It requires that $M_X(\theta) < \infty$ in a neighbourhood of $0$, which ensures that the Taylor series converges absolutely.
## 9.3 MGFs and Sums
Like pgfs, mgfs convert sums of independent random variables into products.
[quotetheorem:1144]
The proof is identical in structure to the pgf case: $e^{\theta(X+Y)} = e^{\theta X} \cdot e^{\theta Y}$, and independence allows the expectation to factor.
[citeproof:1144]
This multiplicativity is the fundamental reason why mgfs (and pgfs) are so effective: they encode the additive structure of sums of independent variables as multiplicative structure.
[example: MGF of the Normal]
If $X \sim N(\mu, \sigma^2)$, then
\begin{align*}
M_X(\theta) = \mathbb{E}[e^{\theta X}] = \frac{1}{\sqrt{2\pi}\sigma} \int_{-\infty}^{\infty} e^{\theta x} e^{-(x-\mu)^2/(2\sigma^2)} \, dx.
\end{align*}
Completing the square in the exponent: $\theta x - (x - \mu)^2/(2\sigma^2) = -(x - \mu - \theta\sigma^2)^2/(2\sigma^2) + \theta\mu + \theta^2\sigma^2/2$. Hence
\begin{align*}
M_X(\theta) = e^{\theta\mu + \theta^2\sigma^2/2} \cdot \underbrace{\frac{1}{\sqrt{2\pi}\sigma} \int_{-\infty}^{\infty} e^{-(x - \mu - \theta\sigma^2)^2/(2\sigma^2)} \, dx}_{= 1} = e^{\theta\mu + \theta^2\sigma^2/2}.
\end{align*}
[/example]
As a consequence, if $X \sim N(\mu_1, \sigma_1^2)$ and $Y \sim N(\mu_2, \sigma_2^2)$ are independent, then $M_{X+Y}(\theta) = e^{(\mu_1 + \mu_2)\theta + (\sigma_1^2 + \sigma_2^2)\theta^2/2}$, which is the mgf of $N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$. So the sum of independent normals is normal.
## 9.4 Convergence in Distribution
To state the Central Limit Theorem precisely, we need a notion of convergence for random variables.
[definition: Convergence in Distribution]
A sequence of random variables $(X_n)$ **converges in distribution** to a random variable $X$, written $X_n \xrightarrow{d} X$, if
\begin{align*}
F_{X_n}(x) \to F_X(x)
\end{align*}
for all $x \in \mathbb{R}$ at which $F_X$ is continuous.
[/definition]
Convergence in distribution is a statement about the cdfs, not about the random variables themselves — the random variables $X_n$ and $X$ need not be defined on the same probability space.
The connection between mgfs and convergence in distribution is provided by the following result, which we accept without proof.
[quotetheorem:1145]
## 9.5 The Central Limit Theorem
We now state and prove the most important result in probability theory. The Central Limit Theorem says that the sum of many independent random variables, suitably normalised, is approximately normally distributed — regardless of the distribution of the individual summands.
[quotetheorem:532]
The proof uses the Continuity Theorem: we show that the mgf of the normalised sum converges to the mgf of the standard normal. The key step is a Taylor expansion of the mgf to second order, followed by the limit $(1 + a/n)^n \to e^a$.
[citeproof:532]
The proof assumes that $M_{X_i}(\theta)$ is finite in a neighbourhood of $0$ (to justify the Taylor expansion). A fully rigorous proof that avoids this assumption uses characteristic functions instead of moment generating functions, but the essential idea — expanding to second order and using a limit of the form $(1 + a/n)^n \to e^a$ — remains the same.
The CLT requires finite variance. The following example shows what can go wrong without it.
[example: Failure of the CLT Without Finite Variance]
Let $X_1, X_2, \ldots$ be independent and identically distributed with the standard Cauchy distribution, whose pdf is
\begin{align*}
f(x) = \frac{1}{\pi(1 + x^2)}, \qquad x \in \mathbb{R}.
\end{align*}
The Cauchy distribution has $\mathbb{E}[|X|] = \int_{-\infty}^{\infty} |x| / (\pi(1+x^2)) \, dx = +\infty$: the expectation does not exist, and neither does the variance.
For the Cauchy distribution, the sample mean $S_n/n = (X_1 + \cdots + X_n)/n$ does *not* converge to any constant. In fact, $S_n/n$ has the same Cauchy distribution as $X_1$ — the "averaging" produces no concentration at all. This follows from the characteristic function of the Cauchy: $\varphi_{X_i}(t) = e^{-|t|}$, so $\varphi_{S_n/n}(t) = (e^{-|t|/n})^n = e^{-|t|}$, which is again Cauchy.
The CLT fails completely: the distribution of $S_n / \sqrt{n}$ does not converge to any non-degenerate limit, because the tails of the Cauchy distribution are too heavy ($\mathbb{P}(|X| > x) \sim 2/(\pi x)$, which decays only as $1/x$ rather than exponentially).
[/example]
### 9.5.1 Application: Normal Approximation to the Binomial
Let $X_1, \ldots, X_n$ be independent $\operatorname{Bern}(p)$ random variables, so $S_n = X_1 + \cdots + X_n \sim \operatorname{Bin}(n, p)$. The CLT gives
\begin{align*}
\frac{S_n - np}{\sqrt{np(1-p)}} \xrightarrow{d} N(0, 1).
\end{align*}
This means that for large $n$, $\operatorname{Bin}(n, p)$ is approximately $N(np, np(1-p))$. This approximation is the basis for constructing confidence intervals and hypothesis tests for proportions.
## 9.6 The Strong Law of Large Numbers
For completeness, we state (without proof) the strengthening of the Weak Law.
[quotetheorem:520]
The Strong Law says that $S_n / n \to \mu$ **almost surely** — that is, the set of outcomes $\omega$ for which $S_n(\omega)/n$ does not converge to $\mu$ has probability zero. This is strictly stronger than the Weak Law, which only says $\mathbb{P}(|S_n/n - \mu| > \varepsilon) \to 0$ for each fixed $\varepsilon$. The proof requires techniques beyond this course.
## 9.7 Multivariate Normal Distribution
We conclude with a brief introduction to Gaussian random vectors, which appear throughout statistics and applied mathematics. The reader who has completed Part IA Vectors and Matrices will recognise the matrix notation used below.
[definition: Gaussian Random Variable]
A random variable $X$ is **Gaussian** (or **normal**) if $X \sim N(\mu, \sigma^2)$ for some $\mu \in \mathbb{R}$ and $\sigma^2 \ge 0$. (When $\sigma^2 = 0$, $X = \mu$ with probability $1$.)
[/definition]
The multivariate generalisation defines a Gaussian random vector not by specifying a density, but by requiring that all linear combinations are Gaussian — a more intrinsic characterisation.
[definition: Gaussian Random Vector]
A random vector $(X_1, \ldots, X_n)$ is **Gaussian** if every linear combination $a_1 X_1 + a_2 X_2 + \cdots + a_n X_n$ is a Gaussian random variable.
[/definition]
A Gaussian random vector is completely characterised by its mean vector $\mu = (\mathbb{E}[X_1], \ldots, \mathbb{E}[X_n])$ and its covariance matrix $V_{ij} = \operatorname{Cov}(X_i, X_j)$. When $V$ is positive definite, the joint density is
\begin{align*}
f(x_1, \ldots, x_n) = \frac{1}{(2\pi)^{n/2} \sqrt{\det V}} \exp\left(-\frac{1}{2} (x - \mu)^\top V^{-1} (x - \mu)\right).
\end{align*}
A remarkable property of Gaussian random vectors is that uncorrelated components are independent: if $(X, Y)$ is a Gaussian random vector with $\operatorname{Cov}(X, Y) = 0$, then $X$ and $Y$ are independent. This is false for general random variables — recall the example in Chapter 4 where $\operatorname{Cov}(X, Y) = 0$ but $X$ and $Y$ were dependent.
## Summary of Distributions
### Discrete Distributions
| Distribution | Notation | PMF | Mean | Variance | PGF |
|---|---|---|---|---|---|
| Bernoulli | $\operatorname{Bern}(p)$ | $p^k(1-p)^{1-k}$, $k \in \{0,1\}$ | $p$ | $p(1-p)$ | $(1-p) + pz$ |
| Binomial | $\operatorname{Bin}(n,p)$ | $\binom{n}{k}p^k(1-p)^{n-k}$ | $np$ | $np(1-p)$ | $(1-p+pz)^n$ |
| Geometric | $\operatorname{Geom}(p)$ | $(1-p)^{k-1}p$, $k \ge 1$ | $1/p$ | $(1-p)/p^2$ | $pz/(1-(1-p)z)$ |
| Poisson | $\operatorname{Po}(\lambda)$ | $e^{-\lambda}\lambda^k/k!$ | $\lambda$ | $\lambda$ | $e^{\lambda(z-1)}$ |
### Continuous Distributions
| Distribution | Notation | PDF | Mean | Variance | MGF |
|---|---|---|---|---|---|
| Uniform | $U[a,b]$ | $1/(b-a)$ | $(a+b)/2$ | $(b-a)^2/12$ | $(e^{\theta b} - e^{\theta a})/(\theta(b-a))$ |
| Exponential | $\operatorname{Exp}(\lambda)$ | $\lambda e^{-\lambda x}$, $x \ge 0$ | $1/\lambda$ | $1/\lambda^2$ | $\lambda/(\lambda - \theta)$ |
| Normal | $N(\mu, \sigma^2)$ | $\frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^2/(2\sigma^2)}$ | $\mu$ | $\sigma^2$ | $e^{\mu\theta + \sigma^2\theta^2/2}$ |
| Gamma | $\Gamma(n, \lambda)$ | $\frac{\lambda^n x^{n-1}}{(n-1)!}e^{-\lambda x}$, $x \ge 0$ | $n/\lambda$ | $n/\lambda^2$ | $(\lambda/(\lambda-\theta))^n$ |
### Key Relationships
- $\operatorname{Bern}(p) = \operatorname{Bin}(1, p)$.
- $\operatorname{Bin}(n, \lambda/n) \approx \operatorname{Po}(\lambda)$ for large $n$.
- $\Gamma(1, \lambda) = \operatorname{Exp}(\lambda)$.
- If $X_1, \ldots, X_n \sim \operatorname{Exp}(\lambda)$ independently, then $X_1 + \cdots + X_n \sim \Gamma(n, \lambda)$.
- If $X \sim \operatorname{Po}(\lambda)$ and $Y \sim \operatorname{Po}(\mu)$ independently, then $X + Y \sim \operatorname{Po}(\lambda + \mu)$.
- If $X \sim N(\mu_1, \sigma_1^2)$ and $Y \sim N(\mu_2, \sigma_2^2)$ independently, then $X + Y \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$.
- If $U \sim U[0,1]$, then $-\log U \sim \operatorname{Exp}(1)$.
Contents
- 1. Probability Spaces and Counting
- 1.1 The Axioms of Probability
- 1.2 Counting in Finite Sample Spaces
- 1.2.1 The Multinomial Coefficient
- 1.2.2 Compositions and the Stars-and-Bars Method
- 1.2.3 Random Walks on $\mathbb{Z}$
- 1.2.4 Stirling's Formula
- 1.3 The Inclusion-Exclusion Principle
- 2. Conditional Probability and Independence
- 2.1 Conditional Probability
- 2.2 The Law of Total Probability and Bayes' Formula
- 2.3 Independence
- 3. Discrete Random Variables
- 3.1 Random Variables and Their Distributions
- 3.1.1 The Indicator Random Variable
- 3.2 Discrete Distributions
- 3.2.1 Counting Successes: Bernoulli and Binomial
- 3.2.2 Modelling Rare Events: The Poisson Distribution
- 3.2.3 Waiting for Success: The Geometric Distribution
- 3.3 Independent Random Variables
- 4. Expectation, Variance, and Covariance
- 4.1 Expectation
- 4.1.1 Expectations of Standard Distributions
- 4.1.2 The Power of Indicator Random Variables
- 4.2 Variance
- 4.2.1 Variances of Standard Distributions
- 4.3 Covariance and Correlation
- 4.4 Variance of a Sum
- 5. Conditional Expectation and Joint Distributions
- 5.1 Conditional Expectation
- 5.1.1 Conditioning on an Event
- 5.1.2 Conditioning on a Random Variable
- 5.1.3 The Law of Total Expectation
- 5.1.4 Properties of Conditional Expectation
- 5.2 Joint, Marginal, and Conditional Distributions
- 5.3 Discrete Convolution
- 5.4 The Gambler's Ruin
- 5.4.1 Setup
- 5.4.2 First-Step Analysis
- 5.4.3 Solving the Recurrence
- 5.4.4 Expected Duration
- 6. Inequalities and the Weak Law of Large Numbers
- 6.1 Jensen's Inequality
- 6.2 The Cauchy-Schwarz Inequality
- 6.3 Markov's and Chebyshev's Inequalities
- 6.4 The Weak Law of Large Numbers
- 7. Probability Generating Functions and Branching Processes
- 7.1 Probability Generating Functions
- 7.2 Extracting Moments from the PGF
- 7.3 PGFs of Standard Distributions
- 7.4 Sums of Independent Random Variables
- 7.4.1 Random Sums
- 7.5 Branching Processes
- 7.5.1 Mean and Variance of the Population Size
- 7.5.2 Extinction Probability
- 8. Continuous Random Variables
- 8.1 Density Functions and the CDF
- 8.2 Continuous Distributions
- 8.2.1 Uniform Randomness: The Uniform Distribution
- 8.2.2 Waiting Times in Continuous Time: The Exponential Distribution
- 8.2.3 The Gamma Distribution
- 8.3 Expectation and Variance for Continuous Random Variables
- 8.4 Joint Distributions and Independence
- 8.5 Transformation of Random Variables
- 8.6 Convolution
- 8.7 The Normal Distribution
- 8.8 Geometric Probability
- 9. Moment Generating Functions and the Central Limit Theorem
- 9.1 Moment Generating Functions
- 9.2 Extracting Moments
- 9.3 MGFs and Sums
- 9.4 Convergence in Distribution
- 9.5 The Central Limit Theorem
- 9.5.1 Application: Normal Approximation to the Binomial
- 9.6 The Strong Law of Large Numbers
- 9.7 Multivariate Normal Distribution
- Summary of Distributions
- Discrete Distributions
- Continuous Distributions
- Key Relationships
Cambridge IA Probability
Content
Problems
History
Created by admin on 4/18/2026 | Last updated on 4/18/2026
Prerequisites
No prerequisites required for this page.
Rate this page
★
★
★
★
★
Poor
Excellent