What should you pay to play the following game? A fair coin is flipped repeatedly until the first head appears. If the first head appears on flip $n$, you win $2^n$ dollars. The expected winnings, computed by weighting each outcome by its probability, give
\begin{align*}
\sum_{n=1}^\infty 2^n \cdot \frac{1}{2^n} = \sum_{n=1}^\infty 1 = \infty.
\end{align*}
According to the naive formula, no finite entry fee is too large. Yet no rational person pays even a thousand dollars to play. This is the St. Petersburg paradox, and it reveals something profound: the intuitive notion of "average value" for a random quantity is not yet mathematically complete. We need a framework that tells us when an average exists, when it is finite, and how to compute it even when outcomes are neither discrete nor continuous.
The answer is the Lebesgue integral with respect to a probability measure — what probabilists call **expectation**. On a probability space $(\Omega, \mathcal{F}, \mathbb{P})$, the expectation of a random variable $X: \Omega \to \mathbb{R}$ is the integral $\int_\Omega X \, d\mathbb{P}$, built up from simple functions through a limiting procedure. Writing $\mathcal{L}^1$ for one-dimensional Lebesgue measure, this definition unifies the discrete formula $\mathbb{E}[X] = \sum_x x \, \mathbb{P}(X = x)$ and the continuous formula $\mathbb{E}[X] = \int_{-\infty}^\infty x f(x) \, d\mathcal{L}^1(x)$, and handles every case in between — singular distributions, random variables defined on abstract probability spaces, and limits of sequences of random variables.
Three properties make expectation indispensable. First, linearity: $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$ regardless of whether $X$ and $Y$ are independent, enabling calculations that would be intractable by any other method. Second, the probability–expectation bridge: $\mathbb{E}[\mathbb{1}_A] = \mathbb{P}(A)$, so every probability is an expectation of an indicator function. Third, controlled interaction with limits: under appropriate conditions, we may pass limits through the expectation sign — a prerequisite for any serious analysis of sequences of random variables.
[example: The St. Petersburg Paradox]
In the St. Petersburg game, let $X$ be the payoff. The sample space can be taken as $\Omega = \mathbb{N}$, where outcome $n$ represents "first head on flip $n$." Setting $\mathbb{P}(X = 2^n) = 2^{-n}$ for $n \in \mathbb{N}$, the sum $\sum_{n=1}^\infty 2^{-n} = 1$ confirms this is a valid probability distribution. The expected payoff is
\begin{align*}
\mathbb{E}[X] = \sum_{n=1}^\infty 2^n \cdot \frac{1}{2^n} = \sum_{n=1}^\infty 1 = \infty.
\end{align*}
What does $\mathbb{E}[X] = \infty$ mean precisely? The payoff $X$ is finite almost surely — every play ends in finite time — but the distribution has no finite mean. For any $M > 0$, define the truncated payoff $X_M = \min(X, M)$. Then
\begin{align*}
\mathbb{E}[X_M] = \sum_{n=1}^{\lfloor \log_2 M \rfloor} 2^n \cdot \frac{1}{2^n} + M \cdot \mathbb{P}(X > M) = \lfloor \log_2 M \rfloor + M \cdot 2^{-\lfloor \log_2 M \rfloor},
\end{align*}
which grows without bound as $M \to \infty$. The expectation is $\infty$ in the precise sense $\sup_{M > 0} \mathbb{E}[X_M] = \infty$. The mathematics is unambiguous; the apparent paradox is that human utility for money is concave rather than linear, so utility-weighted expectation is finite even when dollar-weighted expectation is not.
[/example]
To define expectation in full generality, we build from the simplest possible random variables — those taking only finitely many values — and extend by approximation. The strategy mirrors the construction of the Lebesgue integral: simple functions first, non-negative functions next via monotone approximation, and finally general functions via the positive-negative decomposition.
The first step is to isolate the finite-valued random variables for which expectation can be computed by a direct weighted sum. Naming this class matters because these variables will serve as the approximating building blocks for arbitrary non-negative random variables.
Every random variable taking finitely many values can be written as a linear combination of indicator functions. This is the starting point because indicators are the atoms of the theory: knowing how to integrate them determines everything else.
The obstruction is that an arbitrary random variable may take infinitely many values, so a direct probability-weighted sum over its values need not be available. We first need a finite-valued class whose level sets partition the sample space, because that structure makes expectation a finite sum and later gives the approximants used for general random variables.
[definition: Simple Random Variable]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space. A random variable $X: \Omega \to \mathbb{R}$ is **simple** if it takes only finitely many values. Any such $X$ can be written in **standard form** as
\begin{align*}
X = \sum_{i=1}^n a_i \mathbb{1}_{A_i},
\end{align*}
where $a_1, \ldots, a_n \in \mathbb{R}$ are the distinct values of $X$ and $A_i = \{X = a_i\} \in \mathcal{F}$ are pairwise disjoint events with $\bigcup_{i=1}^n A_i = \Omega$.
[/definition]
For a simple random variable, expectation requires no limiting procedure and no approximation. The definition is the probability-weighted average of the values.
[definition: Expectation of a Simple Random Variable]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and let $X = \sum_{i=1}^n a_i \mathbb{1}_{A_i}$ be a simple random variable in standard form. The **expectation** of $X$ is
\begin{align*}
\mathbb{E}[X] = \sum_{i=1}^n a_i \, \mathbb{P}(A_i).
\end{align*}
[/definition]
One must check that this formula does not depend on how $X$ is written as a sum of indicators. The standard form is unique, but a random variable can also be represented as $X = \sum_j b_j \mathbb{1}_{B_j}$ with the $B_j$ not necessarily disjoint. In all such representations, the sum $\sum_j b_j \mathbb{P}(B_j)$ gives the same value, which follows from finite additivity of $\mathbb{P}$.
The extension to non-negative random variables uses the fact that every non-negative measurable function is the pointwise supremum of all simple functions lying below it. This suggests defining the expectation as the supremum of expectations over all such simple approximants — and this is exactly the right definition.
[definition: Expectation of a Non-Negative Random Variable]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and let $X: \Omega \to [0, \infty]$ be a non-negative random variable. The **expectation** of $X$ is
\begin{align*}
\mathbb{E}[X] = \sup \bigl\{ \mathbb{E}[Z] : Z \text{ is simple and } 0 \leq Z \leq X \bigr\}.
\end{align*}
The expectation takes values in $[0, \infty]$; we allow $\mathbb{E}[X] = \infty$.
[/definition]
For a general random variable, the positive and negative parts may both contribute infinite expectations, making $\mathbb{E}[X^+] - \mathbb{E}[X^-]$ of the form $\infty - \infty$, which is undefined. The expectation is therefore defined only when at least one of $\mathbb{E}[X^+]$ or $\mathbb{E}[X^-]$ is finite. This three-step construction culminates in the primary definition.
## Definition
The positive-negative decomposition is the point where the construction becomes a definition for real-valued random variables. The only obstruction is the indeterminate expression \(\infty - \infty\), so the definition must state exactly when the two one-sided expectations can be combined.
[definition: Expectation]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and let $X: \Omega \to \mathbb{R}$ be a random variable. Write $X^+ = \max(X, 0)$ and $X^- = \max(-X, 0)$, so that $X = X^+ - X^-$ and $|X| = X^+ + X^-$. If $\min(\mathbb{E}[X^+], \mathbb{E}[X^-]) < \infty$, the **expectation** of $X$ is
\begin{align*}
\mathbb{E}[X] = \mathbb{E}[X^+] - \mathbb{E}[X^-] \in [-\infty, \infty].
\end{align*}
The random variable $X$ is called **integrable** if $\mathbb{E}[|X|] = \mathbb{E}[X^+] + \mathbb{E}[X^-] < \infty$, in which case $\mathbb{E}[X] \in \mathbb{R}$.
[/definition]
The integrability condition $\mathbb{E}[|X|] < \infty$ is precisely $X \in L^1(\Omega, \mathcal{F}, \mathbb{P})$, placing expectation squarely within the Lebesgue integration framework. The $L^p$ spaces for $p > 1$ give progressively stronger integrability: $X \in L^p$ means $\mathbb{E}[|X|^p] < \infty$, and the inclusion $L^p(\Omega, \mathcal{F}, \mathbb{P}) \subset L^1(\Omega, \mathcal{F}, \mathbb{P})$ holds for probability measures since $\mathbb{P}(\Omega) = 1$.
The abstract definition recovers all classical formulas. For a discrete random variable with $\mathbb{P}(X = x_k) = p_k$, the expectation is $\mathbb{E}[X] = \sum_k x_k p_k$ when the sum converges absolutely. For a continuous random variable with density $f_X \in L^1(\mathbb{R}, \mathcal{B}(\mathbb{R}), \mathcal{L}^1)$, the expectation is $\mathbb{E}[X] = \int_{-\infty}^\infty x f_X(x) \, d\mathcal{L}^1(x)$.
[example: Standard Discrete Distributions]
For $X \sim \operatorname{Ber}(p)$, the simple random variable formula gives
\begin{align*}
\mathbb{E}[X] = 1 \cdot p + 0 \cdot (1-p) = p.
\end{align*}
For $X \sim \operatorname{Poi}(\lambda)$ with $\mathbb{P}(X = k) = e^{-\lambda}\lambda^k/k!$, the expectation requires summing a series. Shifting the index by writing $k = j + 1$:
\begin{align*}
\mathbb{E}[X] = \sum_{k=0}^\infty k \cdot \frac{e^{-\lambda}\lambda^k}{k!} = e^{-\lambda}\lambda \sum_{k=1}^\infty \frac{\lambda^{k-1}}{(k-1)!} = e^{-\lambda}\lambda \sum_{j=0}^\infty \frac{\lambda^j}{j!} = e^{-\lambda}\lambda \cdot e^{\lambda} = \lambda.
\end{align*}
The parameter $\lambda$ is both the expectation and (as a separate computation shows) the variance.
[/example]
## Linearity and Basic Properties
The single most important property of expectation is linearity. Unlike independence or correlation, which depend on the joint structure of random variables, linearity holds for all integrable random variables: if $X,Y\in L^1(\Omega,\mathcal F,\mathbb P)$ and $a,b\in\mathbb R$, then
\begin{align*}
\mathbb{E}[aX+bY]=a\mathbb{E}[X]+b\mathbb{E}[Y].
\end{align*}
This holds whether $X$ and $Y$ are independent, perfectly correlated, or anything in between. That universality is what makes expectation so useful: sums of random variables can be averaged term by term even when their joint distribution is complicated or unknown.
Linearity enables calculations that would otherwise require knowledge of complicated joint distributions. A canonical illustration is the expected number of fixed points in a random permutation, where direct combinatorial computation involves derangements and inclusion-exclusion, but linearity reduces everything to a single calculation.
[example: Fixed Points of a Random Permutation]
Let $\sigma$ be a uniformly random permutation of $\{1, 2, \ldots, n\}$, and let $F$ count the fixed points: $F = |\{i : \sigma(i) = i\}|$. Writing $F$ as a sum of indicators,
\begin{align*}
F = \sum_{i=1}^n \mathbb{1}_{\{\sigma(i) = i\}},
\end{align*}
linearity of expectation gives
\begin{align*}
\mathbb{E}[F] = \sum_{i=1}^n \mathbb{P}(\sigma(i) = i).
\end{align*}
By symmetry of the uniform distribution, each of the $n!$ permutations is equally likely, and exactly $(n-1)!$ of them fix position $i$. So $\mathbb{P}(\sigma(i) = i) = (n-1)!/n! = 1/n$ for each $i$, giving $\mathbb{E}[F] = n \cdot (1/n) = 1$.
The expected number of fixed points equals $1$, regardless of $n$. The indicators $\mathbb{1}_{\{\sigma(i) = i\}}$ are not independent — whether position $i$ is fixed depends on the other positions — but linearity requires no independence at all.
[/example]
The connection between expectation and probability runs deeper than just the definition. To pass between events and random variables, we need a precise translation that turns a probability question into an expectation question. Indicator functions provide that translation: an event is represented by the random variable that records whether the event occurred.
[quotetheorem:3534]
This identity embeds all of probability theory within integration theory: the axioms of a probability measure become special cases of the linearity and monotonicity of integration. Inequalities about expectations — Markov, Chebyshev, and others — become inequalities about probabilities via this bridge.
Once probabilities are viewed as expectations of indicators, the next question is which order and size relations survive after taking expectation. Monotonicity lets inequalities between random variables become inequalities between their expectations, while the triangle inequality controls signed cancellation and gives the basic estimate behind the $L^1$ norm.
[quotetheorem:3535]
These two estimates are the basic stability checks for expectation. Monotonicity is what makes tail events and non-negative bounds useful, while the triangle inequality controls integrability under addition. The integrability hypotheses matter because signed cancellation is only meaningful when the positive and negative parts are not both infinite; later probability estimates rely on exactly this control.
## Convergence Theorems
Among the most important questions in probability theory is: when does $\mathbb{E}[X_n] \to \mathbb{E}[X]$ follow from $X_n \to X$? Pointwise convergence alone is not enough. Consider the sequence $X_n = n \cdot \mathbb{1}_{[0,1/n]}$ on $([0,1], \mathcal{B}([0,1]), \mathcal{L}^1)$. For every $\omega \in (0,1]$, we have $X_n(\omega) = 0$ for all $n > 1/\omega$, so $X_n \to 0$ pointwise on $(0,1]$. Yet $\mathbb{E}[X_n] = n \cdot (1/n) = 1$ for all $n$. The expectation of the limit is $0$, while the limit of the expectations is $1$. Without additional structure, these two operations do not commute.
Three theorems identify natural and widely applicable conditions under which they do.
[quotetheorem:509]
The Monotone Convergence Theorem is the engine behind interchanging sums and expectations: if $Y_1, Y_2, \ldots$ are non-negative, take $X_n = \sum_{k=1}^n Y_k$ to obtain $\mathbb{E}\!\left[\sum_{k=1}^\infty Y_k\right] = \sum_{k=1}^\infty \mathbb{E}[Y_k]$. It also confirms that the definition of $\mathbb{E}[X]$ as a supremum over simple functions agrees with the limit of expectations of any increasing sequence of simple functions approximating $X$ from below.
When the sequence is not monotone, equality can fail because positive mass may move around or concentrate on smaller and smaller sets. Still, non-negativity prevents the limiting expectation from overshooting the limiting lower bound of the expectations. The result captures the one-sided continuity that remains without monotonicity.
[quotetheorem:510]
The direction of this inequality is worth absorbing. It says that mass can "escape to infinity" — the liminf of the expectations can exceed the expectation of the liminf. The sequence $X_n = n \cdot \mathbb{1}_{[0,1/n]}$ demonstrates the gap: $\liminf_n X_n = 0$ (so $\mathbb{E}[\liminf_n X_n] = 0$), yet $\liminf_n \mathbb{E}[X_n] = 1$. The inequality $0 \leq 1$ holds, but with strict inequality.
For the sharpest result — equality rather than an inequality — pointwise convergence must be paired with a uniform integrable bound. Such a bound prevents mass from escaping to infinity or concentrating on shrinking sets, exactly the behavior that defeated the earlier example. This is the hypothesis that allows limits and expectations to commute.
The central question is now no longer whether some one-sided inequality survives, but when actual convergence of random variables forces convergence of their expectations. The answer is the dominated convergence principle, whose hypotheses are designed exactly to rule out the escape of mass seen above.
[quotetheorem:4]
The dominating function $Y$ is indispensable. For the sequence $X_n = n \cdot \mathbb{1}_{[0,1/n]}$, no integrable dominator exists: any function $Y$ satisfying $Y \geq X_n$ pointwise for all $n$ must satisfy $Y(\omega) \geq n$ for every $\omega \in (0, 1/n]$ and every $n$, forcing $\mathbb{E}[Y] = \infty$. In practice, the Dominated Convergence Theorem is the most frequently used tool: one identifies an integrable envelope and then passes limits through expectations without further justification.
## Inequalities
Several fundamental inequalities relate the expectation of a function of a random variable to probabilities or to other expectations. These form the core toolkit for bounding probabilities in terms of moments.
The simplest and most general bound — requiring only non-negativity and a finite first moment — is Markov's inequality.
[quotetheorem:514]
Markov's inequality is weak but universal. It says that a non-negative random variable cannot spend too much probability mass far above its mean unless the mean itself is large. The statement is sharp in general, so any stronger tail estimate must use additional information such as variance, boundedness, independence, or exponential moments.
When the second moment is finite, we can measure not just the size of a non-negative random variable but its typical deviation from the mean. Large deviations of $X$ force the squared centered variable $(X - \mathbb{E}[X])^2$ to be large, so Markov's inequality applied to that square produces a sharper tail bound.
This shifts the problem from bounding the event that a non-negative variable is large to bounding the event that a general variable is far from its mean. Chebyshev's inequality packages that reduction into the standard variance-based tail estimate.
[quotetheorem:1126]
Chebyshev's inequality applied to the sample mean $\bar{X}_n = (X_1 + \cdots + X_n)/n$ of $n$ i.i.d. copies of $X$ gives $\mathbb{P}(|\bar{X}_n - \mu| \geq \varepsilon) \leq \sigma^2/(n\varepsilon^2) \to 0$ as $n \to \infty$, providing a direct proof of the weak law of large numbers.
Moment bounds are not only about powers; they also depend on how nonlinear functions interact with averaging. Convex functions penalize spread, so applying a convex function before averaging should be at least as large as applying it after averaging.
The new question is how to compare two quantities that look similar but are not interchangeable: the value of a convex function at the mean, and the mean of that convex function applied pointwise. Jensen's inequality supplies the structural rule for moving convex functions across expectation, and it is the source of many classical mean inequalities.
[quotetheorem:1977]
The geometric content: convexity means $\varphi$ lies below every chord connecting two of its points. Averaging a random variable "flattens" its distribution toward the mean, and a convex function of the mean is no larger than the average of the convex function applied pointwise. The inequality reverses for concave $\varphi$.
[example: Jensen Implies the AM-GM Inequality]
Let $a_1, \ldots, a_n > 0$ and let $W$ be a random variable taking each value $\log a_i$ with probability $1/n$. Apply Jensen's inequality with the convex function $\varphi(t) = e^t$ (convex since $\varphi''(t) = e^t > 0$):
\begin{align*}
\exp(\mathbb{E}[W]) \leq \mathbb{E}[e^W].
\end{align*}
Writing this out explicitly:
\begin{align*}
\exp\!\left(\frac{\log a_1 + \cdots + \log a_n}{n}\right) \leq \frac{e^{\log a_1} + \cdots + e^{\log a_n}}{n} = \frac{a_1 + \cdots + a_n}{n}.
\end{align*}
The left side is $(a_1 \cdots a_n)^{1/n}$, the geometric mean. Jensen gives the arithmetic mean–geometric mean inequality
\begin{align*}
(a_1 \cdots a_n)^{1/n} \leq \frac{a_1 + \cdots + a_n}{n}.
\end{align*}
[/example]
Many estimates in probability require controlling an expectation of a product, such as $\mathbb{E}[|XY|]$, when the two factors are controlled in different moment scales. The correct principle is that integrability can be split between conjugate exponents. Hölder's inequality is the formal version of that tradeoff.
[quotetheorem:516]
The special case $p = q = 2$ is important enough to isolate because $L^2$ is the natural home of variance, covariance, and orthogonality. In that setting, products are controlled by lengths, exactly as in Euclidean geometry, and the resulting estimate is used constantly in second-moment arguments.
The forward need is to control quantities such as covariances and correlations, which are expectations of products of centered variables. These are inner products in disguise: on $L^2(\Omega,\mathcal F,\mathbb P)$, the inner product is
\begin{align*}
\langle X,Y\rangle=\mathbb E[XY],
\end{align*}
and the associated norm is $\|X\|_2=(\mathbb E[X^2])^{1/2}$. With this translation, the abstract Cauchy-Schwarz theorem becomes the probability estimate $|\mathbb E[XY]|\leq \|X\|_2\|Y\|_2$.
[quotetheorem:432]
Applying Cauchy–Schwarz to the centered variables $X - \mathbb{E}[X]$ and $Y - \mathbb{E}[Y]$ gives $|\operatorname{Cov}(X, Y)| \leq \sqrt{\operatorname{Var}(X) \operatorname{Var}(Y)}$, confirming that the Pearson correlation coefficient lies in $[-1, 1]$.
## Computing Expectations
The definition of $\mathbb{E}[X] = \int_\Omega X \, d\mathbb{P}$ integrates over the abstract probability space $\Omega$. In practice, we want to compute $\mathbb{E}[g(X)]$ for a function $g$ of a random variable $X$ whose distribution is known. The Law of the Unconscious Statistician allows integration over $\mathbb{R}$ using the distribution of $X$ rather than over $\Omega$.
[quotetheorem:3536]
The Law of the Unconscious Statistician computes expectations from the distribution by integrating over possible values of $X$. A complementary question is whether the mean can be recovered only from tail probabilities, which are often easier to estimate than the full distribution.
For non-negative random variables, the geometric idea is horizontal slicing: $\mathbb{E}[X] = \int_\Omega X \, d\mathbb{P}$ is the area under the graph of $X$ over $\Omega$. Slicing at height $t$ contributes $\mathbb{P}(X > t) \, dt$, and integrating over all levels gives the total area.
This creates a useful computational target: replace an integral of the random variable itself by an integral over the sizes of its upper level sets. The obstruction is that the argument is not a finite sum; it slices an arbitrary non-negative random variable into continuously many level sets. We need the next theorem to justify that tail-integral representation for the expectation and to make this slicing calculation legitimate.
[quotetheorem:1136]
The exponential distribution is the cleanest testing ground for the Layer Cake Formula: its tail probability $\mathbb{P}(X > t) = e^{-\lambda t}$ is a pure exponential, so the expectation reduces to a one-line integral over tail probabilities.
[example: Mean of an Exponential via Layer Cake]
Let $X \sim \operatorname{Exp}(\lambda)$ with $\lambda > 0$. The tail probability is $\mathbb{P}(X > t) = e^{-\lambda t}$ for $t \geq 0$. The Layer Cake Formula gives the mean directly:
\begin{align*}
\mathbb{E}[X] = \int_0^\infty e^{-\lambda t} \, d\mathcal{L}^1(t) = \left[-\frac{1}{\lambda} e^{-\lambda t}\right]_0^\infty = \frac{1}{\lambda}.
\end{align*}
The calculation uses only the survival function, not the density. This is why the tail formula is especially useful in probability estimates: upper bounds on $\mathbb{P}(X>t)$ immediately become upper bounds on $\mathbb{E}[X]$.
[/example]
## Moments and Generating Functions
The expectation $\mathbb{E}[X]$ is the first moment of $X$. Higher moments capture the shape of the distribution — its spread, asymmetry, and tail heaviness — and together they often determine the distribution uniquely.
[definition: Moments of a Random Variable]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and let $X: \Omega \to \mathbb{R}$ be a random variable. For $k \in \mathbb{N}$, the **$k$th moment** of $X$ is $\mathbb{E}[X^k]$, provided it is finite. The **$k$th central moment** is $\mathbb{E}[(X - \mathbb{E}[X])^k]$, when defined. The **variance** of $X$ is the second central moment:
\begin{align*}
\operatorname{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right].
\end{align*}
[/definition]
Expanding the square in the variance formula and applying linearity yields the computing identity $\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$. Variance measures the expected squared deviation from the mean; its square root $\sigma = \sqrt{\operatorname{Var}(X)}$ is the standard deviation, expressed in the same units as $X$.
Rather than analyzing moments one at a time, it is often more efficient to encode them all in a single function. The problem is that the raw list $\mathbb{E}[X], \mathbb{E}[X^2], \ldots$ does not by itself provide an analytic object that can be manipulated under sums, limits, or independence.
The exponential function is the natural encoding because its Taylor expansion contains every power of $X$ at once. This motivates the following definition: when $e^{tX}$ is integrable near $t = 0$, the resulting transform is finite in a neighborhood of the origin, and derivatives at zero recover the moments systematically.
[definition: Moment Generating Function]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and let $X: \Omega \to \mathbb{R}$ be a random variable. The **moment generating function** of $X$ is the function
\begin{align*}
M_X: \mathbb{R} &\to [0, \infty] \\
t &\mapsto \mathbb{E}[e^{tX}].
\end{align*}
[/definition]
When $M_X(t) < \infty$ for all $t \in (-\delta, \delta)$ for some $\delta > 0$, the function $M_X$ is infinitely differentiable on $(-\delta, \delta)$, and the $k$th derivative at zero recovers the $k$th moment: $M_X^{(k)}(0) = \mathbb{E}[X^k]$. Differentiating under the expectation is justified by the Dominated Convergence Theorem, using the finiteness of $M_X$ near zero as the domination condition.
Not every random variable has a moment generating function on any open interval. The Cauchy distribution — with density $f(x) = 1/(\pi(1 + x^2))$ — has $M_X(t) = \infty$ for all $t \neq 0$, and its first moment fails to exist. The characteristic function remedies this: since $|e^{iuX}| = 1$, the expectation always converges.
The forward need is a transform that still encodes the distribution when exponential moments fail to exist. Replacing real exponentials by complex oscillations gives a bounded integrand, so the resulting function is available for every real-valued random variable.
[definition: Characteristic Function]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and let $X: \Omega \to \mathbb{R}$ be a random variable. The **characteristic function** of $X$ is the function
\begin{align*}
\phi_X: \mathbb{R} &\to \mathbb{C} \\
u &\mapsto \mathbb{E}[e^{iuX}].
\end{align*}
[/definition]
The characteristic function satisfies $|\phi_X(u)| \leq 1$ for all $u$ and $\phi_X(0) = 1$. It is uniformly continuous on $\mathbb{R}$ and determines the distribution of $X$ uniquely via the Fourier inversion formula. When all moments exist, differentiating $k$ times at zero gives $\phi_X^{(k)}(0) = i^k \mathbb{E}[X^k]$.
The next problem is a convergence problem rather than a definition problem: if a sequence of random variables has transforms that converge pointwise, when does that imply convergence of the underlying distributions? Levy's continuity theorem gives the bridge from pointwise convergence of characteristic functions back to convergence in distribution.
[quotetheorem:519]
This theorem converts convergence in distribution into pointwise convergence of functions on $\mathbb{R}$, which is typically much easier to verify. The Central Limit Theorem, for instance, reduces to showing that the characteristic function of the standardized partial sum converges pointwise to $e^{-u^2/2}$, the characteristic function of the standard normal.
## Conditional Expectation
The expectation $\mathbb{E}[X]$ averages $X$ over all outcomes. Often we have partial information — we know the value of some other random variable $Y$, or we know which events in a sub-$\sigma$-algebra have occurred — and we want the average of $X$ given that information.
For discrete $Y$ taking values $y_1, y_2, \ldots$ with positive probability, the answer is direct: the conditional expectation on the event $\{Y = y_k\}$ is
\begin{align*}
\mathbb{E}[X \mid Y = y_k] = \frac{\mathbb{E}[X \mathbb{1}_{\{Y = y_k\}}]}{\mathbb{P}(Y = y_k)},
\end{align*}
and $\mathbb{E}[X \mid Y]$ is the random variable that equals $\mathbb{E}[X \mid Y = y_k]$ on $\{Y = y_k\}$. For continuous $Y$, the event $\{Y = y\}$ has probability zero, so this formula breaks down. The measure-theoretic approach defines conditional expectation not pointwise for each value of $Y$, but as a $\mathcal{G}$-measurable random variable characterized by an integral identity — where $\mathcal{G} = \sigma(Y)$ encodes the information carried by $Y$.
[definition: Conditional Expectation]
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space, let $X \in L^1(\Omega, \mathcal{F}, \mathbb{P})$, and let $\mathcal{G} \subset \mathcal{F}$ be a sub-$\sigma$-algebra. The **conditional expectation** of $X$ given $\mathcal{G}$, written $\mathbb{E}[X \mid \mathcal{G}]$, is any random variable $Z: \Omega \to \mathbb{R}$ satisfying:
(i) $Z$ is $\mathcal{G}$-measurable.
(ii) For every $G \in \mathcal{G}$:
\begin{align*}
\int_G Z \, d\mathbb{P} = \int_G X \, d\mathbb{P}.
\end{align*}
[/definition]
Existence and uniqueness (up to $\mathbb{P}$-almost sure equality) follow from the Radon-Nikodym theorem. The signed measure $\nu(G) = \int_G X \, d\mathbb{P}$ on $(\Omega, \mathcal{G})$ is absolutely continuous with respect to $\mathbb{P}|_\mathcal{G}$, and the Radon-Nikodym derivative $d\nu / d(\mathbb{P}|_\mathcal{G})$ is the conditional expectation.
Once conditional expectation has been defined abstractly, we need the next theorem to settle existence and uniqueness: does the integral identity actually determine a random variable, and is that random variable unique up to null sets? Without this theorem, conditional expectation would be only notation rather than a well-defined object.
[quotetheorem:1147]
With existence and uniqueness secure, the next need is a calculus of conditional expectation. Conditioning should preserve the basic algebraic and order properties of ordinary expectation: constants or known quantities should stay known, independent information should not change the average, positive variables should have positive conditional averages, and linear combinations should condition linearly.
[quotetheorem:1148]
The most used structural rule is the tower property. We need the next theorem to answer what happens when two levels of information are nested: if one first averages using richer information and then averages again using poorer information, the result should be the same as averaging directly with the poorer information.
[quotetheorem:1150]
When $X \in L^2(\Omega, \mathcal{F}, \mathbb{P})$, conditional expectation has a geometric interpretation: it is the orthogonal projection of $X$ onto the closed subspace $L^2(\Omega, \mathcal{G}, \mathbb{P}) \subset L^2(\Omega, \mathcal{F}, \mathbb{P})$ of $\mathcal{G}$-measurable square-integrable random variables.
This interpretation raises a precise optimization question: among all square-integrable predictions that only use the information in $\mathcal{G}$, which one is closest to $X$ in mean squared error? The projection theorem answers that question and identifies conditional expectation as the best such predictor.
[quotetheorem:3537]
This projection view explains why conditional expectation is the best predictor: among all $\mathcal{G}$-measurable square-integrable functions, $\mathbb{E}[X \mid \mathcal{G}]$ minimizes the mean squared prediction error. In linear regression, the conditional expectation of $Y$ given covariates $(X_1, \ldots, X_k)$ is the optimal predictor; the regression coefficients arise by restricting the class of predictors to linear functions.
[example: Conditional Expectation for a Jointly Gaussian Pair]
Let $(X, Y)$ be jointly Gaussian with means $\mu_X, \mu_Y$, variances $\sigma_X^2, \sigma_Y^2 > 0$, and Pearson correlation $\rho \in (-1, 1)$. The conditional distribution of $X$ given $Y = y$ is Gaussian, and the conditional mean is
\begin{align*}
\mathbb{E}[X \mid Y = y] = \mu_X + \rho \frac{\sigma_X}{\sigma_Y}(y - \mu_Y).
\end{align*}
As a random variable, $\mathbb{E}[X \mid Y] = \mu_X + \rho \frac{\sigma_X}{\sigma_Y}(Y - \mu_Y)$. The defining conditions hold: the quantity $\mu_X + \rho(\sigma_X/\sigma_Y)(Y - \mu_Y)$ is $\sigma(Y)$-measurable. To verify the integral identity, write
\begin{align*}
X = \mu_X + \rho \frac{\sigma_X}{\sigma_Y}(Y - \mu_Y) + \varepsilon,
\end{align*}
where $\varepsilon = X - \mu_X - \rho(\sigma_X/\sigma_Y)(Y - \mu_Y)$. Since $(X, Y)$ is jointly Gaussian, $\varepsilon$ is Gaussian, and
\begin{align*}
\operatorname{Cov}(\varepsilon, Y) = \operatorname{Cov}(X, Y) - \rho \frac{\sigma_X}{\sigma_Y} \operatorname{Var}(Y) = \rho \sigma_X \sigma_Y - \rho \frac{\sigma_X}{\sigma_Y} \cdot \sigma_Y^2 = 0.
\end{align*}
Uncorrelated jointly Gaussian random variables are independent, so $\varepsilon \perp Y$. Since also $\mathbb{E}[\varepsilon] = 0$, the indicator $\mathbb{1}_{\{Y \in B\}}$ is $\sigma(Y)$-measurable and hence independent of $\varepsilon$, giving $\mathbb{E}[\varepsilon \, \mathbb{1}_{\{Y \in B\}}] = \mathbb{E}[\varepsilon] \cdot \mathbb{P}(Y \in B) = 0$. Therefore, for any Borel set $B \in \mathcal{B}(\mathbb{R})$,
\begin{align*}
\int_{\{Y \in B\}} X \, d\mathbb{P} &= \mathbb{E}\!\left[\left(\mu_X + \rho \frac{\sigma_X}{\sigma_Y}(Y - \mu_Y)\right) \mathbb{1}_{\{Y \in B\}}\right] + \mathbb{E}\!\left[\varepsilon \, \mathbb{1}_{\{Y \in B\}}\right] \\
&= \int_{\{Y \in B\}} \left(\mu_X + \rho \frac{\sigma_X}{\sigma_Y}(Y - \mu_Y)\right) d\mathbb{P},
\end{align*}
confirming the defining integral identity.
When $\rho = 0$ (uncorrelated, hence independent for jointly Gaussian variables), $\mathbb{E}[X \mid Y] = \mu_X = \mathbb{E}[X]$: information about $Y$ tells us nothing about $X$. When $|\rho|$ approaches $1$, $Y$ nearly determines $X$ through a linear relation, and the conditional expectation converges to the exact linear predictor. For $\rho \in (-1, 1)$, the correction $\rho(\sigma_X/\sigma_Y)(Y - \mu_Y)$ adjusts the prior mean by the deviation of $Y$ from its mean, weighted by the correlation and the ratio of standard deviations.
[/example]
## References
P. Billingsley, *Probability and Measure*, 3rd ed. (Wiley, 1995).
R. Durrett, *Probability: Theory and Examples*, 5th ed. (Cambridge University Press, 2019).
D. Williams, *Probability with Martingales* (Cambridge University Press, 1991).
O. Kallenberg, *Foundations of Modern Probability*, 3rd ed. (Springer, 2021).
W. Rudin, *Real and Complex Analysis*, 3rd ed. (McGraw-Hill, 1987).