A calculation with a finite sample space can look harmless until the sample space stops being finite. If a coin is tossed until the first head appears, the possible values are $1,2,3,\dots$, and no finite table contains the whole distribution. Yet the random variable is still described by a list of point probabilities, and the total probability is still obtained by adding them. Discrete random variables are the class for which this pointwise bookkeeping remains exact, even when the list is infinite.
The central issue is how a measurable map $X: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$ can be studied through the probabilities of its individual values. For a general real-valued random variable, singletons may carry no probability at all. A continuous uniform random variable on $(0,1)$ satisfies $\mathbb P(X=x)=0$ for every $x$, so the values $\mathbb P(X=x)$ do not reconstruct its law. In the discrete case, the opposite happens: the law is concentrated on a [countable set](/page/Countable%20Set), and the masses at the atoms contain all information about probabilities, expectations, and generating functions.
[example: Waiting Time for the First Head]
Let a fair coin be tossed independently until the first head appears, and let $X$ be the toss number on which the first head occurs. For $k \in \mathbb N$, the event $\{X=k\}$ is the event that the first $k-1$ tosses are tails and the $k$-th toss is heads. By independence of the tosses and fairness of the coin,
\begin{align*}
\mathbb P(X=k)=\left(\frac{1}{2}\right)^{k-1}\frac{1}{2}=\left(\frac{1}{2}\right)^k.
\end{align*}
Thus the possible values of $X$ are $1,2,3,\dots$, so the range is countable but not finite. To check that the listed point probabilities carry total mass $1$, compute the partial sums:
\begin{align*}
\sum_{k=1}^{n}\mathbb P(X=k)=\sum_{k=1}^{n}\left(\frac{1}{2}\right)^k=\frac{1}{2}\frac{1-\left(\frac{1}{2}\right)^n}{1-\frac{1}{2}}=1-\left(\frac{1}{2}\right)^n.
\end{align*}
Since $\left(\frac{1}{2}\right)^n \to 0$ as $n\to\infty$, it follows that
\begin{align*}
\sum_{k=1}^{\infty}\mathbb P(X=k)=\lim_{n\to\infty}\left(1-\left(\frac{1}{2}\right)^n\right)=1.
\end{align*}
This example shows why discreteness should not mean finite-valued: what matters is that the possible values can be listed and the probabilities assigned to those listed points exhaust the total mass.
[/example]
The example also shows the main advantage of the discrete theory. Events involving $X$ can be reduced to sums over values: the event $\{X \text{ is even}\}$ has probability
\begin{align*}
\mathbb P(X \text{ is even}) = \sum_{j=1}^{\infty} \mathbb P(X=2j) = \sum_{j=1}^{\infty} \left(\frac{1}{2}\right)^{2j} = \frac{1}{3}.
\end{align*}
Instead of integrating over a continuum, we add over a countable set.
## Definition
The parent concept is a [Random Variable](/page/Random%20Variable): a measurable map from a [probability space](/page/Probability%20Space) into a measurable state space. The discrete version answers a more specific question: when is the random variable's law supported on a list of possible real values? This is the exact setting in which probability mass functions replace densities, sums replace integrals, and distributional questions become questions about weighted countable sets.
[definition: Discrete Random Variable]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. A real-valued random variable $X: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$ is a discrete random variable if there exists a countable set $S \subseteq \mathbb R$ such that
\begin{align*}
\mathbb P(X \in S) = 1.
\end{align*}
[/definition]
The set $S$ is not part of the data of $X$; it is a countable support on which the law is concentrated. It may contain values that occur with probability $0$, so computations need a sharper object than an arbitrary countable carrier. To calculate probabilities, we need a function that records how much probability is placed at each possible value.
[definition: Probability Mass Function]
Let $X$ be a discrete random variable. The probability mass function of $X$ is the function
\begin{align*}
p_X: \mathbb R \to [0,1]
\end{align*}
defined by
\begin{align*}
p_X(x)=\mathbb P(X=x).
\end{align*}
[/definition]
Although $p_X$ is defined on all of $\mathbb R$, it is nonzero at at most countably many points. A table of values is often easier to read if the zero-probability points have been removed. The next definition names the exact set of atoms that carry positive mass.
[definition: Support of a Discrete Random Variable]
Let $X$ be a discrete random variable with probability mass function $p_X$. The support of $X$ is the set
\begin{align*}
\operatorname{supp}(X) = \{x \in \mathbb R : p_X(x) > 0\}.
\end{align*}
[/definition]
To ask probabilities of intervals, inverse images, or Borel events, a list of atoms must be interpreted as a probability measure on the real line. This is the same object attached to every real-valued random variable, but the discrete case lets us later reconstruct it from point masses. We therefore name the induced measure before proving the reconstruction formula.
[definition: Law of a Discrete Random Variable]
Let $X$ be a discrete random variable. The law of $X$ is the probability measure $\mu_X$ on $(\mathbb R, \mathcal B(\mathbb R))$ defined by
\begin{align*}
\mu_X(A) = \mathbb P(X \in A)
\end{align*}
for every $A \in \mathcal B(\mathbb R)$.
[/definition]
For discrete $X$, the law is concentrated on $\operatorname{supp}(X)$, but the law still assigns probabilities to all Borel sets. The point-mass table is useful only if it determines those probabilities without ambiguity. The following result gives exactly that reconstruction principle: on a countable state space, the probability of a set is obtained by adding the masses of the points it contains.
[quotetheorem:9348]
This theorem is the reason discrete random variables can be treated by summation. It also separates two questions that are often conflated: whether a random variable is discrete, and whether a proposed list of numbers really is a probability distribution. It is a statement about countable additivity of point masses, not about distribution functions or jump sizes.
A compact example helps fix the difference between the random variable, its support, and its mass function.
[example: A Non-Injective Discrete Random Variable]
Let $\Omega=\{1,2,3,4\}$, let $\mathcal F$ be the power set of $\Omega$, and suppose each outcome has probability $1/4$. Define $X:\Omega\to\mathbb R$ by $X(1)=0$, $X(2)=0$, $X(3)=1$, and $X(4)=3$. The range of $X$ is $\{0,1,3\}$, which is finite and therefore countable, so $X$ is a discrete random variable.
The mass at $0$ is the probability of the inverse image of $\{0\}$:
\begin{align*}
\{X=0\}=\{\omega\in\Omega:X(\omega)=0\}=\{1,2\}.
\end{align*}
Since the two outcomes $1$ and $2$ are disjoint and each has probability $1/4$,
\begin{align*}
p_X(0)=\mathbb P(X=0)=\mathbb P(\{1,2\})=\mathbb P(\{1\})+\mathbb P(\{2\})=\frac{1}{4}+\frac{1}{4}=\frac{1}{2}.
\end{align*}
Similarly,
\begin{align*}
\{X=1\}=\{3\},
\end{align*}
so
\begin{align*}
p_X(1)=\mathbb P(X=1)=\mathbb P(\{3\})=\frac{1}{4}.
\end{align*}
Also,
\begin{align*}
\{X=3\}=\{4\},
\end{align*}
so
\begin{align*}
p_X(3)=\mathbb P(X=3)=\mathbb P(\{4\})=\frac{1}{4}.
\end{align*}
If $x\notin\{0,1,3\}$, then no outcome is mapped to $x$, hence
\begin{align*}
\{X=x\}=\varnothing
\end{align*}
and therefore
\begin{align*}
p_X(x)=\mathbb P(X=x)=\mathbb P(\varnothing)=0.
\end{align*}
Thus $\operatorname{supp}(X)=\{0,1,3\}$. The value $0$ has larger mass than $1$ or $3$ because two distinct sample outcomes are merged into the same value of the random variable.
[/example]
The example warns against confusing outcomes with values. Discreteness is a property of the range of $X$ under its law, not a statement that the underlying sample space must be finite or countable.
## Mass, Atoms, and Distribution Functions
### Jumps of the Distribution Function
The mass function is local: it tells us the probability at a point. Many probability questions are cumulative: they ask about intervals, inequalities, or thresholds. For a discrete random variable, the [cumulative distribution function](/page/Cumulative%20Distribution%20Function) is still central, but it has a stepped geometry rather than a smooth one.
When we ask for the probability that $X$ is at most $x$, we are accumulating all atoms up to a threshold. This motivates the distribution function in the discrete setting.
[definition: Distribution Function of a Discrete Random Variable]
Let $X$ be a discrete random variable. Its distribution function is the function
\begin{align*}
F_X: \mathbb R \to [0,1]
\end{align*}
defined by
\begin{align*}
F_X(x)=\mathbb P(X \le x).
\end{align*}
[/definition]
The distribution function exists for every real-valued random variable, but in the discrete case it is built from jumps. Since $F_X(x)=\mathbb P(X\le x)$, the interval $(-\infty,x]$ collects exactly those atoms whose locations are at most $x$. By the point-mass reconstruction theorem above, this gives the cumulative formula
\begin{align*}
F_X(x)=\sum_{y\le x} p_X(y),
\end{align*}
where only countably many terms are nonzero.
In applications, however, the distribution function is often the object we are given first. Since a discrete distribution changes only by jumps, the missing question is whether the size of the jump at $x$ is exactly the mass assigned to $x$, and whether this recovers the original mass function without ambiguity.
[remark: Recovering Masses from Jumps]
For a discrete real-valued random variable $X$, the mass at a point $x$ is recovered from the jump of the distribution function at $x$:
\begin{align*}
p_X(x)=\mathbb P(X=x)=F_X(x)-\lim_{t\uparrow x}F_X(t).
\end{align*}
Thus the distribution function determines the mass function by recording both the locations and the sizes of its jumps.
[/remark]
Thus a discrete distribution may be read either from the mass function or from the jumps of its distribution function. The mass function is usually better for computation; the distribution function is better for comparing distributions and studying convergence.
A finite-valued example shows how the jump picture works without an infinite series obscuring the calculation.
[example: Distribution Function of a Loaded Die]
Let $X$ be the result of a die roll with probabilities $p_X(1)=1/10$, $p_X(2)=1/10$, $p_X(3)=1/5$, $p_X(4)=1/5$, $p_X(5)=1/5$, and $p_X(6)=1/5$. We compute $F_X(x)=\mathbb P(X\le x)$ by adding the masses of die values not exceeding $x$.
If $x<1$, then no possible value of $X$ is at most $x$, so
\begin{align*}
F_X(x)=\mathbb P(X\le x)=0.
\end{align*}
If $1\le x<2$, the only possible value at most $x$ is $1$, hence
\begin{align*}
F_X(x)=\mathbb P(X=1)=\frac{1}{10}.
\end{align*}
If $2\le x<3$, the possible values at most $x$ are $1$ and $2$, so
\begin{align*}
F_X(x)=\mathbb P(X=1)+\mathbb P(X=2)=\frac{1}{10}+\frac{1}{10}=\frac{1}{5}.
\end{align*}
If $3\le x<4$, the possible values at most $x$ are $1,2,3$, and therefore
\begin{align*}
F_X(x)=\frac{1}{10}+\frac{1}{10}+\frac{1}{5}=\frac{1}{5}+\frac{1}{5}=\frac{2}{5}.
\end{align*}
Continuing value by value,
\begin{align*}
F_X(x)=\frac{2}{5}+\frac{1}{5}=\frac{3}{5}
\end{align*}
for $4\le x<5$,
\begin{align*}
F_X(x)=\frac{3}{5}+\frac{1}{5}=\frac{4}{5}
\end{align*}
for $5\le x<6$, and
\begin{align*}
F_X(x)=\frac{4}{5}+\frac{1}{5}=1
\end{align*}
for $x\ge 6$. Thus the distribution function is constant between consecutive die values, and at each $k\in\{1,\dots,6\}$ its jump size is the added mass $p_X(k)$.
[/example]
The distribution function is sometimes more awkward than the mass function, but it makes the difference between atomic and non-atomic behaviour visible. A non-atomic law has no jumps in its distribution function, absolutely continuous laws are the most familiar examples of this phenomenon, and a discrete random variable is built entirely out of jumps.
### Infinite Stepped Laws
A finite support gives a distribution function with finitely many jumps. Many important discrete laws have infinitely many possible values, so the same idea must be able to handle infinitely many jumps. The question is whether the cumulative probabilities still converge to $1$ as the threshold tends to infinity.
[example: Geometric Distribution Function]
Let $X \sim \operatorname{Geom}(p)$ with $p \in (0,1]$, so
\begin{align*}
\mathbb P(X=k) = (1-p)^{k-1}p, \qquad k \in \mathbb N.
\end{align*}
If $x<1$, then no positive integer is at most $x$, hence
\begin{align*}
F_X(x)=\mathbb P(X\le x)=0.
\end{align*}
Now fix $n\in\mathbb N$ and suppose $n\le x<n+1$. Since $X$ takes values in $\mathbb N$, the event $\{X\le x\}$ is the same event as $\{X\le n\}$, so
\begin{align*}
F_X(x)=\mathbb P(X\le n)=\sum_{k=1}^{n}\mathbb P(X=k).
\end{align*}
Substituting the mass function gives
\begin{align*}
F_X(x)=\sum_{k=1}^{n}(1-p)^{k-1}p.
\end{align*}
Factoring out $p$ gives
\begin{align*}
F_X(x)=p\sum_{k=1}^{n}(1-p)^{k-1}.
\end{align*}
With the index change $j=k-1$, this becomes
\begin{align*}
F_X(x)=p\sum_{j=0}^{n-1}(1-p)^j.
\end{align*}
If $0<p<1$, the finite geometric-sum identity gives
\begin{align*}
\sum_{j=0}^{n-1}(1-p)^j=\frac{1-(1-p)^n}{1-(1-p)}.
\end{align*}
Since $1-(1-p)=p$, we get
\begin{align*}
F_X(x)=p\frac{1-(1-p)^n}{p}=1-(1-p)^n.
\end{align*}
If $p=1$, then $\mathbb P(X=1)=1$ and $\mathbb P(X=k)=0$ for $k\ge 2$, so for every $n\ge 1$ the same formula reads
\begin{align*}
F_X(x)=1=1-(1-1)^n.
\end{align*}
Thus, for $n\le x<n+1$,
\begin{align*}
F_X(x)=1-(1-p)^n.
\end{align*}
At the integer $m\in\mathbb N$, the jump size is
\begin{align*}
\left(1-(1-p)^m\right)-\left(1-(1-p)^{m-1}\right).
\end{align*}
Cancelling the constant terms gives
\begin{align*}
(1-p)^{m-1}-(1-p)^m=(1-p)^{m-1}p.
\end{align*}
So the distribution function is constant between consecutive positive integers, and its jump at $m$ is exactly the mass $\mathbb P(X=m)=(1-p)^{m-1}p$.
[/example]
The infinite support does not disturb the discrete structure. What changes is that tails become significant, and many questions reduce to controlling infinite sums.
## Expectation and Summation
### Integrals Become Sums
The main computational benefit of discreteness is that integration with respect to the law becomes summation. This is not merely a mnemonic; it is the discrete specialization of the general change-of-variables principle for expectations: functions of a random variable are integrated against the law of that random variable. The general theorem may be stated with Lebesgue integrals, densities, or Radon-Nikodym derivatives, but in the discrete case its content becomes the concrete rule that values are weighted by their point probabilities.
Before writing formulas for means and moments, we need a definition that includes the possibility of infinite support. A random variable can be discrete and still have infinite expectation if the tail masses decay too slowly.
[definition: Expectation of a Discrete Random Variable]
Let $X$ be a discrete random variable with support $S \subseteq \mathbb R$ and probability mass function $p_X$. If
\begin{align*}
\sum_{x \in S} |x|p_X(x) < \infty,
\end{align*}
then the expectation of $X$ is
\begin{align*}
\mathbb E[X] = \sum_{x \in S} x p_X(x).
\end{align*}
[/definition]
The absolute convergence condition is essential for real-valued expectations. Without it, positive and negative contributions may depend on the order in which support points are listed. The same issue appears whenever we apply a real function to $X$, so the next theorem gives the general summation rule for functions of a discrete random variable.
[quotetheorem:3536]
The name is informal, but the content is precise: to compute an expectation of $g(X)$, we do not need to describe the sample space or invert $g$ on outcomes. We only need the distribution of $X$.
[example: Mean and Variance of a Bernoulli Random Variable]
Let $X \sim \operatorname{Ber}(p)$ with $p \in [0,1]$. Its support is contained in $\{0,1\}$, with $p_X(0)=1-p$ and $p_X(1)=p$, so the expectation is the sum of each value weighted by its mass:
\begin{align*}
\mathbb E[X] = 0\cdot p_X(0)+1\cdot p_X(1)
\end{align*}
Substituting the two masses gives
\begin{align*}
\mathbb E[X] = 0\cdot(1-p)+1\cdot p
\end{align*}
Since $0\cdot(1-p)=0$ and $1\cdot p=p$,
\begin{align*}
\mathbb E[X]=p.
\end{align*}
To compute the variance, first compute the second moment. On the support $\{0,1\}$, the identity $x^2=x$ holds because $0^2=0$ and $1^2=1$. Therefore
\begin{align*}
\mathbb E[X^2] = 0^2\cdot p_X(0)+1^2\cdot p_X(1)
\end{align*}
Substituting again,
\begin{align*}
\mathbb E[X^2] = 0^2\cdot(1-p)+1^2\cdot p
\end{align*}
Since $0^2=0$ and $1^2=1$,
\begin{align*}
\mathbb E[X^2]=0\cdot(1-p)+1\cdot p=p.
\end{align*}
Using the variance identity $\operatorname{Var}(X)=\mathbb E[X^2]-(\mathbb E[X])^2$, we get
\begin{align*}
\operatorname{Var}(X)=p-p^2.
\end{align*}
Factoring out $p$ gives
\begin{align*}
p-p^2=p(1-p).
\end{align*}
Hence $\mathbb E[X]=p$ and $\operatorname{Var}(X)=p(1-p)$. The calculation works because every moment of a Bernoulli random variable is determined by its two support values, $0$ and $1$.
[/example]
The next failure example shows why discreteness alone does not guarantee finite expectation. Countable support is a structural condition, not an integrability condition.
[example: A Discrete Random Variable with Infinite Mean]
Define $X$ on $\mathbb N$ by
\begin{align*}
\mathbb P(X=k)=\frac{1}{k(k+1)}, \qquad k \in \mathbb N.
\end{align*}
First check that these numbers really form a probability mass function. For each $k\in\mathbb N$,
\begin{align*}
\frac{1}{k(k+1)}=\frac{1}{k}-\frac{1}{k+1},
\end{align*}
because
\begin{align*}
\frac{1}{k}-\frac{1}{k+1}=\frac{k+1-k}{k(k+1)}=\frac{1}{k(k+1)}.
\end{align*}
Therefore the $n$-th partial sum is
\begin{align*}
\sum_{k=1}^{n}\frac{1}{k(k+1)}=\sum_{k=1}^{n}\left(\frac{1}{k}-\frac{1}{k+1}\right).
\end{align*}
Writing out the cancelling terms gives
\begin{align*}
\sum_{k=1}^{n}\left(\frac{1}{k}-\frac{1}{k+1}\right)=\left(1-\frac{1}{2}\right)+\left(\frac{1}{2}-\frac{1}{3}\right)+\cdots+\left(\frac{1}{n}-\frac{1}{n+1}\right).
\end{align*}
All intermediate terms cancel, so
\begin{align*}
\sum_{k=1}^{n}\frac{1}{k(k+1)}=1-\frac{1}{n+1}.
\end{align*}
Since $\frac{1}{n+1}\to 0$ as $n\to\infty$,
\begin{align*}
\sum_{k=1}^{\infty}\mathbb P(X=k)=\lim_{n\to\infty}\left(1-\frac{1}{n+1}\right)=1.
\end{align*}
Now compute the expectation series. For each $k\in\mathbb N$,
\begin{align*}
k\mathbb P(X=k)=k\cdot\frac{1}{k(k+1)}=\frac{1}{k+1}.
\end{align*}
Hence
\begin{align*}
\sum_{k=1}^{n}k\mathbb P(X=k)=\sum_{k=1}^{n}\frac{1}{k+1}.
\end{align*}
The right-hand side is a shifted harmonic partial sum:
\begin{align*}
\sum_{k=1}^{n}\frac{1}{k+1}=\frac{1}{2}+\frac{1}{3}+\cdots+\frac{1}{n+1}.
\end{align*}
For each $m\ge 1$, the block from $k=2^m-1$ to $k=2^{m+1}-2$ contains $2^m$ terms, and each corresponding denominator $k+1$ is at most $2^{m+1}-1$, hence each term is at least $1/2^{m+1}$. Therefore that block contributes at least
\begin{align*}
2^m\cdot\frac{1}{2^{m+1}}=\frac{1}{2}.
\end{align*}
Since there are arbitrarily many such blocks, the partial sums of $\sum_{k=1}^{\infty}\frac{1}{k+1}$ are unbounded. Thus
\begin{align*}
\sum_{k=1}^{\infty}k\mathbb P(X=k)=\infty.
\end{align*}
So $X$ is a discrete random variable whose total mass is $1$, but its expectation is not finite because its tail is too heavy.
[/example]
### Moments and Tails
The previous example shows that moments are additional information about the tail of a discrete distribution. To compare tails and quantify spread, we need a family of expectations that weight large values more heavily.
[definition: Moment of a Discrete Random Variable]
Let $X$ be a discrete random variable and let $r \in \mathbb N$. If $\mathbb E[|X|^r] < \infty$, then the $r$-th moment of $X$ is $\mathbb E[X^r]$.
[/definition]
For nonnegative variables, a common problem is that tail probabilities are easier to estimate than exact densities or point masses. In the continuous case, this relationship is expressed by integrating the tail function. The following theorem records that integral form, which is the continuous analogue of the discrete tail-sum identity.
[quotetheorem:1136]
This formula converts the expectation of a nonnegative continuous random variable into an integral of tail probabilities. It is useful as a conceptual bridge: the discrete version replaces the integral by a sum, while the same principle remains that the size of a nonnegative variable is controlled by how slowly its tail probabilities decay.
## Standard Discrete Laws
### Bernoulli and Binomial Laws
The theory becomes usable through a small collection of standard models. Each model packages a recurring random mechanism: one trial, repeated independent trials, waiting for success, or rare events in many trials.
The simplest model records whether an event occurs. It is the atomic building block for binomial counts and indicator decompositions.
[definition: Bernoulli Distribution]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, and let $X: (\Omega,\mathcal F) \to (\mathbb R,\mathcal B(\mathbb R))$ be a real-valued random variable. For $p \in [0,1]$, $X$ has the Bernoulli distribution with parameter $p$, written $X \sim \operatorname{Ber}(p)$, if
\begin{align*}
\mathbb P(X=1)=p \quad \text{and} \quad \mathbb P(X=0)=1-p.
\end{align*}
[/definition]
A Bernoulli random variable is the same thing as the indicator of an event with probability $p$. To move freely between event language and random-variable language, we need the precise identification between indicators and Bernoulli laws.
[quotetheorem:9475]
Counts of successes in repeated trials lead to the binomial distribution. The finite support is important: there are only $n+1$ possible numbers of successes.
[definition: Binomial Distribution]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, and let $X: (\Omega,\mathcal F) \to (\mathbb R,\mathcal B(\mathbb R))$ be a real-valued random variable. For $n \in \mathbb N$ and $p \in [0,1]$, $X$ has the binomial distribution with parameters $n$ and $p$, written $X \sim \operatorname{Bin}(n,p)$, if
\begin{align*}
\mathbb P(X=k)=\binom{n}{k}p^k(1-p)^{n-k}, \qquad k \in \{0,1,\dots,n\}.
\end{align*}
[/definition]
The binomial formula comes from counting which $k$ of the $n$ trials succeed and multiplying independent success and failure probabilities.
[example: Two Ways to Compute a Binomial Probability]
Let $X \sim \operatorname{Bin}(5,1/3)$. By the binomial mass formula, the probability of exactly two successes is
\begin{align*}
\mathbb P(X=2)=\binom{5}{2}\left(\frac{1}{3}\right)^2\left(1-\frac{1}{3}\right)^{5-2}.
\end{align*}
Since $1-\frac{1}{3}=\frac{2}{3}$ and $5-2=3$, this becomes
\begin{align*}
\mathbb P(X=2)=\binom{5}{2}\left(\frac{1}{3}\right)^2\left(\frac{2}{3}\right)^3.
\end{align*}
The [binomial coefficient](/page/Binomial%20Coefficient) is
\begin{align*}
\binom{5}{2}=\frac{5!}{2!3!}=\frac{5\cdot 4}{2\cdot 1}=10.
\end{align*}
Also,
\begin{align*}
\left(\frac{1}{3}\right)^2=\frac{1}{9}
\end{align*}
and
\begin{align*}
\left(\frac{2}{3}\right)^3=\frac{8}{27}.
\end{align*}
Therefore
\begin{align*}
\mathbb P(X=2)=10\cdot\frac{1}{9}\cdot\frac{8}{27}=\frac{80}{243}.
\end{align*}
For the probability of at least one success, the complement of $\{X\ge 1\}$ is $\{X=0\}$, because $X$ can only take values in $\{0,1,2,3,4,5\}$. Hence
\begin{align*}
\mathbb P(X\ge 1)=1-\mathbb P(X=0).
\end{align*}
Using the binomial mass formula again,
\begin{align*}
\mathbb P(X=0)=\binom{5}{0}\left(\frac{1}{3}\right)^0\left(\frac{2}{3}\right)^5.
\end{align*}
Since $\binom{5}{0}=1$ and $\left(\frac{1}{3}\right)^0=1$,
\begin{align*}
\mathbb P(X=0)=\left(\frac{2}{3}\right)^5=\frac{32}{243}.
\end{align*}
Thus
\begin{align*}
\mathbb P(X\ge 1)=1-\frac{32}{243}=\frac{243}{243}-\frac{32}{243}=\frac{211}{243}.
\end{align*}
The first computation isolates one support value, while the second avoids summing the five probabilities for $X=1,2,3,4,5$ by subtracting the single no-success case.
[/example]
### Waiting Times and Rare Counts
Not every discrete model is a finite count. Some of the most important ones describe waiting times or counts over an unbounded range. These laws force us to combine combinatorial reasoning with convergence of infinite series.
Waiting for the first success gives infinite support with a memoryless tail. The natural value of the random variable is the trial on which success first appears, so the support starts at $1$.
[definition: Geometric Distribution]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, and let $X: (\Omega,\mathcal F) \to (\mathbb R,\mathcal B(\mathbb R))$ be a real-valued random variable. For $p \in (0,1]$, $X$ has the geometric distribution with parameter $p$, written $X \sim \operatorname{Geom}(p)$, if
\begin{align*}
\mathbb P(X=k)=(1-p)^{k-1}p, \qquad k \in \mathbb N.
\end{align*}
[/definition]
This convention counts the trial on which the first success occurs. Some books use a version supported on $\{0,1,2,\dots\}$ that counts failures before the first success; the convention must be checked before using formulas.
Rare-event counts lead to the Poisson distribution. It has infinite support, but only one parameter, and it appears when many possible opportunities each have small success probability.
[definition: Poisson Distribution]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, and let $X: (\Omega,\mathcal F) \to (\mathbb R,\mathcal B(\mathbb R))$ be a real-valued random variable. For $\lambda>0$, $X$ has the Poisson distribution with parameter $\lambda$, written $X \sim \operatorname{Poi}(\lambda)$, if
\begin{align*}
\mathbb P(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}, \qquad k \in \{0,1,2,\dots\}.
\end{align*}
[/definition]
The normalising identity is the exponential series. To justify the Poisson law as a model rather than a formula, we want to see how it arises from a familiar finite experiment. A binomial count with many trials usually depends on two parameters, $n$ and $p$; the sparse regime asks what remains when $n$ grows, $p$ shrinks, and the expected count $np$ stays near a fixed value $\lambda$.
[quotetheorem:9476]
This theorem explains why Poisson laws arise in sparse counting problems: many possible opportunities, each unlikely, with the expected count held near $\lambda$.
## Transform Methods and Generating Functions
### Probability Generating Functions
A probability mass function is a sequence when the support is contained in $\{0,1,2,\dots\}$. Generating functions turn that sequence into a [power series](/page/Power%20Series). This packages convolution, moments, and distributional identities into algebra.
For nonnegative integer-valued random variables, the natural transform records the probability masses as coefficients.
[definition: Probability Generating Function]
Let $X$ be a discrete random variable taking values in $\{0,1,2,\dots\}$. Let
\begin{align*}
D_X = \left\{s \in \mathbb R : \sum_{k=0}^{\infty}\mathbb P(X=k)|s|^k < \infty\right\}.
\end{align*}
The probability [generating function](/page/Generating%20Function) of $X$ is the function
\begin{align*}
G_X: D_X \to \mathbb R
\end{align*}
defined by
\begin{align*}
G_X(s)=\mathbb E[s^X]=\sum_{k=0}^{\infty}\mathbb P(X=k)s^k.
\end{align*}
[/definition]
The probability generating function is especially effective because coefficients recover probabilities. It becomes more than storage when independent sums are involved: convolution of mass functions is cumbersome, while multiplication of power series is algebraic. The next theorem is the main reason generating functions are used for sums of counts.
[quotetheorem:1130]
This theorem is the transform version of the convolution formula for sums. Instead of summing over all decompositions $k=i+j$, we multiply two series and read off coefficients.
[example: Binomial Generating Function]
Let $X \sim \operatorname{Bin}(n,p)$. Its probability generating function is obtained by substituting the binomial masses into the definition:
\begin{align*}
G_X(s)=\sum_{k=0}^{n}\mathbb P(X=k)s^k.
\end{align*}
Since $\mathbb P(X=k)=\binom{n}{k}p^k(1-p)^{n-k}$ for $k\in\{0,1,\dots,n\}$, this gives
\begin{align*}
G_X(s)=\sum_{k=0}^{n}\binom{n}{k}p^k(1-p)^{n-k}s^k.
\end{align*}
For each term, $p^k s^k=(ps)^k$, so
\begin{align*}
G_X(s)=\sum_{k=0}^{n}\binom{n}{k}(ps)^k(1-p)^{n-k}.
\end{align*}
Applying the finite binomial expansion with $a=ps$ and $b=1-p$,
\begin{align*}
\sum_{k=0}^{n}\binom{n}{k}(ps)^k(1-p)^{n-k}=(ps+1-p)^n.
\end{align*}
Reordering the two summands inside the parentheses,
\begin{align*}
G_X(s)=(1-p+ps)^n.
\end{align*}
For comparison, a Bernoulli random variable $B\sim\operatorname{Ber}(p)$ has
\begin{align*}
G_B(s)=\mathbb P(B=0)s^0+\mathbb P(B=1)s^1.
\end{align*}
Substituting $\mathbb P(B=0)=1-p$, $\mathbb P(B=1)=p$, $s^0=1$, and $s^1=s$ gives
\begin{align*}
G_B(s)=(1-p)\cdot 1+p\cdot s=1-p+ps.
\end{align*}
Thus the formula $G_X(s)=(1-p+ps)^n$ matches the interpretation of a binomial random variable as the sum of $n$ independent Bernoulli trials: the single-trial generating function is raised to the $n$-th power.
[/example]
### Moments from Transforms
Generating functions also encode moments, but they do so through derivatives at $1$ rather than coefficients at $0$. This matters because moments describe the size of a random count, while coefficients describe the probabilities of exact values. Differentiating the series inserts the factor $k$ that appears in the expectation.
[quotetheorem:9477]
Probability generating functions are specialised for nonnegative integer-valued variables. For real-valued discrete random variables, exponential transforms are often more flexible because they do not require integer support. The price is integrability: $M_X(0)=1$ always, but finiteness for $t\ne 0$, or even on a whole interval around $0$, is an extra tail condition rather than a consequence of discreteness. This leads to the moment generating function.
[definition: Moment Generating Function of a Discrete Random Variable]
Let $X$ be a discrete random variable with support $S \subseteq \mathbb R$. Let
\begin{align*}
D_X = \left\{t \in \mathbb R : \sum_{x \in S} e^{tx}p_X(x) < \infty\right\}.
\end{align*}
The moment generating function of $X$ is the function
\begin{align*}
M_X: D_X \to \mathbb R
\end{align*}
defined by
\begin{align*}
M_X(t)=\mathbb E[e^{tX}]=\sum_{x \in S} e^{tx}p_X(x).
\end{align*}
[/definition]
Moment generating functions can determine distributions when they exist in a neighbourhood of $0$, but they may fail to exist away from $0$ for heavy-tailed discrete laws. Characteristic functions avoid this integrability problem because $|e^{iuX}|=1$.
## Functions, Conditioning, and Independence
### Functions of Discrete Variables
Discrete random variables are stable under many natural operations, but the support can change. A function of a discrete random variable is again discrete because the image of a countable set is countable.
This stability matters because random variables are rarely studied alone. We square them, threshold them, group their values, and use them to build estimators.
[quotetheorem:9478]
The mass function of $g(X)$ is obtained by collecting the masses of all points of $X$ that map to the same value. This is where non-injective transformations require care.
[example: Squaring Merges Atoms]
Let $X$ have support $\{-2,-1,1,2\}$ with $\mathbb P(X=-2)=1/8$, $\mathbb P(X=-1)=3/8$, $\mathbb P(X=1)=1/4$, and $\mathbb P(X=2)=1/4$. Let $Y=X^2$. Since $(-1)^2=1$ and $1^2=1$, the event $\{Y=1\}$ is exactly the event that $X$ is either $-1$ or $1$:
\begin{align*}
\{Y=1\}=\{X=-1\}\cup\{X=1\}.
\end{align*}
The two events $\{X=-1\}$ and $\{X=1\}$ are disjoint, so finite additivity gives
\begin{align*}
\mathbb P(Y=1)=\mathbb P(X=-1)+\mathbb P(X=1).
\end{align*}
Substituting the given masses,
\begin{align*}
\mathbb P(Y=1)=\frac{3}{8}+\frac{1}{4}.
\end{align*}
Since $\frac{1}{4}=\frac{2}{8}$,
\begin{align*}
\mathbb P(Y=1)=\frac{3}{8}+\frac{2}{8}=\frac{5}{8}.
\end{align*}
Similarly, since $(-2)^2=4$ and $2^2=4$,
\begin{align*}
\{Y=4\}=\{X=-2\}\cup\{X=2\}.
\end{align*}
These two events are disjoint, hence
\begin{align*}
\mathbb P(Y=4)=\mathbb P(X=-2)+\mathbb P(X=2).
\end{align*}
Substituting the given masses gives
\begin{align*}
\mathbb P(Y=4)=\frac{1}{8}+\frac{1}{4}.
\end{align*}
Since $\frac{1}{4}=\frac{2}{8}$,
\begin{align*}
\mathbb P(Y=4)=\frac{1}{8}+\frac{2}{8}=\frac{3}{8}.
\end{align*}
No other value can occur, because every possible value of $X$ is one of $-2,-1,1,2$, and their squares are only $1$ and $4$. Thus $Y$ has support $\{1,4\}$, with masses $5/8$ and $3/8$. The transformation $x\mapsto x^2$ merges the atoms $-1$ and $1$ into the same value, and also merges $-2$ and $2$ into the same value.
[/example]
### Conditional Mass and Joint Mass
Conditional probabilities become particularly concrete in the discrete setting. Conditioning on $X=x$ is meaningful when $x$ is an atom of positive probability. To compute such conditional probabilities systematically, we need a conditional version of the mass function.
[definition: Conditional Mass Function]
Let $X$ and $Y$ be discrete random variables, and let $y \in \operatorname{supp}(Y)$. The conditional mass function of $X$ given $Y=y$ is the function
\begin{align*}
p_{X\mid Y}(\cdot \mid y): \mathbb R \to [0,1]
\end{align*}
defined by
\begin{align*}
p_{X\mid Y}(x \mid y)=\mathbb P(X=x \mid Y=y)=\frac{\mathbb P(X=x, Y=y)}{\mathbb P(Y=y)}.
\end{align*}
[/definition]
This definition is local to atoms with positive probability. To compute the numerator in the conditional mass function, we need a two-variable table that records simultaneous values of $X$ and $Y$. That table is the joint probability mass function.
[definition: Joint Probability Mass Function]
Let $X$ and $Y$ be discrete random variables. The joint probability mass function of $(X,Y)$ is the function
\begin{align*}
p_{X,Y}: \mathbb R^2 \to [0,1]
\end{align*}
defined by
\begin{align*}
p_{X,Y}(x,y)=\mathbb P(X=x, Y=y).
\end{align*}
[/definition]
To decide whether two discrete variables are independent, it is not enough to inspect their marginal mass functions separately. Dependence is hidden in how the two coordinates are paired, so the joint table must be compared entry by entry with what the marginals would predict if no interaction were present. The obstruction is any cell whose joint mass fails to factor as the product of the corresponding marginal masses.
[quotetheorem:4862]
The criterion is useful because it reduces independence to checking a factorisation over the support. For finite supports, this is a matrix rank-one condition on the probability table.
[example: Pairwise Table that is Not Independent]
Let $X$ and $Y$ take values in $\{0,1\}$ with joint masses
\begin{align*}
p_{X,Y}(0,0)=\frac{1}{2},\quad p_{X,Y}(0,1)=0,\quad p_{X,Y}(1,0)=0,\quad p_{X,Y}(1,1)=\frac{1}{2}.
\end{align*}
The marginal mass of $X$ at $0$ is obtained by summing over the possible values of $Y$:
\begin{align*}
p_X(0)=p_{X,Y}(0,0)+p_{X,Y}(0,1)=\frac{1}{2}+0=\frac{1}{2}.
\end{align*}
Similarly,
\begin{align*}
p_X(1)=p_{X,Y}(1,0)+p_{X,Y}(1,1)=0+\frac{1}{2}=\frac{1}{2}.
\end{align*}
For $Y$, summing over the possible values of $X$ gives
\begin{align*}
p_Y(0)=p_{X,Y}(0,0)+p_{X,Y}(1,0)=\frac{1}{2}+0=\frac{1}{2}.
\end{align*}
Also,
\begin{align*}
p_Y(1)=p_{X,Y}(0,1)+p_{X,Y}(1,1)=0+\frac{1}{2}=\frac{1}{2}.
\end{align*}
If $X$ and $Y$ were independent, the joint mass at every pair would equal the product of the two marginal masses. At the pair $(0,0)$, the product of marginals is
\begin{align*}
p_X(0)p_Y(0)=\frac{1}{2}\cdot\frac{1}{2}=\frac{1}{4}.
\end{align*}
But the given joint mass is
\begin{align*}
p_{X,Y}(0,0)=\frac{1}{2}.
\end{align*}
Since
\begin{align*}
\frac{1}{2}\ne\frac{1}{4},
\end{align*}
the required factorisation fails. Thus $X$ and $Y$ have balanced marginal distributions, but they are not independent because the joint table places all mass on the diagonal pairs $(0,0)$ and $(1,1)$.
[/example]
Marginals alone do not determine dependence. The joint mass function contains the missing information.
## Approximation, Limits, and Boundaries of Discreteness
### Finite Versus Countable Support
Discrete random variables sit between finite combinatorics and general measure theory. They are simple enough for sums, but rich enough to approximate continuous laws, converge to continuous limits, and exhibit infinite-tail phenomena.
A first boundary is finite versus countably infinite support. Every finite-valued random variable is discrete, but many important discrete random variables are not finite-valued.
[definition: Finite-Valued Random Variable]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. A real-valued random variable $X$ is finite-valued if there exists a finite set $S \subseteq \mathbb R$ such that
\begin{align*}
\mathbb P(X\in S)=1.
\end{align*}
[/definition]
Finite-valued random variables are the easiest discrete random variables: every expectation is a finite sum when the function is finite on the support. Since every finite set is countable, this class must sit inside the class of discrete random variables. The next theorem records that inclusion explicitly so that finite models can be treated as discrete models without further comment.
[quotetheorem:9479]
The converse fails because countable sets may be infinite. The geometric distribution is the standard example, but there are many others.
[example: Discrete but Not Finite-Valued]
Let $X \sim \operatorname{Poi}(\lambda)$ with $\lambda>0$. By the definition of the Poisson distribution, $X$ takes values in $\{0,1,2,\dots\}$ and
\begin{align*}
\mathbb P(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}
\end{align*}
for each $k\in\{0,1,2,\dots\}$.
For such $k$, we have $e^{-\lambda}>0$, $k!>0$, and $\lambda^k>0$ because $\lambda>0$ and $\lambda^0=1$. Therefore
\begin{align*}
e^{-\lambda}\frac{\lambda^k}{k!}>0
\end{align*}
and hence
\begin{align*}
\mathbb P(X=k)>0.
\end{align*}
It follows that every nonnegative integer belongs to $\operatorname{supp}(X)$, so
\begin{align*}
\{0,1,2,\dots\}\subseteq \operatorname{supp}(X).
\end{align*}
Since the Poisson law has no mass outside $\{0,1,2,\dots\}$, the support is exactly
\begin{align*}
\operatorname{supp}(X)=\{0,1,2,\dots\}.
\end{align*}
This set is countable, so $X$ is discrete. It is not finite-valued: if $A\subseteq\mathbb R$ is finite, then $A$ can contain only finitely many nonnegative integers, so choose $m\in\{0,1,2,\dots\}$ with $m\notin A$. Since $\{X=m\}\subseteq\{X\notin A\}$ and $\mathbb P(X=m)>0$, we have $\mathbb P(X\notin A)>0$, and therefore
\begin{align*}
\mathbb P(X\in A)=1-\mathbb P(X\notin A)<1.
\end{align*}
Thus no finite set carries all the probability mass of $X$, even though the countable set $\{0,1,2,\dots\}$ does.
[/example]
### Discrete Versus Continuous Laws
A second boundary is discrete versus continuous. The defining issue is not whether values are numerical, but whether the law is concentrated on a countable set. More generally, a law may have both atoms and non-atomic parts: a random variable can be equal to $0$ with probability $1/2$ and otherwise be uniformly distributed on $(0,1)$. Such a mixed law is not discrete, because no countable set carries all its probability, but it is also not purely described by a density on the whole line. To see what fails outside the discrete setting, it is useful to name the standard continuous contrast case.
[definition: Continuous Uniform Random Variable]
A real-valued random variable $U$ has the continuous uniform distribution on $(0,1)$, written $U \sim \operatorname{Unif}(0,1)$, if for every Borel set $A \in \mathcal B(\mathbb R)$,
\begin{align*}
\mathbb P(U\in A)=\int_{A\cap(0,1)} 1\,d\mathcal L^1(x).
\end{align*}
[/definition]
This definition provides a contrast case rather than a new discrete model. Its mass at every individual point is zero, even though the interval as a whole has probability one. If one tried to treat $U$ as discrete, one would need a countable set of points carrying all the probability, but countably many zero point masses cannot add up to the probability of the whole interval.
[quotetheorem:9480]
The reason is countable additivity: any countable set has [Lebesgue measure](/page/Lebesgue%20Measure) zero, so no countable set can carry probability one under the continuous uniform law.
Discrete laws can nevertheless converge to continuous laws. This is not a contradiction; discreteness is not preserved under convergence in distribution.
[example: Discrete Uniform Approximation to a Continuous Uniform Law]
For each $n\in\mathbb N$, let $X_n$ be uniformly distributed on the grid
\begin{align*}
\left\{\frac{1}{n},\frac{2}{n},\dots,\frac{n}{n}\right\}.
\end{align*}
This grid has $n$ points, and uniformity means that for each $k\in\{1,2,\dots,n\}$,
\begin{align*}
\mathbb P\left(X_n=\frac{k}{n}\right)=\frac{1}{n}.
\end{align*}
Since the grid is finite, each $X_n$ is a discrete random variable.
Fix $x\in[0,1]$. The event $\{X_n\le x\}$ consists exactly of those grid points $\frac{k}{n}$ such that $1\le k\le n$ and $\frac{k}{n}\le x$. Because $n>0$, the inequality $\frac{k}{n}\le x$ is equivalent to $k\le nx$. Hence the allowed indices are
\begin{align*}
k=1,2,\dots,\lfloor nx\rfloor.
\end{align*}
There are $\lfloor nx\rfloor$ such indices, so by finite additivity over disjoint atoms,
\begin{align*}
\mathbb P(X_n\le x)=\sum_{k=1}^{\lfloor nx\rfloor}\mathbb P\left(X_n=\frac{k}{n}\right).
\end{align*}
Substituting the uniform masses gives
\begin{align*}
\mathbb P(X_n\le x)=\sum_{k=1}^{\lfloor nx\rfloor}\frac{1}{n}.
\end{align*}
A sum of $\lfloor nx\rfloor$ identical terms equal to $\frac{1}{n}$ is
\begin{align*}
\sum_{k=1}^{\lfloor nx\rfloor}\frac{1}{n}=\frac{\lfloor nx\rfloor}{n}.
\end{align*}
Thus
\begin{align*}
\mathbb P(X_n\le x)=\frac{\lfloor nx\rfloor}{n}.
\end{align*}
The floor inequalities give
\begin{align*}
\lfloor nx\rfloor\le nx<\lfloor nx\rfloor+1.
\end{align*}
Dividing by $n$ gives
\begin{align*}
\frac{\lfloor nx\rfloor}{n}\le x<\frac{\lfloor nx\rfloor}{n}+\frac{1}{n}.
\end{align*}
Therefore
\begin{align*}
0\le x-\frac{\lfloor nx\rfloor}{n}<\frac{1}{n}.
\end{align*}
Since $\frac{1}{n}\to 0$, it follows that
\begin{align*}
\frac{\lfloor nx\rfloor}{n}\to x.
\end{align*}
So the distribution functions of the discrete variables $X_n$ converge pointwise on $[0,1]$ to the distribution function of $\operatorname{Unif}(0,1)$ there. This shows that discrete laws can converge to a continuous limiting law, even though each approximating law is supported on only finitely many points.
[/example]
This approximation is a useful mental model for simulation and numerical probability. A finite grid can approximate a continuum, but the limiting object may leave the discrete category.
## Beyond and Connected Topics
Discrete random variables are the entry point to a large part of probability. The immediate next layer is the general theory of [Random Variable](/page/Random%20Variable), where discreteness is replaced by measurability and laws may have atoms, densities, singular parts, or mixtures of these.
The measure-theoretic viewpoint in [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure) explains why the formula $\mu_X=\mathbb P\circ X^{-1}$ is the right definition of the law, and why countable additivity is the principle behind summing masses. That perspective is necessary once random variables are no longer countably supported.
For elementary computation, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability) develops the standard finite and countable models: Bernoulli, binomial, geometric, Poisson, [conditional probability](/page/Conditional%20Probability), and independence. Discrete random variables are the natural language of those examples.
In statistics, [Cambridge II Principles of Statistics](/page/Cambridge%20II%20Principles%20of%20Statistics) uses discrete distributions as sampling models and likelihoods. A mass function becomes a likelihood function once the observed value is fixed and the parameter is allowed to vary.
In advanced probability, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability) places discrete variables inside broader themes such as convergence in distribution, [conditional expectation](/page/Conditional%20Expectation), martingales, and stochastic processes. Discrete-time processes such as Markov chains are built from sequences of discrete random variables, but their long-term behaviour often requires measure-theoretic tools.
## References
Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability).
Androma, [Cambridge II Principles of Statistics](/page/Cambridge%20II%20Principles%20of%20Statistics).
Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure).
Androma, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability).
Grimmett and Stirzaker, *Probability and Random Processes* (2001).
Williams, *Probability with Martingales* (1991).
Durrett, *Probability: Theory and Examples* (2019).
Discrete Random Variable
Also known as: Discrete RV, Discrete random variable, Countable-valued random variable, Atomic random variable, Probability mass function variable, PMF random variable