Probability Mass Function

Also known as: PMF, Probability mass, Mass function, Discrete probability distribution, Discrete mass function

Edit 0 Issues 0 Pull Requests Roadmap Admin

Content

Problems

History

Issues Verification Attributions

Counting outcomes is the first kind of probability most people meet, but it is also where a subtle question appears: what exactly is the distribution of a [random variable](/page/Random%20Variable) if the possible values are isolated points rather than intervals? For a continuous random variable, asking for $\mathbb P(X=x)$ usually loses all information because each single point has probability $0$. For a [discrete random variable](/page/Discrete%20Random%20Variable), the single points carry the whole story. The probability mass function is the device that records that story without returning to the underlying sample space each time. A dice roll already shows the issue. If $X$ is the number shown by a fair six-sided die, the events $\{X=1\}, \ldots, \{X=6\}$ are the natural building blocks. If we know their probabilities, then every event described in terms of $X$ can be reconstructed by summing the appropriate entries. The mass function is therefore not a decorative table of values; it is the compressed representation of the law of a discrete random variable. [example: A Fair Die as a Distribution] Let $X$ be the result of rolling a fair six-sided die, so $X$ takes values in $\{1,2,3,4,5,6\}$ and each face has probability $1/6$. We compute the probability that the die shows an even number. Since \begin{align*} \{X\in\{2,4,6\}\}=\{X=2\}\cup\{X=4\}\cup\{X=6\}, \end{align*} and the three events on the right are pairwise disjoint, finite additivity gives \begin{align*} \mathbb P(X\in\{2,4,6\})=\mathbb P(X=2)+\mathbb P(X=4)+\mathbb P(X=6). \end{align*} For a fair die, \begin{align*} \mathbb P(X=2)+\mathbb P(X=4)+\mathbb P(X=6)=\frac{1}{6}+\frac{1}{6}+\frac{1}{6}. \end{align*} Adding the three equal fractions, \begin{align*} \frac{1}{6}+\frac{1}{6}+\frac{1}{6}=\frac{3}{6}. \end{align*} Reducing the fraction gives \begin{align*} \frac{3}{6}=\frac{1}{2}. \end{align*} Therefore \begin{align*} \mathbb P(X\in\{2,4,6\})=\frac{1}{2}. \end{align*} The probability mass function packages exactly the six point probabilities $p_X(1),\ldots,p_X(6)$, and event probabilities are recovered by summing the entries corresponding to the event. [/example] That example also hints at what can go wrong if we use the wrong language. A histogram bar has width and area, while a mass at a point has no width. Treating a mass function as if it were a density confuses sums with integrals and point probabilities with interval probabilities. The rest of this chapter builds the discrete theory from this distinction. ## Discrete Values Before defining the mass function, we need to isolate the kind of random variable for which point probabilities are allowed to carry the distribution. The essential feature is not that the values are numbers like $1,2,3$; it is that the possible values can be listed, so that probability can be assembled by countable summation. [definition: Discrete Random Variable] Let $(\Omega, \mathcal F, \mathbb P)$ be a [probability space](/page/Probability%20Space) and let $(E, \mathcal E)$ be a measurable space. A random variable $X: (\Omega, \mathcal F) \to (E, \mathcal E)$ is discrete if there exists a [countable set](/page/Countable%20Set) $S \subset E$ such that \begin{align*} \mathbb P(X \in S) &= 1. \end{align*} [/definition] ## Definition The set $S$ need not be unique: adding points with probability $0$ does not change anything. This creates the next problem: among all points in the codomain, we need a function that records how much probability is attached to each point. [definition: Probability Mass Function] Let $X: (\Omega, \mathcal F, \mathbb P) \to (E, \mathcal E)$ be a discrete random variable, and assume that every singleton $\{x\}$ with $x\in E$ belongs to $\mathcal E$. The probability mass function of $X$ is the function $p_X: E \to [0,1]$ defined by \begin{align*} p_X(x) &= \mathbb P(X\in\{x\}) \end{align*} for $x \in E$. [/definition] The notation $p_X(x)$ should be read as mass at the point $x$, not as height of a curve whose area is probability. Many points of the codomain may carry no mass at all; the useful points are those where this function is positive. [example: A Countably Infinite Support] Let $X \sim \operatorname{Geom}(p)$ with $p \in (0,1]$, using the convention that $X$ is the trial number of the first success. Its support is $\mathbb N=\{1,2,3,\ldots\}$, and its mass at $k$ is \begin{align*} p_X(k)=(1-p)^{k-1}p, \qquad k\in\mathbb N. \end{align*} To check that these masses sum to $1$, first reindex with $j=k-1$: \begin{align*} \sum_{k=1}^{\infty}p_X(k)=\sum_{k=1}^{\infty}(1-p)^{k-1}p. \end{align*} Since $p$ is constant in the summation, \begin{align*} \sum_{k=1}^{\infty}(1-p)^{k-1}p=p\sum_{j=0}^{\infty}(1-p)^j. \end{align*} Because $p\in(0,1]$, we have $0\leq 1-p<1$, so the geometric series formula gives \begin{align*} \sum_{j=0}^{\infty}(1-p)^j=\frac{1}{1-(1-p)}. \end{align*} The denominator is \begin{align*} 1-(1-p)=p. \end{align*} Therefore \begin{align*} p\sum_{j=0}^{\infty}(1-p)^j=p\cdot \frac{1}{p}. \end{align*} Since $p>0$, \begin{align*} p\cdot \frac{1}{p}=1. \end{align*} Thus \begin{align*} \sum_{k=1}^{\infty}p_X(k)=1. \end{align*} This example shows why countably infinite supports are still discrete: probabilities are recovered by series over points rather than by integrals over intervals. [/example] ## Mass Functions and Laws ### Support and Discrete Laws Since a mass function may be defined on a large codomain while assigning probability only to a small list of points, we need a name for the part that actually carries mass. This is the set that appears in sums, tables, and computations. [definition: Support of a Probability Mass Function] Let $p_X: E \to [0,1]$ be the probability mass function of a discrete random variable $X$. The support of $p_X$ is \begin{align*} \operatorname{supp}(p_X) &= \{x \in E : p_X(x)>0\}. \end{align*} [/definition] For a discrete random variable, the support is countable. It may be finite, as for a die, or countably infinite, as for a geometric or Poisson random variable. To separate the distribution itself from any particular random variable that realizes it, we also need the measure-level version of discreteness. [definition: Discrete Probability Distribution] Let $(E, \mathcal E)$ be a measurable space. A function $\mu:\mathcal E\to[0,1]$ is a discrete probability distribution on $(E,\mathcal E)$ if $\mu$ is a probability measure and there is a countable set $S \subset E$ such that \begin{align*} \mu(S) &= 1. \end{align*} [/definition] If $X$ has distribution $\mu_X=\mathbb P\circ X^{-1}$, then $p_X(x)=\mu_X(\{x\})$. Thus the mass function can be viewed either as a property of the random variable $X$ or as a coordinate description of its law. ### Recovering Event Probabilities The point of a probability mass function is that it remembers exactly the distribution of a discrete random variable. The underlying sample space may be complicated, but after applying $X$, every event of the form $\{X \in A\}$ should be determined by the masses of the points in $A$. We need a theorem that turns that expectation into a countable-additivity statement. [quotetheorem:9348] This theorem is the reason a probability mass function deserves to be called a distributional object. Once the masses are known, there is no need to inspect the original experiment again when computing events determined by $X$. The reverse modelling question is now unavoidable: if we invent a list of nonnegative masses, the first obstruction is normalization. The total mass must be exactly $1$, because otherwise the proposed weights cannot be the values of a probability measure on all of $E$. In practice, this is how named distributions are introduced: the formula comes first, and the experiment may be supplied later as interpretation. The next example isolates the normalization check, including a case where a plausible-looking formula fails to define a probability mass function. [example: Normalisation and a Failed Mass Function] Define $q:\mathbb N\to[0,1]$ by \begin{align*} q(k)=\frac{1}{k(k+1)}. \end{align*} For each $k\in\mathbb N$, both $k$ and $k+1$ are positive, so $q(k)\geq 0$. To compute the total mass, first decompose each summand: \begin{align*} \frac{1}{k}-\frac{1}{k+1}=\frac{k+1-k}{k(k+1)}=\frac{1}{k(k+1)}. \end{align*} Therefore, for $n\in\mathbb N$, \begin{align*} \sum_{k=1}^{n}\frac{1}{k(k+1)}=\sum_{k=1}^{n}\left(\frac{1}{k}-\frac{1}{k+1}\right). \end{align*} Writing out the cancellation gives \begin{align*} \sum_{k=1}^{n}\left(\frac{1}{k}-\frac{1}{k+1}\right)=\left(1-\frac{1}{2}\right)+\left(\frac{1}{2}-\frac{1}{3}\right)+\cdots+\left(\frac{1}{n}-\frac{1}{n+1}\right). \end{align*} All intermediate terms cancel, leaving \begin{align*} \left(1-\frac{1}{2}\right)+\left(\frac{1}{2}-\frac{1}{3}\right)+\cdots+\left(\frac{1}{n}-\frac{1}{n+1}\right)=1-\frac{1}{n+1}. \end{align*} Since $\frac{1}{n+1}\to 0$ as $n\to\infty$, \begin{align*} \sum_{k=1}^{\infty}q(k)=\lim_{n\to\infty}\left(1-\frac{1}{n+1}\right)=1. \end{align*} Thus $q$ is non-negative and normalised, so it is a probability mass function on $\mathbb N$. By contrast, define $r:\mathbb N\to[0,\infty)$ by $r(k)=1/k$. This function is non-negative, but its total mass is not finite. For $m\in\mathbb N$, group the terms from $2^j$ to $2^{j+1}-1$: \begin{align*} \sum_{k=1}^{2^{m+1}-1}\frac{1}{k}=1+\sum_{j=0}^{m}\sum_{k=2^j}^{2^{j+1}-1}\frac{1}{k}. \end{align*} For $2^j\leq k\leq 2^{j+1}-1$, we have $k<2^{j+1}$, hence \begin{align*} \frac{1}{k}>\frac{1}{2^{j+1}}. \end{align*} There are $2^{j+1}-2^j=2^j$ integers in the block $\{2^j,\ldots,2^{j+1}-1\}$, so \begin{align*} \sum_{k=2^j}^{2^{j+1}-1}\frac{1}{k}>2^j\cdot\frac{1}{2^{j+1}}=\frac{1}{2}. \end{align*} Consequently, \begin{align*} \sum_{k=1}^{2^{m+1}-1}\frac{1}{k}>1+\sum_{j=0}^{m}\frac{1}{2}=1+\frac{m+1}{2}. \end{align*} The lower bound $1+\frac{m+1}{2}$ tends to infinity as $m\to\infty$, so \begin{align*} \sum_{k=1}^{\infty}\frac{1}{k}=\infty. \end{align*} Therefore $r$ cannot be a probability mass function. Non-negativity alone is not enough; the masses must also sum to $1$. [/example] ## Expectation and Moments A mass function does more than assign probabilities to events. It also turns averages over outcomes into weighted sums. This is the discrete version of integration, and it is the point at which probability mass functions connect to statistics, estimation, and long-run behaviour. To define expectation in the discrete setting, the random variable must be summable with respect to its own masses. The formula is familiar, but the condition matters: an infinite support can assign small probabilities to large values in a way that makes the average fail to exist as a finite number. [definition: Expectation from a Probability Mass Function] Let $X$ be a real-valued discrete random variable with probability mass function $p_X$. If \begin{align*} \sum_{x\in\operatorname{supp}(p_X)} |x|p_X(x)&<\infty, \end{align*} then the expectation of $X$ is \begin{align*} \mathbb E[X]&=\sum_{x\in\operatorname{supp}(p_X)} xp_X(x). \end{align*} [/definition] The absolute summability condition prevents the value of the expectation from depending on a rearrangement of positive and negative contributions. In applications we often average a transformed variable $g(X)$ rather than $X$ itself, so we need a rule that computes this average without first finding a new mass function from scratch. [quotetheorem:4989] The name is playful, but the theorem is serious: it says we can compute the expectation of $g(X)$ by weighting each original value of $X$. This saves work whenever many values of $X$ are collapsed by $g$. [example: Mean and Variance of a Bernoulli Variable] Let $X\sim\operatorname{Ber}(p)$ with $p\in[0,1]$, so $X$ has support $\{0,1\}$ and mass function $p_X(0)=1-p$ and $p_X(1)=p$. Using the expectation formula for a discrete random variable, \begin{align*} \mathbb E[X]=\sum_{x\in\{0,1\}}xp_X(x). \end{align*} Substituting the two support points gives \begin{align*} \mathbb E[X]=0\cdot p_X(0)+1\cdot p_X(1). \end{align*} Using $p_X(0)=1-p$ and $p_X(1)=p$, \begin{align*} \mathbb E[X]=0\cdot(1-p)+1\cdot p. \end{align*} Since $0\cdot(1-p)=0$ and $1\cdot p=p$, \begin{align*} \mathbb E[X]=p. \end{align*} For the second moment, apply the same weighted-sum formula to the function $g(x)=x^2$: \begin{align*} \mathbb E[X^2]=\sum_{x\in\{0,1\}}x^2p_X(x). \end{align*} Substituting the two support points, \begin{align*} \mathbb E[X^2]=0^2p_X(0)+1^2p_X(1). \end{align*} Since $0^2=0$ and $1^2=1$, \begin{align*} \mathbb E[X^2]=0\cdot p_X(0)+1\cdot p_X(1). \end{align*} Using the Bernoulli masses again, \begin{align*} \mathbb E[X^2]=0\cdot(1-p)+1\cdot p. \end{align*} Therefore \begin{align*} \mathbb E[X^2]=p. \end{align*} The variance is $\operatorname{Var}(X)=\mathbb E[X^2]-(\mathbb E[X])^2$, so substituting the two values just computed gives \begin{align*} \operatorname{Var}(X)=p-p^2. \end{align*} Factoring out $p$, \begin{align*} p-p^2=p(1-p). \end{align*} Hence \begin{align*} \operatorname{Var}(X)=p(1-p). \end{align*} This example is the template for many discrete computations: list the support, weight each value by its mass, and sum. [/example] Moments are not the only summaries encoded by a mass function, but they are among the most widely used. They compress the distribution into numerical features, while the full mass function preserves all discrete distributional information. ## Transformations and Joint Mass Functions ### Functions of a Discrete Variable Many random variables are built from other random variables. If $X$ is known and $Y=g(X)$, then the mass of a value $y$ for $Y$ comes from all values of $X$ that map to $y$. The language needed here is aggregation over preimages, because several old atoms may merge into one new atom. [definition: Pushforward of a Probability Mass Function] Let $X$ be a discrete random variable with values in a countable set $S\subset E$ and probability mass function $p_X$. Let $g:E\to F$ be a function. The pushforward mass function of $p_X$ along $g$ is the function $p_{g(X)}:F\to[0,1]$ defined by \begin{align*} p_{g(X)}(y)&=\sum_{x\in S: g(x)=y}p_X(x). \end{align*} [/definition] The formula says that when several inputs lead to the same output, their masses merge. This is why the distribution of $X^2$ loses the sign information from a symmetric distribution of $X$. [example: Squaring a Symmetric Discrete Variable] Let $X$ take values in $\{-1,0,1\}$ with $p_X(-1)=1/4$, $p_X(0)=1/2$, and $p_X(1)=1/4$. Set $Y=X^2$. Since \begin{align*} (-1)^2=1, \end{align*} \begin{align*} 0^2=0, \end{align*} and \begin{align*} 1^2=1, \end{align*} the possible values of $Y$ are $0$ and $1$. To compute the mass at $0$, identify the preimage of $0$ under the map $x\mapsto x^2$: \begin{align*} \{x\in\{-1,0,1\}:x^2=0\}=\{0\}. \end{align*} Therefore \begin{align*} p_Y(0)=p_X(0). \end{align*} Using the given mass of $X$ at $0$, \begin{align*} p_Y(0)=\frac{1}{2}. \end{align*} For the mass at $1$, the preimage of $1$ is \begin{align*} \{x\in\{-1,0,1\}:x^2=1\}=\{-1,1\}. \end{align*} Hence \begin{align*} p_Y(1)=p_X(-1)+p_X(1). \end{align*} Substituting the two masses gives \begin{align*} p_Y(1)=\frac{1}{4}+\frac{1}{4}. \end{align*} Since the denominators are equal, \begin{align*} \frac{1}{4}+\frac{1}{4}=\frac{2}{4}. \end{align*} Reducing the fraction, \begin{align*} \frac{2}{4}=\frac{1}{2}. \end{align*} Thus \begin{align*} p_Y(1)=\frac{1}{2}. \end{align*} The squaring map keeps the atom at $0$ separate but merges the two atoms at $-1$ and $1$ into the single atom at $1$. [/example] ### Several Discrete Variables To study several discrete random variables at once, we need a mass function on tuples. This keeps track not only of the marginal behaviour of each variable, but also of how their values occur together. [definition: Joint Probability Mass Function] Let $X:(\Omega,\mathcal F,\mathbb P)\to(E,\mathcal E)$ and $Y:(\Omega,\mathcal F,\mathbb P)\to(F,\mathcal G)$ be discrete random variables, and assume that every singleton in $E$ belongs to $\mathcal E$ and every singleton in $F$ belongs to $\mathcal G$. The joint probability mass function of $(X,Y)$ is the function $p_{X,Y}:E\times F\to[0,1]$ defined by \begin{align*} p_{X,Y}(x,y)&=\mathbb P(X\in\{x\}, Y\in\{y\}). \end{align*} [/definition] A joint mass function is a two-dimensional table when both supports are finite. To answer questions about one coordinate alone, we need a way to collapse the table by summing over the other coordinate. [definition: Marginal Probability Mass Function] Let $X:(\Omega,\mathcal F,\mathbb P)\to(E,\mathcal E)$ and $Y:(\Omega,\mathcal F,\mathbb P)\to(F,\mathcal G)$ be discrete random variables with joint probability mass function $p_{X,Y}:E\times F\to[0,1]$. Let $S_X\subset E$ and $S_Y\subset F$ be countable sets such that $\mathbb P(X\in S_X,Y\in S_Y)=1$. The marginal probability mass functions are the functions $p_X:E\to[0,1]$ and $p_Y:F\to[0,1]$ defined by \begin{align*} p_X(x)&=\sum_{y\in S_Y} p_{X,Y}(x,y) \end{align*} and \begin{align*} p_Y(y)&=\sum_{x\in S_X} p_{X,Y}(x,y). \end{align*} [/definition] Marginals answer questions about one coordinate at a time. They do not usually determine the joint law, because they may forget dependence. The next example shows how two different joint distributions can share the same marginals. [example: Same Marginals, Different Joint Behaviour] Let $X$ and $Y$ take values in $\{0,1\}$. In the first model, define the joint masses by \begin{align*} p_{X,Y}(0,0)=\frac{1}{2},\quad p_{X,Y}(1,1)=\frac{1}{2},\quad p_{X,Y}(0,1)=0,\quad p_{X,Y}(1,0)=0. \end{align*} Since the only pairs with positive mass are $(0,0)$ and $(1,1)$, every outcome with positive probability satisfies $X=Y$. The marginal mass of $X$ at $0$ is obtained by summing over the possible values of $Y$: \begin{align*} p_X(0)=p_{X,Y}(0,0)+p_{X,Y}(0,1). \end{align*} Substituting the joint masses gives \begin{align*} p_X(0)=\frac{1}{2}+0=\frac{1}{2}. \end{align*} Similarly, \begin{align*} p_X(1)=p_{X,Y}(1,0)+p_{X,Y}(1,1). \end{align*} Substituting again, \begin{align*} p_X(1)=0+\frac{1}{2}=\frac{1}{2}. \end{align*} For $Y$, \begin{align*} p_Y(0)=p_{X,Y}(0,0)+p_{X,Y}(1,0). \end{align*} Hence \begin{align*} p_Y(0)=\frac{1}{2}+0=\frac{1}{2}. \end{align*} Also, \begin{align*} p_Y(1)=p_{X,Y}(0,1)+p_{X,Y}(1,1). \end{align*} Thus \begin{align*} p_Y(1)=0+\frac{1}{2}=\frac{1}{2}. \end{align*} So both marginals are Bernoulli with parameter $1/2$. In the second model, define \begin{align*} p_{X,Y}(0,0)=p_{X,Y}(0,1)=p_{X,Y}(1,0)=p_{X,Y}(1,1)=\frac{1}{4}. \end{align*} The marginal mass of $X$ at $0$ is \begin{align*} p_X(0)=p_{X,Y}(0,0)+p_{X,Y}(0,1). \end{align*} Substitution gives \begin{align*} p_X(0)=\frac{1}{4}+\frac{1}{4}=\frac{2}{4}=\frac{1}{2}. \end{align*} Likewise, \begin{align*} p_X(1)=p_{X,Y}(1,0)+p_{X,Y}(1,1). \end{align*} Therefore \begin{align*} p_X(1)=\frac{1}{4}+\frac{1}{4}=\frac{2}{4}=\frac{1}{2}. \end{align*} For $Y$, \begin{align*} p_Y(0)=p_{X,Y}(0,0)+p_{X,Y}(1,0). \end{align*} Thus \begin{align*} p_Y(0)=\frac{1}{4}+\frac{1}{4}=\frac{2}{4}=\frac{1}{2}. \end{align*} Finally, \begin{align*} p_Y(1)=p_{X,Y}(0,1)+p_{X,Y}(1,1). \end{align*} Substituting the two joint masses gives \begin{align*} p_Y(1)=\frac{1}{4}+\frac{1}{4}=\frac{2}{4}=\frac{1}{2}. \end{align*} Thus the two models have the same marginal mass functions, but they have different joint behaviour: in the first model $X=Y$ with probability $1$, while in the second model \begin{align*} \mathbb P(X=Y)=p_{X,Y}(0,0)+p_{X,Y}(1,1)=\frac{1}{4}+\frac{1}{4}=\frac{1}{2}. \end{align*} The marginals record the one-coordinate distributions, while the joint mass function also records how the two coordinates are coupled. [/example] ## Independence and Convolution Independence is the condition under which the joint mass function factors into the product of its marginals. For discrete random variables, this condition is especially concrete: every cell of the joint table must equal the product of its row and column masses. [quotetheorem:4862] The criterion is the discrete form of the idea that independent coordinates carry no hidden coupling. Once the marginal masses are fixed, the joint masses are forced by multiplication. The next natural operation is addition, where many possible pairs of values may lead to the same total; we need an operation on mass functions that performs exactly that summation. [definition: Convolution of Probability Mass Functions] Let $p$ and $q$ be probability mass functions on $\mathbb Z$. Their convolution is the function $p*q:\mathbb Z\to[0,1]$ defined by \begin{align*} (p*q)(k)&=\sum_{j\in\mathbb Z}p(j)q(k-j). \end{align*} [/definition] Convolution is designed to count all decompositions of a total $k$ as $j+(k-j)$. The key question it answers is distributional: when two independent integer-valued random variables are added, what is the probability mass function of the sum? [quotetheorem:9488] This theorem explains why binomial probabilities contain binomial coefficients: when independent Bernoulli variables are added, many different success-failure sequences yield the same total number of successes. [example: Adding Two Fair Dice] Let $X$ and $Y$ be independent fair die rolls, and set $Z=X+Y$. Since $X$ and $Y$ are integer-valued and independent, the mass function of $Z$ is given by *Sum of Independent Integer-Valued Random Variables*: \begin{align*} p_Z(k)=\sum_{j\in\mathbb Z}p_X(j)p_Y(k-j). \end{align*} For a fair die, $p_X(j)=1/6$ when $j\in\{1,\ldots,6\}$ and $p_X(j)=0$ otherwise; the same is true for $p_Y$. Thus a summand is nonzero exactly when \begin{align*} 1\leq j\leq 6 \text{ and } 1\leq k-j\leq 6. \end{align*} The second condition is equivalent to \begin{align*} k-6\leq j\leq k-1. \end{align*} So the valid values of $j$ are exactly the integers satisfying \begin{align*} \max(1,k-6)\leq j\leq \min(6,k-1). \end{align*} For each valid $j$, \begin{align*} p_X(j)p_Y(k-j)=\frac{1}{6}\cdot\frac{1}{6}=\frac{1}{36}. \end{align*} When $2\leq k\leq 7$, the valid values are $j=1,\ldots,k-1$, so there are $k-1$ of them. Hence \begin{align*} p_Z(k)=(k-1)\cdot\frac{1}{36}=\frac{k-1}{36}. \end{align*} When $8\leq k\leq 12$, the valid values are $j=k-6,\ldots,6$, so their number is \begin{align*} 6-(k-6)+1=13-k. \end{align*} Therefore \begin{align*} p_Z(k)=(13-k)\cdot\frac{1}{36}=\frac{13-k}{36}. \end{align*} In particular, \begin{align*} p_Z(2)=\frac{1}{36}, \end{align*} \begin{align*} p_Z(7)=\frac{6}{36}, \end{align*} and \begin{align*} p_Z(12)=\frac{1}{36}. \end{align*} The triangular shape of the distribution is exactly the count of how many ordered pairs of die faces produce each possible sum. [/example] ## Named Discrete Models ### Bernoulli and Binomial Masses A probability mass function often enters practice through a named family. The family name gives a modelling story and a parameter range, while the formula gives exact probabilities. A Bernoulli variable models one trial with two outcomes, usually coded as failure $0$ and success $1$, so it is the atomic building block for finite-counting models. [definition: Bernoulli Distribution] Let $p\in[0,1]$. A random variable $X:(\Omega,\mathcal F,\mathbb P)\to(\{0,1\},2^{\{0,1\}})$ has the Bernoulli distribution with parameter $p$, written $X\sim\operatorname{Ber}(p)$, if its probability mass function $p_X:\{0,1\}\to[0,1]$ is defined by \begin{align*} p_X(0)&=1-p \end{align*} and \begin{align*} p_X(1)&=p. \end{align*} [/definition] Adding independent Bernoulli variables leads to the binomial distribution. The mass at $k$ must count both the probability of a particular pattern with $k$ successes and the number of such patterns, which motivates the [binomial coefficient](/page/Binomial%20Coefficient) in the formula. [definition: Binomial Distribution] Let $n\in\mathbb N$ and $p\in[0,1]$. A random variable $X:(\Omega,\mathcal F,\mathbb P)\to(\{0,1,\ldots,n\},2^{\{0,1,\ldots,n\}})$ has the binomial distribution with parameters $n$ and $p$, written $X\sim\operatorname{Bin}(n,p)$, if its probability mass function $p_X:\{0,1,\ldots,n\}\to[0,1]$ is defined by \begin{align*} p_X(k)&=\binom{n}{k}p^k(1-p)^{n-k}, \qquad k\in\{0,1,\ldots,n\}. \end{align*} [/definition] ### Infinite-Support Models When the number of trials is large and each individual success is rare, the Poisson distribution often replaces the binomial distribution. Its support is infinite, so the formula must be normalised by a factor that makes the infinite sum equal to $1$. [definition: Poisson Distribution] Let $\lambda>0$. A random variable $X:(\Omega,\mathcal F,\mathbb P)\to(\{0,1,2,\ldots\},2^{\{0,1,2,\ldots\}})$ has the Poisson distribution with parameter $\lambda$, written $X\sim\operatorname{Poi}(\lambda)$, if its probability mass function $p_X:\{0,1,2,\ldots\}\to[0,1]$ is defined by \begin{align*} p_X(k)&=e^{-\lambda}\frac{\lambda^k}{k!}, \qquad k\in\{0,1,2,\ldots\}. \end{align*} [/definition] The normalisation of the Poisson mass function is the exponential series. This is a useful reminder that infinite-support mass functions are often justified by familiar analytic identities. [example: Normalising the Poisson Mass Function] For $\lambda>0$, define \begin{align*} p(k)=e^{-\lambda}\frac{\lambda^k}{k!}, \qquad k\in\{0,1,2,\ldots\}. \end{align*} Because $e^{-\lambda}>0$, $\lambda^k\geq 0$, and $k!>0$ for every $k\geq 0$, each value satisfies \begin{align*} p(k)\geq 0. \end{align*} We now compute the total mass. Substituting the definition of $p(k)$ gives \begin{align*} \sum_{k=0}^{\infty}p(k)=\sum_{k=0}^{\infty}e^{-\lambda}\frac{\lambda^k}{k!}. \end{align*} Since $e^{-\lambda}$ does not depend on $k$, it factors out of the series: \begin{align*} \sum_{k=0}^{\infty}e^{-\lambda}\frac{\lambda^k}{k!}=e^{-\lambda}\sum_{k=0}^{\infty}\frac{\lambda^k}{k!}. \end{align*} By the exponential series identity, \begin{align*} \sum_{k=0}^{\infty}\frac{\lambda^k}{k!}=e^\lambda. \end{align*} Therefore \begin{align*} \sum_{k=0}^{\infty}p(k)=e^{-\lambda}e^\lambda. \end{align*} Using $e^{-\lambda}=1/e^\lambda$ and $e^\lambda>0$, \begin{align*} e^{-\lambda}e^\lambda=\frac{1}{e^\lambda}e^\lambda=1. \end{align*} Hence \begin{align*} \sum_{k=0}^{\infty}p(k)=1. \end{align*} Thus the formula gives non-negative masses with total mass $1$, so it defines a probability mass function on $\{0,1,2,\ldots\}$. The factor $e^{-\lambda}$ is exactly the normalising constant that cancels the exponential series sum $e^\lambda$. [/example] The geometric distribution models waiting time until the first success. We need a separate named family because the support is $\mathbb N$ and the mass decays geometrically with the number of failures before the first success. [definition: Geometric Distribution] Let $p\in(0,1]$. A random variable $X:(\Omega,\mathcal F,\mathbb P)\to(\mathbb N,2^{\mathbb N})$ has the geometric distribution with parameter $p$, written $X\sim\operatorname{Geom}(p)$, if its probability mass function $p_X:\mathbb N\to[0,1]$ is defined by \begin{align*} p_X(k)&=(1-p)^{k-1}p, \qquad k\in\mathbb N. \end{align*} [/definition] These named families should not be memorised as isolated formulas. Each is a mass assignment on a countable support, and each becomes usable because set probabilities, expectations, and transformations reduce to sums over that support. ## Comparing Mass and Density The probability mass function is sometimes confused with a probability density function because both are non-negative functions that describe distributions. The distinction is structural: masses are summed over points, while densities are integrated over sets with respect to [Lebesgue measure](/page/Lebesgue%20Measure). To state this distinction cleanly, we need the language of atoms. [definition: Atom of a Probability Measure] Let $(E,\mathcal E,\mu)$ be a probability space. A point $x\in E$ is an atom of $\mu$ if \begin{align*} \mu(\{x\})&>0. \end{align*} [/definition] For a discrete random variable $X$, the atoms of its law are exactly the points in the support of $p_X$. Continuous distributions such as the normal distribution have no point atoms, so a probability mass function would miss their distribution entirely. [example: Why a Density Is Not a Mass Function] Let $X\sim\operatorname{Unif}(0,1)$, so its density is $f_X(x)=1$ for $0<x<1$ and $f_X(x)=0$ otherwise. For a fixed $a\in\mathbb R$, the singleton event has probability \begin{align*} \mathbb P(X=a)=\int_{\{a\}} f_X(x)\,d\mathcal L^1(x). \end{align*} Since $f_X(x)$ is zero off $(0,1)$ and equal to $1$ on $(0,1)$, \begin{align*} \int_{\{a\}} f_X(x)\,d\mathcal L^1(x)=\int_{\{a\}\cap(0,1)}1\,d\mathcal L^1. \end{align*} The set $\{a\}\cap(0,1)$ is either empty or a singleton, and both have Lebesgue measure $0$, so \begin{align*} \int_{\{a\}\cap(0,1)}1\,d\mathcal L^1=\mathcal L^1(\{a\}\cap(0,1))=0. \end{align*} Therefore \begin{align*} \mathbb P(X=a)=0. \end{align*} If we tried to define a mass function by $p_X(a)=\mathbb P(X=a)$, then $p_X(a)=0$ for every $a\in\mathbb R$. Its positive support would be \begin{align*} \{a\in\mathbb R:p_X(a)>0\}=\varnothing, \end{align*} so the total mass recorded by this point-mass function would be \begin{align*} \sum_{a\in\varnothing}p_X(a)=0. \end{align*} This cannot describe the law of $X$, because a probability mass function must have total mass $1$. The distribution is instead visible in interval probabilities. For example, \begin{align*} \mathbb P(X\in(1/4,3/4))=\int_{(1/4,3/4)}1\,d\mathcal L^1. \end{align*} The integral of $1$ over an interval is its Lebesgue length, and the length is \begin{align*} \mathcal L^1((1/4,3/4))=\frac{3}{4}-\frac{1}{4}. \end{align*} Subtracting the fractions gives \begin{align*} \frac{3}{4}-\frac{1}{4}=\frac{2}{4}. \end{align*} Reducing the fraction, \begin{align*} \frac{2}{4}=\frac{1}{2}. \end{align*} Hence \begin{align*} \mathbb P(X\in(1/4,3/4))=\frac{1}{2}. \end{align*} This information lives in probabilities of intervals, not in masses at individual points. [/example] Some distributions have both discrete and continuous parts. In that setting a single probability mass function describes only the atomic part, while a density describes only the absolutely continuous part. A full measure-theoretic description is then needed. [remark: Mixed Distributions] A random variable can assign positive probability to a point and also have a continuous component. For instance, a measurement device may record $0$ with positive probability when it fails and otherwise produce a continuous reading. Such a law is not fully described by a probability mass function or by a density alone. [/remark] ## Beyond and Connected Topics Probability mass functions are the entry point to discrete probability, but the same idea reappears throughout probability and statistics in more flexible language. The mass function of $X$ is the coordinate form of the law $\mu_X=\mathbb P\circ X^{-1}$ on a countable support. In measure-theoretic probability, this becomes a special case of pushforward measures and integration with respect to a probability measure. The next natural topic is expectation. For a discrete random variable, expectation is a sum weighted by the probability mass function; in general measure spaces it becomes the integral $\mathbb E[X]=\int_\Omega X\,d\mathbb P$. This transition is one of the main themes of [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure). Joint mass functions lead directly to independence, conditioning, and Markov chains. In finite or countable state spaces, transition probabilities are organised as matrices, and the distribution at a later time is computed by summing over intermediate states. This is the discrete ancestor of the kernel notation used in advanced probability. In statistics, a probability mass function becomes a likelihood when it is viewed as a function of unknown parameters after the observed data are fixed. For example, the binomial mass function can be read either as $\mathbb P(X=k)$ for fixed $p$, or as a likelihood for $p$ after observing $k$ successes. This statistical viewpoint is developed further in [Cambridge IB Statistics](/page/Cambridge%20IB%20Statistics). Generating functions and characteristic functions provide another way to encode a mass function. For an integer-valued random variable, the probability [generating function](/page/Generating%20Function) stores the coefficients $p_X(k)$ in a [power series](/page/Power%20Series), while the characteristic function stores the same law through complex exponentials. These encodings become especially powerful for sums of independent random variables. ## References Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability). Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure). Androma, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability). Androma, [Cambridge IB Statistics](/page/Cambridge%20IB%20Statistics). Grimmett and Stirzaker, *Probability and Random Processes* (2001). Feller, *An Introduction to Probability Theory and Its Applications, Volume I* (1968). Ross, *A First Course in Probability* (2014).

Created by admin on 6/21/2026 | Last updated on 6/21/2026

What brings you to Androma?

Start with a route through the knowledge graph.

Probability Mass Function

Sign in to Androma

Check your inbox

One last step

Probability Mass Function

Prerequisites (0/3 completed)

Prerequisites Graph

Rate this page