Counting random events by listing all their probabilities is honest but quickly becomes unwieldy. If $X$ counts the number of successes in a collection of trials, then the distribution of $X$ is a sequence $p_X(0),p_X(1),p_X(2),\dots$. When another independent count $Y$ is added, the probability of $X+Y=n$ is a convolution of two such sequences, and repeated convolution is exactly the kind of bookkeeping that hides the structure of the problem.
The probability [generating function](/page/Generating%20Function) replaces the sequence by a [power series](/page/Power%20Series). The coefficient of $z^n$ remembers $\mathbb P(X=n)$, while multiplication of power series remembers convolution. This is the same idea that makes ordinary generating functions useful in combinatorics, but the coefficients are probabilities, so the series also carries normalization, moments, and distributional information.
[example: Two Bernoulli Trials Before the Definition]
Let $X_1,X_2$ be independent random variables with $X_i\sim\operatorname{Ber}(p)$, so $\mathbb P(X_i=1)=p$ and $\mathbb P(X_i=0)=1-p$. Since each $X_i$ takes only the values $0$ and $1$, the sum $X_1+X_2$ can take only the values $0,1,2$. For the value $0$, both trials must be $0$, and independence gives
\begin{align*}
\mathbb P(X_1+X_2=0)=\mathbb P(X_1=0,X_2=0)=\mathbb P(X_1=0)\mathbb P(X_2=0)=(1-p)^2.
\end{align*}
For the value $1$, exactly one trial must be $1$, so the two disjoint cases are $(X_1,X_2)=(1,0)$ and $(X_1,X_2)=(0,1)$. Hence
\begin{align*}
\mathbb P(X_1+X_2=1)=\mathbb P(X_1=1,X_2=0)+\mathbb P(X_1=0,X_2=1).
\end{align*}
Using independence in each term,
\begin{align*}
\mathbb P(X_1+X_2=1)=p(1-p)+(1-p)p=2p(1-p).
\end{align*}
For the value $2$, both trials must be $1$, so
\begin{align*}
\mathbb P(X_1+X_2=2)=\mathbb P(X_1=1,X_2=1)=\mathbb P(X_1=1)\mathbb P(X_2=1)=p^2.
\end{align*}
The polynomial $(1-p)+pz$ stores the two probabilities of one Bernoulli trial as the constant coefficient and the coefficient of $z$. Multiplying the two copies gives
\begin{align*}
((1-p)+pz)^2=((1-p)+pz)((1-p)+pz).
\end{align*}
Expanding term by term,
\begin{align*}
((1-p)+pz)((1-p)+pz)=(1-p)^2+(1-p)pz+pz(1-p)+p^2z^2.
\end{align*}
Since $(1-p)pz+pz(1-p)=2p(1-p)z$, this becomes
\begin{align*}
((1-p)+pz)^2=(1-p)^2+2p(1-p)z+p^2z^2.
\end{align*}
The coefficient of $z^0$ is $\mathbb P(X_1+X_2=0)$, the coefficient of $z^1$ is $\mathbb P(X_1+X_2=1)$, and the coefficient of $z^2$ is $\mathbb P(X_1+X_2=2)$. The point is not the algebra for two trials, but that multiplication is already doing the same bookkeeping as the distribution of an independent sum.
[/example]
This example suggests that the variable $z$ is a placeholder for the count, not a [random variable](/page/Random%20Variable). The power series is a compact encoding of a distribution on $\{0,1,2,\dots\}$, and the probabilistic content is recovered from coefficients and derivatives.
## Definition
The construction begins with random variables whose values are non-negative integers. This restriction is not cosmetic: powers $z^k$ naturally index counts $k=0,1,2,\dots$, and the resulting series has no need for negative powers or fractional exponents.
[definition: Probability Generating Function]
Let $(\Omega,\mathcal F,\mathbb P)$ be a [probability space](/page/Probability%20Space), and let $X: \Omega \to \{0,1,2,\dots\}$ be a random variable. Write
\begin{align*}
p_X(k)=\mathbb P(X=k), \qquad k\in \{0,1,2,\dots\}.
\end{align*}
The probability generating function of $X$ is the function $G_X:\{z\in\mathbb C:|z|\le1\}\to\mathbb C$ defined by
\begin{align*}
G_X(z)=\mathbb E[z^X]=\sum_{k=0}^{\infty}p_X(k)z^k.
\end{align*}
[/definition]
For $|z|\le1$, the series is absolutely bounded by $\sum_{k=0}^{\infty}p_X(k)=1$, so the definition always makes sense on the closed unit disk. On the real interval $[0,1]$, the pgf takes values in $[0,1]$ and has the direct probabilistic interpretation used in many arguments. Sometimes the same formula defines an [analytic function](/page/Analytic%20Function) on a larger disk, and that analytic viewpoint gives sharper coefficient and limit tools.
Because the coefficients are probabilities, the value at $1$ is fixed:
\begin{align*}
G_X(1)=\sum_{k=0}^{\infty}p_X(k)=1.
\end{align*}
Also $G_X(0)=p_X(0)$, so the first visible value already records the atom at zero.
A pgf is useful only if the encoding can be inverted one probability at a time. Values such as $G_X(0)$ recover only special atoms, and derivatives recover moments rather than the full mass function. To recover the law itself, we need a notation for isolating the coefficient of a specified power of $z$.
[definition: Coefficient Extraction]
Let $\mathcal A$ be the class of formal power series in $z$ with complex coefficients, together with power series in $z$ that converge in a neighbourhood of $0$. For each $n\in\{0,1,2,\dots\}$, coefficient extraction is the map $[z^n]:\mathcal A\to\mathbb C$ defined by
\begin{align*}
[z^n]\sum_{k=0}^{\infty}a_kz^k=a_n.
\end{align*}
[/definition]
With this notation, the probabilities encoded by a probability generating function are $\mathbb P(X=n)=[z^n]G_X(z)$. This is the first reason pgfs are faithful encodings rather than summaries: no probability is discarded.
The coefficient operation is algebraic, but many pgf arguments also use limits and derivatives. To know where those analytic manipulations are valid, we need the convergence disk of the underlying power series.
[definition: Radius of Convergence]
Let $\mathcal P$ be the class of complex power series in $z$. The [radius of convergence](/theorems/262) is the map
\begin{align*}
R:\mathcal P\to[0,\infty],
\end{align*}
where $R\left(\sum_{k=0}^{\infty}a_kz^k\right)$ is the unique number such that the series converges absolutely for $|z|<R$ and diverges for $|z|>R$.
[/definition]
For a probability generating function, $R\ge 1$ because $\sum_k p_X(k)|z|^k\le 1$ whenever $|z|\le 1$. If $R>1$, then for every $\rho$ with $1<\rho<R$ the weighted sum $\sum_{k=0}^{\infty}p_X(k)\rho^k$ is finite, so the probabilities have exponential control at every rate strictly below $R$. If $R=1$, the distribution may still have finite moments, but the [analytic continuation](/page/Analytic%20Continuation) is more delicate.
## Extracting Probabilities and Moments
The first task after defining a pgf is to recover what it stores. Coefficients recover the mass function; derivatives at $1$ recover factorial moments, which are often better adapted to counting than ordinary powers.
### Recovering the Law
If two pgfs agree, then every coefficient agrees, so the two random variables have the same law. This matters because it lets us prove distributional identities by proving identities between functions.
[quotetheorem:1128]
This theorem makes the pgf a lossless encoding of the law. Equality of functions on the real interval $[0,1]$ is enough because the coefficients of the corresponding power series are determined by the function near $0$.
After knowing that the law is determined, the next question is how to recover the individual probabilities in a form suitable for calculation. Coefficient extraction gives the direct answer, while derivatives at zero give the same answer when the pgf is being treated analytically.
[quotetheorem:9499]
The second formula is useful for polynomials and analytic expressions, but coefficient extraction is often cleaner. For example, expanding $(1-p+pz)^m$ directly gives the binomial distribution.
### Factorial Moments
Derivatives at $1$ answer a different question from coefficients at $0$: they measure the size of the count rather than the probability of one count value. The terms produced by differentiating $z^k$ are falling products, so the relevant moment scale is factorial rather than ordinary.
[definition: Falling Factorial]
For each $r\in\{0,1,2,\dots\}$, the falling factorial of order $r$ is the map
\begin{align*}
(\cdot)_r:\mathbb R\to\mathbb R,
\end{align*}
where $(x)_0=1$ and, for $r\ge1$,
\begin{align*}
(x)_r=x(x-1)\cdots(x-r+1).
\end{align*}
[/definition]
Falling factorials appear because differentiating $z^k$ lowers the exponent and multiplies by $k(k-1)\cdots(k-r+1)$. Ordinary powers $X^r$ are not what differentiation sees directly, so the immediate obstruction is to identify which moments are encoded by derivatives of $G_X$. The answer is that derivatives recover factorial moments, with limits at $1$ handling distributions whose pgf need not be analytic beyond the unit disk.
[quotetheorem:1129]
This result is mainly a translation rule: derivatives of the pgf give factorial moments, and the displayed variance formula records how the first two factorial moments determine the ordinary variance. The next examples use this rule as a calculation device rather than re-proving it.
[example: Moments of a Binomial Count]
Let $X\sim\operatorname{Bin}(m,p)$ with $m\in\mathbb N$ and $p\in[0,1]$. Its pgf is
\begin{align*}
G_X(z)=(1-p+pz)^m.
\end{align*}
Using the chain rule on $u(z)=1-p+pz$, with $u'(z)=p$, gives
\begin{align*}
G_X'(z)=m(1-p+pz)^{m-1}p=mp(1-p+pz)^{m-1}.
\end{align*}
For $m\ge2$, differentiating once more gives
\begin{align*}
G_X''(z)=mp(m-1)(1-p+pz)^{m-2}p=m(m-1)p^2(1-p+pz)^{m-2}.
\end{align*}
When $m=1$, the same second factorial moment is $0$, since $G_X(z)=1-p+pz$ is linear.
At $z=1$,
\begin{align*}
1-p+p\cdot 1=1,
\end{align*}
so *[PGF and Moments](/theorems/1129)* gives
\begin{align*}
\mathbb E[X]=G_X'(1)=mp\cdot 1^{m-1}=mp.
\end{align*}
It also gives
\begin{align*}
\mathbb E[(X)_2]=G_X''(1)=m(m-1)p^2\cdot 1^{m-2}=m(m-1)p^2.
\end{align*}
Since $X^2=(X)_2+X$, we have
\begin{align*}
\mathbb E[X^2]=\mathbb E[(X)_2]+\mathbb E[X]=m(m-1)p^2+mp.
\end{align*}
Therefore
\begin{align*}
\operatorname{Var}(X)=\mathbb E[X^2]-(\mathbb E[X])^2=m(m-1)p^2+mp-m^2p^2.
\end{align*}
Expanding $m(m-1)p^2=m^2p^2-mp^2$ gives
\begin{align*}
\operatorname{Var}(X)=m^2p^2-mp^2+mp-m^2p^2.
\end{align*}
The $m^2p^2$ terms cancel, so
\begin{align*}
\operatorname{Var}(X)=mp-mp^2=mp(1-p).
\end{align*}
The computation shows why derivatives of the pgf first produce factorial moments, and why ordinary moments are recovered by rewriting powers such as $X^2$ in terms of falling factorials.
[/example]
A common mistake is to treat the [second derivative](/page/Second%20Derivative) term itself as the variance. The binomial example shows the missing correction term: the second derivative counts ordered pairs of distinct successes, not the square of the total count.
## Algebra of Independent Counts
The real strength of pgfs is that independence turns sums into products. This is the probabilistic version of the elementary rule that the generating function of a convolution is the product of the generating functions.
### Sums and Products
If two counts are independent, then raising $z$ to their sum splits into a product. Taking expectation then separates the product, so an operation on random variables becomes multiplication of functions.
[quotetheorem:1130]
This theorem is the engine behind many distributional identities. It moves work from probabilities to algebra: convolution becomes multiplication, repeated independent summation becomes exponentiation, and random sums become composition.
[example: Recovering the Binomial Law]
Let $X_1,\dots,X_m$ be i.i.d. Bernoulli random variables with parameter $p$, and set $S_m=X_1+\cdots+X_m$. For one summand, the two possible values give
\begin{align*}
G_{X_i}(z)=\mathbb P(X_i=0)z^0+\mathbb P(X_i=1)z^1.
\end{align*}
Since $\mathbb P(X_i=0)=1-p$ and $\mathbb P(X_i=1)=p$, this is
\begin{align*}
G_{X_i}(z)=(1-p)+pz.
\end{align*}
Because the $X_i$ are independent, *[PGF of a Sum](/theorems/1130)* gives
\begin{align*}
G_{S_m}(z)=\prod_{i=1}^{m}G_{X_i}(z).
\end{align*}
Each factor is the same polynomial $(1-p)+pz$, so
\begin{align*}
G_{S_m}(z)=((1-p)+pz)^m.
\end{align*}
Apply the [binomial theorem](/theorems/750) with $a=1-p$ and $b=pz$:
\begin{align*}
((1-p)+pz)^m=\sum_{k=0}^{m}\binom{m}{k}(1-p)^{m-k}(pz)^k.
\end{align*}
For each $k$, $(pz)^k=p^kz^k$, so
\begin{align*}
((1-p)+pz)^m=\sum_{k=0}^{m}\binom{m}{k}(1-p)^{m-k}p^kz^k.
\end{align*}
Thus the coefficient of $z^k$ is
\begin{align*}
[z^k]G_{S_m}(z)=\binom{m}{k}p^k(1-p)^{m-k}, \qquad 0\le k\le m.
\end{align*}
By *[Coefficient Recovery for Probability Generating Functions](/theorems/9499)*,
\begin{align*}
\mathbb P(S_m=k)=[z^k]G_{S_m}(z).
\end{align*}
Therefore
\begin{align*}
\mathbb P(S_m=k)=\binom{m}{k}p^k(1-p)^{m-k}, \qquad 0\le k\le m.
\end{align*}
The pgf calculation recovers the binomial law by turning the independent sum of Bernoulli trials into multiplication of their one-trial generating functions.
[/example]
### Random Sums and Composition
Sometimes the number of summands is itself random: a random number of customers, claims, children, or clusters each contributes its own count. Conditioning on the number of summands gives a power of the inner pgf, and averaging those powers produces composition.
[quotetheorem:1131]
The formula says that the outer generating function counts how many clusters occur, while the inner generating function counts the size of each cluster. Substitution is therefore the analytic signature of a random number of independent copies.
[example: Compound Poisson Counts]
Let $N\sim\operatorname{Poi}(\lambda)$ with $\lambda>0$, and let $X_1,X_2,\dots$ be i.i.d. non-negative integer-valued random variables independent of $N$. For
\begin{align*}
S=\sum_{i=1}^{N}X_i,
\end{align*}
the *[PGF of a Random Sum](/theorems/1131)* gives
\begin{align*}
G_S(z)=G_N(G_X(z)).
\end{align*}
For $w\in[0,1]$, the Poisson pgf is
\begin{align*}
G_N(w)=\sum_{n=0}^{\infty}e^{-\lambda}\frac{\lambda^n}{n!}w^n.
\end{align*}
Factoring out $e^{-\lambda}$ and combining powers gives
\begin{align*}
G_N(w)=e^{-\lambda}\sum_{n=0}^{\infty}\frac{(\lambda w)^n}{n!}.
\end{align*}
Using the exponential series,
\begin{align*}
G_N(w)=e^{-\lambda}e^{\lambda w}.
\end{align*}
Since $e^{-\lambda}e^{\lambda w}=e^{\lambda w-\lambda}$, this is
\begin{align*}
G_N(w)=\exp(\lambda(w-1)).
\end{align*}
Substituting $w=G_X(z)$ gives
\begin{align*}
G_S(z)=\exp(\lambda(G_X(z)-1)).
\end{align*}
Now suppose $X_i\sim\operatorname{Ber}(p)$. Then
\begin{align*}
G_X(z)=\mathbb P(X_i=0)z^0+\mathbb P(X_i=1)z^1.
\end{align*}
Using $\mathbb P(X_i=0)=1-p$, $\mathbb P(X_i=1)=p$, $z^0=1$, and $z^1=z$, this becomes
\begin{align*}
G_X(z)=1-p+pz.
\end{align*}
Therefore
\begin{align*}
G_S(z)=\exp(\lambda((1-p+pz)-1)).
\end{align*}
Inside the exponent,
\begin{align*}
(1-p+pz)-1=pz-p.
\end{align*}
Factoring out $p$ gives
\begin{align*}
pz-p=p(z-1).
\end{align*}
Hence
\begin{align*}
G_S(z)=\exp(\lambda p(z-1)).
\end{align*}
This is the pgf of a Poisson random variable with parameter $\lambda p$, so by *Uniqueness of Probability Generating Functions*, $S\sim\operatorname{Poi}(\lambda p)$. Thus the pgf calculation shows Poisson thinning: keeping each point independently with probability $p$ changes the rate from $\lambda$ to $\lambda p$.
[/example]
The independence hypothesis cannot be silently dropped. If $Y=X$ for a Bernoulli random variable $X$, then $X+Y=2X$, so $G_{X+Y}(z)=1-p+pz^2$, while $G_X(z)G_Y(z)=(1-p+pz)^2$. The discrepancy measures dependence.
## Standard Distributions
A pgf is useful only if standard distributions have recognizable forms. The familiar families lead to polynomials, rational functions, and exponentials, each reflecting the structure of the underlying count.
The following formulas form the basic lookup table for count distributions. They also explain many closure properties: powers correspond to fixed sums, exponentials to Poisson superposition, and rational forms to waiting times.
[quotetheorem:9500]
The geometric formula uses the count of failures before the first success, so its support starts at $0$. This differs from Androma's default named convention $\operatorname{Geom}(p)$, where $\mathbb P(X=k)=(1-p)^{k-1}p$ for $k\ge1$. Both are common, and pgf calculations require stating the support.
[example: Two Geometric Conventions]
Let $X$ count failures before the first success, so for $k\in\{0,1,2,\dots\}$,
\begin{align*}
\mathbb P(X=k)=(1-p)^kp.
\end{align*}
By the definition of the probability generating function,
\begin{align*}
G_X(z)=\sum_{k=0}^{\infty}\mathbb P(X=k)z^k.
\end{align*}
Substituting the probabilities of $X$ gives
\begin{align*}
G_X(z)=\sum_{k=0}^{\infty}(1-p)^kpz^k.
\end{align*}
Since $p$ does not depend on $k$ and $(1-p)^kz^k=((1-p)z)^k$,
\begin{align*}
G_X(z)=p\sum_{k=0}^{\infty}((1-p)z)^k.
\end{align*}
For $z\in[0,1]$, we have $0\le (1-p)z<1$ when $p\in(0,1]$, so the geometric-series identity $\sum_{k=0}^{\infty}r^k=1/(1-r)$ gives
\begin{align*}
G_X(z)=\frac{p}{1-(1-p)z}.
\end{align*}
Let $Y=X+1$ count the trial on which the first success occurs. Then $Y$ takes values in $\{1,2,3,\dots\}$, and for $j\ge1$ the event $\{Y=j\}$ is the same as the event $\{X=j-1\}$. Hence
\begin{align*}
\mathbb P(Y=j)=\mathbb P(X=j-1)=(1-p)^{j-1}p.
\end{align*}
Using the pgf definition for $Y$,
\begin{align*}
G_Y(z)=\sum_{j=1}^{\infty}\mathbb P(Y=j)z^j.
\end{align*}
Substituting the displayed formula for $\mathbb P(Y=j)$ gives
\begin{align*}
G_Y(z)=\sum_{j=1}^{\infty}(1-p)^{j-1}pz^j.
\end{align*}
Write $j=k+1$, so $k$ ranges over $\{0,1,2,\dots\}$. Then
\begin{align*}
G_Y(z)=\sum_{k=0}^{\infty}(1-p)^kpz^{k+1}.
\end{align*}
Since $z^{k+1}=zz^k$ and $z$ is constant with respect to the summation index,
\begin{align*}
G_Y(z)=z\sum_{k=0}^{\infty}(1-p)^kpz^k.
\end{align*}
The sum is exactly $G_X(z)$, so
\begin{align*}
G_Y(z)=zG_X(z).
\end{align*}
Substituting the formula for $G_X(z)$,
\begin{align*}
G_Y(z)=\frac{pz}{1-(1-p)z}.
\end{align*}
The extra factor $z$ records the shift of the support from $\{0,1,2,\dots\}$ to $\{1,2,3,\dots\}$: every probability formerly attached to $z^k$ is now attached to $z^{k+1}$.
[/example]
The Poisson pgf is exponential, which explains why sums and thinnings of Poisson variables stay Poisson. The binomial pgf is a finite power, reflecting a fixed number of trials. The geometric pgf is rational, reflecting repeated independent waiting with no fixed endpoint.
## Branching and Counting Processes
Generating functions become especially powerful when a count at one time produces a random count at the next time. Instead of tracking a full infinite transition matrix, the process can often be described by iteration or differential equations for pgfs.
### Branching by Iteration
A branching process begins with individuals, each producing a random number of offspring. The whole population size is a count, and the law of one individual's reproduction is the object whose pgf will be iterated.
[definition: Offspring Probability Generating Function]
Let $\xi$ be a random variable taking values in $\{0,1,2,\dots\}$, where $\xi$ represents the number of offspring produced by one individual. The offspring probability generating function is the function $f:[0,1]\to[0,1]$ defined by
\begin{align*}
f(z)=\mathbb E[z^\xi]=\sum_{k=0}^{\infty}\mathbb P(\xi=k)z^k.
\end{align*}
[/definition]
The population in the next generation is the sum of one offspring count for each current individual. To state that recursion as a stochastic process, we specify an independent copy of the offspring count for each individual in each generation.
[definition: Galton-Watson Process]
Let $Z_0$ be a random variable taking values in $\{0,1,2,\dots\}$. Let $(\xi_{n,j})_{n,j\in\mathbb N}$ be i.i.d. random variables taking values in $\{0,1,2,\dots\}$, independent of $Z_0$. A Galton-Watson process with offspring distribution $\xi_{1,1}$ is the stochastic process $(Z_n)_{n\ge0}$ defined by
\begin{align*}
Z_{n+1}=\sum_{j=1}^{Z_n}\xi_{n+1,j}, \qquad n\ge0,
\end{align*}
where the empty sum is $0$.
[/definition]
This definition is probabilistic, but its distributional recursion is analytic. In the one-ancestor specialization $Z_0=1$, the pgf of $Z_n$ is obtained by iterating the offspring pgf.
[quotetheorem:1132]
The composition formula translates genealogy into function iteration. Extinction, survival, and expected population growth become questions about fixed points and derivatives of $f$.
The event of eventual extinction depends on every future generation, so it is not obtained by reading a single coefficient from a finite-time pgf. The fixed-point equation below is the bridge from finite-time iteration to the long-term survival question.
[quotetheorem:1134]
This theorem is a typical example of why pgfs matter beyond calculation. A probability of an event over infinitely many generations is characterized by a fixed point of a function on $[0,1]$.
[example: Binary or No-Offspring Branching]
Suppose each individual has either no children or two children, with
\begin{align*}
\mathbb P(\xi=0)=1-p, \qquad \mathbb P(\xi=2)=p.
\end{align*}
All other offspring counts have probability $0$, so the offspring pgf is
\begin{align*}
f(z)=\mathbb P(\xi=0)z^0+\mathbb P(\xi=2)z^2.
\end{align*}
Using $z^0=1$, this becomes
\begin{align*}
f(z)=(1-p)+pz^2.
\end{align*}
By *[Extinction Probability](/theorems/1134)*, the extinction probability $q$ is the smallest solution $s\in[0,1]$ of $f(s)=s$. Here that equation is
\begin{align*}
1-p+ps^2=s.
\end{align*}
Moving all terms to the left gives
\begin{align*}
ps^2-s+1-p=0.
\end{align*}
The claimed factorization is verified by expanding:
\begin{align*}
(s-1)(ps-(1-p))=s\cdot ps-s(1-p)-1\cdot ps+1\cdot(1-p).
\end{align*}
Since $s\cdot ps=ps^2$, this is
\begin{align*}
ps^2-s(1-p)-ps+1-p.
\end{align*}
Expanding $-s(1-p)=-s+ps$ gives
\begin{align*}
ps^2-s+ps-ps+1-p.
\end{align*}
The $ps$ terms cancel, so
\begin{align*}
(s-1)(ps-(1-p))=ps^2-s+1-p.
\end{align*}
Thus the fixed-point equation is equivalent to
\begin{align*}
(s-1)(ps-(1-p))=0.
\end{align*}
One solution is always $s=1$. If $p>0$, the second factor gives
\begin{align*}
ps-(1-p)=0,
\end{align*}
hence
\begin{align*}
ps=1-p.
\end{align*}
Dividing by $p$ gives
\begin{align*}
s=\frac{1-p}{p}.
\end{align*}
If $0<p\le 1/2$, then $1-p\ge p$, so $(1-p)/p\ge1$; the smallest solution in $[0,1]$ is therefore $q=1$. If $p=0$, the equation is $1=s$, so again $q=1$. If $p>1/2$, then $1-p<p$, so $0\le(1-p)/p<1$, and the smallest solution in $[0,1]$ is
\begin{align*}
q=\frac{1-p}{p}.
\end{align*}
Finally,
\begin{align*}
f'(z)=2pz,
\end{align*}
so
\begin{align*}
f'(1)=2p.
\end{align*}
The transition from certain extinction to possible survival occurs at $p=1/2$, exactly where the mean offspring number $2p$ crosses $1$.
[/example]
### Time-Dependent Count Laws
Branching is one way a count evolves, but many count processes are indexed by continuous time. When the probabilities $P_n(t)$ change with $t$, a two-variable pgf stores the whole time-dependent law and can turn infinitely many differential equations into one equation.
[definition: Time-Dependent Probability Generating Function]
Let $(P_n(t))_{n\ge0}$ be a probability distribution on $\{0,1,2,\dots\}$ for each $t\in[0,\infty)$. The time-dependent probability generating function is the function $P:[0,1]\times[0,\infty)\to[0,1]$ defined by
\begin{align*}
P(z,t)=\sum_{n=0}^{\infty}P_n(t)z^n.
\end{align*}
[/definition]
This is the form used when the distribution evolves with time. Differential equations for the probabilities $P_n(t)$ can often be compressed into a partial differential equation or [ordinary differential equation](/page/Ordinary%20Differential%20Equation) for $P(z,t)$.
[example: Pure Birth Process Generating Function]
Consider a process $(N_t)_{t\ge0}$ with $N_0=1$ in which each individual gives birth at rate $\lambda>0$ independently. Write $P_n(t)=\mathbb P(N_t=n)$ for $n\ge1$ and
\begin{align*}
P(z,t)=\sum_{n=1}^{\infty}P_n(t)z^n.
\end{align*}
The forward equations are
\begin{align*}
\frac{dP_1}{dt}(t)=-\lambda P_1(t)
\end{align*}
and, for $n\ge2$,
\begin{align*}
\frac{dP_n}{dt}(t)=\lambda(n-1)P_{n-1}(t)-\lambda nP_n(t),
\end{align*}
with $P_1(0)=1$ and $P_n(0)=0$ for $n\ge2$.
For $|z|<1$, multiplying the first equation by $z$ and the $n$th equation by $z^n$ gives
\begin{align*}
\frac{\partial P}{\partial t}(z,t)=-\lambda P_1(t)z+\sum_{n=2}^{\infty}\lambda(n-1)P_{n-1}(t)z^n-\sum_{n=2}^{\infty}\lambda nP_n(t)z^n.
\end{align*}
In the inflow sum, put $k=n-1$. Then $n\ge2$ is the same as $k\ge1$, and
\begin{align*}
\sum_{n=2}^{\infty}\lambda(n-1)P_{n-1}(t)z^n=\lambda\sum_{k=1}^{\infty}kP_k(t)z^{k+1}.
\end{align*}
Since
\begin{align*}
z\frac{\partial P}{\partial z}(z,t)=\sum_{k=1}^{\infty}kP_k(t)z^k,
\end{align*}
the inflow term is
\begin{align*}
\lambda\sum_{k=1}^{\infty}kP_k(t)z^{k+1}=\lambda z^2\frac{\partial P}{\partial z}(z,t).
\end{align*}
The outflow terms combine as
\begin{align*}
-\lambda P_1(t)z-\sum_{n=2}^{\infty}\lambda nP_n(t)z^n=-\lambda\sum_{n=1}^{\infty}nP_n(t)z^n.
\end{align*}
Again using $zP_z(z,t)=\sum_{n=1}^{\infty}nP_n(t)z^n$, this is
\begin{align*}
-\lambda\sum_{n=1}^{\infty}nP_n(t)z^n=-\lambda z\frac{\partial P}{\partial z}(z,t).
\end{align*}
Therefore
\begin{align*}
\frac{\partial P}{\partial t}(z,t)=\lambda z^2\frac{\partial P}{\partial z}(z,t)-\lambda z\frac{\partial P}{\partial z}(z,t).
\end{align*}
Factoring out $\lambda zP_z(z,t)$ gives
\begin{align*}
\frac{\partial P}{\partial t}(z,t)=\lambda z(z-1)\frac{\partial P}{\partial z}(z,t),
\end{align*}
and the initial condition is
\begin{align*}
P(z,0)=\sum_{n=1}^{\infty}P_n(0)z^n=z.
\end{align*}
Set
\begin{align*}
a(t)=e^{-\lambda t}
\end{align*}
and
\begin{align*}
D(z,t)=1-z(1-a(t)).
\end{align*}
We verify that
\begin{align*}
P(z,t)=\frac{za(t)}{D(z,t)}
\end{align*}
satisfies the pgf equation. First, at $t=0$ we have $a(0)=1$ and $D(z,0)=1$, so
\begin{align*}
P(z,0)=\frac{z\cdot 1}{1}=z.
\end{align*}
For the $z$-derivative, $D_z(z,t)=-(1-a(t))$, so the quotient rule gives
\begin{align*}
\frac{\partial P}{\partial z}(z,t)=\frac{a(t)D(z,t)-za(t)D_z(z,t)}{D(z,t)^2}.
\end{align*}
Substituting $D_z(z,t)=-(1-a(t))$ gives
\begin{align*}
\frac{\partial P}{\partial z}(z,t)=\frac{a(t)D(z,t)+za(t)(1-a(t))}{D(z,t)^2}.
\end{align*}
Since $D(z,t)=1-z(1-a(t))$, the numerator factor in parentheses is
\begin{align*}
D(z,t)+z(1-a(t))=1.
\end{align*}
Hence
\begin{align*}
\frac{\partial P}{\partial z}(z,t)=\frac{a(t)}{D(z,t)^2}.
\end{align*}
For the $t$-derivative, $a'(t)=-\lambda a(t)$ and
\begin{align*}
D_t(z,t)=z a'(t)=-\lambda za(t).
\end{align*}
The quotient rule gives
\begin{align*}
\frac{\partial P}{\partial t}(z,t)=\frac{za'(t)D(z,t)-za(t)D_t(z,t)}{D(z,t)^2}.
\end{align*}
Substituting $a'(t)=-\lambda a(t)$ and $D_t(z,t)=-\lambda za(t)$ gives
\begin{align*}
\frac{\partial P}{\partial t}(z,t)=\frac{-\lambda za(t)D(z,t)+\lambda z^2a(t)^2}{D(z,t)^2}.
\end{align*}
Factoring $\lambda za(t)$ gives
\begin{align*}
\frac{\partial P}{\partial t}(z,t)=\frac{\lambda za(t)(za(t)-D(z,t))}{D(z,t)^2}.
\end{align*}
Because $D(z,t)=1-z+za(t)$, we have
\begin{align*}
za(t)-D(z,t)=za(t)-(1-z+za(t))=z-1.
\end{align*}
Therefore
\begin{align*}
\frac{\partial P}{\partial t}(z,t)=\frac{\lambda za(t)(z-1)}{D(z,t)^2}.
\end{align*}
On the other hand,
\begin{align*}
\lambda z(z-1)\frac{\partial P}{\partial z}(z,t)=\lambda z(z-1)\frac{a(t)}{D(z,t)^2}.
\end{align*}
The two displayed expressions are equal, so the rational function satisfies the differential equation and the initial condition:
\begin{align*}
P(z,t)=\frac{ze^{-\lambda t}}{1-z(1-e^{-\lambda t})}.
\end{align*}
Now expand this pgf to recover the probabilities. Since $0\le 1-e^{-\lambda t}<1$ for $t>0$, the geometric-series identity gives
\begin{align*}
\frac{1}{1-z(1-e^{-\lambda t})}=\sum_{k=0}^{\infty}z^k(1-e^{-\lambda t})^k
\end{align*}
for $|z|<1/(1-e^{-\lambda t})$. Multiplying by $ze^{-\lambda t}$ gives
\begin{align*}
P(z,t)=\sum_{k=0}^{\infty}e^{-\lambda t}(1-e^{-\lambda t})^k z^{k+1}.
\end{align*}
Writing $n=k+1$, the coefficient of $z^n$ is
\begin{align*}
P_n(t)=e^{-\lambda t}(1-e^{-\lambda t})^{n-1}, \qquad n\ge1.
\end{align*}
At $t=0$, this gives $P_1(0)=1$ and $P_n(0)=0$ for $n\ge2$, matching the initial condition. The generating function converts the infinite forward system into one differential equation, and coefficient extraction then recovers the full time-dependent law.
[/example]
The time parameter does not change the coefficient logic: $P_n(t)=[z^n]P(z,t)$. It adds a second axis along which the encoded distribution evolves.
## Analytic Behaviour and Limitations
A pgf is both a probability object and an analytic object. The positivity of its coefficients gives monotonicity and convexity on $[0,1]$, while the endpoint $z=1$ stores normalization and moments.
Positive coefficients force strong shape restrictions. Those restrictions are useful in fixed-point problems, because the graph of a pgf cannot oscillate arbitrarily on the unit interval.
[quotetheorem:9501]
These shape properties are what make fixed-point arguments in branching processes work. A convex curve crossing the diagonal has a controlled geometry, and the slope at $1$ determines the subcritical, critical, and supercritical regimes.
The endpoint $1$ requires care. A pgf is always defined at $1$, but derivatives at $1$ may be infinite, and the [radius of convergence](/theorems/265) may be exactly $1$.
[example: A Probability Generating Function with Infinite Mean]
Let $X$ have probabilities
\begin{align*}
\mathbb P(X=k)=\frac{c}{(k+1)^2}, \qquad k\in\{0,1,2,\dots\},
\end{align*}
where $c>0$ is chosen so that
\begin{align*}
\sum_{k=0}^{\infty}\frac{c}{(k+1)^2}=1.
\end{align*}
The corresponding probability generating function is
\begin{align*}
G_X(z)=\sum_{k=0}^{\infty}\mathbb P(X=k)z^k.
\end{align*}
Substituting the displayed probabilities gives
\begin{align*}
G_X(z)=\sum_{k=0}^{\infty}\frac{c}{(k+1)^2}z^k.
\end{align*}
At $z=1$,
\begin{align*}
G_X(1)=\sum_{k=0}^{\infty}\frac{c}{(k+1)^2}1^k.
\end{align*}
Since $1^k=1$ for every $k$,
\begin{align*}
G_X(1)=\sum_{k=0}^{\infty}\frac{c}{(k+1)^2}=1.
\end{align*}
Now compute the mean from the definition of expectation for a non-negative integer-valued random variable:
\begin{align*}
\mathbb E[X]=\sum_{k=0}^{\infty}k\mathbb P(X=k).
\end{align*}
Substituting $\mathbb P(X=k)=c/(k+1)^2$ gives
\begin{align*}
\mathbb E[X]=\sum_{k=0}^{\infty}\frac{ck}{(k+1)^2}.
\end{align*}
The $k=0$ term is $0$, so
\begin{align*}
\mathbb E[X]=c\sum_{k=1}^{\infty}\frac{k}{(k+1)^2}.
\end{align*}
For $k\ge1$, we have $k+1\le 2k$, hence
\begin{align*}
(k+1)^2\le 4k^2.
\end{align*}
Taking reciprocals of positive quantities gives
\begin{align*}
\frac{1}{(k+1)^2}\ge \frac{1}{4k^2}.
\end{align*}
Multiplying by $k>0$ gives
\begin{align*}
\frac{k}{(k+1)^2}\ge \frac{1}{4k}.
\end{align*}
Therefore
\begin{align*}
c\sum_{k=1}^{\infty}\frac{k}{(k+1)^2}\ge \frac{c}{4}\sum_{k=1}^{\infty}\frac{1}{k}.
\end{align*}
The harmonic series diverges, so
\begin{align*}
\mathbb E[X]=\infty.
\end{align*}
Thus the pgf is perfectly normalized at the endpoint, with $G_X(1)=1$, but the first moment is infinite. Endpoint differentiability of a pgf is a moment condition, not automatic regularity.
[/example]
This example also shows why pgf arguments often evaluate derivatives first at $z<1$ and then take a limit as $z\uparrow1$. Monotone convergence justifies many such passages when the relevant moment is allowed to be infinite.
Not every discrete distribution is naturally handled by a pgf. The support must sit inside $\{0,1,2,\dots\}$ unless the definition is modified.
[example: Why Negative Values Do Not Fit the Basic Definition]
Let $X$ take values $-1$ and $1$, with
\begin{align*}
\mathbb P(X=-1)=\frac12
\end{align*}
and
\begin{align*}
\mathbb P(X=1)=\frac12.
\end{align*}
If we try to imitate the probability generating function formula, then expectation over the two possible values gives
\begin{align*}
\mathbb E[z^X]=z^{-1}\mathbb P(X=-1)+z^1\mathbb P(X=1).
\end{align*}
Substituting the two probabilities,
\begin{align*}
\mathbb E[z^X]=z^{-1}\cdot\frac12+z\cdot\frac12.
\end{align*}
Reordering the scalar factors gives
\begin{align*}
\mathbb E[z^X]=\frac12 z^{-1}+\frac12 z.
\end{align*}
The term $\frac12z^{-1}$ has exponent $-1$, so this expression is not an ordinary power series of the form
\begin{align*}
\sum_{k=0}^{\infty}a_kz^k.
\end{align*}
It is instead a Laurent polynomial, with one negative-power term and one positive-power term. Coefficient extraction from non-negative powers can read
\begin{align*}
[z^1]\left(\frac12z^{-1}+\frac12z\right)=\frac12,
\end{align*}
but it has no non-negative index corresponding to the mass at $-1$. Thus the basic pgf framework cannot encode this law as probabilities attached only to $z^0,z^1,z^2,\dots$. Characteristic functions or moment generating functions are better suited to random variables with unrestricted integer or real support.
[/example]
A pgf is therefore a specialized tool, not a universal transform. Its specialization is also its strength: for non-negative counts, it aligns exactly with sums, coefficients, and branching recursions.
## Beyond and Connected Topics
Probability generating functions sit at the meeting point of probability, combinatorics, and analysis. The most immediate continuation is the study of discrete distributions in [Cambridge IA Probability](/page/Cambridge%20IA%20Probability), where Bernoulli, binomial, geometric, and Poisson variables supply the first examples of the pgf method.
A second direction is measure-theoretic probability. In [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure), random variables are treated as measurable maps and distributions as pushforward measures. From that viewpoint, $G_X(z)=\int z^x\,d\mu_X(x)$ for a law supported on $\{0,1,2,\dots\}$, so pgfs become a special case of integrating test functions against a probability measure.
A third direction is advanced stochastic processes. In [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability), generating functions appear naturally in branching processes, martingales associated with population growth, and limit theorems for random counts.
Generating functions also connect to analytic combinatorics. When coefficients count weighted structures rather than probabilities, the same algebraic operations describe products, sequences, and recursive constructions. Probability enters when the coefficients are normalized to sum to $1$.
Finally, pgfs should be compared with moment generating functions and characteristic functions. Moment generating functions encode ordinary exponential moments but may fail to exist near $0$. Characteristic functions always exist for real-valued random variables and determine the law, but they do not turn sums of non-negative integer counts into coefficient extraction as directly as pgfs do.
## References
Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability).
Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure).
Androma, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability).
William Feller, *An Introduction to Probability Theory and Its Applications, Volume I* (1968).
Geoffrey Grimmett and David Stirzaker, *Probability and Random Processes* (2001).
Herbert S. Wilf, *Generatingfunctionology* (1994).
Probability Generating Function
Also known as: Probability generating function, PGF, probability generating functions, discrete generating function, count distribution generating function