Moment Generating Function

Also known as: MGF, Moment-generating function, Exponential moment transform, Moment transform, Moment generating transform

Edit 0 Issues 0 Pull Requests Roadmap Admin

Content

Problems

History

Issues Verification Attributions

A moment [generating function](/page/Generating%20Function) answers a practical question: how much of a distribution can be recovered from the exponential averages $\mathbb E[e^{tX}]$? Ordinary moments $\mathbb E[X^n]$ are useful, but they arrive one at a time and can be hard to assemble into a recognizable distribution. The exponential function packages them into a single analytic object when the expectation exists near $0$, and that package turns sums of independent random variables into products. The central tension is that this package is powerful only when it exists on a genuine interval around $0$. Some random variables have every ordinary moment but no moment generating function away from $0$; others have a moment generating function on one side but not the other. The theory is therefore not just about a formula. It is about the domain of finiteness, the analytic information carried near the origin, and the way independence turns convolution into multiplication. [example: A Random Variable with No Positive Exponential Moment] Let $X \sim \operatorname{Exp}(\lambda)$ with $\lambda>0$, so its density is $f_X(x)=\lambda e^{-\lambda x}$ on $[0,\infty)$. Throughout this example, $d\mathcal L^1(x)$ means integration with respect to one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) on the real line. We write $M_X(t):=\mathbb E[e^{tX}]$ whenever this expectation is finite, and $D_X:=\{t\in\mathbb R:\mathbb E[e^{tX}]<\infty\}$ for the set of such parameters. For $t \in \mathbb R$, the expectation formula for a [random variable](/page/Random%20Variable) with density gives \begin{align*} \mathbb E[e^{tX}]=\int_0^\infty e^{tx}\lambda e^{-\lambda x}\,d\mathcal L^1(x). \end{align*} Since $e^{tx}e^{-\lambda x}=e^{(t-\lambda)x}=e^{-(\lambda-t)x}$, this becomes \begin{align*} \mathbb E[e^{tX}]=\lambda\int_0^\infty e^{-(\lambda-t)x}\,d\mathcal L^1(x). \end{align*} If $t<\lambda$, then $\lambda-t>0$, and for each $R>0$, \begin{align*} \lambda\int_0^R e^{-(\lambda-t)x}\,d\mathcal L^1(x)=\lambda\left[\frac{-e^{-(\lambda-t)x}}{\lambda-t}\right]_{0}^{R}. \end{align*} Evaluating the endpoints gives \begin{align*} \lambda\left[\frac{-e^{-(\lambda-t)x}}{\lambda-t}\right]_{0}^{R}=\frac{\lambda}{\lambda-t}\left(1-e^{-(\lambda-t)R}\right). \end{align*} Letting $R\to\infty$ and using $e^{-(\lambda-t)R}\to 0$ gives \begin{align*} \mathbb E[e^{tX}]=\frac{\lambda}{\lambda-t}. \end{align*} If $t=\lambda$, then the integrand is $\lambda e^0=\lambda$, so \begin{align*} \lambda\int_0^R 1\,d\mathcal L^1(x)=\lambda R, \end{align*} which tends to $\infty$ as $R\to\infty$. If $t>\lambda$, then $t-\lambda>0$, and \begin{align*} \lambda\int_0^R e^{(t-\lambda)x}\,d\mathcal L^1(x)=\frac{\lambda}{t-\lambda}\left(e^{(t-\lambda)R}-1\right), \end{align*} which also tends to $\infty$ as $R\to\infty$. Hence \begin{align*} D_X=(-\infty,\lambda) \end{align*} and \begin{align*} M_X(t)=\frac{\lambda}{\lambda-t} \end{align*} for $t<\lambda$. This example shows that the formula for $M_X(t)$ is inseparable from its domain: the same expression that is finite near $0$ reaches a boundary at $t=\lambda$, where the positive exponential weight overwhelms the exponential tail of $X$. [/example] The exponential example already contains the main theme. An mgf is not a decorative transform; it measures the ability of the distribution to absorb exponential weights. Positive $t$ probes the right tail of $X$, negative $t$ probes the left tail, and a neighbourhood of $0$ gives enough two-sided control to unlock uniqueness, moments, and convergence theorems. ## Definition ### The Transform The ordinary moments of $X$ are coefficients one would expect to see in the expansion of $e^{tX}$. To make that idea legitimate, the transform must remember where the expectation is finite. This prevents the notation $M_X(t)$ from hiding an infinite quantity. [definition: Moment Generating Function] Let $X: (\Omega, \mathcal F, \mathbb P) \to (\mathbb R, \mathcal B(\mathbb R))$ be a real-valued random variable, and set \begin{align*} D_X := \{t \in \mathbb R : \mathbb E[e^{tX}] < \infty\}. \end{align*} The moment generating function of $X$ is the map \begin{align*} M_X: D_X \to (0,\infty). \end{align*} For $t \in D_X$, it is defined by \begin{align*} M_X(t):=\mathbb E[e^{tX}]. \end{align*} [/definition] This definition matches the undergraduate convention $M_X(t)=\mathbb E[e^{tX}]$, but it makes the hidden qualifier explicit: the expression is part of the mgf only for those $t \in \mathbb R$ where the expectation is finite. ### Domain of Finiteness The set $D_X$ always contains $0$, because $e^{0X}=1$. What matters is whether $D_X$ contains more than that single point. Since this set controls which exponential tilts are legitimate, it is useful to name it separately. [definition: Domain of the Moment Generating Function] Let $X: (\Omega, \mathcal F, \mathbb P) \to (\mathbb R, \mathcal B(\mathbb R))$ be a real-valued random variable. The domain of the moment generating function of $X$ is \begin{align*} D_X := \{t \in \mathbb R : \mathbb E[e^{tX}] < \infty\}. \end{align*} [/definition] The domain records the exponential weights that the distribution can absorb. A wide domain signals strong tail decay; a one-sided domain signals asymmetry; and the absence of any open interval around $0$ blocks the strongest mgf theorems. A common habit is to calculate a formula and then forget where it is valid. The basic finite-distribution case shows why bounded random variables create no domain restriction. [example: Bernoulli Moment Generating Function] Let $X \sim \operatorname{Ber}(p)$ with $p \in [0,1]$, so $\mathbb P(X=0)=1-p$ and $\mathbb P(X=1)=p$. For any $t \in \mathbb R$, the expectation of the function $x \mapsto e^{tx}$ is the finite weighted sum over the two possible values of $X$: \begin{align*} M_X(t)=\mathbb E[e^{tX}]=(1-p)e^{t\cdot 0}+p e^{t\cdot 1}. \end{align*} Since $t\cdot 0=0$, $t\cdot 1=t$, and $e^0=1$, this gives \begin{align*} M_X(t)=(1-p)\cdot 1+pe^t. \end{align*} Therefore \begin{align*} M_X(t)=1-p+pe^t. \end{align*} Both $1-p$ and $p$ are finite constants, and $e^t<\infty$ for every real $t$, so $M_X(t)<\infty$ for every $t \in \mathbb R$. Hence $D_X=\mathbb R$. This example shows why finite-valued random variables are analytically simple: their mgfs are finite sums of exponential functions. [/example] The transform is especially natural in probability because it reacts well to independence. Before reaching that, we need to understand what the domain looks like and why a small interval around $0$ is a much stronger assumption than mere existence at isolated points. ## Finiteness and Exponential Tails ### Convexity of the Domain The parameter $t$ is not a formal symbol. It changes the distribution by weighting outcomes according to $e^{tX}$. Large positive values of $X$ are magnified when $t>0$, while large negative values are magnified when $t<0$. Thus the domain of the mgf records two-sided tail information. The domain is not an arbitrary subset of $\mathbb R$. Convexity is the structural fact that makes the theory stable: if two exponential moments are finite, then all exponential moments between them are finite. This is the first reason the interval around $0$ becomes the natural object. [quotetheorem:6043] Convexity explains why mgf domains are intervals, possibly with endpoints omitted. It also tells us that checking a small amount of two-sided exponential integrability near $0$ is enough to obtain a whole open interval of finite values. Some distributions fail exactly because one tail is too heavy. The next example is the standard warning: all positive moments need not imply a useful mgf. [example: Lognormal Moments Do Not Force an MGF] Let $Y \sim \mathcal N(0,1)$ and set $X=e^Y$. Then $X$ is lognormal. For each $n \in \mathbb N$, the density formula for $Y$ gives \begin{align*} \mathbb E[X^n]=\mathbb E[e^{nY}]=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{ny}e^{-y^2/2}\,d\mathcal L^1(y). \end{align*} Since \begin{align*} ny-\frac{y^2}{2}=-\frac{(y-n)^2}{2}+\frac{n^2}{2}, \end{align*} we get \begin{align*} \mathbb E[X^n]=e^{n^2/2}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{-(y-n)^2/2}\,d\mathcal L^1(y). \end{align*} With the change of variables $z=y-n$, \begin{align*} \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{-(y-n)^2/2}\,d\mathcal L^1(y)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{-z^2/2}\,d\mathcal L^1(z)=1. \end{align*} Therefore \begin{align*} \mathbb E[X^n]=e^{n^2/2}<\infty. \end{align*} So every ordinary moment exists. Now fix $t>0$. Again using the density of $Y$, \begin{align*} \mathbb E[e^{tX}]=\mathbb E[e^{t e^Y}]=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{t e^y}e^{-y^2/2}\,d\mathcal L^1(y). \end{align*} Since $e^y/y^2\to\infty$ as $y\to\infty$, there exists $R>0$ such that $t e^y\ge y^2$ for every $y\ge R$. Hence, for $y\ge R$, \begin{align*} e^{t e^y}e^{-y^2/2}\ge e^{y^2}e^{-y^2/2}=e^{y^2/2}. \end{align*} Therefore \begin{align*} \mathbb E[e^{tX}]\ge \frac{1}{\sqrt{2\pi}}\int_R^\infty e^{y^2/2}\,d\mathcal L^1(y). \end{align*} For $y\ge R$, $e^{y^2/2}\ge 1$, so \begin{align*} \int_R^\infty e^{y^2/2}\,d\mathcal L^1(y)\ge \int_R^\infty 1\,d\mathcal L^1(y)=\infty. \end{align*} Thus $\mathbb E[e^{tX}]=\infty$ for every $t>0$. If $t\le 0$, then $X=e^Y\ge 0$, so $tX\le 0$ and $0<e^{tX}\le 1$. Hence \begin{align*} \mathbb E[e^{tX}]\le 1<\infty. \end{align*} Consequently $D_X=(-\infty,0]$. This example shows that ordinary moments only test powers of the tail, while positive exponential moments test a much stronger right-tail condition. [/example] This example separates two ideas that are often conflated. Moments ask whether powers of $X$ are integrable. The mgf asks whether an exponential of $X$ is integrable. Exponential integrability is stronger in the relevant tail and carries analytic consequences that ordinary moments alone may not provide. ### Local Exponential Integrability For many estimates, we do not need the exact mgf. We need to know that it is finite near the origin. This condition deserves its own name because it is the hypothesis behind Chernoff bounds, uniqueness, and convergence by mgfs. [definition: Exponential Integrability Near Zero] A real-valued random variable $X: (\Omega, \mathcal F, \mathbb P) \to (\mathbb R, \mathcal B(\mathbb R))$ is exponentially integrable near zero if there exists $a>0$ such that \begin{align*} \mathbb E[e^{tX}] < \infty \end{align*} for every $t \in (-a,a)$. [/definition] Exponential integrability near zero is the condition that turns the mgf from a partially defined expectation into a local analytic transform. The word "near" is essential: finiteness at $t=0$ alone is automatic and carries no tail information. ## Moments from Differentiation ### Derivatives at the Origin The name "moment generating function" comes from differentiating at the origin. Formally, expanding $e^{tX}$ gives powers of $X$. The mathematical issue is whether expectation and differentiation may be interchanged. A neighbourhood of finite exponential moments supplies the domination needed for this operation. The first derivative should recover the mean, the [second derivative](/page/Second%20Derivative) should recover the second moment, and higher derivatives should recover higher moments. The following theorem states the precise local version. [quotetheorem:9544] This theorem explains why the mgf is more than a transform. It is a compact storage device for all moments, provided the storage device exists locally around $0$. The first two derivatives give the most familiar statistics. It is useful to see them as derivatives before introducing centered variants, because the same calculation underlies cumulants and Gaussian computations. [example: Mean and Variance from an MGF] Assume $M_X$ is finite on an open interval around $0$. The derivative identities for local moment generating functions give, at $t=0$, \begin{align*} M_X'(0)=\mathbb E[Xe^{0X}]. \end{align*} Since $e^{0X}=1$, this becomes \begin{align*} M_X'(0)=\mathbb E[X]. \end{align*} The same theorem with $n=2$ gives \begin{align*} M_X''(0)=\mathbb E[X^2e^{0X}]. \end{align*} Again using $e^{0X}=1$, \begin{align*} M_X''(0)=\mathbb E[X^2]. \end{align*} By the definition of variance, \begin{align*} \operatorname{Var}(X)=\mathbb E[X^2]-(\mathbb E[X])^2. \end{align*} Substituting the two derivative identities gives \begin{align*} \operatorname{Var}(X)=M_X''(0)-(M_X'(0))^2. \end{align*} For $X\sim\operatorname{Ber}(p)$, the Bernoulli mgf is \begin{align*} M_X(t)=1-p+pe^t. \end{align*} Differentiating term by term, \begin{align*} M_X'(t)=pe^t. \end{align*} Evaluating at $t=0$ gives \begin{align*} M_X'(0)=pe^0=p. \end{align*} Differentiating once more, \begin{align*} M_X''(t)=pe^t. \end{align*} Thus \begin{align*} M_X''(0)=pe^0=p. \end{align*} Therefore \begin{align*} \operatorname{Var}(X)=p-p^2. \end{align*} Factoring out $p$ gives \begin{align*} p-p^2=p(1-p). \end{align*} So $\operatorname{Var}(X)=p(1-p)$. This example shows how a distributional calculation becomes a differentiation calculation once the mgf is known. [/example] ### Analytic Expansion The derivative identities invite a stronger question: can the whole mgf be reconstructed from the moments as a [power series](/page/Power%20Series)? Local exponential integrability is precisely the hypothesis that makes this analytic expansion a theorem rather than a formal manipulation. [quotetheorem:9545] The theorem also warns against reversing the logic without hypotheses. A sequence of moments may exist, and the formal series may have radius zero or fail to determine the law. The mgf hypothesis gives a genuine [analytic function](/page/Analytic%20Function), not only a list of coefficients. ## Independence and Sums ### Products Replace Convolutions A large part of probability studies sums of independent random variables. Densities of sums require convolution, and distribution functions can be awkward. Moment generating functions replace that operation by multiplication, which is the main computational reason they are so useful. The multiplication rule rests on the fact that independence lets expectations of products factor. The exponential function turns a sum into a product before the expectation is taken, but this is useful only on values of $t$ where all the relevant exponential expectations are finite. Thus the rule is not merely an algebraic slogan: it is a statement about independent random variables whose mgfs exist on a common interval around $0$. [quotetheorem:1120] Here the hypotheses do real work. Independence permits \begin{align*} e^{t(X+Y)}=e^{tX}e^{tY} \end{align*} to pass through expectation as a product, while finite exponential expectations ensure that both sides are genuine mgfs rather than formal expressions. The theorem is therefore the computational engine behind the product rule for mgfs; by itself it only computes the [mgf of a sum](/theorems/1144). The later uniqueness theorem is what lets an equality of mgfs identify the distribution. The following examples should be read in that order: first compute the mgf of the sum by multiplication, then use the uniqueness principle stated in the next section to justify the final distributional identification. The binomial distribution is the cleanest illustration. Instead of counting the number of successes directly, we build it as a sum of Bernoulli trials and multiply their mgfs. [example: Binomial Distribution from Bernoulli Sums] Let $X_1,\ldots,X_n$ be independent random variables with $X_i \sim \operatorname{Ber}(p)$ for each $i$, and set \begin{align*} S_n=\sum_{i=1}^n X_i. \end{align*} For each $i$, the Bernoulli mgf is \begin{align*} M_{X_i}(t)=\mathbb E[e^{tX_i}]=1-p+pe^t. \end{align*} Since the variables are independent, the product rule for mgfs of independent sums gives \begin{align*} M_{S_n}(t)=\prod_{i=1}^n M_{X_i}(t). \end{align*} Substituting the common Bernoulli mgf into each factor gives \begin{align*} M_{S_n}(t)=\prod_{i=1}^n (1-p+pe^t). \end{align*} There are $n$ identical factors, so \begin{align*} M_{S_n}(t)=(1-p+pe^t)^n. \end{align*} To compare this with the binomial distribution, let $B \sim \operatorname{Bin}(n,p)$. Then \begin{align*} \mathbb P(B=k)=\binom nk p^k(1-p)^{n-k} \end{align*} for $k=0,\ldots,n$, so \begin{align*} M_B(t)=\mathbb E[e^{tB}]=\sum_{k=0}^n e^{tk}\binom nk p^k(1-p)^{n-k}. \end{align*} Since $e^{tk}=(e^t)^k$, this becomes \begin{align*} M_B(t)=\sum_{k=0}^n \binom nk (pe^t)^k(1-p)^{n-k}. \end{align*} By the [binomial theorem](/theorems/750), \begin{align*} M_B(t)=(1-p+pe^t)^n. \end{align*} Thus $M_{S_n}(t)=M_B(t)$ for every $t \in \mathbb R$. By the *Uniqueness Theorem for Moment Generating Functions*, $S_n$ has the same distribution as $B$, so \begin{align*} S_n \sim \operatorname{Bin}(n,p). \end{align*} This shows that the law of the number of successes is recovered by multiplying the mgfs of the independent Bernoulli trials. [/example] ### Stable Families The same multiplication principle explains why Gaussian random variables are stable under independent addition. The calculation is short, but it encodes one of the central closure properties of the normal family. [example: Gaussian Sums] Let $X \sim \mathcal N(\mu_1,\sigma_1^2)$ and $Y \sim \mathcal N(\mu_2,\sigma_2^2)$ be independent, where $\sigma_1,\sigma_2>0$. For every $t \in \mathbb R$, the normal mgf formula gives \begin{align*} M_X(t)=\exp\left(\mu_1 t+\frac{\sigma_1^2t^2}{2}\right) \end{align*} and \begin{align*} M_Y(t)=\exp\left(\mu_2 t+\frac{\sigma_2^2t^2}{2}\right). \end{align*} By the *the expectation-product argument applied to exponentials of independent random variables*, \begin{align*} M_{X+Y}(t)=M_X(t)M_Y(t). \end{align*} Substituting the two formulas gives \begin{align*} M_{X+Y}(t)=\exp\left(\mu_1 t+\frac{\sigma_1^2t^2}{2}\right)\exp\left(\mu_2 t+\frac{\sigma_2^2t^2}{2}\right). \end{align*} Using $\exp(a)\exp(b)=\exp(a+b)$, \begin{align*} M_{X+Y}(t)=\exp\left(\mu_1 t+\frac{\sigma_1^2t^2}{2}+\mu_2 t+\frac{\sigma_2^2t^2}{2}\right). \end{align*} Grouping the linear terms and the quadratic terms, \begin{align*} \mu_1 t+\mu_2 t=(\mu_1+\mu_2)t \end{align*} and \begin{align*} \frac{\sigma_1^2t^2}{2}+\frac{\sigma_2^2t^2}{2}=\frac{(\sigma_1^2+\sigma_2^2)t^2}{2}. \end{align*} Therefore \begin{align*} M_{X+Y}(t)=\exp\left((\mu_1+\mu_2)t+\frac{(\sigma_1^2+\sigma_2^2)t^2}{2}\right). \end{align*} This is the mgf of $\mathcal N(\mu_1+\mu_2,\sigma_1^2+\sigma_2^2)$. Since the two mgfs agree for every $t \in \mathbb R$, hence on an open interval around $0$, the *Uniqueness Theorem for Moment Generating Functions* gives \begin{align*} X+Y \sim \mathcal N(\mu_1+\mu_2,\sigma_1^2+\sigma_2^2). \end{align*} The computation shows that independence preserves the Gaussian family, with means adding and variances adding. [/example] The calculation relies on a fact not yet stated: if two random variables have the same mgf near $0$, then they have the same distribution. That uniqueness principle is what allows us to identify the law from the transform. ## Uniqueness and Identification of Laws A transform is useful for identification only if it does not lose information. For mgfs, the correct uniqueness statement is local: agreement on an open interval around $0$ determines the distribution. Agreement at isolated points does not have the same force. The theorem below is the reason mgf tables work. Once a computed mgf matches a known mgf near $0$, the distribution has been identified. [quotetheorem:1142] The local interval hypothesis is the important part. It is not enough to match a few derivatives unless a moment determinacy theorem is available. The mgf uniqueness theorem avoids that difficulty by using the analytic transform itself. A useful practical workflow is therefore: compute the mgf, check its domain includes a neighbourhood of $0$, compare it with a known transform, and then invoke uniqueness. [example: Identifying a Poisson Law] Let $Y \sim \operatorname{Poi}(\lambda)$, where $\lambda>0$. We compute its moment generating function and compare it with the given transform of $X$. For every $t\in\mathbb R$, \begin{align*} M_Y(t)=\mathbb E[e^{tY}]. \end{align*} Since $Y$ takes values in $\mathbb N\cup\{0\}$ and $\mathbb P(Y=k)=e^{-\lambda}\lambda^k/k!$, the expectation is \begin{align*} M_Y(t)=\sum_{k=0}^{\infty} e^{tk} e^{-\lambda}\frac{\lambda^k}{k!}. \end{align*} Using $e^{tk}=(e^t)^k$, we get \begin{align*} M_Y(t)=e^{-\lambda}\sum_{k=0}^{\infty}\frac{(\lambda e^t)^k}{k!}. \end{align*} By the exponential series $\exp u=\sum_{k=0}^{\infty}u^k/k!$, with $u=\lambda e^t$, this becomes \begin{align*} M_Y(t)=e^{-\lambda}\exp(\lambda e^t). \end{align*} Since $e^a e^b=e^{a+b}$, \begin{align*} M_Y(t)=\exp(-\lambda+\lambda e^t). \end{align*} Factoring out $\lambda$ in the exponent gives \begin{align*} M_Y(t)=\exp(\lambda(e^t-1)). \end{align*} The hypothesis says that \begin{align*} M_X(t)=\exp(\lambda(e^t-1)) \end{align*} for every $t\in\mathbb R$, so $M_X(t)=M_Y(t)$ for every $t\in\mathbb R$, in particular on an open interval around $0$. By the *Uniqueness Theorem for Moment Generating Functions*, $X$ and $Y$ have the same distribution. Therefore \begin{align*} X\sim \operatorname{Poi}(\lambda). \end{align*} The mgf identifies the law because it agrees locally with the Poisson mgf, not merely because it has the right-looking formula. [/example] Uniqueness also explains why the Gaussian computation above was not merely suggestive. Matching the normal mgf on a neighbourhood of $0$ proves the resulting distribution, rather than only matching its mean and variance. ## Cumulants and Logarithmic Structure Products of mgfs are natural for sums, but logarithms turn those products into sums. This is the reason cumulants are additive under independence. They isolate distributional quantities that behave linearly when independent random variables are added. To use logarithms, the mgf must be positive and finite near $0$, so its logarithm is well-defined there. This motivates a transform that stores the same local information as the mgf but is adapted to addition of independent variables. [definition: Cumulant Generating Function] Let $X: (\Omega, \mathcal F, \mathbb P) \to (\mathbb R, \mathcal B(\mathbb R))$ be a real-valued random variable whose moment generating function is finite on an open interval containing $0$. Let $I_X$ be the [connected component](/page/Connected%20Component) of the interior of $D_X$ that contains $0$. The cumulant generating function of $X$ is the map $K_X: I_X \to \mathbb R$ defined by \begin{align*} K_X(t):=\log M_X(t). \end{align*} [/definition] The cumulant generating function is not a new distributional transform with different information near $0$; it repackages the same local mgf in a way adapted to sums. To use its Taylor expansion as data, we now name the individual coefficients produced by differentiating $K_X$ at the origin. [definition: Cumulant] Let $X: (\Omega, \mathcal F, \mathbb P) \to (\mathbb R, \mathcal B(\mathbb R))$ be a real-valued random variable whose cumulant generating function $K_X$ is defined on an open interval containing $0$. For $n \in \mathbb N$, the $n$th cumulant of $X$ is \begin{align*} \kappa_n(X):=K_X^{(n)}(0). \end{align*} [/definition] The first cumulant is $\mathbb E[X]$, and the second is $\operatorname{Var}(X)$. Raw moments of a sum mix together many cross terms, so they are often awkward to track directly. Cumulants are designed to remove that obstruction: after passing through the logarithm of the mgf, independence turns multiplication of transforms into addition of logarithms, and the derivatives at $0$ separate into additive pieces. [quotetheorem:9546] This theorem is often the cleanest way to track variance and higher-order corrections in sums. For independent identically distributed variables, cumulants scale linearly with the number of summands. The Gaussian family is the model case where cumulants terminate after order two. That is one reason the normal distribution appears as the limiting shape after centering and scaling sums. [example: Gaussian Cumulants] Let $X \sim \mathcal N(\mu,\sigma^2)$ with $\sigma>0$. The normal moment generating function is finite for every $t\in\mathbb R$ and is \begin{align*} M_X(t)=\exp\left(\mu t+\frac{\sigma^2t^2}{2}\right). \end{align*} Since $\exp(u)>0$ for every real $u$, the cumulant generating function is defined by taking the logarithm of $M_X(t)$: \begin{align*} K_X(t)=\log M_X(t)=\log\left(\exp\left(\mu t+\frac{\sigma^2t^2}{2}\right)\right). \end{align*} Using $\log(e^u)=u$ for real $u$, this becomes \begin{align*} K_X(t)=\mu t+\frac{\sigma^2t^2}{2}. \end{align*} By the definition of cumulants, $\kappa_n(X)=K_X^{(n)}(0)$. Differentiating once gives \begin{align*} K_X'(t)=\mu+\sigma^2t. \end{align*} Evaluating at $t=0$ gives \begin{align*} \kappa_1(X)=K_X'(0)=\mu+\sigma^2\cdot 0=\mu. \end{align*} Differentiating a second time gives \begin{align*} K_X''(t)=\sigma^2. \end{align*} Therefore \begin{align*} \kappa_2(X)=K_X''(0)=\sigma^2. \end{align*} Differentiating a third time gives \begin{align*} K_X^{(3)}(t)=0. \end{align*} Since the derivative of the zero function is again zero, repeated differentiation gives \begin{align*} K_X^{(n)}(t)=0 \end{align*} for every $n\ge 3$. Hence \begin{align*} \kappa_n(X)=K_X^{(n)}(0)=0 \end{align*} for every $n\ge 3$. Thus a Gaussian distribution has first cumulant $\mu$, second cumulant $\sigma^2$, and no nonzero cumulants of order three or higher. [/example] The logarithmic viewpoint also connects mgfs with large deviations, where $K_X(t)$ controls exponential rates. The next section records the simplest and most widely used consequence: exponential tail bounds. ## Exponential Bounds and Concentration ### Chernoff's Method Markov's inequality becomes much sharper when applied to $e^{tX}$ instead of $X$ itself. This is the basic Chernoff method: transform a tail event into an exponential moment, then optimize over the parameter $t$. The point is not that the mgf gives exact probabilities. Instead, it converts tail estimation into deterministic calculus involving $M_X(t)$ or $K_X(t)$. [quotetheorem:6052] The parameter $t$ should be chosen after the bound is written. Different choices expose different parts of the tail, and optimizing produces the Legendre-transform structure seen in large deviation theory. The Gaussian case gives the standard sub-Gaussian tail estimate. The calculation is a model for many concentration inequalities. [example: Gaussian Chernoff Bound] Let $Z \sim \mathcal N(0,1)$, and fix $a>0$. The standard normal moment generating function is \begin{align*} M_Z(t)=e^{t^2/2} \end{align*} for every $t\in\mathbb R$. Therefore, for every $t>0$, *[Chernoff Bound](/theorems/6038)* gives \begin{align*} \mathbb P(Z\ge a)\le e^{-ta}M_Z(t). \end{align*} Indeed, this inequality is Markov's inequality applied to the nonnegative random variable $e^{tZ}$: \begin{align*} \mathbb P(Z\ge a)=\mathbb P(e^{tZ}\ge e^{ta})\le e^{-ta}\mathbb E[e^{tZ}]. \end{align*} Substituting $M_Z(t)=e^{t^2/2}$ gives \begin{align*} \mathbb P(Z\ge a)\le e^{-ta}e^{t^2/2}. \end{align*} Using $e^u e^v=e^{u+v}$, this becomes \begin{align*} \mathbb P(Z\ge a)\le \exp\left(-ta+\frac{t^2}{2}\right). \end{align*} It remains to choose the best positive value of $t$. Complete the square in the exponent: \begin{align*} -ta+\frac{t^2}{2}=\frac{t^2-2at}{2}. \end{align*} Since \begin{align*} (t-a)^2=t^2-2at+a^2, \end{align*} we have \begin{align*} t^2-2at=(t-a)^2-a^2. \end{align*} Thus \begin{align*} -ta+\frac{t^2}{2}=\frac{(t-a)^2}{2}-\frac{a^2}{2}. \end{align*} Because $(t-a)^2\ge 0$, the exponent is minimized when $t=a$. This choice is allowed because $a>0$, and it gives \begin{align*} -a\cdot a+\frac{a^2}{2}=-\frac{a^2}{2}. \end{align*} Hence \begin{align*} \mathbb P(Z\ge a)\le e^{-a^2/2}. \end{align*} The bound is not the exact Gaussian tail probability, but it shows that the upper tail decays at least as fast as an exponential with quadratic exponent. [/example] ### Sums of Independent Trials For sums, the Chernoff method combines with the product rule. This is why mgfs are central in probability estimates for independent trials. [example: Upper Tail for a Binomial Random Variable] Let $S_n \sim \operatorname{Bin}(n,p)$ with $p\in(0,1)$, and write $q=a/n$. We prove the upper-tail bound for $q\in(p,1)$. For $t>0$, the function $x\mapsto e^{tx}$ is increasing, so \begin{align*} \{S_n\ge nq\}\subseteq \{e^{tS_n}\ge e^{tnq}\}. \end{align*} By Markov's inequality applied to the nonnegative random variable $e^{tS_n}$, \begin{align*} \mathbb P(S_n\ge nq)\le e^{-tnq}\mathbb E[e^{tS_n}]. \end{align*} Since $S_n\sim\operatorname{Bin}(n,p)$, its mgf is \begin{align*} \mathbb E[e^{tS_n}]=(1-p+pe^t)^n. \end{align*} Therefore \begin{align*} \mathbb P(S_n\ge nq)\le \exp\left(n\left[-tq+\log(1-p+pe^t)\right]\right). \end{align*} Set \begin{align*} F(t):=-tq+\log(1-p+pe^t). \end{align*} Then \begin{align*} F'(t)=-q+\frac{pe^t}{1-p+pe^t}. \end{align*} The critical point satisfies \begin{align*} q=\frac{pe^t}{1-p+pe^t}. \end{align*} Multiplying by $1-p+pe^t$ gives \begin{align*} q(1-p)+qpe^t=pe^t. \end{align*} Moving the $qpe^t$ term to the right gives \begin{align*} q(1-p)=p(1-q)e^t. \end{align*} Since $p\in(0,1)$ and $q\in(p,1)$, this is equivalent to \begin{align*} e^t=\frac{q(1-p)}{p(1-q)}. \end{align*} Thus the critical point is \begin{align*} t_*=\log\frac{q(1-p)}{p(1-q)}. \end{align*} Because $q>p$, we have $q(1-p)>p(1-q)$, so $t_*>0$. To check that this critical point minimizes $F$, differentiate once more: \begin{align*} F''(t)=\frac{pe^t(1-p+pe^t)-pe^tpe^t}{(1-p+pe^t)^2}. \end{align*} The numerator reduces to \begin{align*} pe^t(1-p+pe^t)-p^2e^{2t}=p(1-p)e^t. \end{align*} Hence \begin{align*} F''(t)=\frac{p(1-p)e^t}{(1-p+pe^t)^2}>0. \end{align*} So $F$ is strictly convex and $t_*$ is the unique minimizer. Now evaluate the exponent at $t_*$. From \begin{align*} e^{t_*}=\frac{q(1-p)}{p(1-q)}, \end{align*} we get \begin{align*} 1-p+pe^{t_*}=1-p+\frac{q(1-p)}{1-q}. \end{align*} Factoring out $1-p$ gives \begin{align*} 1-p+pe^{t_*}=(1-p)\left(1+\frac{q}{1-q}\right). \end{align*} Since \begin{align*} 1+\frac{q}{1-q}=\frac{1}{1-q}, \end{align*} we have \begin{align*} 1-p+pe^{t_*}=\frac{1-p}{1-q}. \end{align*} Therefore \begin{align*} F(t_*)=-q\log\frac{q(1-p)}{p(1-q)}+\log\frac{1-p}{1-q}. \end{align*} Splitting the first logarithm gives \begin{align*} F(t_*)=-q\log\frac{q}{p}-q\log\frac{1-p}{1-q}+\log\frac{1-p}{1-q}. \end{align*} Combining the last two terms gives \begin{align*} F(t_*)=-q\log\frac{q}{p}+(1-q)\log\frac{1-p}{1-q}. \end{align*} Equivalently, \begin{align*} F(t_*)=-\left(q\log\frac{q}{p}+(1-q)\log\frac{1-q}{1-p}\right). \end{align*} Substituting this minimizing value into the Chernoff bound yields \begin{align*} \mathbb P(S_n\ge nq)\le \exp\left(-n\left(q\log\frac{q}{p}+(1-q)\log\frac{1-q}{1-p}\right)\right). \end{align*} This example shows how the mgf reduces the binomial upper tail to minimizing a one-variable [convex function](/page/Convex%20Function), and the resulting exponent is the Bernoulli relative entropy rate. [/example] These bounds also reveal why domain information matters. If $M_X(t)$ is finite only for a small range of positive $t$, then the available upper-tail estimates are restricted to that range. ## Convergence Through Moment Generating Functions Transforms are also used to prove convergence in distribution. Characteristic functions always exist and are the most general tool, but mgfs give a real-variable route when a common neighbourhood of $0$ is available. The theorem below is a convergence analogue of uniqueness. Pointwise convergence of mgfs near $0$ determines convergence in distribution to the law with the limiting mgf. [quotetheorem:9547] This theorem is powerful but must be used with care. The common interval is part of the hypothesis, and the limiting function must be the mgf of the proposed limit on that interval. The [central limit theorem](/theorems/521) has a compact mgf proof under suitable exponential-moment assumptions. The calculation shows why centering and scaling leave a quadratic term in the exponent. [example: MGF Proof under Exponential Integrability] Let $X_1,X_2,\ldots$ be i.i.d. real-valued random variables with $\mathbb E[X_1]=0$, $\operatorname{Var}(X_1)=\sigma^2>0$, and $M_{X_1}$ finite on some interval $(-b,b)$. Define \begin{align*} Z_n=\frac{X_1+\cdots+X_n}{\sigma\sqrt n}. \end{align*} If $|t|<\sigma b$, then $|t/(\sigma\sqrt n)|<b$ for every $n\ge 1$, so the following mgfs are finite. Using independence and $e^{a_1+\cdots+a_n}=e^{a_1}\cdots e^{a_n}$, \begin{align*} M_{Z_n}(t)=\mathbb E\left[\exp\left(\frac{t}{\sigma\sqrt n}(X_1+\cdots+X_n)\right)\right]. \end{align*} Thus \begin{align*} M_{Z_n}(t)=\mathbb E\left[\prod_{i=1}^n \exp\left(\frac{tX_i}{\sigma\sqrt n}\right)\right]. \end{align*} By independence, the expectation of the product factors: \begin{align*} M_{Z_n}(t)=\prod_{i=1}^n \mathbb E\left[\exp\left(\frac{tX_i}{\sigma\sqrt n}\right)\right]. \end{align*} Since the $X_i$ have the same distribution, \begin{align*} M_{Z_n}(t)=\left(M_{X_1}\left(\frac{t}{\sigma\sqrt n}\right)\right)^n. \end{align*} Because $M_{X_1}$ is finite near $0$, the local Taylor expansion of the mgf gives \begin{align*} M_{X_1}(u)=M_{X_1}(0)+M_{X_1}'(0)u+\frac{M_{X_1}''(0)}{2}u^2+u^2r(u) \end{align*} where $r(u)\to 0$ as $u\to 0$. The derivative identities for mgfs give \begin{align*} M_{X_1}(0)=\mathbb E[1]=1. \end{align*} They also give \begin{align*} M_{X_1}'(0)=\mathbb E[X_1]=0. \end{align*} Since $\operatorname{Var}(X_1)=\mathbb E[X_1^2]-(\mathbb E[X_1])^2$ and $\mathbb E[X_1]=0$, we have \begin{align*} \mathbb E[X_1^2]=\sigma^2. \end{align*} Hence \begin{align*} M_{X_1}''(0)=\mathbb E[X_1^2]=\sigma^2. \end{align*} Substituting these values into the Taylor expansion gives \begin{align*} M_{X_1}(u)=1+\frac{\sigma^2u^2}{2}+u^2r(u). \end{align*} Now set \begin{align*} u_n=\frac{t}{\sigma\sqrt n}. \end{align*} Then $u_n\to 0$, so $r(u_n)\to 0$, and \begin{align*} \sigma^2u_n^2=\sigma^2\frac{t^2}{\sigma^2 n}=\frac{t^2}{n}. \end{align*} Also \begin{align*} u_n^2r(u_n)=\frac{t^2}{\sigma^2n}r(u_n). \end{align*} Therefore \begin{align*} M_{X_1}(u_n)=1+\frac{t^2}{2n}+\frac{t^2}{\sigma^2n}r(u_n). \end{align*} Define \begin{align*} c_n=\frac{t^2}{\sigma^2}r(u_n). \end{align*} Then $c_n\to 0$, and \begin{align*} M_{Z_n}(t)=\left(1+\frac{t^2/2+c_n}{n}\right)^n. \end{align*} Let $a_n=t^2/2+c_n$. Since $a_n\to t^2/2$, and $\log(1+x)/x\to 1$ as $x\to 0$, \begin{align*} n\log\left(1+\frac{a_n}{n}\right)=a_n\frac{\log(1+a_n/n)}{a_n/n}\to \frac{t^2}{2}. \end{align*} Exponentiating this limit gives \begin{align*} M_{Z_n}(t)=\exp\left(n\log\left(1+\frac{a_n}{n}\right)\right)\to e^{t^2/2}. \end{align*} The function $t\mapsto e^{t^2/2}$ is the mgf of $\mathcal N(0,1)$ on every neighbourhood of $0$. Since the convergence holds for every $t\in(-\sigma b,\sigma b)$, the *[Continuity Theorem for Moment Generating Functions](/theorems/9547)* gives \begin{align*} Z_n \xrightarrow{d} \mathcal N(0,1). \end{align*} The calculation shows explicitly that centering removes the linear term, scaling makes the quadratic term equal to $t^2/(2n)$, and the $n$ independent factors combine to leave the Gaussian exponent $t^2/2$. [/example] Characteristic functions remove the exponential-integrability hypothesis and prove the full classical [central limit theorem](/theorems/1848). The mgf proof is nevertheless valuable because it makes the mechanism visible in real-variable calculus. ## Comparison with Related Transforms ### Probability Generating Functions Moment generating functions sit among several transforms of probability laws. The closest are probability generating functions, characteristic functions, and Laplace transforms. Each solves a slightly different existence problem. For nonnegative integer-valued random variables, powers of a variable $s$ are often more natural than exponentials. This leads to the [probability generating function](/page/Probability%20Generating%20Function), which stores point probabilities as coefficients. [definition: Probability Generating Function] Let $X: (\Omega, \mathcal F, \mathbb P) \to (\mathbb N \cup \{0\},2^{\mathbb N\cup\{0\}})$ be a nonnegative integer-valued random variable, and set \begin{align*} D_X^{\mathrm{pgf}} := \{s \in \mathbb R : \mathbb E[s^X] < \infty\}. \end{align*} The probability generating function of $X$ is the map \begin{align*} G_X: D_X^{\mathrm{pgf}} \to \mathbb R. \end{align*} For $s \in D_X^{\mathrm{pgf}}$, it is defined by \begin{align*} G_X(s):=\mathbb E[s^X]. \end{align*} [/definition] The relation to the mgf is obtained by setting $s=e^t$. Thus, where both sides are finite, \begin{align*} M_X(t)=G_X(e^t). \end{align*} This change of variables is useful because the probability generating function encodes point probabilities as coefficients, while the mgf is better adapted to sums and exponential bounds. ### Characteristic Functions Characteristic functions solve the existence problem by using complex exponentials of modulus one. They always exist, even when the mgf does not exist away from $0$. [definition: Characteristic Function] Let $X: (\Omega, \mathcal F, \mathbb P) \to (\mathbb R, \mathcal B(\mathbb R))$ be a real-valued random variable. The characteristic function of $X$ is the map \begin{align*} \phi_X: \mathbb R \to \mathbb C. \end{align*} For $u \in \mathbb R$, it is defined by \begin{align*} \phi_X(u):=\mathbb E[e^{iuX}]. \end{align*} [/definition] The characteristic function is less restrictive and therefore more general. The mgf, when available near $0$, gives stronger real-variable analytic control and more direct exponential tail estimates. ### Laplace Transforms For nonnegative random variables, the left side of the mgf is often the side that always exists. Writing that side with a nonnegative parameter produces the [Laplace transform](/page/Laplace%20Transform), which is the standard language for waiting times and nonnegative distributions. [definition: Laplace Transform of a Nonnegative Random Variable] Let $X: (\Omega, \mathcal F, \mathbb P) \to ([0,\infty),\mathcal B([0,\infty)))$ be a nonnegative real-valued random variable. The Laplace transform of $X$ is the map \begin{align*} L_X: [0,\infty) \to (0,1]. \end{align*} For $s \in [0,\infty)$, it is defined by \begin{align*} L_X(s):=\mathbb E[e^{-sX}]. \end{align*} [/definition] For $X\ge 0$, $L_X(s)=M_X(-s)$ for $s\ge 0$. This transform always exists on $[0,\infty)$, while the positive side of the mgf may fail because it tests the right tail. ## Beyond and Connected Topics Moment generating functions are a gateway to [Probability](/page/Cambridge%20IA%20Probability), [measure-theoretic probability](/page/Cambridge%20IB%20Probability%20and%20Measure), asymptotic statistics, and large deviations. The key next step is to compare mgfs with characteristic functions, because characteristic functions provide the general uniqueness and convergence theory without exponential-integrability assumptions. In statistics, mgfs help identify sampling distributions and compute moments of estimators. They appear naturally beside likelihoods and exponential families, especially when differentiating normalizing constants produces expectations and variances. In advanced probability, cumulant generating functions lead to concentration inequalities and large deviation principles. The Legendre transform of $K_X$ controls exponential decay rates for sums of independent random variables, and Chernoff bounds are the first finite-sample shadow of that theory. The limitations are as important as the uses. Heavy-tailed distributions may have no mgf near $0$, even when many or all ordinary moments exist. In those settings, characteristic functions, tail estimates, regular variation, or direct measure-theoretic arguments replace mgf methods. ## References Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability). Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure). Androma, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability). Androma, [Cambridge IB Statistics](/page/Cambridge%20IB%20Statistics). Patrick Billingsley, *Probability and Measure* (1995). Rick Durrett, *Probability: Theory and Examples* (2019). Allan Gut, *Probability: A Graduate Course* (2013). Geoffrey Grimmett and David Stirzaker, *Probability and Random Processes* (2020).

Created by admin on 6/22/2026 | Last updated on 6/22/2026

What brings you to Androma?

Start with a route through the knowledge graph.

Moment Generating Function

Sign in to Androma

Check your inbox

One last step

Moment Generating Function

Prerequisites (0/4 completed)

Prerequisites Graph

Rate this page