Cramer's Theorem — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We establish the large deviation [limit](/page/Limit) $\lim_{n \to \infty} -\frac{1}{n} \log \mathbb{P}(S_n \geq na) = \Psi^*(a)$ through matching upper and lower bounds. The upper bound is a direct exponential Chebyshev argument: for each $\lambda \geq 0$, $\mathbb{P}(S_n \geq na) \leq e^{-\lambda na} \mathbb{E}[e^{\lambda S_n}] = e^{-n(\lambda a - \Psi(\lambda))}$, and optimising over $\lambda$ gives $\Psi^*(a)$. The lower bound requires a change-of-measure (exponential tilting) argument: we shift to $X_i - a$, tilt the [distribution](/page/Distribution) to recentre the mean at $0$, and use the [Central Limit Theorem](/theorems/521) under the tilted measure to show the tilted probability of $S_n \geq 0$ converges to $1/2$. The general case reduces to the case of bounded support via truncation and a compactness argument. [/proofplan] [step:Upper bound: apply exponential Chebyshev and optimise over $\lambda \geq 0$] For any $\lambda \geq 0$, the [Markov Inequality](/theorems/514) applied to the non-negative random variable $e^{\lambda S_n}$ at threshold $e^{\lambda na}$ gives \begin{align*} \mathbb{P}(S_n \geq na) = \mathbb{P}(e^{\lambda S_n} \geq e^{\lambda na}) \leq e^{-\lambda na} \, \mathbb{E}[e^{\lambda S_n}]. \end{align*} Since the $X_i$ are i.i.d., $\mathbb{E}[e^{\lambda S_n}] = \mathbb{E}[e^{\lambda X_1}]^n = e^{n\Psi(\lambda)}$ (where $\Psi(\lambda) = \log \mathbb{E}[e^{\lambda X_1}]$ is the cumulant generating [function](/page/Function)). Therefore \begin{align*} \mathbb{P}(S_n \geq na) \leq e^{-n(\lambda a - \Psi(\lambda))}. \end{align*} Taking logarithms, dividing by $-n$, and optimising over $\lambda \geq 0$: \begin{align*} -\frac{1}{n} \log \mathbb{P}(S_n \geq na) \geq \sup_{\lambda \geq 0} (\lambda a - \Psi(\lambda)) = \Psi^*(a). \end{align*} Since the right-hand side is independent of $n$, $\liminf_{n \to \infty} \left(-\frac{1}{n} \log \mathbb{P}(S_n \geq na)\right) \geq \Psi^*(a)$. [guided] This is the exponential Chebyshev method, the workhorse of large deviation upper bounds. The idea is: to bound the probability that $S_n$ is atypically large ($\geq na$ with $a > \bar{x}$), we exponentiate both sides ($e^{\lambda S_n} \geq e^{\lambda na}$) and apply [Markov's Inequality](/theorems/514). The exponential transform converts the additive event $\{S_n \geq na\}$ into a multiplicative bound, which interacts well with the independence of the $X_i$. The key identity $\mathbb{E}[e^{\lambda S_n}] = e^{n\Psi(\lambda)}$ comes from independence: $\mathbb{E}[e^{\lambda(X_1 + \cdots + X_n)}] = \prod_{i=1}^n \mathbb{E}[e^{\lambda X_i}] = \mathbb{E}[e^{\lambda X_1}]^n$, and taking logarithms gives $\log \mathbb{E}[e^{\lambda S_n}] = n \Psi(\lambda)$. The bound $e^{-n(\lambda a - \Psi(\lambda))}$ is valid for every $\lambda \geq 0$. The optimal $\lambda$ maximises $\lambda a - \Psi(\lambda)$, which is the Legendre transform $\Psi^*(a)$. If $\Psi$ is differentiable, the optimiser satisfies $\Psi'(\lambda^*) = a$ — it is the value of the tilt parameter that recentres the distribution at $a$. This connection between the optimal Chebyshev bound and exponential tilting is the conceptual backbone of Cramér's theorem. [/guided] [/step] [step:Reduce the lower bound to the case $a = 0$, $\bar{x} \leq 0$] Replacing $X_i$ by $\tilde{X}_i := X_i - a$ shifts the mean to $\tilde{\bar{x}} = \bar{x} - a \leq 0$ (since $a \geq \bar{x}$) and transforms the cumulant generating function: \begin{align*} \tilde{\Psi}(\lambda) = \log \mathbb{E}[e^{\lambda(X_1 - a)}] = \Psi(\lambda) - \lambda a, \end{align*} so $\tilde{\Psi}^*(0) = \sup_{\lambda \geq 0}(-\tilde{\Psi}(\lambda)) = \sup_{\lambda \geq 0}(\lambda a - \Psi(\lambda)) = \Psi^*(a)$. The event $\{S_n \geq na\}$ becomes $\{\tilde{S}_n \geq 0\}$. It therefore suffices to prove: if $\bar{x} \leq 0$, then \begin{align*} \limsup_{n \to \infty} \frac{1}{n} \log \mathbb{P}(S_n \geq 0) \geq \inf_{\lambda \geq 0} \Psi(\lambda). \end{align*} [/step] [step:Case 1: $\mathbb{P}(X_1 > 0) = 0$] If $X_1 \leq 0$ a.s., then $S_n \geq 0$ if and only if $S_n = 0$, which requires $X_i = 0$ for all $i$. By independence, \begin{align*} \mathbb{P}(S_n \geq 0) = \mathbb{P}(X_1 = 0)^n. \end{align*} For the rate function: $\Psi(\lambda) = \log \mathbb{E}[e^{\lambda X_1}]$, and since $X_1 \leq 0$ a.s., $e^{\lambda X_1} \leq 1$ for $\lambda \geq 0$, with equality on $\{X_1 = 0\}$. By the [Monotone Convergence Theorem](/theorems/509), $\lim_{\lambda \to +\infty} \mathbb{E}[e^{\lambda X_1}] = \mathbb{P}(X_1 = 0)$ (since $e^{\lambda X_1} \downarrow \mathbb{1}_{\{X_1 = 0\}}$ for $X_1 \leq 0$). Therefore \begin{align*} \inf_{\lambda \geq 0} \Psi(\lambda) \leq \lim_{\lambda \to \infty} \Psi(\lambda) = \log \mathbb{P}(X_1 = 0), \end{align*} and $\frac{1}{n} \log \mathbb{P}(S_n \geq 0) = \log \mathbb{P}(X_1 = 0) \geq \inf_{\lambda \geq 0} \Psi(\lambda)$. [/step] [step:Case 2: $\mathbb{E}[e^{\lambda X_1}] < \infty$ for all $\lambda$ and $\mathbb{P}(X_1 > 0) > 0$] Since $\mathbb{E}[e^{\lambda X_1}] < \infty$ for all $\lambda \in \mathbb{R}$, the cumulant generating function $\Psi$ is $C^\infty$ on $\mathbb{R}$ (by differentiation under the [integral](/page/Integral), justified by dominated convergence). Define \begin{align*} M(\lambda) := \mathbb{E}[e^{\lambda X_1}], \quad \text{so } \Psi(\lambda) = \log M(\lambda), \quad \Psi'(\lambda) = \frac{M'(\lambda)}{M(\lambda)} = \frac{\mathbb{E}[X_1 e^{\lambda X_1}]}{M(\lambda)}. \end{align*} At $\lambda = 0$: $\Psi'(0) = \mathbb{E}[X_1] = \bar{x} \leq 0$. As $\lambda \to +\infty$: since $\mathbb{P}(X_1 > 0) > 0$, there exists $\delta > 0$ with $\mathbb{P}(X_1 \geq \delta) > 0$. Then $\mathbb{E}[X_1 e^{\lambda X_1}] \geq \delta \cdot e^{\lambda \delta} \cdot \mathbb{P}(X_1 \geq \delta) \to +\infty$, and since $M(\lambda) \geq \mathbb{P}(X_1 \geq 0) > 0$, we get $\Psi'(\lambda) \to +\infty$. By the [intermediate value theorem](/theorems/629) applied to the continuous function $\Psi'$, there exists $\theta > 0$ with $\Psi'(\theta) = 0$. [guided] The parameter $\theta$ is the exponential tilt that recentres the mean at $0$. We are looking for a change of measure under which $S_n/n$ concentrates near $0$ (instead of near $\bar{x} \leq 0$), so that $\mathbb{P}_\theta(S_n \geq 0)$ is bounded away from $0$. The existence of $\theta$ requires two ingredients: (i) $\Psi'(0) = \bar{x} \leq 0$ (the original mean is non-positive), and (ii) $\Psi'(\lambda) \to +\infty$ as $\lambda \to +\infty$ (possible because $X_1$ has positive mass on $(0, \infty)$, so the tilt can push the mean to any positive value). The [intermediate value theorem](/theorems/180) bridges the gap. The condition $\mathbb{P}(X_1 > 0) > 0$ is essential: without it, we are in Case 1. [/guided] [/step] [step:Tilt the measure by $\theta$ and apply the [Central Limit Theorem](/theorems/521)] Define the tilted probability measure $\mathbb{P}_\theta$ by the Radon-Nikodym [derivative](/page/Derivative) \begin{align*} \frac{d\mathbb{P}_\theta}{d\mathbb{P}} = \frac{e^{\theta S_n}}{M(\theta)^n}. \end{align*} Under $\mathbb{P}_\theta$, the $X_i$ are i.i.d. with common distribution having density $e^{\theta x}/M(\theta)$ with respect to the original law of $X_1$. The mean and variance under $\mathbb{P}_\theta$ are \begin{align*} \mathbb{E}_\theta[X_1] = \Psi'(\theta) = 0, \quad \operatorname{Var}_\theta(X_1) = \Psi''(\theta) =: \sigma_\theta^2. \end{align*} The variance $\sigma_\theta^2$ is finite and positive: finiteness follows from the hypothesis $M(\lambda) < \infty$ for all $\lambda$ (which guarantees all moments are finite under $\mathbb{P}_\theta$), and positivity holds because $X_1$ is not a.s. constant (since $\mathbb{P}(X_1 > 0) > 0$ and $\mathbb{E}_\theta[X_1] = 0$). For any $\varepsilon > 0$, we bound $\mathbb{P}(S_n \geq 0)$ from below by restricting to the event $\{S_n \in [0, \varepsilon n]\}$: \begin{align*} \mathbb{P}(S_n \geq 0) &\geq \mathbb{P}(S_n \in [0, \varepsilon n]) \\ &= \mathbb{E}\!\left[\mathbb{1}_{\{S_n \in [0, \varepsilon n]\}}\right] \\ &= \mathbb{E}_\theta\!\left[\mathbb{1}_{\{S_n \in [0, \varepsilon n]\}} \cdot \frac{M(\theta)^n}{e^{\theta S_n}}\right] \\ &\geq M(\theta)^n \cdot e^{-\theta \varepsilon n} \cdot \mathbb{P}_\theta(S_n \in [0, \varepsilon n]), \end{align*} where the last inequality uses $e^{\theta S_n} \leq e^{\theta \varepsilon n}$ on $\{S_n \in [0, \varepsilon n]\}$ (since $\theta > 0$ and $S_n \leq \varepsilon n$). By the [Central Limit Theorem](/theorems/521) applied under $\mathbb{P}_\theta$ (the $X_i$ are i.i.d. with mean $0$ and finite variance $\sigma_\theta^2$), $S_n / (\sigma_\theta \sqrt{n}) \xrightarrow{d} \mathcal{N}(0,1)$ under $\mathbb{P}_\theta$. The event $\{S_n \in [0, \varepsilon n]\}$ in the scaled variable is $\{S_n / (\sigma_\theta \sqrt{n}) \in [0, \varepsilon \sqrt{n} / \sigma_\theta]\}$. Since $\varepsilon \sqrt{n} / \sigma_\theta \to +\infty$, \begin{align*} \mathbb{P}_\theta(S_n \in [0, \varepsilon n]) \to \mathbb{P}(Z \geq 0) = \frac{1}{2}, \end{align*} where $Z \sim \mathcal{N}(0,1)$. [guided] The exponential tilting (or change of measure) is the central technique in the lower bound. The idea: the event $\{S_n \geq 0\}$ is a large deviation event under $\mathbb{P}$ (since $\mathbb{E}[S_n] = n\bar{x} \leq 0$, the sum $S_n$ must fluctuate above its mean). Under $\mathbb{P}_\theta$, the same event is typical (since $\mathbb{E}_\theta[S_n] = 0$). The Radon-Nikodym derivative $d\mathbb{P}_\theta/d\mathbb{P} = e^{\theta S_n}/M(\theta)^n$ is a product of i.i.d. factors $e^{\theta X_i}/M(\theta)$, which makes $\mathbb{P}_\theta$ another product measure. This is the key structural feature: exponential tilting preserves independence. The restriction to $\{S_n \in [0, \varepsilon n]\}$ (instead of $\{S_n \geq 0\}$) ensures an upper bound on $e^{\theta S_n}$, which is needed to convert the change-of-measure identity into a lower bound. On the larger event $\{S_n \geq 0\}$, the factor $e^{-\theta S_n}$ could be arbitrarily small (when $S_n$ is large), preventing a useful bound. The CLT under $\mathbb{P}_\theta$ is applicable because $\theta$ was chosen to make $\mathbb{E}_\theta[X_1] = 0$, and the variance $\sigma_\theta^2 = \Psi''(\theta)$ is finite (all moments exist since $M(\lambda) < \infty$ for all $\lambda$). The scaling $[0, \varepsilon n] = \sigma_\theta \sqrt{n} \cdot [0, \varepsilon \sqrt{n}/\sigma_\theta]$ shows that the interval grows to $[0, +\infty)$ in the CLT scale, capturing half the Gaussian mass. [/guided] [/step] [step:Extract the exponential rate for Case 2] Taking logarithms in the bound $\mathbb{P}(S_n \geq 0) \geq M(\theta)^n e^{-\theta \varepsilon n} \mathbb{P}_\theta(S_n \in [0, \varepsilon n])$, dividing by $n$, and taking $\limsup$: \begin{align*} \limsup_{n \to \infty} \frac{1}{n} \log \mathbb{P}(S_n \geq 0) &\geq \log M(\theta) - \theta \varepsilon + \limsup_{n \to \infty} \frac{1}{n} \log \mathbb{P}_\theta(S_n \in [0, \varepsilon n]) \\ &= \Psi(\theta) - \theta \varepsilon + 0, \end{align*} since $\mathbb{P}_\theta(S_n \in [0, \varepsilon n]) \to 1/2 > 0$ implies $\frac{1}{n} \log \mathbb{P}_\theta(S_n \in [0, \varepsilon n]) \to 0$. Letting $\varepsilon \downarrow 0$: \begin{align*} \limsup_{n \to \infty} \frac{1}{n} \log \mathbb{P}(S_n \geq 0) \geq \Psi(\theta) \geq \inf_{\lambda \geq 0} \Psi(\lambda). \end{align*} [/step] [step:Case 3 (general): reduce to bounded support via truncation and compactness] For general $X_1$ with $\mathbb{P}(X_1 > 0) > 0$ but possibly $M(\lambda) = \infty$ for large $\lambda$, we truncate. For each $K > 0$, let $\nu_K$ denote the law of $X_1$ conditioned on $|X_1| \leq K$, with cumulant generating function \begin{align*} \Psi_K(\lambda) := \log \int_{-K}^{K} e^{\lambda x} \, d\mu(x), \end{align*} where $\mu$ is the law of $X_1$. Since $|x| \leq K$, $M_K(\lambda) := \int_{-K}^K e^{\lambda x} \, d\mu(x)$ is finite for all $\lambda$, so Case 2 applies to the $\nu_K$-distributed variables (after normalising to a probability measure). Let $\mu_n$ and $\nu_{K,n}$ denote the laws of $S_n$ under $\mu$ and $\nu_K$ respectively. Since $\nu_K$ is the law of $X_1$ restricted to $[-K, K]$, a sample of $n$ i.i.d. $\nu_K$-draws can be coupled with $n$ i.i.d. $\mu$-draws conditioned on all falling in $[-K, K]$: \begin{align*} \mu_n([0, \infty)) \geq \nu_{K,n}([0, \infty)) \cdot \mu([-K, K])^n. \end{align*} Taking logarithms, dividing by $n$, and applying Case 2 to the bounded distribution: \begin{align*} \limsup_{n \to \infty} \frac{1}{n} \log \mu_n([0, \infty)) \geq \inf_{\lambda \geq 0} \Psi_K(\lambda) + \log \mu([-K, K]). \end{align*} As $K \to \infty$, $\mu([-K, K]) \to 1$, so $\log \mu([-K, K]) \to 0$. Also $\Psi_K(\lambda) \uparrow \Psi(\lambda)$ for each $\lambda \geq 0$ (by the [Monotone Convergence Theorem](/theorems/509)), so $\inf_{\lambda \geq 0} \Psi_K(\lambda) \uparrow \inf_{\lambda \geq 0} \Psi(\lambda)$. [guided] The convergence of infima requires a compactness argument. Define $J_K := \inf_{\lambda \geq 0} \Psi_K(\lambda)$. Since $\Psi_K \leq \Psi_{K'}$ for $K \leq K'$ (the integration domain $[-K,K]$ is enlarged), $J_K$ is non-decreasing in $K$, and $J_K \leq \inf_{\lambda \geq 0} \Psi(\lambda) =: J$ for all $K$. We need $J_K \to J$. Suppose for contradiction that $J_K \leq J - \delta$ for all $K$. Then for each $K$, there exists $\lambda_K \geq 0$ with $\Psi_K(\lambda_K) \leq J$. Consider the level [sets](/page/Set) $L_K := \{\lambda \geq 0 : \Psi_K(\lambda) \leq J\}$. Each $L_K$ is a closed interval (since $\Psi_K$ is convex with $\Psi_K(0) = \log \mu([-K,K]) \leq 0 \leq J$ and $\Psi_K(\lambda) \to +\infty$ as $\lambda \to +\infty$ for $K$ large enough that $\mu((0,K]) > 0$). The sets $L_K$ are nested: $L_{K'} \subset L_K$ for $K' \geq K$ (since $\Psi_{K'} \geq \Psi_K$). Each $L_K$ is non-empty and compact (a closed bounded interval in $[0, \infty)$; bounded because $\Psi_K(\lambda) \geq \lambda \cdot (-K) + \log \mu([-K,K])$ which grows linearly). By the finite intersection property, $\bigcap_K L_K \neq \varnothing$, so there exists $\lambda_0 \geq 0$ with $\Psi_K(\lambda_0) \leq J$ for all $K$. Letting $K \to \infty$: $\Psi(\lambda_0) = \lim_K \Psi_K(\lambda_0) \leq J$, so $J \leq \Psi(\lambda_0) \leq J$, confirming $J_K \to J$. [/guided] [/step] [step:Combine the upper and lower bounds] The upper bound (valid for all $n$) gives \begin{align*} \liminf_{n \to \infty} \left(-\frac{1}{n} \log \mathbb{P}(S_n \geq na)\right) \geq \Psi^*(a). \end{align*} The lower bound (from Cases 1-3 and the reduction to $a = 0$) gives \begin{align*} \limsup_{n \to \infty} \left(-\frac{1}{n} \log \mathbb{P}(S_n \geq na)\right) \leq \Psi^*(a). \end{align*} Together, $\lim_{n \to \infty} -\frac{1}{n} \log \mathbb{P}(S_n \geq na) = \Psi^*(a)$. [/step]

Prerequisites (0/5 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.

Cramer's Theorem (Theorem # 1173)

Discussion

Proof

Prerequisites (0/5 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Cramer's Theorem (Theorem # 1173)

Discussion

Proof

Prerequisites (0/5 completed)

Prerequisites Graph

Explore Further