[proofplan]
The claim follows by chaining two established results. First, Bernoulli sources satisfy the AEP with constant $H(X_1)$; second, Shannon's First Coding Theorem converts this AEP into the statement that the information rate equals the AEP constant. Verifying the hypotheses of each result amounts to checking that the Bernoulli source is what it is.
[/proofplan]
[step:Invoke that Bernoulli sources satisfy the AEP with constant $H(X_1)$]
The source $X_1, X_2, \ldots$ is Bernoulli by hypothesis, which is the sole requirement of [Bernoulli Sources Satisfy AEP](/theorems/1645). Applying that theorem, the source satisfies the AEP with constant $H(X_1)$.
[/step]
[step:Apply Shannon's First Coding Theorem to convert the AEP into an information-rate statement]
[Shannon's First Coding Theorem](/theorems/1644) states: if a source satisfies the AEP with constant $H$, then its information rate equals $H$. The single hypothesis — that the source satisfies the AEP with constant $H$ — was established in the previous step with $H = H(X_1)$. Therefore the source has information rate $H(X_1)$.
[guided]
The two-step chain compresses what is actually happening under the hood. The AEP for a Bernoulli source is a probabilistic fact: the empirical log-likelihood concentrates at its expectation $H(X_1)$ thanks to the WLLN applied to the i.i.d. random variables $-\log_2 p(X_i)$. Shannon's First Coding Theorem converts concentration of probability mass on the typical set $T_n$ (with known exponential size $2^{nH}$) into matching upper and lower bounds on the number of bits needed to describe the source reliably. Combining the two yields the headline equality: the optimal compression rate of a Bernoulli source equals the Shannon entropy of its one-step marginal.
No hypothesis beyond the Bernoulli structure is needed. The finite-alphabet assumption embedded in the definition of a Bernoulli source (inherited by both [Bernoulli Sources Satisfy AEP](/theorems/1645) and [Shannon's First Coding Theorem](/theorems/1644)) guarantees that $H(X_1)$ is a finite non-negative real number.
The theorem has two practical interpretations. Operationally, it gives an encoder-side promise: for any $\varepsilon > 0$, a sufficiently long block of a Bernoulli source can be compressed to roughly $n(H(X_1) + \varepsilon)$ bits with vanishing probability of error, and no scheme can do strictly better than $n(H(X_1) - \varepsilon)$ bits while maintaining reliability. Informationally, it identifies the single-symbol Shannon entropy $H(X_1) = -\sum_i p_i \log_2 p_i$ as the precise per-symbol information content of the source.
The result does not address two important questions. It is silent on the rate of convergence — how large must $n$ be before the compression bound is achieved within a prescribed error? — and it is silent on constructive schemes — the proof via AEP shows existence of good codes without specifying them. These two gaps are addressed separately by refinements such as finite-blocklength coding theory and by explicit algorithms such as Huffman coding and arithmetic coding.
The theorem also identifies a structural coincidence: two a priori different quantities — the operational information rate (an inf over codes achieving reliable compression) and the Shannon entropy of the marginal distribution (a closed-form sum) — agree for Bernoulli sources. This coincidence is the signature insight of information theory: an operational quantity defined by a limit of combinatorial constructions coincides with an algebraic expression in the source probabilities. The Bernoulli case is the cleanest instance; more general sources (stationary ergodic, Markov, etc.) satisfy analogous equalities whose proofs require progressively stronger versions of the AEP.
Finally, notice the asymmetry between the two ingredients of the proof. The AEP for Bernoulli sources is a probabilistic statement about empirical averages; Shannon's First Coding Theorem is a combinatorial-information-theoretic statement about the existence and non-existence of codes. The two are connected by the typical set $T_n$: concentration of probability on $T_n$ (the AEP) together with knowledge of $|T_n|$ (combinatorial) yields both the achievability (encode $T_n$ efficiently, fail outside $T_n$) and the converse (any reliable code must at least enumerate $T_n$).
Some extensions worth noting follow immediately from the same proof structure.
Extension 1: Replace the Bernoulli hypothesis with stationary ergodicity.
The Shannon–McMillan–Breiman theorem plays the role of the AEP in this more general setting.
The conclusion becomes: the information rate equals the entropy rate $\lim_n \frac{1}{n} H(X_1, \ldots, X_n)$.
Extension 2: Replace the single-letter entropy with a conditional entropy in the Markov case.
For a first-order Markov chain $X_n$, the relevant entropy is $H(X_2 \mid X_1)$.
The typical set is adapted to count conditional empirical frequencies, and the proof otherwise goes through unchanged.
These refinements do not alter the structure of the compositional argument given here; they only replace the AEP-ingredient and the entropy-ingredient with their appropriate generalisations.
[/guided]
[/step]