Shannon's Noiseless Coding Theorem — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The theorem has two independent halves. For the lower bound, we take an arbitrary decipherable code with word lengths $l_1, \ldots, l_m$ and convert the (possibly sub-probability) weights $a^{-l_i}$ into an honest probability distribution $q_i$ by normalisation; Gibbs' inequality then compares $H(X)$ with the cross-entropy against $q$, and [McMillan's inequality](/theorems/???) forces the normalising constant to be at most $1$. For the upper bound, we construct a concrete prefix-free code by choosing $l_i = \lceil -\log_a p_i \rceil$: the ceiling makes the Kraft sum fit under $1$, so [Kraft's inequality](/theorems/???) produces a prefix-free code, and the one-sided ceiling bound yields $\mathbb{E}[S] < H(X)/\log_2 a + 1$. [/proofplan] [step:Fix an arbitrary decipherable code and normalise its Kraft weights into a probability distribution] Let $c$ be any decipherable code on the alphabet $B$ with $|B| = a$, assigning to $\mu_i$ a codeword of length $l_i \in \mathbb{N}_{\geq 1}$. Define the Kraft sum \begin{align*} D := \sum_{i=1}^{m} a^{-l_i}. \end{align*} By [McMillan's inequality](/theorems/???), which asserts that every decipherable $a$-ary code satisfies $\sum_{i=1}^m a^{-l_i} \leq 1$, we have $D \leq 1$, and in particular $D > 0$. Define \begin{align*} q: \{1, \ldots, m\} &\to (0, 1] \\ i &\mapsto \frac{a^{-l_i}}{D}. \end{align*} Since $q_i > 0$ and $\sum_{i=1}^m q_i = D^{-1} \sum_{i=1}^m a^{-l_i} = D^{-1} \cdot D = 1$, the tuple $(q_1, \ldots, q_m)$ is a probability distribution on $\{1, \ldots, m\}$. [guided] The central trick of the lower bound is to manufacture a probability distribution out of the word lengths. From Kraft/McMillan we know that $\sum a^{-l_i} \leq 1$, so the numbers $a^{-l_i}$ behave like sub-probabilities. The only obstacle to calling them probabilities is that they may fail to sum to exactly $1$ — which we fix by dividing through by their sum $D$. Concretely, let $c$ be any decipherable code on $B$ (we are not assuming prefix-free; decipherability is the weaker hypothesis). Each $\mu_i$ receives a codeword of positive integer length $l_i$. By [McMillan's inequality](/theorems/???), decipherability forces \begin{align*} D := \sum_{i=1}^{m} a^{-l_i} \leq 1, \end{align*} and $D > 0$ because every summand is positive. Define \begin{align*} q: \{1, \ldots, m\} &\to (0, 1] \\ i &\mapsto \frac{a^{-l_i}}{D}. \end{align*} Then $q_i > 0$ for each $i$ and $\sum_i q_i = D^{-1} \sum_i a^{-l_i} = 1$, so $q$ is a genuine probability mass function — the price of normalisation is the factor $D \leq 1$, and this factor is exactly what we will exploit in the next step. [/guided] [/step] [step:Compare $H(X)$ with the cross-entropy against $q$ via Gibbs' inequality] We apply [Gibbs' inequality](/theorems/???), which states that for any two probability distributions $(p_i)_{i=1}^m$ and $(q_i)_{i=1}^m$ on the same finite set with $p_i, q_i > 0$, one has \begin{align*} -\sum_{i=1}^m p_i \log_2 p_i \leq -\sum_{i=1}^m p_i \log_2 q_i, \end{align*} with equality if and only if $p_i = q_i$ for all $i$. We verify the hypotheses: $(p_i)$ is the law of $X$, so $\sum p_i = 1$ and we may assume $p_i > 0$ (symbols with $p_i = 0$ contribute zero to both $H(X)$ and $\mathbb{E}[S]$ and may be discarded); $(q_i)$ was shown to be a strictly positive probability distribution in the previous step. Hence Gibbs applies and gives \begin{align*} H(X) = -\sum_{i=1}^{m} p_i \log_2 p_i \leq -\sum_{i=1}^{m} p_i \log_2 q_i. \end{align*} [/step] [step:Expand the cross-entropy and invoke $D \leq 1$ to extract $\mathbb{E}[S]$] Substituting $q_i = a^{-l_i}/D$ into the right-hand side: \begin{align*} -\sum_{i=1}^{m} p_i \log_2 q_i &= -\sum_{i=1}^{m} p_i \bigl( -l_i \log_2 a - \log_2 D \bigr) \\ &= \log_2 a \cdot \sum_{i=1}^{m} p_i l_i + \log_2 D \cdot \sum_{i=1}^{m} p_i \\ &= \log_2 a \cdot \mathbb{E}[S] + \log_2 D, \end{align*} where we used $\mathbb{E}[S] = \sum_i p_i l_i$ and $\sum_i p_i = 1$. Combining with Gibbs: \begin{align*} H(X) \leq \log_2 a \cdot \mathbb{E}[S] + \log_2 D. \end{align*} Since $D \leq 1$ by McMillan, $\log_2 D \leq 0$, so \begin{align*} H(X) \leq \log_2 a \cdot \mathbb{E}[S]. \end{align*} Dividing by $\log_2 a > 0$ (valid because $a \geq 2$) yields the lower bound \begin{align*} \frac{H(X)}{\log_2 a} \leq \mathbb{E}[S]. \end{align*} This holds for every decipherable code, in particular for every optimal code. [guided] Substituting $q_i = a^{-l_i}/D$ into the cross-entropy and using the additivity of the logarithm: \begin{align*} -\sum_{i=1}^{m} p_i \log_2 q_i &= -\sum_{i=1}^{m} p_i \log_2\bigl(a^{-l_i}/D\bigr) \\ &= -\sum_{i=1}^{m} p_i \bigl(-l_i \log_2 a - \log_2 D\bigr) \\ &= \log_2 a \cdot \sum_{i=1}^{m} p_i l_i + \log_2 D \cdot \sum_{i=1}^{m} p_i. \end{align*} The first sum is exactly the expected word length $\mathbb{E}[S] = \sum_i p_i l_i$, and the second sum is $\sum_i p_i = 1$ because $p$ is a probability distribution. So: \begin{align*} H(X) \leq \log_2 a \cdot \mathbb{E}[S] + \log_2 D. \tag{$\ast$} \end{align*} Now the finishing move: we want to drop the $\log_2 D$ term. This is where McMillan is consumed. Because $D \leq 1$, we have $\log_2 D \leq 0$, so the left-hand side of ($\ast$) is bounded by $\log_2 a \cdot \mathbb{E}[S]$ alone: \begin{align*} H(X) \leq \log_2 a \cdot \mathbb{E}[S]. \end{align*} Dividing by $\log_2 a$ — legal because $a \geq 2$ so $\log_2 a \geq 1 > 0$ — gives \begin{align*} \frac{H(X)}{\log_2 a} \leq \mathbb{E}[S], \end{align*} as claimed. Note that the argument used decipherability only through the McMillan bound $D \leq 1$; prefix-freeness was never required. This is why the lower bound applies to every decipherable code, not just every prefix-free code. [/guided] [/step] [step:Construct a prefix-free code with word lengths $l_i = \lceil -\log_a p_i \rceil$ to establish the upper bound] For the upper bound we construct a specific prefix-free code and show its expected length is less than $H(X)/\log_2 a + 1$; since an optimal code has expected length no larger than this, the same bound follows. Again we may assume $p_i > 0$ for all $i$. Define \begin{align*} l_i := \lceil -\log_a p_i \rceil \in \mathbb{N}_{\geq 1}. \end{align*} The ceiling function satisfies $-\log_a p_i \leq l_i < -\log_a p_i + 1$ for every $i$. Exponentiating the left inequality with base $a > 1$ (which preserves the direction): \begin{align*} a^{-l_i} \leq a^{\log_a p_i} = p_i. \end{align*} Summing over $i$ and using $\sum_i p_i = 1$: \begin{align*} \sum_{i=1}^{m} a^{-l_i} \leq \sum_{i=1}^{m} p_i = 1. \end{align*} By [Kraft's inequality](/theorems/???), which states that integers $l_1, \ldots, l_m \geq 1$ satisfying $\sum_i a^{-l_i} \leq 1$ are the word lengths of some prefix-free $a$-ary code, there exists a prefix-free code $c^\star$ with these word lengths. [/step] [step:Estimate the expected length of $c^\star$ from the ceiling bound] The expected length of $c^\star$ is \begin{align*} \mathbb{E}[S^\star] = \sum_{i=1}^{m} p_i l_i. \end{align*} Using the right inequality $l_i < -\log_a p_i + 1$: \begin{align*} \mathbb{E}[S^\star] = \sum_{i=1}^{m} p_i l_i < \sum_{i=1}^{m} p_i \bigl(-\log_a p_i + 1\bigr) = -\sum_{i=1}^{m} p_i \log_a p_i + \sum_{i=1}^{m} p_i. \end{align*} The first sum equals $H(X)/\log_2 a$ because $\log_a p_i = \log_2 p_i / \log_2 a$, so $-\sum_i p_i \log_a p_i = H(X)/\log_2 a$. The second sum equals $1$. Therefore \begin{align*} \mathbb{E}[S^\star] < \frac{H(X)}{\log_2 a} + 1. \end{align*} [guided] The expected length of the constructed code $c^\star$ is \begin{align*} \mathbb{E}[S^\star] = \sum_{i=1}^{m} p_i l_i, \end{align*} and we bound each $l_i$ from above using the defining property of the ceiling: $l_i = \lceil -\log_a p_i \rceil < -\log_a p_i + 1$. Multiplying by $p_i \geq 0$ (inequality preserved) and summing: \begin{align*} \mathbb{E}[S^\star] < \sum_{i=1}^{m} p_i \bigl(-\log_a p_i + 1\bigr) = -\sum_{i=1}^{m} p_i \log_a p_i + \sum_{i=1}^{m} p_i. \end{align*} We want to rewrite the first sum in terms of $H(X) = -\sum_i p_i \log_2 p_i$. The change-of-base identity $\log_a p_i = \log_2 p_i / \log_2 a$ gives \begin{align*} -\sum_{i=1}^{m} p_i \log_a p_i = -\frac{1}{\log_2 a} \sum_{i=1}^{m} p_i \log_2 p_i = \frac{H(X)}{\log_2 a}. \end{align*} The second sum is $\sum_i p_i = 1$. Combining: \begin{align*} \mathbb{E}[S^\star] < \frac{H(X)}{\log_2 a} + 1. \end{align*} The strict inequality is inherited from the strict inequality $l_i < -\log_a p_i + 1$ (unless $-\log_a p_i$ happens to be an integer for every $i$, in which case the bound is tight in the weak sense but the proof gives strict inequality once any single $p_i$ is not a power of $a$). [/guided] [/step] [step:Combine the bounds on $c^\star$ and an optimal code to conclude] Let $c^{\mathrm{opt}}$ be any optimal decipherable code, with expected length $\mathbb{E}[S^{\mathrm{opt}}]$. By optimality, \begin{align*} \mathbb{E}[S^{\mathrm{opt}}] \leq \mathbb{E}[S^\star] < \frac{H(X)}{\log_2 a} + 1, \end{align*} since $c^\star$ is a prefix-free (hence decipherable) code. Combining with the lower bound established earlier — which applies to every decipherable code, in particular to $c^{\mathrm{opt}}$ — we obtain \begin{align*} \frac{H(X)}{\log_2 a} \leq \mathbb{E}[S^{\mathrm{opt}}] < \frac{H(X)}{\log_2 a} + 1, \end{align*} which is the stated inequality. This completes the proof. [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Shannon's Noiseless Coding Theorem (Theorem # 1631)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Shannon's Noiseless Coding Theorem (Theorem # 1631)

Discussion

Proof

Explore Further