Capacity of the nth Extension — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We prove the two inequalities $C^{(n)} \leq nC$ and $C^{(n)} \geq nC$. For the upper bound, we use memorylessness to split the conditional entropy $H(Y \mid X)$ into a sum of single-letter conditional entropies, and subadditivity of entropy to bound $H(Y)$ by $\sum_i H(Y_i)$. Rearranging yields $I(X; Y) \leq \sum_{i=1}^n I(X_i; Y_i) \leq nC$. For the lower bound, we exhibit a product input distribution whose components each achieve (or come $\varepsilon$-close to) the single-use capacity; memorylessness then makes the outputs independent, so the mutual information splits additively and equals (or approaches) $nC$. [/proofplan] [step:Decompose $H(Y \mid X)$ using memorylessness] Let $p_{X}$ be any probability mass function on $\mathcal{A}^n$ and let $X = (X_1, \ldots, X_n) \sim p_{X}$, with $Y = (Y_1, \ldots, Y_n)$ the output of the $n$th extension. By definition of the $n$th extension, the conditional distribution of $Y$ given $X = (a_1, \ldots, a_n)$ is the product $\prod_{i=1}^n P(\cdot \mid a_i)$. Equivalently, conditionally on $X$, the coordinates $Y_1, \ldots, Y_n$ are independent, and $Y_i$ depends on $X$ only through $X_i$. Therefore, for each realisation $x = (a_1, \ldots, a_n) \in \mathcal{A}^n$ with $p_{X}(x) > 0$, \begin{align*} H(Y \mid X = x) = \sum_{i=1}^n H(Y_i \mid X_i = a_i), \end{align*} by additivity of entropy under independence of the conditional law. Taking the expectation over $X$, \begin{align*} H(Y \mid X) = \sum_{i=1}^n \mathbb{E}_{X}\bigl[ H(Y_i \mid X_i) \bigr] = \sum_{i=1}^n H(Y_i \mid X_i), \end{align*} where the last equality is the definition of $H(Y_i \mid X_i)$. [guided] Memorylessness has two equivalent formulations: the transition kernel factors as a product, and the output coordinates are conditionally independent given the input. We use the second. Concretely: for any $x = (a_1, \ldots, a_n) \in \mathcal{A}^n$ and any $y = (b_1, \ldots, b_n) \in \mathcal{B}^n$, \begin{align*} \mathbb{P}(Y = y \mid X = x) = \prod_{i=1}^n P(b_i \mid a_i) = \prod_{i=1}^n \mathbb{P}(Y_i = b_i \mid X_i = a_i). \end{align*} This is exactly the statement that, given $X = x$, the $Y_i$ are independent and $Y_i$ has law $P(\cdot \mid a_i)$. For any random variables $W_1, \ldots, W_n$ on finite alphabets that are independent, $H(W_1, \ldots, W_n) = \sum_i H(W_i)$. Applying this with $W_i$ being $Y_i$ under the conditional law given $X = x$, \begin{align*} H(Y \mid X = x) = \sum_{i=1}^n H(Y_i \mid X_i = a_i). \end{align*} Why only $H(Y_i \mid X_i = a_i)$ rather than $H(Y_i \mid X = x)$ on the right? Because the conditional law of $Y_i$ given $X = x$ depends on $x$ only through its $i$th coordinate $a_i$, as shown above. Now average over $X$. By definition of conditional entropy, \begin{align*} H(Y \mid X) = \sum_{x} p_{X}(x) H(Y \mid X = x) = \sum_{x} p_{X}(x) \sum_{i=1}^n H(Y_i \mid X_i = a_i). \end{align*} Swapping the finite sums (legitimate because everything is finite and non-negative), \begin{align*} H(Y \mid X) = \sum_{i=1}^n \sum_{x} p_{X}(x) H(Y_i \mid X_i = a_i) = \sum_{i=1}^n \sum_{a_i \in \mathcal{A}} p_{X_i}(a_i) H(Y_i \mid X_i = a_i), \end{align*} where in the last step the inner sum over $x$ was marginalised to a sum over $a_i$ alone (since $H(Y_i \mid X_i = a_i)$ depends only on $a_i$). The inner sum is $H(Y_i \mid X_i)$ by definition, yielding \begin{align*} H(Y \mid X) = \sum_{i=1}^n H(Y_i \mid X_i). \end{align*} [/guided] [/step] [step:Bound $H(Y)$ by subadditivity of entropy] The random variable $Y$ takes values in the finite set $\mathcal{B}^n$, and $Y_1, \ldots, Y_n$ are its coordinates. By iterated application of [Subadditivity of Joint Entropy](/theorems/???) (equivalently, by induction on $n$ applied to the case $n = 2$), \begin{align*} H(Y) = H(Y_1, \ldots, Y_n) \leq \sum_{i=1}^n H(Y_i). \end{align*} No independence assumption on the $Y_i$ is required; subadditivity holds for arbitrary joint distributions. [/step] [step:Assemble the upper bound $I(X; Y) \leq nC$] Using the definition of mutual information as $I(X; Y) = H(Y) - H(Y \mid X)$, together with the two previous steps, \begin{align*} I(X; Y) &= H(Y) - H(Y \mid X) \\ &\leq \sum_{i=1}^n H(Y_i) - \sum_{i=1}^n H(Y_i \mid X_i) \\ &= \sum_{i=1}^n \bigl(H(Y_i) - H(Y_i \mid X_i)\bigr) \\ &= \sum_{i=1}^n I(X_i; Y_i). \end{align*} For each $i \in \{1, \ldots, n\}$, the pair $(X_i, Y_i)$ is formed from the input $X_i$ and the output $Y_i$ of a single use of the DMC: memorylessness ensures that the conditional law of $Y_i$ given $X_i$ is exactly $P(\cdot \mid X_i)$, independently of the other coordinates. Hence by the definition of the information capacity $C$ as a supremum over input distributions, \begin{align*} I(X_i; Y_i) \leq C. \end{align*} Summing, \begin{align*} I(X; Y) \leq \sum_{i=1}^n I(X_i; Y_i) \leq n C. \end{align*} Taking the supremum over $p_{X}$ gives $C^{(n)} \leq nC$. [guided] The chain so far gives \begin{align*} I(X; Y) = H(Y) - H(Y \mid X) \leq \sum_{i=1}^n H(Y_i) - \sum_{i=1}^n H(Y_i \mid X_i) = \sum_{i=1}^n I(X_i; Y_i). \end{align*} Two remarks. First, we combined an upper bound on $H(Y)$ with an exact equality for $H(Y \mid X)$. The subtraction preserves the inequality direction because we subtract equal quantities, not bounds. Second, the decomposition \begin{align*} \sum_{i=1}^n H(Y_i) - \sum_{i=1}^n H(Y_i \mid X_i) = \sum_{i=1}^n \bigl(H(Y_i) - H(Y_i \mid X_i)\bigr) = \sum_{i=1}^n I(X_i; Y_i) \end{align*} is just rearrangement of a finite sum and the definition of mutual information. To bound each $I(X_i; Y_i)$ by $C$, we must check that $(X_i, Y_i)$ is a valid single-use input/output pair. The marginal law of $X_i$ is $p_{X_i}$, a probability mass function on $\mathcal{A}$. The conditional law of $Y_i$ given $X_i = a$ is $P(\cdot \mid a)$: indeed, marginalising the memorylessness identity over all $Y_j$ with $j \neq i$, \begin{align*} \mathbb{P}(Y_i = b \mid X = x) = \sum_{b_j, j \neq i} \prod_{j=1}^n P(b_j \mid a_j) = P(b \mid a_i) \prod_{j \neq i} \underbrace{\sum_{b_j} P(b_j \mid a_j)}_{= 1} = P(b \mid a_i), \end{align*} and a further marginalisation over $X_j$, $j \neq i$, shows that $\mathbb{P}(Y_i = b \mid X_i = a) = P(b \mid a)$. So $(X_i, Y_i)$ is indeed distributed as a single use of the DMC with input marginal $p_{X_i}$, hence $I(X_i; Y_i) \leq \sup_{p_X} I(X; Y) = C$. Summing and taking the supremum over $p_{X}$ gives $C^{(n)} = \sup_{p_{X}} I(X; Y) \leq nC$. [/guided] [/step] [step:Construct a product input distribution achieving the lower bound $C^{(n)} \geq nC$] It remains to show $C^{(n)} \geq nC$. Fix $\varepsilon > 0$. By definition of the supremum $C = \sup_{p_X} I(X; Y)$, there exists a probability mass function $p^*$ on $\mathcal{A}$ such that the mutual information $I^*$ between an input $X^* \sim p^*$ and its corresponding single-use output $Y^*$ satisfies \begin{align*} I^* = I(X^*; Y^*) \geq C - \varepsilon. \end{align*} Let $p_{X}$ be the product distribution $p_{X}(x) = \prod_{i=1}^n p^*(a_i)$, and let $X = (X_1, \ldots, X_n) \sim p_{X}$ and $Y = (Y_1, \ldots, Y_n)$ the corresponding output of the $n$th extension. Under this choice the $X_i$ are i.i.d. with law $p^*$. Combined with memorylessness, this makes the $n$ pairs $(X_i, Y_i)$ mutually independent: for any $(a, b) \in \mathcal{A}^n \times \mathcal{B}^n$, \begin{align*} \mathbb{P}(X = a, Y = b) = \prod_{i=1}^n p^*(a_i) \cdot \prod_{i=1}^n P(b_i \mid a_i) = \prod_{i=1}^n \mathbb{P}(X_i = a_i, Y_i = b_i). \end{align*} Independence of the $(X_i, Y_i)$ implies independence of the $Y_i$, so \begin{align*} H(Y) = \sum_{i=1}^n H(Y_i), \end{align*} i.e. subadditivity is tight. Each $Y_i$ has the same law as $Y^*$ (since $X_i \sim p^*$ and the channel is the same) and each $Y_i$ given $X_i$ has the same law as $Y^*$ given $X^*$, so $I(X_i; Y_i) = I^*$ for every $i$. From the chain of equalities established in step 3 (with all inequalities now equalities), \begin{align*} I(X; Y) = \sum_{i=1}^n I(X_i; Y_i) = n I^* \geq n(C - \varepsilon) = nC - n\varepsilon. \end{align*} Taking the supremum over input distributions, \begin{align*} C^{(n)} \geq I(X; Y) \geq nC - n\varepsilon. \end{align*} Since $\varepsilon > 0$ was arbitrary, $C^{(n)} \geq nC$. [guided] The upper bound traces the two sources of slack: the subadditivity $H(Y) \leq \sum_i H(Y_i)$ and the single-letter bound $I(X_i; Y_i) \leq C$. To achieve equality, we must (i) make the $Y_i$ independent and (ii) make each $I(X_i; Y_i)$ equal to (or arbitrarily close to) $C$. Both are accomplished by the same construction: take the $X_i$ to be i.i.d. with a common distribution $p^*$ that is $\varepsilon$-optimal for the single-use channel. Why $\varepsilon$-optimal rather than exactly optimal? The supremum defining $C$ is attained in the discrete-memoryless setting because the input distribution ranges over the compact probability simplex on the finite alphabet $\mathcal{A}$ and $p \mapsto I(X; Y)$ is continuous. So we could take $p^*$ exactly attaining $C$. We phrase the argument with $\varepsilon$ for clarity: it shows the lower bound follows from the defining property of the supremum alone, independent of attainment. The key point is that i.i.d. inputs on a memoryless channel produce independent output pairs. Let us verify this explicitly. The joint law factors as \begin{align*} \mathbb{P}(X = a, Y = b) = p_{X}(a) \mathbb{P}(Y = b \mid X = a) = \prod_{i=1}^n p^*(a_i) \prod_{i=1}^n P(b_i \mid a_i) = \prod_{i=1}^n \bigl(p^*(a_i) P(b_i \mid a_i)\bigr), \end{align*} using i.i.d. inputs for the first product and memorylessness for the second. The right-hand side is the product of the marginal laws $\mathbb{P}(X_i = a_i, Y_i = b_i) = p^*(a_i) P(b_i \mid a_i)$, so the pairs $(X_i, Y_i)$ are mutually independent. Independence of tuples implies independence of the projections, so $(Y_i)_{i=1}^n$ is independent. With the $Y_i$ independent, $H(Y_1, \ldots, Y_n) = \sum_i H(Y_i)$, so subadditivity is tight. Combined with the exact equality $H(Y \mid X) = \sum_i H(Y_i \mid X_i)$ from memorylessness, \begin{align*} I(X; Y) = H(Y) - H(Y \mid X) = \sum_{i=1}^n \bigl(H(Y_i) - H(Y_i \mid X_i)\bigr) = \sum_{i=1}^n I(X_i; Y_i). \end{align*} Each $X_i$ has law $p^*$ and each $Y_i$ given $X_i$ has the DMC transition law, so $(X_i, Y_i)$ is distributed as $(X^*, Y^*)$ and $I(X_i; Y_i) = I^* \geq C - \varepsilon$. Summing, $I(X; Y) \geq n(C - \varepsilon)$. Taking the supremum over $p_{X}$ — a supremum over a set that contains the particular product distribution just constructed — gives $C^{(n)} \geq n(C - \varepsilon)$, and letting $\varepsilon \downarrow 0$ yields $C^{(n)} \geq nC$. [/guided] [/step] [step:Combine the two bounds] From the third step, $C^{(n)} \leq nC$, and from the fourth step, $C^{(n)} \geq nC$. Therefore \begin{align*} C^{(n)} = nC, \end{align*} completing the proof. [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Capacity of the nth Extension (Theorem # 1656)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Capacity of the nth Extension (Theorem # 1656)

Discussion

Proof

Explore Further