[proofplan]
We prove the two inequalities $C^{(n)} \leq nC$ and $C^{(n)} \geq nC$. For the upper bound, we use memorylessness to split the conditional entropy $H(Y \mid X)$ into a sum of single-letter conditional entropies, and subadditivity of entropy to bound $H(Y)$ by $\sum_i H(Y_i)$. Rearranging yields $I(X; Y) \leq \sum_{i=1}^n I(X_i; Y_i) \leq nC$. For the lower bound, we exhibit a product input distribution whose components each achieve (or come $\varepsilon$-close to) the single-use capacity; memorylessness then makes the outputs independent, so the mutual information splits additively and equals (or approaches) $nC$.
[/proofplan]
[step:Decompose $H(Y \mid X)$ using memorylessness]
Let $p_{X}$ be any probability mass function on $\mathcal{A}^n$ and let $X = (X_1, \ldots, X_n) \sim p_{X}$, with $Y = (Y_1, \ldots, Y_n)$ the output of the $n$th extension. By definition of the $n$th extension, the conditional distribution of $Y$ given $X = (a_1, \ldots, a_n)$ is the product $\prod_{i=1}^n P(\cdot \mid a_i)$. Equivalently, conditionally on $X$, the coordinates $Y_1, \ldots, Y_n$ are independent, and $Y_i$ depends on $X$ only through $X_i$.
Therefore, for each realisation $x = (a_1, \ldots, a_n) \in \mathcal{A}^n$ with $p_{X}(x) > 0$,
\begin{align*}
H(Y \mid X = x) = \sum_{i=1}^n H(Y_i \mid X_i = a_i),
\end{align*}
by additivity of entropy under independence of the conditional law. Taking the expectation over $X$,
\begin{align*}
H(Y \mid X) = \sum_{i=1}^n \mathbb{E}_{X}\bigl[ H(Y_i \mid X_i) \bigr] = \sum_{i=1}^n H(Y_i \mid X_i),
\end{align*}
where the last equality is the definition of $H(Y_i \mid X_i)$.
[guided]
Memorylessness has two equivalent formulations: the transition kernel factors as a product, and the output coordinates are conditionally independent given the input. We use the second. Concretely: for any $x = (a_1, \ldots, a_n) \in \mathcal{A}^n$ and any $y = (b_1, \ldots, b_n) \in \mathcal{B}^n$,
\begin{align*}
\mathbb{P}(Y = y \mid X = x) = \prod_{i=1}^n P(b_i \mid a_i) = \prod_{i=1}^n \mathbb{P}(Y_i = b_i \mid X_i = a_i).
\end{align*}
This is exactly the statement that, given $X = x$, the $Y_i$ are independent and $Y_i$ has law $P(\cdot \mid a_i)$.
For any random variables $W_1, \ldots, W_n$ on finite alphabets that are independent, $H(W_1, \ldots, W_n) = \sum_i H(W_i)$. Applying this with $W_i$ being $Y_i$ under the conditional law given $X = x$,
\begin{align*}
H(Y \mid X = x) = \sum_{i=1}^n H(Y_i \mid X_i = a_i).
\end{align*}
Why only $H(Y_i \mid X_i = a_i)$ rather than $H(Y_i \mid X = x)$ on the right? Because the conditional law of $Y_i$ given $X = x$ depends on $x$ only through its $i$th coordinate $a_i$, as shown above.
Now average over $X$. By definition of conditional entropy,
\begin{align*}
H(Y \mid X) = \sum_{x} p_{X}(x) H(Y \mid X = x) = \sum_{x} p_{X}(x) \sum_{i=1}^n H(Y_i \mid X_i = a_i).
\end{align*}
Swapping the finite sums (legitimate because everything is finite and non-negative),
\begin{align*}
H(Y \mid X) = \sum_{i=1}^n \sum_{x} p_{X}(x) H(Y_i \mid X_i = a_i) = \sum_{i=1}^n \sum_{a_i \in \mathcal{A}} p_{X_i}(a_i) H(Y_i \mid X_i = a_i),
\end{align*}
where in the last step the inner sum over $x$ was marginalised to a sum over $a_i$ alone (since $H(Y_i \mid X_i = a_i)$ depends only on $a_i$). The inner sum is $H(Y_i \mid X_i)$ by definition, yielding
\begin{align*}
H(Y \mid X) = \sum_{i=1}^n H(Y_i \mid X_i).
\end{align*}
[/guided]
[/step]
[step:Bound $H(Y)$ by subadditivity of entropy]
The random variable $Y$ takes values in the finite set $\mathcal{B}^n$, and $Y_1, \ldots, Y_n$ are its coordinates. By iterated application of [Subadditivity of Joint Entropy](/theorems/???) (equivalently, by induction on $n$ applied to the case $n = 2$),
\begin{align*}
H(Y) = H(Y_1, \ldots, Y_n) \leq \sum_{i=1}^n H(Y_i).
\end{align*}
No independence assumption on the $Y_i$ is required; subadditivity holds for arbitrary joint distributions.
[/step]
[step:Assemble the upper bound $I(X; Y) \leq nC$]
Using the definition of mutual information as $I(X; Y) = H(Y) - H(Y \mid X)$, together with the two previous steps,
\begin{align*}
I(X; Y) &= H(Y) - H(Y \mid X) \\
&\leq \sum_{i=1}^n H(Y_i) - \sum_{i=1}^n H(Y_i \mid X_i) \\
&= \sum_{i=1}^n \bigl(H(Y_i) - H(Y_i \mid X_i)\bigr) \\
&= \sum_{i=1}^n I(X_i; Y_i).
\end{align*}
For each $i \in \{1, \ldots, n\}$, the pair $(X_i, Y_i)$ is formed from the input $X_i$ and the output $Y_i$ of a single use of the DMC: memorylessness ensures that the conditional law of $Y_i$ given $X_i$ is exactly $P(\cdot \mid X_i)$, independently of the other coordinates. Hence by the definition of the information capacity $C$ as a supremum over input distributions,
\begin{align*}
I(X_i; Y_i) \leq C.
\end{align*}
Summing,
\begin{align*}
I(X; Y) \leq \sum_{i=1}^n I(X_i; Y_i) \leq n C.
\end{align*}
Taking the supremum over $p_{X}$ gives $C^{(n)} \leq nC$.
[guided]
The chain so far gives
\begin{align*}
I(X; Y) = H(Y) - H(Y \mid X) \leq \sum_{i=1}^n H(Y_i) - \sum_{i=1}^n H(Y_i \mid X_i) = \sum_{i=1}^n I(X_i; Y_i).
\end{align*}
Two remarks. First, we combined an upper bound on $H(Y)$ with an exact equality for $H(Y \mid X)$. The subtraction preserves the inequality direction because we subtract equal quantities, not bounds. Second, the decomposition
\begin{align*}
\sum_{i=1}^n H(Y_i) - \sum_{i=1}^n H(Y_i \mid X_i) = \sum_{i=1}^n \bigl(H(Y_i) - H(Y_i \mid X_i)\bigr) = \sum_{i=1}^n I(X_i; Y_i)
\end{align*}
is just rearrangement of a finite sum and the definition of mutual information.
To bound each $I(X_i; Y_i)$ by $C$, we must check that $(X_i, Y_i)$ is a valid single-use input/output pair. The marginal law of $X_i$ is $p_{X_i}$, a probability mass function on $\mathcal{A}$. The conditional law of $Y_i$ given $X_i = a$ is $P(\cdot \mid a)$: indeed, marginalising the memorylessness identity over all $Y_j$ with $j \neq i$,
\begin{align*}
\mathbb{P}(Y_i = b \mid X = x) = \sum_{b_j, j \neq i} \prod_{j=1}^n P(b_j \mid a_j) = P(b \mid a_i) \prod_{j \neq i} \underbrace{\sum_{b_j} P(b_j \mid a_j)}_{= 1} = P(b \mid a_i),
\end{align*}
and a further marginalisation over $X_j$, $j \neq i$, shows that $\mathbb{P}(Y_i = b \mid X_i = a) = P(b \mid a)$. So $(X_i, Y_i)$ is indeed distributed as a single use of the DMC with input marginal $p_{X_i}$, hence $I(X_i; Y_i) \leq \sup_{p_X} I(X; Y) = C$.
Summing and taking the supremum over $p_{X}$ gives $C^{(n)} = \sup_{p_{X}} I(X; Y) \leq nC$.
[/guided]
[/step]
[step:Construct a product input distribution achieving the lower bound $C^{(n)} \geq nC$]
It remains to show $C^{(n)} \geq nC$. Fix $\varepsilon > 0$. By definition of the supremum $C = \sup_{p_X} I(X; Y)$, there exists a probability mass function $p^*$ on $\mathcal{A}$ such that the mutual information $I^*$ between an input $X^* \sim p^*$ and its corresponding single-use output $Y^*$ satisfies
\begin{align*}
I^* = I(X^*; Y^*) \geq C - \varepsilon.
\end{align*}
Let $p_{X}$ be the product distribution $p_{X}(x) = \prod_{i=1}^n p^*(a_i)$, and let $X = (X_1, \ldots, X_n) \sim p_{X}$ and $Y = (Y_1, \ldots, Y_n)$ the corresponding output of the $n$th extension. Under this choice the $X_i$ are i.i.d. with law $p^*$. Combined with memorylessness, this makes the $n$ pairs $(X_i, Y_i)$ mutually independent: for any $(a, b) \in \mathcal{A}^n \times \mathcal{B}^n$,
\begin{align*}
\mathbb{P}(X = a, Y = b) = \prod_{i=1}^n p^*(a_i) \cdot \prod_{i=1}^n P(b_i \mid a_i) = \prod_{i=1}^n \mathbb{P}(X_i = a_i, Y_i = b_i).
\end{align*}
Independence of the $(X_i, Y_i)$ implies independence of the $Y_i$, so
\begin{align*}
H(Y) = \sum_{i=1}^n H(Y_i),
\end{align*}
i.e. subadditivity is tight. Each $Y_i$ has the same law as $Y^*$ (since $X_i \sim p^*$ and the channel is the same) and each $Y_i$ given $X_i$ has the same law as $Y^*$ given $X^*$, so $I(X_i; Y_i) = I^*$ for every $i$. From the chain of equalities established in step 3 (with all inequalities now equalities),
\begin{align*}
I(X; Y) = \sum_{i=1}^n I(X_i; Y_i) = n I^* \geq n(C - \varepsilon) = nC - n\varepsilon.
\end{align*}
Taking the supremum over input distributions,
\begin{align*}
C^{(n)} \geq I(X; Y) \geq nC - n\varepsilon.
\end{align*}
Since $\varepsilon > 0$ was arbitrary, $C^{(n)} \geq nC$.
[guided]
The upper bound traces the two sources of slack: the subadditivity $H(Y) \leq \sum_i H(Y_i)$ and the single-letter bound $I(X_i; Y_i) \leq C$. To achieve equality, we must (i) make the $Y_i$ independent and (ii) make each $I(X_i; Y_i)$ equal to (or arbitrarily close to) $C$. Both are accomplished by the same construction: take the $X_i$ to be i.i.d. with a common distribution $p^*$ that is $\varepsilon$-optimal for the single-use channel.
Why $\varepsilon$-optimal rather than exactly optimal? The supremum defining $C$ is attained in the discrete-memoryless setting because the input distribution ranges over the compact probability simplex on the finite alphabet $\mathcal{A}$ and $p \mapsto I(X; Y)$ is continuous. So we could take $p^*$ exactly attaining $C$. We phrase the argument with $\varepsilon$ for clarity: it shows the lower bound follows from the defining property of the supremum alone, independent of attainment.
The key point is that i.i.d. inputs on a memoryless channel produce independent output pairs. Let us verify this explicitly. The joint law factors as
\begin{align*}
\mathbb{P}(X = a, Y = b) = p_{X}(a) \mathbb{P}(Y = b \mid X = a) = \prod_{i=1}^n p^*(a_i) \prod_{i=1}^n P(b_i \mid a_i) = \prod_{i=1}^n \bigl(p^*(a_i) P(b_i \mid a_i)\bigr),
\end{align*}
using i.i.d. inputs for the first product and memorylessness for the second. The right-hand side is the product of the marginal laws $\mathbb{P}(X_i = a_i, Y_i = b_i) = p^*(a_i) P(b_i \mid a_i)$, so the pairs $(X_i, Y_i)$ are mutually independent. Independence of tuples implies independence of the projections, so $(Y_i)_{i=1}^n$ is independent.
With the $Y_i$ independent, $H(Y_1, \ldots, Y_n) = \sum_i H(Y_i)$, so subadditivity is tight. Combined with the exact equality $H(Y \mid X) = \sum_i H(Y_i \mid X_i)$ from memorylessness,
\begin{align*}
I(X; Y) = H(Y) - H(Y \mid X) = \sum_{i=1}^n \bigl(H(Y_i) - H(Y_i \mid X_i)\bigr) = \sum_{i=1}^n I(X_i; Y_i).
\end{align*}
Each $X_i$ has law $p^*$ and each $Y_i$ given $X_i$ has the DMC transition law, so $(X_i, Y_i)$ is distributed as $(X^*, Y^*)$ and $I(X_i; Y_i) = I^* \geq C - \varepsilon$. Summing, $I(X; Y) \geq n(C - \varepsilon)$.
Taking the supremum over $p_{X}$ — a supremum over a set that contains the particular product distribution just constructed — gives $C^{(n)} \geq n(C - \varepsilon)$, and letting $\varepsilon \downarrow 0$ yields $C^{(n)} \geq nC$.
[/guided]
[/step]
[step:Combine the two bounds]
From the third step, $C^{(n)} \leq nC$, and from the fourth step, $C^{(n)} \geq nC$. Therefore
\begin{align*}
C^{(n)} = nC,
\end{align*}
completing the proof.
[/step]