[proofplan]
The theorem has two independent halves. For the lower bound, we take an arbitrary decipherable code with word lengths $l_1, \ldots, l_m$ and convert the (possibly sub-probability) weights $a^{-l_i}$ into an honest probability distribution $q_i$ by normalisation; Gibbs' inequality then compares $H(X)$ with the cross-entropy against $q$, and [McMillan's inequality](/theorems/???) forces the normalising constant to be at most $1$. For the upper bound, we construct a concrete prefix-free code by choosing $l_i = \lceil -\log_a p_i \rceil$: the ceiling makes the Kraft sum fit under $1$, so [Kraft's inequality](/theorems/???) produces a prefix-free code, and the one-sided ceiling bound yields $\mathbb{E}[S] < H(X)/\log_2 a + 1$.
[/proofplan]
[step:Fix an arbitrary decipherable code and normalise its Kraft weights into a probability distribution]
Let $c$ be any decipherable code on the alphabet $B$ with $|B| = a$, assigning to $\mu_i$ a codeword of length $l_i \in \mathbb{N}_{\geq 1}$. Define the Kraft sum
\begin{align*}
D := \sum_{i=1}^{m} a^{-l_i}.
\end{align*}
By [McMillan's inequality](/theorems/???), which asserts that every decipherable $a$-ary code satisfies $\sum_{i=1}^m a^{-l_i} \leq 1$, we have $D \leq 1$, and in particular $D > 0$. Define
\begin{align*}
q: \{1, \ldots, m\} &\to (0, 1] \\
i &\mapsto \frac{a^{-l_i}}{D}.
\end{align*}
Since $q_i > 0$ and $\sum_{i=1}^m q_i = D^{-1} \sum_{i=1}^m a^{-l_i} = D^{-1} \cdot D = 1$, the tuple $(q_1, \ldots, q_m)$ is a probability distribution on $\{1, \ldots, m\}$.
[guided]
The central trick of the lower bound is to manufacture a probability distribution out of the word lengths. From Kraft/McMillan we know that $\sum a^{-l_i} \leq 1$, so the numbers $a^{-l_i}$ behave like sub-probabilities. The only obstacle to calling them probabilities is that they may fail to sum to exactly $1$ — which we fix by dividing through by their sum $D$.
Concretely, let $c$ be any decipherable code on $B$ (we are not assuming prefix-free; decipherability is the weaker hypothesis). Each $\mu_i$ receives a codeword of positive integer length $l_i$. By [McMillan's inequality](/theorems/???), decipherability forces
\begin{align*}
D := \sum_{i=1}^{m} a^{-l_i} \leq 1,
\end{align*}
and $D > 0$ because every summand is positive. Define
\begin{align*}
q: \{1, \ldots, m\} &\to (0, 1] \\
i &\mapsto \frac{a^{-l_i}}{D}.
\end{align*}
Then $q_i > 0$ for each $i$ and $\sum_i q_i = D^{-1} \sum_i a^{-l_i} = 1$, so $q$ is a genuine probability mass function — the price of normalisation is the factor $D \leq 1$, and this factor is exactly what we will exploit in the next step.
[/guided]
[/step]
[step:Compare $H(X)$ with the cross-entropy against $q$ via Gibbs' inequality]
We apply [Gibbs' inequality](/theorems/???), which states that for any two probability distributions $(p_i)_{i=1}^m$ and $(q_i)_{i=1}^m$ on the same finite set with $p_i, q_i > 0$, one has
\begin{align*}
-\sum_{i=1}^m p_i \log_2 p_i \leq -\sum_{i=1}^m p_i \log_2 q_i,
\end{align*}
with equality if and only if $p_i = q_i$ for all $i$. We verify the hypotheses: $(p_i)$ is the law of $X$, so $\sum p_i = 1$ and we may assume $p_i > 0$ (symbols with $p_i = 0$ contribute zero to both $H(X)$ and $\mathbb{E}[S]$ and may be discarded); $(q_i)$ was shown to be a strictly positive probability distribution in the previous step. Hence Gibbs applies and gives
\begin{align*}
H(X) = -\sum_{i=1}^{m} p_i \log_2 p_i \leq -\sum_{i=1}^{m} p_i \log_2 q_i.
\end{align*}
[/step]
[step:Expand the cross-entropy and invoke $D \leq 1$ to extract $\mathbb{E}[S]$]
Substituting $q_i = a^{-l_i}/D$ into the right-hand side:
\begin{align*}
-\sum_{i=1}^{m} p_i \log_2 q_i
&= -\sum_{i=1}^{m} p_i \bigl( -l_i \log_2 a - \log_2 D \bigr) \\
&= \log_2 a \cdot \sum_{i=1}^{m} p_i l_i + \log_2 D \cdot \sum_{i=1}^{m} p_i \\
&= \log_2 a \cdot \mathbb{E}[S] + \log_2 D,
\end{align*}
where we used $\mathbb{E}[S] = \sum_i p_i l_i$ and $\sum_i p_i = 1$. Combining with Gibbs:
\begin{align*}
H(X) \leq \log_2 a \cdot \mathbb{E}[S] + \log_2 D.
\end{align*}
Since $D \leq 1$ by McMillan, $\log_2 D \leq 0$, so
\begin{align*}
H(X) \leq \log_2 a \cdot \mathbb{E}[S].
\end{align*}
Dividing by $\log_2 a > 0$ (valid because $a \geq 2$) yields the lower bound
\begin{align*}
\frac{H(X)}{\log_2 a} \leq \mathbb{E}[S].
\end{align*}
This holds for every decipherable code, in particular for every optimal code.
[guided]
Substituting $q_i = a^{-l_i}/D$ into the cross-entropy and using the additivity of the logarithm:
\begin{align*}
-\sum_{i=1}^{m} p_i \log_2 q_i
&= -\sum_{i=1}^{m} p_i \log_2\bigl(a^{-l_i}/D\bigr) \\
&= -\sum_{i=1}^{m} p_i \bigl(-l_i \log_2 a - \log_2 D\bigr) \\
&= \log_2 a \cdot \sum_{i=1}^{m} p_i l_i + \log_2 D \cdot \sum_{i=1}^{m} p_i.
\end{align*}
The first sum is exactly the expected word length $\mathbb{E}[S] = \sum_i p_i l_i$, and the second sum is $\sum_i p_i = 1$ because $p$ is a probability distribution. So:
\begin{align*}
H(X) \leq \log_2 a \cdot \mathbb{E}[S] + \log_2 D. \tag{$\ast$}
\end{align*}
Now the finishing move: we want to drop the $\log_2 D$ term. This is where McMillan is consumed. Because $D \leq 1$, we have $\log_2 D \leq 0$, so the left-hand side of ($\ast$) is bounded by $\log_2 a \cdot \mathbb{E}[S]$ alone:
\begin{align*}
H(X) \leq \log_2 a \cdot \mathbb{E}[S].
\end{align*}
Dividing by $\log_2 a$ — legal because $a \geq 2$ so $\log_2 a \geq 1 > 0$ — gives
\begin{align*}
\frac{H(X)}{\log_2 a} \leq \mathbb{E}[S],
\end{align*}
as claimed. Note that the argument used decipherability only through the McMillan bound $D \leq 1$; prefix-freeness was never required. This is why the lower bound applies to every decipherable code, not just every prefix-free code.
[/guided]
[/step]
[step:Construct a prefix-free code with word lengths $l_i = \lceil -\log_a p_i \rceil$ to establish the upper bound]
For the upper bound we construct a specific prefix-free code and show its expected length is less than $H(X)/\log_2 a + 1$; since an optimal code has expected length no larger than this, the same bound follows. Again we may assume $p_i > 0$ for all $i$. Define
\begin{align*}
l_i := \lceil -\log_a p_i \rceil \in \mathbb{N}_{\geq 1}.
\end{align*}
The ceiling function satisfies $-\log_a p_i \leq l_i < -\log_a p_i + 1$ for every $i$. Exponentiating the left inequality with base $a > 1$ (which preserves the direction):
\begin{align*}
a^{-l_i} \leq a^{\log_a p_i} = p_i.
\end{align*}
Summing over $i$ and using $\sum_i p_i = 1$:
\begin{align*}
\sum_{i=1}^{m} a^{-l_i} \leq \sum_{i=1}^{m} p_i = 1.
\end{align*}
By [Kraft's inequality](/theorems/???), which states that integers $l_1, \ldots, l_m \geq 1$ satisfying $\sum_i a^{-l_i} \leq 1$ are the word lengths of some prefix-free $a$-ary code, there exists a prefix-free code $c^\star$ with these word lengths.
[/step]
[step:Estimate the expected length of $c^\star$ from the ceiling bound]
The expected length of $c^\star$ is
\begin{align*}
\mathbb{E}[S^\star] = \sum_{i=1}^{m} p_i l_i.
\end{align*}
Using the right inequality $l_i < -\log_a p_i + 1$:
\begin{align*}
\mathbb{E}[S^\star]
= \sum_{i=1}^{m} p_i l_i
< \sum_{i=1}^{m} p_i \bigl(-\log_a p_i + 1\bigr)
= -\sum_{i=1}^{m} p_i \log_a p_i + \sum_{i=1}^{m} p_i.
\end{align*}
The first sum equals $H(X)/\log_2 a$ because $\log_a p_i = \log_2 p_i / \log_2 a$, so $-\sum_i p_i \log_a p_i = H(X)/\log_2 a$. The second sum equals $1$. Therefore
\begin{align*}
\mathbb{E}[S^\star] < \frac{H(X)}{\log_2 a} + 1.
\end{align*}
[guided]
The expected length of the constructed code $c^\star$ is
\begin{align*}
\mathbb{E}[S^\star] = \sum_{i=1}^{m} p_i l_i,
\end{align*}
and we bound each $l_i$ from above using the defining property of the ceiling: $l_i = \lceil -\log_a p_i \rceil < -\log_a p_i + 1$. Multiplying by $p_i \geq 0$ (inequality preserved) and summing:
\begin{align*}
\mathbb{E}[S^\star]
< \sum_{i=1}^{m} p_i \bigl(-\log_a p_i + 1\bigr)
= -\sum_{i=1}^{m} p_i \log_a p_i + \sum_{i=1}^{m} p_i.
\end{align*}
We want to rewrite the first sum in terms of $H(X) = -\sum_i p_i \log_2 p_i$. The change-of-base identity $\log_a p_i = \log_2 p_i / \log_2 a$ gives
\begin{align*}
-\sum_{i=1}^{m} p_i \log_a p_i = -\frac{1}{\log_2 a} \sum_{i=1}^{m} p_i \log_2 p_i = \frac{H(X)}{\log_2 a}.
\end{align*}
The second sum is $\sum_i p_i = 1$. Combining:
\begin{align*}
\mathbb{E}[S^\star] < \frac{H(X)}{\log_2 a} + 1.
\end{align*}
The strict inequality is inherited from the strict inequality $l_i < -\log_a p_i + 1$ (unless $-\log_a p_i$ happens to be an integer for every $i$, in which case the bound is tight in the weak sense but the proof gives strict inequality once any single $p_i$ is not a power of $a$).
[/guided]
[/step]
[step:Combine the bounds on $c^\star$ and an optimal code to conclude]
Let $c^{\mathrm{opt}}$ be any optimal decipherable code, with expected length $\mathbb{E}[S^{\mathrm{opt}}]$. By optimality,
\begin{align*}
\mathbb{E}[S^{\mathrm{opt}}] \leq \mathbb{E}[S^\star] < \frac{H(X)}{\log_2 a} + 1,
\end{align*}
since $c^\star$ is a prefix-free (hence decipherable) code. Combining with the lower bound established earlier — which applies to every decipherable code, in particular to $c^{\mathrm{opt}}$ — we obtain
\begin{align*}
\frac{H(X)}{\log_2 a} \leq \mathbb{E}[S^{\mathrm{opt}}] < \frac{H(X)}{\log_2 a} + 1,
\end{align*}
which is the stated inequality. This completes the proof.
[/step]