[proofplan]
Because the decryption rule $d(e(m, k), k) = m$ expresses the message $M$ as a deterministic function of $(C, K)$, conditional entropy theory gives $H(M \mid C, K) = 0$. Applying the chain rule to the joint entropy $H(M, K \mid C)$ in two different orders and equating the two expressions yields the identity $H(K \mid C) = H(M \mid C) + H(K \mid M, C)$. Non-negativity of $H(K \mid M, C)$ yields the claimed inequality.
[/proofplan]
[step:Establish that $M$ is determined by the pair $(C, K)$]
By the cryptosystem axiom, for all $m \in \mathcal{M}$ and $k \in \mathcal{K}$ we have $d(e(m, k), k) = m$. Since $C = e(M, K)$, this gives the pointwise identity
\begin{align*}
M = d(C, K) \qquad \text{with probability } 1.
\end{align*}
In particular, $M$ is a deterministic (measurable) function of the pair $(C, K)$.
A standard property of Shannon entropy states: if $X$ is a deterministic function of $Y$, then $H(X \mid Y) = 0$. Applying this with $X = M$ and $Y = (C, K)$,
\begin{align*}
H(M \mid C, K) = 0.
\end{align*}
[guided]
The inequality $H(M \mid C) \leq H(K \mid C)$ expresses the intuition that "the key hides the message at least as well as it hides itself": once an eavesdropper learns the ciphertext, their residual uncertainty about the message cannot exceed their residual uncertainty about the key.
Our starting point is the observation that the cryptosystem is **invertible given the key**. The axiom $d(e(m, k), k) = m$ says that if we know the ciphertext $c = e(m, k)$ and the key $k$, we can recover $m$ via $m = d(c, k)$. This determinism translates directly into vanishing conditional entropy: $H(M \mid C, K) = 0$. Entropy measures residual uncertainty, and there is no uncertainty in $M$ once $(C, K)$ is given.
Why is this the essential input? Without invertibility, learning the key would not determine the plaintext, and the bound could fail. It is exactly the decryption axiom of the cryptosystem that makes key-equivocation a meaningful upper bound for message-equivocation.
[/guided]
[/step]
[step:Apply the chain rule for conditional entropy in two orders]
Consider the joint conditional entropy $H(M, K \mid C)$. By the chain rule for conditional entropy, for random variables $X, Y, Z$ with finite ranges,
\begin{align*}
H(X, Y \mid Z) = H(X \mid Z) + H(Y \mid X, Z).
\end{align*}
Applying this first with $X = K$, $Y = M$, $Z = C$:
\begin{align*}
H(M, K \mid C) = H(K \mid C) + H(M \mid K, C). \qquad (\star)
\end{align*}
Applying it with $X = M$, $Y = K$, $Z = C$:
\begin{align*}
H(M, K \mid C) = H(M \mid C) + H(K \mid M, C). \qquad (\star\star)
\end{align*}
Both expressions compute the same quantity $H(M, K \mid C)$, so the right-hand sides of $(\star)$ and $(\star\star)$ are equal:
\begin{align*}
H(K \mid C) + H(M \mid K, C) = H(M \mid C) + H(K \mid M, C).
\end{align*}
[guided]
We apply the chain rule for conditional entropy twice, with the variables in different orders. The chain rule
\begin{align*}
H(X, Y \mid Z) = H(X \mid Z) + H(Y \mid X, Z)
\end{align*}
is the conditional analogue of the joint-entropy chain rule $H(X, Y) = H(X) + H(Y \mid X)$. It decomposes the joint uncertainty of $(X, Y)$ given $Z$ as "uncertainty in $X$ alone (given $Z$) plus uncertainty in $Y$ given $X$ and $Z$."
Now here is the key observation: we compute $H(M, K \mid C)$ in **two different ways** by swapping the roles of $M$ and $K$. Both decompositions give the same quantity because the chain rule is symmetric in which variable is peeled off first. Equating the two expansions links $H(K \mid C)$ to $H(M \mid C)$, with the remainder terms being $H(M \mid K, C)$ and $H(K \mid M, C)$. This symmetric-expansion trick is a standard technique for comparing entropies: pick a joint entropy you can expand two different ways, and equate.
[/guided]
[/step]
[step:Use the decryption identity to drop $H(M \mid K, C)$]
From Step 1 we have $H(M \mid C, K) = 0$. The conditional entropy $H(M \mid K, C)$ is the same quantity (conditional entropy depends only on the $\sigma$-algebra generated by the conditioning variables, which is independent of the order of $K$ and $C$). Substituting into the equation from Step 2,
\begin{align*}
H(K \mid C) + 0 = H(M \mid C) + H(K \mid M, C),
\end{align*}
i.e.,
\begin{align*}
H(K \mid C) = H(M \mid C) + H(K \mid M, C).
\end{align*}
[/step]
[step:Apply non-negativity of conditional entropy]
For any discrete random variables $X, Y$ with finite ranges, the conditional entropy $H(X \mid Y)$ is non-negative: it is an average of Shannon entropies $H(X \mid Y = y)$, each of which is non-negative as an average of terms $-p \log p \geq 0$ for $p \in [0, 1]$.
Applied with $X = K$ and $Y = (M, C)$,
\begin{align*}
H(K \mid M, C) \geq 0.
\end{align*}
Substituting into the identity from Step 3,
\begin{align*}
H(K \mid C) = H(M \mid C) + H(K \mid M, C) \geq H(M \mid C),
\end{align*}
which rearranges to
\begin{align*}
H(M \mid C) \leq H(K \mid C).
\end{align*}
This completes the proof.
[/step]