[proofplan]
We expand the joint entropy $H(X, Y, Z)$ in two ways using the [Chain Rule for Entropy](/theorems/???): once peeling off $Z$ last, and once peeling off $X$ last. Equating the two expansions produces an identity relating $H(X \mid Y)$, $H(X \mid Y, Z)$, and the conditional entropies of $Z$. Using the non-negativity of conditional entropy and the fact that conditioning reduces entropy, both $H(Z \mid X, Y) \geq 0$ and $H(Z \mid Y) \leq H(Z)$, we bound the difference by $H(Z)$.
[/proofplan]
[step:Expand $H(X, Y, Z)$ peeling off $Z$ first]
Applying the [Chain Rule for Entropy](/theorems/???) repeatedly, grouping $(X, Y)$ first and then peeling off $Y$,
\begin{align*}
H(X, Y, Z) &= H(X, Y) + H(Z \mid X, Y) \\
&= H(Y) + H(X \mid Y) + H(Z \mid X, Y).
\end{align*}
Each application of the chain rule requires only that the relevant joint distributions be well defined, which holds since $X$, $Y$, $Z$ are defined on a common probability space with finite alphabets.
[/step]
[step:Expand $H(X, Y, Z)$ peeling off $X$ first]
Again by the [Chain Rule for Entropy](/theorems/???), this time grouping $(Y, Z)$ first and then peeling off $Y$,
\begin{align*}
H(X, Y, Z) &= H(Y, Z) + H(X \mid Y, Z) \\
&= H(Y) + H(Z \mid Y) + H(X \mid Y, Z).
\end{align*}
[/step]
[step:Equate the two expansions and cancel $H(Y)$]
Setting the two expressions for $H(X, Y, Z)$ equal,
\begin{align*}
H(Y) + H(X \mid Y) + H(Z \mid X, Y) = H(Y) + H(Z \mid Y) + H(X \mid Y, Z).
\end{align*}
Subtracting $H(Y)$ (a finite real number, since the alphabets are finite) from both sides,
\begin{align*}
H(X \mid Y) + H(Z \mid X, Y) = H(Z \mid Y) + H(X \mid Y, Z).
\end{align*}
Rearranging,
\begin{align*}
H(X \mid Y) - H(X \mid Y, Z) = H(Z \mid Y) - H(Z \mid X, Y).
\end{align*}
[guided]
The two chain-rule expansions both decompose $H(X, Y, Z)$, so equating them is a bookkeeping identity that must hold. The value of this manoeuvre is that it produces, on the two sides, exactly the quantities we need to compare: $H(X \mid Y)$ and $H(X \mid Y, Z)$ on the left, and $H(Z \mid Y)$ and $H(Z \mid X, Y)$ on the right. The symmetry between the two triples is the key idea — the identity converts a statement about the $X$-side into a statement about the $Z$-side, which turns out to be easier to bound.
Cancelling $H(Y)$ from both sides is valid because $H(Y)$ is a finite non-negative real number (finite because $\mathcal{Y}$ is finite and the Shannon entropy of a random variable with finite support is finite). Rearranging so that both "$X$-quantities" are on the left and both "$Z$-quantities" are on the right yields
\begin{align*}
H(X \mid Y) - H(X \mid Y, Z) = H(Z \mid Y) - H(Z \mid X, Y).
\end{align*}
This identity is the core of the proof: it reduces the desired inequality to bounding the right-hand side.
[/guided]
[/step]
[step:Bound the right-hand side by $H(Z)$]
By [non-negativity of conditional entropy](/theorems/???), $H(Z \mid X, Y) \geq 0$. Hence
\begin{align*}
H(Z \mid Y) - H(Z \mid X, Y) \leq H(Z \mid Y).
\end{align*}
By [Conditioning Reduces Entropy](/theorems/1652) applied to the pair $(Z, Y)$,
\begin{align*}
H(Z \mid Y) \leq H(Z).
\end{align*}
Chaining these bounds,
\begin{align*}
H(Z \mid Y) - H(Z \mid X, Y) \leq H(Z \mid Y) \leq H(Z).
\end{align*}
[guided]
The right-hand side of the identity established above is $H(Z \mid Y) - H(Z \mid X, Y)$. Two facts deliver the bound we need.
First, conditional entropy is non-negative: for any random variables $U, V$ on finite alphabets,
\begin{align*}
H(U \mid V) = \sum_v p_V(v) \, H(U \mid V = v) \geq 0,
\end{align*}
since each $H(U \mid V = v) \geq 0$ (entropy of a probability distribution on a finite set is non-negative) and $p_V(v) \geq 0$. Applying this to $(U, V) = (Z, (X, Y))$,
\begin{align*}
H(Z \mid X, Y) \geq 0,
\end{align*}
so dropping this term can only increase the right-hand side:
\begin{align*}
H(Z \mid Y) - H(Z \mid X, Y) \leq H(Z \mid Y).
\end{align*}
Second, [Conditioning Reduces Entropy](/theorems/1652) states: for any discrete random variables $U, V$ on finite alphabets, $H(U \mid V) \leq H(U)$. Applying this with $(U, V) = (Z, Y)$,
\begin{align*}
H(Z \mid Y) \leq H(Z).
\end{align*}
Both hypotheses of theorem 1652 are satisfied: $Z$ and $Y$ are discrete random variables on finite alphabets defined on the same probability space, by the hypotheses of the theorem being proved.
Chaining,
\begin{align*}
H(Z \mid Y) - H(Z \mid X, Y) \leq H(Z \mid Y) \leq H(Z).
\end{align*}
[/guided]
[/step]
[step:Combine the identity and the bound to conclude]
Combining the identity from the third step,
\begin{align*}
H(X \mid Y) - H(X \mid Y, Z) = H(Z \mid Y) - H(Z \mid X, Y),
\end{align*}
with the bound from the fourth step,
\begin{align*}
H(Z \mid Y) - H(Z \mid X, Y) \leq H(Z),
\end{align*}
we obtain
\begin{align*}
H(X \mid Y) - H(X \mid Y, Z) \leq H(Z),
\end{align*}
equivalently
\begin{align*}
H(X \mid Y) \leq H(X \mid Y, Z) + H(Z),
\end{align*}
which is the statement of the theorem.
[/step]