[proofplan]
We rewrite $H(X \mid Y)$ using the chain rule $H(X \mid Y) = H(X, Y) - H(Y)$, reducing the claim to the subadditivity of joint entropy $H(X, Y) \leq H(X) + H(Y)$. Subadditivity and its equality case both follow from the non-negativity of the Kullback–Leibler divergence (Gibbs' inequality), applied to the joint distribution and the product of the marginals. Rearranging gives the desired inequality and the independence characterisation.
[/proofplan]
[step:Apply the chain rule to express $H(X \mid Y)$ via the joint entropy]
Let $p_{X,Y}$ denote the joint probability mass function of $(X, Y)$ on $\mathcal{X} \times \mathcal{Y}$, and let $p_X$, $p_Y$ denote the marginal mass functions. By the [Chain Rule for Entropy](/theorems/???),
\begin{align*}
H(X, Y) = H(Y) + H(X \mid Y).
\end{align*}
Rearranging,
\begin{align*}
H(X \mid Y) = H(X, Y) - H(Y).
\end{align*}
The inequality $H(X \mid Y) \leq H(X)$ is therefore equivalent to
\begin{align*}
H(X, Y) \leq H(X) + H(Y),
\end{align*}
which is the statement of subadditivity.
[/step]
[step:Prove subadditivity $H(X, Y) \leq H(X) + H(Y)$ via non-negativity of KL divergence]
Let $q: \mathcal{X} \times \mathcal{Y} \to [0, 1]$ be the product distribution $q(x, y) := p_X(x) p_Y(y)$. Both $p_{X,Y}$ and $q$ are probability mass functions on the finite set $\mathcal{X} \times \mathcal{Y}$, and $q(x, y) = 0$ implies $p_{X,Y}(x, y) = 0$ (because if $p_X(x) = 0$ then $p_{X,Y}(x, y) = 0$ for all $y$ by marginalisation, and symmetrically for $p_Y(y) = 0$). Hence $p_{X,Y} \ll q$, and the Kullback–Leibler divergence is well defined:
\begin{align*}
D(p_{X,Y} \,\|\, q) := \sum_{(x, y): p_{X,Y}(x, y) > 0} p_{X,Y}(x, y) \log \frac{p_{X,Y}(x, y)}{p_X(x) p_Y(y)}.
\end{align*}
Expanding the logarithm of the quotient and using the conventions $0 \log 0 := 0$ and $\sum_{(x, y): p_{X,Y}(x,y) > 0} p_{X,Y}(x, y) \log p_{X,Y}(x, y) = -H(X, Y)$,
\begin{align*}
D(p_{X,Y} \,\|\, q) &= -H(X, Y) - \sum_{x, y} p_{X,Y}(x, y) \log\bigl(p_X(x) p_Y(y)\bigr) \\
&= -H(X, Y) - \sum_{x} p_X(x) \log p_X(x) - \sum_y p_Y(y) \log p_Y(y) \\
&= -H(X, Y) + H(X) + H(Y),
\end{align*}
where the second equality uses $\sum_y p_{X,Y}(x, y) = p_X(x)$ and $\sum_x p_{X,Y}(x, y) = p_Y(y)$.
By [Gibbs' Inequality](/theorems/???), $D(p_{X,Y} \,\|\, q) \geq 0$, with equality if and only if $p_{X,Y} = q$ pointwise. Therefore
\begin{align*}
H(X) + H(Y) - H(X, Y) \geq 0,
\end{align*}
i.e. $H(X, Y) \leq H(X) + H(Y)$.
[guided]
The strategy is to exhibit the deficit $H(X) + H(Y) - H(X, Y)$ as a Kullback–Leibler divergence and invoke Gibbs' inequality. The natural candidate for a reference distribution is $q(x, y) := p_X(x) p_Y(y)$: the product of the marginals, which is the joint distribution that $(X, Y)$ would have if they were independent.
First we verify the prerequisites for defining $D(p_{X,Y} \,\|\, q)$. Both are probability mass functions on the finite set $\mathcal{X} \times \mathcal{Y}$: $q$ sums to $\sum_{x,y} p_X(x) p_Y(y) = (\sum_x p_X(x))(\sum_y p_Y(y)) = 1$. We need absolute continuity, $p_{X,Y} \ll q$ — that is, whenever $q(x, y) = 0$ we must have $p_{X,Y}(x, y) = 0$. Suppose $q(x_0, y_0) = 0$. Then $p_X(x_0) = 0$ or $p_Y(y_0) = 0$. If $p_X(x_0) = 0$, then by marginalisation $\sum_y p_{X,Y}(x_0, y) = 0$, and non-negativity of the summands forces $p_{X,Y}(x_0, y) = 0$ for every $y$, in particular $p_{X,Y}(x_0, y_0) = 0$. The case $p_Y(y_0) = 0$ is symmetric. Hence $p_{X,Y} \ll q$.
Now we compute:
\begin{align*}
D(p_{X,Y} \,\|\, q)
&= \sum_{x, y} p_{X,Y}(x, y) \log \frac{p_{X,Y}(x, y)}{p_X(x) p_Y(y)} \\
&= \sum_{x, y} p_{X,Y}(x, y) \log p_{X,Y}(x, y) - \sum_{x, y} p_{X,Y}(x, y) \log p_X(x) - \sum_{x, y} p_{X,Y}(x, y) \log p_Y(y) \\
&= -H(X, Y) - \sum_x \log p_X(x) \sum_y p_{X,Y}(x, y) - \sum_y \log p_Y(y) \sum_x p_{X,Y}(x, y) \\
&= -H(X, Y) - \sum_x p_X(x) \log p_X(x) - \sum_y p_Y(y) \log p_Y(y) \\
&= -H(X, Y) + H(X) + H(Y).
\end{align*}
The term-by-term splitting of the logarithm, the switch in summation order, and the marginalisations are all legitimate because the sums are finite. (Where $p_{X,Y}(x, y) = 0$ we use the convention $0 \log 0 := 0$, consistent with $\lim_{t \downarrow 0} t \log t = 0$; where $p_X(x) = 0$ or $p_Y(y) = 0$ but $p_{X,Y}(x, y) = 0$, the entire term vanishes so no $\log 0$ issue arises.)
[Gibbs' Inequality](/theorems/???) states: if $p$ and $q$ are probability mass functions on a countable set with $p \ll q$, then $D(p \,\|\, q) \geq 0$, with equality if and only if $p = q$. Its hypotheses are met ($p_{X,Y} \ll q$ was just verified), so
\begin{align*}
H(X) + H(Y) - H(X, Y) = D(p_{X,Y} \,\|\, q) \geq 0,
\end{align*}
i.e. $H(X, Y) \leq H(X) + H(Y)$.
Why start from KL divergence rather than directly convexifying with Jensen? Both work; KL packages the computation more cleanly and simultaneously delivers the equality case for free via the Gibbs equality condition, which is exactly what we need next.
[/guided]
[/step]
[step:Combine the chain rule with subadditivity]
Substituting subadditivity into the chain-rule identity from the first step,
\begin{align*}
H(X \mid Y) = H(X, Y) - H(Y) \leq \bigl(H(X) + H(Y)\bigr) - H(Y) = H(X).
\end{align*}
This proves the inequality.
[/step]
[step:Characterise the equality case via independence]
Equality $H(X \mid Y) = H(X)$ holds if and only if subadditivity is tight, i.e.
\begin{align*}
H(X, Y) = H(X) + H(Y),
\end{align*}
which by the computation in the second step is equivalent to $D(p_{X,Y} \,\|\, q) = 0$. By the equality clause of [Gibbs' Inequality](/theorems/???), this holds if and only if $p_{X,Y}(x, y) = p_X(x) p_Y(y)$ for all $(x, y) \in \mathcal{X} \times \mathcal{Y}$, which is the definition of $X$ and $Y$ being independent. This completes the proof of both the inequality and the equality characterisation.
[/step]