[guided]The strategy is to exhibit the deficit $H(X) + H(Y) - H(X, Y)$ as a Kullback–Leibler divergence and invoke Gibbs' inequality. The natural candidate for a reference distribution is $q(x, y) := p_X(x) p_Y(y)$: the product of the marginals, which is the joint distribution that $(X, Y)$ would have if they were independent.
First we verify the prerequisites for defining $D(p_{X,Y} \,\|\, q)$. Both are probability mass functions on the finite set $\mathcal{X} \times \mathcal{Y}$: $q$ sums to $\sum_{x,y} p_X(x) p_Y(y) = (\sum_x p_X(x))(\sum_y p_Y(y)) = 1$. We need absolute continuity, $p_{X,Y} \ll q$ — that is, whenever $q(x, y) = 0$ we must have $p_{X,Y}(x, y) = 0$. Suppose $q(x_0, y_0) = 0$. Then $p_X(x_0) = 0$ or $p_Y(y_0) = 0$. If $p_X(x_0) = 0$, then by marginalisation $\sum_y p_{X,Y}(x_0, y) = 0$, and non-negativity of the summands forces $p_{X,Y}(x_0, y) = 0$ for every $y$, in particular $p_{X,Y}(x_0, y_0) = 0$. The case $p_Y(y_0) = 0$ is symmetric. Hence $p_{X,Y} \ll q$.
Now we compute:
\begin{align*}
D(p_{X,Y} \,\|\, q)
&= \sum_{x, y} p_{X,Y}(x, y) \log \frac{p_{X,Y}(x, y)}{p_X(x) p_Y(y)} \\
&= \sum_{x, y} p_{X,Y}(x, y) \log p_{X,Y}(x, y) - \sum_{x, y} p_{X,Y}(x, y) \log p_X(x) - \sum_{x, y} p_{X,Y}(x, y) \log p_Y(y) \\
&= -H(X, Y) - \sum_x \log p_X(x) \sum_y p_{X,Y}(x, y) - \sum_y \log p_Y(y) \sum_x p_{X,Y}(x, y) \\
&= -H(X, Y) - \sum_x p_X(x) \log p_X(x) - \sum_y p_Y(y) \log p_Y(y) \\
&= -H(X, Y) + H(X) + H(Y).
\end{align*}
The term-by-term splitting of the logarithm, the switch in summation order, and the marginalisations are all legitimate because the sums are finite. (Where $p_{X,Y}(x, y) = 0$ we use the convention $0 \log 0 := 0$, consistent with $\lim_{t \downarrow 0} t \log t = 0$; where $p_X(x) = 0$ or $p_Y(y) = 0$ but $p_{X,Y}(x, y) = 0$, the entire term vanishes so no $\log 0$ issue arises.)
[Gibbs' Inequality](/theorems/???) states: if $p$ and $q$ are probability mass functions on a countable set with $p \ll q$, then $D(p \,\|\, q) \geq 0$, with equality if and only if $p = q$. Its hypotheses are met ($p_{X,Y} \ll q$ was just verified), so
\begin{align*}
H(X) + H(Y) - H(X, Y) = D(p_{X,Y} \,\|\, q) \geq 0,
\end{align*}
i.e. $H(X, Y) \leq H(X) + H(Y)$.
Why start from KL divergence rather than directly convexifying with Jensen? Both work; KL packages the computation more cleanly and simultaneously delivers the equality case for free via the Gibbs equality condition, which is exactly what we need next.[/guided]