Conditioning Reduces Entropy — Statement & Proof

Conditioning Reduces Entropy (Theorem # 1652)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We rewrite $H(X \mid Y)$ using the chain rule $H(X \mid Y) = H(X, Y) - H(Y)$, reducing the claim to the subadditivity of joint entropy $H(X, Y) \leq H(X) + H(Y)$. Subadditivity and its equality case both follow from the non-negativity of the Kullback–Leibler divergence (Gibbs' inequality), applied to the joint distribution and the product of the marginals. Rearranging gives the desired inequality and the independence characterisation. [/proofplan] [step:Apply the chain rule to express $H(X \mid Y)$ via the joint entropy] Let $p_{X,Y}$ denote the joint probability mass function of $(X, Y)$ on $\mathcal{X} \times \mathcal{Y}$, and let $p_X$, $p_Y$ denote the marginal mass functions. By the [Chain Rule for Entropy](/theorems/???), \begin{align*} H(X, Y) = H(Y) + H(X \mid Y). \end{align*} Rearranging, \begin{align*} H(X \mid Y) = H(X, Y) - H(Y). \end{align*} The inequality $H(X \mid Y) \leq H(X)$ is therefore equivalent to \begin{align*} H(X, Y) \leq H(X) + H(Y), \end{align*} which is the statement of subadditivity. [/step] [step:Prove subadditivity $H(X, Y) \leq H(X) + H(Y)$ via non-negativity of KL divergence] Let $q: \mathcal{X} \times \mathcal{Y} \to [0, 1]$ be the product distribution $q(x, y) := p_X(x) p_Y(y)$. Both $p_{X,Y}$ and $q$ are probability mass functions on the finite set $\mathcal{X} \times \mathcal{Y}$, and $q(x, y) = 0$ implies $p_{X,Y}(x, y) = 0$ (because if $p_X(x) = 0$ then $p_{X,Y}(x, y) = 0$ for all $y$ by marginalisation, and symmetrically for $p_Y(y) = 0$). Hence $p_{X,Y} \ll q$, and the Kullback–Leibler divergence is well defined: \begin{align*} D(p_{X,Y} \,\|\, q) := \sum_{(x, y): p_{X,Y}(x, y) > 0} p_{X,Y}(x, y) \log \frac{p_{X,Y}(x, y)}{p_X(x) p_Y(y)}. \end{align*} Expanding the logarithm of the quotient and using the conventions $0 \log 0 := 0$ and $\sum_{(x, y): p_{X,Y}(x,y) > 0} p_{X,Y}(x, y) \log p_{X,Y}(x, y) = -H(X, Y)$, \begin{align*} D(p_{X,Y} \,\|\, q) &= -H(X, Y) - \sum_{x, y} p_{X,Y}(x, y) \log\bigl(p_X(x) p_Y(y)\bigr) \\ &= -H(X, Y) - \sum_{x} p_X(x) \log p_X(x) - \sum_y p_Y(y) \log p_Y(y) \\ &= -H(X, Y) + H(X) + H(Y), \end{align*} where the second equality uses $\sum_y p_{X,Y}(x, y) = p_X(x)$ and $\sum_x p_{X,Y}(x, y) = p_Y(y)$. By [Gibbs' Inequality](/theorems/???), $D(p_{X,Y} \,\|\, q) \geq 0$, with equality if and only if $p_{X,Y} = q$ pointwise. Therefore \begin{align*} H(X) + H(Y) - H(X, Y) \geq 0, \end{align*} i.e. $H(X, Y) \leq H(X) + H(Y)$. [guided] The strategy is to exhibit the deficit $H(X) + H(Y) - H(X, Y)$ as a Kullback–Leibler divergence and invoke Gibbs' inequality. The natural candidate for a reference distribution is $q(x, y) := p_X(x) p_Y(y)$: the product of the marginals, which is the joint distribution that $(X, Y)$ would have if they were independent. First we verify the prerequisites for defining $D(p_{X,Y} \,\|\, q)$. Both are probability mass functions on the finite set $\mathcal{X} \times \mathcal{Y}$: $q$ sums to $\sum_{x,y} p_X(x) p_Y(y) = (\sum_x p_X(x))(\sum_y p_Y(y)) = 1$. We need absolute continuity, $p_{X,Y} \ll q$ — that is, whenever $q(x, y) = 0$ we must have $p_{X,Y}(x, y) = 0$. Suppose $q(x_0, y_0) = 0$. Then $p_X(x_0) = 0$ or $p_Y(y_0) = 0$. If $p_X(x_0) = 0$, then by marginalisation $\sum_y p_{X,Y}(x_0, y) = 0$, and non-negativity of the summands forces $p_{X,Y}(x_0, y) = 0$ for every $y$, in particular $p_{X,Y}(x_0, y_0) = 0$. The case $p_Y(y_0) = 0$ is symmetric. Hence $p_{X,Y} \ll q$. Now we compute: \begin{align*} D(p_{X,Y} \,\|\, q) &= \sum_{x, y} p_{X,Y}(x, y) \log \frac{p_{X,Y}(x, y)}{p_X(x) p_Y(y)} \\ &= \sum_{x, y} p_{X,Y}(x, y) \log p_{X,Y}(x, y) - \sum_{x, y} p_{X,Y}(x, y) \log p_X(x) - \sum_{x, y} p_{X,Y}(x, y) \log p_Y(y) \\ &= -H(X, Y) - \sum_x \log p_X(x) \sum_y p_{X,Y}(x, y) - \sum_y \log p_Y(y) \sum_x p_{X,Y}(x, y) \\ &= -H(X, Y) - \sum_x p_X(x) \log p_X(x) - \sum_y p_Y(y) \log p_Y(y) \\ &= -H(X, Y) + H(X) + H(Y). \end{align*} The term-by-term splitting of the logarithm, the switch in summation order, and the marginalisations are all legitimate because the sums are finite. (Where $p_{X,Y}(x, y) = 0$ we use the convention $0 \log 0 := 0$, consistent with $\lim_{t \downarrow 0} t \log t = 0$; where $p_X(x) = 0$ or $p_Y(y) = 0$ but $p_{X,Y}(x, y) = 0$, the entire term vanishes so no $\log 0$ issue arises.) [Gibbs' Inequality](/theorems/???) states: if $p$ and $q$ are probability mass functions on a countable set with $p \ll q$, then $D(p \,\|\, q) \geq 0$, with equality if and only if $p = q$. Its hypotheses are met ($p_{X,Y} \ll q$ was just verified), so \begin{align*} H(X) + H(Y) - H(X, Y) = D(p_{X,Y} \,\|\, q) \geq 0, \end{align*} i.e. $H(X, Y) \leq H(X) + H(Y)$. Why start from KL divergence rather than directly convexifying with Jensen? Both work; KL packages the computation more cleanly and simultaneously delivers the equality case for free via the Gibbs equality condition, which is exactly what we need next. [/guided] [/step] [step:Combine the chain rule with subadditivity] Substituting subadditivity into the chain-rule identity from the first step, \begin{align*} H(X \mid Y) = H(X, Y) - H(Y) \leq \bigl(H(X) + H(Y)\bigr) - H(Y) = H(X). \end{align*} This proves the inequality. [/step] [step:Characterise the equality case via independence] Equality $H(X \mid Y) = H(X)$ holds if and only if subadditivity is tight, i.e. \begin{align*} H(X, Y) = H(X) + H(Y), \end{align*} which by the computation in the second step is equivalent to $D(p_{X,Y} \,\|\, q) = 0$. By the equality clause of [Gibbs' Inequality](/theorems/???), this holds if and only if $p_{X,Y}(x, y) = p_X(x) p_Y(y)$ for all $(x, y) \in \mathcal{X} \times \mathcal{Y}$, which is the definition of $X$ and $Y$ being independent. This completes the proof of both the inequality and the equality characterisation. [/step]

Explore Further

Minimum Distance Equals Minimum Weight Information Theory Achievability for the BSC Information Theory Parity Check Matrix from $h(X)$ Information Theory Generator Polynomial Information Theory Non-Zero BSC Capacity Information Theory Hamming and Gilbert–Varshamov Bounds Information Theory Rank and Distance of Bar Product Information Theory Concentration of Log Bankroll Information Theory

What brings you to Androma?

Start with a route through the knowledge graph.

Conditioning Reduces Entropy (Theorem # 1652)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Conditioning Reduces Entropy (Theorem # 1652)

Discussion

Proof

Explore Further