Pinsker Inequality — Statement & Proof

Pinsker Inequality (Theorem # 5890)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] If $P$ is not absolutely continuous with respect to $Q$, the right-hand side is infinite and the result is immediate. Otherwise, after disposing of the zero total-variation case, we reduce the problem to a two-point probability space by choosing a measurable set $A_\varepsilon$ whose probability gap is almost maximal. The key estimate is the log-sum inequality, which says that relative entropy decreases under the coarse-graining into $A_\varepsilon$ and $A_\varepsilon^c$. We then prove the remaining Bernoulli inequality directly and let the approximating set approach the total variation supremum. [/proofplan] [step:Dispose of the singular case and fix an almost extremal set] Let $(\Omega,\mathcal{F})$ be the measurable space on which the probability measures $P$ and $Q$ are defined. If $P \not\ll Q$, then $D_{\mathrm{KL}}(P \mid Q)=+\infty$, and the asserted inequality holds because $\operatorname{TV}(P,Q) \leq 1$. Assume henceforth that $P \ll Q$. If $\operatorname{TV}(P,Q)=0$, then the asserted inequality follows because $0 \leq \frac{1}{2}D_{\mathrm{KL}}(P \mid Q)$. Hence assume $\operatorname{TV}(P,Q)>0$, and let $\varepsilon$ satisfy $0<\varepsilon<\operatorname{TV}(P,Q)$. By the definition of the supremum, choose a measurable set $A_\varepsilon \in \mathcal{F}$ such that \begin{align*} |P(A_\varepsilon)-Q(A_\varepsilon)| \geq \operatorname{TV}(P,Q)-\varepsilon. \end{align*} Replacing $A_\varepsilon$ by its complement if necessary, we may assume \begin{align*} P(A_\varepsilon)-Q(A_\varepsilon) \geq \operatorname{TV}(P,Q)-\varepsilon. \end{align*} Define $a_\varepsilon := P(A_\varepsilon)$ and $b_\varepsilon := Q(A_\varepsilon)$. Since $0<\varepsilon<\operatorname{TV}(P,Q)$, the preceding lower bound is positive, so $a_\varepsilon-b_\varepsilon>0$. Hence $0 \leq b_\varepsilon < a_\varepsilon \leq 1$, and \begin{align*} a_\varepsilon-b_\varepsilon \geq \operatorname{TV}(P,Q)-\varepsilon. \end{align*} [/step] [step:Coarse-grain the relative entropy to the two sets $A_\varepsilon$ and $A_\varepsilon^c$] By the [Radon-Nikodym theorem](/page/Radon-Nikodym%20Theorem), let $r:\Omega\to[0,\infty]$ be the measurable function given by $r(\omega)=\frac{dP}{dQ}(\omega)$ for $\omega\in\Omega$, the Radon-Nikodym derivative of $P$ with respect to $Q$. Since $P(\Omega)=1<\infty$, the derivative $r$ may be chosen finite $Q$-a.e.; changing $r$ on a $Q$-null set does not affect any integral below. Define the function $\varphi:[0,\infty)\to\mathbb{R}$ by $\varphi(t)=t\log t$ for $t>0$ and $\varphi(0)=0$. The function $\varphi$ is convex on $[0,\infty)$ because $\varphi''(t)=1/t\geq 0$ for $t>0$ and $\varphi$ is the continuous extension at $0$ of this convex function. Then \begin{align*} D_{\mathrm{KL}}(P \mid Q)=\int_\Omega \varphi(r)\,dQ. \end{align*} We claim that \begin{align*} D_{\mathrm{KL}}(P\|Q) \geq a_\varepsilon\log\left(\frac{a_\varepsilon}{b_\varepsilon}\right) + (1-a_\varepsilon)\log\left(\frac{1-a_\varepsilon}{1-b_\varepsilon}\right), \end{align*} with the usual conventions $0\log(0/c)=0$ for $c>0$ and $c\log(c/0)=+\infty$ for $c>0$. Indeed, on the measurable set $A_\varepsilon$, [Jensen's inequality](/page/Jensen%27s%20Inequality) for the probability measure $Q(\cdot \cap A_\varepsilon)/Q(A_\varepsilon)$ gives, when $b_\varepsilon>0$, \begin{align*} \int_{A_\varepsilon}\varphi(r)\,dQ \geq b_\varepsilon\, \varphi\left( \frac{1}{b_\varepsilon}\int_{A_\varepsilon} r\,dQ \right) = b_\varepsilon\, \varphi\left(\frac{a_\varepsilon}{b_\varepsilon}\right) = a_\varepsilon\log\left(\frac{a_\varepsilon}{b_\varepsilon}\right). \end{align*} If $b_\varepsilon=0$, then $P \ll Q$ implies $a_\varepsilon=0$, so the same term is $0$. If $1-b_\varepsilon>0$, applying [Jensen's inequality](/page/Jensen%27s%20Inequality) to the probability measure $Q(\cdot\cap A_\varepsilon^c)/Q(A_\varepsilon^c)$ gives \begin{align*} \int_{A_\varepsilon^c}\varphi(r)\,dQ \geq (1-a_\varepsilon)\log\left(\frac{1-a_\varepsilon}{1-b_\varepsilon}\right). \end{align*} If $1-b_\varepsilon=0$, then $Q(A_\varepsilon^c)=0$, and $P\ll Q$ implies $P(A_\varepsilon^c)=1-a_\varepsilon=0$; the complementary entropy term is therefore $0$. Adding the two estimates proves the claimed coarse-grained lower bound. [guided] The purpose of this step is to replace the original probability space by the two events $A_\varepsilon$ and $A_\varepsilon^c$. This is useful because the total variation distance only asks for probability gaps of measurable sets, and a single nearly extremal set contains enough information to prove the estimate. Since $P \ll Q$, the [Radon-Nikodym theorem](/page/Radon-Nikodym%20Theorem) gives a Radon-Nikodym derivative. Let $r:\Omega\to[0,\infty]$ be the measurable function given by $r(\omega)=\frac{dP}{dQ}(\omega)$ for $\omega\in\Omega$. Thus, for every measurable set $E \in \mathcal{F}$, \begin{align*} P(E)=\int_E r\,dQ. \end{align*} Because $P(\Omega)=1<\infty$, we may choose this derivative finite $Q$-a.e.; if necessary, redefine $r$ on the $Q$-null set where it is infinite. This redefinition preserves the displayed identity for every $E\in\mathcal{F}$ and preserves all integrals with respect to $Q$. We also define the function $\varphi:[0,\infty)\to\mathbb{R}$ by $\varphi(t)=t\log t$ for $t>0$ and $\varphi(0)=0$. The function $\varphi$ is convex on $[0,\infty)$: on $(0,\infty)$ it has second derivative $\varphi''(t)=1/t\geq 0$, and the value $\varphi(0)=0$ is the continuous endpoint extension of that convex function. The relative entropy can be written as \begin{align*} D_{\mathrm{KL}}(P \mid Q)=\int_\Omega \varphi(r)\,dQ. \end{align*} We now estimate the contribution from $A_\varepsilon$. If $b_\varepsilon=Q(A_\varepsilon)>0$, then $Q(\cdot \cap A_\varepsilon)/b_\varepsilon$ is a probability measure on $A_\varepsilon$. [Jensen's inequality](/page/Jensen%27s%20Inequality) applied to the convex function $\varphi$ gives \begin{align*} \frac{1}{b_\varepsilon}\int_{A_\varepsilon}\varphi(r)\,dQ \geq \varphi\left( \frac{1}{b_\varepsilon}\int_{A_\varepsilon}r\,dQ \right). \end{align*} Multiplying by $b_\varepsilon$ and using $\int_{A_\varepsilon}r\,dQ=P(A_\varepsilon)=a_\varepsilon$, we get \begin{align*} \int_{A_\varepsilon}\varphi(r)\,dQ \geq b_\varepsilon\varphi\left(\frac{a_\varepsilon}{b_\varepsilon}\right) = a_\varepsilon\log\left(\frac{a_\varepsilon}{b_\varepsilon}\right). \end{align*} If $b_\varepsilon=0$, absolute continuity $P \ll Q$ forces $a_\varepsilon=P(A_\varepsilon)=0$, so the corresponding entropy term is interpreted as $0$ and the inequality remains valid. We next apply the same Jensen mechanism to the complement $A_\varepsilon^c$, but we must first check the endpoint. Since $Q(A_\varepsilon^c)=1-b_\varepsilon$ and $P(A_\varepsilon^c)=1-a_\varepsilon$, if $1-b_\varepsilon>0$, then $Q(\cdot\cap A_\varepsilon^c)/(1-b_\varepsilon)$ is a probability measure on $A_\varepsilon^c$ and [Jensen's inequality](/page/Jensen%27s%20Inequality) gives \begin{align*} \int_{A_\varepsilon^c}\varphi(r)\,dQ \geq (1-a_\varepsilon)\log\left(\frac{1-a_\varepsilon}{1-b_\varepsilon}\right). \end{align*} If $1-b_\varepsilon=0$, then $Q(A_\varepsilon^c)=0$, and absolute continuity $P\ll Q$ gives $P(A_\varepsilon^c)=1-a_\varepsilon=0$, so the complementary entropy term is $0$. Adding the two estimates yields \begin{align*} D_{\mathrm{KL}}(P \mid Q) \geq a_\varepsilon\log\left(\frac{a_\varepsilon}{b_\varepsilon}\right) + (1-a_\varepsilon)\log\left(\frac{1-a_\varepsilon}{1-b_\varepsilon}\right). \end{align*} This is exactly the relative entropy of the two-point distributions $(a_\varepsilon,1-a_\varepsilon)$ and $(b_\varepsilon,1-b_\varepsilon)$. [/guided] [/step] [step:Prove the Bernoulli form of Pinsker's inequality] We prove that for $0 \leq b \leq a \leq 1$, \begin{align*} a\log\left(\frac{a}{b}\right) + (1-a)\log\left(\frac{1-a}{1-b}\right) \geq 2(a-b)^2. \end{align*} We first handle the equality and endpoint cases. If $a=b$, then both sides are $0$, so the inequality holds. If $b=0$ and $a>0$, the left-hand side is $+\infty$ by the convention $a\log(a/0)=+\infty$, so the inequality holds. If $b=0$ and $a=0$, both sides are $0$. If $a=1$, then $b<1$ because $b<a$, and the inequality becomes \begin{align*} \log\left(\frac{1}{b}\right) \geq 2(1-b)^2. \end{align*} This follows from the interior argument below by taking $a\uparrow 1$, because the function \begin{align*} a \mapsto a\log\left(\frac{a}{b}\right)+(1-a)\log\left(\frac{1-a}{1-b}\right)-2(a-b)^2 \end{align*} is continuous on $(b,1]$ after setting $(1-a)\log((1-a)/(1-b))=0$ at $a=1$. It remains to consider $0<b<a<1$. For fixed $b\in(0,1)$, define $F_b:(b,1)\to\mathbb{R}$ by \begin{align*} F_b(a) = a\log\left(\frac{a}{b}\right) + (1-a)\log\left(\frac{1-a}{1-b}\right) - 2(a-b)^2. \end{align*} Then \begin{align*} F_b'(a) = \log\left(\frac{a(1-b)}{b(1-a)}\right)-4(a-b), \end{align*} and \begin{align*} F_b''(a) = \frac{1}{a}+\frac{1}{1-a}-4 = \frac{1}{a(1-a)}-4 = \frac{(2a-1)^2}{a(1-a)} \geq 0. \end{align*} Thus $F_b'$ is nondecreasing on $(b,1)$. Since \begin{align*} F_b'(b)=0, \end{align*} we have $F_b'(a)\geq 0$ for $a\geq b$. Hence $F_b$ is nondecreasing on $[b,1)$, and because \begin{align*} F_b(b)=0, \end{align*} we obtain $F_b(a)\geq 0$. This proves the Bernoulli inequality. [/step] [step:Apply the Bernoulli estimate and pass to the total variation supremum] Applying the Bernoulli inequality with $a=a_\varepsilon$ and $b=b_\varepsilon$, and using the coarse-grained entropy bound, gives \begin{align*} D_{\mathrm{KL}}(P \mid Q) \geq 2(a_\varepsilon-b_\varepsilon)^2. \end{align*} Since $a_\varepsilon-b_\varepsilon \geq \operatorname{TV}(P,Q)-\varepsilon$, we have \begin{align*} D_{\mathrm{KL}}(P \mid Q) \geq 2(\operatorname{TV}(P,Q)-\varepsilon)^2. \end{align*} Letting $\varepsilon \downarrow 0$ gives \begin{align*} D_{\mathrm{KL}}(P \mid Q) \geq 2\operatorname{TV}(P,Q)^2. \end{align*} Equivalently, \begin{align*} \operatorname{TV}(P,Q)^2 \leq \frac{1}{2}D_{\mathrm{KL}}(P \mid Q). \end{align*} This is the desired inequality. [/step]

Prerequisites (0/7 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

What brings you to Androma?

Start with a route through the knowledge graph.

Pinsker Inequality (Theorem # 5890)

Discussion

Proof

Prerequisites (0/7 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Pinsker Inequality (Theorem # 5890)

Discussion

Proof

Prerequisites (0/7 completed)

Prerequisites Graph

Explore Further