Tensorization of Entropy for Product Measures — Statement & Proof

Tensorization of Entropy for Product Measures (Theorem # 6725)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We prove the result by induction on the number of tensor factors for the product probability measure $\mu$. The main identity is the two-factor entropy decomposition: entropy with respect to a product measure equals the average entropy in one coordinate plus the entropy of the marginal density. The induction then reduces the marginal entropy to coordinate entropies of the marginal density, and convexity of the entropy functional shows that each marginal coordinate entropy is bounded by the corresponding full slice entropy of $f$. [/proofplan] [step:Record the convexity estimate for entropy under mixtures] Let $(S,\mathcal{S},\rho)$ be a probability space. Define the entropy functional $H_\rho$ on non-negative measurable maps $v:S\to[0,\infty]$ by \begin{align*} H_\rho(v):=\operatorname{Ent}_\rho(v). \end{align*} We use the following convexity fact. Let $(T,\mathcal{T},\sigma)$ be another probability space, let $(s,t)\mapsto u_t(s)$ be jointly measurable from $S\times T$ to $[0,\infty]$, and define \begin{align*} U(s):=\int_T u_t(s)\,d\sigma(t). \end{align*} For each $t$, set \begin{align*} m_t:=\int_S u_t(s)\,d\rho(s), \end{align*} and set \begin{align*} M:=\int_T m_t\,d\sigma(t)=\int_S U(s)\,d\rho(s), \end{align*} where the equality follows from Tonelli's theorem applied to the non-negative function $u_t(s)$. Suppose that \begin{align*} M<\infty. \end{align*} Then \begin{align*} H_\rho(U)\leq \int_T H_\rho(u_t)\,d\sigma(t), \end{align*} with both sides interpreted in the extended sense. If the right-hand side is infinite, the assertion is immediate. Assume it is finite. Then $m_t<\infty$ for $\sigma$-almost every $t$ that contributes positive mass. If $M=0$, then $U=0$ $\rho$-almost everywhere and the desired inequality is trivial, so assume $0<M<\infty$. Define a probability measure $\alpha$ on $T$ by \begin{align*} d\alpha(t):=\frac{m_t}{M}\,d\sigma(t), \end{align*} with the convention that points with $m_t=0$ contribute no mass. For $\alpha$-almost every $t$, define the probability density \begin{align*} p_t(s):=\frac{u_t(s)}{m_t} \end{align*} with respect to $\rho$. The normalized mixture density is \begin{align*} p(s):=\frac{U(s)}{M}=\int_T p_t(s)\,d\alpha(t). \end{align*} Relative entropy is convex in its first argument. Equivalently, this is the log-sum inequality, first proved for finite partitions and then extended to non-negative measurable densities by monotone approximation. Therefore \begin{align*} D(p\rho\|\rho)\leq \int_T D(p_t\rho\|\rho)\,d\alpha(t). \end{align*} Multiplying by $M$ and using \begin{align*} H_\rho(v)=m(v)D\left(\frac{v}{m(v)}\rho\,\middle\|\,\rho\right), \end{align*} whenever $0<m(v)<\infty$, where \begin{align*} m(v):=\int_S v(s)\,d\rho(s), \end{align*} we obtain \begin{align*} H_\rho(U)\leq \int_T H_\rho(u_t)\,d\sigma(t). \end{align*} This proves the mixture estimate without applying Tonelli to a sign-changing integrand. [guided] The useful point is that a marginal density is an average of slices, and entropy is convex with respect to such averages. Let $(s,t)\mapsto u_t(s)$ be a non-negative jointly measurable family and define \begin{align*} U(s):=\int_T u_t(s)\,d\sigma(t). \end{align*} We write entropy on $(S,\mathcal S,\rho)$ as \begin{align*} H_\rho(v):=\operatorname{Ent}_\rho(v). \end{align*} The desired estimate is \begin{align*} H_\rho(U)\leq \int_T H_\rho(u_t)\,d\sigma(t). \end{align*} This estimate is used under the finite-total-mass hypothesis \begin{align*} \int_S U(s)\,d\rho(s)<\infty. \end{align*} If \begin{align*} \int_T H_\rho(u_t)\,d\sigma(t)=\infty, \end{align*} there is nothing to prove, so assume this integral is finite. Write the slice masses and total mass as \begin{align*} m_t:=\int_S u_t(s)\,d\rho(s), \end{align*} and \begin{align*} M:=\int_T m_t\,d\sigma(t). \end{align*} Tonelli's theorem also gives \begin{align*} M=\int_S U(s)\,d\rho(s). \end{align*} The finite-total-mass hypothesis says \begin{align*} M<\infty. \end{align*} If $M=0$, then all functions involved vanish in the only way relevant to the integrals, and the inequality is immediate. If $M>0$, use the masses to turn $\sigma$ into a probability measure: \begin{align*} d\alpha(t):=\frac{m_t}{M}\,d\sigma(t). \end{align*} For $m_t>0$, the normalized slice density is \begin{align*} p_t(s):=\frac{u_t(s)}{m_t}. \end{align*} This satisfies \begin{align*} \int_S p_t(s)\,d\rho(s)=1. \end{align*} The normalized density of the average is the $\alpha$-mixture \begin{align*} \frac{U(s)}{M}=\int_T p_t(s)\,d\alpha(t). \end{align*} It is also a probability density: \begin{align*} \int_S \frac{U(s)}{M}\,d\rho(s)=1. \end{align*} For any probability density $q$ with respect to $\rho$, write \begin{align*} D(q\rho\|\rho)=\int_S q(s)\log q(s)\,d\rho(s). \end{align*} The entropy of a non-negative function with finite positive mass $m(v)$ is related to this relative entropy by \begin{align*} H_\rho(v)=m(v)D\left(\frac{v}{m(v)}\rho\,\middle\|\,\rho\right). \end{align*} Here the mass notation means \begin{align*} m(v):=\int_S v(s)\,d\rho(s). \end{align*} In particular, \begin{align*} H_\rho(U)=M D\left(\frac{U}{M}\rho\,\middle\|\,\rho\right). \end{align*} For the slices, \begin{align*} H_\rho(u_t)=m_t D(p_t\rho\|\rho) \end{align*} whenever $m_t>0$. By convexity of relative entropy in the first measure, or by the log-sum inequality followed by monotone approximation, \begin{align*} D\left(\frac{U}{M}\rho\,\middle\|\,\rho\right) \leq \int_T D(p_t\rho\|\rho)\,d\alpha(t). \end{align*} Finally multiply by $M$. Since \begin{align*} H_\rho(U)=M D\left(\frac{U}{M}\rho\,\middle\|\,\rho\right) \end{align*} and \begin{align*} M\,d\alpha(t)=m_t\,d\sigma(t), \end{align*} the right-hand side becomes \begin{align*} \int_T H_\rho(u_t)\,d\sigma(t). \end{align*} Thus averaging slices cannot increase entropy. [/guided] [/step] [step:Decompose entropy across one product coordinate] Let $(S,\mathcal{S},\rho)$ and $(T,\mathcal{T},\sigma)$ be probability spaces, and let $u:S\times T\to[0,\infty]$ be measurable with $\operatorname{Ent}_{\rho\otimes\sigma}(u)<\infty$. Define the marginal function $g:T\to[0,\infty]$ by \begin{align*} g(t):=\int_S u(s,t)\,d\rho(s). \end{align*} Then \begin{align*} \operatorname{Ent}_{\rho\otimes\sigma}(u)=\int_T \operatorname{Ent}_\rho(u(\cdot,t))\,d\sigma(t)+\operatorname{Ent}_\sigma(g). \end{align*} Use the function $\Phi:[0,\infty)\to[-e^{-1},\infty]$ defined by $\Phi(0)=0$ and $\Phi(r)=r\log r$ for $r>0$. Since $\operatorname{Ent}_{\rho\otimes\sigma}(u)<\infty$, the assumption means that the entropy expression is well-defined as a finite real number: the mass $\int_{S\times T}u(s,t)\,d(\rho\otimes\sigma)(s,t)$ is finite and $\int_{S\times T}\Phi(u(s,t))\,d(\rho\otimes\sigma)(s,t)$ is finite. [Jensen's inequality](/theorems/9) applied on each $S$-fiber gives $\Phi(g(t))\leq\int_S\Phi(u(s,t))\,d\rho(s)$ for $\sigma$-a.e. $t$, so $\int_T\Phi(g(t))\,d\sigma(t)$ is also finite. Expanding entropy through $\Phi$ gives \begin{align*} \int_T \operatorname{Ent}_\rho(u(\cdot,t))\,d\sigma(t)=\int_{S\times T}\Phi(u(s,t))\,d(\rho\otimes\sigma)(s,t)-\int_T \Phi(g(t))\,d\sigma(t), \end{align*} where Tonelli's theorem is applied to the non-negative function $\Phi(u)+e^{-1}$ and then the finite constants are subtracted. Also, \begin{align*} \operatorname{Ent}_\sigma(g)=\int_T \Phi(g(t))\,d\sigma(t)-\Phi\left(\int_T g(t)\,d\sigma(t)\right). \end{align*} Finally, Tonelli's theorem applied to the non-negative function $u$ gives \begin{align*} \int_T g(t)\,d\sigma(t)=\int_{S\times T}u(s,t)\,d(\rho\otimes\sigma)(s,t). \end{align*} Adding the two displayed identities cancels the marginal term and gives the claimed decomposition without any undefined $\infty-\infty$ expression. [/step] [step:Prove the tensorization inequality by induction over the number of coordinates] We prove the theorem for all $n\in\mathbb{N}$ by induction. For $n=1$, the assertion is an equality: \begin{align*} \operatorname{Ent}_{\mu_1}(f)=\int_{E_{\varnothing}}\operatorname{Ent}_{\mu_1,1}(f)\,d\mu_{\varnothing}. \end{align*} Here $E_{\varnothing}$ is a one-point space and $\mu_{\varnothing}$ is its unit mass. Assume the theorem holds for $n-1$ factors. Define the product space $E'$ and product probability measure $\mu'$ by \begin{align*} E' := \prod_{i=1}^{n-1}E_i. \end{align*} \begin{align*} \mu' := \bigotimes_{i=1}^{n-1}\mu_i. \end{align*} Define the marginal function $g:E'\to[0,\infty]$ by \begin{align*} g(x'):=\int_{E_n} f(x',z_n)\,d\mu_n(z_n). \end{align*} Applying the two-factor decomposition with $(S,\rho)=(E_n,\mu_n)$ and $(T,\sigma)=(E',\mu')$ gives \begin{align*} \operatorname{Ent}_\mu(f)=\int_{E'}\operatorname{Ent}_{\mu_n,n}(f)(x')\,d\mu'(x')+\operatorname{Ent}_{\mu'}(g). \end{align*} The first term on the right is non-negative and $\operatorname{Ent}_\mu(f)<\infty$ by hypothesis, hence $\operatorname{Ent}_{\mu'}(g)<\infty$. Therefore the induction hypothesis applies to $g$ on $E'$, and gives \begin{align*} \operatorname{Ent}_{\mu'}(g)\leq \sum_{i=1}^{n-1}\int_{E'_{-i}}\operatorname{Ent}_{\mu_i,i}(g)(y_{-i})\,d\mu'_{-i}(y_{-i}), \end{align*} where $E'_{-i}:=\prod_{1\leq j\leq n-1,\ j\ne i}E_j$ and $\mu'_{-i}:=\bigotimes_{1\leq j\leq n-1,\ j\ne i}\mu_j$. Since $\operatorname{Ent}_\mu(f)<\infty$, the mass of $f$ is finite: \begin{align*} \int_E f\,d\mu<\infty. \end{align*} Fix $i\in\{1,\dots,n-1\}$. By Tonelli's theorem, for $\mu'_{-i}$-almost every $y_{-i}\in E'_{-i}$ the two-coordinate slice has finite total mass: \begin{align*} \int_{E_i}\int_{E_n} f(z_i,y_{-i},z_n)\,d\mu_n(z_n)\,d\mu_i(z_i)<\infty. \end{align*} Fix such a $y_{-i}$. For each $z_n\in E_n$, define the slice map $u_{z_n}:E_i\to[0,\infty]$ by \begin{align*} u_{z_n}(z_i):=f(z_i,y_{-i},z_n). \end{align*} The $i$-th slice of $g$ at $y_{-i}$ is the mixture map $g_{y_{-i}}:E_i\to[0,\infty]$ defined by \begin{align*} g_{y_{-i}}(z_i):=\int_{E_n}u_{z_n}(z_i)\,d\mu_n(z_n). \end{align*} By the convexity estimate from the first step, \begin{align*} \operatorname{Ent}_{\mu_i,i}(g)(y_{-i})\leq \int_{E_n}\operatorname{Ent}_{\mu_i}(u_{z_n})\,d\mu_n(z_n). \end{align*} Integrating this inequality with respect to $d\mu'_{-i}(y_{-i})$ gives \begin{align*} \int_{E'_{-i}}\operatorname{Ent}_{\mu_i,i}(g)(y_{-i})\,d\mu'_{-i}(y_{-i})\leq \int_{E_{-i}}\operatorname{Ent}_{\mu_i,i}(f)(x_{-i})\,d\mu_{-i}(x_{-i}). \end{align*} Here the exceptional $y_{-i}$ set has $\mu'_{-i}$-measure zero, and the identification of the right-hand side is Tonelli's theorem for the non-negative coordinate entropy terms together with product-measure associativity. Combining the last estimates with the decomposition of $\operatorname{Ent}_\mu(f)$ yields \begin{align*} \operatorname{Ent}_\mu(f)\leq \sum_{i=1}^{n}\int_{E_{-i}}\operatorname{Ent}_{\mu_i,i}(f)(x_{-i})\,d\mu_{-i}(x_{-i}). \end{align*} This is the desired tensorization inequality. [/step]

Explore Further

What brings you to Androma?

Start with a route through the knowledge graph.

Tensorization of Entropy for Product Measures (Theorem # 6725)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Tensorization of Entropy for Product Measures (Theorem # 6725)

Discussion

Proof

Explore Further