Tensorization of Entropy — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We prove the result by revealing the coordinates one at a time. For each $k$, we average $f$ over the coordinates after $k$ and write the total entropy as a telescoping sum of entropy increments. Each increment is an entropy in the $k$-th coordinate of an averaged function. Finally, convexity of $t \mapsto t\log t$ shows that averaging over the later coordinates can only decrease this one-coordinate entropy, giving the desired sum of coordinate entropies. [/proofplan] [step:Define the partial averages and the entropy function] Let $\Phi: [0,\infty) \to \mathbb{R}$ be the convex function defined by $\Phi(t) := t\log t$ for $t>0$ and $\Phi(0) := 0$. Entropy is understood in the standard measure-theoretic sense \begin{align*} \operatorname{Ent}_\nu(g) := \int \Phi(g)\,d\nu - \Phi\left(\int g\,d\nu\right), \end{align*} with all integrals taken with respect to the displayed measure. For $k \in \{0,\dots,n\}$, define the product probability measure \begin{align*} \mu_{\leq k} := \mu_1 \otimes \cdots \otimes \mu_k \end{align*} on $E_{\leq k} := E_1 \times \cdots \times E_k$, with the convention that $E_{\leq 0}$ is a one-point space and $\mu_{\leq 0}$ is the unit mass on it. Also define \begin{align*} \mu_{>k} := \mu_{k+1} \otimes \cdots \otimes \mu_n \end{align*} on $E_{>k} := E_{k+1} \times \cdots \times E_n$, with the analogous convention when $k=n$. For $x=(x_1,\dots,x_n) \in E$, write $x_{\leq k}:=(x_1,\dots,x_k) \in E_{\leq k}$ and $x_{-k}:=(x_1,\dots,x_{k-1},x_{k+1},\dots,x_n) \in E_1\times\cdots\times E_{k-1}\times E_{k+1}\times\cdots\times E_n$. For each $k \in \{0,\dots,n\}$, define the partial average $F_k: E_{\leq k} \to [0,\infty)$ by \begin{align*} F_k(x_1,\dots,x_k) := \int_{E_{>k}} f(x_1,\dots,x_k,z_{k+1},\dots,z_n)\, d\mu_{>k}(z_{k+1},\dots,z_n). \end{align*} Thus $F_n=f$ as a function on $E$, and $F_0=\int_E f\,d\mu$ is a constant. Since $f$ is non-negative and measurable, Tonelli's theorem applies to the non-negative sections defining $F_k$, so each $F_k$ is measurable. Since $f$ is also $\mu$-integrable and all factors are probability measures, [Fubini's theorem](/theorems/2961) gives that each $F_k$ is $\mu_{\leq k}$-integrable and \begin{align*} \int_{E_{\leq k}} F_k\,d\mu_{\leq k} = \int_E f\,d\mu. \end{align*} [/step] [step:Decompose the total entropy into one-coordinate increments] We telescope the entropy along the sequence $F_0,F_1,\dots,F_n$. First we justify that all terms in the telescoping sum are finite. We use the convention in the definition of finite entropy that $\operatorname{Ent}_\mu(f)<\infty$ means both defining quantities \begin{align*} \int_E \Phi(f)\,d\mu(x) \end{align*} and \begin{align*} \Phi\left(\int_E f\,d\mu(x)\right) \end{align*} are finite [real numbers](/page/Real%20Numbers). For each $k$, [Jensen's Inequality](/theorems/9) for the convex function $\Phi$ applied to the probability measure $\mu_{>k}$ gives, for $\mu_{\leq k}$-almost every $x_{\leq k}$, \begin{align*} \Phi(F_k(x_{\leq k})) \leq \int_{E_{>k}} \Phi(f(x_{\leq k},z))\,d\mu_{>k}(z). \end{align*} Integrating with respect to $\mu_{\leq k}$ and applying Tonelli's theorem to the positive and negative parts, which is legitimate because $\Phi$ is bounded below on $[0,\infty)$, yields \begin{align*} \int_{E_{\leq k}} \Phi(F_k)\,d\mu_{\leq k} \leq \int_E \Phi(f)\,d\mu(x) < \infty. \end{align*} Thus every intermediate entropy term is a real number, so the following telescoping cancellation is an ordinary finite cancellation. Since $F_n=f$ and \begin{align*} F_0 = \int_E f\,d\mu(x), \end{align*} we have \begin{align*} \operatorname{Ent}_\mu(f) = \int_E \Phi(F_n)\,d\mu(x) - \Phi(F_0). \end{align*} Adding and subtracting the intermediate terms gives \begin{align*} \operatorname{Ent}_\mu(f) = \sum_{k=1}^n \left(\int_{E_{\leq k}} \Phi(F_k)\,d\mu_{\leq k} - \int_{E_{\leq k-1}} \Phi(F_{k-1})\,d\mu_{\leq k-1}\right). \end{align*} For fixed $k$, [Fubini's theorem](/theorems/2961) applies because $f$ is non-negative and integrable on the product probability space, and it gives \begin{align*} F_{k-1}(x_1,\dots,x_{k-1}) = \int_{E_k} F_k(x_1,\dots,x_{k-1},y)\,d\mu_k(y) \end{align*} for $\mu_{\leq k-1}$-almost every $(x_1,\dots,x_{k-1})$. Therefore the $k$-th summand is \begin{align*} \int_{E_{\leq k-1}} \operatorname{Ent}_{\mu_k}\bigl(F_k(x_1,\dots,x_{k-1},\cdot)\bigr)\,d\mu_{\leq k-1}(x_1,\dots,x_{k-1}). \end{align*} [guided] The purpose of the functions $F_k$ is to expose the variables gradually. The function $F_k$ depends only on the first $k$ coordinates; it is obtained from $f$ by integrating out the last $n-k$ coordinates. Tonelli's theorem guarantees these sections are measurable because $f$ is non-negative and measurable, and [Fubini's theorem](/theorems/2961) is available because $f$ is integrable with respect to the product probability measure. Thus $F_n$ is the original function, while $F_0$ is the global average \begin{align*} F_0 = \int_E f\,d\mu(x). \end{align*} Before telescoping, we must check that we are not subtracting undefined extended quantities. We use the convention in the definition of finite entropy that $\operatorname{Ent}_\mu(f)<\infty$ means both \begin{align*} \int_E \Phi(f)\,d\mu(x) \end{align*} and \begin{align*} \Phi\left(\int_E f\,d\mu(x)\right) \end{align*} are finite real numbers. [Jensen's Inequality](/theorems/9) for the convex function $\Phi$ applied to the probability measure $\mu_{>k}$ gives \begin{align*} \Phi(F_k(x_{\leq k})) \leq \int_{E_{>k}} \Phi(f(x_{\leq k},z))\,d\mu_{>k}(z) \end{align*} for $\mu_{\leq k}$-almost every $x_{\leq k}$. Because $\Phi$ is bounded below on $[0,\infty)$, Tonelli's theorem applies to the positive and negative parts, and hence \begin{align*} \int_{E_{\leq k}}\Phi(F_k)\,d\mu_{\leq k} \leq \int_E \Phi(f)\,d\mu(x) < \infty. \end{align*} Therefore all intermediate quantities in the telescoping identity are finite real numbers. The entropy of $f$ is \begin{align*} \operatorname{Ent}_\mu(f) = \int_E \Phi(f)\,d\mu - \Phi\left(\int_E f\,d\mu\right). \end{align*} Because $F_n=f$ and $F_0=\int_E f\,d\mu$, this becomes \begin{align*} \operatorname{Ent}_\mu(f) = \int_E \Phi(F_n)\,d\mu - \Phi(F_0). \end{align*} Now insert every intermediate quantity $\int_{E_{\leq k}}\Phi(F_k)\,d\mu_{\leq k}$. The resulting telescoping identity is \begin{align*} \operatorname{Ent}_\mu(f) = \sum_{k=1}^n \left(\int_{E_{\leq k}} \Phi(F_k)\,d\mu_{\leq k} - \int_{E_{\leq k-1}} \Phi(F_{k-1})\,d\mu_{\leq k-1}\right). \end{align*} We now identify each difference as an entropy in the newly revealed coordinate $x_k$. Fix $k \in \{1,\dots,n\}$ and fix $(x_1,\dots,x_{k-1}) \in E_{\leq k-1}$ outside the null set where the Fubini identity can fail. By definition, \begin{align*} F_k(x_1,\dots,x_{k-1},y) = \int_{E_{>k}} f(x_1,\dots,x_{k-1},y,z_{k+1},\dots,z_n)\,d\mu_{>k}(z_{k+1},\dots,z_n). \end{align*} Averaging this expression over $y \in E_k$ and applying [Fubini's theorem](/theorems/2961) to the non-negative integrable function $f$ gives exactly $F_{k-1}(x_1,\dots,x_{k-1})$: \begin{align*} F_{k-1}(x_1,\dots,x_{k-1}) = \int_{E_k} F_k(x_1,\dots,x_{k-1},y)\,d\mu_k(y). \end{align*} Hence, for this fixed past coordinate vector, the expression \begin{align*} \int_{E_k}\Phi(F_k(x_1,\dots,x_{k-1},y))\,d\mu_k(y) - \Phi(F_{k-1}(x_1,\dots,x_{k-1})) \end{align*} is precisely \begin{align*} \operatorname{Ent}_{\mu_k}\bigl(F_k(x_1,\dots,x_{k-1},\cdot)\bigr). \end{align*} Integrating over the past variables gives the claimed representation of the $k$-th telescoping increment. [/guided] [/step] [step:Use joint convexity of relative entropy to contract the later-coordinate average] Fix $k \in \{1,\dots,n\}$ and fix $x_{\leq k-1}=(x_1,\dots,x_{k-1}) \in E_{\leq k-1}$ outside the null set where the relevant sections are measurable and integrable. Define the measurable function $h: E_k \times E_{>k} \to [0,\infty)$ by \begin{align*} h(y,z) := f(x_1,\dots,x_{k-1},y,z_{k+1},\dots,z_n). \end{align*} Define the averaged slice $H: E_k \to [0,\infty)$ by \begin{align*} H(y) := \int_{E_{>k}} h(y,z)\,d\mu_{>k}(z). \end{align*} Then $H(y)=F_k(x_1,\dots,x_{k-1},y)$ for $\mu_k$-almost every $y$. Let $\Psi:[0,\infty)\times[0,\infty)\to[-e^{-1},\infty]$ be defined by $\Psi(u,v):=u\log(u/v)$ when $u>0$ and $v>0$, by $\Psi(0,v):=0$ when $v\geq0$, and by $\Psi(u,0):=+\infty$ when $u>0$. The function $\Psi$ is the perspective of the convex function $\Phi$, hence is jointly convex. For each $v\geq0$ and $u\geq0$, the lower bound $r\log r\geq -e^{-1}$ for $r\geq0$ gives $\Psi(u,v)\geq -v/e$, with the convention already stated when $v=0$. For each $z\in E_{>k}$ for which the section $h(\cdot,z)$ is integrable, define the map $m:E_{>k}\to[0,\infty]$ by \begin{align*} m(z) := \int_{E_k} h(y,z)\,d\mu_k(y). \end{align*} Also define \begin{align*} M := \int_{E_{>k}} m(z)\,d\mu_{>k}(z) = \int_{E_k} H(y)\,d\mu_k(y), \end{align*} where the equality follows from [Fubini's theorem](/theorems/2961) applied to the non-negative integrable function $h$. Using the representation of entropy by $\Psi$, we have \begin{align*} \operatorname{Ent}_{\mu_k}(H) = \int_{E_k} \Psi(H(y),M)\,d\mu_k(y). \end{align*} For each fixed $y\in E_k$, [Jensen's Inequality](/theorems/9) for the jointly convex function $\Psi$ on the probability space $(E_{>k},\mu_{>k})$ gives \begin{align*} \Psi(H(y),M) \leq \int_{E_{>k}} \Psi(h(y,z),m(z))\,d\mu_{>k}(z). \end{align*} Integrating over $E_k$ is legitimate after controlling negative parts. Indeed, the lower bound above gives \begin{align*} \Psi(H(y),M)^- \leq M/e \end{align*} for $\mu_k$-almost every $y$, and \begin{align*} \Psi(h(y,z),m(z))^- \leq m(z)/e \end{align*} for $\mu_k\otimes\mu_{>k}$-almost every $(y,z)$. Since \begin{align*} \int_{E_{>k}}m(z)\,d\mu_{>k}(z)=M<\infty, \end{align*} the negative parts are integrable. Applying [Fubini's theorem](/theorems/2961) to the integrable negative parts and Tonelli's theorem to the positive parts yields \begin{align*} \operatorname{Ent}_{\mu_k}(H) \leq \int_{E_{>k}}\int_{E_k}\Psi(h(y,z),m(z))\,d\mu_k(y)\,d\mu_{>k}(z). \end{align*} The inner integral is exactly $\operatorname{Ent}_{\mu_k}(h(\cdot,z))$, interpreted in the extended sense if the section is outside the entropy domain. Therefore, \begin{align*} \operatorname{Ent}_{\mu_k}\bigl(F_k(x_1,\dots,x_{k-1},\cdot)\bigr) \leq \int_{E_{>k}}\operatorname{Ent}_{\mu_k,k}(f)(x_1,\dots,x_{k-1},z_{k+1},\dots,z_n)\,d\mu_{>k}(z). \end{align*} [guided] We need to compare the entropy of the averaged slice $H$ with the average of the entropies of the unaveraged slices $h(\cdot,z)$. A direct convexity argument for $g\mapsto \operatorname{Ent}_{\mu_k}(g)$ is delicate because entropy contains a negative term involving the mean. The clean scalar way to handle this is to rewrite entropy as a relative entropy integrand. Fix $k\in\{1,\dots,n\}$ and fix $x_{\leq k-1}=(x_1,\dots,x_{k-1})\in E_{\leq k-1}$ outside the null set where the relevant sections are measurable and integrable. Define the measurable map $h:E_k\times E_{>k}\to[0,\infty)$ by \begin{align*} h(y,z) := f(x_1,\dots,x_{k-1},y,z_{k+1},\dots,z_n). \end{align*} Define the average over the later variables as the measurable map $H:E_k\to[0,\infty)$ with \begin{align*} H(y) := \int_{E_{>k}} h(y,z)\,d\mu_{>k}(z). \end{align*} Then $H(y)=F_k(x_1,\dots,x_{k-1},y)$ for $\mu_k$-almost every $y$. Now define the scalar function $\Psi:[0,\infty)\times[0,\infty)\to[-e^{-1},\infty]$ by $\Psi(u,v)=u\log(u/v)$ for $u>0$ and $v>0$, by $\Psi(0,v)=0$ for $v\geq0$, and by $\Psi(u,0)=+\infty$ for $u>0$. This function is the perspective of the convex function $\Phi(t)=t\log t$, so $\Psi$ is jointly convex in the two scalar variables. This joint convexity is the correct replacement for the invalid subtraction of two convexity inequalities. We will also need a lower bound: for $v>0$, writing $r=u/v$ gives $\Psi(u,v)=v r\log r$, and $r\log r\geq -e^{-1}$ for $r\geq0$. Thus $\Psi(u,v)\geq -v/e$, and the same lower bound holds under the stated conventions when $v=0$. For each later-coordinate value $z\in E_{>k}$ whose section is integrable, define its $E_k$-mean as the map $m:E_{>k}\to[0,\infty]$ by \begin{align*} m(z) := \int_{E_k} h(y,z)\,d\mu_k(y). \end{align*} Define the mean of $H$ by \begin{align*} M := \int_{E_k} H(y)\,d\mu_k(y). \end{align*} [Fubini's theorem](/theorems/2961) applies to the non-negative integrable function $h$, so the two ways of computing the total mean agree: \begin{align*} M = \int_{E_{>k}} m(z)\,d\mu_{>k}(z). \end{align*} The entropy of $H$ can now be written as \begin{align*} \operatorname{Ent}_{\mu_k}(H) = \int_{E_k}\Psi(H(y),M)\,d\mu_k(y). \end{align*} For fixed $y\in E_k$, the pair $(H(y),M)$ is the $\mu_{>k}$-average of the pairs $(h(y,z),m(z))$. Therefore [Jensen's Inequality](/theorems/9) for the jointly convex scalar function $\Psi$ gives \begin{align*} \Psi(H(y),M) \leq \int_{E_{>k}}\Psi(h(y,z),m(z))\,d\mu_{>k}(z). \end{align*} Integrating this inequality over $E_k$ requires a signed-integral justification, because $\Psi$ is not non-negative. The lower bound gives \begin{align*} \Psi(H(y),M)^- \leq M/e \end{align*} for $\mu_k$-almost every $y$, and \begin{align*} \Psi(h(y,z),m(z))^- \leq m(z)/e \end{align*} for $\mu_k\otimes\mu_{>k}$-almost every $(y,z)$. Since Fubini gave \begin{align*} \int_{E_{>k}}m(z)\,d\mu_{>k}(z)=M<\infty, \end{align*} these negative parts are integrable. We may therefore apply [Fubini's theorem](/theorems/2961) to the negative parts and Tonelli's theorem to the positive parts, obtaining \begin{align*} \operatorname{Ent}_{\mu_k}(H) \leq \int_{E_{>k}}\int_{E_k}\Psi(h(y,z),m(z))\,d\mu_k(y)\,d\mu_{>k}(z). \end{align*} For fixed $z$, the inner integral is exactly the entropy of the slice $h(\cdot,z)$ with respect to $\mu_k$: \begin{align*} \int_{E_k}\Psi(h(y,z),m(z))\,d\mu_k(y)=\operatorname{Ent}_{\mu_k}(h(\cdot,z)). \end{align*} Thus \begin{align*} \operatorname{Ent}_{\mu_k}\bigl(F_k(x_1,\dots,x_{k-1},\cdot)\bigr) \leq \int_{E_{>k}}\operatorname{Ent}_{\mu_k,k}(f)(x_1,\dots,x_{k-1},z_{k+1},\dots,z_n)\,d\mu_{>k}(z), \end{align*} with the right-hand side interpreted in the extended sense for exceptional slices. [/guided] [/step] [step:Integrate the coordinate bounds and sum over all coordinates] Integrating the preceding inequality over $E_{\leq k-1}$ with respect to $\mu_{\leq k-1}$ gives \begin{align*} \int_{E_{\leq k-1}} \operatorname{Ent}_{\mu_k}\bigl(F_k(x_1,\dots,x_{k-1},\cdot)\bigr)\,d\mu_{\leq k-1}(x_1,\dots,x_{k-1}) \end{align*} \begin{align*} \leq \int_{E_{\leq k-1}}\int_{E_{>k}}\operatorname{Ent}_{\mu_k,k}(f)(x_1,\dots,x_{k-1},z_{k+1},\dots,z_n)\,d\mu_{>k}(z)\,d\mu_{\leq k-1}(x_1,\dots,x_{k-1}). \end{align*} Here $\operatorname{Ent}_{\mu_k,k}(f)(x_{-k})$ denotes the entropy, with respect to $\mu_k$, of the one-coordinate slice obtained by fixing all coordinates of $x$ except the $k$-th. Since the integrand on the right is non-negative and independent of the $k$-th coordinate, Tonelli's theorem permits inserting the probability integration over $E_k$ without changing its value. Thus \begin{align*} \int_{E_{\leq k-1}}\int_{E_{>k}}\operatorname{Ent}_{\mu_k,k}(f)\,d\mu_{>k}\,d\mu_{\leq k-1} = \int_E \operatorname{Ent}_{\mu_k,k}(f)(x_{-k})\,d\mu(x). \end{align*} Combining this with the telescoping decomposition gives \begin{align*} \operatorname{Ent}_\mu(f) \leq \sum_{k=1}^n \int_E \operatorname{Ent}_{\mu_k,k}(f)(x_{-k})\,d\mu(x). \end{align*} This is the asserted tensorization inequality. [/step]

Prerequisites (0/1 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Real Numbers

What brings you to Androma?

Start with a route through the knowledge graph.

Tensorization of Entropy (Theorem # 6733)

Discussion

Proof

Prerequisites (0/1 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Tensorization of Entropy (Theorem # 6733)

Discussion

Proof

Prerequisites (0/1 completed)

Prerequisites Graph

Explore Further