Tensorization of Talagrand's $T_2$ Inequality — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Let $n\in\{1,2,\dots\}$ and let $C\in(0,+\infty)$. For each $i\in\{1,\dots,n\}$, let $(X_i,d_i)$ be a Polish [metric space](/page/Metric%20Space) with Borel $\sigma$-algebra $\mathcal B(X_i)$, and let $\rho_i\in\mathcal P(X_i)$. For probability measures $\alpha,\beta\in\mathcal P(X_i)$, define the extended quadratic Wasserstein distance by \begin{align*} W_{2,d_i}(\alpha,\beta)^2:=\inf_{\pi\in\Gamma(\alpha,\beta)}\int_{X_i\times X_i}d_i(x_i,y_i)^2\,d\pi(x_i,y_i), \end{align*} where $\Gamma(\alpha,\beta)$ is the set of couplings of $\alpha$ and $\beta$. Define \begin{align*} \operatorname{Ent}_{\rho_i}(\alpha):=\int_{X_i}\log\left(\frac{d\alpha}{d\rho_i}(x_i)\right)\,d\alpha(x_i) \end{align*} if $\alpha\ll\rho_i$, and $\operatorname{Ent}_{\rho_i}(\alpha):=+\infty$ otherwise. Assume that each $\rho_i$ satisfies Talagrand's $T_2(C)$ inequality, namely \begin{align*} W_{2,d_i}(\alpha,\rho_i)^2\le 2C\operatorname{Ent}_{\rho_i}(\alpha) \end{align*} for every $\alpha\in\mathcal P(X_i)$. Let \begin{align*} X:=\prod_{i=1}^n X_i \end{align*} and \begin{align*} \rho:=\rho_1\otimes\cdots\otimes\rho_n. \end{align*} Equip $X$ with the product Borel $\sigma$-algebra, equivalently the Borel $\sigma$-algebra of the product Polish topology, and with the [product metric](/page/Product%20Metric) $d_2:X\times X\to[0,+\infty)$ defined by \begin{align*} d_2(x,y)^2=\sum_{i=1}^n d_i(x_i,y_i)^2 \end{align*} for $x=(x_1,\dots,x_n)\in X$ and $y=(y_1,\dots,y_n)\in X$. For probability measures $\mu,\nu\in\mathcal P(X)$, define \begin{align*} W_{2,d_2}(\mu,\nu)^2:=\inf_{\Pi\in\Gamma(\mu,\nu)}\int_{X\times X}d_2(x,y)^2\,d\Pi(x,y), \end{align*} where $\Gamma(\mu,\nu)$ is the set of couplings of $\mu$ and $\nu$. Define \begin{align*} \operatorname{Ent}_{\rho}(\nu):=\int_X \log\left(\frac{d\nu}{d\rho}(x)\right)\,d\nu(x) \end{align*} if $\nu\ll\rho$, and $\operatorname{Ent}_{\rho}(\nu):=+\infty$ otherwise. Then $\rho$ satisfies Talagrand's $T_2(C)$ inequality on $(X,d_2)$, namely \begin{align*} W_{2,d_2}(\nu,\rho)^2\le 2C\operatorname{Ent}_{\rho}(\nu) \end{align*} for every $\nu\in\mathcal P(X)$.

Discussion

Proof

[proofplan] We prove the inequality for an arbitrary law $\nu$ on the product. When the relative entropy is finite, we disintegrate $\nu$ into its successive conditional laws and use the entropy chain rule to decompose $\operatorname{Ent}_{\rho}(\nu)$ into a sum of conditional entropies. For each coordinate and each previous-coordinate history, the assumed one-coordinate $T_2(C)$ inequality gives a near-optimal coupling from the conditional law to $\rho_i$. A triangular gluing of these coordinate couplings produces a coupling of $\nu$ with $\rho$, and the square of the [product metric](/page/Product%20Metric) splits exactly as the sum of the coordinate costs. [/proofplan] [step:Reduce to laws with finite relative entropy] Let $\nu\in\mathcal P(X)$ be arbitrary. If $\operatorname{Ent}_{\rho}(\nu)=+\infty$, then \begin{align*} W_{2,d_2}(\nu,\rho)^2\le+\infty=2C\operatorname{Ent}_{\rho}(\nu), \end{align*} so the desired inequality holds. Henceforth assume that \begin{align*} \operatorname{Ent}_{\rho}(\nu)<+\infty. \end{align*} In particular, $\nu\ll\rho$ by the definition of relative entropy. Since each $X_i$ is Polish, each $X_i$ is a standard Borel space, and the finite product $X$ is again a standard Borel space. The regular conditional distribution theorem for standard Borel spaces therefore applies to the coordinate projections used below. [/step] [step:Disintegrate $\nu$ into successive conditional coordinate laws] For $i\in\{1,\dots,n\}$, define the previous-coordinate space $X_{<i}$ by \begin{align*} X_{<i}:=\prod_{j=1}^{i-1}X_j, \end{align*} with the convention that $X_{<1}$ is a one-point space. Let $\nu_{<i}\in\mathcal P(X_{<i})$ denote the marginal of $\nu$ on $X_{<i}$. By regular conditional distribution, for each $i\in\{1,\dots,n\}$ there exists a Borel probability kernel \begin{align*} x_{<i}\mapsto \nu_i^{x_{<i}}\in\mathcal P(X_i) \end{align*} from $X_{<i}$ to $X_i$ such that $\nu$ is represented sequentially by these kernels: \begin{align*} d\nu(x_1,\dots,x_n)=d\nu_1^{\ast}(x_1)\prod_{i=2}^n d\nu_i^{x_{<i}}(x_i), \end{align*} where $\ast$ denotes the unique point of $X_{<1}$ and $\nu_1^{\ast}$ is the first marginal of $\nu$. Define the conditional entropy function $h_i:X_{<i}\to[0,+\infty]$ by \begin{align*} h_i(x_{<i})=\operatorname{Ent}_{\rho_i}(\nu_i^{x_{<i}}). \end{align*} This function is Borel measurable because $x_{<i}\mapsto\nu_i^{x_{<i}}$ is a Borel probability kernel and the map $\alpha\mapsto\operatorname{Ent}_{\rho_i}(\alpha)$ is lower semicontinuous, hence Borel, on $\mathcal P(X_i)$ with the [weak topology](/page/Weak%20Topology). The hypotheses of the entropy chain rule are satisfied: all coordinate spaces are standard Borel, $\rho$ is the product measure $\rho_1\otimes\cdots\otimes\rho_n$, $\nu\ll\rho$, and the displayed kernels are regular conditional distributions of $\nu$. More explicitly, write $f:X\to[0,+\infty)$ for the Radon-Nikodym derivative $d\nu/d\rho$. For fixed $i$, let $f_{<i}:X_{<i}\to[0,+\infty)$ denote the marginal density of $\nu_{<i}$ with respect to $\rho_1\otimes\cdots\otimes\rho_{i-1}$, obtained from $f$ by integrating over the remaining coordinates with respect to the corresponding product reference measure. [Fubini's theorem](/theorems/2961) gives, for $\nu_{<i}$-a.e. $x_{<i}\in X_{<i}$ with $f_{<i}(x_{<i})>0$, the conditional density \begin{align*} \frac{d\nu_i^{x_{<i}}}{d\rho_i}(x_i)=\frac{\int_{\prod_{j=i+1}^n X_j} f(x_1,\dots,x_n)\,d(\rho_{i+1}\otimes\cdots\otimes\rho_n)(x_{>i})}{f_{<i}(x_{<i})}. \end{align*} The set on which $f_{<i}=0$ is $\nu_{<i}$-null, so this proves \begin{align*} \nu_i^{x_{<i}}\ll\rho_i \end{align*} for $\nu_{<i}$-a.e. $x_{<i}$. Hence the entropy chain rule under disintegration gives \begin{align*} \operatorname{Ent}_{\rho}(\nu)=\sum_{i=1}^n\int_{X_{<i}}h_i(x_{<i})\,d\nu_{<i}(x_{<i}). \end{align*} Since the left-hand side is finite and every term in the sum is non-negative, each integral in the sum is finite, and therefore $h_i(x_{<i})<+\infty$ for $\nu_{<i}$-a.e. $x_{<i}$. [/step] [step:Choose near-optimal coordinate couplings from the one-coordinate $T_2(C)$ inequalities] Fix $\varepsilon>0$. For each $i\in\{1,\dots,n\}$ and for $\nu_{<i}$-a.e. $x_{<i}\in X_{<i}$, the measure $\nu_i^{x_{<i}}$ has finite relative entropy with respect to $\rho_i$. The assumed $T_2(C)$ inequality on $(X_i,d_i)$ gives \begin{align*} W_{2,d_i}(\nu_i^{x_{<i}},\rho_i)^2\le 2C h_i(x_{<i}). \end{align*} The cost map \begin{align*} (x_i,y_i)\mapsto d_i(x_i,y_i)^2 \end{align*} from $X_i\times X_i$ to $[0,+\infty)$ is Borel and lower semicontinuous because $d_i$ is a finite continuous metric. Let $G_i\subset X_{<i}$ be a Borel full-$\nu_{<i}$-measure set on which $h_i<+\infty$ and $\nu_i^{x_{<i}}\ll\rho_i$. On $G_i$, the admissible coupling correspondence has non-empty Borel graph on Polish spaces, and the cost functional on $\mathcal P(X_i\times X_i)$ is lower semicontinuous. Therefore the measurable near-optimal coupling selection theorem for lower semicontinuous costs on Polish spaces applies. By the definition of the Wasserstein distance as an infimum over couplings and this selection theorem, we may choose a Borel probability kernel on $G_i$ whose value at $x_{<i}$ is denoted by $K_i^{x_{<i}}\in\mathcal P(X_i\times X_i)$, whose first marginal is $\nu_i^{x_{<i}}$, whose second marginal is $\rho_i$, and such that \begin{align*} \int_{X_i\times X_i}d_i(x_i,y_i)^2\,dK_i^{x_{<i}}(x_i,y_i)\le 2C h_i(x_{<i})+\frac{\varepsilon}{n} \end{align*} for every $x_{<i}\in G_i$. Extend this kernel measurably to all of $X_{<i}$ by setting \begin{align*} K_i^{x_{<i}}:=\nu_i^{x_{<i}}\otimes\rho_i \end{align*} for $x_{<i}\in X_{<i}\setminus G_i$. This product-coupling extension is a Borel probability kernel, has first marginal $\nu_i^{x_{<i}}$ and second marginal $\rho_i$, and does not change any integral with respect to $\nu_{<i}$ because $X_{<i}\setminus G_i$ is $\nu_{<i}$-null. [guided] Fix $\varepsilon>0$. The goal of this step is to prepare one transport plan for each coordinate, but the source law in coordinate $i$ depends on the already chosen history $x_{<i}=(x_1,\dots,x_{i-1})$. For that reason the object we need is a measurable family of couplings indexed by $x_{<i}$. For each $i\in\{1,\dots,n\}$, the preceding disintegration step defined a Borel probability kernel \begin{align*} x_{<i}\mapsto \nu_i^{x_{<i}}\in\mathcal P(X_i) \end{align*} and the conditional entropy function \begin{align*} h_i:X_{<i}\to[0,+\infty],\qquad x_{<i}\mapsto \operatorname{Ent}_{\rho_i}(\nu_i^{x_{<i}}). \end{align*} The entropy chain rule applied there gives \begin{align*} \operatorname{Ent}_{\rho}(\nu)=\sum_{i=1}^n\int_{X_{<i}}h_i(x_{<i})\,d\nu_{<i}(x_{<i}). \end{align*} Since the left-hand side is finite and the terms are non-negative, each integral is finite. Therefore, for each fixed $i$, one has \begin{align*} h_i(x_{<i})<+\infty \end{align*} for $\nu_{<i}$-a.e. $x_{<i}\in X_{<i}$. In particular, for those histories, $\nu_i^{x_{<i}}\ll\rho_i$ and $\nu_i^{x_{<i}}$ is an admissible input to the one-coordinate Talagrand inequality. The assumed one-coordinate Talagrand inequality applies to $\nu_i^{x_{<i}}\in\mathcal P(X_i)$, because the hypothesis says that $\rho_i$ satisfies $T_2(C)$ against every probability law on $X_i$. Thus, for $\nu_{<i}$-a.e. $x_{<i}$, \begin{align*} W_{2,d_i}(\nu_i^{x_{<i}},\rho_i)^2\le 2C\operatorname{Ent}_{\rho_i}(\nu_i^{x_{<i}})=2C h_i(x_{<i}). \end{align*} By definition, $W_{2,d_i}(\nu_i^{x_{<i}},\rho_i)^2$ is the infimum of \begin{align*} \int_{X_i\times X_i}d_i(x_i,y_i)^2\,dK(x_i,y_i) \end{align*} over all couplings $K\in\mathcal P(X_i\times X_i)$ whose first marginal is $\nu_i^{x_{<i}}$ and whose second marginal is $\rho_i$. The cost map $(x_i,y_i)\mapsto d_i(x_i,y_i)^2$ is Borel and lower semicontinuous because $d_i$ is a finite continuous metric on the Polish space $X_i$. The coupling correspondence has non-empty Borel graph, and the cost functional is lower semicontinuous on $\mathcal P(X_i\times X_i)$. Hence the measurable near-optimal coupling selection theorem for lower semicontinuous costs on Polish spaces applies. Let $G_i\subset X_{<i}$ be a Borel full-$\nu_{<i}$-measure set on which $h_i<+\infty$ and $\nu_i^{x_{<i}}\ll\rho_i$. Consequently, we may choose a Borel probability kernel on $G_i$ whose value at $x_{<i}$ is denoted by \begin{align*} K_i^{x_{<i}}\in\mathcal P(X_i\times X_i) \end{align*} such that $K_i^{x_{<i}}$ has first marginal $\nu_i^{x_{<i}}$, second marginal $\rho_i$, and \begin{align*} \int_{X_i\times X_i}d_i(x_i,y_i)^2\,dK_i^{x_{<i}}(x_i,y_i)\le 2C h_i(x_{<i})+\frac{\varepsilon}{n} \end{align*} for every $x_{<i}\in G_i$. On the exceptional set $X_{<i}\setminus G_i$, define \begin{align*} K_i^{x_{<i}}:=\nu_i^{x_{<i}}\otimes\rho_i. \end{align*} This is a measurable product-coupling extension of the selected kernel to all histories, and the exceptional histories cannot affect any subsequent integral with respect to $\nu_{<i}$. [/guided] [/step] [step:Glue the coordinate couplings into a product coupling] For each $i\in\{1,\dots,n\}$, disintegrate the coupling kernel $K_i^{x_{<i}}$ with respect to its first marginal. The parameter space $X_{<i}$ and the coordinate space $X_i$ are standard Borel, and $x_{<i}\mapsto K_i^{x_{<i}}$ is a Borel probability kernel. Hence the parameterized disintegration theorem for probability kernels on standard Borel spaces gives a Borel probability kernel \begin{align*} (x_{<i},x_i)\mapsto L_i^{x_{<i},x_i}\in\mathcal P(X_i) \end{align*} such that \begin{align*} dK_i^{x_{<i}}(x_i,y_i)=d\nu_i^{x_{<i}}(x_i)\,dL_i^{x_{<i},x_i}(y_i) \end{align*} for $\nu_{<i}$-a.e. $x_{<i}$. Define a [probability measure](/page/Probability%20Measure) $\Pi\in\mathcal P(X\times X)$ as follows. First sample $x=(x_1,\dots,x_n)$ according to $\nu$. Conditional on this $x$, sample $y_i\in X_i$ independently across $i$ with conditional law $L_i^{x_{<i},x_i}$. Equivalently, for bounded Borel functions $F:X\times X\to\mathbb R$, define \begin{align*} \int_{X\times X}F(x,y)\,d\Pi(x,y)=\int_X\int_{X_1}\cdots\int_{X_n}F(x,y)\prod_{i=1}^n dL_i^{x_{<i},x_i}(y_i)\,d\nu(x). \end{align*} The Ionescu-Tulcea construction for probability kernels applies because the base law $\nu$ is a probability measure on the standard Borel space $X$ and each $L_i$ is a Borel probability kernel. It gives a well-defined probability measure $\Pi$. The first marginal of $\Pi$ is $\nu$ by construction. We prove that the second marginal is $\rho$. For $m\in\{1,\dots,n\}$, let $\Pi_{<m}\in\mathcal P(X_{<m}\times X_{<m})$ denote the marginal of $\Pi$ on the first $m-1$ coordinate pairs, with the convention that $\Pi_{<1}$ is the unit mass on a one-point space. Let $\varphi_i:X_i\to\mathbb R$ be bounded Borel functions for $i\in\{1,\dots,n\}$. Since the second marginal of $K_n^{x_{<n}}$ is $\rho_n$, integrating first in $x_n$ and $y_n$ gives \begin{align*} \int_{X\times X}\prod_{i=1}^n\varphi_i(y_i)\,d\Pi(x,y)=\int_{X_{<n}\times X_{<n}}\prod_{i=1}^{n-1}\varphi_i(y_i)\left(\int_{X_n}\varphi_n(y_n)\,d\rho_n(y_n)\right)\,d\Pi_{<n}(x_{<n},y_{<n}), \end{align*} Iterating this integration identity gives the displayed product formula as follows. For $k\in\{0,\dots,n\}$, let $A_k$ be the assertion \begin{align*} \int_{X\times X}\prod_{i=1}^n\varphi_i(y_i)\,d\Pi(x,y)=\left(\prod_{i=n-k+1}^n\int_{X_i}\varphi_i(y_i)\,d\rho_i(y_i)\right)\int_{X_{<n-k}\times X_{<n-k}}\prod_{i=1}^{n-k}\varphi_i(y_i)\,d\Pi_{<n-k}(x_{<n-k},y_{<n-k}). \end{align*} The computation just made proves $A_1$. If $A_k$ holds with $k<n$, then the second marginal property of $K_{n-k}^{x_{<n-k}}$ gives \begin{align*} \int_{X_{<n-k}\times X_{<n-k}}\prod_{i=1}^{n-k}\varphi_i(y_i)\,d\Pi_{<n-k}(x_{<n-k},y_{<n-k})=\left(\int_{X_{n-k}}\varphi_{n-k}(y_{n-k})\,d\rho_{n-k}(y_{n-k})\right)\int_{X_{<n-k-1}\times X_{<n-k-1}}\prod_{i=1}^{n-k-1}\varphi_i(y_i)\,d\Pi_{<n-k-1}(x_{<n-k-1},y_{<n-k-1}). \end{align*} Substitution proves $A_{k+1}$. Taking $k=n$ gives \begin{align*} \int_{X\times X}\prod_{i=1}^n\varphi_i(y_i)\,d\Pi(x,y)=\prod_{i=1}^n\int_{X_i}\varphi_i(y_i)\,d\rho_i(y_i). \end{align*} Bounded product functions determine probability measures on finite products of Polish spaces, so the second marginal of $\Pi$ is $\rho_1\otimes\cdots\otimes\rho_n=\rho$. Hence $\Pi$ is a coupling of $\nu$ and $\rho$. [/step] [step:Estimate the product transport cost by the entropy] Since $\Pi$ is a coupling of $\nu$ and $\rho$, the definition of the quadratic Wasserstein distance gives \begin{align*} W_{2,d_2}(\nu,\rho)^2\le\int_{X\times X}d_2(x,y)^2\,d\Pi(x,y). \end{align*} By the definition of $d_2$ and Tonelli's theorem applied to the non-negative functions $d_i(x_i,y_i)^2$, we obtain \begin{align*} \int_{X\times X}d_2(x,y)^2\,d\Pi(x,y)=\sum_{i=1}^n\int_{X\times X}d_i(x_i,y_i)^2\,d\Pi(x,y). \end{align*} For each $i$, the construction of $\Pi$ and the definition of $K_i^{x_{<i}}$ imply \begin{align*} \int_{X\times X}d_i(x_i,y_i)^2\,d\Pi(x,y)=\int_{X_{<i}}\int_{X_i\times X_i}d_i(x_i,y_i)^2\,dK_i^{x_{<i}}(x_i,y_i)\,d\nu_{<i}(x_{<i}). \end{align*} Using the near-optimality estimate for $K_i^{x_{<i}}$, we get \begin{align*} \int_{X\times X}d_i(x_i,y_i)^2\,d\Pi(x,y)\le 2C\int_{X_{<i}}h_i(x_{<i})\,d\nu_{<i}(x_{<i})+\frac{\varepsilon}{n}. \end{align*} Summing over $i$ and using the entropy chain rule gives \begin{align*} \int_{X\times X}d_2(x,y)^2\,d\Pi(x,y)\le 2C\operatorname{Ent}_{\rho}(\nu)+\varepsilon. \end{align*} Therefore \begin{align*} W_{2,d_2}(\nu,\rho)^2\le 2C\operatorname{Ent}_{\rho}(\nu)+\varepsilon. \end{align*} Since this inequality holds for every $\varepsilon>0$ and the left-hand side and entropy term are independent of $\varepsilon$, subtracting $2C\operatorname{Ent}_{\rho}(\nu)$ and taking the infimum over $\varepsilon>0$ gives \begin{align*} W_{2,d_2}(\nu,\rho)^2\le 2C\operatorname{Ent}_{\rho}(\nu). \end{align*} This is precisely the $T_2(C)$ inequality for $\rho$ on $(X,d_2)$. [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Tensorization of Talagrand's $T_2$ Inequality (Theorem # 10033)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Tensorization of Talagrand's $T_2$ Inequality (Theorem # 10033)

Discussion

Proof

Explore Further