Androma — The Home of Mathematics on the Internet

custom_env admin

[step:Control one-coordinate mismatch by conditional entropy]We shall use the following form of Pinsker's inequality. [claim:Pinsker inequality for total variation] Let $(E,\mathcal{E})$ be a measurable space, and let $\rho$ and $\pi$ be probability measures on it. Define \begin{align*} \|\rho-\pi\|_{\mathrm{TV}} := \sup_{A\in\mathcal{E}} |\rho(A)-\pi(A)|. \end{align*} Then \begin{align*} \|\rho-\pi\|_{\mathrm{TV}} \leq \sqrt{\frac{1}{2}H(\rho\mid\pi)}. \end{align*} This is Pinsker's inequality, included here with proof to fix the normalization of total variation used in this argument. [/claim] [proof] If $\rho\not\ll\pi$, then $H(\rho\mid\pi)=\infty$ and the inequality is immediate. Assume $\rho\ll\pi$, and let $f:E\to[0,\infty)$ be the Radon-Nikodym derivative $f=d\rho/d\pi$. Then \begin{align*} H(\rho\mid\pi)=\int_E f\log f\,d\pi. \end{align*} Since $\int_E(f-1)\,d\pi=0$, this can also be written as \begin{align*} H(\rho\mid\pi)=\int_E (f\log f-f+1)\,d\pi. \end{align*} For every $t\geq0$, the elementary calculus inequality \begin{align*} t\log t-t+1 \geq \frac{(t-1)^2}{t+1} \end{align*} holds, with the value at $t=0$ interpreted by continuity. Hence \begin{align*} H(\rho\mid\pi)\geq \int_E \frac{(f-1)^2}{f+1}\,d\pi. \end{align*} Applying the [Cauchy-Schwarz inequality](/theorems/432) in $L^2(E,\mathcal{E},\pi)$ to the product \begin{align*} |f-1|=\frac{|f-1|}{\sqrt{f+1}}\sqrt{f+1} \end{align*} gives \begin{align*} \left(\int_E |f-1|\,d\pi\right)^2 \leq \left(\int_E \frac{(f-1)^2}{f+1}\,d\pi\right) \left(\int_E (f+1)\,d\pi\right). \end{align*} Because $\int_E f\,d\pi=\rho(E)=1$ and $\pi(E)=1$, the second factor is $2$. Therefore \begin{align*} \int_E |f-1|\,d\pi \leq \sqrt{2H(\rho\mid\pi)}. \end{align*} Finally, \begin{align*} \|\rho-\pi\|_{\mathrm{TV}} = \frac{1}{2}\int_E |f-1|\,d\pi, \end{align*} so \begin{align*} \|\rho-\pi\|_{\mathrm{TV}} \leq \sqrt{\frac{1}{2}H(\rho\mid\pi)}. \end{align*} [/proof] For each $i\in\{1,\dots,n\}$ and each previous coordinate value $x_{<i}\in X_{<i}$ for which the conditional relative entropy is defined, Pinsker's inequality applied on $(X_i,\mathcal{X}_i)$ gives \begin{align*} \|\nu_i(\,\cdot\mid x_{<i})-\mu_i\|_{\mathrm{TV}} \leq \sqrt{\frac{1}{2}H(\nu_i(\,\cdot\mid x_{<i})\mid\mu_i)}. \end{align*}[/step]

custom_env admin

[guided]We first make the setup used throughout the proof explicit. If $H(\nu\mid\mu)=\infty$, the asserted inequality is vacuous because the right-hand side is $+\infty$, so assume $H(\nu\mid\mu)<\infty$; then $\nu\ll\mu$. For $i\in\{1,\dots,n\}$, define \begin{align*} X_{<i}:=X_1\times\cdots\times X_{i-1} \end{align*} and \begin{align*} \mathcal{X}_{<i}:=\mathcal{X}_1\otimes\cdots\otimes\mathcal{X}_{i-1}. \end{align*} For $i=1$, use the convention that $X_{<1}$ is a singleton with its one-point $\sigma$-algebra. Since the coordinate spaces are standard Borel, the [disintegration of measures](/theorems/971) applies to the projection onto the previous coordinates. Thus, for each $i$, choose a regular conditional probability kernel \begin{align*} \nu_i(\,\cdot\mid x_{<i}):\mathcal{X}_i\to[0,1] \end{align*} from $(X_{<i},\mathcal{X}_{<i})$ to $(X_i,\mathcal{X}_i)$ giving the conditional law of the $i$-th coordinate under $\nu$ given $x_{<i}=(x_1,\dots,x_{i-1})$. The quantity we need to control is the probability that the two coupled coordinates differ. For a single coordinate space $(X_i,\mathcal{X}_i)$, the best possible mismatch probability between two probability measures is their total variation distance. Thus, before constructing the full product coupling, we need a way to bound total variation by entropy. Let $(E,\mathcal{E})$ be any measurable space, and let $\rho$ and $\pi$ be probability measures on it. Define \begin{align*} \|\rho-\pi\|_{\mathrm{TV}} := \sup_{A\in\mathcal{E}} |\rho(A)-\pi(A)|. \end{align*} If $\rho\not\ll\pi$, then $H(\rho\mid\pi)=\infty$, so the desired bound is immediate. Assume $\rho\ll\pi$, and let $f:E\to[0,\infty)$ be the Radon-Nikodym derivative $f=d\rho/d\pi$. Then \begin{align*} H(\rho\mid\pi)=\int_E f\log f\,d\pi. \end{align*} Because both $\rho$ and $\pi$ are probability measures, $\int_E f\,d\pi=1$ and $\int_E 1\,d\pi=1$, so \begin{align*} \int_E(f-1)\,d\pi=0. \end{align*} Therefore \begin{align*} H(\rho\mid\pi)=\int_E (f\log f-f+1)\,d\pi. \end{align*} The elementary inequality \begin{align*} t\log t-t+1 \geq \frac{(t-1)^2}{t+1} \end{align*} holds for every $t\geq0$. Applying it pointwise to $t=f(z)$ and integrating with respect to $\pi$ gives \begin{align*} H(\rho\mid\pi)\geq \int_E \frac{(f-1)^2}{f+1}\,d\pi. \end{align*} Now we convert this integral lower bound into a total variation bound. The identity \begin{align*} |f-1|=\frac{|f-1|}{\sqrt{f+1}}\sqrt{f+1} \end{align*} lets us apply the [Cauchy-Schwarz inequality](/theorems/432) in $L^2(E,\mathcal{E},\pi)$: \begin{align*} \left(\int_E |f-1|\,d\pi\right)^2 \leq \left(\int_E \frac{(f-1)^2}{f+1}\,d\pi\right)\left(\int_E (f+1)\,d\pi\right). \end{align*} Since $\int_E(f+1)\,d\pi=2$, we obtain \begin{align*} \left(\int_E |f-1|\,d\pi\right)^2 \leq 2H(\rho\mid\pi). \end{align*} Using \begin{align*} \|\rho-\pi\|_{\mathrm{TV}}=\frac{1}{2}\int_E |f-1|\,d\pi, \end{align*} we conclude \begin{align*} \|\rho-\pi\|_{\mathrm{TV}} \leq \sqrt{\frac{1}{2}H(\rho\mid\pi)}. \end{align*} Applying this result with $\rho=\nu_i(\,\cdot\mid x_{<i})$ and $\pi=\mu_i$ yields, for each coordinate $i$ and each admissible conditioning value $x_{<i}$, \begin{align*} \|\nu_i(\,\cdot\mid x_{<i})-\mu_i\|_{\mathrm{TV}} \leq \sqrt{\frac{1}{2}H(\nu_i(\,\cdot\mid x_{<i})\mid\mu_i)}. \end{align*}[/guided]

custom_env admin

[step:Construct a coordinatewise coupling of $\nu$ and $\mu$]Let $X$ denote the product measurable space $X_1\times\cdots\times X_n$ equipped with $\mathcal{X}_1\otimes\cdots\otimes\mathcal{X}_n$. We construct random vectors $Y:\Omega\to X$ and $Z:\Omega\to X$ with coordinate maps $Y_i:\Omega\to X_i$ and $Z_i:\Omega\to X_i$ on an auxiliary probability space $(\Omega,\mathcal{F},\mathbb{P})$ sequentially. At stage $i$, after $Y_{<i}=(Y_1,\dots,Y_{i-1})$ and $Z_{<i}=(Z_1,\dots,Z_{i-1})$ have been constructed, choose $(Y_i,Z_i)$ from a maximal coupling of $\nu_i(\,\cdot\mid Y_{<i})$ and $\mu_i$, so that the first marginal is $\nu_i(\,\cdot\mid Y_{<i})$, the second marginal is $\mu_i$, and \begin{align*} \mathbb{P}(Z_i\neq Y_i\mid Y_{<i},Z_{<i}) = \|\nu_i(\,\cdot\mid Y_{<i})-\mu_i\|_{\mathrm{TV}}. \end{align*} The required parameter-measurable choice is obtained as follows. For two probability kernels $\rho_x$ and $\pi$ on the same standard Borel space, choose a $\sigma$-finite measure $\lambda_x:=\rho_x+\pi$ and define the common subprobability kernel by the density $\min\{d\rho_x/d\lambda_x,d\pi/d\lambda_x\}$ with respect to $\lambda_x$. The measurable [Radon-Nikodym theorem](/theorems/1247) for probability kernels on standard Borel spaces makes these densities measurable in the parameter $x$, after changing them on a $\lambda_x$-null set for each parameter if necessary. The diagonal $\{(u,u):u\in X_i\}$ is measurable in $X_i\times X_i$ because $X_i$ is standard Borel. The two residual kernels obtained after subtracting the common part have measurable total masses. Normalizing the residual kernels when the residual mass is positive, and using the common part on the diagonal otherwise, gives a measurable coupling kernel whose marginals are $\rho_x$ and $\pi$ and whose mismatch probability is $\|\rho_x-\pi\|_{\mathrm{TV}}$. On conditioning values outside the full-measure set where the chosen regular conditional probabilities satisfy their defining identities, define the kernels arbitrarily, for instance equal to $\mu_i$; this does not change any law or expectation used below. This recursive construction gives a coupling of $\nu$ and $\mu$. The law of $Y$ is $\nu$ because its conditional law at stage $i$ given $Y_{<i}$ is exactly $\nu_i(\,\cdot\mid Y_{<i})$. Moreover, for every $i\in\{1,\dots,n\}$ and every $A\in\mathcal{X}_i$, the maximal coupling kernel at stage $i$ has second marginal $\mu_i$, so \begin{align*} \mathbb{P}(Z_i\in A\mid Y_{<i},Z_{<i})=\mu_i(A). \end{align*} Taking [conditional expectation](/page/Conditional%20Expectation) with respect to $Z_{<i}$ gives \begin{align*} \mathbb{P}(Z_i\in A\mid Z_{<i})=\mu_i(A). \end{align*} Thus the conditional law of $Z_i$ given $Z_{<i}$ is $\mu_i$, so induction over $i$ gives that the law of $Z$ is $\mu_1\otimes\cdots\otimes\mu_n=\mu$. Here and below, $\operatorname{Law}(Z)$ denotes the pushforward probability measure $\mathbb{P}\circ Z^{-1}$ on $(X,\mathcal{X}_1\otimes\cdots\otimes\mathcal{X}_n)$. Because $a_i\ge 0$ for every $i$, the weighted Hamming cost is nonnegative and, for this coupling, \begin{align*} \mathbb{E}[d_a(Y,Z)] = \sum_{i=1}^n a_i\mathbb{P}(Y_i\neq Z_i). \end{align*} Using the conditional mismatch identity and then Pinsker's inequality from the previous step, \begin{align*} \mathbb{P}(Y_i\neq Z_i) = \mathbb{E}\left[\|\nu_i(\,\cdot\mid Y_{<i})-\mu_i\|_{\mathrm{TV}}\right] \leq \mathbb{E}\left[\sqrt{\frac{1}{2}H(\nu_i(\,\cdot\mid Y_{<i})\mid\mu_i)}\right]. \end{align*}[/step]

custom_env admin

[guided]The construction must be sequential because the conditional law of the next $Y$-coordinate depends on the already chosen previous $Y$-coordinates. We therefore do not first sample the entire vector $Y$ and then resample its coordinates. Instead, at stage $i$, once $Y_{<i}$ and $Z_{<i}$ are already defined, we choose the pair $(Y_i,Z_i)$ from a maximal coupling of the two probability measures $\nu_i(\,\cdot\mid Y_{<i})$ and $\mu_i$ on $(X_i,\mathcal{X}_i)$. The maximal coupling is chosen so that its first marginal is $\nu_i(\,\cdot\mid Y_{<i})$ and its second marginal is $\mu_i$. Its defining property is \begin{align*} \mathbb{P}(Z_i\neq Y_i\mid Y_{<i},Z_{<i}) = \|\nu_i(\,\cdot\mid Y_{<i})-\mu_i\|_{\mathrm{TV}}. \end{align*} The standard Borel hypothesis is used here in two places. First, it gives regular conditional probability kernels for $\nu_i(\,\cdot\mid x_{<i})$. Second, it permits the common-part construction of maximal couplings to be made measurably in $x_{<i}$: using the dominating measure $\nu_i(\,\cdot\mid x_{<i})+\mu_i$, the pointwise minimum of the two Radon-Nikodym densities defines the common mass, the diagonal is measurable in $X_i\times X_i$ because $X_i$ is standard Borel, and the normalized residual kernels define the off-diagonal part. The regular conditional kernels are only determined outside null sets, so on exceptional conditioning values we define the coupling kernel arbitrarily, for instance by the product coupling $\mu_i\otimes\mu_i$; those exceptional choices do not affect the induced joint law. Thus the coordinatewise prescription defines a genuine probability kernel on the product space. We now verify the marginals. For $Y$, the conditional law of $Y_i$ given $Y_{<i}$ is $\nu_i(\,\cdot\mid Y_{<i})$ at every stage, so the chain of conditional distributions reconstructs the law $\nu$. For $Z$, fix $i\in\{1,\dots,n\}$ and $A\in\mathcal{X}_i$. Since the second marginal of the maximal coupling at stage $i$ is $\mu_i$, \begin{align*} \mathbb{P}(Z_i\in A\mid Y_{<i},Z_{<i})=\mu_i(A). \end{align*} Taking conditional expectation over the extra information $Y_{<i}$ gives \begin{align*} \mathbb{P}(Z_i\in A\mid Z_{<i})=\mu_i(A). \end{align*} Hence the conditional law of $Z_i$ given its previous coordinates is always $\mu_i$. Writing $\operatorname{Law}(Z):=\mathbb{P}\circ Z^{-1}$ for the pushforward probability measure on $(X,\mathcal{X}_1\otimes\cdots\otimes\mathcal{X}_n)$, induction over the coordinates gives $\operatorname{Law}(Z)=\mu_1\otimes\cdots\otimes\mu_n=\mu$. Finally we estimate the cost of this particular coupling. Since each weight satisfies $a_i\ge 0$, \begin{align*} \mathbb{E}[d_a(Y,Z)] = \mathbb{E}\left[\sum_{i=1}^n a_i\mathbb{1}_{\{Y_i\neq Z_i\}}\right] = \sum_{i=1}^n a_i\mathbb{P}(Y_i\neq Z_i). \end{align*} For each coordinate, taking expectation in the conditional mismatch identity and then applying the Pinsker estimate proved in the previous step gives \begin{align*} \mathbb{P}(Y_i\neq Z_i) = \mathbb{E}\left[\|\nu_i(\,\cdot\mid Y_{<i})-\mu_i\|_{\mathrm{TV}}\right] \leq \mathbb{E}\left[\sqrt{\frac{1}{2}H(\nu_i(\,\cdot\mid Y_{<i})\mid\mu_i)}\right]. \end{align*}[/guided]

custom_env admin

What brings you to Androma?

Start with a route through the knowledge graph.

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Sign in to Androma

Check your inbox

One last step

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Raw Attribution Data