[proofplan]
We prove the weighted statement directly by constructing a coordinatewise coupling of $\nu$ and $\mu$. At each coordinate, conditional on the previously coupled coordinates, we couple the conditional law of $\nu$ to the product coordinate law $\mu_i$ so that the mismatch probability is the total variation distance. Pinsker's inequality bounds this mismatch probability by the square root of the corresponding conditional relative entropy, and the chain rule identifies the sum of those conditional entropies with $H(\nu\mid\mu)$. A final Cauchy-Schwarz estimate gives the weighted bound, and the unweighted Hamming case follows by taking all weights equal to $1$.
[/proofplan]
[step:Reduce to the case of finite relative entropy]
If $H(\nu\mid\mu)=\infty$, the asserted inequality is vacuous because the right-hand side is $+\infty$. We therefore assume throughout the proof that
\begin{align*}
H(\nu\mid\mu)<\infty.
\end{align*}
In particular $\nu \ll \mu$.
For $i \in \{1,\dots,n\}$, define the previous-coordinate product space $X_{<i}$ by
\begin{align*}
X_{<i} := X_1 \times \cdots \times X_{i-1}.
\end{align*}
Define the corresponding product $\sigma$-algebra $\mathcal{X}_{<i}$ by
\begin{align*}
\mathcal{X}_{<i} := \mathcal{X}_1 \otimes \cdots \otimes \mathcal{X}_{i-1}.
\end{align*}
We use the convention that $X_{<1}$ is a singleton and $\mathcal{X}_{<1}$ is the one-point $\sigma$-algebra on it. Since each $(X_i,\mathcal{X}_i)$ is standard Borel by hypothesis, the product spaces $(X_{<i},\mathcal{X}_{<i})$ and $(X_i,\mathcal{X}_i)$ are standard Borel. Therefore the [disintegration of measures](/theorems/971) applies to the coordinate projection from $X_1\times\cdots\times X_i$ onto $X_{<i}$ and gives regular conditional distributions for the $i$-th coordinate under $\nu$ conditional on the previous coordinates. Let
\begin{align*}
\nu_i(\,\cdot \mid x_{<i})
\end{align*}
denote a regular [conditional probability](/page/Conditional%20Probability) kernel from $(X_{<i},\mathcal{X}_{<i})$ to $(X_i,\mathcal{X}_i)$ for the $i$-th coordinate under $\nu$, conditional on the previous coordinates $x_{<i}=(x_1,\dots,x_{i-1})$.
[/step]
[step:Control one-coordinate mismatch by conditional entropy]
We shall use the following form of Pinsker's inequality.
[claim:Pinsker inequality for total variation]
Let $(E,\mathcal{E})$ be a measurable space, and let $\rho$ and $\pi$ be probability measures on it. Define
\begin{align*}
\|\rho-\pi\|_{\mathrm{TV}} := \sup_{A\in\mathcal{E}} |\rho(A)-\pi(A)|.
\end{align*}
Then
\begin{align*}
\|\rho-\pi\|_{\mathrm{TV}} \leq \sqrt{\frac{1}{2}H(\rho\mid\pi)}.
\end{align*}
This is Pinsker's inequality, included here with proof to fix the normalization of total variation used in this argument.
[/claim]
[proof]
If $\rho\not\ll\pi$, then $H(\rho\mid\pi)=\infty$ and the inequality is immediate. Assume $\rho\ll\pi$, and let $f:E\to[0,\infty)$ be the Radon-Nikodym derivative $f=d\rho/d\pi$. Then
\begin{align*}
H(\rho\mid\pi)=\int_E f\log f\,d\pi.
\end{align*}
Since $\int_E(f-1)\,d\pi=0$, this can also be written as
\begin{align*}
H(\rho\mid\pi)=\int_E (f\log f-f+1)\,d\pi.
\end{align*}
For every $t\geq0$, the elementary calculus inequality
\begin{align*}
t\log t-t+1 \geq \frac{(t-1)^2}{t+1}
\end{align*}
holds, with the value at $t=0$ interpreted by continuity. Hence
\begin{align*}
H(\rho\mid\pi)\geq \int_E \frac{(f-1)^2}{f+1}\,d\pi.
\end{align*}
Applying the [Cauchy-Schwarz inequality](/theorems/432) in $L^2(E,\mathcal{E},\pi)$ to the product
\begin{align*}
|f-1|=\frac{|f-1|}{\sqrt{f+1}}\sqrt{f+1}
\end{align*}
gives
\begin{align*}
\left(\int_E |f-1|\,d\pi\right)^2
\leq
\left(\int_E \frac{(f-1)^2}{f+1}\,d\pi\right)
\left(\int_E (f+1)\,d\pi\right).
\end{align*}
Because $\int_E f\,d\pi=\rho(E)=1$ and $\pi(E)=1$, the second factor is $2$. Therefore
\begin{align*}
\int_E |f-1|\,d\pi \leq \sqrt{2H(\rho\mid\pi)}.
\end{align*}
Finally,
\begin{align*}
\|\rho-\pi\|_{\mathrm{TV}}
=
\frac{1}{2}\int_E |f-1|\,d\pi,
\end{align*}
so
\begin{align*}
\|\rho-\pi\|_{\mathrm{TV}} \leq \sqrt{\frac{1}{2}H(\rho\mid\pi)}.
\end{align*}
[/proof]
For each $i\in\{1,\dots,n\}$ and each previous coordinate value $x_{<i}\in X_{<i}$ for which the conditional relative entropy is defined, Pinsker's inequality applied on $(X_i,\mathcal{X}_i)$ gives
\begin{align*}
\|\nu_i(\,\cdot\mid x_{<i})-\mu_i\|_{\mathrm{TV}}
\leq
\sqrt{\frac{1}{2}H(\nu_i(\,\cdot\mid x_{<i})\mid\mu_i)}.
\end{align*}
[guided]
We first make the setup used throughout the proof explicit. If $H(\nu\mid\mu)=\infty$, the asserted inequality is vacuous because the right-hand side is $+\infty$, so assume $H(\nu\mid\mu)<\infty$; then $\nu\ll\mu$. For $i\in\{1,\dots,n\}$, define
\begin{align*}
X_{<i}:=X_1\times\cdots\times X_{i-1}
\end{align*}
and
\begin{align*}
\mathcal{X}_{<i}:=\mathcal{X}_1\otimes\cdots\otimes\mathcal{X}_{i-1}.
\end{align*}
For $i=1$, use the convention that $X_{<1}$ is a singleton with its one-point $\sigma$-algebra. Since the coordinate spaces are standard Borel, the [disintegration of measures](/theorems/971) applies to the projection onto the previous coordinates. Thus, for each $i$, choose a regular conditional probability kernel
\begin{align*}
\nu_i(\,\cdot\mid x_{<i}):\mathcal{X}_i\to[0,1]
\end{align*}
from $(X_{<i},\mathcal{X}_{<i})$ to $(X_i,\mathcal{X}_i)$ giving the conditional law of the $i$-th coordinate under $\nu$ given $x_{<i}=(x_1,\dots,x_{i-1})$.
The quantity we need to control is the probability that the two coupled coordinates differ. For a single coordinate space $(X_i,\mathcal{X}_i)$, the best possible mismatch probability between two probability measures is their total variation distance. Thus, before constructing the full product coupling, we need a way to bound total variation by entropy.
Let $(E,\mathcal{E})$ be any measurable space, and let $\rho$ and $\pi$ be probability measures on it. Define
\begin{align*}
\|\rho-\pi\|_{\mathrm{TV}} := \sup_{A\in\mathcal{E}} |\rho(A)-\pi(A)|.
\end{align*}
If $\rho\not\ll\pi$, then $H(\rho\mid\pi)=\infty$, so the desired bound is immediate. Assume $\rho\ll\pi$, and let $f:E\to[0,\infty)$ be the Radon-Nikodym derivative $f=d\rho/d\pi$. Then
\begin{align*}
H(\rho\mid\pi)=\int_E f\log f\,d\pi.
\end{align*}
Because both $\rho$ and $\pi$ are probability measures, $\int_E f\,d\pi=1$ and $\int_E 1\,d\pi=1$, so
\begin{align*}
\int_E(f-1)\,d\pi=0.
\end{align*}
Therefore
\begin{align*}
H(\rho\mid\pi)=\int_E (f\log f-f+1)\,d\pi.
\end{align*}
The elementary inequality
\begin{align*}
t\log t-t+1 \geq \frac{(t-1)^2}{t+1}
\end{align*}
holds for every $t\geq0$. Applying it pointwise to $t=f(z)$ and integrating with respect to $\pi$ gives
\begin{align*}
H(\rho\mid\pi)\geq \int_E \frac{(f-1)^2}{f+1}\,d\pi.
\end{align*}
Now we convert this integral lower bound into a total variation bound. The identity
\begin{align*}
|f-1|=\frac{|f-1|}{\sqrt{f+1}}\sqrt{f+1}
\end{align*}
lets us apply the [Cauchy-Schwarz inequality](/theorems/432) in $L^2(E,\mathcal{E},\pi)$:
\begin{align*}
\left(\int_E |f-1|\,d\pi\right)^2 \leq \left(\int_E \frac{(f-1)^2}{f+1}\,d\pi\right)\left(\int_E (f+1)\,d\pi\right).
\end{align*}
Since $\int_E(f+1)\,d\pi=2$, we obtain
\begin{align*}
\left(\int_E |f-1|\,d\pi\right)^2 \leq 2H(\rho\mid\pi).
\end{align*}
Using
\begin{align*}
\|\rho-\pi\|_{\mathrm{TV}}=\frac{1}{2}\int_E |f-1|\,d\pi,
\end{align*}
we conclude
\begin{align*}
\|\rho-\pi\|_{\mathrm{TV}} \leq \sqrt{\frac{1}{2}H(\rho\mid\pi)}.
\end{align*}
Applying this result with $\rho=\nu_i(\,\cdot\mid x_{<i})$ and $\pi=\mu_i$ yields, for each coordinate $i$ and each admissible conditioning value $x_{<i}$,
\begin{align*}
\|\nu_i(\,\cdot\mid x_{<i})-\mu_i\|_{\mathrm{TV}}
\leq
\sqrt{\frac{1}{2}H(\nu_i(\,\cdot\mid x_{<i})\mid\mu_i)}.
\end{align*}
[/guided]
[/step]
[step:Construct a coordinatewise coupling of $\nu$ and $\mu$]
Let $X$ denote the product measurable space $X_1\times\cdots\times X_n$ equipped with $\mathcal{X}_1\otimes\cdots\otimes\mathcal{X}_n$. We construct random vectors $Y:\Omega\to X$ and $Z:\Omega\to X$ with coordinate maps $Y_i:\Omega\to X_i$ and $Z_i:\Omega\to X_i$ on an auxiliary probability space $(\Omega,\mathcal{F},\mathbb{P})$ sequentially. At stage $i$, after $Y_{<i}=(Y_1,\dots,Y_{i-1})$ and $Z_{<i}=(Z_1,\dots,Z_{i-1})$ have been constructed, choose $(Y_i,Z_i)$ from a maximal coupling of $\nu_i(\,\cdot\mid Y_{<i})$ and $\mu_i$, so that the first marginal is $\nu_i(\,\cdot\mid Y_{<i})$, the second marginal is $\mu_i$, and
\begin{align*}
\mathbb{P}(Z_i\neq Y_i\mid Y_{<i},Z_{<i})
=
\|\nu_i(\,\cdot\mid Y_{<i})-\mu_i\|_{\mathrm{TV}}.
\end{align*}
The required parameter-measurable choice is obtained as follows. For two probability kernels $\rho_x$ and $\pi$ on the same standard Borel space, choose a $\sigma$-finite measure $\lambda_x:=\rho_x+\pi$ and define the common subprobability kernel by the density $\min\{d\rho_x/d\lambda_x,d\pi/d\lambda_x\}$ with respect to $\lambda_x$. The measurable [Radon-Nikodym theorem](/theorems/1247) for probability kernels on standard Borel spaces makes these densities measurable in the parameter $x$, after changing them on a $\lambda_x$-null set for each parameter if necessary. The diagonal $\{(u,u):u\in X_i\}$ is measurable in $X_i\times X_i$ because $X_i$ is standard Borel. The two residual kernels obtained after subtracting the common part have measurable total masses. Normalizing the residual kernels when the residual mass is positive, and using the common part on the diagonal otherwise, gives a measurable coupling kernel whose marginals are $\rho_x$ and $\pi$ and whose mismatch probability is $\|\rho_x-\pi\|_{\mathrm{TV}}$. On conditioning values outside the full-measure set where the chosen regular conditional probabilities satisfy their defining identities, define the kernels arbitrarily, for instance equal to $\mu_i$; this does not change any law or expectation used below.
This recursive construction gives a coupling of $\nu$ and $\mu$. The law of $Y$ is $\nu$ because its conditional law at stage $i$ given $Y_{<i}$ is exactly $\nu_i(\,\cdot\mid Y_{<i})$. Moreover, for every $i\in\{1,\dots,n\}$ and every $A\in\mathcal{X}_i$, the maximal coupling kernel at stage $i$ has second marginal $\mu_i$, so
\begin{align*}
\mathbb{P}(Z_i\in A\mid Y_{<i},Z_{<i})=\mu_i(A).
\end{align*}
Taking [conditional expectation](/page/Conditional%20Expectation) with respect to $Z_{<i}$ gives
\begin{align*}
\mathbb{P}(Z_i\in A\mid Z_{<i})=\mu_i(A).
\end{align*}
Thus the conditional law of $Z_i$ given $Z_{<i}$ is $\mu_i$, so induction over $i$ gives that the law of $Z$ is $\mu_1\otimes\cdots\otimes\mu_n=\mu$.
Here and below, $\operatorname{Law}(Z)$ denotes the pushforward probability measure $\mathbb{P}\circ Z^{-1}$ on $(X,\mathcal{X}_1\otimes\cdots\otimes\mathcal{X}_n)$.
Because $a_i\ge 0$ for every $i$, the weighted Hamming cost is nonnegative and, for this coupling,
\begin{align*}
\mathbb{E}[d_a(Y,Z)]
=
\sum_{i=1}^n a_i\mathbb{P}(Y_i\neq Z_i).
\end{align*}
Using the conditional mismatch identity and then Pinsker's inequality from the previous step,
\begin{align*}
\mathbb{P}(Y_i\neq Z_i)
=
\mathbb{E}\left[\|\nu_i(\,\cdot\mid Y_{<i})-\mu_i\|_{\mathrm{TV}}\right]
\leq
\mathbb{E}\left[\sqrt{\frac{1}{2}H(\nu_i(\,\cdot\mid Y_{<i})\mid\mu_i)}\right].
\end{align*}
[guided]
The construction must be sequential because the conditional law of the next $Y$-coordinate depends on the already chosen previous $Y$-coordinates. We therefore do not first sample the entire vector $Y$ and then resample its coordinates. Instead, at stage $i$, once $Y_{<i}$ and $Z_{<i}$ are already defined, we choose the pair $(Y_i,Z_i)$ from a maximal coupling of the two probability measures $\nu_i(\,\cdot\mid Y_{<i})$ and $\mu_i$ on $(X_i,\mathcal{X}_i)$.
The maximal coupling is chosen so that its first marginal is $\nu_i(\,\cdot\mid Y_{<i})$ and its second marginal is $\mu_i$. Its defining property is
\begin{align*}
\mathbb{P}(Z_i\neq Y_i\mid Y_{<i},Z_{<i})
=
\|\nu_i(\,\cdot\mid Y_{<i})-\mu_i\|_{\mathrm{TV}}.
\end{align*}
The standard Borel hypothesis is used here in two places. First, it gives regular conditional probability kernels for $\nu_i(\,\cdot\mid x_{<i})$. Second, it permits the common-part construction of maximal couplings to be made measurably in $x_{<i}$: using the dominating measure $\nu_i(\,\cdot\mid x_{<i})+\mu_i$, the pointwise minimum of the two Radon-Nikodym densities defines the common mass, the diagonal is measurable in $X_i\times X_i$ because $X_i$ is standard Borel, and the normalized residual kernels define the off-diagonal part. The regular conditional kernels are only determined outside null sets, so on exceptional conditioning values we define the coupling kernel arbitrarily, for instance by the product coupling $\mu_i\otimes\mu_i$; those exceptional choices do not affect the induced joint law. Thus the coordinatewise prescription defines a genuine probability kernel on the product space.
We now verify the marginals. For $Y$, the conditional law of $Y_i$ given $Y_{<i}$ is $\nu_i(\,\cdot\mid Y_{<i})$ at every stage, so the chain of conditional distributions reconstructs the law $\nu$. For $Z$, fix $i\in\{1,\dots,n\}$ and $A\in\mathcal{X}_i$. Since the second marginal of the maximal coupling at stage $i$ is $\mu_i$,
\begin{align*}
\mathbb{P}(Z_i\in A\mid Y_{<i},Z_{<i})=\mu_i(A).
\end{align*}
Taking conditional expectation over the extra information $Y_{<i}$ gives
\begin{align*}
\mathbb{P}(Z_i\in A\mid Z_{<i})=\mu_i(A).
\end{align*}
Hence the conditional law of $Z_i$ given its previous coordinates is always $\mu_i$. Writing $\operatorname{Law}(Z):=\mathbb{P}\circ Z^{-1}$ for the pushforward probability measure on $(X,\mathcal{X}_1\otimes\cdots\otimes\mathcal{X}_n)$, induction over the coordinates gives $\operatorname{Law}(Z)=\mu_1\otimes\cdots\otimes\mu_n=\mu$.
Finally we estimate the cost of this particular coupling. Since each weight satisfies $a_i\ge 0$,
\begin{align*}
\mathbb{E}[d_a(Y,Z)]
=
\mathbb{E}\left[\sum_{i=1}^n a_i\mathbb{1}_{\{Y_i\neq Z_i\}}\right]
=
\sum_{i=1}^n a_i\mathbb{P}(Y_i\neq Z_i).
\end{align*}
For each coordinate, taking expectation in the conditional mismatch identity and then applying the Pinsker estimate proved in the previous step gives
\begin{align*}
\mathbb{P}(Y_i\neq Z_i)
=
\mathbb{E}\left[\|\nu_i(\,\cdot\mid Y_{<i})-\mu_i\|_{\mathrm{TV}}\right]
\leq
\mathbb{E}\left[\sqrt{\frac{1}{2}H(\nu_i(\,\cdot\mid Y_{<i})\mid\mu_i)}\right].
\end{align*}
[/guided]
[/step]
[step:Sum the coordinate estimates by Cauchy-Schwarz]
Define
\begin{align*}
h_i:X_{<i}\to[0,\infty]
\end{align*}
by
\begin{align*}
h_i(x_{<i}) := H(\nu_i(\,\cdot\mid x_{<i})\mid\mu_i).
\end{align*}
The previous step gives
\begin{align*}
\mathbb{E}[d_a(Y,Z)]
\leq
\sum_{i=1}^n a_i\,\mathbb{E}\left[\sqrt{\frac{1}{2}h_i(Y_{<i})}\right].
\end{align*}
Applying the [Cauchy-Schwarz inequality](/theorems/432) first to each expectation and then to the finite sum over $i$ gives
\begin{align*}
\mathbb{E}\left[\sqrt{\frac{1}{2}h_i(Y_{<i})}\right]
\leq
\sqrt{\frac{1}{2}\mathbb{E}[h_i(Y_{<i})]},
\end{align*}
and hence
\begin{align*}
\mathbb{E}[d_a(Y,Z)]
\leq
\sqrt{\sum_{i=1}^n a_i^2}
\sqrt{\frac{1}{2}\sum_{i=1}^n \mathbb{E}[h_i(Y_{<i})]}.
\end{align*}
[guided]
We now turn the coordinatewise mismatch estimates into one weighted estimate. Define the measurable function $h_i:X_{<i}\to[0,\infty]$ by
\begin{align*}
h_i(x_{<i}) := H(\nu_i(\,\cdot\mid x_{<i})\mid\mu_i).
\end{align*}
The previous construction and Pinsker estimate give
\begin{align*}
\mathbb{E}[d_a(Y,Z)]
\leq
\sum_{i=1}^n a_i\,\mathbb{E}\left[\sqrt{\frac{1}{2}h_i(Y_{<i})}\right].
\end{align*}
For each fixed $i$, apply the [Cauchy-Schwarz inequality](/theorems/432) in the probability space carrying $Y$ to the product of the random variables $1$ and $\sqrt{h_i(Y_{<i})/2}$. Since $\mathbb{E}[1^2]=1$, this gives
\begin{align*}
\mathbb{E}\left[\sqrt{\frac{1}{2}h_i(Y_{<i})}\right]
\leq
\sqrt{\frac{1}{2}\mathbb{E}[h_i(Y_{<i})]}.
\end{align*}
Substituting this estimate into the sum gives
\begin{align*}
\mathbb{E}[d_a(Y,Z)]
\leq
\sum_{i=1}^n a_i\sqrt{\frac{1}{2}\mathbb{E}[h_i(Y_{<i})]}.
\end{align*}
Now apply the finite-dimensional [Cauchy-Schwarz inequality](/theorems/432) in $\mathbb{R}^n$ to the two vectors $(a_i)_{i=1}^n$ and $(\sqrt{\mathbb{E}[h_i(Y_{<i})]/2})_{i=1}^n$. Because each $a_i\ge 0$, the weighted sum is bounded by
\begin{align*}
\mathbb{E}[d_a(Y,Z)]
\leq
\sqrt{\sum_{i=1}^n a_i^2}
\sqrt{\frac{1}{2}\sum_{i=1}^n \mathbb{E}[h_i(Y_{<i})]}.
\end{align*}
This is the exact point where the Euclidean norm of the weight vector appears.
[/guided]
[/step]
[step:Identify the entropy sum by the chain rule and take the infimum]
Since $H(\nu\mid\mu)<\infty$, we have $\nu\ll\mu$. The product spaces are standard Borel, so the regular conditional distributions chosen above exist. The map $h_i:X_{<i}\to[0,\infty]$ is measurable because relative entropy is measurable as a function of a probability kernel on a standard Borel space, and finite total entropy makes $\sum_{i=1}^n\mathbb{E}[h_i(Y_{<i})]$ finite. Applying the [chain rule for relative entropy](/theorems/6731) to the regular conditional distributions of $\nu$ over the product reference measure $\mu_1\otimes\cdots\otimes\mu_n$ gives
\begin{align*}
H(\nu\mid\mu)
=
\sum_{i=1}^n
\mathbb{E}\left[
H(\nu_i(\,\cdot\mid Y_{<i})\mid\mu_i)
\right]
=
\sum_{i=1}^n \mathbb{E}[h_i(Y_{<i})].
\end{align*}
Substituting this identity into the previous estimate yields
\begin{align*}
\mathbb{E}[d_a(Y,Z)]
\leq
\sqrt{\frac{1}{2}\left(\sum_{i=1}^n a_i^2\right)H(\nu\mid\mu)}.
\end{align*}
Recall that $W_1(\nu,\mu;d_a)$ denotes the infimum of $\mathbb{E}[d_a(\widetilde{Y},\widetilde{Z})]$ over all couplings $(\widetilde{Y},\widetilde{Z})$ of $\nu$ and $\mu$. Therefore the same upper bound holds for $W_1(\nu,\mu;d_a)$:
\begin{align*}
W_1(\nu,\mu;d_a)
\leq
\sqrt{\frac{1}{2}\left(\sum_{i=1}^n a_i^2\right)H(\nu\mid\mu)}.
\end{align*}
Taking $a_i=1$ for every $i\in\{1,\dots,n\}$ gives $\sum_{i=1}^n a_i^2=n$ and $d_a=d_H$, so
\begin{align*}
W_1(\nu,\mu;d_H)
\leq
\sqrt{\frac{n}{2}H(\nu\mid\mu)}.
\end{align*}
This proves both assertions.
[guided]
The remaining task is to replace the sum of coordinatewise conditional entropies by the single entropy $H(\nu\mid\mu)$. We may use the chain rule for relative entropy because its hypotheses have been verified: finite entropy gives $\nu\ll\mu$, the spaces are standard Borel, the previous-coordinate regular conditional distributions exist by disintegration, and the conditional entropy kernels $h_i$ are measurable and integrable with finite total sum. For the product reference measure $\mu=\mu_1\otimes\cdots\otimes\mu_n$, the chain rule gives
\begin{align*}
H(\nu\mid\mu)
=
\sum_{i=1}^n
\mathbb{E}\left[
H(\nu_i(\,\cdot\mid Y_{<i})\mid\mu_i)
\right].
\end{align*}
By the definition of $h_i$, this is exactly
\begin{align*}
H(\nu\mid\mu)
=
\sum_{i=1}^n \mathbb{E}[h_i(Y_{<i})].
\end{align*}
This identity is the entropy analogue of decomposing a product density into successive conditional densities; the product structure of $\mu$ is what makes the reference law in the $i$-th conditional entropy equal to $\mu_i$.
Substituting the chain-rule identity into the estimate obtained after Cauchy-Schwarz gives
\begin{align*}
\mathbb{E}[d_a(Y,Z)]
\leq
\sqrt{\frac{1}{2}\left(\sum_{i=1}^n a_i^2\right)H(\nu\mid\mu)}.
\end{align*}
This is an estimate for one explicitly constructed coupling $(Y,Z)$ of $\nu$ and $\mu$. Since $W_1(\nu,\mu;d_a)$ is defined as the infimum of $\mathbb{E}[d_a(\widetilde{Y},\widetilde{Z})]$ over all couplings $(\widetilde{Y},\widetilde{Z})$ of $\nu$ and $\mu$, an upper bound for this particular coupling is also an upper bound for the infimum. Therefore
\begin{align*}
W_1(\nu,\mu;d_a)
\leq
\sqrt{\frac{1}{2}\left(\sum_{i=1}^n a_i^2\right)H(\nu\mid\mu)}.
\end{align*}
For the unweighted Hamming distance, set $a_i=1$ for every $i\in\{1,\dots,n\}$. Then $\sum_{i=1}^n a_i^2=n$ and $d_a=d_H$, hence
\begin{align*}
W_1(\nu,\mu;d_H)
\leq
\sqrt{\frac{n}{2}H(\nu\mid\mu)}.
\end{align*}
This proves both the weighted and unweighted assertions.
[/guided]
[/step]