[proofplan]
We prove the rule by introducing an auxiliary post-$do(X=x)$ law in which the variables in $Z$ are still generated by their usual mechanisms, but every child-mechanism that would read $Z$ instead reads the fixed value $z$. This auxiliary law factorizes according to the comparison graph $G_{\overline X\underline Z}$. The [truncated factorization formula](/theorems/9677) identifies its conditional law of $Y$ given $Z=z,W=w$ with the observational kernel $P_x(Y\in B\mid Z=z,W=w)$, and identifies its conditional law of $Y$ given $W=w$ with the intervention kernel $P_{x,z}(Y\in B\mid W=w)$. The assumed $d$-separation then gives equality of those two auxiliary conditional laws.
[/proofplan]
[step:Declare the factorization kernels and the intervened laws]
Choose a topological ordering of $G$ and write $V=\{V_1,\dots,V_n\}$ in that order. For each $i\in\{1,\dots,n\}$, let $\mathcal X_i$ be the state space of $V_i$, let $\mathcal B_i$ be its measurable structure, and let
\begin{align*}
\kappa_i: \prod_{V_j\in \operatorname{pa}_G(V_i)} \mathcal X_j \to \mathcal P(\mathcal X_i)
\end{align*}
be the [conditional probability](/page/Conditional%20Probability) kernel for $V_i$ given its parents in the causal model. Here $\operatorname{pa}_G(V_i)$ denotes the set of parents of $V_i$ in $G$, and $\mathcal P(\mathcal X_i)$ denotes the set of probability measures on $(\mathcal X_i,\mathcal B_i)$.
For a set $A\subset V$ and a value $a$ of $A$, let $P_a$ denote the interventional law obtained by replacing every kernel for variables in $A$ by the corresponding point mass at $a$ and leaving all other kernels unchanged. This is exactly the truncated factorization formula. In particular, $P_x$ and $P_{x,z}$ are the laws obtained from the interventions $do(X=x)$ and $do(X=x),do(Z=z)$.
[/step]
[step:Build the auxiliary law that deletes the outgoing arrows from $Z$]
For each variable $V_i\in V\setminus X$, define a modified parent-evaluation map
\begin{align*}
\pi_i^{z}: \prod_{V_j\in V\setminus X} \mathcal X_j \to \prod_{V_j\in \operatorname{pa}_G(V_i)} \mathcal X_j
\end{align*}
as follows: if a parent $V_j$ lies in $X$, the $V_j$-coordinate of $\pi_i^z$ is the fixed value $x_j$; if a parent $V_j$ lies in $Z$, the $V_j$-coordinate of $\pi_i^z$ is the fixed value $z_j$; and if a parent $V_j$ lies in $V\setminus (X\cup Z)$, the $V_j$-coordinate is the corresponding coordinate of the input point. For $V_i\in Z$, define instead $\pi_i^{\mathrm{obs}}$ by fixing parents in $X$ at $x$ but leaving parents in $V\setminus X$ as coordinates of the input.
Define an auxiliary probability law $Q_{x,z}$ on the variables $V\setminus X$ by the product of kernels in the chosen topological order: variables in $Z$ are generated using $\kappa_i\circ \pi_i^{\mathrm{obs}}$, and variables in $V\setminus (X\cup Z)$ are generated using $\kappa_i\circ \pi_i^z$. This construction is a valid Ionescu-Tulcea product because $V$ is finite and the kernels are regular conditional probability kernels.
By construction, $Q_{x,z}$ factorizes with respect to the DAG $G_{\overline X\underline Z}$ after the variables in $X$ are fixed at $x$: incoming arrows into $X$ do not appear because $X$ is fixed, and outgoing arrows from $Z$ do not appear because the kernels for children of $Z$ use the constant value $z$ rather than the random coordinate $Z$.
[guided]
We need one probability law to which the graphical separation hypothesis can be applied. The law $P_x$ is Markov with respect to $G_{\overline X}$, while the law $P_{x,z}$ is Markov with respect to $G_{\overline X\overline Z}$. The graph in the hypothesis is neither of these: it is $G_{\overline X\underline Z}$, where arrows into $X$ are removed and arrows out of $Z$ are removed. So we explicitly construct a law whose factorization graph is exactly this comparison graph.
For each variable $V_i$, the model supplies a causal kernel
\begin{align*}
\kappa_i: \prod_{V_j\in \operatorname{pa}_G(V_i)} \mathcal X_j \to \mathcal P(\mathcal X_i).
\end{align*}
The intervention $do(X=x)$ means that every occurrence of a parent in $X$ is evaluated at the fixed value $x$. To delete arrows out of $Z$ without deleting the mechanism that generates $Z$ itself, we do two different things. If $V_i\in Z$, we keep the ordinary post-$do(X=x)$ mechanism for $V_i$. If $V_i\notin X\cup Z$, then every parent of $V_i$ that lies in $Z$ is evaluated at the fixed value $z$.
This gives the auxiliary law $Q_{x,z}$. It is a legitimate probability law because the variables are finite in number, the graph is acyclic, and the kernels can be multiplied in a topological order. Its factorization has no incoming arrows into $X$, because $X$ is fixed at $x$. It also has no outgoing arrows from $Z$, because no non-$Z$ kernel depends on the random coordinate of $Z$; such kernels see only the constant value $z$. Therefore the Markov factorization graph for $Q_{x,z}$ is precisely $G_{\overline X\underline Z}$, with $X$ regarded as fixed at $x$.
[/guided]
[/step]
[step:Identify the auxiliary conditionals with the two causal kernels]
Let $B$ be a measurable subset of the state space of $Y$. Since the relevant conditional laws are assumed to exist, define
\begin{align*}
L_{\mathrm{obs}}: \mathcal B_Y \times \mathcal X_Z \times \mathcal X_W \to [0,1]
\end{align*}
by
\begin{align*}
L_{\mathrm{obs}}(B,z,w)=P_x(Y\in B\mid Z=z,W=w),
\end{align*}
and define
\begin{align*}
L_{\mathrm{int}}: \mathcal B_Y \times \mathcal X_Z \times \mathcal X_W \to [0,1]
\end{align*}
by
\begin{align*}
L_{\mathrm{int}}(B,z,w)=P_{x,z}(Y\in B\mid W=w).
\end{align*}
Here $\mathcal B_Y$ is the measurable structure on the state space of $Y$, and $\mathcal X_Z$ and $\mathcal X_W$ are the product state spaces of $Z$ and $W$.
First condition $Q_{x,z}$ on $Z=z$ and $W=w$. Under this conditioning, every kernel for variables in $Z$ contributes only the likelihood of the already fixed value $z$. Every kernel outside $X\cup Z$ is evaluated with the same parent values as in $P_x$ conditioned on $Z=z,W=w$: parents in $X$ are fixed at $x$, parents in $Z$ are fixed at $z$, and all remaining parents retain their coordinates. Thus
\begin{align*}
Q_{x,z}(Y\in B\mid Z=z,W=w)=P_x(Y\in B\mid Z=z,W=w).
\end{align*}
Next condition $Q_{x,z}$ only on $W=w$ and marginalize over $Z$. The non-$Z$ kernels in $Q_{x,z}$ are exactly the kernels appearing in the truncated factorization for $P_{x,z}$, because they use $X=x$ and $Z=z$ as fixed inputs. The remaining kernels generating $Z$ are upstream of no non-$Z$ variable in the auxiliary factorization, so integrating over the random coordinates of $Z$ contributes total mass one and does not change the conditional law of $Y$ given $W=w$. Therefore
\begin{align*}
Q_{x,z}(Y\in B\mid W=w)=P_{x,z}(Y\in B\mid W=w).
\end{align*}
[/step]
[step:Apply the Markov property in the comparison graph]
The standard global Markov theorem for finite Bayesian networks says that if a probability law factorizes according to a finite DAG, then every $d$-separation in that DAG implies the corresponding conditional independence for the law. Its hypotheses hold for $Q_{x,z}$ because the preceding construction gave a finite product factorization over the DAG $G_{\overline X\underline Z}$.
By assumption,
\begin{align*}
Y \perp_G Z \mid X,W \quad \text{in } G_{\overline X\underline Z}.
\end{align*}
Since $X$ is fixed at the value $x$ throughout $Q_{x,z}$, conditioning on $X=x$ is already built into the sample space of the auxiliary law. Hence the same separation gives the conditional independence of the random coordinates $Y$ and $Z$ given $W$ under $Q_{x,z}$. Therefore, for every measurable $B$,
\begin{align*}
Q_{x,z}(Y\in B\mid Z=z,W=w)=Q_{x,z}(Y\in B\mid W=w).
\end{align*}
[/step]
[step:Combine the two identifications to obtain Rule Two]
Combining the two kernel identifications with the conditional independence under $Q_{x,z}$ gives
\begin{align*}
P_x(Y\in B\mid Z=z,W=w)=Q_{x,z}(Y\in B\mid Z=z,W=w).
\end{align*}
Also,
\begin{align*}
Q_{x,z}(Y\in B\mid Z=z,W=w)=Q_{x,z}(Y\in B\mid W=w).
\end{align*}
Finally,
\begin{align*}
Q_{x,z}(Y\in B\mid W=w)=P_{x,z}(Y\in B\mid W=w).
\end{align*}
Thus
\begin{align*}
P_{x,z}(Y\in B\mid W=w)=P_x(Y\in B\mid Z=z,W=w).
\end{align*}
For discrete variables, taking $B=\{y\}$ gives the point-probability form. If the conditional laws admit densities with respect to a common reference measure, equality of the conditional probability kernels gives equality of the corresponding density representatives at every value where those representatives are defined. This is the displayed identity
\begin{align*}
P_{x,z}(y\mid w)=P_x(y\mid z,w).
\end{align*}
[/step]