Pearl's Rule Two — Statement & Proof

Pearl's Rule Two (Theorem # 9679)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We prove the rule by introducing an auxiliary post-$do(X=x)$ law in which the variables in $Z$ are still generated by their usual mechanisms, but every child-mechanism that would read $Z$ instead reads the fixed value $z$. This auxiliary law factorizes according to the comparison graph $G_{\overline X\underline Z}$. The [truncated factorization formula](/theorems/9677) identifies its conditional law of $Y$ given $Z=z,W=w$ with the observational kernel $P_x(Y\in B\mid Z=z,W=w)$, and identifies its conditional law of $Y$ given $W=w$ with the intervention kernel $P_{x,z}(Y\in B\mid W=w)$. The assumed $d$-separation then gives equality of those two auxiliary conditional laws. [/proofplan] [step:Declare the factorization kernels and the intervened laws] Choose a topological ordering of $G$ and write $V=\{V_1,\dots,V_n\}$ in that order. For each $i\in\{1,\dots,n\}$, let $\mathcal X_i$ be the state space of $V_i$, let $\mathcal B_i$ be its measurable structure, and let \begin{align*} \kappa_i: \prod_{V_j\in \operatorname{pa}_G(V_i)} \mathcal X_j \to \mathcal P(\mathcal X_i) \end{align*} be the [conditional probability](/page/Conditional%20Probability) kernel for $V_i$ given its parents in the causal model. Here $\operatorname{pa}_G(V_i)$ denotes the set of parents of $V_i$ in $G$, and $\mathcal P(\mathcal X_i)$ denotes the set of probability measures on $(\mathcal X_i,\mathcal B_i)$. For a set $A\subset V$ and a value $a$ of $A$, let $P_a$ denote the interventional law obtained by replacing every kernel for variables in $A$ by the corresponding point mass at $a$ and leaving all other kernels unchanged. This is exactly the truncated factorization formula. In particular, $P_x$ and $P_{x,z}$ are the laws obtained from the interventions $do(X=x)$ and $do(X=x),do(Z=z)$. [/step] [step:Build the auxiliary law that deletes the outgoing arrows from $Z$] For each variable $V_i\in V\setminus X$, define a modified parent-evaluation map \begin{align*} \pi_i^{z}: \prod_{V_j\in V\setminus X} \mathcal X_j \to \prod_{V_j\in \operatorname{pa}_G(V_i)} \mathcal X_j \end{align*} as follows: if a parent $V_j$ lies in $X$, the $V_j$-coordinate of $\pi_i^z$ is the fixed value $x_j$; if a parent $V_j$ lies in $Z$, the $V_j$-coordinate of $\pi_i^z$ is the fixed value $z_j$; and if a parent $V_j$ lies in $V\setminus (X\cup Z)$, the $V_j$-coordinate is the corresponding coordinate of the input point. For $V_i\in Z$, define instead $\pi_i^{\mathrm{obs}}$ by fixing parents in $X$ at $x$ but leaving parents in $V\setminus X$ as coordinates of the input. Define an auxiliary probability law $Q_{x,z}$ on the variables $V\setminus X$ by the product of kernels in the chosen topological order: variables in $Z$ are generated using $\kappa_i\circ \pi_i^{\mathrm{obs}}$, and variables in $V\setminus (X\cup Z)$ are generated using $\kappa_i\circ \pi_i^z$. This construction is a valid Ionescu-Tulcea product because $V$ is finite and the kernels are regular conditional probability kernels. By construction, $Q_{x,z}$ factorizes with respect to the DAG $G_{\overline X\underline Z}$ after the variables in $X$ are fixed at $x$: incoming arrows into $X$ do not appear because $X$ is fixed, and outgoing arrows from $Z$ do not appear because the kernels for children of $Z$ use the constant value $z$ rather than the random coordinate $Z$. [guided] We need one probability law to which the graphical separation hypothesis can be applied. The law $P_x$ is Markov with respect to $G_{\overline X}$, while the law $P_{x,z}$ is Markov with respect to $G_{\overline X\overline Z}$. The graph in the hypothesis is neither of these: it is $G_{\overline X\underline Z}$, where arrows into $X$ are removed and arrows out of $Z$ are removed. So we explicitly construct a law whose factorization graph is exactly this comparison graph. For each variable $V_i$, the model supplies a causal kernel \begin{align*} \kappa_i: \prod_{V_j\in \operatorname{pa}_G(V_i)} \mathcal X_j \to \mathcal P(\mathcal X_i). \end{align*} The intervention $do(X=x)$ means that every occurrence of a parent in $X$ is evaluated at the fixed value $x$. To delete arrows out of $Z$ without deleting the mechanism that generates $Z$ itself, we do two different things. If $V_i\in Z$, we keep the ordinary post-$do(X=x)$ mechanism for $V_i$. If $V_i\notin X\cup Z$, then every parent of $V_i$ that lies in $Z$ is evaluated at the fixed value $z$. This gives the auxiliary law $Q_{x,z}$. It is a legitimate probability law because the variables are finite in number, the graph is acyclic, and the kernels can be multiplied in a topological order. Its factorization has no incoming arrows into $X$, because $X$ is fixed at $x$. It also has no outgoing arrows from $Z$, because no non-$Z$ kernel depends on the random coordinate of $Z$; such kernels see only the constant value $z$. Therefore the Markov factorization graph for $Q_{x,z}$ is precisely $G_{\overline X\underline Z}$, with $X$ regarded as fixed at $x$. [/guided] [/step] [step:Identify the auxiliary conditionals with the two causal kernels] Let $B$ be a measurable subset of the state space of $Y$. Since the relevant conditional laws are assumed to exist, define \begin{align*} L_{\mathrm{obs}}: \mathcal B_Y \times \mathcal X_Z \times \mathcal X_W \to [0,1] \end{align*} by \begin{align*} L_{\mathrm{obs}}(B,z,w)=P_x(Y\in B\mid Z=z,W=w), \end{align*} and define \begin{align*} L_{\mathrm{int}}: \mathcal B_Y \times \mathcal X_Z \times \mathcal X_W \to [0,1] \end{align*} by \begin{align*} L_{\mathrm{int}}(B,z,w)=P_{x,z}(Y\in B\mid W=w). \end{align*} Here $\mathcal B_Y$ is the measurable structure on the state space of $Y$, and $\mathcal X_Z$ and $\mathcal X_W$ are the product state spaces of $Z$ and $W$. First condition $Q_{x,z}$ on $Z=z$ and $W=w$. Under this conditioning, every kernel for variables in $Z$ contributes only the likelihood of the already fixed value $z$. Every kernel outside $X\cup Z$ is evaluated with the same parent values as in $P_x$ conditioned on $Z=z,W=w$: parents in $X$ are fixed at $x$, parents in $Z$ are fixed at $z$, and all remaining parents retain their coordinates. Thus \begin{align*} Q_{x,z}(Y\in B\mid Z=z,W=w)=P_x(Y\in B\mid Z=z,W=w). \end{align*} Next condition $Q_{x,z}$ only on $W=w$ and marginalize over $Z$. The non-$Z$ kernels in $Q_{x,z}$ are exactly the kernels appearing in the truncated factorization for $P_{x,z}$, because they use $X=x$ and $Z=z$ as fixed inputs. The remaining kernels generating $Z$ are upstream of no non-$Z$ variable in the auxiliary factorization, so integrating over the random coordinates of $Z$ contributes total mass one and does not change the conditional law of $Y$ given $W=w$. Therefore \begin{align*} Q_{x,z}(Y\in B\mid W=w)=P_{x,z}(Y\in B\mid W=w). \end{align*} [/step] [step:Apply the Markov property in the comparison graph] The standard global Markov theorem for finite Bayesian networks says that if a probability law factorizes according to a finite DAG, then every $d$-separation in that DAG implies the corresponding conditional independence for the law. Its hypotheses hold for $Q_{x,z}$ because the preceding construction gave a finite product factorization over the DAG $G_{\overline X\underline Z}$. By assumption, \begin{align*} Y \perp_G Z \mid X,W \quad \text{in } G_{\overline X\underline Z}. \end{align*} Since $X$ is fixed at the value $x$ throughout $Q_{x,z}$, conditioning on $X=x$ is already built into the sample space of the auxiliary law. Hence the same separation gives the conditional independence of the random coordinates $Y$ and $Z$ given $W$ under $Q_{x,z}$. Therefore, for every measurable $B$, \begin{align*} Q_{x,z}(Y\in B\mid Z=z,W=w)=Q_{x,z}(Y\in B\mid W=w). \end{align*} [/step] [step:Combine the two identifications to obtain Rule Two] Combining the two kernel identifications with the conditional independence under $Q_{x,z}$ gives \begin{align*} P_x(Y\in B\mid Z=z,W=w)=Q_{x,z}(Y\in B\mid Z=z,W=w). \end{align*} Also, \begin{align*} Q_{x,z}(Y\in B\mid Z=z,W=w)=Q_{x,z}(Y\in B\mid W=w). \end{align*} Finally, \begin{align*} Q_{x,z}(Y\in B\mid W=w)=P_{x,z}(Y\in B\mid W=w). \end{align*} Thus \begin{align*} P_{x,z}(Y\in B\mid W=w)=P_x(Y\in B\mid Z=z,W=w). \end{align*} For discrete variables, taking $B=\{y\}$ gives the point-probability form. If the conditional laws admit densities with respect to a common reference measure, equality of the conditional probability kernels gives equality of the corresponding density representatives at every value where those representatives are defined. This is the displayed identity \begin{align*} P_{x,z}(y\mid w)=P_x(y\mid z,w). \end{align*} [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Truncated Factorization Formula

Definitions & Concepts

Event

Explore Further

What brings you to Androma?

Start with a route through the knowledge graph.

Pearl's Rule Two (Theorem # 9679)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Pearl's Rule Two (Theorem # 9679)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further