Causal Inference I: Foundations introduces the core ideas behind reasoning about cause and effect from observational data, experiments, and structural assumptions. The course asks what it means to intervene on a system, how causal questions differ from purely associational ones, and when a causal effect can be learned at all. It is set in the language of probability, but its emphasis is on interpretation: moving from describing how variables co-vary to understanding how outcomes would change under hypothetical actions.
The main themes are intervention, identification, graphical structure, and assumptions. The early chapters establish causal questions, interventional probability, potential outcomes, and structural causal models as complementary frameworks. From there, directed acyclic graphs, Markov factorization, and d-separation provide a precise way to encode and test conditional independence structure. The middle chapters develop the main identification tools: adjustment via the back-door criterion, mediation and the front-door criterion, do-calculus, and the ID algorithm for cases where effects are or are not identifiable. Later chapters turn to instrumental variables and the foundations of causal discovery, showing how causal conclusions can sometimes be extracted even with hidden confounding.
The chapters are arranged to build a complete workflow for causal analysis. The course begins with conceptual language, then adds formal models, then derives identification criteria and algorithms, and finally addresses more advanced settings where standard adjustment fails. It ends with synthesis: how to assemble a causal analysis, check whether its assumptions are defensible, and audit the result for hidden gaps or nonidentifiability.
# Introduction
This opening chapter fixes the scope and language of the course. Causal inference begins from a mismatch: probability theory describes distributions of observed random variables, while scientific questions often ask what would happen under an intervention that was not observed. The course develops mathematical conditions under which those intervention and counterfactual quantities can be expressed using observable probability laws.
The main objects of the course are potential outcomes, structural causal models, directed acyclic graphs, and identification formulas. These notes assume familiarity with measure-theoretic probability, [conditional expectation](/page/Conditional%20Expectation), elementary statistical inference, linear regression, and basic graph theory. The purpose of the introduction is to separate causal questions from associational summaries and to explain why extra structure is needed before data can answer causal questions.
## Why Association Is Not Causation
A first obstacle is that conditional distributions are symmetric descriptions of a joint law, while causal statements are asymmetric claims about changing a system. If $X$ and $Y$ are random variables on a [probability space](/page/Probability%20Space) $(\Omega, \mathcal F, \mathbb P)$, the conditional law of $Y$ given $X=x$ describes units for which the event or conditioning value $X=x$ occurs. It does not, by itself, describe the law of $Y$ after forcing $X$ to equal $x$.
[example: Treatment And Recovery]
Let $Z=0$ mean mild illness and $Z=1$ mean severe illness. Suppose severe patients are more common among treated patients:
\begin{align*}
\mathbb P(Z=1\mid X=1)=0.90,\qquad \mathbb P(Z=1\mid X=0)=0.10.
\end{align*}
Within each severity stratum, treatment improves recovery:
\begin{align*}
\mathbb P(Y=1\mid X=1,Z=0)=0.90>0.80=\mathbb P(Y=1\mid X=0,Z=0).
\end{align*}
\begin{align*}
\mathbb P(Y=1\mid X=1,Z=1)=0.30>0.20=\mathbb P(Y=1\mid X=0,Z=1).
\end{align*}
Now compute the marginal recovery probabilities by partitioning on $Z$. For treated patients,
\begin{align*}
\mathbb P(Y=1\mid X=1)=\mathbb P(Y=1\mid X=1,Z=0)\mathbb P(Z=0\mid X=1)+\mathbb P(Y=1\mid X=1,Z=1)\mathbb P(Z=1\mid X=1).
\end{align*}
\begin{align*}
\mathbb P(Y=1\mid X=1)=(0.90)(0.10)+(0.30)(0.90)=0.09+0.27=0.36.
\end{align*}
For untreated patients,
\begin{align*}
\mathbb P(Y=1\mid X=0)=\mathbb P(Y=1\mid X=0,Z=0)\mathbb P(Z=0\mid X=0)+\mathbb P(Y=1\mid X=0,Z=1)\mathbb P(Z=1\mid X=0).
\end{align*}
\begin{align*}
\mathbb P(Y=1\mid X=0)=(0.80)(0.90)+(0.20)(0.10)=0.72+0.02=0.74.
\end{align*}
Thus $\mathbb P(Y=1\mid X=1)=0.36<0.74=\mathbb P(Y=1\mid X=0)$, even though treatment raises recovery probability in both severity groups. The marginal comparison is lower for treated patients because it combines treatment response with the fact that treated patients are much more likely to be severely ill.
[/example]
This example introduces the central problem of confounding: the observed treatment group and control group may differ in ways that also affect the outcome. The comparison $\mathbb P(Y=1\mid X=1)-\mathbb P(Y=1\mid X=0)$ is still a well-defined feature of the joint law of $(X,Y)$, but its interpretation is associational rather than causal. Before introducing interventions, we first name the ordinary probabilistic questions that are already determined by the observational distribution, so that the later causal notions are not mistaken for conditional-probability notation in disguise.
[definition: Associational Query]
Let $W:(\Omega,\mathcal F)\to(\mathcal W,\mathcal E_{\mathcal W})$ denote the observed variables and let $P_W=\mathbb P\circ W^{-1}$ be their observational law. An associational query is a question whose answer is a value determined by $P_W$.
[/definition]
For instance, when the relevant conditional quantities are well-defined, $\mathbb P(Y \in A \mid X=x)$ and $\mathbb E[Y \mid X=x]$ are associational queries because they are features of the observational joint law. Associational queries are legitimate probabilistic objects, and many statistical procedures estimate them accurately. The next question is how to formalise a different operation: setting a variable externally rather than restricting attention to units where that variable already took a value.
[definition: Interventional Query]
Let $\mathcal I$ be a set of interventions, let $W_i:(\Omega_i,\mathcal F_i)\to(\mathcal W,\mathcal E_{\mathcal W})$ denote the variables observed under intervention $i\in\mathcal I$, and let $P_i=\mathbb P_i\circ W_i^{-1}$ be the corresponding intervention law on $(\mathcal W,\mathcal E_{\mathcal W})$. An interventional query is a question whose answer is a value determined by one of these intervention laws.
[/definition]
For example, $\mathbb P(Y \in A \mid \operatorname{do}(X=x))$ denotes a probability under the intervention law generated by setting $X$ to $x$. The notation $\operatorname{do}(X=x)$ is not an event in the original probability space. It denotes a modified data-generating regime; this raises an even finer question, namely how to discuss the outcome a particular unit would have had under an intervention that may differ from the one observed.
[definition: Counterfactual Query]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space, let $\mathcal X$ be the intervention state space for a variable $X$, and let $(\mathcal Y,\mathcal G)$ be the outcome measurable space. A counterfactual query is a question about a [random variable](/page/Random%20Variable) $Y_x:(\Omega,\mathcal F)\to(\mathcal Y,\mathcal G)$ indexed by an intervention value $x\in\mathcal X$.
[/definition]
Counterfactual queries are finer than interventional distributional queries because they can relate several incompatible interventions for the same unit. Later chapters will make this precise using potential outcomes and structural causal models.
## The Three Languages of the Course
A causal question can fail in several different ways: the target may be poorly defined, the intervention may be ambiguous, or the assumptions needed for identification may be invisible in the notation. No single formal language handles all three tasks gracefully. The course therefore moves between potential outcomes, structural causal models, and directed acyclic graphs, using each language where it exposes the relevant obstruction most directly.
The first language is designed to define causal contrasts at the level of units. It asks what outcome the same unit would have under different treatment levels, even though only one of those outcomes can be observed in the factual data.
[definition: Potential Outcome]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $A:(\Omega,\mathcal F)\to(\mathcal A,\mathcal E_{\mathcal A})$ be a treatment variable, and let $(\mathcal Y,\mathcal E_{\mathcal Y})$ be an outcome measurable space. For $a\in\mathcal A$, the potential outcome under treatment level $a$ is a random variable $Y(a):(\Omega,\mathcal F)\to(\mathcal Y,\mathcal E_{\mathcal Y})$.
[/definition]
Potential outcomes allow causal contrasts to be stated as ordinary expectations of random variables that may not be jointly observed. This is already a conceptual shift: the target is no longer a conditional mean among treated or untreated units, but a comparison between two intervention-indexed variables defined on the same underlying population. The most common contrast asks for the average population-level change when a binary treatment is set from $0$ to $1$.
[definition: Average Treatment Effect]
For a binary treatment $A \in \{0,1\}$ and real-valued potential outcomes $Y(1)$ and $Y(0)$ with finite expectations, the average treatment effect is
\begin{align*}
\operatorname{ATE} := \mathbb E[Y(1)-Y(0)].
\end{align*}
[/definition]
The average treatment effect is a parameter of the causal model, not automatically a parameter of the observed distribution of $(A,Y)$. A probability law for observed variables says what co-occurs, but it does not by itself say what would remain fixed and what would be replaced if treatment were externally set.
The next modeling step is therefore to make interventions part of the mathematical object itself. A structural causal model does this by specifying, for each variable, a local assignment from its direct causes and background noise; an intervention can then be represented by replacing one of these assignments while leaving the others fixed.
[definition: Structural Causal Model]
A structural causal model consists of a finite directed acyclic graph $G=(V,E)$, measurable state spaces $(\mathcal X_i,\mathcal E_i)$ for endogenous variables, measurable state spaces $(\mathcal U_i,\mathcal H_i)$ for exogenous variables, a jointly distributed exogenous vector $(U_i)_{i\in V}$ with $U_i:(\Omega,\mathcal F)\to(\mathcal U_i,\mathcal H_i)$, and measurable structural maps
\begin{align*}
f_i:\prod_{j\in\operatorname{pa}(i)}\mathcal X_j\times\mathcal U_i\to\mathcal X_i, \qquad i\in V,
\end{align*}
where $\operatorname{pa}(i)=\{j\in V:(j,i)\in E\}$. The endogenous variables $(X_i)_{i\in V}$ are defined recursively in a topological ordering of $G$ by
\begin{align*}
X_i := f_i\bigl((X_j)_{j\in\operatorname{pa}(i)},U_i\bigr).
\end{align*}
[/definition]
The structural assignments specify how the system responds when an assignment is replaced by an intervention. Since large systems are difficult to read from equations alone, we need a graphical representation of which variables enter which assignments.
[definition: Directed Acyclic Graph]
A directed acyclic graph is a finite directed graph $G=(V,E)$ with no directed cycle. If $(i,j) \in E$, then $i$ is called a parent of $j$ and $j$ is called a child of $i$.
[/definition]
Graphs provide a compact language for assumptions. In later chapters, paths, colliders, and d-separation will turn graphical structure into conditional independence statements and identification criteria.
## Identification As A Mathematical Problem
The course is organised around the following question: when is a causal quantity determined by the observational distribution? This question separates definition from estimation. Identification is about whether the target parameter is a function of the observable law; estimation is about learning that function from finite samples.
[definition: Identifiability]
Let $\mathcal M$ be a class of causal models, let $\mathcal P$ be a space of observational laws over the observed variables, let $L:\mathcal M\to\mathcal P$ send a model to its observational law, and let $\Psi:\mathcal M\to\mathcal S$ be a causal parameter with values in a measurable space $\mathcal S$. The parameter $\Psi$ is identifiable from the observational law if, for all $M_1,M_2\in\mathcal M$,
\begin{align*}
L(M_1)=L(M_2) \implies \Psi(M_1)=\Psi(M_2).
\end{align*}
[/definition]
This definition captures why causal inference needs assumptions. The obstruction is not finite-sample noise but logical non-uniqueness: two causal models may agree on every observable probability while assigning different values to unobserved potential outcomes. In that situation, the observational law has no information with which to choose between the competing causal answers.
[quotetheorem:9653]
[citeproof:9653]
This theorem is the negative starting point for the course, but its conclusion is deliberately limited. It does not say that causal effects are never identifiable from observational data; it says that the observational law alone does not determine them when the model class is unrestricted. Consistency is not the obstruction in the construction, since both models satisfy $Y=Y(A)$ exactly. The obstruction is unrestricted dependence between treatment assignment and the pair of potential outcomes: the observed table records only $Y(A)$, while the average treatment effect depends on the unobserved components of $(Y(0),Y(1))$. Once additional restrictions such as randomisation, exchangeability, or graphical separation assumptions are imposed, two models with the same observed law may be forced to agree on the causal target. To obtain positive identification results, the first bridge between observed and potential outcomes is a condition saying that the factual outcome equals the potential outcome corresponding to the treatment actually received.
[definition: Consistency]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $A:(\Omega,\mathcal F)\to(\mathcal A,\mathcal E_{\mathcal A})$ be a treatment variable, let $(\mathcal Y,\mathcal E_{\mathcal Y})$ be an outcome measurable space, let $Y:(\Omega,\mathcal F)\to(\mathcal Y,\mathcal E_{\mathcal Y})$ be the observed outcome, and let $Y(a):(\Omega,\mathcal F)\to(\mathcal Y,\mathcal E_{\mathcal Y})$ be a potential outcome for each $a\in\mathcal A$. Consistency is the condition
\begin{align*}
A=a \implies Y=Y(a)
\end{align*}
for every $a \in \mathcal A$.
[/definition]
Consistency links the observed outcome to the relevant potential outcome for units that actually received treatment level $a$. It does not say how treatment was assigned, so a separate assumption is needed to compare the treated and untreated groups as if treatment assignment did not reveal potential-outcome information.
[definition: Exchangeability]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $A:(\Omega,\mathcal F)\to(\{0,1\},2^{\{0,1\}})$ be a binary treatment, and let $Y(0),Y(1):(\Omega,\mathcal F)\to(\mathcal Y,\mathcal E_{\mathcal Y})$ be potential outcomes in a measurable outcome space $(\mathcal Y,\mathcal E_{\mathcal Y})$. Exchangeability is the condition
\begin{align*}
(Y(0),Y(1)) \perp A.
\end{align*}
If $Z:(\Omega,\mathcal F)\to(\mathcal Z,\mathcal E_{\mathcal Z})$ is a covariate random variable, conditional exchangeability given $Z$ is the condition
\begin{align*}
(Y(0),Y(1)) \perp A \mid Z.
\end{align*}
[/definition]
Exchangeability says that treatment assignment carries no information about the potential outcomes, either marginally or after conditioning on measured covariates. Even with this independence, the comparison cannot use covariate strata in which a treatment level never occurs, which motivates the following support condition.
[definition: Positivity]
Let $A:(\Omega,\mathcal F)\to(\mathcal A,\mathcal E_{\mathcal A})$ be a treatment variable, let $Z:(\Omega,\mathcal F)\to(\mathcal Z,\mathcal E_{\mathcal Z})$ be covariates, let $\mu_Z$ be a probability law on $(\mathcal Z,\mathcal E_{\mathcal Z})$, and let $\mathcal A_0\subset\mathcal A$ be the treatment levels under comparison. Positivity on the target covariate support is the condition that, for every $a\in\mathcal A_0$,
\begin{align*}
\mathbb P(A=a\mid Z)>0
\end{align*}
holds $\mu_Z$-almost surely.
[/definition]
This point-mass form is the discrete-treatment version used in the first adjustment theorem. For continuous treatments, the analogous condition is stated relative to a dominating measure on $\mathcal A$: the conditional law of $A$ given $Z=z$ must assign positive density, or more generally contain the target treatment values in its conditional support, for $\mu_Z$-almost every $z$.
Positivity prevents identification formulas from asking for conditional means in strata where no units received a treatment level. In the basic adjustment formula below, the target covariate law $\mu_Z$ is the observed marginal law $\mathbb P\circ Z^{-1}$. With consistency, conditional exchangeability, and positivity in place, the following theorem can express potential-outcome means using observed conditional expectations.
[quotetheorem:9654]
[citeproof:9654]
This theorem is the prototype for the whole course, but each hypothesis carries real mathematical content. If consistency fails, the observed $Y$ among units with $A=a$ is not the same object as $Y(a)$: for example, if $A=1$ records "received surgery" but two hospitals use materially different surgical protocols, then a single potential outcome $Y(1)$ is not the outcome attached to every observed treated unit. If exchangeability fails, treated and untreated units may have different potential-outcome distributions even after conditioning on $Z$: for example, if unmeasured disease severity affects both the decision to treat and recovery, then $\mathbb E[Y\mid A=1,Z]$ combines treatment response with selection into treatment. If positivity fails, a stratum with $\mathbb P(A=a\mid Z)=0$ asks for a conditional mean that the observed law does not supply: for example, if no patients above a clinical risk threshold receive the control treatment, the observed data contain no within-stratum value of $\mathbb E[Y\mid A=0,Z]$ for that risk group. Later identification results have the same form: a causal target is transformed, under assumptions, into a functional of the observed law, and the assumptions specify exactly which substitutions are licensed.
## How the Remaining Chapters Fit Together
The remaining chapters return repeatedly to the same obstruction: the data reveal the factual outcome, while causal questions compare factual and non-factual regimes. The first part of the course develops potential outcomes and the assumptions behind adjustment. It introduces average treatment effects, conditional average treatment effects, the stable unit treatment value assumption, propensity scores, and the role of randomisation.
The second part shifts to structural causal models and directed acyclic graphs because adjustment alone does not explain which covariates should be conditioned on. Graphs connect causal inference to conditional independence in probability, to path structure in graph theory, to factorisation ideas from graphical models, and to modelling choices familiar from scientific mechanisms. They make it possible to reason about whole systems of variables, distinguish confounders from mediators and colliders, and prove identification criteria such as the back-door and front-door formulas.
The final part studies more advanced identification tools for settings where simple covariate adjustment is unavailable. Do-calculus gives a formal calculus for manipulating interventional distributions, while instrumental variables connect causal inference to moment equations, natural experiments, and regression-based estimation under hidden confounding. These topics also connect the course to optimisation and semiparametric statistics, where the identified functional must be estimated efficiently under infinite-dimensional nuisance structure.
[remark: Estimation Comes After Identification]
An estimator can be consistent for the wrong target if the causal parameter is not identified by the estimand it uses. For this reason, the course first proves identification formulas and only then discusses regression, weighting, matching, and instrumental-variable estimation as ways of estimating the identified functionals.
[/remark]
The guiding discipline is to write every causal claim as a precise mathematical object before trying to estimate it. Once the target is specified, the course asks which assumptions connect it to observable data, which formulas follow from those assumptions, and which failures of the assumptions break the conclusion.
The introduction has now fixed the basic problem: causal claims must be stated as mathematical targets before they can be estimated. The next step is to introduce the probability language that lets those targets be written precisely and compared with observable data.
# 1. Causal Questions and Interventional Probability
Building on the introduction, this chapter sets up the language in which causal inference turns scientific questions into probability statements. The prerequisites are the basic language of probability spaces, random variables, [conditional probability](/page/Conditional%20Probability), independence, expectation, and distributions; the chapter recalls the extra causal notation as it is introduced. Ordinary conditioning answers questions about association in an observed population, while intervention and counterfactual notation aim to describe what would happen under actions that may not have occurred. The main goal is to separate these three kinds of query, then isolate the mathematical assumptions under which observational quantities can identify causal contrasts.
## From Association to Intervention
A first difficulty in causal inference is that the same symbols used in probability can hide different questions. If $X$ is a treatment or exposure and $Y$ is an outcome, the conditional law of $Y$ given $X=x$ describes units whose exposure was observed to be $x$. A causal question instead asks about the outcome distribution after forcing the exposure to equal $x$, possibly changing the mechanism that generated $X$.
[definition: Associational Query]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space, let $X: \Omega \to \mathcal X$ be an exposure random variable, let $(\mathcal Y,\mathcal A_{\mathcal Y})$ be an outcome measurable space, and let $Y: \Omega \to \mathcal Y$ be a measurable outcome. Let $x \in \mathcal X$ satisfy $\mathbb P(X=x)>0$ in the discrete case. An associational query is a statement about the conditional distribution
\begin{align*}
\mathbb P(Y \in A \mid X=x), \qquad A \in \mathcal A_{\mathcal Y}.
\end{align*}
[/definition]
The associational query is estimable from the joint law of $(X,Y)$ when the relevant conditional probabilities are defined. It does not by itself say what would happen if an external regime set $X$ to $x$.
[example: Treatment Choice and Recovery]
Suppose patients have baseline severity $L \in \{\text{mild},\text{severe}\}$, and severe patients are more likely to choose treatment. Within each severity level, treatment improves recovery:
\begin{align*}
\mathbb P(Y=1 \mid X=1,L=\text{mild})=0.95>0.90=\mathbb P(Y=1 \mid X=0,L=\text{mild}).
\end{align*}
\begin{align*}
\mathbb P(Y=1 \mid X=1,L=\text{severe})=0.55>0.50=\mathbb P(Y=1 \mid X=0,L=\text{severe}).
\end{align*}
Now suppose the treated group is mostly severe, while the untreated group is mostly mild:
\begin{align*}
\mathbb P(L=\text{severe}\mid X=1)=0.90,\qquad \mathbb P(L=\text{mild}\mid X=1)=0.10.
\end{align*}
\begin{align*}
\mathbb P(L=\text{severe}\mid X=0)=0.10,\qquad \mathbb P(L=\text{mild}\mid X=0)=0.90.
\end{align*}
Because the two severity levels are exhaustive and disjoint, conditioning on $X=1$ and summing over $L$ gives
\begin{align*}
\mathbb P(Y=1\mid X=1)=\mathbb P(Y=1\mid X=1,L=\text{mild})\mathbb P(L=\text{mild}\mid X=1)+\mathbb P(Y=1\mid X=1,L=\text{severe})\mathbb P(L=\text{severe}\mid X=1).
\end{align*}
Substituting the given values,
\begin{align*}
\mathbb P(Y=1\mid X=1)=0.95\cdot 0.10+0.55\cdot 0.90.
\end{align*}
The two products are
\begin{align*}
0.95\cdot 0.10=0.095
\end{align*}
and
\begin{align*}
0.55\cdot 0.90=0.495.
\end{align*}
Therefore
\begin{align*}
\mathbb P(Y=1\mid X=1)=0.095+0.495=0.590.
\end{align*}
For the untreated group, the same decomposition gives
\begin{align*}
\mathbb P(Y=1\mid X=0)=\mathbb P(Y=1\mid X=0,L=\text{mild})\mathbb P(L=\text{mild}\mid X=0)+\mathbb P(Y=1\mid X=0,L=\text{severe})\mathbb P(L=\text{severe}\mid X=0).
\end{align*}
Substituting the given values,
\begin{align*}
\mathbb P(Y=1\mid X=0)=0.90\cdot 0.90+0.50\cdot 0.10.
\end{align*}
The two products are
\begin{align*}
0.90\cdot 0.90=0.810
\end{align*}
and
\begin{align*}
0.50\cdot 0.10=0.050.
\end{align*}
Therefore
\begin{align*}
\mathbb P(Y=1\mid X=0)=0.810+0.050=0.860.
\end{align*}
Thus
\begin{align*}
\mathbb P(Y=1\mid X=1)=0.590<0.860=\mathbb P(Y=1\mid X=0).
\end{align*}
The observed treated group has the lower recovery probability even though treatment raises recovery probability within both severity strata, so $\mathbb P(Y=1\mid X=1)$ is associational: it mixes the within-stratum treatment comparison with the fact that treated patients are much more likely to be severe.
[/example]
To express the result of an external action, causal inference introduces intervention notation. The symbol $do(X=x)$ is not an event in the original probability space; it labels a modified data-generating regime.
[definition: Interventional Query]
Let $(\mathcal Y,\mathcal A_{\mathcal Y})$ be the measurable outcome space for an outcome $Y$, and let $\mathcal X$ be the set of possible exposure values. An interventional law for $Y$ is a family of probability measures $(\mathbb P_Y^x)_{x \in \mathcal X}$ on $(\mathcal Y,\mathcal A_{\mathcal Y})$. For $x \in \mathcal X$ and $A \in \mathcal A_{\mathcal Y}$, the interventional query
\begin{align*}
\mathbb P(Y \in A \mid do(X=x)) := \mathbb P_Y^x(A)
\end{align*}
asks for the probability of $A$ under the regime in which the mechanism assigning $X$ is replaced by the fixed assignment $X=x$.
[/definition]
This definition is semantic rather than purely measure-theoretic: the interventional law must be supplied by a causal model, an experiment, or assumptions linking it to the observational law. The point of the notation is to keep the action $do(X=x)$ distinct from the observed event $X=x$.
[remark: Conditioning Is Not Acting]
The expression $\mathbb P(Y \in A \mid X=x)$ restricts attention to units whose exposure equals $x$ under the observed assignment mechanism. The expression $\mathbb P(Y \in A \mid do(X=x))$ changes that mechanism. Confounding is the failure of these two operations to agree for the causal question under study.
[/remark]
To compare the same unit under different intervention regimes, we need notation that records the outcome each fixed exposure would produce. Potential outcomes provide that unit-level representation.
[definition: Potential Outcome]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $(\mathcal Y,\mathcal A_{\mathcal Y})$ be an outcome measurable space, and let $\mathcal X$ be the exposure space. For each exposure value $x \in \mathcal X$, the potential outcome is a measurable random variable
\begin{align*}
Y_x : \Omega \to \mathcal Y
\end{align*}
representing the outcome that would be observed for the same unit under the intervention setting $X=x$.
[/definition]
The distribution of $Y_x$ is another notation for the interventional outcome law when interventions are well defined. In this language, $\mathbb P(Y \in A \mid do(X=x))$ is written as $\mathbb P(Y_x \in A)$.
[example: Two Treatment Arms]
Let $X \in \{0,1\}$ indicate treatment, and let $Y_1,Y_0 \in \{0,1\}$ be the potential recovery indicators under treatment and control. For patient $a$, suppose treatment would lead to recovery and control would not:
\begin{align*}
(Y_1(a),Y_0(a))=(1,0).
\end{align*}
For patient $b$, suppose treatment would not lead to recovery and control would:
\begin{align*}
(Y_1(b),Y_0(b))=(0,1).
\end{align*}
Now assign patient $a$ to treatment and patient $b$ to control, so
\begin{align*}
X(a)=1.
\end{align*}
\begin{align*}
X(b)=0.
\end{align*}
The observed outcome is the potential outcome indexed by the treatment actually received. For patient $a$, substituting $X(a)=1$ gives
\begin{align*}
Y(a)=Y_{X(a)}(a)=Y_1(a).
\end{align*}
Since $Y_1(a)=1$, this becomes
\begin{align*}
Y(a)=1.
\end{align*}
For patient $b$, substituting $X(b)=0$ gives
\begin{align*}
Y(b)=Y_{X(b)}(b)=Y_0(b).
\end{align*}
Since $Y_0(b)=1$, this becomes
\begin{align*}
Y(b)=1.
\end{align*}
Thus the observed data are
\begin{align*}
(X(a),Y(a))=(1,1)
\end{align*}
and
\begin{align*}
(X(b),Y(b))=(0,1).
\end{align*}
The unit-level causal effect for patient $a$ is the treated potential outcome minus the control potential outcome:
\begin{align*}
Y_1(a)-Y_0(a)=1-0=1.
\end{align*}
The unit-level causal effect for patient $b$ is
\begin{align*}
Y_1(b)-Y_0(b)=0-1=-1.
\end{align*}
So both observed outcomes equal $1$, but the missing potential outcomes show opposite individual effects: treatment helps patient $a$ and harms patient $b$.
[/example]
## Causal Contrasts and Effect Scales
Once causal laws have been separated from associational laws, the next question is how to summarise a causal difference. For binary exposures, most introductory causal analyses compare the two interventional distributions $Y_1$ and $Y_0$. Different effect scales answer different scientific questions and behave differently under aggregation.
[definition: Average Causal Effect]
Let $X \in \{0,1\}$ be a binary exposure and let $Y_1,Y_0$ be real-valued potential outcomes with $\mathbb E[|Y_1|]+\mathbb E[|Y_0|]<\infty$. The average causal effect is
\begin{align*}
\operatorname{ACE} := \mathbb E[Y_1]-\mathbb E[Y_0].
\end{align*}
[/definition]
The average causal effect is measured on the additive scale of the outcome. When the outcome is binary, the same additive idea becomes a difference between two event probabilities, which motivates naming the risk-scale version separately.
[definition: Risk Difference]
Let $Y_1,Y_0 \in \{0,1\}$. The causal risk difference is
\begin{align*}
\operatorname{RD} := \mathbb P(Y_1=1)-\mathbb P(Y_0=1).
\end{align*}
[/definition]
Risk differences are collapsible in many settings, which makes them convenient for population-level summaries. Some scientific questions ask instead for proportional change relative to baseline risk, so the next effect scale compares risks by division rather than subtraction.
[definition: Risk Ratio]
Let $Y_1,Y_0 \in \{0,1\}$ and assume $\mathbb P(Y_0=1)>0$. The causal risk ratio is
\begin{align*}
\operatorname{RR} := \frac{\mathbb P(Y_1=1)}{\mathbb P(Y_0=1)}.
\end{align*}
[/definition]
The risk ratio measures proportional change in risk. Epidemiology and logistic modelling often work with odds rather than risks; this motivates the following odds-based definition.
[definition: Odds Ratio]
Let $Y_1,Y_0 \in \{0,1\}$ and assume $0<\mathbb P(Y_x=1)<1$ for $x \in \{0,1\}$. The causal odds ratio is
\begin{align*}
\operatorname{OR} := \frac{\mathbb P(Y_1=1)/(1-\mathbb P(Y_1=1))}{\mathbb P(Y_0=1)/(1-\mathbb P(Y_0=1))}.
\end{align*}
[/definition]
Odds ratios are common because they arise naturally in logistic models and retrospective sampling designs. They require careful interpretation: equality of odds ratios across strata is not the same as equality of risk differences.
[example: Same Data, Different Scales]
Suppose an intervention changes the event probability from $0.10$ under control to $0.20$ under treatment. The risk difference subtracts the control risk from the treated risk, so
\begin{align*}
\operatorname{RD}=0.20-0.10=0.10.
\end{align*}
The risk ratio divides the treated risk by the control risk, and the control risk is nonzero, so
\begin{align*}
\operatorname{RR}=\frac{0.20}{0.10}=2.
\end{align*}
For the odds ratio, first compute the treated odds:
\begin{align*}
\frac{0.20}{1-0.20}=\frac{0.20}{0.80}.
\end{align*}
Since $0.20/0.80=20/80=1/4$, this is
\begin{align*}
\frac{0.20}{0.80}=0.25.
\end{align*}
The control odds are
\begin{align*}
\frac{0.10}{1-0.10}=\frac{0.10}{0.90}.
\end{align*}
Since $0.10/0.90=10/90=1/9$, this is
\begin{align*}
\frac{0.10}{0.90}=\frac{1}{9}.
\end{align*}
Therefore the odds ratio is
\begin{align*}
\operatorname{OR}=\frac{0.25}{1/9}.
\end{align*}
Dividing by $1/9$ is multiplying by $9$, and $0.25=1/4$, so
\begin{align*}
\frac{0.25}{1/9}=0.25\cdot 9=\frac{1}{4}\cdot 9=\frac{9}{4}=2.25.
\end{align*}
Now consider another population where the event probability changes from $0.40$ under control to $0.50$ under treatment. The risk difference is again
\begin{align*}
\operatorname{RD}=0.50-0.40=0.10.
\end{align*}
The risk ratio is
\begin{align*}
\operatorname{RR}=\frac{0.50}{0.40}.
\end{align*}
Since $0.50/0.40=50/40=5/4$, this gives
\begin{align*}
\operatorname{RR}=\frac{5}{4}=1.25.
\end{align*}
The treated odds are
\begin{align*}
\frac{0.50}{1-0.50}=\frac{0.50}{0.50}=1.
\end{align*}
The control odds are
\begin{align*}
\frac{0.40}{1-0.40}=\frac{0.40}{0.60}.
\end{align*}
Since $0.40/0.60=40/60=2/3$, this is
\begin{align*}
\frac{0.40}{0.60}=\frac{2}{3}.
\end{align*}
Thus the odds ratio is
\begin{align*}
\operatorname{OR}=\frac{1}{2/3}.
\end{align*}
Dividing by $2/3$ is multiplying by $3/2$, so
\begin{align*}
\frac{1}{2/3}=1\cdot \frac{3}{2}=\frac{3}{2}=1.5.
\end{align*}
The two populations have the same risk difference, $0.10$, but their risk ratios are $2$ and $1.25$, and their odds ratios are $2.25$ and $1.5$. The numerical comparison therefore depends on the effect scale, even when the absolute change in risk is the same.
[/example]
This example compares common scalar summaries, but interventions can change more than a single number. To cover means, risks, quantiles, and whole distributions in one language, this motivates the following general definition.
[definition: Causal Contrast]
Let $\mathcal X$ be the exposure space, let $(\mathcal Y,\mathcal A_{\mathcal Y})$ be an outcome measurable space, and let $\mathcal P(\mathcal Y)$ denote the set of probability measures on it. For $x,x' \in \mathcal X$, a causal contrast is a map
\begin{align*}
C: \mathcal P(\mathcal Y) \times \mathcal P(\mathcal Y) \to \mathcal S
\end{align*}
from a pair of interventional outcome laws to a summary space $\mathcal S$, evaluated at $(\mathcal L(Y_x),\mathcal L(Y_{x'}))$.
[/definition]
The choice of contrast determines what must be identified from data. Estimating a mean effect needs enough information to recover $\mathbb E[Y_x]$, while distributional contrasts require more of the interventional law.
## Consistency and Observed Outcomes
The next problem is how observed data relate to potential outcomes. A dataset contains $(X,Y)$, not both $Y_0$ and $Y_1$ for each unit. Consistency is the assumption that the observed outcome equals the potential outcome under the exposure actually received.
[definition: Consistency]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $X:\Omega \to \mathcal X$ be the observed exposure, let $(\mathcal Y,\mathcal A_{\mathcal Y})$ be the outcome measurable space, let $(Y_x)_{x \in \mathcal X}$ be potential outcomes with $Y_x:\Omega \to \mathcal Y$, and let $Y:\Omega \to \mathcal Y$ be the observed outcome. Consistency holds if
\begin{align*}
Y = Y_X.
\end{align*}
[/definition]
For binary exposure this means $Y=Y_1$ among treated units and $Y=Y_0$ among untreated units. The immediate use of the assumption is to replace observed outcomes by the relevant potential outcomes inside realised exposure strata. Without consistency this replacement can fail even in a randomized experiment: if $X=1$ is recorded as receiving a treatment but some recorded treated units do not actually receive the protocol, then $Y$ need not equal $Y_1$ on the event $\{X=1\}$. The following lemma isolates exactly the substitution that consistency permits.
[quotetheorem:9655]
[citeproof:9655]
The lemma does not identify $\mathbb P(Y_x \in A)$ by itself, because it only links observed and potential outcomes within the stratum $X=x$. It also shows why consistency is indispensable: if $X=1$ is recorded as assignment to a treatment protocol but some assigned patients do not receive the treatment, then the observed $Y$ among recorded treated units need not be $Y_1$, so the substitution in the proof breaks down. Even when consistency holds, treated units may be systematically sicker than untreated units, so the law of $Y_1$ among treated units can differ from the law of $Y_1$ in the full population. Thus the lemma says nothing about how treatment was assigned, whether all treatment levels occur, or whether an observed exposure group is representative. The next assumptions address exactly those missing pieces: exchangeability controls selection into treatment, and positivity ensures that the relevant exposure strata exist in the data.
[example: Randomized Trial Consistency]
In a two-arm randomized trial, let $X=1$ denote receipt of the active treatment protocol and let $X=0$ denote receipt of the control protocol. If these two protocols are well specified and treatment receipt is measured without error, consistency asserts the pointwise identity
\begin{align*}
Y=Y_X.
\end{align*}
This means that for each participant, the observed outcome is obtained by evaluating the potential-outcome family at that participant's realized treatment value.
For a treated participant, $X=1$. Substituting this value into the index of $Y_X$ gives
\begin{align*}
Y_X=Y_1.
\end{align*}
Combining this with consistency yields
\begin{align*}
Y=Y_X=Y_1.
\end{align*}
Thus the observed outcome for a treated participant is the treated potential outcome.
For a control participant, $X=0$. Substituting this value into the index of $Y_X$ gives
\begin{align*}
Y_X=Y_0.
\end{align*}
Combining this with consistency yields
\begin{align*}
Y=Y_X=Y_0.
\end{align*}
Thus the observed outcome for a control participant is the control potential outcome. Consistency identifies which potential outcome is visible in the observed data for each participant; randomization is the separate design condition that describes how the value of $X$ was assigned.
[/example]
## Exchangeability and Identification
Even after consistency, causal inference still faces a selection problem. The treated units reveal $Y_1$ under consistency, but they may not have the same distribution of $Y_1$ as the whole target population. Exchangeability is the condition that treatment assignment carries no information about the relevant potential outcomes.
[definition: Exchangeability]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $X:\Omega \to \{0,1\}$ be a binary exposure, let $(\mathcal Y,\mathcal A_{\mathcal Y})$ be the outcome measurable space, and let $Y_1,Y_0:\Omega \to \mathcal Y$ be potential outcomes. Marginal exchangeability holds if
\begin{align*}
(Y_1,Y_0) \perp X.
\end{align*}
For a covariate random vector $L:\Omega \to \mathcal L$, conditional exchangeability holds if
\begin{align*}
(Y_1,Y_0) \perp X \mid L.
\end{align*}
[/definition]
Marginal exchangeability is delivered by ideal randomized treatment assignment. Its absence can be seen in a two-type population: suppose severe patients have $(Y_1,Y_0)=(0,0)$ and mild patients have $(Y_1,Y_0)=(1,1)$, and suppose severe patients are more likely to receive treatment. Then treatment status predicts the potential outcomes even though treatment has no effect for either type, so $\mathbb P(Y=1 \mid X=1)$ can be lower than $\mathbb P(Y=1 \mid X=0)$ for purely selective reasons. The definition therefore gives a testable-looking independence statement in potential-outcome language, but the next step is to convert that statement into an identified probability formula involving only observed quantities. The following theorem performs that conversion: exchangeability removes selection bias across treatment arms, consistency says which potential outcome appears inside each arm, and positivity of the arm probability makes the observed conditional law meaningful.
[quotetheorem:9656]
[citeproof:9656]
This formula is the mathematical basis for estimating treatment effects in a randomized trial by comparing sample means across arms. Each hypothesis has a distinct role: without consistency, the observed outcome in arm $x$ may not be $Y_x$; without exchangeability, the conditional law of $Y_x$ among units with $X=x$ may not equal its marginal law; without $\mathbb P(X=x)>0$, the relevant arm is absent. The formula identifies the marginal interventional distribution for the population represented by the probability space, but it does not by itself justify transport to a different population, adjustment for post-treatment variables, or any finite-sample estimator. Those questions require additional design or modelling arguments, which is why the conditional version below separates causal identification from statistical implementation.
[example: Treatment Effect Estimation from a Randomized Trial]
In a two-arm randomized trial, suppose $\mathbb P(X=1)=p$ with $0<p<1$. Then
\begin{align*}
\mathbb P(X=0)=1-\mathbb P(X=1)=1-p>0.
\end{align*}
Thus both trial arms have positive probability. Suppose also that randomization gives
\begin{align*}
X \perp (Y_1,Y_0).
\end{align*}
Since $Y_1$ and $Y_0$ are components of $(Y_1,Y_0)$, this implies that each potential outcome is independent of the assigned arm in the sense required by the *[Exchangeability Identification Formula](/theorems/9656)*.
Consistency gives the pointwise identity
\begin{align*}
Y=Y_X.
\end{align*}
On the event $\{X=1\}$, substituting $X=1$ into $Y_X$ gives
\begin{align*}
Y_X=Y_1.
\end{align*}
Therefore, on $\{X=1\}$,
\begin{align*}
Y=Y_X=Y_1.
\end{align*}
On the event $\{X=0\}$, substituting $X=0$ into $Y_X$ gives
\begin{align*}
Y_X=Y_0.
\end{align*}
Therefore, on $\{X=0\}$,
\begin{align*}
Y=Y_X=Y_0.
\end{align*}
Apply the *Exchangeability Identification Formula* with $x=1$. The positivity condition is $\mathbb P(X=1)>0$, which holds because $\mathbb P(X=1)=p$ and $p>0$. Hence
\begin{align*}
\mathbb E[Y_1]=\mathbb E[Y\mid X=1].
\end{align*}
Apply the same formula with $x=0$. The positivity condition is $\mathbb P(X=0)>0$, which holds because $\mathbb P(X=0)=1-p$ and $p<1$. Hence
\begin{align*}
\mathbb E[Y_0]=\mathbb E[Y\mid X=0].
\end{align*}
By the definition of the average causal effect,
\begin{align*}
\operatorname{ACE}=\mathbb E[Y_1]-\mathbb E[Y_0].
\end{align*}
Substituting the identified expression for $\mathbb E[Y_1]$ gives
\begin{align*}
\operatorname{ACE}=\mathbb E[Y\mid X=1]-\mathbb E[Y_0].
\end{align*}
Substituting the identified expression for $\mathbb E[Y_0]$ gives
\begin{align*}
\operatorname{ACE}=\mathbb E[Y\mid X=1]-\mathbb E[Y\mid X=0].
\end{align*}
Thus the population causal contrast equals the difference between the observed treated-arm and control-arm conditional means. In a finite randomized trial, the corresponding difference in sample group means estimates these two ordinary conditional means, so the remaining error is sampling error rather than causal bias from treatment assignment.
[/example]
In observational studies, exchangeability usually requires adjustment for covariates. A simple failure example is age confounding: let older patients be more likely to receive treatment and also less likely to recover under either treatment level. Marginal exchangeability fails because $X$ predicts $(Y_1,Y_0)$, but conditional exchangeability may hold after conditioning on age $L$ if, within each age level, treatment assignment is otherwise unrelated to the potential outcomes. The corresponding formula averages stratum-specific observed outcome laws over the covariate distribution in the target population.
[quotetheorem:9657]
[citeproof:9657]
The formula explains the role of regression adjustment, standardization, inverse-probability weighting, and matching. The hypotheses are again doing separate work: conditional exchangeability removes within-stratum selection, consistency connects observed and potential outcomes inside treatment strata, and positivity prevents the formula from asking for an unobserved conditional law. A concrete positivity failure occurs if patients above age $90$ never receive surgery: the term $\mathbb P(Y \in A \mid X=1,L=\ell)$ for those ages is not learned from observed surgical outcomes. A concrete exchangeability failure remains if, within the same recorded age and baseline-health stratum, unmeasured frailty still affects both treatment choice and recovery. A concrete consistency failure occurs if $X=1$ records assignment to surgery but some assigned patients receive only medication, so their observed outcome is not the potential outcome under the surgery intervention. The theorem does not say that every covariate should be adjusted for, since conditioning on variables affected by treatment or on colliders can create bias rather than remove it. It also does not choose an estimator; it identifies the population functional that several statistical procedures can target.
## Positivity and the Limits of Adjustment
The final identification condition in this chapter addresses a support problem. Even if exchangeability would hold after conditioning on $L$, the data cannot reveal what happens under treatment level $x$ inside covariate strata where that treatment never occurs. For example, if a medical policy never gives a risky surgery to patients above age $90$, then observational data contain no surgical outcomes for that age stratum. A model can still extrapolate, but extrapolation is an additional assumption rather than identification from the observed law.
[definition: Positivity]
Let $X$ take values in $\mathcal X$, let $L$ have marginal law $\mathbb P_L$, and let $S \subseteq \operatorname{supp}(\mathbb P_L)$ be the target covariate support. Positivity for treatment value $x \in \mathcal X$ on $S$ holds if
\begin{align*}
\mathbb P(X=x \mid L)>0
\end{align*}
for $\mathbb P_L$-a.e. covariate value in $S$.
[/definition]
Positivity is not a statement about confounding; it is a statement about overlap. Its role in identification is negative as well as positive, and this motivates the following theorem: when a treatment value has zero probability in a target stratum, the observed law has no information about that stratum-specific intervention.
[quotetheorem:9658]
[citeproof:9658]
This result is why causal analyses diagnose overlap before relying on adjusted estimates. The assumption cannot be replaced by exchangeability: even if treatment assignment would be as good as random within every stratum where both treatments occur, a zero-probability treatment arm leaves the missing stratum-specific interventional law unrestricted. Nor can structural nonpositivity be repaired by larger sample size, because no amount of sampling creates observations from an impossible treatment-covariate combination. The available remedies are to change the target population, redefine the intervention, or state modelling assumptions that extrapolate beyond the observed support.
[example: Smoking and Lung Cancer]
Let $X=1$ indicate smoking, $X=0$ indicate not smoking, let $Y=1$ indicate lung cancer, and let $L$ collect age, occupational exposure, and baseline health variables. The observed smoker-versus-non-smoker association compares
\begin{align*}
\mathbb P(Y=1\mid X=1)
\end{align*}
with
\begin{align*}
\mathbb P(Y=1\mid X=0).
\end{align*}
These condition on the exposure actually observed, so they compare the lung-cancer rate among people who smoke with the lung-cancer rate among people who do not smoke. If the smoker group and non-smoker group have different distributions of $L$, then this comparison mixes smoking status with age, occupational exposure, and baseline health.
The causal comparison instead uses the potential outcomes $Y_1$ and $Y_0$. The risk under the intervention that sets everyone to smoking is
\begin{align*}
\mathbb P(Y_1=1).
\end{align*}
The risk under the intervention that sets everyone to not smoking is
\begin{align*}
\mathbb P(Y_0=1).
\end{align*}
Adjustment for $L$ is justified by conditional exchangeability,
\begin{align*}
(Y_1,Y_0)\perp X\mid L,
\end{align*}
together with consistency and positivity. Conditional exchangeability says that, within a fixed covariate value $L=\ell$, conditioning further on whether $X=1$ or $X=0$ does not change the conditional law of $(Y_1,Y_0)$. Positivity requires that both exposure groups occur on the relevant covariate support:
\begin{align*}
\mathbb P(X=1\mid L=\ell)>0
\end{align*}
and
\begin{align*}
\mathbb P(X=0\mid L=\ell)>0.
\end{align*}
For the smoking intervention, start from the [law of total probability](/theorems/1113) over $L$:
\begin{align*}
\mathbb P(Y_1=1)=\mathbb E[\mathbb P(Y_1=1\mid L)].
\end{align*}
By conditional exchangeability, the conditional distribution of $Y_1$ given $L$ agrees with its conditional distribution given $X=1$ and $L$, so
\begin{align*}
\mathbb E[\mathbb P(Y_1=1\mid L)]=\mathbb E[\mathbb P(Y_1=1\mid X=1,L)].
\end{align*}
By consistency, on the event $\{X=1\}$ the observed outcome satisfies $Y=Y_X=Y_1$, and therefore
\begin{align*}
\mathbb P(Y_1=1\mid X=1,L)=\mathbb P(Y=1\mid X=1,L).
\end{align*}
Substituting this into the previous display gives the standardized formula
\begin{align*}
\mathbb P(Y_1=1)=\mathbb E[\mathbb P(Y=1\mid X=1,L)].
\end{align*}
The same steps identify the non-smoking intervention. First,
\begin{align*}
\mathbb P(Y_0=1)=\mathbb E[\mathbb P(Y_0=1\mid L)].
\end{align*}
By conditional exchangeability,
\begin{align*}
\mathbb E[\mathbb P(Y_0=1\mid L)]=\mathbb E[\mathbb P(Y_0=1\mid X=0,L)].
\end{align*}
By consistency, on the event $\{X=0\}$ the observed outcome satisfies $Y=Y_X=Y_0$, so
\begin{align*}
\mathbb P(Y_0=1\mid X=0,L)=\mathbb P(Y=1\mid X=0,L).
\end{align*}
Thus
\begin{align*}
\mathbb P(Y_0=1)=\mathbb E[\mathbb P(Y=1\mid X=0,L)].
\end{align*}
Equivalently, these two equalities are the binary-outcome case of the *[Conditional Exchangeability Identification Formula](/theorems/9657)*. The adjusted comparison is therefore not just a numerical correction: it is the claim that the measured covariates $L$ remove the dependence between smoking status and potential lung-cancer outcomes, while positivity ensures that the needed smoker and non-smoker conditional risks are observed across the target covariate support.
[/example]
## Simpson Reversals and the Meaning of Confounding
A central warning in causal inference is that marginal associations can reverse after stratifying by a covariate. Simpson's paradox is not a paradox of probability; it is a warning that the estimand must be specified before data are aggregated.
[definition: Confounding]
Let $X: \Omega \to \mathcal X$ be an exposure, let $Y: \Omega \to \mathcal Y$ be an observed outcome, and let $(Y_x)_{x \in \mathcal X}$ be potential outcomes with $Y_x: \Omega \to \mathcal Y$. For a target class of measurable outcome events $A \subseteq \mathcal Y$ or a target causal contrast $C$, confounding is present when the associational laws $\mathcal L(Y \mid X=x)$ fail to equal the corresponding interventional laws $\mathcal L(Y_x)$ in the way required for that event class or contrast, because treatment assignment is statistically related to the potential outcomes.
[/definition]
This definition ties confounding to a failure of exchangeability. A variable is not a confounder merely because it predicts the outcome; it matters when it also helps explain treatment assignment in a way that distorts the causal comparison.
[example: Simpson's Paradox]
Suppose severity $L$ has two levels, $\text{mild}$ and $\text{severe}$, and let $Y=1$ mean recovery. Within each severity stratum, treatment has the higher recovery probability:
\begin{align*}
\mathbb P(Y=1\mid X=1,L=\text{mild})=0.95>0.90=\mathbb P(Y=1\mid X=0,L=\text{mild}).
\end{align*}
\begin{align*}
\mathbb P(Y=1\mid X=1,L=\text{severe})=0.55>0.50=\mathbb P(Y=1\mid X=0,L=\text{severe}).
\end{align*}
Now suppose treatment is used mostly in severe cases, while non-treatment is used mostly in mild cases:
\begin{align*}
\mathbb P(L=\text{severe}\mid X=1)=0.90,\qquad \mathbb P(L=\text{mild}\mid X=1)=0.10.
\end{align*}
\begin{align*}
\mathbb P(L=\text{severe}\mid X=0)=0.10,\qquad \mathbb P(L=\text{mild}\mid X=0)=0.90.
\end{align*}
Because the two severity strata are exhaustive and disjoint, conditioning on $X=1$ and summing over $L$ gives
\begin{align*}
\mathbb P(Y=1\mid X=1)=\mathbb P(Y=1\mid X=1,L=\text{mild})\mathbb P(L=\text{mild}\mid X=1)+\mathbb P(Y=1\mid X=1,L=\text{severe})\mathbb P(L=\text{severe}\mid X=1).
\end{align*}
Substituting the given conditional probabilities gives
\begin{align*}
\mathbb P(Y=1\mid X=1)=0.95\cdot 0.10+0.55\cdot 0.90.
\end{align*}
The mild-stratum contribution is
\begin{align*}
0.95\cdot 0.10=0.095.
\end{align*}
The severe-stratum contribution is
\begin{align*}
0.55\cdot 0.90=0.495.
\end{align*}
Therefore the marginal treated recovery probability is
\begin{align*}
\mathbb P(Y=1\mid X=1)=0.095+0.495=0.590.
\end{align*}
For the untreated group, the same two-stratum decomposition gives
\begin{align*}
\mathbb P(Y=1\mid X=0)=\mathbb P(Y=1\mid X=0,L=\text{mild})\mathbb P(L=\text{mild}\mid X=0)+\mathbb P(Y=1\mid X=0,L=\text{severe})\mathbb P(L=\text{severe}\mid X=0).
\end{align*}
Substituting the given conditional probabilities gives
\begin{align*}
\mathbb P(Y=1\mid X=0)=0.90\cdot 0.90+0.50\cdot 0.10.
\end{align*}
The mild-stratum contribution is
\begin{align*}
0.90\cdot 0.90=0.810.
\end{align*}
The severe-stratum contribution is
\begin{align*}
0.50\cdot 0.10=0.050.
\end{align*}
Therefore the marginal untreated recovery probability is
\begin{align*}
\mathbb P(Y=1\mid X=0)=0.810+0.050=0.860.
\end{align*}
Combining the two marginal probabilities,
\begin{align*}
\mathbb P(Y=1\mid X=1)=0.590<0.860=\mathbb P(Y=1\mid X=0).
\end{align*}
The marginal association reverses the within-stratum comparisons because the treated group contains many more severe cases, and severe cases have lower recovery probabilities under both treatment levels.
[/example]
The causal lesson is not that stratified analyses are always correct. The right adjustment set must be justified by the causal structure: conditioning on common causes can remove confounding, while conditioning on variables affected by treatment or on colliders can introduce bias.
[remark: Identification Before Estimation]
Identification asks whether a causal quantity is a functional of the observed probability law. Estimation asks how accurately that functional can be learned from finite data. A precise causal analysis states the estimand, the identification assumptions, and then the statistical estimator.
[/remark]
This chapter leaves us with three layers of notation: observed laws such as $\mathbb P(Y \in A \mid X=x)$, interventional laws such as $\mathbb P(Y \in A \mid do(X=x))$, and potential outcomes such as $Y_x$. The rest of the course develops richer models and graphical criteria for proving when these layers coincide in the ways required for causal conclusions.
Chapter 1 turns informal causal questions into the formal objects of interventional probability and potential outcomes. With that notation in place, the course can move from defining the estimands to building a model in which those counterfactual quantities are generated and related to observed variables.
# 2. Potential Outcomes and the Rubin Framework
This chapter makes the counterfactual language from Chapter 1 concrete by assigning each unit a pair of potential outcomes. The central move is to separate the causal object, which compares outcomes under different treatments for the same unit, from the observed data, which reveal only the outcome under the treatment actually received. The prerequisites are conditional expectation, conditional independence, basic random variables, and the distinction between a statistical estimand and an estimator. We then ask what assumptions let population-level causal contrasts be recovered from observational distributions, and why propensity scores can reduce a high-dimensional adjustment problem to a one-dimensional one.
## Unit-Level Potential Outcomes and Average Effects
The first problem is that a causal effect compares two incompatible states of the same unit. A patient either receives the treatment or does not; a school either adopts a policy or does not. The Rubin framework keeps both counterfactual outcomes in the mathematical object, even though only one is observed.
[definition: Binary Potential Outcomes]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space representing a population of units. In this chapter the treatment indicator is denoted by $T$, playing the same role as the binary exposure $X$ in Chapter 1. Let $T: \Omega \to \{0,1\}$ be a treatment indicator and let $Y(1),Y(0):\Omega \to \mathbb R$ be real-valued random variables. The random variable $Y(1)$ is the outcome under treatment, and $Y(0)$ is the outcome under control.
[/definition]
The pair $(Y(1),Y(0))$ is part of the causal model, not part of the raw observed dataset. To turn this pair into a causal contrast, we need a unit-level quantity that subtracts the untreated outcome from the treated outcome for that same unit.
[definition: Individual Treatment Effect]
For binary potential outcomes $Y(1),Y(0):\Omega\to\mathbb R$, the individual treatment effect is the random variable $\tau:\Omega\to\mathbb R$ defined by
\begin{align*}
\tau(u) := Y(1)(u)-Y(0)(u)
\end{align*}
for $u \in \Omega$.
[/definition]
Individual effects are conceptually primary, but the fundamental problem of causal inference is that $\tau(u)$ is not observed for any unit unless both potential outcomes can somehow be measured on the same unit. This pushes the course toward population summaries, which are estimands rather than unit-level observables.
[definition: Average Treatment Effect]
Assume $Y(1),Y(0) \in L^1(\Omega,\mathcal F,\mathbb P)$. The average treatment effect is
\begin{align*}
\operatorname{ATE} := \mathbb E[Y(1)-Y(0)].
\end{align*}
[/definition]
The ATE averages over the whole target population, including units who in the realised data may or may not receive treatment. Some studies instead ask for the effect among those who were actually treated, because that is the group for which policy reversal, medical continuation, or compensation is most relevant.
[definition: Average Treatment Effect on the Treated]
Assume $Y(1),Y(0) \in L^1(\Omega,\mathcal F,\mathbb P)$ and $\mathbb P(T=1)>0$. The average treatment effect on the treated is
\begin{align*}
\operatorname{ATT} := \mathbb E[Y(1)-Y(0) \mid T=1].
\end{align*}
[/definition]
The ATT differs from the ATE when treatment selection is related to treatment effect heterogeneity. This motivates a conditional estimand: before averaging over the whole population or over the treated subpopulation, we record how the causal contrast varies with pre-treatment covariates.
[definition: Conditional Average Treatment Effect]
Let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a covariate random variable. Assume $Y(1),Y(0) \in L^1(\Omega,\mathcal F,\mathbb P)$. A conditional average treatment effect is a real-valued random variable
\begin{align*}
\operatorname{CATE}_X:\Omega\to\mathbb R
\end{align*}
for which $\operatorname{CATE}_X=\mathbb E[Y(1)-Y(0) \mid X]$ a.s. and there exists a measurable map $g:(E,\mathcal E)\to(\mathbb R,\mathcal B(\mathbb R))$ such that $\operatorname{CATE}_X=g(X)$ a.s.
[/definition]
The CATE records systematic effect heterogeneity across covariate strata. It also supplies a route back to the ATE by iterated expectation, because $\mathbb E[\operatorname{CATE}_X] = \operatorname{ATE}$ whenever the integrability assumptions above hold.
[example: Medical Treatment with Effect Heterogeneity]
Consider a population of patients where $T=1$ denotes receiving a new blood-pressure medication, $Y$ is systolic blood pressure after one month, and $X\in\{\text{mild},\text{severe}\}$ records baseline severity. Let $\tau=Y(1)-Y(0)$, so negative values of $\tau$ mean the medication lowers blood pressure. Suppose
\begin{align*}
c_{\mathrm{sev}}:=\mathbb E[\tau\mid X=\text{severe}] < c_{\mathrm{mild}}:=\mathbb E[\tau\mid X=\text{mild}] < 0.
\end{align*}
Then the severe-patient CATE is more negative than the mild-patient CATE, so the treatment effect is heterogeneous across baseline severity.
If $p=\mathbb P(X=\text{severe})$, then averaging over the two severity strata gives
\begin{align*}
\operatorname{ATE}=\mathbb E[\tau]=p\,c_{\mathrm{sev}}+(1-p)c_{\mathrm{mild}}.
\end{align*}
This number lies between $c_{\mathrm{sev}}$ and $c_{\mathrm{mild}}$, so it records a population average rather than either stratum-specific effect. If severe patients are preferentially treated, write $q=\mathbb P(X=\text{severe}\mid T=1)$ with $q>p$, and assume the mean effect within each severity stratum is the same among treated patients as in the full stratum. Then
\begin{align*}
\operatorname{ATT}=q\,c_{\mathrm{sev}}+(1-q)c_{\mathrm{mild}}.
\end{align*}
Since $c_{\mathrm{mild}}-c_{\mathrm{sev}}>0$, the distance from the severe-patient effect is
\begin{align*}
\operatorname{ATE}-c_{\mathrm{sev}}=(1-p)(c_{\mathrm{mild}}-c_{\mathrm{sev}}).
\end{align*}
For the treated population,
\begin{align*}
\operatorname{ATT}-c_{\mathrm{sev}}=(1-q)(c_{\mathrm{mild}}-c_{\mathrm{sev}}).
\end{align*}
Because $q>p$, we have $1-q<1-p$, so the ATT is closer to the severe-patient effect than the ATE is. Thus the CATE displays the heterogeneity, while the ATE and ATT average it using different population weights.
[/example]
This example also shows why causal notation must distinguish observed outcomes from potential outcomes. To connect the estimands above to data, we need a consistency equation specifying which potential outcome becomes the observed outcome after treatment assignment.
[definition: Observed Outcome Under Consistency]
For binary treatment $T:\Omega\to\{0,1\}$ and potential outcomes $Y(1),Y(0):\Omega\to\mathbb R$, the observed outcome is the random variable $Y:\Omega\to\mathbb R$ defined by
\begin{align*}
Y := TY(1)+(1-T)Y(0).
\end{align*}
[/definition]
This formula is the bridge from the counterfactual model to data. It says that treatment assignment selects which potential outcome is revealed, while the unrevealed potential outcome remains missing.
## SUTVA and the Meaning of a Treatment
The next problem is that the symbols $Y(1)$ and $Y(0)$ are only meaningful if the treatment labels specify complete regimes. If treatment status hides different doses, delivery methods, clinicians, or spillovers from other units, then the same symbol may combine several causal questions.
[definition: Stable Unit Treatment Value Assumption]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space representing a population of units, let $T:(\Omega,\mathcal F)\to(\{0,1\},2^{\{0,1\}})$ be a binary treatment random variable, and let $Y(1),Y(0):(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ be real-valued potential outcome random variables. The stable unit treatment value assumption holds for this tuple when $Y(1)$ and $Y(0)$ are well-defined potential outcomes and, for each $t\in\{0,1\}$, a unit's potential outcome under treatment level $t$ depends only on that unit's assigned treatment level $t$.
[/definition]
SUTVA packages two separate requirements. The first rules out interference across units; the second requires the treatment levels themselves to be sufficiently well specified.
[definition: No Interference]
For units indexed by $i=1,\dots,n$, let $Y_i:\{0,1\}^n\to\mathbb R$ be the assignment-vector potential-outcome map for unit $i$. No interference holds when, for each $i$, there exists a map $Y_i^\ast:\{0,1\}\to\mathbb R$ such that
\begin{align*}
Y_i(t_1,\dots,t_n)=Y_i^\ast(t_i)
\end{align*}
for every assignment vector $(t_1,\dots,t_n)\in\{0,1\}^n$.
[/definition]
No interference is plausible for some laboratory experiments and implausible for settings with contagion, peer effects, network spillovers, or market equilibrium effects. Vaccination, tutoring programmes, and advertising campaigns often require potential outcomes indexed by whole assignment vectors rather than by individual treatment alone.
[example: Vaccination Spillovers]
Let $T_i=1$ mean that person $i$ receives a vaccine, and let $Y_i=1$ indicate infection during a fixed follow-up period. For three people, write the assignment vector as $(t_1,t_2,t_3)\in\{0,1\}^3$, so that $Y_i(t_1,t_2,t_3)$ is the infection outcome for person $i$ under the whole assignment vector. If person $1$ is the unit of interest, then the two assignments
\begin{align*}
(1,0,0)\quad\text{and}\quad (1,1,1)
\end{align*}
both give person $1$ the vaccine, because the first coordinate is $1$ in both vectors. They differ only in the neighbours' assignments: in $(1,0,0)$ the other two people are unvaccinated, while in $(1,1,1)$ they are vaccinated.
No interference would require a one-person potential-outcome map $Y_1^\ast:\{0,1\}\to\mathbb R$ such that
\begin{align*}
Y_1(t_1,t_2,t_3)=Y_1^\ast(t_1)
\end{align*}
for every assignment vector. Applying this requirement to $(1,0,0)$ gives
\begin{align*}
Y_1(1,0,0)=Y_1^\ast(1).
\end{align*}
Applying it to $(1,1,1)$ gives
\begin{align*}
Y_1(1,1,1)=Y_1^\ast(1).
\end{align*}
Since both right-hand sides are the same number, no interference would imply
\begin{align*}
Y_1(1,0,0)=Y_1(1,1,1).
\end{align*}
If vaccination of neighbours lowers the infection risk for person 1, these two potential outcomes can differ, so the displayed equality fails. The two-level notation $Y_1(1)$ therefore hides the neighbour-assignment information, and a study that ignores the spillover may estimate a mixture of direct and indirect effects rather than the individual direct effect.
[/example]
The vaccination example is about other units' assignments, but ambiguity can also arise within a single unit's own treatment label. This motivates defining when a treatment level is well-defined: the intervention attached to each label must be specified enough that it corresponds to a single potential outcome.
[definition: Well-Defined Treatment]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space representing a population of units. Let $(\mathcal T,\mathcal A)$ be the measurable treatment space, let $T:(\Omega,\mathcal F)\to(\mathcal T,\mathcal A)$ be a treatment random variable, and let $t\in\mathcal T$ be a treatment level. The treatment level $t$ is well-defined for the causal question under study when all interventions classified as $t$ induce the same real-valued potential outcome random variable $Y(t):(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$.
[/definition]
This definition is deliberately tied to the question under study. For a trial comparing two specific pill formulations, treatment may be well-defined; for an observational study recording only whether a patient had "usual care," the treatment category may contain versions with different causal consequences.
[remark: Versions of Treatment]
When treatment has versions, the notation can be refined by writing $Y(t,v)$, where $v$ records the version of treatment level $t$. A causal contrast based on $Y(1)-Y(0)$ is then shorthand for a more detailed intervention only after the version distribution has been specified.
[/remark]
The framework therefore treats notation as a modelling commitment. Before any identification theorem is applied, the analyst must decide whether the potential outcomes indexed by the treatment variable correspond to interventions that could be assigned and interpreted.
## Ignorability and Identification from Observed Data
The final problem in this chapter is identification: the estimand involves $Y(1)$ and $Y(0)$, while the observed distribution contains $Y$, $T$, and covariates $X$. We need assumptions that connect the missing potential outcomes to observed conditional distributions.
[definition: Ignorability]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Let $T:(\Omega,\mathcal F)\to(\{0,1\},2^{\{0,1\}})$ be a binary treatment random variable and let $Y(1),Y(0):(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ be real-valued potential outcome random variables. Ignorability holds when
\begin{align*}
(Y(1),Y(0)) \perp T.
\end{align*}
[/definition]
Ignorability is the Rubin-framework name for the exchangeability condition used in Chapter 1: treatment assignment carries no information about the pair of potential outcomes. Randomized experiments are designed to make this condition credible, but observational studies usually need covariate adjustment.
[definition: Conditional Ignorability]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Let $T:(\Omega,\mathcal F)\to(\{0,1\},2^{\{0,1\}})$ be a binary treatment random variable, let $Y(1),Y(0):(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ be real-valued potential outcome random variables, and let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a covariate random variable. Conditional ignorability holds when
\begin{align*}
(Y(1),Y(0)) \perp T \mid X.
\end{align*}
[/definition]
Conditional ignorability permits treatment assignment to depend on observed covariates, but not on the remaining potential-outcome information after conditioning on those covariates. For identification, this exchangeability condition must be paired with an overlap condition so that both treatment levels are represented in each relevant covariate stratum.
[definition: Positivity]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Let $T:(\Omega,\mathcal F)\to(\{0,1\},2^{\{0,1\}})$ be a binary treatment random variable and let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a covariate random variable. Positivity holds when
\begin{align*}
0 < \mathbb P(T=1 \mid X) < 1
\end{align*}
a.s.
[/definition]
Positivity is needed because conditional means such as $\mathbb E[Y \mid T=1,X]$ and $\mathbb E[Y \mid T=0,X]$ must both be estimable in the covariate strata that occur in the target population. This motivates naming the combined assumption used in the main identification theorem, rather than restating exchangeability and overlap each time.
[definition: Strong Ignorability]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Let $T:(\Omega,\mathcal F)\to(\{0,1\},2^{\{0,1\}})$ be a binary treatment random variable, let $Y(1),Y(0):(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ be real-valued potential outcome random variables, and let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a covariate random variable. Strong ignorability holds when conditional ignorability and positivity both hold for this tuple $(Y(1),Y(0),T,X)$.
[/definition]
Strong ignorability is the main identification assumption for the Rubin framework in this chapter. It combines exchangeability with overlap, and it is precisely the condition that lets the missing potential-outcome means be replaced by observed conditional means within covariate strata. This replacement happens in two stages: consistency identifies the observed outcome with the appropriate potential outcome inside each treatment arm, and conditional ignorability says that conditioning on $T$ adds no further potential-outcome information once $X$ is fixed. Positivity then ensures that both observed treatment arms are available in the covariate strata over which the target population is averaged. The next theorem packages these steps into the standard identification formula for the ATE.
[quotetheorem:9659]
[citeproof:9659]
The theorem explains why adjustment works when its assumptions hold: within each covariate stratum, treated and untreated units are comparable, and then the stratum-specific causal contrasts are averaged over the covariate distribution. Each assumption is doing real work. If SUTVA fails because a vaccination changes neighbours' infection risks, then $Y_i(1)$ is not a single outcome indexed only by the treatment assigned to person $i$, so the displayed formula may identify a mixture of direct and spillover effects rather than the advertised ATE. If ignorability fails because doctors prescribe using unrecorded frailty, then treated and untreated patients with the same recorded $X$ can still differ in their untreated potential outcomes, so $\mathbb E[Y\mid T=0,X]$ need not equal $\mathbb E[Y(0)\mid X]$ for treated-type units. If positivity fails because severe older patients are always treated, then the term $\mathbb E[Y\mid T=0,X]$ is not learned for that stratum. The theorem also does not identify unit-level effects, the ATT without the corresponding target-population weighting, or effects for covariate values outside the support of the observed population; it identifies the population ATE under the stated target distribution. These limitations motivate the next step: finding lower-dimensional adjustment summaries that preserve the same identifying assumptions without pretending to repair violations of them.
[example: Covariate Adjustment in a Medical Study]
Suppose $T=1$ denotes receiving a medication, $Y$ is a post-treatment health score, and $X$ records a patient's baseline age-severity stratum. Write the possible strata as a finite set $\mathcal S$, and for $x\in\mathcal S$ define
\begin{align*}
\mu_1(x):=\mathbb E[Y\mid T=1,X=x]
\end{align*}
and
\begin{align*}
\mu_0(x):=\mathbb E[Y\mid T=0,X=x].
\end{align*}
If treatment assignment uses only information contained in $X$, then conditional ignorability says $(Y(1),Y(0))\perp T\mid X$. If every relevant stratum contains both treatment arms, then positivity gives
\begin{align*}
0<\mathbb P(T=1\mid X=x)<1
\end{align*}
for each stratum $x$ with $\mathbb P(X=x)>0$.
For such a stratum, consistency gives
\begin{align*}
\mathbb E[Y\mid T=1,X=x]=\mathbb E[Y(1)\mid T=1,X=x].
\end{align*}
Conditional ignorability then removes the conditioning on $T=1$:
\begin{align*}
\mathbb E[Y(1)\mid T=1,X=x]=\mathbb E[Y(1)\mid X=x].
\end{align*}
The same two steps for controls give
\begin{align*}
\mathbb E[Y\mid T=0,X=x]=\mathbb E[Y(0)\mid X=x].
\end{align*}
Therefore the stratum-specific causal contrast is
\begin{align*}
\mathbb E[Y(1)-Y(0)\mid X=x]=\mu_1(x)-\mu_0(x).
\end{align*}
Averaging over the distribution of $X$ gives
\begin{align*}
\operatorname{ATE}=\sum_{x\in\mathcal S}\mathbb P(X=x)\bigl(\mu_1(x)-\mu_0(x)\bigr).
\end{align*}
In a sample, the corresponding plug-in estimate replaces $\mathbb P(X=x)$ by the empirical fraction of patients in stratum $x$, and replaces $\mu_1(x)$ and $\mu_0(x)$ by the treated and untreated sample means inside that stratum.
If severe older patients are always treated in some stratum $x_0$, then
\begin{align*}
\mathbb P(T=1\mid X=x_0)=1
\end{align*}
and hence
\begin{align*}
\mathbb P(T=0\mid X=x_0)=0.
\end{align*}
There are then no untreated observations from that stratum, so $\mathbb E[Y\mid T=0,X=x_0]$ is not learned from the observed sample. The adjustment formula works only where both treatment arms are represented.
[/example]
High-dimensional covariates create a practical and conceptual difficulty: exact adjustment for $X$ may require comparing units in very small strata. For instance, if $X$ records ten binary pre-treatment covariates, exact stratification already creates up to $2^{10}=1024$ cells; with a few hundred observations, many cells will contain only treated units, only control units, or no units at all. Continuous covariates make the same problem sharper, because exact matches may not exist. Propensity scores solve part of this problem by replacing covariates with a scalar score that still balances treatment assignment.
[definition: Propensity Score]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Let $T:\Omega\to\{0,1\}$ be a binary treatment random variable and let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a covariate random variable. A propensity score is a measurable map $e:(E,\mathcal E)\to([0,1],\mathcal B([0,1]))$ such that the random variable $e(X):\Omega\to[0,1]$ is a version of the conditional probability
\begin{align*}
e(X)=\mathbb P(T=1\mid X).
\end{align*}
[/definition]
The propensity score is not a causal parameter by itself. Its role is to summarize the treatment-assignment mechanism, and this motivates the more general notion of a balancing score: a covariate summary that keeps treated and untreated groups comparable with respect to the original covariates.
[definition: Balancing Score]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Let $T:\Omega\to\{0,1\}$ be a binary treatment random variable and let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a covariate random variable. A balancing score is a measurable map $b:(E,\mathcal E)\to(B,\mathcal B)$ such that the random variable $b(X):\Omega\to B$ satisfies
\begin{align*}
T \perp X \mid b(X).
\end{align*}
[/definition]
A balancing score coarsens the covariate information while preserving the part relevant to treatment assignment. The key question is whether the propensity score actually has this balancing property, and whether ignorability given the full covariate vector descends to ignorability given the score.
[quotetheorem:9660]
[citeproof:9660]
The theorem has two separate hypotheses doing different jobs. First, the score used for adjustment must be a genuine balancing score, and the displayed balancing conclusion is guaranteed for the true propensity score $e(X)=\mathbb P(T=1\mid X)$, not for an arbitrary scalar summary of covariates. If a proposed score is misspecified so that units with the same score can still have different conditional treatment probabilities once $X$ is known, then $T\perp X\mid e(X)$ fails and score strata need not make the original covariates comparable across treatment arms. Second, even a correct balancing score does not create ignorability from missing covariates. Conditional ignorability remains essential: if a patient's unmeasured frailty affects both treatment and outcome, two patients with the same treatment probability may still have different potential-outcome distributions because the relevant confounder was never in $X$. Positivity is also essential: if $e(X)=1$ for a high-risk stratum, then that score stratum contains no controls, so balancing cannot supply the missing untreated outcome distribution. The theorem is therefore a dimension-reduction result inside an already credible adjustment set, not a licence to replace design knowledge by a score model.
The next identification formula uses both parts of the theorem in exactly this order. The balancing result justifies replacing high-dimensional covariate strata by propensity-score strata, while the inherited conditional ignorability result keeps the potential-outcome comparison valid after that replacement. Together with consistency and positivity, these facts let the full-covariate adjustment formula be rewritten with $e(X)$ in place of $X$. The point is not that the score introduces a new causal estimand, but that it expresses the same ATE through lower-dimensional observed conditional means.
[quotetheorem:9661]
[citeproof:9661]
This formula is the basis for matching, stratification, inverse-probability weighting, and regression adjustment using the propensity score. Its hypotheses matter in the same way as the full-covariate adjustment theorem: without SUTVA the potential outcomes indexed by treatment level are not the target objects, without conditional ignorability the score-adjusted treated and untreated groups can still differ in unmeasured prognosis, and without positivity some score values contain only one treatment arm. The formula also identifies an average over the population distribution of $e(X)$, not the effect for a particular unit and not effects in score regions absent from the data. In applications, this distinction separates identification from estimation: after the estimand is written in terms of observed conditional distributions, the next task is to estimate or approximate those distributions by matching, subclassification, weighting, or outcome regression while checking overlap and covariate balance.
[example: Exact Matching Versus Propensity Score Stratification]
Suppose $X=(X_1,\dots,X_{20})$ records twenty pre-treatment covariates. Exact matching compares treated and untreated students only inside cells of the form $X=x$. If each $X_j$ were binary, the number of possible exact cells would be
\begin{align*}
|\{0,1\}^{20}|=2^{20}=1{,}048{,}576.
\end{align*}
For a cell $x$, exact matching can use that cell only when both counts
\begin{align*}
n_1(x):=\#\{i:T_i=1,\ X_i=x\}
\end{align*}
and
\begin{align*}
n_0(x):=\#\{i:T_i=0,\ X_i=x\}
\end{align*}
are positive. If $n_1(x)>0$ but $n_0(x)=0$, treated students in that cell have no exact untreated matches; if $n_0(x)>0$ but $n_1(x)=0$, untreated students in that cell have no exact treated matches. With many discrete levels, and especially with continuous covariates where exact equality $X_i=X_j$ is rarely observed, many units can therefore be unmatched by full-vector exact matching.
Propensity score stratification replaces the full vector by
\begin{align*}
e(X)=\mathbb P(T=1\mid X).
\end{align*}
Under strong ignorability given $X$, the *[Rosenbaum-Rubin Propensity Score Theorem](/theorems/9660)* gives
\begin{align*}
T\perp X\mid e(X)
\end{align*}
and
\begin{align*}
(Y(1),Y(0))\perp T\mid e(X).
\end{align*}
For a score value $s$ with both treatment arms represented, consistency gives
\begin{align*}
\mathbb E[Y\mid T=1,e(X)=s]=\mathbb E[Y(1)\mid T=1,e(X)=s].
\end{align*}
Conditional ignorability given $e(X)$ then gives
\begin{align*}
\mathbb E[Y(1)\mid T=1,e(X)=s]=\mathbb E[Y(1)\mid e(X)=s].
\end{align*}
Similarly,
\begin{align*}
\mathbb E[Y\mid T=0,e(X)=s]=\mathbb E[Y(0)\mid e(X)=s].
\end{align*}
Subtracting the two displayed identities yields
\begin{align*}
\mathbb E[Y\mid T=1,e(X)=s]-\mathbb E[Y\mid T=0,e(X)=s]=\mathbb E[Y(1)-Y(0)\mid e(X)=s].
\end{align*}
Thus the scalar score preserves the causal comparison needed for identification, while practical stratification by estimated or binned score values still depends on correct score modelling and on overlap within the score strata.
[/example]
The chapter's conclusion is that the Rubin framework separates three tasks. First, define the causal estimand using potential outcomes and a meaningful treatment intervention. Second, state assumptions such as SUTVA, ignorability, and positivity that connect the estimand to observed data. Third, use covariates or balancing scores to express the estimand in terms of observable conditional distributions.
Potential outcomes give a clear way to define causal effects, but they still need a data-generating story that explains when they can be identified. Structural causal models supply that story by expressing interventions and counterfactuals through equations, which prepares the ground for graphical representations.
# 3. Structural Causal Models
Structural causal models give a probabilistic semantics for causal mechanisms. Chapters 1 and 2 described interventions and potential outcomes as mathematical objects; this chapter builds a model in which those objects are generated by equations. The central idea is that each observed variable is assigned by a structural equation, driven by exogenous noise and by other variables in the system, and an intervention modifies the equations rather than merely conditioning on an event.
The point of this chapter is not to assume that every system is deterministic in a philosophical sense. Instead, an SCM separates unexplained variation into exogenous variables and records which endogenous variables are determined from which others. This lets us define interventional distributions, nested counterfactuals, and potential outcomes in one common language.
## Structural Equations and Recursive Models
What kind of mathematical object can distinguish seeing $A=a$ from setting $A=a$? A joint distribution for $(A,Y)$ cannot do this by itself: it describes which values co-occur, but not which mechanisms would remain unchanged after an external manipulation. Structural equations add this missing mechanism-level information.
[definition: Structural Causal Model]
A structural causal model is a tuple
\begin{align*}
M=(U,V,(\mathcal X_W)_{W\in U\cup V},P_U,(f_X)_{X\in V}),
\end{align*}
where $U$ is a finite set of exogenous variables, $V$ is a finite set of endogenous variables, $\mathcal X_W$ is the state space of variable $W$, $P_U$ is a probability measure on $\prod_{U_i\in U}\mathcal X_{U_i}$, and for each $X\in V$ there is a parent set $\operatorname{pa}(X)\subseteq U\cup V$ together with a measurable structural function
\begin{align*}
f_X:\prod_{W\in \operatorname{pa}(X)}\mathcal X_W\to \mathcal X_X
\end{align*}
defined on the product of the state spaces of those parent variables.
[/definition]
The exogenous variables represent background factors not explained inside the model. The endogenous variables are the variables whose values the model assigns. The parent set $\operatorname{pa}(X)$ records the arguments that actually enter the equation for $X$, so the equation has the form
\begin{align*}
X=f_X(\operatorname{pa}(X)).
\end{align*}
[example: Linear Structural Equation Model With Confounding]
Let $U_A,U_Y$ be real-valued exogenous variables, and define
\begin{align*}A=U_A.\end{align*}
\begin{align*}Y=\beta A+U_Y.\end{align*}
Substituting the treatment equation into the outcome equation gives
\begin{align*}Y=\beta U_A+U_Y.\end{align*}
Thus the observational conditional mean, at values where the conditional expectations are defined, satisfies
\begin{align*}\mathbb E[Y\mid A=a]=\mathbb E[\beta A+U_Y\mid A=a].\end{align*}
Since $A=a$ inside the conditional expectation,
\begin{align*}\mathbb E[\beta A+U_Y\mid A=a]=\beta a+\mathbb E[U_Y\mid A=a].\end{align*}
Using $A=U_A$, this is
\begin{align*}\mathbb E[Y\mid A=a]=\beta a+\mathbb E[U_Y\mid U_A=a].\end{align*}
So when the distribution of $U_Y$ among units with $U_A=a$ differs from its population distribution, the observational mean changes with both the causal term $\beta a$ and the selected noise term $\mathbb E[U_Y\mid U_A=a]$.
If second moments exist and $\operatorname{Var}(A)>0$, the observational least-squares slope of $Y$ on $A$ is
\begin{align*}\frac{\operatorname{Cov}(A,Y)}{\operatorname{Var}(A)}=\frac{\operatorname{Cov}(A,\beta A+U_Y)}{\operatorname{Var}(A)}.\end{align*}
By bilinearity of covariance,
\begin{align*}\operatorname{Cov}(A,\beta A+U_Y)=\beta\operatorname{Cov}(A,A)+\operatorname{Cov}(A,U_Y).\end{align*}
Since $\operatorname{Cov}(A,A)=\operatorname{Var}(A)$, the slope is
\begin{align*}\frac{\beta\operatorname{Var}(A)+\operatorname{Cov}(A,U_Y)}{\operatorname{Var}(A)}=\beta+\frac{\operatorname{Cov}(U_A,U_Y)}{\operatorname{Var}(U_A)}.\end{align*}
The extra covariance term is the regression expression of confounding.
Under the intervention $do(A=a)$, the equation $A=U_A$ is replaced by the constant equation $A=a$, while the outcome equation is retained. Therefore
\begin{align*}Y=\beta a+U_Y.\end{align*}
When $\mathbb E[U_Y]$ exists, linearity of expectation gives
\begin{align*}\mathbb E[Y\mid do(A=a)]=\mathbb E[\beta a+U_Y]=\beta a+\mathbb E[U_Y].\end{align*}
The observational calculation conditions on the units whose noise helped produce $A=a$, while the interventional calculation sets $A$ externally and leaves the marginal distribution of $U_Y$ unchanged.
[/example]
This example shows why equations contain more information than an observational law. The same joint distribution for $(A,Y)$ can sometimes be generated by different structural systems, and those systems may disagree about what changes after an intervention. To make the equations solvable without circularity, the course next isolates the recursive case. Cycles are not automatically invalid, but they can stop the equations from defining a probability model: over binary state spaces, the cyclic system $X=Y$ and $Y=X$ has two solutions for each exogenous realization when no other equation fixes either variable, while the one-variable equation $X=1-X$ has no binary solution.
[definition: Recursive Structural Causal Model]
A structural causal model is recursive if there is an ordering $X_1,\dots,X_n$ of the endogenous variables $V$ such that, for every $i$, the structural function $f_{X_i}$ depends only on exogenous variables and on endogenous variables among $X_1,\dots,X_{i-1}$.
[/definition]
Recursiveness is the equation-level analogue of acyclicity. Once the exogenous variables are sampled, the endogenous variables should be generated in order: first $X_1$, then $X_2$, and so on. The first mathematical task is to prove that this informal sequential construction defines a genuine random vector with a probability law.
[quotetheorem:9662]
[citeproof:9662]
This result is the technical reason recursive SCMs can be treated as probability models. Recursiveness matters in two separate ways: it gives an order in which equations can be evaluated, and it prevents a later value from being needed to define an earlier one. Without such a condition, the phrase "the random vector generated by the equations" may be ambiguous or empty; the binary system $X=Y$, $Y=X$ gives two compatible endogenous vectors, while $X=1-X$ gives none. The theorem therefore does not say that every nonrecursive SCM is ill-behaved, and it does not identify the structural equations from the observational law. It only says that a recursive system with measurable structural functions induces a well-defined observational probability measure.
With this probability-model semantics in place, the next question is how an external intervention modifies such a model.
## Interventions as Equation Replacement
Why is $P(Y\mid A=a)$ usually not the same object as $P(Y\mid do(A=a))$? Conditioning restricts attention to units whose assigned treatment happens to be $a$; intervention replaces the assignment mechanism for $A$ and leaves the other mechanisms fixed. SCMs express this distinction by replacing equations.
[definition: Atomic Intervention]
Let $M$ be an SCM with endogenous variables $V$, and let $A\in V$. For $a\in\mathcal X_A$, the atomic intervention $do(A=a)$ forms a new structural model $M_{do(A=a)}$ by replacing the equation for $A$ with the constant equation
\begin{align*}
A=a,
\end{align*}
and leaving all other structural equations and the exogenous distribution $P_U$ unchanged.
[/definition]
Thus the intervened object has the same exogenous variables, the same endogenous variables, the same state spaces, and the same exogenous law as $M$; only the structural function for $A$ is changed to the constant map from the relevant parent product to $\mathcal X_A$ with value $a$.
The phrase ``leaving all other equations unchanged'' is the modularity assumption built into SCMs. It says that forcing treatment does not by itself alter the outcome mechanism, covariate mechanisms downstream of treatment may still change because their parent value changed, and mechanisms upstream of treatment are not rewritten. Since realistic interventions may set several variables at once, we need the multi-variable version of the same operation.
[definition: Truncated Structural System]
Let $I\subset V$ and let $x_I=(x_i)_{X_i\in I}\in\prod_{X_i\in I}\mathcal X_{X_i}$. The truncated structural system $M_{do(I=x_I)}$ is obtained from $M$ by replacing, for every $X_i\in I$, the equation for $X_i$ by $X_i=x_i$, and retaining the structural equations for all variables in $V\setminus I$.
[/definition]
Formally, if
\begin{align*}
M=(U,V,(\mathcal X_W)_{W\in U\cup V},P_U,(f_X)_{X\in V}),
\end{align*}
then $M_{do(I=x_I)}$ has the same $U$, $V$, state spaces, and $P_U$, and has structural functions $(f_X^{do(I=x_I)})_{X\in V}$ where $f_X^{do(I=x_I)}$ is the constant map with value $x_i$ for $X=X_i\in I$ and $f_X^{do(I=x_I)}=f_X$ for $X\notin I$.
The word truncated refers to removing the original assignment mechanisms for the intervened variables. In graph language, the incoming arrows into the intervened variables are cut, while their outgoing influence through the remaining equations is retained. The course proves that the truncated system still produces a probability law whenever the original model was recursive.
[quotetheorem:9663]
[citeproof:9663]
This theorem is the SCM version of the idea that a randomized trial creates a new probability law. The law is not produced by conditioning within $P_M$; it is produced by a new model with different equations. Conditioning may agree with intervention under special assumptions, but the definitions are distinct.
The hypotheses again do real work. Recursiveness makes the truncated equations solvable in the inherited order after some assignments have been replaced by constants, and measurability of the structural functions makes the resulting solution map a random vector rather than only a pointwise construction. A nonrecursive system can fail after intervention even when every displayed equation is total and measurable: over binary state spaces, suppose an intervention fixes $X=0$ while a retained cyclic block satisfies $Z=W$ and $W=Z$. The intervened system has two compatible solutions for the retained block, $(Z,W)=(0,0)$ and $(Z,W)=(1,1)$, so it does not define a unique random vector without an additional selection rule. The theorem therefore does not claim that equation replacement is automatically well-defined for arbitrary cyclic systems. It applies to recursive SCMs with total [measurable functions](/page/Measurable%20Functions), and it gives existence of the interventional law, not identification of that law from observational data.
[example: Deterministic Threshold Treatment Model]
Let $U,L,E$ be real-valued exogenous variables, with $L$ a pre-treatment covariate and $E$ outcome noise. The treatment and outcome equations are
\begin{align*}A=\mathbb 1_{\{L+U>0\}}.\end{align*}
\begin{align*}Y=\alpha A+\gamma L+E.\end{align*}
We compare conditioning on the observed treatment value with replacing the treatment equation by an intervention.
Because $A$ is the indicator of the event $\{L+U>0\}$, the event $\{A=1\}$ is exactly
\begin{align*}\{A=1\}=\{L+U>0\}.\end{align*}
On this event, the outcome equation becomes
\begin{align*}Y=\alpha\cdot 1+\gamma L+E.\end{align*}
Thus
\begin{align*}Y=\alpha+\gamma L+E\quad\text{on }\{A=1\}.\end{align*}
At values where the conditional expectations are defined,
\begin{align*}\mathbb E[Y\mid A=1]=\mathbb E[\alpha+\gamma L+E\mid L+U>0].\end{align*}
By linearity of conditional expectation,
\begin{align*}\mathbb E[Y\mid A=1]=\alpha+\gamma\mathbb E[L\mid L+U>0]+\mathbb E[E\mid L+U>0].\end{align*}
So the observational treated mean depends not only on the outcome equation, but also on the distribution of $(L,E)$ among units selected by the threshold $L+U>0$.
Under $do(A=1)$, the structural equation $A=\mathbb 1_{\{L+U>0\}}$ is replaced by the constant equation
\begin{align*}A=1.\end{align*}
The outcome equation is retained, so substituting the intervened value gives
\begin{align*}Y=\alpha\cdot 1+\gamma L+E.\end{align*}
Equivalently,
\begin{align*}Y=\alpha+\gamma L+E\quad\text{under }do(A=1).\end{align*}
When the expectations exist, the interventional mean is
\begin{align*}\mathbb E[Y\mid do(A=1)]=\mathbb E[\alpha+\gamma L+E].\end{align*}
By linearity of expectation,
\begin{align*}\mathbb E[Y\mid do(A=1)]=\alpha+\gamma\mathbb E[L]+\mathbb E[E].\end{align*}
The observational calculation uses the threshold-selected distribution given $L+U>0$, while the intervention sets treatment for every unit and leaves the marginal distribution of the exogenous variables unchanged.
[/example]
The threshold model also previews identification questions. The SCM defines both $P(Y\mid A=1)$ and $P(Y\mid do(A=1))$, but observational data alone need not reveal the second from the first. Later chapters use graphs and adjustment criteria to decide when such recovery is possible.
## Counterfactuals by Abduction, Action, and Prediction
Interventions define population laws after a manipulation. Counterfactuals ask a sharper unit-level question: after observing information about a unit, what would the model say about that same unit under a different intervention? SCMs answer this through a three-step procedure: infer exogenous information from observations, modify equations, then compute the target variable.
[definition: Counterfactual Variable]
Let $M$ be an SCM, let $I\subset V$, and let $x_I$ be an intervention value. Suppose the intervened model $M_{do(I=x_I)}$ has a measurable solution map
\begin{align*}
F_{M,x_I}:\prod_{U_i\in U}\mathcal X_{U_i}\to \prod_{X\in V}\mathcal X_X.
\end{align*}
For an endogenous variable $Y\in V$, the counterfactual variable $Y_{x_I}$ is the measurable map
\begin{align*}
Y_{x_I}:\prod_{U_i\in U}\mathcal X_{U_i}\to \mathcal X_Y
\end{align*}
obtained by composing $F_{M,x_I}$ with the coordinate projection onto $\mathcal X_Y$.
[/definition]
This definition makes counterfactuals random variables on the same exogenous probability space as the observed variables. The same background realization drives both the actual-world value and the hypothetical-world value, which is why joint statements such as $(Y_0,Y_1)$ are meaningful inside an SCM even when only one outcome is observed for each unit. To use counterfactuals after observing a unit, however, the exogenous distribution must first be updated by that observation.
[definition: Abduction Action Prediction]
For an SCM $M$, observed event $E\subseteq\prod_{X\in V}\mathcal X_X$ for which the conditional law of the exogenous variables given that the original model generates $E$ is defined, intervention $do(I=x_I)$, and target variable $Y$, the abduction-action-prediction procedure consists of:
1. replacing $P_U$ by the conditional distribution of exogenous variables given the event that the original model generates $E$;
2. replacing the equations for variables in $I$ by $X_i=x_i$;
3. computing the distribution of $Y$ in the modified model under the updated exogenous distribution.
[/definition]
For elementary finite or discrete models, this usually means conditioning on a positive-probability event. In standard Borel state spaces, the same language is interpreted through a chosen regular conditional distribution.
The abduction step is where observed evidence about a unit enters. If a patient had unusually high observed outcome under treatment, the posterior distribution of that patient's background noise may differ from the population distribution, and the counterfactual prediction under no treatment should use that updated information.
[example: Counterfactual Prediction in a Linear SCM]
Consider the linear SCM
\begin{align*}A=U_A.\end{align*}
\begin{align*}Y=\beta A+U_Y.\end{align*}
Suppose a unit is observed with $A=a$ and $Y=y$. The first equation gives
\begin{align*}U_A=A.\end{align*}
Therefore the observation $A=a$ implies
\begin{align*}U_A=a.\end{align*}
Substituting $A=a$ into the outcome equation gives
\begin{align*}y=\beta a+U_Y.\end{align*}
Solving this equation for $U_Y$ gives
\begin{align*}U_Y=y-\beta a.\end{align*}
Under the action $do(A=a')$, the treatment equation is replaced by the constant equation
\begin{align*}A=a'.\end{align*}
The outcome equation is retained, so the counterfactual outcome for the same exogenous realization is
\begin{align*}Y_{a'}=\beta a'+U_Y.\end{align*}
Using the abducted value $U_Y=y-\beta a$, we get
\begin{align*}Y_{a'}=\beta a'+(y-\beta a).\end{align*}
By associativity and commutativity of addition,
\begin{align*}Y_{a'}=y+\beta a'-\beta a.\end{align*}
Factoring out $\beta$ gives
\begin{align*}Y_{a'}=y+\beta(a'-a).\end{align*}
Thus, in the deterministic-noise interpretation, the observed unit keeps the same inferred background outcome noise $U_Y=y-\beta a$, and changing treatment from $a$ to $a'$ changes the predicted outcome by $\beta(a'-a)$.
[/example]
The linear example shows how SCM counterfactuals retain unit-level information not present in a marginal interventional mean. The potential outcome notation from Chapter 2 can now be recovered by naming these counterfactual variables.
[quotetheorem:9664]
[citeproof:9664]
This is the bridge between the Rubin framework and structural causal models. Potential outcomes can be treated as primitive random variables, or they can be generated from a structural system. The structural approach adds a disciplined way to discuss multiple interventions, nested counterfactuals, and compatibility with causal graphs.
The assumption that the original and intervened systems have unique solutions is essential for this statement to have a definite meaning. In the cyclic binary system $A=Y$ and $Y=A$, the event $A=0$ is compatible with the solution $(A,Y)=(0,0)$ and the event $A=1$ is compatible with $(1,1)$; without a rule selecting one solution from the same exogenous realization, the observed value and the counterfactual value are not functions on a common probability space. The shared exogenous realization is equally important: comparing the observed $Y$ from one background draw with $Y(a)$ from an independent background draw would describe two different units, not the consistency assertion for the same unit. Thus the theorem proves consistency for recursive SCMs, or more generally for SCMs where the original and relevant intervened systems admit compatible unique measurable solution maps; it does not grant consistency to arbitrary cyclic equations.
Consistency also fixes what later identification assumptions are trying to recover. In a randomized trial, random assignment is designed so that the observed treated group represents the distribution of $Y(1)$ and the observed control group represents the distribution of $Y(0)$; the consistency equality supplies the link between the recorded outcome and the relevant potential outcome within each arm. In observational studies, the same equality is only the starting point: exchangeability, positivity, and graphical adjustment criteria are the additional conditions that decide whether the interventional distribution defined by the SCM can be expressed using the observed law.
[remark: Consistency Is Not Identification]
The equality $Y=Y(A)$ says that the observed outcome is the potential outcome under the treatment actually received. It does not say that the distribution of $Y(a)$ can be learned from the observed conditional distribution of $Y$ among units with $A=a$. Identification still requires assumptions such as exchangeability, positivity, or graphical adjustment conditions.
[/remark]
The chapter ends with three objects now in place: observational laws from the original equations, interventional laws from equation replacement, and counterfactual variables from shared exogenous realizations across modified models. These are the formal targets behind statistical identification practice: data analysis observes samples from the observational law, while causal questions ask about interventional or counterfactual laws. The next part of the course adds directed acyclic graphs as a compact language for reading conditional independences and intervention effects from the qualitative structure of an SCM.
Structural causal models make the causal mechanisms explicit, and directed acyclic graphs distill that structure into a compact visual language. Once the graph is available, the course can ask which statistical relations follow from the causal assumptions encoded in it.
# 4. Directed Acyclic Graphs and Markov Factorization
This chapter turns causal models into graph-theoretic objects. Chapters 1 through 3 described interventions, potential outcomes, and structural equations; here we introduce directed acyclic graphs as a language for recording which variables are taken to be direct causes of which other variables. The main question is how a graph constrains the observational distribution, and which probabilistic independences should be expected from the graph alone.
The chapter has two complementary themes. First, the Causal Markov condition turns a qualitative causal ordering into a factorization of the joint law. Second, faithfulness and minimality describe when the graphical independences are informative about the distribution, and when algebraic cancellations or missing arrows limit what the graph can tell us.
## Graphical Vocabulary for Causal Models
A causal diagram is meant to separate direct causal input from indirect association. Before using a graph to make probabilistic claims, we need precise language for paths, parent sets, and special nodes on paths.
[definition: Directed Acyclic Graph]
A directed graph $G=(V,E)$ consists of a finite vertex set $V$ and an edge set $E \subset V \times V$ whose elements are ordered pairs. A directed edge $(u,v) \in E$ is written $u \to v$. A directed path from $v_0$ to $v_k$ is a sequence $v_0,\dots,v_k$ of distinct vertices such that $v_{i-1} \to v_i$ for $1 \le i \le k$. A directed acyclic graph, or DAG, is a directed graph with no directed path from any vertex back to itself.
[/definition]
Acyclicity encodes the idea that variables can be arranged in a causal order, so that direct causes occur before their effects in the graph. This does not require chronological time in a literal sense, but it rules out feedback loops at the level of the variables represented in the DAG.
[example: Treatment Outcome DAG]
Let $V=\{L,A,Y\}$, where $L$ is a pre-treatment covariate, $A$ is treatment, and $Y$ is outcome, and take
\begin{align*}
E=\{(L,A),(L,Y),(A,Y)\}.
\end{align*}
Thus the arrows are exactly $L \to A$, $L \to Y$, and $A \to Y$: $L$ is a direct cause of both treatment and outcome, and $A$ is a direct cause of the outcome.
The directed paths starting at $L$ are $L \to A$, $L \to Y$, and $L \to A \to Y$, because the first step from $L$ can go to $A$ or $Y$, and from $A$ there is the arrow $A \to Y$. Hence $L$ is an ancestor of $A$ and of $Y$, and $A$ is also an ancestor of $Y$. There is no directed path from $A$ back to $L$, from $Y$ back to $A$, or from $Y$ back to $L$, since no arrow leaves $Y$ and no arrow points into $L$. Therefore no vertex lies on a directed path back to itself, so this graph is acyclic.
[/example]
The treatment-outcome example already uses two different notions of causal relatedness: direct input, as in $A \to Y$, and indirect reachability, as in $L \to A \to Y$. To state Markov conditions later, we must distinguish direct causes from more remote upstream variables and from downstream effects.
[definition: Parents Ancestors And Descendants]
Let $G=(V,E)$ be a DAG and let $v \in V$. The parent set of $v$ is
\begin{align*}
\operatorname{pa}_G(v) = \{u \in V : u \to v\}.
\end{align*}
The child set of $v$ is
\begin{align*}
\operatorname{ch}_G(v) = \{w \in V : v \to w\}.
\end{align*}
A vertex $u$ is an ancestor of $v$ if $u=v$ or there is a directed path from $u$ to $v$. A vertex $w$ is a descendant of $v$ if $w=v$ or there is a directed path from $v$ to $w$. The corresponding sets are denoted by $\operatorname{an}_G(v)$ and $\operatorname{de}_G(v)$.
[/definition]
Including each vertex as its own ancestor and descendant is convenient when statements involve conditioning sets or induced subgraphs. When the graph is understood, we often omit the subscript $G$.
[remark: Topological Order]
Every finite DAG admits an ordering $V=\{v_1,\dots,v_n\}$ such that all arrows point from earlier vertices to later vertices. Such an ordering is called a topological order.
[/remark]
The topological order is the bridge between graphical structure and probability. It permits us to write the joint distribution by successively conditioning on earlier variables, and then to ask which of those earlier variables are really needed.
[quotetheorem:9665]
[citeproof:9665]
The finiteness and acyclicity hypotheses are doing real work here. A directed cycle such as $X \to Y \to Z \to X$ cannot be placed in an order where every arrow points forward, and an infinite acyclic graph may require extra ordering arguments not covered by this finite theorem. The theorem also does not say that the topological order is unique; for instance, two unrelated source vertices may be swapped. Its role is structural rather than causal by itself: once we have such an order, the probabilistic chain rule can be aligned with the graph in the factorization theorem below.
Topological order describes arrows, but conditional independence depends on undirected paths as well. The next vocabulary separates the three path patterns that determine whether conditioning blocks or opens association.
[definition: Chain Fork And Collider]
Let $G$ be a DAG. A path segment on three distinct vertices is a chain if it has the form $X \to Z \to Y$ or $X \leftarrow Z \leftarrow Y$. It is a fork if it has the form $X \leftarrow Z \to Y$. It is a collider if it has the form $X \to Z \leftarrow Y$.
[/definition]
Chains and forks transmit association unless the middle variable is conditioned on. Colliders behave in the opposite direction: without conditioning, the two incoming sides need not be associated, while conditioning on the collider or on information downstream of it can induce association.
[example: Collider Bias]
Let $X$ and $Y$ be independent Bernoulli variables with $\mathbb P(X=1)=\mathbb P(Y=1)=1/2$, and let admission be $S=1$ exactly when at least one of $X$ or $Y$ equals $1$. This is a concrete distribution compatible with the collider graph $X \to S \leftarrow Y$: the two causes $X$ and $Y$ are marginally independent, and $S$ is determined by their joint values.
In the full population,
\begin{align*}
\mathbb P(X=1,Y=1)=\frac14=\frac12\cdot\frac12=\mathbb P(X=1)\mathbb P(Y=1).
\end{align*}
Thus $X$ and $Y$ have no marginal association. Now condition on admission. Since $S=1$ excludes only the case $(X,Y)=(0,0)$,
\begin{align*}
\mathbb P(S=1)=1-\mathbb P(X=0,Y=0)=1-\frac14=\frac34.
\end{align*}
The conditional probability that both causes are present among admitted individuals is
\begin{align*}
\mathbb P(X=1,Y=1\mid S=1)=\frac{\mathbb P(X=1,Y=1,S=1)}{\mathbb P(S=1)}=\frac{1/4}{3/4}=\frac13.
\end{align*}
Also,
\begin{align*}
\mathbb P(X=1\mid S=1)=\frac{\mathbb P(X=1,S=1)}{\mathbb P(S=1)}=\frac{1/2}{3/4}=\frac23.
\end{align*}
By the same calculation, $\mathbb P(Y=1\mid S=1)=2/3$, so
\begin{align*}
\mathbb P(X=1\mid S=1)\mathbb P(Y=1\mid S=1)=\frac23\cdot\frac23=\frac49.
\end{align*}
Because $\frac13 \ne \frac49$, the variables $X$ and $Y$ are not independent after conditioning on $S=1$. In fact,
\begin{align*}
\mathbb P(Y=1\mid X=1,S=1)=\frac{\mathbb P(X=1,Y=1\mid S=1)}{\mathbb P(X=1\mid S=1)}=\frac{1/3}{2/3}=\frac12.
\end{align*}
whereas
\begin{align*}
\mathbb P(Y=1\mid X=0,S=1)=1.
\end{align*}
Among admitted individuals, learning that $X=1$ lowers the probability that $Y=1$ from $1$ to $1/2$, so conditioning on the common effect $S$ creates a negative association that was absent in the full population.
[/example]
The collider example is the first warning that adjustment is not the same as conditioning on every observed variable. Graphs will let us state which conditioning operations block paths and which open them.
## Markov Conditions and Factorization
The core probabilistic question is: if the graph is causal, what restrictions should the observational law satisfy? The answer is that each variable should be conditionally independent of its non-descendants once its direct causes are known.
[definition: Distribution Markov To A DAG]
Let $G=(V,E)$ be a DAG with vertices $V=\{1,\dots,n\}$. For each $j$, let $(S_j,\mathcal S_j)$ be a measurable space and let $X_j:(\Omega,\mathcal F)\to (S_j,\mathcal S_j)$ be a random variable on a probability space $(\Omega,\mathcal F,\mathbb P)$. Write $X=(X_1,\dots,X_n)$ and let $\mathbb P_X$ be the joint distribution on $\prod_{j=1}^n S_j$. The distribution $\mathbb P_X$ is Markov with respect to $G$ if, for each vertex $j$,
\begin{align*}
X_j \perp\!\!\!\perp X_{\operatorname{nd}(j) \setminus \operatorname{pa}(j)} \mid X_{\operatorname{pa}(j)},
\end{align*}
where $\operatorname{nd}(j)$ denotes the set of non-descendants of $j$.
[/definition]
This is the local Markov property. It says that after conditioning on the direct causes of $X_j$, other variables that are not downstream of $X_j$ add no further predictive information about $X_j$.
[example: Common Cause Graph]
Consider the DAG on $\{L,A,Y\}$ with arrows $L \to A$ and $L \to Y$ and no arrow between $A$ and $Y$. The parent sets are
\begin{align*}
\operatorname{pa}(L)=\varnothing,\qquad \operatorname{pa}(A)=\{L\},\qquad \operatorname{pa}(Y)=\{L\}.
\end{align*}
Since no arrow leaves $A$, the only descendant of $A$ is $A$ itself. Thus $Y$ is a non-descendant of $A$ and is not a parent of $A$, so the local Markov property gives
\begin{align*}
A \perp\!\!\!\perp Y \mid L.
\end{align*}
Equivalently, for values with $\mathbb P(L=l)>0$,
\begin{align*}
\mathbb P(A=a,Y=y\mid L=l)=\mathbb P(A=a\mid L=l)\mathbb P(Y=y\mid L=l).
\end{align*}
The same conclusion would fail to be graphically justified if an arrow $A \to Y$ were added, because then $Y$ would be a descendant of $A$, not a non-descendant. If an arrow $Y \to A$ were added, then $Y$ would be a parent of $A$, not a non-parent non-descendant. Thus, in this simple common-cause graph, conditioning on $L$ removes the association transmitted through the fork $A \leftarrow L \to Y$ exactly because the graph says that there is no remaining direct causal relation between $A$ and $Y$.
[/example]
The local Markov property is useful, but local conditional independences are awkward to use one at a time when writing likelihoods or intervention formulas. What is needed is a single product expression that records, for every vertex, that its conditional law depends only on its parents. Such a formula turns graphical separation information into an algebraic description of the whole observational law.
[quotetheorem:9666]
[citeproof:9666]
The Markov hypothesis is essential: if the graph omits an arrow but the corresponding conditional independence fails, the parent-only product formula need not equal the true joint law. For example, in the graph $L \to A$ and $L \to Y$ with no edge between $A$ and $Y$, the factorization $p(l,a,y)=p(l)p(a\mid l)p(y\mid l)$ fails whenever $A$ and $Y$ remain dependent conditional on $L$. The theorem also does not identify the graph from the distribution, nor does it by itself say what happens under intervention; it is an observational factorization statement. Its importance is that it converts local causal inputs into a modular product form, which is exactly the form later modified by intervention rules.
The factorization theorem is what makes DAGs computational. Instead of specifying one joint distribution on all variables at once, we specify one conditional distribution per vertex.
[example: Factorization In A Treatment Graph]
For the graph with arrows $L \to A$, $L \to Y$, and $A \to Y$, the parent sets are
\begin{align*}
\operatorname{pa}(L)=\varnothing,\qquad \operatorname{pa}(A)=\{L\},\qquad \operatorname{pa}(Y)=\{L,A\}.
\end{align*}
If the observational distribution is Markov with respect to this DAG, the *[DAG Factorization Theorem](/theorems/9666)* gives
\begin{align*}
p(l,a,y)=p(l\mid x_{\operatorname{pa}(L)})p(a\mid x_{\operatorname{pa}(A)})p(y\mid x_{\operatorname{pa}(Y)}).
\end{align*}
Since $\operatorname{pa}(L)=\varnothing$, the first conditional density is the marginal density $p(l)$. Since $x_{\operatorname{pa}(A)}=l$ and $x_{\operatorname{pa}(Y)}=(l,a)$, this becomes
\begin{align*}
p(l,a,y)=p(l)p(a\mid l)p(y\mid l,a).
\end{align*}
Equivalently, because conditioning on the pair $(l,a)$ is the same information as conditioning on $(a,l)$,
\begin{align*}
p(l,a,y)=p(l)p(a\mid l)p(y\mid a,l).
\end{align*}
The factor $p(a\mid l)$ describes treatment assignment in the observed regime, while $p(y\mid a,l)$ describes the conditional outcome law; the displayed product is still an observational factorization, not yet an interventional formula.
[/example]
The factorization example used one small graph, but causal models usually require independence statements about sets of variables connected by longer paths. To read those statements from the graph, we need a path-blocking criterion that handles chains, forks, colliders, and descendants of colliders in a single rule.
[definition: D-Separation]
Let $G$ be a DAG and let $A,B,C \subset V$ be disjoint vertex sets. A path between a vertex in $A$ and a vertex in $B$ is blocked by $C$ if at least one of the following holds: the path contains a non-collider that belongs to $C$; or the path contains a collider such that neither the collider nor any of its descendants belongs to $C$. The sets $A$ and $B$ are d-separated by $C$ if every path from $A$ to $B$ is blocked by $C$.
[/definition]
D-separation formalizes the different roles of chains, forks, and colliders. The next question is whether this purely graphical blocking rule is sound as a statement about conditional independence for every law that factorizes over the DAG.
[quotetheorem:9667]
[citeproof:9667]
The factorization assumption is essential: d-separation is not a statement about arbitrary distributions placed on the same vertex set. For instance, in the empty graph on $X$ and $Y$, d-separation predicts $X \perp\!\!\!\perp Y$, but an arbitrary joint law with dependent $X$ and $Y$ does not satisfy the empty-graph factorization $p(x,y)=p(x)p(y)$. The theorem is also one-way; it says graphical separation implies independence for all factorizing laws, not that every observed independence must come from separation. That reverse direction is exactly what faithfulness tries to supply in the next part of the chapter.
The global Markov property is the main rule for reading conditional independences from a DAG. It is sound: graphical separation gives a probabilistic independence for every distribution that factorizes according to the graph.
[example: Selection On A Descendant Of A Collider]
Let $X$ and $Y$ be independent Bernoulli variables with $\mathbb P(X=1)=\mathbb P(Y=1)=1/2$. Define $S=1$ exactly when $X=1$ or $Y=1$, and define $R=S$. This distribution has graph $X \to S \leftarrow Y$ and $S \to R$, with $R$ a descendant of the collider $S$.
Marginally,
\begin{align*}
\mathbb P(X=1,Y=1)=\frac14=\frac12\cdot\frac12=\mathbb P(X=1)\mathbb P(Y=1).
\end{align*}
Thus $X$ and $Y$ are independent before conditioning. Since $R=S$, conditioning on $R=1$ is the same as conditioning on $S=1$. The event $S=1$ excludes only $(X,Y)=(0,0)$, so
\begin{align*}
\mathbb P(R=1)=\mathbb P(S=1)=1-\mathbb P(X=0,Y=0)=1-\frac14=\frac34.
\end{align*}
Also, whenever $X=1$ and $Y=1$, we have $S=1$ and hence $R=1$, so
\begin{align*}
\mathbb P(X=1,Y=1\mid R=1)=\frac{\mathbb P(X=1,Y=1,R=1)}{\mathbb P(R=1)}=\frac{1/4}{3/4}=\frac13.
\end{align*}
Similarly, whenever $X=1$, we have $S=1$ and hence $R=1$, so
\begin{align*}
\mathbb P(X=1\mid R=1)=\frac{\mathbb P(X=1,R=1)}{\mathbb P(R=1)}=\frac{1/2}{3/4}=\frac23.
\end{align*}
By the same argument,
\begin{align*}
\mathbb P(Y=1\mid R=1)=\frac23.
\end{align*}
Therefore
\begin{align*}
\mathbb P(X=1\mid R=1)\mathbb P(Y=1\mid R=1)=\frac23\cdot\frac23=\frac49.
\end{align*}
Since $\frac13 \ne \frac49$, the variables $X$ and $Y$ are not independent conditional on $R=1$. Thus conditioning on the descendant $R$ of the collider $S$ creates an association through the upstream collider structure along the path $X \to S \leftarrow Y \to R$; in dataset terms, analysing only records with $R=1$ can introduce selection bias even though $X$ and $Y$ were independent in the full population.
[/example]
This example matters in causal inference because datasets are often produced by sampling, consent, survival, or measurement mechanisms. A variable that looks like a harmless post-selection indicator may carry information about a collider upstream.
## Faithfulness Minimality And Graphical Limits
The Markov property tells us which independences the graph guarantees. The reverse problem asks whether observed conditional independences reveal the graph. The answer requires extra assumptions, because different graphs and parameter values can produce the same independences.
[definition: Faithfulness]
Let $G$ be a DAG and let $\mathbb P_X$ be Markov with respect to $G$. The distribution $\mathbb P_X$ is faithful to $G$ if, for all disjoint $A,B,C \subset V$,
\begin{align*}
X_A \perp\!\!\!\perp X_B \mid X_C
\end{align*}
holds under $\mathbb P_X$ only when $A$ and $B$ are d-separated by $C$ in $G$.
[/definition]
Faithfulness says that the graph accounts for all and only the conditional independences in the distribution. Without it, some independences may be caused by special parameter cancellations rather than by missing active paths.
[example: Unfaithful Linear Cancellation]
Consider the linear structural equations
\begin{align*}
X=\varepsilon_X.
\end{align*}
\begin{align*}
Z=aX+\varepsilon_Z.
\end{align*}
\begin{align*}
Y=bZ+cX+\varepsilon_Y.
\end{align*}
Assume that $\varepsilon_X,\varepsilon_Z,\varepsilon_Y$ are mutually independent mean-zero Gaussian noise variables with positive variances, and take the graph to have arrows $X \to Z$, $Z \to Y$, and $X \to Y$. Substituting the equation for $Z$ into the equation for $Y$ gives
\begin{align*}
Y=b(aX+\varepsilon_Z)+cX+\varepsilon_Y.
\end{align*}
By distributivity,
\begin{align*}
Y=abX+b\varepsilon_Z+cX+\varepsilon_Y.
\end{align*}
Collecting the two terms involving $X$,
\begin{align*}
Y=(ab+c)X+b\varepsilon_Z+\varepsilon_Y.
\end{align*}
If $ab+c=0$, then this becomes
\begin{align*}
Y=0\cdot X+b\varepsilon_Z+\varepsilon_Y.
\end{align*}
Hence
\begin{align*}
Y=b\varepsilon_Z+\varepsilon_Y.
\end{align*}
Since $X=\varepsilon_X$, and $\varepsilon_X$ is independent of the pair $(\varepsilon_Z,\varepsilon_Y)$, $X$ is independent of every [measurable function](/page/Measurable%20Function) of $(\varepsilon_Z,\varepsilon_Y)$. Therefore $X$ is independent of $Y=b\varepsilon_Z+\varepsilon_Y$. This independence is not represented by d-separation in the graph: the single-edge path $X \to Y$ has no interior vertex that could block it, and the path $X \to Z \to Y$ has middle vertex $Z$ as a non-collider which is not conditioned on. Thus the graph contains active paths from $X$ to $Y$, but the distribution still has $X \perp\!\!\!\perp Y$ because the two directed effects cancel when $ab+c=0$. The distribution is therefore Markov-compatible in its structural form but not faithful to that DAG.
[/example]
Faithfulness is stronger than Markovness. The cancellation example shows that extra independences may appear even when every displayed arrow participates in the data-generating equations. To separate this severe failure from the weaker problem of arrows that do no conditional-distribution work, we introduce minimality.
[definition: Causal Minimality]
Let $G$ be a DAG and let $\mathbb P_X$ be Markov with respect to $G$. The pair $(G,\mathbb P_X)$ satisfies causal minimality if $\mathbb P_X$ is not Markov with respect to any proper subgraph obtained by deleting at least one arrow from $G$.
[/definition]
Minimality rules out arrows that make no difference to the observational Markov structure. The comparison problem is whether faithfulness, which rules out all non-graphical independences, automatically rules out such removable arrows. The next theorem answers this and places minimality below faithfulness in the hierarchy of assumptions used for structure learning.
[quotetheorem:9668]
[citeproof:9668]
Faithfulness is the hypothesis that prevents hidden cancellations from imitating missing arrows. Without it, the implication can fail: in a graph $X \to Y$, a law with independent $X$ and $Y$ is Markov with respect to the graph but also Markov with respect to the proper subgraph with no arrow, so minimality fails. The theorem does not say that minimality implies faithfulness; minimality rules out redundant arrows, while faithfulness rules out all extra conditional independences. This distinction matters for structure learning because edge deletion tests use the weaker idea, whereas recovering a full equivalence class from independences usually assumes the stronger one.
This theorem explains why faithfulness is often used in structure learning: it connects missing edges with testable conditional independences. Minimality captures only the weakest part of that connection.
[example: M-Bias]
Let $U_1$ and $U_2$ be independent Bernoulli variables with $\mathbb P(U_1=1)=\mathbb P(U_2=1)=1/2$. Define $A=U_1$, $Y=U_2$, and let $M=1$ exactly when $U_1=U_2$. This realizes the graph with arrows $U_1 \to A$, $U_1 \to M$, $U_2 \to M$, and $U_2 \to Y$, with no causal arrow from $A$ to $Y$.
Marginally, $A$ and $Y$ are independent because they are functions of the independent variables $U_1$ and $U_2$ separately. Explicitly,
\begin{align*}
\mathbb P(A=1,Y=1)=\mathbb P(U_1=1,U_2=1)=\frac14.
\end{align*}
Also,
\begin{align*}
\mathbb P(A=1)\mathbb P(Y=1)=\mathbb P(U_1=1)\mathbb P(U_2=1)=\frac12\cdot\frac12=\frac14.
\end{align*}
The path $A \leftarrow U_1 \to M \leftarrow U_2 \to Y$ is blocked at the collider $M$ when we do not condition on $M$, matching this marginal independence.
Now condition on $M=1$. Since $M=1$ means $U_1=U_2$, the possible admitted pairs are $(U_1,U_2)=(0,0)$ and $(1,1)$. Therefore
\begin{align*}
\mathbb P(M=1)=\mathbb P(U_1=0,U_2=0)+\mathbb P(U_1=1,U_2=1)=\frac14+\frac14=\frac12.
\end{align*}
Moreover,
\begin{align*}
\mathbb P(A=1,Y=1\mid M=1)=\frac{\mathbb P(U_1=1,U_2=1,M=1)}{\mathbb P(M=1)}=\frac{1/4}{1/2}=\frac12.
\end{align*}
But
\begin{align*}
\mathbb P(A=1\mid M=1)=\frac{\mathbb P(U_1=1,M=1)}{\mathbb P(M=1)}=\frac{1/4}{1/2}=\frac12.
\end{align*}
By the same calculation,
\begin{align*}
\mathbb P(Y=1\mid M=1)=\frac12.
\end{align*}
Thus
\begin{align*}
\mathbb P(A=1\mid M=1)\mathbb P(Y=1\mid M=1)=\frac12\cdot\frac12=\frac14.
\end{align*}
Since $\frac12 \ne \frac14$, $A$ and $Y$ are not independent conditional on $M=1$. Adjusting for the collider $M$ opens the path between $A$ and $Y$ and creates an association even though there is no causal arrow from $A$ to $Y$; this is M-bias because the active path has the shape of the letter M.
[/example]
M-bias is a graphical limit on naive adjustment rules. Pre-treatment status alone does not make a covariate safe to condition on; its position on paths between treatment and outcome matters.
[remark: Markov Equivalence]
Several DAGs can encode the same d-separation relations. In particular, DAGs with the same skeleton and the same unshielded colliders are Markov equivalent. Observational conditional independence information can identify only a Markov equivalence class unless additional assumptions, interventions, temporal information, or parametric restrictions are supplied.
[/remark]
The final message of this chapter is that DAGs are not pictures of data alone. They combine causal assumptions with probability laws: the Markov condition lets causal structure imply factorization and conditional independence, while faithfulness-type assumptions determine how much graphical structure can be recovered from observed independences.
D-separation translates graph structure into conditional independence statements. That bridge is what allows us to move from a qualitative DAG to quantitative criteria for adjustment, path blocking, and eventually identification.
# 5. D-Separation and Conditional Independence
D-separation is the bridge between the graphical language of a directed acyclic graph and probabilistic statements of conditional independence. Chapter 4 introduced causal DAGs as qualitative descriptions of structural dependence; this chapter asks which independences are forced by the graph alone. The prerequisites are the graph-theoretic notions of paths, directed paths, parents, descendants, and induced subgraphs, together with conditional independence and the DAG Markov factorization from the earlier probability and causal-model chapters. The central issue is that conditioning can either block information flow along a path or open a path that was previously inactive, so ordinary graph connectivity is not the right criterion.
## Paths and Conditioning
When two variables are joined by many directed and undirected-looking routes in a DAG, the first problem is to decide which routes can transmit statistical association after we condition on other variables. A path is allowed to ignore arrow direction while it is being traced, but the arrow pattern at each intermediate vertex matters. The three local patterns are chains, forks, and colliders, and they respond to conditioning in different ways.
[definition: Path In A DAG]
Let $G=(V,E)$ be a directed acyclic graph. A path from $X$ to $Y$ is a finite sequence of distinct vertices $(v_0,\dots,v_k)$ such that $v_0=X$, $v_k=Y$, and for each $i=1,\dots,k$, either $v_{i-1} \to v_i$ or $v_i \to v_{i-1}$ is an edge of $G$.
[/definition]
The definition treats a path as an undirected route through a directed graph. This is deliberate: association can travel against an arrow through a common cause, as in $X \leftarrow Z \to Y$, even though causal influence does not travel from $X$ to $Z$. To decide whether such a route is blocked by conditioning, we next need to classify the interior vertices by the direction of the two arrowheads touching them.
[definition: Collider On A Path]
Let $(v_0,\dots,v_k)$ be a path in a DAG. An interior vertex $v_i$, where $1 \le i \le k-1$, is a collider on the path if the two adjacent edges on the path have arrowheads pointing into $v_i$:
\begin{align*}
v_{i-1} \to v_i \leftarrow v_{i+1}.
\end{align*}
An interior vertex that is not a collider on the path is a non-collider on the path.
[/definition]
Colliders are the source of the main new phenomenon. In a chain or fork, conditioning on the middle variable blocks the path because it accounts for the relevant transmitted information. For a collider, however, observing a downstream effect of the collider can also open the path, so the next definition records which vertices lie downstream of a set.
[definition: Descendant Set]
Let $G=(V,E)$ be a directed acyclic graph. The descendant-set operator is the map
\begin{align*}
\operatorname{De}_G:\mathcal P(V) \to \mathcal P(V)
\end{align*}
defined as follows. For $A \subset V$, a vertex $w \in V$ belongs to $\operatorname{De}_G(A)$ if there exists $a \in A$ and a directed path from $a$ to $w$ in $G$, allowing paths of length zero.
[/definition]
Descendants matter because observing an effect of a collider gives indirect information about the collider itself. Thus a path-blocking rule must mention both conditioned colliders and conditioned descendants of colliders. We can now state the complete local criterion that decides whether a single path is available after conditioning.
[definition: Blocked Path And Active Path]
Let $G=(V,E)$ be a directed acyclic graph, let $Z \subset V$, and let $\pi$ be a path in $G$. The path $\pi$ is blocked by $Z$ if at least one of the following conditions holds:
1. some non-collider on $\pi$ belongs to $Z$;
2. some collider $c$ on $\pi$ satisfies $c \notin Z$ and $\operatorname{De}_G(\{c\}) \cap Z = \varnothing$.
The path $\pi$ is active given $Z$ if it is not blocked by $Z$.
[/definition]
This definition encodes all three elementary patterns in one rule. Conditioning blocks chains and forks through their middle vertex, while conditioning activates colliders through the collider itself or through a descendant of the collider.
[example: Chain Fork And Collider Rules]
Consider the three elementary paths from $X$ to $Y$ with middle vertex $Z$. In the chain
\begin{align*}
X \to Z \to Y,
\end{align*}
the only path is $(X,Z,Y)$, and the interior vertex $Z$ is a non-collider because the two adjacent arrows do not both point into $Z$. Given $\varnothing$, this non-collider is not conditioned on, and there is no collider on the path, so neither blocking condition holds. Hence the path is active given $\varnothing$. Given $\{Z\}$, the non-collider $Z$ belongs to the conditioning set, so the path is blocked.
In the fork
\begin{align*}
X \leftarrow Z \to Y,
\end{align*}
the only path is again $(X,Z,Y)$, and $Z$ is also a non-collider: one arrow leaves $Z$ toward $X$, and the other leaves $Z$ toward $Y$. Thus the same check applies. Given $\varnothing$, no non-collider on the path is conditioned on, so the path is active; given $\{Z\}$, the non-collider $Z$ is conditioned on, so the path is blocked.
In the collider
\begin{align*}
X \to Z \leftarrow Y,
\end{align*}
the only path $(X,Z,Y)$ has $Z$ as a collider, since both adjacent arrowheads point into $Z$. Given $\varnothing$, we have $Z \notin \varnothing$ and $\operatorname{De}_G(\{Z\}) \cap \varnothing=\varnothing$, so the collider blocking condition holds and the path is blocked. Given $\{Z\}$, the collider itself is conditioned on, so the condition requiring $Z \notin \{Z\}$ fails; there are no non-colliders on the path, so the path is active. If the graph also has $Z \to W$, then $W \in \operatorname{De}_G(\{Z\})$, so
\begin{align*}
\operatorname{De}_G(\{Z\}) \cap \{W\}=\{W\}.
\end{align*}
Therefore the collider blocking condition fails given $\{W\}$ as well, and the path $X \to Z \leftarrow Y$ is active given $\{W\}$.
[/example]
The collider case is often the first place where causal graphs depart from everyday intuition. Observing a common effect can make its causes statistically dependent, even when the causes have no direct causal connection.
[example: Explaining Away]
Let $B$ denote burglary, $E$ earthquake, and $A$ alarm, with DAG $B \to A \leftarrow E$. The only path from $B$ to $E$ is $(B,A,E)$, and $A$ is a collider because the adjacent arrows have the form $B \to A \leftarrow E$. Given $\varnothing$, the collider satisfies $A \notin \varnothing$ and $\operatorname{De}_G(\{A\}) \cap \varnothing=\varnothing$, so this path is blocked. Thus the graph permits marginal independence of burglary and earthquake. Given $\{A\}$, the collider itself is conditioned on, so the collider-blocking condition fails; there are no non-colliders on the path, so the path is active and the DAG no longer entails $B \perp\!\!\!\perp E \mid A$.
A concrete probability model shows the signed effect. Suppose $B$ and $E$ are independent Bernoulli variables with $\mathbb P(B=1)=p$ and $\mathbb P(E=1)=q$, where $0<p,q<1$, and suppose the alarm is triggered exactly when at least one cause occurs:
\begin{align*}
A=1 \quad \text{if and only if} \quad B=1 \text{ or } E=1.
\end{align*}
Then
\begin{align*}
\mathbb P(A=1)=\mathbb P(B=1 \text{ or } E=1)=p+q-pq.
\end{align*}
Since $B=1$ implies $A=1$ in this model,
\begin{align*}
\mathbb P(B=1 \mid A=1)=\frac{\mathbb P(B=1,A=1)}{\mathbb P(A=1)}=\frac{\mathbb P(B=1)}{\mathbb P(A=1)}=\frac{p}{p+q-pq}.
\end{align*}
If we also observe $E=1$, then $A=1$ is already guaranteed by the earthquake, so
\begin{align*}
\mathbb P(B=1 \mid A=1,E=1)=\mathbb P(B=1 \mid E=1)=p,
\end{align*}
using the independence of $B$ and $E$. Finally,
\begin{align*}
p+q-pq=1-(1-p)(1-q)<1,
\end{align*}
so
\begin{align*}
\frac{p}{p+q-pq}>p.
\end{align*}
Thus observing the alarm raises the posterior probability of burglary above $p$, but observing the earthquake as an additional explanation lowers it back to $p$; this model-specific negative conditional association is the explaining-away phenomenon.
[/example]
The same mechanism appears in selection bias. Conditioning on being included in a sample, admitted to a programme, or diagnosed through a criterion can turn unrelated causes of selection into associated variables.
[example: Berkson Paradox]
Suppose $D_1$ and $D_2$ are two diseases and $H$ is hospital admission, with DAG $D_1 \to H \leftarrow D_2$. The only path from $D_1$ to $D_2$ is $(D_1,H,D_2)$, and $H$ is a collider because the adjacent arrows have the form $D_1 \to H \leftarrow D_2$. Given $\varnothing$, the collider satisfies $H \notin \varnothing$ and $\operatorname{De}_G(\{H\})\cap \varnothing=\varnothing$, so the path is blocked. Given $\{H\}$, the collider itself is conditioned on, so the collider-blocking condition fails; there are no non-colliders on the path, so the path is active.
A concrete model shows the resulting selection bias. Let $D_1$ and $D_2$ be independent Bernoulli variables with $\mathbb P(D_1=1)=p$ and $\mathbb P(D_2=1)=q$, where $0<p,q<1$, and suppose a patient is admitted exactly when at least one disease is present:
\begin{align*}
H=1 \quad \text{if and only if} \quad D_1=1 \text{ or } D_2=1.
\end{align*}
By independence,
\begin{align*}
\mathbb P(D_1=0,D_2=0)=(1-p)(1-q).
\end{align*}
Therefore
\begin{align*}
\mathbb P(H=1)=1-\mathbb P(D_1=0,D_2=0)=1-(1-p)(1-q)=p+q-pq.
\end{align*}
Since $D_1=1$ implies $H=1$,
\begin{align*}
\mathbb P(D_1=1 \mid H=1)=\frac{\mathbb P(D_1=1,H=1)}{\mathbb P(H=1)}=\frac{\mathbb P(D_1=1)}{p+q-pq}=\frac{p}{p+q-pq}.
\end{align*}
If we also observe $D_2=1$, then $H=1$ is already guaranteed, so independence of $D_1$ and $D_2$ gives
\begin{align*}
\mathbb P(D_1=1 \mid H=1,D_2=1)=\mathbb P(D_1=1 \mid D_2=1)=p.
\end{align*}
Because
\begin{align*}
p+q-pq=1-(1-p)(1-q)<1,
\end{align*}
we have
\begin{align*}
p<\frac{p}{p+q-pq}.
\end{align*}
Thus among admitted patients, learning that $D_2$ is present lowers the probability of $D_1$ from $\frac{p}{p+q-pq}$ to $p$; the graphical path $D_1 \to H \leftarrow D_2$ records that the association is created by conditioning on the collider $H$.
[/example]
## D-Separation and Conditional Independence
The next question is global: given two sets of variables $A$ and $B$, and a conditioning set $Z$, when does the graph force every path from $A$ to $B$ to be inactive? The answer is d-separation. It is a purely graphical statement, but its value comes from the theorem connecting it to conditional independence in probability distributions that factorize according to the DAG.
[definition: D-Separation]
Let $G=(V,E)$ be a directed acyclic graph and let $A,B,Z \subset V$ be pairwise disjoint vertex sets. The set $Z$ d-separates $A$ and $B$ in $G$, written $A \perp_G B \mid Z$, if every path from a vertex in $A$ to a vertex in $B$ is blocked by $Z$.
[/definition]
D-separation is a separation criterion for conditional independence, not a statement about a particular dataset. To turn it into probability, we need the Markov factorization for DAGs, which is the formal condition saying that each node is generated from its parents.
[definition: DAG Markov Distribution]
Let $G=(V,E)$ be a directed acyclic graph with vertex set $V=\{X_1,\dots,X_n\}$. A probability distribution $\mathbb P$ on the variables $V$ is Markov with respect to $G$ if it factorizes as
\begin{align*}
\mathbb P(X_1,\dots,X_n)=\prod_{i=1}^n \mathbb P(X_i \mid \operatorname{Pa}_G(X_i)),
\end{align*}
with the usual interpretation using conditional densities or probability mass functions when such versions exist.
[/definition]
The Markov property says that each variable is generated from its parents and independent noise, so the graph supplies a collection of conditional independence constraints. The natural question is whether the path-blocking rule is sound for every distribution with this factorization. The next theorem gives the guarantee that justifies using d-separation as a graphical test for conditional independence.
[quotetheorem:9669]
[citeproof:9669]
The Markov factorization hypothesis is essential. For example, in the chain $X \to Z \to Y$, the graph d-separates $X$ and $Y$ given $Z$, but an arbitrary distribution on $(X,Z,Y)$ need not satisfy $X \perp\!\!\!\perp Y \mid Z$ unless it factorizes as $\mathbb P(X)\mathbb P(Z \mid X)\mathbb P(Y \mid Z)$. Thus d-separation is a theorem about distributions compatible with the graph, not about every joint distribution that happens to use the same variable names.
The pairwise-disjointness condition keeps the graphical and probabilistic statements from degenerating. If a variable belongs to both $A$ and $Z$, then the query is asking about a variable after conditioning on itself, so the usual path language no longer matches the intended comparison between two unobserved sets given observed information. In applications the sets $A$, $B$, and $Z$ are therefore chosen as target variables, comparison variables, and conditioned variables with no overlap.
The theorem is also only a soundness result: a d-separation gives a conditional independence in every Markov distribution for the graph. The converse can fail for special parameter values, such as exact cancellations in linear Gaussian models, because those independences come from numerical coincidences rather than the graph. For instance, a graph may contain a directed edge $X \to Y$ while a particular parameter choice makes $Y$ independent of $X$; that independence is not forced by d-separation and would disappear after a generic perturbation of the parameters.
[remark: Faithfulness]
A distribution $\mathbb P$ is faithful to a DAG $G$ if every conditional independence in $\mathbb P$ is represented by a d-separation in $G$. Faithfulness is an additional assumption, not part of the Markov property. Many structure-learning methods require faithfulness or a weaker variant because they infer edges from observed independences.
[/remark]
D-separation is the causal-inference version of a more general graphical-model idea. In Bayesian networks, the same DAG Markov factorization is used to encode conditional independences for prediction and probabilistic computation, even when the arrows are not interpreted causally. Causal DAGs add intervention semantics later, but the separation calculus in this chapter is shared with that broader graphical-model setting.
D-separation is often used as a checklist in a concrete causal graph. The following diagnostic testing DAG illustrates how marginal and conditional independence claims are read off from paths.
[example: Diagnostic Testing DAG]
Let $D$ be disease status, $T$ a diagnostic test result, $S$ symptoms, and $R$ a risk factor, with DAG edges
\begin{align*}
R \to D,\quad D \to T,\quad D \to S.
\end{align*}
The only path from $R$ to $T$ is $(R,D,T)$, whose edge pattern is
\begin{align*}
R \to D \to T.
\end{align*}
The interior vertex $D$ is a non-collider, because the two adjacent arrowheads do not both point into $D$. Given $\varnothing$, this non-collider is not in the conditioning set, and the path has no collider, so neither blocking condition holds. Hence the path is active given $\varnothing$, so $R$ and $T$ are not d-separated marginally.
Given $\{D\}$, the same path $(R,D,T)$ has the non-collider $D$ in the conditioning set. Therefore the path is blocked by $\{D\}$. Since it is the only path from $R$ to $T$, every path from $R$ to $T$ is blocked by $\{D\}$, so
\begin{align*}
R \perp_G T \mid D.
\end{align*}
By *Soundness Of D-Separation*, every distribution Markov with respect to this DAG satisfies
\begin{align*}
R \perp\!\!\!\perp T \mid D.
\end{align*}
For $T$ and $S$, the only path is $(T,D,S)$, with edge pattern
\begin{align*}
T \leftarrow D \to S.
\end{align*}
Again $D$ is a non-collider, now because both arrows leave $D$. Given $\varnothing$, the path is active, so the graph does not force marginal independence of $T$ and $S$. Given $\{D\}$, the non-collider $D$ is conditioned on, so the path is blocked; since no other path connects $T$ to $S$, we get
\begin{align*}
T \perp_G S \mid D.
\end{align*}
Thus disease status explains both the test result and the symptoms: conditioning on $D$ removes the graphical route through which $R$ can be associated with $T$, and the route through which $T$ can be associated with $S$.
[/example]
## Moralized Ancestral Graphs
Checking all paths in a large DAG can be inefficient, and collider activation makes manual reasoning error-prone. The final question in this chapter is how to convert a d-separation query into an ordinary separation query in an undirected graph. The construction first removes irrelevant vertices by taking ancestors, then removes arrow direction after marrying parents of common children.
[definition: Ancestral Set]
Let $G=(V,E)$ be a directed acyclic graph. The ancestor-set operator is the map
\begin{align*}
\operatorname{An}_G:\mathcal P(V) \to \mathcal P(V)
\end{align*}
defined by
\begin{align*}
\operatorname{An}_G(S)=\{v \in V : \text{there exists a directed path from } v \text{ to some } s \in S\},
\end{align*}
where directed paths of length zero are allowed.
[/definition]
Only ancestors of the queried variables and conditioning variables can participate in an active path relevant to the query. Non-ancestors cannot affect whether a collider is opened by conditioning, because they are not upstream of the observed or queried variables. After this restriction, we still need to replace directed collider structure by undirected adjacency, which motivates the moral graph.
[definition: Moral Graph]
Let $\mathsf{DAG}$ denote the class of finite directed acyclic graphs and let $\mathsf{UG}$ denote the class of finite undirected graphs. The moralization operator is the map
\begin{align*}
(-)^m:\mathsf{DAG} \to \mathsf{UG}
\end{align*}
which sends a DAG $G=(V,E)$ to the undirected graph $G^m$ obtained by replacing every directed edge by an undirected edge and adding an undirected edge between every pair of distinct vertices that share a child in $G$.
[/definition]
The extra edges between parents of a common child encode the fact that conditioning on a collider or its descendant can connect those parents. However, moralizing the whole DAG can introduce irrelevant connections through vertices outside the query. To avoid those irrelevant connections, we need a query-specific graph transformation that first keeps only ancestors of $A \cup B \cup Z$ and then moralizes that induced subgraph.
[definition: Moralized Ancestral Graph]
Let $\mathsf{QueryDAG}$ be the class of quadruples $(G,A,B,Z)$ where $G=(V,E)$ is a finite directed acyclic graph and $A,B,Z \subset V$. The moralized-ancestral-graph operator is the map
\begin{align*}
\mathsf{MAG}:\mathsf{QueryDAG} \to \mathsf{UG}
\end{align*}
defined by
\begin{align*}
\mathsf{MAG}(G,A,B,Z)=\left(G_{\operatorname{An}_G(A \cup B \cup Z)}\right)^m.
\end{align*}
[/definition]
The undirected graph $\mathsf{MAG}(G,A,B,Z)$ is called the moralized ancestral graph for the query $(A,B,Z)$. The name records the two operations in order: first restrict to ancestors of the queried and conditioned variables, then moralize the remaining directed graph.
This graph is designed so that collider activation has already been accounted for by ancestry and moral edges. The remaining operation is ordinary undirected separation after removing the conditioned vertices. The following theorem states that this procedure is exactly equivalent to d-separation, so it gives both an algorithm and an alternative proof strategy for soundness.
[quotetheorem:9670]
[proofunderconstruction:9670]
The theorem gives a practical algorithm: form the ancestral subgraph on $A \cup B \cup Z$, moralize it, delete $Z$, and check whether any undirected path remains between $A$ and $B$. The order of these operations matters: ancestral restriction must come before moralization, because moralizing the whole graph can marry parents of irrelevant descendants and create undirected connections that have no bearing on the query. The criterion is graphical rather than probabilistic; by itself it proves d-separation, and it becomes a conditional-independence statement only after combined with the DAG Markov soundness theorem above. This limitation is useful in later chapters, because adjustment and do-calculus arguments first reduce causal questions to graphical separation checks and only then translate those checks into independence statements under explicit model assumptions.
[example: Why Ancestry Comes Before Moralization]
Let $G$ have vertices $A,B,C,D$ and directed edges $A \to C$, $B \to C$, and $C \to D$. For the query whether $A$ and $B$ are d-separated given $\varnothing$, the relevant ancestral set is
\begin{align*}
\operatorname{An}_G(\{A,B\})=\{A,B\}.
\end{align*}
Indeed, $A$ and $B$ are included by length-zero directed paths, while $C$ has no directed path to $A$ or $B$, and $D$ has no directed path to $A$ or $B$. Thus the induced subgraph on $\{A,B\}$ has no directed edge, and its moralization has two isolated vertices. Since deleting the conditioning set $\varnothing$ deletes no vertices, there is still no undirected path from $A$ to $B$. By *[Moralized Ancestral Graph Criterion](/theorems/9670)*, this reports that $A$ and $B$ are d-separated given $\varnothing$.
If one moralizes the whole DAG first, the directed edges become undirected edges $A-C$, $B-C$, and $C-D$, and moralization also adds $A-B$ because $A$ and $B$ are distinct parents of the common child $C$. With conditioning set $\varnothing$, this added edge remains, so the wrongly constructed undirected graph contains the path
\begin{align*}
A-B.
\end{align*}
That path is spurious for this query: in the original DAG, the only path from $A$ to $B$ is $(A,C,B)$, with local pattern
\begin{align*}
A \to C \leftarrow B.
\end{align*}
The vertex $C$ is a collider, and with $Z=\varnothing$ we have $C\notin Z$ and
\begin{align*}
\operatorname{De}_G(\{C\})\cap Z=\operatorname{De}_G(\{C\})\cap \varnothing=\varnothing.
\end{align*}
So the path is blocked. The mistake is that moralizing before restricting to ancestors creates an $A-B$ edge through a collider whose descendants are not conditioned on and whose child $C$ is not ancestral to the variables in the query.
[/example]
With the need for the ancestral restriction made explicit, we can return to the familiar explaining-away graph and see how the same construction changes when the collider itself is part of the conditioning set.
[example: Moralization For Explaining Away]
In the DAG $B \to A \leftarrow E$, test whether $B$ and $E$ are d-separated given $\{A\}$. The query set is $\{B,E\} \cup \{A\}=\{B,E,A\}$. Since directed paths of length zero are allowed, each of $B$, $E$, and $A$ is an ancestor of itself, and there are no other vertices in the graph. Hence
\begin{align*}
\operatorname{An}_G(\{B,E,A\})=\{B,E,A\}.
\end{align*}
The induced ancestral subgraph is therefore the original three-vertex collider $B \to A \leftarrow E$. Moralization first replaces the directed edges by undirected edges $B-A$ and $E-A$. It also adds the undirected edge $B-E$, because $B$ and $E$ are distinct parents of the common child $A$. Thus the moralized ancestral graph has undirected edge set
\begin{align*}
\{B-A,\ E-A,\ B-E\}.
\end{align*}
Deleting the conditioned vertex $A$ removes the edges $B-A$ and $E-A$, but it does not remove $B-E$. The remaining graph still contains the undirected path
\begin{align*}
B-E.
\end{align*}
Therefore $\{A\}$ does not separate $B$ from $E$ in the moralized ancestral graph, so by *Moralized Ancestral Graph Criterion*, $B$ and $E$ are not d-separated given $A$.
[/example]
The same method handles longer graphs without drawing every possible path. It is the standard graphical test behind many algorithms for causal discovery and adjustment-set checking.
[example: Independence Checks In A Diagnostic Testing DAG]
For the DAG with edges $R \to D$, $D \to T$, and $D \to S$, first test $R \perp_G T \mid D$ using the moralized ancestral graph criterion. The query involves
\begin{align*}
\{R,T\}\cup \{D\}=\{R,D,T\}.
\end{align*}
The ancestors of these vertices are exactly
\begin{align*}
\operatorname{An}_G(\{R,D,T\})=\{R,D,T\}.
\end{align*}
Indeed, $R$, $D$, and $T$ are included by length-zero directed paths; $S$ is not included because the only edge touching $S$ is $D \to S$, so there is no directed path from $S$ to $R$, $D$, or $T$.
The induced ancestral subgraph has directed edges $R \to D$ and $D \to T$. Moralization replaces them by undirected edges $R-D$ and $D-T$. No extra moral edge is added, because within this induced subgraph no two distinct vertices share a child. Thus the moralized ancestral graph is the undirected chain
\begin{align*}
R-D-T.
\end{align*}
Deleting the conditioned vertex $D$ removes both edges $R-D$ and $D-T$, leaving $R$ and $T$ in different connected components. Hence $D$ separates $R$ from $T$ in the moralized ancestral graph, so by *Moralized Ancestral Graph Criterion*,
\begin{align*}
R \perp_G T \mid D.
\end{align*}
By *Soundness Of D-Separation*, every distribution Markov with respect to this DAG satisfies
\begin{align*}
R \perp\!\!\!\perp T \mid D.
\end{align*}
For the query $T \perp_G S \mid D$, the vertices relevant to the query are
\begin{align*}
\{T,S\}\cup \{D\}=\{D,T,S\}.
\end{align*}
Their ancestor set is
\begin{align*}
\operatorname{An}_G(\{D,T,S\})=\{R,D,T,S\}.
\end{align*}
Here $D$, $T$, and $S$ are included by length-zero directed paths, and $R$ is included because $R \to D$ is a directed path from $R$ to a queried or conditioned vertex. The induced ancestral subgraph is therefore the whole DAG. Moralization replaces $R \to D$, $D \to T$, and $D \to S$ by undirected edges $R-D$, $D-T$, and $D-S$. There is still no added moral edge, because no vertex has two distinct parents. After deleting $D$, all three edges incident to $D$ are removed, so $T$ and $S$ lie in different connected components. Therefore $D$ separates $T$ from $S$ in the moralized ancestral graph, and the criterion gives
\begin{align*}
T \perp_G S \mid D.
\end{align*}
Thus every DAG Markov distribution satisfies
\begin{align*}
T \perp\!\!\!\perp S \mid D.
\end{align*}
The graph records the same causal reading in both checks: once disease status is fixed, the routes through which risk, test result, and symptoms can be associated are cut at the common disease vertex.
[/example]
D-separation is therefore the first major graphical calculus of the course. It tells us which conditional independence statements are structural consequences of a causal DAG, before any estimation or identification argument begins. Later chapters will use these independences to justify adjustment sets, to detect backdoor paths, and to manipulate interventional distributions through do-calculus.
The conditional independences implied by a DAG are the tools that make adjustment arguments work. With that machinery established, the course can now ask when an intervention effect is recoverable by conditioning on observed covariates and when the back-door criterion applies.
# 6. Adjustment and the Back-Door Criterion
The previous chapters made interventions precise through potential outcomes, structural causal models, and directed acyclic graphs. This chapter asks when an interventional distribution can be recovered by conditioning on measured covariates in an observational distribution. The central point is that adjustment is not a generic recipe: the covariates must block the right non-causal paths without opening new ones or conditioning away part of the causal effect.
## Identifying Causal Effects by Conditioning
Which variables should be conditioned on when estimating the effect of a treatment? The answer depends on the graph and on the estimand. For a total effect, adjustment should remove spurious association between treatment and outcome while leaving the directed causal pathways from treatment to outcome intact.
Let $A$ denote a treatment, let $Y$ denote an outcome, and let $L$ denote a collection of pre-treatment covariates. The target intervention law is the distribution of $Y_a$. In a structural causal model, the same object is written through the law of $Y$ under $do(A=a)$.
A first way to formalise adjustment is through conditional exchangeability. The covariates $L$ are intended to make treatment assignment comparable across levels of $A$.
[definition: Conditional Exchangeability for Treatment Effects]
Let $A$ be a treatment, $Y_a$ the potential outcome under treatment level $a$, and $L$ a random element taking values in a measurable space $(\mathcal L,\mathcal A)$. Conditional exchangeability for the treatment level $a$ holds if
\begin{align*}
Y_a \perp\!\!\!\perp A \mid L.
\end{align*}
For a family of treatment levels $\mathcal T$, conditional exchangeability holds on $\mathcal T$ if the displayed condition holds for every $a \in \mathcal T$.
[/definition]
This condition says that, after fixing $L$, the observed treatment level carries no further information about the potential outcome under $a$. To turn this into an estimable formula, we also need the consistency and positivity conditions introduced in Chapters 1 and 2: observed outcomes equal the relevant potential outcomes for treated units, and each treatment level has positive conditional probability on the covariate strata that matter.
[quotetheorem:9671]
[citeproof:9671]
The theorem is the probabilistic heart of covariate adjustment, but each hypothesis is doing separate work. If consistency fails, as when the recorded treatment $A=1$ mixes different versions of a vaccine with different biological effects, then $Y=Y_1$ on $\{A=1\}$ is not a well-defined replacement. If exchangeability fails, as when older high-risk individuals are more likely to be vaccinated even after the recorded covariates are fixed, then $\mathbb E[Y\mid A=1,L=l]$ is not the mean outcome those same strata would have under universal vaccination. If positivity fails, as when no elderly patients in a stratum remain unvaccinated, then $\mathbb P(Y\mid A=0,L=l)$ is not learned from the data in that stratum, so the formula asks for an unsupported conditional law. The theorem also does not say which $L$ should be used or whether $L$ is measured without error; directed acyclic graphs give a systematic language for judging those issues.
[example: Vaccine Effectiveness with Age Adjustment]
Suppose $A\in\{0,1\}$ records vaccination, $Y\in\{0,1\}$ records infection during follow-up, and $L$ is age group with possible values $l$. In the graph $L\to A$, $L\to Y$, and $A\to Y$, the only back-door path from vaccination to infection is $A\leftarrow L\to Y$, so fixing $L$ blocks the age route that makes vaccinated and unvaccinated people incomparable.
Assuming consistency, positivity, and conditional exchangeability given $L$, the infection risk under vaccination is obtained by averaging the age-specific observed risks over the population age distribution:
\begin{align*}
\mathbb P(Y_1=1)=\sum_l \mathbb P(Y_1=1\mid L=l)\mathbb P(L=l).
\end{align*}
Conditional exchangeability gives $\mathbb P(Y_1=1\mid L=l)=\mathbb P(Y_1=1\mid A=1,L=l)$, and consistency on the stratum $A=1$ gives $\mathbb P(Y_1=1\mid A=1,L=l)=\mathbb P(Y=1\mid A=1,L=l)$. Hence
\begin{align*}
\mathbb P(Y_1=1)=\sum_l \mathbb P(Y=1\mid A=1,L=l)\mathbb P(L=l).
\end{align*}
Similarly,
\begin{align*}
\mathbb P(Y_0=1)=\sum_l \mathbb P(Y=1\mid A=0,L=l)\mathbb P(L=l).
\end{align*}
The adjusted risk difference comparing vaccination with no vaccination is therefore
\begin{align*}
\mathbb P(Y_1=1)-\mathbb P(Y_0=1)=\sum_l \{\mathbb P(Y=1\mid A=1,L=l)-\mathbb P(Y=1\mid A=0,L=l)\}\mathbb P(L=l).
\end{align*}
This standardizes both treatment groups to the same age distribution, so the contrast is not driven by older individuals being more likely to be vaccinated.
[/example]
This example shows that $L$ is related both to treatment assignment and to the outcome, but that verbal description is too imprecise for complicated graphs. To specify which routes of association must be controlled, we first need the graph-theoretic notion of a path.
[definition: Path]
In a directed acyclic graph $G$, a path between vertices $V_0$ and $V_m$ is a sequence of distinct vertices $V_0,V_1,\dots,V_m$ such that each adjacent pair is joined by an edge in either direction.
[/definition]
A path records potential flow of association, while the edge directions determine whether that path is causal, confounding, or blocked by a collider. Since adjustment for a total effect should remove only the non-causal routes into the treatment, the next definition isolates paths that enter the treatment node from the back.
[definition: Back-Door Path]
Let $G$ be a directed acyclic graph containing vertices $A$ and $Y$. A back-door path from $A$ to $Y$ is a path whose first edge has an arrowhead into $A$.
[/definition]
Back-door paths are the graphical representation of confounding paths for the effect of $A$ on $Y$. The next section turns this intuition into a criterion for valid adjustment.
## The Back-Door Criterion
How can a graph certify that a proposed adjustment set is valid? The back-door criterion gives a sufficient condition based on blocking every non-causal path from treatment to outcome while avoiding descendants of the treatment.
The criterion is stated using d-separation. Recall that a non-collider on a path blocks the path when conditioned on, while a collider blocks the path unless the collider or one of its descendants is conditioned on. This asymmetry is the source of many adjustment mistakes.
[definition: Back-Door Criterion]
Let $G$ be a directed acyclic graph with treatment vertex $A$, outcome vertex $Y$, and covariate set $Z$. The set $Z$ satisfies the back-door criterion relative to $(A,Y)$ if:
1. no element of $Z$ is a descendant of $A$;
2. $Z$ blocks every path from $A$ to $Y$ that begins with an arrow into $A$.
[/definition]
The first condition protects the total-effect interpretation. The second condition removes all open back-door paths, so any remaining association after conditioning travels through directed causal paths from $A$ to $Y$.
[quotetheorem:9672]
[citeproof:9672]
This theorem is a sufficient criterion rather than a definition of all valid adjustment sets. Its hypotheses also mark the main failure modes. Conditioning on a collider can violate the blocking requirement: in $A\to C\leftarrow Y$, adjusting for $C$ opens a non-causal path and creates selection bias rather than removing confounding. Positivity remains a separate statistical requirement: even if $Z$ blocks every back-door path, the stratum $Z=z$ contributes to the integral only if the treatment level $A=a$ occurs there with positive probability. The theorem is also only as good as the causal graph supplied to it; if an unmeasured common cause $U\to A$ and $U\to Y$ is omitted, a measured set may appear to block all back-door paths in the drawn graph while leaving confounding in the data-generating system. Thus the criterion certifies adjustment relative to a causal DAG and a positivity regime, not relative to the observational distribution alone.
[example: Education and Wages]
Let $A$ be years of education and let $Y$ be later wages. Family background $L_1$ and baseline academic ability $L_2$ are pre-treatment variables, and in the graph
$L_1\to A$, $L_1\to Y$, $L_2\to A$, $L_2\to Y$, and $A\to Y$, the only back-door paths from education to wages are $A\leftarrow L_1\to Y$ and $A\leftarrow L_2\to Y$. With $Z=(L_1,L_2)$, the variable $L_1$ is a non-collider on the first path and is conditioned on, so $A\leftarrow L_1\to Y$ is blocked; similarly, $L_2$ is a conditioned non-collider on $A\leftarrow L_2\to Y$, so that path is blocked. Since neither $L_1$ nor $L_2$ is a descendant of $A$, $Z$ satisfies the back-door criterion for the total effect of education on wages.
Assuming positivity for the education level $a$, the *[Back-Door Adjustment Theorem](/theorems/9672)* gives the mean wage under the intervention setting education to $a$ as
\begin{align*}
\mathbb E[Y\mid do(A=a)] = \int \mathbb E[Y\mid A=a,Z=z]\,d\mu_Z(z).
\end{align*}
Writing $z=(l_1,l_2)$ and $\mu_Z=\mu_{(L_1,L_2)}$, this is exactly
\begin{align*}
\mathbb E[Y\mid do(A=a)] = \int \mathbb E[Y\mid A=a,L_1=l_1,L_2=l_2]\,d\mu_{(L_1,L_2)}(l_1,l_2).
\end{align*}
Thus the adjusted mean compares people at education level $a$ within fixed family-background and ability strata, then averages those stratum-specific means using the population distribution of $(L_1,L_2)$.
[/example]
The wage example also shows a limitation. If parental ambition, school quality, or local labour market structure are unmeasured common causes, then the displayed adjustment set may not block every back-door path in the true graph.
[remark: Minimality Is Not Required]
A valid adjustment set need not be minimal. If $Z$ blocks all back-door paths and $W$ is an additional pre-treatment variable that does not open a blocked path, then $Z\cup W$ may also be valid. In finite samples, however, adding unnecessary variables can increase variance and create positivity problems, so statistical efficiency is separate from graphical validity.
[/remark]
This distinction matters because the back-door theorem is about identification, not estimation performance. Once a formula is identified, modelling choices and finite-sample behaviour become a second layer of the analysis.
## Bad Controls in Adjustment Sets
What goes wrong if we condition on every variable we have measured? Some variables destroy the estimand, and others create bias by opening paths that were previously closed. The term bad control refers to a covariate whose inclusion invalidates the intended adjustment formula or changes the causal question.
A mediator is a variable on a directed causal path from treatment to outcome. Conditioning on it is appropriate for certain direct-effect estimands, but it is not appropriate when the target is the total effect.
[definition: Mediator]
Let $A$ be a treatment and $Y$ an outcome in a directed acyclic graph $G$. A vertex $M$ is a mediator for the effect of $A$ on $Y$ if $M$ lies on a directed path from $A$ to $Y$.
[/definition]
Conditioning on a mediator blocks part of the causal pathway. This changes the estimand from a total effect toward a controlled or direct effect, and it may introduce additional confounding if the mediator-outcome relation has common causes.
[example: Conditioning on Post-Treatment Blood Pressure]
Suppose $A$ is a medication, $M$ is blood pressure measured after treatment, and $Y$ is stroke occurrence. In a graph containing the directed path $A\to M\to Y$, the total effect of medication includes the part of the effect transmitted through blood pressure. The total-effect comparison is based on quantities such as
\begin{align*}
\mathbb P(Y_1=1)-\mathbb P(Y_0=1).
\end{align*}
If instead we adjust for the post-treatment value $M=m$, the comparison becomes a stratum-specific contrast of the form
\begin{align*}
\mathbb P(Y=1\mid A=1,M=m)-\mathbb P(Y=1\mid A=0,M=m).
\end{align*}
This contrast fixes a variable that lies downstream of treatment. Because $A$ affects $M$ and $M$ affects $Y$, holding $M=m$ prevents the comparison from including changes in stroke risk that operate through the medication-induced change in blood pressure. Thus the adjusted contrast compares treated and untreated individuals at the same post-treatment blood pressure, so it is not the total effect of medication on stroke occurrence.
[/example]
Mediators are not the only dangerous controls. Colliders create a different problem: they can open a path that was closed before conditioning.
[definition: Collider]
Let $G$ be a directed acyclic graph and let $P$ be a path in $G$. A non-endpoint vertex $C$ on $P$ is a collider on that path if the two edges adjacent to $C$ both have arrowheads into $C$ along the path.
[/definition]
A collider blocks a path unless the collider or a descendant of the collider is conditioned on. This makes collider adjustment especially tempting in applications, because colliders are often outcomes of selection, participation, hospital admission, or data availability.
[example: Selection by Hospital Admission]
Let $A$ be smoking, $Y$ be respiratory disease, and $C$ be hospital admission. In the graph $A\to C\leftarrow Y$, the vertex $C$ is a collider on the path $A\to C\leftarrow Y$, so the path is closed in the full population unless we condition on $C$.
To see explicitly how conditioning on admission can create association, suppose for illustration that $A$ and $Y$ are independent in the full population, with
\begin{align*}
\mathbb P(A=1)=\mathbb P(Y=1)=\frac12.
\end{align*}
Let admission be more likely for smokers and for people with respiratory disease:
\begin{align*}
\mathbb P(C=1\mid A=0,Y=0)=0.1,\quad \mathbb P(C=1\mid A=1,Y=0)=0.5,\quad \mathbb P(C=1\mid A=0,Y=1)=0.5,\quad \mathbb P(C=1\mid A=1,Y=1)=0.9.
\end{align*}
In the full population, independence gives
\begin{align*}
\mathbb P(Y=1\mid A=1)=\mathbb P(Y=1\mid A=0)=\frac12.
\end{align*}
Among admitted patients with $A=1$,
\begin{align*}
\mathbb P(C=1\mid A=1)=\mathbb P(C=1\mid A=1,Y=1)\mathbb P(Y=1)+\mathbb P(C=1\mid A=1,Y=0)\mathbb P(Y=0)=0.9\cdot\frac12+0.5\cdot\frac12=0.7.
\end{align*}
Therefore
\begin{align*}
\mathbb P(Y=1\mid A=1,C=1)=\frac{\mathbb P(C=1\mid A=1,Y=1)\mathbb P(Y=1\mid A=1)}{\mathbb P(C=1\mid A=1)}=\frac{0.9\cdot \frac12}{0.7}=\frac{9}{14}.
\end{align*}
Among admitted patients with $A=0$,
\begin{align*}
\mathbb P(C=1\mid A=0)=0.5\cdot\frac12+0.1\cdot\frac12=0.3.
\end{align*}
Thus
\begin{align*}
\mathbb P(Y=1\mid A=0,C=1)=\frac{0.5\cdot \frac12}{0.3}=\frac{5}{6}.
\end{align*}
Since $\frac{9}{14}\ne \frac{5}{6}$, smoking and respiratory disease are associated after conditioning on hospital admission even though they were independent in the full population. Restricting analysis to admitted patients has opened the collider path $A\to C\leftarrow Y$, so the resulting association is selection bias rather than evidence of a causal effect.
[/example]
The selection example explains why a variable can be harmful even when it is strongly associated with treatment. A final common mistake is to adjust for a variable that affects treatment but is designed to have no direct route to the outcome; this motivates the next definition of an instrument.
[definition: Instrument]
Let $A$ be a treatment and $Y$ an outcome. A variable $Z$ is an instrument for the effect of $A$ on $Y$ if $Z$ affects $A$, has no directed effect on $Y$ except through $A$, and has no unblocked common cause with $Y$ in the relevant graph.
[/definition]
Instruments are useful for instrumental-variable methods, but they are usually poor adjustment variables for ordinary covariate adjustment. Conditioning on a strong instrument can amplify residual confounding by forcing comparison within levels of a variable that predicts treatment but does not explain outcome risk except through treatment.
[remark: Instruments and Adjustment]
If all confounding has already been blocked, adding an instrument need not introduce bias at the population-identification level. The warning is that instruments do not solve back-door confounding by themselves, and in the presence of unmeasured treatment-outcome confounding they can make adjusted associations more unstable and more biased in common parametric settings.
[/remark]
The practical rule is therefore not to adjust for variables solely because they predict treatment. A control variable must be judged by the paths it blocks and the paths it opens.
## The Single-World Intervention Graph Refinement
The back-door criterion is useful but conservative. Which graphical criterion describes adjustment more closely for a specified intervention? The single-world intervention graph gives a refinement by representing the intervention directly in the graph and checking separation in that modified graph.
For a treatment set $A$ and intervention value $a$, the intervention graph removes arrows into $A$ because the intervention sets treatment externally. This graph records the post-intervention factorisation while staying within one intervention world.
[definition: Single-World Intervention Graph]
Let $G$ be a directed acyclic graph and let $A$ be a set of treatment vertices. The single-world intervention graph $G(a)$ is obtained by replacing each treatment vertex $A_i\in A$ by a random node $A_i$ and a fixed node $a_i$: incoming arrows into $A_i$ remain attached to the random node, outgoing arrows from $A_i$ are attached to the fixed node $a_i$, and no edge is drawn between $A_i$ and $a_i$. Non-treatment descendants of $A$ are interpreted as their potential-outcome versions under the intervention value $a$.
[/definition]
The graph $G(a)$ keeps outgoing arrows from treatment to downstream variables, so it is suited to total-effect questions. To identify effects beyond the original back-door checklist, we need a criterion that checks separation after the intervention graph has removed the treatment-assignment arrows.
[quotetheorem:9673]
The separation condition is a graphical version of conditional exchangeability: after $Z$ is fixed, the random mechanism that assigned $A$ is separated from the potential outcomes $Y(a)$. The descendant restriction is needed because conditioning on a downstream variable on a causal path can remove part of the total effect, even when a separation statement can be made in a modified graph. The criterion still does not identify every causal effect: if $A$ and $Y$ share an unmeasured common cause that remains connected to $Y(a)$ after conditioning on all measured $Z$, no adjustment formula of this form follows. It also relies on the proposed causal graph; if the graph omits a selection mechanism or misorients a collider, the d-separation statement can certify the wrong formula.
[example: A Harmless Pre-Treatment Predictor]
Suppose the graph contains $W\to A$ and $A\to Y$, but no route from $W$ to $Y$ except the directed route through $A$. Suppose also that the measured set $L$ blocks every genuine back-door path from $A$ to $Y$. For a fixed treatment value $a$, the single-world intervention graph separates the random treatment node $A$ from $Y(a)$ after conditioning on $L$, and conditioning additionally on $W$ does not open a collider path or block a directed path from the fixed intervention node $a$ to $Y(a)$.
Thus, under consistency and positivity on the strata of $(L,W)$, the adjusted law using $L\cup\{W\}$ is
\begin{align*}
\mathbb P(Y\in B\mid do(A=a))=\int \mathbb P(Y\in B\mid A=a,L=l,W=w)\,d\mu_{(L,W)}(l,w).
\end{align*}
Writing the joint law as $d\mu_{(L,W)}(l,w)=d\mu_{W\mid L=l}(w)\,d\mu_L(l)$, the same expression is
\begin{align*}
\mathbb P(Y\in B\mid do(A=a))=\int\left(\int \mathbb P(Y\in B\mid A=a,L=l,W=w)\,d\mu_{W\mid L=l}(w)\right)d\mu_L(l).
\end{align*}
The variable $W$ can improve prediction of who received treatment, but in this graph it supplies no additional route into the outcome once the intervention fixes $A=a$. Including $W$ therefore changes the standardization strata from $L$ to $(L,W)$, but it does not by itself create bias for the total effect; the important distinction is between predictors of treatment and variables that open non-causal paths or condition on causal descendants.
[/example]
This refinement prepares the transition to do-calculus. Back-door adjustment is the main conditioning formula used in applied work, while intervention-graph criteria explain why some valid formulas lie beyond the simplest back-door checklist.
Back-door adjustment handles many identification problems, but not all of them. When direct conditioning fails, the next chapter studies whether an effect can still be recovered through an observed mediator, even in the presence of unmeasured confounding.
# 7. The Front-Door Criterion and Mediation Structure
This chapter studies a second route to causal identification when the usual back-door adjustment is unavailable. The central question is how an effect of a treatment $A$ on an outcome $Y$ can be recovered when there is unmeasured confounding between $A$ and $Y$, but the causal pathway from $A$ to $Y$ passes through an observed intermediate variable $M$. The front-door criterion answers this by replacing adjustment for the unobserved common cause with two observable exchangeability steps involving the mediator. This chapter also separates front-door identification from mediation analysis: the same diagram may contain a mediator, but the target estimand and the needed assumptions are different.
## Mediators and Indirect Causal Paths
When treatment and outcome share an unmeasured cause, the back-door strategy from earlier chapters cannot be applied directly. The problem is whether observing a variable downstream of treatment can still carry enough information to reconstruct the interventional law of the outcome. This is possible only when the observed downstream variable captures the causal transmission from treatment to outcome in a strong graphical sense.
[definition: Mediator]
Let $(\Omega,\mathcal F,\mathbb P)$ be the observational probability space of a causal model, and let $A:\Omega\to\mathcal A$, $M:\Omega\to\mathcal M$, and $Y:\Omega\to\mathcal Y$ be random variables. The variable $M$ is a mediator from $A$ to $Y$ if there is a directed path $A \to M \to Y$ in the causal graph.
[/definition]
A mediator records part of the mechanism by which changing $A$ can change $Y$. The word by itself does not imply that all causal influence goes through $M$, nor does it imply that the effect is identifiable from observational data. The next condition isolates the stronger case needed for front-door identification: every directed route carrying the effect from $A$ to $Y$ must pass through the observed mediator.
[definition: Complete Mediation for Front-Door Identification]
Let $G=(V,E)$ be a directed acyclic graph with vertices $A,M,Y\in V$, where $A$ is the treatment, $M$ is the mediator, and $Y$ is the outcome. The mediator $M$ intercepts all directed paths from $A$ to $Y$ if every directed path from $A$ to $Y$ contains $M$.
[/definition]
This condition is stronger than saying that $M$ is affected by $A$ and affects $Y$. It rules out a direct arrow $A \to Y$ and any alternative directed route from $A$ to $Y$ that avoids $M$. In a structural causal model, it says that the causal effect of intervening on $A$ reaches $Y$ only through its effect on the distribution of $M$.
[example: Smoking Tar Deposits and Lung Cancer]
Let $A=1$ indicate smoking, $M$ measure tar deposits, and $Y=1$ indicate lung cancer. Suppose an unobserved factor $U$ affects both $A$ and $Y$, so the path $A \leftarrow U \to Y$ is open and $\mathbb P(Y=1\mid A=1)-\mathbb P(Y=1\mid A=0)$ is an associational contrast rather than a causal effect.
The front-door idea is that the causal path is still recoverable if tar deposits carry all directed influence from smoking to cancer. In graphical terms, every directed path from $A$ to $Y$ contains $M$, there is no open back-door path from $A$ to $M$, and every back-door path from $M$ to $Y$ is blocked after conditioning on $A$. Under these conditions, the mediator distribution under smoking is identified by the observed smoking-tar relation:
\begin{align*}
\mathbb P(M=m\mid do(A=1))=\mathbb P(M=m\mid A=1).
\end{align*}
For a fixed tar level $m$, the cancer risk under an intervention setting $M=m$ is identified by averaging the observed cancer risks within smoking strata:
\begin{align*}
\mathbb P(Y=1\mid do(M=m))=\sum_{a'\in\{0,1\}}\mathbb P(Y=1\mid M=m,A=a')\mathbb P(A=a').
\end{align*}
Combining the tar distribution induced by smoking with the stratum-averaged cancer response gives
\begin{align*}
\mathbb P(Y=1\mid do(A=1))=\sum_m \mathbb P(M=m\mid A=1)\sum_{a'\in\{0,1\}}\mathbb P(Y=1\mid M=m,A=a')\mathbb P(A=a').
\end{align*}
Thus the unobserved $U$ may still confound the smoking-cancer association, but it does not prevent identification if tar deposits intercept the whole directed effect and the two observable replacement steps are valid.
[/example]
The example shows the conceptual appeal of the front-door idea, but it also shows why the criterion is demanding. If smoking affects cancer through inflammation, screening behaviour, or another unobserved mediator not represented by $M$, then $M$ no longer intercepts all directed causal paths. If tar deposits and cancer share an unmeasured cause after conditioning on smoking, the second step of the argument also fails.
[remark: Association Through a Mediator Is Not Enough]
A strong empirical association between $A$ and $M$, and between $M$ and $Y$, does not establish a front-door structure. The criterion is causal and graphical: it is about blocked paths, directed paths, and the absence of certain unmeasured common causes. Statistical regression through a mediator may therefore estimate an associational decomposition even when the front-door estimand is not identified.
[/remark]
This distinction will matter throughout the chapter. Front-door adjustment is not a general recipe for mediation problems; it is a special identification theorem for $\mathbb P(Y \in B \mid do(A=a))$ under a particular causal graph.
## The Front-Door Criterion
The question now becomes exact: which graphical conditions allow us to identify the interventional distribution of $Y$ under $do(A=a)$ despite unmeasured $A$-$Y$ confounding? The answer requires three conditions. First, the mediator must intercept the directed causal effect. Second, the treatment-mediator relation must be unconfounded after the graph is read appropriately. Third, the mediator-outcome relation must become unconfounded after conditioning on the treatment.
[definition: Front-Door Criterion]
Let $G$ be a directed acyclic graph with treatment $A \in \mathcal A$, mediator $M \in \mathcal M$, and outcome $Y \in \mathcal Y$. In this chapter we use the single-mediator specialization of the front-door criterion: the singleton mediator $M$ satisfies the front-door criterion relative to $(A,Y)$ if:
1. every directed path from $A$ to $Y$ is intercepted by $M$;
2. there is no unblocked back-door path from $A$ to $M$;
3. every back-door path from $M$ to $Y$ is blocked by $A$.
[/definition]
The first condition says that the causal effect must pass through the observed mediator. The second condition identifies the effect of $A$ on $M$ from the observed conditional law of $M$ given $A$. The third condition identifies the effect of $M$ on $Y$ after adjusting for $A$, so the natural next problem is to combine these three replacements into one observable expression for the interventional law.
[quotetheorem:9674]
[citeproof:9674]
The formula has two visible stages. The outer factor $\mathbb P(M=m \mid A=a)$ describes how the intervention on $A$ changes the mediator. The inner averaged conditional outcome law describes what would happen if the mediator were set to $m$, with the averaging over $A$ removing the association induced by unmeasured $A$-$Y$ confounding.
This theorem is often the first identification result that feels different from back-door adjustment. Instead of making treatment as-if randomized by conditioning on observed pre-treatment variables, it uses a post-treatment variable to reconstruct the causal pathway. That is why the mediator cannot be treated as an ordinary covariate.
Each hypothesis has a distinct failure mode. If there is also a direct arrow $A \to Y$, then fixing the mediator does not remove all directed influence of treatment, so the formula misses part of the total effect. If an unmeasured common cause affects both $A$ and $M$, then $\mathbb P(M=m \mid A=a)$ is an associational mediator distribution rather than the law under $do(A=a)$. If an unmeasured common cause affects both $M$ and $Y$ within levels of $A$, then $\mathbb P(Y \in B \mid M=m,A=a')$ is not the response to setting $M=m$. The theorem identifies the total interventional law of $Y$ under $do(A=a)$; it does not identify the causal effect of the mediator itself, a direct effect bypassing the mediator, or a natural direct/indirect decomposition without further assumptions. These limitations motivate the potential-outcome version below, where the replacements in the formula are made explicit.
[example: Binary Front-Door Formula]
Suppose $A,M,Y \in \{0,1\}$ and $M$ satisfies the front-door criterion. We compute the causal risk under $do(A=1)$ by applying the discrete front-door formula with $a=1$ and $B=\{1\}$.
\begin{align*}
\mathbb P(Y=1 \mid do(A=1))=\sum_{m\in\{0,1\}} \mathbb P(M=m \mid A=1)\sum_{a'\in\{0,1\}}\mathbb P(Y=1 \mid M=m,A=a')\mathbb P(A=a')
\end{align*}
Because the mediator is binary, the outer sum has exactly the two terms $m=0$ and $m=1$:
\begin{align*}
\mathbb P(Y=1 \mid do(A=1))=\mathbb P(M=0 \mid A=1)\sum_{a'\in\{0,1\}}\mathbb P(Y=1 \mid M=0,A=a')\mathbb P(A=a')+\mathbb P(M=1 \mid A=1)\sum_{a'\in\{0,1\}}\mathbb P(Y=1 \mid M=1,A=a')\mathbb P(A=a')
\end{align*}
For each fixed mediator value $m\in\{0,1\}$, the inner treatment average expands as
\begin{align*}
\sum_{a'\in\{0,1\}}\mathbb P(Y=1 \mid M=m,A=a')\mathbb P(A=a')=\mathbb P(Y=1 \mid M=m,A=0)\mathbb P(A=0)+\mathbb P(Y=1 \mid M=m,A=1)\mathbb P(A=1)
\end{align*}
Substituting this expansion into the outer sum gives
\begin{align*}
\mathbb P(Y=1 \mid do(A=1))=\sum_{m=0}^{1} \mathbb P(M=m \mid A=1)\left[\mathbb P(Y=1 \mid M=m,A=0)\mathbb P(A=0)+\mathbb P(Y=1 \mid M=m,A=1)\mathbb P(A=1)\right]
\end{align*}
Thus both treatment groups contribute to the outcome component for each mediator level. Even when the target is the causal risk under setting $A=1$, observed outcomes among units with $A=0$ remain part of the front-door adjustment through the treatment-averaged mediator-outcome response.
[/example]
The presence of $\mathbb P(A=a')$ in the inner average is sometimes surprising. It appears because the outcome response to a mediator value $m$ is learned across the distribution of treatment values, not only among units with the intervention value $a$.
## Conditional Exchangeability Steps Behind the Formula
The front-door formula can look algebraic, but its logic is a chain of exchangeability claims. The problem is to see exactly where observational quantities replace interventional quantities. Writing the argument in this way also makes the failure modes easier to diagnose.
[definition: Front-Door Exchangeability Conditions]
Let $A \in \mathcal A$, $M \in \mathcal M$, and $Y \in \mathcal Y$ with potential mediator $M_a$, potential outcome $Y_m$ under intervention on the mediator, and joint-intervention potential outcome $Y_{a,m}$. The front-door exchangeability conditions are:
1. $M_a \perp\!\!\!\perp A$ for each $a \in \mathcal A$;
2. $Y_m \perp\!\!\!\perp M \mid A$ for each $m \in \mathcal M$;
3. $Y_{a,m}=Y_m$ for all $a \in \mathcal A$ and $m \in \mathcal M$, and $Y_a=Y_{M_a}$;
4. $Y_m \perp\!\!\!\perp M_a$ for all $a \in \mathcal A$ and $m \in \mathcal M$.
[/definition]
These statements are a potential-outcome sufficient condition for the same formula, not a minimal restatement of the graph. The third condition is the no-direct-effect and composition content: once the mediator is fixed, changing treatment has no remaining effect on $Y$, and the outcome under $do(A=a)$ is obtained by feeding the mediator value $M_a$ into the mediator intervention response. The fourth condition is the cross-world independence needed to multiply the distribution of $Y_m$ and $M_a$; without it, the algebra below would not follow from the first two exchangeability conditions alone.
[quotetheorem:9675]
[citeproof:9675]
This version states the same target in potential-outcome notation. It emphasizes that the final formula is not a modelling assumption such as linearity or additivity; it follows from consistency, positivity, and exchangeability statements corresponding to the graph.
This derivation helps identify which empirical comparisons are being made. The treatment-mediator comparison compares mediator distributions across treatment groups. The mediator-outcome comparison compares outcomes among units with the same observed mediator and treatment, then averages over treatment prevalence.
The assumptions are strong enough to expose the failure modes. If $M_a$ is associated with $A$, as when latent health-awareness affects both treatment uptake and the mediator, the first replacement uses the wrong mediator distribution. If $Y_m$ is associated with observed $M$ within treatment strata, as when latent motivation affects both mediator choice and outcome, the second replacement confuses selection with mediator response. If there is a direct effect of $A$ on $Y$ or if $Y_m$ is dependent on $M_a$, the opening factorization itself fails: the outcome response under a mediator intervention cannot be multiplied by the mediator distribution under a different treatment intervention. Thus the theorem is best read as a careful counterfactual analogue of front-door identification, not as a general mediation decomposition.
[example: Necessity of the Counterfactual Front-Door Assumptions]
Let $A$ be treatment uptake, $M$ adherence, and $Y$ recovery. The front-door counterfactual derivation first needs the mediator replacement
\begin{align*}
\mathbb P(M_a=m)=\mathbb P(M_a=m\mid A=a)=\mathbb P(M=m\mid A=a),
\end{align*}
where the first equality uses $M_a\perp\!\!\!\perp A$ and the second uses consistency among units with $A=a$. If latent health-awareness $U$ increases both uptake and adherence, then conditioning on $A=a$ changes the distribution of $U$. Writing out the two mediator laws gives
\begin{align*}
\mathbb P(M_a=m)=\sum_u \mathbb P(M_a=m\mid U=u)\mathbb P(U=u).
\end{align*}
The observed treatment-stratum law corresponds to
\begin{align*}
\mathbb P(M_a=m\mid A=a)=\sum_u \mathbb P(M_a=m\mid U=u,A=a)\mathbb P(U=u\mid A=a).
\end{align*}
If $U$ affects adherence under treatment, the factor $\mathbb P(M_a=m\mid U=u,A=a)$ varies with $u$; if $U$ also affects uptake, then $\mathbb P(U=u\mid A=a)$ differs from $\mathbb P(U=u)$. The two weighted sums can therefore differ, so $\mathbb P(M=m\mid A=a)$ is a selected adherence distribution rather than the interventional distribution of $M_a$.
The second replacement needs, for each treatment stratum $A=a'$,
\begin{align*}
\mathbb P(Y_m\in B\mid A=a')=\mathbb P(Y_m\in B\mid M=m,A=a')=\mathbb P(Y\in B\mid M=m,A=a'),
\end{align*}
where the first equality uses $Y_m\perp\!\!\!\perp M\mid A$ and the second uses consistency among units with $M=m$ and $A=a'$. If latent motivation $V$ affects both adherence and recovery within treatment arms, then
\begin{align*}
\mathbb P(Y_m\in B\mid A=a')=\sum_v \mathbb P(Y_m\in B\mid V=v,A=a')\mathbb P(V=v\mid A=a').
\end{align*}
But among units observed with adherence level $M=m$,
\begin{align*}
\mathbb P(Y_m\in B\mid M=m,A=a')=\sum_v \mathbb P(Y_m\in B\mid V=v,M=m,A=a')\mathbb P(V=v\mid M=m,A=a').
\end{align*}
When $V$ affects adherence, $\mathbb P(V=v\mid M=m,A=a')$ differs from $\mathbb P(V=v\mid A=a')$; when $V$ affects recovery under fixed adherence, the outcome terms vary with $v$. Thus comparing high-adherence and low-adherence units at fixed treatment level mixes the mediator response with motivation.
The no-direct-effect part requires a single mediator-intervention response $Y_m$ satisfying
\begin{align*}
Y_{a,m}=Y_m
\end{align*}
for every treatment value $a$. If treatment has a pharmacological effect on recovery in addition to changing adherence, then fixing $M=m$ does not fix all causal inputs to $Y$. For the same adherence level $m$, one may have
\begin{align*}
\mathbb P(Y_{1,m}\in B)\ne \mathbb P(Y_{0,m}\in B),
\end{align*}
so there is no treatment-invariant variable $Y_m$ that can replace both joint-intervention outcomes.
Finally, the opening factorization needs the cross-world product
\begin{align*}
\mathbb P(Y_m\in B,M_a=m)=\mathbb P(Y_m\in B)\mathbb P(M_a=m).
\end{align*}
If a latent frailty variable $F$ makes units who would adhere under treatment also respond differently to a fixed adherence intervention, then the joint probability is
\begin{align*}
\mathbb P(Y_m\in B,M_a=m)=\sum_f \mathbb P(Y_m\in B,M_a=m\mid F=f)\mathbb P(F=f).
\end{align*}
When $F$ drives both quantities, this joint law is not determined by the two marginals
\begin{align*}
\mathbb P(Y_m\in B)=\sum_f \mathbb P(Y_m\in B\mid F=f)\mathbb P(F=f)
\end{align*}
and
\begin{align*}
\mathbb P(M_a=m)=\sum_f \mathbb P(M_a=m\mid F=f)\mathbb P(F=f).
\end{align*}
The product $\mathbb P(Y_m\in B)\mathbb P(M_a=m)$ then uses an independent coupling of the recovery response and the treatment-induced adherence value, while the causal setting contains a shared frailty coupling them.
[/example]
This example also separates causal assumptions from support assumptions. Even if all counterfactual independences hold, positivity can fail: if treatment always produces high adherence and control never produces high adherence, the term $\mathbb P(Y\in B\mid M=\text{high},A=0)$ has no observational support. Consistency can fail when the labels $A=a$ or $M=m$ hide multiple versions of intervention, such as two adherence programmes that produce the same recorded adherence level but different clinical monitoring. In that case the observed equality $M=m$ or $A=a$ does not select a unique potential outcome, so the replacement of potential outcomes by observed outcomes is not well-defined.
The next operational check is positivity. Even when the exchangeability statements have the right causal interpretation, the front-door formula still demands observed support for the mediator-outcome comparison in every treatment stratum that receives positive weight.
[remark: Positivity in the Front-Door Formula]
The conditional probabilities $\mathbb P(Y \in B \mid M=m,A=a')$ must be estimable wherever the formula uses them. Thus if $\mathbb P(M=m \mid A=a)>0$ and $\mathbb P(A=a')>0$, then the cell $(M=m,A=a')$ must have positive probability. Lack of such support is a failure of identification from the observed law, even if the graph has the right shape.
[/remark]
Positivity is a practical as well as mathematical restriction. A mediator value produced only by treated units cannot be used to learn the outcome response among untreated units, yet the front-door formula requires exactly that comparison when $\mathbb P(A=0)>0$.
## Failure Cases and Diagnostic Graphs
The next problem is to recognize when a tempting mediator does not give a valid front-door adjustment. Most failures come from violating one of the three graphical clauses while leaving the overall story superficially plausible. Each failure corresponds to a specific term in the formula becoming non-identifiable.
[example: Direct Effect Bypassing the Mediator]
Let $A$ be an educational programme, $M$ test-taking confidence, and $Y$ exam performance. The front-door condition fails if the programme also teaches examinable material directly, because then the directed path $A\to Y$ reaches the outcome without passing through $M$.
To see the failure algebraically, suppose $A,M,Y\in\{0,1\}$, the programme always raises confidence so $M=A$, and the taught material directly determines passing so $Y=A$. The true causal risk under the programme is
\begin{align*}
\mathbb P(Y=1\mid do(A=1))=1.
\end{align*}
The confidence-only front-door expression would give
\begin{align*}
\sum_{m=0}^{1}\mathbb P(M=m\mid A=1)\sum_{a'=0}^{1}\mathbb P(Y=1\mid M=m,A=a')\mathbb P(A=a').
\end{align*}
Since $M=A$, we have $\mathbb P(M=1\mid A=1)=1$ and $\mathbb P(M=0\mid A=1)=0$, so the expression reduces to
\begin{align*}
\sum_{a'=0}^{1}\mathbb P(Y=1\mid M=1,A=a')\mathbb P(A=a').
\end{align*}
The cell $(M=1,A=0)$ has probability $0$, because $M=A$; even if one supplied a model-based value for that unsupported cell, the expression would be learning an outcome response at fixed confidence, while the true intervention $do(A=1)$ also changes exam performance through the direct teaching path. The missing step is precisely the invalid replacement $Y_{a,m}=Y_m$: when teaching material has a direct effect, the same confidence level $m$ can have different outcome laws under $A=1$ and $A=0$. Thus confidence alone cannot identify the full causal effect of the programme, and it also does not identify a mediated effect without additional assumptions.
[/example]
This example violates the first clause. The observed mediator records one mechanism but not the whole causal transmission, so reconstructing the full effect from the mediator distribution loses the direct pathway.
[example: Confounding of Treatment and Mediator]
Let $A$ be participation in a health intervention, $M$ be adherence to a medication schedule, and $Y$ be recovery. Suppose an unmeasured health-awareness variable $U$ affects both $A$ and $M$. Then the path $A \leftarrow U \to M$ is a back-door path from treatment to mediator, so the observed mediator law within the treatment group need not equal the mediator law under an intervention setting treatment.
The interventional mediator distribution averages over the marginal distribution of $U$:
\begin{align*}
\mathbb P(M=m\mid do(A=a))=\sum_u \mathbb P(M=m\mid do(A=a),U=u)\mathbb P(U=u).
\end{align*}
The observed mediator distribution among units with $A=a$ averages over the selected distribution of $U$ in that treatment stratum:
\begin{align*}
\mathbb P(M=m\mid A=a)=\sum_u \mathbb P(M=m\mid A=a,U=u)\mathbb P(U=u\mid A=a).
\end{align*}
If health-awareness affects participation, then for some $u$ one has $\mathbb P(U=u\mid A=a)\ne \mathbb P(U=u)$. If health-awareness also affects adherence under treatment, then the mediator probabilities vary with $u$. Therefore the two weighted averages can differ:
\begin{align*}
\sum_u \mathbb P(M=m\mid do(A=a),U=u)\mathbb P(U=u)\ne \sum_u \mathbb P(M=m\mid A=a,U=u)\mathbb P(U=u\mid A=a).
\end{align*}
Thus $\mathbb P(M=m\mid A=a)$ is an adherence distribution among the selected people who participated, not necessarily the adherence distribution that would be produced by intervening to set participation to $a$. This violates the treatment-mediator component of the front-door formula.
[/example]
This failure corrupts the first observable component of the formula. Even if the mediator-outcome relation were otherwise well behaved, the formula would use the wrong distribution of mediator values under intervention.
[example: Mediator Outcome Confounding After Conditioning on Treatment]
Let $A$ be an encouragement to use a tutoring platform, $M$ be actual platform usage, and $Y$ be final exam score. Suppose a latent motivation variable $V$ affects both platform usage and exam score within each encouragement arm. Fix an encouragement level $a'$ and a usage level $m$. The mediator-outcome replacement required by the front-door formula would be
\begin{align*}
\mathbb P(Y\in B\mid do(M=m),A=a')=\mathbb P(Y\in B\mid M=m,A=a').
\end{align*}
The left side averages the fixed-usage outcome over the motivation distribution among all units with encouragement level $a'$:
\begin{align*}
\mathbb P(Y\in B\mid do(M=m),A=a')=\sum_v \mathbb P(Y\in B\mid do(M=m),A=a',V=v)\mathbb P(V=v\mid A=a').
\end{align*}
The observed conditional risk among units who actually used the platform at level $m$ averages over the motivation distribution selected by observing $M=m$:
\begin{align*}
\mathbb P(Y\in B\mid M=m,A=a')=\sum_v \mathbb P(Y\in B\mid M=m,A=a',V=v)\mathbb P(V=v\mid M=m,A=a').
\end{align*}
If motivation affects platform usage inside the encouragement stratum, then $\mathbb P(V=v\mid M=m,A=a')$ differs from $\mathbb P(V=v\mid A=a')$ for some $v$. If motivation also affects exam score under fixed usage, then the outcome terms vary with $v$. The two weighted averages can therefore differ:
\begin{align*}
\mathbb P(Y\in B\mid do(M=m),A=a')\ne \mathbb P(Y\in B\mid M=m,A=a').
\end{align*}
Thus conditioning on encouragement leaves the back-door path $M\leftarrow V\to Y$ open, so observed high-usage and low-usage students within the same encouragement arm are still selected by motivation rather than differing only by mediator value.
[/example]
This failure is common in applications because mediators are often behavioural variables. When the mediator is chosen by the unit, unmeasured preferences, ability, or motivation may affect both the mediator and the outcome.
[remark: Conditioning on a Mediator Can Create Bias]
In ordinary effect estimation, conditioning on a post-treatment variable can block part of the causal effect or open collider paths. Front-door adjustment is not an instruction to condition on any mediator in a regression. It is a particular functional of the observational law justified only when the front-door criterion holds.
[/remark]
The diagnostic lesson is that front-door identification should be checked graphically before any estimation step. The formula is a consequence of causal structure, not a data-driven selection rule for useful intermediate variables.
## Relation to Mediation Analysis and Path-Specific Effects
The final question is how front-door identification relates to mediation analysis. Both topics use variables on the causal pathway, but they ask different questions. Front-door identification asks for the total effect of $A$ on $Y$ when the direct back-door route is blocked by unmeasured confounding. Mediation analysis usually asks how much of an already identified effect travels through a specified pathway.
[definition: Controlled Direct Effect]
Let $A \in \mathcal A$, $M \in \mathcal M$, and let $Y$ be an integrable real-valued outcome. For treatment values $a,a' \in \mathcal A$ and mediator value $m \in \mathcal M$, the controlled direct effect on a mean scale is
\begin{align*}
\mathbb E[Y_{a,m}] - \mathbb E[Y_{a',m}],
\end{align*}
where $Y_{a,m}$ denotes the potential outcome under the joint intervention setting $A=a$ and $M=m$.
[/definition]
The controlled direct effect fixes the mediator and compares treatment values. It is therefore not the target of the front-door formula, which identifies the total effect of changing $A$ while allowing $M$ to respond naturally to that change. To discuss mediation rather than total-effect identification, we also need a quantity that changes the mediator pathway while holding the treatment component of the outcome fixed.
[definition: Natural Indirect Effect]
Let $A \in \mathcal A$, $M \in \mathcal M$, and let $Y$ be an integrable real-valued outcome. For treatment values $a,a' \in \mathcal A$, the natural indirect effect on a mean scale is
\begin{align*}
\mathbb E[Y_{a,M_a}] - \mathbb E[Y_{a,M_{a'}}].
\end{align*}
[/definition]
Natural effects compare nested counterfactuals in which the mediator is set to the value it would have taken under another treatment. These quantities require assumptions beyond the front-door criterion in many settings, especially assumptions linking counterfactual mediator values and counterfactual outcomes under conflicting interventions.
[remark: Caution About Path-Specific Language]
The front-door criterion can identify a total effect using an observed mediator, but this does not automatically identify a natural direct effect, a natural indirect effect, or every path-specific effect. Those targets involve different counterfactuals and may require cross-world independence assumptions or additional structural restrictions. Therefore a front-door analysis should state its target estimand before using mediation terminology.
[/remark]
This cautious separation prevents a common interpretation error. A valid front-door estimate may be a total causal effect, not a decomposition of that effect into direct and indirect components.
[example: Encouragement Design with an Observed Mediator]
Let $A=1$ denote encouragement to attend training, $M=1$ denote actual attendance, and $Y$ denote later earnings. Because encouragement is randomized, observing $A=a$ gives the same distribution of attendance as intervening to set encouragement to $a$:
\begin{align*}
\mathbb P(M=m\mid do(A=a))=\mathbb P(M=m\mid A=a).
\end{align*}
If every directed effect of encouragement on earnings passes through attendance, and if conditioning on $A$ blocks the back-door paths from attendance to earnings, then for each attendance value $m$ the mediator-response component is
\begin{align*}
\mathbb P(Y\in B\mid do(M=m))=\sum_{a'\in\{0,1\}}\mathbb P(Y\in B\mid M=m,A=a')\mathbb P(A=a').
\end{align*}
Combining the attendance distribution induced by encouragement with this treatment-averaged earnings response gives the total effect target:
\begin{align*}
\mathbb P(Y\in B\mid do(A=1))=\sum_{m\in\{0,1\}}\mathbb P(M=m\mid A=1)\sum_{a'\in\{0,1\}}\mathbb P(Y\in B\mid M=m,A=a')\mathbb P(A=a').
\end{align*}
Likewise, replacing $A=1$ by $A=0$ gives $\mathbb P(Y\in B\mid do(A=0))$, so the front-door analysis compares earnings under encouragement versus no encouragement while allowing attendance to change naturally under each encouragement policy. That is a total effect of encouragement, not automatically a causal effect of attendance itself; a separate attendance intervention would require a well-defined way to set $M=m$ and assumptions justifying the corresponding mediator-intervention contrast.
[/example]
The example sits between instrumental-variable reasoning and mediation language. The observed mediator is central, but the front-door conclusion is about the intervention on encouragement, not automatically about every causal contrast involving attendance. The remaining issue is therefore conceptual: even a valid front-door structure may leave finer mediation decompositions unidentified.
[quotetheorem:9676]
[citeproof:9676]
This negative result is a warning about target estimands. Identification is always relative to a query; identifying one interventional distribution does not identify all counterfactual contrasts involving the same variables.
The hypotheses matter in a different way from the front-door theorem itself. If the graph fails the front-door criterion, the total effect may not be identified at all; if the graph satisfies it but no cross-world assumptions are added, the total effect can be identified while nested mediation contrasts remain unspecified. The binary construction in the proof is a concrete version of this distinction: the same mediator marginals and outcome-response marginals can be coupled differently across worlds, changing $\mathbb E[Y_{1,M_0}]$ without changing the front-door total-effect calculation. The limitation is therefore not computational but conceptual: front-door identification answers a total-effect query, and any later path-specific claim must state and defend its own identifying assumptions.
[example: Front-Door Estimation Followed by a Mediation Claim]
Suppose a study estimates the effect of encouragement $A$ on earnings $Y$ through attendance $M$ by the front-door functional. For a measurable earnings set $B$, the reported estimand for encouragement is the total-effect interventional law
\begin{align*}
\mathbb P(Y\in B\mid do(A=1)).
\end{align*}
Under the front-door conditions, this is written in observable terms as
\begin{align*}
\mathbb P(Y\in B\mid do(A=1))=\sum_m \mathbb P(M=m\mid A=1)\sum_{a'\in\{0,1\}}\mathbb P(Y\in B\mid M=m,A=a')\mathbb P(A=a').
\end{align*}
Expanding the inner average shows exactly which observed earnings comparisons enter:
\begin{align*}
\sum_{a'\in\{0,1\}}\mathbb P(Y\in B\mid M=m,A=a')\mathbb P(A=a')=\mathbb P(Y\in B\mid M=m,A=0)\mathbb P(A=0)+\mathbb P(Y\in B\mid M=m,A=1)\mathbb P(A=1).
\end{align*}
Thus the front-door estimate answers the intervention question, "what would the earnings distribution be if encouragement were set to $1$ and attendance changed naturally according to the distribution induced by encouragement?"
A natural mediation claim asks a different question. On a mean scale, the natural indirect contrast from no encouragement to encouragement would compare
\begin{align*}
\mathbb E[Y_{1,M_1}]-\mathbb E[Y_{1,M_0}].
\end{align*}
The first term fixes the outcome response to the encouragement world $A=1$ and feeds in the attendance value that would occur under encouragement, while the second fixes the same outcome response to $A=1$ but feeds in the attendance value that would occur under no encouragement. Written by conditioning on the cross-world mediator value, the second term is
\begin{align*}
\mathbb E[Y_{1,M_0}]=\sum_m \mathbb E[Y_{1,m}\mid M_0=m]\mathbb P(M_0=m).
\end{align*}
The front-door total-effect formula identifies observable components such as $\mathbb P(M=m\mid A=1)$ and treatment-averaged laws for $Y$ under mediator values; it does not identify the cross-world conditional mean $\mathbb E[Y_{1,m}\mid M_0=m]$ without an additional assumption connecting $Y_{1,m}$ to $M_0$.
A valid report should therefore state the front-door estimate as a total effect of encouragement on earnings. A later decomposition into attendance-mediated and non-attendance-mediated parts requires separate cross-world or structural assumptions, because it involves nested counterfactuals rather than only the interventional law $\mathbb P(Y\in B\mid do(A=1))$.
[/example]
The chapter's main message is therefore precise. A mediator can identify a total causal effect in the presence of unmeasured treatment-outcome confounding when it satisfies the front-door criterion. That fact is powerful, but it does not license arbitrary adjustment for post-treatment variables or automatic claims about direct and indirect effects.
The front-door criterion shows that some effects are identifiable through mediation even when back-door adjustment is impossible. To go beyond a handful of special patterns, the course next develops do-calculus as a general symbolic system for manipulating interventional distributions.
# 8. Do-Calculus
This chapter turns graphical identification into a calculus. Chapters 6 and 7 gave criteria such as back-door and front-door adjustment, each tailored to a recognizable graphical pattern. Do-calculus is the general symbolic language for transforming interventional distributions into observational distributions whenever the graph permits it. The chapter assumes familiarity with causal DAGs, structural causal models, intervention notation, $d$-separation, latent projections, and the back-door and front-door criteria.
The central question is: when may an expression containing $do(X=x)$ be simplified, exchanged for conditioning, or removed? Pearl's answer is a set of three rules, each justified by a conditional independence statement in a modified graph. These rules are local, but their consequences are global: together with ordinary probability manipulations, they characterize nonparametric identification in directed acyclic graph models.
## Intervention Distributions and Mutilated Graphs
The notation $P(Y \mid do(X=x))$ should be read as a distribution under a new data-generating regime, not as an ordinary conditional distribution. Conditioning restricts attention to units for which $X=x$ occurred naturally; intervention replaces the mechanism assigning $X$ by the constant value $x$. To make this distinction mathematical, we first name the post-intervention law.
[definition: Intervention Distribution]
Let $G$ be a causal DAG with observed variables $V$, and let $X,Y \subset V$ be disjoint. For a value $x$ of $X$, the intervention distribution of $Y$ under $do(X=x)$ is the probability measure
\begin{align*}
P_x^Y : \mathcal Y \to [0,1],
\end{align*}
where $(\mathsf Y,\mathcal Y)$ is the measurable state space of $Y$. It is defined by
\begin{align*}
P_x^Y(A)=\mathbb P_x(Y\in A), \qquad A\in \mathcal Y,
\end{align*}
where $\mathbb P_x$ is the law of the structural causal model obtained by replacing the structural equations for variables in $X$ by the constant assignment $X=x$. When $Y$ is discrete, $P_x(y)$ abbreviates $P_x^Y(\{y\})$; with densities, $P_x(y)$ denotes the corresponding density value relative to the chosen reference measure.
[/definition]
This definition separates two operations that ordinary conditioning often conflates: observing $X=x$ gives information about the causes of $X$, while setting $X=x$ breaks the dependence of $X$ on its causes. Since do-calculus will decide validity by reading conditional independences from graphs, we need a graphical operation that records exactly which mechanisms have been replaced and which causal arrows should be ignored in a rule premise. The next definition introduces the two graph modifications that appear in the three rules.
[definition: Mutilated Graph]
Let $G=(V,E)$ be a DAG. For each subset $X \subset V$, define graph transformations
\begin{align*}
\overline{(\cdot)}_X : \{\text{DAGs on }V\} \to \{\text{DAGs on }V\}
\end{align*}
by $\overline{(\cdot)}_X(G)=G_{\overline{X}}$, and
\begin{align*}
\underline{(\cdot)}_X : \{\text{DAGs on }V\} \to \{\text{DAGs on }V\}
\end{align*}
by $\underline{(\cdot)}_X(G)=G_{\underline{X}}$. Here $G_{\overline{X}}=(V,E_{\overline{X}})$ with
\begin{align*}
E_{\overline{X}}=E\setminus \{A\to B\in E:B\in X\},
\end{align*}
and $G_{\underline{X}}=(V,E_{\underline{X}})$ with
\begin{align*}
E_{\underline{X}}=E\setminus \{A\to B\in E:A\in X\}.
\end{align*}
[/definition]
The overline notation corresponds to intervention because the incoming mechanisms for $X$ have been removed. The underline notation is used in do-calculus when an action is temporarily treated as an observation and outgoing causal influence from that action is suppressed for the relevant separation statement. With these modified graphs available, we can connect the symbolic intervention law to the factorization of the original causal model.
[example: Randomized Treatment Intervention]
For discrete $U$, the observational conditional distribution decomposes as
\begin{align*}
P(y \mid x)=\sum_u P(y \mid x,u)P(u \mid x).
\end{align*}
Because the graph contains $U \to X$, observing $X=x$ generally changes the distribution of $U$: by [Bayes' formula](/theorems/1114),
\begin{align*}
P(u \mid x)=\frac{P(x \mid u)P(u)}{\sum_{u'}P(x \mid u')P(u')}.
\end{align*}
Thus $P(y\mid x)$ averages the outcome mechanism $P(y\mid x,u)$ using the distribution of health status among units who naturally received treatment value $x$.
Under $do(X=x)$, the structural mechanism for $X$ is replaced by the constant assignment $x$, so the incoming edge $U\to X$ is removed in $G_{\overline X}$. The law of $U$ is not changed by this intervention, and the outcome mechanism is evaluated at the imposed treatment value:
\begin{align*}
P_x(y)=\sum_u P(y \mid x,u)P(u).
\end{align*}
The difference between the two formulas is the weighting term: conditioning uses $P(u\mid x)$, while intervention uses the natural baseline distribution $P(u)$. Therefore $P_x(Y)$ represents the outcome distribution when treatment is externally assigned, rather than the outcome distribution among units who happened to receive that treatment.
[/example]
The graph operation is not merely a mnemonic. Once an intervention replaces the assignment for $X$, the observational factor for $X$ should no longer be used, while the other structural mechanisms should remain in force. The algebraic problem is to express the post-intervention joint law by deleting exactly the mechanism that has been externally fixed and retaining the mechanisms that have not been changed.
[quotetheorem:9677]
[citeproof:9677]
This formula is often enough for randomized or fully specified models, but it is not itself an identification formula when unobserved variables or unknown mechanisms remain in the product. Each hypothesis matters. The structural causal model assumption is what says that non-intervened mechanisms are invariant under the intervention; without it, the observational factorization alone would not rule out a regime in which setting $X$ also changes the conditional law of some child of $X$. The causal Markov factorization is also essential: if the observed joint law is merely decomposed by an arbitrary chain rule order, deleting the factor for $X$ has no causal interpretation.
The observed-variable scope is another limitation. In the confounded graph $U\to X$, $U\to Y$, and $X\to Y$, the truncated product for $P_x(y)$ still involves the distribution of the latent $U$ and the outcome mechanism conditional on $U$. Replacing that expression by $P(y\mid x)$ would misuse the theorem: conditioning on $X=x$ retains information about $U$, while the intervention law has removed the mechanism by which $U$ affected $X$. Thus the theorem tells us how an intervention changes a compatible structural model; it does not by itself guarantee that the resulting interventional distribution is expressible from the observed marginal law on $V$. Identification asks for expressions in terms of the observed law when those structural factors are not available. Do-calculus gives transformations that are valid across every model inducing the same graph.
## The Three Rules of Do-Calculus
Each rule answers a different symbolic question. Can an observation be inserted or deleted? Can an action be exchanged with an observation? Can an action be inserted or deleted? The answer is yes exactly when a corresponding conditional independence holds in the appropriate mutilated graph.
Throughout this section, let $G$ be a causal DAG with observed variables $V$. For disjoint subsets $X,Y,Z,W \subset V$, write $P_x(y \mid z,w)$ for $P(Y=y \mid do(X=x),Z=z,W=w)$ whenever the conditioning event has positive probability under the relevant intervention distribution.
[quotetheorem:9678]
[citeproof:9678]
Rule one says that irrelevant observations may be ignored after the intervention has been represented in the graph. The hypothesis is not cosmetic: if $Z$ remains connected to $Y$ by an open path in $G_{\overline{X}}$ after conditioning on $W$, then observing $Z$ may change the distribution of $Y$ even though $X$ has already been fixed by intervention. For instance, if $Z\to Y$ and neither variable is conditioned away, deleting $Z$ from $P_x(y\mid z)$ would discard genuine predictive information about the outcome. Rule one is therefore a conditional irrelevance rule, not a license to remove arbitrary covariates.
The limitation of rule one also explains why a second rule is needed. Sometimes $Z$ is not irrelevant as an observation, but the graph says that actively setting $Z$ and passively observing $Z$ have the same consequence for the target distribution. The next rule formalizes exactly that action-observation exchange.
[quotetheorem:9679]
[citeproof:9679]
Rule two is the algebraic heart of adjustment. It is the step that turns a causal action into an observed covariate when the modified graph says that the action-observation contrast is irrelevant for the target distribution. The underline operation is only a test used in the rule premise; it is not a claim that the structural model has had the causal effect of $Z$ permanently deleted. The theorem says that two particular conditional distributions agree under the stated separation and positivity conditions. It does not say that $Z$ has no causal effect on $Y$ in the original model, nor that every expression involving $do(Z=z)$ may be replaced by conditioning on $Z=z$.
The condition cannot be dropped. If $Z\to Y$ and $U\to Z$, $U\to Y$, then $P_z(y)$ generally differs from $P(y\mid z)$: intervention sets $Z$ while leaving the distribution of $U$ unchanged, whereas conditioning on $Z=z$ selects units with a different distribution of $U$. Suppressing the outgoing arrow from $Z$ in the graph check asks whether this selection difference is all that remains relevant after $X,W$ are fixed. When the required separation fails, the attempted exchange may mix causal influence and selection information.
This distinction is why rule two appears in both back-door and front-door derivations. In back-door adjustment it exchanges the treatment intervention for treatment observation within covariate strata; in front-door adjustment it exchanges the mediator intervention for mediator observation after conditioning on the treatment. Yet exchanging an action for an observation is not always the final step. Many derivations also create intervention symbols that are no longer meant to be observed at all: after other variables have been fixed, the remaining question is whether the action still changes the conditional law of the target. This is a stronger request than rule two answers, because deleting an intervention removes the intervention symbol entirely rather than replacing it by an observed value. Conditioning variables create the main obstruction: if the action affects a variable in the conditioning set, then conditioning may select a different subpopulation under intervention than under the natural regime. Rule three is introduced to handle exactly this action-deletion problem while keeping those ancestor effects visible in the graphical premise.
[quotetheorem:9680]
[citeproof:9680]
The third rule is the deletion rule for actions. Its hypothesis is stricter than ordinary conditional independence because conditioning on $W$ may make some components of $Z$ relevant through their influence on $W$; this is the reason for the special subset $Z(W)$. A concrete failure occurs in the graph $Z\to W$ and $Z\to Y$ with independent noise in the equations for $W$ and $Y$. For example, if $Z$ is binary, $W=Z\oplus N_W$ with noise $N_W$, and $Y=Z\oplus N_Y$ with noise $N_Y$, then $P_z(y\mid w)$ fixes the value of the common cause $Z$ of both $W$ and $Y$, while $P(y\mid w)$ averages over the posterior mixture of $Z$ values among units with $W=w$. These two conditional laws generally differ. The ancestor qualification keeps this possibility visible in the graph premise instead of allowing an intervention on an ancestor of $W$ to disappear without checking how conditioning on $W$ selects units. When the separation condition does hold, the intervention on $Z$ has no remaining route by which it can alter the law of $Y$ once $X$ and $W$ are fixed.
This rule is especially useful near the end of an identification derivation. After rule two has converted some actions into observations, rule three often removes the remaining actions whose causal influence has been blocked or intercepted by variables already in the expression.
[example: A Simple Action Deletion]
Suppose the observed DAG has exactly the directed chain $Z \to X \to Y$ among the paths connecting $Z$ to $Y$. Its observational factorization is
\begin{align*}
P(z,x,y)=P(z)P(x\mid z)P(y\mid x).
\end{align*}
Under $do(X=x,Z=z)$, the mechanisms for both $X$ and $Z$ are replaced, so the only remaining factor involving $Y$ is evaluated at the imposed value $x$:
\begin{align*}
P_{x,z}(y)=P(y\mid x).
\end{align*}
Under $do(X=x)$ alone, the mechanism for $Z$ remains natural while the outcome factor is still evaluated at $x$:
\begin{align*}
P_x(y)=\sum_{z'} P_x(y,z').
\end{align*}
The truncated product gives
\begin{align*}
P_x(y,z')=P(z')P(y\mid x).
\end{align*}
Therefore
\begin{align*}
P_x(y)=\sum_{z'}P(z')P(y\mid x)=P(y\mid x)\sum_{z'}P(z')=P(y\mid x).
\end{align*}
Thus $P_{x,z}(y)=P_x(y)$. Graphically, *Pearl's Rule Three* gives the same deletion: in $G_{\overline X\overline Z}$ the incoming edge $Z\to X$ has been removed, and by assumption no other path connects $Z$ to $Y$, so $Y$ and $Z$ are $d$-separated after $X$ is fixed.
[/example]
## Deriving Back-Door Adjustment
The back-door criterion was introduced earlier as a standalone identification result. Do-calculus explains why it works: after intervening on $X$, a valid adjustment set $Z$ may be inserted as an observation, the action on $X$ may be exchanged for observation within each stratum of $Z$, and the resulting expression is observational.
[quotetheorem:9681]
[citeproof:9681]
The derivation shows that adjustment is not a separate principle from do-calculus. It is a short sequence of probability expansion plus rules two and three. Each hypothesis has a distinct job: blocking back-door paths justifies replacing the action on $X$ by observation of $X$, while excluding descendants of $X$ ensures that the distribution of $Z$ is not changed by the intervention. If $Z$ were a post-treatment variable, such as $X\to Z\to Y$, then replacing $P_x(z)$ by $P(z)$ would generally be false because treatment changes the mediator distribution. If $Z$ failed to block a confounding path, such as $X\leftarrow U\to Y$, then $P(y\mid x,z)$ would still mix causal effect with selection information. Positivity is also part of the statement: if a stratum has $P(X=x,Z=z)=0$, the conditional term $P(y\mid x,z)$ cannot be estimated or interpreted as an ordinary conditional probability in that stratum.
The theorem does not say that every regression adjustment using $Z$ is efficient, stable, or appropriate for finite data, and it does not identify effects outside the support where the required conditional laws exist. It also does not claim that $Z$ is the only valid adjustment set; different sets may satisfy the same graphical requirements. The result is therefore best read as a diagnostic recipe: it tells us which graphical facts are needed before a familiar regression-style adjustment formula can be interpreted causally. The next section uses the same rules in a setting where no single pre-treatment adjustment set is available, so identification must pass through a mediator instead of through direct covariate adjustment.
[example: Confounded Treatment with Measured Covariate]
Let $Z$ be socioeconomic status, $X$ a treatment, and $Y$ a recovery outcome, with arrows $Z \to X$, $Z \to Y$, and $X \to Y$. We identify the intervention distribution $P_x(y)$ by expanding over the covariate strata:
\begin{align*}
P_x(y)=\sum_z P_x(y \mid z)P_x(z).
\end{align*}
Because $Z$ is pre-treatment, intervening on $X$ does not change the marginal law of $Z$; by *Pearl's Rule Three*, $P_x(z)=P(z)$. Because $Z$ blocks the only back-door path $X \leftarrow Z \to Y$, the action $do(X=x)$ may be exchanged for observing $X=x$ within each stratum of $Z$; by *[Pearl's Rule Two](/theorems/9679)*, $P_x(y\mid z)=P(y\mid x,z)$. Substituting these two identities into the expansion gives
\begin{align*}
P_x(y)=\sum_z P(y \mid x,z)P(z).
\end{align*}
Thus the causal effect is obtained by taking the treatment-outcome association inside each socioeconomic stratum and averaging those stratum-specific quantities using the natural population distribution of $Z$.
[/example]
## Deriving Front-Door Adjustment
Front-door adjustment handles a different obstruction. The treatment-outcome relation is confounded, so $P(y \mid x)$ is not causal, but a measured mediator $Z$ transmits all directed causal influence from $X$ to $Y$. Do-calculus identifies the effect by splitting it into the effect of $X$ on $Z$ and the effect of $Z$ on $Y$ after removing the confounding through a second adjustment.
[quotetheorem:9682]
[citeproof:9682]
The front-door formula illustrates why do-calculus is more flexible than a single adjustment criterion. Its hypotheses also show why the result is delicate. If $Z$ fails to intercept a directed path from $X$ to $Y$, then deleting $do(X=x)$ after fixing $Z$ would miss the remaining direct effect. If the $X$-$Z$ relation is confounded, then $P(z\mid x)$ is not the causal effect of $X$ on the mediator. If the $Z$-$Y$ relation has an unblocked back-door path after conditioning on $X$, then the inner term $P(y\mid z,x')$ is not the causal response to the mediator. The formula works because the three conditions identify three different pieces of the derivation, not because the mediator is merely observed.
This example is the first place where do-calculus genuinely outperforms ordinary adjustment. It moves between several intervention regimes and uses different separation statements at different stages of the same derivation.
[example: Smoking, Tar, and Disease]
Let $X$ denote smoking, $Z$ tar deposits, and $Y$ lung disease. Suppose an unobserved genetic factor confounds $X$ and $Y$, all directed effect of smoking on disease passes through tar, there is no unobserved confounding between $X$ and $Z$, and the relevant conditional probabilities are defined on the displayed supports. We identify $P_x(y)$ by first expanding over the mediator:
\begin{align*}
P_x(y)=\sum_z P_x(y\mid z)P_x(z).
\end{align*}
Because there is no unblocked back-door path from $X$ to $Z$, *Pearl's Rule Two* exchanges the action $do(X=x)$ for the observation $X=x$ in the mediator factor:
\begin{align*}
P_x(z)=P(z\mid x).
\end{align*}
Substituting this into the expansion gives
\begin{align*}
P_x(y)=\sum_z P_x(y\mid z)P(z\mid x).
\end{align*}
For the outcome factor, observing tar level $Z=z$ after setting smoking is equivalent to setting tar level once the front-door separation condition is checked, so *Pearl's Rule Two* gives
\begin{align*}
P_x(y\mid z)=P_{x,z}(y).
\end{align*}
Since every directed path from $X$ to $Y$ passes through $Z$, fixing $Z=z$ blocks the remaining directed causal influence of smoking on disease; *Pearl's Rule Three* deletes the smoking intervention:
\begin{align*}
P_{x,z}(y)=P_z(y).
\end{align*}
Therefore
\begin{align*}
P_x(y)=\sum_z P(z\mid x)P_z(y).
\end{align*}
It remains to express $P_z(y)$ observationally. Expanding over smoking status under $do(Z=z)$ gives
\begin{align*}
P_z(y)=\sum_{x'}P_z(y\mid x')P_z(x').
\end{align*}
The intervention on tar does not change the marginal distribution of smoking, since $Z$ is downstream of $X$ and has no directed effect back into $X$; by *Pearl's Rule Three*,
\begin{align*}
P_z(x')=P(x').
\end{align*}
Every back-door path from $Z$ to $Y$ is blocked by conditioning on $X$, so *Pearl's Rule Two* exchanges the action $do(Z=z)$ for the observation $Z=z$ inside each smoking stratum:
\begin{align*}
P_z(y\mid x')=P(y\mid z,x').
\end{align*}
Substituting these two identities into the expansion for $P_z(y)$ yields
\begin{align*}
P_z(y)=\sum_{x'}P(y\mid z,x')P(x').
\end{align*}
Putting this expression back into the earlier formula gives
\begin{align*}
P_x(y)=\sum_z P(z\mid x)\sum_{x'}P(y\mid z,x')P(x').
\end{align*}
The factor $P(z\mid x)$ measures how smoking changes tar deposits, while the inner sum averages the tar-disease association within smoking strata using the natural smoking distribution.
[/example]
## Conditional Effects and Four-Node Identification
Many applied questions ask for conditional intervention distributions such as $P(Y \mid do(X=x),Z=z)$. These are not obtained by conditioning an already identified marginal effect unless the conditioning variable has the right causal status. Do-calculus keeps the conditioning variable inside the query and tests which actions and observations can be moved around it.
[example: Identifying a Conditional Effect in a Four-Node DAG]
Consider the observed DAG with arrows $W \to X$, $W \to Z$, $X \to Z$, $Z \to Y$, and $W \to Y$, and assume the displayed conditional probabilities are defined. We identify the conditional intervention distribution $P_x(y\mid z)$ by conditioning on the possible values of $W$ under the law induced by $do(X=x)$:
\begin{align*}
P_x(y\mid z)=\sum_w P_x(y,w\mid z)=\sum_w P_x(y\mid z,w)P_x(w\mid z).
\end{align*}
For the outcome factor, in $G_{\underline X}$ the edge $X\to Z$ is removed. The remaining path from $X$ to $Y$ is $X\leftarrow W\to Y$, and it is blocked by conditioning on $W$; the path $X\leftarrow W\to Z\to Y$ is blocked by conditioning on $W$ and also by conditioning on $Z$. Hence $Y\perp X\mid Z,W$ in $G_{\underline X}$, so *Pearl's Rule Two* gives
\begin{align*}
P_x(y\mid z,w)=P(y\mid x,z,w).
\end{align*}
It remains to identify $P_x(w\mid z)$. By Bayes' formula,
\begin{align*}
P_x(w\mid z)=\frac{P_x(w,z)}{P_x(z)}.
\end{align*}
The numerator factors as
\begin{align*}
P_x(w,z)=P_x(z\mid w)P_x(w).
\end{align*}
Since $W$ is not downstream of $X$, intervening on $X$ does not change the marginal law of $W$; equivalently, *Pearl's Rule Three* gives
\begin{align*}
P_x(w)=P(w).
\end{align*}
Also, after conditioning on $W$, the only back-door path from $X$ to $Z$ is blocked, so *Pearl's Rule Two* gives
\begin{align*}
P_x(z\mid w)=P(z\mid x,w).
\end{align*}
Therefore
\begin{align*}
P_x(w,z)=P(z\mid x,w)P(w).
\end{align*}
Summing this identity over $w'$ gives the denominator:
\begin{align*}
P_x(z)=\sum_{w'}P_x(w',z)=\sum_{w'}P(z\mid x,w')P(w').
\end{align*}
Thus
\begin{align*}
P_x(w\mid z)=\frac{P(z\mid x,w)P(w)}{\sum_{w'}P(z\mid x,w')P(w')}.
\end{align*}
Substituting the two identified factors into the expansion over $W$ yields
\begin{align*}
P_x(y\mid z)=\sum_w P(y\mid x,z,w)\frac{P(z\mid x,w)P(w)}{\sum_{w'}P(z\mid x,w')P(w')}.
\end{align*}
The formula is observational, but it still conditions on the post-treatment value $Z=z$ after the intervention, so it identifies the conditional causal distribution rather than conditioning an already-marginalized effect.
[/example]
This example also warns against a common shortcut. If $Z$ is affected by $X$, then conditioning on $Z=z$ after intervention describes a selected post-treatment subpopulation, so the conditional causal estimand must be identified directly rather than inferred by ordinary conditioning from $P_x(y)$.
## Soundness and Completeness
The three rules would be of limited value if they were only plausible algebraic moves. Soundness says that every permitted move preserves the interventional distribution in every causal model compatible with the graph. Completeness says that the rules are not missing any graphical identification moves for nonparametric models.
[quotetheorem:9683]
[citeproof:9683]
Soundness provides the safety guarantee: a derivation never identifies a false formula if each graphical premise has been checked in the right graph and every displayed conditional distribution is defined. The hypotheses are not interchangeable. Positivity or definedness is needed at every rule invocation: if $P(W=w)=0$, then an expression such as $P_x(y\mid w)$ has no ordinary conditional probability meaning without an additional regular conditional convention, so a displayed equality involving that term is not a valid finite-probability statement. In a discrete model, if a derivation divides by $P(z\mid x)$ and this probability is zero, the resulting observational formula may be undefined even though the formal graph manipulation looks syntactically correct.
Structural compatibility with the causal graph is equally necessary. If the graph omits a latent common cause of $X$ and $Y$, then it may falsely report $X \perp Y$ after a mutilation and license replacement of $P_x(y)$ by $P(y\mid x)$, even though conditioning on $X=x$ still carries selection information about the unmodeled cause. This is a failure of the SCM compatibility hypothesis, not a failure of algebra. Markov compatibility after intervention is also needed: a distribution that happens to be Markov with respect to the observational DAG but is not generated by mechanisms that remain invariant under interventions need not obey the truncated factorization used in the proof of each rule.
The graph and derivation hypotheses have their own counterexamples. If a latent projection is treated as an ordinary DAG and $d$-separation is applied while bidirected confounding edges are ignored, a derivation may delete a conditioning variable or an action that still carries confounding information. If separation is checked in the original graph instead of the required mutilated graph, rule two can confuse observation with intervention because the outgoing arrows from the action have not been suppressed for the comparison. Finally, soundness applies to finite derivations: an expression obtained by an informal limiting procedure or by rearranging infinitely many terms needs its own convergence argument, since ordinary probability rules justify only the finite algebraic steps actually used. Soundness therefore protects only derivations whose positivity, mutilation, separation, finite-derivation, and graph-compatibility hypotheses match the causal representation.
Soundness does not answer the search question: when a target effect is identifiable, are the three rules powerful enough to find an observational expression? The completeness theorem answers this by connecting do-calculus to the ID algorithm, which systematically decomposes interventional queries into graph-theoretic subproblems.
[quotetheorem:9684]
The theorem is usually read as completeness of do-calculus for identification: if a causal effect is identifiable from the observed law in the nonparametric graph model, then there is a do-calculus derivation of an observational formula. The proof analyzes the recursive structure of the ID algorithm, especially the decomposition into confounded components, and is part of the deeper identification theory developed after the foundational rules.
[remark: Interpretation of Completeness]
Completeness does not say that every causal query is identifiable. It says that failure to find a do-calculus derivation is not a limitation of the three rules when the search is organized by the complete identification algorithm. Non-identifiability is then a property of the graph model itself, usually witnessed by two causal models with the same observational distribution but different interventional distributions.
[/remark]
The practical lesson is that do-calculus is both a proof language and an algorithmic foundation. For small graphs, hand derivations reveal why a formula is valid; for larger graphs, automated ID procedures use the same principles to determine whether identification is possible.
Do-calculus unifies the ad hoc-looking rules from back-door and front-door reasoning into a general identification calculus. From there, the natural next question is whether every causal query can be decided algorithmically, and if not, how nonidentifiability can be certified.
# 9. The ID Algorithm and Nonidentifiability
This chapter moves from graphical criteria such as back-door, front-door, and do-calculus to the general identification problem for directed acyclic graphs with hidden variables. The central question is algorithmic: given a causal graph and a target intervention, can $\mathbb P(Y \mid do(X=x))$ be written uniquely in terms of the observed law? The answer is governed by how latent common causes appear after we project them out, and by special subgraphs that witness failure of identification.
## Hidden Confounding and C-Components
When hidden variables are present, ordinary directed separation no longer records all dependence that matters for interventions. The first problem is to encode the observable consequences of latent common causes without carrying every unobserved variable through the graph. This leads to latent projections and to the bidirected connected pieces called districts or C-components.
[definition: Acyclic Directed Mixed Graph]
An acyclic directed mixed graph on vertex set $V$ is a graph with directed edges $A \to B$ and bidirected edges $A \leftrightarrow B$, with no directed cycle.
[/definition]
A directed edge records a possible direct causal effect among observed variables, while a bidirected edge records the presence of some hidden common cause after latent variables have been removed. The absence of directed cycles keeps the recursive structural interpretation available, and the first place where both kinds of edge interact is the two-variable bow-arc graph.
[example: Bow-Arc Graph]
Consider the acyclic directed mixed graph on observed vertices $\{X,Y\}$ with edges $X \to Y$ and $X \leftrightarrow Y$. The directed edge $X \to Y$ records a possible direct causal effect of $X$ on $Y$. The bidirected edge $X \leftrightarrow Y$ records a latent common cause, equivalently an unobserved variable $U$ in an underlying DAG with $U \to X$ and $U \to Y$.
For the effect of $X$ on $Y$, any ordinary adjustment set must be made from observed variables other than $X$ and $Y$. Here
\begin{align*}
\{X,Y\}\setminus \{X,Y\}=\varnothing .
\end{align*}
Thus the only possible observed adjustment set is the empty set. Adjusting for the empty set gives no conditioning variable at all, so it cannot block the path represented by
\begin{align*}
X \leftrightarrow Y .
\end{align*}
In the underlying latent-variable picture, that same confounding path is
\begin{align*}
X \leftarrow U \to Y .
\end{align*}
Since $U$ is not observed, there is no observed variable available in this graph that can be conditioned on to remove the association induced by $U$.
The bow-arc graph is therefore the smallest mixed graph in which a directed causal effect and unobserved confounding between the same two observed variables coexist. It shows that adjustment can fail not because the right observed covariate was overlooked, but because the graph contains no observed covariate capable of removing the latent confounding.
[/example]
The bow-arc graph is the smallest setting in which an interventional distribution may fail to be identifiable. To pass from a full graph containing hidden variables to such a mixed graph, we need the projection operation that preserves the observable directed and confounding structure.
[definition: Latent Projection]
Let $G$ be a directed acyclic graph with observed vertices $V$ and latent vertices $U$. The latent projection is the map
\begin{align*}
\operatorname{LP}_{V}:\{ \text{DAGs with observed set } V \text{ and latent set } U\}\to \{\text{acyclic directed mixed graphs on }V\}.
\end{align*}
For such a DAG $G$, the projected graph $\operatorname{LP}_{V}(G)$ has vertex set $V$ and is defined as follows. Put $A \to B$ if there is a directed path from $A$ to $B$ in $G$ whose non-endpoint vertices are latent. Put $A \leftrightarrow B$ if there is a path from $A$ to $B$ in $G$ whose non-endpoint vertices are latent, whose first edge has an arrowhead into $A$, whose last edge has an arrowhead into $B$, and every non-endpoint vertex on the path is a non-collider on that path.
[/definition]
The projected graph is not meant to identify the hidden variables themselves. It preserves exactly the kinds of observed causal and confounding paths that the identification algorithm uses, and once those paths have been encoded we need to know which observed variables are tied together by chains of hidden confounding.
[definition: District]
Let $G$ be an acyclic directed mixed graph on vertex set $V$. A district, also called a C-component, is a maximal subset $D \subset V$ such that every pair of vertices in $D$ is connected by a path consisting only of bidirected edges.
[/definition]
Districts are the units on which hidden confounding couples variables. Directed edges may run between districts, but bidirected paths keep track of variables whose observational factorization cannot be separated into ordinary conditional distributions.
[example: Two Latent Confounders with Separated Districts]
Let the observed variables be $X,Z,Y,W$. Suppose the projected graph has directed edges $X \to Z \to Y$ and $X \to W \to Y$, and bidirected edges
\begin{align*}
X \leftrightarrow Z
\end{align*}
and
\begin{align*}
W \leftrightarrow Y .
\end{align*}
There is no bidirected path connecting any vertex in $\{X,Z\}$ to any vertex in $\{W,Y\}$.
The bidirected edge $X \leftrightarrow Z$ gives a bidirected path from $X$ to $Z$, so $X$ and $Z$ lie in the same district. Since no bidirected edge or bidirected path connects $X$ or $Z$ to $W$ or $Y$, this bidirected-connected set cannot be enlarged past
\begin{align*}
\{X,Z\}.
\end{align*}
Thus one district is $\{X,Z\}$. Similarly, the bidirected edge $W \leftrightarrow Y$ gives a bidirected path from $W$ to $Y$, and the assumed absence of any bidirected path from $\{W,Y\}$ to $\{X,Z\}$ prevents this set from being enlarged. Thus the other district is
\begin{align*}
\{W,Y\}.
\end{align*}
So the district decomposition is
\begin{align*}
\{\{X,Z\},\{W,Y\}\}.
\end{align*}
The directed paths $X \to Z \to Y$ and $X \to W \to Y$ may carry causal influence between variables in different districts, but districts are determined only by bidirected connectivity. Hence two latent confounders need not create one large confounded block; the graph retains two separate hidden-confounding pieces that identification can handle separately.
[/example]
The separated-district example shows that the graph can contain hidden confounding while still having useful algebraic structure. The next result explains the structure: the observed joint law splits into factors indexed by districts, so these are the pieces passed through the identification recursion. The graph here is interpreted as the latent projection of an underlying semi-Markovian directed acyclic graph, so the bidirected edges stand for independent latent common causes rather than arbitrary mixed-graph dependence.
[quotetheorem:9685]
[citeproof:9685]
The theorem is the algebraic reason districts matter, but it is not an adjustment theorem. A district factor may still depend on variables outside the district through earlier variables in the topological order, so it should not be read as a marginal law on the district alone or as an ordinary conditional law given the complement. The positivity condition is also doing work: if a conditioning history has probability zero, the conditional kernel is not determined by the observational distribution there, and the displayed product must be interpreted only on the observed support. For example, if $A$ is binary and $\mathbb P(A=0)=0$, then $\mathbb P(B\mid A=0)$ can be changed without changing the observed joint law on its support; any district factor using that conditional at $A=0$ is therefore not determined by $\mathbb P(A,B)$. Thus the theorem says how to decompose the observed law into identifiable kernels; it does not by itself say that every intervention is identifiable.
## Recursive Identification of Interventional Laws
The next problem is to decide whether a target such as $\mathbb P(Y \mid do(X=x))$ can be computed from $\mathbb P(V)$. Do-calculus is complete, but using it by hand can be opaque. The ID algorithm packages the graphical reductions into a recursive procedure whose failure has a precise meaning.
[definition: Ancestor Set in a Mixed Graph]
Let $G$ be an acyclic directed mixed graph on $V$. The ancestor operator is the map
\begin{align*}
\operatorname{An}_G:\mathcal P(V)\to \mathcal P(V).
\end{align*}
For $A \subset V$, the ancestor set $\operatorname{An}_G(A)$ is the set of all vertices $v \in V$ for which there is a directed path from $v$ to some element of $A$, including the elements of $A$ themselves.
[/definition]
Ancestor restriction matters because variables that cannot reach $Y$ by directed paths cannot affect the interventional distribution of $Y$ after the intervention has been set. To state the target of the recursion precisely, we next record the intervention law that the algorithm is trying to express in terms of observations.
[definition: Interventional Distribution]
Let $G$ be a causal graph on observed variables $V$, and let $X,Y \subset V$ be disjoint. The interventional distribution $\mathbb P(Y \mid do(X=x))$ is the law of $Y$ under the intervention replacing the structural equations for $X$ by constants $x$.
[/definition]
For identification, the value $x$ is treated as fixed, and variables not in $Y$ are summed or integrated out after the relevant post-intervention law has been found. In the finite case, the target can be written as a marginal of $\mathbb P_{x}(V \setminus X)$.
[explanation: Sketch of the ID Algorithm]
The ID algorithm takes as input a target set $Y$, an intervention set $X$, the observed vertex set $V$, an acyclic directed mixed graph $G$, and the observational law $\mathbb P(V)$. It first removes variables outside the ancestors of $Y$ in the graph with incoming arrows into $X$ deleted; these variables cannot change the target. It then augments the intervention set by variables that are non-ancestors of $Y$, since fixing them does not change the target law.
After these reductions, the algorithm decomposes the remaining graph into districts. If the relevant graph splits into several districts, the target is expressed as a sum over non-target variables of the product of the district factors supplied by the Tian-Pearl factorization, and ID is called recursively on each district. The failure case is more specific: it occurs when the current graph is one district, but this district is nested inside a larger original district that intersects the intervention set in the way that produces a hedge. If a district is strictly contained in a larger district of the original graph without triggering this obstruction, the algorithm restricts to that larger district and uses the corresponding district factor as the new input distribution.
The output, when the algorithm succeeds, is an expression involving only sums, products, and conditional probabilities computed from $\mathbb P(V)$. Thus success gives identification, not merely an estimation strategy.
[/explanation]
The recursive form can seem abstract because it manipulates subgraphs and kernels rather than single adjustment sets. The following statement is the completeness guarantee: if the algorithm returns a formula, the formula is valid; if it fails, no formula exists under the nonparametric graph model.
[quotetheorem:9686]
This theorem is stated as the main completeness result for identification under hidden confounding. Its hypotheses are part of the statement: the graph is a semi-Markovian acyclic mixed graph and the model is nonparametric. If directed cycles, selection variables, or parametric restrictions are added, the same algorithm need not be complete without modification. Conversely, a negative answer from ID is not a statement about every possible statistical model with the same graph; linear-Gaussian, monotone, or other restricted models may identify special numerical summaries even when the unrestricted nonparametric effect is not identifiable. Its proof uses the soundness of the recursive reductions and the hedge criterion for the failure case; the full proof belongs to the general theory of graphical identification algorithms.
[example: Identifiable District Decomposition with Two Latent Confounders]
Let the observed variables be $X,Z,W,Y$, with directed edges $X \to W \to Y$ and bidirected edges $X \leftrightarrow Z$ and $W \leftrightarrow Y$. There is no directed edge out of $Z$ and no bidirected path connecting $\{X,Z\}$ to $\{W,Y\}$, so the bidirected districts are exactly $\{X,Z\}$ and $\{W,Y\}$. We compute the effect $\mathbb P(Y=y \mid do(X=x))$.
With topological order $X,Z,W,Y$, the chain rule gives
\begin{align*}
\mathbb P(x,z,w,y)=\mathbb P(x)\mathbb P(z\mid x)\mathbb P(w\mid x,z)\mathbb P(y\mid x,z,w).
\end{align*}
Grouping the chain-rule factors by district gives
\begin{align*}
Q_{\pi}[\{X,Z\}](x,z,w,y)=\mathbb P(x)\mathbb P(z\mid x).
\end{align*}
For the second district,
\begin{align*}
Q_{\pi}[\{W,Y\}](x,z,w,y)=\mathbb P(w\mid x,z)\mathbb P(y\mid x,z,w).
\end{align*}
The only paths from $Z$ to $W$ or $Y$ pass through $X$, so conditioning on $X$ blocks the dependence of $W$ on $Z$, and conditioning on $X,W$ blocks the dependence of $Y$ on $Z$. Thus $Z \perp W \mid X$ and $Z \perp Y \mid X,W$ in the observational law. On histories with positive denominators,
\begin{align*}
\mathbb P(w\mid x,z)=\frac{\mathbb P(w,z\mid x)}{\mathbb P(z\mid x)}=\frac{\mathbb P(w\mid x)\mathbb P(z\mid x)}{\mathbb P(z\mid x)}=\mathbb P(w\mid x).
\end{align*}
Similarly,
\begin{align*}
\mathbb P(y\mid x,z,w)=\frac{\mathbb P(y,z\mid x,w)}{\mathbb P(z\mid x,w)}=\frac{\mathbb P(y\mid x,w)\mathbb P(z\mid x,w)}{\mathbb P(z\mid x,w)}=\mathbb P(y\mid x,w).
\end{align*}
Therefore the district factor for $\{W,Y\}$ can be read as
\begin{align*}
Q_{\pi}[\{W,Y\}](x,w,y)=\mathbb P(w\mid x)\mathbb P(y\mid x,w).
\end{align*}
Under the intervention $do(X=x)$, the district $\{X,Z\}$ is not part of the post-intervention outcome calculation, while the post-treatment variables $W,Y$ are governed by the remaining district factor. Marginalizing over $W$ gives
\begin{align*}
\mathbb P(Y=y\mid do(X=x))=\sum_w Q_{\pi}[\{W,Y\}](x,w,y).
\end{align*}
Substituting the reduced district factor yields
\begin{align*}
\mathbb P(Y=y\mid do(X=x))=\sum_w \mathbb P(w\mid x)\mathbb P(y\mid x,w).
\end{align*}
Thus the latent confounder between $X$ and $Z$ does not obstruct this effect: after the irrelevant district is removed, the remaining post-treatment district gives an expression entirely in the observed law.
[/example]
This example illustrates why hidden confounding does not automatically imply nonidentifiability. The obstruction is not the mere presence of bidirected edges, but a specific nesting pattern of districts that prevents the recursive factorization from isolating the intervention.
## Hedges and Nonidentifiability
When ID fails, the failure is not a numerical accident or a weakness of the algorithm. The graph contains a finite witness proving that two causal models can agree on the observed distribution while disagreeing on the intervention. The witness is called a hedge.
[definition: Rooted C-Forest]
Let $G$ be an acyclic directed mixed graph, and let $F \subset V$. The induced subgraph $G[F]$ is an $R$-rooted C-forest if $F$ is a single district, every observed vertex in $F$ has at most one child in $G[F]$, all directed edges are oriented along directed paths toward the root set, and $R \subset F$ is exactly the set of vertices in $F$ with no children in $G[F]$.
[/definition]
C-forests are graph shapes in which hidden confounding is sufficiently connected while the directed structure remains tree-like and directed arrows flow toward the root set. Hedges are pairs of such forests with the same roots, arranged so that the larger forest still contains treatment variables but the smaller one does not.
[definition: Hedge]
Let $G$ be an acyclic directed mixed graph and let $X,Y \subset V$ be disjoint. A hedge for $\mathbb P(Y \mid do(X=x))$ is a pair of $R$-rooted C-forests $F,F'$ in $G$ such that $F' \subset F$, $F \cap X \ne \varnothing$, $F' \cap X = \varnothing$, and
\begin{align*}
R \subset \operatorname{An}_{G[V\setminus X]}(Y).
\end{align*}
[/definition]
The smaller forest $F'$ records the part still relevant to $Y$ after intervention, while the larger forest $F$ records the same confounded structure before the treatment variables are removed. This containment is the proposed graphical signature of nonidentifiability, and the next criterion makes that signature exact.
[quotetheorem:9687]
This result is stated as the graphical certificate of nonidentifiability, and its assumptions are essential. The criterion is for unrestricted nonparametric semi-Markovian models; it does not rule out identification of a parametric coefficient, a local average treatment effect, or a monotone-response summary under extra assumptions. The shared-root condition in the definition is also not cosmetic: without common roots, two confounded forests may be present without forming the nested obstruction that ID reads as failure. In the full proof, a hedge is used to construct two semi-Markovian causal models with the same observational law and different interventional laws, while the converse follows by reading the failed recursive call of ID as a nested pair of rooted C-forests.
[example: Bow-Arc Hedge]
In the bow-arc graph $G$ on $V=\{X,Y\}$ with edges $X \to Y$ and $X \leftrightarrow Y$, take
\begin{align*}
F=\{X,Y\}
\end{align*}
and
\begin{align*}
F'=\{Y\}.
\end{align*}
We verify that these two induced subgraphs form a hedge for the effect of $X$ on $Y$.
The induced subgraph $G[F]$ contains both vertices $X,Y$ and the bidirected edge $X \leftrightarrow Y$, so $X$ and $Y$ are connected by a path consisting only of bidirected edges. Hence $F$ is a single district. Inside $G[F]$, the only directed edge is $X \to Y$. Therefore $X$ has child $Y$, while $Y$ has no child in $G[F]$, so the root set of $G[F]$ is
\begin{align*}
R=\{Y\}.
\end{align*}
The induced subgraph $G[F']$ has the single vertex $Y$ and no directed edge. Since $Y$ has no child in this one-vertex induced graph, its root set is again
\begin{align*}
R=\{Y\}.
\end{align*}
Thus $G[F]$ and $G[F']$ are both $R$-rooted C-forests with the same root set $R=\{Y\}$.
The containment and treatment-intersection conditions are also explicit:
\begin{align*}
F'=\{Y\}\subset \{X,Y\}=F.
\end{align*}
For the treatment set $\{X\}$,
\begin{align*}
F\cap \{X\}=\{X,Y\}\cap \{X\}=\{X\}\ne \varnothing .
\end{align*}
Also,
\begin{align*}
F'\cap \{X\}=\{Y\}\cap \{X\}=\varnothing .
\end{align*}
Finally, deleting $X$ leaves the vertex set
\begin{align*}
V\setminus \{X\}=\{Y\}.
\end{align*}
In the induced graph on $\{Y\}$, every vertex is an ancestor of itself by definition, so
\begin{align*}
Y\in \operatorname{An}_{G[V\setminus \{X\}]}(Y).
\end{align*}
Therefore
\begin{align*}
R=\{Y\}\subset \operatorname{An}_{G[V\setminus \{X\}]}(Y).
\end{align*}
All conditions in the definition of a hedge are satisfied: $F' \subset F$, the larger forest contains the treatment $X$, the smaller forest does not, and the common root lies among the ancestors of $Y$ after $X$ is removed. Hence $F,F'$ form a hedge witnessing nonidentifiability of $\mathbb P(Y \mid do(X=x))$ from the observational law alone.
[/example]
The bow-arc example shows the smallest hedge, but the same idea appears in more familiar econometric graphs. In particular, an instrumental variable can identify certain parametric or monotonicity-based summaries, while the fully nonparametric average effect is not identified without extra assumptions.
[example: Instrumental Variable Graph and a Nonidentifiable Average Effect]
[claim]In the instrumental-variable graph with $Z \to X \to Y$ and $X \leftrightarrow Y$, the nonparametric effect of $X$ on $Y$ is not identifiable from the observed law.[/claim]
[proof]The vertex $Z$ satisfies the graphical instrument features: there is a directed path $Z \to X \to Y$, there is no directed edge $Z \to Y$, and there is no bidirected path from $Z$ to $X$ or $Y$. Thus $Z$ is excluded from directly causing $Y$ and is not latently confounded with the treatment-outcome pair.
To see why this still does not identify the unrestricted effect of $X$ on $Y$, restrict attention to the induced subgraph on $\{X,Y\}$. This subgraph has directed edge $X \to Y$ and bidirected edge $X \leftrightarrow Y$. Take
\begin{align*}
F=\{X,Y\}.
\end{align*}
Take
\begin{align*}
F'=\{Y\}.
\end{align*}
Inside $G[F]$, the bidirected edge $X \leftrightarrow Y$ makes $F$ one district. The only directed edge in $G[F]$ is $X \to Y$, so $X$ has child $Y$ and $Y$ has no child. Hence the root set is
\begin{align*}
R=\{Y\}.
\end{align*}
Inside $G[F']$, the graph has the single vertex $Y$, so $Y$ has no child and the root set is again
\begin{align*}
R=\{Y\}.
\end{align*}
Thus $G[F]$ and $G[F']$ are $R$-rooted C-forests with the same root set.
The containment condition is
\begin{align*}
F'=\{Y\}\subset \{X,Y\}=F.
\end{align*}
For the treatment set $\{X\}$,
\begin{align*}
F\cap \{X\}=\{X,Y\}\cap \{X\}=\{X\}.
\end{align*}
Therefore
\begin{align*}
F\cap \{X\}\ne \varnothing .
\end{align*}
For the smaller forest,
\begin{align*}
F'\cap \{X\}=\{Y\}\cap \{X\}=\varnothing .
\end{align*}
After deleting the treatment vertex $X$, the remaining vertex set relevant to this subgraph is
\begin{align*}
\{X,Y\}\setminus \{X\}=\{Y\}.
\end{align*}
A vertex is an ancestor of itself, so
\begin{align*}
Y\in \operatorname{An}_{G[\{Y\}]}(Y).
\end{align*}
Hence
\begin{align*}
R=\{Y\}\subset \operatorname{An}_{G[\{X,Y\}\setminus \{X\}]}(Y).
\end{align*}
All hedge conditions are satisfied, so $F,F'$ form a hedge for the effect of $X$ on $Y$. By the *[Hedge Nonidentifiability Criterion](/theorems/9687)*, the interventional law $\mathbb P(Y\mid do(X=x))$ is not identifiable in the unrestricted nonparametric model.[/proof]
The instrument can still identify special parameters under extra assumptions, such as a linear coefficient or a local average treatment effect with compliance restrictions, but the graph alone does not identify the full average causal effect of $X$ on $Y$.
[/example]
This distinction is important in applications: an instrument is not a universal identification device. It identifies particular causal parameters only after the target and the modeling assumptions have been specified.
[remark: What the ID Algorithm Adds]
Back-door and front-door criteria give memorable sufficient conditions. The ID algorithm gives a complete decision procedure for nonparametric identification in semi-Markovian graphs, and hedges explain every negative answer. The practical workflow is therefore: form the latent projection, compute districts, run the recursive reductions, and interpret any failure through the corresponding hedge.
[/remark]
The ID algorithm answers the identification question systematically and explains when no observational formula exists. Once that limit is clear, instrumental variables provide another route: they exploit special sources of variation to identify effects that standard adjustment cannot recover.
# 10. Instrumental Variables
Instrumental variables address a specific failure of the identification strategies developed so far: the treatment is not exchangeable, but some external source of variation moves the treatment in a way that is plausibly unrelated to the potential outcomes. The chapter formalises this idea in two languages. First, we study instruments through structural and graphical restrictions; then we translate the same restrictions into potential outcomes and derive the Wald and LATE identification formulas.
The guiding example is an encouragement design. An encouragement, assignment, or preference variable $Z$ changes the probability of receiving treatment $A$, while the outcome $Y$ is affected by $Z$ only through the treatment actually received. This gives causal information even when $A$ itself is confounded.
## Why Instruments Can Help When Treatment Is Confounded
The central problem is that exchangeability for $A$ may fail: the people who receive treatment may differ systematically from those who do not in ways that also affect $Y$. An instrument is a variable whose variation can be treated as a partial randomisation of treatment. The price is that an instrument identifies a more specialised causal contrast, usually for units whose treatment status is moved by the instrument.
Before defining the assumptions, it helps to state the roles of the variables. Throughout this chapter, $Z$ is the candidate instrument, $A$ is a binary treatment, and $Y$ is an outcome. Unless otherwise stated, $Z,A \in \{0,1\}$ and $Y$ is integrable.
[definition: Instrumental Variable Conditions]
A variable $Z$ is an instrumental variable for the effect of $A$ on $Y$ if the following conditions hold:
1. Relevance: $\mathbb P(A=1\mid Z=1) \ne \mathbb P(A=1\mid Z=0)$.
2. Exclusion: $Z$ has no causal effect on $Y$ except through $A$.
3. Instrument independence: $Z$ is independent of the common causes of $A$ and $Y$.
[/definition]
Relevance is empirically checkable in the observed distribution, while exclusion and independence are causal assumptions. This asymmetry is a recurring theme: the strongest parts of an IV argument are usually not testable from $(Z,A,Y)$ alone, so the next example shows how design can make them credible.
[example: Randomized Encouragement Design]
Suppose patients are randomly assigned a letter encouraging vaccination, with $Z=1$ if the letter is sent and $Z=0$ otherwise. Let $A\in\{0,1\}$ indicate vaccine uptake and let $Y\in\{0,1\}$ indicate infection during follow-up. Random assignment is meant to make $Z$ independent of baseline infection risk, while the exclusion condition says that changing $Z$ can change $Y$ only by changing $A$.
If the letter raises vaccine uptake, then
\begin{align*}
\mathbb E[A\mid Z=1]-\mathbb E[A\mid Z=0]>0.
\end{align*}
The observed outcome contrast is
\begin{align*}
\mathbb E[Y\mid Z=1]-\mathbb E[Y\mid Z=0].
\end{align*}
The observed uptake contrast is
\begin{align*}
\mathbb E[A\mid Z=1]-\mathbb E[A\mid Z=0].
\end{align*}
Thus the IV ratio is
\begin{align*}
\frac{\mathbb E[Y\mid Z=1]-\mathbb E[Y\mid Z=0]}{\mathbb E[A\mid Z=1]-\mathbb E[A\mid Z=0]}.
\end{align*}
The numerator measures the effect of assignment to encouragement on infection, and the denominator measures the effect of assignment to encouragement on vaccination. Dividing the first contrast by the second converts the effect per assigned letter into an effect per vaccination induced by the letter, so the target group is patients who would vaccinate if encouraged but would not vaccinate otherwise.
[/example]
The exclusion condition is often the most fragile part of the argument. If the encouragement changes other behaviours, such as masking or testing, then the path from $Z$ to $Y$ does not pass only through $A$. This motivates separating observable diagnostics from causal restrictions.
[remark: Testable and Untestable Components]
The association between $Z$ and $A$ checks relevance, but it does not check whether $Z$ is independent of unmeasured causes of $Y$, nor whether $Z$ affects $Y$ only through $A$. A strong first stage is therefore not enough for a valid IV design.
[/remark]
This distinction motivates graphical criteria. Graphs do not prove assumptions from data, but they make the required absence of paths explicit.
## Graphical IV Conditions
The graphical question is: which directed acyclic graphs represent a variable that perturbs treatment while blocking all non-treatment routes to the outcome? Let $U$ denote unmeasured common causes of $A$ and $Y$. The ideal IV graph has arrows $Z \to A \to Y$ and $U \to A$, $U \to Y$, with no arrow $Z \to Y$ and no open back-door path from $Z$ to $Y$.
[definition: Graphical Instrument]
In a directed acyclic graph with observed variables $Z,A,Y$ and possibly unobserved variables, $Z$ is a graphical instrument for the effect of $A$ on $Y$ if:
1. There is a directed path from $Z$ to $A$.
2. Every directed path from $Z$ to $Y$ passes through $A$.
3. All back-door paths from $Z$ to $Y$ are blocked by the empty set.
[/definition]
Condition 1 is the graphical form of relevance, condition 2 is exclusion, and condition 3 is independence. These are only useful for IV reasoning if the graph rules out all non-treatment routes by which $Z$ could be associated with $Y$.
The formal question is whether these three graphical restrictions are exactly strong enough to translate the diagram into the usual instrumental-variable conditions. The theorem records that implication: the graph must make the $Z$-$Y$ association pass through treatment rather than through a direct effect or an open common-cause path.
[quotetheorem:9688]
[citeproof:9688]
The theorem explains why IV assumptions are often justified by design or substantive knowledge, but it also shows where the graph can mislead if read too strongly. Dropping each graphical condition produces a different failure. If relevance is dropped, an encouragement letter that nobody reads may satisfy the other causal restrictions but has $\mathbb E[A\mid Z=1]=\mathbb E[A\mid Z=0]$, so the Wald denominator is zero. If exclusion is dropped, the same letter might change testing behaviour as well as vaccine uptake; then a directed route $Z\to Y$ avoiding $A$ contributes to the numerator, so part of the $Z$-$Y$ association is not a treatment effect. If the blocked back-door condition is dropped, clinic resources may affect both physician preference and patient outcomes, creating a noncausal association between $Z$ and $Y$ even when every directed route from $Z$ to $Y$ passes through $A$.
These three counterexamples are logically separate. A variable can be relevant but invalid because it has a direct effect on the outcome; it can satisfy exclusion but be invalid because its levels are confounded with prognosis; and it can satisfy the two causal separation requirements but be useless because it does not move treatment. Separating the failures matters because the empirical first-stage check diagnoses only the last of these problems.
Thus the graphical theorem is not an identification formula by itself. It does not prove that the missing arrows are substantively correct, and it does not replace the empirical check that $Z$ is associated with $A$. Its role is to organise the assumptions that will later appear algebraically in the Wald estimand: relevance supplies the denominator, exclusion makes the numerator operate through treatment, and independence prevents the instrument-outcome contrast from mixing causal and confounded variation.
[example: Physician Preference as an Instrument]
Let $Z$ indicate a physician's preference for prescribing a new drug, let $A\in\{0,1\}$ indicate whether the patient actually receives the drug, and let $Y$ denote the patient's outcome. The relevance condition says that preference must change prescribing rates:
\begin{align*}
\mathbb P(A=1\mid Z=1)\ne \mathbb P(A=1\mid Z=0).
\end{align*}
For example, if physicians with $Z=1$ prescribe the drug to $70\%$ of eligible patients and physicians with $Z=0$ prescribe it to $30\%$, then the first-stage contrast is
\begin{align*}
0.70-0.30=0.40,
\end{align*}
so preference moves treatment receipt.
The remaining IV conditions are not checked by this first-stage contrast. Instrument independence requires physician preference to be unrelated to patient prognosis except through the prescription: patients seen by high-prescribing and low-prescribing physicians must not differ systematically in baseline severity, comorbidities, socioeconomic status, or other causes of $Y$. Exclusion requires every causal path from $Z$ to $Y$ to pass through $A$; thus preference would be invalid if it also changed monitoring intensity, referral decisions, dose adjustment, or follow-up care. Physician preference can therefore supply useful treatment variation only when the clinical pathways justify both conditions, not merely because prescribing rates differ by preference.
[/example]
Having separated the assumptions, we now turn to the estimand they identify. In the simplest linear setting, the IV ratio appears as the slope in a causal relation.
## Linear IV and the Wald Estimand
The basic numerical question is how much outcome change is associated with the instrument per unit of treatment change caused by the instrument. With binary $Z$ and $A$, this ratio is called the Wald estimand. In linear structural models it recovers the causal slope when the instrument is valid.
[definition: Wald Estimand]
Assume $Z\in\{0,1\}$ and $\mathbb E[Y\mid Z=z]$, $\mathbb E[A\mid Z=z]$ exist for $z\in\{0,1\}$. If $\mathbb E[A\mid Z=1] \ne \mathbb E[A\mid Z=0]$, the Wald estimand is
\begin{align*}
\beta_{\mathrm{Wald}}
&:= \frac{\mathbb E[Y\mid Z=1]-\mathbb E[Y\mid Z=0]}{\mathbb E[A\mid Z=1]-\mathbb E[A\mid Z=0]}.
\end{align*}
[/definition]
The numerator is the intention-to-treat effect of the instrument on the outcome; the denominator is the first-stage effect of the instrument on treatment. The ratio therefore rescales an effect of assignment or encouragement into an effect per treatment induced by the instrument, and the next theorem verifies that this rescaling recovers the structural slope in the linear model.
[quotetheorem:9689]
[citeproof:9689]
This theorem is the algebraic core of IV, and each hypothesis has a precise algebraic job. If the equal error-mean condition fails and
\begin{align*}
\mathbb E[\varepsilon\mid Z=1]-\mathbb E[\varepsilon\mid Z=0]=\gamma,
\end{align*}
while the first stage is $\pi=\mathbb E[A\mid Z=1]-\mathbb E[A\mid Z=0]$, the Wald ratio becomes $\beta+\gamma/\pi$ rather than $\beta$. If $\pi=0$, the ratio is undefined because the instrument has not generated any treatment contrast to rescale. If $\pi$ is nonzero but small, the same formula shows why small sampling errors or assumption violations can produce large changes in the ratio.
The linear structural form is also doing real work. Outside a model with a constant additive slope $\beta$, the Wald ratio need not equal an average treatment effect for the whole population; with heterogeneous treatment effects it averages only the treatment variation induced by the instrument, and with nonlinear outcome models the ratio may not have a simple structural-slope interpretation. This limitation motivates the potential-outcome development below, where the same ratio is reinterpreted as a local causal effect under stronger individual-level assumptions. The next example turns the denominator problem into a numerical warning.
[example: Weak Instrument Pathology in a Linear Model]
Suppose $Y=2A+\varepsilon$, with $\mathbb E[\varepsilon\mid Z=1]=\mathbb E[\varepsilon\mid Z=0]$, and suppose the first stage is only
\begin{align*}
\mathbb E[A\mid Z=1]-\mathbb E[A\mid Z=0]=0.01.
\end{align*}
Taking conditional expectations of $Y=2A+\varepsilon$ given $Z=z$ gives
\begin{align*}
\mathbb E[Y\mid Z=z]=2\mathbb E[A\mid Z=z]+\mathbb E[\varepsilon\mid Z=z].
\end{align*}
Subtracting the expression for $Z=0$ from the expression for $Z=1$ gives
\begin{align*}
\mathbb E[Y\mid Z=1]-\mathbb E[Y\mid Z=0]=2\{\mathbb E[A\mid Z=1]-\mathbb E[A\mid Z=0]\}+\{\mathbb E[\varepsilon\mid Z=1]-\mathbb E[\varepsilon\mid Z=0]\}.
\end{align*}
The error-mean difference is $0$, so the true numerator is
\begin{align*}
2(0.01)+0=0.02.
\end{align*}
Thus the population Wald ratio is
\begin{align*}
\frac{0.02}{0.01}=2.
\end{align*}
Now let the estimated numerator equal the true numerator plus a sampling error $\eta$ with $|\eta|=0.02$. The estimated ratio is
\begin{align*}
\frac{0.02+\eta}{0.01}.
\end{align*}
Its error relative to the population ratio is
\begin{align*}
\frac{0.02+\eta}{0.01}-\frac{0.02}{0.01}=\frac{\eta}{0.01}.
\end{align*}
Since $|\eta|=0.02$, the magnitude of the induced Wald-ratio error is
\begin{align*}
\frac{0.02}{0.01}=2.
\end{align*}
The instrument is formally valid in the population calculation, but the very small first stage amplifies a modest numerator error into an error as large as the structural effect itself.
[/example]
The example shows that the single binary-instrument ratio is only the simplest member of a broader linear IV family. To handle continuous instruments, multiple instruments, or covariates, we need the projection formulation used by two-stage least squares.
[definition: Population Two-Stage Least Squares]
Let $Y$ and $A$ be square-integrable real-valued random variables, and let $Z$ be a square-integrable instrument. The population two-stage least squares coefficient of $Y$ on $A$ using $Z$ is the coefficient on $\widehat A$ in the population linear projection of $Y$ onto $(1,\widehat A)$, where $\widehat A$ is the population linear projection of $A$ onto $(1,Z)$.
[/definition]
This projection definition is the population version of the two regressions used in finite samples. The first stage keeps the component of treatment that is linearly predicted by the instrument; the second stage relates the outcome to that instrument-induced component rather than to the full, confounded treatment. In the binary single-instrument case, population two-stage least squares equals the Wald estimand because the fitted value $\widehat A$ takes two values and the projection slope reduces to a difference in conditional means.
With covariates $X$, the same construction includes $X$ in both stages: project $A$ onto the linear span of $(1,Z,X)$, then project $Y$ onto $(1,\widehat A,X)$. Equivalently, after residualising $Y$, $A$, and $Z$ with respect to $X$, the IV coefficient uses the part of residual treatment predicted by residual instrument variation. This is the regression-adjusted version of the same argument: covariates are used to make the instrument-as-good-as-random condition more plausible within covariate strata, while the instrument still supplies the treatment movement.
Multiple instruments fit the same projection language. If $Z=(Z_1,\dots,Z_k)$, the first stage projects $A$ onto the span of all instruments, and the second stage uses the fitted treatment value. Population 2SLS then combines the instruments according to how much treatment variation they predict; if different instruments move different complier groups and treatment effects are heterogeneous, that combined coefficient can depend on the chosen instrument set.
This connects IV to ordinary least squares and projection geometry. OLS projects $Y$ directly onto the observed treatment $A$, which is biased when $A$ contains confounded variation. IV replaces $A$ by the projection of $A$ onto variation supplied by $Z$, so the identifying content is not a new algebraic trick but a restriction on which direction in the data is allowed to carry causal meaning.
[remark: IV as a Causal Estimand]
The IV coefficient is not automatically the average treatment effect. It becomes a causal estimand only after the structural or potential outcome assumptions specify which counterfactual contrast the ratio represents.
[/remark]
The potential outcome formulation makes that contrast precise. It also explains why the relevant population is not all units, but the units whose treatment is changed by the instrument.
## Potential Outcome IV Assumptions
The potential outcome question is: what individual-level causal types are compatible with the way $Z$ changes $A$? For each unit, let $A(1)$ and $A(0)$ denote the treatment that would be received under instrument values $1$ and $0$. Let $Y(a)$ denote the outcome that would be observed under treatment value $a$.
[definition: IV Potential Outcomes]
For binary instrument $Z$ and binary treatment $A$, the treatment potential outcomes are $A(1),A(0)\in\{0,1\}$. The outcome potential outcomes are $Y(1)$ and $Y(0)$, where $Y(a)$ is the outcome under treatment value $a$.
[/definition]
The exclusion restriction is encoded by writing outcomes as $Y(a)$ rather than $Y(z,a)$. This notation asserts that once treatment is fixed, changing $Z$ has no remaining causal effect on the outcome. To understand who is affected by the instrument, we classify units by the pair of treatment potential outcomes.
[definition: Principal Strata for a Binary Instrument]
The principal stratum of a unit is the pair $(A(0),A(1))$. The four strata are:
1. Compliers: $A(0)=0$ and $A(1)=1$.
2. Always-takers: $A(0)=1$ and $A(1)=1$.
3. Never-takers: $A(0)=0$ and $A(1)=0$.
4. Defiers: $A(0)=1$ and $A(1)=0$.
[/definition]
These strata describe how treatment receipt responds to the instrument. The IV ratio can identify a causal effect for compliers only after excluding defiers, so the next assumption supplies the needed one-directional response condition.
[definition: Monotonicity]
The monotonicity assumption is
\begin{align*}
A(1) \ge A(0) \quad \text{a.s.}
\end{align*}
[/definition]
Monotonicity says that the instrument can encourage treatment or leave treatment unchanged, but it cannot discourage treatment for any unit. In encouragement designs this is often plausible when $Z=1$ gives extra access or information and $Z=0$ withholds it, as the following example illustrates.
[example: Principal Strata in an Encouragement Trial]
Let $A(1)$ be vaccine uptake if the encouragement letter is sent and $A(0)$ be vaccine uptake if it is not sent. The four possible pairs $(A(0),A(1))$ are exactly
$(0,1)$, $(1,1)$, $(0,0)$, and $(1,0)$.
Units with $(A(0),A(1))=(0,1)$ are compliers: they do not vaccinate without encouragement and do vaccinate with encouragement. For them,
\begin{align*}
A(1)-A(0)=1-0=1.
\end{align*}
Units with $(A(0),A(1))=(1,1)$ are always-takers, since
\begin{align*}
A(1)-A(0)=1-1=0.
\end{align*}
Units with $(A(0),A(1))=(0,0)$ are never-takers, since
\begin{align*}
A(1)-A(0)=0-0=0.
\end{align*}
Units with $(A(0),A(1))=(1,0)$ are defiers: they vaccinate only when not encouraged, and
\begin{align*}
A(1)-A(0)=0-1=-1.
\end{align*}
Thus the no-defiers condition is exactly the requirement that the last case, $(A(0),A(1))=(1,0)$, does not occur. If the letter only adds information and does not reduce access to vaccination, then a person who would vaccinate without the letter should still be able to vaccinate with the letter, so $A(0)=1$ implies $A(1)=1$. That implication rules out $(1,0)$ and gives $A(1)\ge A(0)$ for every unit, but it remains an assumption about the unobserved pair $(A(0),A(1))$.
[/example]
The example shows why the target group is defined by response to the instrument rather than by observed treatment status. This motivates naming the causal contrast among compliers before proving the identification theorem.
[definition: Local Average Treatment Effect]
Assume $\mathbb P(A(1)>A(0))>0$. The local average treatment effect is
\begin{align*}
\operatorname{LATE}
&:= \mathbb E[Y(1)-Y(0)\mid A(1)>A(0)].
\end{align*}
[/definition]
The conditioning event $A(1)>A(0)$ selects compliers. Thus LATE is local to the subpopulation whose treatment is affected by the instrument.
## The Angrist-Imbens-Rubin LATE Theorem
We now prove the main identification result for binary instruments and binary treatments. The theorem explains exactly what the Wald estimand equals under the potential outcome IV assumptions.
[quotetheorem:9690]
[citeproof:9690]
This theorem is often the most important conceptual correction to the naive IV interpretation. A valid binary instrument identifies the average effect for compliers, not necessarily the average effect for everyone. The proof explains the algebraic cancellation, but the assumptions are also necessary in substantive ways. If independence fails, for example because $Z$ is physician preference and sicker patients are sorted to physicians with different prescribing habits, the instrument-outcome contrast already contains prognosis differences before treatment is considered. If exclusion fails, such as an offer of job training also providing counselling to people who never attend, the numerator contains a direct effect of $Z$. If relevance fails, there are no compliers and the denominator is zero.
Monotonicity is the assumption that turns the first stage into a complier probability. If defiers exist, the denominator is the difference between the proportion of compliers and the proportion of defiers, and the numerator subtracts the treatment effects of defiers rather than merely averaging complier effects. Thus the Wald ratio may still be a well-defined number, but it is no longer the complier average treatment effect. The theorem also does not say that the complier effect equals the population average treatment effect, nor that two different instruments identify the same causal contrast.
[remark: Why the Estimand Is Local]
Always-takers and never-takers help determine the observed treatment probabilities, but they do not contribute treatment variation induced by $Z$. Since their treatment is unchanged across instrument values, the Wald ratio cannot reveal their individual treatment effects without additional assumptions. Compliers enter because they are exactly the units for whom $A(1)-A(0)=1$, so the first stage weights the population by responsiveness to this particular instrument.
[/remark]
The theorem also clarifies why different instruments can answer different causal questions. A reminder letter, a price subsidy, and geographic distance to a clinic may each define a different complier group, so the next example compares the resulting target populations.
[example: Different Instruments, Different Compliers]
For each instrument $Z_j$, write $A_j(1)$ and $A_j(0)$ for the treatment that would be received when $Z_j$ is set to $1$ or $0$. Suppose half the population is reminder-responsive and half is voucher-responsive. For reminder-responsive units,
\begin{align*}
A_1(1)-A_1(0)=1-0=1
\end{align*}
and
\begin{align*}
A_2(1)-A_2(0)=0-0=0.
\end{align*}
For voucher-responsive units,
\begin{align*}
A_1(1)-A_1(0)=0-0=0
\end{align*}
and
\begin{align*}
A_2(1)-A_2(0)=1-0=1.
\end{align*}
Thus the compliers for $Z_1$ are exactly the reminder-responsive units, while the compliers for $Z_2$ are exactly the voucher-responsive units. If the individual treatment effect $\tau=Y(1)-Y(0)$ equals $1$ for reminder-responsive units and $3$ for voucher-responsive units, then
\begin{align*}
\operatorname{LATE}_{Z_1}=\mathbb E[\tau\mid A_1(1)>A_1(0)]=1.
\end{align*}
For the voucher instrument,
\begin{align*}
\operatorname{LATE}_{Z_2}=\mathbb E[\tau\mid A_2(1)>A_2(0)]=3.
\end{align*}
Both instruments can be valid for the same treatment, but they answer different local questions because they move different units into treatment.
[/example]
## Failures and Diagnostics
The practical question is how an IV analysis can fail and what the observed data can warn us about. Relevance can be checked directly, but exclusion, independence, and monotonicity require design knowledge, sensitivity analysis, or auxiliary evidence.
[definition: Weak Instrument]
An instrument is weak when the first-stage contrast
\begin{align*}
\mathbb E[A\mid Z=1]-\mathbb E[A\mid Z=0]
\end{align*}
is close to $0$ in the scale relevant for estimation.
[/definition]
Weak instruments do not merely reduce precision. They can make finite-sample IV estimators highly unstable and sensitive to small violations of independence or exclusion. The next remark quantifies the same issue at the level of the ratio.
[remark: Direction of Bias]
When the first stage is small, even a small direct effect of $Z$ on $Y$ can dominate the IV ratio. If the exclusion violation contributes $\delta$ to the numerator and the first stage is $\pi$, the induced distortion is approximately $\delta/\pi$.
[/remark]
Sensitivity analysis asks how large an exclusion or independence violation would need to be to change the substantive conclusion. This is often more informative than reporting the IV estimate alone, and the next example shows the kind of direct path such an analysis should consider.
[example: Direct Effect Violating Exclusion]
In a job-training encouragement design, let $Z$ be an offer of programme access, $A$ attendance, and $Y$ earnings. Suppose the offer changes attendance for compliers, but also changes earnings directly through motivation or job-search counselling. Write the first stage as
\begin{align*}
\pi=\mathbb E[A\mid Z=1]-\mathbb E[A\mid Z=0].
\end{align*}
If the potential-outcome IV assumptions except exclusion hold, and if the direct contribution of the offer to the outcome contrast is $\delta$, then the observed numerator decomposes as
\begin{align*}
\mathbb E[Y\mid Z=1]-\mathbb E[Y\mid Z=0]
=
\pi\,\mathbb E[Y(1)-Y(0)\mid A(1)>A(0)]+\delta.
\end{align*}
Dividing by the first stage gives
\begin{align*}
\frac{\mathbb E[Y\mid Z=1]-\mathbb E[Y\mid Z=0]}{\mathbb E[A\mid Z=1]-\mathbb E[A\mid Z=0]}
=
\frac{\pi\,\mathbb E[Y(1)-Y(0)\mid A(1)>A(0)]+\delta}{\pi}.
\end{align*}
Since $\pi\ne 0$, the right-hand side separates into
\begin{align*}
\mathbb E[Y(1)-Y(0)\mid A(1)>A(0)]+\frac{\delta}{\pi}.
\end{align*}
Thus the Wald ratio is no longer only the effect of attendance among compliers: it equals that local attendance effect plus the direct offer effect divided by the first stage. A small first stage makes the exclusion violation especially damaging, because the extra term $\delta/\pi$ becomes large when $\pi$ is close to $0$.
[/example]
Instrumental variables therefore trade one identification problem for another. They can handle unmeasured confounding of treatment when a credible source of treatment variation is available, but the resulting estimand is local and the assumptions are largely causal rather than statistical.
Instrumental variables solve a different identification problem by trading exchangeability for a stronger causal design assumption. After that, the course shifts from using a known graph to infer effects to asking what parts of the graph can be learned from data alone.
# 11. Causal Discovery Foundations
This chapter turns the graphical language of causal DAGs into a discovery problem: given only the observational law of a random vector, what features of the underlying causal graph can be recovered? The earlier chapters used a known graph to decide adjustment, interventions, and identification. Here the direction is reversed: conditional independences in the distribution are used as evidence about the graph, but the evidence is incomplete because distinct DAGs can encode the same independence model.
The main mathematical theme is that observational discovery is possible only up to Markov equivalence. We first describe the equivalence class of a DAG through skeletons and v-structures, then formulate the population PC algorithm as a constraint-based procedure under causal sufficiency and faithfulness, and finally isolate which causal directions are identifiable without interventions.
## Conditional Independence as Graphical Evidence
The discovery problem begins with a tension. Conditional independence statements are features of the observational distribution, while arrows are features of a causal graph. A discovery method needs assumptions connecting these two objects before any graph-theoretic conclusion can be drawn.
[definition: Skeleton]
Let $G=(V,E)$ be a directed acyclic graph. The skeleton of $G$ is the undirected graph on vertex set $V$ with an undirected edge $A-B$ whenever either $A \to B$ or $B \to A$ is an edge of $G$.
[/definition]
The skeleton records adjacency but discards orientation. This is the part of a graph most directly tested by asking whether two variables can be separated by conditioning on other variables.
[example: Chain And Fork Have The Same Skeleton]
Consider the three DAGs $X \to Y \to Z$, $X \leftarrow Y \to Z$, and $X \leftarrow Y \leftarrow Z$. In all three graphs, the only adjacencies are between $X$ and $Y$ and between $Y$ and $Z$, so the skeleton is the undirected path $X-Y-Z$.
For $X \to Y \to Z$, the only path from $X$ to $Z$ is $X \to Y \to Z$; the middle vertex $Y$ is a non-collider, so the path is open with no conditioning and blocked after conditioning on $Y$. Thus a faithful distribution has $X \not\perp\!\!\!\perp Z$ and $X \perp\!\!\!\perp Z \mid Y$. For $X \leftarrow Y \to Z$, the only path is $X \leftarrow Y \to Z$, again with $Y$ a non-collider, so the same conclusion holds. For $X \leftarrow Y \leftarrow Z$, the only path is $X \leftarrow Y \leftarrow Z$, and $Y$ is still a non-collider, so again $X \not\perp\!\!\!\perp Z$ but $X \perp\!\!\!\perp Z \mid Y$ under faithfulness. The same skeleton and the same endpoint conditional independence therefore leave unresolved whether $Y$ is a mediator or a common cause.
[/example]
The example shows that adjacency information is too coarse: it keeps the path $X-Y-Z$ but loses the causal role of the middle node. To recover any orientation from conditional independence, we need a local pattern whose independence signature changes when the middle node is conditioned on. The key pattern is an unshielded collider, because it blocks rather than transmits association along the path until conditioning activates it.
[definition: V-Structure]
Let $G$ be a directed acyclic graph. A v-structure in $G$ is an ordered triple $(A,B,C)$ of distinct vertices such that $A \to B \leftarrow C$ in $G$ and $A$ and $C$ are not adjacent in the skeleton of $G$.
[/definition]
The requirement that $A$ and $C$ are not adjacent matters. If an extra edge joins them, the same local arrow pattern no longer creates the same conditional independence signature.
[example: Identifying A Collider]
Let $G$ be the DAG $X \to Y \leftarrow Z$, and suppose the observational law $P$ is faithful to $G$. The only path from $X$ to $Z$ is
\begin{align*}
X \to Y \leftarrow Z.
\end{align*}
The middle vertex $Y$ is a collider on this path, because both arrows on the path point into $Y$. With no conditioning, this collider blocks the path, so $X$ and $Z$ are d-separated by $\varnothing$. By faithfulness, the graphical separation implies
\begin{align*}
X \perp\!\!\!\perp Z.
\end{align*}
Now condition on $Y$. Conditioning on a collider opens the path through that collider, so the same path
\begin{align*}
X \to Y \leftarrow Z
\end{align*}
is active given $\{Y\}$. Since this is the only path between $X$ and $Z$, they are d-connected given $Y$, and faithfulness gives
\begin{align*}
X \not\perp\!\!\!\perp Z \mid Y.
\end{align*}
If the skeleton is known to be $X-Y-Z$, this separates the collider from both non-collider orientations: in $X \to Y \to Z$ and $X \leftarrow Y \to Z$, the middle vertex $Y$ is a non-collider, so conditioning on $Y$ blocks the only path between $X$ and $Z$ and gives $X \perp\!\!\!\perp Z \mid Y$ instead. Thus the change from marginal independence to conditional dependence is exactly the signature of the unshielded collider $X \to Y \leftarrow Z$.
[/example]
The collider example gives the first genuinely directional information, while the chain and fork example shows that many orientations remain unresolved. We therefore need a relation that groups DAGs according to exactly the conditional independences they imply, rather than according to their literal arrow sets. That relation is the right target for discovery from observational independences.
[definition: Markov Equivalence]
Two directed acyclic graphs $G_1$ and $G_2$ on the same vertex set are Markov equivalent if they imply the same set of conditional independence statements by d-separation.
[/definition]
Markov equivalence is a graphical relation, not a finite-sample statistical relation. The definition raises a structural question: how can we decide equivalence without comparing every possible conditioning set? The answer is that the skeleton and v-structures are exactly the graphical information carried by d-separation.
[quotetheorem:9691]
[proofunderconstruction:9691]
This theorem is the central negative and positive result for observational graph recovery. It says that all information carried by a DAG's d-separation model is compressed into the skeleton and the unshielded colliders. The common vertex set hypothesis is essential: if the variable set changes, marginalising or adding variables can create new independences and can turn a DAG problem into one involving latent projections rather than ordinary DAG equivalence. The theorem is also about graphical Markov models, not finite-sample distributions; two empirical distributions may look similar because of sampling error, while two population distributions can agree on many numerical features without having the same conditional independence model. Finally, the result does not identify every arrow: shielded collider status and many chain-versus-fork orientations can vary across equivalent DAGs without changing any d-separation statement.
## CPDAGs and Equivalence Classes
A Markov equivalence class may contain many DAGs, so discovery output needs a compact representation. The question is which arrows are common to all DAGs in the class and which arrows can be reversed without changing the independence model.
[definition: CPDAG]
A completed partially directed acyclic graph, or CPDAG, is a mixed graph representing a Markov equivalence class of DAGs. It has a directed edge $A \to B$ when every DAG in the equivalence class contains $A \to B$, and an undirected edge $A-B$ when some DAGs in the class contain $A \to B$ and others contain $B \to A$.
[/definition]
A directed edge in a CPDAG is also called compelled, while an undirected edge is reversible within the equivalence class. The CPDAG is therefore a map of what observational conditional independences can force.
[example: Three Node Equivalence Class]
For the skeleton $X-Y-Z$, there are four acyclic orientations:
\begin{align*}
X \to Y \to Z,\quad X \leftarrow Y \to Z,\quad X \leftarrow Y \leftarrow Z,\quad X \to Y \leftarrow Z.
\end{align*}
In the first three DAGs, the triple $X-Y-Z$ is not a v-structure: in $X \to Y \to Z$ only one arrow points into $Y$, in $X \leftarrow Y \to Z$ neither arrow points into $Y$, and in $X \leftarrow Y \leftarrow Z$ only one arrow points into $Y$. Thus these three DAGs have the same skeleton and the same set of v-structures, namely none. By the *Verma Pearl Markov Equivalence Characterization*, they are Markov equivalent, so the CPDAG for this equivalence class leaves both reversible edges unoriented:
\begin{align*}
X-Y-Z.
\end{align*}
The remaining orientation $X \to Y \leftarrow Z$ has the same skeleton but now $X$ and $Z$ are non-adjacent and both arrows point into $Y$, so $(X,Y,Z)$ is a v-structure. Since no DAG without this collider is Markov equivalent to it, both arrowheads into $Y$ are compelled in its equivalence class. Its CPDAG is therefore
\begin{align*}
X \to Y \leftarrow Z.
\end{align*}
The example shows that the CPDAG records exactly the orientations forced by the v-structure information and leaves the chain-versus-fork ambiguity unresolved.
[/example]
This representation also clarifies why orienting one edge can sometimes force another orientation. A proposed orientation is valid only if it preserves acyclicity and does not introduce or destroy v-structures relative to the equivalence class.
[remark: Orientation Propagation]
CPDAG construction is not just a list of colliders. After the v-structures are oriented, additional arrows may be forced to avoid directed cycles or new unshielded colliders. The standard orientation rules, often called Meek rules, repeatedly apply these constraints until no further compelled directions remain.
[/remark]
The CPDAG will be the output target for the population discovery algorithm below. In finite samples, statistical errors can disturb the learned skeleton or collider set, but the population version isolates the logical content of the method.
## Constraint-Based Discovery and the PC Algorithm
The PC algorithm answers the following problem: if an oracle tells us exactly which conditional independences hold in the observational distribution, can we recover the CPDAG of the true causal graph? The answer is yes under causal sufficiency, the Markov condition, and faithfulness.
[definition: Causal Sufficiency]
A causal model over observed variables $V$ is causally sufficient if there is no unobserved common cause of two or more variables in $V$.
[/definition]
Causal sufficiency lets the observed causal structure be represented by a DAG on the observed variables themselves. Without it, latent confounding can create dependence patterns that require mixed graphs rather than ordinary DAGs.
[definition: Faithfulness]
Let $P$ be a probability distribution on variables indexed by $V$, and let $G$ be a directed acyclic graph on $V$. The distribution $P$ is faithful to $G$ if, for all disjoint subsets $A,B,S \subset V$, the conditional independence $X_A \perp\!\!\!\perp X_B \mid X_S$ holds under $P$ if and only if $A$ and $B$ are d-separated by $S$ in $G$.
[/definition]
Faithfulness rules out accidental cancellations in the distribution. The Markov condition gives all d-separations as independences; faithfulness gives the reverse implication needed for discovery.
[example: Cancellation Violates Faithfulness]
Consider the linear Gaussian structural equations
\begin{align*}
X=\varepsilon_X,\qquad Y=aX+\varepsilon_Y,\qquad Z=bY+cX+\varepsilon_Z,
\end{align*}
where $\varepsilon_X,\varepsilon_Y,\varepsilon_Z$ are mutually independent, centered Gaussian noises and $\operatorname{Var}(X)>0$. The graph has arrows $X \to Y$, $Y \to Z$, and $X \to Z$, so $X$ and $Z$ are d-connected because the one-edge path $X \to Z$ is open.
We compute the marginal covariance. Since $Z=bY+cX+\varepsilon_Z$,
\begin{align*}
\operatorname{Cov}(X,Z)=\operatorname{Cov}(X,bY+cX+\varepsilon_Z).
\end{align*}
By bilinearity of covariance,
\begin{align*}
\operatorname{Cov}(X,Z)=b\operatorname{Cov}(X,Y)+c\operatorname{Cov}(X,X)+\operatorname{Cov}(X,\varepsilon_Z).
\end{align*}
Because $Y=aX+\varepsilon_Y$,
\begin{align*}
\operatorname{Cov}(X,Y)=\operatorname{Cov}(X,aX+\varepsilon_Y).
\end{align*}
Again by bilinearity,
\begin{align*}
\operatorname{Cov}(X,Y)=a\operatorname{Var}(X)+\operatorname{Cov}(X,\varepsilon_Y).
\end{align*}
The noises are independent of $X$, so $\operatorname{Cov}(X,\varepsilon_Y)=0$ and $\operatorname{Cov}(X,\varepsilon_Z)=0$. Hence
\begin{align*}
\operatorname{Cov}(X,Z)=b\,a\operatorname{Var}(X)+c\operatorname{Var}(X).
\end{align*}
Factoring out $\operatorname{Var}(X)$ gives
\begin{align*}
\operatorname{Cov}(X,Z)=(ab+c)\operatorname{Var}(X).
\end{align*}
If the direct coefficient is chosen as $c=-ab$, then
\begin{align*}
\operatorname{Cov}(X,Z)=(ab-ab)\operatorname{Var}(X)=0.
\end{align*}
Since $(X,Z)$ is jointly Gaussian, zero covariance implies $X \perp\!\!\!\perp Z$. Thus the distribution contains a marginal independence even though $X$ and $Z$ are d-connected in the DAG, so the distribution is not faithful; an oracle constraint-based method could remove the edge $X-Z$ because it sees the separating set $\varnothing$.
[/example]
The cancellation example explains why the algorithm must assume that every independence is graphically meaningful. Under faithfulness, a conditional independence oracle can be treated as a d-separation oracle, so graph search can proceed by deleting adjacencies and then orienting the colliders those deletions reveal. This motivates the following definition of the population PC algorithm, which is the idealized constraint-based procedure studied in the correctness theorem.
[definition: Population PC Algorithm]
The population PC algorithm takes as input the joint distribution $P$ of variables indexed by $V$ and an oracle for conditional independence under $P$.
It begins with the complete undirected graph on $V$. It removes an edge $A-B$ whenever it finds a conditioning set $S \subset V \setminus \{A,B\}$ such that $X_A \perp\!\!\!\perp X_B \mid X_S$, recording one such separating set as $S_{AB}$. After no more edges can be removed, it orients each unshielded triple $A-B-C$ as $A \to B \leftarrow C$ when $B \notin S_{AC}$. It then applies sound orientation rules that preserve the learned v-structures and acyclicity until no further orientations are forced.
[/definition]
The skeleton phase searches for conditional independences that witness non-adjacency, and the collider phase uses the stored separating sets to decide whether the middle node of an unshielded triple behaved like a collider. What remains to prove is that these local decisions assemble into the correct global equivalence-class representative. The correctness theorem states that, in the population setting, PC recovers exactly the CPDAG rather than a heuristic approximation to it.
[quotetheorem:9692]
[citeproof:9692]
This theorem is a population statement, and each assumption is doing real work. If causal sufficiency fails, a latent common cause $U$ of $X$ and $Y$ can produce dependence between the observed variables without any observed directed edge, so an ordinary DAG on $V$ is the wrong target and PC may retain or orient edges that represent confounding rather than direct causation. If the Markov condition fails, d-separation no longer guarantees the conditional independences used to delete edges, so the skeleton phase is not licensed. If faithfulness fails, as in the cancellation example above, a genuine adjacency or active path can produce an accidental conditional independence, causing PC to delete an edge that belongs to the causal graph. A sample implementation replaces the independence oracle with statistical tests, so consistency also requires assumptions about testing, sample size, and the growth of the conditioning sets.
## What Observational Data Can Identify
The final question is not algorithmic but conceptual. If the best possible observational procedure recovers only a CPDAG, then the CPDAG describes the boundary between identifiable and non-identifiable causal directions.
[definition: Identifiable Direction From Observational Independences]
Let $G$ be a DAG and let $\mathcal{E}(G)$ be its Markov equivalence class. An arrow $A \to B$ in $G$ is identifiable from observational conditional independences if every DAG in $\mathcal{E}(G)$ contains the arrow $A \to B$.
[/definition]
This definition says that a direction is identifiable only when it is compelled throughout the equivalence class. It excludes directions that are plausible in one DAG but reversed in another observationally equivalent DAG. The next theorem connects this semantic notion of identifiability to the concrete graphical output of PC and other equivalence-class procedures.
[quotetheorem:9693]
[citeproof:9693]
The theorem should be read as a limitation of a particular information source, not as a statement that direction is never learnable. Its assumptions matter in the same way as for PC correctness. Under unfaithfulness, the observed conditional independences may correspond to the wrong skeleton or the wrong collider set, so the CPDAG computed from them need not represent the true causal equivalence class. With latent variables, the observed independence model may be better represented by a mixed graph with bidirected edges, and a directed edge in a DAG CPDAG can then be an artefact of forcing a hidden-confounding problem into a causally sufficient model. Time order, interventions, background knowledge, parametric restrictions, non-Gaussian noise, or nonstationarity can add information beyond conditional independences.
[example: What The Collider Reveals]
Suppose the complete observed conditional-independence information on $(X,Y,Z)$ says
\begin{align*}
X \perp\!\!\!\perp Z
\end{align*}
and
\begin{align*}
X \not\perp\!\!\!\perp Z \mid Y.
\end{align*}
Since there are no other variables, the only possible conditioning sets for testing adjacency between $X$ and $Z$ are $\varnothing$ and $\{Y\}$. The marginal independence gives a separating set for $X$ and $Z$, so under faithfulness $X$ and $Z$ are not adjacent in the causal DAG. The absence of any corresponding independence involving $X$ and $Y$, or involving $Y$ and $Z$, leaves the adjacencies $X-Y$ and $Y-Z$. Thus the learned skeleton is
\begin{align*}
X-Y-Z.
\end{align*}
Now the unshielded triple $X-Y-Z$ has only four acyclic orientations:
\begin{align*}
X \to Y \to Z,\quad X \leftarrow Y \to Z,\quad X \leftarrow Y \leftarrow Z,\quad X \to Y \leftarrow Z.
\end{align*}
In the first three orientations, $Y$ is a non-collider on the only path between $X$ and $Z$, so conditioning on $Y$ blocks that path and gives $X \perp\!\!\!\perp Z \mid Y$ under faithfulness. This contradicts the observed statement $X \not\perp\!\!\!\perp Z \mid Y$. The only remaining orientation is therefore
\begin{align*}
X \to Y \leftarrow Z.
\end{align*}
Here $Y$ is a collider, so the path is blocked marginally and opened when conditioning on $Y$, matching exactly the two observed statements. The CPDAG is therefore $X \to Y \leftarrow Z$, and both arrowheads into $Y$ are identifiable because every DAG compatible with the observed conditional independences contains this v-structure.
[/example]
The contrast with chains and forks is the basic lesson of causal discovery from observational independences. Conditional independences can identify missing edges and unshielded colliders, but they do not generally identify a complete causal ordering.
[example: What The Chain And Fork Do Not Reveal]
Suppose instead that the complete observational information on $(X,Y,Z)$ says that $X$ and $Z$ are marginally dependent, while
\begin{align*}
X \perp\!\!\!\perp Z \mid Y.
\end{align*}
Under the Markov and faithfulness assumptions, the conditional independence means that $Y$ is a separating set for $X$ and $Z$, so $X$ and $Z$ are not adjacent in the causal DAG. The marginal dependence means that the empty set is not a separating set for $X$ and $Z$. If the remaining adjacencies are $X-Y$ and $Y-Z$, the learned skeleton is therefore
\begin{align*}
X-Y-Z.
\end{align*}
There are four acyclic orientations of this skeleton:
\begin{align*}
X \to Y \to Z,\quad X \leftarrow Y \to Z,\quad X \leftarrow Y \leftarrow Z,\quad X \to Y \leftarrow Z.
\end{align*}
In $X \to Y \leftarrow Z$, the middle vertex $Y$ is a collider on the only path from $X$ to $Z$, so the path is blocked with no conditioning and opened after conditioning on $Y$. That would give marginal independence and conditional dependence, the opposite of the observed pattern. Thus the collider orientation is excluded.
In each of the other three orientations, $Y$ is a non-collider on the only path between $X$ and $Z$. For
\begin{align*}
X \to Y \to Z,
\end{align*}
conditioning on $Y$ blocks the path. For
\begin{align*}
X \leftarrow Y \to Z,
\end{align*}
conditioning on $Y$ also blocks the path. For
\begin{align*}
X \leftarrow Y \leftarrow Z,
\end{align*}
conditioning on $Y$ again blocks the path. With no conditioning, the same non-collider path is open in all three cases, so each orientation matches the observed pattern: $X$ and $Z$ are dependent marginally but independent given $Y$.
Therefore the compatible Markov equivalence class contains $X \to Y \to Z$, $X \leftarrow Y \to Z$, and $X \leftarrow Y \leftarrow Z$. Observational independences identify the path skeleton and rule out the collider at $Y$, but they do not identify whether $Y$ is a common cause of the endpoints or a mediator between them.
[/example]
The practical output of this chapter is therefore a disciplined interpretation of discovery algorithms. A learned CPDAG is not a fully oriented causal graph; it is a representation of all DAGs that remain possible after using the conditional independence information licensed by the assumptions.
Causal discovery closes the loop between assumptions and observed independence patterns, but it rarely yields a fully oriented graph. The synthesis chapter then brings the whole framework together, using the course's notation to combine estimands, graphs, identification, and robustness into one coherent analysis.
# 12. Synthesis: Building and Auditing a Causal Analysis
This chapter synthesises the course's main tools: counterfactual estimands, graphical causal models, identification arguments, support conditions, and sensitivity analysis. It assumes the earlier chapters' notation for potential outcomes, interventions, DAGs, conditional independence, and observational laws. The guiding theme is that a causal analysis is not an estimator attached to a dataset. It is a chain of mathematical claims linking a scientific intervention to an interventional law, and then linking that interventional law to observable quantities. Each link should be stated explicitly enough that another reader can locate the step where the analysis would break under a different substantive story.
## From Scientific Question to Estimand
A causal project begins with a question posed in ordinary scientific language: would changing a treatment, exposure, policy, diagnostic procedure, or environment change an outcome? The first mathematical problem is to translate that question into a probability statement whose terms have well-defined interventions and populations.
[definition: Causal Analysis Target]
A causal analysis target consists of a population law, an intervention or intervention regime, an outcome, a causal contrast, and a target population.
[/definition]
The target is deliberately broader than a single formula. It records whether the goal is an average treatment effect, a conditional effect, a distributional contrast, a mediation contrast, or an effect among a restricted subpopulation; this motivates checking the target in a concrete policy-style example before any graph is drawn.
[example: Vaccine Policy Estimand]
Consider a population of eligible adults, with binary vaccination decision $A \in \{0,1\}$, baseline covariates $L$, and infection outcome $Y \in \{0,1\}$ over a fixed follow-up period. For the scientific question “what would change if everyone in this target population were vaccinated rather than unvaccinated?”, the risk under universal vaccination is $\mathbb E[Y_1]$ and the risk under universal non-vaccination is $\mathbb E[Y_0]$, because $Y_1$ and $Y_0$ are binary infection indicators under the two interventions. The risk-difference estimand is therefore
\begin{align*}
\mathbb E[Y_1]-\mathbb E[Y_0].
\end{align*}
Since $Y_a \in \{0,1\}$, each mean is also an infection probability:
\begin{align*}
\mathbb E[Y_a]=1\cdot \mathbb P(Y_a=1)+0\cdot \mathbb P(Y_a=0)=\mathbb P(Y_a=1).
\end{align*}
Thus the same estimand can be written as
\begin{align*}
\mathbb P(Y_1=1)-\mathbb P(Y_0=1).
\end{align*}
This is not the observed difference $\mathbb E[Y\mid A=1]-\mathbb E[Y\mid A=0]$; it compares two intervention distributions in the same target population, rather than the infection rates among the people who happened to be vaccinated and unvaccinated.
[/example]
The example shows why notation alone is not enough: the same variables can support several causal questions. Once the target is fixed, the next task is to choose a causal representation that records the assumed ordering and dependence structure among the variables.
[definition: Analysis Graph]
An analysis graph is a directed acyclic graph $G$ whose vertices are the substantive variables used in the causal analysis, with directed edges representing direct causal dependence relative to the chosen level of description.
[/definition]
The graph is not a decorative summary. It determines which conditional independences are being asserted, which variables are pre-treatment, which variables are mediators or colliders, and which adjustment or do-calculus arguments are available; the next remark explains why this graph must be tied to the estimand rather than copied mechanically from a data dictionary.
[remark: Graphs Depend on the Question]
The same dataset may support different analysis graphs for different estimands. A biomarker measured after treatment may be a mediator for the total effect, an outcome for an early mechanistic question, or a collider if selection depends on the biomarker and an unmeasured risk factor.
[/remark]
This dependence on the question means that graph choice and target choice must be joined by a formal claim. Without such a claim, a graph and an estimand remain only causal language: they do not specify whether the desired interventional or counterfactual quantity is determined by the observed law, or which functional of the data is supposed to recover it.
[definition: Identification Claim]
An identification claim is a statement that an interventional or counterfactual quantity is a well-defined functional of the observational law under a specified set of causal assumptions. Formally, for a causal model class $\mathcal M$ inducing a set $\mathcal P_{\mathrm{obs}}$ of observational laws, the claim gives a map
\begin{align*}
\Psi : \mathcal P_{\mathrm{obs}} \to \mathcal T,
\end{align*}
where $\mathcal T$ is the target space, such as $\mathbb R$ for a mean contrast or the set of probability measures on the outcome space for an interventional distribution.
[/definition]
Identification is the bridge from causal language to data language. In a randomized trial, this bridge is clearest because assignment is designed not to reveal potential-outcome information. The remaining question is which trial conditions are enough to replace an intervention distribution by an observed arm-specific outcome distribution.
[quotetheorem:9694]
[citeproof:9694]
Each hypothesis has a distinct job. Consistency is needed because the observed outcome in arm $a$ must be the same variable as the potential outcome $Y_a$; if different versions of treatment are hidden under the label $A=a$, the conditional law of $Y$ among assigned units need not equal the law under the intended intervention. No interference is needed because otherwise one unit's outcome may depend on other units' assignments, so $Y_a$ is not enough notation to describe the intervention. Randomization is the exchangeability step: if sicker patients are preferentially assigned one arm, then $\mathbb P(Y_a \in B \mid A=a)$ can differ from $\mathbb P(Y_a \in B)$ even when consistency holds. Positivity is the support condition: if no unit is assigned arm $a$, the observed conditional distribution given $A=a$ is not defined.
The theorem also says less than a trial report usually wants. It identifies the arm-specific intervention distribution in the trial population; it does not by itself give transportability to another population, handle non-adherence, or justify comparing ill-defined treatment versions. Observational identification tries to reproduce exactly these four trial ingredients after conditioning, graph surgery, or instrumental-variable restrictions, so the later audit asks which trial ingredient each observational assumption is replacing.
## Graphical Proofs and Algebraic Identification
After the estimand and assumptions are fixed, the next problem is to write a proof that a reader can audit. A causal proof should not jump from a DAG to a final summation formula; it should identify the graphical criterion being used and then translate that criterion into algebra.
[definition: Graphical Identification Proof]
A graphical identification proof consists of an estimand, a causal graph, a list of assumptions, a graphical separation or do-calculus argument, and an algebraic expression involving only the observational law.
[/definition]
This definition names the components of a proof rather than introducing a new identification criterion. It motivates the [Back-Door Adjustment Formula](/theorems/9695) because adjustment is the first setting where a graphical blocking statement must be converted into an observed probability functional.
[quotetheorem:9695]
[citeproof:9695]
The hypotheses mark the exact places where an adjustment analysis can fail. If $L$ does not block a back-door path, for example $A \leftarrow U \to Y$ remains open through an unobserved $U$, the conditional exchangeability step is false and the integral averages confounded outcome regressions. If consistency fails because treatment level $a$ bundles several versions with different outcome laws, then $Y$ among units with $A=a$ need not be the potential outcome for the intervention named in the estimand. If positivity fails, the formula asks for $\mathbb P(Y \in B \mid A=a,L=\ell)$ in a stratum where treatment $a$ is absent, so the conditional distribution is not learned from the observational law.
The template is still a reusable proof pattern: the graphical step supplies conditional exchangeability, and the algebraic step applies consistency and iterated expectation. A small baseline-covariate example is the right first application because every part of the proof has a visible counterpart in the graph and the formula.
[example: Back-Door Analysis with Baseline Covariates]
Let $L$ be age and baseline health, $A \in \{0,1\}$ a treatment, and $Y$ a one-year outcome. In the graph $L \to A$, $L \to Y$, and $A \to Y$, the only back-door path from $A$ to $Y$ is $A \leftarrow L \to Y$, and conditioning on $L$ blocks that path. Thus, under consistency, positivity, and the adjustment result in *Back-Door Adjustment Formula*, we compute the mean under treatment $a$ by integrating the identified conditional law of $Y_a$ over the target distribution of $L$.
By the [law of total expectation](/theorems/1121) applied to $Y_a$ with respect to $L$,
\begin{align*}
\mathbb E[Y_a]=\int \mathbb E[Y_a \mid L=\ell]\,d\mathbb P_L(\ell).
\end{align*}
Because $L$ blocks the back-door path, the graphical criterion gives conditional exchangeability $Y_a \perp A \mid L$, so for every $\ell$ in the support with $\mathbb P(A=a \mid L=\ell)>0$,
\begin{align*}
\mathbb E[Y_a \mid L=\ell]=\mathbb E[Y_a \mid A=a,L=\ell].
\end{align*}
Consistency then replaces $Y_a$ by the observed outcome $Y$ among units with $A=a$:
\begin{align*}
\mathbb E[Y_a \mid A=a,L=\ell]=\mathbb E[Y \mid A=a,L=\ell].
\end{align*}
Substituting these two equalities into the first display gives
\begin{align*}
\mathbb E[Y_a]=\int \mathbb E[Y \mid A=a,L=\ell]\,d\mathbb P_L(\ell).
\end{align*}
Therefore the total-effect contrast $\mathbb E[Y_1]-\mathbb E[Y_0]$ compares two treatment-specific outcome regressions averaged over the same baseline distribution of age and baseline health.
[/example]
The example also indicates a limitation: not every variable related to treatment and outcome should be adjusted for. The ambiguity is that conditioning on a downstream variable can change the causal question: it may hold fixed part of the response that would normally be allowed to change after treatment. To avoid confusing that controlled comparison with the effect of setting treatment alone, the target must specify that all causal pathways from treatment to outcome remain included.
[definition: Total Effect]
For treatment $A$ and outcome $Y$, the total effect of setting $A=a$ rather than $A=a'$ is any contrast between the distributions of $Y_a$ and $Y_{a'}$ in the target population.
[/definition]
The total effect includes all causal pathways from treatment to outcome. This is why post-treatment variables require special care, and the next example shows how mediator adjustment can change the estimand.
[example: Mediator Adjustment Changes the Question]
Let $A$ be treatment, $M$ a biomarker measured after treatment, $Y$ the outcome, and $L$ baseline covariates. In the graph $L \to A$, $L \to Y$, $A \to M$, $M \to Y$, and $A \to Y$, the path $A \leftarrow L \to Y$ is a back-door path, and conditioning on $L$ blocks it. Thus, under consistency, positivity, and the adjustment result in *Back-Door Adjustment Formula*, the total-effect mean under treatment level $a$ is
\begin{align*}
\mathbb E[Y_a]=\sum_\ell \mathbb E[Y\mid A=a,L=\ell]\mathbb P(L=\ell).
\end{align*}
Because $M$ lies downstream of $A$, the total intervention setting only $A=a$ lets the biomarker take whatever value it would take under that treatment. Writing the biomarker-specific outcome mean as $\mu(a,m,\ell)=\mathbb E[Y\mid A=a,M=m,L=\ell]$, the total-effect functional averages with treatment-specific mediator weights:
\begin{align*}
\mathbb E[Y_a]=\sum_\ell\sum_m \mu(a,m,\ell)\mathbb P(M=m\mid A=a,L=\ell)\mathbb P(L=\ell).
\end{align*}
If instead one adjusts for both $L$ and $M$ in the same standardization form, the resulting quantity is
\begin{align*}
\Phi(a)=\sum_\ell\sum_m \mu(a,m,\ell)\mathbb P(M=m\mid L=\ell)\mathbb P(L=\ell).
\end{align*}
The two displays differ in their mediator weights: $\mathbb P(M=m\mid A=a,L=\ell)$ is the distribution of the post-treatment biomarker under treatment level $a$, while $\mathbb P(M=m\mid L=\ell)$ is the same baseline-stratum biomarker distribution used for both treatment levels. Therefore $\Phi(1)-\Phi(0)$ compares treated and untreated outcomes at fixed biomarker distributions, so it changes the question rather than identifying the total effect of setting only $A$.
[/example]
This mediator example rules out a naive adjustment strategy but does not rule out identification. The obstacle is that treatment and outcome may be confounded, while adjusting for the mediator in the usual way changes the estimand.
The remaining question is whether the mediator can be used without conditioning away part of the effect. The front-door result answers this by imposing graphical conditions under which the treatment-to-mediator law and the mediator-to-outcome law can be learned separately from the observed distribution and then assembled into the total effect.
[quotetheorem:9696]
[citeproof:9696]
Each front-door condition removes a different obstruction. If $M$ does not intercept all directed paths from $A$ to $Y$, then part of the treatment effect bypasses the mediator and the decomposition through $M_a$ omits a causal pathway. If $A$ and $M$ have an unblocked common cause, the observed law of $M$ given $A=a$ need not be the intervention law of $M_a$. If conditioning on $A$ does not block back-door paths from $M$ to $Y$, the mediator-outcome component remains confounded. The formula identifies the total effect of setting $A=a$ under these conditions; it does not identify natural direct or indirect effects, nor does it justify using an error-prone proxy for the mediator without additional assumptions.
The front-door formula is useful because it makes non-adjustment identification visible. It also shows why proof writing matters: the formula is not guessed from regression; it is assembled by a sequence of valid transformations.
## Sensitivity to Assumption Failures
Once identification has been written down, the next problem is not computation but fragility. A formula may be correct under the graph and assumptions, yet scientifically unreliable if the assumptions are unstable under plausible changes to the data-generating story.
[definition: Sensitivity Analysis]
A sensitivity analysis studies how an identified causal estimand or estimator changes when one or more identifying assumptions are weakened, perturbed, or replaced by parameters describing departures from the ideal model.
[/definition]
Sensitivity analysis belongs in the mathematical analysis, not only in the discussion section. It motivates the definition of Unmeasured Confounding because hidden common causation is the standard departure that attacks the exchangeability step in adjustment proofs.
[definition: Unmeasured Confounding]
Unmeasured confounding for the effect of $A$ on $Y$ is the presence of a common cause $U$ of $A$ and $Y$ such that $U$ is not included among the observed adjustment variables.
[/definition]
Unmeasured confounding breaks the conditional exchangeability step in an adjustment proof. The next example places that failure in the treatment-biomarker-outcome setting that will return in the synthetic analysis.
[example: Latent Socioeconomic Confounding]
Let $A$ be access to a preventive treatment, $M$ a biomarker, $Y$ a health outcome, $L$ recorded baseline clinical covariates, and $U$ latent socioeconomic status. Suppose $U \to A$, $U \to M$, and $U \to Y$, while $A \to M$ and $M \to Y$. Conditioning only on $L$ does not block the back-door path $A \leftarrow U \to Y$, because $U$ is a non-collider on that path and is not included in the conditioning set.
For discrete $U$, the observed outcome regression within a baseline stratum is
\begin{align*}
\mathbb E[Y \mid A=a,L=\ell]=\sum_u \mathbb E[Y \mid A=a,L=\ell,U=u]\mathbb P(U=u \mid A=a,L=\ell).
\end{align*}
The causal mean in the same baseline stratum averages over the latent socioeconomic distribution in that stratum:
\begin{align*}
\mathbb E[Y_a \mid L=\ell]=\sum_u \mathbb E[Y_a \mid L=\ell,U=u]\mathbb P(U=u \mid L=\ell).
\end{align*}
If consistency holds, then among units with $A=a$,
\begin{align*}
\mathbb E[Y \mid A=a,L=\ell,U=u]=\mathbb E[Y_a \mid A=a,L=\ell,U=u].
\end{align*}
But the open path through $U$ means conditional exchangeability need not hold:
\begin{align*}
\mathbb E[Y_a \mid A=a,L=\ell,U=u]\neq \mathbb E[Y_a \mid L=\ell,U=u]
\end{align*}
and, because $U$ affects access to treatment,
\begin{align*}
\mathbb P(U=u \mid A=a,L=\ell)\neq \mathbb P(U=u \mid L=\ell)
\end{align*}
may also hold. Therefore $\mathbb E[Y \mid A=a,L=\ell]$ can differ from $\mathbb E[Y_a \mid L=\ell]$ both through different latent-risk means and through different latent-risk weights. Adjusting only for $L$ therefore mixes the effect of treatment with differences in latent socioeconomic risk, rather than isolating the total effect of setting $A=a$.
[/example]
The latent-confounding example concerns bias from an omitted common cause. A second failure mode concerns support. Even if the right covariates are observed and adjusted for, the adjustment formula can require an outcome regression for treatment level $a$ inside a stratum where no one receives $a$. Then the target population asks for a comparison that the observed data contain no within-stratum evidence for.
[definition: Positivity Violation]
A positivity violation occurs when an intervention level has zero or near-zero probability in a covariate stratum that has positive probability in the target population.
[/definition]
Exact violations are identification failures. Near-violations are statistical and substantive warnings, so the next example shows how a formula can ask for a conditional mean in a stratum where the relevant treatment is absent.
[example: Treatment Contraindication]
Suppose a drug is never prescribed to patients with severe renal impairment, but the target estimand is the mean outcome under prescribing the drug to the whole eligible population. Let $L$ record renal-impairment status, and write one stratum as $L=\text{severe}$. If
\begin{align*}
\mathbb P(A=1 \mid L=\text{severe})=0
\end{align*}
and
\begin{align*}
\mathbb P(L=\text{severe})>0,
\end{align*}
then the discrete adjustment formula for the treatment intervention $A=1$ would require
\begin{align*}
\mathbb E[Y_1]=\sum_\ell \mathbb E[Y \mid A=1,L=\ell]\mathbb P(L=\ell).
\end{align*}
Separating the severe stratum from the remaining strata gives
\begin{align*}
\mathbb E[Y_1]=\mathbb E[Y \mid A=1,L=\text{severe}]\mathbb P(L=\text{severe})+\sum_{\ell\neq \text{severe}}\mathbb E[Y \mid A=1,L=\ell]\mathbb P(L=\ell).
\end{align*}
The first conditional mean is not defined by the observational law, because conditioning on $A=1$ and $L=\text{severe}$ requires
\begin{align*}
\mathbb P(A=1,L=\text{severe})=\mathbb P(A=1 \mid L=\text{severe})\mathbb P(L=\text{severe})=0\cdot \mathbb P(L=\text{severe})=0.
\end{align*}
Thus the formula assigns positive target-population weight to a stratum in which no treated outcomes are observed, so the effect of prescribing the drug to the whole eligible population is not identified by ordinary adjustment without an additional modeling or extrapolation assumption.
[/example]
The contraindication example shows a support failure rather than a wrong variable definition. It motivates the definition of Measurement Error in a Causal Variable, where the proof may use a recorded proxy in place of the causal variable named by the graph.
[definition: Measurement Error in a Causal Variable]
Measurement error in a causal variable occurs when the recorded variable $W$ is used in the analysis in place of a target variable $V$, with $W$ not equal to $V$ as a random variable under the population law.
[/definition]
The effect of measurement error depends on the role of the variable. Misclassified treatment changes the intervention being compared; noisy confounders may leave residual confounding; outcome error changes the estimand unless the outcome model connects the recorded and target outcomes, and mediator error threatens mediator-based identification directly.
[example: Misclassified Biomarker Mediator]
Let $M$ be the true biomarker mediator and $W$ its noisy laboratory measurement. To see why replacing $M$ by $W$ changes the claim, consider the binary structural model
\begin{align*}
A \sim \operatorname{Ber}(1/2),\quad M=A,\quad Y=M,
\end{align*}
and let $W$ be a misclassified measurement of $M$ with
\begin{align*}
\mathbb P(W=M \mid M)=q
\end{align*}
for some $0<q<1$. The true mediator lies on the directed path $A \to M \to Y$, but $W$ is only a proxy for $M$; the directed path from $A$ to $Y$ does not pass through $W$.
The actual intervention means are
\begin{align*}
\mathbb E[Y_1]=1
\end{align*}
because setting $A=1$ gives $M=1$ and hence $Y=1$, while
\begin{align*}
\mathbb E[Y_0]=0
\end{align*}
because setting $A=0$ gives $M=0$ and hence $Y=0$. Thus the true total-effect contrast is
\begin{align*}
\mathbb E[Y_1]-\mathbb E[Y_0]=1-0=1.
\end{align*}
If one incorrectly inserts $W$ into the front-door formula in place of $M$, the corresponding functional is
\begin{align*}
\Phi_W(a)=\sum_w \sum_{a'} \mathbb E[Y \mid W=w,A=a']\mathbb P(A=a')\mathbb P(W=w \mid A=a).
\end{align*}
In this model, $Y=A$ observationally, so for each $w$ and $a'$,
\begin{align*}
\mathbb E[Y \mid W=w,A=a']=a'.
\end{align*}
Substituting this into the inner sum gives
\begin{align*}
\sum_{a'} \mathbb E[Y \mid W=w,A=a']\mathbb P(A=a')=0\cdot \mathbb P(A=0)+1\cdot \mathbb P(A=1)=1/2.
\end{align*}
Therefore
\begin{align*}
\Phi_W(a)=\sum_w \frac{1}{2}\mathbb P(W=w \mid A=a)=\frac{1}{2}\sum_w \mathbb P(W=w \mid A=a)=\frac{1}{2}.
\end{align*}
Hence
\begin{align*}
\Phi_W(1)-\Phi_W(0)=\frac{1}{2}-\frac{1}{2}=0,
\end{align*}
which is not the true total-effect contrast $1$. The failure is not an algebraic accident: using $W$ asks the front-door condition to hold for the recorded proxy, while the causal pathway from treatment to outcome runs through the unobserved true mediator $M$.
[/example]
The three failure modes should be recorded next to the proof steps they threaten. This motivates a structured audit table rather than a loose paragraph of limitations.
[definition: Causal Audit Table]
A causal audit table is a structured record whose rows list the estimand, graph, assumptions, identification steps, observed variables, potential violations, and planned sensitivity analyses for a causal analysis.
[/definition]
The audit table is especially valuable when multiple identification strategies are possible. It lets the analyst compare, for instance, a back-door analysis with a front-door or instrumental-variable analysis by asking which assumptions are being traded.
## A Worked Synthetic Analysis
We now assemble the workflow in a single example. The problem is to estimate the effect of a treatment on an outcome when a post-treatment biomarker is recorded, baseline covariates are available, and latent socioeconomic status may confound several relationships.
[example: Treatment Biomarker Outcome Analysis]
Let $L$ denote observed baseline clinical covariates, $U$ latent socioeconomic status, $A \in \{0,1\}$ treatment, $M$ a biomarker measured after treatment, and $Y$ a binary outcome. The target is the total-effect contrast in the study population,
\begin{align*}
\mathbb E[Y_1]-\mathbb E[Y_0].
\end{align*}
First suppose the analysis graph contains $L \to A$, $L \to Y$, $A \to M$, $M \to Y$, and $A \to Y$, with no latent common cause of $A$ and $Y$ after conditioning on $L$. For discrete $L$, the law of total expectation gives
\begin{align*}
\mathbb E[Y_a]=\sum_\ell \mathbb E[Y_a \mid L=\ell]\mathbb P(L=\ell).
\end{align*}
The back-door path $A \leftarrow L \to Y$ is blocked by conditioning on $L$, so the adjustment argument gives conditional exchangeability $Y_a \perp A \mid L$. Hence, for each $\ell$ with $\mathbb P(A=a \mid L=\ell)>0$,
\begin{align*}
\mathbb E[Y_a \mid L=\ell]=\mathbb E[Y_a \mid A=a,L=\ell].
\end{align*}
By consistency, among units with $A=a$ the observed outcome equals the potential outcome $Y_a$, so
\begin{align*}
\mathbb E[Y_a \mid A=a,L=\ell]=\mathbb E[Y \mid A=a,L=\ell].
\end{align*}
Substituting into the total-expectation formula gives
\begin{align*}
\mathbb E[Y_a]=\sum_\ell \mathbb E[Y \mid A=a,L=\ell]\mathbb P(L=\ell).
\end{align*}
Applying this once with $a=1$ and once with $a=0$ identifies the total-effect contrast as
\begin{align*}
\mathbb E[Y_1]-\mathbb E[Y_0]=\sum_\ell \mathbb E[Y \mid A=1,L=\ell]\mathbb P(L=\ell)-\sum_\ell \mathbb E[Y \mid A=0,L=\ell]\mathbb P(L=\ell).
\end{align*}
Because both sums use the same marginal distribution of $L$, this is equivalently
\begin{align*}
\mathbb E[Y_1]-\mathbb E[Y_0]=\sum_\ell \{\mathbb E[Y \mid A=1,L=\ell]-\mathbb E[Y \mid A=0,L=\ell]\}\mathbb P(L=\ell).
\end{align*}
If substantive knowledge adds $U \to A$ and $U \to Y$, then conditioning only on $L$ leaves the path $A \leftarrow U \to Y$ open. In that graph, the equality
\begin{align*}
\mathbb E[Y_a \mid L=\ell]=\mathbb E[Y_a \mid A=a,L=\ell]
\end{align*}
is no longer justified, so the same standardized expression need not equal the total effect. If $U$ also affects $M$, then a front-door analysis using $M$ must separately verify that $M$ intercepts all directed paths from $A$ to $Y$, that there is no unblocked back-door path from $A$ to $M$, and that conditioning on $A$ blocks all back-door paths from $M$ to $Y$. When an unblocked mediator-outcome path such as $M \leftarrow U \to Y$ remains after conditioning on $A$, the mediator-outcome component is still confounded, so the front-door formula is not justified. The audit therefore records baseline adjustment as valid only under the no-latent-confounding claim for $A$ and $Y$, and records sensitivity analysis for the strength of the $U \to A$ and $U \to Y$ relationships.
[/example]
This example shows how the answer changes as the graph changes. The observed variables and data table may be identical in the two versions, which motivates a formal nonidentifiability result explaining why observational equality need not imply causal equality.
[quotetheorem:9697]
[citeproof:9697]
Nonidentifiability is not a technical inconvenience. It says that observational equality of the recorded variables does not rule out different causal mechanisms. The theorem does not say that causal inference is impossible; it says that identification requires assumptions that exclude at least one of the observationally equivalent mechanisms. Randomization, valid adjustment, front-door structure, instrumental-variable restrictions, or sensitivity parameters are different ways of adding information beyond the marginal observational law of $(A,Y)$.
## Writing the Final Causal Proof
The final problem is presentation. A reader should be able to separate the scientific claim, the causal assumptions, the graphical criterion, the identifying algebra, and the empirical estimation plan.
[definition: Causal Proof Skeleton]
A causal proof skeleton is an ordered outline consisting of the target estimand, observed law, causal model, assumptions, graphical criterion, identification derivation, positivity statement, and final observational functional.
[/definition]
The skeleton prevents a common error: placing a regression model where an identification argument should be. Regression may estimate part of the final functional, while the next checklist records the proof components that must already be in place before estimation begins.
[explanation: Checklist for a Causal Analysis]
Begin by writing the estimand as an interventional or counterfactual quantity, such as $\mathbb E[Y_1-Y_0]$, $\mathbb P(Y_a \in B)$, or a local average treatment effect. State the target population and the intervention versions being compared.
Next draw or describe the graph. Declare which variables are baseline covariates, treatments, mediators, instruments, outcomes, selection variables, and latent variables. For every omitted arrow or unmeasured node, record the substantive reason for the omission.
Then state the assumptions in mathematical form. Exchangeability should be written as a conditional independence statement; positivity as a support condition; consistency as an equality linking observed and counterfactual outcomes; instrument assumptions as relevance, independence, exclusion, and any monotonicity or compliance condition required by the estimand.
After that, give the identification proof. Name the graphical criterion or do-calculus steps, derive the relevant conditional independence, apply consistency, and integrate or sum over the required observed variables. End with a formula involving only the observational law.
Finally, audit the formula. Identify which variables are measured with error, where positivity may be weak, what unmeasured common causes are plausible, and which sensitivity parameters would express those departures.
[/explanation]
The checklist is procedural, but identification itself is a property of the observational law. This motivates the theorem on invariance under equivalent observational factorizations, which records that an exact identified functional cannot depend on which statistical factorization was used to compute the same law.
[quotetheorem:9698]
[citeproof:9698]
This theorem is a useful guardrail for practice, but its hypotheses matter. The functional must be defined on the same observational law and evaluated exactly; if a conditional component is undefined because positivity fails, then two factorizations may not both support the claimed calculation. The theorem also does not protect finite-sample estimates from numerical instability, regularisation choices, or model misspecification. If two exact algebraic routes claim to identify the same estimand under the same assumptions but give different functionals of the same law, then at least one route has changed the assumptions, the estimand, the support conditions, or the conditioning events.
The course closes with a simple principle: causal inference is a proof discipline before it is an estimation discipline. The proof says what would be learned from the population law under stated assumptions; estimation says how well the available data approximate that population-level answer.
## Beyond and Connections
The material here connects most directly to later study of graphical models, conditional independence, missing data, and semiparametric estimation. A natural next step is to study how d-separation and $m$-separation support identification algorithms, then compare those symbolic identification results with estimators such as inverse-probability weighting, outcome regression, doubly robust estimation, and targeted learning.
For applications, the same proof-first discipline reappears in settings with instruments, mediators, transport between populations, interference, and longitudinal treatment regimes. Each setting changes the graph or the observed-data structure, but the workflow remains the same: state the causal query, encode the assumptions, prove identification from the population law, and only then choose an estimator suited to the data at hand.
## References
### External References
- Judea Pearl, *Causality: Models, Reasoning, and Inference*, 2nd ed., Cambridge University Press, 2009.
- Miguel A. Hernan and James M. Robins, *Causal Inference: What If*, Chapman & Hall/CRC, 2020.
- Guido W. Imbens and Donald B. Rubin, *Causal Inference for Statistics, Social, and Biomedical Sciences*, Cambridge University Press, 2015.
- Stephen L. Morgan and Christopher Winship, *Counterfactuals and Causal Inference*, 2nd ed., Cambridge University Press, 2015.
Contents
- Introduction
- Why Association Is Not Causation
- The Three Languages of the Course
- Identification As A Mathematical Problem
- How the Remaining Chapters Fit Together
- 1. Causal Questions and Interventional Probability
- From Association to Intervention
- Causal Contrasts and Effect Scales
- Consistency and Observed Outcomes
- Exchangeability and Identification
- Positivity and the Limits of Adjustment
- Simpson Reversals and the Meaning of Confounding
- 2. Potential Outcomes and the Rubin Framework
- Unit-Level Potential Outcomes and Average Effects
- SUTVA and the Meaning of a Treatment
- Ignorability and Identification from Observed Data
- 3. Structural Causal Models
- Structural Equations and Recursive Models
- Interventions as Equation Replacement
- Counterfactuals by Abduction, Action, and Prediction
- 4. Directed Acyclic Graphs and Markov Factorization
- Graphical Vocabulary for Causal Models
- Markov Conditions and Factorization
- Faithfulness Minimality And Graphical Limits
- 5. D-Separation and Conditional Independence
- Paths and Conditioning
- D-Separation and Conditional Independence
- Moralized Ancestral Graphs
- 6. Adjustment and the Back-Door Criterion
- Identifying Causal Effects by Conditioning
- The Back-Door Criterion
- Bad Controls in Adjustment Sets
- The Single-World Intervention Graph Refinement
- 7. The Front-Door Criterion and Mediation Structure
- Mediators and Indirect Causal Paths
- The Front-Door Criterion
- Conditional Exchangeability Steps Behind the Formula
- Failure Cases and Diagnostic Graphs
- Relation to Mediation Analysis and Path-Specific Effects
- 8. Do-Calculus
- Intervention Distributions and Mutilated Graphs
- The Three Rules of Do-Calculus
- Deriving Back-Door Adjustment
- Deriving Front-Door Adjustment
- Conditional Effects and Four-Node Identification
- Soundness and Completeness
- 9. The ID Algorithm and Nonidentifiability
- Hidden Confounding and C-Components
- Recursive Identification of Interventional Laws
- Hedges and Nonidentifiability
- 10. Instrumental Variables
- Why Instruments Can Help When Treatment Is Confounded
- Graphical IV Conditions
- Linear IV and the Wald Estimand
- Potential Outcome IV Assumptions
- The Angrist-Imbens-Rubin LATE Theorem
- Failures and Diagnostics
- 11. Causal Discovery Foundations
- Conditional Independence as Graphical Evidence
- CPDAGs and Equivalence Classes
- Constraint-Based Discovery and the PC Algorithm
- What Observational Data Can Identify
- 12. Synthesis: Building and Auditing a Causal Analysis
- From Scientific Question to Estimand
- Graphical Proofs and Algebraic Identification
- Sensitivity to Assumption Failures
- A Worked Synthetic Analysis
- Writing the Final Causal Proof
- Beyond and Connections
- References
- External References
Causal Inference I: Foundations
Also known as: Causal inference foundations, causal inference basics, potential outcomes, structural causal models, graphical causal models, do-calculus, treatment effect identification
Content
Problems
History
Created by admin on 6/22/2026 | Last updated on 6/22/2026
Prerequisites
No prerequisites required for this page.
Rate this page
★
★
★
★
★
Poor
Excellent