Probability is often first introduced as a way of assigning weights to events before an experiment is performed. That is only half of the story. In practice, information arrives after the model is built: a test result is positive, a card is known to be red, a [random variable](/page/Random%20Variable) has fallen in a given range, or a filtration has revealed the past of a stochastic process. Conditional probability is the operation that rebuilds probability inside the world that remains possible after such information is known.
The danger is that new information does not merely add a number to the calculation. It changes the sample space being used. A rare disease can have an accurate test and still be unlikely after a positive result, because most positive tests may come from healthy people when the disease is rare. This is the first lesson of conditioning: the event being conditioned on is not a decoration after the vertical bar. It is the reference universe in which the probability is being measured.
[example: A Positive Test for a Rare Disease]
Suppose $1$ percent of a population has a disease. Let $D$ be the event that a randomly chosen person has the disease, and let $T$ be the event that the test is positive. The given data are
\begin{align*}
\mathbb P(D) &= 0.01, &
\mathbb P(T \mid D) &= 0.99, &
\mathbb P(T \mid D^c) &= 0.05.
\end{align*}
Since $D$ and $D^c$ partition the population and $\mathbb P(D^c)=1-\mathbb P(D)=1-0.01=0.99$, *[Law of Total Probability](/theorems/1113)* gives
\begin{align*}
\mathbb P(T)
&= \mathbb P(T \mid D)\mathbb P(D)+\mathbb P(T \mid D^c)\mathbb P(D^c) \\
&= 0.99\cdot 0.01+0.05\cdot 0.99 \\
&= 0.0099+0.0495 \\
&= 0.0594.
\end{align*}
Now *[Bayes' Formula](/theorems/1114) for a Partition*, applied to the partition $\{D,D^c\}$, gives
\begin{align*}
\mathbb P(D \mid T)
&= \frac{\mathbb P(T \mid D)\mathbb P(D)}{\mathbb P(T)} \\
&= \frac{0.99\cdot 0.01}{0.0594} \\
&= \frac{0.0099}{0.0594} \\
&= \frac{99}{594} \\
&= \frac{1}{6}.
\end{align*}
The test is highly sensitive, but a positive result gives probability $1/6$, not $0.99$, because the large healthy population contributes $0.0495$ of the total positive-test probability through false positives.
[/example]
This example is not a paradox. It is the arithmetic of changing the reference population. Conditional probability formalizes that change and gives a reliable language for every later subject in [probability theory](/page/Cambridge%20IA%20Probability), from independence to martingales.
Before defining conditional probability, we need to isolate the kind of information event on which conditioning is possible by a ratio. If an event has probability zero, then dividing by its probability cannot define a new probability measure. Conditioning on probability-zero information is possible in more advanced settings, but it requires conditional distributions or regular conditional probabilities rather than the elementary ratio.
[definition: Positive Probability Event]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. An event $B \in \mathcal F$ is a positive probability event if
\begin{align*}
\mathbb P(B) > 0.
\end{align*}
[/definition]
This preliminary definition marks the boundary of the elementary theory. Once the conditioning event has positive probability, the updated probability is obtained by restricting attention to that event and renormalizing.
## Definition
After identifying the events whose probability can be used as a normalizing factor, the next problem is to define the updated probability of an arbitrary event $A$. The only outcomes that still matter are those in $B$, so the numerator must be $\mathbb P(A \cap B)$; the denominator must make the conditioned universe $B$ have total probability $1$.
[definition: Conditional Probability]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space, and let $B \in \mathcal F$ satisfy $\mathbb P(B) > 0$. The conditional probability given $B$ is the map
\begin{align*}
\mathbb P(\cdot \mid B): \mathcal F &\to [0,1] \\
A &\mapsto \frac{\mathbb P(A \cap B)}{\mathbb P(B)}.
\end{align*}
[/definition]
The definition says that only the overlap $A \cap B$ matters after $B$ is known. The next useful question is how to recover the probability of a joint event from a conditional probability. This is the form needed when an experiment is performed in stages and we want to multiply probabilities along the path.
[quotetheorem:4995]
The multiplication rule is the bridge between intersections and updates. It also warns that $\mathbb P(A \mid B)$ and $\mathbb P(B \mid A)$ are different ratios with the same numerator but different denominators. A finite sample-space computation makes this change of denominator visible.
[example: Two Dice and a Restricted Universe]
Roll two fair six-sided dice, so the sample space has $36$ equally likely ordered pairs. Let $A$ be the event that the sum is $8$, and let $B$ be the event that the first die is $3$. The conditioning event is
\begin{align*}
B=\{(3,1),(3,2),(3,3),(3,4),(3,5),(3,6)\},
\end{align*}
so $|B|=6$ and
\begin{align*}
\mathbb P(B)=\frac{6}{36}=\frac{1}{6}>0.
\end{align*}
The event $A$ is
\begin{align*}
A=\{(2,6),(3,5),(4,4),(5,3),(6,2)\},
\end{align*}
so
\begin{align*}
A\cap B=\{(3,5)\}
\end{align*}
and therefore
\begin{align*}
\mathbb P(A\cap B)=\frac{1}{36}.
\end{align*}
Using the definition of conditional probability,
\begin{align*}
\mathbb P(A\mid B)
&=\frac{\mathbb P(A\cap B)}{\mathbb P(B)} \\
&=\frac{1/36}{6/36} \\
&=\frac{1}{36}\cdot \frac{36}{6} \\
&=\frac{1}{6}.
\end{align*}
Without the information $B$, the sum $8$ has five favorable outcomes among the thirty-six equally likely outcomes, so
\begin{align*}
\mathbb P(A)=\frac{5}{36}.
\end{align*}
Conditioning has changed the reference set from all $36$ outcomes to the six outcomes compatible with the first die being $3$.
[/example]
## Conditioning on Events
### Probability After Information
The formula for $\mathbb P(A \mid B)$ resembles an ordinary probability measure in the variable $A$. We need this resemblance to be genuine, because once information has arrived we still want to take complements, count disjoint alternatives, and reason with the probability axioms inside the new universe.
[quotetheorem:4972]
This theorem justifies treating conditional probabilities like ordinary probabilities once the conditioning event has been fixed. The next problem is to handle several pieces of information arriving in sequence, where each new event is conditioned on all previous events.
[quotetheorem:4996]
The chain rule is often the cleanest way to compute probabilities in sampling without replacement, card problems, and reliability models. A card draw illustrates why each factor has its own denominator.
[example: Drawing Cards Without Replacement]
Draw three cards without replacement from a standard deck of $52$ cards, and let $A_i$ be the event that the $i$th card drawn is an ace. Since the first card is drawn from $52$ cards, of which $4$ are aces,
\begin{align*}
\mathbb P(A_1)=\frac{4}{52}.
\end{align*}
Given $A_1$, one ace has already been removed, so the second draw is from $51$ remaining cards, of which $3$ are aces:
\begin{align*}
\mathbb P(A_2 \mid A_1)=\frac{3}{51}.
\end{align*}
Given $A_1 \cap A_2$, two aces have already been removed, so the third draw is from $50$ remaining cards, of which $2$ are aces:
\begin{align*}
\mathbb P(A_3 \mid A_1 \cap A_2)=\frac{2}{50}.
\end{align*}
By *[Chain Rule for Events](/theorems/4996)*,
\begin{align*}
\mathbb P(A_1 \cap A_2 \cap A_3)
&= \mathbb P(A_1)\mathbb P(A_2 \mid A_1)\mathbb P(A_3 \mid A_1 \cap A_2) \\
&= \frac{4}{52}\cdot \frac{3}{51}\cdot \frac{2}{50} \\
&= \frac{1}{13}\cdot \frac{1}{17}\cdot \frac{1}{25} \\
&= \frac{1}{13\cdot 17\cdot 25} \\
&= \frac{1}{5525}.
\end{align*}
The conditional probabilities shrink because each observed ace removes one ace and one card from the deck before the next draw.
[/example]
### The Boundary of the Ratio Definition
The event in the denominator must have positive probability. When information singles out a point in a continuous space, the elementary ratio may no longer apply, and a different kind of conditional object is needed.
[example: Conditioning on a Point in a Square]
Let $(X,Y)$ be uniformly distributed on $[0,1]^2$, so probabilities are areas. The diagonal event $\{X=Y\}$ is a line segment and has area $0$, hence $\mathbb P(Y \le 1/2 \mid X=Y)$ is not defined by the elementary conditional-probability ratio.
For $0<\varepsilon<1/2$, condition instead on the strip
\begin{align*}
S_\varepsilon=\{(x,y)\in[0,1]^2: |x-y|<\varepsilon\}.
\end{align*}
The two excluded corner triangles have side length $1-\varepsilon$, so
\begin{align*}
\mathbb P(S_\varepsilon)
&=1-2\cdot \frac{(1-\varepsilon)^2}{2} \\
&=1-(1-2\varepsilon+\varepsilon^2) \\
&=2\varepsilon-\varepsilon^2.
\end{align*}
The part of the strip with $Y\le 1/2$ has area
\begin{align*}
\mathbb P(S_\varepsilon\cap\{Y\le 1/2\})
&=\int_0^\varepsilon (y+\varepsilon)\,d\mathcal L^1(y)
+\int_\varepsilon^{1/2} 2\varepsilon\,d\mathcal L^1(y) \\
&=\left[\frac{y^2}{2}+\varepsilon y\right]_0^\varepsilon
+2\varepsilon\left(\frac{1}{2}-\varepsilon\right) \\
&=\frac{\varepsilon^2}{2}+\varepsilon^2+\varepsilon-2\varepsilon^2 \\
&=\varepsilon-\frac{\varepsilon^2}{2}.
\end{align*}
Therefore
\begin{align*}
\mathbb P(Y\le 1/2\mid S_\varepsilon)
&=\frac{\varepsilon-\varepsilon^2/2}{2\varepsilon-\varepsilon^2} \\
&=\frac{\varepsilon(1-\varepsilon/2)}{\varepsilon(2-\varepsilon)} \\
&=\frac{(2-\varepsilon)/2}{2-\varepsilon} \\
&=\frac{1}{2}.
\end{align*}
The limiting answer can change if the neighborhoods approach the diagonal with non-uniform thickness. If the transverse width near $(t,t)$ is proportional to $\varepsilon w(t)$ for a positive [continuous function](/page/Continuous%20Function) $w:[0,1]\to(0,\infty)$, then the mass assigned to parameters $t\in[a,b]$ is proportional, in the limit, to
\begin{align*}
\int_a^b w(t)\,d\mathcal L^1(t).
\end{align*}
After normalization, the limiting density of the diagonal parameter $t$ is
\begin{align*}
\frac{w(t)}{\int_0^1 w(s)\,d\mathcal L^1(s)}.
\end{align*}
For $w(t)=1$,
\begin{align*}
\mathbb P(t\le 1/2)
&=\frac{\int_0^{1/2}1\,d\mathcal L^1(t)}{\int_0^1 1\,d\mathcal L^1(t)} \\
&=\frac{1/2}{1} \\
&=\frac{1}{2}.
\end{align*}
For $w(t)=2t$,
\begin{align*}
\mathbb P(t\le 1/2)
&=\frac{\int_0^{1/2}2t\,d\mathcal L^1(t)}{\int_0^1 2t\,d\mathcal L^1(t)} \\
&=\frac{[t^2]_0^{1/2}}{[t^2]_0^1} \\
&=\frac{1/4}{1} \\
&=\frac{1}{4}.
\end{align*}
Thus the phrase "condition on $X=Y$" does not determine a unique elementary conditional probability; probability-zero conditioning needs extra structure specifying how the zero-probability event is approached.
[/example]
## Partitions and Bayes' Formula
### Decomposing by Cases
Many applications start with hidden causes and observed evidence. The sample point belongs to exactly one possible cause class, but the observer sees only evidence generated from that class. To compute the probability of the evidence, we need a formal way to split the whole space into mutually exclusive cases.
[definition: Countable Measurable Partition]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. A countable family $(B_i)_{i \in I}$ of events in $\mathcal F$ is a countable measurable partition of $\Omega$ if $I$ is finite or countably infinite,
\begin{align*}
B_i \cap B_j &= \varnothing \quad \text{for } i \ne j, \\
\bigcup_{i \in I} B_i &= \Omega.
\end{align*}
[/definition]
A partition is a bookkeeping device for mutually exclusive explanations. The next step is to compute the probability of an event $A$ by summing the contributions coming from the cells $B_i$.
[quotetheorem:1113]
The first formula decomposes an event into disjoint pieces. The second formula gives those pieces a causal reading: likelihood inside a cell times prior probability of that cell. Once this denominator for the evidence is known, we can reverse the direction of conditioning and ask which cell is plausible after the evidence is observed.
[quotetheorem:1114]
The denominator is the total probability of the evidence $A$, obtained by summing over all possible cells of the partition. A box experiment shows how the prior weight of a cell and the likelihood of the observation compete.
[example: Which Box Was Chosen]
A box is chosen at random. Box $B_1$ is chosen with probability $2/3$ and contains $1$ red ball and $1$ blue ball, while box $B_2$ is chosen with probability $1/3$ and contains $3$ red balls and $1$ blue ball. After the box is chosen, one ball is drawn uniformly from that box. Let $R$ be the event that the drawn ball is red. From the contents of the boxes,
\begin{align*}
\mathbb P(R \mid B_1) &= \frac{1}{2}, &
\mathbb P(R \mid B_2) &= \frac{3}{4},
\end{align*}
because $B_1$ has $1$ red ball among $2$ balls and $B_2$ has $3$ red balls among $4$ balls.
The events $B_1$ and $B_2$ form a partition of the possible box choices, so by *Law of Total Probability*,
\begin{align*}
\mathbb P(R)
&= \mathbb P(R \mid B_1)\mathbb P(B_1)+\mathbb P(R \mid B_2)\mathbb P(B_2) \\
&= \frac{1}{2}\cdot \frac{2}{3}+\frac{3}{4}\cdot \frac{1}{3} \\
&= \frac{2}{6}+\frac{3}{12} \\
&= \frac{1}{3}+\frac{1}{4} \\
&= \frac{4}{12}+\frac{3}{12} \\
&= \frac{7}{12}.
\end{align*}
After observing a red ball, *Bayes' Formula for a Partition* gives
\begin{align*}
\mathbb P(B_2 \mid R)
&= \frac{\mathbb P(R \mid B_2)\mathbb P(B_2)}{\mathbb P(R)} \\
&= \frac{(3/4)(1/3)}{7/12} \\
&= \frac{3}{12}\cdot \frac{12}{7} \\
&= \frac{3}{7}.
\end{align*}
The red ball favors $B_2$, because red is more likely from $B_2$ than from $B_1$, but the larger prior weight of $B_1$ remains part of the posterior calculation.
[/example]
### Bayesian Updating
Bayes' formula is also the language behind statistical updating. To use it as a reusable model, we separate the unknown parameter, the data-generating probabilities, and the updated probabilities after an observation.
[definition: Prior, Likelihood, and Posterior on a Finite Parameter Space]
Let $\Theta$ be a finite set, let $S$ be a [countable set](/page/Countable%20Set), let $\Theta_0: \Omega \to \Theta$ be a parameter random variable, and let $X: \Omega \to S$ be an observable random variable. The prior is the map
\begin{align*}
\pi: \Theta &\to [0,1] \\
\theta &\mapsto \mathbb P(\Theta_0 = \theta),
\end{align*}
and the likelihood is the map
\begin{align*}
L: S \times \Theta &\to [0,1] \\
(x,\theta) &\mapsto \mathbb P(X=x \mid \Theta_0=\theta),
\end{align*}
for parameters $\theta$ with $\mathbb P(\Theta_0=\theta)>0$. For $x \in S$ satisfying
\begin{align*}
\sum_{\eta \in \Theta} L(x,\eta)\pi(\eta)>0,
\end{align*}
the posterior after observing $X=x$ is the map
\begin{align*}
\pi(\cdot \mid x): \Theta &\to [0,1] \\
\theta &\mapsto \frac{L(x,\theta)\pi(\theta)}{\sum_{\eta \in \Theta} L(x,\eta)\pi(\eta)}.
\end{align*}
[/definition]
If $\mathbb P(\Theta_0=\theta)=0$, the likelihood values at $\theta$ do not affect the posterior, since the corresponding prior weight $\pi(\theta)$ is zero in every numerator and denominator term. This finite version avoids measure-theoretic technicalities while preserving the main idea of Bayesian inference: update by multiplying prior weight by likelihood and then normalizing. Repeated observations then update the likelihood by multiplication.
[example: Updating a Biased Coin Model]
A coin is either fair or biased toward heads. Let $\Theta=\{F,B\}$, with prior probabilities
\begin{align*}
\pi(F)=\frac{1}{2}, \qquad \pi(B)=\frac{1}{2}.
\end{align*}
Suppose the one-toss head probabilities are
\begin{align*}
\mathbb P(H \mid F) &= \frac{1}{2}, &
\mathbb P(H \mid B) &= \frac{3}{4}.
\end{align*}
We observe three heads in three conditionally independent tosses. Hence the likelihood of the data $HHH$ under the fair model is
\begin{align*}
L(HHH \mid F)
&= \mathbb P(H \mid F)\mathbb P(H \mid F)\mathbb P(H \mid F) \\
&= \frac{1}{2}\cdot \frac{1}{2}\cdot \frac{1}{2} \\
&= \frac{1}{8},
\end{align*}
and the likelihood under the biased model is
\begin{align*}
L(HHH \mid B)
&= \mathbb P(H \mid B)\mathbb P(H \mid B)\mathbb P(H \mid B) \\
&= \frac{3}{4}\cdot \frac{3}{4}\cdot \frac{3}{4} \\
&= \frac{27}{64}.
\end{align*}
Using the finite posterior formula from the definition of prior, likelihood, and posterior,
\begin{align*}
\mathbb P(B \mid HHH)
&= \frac{L(HHH \mid B)\pi(B)}{L(HHH \mid F)\pi(F)+L(HHH \mid B)\pi(B)} \\
&= \frac{(27/64)(1/2)}{(1/8)(1/2)+(27/64)(1/2)} \\
&= \frac{27/128}{1/16+27/128} \\
&= \frac{27/128}{8/128+27/128} \\
&= \frac{27/128}{35/128} \\
&= \frac{27}{128}\cdot \frac{128}{35} \\
&= \frac{27}{35}.
\end{align*}
Three heads increase the posterior probability of the biased model from $1/2$ to $27/35$, because the biased model assigns larger likelihood to the observed sequence while the denominator still includes the prior-weighted fair-model contribution.
[/example]
## Independence and Conditional Independence
### No Change Under New Information
Independence is often described as the absence of influence, but conditional probability gives the sharper formulation: learning $B$ does not change the probability of $A$. To make that idea symmetric and usable even before writing a conditional ratio, we start with the product condition.
[definition: Independent Events]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. Events $A,B \in \mathcal F$ are independent if
\begin{align*}
\mathbb P(A \cap B)=\mathbb P(A)\mathbb P(B).
\end{align*}
[/definition]
When $\mathbb P(B)>0$, there is a concrete way to test whether learning $B$ changes the chance of $A$: compare $\mathbb P(A\mid B)$ with $\mathbb P(A)$. The only obstruction is that conditional probability is a ratio, so this comparison is meaningful only when the conditioning event has positive probability.
[quotetheorem:4859]
The condition $\mathbb P(B)>0$ is essential for the conditional expression. The product definition of independence still makes sense when probabilities vanish, but the ratio form does not. A simple die roll also separates independence from disjointness.
[example: Independence Does Not Mean Disjointness]
Roll one fair six-sided die, so each outcome in $\{1,2,3,4,5,6\}$ has probability $1/6$. Let $A$ be the event that the outcome is even, and let $B$ be the event that the outcome is at least $5$. Then
\begin{align*}
A &= \{2,4,6\}, &
B &= \{5,6\}, &
A \cap B &= \{6\}.
\end{align*}
Counting favorable outcomes in the uniform sample space gives
\begin{align*}
\mathbb P(A)
&= \frac{|A|}{6}
= \frac{3}{6}
= \frac{1}{2}, \\
\mathbb P(B)
&= \frac{|B|}{6}
= \frac{2}{6}
= \frac{1}{3}, \\
\mathbb P(A \cap B)
&= \frac{|A\cap B|}{6}
= \frac{1}{6}.
\end{align*}
The product of the two marginal probabilities is
\begin{align*}
\mathbb P(A)\mathbb P(B)
&= \frac{1}{2}\cdot \frac{1}{3} \\
&= \frac{1}{6}.
\end{align*}
Hence
\begin{align*}
\mathbb P(A\cap B)=\mathbb P(A)\mathbb P(B),
\end{align*}
so $A$ and $B$ are independent by the definition of independent events, even though $A\cap B=\{6\}$ is nonempty.
This separates independence from disjointness. If two disjoint events $E$ and $F$ have positive probability, then
\begin{align*}
\mathbb P(E\cap F)&=\mathbb P(\varnothing)=0,\\
\mathbb P(E)\mathbb P(F)&>0,
\end{align*}
so they cannot satisfy $\mathbb P(E\cap F)=\mathbb P(E)\mathbb P(F)$.
[/example]
### Independence After Background Information
Unconditional independence may be the wrong question when a hidden environment affects both events. In such situations, the events can become independent after the environment is fixed. This motivates an independence notion inside a conditioned probability measure.
[definition: Conditional Independence Given an Event]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space, and let $C \in \mathcal F$ satisfy $\mathbb P(C)>0$. Events $A,B \in \mathcal F$ are conditionally independent given $C$ if
\begin{align*}
\mathbb P(A \cap B \mid C)=\mathbb P(A \mid C)\mathbb P(B \mid C).
\end{align*}
[/definition]
The phrase "given $C$" matters. Conditional independence is independence under the [conditional probability measure](/theorems/4972) $\mathbb P_C$, and it need not agree with unconditional independence. A shared random environment shows one direction of this mismatch.
[example: Conditional Independence Without Unconditional Independence]
Choose a coin type at random: with probability $1/2$ choose a fair coin, and with probability $1/2$ choose a coin that has heads on both sides. Let $C_F$ be the event that the fair coin was chosen, let $C_H$ be the event that the two-headed coin was chosen, let $A$ be the event that the first toss is heads, and let $B$ be the event that the second toss is heads.
Given $C_F$, the two tosses are independent fair-coin tosses, so
\begin{align*}
\mathbb P(A \mid C_F)&=\frac{1}{2},&
\mathbb P(B \mid C_F)&=\frac{1}{2},&
\mathbb P(A\cap B \mid C_F)&=\frac{1}{2}\cdot \frac{1}{2}=\frac{1}{4}.
\end{align*}
Thus
\begin{align*}
\mathbb P(A\cap B \mid C_F)
&=\frac{1}{4} \\
&=\frac{1}{2}\cdot \frac{1}{2} \\
&=\mathbb P(A \mid C_F)\mathbb P(B \mid C_F),
\end{align*}
so $A$ and $B$ are conditionally independent given $C_F$.
Given $C_H$, both tosses are heads with probability $1$, so
\begin{align*}
\mathbb P(A \mid C_H)&=1,&
\mathbb P(B \mid C_H)&=1,&
\mathbb P(A\cap B \mid C_H)&=1.
\end{align*}
Hence
\begin{align*}
\mathbb P(A\cap B \mid C_H)
&=1 \\
&=1\cdot 1 \\
&=\mathbb P(A \mid C_H)\mathbb P(B \mid C_H),
\end{align*}
so $A$ and $B$ are also conditionally independent given $C_H$.
Unconditionally, the events $C_F$ and $C_H$ form a partition of the possible coin choices, with $\mathbb P(C_F)=\mathbb P(C_H)=1/2$. By *Law of Total Probability*,
\begin{align*}
\mathbb P(A)
&=\mathbb P(A\mid C_F)\mathbb P(C_F)+\mathbb P(A\mid C_H)\mathbb P(C_H) \\
&=\frac{1}{2}\cdot\frac{1}{2}+1\cdot\frac{1}{2} \\
&=\frac{1}{4}+\frac{1}{2} \\
&=\frac{1}{4}+\frac{2}{4} \\
&=\frac{3}{4},
\end{align*}
and the same calculation gives
\begin{align*}
\mathbb P(B)
&=\mathbb P(B\mid C_F)\mathbb P(C_F)+\mathbb P(B\mid C_H)\mathbb P(C_H) \\
&=\frac{1}{2}\cdot\frac{1}{2}+1\cdot\frac{1}{2} \\
&=\frac{1}{4}+\frac{1}{2} \\
&=\frac{3}{4}.
\end{align*}
Again by *Law of Total Probability*,
\begin{align*}
\mathbb P(A\cap B)
&=\mathbb P(A\cap B\mid C_F)\mathbb P(C_F)+\mathbb P(A\cap B\mid C_H)\mathbb P(C_H) \\
&=\frac{1}{4}\cdot\frac{1}{2}+1\cdot\frac{1}{2} \\
&=\frac{1}{8}+\frac{1}{2} \\
&=\frac{1}{8}+\frac{4}{8} \\
&=\frac{5}{8}.
\end{align*}
But
\begin{align*}
\mathbb P(A)\mathbb P(B)
&=\frac{3}{4}\cdot\frac{3}{4} \\
&=\frac{9}{16},
\end{align*}
while
\begin{align*}
\mathbb P(A\cap B)
&=\frac{5}{8} \\
&=\frac{10}{16}.
\end{align*}
Since $10/16\ne 9/16$, the product condition for independence fails unconditionally. The shared random choice of coin type creates dependence before the environment event is known.
[/example]
Conditioning can also create dependence. If two independent measurements are observed only through their sum, then knowing one tells us about the other because the sum constrains them together.
[example: Conditioning Can Create Dependence]
Roll two independent fair six-sided dice, and let $X$ and $Y$ be the first and second outcomes. In the unconditioned sample space there are $36$ equally likely ordered pairs. For
\begin{align*}
A=\{X=1\}, \qquad B=\{Y=1\},
\end{align*}
we have
\begin{align*}
\mathbb P(A)&=\frac{6}{36}=\frac{1}{6},\\
\mathbb P(B)&=\frac{6}{36}=\frac{1}{6},\\
\mathbb P(A\cap B)&=\mathbb P(\{(1,1)\})=\frac{1}{36}.
\end{align*}
Also
\begin{align*}
\mathbb P(A)\mathbb P(B)
&=\frac{1}{6}\cdot \frac{1}{6}\\
&=\frac{1}{36}\\
&=\mathbb P(A\cap B),
\end{align*}
so $A$ and $B$ are independent before conditioning.
Now condition on
\begin{align*}
C=\{X+Y=4\}.
\end{align*}
The outcomes in $C$ are exactly
\begin{align*}
C=\{(1,3),(2,2),(3,1)\},
\end{align*}
so
\begin{align*}
\mathbb P(C)=\frac{3}{36}=\frac{1}{12}>0.
\end{align*}
Inside this conditioned universe,
\begin{align*}
A\cap C&=\{(1,3)\},\\
B\cap C&=\{(3,1)\},\\
A\cap B\cap C&=\varnothing,
\end{align*}
because no ordered pair can have both $X=1$, $Y=1$, and $X+Y=4$. Hence, using the definition of conditional probability,
\begin{align*}
\mathbb P(A\mid C)
&=\frac{\mathbb P(A\cap C)}{\mathbb P(C)}\\
&=\frac{1/36}{3/36}\\
&=\frac{1}{36}\cdot\frac{36}{3}\\
&=\frac{1}{3},
\end{align*}
and similarly
\begin{align*}
\mathbb P(B\mid C)
&=\frac{\mathbb P(B\cap C)}{\mathbb P(C)}\\
&=\frac{1/36}{3/36}\\
&=\frac{1}{3}.
\end{align*}
For the intersection,
\begin{align*}
\mathbb P(A\cap B\mid C)
&=\frac{\mathbb P(A\cap B\cap C)}{\mathbb P(C)}\\
&=\frac{0}{3/36}\\
&=0.
\end{align*}
But
\begin{align*}
\mathbb P(A\mid C)\mathbb P(B\mid C)
&=\frac{1}{3}\cdot \frac{1}{3}\\
&=\frac{1}{9},
\end{align*}
and $0\ne 1/9$. Therefore $A$ and $B$ are not conditionally independent given $C$. The constraint $X+Y=4$ couples the two dice: once one die is known to be $1$, the other must be $3$, not $1$.
[/example]
A common statistical error is to compare conditional probabilities across groups without tracking how the groups are mixed. Aggregation can reverse inequalities that hold in every subgroup.
[example: Simpson's Paradox]
Suppose a treatment is tested separately in two severity groups. Let $T$ denote receiving the treatment, let $C$ denote receiving the control, let $M$ denote being in the mild group, let $S$ denote being in the severe group, and let $R$ denote successful recovery.
In the mild group, treatment succeeds for $90$ of $100$ patients, while control succeeds for $80$ of $100$ patients. Therefore
\begin{align*}
\mathbb P(R\mid T\cap M)
&=\frac{90}{100}
=\frac{9}{10},\\
\mathbb P(R\mid C\cap M)
&=\frac{80}{100}
=\frac{4}{5}
=\frac{8}{10}.
\end{align*}
Since $9/10>8/10$, treatment has the higher success rate among mild cases.
In the severe group, treatment succeeds for $10$ of $100$ patients, while control succeeds for $1$ of $20$ patients. Therefore
\begin{align*}
\mathbb P(R\mid T\cap S)
&=\frac{10}{100}
=\frac{1}{10},\\
\mathbb P(R\mid C\cap S)
&=\frac{1}{20}.
\end{align*}
Since
\begin{align*}
\frac{1}{10}
&=\frac{2}{20}
>\frac{1}{20},
\end{align*}
treatment also has the higher success rate among severe cases.
Now aggregate over severity. In the treatment group there are $100$ mild patients and $100$ severe patients, so there are
\begin{align*}
100+100=200
\end{align*}
treated patients in total, with
\begin{align*}
90+10=100
\end{align*}
successful recoveries. Hence
\begin{align*}
\mathbb P(R\mid T)
&=\frac{100}{200}
=\frac{1}{2}.
\end{align*}
In the control group there are $100$ mild patients and $20$ severe patients, so there are
\begin{align*}
100+20=120
\end{align*}
control patients in total, with
\begin{align*}
80+1=81
\end{align*}
successful recoveries. Hence
\begin{align*}
\mathbb P(R\mid C)
&=\frac{81}{120}
=\frac{27}{40}.
\end{align*}
To compare the two aggregate rates,
\begin{align*}
\frac{1}{2}
&=\frac{20}{40}
<\frac{27}{40}.
\end{align*}
Thus treatment has the higher success rate in each severity group, but control has the higher success rate after the groups are combined.
The reversal occurs because the treatment group contains a much larger proportion of severe cases:
\begin{align*}
\mathbb P(S\mid T)
&=\frac{100}{200}
=\frac{1}{2},\\
\mathbb P(S\mid C)
&=\frac{20}{120}
=\frac{1}{6}.
\end{align*}
Conditional probabilities by severity and unconditional probabilities in the combined population answer different questions, so comparing the aggregated rates without tracking the severity mix can reverse the within-group comparisons.
[/example]
## Conditioning Random Variables
### Conditional Laws in the Discrete Case
Events are not the only kind of information. Often we observe a random variable: the number of heads, the value of a signal, the past of a process, or a statistic computed from data. Conditioning on $X=x$ is elementary when $X=x$ has positive probability, and it leads naturally to conditional distributions.
The measure-theoretic way to say "we observed $X$" is "we know every event in $\sigma(X)$", the sigma-algebra generated by $X$. In the discrete case, this information is organized by the atoms $\{X=x\}$ with positive probability, so conditioning on the random variable can be built from ordinary event conditioning on those atoms. In continuous cases the atoms usually have probability zero, which is why the later density and sigma-algebra formulations are not cosmetic extensions but replacements for the elementary ratio.
[definition: Conditional Distribution Given a Discrete Random Variable]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space, let $X: \Omega \to S$ be a random variable taking values in a countable set $S$, and let $Y: \Omega \to T$ be a random variable taking values in a countable set $T$. For $x \in S$ with $\mathbb P(X=x)>0$, the conditional distribution of $Y$ given $X=x$ is the probability mass function
\begin{align*}
p_{Y \mid X}(\cdot \mid x): T &\to [0,1] \\
y &\mapsto \mathbb P(Y=y \mid X=x).
\end{align*}
[/definition]
This definition packages many conditional probabilities into a single object: for each observed value $x$, it gives the whole distribution of $Y$ after the observation $X=x$. A sum of dice shows how the support of a conditional law can be smaller than the original support.
[example: Conditional Distribution of a Sum]
Roll two independent fair six-sided dice. Let $X$ be the first die, let $Y$ be the second die, and let $S=X+Y$. Since the dice are fair and independent, each ordered pair in $\{1,\ldots,6\}^2$ has probability $1/36$. The event $\{S=5\}$ is
\begin{align*}
\{S=5\}
&=\{(x,y)\in\{1,\ldots,6\}^2:x+y=5\}\\
&=\{(1,4),(2,3),(3,2),(4,1)\},
\end{align*}
so
\begin{align*}
\mathbb P(S=5)
&=\frac{4}{36}
=\frac{1}{9}>0.
\end{align*}
For $k\in\{1,2,3,4\}$, the event $\{X=k\}\cap\{S=5\}$ consists of the single pair $(k,5-k)$, hence
\begin{align*}
\mathbb P(X=k\mid S=5)
&=\frac{\mathbb P(\{X=k\}\cap\{S=5\})}{\mathbb P(S=5)}\\
&=\frac{1/36}{4/36}\\
&=\frac{1}{36}\cdot\frac{36}{4}\\
&=\frac{1}{4}.
\end{align*}
For $k=5$ or $k=6$, there is no die value $y\in\{1,\ldots,6\}$ with $k+y=5$, so
\begin{align*}
\{X=k\}\cap\{S=5\}=\varnothing
\end{align*}
and therefore
\begin{align*}
\mathbb P(X=k\mid S=5)
&=\frac{0}{4/36}
=0.
\end{align*}
Thus the conditional distribution of $X$ given $S=5$ is uniform on the four values $\{1,2,3,4\}$ and assigns probability $0$ to the incompatible values $5$ and $6$.
[/example]
### Conditional Averages
The same idea defines [conditional expectation](/page/Conditional%20Expectation) in the discrete setting. Instead of asking for every conditional probability, we ask for the average value of a random variable under the conditional distribution.
[definition: Conditional Expectation Given a Positive Probability Event]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space, and let $B \in \mathcal F$ satisfy $\mathbb P(B)>0$. The conditional expectation given $B$ is the map
\begin{align*}
\mathbb E[\cdot \mid B]: L^1(\Omega,\mathcal F,\mathbb P) &\to \mathbb R \\
X &\mapsto \frac{\mathbb E[X\mathbb{1}_B]}{\mathbb P(B)}.
\end{align*}
For an integrable random variable $X: \Omega \to \mathbb R$, its value is denoted
\begin{align*}
\mathbb E[X \mid B] := \frac{\mathbb E[X\mathbb{1}_B]}{\mathbb P(B)}.
\end{align*}
[/definition]
This is the average of $X$ after restricting to $B$. When the information is the value of another random variable, we need the conditional average to depend on the value observed, and then to become a random variable before the observation is known.
[definition: Conditional Expectation Given a Discrete Random Variable]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space, let $X: \Omega \to S$ be a random variable taking values in a countable set $S$, and let $Y: \Omega \to \mathbb R$ be an integrable random variable. Let
\begin{align*}
S_+ := \{x \in S : \mathbb P(X=x)>0\}.
\end{align*}
Let $g: S \to \mathbb R$ be any function satisfying
\begin{align*}
g(x) = \mathbb E[Y \mid X=x]
\end{align*}
for every $x \in S_+$. A version of the conditional expectation of $Y$ given $X$ is the random variable
\begin{align*}
\mathbb E[Y \mid X]: \Omega &\to \mathbb R \\
\omega &\mapsto g(X(\omega)).
\end{align*}
[/definition]
The definition only fixes $g$ on the values of $X$ that occur with positive probability. Values assigned outside $S_+$ do not change $g(X)$ on any event of positive probability, so different choices there give the same random variable $\mathbb P$-a.s. Thus $\mathbb E[Y \mid X]$ is understood as an almost-sure equivalence class unless a particular version has been chosen. The output is a random variable because the observed value $X$ is itself random. A natural consistency question now appears: if we average these conditional averages over all possible observations, should we recover the original average?
[quotetheorem:1121]
The [law of total expectation](/theorems/1121) says that an average can be computed by averaging conditional averages. This principle is one of the main reasons conditional expectation becomes central in [measure-theoretic probability](/page/Cambridge%20IB%20Probability%20and%20Measure).
[example: Expected Winnings After Seeing a Signal]
A fair die is rolled, and the player is paid the face value in pounds. Let $X$ be the payment. Since each outcome in $\{1,2,3,4,5,6\}$ has probability $1/6$, the expected payment before any signal is
\begin{align*}
\mathbb E[X]
&=1\cdot \frac{1}{6}+2\cdot \frac{1}{6}+3\cdot \frac{1}{6}
+4\cdot \frac{1}{6}+5\cdot \frac{1}{6}+6\cdot \frac{1}{6}\\
&=\frac{1+2+3+4+5+6}{6}\\
&=\frac{21}{6}\\
&=\frac{7}{2}.
\end{align*}
Suppose the player is told whether the result is even, and let
\begin{align*}
E=\{2,4,6\}.
\end{align*}
Then
\begin{align*}
E^c=\{1,3,5\},
\end{align*}
so
\begin{align*}
\mathbb P(E)
&=\frac{3}{6}
=\frac{1}{2},&
\mathbb P(E^c)
&=\frac{3}{6}
=\frac{1}{2}.
\end{align*}
Using the definition of conditional expectation given a positive probability event, the conditional expected payment after being told that the result is even is
\begin{align*}
\mathbb E[X\mid E]
&=\frac{\mathbb E[X\mathbb 1_E]}{\mathbb P(E)}\\
&=\frac{2\cdot(1/6)+4\cdot(1/6)+6\cdot(1/6)}{3/6}\\
&=\frac{(2+4+6)/6}{3/6}\\
&=\frac{12/6}{3/6}\\
&=2\cdot \frac{6}{3}\\
&=4.
\end{align*}
Similarly, after being told that the result is odd,
\begin{align*}
\mathbb E[X\mid E^c]
&=\frac{\mathbb E[X\mathbb 1_{E^c}]}{\mathbb P(E^c)}\\
&=\frac{1\cdot(1/6)+3\cdot(1/6)+5\cdot(1/6)}{3/6}\\
&=\frac{(1+3+5)/6}{3/6}\\
&=\frac{9/6}{3/6}\\
&=\frac{9}{6}\cdot \frac{6}{3}\\
&=3.
\end{align*}
Averaging the two conditional expectations over the two possible signals gives
\begin{align*}
\mathbb E[X\mid E]\mathbb P(E)+\mathbb E[X\mid E^c]\mathbb P(E^c)
&=4\cdot \frac{1}{2}+3\cdot \frac{1}{2}\\
&=\frac{4}{2}+\frac{3}{2}\\
&=\frac{7}{2}.
\end{align*}
The signal changes the conditional average from $7/2$ to either $4$ or $3$, while weighting those two conditional averages by the probabilities of the signals recovers the original expectation.
[/example]
### Conditional Densities
When $X$ has a density, the event $X=x$ usually has probability zero. The elementary definition above no longer applies directly. To condition on a continuously distributed observation, we need a density-level replacement for the ratio of probabilities.
[definition: Conditional Density]
Let $(X,Y)$ be an $\mathbb R^2$-valued random vector with a chosen version of a joint density $f_{X,Y}: \mathbb R^2 \to [0,\infty)$ with respect to $\mathcal L^2$. The associated chosen marginal density of $X$ is the map
\begin{align*}
f_X: \mathbb R &\to [0,\infty) \\
x &\mapsto \int_{\mathbb R} f_{X,Y}(x,y)\, d\mathcal L^1(y).
\end{align*}
For every $x \in \mathbb R$ with $f_X(x)>0$, the corresponding version of the conditional density of $Y$ given $X=x$ is the map
\begin{align*}
f_{Y \mid X}(\cdot \mid x): \mathbb R &\to [0,\infty) \\
y &\mapsto \frac{f_{X,Y}(x,y)}{f_X(x)}.
\end{align*}
[/definition]
The conditional density is not obtained by conditioning on the event $X=x$ using the elementary ratio. It is a density-level object that describes the limiting distribution transverse to the observed coordinate. Since densities are defined only up to null sets, changing the chosen joint density can change the displayed pointwise formula on exceptional $x$-values; the conditional density should therefore be read as a chosen version, with probabilistic statements made for $f_X(x)\,d\mathcal L^1(x)$-a.e. value of $x$.
[example: Conditional Density on a Triangle]
Let $(X,Y)$ have joint density $f_{X,Y}(x,y)=2$ on the triangle $0<y<x<1$, and $f_{X,Y}(x,y)=0$ elsewhere. For a fixed $x$, the possible values of $y$ are exactly $0<y<x$ when $0<x<1$, so the marginal density of $X$ is
\begin{align*}
f_X(x)
&=\int_{\mathbb R} f_{X,Y}(x,y)\,d\mathcal L^1(y)\\
&=\int_0^x 2\,d\mathcal L^1(y)\\
&=\left[2y\right]_0^x\\
&=2x-0\\
&=2x,
\end{align*}
for $0<x<1$. For $x\le 0$ or $x\ge 1$, no $y$ satisfies $0<y<x<1$, so $f_X(x)=0$.
Now fix $0<x<1$. Since $f_X(x)=2x>0$, the conditional density formula gives
\begin{align*}
f_{Y \mid X}(y \mid x)
&=\frac{f_{X,Y}(x,y)}{f_X(x)}.
\end{align*}
Thus, for $0<y<x$,
\begin{align*}
f_{Y \mid X}(y \mid x)
&=\frac{2}{2x}\\
&=\frac{1}{x},
\end{align*}
and for $y\notin(0,x)$,
\begin{align*}
f_{Y \mid X}(y \mid x)
&=\frac{0}{2x}\\
&=0.
\end{align*}
This is the uniform density on $(0,x)$, since
\begin{align*}
\int_{\mathbb R} f_{Y\mid X}(y\mid x)\,d\mathcal L^1(y)
&=\int_0^x \frac{1}{x}\,d\mathcal L^1(y)\\
&=\left[\frac{y}{x}\right]_0^x\\
&=1.
\end{align*}
Therefore the conditional expectation is
\begin{align*}
\mathbb E[Y\mid X=x]
&=\int_{\mathbb R} y f_{Y\mid X}(y\mid x)\,d\mathcal L^1(y)\\
&=\int_0^x y\cdot \frac{1}{x}\,d\mathcal L^1(y)\\
&=\frac{1}{x}\int_0^x y\,d\mathcal L^1(y)\\
&=\frac{1}{x}\left[\frac{y^2}{2}\right]_0^x\\
&=\frac{1}{x}\cdot \frac{x^2}{2}\\
&=\frac{x}{2}.
\end{align*}
Given $X=x$, the conditional law spreads mass uniformly over the vertical slice $0<Y<x$, so the conditional average is the midpoint of that slice.
[/example]
## Conditioning and Information
### Sigma-Algebras as Information
Conditional probability becomes more conceptual when the information is not a single event but a collection of events that can be distinguished. A sigma-algebra represents information: if an event belongs to the sigma-algebra, then the information available is enough to decide whether that event occurred.
[definition: Conditioning Sigma-Algebra]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. A conditioning sigma-algebra is a sub-$\sigma$-algebra $\mathcal G \subset \mathcal F$.
[/definition]
This definition is intentionally spare. The mathematical content lies in the interpretation: $\mathcal G$ is the family of events whose truth is known at a given stage. To condition an integrable random variable on this information, we need a replacement for pointwise averaging that only uses events visible to $\mathcal G$.
[definition: Conditional Expectation Given a Sigma-Algebra]
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space, let $X: \Omega \to \mathbb R$ be integrable, and let $\mathcal G \subset \mathcal F$ be a sub-$\sigma$-algebra. A conditional expectation of $X$ given $\mathcal G$ is an integrable, $\mathcal G$-measurable random variable $Z: \Omega \to \mathbb R$ such that
\begin{align*}
\int_G Z\, d\mathbb P = \int_G X\, d\mathbb P
\end{align*}
for every $G \in \mathcal G$.
[/definition]
The definition no longer gives a pointwise formula. Instead, it specifies the integrals that a conditioned random variable must have over every event whose occurrence is known at the information level. The next result is included as an advanced orientation theorem: it explains why the notation $\mathbb E[X \mid \mathcal G]$ is legitimate, but its proof belongs to the measure-theoretic development using Radon-Nikodym theory rather than to the elementary ratio calculus of this page.
[quotetheorem:1147]
This theorem is the gateway from elementary conditional probability to martingales. In advanced probability, the expression $\mathbb E[X \mid \mathcal G]$ is not a number but a random variable adapted to the information encoded by $\mathcal G$. A finite sigma-algebra gives the most concrete model.
[example: A Finite Sigma-Algebra]
Let $\Omega=\{1,2,3,4\}$ with the uniform probability measure, so each point has probability $1/4$, and let
\begin{align*}
\mathcal G = \{\varnothing, \{1,2\}, \{3,4\}, \Omega\}.
\end{align*}
Define $X: \Omega \to \mathbb R$ by
\begin{align*}
X(1)&=1, & X(2)&=5, & X(3)&=2, & X(4)&=10.
\end{align*}
The nonempty atoms of $\mathcal G$ are $\{1,2\}$ and $\{3,4\}$, so a $\mathcal G$-measurable random variable must be constant on each of these two sets. Define $Z:\Omega\to\mathbb R$ by
\begin{align*}
Z(1)&=3, & Z(2)&=3, & Z(3)&=6, & Z(4)&=6.
\end{align*}
Then $Z$ is $\mathcal G$-measurable because it is constant on $\{1,2\}$ and constant on $\{3,4\}$.
To verify that $Z$ is a conditional expectation of $X$ given $\mathcal G$, we check the defining integral identity on every event in $\mathcal G$. For the empty set,
\begin{align*}
\int_{\varnothing} Z\,d\mathbb P
&=0
=\int_{\varnothing} X\,d\mathbb P.
\end{align*}
On the first atom,
\begin{align*}
\int_{\{1,2\}} Z\,d\mathbb P
&=Z(1)\mathbb P(\{1\})+Z(2)\mathbb P(\{2\})\\
&=3\cdot \frac{1}{4}+3\cdot \frac{1}{4}\\
&=\frac{3}{4}+\frac{3}{4}\\
&=\frac{6}{4}\\
&=\frac{3}{2},
\end{align*}
while
\begin{align*}
\int_{\{1,2\}} X\,d\mathbb P
&=X(1)\mathbb P(\{1\})+X(2)\mathbb P(\{2\})\\
&=1\cdot \frac{1}{4}+5\cdot \frac{1}{4}\\
&=\frac{1}{4}+\frac{5}{4}\\
&=\frac{6}{4}\\
&=\frac{3}{2}.
\end{align*}
On the second atom,
\begin{align*}
\int_{\{3,4\}} Z\,d\mathbb P
&=Z(3)\mathbb P(\{3\})+Z(4)\mathbb P(\{4\})\\
&=6\cdot \frac{1}{4}+6\cdot \frac{1}{4}\\
&=\frac{6}{4}+\frac{6}{4}\\
&=\frac{12}{4}\\
&=3,
\end{align*}
while
\begin{align*}
\int_{\{3,4\}} X\,d\mathbb P
&=X(3)\mathbb P(\{3\})+X(4)\mathbb P(\{4\})\\
&=2\cdot \frac{1}{4}+10\cdot \frac{1}{4}\\
&=\frac{2}{4}+\frac{10}{4}\\
&=\frac{12}{4}\\
&=3.
\end{align*}
Finally, on $\Omega$,
\begin{align*}
\int_{\Omega} Z\,d\mathbb P
&=3\cdot \frac{1}{4}+3\cdot \frac{1}{4}+6\cdot \frac{1}{4}+6\cdot \frac{1}{4}\\
&=\frac{3+3+6+6}{4}\\
&=\frac{18}{4}\\
&=\frac{9}{2},
\end{align*}
and
\begin{align*}
\int_{\Omega} X\,d\mathbb P
&=1\cdot \frac{1}{4}+5\cdot \frac{1}{4}+2\cdot \frac{1}{4}+10\cdot \frac{1}{4}\\
&=\frac{1+5+2+10}{4}\\
&=\frac{18}{4}\\
&=\frac{9}{2}.
\end{align*}
Thus
\begin{align*}
\mathbb E[X \mid \mathcal G](1)&=3, & \mathbb E[X \mid \mathcal G](2)&=3, \\
\mathbb E[X \mid \mathcal G](3)&=6, & \mathbb E[X \mid \mathcal G](4)&=6.
\end{align*}
The values are the averages of $X$ on the information cells:
\begin{align*}
3&=\frac{1+5}{2},&
6&=\frac{2+10}{2}.
\end{align*}
Conditioning on $\mathcal G$ forgets the distinction between points inside the same cell and keeps only the average visible at that information level.
[/example]
### Conditioning in Stages
Once conditional expectation is a random variable, information can be organized in layers. If a finer sigma-algebra records more information and a coarser one records less, we need a rule saying that averaging in stages is consistent.
[quotetheorem:1150]
The tower property is the formal reason that repeated updating does not double-count information. It is also the algebraic engine behind martingales, dynamic programming, and many stopping-time arguments.
## Beyond and Connected Topics
Conditional probability is the entry point to measure-theoretic probability. In [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure), conditional expectation is developed using sigma-algebras and integration, which allows conditioning on random variables with continuous distributions and on information generated by stochastic processes.
In [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability), conditioning becomes part of the language of filtrations, martingales, Markov processes, and stopping times. A martingale is a process whose conditional expectation at a future time, given the present information, is the present value. The notation $\mathbb E[X_t \mid \mathcal F_s]$ is therefore not an optional refinement; it is the central object.
Bayes' formula connects conditional probability with statistics and inference. In finite models it is a normalized product of prior and likelihood. In continuous models it becomes a statement about densities, and in general Bayesian statistics it is expressed using probability measures and Radon-Nikodym derivatives.
Conditional independence leads toward graphical models and Markov properties. The statement that the future and past are conditionally independent given the present is one way to express the Markov property. This idea becomes a structural principle in stochastic processes and in probabilistic models with many interacting variables.
There is also a useful bridge to number theory. Probabilistic models for residues, random divisibility, and distribution of arithmetic functions often use conditional probability informally to ask how a random integer behaves after congruence information is imposed. The elementary arithmetic background for such examples belongs naturally beside [Cambridge II Number Theory](/page/Cambridge%20II%20Number%20Theory).
## References
Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability).
Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure).
Androma, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability).
Androma, [Cambridge II Number Theory](/page/Cambridge%20II%20Number%20Theory).
Grimmett and Stirzaker, *Probability and Random Processes* (2020).
Williams, *Probability with Martingales* (1991).
Billingsley, *Probability and Measure* (1995).