A probability model often contains more information than the question we are ready to use. A gambler sees the first few tosses but not the future; a statistician observes a noisy measurement but not the hidden parameter; a stochastic process is known up to time $t$ but not beyond it. In each case the same problem appears: replace an unknown random variable by the best prediction that is allowed to depend only on the information currently available.
Ordinary expectation answers a question with no information: what single number best summarizes $X$ before anything is observed? Conditional expectation answers the same question after restricting the observer to a chosen collection of observable events. The answer is not usually a number. It is a new random variable, constant on indistinguishable states and measurable with respect to the information being used.
[example: A Prediction That Cannot See the Whole Outcome]
Let $\nu$ be the uniform probability measure on $\Omega=\{1,2,3,4,5,6\}$, let $X(k)=k$, and suppose the observer is only told whether the die roll is even or odd. The available information is
\begin{align*}
\mathcal G=\{\varnothing,\{1,3,5\},\{2,4,6\},\Omega\}.
\end{align*}
Since the only nontrivial observable atoms are $\{1,3,5\}$ and $\{2,4,6\}$, any $\mathcal G$-measurable predictor must be constant on each of these two sets.
On the odd atom,
\begin{align*}
\nu(\{1,3,5\})&=\frac{3}{6}=\frac{1}{2},\\
\int_{\{1,3,5\}} X\,d\nu
&=1\cdot \nu(\{1\})+3\cdot \nu(\{3\})+5\cdot \nu(\{5\})\\
&=1\cdot\frac{1}{6}+3\cdot\frac{1}{6}+5\cdot\frac{1}{6}\\
&=\frac{1+3+5}{6}
=\frac{9}{6}
=\frac{3}{2}.
\end{align*}
Therefore the constant value on $\{1,3,5\}$ must be
\begin{align*}
\frac{\int_{\{1,3,5\}}X\,d\nu}{\nu(\{1,3,5\})}
=\frac{3/2}{1/2}
=3.
\end{align*}
Similarly, on the even atom,
\begin{align*}
\nu(\{2,4,6\})&=\frac{3}{6}=\frac{1}{2},\\
\int_{\{2,4,6\}} X\,d\nu
&=2\cdot\frac{1}{6}+4\cdot\frac{1}{6}+6\cdot\frac{1}{6}\\
&=\frac{2+4+6}{6}
=\frac{12}{6}
=2,
\end{align*}
so the constant value on $\{2,4,6\}$ must be
\begin{align*}
\frac{\int_{\{2,4,6\}}X\,d\nu}{\nu(\{2,4,6\})}
=\frac{2}{1/2}
=4.
\end{align*}
Thus the conditional expectation of $X$ given parity is the random variable
\begin{align*}
Y(k)=
\begin{cases}
3, & k\in \{1,3,5\},\\
4, & k\in \{2,4,6\}.
\end{cases}
\end{align*}
It is $\mathcal G$-measurable because the preimages of its possible values are $\{1,3,5\}$ and $\{2,4,6\}$, both of which belong to $\mathcal G$.
Finally, $Y$ preserves the averages over every observable event. For the odd atom,
\begin{align*}
\int_{\{1,3,5\}}Y\,d\nu
=3\cdot\nu(\{1,3,5\})
=3\cdot\frac{1}{2}
=\frac{3}{2}
=\int_{\{1,3,5\}}X\,d\nu,
\end{align*}
and for the even atom,
\begin{align*}
\int_{\{2,4,6\}}Y\,d\nu
=4\cdot\nu(\{2,4,6\})
=4\cdot\frac{1}{2}
=2
=\int_{\{2,4,6\}}X\,d\nu.
\end{align*}
For $\varnothing$, both integrals are $0$, and for $\Omega$,
\begin{align*}
\int_\Omega Y\,d\nu
&=\int_{\{1,3,5\}}Y\,d\nu+\int_{\{2,4,6\}}Y\,d\nu\\
&=\frac{3}{2}+2
=\frac{7}{2},\\
\int_\Omega X\,d\nu
&=\frac{1+2+3+4+5+6}{6}
=\frac{21}{6}
=\frac{7}{2}.
\end{align*}
The predictor cannot distinguish individual die rolls inside a parity class, so it replaces each class by its average while preserving the total integral on every event the observer can describe.
[/example]
The second condition in the example is the key. The observer cannot distinguish points inside an atom of $\mathcal G$, but the total mass of $X$ over every event the observer can describe must be preserved. Conditional expectation is the unique random variable with exactly those two properties.
## Information and Measurability
Before conditional expectation itself can be defined, the word "information" has to be turned into a mathematical object. The events visible to an observer should be closed under complements and countable unions, because these are the logical operations used to combine observable questions. A smaller collection of events records less information while still supporting those operations.
[definition: Sub-$\sigma$-Algebra]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. A sub-$\sigma$-algebra of $\mathcal F$ is a collection $\mathcal G\subset\mathcal F$ such that $(\Omega,\mathcal G)$ is a measurable space.
[/definition]
Once the visible events have been isolated, the next question is which random variables can be computed from them. This motivates the measurability condition that turns dependence on information into a precise statement.
[definition: $\mathcal G$-Measurable Random Variable]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and let $\mathcal G\subset\mathcal F$ be a sub-$\sigma$-algebra. A real-valued random variable $Y:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ is $\mathcal G$-measurable if $Y^{-1}(B)\in\mathcal G$ for every Borel set $B\in\mathcal B(\mathbb R)$.
[/definition]
These two definitions separate the information itself from the random variables it can observe. With that language in place, the prediction problem becomes a preservation problem: replace $X$ by a $\mathcal G$-measurable random variable without changing integrals over $\mathcal G$-observable events.
## Definition
The main definition states the prediction problem as a preservation problem. We seek a $\mathcal G$-measurable replacement for $X$ whose integrals over all $\mathcal G$-observable events are the same as those of $X$.
[definition: Conditional Expectation Given a Sub-$\sigma$-Algebra]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $\mathcal G\subset\mathcal F$ be a sub-$\sigma$-algebra, and let $X\in L^1(\Omega,\mathcal F,\mathbb P)$, meaning that $X$ is integrable and $\mathcal F$-measurable. A conditional expectation of $X$ given $\mathcal G$ is a real-valued random variable $Y\in L^1(\Omega,\mathcal G,\mathbb P)$ such that
\begin{align*}
\int_A Y \, d\mathbb P = \int_A X \, d\mathbb P
\end{align*}
for every $A\in\mathcal G$.
[/definition]
The definition asks for a random variable that is visible to the smaller information structure while preserving all integrals over that information. It is not obvious that such a visible representative exists, because the original variable may depend on distinctions that $\mathcal G$ cannot see. The existence question is resolved by treating the integrals of $X$ over $\mathcal G$-events as a measure on the smaller measurable space.
[quotetheorem:1147]
This theorem says that the preservation problem is not merely a useful definition but a well-posed operation. The output is determined only up to almost sure equality, which is exactly the right uniqueness notion for random variables in $L^1$. Conceptually, conditional expectation discards distinctions invisible to $\mathcal G$ while retaining every average that $\mathcal G$ can test. The next step is to connect this sub-$\sigma$-algebra formulation with the more common situation where the available information is generated by an observed random variable.
## Conditioning on Random Variables
An observer often reports a random variable rather than naming a $\sigma$-algebra. To condition on that observation, the theory keeps exactly the events determined by the observed value.
[definition: Conditional Expectation Given a Random Variable]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, let $X\in L^1(\Omega,\mathcal F,\mathbb P)$, and let $Z:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a random variable. The conditional expectation of $X$ given $Z$ is
\begin{align*}
\mathbb E[X\mid Z] := \mathbb E[X\mid \sigma(Z)].
\end{align*}
[/definition]
After this definition, the phrase “given $Z$” should be read as “given every event determined by $Z$”. This viewpoint is essential when $Z$ is continuous, since events of the form $\{Z=z\}$ often have probability zero and cannot support elementary conditional averages.
## Finite Information and Partitions
The abstract definition becomes concrete when the information has finitely many atoms. This case should be the reader's mental model: conditional expectation averages $X$ over the pieces that the observer can distinguish.
[definition: Finite Information Partition]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. A finite information partition is a finite collection $(A_i)_{i=1}^n$ of pairwise disjoint events in $\mathcal F$ such that
\begin{align*}
\Omega=\bigcup_{i=1}^n A_i.
\end{align*}
[/definition]
A partition becomes information once we declare that the observer only knows which cell occurred. On each cell, a $\sigma$-algebra-measurable predictor cannot vary, so the integral-preservation condition forces one constant value whose mass-weighted integral matches that of $X$ on the cell.
The useful next question is whether this informal cell-average rule is exactly the conditional expectation required by the integral identities. The finite formula below makes that bridge explicit: it turns the abstract definition into a computable random variable by assigning one average to each atom of the observed partition.
[quotetheorem:4938]
This formula explains why conditional expectation is a random variable. The answer changes from cell to cell because the available information changes the relevant average.
[example: Two-Step Coin Toss]
Let $\Omega=\{HH,HT,TH,TT\}$ with the uniform probability measure, so each outcome has probability $1/4$. Let $X$ be the number of heads in two tosses:
\begin{align*}
X(HH)=2,\qquad X(HT)=1,\qquad X(TH)=1,\qquad X(TT)=0.
\end{align*}
The information $\mathcal G$ records only the first toss:
\begin{align*}
\mathcal G=\sigma(\{HH,HT\},\{TH,TT\}).
\end{align*}
Thus a $\mathcal G$-measurable predictor must be constant on
\begin{align*}
A=\{HH,HT\},\qquad B=\{TH,TT\}.
\end{align*}
On $A$, we have
\begin{align*}
\mathbb P(A)&=\mathbb P(\{HH\})+\mathbb P(\{HT\})
=\frac{1}{4}+\frac{1}{4}
=\frac{1}{2},\\
\int_A X\,d\mathbb P
&=X(HH)\mathbb P(\{HH\})+X(HT)\mathbb P(\{HT\})\\
&=2\cdot\frac{1}{4}+1\cdot\frac{1}{4}\\
&=\frac{2}{4}+\frac{1}{4}
=\frac{3}{4}.
\end{align*}
Therefore the constant value on $A$ must be
\begin{align*}
\frac{\int_A X\,d\mathbb P}{\mathbb P(A)}
=\frac{3/4}{1/2}
=\frac{3}{4}\cdot 2
=\frac{3}{2}.
\end{align*}
On $B$, we have
\begin{align*}
\mathbb P(B)&=\mathbb P(\{TH\})+\mathbb P(\{TT\})
=\frac{1}{4}+\frac{1}{4}
=\frac{1}{2},\\
\int_B X\,d\mathbb P
&=X(TH)\mathbb P(\{TH\})+X(TT)\mathbb P(\{TT\})\\
&=1\cdot\frac{1}{4}+0\cdot\frac{1}{4}\\
&=\frac{1}{4}.
\end{align*}
Therefore the constant value on $B$ must be
\begin{align*}
\frac{\int_B X\,d\mathbb P}{\mathbb P(B)}
=\frac{1/4}{1/2}
=\frac{1}{4}\cdot 2
=\frac{1}{2}.
\end{align*}
Hence
\begin{align*}
\mathbb E[X\mid\mathcal G]
=\frac{3}{2}\mathbb 1_{\{HH,HT\}}+\frac{1}{2}\mathbb 1_{\{TH,TT\}}.
\end{align*}
This random variable is $\mathcal G$-measurable because its two level sets are exactly $\{HH,HT\}$ and $\{TH,TT\}$, both of which belong to $\mathcal G$.
It also preserves the integral over each observable event. For $A$,
\begin{align*}
\int_A \mathbb E[X\mid\mathcal G]\,d\mathbb P
=\frac{3}{2}\mathbb P(A)
=\frac{3}{2}\cdot\frac{1}{2}
=\frac{3}{4}
=\int_A X\,d\mathbb P,
\end{align*}
and for $B$,
\begin{align*}
\int_B \mathbb E[X\mid\mathcal G]\,d\mathbb P
=\frac{1}{2}\mathbb P(B)
=\frac{1}{2}\cdot\frac{1}{2}
=\frac{1}{4}
=\int_B X\,d\mathbb P.
\end{align*}
For $\varnothing$, both integrals are $0$, and for $\Omega=A\cup B$,
\begin{align*}
\int_\Omega \mathbb E[X\mid\mathcal G]\,d\mathbb P
&=\int_A \mathbb E[X\mid\mathcal G]\,d\mathbb P+\int_B \mathbb E[X\mid\mathcal G]\,d\mathbb P\\
&=\frac{3}{4}+\frac{1}{4}
=1,\\
\int_\Omega X\,d\mathbb P
&=2\cdot\frac{1}{4}+1\cdot\frac{1}{4}+1\cdot\frac{1}{4}+0\cdot\frac{1}{4}\\
&=\frac{2+1+1+0}{4}
=1.
\end{align*}
The conditional expectation keeps exactly the information contained in the first toss and replaces the unseen second toss by its average contribution.
[/example]
Finite partitions also expose a failure in the elementary formula
\begin{align*}
\mathbb E[X\mid A]=\frac{\mathbb E[X\mathbb 1_A]}{\mathbb P(A)}.
\end{align*}
This formula gives a number after conditioning on a single positive-probability event. Conditional expectation given information must give compatible numbers on all observable events at once.
[example: A Single Event Does Not Describe the Information]
Let $\Omega=\{1,2,3,4\}$ with the uniform probability measure, let $X(k)=k$, and set $A=\{1,2\}$. Since each point has probability $1/4$,
\begin{align*}
\mathbb P(A)
&=\mathbb P(\{1\})+\mathbb P(\{2\})\\
&=\frac{1}{4}+\frac{1}{4}
=\frac{1}{2},
\end{align*}
and
\begin{align*}
\mathbb E[X\mathbb 1_A]
&=X(1)\mathbb P(\{1\})+X(2)\mathbb P(\{2\})+X(3)\mathbb 1_A(3)\mathbb P(\{3\})+X(4)\mathbb 1_A(4)\mathbb P(\{4\})\\
&=1\cdot\frac{1}{4}+2\cdot\frac{1}{4}+3\cdot 0\cdot\frac{1}{4}+4\cdot 0\cdot\frac{1}{4}\\
&=\frac{1}{4}+\frac{2}{4}
=\frac{3}{4}.
\end{align*}
Thus the elementary conditional mean after learning that $A$ occurred is
\begin{align*}
\mathbb E[X\mid A]
=\frac{\mathbb E[X\mathbb 1_A]}{\mathbb P(A)}
=\frac{3/4}{1/2}
=\frac{3}{4}\cdot 2
=\frac{3}{2}.
\end{align*}
The information generated by the event is
\begin{align*}
\sigma(A)=\{\varnothing,A,A^c,\Omega\},
\end{align*}
where $A^c=\{3,4\}$. A $\sigma(A)$-measurable predictor must be constant on $A$ and on $A^c$. On the complementary atom,
\begin{align*}
\mathbb P(A^c)
&=\mathbb P(\{3\})+\mathbb P(\{4\})\\
&=\frac{1}{4}+\frac{1}{4}
=\frac{1}{2},
\end{align*}
and
\begin{align*}
\int_{A^c}X\,d\mathbb P
&=X(3)\mathbb P(\{3\})+X(4)\mathbb P(\{4\})\\
&=3\cdot\frac{1}{4}+4\cdot\frac{1}{4}\\
&=\frac{3+4}{4}
=\frac{7}{4}.
\end{align*}
Therefore the constant value forced on $A^c$ is
\begin{align*}
\frac{\int_{A^c}X\,d\mathbb P}{\mathbb P(A^c)}
=\frac{7/4}{1/2}
=\frac{7}{4}\cdot 2
=\frac{7}{2}.
\end{align*}
Hence
\begin{align*}
\mathbb E[X\mid\sigma(A)]
=\frac{3}{2}\mathbb 1_A+\frac{7}{2}\mathbb 1_{A^c}.
\end{align*}
This random variable is $\sigma(A)$-measurable because its level sets are $A$ and $A^c$, both of which belong to $\sigma(A)$.
It preserves the integral over every event visible to $\sigma(A)$. For $A$,
\begin{align*}
\int_A \mathbb E[X\mid\sigma(A)]\,d\mathbb P
=\frac{3}{2}\mathbb P(A)
=\frac{3}{2}\cdot\frac{1}{2}
=\frac{3}{4}
=\int_A X\,d\mathbb P,
\end{align*}
and for $A^c$,
\begin{align*}
\int_{A^c}\mathbb E[X\mid\sigma(A)]\,d\mathbb P
=\frac{7}{2}\mathbb P(A^c)
=\frac{7}{2}\cdot\frac{1}{2}
=\frac{7}{4}
=\int_{A^c}X\,d\mathbb P.
\end{align*}
For $\varnothing$, both integrals are $0$, and for $\Omega=A\cup A^c$,
\begin{align*}
\int_\Omega \mathbb E[X\mid\sigma(A)]\,d\mathbb P
&=\int_A \mathbb E[X\mid\sigma(A)]\,d\mathbb P+\int_{A^c}\mathbb E[X\mid\sigma(A)]\,d\mathbb P\\
&=\frac{3}{4}+\frac{7}{4}
=\frac{10}{4}
=\frac{5}{2},\\
\int_\Omega X\,d\mathbb P
&=1\cdot\frac{1}{4}+2\cdot\frac{1}{4}+3\cdot\frac{1}{4}+4\cdot\frac{1}{4}\\
&=\frac{1+2+3+4}{4}
=\frac{10}{4}
=\frac{5}{2}.
\end{align*}
The single number $\mathbb E[X\mid A]$ records only the average after learning that $A$ occurred; the conditional expectation given $\sigma(A)$ is the whole prediction rule, with one value on $A$ and another value on $A^c$.
[/example]
[remark: Conditioning Is Not Only Division by Probability]
The expression $\mathbb E[X\mid A]$ for an event $A$ with $\mathbb P(A)>0$ is an elementary conditional mean. The expression $\mathbb E[X\mid\mathcal G]$ is a random variable. They are related, but the second object cannot be reduced to the first unless the whole information structure is specified.
[/remark]
## Conditional Densities and Regression
In applications, information often comes from an observed random variable $Z$. The reader may expect $\mathbb E[X\mid Z=z]$ to be defined by conditioning on the event $\{Z=z\}$, but that event may have probability zero. Conditional densities give a way to recover the familiar formula when the joint distribution has enough regularity.
[definition: Conditional Density]
Let $(X,Z)$ be an $\mathbb R^m\times\mathbb R^n$-valued random vector with joint density $f_{X,Z}$ with respect to $\mathcal L^{m+n}$, where $\mathcal L^k$ denotes $k$-dimensional [Lebesgue measure](/page/Lebesgue%20Measure). Let $f_Z$ be the density of $Z$. For $z\in\mathbb R^n$ with $f_Z(z)>0$, the conditional density of $X$ given $Z=z$ is the function
\begin{align*}
f_{X\mid Z}(\cdot\mid z):\mathbb R^m&\to[0,\infty) \\
x&\mapsto \frac{f_{X,Z}(x,z)}{f_Z(z)}.
\end{align*}
[/definition]
The conditional density is a pointwise object, while conditional expectation given $Z$ is an a.s. class of random variables. That creates a genuine compatibility question: a formula in the variable $z$ must become a measurable, integrable random variable after replacing $z$ by $Z(\omega)$.
The result below supplies this compatibility in the absolutely continuous setting. It identifies the conditional expectation with the function obtained by integrating against the conditional density, provided the regularity needed to make that substitution legitimate is present.
[quotetheorem:4939]
This theorem legitimizes the regression notation $g(z)=\mathbb E[X\mid Z=z]$ in the absolutely continuous setting. The rigorous conditional expectation remains $g(Z)$, a random variable on $\Omega$.
[example: Conditional Mean in a Gaussian Model]
Let $Z\sim\mathcal N(0,1)$, let $\varepsilon\sim\mathcal N(0,\sigma^2)$ with $\sigma>0$, assume $Z$ and $\varepsilon$ are independent, and set
\begin{align*}
X=aZ+\varepsilon
\end{align*}
for a fixed $a\in\mathbb R$. Since $\varepsilon$ is independent of $Z$, conditioning on $Z=z$ leaves the noise density equal to
\begin{align*}
f_\varepsilon(e)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{e^2}{2\sigma^2}\right).
\end{align*}
For fixed $z$, the relation $X=az+\varepsilon$ gives $e=x-az$, so the conditional density of $X$ given $Z=z$ is
\begin{align*}
f_{X\mid Z}(x\mid z)
=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-az)^2}{2\sigma^2}\right).
\end{align*}
Thus the conditional mean function is
\begin{align*}
g(z)
&=\int_{\mathbb R} x f_{X\mid Z}(x\mid z)\,dx\\
&=\int_{\mathbb R}x\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-az)^2}{2\sigma^2}\right)\,dx.
\end{align*}
With $u=x-az$, so that $x=u+az$ and $dx=du$,
\begin{align*}
g(z)
&=\int_{\mathbb R}(u+az)\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{u^2}{2\sigma^2}\right)\,du\\
&=\int_{\mathbb R}u\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{u^2}{2\sigma^2}\right)\,du
+az\int_{\mathbb R}\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{u^2}{2\sigma^2}\right)\,du\\
&=0+az\cdot 1\\
&=az.
\end{align*}
The first integral is $0$ because its integrand is odd, and the second integral is $1$ because it is the total mass of the $\mathcal N(0,\sigma^2)$ density. By *[Conditional Expectation from a Conditional Density](/theorems/4939)*, the conditional expectation is $g(Z)$, hence
\begin{align*}
\mathbb E[X\mid Z]=aZ
\end{align*}
$\mathbb P$-a.s. The conditional expectation keeps the linear part determined by the observed value of $Z$ and averages the independent noise to $0$.
[/example]
The same example also gives a warning. If we condition on a noisy observation rather than the signal itself, the best predictor is usually shrunk toward the mean. Conditional expectation is sensitive to the information actually observed.
[example: Linear Gaussian Filtering]
Let $S\sim\mathcal N(0,\tau^2)$ and $N\sim\mathcal N(0,\sigma^2)$ be independent, with $\tau,\sigma>0$, and let
\begin{align*}
Z=S+N.
\end{align*}
We compute the conditional density of $S$ given the observed value $Z=z$. Since $N=Z-S$, independence gives the joint density
\begin{align*}
f_{S,Z}(s,z)
&=f_S(s)f_N(z-s)\\
&=\frac{1}{\tau\sqrt{2\pi}}\exp\left(-\frac{s^2}{2\tau^2}\right)
\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(z-s)^2}{2\sigma^2}\right)\\
&=\frac{1}{2\pi\tau\sigma}
\exp\left(-\frac{s^2}{2\tau^2}-\frac{(z-s)^2}{2\sigma^2}\right).
\end{align*}
Writing $v=\tau^2+\sigma^2$, $m(z)=\frac{\tau^2}{v}z$, and $r=\frac{\tau^2\sigma^2}{v}$, the exponent separates by completing the square:
\begin{align*}
\frac{s^2}{2\tau^2}+\frac{(z-s)^2}{2\sigma^2}
&=\frac{s^2}{2\tau^2}+\frac{z^2-2zs+s^2}{2\sigma^2}\\
&=\frac{v s^2-2\tau^2zs+\tau^2 z^2}{2\tau^2\sigma^2}\\
&=\frac{(s-m(z))^2}{2r}+\frac{z^2}{2v}.
\end{align*}
Indeed,
\begin{align*}
\frac{(s-m(z))^2}{2r}+\frac{z^2}{2v}
&=\frac{v}{2\tau^2\sigma^2}\left(s-\frac{\tau^2}{v}z\right)^2+\frac{z^2}{2v}\\
&=\frac{v s^2}{2\tau^2\sigma^2}-\frac{zs}{\sigma^2}
+\frac{\tau^2 z^2}{2\sigma^2v}+\frac{z^2}{2v}\\
&=\frac{v s^2}{2\tau^2\sigma^2}-\frac{zs}{\sigma^2}
+\frac{z^2}{2\sigma^2}\\
&=\frac{s^2}{2\tau^2}+\frac{z^2-2zs+s^2}{2\sigma^2}.
\end{align*}
Thus, as a function of $s$ with $z$ fixed,
\begin{align*}
f_{S,Z}(s,z)
=\frac{1}{\sqrt{v}\sqrt{2\pi}}\exp\left(-\frac{z^2}{2v}\right)
\frac{1}{\sqrt r\sqrt{2\pi}}\exp\left(-\frac{(s-m(z))^2}{2r}\right).
\end{align*}
The first factor is the density of $Z\sim\mathcal N(0,v)$ at $z$, and the second factor is a normal density in $s$ with mean $m(z)$ and variance $r$. Therefore
\begin{align*}
f_{S\mid Z}(s\mid z)
=\frac{1}{\sqrt r\sqrt{2\pi}}
\exp\left(-\frac{(s-m(z))^2}{2r}\right),
\end{align*}
so the conditional mean function is
\begin{align*}
g(z)
&=\int_{\mathbb R}s f_{S\mid Z}(s\mid z)\,ds\\
&=\int_{\mathbb R}s\frac{1}{\sqrt r\sqrt{2\pi}}
\exp\left(-\frac{(s-m(z))^2}{2r}\right)\,ds.
\end{align*}
With $u=s-m(z)$, so that $s=u+m(z)$ and $ds=du$,
\begin{align*}
g(z)
&=\int_{\mathbb R}(u+m(z))\frac{1}{\sqrt r\sqrt{2\pi}}
\exp\left(-\frac{u^2}{2r}\right)\,du\\
&=\int_{\mathbb R}u\frac{1}{\sqrt r\sqrt{2\pi}}
\exp\left(-\frac{u^2}{2r}\right)\,du
+m(z)\int_{\mathbb R}\frac{1}{\sqrt r\sqrt{2\pi}}
\exp\left(-\frac{u^2}{2r}\right)\,du\\
&=0+m(z)\cdot 1\\
&=\frac{\tau^2}{\tau^2+\sigma^2}z.
\end{align*}
The first integral is $0$ because its integrand is odd, and the second integral is $1$ because it is the total mass of the $\mathcal N(0,r)$ density. By *Conditional Expectation from a Conditional Density*,
\begin{align*}
\mathbb E[S\mid Z]
=g(Z)
=\frac{\tau^2}{\tau^2+\sigma^2}Z
\end{align*}
$\mathbb P$-a.s. Since $0<\frac{\tau^2}{\tau^2+\sigma^2}<1$, the best predictor shrinks the observation toward $0$, with stronger shrinkage when the noise variance $\sigma^2$ is larger.
[/example]
## Algebra of Conditional Expectation
Conditional expectation behaves like an averaging operator that respects the information $\mathcal G$. Its algebraic rules are used constantly in probability, statistics, martingales, and stochastic analysis. The first rules record that averaging is linear and preserves order.
[quotetheorem:4904]
Linearity says that conditional expectation preserves the vector-space structure of integrable random variables. Positivity is what makes the operation an average rather than a formal projection alone. The next rule addresses a different question: what happens when a factor is already known under the conditioning information?
[quotetheorem:1151]
The boundedness assumption keeps $XY$ integrable without extra hypotheses. More general versions replace boundedness by appropriate integrability conditions. The theorem formalizes the phrase “known factors come outside the conditional expectation”.
[example: Pulling Out a Known Stake]
Let $R\in L^1(\Omega,\mathcal F,\mathbb P)$ be a random payoff, and let $S$ be a bounded stake chosen using the information $\mathcal G$. Since $S$ is chosen from $\mathcal G$, it is $\mathcal G$-measurable. If $|S|\le M$ for some $M<\infty$, then
\begin{align*}
|SR|\le M|R|,
\end{align*}
so
\begin{align*}
\mathbb E[|SR|]\le M\mathbb E[|R|]<\infty.
\end{align*}
Thus $SR$ is integrable.
By *[Taking Out What Is Known](/theorems/1151)*, the bounded $\mathcal G$-measurable factor $S$ can be pulled outside the conditional expectation:
\begin{align*}
\mathbb E[SR\mid\mathcal G]
=S\mathbb E[R\mid\mathcal G].
\end{align*}
In particular, if the conditional expected return is zero, meaning
\begin{align*}
\mathbb E[R\mid\mathcal G]=0
\end{align*}
$\mathbb P$-a.s., then
\begin{align*}
\mathbb E[SR\mid\mathcal G]
&=S\mathbb E[R\mid\mathcal G]\\
&=S\cdot 0\\
&=0
\end{align*}
$\mathbb P$-a.s. A bounded stake based only on the available information can scale the payoff, but it cannot turn zero conditional expected return into positive conditional expected gain.
[/example]
The next rule describes what happens when the random variable being conditioned is already known. Averaging should not change information that is already visible.
[quotetheorem:4903]
If the information is already sufficient to know $X$, conditioning returns $X$. The complementary problem is to understand information that has no relation to $X$, where conditioning should add nothing to the unconditional average.
[quotetheorem:1152]
This theorem is not saying that conditioning always produces a simpler formula. It says that irrelevant information is ignored. The difficulty in applications is often recognizing which part of the observed information is relevant.
## Towers and Iterated Information
Information often arrives in stages. A filtration in stochastic processes, a sequence of experiments in statistics, and a growing database in Bayesian inference all produce nested $\sigma$-algebras. To state the rule for conditioning through stages, we first name the inclusion pattern.
[definition: Nested Information]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Two sub-$\sigma$-algebras $\mathcal H\subset\mathcal G\subset\mathcal F$ are nested information structures.
[/definition]
The phrase means that every event visible under $\mathcal H$ is also visible under $\mathcal G$. If a prediction is first made using the finer information $\mathcal G$ and is then viewed through the coarser information $\mathcal H$, there is a possible ambiguity: the two-stage average might seem to depend on the intermediate information. The tower property says that no such extra dependence remains once the final information level is only $\mathcal H$.
[quotetheorem:1150]
The tower property is the computational engine behind most conditional expectation calculations. It allows a complicated expectation to be broken into stages that match the information structure of the problem.
[example: Total Expectation Through a First Observation]
Let $X\in L^1(\Omega,\mathcal F,\mathbb P)$ and let $Z$ be a random variable. Write $\mathbb E[X\mid Z]=\mathbb E[X\mid\sigma(Z)]$. If $\mathcal H=\{\varnothing,\Omega\}$ and $\mathcal G=\sigma(Z)$, then $\mathcal H\subset\mathcal G$, so by the *Tower Property*,
\begin{align*}
\mathbb E[\mathbb E[X\mid Z]\mid\mathcal H]
&=\mathbb E[\mathbb E[X\mid\sigma(Z)]\mid\mathcal H]\\
&=\mathbb E[X\mid\mathcal H]
\end{align*}
$\mathbb P$-a.s. Since a $\mathcal H$-measurable random variable is constant, the defining integral condition over $\Omega$ gives
\begin{align*}
\mathbb E[\mathbb E[X\mid Z]\mid\mathcal H]
&=\mathbb E[\mathbb E[X\mid Z]],\\
\mathbb E[X\mid\mathcal H]
&=\mathbb E[X].
\end{align*}
Therefore
\begin{align*}
\mathbb E[\mathbb E[X\mid Z]]
=\mathbb E[X].
\end{align*}
Now suppose $Z$ takes finitely many values $z_1,\dots,z_n$, and set
\begin{align*}
A_i=\{Z=z_i\}.
\end{align*}
The atoms of $\sigma(Z)$ are the nonempty sets among $A_1,\dots,A_n$. For each $i$ with $\mathbb P(A_i)>0$, the conditional expectation is constant on $A_i$, with value
\begin{align*}
c_i
&=\frac{1}{\mathbb P(A_i)}\int_{A_i}X\,d\mathbb P\\
&=\frac{\mathbb E[X\mathbb 1_{A_i}]}{\mathbb P(A_i)}\\
&=\mathbb E[X\mid Z=z_i].
\end{align*}
Hence, up to null atoms,
\begin{align*}
\mathbb E[X\mid Z]
=\sum_{\mathbb P(A_i)>0} c_i\mathbb 1_{A_i}.
\end{align*}
Taking expectations gives
\begin{align*}
\mathbb E[\mathbb E[X\mid Z]]
&=\int_\Omega \sum_{\mathbb P(A_i)>0} c_i\mathbb 1_{A_i}\,d\mathbb P\\
&=\sum_{\mathbb P(A_i)>0} c_i\int_\Omega \mathbb 1_{A_i}\,d\mathbb P\\
&=\sum_{\mathbb P(A_i)>0} c_i\mathbb P(A_i)\\
&=\sum_{\mathbb P(A_i)>0}\mathbb E[X\mid Z=z_i]\mathbb P(Z=z_i).
\end{align*}
Combining this with $\mathbb E[\mathbb E[X\mid Z]]=\mathbb E[X]$ yields
\begin{align*}
\mathbb E[X]
=\sum_{\mathbb P(Z=z)>0}\mathbb E[X\mid Z=z]\mathbb P(Z=z).
\end{align*}
Thus the ordinary expectation is recovered by first averaging inside each observable value of $Z$ and then weighting those conditional averages by the probabilities of the observed values.
[/example]
Nested information is not symmetric. Conditioning on $\mathcal G$ and then on $\mathcal H$ only collapses to conditioning on $\mathcal H$ when $\mathcal H\subset\mathcal G$. Without nesting, iterated conditioning can depend on the order.
[remark: The Tower Needs Nested Information]
If $\mathcal G$ and $\mathcal H$ are unrelated sub-$\sigma$-algebras, the expressions $\mathbb E[\mathbb E[X\mid\mathcal G]\mid\mathcal H]$ and $\mathbb E[\mathbb E[X\mid\mathcal H]\mid\mathcal G]$ need not agree. The tower property is a theorem about loss of information along an inclusion $\mathcal H\subset\mathcal G$.
[/remark]
## Projection in $L^2$
For square-integrable random variables, conditional expectation has a geometric interpretation. It is the [orthogonal projection](/theorems/437) of $X$ onto the closed subspace of $\mathcal G$-measurable square-integrable random variables. To make that statement precise, we first identify the subspace of predictions allowed by the information $\mathcal G$.
[definition: $L^2$ Space of Observable Random Variables]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and let $\mathcal G\subset\mathcal F$ be a sub-$\sigma$-algebra. The $L^2$ space of $\mathcal G$-observable random variables is
\begin{align*}
L^2(\Omega,\mathcal G,\mathbb P)=\{Y\in L^2(\Omega,\mathcal F,\mathbb P):Y\text{ has a }\mathcal G\text{-measurable representative}\}.
\end{align*}
[/definition]
This is a closed subspace of the Hilbert space $L^2(\Omega,\mathcal F,\mathbb P)$. A square-integrable predictor based on $\mathcal G$ must lie in this subspace, and the defining integral identities should force the prediction error to be perpendicular to every observable square-integrable test variable.
The next point is to justify that this orthogonality is not only a useful analogy but exactly characterizes conditional expectation in $L^2$. The [projection theorem](/theorems/1985) supplies that Hilbert-space identification and turns the integral identities into a geometric certificate.
[quotetheorem:3537]
The geometric formulation identifies conditional expectation inside the Hilbert space, including its best-approximation meaning. For probabilistic use, one still wants the same conclusion stated directly in the language of predictors and mean-square loss, where competitors are actual $\mathcal G$-measurable random variables and equality is understood up to almost sure agreement.
This shift matters because the projection theorem speaks about an abstract closed subspace, while applications usually ask a concrete prediction question: if only the information in $\mathcal G$ is available, which random variable should be reported as the estimate of $X$? The orthogonality condition should imply that every competing $\mathcal G$-measurable predictor has an error that decomposes into the unavoidable residual plus an extra squared distance from the conditional expectation. This motivates a second formulation: among all predictors that can only use the information in $\mathcal G$, the conditional expectation should be the unique mean-square optimal choice up to almost sure equality.
[quotetheorem:3621]
This formulation packages the projection result as a prediction rule: once the information is restricted to $\mathcal G$, no other square-integrable $\mathcal G$-measurable predictor improves the mean-square error.
[example: Projection onto Constants]
Let $X\in L^2(\Omega,\mathcal F,\mathbb P)$ and let $\mathcal G=\{\varnothing,\Omega\}$. If $Y$ is $\mathcal G$-measurable and $Y(\omega_1)<Y(\omega_2)$ for two points, then choosing $t$ with $Y(\omega_1)<t<Y(\omega_2)$ would make
\begin{align*}
Y^{-1}((-\infty,t))
\end{align*}
a nonempty proper subset of $\Omega$, which cannot belong to $\mathcal G$. Hence the $\mathcal G$-measurable real-valued random variables are constants.
Set $m=\mathbb E[X]$. Since $X\in L^2$ and $\mathbb P(\Omega)=1$,
\begin{align*}
\mathbb E[|X|]\le \left(\mathbb E[X^2]\right)^{1/2}<\infty,
\end{align*}
so $m$ is finite. The constant random variable $m$ is $\mathcal G$-measurable and integrable. It satisfies the defining integral condition for conditional expectation: on $\varnothing$,
\begin{align*}
\int_\varnothing m\,d\mathbb P
=0
=\int_\varnothing X\,d\mathbb P,
\end{align*}
and on $\Omega$,
\begin{align*}
\int_\Omega m\,d\mathbb P
=m\mathbb P(\Omega)
=m
=\mathbb E[X]
=\int_\Omega X\,d\mathbb P.
\end{align*}
Thus
\begin{align*}
\mathbb E[X\mid\mathcal G]=\mathbb E[X]
\end{align*}
$\mathbb P$-a.s.
For any constant predictor $c\in\mathbb R$,
\begin{align*}
X-c
&=(X-m)+(m-c),
\end{align*}
so
\begin{align*}
(X-c)^2
&=\left((X-m)+(m-c)\right)^2\\
&=(X-m)^2+2(X-m)(m-c)+(m-c)^2.
\end{align*}
Taking expectations term by term gives
\begin{align*}
\mathbb E[(X-c)^2]
&=\mathbb E[(X-m)^2]+2(m-c)\mathbb E[X-m]+\mathbb E[(m-c)^2]\\
&=\mathbb E[(X-m)^2]+2(m-c)(\mathbb E[X]-m)+(m-c)^2\mathbb P(\Omega)\\
&=\mathbb E[(X-m)^2]+2(m-c)(m-m)+(m-c)^2\\
&=\mathbb E[(X-\mathbb E[X])^2]+(\mathbb E[X]-c)^2.
\end{align*}
Since $(\mathbb E[X]-c)^2\ge0$, the mean-square error is minimized exactly when $c=\mathbb E[X]$. Thus projection onto the constants replaces $X$ by its ordinary mean.
[/example]
## Martingales and Dynamic Prediction
### Time-Indexed Information
The theory becomes especially powerful when the information varies with time. Conditional expectation describes fair prediction under the information available at each time, which is the foundation of martingales. The first object needed is a time-indexed family of information sets.
[definition: Filtration]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and let $T\subset[0,\infty)$ or $T\subset\mathbb N$. A filtration indexed by $T$ is a family $(\mathcal F_t)_{t\in T}$ of sub-$\sigma$-algebras of $\mathcal F$ such that $\mathcal F_s\subset\mathcal F_t$ whenever $s\le t$.
[/definition]
### Adapted Processes
A filtration records the growth of information, but it does not yet say how a random process is related to that information. To condition on the present state of a process, the value at time $t$ must already be visible through $\mathcal F_t$.
This observability requirement is the role of adaptedness. It rules out processes whose value at time $t$ depends on information that appears only later, making conditional predictions at time $t$ mathematically meaningful.
[definition: Adapted Process]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space with filtration $(\mathcal F_t)_{t\in T}$. A real-valued stochastic process $(X_t)_{t\in T}$ is adapted to $(\mathcal F_t)_{t\in T}$ if each map $X_t:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ is $\mathcal F_t$-measurable for every $t\in T$.
[/definition]
### Fair Prediction
Adaptedness says that the process does not look into the future, but it does not by itself impose any fairness or no-drift condition. A process could be adapted and still have a predictable upward trend.
The martingale condition adds the missing prediction rule. It requires that, after conditioning a later value on the information currently available, the conditional expectation returns the present value itself.
[definition: Martingale]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space with filtration $(\mathcal F_t)_{t\in T}$. A real-valued stochastic process $(X_t)_{t\in T}$ is a martingale if each map $X_t:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ is $\mathcal F_t$-measurable, $X_t\in L^1(\Omega,\mathcal F,\mathbb P)$ for every $t\in T$, and
\begin{align*}
\mathbb E[X_t\mid\mathcal F_s]=X_s
\end{align*}
$\mathbb P$-a.s. whenever $s\le t$.
[/definition]
The definition is compact because conditional expectation carries the entire information structure. It says that the best prediction of the future, using present information, is the present value.
[example: Centered Random Walk]
Let $\xi_1,\xi_2,\dots$ be i.i.d. real-valued random variables with $\mathbb E[\xi_1]=0$ and $\mathbb E[|\xi_1|]<\infty$. Define
\begin{align*}
S_n=\sum_{k=1}^n \xi_k,\qquad \mathcal F_n=\sigma(\xi_1,\dots,\xi_n).
\end{align*}
We show that $(S_n)_{n\in\mathbb N}$ is a martingale with respect to $(\mathcal F_n)_{n\in\mathbb N}$.
First, $S_n$ is $\mathcal F_n$-measurable because each $\xi_1,\dots,\xi_n$ is $\mathcal F_n$-measurable and finite sums of measurable random variables are measurable. Also,
\begin{align*}
\mathbb E[|S_n|]
&=\mathbb E\left[\left|\sum_{k=1}^n \xi_k\right|\right]\\
&\le \mathbb E\left[\sum_{k=1}^n |\xi_k|\right]\\
&=\sum_{k=1}^n \mathbb E[|\xi_k|]\\
&=\sum_{k=1}^n \mathbb E[|\xi_1|]\\
&=n\mathbb E[|\xi_1|]
<\infty,
\end{align*}
so each $S_n$ is integrable.
Now fix $n<m$. Since
\begin{align*}
S_m
=\sum_{k=1}^m \xi_k
=\sum_{k=1}^n \xi_k+\sum_{k=n+1}^m \xi_k
=S_n+\sum_{k=n+1}^m \xi_k,
\end{align*}
linearity of conditional expectation gives
\begin{align*}
\mathbb E[S_m\mid\mathcal F_n]
&=\mathbb E\left[S_n+\sum_{k=n+1}^m \xi_k\mid\mathcal F_n\right]\\
&=\mathbb E[S_n\mid\mathcal F_n]
+\sum_{k=n+1}^m \mathbb E[\xi_k\mid\mathcal F_n].
\end{align*}
Because $S_n$ is $\mathcal F_n$-measurable, conditioning known random variables gives
\begin{align*}
\mathbb E[S_n\mid\mathcal F_n]=S_n
\end{align*}
$\mathbb P$-a.s. For each $k>n$, the random variable $\xi_k$ is independent of $\mathcal F_n=\sigma(\xi_1,\dots,\xi_n)$, so conditioning on independent information gives
\begin{align*}
\mathbb E[\xi_k\mid\mathcal F_n]
=\mathbb E[\xi_k]
=\mathbb E[\xi_1]
=0
\end{align*}
$\mathbb P$-a.s. Therefore
\begin{align*}
\mathbb E[S_m\mid\mathcal F_n]
&=S_n+\sum_{k=n+1}^m 0\\
&=S_n.
\end{align*}
Thus the best prediction of the future sum using the first $n$ increments is the current sum: the later increments have conditional mean zero because they are independent of the present information and centered.
[/example]
The example presents martingales through values, but in many processes the easier quantities to inspect are the increments from one time to the next. The value condition says that no future level has a predictable excess over the present level; subtracting the present value should therefore leave an increment whose conditional mean is zero. This gives a practical test for absence of predictable drift.
[quotetheorem:4940]
This form focuses on increments. Future gain has conditional mean zero given present information, so the process has no predictable drift.
## Beyond and Connected Topics
Conditional expectation is the entry point to [martingales](/page/Martingale). Once a [filtration](/page/Filtration) is present, the tower property becomes the main algebraic rule behind optional stopping, Doob inequalities, convergence theorems, and stochastic integration. In that setting, conditional expectation is not just a way to compute averages; it is the language for saying what a process should predict from its own past.
It is also central in [Bayesian inference](/page/Bayesian%20Inference), where conditioning updates a prior distribution after data are observed. The conditional expectation of a parameter or future observation is the posterior mean, and the distinction between conditioning on an event and conditioning on a $\sigma$-algebra prevents many measure-zero mistakes. The same distinction leads naturally to [conditional probability](/page/Conditional%20Probability), regular conditional distributions, and the problem of making pointwise conditioning rigorous.
In statistics, conditional expectation is the mathematical form of regression. The function $z\mapsto\mathbb E[X\mid Z=z]$ is the best mean-square predictor of $X$ from $Z$ whenever the $L^2$ framework applies, connecting this page to least squares, Gaussian conditioning, and prediction theory. The $L^2$ viewpoint also links conditional expectation to [Hilbert space](/page/Hilbert%20Space) geometry, where prediction becomes orthogonal projection onto the subspace of observable random variables.
In measure theory, conditional expectation is a Radon-Nikodym derivative on a smaller $\sigma$-algebra. This connects it to [absolute continuity](/page/Absolute%20Continuity), [disintegration of measures](/theorems/971), regular conditional probabilities, and conditional distributions on standard Borel spaces.
In [Cambridge III Stochastic Calculus and Applications](/page/Cambridge%20III%20Stochastic%20Calculus%20and%20Applications), conditional expectation drives the construction of Itô integrals and the analysis of adapted processes. The rule that known factors can be pulled outside conditioning is one of the basic mechanisms behind [Itô isometry](/theorems/3544) and martingale representation theorems.
## References
Androma, *Probability Theory Notes*.
Androma, [Martingale](/page/Martingale).
Androma, *Filtration*.
Androma, *Bayesian Inference*.
Androma, *Conditional Probability*.
Androma, [Hilbert Space](/page/Hilbert%20Space).
Androma, *Absolute Continuity*.
David Williams, *Probability with Martingales* (1991).
Rick Durrett, *Probability: Theory and Examples* (2019).
Patrick Billingsley, *Probability and Measure* (1995).
Olav Kallenberg, *Foundations of Modern Probability* (2021).