[proofplan]
We prove the two directions separately. For the forward direction, assume $T$ is sufficient for $\theta$. By the definition of sufficiency, the conditional distribution of $X$ given $T(X) = t$ does not depend on $\theta$. We decompose the joint density $f(x; \theta)$ as the product of the marginal density of $T(X)$ and the conditional density of $X$ given $T(X)$, then identify the factorisation. For the reverse direction, assume $f(x; \theta) = g(T(x); \theta)\,h(x)$. We compute the conditional density of $X$ given $T(X) = t$ by summing (or integrating) $f$ over the level set $\{T = t\}$; the factor $g(t; \theta)$ cancels, leaving a quantity that depends only on $x$ and $t$. We give the discrete case first and then indicate the modifications for the absolutely continuous case.
[/proofplan]
[step:Fix the statistical model and the notion of sufficiency]
Let $X = (X_1, \ldots, X_n)$ take values in a sample space $\mathcal{X}$, and let $\{f(\cdot; \theta) : \theta \in \Theta\}$ be a family of joint densities (with respect to a dominating measure $\nu$ on $\mathcal{X}$ — counting measure in the discrete case, Lebesgue measure on $\mathbb{R}^n$ in the continuous case). Let $T: \mathcal{X} \to \mathcal{T}$ be a measurable statistic. Recall that $T$ is [sufficient](/page/Sufficient%20Statistic) for $\theta$ if the conditional distribution of $X$ given $T(X) = t$ does not depend on $\theta$ for any $t$ in the range of $T$.
[guided]
Before proving anything we need to be precise about what sufficiency means and what a "density" means here. The theorem is stated in generality sufficient to cover both the discrete case (where $f$ is a probability mass function and $\nu$ is counting measure) and the absolutely continuous case (where $f$ is a Lebesgue density). We write integrals $\int \cdot \, d\nu$ throughout; in the discrete case this is a sum, and in the continuous case an integral with respect to Lebesgue measure $\mathcal{L}^n$.
A statistic is a measurable map $T: \mathcal{X} \to \mathcal{T}$. We say $T$ is [sufficient](/page/Sufficient%20Statistic) for $\theta$ if the conditional distribution $\mathbb{P}_\theta(X \in \cdot \mid T(X) = t)$ is the same function of $\cdot$ for every $\theta$ — in other words, once we are told the value $t$ of $T$, the data carry no further information about $\theta$. Equivalently, the conditional density $f(x \mid T(x) = t; \theta)$ does not depend on $\theta$.
The content of the Factorisation Criterion is that this statistical definition, which is intrinsically about conditional distributions, is equivalent to a purely algebraic factorisation of the joint density. We prove both directions.
[/guided]
[/step]
[step:Sufficiency implies factorisation]
Suppose $T$ is sufficient. For any $x \in \mathcal{X}$, let $t = T(x)$. The joint density factors via Bayes' rule as
\begin{align*}
f(x; \theta) = f_T(t; \theta) \cdot f(x \mid T(X) = t; \theta),
\end{align*}
where $f_T(t; \theta)$ is the marginal density of $T(X)$ under $\mathbb{P}_\theta$ and $f(\cdot \mid T = t; \theta)$ is the conditional density. By sufficiency, the second factor does not depend on $\theta$; denote it $h(x) := f(x \mid T(X) = t; \theta)$, which is well-defined because $t = T(x)$ is determined by $x$. Setting $g(t; \theta) := f_T(t; \theta)$, we obtain
\begin{align*}
f(x; \theta) = g(T(x); \theta)\, h(x),
\end{align*}
which is the required factorisation.
[guided]
We assume $T$ is sufficient and must produce functions $g: \mathcal{T} \times \Theta \to [0, \infty)$ and $h: \mathcal{X} \to [0, \infty)$ with $f(x; \theta) = g(T(x); \theta)\,h(x)$.
The natural candidates come from the chain rule for densities. For any $x \in \mathcal{X}$, write $t = T(x)$. Because $T$ is a function of $X$, the event $\{X = x\}$ is contained in $\{T(X) = t\}$, so conditioning on $T(X) = t$ is consistent with the data. Bayes' rule (equivalently, the definition of conditional density) gives
\begin{align*}
f(x; \theta) = f_T(t; \theta) \cdot f(x \mid T(X) = t; \theta),
\end{align*}
where $f_T(t; \theta)$ is the marginal of $T$ under $\mathbb{P}_\theta$.
Now we use sufficiency: by definition, $f(x \mid T(X) = t; \theta)$ is the same function of $x$ for every $\theta$. So we may define
\begin{align*}
h(x) &:= f(x \mid T(X) = T(x); \theta), & g(t; \theta) &:= f_T(t; \theta),
\end{align*}
where the definition of $h$ is unambiguous because the right-hand side does not depend on $\theta$. The factorisation $f(x; \theta) = g(T(x); \theta)\,h(x)$ is then a rewriting of the chain rule. This is the forward direction.
[/guided]
[/step]
[step:Factorisation implies sufficiency, discrete case]
Conversely, suppose $f(x; \theta) = g(T(x); \theta)\,h(x)$ for some non-negative functions $g$ and $h$. We first treat the case where $X$ takes values in a countable set, so that $f$ is a probability mass function.
Fix $t \in \mathcal{T}$ with $\mathbb{P}_\theta(T(X) = t) > 0$ for some (equivalently, every — see below) $\theta$. The marginal of $T(X)$ is obtained by summing the joint mass function over the level set $\{T = t\}$:
\begin{align*}
\mathbb{P}_\theta(T(X) = t) = \sum_{y : T(y) = t} f(y; \theta) = g(t; \theta) \sum_{y : T(y) = t} h(y),
\end{align*}
where we factored $g(T(y); \theta) = g(t; \theta)$ out of the sum because $T(y) = t$ on the domain of summation. For any $x$ with $T(x) = t$, the definition of conditional probability gives
\begin{align*}
\mathbb{P}_\theta(X = x \mid T(X) = t)
&= \frac{\mathbb{P}_\theta(X = x)}{\mathbb{P}_\theta(T(X) = t)} \\
&= \frac{g(T(x); \theta)\,h(x)}{g(t; \theta)\,\sum_{y: T(y) = t} h(y)} \\
&= \frac{h(x)}{\sum_{y: T(y) = t} h(y)}.
\end{align*}
In the last step the factor $g(t; \theta)$ cancels between numerator and denominator. The resulting expression depends on $x$ and $t$ but not on $\theta$, so the conditional distribution of $X$ given $T(X) = t$ is parameter-free. By definition, $T$ is sufficient. For $x$ with $T(x) \ne t$, $\mathbb{P}_\theta(X = x \mid T(X) = t) = 0$, independently of $\theta$.
[guided]
We assume the factorisation $f(x; \theta) = g(T(x); \theta)\,h(x)$ and must show that the conditional distribution of $X$ given $T(X) = t$ does not depend on $\theta$. We do this in the discrete case, where "density" is a probability mass function and conditional probabilities are ratios of probabilities.
The plan is to compute $\mathbb{P}_\theta(X = x \mid T(X) = t)$ directly and verify that all $\theta$-dependence cancels. The ratio is
\begin{align*}
\mathbb{P}_\theta(X = x \mid T(X) = t) = \frac{\mathbb{P}_\theta(X = x,\, T(X) = t)}{\mathbb{P}_\theta(T(X) = t)}.
\end{align*}
The numerator equals $f(x; \theta)$ if $T(x) = t$ and zero otherwise — since $\{X = x\}$ implies $\{T(X) = T(x)\}$. In the former case, we substitute the factorisation:
\begin{align*}
\mathbb{P}_\theta(X = x,\, T(X) = t) = f(x; \theta) = g(T(x); \theta)\,h(x) = g(t; \theta)\,h(x).
\end{align*}
For the denominator we must sum the joint mass function over all points mapped to $t$. Let $\Lambda_t := \{y \in \mathcal{X} : T(y) = t\}$ be the level set. Then
\begin{align*}
\mathbb{P}_\theta(T(X) = t) = \sum_{y \in \Lambda_t} f(y; \theta) = \sum_{y \in \Lambda_t} g(T(y); \theta)\,h(y) = g(t; \theta) \sum_{y \in \Lambda_t} h(y),
\end{align*}
where we pulled $g(t; \theta)$ out of the sum, using that $T(y) = t$ for every $y \in \Lambda_t$. Taking the ratio for $x \in \Lambda_t$:
\begin{align*}
\mathbb{P}_\theta(X = x \mid T(X) = t) = \frac{g(t; \theta)\,h(x)}{g(t; \theta) \sum_{y \in \Lambda_t} h(y)} = \frac{h(x)}{\sum_{y \in \Lambda_t} h(y)}.
\end{align*}
This is the cancellation that makes the theorem work: the $\theta$-dependent factor $g(t; \theta)$ appears identically in numerator and denominator and cancels. The remaining expression depends on $x$ and on the level set $\Lambda_t$ (hence on $t$), but not on $\theta$. By definition, $T$ is sufficient.
For $x \notin \Lambda_t$, the conditional probability is zero — again not depending on $\theta$.
[/guided]
[/step]
[step:Factorisation implies sufficiency, absolutely continuous case]
In the absolutely continuous case, sums over level sets are replaced by integrals with respect to the appropriate conditional measure. Let $f_T(t; \theta)$ denote the marginal density of $T(X)$ on $\mathcal{T}$. By the [disintegration of measures](/theorems/971) (or, concretely, by the standard change-of-variables formula for densities under a measurable map), we have, for $f_T(t; \theta) > 0$,
\begin{align*}
f_T(t; \theta) = \int_{\{T = t\}} h(y) \, d\mu_t(y) \cdot g(t; \theta),
\end{align*}
where $\mu_t$ is the conditional reference measure on the level set $\{y \in \mathcal{X} : T(y) = t\}$. The conditional density of $X$ given $T(X) = t$ is then
\begin{align*}
f(x \mid T(X) = t; \theta)
= \frac{f(x; \theta)}{f_T(t; \theta)}
= \frac{g(t; \theta)\,h(x)}{g(t; \theta)\,\int_{\{T = t\}} h(y)\,d\mu_t(y)}
= \frac{h(x)}{\int_{\{T = t\}} h(y)\,d\mu_t(y)}.
\end{align*}
Again the factor $g(t; \theta)$ cancels, leaving a quantity independent of $\theta$. Hence $T$ is sufficient.
[guided]
The continuous case works exactly as in the discrete case, with sums replaced by integrals over the level set $\{T = t\}$ against the appropriate conditional measure. The cleanest way to phrase this is via the change-of-variables / disintegration formula: if $X$ has density $f(\cdot; \theta)$ with respect to Lebesgue measure $\mathcal{L}^n$, and $T: \mathcal{X} \to \mathcal{T}$ is a smooth map with non-degenerate differential, then $T(X)$ has marginal density
\begin{align*}
f_T(t; \theta) = \int_{\{T = t\}} f(y; \theta)\,d\mu_t(y) = \int_{\{T = t\}} g(t; \theta)\,h(y)\,d\mu_t(y) = g(t; \theta)\int_{\{T = t\}} h(y)\,d\mu_t(y),
\end{align*}
where $\mu_t$ is the natural reference measure on the level set (e.g., a Hausdorff measure of appropriate dimension when $T$ is smooth). Pulling $g(t; \theta)$ out of the integral uses that $T(y) = t$ on the level set — identical to the discrete argument.
The conditional density is the ratio
\begin{align*}
f(x \mid T(X) = t; \theta) = \frac{f(x; \theta)}{f_T(t; \theta)} = \frac{g(t; \theta)\,h(x)}{g(t; \theta)\int_{\{T = t\}} h(y)\,d\mu_t(y)} = \frac{h(x)}{\int_{\{T = t\}} h(y)\,d\mu_t(y)}.
\end{align*}
The $g(t; \theta)$ factors cancel identically, leaving a quantity depending on $x$ and $t$ but not on $\theta$. Therefore $T$ is sufficient.
The measure-theoretic subtleties (null sets, regular conditional distributions) are the standard ones for continuous conditioning; they do not affect the algebra of cancellation, which is the actual content of the theorem.
[/guided]
[/step]
[step:Combine the two directions to conclude]
The forward and reverse implications together establish the equivalence
\begin{align*}
T \text{ is sufficient for } \theta \iff f(x; \theta) = g(T(x); \theta)\,h(x) \text{ for some } g, h \ge 0.
\end{align*}
This is the Factorisation Criterion.
[/step]