Factorisation Criterion — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We prove the two directions separately. For the forward direction, assume $T$ is sufficient for $\theta$. By the definition of sufficiency, the conditional distribution of $X$ given $T(X) = t$ does not depend on $\theta$. We decompose the joint density $f(x; \theta)$ as the product of the marginal density of $T(X)$ and the conditional density of $X$ given $T(X)$, then identify the factorisation. For the reverse direction, assume $f(x; \theta) = g(T(x); \theta)\,h(x)$. We compute the conditional density of $X$ given $T(X) = t$ by summing (or integrating) $f$ over the level set $\{T = t\}$; the factor $g(t; \theta)$ cancels, leaving a quantity that depends only on $x$ and $t$. We give the discrete case first and then indicate the modifications for the absolutely continuous case. [/proofplan] [step:Fix the statistical model and the notion of sufficiency] Let $X = (X_1, \ldots, X_n)$ take values in a sample space $\mathcal{X}$, and let $\{f(\cdot; \theta) : \theta \in \Theta\}$ be a family of joint densities (with respect to a dominating measure $\nu$ on $\mathcal{X}$ — counting measure in the discrete case, Lebesgue measure on $\mathbb{R}^n$ in the continuous case). Let $T: \mathcal{X} \to \mathcal{T}$ be a measurable statistic. Recall that $T$ is [sufficient](/page/Sufficient%20Statistic) for $\theta$ if the conditional distribution of $X$ given $T(X) = t$ does not depend on $\theta$ for any $t$ in the range of $T$. [guided] Before proving anything we need to be precise about what sufficiency means and what a "density" means here. The theorem is stated in generality sufficient to cover both the discrete case (where $f$ is a probability mass function and $\nu$ is counting measure) and the absolutely continuous case (where $f$ is a Lebesgue density). We write integrals $\int \cdot \, d\nu$ throughout; in the discrete case this is a sum, and in the continuous case an integral with respect to Lebesgue measure $\mathcal{L}^n$. A statistic is a measurable map $T: \mathcal{X} \to \mathcal{T}$. We say $T$ is [sufficient](/page/Sufficient%20Statistic) for $\theta$ if the conditional distribution $\mathbb{P}_\theta(X \in \cdot \mid T(X) = t)$ is the same function of $\cdot$ for every $\theta$ — in other words, once we are told the value $t$ of $T$, the data carry no further information about $\theta$. Equivalently, the conditional density $f(x \mid T(x) = t; \theta)$ does not depend on $\theta$. The content of the Factorisation Criterion is that this statistical definition, which is intrinsically about conditional distributions, is equivalent to a purely algebraic factorisation of the joint density. We prove both directions. [/guided] [/step] [step:Sufficiency implies factorisation] Suppose $T$ is sufficient. For any $x \in \mathcal{X}$, let $t = T(x)$. The joint density factors via Bayes' rule as \begin{align*} f(x; \theta) = f_T(t; \theta) \cdot f(x \mid T(X) = t; \theta), \end{align*} where $f_T(t; \theta)$ is the marginal density of $T(X)$ under $\mathbb{P}_\theta$ and $f(\cdot \mid T = t; \theta)$ is the conditional density. By sufficiency, the second factor does not depend on $\theta$; denote it $h(x) := f(x \mid T(X) = t; \theta)$, which is well-defined because $t = T(x)$ is determined by $x$. Setting $g(t; \theta) := f_T(t; \theta)$, we obtain \begin{align*} f(x; \theta) = g(T(x); \theta)\, h(x), \end{align*} which is the required factorisation. [guided] We assume $T$ is sufficient and must produce functions $g: \mathcal{T} \times \Theta \to [0, \infty)$ and $h: \mathcal{X} \to [0, \infty)$ with $f(x; \theta) = g(T(x); \theta)\,h(x)$. The natural candidates come from the chain rule for densities. For any $x \in \mathcal{X}$, write $t = T(x)$. Because $T$ is a function of $X$, the event $\{X = x\}$ is contained in $\{T(X) = t\}$, so conditioning on $T(X) = t$ is consistent with the data. Bayes' rule (equivalently, the definition of conditional density) gives \begin{align*} f(x; \theta) = f_T(t; \theta) \cdot f(x \mid T(X) = t; \theta), \end{align*} where $f_T(t; \theta)$ is the marginal of $T$ under $\mathbb{P}_\theta$. Now we use sufficiency: by definition, $f(x \mid T(X) = t; \theta)$ is the same function of $x$ for every $\theta$. So we may define \begin{align*} h(x) &:= f(x \mid T(X) = T(x); \theta), & g(t; \theta) &:= f_T(t; \theta), \end{align*} where the definition of $h$ is unambiguous because the right-hand side does not depend on $\theta$. The factorisation $f(x; \theta) = g(T(x); \theta)\,h(x)$ is then a rewriting of the chain rule. This is the forward direction. [/guided] [/step] [step:Factorisation implies sufficiency, discrete case] Conversely, suppose $f(x; \theta) = g(T(x); \theta)\,h(x)$ for some non-negative functions $g$ and $h$. We first treat the case where $X$ takes values in a countable set, so that $f$ is a probability mass function. Fix $t \in \mathcal{T}$ with $\mathbb{P}_\theta(T(X) = t) > 0$ for some (equivalently, every — see below) $\theta$. The marginal of $T(X)$ is obtained by summing the joint mass function over the level set $\{T = t\}$: \begin{align*} \mathbb{P}_\theta(T(X) = t) = \sum_{y : T(y) = t} f(y; \theta) = g(t; \theta) \sum_{y : T(y) = t} h(y), \end{align*} where we factored $g(T(y); \theta) = g(t; \theta)$ out of the sum because $T(y) = t$ on the domain of summation. For any $x$ with $T(x) = t$, the definition of conditional probability gives \begin{align*} \mathbb{P}_\theta(X = x \mid T(X) = t) &= \frac{\mathbb{P}_\theta(X = x)}{\mathbb{P}_\theta(T(X) = t)} \\ &= \frac{g(T(x); \theta)\,h(x)}{g(t; \theta)\,\sum_{y: T(y) = t} h(y)} \\ &= \frac{h(x)}{\sum_{y: T(y) = t} h(y)}. \end{align*} In the last step the factor $g(t; \theta)$ cancels between numerator and denominator. The resulting expression depends on $x$ and $t$ but not on $\theta$, so the conditional distribution of $X$ given $T(X) = t$ is parameter-free. By definition, $T$ is sufficient. For $x$ with $T(x) \ne t$, $\mathbb{P}_\theta(X = x \mid T(X) = t) = 0$, independently of $\theta$. [guided] We assume the factorisation $f(x; \theta) = g(T(x); \theta)\,h(x)$ and must show that the conditional distribution of $X$ given $T(X) = t$ does not depend on $\theta$. We do this in the discrete case, where "density" is a probability mass function and conditional probabilities are ratios of probabilities. The plan is to compute $\mathbb{P}_\theta(X = x \mid T(X) = t)$ directly and verify that all $\theta$-dependence cancels. The ratio is \begin{align*} \mathbb{P}_\theta(X = x \mid T(X) = t) = \frac{\mathbb{P}_\theta(X = x,\, T(X) = t)}{\mathbb{P}_\theta(T(X) = t)}. \end{align*} The numerator equals $f(x; \theta)$ if $T(x) = t$ and zero otherwise — since $\{X = x\}$ implies $\{T(X) = T(x)\}$. In the former case, we substitute the factorisation: \begin{align*} \mathbb{P}_\theta(X = x,\, T(X) = t) = f(x; \theta) = g(T(x); \theta)\,h(x) = g(t; \theta)\,h(x). \end{align*} For the denominator we must sum the joint mass function over all points mapped to $t$. Let $\Lambda_t := \{y \in \mathcal{X} : T(y) = t\}$ be the level set. Then \begin{align*} \mathbb{P}_\theta(T(X) = t) = \sum_{y \in \Lambda_t} f(y; \theta) = \sum_{y \in \Lambda_t} g(T(y); \theta)\,h(y) = g(t; \theta) \sum_{y \in \Lambda_t} h(y), \end{align*} where we pulled $g(t; \theta)$ out of the sum, using that $T(y) = t$ for every $y \in \Lambda_t$. Taking the ratio for $x \in \Lambda_t$: \begin{align*} \mathbb{P}_\theta(X = x \mid T(X) = t) = \frac{g(t; \theta)\,h(x)}{g(t; \theta) \sum_{y \in \Lambda_t} h(y)} = \frac{h(x)}{\sum_{y \in \Lambda_t} h(y)}. \end{align*} This is the cancellation that makes the theorem work: the $\theta$-dependent factor $g(t; \theta)$ appears identically in numerator and denominator and cancels. The remaining expression depends on $x$ and on the level set $\Lambda_t$ (hence on $t$), but not on $\theta$. By definition, $T$ is sufficient. For $x \notin \Lambda_t$, the conditional probability is zero — again not depending on $\theta$. [/guided] [/step] [step:Factorisation implies sufficiency, absolutely continuous case] In the absolutely continuous case, sums over level sets are replaced by integrals with respect to the appropriate conditional measure. Let $f_T(t; \theta)$ denote the marginal density of $T(X)$ on $\mathcal{T}$. By the [disintegration of measures](/theorems/971) (or, concretely, by the standard change-of-variables formula for densities under a measurable map), we have, for $f_T(t; \theta) > 0$, \begin{align*} f_T(t; \theta) = \int_{\{T = t\}} h(y) \, d\mu_t(y) \cdot g(t; \theta), \end{align*} where $\mu_t$ is the conditional reference measure on the level set $\{y \in \mathcal{X} : T(y) = t\}$. The conditional density of $X$ given $T(X) = t$ is then \begin{align*} f(x \mid T(X) = t; \theta) = \frac{f(x; \theta)}{f_T(t; \theta)} = \frac{g(t; \theta)\,h(x)}{g(t; \theta)\,\int_{\{T = t\}} h(y)\,d\mu_t(y)} = \frac{h(x)}{\int_{\{T = t\}} h(y)\,d\mu_t(y)}. \end{align*} Again the factor $g(t; \theta)$ cancels, leaving a quantity independent of $\theta$. Hence $T$ is sufficient. [guided] The continuous case works exactly as in the discrete case, with sums replaced by integrals over the level set $\{T = t\}$ against the appropriate conditional measure. The cleanest way to phrase this is via the change-of-variables / disintegration formula: if $X$ has density $f(\cdot; \theta)$ with respect to Lebesgue measure $\mathcal{L}^n$, and $T: \mathcal{X} \to \mathcal{T}$ is a smooth map with non-degenerate differential, then $T(X)$ has marginal density \begin{align*} f_T(t; \theta) = \int_{\{T = t\}} f(y; \theta)\,d\mu_t(y) = \int_{\{T = t\}} g(t; \theta)\,h(y)\,d\mu_t(y) = g(t; \theta)\int_{\{T = t\}} h(y)\,d\mu_t(y), \end{align*} where $\mu_t$ is the natural reference measure on the level set (e.g., a Hausdorff measure of appropriate dimension when $T$ is smooth). Pulling $g(t; \theta)$ out of the integral uses that $T(y) = t$ on the level set — identical to the discrete argument. The conditional density is the ratio \begin{align*} f(x \mid T(X) = t; \theta) = \frac{f(x; \theta)}{f_T(t; \theta)} = \frac{g(t; \theta)\,h(x)}{g(t; \theta)\int_{\{T = t\}} h(y)\,d\mu_t(y)} = \frac{h(x)}{\int_{\{T = t\}} h(y)\,d\mu_t(y)}. \end{align*} The $g(t; \theta)$ factors cancel identically, leaving a quantity depending on $x$ and $t$ but not on $\theta$. Therefore $T$ is sufficient. The measure-theoretic subtleties (null sets, regular conditional distributions) are the standard ones for continuous conditioning; they do not affect the algebra of cancellation, which is the actual content of the theorem. [/guided] [/step] [step:Combine the two directions to conclude] The forward and reverse implications together establish the equivalence \begin{align*} T \text{ is sufficient for } \theta \iff f(x; \theta) = g(T(x); \theta)\,h(x) \text{ for some } g, h \ge 0. \end{align*} This is the Factorisation Criterion. [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Factorisation Criterion (Theorem # 1425)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Factorisation Criterion (Theorem # 1425)

Discussion

Proof

Explore Further