Fisher–Neyman Sufficiency of the Sample Mean and Sample Covariance for the Multivariate Normal Family

Fisher–Neyman Sufficiency of the Sample Mean and Sample Covariance for the Multivariate Normal Family (Theorem # 4014)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We work on the sample space $(\mathbb{R}^p)^n$ and write the joint density of the sample with respect to Lebesgue measure. The only algebraic point is to decompose the centered quadratic form into a within-sample covariance term and a sample-mean term. This shows that the likelihood ratio between any parameter value and a fixed reference parameter is measurable with respect to $\sigma(\bar{X},S)$. A conditional-expectation argument then proves sufficiency directly from the definition. [/proofplan] [step:Write the joint density on the product sample space] Let \begin{align*} \Theta := \{(\mu,\Sigma) : \mu \in \mathbb{R}^p,\ \Sigma \in \mathbb{R}^{p \times p}\text{ is symmetric positive definite}\} \end{align*} denote the parameter space. Let \begin{align*} \Omega_0 := (\mathbb{R}^p)^n \end{align*} be the sample space, equipped with its Borel $\sigma$-algebra $\mathcal{B}(\Omega_0)$. For $\theta=(\mu,\Sigma)\in\Theta$, let $P_\theta$ denote the joint law of $(X_1,\dots,X_n)$ on $\Omega_0$. Let $\mathcal{L}^{np}$ denote Lebesgue measure on $\Omega_0$. For a point $x=(x_1,\dots,x_n)\in\Omega_0$, define the sample mean map \begin{align*} \bar{x}: \Omega_0 &\to \mathbb{R}^p \\ (x_1,\dots,x_n) &\mapsto \frac{1}{n}\sum_{i=1}^{n}x_i \end{align*} and the sample covariance map \begin{align*} s: \Omega_0 &\to \mathbb{R}^{p\times p} \\ (x_1,\dots,x_n) &\mapsto \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(x_i-\bar{x})^\top . \end{align*} Then $P_\theta$ has density \begin{align*} f_\theta:\Omega_0 &\to [0,\infty) \\ x &\mapsto (2\pi)^{-np/2}(\det \Sigma)^{-n/2} \exp\left( -\frac{1}{2}\sum_{i=1}^{n}(x_i-\mu)^\top\Sigma^{-1}(x_i-\mu) \right) \end{align*} with respect to $\mathcal{L}^{np}$. [guided] We first place the statistical problem on a fixed measurable space. The observations form a point $x=(x_1,\dots,x_n)$ in $\Omega_0=(\mathbb{R}^p)^n$, and all parameter values use the same dominating measure, namely Lebesgue measure $\mathcal{L}^{np}$ on this product space. For each parameter $\theta=(\mu,\Sigma)$, where $\Sigma$ is symmetric positive definite, the inverse matrix $\Sigma^{-1}$ exists and $\det\Sigma>0$. Therefore the joint density of independent $\mathcal{N}_p(\mu,\Sigma)$ observations is the product of the $n$ multivariate normal densities: \begin{align*} f_\theta(x) = (2\pi)^{-np/2}(\det \Sigma)^{-n/2} \exp\left( -\frac{1}{2}\sum_{i=1}^{n}(x_i-\mu)^\top\Sigma^{-1}(x_i-\mu) \right). \end{align*} The statistic in the theorem is the measurable map \begin{align*} T:\Omega_0 &\to \mathbb{R}^p\times \mathbb{R}^{p\times p}\\ x &\mapsto (\bar{x},s(x)). \end{align*} Thus proving sufficiency means proving that the conditional law of the full sample given $T$ can be chosen independently of $\theta$. [/guided] [/step] [step:Decompose the normal quadratic form through the sample mean and covariance] For each $x=(x_1,\dots,x_n)\in\Omega_0$, we have \begin{align*} \sum_{i=1}^{n}(x_i-\mu)(x_i-\mu)^\top = \sum_{i=1}^{n}(x_i-\bar{x})(x_i-\bar{x})^\top + n(\bar{x}-\mu)(\bar{x}-\mu)^\top . \end{align*} Indeed, \begin{align*} x_i-\mu=(x_i-\bar{x})+(\bar{x}-\mu), \end{align*} so expanding the outer product gives \begin{align*} \sum_{i=1}^{n}(x_i-\mu)(x_i-\mu)^\top &= \sum_{i=1}^{n}(x_i-\bar{x})(x_i-\bar{x})^\top \\ &\quad+ \sum_{i=1}^{n}(x_i-\bar{x})(\bar{x}-\mu)^\top \\ &\quad+ \sum_{i=1}^{n}(\bar{x}-\mu)(x_i-\bar{x})^\top \\ &\quad+ \sum_{i=1}^{n}(\bar{x}-\mu)(\bar{x}-\mu)^\top . \end{align*} The two cross terms vanish because \begin{align*} \sum_{i=1}^{n}(x_i-\bar{x})=\sum_{i=1}^{n}x_i-n\bar{x}=0. \end{align*} Since \begin{align*} \sum_{i=1}^{n}(x_i-\bar{x})(x_i-\bar{x})^\top=(n-1)s(x), \end{align*} we obtain \begin{align*} \sum_{i=1}^{n}(x_i-\mu)(x_i-\mu)^\top = (n-1)s(x)+n(\bar{x}-\mu)(\bar{x}-\mu)^\top . \end{align*} [guided] The exponent in the normal likelihood contains a sum of quadratic forms. To expose the statistic $(\bar{x},s(x))$, we rewrite every centered observation around the sample mean: \begin{align*} x_i-\mu=(x_i-\bar{x})+(\bar{x}-\mu). \end{align*} Expanding the outer product and summing over $i$ gives \begin{align*} \sum_{i=1}^{n}(x_i-\mu)(x_i-\mu)^\top &= \sum_{i=1}^{n}(x_i-\bar{x})(x_i-\bar{x})^\top \\ &\quad+ \left(\sum_{i=1}^{n}(x_i-\bar{x})\right)(\bar{x}-\mu)^\top \\ &\quad+ (\bar{x}-\mu)\left(\sum_{i=1}^{n}(x_i-\bar{x})\right)^\top \\ &\quad+ n(\bar{x}-\mu)(\bar{x}-\mu)^\top . \end{align*} The defining property of the sample mean is \begin{align*} \sum_{i=1}^{n}(x_i-\bar{x})=0. \end{align*} Therefore both cross terms vanish. The within-sample term is exactly $(n-1)s(x)$ by the definition of the sample covariance map. Hence \begin{align*} \sum_{i=1}^{n}(x_i-\mu)(x_i-\mu)^\top = (n-1)s(x)+n(\bar{x}-\mu)(\bar{x}-\mu)^\top . \end{align*} This is the algebraic reason no additional feature of the data enters the likelihood. [/guided] [/step] [step:Factor the density through the statistic $(\bar{x},s(x))$] For every symmetric positive definite matrix $\Sigma$, the identity \begin{align*} v^\top\Sigma^{-1}v=\operatorname{tr}\left(\Sigma^{-1}vv^\top\right) \end{align*} holds for every $v\in\mathbb{R}^p$. Applying this identity with $v=x_i-\mu$ and summing over $i$ gives \begin{align*} \sum_{i=1}^{n}(x_i-\mu)^\top\Sigma^{-1}(x_i-\mu) = \operatorname{tr}\left( \Sigma^{-1} \sum_{i=1}^{n}(x_i-\mu)(x_i-\mu)^\top \right). \end{align*} Using the decomposition from the previous step, \begin{align*} \sum_{i=1}^{n}(x_i-\mu)^\top\Sigma^{-1}(x_i-\mu) = \operatorname{tr}\left( \Sigma^{-1} \left[(n-1)s(x)+n(\bar{x}-\mu)(\bar{x}-\mu)^\top\right] \right). \end{align*} Define, for each $\theta=(\mu,\Sigma)\in\Theta$, the map \begin{align*} g_\theta:\mathbb{R}^p\times\mathbb{R}^{p\times p} &\to [0,\infty)\\ (m,A) &\mapsto (2\pi)^{-np/2}(\det \Sigma)^{-n/2} \exp\left( -\frac{1}{2}\operatorname{tr}\left( \Sigma^{-1}\left[(n-1)A+n(m-\mu)(m-\mu)^\top\right] \right) \right). \end{align*} Then \begin{align*} f_\theta(x)=g_\theta(\bar{x},s(x)) \end{align*} for every $x\in\Omega_0$ and every $\theta\in\Theta$. [guided] We now translate the matrix decomposition into the scalar exponent of the density. For a vector $v\in\mathbb{R}^p$ and a symmetric positive definite matrix $\Sigma$, the scalar quadratic form can be written as a trace: \begin{align*} v^\top\Sigma^{-1}v=\operatorname{tr}(\Sigma^{-1}vv^\top). \end{align*} This identity is useful because the previous step decomposed the matrix sum of outer products. Applying the trace identity to each vector $x_i-\mu$ gives \begin{align*} \sum_{i=1}^{n}(x_i-\mu)^\top\Sigma^{-1}(x_i-\mu) = \operatorname{tr}\left( \Sigma^{-1} \sum_{i=1}^{n}(x_i-\mu)(x_i-\mu)^\top \right). \end{align*} Substituting the decomposition already proved yields \begin{align*} \sum_{i=1}^{n}(x_i-\mu)^\top\Sigma^{-1}(x_i-\mu) = \operatorname{tr}\left( \Sigma^{-1} \left[(n-1)s(x)+n(\bar{x}-\mu)(\bar{x}-\mu)^\top\right] \right). \end{align*} Thus the density depends on the full sample $x$ only through the pair $(\bar{x},s(x))$. Formally, defining \begin{align*} g_\theta(m,A) = (2\pi)^{-np/2}(\det \Sigma)^{-n/2} \exp\left( -\frac{1}{2}\operatorname{tr}\left( \Sigma^{-1}\left[(n-1)A+n(m-\mu)(m-\mu)^\top\right] \right) \right) \end{align*} for $(m,A)\in\mathbb{R}^p\times\mathbb{R}^{p\times p}$, we have \begin{align*} f_\theta(x)=g_\theta(\bar{x},s(x)). \end{align*} So every likelihood ratio between two parameter values is measurable with respect to the $\sigma$-algebra generated by $(\bar{x},s(x))$. [/guided] [/step] [step:Use likelihood-ratio measurability to prove sufficiency] Let $\theta_0=(0,I_p)\in\Theta$, where $I_p$ is the $p\times p$ identity matrix. Since $f_{\theta_0}(x)>0$ for every $x\in\Omega_0$, define the likelihood-ratio map \begin{align*} L_\theta:\Omega_0 &\to (0,\infty)\\ x &\mapsto \frac{f_\theta(x)}{f_{\theta_0}(x)}. \end{align*} Because each density factors through $T(x):=(\bar{x},s(x))$, the map $L_\theta$ is $\sigma(T)$-measurable. Let $A\in\mathcal{B}(\Omega_0)$. Choose a version of the conditional expectation \begin{align*} K_A:\Omega_0 &\to [0,1]\\ x &\mapsto \mathbb{E}_{\theta_0}[\mathbb{1}_A\mid\sigma(T)](x). \end{align*} We claim that the same $K_A$ is a version of $\mathbb{E}_\theta[\mathbb{1}_A\mid\sigma(T)]$ for every $\theta\in\Theta$. Indeed, for every $B\in\sigma(T)$, using $dP_\theta=L_\theta\,dP_{\theta_0}$ and the $\sigma(T)$-measurability of $L_\theta$ and $K_A$, we obtain \begin{align*} \int_B K_A\,dP_\theta &= \int_B K_A L_\theta\,dP_{\theta_0}\\ &= \int_B L_\theta\,\mathbb{E}_{\theta_0}[\mathbb{1}_A\mid\sigma(T)]\,dP_{\theta_0}\\ &= \int_B L_\theta\mathbb{1}_A\,dP_{\theta_0}\\ &= \int_B \mathbb{1}_A\,dP_\theta. \end{align*} Thus $K_A$ is $\sigma(T)$-measurable and satisfies the defining property of $\mathbb{E}_\theta[\mathbb{1}_A\mid\sigma(T)]$, independently of $\theta$. Hence $\sigma(T)$ is sufficient for $\theta$, and therefore $(\bar{X},S)$ is sufficient for $(\mu,\Sigma)$. [guided] To finish, we verify sufficiency directly from conditional expectations. Fix the reference parameter \begin{align*} \theta_0=(0,I_p). \end{align*} The corresponding density $f_{\theta_0}$ is strictly positive everywhere on $\Omega_0$, so for each parameter $\theta\in\Theta$ we can form the likelihood ratio \begin{align*} L_\theta(x)=\frac{f_\theta(x)}{f_{\theta_0}(x)}. \end{align*} The previous step shows that both $f_\theta$ and $f_{\theta_0}$ are functions of $T(x)=(\bar{x},s(x))$. Hence $L_\theta$ is also a function of $T(x)$, and therefore $L_\theta$ is $\sigma(T)$-measurable. Now let $A$ be any Borel subset of $\Omega_0$. Under the reference law $P_{\theta_0}$, choose a version \begin{align*} K_A(x)=\mathbb{E}_{\theta_0}[\mathbb{1}_A\mid\sigma(T)](x). \end{align*} This function is $\sigma(T)$-measurable and takes values in $[0,1]$. We show that the same function works as the conditional probability of $A$ given $T$ under every parameter value. Let $B\in\sigma(T)$. Since $P_\theta$ has Radon-Nikodym derivative $L_\theta$ with respect to $P_{\theta_0}$, we have \begin{align*} \int_B K_A\,dP_\theta = \int_B K_A L_\theta\,dP_{\theta_0}. \end{align*} Because $L_\theta$ is $\sigma(T)$-measurable, the product $K_A L_\theta$ is an admissible $\sigma(T)$-measurable multiplier in the defining identity for conditional expectation under $P_{\theta_0}$. Therefore \begin{align*} \int_B K_A L_\theta\,dP_{\theta_0} = \int_B L_\theta\mathbb{1}_A\,dP_{\theta_0}. \end{align*} Converting back from $P_{\theta_0}$ to $P_\theta$ gives \begin{align*} \int_B L_\theta\mathbb{1}_A\,dP_{\theta_0} = \int_B \mathbb{1}_A\,dP_\theta. \end{align*} Thus \begin{align*} \int_B K_A\,dP_\theta=\int_B \mathbb{1}_A\,dP_\theta \end{align*} for every $B\in\sigma(T)$. This is exactly the defining property of $\mathbb{E}_\theta[\mathbb{1}_A\mid\sigma(T)]$. Since the version $K_A$ was chosen under the fixed reference law and does not depend on $\theta$, the conditional distribution of the full sample given $(\bar{X},S)$ is independent of $(\mu,\Sigma)$. Therefore $(\bar{X},S)$ is sufficient for $(\mu,\Sigma)$. [/guided] [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Distribution Definition Expectation Definition Orthogonal Equivariance of Covariance Principal Components probability Asymptotic Normality of ARMA Maximum Likelihood Estimators probability Conditional Distribution of a Multivariate Normal Vector probability Causal ARMA Infinite Moving-Average Representation probability Wold Decomposition Theorem probability Gaussian Copula Has Zero Tail Dependence probability Unbiasedness of the Sample Mean and Sample Covariance Matrix probability Noncentral F Distribution of Hotelling's $T^2$ Statistic probability

What brings you to Androma?

Start with a route through the knowledge graph.