Pearson's Chi-Squared Test — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] For the simple null, the proof has three ingredients. First, a multivariate CLT turns the centred and scaled count vector $W_n := (O_i - e_i)/\sqrt n$ into a multivariate normal with a rank-$(k-1)$ covariance matrix $\Sigma$ determined by the multinomial structure. Second, the Pearson statistic is the squared length of $W_n$ in the diagonal metric $D^{-1}$ where $D := \operatorname{diag}(\tilde p_1, \ldots, \tilde p_k)$, and the key algebraic identity $D^{-1/2} \Sigma D^{-1/2} = I_k - \sqrt{\tilde p}\sqrt{\tilde p}^\top$ shows this limiting covariance, viewed in the rescaled coordinates, is the orthogonal projection onto the $(k-1)$-dimensional hyperplane $\sqrt{\tilde p}^\perp$. Third, a projected multivariate standard normal onto a $(k-1)$-dimensional subspace has squared norm distributed as $\chi^2_{k-1}$. The composite-null extension applies Wilks' theorem to the $q$-parameter submodel, subtracting $q$ extra degrees of freedom. [/proofplan] [step:Set up the centred and scaled count vector and compute its limiting distribution] Let $(N_1, \ldots, N_k) \sim \mathrm{Multinomial}(n; \tilde p_1, \ldots, \tilde p_k)$ under $H_0$, with $\tilde p_i > 0$ for all $i$ and $\sum_{i=1}^k \tilde p_i = 1$. Write $O_i := N_i$ for the observed count, $e_i := n \tilde p_i$ for the expected count, and define the centred and scaled count vector \begin{align*} W_n: \mathbb{R}^n &\to \mathbb{R}^k, \\ (X_1, \ldots, X_n) &\mapsto \left(\frac{O_1 - e_1}{\sqrt n}, \ldots, \frac{O_k - e_k}{\sqrt n}\right)^\top, \end{align*} where each $X_j$ is an i.i.d. categorical draw with $\mathbb{P}(X_j = i) = \tilde p_i$. Representing $X_j$ by the indicator vector $e_{X_j} \in \mathbb{R}^k$ (the standard basis vector at coordinate $X_j$), \begin{align*} W_n = \frac{1}{\sqrt n}\sum_{j=1}^n (e_{X_j} - \tilde{p}), \qquad \tilde{p} := (\tilde p_1, \ldots, \tilde p_k)^\top. \end{align*} The summands are i.i.d. with $\mathbb{E}[e_{X_j}] = \tilde{p}$ and covariance matrix \begin{align*} \Sigma &:= \operatorname{Cov}(e_{X_j}) \in \mathbb{R}^{k \times k}, & \Sigma_{i\ell} &= \begin{cases} \tilde p_i(1 - \tilde p_i), & i = \ell, \\ -\tilde p_i \tilde p_\ell, & i \ne \ell. \end{cases} \end{align*} In compact form, $\Sigma = D - \tilde{p}\tilde{p}^\top$ where $D := \operatorname{diag}(\tilde p_1, \ldots, \tilde p_k)$. The multivariate Central Limit Theorem applies to the i.i.d. sequence $(e_{X_j} - \tilde{p})_{j \ge 1}$ (mean zero, finite covariance $\Sigma$), giving \begin{align*} W_n \xrightarrow{d} W \sim N_k(\mathbf{0}, \Sigma). \end{align*} [guided] The data $(N_1, \ldots, N_k)$ are the counts of $n$ i.i.d. categorical draws; this is the standard representation of the multinomial as a sum of indicator vectors. We write $X_j \in \{1, \ldots, k\}$ for the category of the $j$-th draw and $e_{X_j}$ for its one-hot encoding in $\mathbb{R}^k$. Then $O_i = \sum_{j=1}^n \mathbb{1}\{X_j = i\} = \sum_{j=1}^n (e_{X_j})_i$. The Pearson statistic is a quadratic form in $O - e$, where $e = n\tilde{p}$. To apply the CLT, centre by $\mathbb{E}[O] = n\tilde{p} = e$ and scale by $\sqrt n$. The result is \begin{align*} W_n = n^{-1/2}\sum_{j=1}^n (e_{X_j} - \tilde{p}), \end{align*} a sample mean of $n$ i.i.d. mean-zero bounded random vectors — precisely the setup of the multivariate CLT. We need the covariance of a single summand $e_{X_j}$. Its components: $\mathbb{E}[(e_{X_j})_i] = \mathbb{P}(X_j = i) = \tilde p_i$. And $(e_{X_j})_i (e_{X_j})_\ell$ equals $\mathbb{1}\{X_j = i = \ell\}$ — nonzero only when $i = \ell$. So $\mathbb{E}[(e_{X_j})_i (e_{X_j})_\ell] = \tilde p_i \mathbb{1}\{i = \ell\}$. Subtracting $\mathbb{E}[(e_{X_j})_i] \mathbb{E}[(e_{X_j})_\ell] = \tilde p_i \tilde p_\ell$ gives \begin{align*} \Sigma_{i\ell} = \tilde p_i \mathbb{1}\{i = \ell\} - \tilde p_i \tilde p_\ell, \end{align*} or $\Sigma = D - \tilde{p}\tilde{p}^\top$ with $D = \operatorname{diag}(\tilde{p})$. The multivariate CLT is applicable: the summands are i.i.d., bounded (so all moments finite), with mean zero and covariance $\Sigma$. It yields $W_n \xrightarrow{d} N_k(\mathbf{0}, \Sigma)$. The covariance $\Sigma$ is singular: $\Sigma \mathbf{1} = D\mathbf{1} - \tilde{p}(\tilde{p}^\top \mathbf{1}) = \tilde{p} - \tilde{p} \cdot 1 = \mathbf{0}$, reflecting the constraint $\sum (O_i - e_i) = 0$. So the limiting normal is supported on the hyperplane $\{w \in \mathbb{R}^k : \mathbf{1}^\top w = 0\}$, a $(k-1)$-dimensional subspace. [/guided] [/step] [step:Rescale to standardise the diagonal and compute the rescaled covariance] Let $\sqrt{\tilde{p}} := (\sqrt{\tilde p_1}, \ldots, \sqrt{\tilde p_k})^\top$ and $D^{1/2} := \operatorname{diag}(\sqrt{\tilde p_1}, \ldots, \sqrt{\tilde p_k})$, $D^{-1/2} := \operatorname{diag}(1/\sqrt{\tilde p_1}, \ldots, 1/\sqrt{\tilde p_k})$, well-defined since each $\tilde p_i > 0$. Define the rescaled vector \begin{align*} V_n &:= D^{-1/2} W_n, & (V_n)_i &= \frac{O_i - e_i}{\sqrt{n \tilde p_i}} = \frac{O_i - e_i}{\sqrt{e_i}}. \end{align*} By linearity, $V_n \xrightarrow{d} D^{-1/2} W \sim N_k(\mathbf{0}, \Sigma')$ where \begin{align*} \Sigma' := D^{-1/2} \Sigma D^{-1/2} = D^{-1/2}(D - \tilde{p}\tilde{p}^\top) D^{-1/2} = I_k - (D^{-1/2}\tilde{p})(D^{-1/2}\tilde{p})^\top. \end{align*} Now $D^{-1/2}\tilde{p}$ has $i$-th entry $\tilde p_i / \sqrt{\tilde p_i} = \sqrt{\tilde p_i}$, so $D^{-1/2}\tilde{p} = \sqrt{\tilde{p}}$, and \begin{align*} \Sigma' = I_k - \sqrt{\tilde{p}}\,\sqrt{\tilde{p}}^\top. \end{align*} This matrix is the orthogonal projection $P_L$ onto $L := \sqrt{\tilde{p}}^\perp$, the $(k-1)$-dimensional hyperplane in $\mathbb{R}^k$ orthogonal to $\sqrt{\tilde{p}}$: indeed $\sqrt{\tilde{p}}$ is a unit vector since $\sum_i (\sqrt{\tilde p_i})^2 = \sum_i \tilde p_i = 1$, so $\sqrt{\tilde{p}}\,\sqrt{\tilde{p}}^\top$ is the orthogonal projection onto $\operatorname{span}(\sqrt{\tilde{p}})$, and $\Sigma' = I_k - \sqrt{\tilde{p}}\,\sqrt{\tilde{p}}^\top$ is its complementary projection onto $L$. [guided] The Pearson statistic in the theorem is a squared norm of $V_n$. Recall $W_n = (O - e)/\sqrt n$ and $V_n = D^{-1/2}W_n$, so \begin{align*} (V_n)_i = \frac{O_i - e_i}{\sqrt n \sqrt{\tilde p_i}} = \frac{O_i - e_i}{\sqrt{n\tilde p_i}} = \frac{O_i - e_i}{\sqrt{e_i}}, \end{align*} using $e_i = n\tilde p_i$. Therefore \begin{align*} X^2 = \sum_{i=1}^k \frac{(O_i - e_i)^2}{e_i} = \sum_{i=1}^k (V_n)_i^2 = \|V_n\|^2, \end{align*} which we will identify as a chi-squared-distributed quantity in the next step. Computing the limiting covariance of $V_n$: linear transformations of a normal distribution transform the covariance by conjugation, $\operatorname{Cov}(Av) = A \operatorname{Cov}(v) A^\top$. Here $A = D^{-1/2}$ (symmetric), so \begin{align*} \Sigma' = D^{-1/2}\Sigma D^{-1/2} = D^{-1/2} D D^{-1/2} - (D^{-1/2}\tilde{p})(D^{-1/2}\tilde{p})^\top = I_k - \sqrt{\tilde{p}}\sqrt{\tilde{p}}^\top. \end{align*} The key identity $D^{-1/2}\tilde{p} = \sqrt{\tilde{p}}$ is why rescaling produces such a clean answer: the vector $\sqrt{\tilde{p}}$ is a unit vector (its squared length is $\sum \tilde p_i = 1$), and $\Sigma'$ is the complement of the rank-one projection onto it — namely the rank-$(k-1)$ projection onto its orthogonal complement $L$. This is the geometric picture we will exploit: the limiting distribution of $V_n$ is a standard Gaussian on $\mathbb{R}^k$ projected onto the $(k-1)$-dimensional hyperplane $L$. Its squared norm is therefore $\chi^2_{k-1}$. [/guided] [/step] [step:Identify the limiting distribution of $\|V_n\|^2$ as $\chi^2_{k-1}$] Let $V := D^{-1/2}W \sim N_k(\mathbf{0}, P_L)$ where $P_L = I_k - \sqrt{\tilde{p}}\,\sqrt{\tilde{p}}^\top$ is the orthogonal projection onto $L = \sqrt{\tilde{p}}^\perp$, a $(k-1)$-dimensional subspace. The map $v \mapsto \|v\|^2$ is continuous on $\mathbb{R}^k$, so by the Continuous Mapping Theorem, \begin{align*} X^2 = \|V_n\|^2 \xrightarrow{d} \|V\|^2. \end{align*} It remains to identify $\|V\|^2$. Since $\operatorname{Cov}(V) = P_L$ and $\mathbb{E}[V] = \mathbf{0}$, the vector $V$ is a centred Gaussian with covariance equal to an orthogonal projection of rank $k-1$. By the [Quadratic Forms and Idempotent Matrices](/theorems/1441) lemma applied to the standard $N_k(\mathbf{0}, I_k)$ distribution (we supply the details next), the squared norm of such a Gaussian is exactly $\chi^2_{k-1}$. Explicitly: let $\{u_1, \ldots, u_{k-1}\}$ be an orthonormal basis of $L$ and extend to an orthonormal basis $\{u_1, \ldots, u_{k-1}, \sqrt{\tilde{p}}\}$ of $\mathbb{R}^k$. Let $Q := [u_1 | \cdots | u_{k-1} | \sqrt{\tilde{p}}] \in \mathbb{R}^{k \times k}$ be the orthogonal matrix whose columns are this basis. Define $Z := Q^\top V \in \mathbb{R}^k$. By orthogonality of $Q$ and the [transformation law for multivariate normals](/theorems/1434), \begin{align*} Z \sim N_k(\mathbf{0}, Q^\top P_L Q). \end{align*} Now $P_L$ acts as the identity on $L = \operatorname{span}(u_1, \ldots, u_{k-1})$ and annihilates $\sqrt{\tilde{p}}$, so $Q^\top P_L Q = \operatorname{diag}(1, \ldots, 1, 0)$ with $k-1$ ones and one zero. Thus \begin{align*} Z_1, \ldots, Z_{k-1} \;&\text{i.i.d.}\; N(0, 1), & Z_k &= 0 \text{ a.s.} \end{align*} Since $Q$ is orthogonal, $\|V\|^2 = \|Q^\top V\|^2 = \|Z\|^2 = \sum_{i=1}^{k-1} Z_i^2 + 0$. This is a sum of $k - 1$ independent squared $N(0,1)$ variables, which by definition has the $\chi^2_{k-1}$ distribution. Therefore \begin{align*} X^2 \xrightarrow{d} \chi^2_{k-1}. \end{align*} For the size of the test: $\chi^2_{k-1}(\alpha)$ is a continuity point of the $\chi^2_{k-1}$ distribution function (the density is continuous on $(0, \infty)$), so \begin{align*} \mathbb{P}_{H_0}(X^2 > \chi^2_{k-1}(\alpha)) \to \mathbb{P}(\chi^2_{k-1} > \chi^2_{k-1}(\alpha)) = \alpha. \end{align*} This establishes the simple-null claim. [guided] We have $V_n \xrightarrow{d} V \sim N_k(\mathbf{0}, P_L)$. Convergence in distribution is preserved under continuous maps, so $\|V_n\|^2 \xrightarrow{d} \|V\|^2$. We must identify the distribution of $\|V\|^2$. A centred Gaussian vector with covariance equal to an orthogonal projection of rank $r$ can be decomposed into $r$ independent standard normals along the range of the projection plus zeros along the kernel. This is a standard diagonalisation: $P_L$ is symmetric idempotent, with eigenvalue $1$ (multiplicity $r = k-1$) on $L$ and eigenvalue $0$ (multiplicity $1$) on $L^\perp = \operatorname{span}(\sqrt{\tilde{p}})$. To make this concrete, choose an orthonormal basis adapted to the split: pick $\{u_1, \ldots, u_{k-1}\}$ orthonormal in $L$ (they exist because $\dim L = k-1$) and use $\sqrt{\tilde{p}}$ as the $k$-th basis vector (it is a unit vector, orthogonal to $L$ by definition of $L$). Stack these columns into an orthogonal matrix $Q$; then $Q^\top P_L Q$ is diagonal with $(k-1)$ ones followed by a single zero — the eigenvalue decomposition of $P_L$. Applying the transformation law for multivariate normals (Theorem 1434), $Z = Q^\top V$ is normal with mean zero and covariance $Q^\top P_L Q = \operatorname{diag}(1, \ldots, 1, 0)$. Componentwise: $Z_1, \ldots, Z_{k-1}$ are i.i.d.\ $N(0,1)$ (uncorrelated normals are independent), and $Z_k$ has variance zero, hence $Z_k = 0$ a.s. Orthogonality preserves Euclidean norm: $\|V\|^2 = \|Q^\top V\|^2 = \|Z\|^2$. So \begin{align*} \|V\|^2 = Z_1^2 + \cdots + Z_{k-1}^2 + 0 \sim \chi^2_{k-1} \end{align*} by the definition of the chi-squared distribution (a sum of $r$ independent squared standard normals is $\chi^2_r$). This is the degree-of-freedom count: the constraint $\sum (O_i - e_i) = 0$ cuts one dimension off the naive $k$-dimensional Gaussian, leaving a $\chi^2_{k-1}$ in place of $\chi^2_k$. [/guided] [/step] [step:Extend to the composite-null case with $q$-dimensional parameter] Now suppose $H_0$ specifies $p_i = p_i(\theta)$ for $\theta \in \Theta^{(q)} \subseteq \mathbb{R}^q$ open, with the map $\theta \mapsto (p_1(\theta), \ldots, p_k(\theta))$ a $C^2$ embedding into the open simplex interior (the Jacobian has full column rank $q$), and let $\hat\theta$ be the MLE of $\theta$ under $H_0$. The modified Pearson statistic is \begin{align*} X^2 = \sum_{i=1}^k \frac{(O_i - n p_i(\hat\theta))^2}{n p_i(\hat\theta)}. \end{align*} We now show that $X^2$ and the generalised likelihood ratio statistic $2\log\Lambda_n$ — comparing the unconstrained multinomial MLE $\hat p_i = O_i/n$ to the constrained MLE $p_i(\hat\theta)$ — differ by $o_{\mathbb{P}}(1)$ under $H_0$. Write $\hat e_i := n p_i(\hat\theta)$. A second-order Taylor expansion of each summand in the log-likelihood ratio gives \begin{align*} \log\frac{O_i}{\hat e_i} = \log\!\left(1 + \frac{O_i - \hat e_i}{\hat e_i}\right) = \frac{O_i - \hat e_i}{\hat e_i} - \frac{(O_i - \hat e_i)^2}{2\hat e_i^2} + O\!\left(\frac{|O_i - \hat e_i|^3}{\hat e_i^3}\right). \end{align*} Multiplying by $2 O_i$ and summing over $i$, the linear terms cancel because $\sum_i O_i = \sum_i \hat e_i = n$, leaving \begin{align*} 2\log\Lambda_n = 2\sum_{i=1}^k O_i \log\frac{O_i}{\hat e_i} = \sum_{i=1}^k \frac{(O_i - \hat e_i)^2}{\hat e_i} + o_{\mathbb{P}}(1) = X^2 + o_{\mathbb{P}}(1), \end{align*} where the cubic remainder is $o_{\mathbb{P}}(1)$ because $\hat p_i = O_i/n \xrightarrow{\mathbb{P}} \tilde p_i > 0$ and $(O_i - \hat e_i)/\hat e_i = O_{\mathbb{P}}(n^{-1/2})$. Hence \begin{align*} X^2 = 2\log\Lambda_n + o_{\mathbb{P}}(1). \end{align*} The unconstrained parameter space (the open $(k-1)$-simplex) has dimension $k-1$, and the constrained parameter set $\{p(\theta): \theta \in \Theta^{(q)}\}$ has dimension $q$. By [Wilks' Theorem](/theorems/1431), \begin{align*} 2\log\Lambda_n \xrightarrow{d} \chi^2_{(k-1) - q}. \end{align*} Slutsky's theorem combined with convergence in distribution transfers the limit to $X^2$: \begin{align*} X^2 \xrightarrow{d} \chi^2_{k-1-q}. \end{align*} The asymptotic size of the test rejecting when $X^2 > \chi^2_{k-1-q}(\alpha)$ is then $\alpha$, by the same continuity-point argument as in the simple-null case. [guided] The composite-null case adds one twist: we no longer know the expected counts $e_i = n\tilde p_i$ exactly, because $\tilde p_i$ depends on an unknown parameter $\theta$ which must be estimated from the data. Substituting $\hat\theta$ (the MLE of $\theta$ under $H_0$) makes the expected counts $\hat e_i = n p_i(\hat\theta)$ themselves random. There are two equivalent routes to the $\chi^2_{k-1-q}$ conclusion. The route via Wilks' theorem is the cleanest. The multinomial log-likelihood, restricted to the $q$-dimensional submodel $\{p_i = p_i(\theta): \theta \in \Theta^{(q)}\}$ embedded in the $(k-1)$-dimensional open simplex, satisfies the hypotheses of Wilks' theorem: smoothness of $\theta \mapsto p(\theta)$, positive Fisher information, interior true parameter. Wilks then gives $2\log\Lambda_n \xrightarrow{d} \chi^2_p$ with $p = \dim(\text{full}) - \dim(\text{null}) = (k-1) - q$. To see $X^2 = 2\log\Lambda_n + o_{\mathbb{P}}(1)$, expand each log-ratio via $\log(1 + x) = x - x^2/2 + O(x^3)$. Set $x_i := (O_i - \hat e_i)/\hat e_i$ so that $\hat p_i = O_i/n$ gives $O_i/\hat e_i = 1 + x_i$. Then \begin{align*} 2\log\Lambda_n &= 2\sum_{i=1}^k O_i \log\frac{O_i}{\hat e_i} = 2\sum_{i=1}^k O_i\!\left(x_i - \frac{x_i^2}{2} + O(x_i^3)\right). \end{align*} Since $\sum_i O_i = \sum_i \hat e_i = n$, we have $\sum_i O_i x_i = \sum_i (O_i - \hat e_i) + \sum_i (O_i - \hat e_i)x_i$, and the first sum vanishes. Retaining the leading quadratic term: \begin{align*} 2\log\Lambda_n = \sum_{i=1}^k \frac{(O_i - \hat e_i)^2}{\hat e_i} + R_n, \end{align*} where $R_n$ collects cubic and higher terms. Under $H_0$, the consistency of $\hat\theta$ gives $p_i(\hat\theta) \xrightarrow{\mathbb{P}} \tilde p_i > 0$, and the CLT gives $(O_i - \hat e_i)/\hat e_i = O_{\mathbb{P}}(n^{-1/2})$. Each cubic term is $O_{\mathbb{P}}(n^{-3/2})$, summed over $k$ fixed categories: $R_n = O_{\mathbb{P}}(n^{-1/2}) = o_{\mathbb{P}}(1)$. Applying Slutsky to combine $2\log\Lambda_n \xrightarrow{d} \chi^2_{k-1-q}$ with $X^2 - 2\log\Lambda_n \xrightarrow{\mathbb{P}} 0$: since adding a $o_{\mathbb{P}}(1)$ to a convergent-in-distribution sequence preserves the limit, \begin{align*} X^2 \xrightarrow{d} \chi^2_{k-1-q}. \end{align*} The degree-of-freedom accounting is transparent: the simple-null gave $k-1$ (unrestricted full simplex minus the constraint $\sum p_i = 1$); each of the $q$ estimated parameters consumes one additional degree of freedom, leaving $k - 1 - q$. [/guided] [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Pearson's Chi-Squared Test (Theorem # 1432)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Pearson's Chi-Squared Test (Theorem # 1432)

Discussion

Proof

Explore Further