Wilks' Theorem — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The proof is a second-order Taylor expansion of the log-likelihood around the unrestricted MLE, followed by identification of the resulting quadratic form. Write $\Theta \subseteq \mathbb{R}^d$ with $\Theta_0 \subseteq \Theta$ having dimension $d - p$. Under $H_0$, the unrestricted MLE $\hat\theta_n$ and the constrained MLE $\hat\theta_n^{(0)}$ are both consistent for the true parameter $\theta_0 \in \Theta_0$ and satisfy the score equation up to $o_{\mathbb{P}}(n^{-1/2})$. Taylor-expanding the log-likelihood $\ell_n$ around $\hat\theta_n$, the linear term vanishes (first-order condition) and the quadratic term carries the Hessian $-n\,I(\theta_0) + o_{\mathbb{P}}(n)$. Asymptotic normality of $\sqrt n(\hat\theta_n - \hat\theta_n^{(0)})$ restricted to the $p$-dimensional direction orthogonal to $\Theta_0$ (relative to $I(\theta_0)$) reduces $2\log\Lambda$ to the squared norm of a $p$-dimensional standard normal, which is $\chi^2_p$. [/proofplan] [step:Fix the parameterisation and identify the null manifold] Assume $\Theta \subseteq \mathbb{R}^d$ is open and the model $\{f(\cdot \mid \theta): \theta \in \Theta\}$ satisfies the standard regularity conditions of the Asymptotic Normality of the MLE: the log-density $\log f(x \mid \theta)$ is thrice continuously differentiable in $\theta$ on $\Theta$; the Fisher information \begin{align*} I(\theta) &:= \mathbb{E}_\theta\!\left[\nabla_\theta \log f(X \mid \theta)\, \nabla_\theta \log f(X \mid \theta)^\top\right] \in \mathbb{R}^{d \times d} \end{align*} is finite, continuous, and positive-definite on $\Theta$; the true parameter $\theta_0$ lies in the interior of $\Theta_0$; and the model is identifiable. Let the null set $\Theta_0 \subseteq \Theta$ be a smooth submanifold of codimension $p$: there exist a $C^2$ map $g: \Theta \to \mathbb{R}^p$ with $\nabla g(\theta_0)$ of full rank $p$, and \begin{align*} \Theta_0 &= \{\theta \in \Theta: g(\theta) = 0\}. \end{align*} (The condition $|\Theta| - |\Theta_0| = p$ in the theorem statement is the informal dimension count $\dim \Theta - \dim \Theta_0 = p$; this is the precise statement under which the proof proceeds.) Equivalently, by the [Implicit Function Theorem](/theorems/52), near $\theta_0$ there exists a $C^2$ chart $\psi: V \to \Theta$ from an open neighbourhood $V$ of $0 \in \mathbb{R}^d$ such that \begin{align*} \psi(V \cap (\mathbb{R}^{d-p} \times \{0\}^p)) \subseteq \Theta_0. \end{align*} By reparameterisation and a linear change of coordinates realised by the inverse square root of $I(\theta_0)$, we may and do assume $\theta_0 = 0$, $I(\theta_0) = I_d$ (the identity), and $\Theta_0 = (\mathbb{R}^{d-p} \times \{0\}^p) \cap \Theta$ locally near $0$. This orthogonal decomposition — $\mathbb{R}^d = T_{\theta_0}\Theta_0 \oplus N_{\theta_0}\Theta_0$ where the tangent space is $\mathbb{R}^{d-p} \times \{0\}$ and the normal space is $\{0\} \times \mathbb{R}^p$ — is crucial. Write a generic $\theta \in \Theta$ in this chart as $\theta = (\xi, \eta)$ with $\xi \in \mathbb{R}^{d-p}$ and $\eta \in \mathbb{R}^p$. [guided] Wilks' theorem is a statement about a degree-of-freedom count: the null distribution of $2\log\Lambda$ has $p$ degrees of freedom, where $p$ is the *codimension* of the null — the number of constraints $H_0$ places on $\theta$. We need a setup that makes this count geometric. The regularity conditions supply four ingredients: smoothness (for Taylor expansion to terminate at second order with a controlled remainder), invertibility of $I$ (so the quadratic form is non-degenerate), interior point (so perturbations in every direction are legitimate), and identifiability (so MLEs are well-defined). The most conceptually clarifying simplification is to choose coordinates in which $I(\theta_0) = I_d$ and $\Theta_0$ is a coordinate hyperplane near $\theta_0$. Both can be arranged: - Rescale by $I(\theta_0)^{-1/2}$: let $\tilde\theta = I(\theta_0)^{1/2}(\theta - \theta_0)$. In the new coordinate the Fisher information at zero is the identity. - Straighten $\Theta_0$ via the implicit function theorem: since $\nabla g(\theta_0)$ has full rank $p$, we can choose a $C^2$ local chart in which $\Theta_0$ is defined by $(\theta_{d-p+1}, \ldots, \theta_d) = 0$. Composing, we work in coordinates $\theta = (\xi, \eta)$ where $\xi$ is tangent to $\Theta_0$ (free under $H_0$) and $\eta$ is normal to $\Theta_0$ (forced to zero under $H_0$). The directional independence of $I(\theta_0) = I_d$ is what reduces the final quadratic form to an unrotated sum of squares — it is what makes the answer "$p$" rather than "a trace involving projection onto the normal space of $\Theta_0$ in the $I$-inner product". [/guided] [/step] [step:Set up both MLEs and derive their score expansions] Write the log-likelihood \begin{align*} \ell_n: \Theta &\to \mathbb{R}, \\ \theta &\mapsto \sum_{i=1}^n \log f(X_i \mid \theta), \end{align*} the score $s_n(\theta) := \nabla \ell_n(\theta)$, and the observed information $\mathcal{J}_n(\theta) := -\nabla^2 \ell_n(\theta)$. Let \begin{align*} \hat\theta_n &:= \arg\max_{\theta \in \Theta} \ell_n(\theta), & \hat\theta_n^{(0)} &:= \arg\max_{\theta \in \Theta_0} \ell_n(\theta), \end{align*} both of which exist and are consistent for $\theta_0 = 0$ by the regularity assumptions. The generalised likelihood ratio is \begin{align*} 2\log\Lambda_n &= 2\left[\ell_n(\hat\theta_n) - \ell_n(\hat\theta_n^{(0)})\right]. \end{align*} By the [Weak Law of Large Numbers](/theorems/1127) applied componentwise and the regularity hypothesis that second derivatives of $\log f$ have integrable envelopes near $\theta_0$, \begin{align*} n^{-1}\mathcal{J}_n(\theta_0) \xrightarrow{\mathbb{P}} I(\theta_0) = I_d. \end{align*} By the [Central Limit Theorem](/theorems/521) applied to the i.i.d. mean-zero score contributions, \begin{align*} n^{-1/2} s_n(\theta_0) \xrightarrow{d} Z \sim N_d(0, I_d). \end{align*} At the unrestricted MLE, $s_n(\hat\theta_n) = 0$ by the first-order condition (interior maximum). At the constrained MLE, by Lagrange multipliers applied to the constraint $\eta = 0$, we have $s_n^{(\xi)}(\hat\theta_n^{(0)}) = 0$ — the score has zero component in the tangent directions — while the normal component may be nonzero. Writing $\hat\theta_n^{(0)} = (\hat\xi_n^{(0)}, 0)$ and decomposing $s_n = (s_n^{(\xi)}, s_n^{(\eta)})$, the constrained first-order conditions read \begin{align*} s_n^{(\xi)}(\hat\xi_n^{(0)}, 0) = 0, \qquad \hat\theta_n^{(0)} \in \Theta_0. \end{align*} [guided] We have two maximisers: $\hat\theta_n$ over the full $\Theta$, and $\hat\theta_n^{(0)}$ over the null manifold $\Theta_0$. Both converge in probability to the true parameter $\theta_0 = 0$ because the true parameter is assumed to lie in $\Theta_0$ under $H_0$. The score $s_n = \nabla \ell_n$ measures the slope of the log-likelihood. At an interior maximum of a smooth function, the score vanishes. So $s_n(\hat\theta_n) = 0$ unconditionally: this is the unrestricted first-order condition. At the constrained maximum, the first-order condition is more subtle. We are maximising $\ell_n$ over $\Theta_0 = \{\eta = 0\}$, so we can vary only the $\xi$ coordinates; the derivative with respect to $\xi$ must vanish at $\hat\theta_n^{(0)}$, but the derivative with respect to $\eta$ need not — we are not free to perturb $\eta$ away from $0$. This is exactly the Lagrange-multiplier condition: $\nabla \ell_n$ must be orthogonal to the feasible directions, which are the $\xi$ directions. Hence $s_n^{(\xi)}(\hat\theta_n^{(0)}) = 0$ and $s_n^{(\eta)}(\hat\theta_n^{(0)})$ is in general nonzero — it is the Lagrange multiplier of the constraint $\eta = 0$. The CLT and LLN inputs — $n^{-1/2} s_n(\theta_0) \xrightarrow{d} N_d(0, I_d)$ and $n^{-1}\mathcal{J}_n(\theta_0) \xrightarrow{\mathbb{P}} I_d$ — are standard under the regularity hypotheses. The score is a sum of i.i.d. mean-zero vectors with covariance $I(\theta_0) = I_d$ (this identity of covariance and Fisher information is the first Bartlett identity). The observed information, after division by $n$, is the sample mean of i.i.d. matrices converging to $I(\theta_0)$ by the LLN. [/guided] [/step] [step:Expand $\ell_n$ to second order around $\hat\theta_n$ and evaluate the difference] Apply the multivariate [Taylor expansion with Integral Remainder](/theorems/189) to $\ell_n$ around $\hat\theta_n$, evaluated at $\hat\theta_n^{(0)}$. Setting $h := \hat\theta_n^{(0)} - \hat\theta_n \in \mathbb{R}^d$, \begin{align*} \ell_n(\hat\theta_n^{(0)}) &= \ell_n(\hat\theta_n) + s_n(\hat\theta_n)^\top h - \tfrac{1}{2} h^\top \mathcal{J}_n(\tilde\theta_n) h, \end{align*} for some $\tilde\theta_n$ on the segment between $\hat\theta_n^{(0)}$ and $\hat\theta_n$ (in the componentwise mean-value form of the remainder, $\tilde\theta_n$ depends on the component; for a clean statement use the integral form, which gives the same asymptotic outcome). Since $s_n(\hat\theta_n) = 0$, \begin{align*} 2\log\Lambda_n &= 2\left[\ell_n(\hat\theta_n) - \ell_n(\hat\theta_n^{(0)})\right] = h^\top \mathcal{J}_n(\tilde\theta_n) h. \end{align*} Both $\hat\theta_n$ and $\hat\theta_n^{(0)}$ are consistent for $\theta_0$, so $\tilde\theta_n \xrightarrow{\mathbb{P}} \theta_0 = 0$, and by continuity of $\theta \mapsto I(\theta)$ together with the LLN for the observed information, \begin{align*} n^{-1}\mathcal{J}_n(\tilde\theta_n) \xrightarrow{\mathbb{P}} I(\theta_0) = I_d. \end{align*} Therefore \begin{align*} 2\log\Lambda_n &= (\sqrt n\, h)^\top \left[n^{-1}\mathcal{J}_n(\tilde\theta_n)\right] (\sqrt n\, h) = \|\sqrt n\, h\|^2 + o_{\mathbb{P}}(\|\sqrt n\, h\|^2). \end{align*} It remains to identify the asymptotic distribution of $\sqrt n\, h = \sqrt n(\hat\theta_n^{(0)} - \hat\theta_n)$. [guided] Since the score vanishes at $\hat\theta_n$ (unrestricted first-order condition) and $\ell_n$ is smooth, the value of $\ell_n$ near $\hat\theta_n$ is dominated by the quadratic term of Taylor's expansion. Taylor's theorem in the form with Lagrange remainder gives \begin{align*} \ell_n(\hat\theta_n^{(0)}) = \ell_n(\hat\theta_n) + \underbrace{s_n(\hat\theta_n)^\top h}_{= 0} + \tfrac{1}{2} h^\top \nabla^2\ell_n(\tilde\theta_n) h, \end{align*} where $\nabla^2\ell_n(\tilde\theta) = -\mathcal{J}_n(\tilde\theta)$. So \begin{align*} \ell_n(\hat\theta_n) - \ell_n(\hat\theta_n^{(0)}) = \tfrac{1}{2} h^\top \mathcal{J}_n(\tilde\theta_n) h, \end{align*} and $2\log\Lambda_n = h^\top \mathcal{J}_n(\tilde\theta_n) h$. We want this to be asymptotically $\chi^2_p$. The natural move is to rescale: write $h^\top \mathcal{J}_n(\tilde\theta_n) h = (\sqrt n h)^\top [n^{-1}\mathcal{J}_n(\tilde\theta_n)] (\sqrt n h)$. The middle factor is a (consistent) estimator of the Fisher information, converging in probability to $I_d$. The outer factor, $\sqrt n h$, is a rescaled displacement of the constrained MLE from the unrestricted one. If we can show $\sqrt n h$ converges in distribution to a vector $W \in \mathbb{R}^d$ that is degenerate on the $\xi$-directions (first $d - p$ coordinates) and standard normal on the $\eta$-directions (last $p$ coordinates) — i.e. $W = (0, \zeta)$ with $\zeta \sim N_p(0, I_p)$ — then the quadratic form becomes $W^\top I_d W = \|W\|^2 = \|\zeta\|^2 \sim \chi^2_p$, and we are done. This is the content of the next step. [/guided] [/step] [step:Compute the asymptotic distribution of $\sqrt n(\hat\theta_n - \hat\theta_n^{(0)})$] Recall the chart splits $\theta = (\xi, \eta)$ with $\Theta_0 = \{\eta = 0\}$ locally, and $I(\theta_0) = I_d$. By asymptotic normality of the MLE, \begin{align*} \sqrt n\, \hat\theta_n &= \sqrt n\,(\hat\theta_n - \theta_0) \xrightarrow{d} Z = (Z^{(\xi)}, Z^{(\eta)}) \sim N_d(0, I_d), \end{align*} with $Z^{(\xi)} \in \mathbb{R}^{d-p}$ and $Z^{(\eta)} \in \mathbb{R}^p$ independent standard Gaussians. The constrained MLE is the projection of $\hat\theta_n$ onto $\Theta_0$ in the information metric. Under our choice of coordinates where $I(\theta_0) = I_d$, this is the Euclidean orthogonal projection onto $\mathbb{R}^{d-p} \times \{0\}$, up to first order: we claim \begin{align*} \sqrt n\, \hat\theta_n^{(0)} &= \sqrt n\, (\hat\xi_n^{(0)}, 0) = (\sqrt n\, \hat\xi_n, 0) + o_{\mathbb{P}}(1). \end{align*} [claim:First-order equivalence of $\hat\xi_n^{(0)}$ and $\hat\xi_n$] $\sqrt n\,(\hat\xi_n^{(0)} - \hat\xi_n) \xrightarrow{\mathbb{P}} 0$. [proof] By the first-order condition at the unrestricted MLE, $s_n(\hat\theta_n) = 0$, so in particular $s_n^{(\xi)}(\hat\xi_n, \hat\eta_n) = 0$. By the constrained first-order condition, $s_n^{(\xi)}(\hat\xi_n^{(0)}, 0) = 0$. Define \begin{align*} \Phi_n: \mathbb{R}^{d-p} \times \mathbb{R}^p &\to \mathbb{R}^{d-p}, \\ (\xi, \eta) &\mapsto n^{-1} s_n^{(\xi)}(\xi, \eta). \end{align*} Then $\Phi_n(\hat\xi_n, \hat\eta_n) = \Phi_n(\hat\xi_n^{(0)}, 0) = 0$. The map $\Phi_n$ is continuously differentiable, and by the LLN $\nabla_\xi \Phi_n(\theta_0) \xrightarrow{\mathbb{P}} -I_{\xi\xi}(\theta_0) = -I_{d-p}$ (the upper-left block of $I(\theta_0) = I_d$). By the implicit function theorem applied pathwise in a neighbourhood where $\nabla_\xi \Phi_n$ is invertible (which happens with probability $\to 1$), there is a $C^1$ solution map $\xi = \chi_n(\eta)$ of $\Phi_n(\xi, \eta) = 0$, and both $\hat\xi_n = \chi_n(\hat\eta_n)$ and $\hat\xi_n^{(0)} = \chi_n(0)$. Taylor expanding $\chi_n$ around $0$, \begin{align*} \sqrt n\,(\hat\xi_n - \hat\xi_n^{(0)}) = \sqrt n\,[\chi_n(\hat\eta_n) - \chi_n(0)] = \nabla \chi_n(0) \cdot \sqrt n\, \hat\eta_n + o_{\mathbb{P}}(1). \end{align*} Implicit differentiation of $\Phi_n(\chi_n(\eta), \eta) = 0$ gives $\nabla\chi_n(0) = -[\nabla_\xi \Phi_n]^{-1}\nabla_\eta \Phi_n \xrightarrow{\mathbb{P}} I_{d-p}^{-1} \cdot I_{\xi\eta}(\theta_0) = 0$, where the cross-block $I_{\xi\eta}(\theta_0)$ vanishes because $I(\theta_0) = I_d$. Since $\sqrt n\, \hat\eta_n = O_{\mathbb{P}}(1)$ by asymptotic normality of the MLE, the product is $o_{\mathbb{P}}(1)$. [/proof] [/claim] Combining the claim with the decomposition of $\sqrt n\, h = \sqrt n\, \hat\theta_n^{(0)} - \sqrt n\, \hat\theta_n$, \begin{align*} \sqrt n\, h = (\sqrt n\,(\hat\xi_n^{(0)} - \hat\xi_n), -\sqrt n\, \hat\eta_n) = (o_{\mathbb{P}}(1), -\sqrt n\, \hat\eta_n). \end{align*} By asymptotic normality, $\sqrt n\, \hat\eta_n \xrightarrow{d} Z^{(\eta)} \sim N_p(0, I_p)$. Therefore \begin{align*} \sqrt n\, h \xrightarrow{d} (0, -Z^{(\eta)}), \end{align*} and its squared Euclidean norm converges in distribution to $\|Z^{(\eta)}\|^2 \sim \chi^2_p$. [guided] Under $H_0$ both MLEs are close to $\theta_0$, at scale $n^{-1/2}$. The question is: how close are they to each other, along each coordinate direction? In the $\eta$-directions, the constrained MLE is pinned to $\hat\eta_n^{(0)} = 0$ while the unrestricted MLE has $\hat\eta_n$ of order $n^{-1/2}$, tending to a standard normal $Z^{(\eta)} \sim N_p(0, I_p)$ after scaling. So $\sqrt n\,(\hat\eta_n^{(0)} - \hat\eta_n) = -\sqrt n\,\hat\eta_n \xrightarrow{d} -Z^{(\eta)}$. In the $\xi$-directions — the directions tangent to $\Theta_0$ — the story is more delicate. Both estimators optimise in $\xi$, one subject to $\eta = 0$ and one freely. If the Fisher information has block-diagonal structure in our chart (so the $\xi$- and $\eta$-components of the score are asymptotically independent), then constraining $\eta$ does not affect the $\xi$-optimiser at first order. That is the content of the claim: $\sqrt n(\hat\xi_n - \hat\xi_n^{(0)}) \xrightarrow{\mathbb{P}} 0$. The proof of the claim uses the implicit function theorem. Both $\hat\theta_n$ and $\hat\theta_n^{(0)}$ satisfy $s_n^{(\xi)} = 0$; by the IFT this determines $\xi$ as a function $\chi_n$ of $\eta$. The difference $\hat\xi_n - \hat\xi_n^{(0)} = \chi_n(\hat\eta_n) - \chi_n(0)$ equals the derivative $\nabla \chi_n(0)$ times $\hat\eta_n$ plus lower-order terms. The derivative $\nabla\chi_n(0)$ involves the cross-block $I_{\xi\eta}$ of the Fisher information, which vanishes in our chosen chart (where $I(\theta_0) = I_d$). Hence the first-order change in $\hat\xi_n$ due to moving from $\eta = \hat\eta_n$ to $\eta = 0$ is zero. This is where the orthogonality $I(\theta_0) = I_d$ pays off: in a general chart, the constrained and unconstrained $\xi$-estimates would differ by an amount proportional to $I_{\xi\eta}(\theta_0) \hat\eta_n$, and unscrambling the quadratic form $h^\top I(\theta_0) h$ would require projecting onto the $\eta$-direction in the $I$-inner product. We have arranged coordinates so that the $I$-inner product is the Euclidean inner product, the $\eta$-direction is already $I$-orthogonal to the $\xi$-direction, and the projection reduces to the Euclidean projection. Putting it together, $\sqrt n h \xrightarrow{d} (0, -Z^{(\eta)})$ with $Z^{(\eta)} \sim N_p(0, I_p)$. The length squared of this vector is $\|Z^{(\eta)}\|^2$, a sum of $p$ independent squared $N(0,1)$'s — by definition $\chi^2_p$. [/guided] [/step] [step:Assemble the pieces to conclude $2\log\Lambda_n \xrightarrow{d} \chi^2_p$] From Step 3, \begin{align*} 2\log\Lambda_n = (\sqrt n h)^\top [n^{-1}\mathcal{J}_n(\tilde\theta_n)] (\sqrt n h), \end{align*} with $n^{-1}\mathcal{J}_n(\tilde\theta_n) \xrightarrow{\mathbb{P}} I_d$ and $\sqrt n h \xrightarrow{d} (0, -Z^{(\eta)})$ from Step 4. By the Continuous Mapping Theorem and Slutsky's Theorem applied to the bilinear map $(A, v) \mapsto v^\top A v$ (jointly continuous on $\mathbb{R}^{d \times d} \times \mathbb{R}^d$, with the first factor converging in probability to a constant), \begin{align*} 2\log\Lambda_n \xrightarrow{d} (0, -Z^{(\eta)})^\top I_d (0, -Z^{(\eta)}) = \|Z^{(\eta)}\|^2. \end{align*} Since $Z^{(\eta)} \sim N_p(0, I_p)$, we have $\|Z^{(\eta)}\|^2 = \sum_{j=1}^p (Z^{(\eta)}_j)^2 \sim \chi^2_p$ by Definition of the Chi-Squared Distribution. Hence \begin{align*} 2\log\Lambda_n \xrightarrow{d} \chi^2_p, \end{align*} which is the stated asymptotic distribution. For the second part — that the GLR test of approximate size $\alpha$ rejects when $2\log\Lambda_n > \chi^2_p(\alpha)$ — observe that $\chi^2_p(\alpha)$ is the upper $\alpha$ quantile of $\chi^2_p$, which is a continuity point of the $\chi^2_p$ CDF (since the $\chi^2_p$ distribution has a smooth density on $(0, \infty)$ for all $p \ge 1$). By convergence in distribution at continuity points of the limit CDF, \begin{align*} \mathbb{P}_{\theta_0}\!\left(2\log\Lambda_n > \chi^2_p(\alpha)\right) \xrightarrow[n \to \infty]{} \mathbb{P}(\chi^2_p > \chi^2_p(\alpha)) = \alpha. \end{align*} So the asymptotic size of the test is $\alpha$, and under $H_0$ the test rejects with probability approaching $\alpha$ — confirming the "approximate size $\alpha$" interpretation. The stochastic-dominance claim under $H_1$ (that $2\log\Lambda_n$ tends to be stochastically larger when $H_0$ is false) follows from consistency of the test: if $\theta_0 \notin \Theta_0$, the restricted MLE is bounded away from $\theta_0$ and the log-likelihood difference $\ell_n(\hat\theta_n) - \ell_n(\hat\theta_n^{(0)})$ grows linearly in $n$, so $2\log\Lambda_n \to \infty$ in probability. This completes the proof. [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Wilks' Theorem (Theorem # 1431)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Wilks' Theorem (Theorem # 1431)

Discussion

Proof

Explore Further