Asymptotic Equivalence of Wald, LRT, and Score Tests

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We express all three statistics in terms of the common building block $\Delta_n := \sqrt{n}(\hat{\theta}_n - \theta_0)$ and show that each equals $\Delta_n^\top I(\theta_0) \Delta_n + o_{\mathbb{P}}(1)$. The Wald statistic is already a quadratic form in $\Delta_n$ up to replacing $\hat{i}_n$ by $I(\theta_0)$. The LRT is related to the Wald statistic by a second-order Taylor expansion of the log-likelihood. The score statistic is linked via the first-order Taylor expansion of the score at $\theta_0$. Once all three are shown to differ from $\Delta_n^\top I(\theta_0) \Delta_n$ by $o_{\mathbb{P}}(1)$, the asymptotic equivalence follows by the triangle inequality. [/proofplan] [step:Express the Wald statistic as $\Delta_n^\top I(\theta_0) \Delta_n + o_{\mathbb{P}}(1)$] Define $\Delta_n := \sqrt{n}(\hat{\theta}_n - \theta_0)$. By the [Asymptotic Normality of the MLE](/theorems/1857), $\Delta_n \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})$ under $\mathbb{P}_{\theta_0}$. In particular, $\Delta_n = O_{\mathbb{P}}(1)$. The Wald statistic is $W_n(\theta_0) = \Delta_n^\top \hat{i}_n \Delta_n$, where $\hat{i}_n$ is the observed Fisher information. By the [Consistency of Observed Fisher Information](/theorems/1862), $\hat{i}_n \xrightarrow{\mathbb{P}_{\theta_0}} I(\theta_0)$. Write \begin{align*} W_n(\theta_0) = \Delta_n^\top I(\theta_0) \Delta_n + \Delta_n^\top (\hat{i}_n - I(\theta_0)) \Delta_n. \end{align*} The error term satisfies $|\Delta_n^\top (\hat{i}_n - I(\theta_0)) \Delta_n| \leq |\Delta_n|^2 \|\hat{i}_n - I(\theta_0)\|_{\mathrm{op}}$. Since $|\Delta_n|^2 = O_{\mathbb{P}}(1)$ and $\|\hat{i}_n - I(\theta_0)\|_{\mathrm{op}} \xrightarrow{\mathbb{P}} 0$, the product converges to zero in probability. Hence \begin{align*} W_n(\theta_0) = \Delta_n^\top I(\theta_0) \Delta_n + o_{\mathbb{P}}(1). \end{align*} [/step] [step:Express the LRT statistic as $\Delta_n^\top I(\theta_0) \Delta_n + o_{\mathbb{P}}(1)$ via Taylor expansion] The log-likelihood ratio statistic is $\Lambda_n = 2(\ell_n(\hat{\theta}_n) - \ell_n(\theta_0)) = 2n(\bar{\ell}_n(\hat{\theta}_n) - \bar{\ell}_n(\theta_0))$. Perform a second-order Taylor expansion of $\bar{\ell}_n$ around $\hat{\theta}_n$. Since $\nabla_\theta \bar{\ell}_n(\hat{\theta}_n) = 0$ at the MLE (on the event that $\hat{\theta}_n \in \operatorname{int}(\Theta)$, which has probability tending to 1), we have \begin{align*} \bar{\ell}_n(\theta_0) = \bar{\ell}_n(\hat{\theta}_n) + \frac{1}{2}(\theta_0 - \hat{\theta}_n)^\top \nabla^2_\theta \bar{\ell}_n(\bar{\theta}_n)(\theta_0 - \hat{\theta}_n), \end{align*} where $\bar{\theta}_n$ lies on the segment $[\theta_0, \hat{\theta}_n]$. Rearranging: \begin{align*} \Lambda_n = -n(\theta_0 - \hat{\theta}_n)^\top \nabla^2_\theta \bar{\ell}_n(\bar{\theta}_n)(\theta_0 - \hat{\theta}_n) = \Delta_n^\top (-\nabla^2_\theta \bar{\ell}_n(\bar{\theta}_n)) \Delta_n. \end{align*} By the uniform law of large numbers for the Hessian and the fact that $\bar{\theta}_n \xrightarrow{\mathbb{P}} \theta_0$ (since $\hat{\theta}_n \xrightarrow{\mathbb{P}} \theta_0$ and $\bar{\theta}_n$ lies between the two), $-\nabla^2_\theta \bar{\ell}_n(\bar{\theta}_n) \xrightarrow{\mathbb{P}} I(\theta_0)$. Therefore \begin{align*} \Lambda_n = \Delta_n^\top I(\theta_0) \Delta_n + \Delta_n^\top (-\nabla^2_\theta \bar{\ell}_n(\bar{\theta}_n) - I(\theta_0)) \Delta_n. \end{align*} The error term vanishes in probability by the same argument as for the Wald statistic: $|\Delta_n|^2 = O_{\mathbb{P}}(1)$ and $\|-\nabla^2_\theta \bar{\ell}_n(\bar{\theta}_n) - I(\theta_0)\|_{\mathrm{op}} \xrightarrow{\mathbb{P}} 0$. Hence \begin{align*} \Lambda_n = \Delta_n^\top I(\theta_0) \Delta_n + o_{\mathbb{P}}(1). \end{align*} [guided] The key idea is that the log-likelihood ratio is well-approximated by a quadratic form at the MLE. The proof of [Wilks' Theorem](/theorems/1864) uses the same Taylor expansion. The first-order term vanishes because $\hat{\theta}_n$ is a critical point of $\bar{\ell}_n$. The Hessian matrix $\nabla^2_\theta \bar{\ell}_n(\bar{\theta}_n)$ evaluated at an intermediate point converges to $-I(\theta_0)$ by the uniform law of large numbers for the Hessian (regularity assumption) and the squeeze $\bar{\theta}_n \in [\theta_0, \hat{\theta}_n]$ combined with $\hat{\theta}_n \xrightarrow{\mathbb{P}} \theta_0$. After substitution, $\Lambda_n = \Delta_n^\top (-\nabla^2_\theta \bar{\ell}_n(\bar{\theta}_n)) \Delta_n$, which differs from $\Delta_n^\top I(\theta_0) \Delta_n$ by a term bounded by $|\Delta_n|^2 \cdot \|-\nabla^2_\theta \bar{\ell}_n(\bar{\theta}_n) - I(\theta_0)\|_{\mathrm{op}}$. Since the first factor is $O_{\mathbb{P}}(1)$ and the second vanishes in probability, the product is $o_{\mathbb{P}}(1)$. [/guided] [/step] [step:Express the score statistic as $\Delta_n^\top I(\theta_0) \Delta_n + o_{\mathbb{P}}(1)$ via the score expansion] The score statistic is $T_n(\theta_0) = \frac{1}{n} S_n(\theta_0)^\top I(\theta_0)^{-1} S_n(\theta_0)$, where $S_n(\theta_0) = \sum_{i=1}^n \nabla_\theta \log f(X_i, \theta_0) = n \nabla_\theta \bar{\ell}_n(\theta_0)$. Hence \begin{align*} T_n(\theta_0) = n\, \nabla_\theta \bar{\ell}_n(\theta_0)^\top I(\theta_0)^{-1} \nabla_\theta \bar{\ell}_n(\theta_0). \end{align*} From the proof of [Asymptotic Normality of the MLE](/theorems/1857), the mean value expansion of the score gives \begin{align*} \nabla_\theta \bar{\ell}_n(\theta_0) = \nabla_\theta \bar{\ell}_n(\hat{\theta}_n) - \nabla^2_\theta \bar{\ell}_n(\tilde{\theta}_n)(\theta_0 - \hat{\theta}_n) = -\nabla^2_\theta \bar{\ell}_n(\tilde{\theta}_n)(\theta_0 - \hat{\theta}_n), \end{align*} where $\tilde{\theta}_n$ lies on the segment $[\theta_0, \hat{\theta}_n]$ and we used $\nabla_\theta \bar{\ell}_n(\hat{\theta}_n) = 0$. Define $\bar{H}_n := -\nabla^2_\theta \bar{\ell}_n(\tilde{\theta}_n)$, which satisfies $\bar{H}_n \xrightarrow{\mathbb{P}} I(\theta_0)$. Then $\sqrt{n}\, \nabla_\theta \bar{\ell}_n(\theta_0) = \bar{H}_n \Delta_n$, and \begin{align*} T_n(\theta_0) = \Delta_n^\top \bar{H}_n^\top I(\theta_0)^{-1} \bar{H}_n \Delta_n. \end{align*} Since $\bar{H}_n \xrightarrow{\mathbb{P}} I(\theta_0)$, the matrix $\bar{H}_n^\top I(\theta_0)^{-1} \bar{H}_n \xrightarrow{\mathbb{P}} I(\theta_0)^\top I(\theta_0)^{-1} I(\theta_0) = I(\theta_0)$ (using symmetry of $I(\theta_0)$). Therefore \begin{align*} T_n(\theta_0) = \Delta_n^\top I(\theta_0) \Delta_n + o_{\mathbb{P}}(1), \end{align*} by the same operator-norm bound used in the previous steps. [guided] Why does the score statistic also reduce to $\Delta_n^\top I(\theta_0) \Delta_n$? The score at $\theta_0$ is linked to $\Delta_n = \sqrt{n}(\hat{\theta}_n - \theta_0)$ through the Taylor expansion of the score equation. Specifically, since the MLE zeroes the score, $\nabla_\theta \bar{\ell}_n(\hat{\theta}_n) = 0$, and expanding around $\hat{\theta}_n$: \begin{align*} \nabla_\theta \bar{\ell}_n(\theta_0) = \nabla^2_\theta \bar{\ell}_n(\tilde{\theta}_n)(\theta_0 - \hat{\theta}_n) = -\bar{H}_n \cdot \frac{1}{\sqrt{n}} \Delta_n \cdot \frac{1}{\sqrt{n}}, \end{align*} so $\frac{1}{\sqrt{n}} S_n(\theta_0) = \sqrt{n}\, \nabla_\theta \bar{\ell}_n(\theta_0) = \bar{H}_n \Delta_n$. Substituting into the score statistic: \begin{align*} T_n(\theta_0) = (\bar{H}_n \Delta_n)^\top I(\theta_0)^{-1} (\bar{H}_n \Delta_n) = \Delta_n^\top \bar{H}_n^\top I(\theta_0)^{-1} \bar{H}_n \Delta_n. \end{align*} Now $\bar{H}_n^\top I(\theta_0)^{-1} \bar{H}_n - I(\theta_0) \xrightarrow{\mathbb{P}} 0$ because $\bar{H}_n \xrightarrow{\mathbb{P}} I(\theta_0)$ and matrix multiplication is continuous. The error $|\Delta_n^\top (\bar{H}_n^\top I(\theta_0)^{-1} \bar{H}_n - I(\theta_0)) \Delta_n| \leq |\Delta_n|^2 \|\bar{H}_n^\top I(\theta_0)^{-1} \bar{H}_n - I(\theta_0)\|_{\mathrm{op}} = o_{\mathbb{P}}(1)$. [/guided] [/step] [step:Conclude asymptotic equivalence by combining the three representations] From the three preceding steps: \begin{align*} W_n(\theta_0) &= \Delta_n^\top I(\theta_0) \Delta_n + o_{\mathbb{P}}(1), \\ \Lambda_n(\Theta, \{\theta_0\}) &= \Delta_n^\top I(\theta_0) \Delta_n + o_{\mathbb{P}}(1), \\ T_n(\theta_0) &= \Delta_n^\top I(\theta_0) \Delta_n + o_{\mathbb{P}}(1). \end{align*} Subtracting pairwise: \begin{align*} W_n(\theta_0) - \Lambda_n(\Theta, \{\theta_0\}) &= o_{\mathbb{P}}(1), \\ W_n(\theta_0) - T_n(\theta_0) &= o_{\mathbb{P}}(1). \end{align*} Since all three statistics share the same asymptotic distribution $\chi^2_p$ (established in the [Wald Confidence Ellipsoids](/theorems/1863), [Wilks' Theorem](/theorems/1864), and [Asymptotic Distribution of the Score Statistic](/theorems/1865)), and their pairwise differences vanish in probability, any test that rejects $H_0$ when the statistic exceeds a threshold $\xi_\alpha$ (the $(1-\alpha)$-quantile of $\chi^2_p$) has the same asymptotic size $\alpha$. Under a sequence of local alternatives $\theta_n = \theta_0 + h/\sqrt{n}$, all three statistics have the same limiting non-central $\chi^2$ distribution (since the $o_{\mathbb{P}}(1)$ remainders do not depend on whether the true parameter is $\theta_0$ or $\theta_n$), so the asymptotic power is also identical. [/step]

What brings you to Androma?

Start with a route through the knowledge graph.