Wald, Likelihood-Ratio, and Score Tests in Regular Generalized Linear Models

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The likelihood regularity assumptions give a local quadratic expansion of the log-likelihood around the true parameter and asymptotic normality of the maximum likelihood estimator. The restriction map is then linearized at $\beta_0$, reducing the null set locally to the tangent space $\{h \in \mathbb{R}^p : R_0h=0\}$. The Wald, likelihood-ratio, and score statistics all become the same quadratic form in the normal vector $R_0\sqrt{n}(\hat{\beta}_n-\beta_0)$. Since this vector has a nonsingular $q$-dimensional Gaussian limit, the common limiting quadratic form has the $\chi_q^2$ distribution. [/proofplan] [step:Extract the local Gaussian limit for the unrestricted estimator] Let $\ell_n:\Theta\to\mathbb{R}$ denote the sample log-likelihood based on $n$ observations, and let the score map $U_n:\Theta\to\mathbb{R}^p$ be defined by \begin{align*} U_n(\beta):=\nabla \ell_n(\beta), \qquad \beta\in\Theta. \end{align*} Define the normalized score vector \begin{align*} \Delta_n := n^{-1/2} U_n(\beta_0) \in \mathbb{R}^p. \end{align*} By the score [central limit theorem](/theorems/1848) included in the regular GLM assumptions at the interior point $\beta_0$, \begin{align*} \Delta_n \xrightarrow{d} Z, \qquad Z \sim \mathcal{N}(0,I_0), \end{align*} where $I_0:=I(\beta_0)$. The no-separation and identifiability assumptions ensure that the unrestricted maximum likelihood estimator exists and is consistent, so the local parameter $\sqrt{n}(\hat{\beta}_n-\beta_0)$ is $O_{\mathbb{P}}(1)$ under the stated likelihood regularity. The uniform local asymptotic quadratic expansion of $\ell_n$ gives \begin{align*} \ell_n(\beta_0+n^{-1/2}h)-\ell_n(\beta_0) = h^\top \Delta_n-\frac{1}{2}h^\top I_0h+o_{\mathbb{P}}(1) \end{align*} uniformly for $h$ in compact subsets of $\mathbb{R}^p$. Since $I_0$ is positive definite, the limiting quadratic function \begin{align*} Q_Z: \mathbb{R}^p &\to \mathbb{R} \\ h &\mapsto h^\top Z-\frac{1}{2}h^\top I_0h \end{align*} has unique maximizer $I_0^{-1}Z$. The regular GLM likelihood assumptions include the standard local asymptotic normality maximizer result: [uniform convergence](/page/Uniform%20Convergence) of the local likelihood process on compact subsets, tightness of the local maximizer, and uniqueness of the limiting maximizer imply convergence of the local maximizers. Applying this result, with tightness of $\sqrt{n}(\hat{\beta}_n-\beta_0)$ supplied by consistency and nonsingular information, gives \begin{align*} \sqrt{n}(\hat{\beta}_n-\beta_0) = I_0^{-1}\Delta_n+o_{\mathbb{P}}(1). \end{align*} Set \begin{align*} H_n := \sqrt{n}(\hat{\beta}_n-\beta_0). \end{align*} Then \begin{align*} H_n \xrightarrow{d} H, \qquad H \sim \mathcal{N}(0,I_0^{-1}). \end{align*} [/step] [step:Linearize the restriction map and identify the null-direction covariance] Let $R:\Theta\to\mathbb{R}^{q\times p}$ denote the derivative map of the restriction function, defined by \begin{align*} R(\beta):=Dr_\beta, \qquad \beta\in\Theta, \end{align*} and set $R_0:=R(\beta_0)$. Let \begin{align*} A_0 := R_0 I_0^{-1} R_0^\top \in \mathbb{R}^{q \times q}. \end{align*} Because $R_0$ has rank $q$ and $I_0^{-1}$ is positive definite, $A_0$ is positive definite. Indeed, for every nonzero $a \in \mathbb{R}^q$, \begin{align*} a^\top A_0 a = (R_0^\top a)^\top I_0^{-1}(R_0^\top a)>0, \end{align*} since $R_0^\top a \ne 0$ by the full row rank of $R_0$. Since $r$ is differentiable at $\beta_0$ and $r(\beta_0)=0$, \begin{align*} \sqrt{n}\,r(\hat{\beta}_n) = R_0 H_n+o_{\mathbb{P}}(1). \end{align*} Thus \begin{align*} R_0H_n \xrightarrow{d} R_0H, \qquad R_0H \sim \mathcal{N}(0,A_0). \end{align*} [/step] [step:Show that the Wald statistic converges to the Gaussian quadratic form] By continuity of $R$ and $I$ at $\beta_0$, consistency of $\hat{\beta}_n$, and nonsingularity of $A_0$, \begin{align*} R(\hat{\beta}_n) I(\hat{\beta}_n)^{-1} R(\hat{\beta}_n)^\top \xrightarrow{\mathbb{P}} A_0. \end{align*} Therefore \begin{align*} W_n &= \left(\sqrt{n}\,r(\hat{\beta}_n)\right)^\top \left[ R(\hat{\beta}_n) I(\hat{\beta}_n)^{-1} R(\hat{\beta}_n)^\top \right]^{-1} \left(\sqrt{n}\,r(\hat{\beta}_n)\right) \\ &= (R_0H_n)^\top A_0^{-1}(R_0H_n)+o_{\mathbb{P}}(1). \end{align*} Since $R_0H \sim \mathcal{N}(0,A_0)$ and $A_0$ is positive definite, choose the positive definite square root $A_0^{1/2}$ and write $R_0H=A_0^{1/2}G$ with $G \sim \mathcal{N}(0,I_q)$. Then \begin{align*} (R_0H)^\top A_0^{-1}(R_0H) = G^\top G = \sum_{k=1}^q G_k^2 \sim \chi_q^2. \end{align*} Hence \begin{align*} W_n \xrightarrow{d} \chi_q^2. \end{align*} [/step] [step:Compare the likelihood-ratio statistic with the same quadratic form] Define the normalized constrained estimator \begin{align*} K_n := \sqrt{n}(\tilde{\beta}_n-\beta_0) \in \mathbb{R}^p. \end{align*} The stated regularity assumptions include existence and consistency of $\tilde{\beta}_n$ under no separation, hence $K_n=O_{\mathbb{P}}(1)$. Since $r$ is continuously differentiable near $\beta_0$ and $r(\beta_0)=0$, Taylor expansion of $r$ at $\beta_0$ gives, uniformly for $h=O_{\mathbb{P}}(1)$, \begin{align*} \sqrt{n}\,r(\beta_0+n^{-1/2}h)=R_0h+o_{\mathbb{P}}(1). \end{align*} Applying this with $h=K_n$ and using $r(\tilde{\beta}_n)=0$ gives \begin{align*} R_0K_n=o_{\mathbb{P}}(1). \end{align*} Thus the nonlinear null surface is replaced locally, at first order and uniformly on the $O_{\mathbb{P}}(1)$ region containing $K_n$, by the tangent space \begin{align*} T_0:=\{h \in \mathbb{R}^p : R_0h=0\}. \end{align*} Let $P_0:\mathbb{R}^p\to T_0$ denote the $I_0$-[orthogonal projection](/theorems/437), where the $I_0$ inner product is \begin{align*} (u,v)_{I_0}:=u^\top I_0v, \qquad u,v\in\mathbb{R}^p. \end{align*} We use the following constrained local asymptotic normality maximizer result: if the local likelihood expansion is uniform on compact subsets, the constrained local maximizer is tight, the constraint set is locally approximated by a linear tangent space, and the limiting quadratic objective is strictly concave, then the constrained local maximizer converges to the maximizer of the limiting quadratic objective over that tangent space. Applying this result on the regular null manifold $\{\beta\in\Theta:r(\beta)=0\}$ with full-rank derivative $R_0$ gives \begin{align*} K_n=P_0H_n+o_{\mathbb{P}}(1). \end{align*} The hypotheses are verified as follows: the local likelihood expansion is uniform on compact subsets; $K_n$ is tight by constrained consistency; the Taylor expansion of $r$ identifies the feasible local directions with $T_0$ up to $o_{\mathbb{P}}(1)$; and the limiting quadratic objective is strictly concave because $I_0$ is positive definite. We compute $P_0$ explicitly. The $I_0$-normal space to $T_0$ is $I_0^{-1}\operatorname{Range}(R_0^\top)$, because for every $a\in\mathbb{R}^q$ and every $v\in T_0$, \begin{align*} (I_0^{-1}R_0^\top a,v)_{I_0}=a^\top R_0v=0. \end{align*} Therefore $H_n-P_0H_n=I_0^{-1}R_0^\top\lambda_n$ for a vector $\lambda_n\in\mathbb{R}^q$. Imposing $R_0P_0H_n=0$ gives \begin{align*} 0 &=R_0\left(H_n-I_0^{-1}R_0^\top\lambda_n\right) \\ &=R_0H_n-A_0\lambda_n, \end{align*} so $\lambda_n=A_0^{-1}R_0H_n$. Hence \begin{align*} H_n-K_n = I_0^{-1}R_0^\top A_0^{-1}R_0H_n+o_{\mathbb{P}}(1). \end{align*} Using the uniform local quadratic expansion at both $H_n$ and $K_n$, which is valid because both sequences are $O_{\mathbb{P}}(1)$, and using the first-order condition $\Delta_n=I_0H_n+o_{\mathbb{P}}(1)$, we obtain \begin{align*} \Lambda_n &= 2\left(\ell_n(\hat{\beta}_n)-\ell_n(\tilde{\beta}_n)\right) \\ &= (H_n-K_n)^\top I_0(H_n-K_n)+o_{\mathbb{P}}(1) \\ &= (R_0H_n)^\top A_0^{-1}(R_0H_n)+o_{\mathbb{P}}(1). \end{align*} The final expression has already been shown to converge in distribution to $\chi_q^2$: the quadratic map $y\mapsto y^\top A_0^{-1}y$ is continuous, and adding an $o_{\mathbb{P}}(1)$ remainder does not change the distributional limit. Hence \begin{align*} \Lambda_n \xrightarrow{d} \chi_q^2. \end{align*} [guided] Define the normalized constrained estimator \begin{align*} K_n := \sqrt{n}(\tilde{\beta}_n-\beta_0) \in \mathbb{R}^p. \end{align*} The no-separation assumption is used here: together with the regular GLM hypotheses, it gives existence and consistency of the restricted maximum likelihood estimator. Consequently $K_n=O_{\mathbb{P}}(1)$, so the local likelihood expansion may be evaluated at $K_n$. The first question is whether the curved constraint $r(\beta)=0$ may be replaced by its tangent space. Since $r:\Theta\to\mathbb{R}^q$ is continuously differentiable near $\beta_0$, $r(\beta_0)=0$, and $R_0:=Dr_{\beta_0}$ has rank $q$, Taylor expansion gives, uniformly for bounded local parameters $h\in\mathbb{R}^p$, \begin{align*} \sqrt{n}\,r(\beta_0+n^{-1/2}h)=R_0h+o(1). \end{align*} Applying this expansion along the tight sequence $h=K_n$ and using the exact constrained identity $r(\tilde{\beta}_n)=0$ gives \begin{align*} R_0K_n=o_{\mathbb{P}}(1). \end{align*} Thus the feasible local directions are, to first order, the tangent space \begin{align*} T_0:=\{h\in\mathbb{R}^p:R_0h=0\}. \end{align*} This is the point at which the full-rank assumption on $R_0$ matters: it makes the null surface a regular codimension-$q$ manifold near $\beta_0$ and makes the tangent approximation stable. Now apply the constrained local asymptotic normality maximizer result to the local likelihood process. Its hypotheses are satisfied: the local quadratic expansion is uniform on compact subsets of $\mathbb{R}^p$; $K_n$ is tight by constrained consistency; the preceding Taylor expansion identifies the feasible local set with $T_0$ up to $o_{\mathbb{P}}(1)$; and the limiting quadratic objective is strictly concave because $I_0$ is positive definite. Therefore the constrained local maximizer is the $I_0$-orthogonal projection of the unrestricted local maximizer onto $T_0$: \begin{align*} K_n=P_0H_n+o_{\mathbb{P}}(1), \end{align*} where $P_0:\mathbb{R}^p\to T_0$ denotes the projection for the inner product \begin{align*} (u,v)_{I_0}:=u^\top I_0v, \qquad u,v\in\mathbb{R}^p. \end{align*} We compute this projection because the likelihood-ratio statistic is exactly the local quadratic loss from forcing $H_n$ into $T_0$. The $I_0$-normal directions to $T_0$ have the form $I_0^{-1}R_0^\top a$ with $a\in\mathbb{R}^q$, since for every $v\in T_0$, \begin{align*} (I_0^{-1}R_0^\top a,v)_{I_0}=a^\top R_0v=0. \end{align*} Thus $H_n-P_0H_n=I_0^{-1}R_0^\top\lambda_n$ for some $\lambda_n\in\mathbb{R}^q$. The condition $P_0H_n\in T_0$ gives \begin{align*} 0 &=R_0\left(H_n-I_0^{-1}R_0^\top\lambda_n\right) \\ &=R_0H_n-A_0\lambda_n. \end{align*} Since $A_0=R_0I_0^{-1}R_0^\top$ is positive definite, it is invertible, and hence \begin{align*} \lambda_n=A_0^{-1}R_0H_n. \end{align*} Therefore \begin{align*} H_n-K_n = I_0^{-1}R_0^\top A_0^{-1}R_0H_n+o_{\mathbb{P}}(1). \end{align*} Finally use the local quadratic expansion of the log-likelihood at the two tight local parameters $H_n$ and $K_n$. Since $H_n$ is the unrestricted local maximizer, the linear term cancels through $\Delta_n=I_0H_n+o_{\mathbb{P}}(1)$, and the likelihood-ratio statistic becomes \begin{align*} \Lambda_n &=2\left(\ell_n(\hat{\beta}_n)-\ell_n(\tilde{\beta}_n)\right) \\ &=(H_n-K_n)^\top I_0(H_n-K_n)+o_{\mathbb{P}}(1) \\ &=(R_0H_n)^\top A_0^{-1}(R_0H_n)+o_{\mathbb{P}}(1). \end{align*} The last expression is the same quadratic form already obtained in the Wald calculation. Since the quadratic map is continuous and the $o_{\mathbb{P}}(1)$ remainder is negligible for distributional convergence, \begin{align*} \Lambda_n\xrightarrow{d}\chi_q^2. \end{align*} [/guided] [/step] [step:Compare the score statistic with the same quadratic form] Define the score statistic for the restriction $r(\beta)=0$ by \begin{align*} S_n := \left(n^{-1/2}U_n(\tilde{\beta}_n)\right)^\top I(\tilde{\beta}_n)^{-1}R(\tilde{\beta}_n)^\top \left[R(\tilde{\beta}_n)I(\tilde{\beta}_n)^{-1}R(\tilde{\beta}_n)^\top\right]^{-1} R(\tilde{\beta}_n)I(\tilde{\beta}_n)^{-1} \left(n^{-1/2}U_n(\tilde{\beta}_n)\right). \end{align*} The score expansion at the constrained estimator gives \begin{align*} n^{-1/2}U_n(\tilde{\beta}_n) = \Delta_n-I_0K_n+o_{\mathbb{P}}(1). \end{align*} Since $H_n=I_0^{-1}\Delta_n+o_{\mathbb{P}}(1)$, this becomes \begin{align*} n^{-1/2}U_n(\tilde{\beta}_n) = I_0(H_n-K_n)+o_{\mathbb{P}}(1). \end{align*} Using the projection formula from the preceding step, \begin{align*} n^{-1/2}U_n(\tilde{\beta}_n) = R_0^\top A_0^{-1}R_0H_n+o_{\mathbb{P}}(1). \end{align*} By consistency of $\tilde{\beta}_n$, continuity of $R$ and $I$ at $\beta_0$, and nonsingularity of $A_0$, the plug-in matrices in the definition of $S_n$ converge in probability to their corresponding limits. Substituting the preceding score expansion into the definition of $S_n$ gives \begin{align*} S_n &= (R_0H_n)^\top A_0^{-1}R_0I_0^{-1}R_0^\top A_0^{-1}R_0H_n+o_{\mathbb{P}}(1) \\ &= (R_0H_n)^\top A_0^{-1}A_0A_0^{-1}(R_0H_n)+o_{\mathbb{P}}(1) \\ &= (R_0H_n)^\top A_0^{-1}(R_0H_n)+o_{\mathbb{P}}(1). \end{align*} Therefore \begin{align*} S_n \xrightarrow{d} \chi_q^2. \end{align*} [/step] [step:Recover the scalar Wald form] For the scalar coordinate null hypothesis $H_0:\beta_j=0$, take \begin{align*} r: \Theta \to \mathbb{R}, \qquad \beta \mapsto \beta_j. \end{align*} Then $q=1$ and $R(\beta)$ is the row vector selecting the $j$th coordinate. The estimated variance of $\hat{\beta}_{n,j}$ is the $j$th diagonal entry of the estimated covariance matrix of $\hat{\beta}_n$, denoted \begin{align*} \operatorname{se}(\hat{\beta}_{n,j})^2. \end{align*} Thus the Wald statistic reduces to \begin{align*} W_n = \frac{\hat{\beta}_{n,j}^2}{\operatorname{se}(\hat{\beta}_{n,j})^2} = \left(\frac{\hat{\beta}_{n,j}}{\operatorname{se}(\hat{\beta}_{n,j})}\right)^2. \end{align*} Combining the preceding steps proves the stated asymptotic $\chi_q^2$ null distributions for the Wald, likelihood-ratio, and score tests. [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Orthogonal Projection

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.