Consistency of the Cluster-Robust Sandwich Variance Estimator Under Many Clusters

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The argument is a sandwich-estimator consistency proof with clusters as the independent summands. First, rewrite the cluster-robust estimator using $X_G^\top X_G = GQ_G$, which exposes the sample cluster score covariance between the two curvature factors. Then use the convergence of $Q_G$ and the continuity of matrix inversion to obtain $Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}$. Finally, combine this with the assumed convergence of the residual cluster covariance to get consistency of the scaled sandwich estimator, and combine the score expansion with the cluster [central limit theorem](/theorems/1848) to get the limiting distribution. [/proofplan] [step:Rewrite the sandwich estimator in cluster scale] For each $G \in \mathbb N$, the theorem statement defines $X_G \in \mathbb R^{N_G \times d}$, $\hat\beta_G: \Omega \to \mathbb R^d$, and residual cluster scores $\hat S_{g,G}:\Omega\to\mathbb R^d$ for $g=1,\dots,G$. By the definition of the normalized curvature matrix, \begin{align*} Q_G = \frac{1}{G}X_G^\top X_G, \end{align*} so $X_G^\top X_G = GQ_G$. Since $Q$ is positive definite and $Q_G \xrightarrow{\mathbb{P}} Q$, the matrix $Q_G$ is invertible with probability tending to $1$. On this event, \begin{align*} (X_G^\top X_G)^{-1} = (GQ_G)^{-1} = \frac{1}{G}Q_G^{-1}. \end{align*} Substituting this identity into the definition of $\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)$ gives \begin{align*} \widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G) &= \frac{1}{G}Q_G^{-1} \left(\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right) \frac{1}{G}Q_G^{-1} \\ &= \frac{1}{G} Q_G^{-1} \left(\frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right) Q_G^{-1}. \end{align*} Multiplying by $G$ yields the scaled sandwich identity \begin{align*} G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G) = Q_G^{-1} \left(\frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right) Q_G^{-1}. \end{align*} [guided] The theorem statement works with a triangular array indexed by the number of clusters $G$. For each $G \in \mathbb N$, the design matrix is $X_G \in \mathbb R^{N_G \times d}$, the OLS estimator is the random vector $\hat\beta_G: \Omega \to \mathbb R^d$, and the residual cluster score in cluster $g$ is $\hat S_{g,G}:\Omega\to\mathbb R^d$. The purpose of this step is to expose the three factors whose probability limits are known or can be derived. The curvature matrix is not $X_G^\top X_G$ itself but its cluster-normalized version \begin{align*} Q_G = \frac{1}{G}X_G^\top X_G. \end{align*} Therefore $X_G^\top X_G = GQ_G$. Since $Q$ is positive definite, its smallest eigenvalue is strictly positive. The convergence $Q_G \xrightarrow{\mathbb{P}} Q$ implies that $Q_G$ is nonsingular with probability tending to $1$, so the inverse below is well-defined on events whose probabilities tend to $1$. On those events, \begin{align*} (X_G^\top X_G)^{-1} = (GQ_G)^{-1} = \frac{1}{G}Q_G^{-1}. \end{align*} Now substitute this into the cluster-robust sandwich estimator: \begin{align*} \widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G) &= (X_G^\top X_G)^{-1} \left(\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right) (X_G^\top X_G)^{-1} \\ &= \frac{1}{G}Q_G^{-1} \left(\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right) \frac{1}{G}Q_G^{-1} \\ &= \frac{1}{G} Q_G^{-1} \left(\frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right) Q_G^{-1}. \end{align*} Thus multiplying by $G$ gives \begin{align*} G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G) = Q_G^{-1} \left(\frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right) Q_G^{-1}. \end{align*} This is the desired form: the middle factor is exactly the assumed residual cluster score covariance, and the two outer factors are sample curvature inverses. [/guided] [/step] [step:Pass the curvature inverse to its probability limit] Because $Q$ is positive definite, $Q$ is invertible. The inversion map \begin{align*} \operatorname{Inv}: GL_d(\mathbb{R}) &\to GL_d(\mathbb{R}) \\ A &\mapsto A^{-1} \end{align*} is continuous at $Q$. For each $G$, define an everywhere-defined random matrix $R_G: \Omega \to \mathbb{R}^{d \times d}$ by \begin{align*} R_G := \begin{cases} Q_G^{-1}, & \det Q_G \neq 0, \\ 0, & \det Q_G = 0. \end{cases} \end{align*} Let $\varepsilon>0$ be fixed. Since inversion is continuous at $Q$, there is $\delta>0$ such that whenever $A\in\mathbb R^{d\times d}$ satisfies $|A-Q|<\delta$, the matrix $A$ is invertible and $|A^{-1}-Q^{-1}|<\varepsilon$. Since $Q_G \xrightarrow{\mathbb{P}} Q$, \begin{align*} \mathbb P\bigl(|R_G-Q^{-1}|>\varepsilon\bigr) \leq \mathbb P\bigl(|Q_G-Q|\geq\delta\bigr) \to 0. \end{align*} Therefore \begin{align*} R_G \xrightarrow{\mathbb{P}} Q^{-1}. \end{align*} Moreover $\mathbb{P}(R_G = Q_G^{-1}) \to 1$, so all subsequent products involving $Q_G^{-1}$ may be read on these high-probability events, equivalently with $R_G$ replacing $Q_G^{-1}$ without changing the probability limit. [guided] We need to justify replacing $Q_G^{-1}$ by $Q^{-1}$ in the sandwich expression. The limit matrix $Q$ is positive definite, hence invertible. In finite dimensions, the determinant map is continuous, and the set \begin{align*} GL_d(\mathbb{R}) := \{A \in \mathbb{R}^{d \times d} : \det A \neq 0\} \end{align*} is the domain of the inverse map. The inverse map \begin{align*} \operatorname{Inv}: GL_d(\mathbb{R}) &\to GL_d(\mathbb{R}) \\ A &\mapsto A^{-1} \end{align*} is continuous at every invertible matrix, in particular at $Q$. Because $Q_G$ may be singular on exceptional events, we make the inverse an everywhere-defined random matrix. Define $R_G: \Omega \to \mathbb{R}^{d \times d}$ by \begin{align*} R_G := \begin{cases} Q_G^{-1}, & \det Q_G \neq 0, \\ 0, & \det Q_G = 0. \end{cases} \end{align*} The value $0$ on the singular event is arbitrary; it is only used to give a total definition. Since $Q_G \xrightarrow{\mathbb{P}} Q$ and $\det Q \neq 0$, the event $\{\det Q_G \neq 0\}$ has probability tending to $1$. On that high-probability event, $R_G$ is exactly $Q_G^{-1}$. To prove convergence without treating the piecewise definition of $R_G$ as a globally continuous map, fix $\varepsilon>0$. Continuity of inversion at $Q$ gives a number $\delta>0$ such that $|A-Q|<\delta$ implies that $A$ is invertible and $|A^{-1}-Q^{-1}|<\varepsilon$. Therefore \begin{align*} \mathbb P\bigl(|R_G-Q^{-1}|>\varepsilon\bigr) \leq \mathbb P\bigl(|Q_G-Q|\geq\delta\bigr) \to 0, \end{align*} so \begin{align*} R_G \xrightarrow{\mathbb{P}} Q^{-1}. \end{align*} Thus we may write $Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}$ with the understood convention that the inverse is taken on events whose probabilities tend to $1$. This is where positive definiteness is used: without invertibility of $Q$, there would be no finite matrix $Q^{-1}$ to serve as the limit. [/guided] [/step] [step:Combine the three sandwich factors] Define the residual cluster covariance matrix $M_G: \Omega \to \mathbb{R}^{d \times d}$ by \begin{align*} M_G := \frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}. \end{align*} By hypothesis, \begin{align*} M_G \xrightarrow{\mathbb{P}} \Omega_C. \end{align*} From the previous step, \begin{align*} Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}. \end{align*} Matrix multiplication is continuous as a map \begin{align*} \mathbb{R}^{d \times d}\times \mathbb{R}^{d \times d}\times \mathbb{R}^{d \times d} &\to \mathbb{R}^{d \times d} \\ (A,B,C) &\mapsto ABC. \end{align*} Therefore, \begin{align*} Q_G^{-1}M_GQ_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}\Omega_CQ^{-1}. \end{align*} Using the scaled sandwich identity from the first step, \begin{align*} G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G) = Q_G^{-1}M_GQ_G^{-1}, \end{align*} we obtain \begin{align*} G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G) \xrightarrow{\mathbb{P}} Q^{-1}\Omega_CQ^{-1}. \end{align*} [guided] Let us isolate the middle factor by defining \begin{align*} M_G := \frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}. \end{align*} This is the empirical residual cluster score covariance. The hypothesis says exactly that \begin{align*} M_G \xrightarrow{\mathbb{P}} \Omega_C. \end{align*} The previous step proved \begin{align*} Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}. \end{align*} Now consider the deterministic multiplication map \begin{align*} \mathbb{R}^{d \times d}\times \mathbb{R}^{d \times d}\times \mathbb{R}^{d \times d} &\to \mathbb{R}^{d \times d} \\ (A,B,C) &\mapsto ABC. \end{align*} This map is continuous because each entry of $ABC$ is a finite sum of products of entries of $A$, $B$, and $C$. Hence convergence in probability of the three matrix factors implies convergence in probability of their product: \begin{align*} Q_G^{-1}M_GQ_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}\Omega_CQ^{-1}. \end{align*} Finally, the algebraic identity from the first step says \begin{align*} G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G) = Q_G^{-1}M_GQ_G^{-1}. \end{align*} Substituting this identity into the preceding convergence proves \begin{align*} G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G) \xrightarrow{\mathbb{P}} Q^{-1}\Omega_CQ^{-1}. \end{align*} This proves the consistency of the scaled cluster-robust variance estimator. [/guided] [/step] [step:Identify the limiting distribution of the scaled estimator] By assumption, \begin{align*} \frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G} \xrightarrow{d} \mathcal{N}(0,\Omega_C), \end{align*} and the previous curvature argument gives $Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}$. Applying Slutsky's theorem to the product \begin{align*} Q_G^{-1}\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G} \end{align*} gives \begin{align*} Q_G^{-1}\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G} \xrightarrow{d} Q^{-1}Z, \end{align*} where $Z \sim \mathcal{N}(0,\Omega_C)$. Since linear transformations of Gaussian random vectors are Gaussian, \begin{align*} Q^{-1}Z \sim \mathcal{N}(0,Q^{-1}\Omega_CQ^{-1}). \end{align*} The expansion \begin{align*} \sqrt{G}(\hat{\beta}_G-\beta_0) = Q_G^{-1}\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G} + o_{\mathbb{P}}(1) \end{align*} and Slutsky's theorem again imply \begin{align*} \sqrt{G}(\hat{\beta}_G-\beta_0) \xrightarrow{d} \mathcal{N}(0,Q^{-1}\Omega_CQ^{-1}). \end{align*} Thus $\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)$ is on the covariance scale of $\hat{\beta}_G$, while $G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)$ is on the covariance scale of $\sqrt{G}(\hat{\beta}_G-\beta_0)$. This completes the proof. [guided] The variance consistency result identifies the limiting covariance matrix. We now connect it to the limiting distribution of the estimator itself. The assumed cluster [central limit theorem](/theorems/521) gives \begin{align*} \frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G} \xrightarrow{d} \mathcal{N}(0,\Omega_C). \end{align*} Here each population cluster score is the random vector $S_{g,G}:\Omega\to\mathbb R^d$ introduced in the theorem statement. The curvature convergence already proved gives \begin{align*} Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}. \end{align*} Slutsky's theorem applies to a product of a random matrix converging in probability to a constant matrix and a random vector converging in distribution. Therefore \begin{align*} Q_G^{-1}\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G} \xrightarrow{d} Q^{-1}Z, \end{align*} where $Z$ is a Gaussian random vector with distribution $\mathcal{N}(0,\Omega_C)$. It remains to compute the covariance of $Q^{-1}Z$. Since $Q$ is positive definite, it is symmetric, so $(Q^{-1})^\top = Q^{-1}$. Hence \begin{align*} \operatorname{Var}(Q^{-1}Z) = Q^{-1}\operatorname{Var}(Z)(Q^{-1})^\top = Q^{-1}\Omega_CQ^{-1}. \end{align*} Thus \begin{align*} Q^{-1}Z \sim \mathcal{N}(0,Q^{-1}\Omega_CQ^{-1}). \end{align*} Finally, the estimator admits the asymptotic linear expansion \begin{align*} \sqrt{G}(\hat{\beta}_G-\beta_0) = Q_G^{-1}\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G} + o_{\mathbb{P}}(1). \end{align*} The remainder term converges to $0$ in probability. Applying Slutsky's theorem once more gives \begin{align*} \sqrt{G}(\hat{\beta}_G-\beta_0) \xrightarrow{d} \mathcal{N}(0,Q^{-1}\Omega_CQ^{-1}). \end{align*} The estimator $\hat{\beta}_G$ fluctuates at order $G^{-1/2}$, so its covariance is of order $G^{-1}$. This is exactly why the unscaled estimator $\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)$ estimates the covariance scale of $\hat{\beta}_G$, while the scaled matrix $G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)$ estimates the asymptotic covariance of $\sqrt{G}(\hat{\beta}_G-\beta_0)$. This completes the proof. [/guided] [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.