[proofplan]
The argument is a sandwich-estimator consistency proof with clusters as the independent summands. First, rewrite the cluster-robust estimator using $X_G^\top X_G = GQ_G$, which exposes the sample cluster score covariance between the two curvature factors. Then use the convergence of $Q_G$ and the continuity of matrix inversion to obtain $Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}$. Finally, combine this with the assumed convergence of the residual cluster covariance to get consistency of the scaled sandwich estimator, and combine the score expansion with the cluster [central limit theorem](/theorems/1848) to get the limiting distribution.
[/proofplan]
[step:Rewrite the sandwich estimator in cluster scale]
For each $G \in \mathbb N$, the theorem statement defines $X_G \in \mathbb R^{N_G \times d}$, $\hat\beta_G: \Omega \to \mathbb R^d$, and residual cluster scores $\hat S_{g,G}:\Omega\to\mathbb R^d$ for $g=1,\dots,G$. By the definition of the normalized curvature matrix,
\begin{align*}
Q_G = \frac{1}{G}X_G^\top X_G,
\end{align*}
so $X_G^\top X_G = GQ_G$. Since $Q$ is positive definite and $Q_G \xrightarrow{\mathbb{P}} Q$, the matrix $Q_G$ is invertible with probability tending to $1$. On this event,
\begin{align*}
(X_G^\top X_G)^{-1}
=
(GQ_G)^{-1}
=
\frac{1}{G}Q_G^{-1}.
\end{align*}
Substituting this identity into the definition of $\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)$ gives
\begin{align*}
\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)
&=
\frac{1}{G}Q_G^{-1}
\left(\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right)
\frac{1}{G}Q_G^{-1} \\
&=
\frac{1}{G}
Q_G^{-1}
\left(\frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right)
Q_G^{-1}.
\end{align*}
Multiplying by $G$ yields the scaled sandwich identity
\begin{align*}
G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)
=
Q_G^{-1}
\left(\frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right)
Q_G^{-1}.
\end{align*}
[guided]
The theorem statement works with a triangular array indexed by the number of clusters $G$. For each $G \in \mathbb N$, the design matrix is $X_G \in \mathbb R^{N_G \times d}$, the OLS estimator is the random vector $\hat\beta_G: \Omega \to \mathbb R^d$, and the residual cluster score in cluster $g$ is $\hat S_{g,G}:\Omega\to\mathbb R^d$. The purpose of this step is to expose the three factors whose probability limits are known or can be derived. The curvature matrix is not $X_G^\top X_G$ itself but its cluster-normalized version
\begin{align*}
Q_G = \frac{1}{G}X_G^\top X_G.
\end{align*}
Therefore $X_G^\top X_G = GQ_G$. Since $Q$ is positive definite, its smallest eigenvalue is strictly positive. The convergence $Q_G \xrightarrow{\mathbb{P}} Q$ implies that $Q_G$ is nonsingular with probability tending to $1$, so the inverse below is well-defined on events whose probabilities tend to $1$.
On those events,
\begin{align*}
(X_G^\top X_G)^{-1}
=
(GQ_G)^{-1}
=
\frac{1}{G}Q_G^{-1}.
\end{align*}
Now substitute this into the cluster-robust sandwich estimator:
\begin{align*}
\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)
&=
(X_G^\top X_G)^{-1}
\left(\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right)
(X_G^\top X_G)^{-1} \\
&=
\frac{1}{G}Q_G^{-1}
\left(\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right)
\frac{1}{G}Q_G^{-1} \\
&=
\frac{1}{G}
Q_G^{-1}
\left(\frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right)
Q_G^{-1}.
\end{align*}
Thus multiplying by $G$ gives
\begin{align*}
G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)
=
Q_G^{-1}
\left(\frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}\right)
Q_G^{-1}.
\end{align*}
This is the desired form: the middle factor is exactly the assumed residual cluster score covariance, and the two outer factors are sample curvature inverses.
[/guided]
[/step]
[step:Pass the curvature inverse to its probability limit]
Because $Q$ is positive definite, $Q$ is invertible. The inversion map
\begin{align*}
\operatorname{Inv}: GL_d(\mathbb{R}) &\to GL_d(\mathbb{R}) \\
A &\mapsto A^{-1}
\end{align*}
is continuous at $Q$. For each $G$, define an everywhere-defined random matrix $R_G: \Omega \to \mathbb{R}^{d \times d}$ by
\begin{align*}
R_G :=
\begin{cases}
Q_G^{-1}, & \det Q_G \neq 0, \\
0, & \det Q_G = 0.
\end{cases}
\end{align*}
Let $\varepsilon>0$ be fixed. Since inversion is continuous at $Q$, there is $\delta>0$ such that whenever $A\in\mathbb R^{d\times d}$ satisfies $|A-Q|<\delta$, the matrix $A$ is invertible and $|A^{-1}-Q^{-1}|<\varepsilon$. Since $Q_G \xrightarrow{\mathbb{P}} Q$,
\begin{align*}
\mathbb P\bigl(|R_G-Q^{-1}|>\varepsilon\bigr)
\leq
\mathbb P\bigl(|Q_G-Q|\geq\delta\bigr)
\to 0.
\end{align*}
Therefore
\begin{align*}
R_G \xrightarrow{\mathbb{P}} Q^{-1}.
\end{align*}
Moreover $\mathbb{P}(R_G = Q_G^{-1}) \to 1$, so all subsequent products involving $Q_G^{-1}$ may be read on these high-probability events, equivalently with $R_G$ replacing $Q_G^{-1}$ without changing the probability limit.
[guided]
We need to justify replacing $Q_G^{-1}$ by $Q^{-1}$ in the sandwich expression. The limit matrix $Q$ is positive definite, hence invertible. In finite dimensions, the determinant map is continuous, and the set
\begin{align*}
GL_d(\mathbb{R}) := \{A \in \mathbb{R}^{d \times d} : \det A \neq 0\}
\end{align*}
is the domain of the inverse map. The inverse map
\begin{align*}
\operatorname{Inv}: GL_d(\mathbb{R}) &\to GL_d(\mathbb{R}) \\
A &\mapsto A^{-1}
\end{align*}
is continuous at every invertible matrix, in particular at $Q$.
Because $Q_G$ may be singular on exceptional events, we make the inverse an everywhere-defined random matrix. Define $R_G: \Omega \to \mathbb{R}^{d \times d}$ by
\begin{align*}
R_G :=
\begin{cases}
Q_G^{-1}, & \det Q_G \neq 0, \\
0, & \det Q_G = 0.
\end{cases}
\end{align*}
The value $0$ on the singular event is arbitrary; it is only used to give a total definition. Since $Q_G \xrightarrow{\mathbb{P}} Q$ and $\det Q \neq 0$, the event $\{\det Q_G \neq 0\}$ has probability tending to $1$. On that high-probability event, $R_G$ is exactly $Q_G^{-1}$. To prove convergence without treating the piecewise definition of $R_G$ as a globally continuous map, fix $\varepsilon>0$. Continuity of inversion at $Q$ gives a number $\delta>0$ such that $|A-Q|<\delta$ implies that $A$ is invertible and $|A^{-1}-Q^{-1}|<\varepsilon$. Therefore
\begin{align*}
\mathbb P\bigl(|R_G-Q^{-1}|>\varepsilon\bigr)
\leq
\mathbb P\bigl(|Q_G-Q|\geq\delta\bigr)
\to 0,
\end{align*}
so
\begin{align*}
R_G \xrightarrow{\mathbb{P}} Q^{-1}.
\end{align*}
Thus we may write $Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}$ with the understood convention that the inverse is taken on events whose probabilities tend to $1$. This is where positive definiteness is used: without invertibility of $Q$, there would be no finite matrix $Q^{-1}$ to serve as the limit.
[/guided]
[/step]
[step:Combine the three sandwich factors]
Define the residual cluster covariance matrix $M_G: \Omega \to \mathbb{R}^{d \times d}$ by
\begin{align*}
M_G := \frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}.
\end{align*}
By hypothesis,
\begin{align*}
M_G \xrightarrow{\mathbb{P}} \Omega_C.
\end{align*}
From the previous step,
\begin{align*}
Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}.
\end{align*}
Matrix multiplication is continuous as a map
\begin{align*}
\mathbb{R}^{d \times d}\times \mathbb{R}^{d \times d}\times \mathbb{R}^{d \times d}
&\to \mathbb{R}^{d \times d} \\
(A,B,C) &\mapsto ABC.
\end{align*}
Therefore,
\begin{align*}
Q_G^{-1}M_GQ_G^{-1}
\xrightarrow{\mathbb{P}}
Q^{-1}\Omega_CQ^{-1}.
\end{align*}
Using the scaled sandwich identity from the first step,
\begin{align*}
G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)
=
Q_G^{-1}M_GQ_G^{-1},
\end{align*}
we obtain
\begin{align*}
G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)
\xrightarrow{\mathbb{P}}
Q^{-1}\Omega_CQ^{-1}.
\end{align*}
[guided]
Let us isolate the middle factor by defining
\begin{align*}
M_G := \frac{1}{G}\sum_{g=1}^G \hat{S}_{g,G}\hat{S}_{g,G}^{\top}.
\end{align*}
This is the empirical residual cluster score covariance. The hypothesis says exactly that
\begin{align*}
M_G \xrightarrow{\mathbb{P}} \Omega_C.
\end{align*}
The previous step proved
\begin{align*}
Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}.
\end{align*}
Now consider the deterministic multiplication map
\begin{align*}
\mathbb{R}^{d \times d}\times \mathbb{R}^{d \times d}\times \mathbb{R}^{d \times d}
&\to \mathbb{R}^{d \times d} \\
(A,B,C) &\mapsto ABC.
\end{align*}
This map is continuous because each entry of $ABC$ is a finite sum of products of entries of $A$, $B$, and $C$. Hence convergence in probability of the three matrix factors implies convergence in probability of their product:
\begin{align*}
Q_G^{-1}M_GQ_G^{-1}
\xrightarrow{\mathbb{P}}
Q^{-1}\Omega_CQ^{-1}.
\end{align*}
Finally, the algebraic identity from the first step says
\begin{align*}
G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)
=
Q_G^{-1}M_GQ_G^{-1}.
\end{align*}
Substituting this identity into the preceding convergence proves
\begin{align*}
G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)
\xrightarrow{\mathbb{P}}
Q^{-1}\Omega_CQ^{-1}.
\end{align*}
This proves the consistency of the scaled cluster-robust variance estimator.
[/guided]
[/step]
[step:Identify the limiting distribution of the scaled estimator]
By assumption,
\begin{align*}
\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G}
\xrightarrow{d}
\mathcal{N}(0,\Omega_C),
\end{align*}
and the previous curvature argument gives $Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}$. Applying Slutsky's theorem to the product
\begin{align*}
Q_G^{-1}\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G}
\end{align*}
gives
\begin{align*}
Q_G^{-1}\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G}
\xrightarrow{d}
Q^{-1}Z,
\end{align*}
where $Z \sim \mathcal{N}(0,\Omega_C)$. Since linear transformations of Gaussian random vectors are Gaussian,
\begin{align*}
Q^{-1}Z \sim \mathcal{N}(0,Q^{-1}\Omega_CQ^{-1}).
\end{align*}
The expansion
\begin{align*}
\sqrt{G}(\hat{\beta}_G-\beta_0)
=
Q_G^{-1}\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G}
+
o_{\mathbb{P}}(1)
\end{align*}
and Slutsky's theorem again imply
\begin{align*}
\sqrt{G}(\hat{\beta}_G-\beta_0)
\xrightarrow{d}
\mathcal{N}(0,Q^{-1}\Omega_CQ^{-1}).
\end{align*}
Thus $\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)$ is on the covariance scale of $\hat{\beta}_G$, while $G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)$ is on the covariance scale of $\sqrt{G}(\hat{\beta}_G-\beta_0)$. This completes the proof.
[guided]
The variance consistency result identifies the limiting covariance matrix. We now connect it to the limiting distribution of the estimator itself.
The assumed cluster [central limit theorem](/theorems/521) gives
\begin{align*}
\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G}
\xrightarrow{d}
\mathcal{N}(0,\Omega_C).
\end{align*}
Here each population cluster score is the random vector $S_{g,G}:\Omega\to\mathbb R^d$ introduced in the theorem statement. The curvature convergence already proved gives
\begin{align*}
Q_G^{-1} \xrightarrow{\mathbb{P}} Q^{-1}.
\end{align*}
Slutsky's theorem applies to a product of a random matrix converging in probability to a constant matrix and a random vector converging in distribution. Therefore
\begin{align*}
Q_G^{-1}\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G}
\xrightarrow{d}
Q^{-1}Z,
\end{align*}
where $Z$ is a Gaussian random vector with distribution $\mathcal{N}(0,\Omega_C)$.
It remains to compute the covariance of $Q^{-1}Z$. Since $Q$ is positive definite, it is symmetric, so $(Q^{-1})^\top = Q^{-1}$. Hence
\begin{align*}
\operatorname{Var}(Q^{-1}Z)
=
Q^{-1}\operatorname{Var}(Z)(Q^{-1})^\top
=
Q^{-1}\Omega_CQ^{-1}.
\end{align*}
Thus
\begin{align*}
Q^{-1}Z \sim \mathcal{N}(0,Q^{-1}\Omega_CQ^{-1}).
\end{align*}
Finally, the estimator admits the asymptotic linear expansion
\begin{align*}
\sqrt{G}(\hat{\beta}_G-\beta_0)
=
Q_G^{-1}\frac{1}{\sqrt{G}}\sum_{g=1}^G S_{g,G}
+
o_{\mathbb{P}}(1).
\end{align*}
The remainder term converges to $0$ in probability. Applying Slutsky's theorem once more gives
\begin{align*}
\sqrt{G}(\hat{\beta}_G-\beta_0)
\xrightarrow{d}
\mathcal{N}(0,Q^{-1}\Omega_CQ^{-1}).
\end{align*}
The estimator $\hat{\beta}_G$ fluctuates at order $G^{-1/2}$, so its covariance is of order $G^{-1}$. This is exactly why the unscaled estimator $\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)$ estimates the covariance scale of $\hat{\beta}_G$, while the scaled matrix $G\,\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat{\beta}_G)$ estimates the asymptotic covariance of $\sqrt{G}(\hat{\beta}_G-\beta_0)$. This completes the proof.
[/guided]
[/step]