Plug-In Variance Consistency for Nondegenerate U-Statistics

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We compare each leave-one empirical projection with the true first Hoeffding projection $h_1(X_i)$. The key estimate is that the average squared error of these projection estimates converges to $0$ in probability; this follows by expanding the conditional U-statistic error, observing that disjoint index configurations have conditional covariance $0$, and using the fourth-moment assumption to obtain the square-integrability needed for the overlapping configurations. Once the projection estimates are close in empirical $L^2$, their centred empirical variance is close to the empirical variance of $h_1(X_i)$, which converges to $\zeta_1$ by the ordinary law of large numbers. The studentised limit then follows from the nondegenerate U-statistic [central limit theorem](/theorems/521) and Slutsky's theorem. [/proofplan] [step:Introduce the leave-one conditional averages and isolate the projection error] Let $(\Omega,\mathcal F,\mathbb P)$ denote the probability space on which the i.i.d. $E$-valued sample $(X_i)_{i\geq 1}$ is defined. For $1\leq p<\infty$, write $L^p(\Omega,\mathcal F,\mathbb P)$ for the space of real-valued random variables $Z:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ satisfying $\mathbb E[|Z|^p]<\infty$, modulo equality $\mathbb P$-a.s. Define the target parameter $\theta:=\mathbb E[h(X_1,\dots,X_m)]$. For $n\geq m$, define the U-statistic \begin{align*} U_n:=\binom{n}{m}^{-1}\sum_{A\in\mathcal A_n} h((X_j)_{j\in A}), \end{align*} where $\mathcal A_n$ denotes the family of subsets $A\subset\{1,\dots,n\}$ with $|A|=m$. Define the first conditional projection function $g:E\to\mathbb R$ by \begin{align*} g(x):=\mathbb E[h(x,X_2,\dots,X_m)] \end{align*} for $x\in E$, with the convention that when $m=1$ this means $g(x)=h(x)$. Thus $h_1:E\to\mathbb R$ is defined by \begin{align*} h_1(x):=g(x)-\theta \end{align*} for $x\in E$, and $h_1(X_i)=g(X_i)-\theta$ for every $i\in\{1,\dots,n\}$. For $i \in \{1,\dots,n\}$, let $\mathcal A_{n,i}$ denote the family of subsets $A \subset \{1,\dots,n\}\setminus\{i\}$ with $|A|=m-1$. Define \begin{align*} V_{n,i}:=\binom{n-1}{m-1}^{-1}\sum_{A\in\mathcal A_{n,i}} h\bigl(X_i,(X_j)_{j\in A}\bigr). \end{align*} By the definition of the leave-one empirical projection in the theorem statement, $\widehat h_{1,i}=V_{n,i}-U_n$. Define the error variable \begin{align*} R_{n,i}:=V_{n,i}-g(X_i). \end{align*} Since $h_1(X_i)=g(X_i)-\theta$, we have \begin{align*} \widehat h_{1,i}-h_1(X_i)=R_{n,i}-(U_n-\theta). \end{align*} It is therefore enough to prove \begin{align*} \frac{1}{n}\sum_{i=1}^n R_{n,i}^2 \xrightarrow{\mathbb P}0 \qquad\text{and}\qquad U_n\xrightarrow{\mathbb P}\theta. \end{align*} The second convergence follows from the [Strong Law for U-Statistics](/theorems/4938), which gives the stronger conclusion $U_n\xrightarrow{a.s.}\theta$ and hence convergence in probability. Its hypotheses hold here because the sample $(X_i)_{i\geq 1}$ is i.i.d., the kernel $h$ is symmetric and measurable, and the fourth-moment assumption implies $h(X_1,\dots,X_m)\in L^1(\Omega,\mathcal F,\mathbb P)$. [/step] [step:Bound the average squared projection error] We prove \begin{align*} \mathbb E\left[\frac{1}{n}\sum_{i=1}^n R_{n,i}^2\right]\longrightarrow 0. \end{align*} If $m=1$, then $V_{n,i}=h(X_i)$ and $g(X_i)=h(X_i)$ for every $i$, so $R_{n,i}=0$ for every $i$ and the desired convergence holds. Hence assume $m\geq 2$ for the rest of this step. By exchangeability of $X_1,\dots,X_n$, \begin{align*} \mathbb E\left[\frac{1}{n}\sum_{i=1}^n R_{n,i}^2\right]=\mathbb E[R_{n,1}^2]. \end{align*} Condition on $X_1$. For each subset $A \subset \{2,\dots,n\}$ with $|A|=m-1$, define $Y_A:=h(X_1,(X_j)_{j\in A})-\mathbb E[h(X_1,X_2,\dots,X_m)\mid X_1]$. Then, using the already defined family $\mathcal A_{n,1}$, \begin{align*} R_{n,1}=\binom{n-1}{m-1}^{-1}\sum_{A\in\mathcal A_{n,1}}Y_A. \end{align*} For every $A\in\mathcal A_{n,1}$, the random vector $(X_j)_{j\in A}$ has the same law as $(X_2,\dots,X_m)$ and is independent of $X_1$, so \begin{align*} \mathbb E[h(X_1,(X_j)_{j\in A})\mid X_1]=g(X_1) \end{align*} and therefore $\mathbb E[Y_A\mid X_1]=0$. For two such subsets $A$ and $B$, if $A\cap B=\varnothing$, then $Y_A$ and $Y_B$ are conditionally independent given $X_1$, hence \begin{align*} \mathbb E[Y_A Y_B\mid X_1]=\mathbb E[Y_A\mid X_1]\mathbb E[Y_B\mid X_1]=0. \end{align*} Thus only pairs with $A\cap B\neq \varnothing$ contribute after taking expectation. The fourth-moment assumption implies $h(X_1,\dots,X_m)\in L^2(\Omega,\mathcal F,\mathbb P)$. By [Jensen's inequality](/theorems/9) for [conditional expectation](/page/Conditional%20Expectation), $g(X_1)=\mathbb E[h(X_1,\dots,X_m)\mid X_1]$ also belongs to $L^2(\Omega,\mathcal F,\mathbb P)$. Hence there is a finite constant \begin{align*} C_h:=\mathbb E\left[\left(h(X_1,\dots,X_m)-g(X_1)\right)^2\right]<\infty \end{align*} such that, by the [Cauchy-Schwarz Inequality](/theorems/432) in $L^2(\Omega,\mathcal F,\mathbb P)$ applied to $Y_A$ and $Y_B$, \begin{align*} |\mathbb E[Y_A Y_B]|\leq \mathbb E[|Y_A Y_B|]\leq \left(\mathbb E[Y_A^2]\right)^{1/2}\left(\mathbb E[Y_B^2]\right)^{1/2}=C_h \end{align*} for all $A,B$. The number of subsets $A \subset \{2,\dots,n\}$ with $|A|=m-1$ is $\binom{n-1}{m-1}$. For a fixed $A$, the number of subsets $B \subset \{2,\dots,n\}$ with $|B|=m-1$ and $A\cap B\neq \varnothing$ is at most \begin{align*} (m-1)\binom{n-2}{m-2}. \end{align*} Therefore \begin{align*} \mathbb E[R_{n,1}^2]\leq \binom{n-1}{m-1}^{-2}\binom{n-1}{m-1}(m-1)\binom{n-2}{m-2}C_h. \end{align*} Equivalently, \begin{align*} \mathbb E[R_{n,1}^2]\leq C_h(m-1)\frac{\binom{n-2}{m-2}}{\binom{n-1}{m-1}}. \end{align*} Since \begin{align*} \frac{\binom{n-2}{m-2}}{\binom{n-1}{m-1}}=\frac{m-1}{n-1}, \end{align*} we obtain \begin{align*} \mathbb E[R_{n,1}^2]\leq C_h(m-1)\frac{m-1}{n-1}\longrightarrow 0. \end{align*} The [Markov Inequality](/theorems/514) applied to the nonnegative [random variable](/page/Random%20Variable) \begin{align*} \frac{1}{n}\sum_{i=1}^n R_{n,i}^2 \end{align*} gives \begin{align*} \frac{1}{n}\sum_{i=1}^n R_{n,i}^2 \xrightarrow{\mathbb P}0. \end{align*} [guided] We want to show that the empirical leave-one averages behave like the conditional expectation defining $h_1$. Recall the objects being compared. The conditional projection function $g:E\to\mathbb R$ is defined by $g(x)=\mathbb E[h(x,X_2,\dots,X_m)]$ for $x\in E$, with $g(x)=h(x)$ when $m=1$. For $i\in\{1,\dots,n\}$, the leave-one average is \begin{align*} V_{n,i}:=\binom{n-1}{m-1}^{-1}\sum_{A\in\mathcal A_{n,i}} h\bigl(X_i,(X_j)_{j\in A}\bigr), \end{align*} where $\mathcal A_{n,i}$ is the family of subsets $A\subset\{1,\dots,n\}\setminus\{i\}$ with $|A|=m-1$. The projection error is $R_{n,i}:=V_{n,i}-g(X_i)$. The natural quantity to control is the average squared error \begin{align*} \frac{1}{n}\sum_{i=1}^n R_{n,i}^2. \end{align*} If $m=1$, then $V_{n,i}=h(X_i)$ and $g(X_i)=h(X_i)$ for every $i$, so $R_{n,i}=0$ for every $i$. Thus the desired convergence is immediate in that case. We now assume $m\geq 2$. Because the observations are i.i.d. and the construction is symmetric in the index $i$, exchangeability gives \begin{align*} \mathbb E\left[\frac{1}{n}\sum_{i=1}^n R_{n,i}^2\right]=\mathbb E[R_{n,1}^2]. \end{align*} So it is enough to prove that the error for the first observation has vanishing second moment. Fix $X_1$ and average over the remaining observations. For every subset $A \subset \{2,\dots,n\}$ with $|A|=m-1$, define $Y_A:=h(X_1,(X_j)_{j\in A})-\mathbb E[h(X_1,X_2,\dots,X_m)\mid X_1]$. This is the centred contribution of the tuple using $X_1$ and the observations indexed by $A$. The vector $(X_j)_{j\in A}$ has the same distribution as $(X_2,\dots,X_m)$ and is independent of $X_1$, so \begin{align*} \mathbb E[h(X_1,(X_j)_{j\in A})\mid X_1]=g(X_1), \end{align*} and therefore $\mathbb E[Y_A\mid X_1]=0$. Moreover, using $\mathcal A_{n,1}$ for the subsets of $\{2,\dots,n\}$ with cardinality $m-1$, \begin{align*} R_{n,1}=\binom{n-1}{m-1}^{-1}\sum_{A\in\mathcal A_{n,1}}Y_A. \end{align*} The important point is that most pairs of summands are conditionally uncorrelated. If $A\cap B=\varnothing$, then the random vectors $(X_j)_{j\in A}$ and $(X_j)_{j\in B}$ are conditionally independent given $X_1$. Since both $Y_A$ and $Y_B$ have conditional mean $0$, conditional independence gives \begin{align*} \mathbb E[Y_A Y_B\mid X_1] = \mathbb E[Y_A\mid X_1]\mathbb E[Y_B\mid X_1] = 0. \end{align*} Thus only overlapping pairs of subsets can contribute to $\mathbb E[R_{n,1}^2]$. For overlapping pairs, we use a uniform integrable bound. The fourth-moment assumption gives $h(X_1,\dots,X_m)\in L^2(\Omega,\mathcal F,\mathbb P)$. [Jensen's inequality](/theorems/1977) for conditional expectation gives \begin{align*} \mathbb E[g(X_1)^2]\leq \mathbb E[h(X_1,\dots,X_m)^2]<\infty. \end{align*} Therefore define \begin{align*} C_h:=\mathbb E\left[\left(h(X_1,\dots,X_m)-g(X_1)\right)^2\right]<\infty. \end{align*} For any two subsets $A$ and $B$ of $\{2,\dots,n\}$ with $|A|=|B|=m-1$, the random variables $Y_A$ and $Y_B$ have the same second moment $C_h$, because the variables outside $X_1$ are i.i.d. and the kernel is symmetric. The [Cauchy-Schwarz Inequality](/theorems/432) in $L^2(\Omega,\mathcal F,\mathbb P)$ gives \begin{align*} |\mathbb E[Y_A Y_B]|\leq \mathbb E[|Y_A Y_B|]\leq \left(\mathbb E[Y_A^2]\right)^{1/2}\left(\mathbb E[Y_B^2]\right)^{1/2}=C_h. \end{align*} This is the bound used for every overlapping pair $A,B$. It remains to count how many overlapping pairs there are. There are $\binom{n-1}{m-1}$ choices for $A$. Once $A$ is fixed, a subset $B$ with $|B|=m-1$ and $A\cap B\neq\varnothing$ can be counted by first choosing at least one element of $A$ that lies in $B$. This gives the upper bound \begin{align*} (m-1)\binom{n-2}{m-2}. \end{align*} The count may overcount sets $B$ that share more than one element with $A$, but an upper bound is all we need. Therefore \begin{align*} \mathbb E[R_{n,1}^2]\leq \binom{n-1}{m-1}^{-2}\binom{n-1}{m-1}(m-1)\binom{n-2}{m-2}C_h. \end{align*} Canceling one copy of $\binom{n-1}{m-1}$ gives \begin{align*} \mathbb E[R_{n,1}^2]\leq C_h(m-1)\frac{\binom{n-2}{m-2}}{\binom{n-1}{m-1}}. \end{align*} The ratio of binomial coefficients is \begin{align*} \frac{\binom{n-2}{m-2}}{\binom{n-1}{m-1}}=\frac{m-1}{n-1}. \end{align*} Consequently \begin{align*} \mathbb E[R_{n,1}^2]\leq C_h(m-1)\frac{m-1}{n-1}\longrightarrow 0. \end{align*} Since the expectation of the nonnegative random variable \begin{align*} \frac{1}{n}\sum_{i=1}^n R_{n,i}^2 \end{align*} tends to $0$, the [Markov Inequality](/theorems/514) implies \begin{align*} \frac{1}{n}\sum_{i=1}^n R_{n,i}^2 \xrightarrow{\mathbb P}0. \end{align*} [/guided] [/step] [step:Transfer empirical second moments from estimated projections to true projections] For $i\in\{1,\dots,n\}$, define $a_{n,i}:=\widehat h_{1,i}$, $b_{n,i}:=h_1(X_i)$, and $d_{n,i}:=a_{n,i}-b_{n,i}$. From the preceding step and $U_n\xrightarrow{\mathbb P}\theta$, \begin{align*} \frac{1}{n}\sum_{i=1}^n d_{n,i}^2 \leq 2\frac{1}{n}\sum_{i=1}^n R_{n,i}^2 + 2(U_n-\theta)^2 \xrightarrow{\mathbb P}0. \end{align*} By the [Cauchy-Schwarz Inequality](/theorems/432) for finite sums, \begin{align*} \left|\frac{1}{n}\sum_{i=1}^n a_{n,i}^2-\frac{1}{n}\sum_{i=1}^n b_{n,i}^2\right|\leq \frac{1}{n}\sum_{i=1}^n d_{n,i}^2+2\left(\frac{1}{n}\sum_{i=1}^n d_{n,i}^2\right)^{1/2}\left(\frac{1}{n}\sum_{i=1}^n b_{n,i}^2\right)^{1/2}. \end{align*} Since $h_1(X_1)\in L^2(\Omega,\mathcal F,\mathbb P)$, in particular $h_1(X_1)^2\in L^1(\Omega,\mathcal F,\mathbb P)$, the [Weak Law of Large Numbers](/theorems/1851) gives \begin{align*} \frac{1}{n}\sum_{i=1}^n b_{n,i}^2 \xrightarrow{\mathbb P} \mathbb E[h_1(X_1)^2] = \zeta_1. \end{align*} Thus \begin{align*} \frac{1}{n}\sum_{i=1}^n \widehat h_{1,i}^2 - \frac{1}{n}\sum_{i=1}^n h_1(X_i)^2 \xrightarrow{\mathbb P}0. \end{align*} The same [Cauchy-Schwarz Inequality](/theorems/432) estimate gives \begin{align*} \left| \frac{1}{n}\sum_{i=1}^n \widehat h_{1,i} - \frac{1}{n}\sum_{i=1}^n h_1(X_i) \right| \leq \left(\frac{1}{n}\sum_{i=1}^n d_{n,i}^2\right)^{1/2} \xrightarrow{\mathbb P}0. \end{align*} By the [tower property of conditional expectation](/theorems/1150), \begin{align*} \mathbb E[h_1(X_1)] = \mathbb E[g(X_1)]-\theta = \mathbb E[h(X_1,\dots,X_m)]-\theta =0. \end{align*} Since $\mathbb E[h_1(X_1)]=0$ and $h_1(X_1)\in L^1(\Omega,\mathcal F,\mathbb P)$, the [Weak Law of Large Numbers](/theorems/1851) also gives \begin{align*} \frac{1}{n}\sum_{i=1}^n h_1(X_i)\xrightarrow{\mathbb P}0. \end{align*} Therefore \begin{align*} \frac{1}{n}\sum_{i=1}^n \widehat h_{1,i}\xrightarrow{\mathbb P}0. \end{align*} [/step] [step:Identify the limit of the centred empirical variance] The quantity $\widehat\zeta_1$ is the centred empirical variance of the estimated first-projection values $\widehat h_{1,1},\dots,\widehat h_{1,n}$. By the algebraic identity for the centred empirical second moment, \begin{align*} \widehat\zeta_1 = \frac{1}{n}\sum_{i=1}^n \widehat h_{1,i}^2 - \left(\frac{1}{n}\sum_{i=1}^n \widehat h_{1,i}\right)^2. \end{align*} The previous step gives \begin{align*} \frac{1}{n}\sum_{i=1}^n \widehat h_{1,i}^2 \xrightarrow{\mathbb P} \zeta_1 \qquad\text{and}\qquad \frac{1}{n}\sum_{i=1}^n \widehat h_{1,i} \xrightarrow{\mathbb P}0. \end{align*} Since the square map $x\mapsto x^2$ is continuous on $\mathbb R$, the [Continuous Mapping Theorem](/theorems/1847) gives \begin{align*} \left(\frac{1}{n}\sum_{i=1}^n \widehat h_{1,i}\right)^2 \xrightarrow{\mathbb P}0. \end{align*} Hence \begin{align*} \widehat\zeta_1\xrightarrow{\mathbb P}\zeta_1. \end{align*} [/step] [step:Apply studentisation through Slutsky's theorem] By assumption $\zeta_1>0$, and the consistency just proved implies \begin{align*} m\sqrt{\widehat\zeta_1}\xrightarrow{\mathbb P}m\sqrt{\zeta_1}. \end{align*} Since $\widehat\zeta_1\geq 0$ by definition and $\widehat\zeta_1\xrightarrow{\mathbb P}\zeta_1>0$, we also have $\mathbb P(\widehat\zeta_1>0)\to 1$. Thus the reciprocal factor $1/\sqrt{\widehat\zeta_1}$ is asymptotically well-defined. Define the continuous map $q:(0,\infty)\to\mathbb R$ by \begin{align*} q(x):=\frac{m\sqrt{\zeta_1}}{m\sqrt{x}}. \end{align*} The [Continuous Mapping Theorem](/theorems/1847) applied to $q$ gives \begin{align*} \frac{m\sqrt{\zeta_1}}{m\sqrt{\widehat\zeta_1}}\xrightarrow{\mathbb P}1. \end{align*} The [Central Limit Theorem for Nondegenerate U-Statistics](/theorems/4939) applies in the following standard form: if $(X_i)_{i\geq1}$ is i.i.d., $h$ is symmetric and measurable, $h(X_1,\dots,X_m)\in L^2(\Omega,\mathcal F,\mathbb P)$, and the first Hoeffding projection satisfies $\operatorname{Var}(h_1(X_1))=\zeta_1>0$, then $\sqrt n(U_n-\theta)/(m\sqrt{\zeta_1})$ converges in distribution to $\mathcal N(0,1)$. Its hypotheses hold here because $(X_i)_{i\geq1}$ is i.i.d., the kernel $h$ is symmetric and measurable, $h(X_1,\dots,X_m)\in L^2(\Omega,\mathcal F,\mathbb P)$ by the fourth-moment assumption, and the first Hoeffding projection has positive variance $\zeta_1>0$. Therefore \begin{align*} \frac{\sqrt n(U_n-\theta)}{m\sqrt{\zeta_1}}\xrightarrow{d}\mathcal N(0,1). \end{align*} The [Slutsky's Lemma](/theorems/1850) applied to the preceding convergence in distribution and the denominator ratio yields \begin{align*} \frac{\sqrt n(U_n-\theta)}{m\sqrt{\widehat\zeta_1}}=\frac{\sqrt n(U_n-\theta)}{m\sqrt{\zeta_1}}\cdot\frac{m\sqrt{\zeta_1}}{m\sqrt{\widehat\zeta_1}}\xrightarrow{d}\mathcal N(0,1). \end{align*} This proves both the plug-in variance consistency and the studentised [central limit theorem](/theorems/1848). [/step]

Prerequisites (0/5 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.