Asymptotic Normality of Two-Sample Linear Rank Statistics

Asymptotic Normality of Two-Sample Linear Rank Statistics (Theorem # 6345)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The proof reduces the rank statistic to a finite-population sampling problem. Under the continuous null hypothesis, the ranks occupied by the $X$-sample form a uniformly chosen subset of size $m_N$ of $\{1,\dots,N\}$. After centering the deterministic scores, the statistic is therefore the sum of $m_N$ elements sampled by simple random sampling without replacement from a finite population with mean zero and variance $s_N^2$. The stated maximum-score assumption is exactly the Lindeberg-type hypothesis in the Hájek finite-population [central limit theorem](/theorems/521) for simple random sampling without replacement, which gives convergence in distribution to the standard normal distribution with the finite-population variance factor. [/proofplan] [step:Show that the $X$-sample ranks form a uniform subset] For each $N$, let $R_{N,i}$ denote the rank of $X_{N,i}$ among the $N$ pooled observations, and define the random subset $S_N=\{R_{N,1},\dots,R_{N,m_N}\}$ of $\{1,\dots,N\}$. Because the common distribution under the null is continuous, ties occur with probability zero. On the no-tie event, the relative ordering of the $N$ independent identically distributed observations is exchangeable, so each of the $N!$ strict orderings has probability $1/N!$. Hence, for every subset $A\subset \{1,\dots,N\}$ with $|A|=m_N$, \begin{align*} \mathbb{P}(S_N=A)=\frac{m_N!n_N!}{N!}=\binom{N}{m_N}^{-1}. \end{align*} Thus $S_N$ is a uniformly chosen subset of $\{1,\dots,N\}$ of cardinality $m_N$. [guided] For each $N$, let $R_{N,i}$ denote the rank of $X_{N,i}$ among the $N$ pooled observations, and define $S_N=\{R_{N,1},\dots,R_{N,m_N}\}$ as the subset of $\{1,\dots,N\}$ occupied by the observations from the first sample. The continuity assumption is used here and only here: since the common distribution is continuous, the probability that two pooled observations are equal is zero. Therefore the pooled observations have a strict ordering with probability one. On this no-tie event, the $N$ observations are independent and identically distributed, so their labels are exchangeable. Equivalently, every permutation of the $N$ labels among the ordered positions has the same probability. Since there are $N!$ possible strict orderings, each ordering has probability $1/N!$. Now fix a subset $A\subset \{1,\dots,N\}$ with $|A|=m_N$. The event $S_N=A$ says exactly that the $m_N$ observations labelled $X$ occupy the ordered positions in $A$, while the $n_N$ observations labelled $Y$ occupy the complement $\{1,\dots,N\}\setminus A$. Once the positions in $A$ are fixed, the $X$ labels may be assigned to those positions in $m_N!$ ways, and the $Y$ labels may be assigned to the remaining positions in $n_N!$ ways. Hence \begin{align*} \mathbb{P}(S_N=A) = \frac{m_N!n_N!}{N!} = \binom{N}{m_N}^{-1}. \end{align*} This probability is the same for every subset $A$ of size $m_N$, so $S_N$ is a uniformly chosen subset of $\{1,\dots,N\}$ with cardinality $m_N$. [/guided] [/step] [step:Rewrite the centered rank statistic as a sampled finite-population sum] Define the centered score population $b_N:\{1,\dots,N\}\to \mathbb{R}$ by \begin{align*} b_N(r)=a_N(r)-\bar a_N. \end{align*} Define the two-sample linear rank statistic $L_N^{(a)}:\mathbb{R}^{N}\to\mathbb{R}$ on no-tie pooled samples by \begin{align*} L_N^{(a)}(Z_N)=\sum_{i=1}^{m_N} a_N(R_{N,i}), \end{align*} where $R_{N,i}$ is the rank of $X_{N,i}$ among the pooled observations. Then \begin{align*} \frac{1}{N}\sum_{r=1}^{N} b_N(r)=0, \qquad \frac{1}{N}\sum_{r=1}^{N} b_N(r)^2=s_N^2. \end{align*} Using the definition of $S_N$, first subtract $\bar a_N$ from each of the $m_N$ selected scores: \begin{align*} L_N^{(a)}(Z_N)-m_N\bar a_N = \sum_{i=1}^{m_N}\bigl(a_N(R_{N,i})-\bar a_N\bigr). \end{align*} Since $S_N$ is the set of the selected ranks and $b_N(r)=a_N(r)-\bar a_N$ for each $r\in\{1,\dots,N\}$, this is equivalently \begin{align*} L_N^{(a)}(Z_N)-m_N\bar a_N = \sum_{r\in S_N} b_N(r). \end{align*} Thus the centered rank statistic is the sum of $m_N$ values sampled without replacement from the finite population $\{b_N(1),\dots,b_N(N)\}$. [guided] The purpose of this step is to remove the rank-statistic notation and replace it with a finite-population sum. Define $b_N:\{1,\dots,N\}\to\mathbb{R}$ by $b_N(r)=a_N(r)-\bar a_N$. By the definition of $\bar a_N$, this population has mean zero, and by the definition of $s_N^2$, it has variance $s_N^2$: \begin{align*} \frac{1}{N}\sum_{r=1}^{N} b_N(r)=0, \qquad \frac{1}{N}\sum_{r=1}^{N} b_N(r)^2=s_N^2. \end{align*} We also define the statistic explicitly. On the no-tie event, let $L_N^{(a)}:\mathbb{R}^{N}\to\mathbb{R}$ be given by \begin{align*} L_N^{(a)}(Z_N)=\sum_{i=1}^{m_N}a_N(R_{N,i}), \end{align*} where $R_{N,i}$ is the rank of $X_{N,i}$ among the pooled observations. Subtracting $m_N\bar a_N$ subtracts the same centering constant from each selected score: \begin{align*} L_N^{(a)}(Z_N)-m_N\bar a_N = \sum_{i=1}^{m_N}\bigl(a_N(R_{N,i})-\bar a_N\bigr). \end{align*} Since $S_N$ is exactly the set of selected ranks and $b_N(r)=a_N(r)-\bar a_N$, the last display is the same as \begin{align*} L_N^{(a)}(Z_N)-m_N\bar a_N = \sum_{r\in S_N}b_N(r). \end{align*} This is the finite-population reformulation needed for Hájek's theorem: the randomness is only the uniformly selected subset $S_N$, while the values $b_N(1),\dots,b_N(N)$ are deterministic. [/guided] [/step] [step:Apply Hájek's finite-population central limit theorem] Define the normalized finite population $c_N:\{1,\dots,N\}\to \mathbb{R}$ by \begin{align*} c_N(r)=\frac{b_N(r)}{s_N}. \end{align*} Then \begin{align*} \frac{1}{N}\sum_{r=1}^{N} c_N(r)=0, \qquad \frac{1}{N}\sum_{r=1}^{N} c_N(r)^2=1, \end{align*} and the assumed maximum condition becomes \begin{align*} \frac{\max_{1\leq r\leq N}|c_N(r)|}{\sqrt{N}}\to 0. \end{align*} We use the following form of the Hájek finite-population [central limit theorem](/theorems/1848) for simple random sampling without replacement. If $d_N:\{1,\dots,N\}\to\mathbb{R}$ is a deterministic population satisfying \begin{align*} \frac{1}{N}\sum_{r=1}^{N} d_N(r)=0, \end{align*} \begin{align*} \frac{1}{N}\sum_{r=1}^{N} d_N(r)^2=1, \end{align*} and \begin{align*} \frac{\max_{1\leq r\leq N}|d_N(r)|}{\sqrt{N}}\to 0, \end{align*} and if $T_N$ is a uniformly chosen subset of $\{1,\dots,N\}$ with $|T_N|=m_N$ and $m_N/N\to\lambda\in(0,1)$, then \begin{align*} \frac{\sum_{r\in T_N} d_N(r)}{\sqrt{m_N(N-m_N)/(N-1)}} \xrightarrow{d} \mathcal{N}(0,1). \end{align*} This statement uses the finite-population variance convention $N^{-1}\sum_{r=1}^{N}d_N(r)^2=1$, so the sampling variance factor is $m_N(N-m_N)/(N-1)$. We apply it with $d_N=c_N$ and $T_N=S_N$. The mean condition, variance condition, maximum condition, and sampling fraction condition were verified above, and the first step proved that $S_N$ is uniformly sampled without replacement. Since $N-m_N=n_N$, the theorem gives \begin{align*} \frac{\sum_{r\in S_N} c_N(r)}{\sqrt{\frac{m_Nn_N}{N-1}}} \xrightarrow{d} \mathcal{N}(0,1). \end{align*} Multiplying the numerator by $s_N$ gives \begin{align*} \frac{\sum_{r\in S_N} b_N(r)}{\sqrt{\frac{m_Nn_N}{N-1}s_N^2}} \xrightarrow{d} \mathcal{N}(0,1). \end{align*} [guided] We now apply the finite-population central limit theorem to the deterministic population after normalising it to have variance one. Define $c_N:\{1,\dots,N\}\to\mathbb{R}$ by \begin{align*} c_N(r)=\frac{b_N(r)}{s_N}. \end{align*} This definition is valid because $s_N^2>0$, hence $s_N>0$. The centering and variance computations give \begin{align*} \frac{1}{N}\sum_{r=1}^{N}c_N(r)=0, \qquad \frac{1}{N}\sum_{r=1}^{N}c_N(r)^2=1. \end{align*} The maximum condition in the theorem statement becomes \begin{align*} \frac{\max_{1\leq r\leq N}|c_N(r)|}{\sqrt{N}} = \frac{\max_{1\leq r\leq N}|a_N(r)-\bar a_N|}{\sqrt{N}s_N} \to 0. \end{align*} The Hájek finite-population central limit theorem applies to a deterministic population $d_N:\{1,\dots,N\}\to\mathbb{R}$ with mean zero, variance one, and maximum element $o(\sqrt{N})$, sampled by a uniformly chosen subset $T_N$ of size $m_N$ with $m_N/N\to\lambda\in(0,1)$. We apply it with $d_N=c_N$ and $T_N=S_N$. The uniform-subset hypothesis was proved in the first step, and the sampling fraction hypothesis is part of the theorem statement. Therefore \begin{align*} \frac{\sum_{r\in S_N}c_N(r)}{\sqrt{m_N(N-m_N)/(N-1)}} \xrightarrow{d} \mathcal{N}(0,1). \end{align*} Since $N-m_N=n_N$ and $b_N(r)=s_Nc_N(r)$ for every $r\in\{1,\dots,N\}$, this is equivalently \begin{align*} \frac{\sum_{r\in S_N}b_N(r)}{\sqrt{\frac{m_Nn_N}{N-1}s_N^2}} \xrightarrow{d} \mathcal{N}(0,1). \end{align*} [/guided] [/step] [step:Return from the sampled-score sum to the rank statistic] From the finite-population representation already proved, \begin{align*} \sum_{r\in S_N} b_N(r)=L_N^{(a)}(Z_N)-m_N\bar a_N. \end{align*} Substituting this identity into the convergence obtained from Hájek's theorem yields \begin{align*} \frac{L_N^{(a)}(Z_N)-m_N\bar a_N}{\sqrt{\frac{m_Nn_N}{N-1}s_N^2}} \xrightarrow{d} \mathcal{N}(0,1). \end{align*} This is precisely the asserted asymptotic normality of the two-sample linear rank statistic for the sample sizes $m_N$ and $n_N$. [guided] The previous step proved the asymptotic normality of the sampled centered-score sum: \begin{align*} \frac{\sum_{r\in S_N} b_N(r)}{\sqrt{\frac{m_Nn_N}{N-1}s_N^2}} \xrightarrow{d} \mathcal{N}(0,1). \end{align*} The finite-population representation proved earlier identifies this numerator exactly with the centered linear rank statistic: \begin{align*} \sum_{r\in S_N} b_N(r)=L_N^{(a)}(Z_N)-m_N\bar a_N. \end{align*} Substituting this identity gives \begin{align*} \frac{L_N^{(a)}(Z_N)-m_N\bar a_N}{\sqrt{\frac{m_Nn_N}{N-1}s_N^2}} \xrightarrow{d} \mathcal{N}(0,1). \end{align*} This is exactly the claimed convergence under the continuous two-sample null. [/guided] [/step]

Prerequisites (0/7 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Explore Further

Continuity Definition Distribution Definition Event Definition Variance Definition Central Limit Theorem for Nondegenerate U-Statistics Theorem #6336 Central Limit Theorem Theorem #521 Central Limit Theorem Theorem #1848 Coordinate Characterisation of Product Measurability Probability & Statistics Interval Probabilities from the Distribution Function Probability Theory Stable and Robust Recovery under the Restricted Isometry Property Probability & Statistics Uniqueness for the Dirichlet Problem via Maximum Principle Brownian Motion Law of Total Probability Probability Theory Stability Selection False Discovery Bound Probability & Statistics $L^p$ Martingale Convergence Theorem Martingale Theory McDiarmid's Bounded Differences Inequality Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.