Uniform Stochastic Rate for Kernel Density Estimators over Compact Sets

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We rewrite the centered kernel density estimator as the supremum of an empirical process over a fixed-bandwidth translate class. The bounded compact support of $K$ and the boundedness of $f$ give an envelope of size $h^{-d}$ and a variance bound of size $h^{-d}$. The pointwise measurable VC-subgraph assumption supplies measurable suprema and polynomial covering numbers, so the standard VC-type empirical-process maximal inequality, followed by Talagrand concentration, gives a stochastic bound with square-root and linear terms. The bandwidth assumptions replace $\log(1/h_n)$ by $\log n$ and make the linear term negligible. [/proofplan] [step:Rewrite the centered estimator as an empirical-process supremum] Let $\mathcal L^d$ denote $d$-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) on $(\mathbb R^d,\mathcal B(\mathbb R^d))$. Let $P$ denote the probability law of $X_1$ on $(\mathbb R^d,\mathcal B(\mathbb R^d))$. For $y\in\mathbb R^d$, let $\delta_y$ denote the Dirac probability measure at $y$, defined by $\delta_y(B)=\mathbb 1_B(y)$ for every $B\in\mathcal B(\mathbb R^d)$. Define the empirical probability measure $P_n$ by \begin{align*} P_n := \frac{1}{n}\sum_{i=1}^n \delta_{X_i}. \end{align*} For each $h>0$, the theorem statement defines the kernel density estimator $\hat f_{n,h}:\mathbb R^d\to\mathbb R$ by \begin{align*} \hat f_{n,h}(x)=\frac{1}{n}\sum_{i=1}^n h^{-d}K\left(\frac{x-X_i}{h}\right). \end{align*} For each $h>0$ and $x\in A$, define the function $g_{x,h}:\mathbb R^d\to\mathbb R$ by \begin{align*} g_{x,h}(u):=h^{-d}K\left(\frac{x-u}{h}\right). \end{align*} This function is Borel measurable because $K$ is Borel measurable and the affine map $u\mapsto (x-u)/h$ from $\mathbb R^d$ to $\mathbb R^d$ is continuous. Define the fixed-bandwidth kernel class $\mathcal F_h$ by \begin{align*} \mathcal F_h := \{g_{x,h}:x\in A\}. \end{align*} For every $x \in A$, \begin{align*} \hat f_{n,h}(x)-\mathbb E[\hat f_{n,h}(x)] = (P_n-P)g_{x,h}. \end{align*} Therefore \begin{align*} \|\hat f_{n,h}-\mathbb E[\hat f_{n,h}]\|_{\infty,A} = \sup_{g \in \mathcal F_h}|(P_n-P)g|. \end{align*} [guided] For each bandwidth $h>0$, the estimator is the map $\hat f_{n,h}:\mathbb R^d\to\mathbb R$ given by \begin{align*} \hat f_{n,h}(x)=\frac{1}{n}\sum_{i=1}^n h^{-d}K\left(\frac{x-X_i}{h}\right). \end{align*} The purpose of introducing $g_{x,h}$ is to view the value of the estimator at $x$ as an empirical average of one function. Define $g_{x,h}:\mathbb R^d\to\mathbb R$ by \begin{align*} g_{x,h}(u):=h^{-d}K\left(\frac{x-u}{h}\right). \end{align*} This map is Borel measurable because $K$ is Borel measurable and $u\mapsto (x-u)/h$ is continuous. With $P_n=n^{-1}\sum_{i=1}^n\delta_{X_i}$ and $P$ the law of $X_1$, we have \begin{align*} P_ng_{x,h}=\frac{1}{n}\sum_{i=1}^n g_{x,h}(X_i)=\hat f_{n,h}(x), \qquad Pg_{x,h}=\mathbb E[g_{x,h}(X_1)]=\mathbb E[\hat f_{n,h}(x)]. \end{align*} Therefore \begin{align*} \hat f_{n,h}(x)-\mathbb E[\hat f_{n,h}(x)]=(P_n-P)g_{x,h}. \end{align*} Taking the supremum over $x\in A$, equivalently over $\mathcal F_h=\{g_{x,h}:x\in A\}$, gives \begin{align*} \|\hat f_{n,h}-\mathbb E[\hat f_{n,h}]\|_{\infty,A} = \sup_{g\in\mathcal F_h}|(P_n-P)g|. \end{align*} [/guided] [/step] [step:Compute the envelope and variance scales] Set \begin{align*} M := \|K\|_{\infty,\mathbb R^d}. \end{align*} Since $K$ is bounded and compactly supported, $M<\infty$ and \begin{align*} L_K^2 := \int_{\mathbb R^d} |K(v)|^2\,d\mathcal L^d(v)<\infty. \end{align*} Moreover $L_K>0$: if $L_K=0$, then $K=0$ $\mathcal L^d$-a.e., contradicting $\int_{\mathbb R^d}K(v)\,d\mathcal L^d(v)=1$. Define the function $F_h:\mathbb R^d\to[0,\infty)$ by \begin{align*} F_h(u):=Mh^{-d}. \end{align*} Then $F_h$ is an envelope for $\mathcal F_h$. For $g_{x,h}\in\mathcal F_h$, using the density $f$ of $X_1$ gives \begin{align*} \mathbb E[g_{x,h}(X_1)^2]=h^{-2d}\int_{\mathbb R^d} K\left(\frac{x-u}{h}\right)^2 f(u)\,d\mathcal L^d(u). \end{align*} Apply the change of variables $v=(x-u)/h$, equivalently $u=x-hv$, under which $d\mathcal L^d(u)=h^d\,d\mathcal L^d(v)$ and the domain $\mathbb R^d$ maps onto $\mathbb R^d$. This yields \begin{align*} \mathbb E[g_{x,h}(X_1)^2]=h^{-d}\int_{\mathbb R^d} K(v)^2 f(x-hv)\,d\mathcal L^d(v). \end{align*} Since $f$ is bounded and $L_K^2$ is the preceding $L^2(\mathcal L^d)$ norm square of $K$, \begin{align*} \mathbb E[g_{x,h}(X_1)^2]\le h^{-d}\|f\|_{\infty,\mathbb R^d} L_K^2. \end{align*} Thus, with \begin{align*} \sigma_h^2 := \|f\|_{\infty,\mathbb R^d}L_K^2 h^{-d}, \end{align*} we have \begin{align*} \sup_{g\in\mathcal F_h}\operatorname{Var}(g(X_1)) \le \sup_{g\in\mathcal F_h}\mathbb E[g(X_1)^2] \le \sigma_h^2. \end{align*} [guided] The empirical-process inequalities used below need two numerical inputs: a uniform envelope and a uniform variance bound. The envelope is obtained directly from boundedness of $K$. Define \begin{align*} M := \|K\|_{\infty,\mathbb R^d}. \end{align*} Then every function $g_{x,h}\in\mathcal F_h$ satisfies \begin{align*} |g_{x,h}(u)| = h^{-d}\left|K\left(\frac{x-u}{h}\right)\right| \le M h^{-d} \end{align*} for every $u\in\mathbb R^d$. Hence $F_h(u)=Mh^{-d}$ is an envelope for $\mathcal F_h$. Next we estimate the variance. Since $\operatorname{Var}(Y)\le \mathbb E[Y^2]$ for every square-integrable real-valued [random variable](/page/Random%20Variable) $Y$, it is enough to control $\mathbb E[g_{x,h}(X_1)^2]$. The random vector $X_1$ has density $f$ with respect to $\mathcal L^d$, so \begin{align*} \mathbb E[g_{x,h}(X_1)^2] = h^{-2d}\int_{\mathbb R^d} K\left(\frac{x-u}{h}\right)^2 f(u)\,d\mathcal L^d(u). \end{align*} We now perform the explicit change of variables $v=(x-u)/h$, so $u=x-hv$ and $d\mathcal L^d(u)=h^d\,d\mathcal L^d(v)$. This gives \begin{align*} \mathbb E[g_{x,h}(X_1)^2] = h^{-d}\int_{\mathbb R^d}K(v)^2 f(x-hv)\,d\mathcal L^d(v). \end{align*} Because $\|f\|_{\infty,\mathbb R^d}<\infty$ and $K$ is bounded with compact support, the constant \begin{align*} L_K^2 := \int_{\mathbb R^d}|K(v)|^2\,d\mathcal L^d(v) \end{align*} is finite. It is also positive: if $L_K=0$, then $K=0$ $\mathcal L^d$-a.e., contradicting $\int_{\mathbb R^d}K(v)\,d\mathcal L^d(v)=1$. Therefore \begin{align*} \mathbb E[g_{x,h}(X_1)^2] \le h^{-d}\|f\|_{\infty,\mathbb R^d}L_K^2. \end{align*} Thus the natural variance scale is $h^{-d}$, whereas the envelope scale is $h^{-d}$. This distinction is what produces the final square-root rate $\sqrt{1/(nh^d)}$ rather than $1/(nh^d)$. [/guided] [/step] [step:Apply the VC maximal inequality at bandwidth $h$] We use the [VC-type maximal inequality for bounded empirical processes, van der Vaart and Wellner Theorem 2.14.1](external:van-der-vaart-wellner-1996-theorem-2-14-1) for pointwise measurable VC-subgraph classes. By the pointwise measurability assumption on $\mathcal K$, the fixed-bandwidth subclass $\mathcal F_h\subset h^{-d}\mathcal K$ is pointwise measurable: for this $h$ there is a countable subclass $\mathcal F_{h,0}\subset\mathcal F_h$ such that every $g\in\mathcal F_h$ is the pointwise limit of a sequence in $\mathcal F_{h,0}$. Since every member of $\mathcal F_h$ is bounded by the integrable envelope $F_h$ under $P$, dominated convergence gives $(Pg_{k})\to Pg$ along such an approximating sequence $(g_k)_{k=1}^\infty$; at the finitely many sample points, pointwise convergence also gives $(P_ng_k)\to P_ng$. Hence the supremum over $\mathcal F_h$ agrees with the supremum over the countable class $\mathcal F_{h,0}$ and is measurable. Since $\mathcal K$ is VC-subgraph with bounded envelope $M$, rescaling by the positive scalar $h^{-d}$ and restricting the translation parameter to $x\in A$ preserve the VC-subgraph entropy exponents. Thus there exist constants $a\ge e$ and $v\ge 1$, depending only on the VC characteristics of the original translated-dilated kernel class, and a universal constant $C_0>0$ from the maximal inequality, such that $\mathcal F_h$ has envelope $Mh^{-d}$ and for every $h\in(0,1)$, \begin{align*} \mathbb E\left[\sup_{g\in\mathcal F_h}|(P_n-P)g|\right] \le C_0\left( \sqrt{\frac{\sigma_h^2}{n}\log\left(\frac{aMh^{-d}}{\sigma_h}\right)} + \frac{Mh^{-d}}{n}\log\left(\frac{aMh^{-d}}{\sigma_h}\right) \right). \end{align*} The hypotheses required for this standard empirical-process inequality are satisfied: $\mathcal F_h$ is pointwise measurable, it is a subclass of the rescaled VC-subgraph class $h^{-d}\mathcal K$, its envelope is $Mh^{-d}$, its VC entropy constants are the fixed constants $a$ and $v$ above, and its variance is bounded by $\sigma_h^2$. Since $\sigma_h=\|f\|_{\infty,\mathbb R^d}^{1/2}L_Kh^{-d/2}$, we have \begin{align*} \frac{aMh^{-d}}{\sigma_h}=\frac{aM}{\|f\|_{\infty,\mathbb R^d}^{1/2}L_K}\,h^{-d/2}. \end{align*} Therefore there is a constant $C_1=C_1(C_0,d,K,f,a,v)>0$ such that, for all sufficiently small $h$, \begin{align*} \mathbb E\left[\sup_{g\in\mathcal F_h}|(P_n-P)g|\right] \le C_1\left( \sqrt{\frac{\log(1/h)}{nh^d}} + \frac{\log(1/h)}{nh^d} \right). \end{align*} [guided] The maximal inequality is applied to an uncountable class, so we must first make sure the supremum is a measurable random variable rather than only an outer supremum. The theorem assumes that $\mathcal K$ is pointwise measurable. Since $\mathcal F_h\subset h^{-d}\mathcal K$ is obtained by restriction of parameters and multiplication by the fixed positive scalar $h^{-d}$, it is also pointwise measurable. Thus there is a countable subclass $\mathcal F_{h,0}\subset\mathcal F_h$ such that each member of $\mathcal F_h$ is the pointwise limit of a sequence from $\mathcal F_{h,0}$. Along such an approximating sequence, the empirical averages converge because there are only finitely many sample points, and the expectations converge by dominated convergence using the integrable envelope $F_h$ under $P$. Hence the supremum over $\mathcal F_h$ agrees with the supremum over the countable class $\mathcal F_{h,0}$, so it is measurable. Now the standard VC-type maximal inequality for bounded empirical processes applies. Its hypotheses are: pointwise measurability, VC-subgraph entropy with fixed characteristics $a\ge e$ and $v\ge1$ inherited from $\mathcal K$, an envelope $Mh^{-d}$, and a variance bound $\sigma_h^2$. The first condition was just verified, the VC-subgraph condition is inherited from $\mathcal K$ by rescaling and restriction to $x\in A$, the envelope was computed above, and the variance bound is \begin{align*} \sup_{g\in\mathcal F_h}\operatorname{Var}(g(X_1))\le \sigma_h^2. \end{align*} Therefore there is a constant $C_0>0$, depending only on the VC characteristics of $\mathcal K$, such that \begin{align*} \mathbb E\left[\sup_{g\in\mathcal F_h}|(P_n-P)g|\right]\le C_0\left(\sqrt{\frac{\sigma_h^2}{n}\log\left(\frac{aMh^{-d}}{\sigma_h}\right)}+\frac{Mh^{-d}}{n}\log\left(\frac{aMh^{-d}}{\sigma_h}\right)\right). \end{align*} Since $\sigma_h=\|f\|_{\infty,\mathbb R^d}^{1/2}L_Kh^{-d/2}$, the logarithm is bounded by a constant multiple of $\log(1/h)$ for all sufficiently small $h$. Absorbing the fixed constants into $C_1=C_1(C_0,d,K,f,a,v)>0$ gives \begin{align*} \mathbb E\left[\sup_{g\in\mathcal F_h}|(P_n-P)g|\right] \le C_1\left( \sqrt{\frac{\log(1/h)}{nh^d}} + \frac{\log(1/h)}{nh^d} \right). \end{align*} [/guided] [/step] [step:Use concentration to pass from the mean bound to a probability bound] We use [Talagrand's concentration inequality for bounded empirical processes in Bousquet's form](external:bousquet-2002-talagrand-bounded-empirical-process-concentration) indexed by pointwise measurable classes. The pointwise measurability verification from the preceding step applies to the same class $\mathcal F_h$, so the supremum is measurable and Talagrand's inequality applies with ordinary probability. There is a constant $C_2>0$, depending only on the universal constants in Talagrand's inequality, such that for every $t>0$, \begin{align*} \mathbb P\left( \sup_{g\in\mathcal F_h}|(P_n-P)g| > \mathbb E\left[\sup_{g\in\mathcal F_h}|(P_n-P)g|\right] + C_2\left(\sqrt{\frac{\sigma_h^2 t}{n}}+\frac{Mh^{-d}t}{n}\right) \right) \le e^{-t}. \end{align*} Taking $t=\log n$ and inserting $\sigma_h^2=\|f\|_{\infty,\mathbb R^d}L_K^2h^{-d}$ gives \begin{align*} \sup_{g\in\mathcal F_h}|(P_n-P)g|=O_{\mathbb P}\left(\sqrt{\frac{\log(1/h)}{nh^d}}+\frac{\log(1/h)}{nh^d}+\sqrt{\frac{\log n}{nh^d}}+\frac{\log n}{nh^d}\right). \end{align*} [guided] [Talagrand's concentration inequality for bounded empirical processes in Bousquet's form](external:bousquet-2002-talagrand-bounded-empirical-process-concentration) is applied to the same pointwise measurable class $\mathcal F_h$. The hypotheses are the measurable supremum condition, the uniform envelope bound $|g|\le Mh^{-d}$ for all $g\in\mathcal F_h$, and the variance bound $\sup_{g\in\mathcal F_h}\operatorname{Var}(g(X_1))\le\sigma_h^2$. These have all been verified, so for every $t>0$, \begin{align*} \mathbb P\left( \sup_{g\in\mathcal F_h}|(P_n-P)g| > \mathbb E\left[\sup_{g\in\mathcal F_h}|(P_n-P)g|\right] + C_2\left( \sqrt{\frac{\sigma_h^2t}{n}} + \frac{Mh^{-d}t}{n} \right) \right) \le e^{-t}. \end{align*} Choose $t=\log n$. Then $e^{-t}=n^{-1}\to0$, so the displayed bound holds with probability tending to one. Substituting the mean bound from the previous step and the identity $\sigma_h^2=\|f\|_{\infty,\mathbb R^d}L_K^2h^{-d}$ yields \begin{align*} \sup_{g\in\mathcal F_h}|(P_n-P)g|=O_{\mathbb P}\left(\sqrt{\frac{\log(1/h)}{nh^d}}+\frac{\log(1/h)}{nh^d}+\sqrt{\frac{\log n}{nh^d}}+\frac{\log n}{nh^d}\right). \end{align*} [/guided] [/step] [step:Insert the bandwidth assumptions and conclude] Now take $h=h_n$. Since $\log(1/h_n)=O(\log n)$, the preceding bound becomes \begin{align*} \sup_{g\in\mathcal F_{h_n}} |(P_n-P)g|=O_{\mathbb P}\left(\sqrt{\frac{\log n}{n h_n^d}}+\frac{\log n}{n h_n^d}\right). \end{align*} The assumption \begin{align*} \frac{n h_n^d}{\log n}\to\infty \end{align*} implies \begin{align*} \frac{\log n}{n h_n^d} = o\left(\sqrt{\frac{\log n}{n h_n^d}}\right). \end{align*} Therefore \begin{align*} \sup_{g\in\mathcal F_{h_n}} |(P_n-P)g|=O_{\mathbb P}\left(\sqrt{\frac{\log n}{n h_n^d}}\right). \end{align*} Using the empirical-process identity from the first step, this is exactly \begin{align*} \|\hat f_{n,h_n}-\mathbb E[\hat f_{n,h_n}]\|_{\infty,A}=O_{\mathbb P}\left(\sqrt{\frac{\log n}{n h_n^d}}\right). \end{align*} This proves the asserted maximal deviation rate. [guided] Set $h=h_n$. From the preceding concentration step, \begin{align*} \sup_{g\in\mathcal F_{h_n}} |(P_n-P)g|=O_{\mathbb P}\left(\sqrt{\frac{\log(1/h_n)}{n h_n^d}}+\frac{\log(1/h_n)}{n h_n^d}+\sqrt{\frac{\log n}{n h_n^d}}+\frac{\log n}{n h_n^d}\right). \end{align*} The assumption $\log(1/h_n)=O(\log n)$ means that the terms involving $\log(1/h_n)$ are bounded by constant multiples of the corresponding terms involving $\log n$. Hence \begin{align*} \sup_{g\in\mathcal F_{h_n}} |(P_n-P)g|=O_{\mathbb P}\left(\sqrt{\frac{\log n}{n h_n^d}}+\frac{\log n}{n h_n^d}\right). \end{align*} Now use the second bandwidth assumption. Since \begin{align*} \frac{n h_n^d}{\log n}\to\infty, \end{align*} we have $\log n/(n h_n^d)\to0$, and therefore \begin{align*} \frac{\log n}{n h_n^d} = o\left(\sqrt{\frac{\log n}{n h_n^d}}\right). \end{align*} So the linear term is absorbed into the square-root term: \begin{align*} \sup_{g\in\mathcal F_{h_n}} |(P_n-P)g|=O_{\mathbb P}\left(\sqrt{\frac{\log n}{n h_n^d}}\right). \end{align*} The first step identified this supremum with the uniform centered estimator deviation over $A$, namely \begin{align*} \|\hat f_{n,h_n}-\mathbb E[\hat f_{n,h_n}]\|_{\infty,A}=\sup_{g\in\mathcal F_{h_n}} |(P_n-P)g|. \end{align*} Thus \begin{align*} \|\hat f_{n,h_n}-\mathbb E[\hat f_{n,h_n}]\|_{\infty,A}=O_{\mathbb P}\left(\sqrt{\frac{\log n}{n h_n^d}}\right), \end{align*} which is the claimed maximal deviation rate. [/guided] [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Maximal Inequality for Finitely Many Sub-Gaussian Random Variables

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.