Uniform Consistency of Kernel Density Estimators

Uniform Consistency of Kernel Density Estimators (Theorem # 6320)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We split the uniform error into a stochastic fluctuation term and a deterministic bias term. The bias is the difference between $Pg_{x,h}$ and $f(x)$, and it vanishes uniformly because $f$ is uniformly continuous while $K$ is compactly supported and has integral $1$. The stochastic term is controlled by a VC-type maximal inequality for the empirical process indexed by the bandwidth-scaled kernel class. The bandwidth assumptions force both the square-root and linear empirical-process rates to vanish, so the two pieces combine to give convergence in probability. [/proofplan] [step:Decompose the estimator into empirical fluctuation and convolution bias] Let $P$ denote the common law of $X_1$, and for each $n \in \mathbb{N}$ let $P_n$ denote the empirical probability measure \begin{align*} P_n(B):=\frac{1}{n}\sum_{i=1}^n \mathbb{1}_B(X_i) \end{align*} for Borel sets $B \subset \mathbb{R}^d$. For $h>0$ and $x \in \mathbb{R}^d$, define the measurable function $g_{x,h}: \mathbb{R}^d \to \mathbb{R}$ by \begin{align*} g_{x,h}(u)=h^{-d}K\left(\frac{x-u}{h}\right). \end{align*} The kernel density estimator at bandwidth $h$ is therefore \begin{align*} \hat f_h(x) = P_n g_{x,h}. \end{align*} Also, \begin{align*} \mathbb{E}[\hat f_h(x)] = P g_{x,h}. \end{align*} Hence, for every compact set $A \subset \mathbb{R}^d$, \begin{align*} \|\hat f_h-f\|_{\infty,A} \leq \sup_{x \in A}|(P_n-P)g_{x,h}| + \sup_{x \in A}|Pg_{x,h}-f(x)|. \end{align*} It remains to prove that the two terms on the right converge to $0$ in probability, respectively deterministically, when $h=h_n$. [/step] [step:Show that the convolution bias vanishes uniformly on compact sets] Let $R>0$ be such that $\operatorname{supp}K \subset \overline{B}(0,R)$. Since $K$ is bounded and compactly supported, $K \in L^1(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d),\mathcal{L}^d)$, and we define \begin{align*} M_K:=\int_{\mathbb{R}^d}|K(z)|\,d\mathcal{L}^d(z)<\infty. \end{align*} For $x \in \mathbb{R}^d$ and $h>0$, the substitution $z=(x-u)/h$, equivalently $u=x-hz$, transforms $d\mathcal{L}^d(u)$ into $h^d\,d\mathcal{L}^d(z)$ and gives \begin{align*} Pg_{x,h}=\int_{\mathbb{R}^d} h^{-d}K\left(\frac{x-u}{h}\right)f(u)\,d\mathcal{L}^d(u). \end{align*} After this change of variables, the same quantity is \begin{align*} Pg_{x,h}=\int_{\mathbb{R}^d}K(z)f(x-hz)\,d\mathcal{L}^d(z). \end{align*} Using $\int_{\mathbb{R}^d}K(z)\,d\mathcal{L}^d(z)=1$, we obtain \begin{align*} |Pg_{x,h}-f(x)|\leq \int_{\mathbb{R}^d}|K(z)|\,|f(x-hz)-f(x)|\,d\mathcal{L}^d(z). \end{align*} Since $K(z)=0$ for $z \notin \overline{B}(0,R)$, this bound becomes \begin{align*} |Pg_{x,h}-f(x)|\leq \int_{\overline{B}(0,R)}|K(z)|\,|f(x-hz)-f(x)|\,d\mathcal{L}^d(z). \end{align*} Because $f$ is uniformly continuous on $\mathbb{R}^d$, for every $\varepsilon>0$ there exists $\delta>0$ such that \begin{align*} |f(y)-f(x)|<\frac{\varepsilon}{1+M_K} \end{align*} whenever $|y-x|<\delta$. If $h>0$ satisfies \begin{align*} 0<h<\frac{\delta}{1+R} \end{align*} and $z \in \overline{B}(0,R)$, then $|x-hz-x|\leq hR<\delta$, and therefore \begin{align*} \sup_{x \in A}|Pg_{x,h}-f(x)| \leq \frac{\varepsilon}{1+M_K}M_K <\varepsilon. \end{align*} Thus \begin{align*} \sup_{x \in A}|Pg_{x,h}-f(x)| \to 0 \end{align*} as $h \downarrow 0$. [guided] The deterministic term is the bias of the kernel estimator. We first rewrite it as an approximate identity acting on $f$. Let $R>0$ be chosen so that $\operatorname{supp}K \subset \overline{B}(0,R)$. Since $K$ is bounded and vanishes outside the compact set $\overline{B}(0,R)$, it is absolutely integrable with respect to [Lebesgue measure](/page/Lebesgue%20Measure), so the finite constant \begin{align*} M_K:=\int_{\mathbb{R}^d}|K(z)|\,d\mathcal{L}^d(z) \end{align*} is well-defined. For $x \in \mathbb{R}^d$ and $h>0$, compute the expectation using the density of $X_1$. The map $u \mapsto h^{-d}K((x-u)/h)$ is Borel measurable and bounded because $K$ is Borel measurable and bounded. Thus \begin{align*} Pg_{x,h} = \int_{\mathbb{R}^d} h^{-d}K\left(\frac{x-u}{h}\right)f(u)\,d\mathcal{L}^d(u). \end{align*} Now use the affine change of variables $z=(x-u)/h$, or $u=x-hz$. The Jacobian determinant has absolute value $h^d$, so \begin{align*} d\mathcal{L}^d(u)=h^d\,d\mathcal{L}^d(z). \end{align*} The whole domain $\mathbb{R}^d$ maps onto $\mathbb{R}^d$, and therefore \begin{align*} Pg_{x,h} = \int_{\mathbb{R}^d}K(z)f(x-hz)\,d\mathcal{L}^d(z). \end{align*} Since the kernel has unit integral, \begin{align*} f(x)=f(x)\int_{\mathbb{R}^d}K(z)\,d\mathcal{L}^d(z). \end{align*} Subtracting this identity from the previous display and applying the triangle inequality gives \begin{align*} |Pg_{x,h}-f(x)|\leq \int_{\mathbb{R}^d}|K(z)|\,|f(x-hz)-f(x)|\,d\mathcal{L}^d(z). \end{align*} Since $K(z)=0$ for $z \notin \overline{B}(0,R)$, the integral over $\mathbb{R}^d$ is the same as the integral over $\overline{B}(0,R)$, so \begin{align*} |Pg_{x,h}-f(x)|\leq \int_{\overline{B}(0,R)}|K(z)|\,|f(x-hz)-f(x)|\,d\mathcal{L}^d(z). \end{align*} The reason compact support matters is that $hz$ is uniformly small for every $z$ contributing to the integral. Let $\varepsilon>0$. [Uniform continuity](/page/Uniform%20Continuity) of $f$ on all of $\mathbb{R}^d$ gives a number $\delta>0$ such that \begin{align*} |y-x|<\delta \implies |f(y)-f(x)|<\frac{\varepsilon}{1+M_K} \end{align*} for all $x,y \in \mathbb{R}^d$. If $h>0$ satisfies \begin{align*} 0<h<\frac{\delta}{1+R} \end{align*} and $z \in \overline{B}(0,R)$, then \begin{align*} |x-hz-x|=h|z|\leq hR<\delta. \end{align*} Hence \begin{align*} \sup_{x \in A}|Pg_{x,h}-f(x)|\leq \int_{\overline{B}(0,R)}|K(z)|\frac{\varepsilon}{1+M_K}\,d\mathcal{L}^d(z). \end{align*} Using the definition of $M_K$, this gives \begin{align*} \sup_{x \in A}|Pg_{x,h}-f(x)|\leq \frac{\varepsilon M_K}{1+M_K}<\varepsilon. \end{align*} This proves that the bias converges to zero uniformly on $A$ as $h \downarrow 0$. [/guided] [/step] [step:Control the empirical fluctuation by the VC kernel maximal inequality] Define, for each $h>0$ and compact $A \subset \mathbb{R}^d$, the function class \begin{align*} \mathcal{G}_{h,A}:=\{g_{x,h}: x \in A\}. \end{align*} We use the VC-subgraph and pointwise-measurability assumptions on $\mathcal K$. For fixed $h>0$ and compact $A$, the restricted scaled class $\mathcal{G}_{h,A}$ is pointwise measurable because pointwise measurability is inherited by restricting the index set and multiplying every function by the fixed positive scalar $h^{-d}$. The subgraph VC dimension is unchanged by multiplication by the positive scalar $h^{-d}$, because \begin{align*} \{(u,t):h^{-d}K((x-u)/h)>t\}=\{(u,t):K((x-u)/h)>h^d t\}, \end{align*} and the map $(u,t)\mapsto (u,h^dt)$ is a bijection of $\mathbb{R}^d\times\mathbb{R}$ preserving shattering relations. Let $B_K:=\sup_{z \in \mathbb{R}^d}|K(z)|<\infty$. Define the constant envelope $G_h: \mathbb{R}^d \to [0,\infty)$ by \begin{align*} G_h(u)=h^{-d}B_K. \end{align*} This is a measurable envelope for $\mathcal{G}_{h,A}$. Since $\mathcal{K}$ is VC-subgraph, there exist constants $a\geq e$ and $v\geq 1$, depending only on the VC characteristics of $\mathcal{K}$, such that for every probability measure $Q$ on $\mathbb{R}^d$ and every $0<\eta\leq 1$, \begin{align*} N\left(\eta\|G_h\|_{L^2(Q)},\mathcal{G}_{h,A},L^2(Q)\right) \leq \left(\frac{a}{\eta}\right)^v. \end{align*} This is the polynomial entropy hypothesis required by the outer-probability VC maximal inequality. The variance is bounded uniformly in $x \in A$. First, \begin{align*} P g_{x,h}^2=\int_{\mathbb{R}^d}h^{-2d}K\left(\frac{x-u}{h}\right)^2 f(u)\,d\mathcal{L}^d(u). \end{align*} Using the affine substitution $z=(x-u)/h$, equivalently $u=x-hz$, the measure transforms as $d\mathcal{L}^d(u)=h^d\,d\mathcal{L}^d(z)$ and the domain remains $\mathbb{R}^d$. Hence \begin{align*} P g_{x,h}^2=h^{-d}\int_{\mathbb{R}^d}K(z)^2f(x-hz)\,d\mathcal{L}^d(z). \end{align*} Since $f$ is bounded and $K$ is bounded with compact support, \begin{align*} P g_{x,h}^2\leq h^{-d}\|f\|_{\infty}\int_{\mathbb{R}^d}K(z)^2\,d\mathcal{L}^d(z). \end{align*} Set \begin{align*} C_{K,f}:= \|f\|_{\infty} \int_{\mathbb{R}^d}K(z)^2\,d\mathcal{L}^d(z)<\infty. \end{align*} Thus $\sup_{g\in\mathcal{G}_{h,A}}Pg^2\leq C_{K,f}h^{-d}$. We use the following VC empirical-process maximal inequality with Bernstein-Talagrand concentration, in the form of Theorem 2.14.1 in van der Vaart and Wellner's empirical-process theory together with the standard Bernstein-Talagrand tail integration: if a pointwise measurable class $\mathcal{F}$ has measurable envelope $F$, polynomial covering numbers \begin{align*} N\left(\eta\|F\|_{L^2(Q)},\mathcal{F},L^2(Q)\right)\leq \left(\frac{a}{\eta}\right)^v \end{align*} for all probability measures $Q$ and all $0<\eta\leq 1$, and if $\sup_{g\in\mathcal F}Pg^2\leq \sigma^2$, then the measurable [random variable](/page/Random%20Variable) $\sup_{g\in\mathcal F}|(P_n-P)g|$ satisfies \begin{align*} \sup_{g\in\mathcal F}|(P_n-P)g| = O_{\mathbb P}\left(\sqrt{\frac{\sigma^2\Lambda}{n}}\right) + O_{\mathbb P}\left(\frac{\|F\|_{\infty}\Lambda}{n}\right), \end{align*} where \begin{align*} \Lambda:=1+\log\left(\frac{a\|F\|_{\infty}}{\sigma}\right), \end{align*} and the implicit constants depend only on $a$ and $v$. Apply this result with $\mathcal F=\mathcal G_{h,A}$, $F=G_h$, and $\sigma^2=C_{K,f}h^{-d}$. The preceding paragraphs verify pointwise measurability, polynomial entropy, measurability of the envelope, and the variance bound. Moreover $\|G_h\|_{\infty}=B_Kh^{-d}$, so \begin{align*} \Lambda = 1+\log\left(\frac{aB_Kh^{-d}}{C_{K,f}^{1/2}h^{-d/2}}\right) \leq C_1\log(e/h) \end{align*} for a constant $C_1=C_1(a,B_K,C_{K,f},d)>0$ and all sufficiently small $h>0$. Therefore there is a constant $C_2=C_2(a,v,B_K,C_{K,f},d)>0$ such that along $h=h_n$, \begin{align*} \sup_{x \in A}|(P_n-P)g_{x,h}| = O_{\mathbb{P}}\left(\sqrt{\frac{\log(e/h)}{n h^d}}\right)+O_{\mathbb{P}}\left(\frac{\log(e/h)}{n h^d}\right) \end{align*} as $n \to \infty$ and $h=h_n \downarrow 0$. [guided] The technically important point is that the empirical-process theorem is applied to the scaled class $\mathcal G_{h,A}$, so both the variance and the envelope depend on $h$. We verify the hypotheses from scratch. For fixed $h>0$ and compact $A \subset \mathbb{R}^d$, define \begin{align*} \mathcal{G}_{h,A}:=\{g_{x,h}:x\in A\}, \end{align*} where $g_{x,h}:\mathbb{R}^d\to\mathbb{R}$ is the measurable function \begin{align*} g_{x,h}(u)=h^{-d}K\left(\frac{x-u}{h}\right). \end{align*} We use an outer-probability version of the maximal inequality, so the restricted class $\mathcal G_{h,A}$ need not be separately shown to be pointwise measurable. The VC-subgraph property is still available for the scaled class: multiplication by $h^{-d}>0$ sends subgraphs to subgraphs after the vertical change of variables $t\mapsto h^dt$, and this change preserves the relevant shattering numbers. The pointwise-measurability property is also inherited by restricting the original index set and multiplying the resulting countable approximating subclass by $h^{-d}$. If \begin{align*} B_K:=\sup_{z\in\mathbb{R}^d}|K(z)|, \end{align*} then the constant function $G_h:\mathbb{R}^d\to[0,\infty)$ defined by \begin{align*} G_h(u)=h^{-d}B_K \end{align*} is a measurable envelope for $\mathcal G_{h,A}$, and \begin{align*} \|G_h\|_\infty=B_Kh^{-d}. \end{align*} Because the VC-subgraph property is preserved under multiplication by a positive scalar, the scaled class has polynomial entropy: there are constants $a\geq e$ and $v\geq 1$, depending only on the VC characteristics of the kernel class, such that for every probability measure $Q$ on $\mathbb{R}^d$ and every $0<\eta\leq 1$, \begin{align*} N\left(\eta\|G_h\|_{L^2(Q)},\mathcal{G}_{h,A},L^2(Q)\right) \leq \left(\frac{a}{\eta}\right)^v. \end{align*} It remains to compute the second-moment scale. For $x\in A$, \begin{align*} Pg_{x,h}^2=\int_{\mathbb{R}^d}h^{-2d}K\left(\frac{x-u}{h}\right)^2f(u)\,d\mathcal{L}^d(u). \end{align*} Use the affine substitution $z=(x-u)/h$, equivalently $u=x-hz$. This maps $\mathbb{R}^d$ onto $\mathbb{R}^d$ and transforms the measure by \begin{align*} d\mathcal{L}^d(u)=h^d\,d\mathcal{L}^d(z). \end{align*} Therefore \begin{align*} Pg_{x,h}^2=h^{-d}\int_{\mathbb{R}^d}K(z)^2f(x-hz)\,d\mathcal{L}^d(z). \end{align*} Since $f$ is bounded and $K$ is bounded with compact support, define \begin{align*} C_{K,f}:= \|f\|_{\infty} \int_{\mathbb{R}^d}K(z)^2\,d\mathcal{L}^d(z)<\infty. \end{align*} Then \begin{align*} \sup_{g\in\mathcal G_{h,A}}Pg^2\leq C_{K,f}h^{-d}. \end{align*} We now apply the VC empirical-process maximal inequality with Bernstein-Talagrand concentration, in the cited entropy form for pointwise measurable VC-type classes. Its hypotheses in this formulation are pointwise measurability, the measurable envelope, polynomial entropy, and the second-moment bound verified above. Applying it with $\mathcal F=\mathcal G_{h,A}$, $F=G_h$, and $\sigma^2=C_{K,f}h^{-d}$ gives the ordinary-probability estimate \begin{align*} \sup_{g\in\mathcal G_{h,A}}|(P_n-P)g| = O_{\mathbb P}\left(\sqrt{\frac{\sigma^2\Lambda}{n}}\right) + O_{\mathbb P}\left(\frac{\|G_h\|_\infty\Lambda}{n}\right), \end{align*} where \begin{align*} \Lambda:=1+\log\left(\frac{a\|G_h\|_\infty}{\sigma}\right). \end{align*} Substituting $\|G_h\|_\infty=B_Kh^{-d}$ and $\sigma=C_{K,f}^{1/2}h^{-d/2}$ yields \begin{align*} \Lambda = 1+\log\left(\frac{aB_Kh^{-d}}{C_{K,f}^{1/2}h^{-d/2}}\right) \leq C_1\log(e/h) \end{align*} for a constant $C_1=C_1(a,B_K,C_{K,f},d)>0$ and all sufficiently small $h>0$. This logarithmic factor is the entropy logarithm evaluated at the ratio of the envelope size to the standard-deviation scale. Substituting this bound and $\sigma^2=C_{K,f}h^{-d}$ into the maximal inequality gives \begin{align*} \sup_{x \in A}|(P_n-P)g_{x,h}| = O_{\mathbb{P}}\left(\sqrt{\frac{\log(e/h)}{n h^d}}\right)+O_{\mathbb{P}}\left(\frac{\log(e/h)}{n h^d}\right), \end{align*} with constants depending only on $a$, $v$, $B_K$, $C_{K,f}$, and $d$. [/guided] [/step] [step:Use the bandwidth assumptions to force the stochastic term to vanish] Since $\log(1/h_n)=O(\log n)$, there is a constant $C_h>0$ and an integer $N_h \in \mathbb{N}$ such that \begin{align*} \log(e/h_n) \leq C_h \log n \end{align*} for all $n \geq N_h$. Therefore \begin{align*} \frac{\log(e/h_n)}{n h_n^d} \leq C_h\frac{\log n}{n h_n^d} \to 0 \end{align*} because $n h_n^d/\log n \to \infty$. It follows also that \begin{align*} \sqrt{\frac{\log(e/h_n)}{n h_n^d}}\to 0. \end{align*} The stochastic estimate from the previous step therefore gives \begin{align*} \sup_{x \in A}|(P_n-P)g_{x,h_n}| \xrightarrow{\mathbb{P}}0. \end{align*} This is ordinary convergence in probability because the pointwise-measurability hypothesis made the indexed supremum a measurable random variable in the maximal inequality. [/step] [step:Combine the stochastic and bias estimates] For the fixed compact set $A \subset \mathbb{R}^d$, the decomposition from the first step gives \begin{align*} \|\hat f_{h_n}-f\|_{\infty,A} \leq \sup_{x \in A}|(P_n-P)g_{x,h_n}| + \sup_{x \in A}|Pg_{x,h_n}-f(x)|. \end{align*} The first term on the right converges to $0$ in probability by the VC empirical-process estimate and the bandwidth assumptions. The second term converges to $0$ deterministically because $h_n \downarrow 0$ and the convolution bias vanishes uniformly. Hence, by the preceding inequality, \begin{align*} \|\hat f_{h_n}-f\|_{\infty,A}\xrightarrow{\mathbb{P}}0. \end{align*} This proves the asserted uniform consistency on every compact set $A \subset \mathbb{R}^d$. [/step]

Prerequisites (0/8 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Explore Further

Random Variable Definition Convolution Definition Determinant Definition Expectation Definition Variance Definition Jacobian Theorem #34 Triangle Inequality For Inner Product Spaces Theorem #433 Maximal Inequality for Finitely Many Sub-Gaussian Random Variables Theorem #6058 Gambler's Ruin Probability Probability Theory Moments and Asymptotic Normality of the Mann-Whitney $U$ Statistic Probability & Statistics Necessary KKT Conditions for Exact Lasso Sign Recovery Probability & Statistics Union Bound for Simultaneous Deviations Probability & Statistics Equivalent Characterizations of Sub-Gaussian Random Variables Probability & Statistics Boundary Continuity Under the Cone Condition Brownian Motion Partial F-Test for Linear Restrictions in the Normal Linear Model Probability & Statistics Marchenko Pastur Edge for White Noise Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.