Uniform Supremum-Norm Bias-Variance Tradeoff for Kernel Density Estimators

Uniform Supremum-Norm Bias-Variance Tradeoff for Kernel Density Estimators (Theorem # 6327)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We decompose the uniform error into a deterministic bias term and a centered stochastic term. The stochastic term is exactly the conclusion of the assumed uniform maximal deviation theorem for kernel density estimators. The bias term is estimated uniformly on $A$ by changing variables in the expectation, applying Taylor expansion on a fixed neighbourhood of $A$, and using the moment cancellations of the order-$m$ kernel. Finally, the heuristic bandwidth is obtained by equating the bias scale $h^m$ with the stochastic scale \begin{align*} \left(\frac{\log n}{nh^d}\right)^{1/2}. \end{align*} [/proofplan] [step:Decompose the estimator error into bias and centered fluctuation] For each $n \in \mathbb{N}$, write $h=h_n$. The random variables $X_i:(\Omega,\mathcal F,\mathbb P)\to(\mathbb R^d,\mathcal B(\mathbb R^d))$ for $i\in\mathbb N$ are the i.i.d. sample declared in the theorem statement, and $\hat f_h:A\to\mathbb R$ denotes the corresponding kernel density estimator. Define the deterministic bias map $b_h:A\to \mathbb{R}$ by \begin{align*} b_h(x)&=\mathbb{E}[\hat f_h(x)]-f(x) \end{align*} for $x\in A$, and define the centered stochastic fluctuation map $Z_h:A\to \mathbb{R}$ by \begin{align*} Z_h(x)&=\hat f_h(x)-\mathbb{E}[\hat f_h(x)] \end{align*} for $x\in A$. Then, for every $x \in A$, \begin{align*} \hat f_h(x)-f(x)=Z_h(x)+b_h(x). \end{align*} Taking suprema over $A$ and applying the triangle inequality gives \begin{align*} \|\hat f_h-f\|_{\infty,A} \le \|Z_h\|_{\infty,A}+\|b_h\|_{\infty,A}. \end{align*} [/step] [step:Compute the expectation as a kernel smoothing of the density] Define the support of $K$ by $\operatorname{supp}K:=\overline{\{u\in\mathbb R^d:K(u)\neq 0\}}$, and define the distance from a point $y\in\mathbb R^d$ to $A$ by \begin{align*} \operatorname{dist}(y,A):=\inf_{a\in A}|y-a|. \end{align*} Since $K$ is compactly supported by hypothesis, let $R>0$ be such that $\operatorname{supp}K \subset \overline{B}(0,R)$. Since $A$ is compact and $U$ is the open neighbourhood of $A$ declared in the theorem statement, choose $\delta>0$ such that the closed $\delta$-neighbourhood of $A$ is contained in $U$: \begin{align*} \{y \in \mathbb{R}^d:\operatorname{dist}(y,A)\le\delta\}\subset U. \end{align*} For all sufficiently large $n$, $hR<\delta$. Fix such an $n$ and fix $x\in A$. The function $y\mapsto h^{-d}K((x-y)/h)f(y)$ is integrable with respect to $\mathcal L^d$: its support is contained in $x-h\operatorname{supp}K\subset U$, the density $f$ is bounded on the compact set $\{y\in\mathbb R^d:\operatorname{dist}(y,A)\le \delta\}$, and $K\in L^1(\mathcal L^d)$. By the definition of $\hat f_h$ and since the $X_i$ have density $f$ with respect to $\mathcal{L}^d$, \begin{align*} \mathbb{E}[\hat f_h(x)] &= \frac{1}{h^d}\int_{\mathbb{R}^d} K\left(\frac{x-y}{h}\right)f(y)\,d\mathcal{L}^d(y). \end{align*} Apply the change of variables $u=(x-y)/h$, equivalently $y=x-hu$. Under this affine substitution, $d\mathcal{L}^d(y)=h^d\,d\mathcal{L}^d(u)$ and the domain remains $\mathbb{R}^d$. Hence \begin{align*} \mathbb{E}[\hat f_h(x)] &= \int_{\mathbb{R}^d} K(u)f(x-hu)\,d\mathcal{L}^d(u). \end{align*} Because $K(u)=0$ for $u\notin \overline{B}(0,R)$ and $hR<\delta$, every point $x-thu$ with $t\in[0,1]$ and $u\in\operatorname{supp}K$ belongs to $U$. This lets us apply the one-dimensional Taylor formula with integral remainder to the function $g_{x,u}:[0,1]\to\mathbb{R}$ defined by $g_{x,u}(t)=f(x-thu)$, whose required derivatives exist by the assumed $C^m$ regularity of $f$ on $U$. [guided] The expectation is the place where the random estimator becomes a deterministic smoothing of the density. We use $\operatorname{supp}K:=\overline{\{u\in\mathbb R^d:K(u)\neq 0\}}$ for the support of the kernel and $\operatorname{dist}(y,A):=\inf_{a\in A}|y-a|$ for the distance from $y\in\mathbb R^d$ to $A$. For each fixed $x\in A$, first verify that the summand is integrable. Since $K$ is compactly supported, $K((x-y)/h)=0$ unless $y\in x-h\operatorname{supp}K$; for all sufficiently large $n$, this set lies in the compact neighbourhood $\{y\in\mathbb R^d:\operatorname{dist}(y,A)\le \delta\}\subset U$. The density $f$ is bounded there and $K\in L^1(\mathcal L^d)$, so the map $y\mapsto h^{-d}K((x-y)/h)f(y)$ is $\mathcal L^d$-integrable. The definition of $\hat f_h$ gives \begin{align*} \mathbb{E}[\hat f_h(x)] &= \mathbb{E}\left[ \frac{1}{n h^d}\sum_{i=1}^n K\left(\frac{x-X_i}{h}\right) \right]. \end{align*} Linearity of expectation and identical distribution of the $X_i$ imply \begin{align*} \mathbb{E}[\hat f_h(x)] &= \frac{1}{h^d}\mathbb{E}\left[K\left(\frac{x-X_1}{h}\right)\right]. \end{align*} Since $X_1$ has density $f$ with respect to $\mathcal{L}^d$, this expectation is \begin{align*} \mathbb{E}[\hat f_h(x)] &= \frac{1}{h^d} \int_{\mathbb{R}^d} K\left(\frac{x-y}{h}\right)f(y)\,d\mathcal{L}^d(y). \end{align*} Now use the affine substitution $u=(x-y)/h$, so $y=x-hu$. The Jacobian determinant of the map $u\mapsto x-hu$ has absolute value $h^d$, hence $d\mathcal{L}^d(y)=h^d\,d\mathcal{L}^d(u)$. Therefore \begin{align*} \mathbb{E}[\hat f_h(x)] &= \int_{\mathbb{R}^d}K(u)f(x-hu)\,d\mathcal{L}^d(u). \end{align*} We also need the Taylor expansion below to take place inside the region where the derivatives of $f$ are controlled. Since $K$ is compactly supported by hypothesis, choose $R>0$ with $\operatorname{supp}K\subset \overline{B}(0,R)$. Since $A$ is compact and contained in the [open set](/page/Open%20Set) $U$ from the theorem statement, choose $\delta>0$ such that the closed $\delta$-neighbourhood of $A$ lies in $U$. For all sufficiently large $n$, $hR<\delta$, so if $x\in A$, $u\in\operatorname{supp}K$, and $t\in[0,1]$, then $\operatorname{dist}(x-thu,A)\le th|u|\le hR<\delta$, hence $x-thu\in U$. This verifies the domain condition needed for the Taylor formula. [/guided] [/step] [step:Use the kernel moment cancellations to bound the uniform bias] For each multi-index $\alpha\in\mathbb{N}_0^d$ with $|\alpha|=m$, define \begin{align*} M_\alpha:=\sup_{y\in U}|D^\alpha f(y)|. \end{align*} This number is finite by the boundedness hypothesis on the order-$m$ partial derivatives. Throughout this remainder computation, $\mathcal L^1$ denotes one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) on $[0,1]$. For $x\in A$ and $u\in\operatorname{supp}K$, Taylor expansion of $f$ at $x$ to order $m-1$ in the direction $-hu$ gives \begin{align*} f(x-hu) &= \sum_{|\alpha|\le m-1} \frac{D^\alpha f(x)}{\alpha!}(-hu)^\alpha + m\sum_{|\alpha|=m} \frac{(-hu)^\alpha}{\alpha!} \int_0^1 (1-t)^{m-1}D^\alpha f(x-thu)\,d\mathcal{L}^1(t). \end{align*} Multiplying by $K(u)$ and integrating with respect to $\mathcal{L}^d(u)$, the term with $\alpha=0$ equals $f(x)$ because $\int_{\mathbb{R}^d}K(u)\,d\mathcal{L}^d(u)=1$, and all terms with $1\le |\alpha|\le m-1$ vanish by the moment conditions. Thus \begin{align*} b_h(x) &= m\sum_{|\alpha|=m} \frac{(-h)^{|\alpha|}}{\alpha!} \int_{\mathbb{R}^d} u^\alpha K(u) \int_0^1 (1-t)^{m-1}D^\alpha f(x-thu)\,d\mathcal{L}^1(t) \,d\mathcal{L}^d(u). \end{align*} Taking absolute values and using $|(-h)^{|\alpha|}|=h^m$ gives \begin{align*} |b_h(x)| &\le m h^m \sum_{|\alpha|=m} \frac{M_\alpha}{\alpha!} \int_{\mathbb{R}^d} |u^\alpha|\,|K(u)| \int_0^1 (1-t)^{m-1}\,d\mathcal{L}^1(t) \,d\mathcal{L}^d(u). \end{align*} Since \begin{align*} \int_0^1(1-t)^{m-1}\,d\mathcal{L}^1(t)=\frac{1}{m}, \end{align*} we obtain \begin{align*} |b_h(x)| &\le h^m \sum_{|\alpha|=m} \frac{M_\alpha}{\alpha!} \int_{\mathbb{R}^d}|u^\alpha|\,|K(u)|\,d\mathcal{L}^d(u). \end{align*} Define \begin{align*} C_{\mathrm{bias}} := \sum_{|\alpha|=m} \frac{M_\alpha}{\alpha!} \int_{\mathbb{R}^d}|u^\alpha|\,|K(u)|\,d\mathcal{L}^d(u). \end{align*} Since $K\in L^1(\mathcal L^d)$ and $\operatorname{supp}K\subset\overline B(0,R)$, for every multi-index $\alpha$ with $|\alpha|=m$, \begin{align*} \int_{\mathbb{R}^d}|u^\alpha|\,|K(u)|\,d\mathcal{L}^d(u) \le R^m\int_{\mathbb{R}^d}|K(u)|\,d\mathcal{L}^d(u) <\infty. \end{align*} Thus $C_{\mathrm{bias}}<\infty$. Therefore, uniformly for $x\in A$, \begin{align*} |b_h(x)|\le C_{\mathrm{bias}}h^m, \end{align*} and hence \begin{align*} \|b_h\|_{\infty,A}=O(h^m). \end{align*} [guided] The bias estimate comes from two ingredients: Taylor expansion and moment cancellation. The Taylor expansion turns the local smoothing error $f(x-hu)-f(x)$ into polynomial terms in $u$ plus a remainder. The order-$m$ kernel conditions then remove every polynomial term below order $m$. Fix $x\in A$ and $u\in\operatorname{supp}K$. From the previous step, the entire line segment \begin{align*} \{x-thu:t\in[0,1]\} \end{align*} lies in $U$ for all sufficiently large $n$. Therefore the derivatives of $f$ needed for Taylor expansion are defined and bounded along this segment. We write $\mathcal L^1$ for one-dimensional Lebesgue measure on $[0,1]$, which is the measure used in the integral remainder term. Taylor expansion at $x$ to order $m-1$ in the vector $-hu$ gives \begin{align*} f(x-hu) &= \sum_{|\alpha|\le m-1} \frac{D^\alpha f(x)}{\alpha!}(-hu)^\alpha + m\sum_{|\alpha|=m} \frac{(-hu)^\alpha}{\alpha!} \int_0^1 (1-t)^{m-1}D^\alpha f(x-thu)\,d\mathcal{L}^1(t). \end{align*} Here $\alpha=(\alpha_1,\dots,\alpha_d)$ is a multi-index, $\alpha!:=\alpha_1!\cdots\alpha_d!$, and $u^\alpha:=u_1^{\alpha_1}\cdots u_d^{\alpha_d}$. Now multiply this identity by $K(u)$ and integrate over $\mathbb{R}^d$ with respect to $\mathcal{L}^d(u)$. The constant term gives \begin{align*} f(x)\int_{\mathbb{R}^d}K(u)\,d\mathcal{L}^d(u)=f(x), \end{align*} because the kernel has total mass $1$. If $1\le |\alpha|\le m-1$, the corresponding Taylor term contributes \begin{align*} \frac{D^\alpha f(x)(-h)^{|\alpha|}}{\alpha!} \int_{\mathbb{R}^d}u^\alpha K(u)\,d\mathcal{L}^d(u)=0, \end{align*} because the integer moments of $K$ vanish through order $m-1$. Therefore the only contribution to $\mathbb{E}[\hat f_h(x)]-f(x)$ is the Taylor remainder: \begin{align*} b_h(x) &= m\sum_{|\alpha|=m} \frac{(-h)^m}{\alpha!} \int_{\mathbb{R}^d} u^\alpha K(u) \int_0^1 (1-t)^{m-1}D^\alpha f(x-thu)\,d\mathcal{L}^1(t) \,d\mathcal{L}^d(u). \end{align*} To bound this uniformly in $x$, define \begin{align*} M_\alpha:=\sup_{y\in U}|D^\alpha f(y)| \end{align*} for each $\alpha$ with $|\alpha|=m$. These quantities are finite by the boundedness hypothesis on the order-$m$ derivatives. Taking absolute values yields \begin{align*} |b_h(x)| &\le m h^m \sum_{|\alpha|=m} \frac{M_\alpha}{\alpha!} \int_{\mathbb{R}^d} |u^\alpha|\,|K(u)| \int_0^1 (1-t)^{m-1}\,d\mathcal{L}^1(t) \,d\mathcal{L}^d(u). \end{align*} The one-dimensional integral is \begin{align*} \int_0^1(1-t)^{m-1}\,d\mathcal{L}^1(t)=\frac{1}{m}, \end{align*} so \begin{align*} |b_h(x)| &\le h^m \sum_{|\alpha|=m} \frac{M_\alpha}{\alpha!} \int_{\mathbb{R}^d}|u^\alpha|\,|K(u)|\,d\mathcal{L}^d(u). \end{align*} The right-hand side is independent of $x$. It is finite because $K\in L^1(\mathcal L^d)$ and $\operatorname{supp}K\subset\overline B(0,R)$ imply \begin{align*} \int_{\mathbb{R}^d}|u^\alpha|\,|K(u)|\,d\mathcal{L}^d(u) \le R^m\int_{\mathbb{R}^d}|K(u)|\,d\mathcal{L}^d(u) <\infty \end{align*} for every $\alpha$ with $|\alpha|=m$. Thus, with \begin{align*} C_{\mathrm{bias}} := \sum_{|\alpha|=m} \frac{M_\alpha}{\alpha!} \int_{\mathbb{R}^d}|u^\alpha|\,|K(u)|\,d\mathcal{L}^d(u), \end{align*} we have \begin{align*} \|b_h\|_{\infty,A}\le C_{\mathrm{bias}}h^m. \end{align*} This proves the deterministic bias estimate $\|b_h\|_{\infty,A}=O(h^m)$. [/guided] [/step] [step:Apply the maximal deviation theorem to the centered term] By the uniform maximal deviation condition included in the theorem statement, applied to the bandwidth sequence $(h_n)_{n\in\mathbb{N}}$, the required hypotheses are met as follows: $A$ is compact by assumption; the variables $X_i$ are i.i.d. with common density $f$; the preceding neighbourhood construction gives boundedness of $f$ on the region reached by the kernels for all sufficiently large $n$; the entropy, envelope, separability, and measurability conditions are precisely the kernel-class hypotheses imposed on $K$ in the theorem statement; and the bandwidth restrictions \begin{align*} \frac{n h_n^d}{\log n}\to\infty \end{align*} and \begin{align*} \log(1/h_n)=O(\log n) \end{align*} are exactly the displayed assumptions. Hence the assumed maximal deviation condition gives \begin{align*} \|Z_{h_n}\|_{\infty,A} = \|\hat f_{h_n}-\mathbb{E}[\hat f_{h_n}]\|_{\infty,A} = O_{\mathbb{P}}\left(\sqrt{\frac{\log n}{n h_n^d}}\right). \end{align*} This is exactly the maximal deviation estimate included among the hypotheses, with each structural assumption matched to the present kernel density estimator setting. [guided] The centered term is the random part of the estimator after subtracting its mean. We use the uniform maximal deviation condition that is explicitly included in the theorem statement. In the present notation, the empirical process is indexed by the compact set $A$ through the class of functions $y\mapsto K((x-y)/h_n)$ with $x\in A$. We verify the hypotheses one by one. The index set is compact because $A$ is compact by hypothesis. The sampling hypothesis holds because the random variables $X_i:(\Omega,\mathcal F,\mathbb P)\to(\mathbb R^d,\mathcal B(\mathbb R^d))$ are i.i.d. with common density $f$ with respect to $\mathcal L^d$ on the neighbourhood reached by the kernels. The local boundedness condition on the density holds on the region reached by the kernels: the previous neighbourhood construction produced $\delta>0$ such that $\{y\in\mathbb R^d:\operatorname{dist}(y,A)\le\delta\}\subset U$, and $f$ is bounded there because its order-$0$ partial derivative is bounded on $U$. The entropy, envelope, separability, and measurability assumptions required for the kernel class generated by $K$ are exactly the kernel-class conditions imposed on $K$ in the theorem statement. Finally, the bandwidth conditions required by the assumed maximal deviation condition are \begin{align*} \frac{n h_n^d}{\log n}\to\infty \end{align*} and \begin{align*} \log(1/h_n)=O(\log n), \end{align*} which are also hypotheses of the theorem statement. Therefore the maximal deviation condition applies to the process $x\mapsto \hat f_{h_n}(x)-\mathbb E[\hat f_{h_n}(x)]$ indexed by $x\in A$ and yields \begin{align*} \|Z_{h_n}\|_{\infty,A} = \|\hat f_{h_n}-\mathbb{E}[\hat f_{h_n}]\|_{\infty,A} = O_{\mathbb{P}}\left(\sqrt{\frac{\log n}{n h_n^d}}\right). \end{align*} This estimate supplies the variance part of the bias-variance tradeoff. [/guided] [/step] [step:Combine the deterministic and stochastic bounds] Combining the decomposition from the first step with the bias estimate and the maximal deviation estimate, the triangle inequality gives \begin{align*} \|\hat f_{h_n}-f\|_{\infty,A} &\le \|\hat f_{h_n}-\mathbb{E}[\hat f_{h_n}]\|_{\infty,A} + \|\mathbb{E}[\hat f_{h_n}]-f\|_{\infty,A}. \end{align*} Substituting the two estimates gives \begin{align*} \|\hat f_{h_n}-f\|_{\infty,A} &= O_{\mathbb{P}}\left(\sqrt{\frac{\log n}{n h_n^d}}\right) + O(h_n^m). \end{align*} Thus \begin{align*} \|\hat f_{h_n}-f\|_{\infty,A} = O(h_n^m) + O_{\mathbb{P}}\left(\sqrt{\frac{\log n}{n h_n^d}}\right), \end{align*} as claimed. [guided] The first step decomposed the estimator error as the sum of the centered fluctuation and the deterministic bias. Taking the supremum norm over $A$ and applying the triangle inequality gives \begin{align*} \|\hat f_{h_n}-f\|_{\infty,A} &\le \|\hat f_{h_n}-\mathbb{E}[\hat f_{h_n}]\|_{\infty,A} + \|\mathbb{E}[\hat f_{h_n}]-f\|_{\infty,A}. \end{align*} The bias step proved \begin{align*} \|\mathbb{E}[\hat f_{h_n}]-f\|_{\infty,A}=O(h_n^m), \end{align*} and the maximal deviation step proved \begin{align*} \|\hat f_{h_n}-\mathbb{E}[\hat f_{h_n}]\|_{\infty,A} = O_{\mathbb{P}}\left(\sqrt{\frac{\log n}{n h_n^d}}\right). \end{align*} Substituting these two bounds into the triangle inequality yields \begin{align*} \|\hat f_{h_n}-f\|_{\infty,A} = O(h_n^m) + O_{\mathbb{P}}\left(\sqrt{\frac{\log n}{n h_n^d}}\right). \end{align*} This is the asserted uniform bias-variance tradeoff. [/guided] [/step] [step:Solve the heuristic balance equation for the bandwidth scale] To balance the deterministic and stochastic terms, set their magnitudes equal: \begin{align*} h^m \asymp \left(\frac{\log n}{n h^d}\right)^{1/2}. \end{align*} Squaring both sides gives \begin{align*} h^{2m} \asymp \frac{\log n}{n h^d}. \end{align*} Multiplying by $h^d$ gives \begin{align*} h^{2m+d} \asymp \frac{\log n}{n}. \end{align*} Taking the power $1/(2m+d)$ yields the heuristic bandwidth scale \begin{align*} h \asymp \left(\frac{\log n}{n}\right)^{1/(2m+d)}. \end{align*} This proves the stated bias-variance tradeoff and the displayed balancing rule. [/step]

Prerequisites (0/7 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Explore Further

What brings you to Androma?

Start with a route through the knowledge graph.