Pointwise Bias and Variance Expansion for the Nadaraya-Watson Estimator (Theorem # 6348)
Theorem
Let $(X_i,Y_i)_{i=1}^n$ be i.i.d. copies of $(X,Y)$ in the nonparametric regression model. Fix an interior point $x$ such that the design density $f_X$ satisfies $f_X(x)>0$. Assume $K$ is symmetric, nonnegative, bounded, compactly supported, integrates to $1$, is uniformly positive on a neighbourhood of $0$, has finite second moment $\mu_2(K)=\int u^2K(u)\,d\mathcal L^1(u)$, and $R(K)=\int K(u)^2\,d\mathcal L^1(u)<\infty$. Define $K_h:\mathbb R\to[0,\infty)$ by $K_h(u):=h^{-1}K(u/h)$. Let
\begin{align*}
A_{n,h}(x)=\left\{\sum_{i=1}^n K_h(x-X_i)\ne0\right\}.
\end{align*}
Define $\hat m_{NW}(x):\Omega\to\mathbb R$ by
\begin{align*}
\hat m_{NW}(x):=
\frac{\sum_{i=1}^nY_iK_h(x-X_i)}{\sum_{i=1}^nK_h(x-X_i)}
\end{align*}
on $A_{n,h}(x)$ and by $0$ on $A_{n,h}(x)^c$. Assume $m$ and $f_X$ are twice continuously differentiable near $x$, $\sigma^2$ is continuous at $x$, the local second moments of $Y$ are bounded near $x$, and $h\to0$, $nh\to\infty$, $nh^3\to\infty$. Then $\mathbb P(A_{n,h}(x)^c)$ is exponentially small in $nh$, and the bias satisfies
\begin{align*}
\mathbb E[\hat m_{NW}(x)]-m(x)=\frac{h^2\mu_2(K)}{2}\left(m''(x)+2m'(x)\frac{f_X'(x)}{f_X(x)}\right)+o(h^2).
\end{align*}
The variance satisfies
\begin{align*}
\operatorname{Var}(\hat m_{NW}(x))=\frac{1}{nh}\frac{\sigma^2(x)}{f_X(x)}R(K)+o\left(\frac{1}{nh}\right).
\end{align*}
Knowledge Status
Probability & Statistics
Discussion
This theorem records a core [nonparametric statistics](/page/Nonparametric%20Statistics) result about Pointwise Bias and Variance Expansion for the Nadaraya-Watson Estimator. It identifies the main probabilistic or inferential guarantee used to analyze distribution-free procedures and nonparametric estimators.
Proof
[proofplan]
We first show that the kernel denominator is positive with exponentially high probability, using the uniform lower bound of $K$ near $0$ and the positivity of $f_X(x)$. We then replace the random ratio by the ratio of its expectations and prove that the error is $o(h^2)$, where the condition $nh^3\to\infty$ makes denominator fluctuations smaller than the deterministic second-order bias. Finally, Taylor expansion of the numerator and denominator gives the bias constant, and a first-order linearisation of the ratio gives the variance constant.
[/proofplan]
[step:Bound the probability that the denominator vanishes]
For each $h>0$, define the rescaled kernel map $K_h:\mathbb R\to[0,\infty)$ by
\begin{align*}
K_h(u):=h^{-1}K(u/h).
\end{align*}
Since $K$ is uniformly positive on a neighbourhood of $0$, choose constants $a_K>0$ and $c_K>0$ such that $K(u)\ge c_K$ whenever $|u|\le a_K$. For $h>0$ and $i\in\{1,\dots,n\}$, define the kernel weight $Z_{i,h}:\Omega\to[0,\infty)$ by
\begin{align*}
Z_{i,h}:=K_h(x-X_i)=h^{-1}K((x-X_i)/h).
\end{align*}
Define the denominator $S_{n,h}:\Omega\to[0,\infty)$ by
\begin{align*}
S_{n,h}:=\sum_{i=1}^n Z_{i,h}.
\end{align*}
Since $K\ge0$, the event $A_{n,h}(x)^c$ is contained in the event that no observation lies in $(x-a_Kh,x+a_Kh)$. Let
\begin{align*}
p_h:=\mathbb P(|X-x|\le a_Kh).
\end{align*}
Because $f_X$ is continuous at $x$ and $f_X(x)>0$, there are constants $h_0>0$ and $c_x>0$ such that $p_h\ge c_x h$ for $0<h<h_0$. Independence gives
\begin{align*}
\mathbb P(A_{n,h}(x)^c)\le (1-p_h)^n\le \exp(-np_h)\le \exp(-c_xnh).
\end{align*}
Thus $\mathbb P(A_{n,h}(x)^c)$ is exponentially small in $nh$.
[guided]
The denominator can vanish only if every kernel weight is zero. The nonnegativity hypothesis is essential here: with $K\ge0$, no cancellation among positive and negative weights is possible. Since $K(u)\ge c_K>0$ whenever $|u|\le a_K$, any observation satisfying $|X_i-x|\le a_Kh$ contributes the positive amount at least $h^{-1}c_K$ to $S_{n,h}$. Therefore
\begin{align*}
A_{n,h}(x)^c\subseteq \bigcap_{i=1}^n\{|X_i-x|>a_Kh\}.
\end{align*}
Define $p_h:=\mathbb P(|X-x|\le a_Kh)$. Since $f_X$ is continuous at $x$ and $f_X(x)>0$, shrinking $h_0>0$ if necessary gives $f_X(t)\ge f_X(x)/2$ for $|t-x|\le a_Kh_0$. Hence, for $0<h<h_0$,
\begin{align*}
p_h=\int_{x-a_Kh}^{x+a_Kh} f_X(t)\,d\mathcal L^1(t)\ge a_K f_X(x)h.
\end{align*}
Set $c_x:=a_Kf_X(x)$. The variables $X_1,\dots,X_n$ are independent, so
\begin{align*}
\mathbb P(A_{n,h}(x)^c)\le (1-p_h)^n.
\end{align*}
For $0\le r\le1$, the elementary exponential inequality $1-r\le e^{-r}$ follows from convexity of the exponential function. Applying this with $r=p_h$ gives
\begin{align*}
\mathbb P(A_{n,h}(x)^c)\le \exp(-np_h)\le \exp(-c_xnh).
\end{align*}
This proves the claimed exponential smallness.
[/guided]
[/step]
[step:Linearise the random ratio around its deterministic numerator and denominator]
Define the random numerator $T_{n,h}:\Omega\to\mathbb R$ by
\begin{align*}
T_{n,h}:=\sum_{i=1}^n Y_iZ_{i,h}.
\end{align*}
Define the deterministic quantities $a_h\in\mathbb R$, $b_h\in\mathbb R$, and $r_h\in\mathbb R$ by
\begin{align*}
a_h:=\mathbb E[Z_{1,h}],\qquad b_h:=\mathbb E[Y_1Z_{1,h}],\qquad r_h:=b_h/a_h.
\end{align*}
Let $R_K>0$ be such that $\operatorname{supp}K\subset[-R_K,R_K]$, and let $M_K:=\sup_{u\in\mathbb R}|K(u)|$. Fix the regular conditional version near $x$ used to define $m(t)=\mathbb E[Y\mid X=t]$ and $\sigma^2(t)=\operatorname{Var}(Y\mid X=t)$. The local second-moment hypothesis gives $h_0>0$ and $M_2<\infty$ such that $\mathbb E[Y^2\mid X=t]\le M_2$ whenever $|t-x|\le R_Kh_0$. Since $K$ is compactly supported and $f_X$ is continuous at $x$, $a_h\to f_X(x)>0$, so after decreasing $h_0$ we have $a_h\ge f_X(x)/2$ for $0<h<h_0$.
For $0<h<h_0$, the support condition implies $Z_{1,h}=0$ unless $|X_1-x|\le R_Kh$. We use the [law of total expectation](/theorems/1121) with respect to the regular conditional law of $Y$ given $X=t$, and then use that $X$ has density $f_X$ with respect to $\mathcal L^1$. Hence
\begin{align*}
\mathbb E[Z_{1,h}^2]
&=h^{-1}\int_{\mathbb R}K(u)^2f_X(x-hu)\,d\mathcal L^1(u)=O(h^{-1}),
\end{align*}
and
\begin{align*}
\mathbb E[Y_1^2Z_{1,h}^2]
&=h^{-1}\int_{\mathbb R}K(u)^2\mathbb E[Y^2\mid X=x-hu]f_X(x-hu)\,d\mathcal L^1(u)=O(h^{-1}).
\end{align*}
Therefore
\begin{align*}
\mathbb E[(S_{n,h}-na_h)^2]=O(nh^{-1}).
\end{align*}
Also,
\begin{align*}
\mathbb E[(T_{n,h}-nb_h)^2]=O(nh^{-1}).
\end{align*}
We verify the inputs for the displayed scalar concentration bound. Since $|K|\le M_K$, each centered summand satisfies
\begin{align*}
|Z_{i,h}-a_h|\le h^{-1}M_K+a_h\le C_Zh^{-1}
\end{align*}
for a constant $C_Z>0$. Also $\operatorname{Var}(Z_{i,h})\le\mathbb E[Z_{i,h}^2]\le C_Zh^{-1}$, so
\begin{align*}
\sum_{i=1}^n\operatorname{Var}(Z_{i,h})\le C_Znh^{-1}.
\end{align*}
We use the following scalar [Bernstein Inequality](/theorems/1200): if independent centered variables are bounded in absolute value by $M>0$ and have total variance at most $v>0$, then for every $t>0$ their sum exceeds $t$ in absolute value with probability at most
\begin{align*}
2\exp\left(-\frac{t^2}{2(v+Mt/3)}\right).
\end{align*}
Applying this bound to the independent centered variables $Z_{i,h}-a_h$, with $M=C_Zh^{-1}$, $v=C_Znh^{-1}$, and $t=na_h/2$, and using $a_h\ge f_X(x)/2$, gives constants $c_1,c_2>0$ depending only on $x$, $K$, and an upper bound for $f_X$ near $x$ such that
\begin{align*}
\mathbb P\left(|S_{n,h}-na_h|>\frac{na_h}{2}\right)\le c_1\exp(-c_2nh).
\end{align*}
Let
\begin{align*}
B_{n,h}:=\left\{|S_{n,h}-na_h|\le \frac{na_h}{2}\right\}.
\end{align*}
On $B_{n,h}$, $S_{n,h}\ge na_h/2$, and the algebraic identity
\begin{align*}
\frac{T_{n,h}}{S_{n,h}}-r_h
=\frac{T_{n,h}-r_hS_{n,h}}{na_h}
-\frac{(T_{n,h}-r_hS_{n,h})(S_{n,h}-na_h)}{na_hS_{n,h}}
\end{align*}
holds. Since $\mathbb E[T_{n,h}-r_hS_{n,h}]=nb_h-r_hna_h=0$, taking expectations on $B_{n,h}$ and subtracting the missing part of the centered linear term gives
\begin{align*}
\left|\mathbb E\left[\left(\frac{T_{n,h}}{S_{n,h}}-r_h\right)\mathbb 1_{B_{n,h}}\right]\right|
\le \frac{2}{n^2a_h^2}\mathbb E\left[|T_{n,h}-r_hS_{n,h}|\,|S_{n,h}-na_h|\right]+\frac{1}{na_h}\mathbb E\left[|T_{n,h}-r_hS_{n,h}|\mathbb 1_{B_{n,h}^c}\right].
\end{align*}
Applying the [Cauchy-Schwarz Inequality](/theorems/1201), in the [Hilbert space](/page/Hilbert%20Space) $L^2(\Omega)$, to the two centered sums and to the indicator term gives
\begin{align*}
\left|\mathbb E\left[\left(\frac{T_{n,h}}{S_{n,h}}-r_h\right)\mathbb 1_{B_{n,h}}\right]\right|
\le C(nh)^{-1}+C(nh)^{-1/2}\mathbb P(B_{n,h}^c)^{1/2}.
\end{align*}
The exponential bound for $B_{n,h}^c$ makes the second term $O((nh)^{-1})$, so the contribution from $B_{n,h}$ is $O((nh)^{-1})$. Here the constant $C>0$ is chosen large enough to dominate the preceding second-moment bounds; it depends only on $x$, $K$, $f_X$, and the local bounded second-moment hypothesis for $Y$ near $x$.
It remains to bound the actual ratio on $B_{n,h}^c$. Choose $\delta>0$ and $M_2<\infty$ such that $\mathbb E[Y^2\mid X=t]\le M_2$ whenever $|t-x|<\delta$, where $m(t)=\mathbb E[Y\mid X=t]$ is the regression function. Since $K$ is compactly supported, for all sufficiently small $h$ every nonzero weight $K_h(x-X_i)$ has $|X_i-x|<\delta$. Because $K\ge0$, on $A_{n,h}(x)$ the estimator is a weighted average of those $Y_i$ with nonzero weight, so [Jensen's inequality](/theorems/9) for the finite probability weights gives
\begin{align*}
\hat m_{NW}(x)^2\le \frac{\sum_{i=1}^n K_h(x-X_i)Y_i^2}{\sum_{i=1}^n K_h(x-X_i)}.
\end{align*}
Conditioning on $X_1,\dots,X_n$ and using the local second-moment bound gives $\mathbb E[\hat m_{NW}(x)^2\mid X_1,\dots,X_n]\le M_2$ on $A_{n,h}(x)$, while $\hat m_{NW}(x)=0$ on $A_{n,h}(x)^c$. Since $r_h\to m(x)$, there is $M_r<\infty$ such that $|r_h|\le M_r$ for all sufficiently small $h$. Therefore the [Cauchy-Schwarz inequality](/theorems/432) gives
\begin{align*}
\left|\mathbb E\left[(\hat m_{NW}(x)-r_h)\mathbb 1_{B_{n,h}^c}\right]\right|
\le \left(2M_2+2M_r^2\right)^{1/2}\mathbb P(B_{n,h}^c)^{1/2}=O((nh)^{-1}).
\end{align*}
Thus
\begin{align*}
\mathbb E[\hat m_{NW}(x)]-r_h=O((nh)^{-1}).
\end{align*}
Since $nh^3\to\infty$, we have $(nh)^{-1}=o(h^2)$, and therefore
\begin{align*}
\mathbb E[\hat m_{NW}(x)]-r_h=o(h^2).
\end{align*}
[/step]
[step:Expand the deterministic numerator and denominator to second order]
Choose an open interval $I\subset\mathbb R$ containing $x$ on which $m$ and $f_X$ are twice continuously differentiable. Define the product map $g:I\to\mathbb R$ by $g(t)=m(t)f_X(t)$.
Since $m$ and $f_X$ are twice continuously differentiable on $I$, the function $g$ is twice continuously differentiable on $I$. By the law of total expectation applied to the regular conditional law of $Y$ given $X=t$, and using the density $f_X$ of $X$, the compact support of $K$ permits the change of variables $t=x-hu$, with $d\mathcal L^1(t)=h\,d\mathcal L^1(u)$, giving
\begin{align*}
a_h=\int_{\mathbb R}K(u)f_X(x-hu)\,d\mathcal L^1(u).
\end{align*}
The [Taylor Theorem With Remainder](/theorems/1202), applied at $x$, together with symmetry of $K$ and $\int_{\mathbb R}K(u)\,d\mathcal L^1(u)=1$, gives
\begin{align*}
a_h=f_X(x)+\frac{h^2\mu_2(K)}{2}f_X''(x)+o(h^2).
\end{align*}
Similarly,
\begin{align*}
b_h=\int_{\mathbb R}K(u)g(x-hu)\,d\mathcal L^1(u)=g(x)+\frac{h^2\mu_2(K)}{2}g''(x)+o(h^2).
\end{align*}
Dividing the two expansions and using $g(x)=m(x)f_X(x)$ yields
\begin{align*}
r_h-m(x)=\frac{h^2\mu_2(K)}{2}\left(\frac{g''(x)}{f_X(x)}-m(x)\frac{f_X''(x)}{f_X(x)}\right)+o(h^2).
\end{align*}
Since $g''=m''f_X+2m'f_X'+mf_X''$, this becomes
\begin{align*}
r_h-m(x)=\frac{h^2\mu_2(K)}{2}\left(m''(x)+2m'(x)\frac{f_X'(x)}{f_X(x)}\right)+o(h^2).
\end{align*}
Combining this with $\mathbb E[\hat m_{NW}(x)]-r_h=o(h^2)$ proves the bias expansion.
[guided]
The deterministic ratio $r_h=b_h/a_h$ is the ratio obtained by replacing the random numerator and denominator by their expectations. Choose an open interval $I\subset\mathbb R$ containing $x$ on which $m$ and $f_X$ are twice continuously differentiable, and define the product map $g:I\to\mathbb R$ by $g(t)=m(t)f_X(t)$
Then $g$ is twice continuously differentiable on $I$ by the product rule. For all sufficiently small $h$, the compact support of $K$ ensures that $x-hu\in I$ for every $u\in\operatorname{supp}K$. We first condition on $X=t$ and use the regular conditional law of $Y$ given $X=t$; the law of total expectation then writes expectations involving $Y$ as integrals against the density $f_X(t)\,d\mathcal L^1(t)$. Applying the change of variables $t=x-hu$, so that $d\mathcal L^1(t)=h\,d\mathcal L^1(u)$, gives
\begin{align*}
a_h=\int_{\mathbb R}K(u)f_X(x-hu)\,d\mathcal L^1(u)
\end{align*}
and
\begin{align*}
b_h=\int_{\mathbb R}K(u)g(x-hu)\,d\mathcal L^1(u).
\end{align*}
The [Taylor Theorem With Remainder](/theorems/1202), applied at $x$ to the one-dimensional function $f_X:I\to\mathbb R$, gives
\begin{align*}
f_X(x-hu)=f_X(x)-hu f_X'(x)+\frac{h^2u^2}{2}f_X''(x)+o(h^2u^2)
\end{align*}
locally uniformly for $u\in\operatorname{supp}K$. Since $K$ is symmetric, $\int_{\mathbb R}uK(u)\,d\mathcal L^1(u)=0$, and since $\int_{\mathbb R}K(u)\,d\mathcal L^1(u)=1$, we obtain
\begin{align*}
a_h=f_X(x)+\frac{h^2\mu_2(K)}{2}f_X''(x)+o(h^2).
\end{align*}
The same argument applied to $g$ gives
\begin{align*}
b_h=g(x)+\frac{h^2\mu_2(K)}{2}g''(x)+o(h^2).
\end{align*}
Because $f_X(x)>0$, division of the two expansions is valid and yields
\begin{align*}
r_h-m(x)=\frac{h^2\mu_2(K)}{2}\left(\frac{g''(x)}{f_X(x)}-m(x)\frac{f_X''(x)}{f_X(x)}\right)+o(h^2).
\end{align*}
Finally, differentiating $g=mf_X$ twice gives
\begin{align*}
g''(x)=m''(x)f_X(x)+2m'(x)f_X'(x)+m(x)f_X''(x).
\end{align*}
Substitution cancels the final $m(x)f_X''(x)$ term and gives
\begin{align*}
r_h-m(x)=\frac{h^2\mu_2(K)}{2}\left(m''(x)+2m'(x)\frac{f_X'(x)}{f_X(x)}\right)+o(h^2).
\end{align*}
The previous step proved $\mathbb E[\hat m_{NW}(x)]-r_h=o(h^2)$, so the same expansion holds for $\mathbb E[\hat m_{NW}(x)]-m(x)$.
[/guided]
[/step]
[step:Compute the leading variance from the linearised centered numerator]
Define $U_{i,h}:\Omega\to\mathbb R$ by $U_{i,h}:=(Y_i-r_h)Z_{i,h}$ for $i\in\{1,\dots,n\}$. On $B_{n,h}\cap A_{n,h}(x)$, the ratio identity from the preceding step gives
\begin{align*}
\hat m_{NW}(x)-r_h
=\frac{1}{na_h}\sum_{i=1}^n U_{i,h}+R_{n,h},
\end{align*}
where
\begin{align*}
R_{n,h}:=-\frac{\left(\sum_{i=1}^n U_{i,h}\right)(S_{n,h}-na_h)}{na_hS_{n,h}}.
\end{align*}
Let $V_{n,h}:=\sum_{i=1}^n U_{i,h}$ and $W_{n,h}:=S_{n,h}-na_h$. On $B_{n,h}$, $S_{n,h}\ge na_h/2$, so
\begin{align*}
R_{n,h}^2\mathbb 1_{B_{n,h}}
\le \frac{4V_{n,h}^2W_{n,h}^2}{n^4a_h^4}.
\end{align*}
We estimate this remainder using conditional second moments, without requiring fourth moments of $Y$. Conditional on $X_1,\dots,X_n$, the quantity $W_{n,h}$ is fixed. The summands in $V_{n,h}$ are not conditionally centered at $r_h$, so we separate their conditional means. Define the conditional noise variables $\varepsilon_i:=Y_i-m(X_i)$ and the deterministic conditional drift
\begin{align*}
D_{n,h}:=\sum_{i=1}^n (m(X_i)-r_h)Z_{i,h}.
\end{align*}
Then $V_{n,h}=\sum_{i=1}^n \varepsilon_iZ_{i,h}+D_{n,h}$, and the conditional independence of the observations gives
\begin{align*}
\mathbb E\left[\left(\sum_{i=1}^n \varepsilon_iZ_{i,h}\right)^2\mid X_1,\dots,X_n\right]
\le C_0\sum_{i=1}^n Z_{i,h}^2
\end{align*}
for a constant $C_0>0$ depending only on the local conditional variance bound. Since $m$ is continuously differentiable near $x$ and $r_h=m(x)+O(h^2)$ by the deterministic expansion, there is $C_m>0$ such that $|m(t)-r_h|\le C_mh$ whenever $|t-x|\le R_Kh$ and $h$ is sufficiently small. Hence
\begin{align*}
D_{n,h}^2\le C_m^2h^2S_{n,h}^2.
\end{align*}
Since $|K|\le M_K$ and $K$ is supported in $[-R_K,R_K]$, we have $Z_{i,h}^2\le M_Kh^{-1}Z_{i,h}$ and therefore $\sum_i Z_{i,h}^2\le M_Kh^{-1}S_{n,h}$. On $B_{n,h}$, also $S_{n,h}\le 3na_h/2$, so
\begin{align*}
\mathbb E[V_{n,h}^2\mid X_1,\dots,X_n]\mathbb 1_{B_{n,h}}
\le C_1(nh^{-1}+n^2h^2)\mathbb 1_{B_{n,h}}
\end{align*}
for a constant $C_1>0$. Hence
\begin{align*}
\mathbb E[R_{n,h}^2\mathbb 1_{B_{n,h}}]
\le C_2\left(\frac{1}{n^3h}+\frac{h^2}{n^2}\right)\mathbb E[W_{n,h}^2].
\end{align*}
The denominator calculation from the preceding step gives $\mathbb E[W_{n,h}^2]=O(nh^{-1})$, so
\begin{align*}
\mathbb E[R_{n,h}^2\mathbb 1_{B_{n,h}}]
=O(n^{-2}h^{-2})+O(h/n)=o((nh)^{-1}),
\end{align*}
because $nh\to\infty$ and $h\to0$.
Define the linearised leading term
\begin{align*}
L_{n,h}:=\frac{1}{na_h}\sum_{i=1}^n U_{i,h}=\frac{V_{n,h}}{na_h}.
\end{align*}
It remains to control the complement of $B_{n,h}$. On $A_{n,h}(x)$, all nonzero weights are nonnegative and come from observations with $|X_i-x|\le R_Kh$. For sufficiently small $h$, the local second-moment assumption gives $M_2<\infty$ such that $\mathbb E[Y_i^2\mid X_i=t]\le M_2$ throughout this local support. Since $\hat m_{NW}(x)$ is a weighted average on $A_{n,h}(x)$, [Jensen's inequality](/theorems/1977) for the finite probability weights and conditioning on $X_1,\dots,X_n$ give $\mathbb E[\hat m_{NW}(x)^2\mid X_1,\dots,X_n]\le M_2$ on $A_{n,h}(x)$, while $\hat m_{NW}(x)=0$ on $A_{n,h}(x)^c$. Since $r_h\to m(x)$, define $M_r<\infty$ so that $|r_h|\le M_r$ for all sufficiently small $h$. Therefore
\begin{align*}
\mathbb E[(\hat m_{NW}(x)-r_h)^2\mathbb 1_{B_{n,h}^c}]
\le (2M_2+2M_r^2)\mathbb P(B_{n,h}^c)
=o((nh)^{-1})
\end{align*}
by the exponential bound for $B_{n,h}^c$. The leading term must also be controlled on this same bad event. Using the corrected conditional estimate
\begin{align*}
\mathbb E[V_{n,h}^2\mid X_1,\dots,X_n]
\le C\left(h^{-1}S_{n,h}+h^2S_{n,h}^2\right)
\end{align*}
and the definition $L_{n,h}=V_{n,h}/(na_h)$, we obtain
\begin{align*}
\mathbb E[L_{n,h}^2\mathbb 1_{B_{n,h}^c}]
\le \frac{C}{n^2}\left(h^{-1}\mathbb E[S_{n,h}\mathbb 1_{B_{n,h}^c}]+h^2\mathbb E[S_{n,h}^2\mathbb 1_{B_{n,h}^c}]\right).
\end{align*}
By the [Cauchy-Schwarz Inequality](/theorems/1201),
\begin{align*}
\mathbb E[S_{n,h}\mathbb 1_{B_{n,h}^c}]
\le \mathbb E[S_{n,h}^2]^{1/2}\mathbb P(B_{n,h}^c)^{1/2}.
\end{align*}
Again by the same inequality,
\begin{align*}
\mathbb E[S_{n,h}^2\mathbb 1_{B_{n,h}^c}]
\le \mathbb E[S_{n,h}^4]^{1/2}\mathbb P(B_{n,h}^c)^{1/2}.
\end{align*}
The boundedness and compact support of $K$ give polynomial bounds for $\mathbb E[S_{n,h}^2]$ and $\mathbb E[S_{n,h}^4]$, while $\mathbb P(B_{n,h}^c)$ is exponentially small in $nh$. Therefore
\begin{align*}
\mathbb E[L_{n,h}^2\mathbb 1_{B_{n,h}^c}]=o((nh)^{-1}).
\end{align*}
Combining the good-event remainder estimate with the two bad-event estimates and the inequality $(a+b)^2\le2a^2+2b^2$ gives
\begin{align*}
\mathbb E[(\hat m_{NW}(x)-r_h-L_{n,h})^2]=o((nh)^{-1}).
\end{align*}
Also $\mathbb E[L_{n,h}^2]=O((nh)^{-1})$. Applying the [Cauchy-Schwarz Inequality](/theorems/1201) in $L^2(\Omega)$ to the covariance term gives
\begin{align*}
\left|\mathbb E[L_{n,h}(\hat m_{NW}(x)-r_h-L_{n,h})]\right|=o((nh)^{-1}).
\end{align*}
The preceding bias estimate gives $\mathbb E[\hat m_{NW}(x)]-r_h=O((nh)^{-1})$, so
\begin{align*}
\left(\mathbb E[\hat m_{NW}(x)]-r_h\right)^2=O((nh)^{-2})=o((nh)^{-1})
\end{align*}
because $nh\to\infty$. Since subtracting the deterministic constant $r_h$ does not change variance, this yields
\begin{align*}
\operatorname{Var}(\hat m_{NW}(x))=\operatorname{Var}(L_{n,h})+o((nh)^{-1})
\end{align*}
and therefore
\begin{align*}
\operatorname{Var}(\hat m_{NW}(x))=\frac{1}{n^2a_h^2}\operatorname{Var}\left(\sum_{i=1}^n U_{i,h}\right)+o((nh)^{-1}).
\end{align*}
The variables $U_{1,h},\dots,U_{n,h}$ are independent and identically distributed, so
\begin{align*}
\operatorname{Var}\left(\sum_{i=1}^n U_{i,h}\right)=n\operatorname{Var}(U_{1,h}).
\end{align*}
Moreover $\mathbb E[U_{1,h}]=0$ by the definition $r_h=b_h/a_h$, hence $\operatorname{Var}(U_{1,h})=\mathbb E[U_{1,h}^2]$. Choose a regular conditional version near $x$ for which $t\mapsto\mathbb E[(Y-m(t))^2\mid X=t]$ is defined at $x$ and in a neighbourhood of $x$. Choose an open interval $I\subset\mathbb R$ containing $x$ on which this version is defined, and define the conditional variance map $v:I\to[0,\infty)$ by $v(t)=\mathbb E[(Y-m(t))^2\mid X=t]$.
Then $v(x)=\sigma^2(x)$. Using the law of total expectation with respect to the regular conditional law of $Y$ given $X=t$, and then making the change of variables $t=x-hu$, with $d\mathcal L^1(t)=h\,d\mathcal L^1(u)$, gives
\begin{align*}
\mathbb E[U_{1,h}^2]=\frac{1}{h}\int_{\mathbb R}K(u)^2\mathbb E[(Y-r_h)^2\mid X=x-hu]f_X(x-hu)\,d\mathcal L^1(u).
\end{align*}
For $u\in\operatorname{supp}K$ and all sufficiently small $h$, the point $x-hu$ lies in $I$. For every $t\in I$, the identity $\mathbb E[Y\mid X=t]=m(t)$ gives
\begin{align*}
\mathbb E[(Y-r_h)^2\mid X=t]=\sigma^2(t)+(m(t)-r_h)^2.
\end{align*}
Since $r_h\to m(x)$, $m$ is continuous at $x$, $\sigma^2$ is continuous at $x$, and $f_X$ is continuous at $x$, the integrand converges pointwise to
\begin{align*}
K(u)^2\sigma^2(x)f_X(x).
\end{align*}
The local boundedness of $Y$, boundedness of $m$ near $x$, boundedness of $f_X$ near $x$, and compact support of $K$ give an integrable dominating function $C_3K(u)^2\mathbb 1_{[-R_K,R_K]}(u)$ with respect to $\mathcal L^1$, where $C_3>0$ is chosen large enough to dominate these local bounds and depends only on the same local quantities. By the [Dominated Convergence Theorem](/theorems/4),
\begin{align*}
h\mathbb E[U_{1,h}^2]\to \sigma^2(x)f_X(x)R(K).
\end{align*}
Since $a_h\to f_X(x)>0$, we obtain
\begin{align*}
\operatorname{Var}(\hat m_{NW}(x))=\frac{1}{nh}\frac{\sigma^2(x)}{f_X(x)}R(K)+o((nh)^{-1}).
\end{align*}
This is the stated variance expansion.
[guided]
The variance calculation has two separate tasks: first show that replacing the ratio by its first-order linearisation costs only $o((nh)^{-1})$, and then compute the variance of that linearised term. Define
\begin{align*}
U_{i,h}:=(Y_i-r_h)Z_{i,h}
\end{align*}
for $i\in\{1,\dots,n\}$. On the good denominator event $B_{n,h}\cap A_{n,h}(x)$, the exact algebraic identity gives
\begin{align*}
\hat m_{NW}(x)-r_h
=\frac{1}{na_h}\sum_{i=1}^n U_{i,h}
-\frac{\left(\sum_{i=1}^n U_{i,h}\right)(S_{n,h}-na_h)}{na_hS_{n,h}}.
\end{align*}
Set $V_{n,h}:=\sum_{i=1}^n U_{i,h}$ and $W_{n,h}:=S_{n,h}-na_h$. Since $S_{n,h}\ge na_h/2$ on $B_{n,h}$, the remainder satisfies
\begin{align*}
R_{n,h}^2\mathbb 1_{B_{n,h}}
\le \frac{4V_{n,h}^2W_{n,h}^2}{n^4a_h^4}.
\end{align*}
The local second-moment hypothesis is enough if we condition on the design points, but the summands are not conditionally centered at $r_h$. Define $\varepsilon_i:=Y_i-m(X_i)$ and
\begin{align*}
D_{n,h}:=\sum_{i=1}^n(m(X_i)-r_h)Z_{i,h}.
\end{align*}
Then $V_{n,h}=\sum_i\varepsilon_iZ_{i,h}+D_{n,h}$. Conditional on $X_1,\dots,X_n$, the quantity $W_{n,h}$ is fixed, and conditional independence gives
\begin{align*}
\mathbb E\left[\left(\sum_{i=1}^n\varepsilon_iZ_{i,h}\right)^2\mid X_1,\dots,X_n\right]\le C_0\sum_{i=1}^n Z_{i,h}^2.
\end{align*}
Because $m$ is continuously differentiable near $x$ and $r_h=m(x)+O(h^2)$, we also have $|m(t)-r_h|\le C_mh$ on the local kernel support, hence $D_{n,h}^2\le C_m^2h^2S_{n,h}^2$. Since $K$ is bounded and nonnegative, $Z_{i,h}^2\le M_Kh^{-1}Z_{i,h}$, so $\sum_i Z_{i,h}^2\le M_Kh^{-1}S_{n,h}$. On $B_{n,h}$, $S_{n,h}\le 3na_h/2$, and therefore
\begin{align*}
\mathbb E[V_{n,h}^2\mid X_1,\dots,X_n]\mathbb 1_{B_{n,h}}\le C_1(nh^{-1}+n^2h^2)\mathbb 1_{B_{n,h}}.
\end{align*}
Using this in the displayed remainder bound gives
\begin{align*}
\mathbb E[R_{n,h}^2\mathbb 1_{B_{n,h}}]
\le C_2\left(\frac{1}{n^3h}+\frac{h^2}{n^2}\right)\mathbb E[W_{n,h}^2]
=O(n^{-2}h^{-2})+O(h/n)=o((nh)^{-1}),
\end{align*}
because $\mathbb E[W_{n,h}^2]=O(nh^{-1})$, $nh\to\infty$, and $h\to0$.
We must also control the event $B_{n,h}^c$, where the denominator might be small. On this event, a second-moment bound for the numerator alone would not control the ratio. Instead, use nonnegativity of the weights: on $A_{n,h}(x)$, the estimator is a weighted average of those $Y_i$ whose $X_i$ lie within the compact local support of the kernel. Jensen's inequality for these finite probability weights gives
\begin{align*}
\hat m_{NW}(x)^2\le \frac{\sum_{i=1}^n K_h(x-X_i)Y_i^2}{\sum_{i=1}^n K_h(x-X_i)}.
\end{align*}
After conditioning on $X_1,\dots,X_n$, the local second-moment bound makes the [conditional expectation](/page/Conditional%20Expectation) of the right-hand side uniformly bounded. On $A_{n,h}(x)^c$ the estimator is defined to be $0$. Since $r_h\to m(x)$, the squared error on $B_{n,h}^c$ is bounded in expectation by a fixed constant times $\mathbb P(B_{n,h}^c)$, which is exponentially small by Bernstein's inequality. Thus the bad-denominator contribution from $\hat m_{NW}(x)-r_h$ is $o((nh)^{-1})$. We must also control the leading linearised term on $B_{n,h}^c$, because the ratio identity was only used on $B_{n,h}$. If $L_{n,h}:=(na_h)^{-1}\sum_{i=1}^n U_{i,h}=V_{n,h}/(na_h)$, then the corrected conditional estimate gives
\begin{align*}
\mathbb E[L_{n,h}^2\mathbb 1_{B_{n,h}^c}]
\le \frac{C}{n^2}\left(h^{-1}\mathbb E[S_{n,h}\mathbb 1_{B_{n,h}^c}]+h^2\mathbb E[S_{n,h}^2\mathbb 1_{B_{n,h}^c}]\right).
\end{align*}
Applying the [Cauchy-Schwarz Inequality](/theorems/1201) to $S_{n,h}$ and $\mathbb 1_{B_{n,h}^c}$, and again to $S_{n,h}^2$ and $\mathbb 1_{B_{n,h}^c}$, reduces the two terms to polynomial moments of $S_{n,h}$ multiplied by $\mathbb P(B_{n,h}^c)^{1/2}$. These polynomial moments are finite with polynomial growth because $K$ is bounded and compactly supported, while the Bernstein bound for $B_{n,h}^c$ is exponentially small in $nh$. Therefore
\begin{align*}
\mathbb E[L_{n,h}^2\mathbb 1_{B_{n,h}^c}]=o((nh)^{-1}).
\end{align*} Combining this with the good-event remainder estimate and $(a+b)^2\le2a^2+2b^2$ shows that $\hat m_{NW}(x)-r_h-L_{n,h}$ has squared expectation $o((nh)^{-1})$. Since $\mathbb E[L_{n,h}^2]=O((nh)^{-1})$, the [Cauchy-Schwarz Inequality](/theorems/1201) in $L^2(\Omega)$ bounds the covariance between $L_{n,h}$ and the remainder by $o((nh)^{-1})$. Therefore the variance of $\hat m_{NW}(x)$ is the variance of $L_{n,h}$ up to $o((nh)^{-1})$.
It remains to compute the leading variance. Independence gives
\begin{align*}
\operatorname{Var}\left(\sum_{i=1}^n U_{i,h}\right)=n\operatorname{Var}(U_{1,h}),
\end{align*}
and $\mathbb E[U_{1,h}]=0$ because $r_h=b_h/a_h$. Hence $\operatorname{Var}(U_{1,h})=\mathbb E[U_{1,h}^2]$. Using the regular conditional version fixed in the statement, define the conditional variance map $v:I\to[0,\infty)$ near $x$ by $v(t)=\mathbb E[(Y-m(t))^2\mid X=t]$, so that $v(x)=\sigma^2(x)$. Using the law of total expectation with respect to the regular conditional law of $Y$ given $X=t$, and then setting $t=x-hu$ with $d\mathcal L^1(t)=h\,d\mathcal L^1(u)$,
\begin{align*}
\mathbb E[U_{1,h}^2]=\frac{1}{h}\int_{\mathbb R}K(u)^2\mathbb E[(Y-r_h)^2\mid X=x-hu]f_X(x-hu)\,d\mathcal L^1(u).
\end{align*}
For $t\in I$, the regression identity $\mathbb E[Y\mid X=t]=m(t)$ gives
\begin{align*}
\mathbb E[(Y-r_h)^2\mid X=t]=\sigma^2(t)+(m(t)-r_h)^2.
\end{align*}
The integrand converges pointwise to $K(u)^2\sigma^2(x)f_X(x)$ because $r_h\to m(x)$ and $m$, the conditional variance, and the density are continuous at $x$. The same local boundedness assumptions provide the integrable domination $C_3K(u)^2\mathbb 1_{[-R_K,R_K]}(u)$, where $C_3>0$ is chosen to dominate the local bounds. The [Dominated Convergence Theorem](/theorems/4) gives
\begin{align*}
h\mathbb E[U_{1,h}^2]\to \sigma^2(x)f_X(x)R(K).
\end{align*}
Combining this with $a_h\to f_X(x)>0$ gives
\begin{align*}
\operatorname{Var}(\hat m_{NW}(x))=\frac{1}{nh}\frac{\sigma^2(x)}{f_X(x)}R(K)+o((nh)^{-1}).
\end{align*}
[/guided]
[/step]
Prerequisites (0/7 completed)
Prerequisites Graph
Interactive dependency map showing how this theorem builds on foundational concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Theorems
- Jensen's Inequality
- Law of Total Expectation
- Jensen's Inequality for finite measure spaces
- Jensen's Inequality
Definitions & Concepts
Explore Further
Event
Definition
Expectation
Definition
Variance
Definition
Jensen's Inequality
Theorem #1977
Law of Total Expectation
Theorem #1121
Jensen's Inequality for finite measure spaces
Theorem #8
Jensen's Inequality
Theorem #9
Uniform Sub-Gaussian Mean Bound for a Finite Function Class
Probability & Statistics
Primal-Dual Witness Theorem for Lasso Sign Consistency
Probability & Statistics
Likelihood-Ratio Deviance Comparison Theorem for Nested GLMs
Probability & Statistics
Equivalence of Squared-Loss Restricted Strong Convexity and Restricted Eigenvalues
Probability & Statistics
Pointwise Strong Consistency of the Empirical Distribution Function
Probability & Statistics
Omitted Variable Bias Formula
Probability & Statistics
Oracle Inequality for the Lasso under a Uniform Compatibility Condition
Probability & Statistics
Marchenko-Pastur Stieltjes Transform Equation
Probability & Statistics
Probability & Statistics
Area