[proofplan]
The boundary effect comes from the fact that, at $a$, the kernel sees only the one-sided interval $[a,a+h]$. After the change of variables $t=a+hu$, the local constant target is a ratio of one-sided weighted moments, and Taylor expansion shows a nonzero first-order term proportional to $\mu_1^+/\mu_0^+$. For local linear regression, the normal equations reproduce every affine function of $t-a$, so the constant and first-order Taylor terms of $m$ are fitted exactly at the intercept, leaving only the quadratic remainder. Finally, the empirical estimators are smooth functions of finitely many local averages; their variances are of order $1/(nh)$ after the natural powers of $h$ are factored out, and stability of the limiting moment matrices transfers this fluctuation scale to the fitted intercepts.
[/proofplan]
[step:Rewrite the population local constant target as a one-sided weighted ratio]
Let $(X,Y)$ denote a generic observation with the same law as $(X_i,Y_i)$, and define the regression error $\varepsilon:=Y-m(X)$. The model assumption $\mathbb E[\varepsilon\mid X]=0$ gives $\mathbb E[Y\mid X=t]=m(t)$ for $f_X$-almost every $t$. Define the population local constant criterion $Q_{0,h}:\mathbb R\to\mathbb R$ by
\begin{align*}
Q_{0,h}(c):=\mathbb E\left[K\!\left(\frac{X-a}{h}\right)(Y-c)^2\right].
\end{align*}
Expanding the square and conditioning on $X$ show that the minimizer over $c\in\mathbb R$ is the weighted conditional mean whenever the denominator is nonzero. For $h<\min\{\eta,b-a\}$, the support condition on $K$ gives
\begin{align*}
K\!\left(\frac{t-a}{h}\right)=0
\end{align*}
unless $t\in[a,a+h]$. Hence the population local constant objective is minimized at
\begin{align*}
m_{0,h,\mathrm{pop}}(a)
=
\frac{
\int_a^{a+h} K\!\left(\frac{t-a}{h}\right)m(t)f_X(t)\,d\mathcal L^1(t)
}{
\int_a^{a+h} K\!\left(\frac{t-a}{h}\right)f_X(t)\,d\mathcal L^1(t)
}.
\end{align*}
Using the substitution $t=a+hu$, the measure transforms as $d\mathcal L^1(t)=h\,d\mathcal L^1(u)$ and the interval $[a,a+h]$ becomes $[0,1]$. Therefore
\begin{align*}
m_{0,h,\mathrm{pop}}(a)
=
\frac{
\int_0^1 K(u)m(a+hu)f_X(a+hu)\,d\mathcal L^1(u)
}{
\int_0^1 K(u)f_X(a+hu)\,d\mathcal L^1(u)
}.
\end{align*}
The denominator is positive for all sufficiently small $h$, because $f_X(a)>0$ and $\mu_0^+>0$.
[guided]
The first task is to remove the statistical notation and identify the deterministic quantity being minimized. Let $(X,Y)$ denote a generic observation and define $\varepsilon:=Y-m(X)$. The model assumption $\mathbb E[\varepsilon\mid X]=0$ implies $\mathbb E[Y\mid X=t]=m(t)$ for $f_X$-almost every $t$. Define $Q_{0,h}:\mathbb R\to\mathbb R$ by
\begin{align*}
Q_{0,h}(c):=\mathbb E\left[K\!\left(\frac{X-a}{h}\right)(Y-c)^2\right].
\end{align*}
Expanding the square and conditioning on $X$ show that the minimizer over constants is the weighted conditional mean:
\begin{align*}
m_{0,h,\mathrm{pop}}(a)
=
\frac{
\int_{[a,b]} K\!\left(\frac{t-a}{h}\right)m(t)f_X(t)\,d\mathcal L^1(t)
}{
\int_{[a,b]} K\!\left(\frac{t-a}{h}\right)f_X(t)\,d\mathcal L^1(t)
}.
\end{align*}
Because $K$ is supported on $[-1,1]$ and $t\geq a$ on the design support, the only values that contribute satisfy
\begin{align*}
0\leq \frac{t-a}{h}\leq 1,
\end{align*}
that is, $t\in[a,a+h]$. Thus, for $h<\min\{\eta,b-a\}$,
\begin{align*}
m_{0,h,\mathrm{pop}}(a)
=
\frac{
\int_a^{a+h} K\!\left(\frac{t-a}{h}\right)m(t)f_X(t)\,d\mathcal L^1(t)
}{
\int_a^{a+h} K\!\left(\frac{t-a}{h}\right)f_X(t)\,d\mathcal L^1(t)
}.
\end{align*}
Now set $t=a+hu$. Under this substitution, $d\mathcal L^1(t)=h\,d\mathcal L^1(u)$ and the interval $[a,a+h]$ is transformed into $[0,1]$. The common factor $h$ cancels between numerator and denominator, giving
\begin{align*}
m_{0,h,\mathrm{pop}}(a)
=
\frac{
\int_0^1 K(u)m(a+hu)f_X(a+hu)\,d\mathcal L^1(u)
}{
\int_0^1 K(u)f_X(a+hu)\,d\mathcal L^1(u)
}.
\end{align*}
The denominator is nonzero for small $h$: since $f_X$ is continuous and $f_X(a)>0$, $f_X(a+hu)$ remains positive uniformly for $u\in[0,1]$ when $h$ is small, and $\mu_0^+=\int_0^1K(u)\,d\mathcal L^1(u)>0$ because $K$ is nonnegative and positive on a set of positive measure in $[0,1]$.
[/guided]
[/step]
[step:Expand the local constant ratio to first order]
Define $m_0:=m(a)$ and $m_1:=m'(a)$. Since $m\in C^2([a,a+\eta])$, define the Taylor remainder map $R_{m,h}:[0,1]\to\mathbb R$ by
\begin{align*}
R_{m,h}(u):=\frac{m(a+hu)-m_0-hm_1u}{h^2}
\end{align*}
for $h>0$. [Taylor's theorem](/theorems/827) on the compact interval $[a,a+\eta]$ gives a constant $C_m>0$ such that $|R_{m,h}(u)|\leq C_m$ for every $u\in[0,1]$ and all sufficiently small $h$.
Define the denominator moment $D_h\in\mathbb R$ by
\begin{align*}
D_h:=\int_0^1K(u)f_X(a+hu)\,d\mathcal L^1(u),
\end{align*}
define the first weighted design moment $I_h\in\mathbb R$ by
\begin{align*}
I_h:=\int_0^1uK(u)f_X(a+hu)\,d\mathcal L^1(u),
\end{align*}
and define the bounded remainder moment $J_h\in\mathbb R$ by
\begin{align*}
J_h:=\int_0^1K(u)R_{m,h}(u)f_X(a+hu)\,d\mathcal L^1(u).
\end{align*}
The function $J_h$ is bounded uniformly in $h$, because $K$ is bounded on $[0,1]$, $R_{m,h}$ is uniformly bounded, and $f_X$ is continuous on $[a,a+\eta]$. Substituting the Taylor formula for $m$ into the numerator gives
\begin{align*}
\int_0^1K(u)m(a+hu)f_X(a+hu)\,d\mathcal L^1(u)
=
m_0D_h+hm_1I_h+h^2J_h.
\end{align*}
Therefore
\begin{align*}
m_{0,h,\mathrm{pop}}(a)-m(a)
=
hm_1\frac{I_h}{D_h}+h^2\frac{J_h}{D_h}.
\end{align*}
The denominator satisfies $D_h\to f_X(a)\mu_0^+>0$, so $J_h/D_h=O(1)$.
It remains to identify $I_h/D_h$ to zeroth order with an $O(h)$ error. Define $f_0:=f_X(a)$ and $f_1:=f_X'(a)$. Since $f_X\in C^1([a,a+\eta])$, uniformly for $u\in[0,1]$,
\begin{align*}
f_X(a+hu)=f_0+hf_1u+h\rho_h(u),
\end{align*}
where the remainder map $\rho_h:[0,1]\to\mathbb R$ satisfies $\sup_{u\in[0,1]}|\rho_h(u)|\to0$. Hence
\begin{align*}
D_h=f_0\mu_0^+ + hf_1\mu_1^+ + h\theta_{0,h}
\end{align*}
and
\begin{align*}
I_h=f_0\mu_1^+ + hf_1\mu_2^+ + h\theta_{1,h},
\end{align*}
where $\theta_{0,h}\to0$ and $\theta_{1,h}\to0$. Since $f_0\mu_0^+>0$, the quotient expansion with denominator bounded away from zero gives
\begin{align*}
\frac{I_h}{D_h}=\frac{\mu_1^+}{\mu_0^+}+O(h).
\end{align*}
Combining this with the previous identity yields
\begin{align*}
m_{0,h,\mathrm{pop}}(a)-m(a)
=
h\,m'(a)\frac{\mu_1^+}{\mu_0^+}
+
O(h^2).
\end{align*}
[/step]
[step:Express the local linear normal equations in rescaled coordinates]
Define the population local linear criterion $Q_{1,h}:\mathbb R^2\to\mathbb R$ by
\begin{align*}
Q_{1,h}(\beta_0,\beta_1):=\mathbb E\left[K\!\left(\frac{X-a}{h}\right)(Y-\beta_0-\beta_1(X-a))^2\right].
\end{align*}
Define the rescaled slope parameter $\gamma_1:=h\beta_1$. The criterion $Q_{1,h}$ has the same minimizer in its intercept coordinate as
\begin{align*}
(\gamma_0,\gamma_1)
\mapsto
\int_0^1
K(u)
\left(m(a+hu)-\gamma_0-\gamma_1u\right)^2
f_X(a+hu)\,d\mathcal L^1(u),
\end{align*}
where $\gamma_0=\beta_0$. Define $M^+\in\mathbb R^{2\times2}$ by
\begin{align*}
(M^+)_{jk}:=\mu_{j+k}^+,\qquad 0\leq j,k\leq1.
\end{align*}
Define the $2\times2$ matrix $G_h$ by
\begin{align*}
(G_h)_{jk}
:=
\int_0^1 u^{j+k}K(u)f_X(a+hu)\,d\mathcal L^1(u),
\qquad 0\leq j,k\leq1.
\end{align*}
Define the vector $r_h\in\mathbb R^2$ by
\begin{align*}
(r_h)_j
:=
\int_0^1 u^jK(u)m(a+hu)f_X(a+hu)\,d\mathcal L^1(u),
\qquad 0\leq j\leq1.
\end{align*}
The normal equations are
\begin{align*}
G_h(\gamma_{0,h},\gamma_{1,h})^\top=r_h.
\end{align*}
Since $f_X(a)>0$ and $G_h\to f_X(a)M^+$ entrywise, the nonsingularity of $M^+$ implies that $G_h$ is nonsingular for all sufficiently small $h$.
[/step]
[step:Use affine reproduction to remove the first-order boundary term]
Define the affine Taylor polynomial $\ell_h:[0,1]\to\mathbb R$ by
\begin{align*}
\ell_h(u):=m(a)+h m'(a)u.
\end{align*}
Because $m\in C^2([a,a+\eta])$, there is a constant $C_m>0$ such that
\begin{align*}
|m(a+hu)-\ell_h(u)|\leq C_mh^2
\end{align*}
for every $u\in[0,1]$ and all sufficiently small $h$.
Let $c_h\in\mathbb R^2$ be defined by
\begin{align*}
c_h=(m(a),h m'(a))^\top,
\end{align*}
and let $e_h\in\mathbb R^2$ be defined by
\begin{align*}
e_h=r_h-G_hc_h.
\end{align*}
The vector $e_h$ has components
\begin{align*}
(e_h)_j
=
\int_0^1 u^jK(u)\bigl(m(a+hu)-\ell_h(u)\bigr)f_X(a+hu)\,d\mathcal L^1(u),
\qquad 0\leq j\leq1.
\end{align*}
Since $K$ is bounded, $f_X$ is bounded on $[a,a+\eta]$, and $|m(a+hu)-\ell_h(u)|\leq C_mh^2$, there is a constant $C_e>0$ such that $|(e_h)_j|\leq C_eh^2$ for $j=0,1$. Therefore
\begin{align*}
(\gamma_{0,h},\gamma_{1,h})^\top-c_h
=
G_h^{-1}e_h
=
O(h^2),
\end{align*}
because $\|G_h^{-1}\|_{\mathrm{op}}$ is bounded for small $h$. Taking the first coordinate gives
\begin{align*}
m_{1,h,\mathrm{pop}}(a)-m(a)
=
\gamma_{0,h}-m(a)
=
O(h^2).
\end{align*}
[guided]
The reason local linear regression removes the boundary first-order bias is not symmetry; near $a$ there is no symmetry. The reason is exact reproduction of affine functions.
Work in the rescaled coordinate $u=(t-a)/h$. In that coordinate, the first two Taylor terms of $m(a+hu)$ form the affine map $\ell_h:[0,1]\to\mathbb R$ defined by
\begin{align*}
\ell_h(u)=m(a)+hm'(a)u.
\end{align*}
If the regression target were exactly $\ell_h(u)$, then the weighted least-squares minimizer would be exactly $(m(a),hm'(a))^\top$, because the model class $\gamma_0+\gamma_1u$ contains $\ell_h$ itself. Thus the intercept would be exactly $m(a)$.
Now quantify the error from replacing $m(a+hu)$ by $\ell_h(u)$. Since $m\in C^2([a,a+\eta])$, its second derivative is bounded on this compact interval. Choose
\begin{align*}
C_m:=\frac12\sup_{s\in[a,a+\eta]}|m''(s)|.
\end{align*}
Taylor's formula with remainder gives, for every $u\in[0,1]$ and all $h<\eta$,
\begin{align*}
|m(a+hu)-m(a)-hm'(a)u|
\leq
C_mh^2u^2
\leq
C_mh^2.
\end{align*}
Define
\begin{align*}
G_h\in\mathbb R^{2\times2},\qquad
(G_h)_{jk}
:=
\int_0^1u^{j+k}K(u)f_X(a+hu)\,d\mathcal L^1(u),
\quad 0\leq j,k\leq1,
\end{align*}
and define
\begin{align*}
r_h\in\mathbb R^2,\qquad
(r_h)_j
:=
\int_0^1u^jK(u)m(a+hu)f_X(a+hu)\,d\mathcal L^1(u),
\quad 0\leq j\leq1.
\end{align*}
The local linear normal equations are
\begin{align*}
G_h(\gamma_{0,h},\gamma_{1,h})^\top=r_h.
\end{align*}
Also define $c_h\in\mathbb R^2$ by
\begin{align*}
c_h:=(m(a),hm'(a))^\top.
\end{align*}
Then $G_hc_h$ is exactly the right-hand side that would occur if $m(a+hu)$ were replaced by $\ell_h(u)$. Therefore the only error is
\begin{align*}
e_h:=r_h-G_hc_h,
\end{align*}
whose components are
\begin{align*}
(e_h)_j
=
\int_0^1u^jK(u)
\bigl(m(a+hu)-m(a)-hm'(a)u\bigr)
f_X(a+hu)\,d\mathcal L^1(u).
\end{align*}
The integrand is bounded in absolute value by a constant times $h^2K(u)$, because $u^j\leq1$, $K$ is bounded and integrable on $[0,1]$, $f_X$ is bounded near $a$, and the Taylor remainder is $O(h^2)$ uniformly in $u$. Hence $e_h=O(h^2)$ in $\mathbb R^2$.
Finally, $G_h\to f_X(a)M^+$ entrywise. Since $f_X(a)>0$ and $M^+$ is nonsingular, $f_X(a)M^+$ is nonsingular. By continuity of the determinant, $\det G_h$ stays bounded away from $0$ for all sufficiently small $h$; using the adjugate formula for a $2\times2$ inverse, $G_h^{-1}$ exists and is uniformly bounded for all sufficiently small $h$. Thus
\begin{align*}
(\gamma_{0,h},\gamma_{1,h})^\top-c_h
=
G_h^{-1}e_h
=
O(h^2).
\end{align*}
Taking the first coordinate yields
\begin{align*}
m_{1,h,\mathrm{pop}}(a)-m(a)
=
\gamma_{0,h}-m(a)
=
O(h^2).
\end{align*}
This is the precise sense in which local linear fitting removes the boundary term of order $h$.
[/guided]
[/step]
[step:Control the empirical local averages on the boundary scale]
For integers $j\geq0$, define the normalized empirical design moment by
\begin{align*}
\hat s_j
:=
\frac1{nh}\sum_{i=1}^n
\left(\frac{X_i-a}{h}\right)^j
K\!\left(\frac{X_i-a}{h}\right).
\end{align*}
Define the normalized empirical response moment by
\begin{align*}
\hat q_j
:=
\frac1{nh}\sum_{i=1}^n
\left(\frac{X_i-a}{h}\right)^j
K\!\left(\frac{X_i-a}{h}\right)Y_i.
\end{align*}
Let $s_j:=\mathbb E[\hat s_j]$ and $q_j:=\mathbb E[\hat q_j]$. For each fixed $j$ needed below, namely $0\leq j\leq2$ for design moments,
\begin{align*}
\hat s_j-s_j=O_{\mathbb P}((nh)^{-1/2}).
\end{align*}
For each fixed $j$ with $0\leq j\leq1$ for response moments,
\begin{align*}
\hat q_j-q_j=O_{\mathbb P}((nh)^{-1/2}).
\end{align*}
Indeed, for $\hat s_j$, boundedness of $K$ and support in $[0,1]$ after restriction to the design support give
\begin{align*}
\operatorname{Var}(\hat s_j)
=
\frac1{nh^2}
\operatorname{Var}\!\left[
\left(\frac{X-a}{h}\right)^jK\!\left(\frac{X-a}{h}\right)
\right].
\end{align*}
Using $0\leq (X-a)/h\leq1$ on the effective support and bounding variance by the second moment gives
\begin{align*}
\operatorname{Var}(\hat s_j)
\leq
\frac1{nh^2}
\mathbb E\!\left[
K\!\left(\frac{X-a}{h}\right)^2
\right].
\end{align*}
Changing variables $t=a+hu$ gives
\begin{align*}
\mathbb E\!\left[
K\!\left(\frac{X-a}{h}\right)^2
\right]
=
h\int_0^1K(u)^2f_X(a+hu)\,d\mathcal L^1(u).
\end{align*}
Since $K$ is bounded on $[0,1]$ and $f_X$ is bounded on $[a,a+\eta]$, this expectation is $O(h)$,
so $\operatorname{Var}(\hat s_j)=O((nh)^{-1})$.
For $\hat q_j$, write $Y=m(X)+\varepsilon$. On $[a,a+h]$, the function $m$ is bounded for small $h$. Since $\mathbb E[\varepsilon\mid X=t]=0$, the boundedness of $\sigma^2(t)=\operatorname{Var}(\varepsilon\mid X=t)$ on $[a,a+\eta]$ gives a uniform bound on $\mathbb E[\varepsilon^2\mid X=t]$ there. Hence $\mathbb E[Y^2\mid X=t]$ is bounded uniformly for $t\in[a,a+\eta]$. Therefore
\begin{align*}
\operatorname{Var}(\hat q_j)
\leq
\frac1{nh^2}
\mathbb E\!\left[
K\!\left(\frac{X-a}{h}\right)^2Y^2
\right].
\end{align*}
By the same change of variables and the uniform bound on $\mathbb E[Y^2\mid X=t]$ for $t\in[a,a+\eta]$, the right-hand side is $O((nh)^{-1})$. For each fixed $h$, the summands defining $\hat s_j$ and $\hat q_j$ are i.i.d. finite-variance random variables; as $h\to0$ they form a triangular array indexed by the bandwidth. Applying [Chebyshev's inequality](/theorems/1126) to each centered average gives the asserted stochastic orders.
[/step]
[step:Transfer local average fluctuations to the fitted intercepts]
For the local constant estimator,
\begin{align*}
\hat m_{0,h}(a)=\frac{\hat q_0}{\hat s_0}.
\end{align*}
The corresponding population target is
\begin{align*}
m_{0,h,\mathrm{pop}}(a)=\frac{q_0}{s_0}.
\end{align*}
Since $s_0\to f_X(a)\mu_0^+>0$, there are constants $h_0>0$ and $c_s>0$ such that $s_0\geq c_s$ for $0<h<h_0$. Since $\hat s_0-s_0=O_{\mathbb P}((nh)^{-1/2})$ and $nh\to\infty$, the denominator $\hat s_0$ is bounded below by $c_s/2$ with probability tending to $1$. On the compact neighborhood where $s\geq c_s/2$, the quotient map $(q,s)\mapsto q/s$ is continuously differentiable with bounded derivative. The finite-dimensional [delta method](/theorems/1861), equivalently the [mean value theorem](/theorems/186) on this compact neighborhood, gives
\begin{align*}
\hat m_{0,h}(a)-m_{0,h,\mathrm{pop}}(a)
=
O_{\mathbb P}((nh)^{-1/2}).
\end{align*}
For local linear regression, define $\hat G_h\in\mathbb R^{2\times2}$ by $(\hat G_h)_{jk}=\hat s_{j+k}$ for $0\leq j,k\leq1$, and define $\hat r_h\in\mathbb R^2$ by
\begin{align*}
\hat r_h=(\hat q_0,\hat q_1)^\top.
\end{align*}
Define $G_h\in\mathbb R^{2\times2}$ by $(G_h)_{jk}=s_{j+k}$ for $0\leq j,k\leq1$, and define $r_h\in\mathbb R^2$ by
\begin{align*}
r_h=(q_0,q_1)^\top.
\end{align*}
The rescaled empirical local linear coefficients satisfy
\begin{align*}
\hat G_h(\hat\gamma_{0,h},\hat\gamma_{1,h})^\top=\hat r_h,
\end{align*}
and the population coefficients satisfy
\begin{align*}
G_h(\gamma_{0,h},\gamma_{1,h})^\top=r_h.
\end{align*}
From the preceding step,
\begin{align*}
\|\hat G_h-G_h\|_{\mathrm{op}}
=
O_{\mathbb P}((nh)^{-1/2}).
\end{align*}
Also,
\begin{align*}
|\hat r_h-r_h|
=
O_{\mathbb P}((nh)^{-1/2}).
\end{align*}
Also $G_h\to f_X(a)M^+$. Since the smallest singular value of the nonsingular matrix $f_X(a)M^+$ is positive, there are constants $h_1>0$ and $c_G>0$ such that the smallest singular value of $G_h$ is at least $c_G$ for $0<h<h_1$. The perturbation bound $\|\hat G_h-G_h\|_{\mathrm{op}}=O_{\mathbb P}((nh)^{-1/2})$ and $nh\to\infty$ imply that the smallest singular value of $\hat G_h$ is at least $c_G/2$ with probability tending to $1$. Therefore $\|G_h^{-1}\|_{\mathrm{op}}\leq c_G^{-1}$ for small $h$, and $\|\hat G_h^{-1}\|_{\mathrm{op}}\leq 2c_G^{-1}$ with probability tending to $1$. On this event,
\begin{align*}
(\hat\gamma_{0,h},\hat\gamma_{1,h})^\top
-
(\gamma_{0,h},\gamma_{1,h})^\top
=
\hat G_h^{-1}(\hat r_h-r_h)
+
\hat G_h^{-1}(G_h-\hat G_h)(\gamma_{0,h},\gamma_{1,h})^\top.
\end{align*}
The population coefficient vector is bounded, because it converges to $(m(a),0)^\top$ in the rescaled coordinates. Hence
\begin{align*}
(\hat\gamma_{0,h},\hat\gamma_{1,h})^\top
-
(\gamma_{0,h},\gamma_{1,h})^\top
=
O_{\mathbb P}((nh)^{-1/2}).
\end{align*}
Taking first coordinates, and using $\hat m_{1,h}(a)=\hat\gamma_{0,h}$ and $m_{1,h,\mathrm{pop}}(a)=\gamma_{0,h}$, gives
\begin{align*}
\hat m_{1,h}(a)-m_{1,h,\mathrm{pop}}(a)
=
O_{\mathbb P}((nh)^{-1/2}).
\end{align*}
Together with the local constant conclusion, this proves the empirical part and completes the theorem.
[/step]