Boundary Bias Reduction by Local Linear Regression

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The boundary effect comes from the fact that, at $a$, the kernel sees only the one-sided interval $[a,a+h]$. After the change of variables $t=a+hu$, the local constant target is a ratio of one-sided weighted moments, and Taylor expansion shows a nonzero first-order term proportional to $\mu_1^+/\mu_0^+$. For local linear regression, the normal equations reproduce every affine function of $t-a$, so the constant and first-order Taylor terms of $m$ are fitted exactly at the intercept, leaving only the quadratic remainder. Finally, the empirical estimators are smooth functions of finitely many local averages; their variances are of order $1/(nh)$ after the natural powers of $h$ are factored out, and stability of the limiting moment matrices transfers this fluctuation scale to the fitted intercepts. [/proofplan] [step:Rewrite the population local constant target as a one-sided weighted ratio] Let $(X,Y)$ denote a generic observation with the same law as $(X_i,Y_i)$, and define the regression error $\varepsilon:=Y-m(X)$. The model assumption $\mathbb E[\varepsilon\mid X]=0$ gives $\mathbb E[Y\mid X=t]=m(t)$ for $f_X$-almost every $t$. Define the population local constant criterion $Q_{0,h}:\mathbb R\to\mathbb R$ by \begin{align*} Q_{0,h}(c):=\mathbb E\left[K\!\left(\frac{X-a}{h}\right)(Y-c)^2\right]. \end{align*} Expanding the square and conditioning on $X$ show that the minimizer over $c\in\mathbb R$ is the weighted conditional mean whenever the denominator is nonzero. For $h<\min\{\eta,b-a\}$, the support condition on $K$ gives \begin{align*} K\!\left(\frac{t-a}{h}\right)=0 \end{align*} unless $t\in[a,a+h]$. Hence the population local constant objective is minimized at \begin{align*} m_{0,h,\mathrm{pop}}(a) = \frac{ \int_a^{a+h} K\!\left(\frac{t-a}{h}\right)m(t)f_X(t)\,d\mathcal L^1(t) }{ \int_a^{a+h} K\!\left(\frac{t-a}{h}\right)f_X(t)\,d\mathcal L^1(t) }. \end{align*} Using the substitution $t=a+hu$, the measure transforms as $d\mathcal L^1(t)=h\,d\mathcal L^1(u)$ and the interval $[a,a+h]$ becomes $[0,1]$. Therefore \begin{align*} m_{0,h,\mathrm{pop}}(a) = \frac{ \int_0^1 K(u)m(a+hu)f_X(a+hu)\,d\mathcal L^1(u) }{ \int_0^1 K(u)f_X(a+hu)\,d\mathcal L^1(u) }. \end{align*} The denominator is positive for all sufficiently small $h$, because $f_X(a)>0$ and $\mu_0^+>0$. [guided] The first task is to remove the statistical notation and identify the deterministic quantity being minimized. Let $(X,Y)$ denote a generic observation and define $\varepsilon:=Y-m(X)$. The model assumption $\mathbb E[\varepsilon\mid X]=0$ implies $\mathbb E[Y\mid X=t]=m(t)$ for $f_X$-almost every $t$. Define $Q_{0,h}:\mathbb R\to\mathbb R$ by \begin{align*} Q_{0,h}(c):=\mathbb E\left[K\!\left(\frac{X-a}{h}\right)(Y-c)^2\right]. \end{align*} Expanding the square and conditioning on $X$ show that the minimizer over constants is the weighted conditional mean: \begin{align*} m_{0,h,\mathrm{pop}}(a) = \frac{ \int_{[a,b]} K\!\left(\frac{t-a}{h}\right)m(t)f_X(t)\,d\mathcal L^1(t) }{ \int_{[a,b]} K\!\left(\frac{t-a}{h}\right)f_X(t)\,d\mathcal L^1(t) }. \end{align*} Because $K$ is supported on $[-1,1]$ and $t\geq a$ on the design support, the only values that contribute satisfy \begin{align*} 0\leq \frac{t-a}{h}\leq 1, \end{align*} that is, $t\in[a,a+h]$. Thus, for $h<\min\{\eta,b-a\}$, \begin{align*} m_{0,h,\mathrm{pop}}(a) = \frac{ \int_a^{a+h} K\!\left(\frac{t-a}{h}\right)m(t)f_X(t)\,d\mathcal L^1(t) }{ \int_a^{a+h} K\!\left(\frac{t-a}{h}\right)f_X(t)\,d\mathcal L^1(t) }. \end{align*} Now set $t=a+hu$. Under this substitution, $d\mathcal L^1(t)=h\,d\mathcal L^1(u)$ and the interval $[a,a+h]$ is transformed into $[0,1]$. The common factor $h$ cancels between numerator and denominator, giving \begin{align*} m_{0,h,\mathrm{pop}}(a) = \frac{ \int_0^1 K(u)m(a+hu)f_X(a+hu)\,d\mathcal L^1(u) }{ \int_0^1 K(u)f_X(a+hu)\,d\mathcal L^1(u) }. \end{align*} The denominator is nonzero for small $h$: since $f_X$ is continuous and $f_X(a)>0$, $f_X(a+hu)$ remains positive uniformly for $u\in[0,1]$ when $h$ is small, and $\mu_0^+=\int_0^1K(u)\,d\mathcal L^1(u)>0$ because $K$ is nonnegative and positive on a set of positive measure in $[0,1]$. [/guided] [/step] [step:Expand the local constant ratio to first order] Define $m_0:=m(a)$ and $m_1:=m'(a)$. Since $m\in C^2([a,a+\eta])$, define the Taylor remainder map $R_{m,h}:[0,1]\to\mathbb R$ by \begin{align*} R_{m,h}(u):=\frac{m(a+hu)-m_0-hm_1u}{h^2} \end{align*} for $h>0$. [Taylor's theorem](/theorems/827) on the compact interval $[a,a+\eta]$ gives a constant $C_m>0$ such that $|R_{m,h}(u)|\leq C_m$ for every $u\in[0,1]$ and all sufficiently small $h$. Define the denominator moment $D_h\in\mathbb R$ by \begin{align*} D_h:=\int_0^1K(u)f_X(a+hu)\,d\mathcal L^1(u), \end{align*} define the first weighted design moment $I_h\in\mathbb R$ by \begin{align*} I_h:=\int_0^1uK(u)f_X(a+hu)\,d\mathcal L^1(u), \end{align*} and define the bounded remainder moment $J_h\in\mathbb R$ by \begin{align*} J_h:=\int_0^1K(u)R_{m,h}(u)f_X(a+hu)\,d\mathcal L^1(u). \end{align*} The function $J_h$ is bounded uniformly in $h$, because $K$ is bounded on $[0,1]$, $R_{m,h}$ is uniformly bounded, and $f_X$ is continuous on $[a,a+\eta]$. Substituting the Taylor formula for $m$ into the numerator gives \begin{align*} \int_0^1K(u)m(a+hu)f_X(a+hu)\,d\mathcal L^1(u) = m_0D_h+hm_1I_h+h^2J_h. \end{align*} Therefore \begin{align*} m_{0,h,\mathrm{pop}}(a)-m(a) = hm_1\frac{I_h}{D_h}+h^2\frac{J_h}{D_h}. \end{align*} The denominator satisfies $D_h\to f_X(a)\mu_0^+>0$, so $J_h/D_h=O(1)$. It remains to identify $I_h/D_h$ to zeroth order with an $O(h)$ error. Define $f_0:=f_X(a)$ and $f_1:=f_X'(a)$. Since $f_X\in C^1([a,a+\eta])$, uniformly for $u\in[0,1]$, \begin{align*} f_X(a+hu)=f_0+hf_1u+h\rho_h(u), \end{align*} where the remainder map $\rho_h:[0,1]\to\mathbb R$ satisfies $\sup_{u\in[0,1]}|\rho_h(u)|\to0$. Hence \begin{align*} D_h=f_0\mu_0^+ + hf_1\mu_1^+ + h\theta_{0,h} \end{align*} and \begin{align*} I_h=f_0\mu_1^+ + hf_1\mu_2^+ + h\theta_{1,h}, \end{align*} where $\theta_{0,h}\to0$ and $\theta_{1,h}\to0$. Since $f_0\mu_0^+>0$, the quotient expansion with denominator bounded away from zero gives \begin{align*} \frac{I_h}{D_h}=\frac{\mu_1^+}{\mu_0^+}+O(h). \end{align*} Combining this with the previous identity yields \begin{align*} m_{0,h,\mathrm{pop}}(a)-m(a) = h\,m'(a)\frac{\mu_1^+}{\mu_0^+} + O(h^2). \end{align*} [/step] [step:Express the local linear normal equations in rescaled coordinates] Define the population local linear criterion $Q_{1,h}:\mathbb R^2\to\mathbb R$ by \begin{align*} Q_{1,h}(\beta_0,\beta_1):=\mathbb E\left[K\!\left(\frac{X-a}{h}\right)(Y-\beta_0-\beta_1(X-a))^2\right]. \end{align*} Define the rescaled slope parameter $\gamma_1:=h\beta_1$. The criterion $Q_{1,h}$ has the same minimizer in its intercept coordinate as \begin{align*} (\gamma_0,\gamma_1) \mapsto \int_0^1 K(u) \left(m(a+hu)-\gamma_0-\gamma_1u\right)^2 f_X(a+hu)\,d\mathcal L^1(u), \end{align*} where $\gamma_0=\beta_0$. Define $M^+\in\mathbb R^{2\times2}$ by \begin{align*} (M^+)_{jk}:=\mu_{j+k}^+,\qquad 0\leq j,k\leq1. \end{align*} Define the $2\times2$ matrix $G_h$ by \begin{align*} (G_h)_{jk} := \int_0^1 u^{j+k}K(u)f_X(a+hu)\,d\mathcal L^1(u), \qquad 0\leq j,k\leq1. \end{align*} Define the vector $r_h\in\mathbb R^2$ by \begin{align*} (r_h)_j := \int_0^1 u^jK(u)m(a+hu)f_X(a+hu)\,d\mathcal L^1(u), \qquad 0\leq j\leq1. \end{align*} The normal equations are \begin{align*} G_h(\gamma_{0,h},\gamma_{1,h})^\top=r_h. \end{align*} Since $f_X(a)>0$ and $G_h\to f_X(a)M^+$ entrywise, the nonsingularity of $M^+$ implies that $G_h$ is nonsingular for all sufficiently small $h$. [/step] [step:Use affine reproduction to remove the first-order boundary term] Define the affine Taylor polynomial $\ell_h:[0,1]\to\mathbb R$ by \begin{align*} \ell_h(u):=m(a)+h m'(a)u. \end{align*} Because $m\in C^2([a,a+\eta])$, there is a constant $C_m>0$ such that \begin{align*} |m(a+hu)-\ell_h(u)|\leq C_mh^2 \end{align*} for every $u\in[0,1]$ and all sufficiently small $h$. Let $c_h\in\mathbb R^2$ be defined by \begin{align*} c_h=(m(a),h m'(a))^\top, \end{align*} and let $e_h\in\mathbb R^2$ be defined by \begin{align*} e_h=r_h-G_hc_h. \end{align*} The vector $e_h$ has components \begin{align*} (e_h)_j = \int_0^1 u^jK(u)\bigl(m(a+hu)-\ell_h(u)\bigr)f_X(a+hu)\,d\mathcal L^1(u), \qquad 0\leq j\leq1. \end{align*} Since $K$ is bounded, $f_X$ is bounded on $[a,a+\eta]$, and $|m(a+hu)-\ell_h(u)|\leq C_mh^2$, there is a constant $C_e>0$ such that $|(e_h)_j|\leq C_eh^2$ for $j=0,1$. Therefore \begin{align*} (\gamma_{0,h},\gamma_{1,h})^\top-c_h = G_h^{-1}e_h = O(h^2), \end{align*} because $\|G_h^{-1}\|_{\mathrm{op}}$ is bounded for small $h$. Taking the first coordinate gives \begin{align*} m_{1,h,\mathrm{pop}}(a)-m(a) = \gamma_{0,h}-m(a) = O(h^2). \end{align*} [guided] The reason local linear regression removes the boundary first-order bias is not symmetry; near $a$ there is no symmetry. The reason is exact reproduction of affine functions. Work in the rescaled coordinate $u=(t-a)/h$. In that coordinate, the first two Taylor terms of $m(a+hu)$ form the affine map $\ell_h:[0,1]\to\mathbb R$ defined by \begin{align*} \ell_h(u)=m(a)+hm'(a)u. \end{align*} If the regression target were exactly $\ell_h(u)$, then the weighted least-squares minimizer would be exactly $(m(a),hm'(a))^\top$, because the model class $\gamma_0+\gamma_1u$ contains $\ell_h$ itself. Thus the intercept would be exactly $m(a)$. Now quantify the error from replacing $m(a+hu)$ by $\ell_h(u)$. Since $m\in C^2([a,a+\eta])$, its second derivative is bounded on this compact interval. Choose \begin{align*} C_m:=\frac12\sup_{s\in[a,a+\eta]}|m''(s)|. \end{align*} Taylor's formula with remainder gives, for every $u\in[0,1]$ and all $h<\eta$, \begin{align*} |m(a+hu)-m(a)-hm'(a)u| \leq C_mh^2u^2 \leq C_mh^2. \end{align*} Define \begin{align*} G_h\in\mathbb R^{2\times2},\qquad (G_h)_{jk} := \int_0^1u^{j+k}K(u)f_X(a+hu)\,d\mathcal L^1(u), \quad 0\leq j,k\leq1, \end{align*} and define \begin{align*} r_h\in\mathbb R^2,\qquad (r_h)_j := \int_0^1u^jK(u)m(a+hu)f_X(a+hu)\,d\mathcal L^1(u), \quad 0\leq j\leq1. \end{align*} The local linear normal equations are \begin{align*} G_h(\gamma_{0,h},\gamma_{1,h})^\top=r_h. \end{align*} Also define $c_h\in\mathbb R^2$ by \begin{align*} c_h:=(m(a),hm'(a))^\top. \end{align*} Then $G_hc_h$ is exactly the right-hand side that would occur if $m(a+hu)$ were replaced by $\ell_h(u)$. Therefore the only error is \begin{align*} e_h:=r_h-G_hc_h, \end{align*} whose components are \begin{align*} (e_h)_j = \int_0^1u^jK(u) \bigl(m(a+hu)-m(a)-hm'(a)u\bigr) f_X(a+hu)\,d\mathcal L^1(u). \end{align*} The integrand is bounded in absolute value by a constant times $h^2K(u)$, because $u^j\leq1$, $K$ is bounded and integrable on $[0,1]$, $f_X$ is bounded near $a$, and the Taylor remainder is $O(h^2)$ uniformly in $u$. Hence $e_h=O(h^2)$ in $\mathbb R^2$. Finally, $G_h\to f_X(a)M^+$ entrywise. Since $f_X(a)>0$ and $M^+$ is nonsingular, $f_X(a)M^+$ is nonsingular. By continuity of the determinant, $\det G_h$ stays bounded away from $0$ for all sufficiently small $h$; using the adjugate formula for a $2\times2$ inverse, $G_h^{-1}$ exists and is uniformly bounded for all sufficiently small $h$. Thus \begin{align*} (\gamma_{0,h},\gamma_{1,h})^\top-c_h = G_h^{-1}e_h = O(h^2). \end{align*} Taking the first coordinate yields \begin{align*} m_{1,h,\mathrm{pop}}(a)-m(a) = \gamma_{0,h}-m(a) = O(h^2). \end{align*} This is the precise sense in which local linear fitting removes the boundary term of order $h$. [/guided] [/step] [step:Control the empirical local averages on the boundary scale] For integers $j\geq0$, define the normalized empirical design moment by \begin{align*} \hat s_j := \frac1{nh}\sum_{i=1}^n \left(\frac{X_i-a}{h}\right)^j K\!\left(\frac{X_i-a}{h}\right). \end{align*} Define the normalized empirical response moment by \begin{align*} \hat q_j := \frac1{nh}\sum_{i=1}^n \left(\frac{X_i-a}{h}\right)^j K\!\left(\frac{X_i-a}{h}\right)Y_i. \end{align*} Let $s_j:=\mathbb E[\hat s_j]$ and $q_j:=\mathbb E[\hat q_j]$. For each fixed $j$ needed below, namely $0\leq j\leq2$ for design moments, \begin{align*} \hat s_j-s_j=O_{\mathbb P}((nh)^{-1/2}). \end{align*} For each fixed $j$ with $0\leq j\leq1$ for response moments, \begin{align*} \hat q_j-q_j=O_{\mathbb P}((nh)^{-1/2}). \end{align*} Indeed, for $\hat s_j$, boundedness of $K$ and support in $[0,1]$ after restriction to the design support give \begin{align*} \operatorname{Var}(\hat s_j) = \frac1{nh^2} \operatorname{Var}\!\left[ \left(\frac{X-a}{h}\right)^jK\!\left(\frac{X-a}{h}\right) \right]. \end{align*} Using $0\leq (X-a)/h\leq1$ on the effective support and bounding variance by the second moment gives \begin{align*} \operatorname{Var}(\hat s_j) \leq \frac1{nh^2} \mathbb E\!\left[ K\!\left(\frac{X-a}{h}\right)^2 \right]. \end{align*} Changing variables $t=a+hu$ gives \begin{align*} \mathbb E\!\left[ K\!\left(\frac{X-a}{h}\right)^2 \right] = h\int_0^1K(u)^2f_X(a+hu)\,d\mathcal L^1(u). \end{align*} Since $K$ is bounded on $[0,1]$ and $f_X$ is bounded on $[a,a+\eta]$, this expectation is $O(h)$, so $\operatorname{Var}(\hat s_j)=O((nh)^{-1})$. For $\hat q_j$, write $Y=m(X)+\varepsilon$. On $[a,a+h]$, the function $m$ is bounded for small $h$. Since $\mathbb E[\varepsilon\mid X=t]=0$, the boundedness of $\sigma^2(t)=\operatorname{Var}(\varepsilon\mid X=t)$ on $[a,a+\eta]$ gives a uniform bound on $\mathbb E[\varepsilon^2\mid X=t]$ there. Hence $\mathbb E[Y^2\mid X=t]$ is bounded uniformly for $t\in[a,a+\eta]$. Therefore \begin{align*} \operatorname{Var}(\hat q_j) \leq \frac1{nh^2} \mathbb E\!\left[ K\!\left(\frac{X-a}{h}\right)^2Y^2 \right]. \end{align*} By the same change of variables and the uniform bound on $\mathbb E[Y^2\mid X=t]$ for $t\in[a,a+\eta]$, the right-hand side is $O((nh)^{-1})$. For each fixed $h$, the summands defining $\hat s_j$ and $\hat q_j$ are i.i.d. finite-variance random variables; as $h\to0$ they form a triangular array indexed by the bandwidth. Applying [Chebyshev's inequality](/theorems/1126) to each centered average gives the asserted stochastic orders. [/step] [step:Transfer local average fluctuations to the fitted intercepts] For the local constant estimator, \begin{align*} \hat m_{0,h}(a)=\frac{\hat q_0}{\hat s_0}. \end{align*} The corresponding population target is \begin{align*} m_{0,h,\mathrm{pop}}(a)=\frac{q_0}{s_0}. \end{align*} Since $s_0\to f_X(a)\mu_0^+>0$, there are constants $h_0>0$ and $c_s>0$ such that $s_0\geq c_s$ for $0<h<h_0$. Since $\hat s_0-s_0=O_{\mathbb P}((nh)^{-1/2})$ and $nh\to\infty$, the denominator $\hat s_0$ is bounded below by $c_s/2$ with probability tending to $1$. On the compact neighborhood where $s\geq c_s/2$, the quotient map $(q,s)\mapsto q/s$ is continuously differentiable with bounded derivative. The finite-dimensional [delta method](/theorems/1861), equivalently the [mean value theorem](/theorems/186) on this compact neighborhood, gives \begin{align*} \hat m_{0,h}(a)-m_{0,h,\mathrm{pop}}(a) = O_{\mathbb P}((nh)^{-1/2}). \end{align*} For local linear regression, define $\hat G_h\in\mathbb R^{2\times2}$ by $(\hat G_h)_{jk}=\hat s_{j+k}$ for $0\leq j,k\leq1$, and define $\hat r_h\in\mathbb R^2$ by \begin{align*} \hat r_h=(\hat q_0,\hat q_1)^\top. \end{align*} Define $G_h\in\mathbb R^{2\times2}$ by $(G_h)_{jk}=s_{j+k}$ for $0\leq j,k\leq1$, and define $r_h\in\mathbb R^2$ by \begin{align*} r_h=(q_0,q_1)^\top. \end{align*} The rescaled empirical local linear coefficients satisfy \begin{align*} \hat G_h(\hat\gamma_{0,h},\hat\gamma_{1,h})^\top=\hat r_h, \end{align*} and the population coefficients satisfy \begin{align*} G_h(\gamma_{0,h},\gamma_{1,h})^\top=r_h. \end{align*} From the preceding step, \begin{align*} \|\hat G_h-G_h\|_{\mathrm{op}} = O_{\mathbb P}((nh)^{-1/2}). \end{align*} Also, \begin{align*} |\hat r_h-r_h| = O_{\mathbb P}((nh)^{-1/2}). \end{align*} Also $G_h\to f_X(a)M^+$. Since the smallest singular value of the nonsingular matrix $f_X(a)M^+$ is positive, there are constants $h_1>0$ and $c_G>0$ such that the smallest singular value of $G_h$ is at least $c_G$ for $0<h<h_1$. The perturbation bound $\|\hat G_h-G_h\|_{\mathrm{op}}=O_{\mathbb P}((nh)^{-1/2})$ and $nh\to\infty$ imply that the smallest singular value of $\hat G_h$ is at least $c_G/2$ with probability tending to $1$. Therefore $\|G_h^{-1}\|_{\mathrm{op}}\leq c_G^{-1}$ for small $h$, and $\|\hat G_h^{-1}\|_{\mathrm{op}}\leq 2c_G^{-1}$ with probability tending to $1$. On this event, \begin{align*} (\hat\gamma_{0,h},\hat\gamma_{1,h})^\top - (\gamma_{0,h},\gamma_{1,h})^\top = \hat G_h^{-1}(\hat r_h-r_h) + \hat G_h^{-1}(G_h-\hat G_h)(\gamma_{0,h},\gamma_{1,h})^\top. \end{align*} The population coefficient vector is bounded, because it converges to $(m(a),0)^\top$ in the rescaled coordinates. Hence \begin{align*} (\hat\gamma_{0,h},\hat\gamma_{1,h})^\top - (\gamma_{0,h},\gamma_{1,h})^\top = O_{\mathbb P}((nh)^{-1/2}). \end{align*} Taking first coordinates, and using $\hat m_{1,h}(a)=\hat\gamma_{0,h}$ and $m_{1,h,\mathrm{pop}}(a)=\gamma_{0,h}$, gives \begin{align*} \hat m_{1,h}(a)-m_{1,h,\mathrm{pop}}(a) = O_{\mathbb P}((nh)^{-1/2}). \end{align*} Together with the local constant conclusion, this proves the empirical part and completes the theorem. [/step]

Prerequisites (0/5 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.