Unbiased Risk Estimate for Least-Squares Cross-Validation

Unbiased Risk Estimate for Least-Squares Cross-Validation (Theorem # 6330)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We expand the mean integrated squared error into its quadratic, cross, and constant terms. The quadratic term already appears in $\operatorname{LSCV}(h)$, so the main point is to show that the leave-one-out average is an unbiased estimator of the cross term $\mathbb E[\int_{\mathbb R}\hat f_h(x) f(x)\,d\mathcal L^1(x)]$. This follows by conditioning on all observations except $X_i$: after conditioning, $\hat f_{h,-i}$ is fixed while $X_i$ still has conditional density $f$ with respect to $\mathcal L^1$. Finally, the pathwise identity that the average of the leave-one-out estimators equals the full estimator converts the averaged leave-one-out expectation into the full cross term. [/proofplan] [step:Expand the mean integrated squared error into three finite terms] Because $f\in L^2(\mathbb R)$ and \begin{align*} \mathbb E\left[\int_{\mathbb R}\hat f_h(x)^2\,d\mathcal L^1(x)\right]<\infty, \end{align*} the cross term is integrable. Indeed, for almost every $\omega\in\Omega$, the [Cauchy-Schwarz inequality](/theorems/432) in $L^2(\mathbb R)$ gives the following bound: \begin{align*} \int_{\mathbb R}|\hat f_h(x)f(x)|\,d\mathcal L^1(x) \leq \left(\int_{\mathbb R}\hat f_h(x)^2\,d\mathcal L^1(x)\right)^{1/2}\left(\int_{\mathbb R}f(x)^2\,d\mathcal L^1(x)\right)^{1/2}. \end{align*} Taking expectations and applying the [Cauchy-Schwarz inequality](/theorems/432) for real random variables gives \begin{align*} \mathbb E\left[\int_{\mathbb R}|\hat f_h(x)f(x)|\,d\mathcal L^1(x)\right] \leq \left(\mathbb E\left[\int_{\mathbb R}\hat f_h(x)^2\,d\mathcal L^1(x)\right]\right)^{1/2}\left(\int_{\mathbb R}f(x)^2\,d\mathcal L^1(x)\right)^{1/2}<\infty. \end{align*} Therefore we may expand the square and take expectations term by term: \begin{align*} \operatorname{MISE}(h) = \mathbb E\left[\int_{\mathbb R}\hat f_h(x)^2\,d\mathcal L^1(x)\right] -2\mathbb E\left[\int_{\mathbb R}\hat f_h(x)f(x)\,d\mathcal L^1(x)\right] +\int_{\mathbb R}f(x)^2\,d\mathcal L^1(x). \end{align*} [/step] [step:Condition on the observations outside the leave-one-out index] Fix $i\in\{1,\dots,n\}$. Let $\mathcal F_{-i}:=\sigma(X_j:1\leq j\leq n,\ j\neq i)$ be the $\sigma$-algebra generated by all observations except $X_i$, and let $\mathcal B(\mathbb R)$ denote the Borel $\sigma$-algebra on $\mathbb R$. By the theorem hypotheses, $K:\mathbb R\to\mathbb R$ is Borel-measurable and $\hat f_{h,-i}$ is given by its finite-sum formula in the variables $X_j$ with $j\neq i$. Therefore the map $(\omega,x)\mapsto \hat f_{h,-i}(\omega,x)$ is $\mathcal F_{-i}\otimes\mathcal B(\mathbb R)$-measurable. We use the following conditional integration fact. If $G:\Omega\times\mathbb R\to\mathbb R$ is $\mathcal F_{-i}\otimes\mathcal B(\mathbb R)$-measurable and $G(\cdot,X_i)$ is integrable, then independence of $X_i$ from $\mathcal F_{-i}$ and the product-measure construction imply \begin{align*} \mathbb E\left[G(\cdot,X_i)\mid \mathcal F_{-i}\right] = \int_{\mathbb R}G(\cdot,x)f(x)\,d\mathcal L^1(x) \end{align*} almost surely, whenever the right-hand side is interpreted as an absolutely finite [random variable](/page/Random%20Variable). Indeed, the identity first holds for non-negative $G$ by Tonelli's theorem applied to the joint law of $(\omega,X_i)$ over $\mathcal F_{-i}\otimes\mathcal B(\mathbb R)$, and the signed case follows by applying the non-negative case to $G^+$ and $G^-$ whenever $G(\cdot,X_i)$ is integrable. Apply this fact to $G(\omega,x):=\hat f_{h,-i}(\omega,x)$. The theorem hypothesis $\mathbb E[|\hat f_{h,-i}(X_i)|]<\infty$ gives the required integrability. Hence \begin{align*} \mathbb E\left[\hat f_{h,-i}(X_i)\mid \mathcal F_{-i}\right] = \int_{\mathbb R}\hat f_{h,-i}(x)f(x)\,d\mathcal L^1(x) \end{align*} almost surely, and the conditional integral is absolutely finite almost surely. Taking expectations and using the defining property of [conditional expectation](/page/Conditional%20Expectation) yields \begin{align*} \mathbb E[\hat f_{h,-i}(X_i)] = \mathbb E\left[\int_{\mathbb R}\hat f_{h,-i}(x)f(x)\,d\mathcal L^1(x)\right]. \end{align*} [guided] Fix one index $i\in\{1,\dots,n\}$. The reason for deleting $X_i$ from the estimator is that the remaining estimator no longer depends on $X_i$. Define \begin{align*} \mathcal F_{-i}:=\sigma(X_j:1\leq j\leq n,\ j\neq i). \end{align*} Let $\mathcal B(\mathbb R)$ denote the Borel $\sigma$-algebra on $\mathbb R$. By construction from the theorem statement, $\hat f_{h,-i}:\mathbb R\to\mathbb R$ is determined by the random variables $X_j$ with $j\neq i$. More explicitly, its finite-sum formula is built from the Borel-measurable kernel $K:\mathbb R\to\mathbb R$ and the measurable coordinate maps $X_j$, so $(\omega,x)\mapsto \hat f_{h,-i}(\omega,x)$ is $\mathcal F_{-i}\otimes\mathcal B(\mathbb R)$-measurable. Thus it is legitimate to treat $\hat f_{h,-i}$ as an $\mathcal F_{-i}$-measurable random function. Since the theorem assumes that $X_1,\dots,X_n$ are independent identically distributed random variables with common density $f$, the random variable $X_i$ is independent of $\mathcal F_{-i}$ and its law has density $f$ with respect to $\mathcal L^1$. The precise tool is the following conditional integration identity. Suppose $G:\Omega\times\mathbb R\to\mathbb R$ is $\mathcal F_{-i}\otimes\mathcal B(\mathbb R)$-measurable and $G(\cdot,X_i)$ is integrable. Then \begin{align*} \mathbb E\left[G(\cdot,X_i)\mid \mathcal F_{-i}\right] = \int_{\mathbb R}G(\cdot,x)f(x)\,d\mathcal L^1(x) \end{align*} almost surely. Why is this true? For non-negative $G$, independence says that the joint law factors into the law generated by $\mathcal F_{-i}$ and the law of $X_i$, and the latter has density $f$ with respect to $\mathcal L^1$. Tonelli's theorem then gives the displayed identity after testing against every bounded $\mathcal F_{-i}$-measurable random variable. For signed $G$, we apply the non-negative result to $G^+$ and $G^-$ and subtract the two finite identities; integrability of $G(\cdot,X_i)$ ensures that this subtraction is legitimate. Now take $G(\omega,x):=\hat f_{h,-i}(\omega,x)$. The integrability hypothesis \begin{align*} \mathbb E\left[\left|\hat f_{h,-i}(X_i)\right|\right]<\infty \end{align*} ensures that $G(\cdot,X_i)$ is integrable. Therefore the conditional integration identity gives \begin{align*} \mathbb E\left[\hat f_{h,-i}(X_i)\mid \mathcal F_{-i}\right] = \int_{\mathbb R}\hat f_{h,-i}(x)f(x)\,d\mathcal L^1(x) \end{align*} almost surely. The same identity applied to $|G|$ shows that the integral on the right is absolutely finite almost surely. Taking expectation on both sides and using the defining property of [conditional expectation](/page/Conditional%20Expectation) gives \begin{align*} \mathbb E[\hat f_{h,-i}(X_i)] = \mathbb E\left[\int_{\mathbb R}\hat f_{h,-i}(x)f(x)\,d\mathcal L^1(x)\right]. \end{align*} This is the key unbiasedness mechanism: leave-one-out removes the dependence between the evaluation point $X_i$ and the estimator being evaluated there. [/guided] [/step] [step:Average the leave-one-out cross terms before taking expectation] For each $i\in\{1,\dots,n\}$, define the random variable \begin{align*} A_i:=\int_{\mathbb R}\hat f_{h,-i}(x)f(x)\,d\mathcal L^1(x). \end{align*} The previous step shows that $A_i$ is integrable and that \begin{align*} \mathbb E[A_i]=\mathbb E[\hat f_{h,-i}(X_i)]. \end{align*} We compare the average of the random variables $A_i$ with the full cross term pathwise, avoiding pointwise expectations of the kernel. By the theorem statement, the estimators are the displayed finite-sum random fields, so the finite sums below define measurable random functions and the evaluation maps used above are measurable. By linearity in the definition of the [Lebesgue integral](/page/Lebesgue%20Integral) for finite sums, \begin{align*} \frac{1}{n}\sum_{i=1}^n A_i = \int_{\mathbb R}\left(\frac{1}{n}\sum_{i=1}^n \hat f_{h,-i}(x)\right)f(x)\,d\mathcal L^1(x). \end{align*} For each $x\in\mathbb R$, the finite-sum formula gives \begin{align*} \frac{1}{n}\sum_{i=1}^n \hat f_{h,-i}(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{(n-1)h}\sum_{\substack{1\leq j\leq n,\ j\neq i}} K\left(\frac{x-X_j}{h}\right). \end{align*} Since each kernel term with index $j$ appears exactly $n-1$ times in the double sum, this becomes \begin{align*} \frac{1}{n}\sum_{i=1}^n \hat f_{h,-i}(x) = \frac{1}{nh}\sum_{j=1}^n K\left(\frac{x-X_j}{h}\right). \end{align*} By the definition of the full kernel density estimator, the last expression equals $\hat f_h(x)$, so \begin{align*} \frac{1}{n}\sum_{i=1}^n \hat f_{h,-i}(x)=\hat f_h(x). \end{align*} Hence \begin{align*} \frac{1}{n}\sum_{i=1}^n A_i = \int_{\mathbb R}\hat f_h(x)f(x)\,d\mathcal L^1(x). \end{align*} The right-hand side is integrable by the cross-term estimate in the first step, and the left-hand side is integrable because each $A_i$ is integrable. Taking expectations gives \begin{align*} \frac{1}{n}\sum_{i=1}^n\mathbb E[\hat f_{h,-i}(X_i)] = \mathbb E\left[\int_{\mathbb R}\hat f_h(x)f(x)\,d\mathcal L^1(x)\right]. \end{align*} [/step] [step:Average over the leave-one-out terms and identify the risk] Using linearity in the definition of expectation and the identity from the previous step, \begin{align*} \mathbb E[\operatorname{LSCV}(h)] = \mathbb E\left[\int_{\mathbb R}\hat f_h(x)^2\,d\mathcal L^1(x)\right]-\frac{2}{n}\sum_{i=1}^n \mathbb E[\hat f_{h,-i}(X_i)]. \end{align*} Substituting the averaged leave-one-out identity gives \begin{align*} \mathbb E[\operatorname{LSCV}(h)] = \mathbb E\left[\int_{\mathbb R}\hat f_h(x)^2\,d\mathcal L^1(x)\right]-2\mathbb E\left[\int_{\mathbb R}\hat f_h(x)f(x)\,d\mathcal L^1(x)\right]. \end{align*} From the expansion of $\operatorname{MISE}(h)$, the preceding right-hand side is exactly \begin{align*} \operatorname{MISE}(h) - \int_{\mathbb R}f(x)^2\,d\mathcal L^1(x). \end{align*} Therefore \begin{align*} \mathbb E[\operatorname{LSCV}(h)] = \operatorname{MISE}(h) - \int_{\mathbb R}f(x)^2\,d\mathcal L^1(x), \end{align*} which is the desired identity. [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Random Variable Definition Expectation Definition Moments and Asymptotic Normality of the Wilcoxon Signed-Rank Statistic Probability & Statistics Stochastic Scale of Local Polynomial Derivative Estimators Probability & Statistics Tensorization of Hellinger Affinity Probability & Statistics Consistency of Random Design Ordinary Least Squares Probability & Statistics Nonexistence of Fully Adaptive Honest Supremum-Norm Confidence Bands Probability & Statistics Hoeffding Confidence Radius for Bounded Independent Means Probability & Statistics Strong Markov Property of Brownian Motion Brownian Motion Cauchy-Schwarz Inequality for Covariance Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.