Unbiasedness of $K$-Fold Cross-Validation Risk for Fold-Trained Predictors

Unbiasedness of $K$-Fold Cross-Validation Risk for Fold-Trained Predictors (Theorem # 4471)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The proof conditions on the fold assignment and on the training data used to construct a fixed fold predictor $\hat f_{-k}$. Given this information, every held-out observation in $I_k$ is independent of the fitted rule and has the same law as a fresh observation $(X,Y)$. Therefore each held-out conditional expected loss equals the conditional fresh-observation risk of $\hat f_{-k}$. Averaging first over the held-out indices, then over folds, and finally taking expectations gives the claimed identity. [/proofplan] [step:Condition on the fold-complement information for one fold] Fix $k \in \{1,\dots,K\}$. Let $\mathcal G_k$ be the $\sigma$-algebra generated by the random partition $(I_1,\dots,I_K)$ and by the fold-complement training sample $D_{-k}$. Since the learning procedure and all preprocessing steps used to construct $\hat f_{-k}$ depend only on $D_{-k}$, the fitted map \begin{align*} \hat f_{-k}: \mathcal X \to \mathcal A \end{align*} is $\mathcal G_k$-measurable as a random prediction rule. For every index $i \in I_k$, the observation $Z_i=(X_i,Y_i)$ is independent of $\mathcal G_k$. Indeed, conditional on the fold assignment, $\mathcal G_k$ contains only observations with indices outside $I_k$, and the fold assignment is independent of the i.i.d. sample $(Z_1,\dots,Z_n)$. Thus $Z_i$ has the same distribution as an independent fresh observation $(X,Y)$ and is independent of $\hat f_{-k}$ conditional on $\mathcal G_k$. [guided] Fix a fold index $k \in \{1,\dots,K\}$. We isolate exactly the information used to build the predictor being evaluated on the held-out fold. Define $\mathcal G_k$ to be the $\sigma$-algebra generated by the random fold partition $(I_1,\dots,I_K)$ and by the training sample \begin{align*} D_{-k} := (Z_i)_{i \notin I_k}. \end{align*} The fitted rule $\hat f_{-k}: \mathcal X \to \mathcal A$ is obtained by applying the fixed learning procedure to $D_{-k}$. Since all preprocessing parameters are also estimated only from $D_{-k}$, the entire fitted prediction rule is determined by $\mathcal G_k$. Now take an index $i \in I_k$. Conditional on the fold assignment, the training data $D_{-k}$ consists precisely of observations with indices outside $I_k$. Because the observations $Z_1,\dots,Z_n$ are i.i.d., the held-out observation $Z_i=(X_i,Y_i)$ is independent of those training observations. Because the fold assignment itself was chosen independently of the data, adding the fold assignment to the conditioning information does not introduce dependence between $Z_i$ and the fold-complement sample. Therefore $Z_i$ is independent of $\mathcal G_k$ and has the same distribution as a fresh observation $(X,Y)$. This is the exact point where the “preprocessing inside each training fold” hypothesis is used. If preprocessing were fitted using all observations, then $\hat f_{-k}$ would depend on the held-out observations in $I_k$, and $Z_i$ would no longer be independent of the rule whose loss is being evaluated. [/guided] [/step] [step:Identify the conditional held-out loss with the conditional fresh-observation risk] Define the conditional risk random variable \begin{align*} \rho_k := \mathbb E\left[L(Y,\hat f_{-k}(X)) \mid \mathcal G_k\right], \end{align*} where $(X,Y)$ is independent of $\mathcal G_k$ and has the same distribution as $(X_1,Y_1)$. For each $i \in I_k$, the independence established above and the $\mathcal G_k$-measurability of $\hat f_{-k}$ give \begin{align*} \mathbb E\left[L(Y_i,\hat f_{-k}(X_i)) \mid \mathcal G_k\right] = \rho_k. \end{align*} Averaging over the held-out indices in $I_k$ and using linearity of conditional expectation, \begin{align*} \mathbb E\left[ \frac{1}{|I_k|}\sum_{i \in I_k} L(Y_i,\hat f_{-k}(X_i)) \mid \mathcal G_k \right] = \rho_k. \end{align*} [guided] We now compare the held-out loss with the loss on a genuinely fresh observation. Define \begin{align*} \rho_k := \mathbb E\left[L(Y,\hat f_{-k}(X)) \mid \mathcal G_k\right], \end{align*} where $(X,Y)$ is an independent copy of $(X_1,Y_1)$, independent of $\mathcal G_k$. The random variable $\rho_k$ is the conditional test risk of the already-fitted rule $\hat f_{-k}$. For an index $i \in I_k$, the previous step gives two facts: $Z_i=(X_i,Y_i)$ is independent of $\mathcal G_k$, and $Z_i$ has the same law as $(X,Y)$. Also, $\hat f_{-k}$ is $\mathcal G_k$-measurable, because it was constructed only from $D_{-k}$. Therefore, after conditioning on $\mathcal G_k$, the fitted rule is fixed while the held-out observation is distributed as a fresh independent draw. Hence \begin{align*} \mathbb E\left[L(Y_i,\hat f_{-k}(X_i)) \mid \mathcal G_k\right] = \mathbb E\left[L(Y,\hat f_{-k}(X)) \mid \mathcal G_k\right] = \rho_k. \end{align*} Since $I_k$ is nonempty by hypothesis, the fold average is well-defined. Applying linearity of conditional expectation to the finite sum over held-out indices gives \begin{align*} \mathbb E\left[ \frac{1}{|I_k|}\sum_{i \in I_k} L(Y_i,\hat f_{-k}(X_i)) \mid \mathcal G_k \right] &= \frac{1}{|I_k|}\sum_{i \in I_k} \mathbb E\left[ L(Y_i,\hat f_{-k}(X_i)) \mid \mathcal G_k \right] \\ &= \frac{1}{|I_k|}\sum_{i \in I_k} \rho_k \\ &= \rho_k. \end{align*} [/guided] [/step] [step:Average the conditional identity over all folds] Taking expectations in the identity from the previous step and using the tower property, \begin{align*} \mathbb E\left[ \frac{1}{|I_k|}\sum_{i \in I_k} L(Y_i,\hat f_{-k}(X_i)) \right] &= \mathbb E[\rho_k] \\ &= \mathbb E\left[L(Y,\hat f_{-k}(X))\right]. \end{align*} Therefore, by linearity of expectation applied to the definition of $\hat R_{\mathrm{CV}}$, \begin{align*} \mathbb E[\hat R_{\mathrm{CV}}] &= \mathbb E\left[ \frac{1}{K}\sum_{k=1}^K \frac{1}{|I_k|}\sum_{i \in I_k} L(Y_i,\hat f_{-k}(X_i)) \right] \\ &= \frac{1}{K}\sum_{k=1}^K \mathbb E\left[ \frac{1}{|I_k|}\sum_{i \in I_k} L(Y_i,\hat f_{-k}(X_i)) \right] \\ &= \frac{1}{K}\sum_{k=1}^K \mathbb E\left[L(Y,\hat f_{-k}(X))\right]. \end{align*} This is the asserted cross-validation risk identity. [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Distribution Definition Expectation Definition Donsker's Invariance Principle Brownian Motion Independence of Disjoint Blocks Probability Theory Extinction Probability Probability Theory Taking Out What is Known Conditional Expectation Immediate Return to Zero for Brownian Motion Brownian Motion Existence of Densities Probability Theory Chebyshev's Inequality Probability Theory Affine Transformation of Variance Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.