Unbiasedness of the Holdout Risk Estimator — Statement & Proof

Unbiasedness of the Holdout Risk Estimator (Theorem # 4470)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We condition on the training data, so the fitted predictor $\hat f$ is fixed relative to the remaining randomness. Each test observation is independent of the training data and has the same distribution as a fresh observation $(X,Y)$, hence each test loss has conditional expectation equal to the conditional risk $R(\hat f)$. Linearity of conditional expectation then shows that the average of the test losses has the same conditional expectation. Taking expectations gives the unconditional identity. [/proofplan] [step:Condition on the training data and define the test losses] Let \begin{align*} \mathcal G:=\sigma(D_{\mathrm{train}}) \end{align*} be the $\sigma$-algebra generated by the training sample. Since $A$ is measurable and $\hat f=A(D_{\mathrm{train}})$, the random hypothesis $\hat f$ is $\mathcal G$-measurable. For each $i\in I_{\mathrm{test}}$, define the real-valued random variable \begin{align*} Z_i:\Omega\to\mathbb R, \qquad Z_i:=L(Y_i,\hat f(X_i)). \end{align*} The integrability hypothesis in the statement gives \begin{align*} \mathbb E\left[|L(Y,\hat f(X))|\right]<\infty. \end{align*} Because $(X_i,Y_i)$ has the same distribution as $(X,Y)$ and is independent of $\mathcal G$, the same kernel argument used in the next step with the non-negative measurable function $u\mapsto |u|$ gives \begin{align*} \mathbb E[|Z_i|] = \mathbb E\left[|L(Y,\hat f(X))|\right] < \infty. \end{align*} Thus each $Z_i$ is integrable, so the ordinary signed conditional expectations $\mathbb E[Z_i\mid\mathcal G]$ are well-defined. [/step] [step:Compute the conditional expectation of one test loss] Fix $i\in I_{\mathrm{test}}$. Since $I_{\mathrm{train}}$ and $I_{\mathrm{test}}$ are disjoint and the observations are i.i.d., the random pair $(X_i,Y_i)$ is independent of $\mathcal G=\sigma(D_{\mathrm{train}})$ and has the same law as the independent copy $(X,Y)$. We justify the conditional distribution identity by testing against bounded [measurable functions](/page/Measurable%20Functions). Let $\varphi:\mathbb R\to\mathbb R$ be bounded and Borel measurable. Since $\hat f$ is $\mathcal G$-measurable and $(X_i,Y_i)$ is independent of $\mathcal G$, conditioning on $\mathcal G$ freezes the value of $\hat f$ and integrates only over the law of $(X_i,Y_i)$. Since $(X_i,Y_i)$ and $(X,Y)$ have the same law, we obtain \begin{align*} \mathbb E\left[\varphi\left(L(Y_i,\hat f(X_i))\right)\mid\mathcal G\right] = \mathbb E\left[\varphi\left(L(Y,\hat f(X))\right)\mid\mathcal G\right] \quad\text{a.s.} \end{align*} This equality for all bounded Borel $\varphi$ identifies the conditional laws. Applying it first to bounded truncations of the identity map and then using integrability of $Z_i$ and $L(Y,\hat f(X))$ gives \begin{align*} \mathbb E[Z_i\mid\mathcal G] &= \mathbb E\left[L(Y,\hat f(X))\mid\mathcal G\right] \\ &= R(\hat f) \quad\text{a.s.} \end{align*} [guided] Fix $i\in I_{\mathrm{test}}$. The purpose of conditioning on $\mathcal G=\sigma(D_{\mathrm{train}})$ is to freeze the fitted predictor. Since $\hat f=A(D_{\mathrm{train}})$ and $A$ is measurable, $\hat f$ is $\mathcal G$-measurable. The test observation $(X_i,Y_i)$ is independent of $\mathcal G$ because the observations are i.i.d. and the test index $i$ is disjoint from the training index set. It also has the same distribution as the fresh observation $(X,Y)$. The loss is integrable: by the argument from the first step, \begin{align*} \mathbb E\left[|L(Y_i,\hat f(X_i))|\right] = \mathbb E\left[|L(Y,\hat f(X))|\right] < \infty. \end{align*} To make the conditioning argument precise, let $\varphi:\mathbb R\to\mathbb R$ be bounded and Borel measurable. Because $\hat f$ is $\mathcal G$-measurable, conditioning on $\mathcal G$ freezes the value of $\hat f$. Because $(X_i,Y_i)$ is independent of $\mathcal G$ and has the same law as $(X,Y)$, the conditional expectations of the bounded test functions agree: \begin{align*} \mathbb E\left[\varphi\left(L(Y_i,\hat f(X_i))\right)\mid\mathcal G\right] = \mathbb E\left[\varphi\left(L(Y,\hat f(X))\right)\mid\mathcal G\right]. \end{align*} Taking bounded truncations of the identity map as $\varphi$ and using integrability gives \begin{align*} \mathbb E\left[L(Y_i,\hat f(X_i))\mid\mathcal G\right] = \mathbb E\left[L(Y,\hat f(X))\mid\mathcal G\right]. \end{align*} By the definition of the conditional risk, \begin{align*} R(\hat f) &:= \mathbb E\left[L(Y,\hat f(X))\mid D_{\mathrm{train}}\right] \\ &= \mathbb E\left[L(Y,\hat f(X))\mid\mathcal G\right]. \end{align*} Combining these identities gives \begin{align*} \mathbb E[Z_i\mid\mathcal G]=R(\hat f) \quad\text{a.s.} \end{align*} [/guided] [/step] [step:Average the conditional expectations over the test sample] By definition of the holdout estimator, \begin{align*} \hat R_{\mathrm{test}} = \frac{1}{m}\sum_{i\in I_{\mathrm{test}}}Z_i. \end{align*} Since $m=|I_{\mathrm{test}}|\ge 1$ and each $Z_i$ is integrable, the [linearity of conditional expectation](/page/Conditional%20Expectation) gives \begin{align*} \mathbb E[\hat R_{\mathrm{test}}\mid\mathcal G] &= \mathbb E\left[\frac{1}{m}\sum_{i\in I_{\mathrm{test}}}Z_i\mid\mathcal G\right] \\ &= \frac{1}{m}\sum_{i\in I_{\mathrm{test}}}\mathbb E[Z_i\mid\mathcal G]. \end{align*} Using the identity from the previous step for every $i\in I_{\mathrm{test}}$, \begin{align*} \mathbb E[\hat R_{\mathrm{test}}\mid\mathcal G] &= \frac{1}{m}\sum_{i\in I_{\mathrm{test}}}R(\hat f) \\ &= \frac{m}{m}R(\hat f) \\ &= R(\hat f) \quad\text{a.s.} \end{align*} Since conditioning on $D_{\mathrm{train}}$ is conditioning on $\mathcal G=\sigma(D_{\mathrm{train}})$, this proves \begin{align*} \mathbb E[\hat R_{\mathrm{test}}\mid D_{\mathrm{train}}] = R(\hat f) \quad\text{a.s.} \end{align*} [/step] [step:Take expectations to obtain unconditional unbiasedness] Since each $Z_i$ is integrable, $\hat R_{\mathrm{test}}$ is integrable. Also $R(\hat f)=\mathbb E[L(Y,\hat f(X))\mid\mathcal G]$ is integrable by the integrability hypothesis and the defining integrability property of conditional expectation. Taking expectations in the conditional identity and using the [tower property of conditional expectation](/page/Conditional%20Expectation) yields \begin{align*} \mathbb E[\hat R_{\mathrm{test}}] &= \mathbb E\left[\mathbb E[\hat R_{\mathrm{test}}\mid D_{\mathrm{train}}]\right] \\ &= \mathbb E[R(\hat f)]. \end{align*} Thus the holdout estimator is conditionally unbiased for the conditional risk, and its unconditional expectation equals the expected conditional risk. [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Explore Further

Distribution Definition Expectation Definition Iteratively Reweighted Least Squares Normal Equation Update Probability & Statistics Taking Out What is Known Conditional Expectation Independence of Disjoint Blocks Probability Theory Basic Properties of Conditional Expectation Conditional Expectation Conditional Expectation on a Finite Partition Probability Theory Optional Stopping Theorem Martingale Theory Consistency of Ordinary Least Squares Under Random Design Probability & Statistics Elementary Closure Properties Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.

Unbiasedness of the Holdout Risk Estimator (Theorem # 4470)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Unbiasedness of the Holdout Risk Estimator (Theorem # 4470)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further