[proofplan]
The proof conditions on the fold assignment and on the training data used to construct a fixed fold predictor $\hat f_{-k}$. Given this information, every held-out observation in $I_k$ is independent of the fitted rule and has the same law as a fresh observation $(X,Y)$. Therefore each held-out conditional expected loss equals the conditional fresh-observation risk of $\hat f_{-k}$. Averaging first over the held-out indices, then over folds, and finally taking expectations gives the claimed identity.
[/proofplan]
[step:Condition on the fold-complement information for one fold]
Fix $k \in \{1,\dots,K\}$. Let $\mathcal G_k$ be the $\sigma$-algebra generated by the random partition $(I_1,\dots,I_K)$ and by the fold-complement training sample $D_{-k}$. Since the learning procedure and all preprocessing steps used to construct $\hat f_{-k}$ depend only on $D_{-k}$, the fitted map
\begin{align*}
\hat f_{-k}: \mathcal X \to \mathcal A
\end{align*}
is $\mathcal G_k$-measurable as a random prediction rule.
For every index $i \in I_k$, the observation $Z_i=(X_i,Y_i)$ is independent of $\mathcal G_k$. Indeed, conditional on the fold assignment, $\mathcal G_k$ contains only observations with indices outside $I_k$, and the fold assignment is independent of the i.i.d. sample $(Z_1,\dots,Z_n)$. Thus $Z_i$ has the same distribution as an independent fresh observation $(X,Y)$ and is independent of $\hat f_{-k}$ conditional on $\mathcal G_k$.
[guided]
Fix a fold index $k \in \{1,\dots,K\}$. We isolate exactly the information used to build the predictor being evaluated on the held-out fold. Define $\mathcal G_k$ to be the $\sigma$-algebra generated by the random fold partition $(I_1,\dots,I_K)$ and by the training sample
\begin{align*}
D_{-k} := (Z_i)_{i \notin I_k}.
\end{align*}
The fitted rule $\hat f_{-k}: \mathcal X \to \mathcal A$ is obtained by applying the fixed learning procedure to $D_{-k}$. Since all preprocessing parameters are also estimated only from $D_{-k}$, the entire fitted prediction rule is determined by $\mathcal G_k$.
Now take an index $i \in I_k$. Conditional on the fold assignment, the training data $D_{-k}$ consists precisely of observations with indices outside $I_k$. Because the observations $Z_1,\dots,Z_n$ are i.i.d., the held-out observation $Z_i=(X_i,Y_i)$ is independent of those training observations. Because the fold assignment itself was chosen independently of the data, adding the fold assignment to the conditioning information does not introduce dependence between $Z_i$ and the fold-complement sample. Therefore $Z_i$ is independent of $\mathcal G_k$ and has the same distribution as a fresh observation $(X,Y)$.
This is the exact point where the “preprocessing inside each training fold” hypothesis is used. If preprocessing were fitted using all observations, then $\hat f_{-k}$ would depend on the held-out observations in $I_k$, and $Z_i$ would no longer be independent of the rule whose loss is being evaluated.
[/guided]
[/step]
[step:Identify the conditional held-out loss with the conditional fresh-observation risk]
Define the conditional risk random variable
\begin{align*}
\rho_k
:= \mathbb E\left[L(Y,\hat f_{-k}(X)) \mid \mathcal G_k\right],
\end{align*}
where $(X,Y)$ is independent of $\mathcal G_k$ and has the same distribution as $(X_1,Y_1)$. For each $i \in I_k$, the independence established above and the $\mathcal G_k$-measurability of $\hat f_{-k}$ give
\begin{align*}
\mathbb E\left[L(Y_i,\hat f_{-k}(X_i)) \mid \mathcal G_k\right]
= \rho_k.
\end{align*}
Averaging over the held-out indices in $I_k$ and using linearity of conditional expectation,
\begin{align*}
\mathbb E\left[
\frac{1}{|I_k|}\sum_{i \in I_k} L(Y_i,\hat f_{-k}(X_i))
\mid \mathcal G_k
\right]
= \rho_k.
\end{align*}
[guided]
We now compare the held-out loss with the loss on a genuinely fresh observation. Define
\begin{align*}
\rho_k
:= \mathbb E\left[L(Y,\hat f_{-k}(X)) \mid \mathcal G_k\right],
\end{align*}
where $(X,Y)$ is an independent copy of $(X_1,Y_1)$, independent of $\mathcal G_k$. The random variable $\rho_k$ is the conditional test risk of the already-fitted rule $\hat f_{-k}$.
For an index $i \in I_k$, the previous step gives two facts: $Z_i=(X_i,Y_i)$ is independent of $\mathcal G_k$, and $Z_i$ has the same law as $(X,Y)$. Also, $\hat f_{-k}$ is $\mathcal G_k$-measurable, because it was constructed only from $D_{-k}$. Therefore, after conditioning on $\mathcal G_k$, the fitted rule is fixed while the held-out observation is distributed as a fresh independent draw. Hence
\begin{align*}
\mathbb E\left[L(Y_i,\hat f_{-k}(X_i)) \mid \mathcal G_k\right]
= \mathbb E\left[L(Y,\hat f_{-k}(X)) \mid \mathcal G_k\right]
= \rho_k.
\end{align*}
Since $I_k$ is nonempty by hypothesis, the fold average is well-defined. Applying linearity of conditional expectation to the finite sum over held-out indices gives
\begin{align*}
\mathbb E\left[
\frac{1}{|I_k|}\sum_{i \in I_k} L(Y_i,\hat f_{-k}(X_i))
\mid \mathcal G_k
\right]
&=
\frac{1}{|I_k|}\sum_{i \in I_k}
\mathbb E\left[
L(Y_i,\hat f_{-k}(X_i))
\mid \mathcal G_k
\right] \\
&=
\frac{1}{|I_k|}\sum_{i \in I_k} \rho_k \\
&= \rho_k.
\end{align*}
[/guided]
[/step]
[step:Average the conditional identity over all folds]
Taking expectations in the identity from the previous step and using the tower property,
\begin{align*}
\mathbb E\left[
\frac{1}{|I_k|}\sum_{i \in I_k} L(Y_i,\hat f_{-k}(X_i))
\right]
&=
\mathbb E[\rho_k] \\
&=
\mathbb E\left[L(Y,\hat f_{-k}(X))\right].
\end{align*}
Therefore, by linearity of expectation applied to the definition of $\hat R_{\mathrm{CV}}$,
\begin{align*}
\mathbb E[\hat R_{\mathrm{CV}}]
&=
\mathbb E\left[
\frac{1}{K}\sum_{k=1}^K
\frac{1}{|I_k|}\sum_{i \in I_k}
L(Y_i,\hat f_{-k}(X_i))
\right] \\
&=
\frac{1}{K}\sum_{k=1}^K
\mathbb E\left[
\frac{1}{|I_k|}\sum_{i \in I_k}
L(Y_i,\hat f_{-k}(X_i))
\right] \\
&=
\frac{1}{K}\sum_{k=1}^K
\mathbb E\left[L(Y,\hat f_{-k}(X))\right].
\end{align*}
This is the asserted cross-validation risk identity.
[/step]