[proofplan]
We split the excess risk $R(\hat{h}) - R(h^*)$ into two pieces using the empirical risk as an intermediary, then bound each piece separately. The term involving the fixed optimal hypothesis $h^*$ is controlled by a direct application of [Hoeffding's Inequality](/theorems/1962). The term involving the data-dependent ERM $\hat{h}$ is handled by a union bound over the finite class $\mathcal{H}$, reducing it to Hoeffding bounds for individual fixed hypotheses. Combining the two bounds and solving the resulting tail probability for the confidence parameter $\delta$ yields the stated high-probability bound.
[/proofplan]
[step:Decompose the excess risk via the empirical risk]
Write $R(h) = \mathbb{E}[\ell(h(X), Y)]$ for the population risk and $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \ell(h(X_i), Y_i)$ for the empirical risk. Since $\hat{h}$ is the empirical risk minimiser, $\hat{R}(\hat{h}) \leq \hat{R}(h^*)$. Fix $t > 0$. On the event $\{R(\hat{h}) - R(h^*) > t\}$, we have
\begin{align*}
t < R(\hat{h}) - R(h^*) = \bigl[R(\hat{h}) - \hat{R}(\hat{h})\bigr] + \bigl[\hat{R}(\hat{h}) - \hat{R}(h^*)\bigr] + \bigl[\hat{R}(h^*) - R(h^*)\bigr].
\end{align*}
Since $\hat{R}(\hat{h}) - \hat{R}(h^*) \leq 0$ by the ERM property, the middle term is non-positive, so
\begin{align*}
t < \bigl[R(\hat{h}) - \hat{R}(\hat{h})\bigr] + \bigl[\hat{R}(h^*) - R(h^*)\bigr].
\end{align*}
For the sum of two quantities to exceed $t$, at least one must exceed $t/2$. Applying the union bound:
\begin{align*}
\mathbb{P}\bigl(R(\hat{h}) - R(h^*) > t\bigr) \leq \mathbb{P}\!\left(R(\hat{h}) - \hat{R}(\hat{h}) > \frac{t}{2}\right) + \mathbb{P}\!\left(\hat{R}(h^*) - R(h^*) > \frac{t}{2}\right).
\end{align*}
[guided]
The decomposition inserts $\pm \hat{R}(\hat{h})$ and $\pm \hat{R}(h^*)$ to create three differences. The key insight is that the ERM condition $\hat{R}(\hat{h}) \leq \hat{R}(h^*)$ makes the middle difference non-positive, so it can be dropped. This leaves two terms: one measuring how much the population risk of $\hat{h}$ exceeds its empirical risk (the "overfitting" of $\hat{h}$), and one measuring the fluctuation of the empirical risk of $h^*$ around its population risk.
Why split at $t/2$? If $A + B > t$ with $A, B$ real, then $A > t/2$ or $B > t/2$ (otherwise $A + B \leq t$). This gives $\mathbb{P}(A + B > t) \leq \mathbb{P}(A > t/2) + \mathbb{P}(B > t/2)$, a simple union bound that allows us to handle the two terms independently.
[/guided]
[/step]
[step:Bound the second term using Hoeffding's inequality for the fixed hypothesis $h^*$]
The hypothesis $h^* = \arg\min_{h \in \mathcal{H}} R(h)$ is a fixed (non-random) element of $\mathcal{H}$, determined by the population distribution. For this fixed $h^*$, the losses $\ell(h^*(X_i), Y_i) = \mathbb{1}_{h^*(X_i) \neq Y_i}$ for $i = 1, \ldots, n$ are i.i.d. random variables taking values in $[0, 1]$. Their empirical average is $\hat{R}(h^*)$ and their common mean is $R(h^*)$.
Applying [Hoeffding's Inequality](/theorems/1962) with $a_i = 0$, $b_i = 1$ for all $i$ (so that $\sum_{i=1}^n (b_i - a_i)^2 = n$):
\begin{align*}
\mathbb{P}\!\left(\hat{R}(h^*) - R(h^*) \geq \frac{t}{2}\right) \leq \exp\!\left(-\frac{2n^2(t/2)^2}{n}\right) = \exp\!\left(-\frac{nt^2}{2}\right).
\end{align*}
[guided]
Why does Hoeffding apply? We need independent random variables with known bounded ranges. Since $(X_1, Y_1), \ldots, (X_n, Y_n)$ are i.i.d. and $h^*$ is non-random, the losses $\ell(h^*(X_i), Y_i)$ are i.i.d. with values in $\{0, 1\} \subset [0, 1]$. The independence hypothesis of Hoeffding is satisfied, and each variable is bounded in $[0, 1]$.
The computation: with $a_i = 0$ and $b_i = 1$, Hoeffding gives
\begin{align*}
\mathbb{P}\!\left(\frac{1}{n}\sum_{i=1}^n \bigl[\ell(h^*(X_i), Y_i) - R(h^*)\bigr] \geq \frac{t}{2}\right) \leq \exp\!\left(-\frac{2n^2(t/2)^2}{\sum_{i=1}^n 1^2}\right) = \exp\!\left(-\frac{nt^2}{2}\right).
\end{align*}
[/guided]
[/step]
[step:Bound the first term via a union bound over $\mathcal{H} \setminus \{h^*\}$]
The first term $\mathbb{P}(R(\hat{h}) - \hat{R}(\hat{h}) > t/2)$ is more delicate because $\hat{h}$ depends on the data. Since $t > 0$ and $R(\hat{h}) - R(h^*) = 0$ when $\hat{h} = h^*$, the event $\{R(\hat{h}) - R(h^*) > t\}$ forces $\hat{h} \neq h^*$, so $\hat{h}$ lies in $\mathcal{H}^- := \mathcal{H} \setminus \{h^*\}$. Therefore
\begin{align*}
\mathbb{P}\!\left(R(\hat{h}) - \hat{R}(\hat{h}) > \frac{t}{2},\; \hat{h} \neq h^*\right) \leq \mathbb{P}\!\left(\sup_{h \in \mathcal{H}^-}\bigl[R(h) - \hat{R}(h)\bigr] > \frac{t}{2}\right).
\end{align*}
For each fixed $h \in \mathcal{H}^-$, the losses $\ell(h(X_i), Y_i)$ are i.i.d. in $[0, 1]$ with mean $R(h)$. By [Hoeffding's Inequality](/theorems/1962) applied in the same way as the previous step (now bounding $R(h) - \hat{R}(h) = -[\hat{R}(h) - R(h)]$, using the one-sided bound on $-\frac{1}{n}\sum_{i=1}^n[\ell(h(X_i), Y_i) - R(h)]$):
\begin{align*}
\mathbb{P}\!\left(R(h) - \hat{R}(h) \geq \frac{t}{2}\right) \leq \exp\!\left(-\frac{nt^2}{2}\right).
\end{align*}
Applying the union bound over $\mathcal{H}^-$, which has $|\mathcal{H}| - 1$ elements:
\begin{align*}
\mathbb{P}\!\left(\sup_{h \in \mathcal{H}^-}\bigl[R(h) - \hat{R}(h)\bigr] \geq \frac{t}{2}\right) \leq \sum_{h \in \mathcal{H}^-} \exp\!\left(-\frac{nt^2}{2}\right) = (|\mathcal{H}| - 1)\exp\!\left(-\frac{nt^2}{2}\right).
\end{align*}
[guided]
The difficulty here is that $\hat{h}$ is random -- it depends on the training data $(X_1, Y_1), \ldots, (X_n, Y_n)$. We cannot directly apply Hoeffding to a data-dependent hypothesis. The union bound resolves this: instead of asking "does the particular $\hat{h}$ chosen by ERM overfit by more than $t/2$?", we ask "does any hypothesis in $\mathcal{H}^-$ overfit by more than $t/2$?". This is a stronger event, so its probability is an upper bound.
Why $\mathcal{H}^- = \mathcal{H} \setminus \{h^*\}$ rather than all of $\mathcal{H}$? On the event $\{R(\hat{h}) - R(h^*) > t\}$ with $t > 0$, the excess risk is strictly positive, which forces $\hat{h} \neq h^*$. So $\hat{h}$ ranges over at most $|\mathcal{H}| - 1$ hypotheses, and the union bound need only cover these. This saves one term in the sum, which will combine exactly with the Hoeffding bound for $h^*$ in the next step.
For each individual $h \in \mathcal{H}^-$ (which is a fixed function, not data-dependent), Hoeffding applies exactly as in the previous step, yielding $\mathbb{P}(R(h) - \hat{R}(h) \geq t/2) \leq e^{-nt^2/2}$. The union bound then sums over $|\mathcal{H}| - 1$ hypotheses:
\begin{align*}
\mathbb{P}\!\left(\exists\, h \in \mathcal{H}^- : R(h) - \hat{R}(h) \geq \frac{t}{2}\right) \leq \sum_{h \in \mathcal{H}^-} e^{-nt^2/2} = (|\mathcal{H}| - 1)\,e^{-nt^2/2}.
\end{align*}
This is where the finiteness of $\mathcal{H}$ is essential. For infinite hypothesis classes, this union bound would produce an infinite (and hence useless) upper bound, motivating the need for more refined complexity measures such as Rademacher complexity or VC dimension.
[/guided]
[/step]
[step:Combine the two bounds and solve for the confidence level $\delta$]
Adding the bounds from the previous two steps:
\begin{align*}
\mathbb{P}\bigl(R(\hat{h}) - R(h^*) > t\bigr) \leq (|\mathcal{H}| - 1)\exp\!\left(-\frac{nt^2}{2}\right) + \exp\!\left(-\frac{nt^2}{2}\right) = |\mathcal{H}|\exp\!\left(-\frac{nt^2}{2}\right).
\end{align*}
Setting this tail probability equal to $\delta$ and solving for $t$:
\begin{align*}
\delta = |\mathcal{H}|\exp\!\left(-\frac{nt^2}{2}\right) \quad \Longrightarrow \quad \frac{nt^2}{2} = \log|\mathcal{H}| + \log(1/\delta) \quad \Longrightarrow \quad t = \sqrt{\frac{2(\log|\mathcal{H}| + \log(1/\delta))}{n}}.
\end{align*}
With probability at least $1 - \delta$:
\begin{align*}
R(\hat{h}) - R(h^*) \leq \sqrt{\frac{2(\log|\mathcal{H}| + \log(1/\delta))}{n}}.
\end{align*}
[guided]
The two bounds combine cleanly because the union bound in the previous step was taken over $\mathcal{H}^- = \mathcal{H} \setminus \{h^*\}$, which has exactly $|\mathcal{H}| - 1$ elements, while the Hoeffding bound for the fixed hypothesis $h^*$ contributes one additional $\exp(-nt^2/2)$ term. The sum $(|\mathcal{H}| - 1) + 1 = |\mathcal{H}|$ accounts for every hypothesis in $\mathcal{H}$ exactly once: the $|\mathcal{H}| - 1$ non-optimal hypotheses are covered by the union bound (controlling the overfitting of the data-dependent $\hat{h}$), and the single optimal hypothesis $h^*$ is covered by the direct Hoeffding bound (controlling the fluctuation of empirical risk around population risk).
Setting $|\mathcal{H}|\exp(-nt^2/2) = \delta$ and solving:
\begin{align*}
\frac{nt^2}{2} = \log|\mathcal{H}| + \log(1/\delta) \quad \Longrightarrow \quad t = \sqrt{\frac{2(\log|\mathcal{H}| + \log(1/\delta))}{n}},
\end{align*}
which gives the stated bound directly. Had we instead taken the union bound over all of $\mathcal{H}$ (including $h^*$), the first term would be $|\mathcal{H}|\exp(-nt^2/2)$ and the total would be $(|\mathcal{H}| + 1)\exp(-nt^2/2)$. Loosening to $2|\mathcal{H}|\exp(-nt^2/2)$ would introduce an extra $\log 2$ inside the square root, yielding a strictly weaker bound. The observation that the event $\{R(\hat{h}) - R(h^*) > t\}$ forces $\hat{h} \neq h^*$ is what allows the tighter counting.
[/guided]
[/step]