[proofplan]
We work in the real Hilbert space $H = L^2(\Omega, \mathcal{F}, \mathbb{P})$, where the white-noise condition makes the rescaled innovations $(Z_t/\sigma)$ an orthonormal family. After confirming that the causal series converges in $H$, we expand $Y_{t+h}$ and split the sum at the index $j = h$: the terms with $j < h$ involve only the future innovations $Z_{t+1}, \dots, Z_{t+h}$, while the terms with $j \ge h$ involve only innovations dated at or before $t$. The first block is orthogonal to $\mathcal{H}_t$ and the second block lies in $\mathcal{H}_t$, so by the [Orthogonal Decomposition Theorem](/theorems/241) the second block is exactly the projection $\hat{Y}_t(h)$ and the first block is the forecast error. Pairwise orthogonality of the innovations then collapses the variance of the error to the displayed sum of squared coefficients.
[/proofplan]
[step:Realize the process in $L^2$ and confirm the causal series converges there]
Let $H := L^2(\Omega, \mathcal{F}, \mathbb{P}; \mathbb{R})$, the space of square-integrable real random variables, equipped with the inner product
\begin{align*}
(\cdot, \cdot)_H : H \times H &\to \mathbb{R}, \\
(X, Y)_H &\mapsto \mathbb{E}[XY],
\end{align*}
and norm $\|X\|_H = (\mathbb{E}[X^2])^{1/2}$. Since $H$ is complete (the [Completeness of $L^p$ Spaces](/theorems/892), the case $p = 2$), it is a Hilbert space. The white-noise hypothesis $\mathbb{E}[Z_s Z_t] = \sigma^2 \mathbb{1}_{\{s = t\}}$ states precisely that
\begin{align*}
(Z_s, Z_t)_H = \sigma^2\, \mathbb{1}_{\{s = t\}}, \qquad s, t \in \mathbb{Z},
\end{align*}
so each $Z_t \in H$ with $\|Z_t\|_H = \sigma$, and distinct innovations are orthogonal.
For $N \in \mathbb{N}$ define the partial sum $S_N := \sum_{j=0}^{N} \psi_j Z_{t-j} \in H$. For $M < N$, the orthogonality relation gives
\begin{align*}
\|S_N - S_M\|_H^2 = \Big\| \sum_{j=M+1}^{N} \psi_j Z_{t-j} \Big\|_H^2 = \sum_{j=M+1}^{N} \sum_{k=M+1}^{N} \psi_j \psi_k (Z_{t-j}, Z_{t-k})_H = \sigma^2 \sum_{j=M+1}^{N} \psi_j^2.
\end{align*}
Because $\sum_{j=0}^{\infty} |\psi_j| < \infty$ we have $\psi_j \to 0$, so there is $J$ with $|\psi_j| \le 1$, hence $\psi_j^2 \le |\psi_j|$, for all $j \ge J$; therefore $\sum_{j=0}^{\infty} \psi_j^2 < \infty$. The tail $\sigma^2 \sum_{j=M+1}^{N} \psi_j^2$ thus tends to $0$ as $M, N \to \infty$, so $(S_N)$ is Cauchy in $H$ and converges there. This is the asserted $L^2$-convergence, and it identifies $Y_t = \sum_{j=0}^{\infty} \psi_j Z_{t-j}$ as a well-defined element of $H$ with finite variance.
[guided]
Why pass to $L^2$ at all? Forecasting is an $L^2$ (least-squares) problem: the "best linear predictor" is by definition the element of a linear subspace minimizing mean-squared error, and that is exactly the orthogonal projection in the Hilbert space $H = L^2(\Omega, \mathcal{F}, \mathbb{P})$. So our first task is to install the right Hilbert structure and check that every object in the statement actually lives in it.
The inner product is $(X, Y)_H = \mathbb{E}[XY]$, and $H$ is complete by the [Completeness of $L^p$ Spaces](/theorems/892) with $p = 2$, so $H$ is a Hilbert space. Now read the white-noise hypothesis geometrically: $\mathbb{E}[Z_s Z_t] = \sigma^2 \mathbb{1}_{\{s=t\}}$ says
\begin{align*}
(Z_s, Z_t)_H = \sigma^2\, \mathbb{1}_{\{s=t\}}.
\end{align*}
Hence $\|Z_t\|_H = \sigma$ and $Z_s \perp Z_t$ whenever $s \ne t$: the family $(Z_t/\sigma)_{t \in \mathbb{Z}}$ is orthonormal. This single fact is the engine of the entire proof.
Next we must make sure the defining series for $Y_t$ converges in $H$, since the statement asserts this and the rest of the argument manipulates it. Set $S_N = \sum_{j=0}^N \psi_j Z_{t-j}$. To prove convergence we show $(S_N)$ is Cauchy and invoke completeness. For $M < N$, orthogonality of the innovations turns the squared norm of the increment into a bare sum of squares (all cross terms $(Z_{t-j}, Z_{t-k})_H$ with $j \ne k$ vanish):
\begin{align*}
\|S_N - S_M\|_H^2 = \sum_{j=M+1}^{N} \sum_{k=M+1}^{N} \psi_j \psi_k (Z_{t-j}, Z_{t-k})_H = \sigma^2 \sum_{j=M+1}^{N} \psi_j^2.
\end{align*}
It remains to see $\sum_j \psi_j^2 < \infty$. We are only given the stronger-looking $\ell^1$ bound $\sum_j |\psi_j| < \infty$, but $\ell^1 \subseteq \ell^2$: since $\psi_j \to 0$, eventually $|\psi_j| \le 1$, so $\psi_j^2 \le |\psi_j|$, and summability of $|\psi_j|$ forces summability of $\psi_j^2$. Therefore the tail $\sigma^2\sum_{j=M+1}^N \psi_j^2 \to 0$, the sequence $(S_N)$ is Cauchy, and completeness delivers a limit $Y_t \in H$. In particular $\operatorname{Var}(Y_t) = \|Y_t\|_H^2 < \infty$, so the variance in the conclusion is meaningful.
[/guided]
[/step]
[step:Split the causal expansion of $Y_{t+h}$ at the forecast horizon]
Applying the causal representation at time $t + h$,
\begin{align*}
Y_{t+h} = \sum_{j=0}^{\infty} \psi_j\, Z_{t+h-j},
\end{align*}
with convergence in $H$ by Step 1. Split the index set $\{0, 1, 2, \dots\}$ at $j = h$ and use continuity of addition in $H$ to write $Y_{t+h} = A + B$, where
\begin{align*}
A := \sum_{j=0}^{h-1} \psi_j\, Z_{t+h-j}, \qquad B := \sum_{j=h}^{\infty} \psi_j\, Z_{t+h-j}.
\end{align*}
Here $A$ is a finite sum and $B$ is the $H$-limit of its partial sums (a convergent tail of a convergent series). As $j$ ranges over $\{0, \dots, h-1\}$, the index $t + h - j$ ranges over $\{t+1, \dots, t+h\}$, so $A$ involves only innovations strictly after time $t$. Reindexing $B$ by $i := j - h \ge 0$ gives
\begin{align*}
B = \sum_{i=0}^{\infty} \psi_{i+h}\, Z_{t-i},
\end{align*}
so $B$ involves only innovations $Z_{t-i}$ with $t - i \le t$, i.e. dated at or before time $t$.
[/step]
[step:Identify the projection and the forecast error via orthogonal decomposition]
Recall $\mathcal{H}_t = \overline{\operatorname{sp}}\{Z_s : s \le t\}$, a closed subspace of the Hilbert space $H$.
First, $B \in \mathcal{H}_t$. Each partial sum $\sum_{i=0}^{N} \psi_{i+h} Z_{t-i}$ is a finite linear combination of the vectors $Z_{t-i}$ with $t - i \le t$, hence lies in $\operatorname{sp}\{Z_s : s \le t\} \subseteq \mathcal{H}_t$. By Step 2 these partial sums converge in $H$ to $B$, and $\mathcal{H}_t$ is closed, so $B \in \mathcal{H}_t$.
Second, $A \in \mathcal{H}_t^{\perp}$. The vector $A$ is a finite linear combination of $Z_{t+1}, \dots, Z_{t+h}$. For any $s \le t$ and any $r \in \{t+1, \dots, t+h\}$ we have $r \ne s$, so $(Z_r, Z_s)_H = \sigma^2 \mathbb{1}_{\{r = s\}} = 0$; thus $(A, Z_s)_H = 0$ for every $s \le t$. By bilinearity and continuity of the inner product, $A$ is orthogonal to every element of $\operatorname{sp}\{Z_s : s \le t\}$ and hence, taking limits, to every element of its closure $\mathcal{H}_t$. Therefore $A \in \mathcal{H}_t^{\perp}$.
We have produced a decomposition $Y_{t+h} = B + A$ with $B \in \mathcal{H}_t$ and $A \in \mathcal{H}_t^{\perp}$. By the [Orthogonal Decomposition Theorem](/theorems/241), applied to the closed subspace $\mathcal{H}_t \subseteq H$, every element of $H$ has a *unique* such decomposition, and its $\mathcal{H}_t$-component is the orthogonal projection $P_{\mathcal{H}_t}$. Matching components yields
\begin{align*}
\hat{Y}_t(h) = P_{\mathcal{H}_t} Y_{t+h} = B = \sum_{j=h}^{\infty} \psi_j\, Z_{t+h-j},
\end{align*}
and therefore the forecast error is
\begin{align*}
Y_{t+h} - \hat{Y}_t(h) = Y_{t+h} - B = A = \sum_{j=0}^{h-1} \psi_j\, Z_{t+h-j},
\end{align*}
which is the first asserted identity.
[guided]
This is the conceptual heart of the proof. We have written $Y_{t+h} = A + B$, and we want to argue that $B$ *is* the forecast $\hat{Y}_t(h)$ and $A$ *is* the error. The forecast is defined as the orthogonal projection $P_{\mathcal{H}_t} Y_{t+h}$ onto the innovation history $\mathcal{H}_t = \overline{\operatorname{sp}}\{Z_s : s \le t\}$. The strategy is to show that the split $A + B$ already *is* the orthogonal decomposition of $Y_{t+h}$ relative to $\mathcal{H}_t$, and then quote the uniqueness of that decomposition to read off the projection.
So we verify the two membership facts.
Why is $B \in \mathcal{H}_t$? After reindexing in Step 2, $B = \sum_{i=0}^{\infty} \psi_{i+h} Z_{t-i}$. Every finite partial sum is a linear combination of innovations $Z_{t-i}$ with time index $t - i \le t$, so it lies in $\operatorname{sp}\{Z_s : s \le t\}$. The infinite sum is the $H$-limit of these partial sums; since $\mathcal{H}_t$ is the *closed* span, it contains all such limits. This is exactly why we take the closure in the definition of $\mathcal{H}_t$ — without it, the infinite-order moving average $B$ might escape the subspace.
Why is $A \in \mathcal{H}_t^{\perp}$? The point of splitting at $j = h$ is that $A$ collects precisely the innovations $Z_{t+1}, \dots, Z_{t+h}$ dated *after* $t$. These are the parts of $Y_{t+h}$ that have not yet been "observed" at time $t$. Concretely, for any spanning vector $Z_s$ with $s \le t$, the index $s$ differs from each of $t+1, \dots, t+h$, so $(Z_r, Z_s)_H = \sigma^2 \mathbb{1}_{\{r=s\}} = 0$. Hence $A \perp Z_s$ for all $s \le t$. Orthogonality to a spanning set extends to the whole subspace: it is preserved under linear combinations (bilinearity of $(\cdot,\cdot)_H$) and under limits (continuity of $(\cdot,\cdot)_H$), so $A \perp \mathcal{H}_t$, i.e. $A \in \mathcal{H}_t^{\perp}$.
Now we invoke the [Orthogonal Decomposition Theorem](/theorems/241). Its hypothesis is exactly what we have arranged: $\mathcal{H}_t$ is a closed subspace of the Hilbert space $H$. Its conclusion is that every $x \in H$ — here $x = Y_{t+h}$ — has a *unique* decomposition $x = m + m^{\perp}$ with $m \in \mathcal{H}_t$, $m^{\perp} \in \mathcal{H}_t^{\perp}$, and that the $\mathcal{H}_t$-component is the orthogonal projection $m = P_{\mathcal{H}_t} x$. We have exhibited one such decomposition, $Y_{t+h} = B + A$; by uniqueness it must be *the* decomposition. Reading off the components gives $\hat{Y}_t(h) = P_{\mathcal{H}_t} Y_{t+h} = B$ and the error $Y_{t+h} - \hat{Y}_t(h) = A = \sum_{j=0}^{h-1} \psi_j Z_{t+h-j}$. Notice that the error depends only on the future innovations and the first $h$ coefficients $\psi_0, \dots, \psi_{h-1}$ — the unpredictable part of the process over the next $h$ steps.
[/guided]
[/step]
[step:Compute the error variance by orthogonality of the innovations]
The error $Y_{t+h} - \hat{Y}_t(h) = A = \sum_{j=0}^{h-1} \psi_j Z_{t+h-j}$ is a finite linear combination of mean-zero random variables, so $\mathbb{E}[A] = \sum_{j=0}^{h-1} \psi_j \mathbb{E}[Z_{t+h-j}] = 0$. Hence its variance equals its second moment, i.e. its squared $H$-norm:
\begin{align*}
\operatorname{Var}\big(Y_{t+h} - \hat{Y}_t(h)\big) = \mathbb{E}[A^2] = (A, A)_H.
\end{align*}
Expanding the finite double sum and using $(Z_{t+h-j}, Z_{t+h-k})_H = \sigma^2 \mathbb{1}_{\{j = k\}}$ (distinct innovations are orthogonal, equal ones have squared norm $\sigma^2$),
\begin{align*}
(A, A)_H = \sum_{j=0}^{h-1} \sum_{k=0}^{h-1} \psi_j \psi_k\, (Z_{t+h-j}, Z_{t+h-k})_H = \sum_{j=0}^{h-1} \sum_{k=0}^{h-1} \psi_j \psi_k\, \sigma^2 \mathbb{1}_{\{j=k\}} = \sigma^2 \sum_{j=0}^{h-1} \psi_j^2.
\end{align*}
Combining the two displays gives
\begin{align*}
\operatorname{Var}\big(Y_{t+h} - \hat{Y}_t(h)\big) = \sigma^2 \sum_{j=0}^{h-1} \psi_j^2,
\end{align*}
which is the second asserted identity. Together with the error representation from Step 3, this completes the proof.
[/step]