[proofplan]
The forecast operator $P_t(\cdot) := \mathbb{E}[\cdot \mid \mathcal{F}_t]$ is linear, fixes $\mathcal{F}_t$-measurable random variables, and annihilates the mean of any random variable independent of $\mathcal{F}_t$. The whole proof rests on two structural facts that the causal/invertible representation supplies: every past innovation $Z_s$ with $s \le t$ is $\mathcal{F}_t$-measurable (from invertibility), while every future innovation $Z_{t+k}$ with $k \ge 1$ is independent of $\mathcal{F}_t$ (from causality together with independence of the white noise). Applying $P_t$ to the ARMA identity written at index $t+h$ and using linearity, these two facts collapse each term into either an observed quantity, an earlier forecast, or zero, yielding the stated recursion. Specialising to $h = 1$ gives the one-step formula, and a finite induction on $h$ shows every forecast is a function of the observed past.
[/proofplan]
[step:Set up the forecast operator and record its three elementary properties]
Since $(Y_t)$ is causal, $Y_{t+h} = \sum_{j=0}^\infty \psi_j Z_{t+h-j}$ with $\sum_j |\psi_j| < \infty$, so $\|Y_{t+h}\|_{L^2}^2 = \sigma^2 \sum_{j=0}^\infty \psi_j^2 < \infty$; in particular $Y_{t+h} \in L^2(\Omega, \mathcal{F}, \mathbb{P}) \subseteq L^1(\Omega, \mathcal{F}, \mathbb{P})$, so the conditional expectation $\hat{Y}_t(h) = \mathbb{E}[Y_{t+h} \mid \mathcal{F}_t]$ is well defined. Likewise $Z_s \in L^2 \subseteq L^1$ for every $s$.
Write $P_t X := \mathbb{E}[X \mid \mathcal{F}_t]$ for $X \in L^1(\Omega, \mathcal{F}, \mathbb{P})$. By the [Basic Properties of Conditional Expectation](/theorems/1148), the operator $P_t$ satisfies:
- **(L) Linearity:** $P_t(\alpha X + \beta W) = \alpha\, P_t X + \beta\, P_t W$ a.s. for $\alpha, \beta \in \mathbb{R}$ and $X, W \in L^1$ (property (v)).
- **(M) Fixes known random variables:** if $X$ is $\mathcal{F}_t$-measurable, then $P_t X = X$ a.s. (property (ii)).
- **(I) Annihilates independent centred random variables:** if $X$ is independent of $\mathcal{F}_t$, then $P_t X = \mathbb{E}[X]$ a.s. (property (iii)); in particular $P_t X = 0$ a.s. when additionally $\mathbb{E}[X] = 0$.
These three properties are the only facts about conditional expectation used below.
[/step]
[step:Show every past innovation is observable and every future innovation is independent of the present]
[claim:For every integer $s \le t$, the innovation $Z_s$ is $\mathcal{F}_t$-measurable]
[proof]
By invertibility, $Z_s = \sum_{j=0}^\infty \pi_j Y_{s-j}$ with convergence in $L^2$. For each $j \ge 0$ the index satisfies $s - j \le s \le t$, so $Y_{s-j}$ is $\mathcal{F}_t$-measurable by definition of $\mathcal{F}_t = \sigma(Y_u : u \le t)$. Hence each partial sum $S_N := \sum_{j=0}^N \pi_j Y_{s-j}$ is $\mathcal{F}_t$-measurable. Since $S_N \to Z_s$ in $L^2$, there is a subsequence $S_{N_k} \to Z_s$ almost surely; an a.s. limit of $\mathcal{F}_t$-measurable random variables is $\mathcal{F}_t$-measurable (after completing $\mathcal{F}_t$ with the $\mathbb{P}$-null sets, which does not change any conditional expectation). Therefore $Z_s$ is $\mathcal{F}_t$-measurable.
[/proof]
[/claim]
[claim:For every integer $k \ge 1$, the innovation $Z_{t+k}$ is independent of $\mathcal{F}_t$]
[proof]
Let $\mathcal{G}_t := \sigma(Z_v : v \le t)$. By causality, for each $u \le t$ we have $Y_u = \sum_{j=0}^\infty \psi_j Z_{u-j}$ as an $L^2$ limit of random variables measurable with respect to $\sigma(Z_v : v \le u) \subseteq \mathcal{G}_t$ (every index $u - j \le u \le t$); arguing as in the previous claim, $Y_u$ is $\mathcal{G}_t$-measurable. Hence $\mathcal{F}_t = \sigma(Y_u : u \le t) \subseteq \mathcal{G}_t$. Because $(Z_v)_{v \in \mathbb{Z}}$ is an independent family and $t + k > t$ for $k \ge 1$, the random variable $Z_{t+k}$ is independent of $\mathcal{G}_t = \sigma(Z_v : v \le t)$, hence independent of the smaller $\sigma$-algebra $\mathcal{F}_t$.
[/proof]
[/claim]
[guided]
We need two facts before any computation, and both are exactly where the hypotheses "causal" and "invertible" are consumed.
**Why these two facts?** When we condition the ARMA recursion on $\mathcal{F}_t$, each term is one of $Y_{t+h-i}$ or $Z_{t+h-j}$. Property (M) lets us keep a term unchanged precisely when it is $\mathcal{F}_t$-measurable, and property (I) lets us delete a centred term precisely when it is independent of $\mathcal{F}_t$. So we must know, for an arbitrary index, whether the corresponding $Z$ is "known" (measurable) or "unseen" (independent). The dividing line is the time $t$.
**Past innovations are known (invertibility).** Intuitively, once we have observed $Y_s, Y_{s-1}, \dots$ we can reconstruct the shock $Z_s$ that drove the system at time $s \le t$. Formally, invertibility gives $Z_s = \sum_{j=0}^\infty \pi_j Y_{s-j}$ in $L^2$. Every index $s - j$ is $\le s \le t$, so each $Y_{s-j}$ lies in $\mathcal{F}_t$; the partial sums $S_N = \sum_{j=0}^N \pi_j Y_{s-j}$ are $\mathcal{F}_t$-measurable, and since $S_N \to Z_s$ in $L^2$ a subsequence converges a.s., so the limit $Z_s$ is $\mathcal{F}_t$-measurable. (We tacitly complete $\mathcal{F}_t$ with null sets; this changes no conditional expectation.) Without invertibility we could not assert that the *past* innovations appearing in the recursion are observable, and the clean formula would break.
**Future innovations are unseen (causality).** Intuitively a causal system depends only on past and present shocks, so observing $Y$ up to time $t$ tells us nothing about a shock $Z_{t+k}$ that has not yet occurred. Formally, set $\mathcal{G}_t = \sigma(Z_v : v \le t)$. Causality $Y_u = \sum_{j \ge 0}\psi_j Z_{u-j}$ expresses each $Y_u$ ($u \le t$) as an $L^2$ limit of $\mathcal{G}_t$-measurable variables, so $Y_u$ is $\mathcal{G}_t$-measurable and therefore $\mathcal{F}_t = \sigma(Y_u : u \le t) \subseteq \mathcal{G}_t$. Now $Z_{t+k}$ with $k \ge 1$ is, by the independence of the white-noise family, independent of $\sigma(Z_v : v \le t) = \mathcal{G}_t$; independence of a $\sigma$-algebra is inherited by any sub-$\sigma$-algebra, so $Z_{t+k}$ is independent of $\mathcal{F}_t$. This is the only place the *independence* (not merely uncorrelatedness) of the noise is used; it is what makes $\mathbb{E}[Z_{t+k}\mid\mathcal{F}_t]=\mathbb{E}[Z_{t+k}]=0$ via property (I), so that the conditional expectation coincides with the linear forecast.
[/guided]
[/step]
[step:Condition the ARMA identity at index $t+h$ and collapse each term]
Fix $h \ge 1$. The defining ARMA identity, written at time index $t + h$, reads
\begin{align*}
Y_{t+h} = \sum_{i=1}^p \phi_i\, Y_{t+h-i} + Z_{t+h} + \sum_{j=1}^q \theta_j\, Z_{t+h-j}.
\end{align*}
Every random variable here lies in $L^2 \subseteq L^1$, so we may apply $P_t$ and use linearity **(L)**:
\begin{align*}
\hat{Y}_t(h) = P_t Y_{t+h} = \sum_{i=1}^p \phi_i\, P_t Y_{t+h-i} + P_t Z_{t+h} + \sum_{j=1}^q \theta_j\, P_t Z_{t+h-j}.
\end{align*}
We evaluate each term using the facts from the previous step.
*Autoregressive terms $P_t Y_{t+h-i}$ for $1 \le i \le p$.* If $h - i \ge 1$, then $Y_{t+(h-i)}$ is a future value and $P_t Y_{t+(h-i)} = \hat{Y}_t(h-i)$ by definition of the forecast. If $h - i \le 0$, then $t + (h-i) \le t$, so $Y_{t+h-i}$ is $\mathcal{F}_t$-measurable and $P_t Y_{t+h-i} = Y_{t+h-i}$ by **(M)**; under the convention $\hat{Y}_t(r) = Y_{t+r}$ for $r \le 0$ this again equals $\hat{Y}_t(h-i)$. In both cases
\begin{align*}
P_t Y_{t+h-i} = \hat{Y}_t(h - i).
\end{align*}
*Leading innovation $P_t Z_{t+h}$.* Since $h \ge 1$, $Z_{t+h}$ is independent of $\mathcal{F}_t$ with $\mathbb{E}[Z_{t+h}] = 0$, so $P_t Z_{t+h} = 0$ by **(I)**. With the convention $\hat{Z}_t(h) = 0$ for $h \ge 1$ this is $\hat{Z}_t(h)$.
*Moving-average terms $P_t Z_{t+h-j}$ for $1 \le j \le q$.* If $h - j \ge 1$, then $Z_{t+(h-j)}$ is a future innovation, independent of $\mathcal{F}_t$ with mean $0$, so $P_t Z_{t+h-j} = 0$ by **(I)**, which equals $\hat{Z}_t(h-j)$. If $h - j \le 0$, then $t + (h-j) \le t$, so $Z_{t+h-j}$ is $\mathcal{F}_t$-measurable and $P_t Z_{t+h-j} = Z_{t+h-j} = \hat{Z}_t(h-j)$ by **(M)**. In both cases
\begin{align*}
P_t Z_{t+h-j} = \hat{Z}_t(h - j).
\end{align*}
Substituting these three evaluations gives, for every $h \ge 1$,
\begin{align*}
\hat{Y}_t(h) = \sum_{i=1}^p \phi_i\, \hat{Y}_t(h - i) + \hat{Z}_t(h) + \sum_{j=1}^q \theta_j\, \hat{Z}_t(h - j),
\end{align*}
which is the asserted recursion.
[guided]
The strategy is to write the ARMA law one step into the future — at index $t+h$ — and then take the conditional expectation $P_t = \mathbb{E}[\cdot \mid \mathcal{F}_t]$ of both sides. Because $P_t$ is linear **(L)**, the conditional expectation of the sum is the sum of the conditional expectations, and the only work left is to classify each individual term.
The classification is governed entirely by the sign of the time offset of each index relative to the present $t$. An index $> t$ is in the future; an index $\le t$ is observed.
- **Autoregressive part.** The term $Y_{t+h-i}$ has offset $h - i$. If $h - i \ge 1$ the value is still in the future and its conditional expectation is, by definition, the forecast $\hat{Y}_t(h-i)$ — this is the recursive coupling that makes long-horizon forecasts depend on shorter-horizon ones. If $h - i \le 0$ the value has already been observed, so it is $\mathcal{F}_t$-measurable and property (M) returns it unchanged. The convention $\hat{Y}_t(r) := Y_{t+r}$ for $r \le 0$ is chosen precisely so these two cases read identically as $\hat{Y}_t(h-i)$; it is not an extra assumption but a bookkeeping device consistent with (M), since $\mathbb{E}[Y_{t+r}\mid\mathcal{F}_t] = Y_{t+r}$ when $r \le 0$.
- **Leading innovation.** The shock $Z_{t+h}$ driving $Y_{t+h}$ always has offset $h \ge 1$, so it is a genuine future innovation: independent of $\mathcal{F}_t$ (Claim 2) and centred, hence killed by property (I). This is the analytic content of "replace future innovations by $0$."
- **Moving-average part.** The term $Z_{t+h-j}$ has offset $h - j$. If $h - j \ge 1$ it is a future shock and is annihilated by (I), matching $\hat{Z}_t(h-j) = 0$. If $h - j \le 0$ it is a past shock, $\mathcal{F}_t$-measurable by Claim 1, and (M) returns it unchanged, matching $\hat{Z}_t(h-j) = Z_{t+h-j}$. Again the convention for $\hat{Z}_t$ merges the two cases.
Assembling the three evaluated pieces and using linearity in reverse gives the single recursion valid for all $h \ge 1$. Notice how the two structural claims did all the heavy lifting: every term became either "kept" or "deleted" with no residual error, which is exactly why the practical forecasting rule is exact rather than approximate.
[/guided]
[/step]
[step:Read off the one-step forecast]
Take $h = 1$ in the recursion. For each $1 \le i \le p$ the offset $1 - i \le 0$, so $\hat{Y}_t(1 - i) = Y_{t+1-i}$ by the convention. The leading term is $\hat{Z}_t(1) = 0$. For each $1 \le j \le q$ the offset $1 - j \le 0$, so $\hat{Z}_t(1 - j) = Z_{t+1-j}$. Hence
\begin{align*}
\hat{Y}_t(1) = \sum_{i=1}^p \phi_i\, Y_{t+1-i} + \sum_{j=1}^q \theta_j\, Z_{t+1-j},
\end{align*}
a quantity built solely from observed values $Y_{t}, \dots, Y_{t+1-p}$ and past innovations $Z_t, \dots, Z_{t+1-q}$, all of which are $\mathcal{F}_t$-measurable by Step 2.
[/step]
[step:Conclude by induction that every forecast depends only on the observed past]
We show by strong induction on $h \ge 1$ that $\hat{Y}_t(h)$ is a (deterministic, affine) function of the observed data $\{Y_s : s \le t\}$ and $\{Z_s : s \le t\}$ together with the earlier forecasts $\{\hat{Y}_t(r) : 1 \le r < h\}$.
*Base case $h = 1$.* The previous step expresses $\hat{Y}_t(1)$ as a linear combination of $Y_{t+1-i}$ ($1 \le i \le p$) and $Z_{t+1-j}$ ($1 \le j \le q$), all with time index $\le t$, hence all observed. No earlier forecast is needed.
*Inductive step.* Fix $h \ge 2$ and assume the statement for all $1 \le r < h$. By the recursion of Step 3,
\begin{align*}
\hat{Y}_t(h) = \sum_{i=1}^p \phi_i\, \hat{Y}_t(h - i) + \hat{Z}_t(h) + \sum_{j=1}^q \theta_j\, \hat{Z}_t(h - j).
\end{align*}
Classify the right-hand side. Each autoregressive term $\hat{Y}_t(h-i)$ with $h - i \ge 1$ is an earlier forecast with horizon $h - i < h$, available by the inductive hypothesis; each with $h - i \le 0$ equals the observed value $Y_{t+h-i}$ (index $\le t$). The leading term $\hat{Z}_t(h) = 0$. Each moving-average term $\hat{Z}_t(h-j)$ with $h - j \ge 1$ equals $0$, while each with $h - j \le 0$ equals the observed innovation $Z_{t+h-j}$ (index $\le t$). Thus $\hat{Y}_t(h)$ is an affine combination of observed values, observed innovations, and earlier forecasts, completing the induction.
Since the recursion determines $\hat{Y}_t(1), \hat{Y}_t(2), \dots$ successively, every $h$-step forecast is obtained from the ARMA recursion at index $t+h$ by replacing the future innovations $Z_{t+1}, \dots, Z_{t+h}$ by $0$ and the future values $Y_{t+1}, \dots, Y_{t+h-1}$ by their already-computed forecasts, while retaining the observed values and past innovations. This is precisely the recursive construction asserted in the theorem.
[/step]