What brings you to Androma?

Time Series Analysis studies data observed sequentially over time, with an emphasis on how dependence across observations shapes modeling, inference, and prediction. The course develops the probabilistic foundations needed to understand temporal structure, from basic notions of stationarity and correlation to more advanced models for persistence, seasonality, changing variance, and latent dynamics. Along the way, it connects theoretical tools with the practical goal of building models that forecast well and explain the mechanisms behind observed data. The main themes are second-order structure, linear modeling, spectral methods, and state-space inference. The early chapters establish the language of stationary processes and linear processes, then move to ARMA modeling, estimation, prediction, and model selection. From there, the course broadens to nonstationary phenomena through differencing and ARIMA models, then deepens the theory with Wold decomposition and linear prediction. Later chapters shift between time and frequency domains, introducing spectral analysis and frequency-domain inference, before covering state space models, the Kalman filter, and conditional heteroskedasticity via GARCH. The chapters are arranged to build progressively from core concepts to integrated applications. After the foundational material, each unit extends the previous one: ARMA models motivate estimation and forecasting; nonstationarity leads naturally to ARIMA; linear prediction theory explains why these methods work; spectral methods provide an alternative representation of dependence; state-space methods unify filtering and latent-variable modeling; and GARCH addresses time-varying volatility. The final chapter brings these strands together through an integrated workflow and case studies, showing how the theory supports end-to-end analysis of real time series problems. # Introduction These notes study random quantities observed in time order: daily returns, quarterly inflation, hourly demand, signal measurements, and many other data streams where dependence across time is the central feature. The course develops the probabilistic structure behind such sequences and then turns that structure into statistical models for prediction, filtering, and inference. The organizing theme is that useful time-series models separate persistent dependence, transient shocks, frequency content, and changing conditional variance. The main object will be a discrete-time stochastic process $(X_t)_{t \in \mathbb Z}$ or $(X_t)_{t \ge 1}$ on a probability space $(\Omega, \mathcal F, \mathbb P)$. We usually observe one finite stretch $X_1,\dots,X_n$, so the theory must explain what can be learned from a single dependent trajectory. This is the first conceptual difference from independent-sample statistics: time order is information, and dependence is not a nuisance to be averaged away. ## What Makes Time Series Different from Independent Data In independent-sample models, repeating the experiment produces a collection $Y_1,\dots,Y_n$ whose joint law is controlled by a common marginal distribution and independence. For a time series, the marginal distribution of $X_t$ is only a small part of the model: the joint behaviour of $(X_s,X_t)$, $(X_r,X_s,X_t)$, and longer blocks contains information about persistence, seasonality, cycles, volatility clustering, and predictability. [definition: Discrete Time Series] A discrete time series is a stochastic process $(X_t)_{t \in T}$, where $T$ is either $\mathbb Z$, $\mathbb N$, or a finite subset of consecutive integers, and each $X_t$ is a random variable on a common probability space $(\Omega, \mathcal F, \mathbb P)$. [/definition] The index $t$ is part of the structure. Reordering $X_1,\dots,X_n$ usually changes the statistical question, because lags such as $X_t$ and $X_{t-1}$ encode temporal dependence. [example: Independent Noise and Persistent Signal] Let $(Z_t)_{t\in\mathbb Z}$ be i.i.d. with $\mathbb E[Z_t]=0$ and $\operatorname{Var}(Z_t)=\sigma^2$. For $h\ne 0$, the variables $Z_t$ and $Z_{t-h}$ are independent, so \begin{align*} \operatorname{Cov}(Z_t,Z_{t-h}) &=\mathbb E[Z_tZ_{t-h}]-\mathbb E[Z_t]\mathbb E[Z_{t-h}]\\ &=\mathbb E[Z_t]\mathbb E[Z_{t-h}]-0\cdot 0\\ &=0. \end{align*} Thus independent noise has no linear dependence across nonzero lags. Now consider the stationary causal autoregression \begin{align*} X_t=\phi X_{t-1}+Z_t,\qquad |\phi|<1, \end{align*} written as \begin{align*} X_t=\sum_{j=0}^{\infty}\phi^j Z_{t-j}. \end{align*} Since $\sum_{j=0}^{\infty}|\phi|^j<\infty$ and $\mathbb E[Z_{t-j}]=0$, \begin{align*} \mathbb E[X_t] &=\sum_{j=0}^{\infty}\phi^j\mathbb E[Z_{t-j}]\\ &=\sum_{j=0}^{\infty}\phi^j\cdot 0\\ &=0. \end{align*} For $h\ge 0$, \begin{align*} \operatorname{Cov}(X_t,X_{t-h}) &=\mathbb E[X_tX_{t-h}]\\ &=\mathbb E\left[\left(\sum_{j=0}^{\infty}\phi^j Z_{t-j}\right) \left(\sum_{k=0}^{\infty}\phi^k Z_{t-h-k}\right)\right]\\ &=\sum_{j=0}^{\infty}\sum_{k=0}^{\infty}\phi^{j+k} \mathbb E[Z_{t-j}Z_{t-h-k}]. \end{align*} The expectation $\mathbb E[Z_{t-j}Z_{t-h-k}]$ equals $\sigma^2$ when $t-j=t-h-k$, i.e. when $j=h+k$, and equals $0$ otherwise by independence and zero means. Hence \begin{align*} \operatorname{Cov}(X_t,X_{t-h}) &=\sum_{k=0}^{\infty}\phi^{(h+k)+k}\sigma^2\\ &=\sigma^2\phi^h\sum_{k=0}^{\infty}\phi^{2k}\\ &=\sigma^2\phi^h\frac{1}{1-\phi^2}. \end{align*} By symmetry of covariance, \begin{align*} \operatorname{Cov}(X_t,X_{t-h})=\frac{\sigma^2}{1-\phi^2}\phi^{|h|},\qquad h\in\mathbb Z. \end{align*} So both processes have mean zero, but their dependence structures differ: $Z_t$ has zero autocovariance at every nonzero lag, while the autoregression carries memory across lags through the factor $\phi^{|h|}$. [/example] This example previews a recurring course pattern. We start with a probabilistic description of dependence, derive covariance or spectral consequences, and then use those consequences to build estimators and predictors. ## The Modelling Questions A time-series model should answer several connected questions. What aspects of the distribution remain stable over time? Which transformations remove trends or changing level? How far into the future is prediction possible? Which frequencies dominate the series? How should uncertainty be updated when new observations arrive? [definition: Forecasting Problem] Given observations $X_1,\dots,X_n$ and an integer $h\ge1$, the $h$-step forecasting problem is to construct an estimator $\widehat X_{n+h}$ of $X_{n+h}$ using the information generated by the observed data. [/definition] The information generated by the observed data is often represented by $\mathcal F_n^X=\sigma(X_1,\dots,X_n)$. In the full probabilistic problem, the optimal square-integrable forecast is a conditional expectation. In linear time-series theory, the optimal linear forecast is an orthogonal projection in a Hilbert space. [quotetheorem:3621] [citeproof:3621] This theorem gives the ideal forecast when the joint law is known. The course then asks how close practical models can get when the law must be inferred from one observed path. [example: Forecasting an Autoregression] Suppose $X_t=\phi X_{t-1}+Z_t$, where $|\phi|<1$, $(Z_t)$ is i.i.d. with $\mathbb E[Z_t]=0$ and finite variance, and $Z_{n+1}$ is independent of $\mathcal F_n^X=\sigma(X_1,\dots,X_n)$. We compute the one-step conditional mean forecast by conditioning the recursion \begin{align*} X_{n+1}=\phi X_n+Z_{n+1}. \end{align*} Since $X_n$ is $\mathcal F_n^X$-measurable and $\phi$ is deterministic, \begin{align*} \mathbb E[X_{n+1}\mid \mathcal F_n^X] &=\mathbb E[\phi X_n+Z_{n+1}\mid \mathcal F_n^X]\\ &=\mathbb E[\phi X_n\mid \mathcal F_n^X]+\mathbb E[Z_{n+1}\mid \mathcal F_n^X]\\ &=\phi X_n+\mathbb E[Z_{n+1}\mid \mathcal F_n^X]. \end{align*} Independence of $Z_{n+1}$ from $\mathcal F_n^X$ gives \begin{align*} \mathbb E[Z_{n+1}\mid \mathcal F_n^X] &=\mathbb E[Z_{n+1}]\\ &=0. \end{align*} Therefore \begin{align*} \mathbb E[X_{n+1}\mid \mathcal F_n^X] &=\phi X_n+0\\ &=\phi X_n. \end{align*} The forecast keeps exactly the predictable autoregressive part $\phi X_n$ and discards the new innovation because its conditional mean, given the observed past, is zero. [/example] ## Stationarity as the Starting Assumption The first major theory block asks when time averages and lagged dependence can be treated as stable features of the process. A single path cannot reveal an arbitrary changing distribution at every time, so time-series analysis needs assumptions that connect different dates. Stationarity is the basic assumption of this kind. [definition: Strict Stationarity] A process $(X_t)_{t\in\mathbb Z}$ is strictly stationary if, for every $k\in\mathbb N$, every $t_1,\dots,t_k\in\mathbb Z$, and every $h\in\mathbb Z$, the random vectors $(X_{t_1},\dots,X_{t_k})$ and $(X_{t_1+h},\dots,X_{t_k+h})$ have the same distribution. [/definition] Strict stationarity says that all finite-dimensional distributions are invariant under time shifts. It is powerful, but often more than is needed for linear prediction and ARMA theory. For many questions the full joint law is less important than the behaviour of means, variances, and covariances. Linear forecasts, autocovariance plots, and spectral methods all depend only on these second-order quantities, so the course isolates the weaker stability assumption that is sufficient for them. The obstruction is that second-order calculations are not meaningful if the mean drifts with time or if the covariance between two observations depends on their calendar dates rather than their separation. To make autocovariances, spectra, and linear predictors reusable across the series, the formal condition must require a constant mean and lag-only covariance without requiring full distributional invariance. [definition: Weak Stationarity] A process $(X_t)_{t\in\mathbb Z}$ with $\mathbb E[X_t^2]<\infty$ for all $t$ is weakly stationary if there exists $\mu\in\mathbb R$ and a function $\gamma:\mathbb Z\to\mathbb R$ such that \begin{align*} \mathbb E[X_t]&=\mu,\\ \operatorname{Cov}(X_t,X_{t-h})&=\gamma(h) \end{align*} for all $t,h\in\mathbb Z$. [/definition] Weak stationarity is the natural setting for autocovariances, spectral density, and best linear prediction. Much of ARMA theory can be developed under second-order assumptions rather than full distributional invariance. [example: Random Walk as a Warning] Let $X_0=0$ and, for $t\ge 1$, let \begin{align*} X_t=X_{t-1}+Z_t, \end{align*} where $(Z_t)$ is i.i.d., $\mathbb E[Z_t]=0$, and $\operatorname{Var}(Z_t)=\sigma^2$. Iterating the recursion gives \begin{align*} X_1&=Z_1,\\ X_2&=X_1+Z_2=Z_1+Z_2,\\ X_3&=X_2+Z_3=Z_1+Z_2+Z_3, \end{align*} and therefore, by induction, \begin{align*} X_t=\sum_{j=1}^{t}Z_j. \end{align*} Its mean is \begin{align*} \mathbb E[X_t] &=\mathbb E\left[\sum_{j=1}^{t}Z_j\right]\\ &=\sum_{j=1}^{t}\mathbb E[Z_j]\\ &=\sum_{j=1}^{t}0\\ &=0. \end{align*} Since the innovations are independent, $\operatorname{Cov}(Z_i,Z_j)=0$ for $i\ne j$, while $\operatorname{Cov}(Z_j,Z_j)=\operatorname{Var}(Z_j)=\sigma^2$. Hence \begin{align*} \operatorname{Var}(X_t) &=\operatorname{Var}\left(\sum_{j=1}^{t}Z_j\right)\\ &=\sum_{i=1}^{t}\sum_{j=1}^{t}\operatorname{Cov}(Z_i,Z_j)\\ &=\sum_{j=1}^{t}\operatorname{Cov}(Z_j,Z_j)\\ &=\sum_{j=1}^{t}\sigma^2\\ &=t\sigma^2. \end{align*} Thus the variance depends on the time index $t$. Since weak stationarity requires the second-order distributional quantities, in particular $\operatorname{Var}(X_t)=\operatorname{Cov}(X_t,X_t)$, to be constant over time, the random walk is not weakly stationary even though its increments have the same mean and variance at every date. [/example] The random walk is central because differencing turns it into stationary noise. This is the first appearance of the ARIMA idea: nonstationarity may be modelled by applying a difference operator before fitting stationary dynamics. ## Autocovariance, Linear Prediction, and Spectra Once stationarity fixes the meaning of a lag, the next problem is how to encode dependence across lags. The autocovariance function records second-order time dependence, but it is constrained by positivity conditions because it must come from variances of linear combinations. [definition: Autocovariance Function] For a weakly stationary process $(X_t)_{t\in\mathbb Z}$ with mean $\mu$, the autocovariance function is \begin{align*} \gamma(h)=\operatorname{Cov}(X_t,X_{t-h})=\mathbb E[(X_t-\mu)(X_{t-h}-\mu)],\qquad h\in\mathbb Z. \end{align*} [/definition] The autocovariance function is not an arbitrary sequence. If $\gamma$ is to describe a genuine process, every finite linear combination of observations must have nonnegative variance. This requirement is exactly the positivity constraint that separates valid covariance kernels from plausible-looking but impossible lag summaries. Before using proposed covariance sequences in models, we need a criterion that can be checked without constructing the whole process first. The quoted result supplies that criterion by characterising the positive-definiteness condition forced by nonnegative variances. [quotetheorem:3622] [citeproof:3622] The condition is useful because it converts a modelling question into a check on quadratic forms. It also explains why autocovariance sequences have a harmonic representation: positive-definite lag functions are the objects that Fourier analysis can represent by nonnegative frequency content. The next viewpoint asks how the same second-order dependence looks when decomposed by frequency rather than by lag. Instead of reading $\gamma(h)$ as a sequence of covariances, spectral analysis records how much variance is carried by oscillations at each angular frequency. [definition: Spectral Density] Let $(X_t)_{t\in\mathbb Z}$ be weakly stationary with autocovariance function $\gamma$. A spectral density is an integrable function $f:[-\pi,\pi]\to[0,\infty)$ such that \begin{align*} \gamma(h)=\int_{-\pi}^{\pi} e^{ih\lambda} f(\lambda)\,d\lambda \end{align*} for every $h\in\mathbb Z$. [/definition] When a spectral density exists, peaks in $f$ indicate frequencies that contribute strongly to the variance of the process. The course uses this viewpoint to connect ARMA models, filters, periodograms, and frequency-domain inference. [example: Periodic Component with Random Phase] Let $X_t=A\cos(\lambda t+U)$, where $A\in\mathbb R$, $\lambda\in(0,\pi)$, and $U\sim\operatorname{Unif}(0,2\pi)$. Since $|X_t|\le |A|$, each $X_t$ has finite second moment. We first compute the mean by using the density $(2\pi)^{-1}$ of $U$: \begin{align*} \mathbb E[X_t] &=A\,\mathbb E[\cos(\lambda t+U)]\\ &=\frac{A}{2\pi}\int_0^{2\pi}\cos(\lambda t+u)\,du\\ &=\frac{A}{2\pi}\left[\sin(\lambda t+u)\right]_{u=0}^{u=2\pi}\\ &=\frac{A}{2\pi}\left(\sin(\lambda t+2\pi)-\sin(\lambda t)\right)\\ &=0. \end{align*} For $h\in\mathbb Z$, the zero mean gives $\operatorname{Cov}(X_t,X_{t-h})=\mathbb E[X_tX_{t-h}]$. Using the identity $\cos a\cos b=\frac12(\cos(a-b)+\cos(a+b))$ with $a=\lambda t+U$ and $b=\lambda(t-h)+U$, \begin{align*} \operatorname{Cov}(X_t,X_{t-h}) &=A^2\mathbb E[\cos(\lambda t+U)\cos(\lambda(t-h)+U)]\\ &=\frac{A^2}{2}\mathbb E\left[\cos(\lambda h)+\cos(2\lambda t-\lambda h+2U)\right]\\ &=\frac{A^2}{2}\cos(\lambda h) +\frac{A^2}{2}\mathbb E[\cos(2\lambda t-\lambda h+2U)]. \end{align*} The remaining expectation is \begin{align*} \mathbb E[\cos(2\lambda t-\lambda h+2U)] &=\frac{1}{2\pi}\int_0^{2\pi}\cos(2\lambda t-\lambda h+2u)\,du\\ &=\frac{1}{2\pi}\left[\frac12\sin(2\lambda t-\lambda h+2u)\right]_{u=0}^{u=2\pi}\\ &=\frac{1}{4\pi}\left(\sin(2\lambda t-\lambda h+4\pi)-\sin(2\lambda t-\lambda h)\right)\\ &=0. \end{align*} Therefore \begin{align*} \operatorname{Cov}(X_t,X_{t-h}) &=\frac{A^2}{2}\cos(\lambda h). \end{align*} The mean is constant in $t$, and the covariance depends on the pair $(t,t-h)$ only through the lag $h$, so $(X_t)$ is weakly stationary with autocovariance $\gamma(h)=\frac{A^2}{2}\cos(\lambda h)$. The random phase removes dependence on calendar time, while the covariance still oscillates at frequency $\lambda$. [/example] ## ARIMA Models as a Course Spine The course is centered on ARIMA modelling because it provides a compact language for many dependent series encountered in applications. Autoregressive terms describe feedback from the past, moving-average terms describe finite memory of shocks, and differencing handles certain forms of nonstationary level behaviour. [definition: Backshift Operator] For a time series $(X_t)$, the backshift operator $B$ acts by \begin{align*} BX_t=X_{t-1}. \end{align*} [/definition] Backshift notation turns recursions into algebraic-looking equations. For example, an autoregression $X_t=\phi X_{t-1}+Z_t$ becomes $(1-\phi B)X_t=Z_t$, and differencing is written $(1-B)X_t=X_t-X_{t-1}$. With this notation in place, the course can state one model class that contains autoregressions, moving averages, and integrated nonstationary series. The definition packages three modelling choices at once: how many past values enter, how many past shocks enter, and how many differences are needed before stationary ARMA behaviour is expected. The formal definition is needed because later estimation and forecasting procedures refer to the orders $(p,d,q)$ and the polynomials $\phi$ and $\theta$ rather than to a separate verbal description of each special case. It fixes that shared notation before examples and root conditions appear. [definition: ARIMA Model] Let $(Z_t)_{t\in\mathbb Z}$ be white noise. A process $(X_t)$ is an $\operatorname{ARIMA}(p,d,q)$ process if \begin{align*} \phi(B)(1-B)^dX_t=\theta(B)Z_t, \end{align*} where $d\in\mathbb N\cup\{0\}$ and \begin{align*} \phi(z)&=1-\phi_1z-\cdots-\phi_pz^p,\\ \theta(z)&=1+\theta_1z+\cdots+\theta_qz^q. \end{align*} [/definition] The definition is deliberately terse; the substance lies in the conditions under which the equation has a stationary causal solution after differencing. Later chapters make the words causal, invertible, and identifiable precise. [example: Differencing a Random Walk] Let $X_t=X_{t-1}+Z_t$, where $(Z_t)$ is white noise. We show that $(X_t)$ satisfies the defining equation of an $\operatorname{ARIMA}(0,1,0)$ process. By the definition of the backshift operator, $BX_t=X_{t-1}$, so \begin{align*} (1-B)X_t &=X_t-BX_t\\ &=X_t-X_{t-1}\\ &=(X_{t-1}+Z_t)-X_{t-1}\\ &=Z_t. \end{align*} For $\operatorname{ARIMA}(0,1,0)$, the polynomials are $\phi(z)=1$ and $\theta(z)=1$, and $d=1$. Hence the ARIMA equation \begin{align*} \phi(B)(1-B)^dX_t=\theta(B)Z_t \end{align*} becomes \begin{align*} 1\cdot(1-B)^1X_t=1\cdot Z_t, \end{align*} which is exactly \begin{align*} (1-B)X_t=Z_t. \end{align*} Thus the random walk itself is not the stationary noise sequence; its first difference is the white-noise increment process. [/example] ## Inference from One Dependent Path The statistical part of the course asks how parameters, forecasts, and uncertainty bands can be estimated from $X_1,\dots,X_n$. Dependence changes both the estimators and the asymptotic arguments: effective sample size, long-run variance, and model diagnostics all depend on serial correlation. [definition: Sample Autocovariance] Given observations $X_1,\dots,X_n$ and sample mean $\bar X_n=n^{-1}\sum_{t=1}^{n}X_t$, the sample autocovariance at lag $h\ge0$ is \begin{align*} \widehat\gamma_n(h)=\frac{1}{n}\sum_{t=h+1}^{n}(X_t-\bar X_n)(X_{t-h}-\bar X_n). \end{align*} [/definition] This estimator is biased in finite samples, but it is a basic diagnostic and a building block for many frequency-domain methods. The course will distinguish descriptive plots from assumptions justified by a probabilistic model. The deeper issue is whether one observed path can consistently reveal population quantities at all. Since time-series data are not independent replications, convergence of sample averages requires a replacement for the usual law of large numbers, and that replacement is expressed through stationarity plus an ergodic condition. [quotetheorem:3623] This is the stationary ergodic theorem applied to the coordinate process. The result explains why stationarity alone is not enough for learning from one path; ergodicity rules out persistent hidden regimes that prevent time averages from identifying ensemble means. [example: Non-Ergodic Stationary Sequence] Let $Y$ be Bernoulli with parameter $1/2$, so $\mathbb P(Y=1)=1/2$ and $\mathbb P(Y=0)=1/2$, and define $X_t=Y$ for every $t\in\mathbb Z$. We show that $(X_t)$ is strictly stationary but that its time average does not converge to the constant $\mathbb E[X_0]=1/2$. Fix $k\in\mathbb N$, times $t_1,\dots,t_k\in\mathbb Z$, and a shift $h\in\mathbb Z$. Since every coordinate is equal to the same random variable $Y$, \begin{align*} (X_{t_1},\dots,X_{t_k}) &=(Y,\dots,Y),\\ (X_{t_1+h},\dots,X_{t_k+h}) &=(Y,\dots,Y). \end{align*} Therefore the two random vectors are equal as random vectors, and hence have the same distribution. This proves strict stationarity. The mean of $X_0$ is the mean of $Y$: \begin{align*} \mathbb E[X_0] &=\mathbb E[Y]\\ &=0\cdot \mathbb P(Y=0)+1\cdot \mathbb P(Y=1)\\ &=0\cdot \frac12+1\cdot \frac12\\ &=\frac12. \end{align*} For every $n\ge1$, the sample average is \begin{align*} \frac1n\sum_{t=1}^{n}X_t &=\frac1n\sum_{t=1}^{n}Y\\ &=\frac1n\cdot nY\\ &=Y. \end{align*} Thus \begin{align*} \lim_{n\to\infty}\frac1n\sum_{t=1}^{n}X_t=Y. \end{align*} Since $Y$ takes only the values $0$ and $1$, \begin{align*} \mathbb P\left(Y=\frac12\right)=0, \end{align*} so the time average does not converge almost surely to $\mathbb E[X_0]=1/2$. The sequence is stationary because its joint laws do not change under time shifts, but one observed path remains trapped forever in the hidden regime $Y=0$ or $Y=1$. [/example] This example marks a theme that returns throughout the course: a time-series assumption is not merely a property of a plotted sequence. It is a probabilistic statement connecting the observed path to hypothetical repetitions of the whole process. ## State Space Models and Changing Volatility ARIMA models give a strong foundation, but many modern applications need hidden states, noisy measurements, or time-varying conditional variance. The later part of the course introduces state space representations, Kalman filtering, and conditional heteroskedasticity as extensions of the same prediction principle. [definition: State Space Model] A state space model consists of an unobserved process $(\alpha_t)$ and an observed process $(Y_t)$ satisfying equations of the form \begin{align*} \alpha_{t+1}&=F_t\alpha_t+G_t\eta_t,\\ Y_t&=H_t\alpha_t+\varepsilon_t, \end{align*} where $F_t,G_t,H_t$ are matrices of compatible dimensions and $(\eta_t)$, $(\varepsilon_t)$ are noise processes. [/definition] State space notation separates the evolution of the latent signal from the measurement equation. The Kalman filter is the recursive conditional-mean algorithm for the linear Gaussian case, and its derivation is another application of Hilbert-space projection. A different extension keeps the observed series in view but allows the uncertainty itself to evolve over time. This is needed when the main dependence is not in the conditional mean but in bursts of volatility, so the formal object to define is the conditional variance given the past. The definition below isolates the phenomenon before any ARCH or GARCH formula is imposed. It gives a model-free way to say that predictability can live in the scale of the next observation even when the conditional mean contains little signal. [definition: Conditional Heteroskedasticity] A time series $(X_t)$ has conditional heteroskedasticity relative to a filtration $(\mathcal F_t)$ if the conditional variance $\operatorname{Var}(X_t\mid\mathcal F_{t-1})$ is not constant in $t$ or not deterministic. [/definition] This concept addresses data, especially financial returns, where conditional means may be weak but conditional variances show strong dependence. ARCH and GARCH models replace the question "Can we predict the sign or level?" with "Can we predict the scale of uncertainty?" [example: ARCH Motivation] Let $\mathcal F_{t-1}=\sigma(X_s:s\le t-1)$, and suppose $Z_t$ is independent of $\mathcal F_{t-1}$. Since $\sigma_t^2=\alpha_0+\alpha_1X_{t-1}^2$ is a function of $X_{t-1}$, the random variable $\sigma_t$ is $\mathcal F_{t-1}$-measurable. We compute the conditional mean: \begin{align*} \mathbb E[X_t\mid\mathcal F_{t-1}] &=\mathbb E[\sigma_t Z_t\mid\mathcal F_{t-1}]\\ &=\sigma_t\,\mathbb E[Z_t\mid\mathcal F_{t-1}]\\ &=\sigma_t\,\mathbb E[Z_t]\\ &=\sigma_t\cdot 0\\ &=0. \end{align*} Thus the process has no predictable conditional mean. For the conditional variance, use the conditional mean just computed: \begin{align*} \operatorname{Var}(X_t\mid\mathcal F_{t-1}) &=\mathbb E\left[(X_t-\mathbb E[X_t\mid\mathcal F_{t-1}])^2\mid\mathcal F_{t-1}\right]\\ &=\mathbb E[X_t^2\mid\mathcal F_{t-1}]\\ &=\mathbb E[\sigma_t^2 Z_t^2\mid\mathcal F_{t-1}]\\ &=\sigma_t^2\,\mathbb E[Z_t^2\mid\mathcal F_{t-1}]\\ &=\sigma_t^2\,\mathbb E[Z_t^2]. \end{align*} Since $\mathbb E[Z_t]=0$ and $\operatorname{Var}(Z_t)=1$, \begin{align*} \mathbb E[Z_t^2] &=\operatorname{Var}(Z_t)+(\mathbb E[Z_t])^2\\ &=1+0^2\\ &=1. \end{align*} Therefore \begin{align*} \operatorname{Var}(X_t\mid\mathcal F_{t-1}) &=\sigma_t^2\\ &=\alpha_0+\alpha_1X_{t-1}^2. \end{align*} The conditional mean is zero, but the conditional variance reacts to the previous squared observation; when $\alpha_1>0$ and $X_{t-1}^2$ is not almost surely constant, the scale of uncertainty changes over time. [/example] ## How to Read the Course The course proceeds in a sequence of translations. Stationary probability models translate into autocovariances; autocovariances translate into prediction equations and spectra; ARMA equations translate into filters; state space models translate into recursive updates; heteroskedastic models translate into conditional variance forecasts. Each translation preserves part of the underlying probability law while making a different task computationally accessible. The reader should keep three levels separate. The process $(X_t)$ is the probabilistic object. The observed path $X_1,\dots,X_n$ is the data. The fitted ARIMA, state space, or GARCH model is a statistical approximation whose adequacy must be checked through residuals, diagnostics, and forecast performance. [remark: Course Convention] Unless stated otherwise, time is discrete, random variables are real-valued, and filtrations are generated by the observed process. Expectations are written as $\mathbb E[X]$, variances as $\operatorname{Var}(X)$, and conditional expectations as $\mathbb E[X\mid\mathcal G]$. [/remark] Chapter 0 established the language we will use throughout: stochastic processes observed in time, conditional expectation, and the basic questions of dependence and prediction. Chapter 1 now makes those ideas precise by introducing stationarity, second-order structure, and the geometry of linear prediction that will support everything that follows. # 1. Stationary Time Series and Second-Order Structure As introduced in Chapter 0, a time series is a stochastic process observed in a fixed order, usually time. The first formal question is how to describe dependence without trying to specify a full joint distribution for infinitely many random variables. Stationarity supplies the central simplification: although individual observations change, the probabilistic rules governing their dependence do not change with calendar time. This chapter introduces the strict and second-order versions of that idea, explains the covariance sequences that drive linear prediction, and separates mathematical stationarity from the empirical question of whether a model is useful for data. ## Describing Dependence Without Tracking Absolute Time If observations are indexed by $t \in \mathbb Z$, then dependence can involve every finite collection $(X_{t_1},\dots,X_{t_n})$. A usable model should say which aspects of that dependence remain stable when all observation times are shifted together. Strict stationarity records invariance of all finite-dimensional distributions, while weak stationarity keeps only the first two moments and is the natural assumption for linear prediction and spectral analysis. [definition: Strict Stationarity] A stochastic process $(X_t)_{t \in \mathbb Z}$ with values in a measurable space $(E,\mathcal E)$ is strictly stationary if for every $n \in \mathbb N$, every $t_1,\dots,t_n \in \mathbb Z$, and every $h \in \mathbb Z$, the random vectors $(X_{t_1},\dots,X_{t_n})$ and $(X_{t_1+h},\dots,X_{t_n+h})$ have the same distribution. [/definition] Strict stationarity is a distributional symmetry. It says that the origin of time has no probabilistic meaning, so a block observed during days $1$ to $n$ has the same law as the corresponding block observed during days $101$ to $100+n$. [example: Gaussian White Noise] Let $(Z_t)_{t\in\mathbb Z}$ be i.i.d. with $Z_t\sim\mathcal N(0,\sigma^2)$, where $\sigma^2>0$. We show that the process is strictly stationary and that distinct times have zero covariance. Fix $n\in\mathbb N$, times $t_1,\dots,t_n\in\mathbb Z$, and a shift $h\in\mathbb Z$. Let $u_1,\dots,u_m$ be the distinct values among $t_1,\dots,t_n$, and write $t_i=u_{\pi(i)}$. Since the variables are i.i.d., \begin{align*} (Z_{u_1},\dots,Z_{u_m}) \end{align*} has product law $\mathcal N(0,\sigma^2)^{\otimes m}$. The shifted indices $u_1+h,\dots,u_m+h$ are also distinct, so \begin{align*} (Z_{u_1+h},\dots,Z_{u_m+h}) \end{align*} has the same product law. Applying the same coordinate-repetition map \begin{align*} (x_1,\dots,x_m)\mapsto (x_{\pi(1)},\dots,x_{\pi(n)}) \end{align*} to both vectors gives \begin{align*} (Z_{t_1},\dots,Z_{t_n}) \stackrel{d}{=} (Z_{t_1+h},\dots,Z_{t_n+h}). \end{align*} Thus $(Z_t)_{t\in\mathbb Z}$ is strictly stationary. For second moments, $\mathbb E[Z_t]=0$ and $\mathbb E[Z_t^2]=\sigma^2$ for every $t$. If $t\ne s$, independence gives \begin{align*} \operatorname{Cov}(Z_t,Z_s) &=\mathbb E[(Z_t-\mathbb E[Z_t])(Z_s-\mathbb E[Z_s])]\\ &=\mathbb E[Z_tZ_s]\\ &=\mathbb E[Z_t]\mathbb E[Z_s]\\ &=0\cdot 0\\ &=0. \end{align*} For $t=s$, \begin{align*} \operatorname{Var}(Z_t) =\mathbb E[(Z_t-0)^2] =\mathbb E[Z_t^2] =\sigma^2. \end{align*} Hence its autocovariance is $\gamma(0)=\sigma^2$ and $\gamma(h)=0$ for $h\ne0$, so Gaussian white noise is stationary while carrying no linear dependence across distinct times. [/example] For much of time series analysis, it is too demanding and unnecessary to control all finite-dimensional distributions. Prediction based on linear combinations uses only means, variances, and covariances, so the second-order formulation is the main working assumption. [definition: Weak Stationarity] A real-valued process $(X_t)_{t \in \mathbb Z}$ is weakly stationary if $\mathbb E[X_t^2]<\infty$ for all $t \in \mathbb Z$, there exists $\mu \in \mathbb R$ such that $\mathbb E[X_t]=\mu$ for all $t \in \mathbb Z$, and there exists a function $\gamma:\mathbb Z\to\mathbb R$ such that $\operatorname{Cov}(X_{t+h},X_t)=\gamma(h)$ for all $t,h \in \mathbb Z$. [/definition] The function $\gamma$ records dependence by lag rather than by absolute time. The variance is $\gamma(0)$, and the symmetry of covariance gives $\gamma(-h)=\gamma(h)$ for real-valued processes. To use this lag function in calculations, it is helpful to name it separately and fix the convention for which pair of times appears in the covariance. That convention matters when comparing formulas across prediction, spectral analysis, and sample estimates. [definition: Autocovariance Function] For a weakly stationary real-valued process $(X_t)_{t \in \mathbb Z}$ with mean $\mu$, the autocovariance function is \begin{align*} \gamma(h) = \mathbb E[(X_{t+h}-\mu)(X_t-\mu)], \qquad h \in \mathbb Z. \end{align*} [/definition] When $\gamma(0)>0$, it is often more interpretable to divide by the variance and work with correlations. Raw covariances carry the units of the observations, so two series measured on different scales can have covariance sequences that are hard to compare. Normalising by $\gamma(0)$ removes this scale and leaves a lag-by-lag measure of linear dependence. [definition: Autocorrelation Function] For a weakly stationary real-valued process $(X_t)_{t \in \mathbb Z}$ with autocovariance function $\gamma$ and $\gamma(0)>0$, the autocorrelation function is \begin{align*} \rho(h)=\frac{\gamma(h)}{\gamma(0)}, \qquad h \in \mathbb Z. \end{align*} [/definition] Autocorrelation is dimensionless, satisfies $\rho(0)=1$, and is the object usually estimated first from data. It is also where many modelling errors show up: a proposed covariance formula cannot be accepted merely because it is symmetric and bounded. The same second-order perspective also determines how infinite series of random variables should converge. ARMA and linear-process constructions often define $X_t$ as a limit of partial sums, and the relevant limit must control squared prediction error rather than only pointwise behaviour. Ordinary pointwise convergence does not by itself guarantee that variances, covariances, or least-squares prediction errors pass to the limit. Since the theory will build processes by truncating and then extending infinite filters, it needs a convergence notion measured in expected squared error, the same scale used for second-order prediction. [definition: Mean Square Convergence] Let $(Y_n)_{n \in \mathbb N}$ and $Y$ be real-valued random variables with finite second moments. We say $Y_n$ converges to $Y$ in mean square if \begin{align*} \mathbb E[|Y_n-Y|^2] \to 0 \end{align*} as $n \to \infty$. [/definition] Mean square convergence is convergence in the Hilbert space $L^2(\Omega,\mathcal F,\mathbb P)$. It is stronger than convergence in probability and is the right notion for least-squares prediction errors. The basic input sequence for these constructions is a stream of new shocks. Before building filters or recursions from such shocks, the course needs a minimal definition that says the innovations have constant variance and no linear dependence across time. [definition: White Noise] A real-valued process $(Z_t)_{t \in \mathbb Z}$ is white noise with variance $\sigma^2>0$ if $\mathbb E[Z_t]=0$, $\mathbb E[Z_t^2]=\sigma^2$, and $\operatorname{Cov}(Z_t,Z_s)=0$ whenever $t\ne s$. [/definition] White noise need not be independent unless independence is explicitly assumed. In Gaussian models, uncorrelatedness and independence coincide, so Gaussian white noise is the canonical innovation sequence used throughout ARMA theory. [example: Random Walk As A Nonstationary Process] Let $X_0=0$ and, for $t\ge 1$, \begin{align*} X_t=Z_1+\cdots+Z_t, \end{align*} where $(Z_t)_{t\ge 1}$ are i.i.d., $\mathbb E[Z_t]=0$, and $\operatorname{Var}(Z_t)=\sigma^2>0$. We show that $(X_t)$ is not weakly stationary. For every $t\ge 1$, linearity of expectation gives \begin{align*} \mathbb E[X_t] &=\mathbb E[Z_1+\cdots+Z_t]\\ &=\mathbb E[Z_1]+\cdots+\mathbb E[Z_t]\\ &=0+\cdots+0\\ &=0. \end{align*} Also $\mathbb E[X_0]=0$, so the mean is constant. The variance, however, changes with $t$. Since the $Z_i$ are independent, distinct increments have covariance zero, and therefore \begin{align*} \operatorname{Var}(X_t) &=\operatorname{Var}(Z_1+\cdots+Z_t)\\ &=\sum_{i=1}^t\operatorname{Var}(Z_i) +2\sum_{1\le i<j\le t}\operatorname{Cov}(Z_i,Z_j)\\ &=\sum_{i=1}^t\sigma^2 +2\sum_{1\le i<j\le t}0\\ &=t\sigma^2. \end{align*} Thus $\operatorname{Var}(X_0)=0$, $\operatorname{Var}(X_1)=\sigma^2$, and $\operatorname{Var}(X_2)=2\sigma^2$. Since weak stationarity would require the variance $\operatorname{Cov}(X_t,X_t)$ to be independent of $t$, the random walk is not weakly stationary. The failure belongs to the level process: its increments satisfy $X_t-X_{t-1}=Z_t$, so they have constant mean $0$ and constant variance $\sigma^2$. [/example] ## Which Sequences Can Be Autocovariances? A formula for $\gamma(h)$ is not automatically a covariance function. Since variances of linear combinations must be nonnegative, every finite covariance matrix built from $\gamma$ must be positive semidefinite. This condition is also sufficient: any sequence satisfying it can occur as the autocovariance sequence of a weakly stationary process. [definition: Positive Definite Sequence] A function $\gamma:\mathbb Z\to\mathbb C$ is positive definite if $\gamma(-h)=\overline{\gamma(h)}$ for every $h\in\mathbb Z$ and, for every $n\in\mathbb N$, every $t_1,\dots,t_n\in\mathbb Z$, and every $a_1,\dots,a_n\in\mathbb C$, \begin{align*} \sum_{i=1}^n\sum_{j=1}^n a_i\overline{a_j}\,\gamma(t_i-t_j) \in [0,\infty). \end{align*} [/definition] For real-valued time series, this condition says that every covariance matrix formed from lags is symmetric and positive semidefinite. The point is not merely that individual variances are nonnegative; every proposed covariance sequence must give a nonnegative variance to every finite linear combination of observations. This turns autocovariance modelling into a matrix positivity problem. [quotetheorem:3624] [citeproof:3624] This criterion explains why autocovariance modelling is constrained. The hypotheses are needed because a covariance sequence must survive all finite tests at once: bounds such as $|\rho(h)|\le 1$ check only two-by-two covariance matrices, while positive definiteness checks arbitrary finite collections. For instance, assigning large positive correlation at lag $1$ and large negative correlation at lag $2$ can violate positive definiteness even if each individual value lies between $-1$ and $1$. The theorem also does not say that the resulting process is unique; many processes can share the same autocovariance sequence, and the Gaussian construction is only one canonical realisation. This distinction becomes important when second-order methods are used for prediction, because two non-Gaussian processes can have identical linear prediction theory but different higher-order behaviour. [example: Stationary AR(1)] Let $(Z_t)_{t\in\mathbb Z}$ be white noise with variance $\sigma^2$, and let $|\phi|<1$. For $m\ge0$ define \begin{align*} S_{t,m}=\sum_{j=0}^m \phi^j Z_{t-j}. \end{align*} If $m>n$, then the white-noise covariance conditions give \begin{align*} \mathbb E\left[|S_{t,m}-S_{t,n}|^2\right] &=\mathbb E\left[\left(\sum_{j=n+1}^m \phi^j Z_{t-j}\right) \left(\sum_{k=n+1}^m \phi^k Z_{t-k}\right)\right]\\ &=\sum_{j=n+1}^m\sum_{k=n+1}^m \phi^{j+k}\mathbb E[Z_{t-j}Z_{t-k}]\\ &=\sum_{j=n+1}^m \phi^{2j}\mathbb E[Z_{t-j}^2] +\sum_{\substack{n+1\le j,k\le m\\ j\ne k}} \phi^{j+k}\mathbb E[Z_{t-j}Z_{t-k}]\\ &=\sum_{j=n+1}^m \phi^{2j}\sigma^2 +\sum_{\substack{n+1\le j,k\le m\\ j\ne k}} \phi^{j+k}\operatorname{Cov}(Z_{t-j},Z_{t-k})\\ &=\sigma^2\sum_{j=n+1}^m \phi^{2j}. \end{align*} Since $|\phi|<1$, the geometric tail $\sum_{j=n+1}^m\phi^{2j}$ tends to $0$ as $n,m\to\infty$. Thus $(S_{t,m})_{m\ge0}$ is Cauchy in mean square, so \begin{align*} X_t=\sum_{j=0}^\infty \phi^j Z_{t-j} \end{align*} is well-defined as a mean-square limit. The mean is constant. Indeed, \begin{align*} \mathbb E[S_{t,m}] &=\sum_{j=0}^m \phi^j\mathbb E[Z_{t-j}]\\ &=0, \end{align*} and mean-square convergence implies convergence in mean, because \begin{align*} |\mathbb E[X_t-S_{t,m}]| \le \mathbb E[|X_t-S_{t,m}|] \le \left(\mathbb E[|X_t-S_{t,m}|^2]\right)^{1/2}\to0. \end{align*} Hence $\mathbb E[X_t]=0$ for every $t$. For $h\ge0$, covariance is preserved under the mean-square limits because products converge in $L^1$ by Cauchy-Schwarz. Therefore \begin{align*} \operatorname{Cov}(X_{t+h},X_t) &=\mathbb E[X_{t+h}X_t]\\ &=\sum_{j=0}^\infty\sum_{k=0}^\infty \phi^{j+k}\mathbb E[Z_{t+h-j}Z_{t-k}]. \end{align*} The term $\mathbb E[Z_{t+h-j}Z_{t-k}]$ is nonzero exactly when $t+h-j=t-k$, which is equivalent to $j=h+k$. Thus \begin{align*} \operatorname{Cov}(X_{t+h},X_t) &=\sum_{k=0}^\infty \phi^{h+k+k}\mathbb E[Z_{t-k}^2]\\ &=\sigma^2\phi^h\sum_{k=0}^\infty \phi^{2k}\\ &=\frac{\sigma^2\phi^h}{1-\phi^2}. \end{align*} For $h<0$, covariance symmetry gives \begin{align*} \operatorname{Cov}(X_{t+h},X_t) =\operatorname{Cov}(X_t,X_{t+h}) =\frac{\sigma^2\phi^{-h}}{1-\phi^2}. \end{align*} Hence \begin{align*} \gamma(h)=\operatorname{Cov}(X_{t+h},X_t) =\frac{\sigma^2\phi^{|h|}}{1-\phi^2},\qquad h\in\mathbb Z, \end{align*} which depends only on the lag $h$. In particular, \begin{align*} \mathbb E[X_t^2]=\gamma(0)=\frac{\sigma^2}{1-\phi^2}<\infty. \end{align*} Thus $(X_t)_{t\in\mathbb Z}$ is weakly stationary. It remains to verify the recursion in mean square. For each $m\ge1$, \begin{align*} S_{t,m} &=Z_t+\sum_{j=1}^m \phi^j Z_{t-j}\\ &=Z_t+\phi\sum_{j=1}^m \phi^{j-1}Z_{t-j}\\ &=Z_t+\phi\sum_{\ell=0}^{m-1}\phi^\ell Z_{t-1-\ell}\\ &=Z_t+\phi S_{t-1,m-1}. \end{align*} Taking mean-square limits as $m\to\infty$ gives \begin{align*} X_t=Z_t+\phi X_{t-1} \end{align*} in mean square. The construction therefore gives the stationary solution to the AR(1) recursion, and its autocovariance decays geometrically with lag. [/example] The AR(1) example shows geometric decay of dependence, which is the simplest pattern produced by a linear recursion. Stationary processes can also have persistent oscillatory dependence without any decay, as the next example shows. [example: Deterministic Sinusoid With Random Phase] Let $\Theta\sim \operatorname{Unif}(0,2\pi)$ and define \begin{align*} X_t=A\cos(\lambda t+\Theta),\qquad t\in\mathbb Z, \end{align*} where $A>0$ and $\lambda\in\mathbb R$. We show that shifting time only shifts the random phase, so the finite-dimensional distributions do not change. Fix $n\in\mathbb N$, times $t_1,\dots,t_n\in\mathbb Z$, and a shift $h\in\mathbb Z$. Then \begin{align*} (X_{t_1+h},\dots,X_{t_n+h}) &=\bigl(A\cos(\lambda(t_1+h)+\Theta),\dots,A\cos(\lambda(t_n+h)+\Theta)\bigr)\\ &=\bigl(A\cos(\lambda t_1+(\Theta+\lambda h)),\dots, A\cos(\lambda t_n+(\Theta+\lambda h))\bigr). \end{align*} For any bounded measurable function $f$ on $[0,2\pi)$ and any constant $c\in\mathbb R$, translation invariance of Lebesgue measure on the circle gives \begin{align*} \mathbb E[f((\Theta+c)\bmod 2\pi)] &=\frac{1}{2\pi}\int_0^{2\pi} f((\theta+c)\bmod 2\pi)\,d\theta\\ &=\frac{1}{2\pi}\int_0^{2\pi} f(u)\,du\\ &=\mathbb E[f(\Theta)]. \end{align*} Thus $(\Theta+\lambda h)\bmod 2\pi$ has the same distribution as $\Theta$, and therefore \begin{align*} (X_{t_1+h},\dots,X_{t_n+h}) \stackrel{d}{=} (X_{t_1},\dots,X_{t_n}). \end{align*} Hence the process is strictly stationary. The mean is independent of $t$ and equals zero: \begin{align*} \mathbb E[X_t] &=\frac{A}{2\pi}\int_0^{2\pi}\cos(\lambda t+\theta)\,d\theta\\ &=\frac{A}{2\pi}\Bigl[\sin(\lambda t+\theta)\Bigr]_{\theta=0}^{2\pi}\\ &=\frac{A}{2\pi}\bigl(\sin(\lambda t+2\pi)-\sin(\lambda t)\bigr)\\ &=0. \end{align*} For the autocovariance, since the mean is zero, \begin{align*} \operatorname{Cov}(X_{t+h},X_t) &=\mathbb E[X_{t+h}X_t]\\ &=\frac{A^2}{2\pi}\int_0^{2\pi} \cos(\lambda(t+h)+\theta)\cos(\lambda t+\theta)\,d\theta. \end{align*} Using $2\cos u\cos v=\cos(u-v)+\cos(u+v)$ with $u=\lambda(t+h)+\theta$ and $v=\lambda t+\theta$, \begin{align*} \operatorname{Cov}(X_{t+h},X_t) &=\frac{A^2}{4\pi}\int_0^{2\pi} \left(\cos(\lambda h)+\cos(2\lambda t+\lambda h+2\theta)\right)\,d\theta\\ &=\frac{A^2}{4\pi} \left( \int_0^{2\pi}\cos(\lambda h)\,d\theta +\int_0^{2\pi}\cos(2\lambda t+\lambda h+2\theta)\,d\theta \right)\\ &=\frac{A^2}{4\pi} \left( 2\pi\cos(\lambda h) +\left[\frac{1}{2}\sin(2\lambda t+\lambda h+2\theta)\right]_{\theta=0}^{2\pi} \right)\\ &=\frac{A^2}{4\pi} \left( 2\pi\cos(\lambda h) +\frac{1}{2}\sin(2\lambda t+\lambda h+4\pi) -\frac{1}{2}\sin(2\lambda t+\lambda h) \right)\\ &=\frac{A^2}{4\pi}\,2\pi\cos(\lambda h)\\ &=\frac{A^2}{2}\cos(\lambda h). \end{align*} Thus $(X_t)$ is weakly stationary with autocovariance \begin{align*} \gamma(h)=\frac{A^2}{2}\cos(\lambda h),\qquad h\in\mathbb Z. \end{align*} This is a stationary process with dependence that oscillates forever rather than decaying with the lag. [/example] ## Linear Prediction As Orthogonal Projection Given past observations, the forecasting problem asks for the best approximation to a future value using an allowed class of predictors. If the allowed class consists of linear combinations and the loss is mean squared error, the problem becomes a projection problem in a Hilbert space. This viewpoint is the basis for Yule-Walker equations, innovations algorithms, and the prediction theory developed later in the course. [definition: Linear Prediction Space] Let $(X_t)_{t\in\mathbb Z}$ be a weakly stationary process with mean $\mu$. For a set $I\subset\mathbb Z$, the linear prediction space generated by $(X_s)_{s\in I}$ is the closed linear subspace \begin{align*} \mathcal H_I = \overline{\operatorname{span}\{X_s-\mu:s\in I\}}^{L^2} \end{align*} of $L^2(\Omega,\mathcal F,\mathbb P)$. [/definition] The closure matters because optimal predictors may be limits of finite linear combinations rather than finite linear combinations themselves. This is especially important when the predictor is allowed to use the entire infinite past. [quotetheorem:3625] [citeproof:3625] For prediction, take $Y=X_{t+h}-\mu$ and $H=\mathcal H_{\{s:s\le t\}}$. Closedness of $H$ is essential: without taking the $L^2$ closure, the best predictor might be approached by longer and longer finite linear combinations without being attained by any finite combination. The $L^2$ structure is also essential because the proof uses Hilbert space geometry; for other losses, such as absolute error, the optimal predictor need not be an orthogonal projection. The theorem gives existence and characterisation, not a computational formula by itself. Computation begins when the orthogonality conditions are written against the spanning variables, producing normal equations whose coefficients are autocovariances. ## Ergodicity And What Data Can Reveal Stationarity is an assumption about the probability law, not a feature that can be verified from a single finite path. What a single long path can support are ergodic conclusions: time averages along that path converge to population quantities. Without an ergodic condition, a stationary process may carry a hidden random environment that never averages out over time. [definition: Ergodicity For Stationary Processes] A strictly stationary process $(X_t)_{t\in\mathbb Z}$ is ergodic if every shift-invariant event $A$ satisfies $\mathbb P(A)\in\{0,1\}$. [/definition] Shift-invariant events are events whose truth is unchanged by moving the whole sample path forward or backward in time. Ergodicity says that there is no non-degenerate random variable describing a permanent regime of the process. The reason this definition matters is that statistical averages are computed along one realised path, not over repeated universes. The fundamental theorem needed here identifies the limit of those path averages and shows precisely where the absence of hidden invariant information enters. To connect the zero-one definition of ergodicity with statistics, we need a result about the actual averages observed along time. The relevant form of Birkhoff's ergodic theorem says that, for an integrable stationary process, time averages converge almost surely to the conditional expectation given the invariant information; under ergodicity this limit reduces to the population expectation $\mathbb E[X_0]$. The integrability hypothesis is needed so that the population mean and the limiting conditional expectation are well-defined. Stationarity alone is not enough to identify the limit with $\mathbb E[X_0]$: a process can be stationary because it first chooses a hidden regime and then remains in that regime forever. Ergodicity rules out exactly this kind of persistent random environment. The result therefore explains why time averages estimate expectations only after stationarity is supplemented by a condition that makes one long path representative of the whole probability law. [quotetheorem:3623] [citeproof:3623] This consistency statement is weaker than independence-based laws of large numbers in some directions and stronger in others: it permits dependence, but it requires the dependence not to encode a persistent random regime. The need for ergodicity is not cosmetic. If a stationary process carries a hidden random constant, every time average can converge, yet it converges to the realised constant rather than to the ensemble mean. The following example is the basic failure mode. [example: Stationary But Non-Ergodic Constant Regime] Let $Y$ be a non-degenerate square-integrable real random variable, and define $X_t=Y$ for every $t\in\mathbb Z$. We show that $(X_t)$ is strictly stationary but not ergodic, and that its sample mean converges to the hidden regime $Y$ rather than to $\mathbb E[Y]$. Fix $n\in\mathbb N$, times $t_1,\dots,t_n\in\mathbb Z$, and a shift $h\in\mathbb Z$. Since every coordinate of the process equals $Y$, \begin{align*} (X_{t_1},\dots,X_{t_n}) &=(Y,\dots,Y),\\ (X_{t_1+h},\dots,X_{t_n+h}) &=(Y,\dots,Y). \end{align*} The two random vectors are therefore equal pointwise, hence have the same distribution. Thus $(X_t)_{t\in\mathbb Z}$ is strictly stationary. Because $Y$ is non-degenerate, there is a Borel set $B\subseteq\mathbb R$ such that \begin{align*} 0<\mathbb P(Y\in B)<1. \end{align*} Let \begin{align*} A=\{\omega:Y(\omega)\in B\}. \end{align*} If the whole path is shifted by any integer $h$, the shifted path still has every coordinate equal to the same value $Y(\omega)$, so membership in $A$ is unchanged by the shift. Hence $A$ is shift-invariant. Since \begin{align*} \mathbb P(A)=\mathbb P(Y\in B)\in(0,1), \end{align*} the process is not ergodic. For every $n\ge1$, the sample mean is \begin{align*} \bar X_n &=\frac{1}{n}\sum_{t=1}^n X_t\\ &=\frac{1}{n}\sum_{t=1}^n Y\\ &=\frac{1}{n}\,nY\\ &=Y. \end{align*} Thus $\bar X_n=Y$ for every $n$, so $\bar X_n\to Y$ almost surely. Since $Y$ is non-degenerate, $Y$ is not almost surely equal to the constant $\mathbb E[Y]$. A single path therefore reveals only the realised constant regime, not the population mean. [/example] In applications, stationarity should be treated as a modelling decision. A plot may reveal trends, changing variance, seasonality, or structural breaks that make a stationary model implausible, but no finite plot can prove stationarity. The practical workflow is to transform or difference data when needed, fit a stationary second-order model to the transformed series, and then check whether residual dependence is consistent with the assumptions that drove the model. Chapter 1 showed how weak stationarity, autocovariances, and white noise give a tractable description of dependence without requiring a full joint law. Chapter 2 builds directly on that framework by representing stationary series as linear filters of innovations and then specializing to the ARMA models that will serve as the main parametric class. # 2. Linear Processes and ARMA Models This chapter turns the second-order language of stationarity from Chapter 1 into a flexible class of models. We assume the basic notions of weak stationarity, autocovariance, and autocorrelation, and ask how much structure is gained by writing a stationary series as the output of a linear filter applied to white noise. We begin with infinite moving averages, then impose finite autoregressive and moving-average equations, and finally use covariance identities to identify model order from data. ## Linear Filters and Infinite Moving Averages How can a process have dependence over time while still being built from independent or uncorrelated shocks? The basic answer is to let each observation combine present and past innovations through a deterministic sequence of weights. Absolute summability is the condition that makes this construction stable and keeps second moments under control. [definition: White Noise] A real-valued process $(Z_t)_{t \in \mathbb Z}$ is white noise with mean $0$ and variance $\sigma^2 > 0$ if $\mathbb E[Z_t] = 0$, $\operatorname{Var}(Z_t) = \sigma^2$, and $\operatorname{Cov}(Z_t, Z_s) = 0$ whenever $t \ne s$. [/definition] White noise is the input sequence. It need not be independent unless this is stated separately, but uncorrelatedness is enough for the second-order calculations in this chapter. [definition: Linear Process] Let $(Z_t)_{t \in \mathbb Z}$ be white noise with mean $0$ and variance $\sigma^2$. A process $(X_t)_{t \in \mathbb Z}$ is a linear process with coefficients $(\psi_j)_{j \in \mathbb Z}$ if \begin{align*} X_t = \sum_{j \in \mathbb Z} \psi_j Z_{t-j} \end{align*} in $L^2$, where $\sum_{j \in \mathbb Z} |\psi_j| < \infty$. [/definition] The index convention says that positive $j$ refers to past shocks $Z_{t-j}$ and negative $j$ refers to future shocks. For time-series modelling we usually require causality, which removes future shocks from the representation. [definition: Causal Linear Process] A linear process $(X_t)_{t \in \mathbb Z}$ is causal with respect to $(Z_t)_{t \in \mathbb Z}$ if $\psi_j = 0$ for every $j < 0$. [/definition] After causality fixes which shocks may enter the representation, the next question is how the filter coefficients determine observable second-order behaviour. The covariance formula below is what lets a linear-process representation be turned into autocovariances, autocorrelations, and concrete model diagnostics. [quotetheorem:3627] [citeproof:3627] This formula is the main computational reason linear filters are useful: second-order properties are determined by the filter coefficients. The absolute summability hypothesis is not cosmetic: without it the series may fail to converge in $L^2$, and even if finite-dimensional covariances can be assigned formally, the filter need not define a stable stationary process. The theorem is a second-order result only; it does not assert independence, Gaussianity, or any Markov property. It prepares the finite-order theory below by showing that moving-average cutoffs and autoregressive tail behaviour are both consequences of the coefficient sequence. [example: Finite Moving Average As Linear Process] Let $X_t=Z_t+\theta Z_{t-1}$, where $(Z_t)$ is white noise with variance $\sigma^2$. This is a causal linear process because \begin{align*} X_t=\psi_0Z_t+\psi_1Z_{t-1} \end{align*} with $\psi_0=1$, $\psi_1=\theta$, and $\psi_j=0$ for every $j\notin\{0,1\}$. Using the *Linear Process Covariance Formula*, \begin{align*} \gamma_X(h) &=\sigma^2\sum_{j\in\mathbb Z}\psi_{j+h}\psi_j. \end{align*} For $h=0$, \begin{align*} \gamma_X(0) &=\sigma^2\sum_{j\in\mathbb Z}\psi_j^2\\ &=\sigma^2(\psi_0^2+\psi_1^2)\\ &=\sigma^2(1+\theta^2). \end{align*} For $h=1$, \begin{align*} \gamma_X(1) &=\sigma^2\sum_{j\in\mathbb Z}\psi_{j+1}\psi_j\\ &=\sigma^2\psi_1\psi_0\\ &=\sigma^2\theta. \end{align*} For $h=-1$, \begin{align*} \gamma_X(-1) &=\sigma^2\sum_{j\in\mathbb Z}\psi_{j-1}\psi_j\\ &=\sigma^2\psi_0\psi_1\\ &=\sigma^2\theta. \end{align*} If $|h|>1$, then no integer $j$ has both $j\in\{0,1\}$ and $j+h\in\{0,1\}$, so every product $\psi_{j+h}\psi_j$ is zero and \begin{align*} \gamma_X(h)=0. \end{align*} Thus this finite moving average has nonzero second-order dependence only up to lag $1$, illustrating the autocorrelation cutoff at the order of the filter. [/example] Lag-polynomial notation packages linear filters as formal power series. On the vector space of real-valued two-sided sequences $\mathbb R^{\mathbb Z}$, let $B: \mathbb R^{\mathbb Z} \to \mathbb R^{\mathbb Z}$ denote the backshift operator, defined by $(Bx)_t=x_{t-1}$ for $x=(x_t)_{t\in\mathbb Z}$. Thus, when applied to a process, $BX_t=X_{t-1}$ and $B^jX_t=X_{t-j}$. A causal linear process can then be written as \begin{align*} X_t = \psi(B)Z_t, \qquad \psi(B)=\sum_{j=0}^{\infty}\psi_j B^j. \end{align*} This notation is algebraic shorthand for an $L^2$ convergent series, not an assertion that $B$ is a numerical variable. ## ARMA Equations and Causal Solutions When can a short recursive equation define a stationary process? Autoregressive and moving-average models answer this by replacing an infinite list of coefficients with finite polynomials. The central issue is whether the recursion can be solved forward from past shocks without using future innovations. [definition: Autoregressive Moving-Average Process] Let $(Z_t)_{t \in \mathbb Z}$ be white noise with mean $0$ and variance $\sigma^2$. A process $(X_t)_{t \in \mathbb Z}$ is an $\operatorname{ARMA}(p,q)$ process with autoregressive polynomial $\phi$ and moving-average polynomial $\theta$ if \begin{align*} \phi(B)X_t = \theta(B)Z_t, \end{align*} where \begin{align*} \phi(z) &= 1-\phi_1z-\cdots-\phi_pz^p, & \theta(z) &= 1+\theta_1z+\cdots+\theta_qz^q. \end{align*} [/definition] The special cases are $\operatorname{AR}(p)$ when $q=0$ and $\operatorname{MA}(q)$ when $p=0$. The minus signs in $\phi$ match the recursion $X_t=\phi_1X_{t-1}+\cdots+\phi_pX_{t-p}+\text{noise}$. An ARMA equation by itself is only a recursion; it does not yet say which solution is usable for forecasting from the past. The same algebraic equation can admit formal solutions that depend on future shocks or on impulse responses whose coefficients do not decay. For prediction and interpretation, the needed solution is the one generated by current and past innovations through an absolutely summable filter. The next formal condition isolates exactly that one-sided, stable representation. [definition: Causal ARMA Solution] An $\operatorname{ARMA}(p,q)$ solution is causal if there are coefficients $(\psi_j)_{j\ge 0}$ with $\sum_{j=0}^{\infty}|\psi_j|<\infty$ such that \begin{align*} X_t = \sum_{j=0}^{\infty}\psi_j Z_{t-j}. \end{align*} [/definition] The definition does not guarantee that such a solution exists. For instance, the equation $X_t=X_{t-1}+Z_t$ would require the non-summable formal expansion $1+B+B^2+\cdots$, while $X_t=2X_{t-1}+Z_t$ has the forward recursion pointing in the wrong time direction for a stable causal representation. The next theorem identifies the precise polynomial condition that rules out these failures. [quotetheorem:3628] [citeproof:3628] The root condition is the stationarity condition for the autoregressive part. It is needed even in the simplest case: $\phi(z)=1-z$ gives the random-walk equation $X_t=X_{t-1}+Z_t$, whose variance grows with time, while $\phi(z)=1-2z$ has a reciprocal root inside the unit circle and produces an unstable causal expansion. The theorem does not say that every stationary solution is causal; non-causal stationary solutions can appear when the recursion is solved using future shocks. What it gives is the stable representation needed for forecasting, likelihood calculations, and the impulse-response interpretation used next. [quotetheorem:3629] [citeproof:3629] This representation theorem is mainly a translation device: it turns a finite ARMA equation into the infinite moving-average language where covariance calculations are direct. The root condition is inherited from the previous theorem; if it fails, the quotient $\theta/\phi$ may have a pole on or inside the unit circle, so the coefficients cannot be absolutely summable. The theorem also does not determine whether the innovations are recoverable from observations; that separate question is invertibility. [example: AR(1) Causal Representation] For $X_t=\phi X_{t-1}+Z_t$, the autoregressive polynomial is $\phi(z)=1-\phi z$. If $|\phi|<1$, then the geometric series satisfies \begin{align*} (1-\phi z)\sum_{j=0}^{n}\phi^jz^j &=\sum_{j=0}^{n}\phi^jz^j-\sum_{j=0}^{n}\phi^{j+1}z^{j+1}\\ &=1-\phi^{n+1}z^{n+1}. \end{align*} For $|z|<|\phi|^{-1}$, the term $\phi^{n+1}z^{n+1}$ tends to $0$, so \begin{align*} \frac{1}{1-\phi z}=\sum_{j=0}^{\infty}\phi^jz^j. \end{align*} Also, \begin{align*} \sum_{j=0}^{\infty}|\phi^j| =\sum_{j=0}^{\infty}|\phi|^j =\frac{1}{1-|\phi|}, \end{align*} so the coefficients are absolutely summable. By the *Causal ARMA Representation*, the causal solution is therefore \begin{align*} X_t=\sum_{j=0}^{\infty}\phi^jZ_{t-j}. \end{align*} Here $\psi_j=\phi^j$ for $j\ge0$ and $\psi_j=0$ for $j<0$. Using the *Linear Process Covariance Formula*, for $h\ge0$, \begin{align*} \gamma_X(h) &=\sigma^2\sum_{j\in\mathbb Z}\psi_{j+h}\psi_j\\ &=\sigma^2\sum_{j=0}^{\infty}\phi^{j+h}\phi^j\\ &=\sigma^2\phi^h\sum_{j=0}^{\infty}\phi^{2j}\\ &=\sigma^2\phi^h\frac{1}{1-\phi^2}\\ &=\frac{\sigma^2\phi^h}{1-\phi^2}. \end{align*} Since autocovariance is symmetric, \begin{align*} \gamma_X(-h)=\gamma_X(h), \end{align*} so for every $h\in\mathbb Z$, \begin{align*} \gamma_X(h)=\frac{\sigma^2\phi^{|h|}}{1-\phi^2}. \end{align*} In particular, \begin{align*} \gamma_X(0)=\frac{\sigma^2}{1-\phi^2}, \end{align*} and hence \begin{align*} \rho_X(h) &=\frac{\gamma_X(h)}{\gamma_X(0)}\\ &=\frac{\sigma^2\phi^{|h|}/(1-\phi^2)}{\sigma^2/(1-\phi^2)}\\ &=\phi^{|h|}. \end{align*} Thus the causal AR(1) model has geometrically decaying autocorrelation, with sign alternation when $\phi<0$. [/example] The AR(1) example shows monotone geometric memory when $\phi>0$ and alternating geometric memory when $\phi<0$. Higher-order autoregressions allow the same decay mechanism to combine with rotation, producing the oscillatory behaviour seen in the next example. [example: AR(2) With Complex Roots] Consider $X_t=\phi_1X_{t-1}+\phi_2X_{t-2}+Z_t$, where the zeros of \begin{align*} \phi(z)=1-\phi_1z-\phi_2z^2 \end{align*} are $r^{-1}e^{i\omega}$ and $r^{-1}e^{-i\omega}$, with $0<r<1$ and $0<\omega<\pi$. Hence \begin{align*} \phi(z) &=(1-re^{i\omega}z)(1-re^{-i\omega}z)\\ &=1-r(e^{i\omega}+e^{-i\omega})z+r^2z^2\\ &=1-2r\cos(\omega)z+r^2z^2, \end{align*} so $\phi_1=2r\cos(\omega)$ and $\phi_2=-r^2$. The causal impulse response is determined by \begin{align*} \sum_{j=0}^{\infty}\psi_jz^j =\frac{1}{(1-re^{i\omega}z)(1-re^{-i\omega}z)}. \end{align*} Put $a=re^{i\omega}$ and $b=re^{-i\omega}$. Since $a\ne b$, \begin{align*} \frac{1}{(1-az)(1-bz)} &=\frac{a}{a-b}\frac{1}{1-az}-\frac{b}{a-b}\frac{1}{1-bz}\\ &=\frac{a}{a-b}\sum_{j=0}^{\infty}a^jz^j-\frac{b}{a-b}\sum_{j=0}^{\infty}b^jz^j, \end{align*} where the two geometric series converge for $|z|<1/r$. Therefore \begin{align*} \psi_j &=\frac{a^{j+1}-b^{j+1}}{a-b}\\ &=\frac{r^{j+1}e^{i(j+1)\omega}-r^{j+1}e^{-i(j+1)\omega}}{re^{i\omega}-re^{-i\omega}}\\ &=r^j\frac{e^{i(j+1)\omega}-e^{-i(j+1)\omega}}{e^{i\omega}-e^{-i\omega}}\\ &=r^j\frac{\sin((j+1)\omega)}{\sin(\omega)}. \end{align*} Equivalently, \begin{align*} \psi_j &=\frac{r^j}{\sin(\omega)}\{\sin(\omega)\cos(j\omega)+\cos(\omega)\sin(j\omega)\}\\ &=\frac{1}{\sin(\omega)}r^j\cos\left(j\omega-\left(\frac{\pi}{2}-\omega\right)\right). \end{align*} Thus the oscillatory impulse response has amplitude scale $C=1/\sin(\omega)$ and phase shift $\delta=\pi/2-\omega$. For $h\ge0$, the covariance formula gives \begin{align*} \gamma(h) &=\sigma^2\sum_{j=0}^{\infty}\psi_{j+h}\psi_j\\ &=\frac{\sigma^2r^h}{\sin^2(\omega)} \sum_{j=0}^{\infty}r^{2j}\sin((j+h+1)\omega)\sin((j+1)\omega). \end{align*} Since the sine factors are bounded by $1$ in absolute value, \begin{align*} |\gamma(h)| &\le \frac{\sigma^2r^h}{\sin^2(\omega)} \sum_{j=0}^{\infty}r^{2j} =\frac{\sigma^2r^h}{\sin^2(\omega)(1-r^2)}. \end{align*} The factor $\sin((j+1)\omega)$ in $\psi_j$ makes the impulse response oscillate, while the factor $r^j$ damps its size geometrically; the same geometric damping bounds the autocovariance and hence the autocorrelation. [/example] [illustration:ar2-damped-oscillatory-impulse-response] ## Invertibility and Non-Uniqueness of Moving-Average Models If $X_t$ is observed but the innovations are hidden, when can the shocks be recovered from the present and past of the process? This is the invertibility problem. It matters because different moving-average parameter values can generate the same autocovariance function unless we impose a convention that selects the recoverable representation. [definition: Invertible ARMA Process] An $\operatorname{ARMA}(p,q)$ process satisfying $\phi(B)X_t=\theta(B)Z_t$ is invertible if there are coefficients $(\pi_j)_{j\ge0}$ with $\sum_{j=0}^{\infty}|\pi_j|<\infty$ such that \begin{align*} Z_t = \sum_{j=0}^{\infty}\pi_jX_{t-j}. \end{align*} [/definition] Invertibility is the moving-average analogue of causality. Causality solves for $X_t$ from current and past shocks; invertibility solves for $Z_t$ from current and past observations. The obstruction is that formally dividing by the moving-average polynomial may produce an inverse filter whose coefficients grow or depend on future observations. To use residuals as recovered innovations in estimation and diagnostics, one needs a stable one-sided inverse, and this becomes a concrete question about the zeros of $\theta$. [quotetheorem:3630] [citeproof:3630] The zero condition on $\theta$ is essential. For an $\operatorname{MA}(1)$ model with $|\theta|>1$, the formal inverse has coefficients proportional to $(-1/\theta)^j$ only after rewriting the model with reciprocal parameter and rescaled noise; the original shocks are not recoverable by a stable filter from past observations. The theorem does not claim that the parameterisation is statistically identifiable without conventions such as removing common factors and imposing invertibility. It leads directly to the non-uniqueness phenomenon in finite moving-average models. [example: MA(1) Non-Uniqueness Without Invertibility] Let $X_t=Z_t+\theta Z_{t-1}$, where $(Z_t)$ is white noise with variance $\sigma^2$. In the linear-process notation, $\psi_0=1$, $\psi_1=\theta$, and $\psi_j=0$ for $j\notin\{0,1\}$. By the *Linear Process Covariance Formula*, \begin{align*} \gamma(0) &=\sigma^2\sum_{j\in\mathbb Z}\psi_j^2\\ &=\sigma^2(\psi_0^2+\psi_1^2)\\ &=\sigma^2(1+\theta^2), \end{align*} and \begin{align*} \gamma(1) &=\sigma^2\sum_{j\in\mathbb Z}\psi_{j+1}\psi_j\\ &=\sigma^2\psi_1\psi_0\\ &=\sigma^2\theta. \end{align*} Similarly, \begin{align*} \gamma(-1) &=\sigma^2\sum_{j\in\mathbb Z}\psi_{j-1}\psi_j\\ &=\sigma^2\psi_0\psi_1\\ &=\sigma^2\theta. \end{align*} If $|h|>1$, then no integer $j$ has both $j\in\{0,1\}$ and $j+h\in\{0,1\}$, so $\psi_{j+h}\psi_j=0$ for every $j$ and $\gamma(h)=0$. Now assume $\theta\ne0$ and define another white-noise sequence $(\widetilde Z_t)$ with variance \begin{align*} \operatorname{Var}(\widetilde Z_t)=\sigma^2\theta^2. \end{align*} For the model \begin{align*} \widetilde X_t=\widetilde Z_t+\theta^{-1}\widetilde Z_{t-1}, \end{align*} the same covariance calculation, with coefficient $\theta^{-1}$ and innovation variance $\sigma^2\theta^2$, gives \begin{align*} \widetilde\gamma(0) &=\sigma^2\theta^2\left(1+\theta^{-2}\right)\\ &=\sigma^2(\theta^2+1)\\ &=\sigma^2(1+\theta^2), \end{align*} and \begin{align*} \widetilde\gamma(1) &=\sigma^2\theta^2\theta^{-1}\\ &=\sigma^2\theta. \end{align*} The same argument gives $\widetilde\gamma(-1)=\sigma^2\theta$ and $\widetilde\gamma(h)=0$ for $|h|>1$. Thus the two parameterisations have identical autocovariance functions, even though their moving-average coefficients are reciprocal. The invertibility condition for an $\operatorname{MA}(1)$ coefficient is $|\theta|<1$, so away from the boundary $|\theta|=1$ it selects exactly one member of the reciprocal pair $\theta$ and $\theta^{-1}$. [/example] The preceding example explains why invertibility is a modelling convention rather than a visible feature of the autocovariance function alone. The next calculation shows how AR and MA components combine: the moving-average part changes the first few covariances, while the autoregressive part controls the eventual tail. [example: ARMA(1,1) Autocorrelation Calculation] Let $X_t-\phi X_{t-1}=Z_t+\theta Z_{t-1}$, where $|\phi|<1$ and $(Z_t)$ is white noise with variance $\sigma^2$. The ARMA polynomials are $\phi(z)=1-\phi z$ and $\theta(z)=1+\theta z$, so by *Causal ARMA Representation* the causal coefficients are determined by \begin{align*} \psi(z) &=\frac{1+\theta z}{1-\phi z}. \end{align*} Since $|\phi|<1$, \begin{align*} (1-\phi z)\sum_{j=0}^{n}\phi^jz^j &=1-\phi^{n+1}z^{n+1}, \end{align*} and hence, for $|z|<|\phi|^{-1}$, \begin{align*} \frac{1}{1-\phi z} &=\sum_{j=0}^{\infty}\phi^jz^j. \end{align*} Thus \begin{align*} \psi(z) &=(1+\theta z)\sum_{j=0}^{\infty}\phi^jz^j\\ &=\sum_{j=0}^{\infty}\phi^jz^j+\theta\sum_{j=0}^{\infty}\phi^jz^{j+1}\\ &=1+\sum_{j=1}^{\infty}\phi^jz^j+\sum_{j=1}^{\infty}\theta\phi^{j-1}z^j\\ &=1+\sum_{j=1}^{\infty}(\phi^j+\theta\phi^{j-1})z^j\\ &=1+\sum_{j=1}^{\infty}(\phi+\theta)\phi^{j-1}z^j. \end{align*} Therefore $\psi_0=1$ and $\psi_j=(\phi+\theta)\phi^{j-1}$ for $j\ge1$. Using *Linear Process Covariance Formula*, \begin{align*} \gamma(0) &=\sigma^2\sum_{j=0}^{\infty}\psi_j^2\\ &=\sigma^2\left\{\psi_0^2+\sum_{j=1}^{\infty}\psi_j^2\right\}\\ &=\sigma^2\left\{1+\sum_{j=1}^{\infty}(\phi+\theta)^2\phi^{2j-2}\right\}\\ &=\sigma^2\left\{1+(\phi+\theta)^2\sum_{j=1}^{\infty}(\phi^2)^{j-1}\right\}\\ &=\sigma^2\left\{1+\frac{(\phi+\theta)^2}{1-\phi^2}\right\}\\ &=\sigma^2\frac{1-\phi^2+\phi^2+2\phi\theta+\theta^2}{1-\phi^2}\\ &=\sigma^2\frac{1+2\phi\theta+\theta^2}{1-\phi^2}. \end{align*} For $h\ge1$, \begin{align*} \gamma(h) &=\sigma^2\sum_{j=0}^{\infty}\psi_{j+h}\psi_j\\ &=\sigma^2\left\{\psi_h\psi_0+\sum_{j=1}^{\infty}\psi_{j+h}\psi_j\right\}\\ &=\sigma^2\left\{(\phi+\theta)\phi^{h-1}+\sum_{j=1}^{\infty}(\phi+\theta)\phi^{j+h-1}(\phi+\theta)\phi^{j-1}\right\}\\ &=\sigma^2\left\{(\phi+\theta)\phi^{h-1}+(\phi+\theta)^2\sum_{j=1}^{\infty}\phi^{2j+h-2}\right\}\\ &=\sigma^2\left\{(\phi+\theta)\phi^{h-1}+(\phi+\theta)^2\phi^h\sum_{j=1}^{\infty}(\phi^2)^{j-1}\right\}\\ &=\sigma^2\left\{(\phi+\theta)\phi^{h-1}+\frac{(\phi+\theta)^2\phi^h}{1-\phi^2}\right\}\\ &=\sigma^2(\phi+\theta)\phi^{h-1}\left\{1+\frac{(\phi+\theta)\phi}{1-\phi^2}\right\}\\ &=\sigma^2(\phi+\theta)\phi^{h-1}\frac{1-\phi^2+\phi^2+\theta\phi}{1-\phi^2}\\ &=\frac{\sigma^2(\phi+\theta)(1+\theta\phi)\phi^{h-1}}{1-\phi^2}. \end{align*} Since autocovariance is symmetric, $\gamma(-h)=\gamma(h)$ for $h\ge1$. Dividing by $\gamma(0)$ gives, for $h\ge1$, \begin{align*} \rho(h) &=\frac{\gamma(h)}{\gamma(0)}\\ &=\frac{\sigma^2(\phi+\theta)(1+\theta\phi)\phi^{h-1}/(1-\phi^2)} {\sigma^2(1+2\phi\theta+\theta^2)/(1-\phi^2)}\\ &=\frac{(\phi+\theta)(1+\theta\phi)}{1+2\phi\theta+\theta^2}\phi^{h-1}. \end{align*} Thus the autocorrelation is scaled at lag $1$ by both $\phi$ and $\theta$, and from then on each additional lag multiplies it by $\phi$. [/example] ## Yule-Walker Equations and Model Identification How can the autoregressive coefficients be read from second-order structure? For a pure autoregression, multiplying the recursion by lagged observations gives linear equations for the autocovariances. These equations are the bridge between probabilistic model definition and statistical fitting. [quotetheorem:3631] [citeproof:3631] Writing $\rho(h)=\gamma(h)/\gamma(0)$ gives the correlation version of the same equations. For $h=1,\dots,p$, they form a $p\times p$ linear system for $(\phi_1,\dots,\phi_p)$: \begin{align*} \begin{pmatrix} \rho(0) & \rho(1) & \cdots & \rho(p-1)\\ \rho(1) & \rho(0) & \cdots & \rho(p-2)\\ \vdots & \vdots & \ddots & \vdots\\ \rho(p-1) & \rho(p-2) & \cdots & \rho(0) \end{pmatrix} \begin{pmatrix} \phi_1\\ \phi_2\\ \vdots\\ \phi_p \end{pmatrix} = \begin{pmatrix} \rho(1)\\ \rho(2)\\ \vdots\\ \rho(p) \end{pmatrix}. \end{align*} The Toeplitz structure reflects stationarity: entries depend only on lag differences. The assumption that $Z_t$ is uncorrelated with the past is essential; without it, multiplying by $X_{t-h}$ would leave extra covariance terms and the displayed linear system would no longer identify the autoregressive coefficients. The theorem is not an estimation result, since sample autocovariances only approximate the population quantities. Its importance is that it connects Hilbert-space linear projection, covariance structure, and the practical estimation algorithms used for autoregressions. [definition: Partial Autocorrelation] For a weakly stationary process $(X_t)$, the partial autocorrelation function is the scalar-valued map $\alpha: \mathbb N \to \mathbb R$ defined as follows. For $h\ge1$, $\alpha(h)$ is the coefficient of $X_t$ in the best linear prediction of $X_{t+h}$ from $X_{t+h-1},\dots,X_t$, after the intervening lags have been included. [/definition] For an $\operatorname{AR}(p)$ process, the partial autocorrelation is zero after lag $p$. This is why the sample partial autocorrelation function is a diagnostic for autoregressive order, while the autocorrelation function of an autoregression usually tails off rather than cutting off. [illustration:acf-pacf-identification-schematic] [example: Identification Heuristics] For a finite $\operatorname{MA}(q)$ model, \begin{align*} X_t &=Z_t+\theta_1Z_{t-1}+\cdots+\theta_qZ_{t-q} =\sum_{j=0}^{q}\theta_jZ_{t-j}, \end{align*} where $\theta_0=1$. By the *Linear Process Covariance Formula*, \begin{align*} \gamma(h) &=\sigma^2\sum_{j\in\mathbb Z}\theta_{j+h}\theta_j, \end{align*} with $\theta_j=0$ for $j\notin\{0,\dots,q\}$. If $h>q$, then $j\in\{0,\dots,q\}$ implies $j+h>q$, so $\theta_{j+h}\theta_j=0$ for every $j$. Hence \begin{align*} \gamma(h)=0 \qquad \text{for } h>q. \end{align*} By symmetry of autocovariance, $\gamma(-h)=\gamma(h)$, so \begin{align*} \rho(h)=\frac{\gamma(h)}{\gamma(0)}=0 \qquad \text{whenever } |h|>q. \end{align*} Thus the autocorrelation function of a finite moving average cuts off after lag $q$. For a causal $\operatorname{AR}(p)$ model, \begin{align*} X_t=\phi_1X_{t-1}+\cdots+\phi_pX_{t-p}+Z_t. \end{align*} Shifting the recursion forward by $h$ gives \begin{align*} X_{t+h} &=\phi_1X_{t+h-1}+\cdots+\phi_pX_{t+h-p}+Z_{t+h}. \end{align*} When $h>p$, all variables $X_{t+h-1},\dots,X_{t+h-p}$ are among the intervening predictors $X_{t+h-1},\dots,X_{t+1}$ already included before $X_t$. Since the current innovation $Z_{t+h}$ is uncorrelated with the past, adding $X_t$ contributes no further linear-prediction coefficient after those $p$ nearer lags have been included. Therefore \begin{align*} \alpha(h)=0 \qquad \text{for } h>p. \end{align*} This is the partial-autocorrelation cutoff for autoregressive order. For a causal $\operatorname{ARMA}(p,q)$ model, the representation \begin{align*} X_t=\sum_{j=0}^{\infty}\psi_jZ_{t-j}, \qquad \psi(z)=\frac{\theta(z)}{\phi(z)}, \end{align*} typically has infinitely many nonzero coefficients. Then \begin{align*} \gamma(h)=\sigma^2\sum_{j=0}^{\infty}\psi_{j+h}\psi_j \end{align*} need not become exactly zero after any finite lag. If the roots of $\phi$ are outside the unit circle, the impulse-response coefficients decay geometrically, possibly with oscillation when complex roots are present, so the autocorrelation and partial autocorrelation tend to tail off rather than cut off. These patterns are diagnostics, not proofs of model order: sample autocorrelations fluctuate around their population values, different nearby orders can produce similar plots, and transformations such as differencing may be needed before the stationary ARMA calculations apply. [/example] Chapter 2 turned stationarity into concrete model classes and showed how ARMA structure connects autocovariances to recursive representations. Chapter 3 takes the next step from model specification to statistical practice, asking how to estimate those parameters, assess uncertainty, and choose among competing ARMA orders from finite data. # 3. Estimation, Prediction, and Model Selection for ARMA Estimation and forecasting turn the ARMA models of Chapter 2 into statistical procedures. The central questions are: how should the unknown parameters be estimated from a finite sample, how can the fitted covariance structure be used for prediction, and how should competing orders be compared without fitting noise as signal? This chapter treats estimation and prediction as two faces of the same projection problem: likelihoods are built from one-step-ahead prediction errors, while forecasts are conditional linear projections computed from the fitted model. ## Estimating ARMA Parameters from a Finite Sample Given observations $x_1,\dots,x_n$ from a stationary ARMA process, the first difficulty is that the innovations $Z_t$ are not observed. Estimation therefore replaces the inaccessible infinite past by initial conditions, prediction residuals, or an exact Gaussian covariance calculation. The resulting estimators agree asymptotically under standard regularity assumptions, but their finite-sample behaviour and computational costs differ. We write a causal invertible ARMA$(p,q)$ model as \begin{align*} \phi(B)(X_t-\mu)=\theta(B)Z_t, \end{align*} where $Z_t$ is white noise with variance $\sigma^2$, $\phi(z)=1-\phi_1z-\cdots-\phi_pz^p$, and $\theta(z)=1+\theta_1z+\cdots+\theta_qz^q$. Causality and invertibility keep the AR and MA roots outside the closed unit disc, so the model has both an infinite moving-average representation and an infinite autoregressive representation. [definition: Conditional Sum Of Squares] Let $\Theta\subseteq \mathbb R^{p+q+1}$ be an admissible parameter set for $\vartheta=(\mu,\phi_1,\dots,\phi_p,\theta_1,\dots,\theta_q)$, and fix initialization data $x_0,x_{-1},\dots$ and $e_0,e_{-1},\dots$. For each $t=m+1,\dots,n$, the conditional residual is the map $e_t:\Theta\to\mathbb R$ defined recursively by \begin{align*} e_t(\vartheta)=x_t-\mu-\sum_{i=1}^p\phi_i(x_{t-i}-\mu)-\sum_{j=1}^q\theta_j e_{t-j}(\vartheta), \end{align*} where $m\ge \max\{p,q\}$. The conditional sum of squares is the map $S_n:\Theta\to[0,\infty)$ given by \begin{align*} S_n(\vartheta)=\sum_{t=m+1}^{n}e_t(\vartheta)^2. \end{align*} [/definition] The conditional least-squares estimator minimizes $S_n(\vartheta)$ over $\Theta$. For pure autoregressions, the residuals are linear in the AR coefficients and the least-squares problem reduces to ordinary regression. For models with moving-average terms, the residual recursion makes the objective nonlinear, and numerical optimization is usually required. This is the first place where the unobserved innovations matter statistically: different initializations define slightly different finite-sample objective functions, even though their effect disappears asymptotically under the usual stability assumptions. [example: Estimating AR Two By Yule Walker] Suppose $X_t$ is a mean-zero causal AR$(2)$ process \begin{align*} X_t=\phi_1X_{t-1}+\phi_2X_{t-2}+Z_t, \end{align*} where $(Z_t)$ is white noise with variance $\sigma^2$. Causality means $X_{t-1}$ and $X_{t-2}$ are linear functions of innovations dated at most $t-1$, so $\mathbb E[Z_tX_{t-h}]=0$ for $h\ge 1$. Multiplying the model equation by $X_{t-1}$ and taking expectations gives \begin{align*} \mathbb E[X_tX_{t-1}] &=\phi_1\mathbb E[X_{t-1}^2]+\phi_2\mathbb E[X_{t-2}X_{t-1}]+\mathbb E[Z_tX_{t-1}]\\ \gamma(1)&=\phi_1\gamma(0)+\phi_2\gamma(1)+0. \end{align*} Multiplying instead by $X_{t-2}$ gives \begin{align*} \mathbb E[X_tX_{t-2}] &=\phi_1\mathbb E[X_{t-1}X_{t-2}]+\phi_2\mathbb E[X_{t-2}^2]+\mathbb E[Z_tX_{t-2}]\\ \gamma(2)&=\phi_1\gamma(1)+\phi_2\gamma(0)+0. \end{align*} Thus the population Yule-Walker equations are \begin{align*} \begin{pmatrix} \gamma(0) & \gamma(1)\\ \gamma(1) & \gamma(0) \end{pmatrix} \begin{pmatrix} \phi_1\\ \phi_2 \end{pmatrix} = \begin{pmatrix} \gamma(1)\\ \gamma(2) \end{pmatrix}. \end{align*} The Yule-Walker estimator replaces $\gamma(h)$ by $\hat\gamma(h)$, so $\hat\phi_1,\hat\phi_2$ are defined by \begin{align*} \begin{pmatrix} \hat\gamma(0) & \hat\gamma(1)\\ \hat\gamma(1) & \hat\gamma(0) \end{pmatrix} \begin{pmatrix} \hat\phi_1\\ \hat\phi_2 \end{pmatrix} = \begin{pmatrix} \hat\gamma(1)\\ \hat\gamma(2) \end{pmatrix}. \end{align*} To estimate the innovation variance, multiply the model equation by $X_t$ and take expectations: \begin{align*} \mathbb E[X_t^2] &=\phi_1\mathbb E[X_{t-1}X_t]+\phi_2\mathbb E[X_{t-2}X_t]+\mathbb E[Z_tX_t]. \end{align*} Since \begin{align*} X_t=\phi_1X_{t-1}+\phi_2X_{t-2}+Z_t \end{align*} and $Z_t$ is uncorrelated with $X_{t-1}$ and $X_{t-2}$, \begin{align*} \mathbb E[Z_tX_t] &=\phi_1\mathbb E[Z_tX_{t-1}]+\phi_2\mathbb E[Z_tX_{t-2}]+\mathbb E[Z_t^2]\\ &=0+0+\sigma^2. \end{align*} Therefore \begin{align*} \gamma(0)=\phi_1\gamma(1)+\phi_2\gamma(2)+\sigma^2, \end{align*} and the fitted variance estimate is \begin{align*} \hat\sigma^2=\hat\gamma(0)-\hat\phi_1\hat\gamma(1)-\hat\phi_2\hat\gamma(2). \end{align*} The calculation shows why the AR$(2)$ parameters can be estimated from the first three autocovariances: the covariance equations close after lags $1$ and $2$. [/example] The Yule-Walker calculation is special to autoregressive models because the covariance equations close after finitely many lags. General ARMA estimation more often proceeds through likelihood, where the residuals are treated as Gaussian prediction errors after conditioning on the chosen starting values. The first likelihood version is conditional because it deliberately fixes or discards the pre-sample history needed to start the recursion. Defining it separately makes clear which objective is actually being optimized when estimation is implemented through residual sums of squares. [definition: Conditional Gaussian Likelihood] Let $\Theta\subseteq\mathbb R^{p+q+1}$ be the admissible parameter set used for the conditional residual maps $e_t:\Theta\to\mathbb R$. For a Gaussian ARMA model, the conditional Gaussian likelihood is the map $L_c:\Theta\times(0,\infty)\to(0,\infty)$ defined by \begin{align*} L_c(\vartheta,\sigma^2)=\prod_{t=m+1}^{n}\frac{1}{(2\pi\sigma^2)^{1/2}}\exp\left(-\frac{e_t(\vartheta)^2}{2\sigma^2}\right). \end{align*} The corresponding negative log-likelihood, up to constants independent of the parameters, is the map $\ell_c:\Theta\times(0,\infty)\to\mathbb R$ defined by \begin{align*} \ell_c(\vartheta,\sigma^2)=\frac{n-m}{2}\log\sigma^2+\frac{1}{2\sigma^2}\sum_{t=m+1}^{n}e_t(\vartheta)^2. \end{align*} [/definition] For fixed $\vartheta$, minimizing $\ell_c$ over $\sigma^2$ gives $\hat\sigma^2=S_n(\vartheta)/(n-m)$. Thus conditional Gaussian maximum likelihood and conditional least squares choose the same $\vartheta$ when the same initialization is used. The limitation is also visible from the formula: conditioning replaces the unknown pre-sample history by convention, so the likelihood is not exactly the joint density of $(X_1,\dots,X_n)$. To remove that approximation, one can model the full observed vector at once under its Gaussian covariance matrix. The exact likelihood below records this joint-density viewpoint and provides the benchmark that conditional and innovations-based computations approximate or re-express. [definition: Exact Gaussian Likelihood] Let $\Theta\subseteq\mathbb R^{p+q+1}$ be an admissible set such that, for each $(\vartheta,\sigma^2)\in\Theta\times(0,\infty)$, the ARMA autocovariance function determines a nonsingular covariance matrix $\Gamma_n(\vartheta,\sigma^2)\in\mathbb R^{n\times n}$. For $x=(x_1,\dots,x_n)^\top\in\mathbb R^n$ and $\iota=(1,\dots,1)^\top$, the exact Gaussian likelihood is the map $L:\Theta\times(0,\infty)\to(0,\infty)$ defined by \begin{align*} L(\vartheta,\sigma^2)=\frac{1}{(2\pi)^{n/2}\det(\Gamma_n(\vartheta,\sigma^2))^{1/2}} \exp\left(-\frac{1}{2}(x-\mu\iota)^\top\Gamma_n(\vartheta,\sigma^2)^{-1}(x-\mu\iota)\right). \end{align*} [/definition] Exact likelihood uses the joint distribution of the observed vector rather than conditioning on artificial starting values. Direct evaluation through the $n\times n$ covariance matrix is expensive for long series, but the innovations representation below rewrites the same likelihood as a product of prediction-error densities. [quotetheorem:3632] [citeproof:3632] The theorem motivates the innovations likelihood: compute the successive predictors and error variances from the model covariance sequence, then multiply the resulting Gaussian densities. Gaussianity is what upgrades orthogonality of the prediction errors to independence, so the product is a product of genuine conditional densities rather than just an orthogonal decomposition of second moments. Nonsingularity is needed because a zero prediction variance would make one observation an exact linear combination of its predecessors, producing a degenerate Gaussian law with no ordinary density on $\mathbb R^n$. The mean-zero assumption is only a notational simplification; a known or parametrized mean is first subtracted, as in the exact likelihood above. For non-Gaussian ARMA models the same prediction errors can still define useful least-squares or quasi-likelihood criteria, but the displayed factorization is then not the exact joint density without additional distributional assumptions. [quotetheorem:3633] This theorem is what turns Gaussian likelihood from a fitting rule into an inferential procedure. It justifies the usual practice of reporting approximate standard errors and confidence intervals for ARMA parameters, because the normalized estimation error has a limiting normal distribution with covariance determined by the likelihood curvature. The assumptions are not cosmetic: causality and invertibility keep the parametrization stable, while moment and identifiability hypotheses prevent the likelihood from flattening or converging to the wrong parameter. The result also marks a boundary of the chapter's estimation narrative. Conditional least squares and Gaussian likelihood may give similar point estimates in stable models, but asymptotic normality is the reason likelihood-based software can attach uncertainty estimates to those fits. If the innovations are heavy-tailed, the model is near a unit root, or the order has been selected after looking at the data, the displayed asymptotic approximation can be misleading and should be checked by diagnostics or resampling rather than treated as automatic. ## Linear Prediction And The Innovations Algorithm Forecasting asks for the best estimate of a future value from the information currently available. In a Gaussian ARMA model, the best mean-square predictor conditional on the past is linear; for general second-order models, the best linear predictor is still the orthogonal projection onto the closed linear span of the observed variables. The algorithms in this section compute these projections using only autocovariances, continuing the Hilbert-space projection viewpoint introduced in Chapter 1 and used for ARMA prediction in Chapter 2. [definition: Best Linear Predictor] Let $X_1,\dots,X_n,Y\in L^2(\Omega,\mathcal F,\mathbb P)$ have mean zero, and set $V_n=\operatorname{span}\{X_1,\dots,X_n\}\subseteq L^2(\Omega,\mathcal F,\mathbb P)$. The best linear prediction operator is the orthogonal projection $P_{V_n}:L^2(\Omega,\mathcal F,\mathbb P)\to V_n$. The best linear predictor of $Y$ from $X_1,\dots,X_n$ is $\hat Y=P_{V_n}Y$. The coefficient set is \begin{align*} \operatorname*{argmin}_{(a_1,\dots,a_n)\in\mathbb R^n}\mathbb E[(Y-a_1X_1-\cdots-a_nX_n)^2]. \end{align*} [/definition] The normal equations say that the forecast error must be orthogonal to each variable used for prediction. If $\Gamma_n=(\gamma(i-j))_{1\le i,j\le n}$ and $c=(\operatorname{Cov}(Y,X_1),\dots,\operatorname{Cov}(Y,X_n))^\top$, then the coefficient vector solves $\Gamma_n a=c$ whenever $\Gamma_n$ is nonsingular. Solving a fresh linear system for every forecast length is conceptually simple but computationally wasteful. As each new observation arrives, most of the old projection geometry should be reusable; the only genuinely new information is the component orthogonal to the previous observations. The recursive forecasting problem is therefore to update predictors and prediction-error variances from the covariance sequence without rebuilding the full normal-equation system at every step. The innovations algorithm supplies that update when the successive prediction errors remain nonzero. [explanation: Innovations Algorithm] The innovations algorithm is a recursive projection method, not a theorem-level structural claim. Starting from the covariance sequence, it orthogonalises $X_1,X_2,\ldots$ one observation at a time and records the one-step prediction errors. At step $t$, the new prediction error variance $v_t$ must be positive; then the update coefficients are obtained by taking covariances with the previous innovations and dividing by their variances. The output is the same best linear predictor that one would obtain by solving the full normal equations, but the recursion reuses the earlier orthogonalisation rather than starting again from a growing covariance matrix. [/explanation] The nonsingularity assumption ensures $v_t>0$ at every stage, so division by $v_t$ is legitimate and each new observation contributes a genuinely new direction in the prediction space. If $v_t=0$, then $X_t$ lies exactly in the span of its predecessors, the innovation basis stops growing, and the recursive update has a zero denominator. The algorithm is conceptual until the covariance sequence is known or estimated; in ARMA work that sequence comes from the fitted parameters, while in empirical forecasting it may be replaced by sample autocovariances. This distinction matters because errors in covariance estimation can be amplified when a prediction variance is small. [example: Forecasting An ARMA One One] Let $\mathcal F_n$ denote the information generated by the fitted past up to time $n$, so $X_n$ and the recovered fitted innovation $\hat Z_n$ are known at forecast origin $n$. For the ARMA$(1,1)$ model \begin{align*} X_t=\phi X_{t-1}+Z_t+\theta Z_{t-1}, \qquad |\phi|<1, \end{align*} the one-step forecast replaces the future innovation $Z_{n+1}$ by its linear prediction $0$ and replaces $Z_n$ by the recovered value $\hat Z_n$: \begin{align*} X_{n+1} &=\phi X_n+Z_{n+1}+\theta Z_n,\\ \hat X_{n+1|n} &=\phi X_n+0+\theta \hat Z_n\\ &=\phi X_n+\theta \hat Z_n. \end{align*} For two steps ahead, \begin{align*} X_{n+2} &=\phi X_{n+1}+Z_{n+2}+\theta Z_{n+1}. \end{align*} Both $Z_{n+2}$ and $Z_{n+1}$ are future innovations relative to time $n$, so their linear predictions from the fitted past are $0$. Therefore \begin{align*} \hat X_{n+2|n} &=\phi \hat X_{n+1|n}+0+\theta\cdot 0\\ &=\phi(\phi X_n+\theta \hat Z_n)\\ &=\phi^2X_n+\phi\theta \hat Z_n. \end{align*} The same calculation gives the recursion, for $h\ge 2$, \begin{align*} \hat X_{n+h|n} &=\phi \hat X_{n+h-1|n}, \end{align*} and hence \begin{align*} \hat X_{n+h|n} &=\phi^hX_n+\phi^{h-1}\theta\hat Z_n,\qquad h\ge 1. \end{align*} Thus the known current moving-average shock affects the whole forecast path through AR propagation, while all genuinely future innovations have forecast value zero. [/example] For stationary processes, the Toeplitz structure of the covariance matrix allows a more specialized recursion. The Durbin-Levinson algorithm is especially important for autoregressive fitting and for computing partial autocorrelations, because it updates the whole prediction problem when one additional lag is added rather than solving a new linear system from scratch. [definition: Partial Autocorrelation] For a mean-zero stationary process with autocorrelation function $\rho(h)$, the partial autocorrelation at lag $k$ is the coefficient of $X_1$ in the best linear predictor of $X_{k+1}$ from $X_k,X_{k-1},\dots,X_1$. [/definition] The partial autocorrelation measures the remaining linear relation between observations $k$ time steps apart after removing the intermediate linear effects. For an AR$(p)$ process, the population partial autocorrelation vanishes for all lags greater than $p$, which makes it a useful order-identification diagnostic. [explanation: Durbin-Levinson Recursion] The Durbin-Levinson recursion is the Toeplitz special case of recursive linear prediction. For a stationary process, the covariance matrix of $(X_t,\ldots,X_{t-k+1})$ has Toeplitz form, so the order-$k$ prediction coefficients can be updated from the order-$(k-1)$ coefficients by one scalar reflection coefficient. That final coefficient is the partial autocorrelation at lag $k$, and the prediction-error variance is updated at the same time. Computationally, this replaces repeated solution of growing Toeplitz systems by a stable one-step update whenever the prediction variances stay positive. [/explanation] The last coefficient $\phi_{k,k}$ is the partial autocorrelation at lag $k$. The Toeplitz hypothesis is the stationarity input: without it the backward and forward prediction equations do not share the same covariance structure, so this compact recursion is replaced by the more general innovations algorithm. Nonsingularity is again essential because $v_{k-1}=0$ would mean the next normal equation is not an invertible prediction problem. A simple boundary case is a deterministic periodic sequence viewed as a second-order process; after enough observations, a future value may be exactly linearly determined, and the variance update reaches zero. Durbin-Levinson therefore gives both a fast solver for stationary Toeplitz prediction systems and a diagnostic route to AR order, but its partial-autocorrelation cutoff interpretation is reliable only when the fitted stationary AR model is a plausible description of the dependence. ## Model Selection And Diagnostic Checking After fitting several candidate ARMA models, the statistical problem changes from estimation within one model to comparison between models. A model with more parameters can reduce residual sums of squares by adapting to random fluctuations, so fit must be balanced against complexity. Information criteria, residual tests, and diagnostic plots provide complementary checks rather than a mechanical decision rule. [definition: Akaike Information Criterion] For a fitted model with maximized likelihood $\hat L$ and $k$ estimated parameters, the Akaike information criterion is \begin{align*} \operatorname{AIC}=-2\log \hat L+2k. \end{align*} [/definition] AIC estimates out-of-sample predictive discrepancy and tends to favour models with good forecast performance. Because its penalty does not grow with $n$, it may choose models larger than the data-generating finite-order ARMA model even when such a model exists. This makes AIC a prediction-oriented criterion rather than a consistent order-identification rule. When the goal is order identification rather than only predictive fit, a fixed penalty per parameter is too weak as the sample size grows. This motivates a second information criterion whose penalty depends explicitly on the sample size. The next definition introduces that stronger large-sample penalty and gives a criterion aimed at separating genuine structure from small likelihood gains. [definition: Bayesian Information Criterion] For a fitted model with maximized likelihood $\hat L$, sample size $n$, and $k$ estimated parameters, the Bayesian information criterion is \begin{align*} \operatorname{BIC}=-2\log \hat L+k\log n. \end{align*} [/definition] BIC imposes a stronger penalty for large samples. Under suitable finite-dimensional assumptions it is consistent for selecting the true order among candidates, but this consistency is not the same as optimal forecasting under misspecification. After an order has been chosen, the fitted residuals must still be checked, since a low information criterion does not guarantee that the remaining dependence has been removed. [definition: Residual Autocorrelation Function] For fitted residuals $\hat e=(\hat e_1,\dots,\hat e_n)\in\mathbb R^n$ with sample mean $\bar e$, the residual autocorrelation at lag $h\in\{0,1,\dots,n-1\}$ is the map $\hat r:\{0,1,\dots,n-1\}\to[-1,1]$ defined by \begin{align*} \hat r(h)=\frac{\sum_{t=h+1}^{n}(\hat e_t-\bar e)(\hat e_{t-h}-\bar e)}{\sum_{t=1}^{n}(\hat e_t-\bar e)^2}. \end{align*} [/definition] If the ARMA model has captured the linear dependence in the series, the fitted innovations should resemble white noise. Residual autocorrelations outside their sampling range indicate remaining serial dependence, structural change, seasonality, or an omitted transformation. [definition: Ljung Box Statistic] For a fixed maximum lag $H\in\{1,\dots,n-1\}$, residual autocorrelations $\hat r(1),\dots,\hat r(H)$ from a fitted model with sample size $n$ define the Ljung-Box statistic map $Q:\{1,\dots,n-1\}\to[0,\infty)$ by \begin{align*} Q(H)=n(n+2)\sum_{h=1}^{H}\frac{\hat r(h)^2}{n-h}. \end{align*} [/definition] For a fitted ARMA$(p,q)$ model, $Q_H$ is commonly compared with a $\chi^2_{H-p-q}$ reference distribution when $H>p+q$ and the sample is large. The reference distribution is an approximation, so the statistic should be read together with residual plots, residual ACF plots, and sensitivity to the chosen maximum lag $H$. [example: Residual ACF Diagnostic For A Misspecified AR One] Suppose the data satisfy the mean-zero AR$(2)$ equation \begin{align*} X_t=\phi_1X_{t-1}+\phi_2X_{t-2}+Z_t, \end{align*} but an AR$(1)$ model with fitted coefficient $\hat a$ is used. The fitted residual is \begin{align*} \hat e_t &=X_t-\hat a X_{t-1}\\ &=\phi_1X_{t-1}+\phi_2X_{t-2}+Z_t-\hat a X_{t-1}\\ &=Z_t+\phi_2X_{t-2}+(\phi_1-\hat a)X_{t-1}. \end{align*} Thus the residual is not just the innovation $Z_t$: it also contains the omitted lag-two term $\phi_2X_{t-2}$ and the estimation-error term $(\phi_1-\hat a)X_{t-1}$. To see why lag-two residual correlation can appear, first ignore the estimation error by setting $\hat a=\phi_1$. Then \begin{align*} u_t:=X_t-\phi_1X_{t-1}=Z_t+\phi_2X_{t-2}. \end{align*} For lag $2$, \begin{align*} \operatorname{Cov}(u_t,u_{t-2}) &=\operatorname{Cov}(Z_t+\phi_2X_{t-2},\,Z_{t-2}+\phi_2X_{t-4})\\ &=\operatorname{Cov}(Z_t,Z_{t-2}) +\phi_2\operatorname{Cov}(Z_t,X_{t-4})\\ &\qquad+\phi_2\operatorname{Cov}(X_{t-2},Z_{t-2}) +\phi_2^2\operatorname{Cov}(X_{t-2},X_{t-4}). \end{align*} Because $Z_t$ is uncorrelated with innovations and observations dated before $t$, the first two terms are $0$. Also \begin{align*} X_{t-2}=\phi_1X_{t-3}+\phi_2X_{t-4}+Z_{t-2}, \end{align*} so \begin{align*} \operatorname{Cov}(X_{t-2},Z_{t-2}) &=\phi_1\operatorname{Cov}(X_{t-3},Z_{t-2}) +\phi_2\operatorname{Cov}(X_{t-4},Z_{t-2}) +\operatorname{Var}(Z_{t-2})\\ &=0+0+\sigma^2. \end{align*} Writing $\gamma(2)=\operatorname{Cov}(X_{t-2},X_{t-4})$, we obtain \begin{align*} \operatorname{Cov}(u_t,u_{t-2}) &=\phi_2\sigma^2+\phi_2^2\gamma(2). \end{align*} This quantity is not forced to be zero when $\phi_2\ne 0$, so the residual autocorrelation at lag $2$ can have a systematic component. With the actual fitted coefficient $\hat a$, the residual is \begin{align*} \hat e_t=u_t+(\phi_1-\hat a)X_{t-1}. \end{align*} Therefore products such as $\hat e_t\hat e_{t-h}$ contain the omitted-term products $u_tu_{t-h}$ together with terms involving $X_{t-1}$ and $X_{t-h-1}$. The sample residual ACF may therefore show correlation at lag $2$ and nearby lags, indicating that the fitted AR$(1)$ model has not removed all linear dependence from an AR$(2)$ series. [/example] This diagnostic example shows why model selection cannot stop at minimizing AIC or BIC. Residual tests ask a different question: conditional on the fitted order and estimated parameters, is there still visible serial structure left in the supposed innovations? [remark: Overfitting Risk] Adding AR or MA terms can reduce in-sample prediction errors while increasing parameter uncertainty and producing unstable forecasts. Extra nearly cancelling AR and MA factors are especially dangerous: they may improve the likelihood slightly while making the fitted model hard to interpret and sensitive to small data changes. [/remark] A practical ARMA analysis therefore ends with several checks. The chosen model should have a defensible order, estimated roots outside the unit circle, residuals with little remaining autocorrelation, and forecasts that remain stable under nearby candidate specifications. Information criteria summarize part of this evidence, but the final judgement is a modelling decision about the time series rather than a property of a formula. Chapter 3 developed estimation, prediction, and model selection under a stationary ARMA framework, where fitted models are judged by residual behavior and forecast performance. Chapter 4 relaxes the stationarity assumption and explains how differencing, unit roots, and ARIMA models separate persistent stochastic trends from short-run dependence. # 4. Nonstationarity, Differencing, and ARIMA Models This chapter builds on the stationary ARMA theory of Chapters 2 and 3. In the stationary setting, forecasts, autocovariances, and innovations were all interpreted around a fixed mean level; here the observed level may itself drift, so we must separate persistence in the level from short-run dependence in changes. The main tools are lag polynomials, differencing operators, integrated processes, ARIMA models, and unit-root diagnostics. Nonstationary time series force us to separate two questions that were merged in the stationary ARMA theory: what part of the observed level is persistent, and what part of its increments has a stable short-run dependence structure? The integrated-process viewpoint recovers stationarity after applying difference operators. The resulting ARIMA models keep the ARMA machinery from the previous chapters, but reinterpret forecasts, trends, and shocks through cumulative sums. The main tension is that deterministic trends, stochastic trends, and nearly persistent stationary dynamics can look similar in finite samples. We therefore introduce unit-root tests as diagnostic tools, while keeping track of their nonstandard limiting theory and the risks of using pretests as automatic model-selection rules. ## Integrated Processes and Difference Operators A series with a changing level need not be unusable: the question is whether differencing removes the persistent component and leaves a stationary process. Differencing is the basic operation that turns level statements into statements about changes. [definition: Backshift Operator] Let $\mathbb R^{\mathbb Z}$ denote the set of real-valued two-sided sequences indexed by $\mathbb Z$. The backshift operator is the map \begin{align*} B: \mathbb R^{\mathbb Z} &\to \mathbb R^{\mathbb Z}, & (X_t)_{t\in\mathbb Z} &\mapsto (X_{t-1})_{t\in\mathbb Z}. \end{align*} For a polynomial $a(z)=a_0+a_1z+\cdots+a_mz^m$, define $a(B):\mathbb R^{\mathbb Z}\to\mathbb R^{\mathbb Z}$ by \begin{align*} a(B)X_t = a_0X_t+a_1X_{t-1}+\cdots+a_mX_{t-m}. \end{align*} [/definition] The first difference operator is the map $\Delta=1-B:\mathbb R^{\mathbb Z}\to\mathbb R^{\mathbb Z}$, so $\Delta X_t=X_t-X_{t-1}$. Higher differences are powers $\Delta^d=(1-B)^d$, and the seasonal difference operator with period $s$ is $1-B^s:\mathbb R^{\mathbb Z}\to\mathbb R^{\mathbb Z}$. Many economic and environmental series are not weakly stationary in levels, but their changes may have stable second-order behaviour. The key modelling question is how many ordinary differences are needed before stationarity appears, because that number determines whether shocks have temporary or permanent effects on the level. [definition: Integrated Process] A real-valued time series $(X_t)_{t\in\mathbb Z}$ is integrated of order $d$, written $X_t\sim I(d)$, if $\Delta^d X_t$ is weakly stationary and $\Delta^{d-1}X_t$ is not weakly stationary, for an integer $d\ge 1$. [/definition] This is a statement about the order of differencing needed to obtain second-order stationarity. It does not say that differencing is always statistically harmless: differencing changes the object being forecast from a level to an increment. [example: Random Walk With Drift] Let $(Z_t)$ be white noise with $\mathbb E[Z_t]=0$ and $\operatorname{Var}(Z_t)=\sigma^2$, and take $X_0$ to be fixed. From \begin{align*} X_t-X_{t-1}=\mu+Z_t, \end{align*} summing both sides from $1$ to $t$ gives, for $t\ge 1$, \begin{align*} \sum_{k=1}^t (X_k-X_{k-1}) &=\sum_{k=1}^t(\mu+Z_k) \\ X_t-X_0 &=\mu t+\sum_{k=1}^t Z_k, \end{align*} so \begin{align*} X_t=X_0+\mu t+\sum_{k=1}^t Z_k. \end{align*} Taking expectations, \begin{align*} \mathbb E[X_t] &=X_0+\mu t+\sum_{k=1}^t\mathbb E[Z_k] \\ &=X_0+\mu t. \end{align*} Thus the mean depends on $t$ when $\mu\ne0$. If $\mu=0$, stationarity still fails when $\sigma^2>0$, because white-noise terms are uncorrelated and therefore \begin{align*} \operatorname{Var}(X_t) &=\operatorname{Var}\left(\sum_{k=1}^t Z_k\right) \\ &=\sum_{k=1}^t\operatorname{Var}(Z_k) +2\sum_{1\le i<j\le t}\operatorname{Cov}(Z_i,Z_j) \\ &=t\sigma^2+2\sum_{1\le i<j\le t}0 \\ &=t\sigma^2, \end{align*} which depends on $t$. Hence $(X_t)$ is not weakly stationary for $\sigma^2>0$. On the other hand, \begin{align*} \Delta X_t &=X_t-X_{t-1} \\ &=\mu+Z_t. \end{align*} Therefore \begin{align*} \mathbb E[\Delta X_t]&=\mu+\mathbb E[Z_t]=\mu, \end{align*} and, for any lag $h$, \begin{align*} \operatorname{Cov}(\Delta X_t,\Delta X_{t-h}) &=\operatorname{Cov}(\mu+Z_t,\mu+Z_{t-h}) \\ &=\operatorname{Cov}(Z_t,Z_{t-h}) \\ &= \begin{cases} \sigma^2, & h=0,\\ 0, & h\ne0. \end{cases} \end{align*} The differenced process has constant mean and autocovariance depending only on the lag, so $\Delta X_t=\mu+Z_t$ is weakly stationary. Differencing removes the accumulated shocks from the level and leaves a stable increment process. [/example] The random walk with drift shows why a linear-looking trajectory can be caused either by deterministic drift or by the accumulation of random shocks. The next distinction formalises that contrast. [definition: Deterministic Trend] A time series $(X_t)$ has a deterministic polynomial trend of degree $r$ if it can be written as \begin{align*} X_t=m(t)+Y_t, \end{align*} where $m(t)$ is a polynomial in $t$ of degree $r$ and $(Y_t)$ is weakly stationary. [/definition] A deterministic trend can be removed by regression on time, after which the residual process is modeled as stationary. This differs from a stochastic trend, where the persistent component is created by accumulated shocks. For autoregressive models, the algebraic signal of a stochastic trend is a factor that prevents the recursion from pulling the series back toward a stable mean. Detecting that factor is essential because detrending and differencing answer different problems and lead to different forecast uncertainty. [definition: Unit Root] Let $\phi(z)=1-\phi_1z-\cdots-\phi_pz^p$. An autoregressive equation \begin{align*} \phi(B)X_t=Z_t \end{align*} has a unit root if $\phi(1)=0$. [/definition] Having a unit root means that $1-B$ is a factor of the autoregressive polynomial. The series then contains a stochastic trend unless the corresponding component is cancelled elsewhere in the model. The next issue is what remains after that single nonstationary factor is removed. If differencing leaves an autoregressive polynomial with another unit root, the process is still nonstationary; if the remaining polynomial is stable, the differenced series can be handled by stationary ARMA theory. [quotetheorem:3636] [citeproof:3636] The important point is that the differenced series carries the stationary ARMA structure. The simple unit-root hypothesis matters because exactly one factor $1-B$ is removed; if instead $(1-B)^2X_t=Z_t$, then $\Delta X_t$ is still a random walk and is not weakly stationary. The root condition on $\psi$ prevents further autoregressive nonstationarity in the differenced process; for example, $\psi(z)=1-z$ would leave another unit root after differencing. The white-noise assumption separates the autoregressive source of persistence from short-run innovation dependence, so the theorem identifies persistence through the polynomial factorisation rather than through serially correlated errors. The original level is recovered only by summing the increments, so shocks have permanent effects on the level. ## ARIMA Models and Forecasts After Differencing Once differencing has produced a stationary series, the next question is how to model the dependence in those differences. ARIMA models do this by applying an ARMA model to $\Delta^dX_t$ rather than to $X_t$ itself. [definition: Arima Model] Let $d\ge0$ be an integer. A real-valued time series $(X_t)$ follows an $\operatorname{ARIMA}(p,d,q)$ model if \begin{align*} \phi(B)\Delta^d X_t = c + \theta(B)Z_t, \end{align*} where $(Z_t)$ is white noise with mean $0$ and variance $\sigma^2$, $\phi(z)=1-\phi_1z-\cdots-\phi_pz^p$, and $\theta(z)=1+\theta_1z+\cdots+\theta_qz^q$. [/definition] Here $c$ is a mean term for the differenced process. When $d=1$, it becomes a drift term in the level, because summing a nonzero mean increment produces a linear mean path. For identification it is usual to take $\phi$ and $\theta$ to be coprime; for innovation recovery from observations, the moving-average polynomial is usually assumed invertible, with its zeros outside the closed unit disk. Defining an ARIMA equation does not by itself guarantee that the differenced process is a stationary ARMA process. The autoregressive part still needs the same root condition as before, now applied after the differencing factors have been separated from the stationary dynamics. [quotetheorem:3637] [citeproof:3637] Thus ARIMA is not a stationarity assumption on the observed levels. It is a stationarity assumption on the appropriately differenced series, and the autoregressive root condition is what makes the stationary ARMA theorem available after differencing. The result also says less than it may first appear to say: when $d>0$, the level process generally has a variance that changes with time, and its forecasts inherit accumulated future shocks. If the root condition fails, the differenced process may still contain a unit root; if the model is overdifferenced, a stationary level process can be turned into a difference with unnecessary moving-average dependence, often visible as strong negative autocorrelation at lag one. Causality and, when innovations are to be recovered uniquely, invertibility remain separate modelling conventions rather than consequences of differencing alone. [example: Arima Zero One One And Simple Exponential Smoothing] Let $(X_t)$ satisfy \begin{align*} \Delta X_t=Z_t+\theta Z_{t-1}, \qquad |\theta|<1, \end{align*} and let $\mathcal F_t$ denote the information available through time $t$, including the recovered innovations up to $Z_t$. Since $\Delta X_{t+1}=X_{t+1}-X_t$, the model at time $t+1$ gives \begin{align*} X_{t+1}-X_t&=Z_{t+1}+\theta Z_t,\\ X_{t+1}&=X_t+Z_{t+1}+\theta Z_t. \end{align*} Taking conditional expectations given $\mathcal F_t$, using that $X_t$ and $Z_t$ are $\mathcal F_t$-measurable and that the next innovation has conditional mean $0$, gives \begin{align*} \hat X_{t+1\mid t} &=\mathbb E[X_{t+1}\mid \mathcal F_t]\\ &=\mathbb E[X_t+Z_{t+1}+\theta Z_t\mid \mathcal F_t]\\ &=X_t+\mathbb E[Z_{t+1}\mid \mathcal F_t]+\theta Z_t\\ &=X_t+\theta Z_t. \end{align*} The current innovation is the current forecast error. Indeed, applying the same formula one period earlier gives \begin{align*} \hat X_{t\mid t-1}=X_{t-1}+\theta Z_{t-1}, \end{align*} while the model equation at time $t$ gives \begin{align*} X_t=X_{t-1}+Z_t+\theta Z_{t-1}. \end{align*} Subtracting the forecast from the observation, \begin{align*} X_t-\hat X_{t\mid t-1} &=(X_{t-1}+Z_t+\theta Z_{t-1})-(X_{t-1}+\theta Z_{t-1})\\ &=Z_t. \end{align*} Substituting this into the one-step forecast formula, \begin{align*} \hat X_{t+1\mid t} &=X_t+\theta(X_t-\hat X_{t\mid t-1})\\ &=(1+\theta)X_t-\theta\hat X_{t\mid t-1}. \end{align*} Writing $\alpha=1+\theta$ turns this into \begin{align*} \hat X_{t+1\mid t} =\alpha X_t+(1-\alpha)\hat X_{t\mid t-1}, \end{align*} which is the simple exponential smoothing recursion. In the usual smoothing range $0<\alpha<1$, this corresponds to $-1<\theta<0$. [/example] This example explains why exponential smoothing often forecasts well for series with a moving local level: it is implicitly using an integrated moving-average model for the changes. Many series have dependence at both adjacent times and at repeated seasonal positions, such as the same month in successive years. Ordinary differencing can remove a local level, but it does not by itself remove a yearly seasonal level or seasonal autocorrelation. A model for such data needs both nonseasonal and seasonal factors, with the seasonal factors acting through powers of the backshift operator. [definition: Seasonal Arima Model] Let $s\ge2$ be an integer seasonal period. A time series $(X_t)$ follows an $\operatorname{ARIMA}(p,d,q)(P,D,Q)_s$ model if \begin{align*} \Phi(B^s)\phi(B)(1-B^s)^D(1-B)^dX_t = c+\Theta(B^s)\theta(B)Z_t, \end{align*} where $\phi,\theta$ are nonseasonal autoregressive and moving-average polynomials, and $\Phi,\Theta$ are seasonal autoregressive and moving-average polynomials. [/definition] The seasonal difference compares an observation with its value one seasonal cycle earlier. For monthly data, $s=12$, so $(1-B^{12})X_t=X_t-X_{t-12}$. [example: Seasonal Airline Model] For a monthly series, consider \begin{align*} (1-B)(1-B^{12})X_t=(1+\theta B)(1+\theta_sB^{12})Z_t. \end{align*} The left-hand side applies one ordinary difference and one seasonal difference. Expanding the operator product gives \begin{align*} (1-B)(1-B^{12}) &=1-B^{12}-B+B^{13}, \end{align*} so \begin{align*} (1-B)(1-B^{12})X_t &=X_t-X_{t-12}-X_{t-1}+X_{t-13}. \end{align*} Equivalently, applying the seasonal difference first, \begin{align*} (1-B)(1-B^{12})X_t &=(1-B)(X_t-X_{t-12})\\ &=(X_t-X_{t-12})-(X_{t-1}-X_{t-13}), \end{align*} so the model compares the current annual change with the previous month's annual change. The moving-average side has one nonseasonal lag and one seasonal lag. Expanding the product, \begin{align*} (1+\theta B)(1+\theta_sB^{12}) &=1+\theta_sB^{12}+\theta B+\theta\theta_sB^{13}, \end{align*} and therefore \begin{align*} (1+\theta B)(1+\theta_sB^{12})Z_t &=Z_t+\theta Z_{t-1}+\theta_sZ_{t-12}+\theta\theta_sZ_{t-13}. \end{align*} Thus the equation can be written as \begin{align*} X_t-X_{t-1}-X_{t-12}+X_{t-13} = Z_t+\theta Z_{t-1}+\theta_sZ_{t-12}+\theta\theta_sZ_{t-13}. \end{align*} Comparing this with the definition of an $\operatorname{ARIMA}(p,d,q)(P,D,Q)_s$ model, there are no autoregressive factors, one ordinary difference, one ordinary moving-average factor, one seasonal difference with $s=12$, and one seasonal moving-average factor. Hence the model is $\operatorname{ARIMA}(0,1,1)(0,1,1)_{12}$. The ordinary difference removes persistence in the level, the seasonal difference removes persistence at the annual frequency, and a single innovation enters the differenced equation at lags $0$, $1$, $12$, and $13$. [/example] Forecasts from differenced models must be interpreted on the original scale by undoing the differences. For $d=1$, a forecast of future increments is cumulatively summed from the last observed level; for seasonal differencing, the forecast also uses the corresponding previously observed season as an anchor. [remark: Forecast Variance Under Integration] Forecast uncertainty in an integrated model usually grows with the horizon because future shocks are accumulated. In a stationary ARMA model, the long-horizon forecast variance approaches the unconditional variance; in a random walk, the $h$-step forecast error variance is $h\sigma^2$. [/remark] ## Stochastic Trends and the Beveridge Nelson Decomposition When a differenced process is stationary, we can ask how much of the level is permanent and how much is transitory. The Beveridge-Nelson decomposition answers this by splitting an integrated process into a random-walk component and a stationary remainder. [quotetheorem:3638] For modeling, the decomposition says that the long-run multiplier $\Psi(1)$ is the permanent effect of a unit innovation on the level, while the remaining summable tail is transitory. This is the interpretation needed for forecast paths and for deciding whether shocks should be described as persistent or mean-reverting. [explanation: Permanent And Transitory Shocks] The coefficient $\Psi(1)$ is the long-run effect of an innovation on the level. If $\Psi(1)\ne0$, a shock contributes permanently to the stochastic trend. If $\Psi(1)=0$, the accumulated effect cancels in the long run, and the process behaves more like a trend-stationary series after accounting for deterministic terms. The decomposition also clarifies why differencing is a modeling choice with interpretive consequences. Modeling $\Delta X_t$ describes the short-run law of motion, while the decomposition describes how those short-run shocks accumulate into the observed level. [/explanation] The summability assumptions are what make this split stable. Absolute summability gives a well-defined stationary short-run component, while the weighted condition $\sum_j j|\psi_j|<\infty$ controls the accumulated tail terms that remain after the long-run multiplier has been separated. Without this tail control, the proposed transitory component need not be stationary, so the permanent-transitory interpretation can break down even when the differenced process is stationary. [example: Long Run Multiplier In An Integrated Arma Model] Suppose \begin{align*} \Delta X_t=\frac{1+\theta B}{1-\rho B}Z_t, \qquad |\rho|<1. \end{align*} Because $|\rho|<1$, the geometric expansion of the lag-polynomial inverse is \begin{align*} \frac{1}{1-\rho B} &=1+\rho B+\rho^2B^2+\rho^3B^3+\cdots. \end{align*} Multiplying by $1+\theta B$ gives \begin{align*} \frac{1+\theta B}{1-\rho B} &=(1+\theta B)(1+\rho B+\rho^2B^2+\rho^3B^3+\cdots)\\ &=1+\rho B+\rho^2B^2+\rho^3B^3+\cdots\\ &\qquad+\theta B+\theta\rho B^2+\theta\rho^2B^3+\theta\rho^3B^4+\cdots\\ &=1+(\rho+\theta)B+(\rho^2+\theta\rho)B^2+(\rho^3+\theta\rho^2)B^3+\cdots. \end{align*} Thus the differenced process has linear-process form \begin{align*} \Delta X_t &=Z_t+\sum_{j=1}^{\infty}\left(\rho^j+\theta\rho^{j-1}\right)Z_{t-j}. \end{align*} The coefficient sequence is therefore \begin{align*} \psi_0&=1,\\ \psi_j&=\rho^j+\theta\rho^{j-1}, \qquad j\ge1. \end{align*} The long-run multiplier is the sum of these impulse-response coefficients: \begin{align*} \Psi(1) &=\sum_{j=0}^{\infty}\psi_j\\ &=1+\sum_{j=1}^{\infty}\left(\rho^j+\theta\rho^{j-1}\right)\\ &=1+\sum_{j=1}^{\infty}\rho^j+\theta\sum_{j=1}^{\infty}\rho^{j-1}\\ &=1+\frac{\rho}{1-\rho}+\theta\frac{1}{1-\rho}\\ &=\frac{1-\rho}{1-\rho}+\frac{\rho}{1-\rho}+\frac{\theta}{1-\rho}\\ &=\frac{1+\theta}{1-\rho}. \end{align*} By the *Beveridge-Nelson Decomposition*, this long-run multiplier is the permanent effect of a unit innovation on the level $X_t$. Hence a unit shock changes the stochastic trend in $X_t$ by $(1+\theta)/(1-\rho)$, and this effect becomes larger in magnitude as $\rho$ moves closer to $1$ when $1+\theta$ is held fixed. [/example] ## Unit Root Testing and the Limits of Pretesting The practical modeling question is often whether a series should be treated as stationary, trend-stationary, or difference-stationary. Unit-root tests formalise this question, but their null distributions are not the usual normal or Student distributions. [definition: Dickey Fuller Regression] The Dickey-Fuller regression for testing a unit root in an autoregressive model with possible deterministic terms is \begin{align*} \Delta X_t = a + bt + \gamma X_{t-1}+u_t. \end{align*} The unit-root null hypothesis is $H_0:\gamma=0$. [/definition] Under the alternative $\gamma<0$, the autoregressive root is less than one in magnitude in the corresponding AR(1) formulation. The deterministic terms included in the regression change the limiting distribution, so the version of the test must match the deterministic specification. The testing difficulty is that under the unit-root null, the regressor $X_{t-1}$ is itself an accumulated shock rather than a stationary variable. As a result, the usual least-squares $t$ distribution is not valid. A separate asymptotic result is needed to identify the nonstandard null law that replaces the familiar regression distribution. In the statement below, $\mathcal L^1$ denotes one-dimensional Lebesgue measure, so integrals such as $\int_0^1 W(r)^2\,d\mathcal L^1(r)$ are ordinary integrals over the unit interval. [quotetheorem:3639] The theorem explains why Dickey-Fuller critical values depend on the unit-root asymptotics rather than on ordinary regression tables. In applications, however, the next obstacle is serial dependence in the regression disturbance: if the short-run dynamics remain in the error term, the basic Dickey-Fuller regression is misspecified even when the unit-root question is the same. The augmented regression introduces lagged differences precisely to separate this short-run dependence from the coefficient that tests persistence. [definition: Augmented Dickey Fuller Regression] The augmented Dickey-Fuller regression of order $k$ is \begin{align*} \Delta X_t=a+bt+\gamma X_{t-1}+\sum_{j=1}^{k}\alpha_j\Delta X_{t-j}+u_t. \end{align*} The unit-root null hypothesis is $H_0:\gamma=0$. [/definition] The lagged differences are included to absorb short-run serial correlation in the errors. The limiting null distribution remains Dickey-Fuller type under suitable conditions, but the finite-sample behavior depends on lag choice, deterministic terms, and the strength of the moving-average component. [remark: Pretesting Risk] A unit-root pretest should not be treated as a mechanical instruction to difference or not difference. Tests have low power against highly persistent stationary alternatives, and overdifferencing can introduce unnecessary moving-average dependence. Model selection should combine plots, autocorrelation diagnostics, subject-matter constraints, forecast performance, and the consequences of interpreting shocks as permanent. [/remark] The following example illustrates the practical difficulty behind that warning. A stationary autoregression can have such slow mean reversion that, over the sample sizes usually available in economic and environmental data, its path and autocorrelation function look close to those of a unit-root process. [example: Nearly Integrated Autoregression] Consider the stationary AR(1) model \begin{align*} X_t=0.98X_{t-1}+Z_t, \end{align*} where $(Z_t)$ is i.i.d. white noise with $\mathbb E[Z_t]=0$ and $\operatorname{Var}(Z_t)=\sigma^2$, and the process is initialized in its stationary distribution. Since $|0.98|<1$, the stationary causal representation is \begin{align*} X_t &=Z_t+0.98Z_{t-1}+0.98^2Z_{t-2}+\cdots \\ &=\sum_{j=0}^{\infty}0.98^jZ_{t-j}. \end{align*} Thus \begin{align*} \mathbb E[X_t] &=\sum_{j=0}^{\infty}0.98^j\mathbb E[Z_{t-j}] =0, \end{align*} and, using independence of distinct innovations, \begin{align*} \operatorname{Var}(X_t) &=\operatorname{Var}\left(\sum_{j=0}^{\infty}0.98^jZ_{t-j}\right)\\ &=\sum_{j=0}^{\infty}0.98^{2j}\operatorname{Var}(Z_{t-j})\\ &=\sigma^2\sum_{j=0}^{\infty}0.98^{2j}\\ &=\frac{\sigma^2}{1-0.98^2}. \end{align*} For $h\ge0$, write \begin{align*} X_{t+h} &=0.98^hX_t+\sum_{j=0}^{h-1}0.98^jZ_{t+h-j}, \end{align*} which follows by iterating the autoregressive equation $h$ times. The future innovations in the second term are independent of $X_t$, so \begin{align*} \operatorname{Cov}(X_{t+h},X_t) &=\operatorname{Cov}\left(0.98^hX_t+\sum_{j=0}^{h-1}0.98^jZ_{t+h-j},X_t\right)\\ &=0.98^h\operatorname{Var}(X_t)+\sum_{j=0}^{h-1}0.98^j\operatorname{Cov}(Z_{t+h-j},X_t)\\ &=0.98^h\operatorname{Var}(X_t). \end{align*} Therefore the autocorrelation at lag $h$ is \begin{align*} \rho(h) &=\frac{\operatorname{Cov}(X_{t+h},X_t)}{\operatorname{Var}(X_t)} =0.98^h. \end{align*} This decay is slow: for example, \begin{align*} 0.98^{10}&\approx0.817,\\ 0.98^{50}&\approx0.364,\\ 0.98^{100}&\approx0.133. \end{align*} In the Dickey-Fuller form, \begin{align*} \Delta X_t &=X_t-X_{t-1}\\ &=0.98X_{t-1}+Z_t-X_{t-1}\\ &=-0.02X_{t-1}+Z_t. \end{align*} The unit-root null corresponds to coefficient $0$ on $X_{t-1}$, while the true coefficient here is only $-0.02$, so moderate samples may not contain enough mean reversion to distinguish the stationary model from a unit-root model. The forecast contrast is nevertheless real: under this stationary AR(1), \begin{align*} \mathbb E[X_{t+h}\mid X_t] &=0.98^hX_t, \end{align*} and $0.98^hX_t\to0$ as $h\to\infty$, whereas a random walk forecast stays at the current level. [/example] Chapter 4 focused on what to do when stationarity is not plausible, using differencing and ARIMA models to restore a workable second-order description. Chapter 5 returns to the stationary case and asks what linear prediction looks like in its most general form, before any finite ARMA structure is imposed. # 5. Wold Decomposition and Linear Prediction Theory This chapter explains why linear prediction in stationary time series has a universal structure. Chapters 2 and 3 described ARMA models and their prediction algorithms; Chapter 6 will develop the spectral representation in detail. Here we ask what remains true before choosing a finite parametric model. The answer is Wold's decomposition: every second-order stationary process splits into a perfectly predictable part and a moving-average part driven by orthogonal innovations. ## Linear Prediction Spaces and Innovations Prediction begins with a choice of information. In nonlinear probability theory the past is represented by a sigma-field, but ARMA theory uses only closed linear spans in $L^2$. This distinction matters because the best linear predictor is an orthogonal projection, while the best predictor among all square-integrable functions is a conditional expectation. [definition: Linear Past] Let $(X_t)_{t \in \mathbb Z}$ be a real-valued second-order stationary process on $(\Omega, \mathcal F, \mathbb P)$ with $\mathbb E[|X_t|^2] < \infty$. For $t \in \mathbb Z$, the linear past of $X_t$ is \begin{align*} \mathcal H_t^X := \overline{\operatorname{span}}\{X_s : s \le t\} \subset L^2(\Omega, \mathcal F, \mathbb P), \end{align*} where the closure is taken in the $L^2$ norm. [/definition] The space $\mathcal H_t^X$ contains limits of finite linear combinations of past observations. To turn this information set into an actual forecast, we need the Hilbert-space operation that chooses the closest element of a closed linear subspace. That operation is the linear projection predictor. [definition: Linear Projection Predictor] Let $Y \in L^2(\Omega, \mathcal F, \mathbb P)$ and let $M \subset L^2(\Omega, \mathcal F, \mathbb P)$ be a closed linear subspace. The linear projection operator onto $M$ is the map \begin{align*} P_M : L^2(\Omega, \mathcal F, \mathbb P) \to M \end{align*} such that $P_MY \in M$ satisfies \begin{align*} \mathbb E[|Y - P_M Y|^2] = \inf_{Z \in M} \mathbb E[|Y-Z|^2]. \end{align*} [/definition] The projection theorem from Hilbert space theory says that the prediction error is orthogonal to every admissible linear predictor. For time series, the most important such error is the part of the next observation that cannot be recovered from the linear past. Naming this error isolates the new information arriving at each time. [definition: Innovation] Let $(X_t)_{t \in \mathbb Z}$ be a second-order stationary process. The one-step innovation at time $t$ is \begin{align*} \varepsilon_t := X_t - P_{\mathcal H_{t-1}^X}X_t. \end{align*} The innovation variance is $\sigma^2 := \mathbb E[|\varepsilon_t|^2]$. [/definition] By stationarity, the innovation variance does not depend on $t$. If $\sigma^2=0$, then $X_t$ is recoverable in mean square from its linear past; if $\sigma^2>0$, each new observation contains a component not linearly predicted by previous observations. The next structural fact records the key property that makes innovations useful for prediction: errors from different forecast dates do not overlap in the linear second-order sense. [quotetheorem:3640] [citeproof:3640] This result is the linear analogue of martingale differences, although the conditioning object is a closed linear subspace rather than a sigma-field. The closedness of the past is essential: without passing to the closure, a best approximating finite linear combination may fail to exist. The theorem does not assert independence of innovations; it only gives pairwise orthogonality, which is the second-order property needed for linear prediction. The next question is whether these innovations account for all of the process, or whether some component remains visible from the arbitrarily remote past. [definition: Pure Nondeterminism] A second-order stationary process $(X_t)_{t \in \mathbb Z}$ is purely nondeterministic if \begin{align*} \bigcap_{t \in \mathbb Z}\mathcal H_t^X = \{0\}. \end{align*} [/definition] The intersection is the part of the process that remains linearly visible no matter how far into the past we move. Pure nondeterminism means there is no non-zero component determined by the infinite remote past. The opposite extreme also needs a name, because Wold's decomposition separates a stationary process into a predictable remote-past component and an innovation-driven component. A process can be random and still have no new linear information arriving over time if each value is already recoverable from its past. [definition: Deterministic Process] A second-order stationary process $(X_t)_{t \in \mathbb Z}$ is deterministic if \begin{align*} X_t \in \mathcal H_{t-1}^X \quad \text{for every } t \in \mathbb Z. \end{align*} [/definition] Deterministic in this sense does not mean non-random. A sinusoid with random phase can be random as a function on $\Omega$, while still being linearly recoverable from its own past. [example: Random Phase Sinusoid] Let $\Theta \sim \operatorname{Unif}(0,2\pi)$ and set $X_t=\cos(\lambda t+\Theta)$, where $\lambda\in(0,\pi)$. First, \begin{align*} \mathbb E[X_t] &=\frac{1}{2\pi}\int_0^{2\pi}\cos(\lambda t+\theta)\,d\theta\\ &=\frac{1}{2\pi}\left[\sin(\lambda t+\theta)\right]_{\theta=0}^{2\pi}\\ &=\frac{\sin(\lambda t+2\pi)-\sin(\lambda t)}{2\pi}\\ &=0. \end{align*} For $s,t\in\mathbb Z$, using $\cos a\cos b=\frac12(\cos(a-b)+\cos(a+b))$, \begin{align*} \mathbb E[X_tX_s] &=\frac{1}{2\pi}\int_0^{2\pi}\cos(\lambda t+\theta)\cos(\lambda s+\theta)\,d\theta\\ &=\frac{1}{4\pi}\int_0^{2\pi}\left\{\cos(\lambda(t-s))+\cos(\lambda(t+s)+2\theta)\right\}\,d\theta\\ &=\frac{1}{4\pi}\left(2\pi\cos(\lambda(t-s))+\left[\frac12\sin(\lambda(t+s)+2\theta)\right]_{\theta=0}^{2\pi}\right)\\ &=\frac{1}{4\pi}\left(2\pi\cos(\lambda(t-s))+\frac12\{\sin(\lambda(t+s)+4\pi)-\sin(\lambda(t+s))\}\right)\\ &=\frac12\cos(\lambda(t-s)). \end{align*} Thus the mean is constant and the covariance depends only on $t-s$, so $(X_t)$ is second-order stationary. It remains to check determinism in the linear prediction sense. Put $u=\lambda(t-1)+\Theta$. Then \begin{align*} 2\cos(\lambda)X_{t-1}-X_{t-2} &=2\cos(\lambda)\cos(u)-\cos(u-\lambda)\\ &=2\cos(\lambda)\cos(u)-\{\cos(u)\cos(\lambda)+\sin(u)\sin(\lambda)\}\\ &=\cos(\lambda)\cos(u)-\sin(\lambda)\sin(u)\\ &=\cos(u+\lambda)\\ &=\cos(\lambda t+\Theta)\\ &=X_t. \end{align*} Since $X_{t-1}$ and $X_{t-2}$ belong to $\mathcal H_{t-1}^X$, the displayed identity gives $X_t\in\mathcal H_{t-1}^X$ for every $t$. The process is therefore random as a function of $\Theta$, but its present value is linearly recoverable from its own past. [/example] ## Wold Decomposition The main structural question is whether the innovations recover the whole process. For an ARMA model with a causal representation, repeated substitution writes $X_t$ as a moving average of current and past shocks. Wold's theorem says that every second-order stationary process has exactly this form after separating off a deterministic component. [quotetheorem:3641] [citeproof:3641] The theorem is the reason moving-average language is not merely a modeling convenience. The stationarity and finite second-moment hypotheses are doing real work: without stationarity the lag coefficients need not be time-independent, and without $L^2$ the projection geometry is unavailable. The result does not say that the shocks are independent or Gaussian, nor that the moving-average series is finite. What it supplies is the canonical second-order coordinate system that the forecasting results below will use. [remark: Non-Zero Mean] If $\mathbb E[X_t]=\mu$, apply the theorem to $X_t-\mu$. The decomposition becomes \begin{align*} X_t = \mu + D_t + \sum_{j=0}^{\infty}\psi_j\varepsilon_{t-j}. \end{align*} [/remark] The most familiar case is a causal autoregression, where the Wold coefficients are already visible by solving the recursion backward. [example: Stationary AR(1)] Let $X_t=\phi X_{t-1}+Z_t$ with $|\phi|<1$, where $(Z_t)$ is white noise with variance $\sigma_Z^2$, and assume $(X_t)$ is the stationary solution. We show that its Wold representation is \begin{align*} X_t=\sum_{j=0}^{\infty}\phi^j Z_{t-j}, \end{align*} so the deterministic component is zero and $\psi_j=\phi^j$. Iterating the recursion once gives \begin{align*} X_t &=\phi X_{t-1}+Z_t \\ &=\phi(\phi X_{t-2}+Z_{t-1})+Z_t \\ &=Z_t+\phi Z_{t-1}+\phi^2X_{t-2}. \end{align*} Repeating the same substitution $m+1$ times yields \begin{align*} X_t &=\sum_{j=0}^{m}\phi^j Z_{t-j}+\phi^{m+1}X_{t-m-1}. \end{align*} The remainder converges to zero in $L^2$, because stationarity gives $\mathbb E[|X_{t-m-1}|^2]=\mathbb E[|X_0|^2]$ and therefore \begin{align*} \mathbb E\left[\left|\phi^{m+1}X_{t-m-1}\right|^2\right] &=|\phi|^{2m+2}\mathbb E[|X_{t-m-1}|^2] \\ &=|\phi|^{2m+2}\mathbb E[|X_0|^2] \\ &\to 0. \end{align*} Hence \begin{align*} X_t=\sum_{j=0}^{\infty}\phi^j Z_{t-j} \end{align*} in $L^2$. Also, \begin{align*} \sum_{j=0}^{\infty}|\phi^j|^2 =\sum_{j=0}^{\infty}|\phi|^{2j} =\frac{1}{1-|\phi|^2}<\infty, \end{align*} so the coefficients are square-summable. The innovation at time $t$ is $Z_t$. Indeed, the expansion above shows that $X_s\in\overline{\operatorname{span}}\{Z_r:r\le s\}$ for every $s\le t-1$, so $\mathcal H_{t-1}^X\subseteq\overline{\operatorname{span}}\{Z_r:r\le t-1\}$. Since white-noise variables are mutually orthogonal, $Z_t$ is orthogonal to every $Z_r$ with $r\le t-1$, and hence $Z_t\perp \mathcal H_{t-1}^X$. Therefore \begin{align*} P_{\mathcal H_{t-1}^X}X_t &=P_{\mathcal H_{t-1}^X}(\phi X_{t-1}+Z_t) \\ &=\phi X_{t-1}+0, \end{align*} and \begin{align*} X_t-P_{\mathcal H_{t-1}^X}X_t &=(\phi X_{t-1}+Z_t)-\phi X_{t-1} \\ &=Z_t. \end{align*} Finally, the deterministic component is zero. Since \begin{align*} Z_s=X_s-\phi X_{s-1}, \end{align*} we have $Z_s\in\mathcal H_s^X$, and the expansion gives $\mathcal H_t^X=\overline{\operatorname{span}}\{Z_s:s\le t\}$. If $Y\in\bigcap_t\mathcal H_t^X$, then for each fixed $k$ choose $t<k$; because $Y\in\mathcal H_t^X=\overline{\operatorname{span}}\{Z_s:s\le t\}$, white-noise orthogonality gives $Y\perp Z_k$. Thus $Y$ is orthogonal to every $Z_k$. Since $Y\in\mathcal H_0^X=\overline{\operatorname{span}}\{Z_s:s\le0\}$, it is orthogonal to a dense subspace of the space containing it, so $\mathbb E[|Y|^2]=0$. Hence $\bigcap_t\mathcal H_t^X=\{0\}$, and the AR(1) is purely nondeterministic with Wold coefficients $\psi_j=\phi^j$. [/example] The decomposition is more informative when both predictable oscillation and fresh noise are present, because it separates these effects rather than forcing a single ARMA interpretation. [example: Periodic Component Plus Noise] Let \begin{align*} S_t:=A\cos(\lambda t)+B\sin(\lambda t), \qquad X_t=S_t+Z_t, \end{align*} where $\lambda\in(0,\pi)$ and $(Z_t)$ is white noise orthogonal to $A$ and $B$. The sinusoidal coordinates evolve in a two-dimensional linear space. For $u=\lambda(t-1)$, \begin{align*} 2\cos(\lambda)S_{t-1}-S_{t-2} &=2\cos(\lambda)\{A\cos u+B\sin u\}-\{A\cos(u-\lambda)+B\sin(u-\lambda)\}\\ &=2\cos(\lambda)\{A\cos u+B\sin u\}\\ &\quad -A\{\cos u\cos\lambda+\sin u\sin\lambda\} -B\{\sin u\cos\lambda-\cos u\sin\lambda\}\\ &=A\{\cos\lambda\cos u-\sin\lambda\sin u\} +B\{\cos\lambda\sin u+\sin\lambda\cos u\}\\ &=A\cos(u+\lambda)+B\sin(u+\lambda)\\ &=S_t. \end{align*} Thus once the two-dimensional sinusoidal component is known at two consecutive times, every other value is obtained by the same fixed recurrence. The component $S_t$ is linearly recoverable from the infinite past of $X$. For example, the orthogonality of the white-noise sequence implies that long trigonometric averages of past observations cancel the $Z$-terms in $L^2$, while the averages of $\cos^2(\lambda s)$, $\sin^2(\lambda s)$, and $\sin(\lambda s)\cos(\lambda s)$ isolate the two coefficients $A$ and $B$ along subsequential Cesaro limits. Hence $A$ and $B$ belong to the closed linear span generated by the remote observations of $X$, and therefore $S_t\in\mathcal H_{t-1}^X$ for every $t$. The new noise $Z_t$ is orthogonal to $A$, $B$, and to every $Z_s$ with $s<t$. Since every element of $\mathcal H_{t-1}^X$ is an $L^2$ limit of linear combinations of variables of the form \begin{align*} X_s=A\cos(\lambda s)+B\sin(\lambda s)+Z_s, \qquad s\le t-1, \end{align*} it follows that $Z_t\perp \mathcal H_{t-1}^X$. Consequently \begin{align*} P_{\mathcal H_{t-1}^X}X_t &=P_{\mathcal H_{t-1}^X}(S_t+Z_t)\\ &=S_t+0\\ &=S_t, \end{align*} and the innovation is \begin{align*} X_t-P_{\mathcal H_{t-1}^X}X_t &=(S_t+Z_t)-S_t\\ &=Z_t. \end{align*} So the Wold decomposition separates the predictable oscillation $S_t$ from the purely nondeterministic white-noise component $Z_t$. [/example] In spectral language, atoms in the spectral distribution give standard examples of deterministic components, such as random-amplitude sinusoids. This gives the applied rule of thumb needed here: sharp spectral lines correspond to predictable oscillatory structure, while the innovation-driven part is the part modeled by moving-average shocks. A full spectral classification of deterministic components is more delicate and is beyond the scope of these notes. ## Moving-Average Coordinates Once the deterministic component is removed, the process can be studied through its innovation coordinates. The practical content is that a purely nondeterministic stationary process can be represented, at the level of second-order linear prediction, by its innovation sequence together with the Wold moving-average coefficients. This continues the projection geometry from Chapters 1 and 3 without requiring spectral operator theory. [explanation: Why the Coordinates Matter] No second-order information used for linear forecasting is lost by replacing a purely nondeterministic process by its innovation sequence and Wold coefficients. Forecasts, forecast errors, and autocovariances can be computed in these coordinates. In statistical modeling, an estimated ARMA model is therefore an attempt to approximate the infinite Wold filter by a rational transfer function. [/explanation] ## Prediction Consequences The immediate practical problem is forecasting. Wold's representation separates what can be predicted from what cannot: past innovations are known from the linear past, while future innovations are orthogonal to it. [quotetheorem:3643] [citeproof:3643] The formula gives a direct interpretation of shocks. A shock at time $t$ affects forecasts at future horizons through the impulse response coefficients $\psi_0,\psi_1,\dots$. [example: Two-Step Forecast in an AR(1)] For the stationary AR(1) process $X_t=\phi X_{t-1}+Z_t$ with $|\phi|<1$, the Wold coefficients are $\psi_j=\phi^j$. Substituting $h=2$ into *Linear Forecast from the Wold Representation* gives \begin{align*} P_{\mathcal H_t^X}X_{t+2} &=\sum_{j=2}^{\infty}\psi_j\varepsilon_{t+2-j}\\ &=\sum_{j=2}^{\infty}\phi^j Z_{t+2-j}. \end{align*} Put $k=j-2$. Then $j=k+2$ and $t+2-j=t-k$, so \begin{align*} \sum_{j=2}^{\infty}\phi^j Z_{t+2-j} &=\sum_{k=0}^{\infty}\phi^{k+2}Z_{t-k}\\ &=\phi^2\sum_{k=0}^{\infty}\phi^kZ_{t-k}\\ &=\phi^2X_t. \end{align*} Thus $P_{\mathcal H_t^X}X_{t+2}=\phi^2X_t$. The same forecast error is obtained by splitting off the two future innovation terms: \begin{align*} X_{t+2}-P_{\mathcal H_t^X}X_{t+2} &=\sum_{j=0}^{1}\phi^j Z_{t+2-j}\\ &=Z_{t+2}+\phi Z_{t+1}. \end{align*} Since $(Z_t)$ is white noise, $Z_{t+2}\perp Z_{t+1}$ and \begin{align*} \mathbb E\left[\left|Z_{t+2}+\phi Z_{t+1}\right|^2\right] &=\mathbb E[|Z_{t+2}|^2] +\phi\,\mathbb E[Z_{t+1}\overline{Z_{t+2}}] +\phi\,\mathbb E[Z_{t+2}\overline{Z_{t+1}}] +\phi^2\mathbb E[|Z_{t+1}|^2]\\ &=\sigma_Z^2+\phi\cdot 0+\phi\cdot 0+\phi^2\sigma_Z^2\\ &=\sigma_Z^2(1+\phi^2). \end{align*} So the two-step forecast keeps the part already encoded in $X_t$, while the remaining uncertainty is exactly the contribution of the next two white-noise shocks. [/example] ## ARMA Approximation and Finite Autoregressions Wold's theorem gives an infinite moving average, but applied time series usually fits finite ARMA or finite AR models. The approximation question is how a finite model can represent the prediction structure of a general stationary process. [definition: Finite Linear Predictor] For $p\in\mathbb N$, the order-$p$ linear predictor of $X_t$ is the projection of $X_t$ onto \begin{align*} \operatorname{span}\{X_{t-1},\dots,X_{t-p}\}. \end{align*} [/definition] As $p$ increases, the finite prediction spaces grow toward the full linear past. The finite autoregressive coefficients are therefore approximations to the infinite linear prediction coefficients, when such coefficients are well defined. To justify finite autoregressions as more than a computational shortcut, we need a convergence result showing that these finite projections approach the ideal infinite-past predictor. [quotetheorem:3644] [citeproof:3644] This theorem justifies fitting high-order autoregressions as nonparametric approximations to linear prediction. It does not say that every process is genuinely finite-order autoregressive; it says finite-order autoregressions can approximate the best linear predictor when enough lags are used. [example: Approximating a General Stationary Process by AR Models] Let $(X_t)$ be purely nondeterministic with Wold representation \begin{align*} X_t=\sum_{j=0}^{\infty}\psi_j\varepsilon_{t-j}, \qquad \sum_{j=0}^{\infty}|\psi_j|^2<\infty, \end{align*} and suppose the sequence $(\psi_j)$ is not eventually zero, so the Wold moving-average representation is genuinely infinite. For $p\ge 1$, put \begin{align*} M_p=\operatorname{span}\{X_{t-1},\dots,X_{t-p}\}, \qquad P_pX_t=\sum_{k=1}^p a_{p,k}X_{t-k}. \end{align*} The projection condition defining $P_pX_t$ is \begin{align*} X_t-\sum_{k=1}^p a_{p,k}X_{t-k}\perp M_p. \end{align*} Equivalently, for each $\ell=1,\dots,p$, \begin{align*} 0 &=\mathbb E\left[\left(X_t-\sum_{k=1}^p a_{p,k}X_{t-k}\right)\overline{X_{t-\ell}}\right]\\ &=\mathbb E[X_t\overline{X_{t-\ell}}] -\sum_{k=1}^p a_{p,k}\mathbb E[X_{t-k}\overline{X_{t-\ell}}]. \end{align*} Writing $\gamma(h)=\mathbb E[X_{t+h}\overline{X_t}]$, stationarity gives \begin{align*} \mathbb E[X_t\overline{X_{t-\ell}}]&=\gamma(\ell),\\ \mathbb E[X_{t-k}\overline{X_{t-\ell}}]&=\gamma(\ell-k), \end{align*} so the finite Yule-Walker equations are \begin{align*} \sum_{k=1}^p a_{p,k}\gamma(\ell-k)=\gamma(\ell), \qquad \ell=1,\dots,p. \end{align*} Let $P_{\mathcal H_{t-1}^X}X_t$ be the optimal one-step linear predictor from the full linear past. Since \begin{align*} \overline{\bigcup_{p\ge1}M_p} =\overline{\operatorname{span}}\{X_{t-1},X_{t-2},\dots\} =\mathcal H_{t-1}^X, \end{align*} the projection convergence theorem gives \begin{align*} P_pX_t\to P_{\mathcal H_{t-1}^X}X_t \end{align*} in $L^2$. Therefore, if \begin{align*} e_p=X_t-P_pX_t, \qquad e=X_t-P_{\mathcal H_{t-1}^X}X_t, \end{align*} then \begin{align*} \|e_p-e\|_2 &=\|(X_t-P_pX_t)-(X_t-P_{\mathcal H_{t-1}^X}X_t)\|_2\\ &=\|P_{\mathcal H_{t-1}^X}X_t-P_pX_t\|_2\\ &\to 0. \end{align*} Also, \begin{align*} \left|\|e_p\|_2-\|e\|_2\right|\le \|e_p-e\|_2\to0, \end{align*} so the finite AR prediction error variances converge to the optimal one-step prediction error variance: \begin{align*} \mathbb E[|X_t-P_pX_t|^2]\to \mathbb E[|X_t-P_{\mathcal H_{t-1}^X}X_t|^2]. \end{align*} Thus finite autoregressions need not give an exact finite equation for the process; their role is to approximate the projection onto the full linear past, and that approximation becomes exact in mean square as the number of lags increases. [/example] This distinction between exact model class and approximation scheme is the right way to interpret finite ARMA fitting after Wold's theorem, and it prepares the frequency-domain reinterpretation of the same second-order structure in Chapter 6. [remark: Interpretation of ARMA Modeling] A finite MA model truncates the Wold impulse response, while an ARMA model replaces it by a rational filter. Good ARMA modeling is therefore not only curve fitting to autocovariances; it is an attempt to capture the geometry of innovations and linear prediction with a small number of parameters. [/remark] Chapter 5 showed that every purely nondeterministic stationary series has a canonical linear prediction structure through the Wold decomposition. Chapter 6 reformulates the same second-order information in the frequency domain, where dependence is described by the spectral density rather than by autocovariances alone. # 6. Spectral Analysis of Stationary Processes Spectral analysis asks how the second-order variation of a stationary process is distributed across oscillatory components. Chapters 1 through 5 described dependence through autocovariances, ARMA recursions, and prediction; this chapter rewrites the same second-order information in the frequency domain. The central bridge is that every valid autocovariance sequence has a spectral measure, and in common models that measure has a density which can be read as power per unit frequency. ## Covariance as Frequency Content The starting question is: when a stationary process has persistent cycles or seasonal behaviour, where is that behaviour visible in its covariance sequence? A slowly decaying autocovariance is often hard to interpret directly, while its Fourier representation separates low-frequency persistence, seasonal peaks, and short-memory noise. [definition: Spectral Distribution Function] Let $(X_t)_{t \in \mathbb Z}$ be a weakly stationary real-valued process with autocovariance function $\gamma(h)=\operatorname{Cov}(X_{t+h},X_t)$. A spectral distribution function for $(X_t)$ is a right-continuous nondecreasing function $F:[-\pi,\pi]\to\mathbb R$ such that $F(-\pi)=0$ and \begin{align*} \gamma(h)=\int_{-\pi}^{\pi} e^{ih\lambda}\,dF(\lambda), \qquad h\in\mathbb Z. \end{align*} [/definition] The function $F$ encodes how much variance lies in frequency intervals. At lag $0$ the representation gives $\gamma(0)=F(\pi)$, so the total mass of the spectral measure is the variance of the process. Many common short-memory models have no point masses in frequency and can be described by a density with respect to Lebesgue measure. The next definition records this absolutely continuous case. [definition: Spectral Density] Let $(X_t)_{t \in \mathbb Z}$ be weakly stationary with spectral distribution function $F$. A spectral density is an integrable function $f:[-\pi,\pi]\to[0,\infty)$ satisfying \begin{align*} F(b)-F(a)=\int_a^b f(\lambda)\,d\lambda \end{align*} for all $-\pi\le a<b\le \pi$. [/definition] When a spectral density exists, the covariance representation becomes the Fourier coefficient formula \begin{align*} \gamma(h)=\int_{-\pi}^{\pi} e^{ih\lambda} f(\lambda)\,d\lambda. \end{align*} With the convention used here, the inverse relation is \begin{align*} f(\lambda)=\frac{1}{2\pi}\sum_{h=-\infty}^{\infty}\gamma(h)e^{-ih\lambda} \end{align*} whenever the covariance sequence is absolutely summable. [example: White Noise Spectrum] Let $(Z_t)_{t\in\mathbb Z}$ be white noise with mean $0$ and variance $\sigma^2$, so its autocovariance sequence is \begin{align*} \gamma(h) = \operatorname{Cov}(Z_{t+h},Z_t) = \begin{cases} \sigma^2, & h=0,\\ 0, & h\ne 0. \end{cases} \end{align*} Using the absolutely summable covariance formula for the spectral density, \begin{align*} f(\lambda) &=\frac{1}{2\pi}\sum_{h=-\infty}^{\infty}\gamma(h)e^{-ih\lambda}\\ &=\frac{1}{2\pi}\left(\gamma(0)e^{-i0\lambda}+\sum_{\substack{h\in\mathbb Z\\ h\ne0}}\gamma(h)e^{-ih\lambda}\right)\\ &=\frac{1}{2\pi}\left(\sigma^2\cdot 1+\sum_{\substack{h\in\mathbb Z\\ h\ne0}}0\cdot e^{-ih\lambda}\right)\\ &=\frac{\sigma^2}{2\pi}. \end{align*} Thus white noise has the same spectral density at every $\lambda\in[-\pi,\pi]$, meaning its variance is spread uniformly across all frequencies. [/example] A constant spectrum means that no frequency band is preferred. Peaks in $f$ indicate frequencies at which oscillations contribute disproportionately to the variance. With finite data, the spectrum itself is not observed, so we need a sample quantity that decomposes the observed series by frequency. The periodogram supplies this empirical frequency-domain summary. [definition: Periodogram] Let $X_1,\dots,X_n$ be observations from a real-valued time series. The discrete Fourier transform is \begin{align*} d_n(\lambda)=\frac{1}{\sqrt{2\pi n}}\sum_{t=1}^{n} X_t e^{-it\lambda}, \end{align*} and the periodogram is \begin{align*} I_n(\lambda)=|d_n(\lambda)|^2. \end{align*} [/definition] The periodogram is the sample analogue of the spectral density, but it is not a consistent pointwise estimator without smoothing. Its role in this chapter is mainly interpretive: it reveals empirical frequency peaks and motivates the theoretical spectrum. [example: Seasonal Periodogram] Let $\omega_0=2\pi/12=\pi/6$ and write the deterministic seasonal part as $C_t=A\cos(\omega_0 t)$. By Euler's identity, \begin{align*} C_t e^{-it\lambda} &=\frac{A}{2}\left(e^{i\omega_0 t}+e^{-i\omega_0 t}\right)e^{-it\lambda}\\ &=\frac{A}{2}\left(e^{-it(\lambda-\omega_0)}+e^{-it(\lambda+\omega_0)}\right). \end{align*} Therefore its contribution to the discrete Fourier transform is \begin{align*} d_{n,C}(\lambda) &=\frac{1}{\sqrt{2\pi n}}\sum_{t=1}^{n}C_t e^{-it\lambda}\\ &=\frac{A}{2\sqrt{2\pi n}} \left( \sum_{t=1}^{n}e^{-it(\lambda-\omega_0)} + \sum_{t=1}^{n}e^{-it(\lambda+\omega_0)} \right). \end{align*} For $\alpha\notin 2\pi\mathbb Z$, \begin{align*} \sum_{t=1}^{n}e^{-it\alpha} &=e^{-i\alpha}\frac{1-e^{-in\alpha}}{1-e^{-i\alpha}}, \end{align*} while for $\alpha\in2\pi\mathbb Z$ the same sum equals $n$. Hence at the seasonal frequency $\lambda=\omega_0$, \begin{align*} d_{n,C}(\omega_0) &=\frac{A}{2\sqrt{2\pi n}} \left( n+\sum_{t=1}^{n}e^{-it(2\omega_0)} \right). \end{align*} Since $2\omega_0=\pi/3$ is not a multiple of $2\pi$, \begin{align*} \left|\sum_{t=1}^{n}e^{-it(2\omega_0)}\right| &= \left|\frac{e^{-i2\omega_0}(1-e^{-in2\omega_0})}{1-e^{-i2\omega_0}}\right|\\ &\le \frac{|1-e^{-in2\omega_0}|}{|1-e^{-i2\omega_0}|}\\ &\le \frac{2}{|1-e^{-i\pi/3}|}, \end{align*} so the leading term in $d_{n,C}(\omega_0)$ has size $A n/(2\sqrt{2\pi n})=A\sqrt{n}/(2\sqrt{2\pi})$. Thus the deterministic seasonal component contributes a periodogram value of order $n$ at $\omega_0$ and similarly at $-\omega_0$. For the white-noise part, \begin{align*} d_{n,Z}(\lambda) &=\frac{1}{\sqrt{2\pi n}}\sum_{t=1}^{n}Z_t e^{-it\lambda}, \end{align*} and, using $\operatorname{Cov}(Z_s,Z_t)=0$ for $s\ne t$ and $\operatorname{Var}(Z_t)=\sigma^2$, \begin{align*} \mathbb E|d_{n,Z}(\lambda)|^2 &=\frac{1}{2\pi n}\mathbb E\left[ \left(\sum_{t=1}^{n}Z_t e^{-it\lambda}\right) \left(\sum_{s=1}^{n}Z_s e^{is\lambda}\right) \right]\\ &=\frac{1}{2\pi n}\sum_{t=1}^{n}\sum_{s=1}^{n} e^{-it\lambda}e^{is\lambda}\mathbb E[Z_tZ_s]\\ &=\frac{1}{2\pi n}\sum_{t=1}^{n}\sigma^2\\ &=\frac{\sigma^2}{2\pi}. \end{align*} The cosine therefore creates growing peaks at the signed seasonal frequencies $\pm 2\pi/12$, while the white noise supplies a frequency-independent background; this is why monthly seasonality appears as a peak in the periodogram. [/example] ## Herglotz Representation and Wiener-Khinchin The next question is: which sequences can occur as autocovariances, and why must a spectral measure exist? The answer is a Fourier-analytic form of positive definiteness. Stationarity says covariance matrices built from $\gamma$ are nonnegative definite; Herglotz says this condition is exactly the condition for being the Fourier transform of a finite positive measure on the unit circle. [definition: Positive Definite Sequence] A sequence $\gamma:\mathbb Z\to\mathbb C$ is positive definite if, for every $n\in\mathbb N$, every $t_1,\dots,t_n\in\mathbb Z$, and every $c_1,\dots,c_n\in\mathbb C$, \begin{align*} \sum_{j=1}^{n}\sum_{k=1}^{n} c_j\overline{c_k}\,\gamma(t_j-t_k)\ge0. \end{align*} [/definition] For a weakly stationary process, this condition follows from the variance identity for $\sum_j c_j X_{t_j}$. The converse is the crucial representation question: if a sequence has exactly the positivity required of covariances, can it always be realized as Fourier coefficients of a finite positive measure? Herglotz answers this question and supplies the existence theorem behind spectral distributions. [quotetheorem:3645] [citeproof:3645] The theorem is the mathematical justification for spectral analysis: frequency-domain objects are not extra modelling assumptions but another representation of second-order stationarity. Its hypotheses are also sharp for covariance sequences, since positive definiteness is exactly what is forced by nonnegative variances of finite linear combinations. The representing measure may still contain atoms or singular parts, so Herglotz alone does not guarantee a spectral density. For many statistical procedures it is more convenient to work with an ordinary function $f$ than with an arbitrary measure. The additional assumption that the autocovariances are absolutely summable is a standard short-memory condition that rules out too much long-range dependence and leads to a continuous density representation. [quotetheorem:3646] [citeproof:3646] The absolute summability condition is a short-memory condition. It is strong enough to produce a continuous density, while processes with deterministic sinusoidal components have spectral measures with atoms. [example: Random Phase Sinusoid] [claim]Let $X_t=A\cos(\omega t+\Theta)$, where $A>0$, $\omega\in(0,\pi)$, and $\Theta\sim\operatorname{Unif}(0,2\pi)$. Then $(X_t)$ is weakly stationary, and its spectral measure is \begin{align*} \mu=\frac{A^2}{4}\delta_{\omega}+\frac{A^2}{4}\delta_{-\omega}. \end{align*} [/claim] To verify this, for each $t\in\mathbb Z$, \begin{align*} \mathbb E[X_t] &=\frac{A}{2\pi}\int_0^{2\pi}\cos(\omega t+\theta)\,d\theta\\ &=\frac{A}{2\pi}\left[\sin(\omega t+\theta)\right]_{\theta=0}^{2\pi}\\ &=\frac{A}{2\pi}\left(\sin(\omega t+2\pi)-\sin(\omega t)\right)\\ &=0. \end{align*} Thus the mean is constant. Since the mean is $0$, the autocovariance is \begin{align*} \gamma(h) &=\operatorname{Cov}(X_{t+h},X_t)\\ &=\mathbb E[X_{t+h}X_t]\\ &=\frac{A^2}{2\pi}\int_0^{2\pi} \cos(\omega(t+h)+\theta)\cos(\omega t+\theta)\,d\theta. \end{align*} Using $\cos u\cos v=\frac{1}{2}\{\cos(u-v)+\cos(u+v)\}$ with $u=\omega(t+h)+\theta$ and $v=\omega t+\theta$, \begin{align*} \gamma(h) &=\frac{A^2}{4\pi}\int_0^{2\pi} \left\{\cos(\omega h)+\cos(2\omega t+\omega h+2\theta)\right\}\,d\theta\\ &=\frac{A^2}{4\pi} \left( \int_0^{2\pi}\cos(\omega h)\,d\theta + \int_0^{2\pi}\cos(2\omega t+\omega h+2\theta)\,d\theta \right)\\ &=\frac{A^2}{4\pi} \left( 2\pi\cos(\omega h) + \frac{1}{2}\left[\sin(2\omega t+\omega h+2\theta)\right]_{\theta=0}^{2\pi} \right)\\ &=\frac{A^2}{4\pi} \left( 2\pi\cos(\omega h) + \frac{1}{2}\left(\sin(2\omega t+\omega h+4\pi)-\sin(2\omega t+\omega h)\right) \right)\\ &=\frac{A^2}{2}\cos(\omega h). \end{align*} This depends only on the lag $h$, so $(X_t)$ is weakly stationary. Now define \begin{align*} \mu=\frac{A^2}{4}\delta_{\omega}+\frac{A^2}{4}\delta_{-\omega}. \end{align*} For every $h\in\mathbb Z$, \begin{align*} \int_{-\pi}^{\pi}e^{ih\lambda}\,d\mu(\lambda) &=\frac{A^2}{4}e^{ih\omega}+\frac{A^2}{4}e^{-ih\omega}\\ &=\frac{A^2}{4}\left(e^{ih\omega}+e^{-ih\omega}\right)\\ &=\frac{A^2}{2}\cos(\omega h)\\ &=\gamma(h), \end{align*} where the third equality uses $e^{ix}+e^{-ix}=2\cos x$. Hence the spectral measure consists of two equal atoms at the signed frequencies $\omega$ and $-\omega$. Because $\omega\in(0,\pi)$, the two points $\omega$ and $-\omega$ are distinct, and each has positive mass $A^2/4$; therefore this stationary process has a spectral measure with atoms rather than a spectral density. [/example] ## Spectra of ARMA Processes The modelling question is now: how do ARMA parameters shape the spectrum? In the time domain an ARMA equation describes recursion and shocks; in the frequency domain it describes a transfer function that amplifies or attenuates white-noise input at each frequency. [definition: ARMA Transfer Function] Let $(X_t)$ be a causal ARMA$(p,q)$ process satisfying \begin{align*} \phi(B)X_t=\theta(B)Z_t, \end{align*} where $(Z_t)$ is white noise with variance $\sigma^2$, $\phi(z)=1-\phi_1z-\cdots-\phi_pz^p$, and $\theta(z)=1+\theta_1z+\cdots+\theta_qz^q$. The transfer function is \begin{align*} H(z)=\frac{\theta(z)}{\phi(z)}. \end{align*} [/definition] Causality means that $1/\phi(z)$ has a power-series expansion on a neighbourhood of the closed unit disc. The coefficients of $H(z)$ are the impulse-response coefficients in the infinite moving-average representation. To turn this transfer-function description into a spectrum, we need to ask how white-noise power is amplified or damped at each frequency by $H(e^{-i\lambda})$. The following result supplies that frequency-by-frequency conversion. [quotetheorem:3647] [citeproof:3647] The denominator controls resonance. Frequencies where $|\phi(e^{-i\lambda})|$ is small produce high spectral power, especially when autoregressive roots lie close to the unit circle. [example: Spectrum of AR(1)] Let $X_t=\phi X_{t-1}+Z_t$, where $|\phi|<1$ and $(Z_t)$ is white noise with variance $\sigma^2$. This is an ARMA$(1,0)$ process with \begin{align*} \theta(z)&=1,\\ \phi_{\operatorname{AR}}(z)&=1-\phi z. \end{align*} Since $|\phi|<1$, the zero of $1-\phi z$ is $z=1/\phi$, and $|1/\phi|>1$, so the AR root lies outside the closed unit disc. By *Spectral Density of a Causal ARMA Process*, \begin{align*} f_X(\lambda) &=\frac{\sigma^2}{2\pi} \left|\frac{\theta(e^{-i\lambda})}{\phi_{\operatorname{AR}}(e^{-i\lambda})}\right|^2\\ &=\frac{\sigma^2}{2\pi} \left|\frac{1}{1-\phi e^{-i\lambda}}\right|^2\\ &=\frac{\sigma^2}{2\pi} \frac{1}{|1-\phi e^{-i\lambda}|^2}. \end{align*} To write the denominator in real trigonometric form, \begin{align*} |1-\phi e^{-i\lambda}|^2 &=(1-\phi e^{-i\lambda})\overline{(1-\phi e^{-i\lambda})}\\ &=(1-\phi e^{-i\lambda})(1-\phi e^{i\lambda})\\ &=1-\phi e^{i\lambda}-\phi e^{-i\lambda}+\phi^2 e^{-i\lambda}e^{i\lambda}\\ &=1+\phi^2-\phi(e^{i\lambda}+e^{-i\lambda})\\ &=1+\phi^2-2\phi\cos\lambda. \end{align*} Therefore \begin{align*} f_X(\lambda) &=\frac{\sigma^2}{2\pi}\frac{1}{1+\phi^2-2\phi\cos\lambda}. \end{align*} If $\phi>0$, the denominator $1+\phi^2-2\phi\cos\lambda$ is smallest when $\cos\lambda$ is largest, namely at $\lambda=0$, so the spectrum is largest at low frequency. If $\phi<0$, writing $\phi=-a$ with $a>0$ gives \begin{align*} 1+\phi^2-2\phi\cos\lambda &=1+a^2+2a\cos\lambda, \end{align*} which is smallest when $\cos\lambda=-1$, namely at $\lambda=\pm\pi$ on $[-\pi,\pi]$. Thus positive AR(1) dependence concentrates power near frequency $0$, while negative AR(1) dependence concentrates power near the alternating frequency $\pi$. [/example] The AR(1) example shows how the sign of a single coefficient moves spectral mass between low and alternating frequencies. A second-order autoregression can do something richer: complex roots create a preferred oscillation whose frequency is determined by the root angle. [example: Spectral Peak of an AR(2) Oscillator] Let $(Z_t)$ be white noise with variance $\sigma^2$, and let \begin{align*} X_t=2r\cos(\omega_0)X_{t-1}-r^2X_{t-2}+Z_t, \qquad 0<r<1,\quad \omega_0\in(0,\pi). \end{align*} Writing the recursion as an autoregression gives \begin{align*} (1-2r\cos(\omega_0)B+r^2B^2)X_t=Z_t, \end{align*} so the autoregressive polynomial is \begin{align*} \phi(z) &=1-2r\cos(\omega_0)z+r^2z^2. \end{align*} Using $2\cos(\omega_0)=e^{i\omega_0}+e^{-i\omega_0}$, \begin{align*} (1-re^{i\omega_0}z)(1-re^{-i\omega_0}z) &=1-re^{-i\omega_0}z-re^{i\omega_0}z+r^2z^2\\ &=1-r(e^{i\omega_0}+e^{-i\omega_0})z+r^2z^2\\ &=1-2r\cos(\omega_0)z+r^2z^2\\ &=\phi(z). \end{align*} The zeros are $z=e^{-i\omega_0}/r$ and $z=e^{i\omega_0}/r$, both of modulus $1/r>1$, so the causal ARMA spectral formula applies by *Spectral Density of a Causal ARMA Process*. Hence \begin{align*} f_X(\lambda) &=\frac{\sigma^2}{2\pi}\frac{1}{|\phi(e^{-i\lambda})|^2}. \end{align*} Now \begin{align*} \phi(e^{-i\lambda}) &=(1-re^{i\omega_0}e^{-i\lambda})(1-re^{-i\omega_0}e^{-i\lambda})\\ &=(1-re^{-i(\lambda-\omega_0)})(1-re^{-i(\lambda+\omega_0)}), \end{align*} and therefore \begin{align*} |\phi(e^{-i\lambda})|^2 &=|1-re^{-i(\lambda-\omega_0)}|^2 |1-re^{-i(\lambda+\omega_0)}|^2. \end{align*} For any real $u$, \begin{align*} |1-re^{-iu}|^2 &=(1-re^{-iu})(1-re^{iu})\\ &=1-re^{iu}-re^{-iu}+r^2\\ &=1+r^2-r(e^{iu}+e^{-iu})\\ &=1+r^2-2r\cos u\\ &=(1-r)^2+2r(1-\cos u)\\ &=(1-r)^2+4r\sin^2(u/2). \end{align*} Substituting $u=\lambda-\omega_0$ and $u=\lambda+\omega_0$ gives \begin{align*} |\phi(e^{-i\lambda})|^2 &=\left((1-r)^2+4r\sin^2\left(\frac{\lambda-\omega_0}{2}\right)\right) \left((1-r)^2+4r\sin^2\left(\frac{\lambda+\omega_0}{2}\right)\right). \end{align*} The first factor is smallest when $\lambda=\omega_0$ modulo $2\pi$, and the second factor is smallest when $\lambda=-\omega_0$ modulo $2\pi$. Since the spectral density is proportional to the reciprocal of this product, the spectrum is large near the signed frequencies $\pm\omega_0$. Near $\lambda=\omega_0$, the factor \begin{align*} (1-r)^2+4r\sin^2\left(\frac{\lambda-\omega_0}{2}\right) \end{align*} has minimum value $(1-r)^2$ and rises as $|\lambda-\omega_0|$ moves away from $0$. As $r$ approaches $1$, this minimum becomes smaller and the interval on which the factor stays small becomes narrower. Thus roots close to the unit circle create sharper spectral peaks, which is the frequency-domain signature of a damped stochastic oscillator. [/example] MA terms shape spectra through zeros of $\theta(e^{-i\lambda})$. They can create notches, flatten peaks, or implement finite-window smoothing. ## Linear Filters and Frequency Response The practical question is: what happens to the spectrum when we smooth, difference, or otherwise filter a time series? Time-domain filtering is convolution with filter coefficients, and frequency-domain filtering is multiplication by a squared gain. [definition: Linear Filter] Let $(X_t)_{t\in\mathbb Z}$ be a weakly stationary process. A linear filter with coefficients $(a_j)_{j\in\mathbb Z}$ satisfying $\sum_{j\in\mathbb Z}|a_j|<\infty$ produces the process \begin{align*} Y_t=\sum_{j\in\mathbb Z}a_jX_{t-j}. \end{align*} Its frequency response is \begin{align*} A(\lambda)=\sum_{j\in\mathbb Z}a_j e^{-ij\lambda}. \end{align*} [/definition] Absolute summability ensures the filtered process is well defined in mean square and that the frequency response is continuous. The key question is how filtering changes the distribution of variance across frequencies: a filter may leave the time-domain process stationary while selectively suppressing or amplifying parts of its spectrum. The next result gives the exact spectral transformation induced by the response $A(\lambda)$. [quotetheorem:3648] [citeproof:3648] The theorem is the main frequency-domain rule used in time-series preprocessing. Differencing suppresses low frequencies, moving averages suppress high frequencies, and seasonal filters suppress selected periodic components. [example: Moving-Average Low-Pass Filter] Let \begin{align*} Y_t=\frac{1}{m}(X_t+X_{t-1}+\cdots+X_{t-m+1}), \end{align*} so the filter coefficients are $a_j=1/m$ for $j=0,\dots,m-1$ and $a_j=0$ otherwise. Its frequency response is \begin{align*} A(\lambda) &=\sum_{j\in\mathbb Z}a_j e^{-ij\lambda}\\ &=\frac{1}{m}\sum_{j=0}^{m-1}e^{-ij\lambda}. \end{align*} If $\lambda\notin 2\pi\mathbb Z$, then with $q=e^{-i\lambda}$, \begin{align*} (1-q)\sum_{j=0}^{m-1}q^j &=(1+q+\cdots+q^{m-1})-(q+q^2+\cdots+q^m)\\ &=1-q^m, \end{align*} and hence \begin{align*} A(\lambda) &=\frac{1}{m}\frac{1-e^{-im\lambda}}{1-e^{-i\lambda}}. \end{align*} Therefore, for $\lambda\notin 2\pi\mathbb Z$, \begin{align*} |A(\lambda)|^2 &=\frac{1}{m^2}\left|\frac{1-e^{-im\lambda}}{1-e^{-i\lambda}}\right|^2. \end{align*} For any real $u$, \begin{align*} 1-e^{-iu} &=e^{-iu/2}(e^{iu/2}-e^{-iu/2})\\ &=2i e^{-iu/2}\sin(u/2), \end{align*} so, since $|e^{-iu/2}|=1$ and $|i|=1$, \begin{align*} |1-e^{-iu}|^2 &=4\sin^2(u/2). \end{align*} Applying this with $u=m\lambda$ and $u=\lambda$ gives \begin{align*} |A(\lambda)|^2 &=\frac{1}{m^2}\frac{4\sin^2(m\lambda/2)}{4\sin^2(\lambda/2)}\\ &=\frac{1}{m^2}\left(\frac{\sin(m\lambda/2)}{\sin(\lambda/2)}\right)^2. \end{align*} At $\lambda=0$, \begin{align*} A(0) &=\frac{1}{m}\sum_{j=0}^{m-1}1\\ &=1, \end{align*} so $|A(0)|^2=1$. Also, for every $\lambda$, \begin{align*} |A(\lambda)| &=\left|\frac{1}{m}\sum_{j=0}^{m-1}e^{-ij\lambda}\right|\\ &\le \frac{1}{m}\sum_{j=0}^{m-1}|e^{-ij\lambda}|\\ &=\frac{1}{m}\sum_{j=0}^{m-1}1\\ &=1, \end{align*} so the gain is maximized at frequency $0$. The numerator $\sin(m\lambda/2)$ vanishes when $\lambda=2\pi \ell/m$, and the denominator is nonzero unless $\lambda\in2\pi\mathbb Z$; hence the gain has zeros at the nonzero multiples of $2\pi/m$ in the frequency range under consideration. The moving average therefore preserves low-frequency components most strongly and cancels selected faster oscillations. [/example] Moving averages smooth by retaining low-frequency variation, but time series work also needs filters that remove slow movement. The first difference is the basic high-pass filter, so the next example records its frequency response explicitly. [example: First Difference Filter] Let $Y_t=X_t-X_{t-1}$, and suppose $X_t$ has spectral density $f_X$. This is a linear filter with coefficients \begin{align*} a_0&=1,\\ a_1&=-1,\\ a_j&=0,\qquad j\notin\{0,1\}. \end{align*} Its frequency response is \begin{align*} A(\lambda) &=\sum_{j\in\mathbb Z}a_j e^{-ij\lambda}\\ &=a_0e^{-i0\lambda}+a_1e^{-i1\lambda}\\ &=1-e^{-i\lambda}. \end{align*} By *Linear Filter Spectral Transformation*, \begin{align*} f_Y(\lambda) &=|A(\lambda)|^2f_X(\lambda)\\ &=|1-e^{-i\lambda}|^2f_X(\lambda). \end{align*} To put the gain in real form, \begin{align*} 1-e^{-i\lambda} &=e^{-i\lambda/2}\left(e^{i\lambda/2}-e^{-i\lambda/2}\right)\\ &=2i e^{-i\lambda/2}\sin(\lambda/2), \end{align*} so, using $|i|=1$ and $|e^{-i\lambda/2}|=1$, \begin{align*} |1-e^{-i\lambda}|^2 &=\left|2i e^{-i\lambda/2}\sin(\lambda/2)\right|^2\\ &=4|i|^2|e^{-i\lambda/2}|^2\sin^2(\lambda/2)\\ &=4\sin^2(\lambda/2). \end{align*} Therefore \begin{align*} f_Y(\lambda) &=4\sin^2(\lambda/2)f_X(\lambda). \end{align*} At $\lambda=0$, \begin{align*} 4\sin^2(0/2)=0, \end{align*} so the first difference filter removes frequency $0$ exactly and strongly attenuates components whose frequencies are close to $0$. [/example] ## Aliasing and Discrete Observation The final question is: what spectral information is lost when a continuous-time process or a high-frequency signal is sampled at discrete times? Sampling folds frequencies outside the observable band back into $[-\pi,\pi]$, so different continuous-time spectra can generate the same discrete-time covariance sequence. [definition: Aliasing] Aliasing is the identification of frequencies that differ by integer multiples of $2\pi$ in a discretely sampled time series. [/definition] Because $e^{it(\lambda+2\pi k)}=e^{it\lambda}$ for every $t\in\mathbb Z$ and $k\in\mathbb Z$, discrete observations cannot distinguish those frequencies. This is why spectra for discrete-time processes are represented on a fundamental interval such as $[-\pi,\pi]$. When a continuous-time signal is sampled, however, all frequencies outside that interval still contribute after being folded back into it. The next result formalizes this folding as a statement about the sampled spectrum. [quotetheorem:3649] [citeproof:3649] Aliasing is not a defect of the Fourier transform; it is a limitation of the observation scheme. Anti-aliasing filters reduce high-frequency power before sampling so that the folded terms contribute little to the observed spectrum. [example: Aliased Sinusoid] Sample the continuous-time sinusoid $Y_t=\cos(\omega t)$ at integer times, and write \begin{align*} X_n=Y_n=\cos(\omega n),\qquad n\in\mathbb Z. \end{align*} For any $k\in\mathbb Z$, the sinusoid with frequency $\omega+2\pi k$ gives the sampled sequence \begin{align*} \widetilde X_n &=\cos((\omega+2\pi k)n)\\ &=\cos(\omega n+2\pi kn). \end{align*} Using the addition formula $\cos(a+b)=\cos a\cos b-\sin a\sin b$ with $a=\omega n$ and $b=2\pi kn$, \begin{align*} \widetilde X_n &=\cos(\omega n)\cos(2\pi kn)-\sin(\omega n)\sin(2\pi kn). \end{align*} Since $k,n\in\mathbb Z$, the product $kn$ is an integer, so \begin{align*} \cos(2\pi kn)&=1,\\ \sin(2\pi kn)&=0. \end{align*} Therefore \begin{align*} \widetilde X_n &=\cos(\omega n)\cdot 1-\sin(\omega n)\cdot 0\\ &=\cos(\omega n)\\ &=X_n. \end{align*} Thus integer-time sampling cannot distinguish $\omega$ from $\omega+2\pi k$; in discrete time, frequency is identifiable only modulo $2\pi$. [/example] Chapter 6 replaced the time-domain view of dependence with a frequency-domain one, emphasizing how spectra encode cycles, filters, and scale-dependent behavior. Chapter 7 then turns that representation into inference, showing how finite samples can be used to estimate spectra and test frequency-domain features of a stationary process. # 7. Frequency-Domain Inference Frequency-domain inference turns the covariance structure of a stationary time series into an inference problem about power at different frequencies. Chapters 2 and 6 described linear filters and ARMA models through autocovariances, backshift polynomials, and spectra; this chapter uses the Fourier transform of those autocovariances as the central object for inference. The guiding questions are: what can be estimated from a finite stretch of data, why the raw periodogram is not a consistent estimator, and how frequency-domain likelihoods approximate time-domain Gaussian inference. ## Periodograms, Smoothing, Tapering, and Leakage A finite sample $X_1,\dots,X_n$ cannot reveal the whole spectral density directly. The first object it gives us is a discrete Fourier transform, but the squared modulus of that transform is too variable to be treated as an ordinary nonparametric estimate without further averaging. [definition: Fourier Frequencies] For a sample of length $n$, the Fourier frequencies are \begin{align*} \omega_j = \frac{2\pi j}{n}, \qquad j = 0,1,\dots,n-1. \end{align*} [/definition] The frequencies $2\pi j/n$ are the grid on which the discrete Fourier basis is orthogonal over the observed sample. For real-valued data the information at $\omega$ and $2\pi-\omega$ is redundant through complex conjugacy, so inference is usually reported on $[0,\pi]$. To measure how much of the observed series oscillates at one of these frequencies, we project the data onto the corresponding complex exponential. This projection is the discrete Fourier transform. [definition: Discrete Fourier Transform] For observations $X_1,\dots,X_n$, the discrete Fourier transform at frequency $\omega$ is \begin{align*} d_n(\omega) = \frac{1}{\sqrt{2\pi n}}\sum_{t=1}^{n} X_t e^{-i\omega t}. \end{align*} [/definition] For estimation of a univariate spectrum, the basic observable quantity is not the complex transform itself but its energy at each frequency. The obstruction is that phase varies with the time origin, while the second-order dependence we want to estimate is encoded in frequency-by-frequency power. Squaring the modulus removes phase and produces a raw sample-power quantity on the same scale as the spectral density when $n$ is large. [definition: Periodogram] The periodogram is \begin{align*} I_n(\omega) = |d_n(\omega)|^2 = \frac{1}{2\pi n}\left|\sum_{t=1}^{n} X_t e^{-i\omega t}\right|^2. \end{align*} [/definition] The periodogram is easy to compute and often gives the first visual evidence of periodic behaviour, but its peaks must be interpreted through the model. A concrete autoregressive oscillator shows how the location of a periodogram peak reflects the denominator of the spectral density rather than a deterministic sinusoid alone. [example: Periodogram Peak For An AR(2) Series] Let \begin{align*} X_t=1.2X_{t-1}-0.7X_{t-2}+Z_t, \end{align*} where $(Z_t)$ is white noise with variance $\sigma^2$. The autoregressive polynomial is \begin{align*} \phi(z)=1-1.2z+0.7z^2 =1-\frac65 z+\frac7{10}z^2. \end{align*} Its roots solve \begin{align*} 0.7z^2-1.2z+1&=0,\\ z&=\frac{1.2\pm\sqrt{1.2^2-4(0.7)(1)}}{2(0.7)}\\ &=\frac{1.2\pm\sqrt{1.44-2.8}}{1.4}\\ &=\frac{1.2\pm i\sqrt{1.36}}{1.4}\\ &=\frac{6\pm i\sqrt{34}}{7}. \end{align*} The modulus of each root is \begin{align*} \left|\frac{6\pm i\sqrt{34}}{7}\right| =\frac{\sqrt{6^2+(\sqrt{34})^2}}{7} =\frac{\sqrt{70}}{7} =\sqrt{\frac{10}{7}}>1, \end{align*} so the stationarity condition is consistent with the stated assumption. The root argument is \begin{align*} \lambda=\arg(6+i\sqrt{34}) =\arctan\left(\frac{\sqrt{34}}{6}\right) \approx 0.771. \end{align*} For this AR(2) model the spectral density is \begin{align*} f(\omega) =\frac{\sigma^2}{2\pi |1-1.2e^{-i\omega}+0.7e^{-2i\omega}|^2}. \end{align*} To locate its peak, compute the denominator. With $q=e^{-i\omega}$, $a=1.2$, and $b=0.7$, \begin{align*} |1-aq+bq^2|^2 &=(1-aq+bq^2)(1-aq^{-1}+bq^{-2})\\ &=1-aq^{-1}+bq^{-2}-aq+a^2-abq^{-1}+bq^2-abq+b^2\\ &=1+a^2+b^2-a(1+b)(q+q^{-1})+b(q^2+q^{-2}). \end{align*} Since $q+q^{-1}=2\cos\omega$ and $q^2+q^{-2}=2\cos(2\omega)$, \begin{align*} |1-1.2e^{-i\omega}+0.7e^{-2i\omega}|^2 &=1+1.2^2+0.7^2-2(1.2)(1.7)\cos\omega+2(0.7)\cos(2\omega)\\ &=2.93-4.08\cos\omega+1.4\cos(2\omega). \end{align*} Using $\cos(2\omega)=2\cos^2\omega-1$, this becomes \begin{align*} 2.93-4.08\cos\omega+1.4(2\cos^2\omega-1) &=1.53-4.08\cos\omega+2.8\cos^2\omega. \end{align*} Writing $x=\cos\omega$, the denominator is the quadratic \begin{align*} D(x)=2.8x^2-4.08x+1.53. \end{align*} Its minimum occurs at \begin{align*} x=\frac{4.08}{2(2.8)}=\frac{4.08}{5.6}\approx 0.729, \end{align*} so the spectral density is largest near \begin{align*} \omega=\arccos(0.729)\approx 0.754. \end{align*} This is close to the oscillation frequency $\lambda\approx0.771$ determined by the complex autoregressive roots. In a simulated sample, the largest periodogram ordinates therefore tend to occur near Fourier frequencies $2\pi j/n$ closest to this value, although the exact largest ordinate varies from sample to sample because the unsmoothed periodogram has high variance. [/example] The example shows both the attraction and the danger of the periodogram. It locates dominant cycles well, but its usefulness depends on the sampling distribution of its ordinates, not just on the position of its peaks. Under standard short-range dependence and moment assumptions, distinct Fourier ordinates behave asymptotically like independent complex normal variables with variances governed by the spectral density. This volatility raises the key consistency question for the raw periodogram. Even when each ordinate is centered near the right spectral value, the variance may fail to vanish, so we need to know whether an unsmoothed periodogram can converge pointwise to the spectrum at all. [quotetheorem:3651] [citeproof:3651] The failure is not bias but variance. Individual periodogram ordinates remain highly variable because each one uses only one frequency projection, so increasing the sample length mainly refines the frequency grid rather than averaging repeated observations at the same frequency. This explains why visual peaks in a raw periodogram are suggestive but not automatically reliable estimates of the height of the spectrum. A practical estimator must borrow information from nearby frequencies. The obstruction is not how to compute more Fourier coefficients, but how to average enough neighboring ordinates to reduce variance without washing out local spectral shape. This leads to a weighted local average whose bandwidth will later control the bias-variance tradeoff. [definition: Smoothed Periodogram] Let $W_m(k)$ be nonnegative weights indexed by $|k|\le m$ such that $\sum_{k=-m}^{m}W_m(k)=1$. The smoothed periodogram at a Fourier frequency $\omega_j$ is \begin{align*} \hat f_n(\omega_j)=\sum_{k=-m}^{m} W_m(k) I_n(\omega_{j+k}), \end{align*} with frequencies interpreted modulo $2\pi$ or restricted away from the endpoints. [/definition] The span $m=m_n$ is the bandwidth in frequency-index units. In angular-frequency units the bandwidth is of order $m_n/n$. The consistency question is therefore a balance between two limits: the number of averaged ordinates should grow, but the angular width of the averaging window should shrink. The next theorem records the standard condition under which that balance succeeds. [quotetheorem:3652] [citeproof:3652] The theorem encodes the bandwidth tradeoff. A small bandwidth keeps sharp peaks but leaves high variance; a large bandwidth stabilises the estimate but smears nearby spectral features. Even with an appropriate bandwidth, another finite-sample problem remains: observing only a finite stretch of data acts like multiplying the infinite series by a rectangular window. That abrupt window can leak power across nearby frequencies, so we introduce tapers to control the boundary effect before forming Fourier transforms. [definition: Data Taper] A data taper is a deterministic sequence $h_{t,n}$, $1\le t\le n$, used to replace $X_t$ by $h_{t,n}X_t$ before computing the Fourier transform. [/definition] Tapering reduces discontinuities created by observing a finite segment as though it were periodically extended. It lowers contamination from strong frequencies into nearby frequencies, but it also changes the effective sample size and broadens the main lobe of the spectral window. To describe the finite-sample artifact that tapers are designed to mitigate, we need a separate term for the spreading of power caused by windowing. This terminology lets us distinguish genuine nearby spectral features from energy that has been redistributed by the observation scheme. [definition: Spectral Leakage] Spectral leakage is the spreading of power from one frequency into nearby estimated frequencies caused by finite-sample windowing. [/definition] The definition becomes visible already for a single sinusoid whose frequency misses the Fourier grid. Instead of appearing at one ordinate, its energy is spread by the finite observation window, which is exactly the practical artifact that tapering tries to reduce. [example: Leakage From A Single Sinusoid] Take $X_t=\cos(\lambda t)$ for $1\le t\le n$, and suppose $\lambda\ne 2\pi j/n$ for every integer $j$. We compute the periodogram at the Fourier frequency $\omega_j=2\pi j/n$ and show that the finite observation window spreads the sinusoid across nearby Fourier ordinates. Using $2\cos u=e^{iu}+e^{-iu}$, \begin{align*} \sum_{t=1}^{n}X_t e^{-i\omega_j t} &=\frac12\sum_{t=1}^{n}\left(e^{i\lambda t}+e^{-i\lambda t}\right)e^{-i\omega_j t}\\ &=\frac12\sum_{t=1}^{n}e^{i(\lambda-\omega_j)t} +\frac12\sum_{t=1}^{n}e^{-i(\lambda+\omega_j)t}. \end{align*} For any real $\alpha$ with $e^{i\alpha}\ne1$, \begin{align*} \sum_{t=1}^{n}e^{i\alpha t} &=e^{i\alpha}+e^{2i\alpha}+\cdots+e^{ni\alpha}\\ &=e^{i\alpha}\left(1+e^{i\alpha}+\cdots+e^{i(n-1)\alpha}\right)\\ &=e^{i\alpha}\frac{1-e^{in\alpha}}{1-e^{i\alpha}}. \end{align*} Rewrite the numerator and denominator as \begin{align*} 1-e^{in\alpha} &=e^{in\alpha/2}\left(e^{-in\alpha/2}-e^{in\alpha/2}\right) =-2i e^{in\alpha/2}\sin\left(\frac{n\alpha}{2}\right),\\ 1-e^{i\alpha} &=e^{i\alpha/2}\left(e^{-i\alpha/2}-e^{i\alpha/2}\right) =-2i e^{i\alpha/2}\sin\left(\frac{\alpha}{2}\right). \end{align*} Therefore \begin{align*} \sum_{t=1}^{n}e^{i\alpha t} &=e^{i\alpha}\frac{-2i e^{in\alpha/2}\sin(n\alpha/2)} {-2i e^{i\alpha/2}\sin(\alpha/2)}\\ &=e^{i(n+1)\alpha/2}\frac{\sin(n\alpha/2)}{\sin(\alpha/2)}. \end{align*} Applying this with $\alpha=\lambda-\omega_j$ and with $\alpha=-(\lambda+\omega_j)$ gives \begin{align*} \sum_{t=1}^{n}X_t e^{-i\omega_j t} &=\frac12 e^{i(n+1)(\lambda-\omega_j)/2} \frac{\sin(n(\lambda-\omega_j)/2)}{\sin((\lambda-\omega_j)/2)}\\ &\quad+\frac12 e^{-i(n+1)(\lambda+\omega_j)/2} \frac{\sin(n(\lambda+\omega_j)/2)}{\sin((\lambda+\omega_j)/2)}. \end{align*} Hence \begin{align*} I_n(\omega_j) &=\frac{1}{2\pi n}\left| \frac12 e^{i(n+1)(\lambda-\omega_j)/2} \frac{\sin(n(\lambda-\omega_j)/2)}{\sin((\lambda-\omega_j)/2)} +\frac12 e^{-i(n+1)(\lambda+\omega_j)/2} \frac{\sin(n(\lambda+\omega_j)/2)}{\sin((\lambda+\omega_j)/2)} \right|^2. \end{align*} The factor \begin{align*} D_n(\alpha)=\frac{\sin(n\alpha/2)}{\sin(\alpha/2)} \end{align*} is the Dirichlet-kernel amplitude. It is largest when $\alpha$ is close to $0$, so the first term is largest for Fourier frequencies $\omega_j$ close to $\lambda$, while the second is largest for frequencies close to $2\pi-\lambda$. Because $\lambda$ is not exactly on the Fourier grid, neither denominator corresponds to a single isolated Fourier ordinate, and the periodogram shows a main peak with side lobes rather than all its mass at one frequency. [/example] ## Whittle Likelihood and Gaussian Frequency-Domain Inference Exact Gaussian likelihood for a stationary time series involves the inverse and determinant of an $n\times n$ covariance matrix. The frequency-domain alternative asks whether the Fourier transform almost diagonalises that covariance structure, replacing a large correlated Gaussian likelihood by a product of approximate independent likelihood contributions. [definition: Gaussian Time-Domain Likelihood] Let $X=(X_1,\dots,X_n)^\top$ have mean $0$ and covariance matrix $\Gamma_n(\theta)$ under parameter $\theta\in\Theta$. The Gaussian log-likelihood, up to an additive constant independent of $\theta$, is \begin{align*} \ell_n(\theta)=-\frac{1}{2}\log\det \Gamma_n(\theta)-\frac{1}{2}X^\top\Gamma_n(\theta)^{-1}X. \end{align*} [/definition] For long series this likelihood can be computationally expensive. If the model is most naturally specified by its spectral density $f_\theta$, the useful approximation is to replace the time-domain covariance calculation by approximately independent frequency-domain contributions. The resulting criterion compares the observed periodogram ordinate with the model spectrum at the same frequency. This gives a likelihood-like objective that can be evaluated from $f_\theta$ without forming or inverting the full covariance matrix. The next definition names that approximation so frequency-domain parametric estimation can be discussed with the same precision as Gaussian likelihood in the time domain. [definition: Whittle Likelihood] For a parametric spectral density $f_\theta(\omega)>0$, the Whittle objective is \begin{align*} Q_n(\theta)=\sum_{j=1}^{\lfloor (n-1)/2\rfloor}\left\{\log f_\theta(\omega_j)+\frac{I_n(\omega_j)}{f_\theta(\omega_j)}\right\}. \end{align*} A Whittle estimator is any measurable minimiser $\hat\theta_n$ of $Q_n(\theta)$ over $\Theta$. [/definition] The sum is taken over positive interior Fourier frequencies for real-valued data to avoid double counting conjugate ordinates. Constants independent of $\theta$ are omitted because they do not affect estimation. Once the objective has been defined, the statistical question is whether minimising it behaves like maximum likelihood for large samples. Under the short-memory Gaussian regularity assumptions used in Whittle theory, the Toeplitz covariance matrix from the exact likelihood can be replaced asymptotically by the circulant matrix diagonalised by the Fourier basis. In the spectral-density convention used throughout these notes, \begin{align*} \gamma_\theta(h)=\int_{-\pi}^{\pi}e^{ih\lambda}f_\theta(\lambda)\,d\lambda, \end{align*} this means that the exact Gaussian negative log-likelihood per observation and the corresponding frequency-domain Whittle objective differ by a term that converges to $0$ in probability, uniformly over compact parameter sets satisfying the usual bounded-away-from-zero and smoothness assumptions. This asymptotic equivalence explains why Whittle estimation is often close to exact Gaussian maximum likelihood while being simpler and faster. It is an approximation to the likelihood, not a claim that the raw periodogram is a consistent estimate of $f$ pointwise. [example: Comparing Time-Domain And Whittle Fits] Let $X_t-\phi_1X_{t-1}=Z_t+\theta_1Z_{t-1}$, where $(Z_t)$ is white noise with variance $\sigma^2$, $|\phi_1|<1$, and $|\theta_1|<1$. To compare exact Gaussian fitting with Whittle fitting on the same grid of $(\phi_1,\theta_1)$ values, write \begin{align*} f_{\phi_1,\theta_1,\sigma^2}(\omega) &=\frac{\sigma^2}{2\pi}\frac{|1+\theta_1e^{-i\omega}|^2}{|1-\phi_1e^{-i\omega}|^2}. \end{align*} The two squared moduli are \begin{align*} |1+\theta_1e^{-i\omega}|^2 &=(1+\theta_1e^{-i\omega})(1+\theta_1e^{i\omega})\\ &=1+\theta_1e^{i\omega}+\theta_1e^{-i\omega}+\theta_1^2\\ &=1+\theta_1^2+\theta_1(e^{i\omega}+e^{-i\omega})\\ &=1+\theta_1^2+2\theta_1\cos\omega, \end{align*} and \begin{align*} |1-\phi_1e^{-i\omega}|^2 &=(1-\phi_1e^{-i\omega})(1-\phi_1e^{i\omega})\\ &=1-\phi_1e^{i\omega}-\phi_1e^{-i\omega}+\phi_1^2\\ &=1+\phi_1^2-\phi_1(e^{i\omega}+e^{-i\omega})\\ &=1+\phi_1^2-2\phi_1\cos\omega. \end{align*} Thus \begin{align*} f_{\phi_1,\theta_1,\sigma^2}(\omega) =\frac{\sigma^2}{2\pi} \frac{1+\theta_1^2+2\theta_1\cos\omega} {1+\phi_1^2-2\phi_1\cos\omega}. \end{align*} For the time-domain likelihood, build the covariance matrix $\Gamma_n(\phi_1,\theta_1,\sigma^2)$ from the ARMA$(1,1)$ autocovariances. Since \begin{align*} \frac{1+\theta_1B}{1-\phi_1B} &=(1+\theta_1B)(1+\phi_1B+\phi_1^2B^2+\cdots)\\ &=1+(\phi_1+\theta_1)B+(\phi_1^2+\theta_1\phi_1)B^2+\cdots, \end{align*} the moving-average coefficients are $\psi_0=1$ and \begin{align*} \psi_k=(\phi_1+\theta_1)\phi_1^{k-1},\qquad k\ge1. \end{align*} Therefore \begin{align*} \gamma(0) &=\sigma^2\sum_{k=0}^{\infty}\psi_k^2\\ &=\sigma^2\left\{1+\sum_{k=1}^{\infty}(\phi_1+\theta_1)^2\phi_1^{2k-2}\right\}\\ &=\sigma^2\left\{1+(\phi_1+\theta_1)^2\sum_{r=0}^{\infty}\phi_1^{2r}\right\}\\ &=\sigma^2\left\{1+\frac{(\phi_1+\theta_1)^2}{1-\phi_1^2}\right\}\\ &=\sigma^2\frac{1-\phi_1^2+\phi_1^2+2\phi_1\theta_1+\theta_1^2}{1-\phi_1^2}\\ &=\sigma^2\frac{1+2\phi_1\theta_1+\theta_1^2}{1-\phi_1^2}, \end{align*} and for $h\ge1$, \begin{align*} \gamma(h) &=\sigma^2\sum_{k=0}^{\infty}\psi_{k+h}\psi_k\\ &=\sigma^2\psi_h\psi_0+\sigma^2\sum_{k=1}^{\infty}\psi_{k+h}\psi_k\\ &=\sigma^2(\phi_1+\theta_1)\phi_1^{h-1} +\sigma^2\sum_{k=1}^{\infty}(\phi_1+\theta_1)^2\phi_1^{k+h-1}\phi_1^{k-1}\\ &=\sigma^2(\phi_1+\theta_1)\phi_1^{h-1} +\sigma^2(\phi_1+\theta_1)^2\phi_1^h\sum_{r=0}^{\infty}\phi_1^{2r}\\ &=\sigma^2(\phi_1+\theta_1)\phi_1^{h-1} +\sigma^2\frac{(\phi_1+\theta_1)^2\phi_1^h}{1-\phi_1^2}\\ &=\sigma^2(\phi_1+\theta_1)\phi_1^{h-1} \left\{1+\frac{(\phi_1+\theta_1)\phi_1}{1-\phi_1^2}\right\}\\ &=\sigma^2(\phi_1+\theta_1)\phi_1^{h-1} \frac{1-\phi_1^2+\phi_1^2+\phi_1\theta_1}{1-\phi_1^2}\\ &=\sigma^2\frac{(\phi_1+\theta_1)(1+\phi_1\theta_1)\phi_1^{h-1}}{1-\phi_1^2}. \end{align*} The exact Gaussian objective on the grid is then obtained from \begin{align*} -\ell_n(\phi_1,\theta_1,\sigma^2) =\frac12\log\det\Gamma_n(\phi_1,\theta_1,\sigma^2) +\frac12X^\top\Gamma_n(\phi_1,\theta_1,\sigma^2)^{-1}X, \end{align*} up to constants independent of the parameters. For the Whittle comparison, put \begin{align*} g_{\phi_1,\theta_1}(\omega) =\frac{1}{2\pi} \frac{1+\theta_1^2+2\theta_1\cos\omega} {1+\phi_1^2-2\phi_1\cos\omega}, \end{align*} so that $f_{\phi_1,\theta_1,\sigma^2}(\omega)=\sigma^2g_{\phi_1,\theta_1}(\omega)$. With $m=\lfloor(n-1)/2\rfloor$, the Whittle objective becomes \begin{align*} Q_n(\phi_1,\theta_1,\sigma^2) &=\sum_{j=1}^{m} \left\{\log\bigl(\sigma^2g_{\phi_1,\theta_1}(\omega_j)\bigr) +\frac{I_n(\omega_j)}{\sigma^2g_{\phi_1,\theta_1}(\omega_j)}\right\}\\ &=\sum_{j=1}^{m} \left\{\log\sigma^2+\log g_{\phi_1,\theta_1}(\omega_j) +\frac{I_n(\omega_j)}{\sigma^2g_{\phi_1,\theta_1}(\omega_j)}\right\}\\ &=m\log\sigma^2 +\sum_{j=1}^{m}\log g_{\phi_1,\theta_1}(\omega_j) +\frac{1}{\sigma^2}\sum_{j=1}^{m}\frac{I_n(\omega_j)}{g_{\phi_1,\theta_1}(\omega_j)}. \end{align*} For each fixed grid point $(\phi_1,\theta_1)$, the Whittle estimate of $\sigma^2$ is found by differentiating: \begin{align*} \frac{\partial Q_n}{\partial \sigma^2} &=\frac{m}{\sigma^2} -\frac{1}{(\sigma^2)^2}\sum_{j=1}^{m}\frac{I_n(\omega_j)}{g_{\phi_1,\theta_1}(\omega_j)}. \end{align*} Setting this derivative equal to $0$ and multiplying by $(\sigma^2)^2$ gives \begin{align*} m\sigma^2-\sum_{j=1}^{m}\frac{I_n(\omega_j)}{g_{\phi_1,\theta_1}(\omega_j)}=0, \end{align*} so \begin{align*} \widehat{\sigma}^2_W(\phi_1,\theta_1) =\frac1m\sum_{j=1}^{m}\frac{I_n(\omega_j)}{g_{\phi_1,\theta_1}(\omega_j)}. \end{align*} The comparison is now concrete: evaluate the profiled exact Gaussian objective and the profiled Whittle objective at the same grid points $(\phi_1,\theta_1)$, and compare their minimisers. For a long short-memory series, *Whittle Likelihood Approximation* says that the two normalised objectives differ by a negligible amount after constants independent of the parameters are matched, so their minimisers are often close. For short samples, or when $\theta_1$ is close to the noninvertible boundary $|\theta_1|=1$, the finite-sample covariance matrix and the frequency-domain approximation can differ enough that the two fitted parameter values are visibly separated. [/example] Whittle inference also gives approximate standard errors through the curvature of $Q_n$ at the minimiser. In correctly specified Gaussian short-memory models, these standard errors match the Fisher information limit obtained from exact likelihood. ## Cross-Spectra, Coherency, and Phase For two time series, frequency-domain inference asks not only where each series has power but where they share power. Cross-spectral methods separate common oscillations by frequency and identify whether one series tends to lead the other at those frequencies. [definition: Cross-Covariance Function] Let $(X_t,Y_t)_{t\in\mathbb Z}$ be a jointly weakly stationary bivariate process with finite second moments. The cross-covariance function from $X$ to $Y$ is \begin{align*} \gamma_{XY}(h)=\operatorname{Cov}(X_{t+h},Y_t), \qquad h\in\mathbb Z. \end{align*} [/definition] The order of the variables matters. In general $\gamma_{XY}(h)=\gamma_{YX}(-h)$ for real-valued processes, so positive lags encode a directional timing convention. As in the univariate case, a covariance sequence is easier to interpret after decomposing it by frequency. The problem now is that dependence between two series has both magnitude and timing direction, so an ordinary real-valued spectrum is no longer enough. Fourier transforming the cross-covariances produces a frequency-by-frequency object that records both the strength of common oscillations and their timing relationship. [definition: Cross-Spectrum] If $\sum_{h\in\mathbb Z}|\gamma_{XY}(h)|<\infty$, the cross-spectrum from $X$ to $Y$ is \begin{align*} f_{XY}(\omega)=\frac{1}{2\pi}\sum_{h\in\mathbb Z}\gamma_{XY}(h)e^{-i\omega h}, \qquad \omega\in[-\pi,\pi]. \end{align*} [/definition] The raw cross-spectrum still depends on the marginal scales of the two series, so its magnitude is not directly comparable across frequencies or data sets. The obstruction is the same one faced by covariance before correlation: a large value may reflect scale rather than strong association. Normalising by the two marginal spectra gives a dimensionless frequency-domain analogue of correlation. [definition: Coherency And Coherence] Let $f_X(\omega)>0$ and $f_Y(\omega)>0$ be the spectral densities of $X$ and $Y$. The coherency is \begin{align*} \kappa_{XY}(\omega)=\frac{f_{XY}(\omega)}{\{f_X(\omega)f_Y(\omega)\}^{1/2}}. \end{align*} The coherence is $|\kappa_{XY}(\omega)|^2$. [/definition] Coherence is a frequency-by-frequency analogue of squared correlation, but magnitude alone still discards timing. High coherence identifies frequencies at which a shared oscillation is meaningful; the remaining obstruction is to decide whether one component tends to lead or lag the other. That directional information is carried by the argument of the complex cross-spectrum, so it is recorded separately as phase. [definition: Phase] At a frequency where $f_{XY}(\omega)\ne0$, the phase of $Y$ relative to $X$ is \begin{align*} \varphi_{XY}(\omega)=\arg f_{XY}(\omega). \end{align*} [/definition] A nonzero phase means that the common oscillatory component is shifted in time. Interpreting phase as a lead or lag requires a sign convention and is most stable near frequencies with high coherence. [quotetheorem:3654] [citeproof:3654] The bound makes coherence a normalised quantity rather than an arbitrary rescaling of the cross-spectrum. It also warns that phase estimates are unreliable at frequencies where either marginal spectrum is close to zero. [example: Detecting Common Cycles In Two Macroeconomic Series] Let $X_t$ be quarterly output growth and $Y_t$ be quarterly investment growth after removing a fitted trend and seasonal mean, so both series are treated as approximately mean zero. For a taper $h_{t,n}$, set $H_n=\sum_{t=1}^n h_{t,n}^2$ and compute the tapered Fourier transforms \begin{align*} d_{X,h}(\omega_j) &=\frac{1}{\sqrt{2\pi H_n}}\sum_{t=1}^n h_{t,n}X_t e^{-i\omega_j t},\\ d_{Y,h}(\omega_j) &=\frac{1}{\sqrt{2\pi H_n}}\sum_{t=1}^n h_{t,n}Y_t e^{-i\omega_j t}, \end{align*} where $\omega_j=2\pi j/n$. The tapered auto-periodograms and cross-periodogram are \begin{align*} I_{X,h}(\omega_j)&=d_{X,h}(\omega_j)\overline{d_{X,h}(\omega_j)},\\ I_{Y,h}(\omega_j)&=d_{Y,h}(\omega_j)\overline{d_{Y,h}(\omega_j)},\\ I_{XY,h}(\omega_j)&=d_{X,h}(\omega_j)\overline{d_{Y,h}(\omega_j)}. \end{align*} Choose nonnegative smoothing weights $W_m(k)$ with $\sum_{k=-m}^m W_m(k)=1$. The smoothed spectral estimates at $\omega_j$ are \begin{align*} \hat f_X(\omega_j) &=\sum_{k=-m}^m W_m(k)I_{X,h}(\omega_{j+k}),\\ \hat f_Y(\omega_j) &=\sum_{k=-m}^m W_m(k)I_{Y,h}(\omega_{j+k}),\\ \hat f_{XY}(\omega_j) &=\sum_{k=-m}^m W_m(k)I_{XY,h}(\omega_{j+k}), \end{align*} with the frequency indices restricted to the interior or interpreted modulo $2\pi$. The estimated coherency and coherence are then \begin{align*} \hat\kappa_{XY}(\omega_j) &=\frac{\hat f_{XY}(\omega_j)} {\{\hat f_X(\omega_j)\hat f_Y(\omega_j)\}^{1/2}},\\ |\hat\kappa_{XY}(\omega_j)|^2 &=\frac{|\hat f_{XY}(\omega_j)|^2} {\hat f_X(\omega_j)\hat f_Y(\omega_j)}. \end{align*} A business-cycle period of $L$ quarters corresponds to angular frequency \begin{align*} \omega=\frac{2\pi}{L}. \end{align*} Thus cycles of length $6$ to $32$ quarters correspond to \begin{align*} \frac{2\pi}{32}\le \omega \le \frac{2\pi}{6}. \end{align*} Inspecting $|\hat\kappa_{XY}(\omega_j)|^2$ on this band measures how much of the variation in output growth and investment growth is shared at each cycle length, after normalising by their separate powers. The estimated phase is \begin{align*} \hat\varphi_{XY}(\omega_j)=\arg \hat f_{XY}(\omega_j). \end{align*} With the convention $\gamma_{XY}(h)=\operatorname{Cov}(X_{t+h},Y_t)$, a phase $\hat\varphi_{XY}(\omega_j)$ corresponds to the approximate time displacement \begin{align*} \hat L_j=\frac{\hat\varphi_{XY}(\omega_j)}{\omega_j} \end{align*} quarters, modulo the cycle length $2\pi/\omega_j$. Positive $\hat L_j$ means investment is shifted later than output at that frequency, while negative $\hat L_j$ means investment leads output. Thus high coherence in the band $2\pi/32\le\omega_j\le2\pi/6$, together with a stable phase across nearby smoothed frequencies, gives evidence for a common business-cycle component and an estimated lead-lag relation between investment and output. [/example] In practice, cross-spectral inference needs the same smoothing logic as univariate spectral estimation. The raw cross-periodogram is variable, while smoothed cross-spectra trade frequency resolution against variance. Taper choice and bandwidth should therefore be reported alongside coherence and phase estimates, because they determine how local the frequency-domain claims are. Chapter 7 developed estimation and testing in the frequency domain, where periodograms, smoothing, and cross-spectral quantities make second-order structure observable in practice. Chapter 8 shifts to state space models, where the same prediction problems are handled recursively through latent states and the Kalman filter. # 8. State Space Models and the Kalman Filter State space methods recast a time series as the noisy observation of an evolving latent state. Building on ARMA models from Chapters 2 and 3, likelihood-based estimation from Chapter 3, and prediction-error decompositions from Chapter 7, this chapter explains the linear Gaussian case, where conditional distributions remain Gaussian and all updating reduces to matrix recursions for means and covariance matrices. The main questions are: how do we update beliefs about the hidden state when a new observation arrives, how do we evaluate the likelihood efficiently, and how do familiar ARMA models fit into this representation? ## From Observed Series to Hidden States A univariate ARMA model describes dependence directly through past observations and past innovations. A state space model separates the mechanism into a hidden Markovian state and an observation equation, which is useful when the measured series is noisy, incomplete, multivariate, or built from several latent components. [definition: Linear Gaussian State Space Model] A linear Gaussian state space model consists of random vectors $(\alpha_t)_{t \ge 1}$ and $(Y_t)_{t \ge 1}$ satisfying \begin{align*} Y_t &= Z_t\alpha_t + d_t + \varepsilon_t, \\ \alpha_{t+1} &= T_t\alpha_t + c_t + R_t\eta_t, \end{align*} where $Y_t \in \mathbb R^{m}$ is observed, $\alpha_t \in \mathbb R^{r}$ is the latent state, $Z_t \in \mathbb R^{m \times r}$, $T_t \in \mathbb R^{r \times r}$, $R_t \in \mathbb R^{r \times s}$, $d_t \in \mathbb R^{m}$, and $c_t \in \mathbb R^{r}$. The noise vectors satisfy \begin{align*} \varepsilon_t &\sim \mathcal N(0, H_t), & \eta_t &\sim \mathcal N(0, Q_t), \end{align*} and the families $(\varepsilon_t)$, $(\eta_t)$, and the initial state $\alpha_1$ are mutually independent. The initial state has distribution $\alpha_1 \sim \mathcal N(a_1, P_1)$. [/definition] The first equation is the observation equation: it tells us how the latent state is measured. The second equation is the transition equation: it tells us how the state evolves before the next observation is made. The matrices $H_t$ and $Q_t$ are the observation-noise and system-noise covariance matrices. [example: Local Level Model] Let $Y_t$ be scalar and let the latent state be the scalar level \begin{align*} \alpha_t &= \mu_t. \end{align*} With this choice, the observation equation becomes \begin{align*} Y_t &= \mu_t+\varepsilon_t \\ &= 1\cdot \alpha_t+0+\varepsilon_t, \end{align*} so it matches the linear Gaussian observation form $Y_t=Z_t\alpha_t+d_t+\varepsilon_t$ with \begin{align*} Z_t &= 1, & d_t &= 0, & H_t &= \sigma_\varepsilon^2. \end{align*} The transition equation is \begin{align*} \alpha_{t+1} &= \mu_{t+1} \\ &= \mu_t+\eta_t \\ &= 1\cdot \alpha_t+0+1\cdot \eta_t, \end{align*} so it matches $\alpha_{t+1}=T_t\alpha_t+c_t+R_t\eta_t$ with \begin{align*} T_t &= 1, & c_t &= 0, & R_t &= 1, & Q_t &= \sigma_\eta^2. \end{align*} The assumed independence of $\varepsilon_t$ and $\eta_t$, together with their Gaussian laws, is exactly the noise structure required in the linear Gaussian state space model. Thus the local-level model is the scalar state space model in which the observed series is a noisy measurement of a random-walk level. [/example] The local-level model separates short-run observational disturbance from persistent movement in the underlying mean. This distinction is difficult to express cleanly using only an ARMA equation, but it is natural in state space form. ## Filtering and Prediction At time $t$, the statistical problem is to combine the prediction from the transition equation with the new information in $Y_t$. Since all variables are jointly Gaussian, the conditional law of the state is determined by its conditional mean and covariance matrix. [definition: Filtered and Predicted State] For observations $Y_1, \dots, Y_t$, define \begin{align*} a_{t|t} &= \mathbb E[\alpha_t \mid Y_1, \dots, Y_t], & P_{t|t} &= \operatorname{Var}(\alpha_t \mid Y_1, \dots, Y_t), \\ a_{t|t-1} &= \mathbb E[\alpha_t \mid Y_1, \dots, Y_{t-1}], & P_{t|t-1} &= \operatorname{Var}(\alpha_t \mid Y_1, \dots, Y_{t-1}). \end{align*} [/definition] The pair $(a_{t|t-1}, P_{t|t-1})$ is the one-step-ahead prediction before observing $Y_t$. The pair $(a_{t|t}, P_{t|t})$ is the filtered estimate after observing $Y_t$. Direct conditioning on the full vector $(Y_1,\dots,Y_t)$ would require repeatedly forming and inverting growing covariance matrices. The Kalman filter avoids this obstruction by carrying forward only the current conditional mean and covariance, so each new observation is processed through matrices of the state and observation dimensions rather than the whole sample length. [explanation: Kalman Filter Recursion] The Kalman filter is a recursive Gaussian conditioning algorithm. Given one-step prediction quantities $a_{t|t-1}$ and $P_{t|t-1}$, define \begin{align*} v_t &= Y_t-Z_ta_{t|t-1}-d_t,\\ F_t &= Z_tP_{t|t-1}Z_t^\top+H_t,\\ K_t &= P_{t|t-1}Z_t^\top F_t^{-1}. \end{align*} The measurement update is \begin{align*} a_{t|t} &= a_{t|t-1}+K_tv_t,\\ P_{t|t} &= P_{t|t-1}-K_tF_tK_t^\top, \end{align*} and the transition equation then propagates these filtered quantities to the next one-step prediction. These formulas are the computational update rules used throughout the state-space chapter. [/explanation] The innovation $v_t$ is the part of the new observation not predicted by the past. The gain $K_t$ balances state uncertainty against observation noise: when $H_t$ is large relative to $P_{t|t-1}$, the update moves less in the direction of the new observation. The invertibility of $F_t$ is not a cosmetic hypothesis. If an observation is measured without noise and is already determined by the predicted state distribution, then $F_t$ may be singular and the displayed inverse does not exist; in that case the model imposes an exact linear constraint rather than an ordinary Gaussian density update. Numerical implementations then use a reduced update, a generalized inverse, or a reparameterisation of the observation equation. The theorem also does not estimate unknown parameters by itself: it gives the recursive conditional distributions once the system matrices and covariance matrices have been fixed. The next section uses the same innovations $v_t$ and covariance matrices $F_t$ to evaluate the likelihood for those parameters. [example: Scalar Kalman Update in the Local Level Model] In the local-level model, $\alpha_t=\mu_t$, $Z_t=1$, $d_t=0$, and $H_t=\sigma_\varepsilon^2$. If the predicted distribution is $\mu_t\mid Y_1,\dots,Y_{t-1}\sim \mathcal N(a_{t|t-1},P_{t|t-1})$, then substituting these scalar quantities into the filtering recursions gives \begin{align*} v_t &= Y_t-Z_ta_{t|t-1}-d_t \\ &= Y_t-1\cdot a_{t|t-1}-0 \\ &= Y_t-a_{t|t-1}, \end{align*} and \begin{align*} F_t &= Z_tP_{t|t-1}Z_t^\top+H_t \\ &= 1\cdot P_{t|t-1}\cdot 1+\sigma_\varepsilon^2 \\ &= P_{t|t-1}+\sigma_\varepsilon^2. \end{align*} Since this is a scalar covariance, its inverse is \begin{align*} F_t^{-1}=\frac{1}{P_{t|t-1}+\sigma_\varepsilon^2}, \end{align*} whenever $P_{t|t-1}+\sigma_\varepsilon^2>0$. Hence the Kalman gain is \begin{align*} K_t &= P_{t|t-1}Z_t^\top F_t^{-1} \\ &= P_{t|t-1}\cdot 1\cdot \frac{1}{P_{t|t-1}+\sigma_\varepsilon^2} \\ &= \frac{P_{t|t-1}}{P_{t|t-1}+\sigma_\varepsilon^2}. \end{align*} The filtered mean is therefore \begin{align*} a_{t|t} &= a_{t|t-1}+K_tv_t \\ &= a_{t|t-1}+\frac{P_{t|t-1}}{P_{t|t-1}+\sigma_\varepsilon^2}\bigl(Y_t-a_{t|t-1}\bigr). \end{align*} Equivalently, \begin{align*} a_{t|t} &= \left(1-\frac{P_{t|t-1}}{P_{t|t-1}+\sigma_\varepsilon^2}\right)a_{t|t-1} +\frac{P_{t|t-1}}{P_{t|t-1}+\sigma_\varepsilon^2}Y_t \\ &= \frac{\sigma_\varepsilon^2}{P_{t|t-1}+\sigma_\varepsilon^2}a_{t|t-1} +\frac{P_{t|t-1}}{P_{t|t-1}+\sigma_\varepsilon^2}Y_t. \end{align*} Thus the new estimate is a weighted average of the previous prediction and the observed value: larger state uncertainty $P_{t|t-1}$ gives more weight to $Y_t$, while larger observation noise $\sigma_\varepsilon^2$ gives more weight to the prediction. [/example] ## Likelihood from Prediction Errors For estimation, the state space representation is valuable because the likelihood can be computed as a product of one-step prediction densities. The filter produces exactly the quantities needed for this factorisation: the innovation $v_t$ and its covariance matrix $F_t$. [quotetheorem:3656] [citeproof:3656] This formula is the basis of maximum likelihood estimation for state space models. It also gives diagnostics: standardised prediction errors $F_t^{-1/2}v_t$ should behave like independent standard Gaussian noise under a correctly specified Gaussian model. The assumption that each $F_t$ is invertible is needed because the likelihood uses the ordinary multivariate Gaussian density with covariance matrix $F_t$. If $F_t$ is singular, the conditional distribution of $Y_t$ is supported on a lower-dimensional affine subspace, so the displayed density with respect to Lebesgue measure is not the right object. The formula is also conditional on the chosen treatment of the initial state, and misspecified Gaussian noise can make maximum likelihood estimates efficient for the wrong criterion rather than correct for the data-generating process. In applications, unusually large standardised innovations, serial correlation in the innovations, or repeated near-singularity of $F_t$ are warning signs that the observation equation, noise covariance, or initialisation needs to be reconsidered. [remark: Diffuse Initialisation] When the initial state is poorly known, $P_1$ is often taken very large or handled by an exact diffuse initialisation. The practical goal is to avoid forcing the early likelihood contributions to reflect an arbitrary precise prior about the state. [/remark] ## Smoothing After Filtering Filtering estimates the current state using observations up to the current time. Smoothing asks a different question: after seeing the whole sample, what should we infer about a past state? [definition: Smoothed State] For $1 \le t \le n$, the smoothed state mean and covariance are \begin{align*} a_{t|n} &= \mathbb E[\alpha_t \mid Y_1, \dots, Y_n], & P_{t|n} &= \operatorname{Var}(\alpha_t \mid Y_1, \dots, Y_n). \end{align*} [/definition] The forward filter has already computed the distributions needed to connect adjacent states. The smoother uses these stored quantities in reverse order, so it is another recursive Gaussian conditioning argument rather than a new model. What remains is to state the backward recursion that turns filtered quantities into smoothed quantities. The formula below identifies the gain that transfers information from the future state estimate back to the current state estimate. [explanation: Rauch-Tung-Striebel Smoothing Recursion] The Rauch-Tung-Striebel smoother is the backward pass paired with the forward Kalman filter. After the filter has computed $a_{t|t}$, $P_{t|t}$, $a_{t+1|t}$, and $P_{t+1|t}$, define the smoothing gain \begin{align*} J_t=P_{t|t}T_t^\top P_{t+1|t}^{-1}. \end{align*} Then the smoothed mean and covariance are updated backward by \begin{align*} a_{t|n} &= a_{t|t}+J_t(a_{t+1|n}-a_{t+1|t}),\\ P_{t|n} &= P_{t|t}+J_t(P_{t+1|n}-P_{t+1|t})J_t^\top. \end{align*} This is a Gaussian-conditioning recursion for using future observations to refine an earlier latent-state estimate. [/explanation] The recursion is usually treated as an application of Gaussian conditioning to the pair $(\alpha_t,\alpha_{t+1})$ conditional on $Y_1,\dots,Y_t$; in a full treatment it is derived from the same Gaussian conditioning formula used in the Kalman filter, applied backward through the Markov chain. In this course we use the result rather than derive it in full, because the main focus is filtering and likelihood evaluation. The invertibility of $P_{t+1|t}$ is the backward analogue of the non-degeneracy condition in the filtering theorem. If the transition is deterministic in some direction, then $P_{t+1|t}$ can be singular and the displayed gain $J_t$ must be replaced by a reduced-rank or generalized-inverse version. The smoother also does not change the likelihood, since it uses future observations only after the filtering pass has already evaluated the prediction densities. Its role is inferential: it reconstructs latent states, supplies inputs for expectation-maximisation algorithms, and explains why later observations can sharpen uncertainty about earlier states. ## Missing Observations Real time series often contain gaps, and a state space model handles these gaps without changing the transition equation. If $Y_t$ is missing, there is no measurement update at time $t$; the filter simply propagates the state forward. [quotetheorem:3658] [citeproof:3658] The hypothesis here is that the whole observation vector at time $t$ is absent. In a multivariate series, some components of $Y_t$ may be observed and others missing; then the update is performed only with the observed rows of $Z_t$, $d_t$, and the corresponding covariance submatrix of $H_t$. A common failure mode is to treat a missing value as the numerical value $0$, which creates an artificial innovation and incorrectly reduces the filtered covariance. Correct missing-data handling preserves uncertainty until actual information arrives, and the same bookkeeping is what later allows smoothing across gaps. [example: Missing Monthly Observation] Suppose April is time $t=4$ in the local-level model \begin{align*} Y_t &= \mu_t+\varepsilon_t, \\ \mu_{t+1} &= \mu_t+\eta_t, \end{align*} with $\operatorname{Var}(\eta_t)=\sigma_\eta^2$. If the April observation is missing, then conditioning on the data available through April is the same as conditioning on the data available through March, so \begin{align*} a_{4|4} &= \mathbb E[\mu_4\mid Y_1,Y_2,Y_3] \\ &= a_{4|3}, \end{align*} and \begin{align*} P_{4|4} &= \operatorname{Var}(\mu_4\mid Y_1,Y_2,Y_3) \\ &= P_{4|3}. \end{align*} The prediction from April into May uses the transition equation $\mu_5=\mu_4+\eta_4$. Since $\eta_4$ has mean $0$ and is independent of the observations through March, \begin{align*} a_{5|4} &= \mathbb E[\mu_5\mid Y_1,Y_2,Y_3] \\ &= \mathbb E[\mu_4+\eta_4\mid Y_1,Y_2,Y_3] \\ &= \mathbb E[\mu_4\mid Y_1,Y_2,Y_3]+\mathbb E[\eta_4] \\ &= a_{4|4}+0 \\ &= a_{4|4}. \end{align*} For the covariance, \begin{align*} P_{5|4} &= \operatorname{Var}(\mu_5\mid Y_1,Y_2,Y_3) \\ &= \operatorname{Var}(\mu_4+\eta_4\mid Y_1,Y_2,Y_3) \\ &= \operatorname{Var}(\mu_4\mid Y_1,Y_2,Y_3)+\operatorname{Var}(\eta_4) \\ &= P_{4|4}+\sigma_\eta^2. \end{align*} Thus the missing April value produces no innovation and no covariance reduction; uncertainty is carried forward and then increased by the system-noise variance before the May observation is processed. [/example] Missing data increase uncertainty because the covariance matrix is propagated without being reduced by an observation. This is why smoothed estimates across a gap usually have wider uncertainty bands than estimates near dense observations. This treatment is also one of the practical advantages of state space models over directly fitted ARMA equations: irregular observation patterns change the measurement update, not the latent transition law. The distinction is useful in economic and environmental time series, where missingness often reflects reporting schedules rather than a change in the underlying dynamics. ## ARMA Models in State Space Form A final reason for learning state space methods is that they include ARMA models as a special case. This gives a unified likelihood computation for ARMA models and makes extensions such as missing data and time-varying coefficients accessible. [example: AR One State Space Form] Let $X_t$ satisfy the stationary AR(1) equation \begin{align*} X_t &= \phi X_{t-1}+e_t, & |\phi|&<1, \end{align*} where the innovations $e_t\sim\mathcal N(0,\sigma^2)$ are independent. Set the one-dimensional state equal to the observed process, \begin{align*} \alpha_t &= X_t. \end{align*} Then the observation equation is \begin{align*} Y_t &= X_t \\ &= \alpha_t \\ &= 1\cdot \alpha_t+0+\varepsilon_t, \end{align*} with \begin{align*} Z_t &= 1, & d_t &= 0, & H_t &= 0, \end{align*} because there is no separate observation noise: $\varepsilon_t=0$ with variance $0$. For the transition equation, rewrite the AR(1) recursion one period ahead: \begin{align*} X_{t+1} &= \phi X_t+e_{t+1}. \end{align*} Since $\alpha_t=X_t$ and $\alpha_{t+1}=X_{t+1}$, this becomes \begin{align*} \alpha_{t+1} &= X_{t+1} \\ &= \phi X_t+e_{t+1} \\ &= \phi\alpha_t+e_{t+1} \\ &= \phi\alpha_t+0+1\cdot e_{t+1}. \end{align*} Thus the transition matrices and noise covariance are \begin{align*} T_t &= \phi, & c_t &= 0, & R_t &= 1, & Q_t &= \sigma^2. \end{align*} For the stationary initialization, the mean must satisfy \begin{align*} \mathbb E[X_t] &= \mathbb E[\phi X_{t-1}+e_t] \\ &= \phi\mathbb E[X_{t-1}]+\mathbb E[e_t] \\ &= \phi\mathbb E[X_{t-1}], \end{align*} so stationarity gives $m=\phi m$, hence $(1-\phi)m=0$ and therefore $m=0$ because $|\phi|<1$. The variance $v=\operatorname{Var}(X_t)$ satisfies, using independence of $X_{t-1}$ and $e_t$, \begin{align*} v &= \operatorname{Var}(X_t) \\ &= \operatorname{Var}(\phi X_{t-1}+e_t) \\ &= \phi^2\operatorname{Var}(X_{t-1})+\operatorname{Var}(e_t) \\ &= \phi^2v+\sigma^2. \end{align*} Therefore \begin{align*} (1-\phi^2)v &= \sigma^2, \\ v &= \frac{\sigma^2}{1-\phi^2}, \end{align*} where $1-\phi^2>0$ because $|\phi|<1$. Hence the stationary state initialization is \begin{align*} \alpha_1=X_1\sim \mathcal N\left(0,\frac{\sigma^2}{1-\phi^2}\right). \end{align*} This representation is the same AR(1) model written as a state space model; the zero observation-noise covariance is admissible here because uncertainty still enters through the latent state and the innovation variance $\sigma^2$. [/example] Here the observation noise is zero because the observed series is the state itself. The state space representation is not adding a new model; it is changing the computational form of the same AR(1) process. [example: ARMA One One State Representation] Let \begin{align*} X_t=\phi X_{t-1}+e_t+\theta e_{t-1}, \end{align*} where $e_t\sim\mathcal N(0,\sigma^2)$ are independent, and set the state vector to be \begin{align*} \alpha_t &= \begin{pmatrix} X_t\\ e_t \end{pmatrix}. \end{align*} We show that this gives a linear Gaussian state space representation when the transition noise at time $t$ is the next innovation $e_{t+1}$. First shift the ARMA recursion forward by one period: \begin{align*} X_{t+1} &=\phi X_t+e_{t+1}+\theta e_t \\ &=\phi X_t+\theta e_t+e_{t+1}. \end{align*} The second component of the next state is, by the definition of the state vector, \begin{align*} \text{second component of }\alpha_{t+1} &= e_{t+1}. \end{align*} Therefore \begin{align*} \alpha_{t+1} &= \begin{pmatrix} X_{t+1}\\ e_{t+1} \end{pmatrix} \\ &= \begin{pmatrix} \phi X_t+\theta e_t+e_{t+1}\\ e_{t+1} \end{pmatrix}. \end{align*} Now compute the matrix product \begin{align*} \begin{pmatrix} \phi & \theta\\ 0 & 0 \end{pmatrix} \alpha_t &= \begin{pmatrix} \phi & \theta\\ 0 & 0 \end{pmatrix} \begin{pmatrix} X_t\\ e_t \end{pmatrix} \\ &= \begin{pmatrix} \phi X_t+\theta e_t\\ 0 \end{pmatrix}, \end{align*} and \begin{align*} \begin{pmatrix} 1\\ 1 \end{pmatrix}e_{t+1} &= \begin{pmatrix} e_{t+1}\\ e_{t+1} \end{pmatrix}. \end{align*} Adding these two vectors gives \begin{align*} \begin{pmatrix} \phi & \theta\\ 0 & 0 \end{pmatrix}\alpha_t + \begin{pmatrix} 1\\ 1 \end{pmatrix}e_{t+1} &= \begin{pmatrix} \phi X_t+\theta e_t\\ 0 \end{pmatrix} + \begin{pmatrix} e_{t+1}\\ e_{t+1} \end{pmatrix} \\ &= \begin{pmatrix} \phi X_t+\theta e_t+e_{t+1}\\ e_{t+1} \end{pmatrix} \\ &=\alpha_{t+1}. \end{align*} Thus the transition equation has the state space form \begin{align*} \alpha_{t+1} &= T\alpha_t+R\eta_t, \end{align*} with \begin{align*} T&= \begin{pmatrix} \phi & \theta\\ 0 & 0 \end{pmatrix}, & R&= \begin{pmatrix} 1\\ 1 \end{pmatrix}, & \eta_t&=e_{t+1}, & Q&=\sigma^2. \end{align*} There is no deterministic drift term, so $c=0$. The observation is the first component of the state: \begin{align*} Y_t &=X_t \\ &= \begin{pmatrix} 1 & 0 \end{pmatrix} \begin{pmatrix} X_t\\ e_t \end{pmatrix} \\ &= \begin{pmatrix} 1 & 0 \end{pmatrix}\alpha_t. \end{align*} Hence the observation equation has \begin{align*} Z&= \begin{pmatrix} 1 & 0 \end{pmatrix}, & d&=0, & H&=0. \end{align*} Since $e_{t+1}$ is independent of $e_t$ and of the past innovations that determine $X_t$, it is independent of $\alpha_t$. Therefore the state is Markovian once $(X_t,e_t)$ is carried forward: the next state depends on the past only through $\alpha_t$ and the new Gaussian innovation $e_{t+1}$. [/example] The ARMA(1,1) example shows the general pattern: enlarge the state until it contains enough past information for Markovian evolution. The Kalman filter can then compute the Gaussian likelihood, with exact or diffuse treatment of the initial state depending on the modelling convention. [remark: State Space Form Is Not Unique] A single time series model may have several state space representations. Different choices of state vector can give the same distribution for the observed process, but they may differ in numerical stability, state dimension, and interpretability. [/remark] Chapter 8 recast time series as noisy observations of an evolving hidden state and showed how the Kalman filter delivers recursive prediction and estimation. Chapter 9 changes the focus from time-varying hidden states to time-varying conditional variance, introducing ARCH and GARCH models for volatility clustering. # 9. Conditional Heteroskedasticity and GARCH The ARMA, ARIMA, spectral, and state space models studied so far primarily explain dependence in the conditional mean or second-order linear prediction: after fitting such a mean model, the remaining innovations are meant to have no predictable sign or level. Financial and economic series often pass that test while still displaying long stretches of high variability followed by calmer periods. This chapter models that second kind of dependence by making the conditional variance evolve over time. The central idea is to separate the innovation $X_t$ into a scale $\sigma_t$ known from the past and a new shock $Z_t$ with unit variance. Autocorrelation in $X_t$ is then a mean phenomenon, while autocorrelation in $X_t^2$ is evidence that the size of shocks is predictable even when their signs are not. ## Conditional Variance as a Dynamic Object What does it mean for a time series to have no predictable level but still have predictable risk? The right condition is not independence, but a martingale difference property together with a time-varying conditional second moment. [definition: Martingale Difference Innovation] Let $(\mathcal F_t)_{t \in \mathbb Z}$ be a filtration. A real-valued process $(X_t)_{t \in \mathbb Z}$ is a martingale difference innovation sequence with respect to $(\mathcal F_t)$ if $X_t$ is $\mathcal F_t$-measurable, $\mathbb E[|X_t|] < \infty$, and \begin{align*} \mathbb E[X_t \mid \mathcal F_{t-1}] = 0 \end{align*} for every $t \in \mathbb Z$. [/definition] For such a sequence the best one-step conditional mean forecast is zero, but the conditional variance may still depend on $\mathcal F_{t-1}$. This is the opening for ARCH and GARCH models. To model predictable risk, we therefore track the variance that is known conditionally on the past rather than the unconditional variance alone. This process becomes the state variable for volatility models even when the conditional mean remains zero. [definition: Conditional Variance Process] Let $(X_t)_{t \in \mathbb Z}$ be a martingale difference innovation sequence with $\mathbb E[X_t^2] < \infty$. Its conditional variance process is \begin{align*} \sigma_t^2 = \operatorname{Var}(X_t \mid \mathcal F_{t-1}) = \mathbb E[X_t^2 \mid \mathcal F_{t-1}]. \end{align*} [/definition] The equality with the conditional second moment uses the martingale difference condition. In applications, $X_t$ is often the residual after removing a conditional mean model, rather than the original observed series. The empirical pattern that motivates dynamic variance models is persistence in the size of shocks. We name this pattern separately because it can occur even when the shocks themselves have no linear autocorrelation. [definition: Volatility Clustering] A martingale difference process $(X_t)$ exhibits volatility clustering if the conditional variance process $(\sigma_t^2)$ is serially dependent and persistent, so that large values of $X_t^2$ tend to be followed by periods with elevated conditional variance. [/definition] Volatility clustering is a statement about dependence in magnitudes, not signs. A process can have $\operatorname{Cov}(X_t, X_{t-h})=0$ for all $h \ne 0$ while $\operatorname{Cov}(X_t^2, X_{t-h}^2)$ is positive for some lags. [example: Uncorrelated Innovations With Dependent Squares] Let $X_t=\sigma_t Z_t$, where $Z_t$ is independent of $\mathcal F_{t-1}$, $\mathbb E[Z_t]=0$, $\mathbb E[Z_t^2]=1$, and $\sigma_t$ is $\mathcal F_{t-1}$-measurable. Since $\sigma_t$ is known at time $t-1$ and $Z_t$ is the new shock, \begin{align*} \mathbb E[X_t\mid \mathcal F_{t-1}] &=\mathbb E[\sigma_t Z_t\mid \mathcal F_{t-1}]\\ &=\sigma_t\mathbb E[Z_t\mid \mathcal F_{t-1}]\\ &=\sigma_t\mathbb E[Z_t]\\ &=\sigma_t\cdot 0\\ &=0. \end{align*} Thus $(X_t)$ has no predictable signed level. For $h\ge 1$, the lagged variable $X_{t-h}$ is $\mathcal F_{t-1}$-measurable, so \begin{align*} \mathbb E[X_tX_{t-h}] &=\mathbb E\!\left[\mathbb E[X_tX_{t-h}\mid \mathcal F_{t-1}]\right]\\ &=\mathbb E\!\left[X_{t-h}\mathbb E[X_t\mid \mathcal F_{t-1}]\right]\\ &=\mathbb E[X_{t-h}\cdot 0]\\ &=0. \end{align*} So the innovations are uncorrelated across time, even though their magnitudes may not be. Indeed, \begin{align*} \mathbb E[X_t^2\mid \mathcal F_{t-1}] &=\mathbb E[\sigma_t^2 Z_t^2\mid \mathcal F_{t-1}]\\ &=\sigma_t^2\mathbb E[Z_t^2\mid \mathcal F_{t-1}]\\ &=\sigma_t^2\mathbb E[Z_t^2]\\ &=\sigma_t^2. \end{align*} If $\sigma_t^2$ is persistent as a function of the past, for example if large past squared shocks make future $\sigma_t^2$ large, then $X_t^2=\sigma_t^2Z_t^2$ inherits predictable variation in its conditional mean. The signs remain unpredictable, but the sizes of future shocks can still be forecast from the past. [/example] ## ARCH and GARCH Models How can conditional variance depend on the past without making the conditional mean depend on the past? ARCH models use past squared innovations as regressors for the current variance; GARCH models add lagged variances, giving a more parsimonious description of persistent volatility. [definition: ARCH Model] Let $(Z_t)_{t \in \mathbb Z}$ be i.i.d. with $\mathbb E[Z_t]=0$ and $\mathbb E[Z_t^2]=1$. An ARCH$(q)$ process is a process $(X_t)$ satisfying \begin{align*} X_t &= \sigma_t Z_t,\\ \sigma_t^2 &= \omega + \sum_{i=1}^{q} \alpha_i X_{t-i}^2, \end{align*} where $\omega>0$ and $\alpha_i \ge 0$ for $i=1,\dots,q$. [/definition] The nonnegativity constraints make the conditional variance nonnegative for every history. Large squared shocks feed directly into future volatility, so ARCH models capture short-run bursts of variability. [example: ARCH(1) Unconditional Variance] Suppose $X_t=\sigma_t Z_t$ and $\sigma_t^2=\omega+\alpha X_{t-1}^2$, where $\omega>0$, $0\le \alpha<1$, $\mathbb E[Z_t]=0$, $\mathbb E[Z_t^2]=1$, and $Z_t$ is independent of $\mathcal F_{t-1}$. Since $\sigma_t$ is $\mathcal F_{t-1}$-measurable, \begin{align*} \mathbb E[X_t\mid \mathcal F_{t-1}] &=\mathbb E[\sigma_t Z_t\mid \mathcal F_{t-1}]\\ &=\sigma_t\mathbb E[Z_t\mid \mathcal F_{t-1}]\\ &=\sigma_t\mathbb E[Z_t]\\ &=0. \end{align*} Thus $\mathbb E[X_t]=0$, so $\operatorname{Var}(X_t)=\mathbb E[X_t^2]$ whenever the second moment is finite. Now assume a second-order stationary solution exists, and write $m=\mathbb E[X_t^2]$. From $X_t^2=\sigma_t^2Z_t^2$, \begin{align*} \mathbb E[X_t^2\mid \mathcal F_{t-1}] &=\mathbb E[\sigma_t^2Z_t^2\mid \mathcal F_{t-1}]\\ &=\sigma_t^2\mathbb E[Z_t^2\mid \mathcal F_{t-1}]\\ &=\sigma_t^2\mathbb E[Z_t^2]\\ &=\sigma_t^2. \end{align*} Taking expectations and using the ARCH$(1)$ recursion gives \begin{align*} \mathbb E[X_t^2] &=\mathbb E[\sigma_t^2]\\ &=\mathbb E[\omega+\alpha X_{t-1}^2]\\ &=\omega+\alpha\mathbb E[X_{t-1}^2]. \end{align*} Second-order stationarity gives $\mathbb E[X_{t-1}^2]=\mathbb E[X_t^2]=m$, so \begin{align*} m&=\omega+\alpha m,\\ m-\alpha m&=\omega,\\ (1-\alpha)m&=\omega,\\ m&=\frac{\omega}{1-\alpha}. \end{align*} Therefore \begin{align*} \operatorname{Var}(X_t)=\mathbb E[X_t^2]=\frac{\omega}{1-\alpha}. \end{align*} The parameter $\alpha$ controls how much past squared shocks inflate future variance, and the condition $\alpha<1$ is exactly what makes the unconditional second moment finite in this ARCH$(1)$ calculation. [/example] ARCH$(q)$ can need many lags to reproduce slow volatility decay. GARCH solves this by allowing the conditional variance itself to be autoregressive. [definition: GARCH Model] Let $(Z_t)_{t \in \mathbb Z}$ be i.i.d. with $\mathbb E[Z_t]=0$ and $\mathbb E[Z_t^2]=1$. A GARCH$(p,q)$ process is a process $(X_t)$ satisfying \begin{align*} X_t &= \sigma_t Z_t,\\ \sigma_t^2 &= \omega + \sum_{i=1}^{q} \alpha_i X_{t-i}^2 + \sum_{j=1}^{p} \beta_j \sigma_{t-j}^2, \end{align*} where $\omega>0$, $\alpha_i \ge 0$, and $\beta_j \ge 0$. [/definition] The most common case is GARCH$(1,1)$: \begin{align*} X_t &= \sigma_t Z_t,\\ \sigma_t^2 &= \omega + \alpha X_{t-1}^2 + \beta \sigma_{t-1}^2. \end{align*} Here $\alpha$ measures immediate reaction to new squared shocks, while $\beta$ measures persistence in the latent variance recursion. [remark: Positivity and Interpretation] The condition $\omega>0$ prevents the variance recursion from collapsing to zero after a run of small shocks. The restrictions $\alpha_i,\beta_j\ge 0$ are sufficient for positivity, although more general parameterisations can impose positivity by other means. [/remark] ## Stationarity and Moment Conditions for GARCH(1,1) When does the recursive variance equation define a stable process rather than a variance that drifts upward indefinitely? For GARCH$(1,1)$, strict stationarity is governed by a random-coefficient product, while covariance stationarity is governed by the simpler mean coefficient $\alpha+\beta$. Substituting $X_{t-1}=\sigma_{t-1}Z_{t-1}$ gives the stochastic recurrence \begin{align*} \sigma_t^2 = \omega + (\alpha Z_{t-1}^2+\beta)\sigma_{t-1}^2. \end{align*} The coefficient is random because the previous innovation changes how strongly past volatility is transmitted. [quotetheorem:3659] [citeproof:3659] This theorem is about strict stationarity, so it controls the full distribution under time shifts. The logarithmic condition says that the random multiplicative coefficient contracts on average in the long run, even though individual shocks can temporarily amplify volatility. It is therefore a stability condition for the recurrence as a stochastic process, not a moment calculation. The distinction is important because a strictly stationary volatility recursion may still have very heavy tails. To know whether the usual variance and autocovariance calculations are available, we need a separate condition controlling the first moment of $\sigma_t^2$. [quotetheorem:3660] [citeproof:3660] The distinction between the two conditions matters. Since $\mathbb E[\log(\alpha Z_0^2+\beta)] < \log(\alpha+\beta)$ by Jensen's inequality when $\alpha>0$, strict stationarity may hold even when the second moment is infinite. [example: Strictly Stationary but Infinite Variance Regime] Take $\omega>0$, $\alpha=1$, $\beta=0$, and let $(Z_t)$ be i.i.d. standard normal, so $\mathbb E[Z_t]=0$ and $\mathbb E[Z_t^2]=1$. Then the GARCH$(1,1)$ variance recursion becomes \begin{align*} \sigma_t^2 &=\omega+X_{t-1}^2\\ &=\omega+\sigma_{t-1}^2Z_{t-1}^2\\ &=\omega+(1\cdot Z_{t-1}^2+0)\sigma_{t-1}^2. \end{align*} Here $\alpha+\beta=1+0=1$, so the covariance-stationarity mean condition $\alpha+\beta<1$ fails. The strict-stationarity condition is nevertheless satisfied. Since $Z_0^2\sim \chi_1^2$, equivalently $Z_0^2\sim \operatorname{Gamma}(1/2,2)$, the gamma log-moment formula gives \begin{align*} \mathbb E[\log Z_0^2] &=\psi(1/2)+\log 2\\ &=(-\gamma-2\log 2)+\log 2\\ &=-\gamma-\log 2\\ &<0, \end{align*} where $\gamma>0$ is Euler's constant. Therefore \begin{align*} \mathbb E[\log(\alpha Z_0^2+\beta)] &=\mathbb E[\log(1\cdot Z_0^2+0)]\\ &=\mathbb E[\log Z_0^2]\\ &<0. \end{align*} By *Strict Stationarity Condition for GARCH(1,1)*, the recursion has a unique strictly stationary causal solution. Its unconditional second moment cannot be finite. If $m=\mathbb E[\sigma_t^2]<\infty$, then stationarity and independence of $Z_{t-1}$ from $\mathcal F_{t-2}$ give \begin{align*} m &=\mathbb E[\sigma_t^2]\\ &=\mathbb E[\omega+\sigma_{t-1}^2Z_{t-1}^2]\\ &=\omega+\mathbb E[\sigma_{t-1}^2Z_{t-1}^2]\\ &=\omega+\mathbb E[\sigma_{t-1}^2]\mathbb E[Z_{t-1}^2]\\ &=\omega+m\cdot 1\\ &=\omega+m. \end{align*} Subtracting $m$ from both sides gives $0=\omega$, contradicting $\omega>0$. Hence $\mathbb E[\sigma_t^2]=\infty$, and since \begin{align*} \mathbb E[X_t^2\mid \mathcal F_{t-1}] &=\mathbb E[\sigma_t^2Z_t^2\mid \mathcal F_{t-1}]\\ &=\sigma_t^2\mathbb E[Z_t^2]\\ &=\sigma_t^2, \end{align*} we also have $\mathbb E[X_t^2]=\infty$. This example separates strict stationarity from finite variance: the process is stable in distribution under time shifts, but its unconditional variance is infinite. [/example] ## Quasi-Maximum Likelihood Estimation How should we estimate a GARCH model when the conditional variance recursion is plausible but the exact distribution of $Z_t$ is uncertain? Quasi-maximum likelihood uses the Gaussian likelihood as a fitting criterion even when the innovations are not Gaussian. For observed residuals $x_1,\dots,x_n$, the Gaussian quasi-log-likelihood for parameter $\theta$ has the form \begin{align*} \ell_n(\theta) = -\frac{1}{2}\sum_{t=1}^{n}\left(\log h_t(\theta)+\frac{x_t^2}{h_t(\theta)}\right), \end{align*} where $h_t(\theta)$ is the conditional variance generated recursively from the model. Constants not depending on $\theta$ are omitted. [definition: Gaussian Quasi-Maximum Likelihood Estimator] Let $\Theta$ be a parameter space for a conditional heteroskedastic model and let $h_t(\theta)$ be the recursively computed conditional variance. A Gaussian quasi-maximum likelihood estimator is any measurable maximiser \begin{align*} \hat\theta_n \in \operatorname*{arg\,max}_{\theta\in\Theta} \ell_n(\theta). \end{align*} [/definition] The word quasi signals that the Gaussian density is used for estimation, not asserted as the true conditional distribution. The resulting estimator targets the variance recursion through conditional second moments. The next issue is whether this misspecified Gaussian objective still identifies the volatility parameters. The point of quasi-likelihood theory is that the variance recursion can be consistently estimated under moment and stability assumptions even when the innovations are not actually Gaussian. [quotetheorem:3661] [citeproof:3661] In practice, standard errors require additional assumptions, especially on fourth moments. The consistency theorem explains why Gaussian QMLE is widely used even for heavy-tailed return innovations. [remark: Mean Model First] GARCH estimation is usually applied to residuals from a mean model such as ARMA. If the conditional mean is misspecified, apparent volatility dependence can be contaminated by unmodelled level dependence. [/remark] ## Forecasting Volatility Once a GARCH model is fitted, what exactly is being forecast? The forecast is the future conditional variance, not the future shock itself. For GARCH$(1,1)$, the forecast recursion has the same autoregressive persistence coefficient $\alpha+\beta$ that appears in the unconditional variance formula. Let $\mathcal F_t$ be the information available at time $t$. For GARCH$(1,1)$ with $\alpha+\beta<1$, define the long-run variance \begin{align*} \bar\sigma^2 = \frac{\omega}{1-\alpha-\beta}. \end{align*} The one-step forecast is known after observing $X_t$ and $\sigma_t^2$: \begin{align*} \mathbb E[X_{t+1}^2\mid\mathcal F_t] = \mathbb E[\sigma_{t+1}^2\mid\mathcal F_t] = \omega+\alpha X_t^2+\beta\sigma_t^2. \end{align*} For horizons $h\ge 2$, repeated conditioning gives \begin{align*} \mathbb E[\sigma_{t+h}^2\mid\mathcal F_t]-\bar\sigma^2 = (\alpha+\beta)^{h-1}(\sigma_{t+1}^2-\bar\sigma^2). \end{align*} [example: GARCH(1,1) Volatility Forecast] Let $\omega=0.02$, $\alpha=0.08$, $\beta=0.90$, and suppose that after observing time $t$ the next conditional variance is \begin{align*} \sigma_{t+1}^2=0.50. \end{align*} Since \begin{align*} \alpha+\beta &=0.08+0.90\\ &=0.98, \end{align*} the long-run variance is \begin{align*} \bar\sigma^2 &=\frac{\omega}{1-\alpha-\beta}\\ &=\frac{0.02}{1-0.08-0.90}\\ &=\frac{0.02}{1-0.98}\\ &=\frac{0.02}{0.02}\\ &=1. \end{align*} For $h\ge 1$, the GARCH$(1,1)$ variance forecast recursion around the long-run mean is \begin{align*} \mathbb E[\sigma_{t+h}^2\mid \mathcal F_t]-\bar\sigma^2 = (\alpha+\beta)^{h-1}(\sigma_{t+1}^2-\bar\sigma^2). \end{align*} Substituting the parameter values and the current update gives \begin{align*} \mathbb E[\sigma_{t+h}^2\mid \mathcal F_t]-1 &=0.98^{h-1}(0.50-1). \end{align*} Adding $1$ to both sides yields \begin{align*} \mathbb E[\sigma_{t+h}^2\mid \mathcal F_t] &=1+0.98^{h-1}(0.50-1). \end{align*} Equivalently, \begin{align*} \mathbb E[\sigma_{t+h}^2\mid \mathcal F_t] &=1-0.50\cdot 0.98^{h-1}. \end{align*} The forecast starts below the long-run variance because $\sigma_{t+1}^2=0.50<1$, and it moves back toward $1$ at the persistence rate $0.98$ per horizon step. [/example] The formula shows mean reversion in variance forecasts. When $\alpha+\beta$ is close to one, shocks to volatility decay slowly, producing long periods of elevated predicted risk. ## Diagnosing Conditional Heteroskedasticity How do we distinguish dependence in levels from dependence in volatility? The diagnostic sequence is: model the conditional mean first, inspect residual autocorrelation, and then inspect squared residual autocorrelation. Suppose an ARMA model has been fitted and produces residuals $\hat X_t$. If the mean model is adequate, the sample autocorrelations of $\hat X_t$ should be small. Conditional heteroskedasticity is suggested when the sample autocorrelations of $\hat X_t^2$ remain substantial. [definition: Squared Residual Diagnostic] Given residuals $\hat X_1,\dots,\hat X_n$ from a fitted conditional mean model, the squared residual diagnostic examines the sample autocorrelation function of $(\hat X_t^2)$ and tests whether these autocorrelations are jointly compatible with white noise in squares. [/definition] This diagnostic asks whether the second conditional moment still contains predictable structure. It should be interpreted after checking the residuals themselves, because level autocorrelation can create spurious dependence in squared residuals. [example: Squared Residual Diagnostic After ARMA Fitting] Fit an ARMA model to the returns $Y_t$ and write the fitted residuals as $\hat X_t$. The first diagnostic is applied to the residuals themselves: for each lag $h\ge 1$, the sample autocorrelation \begin{align*} \hat\rho_{\hat X}(h) &= \frac{\sum_{t=h+1}^{n}(\hat X_t-\bar X)(\hat X_{t-h}-\bar X)} {\sum_{t=1}^{n}(\hat X_t-\bar X)^2} \end{align*} estimates remaining linear dependence in the conditional mean, where \begin{align*} \bar X=\frac{1}{n}\sum_{t=1}^{n}\hat X_t. \end{align*} If the plot of $\hat\rho_{\hat X}(h)$ has no significant spikes, then the fitted ARMA model has not left visible autocorrelation in the signed residuals. The second diagnostic applies the same calculation to squared residuals. Let \begin{align*} \bar Q=\frac{1}{n}\sum_{t=1}^{n}\hat X_t^2. \end{align*} For each lag $h\ge 1$, compute \begin{align*} \hat\rho_{\hat X^2}(h) &= \frac{\sum_{t=h+1}^{n}(\hat X_t^2-\bar Q)(\hat X_{t-h}^2-\bar Q)} {\sum_{t=1}^{n}(\hat X_t^2-\bar Q)^2}. \end{align*} Positive spikes in this second plot mean that large values of $\hat X_{t-h}^2$ tend to be followed by large values of $\hat X_t^2$, so the magnitude of the next shock is predictable even when its sign is not. In that situation, a GARCH model is a natural next step because it models $\operatorname{Var}(X_t\mid\mathcal F_{t-1})$ using past squared shocks and past conditional variances. [/example] This example separates two ideas that are often conflated: unpredictability of signed returns and predictability of risk. The distinction is important enough to record before turning the pieces into a full modeling workflow. [remark: No Arbitrage Interpretation] In financial return applications, martingale difference behaviour of returns is compatible with predictable volatility. Predictable variance changes risk and option prices, but it does not by itself give a predictable signed return. [/remark] Chapter 9 extended the theory of dependence from conditional means to conditional variances, explaining how volatility can evolve even when the mean is well modeled. Chapter 10 brings the full toolkit together in a practical workflow, moving from initial exploration to fitted models, diagnostics, and forecasting in real case studies. # 10. Integrated Workflow and Case Studies ## From a Time Plot to an ARIMA Forecast A time series analysis begins with a concrete tension: the observed data are finite, often nonstationary, and measured on a scale chosen by the application, while the ARMA theory developed earlier assumes a weakly stationary process with stable second-order structure. The Box-Jenkins workflow, foreshadowed by the ARIMA discussion in Chapter 4 and the diagnostics in Chapters 3 and 9, is a disciplined response to that tension. It gives a reproducible sequence of decisions whose failures can be diagnosed. [explanation: Box-Jenkins Modelling Cycle] The classical Box-Jenkins cycle has six recurring stages. First, transform the observed series when the variance or seasonal amplitude changes with the level. Logarithms are common for positive economic or demographic series because multiplicative changes become additive changes. Second, difference the transformed series until its mean behaviour is compatible with stationarity. Ordinary differencing removes stochastic trends of random-walk type, while seasonal differencing removes repeated seasonal levels. Third, identify tentative ARMA orders for the differenced series using the sample autocorrelation function, the sample partial autocorrelation function, information criteria, and subject-matter constraints. Fourth, estimate the model parameters, usually by maximum likelihood or conditional least squares. Fifth, diagnose the fitted residuals. The residuals should resemble an innovation sequence: uncorrelated, centred, and with any remaining volatility structure explicitly modelled. Sixth, forecast on the original scale when needed, carrying the uncertainty introduced by transformation and differencing through the forecast calculation. [/explanation] Differencing creates a bookkeeping problem for the final model: the stationary equation is fitted to transformed increments, but forecasts and interpretation are usually required on the original level scale. The risk is to fit an ARMA equation to a differenced series while losing track of the original forecasting target. ARIMA notation packages the differencing order and the short-run ARMA dynamics into one formal model family. [definition: ARIMA Model] Let $(X_t)_{t \in \mathbb Z}$ be a real-valued time series. For integers $p,d,q \ge 0$, $(X_t)$ follows an $\operatorname{ARIMA}(p,d,q)$ model if $Y_t := (1-B)^d X_t$ is a causal $\operatorname{ARMA}(p,q)$ process satisfying \begin{align*} \phi(B)Y_t = \theta(B)Z_t, \end{align*} where $B$ is the backshift operator, $\phi(z)=1-\phi_1z-\cdots-\phi_pz^p$, $\theta(z)=1+\theta_1z+\cdots+\theta_qz^q$, and $(Z_t)$ is white noise with mean $0$ and variance $\sigma^2$. [/definition] The definition separates two tasks. Differencing supplies a stationary series $Y_t$, while ARMA modelling describes the short-run dependence left after that operation. [example: Monthly Airline Passengers] Let $X_t$ denote the passenger total in month $t$, and set $W_t=\log X_t$. The log transformation is used because a multiplicative seasonal factor becomes additive on the log scale: if $X_t=L_tS_tE_t$ with positive level $L_t$, seasonal factor $S_t$, and error factor $E_t$, then \begin{align*} W_t=\log X_t=\log L_t+\log S_t+\log E_t. \end{align*} Thus changes in seasonal amplitude that grow with the level of $X_t$ become closer to stable additive seasonal effects in $W_t$. We compute the differenced series used for the mean model. Since $BW_t=W_{t-1}$ and $B^{12}W_t=W_{t-12}$, \begin{align*} (1-B^{12})W_t &=W_t-B^{12}W_t \\ &=W_t-W_{t-12}. \end{align*} Applying the ordinary difference to this seasonal difference gives \begin{align*} Y_t &=(1-B)(1-B^{12})W_t \\ &=(1-B)(W_t-W_{t-12}) \\ &=(W_t-W_{t-12})-B(W_t-W_{t-12}) \\ &=(W_t-W_{t-12})-(W_{t-1}-W_{t-13}) \\ &=W_t-W_{t-1}-W_{t-12}+W_{t-13}. \end{align*} Equivalently, \begin{align*} Y_t &=(W_t-W_{t-1})-(W_{t-12}-W_{t-13}), \end{align*} so $Y_t$ compares the current monthly log growth with the log growth for the same month one year earlier. This construction removes a slowly changing log level through $W_t-W_{t-1}$ and removes a repeated annual seasonal level through $W_t-W_{t-12}$. If the sample autocorrelation of $Y_t$ then has prominent short-run dependence at lag $1$ and seasonal dependence at lag $12$, the usual airline specification is a low-order seasonal moving-average model, with moving-average effects at those two lags. [/example] After a fitted ARIMA model has been selected, the next task is prediction rather than description. Forecasts are computed on the stationary differenced scale and then translated back to the original series, so the first theorem records the basic forecast recursion before interval uncertainty is added. [quotetheorem:3662] [citeproof:3662] For an ARIMA model, the forecast of $Y_t=(1-B)^dX_t$ must be integrated back to the scale of $X_t$. For example, if $d=1$, then $X_{t+h}=X_t+Y_{t+1}+\cdots+Y_{t+h}$, so forecasts of future differences add to the last observed level. This distinction matters operationally: a good forecast for the stationary increment is not yet a forecast on the scale reported to the user. Point forecasts alone do not indicate how much uncertainty accumulates with horizon. The obstruction is that forecast errors depend on all future innovations not yet observed, so their variance grows according to the moving-average coefficients. The mean-square forecast error formula is the quantity used to build approximate prediction intervals for ARMA and ARIMA forecasts. [quotetheorem:3663] [citeproof:3663] This result is the computational core of ARMA prediction intervals. It also explains why long-horizon forecast uncertainty for a stationary ARMA process approaches the unconditional variance, while integrated models accumulate uncertainty with horizon. The formula is conditional on the fitted model, so the next diagnostic question is whether the residuals actually look like the innovations that the model assumes. ## Residual Diagnostics and Portmanteau Tests After a model has been estimated, the central question becomes whether the residuals still contain serial dependence that the fitted model failed to explain. Residual plots catch changes in scale, outliers, and structural breaks; residual autocorrelations test whether linear dependence remains. A fitted model should be treated as provisional until this check has been made. [definition: Residual Autocorrelation] Let $\hat Z_1,\dots,\hat Z_n$ be residuals from a fitted time series model, and let $\bar Z=n^{-1}\sum_{t=1}^n\hat Z_t$. The residual autocorrelation at lag $k$ is \begin{align*} \hat r_k = \frac{\sum_{t=k+1}^{n}(\hat Z_t-\bar Z)(\hat Z_{t-k}-\bar Z)}{\sum_{t=1}^{n}(\hat Z_t-\bar Z)^2}. \end{align*} [/definition] Residual autocorrelations near zero do not prove the model is correct. They only indicate that this diagnostic has not found remaining linear dependence at the lags examined. [quotetheorem:3664] The Box-Pierce statistic aggregates several small residual autocorrelations into one test. A significant value suggests that the fitted mean model has left serial dependence in the residuals. The Box-Pierce statistic is asymptotic and can be rough in moderate samples because it treats each residual autocorrelation with the same large-sample scaling. A finite-sample correction adjusts the weighting of the autocorrelations while testing the same practical obstruction: whether residual dependence remains after fitting the model. [quotetheorem:3665] The Ljung-Box version is a finite-sample correction of the same idea. In applied reports it is useful to state the maximum lag $m$, the fitted mean-model order, and whether the test is being applied to residuals or squared residuals. [example: Diagnostic Failure After Underdifferencing] Suppose the data satisfy the random-walk recursion $X_t=X_{t-1}+Z_t$, where $(Z_t)$ is white noise with mean $0$ and variance $\sigma^2$, but the analyst fits the stationary AR(1) equation $X_t=\phi X_{t-1}+e_t$ directly to the levels. The least-squares autoregressive coefficient, ignoring an intercept for clarity, is \begin{align*} \hat\phi &=\frac{\sum_{t=1}^{n}X_tX_{t-1}}{\sum_{t=1}^{n}X_{t-1}^2}. \end{align*} Using $X_t=X_{t-1}+Z_t$ in the numerator gives \begin{align*} \sum_{t=1}^{n}X_tX_{t-1} &=\sum_{t=1}^{n}(X_{t-1}+Z_t)X_{t-1} \\ &=\sum_{t=1}^{n}X_{t-1}^2+\sum_{t=1}^{n}Z_tX_{t-1}, \end{align*} and therefore \begin{align*} \hat\phi &=\frac{\sum_{t=1}^{n}X_{t-1}^2+\sum_{t=1}^{n}Z_tX_{t-1}}{\sum_{t=1}^{n}X_{t-1}^2} \\ &=1+\frac{\sum_{t=1}^{n}Z_tX_{t-1}}{\sum_{t=1}^{n}X_{t-1}^2}. \end{align*} The fitted AR(1) coefficient is therefore driven toward $1$: the model can only imitate the random-walk identity $X_t=X_{t-1}+Z_t$ by placing its autoregressive coefficient near the boundary of stationarity. If a stationary coefficient $\phi<1$ is fitted, the residual is \begin{align*} e_t &=X_t-\phi X_{t-1} \\ &=(X_{t-1}+Z_t)-\phi X_{t-1} \\ &=Z_t+(1-\phi)X_{t-1}. \end{align*} Thus the residual is not just the innovation $Z_t$; it also contains the undifferenced level $X_{t-1}$. For lag $k\ge1$, writing $a=1-\phi$ and taking $X_0$ fixed, \begin{align*} \operatorname{Cov}(e_t,e_{t-k}) &=\operatorname{Cov}(Z_t+aX_{t-1},\,Z_{t-k}+aX_{t-k-1}) \\ &=\operatorname{Cov}(Z_t,Z_{t-k}) +a\operatorname{Cov}(Z_t,X_{t-k-1}) \\ &\quad +a\operatorname{Cov}(X_{t-1},Z_{t-k}) +a^2\operatorname{Cov}(X_{t-1},X_{t-k-1}) \\ &=0+0+a\sigma^2+a^2\operatorname{Var}(X_{t-k-1}). \end{align*} The last expression is generally nonzero, so the residuals inherit serial dependence from the undifferenced random-walk level. Differencing removes the source of the failure: \begin{align*} (1-B)X_t &=X_t-X_{t-1} \\ &=Z_t. \end{align*} The diagnostic message is that the level series should not have been treated as a stationary AR(1) process; the ARMA mean model should be fitted after differencing. [/example] ## Combining ARIMA Means with GARCH Volatility A second diagnostic question is whether the residuals are uncorrelated but not independent. Financial returns often have little linear autocorrelation in the returns themselves, while the squared returns are strongly dependent. In that setting an ARMA or ARIMA mean model is not wrong because it misses volatility clustering; it is incomplete until paired with a conditional variance model. [definition: ARMA-GARCH Model] A real-valued process $(X_t)$ follows an ARMA-GARCH specification if \begin{align*} X_t &= \mu_t + \varepsilon_t,\\ \mu_t &= \sum_{i=1}^{p}\phi_iX_{t-i}+\sum_{j=1}^{q}\theta_j\varepsilon_{t-j},\\ \varepsilon_t &= \sigma_t\eta_t,\\ \sigma_t^2 &= \omega+\sum_{i=1}^{r}\alpha_i\varepsilon_{t-i}^2+\sum_{j=1}^{s}\beta_j\sigma_{t-j}^2, \end{align*} where $(\eta_t)$ is i.i.d. with mean $0$ and variance $1$, $\omega>0$, and $\alpha_i,\beta_j\ge 0$. [/definition] Here the conditional mean $\mu_t$ and conditional variance $\sigma_t^2$ answer different questions. The mean equation predicts the direction or level; the GARCH equation predicts the size of the next innovation. [quotetheorem:3666] [citeproof:3666] [example: Daily Financial Returns With ARMA-GARCH] Let $X_t$ be daily log returns of a liquid asset, and suppose a preliminary plot shows bursts of large positive and negative returns while the sample autocorrelation of $X_t$ is small at most lags. An ARMA(1,1)-GARCH(1,1) specification separates these two features by writing \begin{align*} X_t&=\mu_t+\varepsilon_t,\\ \mu_t&=\phi X_{t-1}+\theta\varepsilon_{t-1},\\ \varepsilon_t&=\sigma_t\eta_t,\\ \sigma_t^2&=\omega+\alpha\varepsilon_{t-1}^2+\beta\sigma_{t-1}^2, \end{align*} where $\mathbb E[\eta_t]=0$, $\mathbb E[\eta_t^2]=1$, and $\eta_t$ is independent of past information $\mathcal F_{t-1}$. The ARMA equation handles linear dependence in the conditional mean: \begin{align*} \mathbb E[X_t\mid \mathcal F_{t-1}] &=\mathbb E[\phi X_{t-1}+\theta\varepsilon_{t-1}+\varepsilon_t\mid \mathcal F_{t-1}]\\ &=\phi X_{t-1}+\theta\varepsilon_{t-1}+\mathbb E[\sigma_t\eta_t\mid \mathcal F_{t-1}]\\ &=\phi X_{t-1}+\theta\varepsilon_{t-1}+\sigma_t\mathbb E[\eta_t]\\ &=\phi X_{t-1}+\theta\varepsilon_{t-1}. \end{align*} Thus, after fitting the mean equation, the residual $\hat\varepsilon_t=X_t-\hat\mu_t$ should have little remaining autocorrelation if the ARMA part has captured the short-run linear dependence. The GARCH equation explains why the residuals can still have dependent magnitudes. Since $\varepsilon_t=\sigma_t\eta_t$, \begin{align*} \mathbb E[\varepsilon_t^2\mid \mathcal F_{t-1}] &=\mathbb E[\sigma_t^2\eta_t^2\mid \mathcal F_{t-1}]\\ &=\sigma_t^2\mathbb E[\eta_t^2]\\ &=\sigma_t^2, \end{align*} and therefore \begin{align*} \mathbb E[\varepsilon_{t+1}^2\mid \mathcal F_t] &=\sigma_{t+1}^2\\ &=\omega+\alpha\varepsilon_t^2+\beta\sigma_t^2. \end{align*} If $|\varepsilon_t|$ is large, then $\varepsilon_t^2$ is large, and the term $\alpha\varepsilon_t^2$ raises the next conditional variance. One step later, \begin{align*} \mathbb E[\sigma_{t+2}^2\mid \mathcal F_t] &=\mathbb E[\omega+\alpha\varepsilon_{t+1}^2+\beta\sigma_{t+1}^2\mid \mathcal F_t]\\ &=\omega+\alpha\mathbb E[\varepsilon_{t+1}^2\mid \mathcal F_t]+\beta\mathbb E[\sigma_{t+1}^2\mid \mathcal F_t]\\ &=\omega+\alpha\sigma_{t+1}^2+\beta\sigma_{t+1}^2\\ &=\omega+(\alpha+\beta)\sigma_{t+1}^2, \end{align*} so when $\alpha+\beta$ is close to $1$, a volatility shock decays slowly. The fitted diagnostics should therefore be split by purpose: autocorrelations of $\hat\varepsilon_t$ check the ARMA mean equation, autocorrelations of $\hat\eta_t^2=(\hat\varepsilon_t/\hat\sigma_t)^2$ check whether volatility dependence remains, and the empirical distribution of $\hat\eta_t$ checks the innovation law used for forecast intervals. The model can produce nearly uncorrelated returns because the residuals have conditional mean zero, while still producing volatility clustering because the conditional variance depends on past squared residuals. [/example] ## Forecast Intervals and Model Uncertainty Point forecasts answer only one part of the forecasting problem. A useful forecast also reports the distributional assumptions, the estimated scale of future errors, and the uncertainty introduced by model choice. This is where the distinction between mathematical prediction under a model and statistical forecasting from data becomes most important. [definition: Gaussian Prediction Interval] Let $\hat X_t(h)$ be an $h$-step forecast of $X_{t+h}$ and let $s_t^2(h)$ be the model-based forecast error variance. Under a Gaussian innovation approximation, an approximate $100(1-\alpha)\%$ prediction interval is \begin{align*} \hat X_t(h) \pm z_{1-\alpha/2}s_t(h), \end{align*} where $z_{1-\alpha/2}$ is the $(1-\alpha/2)$ quantile of $\mathcal N(0,1)$. [/definition] The interval is conditional on the fitted model and its estimated parameters. The remaining obstruction is calibration: even a mathematically valid interval formula can perform poorly if the model was selected on the same data or if the dependence structure changes over time. Before validation, the asymptotic coverage theorem records what the interval is designed to approximate under the fitted model assumptions. [quotetheorem:3667] [citeproof:3667] In practice, $\sigma^2$ and the coefficients $\psi_j$ are estimated, so the interval is approximate. If the forecast has been built on a transformed scale, the back-transformed interval may be asymmetric, and the median forecast on the original scale may differ from the mean forecast. The theorem therefore gives a model-based target, not a substitute for empirical checking of forecast performance over held-out origins. The practical question is whether the entire modeling procedure would have produced accurate forecasts when used in real time. That requires repeatedly fitting or updating the model using only past data and then scoring the forecasts against observations that were not available at the validation origin. [definition: Rolling Forecast Validation] Rolling forecast validation is a procedure in which a model is repeatedly refitted or updated on observations up to time $t$, used to forecast observations after $t$, and scored by comparing forecasts with realised values over a sequence of validation origins. [/definition] Rolling validation asks whether the modelling procedure would have worked when applied in real time. It is especially valuable for comparing several plausible ARIMA or ARMA-GARCH specifications. [example: Rolling Validation of Two ARIMA Specifications] Suppose the observed monthly demands are $D_1,\dots,D_n$, with $D_t>0$, and let $W_t=\log D_t$. Compare two forecasting procedures: model $A$, an $\operatorname{ARIMA}(0,1,1)$ fitted to $W_t$, and model $S$, a seasonal ARIMA model fitted to $W_t$. Choose validation origins \begin{align*} \mathcal T_h=\{T_0,T_0+1,\dots,n-h\}, \end{align*} so that at each origin $t\in\mathcal T_h$ the observations $D_1,\dots,D_t$ are available and the realised value $D_{t+h}$ is still held out. For each model $M\in\{A,S\}$, fit the model using only $W_1,\dots,W_t$, compute the $h$-step forecast on the log scale $\hat W_{M,t}(h)$, and convert it to a demand-scale forecast by \begin{align*} \hat D_{M,t}(h)=\exp(\hat W_{M,t}(h)). \end{align*} The corresponding forecast error is \begin{align*} e_{M,t}(h) &=D_{t+h}-\hat D_{M,t}(h) \\ &=D_{t+h}-\exp(\hat W_{M,t}(h)). \end{align*} For one-step forecasting take $h=1$; for annual seasonal forecasting take $h=12$. The mean absolute error for model $M$ at horizon $h$ is \begin{align*} \operatorname{MAE}_M(h) &=\frac{1}{|\mathcal T_h|}\sum_{t\in\mathcal T_h}|e_{M,t}(h)|, \end{align*} and the root mean squared error is \begin{align*} \operatorname{RMSE}_M(h) &=\left(\frac{1}{|\mathcal T_h|}\sum_{t\in\mathcal T_h}e_{M,t}(h)^2\right)^{1/2}. \end{align*} Thus the seasonal model improves the average absolute error at horizon $h$ exactly when \begin{align*} \operatorname{MAE}_S(h)<\operatorname{MAE}_A(h), \end{align*} and improves the squared-error score exactly when \begin{align*} \operatorname{RMSE}_S(h)<\operatorname{RMSE}_A(h). \end{align*} If model $M$ also supplies a prediction interval $[L_{M,t}(h),U_{M,t}(h)]$ on the demand scale, its empirical coverage over the rolling origins is \begin{align*} \widehat{\operatorname{Cov}}_M(h) &=\frac{1}{|\mathcal T_h|}\sum_{t\in\mathcal T_h} \mathbf 1\{L_{M,t}(h)\le D_{t+h}\le U_{M,t}(h)\}. \end{align*} A useful seasonal model should not merely reduce $\operatorname{MAE}$ or $\operatorname{RMSE}$; its intervals should also have coverage close to their nominal level and its errors should be inspected by calendar date to see whether the improvement persists during unusual months. Rolling validation therefore tests the modelling procedure as it would have been used in real time, rather than evaluating a single fit on data it was allowed to see. [/example] ## Structural Breaks and Communicating Assumptions The final modelling question is what to say when the model is useful but fragile. Stationary ARMA and ARIMA models extrapolate past dependence into the future. A structural break violates this premise by changing the mean, variance, dependence parameters, or innovation distribution. [definition: Structural Break] A structural break in a time series model is a time point or period at which at least one feature of the data-generating mechanism changes, such as the mean level, trend, autocovariance structure, innovation variance, or parameter vector. [/definition] Breaks matter because a model fitted across regimes estimates an average of incompatible behaviours. Diagnostics should therefore include not only residual autocorrelation tests but also residual plots, rolling parameter estimates, and validation performance across time. [example: Simulated Structural Break in an AR Process] Let $m=n/2$, and write the simulated process as \begin{align*} X_t=a_tX_{t-1}+Z_t, \qquad a_t= \begin{cases} 0.4, & t\le m,\\ 0.9, & t>m, \end{cases} \end{align*} where $(Z_t)$ is white noise with mean $0$ and variance $\sigma^2$. If a single no-intercept AR(1) model $X_t=\phi X_{t-1}+e_t$ is fitted to the full sample, its least-squares coefficient is \begin{align*} \hat\phi &=\frac{\sum_{t=1}^{n}X_tX_{t-1}}{\sum_{t=1}^{n}X_{t-1}^2}\\ &=\frac{\sum_{t=1}^{n}(a_tX_{t-1}+Z_t)X_{t-1}}{\sum_{t=1}^{n}X_{t-1}^2}\\ &=\frac{\sum_{t=1}^{n}a_tX_{t-1}^2+\sum_{t=1}^{n}Z_tX_{t-1}}{\sum_{t=1}^{n}X_{t-1}^2}\\ &=\frac{0.4\sum_{t=1}^{m}X_{t-1}^2+0.9\sum_{t=m+1}^{n}X_{t-1}^2}{\sum_{t=1}^{n}X_{t-1}^2} +\frac{\sum_{t=1}^{n}Z_tX_{t-1}}{\sum_{t=1}^{n}X_{t-1}^2}. \end{align*} Define \begin{align*} w_1=\frac{\sum_{t=1}^{m}X_{t-1}^2}{\sum_{t=1}^{n}X_{t-1}^2}, \qquad w_2=\frac{\sum_{t=m+1}^{n}X_{t-1}^2}{\sum_{t=1}^{n}X_{t-1}^2}. \end{align*} Then $w_1,w_2\ge0$ and $w_1+w_2=1$, so \begin{align*} \hat\phi &=0.4w_1+0.9w_2+\frac{\sum_{t=1}^{n}Z_tX_{t-1}}{\sum_{t=1}^{n}X_{t-1}^2}. \end{align*} Thus, apart from the sample innovation term, the fitted coefficient is a weighted average of the two regime coefficients $0.4$ and $0.9$. For a fixed fitted value $\phi$, the residual in regime $a_t$ is \begin{align*} e_t &=X_t-\phi X_{t-1}\\ &=a_tX_{t-1}+Z_t-\phi X_{t-1}\\ &=Z_t+(a_t-\phi)X_{t-1}. \end{align*} Before the break this is \begin{align*} e_t=Z_t+(0.4-\phi)X_{t-1}, \end{align*} while after the break it is \begin{align*} e_t=Z_t+(0.9-\phi)X_{t-1}. \end{align*} If $\phi$ lies between $0.4$ and $0.9$, the coefficient error has opposite signs in the two regimes. A full-sample residual autocorrelation can therefore average over two different behaviours, while a residual plot over time or rolling estimates of $\phi$ show the change in parameter value directly. The forecasting problem after the break is visible from the one-step error. If $t\ge m$ and the pooled model forecasts by $\hat X_t(1)=\phi X_t$, then the true next value satisfies \begin{align*} X_{t+1}=0.9X_t+Z_{t+1}, \end{align*} so \begin{align*} X_{t+1}-\hat X_t(1) &=0.9X_t+Z_{t+1}-\phi X_t\\ &=(0.9-\phi)X_t+Z_{t+1}. \end{align*} Conditioning on the information available at time $t$ gives \begin{align*} \mathbb E[X_{t+1}-\hat X_t(1)\mid \mathcal F_t] &=(0.9-\phi)X_t+\mathbb E[Z_{t+1}\mid \mathcal F_t]\\ &=(0.9-\phi)X_t, \end{align*} and \begin{align*} \mathbb E[(X_{t+1}-\hat X_t(1))^2\mid \mathcal F_t] &=\mathbb E[((0.9-\phi)X_t+Z_{t+1})^2\mid \mathcal F_t]\\ &=(0.9-\phi)^2X_t^2 +2(0.9-\phi)X_t\mathbb E[Z_{t+1}\mid \mathcal F_t] +\mathbb E[Z_{t+1}^2\mid \mathcal F_t]\\ &=(0.9-\phi)^2X_t^2+\sigma^2. \end{align*} A pooled AR(1) interval that treats $\phi$ as the correct stable coefficient accounts for the innovation variance, but it does not account for the regime error term $(0.9-\phi)X_t$. The example therefore shows why structural breaks should be reported separately from ordinary residual diagnostics: the fitted model may look acceptable on average while giving biased forecasts and overconfident intervals in the new regime. [/example] The example shows that a technically fitted model can still be unsuitable for reporting if its assumptions changed over time. A final report should therefore state the modeling choices and the validation evidence, rather than only the selected orders and parameter estimates. [remark: Reporting a Time Series Model] A complete forecast report should state the transformation, differencing order, fitted mean model, volatility model if used, estimation method, residual diagnostics, validation design, interval construction, and known threats to stability. The report should distinguish uncertainty conditional on the model from uncertainty about the model itself. [/remark] The report checklist is useful because it keeps the workflow from collapsing into a sequence of software commands. The last interpretation step is to connect each diagnostic decision back to the mathematical object it changed. [explanation: Practical Interpretation of the Workflow] The integrated workflow is best viewed as a hierarchy of questions. Is the scale suitable for additive modelling? Is differencing needed to make the mean behaviour stable? Does an ARMA structure explain the remaining serial dependence? Do residuals still have volatility dependence? Are forecast intervals calibrated under rolling validation? Are there external reasons to expect a break from the historical pattern? Each affirmative or negative answer changes the mathematical object being fitted. A log transformation changes the interpretation of errors. Differencing changes the forecast target from levels to increments. Adding GARCH changes the conditional distribution without necessarily changing the conditional mean. Rolling validation changes the evaluation from in-sample fit to out-of-sample performance. The purpose of the workflow is therefore not to find a final formula in isolation. It is to produce forecasts whose assumptions are explicit enough that another analyst can reproduce, challenge, and revise them when new data arrive. [/explanation] ## References - Brockwell, P. J. and Davis, R. A., *Time Series: Theory and Methods*, Springer. - Brockwell, P. J. and Davis, R. A., *Introduction to Time Series and Forecasting*, Springer. - Hamilton, J. D., *Time Series Analysis*, Princeton University Press. - Shumway, R. H. and Stoffer, D. S., *Time Series Analysis and Its Applications*, Springer. - Box, G. E. P., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M., *Time Series Analysis: Forecasting and Control*, Wiley. - Tsay, R. S., *Analysis of Financial Time Series*, Wiley.

Created by admin on 5/25/2026 | Last updated on 6/1/2026

Sign in to Androma

Check your inbox

One last step

Time Series Analysis

Prerequisites

Rate this page