Strong Consistency of the Gaussian Quasi-Maximum Likelihood Estimator for GARCH Models (Theorem # 3661)
Theorem
Let $(\epsilon_t)_{t \in \mathbb{Z}}$ be a real-valued process generated by the GARCH$(p,q)$ model
\begin{align*}
\epsilon_t &= \sigma_t\, \eta_t, & \sigma_t^2 &= \omega_0 + \sum_{i=1}^{q} \alpha_{0i}\, \epsilon_{t-i}^2 + \sum_{j=1}^{p} \beta_{0j}\, \sigma_{t-j}^2,
\end{align*}
where $(\eta_t)_{t \in \mathbb{Z}}$ is an i.i.d. sequence with $\mathbb{E}[\eta_0] = 0$, $\mathbb{E}[\eta_0^2] = 1$, and $\eta_t$ is independent of $\mathcal{F}_{t-1} := \sigma(\epsilon_{t-1}, \epsilon_{t-2}, \dots)$. Write the parameter as $\theta = (\omega, \alpha_1, \dots, \alpha_q, \beta_1, \dots, \beta_p)^\top$, with true value $\theta_0 = (\omega_0, \alpha_{01}, \dots, \alpha_{0q}, \beta_{01}, \dots, \beta_{0p})^\top$. For each $\theta \in \Theta$ define the stationary conditional variance $\sigma_t^2(\theta)$ as the strictly stationary, nonanticipative solution of
\begin{align*}
\sigma_t^2(\theta) = \omega + \sum_{i=1}^{q} \alpha_i\, \epsilon_{t-i}^2 + \sum_{j=1}^{p} \beta_j\, \sigma_{t-j}^2(\theta),
\end{align*}
so that $\sigma_t^2(\theta_0) = \sigma_t^2$. Given observations $\epsilon_1, \dots, \epsilon_n$ and arbitrary fixed initial values $(\tilde\sigma_s^2)_{s \le 0}$, $(\epsilon_s^2)_{s \le 0}$, let $\tilde\sigma_t^2(\theta)$ be generated by the same recursion for $t \ge 1$ from these initial values, and define the Gaussian quasi-log-likelihood criterion and the quasi-maximum likelihood estimator (QMLE)
\begin{align*}
\tilde L_n(\theta) &= \frac{1}{n} \sum_{t=1}^{n} \tilde\ell_t(\theta), & \tilde\ell_t(\theta) &= -\frac{1}{2}\left( \log \tilde\sigma_t^2(\theta) + \frac{\epsilon_t^2}{\tilde\sigma_t^2(\theta)} \right), & \hat\theta_n &\in \operatorname*{arg\,max}_{\theta \in \Theta} \tilde L_n(\theta).
\end{align*}
Assume:
- **(A1) Compact parameter space.** $\Theta \subset (0,\infty) \times [0,\infty)^{p+q}$ is compact and $\theta_0 \in \Theta$.
- **(A2) Stationarity and ergodicity.** $(\epsilon_t)_{t\in\mathbb{Z}}$ is strictly stationary and ergodic, and the distribution of $\eta_0^2$ is nondegenerate.
- **(A3) Uniform positivity and stability.** There is $\underline\omega > 0$ with $\omega \ge \underline\omega$ for all $\theta \in \Theta$, and $\bar\beta := \sup_{\theta \in \Theta} \sum_{j=1}^{p} \beta_j < 1$.
- **(A4) Identifiability.** If $\theta \in \Theta$ satisfies $\sigma_t^2(\theta) = \sigma_t^2(\theta_0)$ $\mathbb{P}$-almost surely, then $\theta = \theta_0$.
- **(A5) Integrability.** $\mathbb{E}\big[\log^+ \epsilon_0^2\big] < \infty$ and $\mathbb{E}\big[\log^+ \sigma_0^2\big] < \infty$.
Then the QMLE is strongly consistent:
\begin{align*}
\hat\theta_n \xrightarrow{a.s.} \theta_0 \qquad (n \to \infty).
\end{align*}
Discussion
No discussion available for this theorem.
Proof
[proofplan]
We realize $\tilde L_n$ as a perturbation of the stationary criterion $L_n(\theta) = \tfrac1n\sum_{t=1}^n \ell_t(\theta)$ built from the infinite-past conditional variances $\sigma_t^2(\theta)$, and show first that the choice of initial values is asymptotically irrelevant: the difference $\sup_{\theta}|\tilde L_n - L_n|$ vanishes almost surely because the stability condition $\bar\beta<1$ makes the variance recursion forget its start geometrically. The population objective $L(\theta) = \mathbb{E}[\ell_t(\theta)]$ is then shown to be uniquely maximized at $\theta_0$: conditioning on the past and using $\mathbb{E}[\epsilon_t^2\mid\mathcal F_{t-1}]=\sigma_t^2$ reduces the gap $L(\theta_0)-L(\theta)$ to the elementary inequality $\log x + 1/x \ge 1$, with equality only when $\sigma_t^2(\theta)=\sigma_t^2$ almost surely, which by identifiability forces $\theta=\theta_0$. A neighbourhood-covering argument driven by the Birkhoff Ergodic Theorem upgrades pointwise optimality to a uniform strict-domination bound on $\{|\theta-\theta_0|\ge\varepsilon\}$, and a standard argmax comparison then forces $\hat\theta_n$ into every $\varepsilon$-ball around $\theta_0$ eventually, giving almost sure convergence.
[/proofplan]
[step:Fix the stationary conditional variances, the criteria, and their basic regularity]
For $\theta \in \Theta$ the recursion
\begin{align*}
\sigma_t^2(\theta) = \omega + \sum_{i=1}^{q} \alpha_i\, \epsilon_{t-i}^2 + \sum_{j=1}^{p} \beta_j\, \sigma_{t-j}^2(\theta)
\end{align*}
has, under (A3), a unique strictly stationary nonanticipative solution given by the convergent expansion
\begin{align*}
\sigma_t^2(\theta) &= \frac{\omega}{1 - \sum_{j=1}^p \beta_j} + \sum_{k=1}^{\infty} c_k(\theta)\, \epsilon_{t-k}^2,
\end{align*}
where the coefficients $c_k(\theta) \ge 0$ are obtained from the formal power series identity $\mathcal{A}_\theta(z)/\mathcal{B}_\theta(z) = \sum_{k\ge1} c_k(\theta) z^k$ with $\mathcal{A}_\theta(z) = \sum_{i=1}^q \alpha_i z^i$ and $\mathcal{B}_\theta(z) = 1 - \sum_{j=1}^p \beta_j z^j$. Because $\beta_j \ge 0$ and $\sum_j \beta_j < 1$, every root of $\mathcal{B}_\theta$ lies outside the closed unit disk: for $|z| \le 1$, $\left|\sum_{j} \beta_j z^j\right| \le \sum_j \beta_j < 1$, so $\mathcal{B}_\theta(z) \ne 0$. Hence the $c_k(\theta)$ decay geometrically, uniformly on the compact set $\Theta$; the map $\theta \mapsto \sigma_t^2(\theta)$ is therefore continuous, and the series converges locally uniformly in $\theta$.
Two uniform bounds follow. First, dropping all terms but the constant,
\begin{align*}
\sigma_t^2(\theta) \ge \frac{\omega}{1 - \sum_j \beta_j} \ge \omega \ge \underline\omega > 0 \qquad (\theta \in \Theta).
\end{align*}
The identical bound holds for the initialized version $\tilde\sigma_t^2(\theta) \ge \omega \ge \underline\omega$, since the recursion adds only nonnegative terms to $\omega$.
Define the infinite-past per-observation criterion and its population average:
\begin{align*}
\ell_t(\theta) &= -\frac{1}{2}\left( \log \sigma_t^2(\theta) + \frac{\epsilon_t^2}{\sigma_t^2(\theta)} \right), & L_n(\theta) &= \frac{1}{n}\sum_{t=1}^n \ell_t(\theta), & L(\theta) &= \mathbb{E}\big[ \ell_t(\theta) \big].
\end{align*}
Since $\sigma_t^2(\theta) \ge \underline\omega$ and $\epsilon_t^2/\sigma_t^2(\theta) \ge 0$,
\begin{align*}
\ell_t(\theta) \le -\tfrac{1}{2}\log \underline\omega =: c \qquad \text{for all } \theta \in \Theta,\ t \in \mathbb{Z}. \tag{1}
\end{align*}
Thus $\ell_t(\theta)^+ \le c^+$ is integrable for every $\theta$, and $L(\theta) = \mathbb{E}[\ell_t(\theta)]$ is well defined in $[-\infty, c]$. Because $\theta \mapsto \tilde\sigma_t^2(\theta)$ is continuous (the initialized recursion propagates $\omega, \alpha_i, \beta_j$ polynomially through finitely many steps starting from fixed initial values, and such a polynomial is continuous in its coefficients) and bounded below by $\underline\omega$, the map $\theta \mapsto \tilde\ell_t(\theta)$ is continuous on $\Theta$; hence $\theta\mapsto\tilde L_n(\theta)$ is continuous on the compact set $\Theta$, so the maximum defining $\hat\theta_n$ is attained (extreme value theorem) and $\hat\theta_n$ exists.
Finally, each $\ell_t(\theta)$ is a fixed measurable functional of the stationary ergodic sequence $(\epsilon_{t}, \epsilon_{t-1}, \dots)$, so $(\ell_t(\theta))_{t}$ is itself strictly stationary and ergodic for each fixed $\theta$.
[/step]
[step:Reduce to the stationary criterion by showing the initial values are asymptotically negligible]
[claim:Geometric forgetting of initial values]
There exist an almost surely finite random variable $C > 0$ and a constant $\rho \in (0,1)$, with $C$ and $\rho$ independent of $\theta$, such that
\begin{align*}
\Delta_t := \sup_{\theta \in \Theta} \left| \tilde\sigma_t^2(\theta) - \sigma_t^2(\theta) \right| \le C\, \rho^{\,t} \qquad \text{for all } t \ge 1.
\end{align*}
[/claim]
[proof]
Fix $\theta \in \Theta$ and set $\delta_t(\theta) = \tilde\sigma_t^2(\theta) - \sigma_t^2(\theta)$. Let $t_0 = \max(p,q) + 1$. For $t \ge t_0$ and every $1 \le i \le q$ we have $t - i \ge t_0 - q \ge 1$, so both $\tilde\sigma_t^2(\theta)$ and $\sigma_t^2(\theta)$ use the genuinely observed values $\epsilon_{t-i}^2$; the terms $\omega + \sum_i \alpha_i \epsilon_{t-i}^2$ therefore cancel in the difference, leaving the homogeneous recursion
\begin{align*}
\delta_t(\theta) = \sum_{j=1}^{p} \beta_j\, \delta_{t-j}(\theta) \qquad (t \ge t_0).
\end{align*}
Writing $\Delta_t = \sup_\theta |\delta_t(\theta)|$ and using $\beta_j \ge 0$ with $\sum_j \beta_j \le \bar\beta$,
\begin{align*}
\Delta_t \le \bar\beta \max_{1 \le j \le p} \Delta_{t-j} \qquad (t \ge t_0). \tag{2}
\end{align*}
The initial discrepancies $\Delta_{t_0 - p}, \dots, \Delta_{t_0 - 1}$ are finite linear combinations of the fixed initial values and of finitely many observed $\epsilon_s^2$, hence almost surely finite; set $M_0 = \max_{t_0 - p \le s \le t_0 - 1} \Delta_s < \infty$ a.s. Iterating (2) in blocks of length $p$ gives $\Delta_t \le \bar\beta^{\lfloor (t - t_0)/p \rfloor} M_0$. With $\rho := \bar\beta^{1/p} \in (0,1)$ and a suitable a.s.-finite $C = C(M_0, \bar\beta, t_0, p)$ this yields $\Delta_t \le C \rho^{\,t}$ for all $t \ge 1$.
[/proof]
Using the claim we bound $\sup_\theta |\tilde\ell_t(\theta) - \ell_t(\theta)|$. Since $\tilde\sigma_t^2(\theta), \sigma_t^2(\theta) \ge \underline\omega$, the mean value theorem gives $|\log a - \log b| \le |a-b|/\min(a,b)$ and $|1/a - 1/b| \le |a-b|/\min(a,b)^2$, so
\begin{align*}
\left| \tilde\ell_t(\theta) - \ell_t(\theta) \right| &= \frac{1}{2}\left| \log \tilde\sigma_t^2(\theta) - \log \sigma_t^2(\theta) + \epsilon_t^2\!\left( \frac{1}{\tilde\sigma_t^2(\theta)} - \frac{1}{\sigma_t^2(\theta)} \right) \right| \\
&\le \frac{\Delta_t}{2\underline\omega} + \frac{\epsilon_t^2\, \Delta_t}{2\underline\omega^2} \le C'\, \rho^{\,t}\,(1 + \epsilon_t^2),
\end{align*}
with $C' := \tfrac{C}{2}\max(\underline\omega^{-1}, \underline\omega^{-2})$ a.s. finite and independent of $\theta$. Therefore
\begin{align*}
\sup_{\theta \in \Theta} \left| \tilde L_n(\theta) - L_n(\theta) \right| \le \frac{1}{n}\sum_{t=1}^{n} \sup_{\theta} |\tilde\ell_t(\theta) - \ell_t(\theta)| \le \frac{C'}{n}\sum_{t=1}^{\infty} \rho^{\,t}\,(1 + \epsilon_t^2). \tag{3}
\end{align*}
[claim:The series in (3) converges almost surely]
$\sum_{t \ge 1} \rho^{\,t}(1 + \epsilon_t^2) < \infty$ almost surely.
[/claim]
[proof]
The geometric part $\sum_t \rho^t$ converges. For the second part set $a = -\tfrac{1}{2}\log\rho > 0$. By (A5) and stationarity, $\mathbb{P}(\log^+\epsilon_t^2 > a t) = \mathbb{P}(\log^+\epsilon_0^2 > a t)$, and
\begin{align*}
\sum_{t=1}^{\infty} \mathbb{P}\big( \log^+ \epsilon_0^2 > a t \big) \le \frac{1}{a}\,\mathbb{E}\big[ \log^+ \epsilon_0^2 \big] + 1 < \infty.
\end{align*}
By the [Borel–Cantelli Lemma](/theorems/507), almost surely $\log^+\epsilon_t^2 \le a t$ for all large $t$, hence $\rho^t \epsilon_t^2 \le \rho^t e^{a t} = \rho^{t/2} \to 0$ and $\sum_t \rho^t \epsilon_t^2 < \infty$ a.s.
[/proof]
The right-hand side of (3) is $n^{-1}$ times an a.s.-finite quantity, so
\begin{align*}
\sup_{\theta \in \Theta} \left| \tilde L_n(\theta) - L_n(\theta) \right| \xrightarrow{a.s.} 0 \qquad (n \to \infty). \tag{4}
\end{align*}
It therefore suffices to analyze the stationary criterion $L_n$.
[guided]
The point of this step is that the econometrician never observes the infinite past, so $\tilde\sigma_t^2(\theta)$ is computed from invented initial values, whereas the object we can analyze probabilistically is the stationary $\sigma_t^2(\theta)$. We must show the two never matter asymptotically.
*Why do initial values wash out?* The two recursions are driven by the **same** observed data $\epsilon_{t-i}^2$ once $t$ is large enough that all lags $t-i$ are positive (this is why $t_0 = \max(p,q)+1$ appears). After that point the only thing distinguishing $\tilde\sigma_t^2$ from $\sigma_t^2(\theta)$ is the leftover discrepancy from before $t_0$, and that discrepancy is fed only through the $\beta$-part of the recursion. Since $\sum_j \beta_j \le \bar\beta < 1$ uniformly (this is exactly what (A3) buys us), each pass through the recursion shrinks the discrepancy by a factor $\bar\beta$, giving the geometric bound $\Delta_t \le C\rho^t$. This is the precise meaning of "the variance recursion forgets its starting point."
*Why does the geometric bound on variances control the likelihood?* The criterion involves $\log \sigma_t^2(\theta)$ and $\epsilon_t^2/\sigma_t^2(\theta)$. Both are Lipschitz in $\sigma_t^2(\theta)$ on the region $\{\sigma^2 \ge \underline\omega\}$ — the logarithm with constant $1/\underline\omega$, the reciprocal with constant $1/\underline\omega^2$ — and this is exactly where the uniform positivity in (A3) is consumed: without a positive lower bound on the variances, the logarithm and reciprocal would have unbounded derivatives and a small error in $\sigma_t^2(\theta)$ could blow up the likelihood. The reciprocal term carries a factor $\epsilon_t^2$, which is why the bound on $\sup_\theta|\tilde\ell_t - \ell_t|$ contains $\rho^t(1+\epsilon_t^2)$ rather than just $\rho^t$.
*Why does $\rho^t \epsilon_t^2$ still vanish even though $\epsilon_t^2$ may have no finite moments?* This is the subtle part. We do not assume $\mathbb{E}[\epsilon_0^2] < \infty$ — GARCH returns are heavy-tailed. We only assume the logarithmic moment $\mathbb{E}[\log^+\epsilon_0^2] < \infty$ in (A5). A finite logarithmic moment is exactly the condition that controls the almost sure growth rate: by the [Borel–Cantelli Lemma](/theorems/507), $\log^+\epsilon_t^2$ grows slower than any linear function $a t$, so $\epsilon_t^2$ grows subexponentially and is crushed by the geometric weight $\rho^t$. Once the series $\sum_t \rho^t(1+\epsilon_t^2)$ converges almost surely, dividing its $n$-th partial sum by $n$ sends $\sup_\theta|\tilde L_n - L_n|$ to zero — a deterministic-numerator-over-$n$ Cesàro argument, not an appeal to any law of large numbers.
[/guided]
[/step]
[step:Establish the elementary domination inequality $\log x + 1/x \ge 1$]
[claim:Logarithmic domination inequality]
For every $x > 0$,
\begin{align*}
\log x + \frac{1}{x} \ge 1,
\end{align*}
with equality if and only if $x = 1$.
[/claim]
[proof]
The logarithm is concave with $\log y \le y - 1$ for all $y > 0$, equality iff $y = 1$ (the line $y \mapsto y-1$ is the tangent to $\log$ at $y=1$). Apply this with $y = 1/x$:
\begin{align*}
-\log x = \log\frac{1}{x} \le \frac{1}{x} - 1,
\end{align*}
which rearranges to $\log x + 1/x \ge 1$. Equality holds iff $1/x = 1$, i.e. $x = 1$.
[/proof]
[/step]
[step:Identify $\theta_0$ as the unique maximizer of the population objective]
We show $L(\theta) \le L(\theta_0)$ for all $\theta \in \Theta$, with equality iff $\theta = \theta_0$, and that $L(\theta_0)$ is finite.
**Finiteness at $\theta_0$.** Here $\sigma_t^2(\theta_0) = \sigma_t^2$ and $\epsilon_t^2 = \sigma_t^2 \eta_t^2$, so
\begin{align*}
\ell_t(\theta_0) = -\frac{1}{2}\big( \log \sigma_t^2 + \eta_t^2 \big).
\end{align*}
Now $\mathbb{E}[\eta_t^2] = 1 < \infty$, and $\log\sigma_t^2 \ge \log\omega_0$ is bounded below while $\mathbb{E}[\log^+\sigma_0^2] < \infty$ by (A5); hence $\mathbb{E}\big[ |\log\sigma_t^2| \big] < \infty$ and $L(\theta_0) = -\tfrac12\big(\mathbb{E}[\log\sigma_t^2] + 1\big)$ is finite.
**Conditional gap.** Fix $\theta \in \Theta$. Both $\sigma_t^2(\theta)$ and $\sigma_t^2 = \sigma_t^2(\theta_0)$ are $\mathcal{F}_{t-1}$-measurable, and $\mathbb{E}[\epsilon_t^2 \mid \mathcal{F}_{t-1}] = \sigma_t^2\, \mathbb{E}[\eta_t^2 \mid \mathcal{F}_{t-1}] = \sigma_t^2$, since $\eta_t$ is independent of $\mathcal{F}_{t-1}$ with $\mathbb{E}[\eta_t^2]=1$ (by (A2)). Therefore
\begin{align*}
\mathbb{E}\big[ \ell_t(\theta_0) - \ell_t(\theta) \mid \mathcal{F}_{t-1} \big]
&= \frac{1}{2}\left( \log \frac{\sigma_t^2(\theta)}{\sigma_t^2} + \frac{\mathbb{E}[\epsilon_t^2 \mid \mathcal{F}_{t-1}]}{\sigma_t^2(\theta)} - \frac{\mathbb{E}[\epsilon_t^2 \mid \mathcal{F}_{t-1}]}{\sigma_t^2} \right) \\
&= \frac{1}{2}\left( \log \frac{\sigma_t^2(\theta)}{\sigma_t^2} + \frac{\sigma_t^2}{\sigma_t^2(\theta)} - 1 \right). \tag{5}
\end{align*}
Setting $x = \sigma_t^2(\theta)/\sigma_t^2 > 0$, the bracket in (5) equals $\log x + 1/x - 1 \ge 0$ by the domination inequality, pointwise almost surely, with equality iff $x = 1$, i.e. $\sigma_t^2(\theta) = \sigma_t^2$.
**Unconditional gap.** By (1), $\ell_t(\theta) \le c$, so $\ell_t(\theta_0) - \ell_t(\theta) \ge \ell_t(\theta_0) - c$, and the negative part of $\ell_t(\theta_0) - \ell_t(\theta)$ is dominated by the integrable variable $|\ell_t(\theta_0)| + |c|$. Hence $\mathbb{E}[\ell_t(\theta_0) - \ell_t(\theta)]$ is well defined in $(-\infty, +\infty]$, and taking expectations of the a.s.-nonnegative quantity (5),
\begin{align*}
L(\theta_0) - L(\theta) = \mathbb{E}\big[ \ell_t(\theta_0) - \ell_t(\theta) \big] = \mathbb{E}\!\left[ \frac{1}{2}\left( \log \frac{\sigma_t^2(\theta)}{\sigma_t^2} + \frac{\sigma_t^2}{\sigma_t^2(\theta)} - 1 \right) \right] \ge 0,
\end{align*}
so $L(\theta) \le L(\theta_0)$ (with $L(\theta) = -\infty$ permitted). Equality $L(\theta) = L(\theta_0)$ forces the nonnegative integrand in (5) to vanish almost surely, i.e. $\sigma_t^2(\theta) = \sigma_t^2$ $\mathbb{P}$-a.s. By the identifiability hypothesis (A4) this gives $\theta = \theta_0$. Thus $\theta_0$ is the unique maximizer of $L$ on $\Theta$.
[guided]
We want to prove the population objective $L$ peaks uniquely at the truth. The strategy is to compute the gap $L(\theta_0) - L(\theta)$ and show it is nonnegative, vanishing only at $\theta_0$.
*Why condition on the past?* The criterion mixes the parameter-dependent volatility $\sigma_t^2(\theta)$ (which is $\mathcal{F}_{t-1}$-measurable, i.e. known given the past) with the random innovation through $\epsilon_t^2$. Conditioning on $\mathcal{F}_{t-1}$ freezes the volatilities and isolates the only genuinely random ingredient, $\epsilon_t^2$. The model identity $\epsilon_t = \sigma_t \eta_t$ with $\eta_t \perp \mathcal{F}_{t-1}$ and $\mathbb{E}[\eta_t^2]=1$ gives $\mathbb{E}[\epsilon_t^2 \mid \mathcal{F}_{t-1}] = \sigma_t^2$ — note it is the **true** $\sigma_t^2$, not $\sigma_t^2(\theta)$, that appears, because the data are generated at $\theta_0$. This asymmetry is what breaks the tie in favour of $\theta_0$.
*Where does the inequality enter?* After substituting $\mathbb{E}[\epsilon_t^2\mid\mathcal F_{t-1}]=\sigma_t^2$, the conditional gap (5) collapses to $\tfrac12(\log x + 1/x - 1)$ with $x = \sigma_t^2(\theta)/\sigma_t^2$ the ratio of the candidate variance to the true variance. The domination inequality $\log x + 1/x \ge 1$ says this is always $\ge 0$ and is zero **only** when the ratio is exactly $1$. So the quasi-likelihood, in expectation, always prefers the true volatility, and strictly so whenever the candidate volatility differs.
*Why is the integrability bookkeeping necessary?* We must be careful because $L(\theta)$ might be $-\infty$ for some bad $\theta$ (the candidate volatility could be wildly wrong). We do not need $L(\theta)$ finite — only that the gap $L(\theta_0)-L(\theta)$ is well defined and nonnegative. The uniform upper bound (1), $\ell_t(\theta) \le c$, guarantees the negative part of $\ell_t(\theta_0)-\ell_t(\theta)$ is integrable (it is dominated by $|\ell_t(\theta_0)|+|c|$, and $\ell_t(\theta_0)$ is integrable thanks to $\mathbb{E}[\log^+\sigma_0^2]<\infty$ and $\mathbb{E}[\eta_0^2]=1$). So the expectation is well defined in $(-\infty,+\infty]$ and equals the expectation of the nonnegative conditional gap, which is $\ge 0$.
*Where is identifiability used?* Equality in the inequality is a statement about volatilities: $\sigma_t^2(\theta) = \sigma_t^2$ almost surely. Translating this back into a statement about parameters — that the only $\theta$ producing the true volatility path is $\theta_0$ itself — is precisely the content of (A4). Without identifiability, two distinct parameter vectors could generate the same conditional variance process and the maximizer would not be unique. This is why the theorem must assume it.
[/guided]
[/step]
[step:Convert pointwise optimality into a uniform strict-domination bound via the ergodic theorem]
We use the following extension of the ergodic theorem to sequences that are merely bounded above in mean.
[claim:Ergodic upper bound for sequences with integrable positive part]
Let $(X_t)_{t \ge 1}$ be strictly stationary and ergodic with $\mathbb{E}[X_t^+] < \infty$ (so $\mathbb{E}[X_t] \in [-\infty, \infty)$ is well defined). Then
\begin{align*}
\limsup_{n \to \infty} \frac{1}{n}\sum_{t=1}^{n} X_t \le \mathbb{E}[X_t] \qquad \text{almost surely.}
\end{align*}
[/claim]
[proof]
For $M > 0$ put $X_t^{(M)} = \max(X_t, -M)$, which is integrable since $|X_t^{(M)}| \le X_t^+ + M$. The sequence $(X_t^{(M)})_t$ is strictly stationary and ergodic (a fixed measurable function of $(X_t)_t$), so the [Birkhoff Ergodic Theorem](/theorems/518), applied to the measure-preserving shift on the path space of $(X_t)_t$ which is ergodic by hypothesis, yields $\tfrac1n\sum_{t=1}^n X_t^{(M)} \xrightarrow{a.s.} \mathbb{E}[X_t^{(M)}]$. Since $X_t \le X_t^{(M)}$,
\begin{align*}
\limsup_{n} \frac{1}{n}\sum_{t=1}^{n} X_t \le \limsup_{n} \frac{1}{n}\sum_{t=1}^{n} X_t^{(M)} = \mathbb{E}\big[ X_t^{(M)} \big] \quad \text{a.s.}
\end{align*}
As $M \uparrow \infty$, $X_t^{(M)} \downarrow X_t$ with $X_t^{(M)} \le X_t^+ + 1$ (for $M \ge 1$) integrable; by the [Monotone Convergence Theorem](/theorems/509) applied to the decreasing sequence $X_t^+ + 1 - X_t^{(M)} \uparrow X_t^+ + 1 - X_t$, we get $\mathbb{E}[X_t^{(M)}] \downarrow \mathbb{E}[X_t]$. Taking $M \to \infty$ gives the claim.
[/proof]
For $\theta^* \in \Theta$ and $k \in \mathbb{N}$ let $V_k(\theta^*) = \{\theta \in \Theta : |\theta - \theta^*| < 1/k\}$ and define
\begin{align*}
u_t^{(k)}(\theta^*) = \sup_{\theta \in V_k(\theta^*)} \ell_t(\theta).
\end{align*}
By continuity of $\theta \mapsto \ell_t(\theta)$ the supremum is attained over a countable dense subset of $V_k(\theta^*)$, so $u_t^{(k)}(\theta^*)$ is measurable; it is a fixed measurable functional of $(\epsilon_t, \epsilon_{t-1}, \dots)$, hence $(u_t^{(k)}(\theta^*))_t$ is strictly stationary and ergodic. By (1), $u_t^{(k)}(\theta^*) \le c$, so $\mathbb{E}[(u_t^{(k)}(\theta^*))^+] < \infty$ and the claim applies. Since $\tfrac1n\sum_t \sup_{\theta\in V_k} \ell_t(\theta) \ge \sup_{\theta \in V_k} \tfrac1n\sum_t \ell_t(\theta)$,
\begin{align*}
\limsup_{n \to \infty} \sup_{\theta \in V_k(\theta^*)} L_n(\theta) \le \limsup_{n\to\infty} \frac{1}{n}\sum_{t=1}^{n} u_t^{(k)}(\theta^*) \le \mathbb{E}\big[ u_t^{(k)}(\theta^*) \big] \quad \text{a.s.} \tag{6}
\end{align*}
As $k \uparrow \infty$, $V_k(\theta^*)$ shrinks to $\{\theta^*\}$ and, by continuity of $\theta \mapsto \ell_t(\theta)$, $u_t^{(k)}(\theta^*) \downarrow \ell_t(\theta^*)$ pointwise. Since $u_t^{(1)}(\theta^*) \le c$ provides an integrable upper envelope, the [Monotone Convergence Theorem](/theorems/509) (applied to the increasing sequence $c - u_t^{(k)}(\theta^*) \uparrow c - \ell_t(\theta^*)$) gives
\begin{align*}
\mathbb{E}\big[ u_t^{(k)}(\theta^*) \big] \downarrow \mathbb{E}\big[ \ell_t(\theta^*) \big] = L(\theta^*) \qquad (k \to \infty). \tag{7}
\end{align*}
Now fix $\theta^* \ne \theta_0$. By the previous step $L(\theta^*) < L(\theta_0)$; combining with (7), there exists $k(\theta^*) \in \mathbb{N}$ with
\begin{align*}
\mathbb{E}\big[ u_t^{(k(\theta^*))}(\theta^*) \big] < L(\theta_0). \tag{8}
\end{align*}
(If $L(\theta^*) = -\infty$, (7) gives $\mathbb{E}[u_t^{(k)}(\theta^*)] \to -\infty$ and (8) holds for $k$ large.) Writing $V(\theta^*) := V_{k(\theta^*)}(\theta^*)$, (6) and (8) yield an event of probability one on which
\begin{align*}
\limsup_{n \to \infty} \sup_{\theta \in V(\theta^*)} L_n(\theta) \le \mathbb{E}\big[ u_t^{(k(\theta^*))}(\theta^*) \big] < L(\theta_0). \tag{9}
\end{align*}
[guided]
We have established that $\theta_0$ beats every other single point in expectation. But $\hat\theta_n$ ranges over a continuum, so pointwise domination is not enough — we need to dominate $L_n$ *uniformly* over a whole region away from $\theta_0$. This step manufactures, around each bad point $\theta^*$, a small ball on which $L_n$ is eventually strictly below $L(\theta_0)$.
*Why the unusual ergodic statement?* The standard Birkhoff theorem requires $X_t \in L^1$. But $\ell_t(\theta)$ — and a fortiori its supremum over a neighbourhood — can have expectation $-\infty$ for badly-fitting $\theta$. What saves us is that $\ell_t(\theta)$ is uniformly bounded *above* by $c = -\tfrac12\log\underline\omega$ (again from the positivity condition (A3)). So the positive part is integrable and the mean is well defined in $[-\infty,\infty)$. The truncation argument — apply Birkhoff to $\max(X_t,-M)$, then send $M\to\infty$ by monotone convergence — converts this one-sided integrability into the one-sided ergodic conclusion $\limsup \tfrac1n\sum X_t \le \mathbb{E}[X_t]$, which is exactly the direction we need for an upper bound.
*Why take the supremum inside the average?* We want to control $\sup_{\theta \in V_k} L_n(\theta) = \sup_{\theta\in V_k}\tfrac1n\sum_t \ell_t(\theta)$, but the supremum and the sum do not commute. The inequality $\sup_\theta \sum_t \le \sum_t \sup_\theta$ lets us pull the supremum inside, replacing the awkward object by the genuine time average of the single stationary ergodic sequence $u_t^{(k)} = \sup_{\theta\in V_k}\ell_t(\theta)$, to which the ergodic theorem applies. We pay for this by overestimating, but the overestimate is harmless: we are seeking an upper bound.
*Why shrink the neighbourhood?* The crude bound (6) replaces $L(\theta^*)$ by $\mathbb{E}[u_t^{(k)}(\theta^*)]$, which exceeds $L(\theta^*)$. To recover strict domination by $L(\theta_0)$ we must drive $\mathbb{E}[u_t^{(k)}(\theta^*)]$ back down to $L(\theta^*)$, which we do by shrinking the ball ($k \to \infty$). Continuity of $\theta \mapsto \ell_t(\theta)$ makes $u_t^{(k)}(\theta^*) \downarrow \ell_t(\theta^*)$, and the integrable upper envelope $c$ legitimizes passing the limit through the expectation by monotone convergence. Since $L(\theta^*) < L(\theta_0)$ strictly, a finite radius $1/k(\theta^*)$ already brings the average below $L(\theta_0)$ — giving the uniform bound (9) on a fixed ball.
[/guided]
[/step]
[step:Conclude almost sure convergence by an argmax comparison on compacta]
First, the Birkhoff Ergodic Theorem applied to the integrable stationary ergodic sequence $(\ell_t(\theta_0))_t$ (integrability shown in the identification step) gives
\begin{align*}
L_n(\theta_0) \xrightarrow{a.s.} L(\theta_0). \tag{10}
\end{align*}
Fix $\varepsilon > 0$ and let $K_\varepsilon = \{\theta \in \Theta : |\theta - \theta_0| \ge \varepsilon\}$, a closed subset of the compact set $\Theta$, hence compact. The balls $\{V(\theta^*)\}_{\theta^* \in K_\varepsilon}$ from the previous step form an open cover of $K_\varepsilon$; by compactness extract a finite subcover $V(\theta_1^*), \dots, V(\theta_m^*)$. Intersecting the finitely many probability-one events from (9), (10) and (4), we obtain an event $\Omega_\varepsilon$ of probability one on which all of the following hold:
\begin{align*}
\limsup_{n} \sup_{\theta \in K_\varepsilon} L_n(\theta) &\le \max_{1 \le i \le m} \limsup_{n} \sup_{\theta \in V(\theta_i^*)} L_n(\theta) \le \max_{1 \le i \le m} \mathbb{E}\big[ u_t^{(k(\theta_i^*))}(\theta_i^*) \big] =: \kappa < L(\theta_0), \\
L_n(\theta_0) &\to L(\theta_0), \qquad \sup_{\theta \in \Theta} |\tilde L_n(\theta) - L_n(\theta)| \to 0.
\end{align*}
Choose $\eta := \tfrac{1}{5}\big( L(\theta_0) - \kappa \big) > 0$. Since $4\eta = \tfrac{4}{5}(L(\theta_0)-\kappa) < L(\theta_0)-\kappa$, we have the strict gap $\kappa + 2\eta < L(\theta_0) - 2\eta$. On $\Omega_\varepsilon$, for all $n$ sufficiently large: (4) gives $\sup_{\Theta}|\tilde L_n - L_n| < \eta$; (10) gives $|L_n(\theta_0)-L(\theta_0)| < \eta$; and (9) (combined with the definition of $\limsup$) gives $\sup_{\theta\in K_\varepsilon} L_n(\theta) \le \kappa + \eta$. Therefore
\begin{align*}
\sup_{\theta \in K_\varepsilon} \tilde L_n(\theta) &\le \sup_{\theta \in K_\varepsilon} L_n(\theta) + \eta \le \kappa + 2\eta, \\
\tilde L_n(\theta_0) &\ge L_n(\theta_0) - \eta \ge L(\theta_0) - 2\eta,
\end{align*}
and hence, for all large $n$ on $\Omega_\varepsilon$,
\begin{align*}
\sup_{\theta \in K_\varepsilon} \tilde L_n(\theta) \le \kappa + 2\eta < L(\theta_0) - 2\eta \le \tilde L_n(\theta_0).
\end{align*}
Since $\hat\theta_n$ maximizes $\tilde L_n$ over $\Theta$, we have $\tilde L_n(\hat\theta_n) \ge \tilde L_n(\theta_0) > \sup_{\theta \in K_\varepsilon}\tilde L_n(\theta)$, which is incompatible with $\hat\theta_n \in K_\varepsilon$. Hence on $\Omega_\varepsilon$, $\hat\theta_n \notin K_\varepsilon$ for all large $n$, i.e. $|\hat\theta_n - \theta_0| < \varepsilon$ eventually.
Applying this to the countable sequence $\varepsilon = 1/r$, $r \in \mathbb{N}$, and intersecting the corresponding probability-one events $\Omega_{1/r}$ produces a single probability-one event on which $|\hat\theta_n - \theta_0| < 1/r$ eventually for every $r$, i.e. $\hat\theta_n \to \theta_0$. Therefore
\begin{align*}
\hat\theta_n \xrightarrow{a.s.} \theta_0 \qquad (n \to \infty),
\end{align*}
which is the assertion of the theorem. $\qquad\blacksquare$
[guided]
This final step is the classical Wald-type argmax comparison, adapted to the stationary criterion. The logic: if the maximizer $\hat\theta_n$ stayed a distance $\ge\varepsilon$ from $\theta_0$ infinitely often, it would have to live in the compact annulus $K_\varepsilon$; but we will show the criterion on $K_\varepsilon$ is eventually strictly below its value at $\theta_0$, contradicting maximality.
*Why compactness and a finite subcover?* The uniform bound (9) is local — it holds on one small ball $V(\theta^*)$ around each bad point. To control the entire region $K_\varepsilon$ we must glue these local bounds. Each local bound holds on its own probability-one event, and we can only intersect **finitely** many probability-one events and still have probability one. Compactness of $K_\varepsilon$ (it is closed in the compact $\Theta$) lets us cover it by finitely many of the balls $V(\theta_i^*)$, so finitely many local bounds suffice, and their maximum $\kappa$ is still strictly below $L(\theta_0)$.
*How do the three ingredients combine?* On the common probability-one event we have (i) $\sup_{K_\varepsilon} L_n \le \kappa < L(\theta_0)$ eventually, (ii) $L_n(\theta_0) \to L(\theta_0)$ from Birkhoff, and (iii) $\sup_\Theta|\tilde L_n - L_n| \to 0$ from Step 2. Ingredient (iii) lets us transfer everything from the analyzable stationary criterion $L_n$ to the actually-computed criterion $\tilde L_n$ that $\hat\theta_n$ maximizes. Choosing the margin $\eta = \tfrac15(L(\theta_0)-\kappa)$ leaves room for two perturbation errors on each side, so that $\tilde L_n$ at $\theta_0$ strictly exceeds $\tilde L_n$ everywhere on $K_\varepsilon$. A maximizer cannot sit where the function is strictly dominated by its value at another admissible point, so $\hat\theta_n \notin K_\varepsilon$.
*Why does eventual closeness for each $\varepsilon$ give almost sure convergence?* Almost sure convergence is the statement that for almost every $\omega$, for every $\varepsilon$ there is $N$ with $|\hat\theta_n - \theta_0| < \varepsilon$ for $n \ge N$. We proved this for fixed $\varepsilon$ on a full event $\Omega_\varepsilon$. The quantifier "for every $\varepsilon$" is handled by restricting to the countable sequence $\varepsilon = 1/r$ and intersecting the countably many $\Omega_{1/r}$ — still probability one — which delivers convergence simultaneously for all $r$, hence genuine almost sure convergence.
[/guided]
[/step]
Explore Further
Bartlett's Chi-Squared Approximation for Wilks' Lambda in One-Way MANOVA
probability
Hypothesis and Error SSP Matrices for Multivariate General Linear Model Contrasts
probability
Wishart Principal Block Marginal Theorem
probability
Wishart Distribution of the Sample Covariance Matrix
probability
Conditional Moments in the Gaussian Factor Model
probability
Asymptotic Normality of the Maximum Likelihood Estimators in the Multivariate Normal Model
probability
One-Sample Gaussian Mean Test with Known Covariance
probability
Rotational Diagonalization of a Maximum Likelihood Factor Loading Representative
probability