Numerical Analysis II extends foundational methods to the realm of partial differential equations and large-scale linear algebra. Where Numerical Analysis I covers ordinary differential equations and basic matrix operations, this course confronts the computational challenges of modeling continuous phenomena in space and time: heat diffusion, wave propagation, fluid flow, and quantum mechanics. These problems demand specialized discretization techniques, efficient solvers for massive systems, and careful analysis of stability and convergence.
The course is organized around two complementary themes. Chapters 1 and 2 develop finite difference methods for elliptic and parabolic PDEs, starting with the prototypical Poisson equation and advancing to time-dependent problems where spatial and temporal accuracy must be balanced. Chapters 3, 4, and 5 address the computational challenges that emerge: spectral methods exploit smoothness to achieve high accuracy with fewer unknowns; iterative solvers replace direct factorization for large-dimensional problems; and eigenvalue algorithms reveal the spectral properties that govern convergence and stability.
The architecture is recursive: PDE discretization produces matrices whose eigenvalues predict convergence rates of iterative solvers, which in turn inform the choice between dense spectral representations and sparse finite difference grids. By the end, you will recognize which method suits which problem and understand the theoretical foundations that make each approach work.
# 1. The Poisson Equation
## The Poisson Equation Setup
The Poisson equation is one of the most fundamental elliptic PDEs, arising in electrostatics, fluid mechanics, and heat conduction. The aim of this chapter is to develop a systematic finite difference approach to solving it numerically, and to understand rigorously why the approach converges.
The exact solution is frequently inaccessible in closed form. The strategy is therefore to replace the continuous problem by a finite-dimensional linear system, which we know how to solve.
[definition: Poisson Problem]
Let $\Omega \subset \mathbb{R}^2$ be an open connected domain whose boundary $\partial\Omega$ is a Jordan curve (a non-self-intersecting continuous closed loop). Given $f \in C(\Omega)$ and $\phi \in C^2(\partial\Omega)$, the **Poisson problem** asks for $u : \overline{\Omega} \to \mathbb{R}$ satisfying
\begin{align*}
\Delta u(x, y) &= f(x, y), \quad (x, y) \in \Omega, \\
u(x, y) &= \phi(x, y), \quad (x, y) \in \partial\Omega.
\end{align*}
Here $\Delta = \frac{\partial^2}{\partial x^2} + \frac{\partial^2}{\partial y^2}$ is the Laplacian.
[/definition]
[remark: Why Jordan boundary]
The Jordan curve condition on $\partial\Omega$ ensures that the boundary separates $\Omega$ from its complement in a topologically clean way. It excludes pathological boundaries such as fractal curves, which would make the notion of a boundary value $\phi$ ill-posed.
[/remark]
The idea behind the finite difference method is simple: we know how to solve the linear system $Ax = y$ for $A \in \mathbb{R}^{N \times N}$ and $x, y \in \mathbb{R}^N$. If we can approximate the function $u$ by a vector of its values at grid points, and approximate the differential operator $\Delta$ by a matrix acting on that vector, the PDE becomes a linear system.
To see the analogy concretely, consider a function $g : [0, 10] \to \mathbb{R}$. We can approximate $g$ by the vector
\begin{align*}
v = (g(0),\, g(h),\, g(2h),\, \dots,\, g(10))^\top,
\end{align*}
where $h > 0$ is the step size. The $i$th entry of $v$ represents $g(ih)$. The key question then becomes: what matrix should approximate the operator $\Delta$?
## The Second Central Difference
The difficulty is immediate: we have a continuous second derivative $g''(x)$ appearing in the Laplacian, but we only have access to the discrete values $g(x-h), g(x), g(x+h)$ at grid points. We need a formula that extracts second-derivative information from these sampled values alone, with a quantifiable error. The most natural and accurate such formula is the **second central difference**.
To see why we need this, note that the naive approach of writing $g'(x) \approx \frac{g(x+h)-g(x)}{h}$ and repeating gives a formula for $g''(x)$ in terms of three grid values — but we need to know precisely how large the error is, and how it scales with $h$, before we can trust any convergence claim.
[definition: Second Central Difference]
For $h > 0$ and a function $g : \mathbb{R} \to \mathbb{R}$, the **second central difference** of $g$ at $x$ is
\begin{align*}
\Delta_h^* g(x) := g(x - h) - 2g(x) + g(x + h).
\end{align*}
[/definition]
The motivation is clear from the definition of the derivative: $g'(x) = \lim_{h \to 0} \frac{g(x+h) - g(x)}{h}$, and iterating this idea twice yields a formula involving $g(x \pm h)$ and $g(x)$. The precise relationship between $\Delta_h^* g(x)$ and $g''(x)$ is given by a Taylor expansion.
[quotetheorem:1363]
[explanation: Why $C^4$ is Required]
The hypothesis $g \in C^4[a,b]$ is not a luxury. The formula shows that the error in approximating $g''(x)$ by $\frac{1}{h^2}\Delta_h^* g(x)$ is $\frac{h^2}{12} g^{(4)}(x) + O(h^4)$. For this to tend to zero as $h \to 0$, we need $g^{(4)}(x)$ to exist and be bounded — continuity of $g^{(4)}$ on the closed interval $[a,b]$ ensures both. If $g$ were only $C^3$, the Taylor expansion would break off one term earlier, and the error formula would involve the third derivative, losing the crucial $h^2$ convergence rate. In practice, solutions to the Poisson equation on smooth domains are as smooth as the data allows, so $C^4$ regularity is a reasonable assumption.
[/explanation]
[proof]
Expand $g(x + h)$ and $g(x - h)$ in Taylor series about $x$:
\begin{align*}
g(x + h) - g(x) &= h g'(x) + \frac{h^2}{2!} g''(x) + \frac{h^3}{3!} g'''(x) + \frac{h^4}{4!} g^{(4)}(x) + \cdots, \\
g(x - h) - g(x) &= -h g'(x) + \frac{h^2}{2!} g''(x) - \frac{h^3}{3!} g'''(x) + \frac{h^4}{4!} g^{(4)}(x) + \cdots.
\end{align*}
Adding these two equations, all terms with odd-order derivatives cancel, leaving
\begin{align*}
\Delta_h^* g(x) = 2 \cdot \frac{h^2}{2!} g''(x) + 2 \cdot \frac{h^4}{4!} g^{(4)}(x) + \mathcal{O}(h^6) = h^2 g''(x) + \frac{h^4}{12} g^{(4)}(x) + \mathcal{O}(h^6).
\end{align*}
More generally, the sum equals $\sum_{k=1}^m \frac{2}{(2k)!} h^{2k} g^{(2k)}(x) + \mathcal{O}(h^{2m+2})$; taking $m = 2$ gives the stated result.
[/proof]
[remark: Higher-Order Expansion]
Dividing through by $h^2$ gives a useful formula for the error in approximating $g''(x)$ by $\frac{1}{h^2}\Delta_h^* g(x)$:
\begin{align*}
\frac{1}{h^2}\bigl[g(x - h) - 2g(x) + g(x + h)\bigr] = g''(x) + \frac{1}{12} h^2 g^{(4)}(x) + \frac{1}{360} h^4 g^{(6)}(x) + \mathcal{O}(h^6).
\end{align*}
The leading error term is $\frac{h^2}{12} g^{(4)}$, so the second central difference gives a second-order approximation to $g''(x)$.
[/remark]
Applying the same idea in both the $x$- and $y$-directions simultaneously yields an approximation to the full Laplacian.
[quotetheorem:1364]
This theorem says that the five values $u(x \pm h, y)$, $u(x, y \pm h)$, and $u(x,y)$ collectively carry second-order information about $\Delta u$ at $(x,y)$. Dividing by $h^2$ gives the approximation $\Delta u \approx h^{-2}(\Delta_{h,x}^* + \Delta_{h,y}^*) u$ with an error of order $h^2$. This is the starting point for the five-point formula: we replace $\Delta u = f$ by the discrete equation at each interior grid point.
[proof]
Apply the Taylor expansion theorem separately in the $x$- and $y$-directions. For the $x$-direction:
\begin{align*}
u(x-h, y) + u(x+h, y) - 2u(x, y) = h^2 \frac{\partial^2 u}{\partial x^2}(x, y) + \frac{h^4}{12} \frac{\partial^4 u}{\partial x^4}(x, y) + \mathcal{O}(h^6),
\end{align*}
and for the $y$-direction:
\begin{align*}
u(x, y-h) + u(x, y+h) - 2u(x, y) = h^2 \frac{\partial^2 u}{\partial y^2}(x, y) + \frac{h^4}{12} \frac{\partial^4 u}{\partial y^4}(x, y) + \mathcal{O}(h^6).
\end{align*}
Adding these gives
\begin{align*}
\bigl(\Delta_{h,x}^* + \Delta_{h,y}^*\bigr) u(x, y) = h^2 \Delta u(x, y) + \frac{h^4}{12}\left(\frac{\partial^4 u}{\partial x^4} + \frac{\partial^4 u}{\partial y^4}\right)(x, y) + \mathcal{O}(h^6),
\end{align*}
so $h^2 \Delta u(x, y) - \bigl(\Delta_{h,x}^* + \Delta_{h,y}^*\bigr) u(x, y) = \mathcal{O}(h^4)$ as required.
[/proof]
## The Five-Point Formula
Having approximated the second derivative in one dimension, the challenge is to organise the two-dimensional approximation into a linear system we can actually solve. We must decide how to label the interior grid points, how to handle boundary values, and what matrix structure results from these choices — because the structure of the matrix determines whether efficient solvers are available. We impose a uniform square grid of spacing $h > 0$ on $\Omega$, and for simplicity we require that the boundary $\partial\Omega$ aligns with the grid: if a grid point lies inside $\Omega$, then all four of its neighbours lie in $\overline{\Omega}$. (The case of boundaries that cut across grid lines requires additional treatment and is discussed briefly later.)
[definition: Five-Point Method]
Consider the Poisson problem with Dirichlet boundary condition $u|_{\partial\Omega} = g$. Write $u_{i,j} \approx u(ih, jh)$ and $f_{i,j} = f(ih, jh)$. The **five-point method** replaces the PDE by the discrete equations
\begin{align*}
u_{i-1,j} + u_{i+1,j} + u_{i,j-1} + u_{i,j+1} - 4u_{i,j} = h^2 f_{i,j}, \quad (ih, jh) \in \Omega.
\end{align*}
Whenever $(ih, jh) \in \partial\Omega$, we substitute the boundary value $g(ih, jh)$ directly in place of the unknown.
[/definition]
[remark: Boundary Function Notation]
The boundary condition here uses $g$ for the prescribed boundary data, consistent with standard finite difference notation where $g$ denotes the discrete boundary values. In the earlier definition of the Poisson problem, the boundary function was denoted $\phi$; the two are the same object, with $\phi$ the continuous function and $g$ its restriction to grid points on $\partial\Omega$.
[/remark]
The name "five-point" refers to the stencil of grid points involved: the central point $(i, j)$ and its four axis-aligned neighbours $(i \pm 1, j)$ and $(i, j \pm 1)$.
The discrete equations can be assembled into a linear system. Let $S_n = \{1, \ldots, n-1\}$ index the interior grid points in each direction, so there are $(n-1)^2$ unknowns in total. Stack the interior values $u_{i,j}$ into a vector $u$ by traversing grid points in some order, and collect the right-hand side values (incorporating boundary contributions) into a vector $f$. The five-point equations then take the matrix form
\begin{align*}
Au = f.
\end{align*}
[example: Natural Ordering of the Linear System]
In the **natural ordering**, we enumerate interior grid points row by row:
\begin{align*}
u = (u_{1,1},\, u_{1,2},\, \dots,\, u_{1,n-1},\, u_{2,1},\, \dots,\, u_{n-1,n-1})^\top.
\end{align*}
With this ordering, the matrix $A$ has block tridiagonal structure:
\begin{align*}
A = \begin{pmatrix}
B & I & & \\
I & B & I & \\
& \ddots & \ddots & \ddots \\
& & I & B
\end{pmatrix},
\end{align*}
where $I$ is the $(n-1) \times (n-1)$ identity matrix and
\begin{align*}
B = \begin{pmatrix}
-4 & 1 & & \\
1 & -4 & 1 & \\
& \ddots & \ddots & \ddots \\
& & 1 & -4
\end{pmatrix}.
\end{align*}
The block structure reflects the fact that points in the same row interact through $B$, while points in adjacent rows interact through $I$ (the coupling across rows enters only through the $u_{i, j \pm 1}$ terms).
[/example]
The matrix $A$ depends on the ordering chosen for the grid points. Different orderings — natural, diagonal, red-black — give different matrices, all representing the same discrete equations. Despite this dependence on ordering, three structural properties persist across all orderings:
- $A$ is **symmetric**,
- all diagonal entries equal $-4$,
- $A$ has the same sparsity pattern (each row has at most five nonzero entries).
Symmetry and the constant diagonal are consequences of the symmetry of the five-point stencil. The matrix is not invariant under reordering, but its eigenvalues are, since reordering corresponds to a similarity transformation by a permutation matrix.
[remark: Nine-Point Method]
A higher-order alternative, the **nine-point method**, uses all eight neighbours of each interior point and produces a local truncation error of $\mathcal{O}(h^4)$ rather than $\mathcal{O}(h^2)$ from the five-point stencil. However, a clever enhancement to the five-point method — adding the term $\frac{h^4}{12} \Delta_h f$ (approximated by another application of the five-point Laplacian) to the right-hand side — increases the order of the five-point scheme to $\mathcal{O}(h^4)$, matching the nine-point method without the extra stencil complexity. See Exercise 1.
[/remark]
## The Gershgorin Circle Theorem
To prove convergence, we need to control the spectral radius $\rho(A)$ of the five-point matrix $A$. The natural approach would be to compute the eigenvalues exactly — but for a general matrix this is as hard as solving the original problem. We need a cheap eigenvalue bound: one that requires only the entries of $A$, with no diagonalisation. The Gershgorin circle theorem provides exactly this.
[quotetheorem:1365]
[citeproof:1365]
[remark: Real Matrices]
For real matrices, the diagonal entries $a_{ii}$ are real, so the Gershgorin discs are centred on the real axis. This means that for real symmetric matrices, all eigenvalues already lie on the real line (by spectral theory), and the Gershgorin theorem can still give useful two-sided bounds on their range.
[/remark]
The Gershgorin theorem is a remarkably cheap estimate: it requires only the entries of $A$, with no diagonalisation. It will be applied in the next section to analyse the convergence of the five-point method.
[example: Gershgorin Bounds for the Five-Point Matrix]
Consider the block $B$ from the natural ordering example. Each diagonal entry equals $-4$, and in the interior rows each off-diagonal entry in $B$ contributes $1$, so $r_i = 2$ (within $B$). The coupling through the identity blocks contributes a further $1$ or $2$ to the Gershgorin radius of the full matrix $A$ (depending on whether the point is near the boundary). Hence the eigenvalues of $A$ lie in the union of discs centred at $-4$ with radii at most $4$, giving $\sigma(A) \subset \{z \in \mathbb{C} : |z + 4| \leq 4\}$. In particular all eigenvalues have nonpositive real part, consistent with the negative semi-definiteness of the discrete Laplacian.
[/example]
## Stability via Gershgorin: an Application
When is a finite difference time-stepping scheme stable? The question is whether the iteration matrix $A$ satisfies $\rho(A) \leq 1$ — if so, errors do not grow. For constant-coefficient problems this can be checked directly, but for variable-coefficient equations the matrix entries change from row to row, making direct eigenvalue computation hard. The Gershgorin theorem turns this into a simple entry-level computation.
[example: Stability of the Variable-Diffusion Explicit Scheme]
Let $a : [0, 1] \to \mathbb{R}$ be smooth with $a(x) > 0$. Consider the diffusion equation $u_t = (a u_x)_x$ with initial and Dirichlet boundary data. The explicit finite difference scheme is
\begin{align*}
u_m^{n+1} = u_m^n + \mu \left[ a_{m-1/2} u_{m-1}^n - \left(a_{m-1/2} + a_{m+1/2}\right) u_m^n + a_{m+1/2} u_{m+1}^n \right],
\end{align*}
where $a_s = a(sh)$, $h = \Delta x = \frac{1}{M+1}$, $\mu = \frac{\Delta t}{(\Delta x)^2}$, and $m = 1, \dots, M$. We show the method is stable for $0 < \mu < \frac{1}{2a_{\max}}$, where $a_{\max} = \max_{x \in [0,1]} a(x)$.
Write $u^{n+1} = Au^n$ where $A = I + \mu A_*$ and $A_*$ is the tridiagonal matrix with entries
\begin{align*}
(A_*)_{m,m} &= -(a_{m-1/2} + a_{m+1/2}), \\
(A_*)_{m,m+1} &= a_{m+1/2}, \quad (A_*)_{m,m-1} = a_{m-1/2}.
\end{align*}
By the Gershgorin theorem, $\sigma(A) \subset \bigcup_{m=1}^M \Gamma_m$. For an interior index $m \in \{2, \dots, M-1\}$, the diagonal entry of $A$ is
\begin{align*}
1 - \mu(a_{m-1/2} + a_{m+1/2}),
\end{align*}
and the Gershgorin radius is $\mu(a_{m-1/2} + a_{m+1/2})$. So
\begin{align*}
\Gamma_m = \left\{ z \in \mathbb{C} : \left| z - 1 + \mu(a_{m-1/2} + a_{m+1/2}) \right| \leq \mu(a_{m-1/2} + a_{m+1/2}) \right\}.
\end{align*}
This disc is centred at $1 - \mu(a_{m-1/2} + a_{m+1/2})$ on the real axis, with radius equal to the distance from the centre to $1$. Such a disc lies entirely within $[-1, 1]$ provided $\mu(a_{m-1/2} + a_{m+1/2}) \leq 1$, which holds when $\mu \leq \frac{1}{2a_{\max}}$. The boundary indices $m = 1$ and $m = M$ are handled analogously (they have only one neighbour, so the Gershgorin radius is smaller). Therefore $\sigma(A) \subset \overline{B}(0, 1)$, which gives $\rho(A) \leq 1$ and hence the method is stable.
[/example]
## The Choice of Grid Ordering
Does convergence depend on our arbitrary choice of how to number the grid points? We chose the natural ordering — row by row — but we could equally have numbered diagonally, in a red-black pattern, or in any other order. Each choice yields a different matrix $A^*$, and it is not obvious that the convergence analysis carries over.
[remark: Non-Examinable]
The material in this subsection is non-examinable. It provides structural context for why the results we prove about $A$ in the natural ordering are actually general.
[/remark]
The following result answers the question: all orderings are equivalent up to orthogonal conjugation, so any spectral or algebraic conclusion drawn from the natural ordering applies to every other ordering.
[quotetheorem:1366]
[citeproof:1366]
The significance of this result is that any spectral or algebraic property of $A$ that is preserved under orthogonal conjugation — such as eigenvalues, positive or negative definiteness, and the spectral radius — is shared by every ordering matrix $A^*$. For the purpose of this course we need only one specific ordering to work, so we do not push this further. But the proposition tells us that none of the convergence results we prove below depend on the particular labelling we chose.
## Spectral Properties of the Five-Point Matrix
The convergence proof we are building requires bounding $\|A^{-1}\|$. For a symmetric matrix, $\|A^{-1}\| = 1/|\lambda_{\min}|$, so we need to know the smallest eigenvalue in magnitude. Computing eigenvalues of a general matrix is expensive — but the five-point matrix on a square grid has a very special structure that makes an exact, closed-form computation possible.
[quotetheorem:1367]
[citeproof:1367]
[remark: Semi-Definiteness Without Boundary Conditions]
The argument above uses the Dirichlet boundary condition $u = 0$ on $\partial\Omega$ in an essential way: it is the boundary rows of $A$ (those with fewer than four off-diagonal entries) that rule out $\lambda = 0$. Without any boundary conditions, the matrix would be symmetric and negative semi-definite, but not necessarily negative definite.
[/remark]
Having established negative definiteness, we can go further and compute the eigenvalues of $A$ exactly. This is a remarkable feature of the five-point formula on a square grid: the eigenvectors are products of sines, matching the structure of the Laplacian on a square domain.
[quotetheorem:1368]
[citeproof:1368]
[remark: Linear Independence]
To confirm that these are all $m^2$ eigenvalues, one should check that the $m^2$ vectors $v^{(k,\ell)}$ are linearly independent. This can be verified, and is taken as given for the purposes of this course.
[/remark]
The formula for $\lambda_{k,\ell}$ tells us two things immediately. First, since $\sin^2\theta > 0$ for $\theta \in (0, \pi/2)$, all eigenvalues are strictly negative — consistent with the negative definiteness proved above. Second, the smallest eigenvalue in magnitude (closest to zero) occurs at $k = \ell = 1$:
\begin{align*}
\lambda_{\min} = \lambda_{1,1} = -8\sin^2\frac{\pi h}{2}.
\end{align*}
Since $h = \frac{1}{m+1}$ is small, $\sin(\pi h/2) \approx \pi h/2$, so $|\lambda_{\min}| \approx 2\pi^2 h^2$. This will control the norm of $A^{-1}$.
## Convergence of the Five-Point Formula
Does the discrete solution $u_{i,j}$ actually approximate the PDE solution $u(ih, jh)$ as the grid is refined? The five-point equations are consistent — they approximate $\Delta u = f$ to order $h^2$ locally — but local accuracy does not automatically imply global accuracy. With $O(h^{-2})$ grid points, local errors could in principle accumulate and overwhelm the approximation. The proof that follows shows they do not.
Let $\widehat{u}_{i,j} = u(ih, jh)$ denote the exact solution evaluated at grid points, and define the pointwise error
\begin{align*}
e_{i,j} = u_{i,j} - \widehat{u}_{i,j}.
\end{align*}
Collect these into a vector $e = (e_{i,j}) \in \mathbb{R}^n$ where $n = m^2$, equipped with the Euclidean norm
\begin{align*}
\|x\|^2 = \sum_{i=1}^m \sum_{j=1}^m |x_{i,j}|^2.
\end{align*}
[quotetheorem:1369]
[citeproof:1369]
[remark: The Role of $h = \frac{1}{m+1}$]
The bound $m^2 h^8 < h^6$ in Step 2 used $m < \frac{1}{h}$, which is exactly the content of $h = \frac{1}{m+1}$. This is the structural assumption that makes the grid fit neatly inside the unit square with zero boundary conditions. If $h$ were not of this form, the counting argument would require adjustment.
[/remark]
[remark: Operator Norm of a Symmetric Matrix]
In Step 3 we used the fact that $\|B\| = \rho(B)$ for symmetric $B$. This follows because: for symmetric $B$, the operator 2-norm satisfies $\|B\| = \sqrt{\rho(B^\top B)} = \sqrt{\rho(B^2)} = \rho(B)$, using $B^\top = B$ and the fact that $\rho(B^2) = \rho(B)^2$ for symmetric matrices. It would be incorrect to claim $\|B\| = \max_i |\lambda_i|$ for general (non-symmetric) matrices.
[/remark]
The theorem establishes that the five-point formula converges at rate $O(h)$ in the Euclidean norm — the global error decays linearly in the mesh spacing. This may seem surprising given that the local truncation error is $O(h^4)$; the gap arises because there are $O(h^{-2})$ grid points, which inflates $\|\eta\|$ from the pointwise bound $O(h^4)$ to the vector norm bound $O(h^3)$.
## Towards Efficient Solvers: The Block Structure of $A$
Gaussian elimination applied naively to the $n \times n$ system — with $n = m^2$ unknowns for an $m \times m$ grid — costs $O(n^3) = O(m^6)$ operations. For a modest grid of $m = 100$ this is $10^{12}$ operations, roughly a second on a fast processor; for $m = 1000$ it is $10^{18}$, or decades of computation. To make the five-point method usable on realistic problems, we need solvers that exploit the sparse block structure of $A$.
In natural (column-by-column) ordering, the system $Au = b$ has a block tridiagonal form:
\begin{align*}
\begin{bmatrix} B & I & & \\ I & B & \ddots & \\ & \ddots & \ddots & I \\ & & I & B \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_m \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix},
\end{align*}
where $u_k, b_k \in \mathbb{R}^m$ are the $k$-th column of unknowns and right-hand side values, and $B$ is the $m \times m$ tridiagonal matrix
\begin{align*}
B = \begin{bmatrix} -4 & 1 & & \\ 1 & -4 & \ddots & \\ & \ddots & \ddots & 1 \\ & & 1 & -4 \end{bmatrix}.
\end{align*}
The matrix $B$ is a TST (tridiagonal symmetric Toeplitz) matrix. From the appendix, its eigenvectors and eigenvalues are known explicitly:
\begin{align*}
Bq_\ell = \lambda_\ell q_\ell, \qquad \lambda_\ell = -4 + 2\cos\frac{\ell\pi}{m+1}, \qquad (q_\ell)_j = \gamma_m \sin\frac{j\ell\pi}{m+1}, \quad \ell = 1, \ldots, m,
\end{align*}
where $\gamma_m = \sqrt{\frac{2}{m+1}}$ is a normalisation constant. This gives $B = QDQ^{-1}$ where $D = \operatorname{diag}(\lambda_\ell)$ and $Q$ is the orthogonal matrix of eigenvectors. A key feature is that $Q$ is symmetric ($Q = Q^\top$), so $Q^{-1} = Q$ and $B = QDQ$.
Setting $v_k = Qu_k$ and $c_k = Qb_k$ and substituting $B = QDQ$ into the block system, one obtains the transformed system
\begin{align*}
\begin{bmatrix} D & I & & \\ I & D & \ddots & \\ & \ddots & \ddots & I \\ & & I & D \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_m \end{bmatrix} = \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_m \end{bmatrix}.
\end{align*}
Now $D$ is diagonal rather than tridiagonal, but the blocks are still coupled. The Hockney method resolves this by a further reordering — by rows rather than columns — which decouples the system completely.
[definition: The Hockney Method]
Suppose the five-point linear system can be written in block form with a TST diagonal block $B$ and identity off-diagonal blocks, as above. The **Hockney method** proceeds as follows.
1. Diagonalise $B = QDQ$ using the explicit eigenvectors of the TST matrix, with $Q$ the symmetric orthogonal matrix of eigenvectors.
2. Apply $Q$ column-by-column: set $v_k = Qu_k$ and $c_k = Qb_k$ to obtain a system with diagonal block $D$.
3. Reorder from column-first to row-first enumeration via the permutation matrix $P$ satisfying $P^\top = P^{-1}$ and
\begin{align*}
P : (v_i)_j \mapsto (v_j)_i.
\end{align*}
Under this reordering,
\begin{align*}
P^\top \begin{bmatrix} D & I & & \\ I & D & \ddots & \\ & \ddots & \ddots & I \\ & & I & D \end{bmatrix} P = \begin{bmatrix} \Lambda_1 & & & \\ & \Lambda_2 & & \\ & & \ddots & \\ & & & \Lambda_m \end{bmatrix},
\end{align*}
where each diagonal block is the TST matrix
\begin{align*}
\Lambda_k = \begin{bmatrix} \lambda_k & 1 & & \\ 1 & \lambda_k & 1 & \\ & \ddots & \ddots & \ddots \\ & & 1 & \lambda_k \end{bmatrix}_{m \times m}, \qquad k = 1, \ldots, m.
\end{align*}
4. The resulting block-diagonal system decomposes into $m$ independent $m \times m$ tridiagonal systems $\Lambda_k \widehat{v}_k = \widehat{c}_k$, each of which can be solved in $O(m)$ operations.
[/definition]
The computational bottleneck of the Hockney method lies in applying $Q$ to each of the $m$ column vectors $u_k$ and $b_k$, requiring $2m$ matrix-vector multiplications with $Q$. Each naive multiplication costs $O(m^2)$, giving $O(m^3)$ total — better than Gaussian elimination at $O(m^6)$, but still significant.
The decisive observation is that multiplication by $Q$ is equivalent to a discrete sine transform. Indeed, for a vector $y \in \mathbb{R}^m$:
\begin{align*}
(Qy)_\ell = \gamma_m \sum_{j=1}^m \sin\frac{\pi j\ell}{m+1} y_j = \gamma_m \operatorname{Im}\sum_{j=0}^{2m+1} \exp\frac{2\pi i j\ell}{2(m+1)} y_j, \qquad \ell = 1, \ldots, m,
\end{align*}
where we extended the sum with zeros ($y_{m+1} = \cdots = y_{2m+1} = 0$). This is a Discrete Fourier Transform (DFT) of a zero-padded vector of length $2(m+1)$. The DFT can be computed in $O(m \log m)$ operations via the Fast Fourier Transform (FFT), reducing the total cost of the Hockney method to $O(m^2 \log m)$, or equivalently $O(n \log n)$ where $n = m^2$ is the number of unknowns. This is the subject of the next section.
## The Fast Poisson Solver via the DFT
The Hockney method reduces the problem to $2m$ matrix–vector multiplications with the orthogonal matrix $Q$. Each multiplication costs $O(m^2)$ naively, giving $O(m^3)$ total — identical to band Gaussian elimination, and no real improvement. The bottleneck is not the tridiagonal solves; it is the multiplications by $Q$. The question is whether these can be done faster than $O(m^2)$ per multiply. The answer hinges on recognising that $Q$ is not an arbitrary orthogonal matrix: its entries are sines, and the multiplication by $Q$ is a discrete sine transform, which is a variant of the Discrete Fourier Transform (DFT). The DFT can be computed in $O(m \log m)$ operations via the Fast Fourier Transform (FFT), bringing the total solver cost to $O(m^2 \log m)$.
The bottleneck computation has the form
\begin{align*}
(Qy)_{\ell} = \sum_{j=1}^{m} \sin \frac{\pi j \ell}{m+1} y_j = \operatorname{Im} \sum_{j=0}^{2m+1} \exp \frac{2\pi i j \ell}{2m+2} y_j, \quad \ell = 1, \ldots, m,
\end{align*}
where $y_{m+1} = \cdots = y_{2m+1} = 0$. The right-hand side is the imaginary part of a discrete Fourier transform of a zero-padded vector. This means that if we can compute the DFT efficiently, we can solve the five-point equations fast. In this section we develop the necessary machinery: the DFT and the Fast Fourier Transform (FFT).
### Periodic Sequences and the DFT
[definition: Space of Bi-Infinite Sequences]
The **space of all bi-infinite complex sequences** is
\begin{align*}
\Pi = \left\{ x = \{x_\ell\}_{\ell \in \mathbb{Z}} : x_\ell \in \mathbb{C} \right\}.
\end{align*}
[/definition]
Among all bi-infinite sequences, the periodic ones play a special role because they are completely determined by finitely many values: a single period. A sequence $x \in \Pi$ is called **$n$-periodic** if there exists $n \in \mathbb{N}$ such that $x_{\ell+n} = x_\ell$ for all $\ell \in \mathbb{Z}$. We write $\Pi_n \subset \Pi$ for the space of all bi-infinite complex $n$-periodic sequences. Such a sequence is completely determined by its values on any $n$ consecutive indices, so $\Pi_n$ is isomorphic to $\mathbb{C}^n$.
This periodicity is precisely what the DFT exploits: by treating input vectors as one period of a periodic sequence, the DFT decomposes them into a discrete sum of complex exponentials, each corresponding to a different frequency.
[definition: Discrete Fourier Transform]
Let $x \in \Pi_n$. Set $\omega_n = \exp\!\left(\frac{2\pi i}{n}\right)$, the primitive $n$th root of unity. The **discrete Fourier transform (DFT)** of $x$ is the map
\begin{align*}
\mathcal{F}_n : \Pi_n &\to \Pi_n \\
x &\mapsto y : \quad y_j = \sum_{\ell=0}^{n-1} \omega_n^{j\ell} x_\ell, \quad j = 0, \ldots, n-1.
\end{align*}
As a linear map, $\mathcal{F}_n$ is represented by the **DFT matrix**
\begin{align*}
\mathcal{F}_n = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ 1 & \omega_n & \cdots & \omega_n^{n-1} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & \omega_n^{n-1} & \cdots & \omega_n^{(n-1)^2} \end{bmatrix}.
\end{align*}
[/definition]
The DFT matrix has a clear structure: the $(j,\ell)$ entry is $\omega_n^{j\ell}$, where indices run from $0$ to $n-1$. The map $\mathcal{F}_n$ is invertible, which is the content of the following proposition.
[quotetheorem:1370]
[citeproof:1370]
The inversion formula is the analogue of the continuous Fourier inversion theorem. The key point for the Poisson solver is that $\mathcal{F}_n^{-1}$ has the same structure as $\mathcal{F}_n$ itself — complex exponentials at negative frequencies, with a normalisation factor of $1/n$. This means the FFT algorithm for $\mathcal{F}_n$ can be run in reverse (with conjugated twiddle factors) to compute $\mathcal{F}_n^{-1}$ at the same $O(n \log n)$ cost. The total cost of applying $Q$ and $Q^{-1}$ in the Hockney method is therefore $O(m \log m)$ per column, giving $O(m^2 \log m)$ overall.
[remark: Normalisation Convention]
The course uses the unnormalised forward transform $y_j = \sum_\ell \omega_n^{j\ell} x_\ell$ and places the factor of $1/n$ in the inverse. Some references split the factor symmetrically as $1/\sqrt{n}$ in each direction; the choice does not affect the structure of the algorithm, only the normalisation of the output.
[/remark]
The naive computation of $\mathcal{F}_n x$ by matrix–vector multiplication costs $O(n^2)$ arithmetic operations: each of the $n$ output components requires a dot product of length $n$. For the Poisson solver on an $m \times m$ grid we need $2m$ such transforms of vectors of length $\approx 2m$, giving a total cost of $O(m^3)$. This is no better than a direct solve by Gaussian elimination. The key insight is that the structure of $\mathcal{F}_n$ can be exploited to reduce this dramatically.
### The Fast Fourier Transform
The FFT is an algorithm that computes $\mathcal{F}_n x$ in $O(n \log n)$ operations when $n = 2^p$ is a power of two. The idea is a divide-and-conquer recursion: express the DFT of a vector of length $2m$ in terms of the DFTs of two vectors of length $m$, obtained by separating even and odd indexed entries, then repeat.
[definition: Even and Odd Subsequences]
For a vector $z \in \mathbb{C}^{2m}$, define the **even subsequence** $z^{(\mathrm{e})} \in \mathbb{C}^m$ and the **odd subsequence** $z^{(\mathrm{o})} \in \mathbb{C}^m$ by
\begin{align*}
z_j^{(\mathrm{e})} = z_{2j}, \quad z_j^{(\mathrm{o})} = z_{2j+1}, \quad j = 0, \ldots, m-1.
\end{align*}
[/definition]
The splitting identity underlying the FFT is the following.
[quotetheorem:1371]
[citeproof:1371]
The splitting identity says that to compute both $(\mathcal{F}_{2m}z)_\ell$ and $(\mathcal{F}_{2m}z)_{\ell+m}$ from the half-length transforms, we need only one multiplication by $\omega_{2m}^\ell$. This is the "butterfly" operation at the heart of every FFT implementation.
[remark: Matrix Factorisation View]
The splitting identity can be encoded as a matrix factorisation. Writing $n = 2^k$, we have
\begin{align*}
\mathcal{F}_{2^k} = \begin{pmatrix} I & D \\ I & -D \end{pmatrix} \begin{pmatrix} \mathcal{F}_{2^{k-1}} & 0 \\ 0 & \mathcal{F}_{2^{k-1}} \end{pmatrix} P,
\end{align*}
where $D = \operatorname{diag}(1, \omega_{2^k}, \omega_{2^k}^2, \ldots, \omega_{2^k}^{2^{k-1}-1})$ is the diagonal twiddle-factor matrix, and $P$ is the permutation matrix that interleaves the even and odd entries. Applying this factorisation recursively gives the FFT.
[/remark]
The algorithm resulting from this recursion is described precisely as follows.
[definition: Fast Fourier Transform Algorithm]
The **Fast Fourier Transform (FFT)** computes $\mathcal{F}_n y$ for $y \in \mathbb{C}^n$ where $n = 2^p$ as follows.
**Downward pass (splitting):** Recursively split $y$ into even and odd subsequences until reaching $2^p$ vectors of length $1$. The Fourier transform of a length-$1$ vector is the vector itself.
**Upward pass (reconstruction):** At stage $s = 1, 2, \ldots, p$, reconstruct all DFTs of length $2^s$ from the DFTs of their even and odd children of length $2^{s-1}$. For each parent vector of length $2^s$ at stage $s$, set
\begin{align*}
q_\ell^{(\mathrm{parent})} &= q_\ell^{(\mathrm{e})} + \omega_{2^s}^\ell q_\ell^{(\mathrm{o})}, \quad \ell = 0, \ldots, 2^{s-1}-1, \\
q_{\ell + 2^{s-1}}^{(\mathrm{parent})} &= q_\ell^{(\mathrm{e})} - \omega_{2^s}^\ell q_\ell^{(\mathrm{o})}, \quad \ell = 0, \ldots, 2^{s-1}-1.
\end{align*}
After $p$ stages, the top-level vector is $\mathcal{F}_n y$.
[/definition]
[quotetheorem:1372]
[citeproof:1372]
The contrast with the naive $O(n^2)$ cost is dramatic. For $n = 2^{20} \approx 10^6$, the FFT requires roughly $10^7$ operations while direct matrix–vector multiplication would require $10^{12}$. This makes the FFT one of the most important algorithms in applied mathematics and signal processing.
[example: FFT of an Alternating Vector]
We compute $\mathcal{F}_4 y$ for $y = (1, -1, 1, -1)^\top$ using the FFT recursion. Here $n = 4 = 2^2$, so $p = 2$ and $\omega_4 = e^{2\pi i/4} = i$.
**Level 0 (base case):** Split $y$ into even and odd subsequences:
\begin{align*}
y^{(\mathrm{e})} = (y_0, y_2) = (1, 1), \quad y^{(\mathrm{o})} = (y_1, y_3) = (-1, -1).
\end{align*}
Split each again:
\begin{align*}
y^{(\mathrm{ee})} = (1), \quad y^{(\mathrm{eo})} = (1), \quad y^{(\mathrm{oe})} = (-1), \quad y^{(\mathrm{oo})} = (-1).
\end{align*}
The DFT of a scalar is itself: $\mathcal{F}_1(1) = (1)$ and $\mathcal{F}_1(-1) = (-1)$.
**Stage 1 (reconstruct length-2 DFTs):** For $y^{(\mathrm{e})} = (1,1)$ with $\omega_2 = e^{i\pi} = -1$:
\begin{align*}
(\mathcal{F}_2 y^{(\mathrm{e})})_0 &= 1 + \omega_2^0 \cdot 1 = 1 + 1 = 2, \\
(\mathcal{F}_2 y^{(\mathrm{e})})_1 &= 1 - \omega_2^0 \cdot 1 = 1 - 1 = 0.
\end{align*}
For $y^{(\mathrm{o})} = (-1, -1)$:
\begin{align*}
(\mathcal{F}_2 y^{(\mathrm{o})})_0 &= -1 + (-1) = -2, \\
(\mathcal{F}_2 y^{(\mathrm{o})})_1 &= -1 - (-1) = 0.
\end{align*}
**Stage 2 (reconstruct length-4 DFT):** With $\omega_4 = i$:
\begin{align*}
(\mathcal{F}_4 y)_0 &= 2 + \omega_4^0 \cdot (-2) = 2 + (-2) = 0, \\
(\mathcal{F}_4 y)_1 &= 0 + \omega_4^1 \cdot 0 = 0 + i \cdot 0 = 0, \\
(\mathcal{F}_4 y)_2 &= 2 - \omega_4^0 \cdot (-2) = 2 - (-2) = 4, \\
(\mathcal{F}_4 y)_3 &= 0 - \omega_4^1 \cdot 0 = 0.
\end{align*}
Therefore $\mathcal{F}_4(1, -1, 1, -1)^\top = (0, 0, 4, 0)^\top$.
This result makes sense: the sequence $(1, -1, 1, -1)$ is a pure oscillation at frequency $\ell = 2$ (half the Nyquist frequency for $n = 4$), so the DFT is concentrated at index $j = 2$.
[/example]
### Completing the Fast Poisson Solver
We now return to the Hockney method and explain how the FFT gives an $O(m^2 \log m)$ solver for the five-point equations on an $m \times m$ grid.
Recall from Section B that the matrix–vector products with $Q$ took the form
\begin{align*}
(Qy)_\ell = \operatorname{Im} \sum_{j=0}^{2m+1} \exp\!\left(\frac{2\pi i j\ell}{2m+2}\right) y_j, \quad \ell = 1, \ldots, m,
\end{align*}
where $y_{m+1} = \cdots = y_{2m+1} = 0$. The sum on the right is precisely $(\mathcal{F}_{2m+2}\,\tilde{y})_\ell$ where $\tilde{y}$ is $y$ zero-padded to length $2m+2$. Provided $2m+2$ is a power of two (which can be arranged by choosing $m$ appropriately, or by zero-padding further), this DFT can be computed by the FFT in $O(m \log m)$ operations.
The complete fast Poisson solver then runs as follows:
1. Apply $Q$ to each of the $m$ column vectors $b_1, \ldots, b_m$, using the FFT. Cost: $m \cdot O(m \log m) = O(m^2 \log m)$.
2. Re-index to obtain the decoupled system with diagonal blocks $\Lambda_k$ (as in the Hockney method).
3. Solve each of the $m$ tridiagonal systems $\Lambda_k \hat{v}_k = \hat{c}_k$ for $k = 1, \ldots, m$. Each tridiagonal solve costs $O(m)$. Total: $O(m^2)$.
4. Invert the re-indexing and apply $Q^{-1} = Q^\top = Q$ (recall $Q$ is symmetric orthogonal) using the FFT. Cost: $O(m^2 \log m)$.
The overall cost is $O(m^2 \log m)$. Since the grid has $n = m^2$ unknowns, this is $O(n \log n)$, a substantial improvement over the $O(n^{3/2})$ cost of band Gaussian elimination or the $O(n^3)$ cost of dense linear algebra.
[remark: Scope of the Hockney–FFT Approach]
The fast Poisson solver described here applies directly to the square domain with homogeneous Dirichlet boundary conditions, where the coefficient matrix has the exact block-tridiagonal TST structure. For other geometries or boundary conditions, the approach must be adapted — for instance, using the FFT along one direction while solving tridiagonal systems along the other — but the core idea of exploiting eigendecompositions of TST matrices via the FFT remains central to many fast solvers in computational science.
[/remark]
The Poisson solver's $O(m^2 \log m)$ complexity via FFT demonstrates the power of frequency-domain methods for elliptic problems. Time-dependent PDEs require a fundamentally different discretization strategy: we must evolve solutions forward in time while maintaining spatial accuracy, introducing new stability constraints that neither direct solvers nor spectral methods alone can fully address.
# 2. PDEs of Evolution
## From Elliptic to Parabolic: Equations in Time
The preceding chapter studied the Poisson equation — a static, elliptic PDE describing steady-state phenomena such as temperature distributions at equilibrium or electrostatic potentials. this chapter turns to the other fundamental class: **PDEs of evolution**, which describe how physical fields change over time. The canonical examples are the heat equation and the first-order advection equation, and both demand numerical methods that advance a spatial discretisation forward in time.
The central challenge is that time introduces a second discretisation parameter, and the interaction between the spatial step $h = \Delta x$ and the time step $k = \Delta t$ can cause a numerical method to behave catastrophically. The ratio $\mu = k/h^2$ governs whether computed solutions remain bounded or grow without limit — a phenomenon called **instability**. This section develops the finite difference method for the heat equation, establishes the notions of consistency, stability, and convergence, and proves the Lax Equivalence Theorem connecting all three.
[remark: Heat and Diffusion]
Throughout this chapter, the heat equation and the diffusion equation refer to the same PDE. The two names reflect different physical contexts (thermal conduction versus particle diffusion) but the mathematical structure is identical.
[/remark]
## The Heat Equation and Explicit Euler Discretisation
The model problem for this chapter is the one-dimensional heat equation with homogeneous Dirichlet boundary conditions:
[definition: Heat Equation Initial-Boundary Value Problem]
The **heat equation IBVP** on the unit interval is the problem
\begin{align*}
\frac{\partial u}{\partial t} &= \frac{\partial^2 u}{\partial x^2}, \quad 0 \leq x \leq 1, \quad t \geq 0, \\
u(x, 0) &= u_0(x), \\
u(0, t) &= u(1, t) = 0,
\end{align*}
where $u_0 : [0,1] \to \mathbb{R}$ is a given initial profile.
[/definition]
To discretise this problem we introduce a uniform rectangular mesh. In the spatial direction, set $h = \Delta x = 1/(M+1)$ and grid points $x_m = mh$ for $m = 0, 1, \ldots, M+1$. In the time direction, set $k = \Delta t > 0$ and time levels $t_n = nk$ for $n = 0, 1, 2, \ldots$. We seek numerical approximations $U_m^n \approx u(x_m, t_n)$.
Applying Taylor's theorem to the exact solution $u$ at the point $(x, t)$:
\begin{align*}
\frac{\partial u(x, t)}{\partial t} &= \frac{1}{k}\bigl[u(x, t+k) - u(x, t)\bigr] + \mathcal{O}(k), \\
\frac{\partial^2 u(x, t)}{\partial x^2} &= \frac{1}{h^2}\bigl[u(x-h, t) - 2u(x, t) + u(x+h, t)\bigr] + \mathcal{O}(h^2).
\end{align*}
Substituting into the PDE and rearranging, the true solution satisfies
\begin{align*}
u(x, t+k) = u(x, t) + \frac{k}{h^2}\bigl[u(x-h, t) - 2u(x,t) + u(x+h, t)\bigr] + \eta(x, t)
\end{align*}
where $\eta(x, t) = \mathcal{O}(k^2 + kh^2)$. More precisely, by Taylor's theorem,
\begin{align*}
|\eta(x,t)| \leq c_1 k^2 + c_2 k h^2
\end{align*}
where $c_1 = \frac{1}{2} \max_{\xi, \tau} \left|\frac{\partial^2 u}{\partial t^2}(\xi, \tau)\right|$ and $c_2 = \frac{1}{12} \max_{\xi, \tau}\left|\frac{\partial^4 u}{\partial x^4}(\xi, \tau)\right|$.
Dropping the remainder term $\eta$ motivates the following numerical scheme.
[definition: Explicit Euler Finite Difference Scheme]
The **explicit Euler scheme** (or forward-time centred-space scheme) for the heat equation is
\begin{align*}
U_m^{n+1} = U_m^n + \mu\bigl(U_{m-1}^n - 2U_m^n + U_{m+1}^n\bigr), \quad m = 1, \ldots, M,
\end{align*}
where $\mu = k/h^2 = \Delta t / (\Delta x)^2$ is called the **Courant number** (or mesh ratio). The boundary conditions are enforced by setting $U_0^n = U_{M+1}^n = 0$ for all $n$.
[/definition]
The Courant number $\mu$ encodes the relative rate of time stepping to spatial resolution. With $\mu$ held fixed as $h \to 0$, we have $k = \mu h^2$, so that the local truncation error $\eta = \mathcal{O}(k^2 + kh^2) = \mathcal{O}(\mu h^4)$ is of order $h^4$.
[remark: Method of Lines Interpretation]
An alternative perspective on the scheme above is the **method of lines**: first discretise spatially to obtain a system of ODEs $\dot{u}_h = A_* u_h$ (where $A_*$ is the standard second-difference matrix), then apply a time-stepping method to that ODE system. The explicit Euler scheme arises from applying forward Euler to the spatial semi-discretisation. The implicit Euler and Crank–Nicolson schemes arise from applying backward Euler and the trapezoidal rule respectively. This viewpoint is developed more carefully in the next section on semidiscretisation.
[/remark]
## Convergence and the Role of the Courant Number
The explicit Euler scheme is **convergent** when the numerical solution approaches the exact solution as $h \to 0$ (with $\mu$ fixed). We formalise this precisely.
[definition: Convergent Method]
The explicit Euler scheme is said to be **convergent** if, for every fixed $\mu > 0$ and every $T > 0$,
\begin{align*}
\lim_{\substack{h \to 0,\; k \to 0 \\ k/h^2 = \mu}} \max_{\substack{1 \leq m \leq M \\ 0 \leq n \leq T/k}} \left|U_m^n - u(mh, nk)\right| = 0.
\end{align*}
[/definition]
[quotetheorem:1373]
[citeproof:1373]
[remark: The Bound Motivates Lax Equivalence]
The estimate $\|e^{n+1}\|_\infty \leq \|e^n\|_\infty + ch^4$ is the prototype of a more abstract pattern: stability (no amplification) plus consistency (small local error) implies convergence. The Lax Equivalence Theorem formalises this observation precisely.
[/remark]
## Stability, Consistency, and the Lax Equivalence Theorem
The convergence proof above can be abstracted to apply to any time-stepping method expressible in matrix-vector form. Suppose the numerical scheme can be written as
\begin{align*}
u^{n+1} = A_h u^n,
\end{align*}
where $u^n \in \mathbb{R}^M$ and $A_h \in \mathbb{R}^{M \times M}$ depends on the mesh spacing $h = 1/M$.
[definition: Stability of a Time-Stepping Method]
A numerical method of the form $u^{n+1} = A_h u^n$ is **stable** with respect to a norm $\|\cdot\|$ if the sequence $(\|u^n\|)_{n \geq 0}$ is bounded for every initial vector $u^0$.
[/definition]
Fix a norm $\|\cdot\|$ and let $\|A_h\| := \sup_{u \neq 0} \|A_h u\| / \|u\|$ denote the induced operator norm. Since $\|u^{n+1}\| \leq \|A_h\| \|u^n\|$, the condition
\begin{align*}
\|A_h\| \leq 1 \quad \text{for all sufficiently small } h
\end{align*}
implies stability: $\|u^n\| \leq \|A_h\|^n \|u^0\| \leq \|u^0\|$.
The local truncation error can also be written in vector form. Let $\hat{u}^n = (u(x_m, t_n))_{1 \leq m \leq M}$ denote the vector of exact solution values. Define
\begin{align*}
\eta^n := \hat{u}^{n+1} - A_h \hat{u}^n.
\end{align*}
The error vector $e^n = \hat{u}^n - u^n$ satisfies the recurrence
\begin{align*}
e^{n+1} = A_h e^n + \eta^n.
\end{align*}
Applying the triangle inequality repeatedly and using $\|A_h\| \leq 1$ with $e^0 = \mathbf{0}$:
\begin{align*}
\|e^n\| \leq \|\eta^0\| + \|\eta^1\| + \cdots + \|\eta^{n-1}\|.
\end{align*}
[definition: Consistency]
A numerical method is **consistent** with respect to a norm $\|\cdot\|$ if the local truncation error satisfies $\|\eta^n\| = \mathcal{O}(k^2)$ as $k \to 0$.
[/definition]
If consistency holds, then $\|\eta^n\| \leq ck^2$ for some constant $c > 0$, and therefore
\begin{align*}
\|e^n\| \leq n c k^2 \leq \frac{T}{k} \cdot c k^2 = cTk \to 0 \quad (k \to 0),
\end{align*}
uniformly in $n \leq T/k$. Stability plus consistency thus implies convergence. This is the content of the Lax Equivalence Theorem.
[quotetheorem:1374]
[citeproof:1374]
The Lax Equivalence Theorem is the central organising principle of numerical analysis for linear evolution equations: it reduces the question of convergence (which requires tracking errors over many time steps) to two independent questions — consistency (a local, algebraic computation) and stability (a norm bound on $A_h$).
## Norms for Stability Analysis
The stability condition $\|A_h\| \leq 1$ depends on the choice of norm. Two choices are standard in practice.
### The Sup-Norm
Choose the $\ell^\infty$ norm $\|u\|_\infty = \max_{i=1,\ldots,M} |u_i|$. The induced matrix norm is the maximum absolute row sum:
\begin{align*}
\|A\|_{\infty \to \infty} = \max_{i=1,\ldots,M} \sum_{j=1}^M |A_{ij}|.
\end{align*}
To see why: the bound $\|Au\|_\infty \leq \|A\|_{\infty \to \infty} \|u\|_\infty$ follows from the triangle inequality. Equality is achieved by choosing $u_j = \operatorname{sign}(A_{ij^*})$ where $i^* = \arg\max_i \sum_j |A_{ij}|$.
For the explicit Euler scheme, the amplification matrix is the tridiagonal matrix
\begin{align*}
A_h = \begin{pmatrix} 1-2\mu & \mu & & \\ \mu & \ddots & \ddots & \\ & \ddots & \ddots & \mu \\ & & \mu & 1-2\mu \end{pmatrix}_{M \times M}.
\end{align*}
Each row sum is $|1-2\mu| + 2\mu$. When $\mu \leq 1/2$, this equals $(1-2\mu) + 2\mu = 1$, so $\|A_h\|_{\infty \to \infty} = 1 \leq 1$ and the scheme is stable. This recovers the stability condition found in the convergence proof.
### The Normalised Euclidean Norm
Another natural choice is the normalised $\ell^2$ norm
\begin{align*}
\|u\| := \left(\frac{1}{M} \sum_{i=1}^M |u_i|^2\right)^{1/2}.
\end{align*}
The normalisation by $1/M$ is chosen so that $\|u\|$ approximates the $L^2([0,1])$ norm of the piecewise constant interpolant: as $h = 1/(M+1) \to 0$,
\begin{align*}
\left(\frac{1}{M} \sum_{i=1}^M |u_i|^2\right)^{1/2} \to \left(\int_0^1 |u(x)|^2 \, dx\right)^{1/2} = \|u\|_{L^2}.
\end{align*}
The induced matrix norm is the **spectral norm** $\|A\|_2 := \sup_{u \neq 0} \|Au\|_2 / \|u\|_2$, which equals $[\rho(A A^\top)]^{1/2}$ where $\rho$ denotes the spectral radius. For a special class of matrices, the spectral norm and the spectral radius coincide.
[definition: Normal Matrix]
A matrix $A \in \mathbb{C}^{M \times M}$ is **normal** if it commutes with its conjugate transpose: $A \bar{A}^\top = \bar{A}^\top A$.
[/definition]
The class of normal matrices includes real symmetric matrices, Hermitian matrices, real skew-symmetric matrices, skew-Hermitian matrices, and unitary matrices. For each of these, the spectral norm equals the spectral radius.
[quotetheorem:1375]
[citeproof:1375]
## Spectral Analysis of the Explicit Euler Scheme
The spectral norm result can be used to determine the stability of the explicit Euler scheme under the normalised $L^2$ norm.
[example: Spectral Stability Analysis of Explicit Euler]
Write the explicit Euler scheme in matrix form as $u_h^{n+1} = A_h u_h^n$ with
\begin{align*}
A_h = I + \mu A_*, \qquad A_* = \begin{pmatrix} -2 & 1 & & \\ 1 & \ddots & \ddots & \\ & \ddots & \ddots & 1 \\ & & 1 & -2 \end{pmatrix}_{M \times M}.
\end{align*}
The matrix $A_*$ is a Toeplitz symmetric tridiagonal (TST) matrix. By the eigenvalue formula for TST matrices (see Appendix), its eigenvalues are
\begin{align*}
\lambda_\ell(A_*) = -4\sin^2\frac{\pi \ell h}{2}, \quad \ell = 1, \ldots, M,
\end{align*}
so the eigenvalues of $A_h$ are
\begin{align*}
\lambda_\ell(A_h) = 1 - 4\mu \sin^2\frac{\pi \ell h}{2}, \quad \ell = 1, \ldots, M.
\end{align*}
Since $A_h$ is real and symmetric, it is normal, and $\|A_h\|_2 = \rho(A_h)$.
The spectrum of $A_h$ lies in the interval $[1 - 4\mu, 1]$ (since $\sin^2(\cdot) \in [0,1]$). Therefore:
**Case $\mu \leq 1/2$:** Every eigenvalue satisfies $|\lambda_\ell(A_h)| \leq 1$, so $\|A_h\|_2 = \rho(A_h) \leq 1$. The scheme is stable and $\|u^n\|_2 \leq \|u^0\|_2$ for all $n$.
**Case $\mu > 1/2$:** For sufficiently small $h$ (specifically, for $h \leq h_\mu$ where $h_\mu$ is determined by the condition that $|1 - 4\mu \cos^2(\pi h/2)| > 1$), the eigenvalue corresponding to $\ell = M$ satisfies $|\lambda_M(A_h)| > 1$. Choosing $u^0$ as the corresponding eigenvector gives $u^n = \lambda_M^n u^0$, which grows without bound.
The condition $\mu \leq 1/2$ is thus both necessary and sufficient for stability of the explicit Euler scheme in the $L^2$ norm.
[/example]
[explanation: Why the Stability Condition Matters in Practice]
The constraint $\mu = k/h^2 \leq 1/2$ is a severe restriction when $h$ is small: halving the spatial step requires reducing the time step by a factor of four to maintain stability. This quadratic coupling between $h$ and $k$ makes the explicit Euler scheme computationally expensive for fine spatial grids. For example, to resolve spatial features at scale $h = 10^{-3}$, one must take $k \leq h^2/2 = 5 \times 10^{-7}$, requiring two million time steps per unit time. Implicit methods circumvent this restriction at the cost of solving a linear system at each time step.
[/explanation]
## Implicit Euler and the Theta-Method
The explicit Euler scheme evaluates the right-hand side of the heat equation at the current time level $t_n$. The **implicit Euler scheme** instead evaluates it at the next time level $t_{n+1}$, leading to an unconditionally stable method.
[definition: Implicit Euler Scheme]
The **implicit Euler scheme** for the heat equation is
\begin{align*}
U_m^{n+1} = U_m^n + \mu\bigl(U_{m-1}^{n+1} - 2U_m^{n+1} + U_{m+1}^{n+1}\bigr), \quad m = 1, \ldots, M.
\end{align*}
In matrix form: $(I - \mu A_*)u^{n+1} = u^n$, so each time step requires solving the linear system $(I - \mu A_*)u^{n+1} = u^n$.
[/definition]
The amplification matrix for the implicit Euler scheme is $B_h = (I - \mu A_*)^{-1}$. Since $A_*$ has eigenvalues $\lambda_\ell(A_*) = -4\sin^2(\pi \ell h / 2) \leq 0$, the eigenvalues of $B_h$ are
\begin{align*}
\lambda_\ell(B_h) = \frac{1}{1 + 4\mu\sin^2\frac{\pi \ell h}{2}} \in (0, 1]
\end{align*}
for all $\mu > 0$. Thus $\|B_h\|_2 = \rho(B_h) \leq 1$ for every $\mu > 0$, and the implicit Euler scheme is unconditionally stable — it is stable for any time step $k$, regardless of $h$.
Both the explicit and implicit Euler schemes are only first-order accurate in time (truncation error $\mathcal{O}(k)$ in time, $\mathcal{O}(h^2)$ in space). The **theta-method** provides a one-parameter family that includes both as special cases and admits second-order accuracy.
[definition: Theta-Method]
For $\theta \in [0,1]$, the **theta-method** (or $\theta$-scheme) for the heat equation is
\begin{align*}
U_m^{n+1} - U_m^n = \mu \Bigl[\theta\bigl(U_{m-1}^{n+1} - 2U_m^{n+1} + U_{m+1}^{n+1}\bigr) + (1-\theta)\bigl(U_{m-1}^n - 2U_m^n + U_{m+1}^n\bigr)\Bigr].
\end{align*}
In matrix form: $(I - \theta\mu A_*)u^{n+1} = (I + (1-\theta)\mu A_*)u^n$.
Special cases: $\theta = 0$ gives the explicit Euler scheme; $\theta = 1$ gives the implicit Euler scheme; $\theta = 1/2$ gives the **Crank–Nicolson scheme**.
[/definition]
The amplification matrix of the theta-method is $A_h^\theta = (I - \theta\mu A_*)^{-1}(I + (1-\theta)\mu A_*)$, with eigenvalues
\begin{align*}
\lambda_\ell(A_h^\theta) = \frac{1 - 4(1-\theta)\mu\sin^2\frac{\pi\ell h}{2}}{1 + 4\theta\mu\sin^2\frac{\pi\ell h}{2}}.
\end{align*}
For stability we need $|\lambda_\ell(A_h^\theta)| \leq 1$ for all $\ell$. Since the eigenvalues of $A_*$ are non-positive, the numerator can be negative. One checks:
- $\theta \geq 1/2$: for all $\mu > 0$, $|\lambda_\ell(A_h^\theta)| \leq 1$ — unconditionally stable.
- $\theta < 1/2$: stable only if $\mu \leq \frac{1}{2(1-2\theta)}$ — a conditional stability requirement.
[example: Crank-Nicolson Stability]
For $\theta = 1/2$ (Crank–Nicolson), the eigenvalues of the amplification matrix are
\begin{align*}
\lambda_\ell = \frac{1 - 2\mu\sin^2\frac{\pi\ell h}{2}}{1 + 2\mu\sin^2\frac{\pi\ell h}{2}} \in (-1, 1)
\end{align*}
for all $\mu > 0$ and all $\ell$. Thus $\rho(A_h^{1/2}) < 1$ for all mesh parameters, confirming unconditional stability.
Moreover, Crank–Nicolson is second-order accurate in time: the local truncation error is $\mathcal{O}(k^2 + h^2)$. This follows from the fact that the theta-method at $\theta = 1/2$ corresponds to the trapezoidal rule for the ODE system $\dot{u}_h = A_* u_h$, which has error $\mathcal{O}(k^2)$ per step.
[/example]
[remark: Choosing Between Explicit and Implicit Schemes]
The explicit Euler scheme is computationally cheap per time step (matrix-vector multiplication only) but requires $k = O(h^2)$ for stability — a severe constraint at fine spatial resolutions. The implicit Euler and Crank–Nicolson schemes are unconditionally stable, allowing larger time steps, but require solving a tridiagonal linear system at each step. For the heat equation in one space dimension, this system has bandwidth 3 and can be solved in $O(M)$ operations using the Thomas algorithm (a specialised form of Gaussian elimination). The extra cost per step is modest, making implicit methods practically preferred for parabolic problems.
[/remark]
## Summary of Stability Conditions
The table below summarises the stability properties of the main schemes for the heat equation under the spectral ($L^2$) norm.
| Scheme | $\theta$ | Stability condition | Order in time |
|---|---|---|---|
| Explicit Euler | $0$ | $\mu \leq 1/2$ | $\mathcal{O}(k)$ |
| Implicit Euler | $1$ | Unconditional | $\mathcal{O}(k)$ |
| Crank–Nicolson | $1/2$ | Unconditional | $\mathcal{O}(k^2)$ |
| Theta-method | $\theta \geq 1/2$ | Unconditional | $\mathcal{O}(k)$ for $\theta \neq 1/2$ |
All schemes are second-order accurate in space: the spatial truncation error from the central difference approximation of $\partial^2 u/\partial x^2$ is $\mathcal{O}(h^2)$.
## Semidiscretization and the Method of Lines
The finite difference approach from Section A treats both space and time simultaneously. A cleaner alternative is to discretize space first, producing a system of ordinary differential equations in time, and then apply any ODE solver of our choosing. This two-stage strategy is called **semidiscretization**.
[definition: Semidiscretization]
**Semidiscretization** is a method for solving PDEs by first discretizing the spatial variable(s) and leaving the time variable $t$ continuous. The result is a system of ODEs in time, which may then be solved by any numerical or analytical ODE method.
When both stages are handled numerically — spatial discretization followed by a numerical ODE solver — the combined procedure is called the **method of lines**.
[/definition]
The method of lines reframes PDE numerics as ODE numerics. Every scheme from the ODE toolbox (forward Euler, backward Euler, the trapezoidal rule, Runge–Kutta methods, etc.) becomes a candidate time-integrator, and the analysis of each resulting scheme decomposes cleanly into a spatial truncation error from the first stage and a time-integration error from the second.
## The Heat Equation via Semidiscretization
[example: Semidiscretization of the Diffusion Equation]
Consider the diffusion (heat) equation on $[0, 1]$:
\begin{align*}
\frac{\partial u}{\partial t} = \frac{\partial^2 u}{\partial x^2}, \qquad 0 \leq x \leq 1,\; t \geq 0,
\end{align*}
with $u(x, 0) = u_0(x)$ and Dirichlet boundary conditions $u(0, t) = u(1, t) = 0$.
Introduce a uniform spatial mesh $x_m = mh$ with $h = \frac{1}{M+1}$ and $m = 1, \ldots, M$. Replacing $\frac{\partial^2 u}{\partial x^2}$ by the second central difference (and leaving $t$ continuous) gives the ODE system
\begin{align*}
\frac{d u_m}{d t} = \frac{1}{h^2}(u_{m-1} - 2u_m + u_{m+1}), \qquad m = 1, \ldots, M.
\end{align*}
In vector form, writing $u(t) = [u_1(t), \ldots, u_M(t)]^\top$, this is
\begin{align*}
\frac{d u}{d t} = \frac{1}{h^2} A_* u,
\end{align*}
where $A_*$ is the $M \times M$ tridiagonal matrix with $-2$ on the diagonal and $1$ on the sub- and superdiagonals (a TST matrix). The boundary values $u_0 = u_{M+1} = 0$ enter via the boundary conditions.
**Applying forward Euler.** Setting $u_m^n \approx u_m(t_n)$ with $t_n = nk$ and applying the forward Euler step $y_{n+1} = y_n + k f(t_n, y_n)$ to the ODE system yields
\begin{align*}
u_m^{n+1} = u_m^n + \mu(u_{m-1}^n - 2u_m^n + u_{m+1}^n), \qquad m = 1, \ldots, M,
\end{align*}
where $\mu = k/h^2$ is the Courant number. This is precisely the explicit scheme derived directly in Section A, confirming that the method of lines reproduces familiar schemes when standard ODE solvers are applied.
**Applying backward Euler.** The backward Euler step $y_{n+1} = y_n + k f(t_{n+1}, y_{n+1})$ gives instead the implicit scheme
\begin{align*}
u_m^{n+1} - \mu(u_{m-1}^{n+1} - 2u_m^{n+1} + u_{m+1}^{n+1}) = u_m^n,
\end{align*}
which requires solving a tridiagonal linear system at each time step.
[/example]
[remark: Courant Number]
The dimensionless ratio $\mu = k/h^2$ controls the relative step sizes. For the heat equation, $\mu$ appears as the natural coupling constant between spatial and temporal resolution. Both the stability condition and the truncation error are naturally expressed in terms of $\mu$.
[/remark]
## The Crank–Nicolson Scheme
Among all ODE integrators applied to the semidiscrete heat equation, the **trapezoidal rule** produces a scheme that balances accuracy and stability in a particularly favorable way. The resulting method is called the **Crank–Nicolson scheme**.
[example: The Crank–Nicolson Scheme for the Heat Equation]
Applying the trapezoidal rule,
\begin{align*}
y_{n+1} = y_n + \frac{k}{2}\bigl[f(t_n, y_n) + f(t_{n+1}, y_{n+1})\bigr],
\end{align*}
to the semidiscrete heat equation yields
\begin{align*}
u_m^{n+1} - \frac{\mu}{2}(u_{m-1}^{n+1} - 2u_m^{n+1} + u_{m+1}^{n+1}) = u_m^n + \frac{\mu}{2}(u_{m-1}^n - 2u_m^n + u_{m+1}^n), \qquad m = 1, \ldots, M.
\end{align*}
In matrix form, $u^{n+1} = B^{-1} C u^n$, where
\begin{align*}
B = I - \tfrac{\mu}{2} A_*, \qquad C = I + \tfrac{\mu}{2} A_*,
\end{align*}
and $A_*$ is the same tridiagonal TST matrix as before. Each time step requires solving the $M \times M$ tridiagonal system $B u^{n+1} = C u^n$, which can be done in $\mathcal{O}(M)$ operations.
**Consistency.** Let $u(x, t)$ be the exact solution. Writing $u_m^n = u(mh, nk)$ and expanding using the exact PDE and Taylor's theorem, one finds
\begin{align*}
u_m^{n+1} &= u_m^n + \int_{nk}^{(n+1)k} \frac{\partial u}{\partial t}(mh, \tau)\, d\tau \\
&= u_m^n + \int_{nk}^{(n+1)k} \frac{\partial^2 u}{\partial x^2}(mh, \tau)\, d\tau \\
&= u_m^n + \frac{k}{2h^2}\bigl(u_{m+1}^{n+1} - 2u_m^{n+1} + u_{m-1}^{n+1} + u_{m+1}^n - 2u_m^n + u_{m-1}^n\bigr) + \mathcal{O}(kh^2 + k^3).
\end{align*}
Hence the local truncation error satisfies $\eta_m^n = \mathcal{O}(k^3 + kh^2)$ (the $k^3$ term is inherited from the trapezoidal rule, which is second-order in time; compare $\mathcal{O}(k^2)$ for Euler). In the mesh norm $\|\eta^n\| = \{h \sum_{m=1}^M |\eta_m^n|^2\}^{1/2}$, this gives $\|\eta^n\| = \mathcal{O}(k^3 + kh^2)$.
**Stability.** Since $A_*$ is a TST matrix, all $M \times M$ TST matrices share the same orthonormal eigenvectors (see the appendix). In particular, $B$ and $C$ are simultaneously diagonalizable: writing $A_* = V D V^\top$ with $V$ orthogonal, we have
\begin{align*}
B = V(I - \tfrac{\mu}{2} D)V^\top, \qquad C = V(I + \tfrac{\mu}{2} D)V^\top,
\end{align*}
and therefore
\begin{align*}
B^{-1}C = V(I - \tfrac{\mu}{2} D)^{-1}(I + \tfrac{\mu}{2} D)V^\top.
\end{align*}
The eigenvalues of $A_*$ are $\lambda_k(A_*) = -2 + 2\cos\frac{\pi k}{M+1} = -4\sin^2\frac{\pi k h}{2} \leq 0$. The corresponding eigenvalue of $B^{-1}C$ is
\begin{align*}
\Lambda_k = \frac{1 + \tfrac{\mu}{2}\lambda_k(A_*)}{1 - \tfrac{\mu}{2}\lambda_k(A_*)} = \frac{1 - 2\mu\sin^2\frac{\pi k h}{2}}{1 + 2\mu\sin^2\frac{\pi k h}{2}}.
\end{align*}
Since $\lambda_k(A_*) \leq 0$, both numerator and denominator are positive, with the numerator no larger than the denominator. Therefore $|\Lambda_k| \leq 1$ for all $k = 1, \ldots, M$ and all $\mu > 0$. Since $B^{-1}C$ is normal (being unitarily diagonalizable), its operator norm equals its spectral radius, so $\|B^{-1}C\| \leq 1$.
**Error accumulation.** The error $e^n = u^n - [u(mh, nk)]_m$ satisfies $Be^{n+1} = Ce^n + \eta^n$, which gives
\begin{align*}
\|e^{n+1}\| \leq \|B^{-1}C\|\, \|e^n\| + \|B^{-1}\|\, \|\eta^n\|.
\end{align*}
Since all eigenvalues of $B$ satisfy $|\lambda_k(B)| = 1 + 2\mu\sin^2\frac{\pi kh}{2} \geq 1$ (Gershgorin's theorem also gives $\|B^{-1}\| \leq 1$), we obtain $\|e^{n+1}\| \leq \|e^n\| + \|\eta^n\|$. With $e^0 = 0$ and $n \leq T/k$,
\begin{align*}
\|e^n\| \leq \frac{T}{k}\|\eta\| \leq \frac{cT}{k}(k^3 + kh^2) = cT(k^2 + h^2).
\end{align*}
Choosing $k = \alpha h$ for any fixed $\alpha > 0$ therefore gives $\mathcal{O}(h^2)$ convergence — second order in both space and time.
[/example]
The key conclusion is that Crank–Nicolson is **unconditionally stable**: unlike the forward Euler scheme, which requires $\mu \leq \frac{1}{2}$, Crank–Nicolson is stable for any $\mu > 0$. This removes the severe restriction linking the time step to the square of the spatial step size.
[remark: Cost Per Step]
Each Crank–Nicolson step requires solving a tridiagonal system of size $M$, which costs $\mathcal{O}(M)$ operations. Despite being implicit, the scheme is computationally efficient because the TST structure of $B$ makes the linear solve cheap.
[/remark]
## The Advection Equation
The method of lines applies equally to hyperbolic equations. We now apply it to the advection equation
\begin{align*}
\frac{\partial u}{\partial t} = \frac{\partial u}{\partial x},
\end{align*}
which models pure transport. Here the spatial operator is a first derivative rather than the Laplacian, and we will see that the stability picture changes significantly depending on the time integrator.
Discretizing the right-hand side with the second-order centered difference $\frac{\partial u}{\partial x} \approx \frac{1}{2h}(u(x+h,t) - u(x-h,t))$ gives the semidiscrete system
\begin{align*}
\frac{d u_m}{d t} = \frac{1}{2h}(u_{m+1} - u_{m-1}), \qquad m = 1, \ldots, M.
\end{align*}
In matrix form this is $\frac{du}{dt} = \frac{1}{2h} A_* u$, where $A_*$ is now the $M \times M$ skew-symmetric tridiagonal matrix with $+1$ on the superdiagonal and $-1$ on the subdiagonal. Note that $A_*^\top = -A_*$, so $A_*$ is normal with purely imaginary eigenvalues $\lambda_k(A_*) = i\gamma_k$ for real $\gamma_k$.
[example: Crank–Nicolson for the Advection Equation]
Applying the trapezoidal rule with step $k$ and writing $\mu = k/h$ gives the scheme
\begin{align*}
u_m^{n+1} = u_m^n + \frac{\mu}{4}(u_{m+1}^{n+1} - u_{m-1}^{n+1}) + \frac{\mu}{4}(u_{m+1}^n - u_{m-1}^n), \qquad m = 1, \ldots, M.
\end{align*}
In matrix form, $u^{n+1} = B^{-1}Cu^n$ with
\begin{align*}
B = I - \tfrac{\mu}{4} A_*, \qquad C = I + \tfrac{\mu}{4} A_*,
\end{align*}
where $A_*$ is skew-symmetric. Since $A_*$ is normal, $B$ and $C$ are simultaneously diagonalizable in an orthonormal basis. The eigenvalues of $B^{-1}C$ are
\begin{align*}
\Lambda_k = \frac{1 + \frac{\mu}{4} i\gamma_k}{1 - \frac{\mu}{4} i\gamma_k}.
\end{align*}
This is a ratio of a complex number and its conjugate, so $|\Lambda_k| = 1$ for all $k = 1, \ldots, M$ and all $\mu > 0$. Therefore $\|B^{-1}C\| = 1$, and Crank–Nicolson is stable for the advection equation for all $\mu > 0$.
[/example]
[remark: Why Eigenvalue Ratios Factor]
In both the heat and advection examples we used the identity $\lambda_k(B^{-1}C) = \lambda_k(C)/\lambda_k(B)$. This factoring holds because $B$ and $C$ commute and are simultaneously diagonalizable — a special property of TST (and more generally, circulant-type) matrices. In general, $\lambda(BC) \neq \lambda(B)\lambda(C)$.
[/remark]
[example: Forward Euler for the Advection Equation]
Suppose instead we apply forward Euler to the semidiscrete advection equation:
\begin{align*}
u_m^{n+1} - u_m^n = \frac{\mu}{2}(u_{m+1}^n - u_{m-1}^n), \qquad m = 1, \ldots, M.
\end{align*}
Then $u^{n+1} = Au^n$ where $A = I + \frac{\mu}{2} A_*$ and $A_*$ is the skew-symmetric tridiagonal matrix above.
The only eigenvalue of $A$ is $\lambda = 1$ (since $A_*$ has purely imaginary eigenvalues, $A$ has eigenvalues $1 + \frac{\mu}{2}i\gamma_k$, so $|{\lambda_k(A)}| = \sqrt{1 + \frac{\mu^2\gamma_k^2}{4}} > 1$ for all $k$ with $\gamma_k \neq 0$). More directly, $A$ is not normal (since $A_*$ is skew-symmetric, $A = I + \frac{\mu}{2}A_*$ satisfies $A^\top = I - \frac{\mu}{2}A_*$, and $AA^\top \neq A^\top A$ in general), so the spectral radius does not control stability. Instead, the relevant quantity is the induced norm. Using the row-sum formula $\|A\|_{\infty \to \infty}$ (the maximum absolute row sum), one finds $\|A\|_{\infty \to \infty} = |1 - \mu| + \mu$ when the scheme is written out explicitly. This is less than or equal to 1 precisely when $\mu \leq 1$. For $\mu > 1$, the method is unstable.
In contrast to Crank–Nicolson, forward Euler for the advection equation requires a CFL condition $\mu = k/h \leq 1$ for stability.
[/example]
The comparison of the two time integrators on the same spatial discretization reveals a fundamental pattern: **implicit methods (backward Euler, trapezoidal rule) tend to be unconditionally stable**, while **explicit methods (forward Euler) impose a step-size restriction** that couples $k$ to $h$. For the heat equation the restriction is $\mu = k/h^2 \leq \frac{1}{2}$, which forces $k \ll h$ when $h$ is small. For the advection equation the restriction is $\mu = k/h \leq 1$, which is less severe. This difference reflects the different character of the two equations: the heat equation's second-order spatial operator amplifies the stiffness of the ODE system much more than the first-order advection operator. The next section introduces Fourier analysis as a systematic framework for understanding and proving these stability results.
## Fourier Analysis of Finite Difference Schemes
The semidiscretization approach of Section B transforms a PDE into a system of ODEs, but it does not immediately tell us whether the resulting fully discrete scheme is stable. Stability analysis via the eigenvalue structure of the iteration matrix works when the matrix is explicit and manageable, but becomes impractical for large systems or multidimensional problems. A far more systematic and computationally tractable approach is **von Neumann stability analysis**, which uses the Fourier transform to convert a spatial recurrence into a scalar amplitude equation. The key idea is simple: each spatial frequency evolves independently, and stability holds if and only if no frequency grows in amplitude.
### Fourier Analysis on $\ell_2(\mathbb{Z})$
We consider finite difference schemes posed on the infinite integer lattice — the **Cauchy problem** setting, where the spatial domain is all of $\mathbb{R}$ and there are no boundary conditions. The initial data is required to be square-integrable, which ensures that Fourier methods are applicable.
The general class of schemes we analyse has the form
\begin{align*}
\sum_{k=r}^{s} b_k U_{m+k}^{n+1} = \sum_{k=r}^{s} c_k U_{m+k}^{n}, \quad m \in \mathbb{Z},\; n \in \mathbb{Z}^+,
\end{align*}
where the coefficients $b_k$ and $c_k$ are independent of $m$ and $n$ but may depend on the Courant number $\mu$. A crucial observation is that stability is a property of this algebraic recurrence, not of the underlying PDE: different PDEs can lead to the same recurrence structure, and the Fourier analysis below applies uniformly.
[definition: Fourier Transform on $\ell_2(\mathbb{Z})$]
Let $v = (v_m)_{m \in \mathbb{Z}} \in \ell_2(\mathbb{Z})$. The **Fourier transform** of $v$ is the function $\hat{v}: [-\pi, \pi] \to \mathbb{C}$ defined by
\begin{align*}
\hat{v}(\theta) = \sum_{m \in \mathbb{Z}} e^{-im\theta} v_m, \quad \theta \in [-\pi, \pi].
\end{align*}
[/definition]
We equip sequences and their Fourier transforms with the norms
\begin{align*}
\|v\| = \left(\sum_{m \in \mathbb{Z}} |v_m|^2\right)^{1/2}, \qquad \|\hat{v}\|_* = \left(\frac{1}{2\pi} \int_{-\pi}^{\pi} |\hat{v}(\theta)|^2\, d\theta\right)^{1/2}.
\end{align*}
The fundamental isometry property of this transform is Parseval's identity.
[quotetheorem:434]
[citeproof:434]
Parseval's identity tells us that the Fourier transform is an isometry: it preserves the $\ell_2$-norm. This is the key that makes Fourier analysis useful for stability: if we can control $\|\hat{u}^n\|_*$, we control $\|U^n\|$.
### The Amplification Factor and the von Neumann Criterion
To analyse the recurrence, we take the Fourier transform of both sides. Fix $\theta \in [-\pi, \pi]$ and let $\hat{U}^n(\theta) = \sum_{m \in \mathbb{Z}} e^{-im\theta} U_m^n$. Multiplying both sides of the recurrence by $e^{-im\theta}$ and summing over $m \in \mathbb{Z}$, we use the shift property: if we replace $m$ by $m-k$ as the summation index, then
\begin{align*}
\sum_{m \in \mathbb{Z}} e^{-im\theta} U_{m+k}^n = e^{ik\theta} \hat{U}^n(\theta).
\end{align*}
The recurrence therefore transforms to the scalar equation
\begin{align*}
\hat{U}^{n+1}(\theta) = H(\theta)\, \hat{U}^n(\theta),
\end{align*}
where
[definition: Amplification Factor]
The **amplification factor** of the recurrence $\sum_k b_k U_{m+k}^{n+1} = \sum_k c_k U_{m+k}^n$ is the function $H: [-\pi,\pi] \to \mathbb{C}$ defined by
\begin{align*}
H(\theta) = \frac{\sum_{k=r}^{s} c_k\, e^{ik\theta}}{\sum_{k=r}^{s} b_k\, e^{ik\theta}}.
\end{align*}
[/definition]
The amplification factor encodes exactly how each Fourier mode $e^{im\theta}$ is scaled at each time step. Iterating the transformed recurrence gives $\hat{U}^n(\theta) = [H(\theta)]^n \hat{U}^0(\theta)$. For the numerical solution to remain bounded, we need $[H(\theta)]^n$ to stay bounded for all $n$, which requires $|H(\theta)| \leq 1$.
[quotetheorem:1376]
[citeproof:1376]
[remark: von Neumann as a Necessary and Sufficient Condition]
For one-step schemes ($r = s = 0$ on the left and the right sums are two-point), von Neumann's condition $|H(\theta)| \leq 1$ for all $\theta$ is both necessary and sufficient for $\ell_2$-stability. For multi-step schemes, the situation is more delicate and requires the root condition discussed below.
[/remark]
### Stability Analysis for the Diffusion Equation
The von Neumann criterion gives an explicit algebraic condition on the scheme parameters. We now apply it to the three main schemes for the heat equation $\partial_t u = \partial_x^2 u$.
[example: Stability of Euler, Backward Euler, and Crank-Nicolson for Diffusion]
For the Cauchy problem for the diffusion equation $u_t = u_{xx}$, with Courant number $\mu = k/h^2$:
**Forward Euler.** The scheme $U_m^{n+1} = U_m^n + \mu(U_{m-1}^n - 2U_m^n + U_{m+1}^n)$ has amplification factor
\begin{align*}
H(\theta) = 1 + \mu(e^{-i\theta} - 2 + e^{i\theta}) = 1 - 4\mu \sin^2\frac{\theta}{2}.
\end{align*}
Since $\sin^2(\theta/2) \in [0, 1]$, we have $H(\theta) \in [1 - 4\mu,\, 1]$. The condition $|H(\theta)| \leq 1$ requires $1 - 4\mu \geq -1$, that is, $\mu \leq 1/2$. The scheme is therefore stable if and only if $\mu \leq 1/2$.
**Backward Euler.** The scheme $U_m^{n+1} - \mu(U_{m-1}^{n+1} - 2U_m^{n+1} + U_{m+1}^{n+1}) = U_m^n$ gives
\begin{align*}
H(\theta) = \frac{1}{1 - \mu(e^{-i\theta} - 2 + e^{i\theta})} = \frac{1}{1 + 4\mu\sin^2\frac{\theta}{2}} \in (0, 1].
\end{align*}
Since $4\mu\sin^2(\theta/2) \geq 0$ for all $\mu > 0$, we have $|H(\theta)| \leq 1$ for all $\theta$ and all $\mu > 0$. Backward Euler is unconditionally stable.
**Crank-Nicolson.** The scheme averages the explicit and implicit diffusion operators:
\begin{align*}
U_m^{n+1} - \tfrac{1}{2}\mu(U_{m-1}^{n+1} - 2U_m^{n+1} + U_{m+1}^{n+1}) = U_m^n + \tfrac{1}{2}\mu(U_{m-1}^n - 2U_m^n + U_{m+1}^n).
\end{align*}
The amplification factor is
\begin{align*}
H(\theta) = \frac{1 - 2\mu\sin^2\frac{\theta}{2}}{1 + 2\mu\sin^2\frac{\theta}{2}} \in (-1, 1].
\end{align*}
Since both numerator and denominator are real with $|1 - 2\mu\sin^2(\theta/2)| \leq 1 + 2\mu\sin^2(\theta/2)$ for all $\mu > 0$, we have $|H(\theta)| \leq 1$ for all $\mu > 0$. Crank-Nicolson is unconditionally stable.
[/example]
These three examples illustrate a general principle: explicit schemes are conditionally stable (they impose a time-step restriction $k \lesssim h^2$ for diffusion), while implicit schemes are unconditionally stable. Crank-Nicolson achieves unconditional stability while retaining second-order accuracy in both space and time, making it the preferred scheme when computational cost permits.
### Stability of Advection Schemes
The advection equation $u_t = u_x$ is deceptively simple — its exact solution is just a shift, $u(x,t) = \varphi(x+t)$ — but its numerical discretization is sensitive to the direction of the spatial difference. This sensitivity is captured precisely by the von Neumann analysis.
[example: Downwind Instability]
Approximating the spatial derivative with the **downwind** (backward) difference $\partial_x u \approx (U_m - U_{m-1})/h$ and applying forward Euler gives
\begin{align*}
U_m^{n+1} = U_m^n + \mu(U_m^n - U_{m-1}^n), \quad \mu = k/h.
\end{align*}
The amplification factor is $H(\theta) = 1 + \mu - \mu e^{-i\theta}$. At $\theta = \pi/2$:
\begin{align*}
|H(\pi/2)|^2 = (1+\mu)^2 + \mu^2 > 1
\end{align*}
for any $\mu > 0$. The downwind scheme is therefore unconditionally unstable for the advection equation $u_t = u_x$.
[/example]
The geometric reason for the failure is that the downwind difference "looks in the wrong direction": the advection carries information from left to right (for $u_t = u_x$), but the difference is drawing on information from the wrong side of the stencil.
[example: Upwind Scheme]
Approximating with the **upwind** (forward) difference $\partial_x u \approx (U_{m+1} - U_m)/h$ and applying forward Euler gives
\begin{align*}
U_m^{n+1} = U_m^n + \mu(U_{m+1}^n - U_m^n), \quad \mu = k/h.
\end{align*}
The amplification factor is $H(\theta) = 1 - \mu + \mu e^{i\theta}$. Then
\begin{align*}
|H(\theta)|^2 = (1-\mu)^2 + 2\mu(1-\mu)\cos\theta + \mu^2 = 1 - 2\mu(1-\mu)(1 - \cos\theta).
\end{align*}
For $\mu \in [0,1]$ we have $\mu(1-\mu) \geq 0$ and $1 - \cos\theta \geq 0$, so $|H(\theta)| \leq 1$ for all $\theta$. The upwind scheme is stable for $\mu \leq 1$.
[/example]
The condition $\mu = k/h \leq 1$ for the advection equation is an instance of the **Courant–Friedrichs–Lewy (CFL) condition**: the numerical domain of dependence must contain the physical domain of dependence.
### Multi-Step Schemes and the Root Condition
The amplification factor framework as stated applies directly to one-step schemes. For multi-step schemes, the Fourier transform converts the spatial recurrence into a temporal linear recurrence of higher order, and the stability analysis must track the roots of the associated characteristic polynomial.
[example: Leapfrog Scheme for the Advection Equation]
Semidiscretizing the advection equation with the centred difference $\partial_x u \approx (U_{m+1} - U_{m-1})/(2h)$ and advancing in time with the midpoint rule $y_{n+1} = y_{n-1} + 2k f(t_n, y_n)$ gives the **leapfrog method**:
\begin{align*}
U_m^{n+1} = \mu(U_{m+1}^n - U_{m-1}^n) + U_m^{n-1}, \quad \mu = k/h.
\end{align*}
Taking the Fourier transform in space, we obtain the second-order temporal recurrence
\begin{align*}
\hat{U}^{n+1}(\theta) - 2i\mu\sin\theta\; \hat{U}^n(\theta) - \hat{U}^{n-1}(\theta) = 0.
\end{align*}
The general solution of a recurrence $w_{n+1} + bw_n + cw_{n-1} = 0$ is $w_n = C_1 \lambda_1^n + C_2 \lambda_2^n$ where $\lambda_1, \lambda_2$ are the roots of $\lambda^2 + b\lambda + c = 0$, provided the roots are distinct. When $\lambda_1 = \lambda_2 = \lambda$, the general solution is $w_n = (C_1 + C_2 n)\lambda^n$, which grows linearly in $n$ even when $|\lambda| = 1$.
The characteristic equation here is $\lambda^2 - 2i\mu\sin\theta \cdot \lambda - 1 = 0$, with roots
\begin{align*}
\lambda_{1,2}(\theta) = i\mu\sin\theta \pm \sqrt{1 - \mu^2\sin^2\theta}.
\end{align*}
We analyse three cases:
- $\mu < 1$: The discriminant $1 - \mu^2\sin^2\theta > 0$ for all $\theta$, so the square root is real and the roots are distinct. Then $|\lambda_{1,2}|^2 = (1 - \mu^2\sin^2\theta) + \mu^2\sin^2\theta = 1$. Both roots lie on the unit circle and are distinct, so $|\hat{U}^n(\theta)|$ is bounded. The scheme is stable.
- $\mu = 1$: At $\theta = \pi/2$ we get $\lambda_1 = \lambda_2 = i$. The solution is $(C_1 + C_2 n)\lambda^n$, which grows like $n$. The scheme is unstable.
- $\mu > 1$: At $\theta = \pi/2$ the roots are $i\mu \pm \sqrt{\mu^2 - 1} \cdot i$, both pure imaginary with $|\lambda| = |\mu \pm \sqrt{\mu^2-1}|$. The larger root has modulus $\mu + \sqrt{\mu^2 - 1} > 1$, so the scheme is unstable.
[/example]
[remark: The Root Condition for Multi-Step Schemes]
For a multi-step scheme, $\hat{U}^n(\theta)$ satisfies a linear recurrence whose characteristic roots $\lambda_1(\theta), \ldots, \lambda_p(\theta)$ depend on $\theta$. The scheme is stable if and only if: (i) all roots satisfy $|\lambda_j(\theta)| \leq 1$ for all $\theta$, and (ii) any root with $|\lambda_j(\theta)| = 1$ is a simple root (not a repeated root). The leapfrog example shows that equality in (i) alone is not sufficient: one must also verify simplicity.
[/remark]
### Fourier Analysis for the Wave Equation
The same Fourier framework extends beyond first-order hyperbolic equations. The wave equation $u_{tt} = u_{xx}$ is a second-order hyperbolic PDE, and its standard second-order finite difference discretization is a two-step scheme whose stability analysis mirrors the leapfrog analysis.
[example: Stability of the Wave Equation Scheme]
Consider the wave equation $u_{tt} = u_{xx}$ for $t \geq 0$, given with initial data $u(x,0)$ and $u_t(x,0)$. The standard second-order scheme is
\begin{align*}
U_m^{n+1} - 2U_m^n + U_m^{n-1} = \mu(U_{m+1}^n - 2U_m^n + U_{m-1}^n),
\end{align*}
where now $\mu = k^2/h^2$. Taking the Fourier transform in $m$ gives the temporal recurrence
\begin{align*}
\hat{U}^{n+1}(\theta) - 2\left(1 - 2\mu\sin^2\frac{\theta}{2}\right)\hat{U}^n(\theta) + \hat{U}^{n-1}(\theta) = 0.
\end{align*}
The characteristic polynomial is $\lambda^2 - 2\beta\lambda + 1 = 0$ where $\beta = 1 - 2\mu\sin^2(\theta/2)$. Two structural observations:
1. The product of the roots equals $1$ (the constant term of the characteristic polynomial).
2. The coefficients of the polynomial are real, so the roots are complex conjugates when they are non-real.
These two facts together force $|\lambda_1| = |\lambda_2|$. Combined with $\lambda_1\lambda_2 = 1$, we get $|\lambda_1| = |\lambda_2| = 1$. However, as in the leapfrog case, we must exclude repeated roots. The roots coincide when the discriminant vanishes: $\beta^2 = 1$, i.e., $\beta = \pm 1$. The case $\beta = 1$ gives $\mu\sin^2(\theta/2) = 0$, which is unavoidable at $\theta = 0$. The case $\beta = -1$ gives $1 - 2\mu\sin^2(\theta/2) = -1$, i.e., $\mu\sin^2(\theta/2) = 1$. For this to hold for some $\theta \in (-\pi,\pi)$ we need $\mu > 1$, and at $\mu = 1$ it holds at $\theta = \pi/2$. Thus the scheme is stable if and only if $\mu = k^2/h^2 < 1$.
[/example]
[remark: Structural Argument for Wave Equation Stability]
The argument above — product of roots is $1$ plus real coefficients implies both roots lie on the unit circle — is a useful pattern. When analyzing a two-step scheme, always check first whether the characteristic polynomial has product-of-roots equal to $1$ and real coefficients. If so, the unit-circle constraint is automatic, and the analysis reduces to verifying that no repeated roots occur.
[/remark]
### Fourier Analysis in Two Space Dimensions
So far we have worked with one spatial dimension. For problems in $\mathbb{R}^2$, the Fourier analysis generalises directly: one introduces a two-dimensional spatial frequency $(\theta, \psi) \in [-\pi,\pi]^2$ and analyses the amplification factor as a function of this pair.
[definition: Two-Dimensional Fourier Transform on $\ell_2(\mathbb{Z}^2)$]
For a doubly-indexed sequence $U = (U_{\ell,m})_{\ell,m \in \mathbb{Z}} \in \ell_2(\mathbb{Z}^2)$, the **two-dimensional Fourier transform** is
\begin{align*}
\hat{U}(\theta,\psi) = \sum_{\ell,m \in \mathbb{Z}} U_{\ell,m}\, e^{-i(\ell\theta + m\psi)}, \quad (\theta,\psi) \in [-\pi,\pi]^2.
\end{align*}
The transform is an isometry from $\ell_2(\mathbb{Z}^2)$ to $L^2([-\pi,\pi]^2)$:
\begin{align*}
\left(\sum_{\ell,m \in \mathbb{Z}} |U_{\ell,m}|^2\right)^{1/2} = \left(\frac{1}{4\pi^2}\int_{-\pi}^{\pi}\int_{-\pi}^{\pi} |\hat{U}(\theta,\psi)|^2\, d\theta\, d\psi\right)^{1/2}.
\end{align*}
[/definition]
The stability criterion carries over without change: the scheme is stable if and only if $|H(\theta,\psi)| \leq 1$ for all $(\theta,\psi) \in [-\pi,\pi]^2$.
[example: Forward Euler for Two-Dimensional Diffusion]
For the diffusion equation $u_t = \Delta u$ on $[0,1]^2$, extended to the Cauchy problem on $\mathbb{R}^2$, the five-point Euler scheme is
\begin{align*}
U_{\ell,m}^{n+1} = U_{\ell,m}^n + \mu\left(U_{\ell-1,m}^n + U_{\ell+1,m}^n + U_{\ell,m-1}^n + U_{\ell,m+1}^n - 4U_{\ell,m}^n\right),
\end{align*}
with $\mu = k/h^2$. The two-dimensional amplification factor is
\begin{align*}
H(\theta,\psi) &= 1 + \mu(e^{-i\theta} + e^{i\theta} + e^{-i\psi} + e^{i\psi} - 4) \\
&= 1 - 4\mu\left(\sin^2\frac{\theta}{2} + \sin^2\frac{\psi}{2}\right).
\end{align*}
The worst case is $\theta = \psi = \pi$, giving $H = 1 - 8\mu$. The condition $|H| \leq 1$ requires $\mu \leq 1/4$. The two-dimensional Euler scheme is thus stable if and only if $\mu \leq 1/4$, which is a factor of $2$ more restrictive than the one-dimensional case.
[/example]
This tightened stability condition reflects the additive nature of the Laplacian: each spatial direction contributes independently to the amplification factor, and the worst-case frequency excites both directions simultaneously.
### Operator Splitting for Multi-Dimensional Problems
The stability restriction $\mu \leq 1/(2d)$ for explicit schemes in $d$ spatial dimensions becomes increasingly severe as $d$ grows. One remedy is to use implicit methods, but fully implicit multi-dimensional schemes require solving large, dense linear systems. A more efficient alternative is **operator splitting**, which breaks the multi-dimensional problem into a sequence of one-dimensional solves.
The starting point is the observation that after semidiscretization, the 2D diffusion equation with zero boundary conditions takes the form
\begin{align*}
U' = AU, \quad A = \frac{1}{h^2}(A_x + A_y),
\end{align*}
where $A_x$ and $A_y$ are the matrices corresponding to the one-dimensional second-difference operators in the $x$- and $y$-directions respectively. The exact solution is $U(t) = e^{tA}U(0)$.
[definition: Kronecker Product]
For matrices $A \in \mathbb{R}^{n_A \times m_A}$ and $B \in \mathbb{R}^{n_B \times m_B}$, the **Kronecker product** $A \otimes B \in \mathbb{R}^{n_A n_B \times m_A m_B}$ is the block matrix
\begin{align*}
A \otimes B = \begin{bmatrix} A_{11}B & A_{12}B & \cdots & A_{1,m_A}B \\ A_{21}B & A_{22}B & \cdots & A_{2,m_A}B \\ \vdots & & & \vdots \\ A_{n_A,1}B & \cdots & \cdots & A_{n_A,m_A}B \end{bmatrix}.
\end{align*}
[/definition]
With grid points ordered by columns, the splitting matrices satisfy $A_x = G \otimes I$ and $A_y = I \otimes G$, where $G$ is the tridiagonal second-difference matrix of size $M \times M$. The Kronecker product satisfies the mixed-product property $(A \otimes B)(C \otimes D) = (AC) \otimes (BD)$, which confirms that $A_x A_y = (G \otimes I)(I \otimes G) = G \otimes G = (I \otimes G)(G \otimes I) = A_y A_x$. That is, $A_x$ and $A_y$ commute for the constant-coefficient diffusion equation — a discrete reflection of the fact that $\partial_x^2$ and $\partial_y^2$ commute.
The importance of commutativity is that it allows exact factorisation of the matrix exponential.
[quotetheorem:1377]
[citeproof:1377]
For the 2D diffusion equation with constant coefficients, commutativity is exact: $e^{kA} = e^{kA_x/h^2} e^{kA_y/h^2}$. This motivates the **Split Crank-Nicolson** scheme, which approximates each exponential factor $e^{kA_x/h^2}$ by the Padé approximant $r(z) = (1 + z/2)(1 - z/2)^{-1}$:
\begin{align*}
U^{n+1} = \left(I + \tfrac{\mu}{2} A_x\right)\left(I - \tfrac{\mu}{2} A_x\right)^{-1} \left(I + \tfrac{\mu}{2} A_y\right)\left(I - \tfrac{\mu}{2} A_y\right)^{-1} U^n.
\end{align*}
[explanation: Computational Efficiency of Split Crank-Nicolson]
The split formulation reduces the two-dimensional implicit solve to a sequence of one-dimensional solves. Define the intermediate state $U^{n+1/2} = (I + \frac{\mu}{2}A_y)(I - \frac{\mu}{2}A_y)^{-1}U^n$. Since $A_y = I \otimes G$ is block diagonal, the solve $(I - \frac{\mu}{2}A_y)V = (I + \frac{\mu}{2}A_y)U^n$ decomposes into $M$ independent tridiagonal systems of size $M$, each solvable in $O(M)$ operations. The total cost is $O(M^2)$.
For the second half-step $U^{n+1} = (I + \frac{\mu}{2}A_x)(I - \frac{\mu}{2}A_x)^{-1}U^{n+1/2}$, the matrix $A_x = G \otimes I$ is block diagonal when grid points are ordered by rows instead of columns. By permuting the ordering, this solve also reduces to $M$ independent tridiagonal systems of size $M$, again $O(M^2)$ total. The full time step therefore costs $O(M^2)$ and involves only tridiagonal solves — no full matrix factorisation or FFT is needed.
[/explanation]
Stability of the split Crank-Nicolson scheme follows from properties of the Padé approximant $r$. Since $A_x$ and $A_y$ are symmetric with non-positive eigenvalues (by Gershgorin's theorem applied to the tridiagonal matrices), and since $|r(z)| \leq 1$ for all $z \in \mathbb{C}$ with $\operatorname{Re} z \leq 0$, we have $\rho(r(\mu A_x)) \leq 1$ and $\rho(r(\mu A_y)) \leq 1$. As these matrices are symmetric, their operator norm equals their spectral radius, and therefore
\begin{align*}
\|U^{n+1}\|_2 \leq \|r(\mu A_x)\|_2\, \|r(\mu A_y)\|_2\, \|U^n\|_2 \leq \|U^n\|_2.
\end{align*}
The split Crank-Nicolson scheme is unconditionally stable for all $\mu > 0$.
### Splitting Methods for Non-Commuting Operators
The commutativity of $A_x$ and $A_y$ is special to the constant-coefficient case. When the diffusion coefficient $a(x,y)$ varies in space, the operators $A_x$ and $A_y$ no longer commute, and the splitting $e^{kA} \approx e^{kA_x/h^2} e^{kA_y/h^2}$ introduces a splitting error of order $O(k^2)$. This is acceptable when combined with a first-order time integrator but becomes the dominant error for higher-order methods.
The general inhomogeneous system arising in practice is
\begin{align*}
U' = AU + b(t), \quad U(0) = U_0,
\end{align*}
where $b(t)$ encodes boundary conditions (and possibly a forcing term). The exact solution is given by the variation of constants formula
\begin{align*}
U(t_{n+1}) = e^{kA}U(t_n) + \int_{t_n}^{t_{n+1}} e^{(t_{n+1}-s)A}b(s)\, ds.
\end{align*}
Applying the trapezoidal rule to the integral gives the second-order approximation
\begin{align*}
U(t_{n+1}) \approx e^{kA}U(t_n) + \frac{k}{2}\left[e^{kA}b(t_n) + b(t_{n+1})\right].
\end{align*}
**Strang splitting** improves on the simple sequential split by using the symmetric factorisation
\begin{align*}
e^{t(B+C)} = e^{tB/2}\, e^{tC}\, e^{tB/2} + O(t^3).
\end{align*}
Setting $A = B + C$ with $B = A_x/h^2$ and $C = A_y/h^2$, the Strang split advances from $t_n$ to $t_{n+1}$ in three stages: half a step with $B$, a full step with $C$, then another half step with $B$. When each exponential is replaced by the Padé approximant $r(z)$ and the boundary term is included via the trapezoidal rule, the full scheme reads
\begin{align*}
U^{n+1} = r\!\left(\tfrac{1}{2}kB\right) r(kC)\, r\!\left(\tfrac{1}{2}kB\right) \left[U^n + \tfrac{k}{2}b^n\right] + \tfrac{k}{2}b^{n+1}.
\end{align*}
Each application of $r(\cdot)$ involves only a tridiagonal solve, so the computational cost remains $O(M^2)$ per time step while the method achieves second-order accuracy even when $B$ and $C$ do not commute.
Operator splitting elegantly decouples the spatial and temporal challenges of multi-dimensional evolution equations, yet it remains tied to finite differences or finite elements. Spectral methods offer an alternative: by representing solutions on Fourier or Chebyshev bases, we achieve exponential convergence in space, allowing coarser grids and larger stable time steps.
# 3. Spectral Methods
## Why Spectral Methods?
Finite difference methods approximate derivatives by linear combinations of nearby function values. This produces large, sparse linear systems whose size is dictated by the slow, algebraic convergence of the underlying approximation: to halve the error, one must roughly double the number of grid points in each coordinate direction. Spectral methods offer a fundamentally different philosophy. Instead of sampling the unknown function on a grid, one expands it in a globally-supported basis of smooth functions — typically trigonometric exponentials or orthogonal polynomials — and seeks the best approximation within the corresponding finite-dimensional subspace. The payoff is dramatic: for sufficiently smooth or analytic functions, the error decays exponentially fast in the number of basis functions used, a phenomenon the course calls convergence at spectral speed.
### The Galerkin Framework
The general setup applies to any PDE of the form $\mathcal{L} u = f$, where $\mathcal{L}: H^k([-1,1]) \to L^2([-1,1])$ is a linear differential operator and $f: [-1,1] \to \mathbb{R}$ is a given right-hand side. We fix a finite-dimensional subspace $V$ spanned by an orthonormal basis $\psi_1, \ldots, \psi_N$ with respect to some inner product $(\cdot, \cdot)$, and look for an approximate solution of the form
\begin{align*}
u_N(x) = \sum_{n=1}^{N} c_n \psi_n(x).
\end{align*}
Substituting $u_N$ into the PDE gives $\sum_{n=1}^{N} c_n \mathcal{L}\psi_n = f$, which generically has no solution since there is no reason for the true solution to lie in $V$. Instead, we require only that the residual $\mathcal{L}u_N - f$ is orthogonal to $V$. Since $(\psi_m)$ is orthonormal, this condition becomes
\begin{align*}
\sum_{n=1}^{N} c_n (\mathcal{L}\psi_n, \psi_m) = (f, \psi_m), \quad m = 1, \ldots, N.
\end{align*}
[definition: Galerkin Equations]
The system of equations
\begin{align*}
\sum_{n=1}^{N} c_n (\mathcal{L}\psi_n, \psi_m) = (f, \psi_m), \quad m = 1, \ldots, N,
\end{align*}
arising from the requirement that the residual $\mathcal{L}u_N - f$ is orthogonal to the approximation subspace $V = \operatorname{span}(\psi_1, \ldots, \psi_N)$, are called the **Galerkin equations**.
[/definition]
Setting $A_{m,n} = (\mathcal{L}\psi_n, \psi_m)$ and $\tilde{f}_m = (f, \psi_m)$ gives the $N \times N$ linear system $Ac = \tilde{f}$. The key observation is that the structure of $A$ — and in particular whether it is dense or sparse — depends entirely on the choice of basis $(\psi_n)$. For the two bases studied in this chapter, the Fourier basis and the Chebyshev basis, differentiation acts particularly cleanly, which makes $A$ amenable to efficient computation.
## Fourier Approximation of Functions
What basis should we choose to represent a smooth function on $[-1,1]$, and how fast does the resulting truncated series converge? The trigonometric exponentials $e^{i\pi nx}$ are a natural answer: they are smooth, mutually orthogonal, and correspond to the Fourier analysis of the periodically extended function. The question of convergence rate — and in particular whether it can be made exponentially fast — drives the developments of this section.
We work on the interval $[-1,1]$ and take the Fourier basis
\begin{align*}
\psi_n(x) = e^{i\pi n x}, \quad n \in \mathbb{Z}.
\end{align*}
These functions are orthonormal with respect to the normalised $L^2$ inner product on $[-1,1]$:
\begin{align*}
(\psi_n, \psi_m) = \frac{1}{2}\int_{-1}^{1} \psi_n(x)\,\overline{\psi_m(x)}\,dx = \begin{cases} 1 & \text{if } n = m, \\ 0 & \text{otherwise.}\end{cases}
\end{align*}
[definition: Fourier Approximation]
Let $N \geq 2$ be an even integer. Given $f: [-1,1] \to \mathbb{R}$, its **truncated Fourier approximation** is
\begin{align*}
\phi_N(x) = \sum_{n=-N/2}^{N/2} \widehat{f}_n\, e^{i\pi n x}, \quad x \in [-1,1],
\end{align*}
where the **Fourier coefficients** are
\begin{align*}
\widehat{f}_n = \frac{1}{2}\int_{-1}^{1} f(t)\,e^{-i\pi n t}\,dt, \quad n \in \mathbb{Z}.
\end{align*}
[/definition]
Since every basis function $\psi_n(x) = e^{i\pi nx}$ is 2-periodic, a necessary condition for $\phi_N \to f$ is that $f$ itself be 2-periodic. The following result, whose proof belongs to functional analysis and measure theory, gives a pointwise sufficient condition.
[quotetheorem:1378]
The proof uses techniques from Probability and Measure; it is not covered in this course.
Lipschitz continuity is a sufficient condition but far from optimal. The natural question is how the convergence rate improves as we assume more regularity of $f$. For Lipschitz functions, the Fourier coefficients decay at rate $O(|n|^{-2})$, which gives uniform convergence but no quantitative rate beyond what the general theory guarantees. For $C^k$ functions the decay improves to $O(|n|^{-k-1})$, yielding algebraic rates; for analytic functions we will see exponential decay, which is the main result of this section. The reason it is needed is that Lipschitz functions cannot produce a jump discontinuity anywhere in their range of values, so the Dirichlet kernel used to represent the partial sums has no oscillation large enough to prevent convergence. By contrast, for a function with a jump discontinuity — even one that is $C^\infty$ away from the jump — the Fourier partial sums overshoot near the discontinuity by about $9\%$ of the jump height regardless of $N$. This is the Gibbs phenomenon: convergence degrades from uniform to pointwise, with permanent oscillations near the jump. The course's main interest is in smoother functions, where one obtains exponentially fast convergence.
### Exponential Convergence for Analytic Functions
The rate at which Fourier coefficients decay governs how quickly $\phi_N$ approximates $f$. For merely $C^k$ functions one finds algebraic decay: $\|\phi_N - f\|_\infty = O(N^{-k'})$ for some $k'$ related to $k$. For analytic functions the situation is dramatically better.
[quotetheorem:1379]
[citeproof:1379]
The bound is striking: the error decays like $c^{N/2}$ where $c < 1$ is determined by the width of the analyticity strip. Doubling $N$ squares the convergence factor — this is fundamentally different from finite differences, where halving $h$ only halves the error for a first-order scheme.
[remark: Algebraic Rate for Finite Smoothness]
If $f$ is only assumed to be $C^k$ for some integer $k$, the Fourier coefficients decay at rate $O(|n|^{-k'})$ (with $k'$ typically $k+1$), and one obtains the algebraic rate $\|\phi_N - f\|_\infty = O(N^{-k'})$. Analyticity, which corresponds to the entire strip $|{\rm Im}\, z| < a$, is what promotes this to exponential decay.
[/remark]
The gap between algebraic and exponential decay is decisive for practical computation. An algebraic rate $O(N^{-k})$ means that gaining each additional digit of accuracy requires increasing $N$ by a fixed factor; an exponential rate $c^{N/2}$ means that each fixed increment in $N$ adds a constant number of digits. This distinction explains why spectral methods, when they apply, need far fewer degrees of freedom than finite differences. The following definition gives a name to this phenomenon.
[definition: Spectral Speed Convergence]
An $N$-term approximation $\phi_N$ of a function $f$ **converges to $f$ at spectral speed** if $\|\phi_N - f\|$ decays faster than $O(N^{-p})$ for every $p = 1, 2, \ldots$ — that is, faster than any algebraic rate.
[/definition]
Exponential decay $\|\phi_N - f\| \lesssim c^{N/2}$ (with $c < 1$ fixed) is the principal example of spectral speed convergence.
## The Algebra of Fourier Expansions
Can we use Fourier expansions to solve PDEs, and if so, what operations do we need to be able to perform on them? Any PDE involves differentiation, and most interesting PDEs also involve variable coefficients, which means we will need to multiply Fourier series together. The question is whether these operations preserve the spectral structure — that is, whether the result of differentiating or multiplying two analytic functions still belongs to the class of analytic functions with rapidly decaying Fourier coefficients. The following theorem says yes, and makes the algebra explicit.
Let $\mathcal{A}$ denote the class of all 2-periodic functions analytic on some horizontal strip $\{z \in \mathbb{C} : -a < \operatorname{Im} z < a\}$ (the strip width $a > 0$ may depend on the function). Write a function $f \in \mathcal{A}$ in its Fourier series $f(x) = \sum_{n=-\infty}^{\infty}\widehat{f}_n\,e^{i\pi nx}$.
[quotetheorem:1380]
The proof of linearity and closure under products follows directly from the definition and the analyticity of sums and products. The differentiation formula is obtained by termwise differentiation of the Fourier series, which is justified precisely because the coefficients $\widehat{f}_n$ decay exponentially — multiplying by $n$ does not destroy absolute convergence.
The content of the theorem is more than closure: it tells us precisely how the operations on functions translate into operations on Fourier coefficients. Addition maps to coordinatewise addition; products map to discrete convolution (the frequency-domain analogue of multiplication); differentiation maps to pointwise multiplication by $i\pi n$. The limitation of the class $\mathcal{A}$ is that it requires periodicity — the Fourier basis functions $e^{i\pi nx}$ are themselves 2-periodic, so any function outside $\mathcal{A}$ either fails to have a convergent Fourier series or has only algebraically decaying coefficients. The Chebyshev basis in the next section addresses exactly this restriction for non-periodic problems.
[remark: Differentiation as a Diagonal Operator]
In frequency space, differentiation acts by multiplication: the Fourier coefficient $\widehat{f}_n$ is mapped to $i\pi n\,\widehat{f}_n$. In matrix form (indexing by $n$), differentiation is represented by the diagonal matrix $D$ with entries $D_{nn} = i\pi n$. This is the spectral method's analogue of the finite difference stencil, but unlike the finite difference matrix (which is banded), $D$ is diagonal — inversion and composition reduce to scalar arithmetic on each frequency mode.
[/remark]
### Spectral Discretisation of a Boundary Value Problem
The algebraic structure above translates directly into a numerical algorithm for ODEs and PDEs with periodic boundary conditions. The Fourier coefficients of variable coefficients and right-hand sides interact through discrete convolution, which at finite truncation order reduces to a structured matrix system.
[example: Spectral Method for a Two-Point BVP]
Consider the two-point boundary value problem: find $y: [-1,1] \to \mathbb{R}$ satisfying
\begin{align*}
y'' + a(x)\,y' + b(x)\,y = f(x), \quad y(-1) = y(1),
\end{align*}
where $a, b, f \in \mathcal{A}$ and we seek a periodic solution $y \in \mathcal{A}$.
Substituting the Fourier series of $y$, $a$, $b$, and $f$ and applying the differentiation formula twice, the equation becomes an infinite linear system for the coefficients $\widehat{y}_n$:
\begin{align*}
-\pi^2 n^2\,\widehat{y}_n + i\pi\sum_{m=-\infty}^{\infty} m\,\widehat{a}_{n-m}\,\widehat{y}_m + \sum_{m=-\infty}^{\infty}\widehat{b}_{n-m}\,\widehat{y}_m = \widehat{f}_n, \quad n \in \mathbb{Z}.
\end{align*}
Since $a, b, f \in \mathcal{A}$, their Fourier coefficients decay exponentially. Truncating at $|n|, |m| \leq N/2$ introduces an exponentially small error and gives the finite $N \times N$ linear system
\begin{align*}
-\pi^2 n^2\,\widehat{y}_n + i\pi\sum_{m=-N/2+1}^{N/2} m\,\widehat{a}_{n-m}\,\widehat{y}_m + \sum_{m=-N/2+1}^{N/2}\widehat{b}_{n-m}\,\widehat{y}_m = \widehat{f}_n,
\end{align*}
for $n = -N/2+1, \ldots, N/2$. The matrix of this system has a block structure involving a diagonal second-derivative part $-\pi^2 n^2 \delta_{nm}$ perturbed by convolution matrices for $a$ and $b$.
[/example]
### Numerical Quadrature for Fourier Coefficients
To assemble the linear system in practice, we need to compute the Fourier coefficients $\widehat{f}_n = \frac{1}{2}\int_{-1}^{1} f(t)e^{-i\pi nt}\,dt$ numerically. Setting $h(t) = f(t)e^{-i\pi nt}$, we note that $h \in \mathcal{A}$ whenever $f \in \mathcal{A}$. The natural numerical method is the rectangle rule.
[definition: Rectangle Rule]
The **rectangle rule** for approximating $\int_{-1}^{1} h(t)\,dt$, using $N$ equally spaced nodes $t_k = 2k/N$ for $k = -N/2+1, \ldots, N/2$, is
\begin{align*}
I_N(h) = \frac{2}{N}\sum_{k=-N/2+1}^{N/2} h\!\left(\frac{2k}{N}\right).
\end{align*}
[/definition]
The rectangle rule is often disregarded as a low-order method because on general smooth functions it is only first-order accurate. However, for analytic periodic functions the situation is qualitatively different: the error is controlled not by the smoothness of the integrand but by the rate of decay of its Fourier coefficients, which is exponential. The following two results make this precise. The first identifies the exact form of the quadrature error in terms of aliased Fourier modes; the second bounds that error when the integrand is analytic.
[quotetheorem:1381]
[citeproof:1381]
The formula shows that the quadrature error consists entirely of the Fourier coefficients of $h$ at multiples $Nr$ of the sampling frequency — the so-called aliased modes. If $|\widehat{h}_n|$ decays exponentially in $|n|$, then $|\widehat{h}_{Nr}|$ decays like $c^{N|r|}$, making the error exponentially small.
[quotetheorem:1382]
[citeproof:1382]
The bound $4Mc^N/(1 - c^N)$ deserves a moment's reflection. For a function that is only $C^k$ (not analytic), the Fourier coefficients decay at rate $O(|n|^{-k-1})$, and the aliasing formula gives a quadrature error of $O(N^{-k-1})$ — the rectangle rule inherits the algebraic rate of the coefficient decay. Analyticity is what promotes this to exponential, and the rate constant $c = e^{-a\pi}$ is smaller (hence the convergence faster) the wider the strip of analyticity. In practice, computing the Fourier coefficients via the rectangle rule is equivalent to applying the FFT: evaluating $h$ at $N$ equally spaced points and performing a discrete Fourier transform costs $O(N \log N)$ operations, and the exponential accuracy of the rectangle rule guarantees that the discrete Fourier coefficients approximate the true ones to spectral precision.
### Applying the Spectral Method: The Poisson Equation
We now have all the ingredients to apply the Fourier spectral method to a concrete PDE. The Poisson equation is a natural first example, though the course notes that it is somewhat atypical due to the special structure of the Laplacian.
[example: Fourier Spectral Method for the Poisson Equation]
Consider the Poisson equation on $[-1,1]^2$:
\begin{align*}
\Delta u = f, \quad -1 \leq x, y \leq 1,
\end{align*}
where $f$ is analytic and satisfies doubly periodic boundary conditions: $f(-1,y) = f(1,y)$ and $f(x,-1) = f(x,1)$. We impose the same periodic boundary conditions on $u$, together with the normalisation $\int_{-1}^{1}\int_{-1}^{1} u(x,y)\,dx\,dy = 0$ to fix the additive constant.
Expanding both $f$ and $u$ in 2D Fourier series,
\begin{align*}
f(x,y) = \sum_{k,l=-\infty}^{\infty}\widehat{f}_{k,l}\,e^{i\pi(kx+ly)}, \quad u(x,y) = \sum_{k,l=-\infty}^{\infty}\widehat{u}_{k,l}\,e^{i\pi(kx+ly)},
\end{align*}
the normalisation condition forces $\widehat{u}_{0,0} = 0$. Applying the Laplacian to the series for $u$ gives
\begin{align*}
\Delta u(x,y) = -\pi^2\sum_{k,l=-\infty}^{\infty}(k^2+l^2)\,\widehat{u}_{k,l}\,e^{i\pi(kx+ly)}.
\end{align*}
Matching coefficients in $\Delta u = f$ yields the explicit solution in frequency space:
\begin{align*}
\widehat{u}_{k,l} = -\frac{\widehat{f}_{k,l}}{\pi^2(k^2+l^2)}, \quad (k,l) \neq (0,0), \qquad \widehat{u}_{0,0} = 0.
\end{align*}
The Fourier coefficients of $u$ are recovered from those of $f$ by a pointwise division — in matrix terms, the operator $-\pi^2(k^2+l^2)\delta_{(k,l),(k',l')}$ is diagonal in the Fourier basis. Truncating at $|k|, |l| \leq N/2$ and using the rectangle rule to compute $\widehat{f}_{k,l}$ produces a fully discrete scheme that converges at spectral speed.
[/example]
The Poisson example shows the spectral method at its simplest: the operator is diagonal in the Fourier basis, and inversion reduces to pointwise division in frequency space. This diagonality is special to constant-coefficient operators whose eigenfunctions coincide with the chosen basis. For variable-coefficient operators — such as the diffusion equation with non-constant $a(x)$ studied later in this chapter — the Galerkin matrix acquires off-diagonal entries from the convolution of the coefficient's Fourier series, and the solve requires a full linear system inversion. The next remark makes this distinction precise.
[remark: Why Poisson Is Atypical]
The Poisson equation admits such a clean spectral solution because the basis functions $\phi_{k,l}(x,y) = e^{i\pi(kx+ly)}$ are eigenfunctions of the Laplacian: $\Delta\phi_{k,l} = -\pi^2(k^2+l^2)\phi_{k,l}$. For a general differential operator $\mathcal{L}$, the Galerkin matrix $A_{m,n} = (\mathcal{L}\psi_n, \psi_m)$ is not diagonal, and the convolution structure of variable-coefficient terms must be handled more carefully.
[/remark]
### Periodicity as a Limitation
The Fourier spectral method achieves exponential convergence, but it requires $f$ to be 2-periodic. For problems on $[-1,1]$ without periodic boundary conditions, this is a serious restriction: a non-periodic function, when extended periodically, will have a jump discontinuity at the endpoints, and the Gibbs phenomenon degrades convergence to algebraic rate. The way around this limitation, studied in the next section, is to replace the Fourier basis with the **Chebyshev polynomial** basis, which is naturally adapted to the interval $[-1,1]$ without any periodicity requirement.
## The Algebra of Chebyshev Expansions
Non-periodic functions present a fundamental difficulty for Fourier methods: when extended periodically to all of $\mathbb{R}$, a function that is smooth on $[-1,1]$ but with different values at $-1$ and $1$ will have a jump discontinuity, and the Gibbs phenomenon forces convergence to degrade to algebraic rate. Does a basis exist that captures the algebraic structure we need — multiplication, differentiation, and spectral-speed convergence — without requiring periodicity? The Chebyshev polynomials provide exactly this: they are orthogonal on $[-1,1]$ with respect to the weight $(1-x^2)^{-1/2}$, they encode no periodicity, and as the subsection below shows, they preserve all the algebraic properties of Fourier series via the cosine substitution.
The preceding section developed Fourier approximation on the interval $[-1,1]$ for periodic functions. When the function is not periodic, or when the boundary behaviour matters, Chebyshev polynomials offer a more natural basis. This subsection records the algebraic structure of Chebyshev series — in particular how they multiply and how derivatives behave — that we will need for the spectral method.
Recall that the Chebyshev polynomials $T_n : [-1,1] \to \mathbb{R}$ are defined by
\begin{align*}
T_n(x) = \cos(n\theta), \quad x = \cos\theta, \quad \theta \in [0, \pi].
\end{align*}
Given a function $f : [-1,1] \to \mathbb{R}$, its Chebyshev series is
\begin{align*}
f(x) = \sum_{n=0}^{\infty} \breve{f}_n T_n(x),
\end{align*}
where the coefficients $\breve{f}_n$ are determined by orthogonality. The substitution $x = \cos\theta$ converts this into a cosine series: $f(\cos\theta) = \sum_{n=0}^{\infty} \breve{f}_n \cos(n\theta)$, so every result about Fourier cosine series translates directly into a result about Chebyshev series. This is the central observation linking the two bases.
### Multiplication of Chebyshev Series
The first algebraic fact we need is how to multiply two Chebyshev expansions. The key identity is the product formula for cosines:
\begin{align*}
\cos(m\theta)\cos(n\theta) = \frac{1}{2}[\cos((m-n)\theta) + \cos((m+n)\theta)],
\end{align*}
which, upon the substitution $x = \cos\theta$, becomes a product formula for Chebyshev polynomials.
[quotetheorem:1383]
[citeproof:1383]
This formula has an immediate practical consequence: multiplying two truncated Chebyshev series of degree at most $N$ produces a series of degree at most $2N$, which can be truncated back to degree $N$ if desired. The structure mirrors convolution in Fourier space, reflecting the underlying cosine substitution.
The absolute convergence of both Chebyshev series is essential: without it, rearranging the double sum over $m$ and $n$ into the two families indexed by $k = m+n$ and $k = |m-n|$ is not justified, and the product formula may fail. For analytic functions on $[-1,1]$, the Chebyshev coefficients decay exponentially, so absolute convergence is automatic. For functions of finite smoothness $C^k$, the coefficients decay only polynomially, and the product formula holds provided $k$ is large enough that $\sum_{n}|\breve{f}_n|$ and $\sum_{n}|\breve{g}_n|$ converge — equivalently, that both series converge absolutely in the supremum norm. The product rule for Chebyshev series feeds directly into the Galerkin discretization of variable-coefficient differential equations: if $a(x) = \sum_n \breve{a}_n T_n(x)$ is a known coefficient, then multiplying $a$ by the unknown $u$ produces a Chebyshev series whose coefficients are the convolution of $\breve{a}$ and $\breve{u}$, and truncating this convolution at degree $N$ gives the entries of the Galerkin matrix.
### Differentiation in Chebyshev Coordinates
To apply spectral methods to differential equations, we need to express the Chebyshev coefficients of $f'$ in terms of those of $f$. The starting point is the derivative formula:
\begin{align*}
T_n'(x) = \frac{n\sin(n\theta)}{\sin\theta}, \quad x = \cos\theta,
\end{align*}
which comes from differentiating $T_n(x) = \cos(n\theta)$ with respect to $x$ using the chain rule. Expanding $\sin(n\theta)/\sin\theta$ in terms of cosines gives the following.
[quotetheorem:1384]
[citeproof:1384]
The parity structure visible in these formulas is not incidental: it is a consequence of the fact that $T_{2n}$ is an even function of $x$ and $T_{2n+1}$ is an odd function, so their derivatives must have the opposite parity. The finiteness of the sums is equally important: for each $n$, the derivative $T_n'$ is a linear combination of at most $n$ Chebyshev polynomials, all of index strictly less than $n$. This upper-triangular structure (with respect to the degree filtration) means that the matrix expressing differentiation in the Chebyshev basis is upper triangular with zero diagonal, which guarantees that the associated linear system for boundary value problems is non-singular once boundary conditions are imposed. Without the strict decrease in index — if, say, $T_n'$ involved $T_n$ itself — the differentiation matrix would have diagonal entries and the spectral discretization of a boundary value problem could become singular for certain values of $N$.
[remark: Scope of the Formula]
The key feature of these formulas is that $T_n'$ is expressed as a finite linear combination of $T_k$ with $k < n$, and the indices alternate in parity: the derivative of an even-degree polynomial involves only odd-degree polynomials, and vice versa. This parity structure is essential when building differentiation matrices.
[/remark]
The derivative formulas for individual Chebyshev polynomials lead directly to a recurrence for the Chebyshev coefficients of an arbitrary derivative $f^{(k)}$. The key step is to expand $f' = \sum_n \breve{f'}_n T_n$ using the linearity of differentiation and the polynomial-level formulas above, then read off the coefficient $\breve{f'}_n$ by collecting all contributions from $T_m'$ that produce a $T_n$ term. Because $T_m'$ involves only $T_k$ with $k < m$ and $k \equiv m + 1 \pmod{2}$, the result is a triangular recurrence that propagates from high to low index.
[quotetheorem:1385]
The recurrence computes the Chebyshev coefficients of $f'$ from those of $f$ in a single backward sweep from $n = N$ down to $n = 0$, each step involving only a multiplication and an addition. The parity constraint $n + m$ odd ensures that even-indexed coefficients of $f$ contribute only to odd-indexed coefficients of $f'$ and vice versa, which halves the work and reflects the derivative's parity flip seen in the polynomial formulas above.
This recurrence is the foundation for constructing the Chebyshev differentiation matrix $D_N$, which maps the vector of Chebyshev coefficients $(\breve{u}_0, \dots, \breve{u}_{N-1})$ to the Chebyshev coefficients of $u'$. When $u$ is discretized at the $N$ Chebyshev points $x_j = \cos(\pi j / N)$ for $j = 0, \dots, N$, the differentiation matrix acts in physical space rather than coefficient space, and its entries can be computed explicitly. A crucial contrast with finite differences: the Chebyshev differentiation matrix is dense — every grid point communicates with every other — but the resulting scheme is spectrally accurate.
[example: Differentiation Matrix for $N = 4$]
For $N = 4$ Chebyshev points, the differentiation matrix $D_4$ has entries that can be read off from the recurrence above. For instance, $\breve{f'}_0 = \sum_{m \text{ odd}} m \breve{f}_m = \breve{f}_1 + 3\breve{f}_3 + 5\breve{f}_5 + \cdots$, which in the truncated basis gives the first row of $D_4$ as $(0, 1, 0, 3)$ (after adjusting for the $c_0 = 1$ normalisation). Higher rows follow from the same recurrence with the appropriate $c_n$ factor. The pattern confirms that only coefficients of opposite parity (relative to the output index) contribute to each entry. This structure has a notable consequence: the $4 \times 4$ matrix $D_4$ has a checkerboard sparsity pattern (zero entries wherever the row and column indices share the same parity), which halves its non-zero count. As $N$ grows, this checkerboard pattern persists but the matrix remains dense within each parity block, reflecting the global coupling inherent in spectral differentiation. The eigenvalues of $D_N$ cluster near the imaginary axis and grow like $O(N^2)$, which is the spectral analogue of the CFL constraint: the stiffness of the semi-discrete ODE system scales with $N^2$, imposing the same $k \lesssim N^{-2}$ time-step restriction seen later in the evolutionary PDE section.
[/example]
### Connection Between Fourier and Chebyshev via Cosine Substitution
The substitution $x = \cos\theta$ establishes a precise correspondence between Chebyshev series on $[-1,1]$ and cosine (i.e., even Fourier) series on $[0,\pi]$. Concretely:
\begin{align*}
f(x) = \sum_{n=0}^{\infty} \breve{f}_n T_n(x) \quad \longleftrightarrow \quad f(\cos\theta) = \sum_{n=0}^{\infty} \breve{f}_n \cos(n\theta).
\end{align*}
This means that all convergence theorems for Fourier cosine series carry over directly. In particular, if $f$ is analytic in a neighbourhood of $[-1,1]$, its Chebyshev coefficients decay exponentially, recovering the spectral accuracy established in the Fourier setting. The difference from the Fourier basis on $[-1,1]$ is that the Chebyshev grid is non-uniform: the Chebyshev points cluster near the endpoints $\pm 1$, which counteracts the Runge phenomenon that plagues uniform-grid polynomial interpolation.
---
## The Spectral Method for Evolutionary PDEs
With the algebraic infrastructure in place, we can now apply spectral discretization to time-dependent problems. The setup mirrors the method-of-lines approach from Chapter 2 (Section: Semidiscretization and the Method of Lines), but replaces spatial finite differences by a spectral expansion.
### The General Framework
Consider the Cauchy problem on the spatial domain $[-1,1]$:
\begin{align*}
\frac{\partial u}{\partial t}(x,t) &= \mathcal{L}\, u(x,t), && x \in [-1,1],\; t \geq 0, \\
u(x, 0) &= g(x), && x \in [-1,1],
\end{align*}
supplemented with appropriate boundary conditions at $x = \pm 1$, where $\mathcal{L}$ is a linear differential operator acting in the spatial variable $x$.
The idea is to separate the problem into two stages: discretize in space using a spectral expansion, reducing the PDE to a finite system of ODEs in time; then integrate those ODEs with a standard time-stepping scheme.
[definition: Spectral Method for Evolutionary PDEs]
Given an evolutionary PDE $u_t = \mathcal{L} u$ with $\mathcal{L}: H^k([-1,1]) \to L^2([-1,1])$ a linear differential operator, the **spectral method** proceeds as follows.
1. **Spatial discretization.** Approximate $u(x,t)$ by a partial sum of $N$ basis functions $\psi_k(x)$ (either Fourier exponentials or Chebyshev polynomials):
\begin{align*}
u(x,t) \approx u_N(x,t) = \sum_{k=0}^{N-1} c_k(t)\,\psi_k(x).
\end{align*}
2. **ODE reduction.** Substitute $u_N$ into the PDE and project onto the basis (Galerkin conditions). This yields an $N \times N$ system of ODEs for the time-dependent coefficients $c(t) = (c_0(t), \dots, c_{N-1}(t))^\top$:
\begin{align*}
c'(t) = B\,c(t), \qquad B \in \mathbb{R}^{N \times N}.
\end{align*}
3. **Time integration.** Solve the ODE system using a standard scheme (forward Euler, Crank–Nicolson, etc.), which approximates the matrix exponential in the exact solution $c(t) = e^{tB}c(0)$.
[/definition]
The matrix $B$ encodes the action of the differential operator $\mathcal{L}$ in the chosen spectral basis. For the Fourier basis, $B$ is diagonal if $\mathcal{L}$ has constant coefficients (since the basis functions are eigenfunctions of constant-coefficient operators). For variable-coefficient operators, $B$ is typically dense. This density is the price paid for exponential accuracy: each spectral mode couples to every other, so the per-step cost is $O(N^2)$ rather than $O(N)$. The following remark compares this trade-off with the finite-difference approach of Chapter 2.
[remark: Comparison with Finite Differences]
In the finite-difference approach of Chapters 1 and 2, the spatial variable is discretized on a uniform grid of $M$ points, and $\mathcal{L}$ is replaced by a sparse $M \times M$ matrix. The resulting ODE system has the same structure $U'(t) = A\,U(t)$, but the matrix $A$ is banded, so each grid point couples only to its immediate neighbours. In the spectral method, $B$ is generally dense, coupling all $N$ modes, but $N$ can be much smaller than $M$ for the same accuracy because spectral methods converge exponentially for smooth data.
[/remark]
### Application: Diffusion Equation with Variable Coefficient
We illustrate the method by applying it to the diffusion equation with a non-constant positive coefficient $a = a(x) > 0$:
\begin{align*}
u_t &= (a(x)\,u_x)_x, && (x,t) \in [-1,1] \times [0,\infty), \\
u(x,0) &= g(x), && x \in [-1,1].
\end{align*}
When $a \equiv 1$ this reduces to the standard heat equation, and the Fourier basis diagonalizes the operator. For general $a(x)$, we use the Fourier expansion and let the off-diagonal coupling encode the variable coefficient.
[example: Spectral Discretization of Diffusion with Variable Coefficient]
Write $u(x,t) = \sum_{n} \widehat{u}_n(t)\,e^{i\pi nx}$ and $a(x) = \sum_{n} \widehat{a}_n\,e^{i\pi nx}$. Substituting into $u_t = (a\,u_x)_x$ and projecting onto $e^{i\pi mx}$ yields the ODE
\begin{align*}
\widehat{u}_n'(t) = -\pi^2 \sum_{m=-N/2+1}^{N/2} mn\,\widehat{a}_{n-m}\,\widehat{u}_m(t), \qquad n = -N/2+1, \dots, N/2.
\end{align*}
This can be written in vector form as $\widehat{u}'(t) = B\,\widehat{u}(t)$ where the matrix $B$ has entries $B_{n,m} = -\pi^2 mn\,\widehat{a}_{n-m}$.
Applying forward Euler with time step $\Delta t = k$ gives the fully discrete scheme
\begin{align*}
\widehat{u}_n^{\ell+1} = \widehat{u}_n^\ell - k\pi^2 \sum_{m=-N/2+1}^{N/2} mn\,\widehat{a}_{n-m}\,\widehat{u}_m^\ell,
\end{align*}
or in vector form $\widehat{u}^{\ell+1} = (I + kB)\,\widehat{u}^\ell$. As in the finite-difference setting, stability of the Euler step requires $\|I + kB\| \leq 1$.
[/example]
When $a \equiv 1$, the matrix $B$ is diagonal with entries $B_{n,n} = -\pi^2 n^2$, and the Euler stability condition becomes $|1 - k\pi^2 n^2| \leq 1$ for all modes $n$. The most restrictive constraint comes from the highest mode $|n| = N/2$:
\begin{align*}
k \leq \frac{2}{\pi^2 (N/2)^2} = \frac{8}{\pi^2 N^2}.
\end{align*}
This is a severe restriction: as $N$ increases, the maximum stable time step decreases as $N^{-2}$. The same CFL-type constraint arises in finite differences (where it is $k \lesssim h^2$), but spectral methods typically reach the required accuracy with much smaller $N$, making the constraint more manageable in practice. An implicit scheme such as Crank–Nicolson removes the stability restriction entirely (since $|1 + z/2|/|1 - z/2| = 1$ for purely imaginary $z$) at the cost of solving a linear system at each time step.
### Spectral vs Finite-Difference Accuracy
The central advantage of spectral methods is their accuracy for smooth solutions. For a finite-difference scheme of order $p$, the global error is $O(h^p)$ where $h = \Delta x$ is the grid spacing. For a spectral method applied to a function analytic in a neighbourhood of $[-1,1]$, the error in the $N$-mode approximation decays exponentially in $N$. This is the content of the exponential convergence theorem established earlier: the tail of the Fourier or Chebyshev series is bounded by $C e^{-\alpha N}$ for some $\alpha > 0$ depending on the width of the analyticity strip.
To make this concrete, consider approximating $f(x) = \cos(8\pi x)$ on $[-1,1]$ to a maximum pointwise error of $10^{-6}$. A second-order finite-difference scheme achieves error $O(h^2) = O(N^{-2})$, so one needs roughly $N_{\mathrm{FD}} \sim (10^{-6})^{-1/2} = 10^3$ grid points. The Fourier method, by contrast, represents $f$ exactly with just $N = 16$ Fourier modes (since $f$ is itself a finite Fourier series). For a general analytic function with analyticity strip of width $a = 0.3$, the exponential convergence factor is $c = e^{-0.3\pi} \approx 0.39$, and the bound $C c^{N/2} \leq 10^{-6}$ is satisfied with $N \approx 16$ — while the second-order finite-difference scheme would require $N \approx 1000$ points. This gap widens further as the target accuracy increases.
[explanation: Spectral Accuracy in Practice]
The practical consequence is that spectral methods reach machine precision with $N$ of order 10–100 for smooth problems where a finite-difference scheme might need $M = 10^3$–$10^5$ grid points. The trade-off is that the method matrix $B$ is dense (cost per time step $O(N^2)$ rather than $O(N)$ for a banded matrix), and the exponential convergence only holds for smooth data — if $u$ or $a$ has a discontinuity, the Gibbs phenomenon causes the convergence rate to degrade to algebraic.
In the context of the diffusion equation with smooth $a$ and smooth initial data $g$, the spectral method with Fourier or Chebyshev basis is the natural choice. When boundary layers or steep gradients are present, adaptive or hybrid methods may be preferable.
[/explanation]
This completes Chapter 3. The next chapter turns to iterative methods for the large linear systems that arise from spatial discretization, where the structure of the matrix (sparse in finite differences, dense in spectral methods) determines the choice of solver.
Spectral discretizations produce small but dense linear systems with remarkable conditioning, while finite difference methods yield large sparse systems requiring iterative solution. This chapter shifts focus: regardless of how the PDE was discretized, we now develop the solver algorithms that make all large systems tractable.
# 4. Iterative Methods for Linear Systems
## Why Iterative Methods?
The earlier chapters of this course have produced large sparse linear systems: the five-point finite difference scheme for the Poisson equation (Chapter 1) generates an $N \times N$ matrix where $N$ can be on the order of millions, yet each row contains at most five nonzero entries, and the implicit schemes for the heat equation (Chapter 2) produce a tridiagonal system at every time step. Direct methods such as Gaussian elimination cost $O(N^3)$ operations and $O(N^2)$ storage, which is completely impractical at that scale. Iterative methods exploit sparsity directly: each iteration requires only a matrix-vector product (costing $O(\text{nnz}(A))$, the number of nonzero entries), and one hopes that only a modest number of iterations are needed to reach the desired accuracy. This chapter develops the theory of such methods, starting with the simplest stationary iterations and building up to the convergence criteria that govern them.
## Stationary Linear Iterations
We need a systematic way to produce a sequence of approximations $x^{(0)}, x^{(1)}, x^{(2)}, \ldots$ that converges to the exact solution $A^{-1}b$ without ever forming $A^{-1}$ explicitly. The challenge is that each update must be both cheap to compute and guaranteed to reduce the error — two requirements that are in natural tension. This section develops the framework of stationary linear iterations, the simplest class of methods that resolve this tension through a fixed algebraic update rule.
[definition: General Iterative Method]
A **general iterative method** for solving $Ax = b$ is a rule
\begin{align*}
x^{(k+1)} = f_k\!\left(x^{(0)}, x^{(1)}, \ldots, x^{(k)}\right).
\end{align*}
[/definition]
This section focuses on the most structured subfamily: **stationary linear iterations**, which have the form
\begin{align*}
x^{(k+1)} = Hx^{(k)} + v, \qquad x^{(0)}, v \in \mathbb{R}^n,
\end{align*}
where the matrix $H$ and vector $v$ do not depend on the iteration index $k$. The pair $(H, v)$ is chosen so that the exact solution is a fixed point: $x^* = Hx^* + v$. A method that uses the entire history $(x^{(0)}, \ldots, x^{(k)})$ rather than just the last iterate is called **non-stationary**; Krylov subspace methods (covered in the next section) are the leading examples.
To measure progress, we track two quantities.
[definition: Error and Residual]
Given the exact solution $x^*$ and the $k$-th iterate $x^{(k)}$, the **error** is
\begin{align*}
e^{(k)} := x^* - x^{(k)},
\end{align*}
and the **residual** is
\begin{align*}
r^{(k)} := b - Ax^{(k)} = Ae^{(k)}.
\end{align*}
The matrix $H$ is called the **iteration matrix** of the scheme.
[/definition]
The residual is computable (we know $b$ and $A$), whereas the error requires knowledge of $x^*$. Since $A$ is nonsingular, the residual is small if and only if the error is small, but the proportionality constant is the condition number $\kappa(A)$.
[definition: Convergent Iterative Method]
An iterative method $x^{(k+1)} = Hx^{(k)} + v$ is **convergent** if $x^{(k)} \to A^{-1}b$ for every starting value $x^{(0)} \in \mathbb{R}^n$.
[/definition]
The key observation is that subtracting the fixed-point equation $x^* = Hx^* + v$ from the iteration gives
\begin{align*}
e^{(k+1)} = He^{(k)},
\end{align*}
so by induction $e^{(k)} = H^k e^{(0)}$. Convergence for every starting point is therefore exactly the condition that $H^k \to 0$ as $k \to \infty$.
[quotetheorem:1386]
[citeproof:1386]
This theorem says that the question of convergence is purely a question about the iteration matrix $H$: the right-hand side $b$ and the splitting vector $v$ are irrelevant. The initial error $e^{(0)}$ is amplified or damped by repeated multiplication by $H$, so the long-run behavior of the method depends entirely on whether this repeated multiplication contracts all vectors to zero. The next theorem identifies exactly when this happens in terms of the spectrum of $H$, giving us a computable criterion. The decisive quantity is the spectral radius.
[quotetheorem:1387]
[citeproof:1387]
Combining the two results: a stationary linear iteration converges if and only if $\rho(H) < 1$. This is the fundamental criterion around which the rest of the chapter is organised.
## Splitting Methods
A principled way to construct iteration matrices is through **matrix splittings**. The idea is to decompose $A$ as a sum of two matrices, one of which is easy to invert.
[definition: Splitting Scheme]
Let $A = B + C$ where $B$ is nonsingular and systems of the form $By = c$ are inexpensive to solve. The **splitting iteration** associated with this decomposition is
\begin{align*}
Bx^{(k+1)} = -Cx^{(k)} + b,
\end{align*}
equivalently $x^{(k+1)} = Hx^{(k)} + v$ where
\begin{align*}
H = -B^{-1}C, \qquad v = B^{-1}b.
\end{align*}
[/definition]
The iteration matrix $H = -B^{-1}C = -B^{-1}(A - B) = I - B^{-1}A$ encodes the entire convergence behaviour. Different choices of $B$ give the classical methods.
### The Jacobi and Gauss-Seidel Methods
Write $A = L_0 + D + U_0$, where $D$ is the diagonal part, $L_0$ is the strictly lower-triangular part, and $U_0$ is the strictly upper-triangular part.
[definition: Jacobi Method]
The **Jacobi method** takes $B = D$, so that $C = L_0 + U_0 = A - D$. The iteration is
\begin{align*}
Dx^{(k+1)} = -(L_0 + U_0)x^{(k)} + b,
\end{align*}
with iteration matrix $H_J = -D^{-1}(L_0 + U_0)$. Because $D$ is diagonal, the update decouples into scalar equations:
\begin{align*}
x_i^{(k+1)} = \frac{1}{a_{ii}}\!\left(b_i - \sum_{j \neq i} a_{ij} x_j^{(k)}\right).
\end{align*}
Each iteration costs $O(\text{nnz}(A - D))$ operations (dominated by the off-diagonal matrix-vector product).
[/definition]
The Jacobi update uses all old values $x^{(k)}$ to compute $x^{(k+1)}$. Once a new component $x_i^{(k+1)}$ is available, Gauss-Seidel uses it immediately.
[definition: Gauss-Seidel Method]
The **Gauss-Seidel method** takes $B = L_0 + D$ (the lower-triangular part of $A$), so $C = U_0$. The iteration solves the triangular system
\begin{align*}
(L_0 + D)x^{(k+1)} = -U_0x^{(k)} + b,
\end{align*}
with iteration matrix $H_{GS} = -(L_0 + D)^{-1}U_0$. The component-wise form is
\begin{align*}
x_i^{(k+1)} = \frac{1}{a_{ii}}\!\left(b_i - \sum_{j < i} a_{ij} x_j^{(k+1)} - \sum_{j > i} a_{ij} x_j^{(k)}\right),
\end{align*}
using freshly updated components for indices $j < i$.
[/definition]
[remark: Gauss-Seidel as sequential Jacobi]
Gauss-Seidel can be thought of as Jacobi with an in-place update: instead of waiting until the end of each sweep to replace $x^{(k)}$ with $x^{(k+1)}$, each component is updated as soon as it is computed. This often (though not always) accelerates convergence, as the next convergence theorem will show.
[/remark]
The ordering of components matters for Gauss-Seidel but not for Jacobi. The following example illustrates both methods on a concrete small system, and shows the faster convergence of Gauss-Seidel.
[example: Jacobi and Gauss-Seidel on a 3x3 system]
Consider the system $Ax = b$ where
\begin{align*}
A = \begin{pmatrix} 4 & -1 & 0 \\ -1 & 4 & -1 \\ 0 & -1 & 4 \end{pmatrix}, \qquad b = \begin{pmatrix} 3 \\ 2 \\ 3 \end{pmatrix}.
\end{align*}
The exact solution is $x^* = (1, 1, 1)^\top$. With $D = 4I$, $L_0 + U_0 = A - 4I$, the Jacobi iteration matrix is
\begin{align*}
H_J = -D^{-1}(L_0 + U_0) = \begin{pmatrix} 0 & 1/4 & 0 \\ 1/4 & 0 & 1/4 \\ 0 & 1/4 & 0 \end{pmatrix}.
\end{align*}
The eigenvalues of $H_J$ are $0$ and $\pm 1/\sqrt{2}$, so $\rho(H_J) = 1/\sqrt{2} \approx 0.707 < 1$ and Jacobi converges. For Gauss-Seidel one can verify that $\rho(H_{GS}) = 1/2 < 1$, and since $1/2 < 1/\sqrt{2}$, Gauss-Seidel converges faster on this system — a pattern that holds more generally for symmetric positive definite matrices.
[/example]
## Convergence Theory for Splitting Methods
When does a splitting method converge? The spectral radius criterion answers this in principle — $\rho(H) < 1$ is necessary and sufficient — but the spectral radius is hard to compute directly for large sparse matrices. We need structural conditions on $A$ itself that guarantee $\rho(H) < 1$ without requiring us to compute eigenvalues. Two natural classes of matrices admit such conditions: strictly diagonally dominant matrices, where convergence follows from Gershgorin's theorem, and symmetric positive definite matrices, where convergence follows from a deeper algebraic argument.
### Diagonal Dominance
[definition: Strictly Diagonally Dominant Matrix]
A matrix $A \in \mathbb{R}^{n \times n}$ is **strictly diagonally dominant by rows** if
\begin{align*}
|a_{ii}| > \sum_{j \neq i} |a_{ij}| \qquad \text{for all } i = 1, \ldots, n.
\end{align*}
It is **strictly diagonally dominant by columns** if $A^\top$ is strictly diagonally dominant by rows.
[/definition]
Strict diagonal dominance prevents zero eigenvalues:
[quotetheorem:1388]
[citeproof:1388]
Nonsingularity alone is not sufficient to guarantee convergence of the iterative methods — the diagonal of $A$ must genuinely dominate its off-diagonal entries in a quantitative sense. The next theorem shows that the same Gershgorin argument that keeps $0$ out of the spectrum of $A$ also keeps all eigenvalues of the iteration matrix inside the unit disk.
[quotetheorem:1389]
[citeproof:1389]
The diagonal dominance condition is sufficient but not necessary for convergence. There exist diagonally balanced matrices for which Gauss-Seidel diverges. As an instructive counterexample, consider the non-SPD matrix
\begin{align*}
A = \begin{pmatrix} 1 & 2 \\ -1 & 1 \end{pmatrix}, \qquad b = \begin{pmatrix} 1 \\ 0 \end{pmatrix}.
\end{align*}
Here $|a_{11}| = 1 < |a_{12}| = 2$, so $A$ is not strictly diagonally dominant. The Gauss-Seidel iteration matrix is $H_{GS} = -(L_0 + D)^{-1}U_0$; a direct computation gives $H_{GS} = \begin{pmatrix} 0 & -2 \\ 0 & 2 \end{pmatrix}$, whose eigenvalues are $0$ and $2$. Since $\rho(H_{GS}) = 2 > 1$, Gauss-Seidel diverges. The failure here is not symmetry but the lack of any dominance structure.
### Symmetric Positive Definite Matrices
A sharper result is available for symmetric positive definite (SPD) systems. It relies on a general algebraic criterion.
[quotetheorem:1390]
[citeproof:1390]
The Householder-John theorem is an algebraic master key: by choosing $B$ appropriately, it yields convergence criteria for both Gauss-Seidel and Jacobi as immediate corollaries. The condition $A - B - B^\top \succ 0$ has a clean interpretation: it means the symmetric part of $A - B$ is positive definite, which ensures that the splitting does not over-correct the solution. When $B$ is the lower-triangular factor, this condition reduces to positivity of the diagonal — a consequence of positive definiteness of $A$.
[quotetheorem:1391]
This is the main convergence result for Gauss-Seidel: positive definiteness alone suffices, without any diagonal dominance requirement. The theorem explains why Gauss-Seidel is the smoother of choice in multigrid methods applied to elliptic PDE systems — those systems always produce SPD matrices after appropriate sign adjustment.
[proof]
Apply the Householder-John theorem with $B = U_0$ (the strictly upper-triangular part of $A$), so that $A - B = L_0 + D$, which is the Gauss-Seidel splitting matrix. Then
\begin{align*}
A - B - B^\top = A - U_0 - U_0^\top = A - U_0 - L_0 = D,
\end{align*}
where the last step uses $A = L_0 + D + U_0$ and $U_0^\top = L_0$ (symmetry of $A$). Since $A$ is symmetric positive definite, all its diagonal entries are positive, so $D$ is positive definite. The Householder-John theorem now applies with both $A$ and $A - B - B^\top = D$ positive definite, giving $\rho(H_{GS}) < 1$.
[/proof]
The Jacobi method does not automatically converge for SPD matrices; positive definiteness of the diagonal is necessary but not sufficient, since the Jacobi iteration matrix also depends on the off-diagonal structure.
[quotetheorem:1392]
[citeproof:1392]
The condition $2D - A \succ 0$ means each diagonal entry satisfies $2a_{ii} > \sum_j a_{ij}$, which is equivalent to saying the diagonal entries are large relative to the row sums of $A$. For the five-point Poisson matrix, the diagonal is $4$ and each row sum is at most $0$ (the off-diagonals are $-1$), so $2D - A$ has all diagonal entries equal to $8$ and the same off-diagonals $1$ — it is diagonally dominant, hence positive definite. In contrast, for matrices where the diagonal is close to the row sum (as in certain reaction–diffusion problems with small diagonal), the condition $2D - A \succ 0$ may fail and Jacobi diverges even though Gauss-Seidel converges.
These results apply directly to the linear systems arising from finite difference discretisations of elliptic PDEs, such as the five-point Poisson system analysed in Chapter 1.
[quotetheorem:1393]
[citeproof:1393]
## Stationary vs Non-Stationary Iterations
The splitting methods studied so far are **stationary**: the same matrix $H$ and vector $v$ are applied at every step, regardless of the current iterate. Their simplicity is both a strength and a limitation. The convergence rate is determined by $\rho(H)$, which is fixed throughout the iteration, and the asymptotic error reduction per step is at best $\rho(H)$. For many problems arising from discretised PDEs, $\rho(H)$ is close to $1$ (it behaves like $1 - O(h^2)$ for mesh spacing $h$), making stationary methods impractically slow.
**Non-stationary** methods choose a different update rule at each step, typically by computing the new iterate as an element of a growing Krylov subspace
\begin{align*}
\mathcal{K}_k(A, r^{(0)}) = \operatorname{span}\!\left\{r^{(0)}, Ar^{(0)}, A^2r^{(0)}, \ldots, A^{k-1}r^{(0)}\right\},
\end{align*}
where $r^{(0)} = b - Ax^{(0)}$ is the initial residual. The conjugate gradient method (for SPD systems) and GMRES (for general systems) are the canonical examples. These methods can achieve convergence in far fewer iterations than stationary methods, particularly when combined with preconditioning, but their analysis requires tools from polynomial approximation theory and is deferred to the next section.
The SOR (Successive Over-Relaxation) method occupies an intermediate position: it is stationary but introduces a relaxation parameter $\omega$ that can be tuned to accelerate convergence beyond what Gauss-Seidel achieves, at the cost of a more delicate analysis. It is the subject of the next section.
[remark: Practical Stopping Criteria]
In practice, one never iterates until exact convergence. Instead, the iteration is stopped when the relative residual $\|r^{(k)}\| / \|r^{(0)}\|$ falls below a prescribed tolerance $\varepsilon$ (typically $10^{-6}$ to $10^{-12}$). The number of iterations required to achieve this is approximately $\log(1/\varepsilon) / \log(1/\rho(H))$, so a spectral radius of $0.99$ requires roughly $10 \log(1/\varepsilon)$ times as many iterations as a spectral radius of $0.9$. This makes the choice of splitting, or the use of preconditioning, genuinely critical.
[/remark]
## Relaxation
The splitting methods covered in the previous section — Jacobi, Gauss-Seidel, SOR — all produce an iteration matrix $H$ whose spectral radius $\rho(H)$ governs the rate of convergence. A natural question is: given a convergent method with iteration matrix $H$, can we systematically modify the scheme to reduce $\rho(H)$ further? The answer is yes, through a technique called **relaxation**.
The idea is to blend the new iterate $\widehat{x}^{(k+1)} = Hx^{(k)} + v$ with the current iterate $x^{(k)}$, using a scalar parameter $\omega$ to control the blend. If $\omega > 1$ we extrapolate past the new iterate (over-relaxation); if $\omega < 1$ we take a more cautious step toward it (under-relaxation).
[definition: Relaxation Method]
Let $x^{(k+1)} = Hx^{(k)} + v$ be a convergent iterative method. The **relaxation method** with parameter $\omega \in \mathbb{R}$ proceeds in two steps. First, compute the unrelaxed update:
\begin{align*}
\widehat{x}^{(k+1)} := Hx^{(k)} + v.
\end{align*}
Then set:
\begin{align*}
x^{(k+1)} &= \omega\,\widehat{x}^{(k+1)} + (1-\omega)x^{(k)} \\
&= H_\omega\,x^{(k)} + \omegav,
\end{align*}
where the **relaxed iteration matrix** is
\begin{align*}
H_\omega = \omega H + (1-\omega)I,
\end{align*}
and $\omega$ is called the **relaxation parameter**. The eigenvalues of $H_\omega$ are related to those of $H$ by $\lambda_\omega = \omega\lambda + (1-\omega)$.
[/definition]
The choice $\omega = 1$ recovers the original method exactly. For $\omega > 1$ we have over-relaxation, and for $\omega < 1$ we have under-relaxation. The method converges when $\rho(H_\omega) < 1$, and our goal is to choose $\omega$ to minimise
\begin{align*}
\rho(H_\omega) = \max\{|\omega\lambda + (1-\omega)| : \lambda \in \sigma(H)\}.
\end{align*}
### The Optimal Relaxation Parameter
In general $\sigma(H)$ is not known explicitly, but when some information about the spectrum is available it can be exploited to find a good value of $\omega$.
[example: Optimal Relaxation for Real Spectrum]
Suppose $\sigma(H) \subset [\alpha, \beta]$ where $-1 < \alpha < \beta < 1$, so all eigenvalues are real. We seek to minimise
\begin{align*}
\max\{|\omega\lambda + (1-\omega)| : \lambda \in [\alpha, \beta]\}.
\end{align*}
For fixed $\lambda < 1$, the function $f(\omega) = \omega\lambda + (1-\omega) = \omega(\lambda - 1) + 1$ is strictly decreasing in $\omega$ (since $\lambda - 1 < 0$). As $\omega$ increases from $1$, the whole spectrum of $H_\omega$ shifts to the left; as $\omega$ decreases below $1$, it shifts to the right. Crucially, relaxation only translates the spectrum — it does not change its width. Therefore the spectral radius $\rho(H_\omega)$ is minimised when the image of $[\alpha, \beta]$ under $\lambda \mapsto \omega\lambda + (1-\omega)$ is **centred at the origin**.
Setting the two endpoints to be negatives of each other:
\begin{align*}
-[\omega\alpha + (1-\omega)] = \omega\beta + (1-\omega),
\end{align*}
we solve to obtain the **optimal relaxation parameter**
\begin{align*}
\omega_{\mathrm{opt}} = \frac{2}{2 - (\alpha + \beta)}.
\end{align*}
With this choice, the spectral radius of $H_{\omega_{\mathrm{opt}}}$ equals $\frac{\beta - \alpha}{2 - \alpha - \beta}$, which can be substantially smaller than $\rho(H) = \max(|\alpha|, |\beta|)$ when $\alpha + \beta > 0$.
[/example]
[remark: Relaxation Generalises SOR]
The Successive Over-Relaxation (SOR) method introduced in Section A is precisely the relaxation technique applied to the Gauss-Seidel iteration matrix $H_{GS}$. The parameter $\omega$ in SOR is this relaxation parameter, which explains why SOR with the optimal $\omega$ can converge much faster than plain Gauss-Seidel.
[/remark]
## Multigrid Methods
Relaxation can improve the constant in front of the convergence rate, but for large systems arising from PDE discretisations it does not address a more fundamental bottleneck. To understand this bottleneck — and the idea that resolves it — we need to look at how the error behaves in frequency space.
### Frequency Analysis of Jacobi Iteration
[example: Jacobi on the 1D Poisson Equation]
Consider the boundary value problem
\begin{align*}
\begin{cases}
u'' = f, \\
u(0) = u(1) = 0.
\end{cases}
\end{align*}
Discretising on a uniform grid $\Omega_h = \{x_n = nh : n = 1, \ldots, m\}$ with $h = 1/(m+1)$ via the three-point formula gives $Au = b$ with
\begin{align*}
A = \begin{bmatrix}
2 & -1 & & \\
-1 & 2 & \ddots & \\
& \ddots & \ddots & -1 \\
& & -1 & 2
\end{bmatrix} \in \mathbb{R}^{m \times m}.
\end{align*}
The Jacobi iteration matrix for this system is $H = I - \frac{1}{2}A$. Applying relaxation with parameter $\omega$ gives the relaxed Jacobi iteration
\begin{align*}
x^{(n+1)} = \underbrace{\left(I - \frac{\omega}{2}A\right)}_{H_\omega} x^{(n)} + \omega D^{-1}b.
\end{align*}
Since $H_\omega$ is a TST (tridiagonal symmetric Toeplitz) matrix, its eigenvectors and eigenvalues are explicitly known:
\begin{align*}
w^k = \left[\sin\!\left(\frac{ik\pi}{m+1}\right)\right]_{i=1}^{m}, \qquad \lambda_k(\omega) = 1 - 2\omega\sin^2\!\frac{k\pi}{2(m+1)}, \quad k = 1, \ldots, m.
\end{align*}
Taking $\omega = 1/2$, the eigenvalues simplify to $\lambda_k = \cos^2\!\frac{k\pi}{2(m+1)}$, which are positive and **decreasing** in $k$. The spectral radius is $\rho(H_\omega) = \lambda_1 = \cos^2\!\frac{\pi}{2(m+1)} \approx 1 - \frac{\pi^2}{4m^2}$, which is very close to $1$ for large $m$ — so overall convergence is slow.
However, the error decomposes along the eigenvector basis as
\begin{align*}
e^{(\nu)} = \sum_{k=1}^m a_k^{(\nu)} w^k, \qquad |a_k^{(\nu)}| = |\lambda_k(\omega)|^\nu |a_k^{(0)}|.
\end{align*}
Since $\lambda_k$ decreases with $k$, **high-frequency error components** (those with large $k$) are damped much faster than low-frequency ones. For $k$ close to $m$, the eigenvalue $\lambda_k$ is close to $0$, and those components decay rapidly. For $k$ close to $1$, $\lambda_k \approx 1$, and those components decay almost not at all.
[/example]
This frequency-dependent behaviour motivates a precise definition of what counts as a high-frequency component relative to a given grid.
[definition: High Frequency Components]
Let $\Omega_h$ be a uniform grid on $[0,1]$ with step $h = 1/(m+1)$, so grid indices run from $1$ to $m$. We say that index $k \in \{1, \ldots, m\}$ is **high frequency** with respect to $\Omega_h$ if
\begin{align*}
kh \geq \frac{1}{2},
\end{align*}
i.e., $k \geq m/2$. The remaining indices, with $kh < 1/2$, are called **low frequency**.
[/definition]
[remark: Eigenvalue Structure for TST Systems]
For systems arising from standard PDE discretisations, the iteration matrix is typically TST and the eigenvalues take the form $\lambda_k = \gamma - \delta\sin^2\!\frac{k\pi}{2(m+1)}$ with $\gamma, \delta > 0$. In this case the high-frequency components (large $k$) always correspond to the smallest eigenvalues, hence the fastest-converging modes. This makes the above definition robust across a broad class of problems.
[/remark]
### The Multigrid Observation
Returning to the Jacobi iteration with $\omega = 1/2$, the decay rate for the high-frequency components is at least
\begin{align*}
\mu_* = |\lambda_{(m+1)/2}| = 1 - \sin^2(\pi/4) = \frac{1}{2}.
\end{align*}
After $\nu$ iterations, the high-frequency coefficients satisfy
\begin{align*}
|a_k^{(\nu)}| \leq \left(\frac{1}{2}\right)^\nu |a_k^{(0)}|,
\end{align*}
so they are reduced by half with each iteration. This is called the **smoothing property** of Jacobi: after a few iterations, the error is dominated entirely by smooth (low-frequency) components.
The key observation driving the multigrid idea is the following: **the low-frequency components on the fine grid $\Omega_h$ become high-frequency components on the coarser grid $\Omega_{2h}$**. Concretely, if $k \in \left(\frac{1}{4h}, \frac{1}{2h}\right)$ is a low-frequency index on $\Omega_h$, then $k(2h) \geq 1/2$, so this same $k$ is high frequency on $\Omega_{2h}$.
This suggests the following strategy: apply a few smoothing iterations on $\Omega_h$ to eliminate the high-frequency error, then transfer the residual to the coarser grid $\Omega_{2h}$ where the remaining smooth error becomes rough, apply smoothing there, and continue down a hierarchy of grids.
### Grid Transfer Operators
To move between grids, we need two operators.
[definition: Restriction Operator]
The **restriction operator** $I_h^{2h} : \mathbb{R}^m \to \mathbb{R}^{m'}$ transfers a residual from the fine grid $\Omega_h$ to the coarse grid $\Omega_{2h}$, where $m' \approx m/2$. A standard choice is **full weighting**:
\begin{align*}
(I_h^{2h}\, r)_j = \frac{1}{4}\left(r_{2j-1} + 2r_{2j} + r_{2j+1}\right), \qquad j = 1, \ldots, m'.
\end{align*}
A simpler alternative is **injection**, which simply takes every other fine-grid value: $(I_h^{2h}\,r)_j = r_{2j}$.
[/definition]
[definition: Prolongation Operator]
The **prolongation operator** (or **interpolation operator**) $I_{2h}^h : \mathbb{R}^{m'} \to \mathbb{R}^m$ transfers a correction from the coarse grid $\Omega_{2h}$ back to the fine grid $\Omega_h$. Standard linear interpolation gives:
\begin{align*}
(I_{2h}^h\,e)_{2j} &= e_j, \\
(I_{2h}^h\,e)_{2j-1} &= \frac{1}{2}(e_{j-1} + e_j),
\end{align*}
for $j = 1, \ldots, m'$. The prolongation is the transpose of full-weighting restriction up to scaling.
[/definition]
### Coarse-Grid Correction
With these operators in hand, we can describe the fundamental two-grid correction step. Suppose we have a current approximation $x^{(k)}$ on the fine grid. After applying $\nu_1$ smoothing iterations, the residual $r = b - Ax$ is smooth. We transfer it to the coarse grid, solve the **coarse-grid residual equation**, and use the coarse-grid correction to update the fine-grid solution.
[definition: Coarse-Grid Correction]
Given a fine-grid approximation $x$ with residual $r = b - A_hx$, the **coarse-grid correction** proceeds as follows:
1. **Restrict**: compute the coarse-grid right-hand side $r_{2h} = I_h^{2h}\,r$.
2. **Solve**: solve the coarse-grid equation $A_{2h}\,e_{2h} = r_{2h}$ for the error $e_{2h}$.
3. **Prolongate**: compute the fine-grid correction $e_h = I_{2h}^h\,e_{2h}$.
4. **Update**: set $x \leftarrow x + e_h$.
Here $A_{2h}$ is the coarse-grid operator, which can be either the discretisation of the PDE on $\Omega_{2h}$, or formed by the **Galerkin condition** $A_{2h} = I_h^{2h}\,A_h\,I_{2h}^h$.
[/definition]
The coarse-grid solve in step 2 can itself be solved approximately — by recursion, applying the same two-grid strategy on $\Omega_{2h}$ with an even coarser grid $\Omega_{4h}$.
### The V-Cycle Algorithm
A multigrid V-cycle applies this idea recursively across a hierarchy of grids
\begin{align*}
\Omega_h \subset \Omega_{2h} \subset \Omega_{4h} \subset \cdots \subset \Omega_{2^j h},
\end{align*}
going down through increasingly coarse grids (the "down-stroke" of the V) and then back up (the "up-stroke").
[definition: Multigrid V-Cycle]
Let $J$ be the number of grid levels. A single **V-cycle** $\mathrm{V}(x, b, h)$ is defined recursively:
**Base case** (coarsest grid): solve $A_{2^J h}\,x = b$ directly (e.g. by Cholesky).
**Recursive step** on grid $\Omega_{h'}$:
1. **Pre-smoothing**: apply $\nu_1$ iterations of the smoother (e.g. Jacobi or Gauss-Seidel) to $A_{h'}x = b$.
2. **Restrict**: compute $r_{2h'} = I_{h'}^{2h'}(b - A_{h'}x)$.
3. **Coarse-grid correction**: set $e_{2h'} = 0$, then call $\mathrm{V}(e_{2h'}, r_{2h'}, 2h')$ recursively.
4. **Prolongate and correct**: update $x \leftarrow x + I_{2h'}^{h'}\,e_{2h'}$.
5. **Post-smoothing**: apply $\nu_2$ further iterations of the smoother.
[/definition]
[explanation: Why the V-Cycle Works]
The V-cycle exploits the complementary strengths of two components. The smoother (Jacobi or Gauss-Seidel) is highly effective at eliminating high-frequency error but nearly useless against low-frequency error — its convergence factor for smooth modes is very close to $1$. The coarse-grid correction is highly effective at capturing smooth error because those modes, when restricted to $\Omega_{2h}$, are representable on the coarser grid and are high-frequency relative to it. Together, these two operations attack the full error spectrum.
On the coarsest grid, where there are only $O(1)$ unknowns, a direct solve costs $O(1)$ operations. At each finer level, the number of unknowns roughly doubles, but the recursive solve requires only one V-cycle on the next coarser grid. This telescoping structure makes the total work per V-cycle proportional to the number of fine-grid unknowns, giving $O(m)$ complexity. For comparison, classical iterative methods like Gauss-Seidel require $O(m^2)$ iterations to reduce the error by a fixed factor on an $m$-point grid.
[/explanation]
[remark: Multigrid Convergence Rates]
Under appropriate conditions on the smoother and the grid transfer operators, the convergence factor of a V-cycle is bounded by a constant $q < 1$ that is **independent of the grid spacing $h$**. This is the fundamental advantage of multigrid: where Gauss-Seidel has convergence factor $\rho \approx 1 - \pi^2/m^2 \to 1$ as $m \to \infty$, multigrid achieves a fixed reduction per cycle regardless of how fine the grid is. In practice, $q \approx 0.1$ to $0.3$ with standard choices of $\nu_1 = \nu_2 = 1$ or $2$ smoothing steps.
[/remark]
[example: Multigrid on a Two-Level Grid]
Return to the 1D Poisson problem with $m = 7$ fine-grid interior points ($h = 1/8$) and $m' = 3$ coarse-grid interior points ($2h = 1/4$). Starting from an initial guess $x^{(0)} = 0$, the initial error $e^{(0)}$ can be written as a superposition of all seven eigenvectors $w^1, \ldots, w^7$.
After $\nu_1 = 2$ pre-smoothing steps with Jacobi ($\omega = 1/2$), the high-frequency components ($k = 4, 5, 6, 7$) are each reduced by a factor of at most $(1/2)^2 = 1/4$. The remaining error is concentrated in modes $k = 1, 2, 3$.
The residual is then restricted to the coarse grid $\Omega_{2h}$. On this grid, the mode $k = 3$ corresponds to index $k(2h) = 3/4 \geq 1/2$, placing it in the high-frequency regime. The coarse-grid solve exactly captures the error in modes $k = 1, 2, 3$ (up to the accuracy of the restriction and prolongation operators). After prolongation and correction, the low-frequency error on $\Omega_h$ is substantially reduced.
One to two post-smoothing steps then clean up any high-frequency artifacts introduced by the prolongation, leaving an error that is smaller by a mesh-independent factor in all modes.
[/example]
The multigrid idea extends naturally to higher dimensions and to other elliptic PDE operators. In two dimensions, the fine grid $\Omega_h$ has $O(m^2)$ unknowns, and a V-cycle costs $O(m^2)$ operations with an $h$-independent convergence factor — making multigrid one of the most computationally efficient solvers available for elliptic problems.
## Steepest Descent and Conjugate Gradient Methods
The iterative methods studied in Sections A and B — Jacobi, Gauss-Seidel, SOR, and multigrid — are all motivated by splitting or smoothing ideas. We now change perspective entirely. The key observation is that solving $Ax = b$ with $A$ symmetric positive definite is equivalent to minimizing a convex quadratic function. This variational reformulation opens the door to optimization-based iterative methods, culminating in the conjugate gradient method — one of the most important algorithms in scientific computing.
### The Quadratic Minimization Formulation
When $A$ is symmetric positive definite, solving the linear system $Ax = b$ is equivalent to minimizing the quadratic functional
\begin{align*}
F(x) := \frac{1}{2}\langle x, Ax \rangle - \langle b, x \rangle,
\end{align*}
where $\langle u, v \rangle = u^\top v$ is the Euclidean inner product. The following proposition makes this precise.
[quotetheorem:1394]
[citeproof:1394]
A useful rewriting of $F$ relates it directly to the error $e = x^* - x$. Define the $A$-norm of a vector $y \in \mathbb{R}^n$ by
\begin{align*}
\|y\|_A := \langle y, Ay \rangle^{1/2} = \sqrt{y^\top A y}.
\end{align*}
Then one can verify the identity
\begin{align*}
F(x) = \frac{1}{2}\|x^* - x\|_A^2 - \frac{1}{2}b^\top A^{-1} b.
\end{align*}
The second term is a constant, so minimizing $F$ is the same as minimizing the $A$-norm of the error. This observation will guide the convergence analysis throughout this section.
### The Steepest Descent Method
The most direct approach to minimizing $F$ is to move, at each step, in the direction of steepest decrease of $F$ — that is, along the negative gradient.
[definition: Gradient Descent Method]
The **gradient descent method** (or steepest descent method) for minimizing $F(x) = \frac{1}{2}\langle x, Ax \rangle - \langle b, x \rangle$ produces iterates
\begin{align*}
x^{(k+1)} = x^{(k)} - \alpha_k \nabla F(x^{(k)}),
\end{align*}
where $\alpha_k > 0$ is the step size chosen at each iteration.
[/definition]
To apply this concretely, we need the gradient of $F$. A direct calculation shows that the gradient is minus the residual.
[quotetheorem:1395]
[citeproof:1395]
This identification of the gradient with the negative residual is conceptually important: it says that moving in the direction of steepest descent of $F$ is the same as moving in the direction of the residual vector $r^{(k)} = b - Ax^{(k)}$. A large residual means the current iterate is far from satisfying the system, and the gradient descent step moves toward correcting this imbalance. The step size $\alpha_k$ then controls how far to travel in this direction, which the next result optimizes exactly.
The gradient descent iteration thus takes the form $x^{(k+1)} = x^{(k)} + \alpha_k r^{(k)}$: we move in the direction of the residual. The residual is large when the current iterate is far from the solution, so this is geometrically sensible.
There are two natural strategies for choosing $\alpha_k$: a **constant step size** $\alpha_k = \alpha$, or an **exact line search** which chooses the optimal step at each iteration.
[definition: Exact Line Search]
In the exact line search strategy, the step size $\alpha_k$ is chosen to minimize $F$ exactly along the current search direction $d^{(k)}$:
\begin{align*}
\alpha_k = \arg\min_{\alpha > 0} F\left(x^{(k)} + \alpha d^{(k)}\right).
\end{align*}
[/definition]
Because $F$ is quadratic, the function $\alpha \mapsto F(x^{(k)} + \alpha d^{(k)})$ is a one-dimensional quadratic in $\alpha$, and its minimum can be found in closed form.
[quotetheorem:1396]
[citeproof:1396]
Applying this with $d^{(k)} = r^{(k)}$ (the steepest descent direction) gives an explicit update formula.
[quotetheorem:1397]
[citeproof:1397]
Steepest descent converges, but can be slow when the condition number $\kappa(A) = \lambda_{\max}(A)/\lambda_{\min}(A)$ is large. The difficulty is that consecutive search directions tend to be nearly orthogonal, causing the iterates to zig-zag. The conjugate gradient method resolves this by choosing search directions that are mutually $A$-orthogonal.
## Conjugate Directions and the CG Algorithm
The steepest descent method has a fundamental geometric flaw: consecutive search directions are always $A$-orthogonal to each other (one can verify this from the exact line search formula), which forces the iterates to zig-zag rather than advance directly toward the solution. On an elongated ellipsoid — which is what the level sets of $F$ look like when $\kappa(A)$ is large — this zig-zag path can be arbitrarily inefficient. The remedy is to give up the requirement that we always move along the gradient, and instead choose search directions that are $A$-orthogonal to all previous directions. This ensures that each direction is used exactly once and that progress in one direction is never undone by a later step.
[example: Zig-Zag Failure of Steepest Descent]
Consider $A = \begin{pmatrix} 10 & 0 \\ 0 & 1 \end{pmatrix}$ and $b = \mathbf{0}$, so $x^* = \mathbf{0}$. The level sets of $F(x) = \frac{1}{2}(10x_1^2 + x_2^2)$ are ellipses with semi-axes in ratio $1:\sqrt{10}$. Starting from $x^{(0)} = (1, 1)^\top$, the gradient is $\nabla F = (10, 1)^\top$, which points nearly along the $x_1$-axis. After one exact line search step, $x^{(1)}$ lies near the $x_2$-axis. The new gradient is nearly along $x_1$ again, and the method alternates between two nearly orthogonal directions indefinitely, converging at rate $\rho = (\kappa - 1)/(\kappa + 1) = 9/11 \approx 0.82$ per step. In contrast, conjugate gradient solves this $2 \times 2$ system exactly in $2$ steps because the two conjugate directions span all of $\mathbb{R}^2$.
[/example]
### Conjugate Directions
The key idea is to generalize the steepest descent scheme: instead of always moving along the negative gradient, we allow an arbitrary sequence of search directions, but require them to be $A$-orthogonal to each other. This ensures that each direction is "used exactly once" and that no work is wasted.
[definition: Conjugate Directions]
Vectors $u, v \in \mathbb{R}^n$ are **conjugate** with respect to a symmetric positive definite matrix $A$ if they are nonzero and satisfy the $A$-orthogonality condition
\begin{align*}
\langle u, v \rangle_A := \langle u, Av \rangle = u^\top A v = 0.
\end{align*}
A set $\{d^{(0)}, d^{(1)}, \ldots, d^{(n-1)}\}$ is called a set of pairwise conjugate directions if $\langle d^{(i)}, d^{(j)} \rangle_A = 0$ for all $i \neq j$.
[/definition]
Ordinary orthogonality (with respect to the Euclidean inner product) is the special case $A = I$. For a general SPD matrix, conjugacy is orthogonality in the inner product defined by $A$. Since $A$ is positive definite, any set of pairwise $A$-orthogonal nonzero vectors is linearly independent, so we can have at most $n$ conjugate directions in $\mathbb{R}^n$.
The geometric significance of conjugacy appears in how the error evolves when we take an exact line search step. Writing $e^{(k)} = x^* - x^{(k)}$ for the error vector, and using the exact line search formula:
[quotetheorem:1398]
[citeproof:1398]
Geometrically, $e^{(k+1)}$ is the $A$-orthogonal projection of $e^{(k)}$ onto the hyperplane perpendicular to $d^{(k)}$ in the $A$-inner product. Each step eliminates the component of the error in one direction, and if the directions are pairwise conjugate, the errors eliminated do not overlap.
### Finite Termination
The following theorem is the cornerstone result for conjugate direction methods: with $n$ pairwise conjugate directions in $\mathbb{R}^n$, the method terminates in at most $n$ steps.
[quotetheorem:1399]
[citeproof:1399]
The challenge is how to generate the conjugate directions. The conjugate gradient method provides an elegant answer: generate them from the residuals using a Gram–Schmidt-like procedure in the $A$-inner product.
### The Conjugate Gradient Algorithm
The conjugate gradient method builds pairwise conjugate directions incrementally by $A$-orthogonalizing each new residual against all previous directions.
[definition: Conjugate Gradient Method (Gram–Schmidt Form)]
The **conjugate gradient method** starting from $x^{(0)}$ proceeds as follows.
Initialize: $d^{(0)} = r^{(0)} = b - Ax^{(0)}$.
At each step $k \geq 0$, as long as $r^{(k)} \neq 0$:
\begin{align*}
d^{(k)} &= r^{(k)} - \sum_{i < k} \frac{\langle r^{(k)}, d^{(i)} \rangle_A}{\langle d^{(i)}, d^{(i)} \rangle_A} d^{(i)}, \\
\alpha_k &= \frac{\langle r^{(k)}, d^{(k)} \rangle}{\langle d^{(k)}, Ad^{(k)} \rangle}, \\
x^{(k+1)} &= x^{(k)} + \alpha_k d^{(k)}, \\
r^{(k+1)} &= b - Ax^{(k+1)}.
\end{align*}
[/definition]
As written, this requires computing $k$ inner products at step $k$ (one for each previous direction), which grows with $k$. A crucial structural theorem shows that many of these inner products are zero, reducing the algorithm to a three-term recurrence.
### Krylov Subspaces and Properties of CG
[definition: Krylov Subspace]
Let $A \in \mathbb{R}^{n \times n}$ and $v \in \mathbb{R}^n$. The $m$-th **Krylov subspace** of $A$ with respect to $v$ is
\begin{align*}
K_m(A, v) = \operatorname{span}\{A^i v\}_{i=0}^{m-1} = \operatorname{span}\{v, Av, A^2v, \ldots, A^{m-1}v\}.
\end{align*}
[/definition]
The Krylov subspace captures the information accessible to a polynomial-in-$A$ method starting from $v$. The following theorem reveals that the CG iterates live in Krylov subspaces of $A$ and $r^{(0)}$, and that the residuals and search directions inherit strong orthogonality properties.
[quotetheorem:1400]
[citeproof:1400]
Property (3) means that, at step $k$, all Gram–Schmidt correction terms with index $i < k-1$ vanish (because $\langle r^{(k)}, d^{(i)} \rangle_A = 0$ for $i < k-1$, by properties (1) and (2)). Only the correction against the immediately preceding direction $d^{(k-1)}$ is needed. This collapses the Gram–Schmidt sum to a single term.
[definition: Conjugate Gradient Method (Standard Form)]
The **conjugate gradient method** in its standard, computationally efficient form starts from $x^{(0)} = 0$, $r^{(0)} = b$, $d^{(0)} = r^{(0)}$, and iterates:
\begin{align*}
\alpha_k &= \frac{\langle r^{(k)}, d^{(k)} \rangle}{\|d^{(k)}\|_A^2}, \\
x^{(k+1)} &= x^{(k)} + \alpha_k d^{(k)}, \\
r^{(k+1)} &= r^{(k)} - \alpha_k Ad^{(k)}, \\
d^{(k+1)} &= r^{(k+1)} - \frac{\langle r^{(k+1)}, d^{(k)} \rangle_A}{\|d^{(k)}\|_A^2} d^{(k)}.
\end{align*}
[/definition]
This is a genuine three-term recurrence: each step requires only one matrix-vector product ($Ad^{(k)}$) and a fixed number of inner products, independent of $k$. The cost per iteration is $O(n^2)$ for a dense matrix (dominated by the matrix-vector product), or $O(\mathrm{nnz})$ for sparse systems. This is the form actually implemented in practice.
## Convergence Analysis of the Conjugate Gradient Method
The finite termination property tells us CG converges in at most $n$ steps, but for large problems we stop long before $n$ iterations — the question is how rapidly the error decreases in the early iterations. The answer depends on the eigenvalue distribution of $A$, and the analysis proceeds in two stages: first, we show that the number of steps required is controlled by the number of distinct eigenvalues; then, we derive a quantitative convergence rate using the theory of Chebyshev polynomials. Together these results explain why CG is dramatically faster than steepest descent for well-conditioned or clustered spectra.
### Finite Termination in Terms of Eigenvalues
The finite termination property says CG converges in at most $n$ steps. The following result sharpens this: it is the number of distinct eigenvalues of $A$ that controls termination.
[quotetheorem:1401]
[citeproof:1401]
[example: Rapid Termination When Eigenvalues Cluster]
Suppose $A \in \mathbb{R}^{100 \times 100}$ has only $3$ distinct eigenvalues. Then the conjugate gradient method solves $Ax = b$ exactly in at most $3$ iterations, regardless of the system size.
This explains why CG is particularly effective when the eigenvalues of $A$ cluster into a small number of distinct groups — as happens, for instance, when $A$ is a discretization of a differential operator that has been well preconditioned.
[/example]
### Error Bounds via Chebyshev Polynomials
While finite termination guarantees exactness in $n$ steps, for large problems we need bounds on the error after $k \ll n$ steps. The polynomial characterization of CG provides the right framework.
[quotetheorem:1402]
[citeproof:1402]
To turn this into a computable bound, we bound $\|P_k(A)e^{(0)}\|_A$ by the maximum of $|P_k(\lambda)|$ over the spectrum of $A$, and then choose $P_k$ to minimize this maximum. This is exactly a Chebyshev approximation problem.
[quotetheorem:1403]
[citeproof:1403]
The bound shows that CG converges geometrically at rate $\rho = (\sqrt{L} - \sqrt{l})/(\sqrt{L} + \sqrt{l})$. Expressing this in terms of the condition number $\kappa(A) = L/l$:
[definition: Condition Number of a Symmetric Positive Definite Matrix]
For a symmetric positive definite matrix $A$, the **condition number** is
\begin{align*}
\kappa(A) = \frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}.
\end{align*}
[/definition]
Substituting $L = \lambda_{\max}$ and $l = \lambda_{\min}$, the convergence rate becomes
\begin{align*}
\rho = \frac{\sqrt{\kappa(A)} - 1}{\sqrt{\kappa(A)} + 1},
\end{align*}
so the error bound takes the form $\|e^{(k)}\|_A \leq 2\rho^k \|e^{(0)}\|_A$.
[example: Condition Number and Convergence Rate]
If $\kappa(A) = 100$, the convergence rate is $\rho = (10-1)/(10+1) = 9/11 \approx 0.818$. To reduce the error by a factor of $10^{-6}$, we need roughly $k \geq 6 \log(10)/\log(11/9) \approx 75$ iterations.
If instead $\kappa(A) = 10^4$, then $\rho \approx 0.98$ and the same reduction requires about $750$ iterations. A well-conditioned system converges dramatically faster, which motivates preconditioning.
[/example]
[remark: Square-Root Improvement over Steepest Descent]
Steepest descent converges at rate $(\kappa(A) - 1)/(\kappa(A) + 1) \approx 1 - 2/\kappa(A)$, while CG converges at rate $(\sqrt{\kappa(A)} - 1)/(\sqrt{\kappa(A)} + 1) \approx 1 - 2/\sqrt{\kappa(A)}$. For large condition numbers, CG requires roughly $\sqrt{\kappa(A)}$ times fewer iterations. This is a substantial gain: a problem with $\kappa(A) = 10^6$ requires $10^6$ steepest descent steps but only $10^3$ CG steps to achieve comparable accuracy.
[/remark]
## Preconditioning
### Motivation
The convergence rate of CG depends critically on the condition number $\kappa(A)$. For systems arising from discretizations of elliptic PDEs, $\kappa(A)$ typically grows as $h^{-2}$ when the mesh spacing is $h$, meaning that finer discretizations produce worse-conditioned systems. Preconditioning addresses this by transforming the system into an equivalent one with a smaller condition number before applying CG.
The ideal preconditioner would make $\kappa(\hat{A}) = 1$, i.e., $\hat{A} = I$. This requires $\hat{A} = PAP^\top = I$, which means $P = A^{-1/2}$. Since computing $A^{-1/2}$ is equivalent to solving the original problem, we cannot achieve $\kappa = 1$ directly. The art of preconditioning is finding $P$ such that $\hat{A} = PAP^\top$ has small condition number while $P$ is cheap to apply.
### The Preconditioned System
[definition: Preconditioned Conjugate Gradient Method]
To solve $Ax = b$ with $A$ symmetric positive definite via **preconditioning**:
(i) Choose a nonsingular matrix $P$ such that $\hat{A} := PAP^\top$ has small condition number.
(ii) Change variables $x = P^\top \hat{x}$ to obtain the equivalent system
\begin{align*}
PAP^\top \hat{x} = Pb \iff \hat{A}\hat{x} = \hat{b},
\end{align*}
where $\hat{b} = Pb$.
(iii) Apply the conjugate gradient method to the preconditioned system $\hat{A}\hat{x} = \hat{b}$.
(iv) Recover the original solution via $x = P^\top \hat{x}$.
The matrix $P$ is called the **preconditioner**.
[/definition]
The preconditioned system inherits the symmetry and positive definiteness of the original.
[quotetheorem:1404]
[citeproof:1404]
[remark: Error Propagation Under Preconditioning]
Suppose the CG iterations on the preconditioned system satisfy $\|\hat{x}^{(k)} - \hat{x}^*\| < \varepsilon$. Recovering the original solution via $x^{(k)} = P^\top \hat{x}^{(k)}$ gives
\begin{align*}
\|x^{(k)} - x^*\| = \|P^\top(\hat{x}^{(k)} - \hat{x}^*)\| \leq \|P\| \cdot \|\hat{x}^{(k)} - \hat{x}^*\| \leq \|P\|\,\varepsilon.
\end{align*}
Small error in the preconditioned system translates directly to small error in the original system, up to the factor $\|P\|$.
[/remark]
### Common Preconditioners
In practice, $P$ is not formed explicitly. Instead, one works with the matrix $M = P^\top P$ (the **preconditioner matrix**), and the algorithm requires only the ability to solve systems of the form $Mz = r$. The convergence rate of the preconditioned CG depends on $\kappa(M^{-1}A)$ rather than $\kappa(A)$.
Several families of preconditioners are commonly used:
**Diagonal (Jacobi) preconditioning.** Take $M = \operatorname{diag}(a_{11}, a_{22}, \ldots, a_{nn})$. This is trivial to apply but only effective when diagonal entries dominate.
**Incomplete LU and incomplete Cholesky.** Compute a sparse approximate factorization $A \approx L L^\top$ (for SPD $A$) by dropping fill-in entries, and set $M = LL^\top$. Applying $M^{-1}$ amounts to two sparse triangular solves.
**SSOR preconditioning.** The Symmetric Successive Over-Relaxation splitting gives $M = (D + \omega L)D^{-1}(D + \omega L)^\top$, which connects naturally to the SOR iteration studied in Section B.
**Multigrid preconditioning.** One or a few multigrid V-cycles (from Section B) can serve as a single preconditioner application. Multigrid-preconditioned CG is among the most powerful algorithms for elliptic PDE systems, achieving convergence independent of the mesh spacing.
The closer $M^{-1}A$ is to the identity matrix, the smaller $\kappa(M^{-1}A)$ and the faster the convergence. The ideal is $M = A$, but applying $M^{-1} = A^{-1}$ exactly would solve the problem directly. Effective preconditioners balance approximation quality against the cost of applying $M^{-1}$.
[example: Preconditioned CG for a Discrete Laplacian]
Consider the linear system $Ax = b$ arising from a five-point finite difference discretization of the Laplacian on an $N \times N$ grid, as studied in Chapter 1. The system has $n = N^2$ unknowns. The matrix $A$ has condition number $\kappa(A) = O(N^2) = O(n)$.
Without preconditioning, CG requires $O(\sqrt{\kappa(A)}) = O(N)$ iterations to reduce the error by a fixed factor, for a total cost of $O(N^3) = O(n^{3/2})$.
With an incomplete Cholesky preconditioner, $\kappa(M^{-1}A) = O(N)$, reducing iteration count to $O(\sqrt{N}) = O(n^{1/4})$ and total cost to $O(n^{5/4})$.
With an optimal multigrid preconditioner, $\kappa(M^{-1}A) = O(1)$, so the iteration count is bounded independently of $N$. Each iteration costs $O(n)$, giving total cost $O(n)$ — optimal among iterative methods.
[/example]
This concludes Chapter 4 on iterative methods for linear systems. We have developed a complete toolkit: classical splitting methods (Jacobi, Gauss-Seidel, SOR), geometric multigrid for elliptic problems, and Krylov subspace methods culminating in preconditioned CG. Each method reflects a different strategy — algebraic decomposition, multilevel smoothing, or polynomial optimization over subspaces — and together they cover the landscape of practical large-scale linear solvers.
Preconditioning reduces the condition number of the system matrix, directly accelerating iterative convergence. Understanding why a preconditioner works requires analyzing the spectrum of the preconditioned operator — the eigenvalues and eigenvectors that determine convergence rates and stability boundaries for the iterations themselves.
# 5. Eigenvalues and Eigenvectors
## The Eigenvalue Problem
Many problems in science and engineering reduce to finding eigenvalues and eigenvectors of a matrix. Structural engineers compute natural frequencies of oscillation, stability analysts ask whether small perturbations grow or decay, and data scientists seek principal components of large datasets — all of these questions have the same algebraic core: given $A \in \mathbb{C}^{n \times n}$, find scalars $\lambda$ and nonzero vectors $x$ satisfying $Ax = \lambda x$. Chapter 4 developed iterative solvers for linear systems; this chapter turns to iterative methods for the eigenvalue problem, beginning with algorithms that converge to a single eigenvalue-eigenvector pair.
Throughout this chapter, $\|\cdot\|$ denotes the Euclidean norm on $\mathbb{C}^n$, defined by
\begin{align*}
\|x\|^2 = x^* x = \sum_{i=1}^n |x_i|^2.
\end{align*}
The adjoint (conjugate transpose) of a vector or matrix is written with a superscript star, so $x^* \in \mathbb{C}^{1 \times n}$ is the row vector with entries $\overline{x_i}$.
A key geometric quantity throughout is the distance from a unit vector to a one-dimensional subspace. For unit vectors $x, w \in \mathbb{C}^n$ (with $\|x\| = \|w\| = 1$), this distance satisfies
\begin{align*}
\operatorname{dist}(x, \operatorname{span}(w))^2 = \min_{\alpha \in \mathbb{C}} \|x - \alpha w\|^2 = 1 - |w^* x|^2,
\end{align*}
attained at $\alpha = w^* x$. This formula converts convergence of inner products to convergence of directions, which is what the power method naturally produces.
## The Power Method
Given a matrix $A$, we need its dominant eigenvalue — the one with largest modulus. The obvious approach is to use the characteristic polynomial $\det(A - \lambda I) = 0$, but for large matrices this polynomial is numerically catastrophic: tiny perturbations in its coefficients can cause the roots to jump wildly, a phenomenon made precise by Wilkinson's polynomial. We need a method that does not touch the characteristic polynomial at all. As an illustration of the difficulty: for the matrix
\begin{align*}
A = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix},
\end{align*}
the two eigenvalues are equal in magnitude, and any scalar multiple of any vector is "dominated" — no single direction is preferred. This is the degenerate case where the method below cannot converge to a unique direction; knowing when such a failure can occur is just as important as knowing when convergence is guaranteed.
The key observation that sidesteps the characteristic polynomial is this: if $A$ has a unique largest eigenvalue $\lambda_1$ in modulus, then $A^k$ amplifies the $\lambda_1$-component of any starting vector exponentially faster than all other components. After normalising, the result converges to the corresponding eigenvector — and the characteristic polynomial never appears.
[definition: Power Method]
Let $A \in \mathbb{C}^{n \times n}$ be a matrix whose eigenvalues we wish to estimate. The **power method** is the following iterative algorithm:
1. Choose a nonzero vector $x^{(0)} \in \mathbb{C}^n$.
2. For $k = 0, 1, 2, \ldots$, set
\begin{align*}
y^{(k)} &= A x^{(k)}, \\
x^{(k+1)} &= \frac{y^{(k)}}{\|y^{(k)}\|}.
\end{align*}
[/definition]
The renormalisation step ensures $\|x^{(k)}\| = 1$ at every iteration, preventing numerical overflow and keeping the iterates on the unit sphere. The next theorem quantifies how fast this sequence converges.
[quotetheorem:1405]
The theorem says convergence is geometric with ratio $\rho$, but the choice to measure distance to a span rather than to a specific vector deserves a word of explanation.
[remark: On the span formulation]
The eigenvectors $w_i$ need not be orthogonal — they are just the eigenvectors of the diagonalizable matrix $A$. The theorem measures distance to the span of $w_1$ rather than to $w_1$ itself because eigenvectors are only defined up to scalar multiples: the power method might converge to $+w_1$ or $-w_1$ or $e^{i\theta}w_1$, and we cannot predict which. Since $\|x^{(k)}\| = 1$ by construction, having $\operatorname{dist}(x^{(k)}, \operatorname{span}(w_1)) \to 0$ does in fact force $x^{(k)} \to e^{i\theta}w_1$ for some phase $e^{i\theta}$.
[/remark]
[proof]
Since $A$ is diagonalizable, an explicit formula for the iterates is available. Note that
\begin{align*}
x^{(k)} = \frac{A^k x^{(0)}}{\|A^k x^{(0)}\|}.
\end{align*}
Expanding $x^{(0)} = \sum_i c_i w_i$ and applying $A^k$ term by term gives
\begin{align*}
A^k x^{(0)} = \sum_{i=1}^n c_i \lambda_i^k w_i = c_1 \lambda_1^k \Bigl(w_1 + v^{(k)}\Bigr),
\end{align*}
where $v^{(k)} = \sum_{i=2}^n (c_i/c_1)(\lambda_i/\lambda_1)^k w_i$. Because $|\lambda_i/\lambda_1| \leq \rho < 1$ for $i \geq 2$, each term in $v^{(k)}$ decays geometrically, so $\|v^{(k)}\| = O(\rho^k)$ and $\|w_1 + v^{(k)}\| \to 1$. After normalising, $x^{(k)} = e^{i\theta_k}(w_1 + v^{(k)})/\|w_1 + v^{(k)}\|$ for some phase $e^{i\theta_k}$. Using $\operatorname{dist}(x^{(k)}, \operatorname{span}(w_1))^2 = 1 - |w_1^* x^{(k)}|^2$, one computes
\begin{align*}
|w_1^* x^{(k)}|^2 = \frac{|1 + w_1^* v^{(k)}|^2}{\|w_1 + v^{(k)}\|^2} = 1 + \frac{|w_1^* v^{(k)}|^2 - \|v^{(k)}\|^2}{\|w_1 + v^{(k)}\|^2}.
\end{align*}
Since $\|v^{(k)}\| = O(\rho^k)$ and $|w_1^* v^{(k)}| \leq \|w_1\|\|v^{(k)}\| = O(\rho^k)$ by the Cauchy–Schwarz inequality, the numerator is $O(\rho^{2k})$ and the denominator tends to 1. Therefore $\operatorname{dist}(x^{(k)}, \operatorname{span}(w_1))^2 = O(\rho^{2k})$, which gives $\operatorname{dist}(x^{(k)}, \operatorname{span}(w_1)) = O(\rho^k)$.
[/proof]
The convergence rate $\rho = |\lambda_2/\lambda_1|$ tells us everything about the practical performance of the power method. When the two largest eigenvalues are nearly equal in magnitude, $\rho$ is close to 1 and convergence is very slow. The degenerate example from the section opening — $A = 2I$ — illustrates the worst case: $|\lambda_1| = |\lambda_2| = 2$, so $\rho = 1$ and the iterate $x^{(k)}$ never moves at all. A subtler failure: if $A$ is real and nonsymmetric with a complex conjugate pair of eigenvalues at the spectral radius, then $|\lambda_1| = |\lambda_2|$ and the iterates oscillate between two directions without converging.
The condition $c_1 \neq 0$ in the theorem is also a potential source of degeneracy. If the starting vector $x^{(0)}$ happens to be orthogonal to the dominant eigenvector, then $c_1 = 0$ and the power method converges to $w_2$ instead — or to whatever eigenvector has the next largest eigenvalue for which the initial coefficient is nonzero. In finite-precision arithmetic rounding errors typically introduce a nonzero $c_1$ component automatically, but this is not something to rely on in practice.
## The Rayleigh Quotient
The convergence theorem tells us that $x^{(k)}$ approaches the dominant eigenvector, but it does not directly yield the corresponding eigenvalue $\lambda_1$. The Rayleigh quotient provides a natural way to extract an eigenvalue estimate from any approximate eigenvector.
[definition: Rayleigh Quotient]
Let $A \in \mathbb{C}^{n \times n}$ and let $x \in \mathbb{C}^n$ be a nonzero vector. The **Rayleigh quotient** of $A$ at $x$ is
\begin{align*}
r(x) = \frac{x^* A x}{x^* x}.
\end{align*}
[/definition]
If $x$ is an exact eigenvector with $Ax = \lambda x$, then $r(x) = \lambda$. For a general unit vector, the Rayleigh quotient has a variational characterisation: it equals the scalar $\alpha$ that minimises $\|Ax - \alpha x\|$, making it the best scalar approximation to $A$ in the direction of $x$.
[quotetheorem:1406]
[citeproof:1406]
Combining the power method with the Rayleigh quotient gives both a sequence of approximate eigenvectors and a sequence of approximate eigenvalues. The variational characterisation also explains why the Rayleigh quotient is the right statistic: it is not merely a formula that happens to recover $\lambda$, but the unique scalar that minimises the residual $\|Ax - \alphax\|$. Any other scalar would produce a larger residual and therefore a worse eigenvalue estimate.
[quotetheorem:1407]
This follows from the fact that $x^{(k)} \to w_1$ at rate $O(\rho^k)$ and $r$ is a continuous function of $x$ near an eigenvector. The Rayleigh quotient is therefore a natural companion to the power method, providing an eigenvalue estimate from the eigenvector iterates at essentially no additional cost.
There is a subtlety for non-Hermitian matrices. For a Hermitian matrix $A$, the Rayleigh quotient $r(x)$ is always real and the eigenvalues of $r$ are stationary values (critical points) of $r$ on the unit sphere, a property that underpins the variational characterisation of eigenvalues. For a non-Hermitian matrix, the Rayleigh quotient is generally complex and $r(x)$ is no longer a real-valued function on the sphere; in this setting the power method is still applicable, but the Rayleigh quotient loses its variational interpretation and one should treat it simply as a scalar estimate rather than a variational quantity. For this reason, the most reliable application of Rayleigh quotient iteration (Section below) is in the Hermitian setting.
## Inverse Iteration
The power method converges to the eigenvector associated with the eigenvalue of largest magnitude. What if we want the eigenvalue closest to some target value $s \in \mathbb{R}$? The key observation is that the eigenvalues of $(A - sI)^{-1}$ are $(\lambda_i - s)^{-1}$, where $\lambda_i$ are the eigenvalues of $A$, and the eigenvectors are unchanged. If $\lambda$ is the eigenvalue of $A$ closest to $s$, then $(\lambda - s)^{-1}$ is the largest in magnitude among all $(\lambda_i - s)^{-1}$, so the power method applied to $(A - sI)^{-1}$ converges to the corresponding eigenvector.
[definition: Inverse Iteration]
Let $A \in \mathbb{C}^{n \times n}$ and let $s \in \mathbb{C}$ be a shift such that $A - sI$ is invertible. **Inverse iteration** with shift $s$ is the following algorithm:
1. Choose a nonzero vector $x^{(0)} \in \mathbb{C}^n$.
2. For $k = 0, 1, 2, \ldots$, solve the linear system
\begin{align*}
(A - sI)y^{(k)} = x^{(k)}
\end{align*}
and set $x^{(k+1)} = y^{(k)} / \|y^{(k)}\|$.
[/definition]
The definition does not say anything about how to implement the solve efficiently. In practice, inverting $A - sI$ at each step would be prohibitive; instead, one exploits its structure.
[remark: Inverse iteration as power method]
Inverse iteration is exactly the power method applied to $(A - sI)^{-1}$. Each iteration requires solving a linear system with matrix $A - sI$ rather than computing $(A - sI)^{-1}$ explicitly. In practice one factorises $A - sI$ once (e.g.\ using LU factorisation) and solves with the same factors at each step, making the per-iteration cost $O(n^2)$ after an initial $O(n^3)$ factorisation.
[/remark]
The convergence rate of inverse iteration inherits directly from the power method via this identification, with the spectral ratio now measuring how separated the target eigenvalue is from its nearest neighbour relative to $s$.
[quotetheorem:1408]
[citeproof:1408]
The practical power of inverse iteration lies in the choice of shift $s$. If $s$ is close to $\lambda$, then $|\lambda - s| \ll |\lambda' - s|$ and $\rho \approx 0$, giving extremely rapid convergence. A shift that is within $\varepsilon$ of $\lambda$ produces $\rho \approx \varepsilon/|\lambda' - s|$, which can be made as small as desired. This suggests a natural improvement: update the shift at each step using the best available eigenvalue estimate.
[example: Interior eigenvalue via inverse iteration]
Suppose $A \in \mathbb{R}^{4 \times 4}$ has eigenvalues $\lambda_1 = 10, \lambda_2 = 3, \lambda_3 = 1, \lambda_4 = -2$ and we want $\lambda_2 = 3$. The power method converges to $\lambda_1 = 10$ and cannot reach $\lambda_2$ directly. With shift $s = 3.1$, the distances to the shift are $|10 - 3.1| = 6.9$, $|3 - 3.1| = 0.1$, $|1 - 3.1| = 2.1$, $|-2 - 3.1| = 5.1$. The eigenvalue closest to $s$ is $\lambda_2 = 3$, and the convergence rate is $\rho = 0.1/2.1 \approx 0.048$. Even with this modest shift, each iteration reduces the error by roughly a factor of 20.
[/example]
## Rayleigh Quotient Iteration
The discussion so far leads to a natural synthesis. Inverse iteration converges fastest when the shift $s$ is close to the target eigenvalue, and the Rayleigh quotient provides an improving estimate of the eigenvalue from the current iterate. Combining them — updating the shift at every step — produces Rayleigh quotient iteration.
[definition: Rayleigh Quotient Iteration]
**Rayleigh quotient iteration** is the following algorithm: choose a nonzero $x^{(0)}$ and for $k = 0, 1, 2, \ldots$:
1. Compute the shift $s_k = r(x^{(k)}) = (x^{(k)})^* A x^{(k)}$ (using the fact that $\|x^{(k)}\| = 1$).
2. Solve $(A - s_k I)y = x^{(k)}$.
3. Set $x^{(k+1)} = y/\|y\|$.
[/definition]
The cost is higher than fixed-shift inverse iteration because the coefficient matrix changes at each step, requiring a new factorisation at each iteration. This $O(n^3)$ per-step cost is often acceptable because the algorithm converges in very few iterations.
[remark: Cubic convergence]
Rayleigh quotient iteration exhibits **cubically convergent** behaviour: the error satisfies
\begin{align*}
\|e^{(k+1)}\| = O\bigl(\|e^{(k)}\|^3\bigr).
\end{align*}
This means that once the iterates are close to an eigenpair, three digits of accuracy become nine in the next step, and nine become twenty-seven after that. Such convergence is qualitatively different from the linear convergence of the power method or fixed-shift inverse iteration, and explains why Rayleigh quotient iteration typically converges in three to five iterations from a reasonable starting guess. This cubic rate is established on Example Sheet 4.
[/remark]
[explanation: Comparison of the four methods]
The four algorithms in this section form a hierarchy based on the cost per iteration versus the convergence rate.
The **power method** costs $O(n^2)$ per iteration (one matrix-vector product) and converges at rate $\rho = |\lambda_2/\lambda_1|$. It is cheap but may be slow when eigenvalues cluster near the spectral radius, and it only finds the dominant eigenvalue.
The **Rayleigh quotient** adds negligible cost to the power method (one inner product) and improves the eigenvalue estimate from $O(\rho^k)$ to the same asymptotic rate, but with better constants.
**Inverse iteration** with a fixed shift $s$ costs $O(n^3)$ for the initial LU factorisation and $O(n^2)$ per subsequent iteration. It converges at rate $\rho = |\lambda - s|/|\lambda' - s|$, which can be made arbitrarily small by choosing $s$ close to the target eigenvalue. It can find any eigenvalue, not just the dominant one.
**Rayleigh quotient iteration** costs $O(n^3)$ per iteration but achieves cubic convergence. The high per-step cost is offset by the extremely small number of iterations required. For symmetric (Hermitian) matrices, this method is particularly reliable because the Rayleigh quotient is guaranteed to converge to an actual eigenvalue.
In practice, inverse iteration with a fixed shift (or a shift updated infrequently) is the method of choice when a good initial estimate is available, as in eigenvalue refinement after an approximate solution has been found. Rayleigh quotient iteration is used when the highest possible accuracy per iteration count is needed and the $O(n^3)$ cost is acceptable.
[/explanation]
## Simultaneous Iteration
The power method and Rayleigh quotient iteration each target a single eigenvalue at a time. When all eigenvalues are needed, running these methods repeatedly would be wasteful and numerically fragile. Simultaneous iteration is the natural generalisation: instead of iterating a single vector, we iterate a matrix whose columns span a $p$-dimensional subspace, letting each column track a different eigenvalue concurrently.
Throughout this section we assume $A \in \mathbb{R}^{n \times n}$ is symmetric, so eigenvectors belonging to distinct eigenvalues are automatically orthogonal. We write $|\lambda_1| \geq |\lambda_2| \geq \cdots \geq |\lambda_n|$ for the eigenvalues and $w_1, \ldots, w_n$ for corresponding orthonormal eigenvectors.
[definition: Simultaneous Iteration]
Let $X^{(0)} \in \mathbb{R}^{n \times p}$ have orthonormal columns. For $k = 0, 1, 2, \ldots$, define the iteration:
1. Set $Y = A X^{(k)}$.
2. Compute the QR factorisation $Y = X^{(k+1)} R$, where $X^{(k+1)}$ has orthonormal columns and $R$ is upper triangular.
[/definition]
At each step the columns of $X^{(k)}$ span a $p$-dimensional subspace; the QR factorisation reorthonormalises after multiplication by $A$, preventing the columns from collapsing onto the dominant eigenvector. The following proposition identifies exactly what $X^{(k)}$ computes.
[quotetheorem:1409]
[citeproof:1409]
[explanation: Connections to the power method and inverse iteration]
The proposition has two instructive special cases.
**Connection to the power method.** Since $R^{(k)}$ is upper triangular, the first column of $X^{(k)} R^{(k)}$ is $R^{(k)}_{11}$ times the first column of $X^{(k)}$. Therefore the first column of $A^k X^{(0)}$ is a scalar multiple of the first column $X^{(k)}_1$ of $X^{(k)}$. After normalisation this gives
\begin{align*}
X^{(k)}_1 = \frac{A^k X^{(0)}_1}{\|A^k X^{(0)}_1\|},
\end{align*}
which is precisely the power method applied to the initial vector $X^{(0)}_1$.
**Connection to inverse iteration.** Take $p = n$ so that $X^{(0)}$ is square. From $X^{(k)} R^{(k)} = A^k X^{(0)}$ we obtain
\begin{align*}
A^{-k} X^{(0)} = X^{(k)} \bigl(R^{(k)}\bigr)^{-\top} \cdot \bigl(R^{(k)}\bigr)^{\top} \bigl(R^{(k)}\bigr)^{-\top}.
\end{align*}
More directly: since $(R^{(k)})^{-\top}$ is lower triangular, the last column of $A^{-k} X^{(0)}$ is a multiple of the last column of $X^{(k)}$. By normalisation,
\begin{align*}
X^{(k)}_n = \frac{A^{-k} X^{(0)}_n}{\|A^{-k} X^{(0)}_n\|},
\end{align*}
which is inverse iteration (with shift $s = 0$) applied to $X^{(0)}_n$.
These observations confirm that simultaneous iteration interpolates continuously between forward and inverse iteration as one moves from the first to the last column.
[/explanation]
### Convergence of Simultaneous Iteration
The convergence analysis mirrors that of the power method, with the ratio $|\lambda_{p+1}/\lambda_p|$ playing the role of the convergence factor. We first establish a technical lemma bounding the component of $X^{(k)}$ lying outside the dominant $p$-dimensional eigenspace.
[quotetheorem:1410]
[citeproof:1410]
The lemma says that the columns of $X^{(k)}$ become increasingly orthogonal to $V$, i.e., they become increasingly confined to the span of $w_1, \ldots, w_p$. The convergence theorem follows by applying the lemma inductively to every prefix of columns.
[quotetheorem:1411]
[citeproof:1411]
[remark: Spectral Gap Assumption]
The strict inequalities $|\lambda_1| > |\lambda_2| > \cdots > |\lambda_n|$ guarantee a nonzero spectral gap between every pair of consecutive eigenvalues. The rate of convergence for the $i$-th column is governed by $|\lambda_{i+1}/\lambda_i|^k$. If two consecutive eigenvalues have equal magnitude, the corresponding columns do not converge to definite eigenvectors, though the subspace they span still converges.
[/remark]
---
## The QR Algorithm
Simultaneous iteration requires maintaining $n$ separate orthonormal column vectors and passing them through $A$ at each step. The formulation is somewhat unwieldy: we carry an $n \times n$ orthogonal matrix $X^{(k)}$ and update it at every iteration. Can we reformulate the same computation entirely in terms of a single evolving matrix, without explicitly tracking $X^{(k)}$? The answer is yes, and the resulting algorithm — the QR algorithm — is the workhorse of practical eigenvalue computation.
Taking $X^{(0)} = I$ in simultaneous iteration, the sequence $(X^{(k)})^\top A X^{(k)}$ converges to a diagonal matrix of eigenvalues. It turns out that this process is equivalent to a purely matrix-level algorithm that never references the orthogonal accumulation $X^{(k)}$ directly.
[definition: Basic QR Iteration]
Let $A^{(0)} = A$. For $k = 0, 1, 2, \ldots$, define the iteration:
1. Compute the QR factorisation $A^{(k)} = QR$.
2. Set $A^{(k+1)} = RQ$.
[/definition]
The key observation is the identity $A^{(k+1)} = RQ = Q^\top(QR)Q = Q^\top A^{(k)} Q$, so each step is an orthogonal similarity transformation. Since similar matrices share all eigenvalues, the sequence $(A^{(k)})$ is isospectral. The following proposition makes the connection to simultaneous iteration precise.
[quotetheorem:1412]
[citeproof:1412]
[explanation: What the QR algorithm does geometrically]
The QR algorithm can be understood as simultaneous iteration in disguise. Each step applies $A$ to the current basis $X^{(k)}$ and then reorthonormalises via QR. By working with $A^{(k)} = (X^{(k)})^\top A X^{(k)}$ instead of $X^{(k)}$ directly, the algorithm avoids storing large orthogonal matrices and instead tracks the eigenvalue information in the evolving similarity transform. As $k \to \infty$, the off-diagonal entries of $A^{(k)}$ tend to zero and the diagonal entries converge to the eigenvalues $\lambda_1, \ldots, \lambda_n$.
The matrix $A^{(k)}$ remains symmetric throughout (since $A^{(k+1)} = Q^\top A^{(k)} Q$ preserves symmetry), so for symmetric $A$ the sequence converges to a diagonal matrix. For non-symmetric $A$, the general theory shows convergence to a quasi-upper-triangular Schur form.
[/explanation]
### Shifts in the QR Algorithm
As with inverse iteration, introducing shifts accelerates convergence dramatically by reducing the effective spectral ratio.
[definition: Shifted QR Iteration]
Let $A^{(0)} = A$. For $k = 0, 1, 2, \ldots$, choose a shift $s_k$ (e.g. $s_k = A^{(k)}_{nn}$, the bottom-right entry) and define:
1. Compute the QR factorisation $A^{(k)} - s_k I = QR$.
2. Set $A^{(k+1)} = RQ + s_k I = Q^\top A^{(k)} Q$.
[/definition]
The shift $s_k = A^{(k)}_{nn}$ is the Rayleigh quotient estimate for the smallest eigenvalue (when $A$ is symmetric), and once $A^{(k)}_{nn}$ is close to an eigenvalue $\lambda_n$, convergence of the last row to zero becomes cubic. Correspondingly, at the level of matrices there is an equivalent shifted simultaneous iteration: set $X^{(0)} = I$ and at each step compute the QR factorisation of $(A - s_k I) X^{(k)}$ to obtain $X^{(k+1)}$; the proposition above extends to show that $A^{(k)} = (X^{(k)})^\top A X^{(k)}$ still holds.
[example: QR Iteration on a 2x2 Symmetric Matrix]
Let
\begin{align*}
A = \begin{pmatrix} 3 & 1 \\ 1 & 1 \end{pmatrix}.
\end{align*}
The eigenvalues are $\lambda_1 = 2 + \sqrt{2} \approx 3.414$ and $\lambda_2 = 2 - \sqrt{2} \approx 0.586$. The convergence ratio for basic QR iteration is $|\lambda_2/\lambda_1| \approx 0.172$. After one step:
- Compute QR of $A$: $A = QR$ with $Q$ orthogonal and $R$ upper triangular.
- Form $A^{(1)} = RQ = Q^\top A Q$.
The off-diagonal entry of $A^{(1)}$ is scaled by $|\lambda_2/\lambda_1| \approx 0.172$ relative to the off-diagonal entry of $A^{(0)}$. With the Rayleigh quotient shift $s_0 = A_{22} = 1$, the shifted matrix $A - I$ has eigenvalues $1 + \sqrt{2}$ and $1 - \sqrt{2}$, so the new convergence ratio is
\begin{align*}
\left|\frac{1-\sqrt{2}}{1+\sqrt{2}}\right| = \frac{\sqrt{2}-1}{\sqrt{2}+1} = (\sqrt{2}-1)^2 = 3 - 2\sqrt{2} \approx 0.172.
\end{align*}
In this particular example the shift $s_0 = 1$ does not improve the convergence ratio: the ratio $(3 - 2\sqrt{2}) \approx 0.172$ equals the unshifted ratio $\lambda_2/\lambda_1 \approx 0.172$. This is because $s_0 = 1$ is not close to either eigenvalue ($\lambda_2 \approx 0.586$), so the shift does not concentrate the effective spectral gap. Meaningful acceleration occurs when the shift is close to the eigenvalue being deflated — for example, using the Wilkinson shift, which is chosen to be a root of the characteristic polynomial of the lower-right $2 \times 2$ submatrix.
[/example]
---
## Deflation
Suppose we have found one eigenvalue $\lambda$ of $A$ by the power method or Rayleigh quotient iteration. If we now run the power method again from a new starting vector, there is no guarantee it will converge to a different eigenvalue: the dominant eigenvalue is still $\lambda_1$, and the algorithm will return to the same eigenvector. A concrete illustration: for $A = \operatorname{diag}(5, 3, 1)$ with $\lambda_1 = 5$ already found, rerunning the power method still converges to the eigenvector $e_1$ because the eigenvalue 5 dominates all others. The found eigenvalue interferes with every subsequent iteration.
The question is: after finding one eigenvalue-eigenvector pair $(\lambda, w)$, can we modify $A$ to eliminate $\lambda$ entirely, leaving a smaller matrix whose eigenvalues are exactly the remaining ones? This is deflation, and it is what converts a single-eigenpair method into a complete algorithm for all eigenvalues.
[definition: Deflation]
Given a square matrix $A$ of order $n$ with a known eigenvalue $\lambda$ and eigenvector $w$, **deflation** is the construction of an $(n-1) \times (n-1)$ matrix $B$ such that the eigenvalues of $B$ are the eigenvalues of $A$ other than $\lambda$.
The construction proceeds by finding a nonsingular matrix $S$ such that $Sw = ce_1$ for some nonzero scalar $c$; then the first column of $SAS^{-1}$ is $\lambda e_1$, and the block structure
\begin{align*}
SAS^{-1} = \begin{pmatrix} \lambda & h^\top \\ \mathbf{0} & B \end{pmatrix}
\end{align*}
shows that the eigenvalues of $SAS^{-1}$ are $\lambda$ together with those of $B$.
[/definition]
The following observation justifies why a similarity transformation preserves eigenvalues.
[quotetheorem:1413]
[citeproof:1413]
The condition we need on $S$ is simply $Sw = ce_1$, which ensures $(SAS^{-1})e_1 = \lambdae_1$ so the first column of $SAS^{-1}$ has the required form.
### Deflation for Symmetric Matrices
For symmetric matrices the right choice of $S$ is a Householder reflection, which is orthogonal and therefore preserves symmetry: $HAH^\top$ is again symmetric, and the block $\tilde{A}$ in the lower-right corner is symmetric too.
[definition: Deflation via Householder Reflection]
Let $A \in \mathbb{R}^{n \times n}$ be symmetric with eigenvector $w$ and eigenvalue $\lambda$. Define
\begin{align*}
u_i = \begin{cases} w_1 + \operatorname{sgn}(w_1)\|w\|_2 & i = 1, \\ w_i & i \neq 1, \end{cases}
\end{align*}
and form the Householder reflection
\begin{align*}
H = I - 2\frac{uu^\top}{\|u\|_2^2}.
\end{align*}
Then $H$ is orthogonal ($H^{-1} = H^\top$), and
\begin{align*}
HAH^\top = \begin{pmatrix} \lambda & \mathbf{0}^\top \\ \mathbf{0} & \tilde{A} \end{pmatrix}
\end{align*}
for some symmetric $(n-1) \times (n-1)$ matrix $\tilde{A}$.
[/definition]
We verify the two key properties in turn.
[quotetheorem:1414]
[citeproof:1414]
With the eigenvector mapped to a scalar multiple of $e_1$, the block structure of $HAH^\top$ now follows.
[quotetheorem:1415]
[citeproof:1415]
[remark: Symmetric Deflation Gives Clean Block Structure]
Because $H$ is orthogonal, symmetric deflation produces a matrix with zero off-diagonal blocks in both the first row and first column. This means $\tilde{A}$ itself is a symmetric matrix with eigenvalues $\{\lambda_i\} \setminus \{\lambda\}$, and eigenvectors of $\tilde{A}$ can be lifted back to eigenvectors of $A$ using $H^\top$.
[/remark]
### Deflation for Non-Symmetric Matrices
For non-symmetric matrices, orthogonal transformations may not be available or natural. An elementary matrix constructed from the eigenvector components serves as a simpler alternative.
[definition: Deflation of Non-Symmetric Matrices]
Let $A \in \mathbb{R}^{n \times n}$ (not necessarily symmetric) have eigenvector $w$ with eigenvalue $\lambda$, and assume $w_1 \neq 0$. Construct the $n \times n$ matrix $S$ which agrees with the identity except that its first column has entries $S_{i1} = -w_i/w_1$ for $i = 2, \ldots, n$. The inverse $S^{-1}$ agrees with the identity except that $(S^{-1})_{i1} = w_i/w_1$ for $i = 2, \ldots, n$.
Then
\begin{align*}
SAS^{-1} = \begin{pmatrix} \lambda & h^\top \\ \mathbf{0} & B \end{pmatrix}
\end{align*}
for some $(n-1) \times (n-1)$ matrix $B$, whose eigenvalues are the eigenvalues of $A$ other than $\lambda$.
[/definition]
The block structure claimed in the definition requires a short verification that $S$ does indeed place $\lambda$ in the top-left corner.
[quotetheorem:1416]
[citeproof:1416]
[remark: Non-Symmetric Deflation Cannot Recover Eigenvectors]
Unlike the symmetric case, the first row of $SAS^{-1}$ is $(\lambda, h^\top)$ with $h$ generally nonzero. The matrix is block upper triangular, not block diagonal. This is sufficient for computing eigenvalues (the characteristic polynomial factors as $(\mu - \lambda)\det(\mu I - B)$), but $h \neq \mathbf{0}$ means eigenvectors of $A$ cannot be read off directly from eigenvectors of $B$.
[/remark]
The construction of $B$ can be expressed without forming $S$ explicitly.
[quotetheorem:1417]
[citeproof:1417]
[example: Non-Symmetric Deflation of a 3x3 Matrix]
Let
\begin{align*}
A = \begin{pmatrix} 2 & 1 & 0 \\ 0 & 3 & 1 \\ 0 & 0 & 4 \end{pmatrix}.
\end{align*}
The eigenvalue $\lambda = 2$ has eigenvector $w = (1, 0, 0)^\top$. Since $w_1 = 1$ and $w_2 = w_3 = 0$, the deflated rows are $B_1 = A_2 - 0 \cdot A_1 = (0, 3, 1)$ and $B_2 = A_3 - 0 \cdot A_1 = (0, 0, 4)$. Discarding the first (zero) component gives
\begin{align*}
B = \begin{pmatrix} 3 & 1 \\ 0 & 4 \end{pmatrix},
\end{align*}
whose eigenvalues are $3$ and $4$, as expected.
[/example]
[explanation: Using Deflation Iteratively]
Deflation becomes most useful as part of an iterative scheme. Starting from the original $n \times n$ matrix $A$:
1. Apply a single-vector method (power method, inverse iteration, or Rayleigh quotient iteration) to find an eigenpair $(\lambda, w)$.
2. Deflate $A$ to produce an $(n-1) \times (n-1)$ matrix $B$.
3. Repeat on $B$.
Each deflation reduces the problem size by one. For symmetric matrices, Householder deflation is preferred because orthogonal similarity keeps $\tilde{A}$ well-conditioned. For non-symmetric matrices the row formula is efficient but errors in the numerically computed eigenvector $w$ can cause the deflated matrix $B$ to inherit inaccuracies, so deflation must be applied with care.
In practice, the QR algorithm handles deflation automatically: when an off-diagonal entry $A^{(k)}_{i,i-1}$ drops below a tolerance threshold, the matrix is split into two independent subproblems, effectively deflating without explicitly constructing $S$.
[/explanation]
## Reduction to Tridiagonal Form for Symmetric Matrices
The QR iteration from the previous sections is already a powerful tool for computing eigenvalues, but applying it naively to a dense $n \times n$ symmetric matrix is expensive. Each QR factorisation of a general matrix costs $O(n^3)$ operations, and convergence may require many iterations. The key observation that makes practical eigenvalue algorithms efficient is that we can first reduce $A$ to a simpler form — tridiagonal — using a one-time preprocessing step, after which every subsequent QR iteration becomes dramatically cheaper.
### Why Tridiagonalisation Pays Off
For a symmetric tridiagonal matrix, the QR factorisation can be executed in $O(n)$ operations using Givens rotations, since only the subdiagonal entries need to be eliminated. Moreover, the structure is preserved:
[quotetheorem:1418]
This $O(n)$ cost is striking: a general dense matrix requires $O(n^3)$ operations for QR factorisation, so tridiagonal structure provides a factor of $n^2$ improvement. The savings come from the fact that only $n-1$ Givens rotations are needed — one to eliminate each subdiagonal entry — and each rotation touches at most two rows at a time in a matrix where all entries outside a band of width 1 are already zero.
[quotetheorem:1419]
[citeproof:1419]
The symmetry assumption is essential here. For a non-symmetric Hessenberg matrix, a QR step also preserves the Hessenberg structure, but the result is Hessenberg and not tridiagonal — there is no second half of the argument forcing the superdiagonal entries to mirror the subdiagonal. The tridiagonal preservation theorem is what makes symmetric eigenvalue algorithms so efficient: each QR step takes a tridiagonal matrix to a tridiagonal matrix, and the $O(n)$ cost per step then follows automatically.
These two facts together mean that the initial tridiagonalisation is the only $O(n^3)$ step; all subsequent QR iterations each cost only $O(n^2)$ (dominated by forming the product $RQ$), and we typically need $O(n)$ iterations to convergence, giving an overall cost of $O(n^2)$ for phase 2.
### Householder Reduction to Tridiagonal Form
The reduction itself is carried out using a sequence of Householder reflections. Recall that a Householder matrix is of the form $H = I - 2vv^\top$ for a unit vector $v$, and it is orthogonal and symmetric. The procedure is a symmetric variant of the Householder algorithm for QR factorisation.
[definition: Reduction to Tridiagonal Form (Symmetric Case)]
Given a symmetric matrix $A \in \mathbb{R}^{n \times n}$, the **Householder tridiagonalisation** constructs an orthogonal matrix $P$ such that $P^\top A P = T$ is tridiagonal via the following steps.
**Step 1.** Let $\hat{H}_1 \in \mathbb{R}^{(n-1) \times (n-1)}$ be the Householder reflection such that
\begin{align*}
\hat{H}_1 A[2:n, 1] = \alpha_1 e_1
\end{align*}
for some scalar $\alpha_1$, where $e_1 \in \mathbb{R}^{n-1}$ is the first standard basis vector. Form the block matrix
\begin{align*}
H_1 = \begin{bmatrix} 1 & 0 \\ 0 & \hat{H}_1 \end{bmatrix} \in \mathbb{R}^{n \times n}.
\end{align*}
Since $H_1$ is orthogonal and symmetric, the two-sided similarity transformation $H_1 A H_1^\top$ zeroes out the entries below the first subdiagonal in column 1 and, by symmetry, to the right of the first superdiagonal in row 1:
\begin{align*}
H_1 A H_1^\top = \begin{bmatrix} A_{11} & (\hat{H}_1 A[2:n,1])^\top \\ \hat{H}_1 A[2:n,1] & \hat{H}_1 A[2:n,2:n] \hat{H}_1^\top \end{bmatrix}.
\end{align*}
The first column and row now have nonzero entries only in positions 1 and 2.
**Step 2.** Set $A \leftarrow H_1 A H_1^\top$ and construct $H_2 = \begin{bmatrix} I_2 & 0 \\ 0 & \hat{H}_2 \end{bmatrix}$ where $\hat{H}_2 \in \mathbb{R}^{(n-2)\times(n-2)}$ maps $A[3:n, 2]$ to a multiple of $e_1 \in \mathbb{R}^{n-2}$. The transformation $H_2 A H_2^\top$ zeroes entries below the second subdiagonal in column 2 and their symmetric counterparts in row 2, while leaving the already-achieved zeros in positions $(k, 1)$ and $(1, k)$ for $k \geq 3$ untouched.
**Induction.** Continuing for $n-2$ steps produces orthogonal matrices $H_1, \ldots, H_{n-2}$ such that
\begin{align*}
P^\top A P = T, \qquad P = H_1 H_2 \cdots H_{n-2},
\end{align*}
where $T$ is symmetric tridiagonal and $P$ is orthogonal. Since $T$ is similar to $A$, they share the same eigenvalues.
[/definition]
The algorithm as stated is efficient because each Householder step costs $O(n^2)$ operations (updating both sides of a symmetric matrix), and there are $n-2$ steps, giving a total of $O(n^3)$. The critical structural point — that two-sided transformations are what produce tridiagonal rather than Hessenberg form — is worth spelling out.
[remark: Symmetry is Essential]
The two-sided transformation $H A H^\top$ (rather than the one-sided $H A$ used in QR factorisation) is what makes the reduction tridiagonal rather than merely Hessenberg. Without symmetry, a one-sided reduction would only zero out a half of the off-tridiagonal entries; the two-sided version exploits the fact that row and column operations are mirror images of each other for symmetric $A$.
[/remark]
[example: Tridiagonalisation of a 4×4 Symmetric Matrix]
Let
\begin{align*}
A = \begin{bmatrix} 4 & 1 & -2 & 2 \\ 1 & 2 & 0 & 1 \\ -2 & 0 & 3 & -2 \\ 2 & 1 & -2 & -1 \end{bmatrix}.
\end{align*}
The first Householder step targets the subvector $A[2:4, 1] = (1, -2, 2)^\top$. Compute $\sigma = \|(1,-2,2)^\top\| = 3$, and choose the reflection $\hat{H}_1$ such that $\hat{H}_1(1,-2,2)^\top = (-3, 0, 0)^\top$. Forming $H_1 = \operatorname{diag}(1, \hat{H}_1)$ and computing $H_1 A H_1^\top$ zeros out the $(3,1), (4,1), (1,3), (1,4)$ entries. The resulting matrix has a tridiagonal first row and column. A second Householder step on the $(3:4, 2)$ block completes the reduction to a $4 \times 4$ symmetric tridiagonal matrix.
[/example]
### The Efficient QR Algorithm for Symmetric Matrices
Combining tridiagonalisation with QR iteration gives the practical algorithm.
[definition: QR Algorithm for Symmetric Matrices]
To compute the eigenvalues of a symmetric matrix $A \in \mathbb{R}^{n \times n}$:
1. **Phase 1 (one-time, $O(n^3)$):** Compute $P^\top A P = T$ using Householder tridiagonalisation.
2. **Phase 2 (iterative, $O(n^2)$ total):** Apply shifted QR iteration to $T$. At each step, choose a shift $s$ (e.g., the Wilkinson shift), form $T - sI = QR$, and update $T \leftarrow RQ + sI$. By the structure-preservation theorem, $T$ remains symmetric tridiagonal throughout.
[/definition]
The two-phase structure is the key design decision: a single expensive preprocessing step buys cheap subsequent iterations. The cost breakdown makes this precise.
[remark: Cost Analysis]
Phase 1 costs $O(n^3)$. In phase 2, each QR factorisation of the tridiagonal matrix costs $O(n)$, and forming the product $RQ$ costs $O(n^2)$. In practice, $O(n)$ iterations suffice (with good shift strategies), giving a total phase-2 cost of $O(n^2)$. The eigenvalues appear on the diagonal of $T$ after convergence.
[/remark]
The connection to the material in Sections A and B is direct: the shifted QR iteration applied to the tridiagonal matrix is exactly the algorithm analysed there, but now it runs on a structured object that makes each step $O(n^2)$ instead of $O(n^3)$.
---
## The Nonsymmetric Eigenvalue Problem
For nonsymmetric matrices the situation is more complex. The eigenvalues can be complex even when $A$ is real, and the eigenvector decomposition $A = VDV^{-1}$ may not exist (if $A$ is defective). The appropriate substitute for the spectral theorem is the Schur decomposition.
### Eigenvalue-Revealing Factorisations
[definition: Schur Decomposition]
Let $A \in \mathbb{C}^{n \times n}$. A **Schur decomposition** of $A$ is a factorisation
\begin{align*}
A = P T P^*,
\end{align*}
where $P \in \mathbb{C}^{n \times n}$ is unitary ($P^* P = I$) and $T \in \mathbb{C}^{n \times n}$ is upper triangular. The diagonal entries of $T$ are the eigenvalues of $A$ (with multiplicity). Such a decomposition always exists.
[/definition]
The Schur decomposition is more general than the eigenvalue decomposition, and it is important to understand how they relate.
[remark: Schur vs Eigenvalue Decomposition]
The eigenvalue decomposition $A = VDV^{-1}$ requires $A$ to be diagonalisable and $V$ to be invertible, which may fail. The Schur decomposition always exists and uses a unitary factor, which is numerically stable. The columns of $P$ are called **Schur vectors** and form an orthonormal basis adapted to the eigenstructure of $A$.
[/remark]
When $A$ is real, complex eigenvalues come in conjugate pairs. It is numerically and practically advantageous to stay in real arithmetic throughout. This motivates the real Schur form.
[definition: Real Schur Decomposition]
Let $A \in \mathbb{R}^{n \times n}$. A **real Schur decomposition** of $A$ is a factorisation
\begin{align*}
A = Q T Q^\top,
\end{align*}
where $Q \in \mathbb{R}^{n \times n}$ is orthogonal and $T \in \mathbb{R}^{n \times n}$ is **quasi upper-triangular**: it is block upper triangular with diagonal blocks $T_{ii}$ that are either $1 \times 1$ (corresponding to a real eigenvalue) or $2 \times 2$ (corresponding to a conjugate pair of complex eigenvalues). Every real matrix has a real Schur decomposition.
[/definition]
The $2 \times 2$ diagonal blocks in the real Schur form each have the form $\begin{bmatrix} a & b \\ -b & a \end{bmatrix}$ with $b \neq 0$, whose eigenvalues are $a \pm ib$. This lets all computation stay real.
### Hessenberg Reduction
Eigenvalue algorithms for nonsymmetric matrices also proceed in two phases, mirroring the symmetric case, but the reduced form is Hessenberg rather than tridiagonal.
[definition: Upper Hessenberg Matrix]
A matrix $H \in \mathbb{R}^{n \times n}$ is in **upper Hessenberg form** if $H_{ij} = 0$ for all $i > j+1$, that is, all entries strictly below the first subdiagonal are zero:
\begin{align*}
H = \begin{bmatrix}
* & * & * & * & * \\
* & * & * & * & * \\
0 & * & * & * & * \\
0 & 0 & * & * & * \\
0 & 0 & 0 & * & *
\end{bmatrix}.
\end{align*}
[/definition]
[explanation: Why Hessenberg and Not Tridiagonal]
For a nonsymmetric matrix, the two-sided Householder reduction $H_k A H_k^\top$ cannot simultaneously zero entries below row $k+1$ in column $k$ and their transpose counterparts, because the matrix is not symmetric. Instead, only a one-sided reduction is used in phase 1 (Householder applied from the left), which zeroes entries below the subdiagonal in each column but has no control over the corresponding row. The resulting reduced form is therefore Hessenberg rather than tridiagonal.
In the symmetric case, we used two-sided transformations $H A H^\top$, and the symmetry of $A$ automatically mirrored every zero introduced in a column into the corresponding row. The Hessenberg form is the correct analogue of tridiagonal for general matrices: it is the sparsest form achievable by an orthogonal similarity transformation on a nonsymmetric matrix.
[/explanation]
The reduction to Hessenberg form uses the same Householder algorithm described for the symmetric case, but applied one-sidedly.
[definition: Hessenberg Reduction]
Given $A \in \mathbb{R}^{n \times n}$, the **Hessenberg reduction** constructs orthogonal matrices $U_1, \ldots, U_{n-2}$ using Householder reflections such that
\begin{align*}
U^\top A U = H
\end{align*}
is upper Hessenberg, where $U = U_1 \cdots U_{n-2}$. At step $k$, the Householder matrix $U_k$ is chosen to zero out entries in positions $k+2, \ldots, n$ of column $k$ of the current matrix, without disturbing columns $1, \ldots, k-1$ already processed. The computational cost of this phase is $O(n^3)$.
[/definition]
Once $A$ has been reduced to Hessenberg form, phase 2 applies QR iteration to $H$. A single QR step on a Hessenberg matrix costs $O(n^2)$ rather than $O(n^3)$, and the Hessenberg structure is preserved under QR iteration, just as tridiagonal structure is preserved in the symmetric case. The iteration converges to the real Schur form of $H$ (and hence of $A$), revealing the eigenvalues on the diagonal or in the $2 \times 2$ blocks.
[remark: Connection to the Symmetric Case]
In the symmetric case, the Hessenberg reduction produces a tridiagonal matrix because $U_k A U_k^\top$ automatically symmetrises. In the nonsymmetric case, we get Hessenberg. The rest of the algorithm — repeated QR steps with shifts — is structurally the same in both cases, but for nonsymmetric matrices the convergence theory is more delicate and the shifts must account for complex eigenvalues.
[/remark]
[example: Hessenberg Form]
For the matrix
\begin{align*}
A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix},
\end{align*}
the first Householder step targets the subvector $(4, 7)^\top$ in column 1. The reflection $\hat{H}_1$ maps $(4, 7)^\top$ to $(-\sqrt{65}, 0)^\top$, zeroing out the $(3,1)$ entry. The transformed matrix $H_1 A H_1^\top$ is upper Hessenberg (with a single subdiagonal). Since $n = 3$, one Householder step suffices, completing the reduction in $O(n^3)$ operations.
[/example]
This completes the treatment of eigenvalue algorithms. The full pipeline — tridiagonalisation or Hessenberg reduction followed by iterated QR steps with shifts — underpins virtually all practical eigenvalue solvers for dense matrices, including the LAPACK routines `dsyev` (symmetric) and `dgeev` (nonsymmetric). The QR factorisation and Householder reflections used throughout this chapter are reviewed in the Appendix (Chapter 6).
The QR algorithm and Hessenberg reduction complete our toolkit for computing the full spectrum of nonsymmetric matrices, bridging eigenvalue theory with practical computation. The Appendix formalizes the matrix decompositions, calculus facts, and structural results referenced throughout, providing a self-contained foundation for the numerical techniques developed above.
# Appendix
## Time Complexity
Throughout this course we frequently need to count operations and classify how the running time of a numerical algorithm grows with the problem size. The key idea is that not all arithmetic operations are equally expensive: modern floating-point hardware executes addition much faster than trigonometric evaluation, and we account for this by ranking the elementary operations.
The **complexity ordering** used in this course is
\begin{align*}
\text{addition, subtraction} \;<\; \text{multiplication} \;<\; \text{division} \;<\; \text{exponentiation} \;<\; \text{logarithm} \;<\; \text{trigonometric functions}.
\end{align*}
Given an algorithm, the **dominant complexity** is the most expensive type of operation that appears. The **time complexity** is determined by counting how many of the dominant operations are performed, then expressing that count as a function of $n$ (or whatever natural size parameter governs the problem), and retaining only the leading term.
[definition: Dominant Complexity]
The **dominant complexity** of an algorithm is the class of the most expensive elementary operation appearing in that algorithm, according to the ordering above. The **time complexity** is $O(g(n))$, where $g(n)$ is the leading-order count of the dominant operations as the problem size $n \to \infty$.
[/definition]
Two observations follow immediately from this definition. First, time complexity is insensitive to the absolute cost assigned to the dominant step — whether one multiplication costs $1$ second or $5$ seconds, the leading-order structure is unchanged. Second, cheaper operations (additions, subtractions) never determine the time complexity on their own; they contribute only lower-order terms.
[example: Matrix-Vector Multiplication]
Let $A \in \mathbb{R}^{n \times n}$ and $x \in \mathbb{R}^{n}$. Find the dominant complexity and time complexity of computing $Ax$.
Each entry of $Ax$ is an inner product of a row of $A$ with $x$, requiring $n$ multiplications and $n-1$ additions. Since there are $n$ rows, the total operation count is $n^2$ multiplications and $n(n-1)$ additions.
The dominant step is multiplication, so the dominant complexity is $n^2$ multiplications. The additions contribute the lower-order term $n^2 - n$. Whether we assign cost $1$ or cost $5$ to each multiplication, the time complexity remains $O(n^2)$.
[/example]
The principle extends immediately: matrix–matrix multiplication of two $n \times n$ matrices has dominant complexity $n^3$ (each of the $n^2$ output entries requires $n$ multiplications), giving time complexity $O(n^3)$. A back-substitution step for an $n \times n$ upper triangular system costs $O(n^2)$ multiplications plus $O(n^2)$ additions, so its time complexity is $O(n^2)$.
## TST Matrices
The acronym TST stands for **tridiagonal, symmetric, Toeplitz**. These matrices appear repeatedly in finite difference discretisations of differential operators on uniform grids, and their explicit eigenstructure is what makes Fourier-mode analysis tractable.
[definition: TST Matrix]
A matrix $A \in \mathbb{R}^{m \times m}$ is a **TST matrix** if it has the form
\begin{align*}
A = \begin{pmatrix} \alpha & \beta & & \\ \beta & \alpha & \ddots & \\ & \ddots & \ddots & \beta \\ & & \beta & \alpha \end{pmatrix}_{m \times m}
\end{align*}
for some $\alpha, \beta \in \mathbb{R}$. Equivalently, $A_{ij} = \alpha$ if $i = j$, $A_{ij} = \beta$ if $|i - j| = 1$, and $A_{ij} = 0$ otherwise.
[/definition]
The structure is fully determined by the pair $(\alpha, \beta)$. Two such matrices are $A = \alpha I + \beta T$ and $B = \alpha' I + \beta' T$, where $T$ is the fixed $m \times m$ tridiagonal matrix with $1$s on both off-diagonals and $0$s on the main diagonal. Since both $A$ and $B$ are polynomial expressions in the same matrix $T$, they commute: $AB = BA$. This algebraic fact drives the next result.
[quotetheorem:1420]
[citeproof:1420]
The eigenvectors can in fact be written down explicitly, which is the more concrete result used throughout the course.
[quotetheorem:1421]
[citeproof:1421]
[remark: Application to Discretisations]
In Chapter 1 (Poisson equation) and Chapter 2 (diffusion equation), the matrices arising from second-order finite differences on a uniform grid with Dirichlet boundary conditions are precisely TST matrices. The explicit eigenvalues $\lambda_k = \alpha + 2\beta\cos\frac{\pi k}{n+1}$ allow one to compute spectral radii, condition numbers, and stability thresholds analytically rather than numerically.
[/remark]
## Common Finite Difference Schemes
The following table collects the schemes discussed in Chapter 2 and throughout the course. Each scheme approximates a time derivative using a combination of values at the current and/or next time step.
Let $y_n \approx y(t_n)$, where $t_n = nk$ and $k > 0$ is the step size. All schemes below approximate the equation $y' = f(t, y)$.
| Name | Formula | Type |
|---|---|---|
| Forward Euler | $y_{n+1} = y_n + k\, f(t_n, y_n)$ | Explicit |
| Backward Euler | $y_{n+1} = y_n + k\, f(t_{n+1}, y_{n+1})$ | Implicit |
| Trapezoid rule | $y_{n+1} = y_n + \tfrac{k}{2}\bigl(f(t_n, y_n) + f(t_{n+1}, y_{n+1})\bigr)$ | Implicit |
| Theta method | $y_{n+1} = y_n + k\bigl(\theta\, f(t_n, y_n) + (1-\theta) f(t_{n+1}, y_{n+1})\bigr)$ | Implicit ($\theta < 1$) |
The theta method unifies the three preceding schemes: $\theta = 1$ recovers Forward Euler, $\theta = 0$ recovers Backward Euler, and $\theta = \tfrac{1}{2}$ recovers the Trapezoid rule.
[definition: Step Size]
Given a sequence of time nodes $t_n = nk$, the **step size** is the uniform spacing
\begin{align*}
k := t_{n+1} - t_n.
\end{align*}
[/definition]
For a PDE on a spatial interval $[0, 1]$ with $M$ interior grid points, the spatial step size is $h = \frac{1}{M+1}$ and grid points are $x_m = mh$. The numerical approximation is denoted $u_m^n \approx u(x_m, t_n)$.
[definition: Courant Number]
For an explicit finite difference scheme applied to the diffusion equation $u_t = u_{xx}$, the **Courant number** (or mesh ratio) is
\begin{align*}
\mu := \frac{k}{h^2} = \frac{\Delta t}{(\Delta x)^2}.
\end{align*}
[/definition]
The Forward Euler scheme for the diffusion equation reads $u_m^{n+1} = u_m^n + \mu\bigl(u_{m-1}^n - 2u_m^n + u_{m+1}^n\bigr)$. As established in Chapter 2, this scheme is stable and convergent when $\mu \leq \tfrac{1}{2}$.
## Taylor's Theorem
Taylor's theorem underlies virtually every error analysis in the course: truncation error estimates for finite difference schemes, convergence proofs for Runge–Kutta methods, and the derivation of the second central difference formula all rely on expanding smooth functions into Taylor series with explicit remainder terms.
[quotetheorem:188]
The remainder term $\frac{h^n}{n!} f^{(n)}(x + \theta h)$ is the **Lagrange form**. It is useful when one can bound $|f^{(n)}|$ uniformly, as is typical in truncation error estimates.
The following consequence is used repeatedly in finite difference derivations.
[quotetheorem:1422]
[citeproof:1422]
The same technique in two spatial dimensions gives the five-point stencil approximation to the Laplacian: for $u \in C^4$,
\begin{align*}
u(x-h,y) + u(x+h,y) + u(x,y-h) + u(x,y+h) - 4u(x,y) = h^2 \Delta u(x,y) + O(h^4),
\end{align*}
which is derived by applying the one-dimensional expansion independently in $x$ and $y$.
## QR Factorisation
Many algorithms in numerical linear algebra — including the QR algorithm for computing eigenvalues (Chapter 5) — rest on the QR factorisation of a matrix. The factorisation decomposes any matrix into an orthogonal part and an upper triangular part.
[definition: QR Factorisation]
A **QR factorisation** of a matrix $A \in \mathbb{R}^{m \times n}$ is a representation
\begin{align*}
A = QR,
\end{align*}
where $Q \in \mathbb{R}^{m \times m}$ is an orthogonal matrix (so $Q^\top Q = I_m$) and $R \in \mathbb{R}^{m \times n}$ is upper triangular.
[/definition]
[remark: Uniqueness]
The QR factorisation exists for every matrix $A$. If one additionally requires the diagonal entries of $R$ to be positive, the factorisation is unique (when $A$ has full column rank).
[/remark]
The Gram–Schmidt process provides a constructive method for computing the QR factorisation. Write $A = (a_1 \mid \cdots \mid a_n)$ and $Q = (q_1 \mid \cdots \mid q_n)$ in terms of their columns. The QR relation $A = QR$ with $R$ upper triangular means that each column $a_j$ is a linear combination of $q_1, \dots, q_j$ only:
\begin{align*}
a_j = \sum_{i=1}^{j} R_{ij}\, q_i, \quad j = 1, \dots, n.
\end{align*}
[quotetheorem:1423]
The Gram–Schmidt algorithm is straightforward to implement but can suffer from loss of orthogonality due to floating-point cancellation in the computation of $d_k$. In practice, the **modified Gram–Schmidt** variant (which subtracts each projection one at a time rather than all at once) or the **Householder** approach (see below) is preferred for numerical stability.
## Householder Reflections
A Householder reflection is an orthogonal transformation that maps any given vector to a scalar multiple of a coordinate axis. This property makes Householder matrices the standard tool for reducing a matrix to upper triangular form via QR factorisation without the numerical issues that can affect Gram–Schmidt.
[definition: Householder Reflection]
Given a nonzero vector $u \in \mathbb{R}^m$, the associated **Householder reflection matrix** is
\begin{align*}
H = I - 2\frac{u\,u^\top}{\|u\|^2}.
\end{align*}
[/definition]
The matrix $H$ is symmetric ($H^\top = H$) and orthogonal ($H^\top H = I$), so $H^2 = I$ — applying the reflection twice returns to the identity. Geometrically, $Hx$ is the reflection of $x$ through the hyperplane
\begin{align*}
\{y \in \mathbb{R}^m : u^\top y = 0\},
\end{align*}
the hyperplane orthogonal to $u$ passing through the origin.
[remark: Geometric Interpretation]
Decompose any vector $x$ into its component along $u$ and its component in the hyperplane:
\begin{align*}
x = \underbrace{\frac{u^\top x}{\|u\|^2}u}_{\text{along }u} + \underbrace{x - \frac{u^\top x}{\|u\|^2}u}_{\text{in hyperplane}}.
\end{align*}
The reflection $H$ negates the $u$-component and leaves the hyperplane component unchanged:
\begin{align*}
Hx = -\frac{u^\top x}{\|u\|^2}u + \left(x - \frac{u^\top x}{\|u\|^2}u\right) = x - 2\frac{u^\top x}{\|u\|^2}u.
\end{align*}
[/remark]
The key property used in QR algorithms is that for any two vectors $x, y \in \mathbb{R}^m$ with $\|x\| = \|y\|$, one can find a Householder vector $u$ such that $Hx = y$. In the QR algorithm, this is used to zero out the entries below the diagonal in each column of $A$: given the $j$-th column $a_j$ (restricted to entries in rows $j$ through $m$), one chooses $u$ so that $Ha_j = \|a_j\|e_1$ (a multiple of the first standard basis vector), which makes all sub-diagonal entries in that column zero.
The QR factorisation via Householder reflections applies a sequence of orthogonal transformations $H_1, H_2, \dots, H_n$ from the left, reducing $A$ to upper triangular form:
\begin{align*}
H_n \cdots H_2 H_1 A = R \implies A = (H_1 H_2 \cdots H_n) R = QR,
\end{align*}
where $Q = H_1 H_2 \cdots H_n$ is orthogonal since it is a product of orthogonal matrices.
[example: Householder Step for a Single Column]
Let $x = (3, 4)^\top \in \mathbb{R}^2$. We want $Hx = (5, 0)^\top$, i.e., $\|x\| = 5$ times the first standard basis vector.
Take $u = x - \|x\|e_1 = (3 - 5, 4)^\top = (-2, 4)^\top$. Then $\|u\|^2 = 4 + 16 = 20$, and
\begin{align*}
H = I - 2\frac{uu^\top}{20} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} - \frac{1}{10}\begin{pmatrix} 4 & -8 \\ -8 & 16 \end{pmatrix} = \begin{pmatrix} 3/5 & 4/5 \\ 4/5 & -3/5 \end{pmatrix}.
\end{align*}
One checks directly that $Hx = \bigl(\tfrac{3}{5}\cdot 3 + \tfrac{4}{5}\cdot 4,\; \tfrac{4}{5}\cdot 3 - \tfrac{3}{5}\cdot 4\bigr)^\top = (5, 0)^\top$.
[/example]
The choice $u = x \pm \|x\|e_1$ always works; in numerical practice one takes the sign that avoids cancellation: $u = x + \operatorname{sgn}(x_1)\|x\|e_1$, so that the dominant term in $u$ is the sum rather than the difference of two nearly-equal quantities.
## References
This page is based on the Part II Numerical Analysis II course at the University of Cambridge, Michaelmas 2023, lectured by Hamza Fauzi.
Contents
- 1. The Poisson Equation
- The Poisson Equation Setup
- The Second Central Difference
- The Five-Point Formula
- The Gershgorin Circle Theorem
- Stability via Gershgorin: an Application
- The Choice of Grid Ordering
- Spectral Properties of the Five-Point Matrix
- Convergence of the Five-Point Formula
- Towards Efficient Solvers: The Block Structure of $A$
- The Fast Poisson Solver via the DFT
- Periodic Sequences and the DFT
- The Fast Fourier Transform
- Completing the Fast Poisson Solver
- 2. PDEs of Evolution
- From Elliptic to Parabolic: Equations in Time
- The Heat Equation and Explicit Euler Discretisation
- Convergence and the Role of the Courant Number
- Stability, Consistency, and the Lax Equivalence Theorem
- Norms for Stability Analysis
- The Sup-Norm
- The Normalised Euclidean Norm
- Spectral Analysis of the Explicit Euler Scheme
- Implicit Euler and the Theta-Method
- Summary of Stability Conditions
- Semidiscretization and the Method of Lines
- The Heat Equation via Semidiscretization
- The Crank–Nicolson Scheme
- The Advection Equation
- Fourier Analysis of Finite Difference Schemes
- Fourier Analysis on $\ell_2(\mathbb{Z})$
- The Amplification Factor and the von Neumann Criterion
- Stability Analysis for the Diffusion Equation
- Stability of Advection Schemes
- Multi-Step Schemes and the Root Condition
- Fourier Analysis for the Wave Equation
- Fourier Analysis in Two Space Dimensions
- Operator Splitting for Multi-Dimensional Problems
- Splitting Methods for Non-Commuting Operators
- 3. Spectral Methods
- Why Spectral Methods?
- The Galerkin Framework
- Fourier Approximation of Functions
- Exponential Convergence for Analytic Functions
- The Algebra of Fourier Expansions
- Spectral Discretisation of a Boundary Value Problem
- Numerical Quadrature for Fourier Coefficients
- Applying the Spectral Method: The Poisson Equation
- Periodicity as a Limitation
- The Algebra of Chebyshev Expansions
- Multiplication of Chebyshev Series
- Differentiation in Chebyshev Coordinates
- Connection Between Fourier and Chebyshev via Cosine Substitution
- The Spectral Method for Evolutionary PDEs
- The General Framework
- Application: Diffusion Equation with Variable Coefficient
- Spectral vs Finite-Difference Accuracy
- 4. Iterative Methods for Linear Systems
- Why Iterative Methods?
- Stationary Linear Iterations
- Splitting Methods
- The Jacobi and Gauss-Seidel Methods
- Convergence Theory for Splitting Methods
- Diagonal Dominance
- Symmetric Positive Definite Matrices
- Stationary vs Non-Stationary Iterations
- Relaxation
- The Optimal Relaxation Parameter
- Multigrid Methods
- Frequency Analysis of Jacobi Iteration
- The Multigrid Observation
- Grid Transfer Operators
- Coarse-Grid Correction
- The V-Cycle Algorithm
- Steepest Descent and Conjugate Gradient Methods
- The Quadratic Minimization Formulation
- The Steepest Descent Method
- Conjugate Directions and the CG Algorithm
- Conjugate Directions
- Finite Termination
- The Conjugate Gradient Algorithm
- Krylov Subspaces and Properties of CG
- Convergence Analysis of the Conjugate Gradient Method
- Finite Termination in Terms of Eigenvalues
- Error Bounds via Chebyshev Polynomials
- Preconditioning
- Motivation
- The Preconditioned System
- Common Preconditioners
- 5. Eigenvalues and Eigenvectors
- The Eigenvalue Problem
- The Power Method
- The Rayleigh Quotient
- Inverse Iteration
- Rayleigh Quotient Iteration
- Simultaneous Iteration
- Convergence of Simultaneous Iteration
- The QR Algorithm
- Shifts in the QR Algorithm
- Deflation
- Deflation for Symmetric Matrices
- Deflation for Non-Symmetric Matrices
- Reduction to Tridiagonal Form for Symmetric Matrices
- Why Tridiagonalisation Pays Off
- Householder Reduction to Tridiagonal Form
- The Efficient QR Algorithm for Symmetric Matrices
- The Nonsymmetric Eigenvalue Problem
- Eigenvalue-Revealing Factorisations
- Hessenberg Reduction
- Appendix
- Time Complexity
- TST Matrices
- Common Finite Difference Schemes
- Taylor's Theorem
- QR Factorisation
- Householder Reflections
- References
Cambridge II Numerical Analysis II
Content
Problems
History
Created by admin on 4/24/2026 | Last updated on 4/24/2026
Prerequisites
No prerequisites required for this page.
Rate this page
★
★
★
★
★
Poor
Excellent