[proofplan]
We prove three properties of the conjugate gradient method by induction on $m$. Property (1): the span of residuals equals the span of search directions equals the Krylov subspace $K_{m+1}(A, r^{(0)})$. The equality of the first two spans follows from the Gram-Schmidt-like construction of $d^{(m)}$ from $r^{(m)}$; the identification with the Krylov subspace uses the residual update $r^{(m+1)} = r^{(m)} - \alpha_m Ad^{(m)}$ and a dimension argument. Property (2): mutual orthogonality of residuals follows from the [Finite Termination of Conjugate Direction Methods](/theorems/1399) and the span equality. Property (3): conjugacy of search directions holds by construction.
[/proofplan]
[step:Prove $\operatorname{span}\{d^{(i)}\}_{i=0}^m = \operatorname{span}\{r^{(i)}\}_{i=0}^m$ from the CG construction]
The conjugate gradient method constructs $d^{(m)}$ from $r^{(m)}$ by the recurrence:
\begin{align*}
d^{(0)} &= r^{(0)}, \\
d^{(m)} &= r^{(m)} + \beta_{m-1}\,d^{(m-1)}, \qquad m \geq 1,
\end{align*}
where $\beta_{m-1} = \|r^{(m)}\|_2^2 / \|r^{(m-1)}\|_2^2$. This is a Gram-Schmidt-type process that produces search directions from residuals. We prove $\operatorname{span}\{d^{(i)}\}_{i=0}^m = \operatorname{span}\{r^{(i)}\}_{i=0}^m$ by induction on $m$.
**Base case ($m = 0$).** $d^{(0)} = r^{(0)}$, so $\operatorname{span}\{d^{(0)}\} = \operatorname{span}\{r^{(0)}\}$.
**Inductive step.** Assume $\operatorname{span}\{d^{(i)}\}_{i=0}^{m-1} = \operatorname{span}\{r^{(i)}\}_{i=0}^{m-1}$. The recurrence $d^{(m)} = r^{(m)} + \beta_{m-1}d^{(m-1)}$ shows $d^{(m)} \in \operatorname{span}\{r^{(m)}\} + \operatorname{span}\{d^{(i)}\}_{i=0}^{m-1} = \operatorname{span}\{r^{(m)}\} + \operatorname{span}\{r^{(i)}\}_{i=0}^{m-1}$. Conversely, $r^{(m)} = d^{(m)} - \beta_{m-1}d^{(m-1)} \in \operatorname{span}\{d^{(i)}\}_{i=0}^m$. Therefore:
\begin{align*}
\operatorname{span}\{d^{(i)}\}_{i=0}^m = \operatorname{span}\{r^{(i)}\}_{i=0}^m.
\end{align*}
[/step]
[step:Prove $\operatorname{span}\{r^{(i)}\}_{i=0}^m = K_{m+1}(A, r^{(0)})$ by induction using the residual update]
Recall that $K_{m+1}(A, r^{(0)}) := \operatorname{span}\{r^{(0)}, Ar^{(0)}, \ldots, A^mr^{(0)}\}$. We prove $\operatorname{span}\{r^{(i)}\}_{i=0}^m = K_{m+1}(A, r^{(0)})$ by induction on $m$.
**Base case ($m = 0$).** Both sides equal $\operatorname{span}\{r^{(0)}\}$.
**Inductive step.** Assume $\operatorname{span}\{r^{(i)}\}_{i=0}^{m-1} = K_m(A, r^{(0)})$. The residual update in CG is:
\begin{align*}
r^{(m)} = r^{(m-1)} - \alpha_{m-1} Ad^{(m-1)}.
\end{align*}
By the previous step, $d^{(m-1)} \in \operatorname{span}\{r^{(i)}\}_{i=0}^{m-1} = K_m(A, r^{(0)})$. Therefore $Ad^{(m-1)} \in AK_m(A, r^{(0)}) \subset K_{m+1}(A, r^{(0)})$. Since $r^{(m-1)} \in K_m(A, r^{(0)}) \subset K_{m+1}(A, r^{(0)})$, we get $r^{(m)} \in K_{m+1}(A, r^{(0)})$, hence:
\begin{align*}
\operatorname{span}\{r^{(i)}\}_{i=0}^m \subset K_{m+1}(A, r^{(0)}).
\end{align*}
For the reverse inclusion, $K_{m+1}(A, r^{(0)}) = K_m(A, r^{(0)}) + \operatorname{span}\{A^m r^{(0)}\}$. By the inductive hypothesis, $K_m(A, r^{(0)}) = \operatorname{span}\{r^{(i)}\}_{i=0}^{m-1}$. Since $r^{(m)} \notin \operatorname{span}\{r^{(i)}\}_{i=0}^{m-1}$ (because $r^{(m)}$ is orthogonal to all $r^{(0)}, \ldots, r^{(m-1)}$ by property (2), which we establish independently below, and $r^{(m)} \neq 0$ by hypothesis), the dimension satisfies:
\begin{align*}
\dim \operatorname{span}\{r^{(i)}\}_{i=0}^m = m + 1 = \dim K_{m+1}(A, r^{(0)}),
\end{align*}
where the last equality holds because $K_{m+1}(A, r^{(0)})$ has dimension at most $m + 1$ and contains the $(m+1)$-dimensional subspace $\operatorname{span}\{r^{(i)}\}_{i=0}^m$. Since $\operatorname{span}\{r^{(i)}\}_{i=0}^m \subset K_{m+1}(A, r^{(0)})$ and both have dimension $m+1$, equality follows.
[guided]
The residual update $r^{(m)} = r^{(m-1)} - \alpha_{m-1} Ad^{(m-1)}$ is the mechanism that connects residuals to the Krylov subspace. At each step, the new residual $r^{(m)}$ is formed by taking a linear combination of $r^{(m-1)}$ (which lives in $K_m$) and $Ad^{(m-1)}$ (which lives in $AK_m \subset K_{m+1}$). This is how $A$ enters: the multiplication by $A$ in the residual update generates the next power of $A$ needed to extend the Krylov subspace.
The dimension argument is crucial: we need $\operatorname{span}\{r^{(i)}\}_{i=0}^m$ to have full dimension $m + 1$ (rather than being a proper subspace of $K_{m+1}$). This is guaranteed by the orthogonality of the residuals: since $r^{(0)}, \ldots, r^{(m)}$ are mutually orthogonal and all nonzero (by hypothesis $r^{(m)} \neq 0$), they are linearly independent. Hence $\dim \operatorname{span}\{r^{(i)}\}_{i=0}^m = m + 1$.
Why does the Krylov subspace $K_{m+1}(A, r^{(0)})$ have dimension exactly $m + 1$? The dimension is at most $m + 1$ by definition (it is spanned by $m + 1$ vectors). It is at least $m + 1$ because it contains the $(m+1)$-dimensional subspace $\operatorname{span}\{r^{(i)}\}_{i=0}^m$. So both spaces have the same dimension, and the inclusion $\operatorname{span}\{r^{(i)}\}_{i=0}^m \subset K_{m+1}$ forces equality.
[/guided]
[/step]
[step:Prove mutual orthogonality of residuals from the finite termination theorem and the span equality]
By the [Finite Termination of Conjugate Direction Methods](/theorems/1399), $r^{(k)}$ is orthogonal to $\operatorname{span}\{d^{(0)}, \ldots, d^{(k-1)}\}$ for each $k = 1, \ldots, n$. By property (1), $\operatorname{span}\{d^{(0)}, \ldots, d^{(k-1)}\} = \operatorname{span}\{r^{(0)}, \ldots, r^{(k-1)}\}$. Therefore:
\begin{align*}
\langle r^{(m)}, r^{(i)} \rangle = 0 \qquad \text{for all } i < m.
\end{align*}
Since $\operatorname{span}\{d^{(i)}\}_{i=0}^{m-1} = \operatorname{span}\{r^{(i)}\}_{i=0}^{m-1}$, orthogonality to the residuals immediately gives:
\begin{align*}
\langle r^{(m)}, d^{(i)} \rangle = 0 \qquad \text{for all } i < m.
\end{align*}
[/step]
[step:Verify pairwise conjugacy of search directions from the Gram-Schmidt construction]
The conjugacy $\langle d^{(m)}, d^{(i)} \rangle_A = 0$ for $i < m$ follows from the choice of $\beta_{m-1}$ in the CG recurrence. By the recurrence $d^{(m)} = r^{(m)} + \beta_{m-1}d^{(m-1)}$:
\begin{align*}
\langle d^{(m)}, d^{(i)} \rangle_A = \langle r^{(m)}, d^{(i)} \rangle_A + \beta_{m-1}\langle d^{(m-1)}, d^{(i)} \rangle_A.
\end{align*}
For $i < m - 1$, both terms vanish: $\langle r^{(m)}, d^{(i)} \rangle_A = \langle r^{(m)}, Ad^{(i)} \rangle$, and $Ad^{(i)} = \frac{1}{\alpha_i}(r^{(i)} - r^{(i+1)})$ (from the residual update), so $\langle r^{(m)}, Ad^{(i)} \rangle = \frac{1}{\alpha_i}(\langle r^{(m)}, r^{(i)} \rangle - \langle r^{(m)}, r^{(i+1)} \rangle) = 0$ by the mutual orthogonality of residuals (property (2), since $i < m$ and $i + 1 \leq m - 1 < m$). By induction on $m$, $\langle d^{(m-1)}, d^{(i)} \rangle_A = 0$ for $i < m - 1$.
For $i = m - 1$, the coefficient $\beta_{m-1}$ is chosen precisely so that $\langle d^{(m)}, d^{(m-1)} \rangle_A = 0$:
\begin{align*}
\langle d^{(m)}, d^{(m-1)} \rangle_A = \langle r^{(m)}, d^{(m-1)} \rangle_A + \beta_{m-1}\langle d^{(m-1)}, d^{(m-1)} \rangle_A = 0,
\end{align*}
which gives $\beta_{m-1} = -\langle r^{(m)}, d^{(m-1)} \rangle_A / \langle d^{(m-1)}, d^{(m-1)} \rangle_A$. This is a one-term Gram-Schmidt $A$-orthogonalisation of $r^{(m)}$ against $d^{(m-1)}$ (only one correction term is needed because $r^{(m)}$ is already $A$-orthogonal to $d^{(0)}, \ldots, d^{(m-2)}$).
[guided]
A natural question is: why does the CG recurrence only involve a single correction term $\beta_{m-1}d^{(m-1)}$ rather than a full Gram-Schmidt process against all previous directions? The answer is that the mutual orthogonality of the residuals (property (2)) already ensures $\langle r^{(m)}, d^{(i)} \rangle_A = 0$ for $i < m - 1$. The only direction against which $r^{(m)}$ needs correction is $d^{(m-1)}$.
To see why $\langle r^{(m)}, d^{(i)} \rangle_A = 0$ for $i < m - 1$: from the residual update $r^{(i+1)} = r^{(i)} - \alpha_i Ad^{(i)}$, we get $Ad^{(i)} = \frac{1}{\alpha_i}(r^{(i)} - r^{(i+1)})$. Therefore:
\begin{align*}
\langle r^{(m)}, d^{(i)} \rangle_A = \langle r^{(m)}, Ad^{(i)} \rangle = \frac{1}{\alpha_i}\bigl(\langle r^{(m)}, r^{(i)} \rangle - \langle r^{(m)}, r^{(i+1)} \rangle\bigr).
\end{align*}
When $i + 1 < m$ (i.e., $i \leq m - 2$), both $\langle r^{(m)}, r^{(i)} \rangle$ and $\langle r^{(m)}, r^{(i+1)} \rangle$ vanish by mutual orthogonality. This is the "short recurrence" property that makes CG computationally efficient: each new direction requires only one correction, not $m$ corrections.
This short recurrence is special to the symmetric positive definite case. For non-symmetric $A$, the tridiagonal structure breaks down, and one must use full orthogonalisation methods like GMRES, which require $O(m)$ storage for $m$ iterations.
[/guided]
[/step]