[proofplan]
We derive the Chebyshev convergence bound for CG by combining the polynomial minimization characterization from the [CG Minimizes Error over Polynomials](/theorems/1402) theorem with an explicit upper bound from shifted Chebyshev polynomials. First, we bound $\|P_k(A)e^{(0)}\|_A$ by $\max_{\lambda \in \sigma(A)} |P_k(\lambda)| \cdot \|e^{(0)}\|_A$ using the eigenbasis expansion. Then we construct the optimal polynomial from the Chebyshev polynomial $T_k$ shifted to $[l, L]$, and evaluate the resulting bound using the identity $T_k(t) \geq \frac{1}{2}(t + \sqrt{t^2 - 1})^k$ for $t \geq 1$.
[/proofplan]
[step:Bound $\|P_k(A)e^{(0)}\|_A$ using the eigenbasis expansion]
Since $A$ is symmetric positive definite, let $u_1, \ldots, u_n$ be an orthonormal eigenbasis with eigenvalues $\lambda_1, \ldots, \lambda_n \in [l, L]$. Expand $e^{(0)} = \sum_{i=1}^{n} \hat{e}_i u_i$ where $\hat{e}_i = \langle e^{(0)}, u_i \rangle$. Applying $P_k(A)$:
\begin{align*}
P_k(A)e^{(0)} = \sum_{i=1}^{n} P_k(\lambda_i) \hat{e}_i u_i.
\end{align*}
Computing the squared $A$-norm using orthonormality of $(u_i)$ and $Au_i = \lambda_i u_i$:
\begin{align*}
\|P_k(A)e^{(0)}\|_A^2 = \langle P_k(A)e^{(0)}, AP_k(A)e^{(0)} \rangle = \sum_{i=1}^{n} |P_k(\lambda_i)|^2 \lambda_i |\hat{e}_i|^2.
\end{align*}
Bounding $|P_k(\lambda_i)|^2 \leq \max_{\lambda \in [l,L]} |P_k(\lambda)|^2$ for each $i$:
\begin{align*}
\|P_k(A)e^{(0)}\|_A^2 \leq \left(\max_{\lambda \in [l,L]} |P_k(\lambda)|\right)^2 \sum_{i=1}^{n} \lambda_i |\hat{e}_i|^2 = \left(\max_{\lambda \in [l,L]} |P_k(\lambda)|\right)^2 \|e^{(0)}\|_A^2.
\end{align*}
By the [CG Minimizes Error over Polynomials](/theorems/1402):
\begin{align*}
\|e^{(k)}\|_A \leq \min_{\substack{P_k : \deg P_k \leq k \\ P_k(0) = 1}} \max_{\lambda \in [l,L]} |P_k(\lambda)| \cdot \|e^{(0)}\|_A.
\end{align*}
[guided]
The eigenbasis expansion converts the matrix polynomial $P_k(A)$ into scalar evaluations $P_k(\lambda_i)$. Since $Au_i = \lambda_i u_i$, any polynomial in $A$ acts on $u_i$ by evaluating that polynomial at $\lambda_i$:
\begin{align*}
P_k(A)u_i = P_k(\lambda_i)u_i.
\end{align*}
The $A$-norm computation uses $\|v\|_A^2 = \langle v, Av \rangle$. When $v = \sum_i c_i u_i$, orthonormality gives $\langle v, Av \rangle = \sum_i \lambda_i |c_i|^2$. With $c_i = P_k(\lambda_i)\hat{e}_i$:
\begin{align*}
\|P_k(A)e^{(0)}\|_A^2 = \sum_{i=1}^{n} \lambda_i |P_k(\lambda_i)|^2 |\hat{e}_i|^2 \leq \left(\max_i |P_k(\lambda_i)|^2\right) \sum_i \lambda_i |\hat{e}_i|^2.
\end{align*}
The inequality $\max_i |P_k(\lambda_i)| \leq \max_{\lambda \in [l,L]} |P_k(\lambda)|$ holds because all eigenvalues lie in $[l, L]$. The problem has now been reduced to a scalar approximation theory question: find the degree-$k$ polynomial that is smallest on $[l, L]$ subject to being $1$ at the origin.
[/guided]
[/step]
[step:Construct the optimal polynomial from the shifted Chebyshev polynomial on $[l, L]$]
The Chebyshev polynomial $T_k$ satisfies $|T_k(x)| \leq 1$ for $x \in [-1, 1]$ and is the polynomial of degree $k$ with leading coefficient $2^{k-1}$ that deviates least from zero on $[-1, 1]$. To transfer this to $[l, L]$, define the affine map $\varphi: [l, L] \to [-1, 1]$ by:
\begin{align*}
\varphi(x) = \frac{2x - (L + l)}{L - l} = \frac{2(L - x)}{L - l} - 1,
\end{align*}
so $\varphi(l) = -1$ and $\varphi(L) = 1$. Note that $0 \notin [l, L]$ since $l > 0$, and $\varphi(0) = -(L + l)/(L - l) < -1$. Define the shifted polynomial:
\begin{align*}
P_k(x) := \frac{T_k(\varphi(x))}{T_k(\varphi(0))}.
\end{align*}
This satisfies $P_k(0) = 1$ (by construction) and $\deg P_k = k$ (since $\varphi$ is affine and $\deg T_k = k$). For $x \in [l, L]$, $\varphi(x) \in [-1, 1]$, so $|T_k(\varphi(x))| \leq 1$, giving:
\begin{align*}
\max_{x \in [l, L]} |P_k(x)| = \frac{1}{|T_k(\varphi(0))|} = \frac{1}{\left|T_k\!\left(\frac{L + l}{L - l}\right)\right|}.
\end{align*}
[/step]
[step:Evaluate $T_k\!\left(\frac{L+l}{L-l}\right)$ using the Chebyshev lower bound]
Set $\sigma := \frac{L + l}{L - l}$. Since $L > l > 0$, we have $\sigma > 1$. For $t > 1$, the Chebyshev polynomial satisfies $T_k(t) = \cosh(k \operatorname{arccosh}(t))$, and the identity $\cosh(\theta) \geq \frac{1}{2}e^{\theta}$ for $\theta \geq 0$ gives:
\begin{align*}
T_k(\sigma) \geq \frac{1}{2}\left(\sigma + \sqrt{\sigma^2 - 1}\right)^k.
\end{align*}
We compute $\sigma + \sqrt{\sigma^2 - 1}$. Setting $\sigma = (L + l)/(L - l)$:
\begin{align*}
\sigma^2 - 1 = \frac{(L+l)^2 - (L-l)^2}{(L-l)^2} = \frac{4Ll}{(L-l)^2},
\end{align*}
so $\sqrt{\sigma^2 - 1} = \frac{2\sqrt{Ll}}{L - l} = \frac{2\sqrt{L}\sqrt{l}}{L - l}$. Therefore:
\begin{align*}
\sigma + \sqrt{\sigma^2 - 1} = \frac{L + l + 2\sqrt{Ll}}{L - l} = \frac{(\sqrt{L} + \sqrt{l})^2}{(\sqrt{L} - \sqrt{l})(\sqrt{L} + \sqrt{l})} = \frac{\sqrt{L} + \sqrt{l}}{\sqrt{L} - \sqrt{l}}.
\end{align*}
Hence:
\begin{align*}
T_k(\sigma) \geq \frac{1}{2}\left(\frac{\sqrt{L} + \sqrt{l}}{\sqrt{L} - \sqrt{l}}\right)^k.
\end{align*}
[guided]
The identity $\cosh(k\theta) = T_k(\cosh\theta)$ is the defining relation for Chebyshev polynomials outside $[-1, 1]$: while $T_k(\cos\theta) = \cos(k\theta)$ for the oscillatory regime, the hyperbolic version $T_k(\cosh\theta) = \cosh(k\theta)$ governs the exponential growth regime. Since $\sigma > 1$, we are in the latter regime, and $T_k(\sigma)$ grows exponentially in $k$.
The lower bound $\cosh(\theta) \geq \frac{1}{2}e^\theta$ follows from $\cosh(\theta) = \frac{e^\theta + e^{-\theta}}{2} \geq \frac{e^\theta}{2}$. This gives $T_k(\sigma) = \cosh(k\operatorname{arccosh}(\sigma)) \geq \frac{1}{2}e^{k\operatorname{arccosh}(\sigma)} = \frac{1}{2}(\sigma + \sqrt{\sigma^2 - 1})^k$, using $e^{\operatorname{arccosh}(t)} = t + \sqrt{t^2 - 1}$.
The algebraic simplification of $\sigma + \sqrt{\sigma^2 - 1}$ uses the perfect square factorisation $(L+l)^2 - (L-l)^2 = 4Ll$ (difference of squares) and $L + l + 2\sqrt{Ll} = (\sqrt{L} + \sqrt{l})^2$. The denominator factors as $L - l = (\sqrt{L} - \sqrt{l})(\sqrt{L} + \sqrt{l})$, yielding the clean ratio $(\sqrt{L} + \sqrt{l})/(\sqrt{L} - \sqrt{l})$.
[/guided]
[/step]
[step:Combine the bounds to obtain the Chebyshev convergence estimate]
From the first step:
\begin{align*}
\|e^{(k)}\|_A \leq \max_{x \in [l,L]} |P_k(x)| \cdot \|e^{(0)}\|_A = \frac{1}{T_k(\sigma)} \|e^{(0)}\|_A.
\end{align*}
Substituting the lower bound on $T_k(\sigma)$:
\begin{align*}
\frac{1}{T_k(\sigma)} \leq \frac{2}{\left(\frac{\sqrt{L} + \sqrt{l}}{\sqrt{L} - \sqrt{l}}\right)^k} = 2\left(\frac{\sqrt{L} - \sqrt{l}}{\sqrt{L} + \sqrt{l}}\right)^k.
\end{align*}
Therefore:
\begin{align*}
\|e^{(k)}\|_A \leq 2\left(\frac{\sqrt{L} - \sqrt{l}}{\sqrt{L} + \sqrt{l}}\right)^k \|e^{(0)}\|_A.
\end{align*}
The convergence rate $\rho = \frac{\sqrt{L} - \sqrt{l}}{\sqrt{L} + \sqrt{l}} = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$ where $\kappa = L/l = \kappa(A)$ is the spectral condition number. Since $\kappa > 1$, we have $0 < \rho < 1$, and the error decreases geometrically. The factor $2$ accounts for the Chebyshev polynomial not being exactly $\frac{1}{2}(\sigma + \sqrt{\sigma^2 - 1})^k$ but only bounded below by this quantity.
[guided]
The convergence rate $\rho = (\sqrt{\kappa} - 1)/(\sqrt{\kappa} + 1)$ reveals why CG is dramatically faster than steepest descent for ill-conditioned systems. Steepest descent has convergence rate $(\kappa - 1)/(\kappa + 1)$, which approaches $1$ linearly as $\kappa \to \infty$. CG has convergence rate $(\sqrt{\kappa} - 1)/(\sqrt{\kappa} + 1)$, which approaches $1$ much more slowly.
For example, if $\kappa = 100$: steepest descent has rate $99/101 \approx 0.98$ (needing about $115$ iterations to reduce the error by a factor of $10$), while CG has rate $9/11 \approx 0.82$ (needing about $13$ iterations for the same reduction). The square root in CG's rate comes from the Chebyshev polynomial's ability to exploit the gap between $0$ and $[l, L]$ — the polynomial must equal $1$ at $t = 0$ but can oscillate on $[l, L]$, and the "leverage" from the gap at the origin is what produces the $\sqrt{\kappa}$ improvement.
The factor of $2$ in the bound is an artefact of the lower bound $\cosh(\theta) \geq \frac{1}{2}e^\theta$. In practice, CG often converges faster than this bound suggests, especially when the eigenvalues are clustered (the bound is sharp only when the eigenvalues are spread uniformly across $[l, L]$).
[/guided]
[/step]