What brings you to Androma?

Approximation theory studies how complicated functions, data, and operators can be represented by simpler and more tractable objects, and how well that representation can be made. The course begins with the basic notion of best approximation in normed spaces, then develops the central problem of measuring and controlling approximation error in concrete settings. From there it moves through the classical questions of density, existence, and construction: when polynomials, splines, rational functions, or orthogonal expansions can approximate a target function, and how the smoothness or structure of the target governs the rate of approximation. The main themes are a balance between qualitative existence results and quantitative error estimates. The chapters build from foundational normed-space formulations to constructive Weierstrass theory, then to moduli of smoothness, direct and inverse approximation theorems, and Bernstein-type inequalities that connect approximation rates with regularity. Later chapters specialize to minimax and Chebyshev approximation, orthogonal polynomials and spectral methods, spline spaces and B-splines, and rational and Padé approximation. The course then broadens to approximation in `L^p` spaces and nonlinear approximation, before ending with applications in numerical analysis, where these ideas become practical tools for computation, interpolation, quadrature, and the stable solution of analytic and differential problems. # Introduction Approximation theory begins with a practical tension: many functions that arise in analysis, geometry, physics, and computation are too complicated to manipulate exactly, while finite-dimensional objects are stable enough to store, differentiate, integrate, and evaluate. The course studies how a function can be replaced by a simpler surrogate, how the error of replacement is measured, and how qualitative information such as smoothness or analyticity predicts quantitative convergence rates. Polynomial, trigonometric, spline, and rational approximants will appear as different answers to the same guiding problem: choose a structured family and understand the best error it can achieve. The point of this introductory chapter is to fix the language used throughout the course. We separate the approximation problem itself from the method used to construct an approximant, and we distinguish existence, uniqueness, stability, and convergence as different questions. This viewpoint links real analysis, functional analysis, Fourier analysis, and numerical linear algebra into a single framework. ## What Is Being Approximated? The first question is not how to approximate, but what kind of object is being approximated and in what sense two objects are considered close. A [continuous function](/page/Continuous%20Function) on a compact interval, an $L^2$ function, a periodic signal, and a function with isolated singularities demand different approximation spaces and different error measures. [definition: Approximation Problem] An approximation problem consists of a normed space $(X, \|\cdot\|_X)$, a target element $f \in X$, and a subset $A \subset X$ whose elements are called admissible approximants. [/definition] The subset $A$ is usually much simpler than the ambient space $X$: it might be a finite-dimensional subspace of polynomials, a nonlinear family of rational functions, or a space of splines with prescribed knots. This definition identifies the data of the problem, but it does not yet assign a number to the quality of the best possible replacement. To compare different approximation families, we need a single quantity that records the smallest error that the family permits. [definition: Best Approximation Error] Let $(X, \|\cdot\|_X)$ be a normed space and let $A \subset X$. The best approximation error from $A$ is the map \begin{align*} E_A(\cdot)_X &: X \to [0,\infty], & E_A(f)_X &:= \inf_{a \in A} \|f-a\|_X. \end{align*} [/definition] The infimum records the performance limit of the family $A$, regardless of whether a minimiser has been found. A substantial part of approximation theory asks whether this infimum is attained, whether the minimiser is unique, and whether a computable procedure can find it without amplifying data errors. [definition: Best Approximant] Let $(X, \|\cdot\|_X)$ be a normed space, let $f \in X$, and let $A \subset X$. An element $a^* \in A$ is a best approximant to $f$ from $A$ if \begin{align*} \|f-a^*\|_X = E_A(f)_X. \end{align*} [/definition] A best approximant is a geometric projection of $f$ onto the model class, but it need not behave like an [orthogonal projection](/theorems/437) unless the space has Hilbert structure. This distinction explains why least-squares approximation and uniform approximation have rather different theories. [example: Best Constant Approximation In Two Norms] Consider $f(x)=x$ on $[0,1]$ and approximate it by constants, where $\mathcal L^1$ denotes one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure). For a constant $c\in\mathbb R$, the squared $L^2$ error is \begin{align*} \int_0^1 (x-c)^2\,d\mathcal L^1(x)=\int_0^1 (x^2-2cx+c^2)\,d\mathcal L^1(x)=\frac13-c+c^2. \end{align*} Completing the square gives \begin{align*} \frac13-c+c^2=\left(c-\frac12\right)^2+\frac1{12}. \end{align*} The term $\left(c-\frac12\right)^2$ is minimised exactly when $c=1/2$, so the best constant in $L^2(0,1)$ is $1/2$, with squared error $1/12$. In $C[0,1]$ with the uniform norm, the error of the same constant $c$ is \begin{align*} \|x-c\|_\infty=\sup_{0\le x\le 1}|x-c|. \end{align*} Every $c$ satisfies the endpoint lower bound \begin{align*} 1=|1-0|=|(1-c)+(c-0)|\le |1-c|+|c|\le 2\sup_{0\le x\le 1}|x-c|. \end{align*} Hence $\|x-c\|_\infty\ge 1/2$ for every constant $c$. For $c=1/2$, every $x\in[0,1]$ satisfies $-1/2\le x-1/2\le 1/2$, so \begin{align*} \left\|x-\frac12\right\|_\infty=\frac12. \end{align*} Thus the best constant is again $1/2$, but the $L^2$ computation comes from minimising an integral, while the uniform-norm computation comes from balancing the endpoint deviations. [/example] ## Measuring Error The second question is how the error should be measured. The norm encodes the features of the target that the approximation is required to preserve: average accuracy, pointwise worst-case accuracy, smoothness of derivatives, or stability under an operator. [definition: Uniform Norm] Let $K$ be a compact [topological space](/page/Topological%20Space). The uniform norm is the map \begin{align*} \|\cdot\|_\infty &: C(K) \to [0,\infty), & \|f\|_\infty &:= \sup_{x \in K} |f(x)|. \end{align*} [/definition] Uniform approximation is strong because it controls the error at every point of the domain. It is the natural setting for polynomial density on compact intervals and for Chebyshev approximation, where the maximal deviation determines the quality of the approximant. Many applications instead care about accumulated or average error, so the next norm squares the pointwise discrepancy and integrates it over the domain. [definition: Least-Squares Error] Let $(E, \mathcal E, \mu)$ be a [measure space](/page/Measure%20Space). The least-squares error is the map \begin{align*} L^2(E,\mathcal E,\mu) \times L^2(E,\mathcal E,\mu) &\to [0,\infty), & (f,g) &\mapsto \|f-g\|_{L^2}^2, \end{align*} where \begin{align*} \|f-g\|_{L^2}^2 = \int_E |f-g|^2\,d\mu. \end{align*} [/definition] Least-squares approximation replaces pointwise control by averaged quadratic control. The square is not only a modelling choice: it turns $L^2$ into a Hilbert space, so the geometry of angles and orthogonality becomes available. The next theorem answers the basic computational question for finite-dimensional least-squares problems: how can we recognise the minimiser without testing every candidate in the approximation space? [quotetheorem:4858] [citeproof:4858] This theorem is the engine behind least-squares methods. The Hilbert-space hypothesis supplies an [inner product](/page/Inner%20Product), so errors can be decomposed by orthogonality; in a normed space such as $C[0,1]$ with the uniform norm there is no comparable angle condition, and best uniform approximation is governed instead by alternation. Finite-dimensionality ensures that $V$ is closed and that the orthogonality equations reduce to finitely many scalar equations after a basis is chosen. If a subspace is not closed, a vector may have distance zero from the subspace without belonging to it, so an orthogonal projection onto that subspace need not exist. The theorem also separates approximation theory from numerical conditioning. It identifies the exact minimiser, but after choosing a non-orthogonal basis the resulting Gram matrix may be ill-conditioned, so solving the normal equations can amplify rounding or sampling errors. In coordinates the theorem becomes a linear system, and numerical linear algebra studies how accurately that system can be solved. [example: Least-Squares Projection Onto Linear Polynomials] Let $H=L^2(0,1)$ and let $V=\operatorname{span}\{1,x\}$. We approximate $f(x)=x^2$ by $p(x)=a+bx$; by *Orthogonality Characterisation Of Hilbert Best Approximation*, the residual $x^2-a-bx$ must be orthogonal to the basis vectors $1$ and $x$. For the basis vector $1$, the condition is \begin{align*} \int_0^1 (x^2-a-bx)\,d\mathcal L^1(x)=0. \end{align*} Using $\int_0^1 x^2\,d\mathcal L^1(x)=1/3$, $\int_0^1 1\,d\mathcal L^1(x)=1$, and $\int_0^1 x\,d\mathcal L^1(x)=1/2$, this becomes \begin{align*} \frac13-a-\frac b2=0. \end{align*} Equivalently, \begin{align*} a+\frac b2=\frac13. \end{align*} For the basis vector $x$, the condition is \begin{align*} \int_0^1 x(x^2-a-bx)\,d\mathcal L^1(x)=0. \end{align*} Expanding the integrand gives \begin{align*} x(x^2-a-bx)=x^3-ax-bx^2. \end{align*} Using $\int_0^1 x^3\,d\mathcal L^1(x)=1/4$, $\int_0^1 x\,d\mathcal L^1(x)=1/2$, and $\int_0^1 x^2\,d\mathcal L^1(x)=1/3$, the second condition becomes \begin{align*} \frac14-\frac a2-\frac b3=0. \end{align*} Equivalently, \begin{align*} \frac a2+\frac b3=\frac14. \end{align*} Thus $a$ and $b$ satisfy \begin{align*} a+\frac b2=\frac13. \end{align*} \begin{align*} \frac a2+\frac b3=\frac14. \end{align*} Multiplying the first equation by $1/2$ gives \begin{align*} \frac a2+\frac b4=\frac16. \end{align*} Subtracting this equation from $\frac a2+\frac b3=\frac14$ gives \begin{align*} \frac b3-\frac b4=\frac14-\frac16. \end{align*} Since $\frac13-\frac14=\frac1{12}$ and $\frac14-\frac16=\frac1{12}$, we get \begin{align*} \frac b{12}=\frac1{12}. \end{align*} Hence $b=1$. Substituting $b=1$ into $a+b/2=1/3$ gives \begin{align*} a+\frac12=\frac13. \end{align*} Therefore \begin{align*} a=-\frac16. \end{align*} The least-squares linear approximant is therefore \begin{align*} p(x)=x-\frac16. \end{align*} This example shows concretely how the abstract best-approximation condition becomes a finite system of normal equations once a basis of the approximation space is chosen. [/example] ## Linear And Nonlinear Families The third question is what structure the admissible approximants should have. Linear spaces are geometrically tractable, while nonlinear families often approximate better with fewer parameters but bring harder existence and stability questions. [definition: Linear Approximation Scheme] A linear approximation scheme in a normed space $X$ is a sequence $(V_n)_{n\ge 1}$ of finite-dimensional subspaces of $X$. [/definition] Polynomial spaces, trigonometric polynomial spaces, and spline spaces with fixed knots are the main linear schemes of the course. Their errors form sequences $E_{V_n}(f)_X$, and convergence means these errors tend to zero as the dimension grows. Linear schemes allow projections and basis expansions, but they do not always match the shape of the function being approximated. To capture moving singularities, sharp layers, or adaptive local structure, we also need model classes that are not closed under addition. [definition: Nonlinear Approximation Family] A nonlinear approximation family in a normed space $X$ is a subset $A \subset X$ that is not assumed to be a linear subspace. [/definition] Rational approximation is the principal nonlinear family in the course. It can capture poles and boundary layers more efficiently than polynomials, but minimisation over rational functions can have degeneracies when numerator and denominator share factors or when poles approach the domain. [example: Polynomial Versus Rational Approximation Near A Singularity] On $[-1,1]$, the function $f(x)=1/(x+2)$ has its only singularity at $x=-2$. Its distance from the interval is \begin{align*} \inf_{x\in[-1,1]}|x-(-2)|=\inf_{x\in[-1,1]}|x+2|=1, \end{align*} with equality at $x=-1$, so the pole is separated from the interval. Now move the pole to $-1-\varepsilon$ by setting \begin{align*} f_\varepsilon(x)=\frac{1}{x+1+\varepsilon}, \qquad \varepsilon>0. \end{align*} The distance from this pole to $[-1,1]$ is \begin{align*} \inf_{x\in[-1,1]}|x-(-1-\varepsilon)|=\inf_{x\in[-1,1]}|x+1+\varepsilon|=\varepsilon, \end{align*} again with equality at $x=-1$. Thus the singularity approaches the interval as $\varepsilon\downarrow 0$. For polynomial approximation, this matters because the largest Bernstein ellipse around $[-1,1]$ that avoids the pole $-a$ has parameter \begin{align*} \rho(a)=a+\sqrt{a^2-1}\qquad(a>1). \end{align*} For $f(x)=1/(x+2)$, this gives \begin{align*} \rho(2)=2+\sqrt3>1. \end{align*} For $f_\varepsilon$, where $a=1+\varepsilon$, it gives \begin{align*} \rho(1+\varepsilon)=1+\varepsilon+\sqrt{(1+\varepsilon)^2-1}. \end{align*} Expanding the square shows \begin{align*} (1+\varepsilon)^2-1=1+2\varepsilon+\varepsilon^2-1=2\varepsilon+\varepsilon^2, \end{align*} so \begin{align*} \rho(1+\varepsilon)=1+\varepsilon+\sqrt{2\varepsilon+\varepsilon^2}. \end{align*} As $\varepsilon\downarrow 0$, both $\varepsilon$ and $\sqrt{2\varepsilon+\varepsilon^2}$ tend to $0$, hence $\rho(1+\varepsilon)\downarrow 1$. Since polynomial error estimates controlled by Bernstein ellipses decay like powers of $\rho^{-1}$, the decay becomes slower when $\rho$ approaches $1$. By contrast, a rational approximant whose denominator is allowed to contain the factor $x+1+\varepsilon$ can represent $f_\varepsilon$ exactly: \begin{align*} r_\varepsilon(x)=\frac{1}{x+1+\varepsilon}. \end{align*} Then \begin{align*} f_\varepsilon(x)-r_\varepsilon(x)=\frac{1}{x+1+\varepsilon}-\frac{1}{x+1+\varepsilon}=0 \end{align*} for every $x\in[-1,1]$. This illustrates why nearby singularities slow polynomial approximation but can be captured efficiently by rational families whose denominators encode the pole. [/example] ## Existence, Uniqueness, And Stability After the model class and norm are fixed, the next questions are qualitative. Does a best approximant exist, is it unique, and does it depend continuously on the data? These issues are separate: existence can hold without uniqueness, and uniqueness can be useless computationally if the approximant is unstable. [definition: Proximinal Set] Let $(X,\|\cdot\|_X)$ be a normed space. A subset $A \subset X$ is proximinal if every $f \in X$ has at least one best approximant from $A$. [/definition] Finite-dimensional subspaces of many familiar function spaces are proximinal, but this is a theorem rather than a formal consequence of the definitions. Compactness often provides existence; convexity or strict convexity often provides uniqueness. Before studying special uniqueness principles such as alternation, we need the basic existence result that justifies talking about best approximants in finite-dimensional linear spaces. [quotetheorem:6860] [citeproof:6860] This compactness argument will recur in several disguises. Finite-dimensionality matters because closed bounded sets are compact there; in infinite-dimensional spaces this compactness fails. For example, in $X=C[0,1]$ with the uniform norm, a non-closed linear subspace can have a target $f$ with $E_V(f)_X=0$ but $f \notin V$, so no best approximant is attained even though arbitrarily accurate elements of $V$ exist. The result also proves only existence. It permits several distinct minimisers, as happens when a flat portion of a unit ball meets the nearest point set, and it gives no formula for selecting one. Even when the minimiser is unique, small perturbations of sampled data can produce large changes in a computed approximant if the chosen basis or parametrisation is unstable. Later chapters add structure such as orthogonality, alternation, or interpolation conditions to turn existence into identification. [remark: Computation Is A Separate Question] An existence theorem says that an optimal approximant is present in the model class. It does not say that a numerical algorithm can find it accurately from sampled data. Approximation theory therefore sits between analysis, which describes the limiting object, and numerical analysis, which studies stable computation. [/remark] ## Smoothness And Rates The final organising question is how much approximation error remains when the dimension of the model class grows. Density theorems say that the error tends to zero; constructive theorems estimate the rate and relate it to smoothness. [definition: Approximation Rate] Let $(V_n)_{n\ge 1}$ be a linear approximation scheme in a normed space $X$. A function $f \in X$ has approximation rate $(r_n)$ with respect to $(V_n)$ if \begin{align*} E_{V_n}(f)_X \le C r_n \qquad \text{for all } n\ge 1, \end{align*} for some constant $C>0$ independent of $n$. [/definition] Rates are the bridge from qualitative approximation to quantitative analysis. In polynomial approximation, smoothness on the interval gives algebraic decay in $n$; analyticity in a complex neighbourhood gives faster decay governed by the nearest singularity. The starting point, however, is the density question: do polynomials at least approximate every continuous function on a compact interval to arbitrary uniform accuracy? [quotetheorem:480] [citeproof:480] Weierstrass answers the question of density for continuous functions on compact intervals, but each hypothesis is doing work. Continuity is necessary for uniform approximation by polynomials because a uniform limit of continuous functions is continuous; hence a jump function such as $\mathbf{1}_{[1/2,1]}$ cannot be uniformly approximated by polynomials on $[0,1]$. Compactness is also essential for this formulation: on the noncompact domain $\mathbb R$, the function $e^{-x^2}$ cannot be uniformly approximated by polynomials, since any polynomial uniformly close to it would have to remain bounded at infinity and therefore be constant. The interval hypothesis is what makes the Bernstein construction available in this elementary form, although more general compact-space density theorems appear later under different hypotheses. The theorem also gives less information than many applications require. It guarantees some polynomial with error below a prescribed tolerance, but it does not identify the best polynomial of a fixed degree, give a sharp convergence rate, control numerical stability, or explain which basis should be used in computation. The rest of the course refines this theorem in several directions: Bernstein and Stone-Weierstrass density in Chapter 2, Jackson-type rate estimates in Chapter 3, trigonometric analogues, spline approximation for local control, and rational approximation for singular behaviour. [example: Smoothness Predicts Error Decay] For $f(x)=|x|$ on $[-1,1]$, write $x=\cos\theta$ with $0\le \theta\le \pi$. The Chebyshev coefficient of $T_n$ is a cosine coefficient of $|\cos\theta|$. For $n=2m$ with $m\ge 1$, \begin{align*} a_{2m}=\frac{2}{\pi}\int_0^\pi |\cos\theta|\cos(2m\theta)\,d\theta. \end{align*} Using the symmetry of $|\cos\theta|$ about $\pi/2$ and the evenness of $\cos(2m\theta)$ about $\pi/2$, \begin{align*} a_{2m}=\frac{4}{\pi}\int_0^{\pi/2}\cos\theta\cos(2m\theta)\,d\theta. \end{align*} The product-to-sum identity gives \begin{align*} \cos\theta\cos(2m\theta)=\frac12\cos((2m-1)\theta)+\frac12\cos((2m+1)\theta). \end{align*} Therefore \begin{align*} a_{2m}=\frac{2}{\pi}\left(\frac{\sin((2m-1)\pi/2)}{2m-1}+\frac{\sin((2m+1)\pi/2)}{2m+1}\right). \end{align*} Since $\sin((2m-1)\pi/2)=(-1)^{m-1}$ and $\sin((2m+1)\pi/2)=(-1)^m$, \begin{align*} a_{2m}=\frac{2}{\pi}(-1)^{m-1}\left(\frac{1}{2m-1}-\frac{1}{2m+1}\right). \end{align*} The difference of fractions is \begin{align*} \frac{1}{2m-1}-\frac{1}{2m+1}=\frac{(2m+1)-(2m-1)}{(2m-1)(2m+1)}=\frac{2}{4m^2-1}, \end{align*} so \begin{align*} a_{2m}=\frac{4(-1)^{m-1}}{\pi(4m^2-1)}. \end{align*} Thus the nonzero Chebyshev coefficients of $|x|$ decay like $m^{-2}$, reflecting the loss of smoothness at the corner $x=0$. By contrast, for a function that extends holomorphically to a complex neighbourhood of $[-1,1]$, the corresponding Chebyshev coefficients have geometric decay: for some $\rho>1$ and constant $C>0$, \begin{align*} |a_n|\le C\rho^{-n}. \end{align*} Since $\rho^{-n}$ decreases exponentially while $m^{-2}$ decreases only algebraically, the analytic function admits much faster polynomial approximation. The comparison shows that the regularity of the target function controls the rate, not merely the dimension of the polynomial space. [/example] ## Themes Of The Course The course now proceeds from abstract approximation problems to concrete approximation systems. Chapter 1 develops Hilbert-space projection for least-squares approximation and Haar theory for uniqueness in uniform approximation; Chapter 2 uses Bernstein polynomials for constructive density; Chapter 3 introduces moduli of continuity and smoothness for rates; Chapter 6 returns to orthogonal and Chebyshev polynomials for stable high-degree approximation. Spline theory then introduces local basis functions and piecewise polynomial spaces, which are central in numerical methods. Trigonometric approximation connects the subject with Fourier analysis and periodic phenomena. Rational approximation completes the main arc by showing how nonlinear approximants respond to singularities in ways that polynomial spaces cannot match. A useful habit throughout the course is to ask four questions for every approximation method. What is the admissible family? Which norm measures the error? Does a best approximant exist and is it unique? What regularity of the target controls the rate of convergence? Keeping these questions separate prevents many common confusions and makes the later theorems fit into a coherent picture. The opening chapter separated the notion of an approximation problem from the existence and uniqueness of a best approximant, and that distinction now becomes the formal starting point. In the normed-space setting, we can ask precise questions about the admissible set, the error norm, and the geometry of best approximation. # 1. Normed Approximation Problems and Best Approximation Building on the introductory distinction between an approximation problem, a best error, and a best approximant, this chapter fixes the basic language in a more systematic normed-space form. The guiding question is: given a function or vector $x$ in a normed space, and a simpler class $M$ of admissible approximants, when does there exist a best choice $m \in M$, is it unique, and how can it be found? The chapter assumes the standard first course background on normed spaces, compactness in finite-dimensional normed spaces, continuous functions on compact metric spaces, and Hilbert space inner products. It begins with normed best approximation, then specialises to Hilbert spaces where geometry turns minimisation into orthogonality, and ends with Haar spaces, which explain why uniform approximation on an interval often has uniqueness despite the non-Hilbert nature of the sup norm. ## Normed Approximation Problems The first problem is to turn the informal phrase "best approximation" into a minimisation problem. The data are a normed space, an element to be approximated, and a subset representing the permitted approximants; the norm decides what kind of error the problem measures. [definition: Best Approximation Problem] Let $(X, \|\cdot\|_X)$ be a normed space, let $M \subset X$, and let $x \in X$. The distance from $x$ to $M$ is \begin{align*} \operatorname{dist}(x,M) := \inf_{m \in M} \|x-m\|_X. \end{align*} An element $m_0 \in M$ is a best approximation to $x$ from $M$ if \begin{align*} \|x-m_0\|_X = \operatorname{dist}(x,M). \end{align*} [/definition] This definition separates two issues that are often conflated. The number $\operatorname{dist}(x,M)$ always exists as an infimum in $[0,\infty)$, but there need not be an element of $M$ attaining it. This gap is the first obstruction in best approximation: before asking whether a best approximation is unique or computable, one must know whether the infimum is actually realised in the approximation set. The following terminology names the subsets for which this existence problem never occurs. [definition: Proximinal Set] Let $(X,\|\cdot\|_X)$ be a normed space. A subset $M \subset X$ is proximinal in $X$ if every $x \in X$ has at least one best approximation from $M$. [/definition] Proximinality is an existence property, not a uniqueness property. Approximation theory often asks for both, because an algorithm can be stable only after the target object it is trying to compute has been identified unambiguously. [definition: Chebyshev Set] Let $(X,\|\cdot\|_X)$ be a normed space. A subset $M \subset X$ is a Chebyshev set in $X$ if every $x \in X$ has a unique best approximation from $M$. [/definition] The terminology is tied to uniform approximation, where the norm records the largest pointwise error. The comparison case is least-squares approximation, where errors are averaged quadratically and Hilbert space geometry becomes available. [example: Best Constant Approximation In Two Norms] Let $[a,b]$ be a nondegenerate compact interval and let $f\in C[a,b]$. For least-squares approximation by constants, set \begin{align*} A=\int_a^b f(t)\,dt \end{align*} and define \begin{align*} c_0=\frac{A}{b-a}. \end{align*} For any constant $c\in\mathbb R$, the squared $L^2$ error is \begin{align*} Q(c)=\int_a^b (f(t)-c)^2\,dt=\int_a^b f(t)^2\,dt-2cA+c^2(b-a). \end{align*} Since $A=c_0(b-a)$, subtracting the value at $c_0$ gives \begin{align*} Q(c)-Q(c_0)=-2(c-c_0)c_0(b-a)+(c^2-c_0^2)(b-a). \end{align*} Factoring the right-hand side, \begin{align*} Q(c)-Q(c_0)=(b-a)(c-c_0)(c+c_0-2c_0)=(b-a)(c-c_0)^2. \end{align*} Because $b-a>0$, this is nonnegative for every $c$, and it is zero only when $c=c_0$. Thus the least-squares best constant is \begin{align*} c_0=\frac{1}{b-a}\int_a^b f(t)\,dt. \end{align*} For the uniform norm, let \begin{align*} \alpha=\min_{t\in[a,b]}f(t) \end{align*} and \begin{align*} \beta=\max_{t\in[a,b]}f(t), \end{align*} which exist because $f$ is continuous on a compact interval. Put \begin{align*} d_0=\frac{\alpha+\beta}{2}. \end{align*} For any constant $c$, the uniform error is at least the larger of the errors at a point where $f=\alpha$ and a point where $f=\beta$, so \begin{align*} \|f-c\|_\infty\ge \max\{|\beta-c|,|\alpha-c|\}. \end{align*} Also, \begin{align*} \max\{|\beta-c|,|\alpha-c|\}\ge \frac{|\beta-c|+|\alpha-c|}{2}\ge \frac{|\beta-\alpha|}{2}=\frac{\beta-\alpha}{2}. \end{align*} For $d_0$, every value $f(t)$ lies between $\alpha$ and $\beta$, hence \begin{align*} |f(t)-d_0|\le \frac{\beta-\alpha}{2}\quad\text{for all }t\in[a,b]. \end{align*} At points where $f=\alpha$ or $f=\beta$, equality holds. Therefore \begin{align*} \|f-d_0\|_\infty=\frac{\beta-\alpha}{2}. \end{align*} So the best uniform constant is \begin{align*} d_0=\frac{\max_{t\in[a,b]}f(t)+\min_{t\in[a,b]}f(t)}{2}, \end{align*} and the minimum uniform error is half the oscillation of $f$. The two norms therefore choose different centers: the least-squares norm centers the average value, while the uniform norm centers the extreme range. [/example] The example shows that the same approximation class can give different answers under different norms, so later results must track the optimal error as the target varies. This leads to a function on the ambient space whose continuity will measure stability of the approximation problem. [definition: Error Functional] Let $(X,\|\cdot\|_X)$ be a normed space and let $M \subset X$. The error functional associated to $M$ is the map \begin{align*} E_M : X &\to [0,\infty), & E_M(x) &:= \operatorname{dist}(x,M). \end{align*} [/definition] For a linear subspace $M$, the error functional measures how far $x$ is from satisfying the structural restriction imposed by $M$. The next elementary estimate is useful because it says that small perturbations of the target cannot change the optimal error by more than the perturbation size. [quotetheorem:6861] [citeproof:6861] The Lipschitz estimate gives stability of the optimal error, but it is deliberately only a statement about the value of the optimisation problem. The hypothesis that $M$ is nonempty is needed because the comparison with a point $m \in M$ starts the triangle-inequality argument; if $M=\varnothing$, the distance is not a finite real-valued error functional in the sense used here. The theorem also does not assert that a best approximant exists, nor that best approximants vary continuously when they do exist. For example, in $X=\mathbb R$ with $M=\{-1,1\}$, the error value is Lipschitz, but near $x=0$ the preferred nearest point jumps from $-1$ to $1$, and at $x=0$ there are two nearest points. Thus the value of the error can be stable while the selector of minimisers is nonunique or discontinuous. This distinction motivates the next existence theorem. In polynomial, spline, and numerical trial-space problems, the approximating class is usually finite-dimensional, so compactness can often be used to show that the infimum defining the distance is actually attained. [quotetheorem:6862] [citeproof:6862] This theorem proves existence but says nothing about uniqueness. Each hypothesis marks a real boundary of the argument. Compactness of $K$ ensures that continuous functions have finite uniform norm and that the usual space $C(K)$ is the right ambient object; on a noncompact domain such as $\mathbb R$, the continuous function $f(t)=t$ cannot even be approximated by constants with finite uniform error. Finite-dimensionality is also essential in this compactness proof: the polynomial subspace inside $C[0,1]$ is infinite-dimensional and not closed, and for a continuous non-polynomial function $f$ the Weierstrass theorem gives distance $0$ from $f$ to the polynomials without giving a polynomial best approximant. This is the same [closedness obstruction](/theorems/3673) in another form: a nonclosed trial space may have an infimum that is approached by admissible approximants but not attained inside the trial space. The uniform norm is not strictly convex, so even after existence is secured, uniqueness needs extra structure on the approximating subspace. Haar spaces provide that structure later in the chapter by imposing a zero-counting condition that rules out the flat faces responsible for multiple best approximants. ## Hilbert Space Projection And Normal Equations The next question is what additional information an inner product gives. In Hilbert spaces, minimising distance to a closed convex set is governed by orthogonality, and minimising distance to a finite-dimensional subspace becomes a linear system. [definition: Orthogonal Projection Onto A Subspace] Let $H$ be a Hilbert space and let $M \subset H$ be a linear subspace. An element $m_0 \in M$ is the orthogonal projection of $x \in H$ onto $M$ if \begin{align*} x-m_0 \in M^\perp. \end{align*} When such an element exists uniquely for each $x \in H$, the associated projection map is \begin{align*} P_M : H &\to M, & P_Mx&=m_0. \end{align*} [/definition] The definition is phrased as an orthogonality condition rather than as a minimisation condition. The [projection theorem](/theorems/1985) is the result that connects these two viewpoints when the subspace is closed. [quotetheorem:240] [citeproof:240] The theorem turns a nonlinear minimisation problem into a linear orthogonality condition for closed subspaces, but its hypotheses cannot be treated as decoration. Closedness is needed for existence: in $\ell^2$, the subspace $c_{00}$ of finitely supported sequences is dense but not closed, so a vector such as $(1/n)_{n\ge 1}\in \ell^2$ has distance $0$ from $c_{00}$ without having a nearest finitely supported point. Convexity is needed for uniqueness: the [closed set](/page/Closed%20Set) $\{-1,1\}\subset \mathbb R$ has two nearest points to $0$. The real Hilbert-space assumption matches the scalar variation used in the proof; over complex Hilbert spaces the same projection result holds, but the orthogonality condition must be checked by varying in both real and imaginary directions. For a general closed convex set $M$, the nearest point is better characterised by the variational inequality $(x-m_0,m-m_0)_H\le 0$ for all $m\in M$, rather than by an orthogonal residual against every element of $M$. Thus the normal-equation method belongs to the linear-subspace case. With these restrictions in place, when a closed subspace has a chosen basis, the orthogonality condition becomes the normal equations used in least-squares computation. [quotetheorem:6863] [citeproof:6863] Normal equations are especially useful when the approximating functions are chosen to be orthogonal. Then the Gram matrix is diagonal, and the projection coefficients are obtained independently. Without orthogonality, the Gram matrix can be badly conditioned even when it is invertible; this is a numerical linear algebra issue rather than an approximation-theoretic failure, but it affects computed least-squares approximants. The hypotheses separate three different failure modes. [Linear independence](/page/Linear%20Independence) is the algebraic condition that prevents singularity of the Gram matrix. If the spanning family is dependent, the best approximating vector $p \in V$ is still unique by the projection theorem, but its coefficient vector need not be unique and the Gram matrix is singular. Closedness is the analytic condition behind the same projection picture: in $\ell^2$, the dense nonclosed subspace $c_{00}$ has distance $0$ from every vector in $\ell^2$, but a vector with infinite support has no nearest point in $c_{00}$. Thus finite-dimensionality is not just a convenient way to write matrices; it guarantees closedness and hence attainment of the projected vector. The Hilbert-space inner product is the geometric condition that turns best approximation into an orthogonality equation. Outside an inner product setting there is no residual vector orthogonal to the trial space, so best approximation in a general normed space must be characterised by other tools such as supporting functionals, convexity, or alternation in the uniform norm. This same distinction appears in optimisation: a Hilbert projection is a quadratic normal-equation problem, while a uniform approximation problem behaves more like a constrained minimax problem. The limitation is not merely technical: in non-Hilbert norms, the geometry of nearest points can have flat faces rather than a single perpendicular residual. A Euclidean normal equation may therefore select one convenient candidate without describing the whole best-approximation set for the original norm. There is no analogous normal-equation system for best uniform approximation, where the replacement principle is alternation rather than orthogonality. [example: Least-Squares Polynomial Fit] Let $H=L^2(a,b)$ with inner product $(f,g)_{L^2}=\int_a^b f(t)g(t)\,dt$, and let $V=\mathcal P_n$. Suppose $\{\varphi_0,\dots,\varphi_n\}$ is an orthogonal basis of $V$, so each $\varphi_j$ is nonzero and $\|\varphi_j\|_{L^2}^2>0$. Define \begin{align*} p_n(t)=\sum_{j=0}^n \frac{(f,\varphi_j)_{L^2}}{\|\varphi_j\|_{L^2}^2}\varphi_j(t). \end{align*} Then $p_n\in \mathcal P_n$. To verify that it is the least-squares approximant, fix $k\in\{0,\dots,n\}$. By linearity of the inner product in the first variable, \begin{align*} (p_n,\varphi_k)_{L^2}=\sum_{j=0}^n \frac{(f,\varphi_j)_{L^2}}{\|\varphi_j\|_{L^2}^2}(\varphi_j,\varphi_k)_{L^2}. \end{align*} Orthogonality gives $(\varphi_j,\varphi_k)_{L^2}=0$ for $j\ne k$ and $(\varphi_k,\varphi_k)_{L^2}=\|\varphi_k\|_{L^2}^2$, hence all terms vanish except the $j=k$ term: \begin{align*} (p_n,\varphi_k)_{L^2}=\frac{(f,\varphi_k)_{L^2}}{\|\varphi_k\|_{L^2}^2}\|\varphi_k\|_{L^2}^2=(f,\varphi_k)_{L^2}. \end{align*} Therefore \begin{align*} (f-p_n,\varphi_k)_{L^2}=(f,\varphi_k)_{L^2}-(p_n,\varphi_k)_{L^2}=0. \end{align*} Now let $q\in\mathcal P_n$. Since the $\varphi_k$ form a basis, there are scalars $b_0,\dots,b_n$ with \begin{align*} q=\sum_{k=0}^n b_k\varphi_k. \end{align*} Using linearity in the second variable, \begin{align*} (f-p_n,q)_{L^2}=\sum_{k=0}^n b_k(f-p_n,\varphi_k)_{L^2}=\sum_{k=0}^n b_k\cdot 0=0. \end{align*} Thus $f-p_n$ is orthogonal to every polynomial in $\mathcal P_n$. By *[Projection onto Closed Convex Sets](/theorems/240)*, applied to the closed linear subspace $\mathcal P_n$, this orthogonality condition characterises the best approximation from $\mathcal P_n$, so $p_n$ is the least-squares polynomial approximant to $f$. Its coefficients are independent because the orthogonal basis diagonalises the normal equations. [/example] The Hilbert space picture is powerful, but many approximation problems in numerical analysis and classical approximation theory use the uniform norm. The rest of the chapter explains the replacement for orthogonality in that setting. ## Haar Spaces And Uniform Uniqueness Uniform approximation asks for the largest pointwise error to be as small as possible. The existence theorem already applies to finite-dimensional subspaces of $C(K)$, but uniqueness can fail unless the subspace oscillates in a controlled way. [definition: Haar Space] Let $V \subset C[a,b]$ be an $n$-dimensional subspace. The space $V$ is a Haar space on $[a,b]$ if every nonzero $v \in V$ has at most $n-1$ distinct zeros in $[a,b]$. [/definition] The zero-counting condition counts distinct points, not multiplicity. It says that no nonzero element of $V$ can vanish at too many separate locations. This is the uniform approximation analogue of nonsingularity of a Gram matrix: it prevents two different best approximants from sharing the same extremal error pattern. [example: Polynomial Haar Space] Let $V=\mathcal P_{n-1}$ be the space of real polynomials of degree at most $n-1$ on $[a,b]$. We show that $V$ is a Haar space. If $p\in V$ is nonzero and has distinct zeros $x_1,\dots,x_k$ in $[a,b]$, then polynomial division by $t-x_1$ gives \begin{align*} p(t)=(t-x_1)q_1(t) \end{align*} because the remainder is the constant $p(x_1)=0$. Since $x_2,\dots,x_k$ are also zeros of $p$ and none of them equals $x_1$, each satisfies $q_1(x_i)=0$. Repeating the same division step gives \begin{align*} p(t)=(t-x_1)(t-x_2)\cdots(t-x_k)q_k(t) \end{align*} for some real polynomial $q_k$. Therefore $k\le \deg p$. But $p\in\mathcal P_{n-1}$, so $\deg p\le n-1$, and hence $k\le n-1$. Thus every nonzero element of $\mathcal P_{n-1}$ has at most $n-1$ distinct zeros on $[a,b]$, which is exactly the Haar condition. Classical polynomial uniform approximation therefore fits the Haar framework. [/example] The example identifies the basic model Haar space, so the next question is what the Haar hypothesis buys for an arbitrary finite-dimensional subspace. Its main role is to upgrade the compactness-based existence result into a uniqueness theorem for every continuous target. [quotetheorem:6864] Statement-level reference. This lemma is the uniform-norm replacement for an orthogonality condition. In this chapter it identifies the extremal sign pattern that separates a merely existing best approximant from one that is forced by the geometry of a Haar space. The following uniqueness theorem is the main consequence: the zero-counting condition prevents the flat directions that would otherwise allow multiple best approximants. [quotetheorem:6865] [citeproof:6865] This result points toward the equioscillation theorem of Chapter 5, where alternation becomes a central computational criterion for best uniform approximation. The Haar hypothesis is not merely technical: if $V=\operatorname{span}\{t^2\}\subset C[-1,1]$, then $V$ is not Haar for $n=1$ because $t^2$ has a zero, and the constant function $f(t)=1$ has many best approximants from $V$. Indeed, every $a t^2$ with $0\le a\le 2$ has uniform error $1$, since the error at $t=0$ is always $1$ and the endpoint errors remain bounded by $1$. The theorem also gives uniqueness rather than a practical construction; to find the approximant, one needs the alternation theorem and algorithms such as the Remez exchange method. For now, the important message is that uniqueness in the sup norm comes from oscillation control rather than from an inner product. [example: Best Uniform Constant On An Interval] Let $V$ be the constant functions on $[a,b]$, so $\dim V=1$. A nonzero constant function has no zeros on $[a,b]$, hence every nonzero element of $V$ has at most $0=1-1$ zeros, so $V$ is a Haar space. Let \begin{align*} \alpha=\min_{t\in[a,b]}f(t) \end{align*} and \begin{align*} \beta=\max_{t\in[a,b]}f(t), \end{align*} which exist because $f$ is continuous on the compact interval $[a,b]$. Define \begin{align*} c_0=\frac{\alpha+\beta}{2}. \end{align*} Choose points $s,u\in[a,b]$ with $f(s)=\alpha$ and $f(u)=\beta$. For any constant $c\in\mathbb R$, the uniform error satisfies \begin{align*} \|f-c\|_\infty\ge \max\{|\beta-c|,|\alpha-c|\}. \end{align*} If $M=\max\{|\beta-c|,|\alpha-c|\}$, then \begin{align*} \beta-\alpha=|(\beta-c)+(c-\alpha)|\le |\beta-c|+|c-\alpha|\le 2M. \end{align*} Therefore \begin{align*} \|f-c\|_\infty\ge M\ge \frac{\beta-\alpha}{2}. \end{align*} Now fix $t\in[a,b]$. Since $\alpha\le f(t)\le \beta$, \begin{align*} f(t)-c_0\le \beta-\frac{\alpha+\beta}{2}=\frac{\beta-\alpha}{2}. \end{align*} Also, \begin{align*} c_0-f(t)\le \frac{\alpha+\beta}{2}-\alpha=\frac{\beta-\alpha}{2}. \end{align*} Thus \begin{align*} |f(t)-c_0|\le \frac{\beta-\alpha}{2} \end{align*} for every $t\in[a,b]$. At the maximum point $u$, \begin{align*} f(u)-c_0=\beta-\frac{\alpha+\beta}{2}=\frac{\beta-\alpha}{2}, \end{align*} and at the minimum point $s$, \begin{align*} f(s)-c_0=\alpha-\frac{\alpha+\beta}{2}=-\frac{\beta-\alpha}{2}. \end{align*} Hence \begin{align*} \|f-c_0\|_\infty=\frac{\beta-\alpha}{2}. \end{align*} It remains to see uniqueness. If another constant $c$ has the same minimal error, then $|\beta-c|\le(\beta-\alpha)/2$ and $|\alpha-c|\le(\beta-\alpha)/2$. The [first inequality](/theorems/2897) gives \begin{align*} \frac{\alpha+\beta}{2}\le c\le \frac{3\beta-\alpha}{2}, \end{align*} while the second gives \begin{align*} \frac{3\alpha-\beta}{2}\le c\le \frac{\alpha+\beta}{2}. \end{align*} Both inequalities can hold only when \begin{align*} c=\frac{\alpha+\beta}{2}=c_0. \end{align*} Therefore the unique best uniform constant is \begin{align*} c_0=\frac{\max_{t\in[a,b]}f(t)+\min_{t\in[a,b]}f(t)}{2}, \end{align*} and the minimum uniform error is \begin{align*} \frac{\max_{t\in[a,b]}f(t)-\min_{t\in[a,b]}f(t)}{2}. \end{align*} When $\alpha<\beta$, the residual $f-c_0$ reaches this error with opposite signs at points where $f$ attains its maximum and minimum; when $\alpha=\beta$, the error is $0$ and $f$ is already constant. [/example] This final example ties the three viewpoints together. The abstract normed formulation identifies the minimisation problem, Hilbert space geometry solves the least-squares version by projection, and Haar theory gives uniqueness for the uniform version on compact intervals. Chapter 2 now turns from existence and uniqueness in fixed finite-dimensional spaces to density as the polynomial degree grows. Chapter 1 established the language of best error and best approximants in fixed spaces such as Haar and polynomial subspaces. Chapter 2 now asks a different question: as the polynomial degree increases, can these spaces become dense enough to approximate every continuous function on a compact interval? # 2. Polynomial Density and Constructive Weierstrass Theory Polynomial approximation is the first major density result in the course: after Chapter 1 asked whether best approximants exist inside fixed finite-dimensional spaces such as Haar and polynomial spaces, this chapter asks whether the increasing polynomial spaces $\mathcal P_n$ can approximate every continuous function. The central norm is the uniform norm on a compact set, so the main analytic input is [uniform continuity](/page/Uniform%20Continuity). Bernstein polynomials give a constructive proof on an interval, while Stone-Weierstrass explains why the same phenomenon is really about algebras of continuous functions rather than about monomials alone. ## Uniform Approximation on Compact Intervals The guiding question is: given $f \in C([a,b])$ and an error tolerance $\varepsilon > 0$, can a polynomial $p$ be chosen so that $|f(x)-p(x)|<\varepsilon$ for every $x \in [a,b]$? Chapter 1 gives existence of best approximants in each fixed polynomial space $\mathcal P_n$, but it does not say whether the best error tends to $0$ as $n \to \infty$. [definition: Uniform Polynomial Density] Let $K \subset \mathbb R$ be compact. The polynomials are uniformly dense in $C(K)$ if for every $f \in C(K)$ and every $\varepsilon > 0$, there exists a polynomial $p$ such that \begin{align*} \|f-p\|_{C(K)} := \sup_{x \in K}|f(x)-p(x)| < \varepsilon. \end{align*} [/definition] This definition turns approximation into a closure statement: the closure of the polynomial subspace in the Banach space $C(K)$ is all of $C(K)$. The immediate problem is to decide whether this closure statement is actually true for the compact intervals that support most numerical approximation schemes. Weierstrass' theorem is the foundational affirmative answer, and the rest of the chapter is organised around proving and generalising it. [quotetheorem:480] [citeproof:480] The theorem says that continuous functions on compact intervals have no uniform obstruction to polynomial approximation. Both hypotheses are doing work. Compactness cannot simply be removed: on the unbounded set $\mathbb R$, the continuous function $f(x)=e^x$ cannot be uniformly approximated by a polynomial, since $e^x-p(x)$ is unbounded as $x\to \infty$ for every polynomial $p$. Continuity is also essential: on $[0,1]$, the step function $\mathbf{1}_{[1/2,1]}$ is not a uniform limit of polynomials, because every uniform limit of continuous functions is continuous. The theorem does not claim that the polynomial is unique, nor does it give an efficient degree for a prescribed tolerance; those quantitative questions reappear in Chapter 3 through moduli of continuity and in Chapter 6 through orthogonal polynomial methods. [example: Approximating a Corner] Let $f(x)=|x|$ on $[-1,1]$. The function is continuous, since for all $x,y\in[-1,1]$, \begin{align*} \bigl||x|-|y|\bigr|\le |x-y| \end{align*} by the [reverse triangle inequality](/theorems/2300). Hence the *[Weierstrass Approximation Theorem](/theorems/480)* applies and gives, for each $n$, a polynomial $p_n$ such that \begin{align*} \|f-p_n\|_{C([-1,1])}<\frac{1}{n}. \end{align*} Therefore \begin{align*} 0\le \|f-p_n\|_{C([-1,1])}<\frac{1}{n}, \end{align*} so $\|f-p_n\|_{C([-1,1])}\to 0$. This happens even though $f$ is not differentiable at $0$. Indeed, for $h>0$, \begin{align*} \frac{f(h)-f(0)}{h}=\frac{|h|-0}{h}=1, \end{align*} while for $h<0$, \begin{align*} \frac{f(h)-f(0)}{h}=\frac{|h|-0}{h}=\frac{-h}{h}=-1. \end{align*} The one-sided difference quotients are different, so the derivative at $0$ does not exist. Thus uniform polynomial density is controlled by continuity of the target function, not by differentiability; differentiability affects quantitative approximation behavior rather than the existence of uniformly approximating polynomials. [/example] The compactness of the interval is used through uniform continuity. The corner example shows that differentiability is not required, but the Bernstein proof still needs a uniform way to compare $f(k/n)$ with $f(x)$ simultaneously for all $x$. The next theorem supplies precisely that global comparison scale. [quotetheorem:280] [citeproof:280] This compactness lemma is the reason Bernstein's construction works uniformly rather than only pointwise. Compactness is not a cosmetic assumption: $f(x)=x^2$ is continuous on $\mathbb R$ but is not uniformly continuous, since increments of a fixed small size produce arbitrarily large changes when $|x|$ is large. On a compact interval, by contrast, every pointwise continuity scale can be replaced by one global scale. That global scale is exactly what the Bernstein proof needs when it controls all $x\in[0,1]$ at once, and it is also the reason compact domains are the natural setting for uniform approximation in this chapter. ## Bernstein Polynomials and Constructive Approximation How can Weierstrass' theorem be proved by an explicit formula rather than an abstract existence argument? Bernstein's idea is to sample the function on the grid $0,1/n,\dots,1$ and blend the samples using binomial probabilities. [definition: Bernstein Polynomial] Let $f \in C([0,1])$ and $n \in \mathbb N$. The $n$-th Bernstein polynomial of $f$ is \begin{align*} B_n f(x) := \sum_{k=0}^{n} f\left(\frac{k}{n}\right) \binom{n}{k} x^k(1-x)^{n-k}, \qquad x \in [0,1]. \end{align*} [/definition] The formula is useful only because its coefficients behave like averaging weights rather than arbitrary polynomial coefficients. The next result verifies the two algebraic facts that make this possible: the weights have total mass $1$, and no cancellation occurs on the interval. These facts turn $B_n f(x)$ into a weighted average of the sampled values of $f$ near $x$. [quotetheorem:6866] [citeproof:6866] The probabilistic interpretation packages the same algebra into a useful estimate, but it depends crucially on staying inside $[0,1]$. For $x\in[0,1]$, the Bernstein basis functions are non-negative and sum to $1$, so they can be read as probabilities. If $x$ lies outside this interval, the factors $x^k(1-x)^{n-k}$ can change sign; the partition-of-unity identity still holds algebraically, but the averaging interpretation and the positivity estimates fail. This is why the constructive proof is first stated on $[0,1]$ and only then transferred to $[a,b]$ by an affine change of variables. [remark: Probabilistic Interpretation] If $X_n \sim \operatorname{Bin}(n,x)$, then \begin{align*} B_n f(x)=\mathbb E\left[f\left(\frac{X_n}{n}\right)\right]. \end{align*} Thus $B_n f(x)$ averages $f$ near $x$, because \begin{align*} \mathbb E\left[\frac{X_n}{n}\right]=x, \qquad \operatorname{Var}\left(\frac{X_n}{n}\right)=\frac{x(1-x)}{n}. \end{align*} [/remark] This interpretation leads to the constructive convergence theorem. The point is that $X_n/n$ concentrates near $x$, and uniform continuity controls $f(X_n/n)-f(x)$ on the high-probability part of the average. [quotetheorem:1215] [citeproof:1215] Because this proof is quantitative in the variance, it also explains the slow convergence near points where the function has low regularity. The continuity and compact-domain hypotheses are essential for the stated uniform conclusion. If $f$ has a jump discontinuity on $[0,1]$, then no sequence of Bernstein polynomials can converge uniformly to $f$, since a uniform limit of continuous functions is continuous. If the domain is made non-compact, uniform approximation can fail even for continuous functions, as with $e^x$ on $\mathbb R$. Thus Bernstein polynomials are excellent for proving density and preserving shape on compact intervals, but the theorem does not by itself give optimal rates, best approximants, or efficient numerical schemes; those are separate approximation-theoretic questions. [example: Bernstein Approximation of the Square Root] Let $f(x)=\sqrt{x}$ on $[0,1]$. Since $f$ is continuous on $[0,1]$, the *Bernstein Approximation Theorem* applies to the Bernstein polynomials \begin{align*} B_n f(x)=\sum_{k=0}^{n} f\left(\frac{k}{n}\right)\binom{n}{k}x^k(1-x)^{n-k}. \end{align*} Substituting $f(k/n)=\sqrt{k/n}$ gives \begin{align*} B_n f(x)=\sum_{k=0}^{n}\sqrt{\frac{k}{n}}\binom{n}{k}x^k(1-x)^{n-k}. \end{align*} Therefore \begin{align*} \|B_n f-f\|_{C([0,1])}=\sup_{x\in[0,1]}\left|B_n f(x)-\sqrt{x}\right|\to 0. \end{align*} This convergence does not require bounded derivative. For $x>0$, the derivative is computed from the difference quotient: \begin{align*} \frac{\sqrt{x+h}-\sqrt{x}}{h} = \frac{(\sqrt{x+h}-\sqrt{x})(\sqrt{x+h}+\sqrt{x})}{h(\sqrt{x+h}+\sqrt{x})} = \frac{x+h-x}{h(\sqrt{x+h}+\sqrt{x})} = \frac{1}{\sqrt{x+h}+\sqrt{x}}, \end{align*} so letting $h\to 0$ gives \begin{align*} f'(x)=\frac{1}{2\sqrt{x}}. \end{align*} As $x\downarrow 0$, the values $1/(2\sqrt{x})$ become arbitrarily large; for example, if $0<x<1/(4M^2)$, then $1/(2\sqrt{x})>M$. Thus the Bernstein polynomials converge uniformly to $\sqrt{x}$ even though the derivative has an endpoint singularity at $0$. [/example] The square-root example highlights a special feature at the boundary: although the approximants smooth the graph in the interior, they do not distort the two endpoint values. This matters in numerical applications where boundary data must be retained exactly by the approximating polynomial. The following short computation records this endpoint interpolation property. [quotetheorem:6867] [citeproof:6867] Endpoint interpolation is one consequence of the special Bernstein weights, not a general feature of polynomial approximation. For example, if $f(0)=0$ and $f(1)=1$, the constant polynomial $p(x)=1/2$ may be a rough uniform approximant, but it preserves neither endpoint value. Even sequences of good approximating polynomials need not interpolate endpoints unless that condition is built into the construction. The more robust Bernstein property is positivity: it is the structural feature that lets an approximation operator preserve inequalities and allows convergence to be tested using comparison functions. To state this independently of Bernstein polynomials, we name the operator-theoretic property. [definition: Positive Linear Operator] Let $K$ be a compact [Hausdorff space](/page/Hausdorff%20Space). A [linear map](/page/Linear%20Map) $L:C(K)\to C(K)$ is positive if $f(x)\ge 0$ for all $x\in K$ implies $(Lf)(x)\ge 0$ for all $x\in K$. [/definition] This definition isolates an order-preserving mechanism on $C(K)$. For Bernstein operators, the mechanism comes from non-negative basis functions, and it immediately implies that inequalities between functions are inherited by their Bernstein approximants. The next theorem makes that inheritance explicit before we use it in Korovkin's theorem. [quotetheorem:6868] [citeproof:6868] Order preservation controls inequalities between two functions, and positivity is the essential ingredient. Linearity alone does not preserve order: the linear operator $L:C([0,1])\to C([0,1])$ given by $Lf=-f$ sends every non-negative non-zero function to a non-positive function, so it reverses the order rather than preserving it. Positivity rules out this cancellation and makes the Bernstein operator behave like an average. The next problem is whether the operator also preserves geometric shape of a single function, such as monotonicity or convexity, which is important when approximants represent monotone data or convex profiles. Bernstein polynomials answer this problem through finite differences of sampled values, so they preserve qualitative information that arbitrary approximating polynomials may lose. [quotetheorem:6869] [citeproof:6869] Shape preservation is a strong advantage of Bernstein approximation, but it is not a consequence of uniform approximation alone. A sequence of arbitrary polynomial approximants to an increasing or convex function may oscillate and lose those qualitative features before eventually converging in norm. Bernstein polynomials avoid this because their derivative formulas are controlled by non-negative finite differences of the sampled data. The next theorem abstracts a different part of the Bernstein proof: instead of proving convergence on every continuous function directly, it suffices for positive operators to test three functions. The shift from finite differences to test functions is the bridge from a special constructive formula to a general positive-operator principle. [quotetheorem:6870] [citeproof:6870] Korovkin's theorem explains why Bernstein approximation can be checked by moments. Positivity is indispensable here: without it, convergence on $1$, $x$, and $x^2$ does not prevent a linear operator from having large oscillatory errors on other continuous functions. The three test functions are special because they control constants, location, and quadratic spread; together with positivity and uniform continuity, this is enough to dominate every continuous function by quadratic comparison functions. For Bernstein operators, the test functions satisfy $B_n1=1$, $B_nx=x$, and \begin{align*} B_n(x^2)=x^2+\frac{x(1-x)}{n}, \end{align*} so the three Korovkin hypotheses hold. The theorem does not apply to arbitrary non-positive approximation schemes, and it is tied to the compact interval structure through the choice of these test functions. ## Stone-Weierstrass and Polynomial Density on Compact Spaces The interval theorem raises a broader question: which families of continuous functions are dense in $C(K)$ when $K$ is no longer an interval? The answer is that the approximating family must be closed under algebraic operations and must distinguish different points of the [compact space](/page/Compact%20Space). [definition: Separating Subalgebra] Let $K$ be a compact Hausdorff space. A subset $A\subset C(K;\mathbb R)$ is a separating subalgebra containing constants if $A$ is a real vector subspace of $C(K;\mathbb R)$, $fg\in A$ whenever $f,g\in A$, $1\in A$, and for every $x,y\in K$ with $x\ne y$, there exists $f\in A$ such that $f(x)\ne f(y)$. [/definition] The algebra condition lets approximants be combined by addition and multiplication, while the separation condition prevents the algebra from identifying two distinct points forever. The resulting problem is whether these necessary conditions are sufficient for uniform density. Stone-Weierstrass answers yes: for real subalgebras on compact Hausdorff spaces, [algebraic closure](/page/Algebraic%20Closure) under products plus point separation gives enough functions to approximate every continuous target. [quotetheorem:886] [citeproof:886] This theorem shifts the focus from explicit polynomial formulas to structural hypotheses, and each hypothesis prevents a concrete obstruction. Constants are needed because an algebra without them may be unable to approximate even constant functions; for instance, functions in $C([0,1])$ that vanish at $0$ form a closed algebra but cannot approximate the constant function $1$ uniformly. Separation is needed because if every function in $A$ has $f(x)=f(y)$ for two distinct points, then the same equality holds in the uniform closure, so functions separating those points are unreachable. Compactness is the setting in which the uniform norm and finite subcover argument work; on non-compact spaces the corresponding statement requires extra hypotheses or a different function space. The real hypothesis also matters: complex Stone-Weierstrass needs a conjugation condition, and without it a complex subalgebra may separate points and contain constants but still fail to be dense. The Bernstein theorem proves density on $[0,1]$ constructively; Stone-Weierstrass proves density wherever the algebra has enough functions to distinguish points and no structural obstruction remains. [example: Polynomial Approximation on Compact Subsets of Euclidean Space] Let $K\subset \mathbb R^m$ be compact, and let $A$ be the set of functions on $K$ of the form $p|_K$, where $p$ is a real polynomial in the variables $x_1,\dots,x_m$. We verify the hypotheses of the *[Stone-Weierstrass Theorem](/theorems/886)*. If $c\in\mathbb R$, then the constant polynomial $p(x_1,\dots,x_m)=c$ satisfies $p|_K=c$, so $A$ contains the constant functions. If $p|_K,q|_K\in A$ and $\alpha,\beta\in\mathbb R$, then \begin{align*} \alpha p|_K+\beta q|_K=(\alpha p+\beta q)|_K, \end{align*} and $\alpha p+\beta q$ is again a polynomial. Also, \begin{align*} (p|_K)(q|_K)=(pq)|_K, \end{align*} and $pq$ is again a polynomial. Thus $A$ is a real subalgebra of $C(K;\mathbb R)$. It remains to check point separation. For each $i=1,\dots,m$, the coordinate function $x\mapsto x_i$ belongs to $A$, because it is the restriction to $K$ of the polynomial $p_i(x_1,\dots,x_m)=x_i$. If $x,y\in K$ and $x\ne y$, then, since equality in $\mathbb R^m$ is coordinatewise equality, there is some index $i$ with $x_i\ne y_i$. For this coordinate polynomial, \begin{align*} p_i|_K(x)=x_i \end{align*} and \begin{align*} p_i|_K(y)=y_i, \end{align*} so $p_i|_K(x)\ne p_i|_K(y)$. Hence $A$ separates points of $K$. By the *Stone-Weierstrass Theorem*, for every $f\in C(K)$ and every $\varepsilon>0$, there is a polynomial $p$ in $m$ variables such that \begin{align*} \sup_{x\in K}|f(x)-p(x)|<\varepsilon. \end{align*} Thus every continuous function on a compact subset of Euclidean space can be uniformly approximated by restrictions of multivariable polynomials. [/example] Stone-Weierstrass also clarifies why some approximation spaces fail. If an algebra does not separate two points, every function in its uniform closure takes the same value at those points, so it cannot approximate a continuous function that assigns them different values. [example: A Non-Separating Algebra] Let $K=[-1,1]$ and let $A$ be the algebra of even polynomials restricted to $K$. If $p\in A$, then there is a polynomial $q$ with only even powers such that $p=q|_K$. Thus \begin{align*} q(t)=a_0+a_1t^2+a_2t^4+\cdots+a_Nt^{2N} \end{align*} for some real coefficients $a_0,\dots,a_N$. For every $x\in[-1,1]$, \begin{align*} p(-x)=q(-x)=a_0+a_1(-x)^2+a_2(-x)^4+\cdots+a_N(-x)^{2N}. \end{align*} Since $(-x)^{2j}=x^{2j}$ for each $j$, this becomes \begin{align*} p(-x)=a_0+a_1x^2+a_2x^4+\cdots+a_Nx^{2N}=q(x)=p(x). \end{align*} Hence every function in $A$ takes the same value at $x$ and $-x$, so $A$ does not separate these two points whenever $x\ne 0$. We can see the failure of density directly using $f(x)=x$. Let $p\in A$. Since $p(1)=p(-1)$, \begin{align*} 2=f(1)-f(-1). \end{align*} Also, \begin{align*} f(1)-f(-1)=(f(1)-p(1))+(p(1)-p(-1))+(p(-1)-f(-1)). \end{align*} The middle term is $p(1)-p(-1)=0$, so \begin{align*} 2=(f(1)-p(1))+(p(-1)-f(-1)). \end{align*} Taking absolute values and applying the triangle inequality gives \begin{align*} 2\le |f(1)-p(1)|+|p(-1)-f(-1)|. \end{align*} Each term is bounded by $\|f-p\|_{C([-1,1])}$, so \begin{align*} 2\le 2\|f-p\|_{C([-1,1])}. \end{align*} Therefore \begin{align*} \|f-p\|_{C([-1,1])}\ge 1 \end{align*} for every $p\in A$. Thus even polynomials cannot uniformly approximate $f(x)=x$, and the obstruction is exactly that they identify the distinct points $1$ and $-1$. [/example] The chapter's main arc is now complete: Weierstrass gives the density target, Bernstein gives an explicit constructive sequence with positivity and shape preservation, Korovkin isolates the finite test-function mechanism behind positive approximation processes, and Stone-Weierstrass generalises polynomial density to compact spaces through algebraic separation. The next chapter keeps the same uniform norm but asks for rates, using finite differences to measure how fast the best errors decay. The constructive results of Chapter 2 showed that polynomials can approximate continuous functions, but they did not yet quantify how smoothness influences the speed of convergence. This chapter keeps the uniform norm but shifts to rates, using moduli of smoothness and finite differences to measure approximation power precisely. # 3. Moduli of Smoothness and Direct Approximation Theorems Chapter 2 gave constructive polynomial approximation through Bernstein polynomials, Korovkin's theorem, and Stone-Weierstrass density. This chapter asks for more quantitative information: if a function has a specified amount of smoothness, how fast should its best approximation error decrease? The prerequisites are the uniform norm on compact spaces, elementary polynomial and trigonometric polynomial approximation, and the finite-difference notation introduced below. The answer is expressed through moduli of smoothness, which measure finite-difference stability rather than differentiability alone. These moduli lead to direct theorems of Jackson type, where smoothness bounds approximation error by algebraic or trigonometric polynomials. ## Measuring Smoothness by Finite Differences A differentiability class is often too rigid for approximation estimates: functions such as $|x|$ are not differentiable at one point but are still well approximated by polynomials. The first problem is how to quantify the size of small oscillations using only the uniform norm. [definition: Modulus Of Continuity] Let $f \in C([a,b])$. The modulus of continuity of $f$ is the function $\omega(f,\cdot):[0,\infty)\to[0,\infty)$ defined by \begin{align*} \omega(f,t) = \sup\{|f(x)-f(y)| : x,y\in [a,b],\ |x-y|\le t\}. \end{align*} [/definition] This quantity is adapted to uniform approximation because it uses the same norm and the same compact interval as the approximation problem. To see what the definition records, it is useful to compute it for the basic nonsmooth function that appears throughout direct approximation theory. [example: Modulus Of Continuity Of The Absolute Value] On $[-1,1]$, let $f(x)=|x|$. For any $x,y\in[-1,1]$, the triangle inequality gives \begin{align*} |x|=|(x-y)+y|\le |x-y|+|y|. \end{align*} Subtracting $|y|$ gives $|x|-|y|\le |x-y|$. Interchanging $x$ and $y$ gives $|y|-|x|\le |x-y|$, so \begin{align*} \bigl||x|-|y|\bigr|\le |x-y|. \end{align*} Therefore, whenever $|x-y|\le t$, we have $|f(x)-f(y)|\le t$, and hence $\omega(f,t)\le t$. For the opposite inequality, assume $0\le t\le 1$ and choose $x=t$, $y=0$. Both points lie in $[-1,1]$, and $|x-y|=t$, while \begin{align*} |f(t)-f(0)|=\bigl||t|-|0|\bigr|=t. \end{align*} Thus $\omega(f,t)\ge t$. Combining the two inequalities gives $\omega(f,t)=t$ for $0\le t\le1$, so the corner at $0$ still has first-order uniform smoothness. [/example] The example shows that first-order oscillation detects the scale of a corner, but polynomial approximation also responds to cancellation: constants, affine functions, and higher polynomials should not be treated as equally rough. This motivates a higher order difference operator that vanishes on low-degree polynomial behaviour. [definition: Forward Difference] Let $r\in\mathbb N$, let $h>0$, and let $I\subset\mathbb R$ be an interval. Define \begin{align*} I_{r,h}=\{x\in I:x,x+h,\dots,x+rh\in I\}. \end{align*} For each $f\in C(I)$, the forward difference of order $r$ and step $h$ is the map $\Delta_h^r f:I_{r,h}\to\mathbb R$ given by \begin{align*} \Delta_h^r f(x)=\sum_{j=0}^{r}(-1)^{r-j}\binom{r}{j}f(x+jh). \end{align*} [/definition] The alternating binomial coefficients make constants vanish for $r=1$, affine functions vanish for $r=2$, and polynomials of degree less than $r$ vanish for order $r$. Since approximation error should ignore the polynomial part already captured locally, we now take the largest such difference over all steps up to a scale $t$. [definition: Modulus Of Smoothness] Let $f\in C([a,b])$ and let $r\in\mathbb N$. The $r$-th modulus of smoothness of $f$ is the map $\omega_r(f,\cdot):[0,\infty)\to[0,\infty)$ defined by $\omega_r(f,0)=0$ and, for $t>0$, \begin{align*} \omega_r(f,t) = \sup\{|\Delta_h^r f(x)| : 0< h\le t,\ x,x+h,\dots,x+rh\in [a,b]\}. \end{align*} [/definition] The first modulus of smoothness agrees with the usual modulus of continuity up to the harmless choice of forward rather than symmetric intervals. Before using $\omega_r$ in approximation theorems, we need to compare it with the familiar derivative scale when derivatives exist. [quotetheorem:6871] [citeproof:6871] This theorem explains why finite differences extend derivative-based estimates: whenever the $r$-th derivative exists continuously, the finite-difference modulus has the expected order. The $C^r$ hypothesis is doing real work, because the proof bounds an $r$-fold integral by the uniform norm of $f^{(r)}$; without such a derivative, the same inequality may have no derivative term to control it. The estimate is also only one-sided: small $\|f^{(r)}\|_\infty$ forces small $r$-th differences, but small finite differences do not by themselves produce a classical $r$-th derivative. The direct theorems below only need decay of $\omega_r(f,t)$, so we name the functions whose moduli decay like a power of $t$. [definition: Lipschitz-Hölder Class] Let $0<\alpha\le r$ and let $M>0$. A function $f\in C([a,b])$ belongs to the Lipschitz-Hölder class $\operatorname{Lip}(\alpha,r;M)$ if \begin{align*} \omega_r(f,t)\le M t^\alpha \end{align*} for all $t>0$. [/definition] When $r=1$, this is the usual Hölder condition. Allowing $r>1$ separates the smoothness order from the finite-difference order used to measure it, which is useful for approximation estimates near integer smoothness thresholds. [example: Hölder Smoothness Of A Power] Let $f(x)=|x|^\alpha$ on $[-1,1]$, where $0<\alpha<1$. We first prove the estimate that controls the modulus. If $u\ge v\ge0$ and $d=u-v$, then $d=0$ gives equality, while for $d>0$ we write $u=v+d$ and set $s=v/d$. The function \begin{align*} \phi(s)=(s+1)^\alpha-s^\alpha \end{align*} satisfies \begin{align*} \phi'(s)=\alpha\bigl((s+1)^{\alpha-1}-s^{\alpha-1}\bigr)\le0 \end{align*} because $\alpha-1<0$ and $s+1>s$. Hence $\phi(s)\le \phi(0)=1$, so \begin{align*} u^\alpha-v^\alpha=d^\alpha\bigl((s+1)^\alpha-s^\alpha\bigr)\le d^\alpha=(u-v)^\alpha. \end{align*} Applying this with $u=\max\{|x|,|y|\}$ and $v=\min\{|x|,|y|\}$ gives \begin{align*} \bigl||x|^\alpha-|y|^\alpha\bigr|\le \bigl||x|-|y|\bigr|^\alpha. \end{align*} The reverse triangle inequality gives $\bigl||x|-|y|\bigr|\le |x-y|$, and therefore \begin{align*} |f(x)-f(y)|=\bigl||x|^\alpha-|y|^\alpha\bigr|\le |x-y|^\alpha. \end{align*} Thus, whenever $|x-y|\le t$, we have $|f(x)-f(y)|\le t^\alpha$, so $\omega(f,t)\le t^\alpha$ for every $t>0$. For the matching lower bound at small scales, let $0<t\le1$ and choose $x=t$, $y=0$. Then both points lie in $[-1,1]$, $|x-y|=t$, and \begin{align*} |f(t)-f(0)|=\bigl||t|^\alpha-|0|^\alpha\bigr|=t^\alpha. \end{align*} Hence $\omega(f,t)\ge t^\alpha$ for $0<t\le1$, and together with the upper bound this gives $\omega(f,t)=t^\alpha$ on $(0,1]$. Therefore $f\in \operatorname{Lip}(\alpha,1;1)$, and no larger exponent can hold with a finite constant: if $\omega(f,t)\le M t^\beta$ for some $\beta>\alpha$, then $t^\alpha\le M t^\beta$ for all $0<t\le1$, equivalently $t^{\alpha-\beta}\le M$, which fails as $t$ approaches $0$. [/example] ## Periodic Jackson Inequalities For periodic functions, trigonometric polynomials are the natural approximants, and translation is compatible with the domain. The central problem is to turn the size of $\omega_r(f,t)$ at scale $t\approx 1/n$ into a bound for approximation by frequencies at most $n$. [definition: Best Trigonometric Approximation Error] Let $n\in\mathbb N$ and let $\mathbb T=\mathbb R/(2\pi\mathbb Z)$. The best uniform trigonometric approximation error of degree at most $n$ is the functional $E_{n,\operatorname{trig}}(\cdot)_\infty:C(\mathbb T)\to[0,\infty)$ defined by \begin{align*} E_{n,\operatorname{trig}}(f)_\infty =\inf\{\|f-T\|_\infty : T(x)=\sum_{|k|\le n} c_k e^{ikx}\}. \end{align*} [/definition] The quantity $E_{n,\operatorname{trig}}(f)_\infty$ is zero exactly for trigonometric polynomials of degree at most $n$. The main direct question is whether finite-difference smoothness at frequency scale $1/n$ always produces a trigonometric approximant with comparable error. [quotetheorem:6910] Statement-level reference. The following discussion explains how this result is used in the notes and which hypotheses matter. The result is constructive in spirit: it does not merely say a good trigonometric polynomial exists, but identifies smoothing as the mechanism that produces one. Each hypothesis has a concrete role. If $f$ is the jump function on $\mathbb T$ that is $1$ on one semicircle and $0$ on the other, then no sequence of continuous trigonometric polynomials can converge to $f$ uniformly, so the continuity assumption cannot be dropped in a uniform theorem. The degree restriction is also part of the assertion: if arbitrary degrees were allowed, Weierstrass approximation would make the best error vanish in the limit without relating the error to the prescribed resolution $1/n$; for example $e^{i(n+1)x}$ has modulus of smoothness at scale $1/n$ bounded independently of $n$ but cannot be represented by degree at most $n$ trigonometric polynomials. Periodicity is needed because the finite differences use translations on the whole domain; applying the same formula to a nonperiodic function on $[0,2\pi]$, such as $f(x)=x$, creates an artificial jump when the endpoint is wrapped to the beginning. Finally, the order-$r$ modulus is matched to the cancellation demanded by the estimate: first-order control is not the same information as $r$th-order smoothness. The theorem is a direct estimate only: it says smoothness controls approximation error, but it does not say that the displayed decay rate is necessary without an inverse theorem. [example: Jackson Rate For A Periodic Holder Function] Suppose $f\in C(\mathbb T)$ satisfies $\omega_r(f,t)\le M t^\alpha$ for every $t>0$, where $0<\alpha\le r$. For each $n\ge1$, the *[Jackson Theorem For Periodic Functions](/theorems/6910)* gives \begin{align*} E_{n,\operatorname{trig}}(f)_\infty\le C_r\,\omega_r\left(f,\frac{1}{n}\right). \end{align*} Applying the assumed Hölder-type modulus bound with $t=1/n$ gives \begin{align*} \omega_r\left(f,\frac{1}{n}\right)\le M\left(\frac{1}{n}\right)^\alpha. \end{align*} Since $n>0$, \begin{align*} \left(\frac{1}{n}\right)^\alpha=n^{-\alpha}. \end{align*} Substituting this into the Jackson estimate yields \begin{align*} E_{n,\operatorname{trig}}(f)_\infty\le C_r M n^{-\alpha}. \end{align*} Thus a modulus of smoothness bounded by $t^\alpha$ at small scales gives trigonometric approximation error of order $n^{-\alpha}$ at degree $n$. [/example] The preceding example uses an abstract Hölder assumption. A second common source of modulus estimates is bounded variation of a derivative, which gives finite-difference control even when a classical second derivative need not exist. [example: Functions With One Derivative Of Bounded Variation] Let $V=\operatorname{Var}_{[0,2\pi]}(f')$, computed over one period, and interpret $f'$ periodically. For $h>0$ and admissible $x$, the second forward difference is \begin{align*} \Delta_h^2 f(x)=f(x+2h)-2f(x+h)+f(x). \end{align*} Grouping the two increments gives \begin{align*} \Delta_h^2 f(x)=\bigl(f(x+2h)-f(x+h)\bigr)-\bigl(f(x+h)-f(x)\bigr). \end{align*} By the [fundamental theorem of calculus](/theorems/632) on the two intervals, \begin{align*} f(x+2h)-f(x+h)=\int_0^h f'(x+h+s)\,ds. \end{align*} Similarly, \begin{align*} f(x+h)-f(x)=\int_0^h f'(x+s)\,ds. \end{align*} Subtracting these two integral formulas gives \begin{align*} \Delta_h^2 f(x)=\int_0^h \bigl(f'(x+h+s)-f'(x+s)\bigr)\,ds. \end{align*} For each $s\in[0,h]$, the absolute difference $|f'(x+h+s)-f'(x+s)|$ is at most $V$, because the total variation on one period dominates the variation along any arc between two points. Hence \begin{align*} |\Delta_h^2 f(x)|\le \int_0^h V\,ds. \end{align*} Since $V$ is constant in $s$, \begin{align*} \int_0^h V\,ds=hV. \end{align*} Therefore, for every $0<h\le t$, \begin{align*} |\Delta_h^2 f(x)|\le t\,\operatorname{Var}_{[0,2\pi]}(f'). \end{align*} Taking the supremum over $x$ and $0<h\le t$ gives \begin{align*} \omega_2(f,t)\le t\,\operatorname{Var}_{[0,2\pi]}(f'). \end{align*} Applying *Jackson Theorem For Periodic Functions* with $r=2$ gives \begin{align*} E_{n,\operatorname{trig}}(f)_\infty\le C_2\,\omega_2\left(f,\frac{1}{n}\right). \end{align*} Using the bound just proved with $t=1/n$ gives \begin{align*} \omega_2\left(f,\frac{1}{n}\right)\le \frac{1}{n}\operatorname{Var}_{[0,2\pi]}(f'). \end{align*} Substitution yields \begin{align*} E_{n,\operatorname{trig}}(f)_\infty\le C_2 n^{-1}\operatorname{Var}_{[0,2\pi]}(f'). \end{align*} Thus bounded variation of one derivative controls second differences at first order in $t$, which gives an $n^{-1}$ Jackson rate; a jump in $f'$ can contribute linearly to these increments, so one cannot expect a general $t^2$ estimate from bounded variation alone. [/example] ## Algebraic Polynomial Approximation On Intervals On a compact interval, translation is no longer compatible with the endpoints: $x+h$ may leave the interval. The direct theorem must account for boundary effects, and this is why algebraic polynomial approximation has a slightly different shape from the periodic theory. [definition: Best Algebraic Polynomial Approximation Error] Let $n\in\mathbb N$ and let $[a,b]\subset\mathbb R$ be a compact interval. The best uniform algebraic polynomial approximation error of degree at most $n$ is the functional $E_n(\cdot)_{[a,b]}:C([a,b])\to[0,\infty)$ defined by \begin{align*} E_n(f)_{[a,b]}=\inf\{\|f-p\|_\infty : p\in \mathbb P_n\}, \end{align*} where $\mathbb P_n$ denotes the real polynomials of degree at most $n$. [/definition] The notation suppresses the interval when it is fixed. Since affine changes of variables transfer $[a,b]$ to $[-1,1]$, the search for a general interval estimate motivates the following standard form on the model interval. [quotetheorem:6872] [citeproof:6872] This form is sufficient for many qualitative and rate estimates, but its hypotheses should be read literally. Continuity is required for a uniform statement: the step function $\mathbf{1}_{[0,1]}$ on $[-1,1]$ cannot be the uniform limit of algebraic polynomials, because every polynomial is continuous and a uniform limit of continuous functions is continuous. The degree restriction is the scale parameter: if the degree is not tied to $n$, the assertion stops measuring approximation at resolution $1/n$; for instance, the Chebyshev polynomial $T_{n+1}$ is already an algebraic polynomial but has best error $1$ from $\mathbb P_n$ in the alternation sense. The compact interval assumption is also essential for the unweighted uniform norm: on $\mathbb R$, no polynomial can uniformly approximate $e^{-x^2}$, since a nonconstant polynomial is unbounded and a constant polynomial misses the decay at infinity. The proof through $x=\cos\theta$ shows why endpoints require care: equal increments in $\theta$ correspond to shorter increments in $x$ near $\pm1$, so sharper versions use weighted moduli that reflect this endpoint geometry. The displayed inequality captures the core direct principle used in this course, while the weighted theory refines it when endpoint singularities dominate the approximation rate. [example: Best Approximation Rate For The Absolute Value] Let $f(x)=|x|$ on $[-1,1]$. From the computation of the first modulus for the absolute value, we have \begin{align*} \omega_1(f,t)=t\quad\text{for }0<t\le1. \end{align*} Apply the *[Algebraic Jackson Inequality](/theorems/6872)* with $r=1$. For every $n\ge1$, \begin{align*} E_n(f)_{[-1,1]}\le C_1\,\omega_1\left(f,\frac{1}{n}\right). \end{align*} Since $n\ge1$, we have $0<1/n\le1$, so the modulus formula gives \begin{align*} \omega_1\left(f,\frac{1}{n}\right)=\frac{1}{n}. \end{align*} Substituting this value into the Jackson estimate yields \begin{align*} E_n(f)_{[-1,1]}\le C_1\frac{1}{n}. \end{align*} Writing $C=C_1$, this is \begin{align*} E_n(f)_{[-1,1]}\le Cn^{-1}. \end{align*} A matching lower bound, proved by sharper methods not used in this direct estimate, gives $E_n(f)_{[-1,1]}\ge c n^{-1}$ for some constant $c>0$. Thus the corner at $0$ produces exactly first-order decay of the best algebraic polynomial approximation error. [/example] The absolute value example is governed by a first-order corner. Replacing the corner by a fractional power gives a family of test cases where the exponent in the modulus is visible in the approximation rate. [example: Best Approximation Rate For A Power Singularity] Let $f(x)=|x|^\alpha$ on $[-1,1]$, where $0<\alpha<1$. From the computation of the modulus of continuity for this power function, for every $0<t\le1$ we have \begin{align*} \omega_1(f,t)=\omega(f,t)=t^\alpha. \end{align*} Apply the *Algebraic Jackson Inequality* with $r=1$. For every $n\ge1$, \begin{align*} E_n(f)_{[-1,1]}\le C_1\,\omega_1\left(f,\frac{1}{n}\right). \end{align*} Since $n\ge1$, we have $0<1/n\le1$, so the modulus formula applies with $t=1/n$: \begin{align*} \omega_1\left(f,\frac{1}{n}\right)=\left(\frac{1}{n}\right)^\alpha. \end{align*} Because $n>0$, \begin{align*} \left(\frac{1}{n}\right)^\alpha=n^{-\alpha}. \end{align*} Substituting this value into the Jackson estimate gives \begin{align*} E_n(f)_{[-1,1]}\le C_1 n^{-\alpha}. \end{align*} Thus the fractional-power singularity at $0$ gives the Jackson upper rate $n^{-\alpha}$; away from $0$, the function is smoother than this estimate needs. [/example] ## Smoothing Kernels And Finite Difference Cancellation The Jackson inequalities above are direct theorems: smoothness implies approximation. Their proofs all use the same mechanism, so it is worth isolating the analytic idea behind them. Approximation is obtained by smoothing, and the error is controlled because smoothing kernels average away finite differences. [definition: Approximate Identity] A family $(K_n)_{n\ge1}$ with $K_n:\mathbb T\to\mathbb C$ integrable for every $n$ is an approximate identity if \begin{align*} \int_{\mathbb T}K_n(t)\,dt=1 \end{align*} for all $n$, and for every $\delta>0$, \begin{align*} \int_{\delta\le |t|\le \pi}|K_n(t)|\,dt\to0 \end{align*} as $n\to\infty$. [/definition] Approximate identities explain convergence of convolution means. To obtain a Jackson estimate rather than only convergence, the kernels must also concentrate quantitatively and interact with $r$-th differences through cancellation; this motivates the following abstract kernel principle. [quotetheorem:6873] [citeproof:6873] This principle is the analytic core of the chapter, and the hypotheses are not cosmetic. Normalisation alone preserves constants but gives no rate: the constant kernels $K_n(t)=(2\pi)^{-1}$ are normalised, yet $f*K_n$ is the mean of $f$ for every $n$, so a nonconstant continuous function such as $f(x)=\cos x$ is never approximated in norm. Approximate-identity concentration without the moment bound is still too weak for a Jackson estimate: a kernel whose mass sits at scale $1/\sqrt n$ converges to a point mass, but for $f(x)=|x|$ near the origin it produces an error of order $n^{-1/2}$ rather than the $n^{-1}$ scale predicted by $\omega_1(f,1/n)$. Cancellation is the separate ingredient that makes the order $r$ visible; convolution with a positive kernel may smooth $f$, but without a representation through $\Delta_t^r f$ it cannot discard local polynomial pieces of degree less than $r$. Thus the measure identity supplies the cancellation, and the moment estimate supplies the quantitative concentration. ## Local Polynomial Approximation And Whitney's Inequality Global approximation by one polynomial is built from local behaviour. Whitney's inequality makes this precise by comparing the best polynomial approximation on an interval with the modulus of smoothness on that same interval. [definition: Local Best Polynomial Error] Let $I\subset\mathbb R$ be a compact interval and let $r\in\mathbb N$. The local best error by polynomials of degree less than $r$ is the functional $E_{r-1}(\cdot;I):C(I)\to[0,\infty)$ defined by \begin{align*} E_{r-1}(f;I)=\inf\{\|f-p\|_{C(I)}:p\in\mathbb P_{r-1}\}. \end{align*} [/definition] This local error vanishes exactly when $f$ is already a polynomial of degree less than $r$ on $I$. The next theorem asks whether the converse holds quantitatively: if all $r$-th differences are small at the length scale of $I$, then $f$ should be close to some local polynomial. [quotetheorem:6874] [citeproof:6874] Whitney's inequality is local, while Jackson's inequalities are global. Compactness of $I$ is used so that the uniform norm and the best approximating polynomial are controlled on a bounded set; on unbounded intervals the same statement is not meaningful without weights or growth restrictions. Continuity is also part of the setup because the error is measured in $C(I)$, and the degree bound less than $r$ matches the fact that $r$-th finite differences vanish on exactly those lower-degree polynomial pieces. The theorem does not produce one global polynomial on a large interval; it produces local polynomial control at the scale of $I$. In applications one often partitions an interval, applies Whitney on each subinterval, and then pieces together the local approximants by splines or other structured approximating spaces. [example: Local Behaviour Near A Corner] Let $f(x)=|x|$ and let $I=[-h,h]$ with $0<h\le 1/2$. For constants, we compute \begin{align*} E_0(f;I)=\inf_{c\in\mathbb R}\sup_{x\in[-h,h]}||x|-c|. \end{align*} For any constant $c$, evaluating only at $x=0$ and $x=h$ gives \begin{align*} \sup_{x\in[-h,h]}||x|-c|\ge \max\{|c|,|h-c|\}. \end{align*} Since \begin{align*} |c|+|h-c|\ge |c+(h-c)|=h, \end{align*} at least one of $|c|$ and $|h-c|$ is at least $h/2$, so \begin{align*} \sup_{x\in[-h,h]}||x|-c|\ge \frac h2. \end{align*} The constant $c=h/2$ attains this bound, because $0\le |x|\le h$ on $[-h,h]$, and therefore \begin{align*} \left||x|-\frac h2\right|\le \frac h2. \end{align*} Thus \begin{align*} E_0(f;[-h,h])=\frac h2. \end{align*} Using the previously computed modulus of continuity of $|x|$ on $[-1,1]$, and using $0<2h\le1$, we have \begin{align*} \omega_1(f,|I|)=\omega_1(f,2h)=2h. \end{align*} Hence the local constant-approximation error has the same order as the first modulus at the interval length: \begin{align*} E_0(f;[-h,h])=\frac h2=\frac14\,\omega_1(f,|I|). \end{align*} This is the scale predicted by *[Whitney Inequality For Local Polynomial Approximation](/theorems/6874)*: the corner forces an order-$h$ local error when only constants are allowed. If an interval $J$ does not cross $0$, then either $J\subset[0,\infty)$ or $J\subset(-\infty,0]$. On $J\subset[0,\infty)$, $|x|=x$, so the degree-one polynomial $p(x)=x$ gives \begin{align*} \|f-p\|_{C(J)}=0. \end{align*} On $J\subset(-\infty,0]$, $|x|=-x$, so $p(x)=-x$ gives the same zero error. Thus the nonzero local approximation error comes from intervals that see the corner at $0$, not from the affine pieces away from it. [/example] The chapter's main message is that approximation rates are not guessed from differentiability labels alone. They are read from finite differences at the resolution $1/n$, and smoothing kernels convert those finite-difference bounds into explicit approximating polynomials. Chapter 4 asks for the converse direction: how much smoothness can be recovered from a known rate of best approximation? Chapter 3 translated smoothness into explicit upper bounds for best approximation error. The converse problem is the next natural step: if those errors decay quickly, what must the function look like, and how much regularity can be recovered from the rate alone? # 4. Inverse Theorems and Bernstein Inequalities The Jackson and Whitney direct theorems from Chapter 3 gave estimates of the form: smooth functions are well approximated by polynomials or trigonometric polynomials. This chapter asks for converse information. If the errors of best approximation decay at a specified rate, what smoothness of the original function can be recovered? The answer depends on derivative inequalities for finite-dimensional approximation spaces, and it also reveals a limitation of many positive linear methods: after a certain order, faster convergence forces the target function to belong to a small exceptional class. ## Derivative Bounds for Algebraic Polynomials A polynomial of degree at most $n$ can oscillate rapidly, but its derivative is still controlled by its size and degree. The guiding question is: if $p$ is small in the uniform norm on $[-1,1]$, how large can $p'$ be on the same interval? This is the first place where the geometry of the interval matters, because endpoints allow faster growth than interior points. [definition: Algebraic Polynomial Space] For $n \in \mathbb N$, let $\mathcal P_n$ denote the [vector space](/page/Vector%20Space) of real or complex algebraic polynomials of degree at most $n$ on $[-1,1]$. [/definition] The norm used throughout this section is $\|p\|_\infty = \sup_{x \in [-1,1]} |p(x)|$. Derivative inequalities compare $\|p'\|_\infty$ with $\|p\|_\infty$ inside this finite-dimensional space. The first problem is to find the largest possible growth factor that works for every polynomial in $\mathcal P_n$. Since inverse theorems later differentiate approximants whose only known control is uniform error, a degree-explicit derivative estimate is needed before any smoothness can be recovered. [quotetheorem:6875] Statement-level reference. The following discussion explains how this result is used in the notes and which hypotheses matter. This theorem is much stronger near the endpoints than what local intuition from trigonometric polynomials would predict. The factor $n^2$ is the price of compressing $n$ oscillations into an interval with endpoints. The hypothesis that the degree is at most $n$ is essential: without a degree bound, the functions $p_m(x)=x^m$ have bounded uniform norm on $[-1,1]$ but derivatives $p_m'(1)=m$ with no fixed bound. The theorem also does not assert that every bounded polynomial has large derivative; constants have derivative zero, so Markov's inequality is a worst-case estimate rather than a typical-size statement. [example: Chebyshev Derivative Growth] Let $T_n$ be defined by $T_n(\cos\theta)=\cos(n\theta)$. Since every $x\in[-1,1]$ can be written as $x=\cos\theta$ for some $\theta\in[0,\pi]$, we have \begin{align*} |T_n(x)|=|\cos(n\theta)|\le 1. \end{align*} At $x=1$, take $\theta=0$, so $T_n(1)=\cos 0=1$. Hence $\|T_n\|_\infty=1$. For $0<\theta<\pi$, differentiate $T_n(\cos\theta)=\cos(n\theta)$ with respect to $\theta$. The chain rule gives \begin{align*} -\sin\theta\, T_n'(\cos\theta)=-n\sin(n\theta). \end{align*} Since $\sin\theta\ne0$ on $(0,\pi)$, \begin{align*} T_n'(\cos\theta)=n\frac{\sin(n\theta)}{\sin\theta}. \end{align*} Letting $\theta\to0^+$ and using continuity of the polynomial derivative, \begin{align*} T_n'(1)=n\lim_{\theta\to0^+}\frac{\sin(n\theta)}{\sin\theta}. \end{align*} The limit is \begin{align*} \lim_{\theta\to0^+}\frac{\sin(n\theta)}{\sin\theta}=n\lim_{\theta\to0^+}\frac{\sin(n\theta)}{n\theta}\cdot\frac{\theta}{\sin\theta}=n. \end{align*} Therefore \begin{align*} T_n'(1)=n^2. \end{align*} Thus the Chebyshev polynomial remains bounded by $1$ on the whole interval, but its endpoint derivative grows exactly like $n^2$. [/example] The example shows that the endpoint exponent in Markov's inequality cannot be improved. Away from the endpoints, however, Chebyshev oscillation behaves more like periodic oscillation under the change of variables $x=\cos\theta$. This motivates a local inequality that keeps the degree factor $n$ but records endpoint degeneration through a weight. [quotetheorem:6876] [citeproof:6876] The two inequalities are compatible: near $x=\pm1$, the factor $\sqrt{1-x^2}$ degenerates, and Markov's $n^2$ estimate supplies the endpoint control. The weight cannot be removed: the Chebyshev example has $|T_n'(1)|=n^2$, so an unweighted inequality $|p'(x)|\le Cn\|p\|_\infty$ up to the endpoints would fail. At the same time, Bernstein's inequality does not control endpoint derivatives; it is an interior estimate whose degenerating weight records exactly where the interval geometry becomes singular. This mixture of global and local derivative control is the mechanism behind inverse theorems. ## Trigonometric Bernstein Inequalities Periodic approximation removes endpoints, so the natural scale for derivative growth is degree $n$ rather than $n^2$. The problem now is to control derivatives of trigonometric polynomials in terms of the uniform norm of the polynomial itself. [definition: Trigonometric Polynomial] A trigonometric polynomial of degree at most $n$ is a function $t:\mathbb R\to\mathbb C$ of the form \begin{align*} t(x)=\sum_{k=-n}^{n} c_k e^{ikx}, \end{align*} where $c_k\in\mathbb C$. [/definition] Such a function is $2\pi$-periodic, and its natural uniform norm is $\|t\|_\infty=\sup_{x\in\mathbb R}|t(x)|$. The derivative multiplies the $k$th Fourier mode by $ik$, so the highest frequency suggests a factor of $n$. To make this heuristic usable in inverse arguments, we need an estimate depending only on the uniform norm of $t$, not on its Fourier coefficients. [quotetheorem:6877] [citeproof:6877] This inequality is the periodic analogue of the algebraic Bernstein inequality. It gives the correct derivative scale for Fourier sums and is the main tool for turning approximation-rate estimates into difference estimates. The degree hypothesis is indispensable: the functions $e^{imx}$ all have uniform norm $1$, but their derivative norms are $m$, so no bound independent of frequency is possible. The theorem also does not estimate derivatives of arbitrary periodic functions from their uniform norm; it is a finite-dimensional estimate for functions whose Fourier spectrum is cut off at frequency $n$. [example: Single Frequency Extremiser] For $t(x)=e^{inx}$ with $n\ge 1$, Euler's identity gives \begin{align*} e^{inx}=\cos(nx)+i\sin(nx). \end{align*} Hence, for every $x\in\mathbb R$, \begin{align*} |t(x)|^2=\cos^2(nx)+\sin^2(nx)=1. \end{align*} Therefore $|t(x)|=1$ for all $x$, so \begin{align*} \|t\|_\infty=\sup_{x\in\mathbb R}|t(x)|=1. \end{align*} Differentiating the single Fourier mode gives \begin{align*} t'(x)=\frac{d}{dx}e^{inx}=in e^{inx}. \end{align*} Thus, for every $x\in\mathbb R$, \begin{align*} |t'(x)|=|in|\,|e^{inx}|=n\cdot 1=n. \end{align*} Taking the supremum over $x$ gives \begin{align*} \|t'\|_\infty=n. \end{align*} Since $\|t'\|_\infty=n\|t\|_\infty$, the constant in *[Bernstein Inequality for Trigonometric Polynomials](/theorems/6877)* is attained by a single Fourier mode. The example shows that the highest frequency present, not the number of nonzero Fourier coefficients, determines the derivative scale. [/example] Higher derivatives follow by applying the first derivative estimate repeatedly. For a trigonometric polynomial $t$ of degree at most $n$, this gives $\|t^{(r)}\|_\infty\le n^r\|t\|_\infty$, which is the periodic model for many inverse estimates. ## Recovering Smoothness from Best Approximation Direct approximation theorems usually say that a modulus of smoothness controls the approximation error. The inverse problem starts with the error sequence and asks how regular the function must have been. The central idea is to compare best approximants at dyadic degrees and use [Bernstein inequalities](/theorems/3188) to prove convergence of their derivatives or differences. [definition: Best Polynomial Approximation Error] For $n\in\mathbb N$, the best uniform algebraic polynomial approximation error is the functional \begin{align*} E_n:C[-1,1]\to[0,\infty) \end{align*} defined by \begin{align*} E_n(f) := \inf_{p\in\mathcal P_n}\|f-p\|_\infty. \end{align*} [/definition] The sequence $E_n(f)$ is non-increasing, and its decay records how efficiently $f$ can be represented by low-degree polynomials. To translate this decay into regularity, we need a vocabulary for measuring oscillation of $f$ at small spatial scales. The simplest scale in this chapter is the Lipschitz scale. [definition: Lipschitz Class] For $0<\alpha\le 1$, a function $f\in C[-1,1]$ belongs to $\operatorname{Lip}(\alpha)$ if there exists $C>0$ such that \begin{align*} |f(x)-f(y)|\le C|x-y|^\alpha \end{align*} for all $x,y\in[-1,1]$. [/definition] A Lipschitz condition is exactly a bound on increments, while the approximation hypothesis is a bound on polynomial tails. The next theorem connects these by decomposing $f$ into dyadic polynomial increments and estimating each increment at a spatial scale $h$. This is the inverse counterpart to Jackson-type direct approximation. [quotetheorem:6878] [citeproof:6878] This theorem expresses the inverse philosophy. Approximation at degree $2^m$ isolates behaviour at length scale roughly $2^{-m}$; in the stated algebraic-polynomial form, a tail of order $2^{-m\alpha}$ forces the weaker conclusion $f\in\operatorname{Lip}(\alpha/2)$. The loss in exponent is part of this particular inverse statement, so it should not be confused with sharper direct-inverse equivalences for moduli of smoothness. The assumption that the estimate holds uniformly for all degrees is important: finite numerical evidence at a bounded list of degrees cannot rule out later small-scale oscillations. The restriction $0<\alpha<1$ is also part of the statement; at endpoint exponents, inverse results require more delicate formulations rather than this simple Lipschitz conclusion. [example: Detecting Lipsch Regularity] Suppose $f\in C[-1,1]$ and that the approximation estimate \begin{align*} E_n(f)\le 5n^{-1/2} \end{align*} is known for every $n\in\mathbb N$, not only for the degrees checked numerically. This has the form required by *Bernstein Inverse Theorem* with $\alpha=1/2$ and $C=5$, since \begin{align*} 0<\frac12<1 \end{align*} and \begin{align*} 5n^{-1/2}=5n^{-\alpha}. \end{align*} Therefore $f\in\operatorname{Lip}(1/4)$, meaning that there is a constant $C'>0$ such that \begin{align*} |f(x)-f(y)|\le C'|x-y|^{1/4} \end{align*} for all $x,y\in[-1,1]$. This conclusion is stronger than the numerical inequality by itself: the analytically proved bound for all degrees gives a uniform Hölder estimate on the function. The theorem does not say that this estimate is sharp for every individual function; it gives a guaranteed regularity floor from the approximation rate. In particular, a jump discontinuity is excluded because every $\operatorname{Lip}(1/4)$ function is continuous: if $x_m\to x$, then \begin{align*} |f(x_m)-f(x)|\le C'|x_m-x|^{1/4}\to0. \end{align*} [/example] The example illustrates how an observed rate can be read as a regularity certificate once the estimate is proved for all degrees. Direct and inverse theorems are most useful when they are stated with matching smoothness scales and precise constants; in those sharper forms, the decay of $E_n(f)$ becomes a quantitative measurement of regularity. A particular sequence of positive linear operators may converge at the Jackson rate for a whole class of functions but fail to improve when the function is smoother. This gap between best possible approximation and the behaviour of a fixed method leads to saturation: the leading error term of the method may vanish only on a small reproduced class. ## Saturation of Linear Approximation Processes Best approximation can improve as the target function becomes smoother, but a fixed linear approximation process may have a built-in ceiling. The question is: when does a sequence of operators stop reflecting additional smoothness, and what functions can converge faster than the natural rate? [definition: Saturation Class] Let $X$ be a normed space, and let $L_n:X\to X$ be linear operators such that $L_n f\to f$ in $X$ for each $f$ in a subspace of $X$. A saturation pair for $(L_n)$ consists of a positive sequence $(\varphi_n)$ with $\varphi_n\to0$ and a subspace $\mathcal S\subset X$ such that the implication \begin{align*} \|L_n f-f\|_X=o(\varphi_n) \end{align*} forces $f\in\mathcal S$. [/definition] The accompanying class $\mathcal S$ is called the saturation class at order $(\varphi_n)$. In applications, $\mathcal S$ is usually identified by proving a converse statement: every function in $\mathcal S$ has error smaller than the generic order, and no function outside $\mathcal S$ does. This separates the usual order of convergence from the exceptional set where the leading error term vanishes. To identify such a set, we need a specific operator sequence whose moments can be computed. This motivates returning to the Bernstein operators from the constructive proof of Weierstrass, because their positivity and binomial formula make the leading error term accessible. [definition: Bernstein Operator] For $n\in\mathbb N$, the $n$th Bernstein operator is the linear map \begin{align*} B_n:C[0,1]\to \mathcal P_n \end{align*} defined by \begin{align*} (B_n f)(x)=\sum_{k=0}^{n} f\left(\frac{k}{n}\right) \binom{n}{k} x^k(1-x)^{n-k},\qquad x\in[0,1]. \end{align*} [/definition] The probabilistic interpretation is $B_n f(x)=\mathbb E[f(S_n/n)]$, where $S_n\sim\operatorname{Bin}(n,x)$. It explains convergence, but not the sharper question of what the first error term is after the averaging has already concentrated near $x$. Since the averaging scale is $n^{-1/2}$, a first-order Taylor term should cancel and the first non-vanishing correction should be governed by the variance $x(1-x)/n$. The issue is to turn this heuristic into a uniform asymptotic statement under a regularity hypothesis strong enough to control the Taylor remainder. [quotetheorem:6880] Statement-level reference. The following discussion explains how this result is used in the notes and which hypotheses matter. Voronovskaya's formula identifies the leading error term for smooth functions. The $C^2$ hypothesis is not cosmetic: the Taylor expansion with a uniform second-order remainder is the mechanism that produces the limit. For merely continuous functions, the theorem by itself gives no pointwise asymptotic formula, because a second derivative may not exist and the error can reflect nonsmooth local behaviour rather than a quadratic Taylor term. The uniform conclusion is stronger than the pointwise conclusion: it controls the whole interval at once and is the form needed for saturation in the uniform norm. The formula also does not claim that the Bernstein error is always of order exactly $1/n$; if $f$ is affine, then the variance term is multiplied by $f''=0$ and the error vanishes identically. If a Bernstein error is smaller than $1/n$, then that leading term must vanish. This creates the saturation question: does vanishing of the leading term force the function into the finite-dimensional space reproduced exactly by the operators? [quotetheorem:6881] [citeproof:6881] This theorem says that Bernstein operators are saturated at order $1/n$: extra smoothness alone does not force $B_n f$ to converge faster. The affine conclusion is sharp because affine functions are exactly the functions reproduced by every $B_n$. The theorem does not say that non-affine functions cannot have small error at selected values of $n$ or at selected points $x$; it rules out a uniform $o(n^{-1})$ rate along the whole sequence. To obtain higher-order convergence for non-affine functions, the approximation process itself must be changed, for example by modifying moments, using higher-order operators, or switching to best approximation. [example: Quadratic Function and the Saturation Rate] Let $f(x)=x^2$ and fix $x\in[0,1]$. Using the Bernstein operator formula, \begin{align*} (B_n f)(x)=\sum_{k=0}^{n}\left(\frac{k}{n}\right)^2\binom{n}{k}x^k(1-x)^{n-k}. \end{align*} If $S_n\sim\operatorname{Bin}(n,x)$, this sum is $\mathbb E[(S_n/n)^2]$. Since $\mathbb E[S_n]=nx$ and $\operatorname{Var}(S_n)=nx(1-x)$, \begin{align*} \mathbb E[S_n^2]=\operatorname{Var}(S_n)+(\mathbb E[S_n])^2=nx(1-x)+n^2x^2. \end{align*} Dividing by $n^2$ gives \begin{align*} (B_n f)(x)=\mathbb E\left[\left(\frac{S_n}{n}\right)^2\right]=\frac{nx(1-x)+n^2x^2}{n^2}. \end{align*} Thus \begin{align*} (B_n f)(x)=x^2+\frac{x(1-x)}{n}. \end{align*} Therefore \begin{align*} (B_n f)(x)-f(x)=\frac{x(1-x)}{n}. \end{align*} For $x\in[0,1]$, \begin{align*} x(1-x)=\frac14-\left(x-\frac12\right)^2. \end{align*} Hence $0\le x(1-x)\le 1/4$, with equality exactly at $x=1/2$. Taking the supremum over $[0,1]$ gives \begin{align*} \|B_n f-f\|_\infty=\frac{1}{4n}. \end{align*} The quadratic function is smooth, but its Bernstein error is exactly a nonzero constant times $n^{-1}$, so additional smoothness does not improve the natural saturation rate of the Bernstein operators. [/example] Saturation results refine the direct-inverse picture. Best approximation adapts to higher smoothness through increasing approximation spaces, while a fixed operator sequence may reproduce only a finite-dimensional class exactly and therefore cannot exploit all available smoothness. The next chapter returns to the uniform norm from Chapter 1 and studies how best approximants are recognised by alternation rather than by smoothness estimates. Chapter 4 showed that approximation rates can reveal hidden smoothness through inverse theorems and Bernstein-type inequalities. The focus now returns to the uniform norm on intervals, but with a sharper geometric goal: characterise best approximants by alternation instead of by smoothness estimates. # 5. Chebyshev Systems and Minimax Approximation This chapter turns from approximation in general normed spaces to the special geometry of the uniform norm on an interval. The guiding question is: how can a finite-dimensional family be recognised as giving the smallest possible worst-case error? The answer is governed by alternation: a best uniform approximant is forced to balance its positive and negative errors at enough points, and this balancing principle leads directly to computable algorithms. Chapters 1 and 2 supplied existence, uniqueness in Haar spaces, and constructive polynomial approximation. Here we use those tools in the opposite direction: rather than asking whether approximation is possible, we ask how the optimal approximant announces itself and how it can be found numerically. ## Chebyshev Polynomials and Alternating Error The first problem is to understand why the uniform norm behaves differently from Hilbert-space norms. In $L^2$, best approximation is detected by orthogonality. In $C[a,b]$ with the norm \begin{align*} \|f\|_\infty = \sup_{x \in [a,b]} |f(x)|, \end{align*} the geometry is polyhedral in spirit: the error is controlled by the points where its absolute value is largest. Before stating the minimax principle, we isolate the polynomial family that models perfect alternation on $[-1,1]$. [definition: Chebyshev Polynomial] For $n \in \mathbb N \cup \{0\}$, the $n$-th Chebyshev polynomial of the first kind is the polynomial function \begin{align*} T_n:\mathbb R\to\mathbb R \end{align*} whose representing polynomial lies in $\mathbb R[x]$ and is defined on $[-1,1]$ by \begin{align*} T_n(\cos \theta) = \cos(n\theta), \qquad \theta \in \mathbb R. \end{align*} [/definition] The definition gives a polynomial because the recurrence \begin{align*} T_{n+1}(x) = 2xT_n(x) - T_{n-1}(x), \qquad T_0(x)=1, \quad T_1(x)=x, \end{align*} produces the same functions on $[-1,1]$ and hence as polynomials. The extremal feature is that $T_n$ takes the values $1,-1,1,-1,\dots$ at $n+1$ ordered points in $[-1,1]$. [example: Alternation of Chebyshev Polynomials] For $n \ge 1$, define \begin{align*} x_j=\cos\left(\frac{j\pi}{n}\right), \qquad j=0,\dots,n. \end{align*} Using the defining identity $T_n(\cos\theta)=\cos(n\theta)$ with $\theta=j\pi/n$, we get \begin{align*} T_n(x_j)=T_n\left(\cos\left(\frac{j\pi}{n}\right)\right)=\cos\left(n\cdot \frac{j\pi}{n}\right)=\cos(j\pi)=(-1)^j. \end{align*} Since $j\pi/n$ runs from $0$ to $\pi$ as $j$ runs from $0$ to $n$, and $\cos\theta$ is decreasing on $[0,\pi]$, the points satisfy \begin{align*} x_0=1>x_1>\cdots>x_n=-1. \end{align*} Reversing the order gives an increasing sequence \begin{align*} y_i=x_{n-i}, \qquad i=0,\dots,n, \end{align*} from $-1$ to $1$, and on this sequence \begin{align*} T_n(y_i)=T_n(x_{n-i})=(-1)^{n-i}. \end{align*} Consecutive signs are opposite because \begin{align*} (-1)^{n-(i+1)}=-(-1)^{n-i}. \end{align*} Thus $T_n$ reaches alternating values of equal magnitude at $n+1$ ordered points in $[-1,1]$, which is the model pattern for a best uniform approximation error: the largest positive and negative deviations are equally large and interlaced. [/example] The example gives an oscillating polynomial with small uniform norm after normalisation. The sharper question is whether any monic polynomial can do better, since a lower uniform norm would mean that oscillation alone is not enough to certify extremality. The next result says that the Chebyshev oscillation pattern is already optimal. [quotetheorem:476] [citeproof:476] The central idea is that too many alternating sign constraints force too many zeros. The hypotheses are doing real work. The restriction $n\ge 1$ is needed because the normalising formula gives $2T_0$ when $n=0$, whereas the only monic constant polynomial is $1$. The monic condition is also essential: without fixing the leading coefficient, the zero polynomial would minimise the norm among polynomials of degree at most $n$. Finally, the theorem is not a statement about best approximation to an arbitrary function; it is an extremal normalisation result that supplies the model oscillation pattern used later for minimax errors. The same idea works for approximation from spaces more general than polynomials, provided that the space has the right zero-counting property. ## Haar Spaces and Best Uniform Approximation As in Chapter 1, the next question is which finite-dimensional spaces behave like polynomials for minimax approximation. We need a structural hypothesis ensuring that a nonzero approximant cannot vanish at too many points. [definition: Haar Space] Let $X$ be an $n$-dimensional subspace of $C[a,b]$. The space $X$ is a Haar space on $[a,b]$ if every nonzero $g \in X$ has at most $n-1$ distinct zeros in $[a,b]$. [/definition] Polynomial spaces $\mathcal P_{n-1}$ are the basic examples. The definition also includes many non-polynomial systems, such as spans of suitable exponentials or rational basis functions on intervals where they have no singularities. [example: Polynomial Haar Space] The space $\mathcal P_{n-1}$ has basis \begin{align*} 1,x,x^2,\dots,x^{n-1}, \end{align*} so it is $n$-dimensional. We verify the Haar condition on an interval $[a,b]$: let $p\in\mathcal P_{n-1}$ be nonzero, and suppose its distinct real zeros in $[a,b]$ are $z_1,\dots,z_r$. Applying the [factor theorem](/theorems/3235) successively gives \begin{align*} p(x)=(x-z_1)(x-z_2)\cdots(x-z_r)q(x) \end{align*} for some polynomial $q\in\mathbb R[x]$. Taking degrees gives \begin{align*} \deg p=r+\deg q. \end{align*} Since $q$ is nonzero, $\deg q\ge 0$, and since $p\in\mathcal P_{n-1}$, \begin{align*} r\le \deg p\le n-1. \end{align*} Thus every nonzero element of $\mathcal P_{n-1}$ has at most $n-1$ distinct zeros on $[a,b]$, so $\mathcal P_{n-1}$ is a Haar space. This is exactly the zero-counting mechanism behind polynomial equioscillation: too many forced sign changes would force too many zeros for a nonzero polynomial of degree at most $n-1$. [/example] The polynomial example identifies the zero-counting property, but in applications the approximating functions often come from a chosen basis rather than from an abstract subspace. To test such a basis, we need an interpolation criterion: if values at $n$ points always determine the approximant, then hidden excessive zeros cannot occur. [definition: Chebyshev System] A family $u_1,\dots,u_n \in C[a,b]$ is a Chebyshev system on $[a,b]$ if, for every choice of distinct points $x_1,\dots,x_n \in [a,b]$, the interpolation matrix \begin{align*} \big(u_j(x_i)\big)_{i,j=1}^n \end{align*} is nonsingular. [/definition] Thus the span of a Chebyshev system is a Haar space, and conversely a Haar space admits bases with this nonsingularity property. The determinant condition says that interpolation at $n$ distinct nodes determines a unique element of the space. This prepares the main certification theorem: once interpolation and zero-counting are controlled, best approximation should be recognisable from finitely many extremal errors. [quotetheorem:6882] [citeproof:6882] This theorem is the equioscillation principle in its Haar-space form. It replaces Hilbert-space orthogonality by a finite certificate: $n+1$ alternating maximal errors prove global optimality. The Haar hypothesis is what makes the certificate decisive; without it, a nonzero direction in $X$ may vanish at the proposed alternating points and still change the error elsewhere. For example, if $X=\operatorname{span}\{x^2\}$ on $[-1,1]$, then the nonzero element $x^2$ has the zero $0$ without behaving like a one-dimensional Haar space on sign changes, and interpolation at a single node does not control the direction globally. [Finite dimensionality](/theorems/1534) is also part of the theorem's scope: in an infinite-dimensional subspace of $C[a,b]$, no fixed number of active points can certify optimality, because perturbations can be arranged to vanish on any prescribed finite set while changing the function elsewhere. Thus alternation is not merely a visual pattern. It is a certificate only in spaces where zeros control functions strongly enough, and the theorem does not say that arbitrary finite-dimensional spaces or infinite-dimensional approximation classes admit such a finite test. [example: Minimax Linear Approximation to the Exponential] Let $X=\mathcal P_1$ and write the candidate line as $p(x)=\alpha x+\beta$. We look for an error \begin{align*} r(x)=e^x-p(x)=e^x-\alpha x-\beta \end{align*} that alternates with signs $+,-,+$ at $0$, an interior point $c$, and $1$. Since $r''(x)=e^x>0$, any interior extremum of $r$ is a minimum, so the middle alternating point must satisfy $r'(c)=0$. The alternating equations are therefore \begin{align*} 1-\beta=E, \end{align*} \begin{align*} e^c-\alpha c-\beta=-E, \end{align*} \begin{align*} e-\alpha-\beta=E, \end{align*} \begin{align*} e^c-\alpha=0. \end{align*} Subtracting the first endpoint equation from the second endpoint equation gives \begin{align*} (e-\alpha-\beta)-(1-\beta)=E-E. \end{align*} Hence \begin{align*} e-\alpha-1=0, \end{align*} so \begin{align*} \alpha=e-1. \end{align*} The derivative equation gives $e^c=\alpha$, and therefore \begin{align*} e^c=e-1. \end{align*} Taking logarithms gives \begin{align*} c=\log(e-1). \end{align*} Because $1<e-1<e$, this point satisfies $0<c<1$. It remains to solve for $\beta$ and $E$. From the first equation, \begin{align*} E=1-\beta. \end{align*} Substituting this into the middle equation gives \begin{align*} e^c-\alpha c-\beta=-(1-\beta). \end{align*} Thus \begin{align*} e^c-\alpha c-\beta=-1+\beta. \end{align*} Moving the $\beta$ terms to the right and the constant term to the left gives \begin{align*} e^c-\alpha c+1=2\beta. \end{align*} Therefore \begin{align*} \beta=\frac{e^c-\alpha c+1}{2}. \end{align*} Using $\alpha=e-1$ and $e^c=e-1$, \begin{align*} \beta=\frac{(e-1)-(e-1)c+1}{2}. \end{align*} Since $c=\log(e-1)$, \begin{align*} \beta=\frac{(e-1)(1-\log(e-1))+1}{2}. \end{align*} The corresponding line is \begin{align*} p(x)=(e-1)x+\frac{(e-1)(1-\log(e-1))+1}{2}. \end{align*} Its endpoint error level is \begin{align*} E=1-\beta. \end{align*} Substituting the value of $\beta$ gives \begin{align*} E=1-\frac{(e-1)(1-\log(e-1))+1}{2}. \end{align*} Combining the terms over a common denominator gives \begin{align*} E=\frac{2-(e-1)(1-\log(e-1))-1}{2}. \end{align*} Expanding the numerator gives \begin{align*} E=\frac{2-e+(e-1)\log(e-1)}{2}. \end{align*} The error therefore takes the values $E,-E,E$ at $0,\log(e-1),1$, so by the Haar-space alternation criterion this line is the minimax linear approximant to $e^x$ on $[0,1]$. [/example] The example shows how alternation converts a global optimisation problem into nonlinear equations for extremal points and coefficients. The same reduction is the basis for the exchange algorithm in the next section. ## De la Vallee Poussin Theorem and Near-Best Approximants In computation, an approximant rarely arrives already equipped with the exact alternating set required by the equioscillation theorem. The practical question is how much can be certified from an imperfect pattern of alternating errors. [quotetheorem:6883] [citeproof:6883] The theorem turns partial alternation into a lower bound for the unknown optimal error. Each hypothesis prevents a different false certificate. The Haar condition rules out an improvement direction with too many zeros; without it, $q-p$ could change sign at all sampled points and still belong to $X$. For a concrete failure, take $X=\operatorname{span}\{x^2-1\}$ on $[-1,1]$, which is not Haar because its nonzero generator vanishes at both endpoints. The function $h(x)=x^2-1$ has the sign pattern $0,-,0$ at $-1,0,1$ and can be scaled to disturb sampled alternating inequalities without being controlled by a one-zero rule. The requirement $m>0$ excludes vacuous alternation, since inequalities with $m=0$ would be satisfied by many errors and would give no positive lower bound. The use of $n+1$ points is also sharp in this zero-counting argument: with only $n$ alternating inequalities, the difference $q-p$ would be forced to have only $n-1$ intervening zeros, which a nonzero element of an $n$-dimensional Haar space may have. If $p$ also has known error $M=\|f-p\|_\infty$, then \begin{align*} m \le E^* \le M. \end{align*} This two-sided estimate motivates a separate definition for approximants whose error is controlled up to a fixed multiplicative factor, because Remez computations often stop with such a certificate rather than an exact alternation proof. [definition: Near-Best Uniform Approximant] Let $X \subset C[a,b]$ and $f \in C[a,b]$. An element $p \in X$ is a $\rho$-near-best uniform approximant, with $\rho \ge 1$, if \begin{align*} \|f-p\|_\infty \le \rho \inf_{q \in X}\|f-q\|_\infty. \end{align*} [/definition] The de la Vallee Poussin theorem is useful because it converts sampled alternation into a near-best certificate. If \begin{align*} \frac{M}{m} \le \rho, \end{align*} then $p$ is $\rho$-near-best. [example: A Near-Best Certificate from Sampled Error] Suppose $X=\mathcal P_2$ on $[-1,1]$, so $\dim X=3$, and suppose the sampled errors at four increasing points are \begin{align*} 0.101,\ -0.099,\ 0.100,\ -0.098. \end{align*} These signs alternate, and their absolute values are \begin{align*} |0.101|=0.101,\quad |-0.099|=0.099,\quad |0.100|=0.100,\quad |-0.098|=0.098. \end{align*} The smallest sampled alternating magnitude is therefore \begin{align*} m=\min\{0.101,0.099,0.100,0.098\}=0.098. \end{align*} By the *De La Vallee Poussin Theorem*, the optimal error \begin{align*} E^*=\inf_{q\in\mathcal P_2}\|f-q\|_\infty \end{align*} satisfies \begin{align*} E^*\ge 0.098. \end{align*} If a separate dense-grid check gives \begin{align*} \|f-p\|_\infty\le 0.102, \end{align*} then the near-best ratio is bounded by \begin{align*} \frac{\|f-p\|_\infty}{E^*}\le \frac{0.102}{0.098}. \end{align*} Writing both decimals with denominator $1000$ gives \begin{align*} \frac{0.102}{0.098}=\frac{102/1000}{98/1000}=\frac{102}{98}=\frac{51}{49}. \end{align*} Thus $p$ is certified to be at worst $\frac{51}{49}$-near-best, approximately $1.041$-near-best. The grid supplies an upper bound on the error of this particular $p$, while the sampled alternation supplies a lower bound on the unknown optimal error. [/example] Near-best bounds are central in numerical approximation because floating-point arithmetic, incomplete maximisation of the error curve, and ill-conditioned bases often make exact certificates unrealistic. ## Remez Exchange Algorithm The equioscillation theorem suggests an algorithm: guess the alternating points, solve for the approximant that alternates there, then replace the guessed points by better extrema of the resulting error. This is the Remez exchange algorithm. [definition: Exchange Set] Let $X=\operatorname{span}\{u_1,\dots,u_n\} \subset C[a,b]$. An exchange set is an ordered $(n+1)$-tuple \begin{align*} a \le x_0 < x_1 < \dots < x_n \le b \end{align*} [/definition] Given an exchange set and a sign $\sigma\in\{-1,1\}$, the Remez linear step asks for an approximant $p\in X$ and a scalar $E$ satisfying the alternating interpolation conditions \begin{align*} f(x_i)-p(x_i)=\sigma(-1)^i E, \qquad i=0,\dots,n. \end{align*} For a fixed exchange set, the unknowns are the $n$ coefficients of $p$ and the scalar error level $E$. The resulting system has exactly as many equations as unknowns. To justify the algorithm, we need to know that every proposed exchange set produces a unique candidate rather than a singular or ambiguous linear problem. [quotetheorem:6884] [citeproof:6884] This solvability result is the algebraic hinge of Remez exchange. It converts a proposed alternation pattern into a concrete candidate whose error curve can be tested against the equioscillation theorem and the de la Vallee Poussin lower bound. Thus the theorem supplies the exact linear solve inside each iteration, while the next stage of the algorithm supplies the nonlinear information: where the actual extrema of $f-p$ occur and whether they exceed the provisional level $|E|$. The result has a precise scope. It does not prove convergence of the Remez algorithm, justify a particular exchange rule, or guarantee that the next extremum search has been performed correctly. If the basis is not a Chebyshev system, the interpolation matrix can be singular for some distinct nodes, and the alternating equations may have no solution or many solutions. If two exchange points collide, the system repeats an evaluation constraint instead of imposing $n+1$ independent conditions, so the error level and coefficients need not be determined. These failures are numerical as well as theoretical: near-colliding points can make the system badly conditioned even when exact uniqueness remains true. Having solved on one exchange set, the algorithm searches for extrema of the actual error $e=f-p$. If the largest absolute error is larger than $|E|$, one of the old points is exchanged for a new extremal point while preserving alternation order. [explanation: Remez Iteration] A Remez step begins with an exchange set $x_0<\dots<x_n$. Solving the alternating system gives $p$ and a provisional error level $E$. The error curve $e=f-p$ is then searched for local extrema, including endpoints. If the largest extremum has magnitude $|E|$ and the signs alternate on $n+1$ extrema, the equioscillation theorem certifies optimality. If a larger extremum is found, it is inserted into the ordered list and one neighbouring point is removed so that the sign pattern alternates. This replacement raises the provisional lower certificate in typical exact arithmetic settings and moves the exchange set toward the true alternation set. In practice, the difficult part is not the linear algebra alone but the reliable location of extrema of the error function. [/explanation] The preceding description is abstract, so it is helpful to see what the exchange step asks us to compute. The linear approximation problem for $e^x$ is small enough that the first exchange system and the extremum update can be written down explicitly. [example: First Remez Step for Linear Approximation] For $f(x)=e^x$ on $[0,1]$ and $X=\mathcal P_1$, write the provisional line as $p(x)=\alpha x+\beta$ and the error as \begin{align*} r(x)=e^x-\alpha x-\beta. \end{align*} Starting from the exchange set $0,1/2,1$ with signs $+,-,+$, the alternating equations are \begin{align*} 1-\beta=E, \end{align*} \begin{align*} e^{1/2}-\frac{\alpha}{2}-\beta=-E, \end{align*} \begin{align*} e-\alpha-\beta=E. \end{align*} Subtracting the first endpoint equation from the last endpoint equation gives \begin{align*} (e-\alpha-\beta)-(1-\beta)=E-E. \end{align*} Thus \begin{align*} e-\alpha-1=0, \end{align*} so \begin{align*} \alpha=e-1. \end{align*} From the first equation, \begin{align*} E=1-\beta. \end{align*} Substituting this into the middle equation gives \begin{align*} e^{1/2}-\frac{\alpha}{2}-\beta=-(1-\beta). \end{align*} Hence \begin{align*} e^{1/2}-\frac{\alpha}{2}-\beta=-1+\beta. \end{align*} Moving the $\beta$ terms to the right and the constant term to the left gives \begin{align*} e^{1/2}-\frac{\alpha}{2}+1=2\beta. \end{align*} Therefore \begin{align*} \beta=\frac{e^{1/2}+1-\alpha/2}{2}. \end{align*} Using $\alpha=e-1$, \begin{align*} \beta=\frac{e^{1/2}+1-(e-1)/2}{2}. \end{align*} Putting the numerator over the common denominator $2$ gives \begin{align*} \beta=\frac{2e^{1/2}+3-e}{4}. \end{align*} Consequently \begin{align*} E=1-\frac{2e^{1/2}+3-e}{4}. \end{align*} Combining terms gives \begin{align*} E=\frac{e+1-2e^{1/2}}{4}. \end{align*} Equivalently, \begin{align*} E=\frac{(e^{1/2}-1)^2}{4}. \end{align*} The first Remez solve therefore gives \begin{align*} p(x)=(e-1)x+\frac{2e^{1/2}+3-e}{4}. \end{align*} The error derivative is \begin{align*} r'(x)=e^x-\alpha=e^x-(e-1). \end{align*} An interior critical point satisfies \begin{align*} e^x=e-1, \end{align*} so \begin{align*} x=\log(e-1). \end{align*} Since $1<e-1<e$, this point lies in $(0,1)$. At this point the error is \begin{align*} r(\log(e-1))=(e-1)-(e-1)\log(e-1)-\frac{2e^{1/2}+3-e}{4}. \end{align*} The exchange step compares the absolute values of $r$ at $0$, $1/2$, $1$, and $\log(e-1)$. The old exchange errors are $E,-E,E$ by construction, while $\log(e-1)$ is the newly found interior extremum because $r'(\log(e-1))=0$. If this interior error has the largest relevant magnitude, the next exchange set replaces the old interior point $1/2$ by $\log(e-1)$, preserving the ordered alternating pattern used in the next Remez solve. [/example] Numerical conditioning matters because the exchange equations may be ill-conditioned in monomial bases or with clustered points. Stable implementations use better bases, scaling, high-accuracy extremum searches, and stopping criteria based on both upper and lower error certificates. [remark: Conditioning of Exchange Systems] In polynomial Remez computations, the monomial basis $1,x,x^2,\dots$ can produce badly conditioned Vandermonde-type matrices. Chebyshev bases on a scaled interval reduce this effect because the basis functions remain uniformly bounded on $[-1,1]$. For high degrees, barycentric or orthogonal-polynomial formulations are usually preferable to solving raw monomial systems. [/remark] The algorithmic lesson is that equioscillation is an exact theorem, while Remez is a numerical method that must manage floating-point conditioning and extremum detection. ## Weighted Minimax Approximation Uniform error treats all points of the interval equally. In applications, some regions matter more than others: endpoint behaviour may dominate, a boundary layer may require extra accuracy, or a nonuniform physical scale may make equal absolute error inappropriate. [definition: Weighted Uniform Norm] Let $w:[a,b]\to(0,\infty)$ be continuous. The weighted uniform norm is the map \begin{align*} \|\cdot\|_{\infty,w}:C[a,b]\to[0,\infty) \end{align*} defined by \begin{align*} \|g\|_{\infty,w}=\sup_{x\in[a,b]} w(x)|g(x)|. \end{align*} [/definition] Weighted minimax approximation asks for $p\in X$ minimising $\|f-p\|_{\infty,w}$. Since $w$ is positive and continuous, this is the same as ordinary uniform approximation of the weighted error, but the approximating set becomes $wX$ rather than $X$. [quotetheorem:6885] [citeproof:6885] The positivity assumption on $w$ is structural, not cosmetic. If the weight vanishes at isolated points, the formula may still define a norm on $C[a,b]$, but multiplication by $w$ no longer preserves zeros: with $w(x)=|x|$ on $[-1,1]$, the nonzero constant function $g=1$ is sent to $wg=|x|$, which has an extra zero at $0$. Thus the Haar reduction can fail even though $\|g\|_{\infty,w}=\sup |x||g(x)|$ remains a genuine norm on $C[-1,1]$. Normhood itself fails when $w$ vanishes on a nontrivial set with interior: a nonzero continuous function supported inside that set has weighted norm $0$. The reduction therefore has a precise limitation: it preserves the Haar argument only because multiplication by $w$ is injective and zero-preserving on $C[a,b]$. If the weight is allowed to be discontinuous, sign changes and attainment of the weighted maximum may no longer be controlled by the same compactness argument. [example: Endpoint-Emphasised Approximation] On the truncated interval $I_\delta=[-1+\delta,1-\delta]$, where $0<\delta<1$, define \begin{align*} w(x)=(1-x^2)^{-1/2}. \end{align*} For $x\in I_\delta$ we have $|x|\le 1-\delta<1$, so \begin{align*} 1-x^2>0. \end{align*} Thus $w$ is positive and continuous on $I_\delta$, and weighted equioscillation applies to weighted minimax approximation on this interval. If $p$ is a weighted minimax polynomial and $x_0<\cdots<x_n$ are its alternating points, the weighted alternation condition is \begin{align*} (1-x_i^2)^{-1/2}(f(x_i)-p(x_i))=\sigma(-1)^iE. \end{align*} Multiplying both sides by $(1-x_i^2)^{1/2}$ gives \begin{align*} f(x_i)-p(x_i)=\sigma(-1)^iE(1-x_i^2)^{1/2}. \end{align*} Taking absolute values gives \begin{align*} |f(x_i)-p(x_i)|=E(1-x_i^2)^{1/2}. \end{align*} At the centre, $x=0$, the factor is \begin{align*} (1-0^2)^{1/2}=1. \end{align*} At the truncated endpoints, $x=\pm(1-\delta)$, the factor is \begin{align*} \bigl(1-(1-\delta)^2\bigr)^{1/2}. \end{align*} Expanding the square gives \begin{align*} (1-\delta)^2=1-2\delta+\delta^2. \end{align*} Therefore \begin{align*} 1-(1-\delta)^2=2\delta-\delta^2=\delta(2-\delta). \end{align*} So the endpoint-adjacent unweighted error scale is \begin{align*} E\sqrt{\delta(2-\delta)}. \end{align*} Since $0<\delta<1$, we have $0<\delta(2-\delta)<1$, and hence \begin{align*} E\sqrt{\delta(2-\delta)}<E. \end{align*} Thus the same weighted error level $E$ forces smaller ordinary errors near the truncated endpoints than near the centre. This is why the weight $w(x)=(1-x^2)^{-1/2}$ models approximation problems where endpoint-adjacent accuracy is more valuable. [/example] Weighted approximation also handles nonuniform domains after a change of variables. If $\phi:[c,d]\to[a,b]$ is a homeomorphism, approximating $f$ on $[a,b]$ can be pulled back to approximating $f\circ\phi$ on $[c,d]$, with the choice of $\phi$ concentrating resolution where $f$ changes rapidly. ## Sign-Like Functions and Limits of Polynomial Minimax Approximation Alternation is powerful, but it also reveals when approximation is intrinsically expensive. Functions with steep transitions force many oscillations in the error before a low-degree polynomial can track them well. [example: Best Polynomial Approximation to a Sign-Like Function] Let \begin{align*} f_\varepsilon(x)=\tanh(x/\varepsilon),\qquad 0<\varepsilon\ll 1, \end{align*} and approximate $f_\varepsilon$ on $[-1,1]$ from $\mathcal P_n$. Since $\mathcal P_n$ has basis $1,x,\dots,x^n$, it has dimension $n+1$, and every nonzero element of $\mathcal P_n$ has at most $n$ distinct zeros. Thus $\mathcal P_n$ is a Haar space, so the *[Chebyshev Alternation Theorem for Haar Spaces](/theorems/6882)* applies: the best minimax polynomial $p_n\in\mathcal P_n$ has an error \begin{align*} r_n(x)=f_\varepsilon(x)-p_n(x) \end{align*} that reaches alternating values \begin{align*} r_n(x_i)=\sigma(-1)^i\|r_n\|_\infty,\qquad i=0,\dots,n+1, \end{align*} at $n+2$ ordered points $-1\le x_0<\cdots<x_{n+1}\le 1$. The steepness of the transition is visible from the derivative. Since \begin{align*} \frac{d}{dx}\tanh(x/\varepsilon)=\frac{1}{\varepsilon}\operatorname{sech}^2(x/\varepsilon), \end{align*} we have \begin{align*} f_\varepsilon'(0)=\frac{1}{\varepsilon}\operatorname{sech}^2(0)=\frac{1}{\varepsilon}. \end{align*} Thus the slope at the origin grows without bound as $\varepsilon\to 0$. Away from the origin the function is already close to its limiting signs: for $x\ge \varepsilon$, \begin{align*} f_\varepsilon(x)=\tanh(x/\varepsilon)\ge \tanh(1), \end{align*} and for $x\le -\varepsilon$, \begin{align*} f_\varepsilon(x)=\tanh(x/\varepsilon)\le -\tanh(1), \end{align*} because $\tanh$ is increasing. More generally, if $K>0$ and $x\ge K\varepsilon$, then \begin{align*} 1-\tanh(x/\varepsilon)\le 1-\tanh K. \end{align*} Using \begin{align*} \tanh K=\frac{e^K-e^{-K}}{e^K+e^{-K}}=\frac{e^{2K}-1}{e^{2K}+1}, \end{align*} we get \begin{align*} 1-\tanh K=1-\frac{e^{2K}-1}{e^{2K}+1}=\frac{2}{e^{2K}+1}. \end{align*} So outside a layer of width comparable to $\varepsilon$, the target is close to $1$ on the right and close to $-1$ on the left, while inside that layer it changes with slope of size $1/\varepsilon$. The alternation condition therefore forces the error peaks for a low-degree minimax polynomial to balance three competing regions: the left plateau, the right plateau, and the narrow transition layer near $0$. As $n$ increases, more alternating points are available, and in the sign-like regime they are naturally driven toward the transition where the function changes fastest. This clustering records the main obstruction: a polynomial of fixed degree cannot follow a sharp continuous jump without paying uniform error somewhere. [/example] The discontinuous sign function itself lies outside $C[-1,1]$, so the uniform framework on the closed interval does not give small polynomial error. Indeed, uniform limits of polynomials are continuous, and a discontinuous target cannot be uniformly approximated by continuous functions. [remark: Uniform Approximation Cannot Remove a Jump] If $f$ has a jump discontinuity on $[a,b]$, no sequence of continuous functions can converge uniformly to $f$. Therefore polynomial minimax approximation in the uniform norm is meaningful for such targets only after modifying the problem, for example by excluding a neighbourhood of the jump, using a weighted norm, or measuring error in an integral norm. [/remark] This limitation is not a defect of the theory. It identifies which features of the target are compatible with uniform approximation and which require a different norm or a changed domain. ## Geometry of the Uniform Norm The final viewpoint of the chapter is geometric. Best approximation in $C[a,b]$ is a convex optimisation problem, but unlike Hilbert projection it is usually controlled by finitely many active constraints. [definition: Extremal Set of the Error] Let $X\subset C[a,b]$, let $f\in C[a,b]$, and let $p\in X$. For each error $e=f-p\in C[a,b]$, the extremal-set map \begin{align*} \operatorname{Ext}:C[a,b]\to\mathcal P([a,b]) \end{align*} is defined by \begin{align*} \operatorname{Ext}(e)=\{x\in[a,b]: |e(x)|=\|e\|_\infty\}. \end{align*} [/definition] The extremal set records exactly where the uniform norm is active. A best approximant cannot be improved by moving in any direction from $X$, so the signed active evaluations must block every such movement. The next theorem states this blocking condition as a convex balance of functionals. [quotetheorem:6886] [citeproof:6886] This theorem explains why minimax approximation is sensitive to active points. The best approximant is determined not by average error, but by a finite balance of worst errors that prevents any direction in $X$ from lowering them all at once. This is the same convex geometry behind supporting hyperplanes in finite-dimensional optimisation and the same active-set principle that appears in linear programming: an optimum is certified by a small collection of saturated constraints. Compactness of $K$ ensures that the maximum error is attained; on a noncompact domain such as $(0,1)$, a bounded continuous error may have a supremum that is never reached, leaving no active evaluation points to balance. Continuity keeps point evaluations compatible with the uniform norm and ensures that the active set is closed. Finite dimensionality of $X$ keeps the convex separation problem inside a finite-dimensional [dual space](/page/Dual%20Space); in infinite-dimensional subspaces, subdifferential statements still exist in functional-analytic form, but a finite convex hull of point evaluations need not capture all possible perturbation directions. [example: Best Constant Approximation Revisited] Let \begin{align*} M=\max_{x\in[a,b]} f(x) \end{align*} and \begin{align*} m=\min_{x\in[a,b]} f(x). \end{align*} Since $f$ is continuous on the compact interval $[a,b]$, both extrema are attained. We show that the best constant approximant from $X=\mathbb R\cdot 1$ is \begin{align*} c=\frac{M+m}{2}. \end{align*} For this value of $c$, the largest positive error is \begin{align*} M-c=M-\frac{M+m}{2}=\frac{2M-M-m}{2}=\frac{M-m}{2}, \end{align*} and the largest negative error is \begin{align*} m-c=m-\frac{M+m}{2}=\frac{2m-M-m}{2}=-\frac{M-m}{2}. \end{align*} Thus \begin{align*} \|f-c\|_\infty=\frac{M-m}{2}. \end{align*} Now let $d\in\mathbb R$ be any constant. At a point where $f=M$, \begin{align*} |f-d|=|M-d|. \end{align*} At a point where $f=m$, \begin{align*} |f-d|=|m-d|. \end{align*} Therefore \begin{align*} \|f-d\|_\infty\ge \max\{|M-d|,|m-d|\}. \end{align*} For any two [real numbers](/page/Real%20Numbers) $A$ and $B$, \begin{align*} \max\{|A|,|B|\}\ge \frac{|A|+|B|}{2}\ge \frac{|A-B|}{2}. \end{align*} Applying this with $A=M-d$ and $B=m-d$ gives \begin{align*} \max\{|M-d|,|m-d|\}\ge \frac{|(M-d)-(m-d)|}{2}. \end{align*} The difference in the numerator is \begin{align*} (M-d)-(m-d)=M-m. \end{align*} Hence \begin{align*} \|f-d\|_\infty\ge \frac{M-m}{2}. \end{align*} Since the constant $c=(M+m)/2$ attains this lower bound, it is a best constant approximant. The error $f-c$ takes the values $(M-m)/2$ and $-(M-m)/2$ at maximum and minimum points of $f$, so the two active errors have equal magnitude and opposite signs. On the one-dimensional space of constants, the signed evaluations balance because increasing the constant decreases the positive error at a maximum point while increasing the magnitude of the negative error at a minimum point. [/example] The chapter's main message is that uniform best approximation is governed by extremal geometry. Chebyshev polynomials display the ideal oscillation pattern, Haar spaces are precisely the finite-dimensional spaces where zero-counting makes that pattern decisive, de la Vallee Poussin bounds turn partial alternation into certification, and Remez exchange converts the theory into a practical computational method. Chapter 6 keeps Chebyshev polynomials in view but shifts from minimax certificates to stable bases, coefficients, quadrature, and spectral computation. In Chapter 5, minimax approximation was certified by alternation, vallee Poussin bounds, and Remez exchange. Chapter 6 keeps the same polynomial setting but shifts to orthogonal polynomials and spectral methods, where stable bases and coefficient formulas replace extremal characterisations. # 6. Orthogonal Polynomials and Spectral Approximation After Chapter 5 used Chebyshev polynomials as extremal objects for minimax approximation, this chapter moves to the structured bases used in spectral computation. It assumes the preceding material on best approximation, polynomial interpolation, and uniform norms, together with basic inner-product and complex-variable tools. The main questions are how to choose polynomial bases that interact well with integration, how smoothness or complex analyticity controls coefficient decay, and why interpolation at the right nodes is stable enough for high-order approximation. The guiding principle is that orthogonality converts many approximation questions into statements about coefficients, while carefully chosen nodes prevent the instability seen in equispaced interpolation. The same ideas also connect approximation theory with Fourier analysis, Sturm-Liouville eigenfunction expansions, and computational methods for differential equations. Orthogonal polynomial bases behave like finite-interval analogues of trigonometric bases, while spectral differentiation turns approximation estimates into numerical schemes for boundary value problems. ## Orthogonal Polynomial Families and Three-Term Recurrences A first problem in high-degree polynomial approximation is that the monomial basis $1,x,x^2,\dots$ is badly conditioned for projection, interpolation, and quadrature. Orthogonal polynomials replace monomials by a basis adapted to an inner product, so that coefficient extraction and error estimates become diagonal or nearly diagonal operations. [definition: Orthogonal Polynomial Sequence] Let $w:(a,b)\to[0,\infty)$ be a weight function with finite moments of all orders and with $w>0$ on a set of positive measure. An orthogonal polynomial sequence associated to $w$ is a sequence $(p_n)_{n\ge0}$ such that each $p_n:\mathbb R\to\mathbb R$ is a polynomial map of degree $n$ and \begin{align*} \int_a^b p_m(x)p_n(x)w(x)\,dx=0 \end{align*} whenever $m\ne n$. [/definition] The weight records which parts of the interval the approximation process should emphasise. To see what this definition produces in the unweighted projection problem, the Legendre family gives the basic model example. [example: Legendre Orthogonality] On $[-1,1]$ with $w(x)=1$, the first Legendre polynomials are normalized by $P_n(1)=1$ and begin with \begin{align*} P_0(x)=1,\qquad P_1(x)=x,\qquad P_2(x)=\frac{3x^2-1}{2}. \end{align*} For the first orthogonality checks, compute \begin{align*} \int_{-1}^1 P_0(x)P_1(x)\,dx=\int_{-1}^1 x\,dx=\left[\frac{x^2}{2}\right]_{-1}^{1}=\frac{1}{2}-\frac{1}{2}=0. \end{align*} Next, \begin{align*} \int_{-1}^1 P_0(x)P_2(x)\,dx=\int_{-1}^1 \frac{3x^2-1}{2}\,dx. \end{align*} Since \begin{align*} \int_{-1}^1 \frac{3x^2-1}{2}\,dx=\left[\frac{x^3-x}{2}\right]_{-1}^{1}=\frac{1-1}{2}-\frac{-1+1}{2}=0, \end{align*} the quadratic polynomial $P_2$ is orthogonal to constants. Also, \begin{align*} \int_{-1}^1 P_1(x)P_2(x)\,dx=\int_{-1}^1 \frac{3x^3-x}{2}\,dx=0, \end{align*} because $\frac{3x^3-x}{2}$ is odd and the interval is symmetric about $0$. Thus the correction term $-1/2$ in $P_2(x)=(3x^2-1)/2$ is exactly what removes the constant component of $x^2$ under the unweighted inner product on $[-1,1]$. [/example] Legendre polynomials solve the unweighted least-squares problem, but approximation on an interval is often controlled by endpoint behaviour and by a hidden trigonometric structure. This motivates the Chebyshev polynomial basis, whose definition builds the change of variables $x=\cos\theta$ directly into the polynomial sequence. [definition: Chebyshev Polynomial] For $n\ge0$, the Chebyshev polynomial of the first kind is the real polynomial map $T_n:\mathbb R\to\mathbb R$ determined by \begin{align*} T_n(\cos\theta)=\cos(n\theta) \end{align*} for all $\theta\in\mathbb R$. [/definition] This definition turns polynomial questions on $[-1,1]$ into cosine-series questions on $[0,\pi]$. The next computation shows that the basis is also practical: it can be generated recursively without forming powers of $x$ directly. [example: Chebyshev Recurrence] Let $n\ge1$ and write $x=\cos\theta$. The angle-addition identities give \begin{align*} \cos((n+1)\theta)=\cos(n\theta)\cos\theta-\sin(n\theta)\sin\theta,\qquad \cos((n-1)\theta)=\cos(n\theta)\cos\theta+\sin(n\theta)\sin\theta. \end{align*} Adding these two identities cancels the sine terms: \begin{align*} \cos((n+1)\theta)+\cos((n-1)\theta)=2\cos\theta\cos(n\theta). \end{align*} Using $T_k(\cos\theta)=\cos(k\theta)$, this becomes \begin{align*} T_{n+1}(x)+T_{n-1}(x)=2xT_n(x). \end{align*} Therefore \begin{align*} T_{n+1}(x)=2xT_n(x)-T_{n-1}(x),\qquad T_0(x)=1,\quad T_1(x)=x. \end{align*} The identity holds for every $x\in[-1,1]$, hence as a polynomial identity it holds for all real $x$. For $n=1$, the recurrence gives \begin{align*} T_2(x)=2xT_1(x)-T_0(x)=2x\cdot x-1=2x^2-1. \end{align*} For $n=2$, it gives \begin{align*} T_3(x)=2xT_2(x)-T_1(x)=2x(2x^2-1)-x=4x^3-2x-x=4x^3-3x. \end{align*} Thus the Chebyshev basis is generated by multiplying the two previous terms by $2x$ and subtracting, so high-degree terms can be built recursively instead of by expanding powers of $x$ from scratch. [/example] The Chebyshev recurrence shows that a special endpoint-weighted family can have the same short recurrence structure as Legendre polynomials. To vary endpoint weights systematically and include both of those examples in one framework, we need the Jacobi polynomial family. [definition: Jacobi Polynomial] Let $\alpha,\beta>-1$. The Jacobi polynomials $(J_{n,\alpha,\beta})_{n\ge0}$ are real polynomial maps $J_{n,\alpha,\beta}:\mathbb R\to\mathbb R$ with $\deg J_{n,\alpha,\beta}=n$, forming the orthogonal polynomial sequence on $[-1,1]$ for the weight \begin{align*} w_{\alpha,\beta}(x)=(1-x)^\alpha(1+x)^\beta. \end{align*} [/definition] Legendre polynomials correspond to $\alpha=\beta=0$, while Chebyshev polynomials of the first kind correspond, up to normalisation, to $\alpha=\beta=-1/2$. These examples suggest that short recurrences are not accidents of closed-form formulae, but a structural consequence of orthogonality. The problem is to justify why multiplying an orthogonal polynomial by $x$ can only couple it to the neighbouring degrees, and to identify the nondegeneracy assumptions that make those recurrence coefficients meaningful. [quotetheorem:478] [citeproof:478] The positivity hypothesis ensures that the inner product is non-degenerate on polynomials. For instance, if the measure were the point mass at $0$, then $x$ would be a nonzero polynomial with zero norm, and projection onto polynomial degrees would no longer determine recurrence coefficients. The monic normalisation is not mathematically essential, but it fixes the recurrence coefficients uniquely; with a different scaling the same three-term structure has different constants. The theorem does not give explicit values of $\alpha_n$ and $\beta_n$, nor does it by itself give numerical stability of evaluating the recurrence. What it does give is the key structural fact that multiplication by $x$ has a tridiagonal matrix in an orthogonal polynomial basis. To turn this into projection kernels and quadrature formulae, we need a compact identity for sums of products of orthogonal polynomials; this is the role of the [Christoffel-Darboux formula](/theorems/6887). [quotetheorem:6887] [citeproof:6887] The nonzero norm hypothesis is essential because the summands are normalised by $h_j$. In the degenerate point-mass example at $0$, the polynomial $p_1(x)=x$ has $h_1=0$, so the term $p_1(x)p_1(y)/h_1$ and the corresponding projection kernel are not defined. The condition $x\ne y$ avoids the removable singularity in the displayed quotient; the diagonal value is obtained by taking the limit as $y\to x$, not by substituting directly. The formula does not estimate the size of the kernel or give asymptotics for large $n$ without additional information about the polynomial family. Its immediate use here is algebraic: at [zeros of orthogonal polynomials](/theorems/482) it controls the weights and exactness of quadrature rules. ## Gaussian Quadrature and Exact Integration of Polynomials Numerical integration asks for weighted sums of function values that approximate an integral. Orthogonal polynomials solve the design problem: place nodes at zeros of the next orthogonal polynomial, and the resulting rule integrates twice as many polynomial degrees as interpolation alone would suggest. [definition: Quadrature Rule] A quadrature rule with nodes $x_1,\dots,x_N\in[a,b]$ and weights $\lambda_1,\dots,\lambda_N\in\mathbb R$ is the linear functional \begin{align*} Q:C([a,b];\mathbb R)&\to\mathbb R, & Q(f)&=\sum_{j=1}^N \lambda_j f(x_j). \end{align*} [/definition] The rule is designed to approximate the weighted integration functional \begin{align*} I:C([a,b];\mathbb R)&\to\mathbb R, & I(f)&=\int_a^b f(x)w(x)\,dx. \end{align*} The accuracy of this approximation depends on the chosen nodes and weights. A quadrature rule is useful only after we know which test functions it integrates exactly. Exactness cannot be measured just by counting nodes, because different choices of nodes and weights with the same number of samples can integrate different polynomial classes. The needed invariant is the largest polynomial degree for which the quadrature rule still agrees with the target integral, with failure required at the next degree so the number records the precise algebraic limit of the rule. [definition: Degree of Exactness] A quadrature rule has degree of exactness $m$ if it is exact for every polynomial $q$ with $\deg q\le m$ and is not exact for at least one polynomial of degree $m+1$. [/definition] The degree of exactness measures the algebraic strength of the rule. Since $N$ nodes give $N$ function values, the central question is how orthogonal-polynomial nodes achieve the maximal degree $2N-1$. [quotetheorem:483] [citeproof:483] The positivity of the weight is what guarantees the standard orthogonal-polynomial structure, including real distinct zeros in the interval; without distinct nodes the interpolatory rule is not the ordinary $N$-point Lagrange rule. Choosing the zeros of $p_N$ is essential: arbitrary nodes generally give exactness only through degree $N-1$, as ordinary interpolation alone guarantees. The theorem does not give an error estimate for non-polynomial functions, nor does it say that every function is well approximated by the quadrature rule. It explains only the algebraic exactness mechanism, and the two-point Legendre rule is the smallest nontrivial case where this doubling of exactness can be checked by direct computation. [example: Two-Point Gauss-Legendre Rule] For $w(x)=1$ on $[-1,1]$, the degree two Legendre polynomial is \begin{align*} P_2(x)=\frac{3x^2-1}{2}. \end{align*} Its zeros are obtained from $3x^2-1=0$, hence $x^2=1/3$ and therefore \begin{align*} x_1=-\frac{1}{\sqrt{3}},\qquad x_2=\frac{1}{\sqrt{3}}. \end{align*} For an interpolatory two-node rule \begin{align*} Q(f)=\lambda_1 f\left(-\frac{1}{\sqrt{3}}\right)+\lambda_2 f\left(\frac{1}{\sqrt{3}}\right), \end{align*} exactness on $1$ gives \begin{align*} \lambda_1+\lambda_2=\int_{-1}^1 1\,dx=2. \end{align*} Exactness on $x$ gives \begin{align*} -\frac{\lambda_1}{\sqrt{3}}+\frac{\lambda_2}{\sqrt{3}}=\int_{-1}^1 x\,dx=0. \end{align*} Multiplying the second equation by $\sqrt{3}$ gives $-\lambda_1+\lambda_2=0$, so $\lambda_1=\lambda_2$. Combining this with $\lambda_1+\lambda_2=2$ gives \begin{align*} \lambda_1=\lambda_2=1. \end{align*} Thus the rule is \begin{align*} Q(f)=f\left(-\frac{1}{\sqrt{3}}\right)+f\left(\frac{1}{\sqrt{3}}\right). \end{align*} To verify the advertised exactness through degree $3$, it is enough by linearity to check $1,x,x^2,x^3$. The cases $1$ and $x$ were already used to determine the weights. For $x^2$, \begin{align*} Q(x^2)=\left(-\frac{1}{\sqrt{3}}\right)^2+\left(\frac{1}{\sqrt{3}}\right)^2=\frac{1}{3}+\frac{1}{3}=\frac{2}{3}. \end{align*} Also, \begin{align*} \int_{-1}^1 x^2\,dx=\left[\frac{x^3}{3}\right]_{-1}^{1}=\frac{1}{3}-\left(-\frac{1}{3}\right)=\frac{2}{3}. \end{align*} For $x^3$, \begin{align*} Q(x^3)=\left(-\frac{1}{\sqrt{3}}\right)^3+\left(\frac{1}{\sqrt{3}}\right)^3=-\frac{1}{3\sqrt{3}}+\frac{1}{3\sqrt{3}}=0. \end{align*} Also, \begin{align*} \int_{-1}^1 x^3\,dx=\left[\frac{x^4}{4}\right]_{-1}^{1}=\frac{1}{4}-\frac{1}{4}=0. \end{align*} Therefore the two-point Gauss-Legendre rule integrates every polynomial of degree at most $3$ exactly, even though it uses only two function values. [/example] Gaussian quadrature should be compared with Newton-Cotes rules, where equispaced nodes often produce weights with poor stability for high degree. The same stability theme reappears in interpolation: choosing nodes from orthogonal polynomial structure controls oscillation. ## Fourier-Chebyshev Expansions and Coefficient Decay A second problem is how to predict convergence rates before computing an approximation. Chebyshev expansions answer this by converting approximation on $[-1,1]$ into Fourier cosine analysis, where smoothness and analyticity are visible in coefficient decay. [definition: Fourier-Chebyshev Coefficients] Let \begin{align*} \mathcal A=\{f:[-1,1]\to\mathbb C\mid \theta\mapsto f(\cos\theta) \text{ is integrable on } [0,\pi]\}. \end{align*} The zeroth Fourier-Chebyshev coefficient is the functional $a_0:\mathcal A\to\mathbb C$ defined by \begin{align*} a_0(f)=\frac{1}{\pi}\int_0^\pi f(\cos\theta)\,d\theta. \end{align*} For $n\ge1$, the $n$th Fourier-Chebyshev coefficient is the functional $a_n:\mathcal A\to\mathbb C$ defined by \begin{align*} a_n(f)=\frac{2}{\pi}\int_0^\pi f(\cos\theta)\cos(n\theta)\,d\theta. \end{align*} [/definition] These are cosine-series coefficients for the even periodic function $F(\theta)=f(\cos\theta)$. The first test of the definition is that ordinary polynomial identities become sparse Chebyshev coefficient identities. [example: Polynomial Chebyshev Expansion] For $f(x)=x^2$, use $T_0(x)=1$ and $T_2(x)=2x^2-1$. Solving the second identity for $x^2$ gives \begin{align*} T_2(x)+1=2x^2. \end{align*} Dividing both sides by $2$ gives \begin{align*} x^2=\frac{1}{2}+\frac{1}{2}T_2(x). \end{align*} Since $T_0(x)=1$, this is the Chebyshev expansion \begin{align*} x^2=\frac{1}{2}T_0(x)+\frac{1}{2}T_2(x). \end{align*} Thus the expansion has no $T_1,T_3,T_4,\dots$ terms: the polynomial $x^2$ has exactly a constant Chebyshev component and a degree-two Chebyshev component. [/example] Finite smoothness gives algebraic decay, but spectral approximation is most powerful when the target function is analytic. To state the analytic convergence theorem in the geometry naturally associated with Chebyshev polynomials, we introduce Bernstein ellipses. [definition: Bernstein Ellipse] For $\rho>1$, let $\phi_\rho:\{z\in\mathbb C:|z|=\rho\}\to\mathbb C$ be the map \begin{align*} \phi_\rho(z)=\frac{1}{2}\left(z+z^{-1}\right). \end{align*} The Bernstein ellipse $E_\rho$ is the image $\phi_\rho(\{z\in\mathbb C:|z|=\rho\})$. [/definition] We write $\overline{\Omega}_\rho$ for the closed region bounded by $E_\rho$ and $\Omega_\rho$ for its interior. The interval $[-1,1]$ lies inside $\overline{\Omega}_\rho$, with the two foci of $E_\rho$ at $-1$ and $1$. The parameter $\rho$ measures how far the function can be analytically continued from $[-1,1]$ before meeting a complex singularity. This motivates the Bernstein ellipse convergence theorem, which converts that geometric distance into coefficient decay and truncation error. [quotetheorem:6888] [citeproof:6888] The estimate uses $|T_n(x)|\le1$ on $[-1,1]$, and the displayed tail bound is obtained by summing the coefficients with indices $n\ge N+1$ after the degree $N$ truncation. Analyticity in the full ellipse is essential: if a singularity lies inside or on the ellipse, Cauchy's estimate on that contour is unavailable and the stated $\rho^{-n}$ bound can fail. For the Runge function, any Bernstein ellipse crossing the poles at $\pm i/5$ is unusable for this theorem, no matter how well behaved the function is on the real interval. The uniform bound $|f|\le M$ supplies the constant in the coefficient estimate; without such a bound the rate alone would not control the magnitude of the coefficients. The theorem does not say that $\rho$ is optimal, and it does not rule out faster decay when the function extends to a larger ellipse. Its practical message is that convergence is governed by the nearest complex obstruction in Bernstein-ellipse coordinates. The Runge function illustrates how a pair of nearby poles fixes the limiting convergence rate. [example: Chebyshev Approximation of the Runge Function] For \begin{align*} f(x)=\frac{1}{1+25x^2}, \end{align*} the singularities occur where the denominator vanishes. Solving \begin{align*} 1+25x^2=0 \end{align*} gives \begin{align*} 25x^2=-1 \end{align*} and hence \begin{align*} x^2=-\frac{1}{25}. \end{align*} Therefore \begin{align*} x=\pm \frac{i}{5}. \end{align*} To find the Bernstein ellipse that first reaches these poles, use the Bernstein parametrisation \begin{align*} x=\frac{1}{2}\left(z+z^{-1}\right),\qquad |z|=\rho. \end{align*} For the pole $x=i/5$, this means \begin{align*} \frac{1}{2}\left(z+z^{-1}\right)=\frac{i}{5}. \end{align*} Multiplying by $2z$ gives \begin{align*} z^2+1=\frac{2i}{5}z. \end{align*} Thus \begin{align*} z^2-\frac{2i}{5}z+1=0. \end{align*} The [quadratic formula](/theorems/1301) gives \begin{align*} z=\frac{\frac{2i}{5}\pm \sqrt{\left(-\frac{2i}{5}\right)^2-4}}{2}. \end{align*} Since \begin{align*} \left(-\frac{2i}{5}\right)^2-4=-\frac{4}{25}-4=-\frac{104}{25}, \end{align*} we have \begin{align*} \sqrt{-\frac{104}{25}}=\frac{2i\sqrt{26}}{5}. \end{align*} Therefore \begin{align*} z=\frac{\frac{2i}{5}\pm \frac{2i\sqrt{26}}{5}}{2}=\frac{i(1\pm \sqrt{26})}{5}. \end{align*} The two moduli are \begin{align*} \left|\frac{i(1+\sqrt{26})}{5}\right|=\frac{1+\sqrt{26}}{5} \end{align*} and \begin{align*} \left|\frac{i(1-\sqrt{26})}{5}\right|=\frac{\sqrt{26}-1}{5}. \end{align*} These are reciprocal radii, so the ellipse through the poles has parameter \begin{align*} \rho_*=\frac{1+\sqrt{26}}{5}. \end{align*} For every $1<\rho<\rho_*$, the function is analytic on and inside $E_\rho$, so the *Bernstein Ellipse Convergence Theorem* gives Chebyshev coefficient decay bounded by a constant times $\rho^{-n}$. No ellipse with $\rho>\rho_*$ avoids the poles at $\pm i/5$, so this nearest pair of imaginary-axis singularities fixes the limiting geometric rate available from Bernstein-ellipse estimates. Thus increasing the degree gives geometric improvement, but the rate is limited by $\rho_*^{-1}=5/(1+\sqrt{26})$. [/example] Analyticity gives the cleanest rate, but the same coefficient viewpoint also handles differentiability classes. Repeated [integration by parts](/theorems/210) in the cosine variable gives algebraic decay when only finitely many derivatives are available, so the next remark records the non-analytic comparison case. [remark: Smoothness and Algebraic Decay] If $F(\theta)=f(\cos\theta)$ has $r$ derivatives of bounded variation, then its cosine coefficients decay like a constant times $n^{-r-1}$. Endpoint regularity matters because differentiability of $f$ in $x$ does not always translate into periodic smoothness of $F$ at $0$ and $\pi$. [/remark] ## Interpolation Stability and the Lebesgue Constant Interpolation is attractive because it uses function values, but its stability depends strongly on the nodes. The question is not only whether the interpolating polynomial exists, but whether small data errors and approximation errors are amplified by a large factor. [definition: Interpolation Operator] Given distinct nodes $x_0,\dots,x_N\in[-1,1]$, the interpolation operator $I_N:C([-1,1])\to\mathcal P_N$ maps $f$ to the unique polynomial $I_Nf$ of degree at most $N$ satisfying \begin{align*} I_Nf(x_j)=f(x_j),\qquad 0\le j\le N. \end{align*} [/definition] The interpolation operator turns nodal data into a polynomial, but its usefulness depends on its operator norm in the uniform norm. This motivates the Lebesgue constant, which is the standard way to quantify interpolation stability. [definition: Lebesgue Constant] Let $\ell_0,\dots,\ell_N$ be the Lagrange basis polynomials for nodes $x_0,\dots,x_N$. The Lebesgue function is \begin{align*} \Lambda_N:[-1,1]&\to\mathbb R, & \Lambda_N(x)&=\sum_{j=0}^N |\ell_j(x)|, \end{align*} and the Lebesgue constant is \begin{align*} \lambda_N&\in\mathbb R, & \lambda_N&=\max_{x\in[-1,1]}\Lambda_N(x). \end{align*} [/definition] A small Lebesgue constant means interpolation is nearly as good as best polynomial approximation, while a large one means interpolation may amplify the best-approximation error. The obstruction is that polynomial density alone only supplies nearby polynomials; it does not say that the interpolation operator will preserve that nearness when forced to match point values at the chosen nodes. We therefore need an inequality that splits the interpolation error into an approximation term and a stability factor depending only on the node set. [quotetheorem:6889] [citeproof:6889] The distinct-node hypothesis is essential because repeated nodes do not determine a unique Lagrange interpolant without adding derivative data. Continuity of $f$ puts the problem in the uniform norm; for rougher functions the point values or the supremum norm may not be the right objects. The inequality does not estimate the best approximation error and does not by itself bound $\lambda_N$; it only separates the approximation term from the stability term. This is why polynomial density alone is not enough: if $\lambda_N$ grows too fast, interpolation can diverge even when good polynomial approximants exist. To identify stable interpolation schemes, we need a quantitative bound for the Chebyshev-Lobatto nodes used in spectral approximation. [quotetheorem:6890] [citeproof:6890] The Chebyshev-Lobatto node formula is essential because the endpoint clustering changes the spacing in the angle variable; equispaced physical nodes do not satisfy a comparable logarithmic bound and can have exponentially large Lebesgue constants. The absolute constant $C$ is independent of $N$, which is what makes the estimate asymptotically useful. The theorem does not say that the interpolant converges for every continuous function, since the best approximation error must still tend to zero fast enough relative to the logarithmic factor. It does show that Chebyshev interpolation inherits the convergence rate of near-best polynomial approximation up to a slowly growing factor, explaining the contrast with equispaced interpolation and the Runge phenomenon. [example: Comparing Equispaced and Chebyshev Interpolation] For \begin{align*} f(x)=\frac{1}{1+25x^2}, \end{align*} the denominator vanishes only when \begin{align*} 1+25x^2=0, \end{align*} so \begin{align*} x^2=-\frac{1}{25} \end{align*} and hence the nearest complex singularities are \begin{align*} x=\pm \frac{i}{5}. \end{align*} Thus $f$ is analytic on a complex neighbourhood of $[-1,1]$, so its best polynomial approximation error decreases geometrically on any Bernstein ellipse that stays inside those two poles, by the *Bernstein Ellipse Convergence Theorem*. Let $I_Nf$ be the interpolant at some node set and let $\lambda_N$ be the corresponding Lebesgue constant. The *Lebesgue Inequality* gives \begin{align*} \|f-I_Nf\|_\infty\le (1+\lambda_N)\inf_{p\in\mathcal P_N}\|f-p\|_\infty. \end{align*} For Chebyshev-Lobatto nodes, the *Lebesgue Constants for Chebyshev Interpolation* theorem gives \begin{align*} \lambda_N\le \frac{2}{\pi}\log(N+1)+C. \end{align*} Combining this logarithmic stability factor with geometric best-approximation decay gives a product of the form \begin{align*} (1+\lambda_N)\inf_{p\in\mathcal P_N}\|f-p\|_\infty\le \left(1+\frac{2}{\pi}\log(N+1)+C\right)A\rho^{-N} \end{align*} for constants $A>0$ and $1<\rho<(1+\sqrt{26})/5$. Since $\rho^{-N}$ decays geometrically while $\log(N+1)$ grows only logarithmically, the right-hand side tends to $0$, so Chebyshev-Lobatto interpolation converges uniformly. For equispaced nodes, the Lebesgue constants grow much faster, and for this Runge function the endpoint amplification is strong enough to produce the classical oscillations near $\pm1$. The function itself is not the obstruction: the poles at $\pm i/5$ still allow geometric polynomial approximation. The obstruction is the unstable interpolation operator produced by equispaced nodes. [/example] The Runge phenomenon is therefore not a paradox about polynomials; it is a warning about node choice. Stable interpolation requires matching the node distribution to polynomial geometry, and the following remark records the terminology used for this instability. [remark: Runge Phenomenon] The Runge phenomenon refers to the growth of oscillations, especially near endpoints, in polynomial interpolation at equispaced nodes. It can occur even for smooth or analytic functions on the interval. Chebyshev clustering near endpoints reduces this endpoint amplification and restores convergence for broad classes of functions. [/remark] ## Spectral Differentiation and Polynomial Collocation Spectral methods use global polynomial interpolants not only to approximate functions but also to approximate derivatives. The practical question is how to differentiate nodal data without first converting everything by hand into coefficients. [definition: Spectral Differentiation Matrix] Let $x_0,\dots,x_N$ be distinct nodes with Lagrange basis $\ell_0,\dots,\ell_N$. The spectral differentiation matrix is the linear map \begin{align*} D:\mathbb R^{N+1}&\to\mathbb R^{N+1} \end{align*} represented in the standard basis by the entries \begin{align*} D_{ij}=\ell_j'(x_i),\qquad 0\le i,j\le N. \end{align*} [/definition] If $v\in\mathbb R^{N+1}$ is given by $v_j=f(x_j)$, then $(Dv)_i=(I_Nf)'(x_i)$. The next example shows how this matrix differentiates the interpolating polynomial exactly whenever the data come from a polynomial within the interpolation space. [example: Three-Node Chebyshev Differentiation] For the Chebyshev-Lobatto nodes $x_0=1$, $x_1=0$, and $x_2=-1$, the Lagrange basis polynomial $\ell_j$ is the unique quadratic satisfying $\ell_j(x_j)=1$ and $\ell_j(x_k)=0$ for $k\ne j$. Thus \begin{align*} \ell_0(x)=\frac{x-0}{1-0}\frac{x-(-1)}{1-(-1)}=\frac{x(x+1)}{2}=\frac{x^2+x}{2}. \end{align*} Similarly, \begin{align*} \ell_1(x)=\frac{x-1}{0-1}\frac{x-(-1)}{0-(-1)}=-(x-1)(x+1)=1-x^2. \end{align*} Finally, \begin{align*} \ell_2(x)=\frac{x-1}{-1-1}\frac{x-0}{-1-0}=\frac{x(x-1)}{2}=\frac{x^2-x}{2}. \end{align*} Differentiating these three polynomials gives \begin{align*} \ell_0'(x)=x+\frac{1}{2}. \end{align*} \begin{align*} \ell_1'(x)=-2x. \end{align*} \begin{align*} \ell_2'(x)=x-\frac{1}{2}. \end{align*} Since the differentiation matrix is defined by $D_{ij}=\ell_j'(x_i)$, evaluating at $x_0=1$ gives \begin{align*} (D_{00},D_{01},D_{02})=\left(\ell_0'(1),\ell_1'(1),\ell_2'(1)\right)=\left(\frac{3}{2},-2,\frac{1}{2}\right). \end{align*} Evaluating at $x_1=0$ gives \begin{align*} (D_{10},D_{11},D_{12})=\left(\ell_0'(0),\ell_1'(0),\ell_2'(0)\right)=\left(\frac{1}{2},0,-\frac{1}{2}\right). \end{align*} Evaluating at $x_2=-1$ gives \begin{align*} (D_{20},D_{21},D_{22})=\left(\ell_0'(-1),\ell_1'(-1),\ell_2'(-1)\right)=\left(-\frac{1}{2},2,-\frac{3}{2}\right). \end{align*} For $f(x)=x^2$, the nodal data are \begin{align*} (f(x_0),f(x_1),f(x_2))^\top=(1^2,0^2,(-1)^2)^\top=(1,0,1)^\top. \end{align*} Applying $D$ to this vector gives first component \begin{align*} \frac{3}{2}\cdot 1+(-2)\cdot 0+\frac{1}{2}\cdot 1=2. \end{align*} The second component is \begin{align*} \frac{1}{2}\cdot 1+0\cdot 0-\frac{1}{2}\cdot 1=0. \end{align*} The third component is \begin{align*} -\frac{1}{2}\cdot 1+2\cdot 0-\frac{3}{2}\cdot 1=-2. \end{align*} Therefore \begin{align*} D(1,0,1)^\top=(2,0,-2)^\top. \end{align*} Since $f'(x)=2x$, the exact derivative values at the nodes are \begin{align*} (f'(1),f'(0),f'(-1))^\top=(2,0,-2)^\top. \end{align*} The differentiation matrix therefore differentiates the sampled quadratic exactly, because the interpolation space already contains $x^2$. [/example] The example works because the sampled function already lies in the interpolation space; for non-polynomial data, differentiation also amplifies high-frequency error. To know when spectral differentiation remains accurate, we need the spectral differentiation accuracy theorem, which combines Chebyshev coefficient decay with the polynomial growth introduced by differentiating Chebyshev modes. [quotetheorem:6891] Statement-level reference. The following discussion explains how this result is used in the notes and which hypotheses matter. Analyticity throughout the ellipse interior is essential for the geometric factor; if $f$ has only finitely many derivatives, the coefficient tail is generally algebraic rather than exponential. For a concrete finitely smooth case, $f(x)=|x|$ is continuous but not differentiable at $0$, so its Chebyshev coefficients decay algebraically and the displayed geometric bound is not expected. The derivative order is fixed because the constants and the polynomial power may deteriorate as $m$ increases, and the Chebyshev-Lobatto convention matters because the interpolation and aliasing estimates are tied to those nodes. The theorem does not assert uniform accuracy for derivatives whose order grows with $N$, and it does not remove conditioning issues in the resulting differentiation matrices. For fixed-order derivatives of analytic functions, however, the geometric factor dominates the polynomial amplification. In applications to differential equations, this is the reason spectral collocation can achieve very high accuracy with relatively few nodes when the solution is analytic, as the final example indicates. [example: Collocation View of a Boundary Value Problem] Consider a solution $u$ of $-u''=g$ on $[-1,1]$ with $u(-1)=u(1)=0$, and choose the Chebyshev-Lobatto nodes \begin{align*} x_j=\cos\left(\frac{j\pi}{N}\right),\qquad 0\le j\le N. \end{align*} Let $\ell_0,\dots,\ell_N$ be the corresponding Lagrange basis and let \begin{align*} p(x)=I_Nu(x)=\sum_{j=0}^N u(x_j)\ell_j(x). \end{align*} Differentiating this identity twice gives, by linearity of differentiation, \begin{align*} p''(x)=\sum_{j=0}^N u(x_j)\ell_j''(x). \end{align*} If $D^{(2)}$ denotes the second differentiation matrix with entries \begin{align*} D^{(2)}_{ij}=\ell_j''(x_i), \end{align*} and if $v=(u(x_0),\dots,u(x_N))^\top$, then its $i$th component is \begin{align*} (D^{(2)}v)_i=\sum_{j=0}^N D^{(2)}_{ij}u(x_j). \end{align*} Substituting $D^{(2)}_{ij}=\ell_j''(x_i)$ gives \begin{align*} (D^{(2)}v)_i=\sum_{j=0}^N u(x_j)\ell_j''(x_i). \end{align*} Comparing this with the formula for $p''$ shows that \begin{align*} (D^{(2)}v)_i=p''(x_i)=(I_Nu)''(x_i). \end{align*} The collocation equations therefore replace the differential equation at the interior nodes by \begin{align*} -(D^{(2)}v)_i=g(x_i),\qquad 1\le i\le N-1, \end{align*} while the boundary rows impose \begin{align*} v_0=u(x_0)=u(1)=0 \end{align*} and \begin{align*} v_N=u(x_N)=u(-1)=0. \end{align*} Thus the method is not differentiating nodal data heuristically: it differentiates the unique interpolating polynomial and enforces the differential equation on the chosen grid. If $u$ is analytic in $\Omega_\rho$ and continuous on $\overline{\Omega}_\rho$, then the *[Spectral Differentiation Accuracy for Analytic Functions](/theorems/6891)* theorem with $m=2$ gives constants $C_2>0$ and $c_2>0$ such that \begin{align*} \|u''-(I_Nu)''\|_\infty\le C_2N^{c_2}\rho^{-N}. \end{align*} Since $-u''=g$, the interior residual produced by the interpolant satisfies \begin{align*} |-p''(x_i)-g(x_i)|=|-(I_Nu)''(x_i)+u''(x_i)|. \end{align*} Taking the maximum over the interior nodes is bounded by the uniform norm, so \begin{align*} \max_{1\le i\le N-1}|-p''(x_i)-g(x_i)|\le C_2N^{c_2}\rho^{-N}. \end{align*} The geometric factor $\rho^{-N}$ is the spectral part of the accuracy, while the power of $N$ records the cost of taking two derivatives. [/example] The chapter closes the loop between approximation theory and computation. Orthogonal polynomials provide stable bases, Gaussian quadrature exploits their zeros, Chebyshev expansions turn analyticity into exponential convergence, and Lebesgue constants explain why the nodes used by spectral methods are not arbitrary. Chapter 7 now replaces global high-degree polynomial bases by local spline bases, where stability comes from locality and B-spline structure. Chapter 6 used orthogonal expansions to connect approximation with quadrature, interpolation, and spectral convergence. The next chapter moves away from global bases and toward splines, where locality and B-spline structure provide a different kind of stability. # 7. Splines and B-Splines Splines enter approximation theory at the point where global polynomials become too rigid. Chapters 2 through 6 used algebraic and trigonometric polynomials to approximate smooth functions, but high-degree global interpolation is sensitive to endpoint behaviour and can be expensive to update locally. This chapter replaces a single global polynomial by a chain of low-degree polynomial pieces joined at prescribed knots, giving approximation spaces that combine smoothness, locality, and stable computation. The guiding questions are: how much smoothness should be imposed at the joins, how can the resulting space be represented by a numerically stable basis, and how should splines be chosen when data are noisy rather than exact? The answer begins with piecewise polynomial spaces, passes through B-splines as the canonical local basis, and ends with smoothing splines as variational minimisers balancing fit and curvature. ## Piecewise Polynomial Spaces and Knots The first problem is to keep polynomial approximation local without losing the differentiability needed for analysis and numerical computation. A piecewise polynomial may approximate sharply changing functions better than a single polynomial of the same local degree, but the pieces must be tied together by continuity conditions at the breakpoints. [definition: Knot Sequence] Let $a,b \in \mathbb R$ with $a < b$. A knot sequence on $[a,b]$ is a nondecreasing finite sequence \begin{align*} \tau = (t_0,t_1,\dots,t_m), \qquad a = t_0 \le t_1 \le \dots \le t_m = b. \end{align*} [/definition] Repeated knots are allowed because they encode weaker smoothness at a location. The distinct values of the knots divide $[a,b]$ into subintervals, and the multiplicity of a knot records how strongly neighbouring polynomial pieces are forced to match there. We now use this data to define the actual finite-dimensional approximation space. [definition: Spline Space] Let $r \ge 0$, let $\tau$ be a knot sequence on $[a,b]$, and let \begin{align*} \nu_0<\nu_1<\dots<\nu_N \end{align*} be the distinct knot values. For each interior distinct knot $\nu_j$, choose an integer $\rho_j$ with $-1\le \rho_j\le r-1$. The spline space $S_r(\tau;\rho)$ is the set of functions $s:[a,b]\to\mathbb R$ such that $s|_{(\nu_j,\nu_{j+1})}$ agrees with a polynomial of degree at most $r$ for each $0\le j<N$, and such that $s$ has matching one-sided derivatives through order $\rho_j$ at each interior knot $\nu_j$. [/definition] A common convention is that a knot of multiplicity $q$ in a degree $r$ spline imposes continuity through order $r-q$ at that knot, whenever $1 \le q \le r+1$. Thus simple interior knots give $C^{r-1}$ splines, while a knot of multiplicity $r+1$ permits a jump discontinuity. [example: Piecewise Linear Splines] Let $h_i=x_{i+1}-x_i>0$ for $0\le i<n$. On the interval $[x_i,x_{i+1}]$, define \begin{align*} s(x)=\frac{x_{i+1}-x}{h_i}y_i+\frac{x-x_i}{h_i}y_{i+1}. \end{align*} The two coefficients add to \begin{align*} \frac{x_{i+1}-x}{h_i}+\frac{x-x_i}{h_i}=\frac{x_{i+1}-x_i}{h_i}=1, \end{align*} so $s$ is affine on $[x_i,x_{i+1}]$. At the left endpoint, \begin{align*} s(x_i)=\frac{x_{i+1}-x_i}{h_i}y_i+\frac{x_i-x_i}{h_i}y_{i+1}=y_i, \end{align*} and at the right endpoint, \begin{align*} s(x_{i+1})=\frac{x_{i+1}-x_{i+1}}{h_i}y_i+\frac{x_{i+1}-x_i}{h_i}y_{i+1}=y_{i+1}. \end{align*} Thus neighbouring affine pieces agree at every shared knot $x_i$, so the piecewise definition gives a continuous piecewise linear spline interpolating the data. To prove uniqueness, let $u$ be another continuous function that is affine on each $[x_i,x_{i+1}]$ and satisfies $u(x_i)=y_i$. Then $w=u-s$ is affine on each $[x_i,x_{i+1}]$ and has \begin{align*} w(x_i)=u(x_i)-s(x_i)=y_i-y_i=0 \end{align*} and \begin{align*} w(x_{i+1})=u(x_{i+1})-s(x_{i+1})=y_{i+1}-y_{i+1}=0. \end{align*} Since $w$ is affine on this interval, it has the form $w(x)=\alpha_i x+\beta_i$. The endpoint equations give \begin{align*} \alpha_i x_i+\beta_i=0 \end{align*} and \begin{align*} \alpha_i x_{i+1}+\beta_i=0. \end{align*} Subtracting the first equation from the second gives $\alpha_i(x_{i+1}-x_i)=0$, and $x_{i+1}-x_i>0$, so $\alpha_i=0$; then $\beta_i=0$. Hence $w=0$ on every interval, so $u=s$. The interpolating spline is therefore exactly the polygonal line through the points $(x_i,y_i)$. [/example] This example shows the basic advantage of splines: changing one datum affects only the neighbouring affine pieces if the representation is local. For higher degree splines, the same principle remains useful, but continuity constraints couple more coefficients and require a dimension count before interpolation conditions can be posed responsibly. [quotetheorem:6892] [citeproof:6892] The dimension count predicts how many basis functions should exist and how many interpolation conditions can be imposed, but it depends on the hypotheses in a concrete way. For instance, take $r=1$ on $[0,1]$ with a single interior knot at $1/2$. With a simple knot the space of continuous piecewise affine splines has dimension $3$, while allowing a repeated knot and no continuity condition gives two independent affine pieces and dimension $4$. If the displayed chain is weakened to $0=x_0<x_1=x_2< x_3=1$ without a multiplicity convention, the phrase "polynomial on each subinterval" includes the collapsed interval $[x_1,x_2]$ and the coefficient count no longer names a unique spline space. The theorem also does not assert existence or uniqueness of an interpolating spline for arbitrary data; it only gives the size of the ambient space. For cubic splines with simple knots this leaves three more degrees of freedom than the number of subintervals, so boundary conditions are needed before interpolation becomes unique. [example: Natural Cubic Spline Interpolation] Let $a=x_0<\dots<x_n=b$, let $h_i=x_{i+1}-x_i>0$, and let $M_i$ denote the intended value of $s''(x_i)$. The natural boundary conditions are $M_0=0$ and $M_n=0$. On $[x_i,x_{i+1}]$, define \begin{align*} s_i(x)=\frac{M_i(x_{i+1}-x)^3}{6h_i}+\frac{M_{i+1}(x-x_i)^3}{6h_i}+\left(\frac{y_i}{h_i}-\frac{M_i h_i}{6}\right)(x_{i+1}-x)+\left(\frac{y_{i+1}}{h_i}-\frac{M_{i+1}h_i}{6}\right)(x-x_i). \end{align*} At $x=x_i$, all terms vanish except the first cubic correction and the coefficient of $x_{i+1}-x_i$, so \begin{align*} s_i(x_i)=\frac{M_i h_i^3}{6h_i}+\left(\frac{y_i}{h_i}-\frac{M_i h_i}{6}\right)h_i=\frac{M_i h_i^2}{6}+y_i-\frac{M_i h_i^2}{6}=y_i. \end{align*} Similarly, \begin{align*} s_i(x_{i+1})=\frac{M_{i+1}h_i^3}{6h_i}+\left(\frac{y_{i+1}}{h_i}-\frac{M_{i+1}h_i}{6}\right)h_i=y_{i+1}. \end{align*} Differentiating twice gives \begin{align*} s_i''(x)=\frac{M_i(x_{i+1}-x)}{h_i}+\frac{M_{i+1}(x-x_i)}{h_i}, \end{align*} so $s_i''(x_i)=M_i$ and $s_i''(x_{i+1})=M_{i+1}$. Thus adjacent pieces automatically have matching second derivatives once they share the same $M_i$. It remains to impose matching first derivatives at the interior knots. Differentiating once gives \begin{align*} s_i'(x)=-\frac{M_i(x_{i+1}-x)^2}{2h_i}+\frac{M_{i+1}(x-x_i)^2}{2h_i}-\frac{y_i}{h_i}+\frac{M_i h_i}{6}+\frac{y_{i+1}}{h_i}-\frac{M_{i+1}h_i}{6}. \end{align*} At the right endpoint of $[x_{i-1},x_i]$ this becomes \begin{align*} s_{i-1}'(x_i)=\frac{M_i h_{i-1}}{2}-\frac{y_{i-1}}{h_{i-1}}+\frac{M_{i-1}h_{i-1}}{6}+\frac{y_i}{h_{i-1}}-\frac{M_i h_{i-1}}{6}. \end{align*} Combining the $M_i$ terms, \begin{align*} s_{i-1}'(x_i)=\frac{y_i-y_{i-1}}{h_{i-1}}+\frac{h_{i-1}}{6}M_{i-1}+\frac{h_{i-1}}{3}M_i. \end{align*} At the left endpoint of $[x_i,x_{i+1}]$, \begin{align*} s_i'(x_i)=-\frac{M_i h_i}{2}-\frac{y_i}{h_i}+\frac{M_i h_i}{6}+\frac{y_{i+1}}{h_i}-\frac{M_{i+1}h_i}{6}. \end{align*} Combining the $M_i$ terms, \begin{align*} s_i'(x_i)=\frac{y_{i+1}-y_i}{h_i}-\frac{h_i}{3}M_i-\frac{h_i}{6}M_{i+1}. \end{align*} The condition $s_{i-1}'(x_i)=s_i'(x_i)$ therefore gives, for $1\le i\le n-1$, \begin{align*} h_{i-1}M_{i-1}+2(h_{i-1}+h_i)M_i+h_iM_{i+1}=6\left(\frac{y_{i+1}-y_i}{h_i}-\frac{y_i-y_{i-1}}{h_{i-1}}\right). \end{align*} Together with $M_0=M_n=0$, these are $n-1$ linear equations for the $n-1$ unknowns $M_1,\dots,M_{n-1}$. The coefficient matrix is tridiagonal, and its $i$th diagonal entry is $2(h_{i-1}+h_i)$ while the sum of the off-diagonal entries in that row is at most $h_{i-1}+h_i$; since each $h_i>0$, the matrix is strictly diagonally dominant and hence nonsingular. Thus the second derivatives $M_i$ are uniquely determined, and the displayed formula then gives the unique natural cubic spline interpolating the data. [/example] Natural cubic splines are global in the sense that each datum influences the linear system, but their polynomial pieces remain local and their smoothness is controlled. The next goal is to replace the coefficient description by basis functions that are themselves local. ## B-Spline Bases and the Cox-de Boor Recursion The main representation problem is to find basis functions that reflect the locality of the knots. Monomials on each interval are poorly scaled and must be constrained across knots, while a B-spline basis builds the continuity conditions into the basis functions from the start. [definition: B-Spline of Degree Zero] Given a nondecreasing knot sequence $(t_i)$, the degree zero B-splines are the functions $N_{i,0}:[a,b]\to\mathbb R$ satisfying \begin{align*} N_{i,0}(x)=1 \quad \text{for } t_i\le x<t_{i+1}, \end{align*} and $N_{i,0}(x)=0$ otherwise. [/definition] These functions are the characteristic functions of knot intervals. To obtain continuous and differentiable splines, we need a recursive averaging process that replaces sharp characteristic functions by higher-degree local functions. [definition: Cox-de Boor Recursion] For $r\ge 1$, the B-splines associated with a knot sequence $(t_i)$ are functions $N_{i,r}:[a,b]\to\mathbb R$ defined recursively by \begin{align*} N_{i,r}(x) &= \frac{x-t_i}{t_{i+r}-t_i}N_{i,r-1}(x) + \frac{t_{i+r+1}-x}{t_{i+r+1}-t_{i+1}}N_{i+1,r-1}(x), \end{align*} where any term with zero denominator is interpreted as $0$. [/definition] The recursion is local: $N_{i,r}$ is built only from $N_{i,r-1}$ and $N_{i+1,r-1}$. It follows by induction that $N_{i,r}$ is supported in $[t_i,t_{i+r+1}]$. The next theorem says that these locally constructed functions are not merely useful examples; they give the canonical basis of the whole spline space. [quotetheorem:6893] Statement-level reference. The following discussion explains how this result is used in the notes and which hypotheses matter. This theorem makes B-splines the analogue of Bernstein polynomials for spline spaces: they are nonnegative, local, and adapted to the geometry of the approximation problem. The extended endpoint knots are not cosmetic. For degree $1$ on $[0,1]$, using only the knots $0<t_1<1$ without endpoint repetition leaves no active hat function that can take arbitrary nonzero endpoint values, so constants on the closed interval are not spanned by the resulting interior functions. Excessive multiplicity also changes the statement: for cubic splines, an interior knot of multiplicity $5$ would force denominators over completely collapsed knot spans and cannot be reconciled with the rule "multiplicity $q$ gives continuity through order $3-q$." Ordering is essential as well; replacing an ordered vector by $(0,1,1/2,1)$ makes supports such as $[t_i,t_{i+r+1}]$ fail to describe intervals in the geometric order of $[0,1]$. The theorem also does not say that every arbitrary family of local piecewise polynomials is a B-spline basis; it identifies the basis tied to this ordered knot vector and its multiplicities. Their most important structural identity is the [partition of unity](/page/Partition%20of%20Unity) property. [quotetheorem:6894] [citeproof:6894] Partition of unity is the reason B-spline expansions preserve constants and behave well under convex combinations. The finite active index set and endpoint convention matter: if one omits the endpoint B-splines on an open knot vector, the sum near $a$ or $b$ can be less than $1$, so constants are not reproduced at the boundary. The ordered knot-vector hypothesis is also needed. If a degree-zero "knot vector" is listed as $0,1,1/2$, the half-open intervals do not partition $[0,1]$, and a point such as $3/4$ may be counted by the wrong interval or not counted by the intended geometric ordering. The active-index range is part of the formula: for degree $1$ with clamped knots $0,0,1,1$, summing only the single basis function $N_{0,1}$ gives $1-x$ rather than $1$, while including both active basis functions gives $(1-x)+x=1$. The identity also does not control derivatives or approximation error by itself; it controls affine combinations of coefficients. If coefficients are sampled from a bounded function, the spline cannot develop large artificial oscillations merely from the basis. [example: Local Refinement by Knot Insertion] Let \begin{align*} s(x)=\sum_i c_iN_{i,3}(x) \end{align*} be the cubic B-spline expansion for the ordered knot vector $(t_i)$, and insert a new knot $\xi$ with $t_j<\xi<t_{j+1}$. Let $(\widehat t_i)$ be the refined knot vector and let $\widehat N_{i,3}$ be its cubic B-splines. The insertion affects exactly those old basis functions whose supports contain $\xi$, namely those with \begin{align*} t_i<\xi<t_{i+4}. \end{align*} Since $t_j<\xi<t_{j+1}$, this condition is equivalent to \begin{align*} i\le j \quad \text{and} \quad i+4\ge j+1, \end{align*} so the affected indices are \begin{align*} j-3\le i\le j. \end{align*} For these indices define \begin{align*} \alpha_i=\frac{\xi-t_i}{t_{i+3}-t_i}, \end{align*} with the convention that a zero denominator gives the corresponding term value $0$, as in the Cox-de Boor recursion. Knot insertion rewrites the same spline in the refined basis with coefficients \begin{align*} \widehat c_i=c_i \quad \text{for } i\le j-3, \end{align*} \begin{align*} \widehat c_i=\alpha_i c_i+(1-\alpha_i)c_{i-1} \quad \text{for } j-2\le i\le j, \end{align*} and \begin{align*} \widehat c_i=c_{i-1} \quad \text{for } i\ge j+1. \end{align*} Indeed, each old cubic B-spline is split between two refined cubic B-splines: \begin{align*} N_{i,3}(x)=\alpha_i\widehat N_{i,3}(x)+(1-\alpha_{i+1})\widehat N_{i+1,3}(x). \end{align*} Substituting this identity into the old expansion gives \begin{align*} s(x)=\sum_i c_i\alpha_i\widehat N_{i,3}(x)+\sum_i c_i(1-\alpha_{i+1})\widehat N_{i+1,3}(x). \end{align*} In the second sum, put $k=i+1$, so $i=k-1$: \begin{align*} \sum_i c_i(1-\alpha_{i+1})\widehat N_{i+1,3}(x)=\sum_k c_{k-1}(1-\alpha_k)\widehat N_{k,3}(x). \end{align*} Therefore the coefficient of $\widehat N_{k,3}$ is \begin{align*} c_k\alpha_k+c_{k-1}(1-\alpha_k). \end{align*} This is exactly the displayed formula for $\widehat c_k$ on the affected range, while outside that range one of the two terms is absent because the corresponding support does not meet $\xi$. Thus the refined spline space contains the old spline without changing the represented function. Knot insertion changes only the local coefficients attached to supports meeting the inserted knot; away from $\xi$, the coefficients are merely copied, with an index shift after the insertion point. [/example] Knot insertion is the computational expression of locality. It lets an approximation be refined near a singularity, boundary layer, or rapidly varying region without rebuilding the entire representation from the beginning. For interpolation and collocation, however, locality alone is not enough: the matrices $(N_{i,r}(x_j))$ built from B-spline values must not introduce artificial sign changes or unstable cancellations when the sites $x_j$ are ordered. The next theorem supplies the missing stability mechanism by controlling all ordered minors of these collocation matrices, which is the algebraic form behind variation-diminishing behaviour and stable B-spline interpolation. [quotetheorem:6895] Statement-level reference. The following discussion explains how this result is used in the notes and which hypotheses matter. Total positivity is a structural stability statement, and its hypotheses are part of the statement rather than decoration. If the sites are not ordered, a row swap changes the sign of a determinant; if sites are placed at endpoints with ambiguous half-open conventions, limiting arguments are needed; and if the functions are rescaled with signs, nonnegativity of minors is lost. The theorem also does not say that every collocation matrix is nonsingular: zero minors may occur when too few basis functions are active at the chosen sites. What it does provide is the sign structure behind variation-diminishing behaviour and the good conditioning of many interpolation and collocation schemes built from B-splines. ## Approximation and Stability of Spline Expansions Once the basis is available, the next question is whether splines approximate smooth functions at the expected rate and whether the coefficients provide a stable numerical representation. The approximation result is due to Schoenberg, and the stability estimate is usually associated with de Boor's analysis of B-spline bases. [quotetheorem:6896] Statement-level reference. The following discussion explains how this result is used in the notes and which hypotheses matter. This theorem is the spline analogue of polynomial density, but it is constructive and local. The mesh assumption cannot be dropped: with a fixed partition and fixed degree, the spline space is finite-dimensional and therefore cannot be dense in $C([a,b])$; for example, on a fixed partition of $[0,1]$, the functions $\sin(2\pi kx)$ for arbitrarily large $k$ cannot all lie in the same finite-dimensional spline space. The continuity hypothesis is also tied to uniform approximation by the quasi-interpolants used here: for the step function $\mathbb{1}_{[1/2,1]}$, any local averaging construction near $1/2$ samples values from both sides of the jump and cannot force the uniform error to zero without placing a break exactly at the discontinuity. Fixed degree limits the Jackson order: piecewise constants can approximate continuous functions uniformly as the mesh shrinks, but they cannot reproduce a nonzero linear function exactly on every interval unless the degree is at least $1$. The smoothness clause likewise needs the stated derivative hypothesis; for $f(x)=|x|$, a bound involving $\omega(f',h)$ is unavailable because $f'$ does not exist at $0$. This explains why spline methods are effective for functions whose smoothness varies across the interval. [example: Approximation of a Function with a Corner] Let $f(x)=|x|$ on $[-1,1]$, and place the knot at $0$. On the two knot intervals define \begin{align*} s(x)=-x \quad \text{for } -1\le x\le 0, \qquad s(x)=x \quad \text{for } 0\le x\le 1. \end{align*} Each piece is a polynomial of degree $1$, hence also of degree at most $3$. At the shared knot, \begin{align*} \lim_{x\to 0^-}s(x)=-0=0 \end{align*} and \begin{align*} \lim_{x\to 0^+}s(x)=0, \end{align*} so the two pieces match continuously at $0$. For $-1\le x\le 0$ we have $|x|=-x=s(x)$, and for $0\le x\le 1$ we have $|x|=x=s(x)$. Therefore \begin{align*} \|f-s\|_\infty=\sup_{-1\le x\le 1}|f(x)-s(x)|=0. \end{align*} The corner appears in the derivative rather than in the function value. On the left interval $s'(x)=-1$, while on the right interval $s'(x)=1$, so \begin{align*} \lim_{x\to 0^-}s'(x)=-1 \end{align*} and \begin{align*} \lim_{x\to 0^+}s'(x)=1. \end{align*} Thus a knot placed at the nonsmooth point allows the spline to keep the change in slope local. Increasing the degree of one global polynomial cannot create an actual derivative jump at $0$, but the spline represents the corner exactly because its polynomial pieces are allowed to meet there with only the required continuity. [/example] Approximation by splines is useful only if the coefficient map is stable. A basis could span the correct space and still lead to large coefficient cancellations; de Boor's theorem rules this out for B-splines in the uniform norm up to constants depending only on the degree. [quotetheorem:6897] [citeproof:6897] The theorem says that B-spline coefficients are meaningful numerical data, not merely formal coordinates. The restrictions are necessary. Normalisation cannot be omitted: replacing one basis function by $\varepsilon N_{i,r}$ and using coefficient $1/\varepsilon$ leaves the represented function $N_{i,r}$ unchanged but makes the coefficient maximum arbitrarily large. The multiplicity restriction prevents degenerate basis elements; for degree $1$, three identical consecutive interior knots create zero knot spans in the recursion and can leave a nominal basis function identically zero, so a nonzero coefficient may represent the zero function. Fixed degree is also part of the uniformity: even in polynomial bases related to high-degree splines, increasingly high degree allows sharper cancellations and the constants controlling coefficients cannot be chosen independently of $r$. The result is not an approximation theorem; it compares a spline with its own coordinates, but it does not say that the spline is close to an external function. Within those limits, it justifies algorithms that manipulate coefficients directly, including refinement, differentiation, and least-squares fitting. ## Smoothing Splines and Penalised Curvature Interpolation is inappropriate when data contain measurement error: exact fitting may reproduce noise rather than the underlying trend. The smoothing spline replaces the interpolation problem by an optimisation problem that penalises both residual error and curvature. [definition: Cubic Smoothing Spline Functional] Let \begin{align*} a\le x_1<\dots<x_n\le b, \end{align*} where $n\ge2$. Let $y_1,\dots,y_n\in\mathbb R$, and let $\lambda>0$. The cubic smoothing spline functional is the map $J_\lambda:H^2([a,b])\to\mathbb R$ defined by \begin{align*} J_\lambda[s] &= \sum_{i=1}^n |y_i-s(x_i)|^2 +\lambda \int_a^b |s''(x)|^2\,dx. \end{align*} [/definition] This functional raises two questions that must be answered before it can be used: whether a unique minimiser exists, and whether that minimiser has a computable spline form. The next theorem answers both questions and identifies the curvature penalty as the source of cubic spline structure. [quotetheorem:6898] [citeproof:6898] This theorem explains why smoothing splines are not an ad hoc numerical recipe. The assumptions are doing real work. If $\lambda=0$, any $H^2$ function interpolating the data has objective value $0$, so uniqueness is lost whenever at least one interpolant exists. If there is only one data site, every affine function passing through that one value has zero curvature penalty and the same zero residual, so the minimiser is not unique; this is why at least two distinct sites are required. If two data sites coincide with different values, for example $x_1=x_2$ but $y_1\ne y_2$, the residual term asks for two incompatible point values at the same location and the data no longer describe independent samples of a function graph. If the sites are not ordered, the set of knots is the same after sorting, but the statement "knots among $x_1,\dots,x_n$" no longer matches the interval-by-interval variational argument until the sites are reordered. The space $H^2([a,b])$ is also necessary for this formulation: in $L^2([a,b])$ point values are not defined for equivalence classes, while in $H^1([a,b])$ the curvature penalty $\int_a^b |s''(x)|^2\,dx$ is not part of the normed data. The theorem also does not choose the statistically optimal smoothing parameter. It says that once $\lambda$ is fixed, the minimiser is forced by the curvature penalty. [example: Fitting Noisy Data with Cubic Smoothing Splines] Suppose $y_i=g(x_i)+\varepsilon_i$ with $\mathbb E[\varepsilon_i]=0$, and write \begin{align*} R(s)=\sum_{i=1}^n |y_i-s(x_i)|^2 \end{align*} and \begin{align*} E(s)=\int_a^b |s''(x)|^2\,dx. \end{align*} Then the smoothing spline minimises \begin{align*} J_\lambda[s]=R(s)+\lambda E(s). \end{align*} At a data site $x_i$, \begin{align*} y_i-s(x_i)=g(x_i)-s(x_i)+\varepsilon_i. \end{align*} Expanding the expected squared residual gives \begin{align*} \mathbb E|y_i-s(x_i)|^2=\mathbb E\bigl[(g(x_i)-s(x_i)+\varepsilon_i)^2\bigr]. \end{align*} Since $g(x_i)-s(x_i)$ is deterministic, \begin{align*} \mathbb E|y_i-s(x_i)|^2=(g(x_i)-s(x_i))^2+2(g(x_i)-s(x_i))\mathbb E[\varepsilon_i]+\mathbb E[\varepsilon_i^2]. \end{align*} Using $\mathbb E[\varepsilon_i]=0$, this becomes \begin{align*} \mathbb E|y_i-s(x_i)|^2=(g(x_i)-s(x_i))^2+\mathbb E[\varepsilon_i^2]. \end{align*} Thus the residual term pulls $s(x_i)$ toward the noisy datum $y_i$, while the curvature term penalises changes in slope. If $\lambda$ is small, the term $\lambda E(s)$ has small weight compared with $R(s)$, so minimising $J_\lambda$ favours splines with small residuals at the data sites. If $\lambda$ is large, compare the minimiser $s_\lambda$ with any affine function $p(x)=\alpha+\beta x$. Since $p''(x)=0$ for every $x$, we have \begin{align*} E(p)=\int_a^b |p''(x)|^2\,dx=\int_a^b 0\,dx=0. \end{align*} Minimality gives \begin{align*} R(s_\lambda)+\lambda E(s_\lambda)\le R(p)+\lambda E(p)=R(p). \end{align*} Because $R(s_\lambda)\ge 0$, it follows that \begin{align*} \lambda E(s_\lambda)\le R(p). \end{align*} Dividing by $\lambda>0$ gives \begin{align*} E(s_\lambda)\le \frac{R(p)}{\lambda}. \end{align*} For fixed $p$, the right-hand side tends to $0$ as $\lambda\to\infty$, so large $\lambda$ forces the curvature energy toward zero. Functions with zero curvature energy satisfy $s''=0$ almost everywhere and hence are affine, so the large-$\lambda$ limit is the affine function that minimises \begin{align*} \sum_{i=1}^n |y_i-(\alpha+\beta x_i)|^2. \end{align*} The smoothing parameter therefore moves the fit between close tracking of the noisy observations and the least-squares affine trend. [/example] The smoothing parameter is therefore a modelling choice as well as a numerical parameter. In statistical applications it is often selected by cross-validation, while in deterministic approximation it may be chosen from an expected noise level or desired smoothness scale. [remark: Interpolation and Smoothing as Endpoints] Natural cubic interpolation and cubic smoothing splines are two ends of the same philosophy. Interpolation enforces exact data constraints and chooses the minimum-curvature interpolant through boundary conditions; smoothing relaxes the data constraints and penalises curvature directly. Both constructions lead to cubic splines because the second derivative is the quantity being controlled. [/remark] Splines and B-splines complete the local approximation part of the course. They retain the approximation-theoretic themes of density, stability, and error control, while introducing locality as a first-class principle. Chapter 8 turns to a different response to the same obstacle: rational approximants adapt not by moving knots, but by moving poles. Chapter 7 introduced splines as a local alternative to global polynomial approximation, emphasizing knot placement, smoothness, and compact support. Chapter 8 now responds to the same limitations with rational functions, whose movable poles give another flexible nonlinear approximation family. # 8. Rational Approximation and Pade Methods This chapter continues the course's progression from linear polynomial and spline methods to nonlinear approximation families. The preceding chapters developed uniform approximation, Chebyshev ideas, polynomial convergence, and local spline adaptation; here those tools are reused, but the approximants are allowed to have movable denominators. The main prerequisites are polynomial approximation in the uniform norm, Taylor series, and the basic language of holomorphic and meromorphic functions. The goals are to understand how rational functions encode singularities, how Padé approximants extract information from Taylor data, and how minimax approximation changes when poles are part of the choice. Rational approximation begins from a limitation of polynomial approximation: polynomials have no finite poles, so they often spend many degrees imitating singularities that rational functions can represent directly. This chapter studies approximants represented by \begin{align*} r=\frac{p}{q}, \end{align*} where both numerator and denominator are chosen, and the resulting approximation problem becomes nonlinear. The central questions are how poles encode singular behaviour, how Taylor data determine Padé approximants, and how minimax ideas extend from polynomials to rational families. ## Rational Functions as Nonlinear Approximants What changes when the approximating family is no longer a linear subspace? For polynomials of degree at most $n$, the coefficients are linear parameters and the error $f-p$ depends linearly on them. For rational functions, the denominator moves the poles, so a small change in coefficients can move a singularity across the approximation set and dramatically alter the error. [definition: Rational Function of Type] Let $K \neq \varnothing$ be a compact subset of $\mathbb C$. A rational function of type $(m,n)$ on $K$ is a map \begin{align*} r:K\to\mathbb C,\qquad r(z)=\frac{p(z)}{q(z)}, \end{align*} where $p,q \in \mathbb C[z]$, $\deg p \le m$, $\deg q \le n$, and $q(z) \neq 0$ for all $z \in K$. [/definition] The exclusion of poles on $K$ is part of the approximation problem, not a cosmetic condition. If a pole lies on the compact set, the uniform norm is infinite; if a pole lies just outside $K$, it may represent nearby singular behaviour very efficiently. [example: A Pole Beats Many Polynomial Terms] Let \begin{align*} f:[-1,1]\to\mathbb R,\qquad f(x)=\frac{1}{1+25x^2}. \end{align*} Its complex singularities are obtained by solving the denominator equation \begin{align*} 1+25z^2=0. \end{align*} Subtracting $1$ and dividing by $25$ gives \begin{align*} z^2=-\frac{1}{25}. \end{align*} Hence \begin{align*} z=\pm \frac{i}{5}. \end{align*} These poles lie a distance $1/5$ from the real interval, so the [analytic continuation](/page/Analytic%20Continuation) of $f$ has nearby singularities even though $f$ is smooth on $[-1,1]$. The same function is represented exactly by a rational function of type $(0,2)$: \begin{align*} r(x)=\frac{p(x)}{q(x)},\qquad p(x)=1,\qquad q(x)=1+25x^2. \end{align*} Here $\deg p=0$ and $\deg q=2$. Also, for every $x\in[-1,1]$, \begin{align*} 25x^2\ge 0. \end{align*} Therefore \begin{align*} q(x)=1+25x^2\ge 1, \end{align*} so $q$ has no zero on $[-1,1]$. Thus $r$ is an admissible type $(0,2)$ rational approximant and satisfies $r=f$ exactly on the interval. The example shows the basic advantage of rational approximation: one quadratic denominator can encode the nearby pair of poles that polynomial approximants must imitate indirectly. [/example] This example also exposes a bookkeeping problem: the displayed type records the degrees before cancellation, while approximation theory often needs the number of effective parameters after cancellation. To separate genuine pole freedom from removable common factors, we record the defect of a representation. [definition: Defect of a Rational Representation] Let $K \neq \varnothing$ be a compact subset of $\mathbb C$, and let $r:K\to\mathbb C$ be a rational function of type $(m,n)$ represented by \begin{align*} r(z)=\frac{p(z)}{q(z)}. \end{align*} Here $p,q \in \mathbb C[z]$, $q \not\equiv 0$, and $p$ and $q$ have a greatest common divisor of degree $d$. The defect of the representation is $d$. [/definition] After cancellation, the same function has type at most $(m-d,n-d)$; if the displayed numerator and denominator had full nominal degrees $m$ and $n$, then the reduced representation has exactly those bounds. Since defective representatives lose parameters, the next notion singles out the nondegenerate case in which the nominal type matches the actual numerator and denominator degrees. [definition: Normal Rational Function] Let $K\subset \mathbb C$ be nonempty and compact. A map $r:K\to\mathbb C$ represented by \begin{align*} r(z)=\frac{p(z)}{q(z)} \end{align*} is normal of type $(m,n)$ on $K$ if $p,q\in\mathbb C[z]$, $\deg p=m$, $\deg q=n$, $p$ and $q$ are coprime, and $q(z)\neq 0$ for every $z\in K$. [/definition] Normality is the rational analogue of using the full polynomial degree. It is the hypothesis under which dimension counting, alternation, and local perturbation arguments match the expected number $m+n+1$ of effective free parameters after fixing the scale of $q$. [remark: Why Rational Approximation Is Nonlinear] If $q$ is fixed, then the problem of choosing $p$ to minimise \begin{align*} \left\|f-\frac{p}{q}\right\|_K \end{align*} is a linear approximation problem after multiplying by $q$. Once $q$ is also chosen, the set of possible errors is no longer an affine space. This is why the existence and uniqueness theory for best rational approximants is subtler than the polynomial theory from earlier chapters. [/remark] ## Padé Approximation from Taylor Data Suppose the available information about $f$ is not its values on an interval but its Taylor coefficients at a point. Can a rational function match more Taylor coefficients than a polynomial with the same number of numerator coefficients? Padé approximation answers this by using the denominator coefficients to force additional cancellation in the [power series](/page/Power%20Series) error. [definition: Padé Approximant] Let \begin{align*} f(z)=\sum_{k=0}^{\infty}a_k z^k \end{align*} be a formal power series at $0$. A Padé approximant of type $(m,n)$ to $f$ is the germ at $0$ of a map $r_{m,n}:U\to\mathbb C$ on some neighbourhood $U$ of $0$, represented by \begin{align*} r_{m,n}(z)=\frac{p(z)}{q(z)}. \end{align*} Here $\deg p\le m$, $\deg q\le n$, $q(0)=1$, and \begin{align*} q(z)f(z)-p(z)=O(z^{m+n+1}) \end{align*} as $z\to 0$. [/definition] The normalisation $q(0)=1$ removes the scalar ambiguity in the quotient representation. The defining condition suggests a natural existence question: do the finitely many Taylor-matching equations always have a rational solution of the prescribed type? [quotetheorem:6899] [citeproof:6899] This theorem is linear algebra in approximation-theoretic form, but the normalisation is a real additional condition rather than a harmless convention. For instance, if $m=0$, $n=1$, and $f(z)=z$, the equations with $q(0)=1$ would require \begin{align*} (1+q_1z)z-p_0=O(z^2), \end{align*} whose coefficient of $z$ cannot be cancelled by a constant numerator. The homogeneous theorem still has the solution $q(z)=z$, $p(z)=0$, but this denominator vanishes at the expansion point, so it does not define the normalised Padé approximant. Thus Padé tables must distinguish guaranteed homogeneous solutions from regular entries where $q(0)\neq 0$ and the rational quotient is genuinely defined near $0$. The theorem also does not assert uniqueness of a normalised approximant, nor does it imply convergence of any row, column, or diagonal of the Padé table; those are separate analytic questions depending on the singularities of the represented function. [example: The First Padé Approximant to the Exponential] For $f(z)=e^z$, seek a type $(1,1)$ Padé approximant with normalised denominator \begin{align*} r_{1,1}(z)=\frac{a_0+a_1z}{1+b_1z}. \end{align*} The type $(1,1)$ Padé condition is \begin{align*} (1+b_1z)e^z-(a_0+a_1z)=O(z^3). \end{align*} Using the Taylor expansion \begin{align*} e^z=1+z+\frac{z^2}{2}+O(z^3), \end{align*} we multiply by $1+b_1z$ and keep all terms through degree $2$: \begin{align*} (1+b_1z)e^z=(1+b_1z)\left(1+z+\frac{z^2}{2}+O(z^3)\right). \end{align*} Expanding the product gives \begin{align*} (1+b_1z)e^z=1+z+\frac{z^2}{2}+b_1z+b_1z^2+O(z^3). \end{align*} Combining like powers of $z$, \begin{align*} (1+b_1z)e^z=1+(1+b_1)z+\left(\frac12+b_1\right)z^2+O(z^3). \end{align*} Therefore \begin{align*} (1+b_1z)e^z-(a_0+a_1z)=(1-a_0)+(1+b_1-a_1)z+\left(\frac12+b_1\right)z^2+O(z^3). \end{align*} For this expression to be $O(z^3)$, the coefficients of $1$, $z$, and $z^2$ must vanish: \begin{align*} 1-a_0=0,\qquad 1+b_1-a_1=0,\qquad \frac12+b_1=0. \end{align*} Thus $a_0=1$, $b_1=-1/2$, and \begin{align*} a_1=1+b_1=1-\frac12=\frac12. \end{align*} Hence \begin{align*} r_{1,1}(z)=\frac{1+\frac12 z}{1-\frac12 z}. \end{align*} This rational function agrees with $e^z$ through the quadratic Taylor term because the difference $(1-\frac12 z)e^z-(1+\frac12 z)$ has no terms of degree $0$, $1$, or $2$. [/example] This computation shows that the denominator equations are the hidden core of the construction. To compare Padé approximation with orthogonality and quadrature, it is useful to rewrite those equations as moment conditions on the Taylor coefficients. [definition: Moment Matching Form of Padé Conditions] Let \begin{align*} f(z)=\sum_{k=0}^{\infty}a_kz^k,\qquad q(z)=\sum_{j=0}^n q_jz^j \end{align*} with $q_0=1$. The Padé denominator equations of type $(m,n)$ are \begin{align*} \sum_{j=0}^{n} q_j a_{k-j}=0,\qquad k=m+1,\dots,m+n, \end{align*} where $a_l=0$ for $l<0$. [/definition] These equations have the same structure as orthogonality conditions: the denominator is chosen so that a finite block of moments vanishes. This viewpoint connects Padé approximation to continued fractions, Gaussian quadrature, and Krylov methods. [remark: Continued Fractions] For many special functions, successive Padé approximants appear as convergents of a continued fraction. The continued fraction organises the same Taylor information recursively, while the Padé table organises it by numerator and denominator degree. This is especially useful for Stieltjes functions, where positivity of moments gives strong convergence and pole-location information. [/remark] The existence theorem does not by itself guarantee convergence of the Padé table. Poles may drift, cancel with nearby zeros, or converge to sets that represent [branch cuts](/page/Branch%20Cuts) rather than isolated singularities. [example: Padé Poles for Functions with Branch Points] Consider $f(z)=\sqrt{1+z}$ as the branch determined near $0$ by $f(0)=1$. Its singular point occurs where the expression under the square root vanishes: \begin{align*} 1+z=0. \end{align*} Subtracting $1$ gives \begin{align*} z=-1. \end{align*} Thus the nearest obstruction to analytic continuation from $0$ is at $z=-1$. This obstruction is not a pole. If it were a pole of order $k$, then $(z+1)^k\sqrt{1+z}$ would have a nonzero finite limit as $z\to -1$. Writing $w=z+1$, we get \begin{align*} (z+1)^k\sqrt{1+z}=w^k\sqrt{w}=w^{k+\frac12}. \end{align*} For every integer $k\ge 0$, this tends to $0$ along positive real $w$, not to a nonzero finite limit. Also, analytic continuation once around $w=0$ changes $\sqrt{w}$ to $-\sqrt{w}$, so no single-valued [meromorphic function](/page/Meromorphic%20Function) near $z=-1$ has this local behaviour. A diagonal Padé approximant is still a rational function, so its denominator contributes only finitely many isolated poles. A single isolated pole cannot reproduce the branch point at $-1$. Instead, as the diagonal degrees grow, the approximants use many simple poles; these poles tend to accumulate along a curve representing the usual branch cut from $-1$ toward $-\infty$. Away from that cut, the cluster of poles acts as a finite rational substitute for the branch obstruction, so the approximants can converge well on compact sets that stay a positive distance from the cut. [/example] ## Convergence Near Meromorphic Singularities When does Padé approximation recover the singularities of a function rather than producing misleading poles? The most stable case is meromorphic continuation with only finitely many poles in a disk. In that setting the denominator degree can match the number of poles, and convergence becomes a precise theorem. [quotetheorem:6900] Statement-level reference. The following discussion explains how this result is used in the notes and which hypotheses matter. This theorem explains why fixed denominator degree is appropriate when the function has a fixed finite number of poles before the next analytic obstruction. Each hypothesis is doing work. Meromorphicity rules out branch points such as $\sqrt{1+z}$, where a fixed finite denominator cannot encode the cut and pole clusters must grow with the degree. The requirement of exactly $n$ poles is also essential: if the denominator degree is too small, some pole cannot be represented; if it is too large, extra denominator factors can drift or cancel and the limiting denominator need not be determined by the singularities alone. Boundary singularities at $|z|=R$ are allowed because the conclusion is only on compact subsets strictly inside the disk, but $R$ must be chosen before any additional interior singularity is crossed. Finally, convergence can only be uniform on compact subsets avoiding the poles, since $f$ is unbounded in every neighbourhood of a pole while each approximant is finite away from its own denominator zeros. [example: Recovering a Pair of Complex Poles] Let \begin{align*} f(z)=\frac{1}{1+z^2} \end{align*} near $0$. The poles are the zeros of the denominator: \begin{align*} 1+z^2=0. \end{align*} Thus \begin{align*} z^2=-1, \end{align*} so \begin{align*} z=\pm i. \end{align*} We show that the type $(m,2)$ Padé approximant is exactly $f$ for every $m\ge 0$. Take \begin{align*} q(z)=1+z^2,\qquad p(z)=1. \end{align*} Then $\deg p=0\le m$, $\deg q=2$, and $q(0)=1$. Moreover \begin{align*} q(z)f(z)-p(z)=(1+z^2)\frac{1}{1+z^2}-1. \end{align*} The first term equals $1$ wherever the germ is defined, hence \begin{align*} q(z)f(z)-p(z)=1-1=0. \end{align*} Therefore \begin{align*} q(z)f(z)-p(z)=O(z^{m+3}), \end{align*} which is exactly the type $(m,2)$ Padé condition. Conversely, suppose a normalised type $(m,2)$ Padé pair satisfies \begin{align*} q(z)f(z)-p(z)=O(z^{m+3}). \end{align*} Multiplying by $1+z^2$, which is nonzero at $0$, gives \begin{align*} q(z)-p(z)(1+z^2)=O(z^{m+3}). \end{align*} The left side is a polynomial of degree at most $m+2$. Since it has a zero of order at least $m+3$ at $0$, it must be the zero polynomial. Hence \begin{align*} q(z)=p(z)(1+z^2). \end{align*} Because $\deg q\le 2$, the polynomial $p$ must be constant. Since $q(0)=1$, that constant is $1$, so \begin{align*} p(z)=1,\qquad q(z)=1+z^2. \end{align*} Thus the denominator of degree $2$ recovers exactly the two complex poles $i$ and $-i$, and the Padé approximant equals $f$ itself. [/example] ## Best Rational Approximation in the Uniform Norm Taylor matching is local, while minimax approximation is global. The next question is whether the Chebyshev alternation principle survives when the approximants are rational rather than polynomial. It does survive under hypotheses that rule out denominator zeros and defects. [definition: Best Uniform Rational Approximation] Let $f\in C[a,b]$ and let $\mathcal R_{m,n}$ be the set of real maps $r:[a,b]\to\mathbb R$ represented by \begin{align*} r(x)=\frac{p(x)}{q(x)}, \end{align*} where $p,q\in\mathbb R[x]$, $\deg p\le m$, $\deg q\le n$, and $q(x)\neq 0$ for all $x\in [a,b]$. A best uniform rational approximation to $f$ from $\mathcal R_{m,n}$ is an element $r^*\in\mathcal R_{m,n}$ such that \begin{align*} \|f-r^*\|_\infty = \inf_{r\in\mathcal R_{m,n}}\|f-r\|_\infty. \end{align*} [/definition] Existence is more delicate than in finite-dimensional linear subspaces because $\mathcal R_{m,n}$ is not closed in a naive coefficient parametrisation: poles may approach the interval, or numerator and denominator factors may cancel. Once a normal best approximant exists, the central question becomes how to recognise it from the shape of its error curve. [quotetheorem:6901] Statement-level reference. The following discussion explains how this result is used in the notes and which hypotheses matter. The count $m+n+2$ mirrors the polynomial result with degree $m+n$, but the analogy holds only at nondegenerate normal points. If $p^*$ and $q^*$ have a common factor, the effective dimension drops and the expected number of alternating extrema is too large. If $q^*$ has a zero on $[a,b]$, the uniform norm is not finite; if it comes arbitrarily close to zero, small coefficient perturbations may move a pole across the interval and invalidate the linearised argument. The Haar-space assumption is the precise replacement for the informal phrase “no degeneracy”: without it, a nonzero linearised numerator could have too many zeros, so the sign-changing argument would no longer force optimality or uniqueness. A simple failure occurs if the candidate linearised space contains $x^2$ and constants on $[-1,1]$: the nonzero function $x^2$ has two zeros counted with multiplicity while the span has dimension only $2$, so the zero-counting rule behind alternation is lost. [example: A Rational Minimax Approximation to an Even Function] Let $f(x)=|x|$ on $[-1,1]$, and restrict to even rational functions of the form \begin{align*} r(x)=\frac{p(x^2)}{q(x^2)}. \end{align*} Put $t=x^2$. Since $x\in[-1,1]$ implies $t\in[0,1]$, and since for every $x$, \begin{align*} |x|=\sqrt{x^2}, \end{align*} the approximation error can be rewritten as \begin{align*} |\,|x|-r(x)\,|=\left|\sqrt{x^2}-\frac{p(x^2)}{q(x^2)}\right|. \end{align*} Thus, with $\rho(t)=p(t)/q(t)$, \begin{align*} \|\,|x|-r(x)\,\|_{[-1,1]}=\|\sqrt{t}-\rho(t)\|_{[0,1]}. \end{align*} The cusp of $|x|$ at $x=0$ has become the endpoint branch behaviour of $\sqrt{t}$ at $t=0$. Indeed, for $t>0$, \begin{align*} \frac{d}{dt}\sqrt{t}=\frac{1}{2\sqrt{t}}, \end{align*} and this derivative tends to $+\infty$ as $t\downarrow 0$. A rational denominator can place zeros near the negative real $t$-axis without putting poles on $[0,1]$. For example, if $q(t)$ has a zero at $t=-\alpha$ with $\alpha>0$, then the corresponding even denominator $q(x^2)$ vanishes when \begin{align*} x^2=-\alpha, \end{align*} so \begin{align*} x=\pm i\sqrt{\alpha}. \end{align*} These poles are off the interval $[-1,1]$ but can lie close to the cusp at $0$, allowing the rational approximant to represent the nearby singular structure more efficiently than a polynomial. For a normal best even rational approximant, the error therefore becomes an ordinary rational minimax error for $\sqrt{t}$ on $[0,1]$. The rational alternation criterion predicts points \begin{align*} 0\le t_0<t_1<\cdots<t_{m+n+1}\le 1 \end{align*} such that \begin{align*} \sqrt{t_j}-\rho(t_j)=\sigma(-1)^j\|\sqrt{t}-\rho(t)\|_{[0,1]}. \end{align*} Returning to $x$, the corresponding points are $x_j=\sqrt{t_j}$ and also $-x_j$, because $r$ and $|x|$ are even. The example shows why rational minimax approximation is effective for endpoint or interior singularities: the denominator supplies off-domain poles that encode the singular geometry, while the extremal error still displays the alternating pattern of a best uniform approximant. [/example] ## The Walsh Table Perspective How should the many rational degrees be organised? A single sequence, such as diagonal Padé approximants, hides the two-dimensional nature of numerator and denominator degree. Walsh's perspective is to arrange approximation errors in a table indexed by $(m,n)$ and study how accuracy changes across rows, columns, and diagonals. [definition: Walsh Table] Let $K\neq\varnothing$ be a compact subset of $\mathbb C$, and let $\mathcal R_{m,n}(K)$ denote the set of rational maps $r:K\to\mathbb C$ of type $(m,n)$ without poles on $K$. The Walsh table entry is the map \begin{align*} E_{m,n}(\cdot;K):C(K;\mathbb C)\to[0,\infty),\qquad E_{m,n}(f;K)=\inf_{r\in\mathcal R_{m,n}(K)}\|f-r\|_K, \end{align*} where $m,n\ge 0$. [/definition] The table is monotone in both indices: increasing $m$ or $n$ enlarges the approximating family. The pattern of decay across the table distinguishes smooth functions, meromorphic functions, and functions with branch-type singularities. [remark: Reading the Walsh Table] A rapidly decaying row with fixed $n$ suggests that $n$ poles account for the dominant singularities. A rapidly decaying diagonal suggests that a growing number of poles is needed, as for branch cuts or endpoint singularities. Flat regions of the table may indicate defects, saturation, or insufficient information about the singularity structure. [/remark] The Walsh table also gives a conceptual bridge between best rational approximation and Padé approximation. Padé methods fill a table using local Taylor data; minimax methods fill a table using global uniform error. [example: Boundary Layer Resolution by Rational Poles] Consider \begin{align*} f_\varepsilon:[0,1]\to\mathbb R,\qquad f_\varepsilon(x)=e^{-x/\varepsilon}, \end{align*} with $0<\varepsilon\ll 1$. Its boundary-layer scale is visible from the first derivative: \begin{align*} f_\varepsilon'(x)=-\frac{1}{\varepsilon}e^{-x/\varepsilon}. \end{align*} At the left endpoint this gives \begin{align*} f_\varepsilon(0)=1,\qquad f_\varepsilon'(0)=-\frac{1}{\varepsilon}. \end{align*} Thus the function changes by order one over an $x$-scale comparable to $\varepsilon$. A rational denominator can build this scale into the approximant. For an integer $N\ge 1$, define \begin{align*} r_N(x)=\left(1+\frac{x}{N\varepsilon}\right)^{-N}. \end{align*} Equivalently, \begin{align*} r_N(x)=\frac{(N\varepsilon)^N}{(x+N\varepsilon)^N}. \end{align*} Hence $r_N$ is a rational function of type $(0,N)$, and its only pole is at \begin{align*} x=-N\varepsilon. \end{align*} This pole is outside $[0,1]$, and its distance from the boundary point $0$ is $N\varepsilon$. The candidate $r_N$ matches the endpoint value and first boundary-layer slope. Indeed, \begin{align*} r_N(0)=\left(1+\frac{0}{N\varepsilon}\right)^{-N}=1. \end{align*} Using the chain rule, \begin{align*} r_N'(x)=-N\left(1+\frac{x}{N\varepsilon}\right)^{-N-1}\frac{1}{N\varepsilon}. \end{align*} Multiplying the constants gives \begin{align*} r_N'(x)=-\frac{1}{\varepsilon}\left(1+\frac{x}{N\varepsilon}\right)^{-N-1}. \end{align*} Therefore \begin{align*} r_N'(0)=-\frac{1}{\varepsilon}\left(1+\frac{0}{N\varepsilon}\right)^{-N-1}=-\frac{1}{\varepsilon}. \end{align*} At the right endpoint, \begin{align*} r_N(1)=\left(1+\frac{1}{N\varepsilon}\right)^{-N}=\left(\frac{N\varepsilon}{1+N\varepsilon}\right)^N, \end{align*} so the same denominator that matches the steep initial slope also forces small values away from the boundary when $\varepsilon$ is small. In the Walsh table, this explicit candidate shows that introducing denominator degree gives the bound \begin{align*} E_{0,N}(f_\varepsilon;[0,1])\le \|f_\varepsilon-r_N\|_{[0,1]}. \end{align*} The point is not that this particular $r_N$ is necessarily best, but that denominator degree supplies off-interval poles at the boundary-layer scale, whereas a polynomial has no finite pole with which to encode that scale directly. [/example] ## Practical Consequences and Limitations The main lesson of this chapter is that rational approximation trades linear structure for geometric flexibility. Denominators encode poles, clusters of poles approximate branch cuts, and minimax rational approximants inherit alternation only under normality conditions. [explanation: Choosing Between Polynomial, Padé, and Minimax Rational Methods] Polynomial approximation is usually preferred when the target is smooth on and near the approximation domain and a stable linear procedure is needed. Padé approximation is natural when Taylor coefficients or moments are the available data and local analytic continuation matters. Best rational approximation is appropriate when the error must be controlled uniformly on a set and the singularity structure is important enough to justify nonlinear optimisation. In numerical work, the denominator must be monitored: near-cancelling pole-zero pairs, poles close to the domain, and ill-conditioned coefficient representations can make an apparently accurate rational approximant unstable. [/explanation] These issues connect back to the spline and piecewise approximation methods of Chapter 7. Rational functions adapt by moving poles; splines adapt by placing knots and reducing smoothness across subinterval boundaries. Both strategies respond to the same obstruction: global polynomials are inefficient when a function has localised singular behaviour or rapid variation. Chapter 9 broadens this adaptive viewpoint to $L^p$ spaces and general nonlinear $n$-term approximation. Rational approximation broadened the course beyond polynomial and spline spaces by allowing poles to adapt to singular behaviour. Chapter 9 extends that adaptive viewpoint further, from uniform and Hilbert-space settings to $L^p$ and general nonlinear $n$-term approximation. # 9. Approximation in $L^p$ and Nonlinear Approximation After Chapter 8 introduced rational functions as nonlinear approximants with movable poles, this chapter moves from finite-dimensional uniform and Hilbert-space approximation to $L^p$ approximation and broader nonlinear methods. The central questions are no longer only whether a best approximant exists, but whether it is unique, how its error is detected, and how adaptive choices can outperform a fixed linear approximation scheme. The chapter begins with convexity in $L^p$, then studies thresholding and $n$-term approximation, and ends with approximation spaces that encode rates of nonlinear approximation. ## Best Approximation in $L^p$ Spaces The first problem is to understand how much of the Hilbert-space projection picture survives when the error is measured in $\|\cdot\|_{L^p}$ rather than $\|\cdot\|_{L^2}$. In $L^2$ the best approximant from a closed subspace is controlled by orthogonality; for $p \ne 2$ the geometry is governed instead by convexity of the norm. [definition: Best Approximation in $L^p$] Let $(\Omega, \mathcal F, \mu)$ be a measure space, let $1 \le p \le \infty$, let $V \subset L^p(\Omega)$, and let $f \in L^p(\Omega)$. An element $v^* \in V$ is a best approximation to $f$ from $V$ in $L^p$ if \begin{align*} \|f-v^*\|_{L^p} = \inf_{v \in V} \|f-v\|_{L^p}. \end{align*} [/definition] The definition separates existence from uniqueness. A closed finite-dimensional subspace often gives existence by compactness after reducing to a bounded set, but uniqueness requires a condition that excludes flat pieces of the unit sphere; this is the role of strict convexity. [definition: Strict Convexity of a Normed Space] A normed space $(X, \|\cdot\|_X)$ is strictly convex if, whenever $x,y \in X$ satisfy $x \ne y$ and $\|x\|_X=\|y\|_X=1$, then \begin{align*} \left\|\frac{x+y}{2}\right\|_X < 1. \end{align*} [/definition] Strict convexity turns convex minimisation into a uniqueness statement. The remaining issue is that a convex approximation class may contain two different minimisers with the same error unless the geometry of the ambient norm forbids flat error segments. For a linear subspace, the midpoint of two candidate approximants is still admissible; strict convexity then provides the mechanism that rules out two distinct best approximants once existence is already known. [quotetheorem:6902] [citeproof:6902] The theorem says that uniqueness follows from the ambient geometry once existence has been secured, but it does not prove that a best approximant exists. The linearity of $V$ is used because the midpoint of two candidate approximants must remain admissible; for a nonconvex approximation class, the midpoint argument can fail. Strict convexity is used only on the two error vectors, and it rules out a flat segment on the sphere of radius equal to the best error. Thus the theorem will be applied after an existence argument, usually finite-dimensional compactness or closed-subspace projection, has already produced at least one minimiser. To apply it in the spaces used throughout approximation theory, we need to know which $L^p$ spaces have this strict convexity property. [quotetheorem:6903] [citeproof:6903] Combining these two results gives uniqueness of best approximation in $L^p$ for $1<p<\infty$ whenever a best approximant exists. The restriction $1<p<\infty$ is essential: it is exactly the range in which the $L^p$ unit sphere has no line segments. The conclusion is an equality statement in $L^p$, so functions that agree a.e. are not distinguished; uniqueness is uniqueness of the equivalence class, not of pointwise representatives. Strict convexity still says nothing about existence, so closedness, compactness, reflexivity, or finite dimensionality must be supplied separately depending on the approximation set. The endpoints $p=1$ and $p=\infty$ behave differently, and concrete examples show why no endpoint version of the uniqueness theorem can hold. [example: Nonuniqueness in $L^1$] Let $\Omega=\{-1,1\}$ with counting measure, let $V$ be the subspace of constant functions, and define $f(-1)=0$ and $f(1)=1$. For the constant function with value $c\in\mathbb R$, the $L^1$ error is the sum over the two points: \begin{align*} \|f-c\|_{L^1}=|f(-1)-c|+|f(1)-c| \end{align*} \begin{align*} =|0-c|+|1-c|=|c|+|1-c|. \end{align*} If $0\le c\le 1$, then $|c|=c$ and $|1-c|=1-c$, so \begin{align*} \|f-c\|_{L^1}=c+(1-c)=1. \end{align*} If $c<0$, then $|c|=-c$ and $|1-c|=1-c$, hence \begin{align*} \|f-c\|_{L^1}=(-c)+(1-c)=1-2c>1. \end{align*} If $c>1$, then $|c|=c$ and $|1-c|=c-1$, hence \begin{align*} \|f-c\|_{L^1}=c+(c-1)=2c-1>1. \end{align*} Therefore the infimum of $\|f-c\|_{L^1}$ over all constant functions is $1$, and it is attained by every $c\in[0,1]$. Thus best approximants exist, but they are not unique in $L^1$. [/example] The same flatness appears in the uniform norm. In the sup norm, an entire face of admissible errors can have the same size. [example: Nonuniqueness in $L^\infty$] Let $\Omega=\{-1,1\}$ with counting measure, let $V=\{v_a:a\in\mathbb R\}$, where $v_a(-1)=a$ and $v_a(1)=0$, and define $f(-1)=0$, $f(1)=1$. For a fixed $a\in\mathbb R$, the error has values \begin{align*} (f-v_a)(-1)=0-a=-a \end{align*} and \begin{align*} (f-v_a)(1)=1-0=1. \end{align*} Since the $L^\infty$ norm on this two-point space is the maximum of the absolute values at the two points, \begin{align*} \|f-v_a\|_{L^\infty}=\max\{|-a|,|1|\}. \end{align*} Using $|-a|=|a|$ and $|1|=1$, this becomes \begin{align*} \|f-v_a\|_{L^\infty}=\max\{|a|,1\}. \end{align*} For every $a\in[-1,1]$, we have $|a|\le 1$, so \begin{align*} \max\{|a|,1\}=1. \end{align*} For every $a\notin[-1,1]$, we have $|a|>1$, so \begin{align*} \max\{|a|,1\}=|a|>1. \end{align*} Therefore the infimum of $\|f-v_a\|_{L^\infty}$ over $a\in\mathbb R$ is $1$, and every $v_a$ with $a\in[-1,1]$ attains it. Since these functions are distinct for distinct values of $a$, best approximants exist but are not unique in $L^\infty$. [/example] The endpoint examples explain why Hilbert-space structure is special rather than typical. For $p=2$ there is a sharper optimality condition because the norm comes from an inner product, and this condition will make the later $n$-term problem diagonal. [quotetheorem:6904] [citeproof:6904] The closedness of $V$ matters because the infimum distance to a nonclosed subspace may fail to be attained even in a Hilbert space. For example, in $H=L^2(0,1)$ let $V$ be the subspace of polynomials and let $f\in L^2(0,1)$ be a non-polynomial function. Since polynomials are dense in $L^2(0,1)$, the distance from $f$ to $V$ is $0$, but no polynomial equals $f$ as an $L^2$ equivalence class, so the infimum is not attained. The inner-product hypothesis is also essential: outside Hilbert spaces there is no single orthogonality relation that detects best approximation in this symmetric way, and the corresponding Banach-space conditions use norming functionals instead. The theorem is not yet a computational algorithm, since finding the projection may still require solving equations in $V$; its value here is structural. For orthonormal expansions, however, the projection becomes diagonal in the coefficients, which is exactly what makes thresholding possible in the next section. ## Thresholding and $n$-Term Approximation The next problem is different from projecting onto a fixed subspace. Instead of choosing coefficients inside a prescribed $n$-dimensional space, nonlinear approximation allows the approximating space itself to depend on the target function. [definition: Dictionary] Let $X$ be a normed space. A dictionary in $X$ is a subset $\mathcal D \subset X$ whose elements are called atoms and whose linear span is dense in the part of $X$ under consideration. [/definition] A dictionary supplies the building blocks, but an adaptive method must decide which atoms to use. The main complexity comes from the fact that the selected atoms may depend on $f$, so the error must minimise over both coefficients and supports. [definition: $n$-Term Approximation Error] Let $X$ be a complex normed space and let $\mathcal D \subset X$ be a dictionary. For $n \in \mathbb N$, the best $n$-term approximation error is the map \begin{align*} \sigma_n(\cdot)_{X,\mathcal D}:X \to [0,\infty] \end{align*} defined by \begin{align*} \sigma_n(f)_{X,\mathcal D} = \inf \left\{\left\|f-\sum_{j=1}^{n} c_j g_j\right\|_X : c_j \in \mathbb C,\ g_j \in \mathcal D\right\}. \end{align*} [/definition] The set of all $n$-term sums is not a linear subspace; it is a union of many $n$-dimensional spaces. The simplest case where this nonlinear search can be solved exactly is an orthonormal basis in a Hilbert space. [example: Fourier Thresholding in $L^2$] Let $(e_k)_{k \in \mathbb Z}$ be the trigonometric orthonormal basis for $L^2(\mathbb T)$, and write \begin{align*} f=\sum_{k\in\mathbb Z}\hat f(k)e_k. \end{align*} Fix a finite set $S\subset\mathbb Z$ with $|S|=n$ and approximate $f$ by $\sum_{k\in S}c_ke_k$. Since the basis is orthonormal, [Parseval's identity](/theorems/434) gives \begin{align*} \left\|f-\sum_{k\in S}c_ke_k\right\|_{L^2}^2=\sum_{k\in S}|\hat f(k)-c_k|^2+\sum_{k\notin S}|\hat f(k)|^2. \end{align*} For each $k\in S$ the term $|\hat f(k)-c_k|^2$ is nonnegative, and it equals $0$ exactly when $c_k=\hat f(k)$. Hence, for this fixed support $S$, the best choice of coefficients is $c_k=\hat f(k)$, and the resulting squared error is \begin{align*} \left\|f-\sum_{k\in S}\hat f(k)e_k\right\|_{L^2}^2=\sum_{k\notin S}|\hat f(k)|^2. \end{align*} Now choose $S=\Lambda_n$ to be a set of $n$ indices for which the retained magnitudes $|\hat f(k)|$ are the $n$ largest values, with any tie-breaking rule. Since \begin{align*} \sum_{k\notin S}|\hat f(k)|^2=\sum_{k\in\mathbb Z}|\hat f(k)|^2-\sum_{k\in S}|\hat f(k)|^2, \end{align*} minimising the discarded energy is the same as maximising the retained energy $\sum_{k\in S}|\hat f(k)|^2$. Replacing any selected index with a smaller coefficient magnitude by an unselected index with a larger coefficient magnitude increases the retained sum, so an optimal support is obtained by keeping the $n$ largest Fourier coefficients. Thus the best $n$-term Fourier approximant is \begin{align*} \sum_{k\in\Lambda_n}\hat f(k)e_k, \end{align*} and its squared error is \begin{align*} \sum_{k\notin\Lambda_n}|\hat f(k)|^2. \end{align*} In an orthonormal Fourier basis, thresholding is optimal because the $L^2$ error separates exactly into independent coefficient energies. [/example] The Fourier example suggests that support selection should be governed by coefficient size. This motivates the general orthonormal-basis theorem, where Parseval's identity turns the nonlinear support search into a tail-minimisation problem. [quotetheorem:6905] [citeproof:6905] Thresholding is optimal for orthonormal bases in $L^2$, but this optimality has narrow causes. Ties among equal coefficient magnitudes may give several best supports, and in abstract index sets the theorem is safest when formulated through the decreasing rearrangement and an infimum rather than through an unnamed "largest coefficient" rule. The diagonal computation also uses orthogonality at the decisive point: the squared error separates into independent coefficient contributions. For nonorthogonal bases or redundant dictionaries, changing one selected atom can change the best coefficients on all selected atoms, so coefficient size alone need not solve the support problem. This motivates greedy selection rules: when global search over all $n$-term supports is too expensive, the algorithm chooses one useful atom at a time. [definition: Weak Greedy Step] Let $X$ be a Banach space over $\mathbb F$, where $\mathbb F$ is either $\mathbb R$ or $\mathbb C$. Let $\mathcal D \subset X$ be a dictionary of nonzero atoms, let $r_{m-1}\in X$ be the current residual, and let $0<t\le 1$. If $r_{m-1}\ne 0$, choose a norming functional $F_{m-1}:X\to \mathbb F$ with $F_{m-1}\in X^*$ satisfying \begin{align*} \|F_{m-1}\|_{X^*}=1, \qquad F_{m-1}(r_{m-1})=\|r_{m-1}\|_X. \end{align*} A weak greedy step chooses an atom $g_m\in\mathcal D$ such that \begin{align*} |F_{m-1}(g_m)| \ge t\sup_{g\in\mathcal D}|F_{m-1}(g)|. \end{align*} [/definition] Different courses formulate the score using inner products in Hilbert spaces or norming functionals in Banach spaces. Once atoms have been selected, we need a name for the approximant produced from their span or from the algorithm's update rule. [definition: Greedy Projection Approximation] Let $X$ be a normed space and let $g_1,\dots,g_m\in X$ be selected atoms. Let $D_m\subset X$ be the set of elements for which a best approximant from $\operatorname{span}\{g_1,\dots,g_m\}$ has been chosen. A greedy projection selection is a map \begin{align*} G_m:D_m\to \operatorname{span}\{g_1,\dots,g_m\} \end{align*} such that, for each $f\in D_m$, the greedy projection approximant is the element \begin{align*} G_m(f)\in \operatorname{span}\{g_1,\dots,g_m\} \end{align*} satisfying \begin{align*} \|f-G_m(f)\|_X=\inf_{v\in \operatorname{span}\{g_1,\dots,g_m\}}\|f-v\|_X. \end{align*} [/definition] A greedy approximant is computationally accessible, while $\sigma_n(f)_{X,\mathcal D}$ is the ideal benchmark. To compare the two, the standard estimate is a Lebesgue-type inequality controlling greedy error by optimal $n$-term error, possibly after selecting a constant multiple of $n$ atoms. [quotetheorem:6906] Statement-level reference. The following discussion explains how this result is used in the notes and which hypotheses matter. This theorem is a precise version of a common principle, not a theorem about arbitrary dictionaries. Quasi-greediness prevents thresholding projections from being unstable, and democracy prevents two supports of the same size from having very different basis-vector norms; without either condition, thresholding can keep large coefficients while still producing a poor approximation in $X$. A model failure of democracy is a normalized conditional basis where the norm of $\sum_{k\in A}e_k$ can be much larger than the norm of $\sum_{k\in B}e_k$ for sets of the same size; then thresholding may choose a support whose basis-vector sum is expensive although another support with the same cardinality gives a much smaller residual. A model failure of quasi-greediness is a basis whose coordinate projections onto thresholding sets are not uniformly bounded, so a retained set of large coefficients can amplify the norm rather than reduce the residual in a controlled way. Redundant dictionaries, highly coherent systems, and general Banach-space weak greedy algorithms require different hypotheses and sometimes a larger number of selected atoms before a comparable inequality holds. The important point for approximation theory is the comparison with the unattainable benchmark $\sigma_n(f)_{X,\mathcal D}$ under stated structural assumptions. ## Adaptive Approximation and Smoothness The final problem is to understand which functions benefit most from nonlinear approximation. Smoothness still controls approximation rates, but the relevant smoothness scale must be sensitive to local features such as jumps, corners, and sparse coefficient expansions. [definition: Approximation Space] Let $X$ be a normed space, let $\mathcal D$ be a dictionary, and let $\sigma_n(f)_{X,\mathcal D}$ be the best $n$-term approximation error. For $\alpha>0$ and $0<q\le \infty$, the approximation space $A_q^\alpha(X,\mathcal D)$ consists of those $f \in X$ for which \begin{align*} |f|_{A_q^\alpha} = \left(\sum_{n=1}^{\infty} \left(n^\alpha \sigma_n(f)_{X,\mathcal D}\right)^q \frac{1}{n}\right)^{1/q} <\infty \end{align*} when $0<q<\infty$, with the usual supremum modification when $q=\infty$. The associated full size is \begin{align*} \|f\|_X+|f|_{A_q^\alpha}. \end{align*} [/definition] For $q\ge 1$ the displayed full size is a norm under the usual hypotheses on the approximation scheme; for $0<q<1$ it is in general only a quasi-norm. Thus the word "space" here includes quasi-Banach approximation scales, as is standard in nonlinear approximation theory. The factor $n^\alpha$ records the target rate $\sigma_n(f)_{X,\mathcal D} \lesssim n^{-\alpha}$, while the summability exponent $q$ records how regularly that rate is achieved across scales. The implicit constants in such estimates depend on the dictionary and the ambient norm, so approximation spaces are tied to the chosen atoms. [explanation: Besov-Type Smoothness Intuition] Besov spaces measure smoothness across scales rather than only by differentiability of a fixed integer order. In wavelet and spline dictionaries, this scale-by-scale viewpoint matches nonlinear approximation: a function is efficiently approximated when its fine-scale coefficients are sparse or rapidly decreasing. This is why Besov-type smoothness is natural for adaptive methods. A function with a jump discontinuity may have poor global Sobolev smoothness, but its singular behaviour is localised. A nonlinear method can spend many atoms near the jump and few atoms in smooth regions, while a fixed linear space must refine everywhere at the same resolution. [/explanation] The difference between linear and nonlinear approximation is visible in elementary piecewise polynomial methods. A uniform mesh treats every interval equally; an adaptive mesh follows the local difficulty of the function. [example: Piecewise Polynomial Approximation of a Function with a Jump] Let $f=\mathbb{1}_{[1/2,1]}$ on $[0,1]$, and first use a uniform partition into $N$ intervals of length $h=1/N$ with $N$ odd, so that $1/2$ lies inside exactly one interval $I$. On every interval not containing $1/2$, the function $f$ is already constant, so the only possible error comes from the crossing interval. Since $N$ is odd, the point $1/2$ splits $I$ into two subintervals of equal length $h/2$. If we approximate $f$ on $I$ by a constant $c\in\mathbb R$, then for $1\le p<\infty$ the $p$th power of the local error is \begin{align*} \int_I |f(x)-c|^p\,dx=\int_{I\cap[0,1/2)} |0-c|^p\,dx+\int_{I\cap[1/2,1]} |1-c|^p\,dx. \end{align*} Because both pieces have length $h/2$, this becomes \begin{align*} \int_I |f(x)-c|^p\,dx=\frac{h}{2}|c|^p+\frac{h}{2}|1-c|^p. \end{align*} By convexity of $t\mapsto |t|^p$, \begin{align*} \frac{|c|^p+|1-c|^p}{2}\ge \left|\frac{c+(1-c)}{2}\right|^p=2^{-p}. \end{align*} Therefore \begin{align*} \int_I |f(x)-c|^p\,dx\ge h2^{-p}. \end{align*} Equality is attained by $c=1/2$, since \begin{align*} \frac{h}{2}\left|\frac12\right|^p+\frac{h}{2}\left|1-\frac12\right|^p=h2^{-p}. \end{align*} Thus the best uniform-grid piecewise constant error satisfies \begin{align*} \|f-v\|_{L^p([0,1])}=h^{1/p}2^{-1} \end{align*} for the optimal choice on the crossing cell and exact choices on all other cells. An adaptive partition with a breakpoint at $1/2$ has the two cells $[0,1/2)$ and $[1/2,1]$. Choosing the constants $0$ and $1$ on these two cells gives \begin{align*} f(x)-v(x)=0 \end{align*} for every $x$ except possibly at the single point $1/2$, which has Lebesgue measure zero. Hence \begin{align*} \|f-v\|_{L^p([0,1])}=0. \end{align*} The adaptive method represents the jump exactly with two pieces, while the uniform method pays an error of size proportional to $h^{1/p}$ whenever a cell straddles the jump. [/example] The same lesson applies to smooth functions with isolated layers or corners. The approximation rate reflects not only how smooth the function is globally, but also whether the approximation scheme can detect where resolution is needed. [example: Comparing Linear and Adaptive Approximants] Let $J=[x_0-\delta,x_0+\delta]\cap[0,1]$ be a narrow transition layer, and suppose $f$ is Lipschitz with constant $L_1$ on $J$ and with the smaller constant $L_0$ on $[0,1]\setminus J$, where $L_1\gg L_0$. For a piecewise constant spline on an interval $K$ of length $|K|$, choose the midpoint value $c_K=f(m_K)$. If $f$ has Lipschitz constant $L_K$ on $K$, then for every $x\in K$, \begin{align*} |f(x)-c_K|=|f(x)-f(m_K)|\le L_K|x-m_K|\le L_K|K|/2. \end{align*} Thus the local $L^\infty$ error is bounded by $L_K|K|/2$. A linear scheme based on a uniform partition into $n$ intervals has mesh size $h=1/n$. If one of these intervals intersects the transition layer, the local bound there is \begin{align*} \|f-c_K\|_{L^\infty(K)}\le L_1h/2=L_1/(2n). \end{align*} The method has spent the same interval length $h$ in the slowly varying region, where the corresponding bound is only \begin{align*} \|f-c_K\|_{L^\infty(K)}\le L_0h/2=L_0/(2n). \end{align*} Since $L_1\gg L_0$, the largest uniform-grid error is controlled by the transition layer, not by the smooth region. An adaptive $n$-term spline can allocate $m$ small intervals to $J$ and $n-m$ larger intervals to $[0,1]\setminus J$. If the intervals in $J$ have length approximately $2\delta/m$, then the transition-layer bound becomes \begin{align*} \|f-c_K\|_{L^\infty(K)}\le L_1(2\delta/m)/2=L_1\delta/m. \end{align*} Outside $J$, intervals of length approximately $(1-2\delta)/(n-m)$ give \begin{align*} \|f-c_K\|_{L^\infty(K)}\le L_0(1-2\delta)/(2(n-m)). \end{align*} The adaptive method can therefore reduce the dominant transition-layer term by increasing $m$ near $x_0$, while the linear uniform method cannot make the layer mesh smaller without refining the whole interval. [/example] These examples explain why nonlinear approximation is tied to sparsity. This motivates a quantitative theorem: sorted coefficient decay in an orthonormal basis gives an explicit algebraic rate for best $n$-term approximation. [quotetheorem:6907] [citeproof:6907] The theorem translates coefficient sparsity into approximation rate, and the condition $s>1/2$ is exactly what makes the squared coefficient tail summable. At the endpoint $s=1/2$, the comparison becomes the harmonic tail and no algebraic decay follows from this estimate. The statement is only a sufficient condition: observing an $n$-term rate does not by itself force the pointwise rearrangement bound $a_k^*\le Ck^{-s}$ without additional structure. In wavelet-based theory, Besov smoothness often gives exactly this kind of control over rearranged coefficients, which is why Besov and approximation spaces are paired in the nonlinear theory. [remark: Linear Versus Nonlinear Information] A linear approximation method fixes in advance the $n$ degrees of freedom used at level $n$. A nonlinear method lets the selected atoms, intervals, or coefficients depend on $f$. The price is algorithmic complexity and stability analysis; the reward is better performance on functions whose difficulty is concentrated in a small part of the domain or a sparse subset of a dictionary. [/remark] The chapter therefore extends the course's earlier themes in two directions. First, the geometry of $L^p$ explains when best approximants are unique. Second, nonlinear approximation replaces fixed subspaces by adaptive choices, making sparsity and scale-sensitive smoothness central to approximation rates. The final chapter applies these approximation estimates to numerical algorithms, where the same errors must be paired with stability and conditioning bounds. Chapter 9 showed that nonlinear approximation is driven by sparsity, adaptive choice of terms, and scale-sensitive smoothness. The final chapter takes those estimates into numerical analysis, where approximation error must be balanced with stability, conditioning, and algorithmic cost. # 10. Applications to Numerical Analysis Numerical analysis is where the abstract estimates of approximation theory become design rules for algorithms. Earlier chapters measured the distance from a function to polynomial, spline, trigonometric, rational, and nonlinear approximation families; here the same estimates control quadrature formulas, interpolation routines, finite element spaces, and spectral discretisations. The chapter assumes the preceding material on polynomial interpolation, best approximation, norms on function spaces, and basic Hilbert-space projection ideas. The central question is not only how small the approximation error is in exact arithmetic, but also whether the representation used by the algorithm is stable enough for that error estimate to be visible in computation. ## Approximation as a Numerical Building Block A numerical method often begins by replacing an infinite-dimensional object by finitely many degrees of freedom. The approximation space may be a polynomial space, a spline space, or a space of piecewise polynomials, and the choice determines both the order of accuracy and the conditioning of the computation. [definition: Polynomial Interpolant] Let $x_0,\dots,x_n$ be distinct points in an interval $[a,b]$, and let $f:[a,b]\to\mathbb R$ be a function. The polynomial interpolant of $f$ at the nodes $x_0,\dots,x_n$ is the unique polynomial $p_n\in\mathcal P_n$ satisfying \begin{align*} p_n(x_i)=f(x_i),\qquad i=0,\dots,n. \end{align*} [/definition] The interpolant turns sampled data into a function, so its error is the first place where smoothness, node placement, and algebraic representation meet. A monomial coefficient depends on the origin and is badly matched to the incremental process of adding one more interpolation node; a Lagrange coefficient records a value at one node but does not isolate the contribution of the newest node. To describe interpolation error and Newton interpolation without committing to either coordinate system, we need a coefficient-like quantity intrinsic to the ordered set of sample nodes. [definition: Divided Difference] Let $x_0,\dots,x_n$ be distinct nodes, and let $X_n=\{x_0,\dots,x_n\}$. The order-$n$ divided difference at these ordered nodes is the functional \begin{align*} D[x_0,\dots,x_n]:\mathbb R^{X_n}\to\mathbb R, \end{align*} whose value on a function $f:X_n\to\mathbb R$ is denoted $f[x_0,\dots,x_n]$ and is defined recursively by $f[x_i]=f(x_i)$ and, for $n\ge 1$, \begin{align*} f[x_0,\dots,x_n]=\frac{f[x_1,\dots,x_n]-f[x_0,\dots,x_{n-1}]}{x_n-x_0}. \end{align*} [/definition] Divided differences are useful because they measure the coefficient of the newest Newton basis factor. To estimate interpolation error, however, a coefficient description is not enough: one must relate the error at a point to smoothness of the original function and to the geometry of the chosen nodes. The obstruction is that poor node placement can make the nodal product large, while insufficient differentiability prevents the usual derivative remainder from even being defined. [quotetheorem:475] [citeproof:475] The theorem separates analytic smoothness, represented by $f^{(n+1)}$, from geometry of the nodes, represented by the product. The $C^{n+1}$ hypothesis is what makes the repeated Rolle argument available and supplies the derivative appearing in the remainder; for instance, with $n=0$ and $f(x)=|x|$ on $[-1,1]$, interpolation at $x_0=0$ is defined, but there is no derivative at the node from which to form the stated first-order remainder formula on the whole interval. Distinct nodes are needed both for uniqueness of the interpolating polynomial and for the divided differences to have nonzero denominators: if $x_0=x_1$ and only function values are prescribed, then the two interpolation conditions coincide and do not determine a unique affine polynomial. The interval assumption is also structural, since [Rolle's theorem](/theorems/185) uses the connected segment between successive zeros; on a disconnected set such as $[-1,0]\cup[1,2]$, zeros in different components do not force an intervening derivative zero inside the domain. The estimate is local in the evaluation point $x$, not a complete convergence theorem. It shows that good interpolation requires control of both the derivative term and the nodal product, but it does not say that $p_n\to f$ as $n\to\infty$ for an arbitrary sequence of nodes. It also says nothing about stability of computing $p_n$ from data; that separate issue is measured later by Lebesgue constants and basis conditioning. This separation is the basic template for the estimates used in quadrature and finite element interpolation. [example: Quadrature from Polynomial Interpolation] Let $x_0,\dots,x_n\in[a,b]$ be distinct interpolation nodes, and let $\ell_i$ be the Lagrange basis polynomial satisfying $\ell_i(x_j)=\delta_{ij}$. Define \begin{align*} w_i=\int_a^b \ell_i(x)\,dx. \end{align*} For any $p\in\mathcal P_n$, the Lagrange interpolation formula gives \begin{align*} p(x)=\sum_{i=0}^{n}p(x_i)\ell_i(x). \end{align*} Integrating this identity term by term, \begin{align*} \int_a^b p(x)\,dx=\int_a^b\sum_{i=0}^{n}p(x_i)\ell_i(x)\,dx=\sum_{i=0}^{n}p(x_i)\int_a^b\ell_i(x)\,dx=\sum_{i=0}^{n}w_i p(x_i)=Q_n[p]. \end{align*} Thus the quadrature rule is exact on $\mathcal P_n$. Now let $f\in C^{n+1}([a,b])$, and let $p_n$ be its interpolating polynomial at the nodes. Since $Q_n$ is exact on $p_n$, \begin{align*} \int_a^b f(x)\,dx-Q_n[f]=\int_a^b f(x)\,dx-Q_n[p_n]+\sum_{i=0}^{n}w_i(p_n(x_i)-f(x_i)). \end{align*} The last sum is zero because $p_n(x_i)=f(x_i)$ for every $i$, and exactness gives $Q_n[p_n]=\int_a^b p_n(x)\,dx$. Hence \begin{align*} \int_a^b f(x)\,dx-Q_n[f]=\int_a^b(f(x)-p_n(x))\,dx. \end{align*} By the *Interpolation Error Formula*, for each $x\in[a,b]$ there is $\xi_x\in[a,b]$ such that \begin{align*} f(x)-p_n(x)=\frac{f^{(n+1)}(\xi_x)}{(n+1)!}\prod_{i=0}^{n}(x-x_i). \end{align*} Taking absolute values and using $|f^{(n+1)}(\xi_x)|\le \|f^{(n+1)}\|_\infty$ gives \begin{align*} \left|\int_a^b f(x)\,dx-Q_n[f]\right|\le \frac{\|f^{(n+1)}\|_\infty}{(n+1)!}\int_a^b\left|\prod_{i=0}^{n}(x-x_i)\right|\,dx. \end{align*} The error therefore separates into a smoothness factor and a node-geometry factor: high-order quadrature is accurate when $f^{(n+1)}$ is small and the nodal product has small integral size. [/example] Polynomial interpolation gives accurate quadrature rules, but the same global dependence is a liability for composite meshes: changing one value usually changes the interpolant everywhere. To build local approximants for refinement and finite element assembly, we need to define spaces made from polynomial pieces joined with prescribed smoothness at the knots. [definition: Spline Space on an Interval Mesh] Let $a=t_0<t_1<\dots<t_m=b$ be a partition of $[a,b]$, and let $r,k\in\mathbb N$ with $r<k$. The spline space $S_k^r(t_0,\dots,t_m)$ consists of all functions $s\in C^r([a,b])$ such that $s|_{[t_{j-1},t_j]}\in\mathcal P_k$ for $j=1,\dots,m$. [/definition] Continuity across the knots makes splines smoother than unrelated piecewise polynomials while preserving locality. The next example records the standard form of a composite interpolation estimate. [example: Composite Cubic Spline Interpolation] Let $f\in C^4([a,b])$, let $a=t_0<\dots<t_m=b$ be a uniform mesh with spacing $h$, and let $s$ be the natural cubic spline interpolant of $f$ at the knots. The estimate \begin{align*} \|f-s\|_\infty \le C h^4\|f^{(4)}\|_\infty \end{align*} is the standard fourth-order bound under the boundary compatibility conditions $f''(a)=f''(b)=0$, because then the natural boundary conditions imposed on $s$ match those of $f$. To see why the power $h^4$ appears, fix one element $K_j=[t_{j-1},t_j]$ and let $q_j$ be the cubic polynomial interpolating $f$ at the four local points $t_{j-1},t_{j-1}+h/3,t_{j-1}+2h/3,t_j$. By the cubic interpolation remainder formula, for each $x\in K_j$ there is $\xi_x\in K_j$ such that \begin{align*} f(x)-q_j(x)=\frac{f^{(4)}(\xi_x)}{4!}(x-t_{j-1})(x-t_{j-1}-h/3)(x-t_{j-1}-2h/3)(x-t_j). \end{align*} Since each factor has absolute value at most $h$ on $K_j$, \begin{align*} |f(x)-q_j(x)|\le \frac{\|f^{(4)}\|_{L^\infty(K_j)}}{24}h^4. \end{align*} Taking the maximum over $x\in K_j$ gives \begin{align*} \|f-q_j\|_{L^\infty(K_j)}\le \frac{h^4}{24}\|f^{(4)}\|_{L^\infty(K_j)}. \end{align*} The natural spline $s$ is not obtained by choosing these independent local cubics, because its first and second derivatives must match across the knots and its second derivative must vanish at $a$ and $b$. The stability part of the spline interpolation theorem says that, for the natural spline operator $S_h$, there is a constant $C_{\mathrm{stab}}$ independent of $h$ and $f$ such that the globally coupled spline error is bounded by the best local cubic error: \begin{align*} \|f-S_hf\|_\infty\le C_{\mathrm{stab}}\max_{1\le j\le m}\inf_{q\in\mathcal P_3}\|f-q\|_{L^\infty(K_j)}. \end{align*} Using the particular choice $q=q_j$ in each infimum, \begin{align*} \max_{1\le j\le m}\inf_{q\in\mathcal P_3}\|f-q\|_{L^\infty(K_j)} \le \max_{1\le j\le m}\frac{h^4}{24}\|f^{(4)}\|_{L^\infty(K_j)} \le \frac{h^4}{24}\|f^{(4)}\|_\infty. \end{align*} Therefore \begin{align*} \|f-s\|_\infty=\|f-S_hf\|_\infty\le \frac{C_{\mathrm{stab}}}{24}h^4\|f^{(4)}\|_\infty. \end{align*} Writing $C=C_{\mathrm{stab}}/24$ gives the stated estimate, and the constant is independent of $h$ because the mesh is uniform and the natural spline projection is uniformly stable on this family of meshes. [/example] Finite element methods use the same approximation principle but organise it around basis functions attached to mesh nodes. Before solving a differential equation numerically, one needs a way to transfer a smooth function into the discrete trial space with a controlled error. [definition: Finite Element Interpolant on an Interval] Let $a=t_0<t_1<\dots<t_m=b$ be a mesh and let $V_h$ be the space of continuous functions $v:[a,b]\to\mathbb R$ whose restriction to each interval $[t_{j-1},t_j]$ is affine. The finite element interpolation operator is the map \begin{align*} I_h:C([a,b])\to V_h \end{align*} defined by \begin{align*} (I_h f)(t_j)=f(t_j),\qquad j=0,\dots,m. \end{align*} [/definition] This interpolant is local and basis-independent: on each element it is the affine interpolant at the two endpoints. Its approximation order is the model estimate behind first-order finite elements. [example: Linear Finite Element Interpolation] Assume $f\in C^2([a,b])$, and let $h=\max_j(t_j-t_{j-1})$. Fix an element $K=[t_{j-1},t_j]$ and write $h_K=t_j-t_{j-1}$. On this element, $I_hf$ is the affine interpolant of $f$ at the two endpoints, so the *Interpolation Error Formula* with nodes $t_{j-1}$ and $t_j$ gives, for each $x\in K$, a point $\xi_x\in K$ such that \begin{align*} f(x)-I_hf(x)=\frac{f''(\xi_x)}{2}(x-t_{j-1})(x-t_j). \end{align*} Set $y=(x-t_{j-1})/h_K$. Then $0\le y\le 1$, $x-t_{j-1}=h_Ky$, and $x-t_j=h_K(y-1)$, hence \begin{align*} |(x-t_{j-1})(x-t_j)|=h_K^2y(1-y). \end{align*} Since $y(1-y)=1/4-(y-1/2)^2\le 1/4$, we obtain \begin{align*} |f(x)-I_hf(x)|\le \frac{1}{2}\|f''\|_{L^\infty(K)}\frac{h_K^2}{4}=\frac{h_K^2}{8}\|f''\|_{L^\infty(K)}. \end{align*} Taking the maximum first over $x\in K$ and then over all elements gives \begin{align*} \|f-I_hf\|_\infty\le \frac{h^2}{8}\|f''\|_\infty. \end{align*} For the derivative estimate, the derivative of $I_hf$ on $K$ is the constant slope \begin{align*} (I_hf)'(x)=\frac{f(t_j)-f(t_{j-1})}{h_K}. \end{align*} By the *Fundamental Theorem of Calculus*, \begin{align*} \frac{f(t_j)-f(t_{j-1})}{h_K}=\frac{1}{h_K}\int_{t_{j-1}}^{t_j}f'(t)\,dt. \end{align*} Therefore, for $x\in K$, \begin{align*} (f-I_hf)'(x)=\frac{1}{h_K}\int_{t_{j-1}}^{t_j}(f'(x)-f'(t))\,dt. \end{align*} For each $t\in K$, the [mean value theorem](/theorems/186) applied to $f'$ between $x$ and $t$ gives \begin{align*} |f'(x)-f'(t)|\le \|f''\|_{L^\infty(K)}|x-t|\le h_K\|f''\|_{L^\infty(K)}. \end{align*} Thus \begin{align*} |(f-I_hf)'(x)|\le \frac{1}{h_K}\int_{t_{j-1}}^{t_j}h_K\|f''\|_{L^\infty(K)}\,dt=h_K\|f''\|_{L^\infty(K)}. \end{align*} Taking the maximum over $x\in K$ yields \begin{align*} \|(f-I_hf)'\|_{L^\infty(K)}\le h_K\|f''\|_{L^\infty(K)}. \end{align*} The constants in both estimates are independent of the mesh shape in one dimension; the first estimate has order $h^2$ in the function value, while the derivative loses one power of the element length and has order $h$. [/example] ## Spectral and Pseudospectral Discretisation What changes when the approximation space uses high-degree global polynomials instead of local low-degree elements? Spectral methods exploit high smoothness by expanding the unknown in a global basis, while pseudospectral methods enforce equations at selected collocation points. The reward is very fast convergence for analytic functions; the cost is stronger sensitivity to node placement, boundary treatment, and conditioning. Global polynomial and trigonometric expansions can represent smooth functions with far fewer coefficients than a low-order local mesh, but this advantage disappears if the method is defined only by a convergence slogan. We therefore separate the construction of a spectral approximation from the later question of spectral accuracy: first specify the global trial space and the equations that determine the coefficients, then estimate the rate using regularity and stability. [definition: Spectral Approximation] Let $X$ be a normed function space and let $V_N\subset X$ be a finite-dimensional space spanned by global high-order basis functions, such as trigonometric functions, orthogonal polynomials, or global Lagrange polynomials at selected nodes. A spectral approximation of $u\in X$ is an element $u_N\in V_N$ obtained by projection, interpolation, or collocation using that global basis. [/definition] The phrase spectral accuracy refers to a convergence property of this construction, not to the definition itself: for a fixed basis and regularity class, the error may decay faster than any algebraic power of $N^{-1}$, and for analytic data it is often exponential. For interval problems, a numerical method also needs nodes that support endpoint boundary conditions without causing the interpolation operator to grow too fast. [definition: Chebyshev Collocation Nodes] For $N\in\mathbb N$, the Chebyshev-Lobatto nodes on $[-1,1]$ are \begin{align*} x_j=\cos\left(\frac{j\pi}{N}\right),\qquad j=0,\dots,N. \end{align*} [/definition] These nodes include the endpoints, so they are convenient for boundary value problems. The collocation method turns a differential equation into algebraic equations by demanding that the residual vanish at the nodes. [example: Chebyshev Collocation for a Boundary Value Problem] Consider \begin{align*} -u''(x)=g(x),\qquad -1<x<1,\qquad u(-1)=u(1)=0, \end{align*} with smooth $g$, and let $x_j=\cos(j\pi/N)$ for $j=0,\dots,N$. Let $L_0,\dots,L_N$ be the Lagrange basis polynomials for these nodes, so $L_k(x_j)=\delta_{jk}$. Since the boundary nodes are $x_0=1$ and $x_N=-1$, the boundary conditions force the endpoint coefficients to be zero, and the collocation unknown can be written as \begin{align*} u_N(x)=\sum_{k=1}^{N-1}U_kL_k(x). \end{align*} Differentiating the expansion twice gives \begin{align*} u_N''(x)=\sum_{k=1}^{N-1}U_kL_k''(x). \end{align*} Enforcing the differential equation at the interior nodes $x_i$, $i=1,\dots,N-1$, gives \begin{align*} -\sum_{k=1}^{N-1}U_kL_k''(x_i)=g(x_i),\qquad i=1,\dots,N-1. \end{align*} Thus the coefficient vector $U=(U_1,\dots,U_{N-1})^\top$ satisfies \begin{align*} A U = G, \end{align*} where \begin{align*} A_{ik}=-L_k''(x_i),\qquad G_i=g(x_i),\qquad 1\le i,k\le N-1. \end{align*} Equivalently, if $D^{(2)}$ denotes the full Chebyshev-Lobatto second-derivative differentiation matrix with entries $D^{(2)}_{ik}=L_k''(x_i)$, then $A$ is the negative of the interior-interior block of $D^{(2)}$. The collocation polynomial therefore satisfies the boundary conditions exactly and makes the residual $-u_N''-g$ vanish at each interior node. For analytic $g$ and compatible homogeneous boundary data, standard Chebyshev spectral convergence estimates imply faster-than-algebraic decay of the approximation error, while the entries of the second-derivative differentiation matrix grow with $N$ because differentiating high-degree global polynomials amplifies their nodal data. This is the tradeoff: the approximation space is very accurate for analytic solutions, but the direct collocation matrix becomes increasingly ill-conditioned as the polynomial degree grows. [/example] The preceding example shows the main tradeoff: spectral discretisations can approximate with very few degrees of freedom, but their matrices may be poorly conditioned in the most direct bases. For boundary value problems, a second obstruction appears if we choose coefficients by an arbitrary finite set of equations: the resulting discrete function may solve a linear system while failing to inherit the variational stability of the continuous problem. Projection formulations remove this ambiguity by asking the residual to vanish against every [test function](/page/Test%20Function) in the same trial space, leading to quasioptimality statements. [definition: Galerkin Approximation] Let $V$ be a Hilbert space, let $a:V\times V\to\mathbb R$ be a [bilinear form](/page/Bilinear%20Form), and let $F\in V^*$. Given a finite-dimensional subspace $V_h\subset V$, the Galerkin approximation is the element $u_h\in V_h$ satisfying \begin{align*} a(u_h,v_h)=F(v_h)\qquad\text{for all }v_h\in V_h. \end{align*} [/definition] Galerkin methods are approximation methods constrained by a variational equation. The central estimate says that the discrete solution is as good as the best available approximation in the trial space, up to constants determined by the continuous problem. [quotetheorem:6908] [citeproof:6908] The theorem turns approximation estimates into numerical error estimates, but its constants reveal what the argument needs. Coercivity provides control of the error norm by the energy expression $a(e,e)$; without it, the discrete equations may have nonunique or unstable solutions. A concrete failure is the zero form $a(u,v)=0$ on $V=\mathbb R$: every $u_h\in V_h$ satisfies the homogeneous discrete equation for $F=0$, so no estimate can control the error by the best approximation term. Continuity is the complementary upper bound that lets the comparison error $u-w_h$ enter the estimate; if $a$ is an unbounded bilinear form on an infinite-dimensional Hilbert space, a sequence with $\|v_j\|_V=1$ can make $|a(e,v_j)|$ arbitrarily large, so the comparison step in the proof has no uniform constant. For the Poisson equation in $H^1_0(\Omega)$ these hypotheses come from Poincare's inequality and bounded coefficients, so the theorem directly converts interpolation estimates into finite element error bounds. The result is also deliberately limited. It assumes a conforming space $V_h\subset V$, so nonconforming methods require extra consistency terms rather than this exact orthogonality argument. As a model failure, take $V=H^1_0(0,1)$ with $a(u,v)=\int_0^1 u'v'\,dx$ and let a trial space contain a discontinuous piecewise affine function with a jump at an interior node; that function is not in $V$, its weak derivative includes a singular contribution, and the identity $a(u-u_h,v_h)=0$ is not even a valid statement in the original Hilbert space. Indefinite problems, such as Helmholtz-type variational formulations, usually replace coercivity with an inf-sup condition. The estimate is an error estimate in the Hilbert-space norm, not a conditioning estimate for the stiffness matrix; a method can be quasioptimal and still require preconditioning for efficient numerical solution. In finite element language, the remaining task is to estimate the best approximation error by constructing an interpolant or local polynomial surrogate. ## Local Approximation and the Bramble-Hilbert Principle How can estimates on one reference element produce estimates on an arbitrary mesh element? Finite element analysis relies on local polynomial approximation, scaling, and the fact that derivatives of order $m$ control the part of a function not already represented by polynomials of lower degree. [definition: Sobolev Seminorm] Let $U\subset\mathbb R^n$ be open, let $m\in\mathbb N$, and let $1\le p\le\infty$. The order-$m$ Sobolev seminorm is the functional \begin{align*} |\cdot|_{W^{m,p}(U)}:W^{m,p}(U)\to[0,\infty) \end{align*} defined for $u\in W^{m,p}(U)$ by \begin{align*} |u|_{W^{m,p}(U)}=\left(\sum_{|\alpha|=m}\|D^\alpha u\|_{L^p(U)}^p\right)^{1/p} \end{align*} when $p<\infty$, with the usual maximum over $|\alpha|=m$ when $p=\infty$. [/definition] A seminorm ignores lower-order polynomial components, so it is adapted to an approximation problem where polynomials of degree at most $m-1$ should have zero error. The local finite element estimate therefore needs a theorem saying that this seminorm controls the distance to the polynomial space, with constants depending only on the shape of the reference domain. [quotetheorem:6909] [citeproof:6909] The lemma is the bridge between smoothness and mesh-size powers. The geometric hypotheses are not cosmetic: star-shapedness with respect to a ball prevents thin necks and cusps from making the constant depend on hidden geometric degeneracy, while the Lipschitz boundary assumption supports the extension and compactness tools used in the proof. A model degeneration is the family of rectangles $U_\varepsilon=(0,1)\times(0,\varepsilon)$ after rescaling only by diameter: functions that vary across the thin direction have derivative norms and averaged polynomial corrections related by constants that deteriorate as $\varepsilon\to0$. In finite element language the same failure appears for triangular elements with one angle tending to $0$: affine interpolation may have a fixed formal degree, but gradients of the nodal basis functions grow like the reciprocal of the small altitude, so uniform constants in $W^{1,p}$ estimates are lost. The regularity assumptions on $u$ and the polynomial degree are also necessary in a direct way. If $u\notin W^{m,p}(U)$, then the right-hand side seminorm is not finite and cannot control lower-order errors; for example, a function with a jump discontinuity on an interval has no finite $W^{1,p}$ seminorm for $p\ge1$ but may have a positive $L^p$ distance from every constant. If the approximating space were reduced below $\mathcal P_{m-1}$, even polynomials of degree $m-1$ would no longer be reproduced, so the right-hand side could vanish while the left-hand side remained positive. This is why finite element convergence theorems assume shape-regular meshes rather than merely small diameters. On an element $K$ of diameter $h_K$, scaling typically converts the reference estimate into factors of $h_K^{m-k}$ in the $W^{k,p}$ seminorm or norm, with constants controlled by the reference shape. This is the analytic origin of the rule that mesh refinement improves accuracy only when element shapes remain uniformly regular. [example: Elementwise Polynomial Approximation] Let $K=[x_j,x_{j+1}]$ and set $h_K=x_{j+1}-x_j$. To see the scaling in the local affine approximation estimate, map the reference interval $\widehat K=[0,1]$ to $K$ by \begin{align*} F(\widehat x)=x_j+h_K\widehat x. \end{align*} For $\widehat u(\widehat x)=u(F(\widehat x))$, the chain rule gives \begin{align*} \widehat u'(\widehat x)=h_Ku'(F(\widehat x)). \end{align*} Applying the chain rule once more gives \begin{align*} \widehat u''(\widehat x)=h_K^2u''(F(\widehat x)). \end{align*} Hence \begin{align*} |\widehat u|_{W^{2,\infty}(\widehat K)}=h_K^2|u|_{W^{2,\infty}(K)}. \end{align*} By the *[Bramble-Hilbert Lemma](/theorems/6909)* on the fixed reference interval, there is an affine polynomial $\widehat q\in\mathcal P_1$ such that \begin{align*} \|\widehat u-\widehat q\|_{L^\infty(\widehat K)}\le C_{\widehat K}|\widehat u|_{W^{2,\infty}(\widehat K)}. \end{align*} Define $q$ on $K$ by \begin{align*} q(x)=\widehat q\left(\frac{x-x_j}{h_K}\right). \end{align*} Since $\widehat q$ is affine, $q\in\mathcal P_1$, and the change of variables gives \begin{align*} \|u-q\|_{L^\infty(K)}=\|\widehat u-\widehat q\|_{L^\infty(\widehat K)}. \end{align*} Combining the last two displays, \begin{align*} \|u-q\|_{L^\infty(K)}\le C_{\widehat K}h_K^2|u|_{W^{2,\infty}(K)}. \end{align*} Taking the infimum over all affine $q$ therefore gives \begin{align*} \inf_{q\in\mathcal P_1}\|u-q\|_{L^\infty(K)}\le C_{\widehat K}h_K^2|u|_{W^{2,\infty}(K)}. \end{align*} For the finite element interpolant itself, the one-dimensional interpolation remainder with nodes $x_j$ and $x_{j+1}$ gives, for each $x\in K$, a point $\xi_x\in K$ such that \begin{align*} u(x)-I_hu(x)=\frac{u''(\xi_x)}{2}(x-x_j)(x-x_{j+1}). \end{align*} Writing $y=(x-x_j)/h_K$, we have $0\le y\le 1$ and \begin{align*} |(x-x_j)(x-x_{j+1})|=h_K^2y(1-y). \end{align*} Since \begin{align*} y(1-y)=\frac14-\left(y-\frac12\right)^2\le \frac14, \end{align*} we get \begin{align*} \|u-I_hu\|_{L^\infty(K)}\le \frac{h_K^2}{8}|u|_{W^{2,\infty}(K)}. \end{align*} Thus both the abstract reference-element argument and the explicit nodal interpolant show the same second-order dependence on the element length. [/example] ## Error, Conditioning, and Stable Bases A small approximation error in exact arithmetic does not guarantee an accurate computed answer. Numerical algorithms also amplify data perturbations, roundoff, and coefficient errors, so approximation theory must be paired with conditioning estimates. [definition: Lebesgue Constant] Let $x_0,\dots,x_n$ be distinct interpolation nodes in $[a,b]$, and let $I_n:C([a,b])\to\mathcal P_n$ be the interpolation operator. The Lebesgue constant of these nodes is \begin{align*} \Lambda_n=\|I_n\|_{C([a,b])\to C([a,b])}=\sup_{x\in[a,b]}\sum_{i=0}^{n}|\ell_i(x)|, \end{align*} where $\ell_i$ are the Lagrange basis polynomials. [/definition] The Lebesgue constant measures the worst amplification of data error by interpolation. To compare interpolation with the best approximation available in the same polynomial space, we need an inequality that combines operator norm stability with the intrinsic best-error term. [quotetheorem:6889] [citeproof:6889] Thus interpolation is near-best only when the Lebesgue constant is not too large. The proof used two structural facts: $I_n p=p$ for every $p\in\mathcal P_n$, so interpolation is a projection onto the polynomial space, and $\Lambda_n$ is exactly the operator norm that measures how much the projection can amplify a continuous perturbation. If an operator $T:C([a,b])\to\mathcal P_n$ is not a projection, the argument fails even for data already in the target space: for example, the zero operator has $\|T\|=0$ but sends a nonzero polynomial $p\in\mathcal P_n$ to $0$, so $\|p-Tp\|_\infty>0$ while the best approximation error is $0$. Large Lebesgue constants give the complementary failure mode: with equispaced nodes on $[-1,1]$, the constants grow exponentially, so small perturbations in sampled data can be magnified into large changes in the interpolant even when the best polynomial approximation error is small. The inequality is therefore a stability-plus-approximation statement rather than a pure approximation theorem. The bound is often sharp enough to explain qualitative behaviour, but it is not a full floating-point error analysis and it may overestimate the actual error for a particular function because it uses the worst-case operator norm. It also applies only to the interpolation operator for the chosen nodes; changing to least squares, projection, or barycentric evaluation changes the relevant stability mechanism. This explains why Chebyshev-type nodes are preferred over equispaced nodes for high-degree interpolation. [example: Equispaced Nodes and Chebyshev Nodes] Let $I_n^{\mathrm{eq}}$ denote interpolation at the equispaced nodes $x_i=-1+2i/n$, and let $I_n^{\mathrm{ch}}$ denote interpolation at Chebyshev-type nodes. Their Lebesgue constants satisfy the classical growth estimates \begin{align*} \Lambda_n^{\mathrm{eq}}\sim \frac{2^{n+1}}{e\,n\log n} \end{align*} for equispaced nodes, while for Chebyshev-Lobatto nodes, \begin{align*} \Lambda_n^{\mathrm{ch}}=\frac{2}{\pi}\log n+O(1). \end{align*} Thus the equispaced interpolation operator can amplify data errors by an exponentially large factor, whereas the Chebyshev interpolation operator amplifies them only logarithmically. To see what this means for sampled data, suppose the exact data values are $f(x_i)$ and the perturbed values are $f(x_i)+\delta_i$, with $|\delta_i|\le \varepsilon$. If $\ell_i$ are the Lagrange basis polynomials for the chosen nodes, then the difference between the perturbed and unperturbed interpolants is \begin{align*} \sum_{i=0}^{n}(f(x_i)+\delta_i)\ell_i(x)-\sum_{i=0}^{n}f(x_i)\ell_i(x)=\sum_{i=0}^{n}\delta_i\ell_i(x). \end{align*} For each $x\in[-1,1]$, \begin{align*} \left|\sum_{i=0}^{n}\delta_i\ell_i(x)\right|\le \sum_{i=0}^{n}|\delta_i|\,|\ell_i(x)|\le \varepsilon\sum_{i=0}^{n}|\ell_i(x)|. \end{align*} Taking the supremum over $x$ gives \begin{align*} \|\widetilde I_nf-I_nf\|_\infty\le \varepsilon\Lambda_n. \end{align*} For equispaced nodes this upper bound is of exponential size in $n$, while for Chebyshev-type nodes it is of size $\varepsilon\log n$ up to constants. The same stability factor enters the approximation error. If $p\in\mathcal P_n$, then $I_np=p$, so \begin{align*} f-I_nf=(f-p)-I_n(f-p). \end{align*} Taking sup norms and using $\|I_n\|=\Lambda_n$ gives \begin{align*} \|f-I_nf\|_\infty\le \|f-p\|_\infty+\Lambda_n\|f-p\|_\infty=(1+\Lambda_n)\|f-p\|_\infty. \end{align*} After taking the infimum over $p\in\mathcal P_n$, the best polynomial approximation error is multiplied by $1+\Lambda_n$. Therefore a smooth function with endpoint layers or nearby complex singularities may still interpolate poorly on equispaced nodes, because the exponentially large stability factor can dominate the analytic approximation error. Chebyshev nodes cluster near $-1$ and $1$, which keeps the interpolation operator much smaller and reduces the endpoint oscillations associated with large Lagrange basis functions. [/example] The node set is only one source of numerical sensitivity. Even after choosing a good approximation space and good nodes, the coordinates used to store and evaluate the approximation can turn a moderate function error into a large coefficient error. The monomial basis gives the standard obstruction: two high-degree polynomials can have very different coefficient vectors while remaining close as functions on a short interval, making Vandermonde systems sensitive to roundoff. We therefore need a vocabulary for basis-level stability that records the conditioning of the coefficient-to-function map. [definition: Stable Basis] Let $V_N\subset X$ be a finite-dimensional normed function space with basis $\phi_0,\dots,\phi_N$. The coefficient-to-function map associated to this basis is \begin{align*} T_N:\mathbb R^{N+1}\to V_N,\qquad T_N(c)=\sum_{j=0}^{N}c_j\phi_j. \end{align*} Given a norm $|\cdot|_C$ on $\mathbb R^{N+1}$, the basis is stable for the norm $\|\cdot\|_X$ if there exist constants $A_N,B_N>0$ such that \begin{align*} A_N |c|_C\le \|T_Nc\|_X\le B_N |c|_C\qquad\text{for all }c\in\mathbb R^{N+1}, \end{align*} and the condition ratio $B_N/A_N$ remains bounded or grows at an explicitly controlled rate as $N$ increases. [/definition] Stability is not a property of the span alone; it is a property of the coordinates used by the algorithm. Orthogonal polynomials, barycentric interpolation formulas, and B-spline bases are standard ways to keep coordinate maps well conditioned. [remark: Stable Coordinates in Approximation Algorithms] The monomial basis $1,x,x^2,\dots,x^N$ is poorly conditioned on many intervals for large $N$, especially when interpolation or least-squares matrices are formed directly. Chebyshev and Legendre bases improve conditioning by using near-orthogonality, barycentric formulas evaluate interpolants without explicitly solving a Vandermonde system, and B-splines preserve locality because each basis function has small support. These coordinate choices do not change the best approximation error, but they strongly affect the error observed in floating-point arithmetic. [/remark] The practical lesson of this chapter is that approximation theory supplies two estimates, not one. The analytic estimate measures how well the chosen space can approximate the target, while the conditioning estimate measures whether the numerical representation preserves that accuracy. ## Beyond and Connections Approximation theory sits at the boundary between analysis, numerical computation, and geometry. The compactness principles behind best approximation connect naturally to [Hilbert Space](/page/Hilbert%20Space), [Banach Space](/page/Banach%20Space), [Convex Geometry](/page/Convex%20Geometry), and [Weak Convergence](/page/Weak%20Convergence): projection theorems, separating hyperplanes, and weak compactness explain why minimizers exist and why error functionals certify optimality. The Hilbert-space part of the subject points toward [Fourier Series](/page/Fourier%20Series) and spectral methods, where orthogonal expansions turn approximation into coefficient estimates and where smoothness becomes visible through decay rates. The polynomial chapters connect the same ideas to [Numerical Analysis](/page/Numerical%20Analysis). Chebyshev expansions, Bernstein ellipses, and interpolation stability are the analytic background for quadrature, collocation, and spectral discretizations. Rational approximation leads toward [Continued Fractions](/page/Continued%20Fractions), model reduction, and Krylov methods, while spline and finite element estimates connect local approximation to meshes, [Sobolev Space](/page/Sobolev%20Space), and practical error bounds for differential equations. Several Androma notes continue these threads from different viewpoints. The local theory of norms and minimizers is closest to [$L^p$ Spaces](/page/%24L%5Ep%24%20Spaces), [Normed Vector Space](/page/Normed%20Vector%20Space), and [Linear Operators on Banach Spaces](/page/Linear%20Operators%20on%20Banach%20Spaces). The expansion viewpoint is closest to [Orthogonal System](/page/Orthogonal%20System), [Orthonormal Basis](/page/Orthonormal%20Basis), and [Fourier Series (Trigonometric)](/page/Fourier%20Series%20%28Trigonometric%29). The smoothness-scale results point toward [Sobolev Space](/page/Sobolev%20Space), [Weak Derivative](/page/Weak%20Derivative), and [Weierstrass Approximation Theorem](/theorems/480), where direct and inverse theorems become a language for measuring regularity rather than only proving convergence. ## References - Androma, [Hilbert Space](/page/Hilbert%20Space). - Androma, [Banach Space](/page/Banach%20Space). - Androma, [Weak Convergence](/page/Weak%20Convergence). - Androma, [Fourier Series](/page/Fourier%20Series). - Androma, [Numerical Analysis](/page/Numerical%20Analysis). - Androma, [Sobolev Space](/page/Sobolev%20Space). - Androma, [$L^p$ Spaces](/page/%24L%5Ep%24%20Spaces). - Androma, [Weierstrass Approximation Theorem](/theorems/480).

Created by admin on 6/12/2026 | Last updated on 6/12/2026

What brings you to Androma?

Start with a route through the knowledge graph.

Approximation Theory

Sign in to Androma

Check your inbox

One last step

Approximation Theory

Prerequisites (0/9 completed)

Prerequisites Graph

Rate this page