# Introduction
Analysis, at its core, is the study of limiting processes. Every major construction in the subject — derivatives, integrals, [power series](/page/Power%20Series), Fourier expansions — is defined as a limit of some kind. A derivative is a limit of difference quotients; a [Riemann integral](/page/Riemann%20Integral) is a limit of sums; a power series is a limit of partial sums of [functions](/page/Function). The central difficulty of analysis is that these limiting processes do not always behave as one might hope: the limit of a sequence of continuous functions need not be continuous, the integral of a limit need not equal the limit of the integrals, and the derivative of a limit need not equal the limit of the derivatives.
This tension — between the desire to pass [limits](/page/Limit) through operations and the fact that doing so is not always valid — drives the entire theory developed in these notes. The resolution will come in stages: first, by identifying *modes of convergence* strong enough to guarantee that limits commute with the operations we care about; later, by developing abstract frameworks ([metric spaces](/page/Metric%20Space), [topological](/page/Topology) spaces) in which the relevant notions of convergence, [continuity](/page/Continuity), and compactness can be formulated cleanly and applied to function spaces themselves.
## The Fundamental Problem
To see why care is needed, consider the most natural notion of convergence for a sequence of functions.
[definition: Pointwise Convergence]
Let $E$ be a set and let $f_n: E \to \mathbb{R}$ be a sequence of functions. We say $(f_n)$ **converges pointwise** to $f: E \to \mathbb{R}$ if for every $x \in E$,
\begin{align*}
\lim_{n \to \infty} f_n(x) = f(x).
\end{align*}
That is, for each fixed $x \in E$ and each $\varepsilon > 0$, there exists $N = N(x, \varepsilon) \in \mathbb{N}$ such that $|f_n(x) - f(x)| < \varepsilon$ for all $n \ge N$.
[/definition]
The notation $N = N(x, \varepsilon)$ is deliberate — it signals that the rate of convergence may vary wildly from point to point. This dependence on $x$ is the source of every pathology that follows.
[example: Continuity Destroyed by Pointwise Limits]
Define $f_n: [0, 1] \to \mathbb{R}$ by $f_n(x) = x^n$. Each $f_n$ is continuous (indeed, a polynomial), and the sequence converges pointwise to
\begin{align*}
f(x) = \begin{cases} 0 & \text{if } 0 \le x < 1, \\ 1 & \text{if } x = 1. \end{cases}
\end{align*}
The limit $f$ is discontinuous at $x = 1$. The mechanism is clear: at any fixed $x < 1$, the convergence $x^n \to 0$ is exponentially fast — at $x = 1/2$, for instance, $f_n(1/2) = 2^{-n} < \varepsilon$ once $n > \log_2(1/\varepsilon)$. But as $x$ approaches $1$, the convergence slows without bound. At $x = 1 - 1/n$, we have $f_n(x) = (1 - 1/n)^n \to 1/e \approx 0.368$, so $f_n(x)$ is still far from the limit when $n$ is of order $1/(1-x)$. No single $N$ controls the error uniformly across the interval — the convergence is fast on $[0, 1/2]$, slow near $1$, and fails entirely at $1$.
[/example]
This example is not an isolated curiosity. The failure of pointwise convergence to preserve continuity is the rule rather than the exception. Far worse pathologies exist: pointwise limits of continuous functions can be discontinuous at every point of a dense set (though not at *every* point — the set of continuity points is always a dense $G_\delta$ set, by the [Baire Category Theorem](/theorems/630), a result we will encounter when we study completeness in metric spaces).
The situation for integrals is, if anything, more dramatic.
[example: Integration Destroyed by Pointwise Limits]
Define $f_n: [0, 1] \to \mathbb{R}$ by
\begin{align*}
f_n(x) = \begin{cases} n^2 x & \text{if } 0 \le x \le 1/n, \\ n^2(2/n - x) & \text{if } 1/n < x \le 2/n, \\ 0 & \text{if } x > 2/n. \end{cases}
\end{align*}
Each $f_n$ is a triangular "spike" of height $n$ and base width $2/n$. The area under each spike is $\int_0^1 f_n = \frac{1}{2} \cdot \frac{2}{n} \cdot n = 1$ for every $n$. But $f_n(x) \to 0$ pointwise for every $x > 0$ (since $f_n(x) = 0$ once $n > 2/x$), so the pointwise limit is $f \equiv 0$. The limit of the integrals is $1$, but the integral of the limit is $0$:
\begin{align*}
\lim_{n \to \infty} \int_0^1 f_n \, dx = 1 \neq 0 = \int_0^1 \lim_{n \to \infty} f_n(x) \, dx.
\end{align*}
The mass concentrates at the origin and escapes to infinity in height, invisible to the pointwise limit. This is not a deficiency of the Riemann integral — the same failure occurs for Lebesgue integrals. The problem lies entirely with pointwise convergence.
[/example]
## The Resolution: Uniform Convergence
Both examples above fail for the same structural reason: the convergence $f_n(x) \to f(x)$ is fast at some points and arbitrarily slow at others. The natural fix is to demand that the convergence happen at the same rate everywhere — that is, to require the existence of a single $N$ that works simultaneously for all $x$. This is the idea of **uniform convergence**, which we develop in full in the next chapter. The key results are:
- The **Uniform Limit Theorem**: if $f_n \to f$ uniformly and each $f_n$ is continuous, then so is $f$.
- **Interchange of limits and integrals**: if $f_n \to f$ uniformly on $[a, b]$ and each $f_n$ is Riemann integrable, then $\int f_n \to \int f$.
- **Interchange of limits and derivatives**: this requires not uniform convergence of $f_n$ but uniform convergence of $f_n'$ — a subtle but important distinction, since [differentiation](/page/Derivative) amplifies oscillations.
Uniform convergence is the right tool for many problems, particularly the convergence of power series within their radius of convergence. However, it is far from the end of the story.
## Beyond Uniform Convergence
Uniform convergence is a strong condition, and in many important situations it is too strong. The spike example above cannot be "fixed" by any mode of convergence on $\mathbb{R}$-valued functions alone — the issue is that $L^p$ norms are not controlled by pointwise limits. This leads, in more advanced courses, to the Lebesgue theory of integration and the notion of convergence in $L^p$ spaces, where the [Dominated Convergence Theorem](/theorems/4) and [Monotone Convergence Theorem](/theorems/509) provide the correct interchange results.
Even within the scope of these notes, we will need to go beyond concrete function spaces on $\mathbb{R}$. Many of the results about uniform convergence are most naturally stated in terms of the **uniform norm** $\|f\|_\infty = \sup_x |f(x)|$, which makes the space of bounded functions into a metric space. The Cauchy criterion for uniform convergence is precisely the assertion that this metric space is *complete*. To make such statements precise, and to handle convergence in more general settings (function spaces, spaces of [sequences](/page/Sequence), spaces of measures), we need the abstract theory of **metric spaces** and **topological spaces**.
## Overview of These Notes
The notes are organised around two interlocking themes: the theory of convergence for sequences and [series](/page/Series) of functions, and the abstract framework of metric and topological spaces that underpins it.
**Chapters 2–4** develop the theory of uniform convergence and its applications. We define uniform convergence precisely, prove the key interchange theorems, study series of functions (including the Weierstrass $M$-test and power series), and examine the interaction between uniform continuity and integration.
**Chapters 5–8** build the abstract framework. We introduce metric spaces and their basic properties (open and [closed sets](/page/Closed%20Set), convergence, continuity), then generalise to topological spaces. The two most important structural properties — connectedness and compactness — each receive their own chapter. Compactness, in particular, plays a central role: it is the property that allows us to extract convergent subsequences, and it underpins results from the Extreme Value Theorem to the Arzelà-Ascoli theorem.
**Chapters 9–11** turn to differentiation in several variables. We define the derivative of a map $f: \mathbb{R}^n \to \mathbb{R}^m$ as a linear map (the best linear approximation), study partial derivatives and their relationship to total differentiability, and develop the theory of second-order derivatives including the symmetry of mixed partials.
Throughout, the emphasis is on understanding *why* definitions take the form they do and *when* the key theorems apply — which means understanding the counterexamples that show what goes wrong when hypotheses are dropped.\n\n---\n\nThe previous chapter identified the fundamental problem: pointwise convergence of functions does not preserve continuity, does not commute with integration, and does not commute with differentiation. The culprit in every case was that the rate of convergence $f_n(x) \to f(x)$ depended on $x$ — convergence could be fast at some points and arbitrarily slow at others. Uniform convergence eliminates this by demanding a single rate that works across the entire domain.
This chapter develops the theory of uniform convergence from its definition through to its three major applications: preservation of continuity, interchange of limits and integrals, and — with a crucial modification — interchange of limits and derivatives. The chapter also introduces practical techniques for proving and disproving uniform convergence, since the definition itself (involving a supremum over the domain) can be difficult to verify directly.
[motivation]
### The Quantifier Swap
The difference between pointwise and uniform convergence is a single swap in the order of quantifiers. In pointwise convergence, the threshold $N$ may depend on both $\varepsilon$ and $x$:
\begin{align*}
\forall x \in E, \; \forall \varepsilon > 0, \; \exists N = N(x, \varepsilon): \quad n \geq N \implies |f_n(x) - f(x)| < \varepsilon.
\end{align*}
In uniform convergence, a single $N$ must work for all $x$ simultaneously:
\begin{align*}
\forall \varepsilon > 0, \; \exists N = N(\varepsilon): \quad n \geq N \implies \sup_{x \in E} |f_n(x) - f(x)| < \varepsilon.
\end{align*}
The second formulation makes the connection to the sup norm explicit: uniform convergence is convergence in the metric $d(f, g) = \|f - g\|_\infty$. This reformulation is not merely notational — it reveals that uniform convergence is the natural mode of convergence in the [Banach space](/page/Banach%20Space) of bounded functions, and that the Cauchy criterion for uniform convergence is simply the completeness of this space.
### Why the Three Theorems Differ
The three interchange theorems — for continuity, integration, and differentiation — have fundamentally different characters, and understanding *why* is as important as knowing the statements.
For **continuity**, the mechanism is the $\varepsilon/3$ argument: uniform convergence lets us "freeze" the approximation at a single index $N$ that works for all $x$, then exploit the continuity of $f_N$ at the point of interest. The argument would collapse under pointwise convergence, because different points would require different $N$.
For **integration**, the estimate $|\int (f_n - f)| \leq (b - a)\|f_n - f\|_\infty$ shows that the sup norm controls the $L^1$ norm directly. The spike example from §1 — where $\|f_n\|_\infty = n \to \infty$ but $f_n \to 0$ pointwise — shows exactly where this control fails without uniformity.
For **differentiation**, the situation is qualitatively different. Uniform convergence of the functions $f_n$ gives *no* control over the derivatives $f_n'$, because differentiation amplifies oscillations. The function $\sin(nx)/\sqrt{n}$ has amplitude $1/\sqrt{n}$ but derivative amplitude $\sqrt{n}$. The correct theorem requires uniform convergence of the *derivatives* $f_n'$, plus convergence of the functions at a single point — and its proof reduces the derivative problem to the integral problem via the [Fundamental Theorem of Calculus](/theorems/632).
### The Cauchy Criterion: Convergence Without Knowing the Limit
A recurring practical difficulty is that verifying $\|f_n - f\|_\infty \to 0$ requires knowing the limit $f$, which may be hard to identify. The Cauchy criterion resolves this: a sequence converges uniformly if and only if it is uniformly Cauchy, a condition that refers only to the $f_n$ themselves. The proof constructs $f$ pointwise using completeness of $\mathbb{R}$, then promotes pointwise convergence to uniform convergence by passing $m \to \infty$ in the Cauchy condition — a technique that recurs throughout analysis (in the Riesz–Fischer theorem for $L^p$ spaces, in the proof that $C[a,b]$ is complete, and in the Arzelà–Ascoli theorem).
[/motivation]
## Definition and First Properties
[definition: Uniform Convergence]
A sequence of functions $f_n: E \to \mathbb{R}$ **converges uniformly** to $f: E \to \mathbb{R}$ if for every $\varepsilon > 0$, there exists $N \in \mathbb{N}$ such that
\begin{align*}
|f_n(x) - f(x)| < \varepsilon \quad \text{for all } n \ge N \text{ and all } x \in E.
\end{align*}
[/definition]
Uniform convergence $f_n \to f$ is precisely convergence in the **uniform norm** (or sup norm):
[definition: Uniform Norm]
For a bounded function $g: E \to \mathbb{R}$, the **uniform norm** is
\begin{align*}
\|g\|_\infty := \sup_{x \in E} |g(x)|.
\end{align*}
[/definition]
Thus $f_n \to f$ uniformly if and only if $\|f_n - f\|_\infty \to 0$. The uniform norm makes the space of bounded functions on $E$ into a metric space with $d(f, g) = \|f - g\|_\infty$, and the completeness of this metric space is equivalent to the Cauchy criterion.
### The Cauchy Criterion
[definition: Uniformly Cauchy Sequence]
A sequence $(f_n)$ of functions $f_n: E \to \mathbb{R}$ is **uniformly Cauchy** if for every $\varepsilon > 0$, there exists $N \in \mathbb{N}$ such that
\begin{align*}
|f_n(x) - f_m(x)| < \varepsilon \quad \text{for all } m, n \ge N \text{ and all } x \in E.
\end{align*}
[/definition]
[quotetheorem:257]
[citeproof:257]
The forward direction routes $f_n - f_m$ through the limit $f$ via the triangle inequality, using the single $N$ from uniform convergence for both terms. The reverse direction is the more substantial part: it first constructs $f(x) = \lim_n f_n(x)$ pointwise (using completeness of $\mathbb{R}$), then promotes this to uniform convergence by fixing $n \geq N$ in the Cauchy condition $|f_n(x) - f_m(x)| < \varepsilon$ and passing $m \to \infty$. The key observation is that the inequality $|f_n(x) - f_m(x)| < \varepsilon$ (which holds for all $x$ simultaneously) survives the limit $m \to \infty$ because the absolute value is continuous. This "fix one index, send the other to infinity" technique is the standard method for promoting Cauchy conditions to convergence throughout analysis.
## Preservation of Continuity
The first major payoff of uniform convergence is that it preserves continuity — the property that pointwise convergence destroys.
[quotetheorem:258]
[citeproof:258]
The proof is the paradigmatic $\varepsilon/3$ argument. To show $f$ is continuous at $x_0$, one writes
\begin{align*}
|f(x) - f(x_0)| \leq \underbrace{|f(x) - f_N(x)|}_{< \varepsilon/3} + \underbrace{|f_N(x) - f_N(x_0)|}_{< \varepsilon/3} + \underbrace{|f_N(x_0) - f(x_0)|}_{< \varepsilon/3}.
\end{align*}
The first and third terms are controlled by uniform convergence (choosing $N$ large enough, independently of $x$). The middle term is controlled by continuity of $f_N$ at $x_0$. The argument would collapse under pointwise convergence: different points would require different $N$, and the first term $|f(x) - f_N(x)|$ could not be bounded uniformly in $x$.
This $\varepsilon/3$ structure — splitting an error into "approximation error" and "regularity of the approximant" — pervades analysis. It appears in the Arzelà–Ascoli theorem, in [mollification](/page/Standard%20Mollifier) arguments, in the proof that $C^\infty$ functions are dense in $L^p$, and in every approximation theorem that transfers regularity from approximants to limits.
[example: Uniform Convergence on Compact Subsets vs. the Whole Domain]
Consider $f_n(x) = x/n$ on $\mathbb{R}$. Then $f_n(x) \to 0$ pointwise, but $\sup_{x \in \mathbb{R}} |x/n| = +\infty$, so the convergence is not uniform on $\mathbb{R}$.
However, on any bounded interval $[-a, a]$, we have $\|f_n\|_\infty = a/n \to 0$, so the convergence *is* uniform on $[-a, a]$. Since the limit $f = 0$ is continuous on every bounded interval, the [Uniform Limit Theorem](/theorems/258) is consistent (but vacuous in this case, since both the $f_n$ and $f$ are continuous regardless).
This illustrates a recurring theme: uniform convergence on the whole domain is often too much to ask, but uniform convergence on compact subsets frequently holds and suffices for applications. The distinction becomes important for power series, which converge uniformly on compact subsets of their disc of convergence but typically not on the full disc.
[/example]
### The Converse Fails
The [Uniform Limit Theorem](/theorems/258) does not reverse: a pointwise limit of continuous functions can be continuous without the convergence being uniform. For instance, $f_n(x) = x^n$ on $[0, 1)$ converges pointwise to $f = 0$, which is continuous on $[0, 1)$. But $\sup_{x \in [0,1)} x^n = 1$ for every $n$ (the supremum is approached but not attained), so the convergence is not uniform. Continuity of the limit is *necessary* for uniform convergence (by the [Uniform Limit Theorem](/theorems/258) applied contrapositively) but not sufficient.
## Interchange of Limits and Integrals
[quotetheorem:259]
[citeproof:259]
The proof has two parts. For [integrability](/page/Integral), the idea is to transfer oscillation control from a known integrable function $f_N$ (close to $f$ in the sup norm) to $f$ itself: on each subinterval of a partition, the oscillation of $f$ differs from that of $f_N$ by at most $2\|f - f_N\|_\infty$, so the Darboux-sum estimate for $f_N$ can be perturbed to an estimate for $f$. For the interchange, the key estimate is
\begin{align*}
\left|\int_a^b f_n(x) \, dx - \int_a^b f(x) \, dx\right| \leq (b-a)\|f_n - f\|_\infty,
\end{align*}
which shows that the sup norm controls the $L^1$ norm directly. This is why the spike example from §1 fails: the spikes have $\|f_n\|_\infty = n \to \infty$, so there is no uniform convergence, and the integral $\int |f_n| = 1$ does not tend to zero despite $f_n \to 0$ pointwise.
[example: Term-by-Term Integration of a Series]
Consider the series $\sum_{n=1}^\infty \frac{x^n}{n^2}$ on $[-1, 1]$. Each partial sum $S_N(x) = \sum_{n=1}^N x^n/n^2$ is a polynomial, hence continuous and Riemann integrable. Since $|x^n/n^2| \leq 1/n^2$ for $|x| \leq 1$, and $\sum 1/n^2 < \infty$, the Weierstrass $M$-test (proved in §3) gives uniform convergence on $[-1, 1]$. The [interchange theorem](/theorems/259) then justifies:
\begin{align*}
\int_0^1 \sum_{n=1}^\infty \frac{x^n}{n^2} \, dx = \sum_{n=1}^\infty \int_0^1 \frac{x^n}{n^2} \, dx = \sum_{n=1}^\infty \frac{1}{n^2(n+1)}.
\end{align*}
The partial fraction decomposition $\frac{1}{n^2(n+1)} = \frac{1}{n^2} - \frac{1}{n} + \frac{1}{n+1}$ telescopes, giving the exact value $\sum_{n=1}^\infty \frac{1}{n^2(n+1)} = \frac{\pi^2}{6} - 1 \approx 0.6449$. Without the uniform convergence guarantee, this interchange would require separate justification (e.g., via the Monotone Convergence Theorem from Lebesgue theory).
[/example]
## The Failure for Derivatives
One might hope that uniform convergence also permits interchange of limits and derivatives. It does not — and the failure is not a minor technicality but a fundamental feature of differentiation.
[example: Derivatives Can Diverge Under Uniform Convergence]
Define $f_n: \mathbb{R} \to \mathbb{R}$ by $f_n(x) = \frac{\sin(nx)}{\sqrt{n}}$. Then $|f_n(x)| \leq 1/\sqrt{n}$ for all $x$, so $f_n \to 0$ uniformly on $\mathbb{R}$. The limit function $f = 0$ is differentiable with $f' = 0$ everywhere.
But $f_n'(x) = \sqrt{n} \cos(nx)$, and $|f_n'(0)| = \sqrt{n} \to \infty$. The derivatives not only fail to converge to $f' = 0$ — they diverge. The mechanism is clear: the function $\sin(nx)/\sqrt{n}$ has amplitude $1/\sqrt{n}$ (small) but frequency $n$ (large), and differentiating trades amplitude for frequency — bringing down a factor of $n$ that overwhelms the $1/\sqrt{n}$ decay.
More precisely, the derivative at any point $x_0$ satisfies $|f_n'(x_0)| = \sqrt{n}|\cos(nx_0)|$. For $x_0 = 0$, this is $\sqrt{n} \to \infty$. For $x_0 = \pi/(2n)$, we get $|f_n'(\pi/(2n))| = \sqrt{n}|\cos(\pi/2)| = 0$. The derivatives oscillate wildly in both $n$ and $x$, and no subsequence converges uniformly.
[/example]
[example: Smooth Functions Converging to a Non-Differentiable Limit]
Define $f_n: [-1, 1] \to \mathbb{R}$ by $f_n(x) = (x^2 + 1/n)^{1/2}$. Each $f_n$ is $C^\infty$ on $[-1, 1]$, and $f_n \to f$ uniformly where $f(x) = |x|$. Indeed:
\begin{align*}
|f_n(x) - |x|| = \sqrt{x^2 + 1/n} - |x| = \frac{1/n}{\sqrt{x^2 + 1/n} + |x|} \leq \frac{1/n}{1/\sqrt{n}} = \frac{1}{\sqrt{n}} \to 0,
\end{align*}
where the inequality uses $\sqrt{x^2 + 1/n} \geq \sqrt{1/n} = 1/\sqrt{n}$. So $\|f_n - f\|_\infty \leq 1/\sqrt{n}$ and the convergence is uniform. But $f(x) = |x|$ is not differentiable at $x = 0$.
The derivatives $f_n'(x) = x/\sqrt{x^2 + 1/n}$ converge pointwise to $\operatorname{sgn}(x)$ for $x \neq 0$, but the convergence is not uniform near $x = 0$: $f_n'$ transitions from $\approx -1$ to $\approx +1$ over an interval of width $\sim 1/\sqrt{n}$, so the "slope" of the transition region is of order $\sqrt{n}$. The derivatives converge *pointwise* but not *uniformly*, and the limit $\operatorname{sgn}$ is discontinuous — so the [Uniform Limit Theorem](/theorems/258) correctly predicts that the convergence of the derivatives cannot be uniform.
[/example]
## The Correct Theorem for Derivatives
To interchange limits and derivatives, we need uniform convergence of the *derivatives*, plus convergence of the functions at a single point. The intuition is that a function is determined (up to a constant) by its derivative, so controlling the derivatives uniformly and pinning down one value suffices.
[quotetheorem:260]
[citeproof:260]
The proof strategy is elegant: rather than arguing directly about difference quotients, it reduces the derivative problem to the integral problem via the Fundamental Theorem of Calculus. Each $f_n$ is represented as
\begin{align*}
f_n(x) = f_n(c) + \int_c^x f_n'(t) \, dt.
\end{align*}
The integral term converges uniformly because $f_n' \to g$ uniformly and $|\int_c^x (f_n' - g)| \leq (b-a)\|f_n' - g\|_\infty$. The constant term $f_n(c) \to L$ by hypothesis. Adding them, $f_n \to f$ uniformly where $f(x) = L + \int_c^x g(t) \, dt$. Since $g$ is continuous (by the [Uniform Limit Theorem](/theorems/258) applied to the derivatives), the Fundamental Theorem gives $f' = g$.
Both hypotheses are necessary. The convergence at a single point pins down the limiting constant — without it, $f_n$ could drift by different amounts. The uniform convergence of $f_n'$ provides the real engine: it controls the integral terms, and would fail with merely pointwise convergence of $f_n'$.
[example: Differentiability of Power Series]
Let $f(x) = \sum_{n=0}^\infty a_n x^n$ be a power series with radius of convergence $R > 0$. On any interval $[-r, r]$ with $0 < r < R$, the derived series $\sum n a_n x^{n-1}$ converges uniformly. To see this, note that for $|x| \leq r$:
\begin{align*}
|n a_n x^{n-1}| \leq n |a_n| r^{n-1}.
\end{align*}
Pick $\rho$ with $r < \rho < R$. Since $\sum |a_n| \rho^n$ converges (as $\rho$ is inside the radius), the root test gives $\limsup |a_n|^{1/n} \leq 1/\rho$, and hence $n|a_n|r^{n-1} \leq n (C/\rho)^n r^{n-1} = Cn(r/\rho)^{n-1}/\rho$ for some constant $C$. Since $r/\rho < 1$, the factor $n(r/\rho)^{n-1}$ is summable (by the ratio test), so the Weierstrass $M$-test gives uniform convergence of the derived series on $[-r, r]$.
The partial sums $S_N(x) = \sum_{n=0}^N a_n x^n$ are polynomials, hence $C^\infty$, and $S_N(0) = a_0$ converges. The [interchange theorem for derivatives](/theorems/260) gives:
\begin{align*}
f'(x) = \sum_{n=1}^\infty n a_n x^{n-1} \quad \text{for all } |x| < R.
\end{align*}
By induction, $f$ is infinitely differentiable and all derivatives can be computed term by term. Evaluating at $x = 0$ gives $f^{(k)}(0) = k! \, a_k$, so the coefficients satisfy the Taylor formula $a_n = f^{(n)}(0)/n!$ — a rigidity result showing that power series coefficients are uniquely determined by the function's local behaviour near the centre.
[/example]
## Techniques for Proving Uniform Convergence
### Direct Estimation
The most elementary method is to bound $\|f_n - f\|_\infty$ directly. This requires knowing the limit function $f$.
[example: Direct Estimation for a Geometric Series]
Let $f_n(x) = \sum_{k=0}^n x^k = \frac{1 - x^{n+1}}{1 - x}$ on $[0, a]$ where $0 < a < 1$. The limit is $f(x) = 1/(1-x)$, and:
\begin{align*}
|f_n(x) - f(x)| = \frac{x^{n+1}}{1-x} \leq \frac{a^{n+1}}{1-a} \to 0.
\end{align*}
The bound is independent of $x$, so $f_n \to f$ uniformly on $[0, a]$. On $[0, 1)$ itself, $\sup_{x \in [0,1)} |f_n(x) - f(x)| = +\infty$ (the supremum diverges as $x \to 1^-$), so the convergence is not uniform on $[0, 1)$.
[/example]
### The Sup-Norm Test via Calculus
When $f_n - f$ is a differentiable function of $x$, we can find $\|f_n - f\|_\infty$ by locating the maximum using calculus.
[example: Sup-Norm via Critical Points]
Let $f_n: [0, 1] \to \mathbb{R}$ be defined by $f_n(x) = nx e^{-nx}$. Then $f_n(x) \to 0$ pointwise for each $x > 0$ (since $nx e^{-nx} \to 0$ for fixed $x > 0$), and $f_n(0) = 0$, so the pointwise limit is $f = 0$. Is the convergence uniform?
Differentiate: $f_n'(x) = n e^{-nx}(1 - nx) = 0$ when $x = 1/n$. The second derivative $f_n''(1/n) = -n^2 e^{-1} < 0$ confirms this is a maximum. Evaluating:
\begin{align*}
f_n(1/n) = n \cdot \frac{1}{n} \cdot e^{-1} = e^{-1} \approx 0.368.
\end{align*}
Since $f_n(0) = 0$ and $f_n(1) = ne^{-n} \to 0$, the maximum of $f_n$ on $[0, 1]$ is $e^{-1}$ for every $n$. Therefore:
\begin{align*}
\|f_n - f\|_\infty = \|f_n\|_\infty = e^{-1} \not\to 0.
\end{align*}
The convergence is **not** uniform. The functions form a "sliding bump" — the peak has constant height $1/e$ but slides toward the origin as $n \to \infty$. At any fixed $x > 0$, the bump eventually passes, so pointwise convergence holds; but the maximum never decreases, so uniform convergence fails.
[/example]
### Negating Uniform Convergence
To prove that convergence is *not* uniform, it suffices to exhibit a sequence $x_n \in E$ (possibly depending on $n$) such that $|f_n(x_n) - f(x_n)| \not\to 0$. This works because $\|f_n - f\|_\infty \geq |f_n(x_n) - f(x_n)|$ for any choice of $x_n$.
[example: The Standard Counterexample Revisited]
For $f_n(x) = x^n$ on $[0, 1)$ with limit $f = 0$: choose $x_n = (1/2)^{1/n}$, so that $f_n(x_n) = ((1/2)^{1/n})^n = 1/2$. Then $|f_n(x_n) - f(x_n)| = 1/2$ for all $n$, confirming non-uniform convergence. Note that $x_n \to 1$ as $n \to \infty$ — the "witness" for non-uniformity migrates toward the boundary, where the convergence is slowest.
[/example]
## Uniform Convergence on Compact Subsets
In practice, one frequently encounters sequences that converge uniformly on every compact subset of the domain but not on the whole domain. This is common enough to deserve a name.
[definition: Locally Uniform Convergence]
A sequence $f_n: U \to \mathbb{R}$ (where $U \subseteq \mathbb{R}$ is open) **converges locally uniformly** (or **uniformly on compact subsets**) to $f$ if for every compact $K \subset U$, $f_n \to f$ uniformly on $K$.
[/definition]
Locally uniform convergence still preserves continuity (since continuity is a local property — apply the [Uniform Limit Theorem](/theorems/258) on each compact neighbourhood) and permits interchange of limits with integrals over compact intervals. It is the natural mode of convergence for power series: if $\sum a_n z^n$ has radius of convergence $R$, the partial sums converge uniformly on $\{|z| \leq r\}$ for every $r < R$, but not on $\{|z| < R\}$ in general.
[example: Local but Not Global Uniform Convergence]
The sequence $f_n(x) = e^{-x/n}$ on $[0, \infty)$ converges pointwise to $f \equiv 1$. On any compact interval $[0, a]$:
\begin{align*}
\|f_n - 1\|_{\infty, [0,a]} = \sup_{x \in [0,a]} |e^{-x/n} - 1| = 1 - e^{-a/n}.
\end{align*}
Since $e^{-a/n} = 1 - a/n + O(1/n^2)$, this supremum is $a/n + O(1/n^2) \to 0$, so the convergence is uniform on $[0, a]$.
However, on $[0, \infty)$:
\begin{align*}
\|f_n - 1\|_{\infty, [0,\infty)} = \sup_{x \geq 0} (1 - e^{-x/n}) = 1
\end{align*}
for every $n$ (the supremum is achieved in the limit $x \to \infty$). The convergence is locally uniform but not globally uniform.
[/example]
## Worked Problem
[problem]
Define $f_n: [0, 1] \to \mathbb{R}$ by $f_n(x) = \frac{x^n(1-x)}{1 + x^n}$. Show that $(f_n)$ converges uniformly on $[0, 1]$, identify the limit, and verify that the [Uniform Limit Theorem](/theorems/258) correctly predicts the continuity of the limit.
[/problem]
[solution]
**Step 1: Identify the pointwise limit.** For $x \in [0, 1)$, $x^n \to 0$, so
\begin{align*}
f_n(x) = \frac{x^n(1-x)}{1 + x^n} \to \frac{0 \cdot (1-x)}{1 + 0} = 0.
\end{align*}
At $x = 1$: $f_n(1) = \frac{1 \cdot 0}{1 + 1} = 0$ for all $n$. Therefore the pointwise limit is $f \equiv 0$ on $[0, 1]$.
**Step 2: Bound $\|f_n\|_\infty$ by finding the maximum.** Since $f_n(0) = 0$, $f_n(1) = 0$, and $f_n(x) \geq 0$ on $[0, 1]$, the maximum occurs at an interior critical point. Differentiating via the quotient rule:
\begin{align*}
f_n'(x) = \frac{(nx^{n-1}(1-x) - x^n)(1 + x^n) - x^n(1-x) \cdot nx^{n-1}}{(1 + x^n)^2}.
\end{align*}
Rather than solve this exactly, we use a simpler bound. Since $0 \leq 1 - x \leq 1$ and $1 + x^n \geq 1$ on $[0, 1]$:
\begin{align*}
0 \leq f_n(x) = \frac{x^n(1-x)}{1 + x^n} \leq x^n(1-x).
\end{align*}
The function $g(x) = x^n(1-x)$ on $[0, 1]$ has $g'(x) = x^{n-1}(n - (n+1)x) = 0$ at $x_0 = n/(n+1)$, giving
\begin{align*}
g(x_0) = \left(\frac{n}{n+1}\right)^n \cdot \frac{1}{n+1}.
\end{align*}
Using $(1 - 1/(n+1))^n \leq e^{-n/(n+1)} < 1$, we get
\begin{align*}
\|f_n\|_\infty \leq g(x_0) \leq \frac{1}{n+1}.
\end{align*}
**Step 3: Conclude uniform convergence.** Since $\|f_n - f\|_\infty = \|f_n\|_\infty \leq \frac{1}{n+1} \to 0$, the convergence $f_n \to 0$ is uniform on $[0, 1]$.
**Step 4: Verify the prediction of the Uniform Limit Theorem.** Each $f_n$ is continuous on $[0, 1]$ (as a ratio of continuous functions with non-vanishing denominator $1 + x^n \geq 1$). The [Uniform Limit Theorem](/theorems/258) predicts that the uniform limit $f \equiv 0$ is continuous — which it trivially is.
The interest of this example lies in the contrast with $g_n(x) = x^n$, which also converges pointwise to $0$ on $[0, 1)$ but *not* uniformly (as $g_n(1) = 1$). The factor $(1-x)$ in the numerator of $f_n$ kills the contribution at $x = 1$, and the denominator $1 + x^n$ damps the peak near $x = n/(n+1)$, producing a sequence that converges uniformly despite the presence of a "dangerous" $x^n$ factor.
[/solution]\n\n---\n\nHaving established the notion of uniform convergence for sequences of functions, we now turn to the natural extension: series of functions. Just as numerical series are built from sequences via partial sums, a series of functions $\sum g_n$ is understood through its sequence of partial sums $f_n = \sum_{j=1}^n g_j$. The central question becomes: under what conditions does such a series inherit desirable analytic properties — continuity, integrability, differentiability — from its terms?
Uniform convergence of the partial sums is a powerful sufficient condition, but it is often difficult to verify directly from the definition, because doing so requires knowing the limit function. This motivates the search for practical criteria that can be checked using only the terms $g_n$ themselves, without reference to the sum. The Weierstrass $M$-test is the most important such criterion, and its scope and limitations shape the entire theory.
[motivation]
### From Sequences to Series
The passage from sequences to series of functions may seem routine — after all, $\sum g_n$ is just the sequence of partial sums $S_N = \sum_{j=1}^N g_j$, and every result about uniform convergence of sequences applies. But the series setting introduces a new structural feature: the terms $g_n$ are given individually, whereas the partial sums $S_N$ are cumulative. This distinction is critical because in practice one controls the terms (via bounds $|g_n(x)| \leq M_n$), not the partial sums directly.
### The Need for Practical Criteria
Consider the problem of showing that $\sum_{n=1}^\infty \frac{\sin(nx)}{n^2}$ converges uniformly on $\mathbb{R}$. The partial sums $S_N(x) = \sum_{n=1}^N \frac{\sin(nx)}{n^2}$ have no closed form, so computing $\sup_x |S_N(x) - f(x)|$ directly is hopeless — we do not even know $f$ explicitly. What we *can* do is bound each term: $|\sin(nx)/n^2| \leq 1/n^2$, and $\sum 1/n^2 < \infty$. The Weierstrass $M$-test converts this term-by-term bound into a conclusion about the partial sums. This reduction — from a function-theoretic question to a numerical one — is the key idea of the section.
### Absolute vs. Conditional Uniform Convergence
A subtler issue emerges when we ask whether the convergence is absolute. The $M$-test produces absolute uniform convergence (since it bounds $|g_n|$), but many important series converge uniformly without converging absolutely. The alternating series $\sum (-1)^n x^n / n$ on $[0,1]$ is a case in point: it converges uniformly (by Dirichlet's test), but the series of absolute values $\sum x^n/n$ diverges at $x = 1$. Understanding the logical relationships between pointwise, uniform, absolute, and absolute uniform convergence — and seeing that they are genuinely independent — is essential for choosing the right tool in applications.
[/motivation]
## Core Definitions
To study infinite sums of functions rigorously, we need to distinguish several modes of convergence. The definitions below make precise the relationships that the motivation above outlined informally.
[definition: Convergence of a Series of Functions]
Let $E$ be a set and $g_n: E \to \mathbb{R}$ a sequence of functions. The series $\sum_{n=1}^\infty g_n$ **converges at a point** $x \in E$ if the sequence of partial sums $S_N(x) = \sum_{j=1}^N g_j(x)$ converges as a numerical sequence. The series **converges uniformly on $E$** if the sequence $(S_N)$ converges uniformly on $E$.
[/definition]
This definition reduces the study of series to the study of sequences of partial sums, so every result from the theory of uniform convergence — the Cauchy criterion, the preservation of continuity, the interchange of limit and integral — applies immediately. The distinction in practice, however, is significant: for sequences one typically controls $\|f_n - f\|_\infty$ directly, whereas for series one usually bounds the individual terms $\|g_n\|_\infty$ and argues via summation.
[definition: Absolute Convergence of a Series of Functions]
The series $\sum_{n=1}^\infty g_n$ **converges absolutely at** $x \in E$ if the numerical series $\sum_{n=1}^\infty |g_n(x)|$ converges.
[/definition]
Absolute convergence at a point is a consequence of the completeness of $\mathbb{R}$: the triangle inequality gives $|S_N(x) - S_M(x)| \leq \sum_{j=M+1}^N |g_j(x)|$, so absolute convergence of the terms implies the partial sums are Cauchy. However, absolute convergence is a pointwise property — it says nothing about whether the rate of convergence is uniform across $E$.
[definition: Absolute Uniform Convergence]
The series $\sum_{n=1}^\infty g_n$ **converges absolutely uniformly on $E$** if the series of absolute values $\sum_{n=1}^\infty |g_n|$ converges uniformly on $E$.
[/definition]
This is the strongest mode of convergence considered here. It demands simultaneous control over the magnitude of the terms and the uniformity of the convergence — precisely the combination that the Weierstrass $M$-test is designed to establish.
## Relationships Between Convergence Types
### Absolute Convergence and Pointwise Convergence
A fundamental fact from real analysis carries over to the function setting without modification. The result below connects absolute convergence to the Cauchy criterion and completeness of $\mathbb{R}$, and serves as the starting point for the more refined convergence tests that follow.
[quotetheorem:275]
The proof is a direct application of the triangle inequality: if the partial sums of $\sum |g_n(x)|$ are Cauchy, then the partial sums of $\sum g_n(x)$ are Cauchy as well, and completeness of $\mathbb{R}$ delivers convergence. The converse fails — and understanding this failure is essential, because the Weierstrass $M$-test works by establishing absolute convergence, and series that converge only conditionally require entirely different techniques.
[citeproof:275]
[example: Conditional Convergence of the Alternating Harmonic Series]
The alternating harmonic series $\sum_{n=1}^\infty (-1)^{n-1} / n$ converges to $\ln 2$ by the [alternating series test](/theorems/177), but $\sum_{n=1}^\infty 1/n$ diverges. This series converges conditionally, not absolutely. The cancellation between positive and negative terms is essential — and this cancellation is exactly what the Weierstrass $M$-test cannot detect, since it bounds $|g_n|$ and discards all sign information.
[/example]
### Independence of Uniform and Absolute Convergence
The next result is conceptually important because it shows that uniformity and absoluteness are genuinely orthogonal properties — knowing one tells you nothing about the other. This independence has practical consequences: a series may converge uniformly through delicate cancellations (as in Dirichlet's test) without converging absolutely, or it may converge absolutely at every point but at rates that deteriorate near a [boundary](/page/Boundary), preventing uniform convergence.
[quotetheorem:276]
The proof is by construction of explicit counterexamples in both directions. For the first direction, the geometric series $\sum x^n$ on $(-1,1)$ converges absolutely at each point (since $\sum |x|^n < \infty$) but the remainder $x^{N+1}/(1-x)$ diverges as $x \to 1^-$, so the convergence is not uniform. For the second direction, the alternating harmonic series $\sum (-1)^{n-1}/n$ converges uniformly as a constant-function series (trivially) but the harmonic series diverges. The key insight is that absoluteness controls the *mechanism* of convergence (no cancellation needed) while uniformity controls the *rate* across the domain — and these are independent features.
[citeproof:276]
[example: Absolute but Not Uniform]
The geometric series $\sum_{n=0}^\infty x^n$ converges absolutely for each $x \in (-1,1)$, since $\sum |x|^n = 1/(1-|x|) < \infty$. However, the convergence is not uniform on $(-1,1)$: the partial sum remainder satisfies $\sup_{x \in (-1,1)} |x^m/(1-x)| = +\infty$ for every $m$, because the supremum diverges as $x \to 1^-$. The rates of absolute convergence degrade near the boundary, and no single $N$ controls the error across the entire interval.
[/example]
[example: Uniform but Not Absolute]
The constant series $\sum_{n=1}^\infty (-1)^{n-1}/n$ (viewed as a series of constant functions on any set $E$) converges uniformly — trivially, since the partial sums are independent of $x$. But the series of absolute values $\sum 1/n$ diverges. Uniform convergence is achieved through cancellation, not through smallness of the terms.
[/example]
### Absolute Uniform Convergence Implies Uniform Convergence
The strongest mode of convergence implies the weakest uniform one, through the same triangle inequality mechanism that underpins all comparison arguments.
[quotetheorem:277]
The proof is immediate: the triangle inequality bounds $|\sum_{M+1}^N g_j(x)| \leq \sum_{M+1}^N |g_j(x)|$, and taking the supremum over $x \in E$ on both sides shows that if the partial sums of $\sum |g_n|$ are uniformly Cauchy, then so are the partial sums of $\sum g_n$. The General Principle of Uniform Convergence then delivers uniform convergence. The converse is false, as the alternating harmonic series demonstrates: uniform convergence can rely on cancellation that absolute uniform convergence destroys.
[citeproof:277]
### The Full Independence Picture
Even the conjunction of uniform convergence and pointwise absolute convergence does not guarantee absolute uniform convergence. This is a subtle point that merits a dedicated result and example, because it shows that no implication arrow connects the "uniform + absolute" corner to "absolute uniform" without additional hypotheses.
[quotetheorem:278]
The mechanism behind this failure is always the same: pointwise absolute convergence holds at each fixed $x$, but the rate of convergence of $\sum |g_n(x)|$ deteriorates as $x$ approaches a boundary point, and no uniform bound on the tails of $\sum |g_n|$ is possible. The proof constructs the explicit counterexample $\sum (-1)^n x^n/n$ on $[0,1)$, where Abel's test gives uniform convergence and the comparison test gives pointwise absolute convergence, but the tails of $\sum x^n/n$ blow up as $x \to 1^-$.
[citeproof:278]
[example: Uniform and Absolute but Not Absolutely Uniform]
Consider $\sum_{n=1}^\infty \frac{(-1)^n}{n} x^n$ on $[0,1)$. For each fixed $x \in [0,1)$, the series converges absolutely by comparison with the geometric series $\sum x^n$. It converges uniformly on $[0,1)$ by Abel's test (the partial sums of $(-1)^n$ are bounded and $x^n/n \to 0$ monotonically for each $x$). However, the series of absolute values $\sum_{n=1}^\infty \frac{x^n}{n}$ does not converge uniformly on $[0,1)$: for any $M$, the tail $\sum_{n=M+1}^\infty x^n/n$ can be made arbitrarily large by choosing $x$ close to $1$, since
\begin{align*}
\sup_{x \in [0,1)} \sum_{j=M+1}^{N} \frac{x^j}{j} \geq \sum_{j=M+1}^{N} \frac{(1 - \varepsilon)^j}{j}
\end{align*}
and the right-hand side approaches the divergent tail $\sum_{j=M+1}^{N} 1/j$ as $\varepsilon \to 0$.
[/example]
## The Weierstrass $M$-Test
The preceding discussion shows that absolute uniform convergence is the strongest and most useful mode of convergence for series of functions, but the examples also reveal that verifying it from the definition requires controlling $\sup_x \sum |g_n(x)|$ — which is typically as hard as finding the sum. The Weierstrass $M$-test resolves this by replacing the $x$-dependent bound $|g_n(x)|$ with a constant bound $M_n$, reducing the entire problem to the convergence of a numerical series.
[quotetheorem:272]
The power of the $M$-test lies in its reduction: a function-theoretic question (does $\sum g_n$ converge uniformly on $E$?) becomes a numerical one (does $\sum M_n$ converge?), and the latter can be attacked with the full arsenal of real-analysis convergence tests — comparison, ratio, root, integral, and condensation. The proof is a clean application of the General Principle of Uniform Convergence: the partial sums $S_N = \sum_{j=1}^N g_j$ satisfy $\sup_{x \in E} |S_N(x) - S_M(x)| \leq \sum_{n=M+1}^N M_n$, so they are uniformly Cauchy whenever the tail of $\sum M_n$ is small, and the general principle converts this into uniform convergence.
[citeproof:272]
The test is sufficient but not necessary, and understanding its limitations is as important as understanding its power. Because the $M$-test bounds $|g_n(x)|$ by a constant $M_n$, it discards all information about cancellation between terms and all dependence of $|g_n(x)|$ on $x$. This means it can only detect **absolute** uniform convergence. Series that converge uniformly through cancellation — such as $\sum (-1)^n x^n / n$ on $[0,1]$ — lie beyond its reach, and require tools like Dirichlet's test or Abel's test that exploit monotonicity and bounded partial sums.
[example: The M-Test Applied to a Fourier-Type Series]
The series $\sum_{n=1}^\infty \frac{\sin(nx)}{n^2}$ converges absolutely uniformly on all of $\mathbb{R}$. Each term satisfies $|\sin(nx)/n^2| \leq 1/n^2$ for every $x$, and $\sum 1/n^2 = \pi^2/6 < \infty$. The [Weierstrass $M$-test](/theorems/272) with $M_n = 1/n^2$ gives absolute uniform convergence. Since each partial sum is continuous (as a finite sum of continuous functions), the uniform limit theorem guarantees that the sum $f(x) = \sum \sin(nx)/n^2$ is continuous on $\mathbb{R}$.
[/example]
[example: Failure of the M-Test for Conditionally Convergent Series]
The alternating series $\sum_{n=1}^\infty (-1)^{n-1}/n$ converges (to $\ln 2$). Viewed as a constant-function series on any set $E$, it converges uniformly (trivially, since the partial sums are independent of $x$). However, $\sup_{x \in E} |(-1)^{n-1}/n| = 1/n$, and $\sum 1/n = \infty$, so the $M$-test does not apply. The convergence is achieved through cancellation between the positive and negative terms — precisely the mechanism that the $M$-test, by taking absolute values, destroys.
[/example]
## Power Series and Local Uniform Convergence
Power series provide the most complete and satisfying application of the theory developed so far. A power series $\sum c_n(x - a)^n$ is simultaneously a series of functions (to which the $M$-test applies on compact sub-intervals) and an algebraic object (whose coefficients encode all the analytic information about the sum). The interplay between these two perspectives is what makes the theory so powerful — and the $M$-test is the bridge between them.
The starting point is the fundamental dichotomy: a power series either converges everywhere, converges nowhere (except at the center), or converges on an interval whose half-length $R$ is determined entirely by the asymptotic growth rate of the coefficients.
[quotetheorem:273]
The structure of this result repays careful study. Part (3) — absolute uniform convergence on strict sub-intervals — follows from the [Weierstrass $M$-test](/theorems/272): on $[a-r, a+r]$ with $r < R$, each term satisfies $|c_n(x-a)^n| \leq |c_n| r^n$, and the numerical series $\sum |c_n| r^n$ converges because $r$ lies strictly inside the radius of convergence. The restriction to **strict** sub-intervals ($r < R$) is genuine and reflects real phenomena, not a deficiency of the proof. The geometric series $\sum x^n$ has $R = 1$ and converges absolutely uniformly on $[-r, r]$ for any $r < 1$, but diverges at $x = 1$. More generally, behavior at the endpoints $x = a \pm R$ requires case-by-case analysis: $\sum x^n/n^2$ converges at both endpoints, $\sum x^n/n$ converges at $x = -1$ but not at $x = 1$, and $\sum x^n$ diverges at both.
[citeproof:273]
The Cauchy–Hadamard formula $1/R = \limsup |c_n|^{1/n}$ is more than a computational tool — it reveals that the radius of convergence depends only on the **asymptotic growth rate** of the coefficients, not on their individual values or signs. Replacing every $c_n$ by $|c_n|$ does not change $R$, which explains why a power series has the same radius for absolute and conditional convergence. This is a phenomenon special to power series, with no analogue for general function series.
### Continuity of Power Series
An immediate consequence of the [Radius of Convergence theorem](/theorems/273) and the uniform limit theorem is the continuity of the sum function. The argument is local: on any compact sub-interval of $(a - R, a + R)$, the partial sums converge uniformly (by part (3) of the [Radius of Convergence theorem](/theorems/273)), and each partial sum is a polynomial — hence continuous. The uniform limit theorem then delivers continuity of the sum at every point of the interval.
[quotetheorem:279]
The proof illustrates a pattern that recurs throughout this section: to establish a property of $f$ at a point $x_0 \in (a - R, a + R)$, choose a compact sub-interval $[a - r, a + r]$ containing $x_0$ with $r < R$, apply the [Weierstrass $M$-test](/theorems/272) to obtain uniform convergence there, and invoke the appropriate preservation theorem. The openness of $(a - R, a + R)$ ensures that such an $r$ always exists.
[citeproof:279]
### Termwise Differentiation
The most striking property of power series is that they can be differentiated term by term, with no loss of radius of convergence. This is remarkable because, as seen in the previous section, differentiation does *not* commute with uniform limits in general — an additional hypothesis (uniform convergence of the derived series) is needed. For power series, this hypothesis is automatically satisfied, thanks to the stability of the radius of convergence under differentiation.
[quotetheorem:274]
The key insight in the proof is the observation that $n^{1/n} \to 1$ as $n \to \infty$, which ensures that differentiating a power series — replacing $c_n$ by $nc_n$ — does not change the asymptotic growth rate of the coefficients. The Cauchy–Hadamard formula then gives $R' = R$ for the derived series. With the same radius established, the termwise differentiability theorem for general series applies on each compact sub-interval $[a - r, a + r]$ with $r < R$, where the derived series converges uniformly by the [Weierstrass $M$-test](/theorems/272).
[citeproof:274]
By induction, power series are **infinitely differentiable** on their interval of convergence: $f'$ is given by a power series with the same radius, so $f'$ is itself differentiable with $f''(x) = \sum n(n-1)c_n(x-a)^{n-2}$, and so on. Evaluating the $k$-th derived series at $x = a$ yields $f^{(k)}(a) = k! \, c_k$, so the coefficients satisfy the Taylor formula
\begin{align*}
c_n = \frac{f^{(n)}(a)}{n!}.
\end{align*}
This is a rigidity result of the strongest kind: the coefficients are uniquely determined by the values of $f$ in any neighborhood of the center, no matter how small. It also shows that a function can have at most one power series expansion about a given point — a fact with no analogue for [Fourier series](/page/Fourier%20Series), where uniqueness questions are far more delicate.
### Local Uniform Convergence
The convergence behavior of power series is naturally described using a notion that interpolates between pointwise and uniform convergence.
[definition: Local Uniform Convergence]
Let $U \subseteq \mathbb{R}$ be open and $(f_n)$ a sequence of functions on $U$. The sequence $(f_n)$ **converges locally uniformly on $U$** if for every $x \in U$, there exists $\delta > 0$ such that $(f_n)$ converges uniformly on $(x - \delta, x + \delta) \subseteq U$.
[/definition]
Power series converge locally uniformly — in fact, locally absolutely uniformly — on their open interval of convergence $(a - R, a + R)$. At any point $x_0$ in this interval, we can find $r$ with $|x_0 - a| < r < R$ and the [Weierstrass $M$-test](/theorems/272) gives absolute uniform convergence on $[a - r, a + r]$, which contains a neighborhood of $x_0$. This local uniform control is precisely what makes the preservation theorems (continuity, integration, differentiation) applicable at every point of the interval, even though global uniform convergence on all of $(a - R, a + R)$ may fail.
## Worked Example
[problem]
Show that the series $f(x) = \sum_{n=1}^\infty \frac{\sin(nx)}{n^2}$ defines a continuous function on $\mathbb{R}$ that is differentiable on $(0, 2\pi)$ with $f'(x) = \sum_{n=1}^\infty \frac{\cos(nx)}{n}$, and explain why termwise differentiation fails at $x = 0$.
[/problem]
[solution]
**Step 1: Absolute uniform convergence via the Weierstrass $M$-test.**
Each term satisfies $\left|\frac{\sin(nx)}{n^2}\right| \leq \frac{1}{n^2}$ for all $x \in \mathbb{R}$, and $\sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6} < \infty$. The [Weierstrass $M$-test](/theorems/272) with $M_n = 1/n^2$ gives absolute uniform convergence on all of $\mathbb{R}$.
**Step 2: Continuity from the uniform limit theorem.**
Each partial sum $S_N(x) = \sum_{n=1}^N \frac{\sin(nx)}{n^2}$ is continuous as a finite sum of continuous functions. Since $S_N \to f$ uniformly on $\mathbb{R}$, the uniform limit of continuous functions is continuous, so $f$ is continuous on $\mathbb{R}$.
**Step 3: Uniform convergence of the derived series on compact sub-intervals of $(0, 2\pi)$ via Dirichlet's test.**
The derived series is $\sum_{n=1}^\infty \frac{\cos(nx)}{n}$. Fix $\delta > 0$ and work on $[\delta, 2\pi - \delta]$. To apply Dirichlet's test for uniform convergence, we verify two conditions. First, the partial sums of $\cos(nx)$ are uniformly bounded: the closed-form identity
\begin{align*}
\sum_{n=1}^N \cos(nx) = \frac{\sin((N + 1/2)x) - \sin(x/2)}{2\sin(x/2)}
\end{align*}
gives the bound $\left|\sum_{n=1}^N \cos(nx)\right| \leq 1/|\sin(x/2)| \leq 1/\sin(\delta/2)$ on $[\delta, 2\pi - \delta]$, which is a constant independent of $N$. Second, $1/n \to 0$ monotonically. Both conditions hold, so $\sum \frac{\cos(nx)}{n}$ converges uniformly on $[\delta, 2\pi - \delta]$.
**Step 4: Application of the termwise differentiability theorem.**
The termwise differentiability theorem for series requires: (i) convergence of the original series at some point in $[\delta, 2\pi - \delta]$, which holds everywhere by Step 1; and (ii) uniform convergence of the derived series on $[\delta, 2\pi - \delta]$, established in Step 3. Each term $\sin(nx)/n^2$ is continuously differentiable. The theorem therefore gives $f'(x) = \sum_{n=1}^\infty \frac{\cos(nx)}{n}$ on $(\delta, 2\pi - \delta)$. Since $\delta > 0$ was arbitrary, this holds on all of $(0, 2\pi)$.
**Step 5: Failure of termwise differentiation at $x = 0$.**
At $x = 0$, the derived series evaluates to $\sum_{n=1}^\infty \frac{\cos(0)}{n} = \sum_{n=1}^\infty \frac{1}{n}$, which is the harmonic series and diverges. The derived series does not converge at $x = 0$, so the termwise differentiability theorem does not apply. A more careful analysis using the integral representation of the sawtooth function shows that $f'(0^+) = +\infty$: the function $f$ has a vertical tangent at the origin. This boundary degeneration — a [uniformly convergent](/page/Uniform%20Convergence) series whose derived series diverges at a boundary point — is typical of Fourier analysis and illustrates why the [Weierstrass $M$-test](/theorems/272) alone cannot control differentiation: it guarantees convergence of $\sum \sin(nx)/n^2$ but says nothing about $\sum \cos(nx)/n$.
[/solution]
## References
1. Sheratt, N., *Cambridge Part IB — Analysis and Topology*, Lecture Notes.\n\n---\n\nHaving established the behavior of sequences and series of functions under uniform convergence, we now confront a foundational question in a new direction: when does a function admit a well-defined Riemann integral? The answer hinges on controlling the oscillation of the function — the difference between its [supremum and infimum](/page/Supremum%20and%20Infimum) on small subintervals — and this control is precisely what uniform continuity provides.
The trajectory of this section is shaped by a single chain of implications. Continuity on a closed, bounded interval forces uniform continuity (the Heine-Cantor theorem), which in turn forces Riemann integrability (via the Riemann criterion). Uniform convergence then allows these properties to be transferred to limits, connecting the theory of convergence from the previous sections to the theory of integration. The final result — that continuous compositions of integrable functions are integrable — shows how the Riemann criterion serves as a flexible tool far beyond its initial setting.
[motivation]
### The Oscillation Problem
The Riemann integral is constructed from approximations: upper sums overestimate the integral, lower sums underestimate it, and integrability means these two estimates can be brought arbitrarily close together. The gap $U(P, f) - L(P, f)$ measures the total oscillation of $f$ weighted by the lengths of the partition subintervals. Making this gap small requires that $f$ does not oscillate too wildly on most of the interval — but how do we guarantee this?
For a continuous function on a compact interval, the natural idea is to take a very fine partition, so that $f$ varies little on each subinterval. But this argument has a hidden dependency: the amount by which $f$ varies on an interval of length $\delta$ depends on *where* the interval sits. Near a point where $f$ changes rapidly, a smaller $\delta$ is needed than near a point where $f$ is nearly constant. To make the argument work, we need a single $\delta$ that controls the oscillation everywhere simultaneously — and this is exactly what uniform continuity provides.
### Why Pointwise Continuity Is Insufficient
A continuous function on an open interval need not be integrable in the Riemann sense, because it need not be bounded (consider $f(x) = 1/x$ on $(0,1)$). Even on a closed interval, a bounded function that is continuous except at one point — such as the indicator $\mathbb{1}_{\{0\}}$ on $[0,1]$ — is integrable, showing that continuity everywhere is not necessary. The precise condition for integrability is captured by the Riemann criterion: it is the ability to make the total weighted oscillation small, not the continuity of the function per se, that matters. Uniform continuity is the strongest and most natural sufficient condition.
### From Integration to Limits
A second motivation comes from the interaction between convergence and integration. If $f_n \to f$ pointwise and each $f_n$ is integrable, is the limit $f$ integrable? And does $\int f_n \to \int f$? The answer is no in general — pointwise limits can destroy integrability or change the value of the integral (as bounded convergence and dominated convergence results in measure theory make precise). However, uniform convergence is strong enough to preserve both integrability and the value of the integral. Understanding why requires the same oscillation-control ideas that underpin the Riemann criterion, applied now to the difference $f - f_n$ rather than to $f$ itself.
[/motivation]
## Core Definitions
The central concept of this section is the strengthening of continuity that demands a uniform modulus across the entire domain.
[definition: Uniform Continuity]
Let $E \subseteq \mathbb{R}$ and $f: E \to \mathbb{R}$. The function $f$ is **uniformly continuous** on $E$ if for every $\varepsilon > 0$, there exists $\delta > 0$ such that for all $x, y \in E$:
\begin{align*}
|x - y| < \delta \implies |f(x) - f(y)| < \varepsilon.
\end{align*}
[/definition]
The logical structure here is worth parsing carefully. In ordinary (pointwise) continuity, the $\delta$ may depend on both $\varepsilon$ and the point $x$: for each $x$, we find a $\delta(x, \varepsilon)$. In uniform continuity, the quantifier order is reversed — a single $\delta(\varepsilon)$ must work for all pairs of points simultaneously. This strengthening is genuine: there exist functions that are continuous but not uniformly continuous on every subset of $\mathbb{R}$, and understanding the failure modes is essential for appreciating why compactness plays a role.
We now develop the machinery of Riemann integration. A natural first attempt might use Riemann sums with chosen sample points — pick a point $t_j$ in each subinterval and form $\sum f(t_j)(x_{j+1} - x_j)$. But this approach requires an arbitrary choice of sample points, and different choices yield different sums. The Darboux formulation eliminates this ambiguity by taking suprema and infima over each subinterval, producing canonical upper and lower sums with no choices involved. This makes the oscillation $\sup f - \inf f$ on each piece explicit, and reduces integrability to a single clean condition: the gap between upper and lower sums can be made arbitrarily small.
[definition: Partition]
Let $[a, b] \subseteq \mathbb{R}$ be a closed, bounded interval. A **partition** $P$ of $[a, b]$ is a finite ordered set $P = \{a_0, a_1, \ldots, a_n\}$ with $a = a_0 < a_1 < \cdots < a_n = b$. The **mesh** of $P$ is $\|P\| = \max_{0 \leq j \leq n-1} (a_{j+1} - a_j)$.
[/definition]
[definition: Upper and Lower Sums]
Given a bounded function $f: [a, b] \to \mathbb{R}$ and a partition $P = \{a_0, \ldots, a_n\}$, the **upper sum** and **lower sum** of $f$ with respect to $P$ are:
\begin{align*}
U(P, f) &= \sum_{j=0}^{n-1} (a_{j+1} - a_j) \sup_{x \in [a_j, a_{j+1}]} f(x), \\
L(P, f) &= \sum_{j=0}^{n-1} (a_{j+1} - a_j) \inf_{x \in [a_j, a_{j+1}]} f(x).
\end{align*}
[/definition]
[definition: Riemann Integrability]
A bounded function $f: [a, b] \to \mathbb{R}$ is **Riemann integrable** if the upper integral $I^*(f) = \inf_P U(P, f)$ and the lower integral $I_*(f) = \sup_P L(P, f)$ coincide, where the infimum and supremum range over all partitions $P$ of $[a, b]$. In this case, the common value is denoted $\int_a^b f(x) \, dx$.
[/definition]
The definitions above make integrability a property of the function's oscillation behavior. The difference $U(P, f) - L(P, f) = \sum_j (a_{j+1} - a_j) \omega_j(f)$, where $\omega_j(f) = \sup_{[a_j, a_{j+1}]} f - \inf_{[a_j, a_{j+1}]} f$ is the oscillation on the $j$-th subinterval, measures how well $f$ can be approximated by step functions from above and below. Integrability asks that this total weighted oscillation can be made arbitrarily small.
## Uniform Continuity on Compact Intervals
### The Heine-Cantor Theorem
The first main result establishes that compactness of the domain upgrades pointwise continuity to uniform continuity. This is the critical link that makes the theory of Riemann integration on compact intervals work cleanly.
[quotetheorem:280]
The proof is by contradiction and uses the Bolzano-Weierstrass theorem as its essential ingredient. If uniform continuity fails, one can find two sequences whose entries get arbitrarily close yet whose function values stay at least $\varepsilon_0$ apart. On a compact interval, Bolzano-Weierstrass extracts a convergent subsequence, and the two sequences converge to the same limit — contradicting continuity at that limit point. The argument fails on non-compact domains precisely because the extracted limit point may not belong to the domain, or because no convergent subsequence exists at all.
[citeproof:280]
The compactness hypothesis is essential and cannot be weakened. On open intervals, functions may oscillate with increasing frequency near a boundary point; on unbounded intervals, they may oscillate more and more slowly but with fixed amplitude. Both failure modes prevent a single $\delta$ from controlling all pairs.
[example: Failure of Uniform Continuity on Open Intervals]
The function $f: (0, 1] \to \mathbb{R}$ defined by $f(x) = \sin(1/x)$ is continuous but not uniformly continuous. To see this, consider the sequences $x_n = 1/(2n\pi)$ and $y_n = 1/(2n\pi + \pi/2)$. Then $|x_n - y_n| \to 0$ as $n \to \infty$, since both sequences converge to $0$. However, $f(x_n) = \sin(2n\pi) = 0$ and $f(y_n) = \sin(2n\pi + \pi/2) = 1$, so $|f(x_n) - f(y_n)| = 1$ for all $n$. No single $\delta$ can make $|f(x) - f(y)| < 1/2$ whenever $|x - y| < \delta$, because arbitrarily close pairs with function values $1$ apart exist. The failure occurs because the oscillation frequency $1/x$ diverges as $x \to 0^+$, and the interval $(0, 1]$ is not compact (the problematic limit point $0$ is missing from the domain).
[/example]
[example: Failure of Uniform Continuity on Unbounded Domains]
The function $f: [0, \infty) \to \mathbb{R}$ defined by $f(x) = x^2$ is continuous but not uniformly continuous. Fix any $\delta > 0$. Choose $x = n$ and $y = n + \delta/2$ for $n \in \mathbb{N}$. Then $|x - y| = \delta/2 < \delta$, but:
\begin{align*}
|f(x) - f(y)| = |n^2 - (n + \delta/2)^2| = n\delta + \delta^2/4 \geq n\delta.
\end{align*}
As $n \to \infty$, this exceeds any fixed $\varepsilon$. The failure is caused by the linear growth of the derivative $f'(x) = 2x$: the function becomes increasingly steep, so a fixed step in the domain produces an arbitrarily large step in the range.
[/example]
### Proving Uniform Continuity on Unbounded Domains
Although the Heine-Cantor theorem requires compactness, many functions on unbounded domains are uniformly continuous. The standard technique is **domain splitting**: decompose the domain into a compact part (where Heine-Cantor applies) and a "tail" where the function's behavior is controlled by decay or boundedness conditions. The two parts are then joined using the triangle inequality.
[example: Domain Splitting for Uniform Continuity]
The function $f: [0, \infty) \to \mathbb{R}$ defined by $f(x) = \sin(x^2)/(1 + x)$ is uniformly continuous. Fix $\varepsilon > 0$. Since $|f(x)| \leq 1/(1 + x) \to 0$ as $x \to \infty$, choose $R > 0$ so that $|f(x)| < \varepsilon/4$ for all $x \geq R$.
On the compact interval $[0, R + 1]$, the function $f$ is continuous, so the [Heine-Cantor theorem](/theorems/280) provides $\delta_1 > 0$ such that $|x - y| < \delta_1$ implies $|f(x) - f(y)| < \varepsilon$ for $x, y \in [0, R + 1]$.
Set $\delta = \min(\delta_1, 1)$. For any $x, y \in [0, \infty)$ with $|x - y| < \delta$:
**Case 1:** Both $x, y \in [0, R + 1]$. Then $|f(x) - f(y)| < \varepsilon$ by the choice of $\delta_1$.
**Case 2:** Both $x, y \geq R$. Then $|f(x) - f(y)| \leq |f(x)| + |f(y)| < \varepsilon/4 + \varepsilon/4 < \varepsilon$.
**Case 3:** $x \leq R + 1$ and $y \geq R$ (or vice versa). Since $|x - y| < 1$, we have $x \geq R - 1$ and $y \leq R + 2$, so both lie in $[R - 1, R + 2] \subseteq [0, R + 2]$. Choosing $R$ large enough that $[R - 1, R + 2] \subseteq [0, R + 1]$ (which holds for $R \geq 1$), both points lie in the compact interval and Case 1 applies.
Thus $f$ is uniformly continuous on $[0, \infty)$.
[/example]
The domain splitting technique generalises: any continuous function that is "eventually well-behaved" (e.g., eventually Lipschitz, eventually constant, or tending to a limit) on an unbounded domain is uniformly continuous, because the compact core is handled by Heine-Cantor and the tail is handled by the specific decay/regularity condition.
## The Riemann Criterion and Integrability
### The Riemann Criterion
The Riemann criterion is the single most useful characterisation of integrability, because it reduces the question to finding a partition that makes the total weighted oscillation small — without needing to compute the integral itself.
[quotetheorem:281]
The criterion is an equivalence, and both directions are instructive. The forward direction is a straightforward consequence of the definitions: if the upper and lower integrals agree, partitions can approximate each from the correct side, and their common refinement witnesses the small gap. The reverse direction uses the squeeze principle: the gap $I^*(f) - I_*(f)$ is a fixed non-negative number trapped below every $\varepsilon > 0$, forcing it to be zero. The proof illustrates a recurring theme in real analysis — characterising an existential property (the existence of the integral) through a universal quantitative condition (for every $\varepsilon$, a partition exists).
[citeproof:281]
The Riemann criterion does *not* require the function to be continuous. It applies to any bounded function and is often used to prove integrability of functions with countably many discontinuities, such as the Thomae function. The criterion also interacts cleanly with the notion of refinement: if $P \subseteq Q$ (meaning $Q$ contains all points of $P$ and possibly more), then $U(Q, f) \leq U(P, f)$ and $L(Q, f) \geq L(P, f)$, so refinement can only improve the approximation.
### Integrability of Continuous Functions
With the Heine-Cantor theorem and the Riemann criterion in hand, the integrability of continuous functions follows by a clean two-step argument.
[quotetheorem:282]
The proof chains together the results established above: uniform continuity (from [Heine-Cantor](/theorems/280)) provides a $\delta$ that makes the oscillation on each subinterval less than $\varepsilon/(b-a)$, and any partition with mesh less than $\delta$ then satisfies the [Riemann criterion](/theorems/281) because the total weighted oscillation telescopes to at most $\varepsilon$. The factor $\varepsilon/(b-a)$ in the uniform continuity application is a calibration trick that appears repeatedly in integration theory — one chooses the tolerance for the pointwise condition so that it sums correctly over the whole interval.
[citeproof:282]
This result is sharp in the following sense: every continuous function on a compact interval is integrable, but continuity on a non-compact domain does not guarantee integrability (an unbounded function is not even in the domain of the Riemann integral). The result also has a converse of sorts: every Riemann integrable function is "nearly continuous" in the sense of the Lebesgue criterion — a bounded function is Riemann integrable if and only if its set of discontinuities has Lebesgue measure zero.
[example: Integrability of a Non-Continuous Function]
The function $f: [0, 1] \to \mathbb{R}$ defined by $f(x) = \mathbb{1}_{\mathbb{Q} \cap [0,1]}(x)$ (the Dirichlet function, equal to $1$ on rationals and $0$ on irrationals) is bounded but not Riemann integrable. On every subinterval $[a_j, a_{j+1}]$ of positive length, $\sup f = 1$ and $\inf f = 0$, so $\omega_j(f) = 1$ for all $j$. Therefore $U(P, f) - L(P, f) = \sum_j (a_{j+1} - a_j) \cdot 1 = b - a = 1$ for every partition $P$, and the [Riemann criterion](/theorems/281) fails. The function is discontinuous everywhere, so its set of discontinuities has full measure — illustrating the necessity of "near-continuity" for integrability.
[/example]
[example: Integrability Despite Discontinuity]
The function $f: [0, 1] \to \mathbb{R}$ defined by $f(x) = 1$ if $x = 1/2$ and $f(x) = 0$ otherwise is Riemann integrable with $\int_0^1 f \, dx = 0$. Fix $\varepsilon > 0$ and choose a partition $P$ that isolates $x = 1/2$ in a subinterval of length less than $\varepsilon$. On that subinterval, $\omega_j(f) = 1$, contributing at most $\varepsilon$ to the total. On all other subintervals, $\omega_j(f) = 0$. Thus $U(P, f) - L(P, f) < \varepsilon$. The single discontinuity has measure zero and does not obstruct integrability.
[/example]
## Uniform Convergence and Integration
### Preserving Integrability Under Limits
The interaction between convergence and integration is one of the central themes of analysis. Pointwise convergence of integrable functions does not preserve integrability — or even the value of the integral — as the following example illustrates.
[example: Failure of Pointwise Convergence to Preserve Integrals]
Define $f_n: [0, 1] \to \mathbb{R}$ by $f_n(x) = n^2 x(1 - x^2)^n$. Each $f_n$ is continuous and hence Riemann integrable. A direct computation using the substitution $u = 1 - x^2$ gives:
\begin{align*}
\int_0^1 n^2 x(1 - x^2)^n \, dx = \frac{n^2}{2} \int_0^1 u^n \, du = \frac{n^2}{2(n+1)}.
\end{align*}
As $n \to \infty$, this diverges to $+\infty$. However, for each fixed $x \in (0, 1]$, $(1 - x^2)^n \to 0$ exponentially fast, so $f_n(x) \to 0$ pointwise. At $x = 0$, $f_n(0) = 0$ for all $n$. The pointwise limit is $f \equiv 0$, which is integrable with $\int_0^1 f \, dx = 0$, but $\int_0^1 f_n \, dx \to \infty \neq 0$. The convergence is not uniform: the maximum of $f_n$ occurs near $x \approx 1/n$ and has height of order $n$, so $\|f_n\|_\infty \to \infty$.
[/example]
Uniform convergence eliminates pathologies of this kind by ensuring that the approximation $|f_n(x) - f(x)|$ is small for *all* $x$ simultaneously, not just at each fixed $x$ individually.
[quotetheorem:283]
The proof has two parts. For integrability, the idea is to transfer the oscillation control from a known integrable function $f_N$ (close to $f$ in the supremum norm) to $f$ itself: on each partition subinterval, the oscillation of $f$ differs from that of $f_N$ by at most $2\|f - f_N\|_\infty$, and the [Riemann criterion](/theorems/281) for $f_N$ provides a partition making its oscillation small. For the interchange of limits and integrals, the estimate is simpler: $|\int f_n - \int f| \leq (b-a)\|f_n - f\|_\infty \to 0$. The key structural insight is that uniform convergence makes the sup-norm $\|f_n - f\|_\infty$ the right quantity to track, and this norm controls both the oscillation transfer and the integral difference.
[citeproof:283]
The boundedness hypothesis on the $f_n$ is necessary: without it, the limit $f$ may be unbounded and fall outside the domain of the Riemann integral entirely. However, if $f_n \to f$ uniformly and the $f_n$ are uniformly bounded (i.e., $\sup_n \|f_n\|_\infty < \infty$), then boundedness of $f$ is automatic. In practice, uniform convergence on a compact interval usually implies uniform boundedness, because a uniformly convergent sequence is eventually within distance $1$ of its limit.
### Pointwise vs. Uniform Boundedness
To state convergence theorems precisely, we need to distinguish between pointwise and uniform bounds on sequences of functions.
[definition: Pointwise Bounded Sequence]
A sequence of functions $(f_n)$ on a set $E \subseteq \mathbb{R}$ is **pointwise bounded** if for each $x \in E$, there exists $M_x > 0$ such that $|f_n(x)| \leq M_x$ for all $n \in \mathbb{N}$.
[/definition]
[definition: Uniformly Bounded Sequence]
A sequence of functions $(f_n)$ on a set $E \subseteq \mathbb{R}$ is **uniformly bounded** if there exists $M > 0$ such that $|f_n(x)| \leq M$ for all $n \in \mathbb{N}$ and all $x \in E$.
[/definition]
Uniform boundedness implies pointwise boundedness (take $M_x = M$ for every $x$), but the converse fails. A pointwise bounded sequence may have bounds $M_x$ that grow without limit, preventing any single $M$ from controlling all $f_n$ at all points. This distinction becomes critical in the Arzelà-Ascoli theorem, where uniform boundedness (together with equicontinuity) guarantees the existence of uniformly convergent subsequences.
## Composition and Integrability
### The Good-Bad Subinterval Technique
The final main result extends integrability to compositions, and its proof introduces an important technique: decomposing the partition into "good" subintervals (where the inner function oscillates little, so the outer function's uniform continuity controls the composition) and "bad" subintervals (where the inner function oscillates a lot, but the [Riemann criterion](/theorems/281) ensures these have small total length). This good-bad decomposition appears in many contexts throughout analysis — it is the mechanism behind Vitali covering arguments, Calderón-Zygmund decompositions, and level-set methods in PDE.
[quotetheorem:284]
The key hypothesis is that $g$ is continuous on the *closed interval* $[A, B]$ containing the range of $f$. This ensures uniform continuity of $g$ (by [Heine-Cantor](/theorems/280)), which is used to control the oscillation of $g \circ f$ on the "good" subintervals. On the "bad" subintervals, the crude bound $\omega_j(g \circ f) \leq 2\|g\|_\infty$ is all that is available, but the [Riemann criterion](/theorems/281) for $f$ guarantees that these subintervals have small total length. The proof combines these two estimates — a precise bound on good subintervals and a crude-but-controlled bound on bad ones — into a single $\varepsilon$ bound on $U(P, g \circ f) - L(P, g \circ f)$.
[citeproof:284]
The result has several important consequences. Taking $g(t) = t^2$ shows that $f^2$ is integrable whenever $f$ is; taking $g(t) = |t|$ shows $|f|$ is integrable. More generally, any continuous transformation of an integrable function produces an integrable function, which is essential for change-of-variable arguments and for the theory of $L^p$ spaces built from the Riemann integral.
[example: Products of Integrable Functions]
If $f, g: [a, b] \to \mathbb{R}$ are both bounded and Riemann integrable, then $fg$ is Riemann integrable. This follows from the [Integrability of Continuous Composition](/theorems/284) and the algebraic identity:
\begin{align*}
fg = \frac{1}{4}\bigl((f + g)^2 - (f - g)^2\bigr).
\end{align*}
The functions $f + g$ and $f - g$ are integrable (sums of integrable functions are integrable), and squaring is a continuous operation, so $(f + g)^2$ and $(f - g)^2$ are integrable by the [Integrability of Continuous Composition](/theorems/284) with $g(t) = t^2$. A linear combination of integrable functions is integrable, so $fg$ is integrable. This roundabout proof avoids the need to estimate the oscillation of a product directly — a task that would require careful handling of the interaction between the oscillations of $f$ and $g$.
[/example]
## Worked Example
[problem]
Let $f_n: [0, 1] \to \mathbb{R}$ be defined by $f_n(x) = \frac{nx}{1 + n^2 x^2}$. Show that $(f_n)$ converges pointwise but not uniformly on $[0, 1]$, that each $f_n$ is Riemann integrable, and compute $\lim_{n \to \infty} \int_0^1 f_n(x) \, dx$. Explain why this limit does not equal $\int_0^1 \lim_{n \to \infty} f_n(x) \, dx$, and identify which hypothesis of the [Uniform Limit Preserves Integrability theorem](/theorems/283) fails.
[/problem]
[solution]
**Step 1: Pointwise limit.**
Fix $x \in (0, 1]$. Then:
\begin{align*}
f_n(x) = \frac{nx}{1 + n^2 x^2} = \frac{1}{1/(nx) + nx}.
\end{align*}
As $n \to \infty$, $nx \to \infty$, so $1/(nx) \to 0$ and $nx \to \infty$, giving $f_n(x) \to 0$. At $x = 0$, $f_n(0) = 0$ for all $n$. The pointwise limit is $f(x) = 0$ for all $x \in [0, 1]$.
**Step 2: Non-uniformity of the convergence.**
To find $\|f_n\|_\infty$, compute the critical point. Differentiating: $f_n'(x) = n(1 + n^2 x^2)^{-2}(1 - n^2 x^2)$, which vanishes at $x = 1/n$. Evaluating:
\begin{align*}
f_n(1/n) = \frac{n \cdot 1/n}{1 + n^2 \cdot 1/n^2} = \frac{1}{2}.
\end{align*}
Therefore $\|f_n - f\|_\infty = \|f_n\|_\infty = 1/2$ for all $n$. The convergence is not uniform because the supremum norm does not tend to zero — the "bump" at $x = 1/n$ has fixed height $1/2$, migrating toward the origin but never shrinking.
**Step 3: Integrability and computation of the integral.**
Each $f_n$ is continuous on $[0, 1]$ and hence Riemann integrable by the [Integrability of Continuous Functions](/theorems/282). Using the substitution $u = 1 + n^2 x^2$, so $du = 2n^2 x \, dx$:
\begin{align*}
\int_0^1 \frac{nx}{1 + n^2 x^2} \, dx = \frac{1}{2n} \int_1^{1 + n^2} \frac{du}{u} = \frac{1}{2n} \ln(1 + n^2).
\end{align*}
As $n \to \infty$, $\ln(1 + n^2) \sim 2\ln n$, so:
\begin{align*}
\int_0^1 f_n(x) \, dx = \frac{\ln(1 + n^2)}{2n} \sim \frac{\ln n}{n} \to 0.
\end{align*}
**Step 4: Comparison of the two limits.**
We have $\lim_{n \to \infty} \int_0^1 f_n \, dx = 0$ and $\int_0^1 \lim_{n \to \infty} f_n(x) \, dx = \int_0^1 0 \, dx = 0$. In this particular example, the two limits happen to agree — the interchange of limit and integral gives the correct answer despite the convergence being non-uniform.
**Step 5: Identification of the failing hypothesis.**
The hypothesis that fails in the [Uniform Limit Preserves Integrability theorem](/theorems/283) is **uniform convergence**: as computed in Step 2, $\|f_n - f\|_\infty = 1/2 \not\to 0$. The theorem guarantees that the interchange works whenever convergence is uniform, but the interchange can also hold for other reasons (as in this example, where the integrals converge by an explicit computation). The value of the theorem is that it provides a *sufficient* condition that can be checked without computing the integrals — when direct computation is infeasible, uniform convergence is the standard tool for justifying the interchange. The fact that the interchange holds here despite non-uniform convergence reflects the special structure of the $f_n$ (the mass concentrates at a single point and the "bump" has total area tending to zero); this phenomenon is studied systematically by the bounded and dominated convergence theorems in measure theory.
[/solution]
## References
1. Sheratt, N., *Cambridge Part IB — Analysis and Topology*, Lecture Notes.\n\n---\n\nHaving developed the theory of uniform convergence, series of functions, and Riemann integration on the real line, we now step back and ask a structural question: *what minimal framework is needed for these concepts to make sense?* The definitions of convergence, continuity, and completeness all rest on a single ingredient — a notion of distance — and the algebraic structure of $\mathbb{R}^n$ plays no essential role. By axiomatising the properties of distance that our proofs actually use, we arrive at the concept of a metric space: a set equipped with a distance function satisfying three natural axioms. This abstraction is not merely an exercise in generality; it reveals which analytical properties depend solely on distance and which require additional structure (linearity, compactness, local geometry), and it provides a unified language for function spaces, sequence spaces, and spaces of geometric objects that arise naturally throughout mathematics.
[motivation]
### What Do Our Proofs Actually Use?
Looking back at the results of the previous sections, a pattern emerges. The proof that a uniformly convergent sequence of continuous functions has a continuous limit uses the triangle inequality for the metric on $\mathbb{R}$ and the $\varepsilon$-$\delta$ definition of continuity — but nothing about addition or multiplication of real numbers. The Heine-Cantor theorem uses the Bolzano-Weierstrass property of compact intervals, which is ultimately a statement about sequences and convergence, not about the algebraic operations. The Banach fixed point argument (previewed below) uses only the triangle inequality, the contraction estimate, and the completeness of the space.
This suggests that the "right" setting for these results is not $\mathbb{R}^n$ with its full algebraic and order structure, but rather any set where distances between points are well-defined. The metric space axioms capture exactly the properties needed: positivity (distinct points are separated), symmetry (distance is undirected), and the triangle inequality (indirect paths are never shorter than direct ones).
### Why Not Just Normed Spaces?
[Normed vector spaces](/page/Normed%20Vector%20Space) — where $d(x, y) = \|x - y\|$ — are an important special case, but they require a linear structure that many natural spaces lack. The space of all continuous curves in the plane, equipped with the Hausdorff distance, is a metric space but not a vector space. The set of all compact subsets of $\mathbb{R}^n$, the space of isometry classes of Riemannian manifolds, and even finite [sets](/page/Set) with combinatorial distances all carry natural metrics with no underlying linear structure. Metric spaces are the correct generality for studying convergence, continuity, and completeness.
### The Role of Completeness
A subtler motivation concerns the passage from sequences to limits. In $\mathbb{R}$, the least upper bound axiom guarantees that every Cauchy sequence converges — but in an arbitrary metric space, this may fail. The rational numbers $\mathbb{Q}$ with the standard metric form an incomplete metric space: the sequence of decimal approximations to $\sqrt{2}$ is Cauchy but has no limit in $\mathbb{Q}$. Completeness becomes a property that must be *verified*, not assumed, and the question of which spaces are complete — and how to recognise completeness in practice — becomes a central concern. The relationship between completeness and closedness, and the completeness of function spaces under the uniform metric, are the analytical payoffs of this section.
[/motivation]
## Core Definitions
### Metric Spaces and Convergence
The axioms of a metric space distill the essential properties of distance.
[definition: Metric Space]
Let $X$ be a set. A **metric** on $X$ is a function $d: X \times X \to \mathbb{R}$ satisfying, for all $x, y, z \in X$:
1. **Positivity:** $d(x, y) \geq 0$, with $d(x, y) = 0$ if and only if $x = y$.
2. **Symmetry:** $d(x, y) = d(y, x)$.
3. **Triangle inequality:** $d(x, y) \leq d(x, z) + d(z, y)$.
The pair $(X, d)$ is called a **metric space**.
[/definition]
The positivity axiom ensures that distinct points are metrically separated — this is what makes limits unique. The triangle inequality is the workhorse: it underpins every approximation argument, every continuity proof, and every compactness argument in metric space theory. Symmetry is the least analytically significant axiom, and some authors study "quasi-metric spaces" where it is dropped, but we will not pursue this here.
[definition: Convergence in a Metric Space]
Let $(X, d)$ be a metric space. A sequence $(x_n)$ in $X$ **converges** to $x \in X$ if for every $\varepsilon > 0$, there exists $N \in \mathbb{N}$ such that $d(x_n, x) < \varepsilon$ for all $n \geq N$. We write $x_n \to x$ or $\lim_{n \to \infty} x_n = x$.
[/definition]
Limits in a metric space are unique: if $x_n \to x$ and $x_n \to y$, the triangle inequality gives $d(x, y) \leq d(x, x_n) + d(x_n, y) < 2\varepsilon$ for all $\varepsilon > 0$, hence $d(x, y) = 0$ and $x = y$. This simple argument — which uses all three metric axioms — is the reason the axioms are chosen as they are.
### Standard Examples
The same underlying set can carry many different metrics, and the choice of metric determines which sequences converge, which functions are continuous, and which subsets are open. The examples below illustrate this diversity.
[definition: Euclidean Metric]
For $x, y \in \mathbb{R}^n$, the **Euclidean metric** is $d_2(x, y) = \left(\sum_{k=1}^n |x_k - y_k|^2\right)^{1/2}$.
[/definition]
[definition: $\ell_p$ Metrics]
For $x, y \in \mathbb{R}^n$ and $p \in [1, \infty)$, the **$\ell_p$ metric** is $d_p(x, y) = \left(\sum_{k=1}^n |x_k - y_k|^p\right)^{1/p}$. The **$\ell_\infty$ metric** is $d_\infty(x, y) = \max_{1 \leq k \leq n} |x_k - y_k|$.
[/definition]
The fact that $d_p$ satisfies the triangle inequality for $p \geq 1$ is the Minkowski inequality — a nontrivial result whose proof uses Hölder's inequality. For $p < 1$, the triangle inequality fails: the unit "ball" in $\ell_p^n$ is not convex, and $d_p$ is not a metric.
[definition: Uniform Metric on Bounded Functions]
Let $S$ be a set and $(Y, e)$ a metric space. For bounded functions $f, g: S \to Y$, the **uniform metric** (or **sup metric**) is $\mathcal{D}(f, g) = \sup_{s \in S} e(f(s), g(s))$.
[/definition]
Convergence in the uniform metric is precisely uniform convergence of functions: $\mathcal{D}(f_n, f) \to 0$ if and only if $\sup_s e(f_n(s), f(s)) \to 0$. This connects the abstract metric space framework directly to the theory of uniform convergence from the earlier sections.
[definition: $L_p$ Metrics on $C[a, b]$]
For continuous functions $f, g: [a, b] \to \mathbb{R}$ and $p \in [1, \infty)$, the **$L_p$ metric** is $d_p(f, g) = \left(\int_a^b |f(x) - g(x)|^p \, dx\right)^{1/p}$.
[/definition]
[definition: Discrete Metric]
On any set $X$, the **discrete metric** is $d(x, y) = 0$ if $x = y$ and $d(x, y) = 1$ if $x \neq y$.
[/definition]
The discrete metric makes every subset open and every sequence that is eventually constant convergent — but no other sequences converge. It is the "coarsest" nontrivial metric and serves as a useful test case and counterexample generator.
[definition: Metric Subspace]
Given a metric space $(X, d)$ and a subset $Y \subseteq X$, the **metric subspace** is $(Y, d|_{Y \times Y})$, where $d|_{Y \times Y}$ denotes the restriction of $d$ to $Y \times Y$.
[/definition]
[example: Convergence Depends on the Metric]
Consider the sequence $f_n(x) = x^n$ in $C[0, 1]$. In the uniform metric, $\|f_n\|_\infty = \sup_{x \in [0,1]} |x^n| = 1$ for all $n$, so $f_n$ does not converge to $0$ (since $\mathcal{D}(f_n, 0) = 1$). In the $L_1$ metric, $d_1(f_n, 0) = \int_0^1 x^n \, dx = 1/(n+1) \to 0$, so $f_n \to 0$. The same sequence converges in one metric and diverges in another — the metric, not the set, determines convergence.
[/example]
## Product Metrics
Given two metric spaces, there are many natural ways to define a metric on their Cartesian product. The standard choices mirror the $\ell_p$ norms.
[definition: Product Metric]
Let $(X, d)$ and $(Y, d')$ be metric spaces. For $p \in [1, \infty)$, the **$\ell_p$-product metric** on $X \times Y$ is:
\begin{align*}
d_p\bigl((x_1, y_1), (x_2, y_2)\bigr) = \bigl(d(x_1, x_2)^p + d'(y_1, y_2)^p\bigr)^{1/p}.
\end{align*}
The **$\ell_\infty$-product metric** is $d_\infty\bigl((x_1, y_1), (x_2, y_2)\bigr) = \max\{d(x_1, x_2),\, d'(y_1, y_2)\}$.
[/definition]
A fundamental property of all these product metrics is that they induce componentwise convergence:
[theorem: Convergence in Product Spaces]
Let $(X, d)$ and $(Y, d')$ be metric spaces. A sequence $(x_n, y_n)$ converges to $(x, y)$ in the $\ell_p$-product $(X \times Y, d_p)$ for any $p \in [1, \infty]$ if and only if $x_n \to x$ in $X$ and $y_n \to y$ in $Y$.
[/theorem]
The proof is immediate from the inequalities $\max\{d(x_n, x), d'(y_n, y)\} \leq d_p((x_n, y_n), (x, y)) \leq d(x_n, x) + d'(y_n, y)$, which hold for all $p \in [1, \infty]$. The left inequality shows that product convergence implies componentwise convergence; the right inequality shows the converse. An important consequence is that all $\ell_p$-product metrics are topologically equivalent — they define the same open sets, the same convergent sequences, and the same continuous functions.
## Continuity in Metric Spaces
### The $\varepsilon$-$\delta$ Definition and Its Sequential Equivalent
The definitions of continuity and uniform continuity generalise verbatim from the real line to arbitrary metric spaces, with absolute values replaced by the metric.
[definition: Continuity Between Metric Spaces]
Let $(X, d)$ and $(Y, d')$ be metric spaces. A function $f: X \to Y$ is **continuous at $a \in X$** if for every $\varepsilon > 0$, there exists $\delta > 0$ such that $d(x, a) < \delta$ implies $d'(f(x), f(a)) < \varepsilon$. The function $f$ is **continuous** if it is continuous at every point of $X$.
[/definition]
In practice, the $\varepsilon$-$\delta$ definition is often difficult to verify directly, and the sequential characterisation provides a more flexible alternative.
[quotetheorem:285]
The forward direction is routine — the $\varepsilon$-$\delta$ condition provides the bound on $d'(f(x_n), f(a))$ once $n$ is large enough. The reverse direction is a contrapositive argument: if the $\varepsilon$-$\delta$ condition fails, the negation produces a sequence $x_n \to a$ with $d'(f(x_n), f(a)) \geq \varepsilon_0$, contradicting the sequential condition. This "sequences witness discontinuity" principle is one of the most useful tools in metric space theory — to prove a function is discontinuous, one constructs a single convergent sequence whose image does not converge.
[citeproof:285]
The sequential characterisation also shows that compositions of continuous functions are continuous: if $x_n \to a$, then $f(x_n) \to f(a)$ by continuity of $f$, and then $g(f(x_n)) \to g(f(a))$ by continuity of $g$.
[definition: Uniform Continuity Between Metric Spaces]
A function $f: (X, d) \to (Y, d')$ is **uniformly continuous** if for every $\varepsilon > 0$, there exists $\delta > 0$ such that $d(x, y) < \delta$ implies $d'(f(x), f(y)) < \varepsilon$ for all $x, y \in X$.
[/definition]
The [Heine-Cantor theorem](/theorems/280) from the previous section generalises: continuous functions on compact metric spaces are uniformly continuous. The proof is identical — if uniform continuity fails, extract two sequences with $d(x_n, y_n) \to 0$ and $d'(f(x_n), f(y_n)) \geq \varepsilon_0$; compactness provides convergent subsequences converging to a common limit, contradicting continuity at that limit.
[example: Continuity Depends on the Metric]
The identity function $\mathrm{id}: (C[0,1], d_\infty) \to (C[0,1], d_1)$ is continuous, because $d_1(f, g) \leq (b - a) \cdot d_\infty(f, g)$, so uniform convergence implies $L_1$ convergence. However, the identity function in the reverse direction, $\mathrm{id}: (C[0,1], d_1) \to (C[0,1], d_\infty)$, is *not* continuous: the sequence $f_n(x) = x^n$ satisfies $d_1(f_n, 0) \to 0$ but $d_\infty(f_n, 0) = 1 \not\to 0$, so $f_n \to 0$ in $d_1$ but $\mathrm{id}(f_n) = f_n \not\to 0$ in $d_\infty$, contradicting the [sequential characterisation](/theorems/285).
[/example]
## Topology of Metric Spaces
### Open and Closed Sets
[motivation]
### Why [Open Sets](/page/Open%20Set)?
Looking back at the proofs in this section and the previous ones, a pattern becomes visible. The sequential characterisation of continuity, the closedness-completeness theorem, the proof that uniform limits of continuous functions are continuous — in each case, the metric $d$ appears only through the open balls $B_r(x)$, and the actual argument is about which points are "nearby" which others. The specific numerical values $d(x,y)$ are used only to produce balls, and then the reasoning proceeds by containment: "there exists a ball around $x$ contained in $U$," "every ball around $x$ meets $A$."
This suggests that the metric carries redundant information. What our proofs really use is *which sets contain a ball around each of their points* — the collection of open sets. Two metrics that produce the same open sets (i.e., are topologically equivalent) yield identical theories of convergence, continuity, and closedness, even if they assign different numerical distances. The $\ell_1$, $\ell_2$, and $\ell_\infty$ metrics on $\mathbb{R}^n$ illustrate this: they give different distances between the same pairs of points, but they produce the same open sets, the same convergent sequences, and the same continuous functions.
Isolating the open-set structure from the metric that generates it leads to the fully abstract treatment of topology in §6, where the axioms for open sets are taken as primitive and no metric is assumed at all. For now, we develop the theory within metric spaces, where every topological statement has a concrete $\varepsilon$-ball interpretation.
[/motivation]
[definition: Open Ball]
Let $(X, d)$ be a metric space, $x \in X$, and $r > 0$. The **open ball** of radius $r$ centred at $x$ is $B_r(x) = \{y \in X : d(y, x) < r\}$.
[/definition]
[definition: Open Set]
A subset $U \subseteq X$ is **open** if for every $x \in U$, there exists $r > 0$ such that $B_r(x) \subseteq U$.
[/definition]
Open balls are themselves open sets: if $y \in B_r(x)$, set $\rho = r - d(y, x) > 0$; then for any $z \in B_\rho(y)$, the triangle inequality gives $d(z, x) \leq d(z, y) + d(y, x) < \rho + d(y, x) = r$, so $z \in B_r(x)$. This simple argument is the prototype for all "triangle inequality plus margin" arguments in metric topology.
[definition: Closed Set]
A subset $A \subseteq X$ is **closed** if for every sequence $(x_n)$ in $A$ with $x_n \to x$ in $X$, the limit $x$ belongs to $A$.
[/definition]
This sequential definition of closedness is equivalent to the topological one (complement is open), and the equivalence is a fundamental structural result:
[quotetheorem:286]
Both directions are proved by contradiction. If $A$ is closed but $A^c$ is not open, there is a point $x \in A^c$ with no open ball contained in $A^c$; the balls $B_{1/n}(x)$ must each contain a point of $A$, producing a sequence in $A$ converging to $x \notin A$ — contradicting closedness. Conversely, if $A^c$ is open but $A$ is not closed, some sequence $(x_n) \subseteq A$ converges to $x \notin A$; since $x \in A^c$ and $A^c$ is open, some ball around $x$ lies in $A^c$, but the sequence eventually enters that ball — a contradiction.
[citeproof:286]
The duality between open and closed sets has immediate consequences: the empty set and $X$ are both open and closed; finite intersections of open sets are open; arbitrary unions of open sets are open; and the analogous statements for closed sets hold with intersections and unions swapped.
[example: Subspace Convergence and Closedness]
Let $X = \mathbb{R}$ with the standard metric and $N = (0, \infty) \subseteq X$. The sequence $x_n = 1/n$ lies in $N$ and converges to $0$ in $X$, but $0 \notin N$. Therefore $N$ is not closed in $X$. Equivalently, $N^c = (-\infty, 0]$ is not open (no ball around $0$ lies entirely in $(-\infty, 0]$). However, the closed interval $[0, \infty)$ *is* closed: any convergent sequence of non-negative reals has a non-negative limit.
[/example]
## Completeness
### [Cauchy Sequences](/page/Cauchy%20Sequence) and Complete Spaces
Completeness is the property that ensures limits of "would-be convergent" sequences actually exist in the space.
[definition: Cauchy Sequence]
A sequence $(x_n)$ in a metric space $(X, d)$ is **Cauchy** if for every $\varepsilon > 0$, there exists $N \in \mathbb{N}$ such that $d(x_m, x_n) < \varepsilon$ for all $m, n \geq N$.
[/definition]
[definition: Complete Metric Space]
A metric space $(X, d)$ is **complete** if every Cauchy sequence in $X$ converges to a point in $X$.
[/definition]
Every convergent sequence is Cauchy (the triangle inequality gives $d(x_m, x_n) \leq d(x_m, x) + d(x, x_n) < 2\varepsilon$ for large $m, n$), but the converse fails in general. The rational numbers $\mathbb{Q}$ with the standard metric are incomplete: the sequence $x_1 = 1, \, x_2 = 1.4, \, x_3 = 1.41, \, \ldots$ of truncated decimal expansions of $\sqrt{2}$ is Cauchy but has no limit in $\mathbb{Q}$.
A key property of Cauchy sequences is that convergence of a subsequence implies convergence of the full sequence: if $(x_n)$ is Cauchy and $x_{n_k} \to x$, then for any $\varepsilon > 0$, choose $K$ with $d(x_{n_k}, x) < \varepsilon/2$ for $k \geq K$, and $N$ with $d(x_m, x_n) < \varepsilon/2$ for $m, n \geq N$. For $n \geq \max(N, n_K)$, pick $k \geq K$ with $n_k \geq N$, giving $d(x_n, x) \leq d(x_n, x_{n_k}) + d(x_{n_k}, x) < \varepsilon$. This "subsequence extraction" principle is essential in compactness arguments.
### Completeness and Closedness
The relationship between completeness of a subspace and closedness in the ambient space is clean and bidirectional:
[quotetheorem:287]
The first direction (complete implies closed) uses the fact that a convergent sequence is Cauchy: if a sequence in $N$ converges in $X$, it is Cauchy in $N$, hence converges in $N$ (by completeness), and [uniqueness of limits](/theorems/625) forces the two limits to agree. The second direction (closed in a complete space implies complete) is even simpler: a Cauchy sequence in $N$ is Cauchy in $X$, hence converges in $X$, and closedness pulls the limit back into $N$. This characterisation is the standard tool for proving completeness of concrete spaces: to show $N$ is complete, embed it in a known complete space $X$ and verify that $N$ is closed.
[citeproof:287]
[example: Completeness of Closed Subspaces]
The space $\mathbb{R}$ is complete (by the least upper bound axiom), so every closed subset of $\mathbb{R}$ is complete. In particular, $[a, b]$ is complete for any $a \leq b$. The open interval $(0, 1)$ is *not* complete: the sequence $x_n = 1/n$ is Cauchy but converges to $0 \notin (0, 1)$. Equivalently, $(0, 1)$ is not closed in $\mathbb{R}$.
[/example]
### Completeness of Function Spaces
The most important completeness result for applications is that bounded function spaces inherit completeness from their target:
[quotetheorem:288]
The proof follows the standard three-step pattern for function-space completeness results: (1) construct the candidate limit pointwise, using the completeness of the target space $Y$; (2) verify that the pointwise limit is bounded, by comparing it to a nearby element of the Cauchy sequence; (3) prove that the convergence is uniform, by passing pointwise estimates from the Cauchy condition to the limit. The critical observation in step (3) is that the inequality $e(f_m(s), f_n(s)) \leq \mathcal{D}(f_m, f_n) < \varepsilon$ survives the passage $m \to \infty$ because the metric is continuous.
[citeproof:288]
This result has an important corollary: the space $C_b(X, Y)$ of bounded continuous functions from a metric space $X$ to a complete metric space $Y$ is complete under the uniform metric. The proof combines the [Completeness of Bounded Function Spaces](/theorems/288) with the [Completeness and Closedness of Subspaces](/theorems/287): $C_b(X, Y)$ is a subspace of $\ell_\infty(X, Y)$, and it is closed because the uniform limit of continuous functions is continuous (by the uniform limit theorem). Since $\ell_\infty(X, Y)$ is complete and $C_b(X, Y)$ is closed in it, $C_b(X, Y)$ is complete.
[example: Completeness of $C[a, b]$ Under the Uniform Metric]
The space $C[a, b]$ of continuous functions on $[a, b]$ is complete under the uniform metric $\mathcal{D}(f, g) = \sup_{x \in [a,b]} |f(x) - g(x)|$. Every continuous function on a compact interval is bounded (by the Extreme Value Theorem), so $C[a, b] = C_b([a, b], \mathbb{R})$. Since $\mathbb{R}$ is complete, $\ell_\infty([a,b], \mathbb{R})$ is complete by the [Completeness of Bounded Function Spaces](/theorems/288), and $C[a, b]$ is closed in $\ell_\infty([a,b], \mathbb{R})$ by the uniform limit theorem. Therefore $C[a, b]$ is complete by the [Completeness and Closedness of Subspaces](/theorems/287).
[/example]
[example: Incompleteness of $C[0, 1]$ Under the $L_1$ Metric]
The space $C[0, 1]$ is *not* complete under the $L_1$ metric $d_1(f, g) = \int_0^1 |f(x) - g(x)| \, dx$. Define the sequence of continuous functions:
\begin{align*}
f_n(x) = \begin{cases} 0 & \text{if } x \leq 1/2, \\ n(x - 1/2) & \text{if } 1/2 < x \leq 1/2 + 1/n, \\ 1 & \text{if } x > 1/2 + 1/n. \end{cases}
\end{align*}
Each $f_n$ is continuous. For $m > n$, $d_1(f_m, f_n) \leq 1/n \to 0$, so the sequence is Cauchy. The pointwise limit is the Heaviside step function $\mathbb{1}_{(1/2, 1]}$, which is discontinuous and therefore not in $C[0, 1]$. No continuous function can be the $L_1$-limit either: any $L_1$-limit must agree with the pointwise limit almost everywhere, and no continuous function equals $\mathbb{1}_{(1/2, 1]}$ almost everywhere. This illustrates that changing the metric can destroy completeness, and explains why the uniform metric — not the $L_1$ metric — is the natural choice for $C[a, b]$.
[/example]
## Equivalence of Metrics
Different metrics on the same set may or may not define the same topological structure. Understanding when two metrics are "essentially the same" is important because it tells us which properties (convergence, continuity, openness) depend on the specific choice of metric and which are intrinsic.
[definition: Topologically Equivalent Metrics]
Two metrics $d$ and $d'$ on a set $X$ are **topologically equivalent** if they define the same open sets: $U$ is open in $(X, d)$ if and only if $U$ is open in $(X, d')$.
[/definition]
[definition: Lipschitz Equivalent Metrics]
Two metrics $d$ and $d'$ on $X$ are **Lipschitz equivalent** if there exist constants $a, b > 0$ such that $a \cdot d(x, y) \leq d'(x, y) \leq b \cdot d(x, y)$ for all $x, y \in X$.
[/definition]
Lipschitz equivalence is a strong condition: it implies that the identity maps $(X, d) \to (X, d')$ and $(X, d') \to (X, d)$ are both Lipschitz continuous (hence uniformly continuous, hence continuous), so in particular, the metrics are topologically equivalent. The converse fails: two metrics can be topologically equivalent without being Lipschitz equivalent.
All $\ell_p$ metrics on $\mathbb{R}^n$ are Lipschitz equivalent. The key inequalities are:
\begin{align*}
d_\infty(x, y) \leq d_p(x, y) \leq n^{1/p} \, d_\infty(x, y),
\end{align*}
which hold for all $p \in [1, \infty)$. In particular, convergence, continuity, and openness in $\mathbb{R}^n$ are the same regardless of which $\ell_p$ metric is used. This is a finite-dimensional phenomenon: in infinite-dimensional spaces, different $\ell_p$ norms are generally not equivalent.
[example: Non-Equivalent Metrics on $C[0, 1]$]
The uniform metric $d_\infty$ and the $L_1$ metric $d_1$ on $C[0, 1]$ are not topologically equivalent. The sequence $f_n(x) = x^n$ satisfies $d_1(f_n, 0) = 1/(n + 1) \to 0$ but $d_\infty(f_n, 0) = 1$, so $f_n$ converges in $d_1$ but not in $d_\infty$. Since the two metrics define different convergent sequences, they define different topologies. The inequality $d_1(f, g) \leq (b - a) \cdot d_\infty(f, g)$ shows that the $d_\infty$-topology is *finer* (has more open sets) than the $d_1$-topology: every $d_1$-open set is $d_\infty$-open, but not conversely.
[/example]
## The [Contraction Mapping Theorem](/theorems/71)
The section culminates with the Banach Fixed Point Theorem — one of the most widely applied results in all of analysis. Its power comes from the combination of *existence*, *uniqueness*, and a *constructive approximation scheme* in a single result.
[definition: Contraction Mapping]
Let $(X, d)$ be a metric space. A function $f: X \to X$ is a **contraction** (or **contraction mapping**) if there exists $\lambda \in [0, 1)$ such that $d(f(x), f(y)) \leq \lambda \, d(x, y)$ for all $x, y \in X$. The constant $\lambda$ is called the **contraction constant**.
[/definition]
Every contraction is Lipschitz continuous with Lipschitz constant $\lambda < 1$, and in particular is uniformly continuous. The condition $\lambda < 1$ is essential — a Lipschitz map with constant $\lambda = 1$ (i.e., a non-expansive map) need not have any fixed points, as the translation $f(x) = x + 1$ on $\mathbb{R}$ illustrates.
[quotetheorem:289]
The proof is entirely constructive: starting from any point $x_0 \in X$, the iterates $x_n = f^n(x_0)$ form a Cauchy sequence because the contraction inequality makes successive distances decay geometrically — $d(x_{n+1}, x_n) \leq \lambda^n d(x_1, x_0)$. Completeness provides a limit $z$, continuity of $f$ (which follows from the Lipschitz condition) shows $f(z) = z$, and the contraction inequality applied to two hypothetical fixed points gives uniqueness. The geometric decay rate also provides an explicit error estimate: $d(x_n, z) \leq \lambda^n d(x_1, x_0)/(1 - \lambda)$.
[citeproof:289]
Both hypotheses — completeness and strict contraction ($\lambda < 1$) — are necessary. On the incomplete space $(0, 1)$ with the standard metric, the map $f(x) = x/2$ is a contraction but has no fixed point in $(0, 1)$ (the fixed point $0$ lies outside the space). The map $f(x) = x + 1/x$ on $[1, \infty)$ satisfies $|f(x) - f(y)| \leq |x - y|$ (it is non-expansive) but has no fixed point; the contraction constant $\lambda = 1$ is not strictly less than $1$.
[example: Solving an Integral Equation via Contraction]
Consider the integral equation $f(t) = y_0 + \int_0^t g(s, f(s)) \, ds$ on $[0, T]$, where $g: [0, T] \times \mathbb{R} \to \mathbb{R}$ is continuous and Lipschitz in the second variable with constant $L > 0$. Define the operator $\Phi: C[0, T] \to C[0, T]$ by:
\begin{align*}
(\Phi h)(t) = y_0 + \int_0^t g(s, h(s)) \, ds.
\end{align*}
Then $\Phi$ maps $C[0, T]$ to itself (by continuity of $g$ and the Fundamental Theorem of Calculus), and:
\begin{align*}
|(\Phi h_1)(t) - (\Phi h_2)(t)| &\leq \int_0^t |g(s, h_1(s)) - g(s, h_2(s))| \, ds \\
&\leq L \int_0^t |h_1(s) - h_2(s)| \, ds \leq LT \cdot d_\infty(h_1, h_2).
\end{align*}
Therefore $d_\infty(\Phi h_1, \Phi h_2) \leq LT \cdot d_\infty(h_1, h_2)$. If $T < 1/L$, then $\Phi$ is a contraction on the complete metric space $(C[0, T], d_\infty)$, and the [Banach Fixed Point Theorem](/theorems/289) guarantees a unique solution. This is the core of the Picard-Lindelöf existence and uniqueness theorem for ordinary differential equations.
[/example]
## Worked Example
[problem]
Let $X = \{f \in C[0, 1] : f(0) = 0, \, \|f\|_\infty \leq 1\}$ with the uniform metric. Show that $X$ is a complete metric space, and determine whether the operator $T: X \to C[0, 1]$ defined by $(Tf)(x) = \frac{1}{2}\int_0^x f(t)^2 \, dt$ maps $X$ into $X$ and is a contraction.
[/problem]
[solution]
**Step 1: $X$ is closed in $C[0, 1]$.**
Let $(f_n)$ be a sequence in $X$ with $f_n \to f$ uniformly. Then $f$ is continuous (uniform limit of continuous functions), $f(0) = \lim f_n(0) = 0$, and $\|f\|_\infty = \lim \|f_n\|_\infty \leq 1$. Therefore $f \in X$, and $X$ is closed in $C[0, 1]$.
**Step 2: Completeness of $X$.**
Since $C[0, 1]$ is complete under the uniform metric (as established above using the [Completeness of Bounded Function Spaces](/theorems/288)), and $X$ is a closed subspace, the [Completeness and Closedness of Subspaces theorem](/theorems/287) gives that $X$ is complete.
**Step 3: $T$ maps $X$ into $X$.**
For $f \in X$, the function $Tf$ is continuous (as an integral of a continuous function). At $x = 0$: $(Tf)(0) = \frac{1}{2}\int_0^0 f(t)^2 \, dt = 0$. For the sup bound:
\begin{align*}
|(Tf)(x)| = \frac{1}{2}\left|\int_0^x f(t)^2 \, dt\right| \leq \frac{1}{2}\int_0^x |f(t)|^2 \, dt \leq \frac{1}{2}\|f\|_\infty^2 \cdot x \leq \frac{1}{2} \cdot 1 \cdot 1 = \frac{1}{2} \leq 1.
\end{align*}
Therefore $Tf \in X$.
**Step 4: Contraction estimate.**
For $f, g \in X$:
\begin{align*}
|(Tf)(x) - (Tg)(x)| &= \frac{1}{2}\left|\int_0^x \bigl(f(t)^2 - g(t)^2\bigr) \, dt\right| \\
&\leq \frac{1}{2}\int_0^x |f(t) + g(t)| \cdot |f(t) - g(t)| \, dt \\
&\leq \frac{1}{2}\bigl(\|f\|_\infty + \|g\|_\infty\bigr) \|f - g\|_\infty \cdot x.
\end{align*}
Since $\|f\|_\infty, \|g\|_\infty \leq 1$ and $x \leq 1$:
\begin{align*}
\|Tf - Tg\|_\infty \leq \frac{1}{2}(1 + 1) \cdot 1 \cdot \|f - g\|_\infty = \|f - g\|_\infty.
\end{align*}
This gives Lipschitz constant $\lambda = 1$, which is *not* a strict contraction. However, if we restrict to the interval $[0, \alpha]$ for $\alpha < 1$, the bound improves to $\lambda = \alpha < 1$, and the [Banach Fixed Point Theorem](/theorems/289) applies on $C[0, \alpha]$.
**Step 5: Conclusion.**
The operator $T$ maps $X$ into $X$ and is Lipschitz with constant $1$ on $[0, 1]$, so it is non-expansive but not a contraction. On a shorter interval $[0, \alpha]$ with $\alpha < 1$, $T$ becomes a contraction with constant $\alpha$, and the Banach theorem guarantees a unique fixed point $f$ satisfying $f(x) = \frac{1}{2}\int_0^x f(t)^2 \, dt$. This fixed point satisfies the ODE $f'(x) = \frac{1}{2}f(x)^2$ with $f(0) = 0$.
[/solution]
## References
1. Sheratt, N., *Cambridge Part IB — Analysis and Topology*, Lecture Notes.\n\n---\n\nHaving explored metric spaces in the previous section — their convergence theory, completeness, and the Banach Fixed Point Theorem — we now confront a structural limitation. Many arguments in analysis depend not on specific distances between points, but on qualitative features: which sets are open, which sequences converge, which functions are continuous. The $\varepsilon$-$\delta$ machinery of metric spaces is powerful, but it carries more structure than these arguments require. Moreover, important constructions — quotient spaces, function spaces with pointwise convergence, spaces of [distributions](/page/Distribution) — resist metric description entirely.
Topology isolates the essential ingredient: the collection of open sets. By axiomatising the closure properties that open sets satisfy in metric spaces — closure under arbitrary unions and finite intersections — one obtains a framework that captures "closeness" and "continuity" without reference to a distance function. The payoff is twofold: familiar metric-space results (uniqueness of limits, continuity via preimages, properties of closed sets) are revealed as consequences of a few structural axioms, and genuinely new constructions (quotient spaces, non-metrisable topologies) become accessible.
[motivation]
### Why Abandon Metrics?
The definition of an open set in a metric space $(X, d)$ — a set $U$ such that every point has an open ball around it contained in $U$ — depends on the metric only through the *collection of open sets it generates*. Two different metrics on $X$ can produce the same open sets (as we saw with equivalent metrics on $\mathbb{R}^n$), and when they do, every topological property — convergence, continuity, compactness — is identical. This suggests that the metric is carrying redundant information: what matters is the topology it induces.
### What the Axioms Capture
In metric spaces, the empty set and the whole space are open; any union of open sets is open; any finite intersection of open sets is open. These three properties are the *only* features of open sets used in most convergence and continuity arguments. The definition of a topology axiomatises precisely these closure properties, discarding everything else. The restriction to *finite* intersections is essential: in $\mathbb{R}$, each interval $(-1/n, 1/n)$ is open, but $\bigcap_{n=1}^\infty (-1/n, 1/n) = \{0\}$ is not. Allowing arbitrary intersections would collapse most interesting topologies.
### What We Gain and What We Lose
The abstraction buys generality: quotient spaces (obtained by "gluing" points together), function spaces with pointwise convergence, and the Zariski topology in algebraic geometry are all topological spaces that need not be metrisable. But we lose something too. In metric spaces, sequences suffice to characterise all topological notions — a set is closed if and only if it contains the limits of all its convergent sequences. In general topological spaces, this fails: sequences are too coarse a tool, and one needs the more general notion of nets or filters. The Hausdorff separation axiom partially compensates, ensuring at least that limits of sequences are unique, but the full equivalence between sequential and topological descriptions is a special feature of metric (and more generally, first-countable) spaces.
[/motivation]
## Open Sets and the Definition of a Topology
What should the axioms of a "topology" capture? The answer comes from examining which properties of open sets in metric spaces are actually used in proofs. When we proved that the uniform limit of continuous functions is continuous, we never invoked the triangle inequality directly — only the fact that preimages of open sets are open. When we proved that closed sets contain their limit points, we used that complements of open sets behave well under intersections and unions. The entire theory of convergence and continuity in metric spaces rests on the *algebra of open sets*, not on the metric itself.
In a metric space, an open set is one where every point has "room to move" — a ball of positive radius around it still contained in the set. The intuition is that of *interior safety*: if you are in an open set, small perturbations keep you inside. This is why open sets are the right primitive for studying continuity (small perturbations of the input produce small perturbations of the output) and convergence (eventually, the sequence is "safely inside" every neighbourhood of the limit). The axioms below distill the algebraic properties that make this intuition work.
[definition: Topology]
Let $X$ be a set. A **topology** on $X$ is a collection $\tau \subseteq \mathcal{P}(X)$ satisfying:
1. $\varnothing \in \tau$ and $X \in \tau$.
2. If $\{U_i\}_{i \in I}$ is any collection of sets in $\tau$ (with $I$ an arbitrary index set), then $\bigcup_{i \in I} U_i \in \tau$.
3. If $U, V \in \tau$, then $U \cap V \in \tau$.
The pair $(X, \tau)$ is called a **topological space**, and the members of $\tau$ are called **open sets**.
[/definition]
Axiom (1) says that the trivial cases behave correctly: the empty set is vacuously open, and the whole space offers no boundary to violate. Axiom (2) says that combining regions of safety produces a larger region of safety — if you are safe in $U_i$, you are safe in the union. Axiom (3) says that the overlap of two safe regions is still safe, but this is restricted to *finite* intersections: an infinite intersection can shrink to a single point (as with $\bigcap_{n=1}^\infty (-1/n, 1/n) = \{0\}$ in $\mathbb{R}$), destroying the "room to move" that characterises openness. Axiom (3) extends by induction to any finite intersection: if $U_1, \ldots, U_n \in \tau$, then $U_1 \cap \cdots \cap U_n \in \tau$. The base case $n = 0$ gives $X$ (the empty intersection), consistent with axiom (1).
### Extreme Topologies and the Comparison Lattice
To build intuition for the axioms, it helps to examine the extreme cases — topologies with as few or as many open sets as possible — and to understand what happens as we move between them.
[definition: Indiscrete Topology]
The **indiscrete topology** (or **trivial topology**) on a set $X$ is $\tau = \{\varnothing, X\}$.
[/definition]
[definition: Discrete Topology]
The **discrete topology** on a set $X$ is $\tau = \mathcal{P}(X)$, the power set of $X$.
[/definition]
The indiscrete topology has the fewest open sets the axioms permit: no set other than $\varnothing$ and $X$ is declared open. This means the topology cannot distinguish between any two points — there is no open set containing one but not the other. As we will see, this makes convergence trivial (every sequence converges to every point) and continuity automatic (every function is continuous). At the other extreme, the discrete topology declares every subset open. Here the topology distinguishes points maximally — every singleton $\{x\}$ is open — making convergence very restrictive (only eventually constant sequences converge) and continuity vacuous (every function is continuous *from* a discrete space).
Every topology on $X$ lies between these extremes, and comparing topologies reveals a trade-off between the ease of convergence and the strength of separation:
[definition: Coarser and Finer]
Let $\tau_1$ and $\tau_2$ be topologies on a set $X$. We say $\tau_1$ is **coarser** than $\tau_2$ (equivalently, $\tau_2$ is **finer** than $\tau_1$) if $\tau_1 \subseteq \tau_2$.
[/definition]
A finer topology has more open sets, hence draws finer distinctions between points. This creates a fundamental tension: the finer one makes convergence harder (more open sets for a sequence to eventually enter), continuity easier to break (more preimages to verify as open), and separation stronger (more open sets available to separate points). Conversely, coarser topologies make convergence easier but separation weaker. The indiscrete topology is the coarsest topology on any set; the discrete topology is the finest.
### A Non-Trivial Example: The Cofinite Topology
Between the two extremes lies a rich variety of topologies. The cofinite topology provides a first non-trivial example that illustrates how a topology can be "interesting" — neither so coarse as to be trivial nor so fine as to be discrete.
[definition: Cofinite Topology]
The **cofinite topology** on a set $X$ is $\tau = \{U \subseteq X : X \setminus U \text{ is finite}\} \cup \{\varnothing\}$.
[/definition]
The idea is that a non-empty set is open precisely when its complement is "small" (finite). This is a genuinely different notion of "openness" from the metric one: in $\mathbb{R}$ with the cofinite topology, the set $\mathbb{R} \setminus \{0, 1, 2\}$ is open (its complement is finite), but no open interval $(a, b)$ is open unless it equals $\mathbb{R}$ minus finitely many points.
[example: Verifying the Cofinite Topology]
We verify that the cofinite topology on an infinite set $X$ satisfies the axioms. Axiom (1): $X \setminus X = \varnothing$ is finite, so $X \in \tau$, and $\varnothing \in \tau$ by definition. Axiom (2): if each $U_i \in \tau$ is non-empty, then $X \setminus \bigcup_i U_i = \bigcap_i (X \setminus U_i) \subseteq X \setminus U_j$ for any $j$, which is finite; so $\bigcup_i U_i \in \tau$. Axiom (3): $X \setminus (U \cap V) = (X \setminus U) \cup (X \setminus V)$, a finite union of finite sets, hence finite; so $U \cap V \in \tau$. This example is important because it produces a topology strictly between the indiscrete and discrete topologies on any infinite set.
[/example]
## Metrisability and Separation
Not every topology comes from a metric, and understanding *when* a topology is metrisable is a central problem in topology. The stakes are practical: if a topological space is metrisable, we recover the full power of $\varepsilon$-$\delta$ arguments, sequential characterisations of closedness and compactness, and completeness. If it is not, we must work with the weaker (but more general) tools of open-set arguments.
The simplest obstruction to metrisability is the failure of *separation*. In a metric space, distinct points are always a positive distance apart, which lets us build disjoint open neighbourhoods around them. A topology that cannot separate points in this way can never come from a metric. This motivates the Hausdorff axiom — the minimal separation condition that makes analysis workable.
[definition: Metrisable]
A topological space $(X, \tau)$ is **metrisable** if there exists a metric $d$ on $X$ such that $\tau$ equals the metric topology induced by $d$ — that is, $U \in \tau$ if and only if for every $x \in U$ there exists $r > 0$ with $B_r(x) \subseteq U$.
[/definition]
The metric need not be unique: on $\mathbb{R}^n$, the Euclidean, $\ell_1$, and $\ell_\infty$ metrics all induce the same topology. The question is whether *some* compatible metric exists at all.
[definition: Hausdorff]
A topological space $(X, \tau)$ is **Hausdorff** (or $T_2$) if for any distinct points $x, y \in X$, there exist open sets $U, V \in \tau$ with $x \in U$, $y \in V$, and $U \cap V = \varnothing$.
[/definition]
The Hausdorff condition formalises the idea that the topology has "enough" open sets to tell points apart. It is the weakest separation axiom strong enough to guarantee uniqueness of limits (as we will prove below), which is why it appears as a standing hypothesis throughout analysis. Stronger separation axioms exist (regularity, normality, complete regularity), but the Hausdorff condition is the most commonly invoked.
[quotetheorem:290]
The [Hausdorff property of metric spaces](/theorems/290) establishes that every metric space is automatically Hausdorff. The key input is the positivity axiom: distinct points $x \neq y$ have $d(x,y) > 0$, providing a positive separation distance. The proof constructs open balls of radius $r/2$ around each point and uses the triangle inequality to show these balls are disjoint — the same half-radius trick that appears throughout metric space theory.
[/quotetheorem]
[citeproof:290]
The argument proceeds by contradiction: if a point $z$ belonged to both $B_{r/2}(x)$ and $B_{r/2}(y)$, the triangle inequality would give $d(x,y) \leq d(x,z) + d(z,y) < r/2 + r/2 = r = d(x,y)$, a strict self-inequality. This is the same proof pattern used for uniqueness of limits in metric spaces — the triangle inequality converts "close to $x$ and close to $y$" into an upper bound on $d(x,y)$ that contradicts positivity.
[/citeproof]
The contrapositive is immediate and provides the simplest test for non-metrisability: **if $(X, \tau)$ is not Hausdorff, then it is not metrisable.** The three examples below show how this test applies in practice, and illustrate the range of behaviour possible.
[example: The Indiscrete Topology is Not Metrisable]
Let $X$ be a set with $|X| \geq 2$ and $\tau = \{\varnothing, X\}$ the indiscrete topology. For any distinct $x, y \in X$, the only non-empty open set is $X$ itself. Any open neighbourhoods $U \ni x$ and $V \ni y$ must both equal $X$, so $U \cap V = X \neq \varnothing$. The space is not Hausdorff, hence not metrisable. This illustrates the extreme case: too few open sets means no separation at all.
[/example]
[example: The Discrete Topology is Metrisable]
The discrete topology on any set $X$ is induced by the discrete metric $d(x,y) = 1$ for $x \neq y$, $d(x,x) = 0$. Every singleton $\{x\} = B_{1/2}(x)$ is open, so every subset is a union of singletons and hence open. This is the opposite extreme: so many open sets that every subset is open.
[/example]
[example: The Cofinite Topology on an Infinite Set is Not Hausdorff]
Let $X$ be an infinite set with the cofinite topology. For distinct $x, y \in X$, any non-empty open sets $U \ni x$ and $V \ni y$ have finite complements. Then $X \setminus (U \cap V) = (X \setminus U) \cup (X \setminus V)$ is a union of two finite sets, hence finite. Since $X$ is infinite, $U \cap V$ is infinite and in particular non-empty. No pair of non-empty open sets is disjoint, so the space is not Hausdorff and therefore not metrisable.
[/example]
### Identifying Non-Hausdorff Quotients
A common source of non-Hausdorff spaces in practice is quotient constructions with poorly behaved equivalence relations. If $X$ is Hausdorff and $\sim$ is an equivalence relation on $X$, the quotient $X/{\sim}$ need not be Hausdorff. The failure occurs precisely when two equivalence classes cannot be separated by saturated open sets — open sets that are unions of equivalence classes. For instance, the "line with two origins" (two copies of $\mathbb{R}$ glued along $\mathbb{R} \setminus \{0\}$) is not Hausdorff: the two origins cannot be separated because every open set around one origin contains points arbitrarily close to the other.
## Closed Sets, Interiors, and Closures
Open sets are the primitive notion in topology, but many natural mathematical objects are described more naturally by *closed* conditions. The zero set $\{x : f(x) = 0\}$ of a continuous function is closed (it is the preimage of the closed set $\{0\}$). The set of limit points of a convergent sequence is closed. The set of accumulation points of a bounded sequence in $\mathbb{R}$ is closed. In each case, "closedness" captures the idea of a set that *contains its own boundary* — no sequence of points in the set can escape by converging to a point outside it.
The formal definition is elegantly simple: a set is closed when its complement is open. This duality between open and closed sets means that every statement about open sets has a "mirror image" for closed sets, obtained by taking complements and applying [De Morgan's laws](/theorems/622). In practice, one uses whichever formulation is more convenient — and having both available is essential.
[definition: Closed Set]
A subset $A \subseteq X$ is **closed** if its complement $X \setminus A$ is open.
[/definition]
An important subtlety: "closed" is not the negation of "open." A set can be both open and closed (such as $\varnothing$ and $X$ in any topology, or any subset in the discrete topology), or neither open nor closed (such as $[0, 1)$ in $\mathbb{R}$ with the standard topology). The terminology is potentially misleading but firmly established.
[quotetheorem:306]
The [properties of closed sets](/theorems/306) are the exact duals of the open-set axioms, obtained by complementation via De Morgan's laws. Where open sets are closed under arbitrary unions, closed sets are closed under arbitrary intersections; where open sets are closed under finite intersections, closed sets are closed under finite unions. This duality is purely formal — every statement about closed sets can be translated mechanically into one about open sets and vice versa — but it is convenient to have both formulations available.
[/quotetheorem]
[citeproof:306]
Each property is proved by taking complements and applying the corresponding open-set axiom. For arbitrary intersections: $X \setminus \bigcap_i F_i = \bigcup_i (X \setminus F_i)$ by De Morgan, and the right-hand side is a union of open sets, hence open. For finite unions: $X \setminus (F_1 \cup \cdots \cup F_n) = (X \setminus F_1) \cap \cdots \cap (X \setminus F_n)$, a finite intersection of open sets, hence open. The restriction to *finite* unions of closed sets is essential: in $\mathbb{R}$, each singleton $\{1/n\}$ is closed, but the infinite union $\bigcup_{n=1}^\infty \{1/n\} = \{1, 1/2, 1/3, \ldots\}$ is not closed — it omits its limit point $0$.
[/citeproof]
### Measuring How Far a Set Is from Being Open or Closed
Given an arbitrary subset $A$ of a topological space, one often needs to find the "best open approximation from inside" (the largest open set contained in $A$) or the "best closed approximation from outside" (the smallest closed set containing $A$). These are the interior and closure operations, and they are dual to each other in the same sense that open and closed sets are dual.
The interior strips away the boundary points where $A$ fails to be open; the closure adds in the boundary points that $A$ is missing. Together, they measure how far $A$ is from being open or closed: $A$ is open if and only if $A = A^\circ$, and $A$ is closed if and only if $A = \overline{A}$.
[definition: Interior]
The **interior** of $A \subseteq X$, denoted $A^\circ$, is the union of all open subsets of $A$. Equivalently, $A^\circ$ is the largest open set contained in $A$.
[/definition]
[definition: Closure]
The **closure** of $A \subseteq X$, denoted $\overline{A}$, is the intersection of all closed sets containing $A$. Equivalently, $\overline{A}$ is the smallest closed set containing $A$.
[/definition]
The interior and closure are related by complementation: $(X \setminus A)^\circ = X \setminus \overline{A}$ and $\overline{X \setminus A} = X \setminus A^\circ$. A point $x$ belongs to $\overline{A}$ if and only if every open neighbourhood of $x$ meets $A$ — this is the topological replacement for the metric characterisation "every open ball around $x$ contains a point of $A$."
[example: Interiors and Closures in $\mathbb{R}$]
In $\mathbb{R}$ with the standard topology: for $A = [0, 1) \cup \{2\}$, the interior is $A^\circ = (0, 1)$ (the isolated point $2$ has no open interval around it contained in $A$) and the closure is $\overline{A} = [0, 1] \cup \{2\}$ (the point $1$ is a limit point of $[0,1)$, but $2$ is already closed as a singleton). For $\mathbb{Q}$, we have $\mathbb{Q}^\circ = \varnothing$ (no interval consists entirely of rationals) and $\overline{\mathbb{Q}} = \mathbb{R}$ (every real number is a limit of rationals). For $\mathbb{Z}$, we have $\mathbb{Z}^\circ = \varnothing$ and $\overline{\mathbb{Z}} = \mathbb{Z}$ (the integers form a closed, discrete subset).
[/example]
### Dense Subsets and [Separability](/page/Separable)
The closure operation leads naturally to the notion of density: a subset $A$ is dense in $X$ if its closure is all of $X$, meaning that $A$ "reaches" every point of $X$ through limits. Dense subsets are important because they often allow us to reduce problems about an uncountable space to problems about a countable (and therefore more manageable) subset. For instance, the fact that $\mathbb{Q}$ is dense in $\mathbb{R}$ means that a continuous function on $\mathbb{R}$ is completely determined by its values on $\mathbb{Q}$ — if two continuous functions agree on the rationals, they agree everywhere.
[definition: Dense]
A subset $A \subseteq X$ is **dense** if $\overline{A} = X$. Equivalently, $A$ is dense if and only if every non-empty open set contains a point of $A$.
[/definition]
[definition: Separable]
A topological space $X$ is **separable** if it contains a countable dense subset.
[/definition]
Separability is a countability condition that plays a significant role in functional analysis, where it often determines whether a space is "small enough" to be handled by sequential arguments. The space $\mathbb{R}^n$ is separable (with $\mathbb{Q}^n$ as a countable dense subset), and $C[a,b]$ with the supremum metric is separable (by the [Weierstrass approximation theorem](/theorems/480), polynomials with rational coefficients are dense). An uncountable set with the discrete topology is not separable: the closure of any subset is itself, so a dense subset must be the whole space, which is uncountable.
## Subspace Topology
Given a topological space $X$ and a subset $A \subseteq X$, we want $A$ to inherit a topology from $X$ in a way that makes the inclusion $A \hookrightarrow X$ continuous and preserves as much topological structure as possible. The question is: which subsets of $A$ should be declared open?
The answer is forced by the continuity requirement. If the inclusion $\iota: A \hookrightarrow X$ is to be continuous, then for every open $V \subseteq X$, the preimage $\iota^{-1}(V) = A \cap V$ must be open in $A$. Conversely, the topology consisting of exactly these sets $\{A \cap V : V \in \tau\}$ is the coarsest topology making $\iota$ continuous — any coarser topology would fail the preimage condition for some open $V$. This construction appears constantly: when we restrict attention to a closed interval $[a,b] \subseteq \mathbb{R}$, or to the unit sphere $S^n \subseteq \mathbb{R}^{n+1}$, we are implicitly using the subspace topology.
[definition: Subspace Topology]
Let $(X, \tau)$ be a topological space and $A \subseteq X$. The **subspace topology** on $A$ is $\tau_A = \{A \cap U : U \in \tau\}$.
[/definition]
One must be careful about the distinction between "open in $A$" and "open in $X$": a set can be open in the subspace without being open in the ambient space. For instance, $[0, 1)$ is open in the subspace $[0, 2]$ of $\mathbb{R}$ (since $[0, 1) = [0, 2] \cap (-1, 1)$), but $[0, 1)$ is not open in $\mathbb{R}$. This subtlety — that openness depends on which space you consider yourself to be living in — is a frequent source of errors and requires constant attention.
[example: Closed Subsets of Subspaces]
A subset $B \subseteq A$ is closed in the subspace topology if and only if $B = A \cap F$ for some closed set $F$ in $X$. For instance, in the subspace $A = (0, 2)$ of $\mathbb{R}$, the set $(0, 1]$ is closed in $A$ because $(0, 1] = (0, 2) \cap (-\infty, 1]$ and $(-\infty, 1]$ is closed in $\mathbb{R}$. However, $(0, 1]$ is not closed in $\mathbb{R}$ itself (the point $0$ is a limit point not in the set).
[/example]
## Convergence and Continuity
In metric spaces, we defined convergence using the condition $d(x_n, x) < \varepsilon$ and continuity using the condition $d(f(x), f(a)) < \varepsilon$ whenever $d(x, a) < \delta$. But in our proofs, these conditions were almost always used in the form "the sequence eventually enters every open ball around $x$" or "the preimage of every open set is open." These reformulations refer only to open sets, not to the metric, so they generalise immediately to arbitrary topological spaces.
The resulting definitions agree with the metric-space definitions whenever a metric is present, but they apply much more broadly — including to spaces like function spaces with the weak topology or quotient spaces where no natural metric exists. The price of this generality is that sequences become a weaker tool: in metric spaces, sequences suffice to characterise closedness, continuity, and compactness, but in general topological spaces this fails. For this course, the key compensation is that the Hausdorff axiom rescues uniqueness of sequential limits, which is why it appears as a standing hypothesis throughout analysis.
[definition: Convergence]
A sequence $(x_n)$ in a topological space $(X, \tau)$ **converges** to $x \in X$ if for every open set $U$ with $x \in U$, there exists $N \in \mathbb{N}$ such that $x_n \in U$ for all $n \geq N$.
[/definition]
The definition captures the same intuition as the metric version — "the sequence eventually gets and stays close to $x$" — but "close" is now measured by open neighbourhoods rather than $\varepsilon$-balls. In a coarser topology (fewer open sets), there are fewer conditions to check, so convergence is easier. In a finer topology (more open sets), convergence is harder. This is why every sequence converges to every point in the indiscrete topology (only one non-trivial open set to check) and only eventually constant sequences converge in the discrete topology (every singleton is an open set the sequence must eventually enter).
[quotetheorem:291]
The [uniqueness of limits in Hausdorff spaces](/theorems/291) is the reason the Hausdorff axiom is the standard separation condition in analysis. Without it, a single sequence can converge to every point simultaneously: in the indiscrete topology on $\mathbb{R}$, the only open set containing any point is $\mathbb{R}$ itself, so every sequence converges to every point. The Hausdorff condition is precisely what prevents this pathology.
[/quotetheorem]
[citeproof:291]
The proof is a clean contradiction argument. Given $x_n \to x$ and $x_n \to y$ with $x \neq y$, the Hausdorff property produces disjoint open sets $U \ni x$ and $V \ni y$. Convergence forces the tail of $(x_n)$ into $U$ (beyond some $N_1$) and into $V$ (beyond some $N_2$). Past $\max(N_1, N_2)$, every $x_n$ would lie in $U \cap V = \varnothing$ — an impossibility. This is the topological version of the metric-space argument, with "disjoint open sets" replacing "disjoint open balls."
[/citeproof]
[example: Non-Unique Limits in the Indiscrete Topology]
In $(\mathbb{R}, \{\varnothing, \mathbb{R}\})$, the constant sequence $x_n = 0$ converges to every $x \in \mathbb{R}$: the only open set containing $x$ is $\mathbb{R}$, and $x_n \in \mathbb{R}$ for all $n$. This space is not Hausdorff, which is why uniqueness fails.
[/example]
### Continuity as a Preimage Condition
In metric spaces, we proved that a function is continuous if and only if the preimage of every open set is open. In the topological setting, this preimage characterisation *becomes the definition*. The reason is that it captures exactly the right relationship: a continuous function is one that "respects the topological structure" — if $V$ is a region of safety in $Y$, then $f^{-1}(V)$ is a region of safety in $X$. Points that are topologically close in $X$ (in the sense of sharing open neighbourhoods) are mapped to points that are topologically close in $Y$.
[definition: Continuity]
A function $f: (X, \tau_X) \to (Y, \tau_Y)$ is **continuous** if $f^{-1}(V) \in \tau_X$ for every $V \in \tau_Y$.
[/definition]
[theorem: Continuity Preserves Convergence]
If $f: (X, \tau_X) \to (Y, \tau_Y)$ is continuous and $x_n \to a$ in $X$, then $f(x_n) \to f(a)$ in $Y$.
[/theorem]
The proof is a one-line unwinding: if $V$ is an open neighbourhood of $f(a)$, then $f^{-1}(V)$ is an open neighbourhood of $a$ (by continuity), so $x_n \in f^{-1}(V)$ for all sufficiently large $n$, giving $f(x_n) \in V$.
### Sequential Continuity Versus Continuity
In metric spaces, the converse also holds: if $f$ preserves convergent sequences, then $f$ is continuous. In general topological spaces, this fails — sequences are too coarse to detect all topological information. The following example demonstrates that "sequentially continuous" and "continuous" are genuinely different notions in non-metrisable spaces, and is one of the key reasons general topology requires open-set arguments rather than sequential ones.
[example: Sequential Continuity Does Not Imply Continuity]
Let $X = \mathbb{R}$ with the cocountable topology (a set is open if it is empty or has countable complement), and let $Y = \mathbb{R}$ with the standard topology. Consider the identity map $f: X \to Y$. Every convergent sequence in $X$ is eventually constant: if $x_n \to a$ and infinitely many $x_n \neq a$, then $X \setminus \{x_n : x_n \neq a\}$ would be a cocountable open neighbourhood of $a$ missing infinitely many terms of the sequence, contradicting convergence. Since eventually constant sequences are automatically preserved, $f$ is sequentially continuous. But $f$ is not continuous: $(-1, 1)$ is open in $Y$, while $f^{-1}((-1, 1)) = (-1, 1)$ is not open in $X$ (its complement $(-\infty, -1] \cup [1, \infty)$ is uncountable).
[/example]
## Homeomorphisms and Topological Invariants
A central goal of topology is to classify spaces "up to topological equivalence" — to determine when two spaces, possibly described very differently, are really the "same" from the topological point of view. The circle $S^1$ can be embedded in $\mathbb{R}^2$ in many ways (as an ellipse, as the boundary of a triangle, as an arbitrarily irregular closed curve), but topologically these are all the same space. On the other hand, a circle and a line segment are topologically different — cutting a single point from a circle leaves a connected space, while cutting an interior point from a line segment disconnects it.
Making this intuition precise requires the notion of homeomorphism: a bijection that preserves all topological structure in both directions. The programme of distinguishing non-homeomorphic spaces then relies on finding *topological invariants* — properties that homeomorphisms must preserve.
[definition: Homeomorphism]
A function $f: (X, \tau_X) \to (Y, \tau_Y)$ is a **homeomorphism** if $f$ is a bijection and both $f$ and $f^{-1}$ are continuous. If such a function exists, $X$ and $Y$ are **homeomorphic**.
[/definition]
A homeomorphism is a relabelling of points that preserves all topological structure. The requirement that $f^{-1}$ also be continuous is essential: without it, a continuous bijection could "compress" topological information (for example, the identity from the discrete topology to the standard topology on $\mathbb{R}$ is a continuous bijection but not a homeomorphism). Properties preserved by homeomorphisms are called **topological invariants**: they depend only on the topology, not on any particular representation.
[theorem: Topological Invariants]
The following are topological invariants: Hausdorffness, metrisability, separability, compactness, and connectedness. That is, if $X$ and $Y$ are homeomorphic and $X$ has any of these properties, then so does $Y$.
[/theorem]
Each of these properties is defined entirely in terms of open sets, so preservation under homeomorphisms is immediate from the definition (a homeomorphism gives a bijection between the open-set families of $X$ and $Y$). The importance of topological invariants is strategic: to prove two spaces are *not* homeomorphic, it suffices to find a topological invariant that one possesses and the other does not.
A natural question is whether every property we have studied is topological. Completeness — the property that every Cauchy sequence converges — is defined in terms of the metric, not the topology, so there is reason to suspect it might not be preserved by homeomorphisms.
[quotetheorem:293]
[Completeness is not a topological invariant](/theorems/293): there exist homeomorphic metric spaces with different completeness properties. This is a fundamental distinction between metric and topological properties. The Cauchy condition depends on the specific distances $d(x_m, x_n)$, not just on which sets are open. A homeomorphism preserves open sets but can wildly distort distances, turning Cauchy sequences into non-Cauchy ones and vice versa.
[/quotetheorem]
[citeproof:293]
The standard witness is the homeomorphism between $\mathbb{R}$ (complete) and $(0, 1)$ (incomplete). The map $f(x) = \tan(\pi x - \pi/2)$ from $(0,1)$ to $\mathbb{R}$ is a continuous bijection with continuous inverse $f^{-1}(y) = 1/2 + \arctan(y)/\pi$. Yet the Cauchy sequence $x_n = 1/n$ in $(0,1)$ converges to $0 \notin (0,1)$, while $\mathbb{R}$ is complete. The homeomorphism "stretches" the left endpoint of $(0,1)$ to $-\infty$, turning bounded Cauchy sequences into unbounded ones. This shows that Cauchy-ness is a metric phenomenon, invisible to the topology.
[/citeproof]
[example: Homeomorphic Spaces with Different Metrics]
The spaces $[0, 1]$ and $[0, 2]$ are homeomorphic via $f(x) = 2x$, which is a continuous bijection with continuous inverse $f^{-1}(y) = y/2$. Both are compact and complete. The space $(0, 1)$ is homeomorphic to $\mathbb{R}$ (as above) — both are connected, Hausdorff, and separable. But $(0,1)$ is bounded in $\mathbb{R}$ while $\mathbb{R}$ is not. Boundedness, like completeness, depends on the metric and is therefore not a topological invariant.
[/example]
## Product Topology
Many natural spaces in mathematics are products: the plane $\mathbb{R}^2 = \mathbb{R} \times \mathbb{R}$, configuration spaces in mechanics, spaces of paths in a topological space. Given topological spaces $X$ and $Y$, we need a topology on $X \times Y$ that captures the idea of "independent variation in each coordinate."
Which topology should we choose? The product comes with two natural projection maps $\pi_X: X \times Y \to X$ and $\pi_Y: X \times Y \to Y$, and at minimum we want both to be continuous. But many topologies achieve this — the discrete topology on $X \times Y$ makes every function out of $X \times Y$ continuous, for instance. The right choice is the *coarsest* topology making $\pi_X$ and $\pi_Y$ continuous: the one with the fewest open sets that still ensures the projections are continuous. This is the standard category-theoretic move — universal constructions are characterised by being minimal (or maximal) with respect to some property — and it leads to a clean characterisation: a function *into* a product is continuous if and only if each of its coordinate functions is continuous.
[definition: Product Topology]
Let $(X, \tau_X)$ and $(Y, \tau_Y)$ be topological spaces. The **product topology** on $X \times Y$ is the topology generated by the basis $\{U \times V : U \in \tau_X, V \in \tau_Y\}$. That is, a set $W \subseteq X \times Y$ is open if and only if it is a union of sets of the form $U \times V$ with $U$ open in $X$ and $V$ open in $Y$.
[/definition]
The basis elements $U \times V$ are called **open rectangles**. Not every open set in the product topology is itself a rectangle — for instance, the open unit disc in $\mathbb{R}^2$ is open but is not of the form $U \times V$. It is, however, a union of open rectangles.
[quotetheorem:292]
The [universal property of the product topology](/theorems/292) provides the practical tool for working with products: to verify that a function $g: Z \to X \times Y$ is continuous, it suffices to check that each component $\pi_X \circ g$ and $\pi_Y \circ g$ is continuous. This reduces a problem about a higher-dimensional space to two problems in lower-dimensional spaces, and is the standard method for proving continuity of maps into products.
[/quotetheorem]
[citeproof:292]
The proof first verifies that the projections are continuous (the preimage $\pi_X^{-1}(U) = U \times Y$ is an open rectangle, hence open). The forward direction is then trivial: compositions of continuous functions are continuous. The key content is the reverse direction. Given that $\pi_X \circ g$ and $\pi_Y \circ g$ are continuous, one checks the preimage condition on basic open sets: $g^{-1}(U \times V) = (\pi_X \circ g)^{-1}(U) \cap (\pi_Y \circ g)^{-1}(V)$, which is a finite intersection of open sets and therefore open.
[/citeproof]
[example: Continuity of Addition via the Product Topology]
Consider the addition map $\alpha: \mathbb{R} \times \mathbb{R} \to \mathbb{R}$ defined by $\alpha(x, y) = x + y$, where $\mathbb{R}^2 = \mathbb{R} \times \mathbb{R}$ carries the product topology (which coincides with the standard topology on $\mathbb{R}^2$). To show $\alpha$ is continuous, we check the preimage condition: if $W \subseteq \mathbb{R}$ is open and $(a, b) \in \alpha^{-1}(W)$, then $a + b \in W$, so there exists $\varepsilon > 0$ with $(a + b - \varepsilon, a + b + \varepsilon) \subseteq W$. Then $(a - \varepsilon/2, a + \varepsilon/2) \times (b - \varepsilon/2, b + \varepsilon/2)$ is an open rectangle around $(a,b)$ contained in $\alpha^{-1}(W)$, since $|x - a| < \varepsilon/2$ and $|y - b| < \varepsilon/2$ give $|(x + y) - (a + b)| < \varepsilon$. Therefore $\alpha^{-1}(W)$ is open.
[/example]
## Quotient Topology
Products build new spaces by combining independent factors. Quotient spaces go in the opposite direction: they build new spaces by *identifying* points — declaring that certain distinct points should be treated as the same. This construction is ubiquitous in geometry and topology: the circle $S^1$ arises by identifying the endpoints of the interval $[0, 1]$; the torus $T^2$ arises by identifying opposite edges of a square; the real projective plane $\mathbb{R}P^2$ arises by identifying antipodal points of $S^2$; the Möbius strip arises by identifying one pair of opposite edges with a twist.
The topological question is: what topology should the quotient space carry? The answer is dual to the product case. For products, we chose the *coarsest* topology making the projections continuous. For quotients, we choose the *finest* topology making the quotient map continuous. This means a set in the quotient is open precisely when its preimage — the union of all the equivalence classes it represents — is open in the original space. The quotient topology is "as fine as possible" subject to the constraint that collapsing equivalence classes does not create discontinuities.
[definition: Quotient Topology]
Let $(X, \tau)$ be a topological space, let $\sim$ be an [equivalence relation](/page/Equivalence%20Relation) on $X$, and let $q: X \to X/{\sim}$ be the quotient map sending each $x$ to its equivalence class $[x]$. The **quotient topology** on $X/{\sim}$ is $\tau_{X/\sim} = \{V \subseteq X/{\sim} : q^{-1}(V) \in \tau\}$.
[/definition]
[example: The Circle as a Quotient of $\mathbb{R}$]
Define $\sim$ on $\mathbb{R}$ by $s \sim t$ if and only if $s - t \in \mathbb{Z}$. The quotient $\mathbb{R}/{\sim}$ consists of equivalence classes $[t] = \{t + n : n \in \mathbb{Z}\}$. The map $f: \mathbb{R} \to S^1$ defined by $f(t) = (\cos 2\pi t, \sin 2\pi t)$ is continuous, surjective, and respects $\sim$ (if $s \sim t$ then $f(s) = f(t)$). It induces a continuous bijection $\tilde{f}: \mathbb{R}/{\sim} \to S^1$. Moreover, $f$ maps open intervals to open arcs, so $\tilde{f}$ is open and hence a homeomorphism. This identifies $\mathbb{R}/\mathbb{Z}$ with the unit circle $S^1$.
[/example]
### When Quotients Fail to be Hausdorff
Quotient constructions can easily destroy the Hausdorff property, and understanding when this happens is important for avoiding pathological spaces. The issue is that collapsing points together can make it impossible to separate the resulting equivalence classes with open sets.
On $\mathbb{R}$, define $x \sim y$ if $x = y$ or if both $x, y \in \mathbb{Z}$. The quotient $\mathbb{R}/{\sim}$ collapses all integers to a single point $[0]$, while non-integers remain as singleton classes. This quotient is Hausdorff: non-integer classes can be separated by small open intervals, and the collapsed integer class can be separated from non-integers using the periodicity of $\mathbb{Z}$. However, if we instead collapse $\mathbb{Q}$ to a single point (defining $x \sim y$ if $x = y$ or if both $x, y \in \mathbb{Q}$), the quotient is *not* Hausdorff: the point $[\mathbb{Q}]$ is dense, so every open neighbourhood of $[\mathbb{Q}]$ meets every open neighbourhood of any irrational point.
The difference is the "size" and distribution of the collapsed set. Collapsing a discrete, uniformly spaced set ($\mathbb{Z}$) preserves separation; collapsing a dense set ($\mathbb{Q}$) destroys it. More precisely, the quotient $X/{\sim}$ is Hausdorff if and only if the equivalence relation $\sim$, viewed as a subset of $X \times X$, is closed in the product topology.
[problem]
Let $X = \mathbb{R}$ with the standard topology and define a topology $\tau_L$ on $\mathbb{R}$ by declaring a set $U$ to be open if for every $x \in U$ there exists $\varepsilon > 0$ such that $[x, x + \varepsilon) \subseteq U$. This is the **lower limit topology** (or **Sorgenfrey line**).
(a) Verify that $\tau_L$ is a topology on $\mathbb{R}$.
(b) Show that $\tau_L$ is strictly finer than the standard topology.
(c) Show that the Sorgenfrey line is Hausdorff.
(d) Show that the Sorgenfrey line is not metrisable by proving that the product $\mathbb{R}_L \times \mathbb{R}_L$ (the **Sorgenfrey plane**) is not separable, even though $\mathbb{R}_L$ is separable.
(Hint for (d): consider the anti-diagonal $D = \{(x, -x) : x \in \mathbb{R}\}$ and show it is discrete in the subspace topology from $\mathbb{R}_L \times \mathbb{R}_L$.)
[/problem]
[solution]
**Step 1 (Topology axioms).** We verify the three axioms. $\varnothing$ vacuously satisfies the condition, and $\mathbb{R}$ satisfies it because $[x, x+1) \subseteq \mathbb{R}$ for all $x$. For arbitrary unions: if each $U_i$ satisfies the condition and $x \in \bigcup_i U_i$, then $x \in U_j$ for some $j$, so $[x, x + \varepsilon) \subseteq U_j \subseteq \bigcup_i U_i$. For finite intersections: if $x \in U \cap V$ with $[x, x + \varepsilon_1) \subseteq U$ and $[x, x + \varepsilon_2) \subseteq V$, then $[x, x + \min(\varepsilon_1, \varepsilon_2)) \subseteq U \cap V$.
**Step 2 (Strictly finer).** Every standard-open set is $\tau_L$-open: if $U$ is standard-open and $x \in U$, there exists $\varepsilon > 0$ with $(x - \varepsilon, x + \varepsilon) \subseteq U$, so $[x, x + \varepsilon) \subseteq U$. Thus $\tau_{\text{std}} \subseteq \tau_L$. To see the inclusion is strict, $[0, 1)$ is $\tau_L$-open (for $x \in [0,1)$ take $\varepsilon = 1 - x$) but not standard-open (no open interval around $0$ is contained in $[0,1)$).
**Step 3 (Hausdorff).** Since $\tau_L$ is finer than the standard topology, which is Hausdorff (it is metrisable), $\tau_L$ is also Hausdorff: any disjoint standard-open sets witnessing separation are also $\tau_L$-open.
**Step 4 (Non-metrisability via the Sorgenfrey plane).** The Sorgenfrey line $\mathbb{R}_L$ is separable: $\mathbb{Q}$ is dense, because for any basic open set $[x, x + \varepsilon)$ the rationals are dense in $\mathbb{R}$, so $(x, x + \varepsilon)$ contains a rational, and $(x, x + \varepsilon) \subseteq [x, x + \varepsilon)$.
Now consider the anti-diagonal $D = \{(x, -x) : x \in \mathbb{R}\}$ in the Sorgenfrey plane $\mathbb{R}_L \times \mathbb{R}_L$. For any $(x, -x) \in D$, the basic open set $[x, x + \varepsilon) \times [-x, -x + \varepsilon)$ contains the single point $(x, -x)$ from $D$: if $(y, -y) \in [x, x + \varepsilon) \times [-x, -x + \varepsilon)$ with $y \neq x$, then $y \geq x$ and $-y \geq -x$, giving $y \geq x$ and $y \leq x$, hence $y = x$. Therefore $D$ is discrete in the subspace topology.
An uncountable discrete space is not separable. If $\mathbb{R}_L \times \mathbb{R}_L$ were separable with countable dense subset $S$, then $D$ (as a subspace) would be separable with countable dense subset $S \cap D$. But $D$ is discrete and uncountable, so no countable subset is dense in $D$. This is a contradiction.
Finally, if $\mathbb{R}_L$ were metrisable (say by some metric $d$), then $\mathbb{R}_L \times \mathbb{R}_L$ would also be metrisable (via $d_\infty((x_1,y_1),(x_2,y_2)) = \max(d(x_1,x_2), d(y_1,y_2))$). A separable metrisable space has the property that all its subspaces are separable. Since $\mathbb{R}_L$ is separable, $\mathbb{R}_L \times \mathbb{R}_L$ would be separable and metrisable, so all subspaces (including $D$) would be separable — contradicting the discreteness of the uncountable $D$. Therefore $\mathbb{R}_L$ is not metrisable.
[/solution]\n\n---\n\nHaving developed the language of topological spaces — open and closed sets, continuity via preimages, subspace and product topologies — we now turn to a fundamental structural property: **connectedness**. This notion captures the intuitive idea that a space is "all in one piece," that it cannot be partitioned into two disjoint non-empty open subsets without tearing.
In real analysis, the [Intermediate Value Theorem](/theorems/180) asserts that a continuous function on a closed interval cannot "jump" over values. This is a manifestation of a deeper topological phenomenon: the interval $[a,b]$ is connected, and the continuous image of a connected space must also be connected. But what does it mean for a general topological space — one that need not be a subset of $\mathbb{R}$, or even metrisable — to be connected? And how does this property interact with the constructions we have built so far: continuity, closures, subspaces, products, quotients?
The goal of this section is to develop connectedness as a purely topological notion, recover the classical results of real analysis as special cases, and introduce the stronger notion of path-connectedness. The section closes with connected components, which provide a canonical decomposition of any topological space into its maximal connected pieces, and with an application showing how connectedness distinguishes $\mathbb{R}$ from $\mathbb{R}^n$.
[motivation]
### Why Connectedness Matters
Many existence results in analysis ultimately rely on the fact that certain spaces are "in one piece." The Intermediate Value Theorem guarantees that a continuous function $f: [a,b] \to \mathbb{R}$ with $f(a) < 0 < f(b)$ must have a zero — but the proof does not use any quantitative property of the real line. It uses only that $[a,b]$ cannot be split into two disjoint open parts, and that continuous functions respect this property. This is the topological content of the theorem, and it holds in far greater generality than the classical statement suggests.
### The Topological Formulation
In a metric space, one might try to define connectedness in terms of distances: a space is connected if every two points can be joined by a "path" of nearby points. But this conflates two distinct ideas — the algebraic splitting of the space into open pieces (connectedness) and the existence of continuous paths between points (path-connectedness). These agree for many familiar spaces but diverge in general, and keeping them separate is essential for the theory.
The correct definition is purely topological: a space is disconnected if it admits a non-trivial partition into open sets, and connected otherwise. This captures exactly the property needed for the Intermediate Value Theorem and its generalisations, without assuming anything about paths, metrics, or the structure of the real line.
[/motivation]
## Definitions and First Examples
[definition: Disconnected Space]
A topological space $X$ is **disconnected** if there exist non-empty open subsets $U, V \subseteq X$ such that $U \cap V = \varnothing$ and $U \cup V = X$. In this case, the pair $(U, V)$ is called a **disconnection** of $X$.
[/definition]
[definition: Connected Space]
A topological space $X$ is **connected** if it is not disconnected. That is, $X$ is connected if and only if the only subsets of $X$ that are both open and closed are $\varnothing$ and $X$ itself.
[/definition]
The equivalence in the second definition is worth explaining. If $(U, V)$ is a disconnection, then $U$ is open and $V = X \setminus U$ is also open, so $U$ is simultaneously open and closed ("clopen"), and $U \neq \varnothing$, $U \neq X$. Conversely, if $U$ is a non-trivial clopen set, then $(U, X \setminus U)$ is a disconnection. Thus connectedness is equivalent to the statement: the only clopen subsets are the trivial ones.
The definition applies to any topological space, not just metric spaces. The empty space is vacuously connected (there are no non-empty open sets to form a disconnection), though some authors exclude it by convention. We follow the standard definition throughout.
[example: Connected and Disconnected Subsets of $\mathbb{R}$]
The interval $(0, 1)$ is connected (as we will prove shortly — it is an interval, and intervals are exactly the connected subsets of $\mathbb{R}$). In contrast, the set $X = (0, 1) \cup (2, 3)$ is disconnected: the sets $U = (0, 1)$ and $V = (2, 3)$ are both open in the subspace topology on $X$, disjoint, non-empty, and their union is $X$. The disconnection arises because the gap between $1$ and $2$ separates the two intervals.
Less obviously, the set $Y = [0, 1) \cup (1, 2]$ is also disconnected. In the subspace topology inherited from $\mathbb{R}$, both $[0, 1)$ and $(1, 2]$ are open in $Y$: we have $[0, 1) = Y \cap (-1, 1)$ and $(1, 2] = Y \cap (1, 3)$, where $(-1, 1)$ and $(1, 3)$ are open in $\mathbb{R}$. The missing point $\{1\}$ creates a gap, even though the two "halves" are adjacent.
[/example]
## The Main Characterisation Theorem
The central result of connectedness theory provides three equivalent formulations, each useful in different contexts. The first is the definition. The second links connectedness directly to the Intermediate Value Theorem. The third provides the most practical tool for proofs.
[quotetheorem:294]
The equivalence $(1) \Leftrightarrow (2)$ shows that connectedness is precisely the property that ensures continuous real-valued functions satisfy the intermediate value property. In this light, the classical Intermediate Value Theorem for functions on $[a,b]$ is not a theorem about the real line per se, but a consequence of the topological property of intervals — their connectedness — combined with the fact that continuous images of connected spaces are connected.
The equivalence $(1) \Leftrightarrow (3)$ is the most frequently used in practice, because verifying that every continuous function to a discrete space is constant is often simpler than working directly with open partitions. Suppose $f: X \to \mathbb{Z}$ is continuous and $X$ is connected. Since $\mathbb{Z}$ carries the discrete topology, for each $n \in \mathbb{Z}$ the singleton $\{n\}$ is both open and closed. Therefore $f^{-1}(\{n\})$ is both open and closed in $X$. If $f$ is not constant, some $f^{-1}(\{n\})$ is non-empty and proper, giving a non-trivial clopen subset of $X$ — contradicting connectedness. This "labelling" argument reduces connectedness to the impossibility of assigning consistent discrete labels to the points of a connected space.
[citeproof:294]
The proof is a clean cycle of implications. The step $(1) \Rightarrow (2)$ is the topological core of the Intermediate Value Theorem: if $f(X)$ missed some value $c$ between two achieved values, the preimages $f^{-1}((-\infty, c))$ and $f^{-1}((c, \infty))$ would form a disconnection of $X$. The step $(2) \Rightarrow (3)$ uses the observation that the only intervals contained in $\mathbb{Z}$ are singletons — a discrete-valued function with interval image must be constant. The step $(3) \Rightarrow (1)$ is the contrapositive construction: a disconnection $(U, V)$ directly manufactures a non-constant continuous function to $\{0, 1\} \subseteq \mathbb{Z}$.
[example: Detecting Disconnection via Integer-Valued Functions]
Consider $X = [0, 1) \cup (1, 2]$ with the subspace topology. Define $f: X \to \mathbb{Z}$ by
\begin{align*}
f(x) = \begin{cases}
0 & \text{if } x \in [0,1), \\
1 & \text{if } x \in (1,2].
\end{cases}
\end{align*}
This function is continuous: the preimage of any open subset of $\mathbb{Z}$ is either $\varnothing$, $[0,1)$, $(1,2]$, or $X$, all of which are open in the subspace topology (as verified above). Since $f$ is continuous and non-constant, the equivalence $(1) \Leftrightarrow (3)$ immediately implies $X$ is disconnected. Conversely, if $X$ were connected, every such function would be forced to be constant — the topology would not permit a consistent assignment of two different integer labels.
[/example]
## Connected Subsets of $\mathbb{R}$
The abstract definition of connectedness is vindicated by the following classical result, which identifies the connected subsets of the real line with the familiar notion of an interval.
[quotetheorem:295]
This result justifies the intuition that "unbroken" subsets of $\mathbb{R}$ are precisely the intervals. The converse direction — showing that intervals are connected — is where the completeness of $\mathbb{R}$ (specifically, the least upper bound property) enters, and it deserves careful attention.
[citeproof:295]
The "connected implies interval" direction is a simple contrapositive: a gap in $X$ (a point $c \notin X$ between two points of $X$) immediately manufactures a disconnection via $X \cap (-\infty, c)$ and $X \cap (c, \infty)$. The "interval implies connected" direction is more substantial and relies on the supremum. The argument takes $c = \sup(U \cap [a,b])$ and shows that $c$ can belong to neither $U$ nor $V$ without contradiction — the openness of each set provides room to push past $c$, violating either the upper bound property or the supremum. This is the same completeness argument that underpins the classical Intermediate Value Theorem, now appearing in its natural topological generality.
The characterisation explains why connectedness is so useful in analysis: it provides a single abstract property that captures the "no gaps" condition of intervals, and this property is preserved under continuous maps.
## Preservation Under Continuous Maps
Connectedness, like compactness and the Hausdorff property, is a topological invariant: it is preserved by homeomorphisms. But it enjoys a stronger property — it is preserved by *all* continuous maps, not just bijective ones. This makes it far more useful than invariance alone would suggest.
[quotetheorem:296]
[citeproof:296]
The proof is a clean contradiction via preimages. If $f(X)$ admitted a disconnection $(A, B)$, the preimages $f^{-1}(A)$ and $f^{-1}(B)$ would be open (by continuity), non-empty (since $A, B \subseteq f(X)$), disjoint, and cover $X$ — a disconnection of $X$. The argument uses only the preimage characterisation of continuity, so it works in complete generality without any metric structure.
The result has several immediate consequences. First, it gives an alternative proof of the Intermediate Value Theorem: if $f: [a,b] \to \mathbb{R}$ is continuous, then $f([a,b])$ is connected (as the continuous image of a connected set), hence an interval by the [characterisation of connected subsets of $\mathbb{R}$](/theorems/295), so $f$ takes all values between $f(a)$ and $f(b)$. Second, it implies that quotients of connected spaces are connected, since quotient maps are continuous and surjective. Third, it provides the standard method for proving connectedness by exhibiting a continuous surjection from a known connected space.
[example: $S^1$ is Connected]
The unit circle $S^1 = \{(x, y) \in \mathbb{R}^2 : x^2 + y^2 = 1\}$ is connected. To see this, consider the continuous surjection $f: [0, 2\pi] \to S^1$ defined by $f(t) = (\cos t, \sin t)$. The interval $[0, 2\pi]$ is connected (it is an interval in $\mathbb{R}$), so $f([0, 2\pi]) = S^1$ is connected by the theorem above. This argument generalises: any space that is the continuous image of an interval — any "continuous curve" — is connected.
[/example]
## Closure, Unions, and Products
Connectedness interacts well with the standard topological operations. The results in this subsection show that connectedness is "robust" — it is not easily destroyed by closures, overlapping unions, or products.
### Closure Preserves Connectedness
[quotetheorem:297]
[citeproof:297]
The proof is a "density" argument. If $Z$ (with $Y \subseteq Z \subseteq \overline{Y}$) were disconnected by $(U, V)$, then the restrictions $(Y \cap U, Y \cap V)$ would partition $Y$. Since $Y$ is connected, one of these — say $Y \cap V$ — is empty, forcing $Y \subseteq U$. But then any point $z \in V$ lies in $\overline{Y}$ (since $Z \subseteq \overline{Y}$), so every open neighbourhood of $z$ meets $Y$. In particular, the open set underlying $V$ must meet $Y$, contradicting $Y \cap V = \varnothing$. The essential point is that the density of $Y$ in $\overline{Y}$ prevents any disconnection of an intermediate set from "avoiding" $Y$ on one side.
An important special case: if $A$ is a connected subset of $\mathbb{R}^n$ and $B$ is obtained from $A$ by adding some (or all) of its limit points, then $B$ is connected. This is why the closure of a connected open set in $\mathbb{R}^n$ is connected — a fact used frequently in PDE theory, where one works with open domains and their closures.
### Unions of Overlapping Connected Sets
[quotetheorem:298]
[citeproof:298]
The proof leverages the characterisation via $\mathbb{Z}$-valued functions. A continuous $f: S \to \mathbb{Z}$ must be constant on each connected member $A \in \mathcal{A}$. Since every two members intersect, they all share the same constant value, so $f$ is constant on the entire union. This argument is notably cleaner than working directly with disconnections — the "labelling" characterisation absorbs the combinatorics of overlapping sets into a single step.
The pairwise intersection condition is essential: two disjoint connected sets can have a disconnected union (e.g. $(0,1) \cup (2,3) \subseteq \mathbb{R}$). In fact, the condition can be weakened: it suffices that the collection is "chain-linked," meaning any two members can be connected by a finite chain of pairwise-intersecting members. This stronger version follows by applying the pairwise result iteratively.
### Products of Connected Spaces
[quotetheorem:299]
[citeproof:299]
The proof builds the product from "crosses." For each $y \in Y$, the set $S_y = (X \times \{b\}) \cup (\{a\} \times Y) \cup (X \times \{y\})$ is a union of three connected sets that pairwise intersect (all three meet $\{a\} \times Y$), hence connected by the union theorem. The full product $X \times Y = \bigcup_{y \in Y} S_y$ is then a union of connected sets all passing through the common point $(a, b)$, so the union theorem applies again. The two-stage argument avoids working with open sets in the product topology directly — instead, the proof reduces entirely to the union result.
An immediate corollary is that $\mathbb{R}^n$ is connected for every $n \geq 1$, since $\mathbb{R}$ is connected and the product of connected spaces is connected. By induction, finite products of connected spaces are connected; the result extends to arbitrary products by a more delicate argument (not required here).
## Path-Connectedness
Connectedness captures the idea that a space cannot be split into open pieces, but it does not address a more geometric question: can any two points be joined by a continuous path? The two notions are related but distinct, and understanding their relationship is essential for applications.
[definition: Path]
Let $X$ be a topological space and let $x, y \in X$. A **path** from $x$ to $y$ in $X$ is a continuous function $\gamma: [0,1] \to X$ with $\gamma(0) = x$ and $\gamma(1) = y$.
[/definition]
[definition: Path-Connected Space]
A topological space $X$ is **path-connected** if for every pair of points $x, y \in X$, there exists a path from $x$ to $y$.
[/definition]
Path-connectedness is a stronger condition than connectedness, and the following theorem makes this precise.
[quotetheorem:300]
[citeproof:300]
The forward direction uses the $\mathbb{Z}$-valued characterisation: if $f: X \to \mathbb{Z}$ is continuous and $\gamma$ is a path from $x_0$ to any $x$, the composition $f \circ \gamma: [0,1] \to \mathbb{Z}$ is continuous on a connected domain, hence constant — forcing $f(x) = f(x_0)$. Since $x$ was arbitrary, $f$ is constant, so $X$ is connected. The counterexample (the topologist's sine curve) shows the converse fails: a space can be "in one piece" in the open-set sense while being impossible to traverse by a continuous path.
[example: The Topologist's Sine Curve]
Define the subset of $\mathbb{R}^2$:
\begin{align*}
X = \left\{ \left(t, \sin \frac{1}{t}\right) : t \in (0, 1] \right\} \cup \left(\{0\} \times [-1, 1]\right).
\end{align*}
The set $X$ consists of the graph of $t \mapsto \sin(1/t)$ for $t > 0$, together with the vertical segment $\{0\} \times [-1, 1]$ that the graph "accumulates onto" as $t \to 0^+$.
**$X$ is connected.** The graph $\Gamma = \{(t, \sin(1/t)) : t \in (0, 1]\}$ is the continuous image of the connected interval $(0, 1]$, hence connected. The segment $S = \{0\} \times [-1, 1]$ lies in $\overline{\Gamma}$ (every point $(0, y)$ with $|y| \leq 1$ is a limit of points on the graph — the oscillations of $\sin(1/t)$ sweep through all values in $[-1, 1]$ infinitely often as $t \to 0^+$). Therefore $X \subseteq \overline{\Gamma}$, and since $\Gamma \subseteq X$, we have $\Gamma \subseteq X \subseteq \overline{\Gamma}$. By the [closure preservation theorem](/theorems/297), $X$ is connected.
**$X$ is not path-connected.** Suppose for contradiction that $\gamma: [0, 1] \to X$ is a path with $\gamma(0) = (0, 0)$ and $\gamma(1) = (1, \sin 1)$. Write $\gamma(s) = (\gamma_1(s), \gamma_2(s))$. Let $s_0 = \sup\{s \in [0, 1] : \gamma_1(s) = 0\}$. By continuity of $\gamma_1$ and the fact that $\{0\}$ is closed, $\gamma_1(s_0) = 0$. For $s > s_0$, we have $\gamma_1(s) > 0$ (by definition of $s_0$), so $\gamma_2(s) = \sin(1/\gamma_1(s))$. As $s \to s_0^+$, $\gamma_1(s) \to 0^+$, so $\sin(1/\gamma_1(s))$ oscillates between $-1$ and $1$ without converging. But $\gamma_2$ is continuous, so $\gamma_2(s) \to \gamma_2(s_0)$ — a contradiction, since $\sin(1/\gamma_1(s))$ has no limit.
[/example]
### When Connectedness and Path-Connectedness Agree
Despite the failure of the converse in general, the two notions coincide for the most important class of spaces in analysis: open subsets of Euclidean space.
[quotetheorem:301]
[citeproof:301]
The proof is a paradigmatic "open-and-closed" argument — one of the most elegant proof patterns in topology. The set $P$ of points path-connected to a fixed $x_0$ is open because $U$ is open: any point $y \in P$ has a ball $B_r(y) \subseteq U$, and every point of $B_r(y)$ can be joined to $y$ by a line segment and thence to $x_0$ by concatenation. The complement $U \setminus P$ is open by the same reasoning: if $z \in U \setminus P$ and $y \in B_r(z)$ belonged to $P$, then $z$ could reach $x_0$ via $y$, contradicting $z \notin P$. Since $U$ is connected and $P$ is non-empty and clopen, $P = U$.
The openness hypothesis is essential. The topologist's sine curve is a connected subset of $\mathbb{R}^2$ that is not open (the segment $\{0\} \times [-1,1]$ has no open ball contained in $X$) and not path-connected. For open subsets, the availability of open balls provides the local path-connectedness that makes the global argument work.
## Connected Components
Even when a space is disconnected, it can be decomposed canonically into maximal connected pieces. This decomposition is the analogue of writing a set of integers as a disjoint union of consecutive blocks, or decomposing a graph into its connected components.
[definition: Connected Component]
For a topological space $X$ and a point $x \in X$, the **connected component** of $x$, denoted $C_x$, is the union of all connected subsets of $X$ that contain $x$.
[/definition]
The definition makes sense because the union of connected sets with a common point is connected (by the [union theorem](/theorems/298)). Thus $C_x$ is itself connected — it is the largest connected subset of $X$ containing $x$.
[quotetheorem:302]
[citeproof:302]
The proof assembles the five properties from earlier results. Non-emptiness is trivial ($\{x\} \subseteq C_x$). Connectedness follows from the union theorem applied to the collection of all connected subsets containing $x$. Maximality is immediate from the definition. The partition property reduces to showing that overlapping components coincide: if $C_x \cap C_y \neq \varnothing$, then $C_x \cup C_y$ is connected (union of two overlapping connected sets), so by maximality of each, $C_x \cup C_y \subseteq C_x$ and $C_x \cup C_y \subseteq C_y$. The closedness of $C_x$ is perhaps the most interesting step: it invokes the [closure theorem](/theorems/297) — $\overline{C_x}$ is connected and contains $x$, so by maximality $\overline{C_x} \subseteq C_x$, giving $C_x = \overline{C_x}$.
An important subtlety: connected components need not be open. In $\mathbb{Q}$ with the subspace topology from $\mathbb{R}$, every connected component is a singleton $\{q\}$ (since $\mathbb{Q}$ contains no non-degenerate intervals), but singletons are not open in $\mathbb{Q}$. A space in which every connected component is both open and closed is called **locally connected**; metric spaces with finitely many connected components have this property, but the general case is more subtle.
[example: Connected Components of $\mathbb{Q}$]
Consider $\mathbb{Q}$ with the subspace topology inherited from $\mathbb{R}$. For any two distinct rationals $p < q$, the irrational number $\alpha = p + (q - p)/\sqrt{2}$ satisfies $p < \alpha < q$ and $\alpha \notin \mathbb{Q}$. Then $\mathbb{Q} \cap (-\infty, \alpha)$ and $\mathbb{Q} \cap (\alpha, \infty)$ are both open in $\mathbb{Q}$, disjoint, and their union contains both $p$ and $q$. No subset of $\mathbb{Q}$ containing both $p$ and $q$ can be connected, since it can always be split by an irrational between them. Therefore the connected component of each $q \in \mathbb{Q}$ is the singleton $\{q\}$, and $\mathbb{Q}$ has countably many connected components (one per rational). Such a space is called **totally disconnected**.
[/example]
## Application: Distinguishing $\mathbb{R}$ from $\mathbb{R}^n$
Connectedness provides a clean method for distinguishing spaces that might otherwise seem difficult to tell apart. The following result demonstrates the power of topological invariants: it shows that $\mathbb{R}$ and $\mathbb{R}^n$ are not homeomorphic for $n \geq 2$, a fact that is intuitively obvious but requires a rigorous topological argument.
[theorem: $\mathbb{R}$ is Not Homeomorphic to $\mathbb{R}^n$ for $n \geq 2$]
For $n \geq 2$, $\mathbb{R}$ and $\mathbb{R}^n$ are not homeomorphic.
[/theorem]
The proof uses a "point-removal" argument. Remove a single point from each space and examine the topological consequence:
- Removing a point from $\mathbb{R}$: the space $\mathbb{R} \setminus \{0\} = (-\infty, 0) \cup (0, \infty)$ is disconnected (it has two connected components, both open intervals).
- Removing a point from $\mathbb{R}^n$ for $n \geq 2$: the space $\mathbb{R}^n \setminus \{0\}$ is path-connected. Indeed, for any two points $x, y \in \mathbb{R}^n \setminus \{0\}$, either the line segment from $x$ to $y$ avoids the origin (and provides a path), or one can detour via any third point not on the line through $x$, $y$, and $0$ — in dimension $n \geq 2$, such a point always exists. By the [theorem that path-connected implies connected](/theorems/300), $\mathbb{R}^n \setminus \{0\}$ is connected.
Now suppose for contradiction that $\phi: \mathbb{R} \to \mathbb{R}^n$ is a homeomorphism. Let $p = \phi(0)$. Then $\phi$ restricts to a homeomorphism $\mathbb{R} \setminus \{0\} \to \mathbb{R}^n \setminus \{p\}$. Since connectedness is a topological invariant and homeomorphisms preserve it, $\mathbb{R} \setminus \{0\}$ would have to be connected. But it is disconnected — contradiction.
This argument generalises: two spaces $X$ and $Y$ are not homeomorphic if there exists a subset $A$ such that $X \setminus A$ and $Y \setminus A$ have different connectedness properties. Such arguments are called **point-set invariants** and form the beginning of algebraic topology, where more sophisticated invariants (fundamental [groups](/page/Group), homology) are used to make finer distinctions.
[problem]
Let $f: S^1 \to \mathbb{R}$ be a continuous function on the unit circle. Prove that there exist antipodal points at which $f$ takes the same value — that is, there exists $x \in S^1$ such that $f(x) = f(-x)$.
(This is the one-dimensional **Borsuk-Ulam theorem**.)
[/problem]
[solution]
**Step 1: Reduce to a function on $[0, \pi]$.** Parametrise $S^1$ by the angle $\theta \in [0, 2\pi)$, so that a point on $S^1$ corresponds to $(\cos\theta, \sin\theta)$ and its antipodal point corresponds to $(\cos(\theta + \pi), \sin(\theta + \pi))$. Define the auxiliary function $g: [0, \pi] \to \mathbb{R}$ by
\begin{align*}
g(\theta) = f(\cos\theta, \sin\theta) - f(\cos(\theta + \pi), \sin(\theta + \pi)).
\end{align*}
The function $g$ is continuous (as the composition and difference of continuous functions) and measures the "imbalance" between $f$ at a point and at its antipode.
**Step 2: Evaluate at the endpoints.** At $\theta = 0$:
\begin{align*}
g(0) = f(1, 0) - f(-1, 0).
\end{align*}
At $\theta = \pi$:
\begin{align*}
g(\pi) = f(-1, 0) - f(1, 0) = -g(0).
\end{align*}
Therefore $g(0)$ and $g(\pi)$ have opposite signs (unless $g(0) = 0$, in which case we are done immediately).
**Step 3: Apply the Intermediate Value Theorem.** If $g(0) > 0$, then $g(\pi) = -g(0) < 0$. Since $g$ is continuous on the interval $[0, \pi]$ and takes a positive value at one endpoint and a negative value at the other, the [Intermediate Value Theorem](/theorems/180) guarantees the existence of $\theta_0 \in (0, \pi)$ with $g(\theta_0) = 0$. If $g(0) < 0$, the same argument applies with the signs reversed.
**Step 4: Conclude.** The equation $g(\theta_0) = 0$ means $f(\cos\theta_0, \sin\theta_0) = f(\cos(\theta_0 + \pi), \sin(\theta_0 + \pi))$. Setting $x = (\cos\theta_0, \sin\theta_0) \in S^1$, we have $-x = (\cos(\theta_0 + \pi), \sin(\theta_0 + \pi))$, so $f(x) = f(-x)$.
The connectedness of the interval $[0, \pi]$ (via the Intermediate Value Theorem) is the essential ingredient. The result extends to higher dimensions — the Borsuk-Ulam theorem in dimension $n$ states that every continuous function $f: S^n \to \mathbb{R}^n$ must identify a pair of antipodal points — but the proof requires more sophisticated tools (the fundamental group or homology theory).
[/solution]\n\n---\n\nHaving developed connectedness — the property that a space is "in one piece" — we now turn to the second great structural property of topological spaces: **compactness**. Where connectedness prevents a space from being split into disjoint open pieces, compactness prevents a space from being "infinitely spread out." Compactness is, in many ways, the topological substitute for finiteness: a compact space may have infinitely many points, but it behaves, in a precise sense, as if any open-set question about it can be answered using only finitely many open sets.
The significance of compactness in analysis can hardly be overstated. It is the property that guarantees continuous functions attain their maximum and minimum values (the Extreme Value Theorem). It is the property that guarantees bounded sequences have convergent subsequences (the Bolzano–Weierstrass theorem). It is the property that makes the direct method of the [calculus of variations](/page/Calculus%20of%20Variations) work — extracting convergent minimising sequences — and that underpins the Rellich–Kondrachov compactness theorem in [Sobolev space](/page/Sobolev%20Space) theory. Every time an analyst says "by compactness, we may pass to a subsequence," the argument rests on the ideas developed here.
[motivation]
### Why Open Covers?
The definition of compactness — every open cover admits a finite subcover — can seem unmotivated on first encounter. Why should covering a space by open sets, and extracting finite subcovers, capture anything as concrete as "bounded sequences have convergent subsequences"?
The intuition is this. An open cover of a space $X$ represents a collection of "local observations" — each open set $U_i$ provides information about the points it contains. Compactness says that no matter how we choose these local observations, finitely many of them already capture everything. There is no way to arrange infinitely many open sets so that each one contributes genuinely new information that the others miss. In a non-compact space like $\mathbb{R}$, the cover $\{(-n, n) : n \in \mathbb{N}\}$ demonstrates the failure: each interval adds new points near the boundary, and no finite subcollection reaches all of $\mathbb{R}$.
### From Finite to Compact
In a finite topological space, every open cover is automatically finite, so every finite space is compact. Compactness extends this finiteness to infinite spaces by requiring only that open covers can be *reduced* to finite ones, not that they start finite. The closed interval $[0, 1]$ is compact despite being uncountable — any open cover, no matter how inefficiently chosen, can be trimmed to finitely many sets. The open interval $(0, 1)$ is not compact: the cover $\{(1/n, 1) : n \geq 2\}$ has no finite subcover, because removing any single set $\{(1/n, 1)\}$ leaves points near $0$ uncovered.
### What Compactness Buys
The payoff of compactness is the ability to pass from "local" to "global." If a property holds on each set of an open cover (e.g., a function is bounded on each set), compactness lets us conclude the property holds globally (the function is bounded on the whole space), because finitely many sets suffice. This local-to-global principle is the unifying theme behind the Extreme Value Theorem, the Heine–Borel theorem, and the equivalence of open-cover and sequential compactness in metric spaces.
[/motivation]
## Core Definitions
[definition: Open Cover]
Let $X$ be a topological space. An **open cover** of $X$ is a collection $\mathcal{U}$ of open subsets of $X$ such that $\bigcup_{U \in \mathcal{U}} U = X$.
[/definition]
[definition: Finite Subcover]
A **subcover** of an open cover $\mathcal{U}$ is a subcollection $\mathcal{V} \subseteq \mathcal{U}$ that is itself an open cover of $X$. If $\mathcal{V}$ is finite, it is a **finite subcover**.
[/definition]
[definition: Compact Space]
A topological space $X$ is **compact** if every open cover of $X$ admits a finite subcover.
[/definition]
The definition makes no reference to metrics, sequences, or boundedness — compactness is a purely topological property. It does not require $X$ to be Hausdorff, metrisable, or even $T_1$. However, the interplay between compactness and separation axioms is crucial: in Hausdorff spaces, compact subsets are automatically closed, while in non-Hausdorff spaces this can fail.
A subset $K$ of a topological space $X$ is called compact if $K$ is compact in the subspace topology inherited from $X$. Equivalently, $K$ is compact if every collection of open sets in $X$ whose union contains $K$ has a finite subcollection whose union still contains $K$.
[example: $\mathbb{R}$ is Not Compact]
The collection $\{(-n, n) : n \in \mathbb{N}\}$ is an open cover of $\mathbb{R}$. For any finite subcollection $\{(-n_1, n_1), \ldots, (-n_k, n_k)\}$, the union is $(-N, N)$ where $N = \max(n_1, \ldots, n_k)$, which misses all points with $|x| \geq N$. Therefore $\mathbb{R}$ is not compact. The same argument shows $\mathbb{R}^n$ is not compact for any $n \geq 1$.
[/example]
[example: Finite Spaces are Compact]
Any finite topological space is compact: every open cover is a finite collection to begin with (there are only finitely many subsets), so it is its own finite subcover.
[/example]
[example: $(0, 1)$ is Not Compact]
The open interval $(0, 1)$ with the subspace topology from $\mathbb{R}$ is not compact. The collection $\{(1/n, 1) : n \geq 2\}$ is an open cover (for any $x \in (0,1)$, choose $n > 1/x$ to get $x \in (1/n, 1)$). But any finite subcollection $\{(1/n_1, 1), \ldots, (1/n_k, 1)\}$ has union $(1/N, 1)$ where $N = \max(n_1, \ldots, n_k)$, which misses points in $(0, 1/N]$. The "leak" at the boundary point $0$ — which is not in $(0,1)$ — is what prevents compactness.
[/example]
## The Extreme Value Theorem
The first major payoff of compactness is the generalisation of the familiar calculus result that a continuous function on a closed bounded interval attains its maximum and minimum. The topological version identifies compactness as the essential hypothesis.
[quotetheorem:304]
[citeproof:304]
The proof has two stages, both exploiting the local-to-global principle. For boundedness, the cover $\{f^{-1}((-n,n))\}$ expresses the (local) fact that each point maps to a finite value; compactness collapses this to a single global bound $N$. For attainment, the proof is a contradiction argument using nested open sets: if the supremum $M$ were not attained, the sets $U_n = f^{-1}((-\infty, M - 1/n))$ would form an open cover, and compactness would force $f(x) < M - 1/N$ for all $x$ — contradicting $M = \sup f(X)$. The nestedness of the $U_n$ is crucial: it means the finite subcover collapses to a single set, giving the sharpest possible bound.
The theorem fails without compactness: on the non-compact space $(0, 1)$, the continuous function $f(x) = 1/x$ is unbounded, and $g(x) = x$ has supremum $1$ which is not attained. Both failures trace back to the "leak" at the missing endpoints.
## Continuous Images
Compactness is preserved by continuous maps — a fundamental fact that parallels the analogous result for connectedness.
[quotetheorem:305]
[citeproof:305]
The proof is a direct "pullback-and-push-forward" argument: an open cover of $f(X)$ is pulled back via $f^{-1}$ to an open cover of $X$, compactness of $X$ extracts a finite subcover, and the corresponding sets in $Y$ form a finite subcover of $f(X)$. The argument uses only the preimage characterisation of continuity and the definition of compactness — no metric or Hausdorff hypothesis is needed.
This result has three immediate and important consequences. First, combined with the Extreme Value Theorem, it gives: *a continuous real-valued function on a compact space is bounded and attains its bounds* (since $f(X)$ is a compact subset of $\mathbb{R}$, hence closed and bounded by Heine–Borel). Second, quotients of compact spaces are compact, since quotient maps are continuous and surjective. Third, compactness is a topological invariant — if $X$ and $Y$ are homeomorphic and $X$ is compact, then $Y$ is compact.
[example: Compactness of $S^1$]
The unit circle $S^1$ is compact. The continuous surjection $f: [0, 2\pi] \to S^1$ given by $f(t) = (\cos t, \sin t)$ maps the compact space $[0, 2\pi]$ (a closed bounded interval in $\mathbb{R}$) onto $S^1$. By the theorem, $f([0, 2\pi]) = S^1$ is compact. Similarly, the $n$-sphere $S^n$ is compact, as the continuous image of the compact set $[-1, 1]^{n+1}$ under the normalisation map.
[/example]
## Subspaces and Separation
The interaction between compactness, closedness, and the Hausdorff property is one of the most important relationships in topology. The following theorem captures this interaction precisely.
[quotetheorem:307]
[citeproof:307]
Part (1) uses a simple but elegant trick: augment the open cover of $Y$ with the single set $X \setminus Y$ (which is open since $Y$ is closed) to obtain an open cover of $X$. Compactness of $X$ extracts a finite subcover, and discarding $X \setminus Y$ from this subcover gives a finite subcover of $Y$. The closedness of $Y$ is essential — it is what makes $X \setminus Y$ open and hence a valid augmenting set.
Part (2) is more substantial and uses both compactness and the Hausdorff property in a coordinated way. To show $X \setminus Y$ is open, one takes a point $x \notin Y$ and uses the Hausdorff axiom to separate $x$ from each point $y \in Y$ by disjoint open sets. This produces an open cover of $Y$ (by the sets surrounding the points of $Y$), and compactness of $Y$ reduces it to a finite subcover. The finite intersection of the corresponding open sets around $x$ then provides a single open neighbourhood of $x$ disjoint from $Y$.
The Hausdorff condition in part (2) cannot be dropped: in the cofinite topology on an infinite set, every subset is compact (any open cover reduces to finitely many sets because complements are finite), but not every subset is closed.
[example: Compact Subsets of $\mathbb{R}$ that are Closed]
The Cantor set $C \subseteq [0,1]$ is compact: it is a closed subset of the compact space $[0,1]$, hence compact by part (1). It is also closed in $\mathbb{R}$, consistent with part (2) (since $\mathbb{R}$ is Hausdorff). On the other hand, the set $\{1/n : n \in \mathbb{N}\}$ is bounded but not closed (the limit point $0$ is missing), hence not compact. Adding the limit point gives the compact set $\{0\} \cup \{1/n : n \in \mathbb{N}\}$.
[/example]
## Products
Compactness is preserved under finite products, a non-trivial result that requires careful coordination of the two factors.
[quotetheorem:308]
[citeproof:308]
The proof proceeds by a two-stage covering argument sometimes called the **tube lemma** approach. The key geometric idea is that for each point $x \in X$, the "vertical slice" $\{x\} \times Y$ is compact (homeomorphic to $Y$), so any open cover of $X \times Y$ can be reduced to finitely many sets covering this slice. The finite intersection of the corresponding first-coordinate projections gives an open "tube" $U_x \times Y$ around the slice, covered by finitely many sets from the original cover. The second stage uses compactness of $X$ to extract finitely many such tubes, and assembles the finite subcovers.
This result is the finite case of the full Tychonoff theorem, which asserts that an arbitrary product of compact spaces is compact (requiring the Axiom of Choice in its proof). The finite case, proved here, suffices for all applications in this course and requires no choice principles.
By induction, any finite product $X_1 \times \cdots \times X_n$ of compact spaces is compact. This is crucial for the Heine–Borel theorem.
## The Heine–Borel Theorem
The abstract open-cover definition of compactness is validated by the following concrete characterisation in Euclidean space, which connects it to the familiar notions of closedness and boundedness.
[quotetheorem:309]
[citeproof:309]
The forward direction uses two earlier results: compact subsets of $\mathbb{R}^n$ are closed (by the [compact-subspaces-in-Hausdorff-spaces theorem](/theorems/307), since $\mathbb{R}^n$ is Hausdorff) and bounded (by the open cover $\{B_m(0)\}$, whose finite subcover gives a uniform bound). The reverse direction is the substantial part. It proceeds in three steps: first, prove $[a,b]$ is compact by the **bisection argument** — if an open cover has no finite subcover, repeatedly bisecting produces a nested sequence of intervals $[a_n, b_n]$ with no finite subcover and $b_n - a_n \to 0$, whose intersection is a single point contained in some open set of the cover, contradicting the choice. Second, apply [Tychonoff's theorem](/theorems/308) to conclude $[-M, M]^n$ is compact. Third, use the fact that $K$ is a closed subset of this compact cube.
The Heine–Borel theorem is specific to $\mathbb{R}^n$: it fails in general metric spaces. The closed unit ball in an infinite-dimensional Banach space is closed and bounded but not compact — a failure that motivates the theory of weak compactness in functional analysis.
[example: The Heine–Borel Theorem in Action]
The set $K = \{(x,y) \in \mathbb{R}^2 : x^2 + y^2 \leq 1\}$ (the closed unit disc) is compact: it is closed (as the preimage of $(-\infty, 1]$ under the continuous function $(x,y) \mapsto x^2 + y^2$) and bounded ($K \subseteq [-1, 1]^2$). By contrast, the open disc $D = \{(x,y) : x^2 + y^2 < 1\}$ is bounded but not closed (the boundary circle is missing), hence not compact.
[/example]
## Sequential Compactness in Metric Spaces
In metric spaces, the open-cover definition of compactness can be reformulated in terms of sequences — a characterisation that is often more natural in analysis. The bridge between the two perspectives is the notion of total boundedness.
[definition: Sequentially Compact]
A topological space $X$ is **sequentially compact** if every sequence in $X$ has a convergent subsequence.
[/definition]
[motivation]
### The Two Obstructions to Sequential Compactness
A sequence in a metric space can fail to have a convergent subsequence in exactly two ways, and total boundedness is the condition that eliminates the first.
**Obstruction 1: Spreading out.** A sequence might have all its terms far apart — for instance, the sequence $e_1, e_2, e_3, \ldots$ of standard basis vectors in $\ell^2$ satisfies $\|e_m - e_n\| = \sqrt{2}$ for $m \neq n$, so no subsequence is Cauchy, let alone convergent. The problem is that the sequence "escapes to infinity" in the sense that no finite collection of small balls can contain all of its terms. Total boundedness rules this out: if the space can be covered by finitely many $\varepsilon$-balls for every $\varepsilon > 0$, the pigeonhole principle forces infinitely many terms into a single ball, and a diagonal argument (applying pigeonhole at scales $\varepsilon = 1, 1/2, 1/3, \ldots$) extracts a Cauchy subsequence.
**Obstruction 2: Missing limits.** Even when a Cauchy subsequence exists, the space might have a "hole" where the limit should be. The sequence $x_n = 1/(n+1)$ in $(0, 1)$ is Cauchy, but its limit $0$ lies outside the space. Completeness rules this out: it guarantees that every Cauchy sequence actually converges within the space.
The equivalence theorem below says that these are the *only* two obstructions. A metric space is (sequentially) compact if and only if it is both complete (no missing limits) and totally bounded (no spreading out). This decomposition is not just a theoretical curiosity — it tells you exactly what to check, and exactly what can go wrong, whenever you need to extract a convergent subsequence.
[/motivation]
[definition: Totally Bounded]
A metric space $(M, d)$ is **totally bounded** if for every $\varepsilon > 0$, there exists a finite set $F \subseteq M$ such that $M = \bigcup_{y \in F} B_\varepsilon(y)$. The set $F$ is called a **finite $\varepsilon$-net**.
[/definition]
Total boundedness is strictly stronger than boundedness. A bounded set fits inside a single large ball; a totally bounded set can be covered by finitely many balls of *any* prescribed radius. The distinction matters in infinite dimensions: the closed unit ball in $\ell^2$ is bounded but not totally bounded (the standard basis vectors $e_n$ satisfy $\|e_n - e_m\| = \sqrt{2}$ for $n \neq m$, so no finite $1$-net exists).
[example: Totally Bounded Subsets of $\mathbb{R}^n$]
A subset $K \subseteq \mathbb{R}^n$ is totally bounded if and only if it is bounded. The forward direction is immediate (a finite $1$-net gives a uniform bound). For the reverse, if $K \subseteq [-M, M]^n$, divide each side into intervals of length at most $\varepsilon/\sqrt{n}$ and take the grid points as an $\varepsilon$-net. In infinite-dimensional spaces, this construction fails because no finite grid can cover an infinite-dimensional cube.
[/example]
The following theorem unifies the three perspectives on compactness in metric spaces:
[quotetheorem:316]
[citeproof:316]
The cycle of implications reveals three different mechanisms. The step (1) $\Rightarrow$ (2) shows that compactness prevents sequences from "spreading out" — if a sequence had no limit point, its terms could be surrounded by open balls containing only finitely many terms each, and compactness would force a finite subcover containing only finitely many terms total. The step (2) $\Rightarrow$ (3) derives completeness (a Cauchy sequence with a convergent subsequence converges) and total boundedness (if no finite $\varepsilon$-net exists, a greedy construction produces a sequence with all pairwise distances at least $\varepsilon$, which has no convergent subsequence). The step (3) $\Rightarrow$ (1) is the deepest: it first establishes sequential compactness via a diagonal argument (total boundedness provides successive refinements of $1/k$-nets, and completeness supplies the limit), then invokes the **Lebesgue number lemma** — every open cover of a sequentially compact metric space has a uniform $\delta$ such that every set of diameter less than $\delta$ fits inside some member of the cover — and finally uses total boundedness to assemble a finite subcover from a $\delta/2$-net.
[example: The Equivalence Applied to $[0, 1]$]
The interval $[0, 1]$ is compact (Heine–Borel). By the equivalence: it is sequentially compact (the [Bolzano–Weierstrass theorem](/theorems/171) for bounded sequences in $\mathbb{R}$), complete (as a closed subset of the complete space $\mathbb{R}$), and totally bounded (for any $\varepsilon > 0$, the points $\{0, \varepsilon, 2\varepsilon, \ldots\}$ form a finite $\varepsilon$-net). All three characterisations confirm the same conclusion by different routes.
[/example]
## The Closed Map Lemma and Topological Inverse Function Theorem
Compactness and the Hausdorff property together give continuous maps a remarkable rigidity: continuous bijections are automatically homeomorphisms. The key intermediate result is the Closed Map Lemma.
[quotetheorem:317]
[citeproof:317]
The proof is a clean three-step chain: closed $\xrightarrow{\text{compact ambient}}$ compact $\xrightarrow{\text{continuous}}$ compact $\xrightarrow{\text{Hausdorff target}}$ closed. Each step applies one earlier theorem — the [compact-subspaces theorem](/theorems/307) part (1), the [continuous image theorem](/theorems/305), and the [compact-subspaces theorem](/theorems/307) part (2), respectively. The elegance lies in the composition: three separate results, each simple in isolation, combine to produce a powerful structural conclusion.
The Closed Map Lemma immediately yields the topological inverse function theorem:
[quotetheorem:318]
[citeproof:318]
The proof reduces the problem of continuity of $f^{-1}$ to the question of whether $f$ is a closed map. A bijective closed map has a continuous inverse (preimages of closed sets under $f^{-1}$ are images of closed sets under $f$, which are closed by the Closed Map Lemma). The compactness of $X$ and the Hausdorff property of $Y$ do all the work — no explicit $\varepsilon$-$\delta$ argument is needed.
This result is used constantly to identify quotient spaces. Whenever one constructs a continuous bijection from a quotient of a compact space to a Hausdorff target, the theorem guarantees it is a homeomorphism — without needing to verify continuity of the inverse directly.
[example: $\mathbb{R}/\mathbb{Z} \cong S^1$]
The map $f: [0, 1] \to S^1$ defined by $f(t) = (\cos 2\pi t, \sin 2\pi t)$ satisfies $f(0) = f(1)$ and is injective on $(0, 1)$. It induces a continuous bijection $\tilde{f}: [0, 1]/{\sim} \to S^1$, where $0 \sim 1$. The domain $[0, 1]/{\sim}$ is compact (as a quotient of the compact space $[0, 1]$), and the target $S^1$ is Hausdorff (as a subspace of $\mathbb{R}^2$). By the [Topological Inverse Function Theorem](/theorems/318), $\tilde{f}$ is a homeomorphism.
[/example]
## Compactness as a Topological Invariant
Like connectedness, compactness can be used to distinguish non-homeomorphic spaces. Since homeomorphisms preserve compactness, a compact space cannot be homeomorphic to a non-compact one. This provides a simple but effective tool: $[0, 1]$ is not homeomorphic to $\mathbb{R}$ (the first is compact, the second is not), and $(0, 1)$ is not homeomorphic to $[0, 1]$ (again, compactness distinguishes them — even though both are connected subsets of $\mathbb{R}$).
Compactness also interacts with other topological invariants to provide finer distinctions. A compact Hausdorff space is **normal** (disjoint closed sets can be separated by disjoint open sets), which gives access to Urysohn's lemma and the Tietze extension theorem. These results, together with the Heine–Borel characterisation, form the analytical backbone of much of classical and modern analysis.
[problem]
Let $f: X \to Y$ be a continuous injection from a compact space $X$ to a Hausdorff space $Y$. Prove that $f$ is an embedding: $f$ is a homeomorphism from $X$ onto $f(X)$ (with the subspace topology).
[/problem]
[solution]
**Step 1: Establish that $f(X)$ is Hausdorff.** Since $Y$ is Hausdorff and $f(X) \subseteq Y$, the subspace $f(X)$ inherits the Hausdorff property: for distinct points $p, q \in f(X)$, the disjoint open sets in $Y$ separating them restrict to disjoint open sets in $f(X)$.
**Step 2: Show $f: X \to f(X)$ is a continuous bijection.** By hypothesis, $f$ is continuous and injective. Since the codomain is restricted to the image $f(X)$, the map $f: X \to f(X)$ is a continuous bijection. (Continuity into the subspace follows because the subspace topology is the coarsest making the inclusion continuous, and the composition $X \xrightarrow{f} f(X) \hookrightarrow Y$ is continuous.)
**Step 3: Apply the Topological Inverse Function Theorem.** The domain $X$ is compact and the codomain $f(X)$ is Hausdorff (by Step 1). By the [Topological Inverse Function Theorem](/theorems/318), the continuous bijection $f: X \to f(X)$ is a homeomorphism.
**Step 4: Conclude.** Therefore $f$ is a homeomorphism onto its image, which is exactly the statement that $f$ is an embedding.
This result is frequently used in geometry: to show a map is an embedding, one verifies injectivity, continuity, and compactness of the domain, rather than checking that $f^{-1}$ is continuous directly. For instance, the inclusion of a compact submanifold into a Hausdorff ambient space is automatically an embedding.
[/solution]\n\n---\n\nHaving explored the topological foundations of analysis — compactness, connectedness, and continuity in general topological and metric spaces — we now turn to a central question of calculus: how does one differentiate a function of several variables? In single-variable calculus, differentiability means that a function $f: \mathbb{R} \to \mathbb{R}$ can be well-approximated near a point $a$ by a linear function, $f(a + h) \approx f(a) + f'(a)h$, with an error that vanishes faster than $|h|$. The derivative $f'(a)$ is a single number — the slope of the tangent line.
In higher dimensions, the situation is richer. A function $f: \mathbb{R}^m \to \mathbb{R}^n$ maps vectors to vectors, and the best linear approximation at a point is no longer a scalar multiple but a linear map $\tau: \mathbb{R}^m \to \mathbb{R}^n$ — an $n \times m$ matrix. The derivative of $f$ at a point $a$ is this linear map, and the theory of multivariable differentiation is, at its core, the theory of how nonlinear maps are approximated by linear ones.
The development here relies on the metric structure of $\mathbb{R}^m$ and $\mathbb{R}^n$ (from Section 5), the compactness of the unit sphere (Section 8, which ensures that [linear maps](/page/Linear%20Map) on finite-dimensional spaces are bounded), and the openness of domains (Section 6, which provides room for the limit $h \to \mathbf{0}$ in all directions). The chain rule — the deepest result of this section — combines all of these ingredients.
[motivation]
### Why Linear Maps?
In one variable, the derivative $f'(a)$ is a number that scales the increment: $f(a+h) \approx f(a) + f'(a) \cdot h$. The map $h \mapsto f'(a) \cdot h$ is linear from $\mathbb{R}$ to $\mathbb{R}$. In higher dimensions, the analogue of "scaling" is a linear transformation. If $f: \mathbb{R}^m \to \mathbb{R}^n$, the best linear approximation to the increment $f(a + h) - f(a)$ is a linear map $\tau: \mathbb{R}^m \to \mathbb{R}^n$, not a number. This is because the increment $h$ can point in $m$ independent directions, and the response in $\mathbb{R}^n$ has $n$ components — capturing all of this information requires an $n \times m$ matrix of partial derivatives.
### Why Not Partial Derivatives?
One might attempt to define differentiability by requiring that all partial derivatives $\partial f_j / \partial x_i$ exist. But this approach is insufficient: a function can have all partial derivatives at a point without being continuous there, let alone differentiable. The existence of partial derivatives controls the behaviour of $f$ along the coordinate axes, but says nothing about the behaviour along other directions. True differentiability — approximation by a single linear map with an error that is $o(|h|)$ in *all* directions simultaneously — is a strictly stronger condition, and it is the one that supports a clean chain rule and the full machinery of calculus.
### The Role of Norms
To make the approximation $f(a + h) \approx f(a) + \tau(h)$ precise, we need to measure the size of the error $f(a + h) - f(a) - \tau(h)$ relative to the size of $h$. This requires norms on both $\mathbb{R}^m$ (for $h$) and $\mathbb{R}^n$ (for the error), and a norm on the space of linear maps $\mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ (for convergence of derivatives). The Euclidean norm on vectors and the Frobenius norm on matrices provide a consistent framework in which all the estimates work cleanly.
[/motivation]
## Norms on Linear Maps
Before defining differentiability, we need to equip the space of linear maps $\mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ with a norm that is compatible with the Euclidean norms on domain and codomain. Two natural choices present themselves.
[definition: Operator Norm]
Let $\tau \in \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$. The **operator norm** of $\tau$ is
\begin{align*}
\|\tau\|_{\mathrm{op}} = \sup_{|x| = 1} |\tau(x)|.
\end{align*}
[/definition]
The operator norm measures the maximum stretching factor of $\tau$ on unit vectors. In finite dimensions, the supremum is attained (the unit sphere $S^{m-1} \subseteq \mathbb{R}^m$ is compact, and $x \mapsto |\tau(x)|$ is continuous, so the [Extreme Value Theorem](/theorems/304) applies). However, for computational purposes we primarily use the Euclidean (Frobenius) norm.
[definition: Euclidean Norm on Linear Maps]
Let $\tau \in \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ with matrix representation $T \in M_{n \times m}(\mathbb{R})$ with respect to the standard bases. The **Euclidean norm** (or **Frobenius norm**) of $\tau$ is
\begin{align*}
\|\tau\| = \|T\| = \left(\sum_{i=1}^m \sum_{j=1}^n T_{ji}^2\right)^{1/2} = \left(\sum_{i=1}^m |\tau(e_i)|^2\right)^{1/2},
\end{align*}
where $\{e_1, \ldots, e_m\}$ is the standard basis of $\mathbb{R}^m$.
[/definition]
This norm arises from identifying $\mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ with $\mathbb{R}^{nm}$ (listing all matrix entries as a single vector) and taking the standard Euclidean norm. The two norms are related by $\|\tau\|_{\mathrm{op}} \leq \|\tau\| \leq \sqrt{m}\|\tau\|_{\mathrm{op}}$, so they induce the same topology and the same notion of convergence. We use $\|\tau\|$ (without subscript) for the Euclidean norm throughout.
The following result establishes that linear maps are automatically well-behaved — a fact that underpins the entire theory of differentiation.
[quotetheorem:321]
[citeproof:321]
The Cauchy–Schwarz inequality is the key tool in Part 1: expanding $\tau(x) = \sum x_i \tau(e_i)$ and applying Cauchy–Schwarz to the sequences $(|x_i|)$ and $(|\tau(e_i)|)$ immediately gives $|\tau(x)| \leq |x| \cdot \|\tau\|$. Part 2 (submultiplicativity) follows by applying Part 1 twice — first to $\sigma$, then to $\tau$. Part 3 is then immediate: linearity turns the bound into a Lipschitz estimate $|\tau(x) - \tau(y)| = |\tau(x - y)| \leq \|\tau\| \cdot |x - y|$. The submultiplicativity $\|\sigma \circ \tau\| \leq \|\sigma\| \cdot \|\tau\|$ is essential for the chain rule proof, where it controls the composition of error terms.
## The Definition of Differentiability
[definition: Differentiable Map]
Let $U \subseteq \mathbb{R}^m$ be open, $a \in U$, and $f: U \to \mathbb{R}^n$. The map $f$ is **differentiable at $a$** if there exists a linear map $\tau \in \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ and a function $\varepsilon: U - a \to \mathbb{R}^n$ such that:
1. $\varepsilon(\mathbf{0}) = \mathbf{0}$ and $\varepsilon$ is continuous at $\mathbf{0}$,
2. For all $h$ with $a + h \in U$:
\begin{align*}
f(a + h) = f(a) + \tau(h) + |h|\varepsilon(h).
\end{align*}
[/definition]
The equation says that $f(a + h)$ consists of three parts: the base value $f(a)$, the linear approximation $\tau(h)$, and an error $|h|\varepsilon(h)$ that is $o(|h|)$ because $\varepsilon(h) \to \mathbf{0}$. The linear map $\tau$ captures everything about $f$ near $a$ to first order. The openness of $U$ ensures that $a + h \in U$ for all sufficiently small $h$ in *every* direction, which is essential for the derivative to encode directional information in all $m$ dimensions.
The definition is equivalent to a more familiar limit condition:
[quotetheorem:319]
[citeproof:319]
The equivalence is a direct translation: the error function $\varepsilon(h)$ in the definition *is* the quotient $(f(a + h) - f(a) - \tau(h))/|h|$, extended to $\mathbf{0}$ at $h = \mathbf{0}$. The continuity condition $\varepsilon(h) \to \mathbf{0}$ is the same as the limit being $\mathbf{0}$. The two formulations are used interchangeably: the error-function form is better for algebraic manipulations (as in the chain rule proof), while the limit form is more concise for statements.
### Uniqueness of the Derivative
A priori, the definition says "there exists a linear map $\tau$" — but for the theory to work, we need this map to be unique. Otherwise, the notation $Df_a$ would be ambiguous.
[quotetheorem:320]
[citeproof:320]
The proof exploits linearity in a crucial way. If two linear maps $\tau_1$ and $\tau_2$ both satisfy the differentiability condition, their difference satisfies $(\tau_1 - \tau_2)(h) = \|h\|(\varepsilon_2(h) - \varepsilon_1(h))$. Evaluating along $h = tv$ for a unit vector $v$ and using linearity to cancel $t$, the left side becomes $(\tau_1 - \tau_2)(v)$ — independent of $t$ — while the right side vanishes as $t \to 0$. The openness of $U$ is essential: it guarantees that $a + tv \in U$ for small $t > 0$ in *every* direction $v$, allowing the argument to recover the value of $\tau_1 - \tau_2$ on every unit vector.
We write $Df_a$ for the unique derivative. If $f$ is differentiable at every point of $U$, we write $Df: U \to \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ for the derivative map.
## Basic Examples
[example: Constant and Linear Maps]
If $f(x) = b$ is constant, then $f(a + h) - f(a) = \mathbf{0} = 0(h) + |h| \cdot \mathbf{0}$, so $Df_a = 0$ (the zero linear map) for all $a$.
If $f(x) = \tau(x)$ for a fixed $\tau \in \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$, then $f(a + h) - f(a) = \tau(h)$, so the error term is zero and $Df_a = \tau$ for all $a$. A linear map is its own best linear approximation — its derivative is itself, everywhere.
[/example]
[example: The Squared Norm]
The function $f: \mathbb{R}^m \to \mathbb{R}$ defined by $f(x) = |x|^2 = \sum_{i=1}^m x_i^2$ is differentiable everywhere. At $a \in \mathbb{R}^m$:
\begin{align*}
f(a + h) = |a + h|^2 = |a|^2 + 2\langle a, h\rangle + |h|^2 = f(a) + 2\langle a, h\rangle + |h|^2.
\end{align*}
The term $2\langle a, h\rangle$ is linear in $h$ (it defines an element of $\mathcal{L}(\mathbb{R}^m, \mathbb{R})$), and the remainder $|h|^2 = |h| \cdot |h|$ has $\varepsilon(h) = |h| \to 0$. Therefore $Df_a(h) = 2\langle a, h\rangle$. In coordinates, the matrix of $Df_a$ is the row vector $2a^T = (2a_1, \ldots, 2a_m)$.
[/example]
[example: Differentiability of Bilinear Maps]
Let $B: \mathbb{R}^m \times \mathbb{R}^n \to \mathbb{R}^p$ be bilinear (linear in each variable separately). At $(a, b) \in \mathbb{R}^m \times \mathbb{R}^n$:
\begin{align*}
B(a + h, b + k) = B(a, b) + B(a, k) + B(h, b) + B(h, k).
\end{align*}
The term $B(a, k) + B(h, b)$ is linear in $(h, k)$. The remainder $B(h, k)$ satisfies $|B(h, k)| \leq C|h||k| \leq C|(h, k)|^2$ (by the Lipschitz property of $B$ restricted to bounded sets), so it is $o(|(h, k)|)$. Therefore $DB_{(a, b)}(h, k) = B(a, k) + B(h, b)$.
This example is fundamental: matrix multiplication, inner products, and cross products are all bilinear, so their derivatives are immediately computed by this formula.
[/example]
## [Differentiability Implies Continuity](/theorems/184)
[quotetheorem:322]
[citeproof:322]
The proof uses the triangle inequality and the [Lipschitz property of linear maps](/theorems/321): $|f(a + h) - f(a)| \leq |\tau(h)| + |h||\varepsilon(h)| \leq |h|(\|\tau\| + |\varepsilon(h)|)$. As $h \to \mathbf{0}$, the factor in parentheses tends to the finite value $\|\tau\|$, and $|h| \to 0$, so the product tends to $0$. The converse fails: $f(x) = |x|$ is continuous at $0$ but not differentiable (the "corner" at $0$ prevents any single linear approximation from working in both directions).
## The Chain Rule
The chain rule is the most important computational tool in differentiation, and its proof in several variables is substantially more delicate than in one variable. The key difficulty is controlling the error term $|k|\varepsilon_2(k)$ from $g$, where $k = f(a + h) - f(a)$ depends on $h$ in a nonlinear way.
[quotetheorem:323]
[citeproof:323]
The proof substitutes the error-function expansions of $f$ and $g$ and collects terms. The linear part of the composite is $Dg_b \circ Df_a$ — the composition of linear maps, not the product of scalars. The error analysis requires two estimates. First, the term $\beta(|h|\varepsilon_1(h))$ is bounded by $\|\beta\| \cdot |h| \cdot |\varepsilon_1(h)| = |h| \cdot o(1)$ using the Lipschitz property. Second, the term $|k|\varepsilon_2(k)$ requires showing $|k| = O(|h|)$ — which follows from $|k| \leq \|\alpha\||h| + |h||\varepsilon_1|$ — and $\varepsilon_2(k) \to \mathbf{0}$ — which follows from the continuity of $f$ (implied by differentiability) ensuring $k \to \mathbf{0}$.
In matrix terms, the chain rule says that the Jacobian matrix of $g \circ f$ at $a$ is the matrix product of the Jacobian of $g$ at $f(a)$ and the Jacobian of $f$ at $a$. The one-variable chain rule $(g \circ f)'(a) = g'(f(a)) \cdot f'(a)$ is the special case $m = n = p = 1$, where composition of linear maps reduces to multiplication of scalars.
## Componentwise Differentiability
A vector-valued function $f: U \to \mathbb{R}^n$ can be written in terms of its components $f_1, \ldots, f_n: U \to \mathbb{R}$. The following result shows that differentiability of $f$ can be checked one component at a time.
[quotetheorem:324]
[citeproof:324]
The forward direction uses the chain rule: each component $f_j = \pi_j \circ f$ is the composition of the differentiable map $f$ with the linear (hence differentiable) projection $\pi_j$. The reverse direction assembles the $n$ scalar error functions $\varepsilon_j(h)$ into a single vector error $\varepsilon(h) = (\varepsilon_1(h), \ldots, \varepsilon_n(h))$, which tends to $\mathbf{0}$ because each component does — the Euclidean norm satisfies $|\varepsilon|^2 = \sum \varepsilon_j^2$.
This result is essential for computations: to differentiate a map $f: \mathbb{R}^m \to \mathbb{R}^n$, it suffices to differentiate each of its $n$ component functions separately. The derivative $Df_a$ then has the matrix representation whose $j$-th row is $D(f_j)_a$ — the gradient of the $j$-th component. This matrix is the **Jacobian matrix** of $f$ at $a$.
[example: Componentwise Verification]
The function $f: \mathbb{R} \to \mathbb{R}^2$ defined by
\begin{align*}
f(t) = \begin{cases}
\left(t^2 \sin \frac{1}{t}, \, t^2 \cos \frac{1}{t}\right) & t \neq 0, \\
(0, 0) & t = 0,
\end{cases}
\end{align*}
is differentiable at $t = 0$. Each component $f_j(t)$ satisfies $|f_j(t)| \leq t^2$, so $|f_j(t) - f_j(0) - 0 \cdot t|/|t| = |f_j(t)|/|t| \leq |t| \to 0$. Therefore each $f_j$ is differentiable at $0$ with $f_j'(0) = 0$, and by the componentwise theorem, $f'(0) = 0$.
[/example]
## Algebraic Rules
[quotetheorem:325]
[citeproof:325]
Part 1 (linearity) is immediate: the error functions simply add. Part 2 (product rule) requires expanding the product $\phi(a + h)f(a + h)$ using both error-function expansions and identifying the terms that are linear in $h$ (which give the derivative) from the higher-order terms (which go into the error). The cross term $D\phi_a(h) \cdot Df_a(h)$ is $O(|h|^2)$ by the Lipschitz estimates, hence $o(|h|)$. The product rule for scalar-vector multiplication has the same form as in one variable: $(uv)' = u'v + uv'$.
The linearity and product rules, together with the chain rule, provide the complete algebraic toolkit for computing derivatives of functions built from elementary operations. Any function expressed as a composition of sums, products, and known differentiable functions can be differentiated systematically using these rules.
[problem]
Let $A \in M_{n \times n}(\mathbb{R})$ be a fixed square matrix. Define $f: M_{n \times n}(\mathbb{R}) \to M_{n \times n}(\mathbb{R})$ by $f(X) = X^2 = X \cdot X$ (matrix multiplication). Compute $Df_A$.
[/problem]
[solution]
**Step 1: Set up the increment.** We compute $f(A + H) - f(A)$ for $H \in M_{n \times n}(\mathbb{R})$:
\begin{align*}
f(A + H) = (A + H)^2 = A^2 + AH + HA + H^2.
\end{align*}
Therefore:
\begin{align*}
f(A + H) - f(A) = AH + HA + H^2.
\end{align*}
**Step 2: Identify the linear and error terms.** The map $H \mapsto AH + HA$ is linear in $H$ (it is the sum of left-multiplication by $A$ and right-multiplication by $A$, both of which are linear operations on matrices). The term $H^2$ is the remainder.
**Step 3: Show the remainder is $o(\|H\|)$.** Using the [submultiplicativity of the Euclidean norm](/theorems/321):
\begin{align*}
\|H^2\| \leq \|H\| \cdot \|H\| = \|H\|^2.
\end{align*}
Therefore $\|H^2\|/\|H\| = \|H\| \to 0$ as $H \to 0$.
**Step 4: Conclude.** By the limit characterisation of differentiability, $f$ is differentiable at $A$ with derivative:
\begin{align*}
Df_A(H) = AH + HA.
\end{align*}
This is a linear map from $M_{n \times n}(\mathbb{R})$ to itself. When $n = 1$, this reduces to $Df_a(h) = 2ah$, recovering the familiar one-variable formula $(x^2)' = 2x$. The appearance of *two* terms ($AH$ and $HA$) rather than one reflects the non-commutativity of matrix multiplication: $A$ can act on $H$ from the left or the right, and both contributions survive. This is an instance of the bilinear map formula applied to matrix multiplication $B(X, Y) = XY$, giving $DB_{(A, A)}(H, K) = AK + HA$, evaluated at $H = K$.
[/solution]\n\n---\n\nIn Section 9, we defined the total derivative of a map $f: \mathbb{R}^m \to \mathbb{R}^n$ at a point $a$ as a linear map $Df_a \in \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ that approximates $f$ to first order. The definition is clean and conceptually powerful, but it raises an immediate practical question: *how does one compute $Df_a$?* The total derivative is an $n \times m$ matrix, and the definition — that the error
\begin{align*}
f(a + h) - f(a) - Df_a(h) \quad \text{is } o(|h|)
\end{align*}
— provides no algorithm for finding it.
The answer comes from partial derivatives. By varying one coordinate at a time, holding the others fixed, we reduce the multivariable problem to a collection of single-variable problems. The partial derivatives $\partial f_j / \partial x_i$ — ordinary one-dimensional derivatives along coordinate axes — form the entries of a matrix, the **Jacobian**, that represents the total derivative. The central question of this section is the precise relationship between these two notions: *when do partial derivatives determine the total derivative, and when do they fall short?*
The section then develops the consequences of this relationship: the Mean Value Inequality (the multivariable replacement for the [Mean Value Theorem](/theorems/186)), the fundamental result that zero derivative on a connected domain forces constancy (connecting differential calculus to the topology of Section 7), and the [Inverse Function Theorem](/page/Inverse%20Function%20Theorem) — the deepest result in multivariable differential calculus — which shows that local invertibility of a $C^1$ map is controlled entirely by the invertibility of its derivative at a point.
[motivation]
### Partial Derivatives: The Coordinate-by-Coordinate Approach
Given a function $f: \mathbb{R}^2 \to \mathbb{R}$, the most natural way to study its behaviour near a point $(a, b)$ is to hold one variable fixed and differentiate in the other. The partial derivative $\partial f / \partial x$ at $(a, b)$ is the ordinary derivative of $t \mapsto f(t, b)$ at $t = a$ — a one-dimensional computation. Similarly for $\partial f / \partial y$. This gives two numbers that describe the rate of change of $f$ along the $x$- and $y$-axes.
But why should rates of change along coordinate axes tell us about rates of change in arbitrary directions? They can fail to do so: a function can have well-defined partial derivatives at a point without even being continuous there, let alone differentiable. Consider the function
\begin{align*}
f(x, y) = \begin{cases} \dfrac{xy}{x^2 + y^2} & (x, y) \neq (0, 0), \\ 0 & (x, y) = (0, 0). \end{cases}
\end{align*}
On the $x$-axis, $f(t, 0) = 0$ for all $t$, so $D_1 f(0, 0) = 0$. On the $y$-axis, $f(0, t) = 0$ for all $t$, so $D_2 f(0, 0) = 0$. Both partial derivatives exist and equal zero. Yet $f$ is not continuous at the origin: along the line $y = x$,
\begin{align*}
f(t, t) = \frac{t^2}{2t^2} = \frac{1}{2} \quad \text{for all } t \neq 0,
\end{align*}
so $f(t, t) \to 1/2 \neq 0 = f(0,0)$. The partial derivatives see only the coordinate axes, where $f$ happens to vanish, and miss the non-zero behaviour between them.
### Why Continuity of Partial Derivatives Matters
The gap between "partial derivatives exist" and "total derivative exists" is bridged by a regularity condition: *continuity* of the partial derivatives. If all partial derivatives exist near a point and are continuous at that point, then the function is differentiable there. The intuition is that continuity prevents the partial derivatives from behaving erratically between coordinate axes — it forces the coordinate-by-coordinate approximation to be consistent across all directions simultaneously.
### From Local to Global: The Role of Connectedness
Once we know a function is differentiable, the derivative gives local information — the best linear approximation at each point. But analysis often demands global conclusions: is a function constant? Is it injective? The bridge from local to global is provided by the topology of the domain, specifically its connectedness. If $Df_a = 0$ for every $a$ in a connected open set $U$, the Mean Value Inequality forces $f$ to be locally constant, and connectedness (the impossibility of non-trivial open-closed partitions) promotes this to global constancy. On a disconnected domain, the function could take different constant values on different components — connectedness is not a convenience but a necessity.
[/motivation]
## Directional and Partial Derivatives
[definition: Directional Derivative]
Let $U \subseteq \mathbb{R}^m$ be open, $f: U \to \mathbb{R}^n$, $a \in U$, and $u \in \mathbb{R}^m$ nonzero. If the limit
\begin{align*}
D_{u}f(a) = \lim_{t \to 0} \frac{f(a + tu) - f(a)}{t}
\end{align*}
exists, it is called the **directional derivative** of $f$ at $a$ in the direction $u$.
[/definition]
The directional derivative restricts $f$ to the line $t \mapsto a + tu$ and computes the ordinary one-dimensional derivative at $t = 0$. It measures the rate of change of $f$ along a single direction. The directional derivative can exist in every direction without $f$ being differentiable — or even continuous — because knowing what happens along each line through $a$ individually says nothing about what happens in between.
[definition: Partial Derivative]
Let $U \subseteq \mathbb{R}^m$ be open, $f: U \to \mathbb{R}^n$, and $a \in U$. The **$i$-th partial derivative** of $f$ at $a$ is the directional derivative in the direction of the $i$-th standard basis vector:
\begin{align*}
D_if(a) = D_{e_i}f(a) = \lim_{t \to 0} \frac{f(a + te_i) - f(a)}{t},
\end{align*}
also denoted $\frac{\partial f}{\partial x_i}(a)$.
[/definition]
Partial derivatives are the special case of directional derivatives along coordinate axes. They are the easiest directional derivatives to compute — one simply differentiates with respect to a single variable while treating all other variables as constants.
The following theorem establishes that when the total derivative exists, it determines all directional and partial derivatives.
[quotetheorem:326]
[citeproof:326]
The proof is a direct substitution: setting $h = tu$ in the differentiability equation and using linearity of $Df_a$ to pull $t$ out, the quotient
\begin{align*}
\frac{f(a + tu) - f(a)}{t}
\end{align*}
becomes $Df_a(u)$ plus an error that vanishes with $t$. Part 3 then follows from linearity: any $h = \sum h_i e_i$ gives
\begin{align*}
Df_a(h) = \sum_{i=1}^m h_i Df_a(e_i) = \sum_{i=1}^m h_i D_if(a).
\end{align*}
The converse of this theorem is *false*: all directional derivatives can exist at a point without the total derivative existing. The crucial failure mode is that directional derivatives are "one direction at a time" — they do not guarantee that the map $u \mapsto D_{u}f(a)$ is linear in $u$, or that the approximation is uniform in direction. Total differentiability requires a single linear map that works in all directions simultaneously.
[example: All Directional Derivatives Exist but No Total Derivative]
Define $f: \mathbb{R}^2 \to \mathbb{R}$ by
\begin{align*}
f(x, y) = \begin{cases} \dfrac{x^2 y}{x^4 + y^2} & (x, y) \neq (0, 0), \\ 0 & (x, y) = (0, 0). \end{cases}
\end{align*}
We show that every directional derivative exists at the origin, yet $f$ is not even continuous there.
**Directional derivatives.** Let $u = (a, b)$ with $(a, b) \neq (0, 0)$. If $b \neq 0$:
\begin{align*}
\frac{f(ta, tb)}{t} = \frac{1}{t} \cdot \frac{t^2 a^2 \cdot tb}{t^4 a^4 + t^2 b^2} = \frac{t^2 a^2 b}{t^4 a^4 + t^2 b^2} = \frac{a^2 b}{t^2 a^4 + b^2}.
\end{align*}
As $t \to 0$, the denominator tends to $b^2 \neq 0$, so
\begin{align*}
D_{(a,b)}f(0,0) = \frac{a^2}{b}.
\end{align*}
If $b = 0$: $f(ta, 0) = 0$ for all $t$, so $D_{(a,0)}f(0,0) = 0$. In particular, both partial derivatives exist:
\begin{align*}
D_1 f(0,0) = D_{(1,0)}f(0,0) = 0, \qquad D_2 f(0,0) = D_{(0,1)}f(0,0) = 0.
\end{align*}
**Not differentiable.** If $f$ were differentiable at the origin, the total derivative would be the zero map (since both partial derivatives vanish), and we would need $f(h)/\|h\| \to 0$. But along the parabola $y = x^2$:
\begin{align*}
f(t, t^2) = \frac{t^2 \cdot t^2}{t^4 + t^4} = \frac{t^4}{2t^4} = \frac{1}{2} \quad \text{for all } t \neq 0,
\end{align*}
so $f$ is not even continuous at the origin, let alone differentiable.
**Why directional derivatives miss the failure.** Every line $t \mapsto (ta, tb)$ through the origin approaches $(0,0)$ along a fixed ray, where $f$ is controlled. The parabola $y = x^2$ curves between rays — it is a genuinely two-dimensional approach — and $f$ detects this curvature. Moreover, the map $u \mapsto D_{u}f(0,0)$ is *not* linear:
\begin{align*}
D_{(1,1)}f(0,0) = \frac{1^2}{1} = 1, \qquad \text{but} \quad D_{(1,0)}f(0,0) + D_{(0,1)}f(0,0) = 0 + 0 = 0.
\end{align*}
The non-linearity in $u$ is the obstruction to the existence of a total derivative: if a total derivative existed, Part 2 of the theorem would force $D_{u}f(0,0) = Df_{(0,0)}(u)$ to be linear in $u$.
[/example]
## The Jacobian Matrix
When the total derivative exists, Part 3 of the theorem above gives a concrete recipe for computing its matrix representation.
[definition: Jacobian Matrix]
Let $U \subseteq \mathbb{R}^m$ be open, $f: U \to \mathbb{R}^n$ differentiable at $a \in U$, with components $f_1, \ldots, f_n$. The **Jacobian matrix** of $f$ at $a$ is the $n \times m$ matrix $J_{f}(a)$ representing $Df_a$ with respect to the standard bases:
\begin{align*}
(J_{f}(a))_{ji} = D_i f_j(a) = \frac{\partial f_j}{\partial x_i}(a).
\end{align*}
The $j$-th row is the gradient $\nabla f_j(a)$; the $i$-th column is the partial derivative $D_if(a)$.
[/definition]
The Jacobian translates the [chain rule](/theorems/323) into matrix multiplication: if $g \circ f$ is differentiable, then
\begin{align*}
J_{g \circ f}(a) = J_{g}(f(a)) \cdot J_{f}(a).
\end{align*}
This is the matrix form of the identity
\begin{align*}
D(g \circ f)_a = Dg_{f(a)} \circ Df_a.
\end{align*}
[example: Jacobian of Polar Coordinates and the Chain Rule]
The polar coordinate map $f: (0, \infty) \times (0, 2\pi) \to \mathbb{R}^2$ is defined by
\begin{align*}
f(r, \theta) = (r\cos\theta, \, r\sin\theta).
\end{align*}
Its components are $f_1(r, \theta) = r\cos\theta$ and $f_2(r, \theta) = r\sin\theta$. The four partial derivatives are:
\begin{align*}
D_1 f_1 = \frac{\partial}{\partial r}(r\cos\theta) = \cos\theta, \qquad D_2 f_1 = \frac{\partial}{\partial \theta}(r\cos\theta) = -r\sin\theta, \\
D_1 f_2 = \frac{\partial}{\partial r}(r\sin\theta) = \sin\theta, \qquad D_2 f_2 = \frac{\partial}{\partial \theta}(r\sin\theta) = r\cos\theta.
\end{align*}
The Jacobian matrix is therefore
\begin{align*}
J_{f}(r, \theta) = \begin{pmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{pmatrix}.
\end{align*}
Its determinant is
\begin{align*}
\det J_{f} = r\cos^2\theta + r\sin^2\theta = r > 0
\end{align*}
on the domain $(0, \infty) \times (0, 2\pi)$. Since the determinant is non-zero everywhere, $Df_{(r, \theta)}$ is invertible at every point. By the [Inverse Function Theorem](/theorems/51), $f$ is a local diffeomorphism.
Now consider the chain rule in action. If $g: \mathbb{R}^2 \to \mathbb{R}$ is differentiable and we define $h(r, \theta) = g(r\cos\theta, r\sin\theta)$, then
\begin{align*}
J_{h}(r, \theta) = J_{g}(f(r, \theta)) \cdot J_{f}(r, \theta) = \begin{pmatrix} \frac{\partial g}{\partial x} & \frac{\partial g}{\partial y} \end{pmatrix} \begin{pmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{pmatrix}.
\end{align*}
Multiplying out and reading off entries:
\begin{align*}
\frac{\partial h}{\partial r} &= \frac{\partial g}{\partial x}\cos\theta + \frac{\partial g}{\partial y}\sin\theta, \\[4pt]
\frac{\partial h}{\partial \theta} &= -\frac{\partial g}{\partial x} \, r\sin\theta + \frac{\partial g}{\partial y} \, r\cos\theta,
\end{align*}
recovering the standard change-of-variable formulas for partial derivatives in polar coordinates. For the concrete case $g(x, y) = x^2 + y^2$ (so $h(r, \theta) = r^2$), this gives $\partial h/\partial r = 2x\cos\theta + 2y\sin\theta = 2r$ and $\partial h/\partial \theta = -2xr\sin\theta + 2yr\cos\theta = 0$, as expected.
[/example]
## Continuous Partials Imply Differentiability
The example of $f(x,y) = x^2 y/(x^4 + y^2)$ shows that the existence of partial derivatives (or even all directional derivatives) does not imply differentiability. The following theorem provides the practical resolution: add a regularity condition — continuity of the partial derivatives.
[quotetheorem:327]
[citeproof:327]
The proof strategy is to decompose the multivariable increment along coordinate axes and apply the one-dimensional Mean Value Theorem to each piece. Writing $f(a + h) - f(a)$ as a telescoping sum (first change $x_1$, then $x_2$, and so on), each step involves differentiating in a single variable. The MVT produces intermediate points $\xi_i$ at which the partial derivatives are evaluated. The error term then involves differences
\begin{align*}
D_i f(\xi_i, \ldots) - D_i f(a),
\end{align*}
which tend to zero by the *continuity* hypothesis. This is where the continuity of partial derivatives is essential — mere existence at $a$ would not control the values at the nearby intermediate points $\xi_i$.
The theorem is the workhorse for verifying differentiability in practice. If $f$ is built from polynomials, trigonometric functions, exponentials, and other elementary functions by addition, multiplication, and composition, then its partial derivatives are continuous wherever they are defined. The theorem then guarantees differentiability on that domain, without checking the $o(|h|)$ condition directly.
[example: Using Continuous Partials to Verify Differentiability]
Define $f: \mathbb{R}^3 \to \mathbb{R}^2$ by
\begin{align*}
f(x, y, z) = \begin{pmatrix} 3x^2 + 4\sin y + e^{6z} \\ xyz \, e^{14x} \end{pmatrix}.
\end{align*}
We compute all six partial derivatives. For the first component $f_1(x, y, z) = 3x^2 + 4\sin y + e^{6z}$:
\begin{align*}
D_1 f_1 = 6x, \qquad D_2 f_1 = 4\cos y, \qquad D_3 f_1 = 6e^{6z}.
\end{align*}
For the second component $f_2(x, y, z) = xyz \, e^{14x}$, the product rule gives:
\begin{align*}
D_1 f_2 &= yz \, e^{14x} + xyz \cdot 14 e^{14x} = yz(1 + 14x)e^{14x}, \\
D_2 f_2 &= xz \, e^{14x}, \\
D_3 f_2 &= xy \, e^{14x}.
\end{align*}
Each of these six functions is a composition of polynomials, trigonometric functions, and exponentials — all continuous on $\mathbb{R}^3$. By the theorem, $f$ is differentiable on all of $\mathbb{R}^3$, with Jacobian
\begin{align*}
J_{f}(x, y, z) = \begin{pmatrix} 6x & 4\cos y & 6e^{6z} \\ yz(1 + 14x)e^{14x} & xz \, e^{14x} & xy \, e^{14x} \end{pmatrix}.
\end{align*}
At the specific point $(1, \pi/2, 0)$, the Jacobian evaluates to
\begin{align*}
J_{f}\!\left(1, \frac{\pi}{2}, 0\right) = \begin{pmatrix} 6 & 0 & 6 \\ \frac{\pi}{2} \cdot 0 \cdot 15 \cdot 1 & 1 \cdot 0 \cdot 1 & 1 \cdot \frac{\pi}{2} \cdot 1 \end{pmatrix} = \begin{pmatrix} 6 & 0 & 6 \\ 0 & 0 & \frac{\pi}{2} \end{pmatrix}.
\end{align*}
No $o(\|h\|)$ estimate was needed — the continuous-partials criterion handles everything.
[/example]
[definition: $C^1$ Function]
Let $U \subseteq \mathbb{R}^m$ be open. A function $f: U \to \mathbb{R}^n$ is **$C^1$** (or **continuously differentiable**) if $f$ is differentiable on $U$ and the derivative map $Df: U \to \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ is continuous.
[/definition]
Equivalently, $f$ is $C^1$ if and only if all partial derivatives $D_j f_i$ exist and are continuous on $U$. The forward direction follows because each partial derivative is the $(j, i)$-entry of the Jacobian, which depends continuously on $x$ if $Df$ is continuous. The reverse is exactly the content of the previous theorem applied at every point.
[example: Differentiable but Not $C^1$]
Define $f: \mathbb{R} \to \mathbb{R}$ by
\begin{align*}
f(x) = \begin{cases} x^2 \sin\dfrac{1}{x} & x \neq 0, \\ 0 & x = 0. \end{cases}
\end{align*}
At $x \neq 0$, the function is a product and composition of smooth functions, so standard differentiation gives
\begin{align*}
f'(x) = 2x\sin\frac{1}{x} + x^2 \cdot \cos\frac{1}{x} \cdot \left(-\frac{1}{x^2}\right) = 2x\sin\frac{1}{x} - \cos\frac{1}{x}.
\end{align*}
At $x = 0$, we use the definition directly:
\begin{align*}
f'(0) = \lim_{h \to 0} \frac{f(h) - f(0)}{h} = \lim_{h \to 0} \frac{h^2 \sin(1/h)}{h} = \lim_{h \to 0} h\sin\frac{1}{h}.
\end{align*}
Since $|h \sin(1/h)| \leq |h| \to 0$, the squeeze theorem gives $f'(0) = 0$. So $f$ is differentiable everywhere.
However, $f'$ is not continuous at $0$. As $x \to 0$, the term $2x\sin(1/x) \to 0$ (by the same squeeze argument), but $\cos(1/x)$ oscillates without limit. Along the sequence $x_n = 1/(2\pi n)$:
\begin{align*}
f'(x_n) = 2x_n \sin(2\pi n) - \cos(2\pi n) = 0 - 1 = -1.
\end{align*}
Along $x_n = 1/((2n+1)\pi)$:
\begin{align*}
f'(x_n) = 2x_n \sin((2n+1)\pi) - \cos((2n+1)\pi) = 0 - (-1) = 1.
\end{align*}
Since $f'(x)$ admits subsequential limits $-1$ and $+1$ as $x \to 0$, while $f'(0) = 0$, the derivative $f'$ is discontinuous at $0$. The function is differentiable but not $C^1$.
[/example]
## The Mean Value Inequality
The classical Mean Value Theorem — that there exists $c$ between $a$ and $b$ with $f(b) - f(a) = f'(c)(b - a)$ — does not generalise to vector-valued functions.
[example: Failure of the Exact Mean Value Theorem for Vector-Valued Maps]
Consider $f: [0, 2\pi] \to \mathbb{R}^2$ defined by $f(t) = (\cos t, \sin t)$. The increment over the full interval is
\begin{align*}
f(2\pi) - f(0) = (1, 0) - (1, 0) = \mathbf{0}.
\end{align*}
If an exact MVT held, there would exist $c \in (0, 2\pi)$ with
\begin{align*}
f(2\pi) - f(0) = f'(c) \cdot (2\pi - 0),
\end{align*}
i.e., $\mathbf{0} = 2\pi(-\sin c, \cos c)$. This requires both $\sin c = 0$ and $\cos c = 0$ simultaneously, which is impossible since $\sin^2 c + \cos^2 c = 1$. The exact MVT fails because the two components of $f'$ vanish at different points: $\sin c = 0$ when $c \in \{0, \pi, 2\pi\}$ and $\cos c = 0$ when $c \in \{\pi/2, 3\pi/2\}$. No single $c$ works for both components.
The underlying issue is geometric: $f$ traces the unit circle, returning to its starting point. The total displacement is zero despite the derivative never vanishing — the velocity $(-\sin t, \cos t)$ has unit norm everywhere but keeps changing direction. An inequality, rather than an equality, is the correct generalisation.
[/example]
The correct generalisation replaces the exact equality with an inequality.
[quotetheorem:328]
[citeproof:328]
The key idea is the **reduction to the scalar case**: instead of trying to find an exact intermediate point (which fails for vector-valued functions), project the increment $f(b) - f(a)$ onto a fixed unit vector $\xi$ via the inner product. The scalar function
\begin{align*}
g(t) = \langle \xi, f(\gamma(t))\rangle
\end{align*}
does satisfy the one-dimensional MVT, giving
\begin{align*}
|g(1) - g(0)| = |g'(c)| \leq M\|b - a\|.
\end{align*}
Choosing $\xi$ optimally — in the direction of $f(b) - f(a)$ — recovers the full vector inequality.
The hypothesis that the line segment $[a, b]$ lies in $U$ is essential. On non-convex domains, the straight-line path from $a$ to $b$ might leave $U$, and the result fails. One can replace the line segment with any $C^1$ path in $U$ connecting $a$ to $b$, but the length of the path must then appear on the right-hand side.
[example: The Mean Value Inequality in a Concrete Computation]
Define $f: \mathbb{R}^2 \to \mathbb{R}^2$ by $f(x, y) = (e^x \cos y, \, e^x \sin y)$. Its Jacobian is
\begin{align*}
J_{f}(x, y) = \begin{pmatrix} e^x \cos y & -e^x \sin y \\ e^x \sin y & e^x \cos y \end{pmatrix}.
\end{align*}
We compute the Frobenius norm. The sum of squares of all four entries is
\begin{align*}
e^{2x}\cos^2 y + e^{2x}\sin^2 y + e^{2x}\sin^2 y + e^{2x}\cos^2 y = 2e^{2x},
\end{align*}
so
\begin{align*}
\|f'(x, y)\| = \sqrt{2e^{2x}} = e^x\sqrt{2}.
\end{align*}
Now apply the Mean Value Inequality on the segment from $a = (0, 0)$ to $b = (1, 1)$, parametrised by $\gamma(t) = (t, t)$ for $t \in [0, 1]$. The derivative bound along this segment is
\begin{align*}
\|f'(\gamma(t))\| = e^t\sqrt{2} \leq e^1 \cdot \sqrt{2} = e\sqrt{2} \quad \text{for all } t \in [0, 1].
\end{align*}
The segment has length $|b - a| = |(1, 1)| = \sqrt{2}$. The MVT inequality gives
\begin{align*}
|f(1, 1) - f(0, 0)| \leq e\sqrt{2} \cdot \sqrt{2} = 2e \approx 5.44.
\end{align*}
Let us verify this against the actual value:
\begin{align*}
f(1, 1) &= (e\cos 1, \, e\sin 1) \approx (1.469, \, 2.287), \\
f(0, 0) &= (1, \, 0).
\end{align*}
The squared norm of the difference is
\begin{align*}
(e\cos 1 - 1)^2 + (e\sin 1)^2 &= e^2\cos^2 1 - 2e\cos 1 + 1 + e^2\sin^2 1 \\
&= e^2 - 2e\cos 1 + 1 \\
&\approx 7.389 - 2(2.718)(0.540) + 1 \approx 5.454,
\end{align*}
giving
\begin{align*}
|f(1, 1) - f(0, 0)| \approx \sqrt{5.454} \approx 2.34,
\end{align*}
comfortably below the bound $2e \approx 5.44$. The inequality is not tight here because we used the maximum of $\|f'\|$ over the segment rather than an average.
[/example]
## Zero Derivative and Connectedness
The Mean Value Inequality immediately implies that a function with zero derivative is locally constant. To promote this to a global conclusion, we need a topological hypothesis on the domain.
[quotetheorem:329]
[citeproof:329]
The proof is an elegant application of the "open-and-closed" characterisation of connectedness from Section 7. The set
\begin{align*}
S = \{x \in U : f(x) = f(a)\}
\end{align*}
is shown to be open (by the [Mean Value Inequality](/theorems/328): if $x_0 \in S$ and $y \in B_\delta(x_0) \subseteq U$, the derivative bound $M = 0$ gives
\begin{align*}
|f(y) - f(x_0)| \leq 0 \cdot |y - x_0| = 0,
\end{align*}
so $f(y) = f(x_0) = f(a)$ and $y \in S$) and closed in $U$ (by continuity, which follows from [differentiability](/theorems/322)). Since $U$ is connected and $S \neq \varnothing$, the only possibility is $S = U$.
[example: Connectedness is Necessary — A Careful Counterexample]
Define $U = B_1(-2, 0) \cup B_1(2, 0) \subseteq \mathbb{R}^2$, the union of two disjoint open discs of radius $1$ centred at $(-2, 0)$ and $(2, 0)$. The set $U$ is open but disconnected — the two discs are separated by a gap.
Define $f: U \to \mathbb{R}^2$ by
\begin{align*}
f(x, y) = \begin{cases} (1, \, 0) & \text{if } (x,y) \in B_1(-2, 0), \\ (0, \, 1) & \text{if } (x,y) \in B_1(2, 0). \end{cases}
\end{align*}
On each disc, $f$ is constant, so all partial derivatives vanish:
\begin{align*}
D_1 f(x, y) = (0, 0), \qquad D_2 f(x, y) = (0, 0) \qquad \text{for all } (x, y) \in U.
\end{align*}
In particular, $Df_a = 0$ for all $a \in U$. Yet $f$ is not constant on $U$: it takes the value $(1, 0)$ on the left disc and $(0, 1)$ on the right disc.
The failure traces directly to the topology of $U$. The set $S = \{x \in U : f(x) = (1, 0)\} = B_1(-2, 0)$ is both open and closed in $U$, but $S \neq U$ because $U$ has two connected components. The open-and-closed argument in the proof of the constancy theorem produces $S = U$ only when $U$ is connected — when no non-trivial clopen subset exists. On a disconnected domain, each connected component is independently free to carry a different constant value.
[/example]
The result has a useful generalisation: if $Df_a = Dg_a$ for all $a \in U$ (with $U$ connected), then applying the theorem to $f - g$ shows that $f - g$ is constant on $U$, so $f$ and $g$ differ by a constant vector. This is the multivariable analogue of the fact that two antiderivatives of the same function differ by a constant.
## The Inverse Function Theorem
All the results so far — directional and partial derivatives, the Jacobian, continuous-partials-implies-differentiability, the Mean Value Inequality — converge in the Inverse Function Theorem, which answers the question: *when is a differentiable map locally invertible?*
The answer is strikingly simple: a $C^1$ map $f: U \to \mathbb{R}^n$ is locally invertible near $a$ if and only if the linear approximation $Df_a$ is invertible. The linear algebra condition
\begin{align*}
\det J_{f}(a) \neq 0
\end{align*}
controls the nonlinear geometry of $f$ near $a$.
[quotetheorem:51]
[citeproof:51]
The proof is a tour de force that combines several earlier results. The strategy is to reduce to the case $Df_a = \mathrm{Id}$ (by precomposing with the inverse of the derivative), write $f(x) = x + \varphi(x)$ where $D\varphi_a = 0$, and use the continuity of $D\varphi$ (from the $C^1$ hypothesis) to ensure $\varphi$ is a contraction near $a$. The Contraction Mapping Theorem — applied on the complete metric space $\overline{B_r(a)}$ — then provides, for each target $y$ near $f(a)$, a unique preimage $x$ via the iteration $x_{k+1} = y - \varphi(x_k)$. The contraction estimate also yields Lipschitz continuity of the inverse. Finally, the derivative formula
\begin{align*}
Dg_y = [Df_{g(y)}]^{-1}
\end{align*}
follows from the chain rule applied to $f \circ g = \mathrm{Id}$, and the continuity of $Dg$ follows from the continuity of matrix inversion on $\mathrm{GL}_n(\mathbb{R})$.
The $C^1$ hypothesis cannot be weakened to mere differentiability: the theorem needs the derivative to be continuous near $a$, not just at $a$, because the contraction estimate on $\varphi$ uses the [Mean Value Inequality](/theorems/328), which requires derivative bounds on an entire ball.
[example: Local but Not Global Injectivity]
Define $f: \mathbb{R}^2 \to \mathbb{R}^2$ by
\begin{align*}
f(x, y) = (e^x \cos y, \, e^x \sin y).
\end{align*}
This is the complex exponential $z \mapsto e^z$ written in real coordinates. Its Jacobian (computed in the earlier example) has determinant $e^{2x} > 0$ everywhere, so $Df_{(x, y)}$ is invertible at every point. By the Inverse Function Theorem, $f$ is a local diffeomorphism.
However, $f$ is not globally injective. For any $(x, y)$:
\begin{align*}
f(x, y + 2\pi) &= (e^x \cos(y + 2\pi), \, e^x \sin(y + 2\pi)) \\
&= (e^x \cos y, \, e^x \sin y) = f(x, y).
\end{align*}
The map is $2\pi$-periodic in $y$, so distinct points $(x, y)$ and $(x, y + 2\pi)$ have the same image. Geometrically, $f$ maps the entire plane onto the punctured plane $\mathbb{R}^2 \setminus \{(0,0)\}$ (since $e^x > 0$ for all $x$, the image never reaches the origin), and every point in this punctured plane has infinitely many preimages — one in each horizontal strip of height $2\pi$.
This illustrates the fundamental limitation of the Inverse Function Theorem: it provides *local* invertibility near each point, but says nothing about *global* invertibility. Global injectivity requires additional hypotheses — for instance, properness of $f$ (preimages of compact sets are compact) or simple connectedness of the domain.
[/example]
[example: What Happens When the Derivative Is Singular]
Define $f: \mathbb{R}^2 \to \mathbb{R}^2$ by
\begin{align*}
f(x, y) = (x^2 - y^2, \, 2xy).
\end{align*}
This is $z \mapsto z^2$ in real coordinates, where $z = x + iy$. The Jacobian is
\begin{align*}
J_{f}(x, y) = \begin{pmatrix} 2x & -2y \\ 2y & 2x \end{pmatrix},
\end{align*}
with determinant
\begin{align*}
\det J_{f}(x, y) = 4x^2 + 4y^2 = 4(x^2 + y^2).
\end{align*}
This vanishes *only* at the origin $(0, 0)$. At every other point, the Inverse Function Theorem guarantees local invertibility. At the origin, the theorem does not apply — and indeed, $f$ fails to be locally injective there. Every neighbourhood of $(0, 0)$ contains points $p$ and $-p$ (for $p \neq \mathbf{0}$), and
\begin{align*}
f(-x, -y) = ((-x)^2 - (-y)^2, \, 2(-x)(-y)) = (x^2 - y^2, \, 2xy) = f(x, y),
\end{align*}
so $f(p) = f(-p)$ for all $p$. The map is always at least $2$-to-$1$ near the origin (in fact, exactly $2$-to-$1$ away from the origin, since $w = z^2$ has two square roots for $w \neq 0$). The vanishing of $\det J_{f}$ at the origin reflects this folding: the linear approximation $Df_{(0, 0)} = 0$ maps everything to zero, losing all directional information.
[/example]
### The Derivative Formula for the Inverse
The formula
\begin{align*}
Dg_y = [Df_{g(y)}]^{-1}
\end{align*}
is more than a theoretical conclusion — it provides a practical computation method. If $f$ maps $(r, \theta)$ to $(x, y)$ via polar coordinates and we want the Jacobian of the local inverse (mapping $(x, y)$ back to $(r, \theta)$), we invert the $2 \times 2$ Jacobian of $f$. Since
\begin{align*}
J_{f}(r, \theta) = \begin{pmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{pmatrix}
\end{align*}
and $\det J_{f} = r$, the inverse is
\begin{align*}
J_{f}^{-1} = \frac{1}{r}\begin{pmatrix} r\cos\theta & r\sin\theta \\ -\sin\theta & \cos\theta \end{pmatrix} = \begin{pmatrix} \cos\theta & \sin\theta \\ -\sin\theta/r & \cos\theta/r \end{pmatrix}.
\end{align*}
Reading off the entries gives the formulas for the partial derivatives of the inverse:
\begin{align*}
\frac{\partial r}{\partial x} = \cos\theta, \qquad \frac{\partial r}{\partial y} = \sin\theta, \qquad \frac{\partial \theta}{\partial x} = -\frac{\sin\theta}{r}, \qquad \frac{\partial \theta}{\partial y} = \frac{\cos\theta}{r}.
\end{align*}
These can be verified directly using $r = \sqrt{x^2 + y^2}$ and $\theta = \arctan(y/x)$:
\begin{align*}
\frac{\partial r}{\partial x} = \frac{x}{\sqrt{x^2 + y^2}} = \frac{r\cos\theta}{r} = \cos\theta, \qquad \frac{\partial \theta}{\partial x} = \frac{-y/x^2}{1 + y^2/x^2} = \frac{-y}{x^2 + y^2} = \frac{-r\sin\theta}{r^2} = -\frac{\sin\theta}{r},
\end{align*}
confirming the formula.
[problem]
Let $F: \mathbb{R}^n \to \mathbb{R}$ be differentiable, and suppose
\begin{align*}
D_1 F(x) = D_2 F(x) = \cdots = D_n F(x)
\end{align*}
for all $x \in \mathbb{R}^n$. Prove that there exists a differentiable function $h: \mathbb{R} \to \mathbb{R}$ such that
\begin{align*}
F(x_1, \ldots, x_n) = h(x_1 + \cdots + x_n).
\end{align*}
[/problem]
[solution]
**Step 1: Construct a change of variables.** Define $\vartheta: \mathbb{R}^n \to \mathbb{R}^n$ by
\begin{align*}
\vartheta(y_1, \ldots, y_n) = (y_1, \, y_2, \, \ldots, \, y_{n-1}, \, y_n - y_1 - \cdots - y_{n-1}).
\end{align*}
This is a linear map with matrix
\begin{align*}
J_\vartheta = \begin{pmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ -1 & -1 & \cdots & 1 \end{pmatrix}.
\end{align*}
The determinant equals $1$ (expand along the last row, or observe that $\vartheta$ is a shear transformation with unit determinant), so $\vartheta$ is an invertible linear map and in particular a diffeomorphism $\mathbb{R}^n \to \mathbb{R}^n$.
**Step 2: Show $G = F \circ \vartheta$ is independent of $y_1, \ldots, y_{n-1}$.** By the [chain rule](/theorems/323), for $1 \leq k \leq n - 1$:
\begin{align*}
D_k G(y) = F'(\vartheta(y))(D_k \vartheta(y)).
\end{align*}
The $k$-th column of $J_\vartheta$ is $e_k - e_n$, so $D_k \vartheta = e_k - e_n$ (constant, since $\vartheta$ is linear). Therefore
\begin{align*}
D_k G(y) = F'(\vartheta(y))(e_k - e_n) = D_k F(\vartheta(y)) - D_n F(\vartheta(y)).
\end{align*}
By hypothesis, all partial derivatives of $F$ are equal at every point, so $D_k G(y) = 0$ for all $y \in \mathbb{R}^n$ and all $k = 1, \ldots, n - 1$.
**Step 3: Apply the constancy theorem.** For each fixed $y_n \in \mathbb{R}$, the map
\begin{align*}
(y_1, \ldots, y_{n-1}) \mapsto G(y_1, \ldots, y_{n-1}, y_n)
\end{align*}
is differentiable with zero derivative on the connected open set $\mathbb{R}^{n-1}$. By the [zero-derivative-implies-constancy theorem](/theorems/329), this map is constant: $G(y_1, \ldots, y_n)$ depends only on $y_n$. Define $h: \mathbb{R} \to \mathbb{R}$ by $h(z) = G(0, \ldots, 0, z)$.
**Step 4: Translate back.** The inverse of $\vartheta$ maps $(y_1, \ldots, y_n)$ to $(y_1, \ldots, y_{n-1}, y_n + y_1 + \cdots + y_{n-1})$. Given $x = (x_1, \ldots, x_n) \in \mathbb{R}^n$, set $y = \vartheta^{-1}(x)$, so $y_k = x_k$ for $k < n$ and $y_n = x_1 + \cdots + x_n$. Then:
\begin{align*}
F(x) = F(\vartheta(y)) = G(y) = h(y_n) = h(x_1 + \cdots + x_n).
\end{align*}
Differentiability of $h$ follows from that of $G$: $h(z) = G(0, \ldots, 0, z)$ is the composition of $G$ with the linear inclusion $z \mapsto (0, \ldots, 0, z)$, both of which are differentiable.
[/solution]\n\n---\n\nHaving developed the theory of first-order differentiation for maps between Euclidean spaces — the total derivative as a linear map, partial derivatives as its matrix entries, the chain rule as composition of linear maps — we now ask: what happens when we differentiate *again*? In single-variable calculus, the second derivative $f''(a)$ is a number that captures curvature: positive second derivative means the graph is concave up, negative means concave down. In higher dimensions, the situation is richer. The first derivative $Df_a$ is a linear map $\mathbb{R}^m \to \mathbb{R}^n$; its derivative — the second derivative $f''(a)$ — is a *bilinear* map $\mathbb{R}^m \times \mathbb{R}^m \to \mathbb{R}^n$, encoding how the linear approximation itself changes from point to point.
The second derivative is the natural home of the Hessian matrix, the foundation for Taylor's theorem in several variables, and the tool that classifies critical points as local maxima, minima, or saddle points. Its most striking property — the symmetry of mixed partial derivatives $D_i D_j f = D_j D_i f$ — is not automatic but requires a regularity hypothesis (continuity of the second derivative), and its failure without this hypothesis is a genuine pathology worth understanding.
This section builds on the total derivative and chain rule from Section 9, the partial derivatives and Jacobian from Section 10, and the continuous-partials-implies-differentiability theorem that makes the $C^1$ condition practical. The key new idea is that differentiating the derivative map $Df: U \to \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ requires treating $\mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ as a normed vector space in its own right — which we can do, because it is isomorphic to $\mathbb{R}^{nm}$ with the Frobenius norm.
[motivation]
### Why Bilinear Maps?
In one variable, the second derivative $f''(a)$ is a number. We can think of it as a bilinear form on $\mathbb{R} \times \mathbb{R}$: the map $(h, k) \mapsto f''(a) \cdot h \cdot k$. This bilinear form happens to be determined by a single number because $\dim \mathbb{R} = 1$. In $m$ dimensions, the second derivative at $a$ must capture how $Df_{a + h}(k)$ changes as $h$ varies, for each fixed $k$. This change is linear in $h$ (to first order) and also linear in $k$ (because $f'(a + h)$ is a linear map). The object that is linear in two separate arguments is a bilinear map — and indeed, $f''(a)$ lives in the space of bilinear maps $\mathbb{R}^m \times \mathbb{R}^m \to \mathbb{R}^n$.
### The Nested Operator Perspective
Formally, the derivative of the map $Df: U \to \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ at $a$ is a linear map from $\mathbb{R}^m$ to $\mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$:
\begin{align*}
f''(a) \in \mathcal{L}(\mathbb{R}^m, \, \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)).
\end{align*}
Given $h \in \mathbb{R}^m$, the object $f''(a)(h)$ is itself a linear map $\mathbb{R}^m \to \mathbb{R}^n$; evaluating it at $k$ gives a vector $f''(a)(h)(k) \in \mathbb{R}^n$. The canonical isomorphism
\begin{align*}
\mathcal{L}(\mathbb{R}^m, \, \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)) \cong \mathrm{Bil}(\mathbb{R}^m \times \mathbb{R}^m, \, \mathbb{R}^n)
\end{align*}
identifies this nested structure with a bilinear map by defining $T(h, k) = [f''(a)(h)](k)$. We freely use the bilinear notation $f''(a)(h, k)$ throughout.
### Why Symmetry is Not Free
If $f: \mathbb{R}^2 \to \mathbb{R}$ is twice differentiable, one might expect $D_1 D_2 f = D_2 D_1 f$ — that the order of differentiation does not matter. In one variable this is vacuous (there is only one direction), but in several variables it is a non-trivial claim. The second difference
\begin{align*}
\Delta(s, t) = f(a + se_1 + te_2) - f(a + se_1) - f(a + te_2) + f(a)
\end{align*}
can be computed as "$D_1$ then $D_2$" or "$D_2$ then $D_1$," and the two routes give equal results — but only because the Mean Value Theorem introduces intermediate points that must converge to the same limit. This convergence requires *continuity* of the second derivative, which is the content of Schwarz's theorem. Without it, pathological functions exist where the mixed partials disagree.
[/motivation]
## Twice Differentiable Maps
[definition: Twice Differentiable at a Point]
Let $U \subseteq \mathbb{R}^m$ be open, $f: U \to \mathbb{R}^n$, and $a \in U$. Suppose there exists an open neighbourhood $V \subseteq U$ of $a$ such that $f$ is differentiable at every point of $V$. Then $f$ is **twice differentiable at $a$** if the derivative map $Df: V \to \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ is differentiable at $a$.
[/definition]
The definition asks for two things: first, $f$ must be differentiable on a whole neighbourhood of $a$ (so that $f$ is defined as a function near $a$), and second, this function $f$ must itself be differentiable at $a$. The first condition is not implied by differentiability at $a$ alone — it requires $f$ to be differentiable at nearby points as well, providing the "raw material" for differentiating $Df$.
[definition: Second Derivative]
If $f$ is twice differentiable at $a$, the **second derivative of $f$ at $a$**, denoted $f''(a)$, is the derivative of $Df$ at $a$:
\begin{align*}
f''(a) \in \mathcal{L}(\mathbb{R}^m, \, \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)) \cong \mathrm{Bil}(\mathbb{R}^m \times \mathbb{R}^m, \, \mathbb{R}^n).
\end{align*}
Under the bilinear identification, $f''(a)(h, k) = [f''(a)(h)](k)$.
[/definition]
The following theorem makes the bilinear interpretation precise and provides a practical criterion for twice differentiability.
[quotetheorem:330]
[citeproof:330]
The forward direction is a direct unwinding of the definition: differentiability of $Df$ at $a$ gives an expansion
\begin{align*}
Df_{a + h} = Df_a + f''(a)(h) + |h|E(h),
\end{align*}
where $E(h)$ is an operator-valued error with $\|E(h)\| \to 0$. Evaluating at $k$ produces the bilinear term $[f''(a)(h)](k)$. The reverse direction is the more interesting part: given a pointwise-in-$k$ expansion, one must promote the convergence $\varepsilon(h, k) \to \mathbf{0}$ (for each $k$) to operator-norm convergence $\|E(h)\| \to 0$. In finite dimensions, evaluating on the $m$ basis vectors $e_1, \ldots, e_m$ suffices: $\|E(h)\|^2 = \sum_j |E(h)(e_j)|^2 \to 0$. This step would fail in infinite dimensions, where the norm cannot be recovered from finitely many evaluations.
[example: Second Derivative of a Linear Map]
Let $f(x) = Ax$ for a fixed matrix $A \in M_{n \times m}(\mathbb{R})$. Then $Df_a = A$ for all $a$ — the derivative is a constant function. The derivative map
\begin{align*}
Df: \mathbb{R}^m \to \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n), \qquad a \mapsto A,
\end{align*}
is constant, so its derivative is zero: $f''(a) = 0$ for all $a$. Equivalently, $f''(a)(h, k) = \mathbf{0}$ for all $h, k$. This reflects the fact that linear maps have no curvature — their graph is "flat" in every direction.
[/example]
[example: Second Derivative of the Squared Norm]
Define $f: \mathbb{R}^m \to \mathbb{R}$ by $f(x) = |x|^2$. From Section 9, the first derivative is
\begin{align*}
Df_a(h) = 2\langle a, h\rangle.
\end{align*}
The derivative map is $Df: \mathbb{R}^m \to \mathcal{L}(\mathbb{R}^m, \mathbb{R})$, where $Df_a$ is the linear functional $h \mapsto 2\langle a, h\rangle$. To differentiate $Df$, we compute:
\begin{align*}
Df_{a + h}(k) - Df_a(k) = 2\langle a + h, k\rangle - 2\langle a, k\rangle = 2\langle h, k\rangle.
\end{align*}
There is no error term — the expansion is exact. Therefore $f$ is twice differentiable everywhere, with
\begin{align*}
f''(a)(h, k) = 2\langle h, k\rangle
\end{align*}
for all $a$. The second derivative is the constant bilinear form $2\langle \cdot, \cdot \rangle$, independent of $a$. When $m = 1$, this gives $f''(a) = 2$ — the familiar $(x^2)'' = 2$.
The Hessian matrix is $H_{ij} = f''(a)(e_i, e_j) = 2\langle e_i, e_j\rangle = 2\delta_{ij}$, so $H = 2I_m$. This is positive definite, confirming that $\|x\|^2$ is strictly convex — as expected for a function whose graph is a paraboloid.
[/example]
[example: Second Derivative of a Bilinear Map]
Let $B: \mathbb{R}^m \times \mathbb{R}^n \to \mathbb{R}^p$ be a fixed bilinear map. From Section 9, the first derivative at $(a, b)$ is
\begin{align*}
B'(a, b)(h, k) = B(a, k) + B(h, b).
\end{align*}
To find $B''$, we compute the change in $B'$:
\begin{align*}
B'((a, b) + (h, k))(u, v) &= B(a + h, v) + B(u, b + k) \\
&= B(a, v) + B(h, v) + B(u, b) + B(u, k).
\end{align*}
Subtracting $DB_{(a, b)}(u, v) = B(a, v) + B(u, b)$:
\begin{align*}
DB_{((a, b) + (h, k))}(u, v) - DB_{(a, b)}(u, v) = B(h, v) + B(u, k).
\end{align*}
This is exactly linear in $(h, k)$ with no error term. Therefore $B''(a, b)$ is the constant bilinear map
\begin{align*}
B''(a, b)((h, k), (u, v)) = B(h, v) + B(u, k),
\end{align*}
independent of $(a, b)$. Since $B''$ is constant, all third and higher derivatives vanish — bilinear maps are the "quadratics" of the multivariable world.
For the concrete case of matrix multiplication $B(X, Y) = XY$ on $M_n(\mathbb{R})$, this gives
\begin{align*}
B''(A, A)((H, K), (U, V)) = HV + UK,
\end{align*}
which controls the second-order term in the expansion of $(A + H)^2 = A^2 + AH + HA + H^2$.
[/example]
## Second Derivatives and the Hessian
The bilinear map $f''(a)$ is an abstract object. To make it computable, we express it in terms of the iterated partial derivatives that are familiar from multivariable calculus.
[definition: Iterated Directional Derivative]
Let $f$ be differentiable on a neighbourhood of $a$, and let $u, v \in \mathbb{R}^m$. The **iterated directional derivative** of $f$ at $a$ is
\begin{align*}
D_{u} D_{v} f(a) = D_{u}(D_{v}f)(a),
\end{align*}
provided $D_{v}f$ is defined on a neighbourhood of $a$ and the outer directional derivative exists.
[/definition]
[definition: Hessian Matrix]
Let $f: U \to \mathbb{R}$ be twice differentiable at $a \in U \subseteq \mathbb{R}^m$. The **Hessian matrix** of $f$ at $a$ is the $m \times m$ matrix $H_f(a)$ with entries
\begin{align*}
(H_f(a))_{ij} = D_i D_j f(a) = \frac{\partial^2 f}{\partial x_i \, \partial x_j}(a).
\end{align*}
[/definition]
The following theorem shows the Hessian is not an ad hoc construction but the matrix representation of the intrinsic second derivative.
[quotetheorem:331]
[citeproof:331]
The proof exploits the [characterisation theorem](/theorems/330): the expansion $Df_{a + h}(k) = Df_a(k) + f''(a)(h, k) + o(|h|)$ is specialised to $h = te_i$ and $k = e_j$. The left side becomes $D_j f(a + te_i)$, and dividing by $t$ and taking $t \to 0$ recovers the iterated partial derivative $D_i D_j f(a)$. The bilinear term gives $f''(a)(e_i, e_j)$, and the error vanishes.
The theorem says: to compute $f''(a)(h, k)$ for arbitrary $h$ and $k$, expand both in the standard basis and use bilinearity:
\begin{align*}
f''(a)(h, k) = \sum_{i=1}^m \sum_{j=1}^m h_i k_j \, D_i D_j f(a).
\end{align*}
For scalar-valued $f$, this is $h^T H_f(a) \, k$ — the Hessian matrix acts on the two vectors by the standard quadratic/bilinear form formula.
[example: Computing the Hessian of $f(x, y) = x^3 y + \cos(xy)$]
The first partial derivatives are
\begin{align*}
D_1 f(x, y) = 3x^2 y - y\sin(xy), \qquad D_2 f(x, y) = x^3 - x\sin(xy).
\end{align*}
The four second partial derivatives are:
\begin{align*}
D_1 D_1 f &= \frac{\partial}{\partial x}(3x^2 y - y\sin(xy)) = 6xy - y^2 \cos(xy), \\
D_2 D_1 f &= \frac{\partial}{\partial y}(3x^2 y - y\sin(xy)) = 3x^2 - \sin(xy) - xy\cos(xy), \\
D_1 D_2 f &= \frac{\partial}{\partial x}(x^3 - x\sin(xy)) = 3x^2 - \sin(xy) - xy\cos(xy), \\
D_2 D_2 f &= \frac{\partial}{\partial y}(x^3 - x\sin(xy)) = -x^2\cos(xy).
\end{align*}
The Hessian matrix is
\begin{align*}
H_f(x, y) = \begin{pmatrix} 6xy - y^2\cos(xy) & 3x^2 - \sin(xy) - xy\cos(xy) \\ 3x^2 - \sin(xy) - xy\cos(xy) & -x^2\cos(xy) \end{pmatrix}.
\end{align*}
Notice that $D_1 D_2 f = D_2 D_1 f$ here — the Hessian is symmetric. This is because all partial derivatives of $f$ are continuous (they are compositions of polynomials and trigonometric functions), so $f$ is $C^2$ and the symmetry theorem applies. At the origin $(0, 0)$:
\begin{align*}
H_f(0, 0) = \begin{pmatrix} 0 & 0 \\ 0 & 0 \end{pmatrix},
\end{align*}
so the second derivative $f''(0, 0) = 0$ — the origin is a degenerate critical point where second-order information alone cannot determine the nature of the extremum.
At $(1, 0)$:
\begin{align*}
H_f(1, 0) = \begin{pmatrix} 0 & 3 \\ 3 & -1 \end{pmatrix}.
\end{align*}
This matrix has eigenvalues $\lambda = \frac{-1 \pm \sqrt{1 + 36}}{2} = \frac{-1 \pm \sqrt{37}}{2}$, one positive and one negative, so $(1, 0)$ is a saddle point.
[/example]
## Symmetry of Mixed Partial Derivatives
The Hessian in the example above was symmetric. Is this always the case? The answer is *no* in general, but *yes* under a mild regularity condition.
[quotetheorem:332]
[citeproof:332]
The proof is the classic "second-difference" argument. The quantity
\begin{align*}
\Delta(s, t) = f(a + se_i + te_j) - f(a + se_i) - f(a + te_j) + f(a)
\end{align*}
measures the "interaction" between the $e_i$ and $e_j$ directions: it is the increment over a rectangle, and it is *symmetric* in the sense that swapping $i$ and $j$ (or $s$ and $t$) does not change $\Delta$. Applying the one-dimensional MVT first in the $i$-direction then the $j$-direction gives $\Delta(s, t) = st \, D_j D_i f(c_1)$ for some intermediate $c_1$; applying in the reverse order gives $\Delta(s, t) = st \, D_i D_j f(c_2)$ for some $c_2$. Since $\Delta$ is the same in both cases, $D_j D_i f(c_1) = D_i D_j f(c_2)$. Letting $s, t \to 0$ and using continuity of $f''$ gives equality at $a$.
The continuity hypothesis is essential. The following example shows what can go wrong without it.
[example: Failure of Symmetry Without Continuity]
Define $f: \mathbb{R}^2 \to \mathbb{R}$ by
\begin{align*}
f(x, y) = \begin{cases} \dfrac{xy(x^2 - y^2)}{x^2 + y^2} & (x, y) \neq (0, 0), \\ 0 & (x, y) = (0, 0). \end{cases}
\end{align*}
We compute the mixed partial derivatives at the origin.
**First partial derivatives everywhere.** On the axes: $f(x, 0) = 0$ and $f(0, y) = 0$ for all $x, y$, so $D_1 f(0, 0) = 0$ and $D_2 f(0, 0) = 0$. For $(x, y) \neq (0, 0)$, the quotient rule gives
\begin{align*}
D_1 f(x, y) = \frac{y(x^2 + y^2)(3x^2 - y^2) - xy(x^2 - y^2) \cdot 2x}{(x^2 + y^2)^2} = \frac{y(x^4 + 4x^2 y^2 - y^4)}{(x^2 + y^2)^2}.
\end{align*}
In particular, $D_1 f(0, y) = \frac{y(0 + 0 - y^4)}{y^4} = -y$ for $y \neq 0$, and $D_1 f(0, 0) = 0$. Therefore
\begin{align*}
D_2 D_1 f(0, 0) = \lim_{t \to 0} \frac{D_1 f(0, t) - D_1 f(0, 0)}{t} = \lim_{t \to 0} \frac{-t - 0}{t} = -1.
\end{align*}
By a symmetric computation (or by noting $f(x, y) = -f(y, x)$), $D_2 f(x, 0) = x$ for $x \neq 0$, and
\begin{align*}
D_1 D_2 f(0, 0) = \lim_{t \to 0} \frac{D_2 f(t, 0) - D_2 f(0, 0)}{t} = \lim_{t \to 0} \frac{t - 0}{t} = 1.
\end{align*}
Therefore
\begin{align*}
D_1 D_2 f(0, 0) = 1 \neq -1 = D_2 D_1 f(0, 0).
\end{align*}
The mixed partial derivatives are unequal. The underlying cause is that the second partial derivatives, while they exist at the origin, are not continuous there — the function $D_1 D_2 f(x, y)$ does not have a limit as $(x, y) \to (0, 0)$, so the continuity hypothesis of the symmetry theorem fails.
[/example]
## The $C^2$ Condition
In practice, the continuity hypothesis of the symmetry theorem is most often verified via the following condition.
[definition: $C^2$ Function]
Let $U \subseteq \mathbb{R}^m$ be open. A function $f: U \to \mathbb{R}^n$ is **$C^2$** (or **twice continuously differentiable**) if $f$ is twice differentiable on $U$ and the second derivative map $f'': U \to \mathrm{Bil}(\mathbb{R}^m \times \mathbb{R}^m, \mathbb{R}^n)$ is continuous on $U$.
[/definition]
Equivalently, $f$ is $C^2$ if and only if all second partial derivatives $D_i D_j f_k$ exist and are continuous on $U$. This is the practical criterion: for functions built from elementary operations, continuous second partials are immediate to verify. For a $C^2$ function, the symmetry theorem holds at every point of $U$, so the Hessian is everywhere symmetric and the order of differentiation never matters.
[example: The Matrix Squaring Map is $C^2$]
Define $f: M_n(\mathbb{R}) \to M_n(\mathbb{R})$ by $f(A) = A^2$. From the worked problem of Section 9, the first derivative is
\begin{align*}
Df_A(H) = AH + HA.
\end{align*}
The derivative map $Df: A \mapsto [H \mapsto AH + HA]$ is linear in $A$ (left-multiplication and right-multiplication by $A$ are each linear in $A$). To find $f''$, compute:
\begin{align*}
Df_{A + K}(H) - Df_A(H) &= (A + K)H + H(A + K) - AH - HA \\
&= KH + HK.
\end{align*}
This is exactly linear in $K$ with no error term. Therefore $f''(A)$ is the constant bilinear map
\begin{align*}
f''(A)(H, K) = KH + HK,
\end{align*}
independent of $A$. Since $f''$ is constant, it is continuous, so $f$ is $C^2$.
**Symmetry check.** Swapping $H$ and $K$:
\begin{align*}
f''(A)(K, H) = HK + KH = KH + HK = f''(A)(H, K).
\end{align*}
The second derivative is symmetric, consistent with the symmetry theorem. When $n = 1$ (scalar multiplication), this gives $f''(a)(h, k) = 2hk$, recovering $(x^2)'' = 2$.
All higher derivatives also vanish: $f'''(A) = 0$ (the second derivative is constant, so its derivative is zero). This parallels the one-variable fact $(x^2)''' = 0$ — polynomials of degree $d$ have zero $(d+1)$-th derivative.
[/example]
[problem]
Let $f: \mathbb{R}^m \to \mathbb{R}$ be $C^2$ and suppose $a$ is a critical point of $f$ (meaning $Df_a = 0$). Show that for all $h \in \mathbb{R}^m$,
\begin{align*}
f(a + h) = f(a) + \frac{1}{2} f''(a)(h, h) + o(|h|^2).
\end{align*}
Use this to conclude that if the Hessian $H_f(a)$ is positive definite, then $a$ is a strict local minimum.
[/problem]
[solution]
**Step 1: Write the second-order Taylor expansion.** Since $f$ is $C^2$, it is twice differentiable at $a$, so by the [characterisation theorem](/theorems/330) applied to $Df$:
\begin{align*}
Df_{a + h}(k) = Df_a(k) + f''(a)(h, k) + |h|\varepsilon(h, k),
\end{align*}
where $\varepsilon(h, k) \to 0$ as $h \to \mathbf{0}$. Now consider the path $\gamma(t) = a + th$ for $t \in [0, 1]$ and define $g(t) = f(a + th)$. By the [chain rule](/theorems/323), $g'(t) = Df_{a + th}(h)$.
**Step 2: Integrate.** By the fundamental theorem of calculus:
\begin{align*}
f(a + h) - f(a) = g(1) - g(0) = \int_0^1 g'(t) \, dt = \int_0^1 Df_{a + th}(h) \, dt.
\end{align*}
Using the expansion with $a + th$ in place of $a + h$ (i.e., replace $h$ by $th$):
\begin{align*}
Df_{a + th}(h) = \underbrace{Df_a(h)}_{= \, 0} + t \, f''(a)(h, h) + t|h|\varepsilon(th, h),
\end{align*}
where $Df_a(h) = 0$ because $a$ is a critical point. Integrating:
\begin{align*}
f(a + h) - f(a) &= \int_0^1 \left[t \, f''(a)(h, h) + t|h|\varepsilon(th, h)\right] dt \\
&= \frac{1}{2} f''(a)(h, h) + |h| \int_0^1 t \, \varepsilon(th, h) \, dt.
\end{align*}
Since $|\varepsilon(th, h)| \to 0$ uniformly for $t \in [0, 1]$ as $h \to \mathbf{0}$ (by continuity of $f''$), the integral is $o(1)$, so the last term is $o(|h|)$. In fact, it is $o(|h|^2)$: the factor $|h|$ from outside times the $t$ inside the integral contributes another factor of $|h|$ (since $\varepsilon$ depends on $th$, which has norm $t|h|$). More precisely, given $\epsilon > 0$, for $|h|$ small enough, $|\varepsilon(th, h)| \leq \epsilon |h|$ for all $t \in [0, 1]$ (using the Lipschitz bound from the Frobenius norm), giving $|h| \int_0^1 t \, \epsilon|h| \, dt = \frac{\epsilon}{2}|h|^2$. This confirms:
\begin{align*}
f(a + h) = f(a) + \frac{1}{2} f''(a)(h, h) + o(|h|^2).
\end{align*}
**Step 3: Deduce the local minimum.** Suppose $H_f(a)$ is positive definite. By the spectral theorem, all eigenvalues $\lambda_1 \leq \cdots \leq \lambda_m$ of $H_f(a)$ are positive. The minimum eigenvalue $\lambda_1 > 0$ satisfies
\begin{align*}
f''(a)(h, h) = h^T H_f(a) \, h \geq \lambda_1 \|h\|^2
\end{align*}
for all $h$. Therefore, for $h \neq \mathbf{0}$ sufficiently small:
\begin{align*}
f(a + h) - f(a) = \frac{1}{2}h^T H_f(a)\,h + o(\|h\|^2) \geq \frac{\lambda_1}{2}\|h\|^2 + o(\|h\|^2) = \|h\|^2\left(\frac{\lambda_1}{2} + o(1)\right).
\end{align*}
For $\|h\|$ small enough, the term in parentheses is at least $\lambda_1/4 > 0$, so $f(a + h) > f(a)$. This shows $a$ is a strict local minimum.
Similarly, if $H_f(a)$ is negative definite, $a$ is a strict local maximum. If $H_f(a)$ has eigenvalues of both signs (indefinite), the quadratic form $\frac{1}{2}h^T H_f(a)\,h$ takes both positive and negative values, and $a$ is a **saddle point** — neither a local minimum nor a local maximum.
[/solution]
Contents
- Introduction
- The Fundamental Problem
- The Resolution: Uniform Convergence
- Beyond Uniform Convergence
- Overview of These Notes
- The Quantifier Swap
- Why the Three Theorems Differ
- The Cauchy Criterion: Convergence Without Knowing the Limit
- Definition and First Properties
- The Cauchy Criterion
- Preservation of Continuity
- The Converse Fails
- Interchange of Limits and Integrals
- The Failure for Derivatives
- The Correct Theorem for Derivatives
- Techniques for Proving Uniform Convergence
- Direct Estimation
- The Sup-Norm Test via Calculus
- Negating Uniform Convergence
- Uniform Convergence on Compact Subsets
- Worked Problem
- From Sequences to Series
- The Need for Practical Criteria
- Absolute vs. Conditional Uniform Convergence
- Core Definitions
- Relationships Between Convergence Types
- Absolute Convergence and Pointwise Convergence
- Independence of Uniform and Absolute Convergence
- Absolute Uniform Convergence Implies Uniform Convergence
- The Full Independence Picture
- The Weierstrass $M$-Test
- Power Series and Local Uniform Convergence
- Continuity of Power Series
- Termwise Differentiation
- Local Uniform Convergence
- Worked Example
- References
- The Oscillation Problem
- Why Pointwise Continuity Is Insufficient
- From Integration to Limits
- Core Definitions
- Uniform Continuity on Compact Intervals
- The Heine-Cantor Theorem
- Proving Uniform Continuity on Unbounded Domains
- The Riemann Criterion and Integrability
- The Riemann Criterion
- Integrability of Continuous Functions
- Uniform Convergence and Integration
- Preserving Integrability Under Limits
- Pointwise vs. Uniform Boundedness
- Composition and Integrability
- The Good-Bad Subinterval Technique
- Worked Example
- References
- What Do Our Proofs Actually Use?
- Why Not Just Normed Spaces?
- The Role of Completeness
- Core Definitions
- Metric Spaces and Convergence
- Standard Examples
- Product Metrics
- Continuity in Metric Spaces
- The $\varepsilon$-$\delta$ Definition and Its Sequential Equivalent
- Topology of Metric Spaces
- Open and Closed Sets
- Why Open Sets?
- Completeness
- Cauchy Sequences and Complete Spaces
- Completeness and Closedness
- Completeness of Function Spaces
- Equivalence of Metrics
- The Contraction Mapping Theorem
- Worked Example
- References
- Why Abandon Metrics?
- What the Axioms Capture
- What We Gain and What We Lose
- Open Sets and the Definition of a Topology
- Extreme Topologies and the Comparison Lattice
- A Non-Trivial Example: The Cofinite Topology
- Metrisability and Separation
- Identifying Non-Hausdorff Quotients
- Closed Sets, Interiors, and Closures
- Measuring How Far a Set Is from Being Open or Closed
- Dense Subsets and Separability
- Subspace Topology
- Convergence and Continuity
- Continuity as a Preimage Condition
- Sequential Continuity Versus Continuity
- Homeomorphisms and Topological Invariants
- Product Topology
- Quotient Topology
- When Quotients Fail to be Hausdorff
- Why Connectedness Matters
- The Topological Formulation
- Definitions and First Examples
- The Main Characterisation Theorem
- Connected Subsets of $\mathbb{R}$
- Preservation Under Continuous Maps
- Closure, Unions, and Products
- Closure Preserves Connectedness
- Unions of Overlapping Connected Sets
- Products of Connected Spaces
- Path-Connectedness
- When Connectedness and Path-Connectedness Agree
- Connected Components
- Application: Distinguishing $\mathbb{R}$ from $\mathbb{R}^n$
- Why Open Covers?
- From Finite to Compact
- What Compactness Buys
- Core Definitions
- The Extreme Value Theorem
- Continuous Images
- Subspaces and Separation
- Products
- The Heine–Borel Theorem
- Sequential Compactness in Metric Spaces
- The Two Obstructions to Sequential Compactness
- The Closed Map Lemma and Topological Inverse Function Theorem
- Compactness as a Topological Invariant
- Why Linear Maps?
- Why Not Partial Derivatives?
- The Role of Norms
- Norms on Linear Maps
- The Definition of Differentiability
- Uniqueness of the Derivative
- Basic Examples
- Differentiability Implies Continuity
- The Chain Rule
- Componentwise Differentiability
- Algebraic Rules
- Partial Derivatives: The Coordinate-by-Coordinate Approach
- Why Continuity of Partial Derivatives Matters
- From Local to Global: The Role of Connectedness
- Directional and Partial Derivatives
- The Jacobian Matrix
- Continuous Partials Imply Differentiability
- The Mean Value Inequality
- Zero Derivative and Connectedness
- The Inverse Function Theorem
- The Derivative Formula for the Inverse
- Why Bilinear Maps?
- The Nested Operator Perspective
- Why Symmetry is Not Free
- Twice Differentiable Maps
- Second Derivatives and the Hessian
- Symmetry of Mixed Partial Derivatives
- The $C^2$ Condition
Cambridge IB Analysis and Topology
Content
Problems
History
Created by admin on 2/26/2026 | Last updated on 3/2/2026
Prerequisites
No prerequisites required for this page.
Rate this page
★
★
★
★
★
Poor
Excellent