This course builds on the measure-theoretic foundations of Geometric Measure Theory I to develop the machinery of differentiation for non-smooth functions. The central focus is on Lipschitz maps, functions with bounded distributional derivatives, and convex functions—exotic in classical analysis but ubiquitous in geometric variational problems and optimal transport. Rather than assume differentiability, we will establish when and how non-smooth maps admit meaningful notions of derivative, culminating in two transformative theorems: the area formula and coarea formula. These generalizations of the classical change-of-variables rule extend Fubini's theorem and integration by parts to all Lipschitz maps, unlocking geometric measure theory's power to handle sets and functions of arbitrary regularity.
The narrative arc begins with Rademacher's theorem, which asserts that every Lipschitz function is almost everywhere differentiable—turning even the roughest Lipschitz maps into objects of differential geometry. We then develop the Jacobian for such maps, culminating in the area formula: a measure-theoretic identity that counts how the pullback of a Lipschitz map distributes mass across fibers of its image. Inverting this perspective yields the coarea formula, which slices a domain along level sets of a Lipschitz function and integrates the resulting $(n-1)$-dimensional areas. Beyond these formulas, we investigate finer differentiability results for Sobolev functions, discover Alexandrov's theorem on the twice-differentiability of convex functions, and prove Whitney's extension theorem—the technical backbone that underpins $C^1$ approximation of rough functions.
Each chapter deepens the toolkit: approximate differentiability prepares the ground for Sobolev regularity; linear maps and Jacobians establish a finite-dimensional calculus; the area and coarea formulas unify geometric and analytic perspectives; Alexandrov and Whitney theory extend differentiability to broader classes. The course culminates in worked examples demonstrating how these abstract results illuminate concrete geometry—from minimal surfaces to regularity of solutions to elliptic PDEs. Throughout, the emphasis is on transforming regularity assumptions into measure-theoretic identities, enabling us to compute and reason about singular sets that classical analysis cannot reach.
# 1. Lipschitz Functions and Rademacher's Theorem
This chapter lays the analytic foundation for everything that follows in the course. The central object is the class of Lipschitz functions, which are maps controlled by a linear modulus of continuity. The central theorem — Rademacher's theorem — asserts that any locally Lipschitz function $f : \mathbb{R}^n \to \mathbb{R}^m$ is differentiable at Lebesgue-almost every point. This is a remarkable fact: differentiability is a delicate analytic condition, while the Lipschitz condition is purely metric, yet the metric condition forces almost-everywhere differentiability. Every result in this course — the area formula, the coarea formula, and the finer differentiability theorems — rests on this foundation.
## Lipschitz Functions: Definitions and Basic Properties
[motivation]
### Why Lipschitz?
The classical change-of-variables formula for $C^1$ maps requires genuine everywhere differentiability. Yet many natural maps arising in geometry and analysis — distance functions, convex functions, solutions to certain PDEs — are not $C^1$. They may fail to be differentiable at corners, kinks, or on measure-zero sets. The Lipschitz condition is the minimal quantitative regularity needed to control how a map distorts measure. A function with Lipschitz constant $L$ expands distances by at most $L$, which bounds by how much it can stretch $k$-dimensional volume. This makes Lipschitz maps the natural domain for GMT, and Rademacher's theorem is the theorem that upgrades "metric control" to "almost-everywhere differential calculus."
### The Lipschitz Condition Versus Continuity
A function can be uniformly continuous without being Lipschitz — consider $f(x) = \sqrt{x}$ on $[0,1]$, which has modulus of continuity $\omega(\delta) = \sqrt{\delta}$ but is not Lipschitz because $|f(x) - f(0)|/|x - 0| = x^{-1/2} \to \infty$ as $x \to 0^+$. The Lipschitz condition imposes a linear modulus and is therefore strictly between uniform continuity and $C^1$. For GMT the linear modulus is essential: it is precisely the hypothesis under which areas and coareas can be computed by integrating the Jacobian.
[/motivation]
The core definition precisely quantifies what it means to control a function's rate of oscillation.
[definition: Lipschitz Function]
Let $E \subseteq \mathbb{R}^n$ and $m \ge 1$. A function $f : E \to \mathbb{R}^m$ is **Lipschitz** (or **Lipschitz continuous**) if there exists a constant $L \ge 0$ such that
\begin{align*}
|f(x) - f(y)| \le L|x - y| \quad \text{for all } x, y \in E.
\end{align*}
The **Lipschitz constant** of $f$, denoted $\operatorname{Lip}(f)$, is the infimum of all such $L$:
\begin{align*}
\operatorname{Lip}(f) := \sup_{\substack{x, y \in E \\ x \neq y}} \frac{|f(x) - f(y)|}{|x - y|}.
\end{align*}
We say $f$ is **locally Lipschitz** if every point $x \in E$ has a neighbourhood $U$ such that $f|_{U \cap E}$ is Lipschitz.
[/definition]
When $E$ is open and $f$ is locally Lipschitz, the Lipschitz constant may vary from region to region. For compact subsets, however, a locally Lipschitz function is automatically globally Lipschitz with some constant depending on the compact set.
[example: Standard Examples of Lipschitz Functions]
**Distance function.** Fix a non-empty set $A \subseteq \mathbb{R}^n$ and define $d_A(x) = \operatorname{dist}(x, A) = \inf_{a \in A} |x - a|$. For any $x, y \in \mathbb{R}^n$ and any $a \in A$, the triangle inequality gives $d_A(x) \le |x - a| \le |x - y| + |y - a|$, so taking the infimum over $a$ yields $d_A(x) \le |x - y| + d_A(y)$. Exchanging $x$ and $y$ gives $|d_A(x) - d_A(y)| \le |x - y|$, so $d_A$ is $1$-Lipschitz, i.e., $\operatorname{Lip}(d_A) = 1$.
**Linear maps.** Any linear map $T : \mathbb{R}^n \to \mathbb{R}^m$ is Lipschitz with $\operatorname{Lip}(T) = \|T\|_{\mathcal{L}(\mathbb{R}^n, \mathbb{R}^m)}$, the operator norm. This follows immediately from $|T(x) - T(y)| = |T(x-y)| \le \|T\| |x - y|$.
**Absolute value.** The function $f(x) = |x|$ on $\mathbb{R}$ satisfies $||x| - |y|| \le |x - y|$ by the reverse triangle inequality, so $\operatorname{Lip}(f) = 1$. Note that $f$ is not differentiable at $x = 0$.
**Smooth functions on compact sets.** If $f \in C^1(\overline{U})$ for a bounded open set $U$, then $\operatorname{Lip}(f) \le \sup_{x \in \overline{U}} |\nabla f(x)|$ by the mean value theorem. This shows $C^1(\overline{U}) \subseteq \operatorname{Lip}(\overline{U})$, but the inclusion is strict: the function $f(x) = |x|$ on $\overline{B}(0,1) \subset \mathbb{R}$ is Lipschitz but fails to be $C^1$ at $x = 0$.
[/example]
Having established the central examples, we record the closure properties that make the class of Lipschitz functions easy to work with in practice.
[remark: Closure Properties of Lipschitz Functions]
The class of Lipschitz functions on $E \subseteq \mathbb{R}^n$ is closed under the following operations, with explicit Lipschitz constants:
- **Sums:** $\operatorname{Lip}(f + g) \le \operatorname{Lip}(f) + \operatorname{Lip}(g)$.
- **Products:** If $f, g$ are Lipschitz and bounded, then $fg$ is Lipschitz with $\operatorname{Lip}(fg) \le \|f\|_\infty \operatorname{Lip}(g) + \|g\|_\infty \operatorname{Lip}(f)$.
- **Compositions:** $\operatorname{Lip}(f \circ g) \le \operatorname{Lip}(f) \cdot \operatorname{Lip}(g)$.
- **Pointwise max and min:** $\operatorname{Lip}(\max(f,g)) \le \max(\operatorname{Lip}(f), \operatorname{Lip}(g))$, and similarly for $\min$.
- **Uniform limits:** If $f_k \to f$ uniformly and each $f_k$ is $L$-Lipschitz, then $f$ is $L$-Lipschitz.
[/remark]
The closure under uniform limits is particularly useful: it allows Lipschitz functions to be constructed as uniform limits of smooth approximations without increasing the Lipschitz constant. This connects to mollification arguments that appear throughout the proofs in later chapters.
### Absolute Continuity on Lines
One of the most important structural properties of Lipschitz functions is their behaviour along lines. For a function $f : \mathbb{R}^n \to \mathbb{R}$ that is $L$-Lipschitz, the restriction to any line $\ell$ is $L$-Lipschitz as a function of the arc-length parameter. A one-variable $L$-Lipschitz function on $\mathbb{R}$ is absolutely continuous, and an absolutely continuous function is differentiable almost everywhere (with respect to one-dimensional Lebesgue measure $\mathcal{L}^1$). This one-dimensional observation is the seed from which Rademacher's theorem grows.
[definition: Absolutely Continuous on Lines]
A function $f : U \to \mathbb{R}$ defined on an open set $U \subseteq \mathbb{R}^n$ is **absolutely continuous on lines (ACL)** if for $\mathcal{L}^{n-1}$-almost every line $\ell$ parallel to each coordinate axis, the restriction $f|_\ell$ is absolutely continuous on each compact subinterval of $\ell \cap U$.
[/definition]
Every locally Lipschitz function is ACL, because the Lipschitz condition on the ambient space restricts to a Lipschitz (hence absolutely continuous) condition on each line. The ACL property is important for Sobolev theory as well — it characterises $W^{1,p}_{\mathrm{loc}}(U)$ for $p \ge 1$ via Nikodym's theorem — but for our purposes, what matters is that ACL implies the existence of partial derivatives almost everywhere, which Rademacher upgrades to full differentiability almost everywhere.
## Extension Theorems: Kirszbraun and McShane
A priori, a Lipschitz function is defined only on a subset $E$ of $\mathbb{R}^n$. When the proof of Rademacher's theorem reduces to the scalar case, one needs to work with a function on all of $\mathbb{R}^n$, not merely on $E$. The naive attempt — extend $f$ by zero outside $E$ — immediately fails: if $E = [1, 2]$ and $f \equiv 1$, then the zero extension has a jump discontinuity at the boundary and is nowhere near Lipschitz across it. Can one always extend while preserving the Lipschitz constant? For real-valued functions the answer is yes, and the construction is explicit.
[quotetheorem:3067]
The two formulas produce the smallest ($\tilde{f}$) and largest ($\hat{f}$) possible $L$-Lipschitz extensions: any other $L$-Lipschitz extension $g$ satisfies $\tilde{f} \le g \le \hat{f}$ pointwise on $\mathbb{R}^n$. When $\tilde{f} \ne \hat{f}$ there are infinitely many extensions, and no canonical one. The real-valued hypothesis is essential: for $\mathbb{R}^m$-valued $f$ with $m \ge 2$, applying McShane componentwise gives a Lipschitz extension with constant at most $\sqrt{m} \cdot L$, not $L$ — the factor $\sqrt{m}$ appears because the inf of $m$ cones, each of Lipschitz slope $L$ in $\mathbb{R}$, is only $L$-Lipschitz in $\mathbb{R}^m$ if the components happen to be correlated, which they need not be. This shows the real-valued restriction in McShane is not an artifact of the proof but a genuine boundary: the componentwise strategy is insufficient for vector-valued maps.
McShane's theorem applies only to real-valued functions. For $\mathbb{R}^m$-valued functions with $m \ge 2$, a more subtle argument is required.
[quotetheorem:3068]
The proof relies on Helly's theorem (a combinatorial intersection property for convex sets) applied to closed balls of radius $L \cdot r$ in $\mathbb{R}^m$ and is significantly more involved than McShane's argument. A complete treatment appears in Evans–Gariepy §3.1, and the course does not reproduce it. What the theorem does not give is worth noting: the extension $F$ is not unique, there is no canonical choice analogous to the inf/sup envelope, and no continuity of $F$ in $f$ in any natural topology. More importantly, Kirszbraun's theorem is specific to Hilbert spaces — it fails for general Banach spaces. The map $f : \{e_1, e_2\} \to \ell^1$ defined on the two standard basis vectors of $\ell^\infty$ by $f(e_1) = e_1$ and $f(e_2) = e_2$ provides a counterexample: the two points are distance $1$ apart in $\ell^\infty$, but any $1$-Lipschitz extension to $\ell^\infty$ must place their images at distance at most $1$ in $\ell^1$, which is impossible. The Euclidean structure of $\mathbb{R}^m$ — specifically, that every finite collection of closed balls with pairwise non-empty intersections has a common point — is what makes Kirszbraun work.
The importance of both theorems is that they allow us to assume, without loss of generality, that any Lipschitz function under consideration is defined on all of $\mathbb{R}^n$. We will invoke this reduction silently in the proof of Rademacher's theorem.
## Rademacher's Theorem
[motivation]
### The Gap Between Metric and Analytic Regularity
A Lipschitz function need not be differentiable everywhere. The function $f(x) = |x|$ on $\mathbb{R}$ is $1$-Lipschitz but fails to be differentiable at $x = 0$. More dramatically, the Cantor function — a monotone, uniformly continuous function that is constant on each interval of the complement of the Cantor set and yet maps $[0,1]$ onto $[0,1]$ — is Lipschitz on its domain of increase and is differentiable almost everywhere with derivative zero. These examples show that Lipschitz functions can have measure-zero sets where differentiability fails, but they cannot have larger non-differentiability sets.
Rademacher's theorem makes this precise: the non-differentiability set of any locally Lipschitz function $f : \mathbb{R}^n \to \mathbb{R}^m$ has Lebesgue measure zero. This is the sharpest possible conclusion — no Lipschitz function can be shown to be differentiable everywhere without additional assumptions. The theorem is remarkable because it derives an analytic property (differentiability, which involves limits in all directions simultaneously) from a metric property (uniform control on pairwise distances).
### Why This Theorem Runs the Whole Course
The area formula and coarea formula both require integrating the Jacobian $J f$ of a Lipschitz map $f$. Without Rademacher's theorem, the Jacobian would not even be defined almost everywhere, and such integrals would be meaningless. Rademacher's theorem is what gives Lipschitz maps a well-defined derivative almost everywhere, making the rest of the course possible.
[/motivation]
[quotetheorem:3069]
[citeproof:3069]
Rademacher's theorem tells us that a locally Lipschitz function $f : \mathbb{R}^n \to \mathbb{R}^m$ has a well-defined total derivative $Df_x : \mathbb{R}^n \to \mathbb{R}^m$ for a.e. $x$, and therefore a well-defined Jacobian matrix $Jf_x \in \mathbb{R}^{m \times n}$ for a.e. $x$. The determinant-like quantity derived from $Jf_x$ — the Jacobian $J$ — is what appears in the area and coarea formulas. This connection will be made precise in Chapter 2 when we study linear maps and Jacobians in detail.
[example: Sharpness of the Lipschitz Hypothesis in Rademacher's Theorem]
Rademacher's theorem requires the Lipschitz condition — pointwise Hölder regularity of strictly lower order is not enough. Consider $f : \mathbb{R} \to \mathbb{R}$ defined by $f(x) = \sqrt{|x|}$. This function satisfies $|f(x) - f(y)| \le |x - y|^{1/2}$ (it is $\frac{1}{2}$-Hölder), but is not Lipschitz near $x = 0$ since $|f(x) - f(0)|/|x - 0| = |x|^{-1/2} \to \infty$ as $x \to 0$. At $x = 0$, the difference quotient $(f(h) - f(0))/h = h^{-1/2} \to \infty$, so $f$ is not differentiable at $0$. Since $0$ is a single point, this alone does not violate Rademacher (which permits a null exceptional set). However, the family $f_t(x) = \sqrt{|x - t|}$ for $t \in \mathbb{R}$ shows the issue is systemic rather than incidental: no member of this family is differentiable at its own kink point, and a Cantor-type superposition of such functions — $g(x) = \sum_{k} 2^{-k} \sqrt{|x - r_k|}$ where $(r_k)$ enumerates a dense set — is everywhere Hölder but its differentiability at the dense set $\{r_k\}$ requires a separate argument. In contrast, the $1$-Lipschitz function $h(x) = |x|$ is not differentiable only at $\{0\}$, a null set, confirming that the linear modulus is exactly the threshold. The general principle is: $\operatorname{lip}(f)(x) < \infty$ is the precise pointwise condition under which differentiability at $x$ can be concluded (this is Stepanov's theorem below), and $\operatorname{lip}(f)(x) = \infty$ — as at $x = 0$ for $\sqrt{|x|}$ — puts the point outside the reach of any such theorem.
[/example]
The bound $|\nabla f(x)| \le \operatorname{Lip}(f)$ holding almost everywhere is more than an illustration — it is the key tool for controlling the Jacobian in the area formula, since $|Jf_x|$ is bounded in terms of $\operatorname{Lip}(f)$ and the dimension.
## Stepanov's Theorem
The hypothesis of Rademacher's theorem is global: $f$ must be Lipschitz on all of $E$. But consider the function $f : \mathbb{R} \to \mathbb{R}$ defined by $f(x) = x^2 \sin(1/x)$ for $x \ne 0$ and $f(0) = 0$. This function is differentiable everywhere (with $f'(0) = 0$), but it is not Lipschitz near $0$ since $f'(x) = 2x\sin(1/x) - \cos(1/x)$ is unbounded near $0$. Rademacher's theorem cannot be applied. Yet $f$ is differentiable everywhere, so in particular differentiable almost everywhere. Can one identify, directly from $f$, the set of points where differentiability can be expected — without knowing in advance that $f$ is differentiable? The answer is the Stepanov set.
[definition: Pointwise Lipschitz Constant]
Let $f : \Omega \to \mathbb{R}$ be defined on an open set $\Omega \subseteq \mathbb{R}^n$. The **pointwise Lipschitz constant** of $f$ at $x \in \Omega$ is
\begin{align*}
\operatorname{lip}(f)(x) := \limsup_{y \to x} \frac{|f(y) - f(x)|}{|y - x|}.
\end{align*}
The **Stepanov set** of $f$ is
\begin{align*}
S_f := \{x \in \Omega : \operatorname{lip}(f)(x) < \infty\}.
\end{align*}
[/definition]
Note the distinction between $\operatorname{Lip}(f)$, the global Lipschitz constant (a single number), and $\operatorname{lip}(f)(x)$, the pointwise Lipschitz constant (a function of $x$). A function is globally Lipschitz iff $\operatorname{Lip}(f) = \sup_x \operatorname{lip}(f)(x) < \infty$, but a function can have $\operatorname{lip}(f)(x) < \infty$ at some points and $= \infty$ at others.
[quotetheorem:3070]
[citeproof:3070]
Stepanov's theorem is the sharpest form of Rademacher's theorem in the following sense: if $f$ is differentiable at $x$, then $\operatorname{lip}(f)(x) \le |Df_x|_{\mathcal{L}(\mathbb{R}^n, \mathbb{R})} < \infty$, so the differentiability set of $f$ is always contained in $S_f$. Stepanov's theorem then guarantees differentiability at a.e. point of $S_f$, making $S_f$ the largest possible domain for which an a.e.-differentiability conclusion can hold. The Borel measurability hypothesis on $f$ cannot be dropped: without it, $S_f$ need not be measurable (it is the preimage of $[0,\infty)$ under $\operatorname{lip}(f)$, which may not be Borel if $f$ is non-measurable), and covering arguments break down because one cannot apply Fubini's theorem to non-measurable sets. A non-measurable function built by a Vitali-type construction can have $S_f = \mathbb{R}^n$ and yet fail to be differentiable anywhere, showing that some measurability is genuinely required.
## Almost-Everywhere Differentiability as the Natural Notion
The results of this chapter invite a philosophical reflection on why almost-everywhere differentiability is the correct notion for GMT, rather than demanding differentiability everywhere or accepting only continuity.
[explanation: Why Almost-Everywhere Differentiability Is the Right Notion]
### Everywhere Differentiability Is Too Strong
Requiring differentiability at every point excludes natural and important functions. The distance function $d_A$ is not differentiable on the cut locus of $A$ (the set of points equidistant from multiple points of $A$), yet it is $1$-Lipschitz and its Jacobian is the identity off a measure-zero set. Convex functions — which appear in Chapter 7 — are differentiable at every point of a dense $G_\delta$ set but may fail on a countable dense set. Sobolev functions in $W^{1,p}$ with $p \le n$ may not be differentiable at any specific point, yet they are differentiable almost everywhere (as we will see in Chapters 5–6). Insisting on everywhere differentiability would exclude all of these classes.
### Measure-Zero Exceptional Sets Are Invisible to Integration
The area formula and coarea formula compute integrals. Integrals do not see sets of measure zero: changing $f$ on a null set changes neither $\int g \, d\mathcal{L}^n$ nor $\int_{\mathcal{H}^m} (\text{fiber count}) \, d\mathcal{H}^m$. Therefore, differentiability on a full-measure set is exactly what is needed to define the integrand (the Jacobian), and differentiability outside that set is irrelevant to the formula. This is the same philosophy that underlies Sobolev theory: the function $u \in W^{1,p}(U)$ has a gradient $\nabla u$ defined almost everywhere, and that is sufficient to write down the PDE it satisfies weakly.
### Lebesgue Points and the Differentiation Theorem
The Lebesgue differentiation theorem (from GMT I) asserts that for any $f \in L^1_{\mathrm{loc}}(\mathbb{R}^n)$, almost every $x$ is a Lebesgue point:
\begin{align*}
\lim_{r \to 0} \frac{1}{\mathcal{L}^n(B(x,r))} \int_{B(x,r)} |f(y) - f(x)|\, d\mathcal{L}^n(y) = 0.
\end{align*}
This theorem already embodies the philosophy that measure-zero exceptional sets are the natural remainder in analysis on $\mathbb{R}^n$. Rademacher's theorem is the stronger, derivative-level version of the same principle: Lipschitz functions are "first-order Lebesgue regular" at almost every point.
[/explanation]
These considerations explain why Rademacher's theorem is not merely a technical lemma but a structural fact about the interaction between measure theory and analysis. It is the theorem that places Lipschitz functions on the same footing as $C^1$ functions for the purposes of integration theory, and it is the gateway to everything in this course. The next chapter makes the differential structure of Lipschitz maps quantitatively precise by studying the linear algebra of Jacobians and the Jacobian determinant-like factors that appear in the area and coarea formulas.
The differential structure of Lipschitz maps, rooted in Rademacher's theorem, now demands a precise language: linear algebra provides the framework through Jacobians and how linear maps distort volume, setting the stage for the quantitative analysis that drives the area and coarea formulas.
# 2. Linear Maps and Jacobians
The area and coarea formulas, which lie at the heart of this course, require a precise notion of how a linear map distorts volume. For square matrices, $|\det L|$ plays that role; but the maps we care about — parametrizations of surfaces, projections onto lower-dimensional slices — are almost never square. This chapter develops the correct replacement: the $m$-dimensional Jacobian $J_m L$, defined for any linear map $L: \mathbb{R}^m \to \mathbb{R}^n$ regardless of the relationship between $m$ and $n$. The construction passes through the singular value decomposition, which gives both the definition and its geometric meaning in a single stroke.
## When the Determinant Fails
Suppose you want to measure how a linear map $L: \mathbb{R}^m \to \mathbb{R}^n$ distorts $m$-dimensional volume. When $m = n$, the answer is $|\det L|$: the image of a unit cube in $\mathbb{R}^m$ has Lebesgue measure $|\det L|$ in $\mathbb{R}^n = \mathbb{R}^m$. But suppose $m = 1$ and $n = 3$: $L$ maps a line into space. The image of the unit interval $[0,1]$ under $L$ is a line segment of length $|L(e_1)|$, which is a perfectly good notion of 1-dimensional volume distortion, but $\det L$ is not even defined for a $3 \times 1$ matrix.
More generally, for a map $L: \mathbb{R}^m \to \mathbb{R}^n$ with $m < n$ (the case relevant to the area formula), the Jacobian matrix $JL \in \mathbb{R}^{n \times m}$ is tall and rectangular. Its determinant is undefined. One might try to use $\det(JL^\top JL)$, a square $m \times m$ matrix, and indeed this is the right idea — but to understand why, and to connect it to geometry, one must first understand what the matrix $L^\top L$ encodes.
[definition: Gram Matrix of a Linear Map]
Let $L: \mathbb{R}^m \to \mathbb{R}^n$ be a linear map with Jacobian matrix $JL \in \mathbb{R}^{n \times m}$. The **Gram matrix** of $L$ is the $m \times m$ symmetric matrix
\begin{align*}
G_L := JL^\top JL.
\end{align*}
Since $(G_L v) \cdot v = |JL v|^2 \ge 0$ for all $v \in \mathbb{R}^m$, the Gram matrix is positive semi-definite. It is positive definite if and only if $L$ is injective.
[/definition]
The Gram matrix appears naturally when computing inner products on the image. If $v, w \in \mathbb{R}^m$, then $L(v) \cdot L(w) = (JL v) \cdot (JL w) = v^\top G_L w$. The matrix $G_L$ thus encodes the geometry that $L$ induces on $\mathbb{R}^m$ when you pull back the Euclidean structure from $\mathbb{R}^n$. Its determinant measures the $m$-dimensional volume distortion, as we now make precise.
[definition: $m$-Dimensional Jacobian of a Linear Map]
Let $L: \mathbb{R}^m \to \mathbb{R}^n$ be a linear map. The **$m$-dimensional Jacobian** of $L$ is
\begin{align*}
J_m L := \sqrt{\det(JL^\top JL)}.
\end{align*}
When $m > n$ (the wide, "coarea" case), one instead uses the $n \times n$ matrix $JL \cdot JL^\top$, and the **$n$-dimensional Jacobian** is $J_n L := \sqrt{\det(JL \cdot JL^\top)}$.
[/definition]
The two definitions are consistent when $m = n$: both give $|\det JL|$. The geometric content of $J_m L$ — that it measures $\mathcal{H}^m$-volume distortion — will become transparent once we have the singular value decomposition in hand.
## Singular Values and the SVD
The singular value decomposition is the tool that simultaneously diagonalizes a non-square matrix and reads off the geometry of the linear map. The existence of the SVD is a theorem, but its proof is constructive via the spectral theorem for symmetric matrices.
[quotetheorem:3071]
The proof rests on the spectral theorem. The Gram matrix $G_L = JL^\top JL$ is symmetric and positive semi-definite, so it is diagonalized by an orthonormal basis $\{q_1, \dots, q_m\}$ of $\mathbb{R}^m$ with corresponding non-negative eigenvalues $\lambda_1 \ge \cdots \ge \lambda_m \ge 0$. Setting $\sigma_i = \sqrt{\lambda_i}$ and extending $\{JL q_i / \sigma_i\}$ (for $\sigma_i > 0$) to an orthonormal basis of $\mathbb{R}^n$ gives the matrices $Q$ and $P$.
The singular values of $L$ are related to eigenvalues: $\sigma_i^2$ is the $i$-th eigenvalue of $G_L = JL^\top JL$ (equivalently, of $JL \cdot JL^\top$ when $n \ge m$, since nonzero eigenvalues of $AB$ and $BA$ coincide). In particular, singular values are invariant under orthogonal transformations of the domain or codomain.
[explanation: Geometric Meaning of Singular Values]
The SVD $JL = P\Sigma Q^\top$ decomposes $L$ into three steps:
1. Rotate the domain $\mathbb{R}^m$ by $Q^\top$ to align with the "principal axes" of $L$.
2. Scale axis $i$ by $\sigma_i$ (and embed into $\mathbb{R}^n$ by padding with zeros).
3. Rotate the codomain $\mathbb{R}^n$ by $P$.
The unit ball $B \subseteq \mathbb{R}^m$ maps under $L$ to the ellipsoid $L(B) \subseteq \mathbb{R}^n$ with semi-axes $\sigma_1 \ge \cdots \ge \sigma_m \ge 0$ along the columns of $P$. The $\mathcal{H}^m$-measure of this ellipsoid is $\omega_m \cdot \sigma_1 \cdots \sigma_m$, where $\omega_m$ is the volume of the unit $m$-ball. This is the geometric definition of the $m$-dimensional Jacobian.
[/explanation]
With the SVD in hand, the Jacobian formula becomes a one-line computation.
[quotetheorem:3072]
[citeproof:3072]
This formula makes clear that $J_m L$ depends only on the singular values, not on the choice of $P$ or $Q$ in the SVD. The Jacobian vanishes precisely when at least one singular value is zero, which happens exactly when $L$ has a nonzero kernel.
**Hypothesis necessity.** The formula $J_m L = \prod \sigma_i$ uses the assumption $m \le n$ only implicitly: when $m > n$, every linear map $L: \mathbb{R}^m \to \mathbb{R}^n$ has $\dim \ker L \ge m - n > 0$ and hence $J_m L = 0$ regardless of $L$. This is geometrically correct — you cannot embed $m$-dimensional volume into an $n$-dimensional space when $m > n$ — but it means the area formula becomes degenerate (the formula gives zero for every $L$) in that regime. The coarea formula handles the complementary case $m \ge n$ using the $n$-dimensional Jacobian $J_n L = \sqrt{\det(JL \cdot JL^\top)}$, which can be nonzero even when $L$ is not injective.
## Special Cases: Recovering Classical Formulas
Three special cases of $J_m L$ connect the general definition to familiar objects, and each illuminates a different aspect of the formula.
[example: Square Maps and the Classical Determinant]
When $m = n$, the Gram matrix is $JL^\top JL$ with $JL$ a square matrix. The eigenvalues of $JL^\top JL$ are $\sigma_i^2$, so $\det(JL^\top JL) = (\det JL)^2$ (since $\det(JL^\top JL) = \det(JL^\top) \det(JL) = (\det JL)^2$). Thus
\begin{align*}
J_n L = \sqrt{(\det JL)^2} = |\det JL|.
\end{align*}
The absolute value is essential: the classical determinant is signed (it records orientation), while $J_m L$ is always non-negative. For the area formula, orientation plays no role — we count volumes, not oriented volumes. A rotation has $J_n L = 1$ (it preserves $n$-dimensional volume), and a reflection also has $J_n L = 1$ even though $\det JL = -1$.
[/example]
The square case shows that $J_m L$ is the correct unsigned analogue of the determinant. The next case probes the boundary of the theory.
[example: Curves in $\mathbb{R}^n$ and Arc Length]
When $m = 1$, the map $L: \mathbb{R}^1 \to \mathbb{R}^n$ is determined by the single vector $v = L(e_1) \in \mathbb{R}^n$, and $JL = v$ (as a column vector). The Gram matrix is the $1 \times 1$ matrix $JL^\top JL = v^\top v = |v|^2$, so
\begin{align*}
J_1 L = |v| = |L(e_1)|.
\end{align*}
This is the length of the image of the unit vector, which is exactly the arc-length element of a parametric curve. If $\gamma: [0,1] \to \mathbb{R}^n$ is $C^1$, the arc length is $\int_0^1 |\dot\gamma(t)|\, d\mathcal{L}^1(t) = \int_0^1 J_1 D\gamma_t\, d\mathcal{L}^1(t)$, showing that the area formula for $m = 1$ reduces to the classical arc-length formula.
[/example]
This example probes the formula at its simplest structural case: a one-dimensional domain, where the Gram matrix is a scalar, no diagonalization is needed, and the geometric content (length of the image vector) is immediately apparent. The next case is the one that appears directly in the theory of hypersurface parametrizations.
[example: Hypersurfaces and the Surface Area Element]
When $m = n - 1$, the map $L: \mathbb{R}^{n-1} \to \mathbb{R}^n$ parametrizes an $(n-1)$-dimensional surface. The Gram matrix $JL^\top JL \in \mathbb{R}^{(n-1) \times (n-1)}$ is the classical **first fundamental form** of differential geometry. Writing $L(e_i) = \partial_i r$ for the coordinate vectors of a parametrization $r$, the entries are $(G_L)_{ij} = \partial_i r \cdot \partial_j r$, and
\begin{align*}
J_{n-1} L = \sqrt{\det G_L}.
\end{align*}
In the classical $n = 3$ case, this is $|\partial_1 r \times \partial_2 r|$, the area element familiar from multivariable calculus. The area formula for $m = n-1$ will therefore reproduce the standard surface area integral $\int \sqrt{\det G}\, d\mathcal{L}^{n-1}$ as a special case.
[/example]
This structural example shows that $J_{n-1}$ is not a new object but rather a coordinate-free encoding of the classical surface area element. Recognizing this connection motivates studying $J_m$ for general $m$: it is the canonical generalization of the area element to $m$-dimensional surfaces in $\mathbb{R}^n$.
## The Cauchy-Binet Formula: Computing the Jacobian in Coordinates
The definition $J_m L = \sqrt{\det(JL^\top JL)}$ requires computing a determinant of an $m \times m$ matrix formed from an $n \times m$ matrix. The Cauchy-Binet formula gives an alternative expression that is often more useful for explicit computations.
[quotetheorem:3073]
The proof of Cauchy-Binet follows from expanding the determinant of $A^\top A$ using the Leibniz formula and regrouping terms; see Evans-Gariepy for the full computation. The geometric meaning is illuminating: each term $(\det JL_S)^2$ measures the squared $m$-dimensional volume of the projection of the image $L(B)$ onto the coordinate subspace $\mathbb{R}^S \subset \mathbb{R}^n$. The total Jacobian is the square root of the sum of squared projected volumes — a Pythagorean theorem for $m$-dimensional volumes.
[example: Jacobian of a $2 \times 3$ Map via Cauchy-Binet]
Let $L: \mathbb{R}^2 \to \mathbb{R}^3$ have Jacobian matrix
\begin{align*}
JL = \begin{pmatrix} a & b \\ c & d \\ e & f \end{pmatrix}.
\end{align*}
The three 2-element subsets of $\{1,2,3\}$ are $\{1,2\}$, $\{1,3\}$, $\{2,3\}$, giving submatrices
\begin{align*}
JL_{\{1,2\}} = \begin{pmatrix} a & b \\ c & d \end{pmatrix}, \quad
JL_{\{1,3\}} = \begin{pmatrix} a & b \\ e & f \end{pmatrix}, \quad
JL_{\{2,3\}} = \begin{pmatrix} c & d \\ e & f \end{pmatrix}.
\end{align*}
Cauchy-Binet gives
\begin{align*}
J_2 L = \sqrt{(ad - bc)^2 + (af - be)^2 + (cf - de)^2}.
\end{align*}
This is exactly the magnitude of the cross product $L(e_1) \times L(e_2)$, which is the classical area element for a surface in $\mathbb{R}^3$. The three terms $(ad-bc)^2$, $(af-be)^2$, $(cf-de)^2$ are the squares of the areas of the three coordinate-plane projections of the parallelogram spanned by $L(e_1)$ and $L(e_2)$ — a vivid confirmation that Cauchy-Binet encodes the Pythagorean formula for projected areas.
[/example]
This example is structural in the following sense: it identifies the three 2-element subsets of $\{1,2,3\}$ with the three coordinate planes, and shows that the Jacobian collapses to a formula one can verify independently via the cross product. This gives both a computational check and a geometric calibration of the Cauchy-Binet sum.
## The Jacobian of a Nonlinear Map
Having defined $J_m L$ for linear maps, we extend it to nonlinear maps by linearizing. For the area formula, the relevant setting is $f: \mathbb{R}^m \to \mathbb{R}^n$ Lipschitz continuous with $m \le n$. By Rademacher's theorem (Chapter 1), such $f$ is differentiable $\mathcal{L}^m$-a.e., so the total derivative $Df_x: \mathbb{R}^m \to \mathbb{R}^n$ exists for a.e. $x$.
[definition: $m$-Dimensional Jacobian of a Lipschitz Map]
Let $f: \mathbb{R}^m \to \mathbb{R}^n$ be Lipschitz with $m \le n$. The **$m$-dimensional Jacobian** of $f$ at $x$ is
\begin{align*}
J_m f(x) := J_m (Df_x) = \sqrt{\det(Jf_x^\top Jf_x)},
\end{align*}
defined $\mathcal{L}^m$-a.e. by Rademacher's theorem. Here $Jf_x \in \mathbb{R}^{n \times m}$ is the Jacobian matrix of $f$ at $x$.
[/definition]
This definition makes $x \mapsto J_m f(x)$ a measurable function of $x$ (since $x \mapsto Jf_x$ is measurable and the map $A \mapsto \sqrt{\det(A^\top A)}$ is continuous). The Lipschitz bound on $f$ gives a global $L^\infty$ bound: if $f$ has Lipschitz constant $\text{Lip}(f)$, then $|Jf_x v| \le \text{Lip}(f)|v|$ for a.e. $x$ and all $v$, which implies $J_m f(x) \le \text{Lip}(f)^m$ a.e.
[remark: Why Lipschitz Is the Right Regularity]
The definition only requires $f$ to be differentiable a.e., not everywhere. Asking for $C^1$ would be too strong for geometric applications — many parametrizations and level sets of Sobolev functions are only Lipschitz. The key is that Rademacher's theorem guarantees almost everywhere differentiability for free, so $J_m f$ is defined a.e. without any additional assumption beyond Lipschitz.
[/remark]
The a.e. nature of the definition is not merely a technicality: the area formula will integrate $J_m f$ over the domain, and sets of measure zero do not affect the integral. The function $J_m f$ is thus exactly integrable enough for the purposes of Chapter 3.
## The Coarea Jacobian
For the coarea formula, the relevant setting reverses the dimensional inequality: now $f: \mathbb{R}^n \to \mathbb{R}^k$ with $n \ge k$. The map is "wider than it is tall," and we want to measure the $k$-dimensional volume distortion of the projection $Df_x$.
[definition: Coarea Jacobian]
Let $f: \mathbb{R}^n \to \mathbb{R}^k$ be Lipschitz with $n \ge k$. The **coarea Jacobian** (or **$k$-dimensional Jacobian**) of $f$ at $x$ is
\begin{align*}
J_k f(x) := \sqrt{\det(Jf_x \cdot Jf_x^\top)},
\end{align*}
defined $\mathcal{L}^n$-a.e. Here $Jf_x \in \mathbb{R}^{k \times n}$ is the Jacobian matrix, and $Jf_x \cdot Jf_x^\top \in \mathbb{R}^{k \times k}$ is a square $k \times k$ matrix.
[/definition]
The distinction between $JL^\top JL$ (used for the area Jacobian) and $JL \cdot JL^\top$ (used for the coarea Jacobian) is dictated by which matrix is square. When $m \le n$, $JL \in \mathbb{R}^{n \times m}$ so $JL^\top JL \in \mathbb{R}^{m \times m}$ is the smaller square matrix; when $n \ge k$, $Jf \in \mathbb{R}^{k \times n}$ so $Jf \cdot Jf^\top \in \mathbb{R}^{k \times k}$ is the smaller square matrix. In both cases, $J$ is defined as the square root of the determinant of the smaller Gram matrix.
[remark: Relationship Between the Two Jacobians]
The non-zero singular values of $JL$ and $JL^\top$ are the same, so $J_m(L) = \sqrt{\det(JL^\top JL)}$ and $\sqrt{\det(JL \cdot JL^\top)}$ agree when both are genuinely square (i.e., $m = n$). More precisely, the nonzero eigenvalues of $AB$ and $BA$ always coincide for any matrices $A, B$. This means the two definitions are consistent in the square case $m = n = k$: both give $|\det JL|$.
[/remark]
The coarea Jacobian $J_k f(x)$ measures the $k$-dimensional volume of the image of a unit $k$-cube under $Df_x: \mathbb{R}^n \to \mathbb{R}^k$. Geometrically, $Df_x$ is a projection (in general, oblique) and $J_k f(x)$ is the compression factor. When $Df_x$ is surjective, $J_k f(x) > 0$; the coarea formula integrates this factor to relate integrals over $\mathbb{R}^n$ to integrals over the level sets $f^{-1}(y) \subset \mathbb{R}^n$.
[example: The Coarea Jacobian for a Scalar Function]
Let $f: \mathbb{R}^n \to \mathbb{R}$ (so $k = 1$). The Jacobian matrix $Jf_x = \nabla f(x)^\top$ is a $1 \times n$ row vector. Then
\begin{align*}
Jf_x \cdot Jf_x^\top = |\nabla f(x)|^2 \in \mathbb{R}^{1 \times 1},
\end{align*}
so $J_1 f(x) = |\nabla f(x)|$. The coarea formula for a scalar Lipschitz function $f: \mathbb{R}^n \to \mathbb{R}$ will therefore read
\begin{align*}
\int_{\mathbb{R}^n} g(x) |\nabla f(x)|\, d\mathcal{L}^n(x) = \int_{\mathbb{R}} \int_{f^{-1}(t)} g(x)\, d\mathcal{H}^{n-1}(x)\, d\mathcal{L}^1(t),
\end{align*}
which is the classical co-area formula from geometric analysis, typically derived for $C^1$ functions by the implicit function theorem. The GMT approach extends this to all Lipschitz $f$ using Rademacher's theorem and Hausdorff measure, and replaces $|\nabla f|$ with the coarea Jacobian $J_1 f$ in the general formulation.
[/example]
This example probes the formula at $k = 1$, the case where the level sets are $(n-1)$-dimensional and the formula is most frequently applied (e.g., to compute surface areas from scalar functions). Verifying that $J_1 f = |\nabla f|$ in this case also checks that the definition of the coarea Jacobian is not only abstractly consistent with singular values but produces the right integrand for a formula we already know should hold.
## The Polar Decomposition and a Structural Perspective
The SVD can be refined to a **polar decomposition** that further clarifies the geometry of $L$. Every linear map $L: \mathbb{R}^m \to \mathbb{R}^n$ with $m \le n$ admits a decomposition $JL = P S$ where $P: \mathbb{R}^m \to \mathbb{R}^n$ is an **isometric embedding** (i.e., $P^\top P = I_m$, so $P$ preserves inner products) and $S: \mathbb{R}^m \to \mathbb{R}^m$ is symmetric positive semi-definite. The matrix $S$ is uniquely determined by $S = \sqrt{JL^\top JL}$ (the symmetric positive semi-definite square root of the Gram matrix), and $J_m L = \det S$.
[quotetheorem:3074]
The proof uses the SVD: if $JL = P_0 \Sigma Q^\top$, set $P = P_0 [\Sigma; 0] Q^\top$ — but this requires some care since $P$ must be $n \times m$, not $n \times n$. The cleaner formulation takes $P$ to be the "partial isometry" and $S = Q \operatorname{diag}(\sigma_1,\dots,\sigma_m) Q^\top$. The uniqueness of $S$ follows because $S^2 = JL^\top JL$, and a positive semi-definite square root of a positive semi-definite matrix is unique.
The polar decomposition splits the action of $L$ into a "pure deformation" $S$ (which may shrink or expand axes but preserves orientation of the domain) and a "pure embedding" $P$ (which places the deformed shape isometrically into $\mathbb{R}^n$). The Jacobian $J_m L = \det S$ belongs entirely to the deformation part. This perspective is useful when proving the area formula, where one often needs to separate the combinatorial counting of overlaps (which depends on $P$) from the volume distortion (which depends on $S$).
**Hypothesis necessity.** The polar decomposition requires $m \le n$ in the form stated. When $m > n$, no isometric embedding $\mathbb{R}^m \hookrightarrow \mathbb{R}^n$ exists (since isometric embeddings preserve dimension), and the decomposition takes the different form $JL = QS$ where $Q$ is a partial isometry from $\mathbb{R}^n$ into $\mathbb{R}^m$ — equivalently, one factors $JL^\top = P'S'$ and transposes. The uniqueness of $S$ also requires the positive semi-definiteness: the matrix $S = 0$ is the only symmetric matrix satisfying $S^2 = 0$, but for rank-deficient $L$ there may be many non-symmetric solutions.
The following table summarizes the relationships between the two Jacobians across the main cases that arise in subsequent chapters.
[remark: Summary of Jacobian Conventions]
The two Jacobians $J_m$ and $J_k$ cover complementary regimes:
- **Area formula** ($m \le n$, $f: \mathbb{R}^m \to \mathbb{R}^n$): use $J_m f(x) = \sqrt{\det(Jf_x^\top Jf_x)}$.
- **Coarea formula** ($n \ge k$, $f: \mathbb{R}^n \to \mathbb{R}^k$): use $J_k f(x) = \sqrt{\det(Jf_x \cdot Jf_x^\top)}$.
- **Square case** ($m = n$): both give $|\det Jf_x|$.
In the literature (including Evans-Gariepy), the symbol $J_m f$ or $J_f$ is often used without subscripts when the dimension of the image is clear from context. Throughout these notes, we always write the subscript to disambiguate.
[/remark]
With the Jacobian defined and its geometric meaning understood through the SVD, singular values, and special cases, the stage is set for Chapter 3. The area formula will express $\int_{\mathbb{R}^m} J_m f(x)\, d\mathcal{L}^m(x)$ as an integral over $\mathbb{R}^n$ of the counting function $x \mapsto \#(f^{-1}(y) \cap A)$, extending the classical change of variables formula to all Lipschitz maps and all $m \le n$.
Having established the algebraic machinery of linear maps and Jacobians, Chapter 3 deploys this framework to prove the area formula: a far-reaching generalization of the classical change of variables that accommodates all Lipschitz maps and dimensions $m \le n$, with Jacobian magnitudes encoding how the map stretches measure.
# 3. The Area Formula
The area formula is the first of the two central integral-geometric identities of this course. Where the classical change of variables theorem requires a $C^1$ diffeomorphism between open sets of the same dimension, the area formula works for any Lipschitz map $f: \mathbb{R}^m \to \mathbb{R}^n$ with $m \le n$, and it keeps a careful count of how many source points map to each target point. The key object mediating this count is the $m$-dimensional Jacobian $J_m f$, which was developed in Chapter 2 using the singular value decomposition. This chapter states and proves the area formula, then draws out its most important consequences: the change of variables formula for injective Lipschitz maps, the classical surface area integral, and the pushforward measure identity.
## Why the Classical Change of Variables Is Not Enough
The familiar change of variables formula from multivariable calculus reads: if $\Phi: U \to V$ is a $C^1$ diffeomorphism between open sets in $\mathbb{R}^n$, then for any integrable $g: V \to \mathbb{R}$,
\begin{align*}
\int_V g(y)\, d\mathcal{L}^n(y) = \int_U g(\Phi(x))\, |\det Jf_x|\, d\mathcal{L}^n(x).
\end{align*}
This formula breaks down in two directions at once when we try to use it in geometric measure theory. First, the maps we encounter — parametrizations of surfaces, traces of Sobolev functions, Lipschitz images of cubes — are typically not diffeomorphisms: they may fail to be differentiable everywhere, and they need not be injective. Second, when $m < n$ the formula makes no sense as written, because there is no square Jacobian determinant and the target has higher dimension than the source.
The resolution requires two ingredients: the $m$-dimensional Jacobian $J_m f$ from Chapter 2, which measures the infinitesimal $m$-volume distortion of a linear map $\mathbb{R}^m \to \mathbb{R}^n$; and the multiplicity function, which keeps track of how many preimages each target point has. Together these allow one to integrate over the image using Hausdorff measure $\mathcal{H}^m$ rather than Lebesgue measure, which is the only sensible notion of $m$-dimensional volume in $\mathbb{R}^n$ when $m < n$.
## The $m$-Dimensional Jacobian and Multiplicity
Recall from Chapter 2 that for a linear map $L: \mathbb{R}^m \to \mathbb{R}^n$ with $m \le n$, the $m$-dimensional Jacobian is
\begin{align*}
J_m L := \sqrt{\det(L^\top L)},
\end{align*}
which equals the product of the singular values $\sigma_1, \dots, \sigma_m$ of $L$. When $L$ is the identity embedding $\mathbb{R}^m \hookrightarrow \mathbb{R}^n$, the Jacobian is $1$; when $L$ stretches the $m$-volume by a factor, $J_m L$ records exactly that factor. For a Lipschitz map $f: \mathbb{R}^m \to \mathbb{R}^n$, Rademacher's theorem (Chapter 1) guarantees that the total derivative $Df_x: \mathbb{R}^m \to \mathbb{R}^n$ exists for $\mathcal{L}^m$-almost every $x$, so we define
\begin{align*}
J_m f(x) := J_m(Df_x) = \sqrt{\det(Df_x^\top Df_x)}
\end{align*}
for $\mathcal{L}^m$-a.e. $x \in \mathbb{R}^m$. This function is measurable and bounded (by $(\operatorname{Lip}(f))^m$).
The second ingredient is the multiplicity function, which counts fibers.
[definition: Multiplicity Function]
Let $f: \mathbb{R}^m \to \mathbb{R}^n$ be measurable and let $A \subseteq \mathbb{R}^m$ be measurable. The **multiplicity function** of $f$ on $A$ is
\begin{align*}
N(f, A, y) := \mathcal{H}^0(f^{-1}(y) \cap A), \quad y \in \mathbb{R}^n,
\end{align*}
where $\mathcal{H}^0$ is the counting measure. Thus $N(f, A, y)$ is the number of points in $A$ that $f$ maps to $y$, which may be $0$, a finite positive integer, or $+\infty$.
[/definition]
The multiplicity function is the correct weight to place on each target point: if $y \in f(A)$ has three preimages in $A$, then $y$ contributes three times to the integral on the left of the area formula, and $N(f,A,y) = 3$ on the right compensates for this overcounting. Understanding $N(f,A,\cdot)$ requires knowing not just the image of $f$ but its full fiber structure. For an injective map, $N(f,A,y) \in \{0,1\}$ everywhere, and the formula reduces to a clean identity; for a map that wraps around, the multiplicities can be large.
## The Area Formula
The area formula now takes a precise and general form.
[quotetheorem:3075]
The left side integrates the infinitesimal volume distortion over the source set $A$ with respect to Lebesgue measure $\mathcal{L}^m$. The right side integrates the fiber count over the target $\mathbb{R}^n$ with respect to $m$-dimensional Hausdorff measure $\mathcal{H}^m$. The formula equates these two very different-looking integrals. A few immediate observations are in order before we turn to the proof.
When $m = n$ and $f$ is a $C^1$ diffeomorphism, $J_m f = |\det Jf|$, the fiber count $N(f, A, y)$ is $1$ for $y \in f(A)$ and $0$ outside, and $\mathcal{H}^n = \mathcal{L}^n$, so the formula recovers the classical change of variables. When $m < n$, the right side integrates over the $m$-dimensional image $f(A)$ (counted with multiplicity), which typically has $\mathcal{L}^n$-measure zero, so Lebesgue measure is the wrong tool and only $\mathcal{H}^m$ gives a finite answer.
The hypothesis $J_m f = 0$ on a set $A$ has an interesting consequence: the right side forces $\mathcal{H}^m(f(A)) = 0$, meaning $f$ maps $A$ to an $\mathcal{H}^m$-null set. This is the area formula's version of Sard's theorem: where the derivative is rank-deficient ($J_m f = 0$ means $Df_x$ drops rank, i.e., the image is compressed), the image contributes nothing to the $\mathcal{H}^m$ integral. The hypothesis that $f$ is Lipschitz — rather than merely measurable or $L^1$ — is essential: a measurable map can catastrophically collapse or expand sets in ways that the Jacobian cannot detect, and without Rademacher's theorem, $J_m f$ may not even be defined.
[explanation: Why Lipschitz Is the Natural Hypothesis]
The area formula requires two structural properties of $f$ that are both guaranteed by the Lipschitz condition. First, $f$ must be differentiable $\mathcal{L}^m$-almost everywhere so that $J_m f(x)$ is defined; Rademacher's theorem provides this. Second, the formula involves $\mathcal{H}^m$-integrability of $N(f, A, \cdot)$, which requires that $f$ does not collapse positive-measure sets to $\mathcal{H}^m$-null sets in an uncontrolled way. The Lipschitz bound $\mathcal{H}^m(f(E)) \le (\operatorname{Lip}(f))^m \mathcal{L}^m(E)$ (proved in Chapter 1) gives the needed control.
To see that mere continuity is not enough, consider the Peano-type space-filling curves $f: [0,1] \to [0,1]^2$, which are continuous and surjective onto a set of positive $\mathcal{L}^2$-measure. The left side of the area formula would be zero everywhere (since $J_1 f = |f'|$ and the derivative is zero $\mathcal{L}^1$-a.e., which follows from the fact that $f$ is not Lipschitz), while the right side — $\int_{\mathbb{R}^2} N(f,[0,1],y)\, d\mathcal{H}^1(y)$ — counts fibers of an enormously complicated function over a two-dimensional set with respect to one-dimensional measure. Without Lipschitz control, there is no mechanism to connect these two integrals.
[/explanation]
## Proof: Three-Step Reduction
The proof of the area formula proceeds by successive approximation: first establish the identity for linear maps using pure linear algebra, then extend to $C^1$ maps by local linearization, and finally pass to Lipschitz maps using Rademacher's theorem and a Whitney-type approximation.
### Linear Maps
The base case is a linear map $L: \mathbb{R}^m \to \mathbb{R}^n$. The key geometric fact is that if $L$ is injective, then for every measurable $A \subseteq \mathbb{R}^m$,
\begin{align*}
\mathcal{H}^m(L(A)) = J_m L \cdot \mathcal{L}^m(A).
\end{align*}
This is proved using the singular value decomposition. Write $L = U \Sigma V^\top$ where $U \in O(n)$, $V \in O(m)$, and $\Sigma$ is the $n \times m$ matrix with diagonal entries $\sigma_1, \dots, \sigma_m \ge 0$. Since Hausdorff measure is rotation-invariant, $\mathcal{H}^m(L(A)) = \mathcal{H}^m(U \Sigma V^\top(A)) = \mathcal{H}^m(\Sigma V^\top(A))$. The map $V^\top$ is an isometry of $\mathbb{R}^m$, so $\mathcal{L}^m(V^\top(A)) = \mathcal{L}^m(A)$. Now $\Sigma: \mathbb{R}^m \to \mathbb{R}^n$ embeds $\mathbb{R}^m$ into the first $m$ coordinates of $\mathbb{R}^n$ and scales the $i$-th axis by $\sigma_i$. Identifying the image of $\Sigma$ with $\mathbb{R}^m$ via the natural projection, we have $\mathcal{H}^m(\Sigma(B)) = \sigma_1 \cdots \sigma_m \cdot \mathcal{L}^m(B)$ for any measurable $B \subseteq \mathbb{R}^m$, since Hausdorff $m$-measure on a flat $m$-dimensional subspace of $\mathbb{R}^n$ agrees with Lebesgue measure. Combining:
\begin{align*}
\mathcal{H}^m(L(A)) = \sigma_1 \cdots \sigma_m \cdot \mathcal{L}^m(A) = J_m L \cdot \mathcal{L}^m(A).
\end{align*}
If $L$ is injective, then $N(L, A, y) = \mathbf{1}_{L(A)}(y)$ and the area formula reads
\begin{align*}
\int_A J_m L\, d\mathcal{L}^m = J_m L \cdot \mathcal{L}^m(A) = \mathcal{H}^m(L(A)) = \int_{\mathbb{R}^n} N(L, A, y)\, d\mathcal{H}^m(y),
\end{align*}
which holds. If $L$ is not injective, then $J_m L = 0$ (because $\det(L^\top L) = 0$ when $L$ has rank less than $m$), and $L(A)$ lies in a proper affine subspace of $\mathbb{R}^n$, which has $\mathcal{H}^m$-measure zero. So both sides vanish and the formula reduces to $0 = 0$.
### $C^1$ Maps via Linearization
For a $C^1$ map $f: \mathbb{R}^m \to \mathbb{R}^n$, the proof uses local linearization to transfer the linear case to small balls.
Fix a measurable set $A \subseteq \mathbb{R}^m$ and $\varepsilon > 0$. For each $x_0 \in A$, write $f(x) = f(x_0) + Df_{x_0}(x - x_0) + R(x, x_0)$ where the remainder satisfies $|R(x, x_0)| \le \omega(|x - x_0|)|x - x_0|$ for a modulus of continuity $\omega$ with $\omega(\delta) \to 0$ as $\delta \to 0$ (this uses that $Df$ is continuous, hence uniformly continuous on compact sets).
Cover $A$ by balls $B(x_i, r_i)$ with $r_i < \delta$ chosen small enough that $|Df_{x} - Df_{x_0}| < \varepsilon$ for $x, x_0 \in B(x_i, r_i)$. On each such ball, the linear approximation $L_i := Df_{x_i}$ satisfies $|f(x) - f(x_0) - L_i(x - x_0)| \le \varepsilon|x - x_0|$. This means the image $f(B_i)$ is trapped between the images of two "fattened" linear maps: $\mathcal{H}^m(f(B_i)) \le (J_m f(x_i) + C(m,n)\varepsilon)\, \mathcal{L}^m(B_i)$ and similarly from below.
Summing over the cover, letting the cover be a Vitali subcovering, and letting $\varepsilon \to 0$, one obtains
\begin{align*}
\int_A J_m f\, d\mathcal{L}^m = \int_{\mathbb{R}^n} N(f, A, y)\, d\mathcal{H}^m(y).
\end{align*}
The lower bound requires a separate injectivity argument on small enough balls (where $f$ is close to the injective linear map $Df_{x_0}$ and hence itself injective for $\delta$ small), guaranteeing $N(f, B_i, y) \ge 1$ on a set of measure close to $\mathcal{H}^m(f(B_i))$. The full argument uses the Vitali covering theorem and monotone convergence to pass from finite covers to the integral identity.
The $C^1$ case is the heart of the proof. The Lipschitz case follows by approximation.
### Lipschitz Maps via Rademacher
The extension from $C^1$ to Lipschitz uses Rademacher's theorem, which guarantees that $f$ is differentiable $\mathcal{L}^m$-almost everywhere.
By Rademacher's theorem, the set $E := \{x \in \mathbb{R}^m : Df_x \text{ does not exist}\}$ has $\mathcal{L}^m(E) = 0$. On $A \setminus E$, define $J_m f$ via the derivative; set $J_m f = 0$ on $E$. Since $E$ is $\mathcal{L}^m$-null, the left side of the area formula is unchanged by this convention.
The key step is a Whitney-type $C^1$ approximation: for each $\delta > 0$, there exists a $C^1$ map $g_\delta: \mathbb{R}^m \to \mathbb{R}^n$ such that $f = g_\delta$ and $Df = Dg_\delta$ on a closed set $F_\delta$ with $\mathcal{L}^m(\mathbb{R}^m \setminus F_\delta) < \delta$. This is a Lusin-type result for Lipschitz functions, exploiting that $Df$ is approximately continuous at $\mathcal{L}^m$-almost every point.
Apply the area formula for $C^1$ maps to $g_\delta$ restricted to $F_\delta$. Since $f = g_\delta$ and $Df = Dg_\delta$ on $F_\delta$, the area formula holds for $f$ on $F_\delta$. To handle $A \setminus F_\delta$: use the Lipschitz bound $\mathcal{H}^m(f(A \setminus F_\delta)) \le (\operatorname{Lip}(f))^m \mathcal{L}^m(A \setminus F_\delta) < (\operatorname{Lip}(f))^m \delta$. This controls the error in the right side. Letting $\delta \to 0$ and applying monotone convergence gives the area formula for $f$ on all of $A$.
This completes the proof of the area formula. The Whitney approximation step used here will be developed more carefully in Chapter 8; for now we use it as a black box. The argument shows why Rademacher's theorem is so fundamental to this theory: it converts the a.e.-differentiability of Lipschitz functions into a statement that can be bootstrapped from the $C^1$ case.
## The Change of Variables Formula
When $f$ is injective on $A$, the multiplicity function satisfies $N(f, A, y) \in \{0, 1\}$ everywhere, and the area formula specializes to a formula that allows integration over the image.
[quotetheorem:3076]
[citeproof:3076]
This theorem is the form of the area formula most directly useful in applications. It says: to integrate a function $g$ over the image $f(A)$ with respect to $\mathcal{H}^m$, pull back to $A$ and weight by the Jacobian. The Jacobian plays exactly the role of the absolute Jacobian determinant in the classical substitution formula. The injectivity hypothesis is not merely technical: if $f$ wraps around and maps two disjoint subsets of $A$ to the same subset of $f(A)$, then the right side counts $y$ once but the left side counts the preimages separately, so the formula fails without the multiplicity factor $N$.
[example: Möbius-Type Wrapping]
To see what goes wrong without injectivity, take $m = n = 1$ and let $f: [0, 2\pi] \to \mathbb{R}^2$ be the parametrization of the unit circle by $f(\theta) = (\cos\theta, \sin\theta)$. Here $f$ is not injective (it identifies $0$ and $2\pi$), but this example is degenerate; more instructively, take $f: [0, 4\pi] \to \mathbb{R}^2$ given by the same formula, wrapping around twice. Then $J_1 f = |f'(\theta)| = 1$ everywhere, so
\begin{align*}
\int_0^{4\pi} J_1 f\, d\mathcal{L}^1 = 4\pi.
\end{align*}
But $f([0, 4\pi]) = \mathbb{S}^1$ has $\mathcal{H}^1$-measure $2\pi$, not $4\pi$. The discrepancy is $N(f, [0,4\pi], y) = 2$ for every $y \in \mathbb{S}^1$: every point on the circle is hit exactly twice. The area formula corrects for this: $\int_{\mathbb{S}^1} N(f,[0,4\pi], y)\, d\mathcal{H}^1(y) = 2 \cdot 2\pi = 4\pi$. The injective change of variables formula would give $\int_{\mathbb{S}^1} g\, d\mathcal{H}^1$, which is only correct if one first restricts to a fundamental domain like $[0, 2\pi)$ where $f$ is injective.
[/example]
This example makes the role of the multiplicity function geometric and palpable. The area formula is not a formula about the image as a set, but about the integral-geometric structure of the map — how many times the source wraps over each target point.
## Surface Area of a Parametrized Surface
One of the most important applications of the area formula is to recover and generalize the classical surface area integral from multivariable calculus.
[example: Surface Area via the Area Formula]
Let $\Omega \subseteq \mathbb{R}^2$ be a bounded open set and $f: \Omega \to \mathbb{R}^3$ a Lipschitz map (for instance, a $C^1$ parametrization of a smooth surface). Suppose $f$ is injective on $\Omega$. We compute the area of the surface $S = f(\Omega)$, measured by $\mathcal{H}^2$.
By the change of variables formula with $g \equiv 1$:
\begin{align*}
\mathcal{H}^2(S) = \int_\Omega J_2 f(x)\, d\mathcal{L}^2(x).
\end{align*}
We now compute $J_2 f$ explicitly. Write $f = (f_1, f_2, f_3)$ with partial derivatives $f_u = \partial_1 f$ and $f_v = \partial_2 f$ (column vectors in $\mathbb{R}^3$). The Jacobian matrix is $Jf \in \mathbb{R}^{3 \times 2}$ with columns $f_u$ and $f_v$. The matrix $(Jf)^\top (Jf) \in \mathbb{R}^{2 \times 2}$ has entries:
\begin{align*}
(Jf)^\top (Jf) = \begin{pmatrix} |f_u|^2 & f_u \cdot f_v \\ f_u \cdot f_v & |f_v|^2 \end{pmatrix}.
\end{align*}
Its determinant is $|f_u|^2 |f_v|^2 - (f_u \cdot f_v)^2$. Therefore:
\begin{align*}
J_2 f = \sqrt{|f_u|^2 |f_v|^2 - (f_u \cdot f_v)^2} = |f_u \times f_v|,
\end{align*}
where the last equality uses the identity $|a \times b|^2 = |a|^2|b|^2 - (a \cdot b)^2$ for vectors in $\mathbb{R}^3$. This recovers the classical surface area formula:
\begin{align*}
\mathcal{H}^2(S) = \int_\Omega |f_u \times f_v|\, du\, dv.
\end{align*}
The expression $|f_u \times f_v|$ is the area of the parallelogram spanned by the tangent vectors $f_u$ and $f_v$ at each point. The area formula thus gives a precise geometric interpretation of the classical formula: it integrates the infinitesimal area of the tangent parallelogram over the parameter domain. When the surface is a graph $f(u,v) = (u, v, h(u,v))$, we get $f_u = (1,0,h_u)$ and $f_v = (0,1,h_v)$, so $|f_u \times f_v| = \sqrt{1 + h_u^2 + h_v^2}$, which matches the classical formula for the area of a graph.
[/example]
This connection demonstrates that the area formula is not an abstract generalization but a precise extension of familiar undergraduate calculus, with the $m$-dimensional Jacobian playing exactly the role of the area element.
<!-- illustration-needed: show a parametrized surface patch f(u,v) in R^3 with the tangent vectors f_u and f_v at a point, the parallelogram they span, and the cross product f_u × f_v normal to the surface; label the area of the parallelogram as J_2 f times the area element du dv -->
## The Image Measure Formula
The area formula has a clean reformulation in terms of pushforward measures that is useful in more abstract settings.
[definition: Pushforward Measure]
Let $f: \mathbb{R}^m \to \mathbb{R}^n$ be measurable and let $\mu$ be a Borel measure on $\mathbb{R}^m$. The **pushforward** of $\mu$ by $f$ is the Borel measure $f_\# \mu$ on $\mathbb{R}^n$ defined by
\begin{align*}
(f_\# \mu)(B) := \mu(f^{-1}(B))
\end{align*}
for every Borel set $B \subseteq \mathbb{R}^n$. Integration against $f_\# \mu$ satisfies the identity $\int_{\mathbb{R}^n} g\, d(f_\# \mu) = \int_{\mathbb{R}^m} g \circ f\, d\mu$ for all non-negative measurable $g$.
[/definition]
The pushforward is the language in which one describes how a measure on the source is transported to a measure on the target. When $f$ is Lipschitz and injective on $A$, the area formula states that the pushforward of the weighted measure $J_m f\, \mathcal{L}^m \lfloor A$ equals the restriction of $\mathcal{H}^m$ to $f(A)$.
[quotetheorem:3077]
[citeproof:3077]
The image measure formula captures the geometric content of the area formula in a single equation: the Jacobian weight $J_m f$ is exactly the Radon-Nikodym derivative of $\mathcal{H}^m \lfloor f(A)$ with respect to $f_\# \mathcal{L}^m \lfloor A$. This perspective is essential in GMT applications where one wants to work with the image as a geometric object (a rectifiable set carrying $\mathcal{H}^m$) rather than as the range of a parametrization.
The image measure formula also shows that $\mathcal{H}^m \lfloor f(A)$ is absolutely continuous with respect to $f_\# \mathcal{L}^m$ when $J_m f > 0$ on $A$, and singular otherwise. When $J_m f = 0$ on a subset $E \subseteq A$, the image $f(E)$ has $\mathcal{H}^m$-measure zero, consistent with the Sard-type statement noted earlier.
## Necessity of the Hypotheses
The area formula holds under Lipschitz regularity of $f$, measurability of $A$, and the dimensional constraint $m \le n$. Each of these plays a genuine role, and relaxing any one of them produces counterexamples.
The constraint $m \le n$ is not a technical artifact. When $m > n$, the $m$-dimensional Jacobian $J_m f$ would involve the determinant of an $m \times m$ matrix constructed from an $n \times m$ Jacobian, which has rank at most $n < m$ and hence determinant zero identically. The area formula with $m > n$ would degenerate to $0 = \int N\, d\mathcal{H}^m$, but now $\mathcal{H}^m$ is not the natural measure on $\mathbb{R}^n$ (which has dimension $n < m$); the correct formulation for $m > n$ is the coarea formula, which uses $\mathcal{H}^{m-n}$ as the fiber measure and $\mathcal{L}^n$ on the target. Chapter 4 develops this.
Measurability of $A$ is needed for the left side to be well-defined: if $A$ is not measurable, $\int_A J_m f\, d\mathcal{L}^m$ has no meaning. On the right, $f^{-1}(y) \cap A$ could be non-measurable for Hausdorff-a.e. $y$, making $N(f, A, y)$ undefined as an $\mathcal{H}^m$-measurable function.
Lipschitz regularity cannot be weakened to mere differentiability a.e. without additional structural assumptions. A map can be differentiable everywhere on $\mathbb{R}^m$ with $Df_x = 0$ for $\mathcal{L}^m$-a.e. $x$ yet still be surjective — these are the "nowhere monotone" or Cantor-staircase-type pathologies in higher dimensions. For such a map, the left side of the area formula would be zero, but the right side could be positive if $N(f, A, \cdot)$ is large on a set of positive $\mathcal{H}^m$-measure. The Lipschitz condition provides the uniform control needed to prevent this.
While the area formula handles maps from lower to higher dimension, many integral transformations reverse this asymmetry—projections, level-set decompositions, and radial measure-theoretic decompositions all require an analogue for $m > n$, which the coarea formula provides by slicing through level sets and integrating the resulting $(n-k)$-dimensional measures.
# 4. The Coarea Formula
The area formula, developed in the preceding chapter, handles maps $f : \mathbb{R}^m \to \mathbb{R}^n$ where the source dimension does not exceed the target dimension ($m \le n$). But many natural operations in analysis — projections, level-set decompositions, the radial decomposition of integrals — go in the opposite direction. The coarea formula addresses this complementary regime, where $m \ge k$: given a Lipschitz map $f : \mathbb{R}^n \to \mathbb{R}^k$, it decomposes a Lebesgue integral over $\mathbb{R}^n$ as a Lebesgue integral over $\mathbb{R}^k$ of Hausdorff integrals over the level sets $f^{-1}(t)$. The chapter proves the formula, establishes its change-of-variables form, and works through the two most important applications: polar coordinate integration and level-set integrals for scalar Lipschitz functions.
## The Obstacle: Integrating Over Level Sets
The most natural way to integrate a function over $\mathbb{R}^n$ is to use Fubini's theorem: fix $k$ coordinates, integrate the remaining $n-k$, then integrate the result over $\mathbb{R}^k$. But Fubini is tied to the projection $\mathbb{R}^n \to \mathbb{R}^k$ and exploits the product structure $\mathbb{R}^n = \mathbb{R}^k \times \mathbb{R}^{n-k}$ in an essential way. If the "slicing" is done by a nonlinear Lipschitz map $f$, the level sets $f^{-1}(t)$ are curved hypersurface-like objects, and Fubini's theorem gives no direct handle on how to integrate over them. The question is: what replaces Fubini when the slices are curved, and what is the correct geometric weight that accounts for how the map $f$ distorts the slicing?
The answer involves the $k$-dimensional Jacobian $J_k f$ — the same geometric quantity appearing in the area formula, now measuring the "transversal stretching" of $f$ as it maps $\mathbb{R}^n$ onto $\mathbb{R}^k$. The coarea formula says that $J_k f$ is precisely the weight that converts a flat Lebesgue integral into a curved Fubini-type decomposition.
## The Coarea Jacobian
Before stating the theorem, we recall what $J_k f$ measures. At every point where $f$ is differentiable — which is $\mathcal{L}^n$-almost everywhere by Rademacher's theorem — the total derivative $Df_x : \mathbb{R}^n \to \mathbb{R}^k$ is a linear map between spaces of different dimensions, with $n \ge k$. Its Jacobian matrix $Jf_x \in \mathbb{R}^{k \times n}$ has more columns than rows.
[definition: Coarea Factor]
Let $f : \mathbb{R}^n \to \mathbb{R}^k$ be Lipschitz, $n \ge k$, and let $x$ be a point where $f$ is differentiable. The **$k$-dimensional Jacobian** (or coarea factor) of $f$ at $x$ is
\begin{align*}
J_k f(x) := \sqrt{\det(Jf_x \cdot (Jf_x)^\top)},
\end{align*}
where $Jf_x \in \mathbb{R}^{k \times n}$ is the Jacobian matrix of $f$ at $x$.
[/definition]
The expression $Jf_x \cdot (Jf_x)^\top$ is a $k \times k$ matrix whose determinant equals the sum of squares of all $k \times k$ minors of $Jf_x$ (by the Cauchy-Binet formula). This is the square of the $k$-dimensional volume of the image of the unit ball in $\mathbb{R}^n$ under $Df_x$, projected onto the relevant $k$-dimensional subspace. Geometrically, $J_k f(x)$ measures how much $f$ compresses the $k$-dimensional "shadow" at $x$: it is zero when $Df_x$ is not surjective (the level set through $x$ is degenerate) and positive when $Df_x$ has full rank $k$.
For a linear map $L : \mathbb{R}^n \to \mathbb{R}^k$, the coarea factor $J_k L$ is constant. This special case is the starting point for the proof.
## Statement and Proof of the Coarea Formula
The Fubini theorem decomposes $\mathcal{L}^n$-integrals over $\mathbb{R}^n$ via the projection map, which has coarea factor identically $1$. When the "projection" is replaced by a general Lipschitz map $f : \mathbb{R}^n \to \mathbb{R}^k$, the formula must account for the varying stretching of the level sets.
[quotetheorem:3078]
The integrand on the right, $t \mapsto \mathcal{H}^{n-k}(f^{-1}(t) \cap A)$, is a well-defined nonnegative Borel function of $t$, so the outer integral is meaningful. Each level set $f^{-1}(t)$ is a closed subset of $\mathbb{R}^n$ (since $f$ is continuous), and the $(n-k)$-dimensional Hausdorff measure correctly captures its intrinsic size when the set is an $(n-k)$-dimensional surface.
The hypotheses are sharp in the following sense. Without the Lipschitz assumption, the coarea factor $J_k f$ cannot be defined $\mathcal{L}^n$-a.e. in any useful way. The condition $n \ge k$ is not merely a convention: when $n < k$, a generic level set of $f$ is empty (a map from a lower-dimensional space to a higher-dimensional one has no regular level sets), and the appropriate result is the area formula instead. Dropping the measurability of $A$ would cause the right-hand side to be ill-defined, since $f^{-1}(t) \cap A$ might not be $\mathcal{H}^{n-k}$-measurable for generic $t$.
The proof follows the same three-layer strategy used for the area formula: establish the result first for linear maps, then for $C^1$ maps by linearisation, and finally for Lipschitz maps via Rademacher's theorem.
**Linear case.** Suppose $L : \mathbb{R}^n \to \mathbb{R}^k$ is a linear surjection (if $L$ is not surjective, both sides vanish). Choose an orthonormal basis for $\mathbb{R}^n$ so that $L$ is represented by an orthonormal frame: by a change of variables, we may assume $L(x_1, \ldots, x_n) = (x_1, \ldots, x_k)$ is the coordinate projection. The level set $L^{-1}(t)$ is the affine subspace $\{t\} \times \mathbb{R}^{n-k}$, so $\mathcal{H}^{n-k}(L^{-1}(t) \cap A) = \mathcal{L}^{n-k}(A_t)$ where $A_t$ is the slice of $A$ at height $t$. Fubini's theorem then gives $\mathcal{L}^n(A) = \int_{\mathbb{R}^k} \mathcal{L}^{n-k}(A_t) \, d\mathcal{L}^k(t)$, which matches the formula since $J_k L = 1$ in these coordinates. For a general linear surjection $L$, the same argument applies after an appropriate orthogonal change of coordinates; $J_k L$ is the absolute value of the Jacobian of the coordinate change restricted to the $k$-dimensional component.
**$C^1$ case.** For $f \in C^1$, cover $A$ by small balls $B(x_j, r_j)$ on which $Df$ is nearly constant. On each ball, the map $f$ is well-approximated by its linearisation $Df_{x_j}$. Apply the linear case to $Df_{x_j}$ on each ball, then sum and take the limit as the ball radii go to zero. The key step is controlling the error between $f$ and its linearisation on each ball, which is $O(r_j)$ and contributes only $o(1)$ to both sides of the formula as $r_j \to 0$. The passage to the limit uses the dominated convergence theorem and the fact that $f \in C^1$ implies $J_k f$ is continuous.
**Lipschitz case.** Apply Rademacher's theorem to get differentiability $\mathcal{L}^n$-a.e. Approximate $f$ by a sequence of $C^1$ maps $f_\varepsilon$ such that $f_\varepsilon \to f$ locally uniformly and $Df_\varepsilon \to Df$ pointwise $\mathcal{L}^n$-a.e. (using standard mollification). The coarea formula holds for each $f_\varepsilon$ by the $C^1$ case. Pass to the limit on both sides: the left-hand side converges by dominated convergence since $J_k f_\varepsilon \to J_k f$ a.e. and the Lipschitz constant provides a uniform bound. For the right-hand side, use the lower semicontinuity of Hausdorff measure under set convergence and a separate argument (via the area formula applied to the level-set parametrisation) to upgrade the $\liminf$ inequality to equality.
The structure of the proof reflects a broader pattern in geometric measure theory: the Lipschitz case is always reduced to the $C^1$ case via Rademacher and approximation, and the $C^1$ case is reduced to the linear case by Taylor expansion. The linear case, in turn, is a direct calculation whose answer is encoded by the Jacobian.
## The Coarea Formula Generalises Fubini
The most illuminating sanity check for any new integration formula is to verify that it reduces to a known theorem in the simplest case. Here, the canonical example is the coordinate projection.
[example: Coarea Recovers Fubini]
Take $n \ge k$ and let $f : \mathbb{R}^n \to \mathbb{R}^k$ be the coordinate projection
\begin{align*}
f(x_1, \ldots, x_n) = (x_1, \ldots, x_k).
\end{align*}
We compute the coarea factor. The Jacobian matrix is $Jf_x = [I_k \mid 0_{k \times (n-k)}] \in \mathbb{R}^{k \times n}$, so
\begin{align*}
Jf_x \cdot (Jf_x)^\top = I_k,
\end{align*}
giving $J_k f(x) = \sqrt{\det(I_k)} = 1$ everywhere. The level set $f^{-1}(t)$ is the affine flat $\{t_1\} \times \cdots \times \{t_k\} \times \mathbb{R}^{n-k}$, which is a translate of $\mathbb{R}^{n-k}$. Its $(n-k)$-dimensional Hausdorff measure restricted to a measurable set $A$ is $\mathcal{H}^{n-k}(f^{-1}(t) \cap A) = \mathcal{L}^{n-k}(A_t)$, where $A_t = \{y \in \mathbb{R}^{n-k} : (t, y) \in A\}$ is the slice of $A$ at height $t \in \mathbb{R}^k$. The coarea formula therefore reads
\begin{align*}
\mathcal{L}^n(A) = \int_{\mathbb{R}^k} \mathcal{L}^{n-k}(A_t) \, d\mathcal{L}^k(t),
\end{align*}
which is precisely Fubini's theorem for the decomposition $\mathbb{R}^n = \mathbb{R}^k \times \mathbb{R}^{n-k}$.
[/example]
This example shows that the coarea formula is a genuinely curved version of Fubini: it replaces the flat affine slices $\{t\} \times \mathbb{R}^{n-k}$ with the curved level sets $f^{-1}(t)$, and it replaces the trivial Jacobian factor $1$ with $J_k f$ to account for how $f$ tilts and warps those slices. Every application of the coarea formula can be understood as an instance of "Fubini with curved slices," a viewpoint that makes the right-hand side intuitively natural.
## The Change-of-Variables Form
The coarea formula as stated integrates only the constant function $1$ over $f^{-1}(t) \cap A$. For applications, one needs to integrate an arbitrary nonneg\-ative measurable function $g$ over the same level sets. This is the change-of-variables form of the coarea formula, which extends the basic result in exactly the same way that the substitution rule $\int_a^b g(f(x)) f'(x) \, dx = \int_{f(a)}^{f(b)} g(t) \, dt$ extends the change-of-variables formula for real functions.
[quotetheorem:3079]
The proof of this generalisation from the basic coarea formula follows the same approximation argument used to pass from the basic area formula to its change-of-variables form: first verify it for simple functions $g = \mathbf{1}_B$, where the result is exactly the basic coarea formula applied to $A \cap B$, then extend to nonneg\-ative measurable $g$ by the monotone convergence theorem.
This form of the formula is the one that appears most often in applications. The hypothesis that $g$ is nonneg\-ative can be relaxed to $g \in L^1_{\mathrm{loc}}(\mathbb{R}^n)$ if the appropriate integrability holds, in which case the formula extends by linearity. The measurability conditions are necessary: without them, the inner integral $\int_{f^{-1}(t) \cap A} g \, d\mathcal{H}^{n-k}$ might not be a well-defined measurable function of $t$.
## Application: Polar Coordinates
The coarea formula with $k = 1$ and $f(x) = |x|$ yields the classical polar coordinate decomposition of the Lebesgue integral in $\mathbb{R}^n$. This application is fundamental in analysis — it converts integrals over $\mathbb{R}^n$ into one-dimensional integrals over radii, weighted by the surface measure of spheres.
[example: Polar Coordinate Integration]
Let $n \ge 1$ and define $f : \mathbb{R}^n \to [0, \infty)$ by $f(x) = |x|$. We apply the coarea formula with $k = 1$ and $A = B(0, R)$ for $R > 0$.
First, compute the coarea factor. At a point $x \ne 0$, the gradient is
\begin{align*}
\nabla f(x) = \frac{x}{|x|},
\end{align*}
which is a unit vector. The Jacobian matrix is $Jf_x = (\nabla f(x))^\top \in \mathbb{R}^{1 \times n}$, so
\begin{align*}
J_1 f(x) = \sqrt{Jf_x \cdot (Jf_x)^\top} = \sqrt{|\nabla f(x)|^2} = |\nabla f(x)| = 1.
\end{align*}
At $x = 0$, the function $f$ is not differentiable; this single point has $\mathcal{L}^n$-measure zero and does not affect the formula.
The level set $f^{-1}(t) = \{x \in \mathbb{R}^n : |x| = t\} = \partial B(0, t)$ is the sphere of radius $t$. Applying the coarea formula with $g : \mathbb{R}^n \to [0, \infty]$ measurable,
\begin{align*}
\int_{B(0,R)} g(x) \, d\mathcal{L}^n(x) = \int_0^R \left( \int_{\partial B(0,t)} g \, d\mathcal{H}^{n-1} \right) d\mathcal{L}^1(t).
\end{align*}
This is the polar coordinate formula: an integral over a ball equals a one-dimensional integral of surface integrals over spheres. Taking $g \equiv 1$ gives $\mathcal{L}^n(B(0,R)) = \int_0^R \mathcal{H}^{n-1}(\partial B(0,t)) \, dt = \int_0^R n \alpha_n t^{n-1} \, dt = \alpha_n R^n$, recovering the standard formula for the volume of a ball in terms of the surface area of spheres.
[/example]
The polar coordinate example illustrates why the coarea formula is indispensable: there is no other systematic way to derive such a decomposition for a nonlinear function like $|x|$ without an integral-geometric theorem of this type. The fact that $J_1 f = |\nabla f| = 1$ for $f(x) = |x|$ is not accidental — it reflects that the gradient of the distance function to the origin has unit magnitude wherever it exists, which is a consequence of the eikonal equation $|\nabla |x|| = 1$ away from the origin.
## Application: Integration Over Level Sets of Scalar Functions
When $k = 1$ and $f : \mathbb{R}^n \to \mathbb{R}$ is a Lipschitz function, the coarea formula takes a particularly useful form that expresses integrals over subsets of $\mathbb{R}^n$ as weighted integrals over the level sets $\{f = t\}$. This is the form encountered most often in PDE theory, where one needs to integrate a function over a level set defined by a constraint.
[quotetheorem:3080]
[citeproof:3080]
The condition $|\nabla f| > 0$ almost everywhere on each level set (which holds for instance when $f$ is $C^1$ and $t$ is a regular value) ensures that $1/|\nabla f|$ is locally integrable along $\{f = t\}$. The factor $1/|\nabla f|$ is a corrective weight that accounts for the angle at which $f$ meets its level sets: when $|\nabla f|$ is large, $f$ changes rapidly, so each level set occupies a smaller "slice" of volume, and the weight $1/|\nabla f|$ compensates accordingly.
[example: Integral of a Radial Function Over a Sphere]
Take $f(x) = |x|$ and $g(x) = h(|x|)$ for some Borel function $h : [0,\infty) \to [0,\infty)$. Since $|\nabla f(x)| = 1$ away from the origin, the level-set integration formula gives
\begin{align*}
\int_{\mathbb{R}^n} h(|x|) \, d\mathcal{L}^n(x) = \int_0^\infty \left( \int_{\partial B(0,t)} \frac{h(t)}{1} \, d\mathcal{H}^{n-1} \right) dt = \int_0^\infty h(t) \cdot \mathcal{H}^{n-1}(\partial B(0,t)) \, dt.
\end{align*}
Using $\mathcal{H}^{n-1}(\partial B(0,t)) = n \alpha_n t^{n-1}$, this becomes
\begin{align*}
\int_{\mathbb{R}^n} h(|x|) \, d\mathcal{L}^n(x) = n\alpha_n \int_0^\infty h(t) \, t^{n-1} \, dt.
\end{align*}
This is the standard formula for integrating radial functions in polar coordinates, derived here purely from the coarea formula. For concreteness, taking $h(t) = e^{-t^2}$ gives $\int_{\mathbb{R}^n} e^{-|x|^2} \, d\mathcal{L}^n(x) = n\alpha_n \int_0^\infty e^{-t^2} t^{n-1} \, dt = n \alpha_n \cdot \frac{1}{2}\Gamma(n/2) = \pi^{n/2}$, recovering the standard Gaussian integral.
[/example]
The level-set integration formula is one of the most useful practical consequences of the coarea formula. It appears in the co-area formula for Sobolev functions, in the Federer-Fleming deformation theorem, in estimates for perimeters of superlevel sets, and — as the next section previews — in the definition of total variation for BV functions.
## The Necessity of the Hypotheses
The coarea formula requires $f$ to be Lipschitz, but being merely continuous does not suffice. A space-filling curve $f : [0,1] \to [0,1]^2$ is continuous and surjective, so the level set $f^{-1}(t)$ can be a fractal set of positive $\mathcal{H}^1$ measure for Lebesgue-almost every $t$, yet $f$ has no classical derivative and no Jacobian in the sense used here. The formula would assign $J_1 f = 0$ a.e. (since space-filling curves must fail to be Lipschitz, but more generally since a Lipschitz function with the space-filling property would violate the Hausdorff measure estimates from the area formula), while the right side could be infinite. The Lipschitz condition is therefore not just a technical convenience — it ensures that the coarea factor is well-defined and that the level sets have the correct Hausdorff dimension.
The condition $n \ge k$ is equally necessary. If $n < k$, then for $f : \mathbb{R}^n \to \mathbb{R}^k$ Lipschitz, Sard's theorem implies that the image $f(\mathbb{R}^n)$ has $\mathcal{L}^k$-measure zero. Accordingly, $\mathcal{H}^{n-k}(f^{-1}(t) \cap A) = 0$ for $\mathcal{L}^k$-almost every $t$, and the right-hand side of the coarea formula would be zero. But the left-hand side need not be zero, so the formula fails. The correct result in this dimension regime is the area formula, not the coarea formula.
## Preview: The BV Coarea Formula
The coarea formula has a far-reaching extension to functions of bounded variation. A function $u \in BV(\Omega)$ need not be Lipschitz, yet it admits a decomposition analogous to the coarea formula at the level of the total variation measure.
For $t \in \mathbb{R}$, the superlevel set $\{u > t\} = \{x \in \Omega : u(x) > t\}$ is a set of finite perimeter in $\Omega$ for $\mathcal{L}^1$-almost every $t$, with reduced boundary $\partial^* \{u > t\}$ carrying the natural $(n-1)$-dimensional surface measure. The BV coarea formula states:
\begin{align*}
|Du|(\Omega) = \int_{-\infty}^{\infty} \mathcal{H}^{n-1}(\partial^* \{u > t\} \cap \Omega) \, dt.
\end{align*}
This identity says that the total variation of $u$ — the measure that replaces $|\nabla u| \, d\mathcal{L}^n$ for non-smooth functions — equals the integral over all levels $t$ of the perimeter of the superlevel set $\{u > t\}$. It is the bridge between the theory of BV functions and the theory of sets of finite perimeter, and it is central to the structure theory of BV functions developed in GMT III.
The BV coarea formula does not follow directly from the Lipschitz coarea formula — it requires a separate argument using the distributional gradient and the structure of Radon measures. The full proof, together with the co-area formula for sets of finite perimeter and the fine structure of the reduced boundary, is the starting point of GMT III. What the present chapter establishes is the Lipschitz prototype: the identity $\int_A J_k f \, d\mathcal{L}^n = \int_{\mathbb{R}^k} \mathcal{H}^{n-k}(f^{-1}(t) \cap A) \, d\mathcal{L}^k(t)$ is the template from which the BV version is modelled, and the strategy of "slicing by level sets and integrating the resulting measure" is the same in both settings.
The coarea formula extends the level-set slicing paradigm far beyond Lipschitz maps: when passing to BV functions, the same template of integrating measures over fiber level sets survives, though the Jacobian must be replaced by distributional derivatives, indicating that the deeper principle underlying both formulas transcends classical differentiability.
# 5. Approximate Differentiability
Chapters 1 through 4 showed that every Lipschitz map is classically differentiable almost everywhere (Rademacher's theorem) and that this differentiability, encoded through the Jacobian, governs the area and coarea formulas. The natural next question is: how much of this theory survives when the map is merely in a Sobolev class $W^{1,p}(\Omega)$ rather than Lipschitz? For $p > n$, Morrey's inequality gives Hölder continuity and classical differentiability a.e. still follows. But for $1 \le p \le n$ — the regime most relevant to PDE and geometric analysis — the Sobolev function need not be continuous at any point, and Rademacher's argument breaks down entirely. The purpose of this chapter is to identify the correct replacement: **approximate differentiability**, a pointwise notion defined through density rather than limits, which holds almost everywhere for all Sobolev functions and for BV functions as well.
## Approximate Limits and Approximate Continuity
The obstacle to classical analysis of $L^p$ functions is that pointwise values are undefined: two functions equal almost everywhere are identified in $L^p$, so asking "what is $u(x)$?" is not well-posed. Even for a representative of the equivalence class, the representative at a single point carries no intrinsic meaning because changing its value on a null set does not alter the function in $L^p$. Classical differentiability requires that the difference quotient converges along every sequence approaching $x$, but sequences approaching $x$ from different directions may encounter arbitrarily bad sets of measure zero. The remedy is to measure convergence not along every sequence, but along sequences that avoid an exceptional set of small density.
[definition: Approximate Limit]
Let $u: \Omega \to \mathbb{R}$ be $\mathcal{L}^n$-measurable and let $x \in \Omega$. A value $\ell \in \mathbb{R}$ is the **approximate limit** of $u$ at $x$, written
\begin{align*}
\operatorname{ap-lim}_{y \to x} u(y) = \ell,
\end{align*}
if for every $\varepsilon > 0$,
\begin{align*}
\lim_{r \to 0} \frac{\mathcal{L}^n\bigl(\{y \in B(x,r) : |u(y) - \ell| > \varepsilon\}\bigr)}{\mathcal{L}^n(B(x,r))} = 0.
\end{align*}
[/definition]
The approximate limit at $x$, when it exists, is unique: if both $\ell$ and $\ell'$ qualify, then for any $\varepsilon < |\ell - \ell'|/2$ the sets $\{|u - \ell| \le \varepsilon\}$ and $\{|u - \ell'| \le \varepsilon\}$ are disjoint, yet each has density $1$ at $x$, which forces $\mathcal{L}^n(B(x,r)) \to 0$ — a contradiction. Uniqueness also shows that the approximate limit is preserved under modifications on null sets, confirming it is a well-defined property of the equivalence class.
[definition: Approximate Continuity]
A measurable function $u: \Omega \to \mathbb{R}$ is **approximately continuous** at $x \in \Omega$ if $u$ has an approximate limit at $x$ and
\begin{align*}
\operatorname{ap-lim}_{y \to x} u(y) = u(x).
\end{align*}
[/definition]
The connection to Lebesgue theory is immediate and important. Recall from [GMT I](/page/Geometric%20Measure%20Theory%20I%3A%20Measures%20and%20Hausdorff%20Dimension) that the Lebesgue differentiation theorem guarantees that for any $u \in L^1_{\mathrm{loc}}(\Omega)$ and for $\mathcal{L}^n$-a.e. $x$,
\begin{align*}
\lim_{r \to 0} \frac{1}{\mathcal{L}^n(B(x,r))} \int_{B(x,r)} u(y)\, d\mathcal{L}^n(y) = u(x).
\end{align*}
Every such Lebesgue point $x$ is a point of approximate continuity of $u$ with approximate limit $u(x)$: if $\varepsilon > 0$, then
\begin{align*}
\frac{\mathcal{L}^n(\{|u - u(x)| > \varepsilon\} \cap B(x,r))}{\mathcal{L}^n(B(x,r))} \le \frac{1}{\varepsilon} \cdot \frac{1}{\mathcal{L}^n(B(x,r))} \int_{B(x,r)} |u(y) - u(x)|\, d\mathcal{L}^n(y) \to 0
\end{align*}
by Markov's inequality and the Lebesgue differentiation theorem. Thus approximate continuity holds at almost every point for every locally integrable function, and the Lebesgue representative — defined by the differentiation limit where it exists and extended arbitrarily elsewhere — is the canonical representative to work with.
[remark: Approximate Continuity vs. Classical Continuity]
Approximate continuity is strictly weaker than classical continuity. The indicator function $\mathbb{1}_E$ of a measurable set $E \subseteq \mathbb{R}$ with $\mathcal{L}^1(E) = 0$ is approximately continuous at every point of $\mathbb{R}$ with approximate limit $0$, yet it is nowhere continuous in the classical sense along any sequence hitting $E$. Conversely, a classically continuous function is approximately continuous at every point, since the density condition then reduces to the immediate fact that $\mathcal{L}^n(B(x,r)) > 0$.
[/remark]
## Approximate Differentiability: Definition and First Examples
Classical differentiability at $x$ requires that the ratio $(u(y) - u(x) - L(y-x))/|y-x|$ tends to zero as $y \to x$ along every sequence, where $L: \mathbb{R}^n \to \mathbb{R}$ is a fixed linear map. For Sobolev functions with $p \le n$, the representative $u$ may oscillate wildly near every point, so this ratio can fail to converge along sequences that approach $x$ through the exceptional set where $u$ misbehaves. The approximate notion replaces the requirement that the ratio tends to zero along every approach with the requirement that the set of $y$ where the ratio is large has density zero at $x$.
[definition: Approximate Differentiability]
Let $u: \Omega \to \mathbb{R}$ be measurable and let $x \in \Omega$. We say $u$ is **approximately differentiable** at $x$ if there exists a linear map $L: \mathbb{R}^n \to \mathbb{R}$ such that
\begin{align*}
\operatorname{ap-lim}_{y \to x} \frac{u(y) - u(x) - L(y-x)}{|y-x|} = 0,
\end{align*}
that is, for every $\varepsilon > 0$,
\begin{align*}
\lim_{r \to 0} \frac{\mathcal{L}^n\!\left(\left\{y \in B(x,r) : \frac{|u(y) - u(x) - L(y-x)|}{|y-x|} > \varepsilon\right\}\right)}{\mathcal{L}^n(B(x,r))} = 0.
\end{align*}
When such $L$ exists, it is unique; we call it the **approximate derivative** of $u$ at $x$ and write $\nabla_{\mathrm{ap}} u(x)$ for its representing gradient vector.
[/definition]
Uniqueness of the approximate derivative follows from the same density argument used for approximate limits: if two linear maps $L$ and $L'$ both satisfy the condition, then $L - L'$ is a linear map with $\operatorname{ap-lim}_{y \to x}(L-L')(y-x)/|y-x| = 0$. Setting $v = y - x$ and normalising to $|v| = 1$, this forces every directional limit of the linear functional $(L-L')(v)/|v|$ to vanish, hence $L = L'$.
The structural importance of this definition is that approximate differentiability is a property of the $L^1_{\mathrm{loc}}$ equivalence class: changing $u$ on a null set does not affect which sets have positive density at $x$, so the approximate derivative is intrinsic.
[example: Indicator of a Null Set]
Let $E \subseteq \mathbb{R}^n$ be a Borel set with $\mathcal{L}^n(E) = 0$, and set $u = \mathbb{1}_E$. For any $x \in \mathbb{R}^n$ and any $r > 0$,
\begin{align*}
\frac{\mathcal{L}^n(\{y \in B(x,r) : |u(y) - 0 - 0| > \varepsilon\})}{\mathcal{L}^n(B(x,r))} = \frac{\mathcal{L}^n(E \cap B(x,r))}{\mathcal{L}^n(B(x,r))} \le \frac{\mathcal{L}^n(E \cap B(x,r))}{\mathcal{L}^n(B(x,r))},
\end{align*}
and since $E$ has measure zero the numerator is $0$ for all $r$. Thus $u$ is approximately differentiable everywhere with $L = 0$ (the zero linear map). Yet $u$ is nowhere classically differentiable: at any $x \notin E$, sequences through $E$ give ratio $1/|y - x|$, which diverges; at any $x \in E$, the ratio is $0$ or $1$ depending on whether $y \in E$, with no classical limit. This demonstrates concretely that approximate differentiability is strictly weaker than classical differentiability.
[/example]
<!-- illustration-needed: density picture for approximate differentiability — show a ball B(x,r) with a shaded region where the difference quotient exceeds epsilon, and the shaded fraction shrinking to zero as r → 0, while classical differentiability requires the shaded region to be exactly empty -->
The indicator example establishes that approximate differentiability is strictly weaker than the classical notion, but it does so in a degenerate way: the function is zero almost everywhere, and the approximate derivative inherits this triviality. One might worry that the gap between the two notions is exhausted by such measure-zero modifications, in which case the concept would offer little beyond a bookkeeping device for null-set ambiguity. The next example dispels this concern by exhibiting a function that is genuinely nontrivial — a Sobolev function with a real singularity at the origin — where approximate differentiability captures meaningful analytic content that classical differentiability cannot reach.
[example: A Sobolev Function that is Not Classically Differentiable]
Consider $u(x) = |x|^{\alpha}$ on $\mathbb{R}^n$ for $0 < \alpha < 1$. This function belongs to $W^{1,p}(\mathbb{R}^n)$ for $p < n/(1-\alpha)$, so in particular to $W^{1,1}(\mathbb{R}^n)$ when $\alpha$ is close to $1$. At the origin, the classical difference quotient is $|y|^{\alpha}/|y| = |y|^{\alpha - 1} \to \infty$ as $y \to 0$, so $u$ is not classically differentiable at $0$. However, for the approximate derivative: the set $\{y \in B(0,r) : |y|^{\alpha - 1} > \varepsilon\} = \{|y| < \varepsilon^{1/(\alpha-1)}\}$ — but $\alpha - 1 < 0$, so $\varepsilon^{1/(\alpha-1)} \to 0$ as $\varepsilon \to 0$. For fixed $\varepsilon$, the set of $y \in B(0,r)$ where the quotient exceeds $\varepsilon$ is $B(0, r') \cap B(0,r)$ where $r' = \varepsilon^{1/(1-\alpha)} \to 0$ as $\varepsilon \to 0$ (taking $\varepsilon \to 0$ with $r$ fixed). The ratio of measures is $(r')^n / r^n \to 0$. More precisely, for fixed $\varepsilon > 0$ and $r$ small, the bad set has measure $\omega_n (r')^n$ while $\mathcal{L}^n(B(0,r)) = \omega_n r^n$, giving ratio $(r'/r)^n = (\varepsilon^{1/(1-\alpha)}/r)^n \to 0$ as $r \to 0$. So the approximate derivative at $0$ is $0$, and $u$ is approximately differentiable there.
[/example]
These examples suggest that approximate differentiability strips away the classical requirement of uniformity in direction, retaining only the density-theoretic content. The next two sections make precise in what sense Sobolev functions satisfy this notion.
## $L^p$-Differentiability and the Sobolev Case
The obstacle to proving approximate differentiability for $W^{1,p}$ functions directly from the definition is that the Sobolev gradient $\nabla u$ exists only in the $L^p$ sense — it is defined by integration against test functions and has no pointwise meaning a priori. The strategy is to establish a stronger quantitative form, called $L^p$-differentiability, from which approximate differentiability follows by Markov's inequality.
[definition: $L^p$-Differentiability]
Let $1 \le p < \infty$, let $u \in L^p_{\mathrm{loc}}(\Omega)$, and let $x \in \Omega$. We say $u$ is **$L^p$-differentiable** at $x$ with derivative $L: \mathbb{R}^n \to \mathbb{R}$ if
\begin{align*}
\lim_{r \to 0} \left(\frac{1}{\mathcal{L}^n(B(x,r))} \int_{B(x,r)} \frac{|u(y) - u(x) - L(y-x)|^p}{|y-x|^p}\, d\mathcal{L}^n(y)\right)^{1/p} = 0.
\end{align*}
[/definition]
This is a genuine strengthening of approximate differentiability: $L^p$-differentiability at $x$ implies approximate differentiability at $x$ with the same linear map $L$. To see this, apply Markov's inequality to the integrand: for any $\varepsilon > 0$,
\begin{align*}
\frac{\mathcal{L}^n\!\left(\left\{\frac{|u(y) - u(x) - L(y-x)|}{|y-x|} > \varepsilon\right\} \cap B(x,r)\right)}{\mathcal{L}^n(B(x,r))} \le \frac{1}{\varepsilon^p} \cdot \frac{1}{\mathcal{L}^n(B(x,r))} \int_{B(x,r)} \frac{|u(y) - u(x) - L(y-x)|^p}{|y-x|^p}\, d\mathcal{L}^n(y),
\end{align*}
which tends to zero by the $L^p$-differentiability hypothesis. This implication is strict: a function can be approximately differentiable without being $L^p$-differentiable for any $p > 0$.
The central theorem of this chapter establishes $L^{p^*}$-differentiability almost everywhere for Sobolev functions, where $p^* = np/(n-p)$ is the Sobolev conjugate exponent.
[quotetheorem:3081]
[citeproof:3081]
The hypothesis $p < n$ is essential and cannot be weakened to $p = n$ at this level of generality. When $p = n$ there is no Sobolev conjugate exponent in the usual sense (formally $p^* = \infty$), and the Sobolev-Poincaré inequality takes a different form involving exponential integrability (the Trudinger embedding) rather than a power. The statement of the theorem with $p^* = np/(n-p)$ is therefore genuinely restricted to $p < n$.
[remark: What Happens at $p = n$]
For $u \in W^{1,n}(\Omega)$ the Sobolev embedding gives $u \in L^q_{\mathrm{loc}}(\Omega)$ for all $q < \infty$ but not for $q = \infty$: functions like $u(x) = \log|\log|x||$ near the origin belong to $W^{1,n}$ but are unbounded. The proof above breaks down because $|\nabla u|^n$ need not integrate to zero in an $L^p$-differentiable sense with exponent $p^* = \infty$. Nevertheless, $W^{1,n}$ functions are still approximately differentiable a.e.; this is established by a separate Trudinger-type argument.
[/remark]
The role of the precise representative deserves comment. For $p > n$, Morrey's inequality shows every $W^{1,p}$ function has a Hölder-continuous representative, so $u(x)$ is unambiguous. For $p \le n$, two distinct representatives of the same $W^{1,p}$ equivalence class may differ on a set of positive measure, and the $L^{p^*}$-differentiability statement genuinely depends on choosing the Lebesgue (precise) representative. The theorem says: after fixing this canonical choice, the almost-everywhere $L^{p^*}$-differentiability holds.
[example: Sharpness of the Exponent $p^*$]
Consider $u(x) = |x|^{\alpha}$ on $B(0,1) \subseteq \mathbb{R}^n$ with $\alpha = 1 - n/p$ (so that $u \in W^{1,p}$ but just barely). One computes that $u \in L^{p^*}$ near the origin precisely when $\alpha p^* > -n$, i.e., when $\alpha > -n/p^*$. Substituting $p^* = np/(n-p)$ and $\alpha = 1 - n/p$, one verifies this inequality is exact — the function lies in $W^{1,p}$ but the difference quotient in the $L^{p^*}$ average is finite only at the boundary of what the Sobolev inequality allows. Improving $p^*$ to any larger exponent would fail for this family, confirming the exponent in the theorem is sharp.
[/example]
## $L^{1^*}$-Differentiability for BV Functions
The theorem above applies to Sobolev functions with $p \ge 1$. Functions of bounded variation present a parallel but distinct situation: they arise naturally in problems where the gradient is a measure rather than an $L^1$ function. The analogue of the Sobolev gradient for a $BV$ function $u$ is the absolutely continuous part $D^a u$ of the distributional derivative $Du = D^a u + D^s u$, where $D^s u$ is the singular part. Since $D^a u \ll \mathcal{L}^n$, we can write $D^a u = \nabla u \cdot \mathcal{L}^n$ for an $\mathcal{L}^n$-a.e. defined vector field $\nabla u \in L^1(\Omega; \mathbb{R}^n)$.
[quotetheorem:3082]
The proof follows the same Poincaré-Sobolev strategy as the $W^{1,p}$ case, but the critical input is the BV version of the Sobolev inequality: for $v \in BV(B(x,r))$,
\begin{align*}
\left(\int_{B(x,r)} |v|^{1^*}\, d\mathcal{L}^n\right)^{1/1^*} \lesssim |Dv|(B(x,r)).
\end{align*}
Applying this to $v_x(y) = u(y) - u(x) - \nabla u(x) \cdot (y-x)$ and using the decomposition $Dv_x = D^a u - \nabla u(x) \mathcal{L}^n + D^s u$ on $B(x,r)$, one must show that both the $L^1$ part and the singular part of $|Dv_x|(B(x,r))$ are $o(r^n)$. The singular part shrinks because $|D^s u|(B(x,r)) = o(r^n)$ for $\mathcal{L}^n$-a.e. $x$ (the singular part is singular with respect to $\mathcal{L}^n$, so its density is zero $\mathcal{L}^n$-a.e.). The absolutely continuous part reduces to the same Lebesgue differentiation argument as before. The full BV theory is developed in the companion course GMT III.
This result underscores a key conceptual point: the singular part of $Du$ contributes nothing to the pointwise approximate derivative. The approximate derivative "sees" only the absolutely continuous part of $Du$, confirming that $\nabla_{\mathrm{ap}} u = \nabla u$ (the Radon-Nikodym density of $D^a u$) holds $\mathcal{L}^n$-almost everywhere.
## Approximate Differentiability Follows from $L^p$-Differentiability
The logical implication from $L^p$-differentiability to approximate differentiability is via Markov's inequality, as sketched earlier. It is worth recording this as an explicit proposition because it clarifies the hierarchy of differentiability notions.
[quotetheorem:3083]
[citeproof:3083]
The converse fails in every direction. There exist functions that are approximately differentiable at every point but belong to no $L^p$ class in any neighborhood; indeed, functions that are approximately differentiable everywhere but fail to be measurable in any strong sense. The value of the theorem for Sobolev functions is therefore not merely that they are approximately differentiable (which would follow from weaker arguments), but that they are $L^{p^*}$-differentiable, a quantitatively stronger statement with better stability properties.
## Why Approximate Differentiability is the Right Notion
The practical significance of approximate differentiability for analysis emerges most directly when one tries to apply the area and coarea formulas of Chapters 3 and 4 — in particular the Whitney-type approximation used in Chapter 3's proof of the area formula for Lipschitz maps — or the chain rules and change-of-variables arguments of PDE, to Sobolev functions rather than Lipschitz maps. Classical differentiability fails for all but the largest Sobolev exponents. Yet many results that appear to require pointwise derivatives can be reformulated using approximate derivatives.
The key fact making this work is that $\nabla_{\mathrm{ap}} u(x) = \nabla u(x)$ $\mathcal{L}^n$-a.e., where $\nabla u$ is the Sobolev weak gradient. This identification holds because: (a) the approximate derivative at $x$, if it exists, is unique; (b) the weak gradient $\nabla u$ is an $\mathcal{L}^n$-a.e. defined vector whose distributional action on test functions is the Sobolev derivative; and (c) the $L^{p^*}$-differentiability theorem shows the Sobolev gradient qualifies as the approximate derivative almost everywhere. In short, the Sobolev gradient gains pointwise meaning almost everywhere through approximate differentiability.
This has direct consequences in PDE. When studying fine properties of solutions — such as the Morse-Sard theorem for Sobolev maps, the co-area formula with $W^{1,p}$ weights, or the blow-up analysis of solutions at singular points — one needs a pointwise notion of the gradient that exists almost everywhere and behaves correctly under density arguments. The approximate gradient fills this role precisely. It is the natural bridge between the functional-analytic world of Sobolev spaces and the pointwise geometric world of measure-theoretic analysis.
The chapter that follows (Chapter 6) takes a complementary perspective: when $p > n$, Morrey's inequality promotes the Sobolev regularity to classical differentiability almost everywhere, a stronger conclusion still. That chain — from the weak definition of $W^{1,p}$ through approximate differentiability for all $p$ and onward to classical differentiability for $p > n$ — constitutes the complete differentiability theory for Sobolev functions developed in this course. Chapter 8 later studies Whitney's extension theorem, which in a sense reverses the problem: given approximate differentiability data on a set, when can it be extended to a globally smooth function?
With Rademacher's theorem guaranteeing a.e. differentiability for Lipschitz maps and the area-coarea machinery governing measure transformation, the next question asks what remains when maps lack classical derivatives entirely—approximate differentiability emerges as the appropriate weakening, preserving enough structure to recover the integral-geometric identities in the Sobolev regime.
# 6. Differentiability for Sobolev Functions with $p>n$
The preceding chapters developed differentiation theory for Lipschitz functions and established the area and coarea formulas, all resting on the bedrock of Rademacher's theorem. Sobolev functions with $p \leq n$ do not in general possess classical derivatives at even a single point — the critical case $p = n$ in dimension $n = 1$ already shows this, since $W^{1,1}(\mathbb{R})$ contains functions with jump discontinuities. The condition $p > n$ changes the picture entirely: Morrey's inequality forces every $W^{1,p}$ function to be Hölder continuous, and from Hölder continuity one can extract classical differentiability almost everywhere by reducing to the Lipschitz setting already handled by Rademacher. This chapter works through that reduction carefully, connecting the Sobolev embedding theory to the differentiation theory developed in the previous chapters.
## Morrey's Inequality and Hölder Continuity
The central obstacle to differentiability for Sobolev functions with $p \leq n$ is that such functions need not be continuous — they are defined only as equivalence classes of measurable functions, and no representative need be better than locally $p$-integrable. When $p > n$, an integral estimate traps the oscillation of $u$ over balls and forces a modulus of continuity.
[definition: Hölder Continuous Representative]
Let $U \subseteq \mathbb{R}^n$ be open and $0 < \gamma \leq 1$. A function $u: U \to \mathbb{R}$ belongs to $C^{0,\gamma}(\bar{U})$ if $u$ is bounded and
\begin{align*}
[u]_{C^{0,\gamma}(\bar{U})} := \sup_{\substack{x, y \in U \\ x \neq y}} \frac{|u(x) - u(y)|}{|x - y|^\gamma} < \infty.
\end{align*}
The Hölder norm is $\|u\|_{C^{0,\gamma}(\bar{U})} := \|u\|_{L^\infty(U)} + [u]_{C^{0,\gamma}(\bar{U})}$.
[/definition]
Morrey's inequality asserts that every $W^{1,p}(\mathbb{R}^n)$ function with $p > n$ admits a Hölder continuous representative, and quantifies the exponent of Hölder continuity in terms of $p$ and $n$.
[quotetheorem:62]
[citeproof:62]
<!-- illustration-needed: the Morrey averaging argument — show a ball B(x,r) with two points x and z inside, the line integral path from w to z, and the integral averaging over w in B that leads to the Riesz potential bound -->
The hypothesis $p > n$ is sharp in two distinct ways. First, the exponent $\gamma = 1 - n/p$ degenerates to $0$ as $p \to n$, so no fixed Hölder continuity can be extracted at the threshold. The standard counterexample in dimension $n = 2$ at $p = n = 2$ is $u(x) = \log \log(1 + 1/|x|)$ near the origin: this function belongs to $W^{1,2}(B(0,1))$ (the $L^2$ computation is carried out below in the explanation of why $p \leq n$ fails), yet it is unbounded and therefore cannot be Hölder continuous at the origin. Second, the condition $p > n$ cannot be replaced by any weaker integrability hypothesis on $\nabla u$ alone: for $p < n$, one cannot even guarantee local boundedness of $u$, and the best Sobolev embedding gives $u \in L^{p^*}$ with $p^* = np/(n-p)$ but not in $L^\infty$.
What the theorem does NOT say: it does not assert that the Hölder exponent $\gamma = 1 - n/p$ is attained by $u$, only that it is a guaranteed lower bound on the Hölder regularity. A specific function in $W^{1,p}$ may be smoother — for instance, a function with $|\nabla u| \leq C$ is Lipschitz, which is strictly better than the $\gamma$-Hölder conclusion for any $\gamma < 1$. The theorem gives the worst-case exponent over all $W^{1,p}$ functions.
When $p = \infty$, meaning $u \in W^{1,\infty}(U)$, the function is Lipschitz on $U$, and Rademacher's theorem applies directly. Morrey's inequality for finite $p > n$ is the bridge between the Lipschitz theory and the general Sobolev setting.
[remark: Local Version]
For $u \in W^{1,p}_{\mathrm{loc}}(\Omega)$ with $\Omega \subset \mathbb{R}^n$ open and $p > n$, the same conclusion holds locally: $u$ has a representative in $C^{0,\gamma}_{\mathrm{loc}}(\Omega)$ with $\gamma = 1 - n/p$. This follows by applying Morrey's inequality on compact subdomains $\Omega' \subset\subset \Omega$ after extending by a cutoff.
[/remark]
The local version is what the course uses in the differentiability theorem below. The global $W^{1,p}(\mathbb{R}^n)$ hypothesis in Morrey's inequality is overly restrictive for many applications: the conclusion "classically differentiable $\mathcal{L}^n$-a.e." is a local statement, and the local version frees us to work on arbitrary open domains without worrying about behaviour at infinity or boundary extension.
## Classical Differentiability Almost Everywhere
Hölder continuity alone does not guarantee classical differentiability. The Cantor–Lebesgue function is a sharp illustration: it is $\alpha$-Hölder continuous for every $\alpha < 1$ (its modulus of continuity is controlled by the Cantor set construction), yet its classical derivative is zero wherever it exists and it fails to be differentiable on the Cantor set — a set of $\mathcal{L}^1$-measure zero but positive Hausdorff dimension. The key additional ingredient for $W^{1,p}$ functions is that their weak gradients are actual $L^p$ functions, and this allows one to pass from Hölder continuity to Lipschitz continuity on most of the domain via a maximal-function argument, after which Rademacher does the work.
[quotetheorem:3084]
[citeproof:3084]
This is a structural result worth unpacking. The theorem says that even though $u \in W^{1,p}_{\mathrm{loc}}(\Omega)$ is defined only up to sets of measure zero, its unique Hölder-continuous representative is classically differentiable at almost every point. The classical derivative at those points coincides with the weak gradient — there is no ambiguity between the two notions. This reconciles the distributional definition of $\nabla u$ with the pointwise classical definition wherever the latter exists.
[explanation: Why $p \leq n$ Fails]
The theorem is sharp in the following sense. When $p = n$, Morrey's inequality fails and Hölder continuity is no longer guaranteed. Consider $u: B(0,1/2) \subseteq \mathbb{R}^2 \to \mathbb{R}$ defined by $u(x) = \log\log(1/|x|)$. Direct computation shows $|\nabla u(x)| = 1/(|x| \log(1/|x|))$, and integrating in polar coordinates:
\begin{align*}
\int_{B(0,1/2)} |\nabla u|^2\, d\mathcal{L}^2 = 2\pi \int_0^{1/2} \frac{r\, dr}{r^2 \log^2(1/r)} = 2\pi \int_2^\infty \frac{dt}{t^2} < \infty,
\end{align*}
so $u \in W^{1,2}(B(0,1/2))$. Yet $u(x) \to +\infty$ as $x \to 0$, so $u$ is not even bounded, let alone Hölder continuous or classically differentiable at the origin. Since $\{0\}$ has $\mathcal{L}^2$-measure zero, this single bad point does not violate "a.e. differentiability" — but one can modify the example to produce a Cantor-like set of non-differentiability points while remaining in $W^{1,n}$. For $p < n$, the situation is worse: $W^{1,p}$ contains functions that are unbounded on every open set, and classical differentiability can fail everywhere.
[/explanation]
The hypothesis $p > n$ therefore does real work in two places in the proof: it guarantees Hölder continuity via Morrey (Step 1), and it ensures the Hardy–Littlewood maximal function argument in Step 2 produces Lipschitz approximants with $\mathcal{L}^n$-negligible exceptional sets. The $L^p$ bound on the maximal function requires $p > 1$, but the stronger condition $p > n$ is what forces the exceptional set $\Omega \setminus E_\lambda$ to shrink to measure zero as $\lambda \to \infty$ at a rate controlled by $\lambda^{-p}\|\nabla u\|_{L^p}^p$. If $p = 1$ and $n = 1$, one is in $BV$, and a BV function need not be classically differentiable even a.e. — differentiation of BV functions requires the structure theory of Chapter 7.
[example: Explicit Verification for a Power-Type Function]
Let $n = 2$, $p = 3 > n = 2$, and define $u: B(0,1) \to \mathbb{R}$ by $u(x) = |x|^\alpha$ for some $\alpha > 0$. The weak gradient is $\nabla u(x) = \alpha |x|^{\alpha-2} x$ for $x \neq 0$. We have $|\nabla u(x)| = \alpha |x|^{\alpha - 1}$, so
\begin{align*}
\int_{B(0,1)} |\nabla u|^3\, d\mathcal{L}^2 = \alpha^3 \cdot 2\pi \int_0^1 r^{3(\alpha-1)} r\, dr = 2\pi\alpha^3 \int_0^1 r^{3\alpha - 2}\, dr.
\end{align*}
This integral converges if and only if $3\alpha - 2 > -1$, i.e. $\alpha > 1/3$. So for $\alpha > 1/3$, $u \in W^{1,3}(B(0,1))$.
Morrey's inequality predicts Hölder continuity with exponent $\gamma = 1 - 2/3 = 1/3$. To verify directly that $u(x) = |x|^\alpha$ is $\alpha$-Hölder continuous on $B(0,1)$ for $0 < \alpha \leq 1$, we establish the elementary inequality $|a^\alpha - b^\alpha| \leq |a - b|^\alpha$ for all $a, b \geq 0$.
The key step is the subadditivity $(s + t)^\alpha \leq s^\alpha + t^\alpha$ for $s, t \geq 0$ and $0 < \alpha \leq 1$. Since $0 < \alpha \leq 1$, the function $r \mapsto r^{\alpha - 1}$ is non-increasing on $(0, \infty)$, so $(s + t)^{\alpha - 1} \leq s^{\alpha - 1}$ and $(s + t)^{\alpha - 1} \leq t^{\alpha - 1}$. Hence
\begin{align*}
(s + t)^\alpha = s\,(s + t)^{\alpha - 1} + t\,(s + t)^{\alpha - 1} \leq s \cdot s^{\alpha - 1} + t \cdot t^{\alpha - 1} = s^\alpha + t^\alpha.
\end{align*}
Now assume WLOG $a \geq b \geq 0$ and apply this with $s = b$ and $t = a - b$:
\begin{align*}
a^\alpha = (b + (a - b))^\alpha \leq b^\alpha + (a - b)^\alpha,
\end{align*}
so $a^\alpha - b^\alpha \leq (a - b)^\alpha$. Setting $a = |x|$, $b = |y|$ and using the reverse triangle inequality $\bigl||x| - |y|\bigr| \leq |x - y|$:
\begin{align*}
|u(x) - u(y)| = \bigl||x|^\alpha - |y|^\alpha\bigr| \leq \bigl||x| - |y|\bigr|^\alpha \leq |x - y|^\alpha.
\end{align*}
So for $\alpha \leq 1$, $u$ is $\alpha$-Hölder continuous with Hölder constant $1$. The Morrey exponent $\gamma = 1/3$ is the worst-case guaranteed exponent for any $W^{1,3}(B(0,1))$ function, not the exponent of this particular $u$ (which is $\alpha$-Hölder, and $\alpha > 1/3$ by the integrability condition).
For classical differentiability: $u(x) = |x|^\alpha$ is smooth on $B(0,1) \setminus \{0\}$ and is differentiable there with classical gradient $\alpha|x|^{\alpha-2}x$. At $x_0 = 0$, we test whether $u(rh) - u(0) = r^\alpha|h|^\alpha = r^\alpha$ is $o(r)$ as $r \to 0$: this holds iff $\alpha > 1$. So $u$ is classically differentiable at $0$ when $\alpha > 1$, and not classically differentiable at $0$ when $1/3 < \alpha \leq 1$. The theorem guarantees differentiability $\mathcal{L}^2$-a.e., and indeed the single bad point $\{0\}$ (present only when $\alpha \leq 1$) is a $\mathcal{L}^2$-null set — consistent with the theorem but not contradicted by it. For $\alpha > 1$, $u$ is classically differentiable everywhere in $B(0,1)$, which is stronger than the a.e. conclusion.
[/example]
The example shows that the a.e. differentiability theorem is not vacuous — there are $W^{1,p}$ functions ($p > n$) that fail to be differentiable at individual points, but these exceptional points form a $\mathcal{L}^n$-null set. The theorem also does not assert that the null set has any particular Hausdorff-dimensional structure; that finer question is addressed in the next section.
## Capacity and the Dimension of Singular Sets
The set where classical differentiability fails for a $W^{1,p}$ function with $p > n$ is not just $\mathcal{L}^n$-null: $\mathcal{L}^n$-nullness is a coarse measure of smallness, and one can ask whether the exceptional set is small in a much stronger sense. The answer is given by Sobolev capacity, which captures a notion of "negligibility" that is calibrated to the function space $W^{1,p}$ itself. Understanding the size of singular sets via capacity underpins the finer approximation results in Chapter 8.
[definition: Sobolev $p$-Capacity]
Let $1 \leq p < \infty$ and $E \subseteq \mathbb{R}^n$. The Sobolev $p$-capacity of $E$ is
\begin{align*}
\mathrm{Cap}_p(E) := \inf\bigl\{\|u\|_{W^{1,p}(\mathbb{R}^n)}^p : u \in W^{1,p}(\mathbb{R}^n),\, u \geq 1 \text{ on a neighbourhood of } E\bigr\}.
\end{align*}
A set has $p$-capacity zero if it can be covered by open sets of arbitrarily small $p$-capacity.
[/definition]
Sets of $p$-capacity zero are the "thin" exceptional sets for $W^{1,p}$ theory. They generalise $\mathcal{L}^n$-null sets but can be much smaller: for $p > n$, a single point has positive $p$-capacity (since Morrey's inequality allows point evaluation), while for $p \leq n$ points have zero $p$-capacity.
The relationship between capacity and Hausdorff dimension is the key geometric fact. The substantive theorem concerns the regime $1 \leq p \leq n$, where the capacity-Hausdorff dimension inequality has genuine content.
[quotetheorem:3085]
The theorem says that a set of zero $p$-capacity cannot carry any Hausdorff measure in dimension greater than $n - p$. In other words, its Hausdorff dimension is at most $n - p$. This is a genuine constraint: for $p < n$, the threshold $n - p$ is a positive real number, and the theorem asserts that zero-capacity sets are "dimensionally small" relative to the ambient space $\mathbb{R}^n$.
To understand why each hypothesis is needed, consider what fails at the boundary. The hypothesis $\mathrm{Cap}_p(E) = 0$ is essential: a single point has positive $p$-capacity for $p > n$, and for $p \leq n$ a single point has zero capacity, but a line segment in $\mathbb{R}^n$ (which has $\mathcal{H}^1(E) > 0$) does not in general have zero $p$-capacity for all $p$. A concrete illustration: in $\mathbb{R}^2$ with $p = 1$, the unit interval $E = [0,1] \times \{0\}$ has $\mathrm{Cap}_1(E) > 0$ (a function equalling $1$ on a neighbourhood of $E$ must have $\|\nabla u\|_{L^1}$ bounded below), so it is not an example of a zero-capacity set, and consistently $\mathcal{H}^1(E) = 1 > 0 = n - p = 2 - 1$. The threshold $s > n - p$ is also sharp: there exist sets of zero $p$-capacity with $\mathcal{H}^{n-p}(E) > 0$ (the capacity condition controls only dimensions strictly above $n - p$).
What the theorem does NOT say: it does not give a quantitative estimate for the capacity of a specific set such as a ball or a lower-dimensional surface. The capacity of $\overline{B}(x_0, r)$ behaves like $r^{n-p}$ for $1 \leq p < n$, but this quantitative statement requires a separate computation and is not a consequence of the dimension bound above.
[remark: The $p > n$ Case is Degenerate]
For $p > n$, every nonempty set $E \subseteq \mathbb{R}^n$ has positive $p$-capacity. To see this: since Morrey's inequality gives a continuous embedding $W^{1,p}(\mathbb{R}^n) \hookrightarrow C^{0,\gamma}(\mathbb{R}^n)$, point evaluation $u \mapsto u(x_0)$ is a bounded linear functional on $W^{1,p}(\mathbb{R}^n)$. If $E$ were nonempty and had $\mathrm{Cap}_p(E) = 0$, one could find functions $u_k \in W^{1,p}(\mathbb{R}^n)$ with $u_k \geq 1$ on a neighbourhood of $E$ and $\|u_k\|_{W^{1,p}} \to 0$. But then $\|u_k\|_{C^{0,\gamma}} \to 0$ by Morrey, so $u_k \to 0$ uniformly, contradicting $u_k \geq 1$ on a neighbourhood of any point of $E$. Thus, for $p > n$, $\mathrm{Cap}_p(E) = 0$ implies $E = \varnothing$: there are no nonempty sets of zero $p$-capacity, and the theorem above reduces to the vacuous statement that $\mathcal{H}^s(\varnothing) = 0$ for all $s > n - p$ (which holds regardless of the sign of $n - p$). The interesting case, where the capacity-Hausdorff theorem has real content, is $1 \leq p \leq n$.
[/remark]
This stands in sharp contrast to the case $p \leq n$. For $p \leq n$, individual points have zero $p$-capacity and can be freely added to exceptional sets. The jump from $p \leq n$ to $p > n$ in capacity theory mirrors the jump in the embedding theory: below $n$, Sobolev functions are not even locally bounded; at $n$, they are in $L^q$ for all finite $q$ but not in $L^\infty$; above $n$, they are Hölder continuous and pointwise well-defined everywhere.
[remark: Forward Reference to Whitney Approximation]
In Chapter 8, the capacity theory for general $p$ (including $p \leq n$) will be used to prove that every $W^{1,p}$ function agrees with a $C^1$ function outside a set of small $p$-capacity — a result enabled by the Whitney extension theorem and Lusin's theorem working in concert. For $p > n$, the exceptional set is empty and the result becomes: every $W^{1,p}$ function is $\mathcal{L}^n$-a.e. equal to a $C^1$ function — which is stronger than, but consistent with, the a.e. differentiability established in this chapter.
[/remark]
The chain of implications for $p > n$ can now be summarised. Morrey's inequality converts $W^{1,p}$ membership into Hölder continuity of the precise representative. Hölder continuity, combined with the $L^p$ integrability of the weak gradient, allows the Lipschitz truncation via the Hardy–Littlewood maximal function to produce Lipschitz approximants that cover $\mathcal{L}^n$-almost all of the domain. Rademacher's theorem applied to each Lipschitz piece yields classical differentiability $\mathcal{L}^n$-a.e. Finally, the capacity estimate shows that for $p > n$ the exceptional set is not merely null but actually empty in the capacity sense — a much finer statement. Each step uses the condition $p > n$ in an essential way, and the chapter's results fail, in progressively severe ways, as $p$ decreases toward $n$ and then below it.
Sobolev functions with $p > n$ prove to be the critical threshold where classical differentiability returns: at this Sobolev exponent, enough integrability and dimension-counting conspire to force a.e. differentiability, confirming that the Lipschitz theory from Chapters 1–4 and the approximate theory from Chapter 5 together furnish a complete differentiability picture at all regularity scales.
# 7. Convex Functions and Alexandrov's Theorem
Chapters 1 through 6 built the machinery needed to differentiate Lipschitz maps and integrate over their images: Rademacher's theorem grants a.e. first-order differentiability, the area and coarea formulas govern how Lebesgue measure transforms under Lipschitz maps, and the approximate differentiability framework extends these ideas to Sobolev functions. Chapter 7 turns to a class that is simultaneously more restricted and more powerful: convex functions. The restriction is that the graph bends only one way; the payoff is that convex functions are twice differentiable almost everywhere — a conclusion far beyond what Rademacher alone can deliver. The central result is Alexandrov's theorem, which identifies a pointwise second-order Taylor expansion valid at $\mathcal{L}^n$-almost every point. The proof runs through the theory of subdifferentials, which are the natural substitute for gradients at non-smooth points, and culminates in an application to the Monge–Ampère measure, connecting this chapter to the modern theory of optimal transport.
## Local Lipschitz Continuity of Convex Functions
Every convex function on $\mathbb{R}^n$ is automatically Lipschitz on any compact subset of the interior of its domain, yet this regularity is genuinely interior: it can fail at boundary points, and no assumption on the modulus of convexity is needed. The proof must explain where the Lipschitz constant comes from, since convexity by itself is a purely algebraic condition involving no metric information.
[definition: Convex Function]
A function $f: U \to \mathbb{R}$ defined on a convex open set $U \subseteq \mathbb{R}^n$ is **convex** if for every $x, y \in U$ and every $\lambda \in [0,1]$,
\begin{align*}
f(\lambda x + (1-\lambda) y) \le \lambda f(x) + (1-\lambda) f(y).
\end{align*}
Equivalently, for each $x \in U$, the graph of $f$ lies above every supporting hyperplane at $(x, f(x))$ in $\mathbb{R}^n \times \mathbb{R}$.
[/definition]
Convexity imposes a global constraint on how $f$ behaves between any two points. The definition says nothing about smoothness, yet the geometric structure enforces enough rigidity that local boundedness — and from it, local Lipschitz continuity — follow without any further assumption.
[quotetheorem:3086]
[citeproof:3086]
The Lipschitz constant $L = 2M/r$ makes the dependence on the geometry explicit: functions that are large near $K$ (large $M$) or are evaluated close to the boundary of $U$ (small $r$) require larger Lipschitz constants. This is sharp: the convex function $f(x) = -\sqrt{1 - x^2}$ on $U = (-1,1)$ is locally Lipschitz on every compact $K \subseteq (-1,1)$, but its Lipschitz constant blows up as $K$ approaches $\pm 1$.
The theorem does not hold at boundary points, and the hypothesis that $K$ is compactly contained in $U$ cannot be weakened. The convex function $f(x) = -\log(1 - x^2)$ on $U = (-1,1)$ has $f'(x) = 2x/(1-x^2) \to \infty$ as $x \to 1^-$, so $f$ is not Lipschitz on $[-1+\varepsilon, 1)$ for any $\varepsilon > 0$. Moreover, local Lipschitz continuity requires the function to be finite-valued: a convex function that takes the value $+\infty$ at a boundary point of its domain is not Lipschitz in any neighborhood of that point, even though convexity itself permits $+\infty$ values. For functions defined on all of $\mathbb{R}^n$, global Lipschitz continuity would additionally require the function to have bounded subgradients everywhere — a strictly stronger condition.
[remark: Rademacher Applies]
Since every convex function on an open convex domain $U \subseteq \mathbb{R}^n$ is locally Lipschitz, Rademacher's theorem (Chapter 1) applies immediately: $f$ is differentiable $\mathcal{L}^n$-almost everywhere on $U$, and the gradient $\nabla f$ exists a.e. Alexandrov's theorem will sharpen this to a.e. second-order differentiability.
[/remark]
Rademacher's theorem is the starting point, but it is not the end of the story. Knowing $\nabla f$ exists a.e. tells us nothing about the second-order behavior of $f$, and for a general Lipschitz function the gradient can be as irregular as an arbitrary bounded measurable function. The additional structure of convexity — specifically the monotonicity of the subdifferential — will impose enough regularity on $\nabla f$ to guarantee that a Hessian exists almost everywhere. The key intermediate step is understanding what replaces the gradient at the points where $f$ is not differentiable.
## The Subdifferential
At a point where $f$ is not differentiable, there is no single gradient, but convexity ensures the existence of at least one supporting hyperplane. The collection of all such hyperplane slopes is the subdifferential — a set-valued map that tracks all directional information about $f$ at a point and replaces the gradient where it fails to exist.
[definition: Subdifferential]
Let $f: U \to \mathbb{R}$ be convex on the open convex set $U \subseteq \mathbb{R}^n$. For $x \in U$, the **subdifferential** of $f$ at $x$ is
\begin{align*}
\partial f(x) := \{p \in \mathbb{R}^n : f(y) \ge f(x) + p \cdot (y - x) \text{ for all } y \in U\}.
\end{align*}
Each $p \in \partial f(x)$ is called a **subgradient** of $f$ at $x$.
[/definition]
A subgradient $p$ at $x$ is precisely the slope vector of an affine function $\ell(y) = f(x) + p \cdot (y-x)$ that touches the graph of $f$ at $x$ from below. Convexity guarantees this is possible: the graph of $f$ is a convex set in $\mathbb{R}^{n+1}$, and the supporting hyperplane theorem ensures at least one supporting hyperplane exists at each boundary point. The subdifferential is not just a theoretical device — it extends the differential calculus to non-smooth convex functions and captures the full first-order information about $f$ at every point of its domain, including those where the classical gradient does not exist. The essential algebraic property of the subdifferential is its monotonicity.
[quotetheorem:3087]
[citeproof:3087]
The monotonicity property $(p - q) \cdot (x - y) \ge 0$ is the key structural fact that drives Alexandrov's theorem. It says the subdifferential map $x \mapsto \partial f(x)$ is a monotone multifunction — a condition that controls how rapidly subgradients can change and will allow us to differentiate $\nabla f$ in the next section.
The theorem does not say that $\partial f(x)$ is a singleton for all $x$, nor that $\nabla f$ is continuous wherever it exists. Without the convexity hypothesis, none of these properties hold: a general Lipschitz function need not have any supporting hyperplane at a point of non-differentiability (consider a saddle-shaped non-convex function), and the subdifferential as defined would be empty. Monotonicity is entirely a consequence of convexity, not of Lipschitz continuity alone — a Lipschitz function whose gradient lacks monotonicity (e.g., $f(x) = x\sin(1/x)$ near $x = 0$, suitably modified) has no usable subdifferential theory. The compactness of $\partial f(x)$ also fails for convex functions defined on non-open domains or taking $+\infty$ values: at a boundary point of the effective domain, the subdifferential may be unbounded or empty.
[example: Subdifferential of the Absolute Value]
Let $f: \mathbb{R} \to \mathbb{R}$, $f(x) = |x|$. At $x_0 \ne 0$, $f$ is differentiable and $\partial f(x_0) = \{\operatorname{sgn}(x_0)\}$. At $x_0 = 0$, $f$ is not differentiable. A subgradient $p$ must satisfy $|y| \ge p \cdot y$ for all $y \in \mathbb{R}$. Taking $y = 1$ gives $p \le 1$ and taking $y = -1$ gives $-p \le 1$, i.e. $p \ge -1$. Conversely, any $p \in [-1, 1]$ satisfies $p \cdot y \le |y|$ for all $y$ by the bound $|p \cdot y| \le |p| \cdot |y| \le |y|$. Therefore $\partial f(0) = [-1, 1]$, which is a compact convex set containing more than one point, confirming non-differentiability.
To verify monotonicity explicitly: for $x > 0$ and $y < 0$, take $p = 1 \in \partial f(x)$ and $q = -1 \in \partial f(y)$. Then $(p - q)(x - y) = 2(x - y) > 0$ since $x > y$. For $x > 0$ and $y = 0$, take $p = 1 \in \partial f(x)$ and $q = 0 \in \partial f(0)$: $(1 - 0)(x - 0) = x > 0$. The geometry is transparent: subgradients increase (weakly) as we move right, because the graph tilts upward on both sides.
[/example]
This one-dimensional picture generalizes cleanly. For a convex function $f: \mathbb{R}^n \to \mathbb{R}$, the subdifferential at a point of non-differentiability is a convex body of dimension reflecting the "kink" in the graph. At points of differentiability, it collapses to a single point. Alexandrov's theorem will show that this collapse occurs almost everywhere. The size of the subdifferential at a given point is thus a measure of local non-smoothness: a large subdifferential signals a sharp corner, while a singleton subdifferential certifies genuine differentiability.
[example: Subdifferential of a Max Function]
Let $f(x) = \max(x_1, x_2)$ for $x = (x_1, x_2) \in \mathbb{R}^2$. This function is convex as the maximum of two affine functions. On the open half-plane $\{x_1 > x_2\}$, $f = x_1$ is smooth and $\partial f(x) = \{(1, 0)\}$. On $\{x_2 > x_1\}$, $\partial f(x) = \{(0, 1)\}$. On the diagonal $\{x_1 = x_2\}$, a subgradient $p = (p_1, p_2)$ must satisfy $\max(y_1, y_2) \ge x_1 + p_1(y_1 - x_1) + p_2(y_2 - x_2)$ for all $y$.
Taking $y = (x_1 + t, x_2)$ gives $x_1 + t \ge x_1 + p_1 t$, so $p_1 \le 1$. Taking $y = (x_1 - t, x_2)$ gives $x_2 \ge x_1 - p_1 t$, i.e. $0 \ge -p_1 t$, so $p_1 \ge 0$. By symmetry $0 \le p_2 \le 1$. Taking $y = (x_1 + t, x_2 - t)$ gives $x_1 + t \ge x_1 + p_1 t - p_2 t$, hence $p_1 - p_2 \le 1$. Combining: $\partial f(x) = \{(p_1, p_2): p_1 \ge 0, p_2 \ge 0, p_1 + p_2 = 1\}$ on the diagonal, which is the line segment $\operatorname{conv}\{(1,0), (0,1)\}$. The subdifferential collapses from a segment to a point as we move off the diagonal, reflecting exactly where the two pieces of $f$ compete.
[/example]
The calculation above exemplifies a general principle: for $f = \max(f_1, \ldots, f_k)$ with each $f_i$ smooth, the subdifferential at a point $x$ is the convex hull of the gradients $\nabla f_i(x)$ over the active indices $\{i: f_i(x) = f(x)\}$. Away from the set where the maximum is non-unique, $f$ is smooth and the subdifferential is a singleton, consistent with property (iii) of the theorem. The non-smooth locus has $\mathcal{L}^n$-measure zero. This max-function example also shows that the dimension of $\partial f(x)$ can jump discontinuously: it drops from one to zero as $x$ crosses the diagonal, and there is no continuity of the subdifferential as a set-valued map in the Hausdorff topology.
## Alexandrov's Theorem
The local Lipschitz continuity of convex functions means that Rademacher's theorem applies, giving $\mathcal{L}^n$-a.e. first-order differentiability. The second derivative, however, requires the geometry of the subdifferential in an essential way. The challenge is that $\nabla f$ itself is only defined almost everywhere and may be discontinuous at every point, so the classical definition of the Hessian as the derivative of the gradient needs a distributional or measure-theoretic substitute.
[quotetheorem:3088]
[citeproof:3088]
The conclusion of Alexandrov's theorem is a genuine pointwise second-order expansion, not merely an $L^2$ or distributional statement. This is substantially stronger than what Sobolev regularity alone provides: a $W^{2,1}$ function has a distributional Hessian in $L^1$, but pointwise second-order differentiability requires the Lebesgue differentiation step applied to the gradient, which uses the $BV$ structure available only for convex functions.
[explanation: What Alexandrov's Theorem Does Not Say]
Alexandrov's theorem guarantees second-order differentiability at $\mathcal{L}^n$-a.e. point. Several natural stronger conclusions fail, and understanding these failures clarifies why the a.e. result is optimal.
**The exceptional set can be dense.** A convex function can fail to be twice differentiable on a set that, while of measure zero, is dense in $U$. The standard example is the indefinite integral of a strictly monotone singular function: take $g: [0,1] \to \mathbb{R}$ to be the Cantor staircase (increasing, continuous, $g' = 0$ a.e.) and set $f(x) = \int_0^x g(t)\, dt$. Then $f$ is convex (since $g$ is nondecreasing, $f'' = g \ge 0$ in the distributional sense), $f'(x) = g(x)$ exists everywhere, but $f''(x) = g'(x) = 0$ at $\mathcal{L}^1$-a.e. $x$. However, the Cantor set (on which $g$ grows) witnesses a failure of second-order differentiability in the sense that the quadratic Taylor coefficient at points of the Cantor set is not captured by the a.e. zero $g'$. More dramatically, one can construct a convex $f: \mathbb{R}^2 \to \mathbb{R}$ whose Hessian $D^2f$ is a matrix-valued measure with a singular part supported on a dense set.
**Continuity of $D^2f$ is not guaranteed.** Alexandrov's theorem produces a measurable matrix-valued function $x_0 \mapsto D^2f(x_0)$, defined a.e. This function need not be continuous — indeed, for $f(x) = |x|$ on $\mathbb{R}^n$, the Hessian $D^2f(x_0) = I - \frac{x_0 \otimes x_0}{|x_0|^2}$ at $x_0 \ne 0$ oscillates as $x_0 \to 0$ along different rays, and is undefined at the origin. The theorem says only that $D^2f$ exists a.e., with no control on its regularity.
**The theorem does not apply to all $C^1$ functions.** Convexity is essential. There exist $C^1$ functions on $\mathbb{R}$ that are nowhere twice differentiable (e.g., certain Weierstrass-type functions with controlled growth). Without the monotonicity of the subdifferential — which is a consequence of convexity — the gradient can oscillate wildly in a $C^1$ function, and no Hessian need exist anywhere. The hypothesis of convexity is not a mere technical convenience; it is the source of the $BV$ regularity of $\nabla f$ that makes the whole argument work.
**It does not give $C^2$ approximation.** A convex function can be approximated in $C^1$ by smooth convex functions (via mollification, which preserves convexity), but the second derivatives of the approximations need not converge pointwise to $D^2f$ outside a set of measure zero. The convergence is only in the weak sense of measures.
[/explanation]
## Monge–Ampère Measures
The Monge–Ampère measure is the natural measure-theoretic object associated to a convex function. Its definition does not require any differentiability of $f$ — it is defined via the subdifferential — yet Alexandrov's theorem reveals that it has a clean density formula wherever the classical Hessian exists.
[definition: Normal Map and Monge–Ampère Measure]
Let $f: U \to \mathbb{R}$ be convex on the open convex set $U \subseteq \mathbb{R}^n$. The **normal map** (or **subdifferential map**) of $f$ is the set-valued function $\partial f: U \to 2^{\mathbb{R}^n}$ sending each point $x \in U$ to its subdifferential $\partial f(x)$.
For any Borel set $E \subseteq U$, define
\begin{align*}
\partial f(E) := \bigcup_{x \in E} \partial f(x) \subseteq \mathbb{R}^n.
\end{align*}
The **Monge–Ampère measure** of $f$ is the Borel measure $\mu_f$ on $U$ defined by
\begin{align*}
\mu_f(E) := \mathcal{L}^n(\partial f(E))
\end{align*}
for every Borel set $E \subset U$.
[/definition]
One must verify that $\mu_f$ is indeed a well-defined Borel measure. The key is that $\partial f(E)$ is a Borel set whenever $E$ is Borel (using the closed graph property of $\partial f$) and that $\mu_f$ is countably additive. The countable additivity uses the fact that the image sets $\partial f(E_k)$ for disjoint Borel sets $E_k$ overlap only on a set of $\mathcal{L}^n$-measure zero — a consequence of the monotonicity of $\partial f$, which forces the overlap to lie in the set of points having non-unique subgradients, and this set maps to a set of measure zero.
[quotetheorem:3089]
[citeproof:3089]
The formula $d\mu_f = \det D^2 f \, d\mathcal{L}^n$ makes the Monge–Ampère measure concrete: it measures how much the normal map $\partial f$ stretches volume. A flat piece of the graph (where $D^2 f = 0$) contributes zero mass; a piece with large curvature (large $\det D^2 f$) contributes mass proportional to the Hessian determinant.
The density formula $d\mu_f = \det D^2 f \, d\mathcal{L}^n$ is not valid without each hypothesis in the theorem. Without convexity, $\mu_f$ is not well-defined as a Borel measure in the first place: the set $\partial f(E)$ need not be measurable, and countable additivity can fail because the images of disjoint sets can overlap in a non-negligible way. Without the hypothesis that $f$ is twice differentiable at $x_0$ (that is, without Alexandrov's theorem supplying the full measure set $\Omega_2$), the pointwise density $\det D^2 f(x_0)$ does not exist, and one cannot write $\mu_f = \det D^2 f \cdot \mathcal{L}^n$ even in the weak sense. The formula also says nothing about the singular part of $\mu_f$: if $D^2 f$ has a singular component (as in the case $f(x) = |x|$, where the singular part is $2\omega_n \delta_0$), the formula accounts only for the absolutely continuous part, and $\mu_f$ may strictly exceed $\int \det D^2 f \, d\mathcal{L}^n$ on any Borel set that intersects the singular support.
[example: Monge–Ampère Measure of a Quadratic]
Let $f(x) = \frac{1}{2}x^\top A x$ for a fixed symmetric positive definite matrix $A \in \mathbb{R}^{n \times n}$. Then $f$ is strictly convex, $\nabla f(x) = Ax$, and $D^2 f(x) = A$ everywhere. The normal map $\partial f(x) = \{Ax\}$ is single-valued everywhere. For a Borel set $E \subseteq \mathbb{R}^n$,
\begin{align*}
\partial f(E) = A(E) = \{Ax : x \in E\}.
\end{align*}
The Lebesgue measure of $A(E)$ is $\mathcal{L}^n(A(E)) = |\det A| \cdot \mathcal{L}^n(E) = \det A \cdot \mathcal{L}^n(E)$ (using $\det A > 0$). Therefore $\mu_f(E) = \det A \cdot \mathcal{L}^n(E)$, which is exactly $\int_E \det D^2 f(x)\, d\mathcal{L}^n(x) = \det A \cdot \mathcal{L}^n(E)$, confirming the density formula.
This example also illustrates why $\det D^2 f$ is the right quantity: it is the Jacobian of the gradient map $\nabla f = A$, and $\mu_f$ records exactly the volume distortion of this map.
[/example]
The density formula breaks down at points where $f$ is not twice differentiable. For a convex function with a singular part in $D^2 f$ (a matrix-valued measure with a nonzero singular component), the Monge–Ampère measure $\mu_f$ may have a singular part supported on the exceptional set. The formula $d\mu_f = \det D^2 f \, d\mathcal{L}^n$ accounts only for the absolutely continuous part of $\mu_f$; the full measure $\mu_f$ may be strictly larger.
[example: Singular Monge–Ampère Measure]
Let $f: \mathbb{R} \to \mathbb{R}$, $f(x) = |x|$. Then $\nabla f(x) = \operatorname{sgn}(x)$ for $x \ne 0$, so $\partial f(0) = [-1, 1]$ and $\partial f(x) = \{\operatorname{sgn}(x)\}$ for $x \ne 0$. For the set $E = \{0\}$: $\partial f(\{0\}) = [-1, 1]$, so $\mu_f(\{0\}) = \mathcal{L}^1([-1,1]) = 2$. But $\mathcal{L}^1(\{0\}) = 0$, so $\mu_f$ has an atom at $0$ of mass $2$. The distributional second derivative of $|x|$ is $2\delta_0$ (twice the Dirac mass at the origin), confirming that the Monge–Ampère measure in dimension $n = 1$ is indeed $D^2 f$ in the distributional sense, including its singular part. On $\mathbb{R} \setminus \{0\}$, $D^2 f = 0$ a.e., so the absolutely continuous part of $\mu_f$ vanishes, while the singular part contributes the atom at $0$.
In higher dimensions, $f(x) = |x|$ on $\mathbb{R}^n$ gives $\partial f(0) = \overline{B}(0, 1)$ (the closed unit ball), so $\mu_f(\{0\}) = \mathcal{L}^n(\overline{B}(0,1)) = \omega_n$, again a positive atom at the kink. The density formula $\det D^2 f$ captures none of this atom, since $D^2 f(x) = \frac{1}{|x|}(I - \frac{x \otimes x}{|x|^2})$ for $x \ne 0$ has $\det D^2 f = 0$ when $n \ge 2$ (the matrix has a zero eigenvalue in the radial direction), while the singular part of $D^2f$ concentrated at $0$ accounts for the full Monge–Ampère atom.
[/example]
The connection to optimal transport is immediate: the Monge–Ampère equation $\det D^2 f = g$ for a given function $g$ asks precisely that $\mu_f = g \cdot \mathcal{L}^n$, i.e., that the normal map of $f$ pushes $\mathcal{L}^n$ forward to the measure $g \cdot \mathcal{L}^n$. Brenier's theorem in optimal transport identifies the convex potential $f$ as the solution to the Kantorovich problem of optimally transporting one measure to another, and Alexandrov's theorem — establishing the existence of $D^2 f$ a.e. — is one of the analytical inputs to the regularity theory of these optimal transport maps.
The differentiability program—whether classical, approximate, or Sobolev—culminates with convex functions, a distinguished class where second derivatives exist a.e. by Alexandrov's theorem and connect naturally to optimal transport, a gateway application where the fine measure-theoretic analysis of this course meets the deterministic geometry of convex bodies.
# 8. Whitney's Extension Theorem and $C^1$ Approximation
This chapter is part of a course on Geometric Measure Theory II, and it builds directly on the differentiability theory developed in earlier chapters. The reader is assumed to be comfortable with basic measure theory, with the elementary properties of Lipschitz functions established in Chapter 1, with the Sobolev space machinery introduced in Chapter 5, and with the convex analysis tools assembled in Chapter 7. Some familiarity with Radon measures and weak-* convergence, in the form developed in GMT I, will also be useful when the BV approximation result is reached at the end of the chapter. The preceding chapters established a hierarchy of differentiability results: Rademacher's theorem guarantees that Lipschitz functions are differentiable almost everywhere, the Sobolev embedding gives classical differentiability for $W^{1,p}$ with $p > n$, and Alexandrov's theorem extends second-order differentiability to convex functions. What remains is to ask not just where these functions are differentiable, but whether each such function admits a $C^1$ representative that agrees with it, both in value and in gradient, outside a set of arbitrarily small Lebesgue measure. This chapter answers that question through two interlocking tools: Whitney's extension theorem, which promotes data on a closed set to a global $C^1$ function, and the $C^1$ approximation theorems for Lipschitz, Sobolev, and BV functions, which show that all three classes satisfy a strong Lusin-type property. These results form the technical backbone of many "reduce to the smooth case" arguments throughout geometric measure theory.
## The Whitney $C^1$ Condition
The obstruction to extending a function $f: K \to \mathbb{R}$ from a closed set $K \subseteq \mathbb{R}^n$ to a $C^1$ function on all of $\mathbb{R}^n$ is not merely that $f$ must be continuous — one must also prescribe what the gradient should be, and ensure that the pair $(f, d)$ behaves, along $K$, as if $f$ were already differentiable with gradient $d$. Making this precise requires a notion of "approximate first-order Taylor remainder" that is intrinsic to $K$.
[definition: Whitney $C^1$ Condition]
Let $K \subseteq \mathbb{R}^n$ be closed, and let $f, d : K \to \mathbb{R}$ be continuous (where $d$ is thought of as a prescribed gradient). For $a, b \in K$ with $a \neq b$, define the first-order remainder
\begin{align*}
R(a, b) := \frac{f(b) - f(a) - d(a) \cdot (b - a)}{|b - a|}.
\end{align*}
The pair $(f, d)$ satisfies the **Whitney $C^1$ condition** on $K$ if
\begin{align*}
R(a, b) \to 0 \quad \text{uniformly as } |a - b| \to 0, \; a, b \in K.
\end{align*}
That is, for every $\varepsilon > 0$ there exists $\delta > 0$ such that $|R(a,b)| < \varepsilon$ whenever $a, b \in K$ and $0 < |a - b| < \delta$.
[/definition]
The Whitney condition should be understood as a joint condition on $f$ and $d$ together, not on $f$ alone. If $f$ were already $C^1$ on a neighbourhood of $K$, then $d = \nabla f|_K$ would satisfy this condition with $R(a,b) \to 0$ as a consequence of the definition of differentiability. The theorem reverses this implication: the condition on $K$ is sufficient to build a $C^1$ extension to all of $\mathbb{R}^n$.
Before turning to the extension theorem itself, it is worth pausing to test the definition against simple sets. The statement quantifies over all pairs $a, b \in K$ approaching one another, so its content depends on the topological texture of $K$: a set with no accumulation points imposes no constraint at all, since the condition is vacuous when $|a-b|$ is bounded below; a set of positive measure, by contrast, forces the condition to interact with every direction in which approach is possible. The next example sits at the easy end of this spectrum and shows that even for sets with a single accumulation point the condition can be satisfied vacuously.
[example: Whitney Condition on a Discrete Set]
Let $K = \{1/n : n \in \mathbb{N}\} \cup \{0\}$ (a convergent sequence with its limit), and let $f \equiv 0$ and $d \equiv 0$ on $K$. Then $R(a,b) = 0$ for all $a, b \in K$, and the Whitney condition is satisfied vacuously. The extension promised by Whitney's theorem is simply any $C^1$ function that vanishes on $K$ — for instance, the zero function itself. This example shows that the condition is not empty even on sets with accumulation points, but it also shows that when $d \equiv 0$ the condition is automatic: the difficulty is entirely in the interaction between $f$ and $d$ when both are not identically zero.
[/example]
The uniformity in the Whitney condition is essential. A pair $(f, d)$ might satisfy $R(a, b) \to 0$ for each fixed $a$ as $b \to a$ (pointwise differentiability on $K$) without satisfying the uniform version; the uniform condition is strictly stronger and is what guarantees that the extension can be done in $C^1$ rather than merely $C^0$.
To see why pointwise convergence of the remainder is too weak, it helps to imagine $K$ accumulating onto a single point $0$ along a sequence on which $f$ oscillates. Pointwise convergence at each fixed $a$ uses only pairs $(a, b)$ with $b$ in a small neighbourhood that may be tailored to $a$, and so cannot detect oscillation between sequence points themselves. The uniform condition, by contrast, must also handle pairs $(a_m, b_m)$ where both $a_m, b_m \to 0$ simultaneously; this is precisely the regime in which oscillation of $f$ over short scales becomes visible. The next example exhibits a one-dimensional sequence in which the pointwise condition holds yet the uniform condition fails, with $|R(a_m, b_m)|$ blowing up explicitly.
[example: Failure of Pointwise vs. Uniform Whitney Condition]
Let $K = \{0\} \cup \{1/k : k \in \mathbb{N}\}$ and define $f(1/k) = (-1)^k / k$ for $k \geq 1$ and $f(0) = 0$, with $d \equiv 0$. We show the uniform Whitney condition fails near $0$. Take $a_m = 1/(2m)$ and $b_m = 1/(2m+1)$. Then
\begin{align*}
|a_m - b_m| = \frac{1}{2m} - \frac{1}{2m+1} = \frac{1}{2m(2m+1)},
\end{align*}
and since $d \equiv 0$,
\begin{align*}
f(a_m) - f(b_m) = \frac{1}{2m} - \frac{-1}{2m+1} = \frac{1}{2m} + \frac{1}{2m+1} = \frac{4m+1}{2m(2m+1)}.
\end{align*}
Thus
\begin{align*}
|R(a_m, b_m)| = \frac{|f(b_m) - f(a_m) - d(a_m)(b_m - a_m)|}{|b_m - a_m|} = \frac{(4m+1)/(2m(2m+1))}{1/(2m(2m+1))} = 4m + 1,
\end{align*}
which diverges to $+\infty$ as $m \to \infty$, even though $|a_m - b_m| \to 0$. The pair $(f, 0)$ does not satisfy the Whitney condition uniformly near $0$, and no $C^1$ extension of $f$ with gradient $0$ on $K$ exists.
[/example]
## Whitney's Extension Theorem
The core difficulty in proving Whitney's extension theorem is not extending $f$ near $K$ — extension by uniform continuity is direct on a neighbourhood of $K$, since $f$ is continuous on the closed set $K$ — but ensuring that the extended function is $C^1$ across the boundary $\partial K$ (where $\mathbb{R}^n \setminus K$ meets $K$). The mechanism that makes this work is the Whitney decomposition of the open set $\mathbb{R}^n \setminus K$ into dyadic cubes, together with a partition of unity adapted to this decomposition.
Without some structural control on how the cubes in $\mathbb{R}^n \setminus K$ approach $K$, the partition-of-unity argument breaks down: cubes at very different scales could interfere, preventing the patched-together function from having a well-defined gradient at boundary points of $K$. The Whitney decomposition resolves this by requiring each cube to be comparable in size to its distance from $K$, which means that near any boundary point of $K$, all active cubes are at roughly the same scale.
[definition: Whitney Decomposition]
Let $K \subseteq \mathbb{R}^n$ be closed and nonempty. A **Whitney decomposition** of $\mathbb{R}^n \setminus K$ is a countable collection of closed dyadic cubes $\{Q_j\}_{j \in \mathbb{N}}$ with sides parallel to the coordinate axes satisfying:
1. $\mathbb{R}^n \setminus K = \bigcup_j Q_j$ (the cubes cover the complement).
2. The interiors $Q_j^\circ$ are pairwise disjoint.
3. For each cube $Q_j$ with side length $\ell(Q_j)$:
\begin{align*}
\ell(Q_j) \leq \operatorname{dist}(Q_j, K) \leq 4\ell(Q_j).
\end{align*}
That is, each cube is comparable in size to its distance from $K$.
[/definition]
<!-- illustration-needed: cubes Q_j accumulating onto K with side lengths comparable to dist(Q_j, K) — show a compact set K (e.g. a line segment or Cantor-like set) with large dyadic cubes far away and progressively smaller cubes crowding toward the boundary of K, each labeled with its side length and distance to K satisfying the comparability inequality -->
The comparability condition is the key geometric feature: cubes far from $K$ are large, while cubes near $K$ are small and closely spaced. This ensures that any function defined on $K$ can be interpolated across $\mathbb{R}^n \setminus K$ in a way that matches $K$-values at all scales.
[quotetheorem:3090]
[citeproof:3090]
The Whitney Extension Theorem does not say that the extension is unique. Infinitely many $C^1$ functions can agree with $(f, d)$ on $K$; the theorem merely guarantees existence of at least one. The extension depends on the choice of Whitney decomposition and partition of unity, and different choices yield different $\tilde{f}$'s, all valid. This non-uniqueness is not a defect but a feature: it gives flexibility when one needs the extension to satisfy additional conditions (e.g., bounded support) by choosing the partition of unity appropriately.
The theorem also does not say that the extension is $C^2$. If the pair $(f, d)$ satisfies a Whitney $C^2$ condition (involving second-order remainders), the same partition-of-unity construction with second-order Taylor polynomials $P_j(x) = f(x_j) + d(x_j) \cdot (x - x_j) + \tfrac{1}{2}(x - x_j)^\top H(x_j)(x - x_j)$ in place of the affine $P_j$ produces a $C^2$ extension. But the $C^1$ result requires only first-order compatibility, and attempting to obtain higher regularity from the same data is not possible in general. To see why, consider $f = 0$ on the fat Cantor set $K \subseteq [0,1]$ with $d$ defined as the distributional derivative of the Cantor-like staircase on $K$: the data $(f, d)$ satisfy the $C^1$ Whitney condition on $K$, but no $C^2$ extension exists because the prescribed gradient $d$ is not itself Lipschitz on $K$. Higher regularity requires a separate Whitney $C^k$ condition on the higher-order remainders, which is a strictly stronger hypothesis.
[remark: Necessity of the Whitney Condition]
The Whitney condition is not merely sufficient — it is also necessary. If $\tilde{f} \in C^1(\mathbb{R}^n)$ and $(f, d) = (\tilde{f}|_K, \nabla \tilde{f}|_K)$, then for $a, b \in K$:
\begin{align*}
|R(a,b)| = \frac{|\tilde{f}(b) - \tilde{f}(a) - \nabla \tilde{f}(a) \cdot (b-a)|}{|b-a|} \to 0
\end{align*}
uniformly as $|a-b| \to 0$, because $\tilde{f}$ is $C^1$ hence has modulus of differentiability going to zero uniformly on compact sets. Thus the Whitney condition exactly characterises the data $(f, d)$ that arise as restrictions of $C^1$ functions.
[/remark]
## $C^1$ Approximation of Lipschitz Functions
The Whitney extension theorem on its own does not immediately give approximation results: it extends data from a closed set but does not say how to approximate a function on all of $\mathbb{R}^n$. The bridge is Lusin's theorem applied to the Rademacher gradient. Recall from Chapter 1 that a Lipschitz function $f: \mathbb{R}^n \to \mathbb{R}$ is differentiable $\mathcal{L}^n$-almost everywhere by Rademacher's theorem, so $\nabla f$ is defined $\mathcal{L}^n$-a.e. The difficulty is that $\nabla f$ need not be continuous anywhere: Lipschitz functions can oscillate their gradient arbitrarily as long as the overall difference quotients remain bounded. Approximating $f$ by a $C^1$ function thus requires first taming this oscillation, which is precisely what Lusin's theorem for measurable functions accomplishes.
The pattern that emerges from this discussion is a two-step recipe. First, one extracts a closed set $K$ of large measure on which the gradient — defined only almost everywhere a priori — becomes continuous in the strong sense; second, one feeds the pair $(f|_K, \nabla f|_K)$ into the Whitney extension machinery to manufacture a global $C^1$ function. The interplay between these two steps is what gives the result its strength: neither Lusin's theorem alone nor Whitney's theorem alone produces approximation, but together they convert a measurable gradient into a smooth function that agrees with $f$ everywhere outside a set of small measure. The next theorem packages this strategy and states the resulting approximation precisely.
[quotetheorem:3091]
[citeproof:3091]
The Lipschitz approximation theorem makes a joint assertion about $f$ and $\nabla f$ simultaneously. This is stronger than asking for $g$ close to $f$ in $L^\infty$ (which is a standard mollification result) precisely because it controls the gradient as well. The measure of disagreement involves both the set $\{f \neq g\}$ and the set $\{\nabla f \neq \nabla g\}$; these two sets can differ, but both are made small.
What the theorem does not say: it does not claim $g$ approximates $f$ in $L^\infty$ on all of $\mathbb{R}^n$, nor in $W^{1,\infty}$. The set where $f$ and $g$ differ can have small measure but still contain points where the difference $|f - g|$ is large (though bounded by the Lipschitz constant times the diameter of the set). Achieving $\|f - g\|_{L^\infty} < \varepsilon$ simultaneously requires a more refined construction, typically by mollifying $f$ at a scale chosen to dominate the diameter of the exceptional set produced by Lusin's theorem.
[example: Cantor-Type Lipschitz Function]
Let $K \subseteq [0,1]$ be a fat Cantor set, that is, a nowhere-dense closed set with $\mathcal{L}^1(K) > 0$ (constructed by removing open intervals of total length less than $1$). Define $f : [0,1] \to \mathbb{R}$ by $f(x) = \operatorname{dist}(x, K)$. This function is $1$-Lipschitz, since $x \mapsto \operatorname{dist}(x, K)$ has Lipschitz constant $1$ by the triangle inequality for distance functions. Its derivative is
\begin{align*}
f'(x) = \begin{cases} 0 & x \in K, \\ \pm 1 & x \in [0,1] \setminus K \text{ (depending on which side of } K \text{ the point } x \text{ lies)}, \end{cases}
\end{align*}
with $f'(x) = +1$ on intervals to the right of a component of $K$ and $f'(x) = -1$ to the left. Since $\partial K$ (the boundary of $K$, which equals $K$ itself as $K$ is nowhere dense) has positive $\mathcal{L}^1$-measure, the derivative $f'$ jumps between $0$ and $\pm 1$ at every boundary point of $K$. Consequently $f'$ is discontinuous on a set of positive measure, so $f$ is not $C^1$ and cannot be approximated by $C^1$ functions in $L^\infty$ globally. The $C^1$ approximation theorem nonetheless produces $g \in C^1([0,1])$ with $g = f$ and $g' = f'$ outside a set of measure less than $\varepsilon$: it does this by first finding, via Lusin's theorem, a closed set $E \subseteq [0,1]$ of large measure on which $f'$ is continuous (possible since $f'$ is bounded and measurable), then using Whitney's theorem on $E$.
[/example]
## $C^1$ Approximation of Sobolev Functions
Sobolev functions $u \in W^{1,p}(\mathbb{R}^n)$ are defined by their integrability class rather than by pointwise bounds on difference quotients, so the Lusin-Whitney argument must be adapted. The role of Rademacher's theorem is replaced by the Lebesgue differentiation theorem and the characterisation of Sobolev functions via approximate differentiability. The resulting approximation is measured not in measure of a set of disagreement, but in Sobolev norm.
The shift of viewpoint is significant. For Lipschitz functions the gradient is bounded, so Lusin's theorem applied to it produces a closed set $K$ on which the gradient is uniformly continuous, and one can read off a uniform $L^\infty$ tail bound on $\mathbb{R}^n \setminus K$ for free. For Sobolev functions the weak gradient $Du$ lies only in $L^p$, and there is no uniform pointwise bound at all; the approximation must therefore measure the discrepancy in the $L^p$ norm of both $u - g$ and $Du - \nabla g$. The absolute continuity of the $L^p$ norm — the fact that $\int_E |Du|^p \to 0$ as $\mathcal{L}^n(E) \to 0$ — is the substitute for the boundedness used in the Lipschitz case, and it lets one trade smallness of the exceptional set for smallness in the Sobolev norm.
[quotetheorem:3092]
[citeproof:3092]
This result is sometimes called the Sobolev-Lusin theorem by analogy with the classical Lusin theorem, which says that a measurable function agrees with a continuous function outside a set of small measure. The Sobolev-Lusin theorem upgrades "measurable" to "$W^{1,p}$" and "continuous" to "$C^1$". The $W^{1,p}$ norm control on the approximation error is what makes this result useful for passing from smooth test functions to Sobolev functions in variational arguments.
The theorem does not say that the approximation $g$ is $C^1$ with controlled $W^{1,p}$ norm — only that the difference $u - g$ is small in $W^{1,p}$. One cannot in general bound $\|g\|_{W^{1,p}}$ better than $\|u\|_{W^{1,p}} + \varepsilon$, which follows from the triangle inequality. The theorem also does not assert that $g = u$ on all of $K$ in the strong sense; the identity holds $\mathcal{L}^n$-a.e. on $K$, which is all that is meaningful since both $u$ and $g$ are defined only up to sets of measure zero.
[example: Sobolev Function with Singular Gradient]
Let $n \geq 2$, fix $p \in [1, n)$, and choose $\alpha = 1 - n/p + \varepsilon$ for small $\varepsilon > 0$ so that $\alpha \in (0,1)$. Define $u(x) = |x|^\alpha$ on the unit ball $B(0,1) \subseteq \mathbb{R}^n$. The weak gradient is $\nabla u(x) = \alpha |x|^{\alpha - 1} \hat{x}$ for $x \neq 0$, where $\hat{x} = x/|x|$. Since $\alpha - 1 = -n/p + \varepsilon < 0$, the gradient blows up at the origin. To check that $u \in W^{1,p}(B(0,1))$, switch to polar coordinates:
\begin{align*}
\|\nabla u\|_{L^p}^p = \alpha^p \int_{B(0,1)} |x|^{(\alpha-1)p}\, d\mathcal{L}^n = \alpha^p \omega_{n-1} \int_0^1 r^{(\alpha-1)p + n - 1}\, dr,
\end{align*}
and the exponent $(\alpha - 1)p + n - 1 = (\alpha - 1 + n/p)p - 1 + n - n = \varepsilon p - 1 + n(1 - 1) = \varepsilon p - 1 > -1$ by the choice of $\alpha$, so the integral converges and $u \in W^{1,p}$. At the origin, however, $|\nabla u(x)| \to +\infty$, and there is no $C^1$ function agreeing with $u$ in a neighbourhood of $0$. The $C^1$ approximation theorem produces $g \in C^1(B(0,1))$ with $\|u - g\|_{W^{1,p}} < \varepsilon$ by working away from the singularity: Lusin's theorem finds a large closed set $K \subseteq B(0,1)$ excluding a small neighbourhood of $0$ on which $\nabla u$ is continuous, and Whitney's theorem extends from $K$.
[/example]
## $C^1$ Approximation of BV Functions
Functions of bounded variation are one step further from smooth: they need not lie in any $W^{1,p}$ space, and their distributional derivative $Du$ is a vector-valued Radon measure rather than an $L^p$ function. The approximation theorem for BV functions therefore measures the quality of approximation using the total variation of the difference of the derivative measures, rather than any $L^p$ norm.
[quotetheorem:3093]
[citeproof:3093]
The measure $Du$ for a BV function is a Radon measure and can have a singular part (concentrated on sets of $\mathcal{L}^n$-measure zero, such as jump sets). The approximation $g$ is $C^1$, so its derivative measure is $\nabla g \cdot \mathcal{L}^n$ (absolutely continuous). The approximation theorem says the total variation of the difference is small, which forces the singular part of $Du$ to be small in total variation as well. The full theory of BV functions and their fine structure belongs to GMT III; the approximation result stated here is a forward reference to that theory.
This approximation is weaker than one might hope: the theorem does not give $\|u - g\|_{L^\infty} < \varepsilon$ nor $\|u - g\|_{L^1} < \varepsilon$ in general. The $BV$ topology is coarser than $L^\infty$ but finer than $L^1$, and the correct sense of approximation is in the "strict" or "area-strict" topology of BV, which involves both $L^1$ convergence and total variation convergence. The statement above captures the essential content: $g$ and $u$ agree outside a small set, and the derivative measures nearly coincide.
One might ask whether the $L^1$ distance $\|u - g\|_{L^1}$ can be made small simultaneously. In general this requires additional control: if $u$ has a large jump discontinuity on a hypersurface $\Sigma$, the singular part of $Du$ concentrated on $\Sigma$ has total variation equal to $\int_\Sigma |u^+ - u^-|\, d\mathcal{H}^{n-1}$, and making this small in total variation forces the jump $|u^+ - u^-|$ to be small $\mathcal{H}^{n-1}$-a.e. on $\Sigma$, which is a substantive geometric constraint on $u$ that the approximation theorem alone cannot impose.
## The Lusin Property as a Unifying Theme
The three approximation theorems — for Lipschitz, Sobolev, and BV functions — share a common structure that merits recognition. Each says that functions from a certain regularity class agree with $C^1$ functions outside sets of arbitrarily small measure. This is precisely the spirit of Lusin's theorem from classical measure theory, which asserts that a measurable function $f: \mathbb{R}^n \to \mathbb{R}$ agrees with a continuous function outside a set of small measure (with no regularity assumption on $f$ at all). In GMT I, Lusin's theorem was stated for measurable functions; Chapter 2 of that course established it as a fundamental tool. The results of this chapter show that Lipschitz, Sobolev, and BV functions satisfy a strictly stronger version: not just $C^0$ approximation, but $C^1$ approximation.
This upgrade from $C^0$ to $C^1$ in the Lusin property reflects the fact that all three function classes carry gradient information in some form: Lipschitz functions have a bounded a.e.-gradient by Rademacher, Sobolev functions have an $L^p$ weak gradient, and BV functions have a gradient that is a Radon measure. The Whitney extension theorem is the tool that converts the Lusin output (gradient continuous outside a small set) into a global $C^1$ function. The chain of reasoning is the same in each case:
1. Identify the appropriate notion of gradient (Rademacher gradient, weak gradient, or BV derivative).
2. Apply Lusin's theorem (for the appropriate function class) to find a closed set $K$ of large measure on which the gradient is continuous.
3. Verify the Whitney $C^1$ condition on $K$.
4. Apply Whitney's extension theorem to produce the desired $g \in C^1(\mathbb{R}^n)$.
The differences between the three cases lie entirely in step 2 and the metric used to measure approximation quality in the final result.
[remark: Lusin's Theorem in GMT I]
The classical Lusin theorem (see [Geometric Measure Theory I: Measures and Hausdorff Dimension](/page/Geometric%20Measure%20Theory%20I%3A%20Measures%20and%20Hausdorff%20Dimension), Chapter 2) states: if $f: \mathbb{R}^n \to \mathbb{R}$ is measurable and finite a.e., then for each $\varepsilon > 0$ there exists a closed set $F$ with $\mathcal{L}^n(\mathbb{R}^n \setminus F) < \varepsilon$ such that $f|_F$ is continuous. The $C^1$ approximation theorems of this chapter are the Sobolev and Lipschitz analogues: the function class determines how much gradient regularity Lusin's theorem yields on a closed set of large measure (continuity of $\nabla f$ for Lipschitz $f$, of the weak gradient $Du$ for Sobolev $u$, of $\nabla u_\delta$ for mollified BV $u_\delta$), and Whitney converts that regularity into a global smooth extension.
[/remark]
The Lusin property also has a direct consequence for the area and coarea formulas established in Chapters 3 and 4, and in particular it furnishes an alternative proof strategy to the Rademacher-based argument used there. Those formulas were proved first for $C^1$ maps and then extended to Lipschitz maps using Rademacher's theorem. The $C^1$ approximation theorem provides an alternative route: apply the theorem componentwise to a Lipschitz map $f$ to produce a $C^1$ map $g$ with $\mathcal{L}^n(\{f \neq g\} \cup \{\nabla f \neq \nabla g\}) < \varepsilon$, then bound the difference of area or coarea integrals by $\|Jf - Jg\|_{L^\infty} \cdot \varepsilon$ (since $Jf$ and $Jg$ are bounded by $\operatorname{Lip}(f)^m$ on the disagreement set), and let $\varepsilon \to 0$. This approach shows that the Lusin-Whitney machinery is not merely a technical appendage to GMT II, but is in direct dialogue with its central theorems.
[example: Area Formula via $C^1$ Approximation]
Let $f: \mathbb{R}^m \to \mathbb{R}^n$ be Lipschitz with $m \leq n$. For each $\varepsilon > 0$, apply the $C^1$ approximation theorem to each component $f_i$ to obtain $g \in C^1(\mathbb{R}^m; \mathbb{R}^n)$ with $\mathcal{L}^m(\{f \neq g\} \cup \{\nabla f \neq \nabla g\}) < \varepsilon$. The area formula
\begin{align*}
\int_{\mathbb{R}^m} J g \, d\mathcal{L}^m = \int_{\mathbb{R}^n} \mathcal{H}^0(g^{-1}(y) \cap A) \, d\mathcal{H}^m(y)
\end{align*}
holds for $g$ by the $C^1$ area formula (Chapter 3). Since $Jf = Jg$ outside a set of measure $\varepsilon$, and $Jf$ is bounded (Lipschitz), the integrals $\int Jf \, d\mathcal{L}^m$ and $\int Jg \, d\mathcal{L}^m$ differ by at most $C\varepsilon$ (where $C$ depends on $\operatorname{Lip}(f)$ and $m$). Passing $\varepsilon \to 0$ recovers the area formula for $f$. This argument is an alternative to the direct Rademacher-based proof given in Chapter 3, and it makes the role of $C^1$ approximation in the area formula explicit.
[/example]
Together, the results of this chapter complete the technical toolkit of GMT II. Whitney's extension theorem and the three approximation results constitute the precise sense in which Lipschitz, Sobolev, and BV functions are "almost $C^1$": they agree with smooth functions outside sets whose measure can be made arbitrarily small, and the approximating smooth functions can be chosen to respect not only the function values but also the gradient data. This "Lusin property at the $C^1$ level" is both the culmination of the differentiability theory developed across Chapters 5–7 and the foundation for the fine structure results that are taken up in GMT III.
The extension problem reverses the differentiability question: given approximate (or classical) derivative information on an arbitrary closed set, Whitney's theorem guarantees a globally smooth function that realizes this data, and this extension property at the $C^1$ level represents the capstone of Geometric Measure Theory II, unifying the theoretical machinery from all prior chapters into a coherent whole.
# 9. Examples and Worked Problems
This chapter is a consolidation chapter. Throughout Chapters 1–8 we built the theoretical apparatus of GMT II: Rademacher's theorem (Ch. 1) tells us that every Lipschitz map is differentiable almost everywhere; the area formula (Ch. 3) and coarea formula (Ch. 4) convert Lebesgue integrals over the source into Hausdorff integrals over level sets or images; Alexandrov's theorem (Ch. 7) extends second-order differentiability to all convex functions up to a measure-zero exceptional set; and Whitney's extension theorem (Ch. 8) shows how a jet defined on a closed set can be smoothly extended to all of $\mathbb{R}^n$. The results are powerful precisely because they apply to functions that are nowhere $C^1$ in the classical sense — but their true content is only visible when applied to specific functions where one can check every estimate by hand.
The purpose of this chapter is to make those formulas tangible. We work through six families of explicit computations. Each example is chosen not merely to illustrate a formula but to expose a structural feature: where the Jacobian determinant concentrates, how the coarea slice degenerates at critical values, what Alexandrov's second derivative looks like for a function that is genuinely not smooth at a set of measure zero, and what the Whitney extension procedure actually produces on a concrete closed set. No result here is new; every computation appeals directly to theorems established in earlier chapters.
## Surface Area of a Parametrized Surface via the Area Formula
The area formula in its general form states that for $f: \mathbb{R}^m \to \mathbb{R}^n$ Lipschitz with $m \le n$,
\begin{align*}
\int_A J_m f(x)\, d\mathcal{L}^m(x) = \int_{\mathbb{R}^n} \#(f^{-1}(y) \cap A)\, d\mathcal{H}^m(y),
\end{align*}
where $J_m f(x) = \sqrt{\det(Df_x^\top Df_x)}$ is the $m$-dimensional Jacobian. The formula says that the integral of the Jacobian over the parameter domain equals the $\mathcal{H}^m$-measure of the image, counted with multiplicity. For injective parametrizations, the multiplicity is identically $1$, and the formula reduces to a change-of-variables identity for surface area. The first worked example shows this for the unit sphere.
[example: Area of the Unit Sphere via the Area Formula]
We parametrize the unit sphere $S^2 = \{y \in \mathbb{R}^3 : |y| = 1\}$ by
\begin{align*}
f: (0, \pi) \times (0, 2\pi) \to \mathbb{R}^3, \qquad f(\theta, \varphi) = (\sin\theta \cos\varphi,\, \sin\theta \sin\varphi,\, \cos\theta).
\end{align*}
The domain $A = (0, \pi) \times (0, 2\pi)$ is an open rectangle in $\mathbb{R}^2$, and $f$ is $C^\infty$ on $A$. The map $f$ is injective away from the set $\{\theta = 0\} \cup \{\theta = \pi\} \cup \{\varphi = 0, 2\pi\}$, which has $\mathcal{L}^2$-measure zero, so injectivity holds $\mathcal{L}^2$-a.e. on $A$.
**Step 1: Compute the partial derivatives.** Differentiating coordinate by coordinate,
\begin{align*}
\frac{\partial f}{\partial \theta} &= (\cos\theta \cos\varphi,\, \cos\theta \sin\varphi,\, -\sin\theta), \\
\frac{\partial f}{\partial \varphi} &= (-\sin\theta \sin\varphi,\, \sin\theta \cos\varphi,\, 0).
\end{align*}
**Step 2: Compute the Gram matrix $Df^\top Df$.** The Jacobian matrix $Jf_{(\theta,\varphi)} \in \mathbb{R}^{3 \times 2}$ has $\partial_\theta f$ and $\partial_\varphi f$ as its columns. The Gram matrix $G = Jf^\top Jf \in \mathbb{R}^{2 \times 2}$ has entries:
\begin{align*}
G_{11} &= \left|\frac{\partial f}{\partial \theta}\right|^2 = \cos^2\theta\cos^2\varphi + \cos^2\theta\sin^2\varphi + \sin^2\theta = \cos^2\theta + \sin^2\theta = 1, \\
G_{22} &= \left|\frac{\partial f}{\partial \varphi}\right|^2 = \sin^2\theta\sin^2\varphi + \sin^2\theta\cos^2\varphi + 0 = \sin^2\theta, \\
G_{12} &= \frac{\partial f}{\partial \theta} \cdot \frac{\partial f}{\partial \varphi} = -\cos\theta\cos\varphi\sin\theta\sin\varphi + \cos\theta\sin\varphi\sin\theta\cos\varphi + 0 = 0.
\end{align*}
So $G = \operatorname{diag}(1, \sin^2\theta)$, hence $\det G = \sin^2\theta$.
**Step 3: Compute the 2-dimensional Jacobian.** By definition, $J_2 f(\theta, \varphi) = \sqrt{\det G} = |\sin\theta|$. Since $\theta \in (0, \pi)$, we have $\sin\theta > 0$, so $J_2 f = \sin\theta$.
**Step 4: Apply the area formula.** Since $f$ is injective $\mathcal{L}^2$-a.e. on $A$, the multiplicity function $\#(f^{-1}(y) \cap A) = 1$ for $\mathcal{H}^2$-a.e. $y \in f(A) = S^2 \setminus \{N, S\}$ (north and south poles are missing, but form a set of $\mathcal{H}^2$-measure zero). The area formula gives
\begin{align*}
\mathcal{H}^2(S^2) &= \int_{(0,\pi)\times(0,2\pi)} J_2 f(\theta, \varphi)\, d\mathcal{L}^2(\theta, \varphi) = \int_0^{2\pi} \int_0^\pi \sin\theta\, d\theta\, d\varphi.
\end{align*}
Evaluating: $\int_0^\pi \sin\theta\, d\theta = [-\cos\theta]_0^\pi = -\cos\pi + \cos 0 = 1 + 1 = 2$, and then $\int_0^{2\pi} 2\, d\varphi = 4\pi$. Therefore $\mathcal{H}^2(S^2) = 4\pi$.
**Step 5: Generalization to a graph.** Suppose instead $f(u,v) = (u, v, h(u,v))$ for $h \in C^1(U)$ on an open set $U \subseteq \mathbb{R}^2$. Then $\partial_u f = (1, 0, \partial_u h)$ and $\partial_v f = (0, 1, \partial_v h)$, so
\begin{align*}
G = \begin{pmatrix} 1 + (\partial_u h)^2 & \partial_u h \, \partial_v h \\ \partial_u h \, \partial_v h & 1 + (\partial_v h)^2 \end{pmatrix},
\end{align*}
giving $\det G = 1 + |\nabla h|^2$, hence $J_2 f = \sqrt{1 + |\nabla h|^2}$. The area formula then recovers the classical surface area formula $\mathcal{H}^2(\Gamma_h) = \int_U \sqrt{1 + |\nabla h|^2}\, d\mathcal{L}^2$ for the graph $\Gamma_h = \{(u, v, h(u,v))\}$.
[/example]
The sphere computation illustrates the most favorable situation: an injective parametrization with a smooth Jacobian. What happens when the parametrization fails to be injective on a large set, or when the Jacobian vanishes? The area formula still applies, but the multiplicity function takes values greater than one. This issue does not arise for the sphere, but it is central to understanding what the formula actually counts — the $\mathcal{H}^m$-measure of the image is the integral of the Jacobian only when the map is $\mathcal{L}^m$-a.e. injective. For non-injective maps, one must either correct for multiplicity or restrict to a fundamental domain.
The next two sections turn from maps $\mathbb{R}^m \to \mathbb{R}^n$ with $m \le n$ (area formula territory) to maps $\mathbb{R}^n \to \mathbb{R}$ with scalar range (coarea formula territory). The coarea formula applies whenever the source dimension exceeds the target dimension, and its role in polar coordinate decompositions is one of the cleanest illustrations of its power.
## Polar Coordinates via the Coarea Formula
The coarea formula for a Lipschitz function $f: \mathbb{R}^n \to \mathbb{R}^k$ with $n \ge k$ states that for any measurable $g \ge 0$,
\begin{align*}
\int_{\mathbb{R}^n} g(x)\, J_k f(x)\, d\mathcal{L}^n(x) = \int_{\mathbb{R}^k} \int_{f^{-1}(y)} g(x)\, d\mathcal{H}^{n-k}(x)\, d\mathcal{L}^k(y),
\end{align*}
where $J_k f(x) = \sqrt{\det(Df_x Df_x^\top)}$ is the $k$-dimensional Jacobian. When $k = 1$, this simplifies considerably: the $1$-dimensional Jacobian is $|\nabla f|$ (the norm of the gradient), and the formula becomes the coarea formula
\begin{align*}
\int_{\mathbb{R}^n} g(x)\, |\nabla f(x)|\, d\mathcal{L}^n(x) = \int_{-\infty}^\infty \int_{\{f = t\}} g(x)\, d\mathcal{H}^{n-1}(x)\, dt.
\end{align*}
Applying this to the radial function $f(x) = |x|$ recovers the polar coordinate decomposition formula with an explicit verification of every hypothesis.
[example: Polar Coordinate Decomposition from the Coarea Formula]
Let $f: \mathbb{R}^n \to \mathbb{R}$ be defined by $f(x) = |x| = \sqrt{x_1^2 + \cdots + x_n^2}$. We apply the coarea formula with $k = 1$.
**Checking the hypotheses.** The function $f$ is Lipschitz with constant $1$: $||x| - |y|| \le |x - y|$ by the triangle inequality. By Rademacher's theorem (Ch. 1), $f$ is differentiable $\mathcal{L}^n$-almost everywhere. The single point of non-differentiability is $x = 0$, which is a $\mathcal{L}^n$-null set, so this is consistent with Rademacher.
**Computing the gradient.** For $x \ne 0$, the chain rule gives $\nabla f(x) = x / |x|$. Therefore $|\nabla f(x)| = |x/|x|| = 1$ for all $x \ne 0$.
**Identifying the level sets.** For $t > 0$, the level set $f^{-1}(t) = \{x \in \mathbb{R}^n : |x| = t\} = \partial B(0, t)$, the sphere of radius $t$. For $t \le 0$, the level set is empty. The $\mathcal{H}^{n-1}$-measure of $\partial B(0, t)$ is $\mathcal{H}^{n-1}(S^{n-1}) \cdot t^{n-1}$, where $S^{n-1}$ denotes the unit sphere and $\mathcal{H}^{n-1}(S^{n-1}) = n \omega_n$ with $\omega_n$ the volume of the unit ball $B(0, 1)$.
**Applying the coarea formula.** For any measurable $g \ge 0$,
\begin{align*}
\int_{\mathbb{R}^n} g(x)\, |\nabla f(x)|\, d\mathcal{L}^n(x) &= \int_{\mathbb{R}^n} g(x) \cdot 1\, d\mathcal{L}^n(x) \quad \text{(since } |\nabla f| = 1 \text{ a.e.)}\\
&= \int_0^\infty \int_{\partial B(0, t)} g(x)\, d\mathcal{H}^{n-1}(x)\, dt.
\end{align*}
Cancelling the $|\nabla f|$ factor (which equals $1$ identically a.e.) from both sides, we obtain the polar coordinate formula:
\begin{align*}
\int_{\mathbb{R}^n} g(x)\, d\mathcal{L}^n(x) = \int_0^\infty \int_{\partial B(0, t)} g(x)\, d\mathcal{H}^{n-1}(x)\, dt.
\end{align*}
**Specializing to $g = \mathbb{1}_{B(0, R)}$.** Taking $g = \mathbb{1}_{B(0, R)}$, the left side is $\mathcal{L}^n(B(0, R)) = \omega_n R^n$. The right side becomes
\begin{align*}
\int_0^R \mathcal{H}^{n-1}(\partial B(0, t))\, dt = \int_0^R n \omega_n t^{n-1}\, dt = n\omega_n \cdot \frac{R^n}{n} = \omega_n R^n.
\end{align*}
The two sides agree, confirming the formula. This computation also shows that $\mathcal{H}^{n-1}(\partial B(0,t)) = n \omega_n t^{n-1}$, recovering the standard surface area formula for spheres from the coarea formula rather than computing it directly.
[/example]
The polar coordinate example succeeds because $|\nabla f| = 1$ almost everywhere, so the coarea formula reduces to a pure slice decomposition with no Jacobian weight. This is the special case where the function is a "distance function to a point." The next example probes what happens when we replace the reference point by a more complicated set $K$, and whether the same Eikonal equation $|\nabla d| = 1$ continues to hold.
The general distance function $d_K(x) = \operatorname{dist}(x, K)$ for a closed set $K \subseteq \mathbb{R}^n$ is $1$-Lipschitz everywhere (since $|d_K(x) - d_K(y)| \le |x - y|$), but differentiability is more subtle. The key structural fact, established in the proof of the coarea formula, is that $|\nabla d_K| = 1$ holds $\mathcal{L}^n$-almost everywhere outside $K$. This is a special case of the fact that the gradient of a $1$-Lipschitz function has norm at most $1$ everywhere it exists, and norm exactly $1$ outside the "skeleton" (the set of points with multiple nearest neighbors in $K$), which has measure zero.
## The Coarea Formula for a Distance Function
The distance function to a closed set is one of the most natural examples of a Lipschitz function in analysis. Because $|\nabla d_K| = 1$ holds almost everywhere off $K$, the coarea formula gives a decomposition of the complement of $K$ into its level sets $\{d_K = t\} = \partial K_t$, where $K_t = \{x : d_K(x) \le t\}$ is the $t$-neighborhood of $K$.
[example: Coarea Formula for the Distance Function to a Closed Set]
Let $K \subseteq \mathbb{R}^n$ be a nonempty closed set, and let $d_K: \mathbb{R}^n \to [0, \infty)$ be defined by $d_K(x) = \operatorname{dist}(x, K) = \inf_{y \in K} |x - y|$. We show that $|\nabla d_K(x)| = 1$ for $\mathcal{L}^n$-a.e. $x \notin K$, and then apply the coarea formula.
**Step 1: $d_K$ is $1$-Lipschitz.** For any $x, y \in \mathbb{R}^n$ and any $k \in K$, $d_K(x) \le |x - k| \le |x - y| + |y - k|$. Taking the infimum over $k \in K$ gives $d_K(x) \le |x - y| + d_K(y)$. By symmetry, $d_K(y) \le |x - y| + d_K(x)$, so $|d_K(x) - d_K(y)| \le |x - y|$. Hence $d_K$ is $1$-Lipschitz with constant $1$.
**Step 2: Gradient has norm at most $1$ wherever it exists.** By Rademacher's theorem, $d_K$ is differentiable $\mathcal{L}^n$-a.e. At any point of differentiability $x$, the gradient satisfies $|\nabla d_K(x)| \le \operatorname{Lip}(d_K) = 1$, because the Lipschitz constant is an upper bound for the magnitude of the derivative.
**Step 3: Gradient has norm $1$ a.e. off $K$.** Let $x \notin K$. For any nearest-point $y \in K$ to $x$ (i.e., $|x - y| = d_K(x)$), consider the direction $\nu = (x - y)/|x - y|$. Along the ray from $y$ through $x$, the distance function satisfies $d_K(x + s\nu) \ge |x + s\nu - y| - 0 = |x - y| + s = d_K(x) + s$ for $s > 0$ small, and $d_K(x + s\nu) = |x + s\nu - y|$ when $y$ is still the nearest point. At points of differentiability, this forces $\nabla d_K(x) \cdot \nu = 1$, and combined with $|\nabla d_K(x)| \le 1$, we get $|\nabla d_K(x)| = 1$. The exceptional set where $y$ is not unique (the skeleton of $K$) has $\mathcal{L}^n$-measure zero by a classical result in convex analysis.
**Step 4: Apply the coarea formula.** For any measurable $g \ge 0$ and any $R > 0$, applying the coarea formula to $f = d_K$ on $\{d_K < R\} \setminus K$:
\begin{align*}
\int_{\{d_K < R\} \setminus K} g(x)\, d\mathcal{L}^n(x) &= \int_{\{d_K < R\} \setminus K} g(x) \cdot |\nabla d_K(x)|\, d\mathcal{L}^n(x) \quad \text{(since }|\nabla d_K| = 1 \text{ a.e.)}\\
&= \int_0^R \int_{\{d_K = t\}} g(x)\, d\mathcal{H}^{n-1}(x)\, dt.
\end{align*}
Taking $g \equiv 1$, we obtain the volume formula
\begin{align*}
\mathcal{L}^n(\{0 < d_K < R\}) = \int_0^R \mathcal{H}^{n-1}(\{d_K = t\})\, dt.
\end{align*}
This expresses the volume of the $R$-neighborhood of $K$ (minus $K$ itself, which has measure zero if $\partial K$ is nice) as the integral of the perimeters of the $t$-neighborhoods $\partial K_t$.
**Concrete case: $K = \{0\}$.** The distance function is $d_K(x) = |x|$, and $\{d_K = t\} = \partial B(0,t)$. The formula recovers $\mathcal{L}^n(B(0,R)) = \int_0^R \mathcal{H}^{n-1}(\partial B(0,t))\, dt = \int_0^R n\omega_n t^{n-1}\, dt = \omega_n R^n$, consistent with the previous example.
**Concrete case: $K = \overline{B}(0,1)$.** Here $d_K(x) = \max(|x| - 1, 0)$ and $\{d_K = t\} = \partial B(0, 1+t)$ for $t > 0$. The formula gives $\mathcal{L}^n(B(0,1+R)) - \mathcal{L}^n(B(0,1)) = \int_0^R n\omega_n(1+t)^{n-1}\, dt$, which integrates to $\omega_n[(1+R)^n - 1]$. This matches $\mathcal{L}^n(B(0,1+R)) - \mathcal{L}^n(B(0,1)) = \omega_n[(1+R)^n - 1]$ by direct computation, confirming the formula.
[/example]
The distance function examples illustrate that the coarea formula is most transparent when the Jacobian is identically $1$ off the source set. In such cases, the formula is purely a statement about slice measures, and one can verify it directly by computing both sides. The more interesting situation arises when the gradient vanishes on a set of positive measure — this happens at saddle points or constant regions of the function — and the formula still holds but the slices at critical values may have lower-dimensional structure.
We now turn to Rademacher's theorem in a setting where the exceptional non-differentiability set is geometrically meaningful, and we use it to illustrate exactly what the theorem guarantees and what it does not.
## Rademacher's Theorem for the Absolute Value and a Fat Cantor Function
Rademacher's theorem (Ch. 1) asserts that every Lipschitz function $f: \mathbb{R}^n \to \mathbb{R}^m$ is differentiable $\mathcal{L}^n$-almost everywhere. The theorem makes no claim about differentiability at specific points — only that the exceptional set has measure zero. The following examples probe the boundary of this assertion by working with functions whose singular behavior is concentrated on explicit geometric sets.
[example: Rademacher's Theorem for $f(x) = |x|$ on $\mathbb{R}^n$]
Define $f: \mathbb{R}^n \to \mathbb{R}$ by $f(x) = |x|$. This function is $1$-Lipschitz (as shown in the distance function example). We verify that Rademacher's conclusion holds and identify the precise exceptional set.
**Differentiability off the origin.** For $x \ne 0$, the function $f(x) = (x_1^2 + \cdots + x_n^2)^{1/2}$ is a composition of $C^\infty$ functions on $\mathbb{R}^n \setminus \{0\}$. By the chain rule, it is differentiable at every $x \ne 0$, with $\nabla f(x) = x/|x|$. This is a unit vector pointing radially outward.
**Non-differentiability at the origin.** At $x = 0$, suppose $f$ were differentiable with derivative $L: \mathbb{R}^n \to \mathbb{R}$. Then $f(h) = L(h) + o(|h|)$ as $h \to 0$, i.e., $|h| = L(h) + o(|h|)$. Dividing by $|h|$: $1 = L(h/|h|) + o(1)$ as $h \to 0$. Setting $h = te_i$ and $h = -te_i$ for $t > 0$ and any basis vector $e_i$, we get $L(e_i) = 1$ and $L(-e_i) = 1$ in the limit. But $L$ is linear, so $L(-e_i) = -L(e_i) = -1 \ne 1$. This contradiction shows $f$ is not differentiable at the origin.
**Consistency with Rademacher.** The exceptional set $\{0\}$ is a single point, which has $\mathcal{L}^n$-measure zero. Rademacher's theorem guarantees differentiability outside a set of measure zero, and $\{0\}$ is such a set.
**Gradient norm.** For all $x \ne 0$, $|\nabla f(x)| = |x/|x|| = 1$. This recovers the Eikonal equation $|\nabla d| = 1$ a.e. from the previous section.
[/example]
The absolute value is perhaps the simplest Lipschitz function with a point of non-differentiability. To see that non-differentiable sets can be more complex — and that Rademacher's theorem is sharp — we construct a Lipschitz function on $\mathbb{R}$ that fails to be differentiable precisely on a prescribed closed set of measure zero.
[example: A Lipschitz Function with Non-Differentiability on the Cantor Set]
Let $C \subseteq [0,1]$ denote the standard middle-thirds Cantor set. Recall that $C$ is closed, has $\mathcal{L}^1(C) = 0$, and is perfect (every point of $C$ is a limit point of $C$). The complement $[0,1] \setminus C$ is a countable disjoint union of open intervals (the "gaps"), whose total length equals $1$.
We construct a Lipschitz function $g: [0,1] \to \mathbb{R}$ that is differentiable $\mathcal{L}^1$-a.e. on $[0,1]$, with $g'(x) = 0$ for $\mathcal{L}^1$-a.e. $x \in C^c$ (the gaps), and such that $g(0) = 0$ and $g(1) = 1$.
**Construction.** Define $g$ to be the Cantor function (devil's staircase): $g$ is constant on each gap of $C$, increasing from $0$ to $1$, and extends to all of $[0,1]$ by continuity. Concretely: on the middle third gap $(1/3, 2/3)$, set $g \equiv 1/2$; on $(1/9, 2/9)$, set $g \equiv 1/4$; on $(7/9, 8/9)$, set $g \equiv 3/4$; and so on by the standard construction.
**The function is $1$-Lipschitz.** Direct verification shows that $|g(x) - g(y)| \le |x - y|$ for all $x, y \in [0,1]$, since $g$ never increases faster than linearly.
**Differentiability on gaps.** On each open gap interval $(a, b) \subseteq [0,1] \setminus C$, the function $g$ is identically constant, so $g'(x) = 0$ for all $x \in (a, b)$.
**Non-differentiability and Rademacher.** Since $C$ has $\mathcal{L}^1$-measure zero, the exceptional set where differentiability fails (which is contained in $C$) has measure zero. This is consistent with Rademacher's theorem. The remarkable feature is that $g$ has increased from $0$ to $1$ while having zero derivative on the complement of a measure-zero set: $\int_0^1 g'(x)\, d\mathcal{L}^1(x) = 0 \ne 1 = g(1) - g(0)$. This illustrates that the fundamental theorem of calculus fails for Lipschitz functions that are not absolutely continuous. The Cantor function is not absolutely continuous precisely because the mass of its derivative is concentrated on a set (the Cantor set) of measure zero, where the derivative does not exist in the classical sense.
**Rademacher's guarantee.** The theorem says that $g$ is differentiable $\mathcal{L}^1$-almost everywhere. Indeed, $g'(x) = 0$ for all $x$ in the open gaps, and the gaps have total measure $\mathcal{L}^1([0,1] \setminus C) = 1$. So $g$ is differentiable a.e., with $g' = 0$ a.e. The theorem does not guarantee differentiability on $C$ itself — and in fact the Cantor function is not differentiable at any point of $C$.
[/example]
The Cantor function reveals a genuine subtlety in Rademacher's theorem (Chapter 1): almost everywhere differentiability does not prevent the derivative from missing all of a function's variation. The contrast between what Rademacher guarantees and what calculus requires is sharpest precisely on functions like the Cantor staircase. This motivates why the area and coarea formulas require the Lipschitz hypothesis carefully — one needs to integrate the Jacobian, and if the Jacobian vanishes almost everywhere while the function is not constant, the formula still holds but the multiplicity function on the right-hand side encodes the missing information.
## Alexandrov's Theorem for Specific Convex Functions
Alexandrov's theorem (Ch. 7) asserts that every convex function $f: U \to \mathbb{R}$ on an open convex set $U \subseteq \mathbb{R}^n$ is twice differentiable $\mathcal{L}^n$-almost everywhere, in the sense that there exists a symmetric matrix $D^2 f(x)$ such that
\begin{align*}
f(x + h) = f(x) + \nabla f(x) \cdot h + \frac{1}{2} h^\top D^2 f(x)\, h + o(|h|^2) \quad \text{as } h \to 0.
\end{align*}
The result is deep: it applies to all convex functions, including those that are not $C^1$ (the gradient $\nabla f$ exists a.e. by Rademacher, but may be discontinuous). The following examples compute $D^2 f$ explicitly for two natural convex functions, one smooth and one non-smooth.
[example: Alexandrov's Theorem for $f(x) = |x|^2$]
Let $f: \mathbb{R}^n \to \mathbb{R}$ be defined by $f(x) = |x|^2 = x_1^2 + \cdots + x_n^2$. This is a polynomial and hence $C^\infty$, so the classical Hessian exists everywhere.
**The gradient.** $\nabla f(x) = (2x_1, \ldots, 2x_n) = 2x$.
**The Hessian matrix.** By direct differentiation, $\partial_{x_i} \partial_{x_j} f = 2\delta_{ij}$, so $D^2 f(x) = 2I_n$ for all $x \in \mathbb{R}^n$, where $I_n$ is the $n \times n$ identity matrix.
**Verification of the Alexandrov expansion.** For any $x, h \in \mathbb{R}^n$:
\begin{align*}
f(x + h) &= |x + h|^2 = |x|^2 + 2x \cdot h + |h|^2 = f(x) + \nabla f(x) \cdot h + h^\top I_n h,
\end{align*}
since $h^\top (2I_n) h / 2 = |h|^2$. The remainder is $|h|^2 = O(|h|^2)$ exactly, so the expansion holds with zero $o(|h|^2)$ error. Alexandrov's theorem is confirmed and the Hessian is the constant matrix $2I_n$.
**Convexity check.** The function $f$ is convex because its Hessian $2I_n$ is positive semi-definite (in fact positive definite). This is consistent with the statement of Alexandrov's theorem, which requires convexity.
[/example]
The function $|x|^2$ is globally $C^\infty$, so Alexandrov's theorem adds nothing new here — the second derivative exists classically everywhere. The interesting case is a convex function that fails to be $C^1$ on a set of positive $\mathcal{H}^{n-1}$-measure but still has a well-defined Alexandrov second derivative almost everywhere.
[example: Alexandrov's Theorem for $f(x) = \max(x_1, \ldots, x_n)$ on $\mathbb{R}^n$]
Define $f: \mathbb{R}^n \to \mathbb{R}$ by $f(x) = \max(x_1, \ldots, x_n)$. This is a convex function (maximum of linear functions is convex) and is globally $1$-Lipschitz (since $|\max_i x_i - \max_j y_j| \le \max_i |x_i - y_i| \le |x - y|$).
**Structure of the non-smooth set.** The function $f$ is not differentiable at points where the maximum is achieved by more than one coordinate. The non-differentiability set is
\begin{align*}
\Sigma = \{x \in \mathbb{R}^n : x_i = x_j = \max_k x_k \text{ for some } i \ne j\}.
\end{align*}
This set is a union of hyperplanes of the form $\{x_i = x_j\} \cap \{x_i \ge x_k\, \forall\, k\}$, which is a finite union of polyhedral faces. Each such face has dimension at most $n - 1$, hence $\mathcal{L}^n(\Sigma) = 0$.
**The gradient on the smooth region.** On the open region $U_i = \{x \in \mathbb{R}^n : x_i > x_j\, \forall\, j \ne i\}$ (where the $i$-th coordinate is the unique maximum), $f(x) = x_i$ is linear, so $\nabla f(x) = e_i$ (the $i$-th standard basis vector) for all $x \in U_i$.
**The Hessian on the smooth region.** On each $U_i$, $f(x) = x_i$ is linear, so all second partial derivatives vanish: $\partial_{x_j}\partial_{x_k} f(x) = 0$ for all $j, k$. Therefore $D^2 f(x) = 0$ for all $x \in U_i$.
**Alexandrov's conclusion.** By Alexandrov's theorem, $D^2 f(x)$ exists for $\mathcal{L}^n$-a.e. $x \in \mathbb{R}^n$. Since $\mathbb{R}^n = \bigcup_{i=1}^n U_i \cup \Sigma$ and $\mathcal{L}^n(\Sigma) = 0$, we conclude that $D^2 f(x) = 0$ for $\mathcal{L}^n$-a.e. $x$.
**What Alexandrov provides beyond classical theory.** Classically, the second derivative does not exist on $\Sigma$. The classical Hessian is therefore not defined on a set of $\mathcal{H}^{n-1}$-positive measure (the hyperplane faces are $(n-1)$-dimensional). Alexandrov's theorem tells us that despite this, the Hessian exists $\mathcal{L}^n$-a.e. and equals the zero matrix there. This is not an a posteriori fact one can see from the formula for $f$ alone — it requires the full content of Alexandrov's theorem to guarantee the existence of a second-order Taylor expansion at a.e. point without assuming $C^2$ regularity.
[/example]
Alexandrov's theorem gives a second-order Taylor expansion that exists almost everywhere, but the exceptional set — where $D^2 f$ is not defined — can still be geometrically interesting. The theorem says nothing about the behavior of $f$ on $\Sigma$ itself, and in fact $f$ can exhibit rich subdifferential behavior there (the subdifferential $\partial f(x)$ at a non-smooth point is a convex set, not a singleton). This distinction between the classical second derivative and the Alexandrov second derivative is what makes the theorem valuable in applications to optimal transport and PDE, where one must work with non-smooth convex potentials.
## Whitney Extension for a Cantor-Type Closed Set
Whitney's extension theorem (Ch. 8) addresses the following question: given a closed set $F \subseteq \mathbb{R}^n$ and a family of polynomials $(P_x)_{x \in F}$ (a "jet") satisfying a compatibility condition, does there exist a $C^k$ function $f: \mathbb{R}^n \to \mathbb{R}$ such that the Taylor expansion of $f$ at each point $x \in F$ agrees with $P_x$?
The compatibility condition (Whitney's $C^k$ condition) requires that the remainder when approximating $P_x$ by $P_y$ at scale $|x - y|$ is $o(|x - y|^k)$ uniformly as $x, y \to$ a common point in $F$. When $k = 1$, the condition simplifies: one needs a function $f_0: F \to \mathbb{R}$ and a map $g: F \to \mathbb{R}^n$ (playing the role of $\nabla f|_F$) such that
\begin{align*}
\frac{|f_0(y) - f_0(x) - g(x) \cdot (y - x)|}{|y - x|} \to 0 \quad \text{as } x, y \in F,\, |x - y| \to 0.
\end{align*}
The Whitney extension theorem then produces a $C^1$ function $f: \mathbb{R}^n \to \mathbb{R}$ extending $f_0$ with $\nabla f|_F = g$.
[example: Whitney $C^1$ Extension for the Cantor Set with Zero Jet]
Let $C \subseteq [0,1]$ be the standard middle-thirds Cantor set, viewed as a closed subset of $\mathbb{R}$. We take the zero jet: $f_0 \equiv 0$ on $C$ and $g \equiv 0$ on $C$ (corresponding to $\nabla f|_C = 0$). We verify the Whitney $C^1$ condition and construct an explicit $C^1$ extension.
**Verifying the Whitney $C^1$ condition.** We must check:
\begin{align*}
\frac{|f_0(y) - f_0(x) - g(x)(y - x)|}{|y - x|} = \frac{|0 - 0 - 0 \cdot (y - x)|}{|y - x|} = 0
\end{align*}
for all $x, y \in C$ with $x \ne y$. Since $f_0 \equiv 0$ and $g \equiv 0$, the numerator is identically zero, and the condition holds with the ratio exactly zero (not merely approaching zero). The zero jet satisfies the Whitney condition on any closed set, regardless of its geometry.
**The zero extension.** The most naive extension is $f \equiv 0$ on all of $\mathbb{R}$. This is $C^\infty$, satisfies $f|_C = 0$ and $f'|_C = 0$, and is a valid Whitney extension. However, it does not illustrate the constructive content of the theorem.
**A non-zero $C^1$ extension.** To see that the Whitney extension theorem produces non-zero functions, we construct a $C^1$ extension $f: \mathbb{R} \to \mathbb{R}$ that satisfies $f|_C = 0$, $f'|_C = 0$, and $f > 0$ on the gaps of $C$.
Enumerate the gaps of $C$ as the countable collection of open intervals $\{(a_k, b_k)\}_{k=1}^\infty$, where $\sum_k (b_k - a_k) = 1$. On each gap $(a_k, b_k)$, define
\begin{align*}
f_k(x) = (x - a_k)^2 (b_k - x)^2 / (b_k - a_k)^4.
\end{align*}
This function satisfies $f_k(a_k) = f_k(b_k) = 0$, $f_k'(a_k) = f_k'(b_k) = 0$, and $f_k(x) > 0$ for $x \in (a_k, b_k)$.
Set $f(x) = f_k(x)$ for $x \in (a_k, b_k)$ and $f(x) = 0$ for $x \in C$. Since $f$ vanishes at both endpoints of each gap and its derivative vanishes at the endpoints, $f$ is $C^1$ at the boundary points of each gap. The function $f$ is $C^\infty$ on each open gap and $C^1$ on $[0,1]$ overall.
**Why this is a valid $C^1$ extension.** At any point $c \in C$ and any sequence $x_k \to c$: if $x_k \in C$, then $f(x_k) = 0 = f(c)$ and $f'(x_k) = 0 = f'(c)$; if $x_k \in (a_j, b_j)$ for some gap $j$ with $a_j, b_j \to c$, then $f(x_k) \le (b_j - a_j)^4 / (b_j - a_j)^4 = 1$ but more precisely $f(x_k) = (x_k - a_j)^2(b_j - x_k)^2/(b_j - a_j)^4 \le 1/16$. To verify $f'(c) = 0$ in the derivative sense: $(f(x) - f(c))/(x - c) = f(x)/(x - c)$. For $x \in (a_j, b_j)$ near $c$, $(b_j - a_j) \le 2|x - c|$ since $c \in C$ and $(a_j, b_j)$ is a gap containing $x$, so $|x - a_j| \le (b_j - a_j)$ and $|b_j - x| \le (b_j - a_j)$. Thus $f_j(x) \le (b_j - a_j)^4/(b_j - a_j)^4 = 1$ and $(b_j - a_j) \le 2|x - c|$, giving $f(x) \le (b_j - a_j)^2 \le 4|x - c|^2$, hence $|f(x)/(x-c)| \le 4|x - c| \to 0$ as $x \to c$. So $f'(c) = 0$.
**Consistency with Whitney.** The extension $f$ satisfies the jet condition: $f|_C = 0 = f_0$ and $f'|_C = 0 = g$. The Whitney extension theorem guarantees existence of such an extension, and the explicit construction above realizes it. The key geometric point is that the extension can be non-zero on the gaps while vanishing to second order at every Cantor point.
[/example]
The Whitney extension example clarifies a potential confusion: the theorem does not say that the extension must vanish wherever the jet vanishes. The extension is zero on $C$ and has zero derivative there, but can take nonzero values between Cantor points. This flexibility is what makes the Whitney theorem useful in approximation theory — one can prescribe behavior on a closed set and then fill in the rest of $\mathbb{R}^n$ smoothly.
## Combining Area and Coarea: The Co-Area Formula for a Map $\mathbb{R}^2 \to \mathbb{R}$
The final worked example illustrates the interplay between the area and coarea formulas by computing both sides of the coarea identity explicitly for a function whose level sets are curves in the plane. The computation reveals how the coarea formula distributes the Lebesgue measure of the domain across the $\mathcal{H}^1$-measures of the level curves.
[example: Coarea Formula for a Quadratic Map $\mathbb{R}^2 \to \mathbb{R}$]
Define $f: \mathbb{R}^2 \to \mathbb{R}$ by $f(x_1, x_2) = x_1^2 + x_2^2 = |x|^2$. This function is not Lipschitz on all of $\mathbb{R}^2$ (since $|\nabla f| = 2|x|$ is unbounded), but it is Lipschitz on any bounded domain. We work on the annular region $A = \{x \in \mathbb{R}^2 : 1 \le |x| \le 2\}$ and apply the coarea formula there.
**Computing the gradient.** $\nabla f(x) = (2x_1, 2x_2) = 2x$, so $|\nabla f(x)| = 2|x|$.
**Identifying the level sets.** For $t > 0$, $f^{-1}(t) = \{x : |x|^2 = t\} = \partial B(0, \sqrt{t})$, a circle of radius $\sqrt{t}$. The range of $f$ on $A$ is $[1, 4]$ (since $|x|^2 \in [1, 4]$ for $x \in A$).
**Computing $\mathcal{H}^1$ of the level sets.** For $t \in [1, 4]$, the level set $f^{-1}(t) \cap A = \partial B(0, \sqrt{t})$ is a full circle (since $1 \le \sqrt{t} \le 2$ iff $1 \le t \le 4$). Its $\mathcal{H}^1$-measure is the circumference:
\begin{align*}
\mathcal{H}^1(f^{-1}(t) \cap A) = 2\pi \sqrt{t}.
\end{align*}
**Applying the coarea formula.** The coarea formula with $g \equiv 1$ on $A$ gives
\begin{align*}
\int_A |\nabla f(x)|\, d\mathcal{L}^2(x) = \int_{-\infty}^\infty \mathcal{H}^1(f^{-1}(t) \cap A)\, dt = \int_1^4 2\pi\sqrt{t}\, dt.
\end{align*}
**Computing the right side.** $\int_1^4 2\pi \sqrt{t}\, dt = 2\pi \cdot \left[\frac{2}{3}t^{3/2}\right]_1^4 = 2\pi \cdot \frac{2}{3}(8 - 1) = 2\pi \cdot \frac{14}{3} = \frac{28\pi}{3}$.
**Computing the left side.** Switching to polar coordinates in the classical sense, $x_1 = r\cos\theta$, $x_2 = r\sin\theta$ with $r \in [1, 2]$, $\theta \in [0, 2\pi)$:
\begin{align*}
\int_A |\nabla f(x)|\, d\mathcal{L}^2(x) = \int_A 2|x|\, d\mathcal{L}^2(x) = \int_0^{2\pi} \int_1^2 2r \cdot r\, dr\, d\theta = 2\pi \int_1^2 2r^2\, dr.
\end{align*}
Computing: $\int_1^2 2r^2\, dr = 2 \cdot [r^3/3]_1^2 = 2 \cdot (8/3 - 1/3) = 2 \cdot 7/3 = 14/3$. So the left side equals $2\pi \cdot 14/3 = 28\pi/3$.
**Conclusion.** Both sides equal $28\pi/3$, confirming the coarea formula for $f(x) = |x|^2$ on $A$. The verification is completely explicit: every step uses only integration in polar coordinates and the circumference formula, with no appeal to limiting arguments.
**Reformulation without the $|\nabla f|$ weight.** The coarea formula with $g \equiv 1$ and the Jacobian $|\nabla f| = 2|x|$ in place gives
\begin{align*}
\int_A 1\, d\mathcal{L}^2(x) &= \int_1^4 \int_{f^{-1}(t) \cap A} \frac{1}{|\nabla f(x)|}\, d\mathcal{H}^1(x)\, dt.
\end{align*}
On the level set $f^{-1}(t) = \partial B(0, \sqrt{t})$, we have $|x| = \sqrt{t}$, so $1/|\nabla f(x)| = 1/(2\sqrt{t})$ identically on each level set. Therefore:
\begin{align*}
\mathcal{L}^2(A) = \int_1^4 \frac{1}{2\sqrt{t}} \cdot \mathcal{H}^1(\partial B(0, \sqrt{t}))\, dt = \int_1^4 \frac{1}{2\sqrt{t}} \cdot 2\pi\sqrt{t}\, dt = \int_1^4 \pi\, dt = 3\pi.
\end{align*}
Direct computation: $\mathcal{L}^2(A) = \pi(2^2 - 1^2) = 3\pi$. The coarea formula correctly recovers the area of the annulus via integration over level-set circles, with the factor $1/(2\sqrt{t})$ correcting for the speed at which level sets sweep through the domain.
[/example]
This final example unifies several threads of the course. The coarea formula decomposes the Lebesgue measure of the domain into a family of $\mathcal{H}^1$ measures on the level curves, weighted by the reciprocal of the gradient magnitude. The gradient magnitude $|\nabla f| = 2|x|$ measures how fast $f$ is changing at $x$; a large gradient means the level sets are closely spaced, and dividing by $|\nabla f|$ corrects for this compression. When the gradient is constant (as in the distance function examples), the correction factor is constant and the formula reduces to a pure slice decomposition. When the gradient varies (as here, where it grows like $|x|$), the correction varies with position but can be computed explicitly by evaluating $|\nabla f|$ on each level set.
Together, the six examples in this chapter demonstrate the following principle: the area and coarea formulas are not merely formal identities — they are computational tools that, when applied to explicit functions, reproduce all the classical integral-geometric formulas of analysis (polar coordinates, surface area of spheres, volume of annuli, tube formulas for neighborhoods of closed sets) as special cases. Rademacher's theorem provides the almost-everywhere differentiability that the formulas require; Alexandrov's theorem extends second-order differentiability to convex functions; and Whitney's extension theorem allows one to prescribe arbitrary jet data on closed sets and extend smoothly. The abstract machinery of GMT II is, in the end, a machine for computing.
## References
- L. C. Evans and R. F. Gariepy, *Measure Theory and Fine Properties of Functions* (Revised Edition), Chapters 3 and 6.
Contents
- 1. Lipschitz Functions and Rademacher's Theorem
- Lipschitz Functions: Definitions and Basic Properties
- Why Lipschitz?
- The Lipschitz Condition Versus Continuity
- Absolute Continuity on Lines
- Extension Theorems: Kirszbraun and McShane
- Rademacher's Theorem
- The Gap Between Metric and Analytic Regularity
- Why This Theorem Runs the Whole Course
- Stepanov's Theorem
- Almost-Everywhere Differentiability as the Natural Notion
- Everywhere Differentiability Is Too Strong
- Measure-Zero Exceptional Sets Are Invisible to Integration
- Lebesgue Points and the Differentiation Theorem
- 2. Linear Maps and Jacobians
- When the Determinant Fails
- Singular Values and the SVD
- Special Cases: Recovering Classical Formulas
- The Cauchy-Binet Formula: Computing the Jacobian in Coordinates
- The Jacobian of a Nonlinear Map
- The Coarea Jacobian
- The Polar Decomposition and a Structural Perspective
- 3. The Area Formula
- Why the Classical Change of Variables Is Not Enough
- The $m$-Dimensional Jacobian and Multiplicity
- The Area Formula
- Proof: Three-Step Reduction
- Linear Maps
- $C^1$ Maps via Linearization
- Lipschitz Maps via Rademacher
- The Change of Variables Formula
- Surface Area of a Parametrized Surface
- The Image Measure Formula
- Necessity of the Hypotheses
- 4. The Coarea Formula
- The Obstacle: Integrating Over Level Sets
- The Coarea Jacobian
- Statement and Proof of the Coarea Formula
- The Coarea Formula Generalises Fubini
- The Change-of-Variables Form
- Application: Polar Coordinates
- Application: Integration Over Level Sets of Scalar Functions
- The Necessity of the Hypotheses
- Preview: The BV Coarea Formula
- 5. Approximate Differentiability
- Approximate Limits and Approximate Continuity
- Approximate Differentiability: Definition and First Examples
- $L^p$-Differentiability and the Sobolev Case
- $L^{1^*}$-Differentiability for BV Functions
- Approximate Differentiability Follows from $L^p$-Differentiability
- Why Approximate Differentiability is the Right Notion
- 6. Differentiability for Sobolev Functions with $p>n$
- Morrey's Inequality and Hölder Continuity
- Classical Differentiability Almost Everywhere
- Capacity and the Dimension of Singular Sets
- 7. Convex Functions and Alexandrov's Theorem
- Local Lipschitz Continuity of Convex Functions
- The Subdifferential
- Alexandrov's Theorem
- Monge–Ampère Measures
- 8. Whitney's Extension Theorem and $C^1$ Approximation
- The Whitney $C^1$ Condition
- Whitney's Extension Theorem
- $C^1$ Approximation of Lipschitz Functions
- $C^1$ Approximation of Sobolev Functions
- $C^1$ Approximation of BV Functions
- The Lusin Property as a Unifying Theme
- 9. Examples and Worked Problems
- Surface Area of a Parametrized Surface via the Area Formula
- Polar Coordinates via the Coarea Formula
- The Coarea Formula for a Distance Function
- Rademacher's Theorem for the Absolute Value and a Fat Cantor Function
- Alexandrov's Theorem for Specific Convex Functions
- Whitney Extension for a Cantor-Type Closed Set
- Combining Area and Coarea: The Co-Area Formula for a Map $\mathbb{R}^2 \to \mathbb{R}$
- References
Geometric Measure Theory II: Area and Coarea Formulas
Content
Problems
History
Created by Unknown on 5/2/2026 | Last updated on 5/2/2026
Prerequisites
No prerequisites required for this page.
Rate this page
★
★
★
★
★
Poor
Excellent