[proofplan]
The proof has four major stages. First, derive that the distributional Hessian of $f$ is a matrix of locally finite signed Radon measures, by combining the local Lipschitz bound for $\nabla f$ with the monotonicity of the subgradient (which forces $\nabla f \in BV_{\mathrm{loc}}$). Second, on the full-measure set of Lebesgue points $x_0$ of $\nabla f$ at which the absolutely continuous part of $D^2 f$ has a Lebesgue density $A(x_0) \in \mathbb{R}^{n \times n}_{\mathrm{sym}}$, deduce the $L^1$ Lebesgue-density estimate for the gradient remainder near $x_0$. Third, invoke the classical Alexandrov rigidity lemma (Federer 3.2.22 / Evans–Gariepy §6.4) as a black box: it states that the $L^1$-mean estimate on the gradient remainder, together with convexity of $f$, forces the pointwise quadratic Taylor expansion. Fourth, observe that the limit Hessian is positive semidefinite by the symmetric-difference convexity trick. The technical inputs are stated as `[claim]` blocks: bounded variation of $\nabla f$ from monotonicity, Lebesgue–Besicovitch differentiation for matrix-valued Radon measures, and the Alexandrov rigidity lemma upgrading $L^1$-control to pointwise control.
[/proofplan]
[step:Prepare the setting and the local Lipschitz bound for $\nabla f$]
Throughout this proof, $U \subseteq \mathbb{R}^n$ is open and convex and $f: U \to \mathbb{R}$ is convex. By the theorem that convex functions are locally Lipschitz, $f$ is locally Lipschitz on $U$, hence by Rademacher's theorem, $f$ is (classically) differentiable at $\mathcal{L}^n$-a.e. point of $U$, with weak gradient
\begin{align*}
\nabla f: U &\to \mathbb{R}^n
\end{align*}
defined $\mathcal{L}^n$-a.e. and belonging to $L^\infty_{\mathrm{loc}}(U; \mathbb{R}^n)$. By the standard properties of the subdifferential, at every point of differentiability $\partial f(x) = \{\nabla f(x)\}$.
Fix any open ball $B \Subset U$ (i.e., $\overline{B} \subseteq U$ compact). By local Lipschitzness there is a constant $L_B < \infty$ with $|f(x) - f(y)| \le L_B |x - y|$ for $x, y \in \overline{B}$, and $|\nabla f(x)| \le L_B$ for $\mathcal{L}^n$-a.e. $x \in \overline{B}$.
[/step]
[step:Show $\nabla f$ is locally $BV$ via monotonicity]
[claim:For every open ball $B \Subset U$, $\nabla f \in BV(B; \mathbb{R}^n)$, and the distributional Jacobian $D(\nabla f)$ is an $n \times n$ matrix of finite signed Radon measures on $B$, symmetric in the sense that $\partial_{x_i}(\partial_{x_j} f) = \partial_{x_j}(\partial_{x_i} f)$ as distributions.]
[/claim]
*Proof of claim.* Fix an open ball $B \Subset U$. Choose a slightly enlarged ball $B' \Subset U$ with $\overline{B} \Subset B'$, and set $\delta := \operatorname{dist}(\overline{B}, \partial B') > 0$.
For each unit vector $v \in S^{n-1}$ and each $h \in (0, \delta)$, define the directional difference quotient
\begin{align*}
\Delta_v^h \nabla f: B &\to \mathbb{R}^n, \\
x &\mapsto \frac{\nabla f(x + h v) - \nabla f(x)}{h}.
\end{align*}
This is well-defined for a.e. $x \in B$ since $x + hv \in B'$.
By the monotonicity of the subdifferential, applied with $p = \nabla f(x + hv) \in \partial f(x + hv)$ and $q = \nabla f(x) \in \partial f(x)$ (at points of differentiability),
\begin{align*}
(\nabla f(x + hv) - \nabla f(x)) \cdot ((x + hv) - x) \ge 0,
\end{align*}
i.e., $h \, v \cdot \Delta_v^h \nabla f(x) \ge 0$. Dividing by $h^2 > 0$,
\begin{align*}
v \cdot \Delta_v^h \nabla f(x) \ge 0 \qquad \text{for a.e. } x \in B.
\end{align*}
In particular, taking $v = e_k$ shows that the scalar $\partial_{x_k} f$ is monotone-nondecreasing in the $x_k$-direction along $B$ (in the difference-quotient sense). Taking $v = (e_i + e_j)/\sqrt 2$ and expanding $v \cdot \nabla f = (\partial_{x_i} f + \partial_{x_j} f)/\sqrt 2$ shows that the scalar $\partial_{x_i} f + \partial_{x_j} f$ is monotone-nondecreasing in the direction $v$. Taking $v = (e_i - e_j)/\sqrt 2$ shows analogously that $\partial_{x_i} f - \partial_{x_j} f$ is monotone-nondecreasing in $v' = (e_i - e_j)/\sqrt 2$.
Integrating these difference-quotient inequalities against test functions $\varphi \in C_c^\infty(B)$ with $\varphi \ge 0$, and passing to the distributional limit as $h \to 0^+$, yields nonnegative distributions:
\begin{align*}
T_k &:= \partial_{x_k}(\partial_{x_k} f) \ge 0, \\
T_{ij}^+ &:= \partial_v(\partial_{v} f) = \tfrac{1}{2}(\partial_{x_i} + \partial_{x_j})(\partial_{x_i} f + \partial_{x_j} f) = \tfrac{1}{2}(\partial_{ii} f + 2 \partial_{ij} f + \partial_{jj} f) \ge 0, \\
T_{ij}^- &:= \partial_{v'}(\partial_{v'} f) = \tfrac{1}{2}(\partial_{x_i} - \partial_{x_j})(\partial_{x_i} f - \partial_{x_j} f) = \tfrac{1}{2}(\partial_{ii} f - 2 \partial_{ij} f + \partial_{jj} f) \ge 0.
\end{align*}
By Schwartz's theorem on positive distributions (every nonnegative distribution is a Radon measure; see Schwartz, *Théorie des distributions*, Ch. I §4), each of $T_k$, $T_{ij}^+$, $T_{ij}^-$ is a nonnegative Radon measure on $B$.
We now extract the off-diagonal entries by polarization:
\begin{align*}
\partial_{ij} f = \tfrac{1}{2}\bigl(T_{ij}^+ - T_{ij}^-\bigr),
\end{align*}
exhibiting $\partial_{ij} f$ as a difference of two nonnegative Radon measures, hence a finite signed Radon measure on $B$. The diagonal entries $\partial_{ii} f = T_i$ are themselves nonnegative measures.
Since each component $\partial_{x_i} f \in L^\infty(B)$ has each distributional partial derivative $\partial_{x_j}(\partial_{x_i} f)$ a finite signed Radon measure on $B$, we conclude $\partial_{x_i} f \in BV(B)$ by the definition of $BV$, and hence $\nabla f \in BV(B; \mathbb{R}^n)$.
The symmetry $\partial_{x_i}(\partial_{x_j} f) = \partial_{x_j}(\partial_{x_i} f)$ holds as distributions because mixed partials commute when applied to test functions (using $f \in L^1_{\mathrm{loc}}(B)$, which follows from $f$ being locally Lipschitz hence locally bounded):
\begin{align*}
\partial_{x_j}(\partial_{x_i} f)(\varphi) = -\int_B \partial_{x_i} f \, \partial_{x_j}\varphi = \int_B f \, \partial_{x_i}\partial_{x_j}\varphi = -\int_B \partial_{x_j} f\, \partial_{x_i}\varphi = \partial_{x_i}(\partial_{x_j} f)(\varphi).
\end{align*}
$\square$
By the claim, $\nabla f \in BV(B; \mathbb{R}^n)$ and $D(\nabla f) = (\partial_{x_j}(\partial_{x_i} f))_{i,j} =: D^2 f$ is an $n \times n$ matrix of finite signed Radon measures on $B$.
By Lebesgue's decomposition for the matrix of signed measures $D^2 f$, write
\begin{align*}
D^2 f = D^2 f_{\mathrm{ac}} + D^2 f_{\mathrm{s}},
\end{align*}
where $D^2 f_{\mathrm{ac}} \ll \mathcal{L}^n$ and $D^2 f_{\mathrm{s}} \perp \mathcal{L}^n$ on $B$. The Radon–Nikodym theorem provides a measurable matrix-valued density $A_B \in L^1(B; \mathbb{R}^{n \times n}_{\mathrm{sym}})$ with $D^2 f_{\mathrm{ac}} = A_B \, \mathcal{L}^n$. The symmetry of $D^2 f$ as a distribution descends to symmetry of $A_B(x)$ for $\mathcal{L}^n$-a.e. $x$. The densities $A_B$ on different balls agree on overlaps (the absolutely continuous part of $D^2 f$ on $B$ is the restriction to $B$ of the absolutely continuous part of $D^2 f$ on any larger ball $B'' \supseteq B$), so they patch into a single density $A \in L^1_{\mathrm{loc}}(U; \mathbb{R}^{n \times n}_{\mathrm{sym}})$.
[/step]
[step:Identify the full-measure set of good points $x_0$]
Define the **set of good points** $G \subseteq U$ by
\begin{align*}
G := \{x_0 \in U: x_0 \text{ satisfies all of (G1)-(G4) below}\},
\end{align*}
where the conditions are:
- (G1) $f$ is classically differentiable at $x_0$;
- (G2) $x_0$ is a Lebesgue point of $\nabla f$ in the $L^1$ sense;
- (G3) $x_0$ is a Lebesgue point of the global density $A \in L^1_{\mathrm{loc}}(U; \mathbb{R}^{n \times n}_{\mathrm{sym}})$, with Lebesgue value $A(x_0)$;
- (G4) the singular part $D^2 f_{\mathrm{s}}$ has zero Lebesgue density at $x_0$ (a property holding $\mathcal{L}^n$-a.e. by the Lebesgue–Besicovitch differentiation theorem for Radon measures).
We claim $\mathcal{L}^n(U \setminus G) = 0$. Indeed, (G1) holds on a full-measure set by Rademacher's theorem; (G2) holds on a full-measure set by the Lebesgue differentiation theorem applied to $\nabla f \in L^1_{\mathrm{loc}}(U)$; (G3) holds on a full-measure set by the Lebesgue differentiation theorem applied to $A \in L^1_{\mathrm{loc}}(U)$; (G4) holds by the Lebesgue–Besicovitch differentiation theorem for Radon measures (Evans–Gariepy, *Measure Theory and Fine Properties of Functions*, §1.6.2) applied to $D^2 f_{\mathrm{s}}$. Each is a countable intersection of full-measure sets, hence $G$ has full measure.
For the rest of the proof we fix $x_0 \in G$, set $A(x_0)$ as the symmetric matrix in $\mathbb{R}^{n \times n}$ from (G3), and prove the Taylor expansion at $x_0$.
[/step]
[step:State the gradient density estimate at $x_0$]
[claim:$L^1$-gradient density estimate at $x_0 \in G$. With $x_0 \in G$ and $A(x_0)$ as in Step 3, defining the **gradient deviation**
\begin{align*}
e: U &\to \mathbb{R}^n, \\
y &\mapsto \nabla f(y) - \nabla f(x_0) - A(x_0)(y - x_0),
\end{align*}
we have
\begin{align*}
\fint_{B(x_0, r)} |e(y)| \, d\mathcal{L}^n(y) = o(r) \qquad \text{as } r \to 0^+,
\end{align*}
i.e., the average gradient deviation on $B(x_0, r)$ from the affine model $y \mapsto \nabla f(x_0) + A(x_0)(y - x_0)$ is $o(r)$ as $r \to 0$.]
[/claim]
*Proof of claim.* This is the assertion that $\nabla f \in BV_{\mathrm{loc}}(U; \mathbb{R}^n)$ has $A(x_0)$ as its $L^1$-approximate derivative at $x_0$. Apply the BV $L^{1^*}$-differentiability theorem (Evans–Gariepy, *Measure Theory and Fine Properties of Functions*, §6.1.2, Theorem 4) componentwise to each scalar $\partial_{x_i} f \in BV_{\mathrm{loc}}(U)$ — its hypotheses are met because Step 2 established $\nabla f \in BV_{\mathrm{loc}}(U; \mathbb{R}^n)$, and the absolutely continuous part of $D(\partial_{x_i} f)$ has $\mathcal{L}^n$-density given by the $i$-th row of $A$ (Step 2), with the singular part vanishing in density at $x_0$ by (G4). Here $1^* = n/(n-1)$ is the Sobolev conjugate of $1$. The theorem yields, at $\mathcal{L}^n$-a.e. $x_0$ (in particular at every $x_0$ satisfying (G1)–(G4)),
\begin{align*}
\left(\fint_{B(x_0, r)} |e(y)|^{1^*} \, d\mathcal{L}^n(y)\right)^{1/1^*} = o(r) \qquad \text{as } r \to 0^+.
\end{align*}
By Jensen's inequality (since $1 < 1^*$, the function $s \mapsto s^{1^*}$ is convex on $[0, \infty)$),
\begin{align*}
\fint_{B(x_0, r)} |e| \, d\mathcal{L}^n \le \left(\fint_{B(x_0, r)} |e|^{1^*}\, d\mathcal{L}^n\right)^{1/1^*},
\end{align*}
so the $L^1$-average bound $\fint_{B(x_0,r)}|e| = o(r)$ follows from the $L^{1^*}$ one. $\square$
[/step]
[step:Pointwise Taylor expansion via the Alexandrov rigidity lemma]
We now upgrade the $L^1$-mean control of the gradient deviation $e$ established in Step 4 to a pointwise quadratic Taylor expansion of $f$ at $x_0$. This upgrade is the heart of Alexandrov's theorem and is a standard, named result in geometric measure theory; we invoke it as a black box.
[claim:Alexandrov rigidity lemma. Let $V \subseteq \mathbb{R}^n$ be open and convex, let $g: V \to \mathbb{R}$ be convex, and let $x_0 \in V$ be a point at which $g$ is classically differentiable. Suppose there exists a symmetric matrix $A \in \mathbb{R}^{n \times n}_{\mathrm{sym}}$ such that the gradient remainder satisfies the $L^1$-mean estimate
\begin{align*}
\fint_{B(x_0, r)}\bigl|\nabla g(y) - \nabla g(x_0) - A(y - x_0)\bigr|\, d\mathcal{L}^n(y) = o(r) \qquad \text{as } r \to 0^+.
\end{align*}
Then $g$ admits the pointwise quadratic Taylor expansion
\begin{align*}
g(y) = g(x_0) + \nabla g(x_0)\cdot(y - x_0) + \tfrac{1}{2}(y - x_0)^\top A\,(y - x_0) + o(|y - x_0|^2) \qquad \text{as } y \to x_0.
\end{align*}]
[/claim]
*Reference for the proof of the claim.* This is the standard rigidity lemma underlying Alexandrov's theorem; see Federer, *Geometric Measure Theory* (Springer, 1969), Theorem 3.2.22 (Aleksandrov's theorem), and Evans–Gariepy, *Measure Theory and Fine Properties of Functions* (CRC Press, 2nd ed. 2015), §6.4 ("Aleksandrov's Theorem"). Both references prove this exact upgrade — that an $L^1$-mean estimate on the gradient remainder of a convex function, at a point of classical differentiability, implies the corresponding pointwise quadratic Taylor expansion — using maximal-function arguments combined with the radial-monotonicity properties of one-dimensional convex functions. The convexity of $g$ is essential; without it, the $L^1$-to-pointwise upgrade fails for $W^{1,1}$ functions in general.
*Application to $f$ at $x_0 \in G$.* The hypotheses of the rigidity lemma hold with $V = U$, $g = f$, the point $x_0 \in G$, and the symmetric matrix $A = A(x_0) \in \mathbb{R}^{n \times n}_{\mathrm{sym}}$ from Step 3:
- $U \subseteq \mathbb{R}^n$ is open and convex (global hypothesis).
- $f: U \to \mathbb{R}$ is convex (global hypothesis).
- $f$ is classically differentiable at $x_0$ by (G1).
- The $L^1$-mean estimate $\fint_{B(x_0, r)} |\nabla f(y) - \nabla f(x_0) - A(x_0)(y - x_0)|\, d\mathcal{L}^n(y) = o(r)$ as $r \to 0^+$ holds by Step 4.
Therefore the lemma yields the pointwise quadratic Taylor expansion at $x_0$:
\begin{align*}
f(y) = f(x_0) + \nabla f(x_0)\cdot(y - x_0) + \tfrac{1}{2}(y - x_0)^\top A(x_0) (y - x_0) + o(|y - x_0|^2) \qquad \text{as } y \to x_0.
\end{align*}
[/step]
[step:Verify the Hessian is positive semidefinite]
Set $A(x_0) := D^2 f(x_0)$, the symmetric matrix from Step 3. Fix $h \in \mathbb{R}^n$ with $|h| = 1$. By convexity of $f$, the one-dimensional function
\begin{align*}
\psi: [-\rho_0, \rho_0] &\to \mathbb{R}, \\
t &\mapsto f(x_0 + t h)
\end{align*}
is convex (where $\rho_0 > 0$ is small enough that $x_0 \pm \rho_0 h \in U$, which exists because $U$ is open). By the Taylor expansion (Step 5) applied along $\pm t h$:
\begin{align*}
\psi(t) + \psi(-t) &= f(x_0 + th) + f(x_0 - th) \\
&= 2 f(x_0) + (\nabla f(x_0) \cdot h - \nabla f(x_0) \cdot h) t + t^2 h^\top A(x_0) h + o(t^2) \\
&= 2 f(x_0) + t^2 h^\top A(x_0) h + o(t^2) \quad \text{as } t \to 0.
\end{align*}
By convexity of $\psi$ at $0$ with the symmetric pair $(t, -t)$,
\begin{align*}
\frac{\psi(t) + \psi(-t)}{2} \ge \psi(0) = f(x_0).
\end{align*}
Combining,
\begin{align*}
2 f(x_0) + t^2 h^\top A(x_0) h + o(t^2) = \psi(t) + \psi(-t) \ge 2 f(x_0),
\end{align*}
hence
\begin{align*}
t^2 h^\top A(x_0) h + o(t^2) \ge 0.
\end{align*}
Dividing by $t^2 > 0$ and letting $t \to 0$ gives $h^\top A(x_0) h \ge 0$. Since $h$ was an arbitrary unit vector, $A(x_0) \ge 0$ (positive semidefinite).
This completes the proof: at every point $x_0 \in G$ (a set of full $\mathcal{L}^n$-measure in $U$), $f$ admits the second-order Taylor expansion
\begin{align*}
f(x) = f(x_0) + \nabla f(x_0)\cdot(x - x_0) + \tfrac{1}{2}(x - x_0)^\top A(x_0)(x - x_0) + o(|x - x_0|^2)
\end{align*}
as $x \to x_0$, with $A(x_0) = D^2 f(x_0)$ symmetric and positive semidefinite.
[/step]