Convexity begins with a promise about interpolation: if two states are allowed, then every mixture between them is allowed, and the value of a function at a mixture should not exceed the same mixture of the endpoint values. This promise is powerful because it turns a global problem into a problem controlled by line segments. In optimization, it is the difference between a landscape where a local minimum may be a trap and a landscape where every local minimum is already global.
The first failure to keep in mind is the function $f: \mathbb{R} \to \mathbb{R}$ given by $f(x) = x^4 - x^2$. Its graph has two valleys and a hill between them. A descent method started near one valley has no reason to find the other, and a tangent line near the hill gives misleading information about the global shape. Convex functions are designed to exclude exactly this kind of hidden non-global behaviour.
[example: A Non-Convex Double Well]
Let $f:\mathbb{R}\to\mathbb{R}$ be defined by $f(x)=x^4-x^2$. First note that the symmetric pair $-1,1$ does not reveal non-convexity, because
\begin{align*}
f(-1)=(-1)^4-(-1)^2=1-1=0
\end{align*}
and
\begin{align*}
f(1)=1^4-1^2=1-1=0.
\end{align*}
Their midpoint is $0$, and
\begin{align*}
f(0)=0^4-0^2=0.
\end{align*}
Thus
\begin{align*}
f(0)=0=\frac{0+0}{2}=\frac{f(-1)+f(1)}{2},
\end{align*}
so this pair satisfies the midpoint equality.
Instead take $x=-1/\sqrt{2}$ and $y=1/\sqrt{2}$. Then
\begin{align*}
x^2=\left(-\frac{1}{\sqrt{2}}\right)^2=\frac{1}{2}
\end{align*}
and
\begin{align*}
x^4=(x^2)^2=\left(\frac{1}{2}\right)^2=\frac{1}{4},
\end{align*}
so
\begin{align*}
f(x)=x^4-x^2=\frac{1}{4}-\frac{1}{2}=-\frac{1}{4}.
\end{align*}
The same calculation gives
\begin{align*}
y^2=\left(\frac{1}{\sqrt{2}}\right)^2=\frac{1}{2}
\end{align*}
and
\begin{align*}
y^4=\frac{1}{4},
\end{align*}
hence
\begin{align*}
f(y)=\frac{1}{4}-\frac{1}{2}=-\frac{1}{4}.
\end{align*}
Their midpoint is
\begin{align*}
\frac{x+y}{2}=\frac{-1/\sqrt{2}+1/\sqrt{2}}{2}=0,
\end{align*}
so
\begin{align*}
f\left(\frac{x+y}{2}\right)=f(0)=0.
\end{align*}
The average endpoint value is
\begin{align*}
\frac{f(x)+f(y)}{2}=\frac{-1/4-1/4}{2}=\frac{-1/2}{2}=-\frac{1}{4}.
\end{align*}
Therefore
\begin{align*}
f\left(\frac{x+y}{2}\right)=0>-\frac{1}{4}=\frac{f(x)+f(y)}{2}.
\end{align*}
At $t=1/2$, the convexity inequality would require the midpoint value to be at most the average endpoint value, so this pair proves that $f(x)=x^4-x^2$ is not convex on $\mathbb{R}$ even though it is a polynomial.
[/example]
The point of convexity is not smoothness. It is a compatibility condition between the geometry of the domain and the arithmetic of function values. A convex function may have corners, may fail to be differentiable, and may take the value $+\infty$ in the extended-valued formalism. What it gives in return is a rigid global geometry: epigraphs become convex sets, subgradients replace missing derivatives, and minimization becomes far more stable.
## Definition
The function definition is the centre of the subject. It asks for one inequality to hold along every line segment in the domain: the value at an interpolated point must sit below the interpolated endpoint values.
[definition: Convex Function]
Let $V$ be a real [vector space](/page/Vector%20Space), let $C \subset V$ be a set such that $(1-t)x+ty\in C$ for every $x,y\in C$ and every $t\in[0,1]$, and let $f: C \to \mathbb{R}$ be a function. The function $f$ is convex on $C$ if for every $x,y \in C$ and every $t \in [0,1]$,
\begin{align*}
f((1-t)x+ty) \le (1-t)f(x)+t f(y).
\end{align*}
[/definition]
This definition is deliberately global: it quantifies over every segment in the domain. It also makes no reference to derivatives, which is why it applies equally well to smooth functions, piecewise-linear functions, norms, and many functions with corners. The definition above has already built the segment condition into its hypotheses; the next section separates that domain condition from the function inequality and then sharpens the function side in two standard directions.
## Domains, Strictness, and Infinite Values
### Convex Domains
Before the inequality for function values can even be tested, the interpolated point must remain admissible. If a feasible region has a gap, the segment between two feasible points may leave the region, and the chord comparison has no value there. Convex sets isolate the domains where line-segment arguments can run without leaving the problem.
[definition: Convex Set]
Let $V$ be a real vector space. A set $C \subset V$ is convex if for every $x,y \in C$ and every $t \in [0,1]$, the point $(1-t)x + ty$ belongs to $C$.
[/definition]
Once the domain has this segment geometry, the function inequality can be applied repeatedly without leaving the problem. Some optimization problems require more than the absence of hidden bumps: they need to rule out flat valleys where different points have the same best value.
### Strict Convexity
If an optimization problem has a whole flat valley of best points, convexity still gives global control but it does not select a unique answer. Strict convexity is the condition that every genuine mixture improves on the straight-line interpolation of endpoint values. It is the right strengthening when the geometry should rule out ties between distinct feasible points.
[definition: Strictly Convex Function]
Let $V$ be a real vector space, let $C \subset V$ be convex, and let $f: C \to \mathbb{R}$ be a function. The function $f$ is strictly convex on $C$ if for every distinct $x,y \in C$ and every $t \in (0,1)$,
\begin{align*}
f((1-t)x+ty) < (1-t)f(x)+t f(y).
\end{align*}
[/definition]
Strict convexity is stronger than convexity, but it is not the same as having positive curvature everywhere. The function $x \mapsto x^4$ is strictly convex on $\mathbb{R}$ even though its second derivative vanishes at $0$. Another limitation of the finite-valued definition is that it keeps the feasible set outside the function, while constrained optimization often works better when the objective and the constraint are part of one object.
### Extended-Valued Functions
Many convex problems are constrained: some points are feasible and others should never be chosen. Keeping the feasible set separate from the objective works, but it hides the fact that constraints and costs often obey the same convex calculus. Extended-valued functions encode forbidden points by assigning them the value $+\infty$, turning feasibility into an infinite penalty.
[definition: Extended-Valued Convex Function]
Let $V$ be a real vector space. A function $f: V \to \mathbb{R} \cup \{+\infty\}$ is extended-valued convex if for every $x,y \in V$ and every $t \in [0,1]$,
\begin{align*}
f((1-t)x+ty) \le (1-t)f(x)+t f(y),
\end{align*}
where arithmetic in $\mathbb{R}\cup\{+\infty\}$ is interpreted by $a+(+\infty)=+\infty$ for $a\in\mathbb{R}\cup\{+\infty\}$, $s(+\infty)=+\infty$ for $s>0$, and $0\cdot(+\infty)=0$.
[/definition]
The finite-valued definition is recovered on the effective domain $\{x\in V:f(x)<+\infty\}$. This set is convex whenever $f$ is extended-valued convex, so the restriction of $f$ to its effective domain is an ordinary finite-valued convex function on a convex domain. Extended-valued functions become especially natural once we pass from graphs to epigraphs, because the set where the function is allowed to be finite becomes part of the geometry above the graph.
## Epigraphs and Geometry
### Regions Above Graphs
A convex function is often easier to understand by looking not at the graph itself but at the region above the graph. The graph of a convex function may be curved or have corners, while the region above it has a linear geometry: it is closed under taking line segments.
[definition: Epigraph]
Let $V$ be a real vector space and let $f: C \to \mathbb{R}$ be a function on a set $C \subset V$. The epigraph of $f$ is the subset of $V \times \mathbb{R}$ defined by
\begin{align*}
\operatorname{epi}(f) := \{(x,r) \in C \times \mathbb{R} : f(x) \le r\}.
\end{align*}
[/definition]
The epigraph records every point sitting on or above the graph. If this region is convex, then taking an average of two points above the graph leaves us above the graph. The next result matters because it turns the analytic chord inequality into a statement about a set, so every tool for convex sets becomes available for studying functions.
[quotetheorem:6668]
The epigraph characterisation explains why convex functions belong simultaneously to analysis and geometry. Inequalities about function values become inclusion statements about convex sets, and operations on functions often become operations on epigraphs.
[example: Epigraph of the Absolute Value]
Let $f:\mathbb{R}\to\mathbb{R}$ be $f(x)=|x|$. By the definition of epigraph,
\begin{align*}
\operatorname{epi}(f)=\{(x,r)\in\mathbb{R}^2: f(x)\le r\}=\{(x,r)\in\mathbb{R}^2: |x|\le r\}.
\end{align*}
The inequality $|x|\le r$ is equivalent to $x\le r$ and $-x\le r$, so the boundary consists of the ray $r=x$ for $x\ge 0$ and the ray $r=-x$ for $x\le 0$.
We verify directly that this epigraph is convex. Let $(x_1,r_1),(x_2,r_2)\in\operatorname{epi}(f)$ and let $t\in[0,1]$. From membership in the epigraph,
\begin{align*}
|x_1|\le r_1
\end{align*}
and
\begin{align*}
|x_2|\le r_2.
\end{align*}
Since $1-t\ge 0$ and $t\ge 0$, multiplying these inequalities gives
\begin{align*}
(1-t)|x_1|\le (1-t)r_1
\end{align*}
and
\begin{align*}
t|x_2|\le tr_2.
\end{align*}
Adding them yields
\begin{align*}
(1-t)|x_1|+t|x_2|\le (1-t)r_1+tr_2.
\end{align*}
By the triangle inequality and the homogeneity of absolute value,
\begin{align*}
|(1-t)x_1+tx_2|\le |(1-t)x_1|+|tx_2|=(1-t)|x_1|+t|x_2|.
\end{align*}
Combining the two inequalities gives
\begin{align*}
|(1-t)x_1+tx_2|\le (1-t)r_1+tr_2.
\end{align*}
Therefore
\begin{align*}
((1-t)x_1+tx_2,(1-t)r_1+tr_2)\in\operatorname{epi}(f).
\end{align*}
Thus the region above the graph of $|x|$ is closed under line segments, so the convexity of $|\cdot|$ appears geometrically as the convexity of this cone.
[/example]
### Infinite Penalties
The same idea lets constraints become functions. A set can be represented by a function that is zero on the set and infinite outside it. This construction is worth isolating because it changes a constrained problem into an unconstrained extended-valued problem without changing which points are admissible.
[definition: Indicator Function of a Convex Set]
Let $V$ be a real vector space and let $C \subset V$. The indicator function of $C$ is the extended-valued function $\iota_C: V \to \mathbb{R} \cup \{+\infty\}$ defined by
\begin{align*}
\iota_C(x) := 0 \quad \text{for } x\in C.
\end{align*}
\begin{align*}
\iota_C(x) := +\infty \quad \text{for } x\notin C.
\end{align*}
[/definition]
This notation separates the optimization indicator from the ordinary set indicator $\mathbb{1}_C$, whose values are $0$ and $1$. The infinite penalty version is designed so that minimizing $f+\iota_C$ is the same as minimizing $f$ subject to $x\in C$. The next result identifies exactly when this penalty is compatible with convex analysis rather than merely with set membership.
[quotetheorem:7939]
This result is the first hint that convex functions form a calculus of constraints. Instead of keeping feasible sets and objective functions in separate languages, convex analysis folds them into a single extended-valued framework.
## Jensen Inequalities and Averages
### Finite Averages
The defining inequality uses two points, but most applications involve averages of many points: barycentres of data, expectations of random variables, convex combinations in numerical schemes, and mixtures of feasible designs. To move from two endpoints to many endpoints, we first need the algebraic language of weighted averages whose coefficients are nonnegative and sum to one.
[definition: Convex Combination]
Let $V$ be a real vector space. A convex combination of points $x_1,\ldots,x_m \in V$ is a point of the form
\begin{align*}
\sum_{i=1}^m \lambda_i x_i,
\end{align*}
where $m\in \mathbb{N}$, $\lambda_i \ge 0$ for each $i$, and
\begin{align*}
\sum_{i=1}^m \lambda_i = 1.
\end{align*}
[/definition]
Convex combinations are repeated line-segment averages. If a function behaves well on every segment, then it should behave well under every finite averaging operation. The following theorem is the formal version of that principle, and it is the form of convexity most often used in computations.
[quotetheorem:7940]
Finite Jensen is the basic algebra of convexity. It says that averaging before applying $f$ never gives a larger value than applying $f$ first and then averaging.
[example: Variance as a Convexity Gap]
Let $x_1,\ldots,x_m \in \mathbb{R}$ and let $\lambda_1,\ldots,\lambda_m \ge 0$ satisfy $\sum_{i=1}^m \lambda_i=1$. The function $f(x)=x^2$ is convex, because for $a,b\in\mathbb{R}$ and $t\in[0,1]$,
\begin{align*}
(1-t)a^2+tb^2-((1-t)a+tb)^2=t(1-t)(a-b)^2\ge 0.
\end{align*}
Applying *[Finite Jensen Inequality](/theorems/7940)* to $f(x)=x^2$ gives
\begin{align*}
\left(\sum_{i=1}^m \lambda_i x_i\right)^2 \le \sum_{i=1}^m \lambda_i x_i^2.
\end{align*}
Let
\begin{align*}
\bar{x}=\sum_{i=1}^m \lambda_i x_i.
\end{align*}
Using $(x_i-\bar{x})^2=x_i^2-2\bar{x}x_i+\bar{x}^2$, we get
\begin{align*}
\sum_{i=1}^m \lambda_i (x_i-\bar{x})^2=\sum_{i=1}^m \lambda_i x_i^2-2\bar{x}\sum_{i=1}^m \lambda_i x_i+\bar{x}^2\sum_{i=1}^m \lambda_i.
\end{align*}
Since $\sum_{i=1}^m \lambda_i x_i=\bar{x}$ and $\sum_{i=1}^m \lambda_i=1$, this becomes
\begin{align*}
\sum_{i=1}^m \lambda_i (x_i-\bar{x})^2=\sum_{i=1}^m \lambda_i x_i^2-2\bar{x}^2+\bar{x}^2.
\end{align*}
Therefore
\begin{align*}
\sum_{i=1}^m \lambda_i (x_i-\bar{x})^2=\sum_{i=1}^m \lambda_i x_i^2-\bar{x}^2.
\end{align*}
Substituting $\bar{x}=\sum_{i=1}^m \lambda_i x_i$ gives
\begin{align*}
\sum_{i=1}^m \lambda_i x_i^2-\left(\sum_{i=1}^m \lambda_i x_i\right)^2=\sum_{i=1}^m \lambda_i (x_i-\bar{x})^2.
\end{align*}
Thus the Jensen gap for $x\mapsto x^2$ is exactly the weighted variance around the weighted mean, so the gap vanishes precisely when all points with positive weight coincide with $\bar{x}$.
[/example]
### Expectations
When the finite average is replaced by an integral or expectation, the same principle becomes one of the central inequalities in analysis and probability. The finite result already contains the geometry, but random variables require measurability and integrability so that both sides of the inequality are defined. This is the form that connects convexity to moment bounds, entropy estimates, and risk.
[quotetheorem:9]
[Jensen's inequality](/theorems/9) is why convex functions govern risk, entropy, moment estimates, and many inequalities in analysis. The same shape also appears in deterministic integration with respect to a probability measure.
## Slopes, Derivatives, and Subgradients
### Secant Slopes
Convexity can be read from secant slopes on an interval. For a differentiable function, the derivative should increase. For a non-differentiable function, a whole interval of supporting slopes may appear at a corner. The starting point is the slope of a chord, because convexity controls chords before it controls tangents.
[definition: Secant Slope]
Let $I \subset \mathbb{R}$ be an interval and let $f: I \to \mathbb{R}$ be a function. The secant-slope map of $f$ is the function $S_f:\{(x,y)\in I^2:x\ne y\}\to \mathbb{R}$ defined by
\begin{align*}
S_f(x,y) := \frac{f(y)-f(x)}{y-x}.
\end{align*}
For distinct $x,y\in I$, the number $S_f(x,y)$ is the secant slope of $f$ from $x$ to $y$.
[/definition]
The secant slope measures the average rate of increase along a chord. Convexity says more than the graph lying below chords: it forces the slopes of these chords to appear in a consistent order. That order is the one-dimensional mechanism behind tangent lines, derivative monotonicity, and many estimates for convex functions.
[quotetheorem:7941]
This theorem is the bridge from geometric convexity to differential tests. It says that convexity imposes order on every family of chords. Once the function is differentiable, the chord slopes can be squeezed toward derivative values, turning the geometric condition into a familiar monotonicity condition.
[quotetheorem:6666]
The derivative criterion works only when the derivative exists. Convex functions need not be differentiable, so we need a replacement for the tangent line that can survive corners.
### Supporting Slopes
At a corner, asking for a single derivative asks too much. The graph may have many affine lines below it that touch at the same point, and each of those lines carries first-order information. This motivates the set-valued replacement for the gradient.
[definition: Subgradient]
Let $C \subset \mathbb{R}^n$ be convex, let $f: C \to \mathbb{R}$ be convex, and let $x_0 \in C$. A vector $p \in \mathbb{R}^n$ is a subgradient of $f$ at $x_0$ if for every $x \in C$,
\begin{align*}
f(x) \ge f(x_0) + p \cdot (x-x_0).
\end{align*}
[/definition]
A subgradient is the slope of an affine function lying below $f$ and touching it at $x_0$. At a smooth point there is only one supporting slope. At a corner there may be many, so the full object is not a vector but a set of allowable supporting vectors.
[definition: Subdifferential]
Let $C \subset \mathbb{R}^n$ be convex, let $f: C \to \mathbb{R}$ be convex, and let $x_0 \in C$. The subdifferential of $f$ at $x_0$ is the set
\begin{align*}
\partial f(x_0) := \{p \in \mathbb{R}^n : p \text{ is a subgradient of } f \text{ at } x_0\}.
\end{align*}
[/definition]
The notation $\partial f(x_0)$ uses the boundary-like symbol because it collects all supporting hyperplanes at a point. It is a set-valued replacement for the gradient.
[example: Subgradients of the Absolute Value]
Let $f:\mathbb{R}\to\mathbb{R}$ be $f(x)=|x|$. For a point $x_0$, a number $p$ belongs to $\partial f(x_0)$ exactly when
\begin{align*}
|x|\ge |x_0|+p(x-x_0) \quad \text{for every } x\in\mathbb{R}.
\end{align*}
Suppose first that $x_0>0$. If $p=1$, then
\begin{align*}
|x_0|+1(x-x_0)=x_0+x-x_0=x.
\end{align*}
Since $|x|\ge x$ for every real $x$, this proves $1\in\partial f(x_0)$. Conversely, let $p\in\partial f(x_0)$. For any $x>x_0$, we have $x>0$, so $|x|=x$ and $|x_0|=x_0$. The subgradient inequality gives
\begin{align*}
x\ge x_0+p(x-x_0).
\end{align*}
Subtracting $x_0$ gives
\begin{align*}
x-x_0\ge p(x-x_0).
\end{align*}
Because $x-x_0>0$, division gives $p\le 1$. For any $x$ with $0<x<x_0$, the same substitution gives
\begin{align*}
x-x_0\ge p(x-x_0).
\end{align*}
Now $x-x_0<0$, so division reverses the inequality and gives $p\ge 1$. Hence $p=1$, and therefore
\begin{align*}
\partial f(x_0)=\{1\} \quad \text{for } x_0>0.
\end{align*}
Now suppose that $x_0<0$. If $p=-1$, then
\begin{align*}
|x_0|-1(x-x_0)=-x_0-x+x_0=-x.
\end{align*}
Since $|x|\ge -x$ for every real $x$, this proves $-1\in\partial f(x_0)$. Conversely, let $p\in\partial f(x_0)$. For any $x<x_0$, we have $x<0$, so $|x|=-x$ and $|x_0|=-x_0$. The subgradient inequality gives
\begin{align*}
-x\ge -x_0+p(x-x_0).
\end{align*}
Subtracting $-x_0$ gives
\begin{align*}
-(x-x_0)\ge p(x-x_0).
\end{align*}
Because $x-x_0<0$, division reverses the inequality and gives $p\ge -1$. For any $x$ with $x_0<x<0$, the same inequality holds with $x-x_0>0$, so division gives $p\le -1$. Hence $p=-1$, and therefore
\begin{align*}
\partial f(x_0)=\{-1\} \quad \text{for } x_0<0.
\end{align*}
At $x_0=0$, the subgradient inequality becomes
\begin{align*}
|x|\ge px \quad \text{for every } x\in\mathbb{R}.
\end{align*}
If $p\in[-1,1]$ and $x\ge 0$, then $px\le x=|x|$ because $p\le 1$. If $p\in[-1,1]$ and $x<0$, then multiplying $p\ge -1$ by the negative number $x$ gives $px\le -x=|x|$. Thus every $p\in[-1,1]$ is a subgradient at $0$.
Conversely, if $p\in\partial f(0)$, then substituting $x=1$ gives
\begin{align*}
1=|1|\ge p.
\end{align*}
Thus $p\le 1$. Substituting $x=-1$ gives
\begin{align*}
1=|-1|\ge -p.
\end{align*}
Thus $p\ge -1$. Therefore
\begin{align*}
\partial f(0)=[-1,1].
\end{align*}
The corner at the origin is represented by a whole interval of supporting slopes, while away from the corner the absolute value has only the slope of its linear branch.
[/example]
Subgradients give the correct first-order condition for minimization. In a differentiable problem, the gradient vanishes at an unconstrained minimizer. In a convex non-smooth problem, the corresponding condition is that zero belongs to the subdifferential.
[quotetheorem:7942]
This condition is one reason convex non-smooth problems remain tractable. Corners do not destroy first-order methods; they replace equations by inclusions.
## Second-Order Criteria
### Hessians
When a convex function is twice differentiable, convexity becomes a matrix condition. The graph bends upward in every direction exactly when the Hessian has positive semidefinite, meaning that $v\cdot H_f(x)v\ge 0$ for every direction $v$, so there is no negative quadratic direction. To state this test, we isolate the matrix that records all second partial derivatives.
[definition: Hessian Matrix]
Let $U \subset \mathbb{R}^n$ be open and let $f: U \to \mathbb{R}$ be a function with $f \in C^2(U)$. For each $x\in U$, the Hessian matrix of $f$ at $x$ is
\begin{align*}
Hf_x := \left(\frac{\partial^2 f}{\partial x_i\partial x_j}(x)\right)_{1\le i,j\le n}.
\end{align*}
The Hessian map is the function $Hf: U \to \mathbb{R}^{n\times n}$ sending $x$ to $Hf_x$.
[/definition]
The Hessian measures second-order variation. Testing it against a direction $v\in\mathbb{R}^n$ gives the second derivative of $f$ along the line $x+tv$. A symmetric matrix $M$ is positive semidefinite when
\begin{align*}
v\cdot Mv\ge 0
\end{align*}
for every direction $v$. This directional reduction is what lets a multivariable convexity question be checked by a quadratic form at every point.
[quotetheorem:2545]
The semidefinite condition permits flat directions. Many optimization arguments need a quantitative lower curvature bound, so the next notion strengthens ordinary convexity by subtracting a fixed quadratic penalty from every chord inequality.
### Quantitative Curvature
A nonnegative Hessian prevents downward bending, but it may still allow long flat valleys. To obtain stability estimates, convergence rates, and uniqueness from curvature alone, the graph must bend upward by at least a fixed quadratic amount. This stronger requirement is measured by a parameter $\mu>0$, which records how much quadratic curvature is present in every direction.
[definition: Strongly Convex Function]
Let $C \subset \mathbb{R}^n$ be convex and let $\mu>0$. A function $f: C \to \mathbb{R}$ is $\mu$-strongly convex on $C$ if for every $x,y\in C$ and every $t\in[0,1]$,
\begin{align*}
f((1-t)x+ty) \le (1-t)f(x)+t f(y)-\frac{\mu}{2}t(1-t)|x-y|^2.
\end{align*}
[/definition]
Strong convexity says that the graph lies below each chord by a definite quadratic amount. The parameter $\mu$ measures the amount of curvature that remains in every direction. For twice differentiable functions, this lower curvature bound is naturally expressed as a uniform lower bound on the Hessian quadratic form. If $f\in C^2(U)$ and
\begin{align*}
v\cdot Hf_xv\ge \mu |v|^2
\end{align*}
for every $x\in U$ and every direction $v$, then the function $x\mapsto f(x)-\frac{\mu}{2}|x|^2$ has nonnegative Hessian quadratic forms. The Hessian criterion above makes this difference convex, which is exactly the Hessian lower-bound route to $\mu$-strong convexity.
Strong convexity is stronger than uniqueness of a minimizer, but it gives a clean quantitative route to uniqueness. If two different points both minimized a strongly convex function, the midpoint would be forced strictly below the common minimum value.
[quotetheorem:7943]
The hypothesis that a minimizer exists is separate. Strong convexity prevents two different minimizers, but existence may require compactness, coercivity, lower semicontinuity, or finite-dimensional closedness assumptions.
[example: A Strongly Convex Quadratic]
Let $A\in\mathbb{R}^{n\times n}$ be symmetric and suppose there is $\mu>0$ such that $v\cdot A v\ge \mu |v|^2$ for every $v\in\mathbb{R}^n$. Let $b\in\mathbb{R}^n$ and define
\begin{align*}
f(x)=\frac{1}{2}x\cdot A x-b\cdot x.
\end{align*}
Writing $A=(a_{ij})$, symmetry gives $a_{ij}=a_{ji}$, and
\begin{align*}
x\cdot A x=\sum_{i=1}^n\sum_{j=1}^n a_{ij}x_i x_j.
\end{align*}
For the $k$th [partial derivative](/page/Partial%20Derivative),
\begin{align*}
\frac{\partial}{\partial x_k}\left(\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n a_{ij}x_i x_j\right)=\frac{1}{2}\sum_{j=1}^n a_{kj}x_j+\frac{1}{2}\sum_{i=1}^n a_{ik}x_i.
\end{align*}
Since $a_{ik}=a_{ki}$, the second sum equals $\sum_{i=1}^n a_{ki}x_i$, so
\begin{align*}
\frac{\partial}{\partial x_k}\left(\frac{1}{2}x\cdot A x\right)=\sum_{j=1}^n a_{kj}x_j=(Ax)_k.
\end{align*}
Also,
\begin{align*}
\frac{\partial}{\partial x_k}(b\cdot x)=\frac{\partial}{\partial x_k}\left(\sum_{j=1}^n b_jx_j\right)=b_k.
\end{align*}
Therefore
\begin{align*}
\nabla f(x)=Ax-b.
\end{align*}
Differentiating once more gives
\begin{align*}
\frac{\partial^2 f}{\partial x_\ell\partial x_k}(x)=\frac{\partial}{\partial x_\ell}\left((Ax)_k-b_k\right)=\frac{\partial}{\partial x_\ell}\left(\sum_{j=1}^n a_{kj}x_j-b_k\right)=a_{k\ell}.
\end{align*}
Thus $Hf_x=A$ for every $x\in\mathbb{R}^n$. The assumed inequality
\begin{align*}
v\cdot Hf_x v=v\cdot A v\ge \mu |v|^2
\end{align*}
holds for every $x$ and $v$, so the Hessian lower-bound observation shows that $f$ is $\mu$-strongly convex on $\mathbb{R}^n$.
The first-order equation $\nabla f(x)=0$ is therefore
\begin{align*}
Ax-b=0.
\end{align*}
Equivalently,
\begin{align*}
Ax=b.
\end{align*}
If $x_\ast$ satisfies $Ax_\ast=b$, then for any $x\in\mathbb{R}^n$,
\begin{align*}
f(x)-f(x_\ast)=\frac{1}{2}x\cdot Ax-b\cdot x-\frac{1}{2}x_\ast\cdot Ax_\ast+b\cdot x_\ast.
\end{align*}
Substituting $b=Ax_\ast$ gives
\begin{align*}
f(x)-f(x_\ast)=\frac{1}{2}x\cdot Ax-(Ax_\ast)\cdot x-\frac{1}{2}x_\ast\cdot Ax_\ast+(Ax_\ast)\cdot x_\ast.
\end{align*}
Because $A$ is symmetric, $(Ax_\ast)\cdot x=x_\ast\cdot Ax$, and $(Ax_\ast)\cdot x_\ast=x_\ast\cdot Ax_\ast$. Hence
\begin{align*}
f(x)-f(x_\ast)=\frac{1}{2}x\cdot Ax-x_\ast\cdot Ax+\frac{1}{2}x_\ast\cdot Ax_\ast.
\end{align*}
Expanding the quadratic form of $x-x_\ast$ gives
\begin{align*}
(x-x_\ast)\cdot A(x-x_\ast)=x\cdot Ax-x\cdot Ax_\ast-x_\ast\cdot Ax+x_\ast\cdot Ax_\ast.
\end{align*}
Again using symmetry, $x\cdot Ax_\ast=x_\ast\cdot Ax$, so
\begin{align*}
\frac{1}{2}(x-x_\ast)\cdot A(x-x_\ast)=\frac{1}{2}x\cdot Ax-x_\ast\cdot Ax+\frac{1}{2}x_\ast\cdot Ax_\ast.
\end{align*}
Therefore
\begin{align*}
f(x)-f(x_\ast)=\frac{1}{2}(x-x_\ast)\cdot A(x-x_\ast)\ge \frac{\mu}{2}|x-x_\ast|^2\ge 0.
\end{align*}
Thus any solution of $Ax=b$ is a global minimizer. Since strong convexity gives at most one global minimizer, that solution is the unique global minimizer.
[/example]
## Operations Preserving Convexity
### Sums and Suprema
Convex functions are useful partly because they form a stable class. We can build complicated convex objectives from simple ones without checking the defining inequality from scratch each time. The most basic closure rule says that nonnegative weighted sums preserve the chord inequality.
[quotetheorem:7944]
Another basic construction takes the largest value among many convex functions. This operation is needed for maximum losses, support functions, and worst-case objectives, where the active constraint or active affine piece may change from point to point. The important question is whether taking a pointwise supremum can introduce non-convex bends.
[quotetheorem:7945]
Taking a supremum is how support functions, pointwise maximum losses, and dual norms arise. The price is that the resulting function may be non-smooth even when every input function is smooth.
[example: Maximum of Affine Functions]
Let $a_1,\ldots,a_m\in\mathbb{R}^n$ and $b_1,\ldots,b_m\in\mathbb{R}$. For each $i$, write
\begin{align*}
\phi_i(x)=a_i\cdot x+b_i.
\end{align*}
Then
\begin{align*}
f(x)=\max_{1\le i\le m}\phi_i(x).
\end{align*}
We first check that each $\phi_i$ is convex. For $x,y\in\mathbb{R}^n$ and $t\in[0,1]$,
\begin{align*}
\phi_i((1-t)x+ty)=a_i\cdot((1-t)x+ty)+b_i.
\end{align*}
By bilinearity of the dot product,
\begin{align*}
a_i\cdot((1-t)x+ty)=(1-t)a_i\cdot x+t a_i\cdot y.
\end{align*}
Since $(1-t)+t=1$,
\begin{align*}
b_i=(1-t)b_i+tb_i.
\end{align*}
Therefore
\begin{align*}
\phi_i((1-t)x+ty)=(1-t)(a_i\cdot x+b_i)+t(a_i\cdot y+b_i).
\end{align*}
That is,
\begin{align*}
\phi_i((1-t)x+ty)=(1-t)\phi_i(x)+t\phi_i(y),
\end{align*}
so each affine function $\phi_i$ satisfies the convexity inequality with equality.
Because a finite maximum is a finite supremum, *Supremum Preserves Convexity* applies to the convex functions $\phi_1,\ldots,\phi_m$ and gives that $f$ is convex on $\mathbb{R}^n$. On any region where one index $k$ satisfies $\phi_k(x)\ge \phi_i(x)$ for every $i$, the maximum is exactly $f(x)=\phi_k(x)=a_k\cdot x+b_k$. Thus $f$ is built from affine pieces, and its corners occur along the tie sets where two or more affine pieces attain the same maximum value.
[/example]
### Composition
Composition is more delicate than addition or supremum. A convex function followed by another convex function need not remain convex, because the outer function may reverse the inequality created by the inner one. The useful rule adds monotonicity so that the inequality direction is preserved.
[quotetheorem:7946]
The monotonicity condition is not decorative. Without it, the outer function may reverse the inequality supplied by the inner convex function.
[example: Composition Can Fail Without Monotonicity]
Let $f:\mathbb{R}\to\mathbb{R}$ be $f(x)=x^2$ and let $g:\mathbb{R}\to\mathbb{R}$ be $g(y)=-y$. For any $u,v\in\mathbb{R}$ and $t\in[0,1]$,
\begin{align*}
g((1-t)u+tv)=-((1-t)u+tv)=-(1-t)u-tv=(1-t)(-u)+t(-v)=(1-t)g(u)+tg(v).
\end{align*}
Thus $g$ satisfies the convexity inequality with equality, so $g$ is convex. It is also decreasing: if $u<v$, then $-u>-v$, hence $g(u)>g(v)$.
The composition is
\begin{align*}
(g\circ f)(x)=g(f(x))=g(x^2)=-x^2.
\end{align*}
To test convexity at the midpoint of $-1$ and $1$, first compute
\begin{align*}
\frac{-1+1}{2}=0.
\end{align*}
The midpoint value is
\begin{align*}
(g\circ f)(0)=-0^2=0.
\end{align*}
The endpoint values are
\begin{align*}
(g\circ f)(-1)=-(-1)^2=-1
\end{align*}
and
\begin{align*}
(g\circ f)(1)=-1^2=-1.
\end{align*}
Their average is
\begin{align*}
\frac{(g\circ f)(-1)+(g\circ f)(1)}{2}=\frac{-1+(-1)}{2}=\frac{-2}{2}=-1.
\end{align*}
Therefore
\begin{align*}
(g\circ f)(0)=0>-1=\frac{(g\circ f)(-1)+(g\circ f)(1)}{2}.
\end{align*}
At $t=1/2$, convexity would require the midpoint value to be at most the average endpoint value, so $g\circ f$ is not convex on $\mathbb{R}$. This shows that convexity of the outer function alone does not preserve convexity under composition; the missing monotonicity condition matters.
[/example]
## Minima and Optimization
### Local and Global Minima
Convexity is central in optimization because it turns local information into global information. A function with two separated valleys can have local minimizers that are not global; a convex function cannot hide a lower valley behind a hill. To state that contrast, we first need the metric language for saying that a competitor is nearby.
[definition: Open Ball]
Let $x_0\in\mathbb{R}^n$ and let $r>0$. The open ball with centre $x_0$ and radius $r$ is
\begin{align*}
B(x_0,r) := \{x\in\mathbb{R}^n: |x-x_0|<r\}.
\end{align*}
[/definition]
An open ball is the neighbourhood in which local optimality is tested. In a non-convex landscape, a point can be unbeatable among all sufficiently nearby feasible points even though a much lower value occurs elsewhere.
The formal notion we need must therefore include both the feasible set and the radius of comparison. It should say that once a sufficiently small ball around the candidate point is chosen, no feasible point inside that ball has a smaller value. This is the role of local minimality.
[definition: Local Minimizer]
Let $C \subset \mathbb{R}^n$ and let $f:C\to\mathbb{R}$. A point $x_0\in C$ is a local minimizer of $f$ on $C$ if there exists $r>0$ such that
\begin{align*}
f(x_0)\le f(x)
\end{align*}
for every $x\in C\cap B(x_0,r)$.
[/definition]
Local minimality only tests nearby competitors. Convexity lets us compare a nearby point on the segment from $x_0$ to any distant competitor, but the conclusion we want is stronger than local minimality. We therefore need the global version, which tests every feasible competitor at once.
[definition: Global Minimizer]
Let $C \subset \mathbb{R}^n$ and let $f:C\to\mathbb{R}$. A point $x_0\in C$ is a global minimizer of $f$ on $C$ if
\begin{align*}
f(x_0)\le f(x)
\end{align*}
for every $x\in C$.
[/definition]
The central optimization question is whether checking nearby competitors can ever be enough to guarantee comparison with all competitors. For convex functions the answer is yes: a hypothetical distant point with lower value would create, along the segment from the local minimizer to that point, nearby points with lower value as well.
[quotetheorem:1975]
This theorem is the reason convex optimization is qualitatively different from general nonlinear optimization. Search algorithms still have numerical difficulties, but they are not fighting false local minima.
### Uniqueness and Existence
Globality is not the same as uniqueness. A convex function may have a whole flat set of minimizers, so a stronger curvature or strictness assumption is needed when a single optimizer is desired.
[quotetheorem:7947]
Strict convexity rules out multiple minimizers, while ordinary convexity permits flat faces of minimizers. This distinction matters in linear programming, where whole faces of a polytope may minimize the objective.
[example: A Convex Function with Many Minimizers]
Let $f:\mathbb{R}\to\mathbb{R}$ be defined by
\begin{align*}
f(x)=\max\{|x|-1,0\}.
\end{align*}
For every real $x$, $|x|=\max\{x,-x\}$, so subtracting $1$ from both entries gives
\begin{align*}
|x|-1=\max\{x-1,-x-1\}.
\end{align*}
Therefore
\begin{align*}
f(x)=\max\{x-1,-x-1,0\}.
\end{align*}
Write $\phi_1(x)=x-1$, $\phi_2(x)=-x-1$, and $\phi_3(x)=0$. For $a,b\in\mathbb{R}$ and $t\in[0,1]$,
\begin{align*}
\phi_1((1-t)a+tb)=(1-t)a+tb-1.
\end{align*}
Since $(1-t)+t=1$,
\begin{align*}
(1-t)\phi_1(a)+t\phi_1(b)=(1-t)(a-1)+t(b-1)=(1-t)a+tb-1.
\end{align*}
Thus $\phi_1((1-t)a+tb)=(1-t)\phi_1(a)+t\phi_1(b)$. Similarly,
\begin{align*}
\phi_2((1-t)a+tb)=-((1-t)a+tb)-1=-(1-t)a-tb-1.
\end{align*}
Also,
\begin{align*}
(1-t)\phi_2(a)+t\phi_2(b)=(1-t)(-a-1)+t(-b-1)=-(1-t)a-tb-1.
\end{align*}
Thus $\phi_2$ is convex, and $\phi_3$ is convex because
\begin{align*}
\phi_3((1-t)a+tb)=0=(1-t)0+t0=(1-t)\phi_3(a)+t\phi_3(b).
\end{align*}
Since $f=\max\{\phi_1,\phi_2,\phi_3\}$, the finite case of *Supremum Preserves Convexity* shows that $f$ is convex.
For every $x\in\mathbb{R}$, the maximum defining $f(x)$ includes $0$, so
\begin{align*}
f(x)=\max\{|x|-1,0\}\ge 0.
\end{align*}
If $x\in[-1,1]$, then $|x|\le 1$, hence
\begin{align*}
|x|-1\le 0.
\end{align*}
Therefore
\begin{align*}
f(x)=\max\{|x|-1,0\}=0.
\end{align*}
So the minimum value is $0$, and every $x\in[-1,1]$ is a global minimizer. This convex function has an entire interval of minimizers, so convexity alone does not force uniqueness.
[/example]
Existence is a different issue from uniqueness or globality. Even a convex function may fail to attain its infimum if the domain is open or if the function escapes to its limiting value only at infinity.
[example: Convex Infimum Not Attained]
Let $f:(0,\infty)\to\mathbb{R}$ be $f(x)=x$. The domain is convex: if $x,y\in(0,\infty)$ and $t\in[0,1]$, then $1-t\ge 0$, $t\ge 0$, and
\begin{align*}
(1-t)x+ty>0
\end{align*}
because $x>0$, $y>0$, and $(1-t)+t=1$. Hence $(1-t)x+ty\in(0,\infty)$.
The function is convex on this domain, since for $x,y\in(0,\infty)$ and $t\in[0,1]$,
\begin{align*}
f((1-t)x+ty)=(1-t)x+ty
\end{align*}
and
\begin{align*}
(1-t)f(x)+tf(y)=(1-t)x+ty.
\end{align*}
Therefore
\begin{align*}
f((1-t)x+ty)=(1-t)f(x)+tf(y),
\end{align*}
so the convexity inequality holds with equality.
For every $x>0$,
\begin{align*}
f(x)=x>0,
\end{align*}
so $0$ is a lower bound for the set of values $\{f(x):x>0\}$. If $\varepsilon>0$, then $\varepsilon/2>0$ and
\begin{align*}
f(\varepsilon/2)=\varepsilon/2<\varepsilon.
\end{align*}
Thus no positive number can be a lower bound, and hence
\begin{align*}
\inf_{x>0} f(x)=0.
\end{align*}
However, no point of the domain attains this value: if $x\in(0,\infty)$, then $f(x)=x>0$. Equivalently, for any candidate $x_0>0$,
\begin{align*}
f(x_0/2)=x_0/2<x_0=f(x_0),
\end{align*}
so $x_0$ is not a global minimizer. This convex function has an infimum but no minimizer, so convexity alone does not supply the compactness or closedness needed for existence.
[/example]
## Beyond and Connected Topics
Convex functions are the entry point to [Cambridge IB Optimisation](/page/Cambridge%20IB%20Optimisation), where the abstract geometry becomes algorithms, duality, Lagrange multipliers, and constrained minimization. The subgradient condition on this page is the first-order shadow of the Karush-Kuhn-Tucker conditions.
In [Cambridge II Analysis of Functions](/page/Cambridge%20II%20Analysis%20of%20Functions), convexity connects to differentiability, absolute continuity, weak derivatives, and compactness. Convex integrands also appear in lower semicontinuity theorems in the [calculus of variations](/page/Calculus%20of%20Variations), where convexity in the gradient variable is often the structural assumption that makes minimizers exist.
Convex functions appear in [Cambridge II Mathematics of Machine Learning](/page/Cambridge%20II%20Mathematics%20of%20Machine%20Learning) through empirical risk minimization, regularization, margin losses, and gradient-based algorithms. Strong convexity supplies quantitative convergence rates and uniqueness of learned parameters in many idealized models.
In geometric measure theory, convexity interacts with [functions of bounded variation](/page/Functions%20of%20Bounded%20Variation) and variational relaxation. The notes [Geometric Measure Theory III: BV Functions and Sets of Finite Perimeter](/page/Geometric%20Measure%20Theory%20III%3A%20BV%20Functions%20and%20Sets%20of%20Finite%20Perimeter) develop settings where non-smooth functions and weak compactness become unavoidable, and convexity is one of the mechanisms that keeps variational problems well behaved.
A natural next direction is convex duality. The Legendre-Fenchel transform converts a convex function into a dual convex function on the dual vector space, replacing a function by its supporting affine functions. This viewpoint explains dual norms, conjugate losses, and many optimization duality theorems.
Another direction is [convex geometry](/page/Convex%20Geometry). Support functions, separation theorems, polar bodies, and exposed faces translate between convex sets and convex functions. The epigraph construction on this page is the first version of that dictionary.
## References
Androma, [Cambridge IB Optimisation](/page/Cambridge%20IB%20Optimisation).
Androma, [Cambridge II Analysis of Functions](/page/Cambridge%20II%20Analysis%20of%20Functions).
Androma, [Cambridge II Mathematics of Machine Learning](/page/Cambridge%20II%20Mathematics%20of%20Machine%20Learning).
Androma, [Geometric Measure Theory III: BV Functions and Sets of Finite Perimeter](/page/Geometric%20Measure%20Theory%20III%3A%20BV%20Functions%20and%20Sets%20of%20Finite%20Perimeter).
Rockafellar, *Convex Analysis* (1970).
Boyd and Vandenberghe, *Convex Optimization* (2004).
Hiriart-Urruty and Lemarechal, *Fundamentals of Convex Analysis* (2001).