This course develops the mathematical foundations of convex optimisation, with an emphasis on the theory that makes convex models both tractable and broadly useful in applied problems. It explains why convexity matters, how it shapes the geometry of feasible sets and objective functions, and how it leads to powerful existence, optimality, and duality results. The aim is to give a coherent theory for understanding when optimisation problems are well posed, how solutions can be characterised, and how structure can be exploited in modelling and analysis.
The chapters build from basic geometry to advanced duality and specialised problem classes. The opening chapters introduce convex sets, separation, convex functions, and subdifferentials, then use these tools to study existence of solutions and the formulation of constrained optimisation problems. From there, the course develops Lagrangian duality, KKT conditions, sensitivity analysis, Fenchel duality, and infimal convolution, showing how primal problems can be analysed through dual viewpoints. Later chapters specialise these ideas to linear, polyhedral, conic, second-order cone, and semidefinite programming, and then move to barrier methods, interior-point ideas, and central paths.
The final chapter connects the theory to applications and modelling principles, showing how real problems can be expressed in convex form and solved efficiently. Across the course, the recurring themes are geometry, duality, optimality conditions, and problem structure, with each chapter adding a layer that supports the next.
# Introduction
This opening chapter fixes the scope and language of the course before the technical development begins. Convex optimisation sits at the meeting point of geometry, analysis, and computation: the geometric side explains why convex feasible regions can be separated and supported by hyperplanes, the analytic side describes functions through epigraphs and subgradients, and the optimisation side turns these structures into optimality and duality statements. The course is finite-dimensional throughout, but it uses enough topology and functional language to make Chapter 10's barrier and central-path material mathematically natural.
The main question is: what special structure makes a minimisation problem tractable at the level of theory? Convexity replaces arbitrary local behaviour by global inequalities. It also gives a disciplined way to pass between primal problems, dual certificates, and conic formulations.
## The Shape of a Convex Optimisation Problem
Before defining the objects in detail, we need a template for the questions the course will answer. A typical optimisation problem asks for the smallest value of a function over a constrained set, but without convexity there is little reason for a local certificate to determine the global answer. The convex setting is designed so that first-order inequalities, separating hyperplanes, and dual variables can certify optimality.
[definition: Optimisation Problem]
An optimisation problem consists of a set $X$, a function $f:X\to \mathbb R\cup\{+\infty\}$, and the task of computing or characterising
\begin{align*}
\inf_{x\in X} f(x).
\end{align*}
[/definition]
The elements $x\in X$ are called feasible points when they are allowed by the constraints of the problem. This terminology becomes important once constraints are written separately from the objective.
The extended value $+\infty$ is a useful bookkeeping device: constraints can be folded into the objective by assigning infinite cost outside the feasible region. To see why the course needs more than this bookkeeping, we first look at a familiar problem where the feasible set is all of Euclidean space and the geometry is hidden inside a quadratic objective.
[example: Least Squares As Optimisation]
Let $A\in\mathbb R^{m\times n}$ and $b\in\mathbb R^m$. The least-squares problem is the optimisation problem
\begin{align*}
\inf_{x\in\mathbb R^n} |Ax-b|^2.
\end{align*}
Its objective $f(x)=|Ax-b|^2$ is convex because, for $u=Ax-b$, $v=Ay-b$, and $t\in[0,1]$, we have $A(tx+(1-t)y)-b=tu+(1-t)v$. Expanding the squared norm gives
\begin{align*}
|tu+(1-t)v|^2=t^2|u|^2+2t(1-t)u^\top v+(1-t)^2|v|^2.
\end{align*}
Also,
\begin{align*}
|u-v|^2=|u|^2-2u^\top v+|v|^2,
\end{align*}
so
\begin{align*}
|tu+(1-t)v|^2=t|u|^2+(1-t)|v|^2-t(1-t)|u-v|^2.
\end{align*}
Since $t(1-t)|u-v|^2\ge 0$, it follows that
\begin{align*}
f(tx+(1-t)y)\le tf(x)+(1-t)f(y).
\end{align*}
The normal equations come from expanding $f$ at $x$ in a direction $h$. We have
\begin{align*}
f(x+h)=|(Ax-b)+Ah|^2.
\end{align*}
Expanding the square gives
\begin{align*}
f(x+h)=f(x)+2(Ax-b)^\top Ah+|Ah|^2.
\end{align*}
Since $(Ax-b)^\top Ah=h^\top A^\top(Ax-b)$, the linear term in $h$ is $2h^\top A^\top(Ax-b)$, and therefore
\begin{align*}
\nabla f(x)=2A^\top(Ax-b).
\end{align*}
Thus a stationary point satisfies
\begin{align*}
A^\top(Ax-b)=0,
\end{align*}
which is equivalently
\begin{align*}
A^\top A x=A^\top b.
\end{align*}
If $x^\star$ satisfies the normal equations, then $A^\top(Ax^\star-b)=0$. For every $h\in\mathbb R^n$,
\begin{align*}
f(x^\star+h)=f(x^\star)+2h^\top A^\top(Ax^\star-b)+|Ah|^2.
\end{align*}
The middle term is $0$, so
\begin{align*}
f(x^\star+h)=f(x^\star)+|Ah|^2.
\end{align*}
Since $|Ah|^2\ge 0$, every solution of $A^\top A x=A^\top b$ is a global minimiser.
By contrast, for the smooth nonconvex function $g(x)=x^2+\sin(5x)$, differentiation gives
\begin{align*}
g'(x)=2x+5\cos(5x).
\end{align*}
The equation $2x+5\cos(5x)=0$ only says that the first-order term vanishes at that point. It does not produce an identity of the form $g(x+h)=g(x)+\text{nonnegative term}$ for all $h$, so stationarity alone does not certify global minimality.
[/example]
This example is smooth and unconstrained, so it hides much of the course. The general class studied here must also include nonsmooth objectives and feasible regions cut out by inequalities, so we now isolate the two convexity assumptions that make global certificates possible.
[definition: Convex Optimisation Problem]
A convex optimisation problem is an optimisation problem of the form
\begin{align*}
\inf_{x\in C} f(x),
\end{align*}
where $C\subset \mathbb R^n$ is a convex set and $f:C\to \mathbb R\cup\{+\infty\}$ is a convex function.
[/definition]
The definition isolates two sources of structure: feasible points can be averaged without leaving the feasible region, and objective values are controlled along line segments. The next basic model keeps the objective linear but puts the structure into a polyhedral feasible set.
[example: Linear Programming Template]
Let $A\in \mathbb R^{m\times n}$, $b\in \mathbb R^m$, and $c\in \mathbb R^n$. The linear programme
\begin{align*}
\inf\{c^\top x: Ax=b,\ x_i\ge 0\text{ for }1\le i\le n\}
\end{align*}
has feasible set
\begin{align*}
C=\{x\in\mathbb R^n:Ax=b,\ x_i\ge 0\text{ for }1\le i\le n\}.
\end{align*}
To see the convex structure explicitly, take $x,y\in C$ and $t\in[0,1]$. Since $Ax=b$ and $Ay=b$, linearity of matrix multiplication gives
\begin{align*}
A(tx+(1-t)y)=tAx+(1-t)Ay.
\end{align*}
Substituting $Ax=b$ and $Ay=b$ gives
\begin{align*}
tAx+(1-t)Ay=tb+(1-t)b.
\end{align*}
Factoring $b$ gives
\begin{align*}
tb+(1-t)b=(t+1-t)b=b.
\end{align*}
Hence
\begin{align*}
A(tx+(1-t)y)=b.
\end{align*}
For each coordinate $i$,
\begin{align*}
(tx+(1-t)y)_i=tx_i+(1-t)y_i.
\end{align*}
Because $t\ge 0$, $1-t\ge 0$, $x_i\ge 0$, and $y_i\ge 0$, we have
\begin{align*}
tx_i+(1-t)y_i\ge 0.
\end{align*}
Thus $tx+(1-t)y\in C$, so the feasible set is convex.
The objective is linear, and along the same segment it satisfies
\begin{align*}
c^\top(tx+(1-t)y)=t c^\top x+(1-t)c^\top y.
\end{align*}
Therefore the convexity inequality holds with equality:
\begin{align*}
c^\top(tx+(1-t)y)\le t c^\top x+(1-t)c^\top y.
\end{align*}
This is the basic linear programming template: the constraints give a polyhedral convex feasible region, while the objective has no curvature of its own. Later duality turns the equations $Ax=b$ and inequalities $x_i\ge 0$ into multipliers that can certify the value of an optimal solution when such a solution exists.
[/example]
Linear programming is the model case for duality, but the course is not restricted to polyhedra. Cones, epigraphs, and closure operations let many apparently different problems be treated by the same geometric machinery.
## Geometry as the Source of Certificates
The first technical block of the course asks how a convex set can be recognised from outside. If a point is not in a closed convex set, the theory should produce a linear inequality that every point of the set satisfies but the outside point violates. Such inequalities are the geometric ancestors of Lagrange multipliers and dual certificates.
[definition: Convex Set]
A set $C\subset \mathbb R^n$ is convex if for every $x,y\in C$ and every $t\in[0,1]$,
\begin{align*}
tx+(1-t)y\in C.
\end{align*}
[/definition]
Convexity says that line segments are internal to the set, which gives a way to compare an excluded point with every point of the set at once. The first major payoff is separation: outside points can be detected by a single linear functional, giving the simplest form of an optimisation certificate. This principle is the prototype for many later statements: infeasibility or non-optimality becomes visible through a linear inequality. Closedness prevents the outside point from merely lying on a missing boundary point, and convexity prevents different parts of the set from requiring incompatible separating directions.
The next example shows that for polyhedra the separating inequality may already be one of the constraints.
[example: Separating A Point From A Halfspace Intersection]
Consider the polyhedron
\begin{align*}
C=\{x\in\mathbb R^2:x_1\ge 0,\ x_2\ge 0,\ x_1+x_2\le 1\}
\end{align*}
and the point $x_0=(1,1)$. We show that the existing constraint $x_1+x_2\le 1$ separates $x_0$ from $C$.
Define $a=(1,1)\in\mathbb R^2$ and $\alpha=1$. If $x=(x_1,x_2)\in C$, then the defining inequalities for $C$ include
\begin{align*}
x_1+x_2\le 1.
\end{align*}
Since
\begin{align*}
a^\top x=(1,1)^\top(x_1,x_2)=1\cdot x_1+1\cdot x_2=x_1+x_2,
\end{align*}
every $x\in C$ satisfies
\begin{align*}
a^\top x\le \alpha.
\end{align*}
For the outside point,
\begin{align*}
a^\top x_0=(1,1)^\top(1,1)=1\cdot 1+1\cdot 1=2.
\end{align*}
Thus
\begin{align*}
a^\top x\le 1<2=a^\top x_0
\end{align*}
for every $x\in C$. The separating hyperplane is the line $x_1+x_2=1$, so in this polyhedral case one of the original constraints is already the separating certificate.
[/example]
Separation becomes more delicate when sets are not full-dimensional. For that reason the course uses affine hulls and relative interiors, rather than relying only on ordinary interiors in $\mathbb R^n$.
## Functions Through Epigraphs and Subgradients
The second technical block asks how convexity of sets becomes convexity of functions. A function is convex when the region above its graph is convex, and this geometric viewpoint is better suited to nonsmooth objectives than a derivative-based definition.
[definition: Epigraph]
Let $C\subset \mathbb R^n$ and let $f:C\to \mathbb R\cup\{+\infty\}$. The epigraph of $f$ is
\begin{align*}
\operatorname{epi}(f)=\{(x,r)\in C\times\mathbb R: f(x)\le r\}.
\end{align*}
[/definition]
The epigraph turns a question about function values into a question about a subset of a higher-dimensional Euclidean space. For this translation to be useful, the function must obey the same chord-stability that convex sets obey: averaging two inputs should not force the graph above the averaged height. The formal condition below records precisely this no-bending-above-chords requirement, which is what later lets separation produce affine lower supports.
[definition: Convex Function]
Let $C\subset \mathbb R^n$ be convex. A function $f:C\to \mathbb R\cup\{+\infty\}$ is convex if for all $x,y\in C$ and all $t\in[0,1]$,
\begin{align*}
f(tx+(1-t)y)\le t f(x)+(1-t)f(y).
\end{align*}
[/definition]
This inequality says that secant lines lie above the graph, but it is often too global to check directly. In a differentiable problem one would rather use the tangent plane at a point, and the obstruction is that a tangent plane for a general differentiable function may only describe local behaviour. Convexity removes this obstruction: a valid tangent inequality must hold against every other point of the domain, turning a local derivative into a global affine lower bound.
[quotetheorem:6666]
[citeproof:6666]
This theorem explains why gradients can certify global optimality in the convex differentiable case, but each hypothesis controls a specific obstruction. The convexity of $U$ ensures that the line segment joining $x$ to $y$ remains inside the domain, so the one-dimensional restriction used in the proof is legitimate. Openness lets the derivative at $x$ see small perturbations in every direction; at a boundary point of a closed interval, the same formula may require one-sided information instead. Differentiability is also essential for this gradient formulation: $f(x)=|x|$ is convex on $\mathbb R$, but at $0$ there is no single gradient giving the supporting inequality. The nonsmooth analogue replaces $\nabla f(x)$ by a set of possible supporting slopes, and the absolute value gives the smallest useful example.
[example: Absolute Value Subgradients]
For $f:\mathbb R\to\mathbb R$ given by $f(x)=|x|$, we compute the slopes $s\in\mathbb R$ for which the affine function $y\mapsto sy$ supports $f$ at $0$, meaning
\begin{align*}
|y|\ge sy
\end{align*}
for every $y\in\mathbb R$.
First suppose $s\in[-1,1]$. If $y\ge 0$, then $|y|=y$, and $s\le 1$ gives
\begin{align*}
sy\le 1\cdot y=y=|y|.
\end{align*}
If $y<0$, then $|y|=-y$. Since $s\ge -1$ and $y<0$, multiplying the inequality $s\ge -1$ by $y$ reverses the inequality, so
\begin{align*}
sy\le (-1)y=-y=|y|.
\end{align*}
Thus every $s\in[-1,1]$ satisfies $|y|\ge sy$ for all $y\in\mathbb R$.
Conversely, suppose $s$ has the supporting property $|y|\ge sy$ for every $y$. Taking $y=1$ gives
\begin{align*}
1=|1|\ge s\cdot 1=s,
\end{align*}
so $s\le 1$. Taking $y=-1$ gives
\begin{align*}
1=|-1|\ge s(-1)=-s.
\end{align*}
Multiplying $1\ge -s$ by $-1$ reverses the inequality and gives
\begin{align*}
-1\le s.
\end{align*}
Hence $s\in[-1,1]$. The supporting slopes at the corner $0$ are exactly the interval $[-1,1]$, which is why a nonsmooth convex function has a set of first-order slopes rather than a single derivative.
[/example]
## Optimality, Duality, and Cones
The central optimisation question is not only whether a minimiser exists, but how to prove that a candidate is optimal. Convexity permits certificates: a point is optimal when zero belongs to a suitable first-order object, and a lower bound is sharp when a dual problem attains the same value.
[definition: Primal Value]
For a convex optimisation problem $\inf_{x\in C} f(x)$, the primal value is
\begin{align*}
p^\star=\inf_{x\in C} f(x).
\end{align*}
[/definition]
The value $p^\star$ records the best achievable objective level, whether or not a minimiser exists. To prove statements about this value without already knowing a minimiser, the course needs a language for lower bounds that are justified uniformly over all feasible points.
[definition: Dual Certificate]
A dual certificate for a lower bound $\beta\in\mathbb R$ on a minimisation problem is data whose associated inequalities imply
\begin{align*}
\beta\le f(x)
\end{align*}
for every feasible point $x$.
[/definition]
This definition is intentionally broad at the introductory stage: later chapters make the data explicit as Lagrange multipliers, polar cone elements, Fenchel conjugate inequalities, or conic dual variables. The point of naming the idea now is to separate two tasks that are often confused: finding a candidate minimiser and proving that no feasible point can do better. Once a certificate has produced a valid lower bound, the comparison with the primal value is immediate: if $\ell\le f(x)$ for every feasible $x$, then $\ell\le p^*$. This elementary inequality is the weak-duality principle in its most general form.
[Weak duality](/theorems/2549) is only the beginning. Its strength is that it has almost no hypotheses: the certificate only needs to prove a genuine lower bound for every feasible point. Its limitation is equally important: the lower bound need not be sharp, an optimal certificate need not exist, and the primal infimum need not be attained. Strict gaps can occur when the chosen certificate system is too small, and in nonconvex problems even natural Lagrange-type bounds may sit strictly below the true optimum. Chapters 4, 6, and 8 identify the convexity, closedness, and constraint qualifications under which the best certificate reaches the primal value, and the next example shows the multiplier calculation in a linear programme.
[example: A Linear Programming Certificate]
Consider the linear programme
\begin{align*}
\inf\{c^\top x:Ax=b,\ x_i\ge 0\text{ for }1\le i\le n\}.
\end{align*}
Suppose $x\in\mathbb R^n$ is feasible, so $Ax=b$ and $x_i\ge 0$ for every $i$, and suppose $y\in\mathbb R^m$ satisfies $A^\top y\le c$ coordinatewise. Coordinatewise, this means
\begin{align*}
(A^\top y)_i\le c_i
\end{align*}
for each $i$. Since $x_i\ge 0$, multiplying by $x_i$ preserves the inequality:
\begin{align*}
(A^\top y)_i x_i\le c_i x_i.
\end{align*}
Summing over $i=1,\dots,n$ gives
\begin{align*}
\sum_{i=1}^n (A^\top y)_i x_i\le \sum_{i=1}^n c_i x_i.
\end{align*}
The left-hand side is $(A^\top y)^\top x$, and the right-hand side is $c^\top x$, so
\begin{align*}
(A^\top y)^\top x\le c^\top x.
\end{align*}
Using $(A^\top y)^\top x=y^\top A x$ gives
\begin{align*}
y^\top A x\le c^\top x.
\end{align*}
Because $x$ is feasible, $Ax=b$, hence
\begin{align*}
y^\top A x=y^\top b.
\end{align*}
Therefore
\begin{align*}
y^\top b\le c^\top x
\end{align*}
for every feasible $x$. Thus $y$ certifies the lower bound $y^\top b$ for the primal linear programme.
[/example]
The preceding certificate used the nonnegative orthant to express $x_i\ge 0$, so its multiplier condition is tied to coordinatewise positivity. If the feasible directions were not stable under nonnegative scaling, a multiplier inequality verified at one scale would not automatically remain valid at another scale. The next definition abstracts the geometric feature that mattered: a feasible inequality region should be closed under addition along line segments and under nonnegative scaling, which is exactly the structure needed for conic duality.
[definition: Convex Cone]
A set $K\subset \mathbb R^n$ is a convex cone if $K$ is convex and $tx\in K$ for every $x\in K$ and every $t\ge 0$.
[/definition]
Cones encode inequalities in a coordinate-free way. Nonnegative orthants, positive semidefinite matrices, and second-order cones will provide the main examples for conic formulations.
[example: Second-Order Cone]
The second-order cone in $\mathbb R^{n+1}$ is
\begin{align*}
K=\{(t,x)\in\mathbb R\times\mathbb R^n: |x|\le t\}.
\end{align*}
We verify explicitly that $K$ is a convex cone. Let $(t,x)\in K$ and let $\lambda\ge 0$. Since $(t,x)\in K$, we have $|x|\le t$. Multiplying by $\lambda\ge 0$ gives
\begin{align*}
\lambda |x|\le \lambda t.
\end{align*}
By absolute homogeneity of the Euclidean norm,
\begin{align*}
|\lambda x|=\lambda |x|.
\end{align*}
Therefore
\begin{align*}
|\lambda x|\le \lambda t.
\end{align*}
This means
\begin{align*}
\lambda(t,x)=(\lambda t,\lambda x)\in K.
\end{align*}
So $K$ is closed under nonnegative scaling.
Now take $(t,x),(s,y)\in K$ and $\theta\in[0,1]$. The defining inequalities are
\begin{align*}
|x|\le t
\end{align*}
and
\begin{align*}
|y|\le s.
\end{align*}
By the triangle inequality,
\begin{align*}
|\theta x+(1-\theta)y|\le |\theta x|+|(1-\theta)y|.
\end{align*}
By absolute homogeneity of the Euclidean norm, and because $\theta\ge 0$ and $1-\theta\ge 0$,
\begin{align*}
|\theta x|+|(1-\theta)y|=\theta |x|+(1-\theta)|y|.
\end{align*}
Using $|x|\le t$ and $|y|\le s$ gives
\begin{align*}
\theta |x|+(1-\theta)|y|\le \theta t+(1-\theta)s.
\end{align*}
Combining the inequalities,
\begin{align*}
|\theta x+(1-\theta)y|\le \theta t+(1-\theta)s.
\end{align*}
Thus
\begin{align*}
\theta(t,x)+(1-\theta)(s,y)=(\theta t+(1-\theta)s,\theta x+(1-\theta)y)\in K.
\end{align*}
So $K$ is convex, and together with closure under nonnegative scaling this proves that $K$ is a convex cone.
A constraint such as
\begin{align*}
|Bz+d|\le a^\top z+b
\end{align*}
is equivalent to the conic membership condition
\begin{align*}
(a^\top z+b,Bz+d)\in K.
\end{align*}
Indeed, membership in $K$ says that the norm of the vector component is bounded above by the scalar component, so
\begin{align*}
(a^\top z+b,Bz+d)\in K \Longleftrightarrow |Bz+d|\le a^\top z+b.
\end{align*}
The cone turns this norm inequality into a single convex conic constraint, which is why second-order cone constraints can represent many quadratic-looking feasible regions.
[/example]
## Existence and the Role of Topology
A course on theory must also ask when an infimum is actually attained. Convexity controls shape, but existence usually comes from closedness, compactness, coercivity, or a replacement for compactness on level sets.
[definition: Minimiser]
Let $X$ be a set and let $f:X\to\mathbb R\cup\{+\infty\}$. A point $x^\star\in X$ is a minimiser of $f$ over $X$ if
\begin{align*}
f(x^\star)=\inf_{x\in X}f(x).
\end{align*}
[/definition]
The distinction between the value and a minimiser matters throughout the course. To know when a minimiser exists, we need a compactness argument that turns an approximating sequence for the infimum into an actual point of the feasible set.
[quotetheorem:304]
[citeproof:304]
This theorem supplies the basic compact case, and each assumption prevents a different failure. Nonemptiness is needed because there is no point at which the infimum could be attained. Compactness combines closedness and boundedness: missing boundary points can destroy attainment, while unbounded sets allow minimising sequences to escape to infinity. Continuity prevents downward jumps at limit points; without it, a convergent minimising sequence may approach a point whose function value is larger than the limiting infimum. The theorem also does not assert uniqueness, and it does not say how to find the minimiser. In optimisation, feasible regions are often unbounded, so the next example shows why convexity alone cannot replace the topological hypotheses.
[example: Infimum Without Attainment]
For $f:(0,\infty)\to\mathbb R$ defined by $f(x)=x$, we show that $f$ is convex, that $\inf_{x>0}f(x)=0$, and that no point of $(0,\infty)$ attains this infimum.
First, the domain $(0,\infty)$ is convex: if $x,y>0$ and $t\in[0,1]$, then
\begin{align*}
tx+(1-t)y>0
\end{align*}
because $tx\ge 0$, $(1-t)y\ge 0$, and at least one of $t$ and $1-t$ is nonzero. For the function value along the same segment,
\begin{align*}
f(tx+(1-t)y)=tx+(1-t)y.
\end{align*}
Since $f(x)=x$ and $f(y)=y$,
\begin{align*}
tx+(1-t)y=tf(x)+(1-t)f(y).
\end{align*}
Thus
\begin{align*}
f(tx+(1-t)y)=tf(x)+(1-t)f(y)\le tf(x)+(1-t)f(y),
\end{align*}
so $f$ is convex.
For every $x\in(0,\infty)$, we have $x>0$, hence
\begin{align*}
0<f(x).
\end{align*}
Therefore $0$ is a lower bound for the set of values $\{f(x):x>0\}$. To see that no larger number is a lower bound, let $\varepsilon>0$ and choose
\begin{align*}
x=\frac{\varepsilon}{2}.
\end{align*}
Then $x>0$ and
\begin{align*}
f(x)=f\left(\frac{\varepsilon}{2}\right)=\frac{\varepsilon}{2}<\varepsilon.
\end{align*}
So every positive candidate lower bound $\varepsilon$ is too large, and hence
\begin{align*}
\inf_{x\in(0,\infty)}f(x)=0.
\end{align*}
Finally, if a point $x^\star\in(0,\infty)$ attained the infimum, then
\begin{align*}
f(x^\star)=0.
\end{align*}
But $f(x^\star)=x^\star$, so this would give
\begin{align*}
x^\star=0,
\end{align*}
contradicting $x^\star\in(0,\infty)$. Thus the infimum is not attained. The failure is not caused by nonconvexity: it occurs because the endpoint $0$, approached by the values $f(x)$ as $x\downarrow 0$, is missing from the domain.
[/example]
## How The Course Proceeds
The remaining chapters build the theory in an order that mirrors the logical dependencies. We first study convex sets and separation because they generate the hyperplanes behind most certificates. We then study convex functions, subdifferentials, and conjugates because objective functions are best understood through their supporting affine minorants.
After that, the course turns to optimality conditions and duality. Fermat's rule, normal cones, Fenchel duality, and Lagrange duality explain when first-order conditions are sufficient and when dual lower bounds are exact. The final part reformulates optimisation problems over cones and introduces the theoretical role of interiors, barriers, and central paths as preparation for algorithmic convex optimisation.
[remark: Theory Before Algorithms]
This course treats algorithms as motivation and context, not as the main object of study. Numerical methods depend on the theory developed here: separation identifies certificates, duality explains stopping criteria, and barrier geometry explains why interior-point methods follow central paths.
[/remark]
The result is a toolkit for recognising convex structure, proving optimality, and translating between geometric, analytic, and conic languages. Later algorithmic courses can then focus on rates and implementation because the foundational certificates have already been established.
# 1. Convex Sets and Separation
This opening chapter develops the geometric language used throughout convex optimisation. The central question is how convex sets sit inside finite-dimensional Euclidean space: what is their smallest affine ambient space, what directions do they contain, and when can they be separated by linear inequalities? The assumed background is linear algebra in $\mathbb R^n$, basic topology of Euclidean space, and the compactness of closed bounded subsets of $\mathbb R^n$. These ideas become the foundation for duality, optimality conditions, and conic reformulations later in the course.
## Convexity, Affine Structure, and Relative Interior
The first difficulty in convex optimisation is that feasible sets often live in lower-dimensional affine spaces. A simplex in $\mathbb R^3$, for instance, may be a triangle in a plane, so ordinary interior in $\mathbb R^3$ misses the geometry relevant to optimisation over that triangle. We therefore start by distinguishing convex combinations from affine combinations.
[definition: Convex Set]
A set $C \subseteq \mathbb R^n$ is convex if for every $x,y \in C$ and every $\lambda \in [0,1]$, the point $\lambda x + (1-\lambda)y$ belongs to $C$.
[/definition]
Convexity says that all line segments between feasible points remain feasible. This condition is stable under intersections, which is why systems of linear inequalities and many analytic constraints naturally produce convex feasible regions. The first examples should show how this line-segment property is checked directly from algebraic descriptions.
[example: Intersections of Halfspaces]
Let $a_i \in \mathbb R^n$ and $b_i \in \mathbb R$ for $i=1,\dots,m$, and set
\begin{align*}
P = \{x \in \mathbb R^n : a_i \cdot x \le b_i \text{ for } i=1,\dots,m\}.
\end{align*}
We show that $P$ is convex by checking the line-segment condition. Choose $x,y\in P$ and $\lambda\in[0,1]$. Since $x\in P$ and $y\in P$, for each $i=1,\dots,m$ we have $a_i\cdot x\le b_i$ and $a_i\cdot y\le b_i$. Using linearity of the dot product and the inequalities $\lambda\ge 0$ and $1-\lambda\ge 0$,
\begin{align*}
a_i\cdot(\lambda x+(1-\lambda)y) = \lambda(a_i\cdot x)+(1-\lambda)(a_i\cdot y).
\end{align*}
Therefore
\begin{align*}
\lambda(a_i\cdot x)+(1-\lambda)(a_i\cdot y) \le \lambda b_i+(1-\lambda)b_i.
\end{align*}
Finally,
\begin{align*}
\lambda b_i+(1-\lambda)b_i = (\lambda+1-\lambda)b_i = b_i.
\end{align*}
Thus $a_i\cdot(\lambda x+(1-\lambda)y)\le b_i$ for every $i=1,\dots,m$, so $\lambda x+(1-\lambda)y\in P$. Hence $P$ is convex, which is why finite systems of linear inequalities form the basic polyhedral feasible regions of linear optimisation.
[/example]
The preceding example shows why inequalities give convexity. Equalities, by contrast, restrict the affine dimension, and a feasible region may sit inside a line or plane rather than fill the surrounding Euclidean space. To discuss dimension and interior without losing such sets, we need the smallest affine ambient space containing the set.
[definition: Affine Hull]
The affine hull of a set $S \subseteq \mathbb R^n$ is
\begin{align*}
\operatorname{aff}(S)=\left\{\sum_{i=1}^m \lambda_i x_i : m\in\mathbb N,\ x_i\in S,\ \lambda_i\in\mathbb R,\ \sum_{i=1}^m\lambda_i=1\right\}.
\end{align*}
[/definition]
The affine hull is the smallest affine subspace containing the set, so it separates genuine boundary phenomena from artefacts of embedding the set in too large a space. This distinction matters for triangles in planes, feasible sets with equality constraints, and cones with positive-dimensional lineality. The next notion defines interior inside this affine hull rather than inside all of $\mathbb R^n$.
[definition: Relative Interior]
Let $C \subseteq \mathbb R^n$ be convex. The relative interior of $C$ is the interior of $C$ in the [subspace topology](/page/Subspace%20Topology) on $\operatorname{aff}(C)$, denoted $\operatorname{relint}(C)$.
[/definition]
Relative interior is the interior seen by an optimiser constrained to the affine hull of the feasible set. It is the correct notion for constraint qualifications and supporting hyperplanes, because it ignores directions that were never feasible in the first place. The following example illustrates why this replacement is not cosmetic.
[example: A Triangle in a Plane]
Let the three vertices be $p_1,p_2,p_3\in A$, and set $u=p_2-p_1$ and $v=p_3-p_1$. Since the vertices are non-collinear, $u$ and $v$ are linearly independent, and the affine plane they determine is
\begin{align*}
A=\{p_1+\alpha u+\beta v:\alpha,\beta\in\mathbb R\}.
\end{align*}
The filled triangle is
\begin{align*}
C=\{p_1+\alpha u+\beta v:\alpha\ge 0,\ \beta\ge 0,\ \alpha+\beta\le 1\}.
\end{align*}
The ordinary interior of $C$ in $\mathbb R^3$ is empty. Indeed, choose $z\in C$ and $\varepsilon>0$. Since $A$ is a plane in $\mathbb R^3$, there is a nonzero vector $w$ orthogonal to both $u$ and $v$. Then
\begin{align*}
\left|z+\frac{\varepsilon}{2}\frac{w}{|w|}-z\right|
= \left|\frac{\varepsilon}{2}\frac{w}{|w|}\right|
= \frac{\varepsilon}{2}<\varepsilon,
\end{align*}
but $z+\frac{\varepsilon}{2}\frac{w}{|w|}\notin A$, hence it is not in $C$. Thus no open ball in $\mathbb R^3$ around $z$ is contained in $C$.
Inside the affine plane $A$, however, the relative interior is
\begin{align*}
\operatorname{relint}(C)
=
\{p_1+\alpha u+\beta v:\alpha>0,\ \beta>0,\ \alpha+\beta<1\}.
\end{align*}
If these three strict inequalities hold and
\begin{align*}
\delta=\frac12\min\{\alpha,\beta,1-\alpha-\beta\}>0,
\end{align*}
then every point $p_1+(\alpha+r)u+(\beta+s)v$ with $|r|+|s|<\delta$ still satisfies
\begin{align*}
\alpha+r>0,\qquad \beta+s>0,\qquad (\alpha+r)+(\beta+s)<1,
\end{align*}
so a small neighbourhood in $A$ lies in $C$. Conversely, if $\alpha=0$, $\beta=0$, or $\alpha+\beta=1$, an arbitrarily small move inside $A$ across the corresponding edge violates one of the inequalities defining $C$. Thus the relative interior is exactly the open triangular region in the plane, while the three edges are relative-boundary points.
[/example]
The triangle example is bounded and affine, but optimisation also has homogeneous feasible regions where scaling a feasible point remains feasible. Such sets are not best understood by vertices and edges alone; their key objects are directions and dual inequalities. This motivates the conic language used later for second-order, semidefinite, and nonnegative constraints.
[definition: Convex Cone]
A set $K \subseteq \mathbb R^n$ is a convex cone if $K$ is convex and $\alpha x \in K$ for every $x \in K$ and every $\alpha \ge 0$.
[/definition]
Cones encode homogeneous inequalities and are the natural language for conic optimisation. Once a cone is fixed, linear inequalities that are valid on the whole cone form another cone in the [dual space](/page/Dual%20Space), identified with $\mathbb R^n$ using the Euclidean [inner product](/page/Inner%20Product). We therefore need the polar cone to record all such inequalities at once.
[definition: Polar Cone]
For a cone $K \subseteq \mathbb R^n$, the polar cone is
\begin{align*}
K^\circ = \{y \in \mathbb R^n : y \cdot x \le 0 \text{ for all } x \in K\}.
\end{align*}
[/definition]
The sign convention here is the polar convention. Some optimisation texts use the dual cone $K^* = \{y : y\cdot x \ge 0 \text{ for all } x\in K\}$, so $K^*=-K^\circ$. A standard example is the second-order cone, whose polar is computable from Cauchy-Schwarz geometry.
[example: Polar Cone of the Second-Order Cone]
Let
\begin{align*}
Q=\{(t,x)\in\mathbb R\times\mathbb R^{n-1}: |x|\le t\}.
\end{align*}
We compute its polar from the definition
\begin{align*}
Q^\circ=\{(s,y)\in\mathbb R\times\mathbb R^{n-1}: st+y\cdot x\le 0 \text{ for every }(t,x)\in Q\}.
\end{align*}
First suppose $(s,y)\in Q^\circ$. Since $(1,0)\in Q$, the defining inequality gives
\begin{align*}
s\cdot 1+y\cdot 0=s\le 0.
\end{align*}
If $y\ne 0$, then $(1,y/|y|)\in Q$ because $|y/|y||=1$, so
\begin{align*}
0\ge s\cdot 1+y\cdot \frac{y}{|y|}=s+\frac{y\cdot y}{|y|}=s+\frac{|y|^2}{|y|}=s+|y|.
\end{align*}
If $y=0$, the inequality $s\le 0$ is exactly $s+|y|\le 0$. Hence every element of $Q^\circ$ satisfies
\begin{align*}
|y|\le -s.
\end{align*}
Conversely, suppose $|y|\le -s$. Then $s+|y|\le 0$. For any $(t,x)\in Q$, the inequality $|x|\le t$ implies $t\ge 0$. By the Euclidean [Cauchy-Schwarz inequality](/theorems/432), $y\cdot x\le |y|\,|x|$, so
\begin{align*}
st+y\cdot x\le st+|y|\,|x|.
\end{align*}
Since $|x|\le t$ and $|y|\ge 0$,
\begin{align*}
st+|y|\,|x|\le st+|y|t=(s+|y|)t.
\end{align*}
Finally, $(s+|y|)t\le 0$ because $s+|y|\le 0$ and $t\ge 0$. Therefore $st+y\cdot x\le 0$ for every $(t,x)\in Q$, so $(s,y)\in Q^\circ$.
We have shown
\begin{align*}
Q^\circ=\{(s,y): |y|\le -s\}.
\end{align*}
Also,
\begin{align*}
-Q=\{(-t,-x): |x|\le t\}.
\end{align*}
Writing $s=-t$ and $y=-x$, the condition $|x|\le t$ becomes $|-y|\le -s$, which is the same as $|y|\le -s$. Thus
\begin{align*}
-Q=\{(s,y): |y|\le -s\}.
\end{align*}
Therefore $Q^\circ=-Q$, so the second-order cone is self-dual up to the sign convention for polar cones.
[/example]
Polars describe valid inequalities for homogeneous sets. For a general unbounded convex feasible set, the comparable question is which directions allow feasible motion forever. These directions control existence of minimisers and the shape of dual problems, so we isolate them as another cone.
[definition: Recession Cone]
Let $C \subseteq \mathbb R^n$ be a nonempty convex set. The recession cone of $C$ is
\begin{align*}
\operatorname{rec}(C)=\{d\in\mathbb R^n : x+td\in C \text{ for all } x\in C \text{ and all } t\ge 0\}.
\end{align*}
[/definition]
The recession cone records directions that preserve feasibility indefinitely. For closed polyhedra it is obtained by deleting the right-hand sides of the inequalities, which explains why recession is often computable even when the original set is large.
[example: Recession Cone of a Polyhedron]
Let $P=\{x\in\mathbb R^n: Ax\le b\}$ be nonempty, where $A\in\mathbb R^{m\times n}$ and $b\in\mathbb R^m$. Write the rows of $A$ as $a_1,\dots,a_m$, so $Ax\le b$ means $a_i\cdot x\le b_i$ for every $i$.
First suppose $d\in\operatorname{rec}(P)$. Choose any $x\in P$. By the definition of recession cone, $x+td\in P$ for every $t\ge 0$, so for each $i=1,\dots,m$,
\begin{align*}
a_i\cdot(x+td)\le b_i.
\end{align*}
Using linearity of the dot product,
\begin{align*}
a_i\cdot x+t(a_i\cdot d)\le b_i
\end{align*}
for every $t\ge 0$. If $a_i\cdot d>0$, then choosing
\begin{align*}
t>\frac{b_i-a_i\cdot x}{a_i\cdot d}
\end{align*}
would give
\begin{align*}
t(a_i\cdot d)>b_i-a_i\cdot x,
\end{align*}
hence
\begin{align*}
a_i\cdot x+t(a_i\cdot d)>b_i,
\end{align*}
contradicting $x+td\in P$. Therefore $a_i\cdot d\le 0$ for every $i$, which is exactly $Ad\le 0$.
Conversely, suppose $Ad\le 0$. Let $x\in P$ and $t\ge 0$. Then
\begin{align*}
A(x+td)=Ax+tAd.
\end{align*}
Since $Ax\le b$, $Ad\le 0$, and $t\ge 0$, each component satisfies
\begin{align*}
(Ax)_i+t(Ad)_i\le (Ax)_i\le b_i.
\end{align*}
Thus $A(x+td)\le b$, so $x+td\in P$ for every $x\in P$ and every $t\ge 0$. Hence $d\in\operatorname{rec}(P)$.
Therefore
\begin{align*}
\operatorname{rec}(P)=\{d\in\mathbb R^n:Ad\le 0\}.
\end{align*}
The recession cone is obtained by keeping the left-hand side matrix $A$ and replacing the finite bounds $b$ by the homogeneous bound $0$.
[/example]
## Separating and Supporting Hyperplanes
The guiding question in this section is how to certify that a point does not belong to a convex set. In finite dimensions, the certificate is a linear inequality: a hyperplane places the point on one side and the convex set on the other.
[definition: Hyperplane]
A hyperplane in $\mathbb R^n$ is a set of the form
\begin{align*}
H = \{x\in\mathbb R^n : a\cdot x = \alpha\},
\end{align*}
where $a\in\mathbb R^n\setminus\{0\}$ and $\alpha\in\mathbb R$.
[/definition]
A hyperplane determines two closed halfspaces and hence a linear test for feasibility. The definition alone does not explain why such tests exist for arbitrary closed convex sets, so the separation principle supplies the fundamental certificate: closed convex sets can be recognised externally by linear inequalities. Each hypothesis is doing real work. Closedness rules out missing-boundary failures, and convexity rules out sets that surround an excluded point in incompatible directions. Separation in this point-versus-set form does not say that arbitrary disjoint convex sets are strictly separable, nor that boundary points of a set can be separated from the set by a positive gap.
For polyhedra, this certificate can also be read as a finite linear combination of active inequalities. The next example connects the abstract theorem to the finite inequality systems used in optimisation models.
[example: Separating a Point from a Polyhedron]
Let $P=\{x\in\mathbb R^n:Ax\le b\}$ be nonempty and closed, and let $x_0\notin P$. Choose $q\in P$. Since any minimizer of $|x_0-x|$ over $P$ must lie in the closed ball $\{x:|x_0-x|\le |x_0-q|\}$, compactness gives a point $p\in P$ minimizing $|x_0-x|$ over $x\in P$. Set
\begin{align*}
a=x_0-p.
\end{align*}
Because $x_0\notin P$ and $p\in P$, we have $a\ne 0$.
For any $x\in P$ and $t\in[0,1]$, convexity of $P$ gives $p+t(x-p)\in P$. Hence the function
\begin{align*}
\phi(t)=|x_0-p-t(x-p)|^2
\end{align*}
satisfies $\phi(t)\ge \phi(0)$ for $0\le t\le 1$. Expanding the square gives
\begin{align*}
\phi(t)=|x_0-p|^2-2t(x_0-p)\cdot(x-p)+t^2|x-p|^2.
\end{align*}
Since $a=x_0-p$ and $\phi(0)=|x_0-p|^2$, the inequality $\phi(t)\ge\phi(0)$ becomes
\begin{align*}
-2t\,a\cdot(x-p)+t^2|x-p|^2\ge 0.
\end{align*}
For $t>0$, division by $t$ gives
\begin{align*}
-2a\cdot(x-p)+t|x-p|^2\ge 0.
\end{align*}
Letting $t\downarrow 0$ yields
\begin{align*}
a\cdot(x-p)\le 0.
\end{align*}
Therefore $a\cdot x\le a\cdot p$ for every $x\in P$. Also,
\begin{align*}
a\cdot x_0-a\cdot p=a\cdot(x_0-p).
\end{align*}
Since $a=x_0-p$,
\begin{align*}
a\cdot(x_0-p)=|x_0-p|^2>0.
\end{align*}
Thus $a\cdot x\le a\cdot p<a\cdot x_0$ separates $P$ from $x_0$.
Now write the rows of $A$ as $a_1,\dots,a_m$, and let
\begin{align*}
I=\{i:a_i\cdot p=b_i\}
\end{align*}
be the set of constraints active at $p$. The normal $a=x_0-p$ belongs to the cone generated by the active row normals:
\begin{align*}
a=\sum_{i\in I}\mu_i a_i
\end{align*}
for some coefficients $\mu_i\ge 0$. To justify this, suppose no such coefficients existed. The finite-dimensional separating form of *[Farkas Lemma](/theorems/6685)* would give a direction $d\in\mathbb R^n$ such that $a\cdot d>0$ and $a_i\cdot d\le 0$ for every $i\in I$.
For $i\in I$, we have
\begin{align*}
a_i\cdot(p+\varepsilon d)=a_i\cdot p+\varepsilon a_i\cdot d.
\end{align*}
Since $a_i\cdot p=b_i$ and $a_i\cdot d\le 0$, this gives $a_i\cdot(p+\varepsilon d)\le b_i$ for every $\varepsilon>0$. For $i\notin I$, the inequality at $p$ is strict, so $b_i-a_i\cdot p>0$. Choosing $\varepsilon>0$ small enough to satisfy all finitely many inactive inequalities gives
\begin{align*}
a_i\cdot(p+\varepsilon d)\le b_i
\end{align*}
for every $i\notin I$ as well. Hence $p+\varepsilon d\in P$ for such $\varepsilon>0$.
Applying the already proved separating inequality to $x=p+\varepsilon d$ gives
\begin{align*}
a\cdot((p+\varepsilon d)-p)\le 0.
\end{align*}
Thus
\begin{align*}
\varepsilon\,a\cdot d\le 0.
\end{align*}
Because $\varepsilon>0$, this contradicts $a\cdot d>0$. Therefore the multipliers $\mu_i\ge 0$ exist.
With these multipliers, the separating inequality is visibly assembled from the original constraints. For every $x\in P$ and every $i\in I$, $a_i\cdot x\le b_i$, so multiplying by $\mu_i\ge 0$ gives
\begin{align*}
\mu_i(a_i\cdot x)\le \mu_i b_i.
\end{align*}
Summing over $i\in I$ gives
\begin{align*}
\sum_{i\in I}\mu_i(a_i\cdot x)\le \sum_{i\in I}\mu_i b_i.
\end{align*}
Because $a=\sum_{i\in I}\mu_i a_i$, the left-hand side is $a\cdot x$. Because each active constraint satisfies $b_i=a_i\cdot p$, the right-hand side is
\begin{align*}
\sum_{i\in I}\mu_i(a_i\cdot p)=a\cdot p.
\end{align*}
Hence $a\cdot x\le a\cdot p$ for all $x\in P$. The abstract separating hyperplane is therefore a nonnegative linear combination of the original inequalities active at the nearest feasible point $p$.
[/example]
Strict separation handles points outside closed convex sets. Boundary points require a different certificate, because the best possible separating inequality may have equality at the boundary point. This motivates the notion of a hyperplane that supports the set rather than separates it with a gap.
[definition: Supporting Hyperplane]
Let $C\subseteq\mathbb R^n$ be convex. A hyperplane $H=\{x:a\cdot x=\alpha\}$ supports $C$ at $x_0\in C$ if $x_0\in H$ and either $a\cdot x\le\alpha$ for all $x\in C$ or $a\cdot x\ge\alpha$ for all $x\in C$.
[/definition]
Supporting hyperplanes identify boundary points by linear objectives that attain their optimum there. This is the geometric origin of normal cones and first-order optimality conditions.
To prove that such linear certificates exist, one first solves the easier problem of separating a point that is genuinely outside a closed convex set. Moving an exterior point toward the boundary then explains how a separating hyperplane degenerates into a supporting one. The condition that the point being separated is genuinely outside is not cosmetic. If $C=B(0,1)$ and $x_0$ lies on the unit sphere, then $x_0\in\partial C$ but $x_0\notin C$, so a supporting hyperplane at $x_0$ belongs to $\overline C$ rather than to $C$ under the definition above. Lower-dimensional cases, such as a line segment in $\mathbb R^2$, must be treated inside $\operatorname{aff}(C)$ or with relative interior. Supporting hyperplanes also need not be unique: at a corner of a square there are many supporting lines, while at a smooth boundary point there may be only one.
For two convex sets that merely touch or have empty ordinary interior, point-set separation is no longer enough. Strict separation with a positive gap asks for a buffer between the sets, while proper separation only asks for a nonconstant linear inequality that orders the two sets. Relative interior is the right hypothesis because it detects whether the two sets overlap in their shared affine geometry rather than in the surrounding Euclidean space.
[quotetheorem:6667]
[citeproof:6667]
Proper separation is the version that survives when two convex sets touch along a boundary. Nonemptiness is needed so that the ordering compares actual feasible points: if $C=\varnothing$, every inequality $a\cdot x\le a\cdot y$ over $x\in C$ is vacuous and says nothing about a certificate. Convexity is what makes the difference set $C-D$ convex; without it, two interlaced nonconvex sets may require different inequalities in different directions and no single supporting inequality captures the geometry. The disjoint-relative-interior hypothesis is sharp: if $C=D$ is a line segment, then their relative interiors coincide and any inequality separating $C$ from $D$ must be equality everywhere, so the separating functional is constant on $C\cup D$ and the separation is not proper. Properness rules out this degenerate equality case; for example, if $C=\{(0,t):0\le t\le 1\}$ and $D=\{(0,t):2\le t\le 3\}$, the functional $a=(1,0)$ gives $a\cdot x=a\cdot y=0$ for all points and therefore separates only in a useless equality sense, while $a=(0,1)$ gives a proper ordering. Proper separation differs from strict separation because it permits equality at contact points and does not promise a positive gap; if $C=\{(x_1,x_2):x_1^2+x_2^2\le 1\}$ and $D=\{(1,0)\}$, the inequality $x_1\le 1$ has equality at the common point but is still proper because $x_1$ is not constant on $C\cup D$. In the chapters on duality, the same theorem turns infeasibility or failure of constraint qualifications into a nonzero multiplier: the separating functional becomes the dual certificate, and the use of relative interior explains why Slater-type hypotheses are formulated with relative interiors rather than ordinary interiors.
## Faces, Extreme Points, and Representation
The last question of the chapter is how a convex set is assembled from lower-dimensional pieces. Optimisation naturally exposes these pieces because a linear objective reaches its maximum on a face, and in compact cases the whole set is generated by its extreme points.
[definition: Face]
Let $C\subseteq\mathbb R^n$ be convex. A convex subset $F\subseteq C$ is a face of $C$ if whenever $x,y\in C$, $\lambda\in(0,1)$, and $\lambda x+(1-\lambda)y\in F$, then $x,y\in F$.
[/definition]
A face contains every point of $C$ that is needed to form its relative interior points by line segments. Faces are the regions where convex combinations cannot enter from outside. Since linear objectives select maximisers by supporting hyperplanes, we next single out the faces that arise in exactly this way.
[definition: Exposed Face]
Let $C\subseteq\mathbb R^n$ be convex. A subset $F\subseteq C$ is an exposed face of $C$ if there exists $a\in\mathbb R^n$ such that $\sup_{y\in C}a\cdot y$ is finite, is attained on $C$, and
\begin{align*}
F=\{x\in C:a\cdot x=\sup_{y\in C}a\cdot y\}.
\end{align*}
[/definition]
Every exposed face is a face, but the converse can fail for convex bodies with nonsmooth geometry. The distinction matters because optimisation with linear objectives directly sees exposed faces. A low-dimensional example clarifies why geometric faces and optimisation-exposed faces are not always identical.
[example: Exposed and Non-Exposed Faces]
Let
\begin{align*}
C=\{(x,y)\in\mathbb R^2:-1\le x\le 1,\ \max\{0,x\}^2\le y\le 1\}.
\end{align*}
The function $x\mapsto \max\{0,x\}^2$ is convex, so $C$ is the intersection of its epigraph with the strip $-1\le x\le 1$ and the halfspace $y\le 1$; hence $C$ is convex. Its lower boundary contains the horizontal segment
\begin{align*}
[-1,0]\times\{0\}
\end{align*}
and, for $0\le x\le 1$, the curved arc
\begin{align*}
\{(x,x^2):0\le x\le 1\}.
\end{align*}
These two boundary pieces meet at
\begin{align*}
p=(0,0).
\end{align*}
The point $p$ is an extreme point of $C$. Indeed, suppose
\begin{align*}
p=\lambda (x_1,y_1)+(1-\lambda)(x_2,y_2)
\end{align*}
with $(x_1,y_1),(x_2,y_2)\in C$ and $\lambda\in(0,1)$. Since every point of $C$ satisfies $y\ge 0$, the second coordinate gives
\begin{align*}
0=\lambda y_1+(1-\lambda)y_2.
\end{align*}
Because $\lambda>0$, $1-\lambda>0$, $y_1\ge 0$, and $y_2\ge 0$, this forces
\begin{align*}
y_1=0,\qquad y_2=0.
\end{align*}
For a point $(x,y)\in C$ with $y=0$, the inequality $\max\{0,x\}^2\le 0$ implies $\max\{0,x\}=0$, hence $x\le 0$. Therefore $x_1\le 0$ and $x_2\le 0$. The first coordinate of the convex-combination identity is
\begin{align*}
0=\lambda x_1+(1-\lambda)x_2.
\end{align*}
With $x_1\le 0$, $x_2\le 0$, and positive coefficients, this gives $x_1=x_2=0$. Hence both points equal $p$, so $p$ is extreme.
However, $p$ is not an exposed point. Let $a=(\alpha,\beta)$ be any vector such that $p$ maximises the linear functional $(x,y)\mapsto \alpha x+\beta y$ over $C$. Since $p=(0,0)$, this requires
\begin{align*}
\alpha x+\beta y\le 0
\end{align*}
for every $(x,y)\in C$. Applying this to $(0,1)\in C$ gives $\beta\le 0$. Applying it to $(-1,0)\in C$ gives $-\alpha\le 0$, so $\alpha\ge 0$. Applying it to $(x,x^2)\in C$ with $0<x\le 1$ gives
\begin{align*}
\alpha x+\beta x^2\le 0.
\end{align*}
Dividing by $x>0$ gives
\begin{align*}
\alpha+\beta x\le 0
\end{align*}
for every $0<x\le 1$. Letting $x\downarrow 0$ gives $\alpha\le 0$. Thus $\alpha=0$. The functional is then $\beta y$. If $\beta<0$, every point of the horizontal segment $[-1,0]\times\{0\}$ has value $0$, so $p$ is not the unique maximiser. If $\beta=0$, the functional is constant on all of $C$. Hence no linear functional exposes $p$ alone: $p$ is an extreme point, but not an exposed face by itself.
[/example]
The example shows that some boundary points are geometrically indecomposable even when no linear functional exposes them alone. To name these smallest pieces of a convex set, we need the concept that generalises vertices of a polytope. These points are the atoms in finite-dimensional representation theorems.
[definition: Extreme Point]
Let $C\subseteq\mathbb R^n$ be convex. A point $x\in C$ is an extreme point of $C$ if whenever $x=\lambda y+(1-\lambda)z$ with $y,z\in C$ and $\lambda\in(0,1)$, then $y=z=x$.
[/definition]
Extreme points generalise vertices of polytopes and describe points that cannot be split by genuine convex averaging. The next issue is quantitative: if a point lies in a convex hull in $\mathbb R^n$, how many original points are needed to represent it? The answer is bounded by the ambient dimension.
[quotetheorem:4083]
[citeproof:4083]
[Carathéodory's theorem](/theorems/2954) is a dimension bound rather than a compactness statement. The number $n+1$ is sharp: if $S$ consists of the $n+1$ vertices of a nondegenerate simplex in $\mathbb R^n$, then a point in the relative interior of that simplex cannot be written as a convex combination of only $n$ of the vertices, because any $n$ vertices lie in a proper affine hyperplane. The theorem says nothing about uniqueness of the representation; even in a square, the centre has many convex-combination descriptions. It is also specifically finite-dimensional: in infinite-dimensional spaces, convex hull representations need not admit a uniformly bounded number of generating points. The planar case gives the simplest picture of the bound.
[example: Points in a Triangle]
Let $S=\{p_1,p_2,p_3\}\subseteq\mathbb R^2$ be the three non-collinear vertices of a triangle. The filled triangle is
\begin{align*}
T=\{\lambda_1p_1+\lambda_2p_2+\lambda_3p_3:\lambda_1,\lambda_2,\lambda_3\ge 0,\ \lambda_1+\lambda_2+\lambda_3=1\}.
\end{align*}
Thus every point of $T$ is visibly a convex combination of the three vertices.
If $x$ lies on the edge joining $p_1$ and $p_2$, then for some $\theta\in[0,1]$,
\begin{align*}
x=(1-\theta)p_1+\theta p_2.
\end{align*}
Writing this in the three-vertex form gives
\begin{align*}
x=(1-\theta)p_1+\theta p_2+0p_3,
\end{align*}
so the boundary point uses only the two vertices of its edge. The same argument applies to the other two edges.
More generally, the preceding Carathéodory theorem says that if $S\subseteq\mathbb R^2$ is any set and $x\in\operatorname{conv}(S)$, then
\begin{align*}
x=\lambda_0x_0+\lambda_1x_1+\lambda_2x_2
\end{align*}
for some $x_0,x_1,x_2\in S$ and coefficients $\lambda_0,\lambda_1,\lambda_2\ge 0$ satisfying
\begin{align*}
\lambda_0+\lambda_1+\lambda_2=1.
\end{align*}
The triangle is the sharp planar model for this bound: three vertices suffice for every point in the filled region, while points on an edge need only the two vertices spanning that edge.
[/example]
[Carathéodory's theorem](/theorems/4079) represents points using members of an arbitrary generating set. For compact convex sets, the natural generating set is intrinsic: the extreme points of the set itself. This leads to the finite-dimensional Krein-Milman theorem, which is the structural result behind vertex optimality for polytopes.
[quotetheorem:4093]
[citeproof:4093]
This theorem justifies the common principle that compact convex feasible regions are controlled by their extreme points, and its hypotheses mark real boundaries of the statement. Compactness cannot be dropped: the open interval $(0,1)$ is convex but has no extreme points, and noncompact closed convex sets such as the half-line $[0,\infty)$ require recession directions in addition to extreme points. The point $0$ is extreme in the half-line, but convex combinations of extreme points alone cannot generate the unbounded ray; the missing data is the direction $d=1$ in the recession cone. Closedness is part of compactness in finite dimensions; without it, boundary points needed for the induction may be missing. The theorem is also an existence and representation result, not a uniqueness theorem: a point in a square may be represented using extreme points in several different ways. Linear programming is the polyhedral special case: when an optimum exists over a polytope, some optimum occurs at a vertex. The final example states this optimisation consequence in the language developed above.
[example: Linear Objective on a Compact Convex Set]
[claim]If $C\subseteq\mathbb R^n$ is nonempty, compact, and convex, then every linear functional $x\mapsto a\cdot x$ has a maximiser that is an extreme point of $C$.[/claim]
[proof]Set
\begin{align*}
f(x)=a\cdot x.
\end{align*}
The map $f$ is continuous because it is linear, so the *Extreme Value Theorem* gives a point $p\in C$ such that
\begin{align*}
f(p)=\max_{x\in C}f(x).
\end{align*}
Write
\begin{align*}
M=\max_{x\in C}a\cdot x.
\end{align*}
Define the maximiser set
\begin{align*}
F=\{x\in C:a\cdot x=M\}.
\end{align*}
Since $p\in F$, the set $F$ is nonempty. Also,
\begin{align*}
F=C\cap\{x\in\mathbb R^n:a\cdot x=M\}.
\end{align*}
The hyperplane $\{x:a\cdot x=M\}$ is closed, so $F$ is a closed subset of the compact set $C$; hence $F$ is compact.
We check that $F$ is convex. Let $u,v\in F$ and $\lambda\in[0,1]$. Since $C$ is convex,
\begin{align*}
\lambda u+(1-\lambda)v\in C.
\end{align*}
By linearity,
\begin{align*}
a\cdot(\lambda u+(1-\lambda)v)=\lambda(a\cdot u)+(1-\lambda)(a\cdot v).
\end{align*}
Since $u,v\in F$, we have $a\cdot u=M$ and $a\cdot v=M$, so
\begin{align*}
\lambda(a\cdot u)+(1-\lambda)(a\cdot v)=\lambda M+(1-\lambda)M.
\end{align*}
Finally,
\begin{align*}
\lambda M+(1-\lambda)M=(\lambda+1-\lambda)M=M.
\end{align*}
Thus $\lambda u+(1-\lambda)v\in F$, so $F$ is convex. By the definition of exposed face, $F$ is the exposed face of $C$ exposed by $x\mapsto a\cdot x$.
Now apply the finite-dimensional extreme-point representation theorem just quoted to the nonempty compact convex set $F$. It gives
\begin{align*}
F=\operatorname{conv}(\operatorname{ext}(F)).
\end{align*}
Since $F$ is nonempty, $\operatorname{ext}(F)$ is nonempty; choose $q\in\operatorname{ext}(F)$. Then $q\in F$, so
\begin{align*}
a\cdot q=M.
\end{align*}
It remains to show that $q$ is extreme in $C$, not only in $F$. Suppose
\begin{align*}
q=\lambda x+(1-\lambda)y
\end{align*}
with $x,y\in C$ and $\lambda\in(0,1)$. Since $M$ is the maximum of $a\cdot z$ over $z\in C$,
\begin{align*}
a\cdot x\le M
\end{align*}
and
\begin{align*}
a\cdot y\le M.
\end{align*}
Using $q\in F$ and linearity,
\begin{align*}
M=a\cdot q.
\end{align*}
Also,
\begin{align*}
a\cdot q=a\cdot(\lambda x+(1-\lambda)y).
\end{align*}
By linearity again,
\begin{align*}
a\cdot(\lambda x+(1-\lambda)y)=\lambda(a\cdot x)+(1-\lambda)(a\cdot y).
\end{align*}
Therefore
\begin{align*}
M=\lambda(a\cdot x)+(1-\lambda)(a\cdot y).
\end{align*}
If $a\cdot x<M$, then $\lambda>0$ gives
\begin{align*}
\lambda(a\cdot x)<\lambda M.
\end{align*}
Together with $a\cdot y\le M$ and $1-\lambda>0$, this gives
\begin{align*}
(1-\lambda)(a\cdot y)\le (1-\lambda)M.
\end{align*}
Adding the last two inequalities yields
\begin{align*}
\lambda(a\cdot x)+(1-\lambda)(a\cdot y)<\lambda M+(1-\lambda)M.
\end{align*}
Since
\begin{align*}
\lambda M+(1-\lambda)M=M,
\end{align*}
this contradicts
\begin{align*}
M=\lambda(a\cdot x)+(1-\lambda)(a\cdot y).
\end{align*}
Hence $a\cdot x=M$. The same argument with $x$ and $y$ exchanged gives
\begin{align*}
a\cdot y=M.
\end{align*}
Thus $x,y\in F$. Since $q$ is extreme in $F$ and
\begin{align*}
q=\lambda x+(1-\lambda)y,
\end{align*}
we must have
\begin{align*}
x=y=q.
\end{align*}
Therefore $q$ is an extreme point of $C$. Since $q\in F$, it maximises $a\cdot x$ over $C$.[/proof]
Thus compact convex feasible regions always have extreme-point solutions for linear maximisation problems.
[/example]
# 2. Convex Functions and Subdifferentials
Convex optimisation turns geometric convexity into inequalities for functions. The chapter assumes Chapter 1's material on convex sets, affine hyperplanes, relative interiors, and separation theorems in finite-dimensional Euclidean space. After convex sets and separation, the next question is how to recognise convexity analytically and how to express optimality when a function is not differentiable. This chapter develops the epigraph viewpoint, subgradients, normal cones, and conjugacy, which are the objects that later make duality and KKT conditions systematic.
## Convex Functions Through Epigraphs
The first problem is to decide which extended-valued functions behave like convex objects. In optimisation it is useful to allow the value $+\infty$ so that constraints can be absorbed into the objective, but this only works if convexity is formulated in a way that still sees the feasible domain.
[definition: Extended-Valued Convex Function]
A function $f: \mathbb R^n \to (-\infty,+\infty]$ is convex if
\begin{align*}
f(tx+(1-t)y) \le t f(x) + (1-t)f(y)
\end{align*}
for all $x,y \in \mathbb R^n$ and all $t \in [0,1]$, with the usual extended-real arithmetic. Its effective domain is
\begin{align*}
\operatorname{dom} f := \{x \in \mathbb R^n : f(x) < +\infty\}.
\end{align*}
[/definition]
The effective domain is the part of space on which the optimisation problem has finite cost. To exclude degenerate objectives before discussing minimisation, we isolate the functions that have at least one feasible point and never take the value $-\infty$.
[definition: Proper Function]
A function $f: \mathbb R^n \to (-\infty,+\infty]$ is proper if $\operatorname{dom} f \neq \varnothing$ and $f(x)>-\infty$ for every $x \in \mathbb R^n$.
[/definition]
Properness rules out functions that cannot represent a meaningful minimisation problem. The next task is to connect function convexity to the separation theory of convex sets, and the right geometric object is the region lying above the graph.
[definition: Epigraph]
The epigraph of $f: \mathbb R^n \to (-\infty,+\infty]$ is
\begin{align*}
\operatorname{epi} f := \{(x,r) \in \mathbb R^n \times \mathbb R : f(x) \le r\}.
\end{align*}
[/definition]
The epigraph records all admissible heights above each point in the domain. To use separation or closed-set arguments, however, we need to know when this lifted object is actually a convex set. The obstruction is that averaging two points above the graph could land below the graph at the averaged input; the equivalence below says that this failure is exactly the failure of the function's convexity inequality.
[quotetheorem:6668]
[citeproof:6668]
The epigraph characterisation allows convex functions to inherit geometric closedness questions from sets. The hypothesis that the whole epigraph is convex is essential: a function may have a convex effective domain but still fail convexity if its graph bends above its chords. For instance, $f(x)=-x^2$ on $\mathbb R$ has a convex domain, but the midpoint of two epigraph points can lie below the graph, so separation arguments for convex sets cannot be applied to its epigraph. The theorem does not say that convexity of the domain is enough, nor does it give any closedness or existence information. The next definition is needed to rule out upward spikes in the value at a [limit point](/page/Limit%20Point), because such spikes leave holes in the epigraph and can destroy stability of minimisers.
[definition: Lower Semicontinuous Function]
A function $f: \mathbb R^n \to (-\infty,+\infty]$ is lower semicontinuous at $x_0 \in \mathbb R^n$ if for every sequence $(x_k)$ with $x_k \to x_0$,
\begin{align*}
f(x_0) \le \liminf_{k\to\infty} f(x_k).
\end{align*}
It is lower semicontinuous if it is lower semicontinuous at every $x_0 \in \mathbb R^n$.
[/definition]
Lower semicontinuity is the analytic condition that prevents the function value at a limit point from sitting above the limiting lower values nearby. It permits downward jumps, but it forbids an isolated high value such as $f(0)=1$ and $f(x)=0$ for $x\neq 0$. The following definition packages it with convexity, since later existence and duality arguments need both hypotheses together.
[definition: Closed Convex Function]
A function $f: \mathbb R^n \to (-\infty,+\infty]$ is closed convex if it is convex and lower semicontinuous.
[/definition]
The word closed is justified by the epigraph, but this has to be checked rather than assumed. A sequence of points can remain above the graph while its limit falls into a gap created by an upward jump of the function value. The equivalence below identifies lower semicontinuity as exactly the condition that closes those epigraph gaps and makes limiting epigraph arguments legitimate.
[quotetheorem:6669]
[citeproof:6669]
Closedness of the epigraph is the set-theoretic form of lower semicontinuity, so limit points of feasible heights stay feasible. Without it, the infimum can be approached by nearby points without being attained at the limit; for example, the function with value $1$ at $0$ and value $0$ away from $0$ is not lower semicontinuous at $0$. Convexity is not part of this equivalence, but convexity is what makes the closed epigraph useful for separation and conjugacy later. Closedness handles limiting points, but it does not stop a minimizing sequence from escaping to infinity. The next definition is needed to express the growth condition that makes low sublevel sets bounded.
[definition: Coercive Function]
A function $f: \mathbb R^n \to (-\infty,+\infty]$ is coercive if
\begin{align*}
f(x) \to +\infty \quad \text{as } |x| \to \infty.
\end{align*}
[/definition]
Coercivity turns minimisation on a noncompact space into minimisation on a compact sublevel set. Existence can fail in two different ways: a minimizing sequence may run off to infinity, or it may converge to a point where the limiting value jumps upward. The result below combines coercivity, lower semicontinuity, and properness to rule out these failures in finite dimensions.
[quotetheorem:6670]
[citeproof:6670]
The theorem explains why closedness and growth are natural assumptions rather than technical decoration. Lower semicontinuity prevents loss of the minimiser at a limit point, while coercivity prevents minimising sequences from escaping to infinity, as happens for $f(x)=e^x$ on $\mathbb R$. Properness also matters: an everywhere $+\infty$ function has no finite feasible value, and a function allowed to take $-\infty$ is not a genuine finite minimisation model. A standard model is a constrained quadratic objective, where the constraint is encoded through an extended indicator and coercivity comes from positive definiteness.
[example: Quadratic Plus Indicator]
Let $Q \in \mathbb R^{n\times n}$ be symmetric positive semidefinite, let $b\in\mathbb R^n$, and let $C\subset\mathbb R^n$ be nonempty, closed, and convex. Define
\begin{align*}
f(x)=\frac12 x^\top Qx+b^\top x+\mathbb{1}^{\infty}_C(x).
\end{align*}
Its effective domain is exactly $C$, because $\mathbb{1}^{\infty}_C(x)=0$ for $x\in C$ and $\mathbb{1}^{\infty}_C(x)=+\infty$ for $x\notin C$.
Write $q(x)=\frac12 x^\top Qx+b^\top x$. For $x,y\in\mathbb R^n$ and $t\in[0,1]$,
\begin{align*}
q(tx+(1-t)y)=\frac12 (tx+(1-t)y)^\top Q(tx+(1-t)y)+b^\top(tx+(1-t)y).
\end{align*}
Expanding the quadratic term and using symmetry of $Q$, so that $x^\top Qy=y^\top Qx$, gives
\begin{align*}
q(tx+(1-t)y)=\frac12 t^2x^\top Qx+t(1-t)x^\top Qy+\frac12(1-t)^2y^\top Qy+t b^\top x+(1-t)b^\top y.
\end{align*}
Also,
\begin{align*}
tq(x)+(1-t)q(y)=\frac12 t x^\top Qx+\frac12(1-t)y^\top Qy+t b^\top x+(1-t)b^\top y.
\end{align*}
Subtracting the expanded expression for $q(tx+(1-t)y)$ gives
\begin{align*}
tq(x)+(1-t)q(y)-q(tx+(1-t)y)=\frac12 t(1-t)x^\top Qx-t(1-t)x^\top Qy+\frac12t(1-t)y^\top Qy.
\end{align*}
The right-hand side factors as
\begin{align*}
\frac12 t(1-t)x^\top Qx-t(1-t)x^\top Qy+\frac12t(1-t)y^\top Qy=\frac12 t(1-t)(x-y)^\top Q(x-y).
\end{align*}
Since $Q$ is positive semidefinite, $(x-y)^\top Q(x-y)\ge 0$, and therefore
\begin{align*}
q(tx+(1-t)y)\le tq(x)+(1-t)q(y).
\end{align*}
Thus $q$ is convex. If $x,y\in C$, then $tx+(1-t)y\in C$ by convexity of $C$, and if either point is outside $C$, the corresponding extended-indicator value is $+\infty$; hence $\mathbb{1}^{\infty}_C$ is convex. Therefore $f=q+\mathbb{1}^{\infty}_C$ is convex.
The quadratic function $q$ is continuous. The indicator term is lower semicontinuous because $C$ is closed: if $x_k\to x$ and $\liminf_k \mathbb{1}^{\infty}_C(x_k)<+\infty$, then some subsequence lies in $C$, so closedness gives $x\in C$ and $\mathbb{1}^{\infty}_C(x)=0$. Hence $f$ is closed convex.
Now assume $Q$ is positive definite. Let $\lambda_{\min}>0$ be the smallest eigenvalue of $Q$. Then
\begin{align*}
x^\top Qx\ge \lambda_{\min}|x|^2
\end{align*}
for every $x\in\mathbb R^n$, and Cauchy-Schwarz gives
\begin{align*}
b^\top x\ge -|b|\,|x|.
\end{align*}
For $x\in C$,
\begin{align*}
f(x)=\frac12 x^\top Qx+b^\top x\ge \frac12\lambda_{\min}|x|^2-|b|\,|x|.
\end{align*}
Writing $r=|x|$, the lower bound is
\begin{align*}
\frac12\lambda_{\min}r^2-|b|r=r\left(\frac12\lambda_{\min}r-|b|\right).
\end{align*}
As $r\to+\infty$, the factor $\frac12\lambda_{\min}r-|b|$ tends to $+\infty$, so the product tends to $+\infty$. For $x\notin C$, $f(x)=+\infty$, so $f$ is coercive on $\mathbb R^n$. Since $C$ is nonempty, $f$ is proper, and the *Weierstrass Theorem for Coercive Closed Functions* gives a minimiser of $f$. This minimisation is exactly the constrained quadratic problem $\min_{x\in C}\{\frac12x^\top Qx+b^\top x\}$ written as an unconstrained extended-valued problem.
[/example]
## Jensen Convexity and First-Order Information
The next question is how to test convexity without checking every chord directly. Differentiability turns convexity into a first-order supporting inequality, while nondifferentiability replaces gradients by subgradients.
[quotetheorem:6671]
[citeproof:6671]
[Jensen's inequality](/theorems/9) explains why averages are central in convex optimisation. Convexity is necessary: for $f(x)=-x^2$ on $\mathbb R$, the midpoint inequality fails because $f(0)=0$ while $\frac12 f(-1)+\frac12 f(1)=-1$. The weight assumptions are also necessary; negative weights or weights not summing to $1$ describe affine combinations rather than convex averages, and convexity gives no such bound. The result is finite and one-way: it controls values at convex combinations, but it does not measure how fast $f$ rises when we move away from a point. Optimality conditions need exactly that infinitesimal slope information, especially at boundary points of the domain where only one-sided movements may remain feasible. The following definition isolates the limiting slope along a ray leaving a point.
[definition: Directional Derivative]
Let $f: \mathbb R^n \to (-\infty,+\infty]$ and let $x \in \operatorname{dom} f$. The one-sided directional derivative of $f$ at $x$ is the map $f'(x;\cdot):\mathbb R^n\to[-\infty,+\infty]$ defined by
\begin{align*}
f'(x;d) := \lim_{t\downarrow 0} \frac{f(x+td)-f(x)}{t}
\end{align*}
for each $d\in\mathbb R^n$, when these limits exist in $[-\infty,+\infty]$.
[/definition]
For convex functions, the difference quotients in this definition are monotone in the step size, so the one-sided directional derivative exists as an extended real number. The next definition is needed because a nonsmooth convex function may have many affine lower supports at the same point.
[definition: Subgradient]
Let $f: \mathbb R^n \to (-\infty,+\infty]$ be convex and let $x \in \operatorname{dom} f$. A vector $g \in \mathbb R^n$ is a subgradient of $f$ at $x$ if
\begin{align*}
f(y) \ge f(x) + g \cdot (y-x) \quad \text{for all } y \in \mathbb R^n.
\end{align*}
The subdifferential of $f$ at $x$ is
\begin{align*}
\partial f(x) := \{g \in \mathbb R^n : g \text{ is a subgradient of } f \text{ at } x\}.
\end{align*}
[/definition]
A subgradient is the slope of a global affine lower support to the function.
In the smooth case there is already a distinguished candidate slope, namely the gradient, but it is not automatic that this local derivative gives a lower bound valid at every other point. The issue is whether derivative information at one point controls the whole graph. The next result supplies the missing bridge: for differentiable convex functions, the gradient is not merely an infinitesimal slope but the unique affine support slope that gives the global first-order inequality.
[quotetheorem:6666]
[citeproof:6666]
The first-order inequality says more than local stationarity: for convex functions, a tangent hyperplane at one point gives a global lower bound. Differentiability is used to identify the supporting slope with the gradient; without differentiability, the absolute value function at the origin has many supporting slopes and no single gradient. Convexity is also essential: $f(x)=-x^2$ is differentiable, but the displayed inequality fails at $x=0$ and $y=1$. The openness of $U$ ensures that directional derivatives along small line segments are available at the base point; on a closed interval, endpoint behaviour needs a separate one-sided formulation. The theorem does not assert that every stationary point is optimal without convexity, and it does not cover nonsmooth functions. The following theorem is needed as the nonsmooth analogue of the familiar condition $\nabla f(x_*)=0$.
[quotetheorem:6673]
[citeproof:6673]
The Fermat rule converts a global optimisation statement into the algebraic inclusion $0\in\partial f(x_*)$. Its force comes from convexity: in a nonconvex problem, vanishing first-order information can describe a maximum or a saddle, while the subgradient inequality used here is global. Properness ensures that $x_*$ is a point with finite value, so the comparison $f(y)\ge f(x_*)$ is meaningful; if $f\equiv+\infty$, there is no point in $\operatorname{dom} f$, and if a function is allowed to take $-\infty$, finite minimisation no longer has the same meaning. The rule does not guarantee uniqueness of the minimiser, nor does it compute $\partial f(x_*)$ for us. It is most useful when the subdifferential can be computed explicitly. Norms are the basic nonsmooth examples; their subgradients are described by dual norms and will reappear as regularisation certificates in Chapters 6 and 11.
[example: Subgradients of Norms]
Let $\|\cdot\|_*$ be defined by
\begin{align*}
\|g\|_*=\sup_{\|u\|\le 1} g\cdot u.
\end{align*}
We first compute the subdifferential at a nonzero point $x$. Suppose $g\in\partial\|x\|$. The subgradient inequality says
\begin{align*}
\|y\|\ge \|x\|+g\cdot(y-x)
\end{align*}
for every $y\in\mathbb R^n$. Taking $y=0$ gives
\begin{align*}
0\ge \|x\|-g\cdot x,
\end{align*}
so $g\cdot x\ge \|x\|$. Taking $y=2x$ gives
\begin{align*}
2\|x\|\ge \|x\|+g\cdot x,
\end{align*}
so $g\cdot x\le \|x\|$. Hence
\begin{align*}
g\cdot x=\|x\|.
\end{align*}
Substituting this equality back into the subgradient inequality gives
\begin{align*}
\|y\|\ge g\cdot y
\end{align*}
for every $y$. Therefore, for every $u$ with $\|u\|\le 1$,
\begin{align*}
g\cdot u\le 1,
\end{align*}
so $\|g\|_*\le 1$. Since $x\neq 0$, the vector $u=x/\|x\|$ satisfies $\|u\|=1$, and
\begin{align*}
g\cdot u=g\cdot \frac{x}{\|x\|}=\frac{g\cdot x}{\|x\|}=1.
\end{align*}
Thus $\|g\|_*\ge 1$, and therefore $\|g\|_*=1$.
Conversely, suppose $\|g\|_*=1$ and $g\cdot x=\|x\|$. By the definition of the dual norm, every $y\neq 0$ satisfies
\begin{align*}
g\cdot \frac{y}{\|y\|}\le \|g\|_*=1,
\end{align*}
so $g\cdot y\le \|y\|$; the same inequality is also true for $y=0$. Hence, for every $y$,
\begin{align*}
\|y\|\ge g\cdot y=\|x\|+g\cdot(y-x),
\end{align*}
which is exactly the subgradient inequality. Therefore
\begin{align*}
\partial \|x\| = \{g \in \mathbb R^n : \|g\|_* = 1,\ g\cdot x=\|x\|\}
\end{align*}
for $x\neq 0$.
At the origin, $g\in\partial\|0\|$ iff
\begin{align*}
\|y\|\ge \|0\|+g\cdot(y-0)=g\cdot y
\end{align*}
for every $y\in\mathbb R^n$. This condition is equivalent to
\begin{align*}
g\cdot u\le 1 \quad \text{for every } u \text{ with } \|u\|\le 1,
\end{align*}
which is exactly $\|g\|_*\le 1$. Hence
\begin{align*}
\partial \|0\| = \{g \in \mathbb R^n : \|g\|_* \le 1\}.
\end{align*}
Thus the nonsmoothness of a norm at the origin is encoded by the whole dual unit ball, while at a nonzero point only the dual unit vectors supporting $x$ remain.
[/example]
## Normal Cones and Constrained Optimality
The next problem is to express constraints without writing separate feasible directions by hand. The normal cone packages the first-order obstruction created by a convex set, and it is the subdifferential of the set's extended indicator function.
[definition: Extended Indicator Function]
Let $C \subset \mathbb R^n$. The extended indicator function of $C$ is the function $\mathbb{1}^{\infty}_C:\mathbb R^n\to(-\infty,+\infty]$ defined by
\begin{align*}
\mathbb{1}^{\infty}_C(x)=0 \quad \text{for } x\in C,
\end{align*}
and $\mathbb{1}^{\infty}_C(x)=+\infty$ for $x\notin C$.
[/definition]
The extended indicator function turns a constrained problem $\min_{x\in C} f(x)$ into the unconstrained extended-valued problem $\min_x f(x)+\mathbb{1}^{\infty}_C(x)$. This is different from the ordinary set indicator $\mathbb{1}_C$ with values $0$ and $1$: the value $+\infty$ removes infeasible points from minimisation rather than merely penalising them by one unit. The next definition is needed to express first-order behaviour of this extended indicator through vectors that point outward from the feasible set.
[definition: Normal Cone]
Let $C \subset \mathbb R^n$ be convex and let $x \in C$. The normal cone of $C$ at $x$ is
\begin{align*}
N_C(x) := \{v \in \mathbb R^n : v\cdot (y-x) \le 0 \text{ for all } y\in C\}.
\end{align*}
The normal-cone assignment is the set-valued map $N_C:C\rightrightarrows\mathbb R^n$ sending $x$ to $N_C(x)$.
[/definition]
The normal cone records the outward directions available at a feasible point.
To use this geometry in first-order optimisation, it must be connected to the subdifferential calculus already developed for extended-valued functions. The obstruction is that a constraint contributes no ordinary slope inside the feasible set, but it should contribute outward directions at the boundary that prevent feasible descent. The following identification is the mechanism that turns those outward directions into subgradients of the extended indicator, putting constrained geometry into the same algebra as unconstrained nonsmooth optimisation.
[quotetheorem:6674]
[citeproof:6674]
The identification turns constrained optimisation into an application of the Fermat rule to $f+\mathbb{1}^{\infty}_C$. Closedness is not used in the algebra of the displayed equality at a point of $C$, but it is the standing condition that makes the extended indicator lower semicontinuous; for example, if $C=(0,1)\subset\mathbb R$, then $\mathbb{1}^{\infty}_C$ is not lower semicontinuous at $0$. Convexity is the condition that makes the indicator convex; for $C=\{-1,1\}$, the indicator is not convex because the midpoint $0$ is infeasible. The equality itself does not assert that every set has a useful global first-order calculus, and outside convex closed sets the normal object can fail to certify global optimality. The following theorem is needed because it says that the negative objective slope must be balanced by an outward normal to the feasible set.
[quotetheorem:6675]
[citeproof:6675]
For constrained convex problems, the normal cone condition is both necessary and sufficient because the objective has global affine lower supports. Convexity of the objective is necessary for sufficiency: on $C=[-1,1]$, the differentiable function $f(x)=-x^2$ satisfies the normal cone condition at $x_*=0$, but $0$ is not a minimiser. Convexity of the feasible set is also necessary for this global comparison; for disconnected feasible sets, a normal computed at one component need not compare the objective with points in another component. Differentiability is what lets the slope be written as $\nabla f(x_*)$; nonsmooth objectives require the subdifferential form $0\in\partial(f+\mathbb{1}^{\infty}_C)(x_*)$. The theorem does not assert uniqueness, and it does not provide constraint multipliers until the normal cone is computed. For polyhedral feasible sets, the normal cone has a concrete multiplier description. This is the local form of the Lagrange multiplier picture that later becomes KKT theory.
[example: Normal Cone of a Polyhedron]
[claim]For
\begin{align*}
P=\{x\in\mathbb R^n : a_i\cdot x \le b_i,\ i=1,\dots,m\},
\end{align*}
and $x\in P$, with active set $I(x)=\{i:a_i\cdot x=b_i\}$, one has
\begin{align*}
N_P(x)=\left\{\sum_{i\in I(x)}\lambda_i a_i : \lambda_i\ge 0\right\}.
\end{align*}
[/claim]
[proof]Let
\begin{align*}
K:=\left\{\sum_{i\in I(x)}\lambda_i a_i : \lambda_i\ge 0\right\}.
\end{align*}
First take $v\in K$, so
\begin{align*}
v=\sum_{i\in I(x)}\lambda_i a_i
\end{align*}
with $\lambda_i\ge0$. For any $y\in P$ and any active index $i\in I(x)$,
\begin{align*}
a_i\cdot(y-x)=a_i\cdot y-a_i\cdot x\le b_i-b_i=0.
\end{align*}
Therefore
\begin{align*}
v\cdot(y-x)=\left(\sum_{i\in I(x)}\lambda_i a_i\right)\cdot(y-x).
\end{align*}
By distributivity of the dot product,
\begin{align*}
\left(\sum_{i\in I(x)}\lambda_i a_i\right)\cdot(y-x)=\sum_{i\in I(x)}\lambda_i\, a_i\cdot(y-x).
\end{align*}
Each term satisfies $\lambda_i\,a_i\cdot(y-x)\le0$, so
\begin{align*}
v\cdot(y-x)\le0.
\end{align*}
Thus $v\in N_P(x)$, and $K\subseteq N_P(x)$.
Conversely, take $v\in N_P(x)$ and suppose $v\notin K$. The cone $K$ is closed and convex, so finite-dimensional separation gives a vector $d\in\mathbb R^n$ such that $d\cdot k\le0$ for every $k\in K$, while
\begin{align*}
d\cdot v>0.
\end{align*}
For each active index $i\in I(x)$, the vector $a_i$ belongs to $K$, hence
\begin{align*}
a_i\cdot d=d\cdot a_i\le0.
\end{align*}
For an inactive index $i\notin I(x)$, set
\begin{align*}
s_i:=b_i-a_i\cdot x.
\end{align*}
Then $s_i>0$. Choose $\varepsilon>0$ so small that $\varepsilon a_i\cdot d\le s_i$ for every inactive $i$ with $a_i\cdot d>0$; inactive indices with $a_i\cdot d\le0$ impose no restriction. For active $i$,
\begin{align*}
a_i\cdot(x+\varepsilon d)=a_i\cdot x+\varepsilon a_i\cdot d\le b_i.
\end{align*}
For inactive $i$ with $a_i\cdot d>0$,
\begin{align*}
a_i\cdot(x+\varepsilon d)=a_i\cdot x+\varepsilon a_i\cdot d\le a_i\cdot x+s_i=b_i.
\end{align*}
For inactive $i$ with $a_i\cdot d\le0$,
\begin{align*}
a_i\cdot(x+\varepsilon d)=a_i\cdot x+\varepsilon a_i\cdot d\le a_i\cdot x<b_i.
\end{align*}
Hence $x+\varepsilon d\in P$. Since $v\in N_P(x)$,
\begin{align*}
0\ge v\cdot((x+\varepsilon d)-x)=\varepsilon\, v\cdot d.
\end{align*}
Because $\varepsilon>0$, this implies $v\cdot d\le0$, contradicting $d\cdot v>0$. Therefore $v\in K$, so $N_P(x)\subseteq K$.[/proof]
Thus the normal cone of a polyhedron is generated exactly by the outward normals of the inequalities that are tight at the point.
[/example]
## Conjugates and Biconjugates
The next question is how to encode all affine lower bounds of a convex function at once. The convex conjugate records the best intercept associated with each slope, and this transforms primal minimisation questions into dual ones.
[definition: Convex Conjugate]
Let $f: \mathbb R^n \to (-\infty,+\infty]$ be proper. The convex conjugate of $f$ is the function $f^*:\mathbb R^n\to(-\infty,+\infty]$ defined by
\begin{align*}
f^*(y):=\sup_{x\in\mathbb R^n}\{y\cdot x-f(x)\}.
\end{align*}
[/definition]
The conjugate is finite at slopes whose affine functions can be placed below $f$ with finite intercept. Rearranging its defining supremum gives the basic inequality behind weak duality and the link between conjugacy and subgradients.
[quotetheorem:6676]
[citeproof:6676]
The equality condition is the practical content of Fenchel-Young: the pair $(x,y)$ is matched precisely when $y$ supports $f$ at $x$. Properness prevents the statement from being vacuous; without it, $f\equiv+\infty$ gives no finite primal point, while allowing $-\infty$ would make the supremum defining the conjugate degenerate. The inequality itself may read $+\infty\ge x\cdot y$ when either $f(x)$ or $f^*(y)$ is infinite, and such pairs carry no finite optimality information. The theorem does not assert that equality is attained for some $y$ at every $x$; boundary points of domains may have empty subdifferential. In duality arguments, useful certificates are therefore finite primal-dual pairs satisfying equality. The most important conjugate calculations come from solving the maximisation problem in the definition. A positive definite quadratic gives a finite dual quadratic after completing the square.
[example: Conjugate of a Quadratic]
Let $Q\in\mathbb R^{n\times n}$ be symmetric positive definite and define
\begin{align*}
f(x)=\frac12 x^\top Qx+b^\top x+c.
\end{align*}
We compute $f^*(y)$ by fixing $y\in\mathbb R^n$ and maximizing $y\cdot x-f(x)$ over $x$. Since $Q$ is positive definite, it is invertible; set
\begin{align*}
x_y:=Q^{-1}(y-b).
\end{align*}
Then $y-b=Qx_y$. For every $x\in\mathbb R^n$,
\begin{align*}
y\cdot x-f(x)=y^\top x-\frac12 x^\top Qx-b^\top x-c.
\end{align*}
Combining the two linear terms gives
\begin{align*}
y^\top x-b^\top x=(y-b)^\top x.
\end{align*}
Using $y-b=Qx_y$ and the symmetry of $Q$,
\begin{align*}
(y-b)^\top x=(Qx_y)^\top x=x_y^\top Qx.
\end{align*}
Hence
\begin{align*}
y\cdot x-f(x)=x_y^\top Qx-\frac12 x^\top Qx-c.
\end{align*}
Now expand the quadratic term centered at $x_y$:
\begin{align*}
(x-x_y)^\top Q(x-x_y)=x^\top Qx-x^\top Qx_y-x_y^\top Qx+x_y^\top Qx_y.
\end{align*}
Because $Q$ is symmetric, $x^\top Qx_y=x_y^\top Qx$, so
\begin{align*}
(x-x_y)^\top Q(x-x_y)=x^\top Qx-2x_y^\top Qx+x_y^\top Qx_y.
\end{align*}
Rearranging gives
\begin{align*}
x_y^\top Qx-\frac12 x^\top Qx=\frac12 x_y^\top Qx_y-\frac12 (x-x_y)^\top Q(x-x_y).
\end{align*}
Therefore
\begin{align*}
y\cdot x-f(x)=\frac12 x_y^\top Qx_y-\frac12 (x-x_y)^\top Q(x-x_y)-c.
\end{align*}
Positive definiteness gives
\begin{align*}
(x-x_y)^\top Q(x-x_y)\ge 0,
\end{align*}
with equality exactly at $x=x_y$. Thus the supremum is attained at $x_y=Q^{-1}(y-b)$, and
\begin{align*}
f^*(y)=\frac12 x_y^\top Qx_y-c.
\end{align*}
Substituting $x_y=Q^{-1}(y-b)$ gives
\begin{align*}
x_y^\top Qx_y=\bigl(Q^{-1}(y-b)\bigr)^\top Q\bigl(Q^{-1}(y-b)\bigr).
\end{align*}
Since $Q^{-1}$ is symmetric and $Q^{-1}QQ^{-1}=Q^{-1}$,
\begin{align*}
\bigl(Q^{-1}(y-b)\bigr)^\top Q\bigl(Q^{-1}(y-b)\bigr)=(y-b)^\top Q^{-1}(y-b).
\end{align*}
Consequently
\begin{align*}
f^*(y)=\frac12 (y-b)^\top Q^{-1}(y-b)-c.
\end{align*}
Thus a positive definite quadratic conjugates to another quadratic, with the linear term absorbed into the shift $y-b$.
[/example]
Quadratics show how smooth costs transform under conjugacy. Extended indicators show how feasible sets transform: the dual object is the support function, which records linear optimisation over the set.
[example: Conjugate of an Indicator]
Let $C\subset\mathbb R^n$ be nonempty, and fix $y\in\mathbb R^n$. By the definition of the convex conjugate,
\begin{align*}
(\mathbb{1}^{\infty}_C)^*(y)=\sup_{x\in\mathbb R^n}\{y\cdot x-\mathbb{1}^{\infty}_C(x)\}.
\end{align*}
For $x\in C$, the extended indicator satisfies $\mathbb{1}^{\infty}_C(x)=0$, so
\begin{align*}
y\cdot x-\mathbb{1}^{\infty}_C(x)=y\cdot x.
\end{align*}
For $x\notin C$, the extended indicator satisfies $\mathbb{1}^{\infty}_C(x)=+\infty$, and the extended-real convention gives
\begin{align*}
y\cdot x-\mathbb{1}^{\infty}_C(x)=y\cdot x-(+\infty)=-\infty.
\end{align*}
Hence points outside $C$ contribute only the value $-\infty$ to the supremum, while points inside $C$ contribute exactly the linear values $y\cdot x$. Since $C$ is nonempty, the supremum over $\mathbb R^n$ is therefore
\begin{align*}
(\mathbb{1}^{\infty}_C)^*(y)=\sup_{x\in C} y\cdot x.
\end{align*}
Thus the conjugate of the extended indicator is the support function $\sigma_C(y)$, so passing to the conjugate turns the constraint $x\in C$ into the linear-optimisation term $\sup_{x\in C} y\cdot x$.
[/example]
The indicator example shows that conjugacy may replace a constrained object by a support function, so information about the original problem is now stored indirectly in affine lower bounds. Applying conjugacy once does not by itself say which features of $f$ can be recovered: spikes, nonconvex dips, and nonclosed behaviour may disappear when only global affine minorants are recorded. The biconjugate applies the same operation again to assemble all affine information back into a function. Since every finite conjugate value encodes an affine minorant of $f$, the biconjugate can never exceed $f$; the central theorem says that closed proper convex functions are exactly the functions recovered from this affine-support data.
[definition: Biconjugate]
Let $f: \mathbb R^n \to (-\infty,+\infty]$ be proper. The biconjugate of $f$ is the function $f^{**}:\mathbb R^n\to[-\infty,+\infty]$ defined by
\begin{align*}
f^{**}(x):=\sup_{y\in\mathbb R^n}\{x\cdot y-f^*(y)\}.
\end{align*}
The convention is that $x\cdot y-(+\infty)=-\infty$ and the supremum of a set consisting only of $-\infty$ values is $-\infty$.
[/definition]
The biconjugate is the supremum of all affine lower bounds generated by conjugacy. This construction can only recover information visible from below by affine functions, so upward spikes, nonconvex dents, and nonclosed behaviour are potential losses. The theorem below identifies the exact class for which no information is lost: proper closed convex functions are precisely the functions determined by their affine supports.
[quotetheorem:6677]
[citeproof:6677]
Fenchel-Moreau shows that conjugacy is not merely a calculation trick; it reconstructs the function from its affine supports. Lower semicontinuity cannot be dropped: the convex function $f(0)=1$ and $f(x)=0$ for $x\neq0$ has an upward spike at $0$, and its biconjugate removes that spike by taking the closed lower-semicontinuous convex envelope. Convexity cannot be dropped either, because the biconjugate is always convex; for $f(x)=-x^2$ on a bounded interval and $+\infty$ outside, the biconjugate replaces the concave graph by its closed convex envelope from below. Properness rules out degenerate functions that do not represent finite optimisation problems. The theorem does not say that every affine minorant is attained as a supporting hyperplane at a point; it says their supremum recovers exactly the closed convex function. A nonlinear example with a simplex domain is log-sum-exp, whose conjugate is the entropy expression that appears in probabilistic optimisation.
[example: Conjugate of Log Sum Exp]
Let
\begin{align*}
f(x)=\log\left(\sum_{i=1}^n e^{x_i}\right), \qquad x\in\mathbb R^n.
\end{align*}
We compute
\begin{align*}
f^*(y)=\sup_{x\in\mathbb R^n}\left\{y\cdot x-\log\left(\sum_{i=1}^n e^{x_i}\right)\right\}.
\end{align*}
First suppose $\sum_{i=1}^n y_i\neq 1$. Set $\mathbf 1=(1,\dots,1)$. For any $x\in\mathbb R^n$ and $t\in\mathbb R$,
\begin{align*}
y\cdot(x+t\mathbf 1)-f(x+t\mathbf 1)=y\cdot x+t\sum_{i=1}^n y_i-\log\left(\sum_{i=1}^n e^{x_i+t}\right).
\end{align*}
Since $\sum_{i=1}^n e^{x_i+t}=e^t\sum_{i=1}^n e^{x_i}$, this becomes
\begin{align*}
y\cdot(x+t\mathbf 1)-f(x+t\mathbf 1)=y\cdot x-f(x)+t\left(\sum_{i=1}^n y_i-1\right).
\end{align*}
If $\sum_i y_i>1$, letting $t\to+\infty$ makes the right-hand side tend to $+\infty$; if $\sum_i y_i<1$, letting $t\to-\infty$ does the same. Hence $f^*(y)=+\infty$ unless $\sum_i y_i=1$.
Now suppose $\sum_i y_i=1$ but $y_k<0$ for some $k$. For $s>0$, choose $x_k=-s$ and $x_i=0$ for $i\neq k$. Then
\begin{align*}
y\cdot x-f(x)=-s y_k-\log\left(e^{-s}+n-1\right).
\end{align*}
Here $-y_k>0$, so the term $-s y_k$ tends to $+\infty$, while $\log(e^{-s}+n-1)$ is bounded above by $\log n$. Thus $f^*(y)=+\infty$ unless $y_i\ge0$ for every $i$.
It remains to compute the value on
\begin{align*}
\Delta_n=\left\{y\in\mathbb R^n:y_i\ge0,\ \sum_{i=1}^n y_i=1\right\}.
\end{align*}
Fix $y\in\Delta_n$. For each $x$, write
\begin{align*}
S=\sum_{j=1}^n e^{x_j}, \qquad p_i=\frac{e^{x_i}}{S}.
\end{align*}
Then $p_i>0$, $\sum_i p_i=1$, and $x_i=\log p_i+\log S$. Since $\sum_i y_i=1$,
\begin{align*}
y\cdot x-f(x)=\sum_{i=1}^n y_i(\log p_i+\log S)-\log S=\sum_{i=1}^n y_i\log p_i.
\end{align*}
Let $I=\{i:y_i>0\}$. Terms with $y_i=0$ contribute $0$, so
\begin{align*}
\sum_{i=1}^n y_i\log p_i=\sum_{i\in I} y_i\log p_i.
\end{align*}
For $i\in I$,
\begin{align*}
y_i\log p_i=y_i\log y_i+y_i\log\left(\frac{p_i}{y_i}\right).
\end{align*}
Using $\log u\le u-1$ for $u>0$ gives
\begin{align*}
y_i\log\left(\frac{p_i}{y_i}\right)\le y_i\left(\frac{p_i}{y_i}-1\right)=p_i-y_i.
\end{align*}
Summing over $I$,
\begin{align*}
\sum_{i=1}^n y_i\log p_i\le \sum_{i\in I} y_i\log y_i+\sum_{i\in I}p_i-\sum_{i\in I}y_i.
\end{align*}
Because $\sum_{i\in I}p_i\le1$ and $\sum_{i\in I}y_i=1$, the last two terms are nonpositive. Therefore
\begin{align*}
y\cdot x-f(x)\le \sum_{i\in I} y_i\log y_i=\sum_{i=1}^n y_i\log y_i,
\end{align*}
with the convention $0\log0=0$.
If every $y_i>0$, choose $x_i=\log y_i$. Then $\sum_i e^{x_i}=\sum_i y_i=1$, so $p_i=y_i$, and the upper bound is attained. If some $y_i=0$, choose positive vectors $p^{(\varepsilon)}\in\Delta_n$ with $p^{(\varepsilon)}_i\to y_i$ for every $i$; for example, put a small positive mass on the zero coordinates and rescale the positive coordinates of $y$ by the remaining total mass. Taking $x_i^{(\varepsilon)}=\log p_i^{(\varepsilon)}$ gives
\begin{align*}
y\cdot x^{(\varepsilon)}-f(x^{(\varepsilon)})=\sum_{i=1}^n y_i\log p_i^{(\varepsilon)}.
\end{align*}
For $i$ with $y_i>0$, $p_i^{(\varepsilon)}\to y_i$ implies $y_i\log p_i^{(\varepsilon)}\to y_i\log y_i$; for $i$ with $y_i=0$, the term is always $0$. Hence the supremum equals the same upper bound.
Consequently, $f^*(y)=\sum_{i=1}^n y_i\log y_i$ for $y\in\Delta_n$, and $f^*(y)=+\infty$ for $y\notin\Delta_n$. Thus log-sum-exp conjugates to negative entropy on the probability simplex.
[/example]
## Support and Gauge Functions
The final problem in this chapter is to recognise standard convex functions that are built directly from sets. Support functions, extended indicators, and gauges translate geometry into convex analysis and reappear throughout Lagrange duality and conic optimisation.
[definition: Support Function]
Let $C\subset\mathbb R^n$ be nonempty. The support function of $C$ is the map $\sigma_C:\mathbb R^n\to(-\infty,+\infty]$ defined by $y\mapsto\sigma_C(y)$, where
\begin{align*}
\sigma_C(y):=\sup_{x\in C} y\cdot x, \qquad y\in\mathbb R^n.
\end{align*}
[/definition]
Support functions are always convex because they are pointwise suprema of linear functions. A subgradient of $\sigma_C$ in a direction $y$ should record which points of $C$ actually support the set in that direction, but this requires the supremum to be attained and the exposed face to be well behaved. The result below makes that intuition precise under compact convex hypotheses.
[quotetheorem:6678]
[citeproof:6678]
The theorem identifies a subgradient with an exposed support point or, when the maximiser is not unique, with the whole exposed face. Compactness is the attainment hypothesis: for $C=(0,1)\subset\mathbb R$, $\sigma_C(1)=1$ but no point of $C$ attains the supremum, so the displayed argmax would be empty while $\partial\sigma_C(1)$ is not governed by an attained support point in $C$. Boundedness is also relevant, since for $C=[0,\infty)$ the support function is infinite in positive directions. Convexity ensures that the exposed face is the right first-order object; for $C=\{-1,1\}$, the support function is the same as for $[-1,1]$, so it sees the closed convex hull rather than the original two-point geometry. The theorem does not classify all subgradients of support functions for arbitrary noncompact sets. Support functions measure a set by probing it with linear functionals. The next definition is the complementary construction: it measures a point by asking how far the set must be dilated from the origin to contain it.
[definition: Gauge Function]
Let $C\subset\mathbb R^n$ be a convex set with $0\in C$. The gauge of $C$ is the map $\gamma_C:\mathbb R^n\to[0,+\infty]$ defined by $x\mapsto\gamma_C(x)$, where
\begin{align*}
\gamma_C(x):=\inf\{t>0:x\in tC\}, \qquad x\in\mathbb R^n.
\end{align*}
[/definition]
The gauge measures how much the set $C$ must be dilated to contain a point. When $C$ is balanced and absorbing, it becomes a norm; without symmetry, it still gives a useful convex measure of size.
[example: Gauge of the Euclidean Ball]
Let $C=\overline{B}(0,1)\subset\mathbb R^n$. For $t>0$, the dilation $tC$ is
\begin{align*}
tC=\{tu:u\in C\}=\{tu:u\in\mathbb R^n,\ |u|\le 1\}.
\end{align*}
If $x\in tC$, then $x=tu$ for some $u$ with $|u|\le1$, and therefore
\begin{align*}
|x|=|tu|=t|u|\le t.
\end{align*}
Conversely, if $|x|\le t$, then $u=x/t$ is well-defined because $t>0$, and
\begin{align*}
|u|=\left|\frac{x}{t}\right|=\frac{|x|}{t}\le1.
\end{align*}
Thus $u\in C$ and $x=tu\in tC$. Hence
\begin{align*}
\{t>0:x\in tC\}=\{t>0:|x|\le t\}.
\end{align*}
By the definition of the gauge,
\begin{align*}
\gamma_C(x)=\inf\{t>0:x\in tC\}=\inf\{t>0:|x|\le t\}=|x|.
\end{align*}
So the gauge of the closed Euclidean unit ball is exactly the Euclidean norm.
For comparison, if a polytope containing $0$ is written in normalized facet form
\begin{align*}
P=\{z\in\mathbb R^n:a_j\cdot z\le 1,\ j=1,\dots,m\},
\end{align*}
then $x\in tP$ means $x=tz$ for some $z\in P$. Since $t>0$, this is equivalent to $z=x/t$ and therefore to
\begin{align*}
a_j\cdot\left(\frac{x}{t}\right)\le1 \quad \text{for every } j=1,\dots,m.
\end{align*}
Multiplying by $t>0$ gives the equivalent inequalities
\begin{align*}
a_j\cdot x\le t \quad \text{for every } j=1,\dots,m.
\end{align*}
Therefore
\begin{align*}
\gamma_P(x)=\inf\{t>0:a_j\cdot x\le t \text{ for every } j=1,\dots,m\}.
\end{align*}
The least such positive upper bound is
\begin{align*}
\gamma_P(x)=\max\{0,a_1\cdot x,\dots,a_m\cdot x\}.
\end{align*}
Thus a polytope produces a maximum of finitely many affine-linear functions, and its linear pieces are determined by which facet inequalities attain the maximum.
[/example]
The examples in this chapter show the same pattern from several angles: a convex set produces functions, and those functions recover geometric information about the set. This dictionary is the language used in the next stage of the course, where primal and dual optimisation problems are related by conjugacy.
[remark: Dictionary Between Sets and Functions]
The extended indicator $\mathbb{1}^{\infty}_C$ enforces membership in $C$, the support function $\sigma_C$ maximises linear functionals over $C$, and the gauge $\gamma_C$ measures dilation from the origin. These three constructions are the main bridge from [convex geometry](/page/Convex%20Geometry) to convex optimisation: constraints, dual objectives, and conic formulations can all be written in this language.
[/remark]
# 3. Convex Optimisation Problems and Existence
Convex optimisation becomes a problem about existence as soon as we move from local inequalities to the question of whether an infimum is realised. Chapter 2 gave the language of closed convex functions, subgradients, normal cones, and conjugates. This chapter packages those objects into optimisation problems, separates feasible geometry from objective geometry, and records the main compactness and recession arguments that decide whether a solution exists.
The guiding distinction is between the value of a programme and its solution set. A finite value may fail to be attained, while an attained value may be unstable under perturbations unless the feasible set and objective have enough closedness or growth. We will keep track of infeasibility, unboundedness, attainment, and stability as separate phenomena.
## Standard Convex Programmes and Their Objects
What data are needed to specify a convex optimisation problem, and which parts of the data control feasibility rather than optimality? The first step is to give a common format that includes constrained least squares, entropy minimisation, and conic programmes without hiding domain restrictions.
[definition: Convex Programme]
A finite-dimensional convex programme is an optimisation problem of the form
\begin{align*}
\inf_{x \in C} f(x),
\end{align*}
where $C \subset \mathbb R^n$ is a convex set and $f: \mathbb R^n \to (-\infty, +\infty]$ is a proper convex function.
[/definition]
This format deliberately allows the objective to take the value $+\infty$, so explicit constraints are not the whole story. To decide whether the programme is meaningful, we next isolate the points that satisfy both the visible constraint $C$ and the domain restrictions carried by $f$.
[definition: Feasible Set]
For a convex programme $\inf_{x \in C} f(x)$, the feasible set is
\begin{align*}
F := C \cap \operatorname{dom} f.
\end{align*}
[/definition]
The programme is feasible if $F \neq \varnothing$. Feasibility answers only whether there is at least one admissible point. The next distinction records the best value allowed by those points and the subset, possibly empty, where that value is attained.
[definition: Optimal Value and Solution Set]
The optimal value of $\inf_{x \in C} f(x)$ is
\begin{align*}
p^* := \inf_{x \in C} f(x) = \inf_{x \in F} f(x),
\end{align*}
with the convention $p^* = +\infty$ when $F = \varnothing$. The solution set is
\begin{align*}
S := \{x \in F : f(x) = p^*\}.
\end{align*}
[/definition]
The value $p^*$ is extended-real, but the solution set is an ordinary subset of $\mathbb R^n$. If $p^*=-\infty$, then $S=\varnothing$, since no feasible point has value $-\infty$ for a proper convex objective. The first structural question is then whether convexity of the data leaves any trace on the set of minimisers.
[quotetheorem:6679]
[citeproof:6679]
This theorem says that once minimisers exist, there is no disconnected collection of isolated optimal designs in a convex programme. It does not say that minimisers exist, nor that they are unique: strict convexity of the objective on the feasible set is a separate hypothesis for uniqueness. The convexity assumptions are doing real work. If $C=\{-1,1\}\subset\mathbb R$ and $f(x)=0$, the solution set is non-convex because the feasible set is non-convex; if $C=\mathbb R$ and $f(x)=(x^2-1)^2$, the two global minimisers $\{-1,1\}$ are non-convex because the objective is not convex.
[example: Least Squares with Convex Constraints]
Let $A \in \mathbb R^{m \times n}$, $b \in \mathbb R^m$, and let $C \subset \mathbb R^n$ be nonempty, closed, and convex. For
\begin{align*}
g(x):=\frac12 |Ax-b|^2,
\end{align*}
the feasible set is $C$, because $g(x)$ is finite for every $x\in\mathbb R^n$.
The objective is convex. Indeed, for $x_0,x_1\in\mathbb R^n$, $t\in[0,1]$, and $u=Ax_0-b$, $v=Ax_1-b$, we have
\begin{align*}
A((1-t)x_0+tx_1)-b=(1-t)(Ax_0-b)+t(Ax_1-b)=(1-t)u+tv.
\end{align*}
Expanding the squared norm gives
\begin{align*}
|(1-t)u+tv|^2=(1-t)^2|u|^2+2t(1-t)\langle u,v\rangle+t^2|v|^2.
\end{align*}
Using
\begin{align*}
|u-v|^2=|u|^2-2\langle u,v\rangle+|v|^2,
\end{align*}
this becomes
\begin{align*}
|(1-t)u+tv|^2=(1-t)|u|^2+t|v|^2-t(1-t)|u-v|^2.
\end{align*}
Therefore
\begin{align*}
g((1-t)x_0+tx_1)=\frac12\bigl((1-t)|u|^2+t|v|^2-t(1-t)|u-v|^2\bigr).
\end{align*}
Since $t(1-t)|u-v|^2\ge0$, it follows that
\begin{align*}
g((1-t)x_0+tx_1)\le (1-t)g(x_0)+tg(x_1).
\end{align*}
The function $g$ is continuous because it is the composition of the affine map $x\mapsto Ax-b$ with the continuous map $y\mapsto |y|^2/2$.
If $C$ is compact, continuity of $g$ implies that $g$ attains its minimum on $C$, so the constrained least-squares problem has a minimiser. If $C$ is an affine subspace, write $C=x_0+L$ with $L$ a linear subspace. Along a direction $d\in L$,
\begin{align*}
g(x+td)=\frac12 |Ax-b+tAd|^2.
\end{align*}
Expanding the square gives
\begin{align*}
g(x+td)=\frac12|Ax-b|^2+t\langle Ax-b,Ad\rangle+\frac12t^2|Ad|^2.
\end{align*}
Thus directions in $L\cap\ker A$ leave the residual unchanged, while every direction with $Ad\ne0$ has quadratic growth along that ray.
In particular, if $L\cap\ker A=\{0\}$, then $d\mapsto |Ad|$ has a positive minimum $\gamma>0$ on the compact set $\{d\in L:|d|=1\}$. For $x=x_0+\ell$ with $\ell\in L$ and $\ell\ne0$,
\begin{align*}
|A\ell|\ge \gamma|\ell|.
\end{align*}
The [reverse triangle inequality](/theorems/2300) gives
\begin{align*}
|Ax-b|=|A\ell+(Ax_0-b)|\ge |A\ell|-|Ax_0-b|.
\end{align*}
Hence
\begin{align*}
|Ax-b|\ge \gamma|\ell|-|Ax_0-b|.
\end{align*}
So $|Ax-b|\to\infty$ whenever $x\in x_0+L$ and $|x|\to\infty$, which means the residual norm is level-bounded on the affine subspace. This shows that compactness of $C$ is a sufficient existence mechanism, while on affine feasible sets the level-boundedness check reduces to whether $A$ kills a nonzero feasible direction.
[/example]
The least-squares example shows that the same formal definitions cover both compact and unbounded feasible regions. Before proving general existence theorems, it is useful to separate the possible pathologies into examples that differ only in attainment behaviour.
[example: Attained and Non-Attained Programmes]
For $\inf_{x\in\mathbb R} e^x$, the feasible set is all of $\mathbb R$, since $e^x$ is finite for every real $x$. The function is convex because
\begin{align*}
\frac{d^2}{dx^2}e^x=e^x>0
\end{align*}
for every $x\in\mathbb R$. Since $e^x>0$ for every $x$, every feasible value is positive, so
\begin{align*}
\inf_{x\in\mathbb R} e^x \ge 0.
\end{align*}
On the other hand, along the sequence $x_k=-k$,
\begin{align*}
e^{x_k}=e^{-k}\to0.
\end{align*}
Thus the optimal value is $0$. No minimiser exists, because attaining the value would require a real $x$ with $e^x=0$, but $e^x>0$ for every real $x$.
For $\inf_{x\in[0,\infty)} x$, every feasible point satisfies $x\ge0$, so
\begin{align*}
\inf_{x\in[0,\infty)}x\ge0.
\end{align*}
The feasible point $x=0$ has value $0$, hence the optimal value is $0$. The solution set is
\begin{align*}
\{x\in[0,\infty):x=0\}=\{0\}.
\end{align*}
For $\inf_{x\in\mathbb R} x$, the problem is feasible because every real number is allowed. For each $M\in\mathbb R$, choosing $x=M-1$ gives
\begin{align*}
x=M-1<M,
\end{align*}
so the objective values are not bounded below by any real number. Therefore
\begin{align*}
\inf_{x\in\mathbb R}x=-\infty.
\end{align*}
These examples separate finite non-attainment, finite attainment, and unboundedness below.
[/example]
## Existence from Compactness, Coercivity, Level-Boundedness, and Recession
When does the infimum become a minimum? Convexity gives good geometry for minimisers, but existence still comes from topological compactness or from growth conditions that force minimising sequences to stay in a bounded region.
[definition: Level Set of a Programme]
For $\alpha \in \mathbb R$, the $\alpha$-sublevel set of the programme $\inf_{x \in C} f(x)$ is
\begin{align*}
L_\alpha := \{x \in C : f(x) \le \alpha\}.
\end{align*}
[/definition]
Sublevel sets turn existence into a compactness question. A minimising sequence eventually lies in every sufficiently high sublevel set, so bounded closed sublevel sets provide convergent subsequences. This leads to the finite-dimensional Weierstrass argument in the form needed for convex programmes.
[quotetheorem:6680]
[citeproof:6680]
This is the basic finite-dimensional existence argument. Each hypothesis prevents a different failure mode. If compactness is replaced by boundedness without closedness, the problem $\inf_{x\in(0,1)} x$ has no minimiser. If lower semicontinuity is removed, the function on $[0,1]$ defined by $f(0)=1$ and $f(x)=x$ for $x\in(0,1]$ has infimum $0$ but no point attaining it; the only possible limiting point of a minimising sequence is $0$, where the objective jumps upward. If no compact sublevel set is available, $\inf_{x\in\mathbb R} e^x$ has finite value $0$ but all minimising sequences escape to $-\infty$.
Compact feasible sets are therefore only one route to existence. Many useful feasible sets are unbounded, so we need a condition saying that the relevant sublevel sets are compact even when $C$ itself is not.
[definition: Level-Bounded Programme]
A programme $\inf_{x \in C} f(x)$ is level-bounded if, for every $\alpha \in \mathbb R$, the set
\begin{align*}
\{x \in C : f(x) \le \alpha\}
\end{align*}
is bounded.
[/definition]
Level-boundedness supplies the missing boundedness part of compactness, while lower semicontinuity and closed constraints supply closedness. The remaining question is whether these separate compactness ingredients are enough to stop every minimizing sequence from either escaping to infinity or converging to an infeasible or nonattaining limit. The result below packages exactly those hypotheses into a reusable existence theorem and records compactness of the full minimiser set.
[quotetheorem:6681]
[citeproof:6681]
The theorem is not a converse to existence: a solution may exist even when the full programme is not level-bounded, as in unconstrained least squares with a nonzero nullspace, where the minimiser set can be an affine flat. Its hypotheses are nevertheless close to the compactness proof. Without closedness, $\inf_{x\in(0,1)}x$ fails to attain its value; without lower semicontinuity, downward jumps at missing limit values can destroy attainment on a compact set; without bounded-belowness, $\inf_{x\in\mathbb R}x$ has no finite optimum; without level-boundedness, $\inf_{x\in\mathbb R}e^x$ has finite infimum but no minimiser.
Level-boundedness is often checked through a simpler growth test. When there are no important directions in which the variable can escape while keeping the objective small, the objective behaves like a confining potential; this motivates coercivity.
[definition: Coercive Function]
A function $f: \mathbb R^n \to (-\infty,+\infty]$ is coercive if
\begin{align*}
f(x_k) \to +\infty
\end{align*}
for every sequence $(x_k)$ with $|x_k| \to \infty$.
[/definition]
A coercive objective remains level-bounded after restriction to any closed feasible set. For constrained problems this may be too strong, because only directions belonging to the feasible set matter. Recession analysis refines coercivity by looking only at directions in which the feasible set can escape to infinity.
[definition: Recession Cone of a Convex Set]
For a nonempty closed convex set $C \subset \mathbb R^n$, the recession cone is
\begin{align*}
\operatorname{rec}(C) := \{d \in \mathbb R^n : x+td \in C \text{ for all } x \in C \text{ and all } t \ge 0\}.
\end{align*}
[/definition]
The recession cone records feasible directions that can be followed forever. To decide boundedness below, we must compare those feasible escape directions with the asymptotic slopes of the objective along rays.
[definition: Recession Function]
For a proper closed convex function $f: \mathbb R^n \to (-\infty,+\infty]$, its recession function is the map
\begin{align*}
f^\infty: \mathbb R^n \to (-\infty,+\infty].
\end{align*}
For every direction $d\in\mathbb R^n$ for which there exists $x\in\operatorname{dom}f$ with $x+td\in\operatorname{dom}f$ for all sufficiently large $t$, it is given by
\begin{align*}
f^\infty(d) := \lim_{t \to \infty} \frac{f(x+td)-f(x)}{t}.
\end{align*}
For closed convex $f$, this limit is independent of the admissible base point $x$. If no such admissible ray exists in direction $d$, then $f^\infty(d):=+\infty$.
[/definition]
Thus $f^\infty$ measures the asymptotic slope of $f$ in direction $d$, while directions that leave the domain permanently are treated as forbidden by assigning value $+\infty$. The decisive question is whether any nonzero feasible recession direction has nonpositive asymptotic slope.
[quotetheorem:6682]
[citeproof:6682]
The criterion converts an existence question into an asymptotic calculation, but it relies on closed convex geometry. If $C=(0,1)\subset\mathbb R$ and $f(x)=x$, then $\operatorname{rec}(C)=\{0\}$ but the infimum is not attained because the feasible set is not closed. If the objective is not closed, compactness of level sets can be lost through boundary jumps even when the algebraic recession calculation looks harmless. Convexity is also essential: for non-convex feasible sets, normalised escaping sequences need not converge to a recession direction that describes every far-out feasible ray.
It is especially useful for quadratic objectives, because recession directions can be read from the nullspace of the quadratic part.
[example: Constrained Least Squares and Recession]
For
\begin{align*}
\inf_{x\in C}\frac12|Ax-b|^2,
\end{align*}
the objective is finite on all of $\mathbb R^n$, so the feasible set is exactly $C$. Fix $x\in C$ and $d\in\operatorname{rec}(C)$. Then $x+td\in C$ for every $t\ge0$, and along this ray
\begin{align*}
\frac12|A(x+td)-b|^2=\frac12|Ax-b+tAd|^2.
\end{align*}
Expanding the square gives
\begin{align*}
|Ax-b+tAd|^2=|Ax-b|^2+2t\langle Ax-b,Ad\rangle+t^2|Ad|^2.
\end{align*}
Therefore, for $t>0$,
\begin{align*}
\frac{\frac12|A(x+td)-b|^2-\frac12|Ax-b|^2}{t}=\langle Ax-b,Ad\rangle+\frac12t|Ad|^2.
\end{align*}
If $Ad=0$, this quotient is $0$ for every $t>0$, so the recession value in direction $d$ is $0$. If $Ad\ne0$, then $|Ad|^2>0$, and
\begin{align*}
\langle Ax-b,Ad\rangle+\frac12t|Ad|^2\to+\infty.
\end{align*}
Thus the recession directions with nonpositive asymptotic slope are exactly
\begin{align*}
\{d\in\operatorname{rec}(C):Ad=0\}=\operatorname{rec}(C)\cap\ker A.
\end{align*}
By the preceding recession criterion, the programme is bounded below and level-bounded exactly when
\begin{align*}
\operatorname{rec}(C)\cap\ker A=\{0\}.
\end{align*}
Here boundedness below is automatic because
\begin{align*}
\frac12|Ax-b|^2\ge0
\end{align*}
for every $x$, so the displayed condition is precisely the level-boundedness test.
If $C=\mathbb R^n$, then $\operatorname{rec}(C)=\mathbb R^n$, and the condition becomes
\begin{align*}
\ker A=\{0\}.
\end{align*}
This is exactly full column rank of $A$. Even when $\ker A\ne\{0\}$, minimisers still exist. Writing $y=Ax$, the unconstrained problem is
\begin{align*}
\inf_{y\in\operatorname{Range}(A)}\frac12|y-b|^2.
\end{align*}
The subspace $\operatorname{Range}(A)\subset\mathbb R^m$ is closed because it is finite-dimensional. Hence $b$ has an [orthogonal decomposition](/theorems/436)
\begin{align*}
b=y^*+z
\end{align*}
with $y^*\in\operatorname{Range}(A)$ and $z\perp\operatorname{Range}(A)$. For any $y\in\operatorname{Range}(A)$,
\begin{align*}
|y-b|^2=|y-y^*-z|^2.
\end{align*}
Since $y-y^*\in\operatorname{Range}(A)$ and $z\perp\operatorname{Range}(A)$,
\begin{align*}
|y-y^*-z|^2=|y-y^*|^2+|z|^2.
\end{align*}
Therefore
\begin{align*}
|y-b|^2\ge |z|^2=|y^*-b|^2,
\end{align*}
so $y^*$ is the closest point in $\operatorname{Range}(A)$ to $b$. Since $y^*\in\operatorname{Range}(A)$, there is $x^*$ with $Ax^*=y^*$, and $x^*$ attains the least-squares minimum. Thus full column rank is the condition for level-boundedness, not for existence of an unconstrained least-squares minimiser.
[/example]
## Perturbations and Stability of Optimal Values
Optimisation problems rarely appear in isolation: data, right-hand sides, regularisation parameters, and constraints move. The perturbation viewpoint studies how the value changes when the problem is embedded in a family.
[definition: Perturbation Function]
A perturbation function for a convex programme is a proper convex function $\Phi: \mathbb R^n \times \mathbb R^m \to (-\infty,+\infty]$ such that the unperturbed programme has value
\begin{align*}
p(0)=\inf_{x \in \mathbb R^n}\Phi(x,0).
\end{align*}
The associated value function is the map $p:\mathbb R^m\to[-\infty,+\infty]$ defined by
\begin{align*}
p(u):=\inf_{x \in \mathbb R^n}\Phi(x,u), \qquad u \in \mathbb R^m.
\end{align*}
[/definition]
The variable $u$ is a bookkeeping device for changed constraints or changed data.
The value function is obtained by eliminating $x$, so convexity could in principle be lost when different minimizers are chosen for different parameter values. The geometric issue is whether projecting the epigraph in the eliminated variable preserves convexity. The partial-infimum theorem is needed precisely here: it gives the condition under which minimising out the decision variable leaves a convex value function of the perturbation parameter.
[quotetheorem:2552]
[citeproof:2552]
This result is the entry point to duality: supporting hyperplanes to the epigraph of $p$ become certificates for lower bounds, linking the perturbation picture back to conjugates and support functions from the previous chapter. Joint convexity of $\Phi$ is the load-bearing hypothesis. If $\Phi(x,u)=(u^2-1)^2$, independent of $x$, then $p(u)=(u^2-1)^2$, so $p(0)=1$ while $p(-1)=p(1)=0$; the midpoint inequality fails.
The whole-space partial infimum is also part of the projection argument. The epigraph of $p$ is the projection of the convex set $\{(x,u,r):\Phi(x,u)\le r\}$ onto the $(u,r)$ variables, and projections preserve convexity. If the eliminated variable is secretly restricted to a non-convex set, this geometric proof no longer applies: minimising the convex function $(x-u)^2$ over $x\in\{-1,1\}$ gives $p(u)=\min\{(u-1)^2,(u+1)^2\}$, which is not convex. The extended-real convention is equally important. For the convex indicator model $\Phi(x,u)=\iota_{\{x=u,\ x\ge0\}}(x,u)$, the value function is $p(u)=0$ for $u\ge0$ and $p(u)=+\infty$ for $u<0$; its epigraph is convex, but this statement would be lost if infeasible parameters were forced into an ordinary real-valued function. The theorem says only that $p$ has convex epigraph as an extended-real value function; it does not imply continuity at $0$, finite values near $0$, boundedness below, or attainment of the infimum defining $p(u)$.
[definition: Stable Optimal Value]
The value function $p: \mathbb R^m \to [-\infty,+\infty]$ is stable at $0$ if $p(0)$ is finite and $p$ is continuous at $0$ relative to $\operatorname{dom} p$.
[/definition]
Stability is stronger than existence at the unperturbed parameter, because nearby perturbations may change feasibility or send minimising sequences to infinity. The theorem below gives a usable compactness test for stability: keep all nearby low-value points inside one compact set, then closedness controls lower limits of values. A separate recovery condition supplies the matching upper limit.
[quotetheorem:6683]
[citeproof:6683]
The hypotheses are designed to make nearby values finite from below and to make near-minimisers of nearby problems stay in a common compact set. Their roles are distinct. Properness excludes degenerate data such as $\Phi\equiv+\infty$, where $p(0)$ is not finite and stability at a finite optimum is not a meaningful conclusion. The local lower bound excludes nearby perturbations with $p(u)=-\infty$; without it, a compactness condition about existing near-minimisers would be vacuous along such parameters and could not prove lower semicontinuity. Closedness prevents downward jumps at the limiting parameter: the convex function $g:\mathbb R\to(-\infty,+\infty]$ given by $g(0)=0$, $g(u)=-1$ for $u>0$, and $g(u)=+\infty$ for $u<0$ is not closed, and with $\Phi(x,u)=g(u)$ we get $p(0)=0$ but $p(u)=-1$ for $u>0$, so lower semicontinuity at $0$ fails. Convexity is the structural assumption that keeps the perturbation problem inside convex optimisation. The compact-subsequence proof of lower semicontinuity mainly uses properness, closedness, local lower boundedness, and uniform compactness, but without convexity the value function need not inherit the convex geometry needed for duality; for example, $\Phi(x,u)=(u^2-1)^2$, independent of $x$, gives the non-convex value function $p(u)=(u^2-1)^2$.
The uniform compactness hypothesis is stronger than pointwise existence of minimisers: the minimisers must not drift away as $u\to0$. The model
\begin{align*}
\Phi(x,u)=e^x+ux
\end{align*}
has $p(0)=0$ but admits minimising sequences with $x\to-\infty$ at $u=0$, and for positive perturbations the value is unbounded below. The local lower-bound hypothesis excludes the positive-perturbation values, while the compactness hypothesis excludes the escape of near-minimisers in the finite-value regime. The recovery-map assumption is independent: lower semicontinuity prevents downward jumps, but it does not prevent upward jumps caused by disappearing feasible points. A simple convex model is $\Phi(x,u)=\iota_{\{x=0\}}(x)+\iota_{\{0\}}(u)$, for which $p(0)=0$ and $p(u)=+\infty$ for $u\neq0$; there is no nearby feasible recovery path, so upper semicontinuity fails relative to any ambient neighbourhood containing points outside $\operatorname{dom}p$.
[example: Infeasible, Unbounded, and Unstable Perturbations]
Consider
\begin{align*}
\Phi(x,u)=x+\iota_{[u,\infty)}(x),
\end{align*}
so the constraint is $x\ge u$ and the value function is
\begin{align*}
p(u)=\inf_{x\in\mathbb R}\Phi(x,u)=\inf_{x\ge u}x.
\end{align*}
Every feasible $x$ satisfies $x\ge u$, hence $p(u)\ge u$. The point $x=u$ is feasible and has value $u$, so
\begin{align*}
p(u)=u.
\end{align*}
In particular $p(0)=0$, and if $u_k\to0$, then
\begin{align*}
p(u_k)=u_k\to0=p(0),
\end{align*}
so this perturbation has stable value at $0$.
Now consider
\begin{align*}
\Phi(x,u)=e^x+\iota_{(-\infty,u]}(x),
\end{align*}
so feasible points satisfy $x\le u$ and
\begin{align*}
p(u)=\inf_{x\le u}e^x.
\end{align*}
Since $e^x>0$ for every $x\in\mathbb R$, every feasible value is positive and therefore
\begin{align*}
p(u)\ge0.
\end{align*}
For each integer $k\ge1$, the point $x_k=u-k$ is feasible because $u-k\le u$, and
\begin{align*}
e^{x_k}=e^{u-k}=e^u e^{-k}\to0.
\end{align*}
Thus
\begin{align*}
p(u)=0
\end{align*}
for every finite $u$. No minimiser exists, because attaining the value would require a feasible real number $x$ with $e^x=0$, while $e^x>0$ for all real $x$.
Finally encode the constraint $x^2\le u$ with objective $x^2$ by
\begin{align*}
\Phi(x,u)=x^2+\iota_{\{x:x^2\le u\}}(x).
\end{align*}
If $u\ge0$, then $x=0$ is feasible because $0^2=0\le u$, and every feasible point satisfies $x^2\ge0$. Hence
\begin{align*}
p(u)=\inf_{\{x:x^2\le u\}}x^2=0.
\end{align*}
If $u<0$, then no real $x$ satisfies $x^2\le u$, since $x^2\ge0>u$ for every $x\in\mathbb R$. Therefore the feasible set is empty and
\begin{align*}
p(u)=+\infty.
\end{align*}
These three models separate stable finite values, finite values without attainment, and perturbations whose effective domain records infeasibility.
[/example]
The same stability questions reappear outside finite-dimensional modelling examples. In Tikhonov-regularised inverse problems, compact near-minimiser conditions are replaced by compact embeddings or weak compactness; in Chebyshev approximation, level-boundedness controls whether a best uniform approximant exists; in entropy and transport models, extended-real indicator terms encode marginal or positivity constraints. The finite-dimensional theorem is therefore a prototype: the proof isolates exactly which part is compactness, which part is closedness, and which part is convex geometry.
The chapter closes with an example where the feasible set is compact but the objective has a domain boundary. This is the common pattern for entropy models: compactness supplies existence, and lower semicontinuity handles the boundary.
[example: Entropy Minimisation over the Simplex]
Let
\begin{align*}
\Delta_n := \left\{x\in\mathbb R^n : x_i\ge0,\ \sum_{i=1}^n x_i=1\right\},
\end{align*}
and define
\begin{align*}
F(x):=\sum_{i=1}^n \bigl(x_i\log x_i+c_i x_i\bigr),
\end{align*}
with $0\log0=0$. If $x,y\in\Delta_n$ and $t\in[0,1]$, then $(1-t)x_i+ty_i\ge0$ for every $i$, and
\begin{align*}
\sum_{i=1}^n\bigl((1-t)x_i+ty_i\bigr)=(1-t)\sum_{i=1}^n x_i+t\sum_{i=1}^n y_i=(1-t)+t=1.
\end{align*}
Thus $\Delta_n$ is convex. It is closed because it is defined by finitely many closed inequalities and one closed equality, and it is bounded because $0\le x_i\le1$ for every $x\in\Delta_n$. Hence $\Delta_n$ is compact in finite dimension.
The scalar function $s\mapsto s\log s$ is continuous on $(0,\infty)$, and
\begin{align*}
\lim_{s\downarrow0}s\log s=\lim_{r\to\infty}\frac{-r}{e^r}=0,
\end{align*}
where $r=-\log s$. Therefore the convention $0\log0=0$ makes $s\mapsto s\log s$ continuous on $[0,\infty)$, so $F$ is continuous on $\Delta_n$. By *Weierstrass Existence Theorem for Convex Programmes*, $F$ attains its minimum on $\Delta_n$.
Let $x^*\in\Delta_n$ be a minimiser. No coordinate of $x^*$ can be zero. Suppose $x_i^*=0$. Since $\sum_k x_k^*=1$, there is some $j$ with $x_j^*>0$. For $0<t<x_j^*$, set
\begin{align*}
x(t):=x^*+t e_i-t e_j.
\end{align*}
Then $x(t)\in\Delta_n$, and the change in objective is
\begin{align*}
F(x(t))-F(x^*)=t\log t+(x_j^*-t)\log(x_j^*-t)-x_j^*\log x_j^*+c_i t-c_j t.
\end{align*}
Dividing by $t>0$ gives
\begin{align*}
\frac{F(x(t))-F(x^*)}{t}=\log t+\frac{(x_j^*-t)\log(x_j^*-t)-x_j^*\log x_j^*}{t}+c_i-c_j.
\end{align*}
By the [mean value theorem](/theorems/186) applied to $s\mapsto s\log s$, there is $\xi_t\in(x_j^*-t,x_j^*)$ such that
\begin{align*}
\frac{(x_j^*-t)\log(x_j^*-t)-x_j^*\log x_j^*}{t}=-(\log \xi_t+1).
\end{align*}
As $t\downarrow0$, the numbers $\xi_t$ stay near $x_j^*>0$, so $-(\log \xi_t+1)+c_i-c_j$ stays bounded, while $\log t\to-\infty$. Hence $F(x(t))-F(x^*)<0$ for all sufficiently small $t>0$, contradicting minimality. Therefore $x_i^*>0$ for every $i$.
Now take any $i,j$. For sufficiently small $|t|$, the point
\begin{align*}
x^*+t e_i-t e_j
\end{align*}
lies in $\Delta_n$. The one-variable function
\begin{align*}
\varphi(t):=F(x^*+t e_i-t e_j)
\end{align*}
has a minimum at $t=0$, so $\varphi'(0)=0$. Since $\frac{d}{ds}(s\log s)=\log s+1$ for $s>0$, differentiating gives
\begin{align*}
0=(\log x_i^*+1+c_i)-(\log x_j^*+1+c_j)=\log x_i^*+c_i-\log x_j^*-c_j.
\end{align*}
Thus
\begin{align*}
\log x_i^*+c_i=\log x_j^*+c_j
\end{align*}
for all $i,j$. Hence there is a constant $\lambda$ such that
\begin{align*}
\log x_i^*+c_i=\lambda
\end{align*}
for every $i$, and therefore
\begin{align*}
x_i^*=e^{\lambda-c_i}=e^\lambda e^{-c_i}.
\end{align*}
Using $\sum_i x_i^*=1$,
\begin{align*}
1=\sum_{i=1}^n x_i^*=e^\lambda\sum_{i=1}^n e^{-c_i}.
\end{align*}
Therefore
\begin{align*}
e^\lambda=\frac{1}{\sum_{j=1}^n e^{-c_j}}.
\end{align*}
Substituting this value of $e^\lambda$ gives
\begin{align*}
x_i^*=\frac{e^{-c_i}}{\sum_{j=1}^n e^{-c_j}},\qquad i=1,\dots,n.
\end{align*}
Compactness supplies existence, while the entropy term forces the minimiser into the relative interior of the simplex and turns the optimal weights into the softmax of $-c$.
[/example]
# 4. Lagrangian Duality
Lagrangian duality is the bridge between constrained convex optimisation and the geometry of separating hyperplanes. Chapters 1 and 2 developed the prerequisites used here: convex sets and relative interiors, convex functions and subgradients, separating hyperplanes, and first-order optimality for convex functions on convex domains. This chapter organises those tools around constraints. The main question is how much information about a constrained minimum is contained in the family of unconstrained penalised problems obtained by attaching multipliers to the constraints.
## Constrained Problems and the Lagrangian
The starting point is a constrained minimisation problem in which not every point of the ambient space is admissible. The first problem is to encode the feasible set in a way that allows convex analysis to act on it, while keeping track of which constraints create prices or shadow costs at the optimum.
[definition: Convex Optimisation Problem with Functional Constraints]
Let $X \subset \mathbb R^n$ be a convex set, let $f_0:X\to \mathbb R$ be convex, let $f_i:X\to \mathbb R$ be convex for $i=1,\dots,m$, and let $h_j:X\to \mathbb R$ be affine for $j=1,\dots,p$. The constrained convex optimisation problem is
\begin{align*} \inf_{x\in X} f_0(x) \quad \text{subject to} \quad f_i(x)\le 0\ (1\le i\le m), \qquad h_j(x)=0\ (1\le j\le p). \end{align*}
Its feasible set is
\begin{align*} \mathcal F=\{x\in X: f_i(x)\le 0\text{ for all }i,\ h_j(x)=0\text{ for all }j\}. \end{align*}
The optimal value is denoted $p^*=\inf_{x\in\mathcal F} f_0(x)$, with the convention $p^*=+\infty$ when $\mathcal F=\varnothing$.
[/definition]
The inequality functions are convex because their sublevel sets are convex; the equality functions must be affine, since a nonlinear equality usually destroys convexity. The convention of minimisation is used throughout, so dual lower bounds will approach $p^*$ from below.
[example: Constrained Least Squares]
Let $A\in\mathbb R^{r\times n}$, $b\in\mathbb R^r$, $C\in\mathbb R^{m\times n}$, and $d\in\mathbb R^m$. The problem
\begin{align*} \inf_{x\in\mathbb R^n} \frac12 |Ax-b|^2 \quad \text{subject to}\quad Cx\le d \end{align*}
has domain $X=\mathbb R^n$, objective $f_0(x)=\frac12 |Ax-b|^2$, and inequality functions $f_i(x)=C_i x-d_i$ for $1\le i\le m$, where $C_i$ is the $i$-th row of $C$. The vector inequality $Cx\le d$ means $C_i x\le d_i$ for every $i$, which is the same as $C_i x-d_i\le 0$, so it is exactly the collection of constraints $f_i(x)\le 0$.
Each $f_i$ is affine, hence convex. To verify convexity of $f_0$, let $x,y\in\mathbb R^n$ and $\theta\in[0,1]$, and put $u=Ax-b$ and $v=Ay-b$. Then
\begin{align*} A(\theta x+(1-\theta)y)-b=\theta(Ax-b)+(1-\theta)(Ay-b)=\theta u+(1-\theta)v. \end{align*}
Expanding the squared norm gives
\begin{align*} |\theta u+(1-\theta)v|^2=\theta^2|u|^2+2\theta(1-\theta)u\cdot v+(1-\theta)^2|v|^2. \end{align*}
Therefore
\begin{align*} \theta |u|^2+(1-\theta)|v|^2-|\theta u+(1-\theta)v|^2=\theta(1-\theta)|u-v|^2\ge 0. \end{align*}
Multiplying by $\frac12$ gives
\begin{align*} f_0(\theta x+(1-\theta)y)\le \theta f_0(x)+(1-\theta)f_0(y). \end{align*}
Thus the problem is a convex optimisation problem with functional inequality constraints. It models least-squares fitting with linear resource or safety constraints; the corresponding dual multipliers will be nonnegative prices attached to the violations $C_i x-d_i$.
[/example]
The least-squares example shows the practical difficulty: the constraint $Cx\le d$ is not part of the quadratic objective, but solving the problem requires the objective and constraints to interact. The next construction creates that interaction by assigning a multiplier to each constraint and folding all constraint functions into a single unconstrained expression.
[definition: Lagrangian]
For the constrained problem above, the Lagrangian is the function
\begin{align*} L:X\times \mathbb R^m_+\times \mathbb R^p\to \mathbb R, \qquad L(x,\lambda,\nu)=f_0(x)+\sum_{i=1}^m \lambda_i f_i(x)+\sum_{j=1}^p \nu_j h_j(x). \end{align*}
The vector $\lambda\in\mathbb R^m_+$ is the vector of inequality multipliers, and $\nu\in\mathbb R^p$ is the vector of equality multipliers.
[/definition]
For fixed multipliers, $x\mapsto L(x,\lambda,\nu)$ is a convex function on $X$. The Lagrangian therefore replaces the constrained problem by a family of unconstrained convex minimisation problems indexed by multiplier vectors.
[definition: Dual Function]
The dual function associated to the constrained problem is the extended-real-valued function
\begin{align*} g:\mathbb R^m_+\times\mathbb R^p\to \mathbb R\cup\{-\infty\}, \qquad g(\lambda,\nu)=\inf_{x\in X} L(x,\lambda,\nu). \end{align*}
[/definition]
The value $g(\lambda,\nu)$ may be $-\infty$ if the penalised objective is unbounded below, so the dual problem is understood as an extended-real concave maximisation problem. The dual function is valuable because it is always concave in the multipliers, even when the original problem is not convex. In this course the primal problem is convex, so the concavity of $g$ gives a genuine convex optimisation problem after changing minimisation to maximisation.
[definition: Lagrange Dual Problem]
The Lagrange dual problem is
\begin{align*} \sup_{\lambda\in\mathbb R^m_+,\ \nu\in\mathbb R^p} g(\lambda,\nu). \end{align*}
Its optimal value is denoted $d^*$.
[/definition]
Every dual feasible multiplier gives a certified lower bound on the primal value. Indeed, if $x$ is primal feasible and $\lambda\ge 0$, then $f_i(x)\le 0$ and $h_j(x)=0$, so
\begin{align*}
L(x,\lambda,\nu)=f_0(x)+\sum_{i=1}^m\lambda_i f_i(x)+\sum_{j=1}^p\nu_j h_j(x)\le f_0(x).
\end{align*}
Taking the infimum over all $x$ gives $g(\lambda,\nu)\le f_0(x)$ for every feasible $x$, and hence $g(\lambda,\nu)\le p^*$. Supremising over dual feasible multipliers gives $d^*\le p^*$.
This weak-duality inequality introduces the duality gap $p^*-d^*$, but it does not say that the gap is finite, zero, or attained by a multiplier. Its strength is precisely that it requires almost no hypotheses: the problem need not be convex, the feasible set need not be closed, and the primal or dual optimum need not exist. The only structural condition used is the sign restriction $\lambda\ge 0$ for inequality multipliers. For the problem $\inf x$ subject to $x\le 0$, a negative multiplier would reward rather than penalise violation of the inequality and would no longer give a lower bound on feasible objective values.
A concrete limitation is nonattainment of the best lower bound. Consider the unconstrained convex problem $\inf_{x\in\mathbb R} e^{-x}$, written with no constraints. Its primal value is $p^*=0$, and the dual problem has the same value because there are no multipliers; nevertheless the infimum is not attained by any primal point. Weak duality has certified the value but has not produced an optimiser. In other examples the gap can be positive; the role of Slater's condition below is to identify a common finite-dimensional convex setting where the best certificate is exact.
[example: Dual of Equality-Constrained Least Squares]
Consider
\begin{align*} \inf_{x\in\mathbb R^n} \frac12 |Ax-b|^2 \quad \text{subject to}\quad Cx=d, \end{align*}
where $A\in\mathbb R^{r\times n}$, $C\in\mathbb R^{p\times n}$, and $d\in\mathbb R^p$. The equality multiplier $\nu\in\mathbb R^p$ is unrestricted, so the Lagrangian is
\begin{align*} L(x,\nu)=\frac12 |Ax-b|^2+\nu\cdot(Cx-d). \end{align*}
Write dot products as matrix products. First,
\begin{align*} (Ax-b)^\top(Ax-b)=x^\top A^\top Ax-x^\top A^\top b-b^\top Ax+b^\top b. \end{align*}
Since $b^\top Ax$ is a scalar and equals its transpose $x^\top A^\top b$, this becomes
\begin{align*} (Ax-b)^\top(Ax-b)=x^\top A^\top Ax-2x^\top A^\top b+|b|^2. \end{align*}
Also,
\begin{align*} \nu\cdot(Cx-d)=\nu^\top Cx-\nu^\top d=x^\top C^\top\nu-\nu\cdot d. \end{align*}
Therefore
\begin{align*} L(x,\nu)=\frac12 x^\top A^\top Ax-x^\top A^\top b+x^\top C^\top\nu+\frac12 |b|^2-\nu\cdot d. \end{align*}
Combining the two linear terms in $x$ gives
\begin{align*} L(x,\nu)=\frac12 x^\top A^\top Ax-x^\top(A^\top b-C^\top\nu)+\frac12 |b|^2-\nu\cdot d. \end{align*}
Assume $A^\top A$ is invertible. Set
\begin{align*} Q=A^\top A,\qquad q=A^\top b-C^\top\nu. \end{align*}
Then $Q$ is symmetric positive definite, because for every nonzero $z\in\mathbb R^n$,
\begin{align*} z^\top Qz=z^\top A^\top Az=|Az|^2>0, \end{align*}
where the strict inequality follows from invertibility of $A^\top A$. The $x$-dependent part of the Lagrangian is
\begin{align*} \frac12 x^\top Qx-x^\top q. \end{align*}
Completing the square gives
\begin{align*} \frac12\bigl(x-Q^{-1}q\bigr)^\top Q\bigl(x-Q^{-1}q\bigr)-\frac12 q^\top Q^{-1}q=\frac12 x^\top Qx-x^\top q. \end{align*}
Indeed, expanding the left side produces
\begin{align*} \frac12 x^\top Qx-\frac12 x^\top q-\frac12 q^\top x+\frac12 q^\top Q^{-1}q-\frac12 q^\top Q^{-1}q. \end{align*}
Since $q^\top x=x^\top q$, this reduces to
\begin{align*} \frac12 x^\top Qx-x^\top q. \end{align*}
Thus
\begin{align*} L(x,\nu)=\frac12\bigl(x-Q^{-1}q\bigr)^\top Q\bigl(x-Q^{-1}q\bigr)+\frac12 |b|^2-\nu\cdot d-\frac12 q^\top Q^{-1}q. \end{align*}
The square term is nonnegative and is zero exactly when
\begin{align*} x=Q^{-1}q=(A^\top A)^{-1}(A^\top b-C^\top\nu). \end{align*}
Therefore the dual function is
\begin{align*} g(\nu)=\frac12 |b|^2-\nu\cdot d-\frac12 q^\top Q^{-1}q. \end{align*}
Substituting back $Q=A^\top A$ and $q=A^\top b-C^\top\nu$ gives
\begin{align*} g(\nu)=\frac12 |b|^2-\nu\cdot d-\frac12 (A^\top b-C^\top\nu)^\top (A^\top A)^{-1}(A^\top b-C^\top\nu). \end{align*}
The final quadratic form is nonnegative because $(A^\top A)^{-1}$ is positive definite, and it appears with a minus sign. Hence $g$ is a concave quadratic function of the equality multiplier $\nu$, so the equality-constrained least-squares problem becomes an unconstrained maximisation problem over $\nu$, whose value records the shadow price of imposing $Cx=d$.
[/example]
## Constraint Qualifications and Strong Duality
Weak duality is formal and requires no regularity, but equality of the primal and dual values is a geometric statement. The question is whether the feasible region has enough interior room for a separating hyperplane argument to certify the optimal value without leaving a gap.
[definition: Slater Point]
For the convex problem above, a Slater point is a point $\bar{x}\in \operatorname{ri}(X)$ such that
\begin{align*} f_i(\bar{x})<0 \quad (1\le i\le m), \qquad h_j(\bar{x})=0 \quad (1\le j\le p). \end{align*}
[/definition]
A Slater point rules out the pathology in which the feasible set is convex but sits entirely on the boundary of an inequality constraint for accidental reasons. To state the benefit of such a constraint qualification, we need a name for the desired equality between the primal value and the best Lagrangian lower bound.
[definition: Strong Duality]
A constrained optimisation problem satisfies strong duality when
\begin{align*} d^*=p^*. \end{align*}
[/definition]
Strong duality should be read as an exact certificate theorem. When the dual supremum is attained by $(\lambda^*,\nu^*)$, the pair is called an optimal multiplier vector. The natural question is now which regularity hypothesis turns weak dual lower bounds into an exact certificate; Slater's condition is the main answer for finite-dimensional convex problems. In the finite-dimensional convex setting used here, Slater feasibility and the usual finiteness and closedness hypotheses give $d^*=p^*$, and under the stated attainment hypothesis the best dual multiplier is attained.
[remark: Slater Strong Duality Principle]
For a finite-dimensional convex programme with finite primal value, affine equality constraints, convex inequality constraints, and a Slater point satisfying every inequality strictly, the Lagrange dual has no duality gap: $d^*=p^*$. Under the local closedness hypothesis used in this chapter, an optimal multiplier is also attained.
[/remark]
The same separation mechanism as in the first chapter is applied to the image of the feasible system under the map recording constraint violations, equality residuals, and objective value. The finite-dimensional and finite-valued hypotheses keep this perturbation set inside an ordinary Euclidean space and avoid domain singularities where separation would have to be formulated with extra closure or lower-semicontinuity assumptions. The assumption that $p^*$ is finite matters as well: if the primal value is $-\infty$, there is no finite supporting level to separate, while if $p^*=+\infty$ the equality $d^*=p^*$ is no longer an exact finite certificate of an attained optimisation value. Slater's condition is used exactly where abnormal separating hyperplanes must be excluded, while the local closedness condition controls attainment of the multiplier rather than equality of values. Without that closedness, a sequence of supporting hyperplanes may certify values tending to $p^*$ while no finite multiplier reaches the supremum. [example: Failure of Multiplier Attainment Without Strict Feasibility]
A standard convex failure without strict feasibility is
\begin{align*} \inf_{x\in\mathbb R} x \quad \text{subject to}\quad x^2\le 0. \end{align*}
The only feasible point is $x=0$, so $p^*=0$. The Lagrangian is $L(x,\lambda)=x+\lambda x^2$ with $\lambda\ge 0$, and
\begin{align*} \inf_{x\in\mathbb R} (x+\lambda x^2)= -\frac{1}{4\lambda} \end{align*}
for $\lambda>0$, while the infimum is $-\infty$ for $\lambda=0$. Hence $d^*=0$ but the dual supremum is not attained. This shows that value equality alone is weaker than the existence of optimal multipliers, and it also shows why the closedness hypothesis in the theorem is stated separately.
[/example]
The Slater principle also does not assert primal attainment. For example, the unconstrained problem $\inf_{x\in\mathbb R} e^{-x}$ has finite value $0$ but no minimiser. Constraint qualifications and closedness hypotheses specify which boundary phenomena are being ruled out; primal compactness or coercivity is a separate issue. This prepares the normal-cone interpretation of multipliers, where active constraints generate the supporting hyperplane at the optimum.
[remark: Constraint Qualification]
Different courses and applications use different constraint qualifications. Linear constraints often need no separate strict feasibility assumption, while nonlinear convex inequalities typically require Slater's condition or a related relative-interior condition. The common role of these hypotheses is to make the normal cone to the feasible set equal to the cone generated by the active constraint normals.
[/remark]
The next example shows how Slater's condition appears in a statistical optimisation model. The multipliers have an economic interpretation as marginal prices for tightening risk or budget constraints.
[example: Markowitz Portfolio with Convex Risk Constraints]
Let $e=(1,\dots,1)\in\mathbb R^n$, let $x\in\mathbb R^n$ represent portfolio weights, let $\mu\in\mathbb R^n$ be expected returns, and let $\Sigma\in\mathbb R^{n\times n}$ be symmetric positive semidefinite. The risk-constrained Markowitz problem is
\begin{align*} \inf_{x\in\mathbb R^n} -\mu\cdot x \quad \text{subject to}\quad e\cdot x=1,\qquad x\ge 0,\qquad x^\top\Sigma x\le \rho. \end{align*}
To put it in functional-constraint form, write no-short-selling as the inequalities $-x_i\le 0$ for $1\le i\le n$, and write the risk constraint as $x^\top\Sigma x-\rho\le 0$. The objective $x\mapsto -\mu\cdot x$ is affine, hence convex, and each no-short-selling constraint function $x\mapsto -x_i$ is affine, hence convex.
It remains to verify the convexity of the risk function $r(x)=x^\top\Sigma x-\rho$. Let $x,y\in\mathbb R^n$ and $\theta\in[0,1]$. Expanding the quadratic form gives
\begin{align*} (\theta x+(1-\theta)y)^\top\Sigma(\theta x+(1-\theta)y)=\theta^2 x^\top\Sigma x+\theta(1-\theta)x^\top\Sigma y+\theta(1-\theta)y^\top\Sigma x+(1-\theta)^2y^\top\Sigma y. \end{align*}
Since $\Sigma$ is symmetric, $y^\top\Sigma x=(y^\top\Sigma x)^\top=x^\top\Sigma y$, so
\begin{align*} (\theta x+(1-\theta)y)^\top\Sigma(\theta x+(1-\theta)y)=\theta^2 x^\top\Sigma x+2\theta(1-\theta)x^\top\Sigma y+(1-\theta)^2y^\top\Sigma y. \end{align*}
Now compare this with the convexity upper bound:
\begin{align*} \theta x^\top\Sigma x+(1-\theta)y^\top\Sigma y-(\theta x+(1-\theta)y)^\top\Sigma(\theta x+(1-\theta)y)=\theta(1-\theta)(x-y)^\top\Sigma(x-y). \end{align*}
Because $\Sigma$ is positive semidefinite, $(x-y)^\top\Sigma(x-y)\ge 0$. Hence
\begin{align*} (\theta x+(1-\theta)y)^\top\Sigma(\theta x+(1-\theta)y)\le \theta x^\top\Sigma x+(1-\theta)y^\top\Sigma y. \end{align*}
Subtracting the same constant $\rho$ from both sides gives
\begin{align*} r(\theta x+(1-\theta)y)\le \theta r(x)+(1-\theta)r(y), \end{align*}
so the risk constraint is convex.
If there is a portfolio $\bar{x}$ such that $\bar{x}_i>0$ for every $i$, $e\cdot\bar{x}=1$, and $\bar{x}^\top\Sigma\bar{x}<\rho$, then the inequality constraints are strict:
\begin{align*} -\bar{x}_i<0\quad (1\le i\le n),\qquad \bar{x}^\top\Sigma\bar{x}-\rho<0. \end{align*}
The equality constraint $e\cdot\bar{x}=1$ is also satisfied, so $\bar{x}$ is a Slater point. By the Slater strong-duality principle, when the primal value is finite and the stated closedness hypothesis holds, the primal optimum equals the best Lagrangian lower bound and the dual optimum is attained. In portfolio language, the best expected-return tradeoff is certified by nonnegative multipliers for the risk and no-short-selling constraints, together with an unrestricted multiplier for the budget equation.
[/example]
Strong duality also gives infeasibility certificates. The cleanest finite-dimensional version is Farkas' lemma, which can be read as a duality theorem for linear inequalities.
[quotetheorem:6685]
[citeproof:6685]
This lemma is the prototype for dual certificates: either a primal object exists, or a multiplier separates the requested right-hand side from every attainable one. In Lagrangian language, $y$ is a multiplier for the equation $Ax=b$ whose sign condition on $A^\top y$ makes every nonnegative candidate $x$ contribute a nonnegative amount, while $b\cdot y<0$ contradicts exact feasibility. Thus Farkas' lemma is the linear feasibility version of weak duality plus a converse separation theorem: the absence of a primal feasible point is witnessed by a dual certificate. The closedness of the finitely generated cone is doing real work. For a nonclosed cone such as
\begin{align*} K=\{(u,v)\in\mathbb R^2:v>0,\ u\ge 0\}\cup\{(0,0)\}, \end{align*}
the point $b=(1,0)$ lies in $\overline K\setminus K$. No vector can strictly separate $b$ from $K$, since every strict separator would also separate $b$ from the closure. Thus a Farkas-style strict alternative fails when the attainable cone is not closed. The strict inequality $b\cdot y<0$ is what makes the finite-dimensional linear alternatives exclusive and computationally useful. In optimisation language, the infeasibility certificate is a dual feasible point with a strict sign contradiction. The saddle-point section now turns this certificate picture from feasibility to optimality: at an optimum, the separating multiplier must support the achievable-value set while also making the primal variable minimise the Lagrangian, and the resulting equality is recorded as [complementary slackness](/theorems/2559).
## Saddle Points and Multiplier Geometry
Once strong duality holds, dual optimality can be expressed without mentioning two separate optimisation problems. The guiding question is whether there is a single point of the Lagrangian surface that is simultaneously minimal in the primal variable and maximal in the multiplier variables.
[definition: Lagrangian Saddle Point]
A triple $(x^*,\lambda^*,\nu^*)\in X\times\mathbb R^m_+\times\mathbb R^p$ is a saddle point of the Lagrangian if
\begin{align*} L(x^*,\lambda,\nu)\le L(x^*,\lambda^*,\nu^*)\le L(x,\lambda^*,\nu^*) \end{align*}
for all $x\in X$, $\lambda\in\mathbb R^m_+$, and $\nu\in\mathbb R^p$.
[/definition]
The left inequality forces feasibility and complementary slackness, because otherwise varying a multiplier would change the Lagrangian in an unbounded direction. The right inequality says that the primal point minimises the Lagrangian after the optimal multipliers have been fixed. The next theorem makes this two-sided variational picture equivalent to strong duality plus attainment.
[quotetheorem:6686]
[citeproof:6686]
The saddle point theorem is the conceptual source of the Karush-Kuhn-Tucker conditions. Its hypotheses include both zero gap and attainment. Nonattainment of the primal infimum already rules out a saddle point: for the unconstrained problem $\inf_{x\in\mathbb R} e^{-x}$, a saddle point would require a minimiser of $e^{-x}$, but none exists. Dual nonattainment also rules out a saddle point; in
\begin{align*} \inf_{x\in\mathbb R} x \quad \text{subject to}\quad x^2\le 0, \end{align*}
the primal value and dual value are both $0$, yet no finite multiplier attains the dual supremum. A positive duality gap prevents a saddle point for a different reason: the middle value would have to equal both $p^*$ and $d^*$. A convex example with an infinite gap is
\begin{align*} \inf_{(x,y)\in\mathbb R\times(0,\infty)} x \quad \text{subject to}\quad \frac{x^2}{y}\le 0. \end{align*}
The feasible set has $x=0$ and $y>0$, so $p^*=0$. For every multiplier $\lambda\ge 0$, the Lagrangian $x+\lambda x^2/y$ is unbounded below over $y>0$, so $d^*=-\infty$. Hence no saddle point can exist. Under differentiability and attainment, the condition that $x^*$ minimises $L(\cdot,\lambda^*,\nu^*)$ becomes a stationarity equation, while the left saddle inequality becomes feasibility and complementary slackness.
[quotetheorem:6687]
[citeproof:6687]
Geometrically, the multiplier vector expresses the negative objective gradient as a conic combination of active constraint gradients, plus the span of equality gradients and, when needed, a normal vector to the ambient convex domain $X$. Slater's condition is what makes these multipliers necessary rather than merely sufficient. The boundary example
\begin{align*} \inf_{x\in\mathbb R} x \quad \text{subject to}\quad x^2\le 0 \end{align*}
has optimum $x^*=0$, but stationarity would require $1+2\lambda x^*=0$, which is impossible for finite $\lambda$. The active constraint has zero gradient at the only feasible point, so it cannot generate the normal direction needed to balance the objective.
Differentiability is also essential for this gradient form. For the convex problem $\inf_{x\in\mathbb R} |x|$, the optimum is $0$, but there is no gradient $\nabla f_0(0)$; the correct optimality condition is $0\in\partial |0|=[-1,1]$. Convexity is the final ingredient: for the nonconvex unconstrained problem $\inf_{x\in\mathbb R}(x^4-x^2)$, the point $x=0$ satisfies the first-order stationarity equation but is not a global minimiser. These examples separate the roles of constraint qualification, differentiability, and convexity before the notes move from KKT equations to applications.
[example: Max-Flow Min-Cut as Convex Duality]
Let $G=(V,E)$ be a finite directed network with source $s$, sink $t$, and capacities $c_{uv}\ge 0$ on edges $(u,v)$. Define the incidence operator by
\begin{align*} (Bf)_v=\sum_{(v,w)\in E}f_{vw}-\sum_{(u,v)\in E}f_{uv}. \end{align*}
A flow of value $F$ satisfies
\begin{align*} Bf=F(1_s-1_t),\qquad 0\le f_{uv}\le c_{uv}. \end{align*}
The Lagrange multipliers for the conservation equations are vertex potentials $\pi_v$, and the capacity multipliers lead to the dual cut relaxation
\begin{align*} \inf_{\pi,z}\sum_{(u,v)\in E}c_{uv}z_{uv}\quad\text{subject to}\quad z_{uv}\ge \pi_u-\pi_v,\quad z_{uv}\ge 0,\quad \pi_s-\pi_t=1. \end{align*}
For any feasible flow $(F,f)$ and any feasible dual pair $(\pi,z)$, the constraint $\pi_s-\pi_t=1$ gives
\begin{align*} F=F(\pi_s-\pi_t). \end{align*}
Using $Bf=F(1_s-1_t)$, this becomes
\begin{align*} F=\pi\cdot F(1_s-1_t)=\pi\cdot Bf. \end{align*}
Expanding the incidence product edge by edge gives
\begin{align*} \pi\cdot Bf=\sum_{(u,v)\in E}(\pi_u-\pi_v)f_{uv}. \end{align*}
Since $z_{uv}\ge \pi_u-\pi_v$ and $f_{uv}\ge 0$,
\begin{align*} \sum_{(u,v)\in E}(\pi_u-\pi_v)f_{uv}\le \sum_{(u,v)\in E}z_{uv}f_{uv}. \end{align*}
Since $z_{uv}\ge 0$ and $f_{uv}\le c_{uv}$,
\begin{align*} \sum_{(u,v)\in E}z_{uv}f_{uv}\le \sum_{(u,v)\in E}z_{uv}c_{uv}. \end{align*}
Thus every feasible dual pair gives an upper bound on every feasible flow value:
\begin{align*} F\le \sum_{(u,v)\in E}c_{uv}z_{uv}. \end{align*}
For fixed $\pi$, each variable $z_{uv}$ appears only in the term $c_{uv}z_{uv}$ and must satisfy $z_{uv}\ge 0$ and $z_{uv}\ge \pi_u-\pi_v$. Because $c_{uv}\ge 0$, the smallest admissible choice is
\begin{align*} z_{uv}=\max\{\pi_u-\pi_v,0\}. \end{align*}
Eliminating $z$ gives the potential problem
\begin{align*} \inf_{\pi}\sum_{(u,v)\in E} c_{uv}\max\{\pi_u-\pi_v,0\}\quad\text{subject to}\quad \pi_s-\pi_t=1. \end{align*}
Now compare this relaxation with cuts. If $S\subset V$ is an $s$-$t$ cut, meaning $s\in S$ and $t\notin S$, set $\pi_v=1$ for $v\in S$ and $\pi_v=0$ for $v\notin S$. Then $\pi_s-\pi_t=1$. For an edge $(u,v)$, the value $\max\{\pi_u-\pi_v,0\}$ is $1$ exactly when $u\in S$ and $v\notin S$, and is $0$ otherwise. Therefore the dual objective becomes
\begin{align*} \sum_{(u,v)\in E} c_{uv}\max\{\pi_u-\pi_v,0\}=\sum_{\substack{(u,v)\in E: u\in S,\ v\notin S}}c_{uv}, \end{align*}
which is exactly the capacity of the cut $S$.
Conversely, take any feasible potential $\pi$ with $\pi_s-\pi_t=1$. Replacing $\pi_v$ by
\begin{align*} \tilde\pi_v=\min\{\max\{\pi_v-\pi_t,0\},1\} \end{align*}
keeps $\tilde\pi_s=1$ and $\tilde\pi_t=0$. The clipping map $a\mapsto \min\{\max\{a,0\},1\}$ is nondecreasing and $1$-Lipschitz, so it does not increase positive drops:
\begin{align*} \max\{\tilde\pi_u-\tilde\pi_v,0\}\le \max\{\pi_u-\pi_v,0\}. \end{align*}
Thus it is enough to consider potentials with values in $[0,1]$, with $\pi_s=1$ and $\pi_t=0$.
For $\theta\in(0,1)$, define
\begin{align*} S_\theta=\{v\in V:\pi_v\ge \theta\}. \end{align*}
Then $s\in S_\theta$ and $t\notin S_\theta$, so $S_\theta$ is an $s$-$t$ cut. For $a,b\in[0,1]$,
\begin{align*} \max\{a-b,0\}=\int_0^1 \mathbf 1_{\{b<\theta\le a\}}\,d\theta, \end{align*}
because the set $\{\theta:b<\theta\le a\}$ has length $a-b$ when $a>b$ and is empty when $a\le b$. Applying this with $a=\pi_u$ and $b=\pi_v$ gives
\begin{align*} c_{uv}\max\{\pi_u-\pi_v,0\}=c_{uv}\int_0^1 \mathbf 1_{\{\pi_v<\theta\le \pi_u\}}\,d\theta. \end{align*}
Summing over edges and interchanging the finite sum with the integral yields
\begin{align*} \sum_{(u,v)\in E}c_{uv}\max\{\pi_u-\pi_v,0\}=\int_0^1\sum_{(u,v)\in E}c_{uv}\mathbf 1_{\{u\in S_\theta,\ v\notin S_\theta\}}\,d\theta. \end{align*}
The inner sum is the capacity of $S_\theta$, so
\begin{align*} \sum_{(u,v)\in E}c_{uv}\max\{\pi_u-\pi_v,0\}=\int_0^1 \operatorname{cap}(S_\theta)\,d\theta. \end{align*}
Since this is the average of the cut capacities over an interval of length $1$, at least one threshold cut satisfies
\begin{align*} \operatorname{cap}(S_\theta)\le \sum_{(u,v)\in E}c_{uv}\max\{\pi_u-\pi_v,0\}. \end{align*}
Thus the relaxed potential optimum equals the minimum cut capacity. The primal flow problem and this dual relaxation form a linear-programming dual pair, so strong duality identifies their optimal values. Hence the maximum flow value equals the minimum cut capacity. The integrality of the cut solution comes from thresholding the dual potentials, not from adding integrality constraints by hand.
[/example]
The geometric interpretation is that multipliers are not artificial penalty parameters. They are coordinates of a supporting hyperplane to the set of achievable objective and constraint values. Strong duality says the supporting hyperplane can be chosen so that its height is exactly the primal optimum, and the saddle point theorem says that this supporting hyperplane is visible directly through the Lagrangian.
# 5. KKT Conditions and Sensitivity
This chapter turns duality into first-order conditions that can be checked and interpreted. Earlier chapters introduced convex functions, subgradients, and Lagrange duality; here the question is how an optimal primal point, an optimal dual multiplier, and the active constraints fit together. The Karush-Kuhn-Tucker conditions give this fit as a finite list of equations and inequalities, while sensitivity theory explains why the multipliers deserve the name shadow prices.
## Optimality Equations for Inequality and Equality Constraints
How can we recognize an optimum without comparing it against every feasible point? For a convex programme, the Lagrangian packages the objective and constraints into one function, and optimality becomes the statement that the primal variable minimizes this package while the multipliers obey sign and orthogonality rules.
Consider the finite-dimensional convex programme
\begin{align*}
\inf_{x \in \mathbb R^n} f_0(x) \quad \text{subject to} \quad f_i(x) \le 0 \ (1 \le i \le m), \qquad Ax=b,
\end{align*}
where $f_0,f_1,\dots,f_m: \mathbb R^n \to \mathbb R$ are convex and $A \in \mathbb R^{p \times n}$, $b \in \mathbb R^p$. To match the Lagrange-duality notation of Chapter 4, write the Lagrange multipliers as $\lambda \in \mathbb R^m$ for the inequalities and $\nu \in \mathbb R^p$ for the equalities.
[definition: Lagrangian]
For the above problem, the Lagrangian is the function $L: \mathbb R^n \times \mathbb R^m \times \mathbb R^p \to \mathbb R$ defined by
\begin{align*}
L(x,\lambda,\nu)= f_0(x)+\sum_{i=1}^m \lambda_i f_i(x)+\nu^\top(Ax-b).
\end{align*}
[/definition]
The Lagrangian gives a lower-bound mechanism once the inequality multipliers are nonnegative. Which additional relations must hold when this lower bound is attained by a feasible point? We need feasibility of $x$, feasibility of the multipliers, vanishing contribution from inactive inequalities, and minimization of the Lagrangian in the primal variable; the KKT conditions collect these requirements in one system.
[definition: Kkt Conditions]
For the previously displayed convex programme, a triple $(x^*,\lambda^*,\nu^*) \in \mathbb R^n \times \mathbb R^m \times \mathbb R^p$ satisfies the KKT conditions if primal feasibility holds: $f_i(x^*)\le 0$ for $1\le i\le m$ and $Ax^*=b$; dual feasibility holds: $\lambda_i^*\ge 0$ for $1\le i\le m$; complementary slackness holds: $\lambda_i^*f_i(x^*)=0$ for $1\le i\le m$; and stationarity holds:
\begin{align*}
0 \in \partial f_0(x^*)+\sum_{i=1}^m \lambda_i^* \partial f_i(x^*)+A^\top \nu^*.
\end{align*}
[/definition]
The four parts are called primal feasibility, dual feasibility, complementary slackness, and stationarity. When all functions are differentiable at $x^*$, stationarity becomes
\begin{align*}
\nabla f_0(x^*)+\sum_{i=1}^m \lambda_i^* \nabla f_i(x^*)+A^\top \nu^*=0.
\end{align*}
The least transparent of the four parts is complementary slackness: it says that an inactive constraint cannot carry a positive price. The next result explains why this condition is forced whenever primal and dual optimal values meet.
[quotetheorem:6688]
[citeproof:6688]
Complementary slackness is the algebraic form of the active-set principle. The zero-gap hypothesis is essential: for the infeasible pair of lower and upper bounds $x\le 0$ and $x\ge 1$, nonnegative multipliers can be written down, but there is no primal-dual optimum and hence no meaningful slackness relation. Even when the problem is feasible, the theorem does not say that every active constraint has a positive multiplier; an active constraint may have zero price if it is redundant or does not affect the optimum. This distinction between active constraints and positively priced constraints is what makes the next examples useful: they show how KKT equations recover the active set only after stationarity and feasibility are used together.
[example: Projection Onto The Simplex]
Let $y\in\mathbb R^n$. To project $y$ onto
\begin{align*}
\Delta_n=\{x\in\mathbb R^n:x_i\ge 0,\ \sum_{i=1}^n x_i=1\},
\end{align*}
we minimize
\begin{align*}
\frac12 |x-y|^2=\frac12\sum_{i=1}^n (x_i-y_i)^2
\end{align*}
subject to $-x_i\le 0$ for $1\le i\le n$ and $\sum_{i=1}^n x_i=1$. With multipliers $\nu_i\ge 0$ for the inequalities $-x_i\le 0$ and $\rho\in\mathbb R$ for the equality constraint, the Lagrangian is
\begin{align*}
L(x,\nu,\rho)=\frac12\sum_{i=1}^n (x_i-y_i)^2+\sum_{i=1}^n \nu_i(-x_i)+\rho\left(\sum_{i=1}^n x_i-1\right).
\end{align*}
Stationarity in the $i$th coordinate gives
\begin{align*}
0=\frac{\partial L}{\partial x_i}(x,\nu,\rho)=(x_i-y_i)-\nu_i+\rho,
\end{align*}
so
\begin{align*}
x_i=y_i+\nu_i-\rho.
\end{align*}
The remaining KKT conditions are
\begin{align*}
x_i\ge 0,\qquad \nu_i\ge 0,\qquad \nu_i(-x_i)=0,\qquad \sum_{i=1}^n x_i=1.
\end{align*}
Since $\nu_i(-x_i)=0$ is equivalent to $\nu_i x_i=0$, each coordinate has two possible cases. If $x_i>0$, then complementary slackness gives $\nu_i=0$, and stationarity gives
\begin{align*}
x_i=y_i-\rho>0.
\end{align*}
If $x_i=0$, then stationarity gives
\begin{align*}
0=y_i+\nu_i-\rho,
\end{align*}
hence
\begin{align*}
\nu_i=\rho-y_i.
\end{align*}
Dual feasibility $\nu_i\ge 0$ then gives $y_i-\rho\le 0$. Therefore every coordinate satisfies
\begin{align*}
x_i=\max\{y_i-\rho,0\}=(y_i-\rho)_+.
\end{align*}
Finally, the equality constraint determines $\rho$ by
\begin{align*}
\sum_{i=1}^n (y_i-\rho)_+=1.
\end{align*}
Thus projection onto the simplex is obtained by lowering all coordinates of $y$ by the same threshold $\rho$ and clipping the negative results to zero.
[/example]
This example shows the computational meaning of KKT systems: the multipliers encode which inequalities are active, and stationarity gives the formula once that active set is known. The next issue is whether the KKT equations are merely necessary for optima or also sufficient.
## Convex Sufficiency and Constraint Qualification
When do the KKT conditions certify global optimality? Convexity is the decisive assumption for sufficiency: a stationary point of the Lagrangian is a global minimizer of the Lagrangian, and feasibility plus complementary slackness transfers that global comparison back to the original objective.
[quotetheorem:2547]
[citeproof:2547]
Each KKT hypothesis has a separate role in the sufficiency proof. Stationarity and convexity say that $x^*$ minimizes $L(\cdot,\nu^*,\rho^*)$; dual feasibility keeps the Lagrangian below the objective on the feasible set; primal feasibility permits comparison with the original problem; and complementary slackness removes the constraint terms at $x^*$ so the Lagrangian value equals $f_0(x^*)$. Without convexity, KKT points can be stationary but nonoptimal: for the unconstrained problem of minimizing $x^4-x^2$ over $\mathbb R$, the point $x=0$ satisfies stationarity, but the global minima occur at $x=\pm 1/\sqrt{2}$. The sufficiency theorem also does not assert uniqueness of the primal optimum or uniqueness of multipliers; flat objectives and redundant constraints can leave many certificates for the same value. Necessity is subtler: an optimum may fail to have supporting multipliers if the feasible set has a degenerate boundary, so constraint qualifications rule out that degeneracy by ensuring that the first-order description of feasible directions is rich enough.
[definition: Slater Condition]
For a convex programme with inequality constraints $f_i(x)\le 0$ and affine equality constraints $Ax=b$, Slater's condition holds if there exists $\bar{x}\in \mathbb R^n$ such that
\begin{align*}
f_i(\bar{x})<0 \quad (1\le i\le m), \qquad A\bar{x}=b.
\end{align*}
[/definition]
Slater's condition asks for a strictly feasible point relative to the affine equality set. In the finite-dimensional, finite-valued convex setting used in this chapter, it is the standard constraint qualification because it gives strong duality and existence of optimal multipliers.
[quotetheorem:6689]
[citeproof:6689]
The theorem is the main bridge between geometry and computation in the course. Slater's condition is not a cosmetic assumption: for example, minimizing $x$ subject to $x^2\le 0$ has the unique optimum $x^*=0$, but stationarity would require $1+2\nu x^*=0$, which is impossible for every $\nu\ge 0$. Thus the KKT conditions need not be necessary when the feasible set has no strictly feasible point. The theorem also does not guarantee unique multipliers; redundant inequalities may produce several multiplier vectors for the same primal solution. The next applied example should therefore be read as a well-posed case where the KKT equations do more than certify optimality: they also expose the structure of the optimizer.
[example: Soft Margin Support Vector Machine Primal Dual Pair]
Let $N\in\mathbb N$, $C>0$, and labelled data $(a_i,b_i)\in \mathbb R^d\times\{-1,1\}$ be given for $1\le i\le N$. The primal soft-margin support vector machine is
\begin{align*}
\min_{w\in\mathbb R^d,\,\beta\in\mathbb R,\,\xi\in\mathbb R^N}\frac12 |w|^2+C\sum_{i=1}^N \xi_i
\end{align*}
subject to
\begin{align*}
1-\xi_i-b_i(w\cdot a_i+\beta)\le 0 \quad \text{and} \quad -\xi_i\le 0 \qquad (1\le i\le N).
\end{align*}
Assign multipliers $\alpha_i\ge 0$ to the margin constraints and $\mu_i\ge 0$ to the nonnegativity constraints. The Lagrangian is
\begin{align*}
L(w,\beta,\xi,\alpha,\mu)=\frac12 |w|^2+C\sum_{i=1}^N \xi_i+\sum_{i=1}^N \alpha_i\bigl(1-\xi_i-b_i(w\cdot a_i+\beta)\bigr)+\sum_{i=1}^N \mu_i(-\xi_i).
\end{align*}
Expanding the terms involving $w$, $\beta$, and $\xi$ gives
\begin{align*}
L(w,\beta,\xi,\alpha,\mu)=\frac12 |w|^2-\left(\sum_{i=1}^N \alpha_i b_i a_i\right)\cdot w-\beta\sum_{i=1}^N \alpha_i b_i+\sum_{i=1}^N \alpha_i+\sum_{i=1}^N(C-\alpha_i-\mu_i)\xi_i.
\end{align*}
Stationarity in $w$ gives
\begin{align*}
0=\nabla_w L=w-\sum_{i=1}^N \alpha_i b_i a_i,
\end{align*}
so
\begin{align*}
w=\sum_{i=1}^N \alpha_i b_i a_i.
\end{align*}
Stationarity in $\beta$ gives
\begin{align*}
0=\frac{\partial L}{\partial \beta}=-\sum_{i=1}^N \alpha_i b_i,
\end{align*}
hence
\begin{align*}
\sum_{i=1}^N \alpha_i b_i=0.
\end{align*}
For each slack variable $\xi_i$, stationarity gives
\begin{align*}
0=\frac{\partial L}{\partial \xi_i}=C-\alpha_i-\mu_i.
\end{align*}
Thus
\begin{align*}
\mu_i=C-\alpha_i \qquad (1\le i\le N).
\end{align*}
Since $\alpha_i\ge 0$ and $\mu_i\ge 0$, this is equivalent to
\begin{align*}
0\le \alpha_i\le C \qquad (1\le i\le N).
\end{align*}
Under the stationarity constraints, the $\beta$ term vanishes because $\sum_i\alpha_i b_i=0$, and the $\xi_i$ terms vanish because $C-\alpha_i-\mu_i=0$. Therefore the dual function is obtained by minimizing
\begin{align*}
\frac12 |w|^2-\left(\sum_{i=1}^N \alpha_i b_i a_i\right)\cdot w+\sum_{i=1}^N \alpha_i
\end{align*}
over $w$. At the minimizing $w=\sum_i\alpha_i b_i a_i$, the dot product term is
\begin{align*}
\left(\sum_{i=1}^N \alpha_i b_i a_i\right)\cdot w=w\cdot w=|w|^2.
\end{align*}
Hence
\begin{align*}
g(\alpha,\mu)=\sum_{i=1}^N \alpha_i-\frac12 |w|^2.
\end{align*}
Using $w=\sum_i\alpha_i b_i a_i$, we compute
\begin{align*}
|w|^2=\left(\sum_{i=1}^N \alpha_i b_i a_i\right)\cdot\left(\sum_{j=1}^N \alpha_j b_j a_j\right).
\end{align*}
Expanding the dot product gives
\begin{align*}
|w|^2=\sum_{i=1}^N\sum_{j=1}^N\alpha_i\alpha_j b_i b_j(a_i\cdot a_j).
\end{align*}
Therefore the dual problem is
\begin{align*}
\max_{\alpha\in\mathbb R^N}\sum_{i=1}^N \alpha_i-\frac12\sum_{i=1}^N\sum_{j=1}^N\alpha_i\alpha_j b_i b_j(a_i\cdot a_j)
\end{align*}
subject to
\begin{align*}
0\le \alpha_i\le C \qquad (1\le i\le N), \qquad \sum_{i=1}^N \alpha_i b_i=0.
\end{align*}
Complementary slackness also gives
\begin{align*}
\alpha_i\bigl(1-\xi_i-b_i(w\cdot a_i+\beta)\bigr)=0 \qquad (1\le i\le N)
\end{align*}
and
\begin{align*}
\mu_i\xi_i=0 \qquad (1\le i\le N).
\end{align*}
Thus any point with $\alpha_i>0$ has its margin constraint active, and the normal vector of the separating hyperplane is built only from the weighted training vectors through $w=\sum_i \alpha_i b_i a_i$.
[/example]
This derivation is representative of applied convex optimisation: KKT conditions identify the dual variables, eliminate primal variables, and expose sparsity through complementary slackness. If $\alpha_i>0$, then the corresponding margin constraint binds, so these positively weighted data points are support vectors and enter the classifier through $w=\sum_i\alpha_i b_i a_i$. The converse can fail in degenerate cases: a margin constraint may be active while its multiplier is zero, for instance when other active constraints already determine the same separating hyperplane.
## Shadow Prices and Perturbed Problems
What do optimal multipliers measure? To answer this, we let the right-hand sides of the constraints move and study the resulting optimal value. The multiplier attached to a constraint then records the first-order change in value caused by tightening or relaxing that constraint.
For $u\in\mathbb R^m$ and $v\in\mathbb R^p$, define the perturbed value function $p:\mathbb R^m\times\mathbb R^p\to \mathbb R\cup\{+\infty,-\infty\}$ by
\begin{align*}
p(u,v)=\inf_x \{f_0(x): f_i(x)\le u_i \ (1\le i\le m),\ Ax-b=v\}.
\end{align*}
The convention is that $p(u,v)=+\infty$ when the perturbed problem is infeasible and $p(u,v)=-\infty$ when it is feasible but unbounded below. The original value is $p(0,0)$. A negative $u_i$ tightens the $i$th inequality, while a positive $u_i$ relaxes it.
[quotetheorem:6690]
[citeproof:6690]
Thus $-\nu_i^*$ is a subgradient component of the value function with respect to the right-hand side $u_i$, while $-\rho_j^*$ is the corresponding component for the equality perturbation $v_j$. The hypotheses in the theorem are doing separate jobs. Primal-dual optimality is needed to anchor the lower support at the actual value $p(0,0)$: for the unconstrained problem $\min_x x^2$, the multiplier-free affine lower bound with intercept $-1$ is valid but does not describe sensitivity of the optimum value $0$. Finiteness of $p(u,v)$ rules out perturbations whose feasible set is empty or whose objective value is $-\infty$; for example, the constraint $x\le u$ with objective $-x$ has finite value $-u$, while deleting the effective upper bound would make the value $-\infty$ and remove any finite shadow-price interpretation. Convex dual feasibility, especially $\nu^*\ge 0$, is what permits the monotonicity step from $f_i(x)\le u_i$ to $(\nu^*)^\top f(x)\le(\nu^*)^\top u$; with the feasible problem $\min 0$ subject to $x\le 0$, choosing the invalid multiplier $\nu=-1$ would predict $p(u)\ge u$, which fails for every $u>0$ because $p(u)=0$.
The inequality is only a supporting lower bound unless the value function is differentiable at the perturbation being studied. Equality constraints differ from inequalities because their multipliers have no sign restriction: moving $v_j$ in one direction may raise the value while moving it in the opposite direction may lower it. A simple failure of exact first-order prediction occurs for $p(u)=\max\{0,-u\}$, the value obtained by minimizing $0$ subject to $x\le u$ and $x\ge 0$ after eliminating feasibility as an extended value; at $u=0$ the supporting slopes form an interval, so no single multiplier gives a two-sided derivative. The next example is a smooth regime where the supporting-price picture becomes a usable allocation rule.
[example: Water Filling Allocation]
Let $a_i>0$ and $P\ge 0$. Maximizing $\sum_i \log(a_i+x_i)$ over $x_i\ge 0$ and $\sum_i x_i\le P$ is equivalent to minimizing
\begin{align*}
-\sum_i \log(a_i+x_i)
\end{align*}
with inequality constraints $-x_i\le 0$ and $\sum_i x_i-P\le 0$. With multipliers $\mu_i\ge 0$ for $-x_i\le 0$ and $\lambda\ge 0$ for $\sum_i x_i-P\le 0$, the Lagrangian is
\begin{align*}
L(x,\mu,\lambda)
=
-\sum_i \log(a_i+x_i)
+\sum_i \mu_i(-x_i)
+\lambda\left(\sum_i x_i-P\right).
\end{align*}
Stationarity in the $i$th coordinate gives
\begin{align*}
0
=
\frac{\partial L}{\partial x_i}
=
-\frac{1}{a_i+x_i}-\mu_i+\lambda,
\end{align*}
so
\begin{align*}
\lambda=\frac{1}{a_i+x_i}+\mu_i.
\end{align*}
The complementary slackness equations are
\begin{align*}
\mu_i x_i=0 \qquad \text{and} \qquad \lambda\left(\sum_i x_i-P\right)=0.
\end{align*}
For each coordinate there are two cases. If $x_i>0$, then $\mu_i x_i=0$ implies $\mu_i=0$, and stationarity becomes
\begin{align*}
\lambda=\frac{1}{a_i+x_i}.
\end{align*}
Since $a_i+x_i>0$, this gives $\lambda>0$ and
\begin{align*}
x_i=\frac{1}{\lambda}-a_i.
\end{align*}
The condition $x_i>0$ is exactly $\frac{1}{\lambda}-a_i>0$. If $x_i=0$, then stationarity gives
\begin{align*}
0=-\frac{1}{a_i}-\mu_i+\lambda,
\end{align*}
hence
\begin{align*}
\mu_i=\lambda-\frac{1}{a_i}.
\end{align*}
Dual feasibility $\mu_i\ge 0$ gives $\lambda\ge \frac{1}{a_i}$, which is equivalent to $\frac{1}{\lambda}-a_i\le 0$ when $\lambda>0$. Therefore every coordinate satisfies
\begin{align*}
x_i=\left(\frac{1}{\lambda}-a_i\right)_+.
\end{align*}
When $P>0$, the budget constraint binds. Indeed, if $\sum_i x_i<P$, then for any coordinate $k$ and sufficiently small $\varepsilon>0$, the point $x+\varepsilon e_k$ remains feasible and changes the minimization objective by
\begin{align*}
-\log(a_k+x_k+\varepsilon)+\log(a_k+x_k)<0,
\end{align*}
because $\log$ is strictly increasing. Thus an optimum cannot have unused budget, so
\begin{align*}
\sum_i \left(\frac{1}{\lambda}-a_i\right)_+=P.
\end{align*}
For $P=0$, the formula gives $x_i=0$ for all $i$ by choosing any $\lambda\ge \max_i 1/a_i$.
With the minimization convention, the perturbation is the right-hand side $P$ in the constraint $\sum_i x_i-P\le 0$, so the multiplier sensitivity formula from this section gives
\begin{align*}
p'(P)=-\lambda
\end{align*}
at differentiability points. Returning to the original maximization problem reverses the sign of the value, so the marginal gain from increasing the total power budget is $\lambda$.
[/example]
The water-filling formula makes the economic interpretation concrete. Channels with large $a_i$ may receive no power because their nonnegativity constraints bind, while active channels equalize the marginal utility $1/(a_i+x_i)$ at the common price $\lambda$.
## Envelope Formulas and Danskin's Theorem
When the value function is defined as an infimum or supremum over decisions, differentiating it directly is delicate because the optimizer may change with the parameter. Envelope theorems state that, under convexity and compactness hypotheses, the derivative of the value can be computed by differentiating the objective at the active optimizer and ignoring the derivative of the optimizer itself.
[quotetheorem:6691]
[citeproof:6691]
Danskin's theorem explains why KKT multipliers appear in sensitivity formulas: the dual function is an infimum over $x$, and perturbations enter the Lagrangian linearly. The compactness assumption prevents maximizing sequences from escaping; for instance, $\sup_{x\in\mathbb R} (yx-x^2)$ has a well-behaved maximizer, while $\sup_{x\in\mathbb R} yx$ is infinite for $y\ne 0$ and has no local envelope formula. Continuity is what lets active maximizers at nearby parameters converge to active maximizers at the base point, and convexity is what turns active gradients into supporting slopes. Under the hypotheses stated here the directional derivative formula is the main conclusion; equality between the whole subdifferential and the convex hull may require stronger local regularity assumptions that rule out additional limiting supporting slopes. Uniqueness collapses the displayed directional derivative to a linear function of $h$; without uniqueness, such as $F(y)=|y|=\sup_{x\in\{-1,1\}}xy$, the value has several supporting slopes at $y=0$. To turn the shadow-price inequality into an equality formula, we need differentiability so that the multiplier subgradient set collapses to a single vector.
[quotetheorem:6692]
[citeproof:6692]
The formula should be read with the sign convention in mind. Increasing the right-hand side of an inequality relaxes a minimization problem, so the derivative is $-\nu_i^*$ rather than $\nu_i^*$. Each hypothesis has a specific limitation. If the KKT theorem does not apply, optimal multipliers may fail to exist, as in minimizing $x$ subject to $x^2\le 0$ at $x=0$, and then there is no multiplier vector from which to read a derivative. If the perturbed value is not finite in a neighbourhood, the ordinary gradient formula has no meaning there: for $\min 0$ subject to $x^2\le u$, the value is finite for $u\ge 0$ and infeasible for $u<0$. If convexity of $p$ is unavailable, a supporting affine lower bound need not characterize a subgradient of the value function; the scalar function $p(u)=-u^2$ has a tangent line at $0$, but that tangent is not a global convex support.
Differentiability is essential: when $p$ has a corner, the multipliers describe a subdifferential rather than one gradient. For instance, the one-dimensional value $p(u)=\max\{0,-u\}$ has left and right derivatives that differ at $u=0$. Uniqueness of multipliers is a sufficient way to avoid ambiguity, but it is stronger than the displayed conclusion once differentiability is already assumed; if several optimal multipliers exist, differentiability forces all their vectors $(-\nu,-\rho)$ to coincide, while nondifferentiability allows a whole interval or polytope of shadow prices. The following one-resource example is the scalar version of this rule.
[example: Sensitivity Of A Single Resource Constraint]
Consider the value function at a fixed resource level $P$:
\begin{align*}
p(P)=\min_{x\in\mathbb R^n}\{f(x):g(x)\le P\}.
\end{align*}
Equivalently, write the single inequality as $g(x)-P\le 0$. If $x^*$ is optimal at this value of $P$ and the corresponding KKT multiplier is the unique number $\lambda^*\ge 0$, then the Lagrangian is
\begin{align*}
L(x,\lambda)=f(x)+\lambda(g(x)-P).
\end{align*}
The complementary slackness equation is
\begin{align*}
\lambda^*(g(x^*)-P)=0.
\end{align*}
To compare nearby resource levels, define
\begin{align*}
q(u)=p(P+u)=\min_x\{f(x):g(x)\le P+u\}.
\end{align*}
Since $g(x)\le P+u$ is the same as
\begin{align*}
g(x)-P\le u,
\end{align*}
the perturbation $u$ is exactly the right-hand-side perturbation for the constraint $g(x)-P\le 0$. By the differentiable perturbation formula just stated, at differentiability points,
\begin{align*}
q'(0)=-\lambda^*.
\end{align*}
Because $q(u)=p(P+u)$, differentiating at $u=0$ gives
\begin{align*}
q'(0)=p'(P).
\end{align*}
Therefore
\begin{align*}
p'(P)=-\lambda^*.
\end{align*}
Thus increasing the resource limit by a small amount lowers the optimal minimization value at first-order rate $\lambda^*$. If the constraint is slack, so $g(x^*)<P$, then complementary slackness forces $\lambda^*=0$, and hence
\begin{align*}
p'(P)=0.
\end{align*}
[/example]
KKT theory therefore has two roles in the course. It is an optimality certificate for convex programmes, and it is also a calculus for how optimal values respond to changes in data. Later algorithmic chapters use the same equations as stopping criteria, active-set descriptions, and interpretations of dual iterates.
# 6. Fenchel Duality and Infimal Convolution
This chapter turns Chapter 1's separation theory and Chapter 2's conjugacy theory into a systematic duality calculus. It assumes the preceding material on convex sets, relative interiors, separating hyperplanes, convex conjugates, subgradients, and closed proper convex functions. The guiding question is: when can a constrained or composite convex problem be replaced by an equivalent maximisation problem whose variables certify optimality? Fenchel duality gives the template, while subdifferential calculus explains how to read optimality conditions from the same geometry. The final section connects this theory to proximal maps, polar gauges, and dual certificates used in statistical and imaging examples.
## Fenchel Inequality and Fenchel Dual Problems
The first problem is to convert a convex minimisation problem into a dual maximisation problem without guessing Lagrange multipliers by hand. Conjugacy supplies the basic inequality, and equality in that inequality is exactly the subgradient relation introduced in the previous chapter.
[quotetheorem:6676]
[citeproof:6676]
Fenchel-Young itself needs only properness for the inequality and the equality-subgradient equivalence. The theorem does not assert that a subgradient exists at every $x$; it only identifies what happens when the Fenchel inequality has equality. Boundary points of domains may still have empty subdifferential, and pairs for which either side is infinite carry no finite optimality information. This result turns every available subgradient into a certificate of equality between a primal expression and a conjugate expression. Closedness enters later, when the biconjugate is used to recover the original function rather than its closed convex envelope.
To use that certificate for optimisation, we need a standard primal-dual template rather than a separate derivation for each example. The composite form below isolates two roles: $f$ controls the native variable $x$, while $g$ controls the transformed quantity $Ax$. This separation is what lets two Fenchel inequalities be added so that the coupling terms cancel.
[definition: Fenchel Primal Problem]
Let $f: \mathbb R^n\to (-\infty,\infty]$ and $g: \mathbb R^m\to (-\infty,\infty]$ be proper convex functions, and let $A:\mathbb R^n\to \mathbb R^m$ be linear. The Fenchel primal problem is
\begin{align*}
\inf_{x\in \mathbb R^n}\{f(x)+g(Ax)\}.
\end{align*}
[/definition]
The primal problem separates the part of the objective native to $x$ from the part that sees $x$ only through $Ax$. The companion problem introduces a vector $y$ in $\mathbb R^m$, the codomain paired with the constraint image through the Euclidean inner product. The sign convention below is chosen so that optimality becomes a pair of subgradient inclusions.
[definition: Fenchel Dual Problem]
For the Fenchel primal problem determined by $f$, $g$, and $A$, the Fenchel dual problem is
\begin{align*}
\sup_{y\in \mathbb R^m}\{-f^*(-A^\top y)-g^*(y)\}.
\end{align*}
[/definition]
The dual problem should never overestimate the primal minimum. If $x$ is primal feasible and $y$ is any dual variable, Fenchel's inequality gives
\begin{align*}
f(x)+f^*(-A^\top y)\ge -y\cdot Ax,
\qquad
g(Ax)+g^*(y)\ge y\cdot Ax.
\end{align*}
Adding the two inequalities cancels the coupling terms and yields
\begin{align*}
f(x)+g(Ax)\ge -f^*(-A^\top y)-g^*(y).
\end{align*}
Thus every dual objective value is a lower bound on every primal objective value.
Weak duality has no qualification hypothesis because it is only an inequality obtained by adding two pointwise inequalities. That strength is also its limitation: it may give a strict lower bound even when both primal and dual problems are feasible. A standard pathology is a convex problem whose feasible set touches the boundary of another convex domain without meeting its relative interior; separation can then produce a limiting certificate but no attaining dual multiplier. Thus weak duality gives lower bounds, but optimisation needs a way to recognise when the bound has no slack.
The equality case asks for simultaneous equality in both Fenchel inequalities, not merely equality after their sum is formed. This matters because each equality has its own subgradient interpretation: one lives at $x^*$ for $f$, and the other lives at $Ax^*$ for $g$. The natural next result packages these two equality conditions into a primal-dual optimality system.
[quotetheorem:6693]
[citeproof:6693]
The closedness assumptions keep the subdifferentials tied to the actual objectives rather than to lower-semicontinuous envelopes. The theorem does not prove that optimal pairs exist, nor does it rule out a positive duality gap; it only says that, if a zero-gap pair is present, the pair is recognised exactly by the two inclusions. When the regularity hypotheses fail, the inclusions may have no solution even though the primal infimum is finite, which is why dual attainment cannot be treated as automatic. The optimality system is useful only when a zero duality gap and dual certificates exist.
Weak duality and the optimality conditions are conditional statements: they explain certificates once the right certificate exists. The next theorem supplies the finite-dimensional regularity condition that makes existence of a non-abnormal supporting hyperplane part of the theory rather than an extra assumption. It is the point where relative interiors enter duality as a practical replacement for differentiability or strict feasibility.
[quotetheorem:6694]
[citeproof:6694]
The relative-interior condition is the finite-dimensional substitute for a constraint qualification. It fails, for instance, when $A(\operatorname{dom} f)$ only touches $\operatorname{dom} g$ at a boundary point; then feasible points may exist, but the separation argument can support the value function only by a vertical hyperplane, producing a gap or a missing multiplier. The theorem also does not say that the primal infimum is attained, so compactness or coercivity must still be checked separately when an actual minimiser is needed. In applications, the condition is usually checked by finding a feasible point lying in the relative interiors of the two domains, as in regularised least squares.
[example: Lasso Dual]
Let $X\in\mathbb R^{m\times n}$, $b\in\mathbb R^m$, and $\lambda>0$. Write the lasso objective in Fenchel form with $f(w)=\lambda |w|_1$, $g(z)=\frac12|z-b|^2$, and $A=X$.
For $s\in\mathbb R^n$, the conjugate of $f$ is
\begin{align*}
f^*(s)=\sup_{w\in\mathbb R^n}\{s\cdot w-\lambda |w|_1\}=\sup_{w\in\mathbb R^n}\sum_{j=1}^n(s_jw_j-\lambda |w_j|).
\end{align*}
If $|s|_\infty\le\lambda$, then $s_jw_j\le |s_j||w_j|\le\lambda |w_j|$ for every coordinate $j$, so $s\cdot w-\lambda |w|_1\le0$ for every $w$, with equality at $w=0$. If $|s_k|>\lambda$ for some $k$, take $w_k=t\operatorname{sgn}(s_k)$ and $w_j=0$ for $j\ne k$. Then
\begin{align*}
s\cdot w-\lambda |w|_1=t(|s_k|-\lambda),
\end{align*}
which tends to $+\infty$ as $t\to+\infty$. Hence
\begin{align*}
f^*(s)=\delta_{\{|s|_\infty\le\lambda\}}(s).
\end{align*}
For $y\in\mathbb R^m$, put $r=z-b$. Then
\begin{align*}
g^*(y)=\sup_{z\in\mathbb R^m}\left\{y\cdot z-\frac12|z-b|^2\right\}=b\cdot y+\sup_{r\in\mathbb R^m}\left\{y\cdot r-\frac12|r|^2\right\}.
\end{align*}
Completing the square gives
\begin{align*}
y\cdot r-\frac12|r|^2=\frac12|y|^2-\frac12|r-y|^2.
\end{align*}
The last term is maximised when $r=y$, so
\begin{align*}
g^*(y)=b\cdot y+\frac12|y|^2.
\end{align*}
Substituting these conjugates into the Fenchel dual formula gives
\begin{align*}
\sup_{y\in\mathbb R^m}\{-f^*(-X^\top y)-g^*(y)\}=\sup_{y\in\mathbb R^m}\left\{-\delta_{\{|s|_\infty\le\lambda\}}(-X^\top y)-b\cdot y-\frac12|y|^2\right\}.
\end{align*}
The indicator term is finite exactly when $|-X^\top y|_\infty\le\lambda$, equivalently $|X^\top y|_\infty\le\lambda$. Therefore the dual problem is
\begin{align*}
\sup_{y\in\mathbb R^m}\left\{-b\cdot y-\frac12|y|^2: |X^\top y|_\infty\le \lambda\right\}.
\end{align*}
If $w^*$ and $y^*$ satisfy the Fenchel optimality inclusions, then
\begin{align*}
y^*\in\partial g(Xw^*)=\{Xw^*-b\}.
\end{align*}
Thus $y^*=Xw^*-b$. The other inclusion is
\begin{align*}
-X^\top y^*\in\partial(\lambda |\,\cdot\,|_1)(w^*).
\end{align*}
Coordinatewise, this says
\begin{align*}
-(X^\top y^*)_j\in\lambda\,\partial |\,\cdot\,|(w_j^*).
\end{align*}
Since $\partial |\,\cdot\,|(t)=\{\operatorname{sgn}(t)\}$ for $t\ne0$, every active coordinate satisfies
\begin{align*}
X_j\cdot y^*=(X^\top y^*)_j=-\lambda\operatorname{sgn}(w_j^*).
\end{align*}
When $w_j^*=0$, the same inclusion gives $-(X^\top y^*)_j\in[-\lambda,\lambda]$, hence $|X_j\cdot y^*|\le\lambda$. Thus the residual $y^*=Xw^*-b$ is a dual certificate: it is feasible for the dual constraint and it shows which active coordinates saturate the $\ell_\infty$ bound.
[/example]
## Subdifferential Calculus for Composite Convex Functions
The second problem is to turn optimality conditions into usable formulas. Fermat's rule says $0\in\partial F(x^*)$, but for $F=f+g\circ A$ this is useful only if the subdifferential of the composite objective can be expressed in terms of the parts.
[quotetheorem:6695]
[citeproof:6695]
The relative-interior hypothesis prevents boundary contact from creating extra normals that cannot be split between the two summands. Without such a condition, the inclusion $\partial f(x)+\partial g(x)\subset\partial(f+g)(x)$ still holds, but the reverse inclusion can fail because a supporting hyperplane to the sum may arise as a limiting support to the two epigraphs together. The rule therefore does not license arbitrary decomposition of subgradients; it licenses decomposition only under a qualification condition. The sum rule explains how independent convex penalties share a subgradient balance.
Many optimisation models also contain a measurement or constraint map, and then the subgradient must be pulled back from the measurement space to the decision space. The inclusion from right to left follows by composing a supporting inequality with $A$, but the reverse direction asks whether every support seen on the image of $A$ extends to the whole ambient space. The next rule states the relative-interior condition that guarantees this extension.
[quotetheorem:6696]
[citeproof:6696]
The qualification condition says that the image subspace must pass through the relative interior of the effective domain of $g$. If $A(\mathbb R^n)$ lies only along a boundary face of $\operatorname{dom}g$, a support to $g\circ A$ may be visible on the subspace but fail to extend to a genuine subgradient of $g$ at $Ax$. The theorem also does not describe nonlinear precomposition; differentiable nonlinear maps require a separate chain rule with additional hypotheses. Together, the sum and chain rules translate the abstract Fermat condition into the same inclusions that arose from Fenchel equality.
The sum and chain rules now convert Fermat's rule into the same two-inclusion pattern that appeared in Fenchel equality. The point is not only that $0\in\partial(f+g\circ A)(x^*)$, but that this zero subgradient can be split into a part coming from $f$ and a pulled-back part coming from $g$. The next theorem records the resulting multiplier condition with the qualification hypothesis stated explicitly.
[quotetheorem:6697]
[citeproof:6697]
The composite condition identifies the multiplier $y^*$ as the object that balances the subgradient of $f$ against the pullback of the subgradient of $g$. This is stronger than the bare Fermat condition because it separates the two sources of first-order information: one term belongs to the original objective, while the other lives in the measurement space and is transported back by $A^*$. The relative-interior hypothesis is the reason this split is legitimate; without it, the zero subgradient of the composite objective may exist without a multiplier that extends to a true subgradient of $g$ at $Ax^*$.
The theorem also clarifies what the multiplier does not provide. It does not guarantee uniqueness of $x^*$ or $y^*$, and it does not compute the subdifferentials involved; those must come from separate calculus rules or from the geometry of the functions. In constrained models, $g$ is often an indicator of a set $C$, so the multiplier condition becomes a statement about normals to an inverse-image feasible set. This raises the next geometric question: which normals to $A^{-1}C$ actually come from normals to $C$?
[quotetheorem:6698]
[citeproof:6698]
The formula says that all normals to the feasible set $A^{-1}C$ are pullbacks of normals to $C$. The qualification condition rules out the case where the inverse image only sees $C$ along a boundary face; in such a case the preimage can acquire normals coming from the restricted geometry of $A(\mathbb R^n)$ rather than from genuine normals to $C$ in $\mathbb R^m$. This is the same extension issue as in the chain rule, now expressed for indicator functions. The result is the geometric content behind equality-constrained multipliers, inequality-constrained multipliers, and dual certificates for norm minimisation.
[example: Norm Minimisation and Dual Certificates]
Let $A:\mathbb R^n\to\mathbb R^m$ be linear and let $b\in\operatorname{Range}(A)$, so the feasible set $\{x:Ax=b\}$ is nonempty. We rewrite the constrained problem as
\begin{align*}
\inf\{|x|_1:Ax=b\}=\inf_{x\in\mathbb R^n}\{|x|_1+\delta_{\{b\}}(Ax)\}.
\end{align*}
Thus $f(x)=|x|_1$ and $g(z)=\delta_{\{b\}}(z)$.
For $s\in\mathbb R^n$, the conjugate of $f$ is
\begin{align*}
f^*(s)=\sup_{x\in\mathbb R^n}\{s\cdot x-|x|_1\}=\sup_{x\in\mathbb R^n}\sum_{j=1}^n(s_jx_j-|x_j|).
\end{align*}
If $|s|_\infty\le1$, then $s_jx_j\le |s_j||x_j|\le |x_j|$ for every $j$, so
\begin{align*}
s\cdot x-|x|_1=\sum_{j=1}^n(s_jx_j-|x_j|)\le0.
\end{align*}
Equality is attained at $x=0$, hence $f^*(s)=0$ when $|s|_\infty\le1$. If $|s_k|>1$ for some coordinate $k$, choose $x_k=t\operatorname{sgn}(s_k)$ and $x_j=0$ for $j\ne k$. Then
\begin{align*}
s\cdot x-|x|_1=t(|s_k|-1).
\end{align*}
This tends to $+\infty$ as $t\to+\infty$, so
\begin{align*}
f^*(s)=\delta_{\{|s|_\infty\le1\}}(s).
\end{align*}
For $y\in\mathbb R^m$, the conjugate of $g$ is
\begin{align*}
g^*(y)=\sup_{z\in\mathbb R^m}\{y\cdot z-\delta_{\{b\}}(z)\}=y\cdot b,
\end{align*}
because $z=b$ gives $y\cdot b$ and every $z\ne b$ gives $-\infty$.
Using the Fenchel dual convention from this chapter,
\begin{align*}
\sup_{\eta\in\mathbb R^m}\{-f^*(-A^\top\eta)-g^*(\eta)\}=\sup_{\eta\in\mathbb R^m}\{-\delta_{\{|s|_\infty\le1\}}(-A^\top\eta)-b\cdot\eta\}.
\end{align*}
The indicator term is finite exactly when $|-A^\top\eta|_\infty\le1$, equivalently $|A^\top\eta|_\infty\le1$. Substituting $y=-\eta$ gives the equivalent dual problem
\begin{align*}
\sup_{y\in\mathbb R^m}\{b\cdot y: |A^\top y|_\infty\le1\}.
\end{align*}
Now let $x$ be primal feasible and let $y$ be dual feasible. Since $Ax=b$ and $|A^\top y|_\infty\le1$,
\begin{align*}
b\cdot y=(Ax)\cdot y=x\cdot A^\top y=\sum_{j=1}^n x_j(A^\top y)_j\le\sum_{j=1}^n |x_j|\, |(A^\top y)_j|\le\sum_{j=1}^n |x_j|=|x|_1.
\end{align*}
Thus every dual feasible $y$ gives a lower bound on every feasible primal value.
Suppose $x^*$ is feasible and $A^\top y\in\partial |x^*|_1$. Coordinatewise,
\begin{align*}
(A^\top y)_j\in\partial |\,\cdot\,|(x_j^*).
\end{align*}
For the absolute value, $\partial |\,\cdot\,|(t)=\{\operatorname{sgn}(t)\}$ when $t\ne0$, and $\partial |\,\cdot\,|(0)=[-1,1]$. Hence $|A^\top y|_\infty\le1$, so $y$ is dual feasible. Also,
\begin{align*}
b\cdot y=(Ax^*)\cdot y=x^*\cdot A^\top y=\sum_{j:x_j^*\ne0}x_j^*\operatorname{sgn}(x_j^*)+\sum_{j:x_j^*=0}0\cdot(A^\top y)_j=\sum_{j:x_j^*\ne0}|x_j^*|=|x^*|_1.
\end{align*}
The dual lower bound equals the feasible primal value, so $x^*$ is optimal and $y$ is a dual certificate. On active coordinates, $(A^\top y)_j=\operatorname{sgn}(x_j^*)$, so $|(A^\top y)_j|=1$; on inactive coordinates, $(A^\top y)_j\in[-1,1]$, and strict inequality on inactive coordinates is the additional separation condition used in uniqueness arguments.
[/example]
## Moreau Decomposition, Proximal Geometry, and Polar Gauges
The third problem is to understand duality as geometry rather than only as a value identity. Proximal maps split a point into primal and dual components, while gauges and polars turn norm constraints into support-function constraints.
[definition: Proximal Map]
Let $f:\mathbb R^n\to(-\infty,\infty]$ be proper closed convex and let $\lambda>0$. The proximal map of $\lambda f$ is the map $\operatorname{prox}_{\lambda f}:\mathbb R^n\to\mathbb R^n$ defined by
\begin{align*}
\operatorname{prox}_{\lambda f}(v)=\operatorname*{argmin}_{x\in\mathbb R^n}\left\{f(x)+\frac{1}{2\lambda}|x-v|^2\right\}.
\end{align*}
[/definition]
The quadratic term makes the proximal objective strongly convex, so the minimiser is unique whenever $f$ is proper closed convex. The optimality condition says that the residual $v-\operatorname{prox}_{\lambda f}(v)$ is a scaled subgradient, and this leads to the next theorem relating the residual to the conjugate function.
[quotetheorem:6699]
[citeproof:6699]
Closedness and convexity ensure that the proximal minimiser exists uniquely and that the subgradient relation can be inverted through conjugacy. If $f$ is not closed, the proximal problem may behave as though it had replaced $f$ by its lower-semicontinuous envelope; if $f$ is not convex, the minimiser need not be unique and the displayed expression may not define a map. The theorem is an identity about the two proximal points, not a claim that the two summands are orthogonal in general. Moreau decomposition says that proximal regularisation decomposes a vector into a part governed by $f$ and a part governed by the conjugate $f^*$.
For cones, the relevant conjugate object is not a bounded polar set but a cone of directions that support the original cone at the origin. This object records which linear functionals are nonpositive on every feasible direction, so it is exactly the normal cone to $K$ at $0$ when $K$ is closed and convex. The next definition prepares the projection version of Moreau decomposition.
[definition: Polar Cone]
Let $K\subset\mathbb R^n$ be a convex cone. The polar cone of $K$ is
\begin{align*}
K^\circ=\{y\in\mathbb R^n: y\cdot x\le0\text{ for all }x\in K\}.
\end{align*}
[/definition]
The polar cone is the dual obstruction to belonging to $K$: it consists of all directions that make a nonpositive angle with every point of the cone. This prepares the following cone form of Moreau decomposition, where the proximal maps become Euclidean projections.
[quotetheorem:6700]
[citeproof:6700]
The cone assumptions are essential: closedness gives existence of projections, convexity gives uniqueness, and the cone property turns the support function into the indicator of the polar cone. For a general closed convex set $C$, projection onto $C$ and projection onto $C^\circ$ do not usually add back to $v$, because polarity no longer records a complementary direction cone at every scale. Thus the theorem is not a general projection formula for arbitrary convex sets. For cones, polarity turns projections into orthogonal decompositions.
For norms and seminorms, polarity has to measure scale rather than only direction. A convex set containing the origin can serve as a unit ball even when it is not symmetric, but then the associated size function need not be a norm. The gauge formalises this size function by asking how much the set must be dilated to contain a given point.
[definition: Gauge]
Let $C\subset\mathbb R^n$ be a closed convex set with $0\in C$. The gauge of $C$ is the map $\gamma_C:\mathbb R^n\to[0,\infty]$ defined by
\begin{align*}
\gamma_C(x)=\inf\{t>0: x\in tC\}.
\end{align*}
[/definition]
A gauge measures the scale needed for $C$ to absorb $x$. The dual scale asks which linear functionals are bounded by $1$ on $C$, and this motivates the polar set below.
[definition: Polar Set]
Let $C\subset\mathbb R^n$ be a closed convex set with $0\in C$. The polar set of $C$ is
\begin{align*}
C^\circ=\{y\in\mathbb R^n: y\cdot x\le1\text{ for all }x\in C\}.
\end{align*}
[/definition]
The polar set records exactly which linear functionals stay uniformly bounded on $C$. The remaining problem is to express the gauge, which is defined by rescaling points into $C$, in the dual language of linear functionals. Without such a formula, gauge regularisers remain primal objects and do not produce explicit dual constraints. The conjugacy relation below supplies that translation by showing how the polar set becomes the support-function side of the gauge.
[quotetheorem:6721]
[citeproof:6721]
Absorbingness ensures that the gauge is finite everywhere; without it, $\gamma_C$ may take the value $+\infty$ outside the cone generated by $C$. Closed convexity ensures that the gauge is recovered from its biconjugate rather than from a closed convex relaxation, so deleting closedness changes the object represented by the support function. The theorem does not require $C$ to be symmetric, and therefore it describes gauges more general than norms. The polar formula turns many denoising and inverse problems into constrained dual problems.
Total variation denoising is a model example because the regulariser is a gauge of derivatives and its dual constraint is a pointwise bound on a vector field.
[example: Total Variation Denoising Dual]
In a finite-dimensional discretisation, let $D:\mathbb R^n\to\mathbb R^m$ be a discrete gradient, define $\operatorname{TV}(u)=|Du|_1$, and fix $b\in\mathbb R^n$ and $\lambda>0$. The denoising problem is the Fenchel primal problem
\begin{align*}
\inf_{u\in\mathbb R^n}\left\{\frac12|u-b|^2+\lambda |Du|_1\right\}
=\inf_{u\in\mathbb R^n}\{f(u)+g(Du)\},
\end{align*}
where $f(u)=\frac12|u-b|^2$ and $g(z)=\lambda |z|_1$.
We first compute the two conjugates. For $s\in\mathbb R^n$, put $r=u-b$, so $u=r+b$. Then
\begin{align*}
f^*(s)=\sup_{u\in\mathbb R^n}\left\{s\cdot u-\frac12|u-b|^2\right\}
=\sup_{r\in\mathbb R^n}\left\{s\cdot(r+b)-\frac12|r|^2\right\}.
\end{align*}
Thus
\begin{align*}
f^*(s)=b\cdot s+\sup_{r\in\mathbb R^n}\left\{s\cdot r-\frac12|r|^2\right\}.
\end{align*}
Completing the square gives
\begin{align*}
s\cdot r-\frac12|r|^2=\frac12|s|^2-\frac12|r-s|^2.
\end{align*}
The term $-\frac12|r-s|^2$ is at most $0$ and equals $0$ when $r=s$, so
\begin{align*}
f^*(s)=b\cdot s+\frac12|s|^2.
\end{align*}
For $p\in\mathbb R^m$,
\begin{align*}
g^*(p)=\sup_{z\in\mathbb R^m}\{p\cdot z-\lambda |z|_1\}
=\sup_{z\in\mathbb R^m}\sum_{j=1}^m(p_jz_j-\lambda |z_j|).
\end{align*}
If $|p|_\infty\le\lambda$, then $|p_j|\le\lambda$ for every $j$, and hence
\begin{align*}
p_jz_j-\lambda |z_j|\le |p_j|\,|z_j|-\lambda |z_j|\le0.
\end{align*}
Therefore $p\cdot z-\lambda |z|_1\le0$ for every $z$, with equality at $z=0$. If $|p_k|>\lambda$ for some $k$, choose $z_k=t\operatorname{sgn}(p_k)$ and $z_j=0$ for $j\ne k$. Then
\begin{align*}
p\cdot z-\lambda |z|_1=t(|p_k|-\lambda),
\end{align*}
which tends to $+\infty$ as $t\to+\infty$. Hence
\begin{align*}
g^*(p)=\delta_{\{|p|_\infty\le\lambda\}}(p).
\end{align*}
Using the Fenchel dual convention for $f(u)+g(Du)$, the dual objective is
\begin{align*}
-f^*(-D^\top p)-g^*(p)
=-\left(b\cdot(-D^\top p)+\frac12|-D^\top p|^2\right)-\delta_{\{|q|_\infty\le\lambda\}}(p).
\end{align*}
Since $|-D^\top p|^2=|D^\top p|^2$, this becomes
\begin{align*}
-f^*(-D^\top p)-g^*(p)=b\cdot D^\top p-\frac12|D^\top p|^2-\delta_{\{|q|_\infty\le\lambda\}}(p).
\end{align*}
The indicator term is finite exactly when $|p|_\infty\le\lambda$, so the dual problem is
\begin{align*}
\sup_{p\in\mathbb R^m}\left\{-\frac12|D^\top p|^2+b\cdot D^\top p: |p|_\infty\le\lambda\right\}.
\end{align*}
At a primal-dual optimal pair $(u^*,p^*)$, the Fenchel optimality inclusions are $-D^\top p^*\in\partial f(u^*)$ and $p^*\in\partial g(Du^*)$. Since $f(u)=\frac12|u-b|^2$ is differentiable with $\nabla f(u)=u-b$, the first inclusion gives
\begin{align*}
-D^\top p^*=u^*-b.
\end{align*}
Equivalently,
\begin{align*}
u^*=b-D^\top p^*.
\end{align*}
The second inclusion says
\begin{align*}
p^*\in\partial(\lambda |\,\cdot\,|_1)(Du^*).
\end{align*}
Coordinatewise, this means
\begin{align*}
p_j^*\in\lambda\,\partial |\,\cdot\,|((Du^*)_j).
\end{align*}
Since $\partial |\,\cdot\,|(t)=\{\operatorname{sgn}(t)\}$ for $t\ne0$ and $\partial |\,\cdot\,|(0)=[-1,1]$, if $(Du^*)_j\ne0$ then
\begin{align*}
p_j^*=\lambda\operatorname{sgn}((Du^*)_j).
\end{align*}
If $(Du^*)_j=0$, then
\begin{align*}
p_j^*\in[-\lambda,\lambda].
\end{align*}
Thus $p^*$ lies in the scaled dual ball $|p^*|_\infty\le\lambda$, and every nonzero discrete-gradient coordinate of $u^*$ saturates the dual constraint with $|p_j^*|=\lambda$.
[/example]
The total variation example shows the full pattern of the chapter: conjugates produce the dual problem, calculus produces the optimality equations, and polar geometry makes the constraints readable. This motivates the closing remark, which records the common certificate viewpoint before the course turns to conic formulations.
[remark: Duality as Certificate Geometry]
Fenchel duality, subdifferential calculus, and Moreau decomposition all express the same geometric principle: a convex optimum is witnessed by a supporting functional. In Fenchel duality the functional is a dual variable, in subdifferential calculus it is a subgradient balance, and in Moreau decomposition it is the residual part of a proximal projection. Chapter 8 uses this certificate viewpoint to study conic formulations, after Chapter 7 first spells it out in the polyhedral setting of linear programming.
[/remark]
# 7. Linear and Polyhedral Programming
This chapter specialises the geometric and duality tools of the course to linear objectives over polyhedra. Linear programming is the finite-dimensional setting where separation, faces, normal cones, and certificates become completely explicit. We assume Chapters 1 and 4 on convex sets, faces, separating hyperplanes, affine hulls, cones, and Lagrange duality, together with the linear algebra of rank, bases, determinants, and systems of linear equations. The main questions are: where do optima occur, how are primal and dual problems paired, and when do linear descriptions automatically force integer solutions?
The chapter also prepares the transition from abstract convex optimisation to conic programming. Polyhedra are the first class where geometry, algebra, and algorithms meet: vertices correspond to bases, separating hyperplanes become infeasibility certificates, and integrality becomes a statement about the shape of every face.
## Polyhedra, Standard Form, and Vertex Optimality
The first problem is to understand why a linear optimisation problem over a polyhedron can be reduced to finitely many algebraic candidates, even though the feasible region may contain infinitely many points. The answer is that linear objectives expose faces, and in a pointed polyhedron the smallest exposed faces are vertices.
[definition: Polyhedron]
A polyhedron in $\mathbb R^n$ is a set of the form
\begin{align*}
P = \{x \in \mathbb R^n : Ax \le b\},
\end{align*}
where $A \in \mathbb R^{m \times n}$ and $b \in \mathbb R^m$.
[/definition]
The same set may have many descriptions. Equalities, nonnegativity constraints, and two-sided bounds can all be encoded by linear inequalities, so the definition is geometric rather than tied to a particular presentation.
[example: Redundant Halfspaces and Hidden Dimension]
Consider the square
\begin{align*}
P=[0,1]^2=\{(x_1,x_2)\in\mathbb R^2:0\le x_1\le1,\ 0\le x_2\le1\}.
\end{align*}
Writing all inequalities with $\le$ signs gives
\begin{align*}
-x_1\le0,\qquad x_1\le1,\qquad -x_2\le0,\qquad x_2\le1.
\end{align*}
If $x\in P$, then $x_1\le1$ and $x_2\le1$, so
\begin{align*}
x_1+x_2\le1+1=2\le3.
\end{align*}
Thus adding the extra inequality $x_1+x_2\le3$ does not remove any point of $P$. Conversely, any point satisfying the original four inequalities already lies in $[0,1]^2$, so the five-inequality description defines the same set. The matrix of constraint normals has changed, but the geometric polyhedron has not.
The active constraints show different local dimensions. At $(0,0)$ the inequalities $-x_1\le0$ and $-x_2\le0$ are active because
\begin{align*}
-x_1=0,\qquad -x_2=0,
\end{align*}
and their normals $(-1,0)$ and $(0,-1)$ are linearly independent. Solving the active equalities $x_1=0$ and $x_2=0$ gives the single point $(0,0)$, so these two active constraints pin down a vertex. On the edge $x_1=0$ with $0<x_2<1$, only $-x_1\le0$ is active among the four square constraints, since
\begin{align*}
-x_1=0,\qquad x_1=0<1,\qquad -x_2<0,\qquad x_2<1.
\end{align*}
The remaining feasible points with $x_1=0$ are
\begin{align*}
\{(0,t):0\le t\le1\},
\end{align*}
a whole line segment, so one active independent constraint does not determine a vertex in $\mathbb R^2$.
[/example]
This example shows that halfspace descriptions are flexible but can hide the algebraic structure relevant to optimisation. If unrestricted variables and inequalities are left in arbitrary form, it is harder to tell which constraints actively determine a candidate optimum; redundant rows can also obscure the finite list of candidates. This motivates the standard form in which all geometric constraints are encoded by equality constraints and nonnegative variables.
[definition: Standard Form Linear Program]
A standard form linear program is an optimisation problem with objective $\min c^\top x$ subject to $Ax=b$ and $x\ge0$, where $A \in \mathbb R^{m \times n}$, $b \in \mathbb R^m$, and $c \in \mathbb R^n$.
[/definition]
Inequalities are converted to equalities by slack variables, while unrestricted variables are written as differences of nonnegative variables. This normal form makes the active nonnegativity constraints visible, and that motivates the algebraic notion of a basis.
[definition: Feasible Basis]
Assume $A \in \mathbb R^{m \times n}$ has rank $m$. A basis is a subset $B \subset \{1,\dots,n\}$ with $|B|=m$ such that the submatrix $A_B$ is invertible. The associated basic solution is $x_B = A_B^{-1}b$ and $x_j=0$ for $j \notin B$. The basis is feasible if $x_B \ge 0$.
[/definition]
A feasible basis selects enough active nonnegativity constraints to pin down a candidate point. The possible mismatch is that this is an algebraic construction, while vertices are defined geometrically by the absence of nontrivial line segments through the point. To justify simplex-style reasoning, one has to know that these two notions pick out the same candidates rather than two different finite sets. The theorem below supplies that equivalence under the full row-rank hypothesis.
[quotetheorem:6701]
[citeproof:6701]
This theorem is the bridge between convex geometry and the simplex viewpoint: vertices are solutions obtained by inverting square submatrices. The full row-rank hypothesis is not cosmetic; without it, the phrase "basis of $m$ columns" may demand more independent columns than the row space contains, even though the feasible set still has vertices after redundant equations are removed. The theorem also does not assert that the representing basis is unique, because a vertex may have more active nonnegativity constraints than needed to determine it. That distinction is exactly what degeneracy measures, and it becomes important later when reduced costs are used to certify optimality.
[example: Redundant Equality Obscuring a Vertex]
Let $A$ have columns $A_1=(1,2)^\top$ and $A_2=(1,2)^\top$, let $b=(1,2)^\top$, and set
\begin{align*}
P=\{x\in\mathbb R^2:Ax=b,\ x\ge0\}.
\end{align*}
The equation $Ax=b$ says first that
\begin{align*}
x_1+x_2=1,
\end{align*}
and second that
\begin{align*}
2x_1+2x_2=2.
\end{align*}
The second equation is exactly twice the first, because
\begin{align*}
2x_1+2x_2=2(x_1+x_2).
\end{align*}
Thus the equality system imposes only $x_1+x_2=1$. Together with $x_1\ge0$ and $x_2\ge0$, this gives $x_2=1-x_1$ and
\begin{align*}
x_1\ge0.
\end{align*}
Also,
\begin{align*}
1-x_1\ge0,
\end{align*}
so $0\le x_1\le1$. Writing $t=x_1$, every feasible point has the form
\begin{align*}
x=(t,1-t)\qquad\text{with }0\le t\le1.
\end{align*}
Therefore
\begin{align*}
P=\{(t,1-t):0\le t\le1\}.
\end{align*}
The endpoints $(1,0)$ and $(0,1)$ are vertices of this segment. If $0<t<1$, then
\begin{align*}
(t,1-t)=t(1,0)+(1-t)(0,1),
\end{align*}
where $t>0$, $1-t>0$, and $t+(1-t)=1$. Hence every interior point is a nontrivial convex combination of the two endpoints, so the only vertices of $P$ are $(1,0)$ and $(0,1)$.
The obstruction to the basis language is algebraic rather than geometric. Since $A_1=A_2=(1,2)^\top$, the two columns are linearly dependent:
\begin{align*}
A_1-A_2=(0,0)^\top.
\end{align*}
Equivalently, the determinant of the full $2\times2$ column matrix is
\begin{align*}
1\cdot2-1\cdot2=0.
\end{align*}
Thus $A$ has rank $1<2$, and no pair of columns can form an invertible $2\times2$ basis matrix. The vertices are genuine geometric endpoints of the feasible segment, but the redundant equality row makes the full-row-rank basis language inapplicable until the equality system is reduced.
[/example]
Removing redundant equations repairs the mismatch between the rank of the equality system and the dimension of the row space, so the basis language can now be applied honestly. It does not, however, make the representation of a vertex unique. At a vertex more nonnegativity constraints may be active than are needed to determine the point, and different choices of the remaining basic columns can lead to the same feasible vector. This is the algebraic source of degeneracy.
[example: Degenerate Basic Feasible Solution]
Consider
\begin{align*}
P=\{x\in\mathbb R^3:x_1+x_2=1,\ x_1+x_3=1,\ x\ge0\}.
\end{align*}
Writing the equality constraints as $Ax=b$, the columns of $A$ are
\begin{align*}
A_1=(1,1)^\top,\quad A_2=(1,0)^\top,\quad A_3=(0,1)^\top,
\end{align*}
and $b=(1,1)^\top$. The point $x=(1,0,0)$ is feasible because
\begin{align*}
x_1+x_2=1+0=1,
\end{align*}
\begin{align*}
x_1+x_3=1+0=1,
\end{align*}
and $1\ge0$, $0\ge0$, $0\ge0$.
For $B=\{1,2\}$, the basis matrix has columns $A_1=(1,1)^\top$ and $A_2=(1,0)^\top$. Its determinant is
\begin{align*}
1\cdot0-1\cdot1=-1\ne0,
\end{align*}
so these two columns form a basis. The associated basic solution sets the nonbasic variable $x_3=0$ and solves
\begin{align*}
x_1+x_2=1.
\end{align*}
\begin{align*}
x_1=1.
\end{align*}
Substituting $x_1=1$ into the first equation gives
\begin{align*}
x_2=1-x_1=1-1=0.
\end{align*}
Thus the basis $B=\{1,2\}$ gives $x=(1,0,0)$.
For $B=\{1,3\}$, the basis matrix has columns $A_1=(1,1)^\top$ and $A_3=(0,1)^\top$. Its determinant is
\begin{align*}
1\cdot1-0\cdot1=1\ne0,
\end{align*}
so these two columns also form a basis. The associated basic solution sets the nonbasic variable $x_2=0$ and solves
\begin{align*}
x_1=1.
\end{align*}
\begin{align*}
x_1+x_3=1.
\end{align*}
Substituting $x_1=1$ into the second equation gives
\begin{align*}
x_3=1-x_1=1-1=0.
\end{align*}
Thus the basis $B=\{1,3\}$ also gives $x=(1,0,0)$.
The same feasible point is therefore represented by two different feasible bases, and in each representation one basic variable is zero. That vanished basic variable is the algebraic sign of degeneracy.
[/example]
The example shows that bases may be nonunique, but the finite list of basic feasible solutions still captures the candidate vertices. This motivates the main vertex optimality theorem: if a linear objective attains its best value, then at least one algebraic candidate attains it.
[quotetheorem:6702]
[citeproof:6702]
The pointedness hypothesis is the geometric condition that rules out affine lines. Without it, a finite attained optimum need not occur at a vertex: for example, $P=\{(t,0):t\in\mathbb R\}$ and $c=(0,1)$ have every feasible point optimal but no vertex. The theorem also does not say that all optima are vertices; if the objective is parallel to a face, then the full face is optimal, but at least one of its vertices carries the same objective value. This is the reason simplex-type methods can search among bases without losing the optimum, while still allowing the optimal set itself to be higher-dimensional.
[example: Transportation Problem]
Let there be $p$ supply nodes and $q$ demand nodes, with $s_i\ge0$, $d_j\ge0$, and
\begin{align*}
\sum_{i=1}^p s_i=\sum_{j=1}^q d_j.
\end{align*}
The transportation polytope is
\begin{align*}
T=\left\{X=(x_{ij})\in\mathbb R^{p\times q}: \sum_{j=1}^q x_{ij}=s_i,\ \sum_{i=1}^p x_{ij}=d_j,\ x_{ij}\ge0\right\}.
\end{align*}
If we list the entries as a vector
\begin{align*}
x=(x_{11},x_{12},\dots,x_{1q},x_{21},\dots,x_{pq})\in\mathbb R^{pq},
\end{align*}
then the row-sum and column-sum equations are linear equations in the coordinates of $x$, while the inequalities $x_{ij}\ge0$ are exactly the standard nonnegativity constraints. Thus minimising
\begin{align*}
\sum_{i=1}^p\sum_{j=1}^q c_{ij}x_{ij}
\end{align*}
is a standard form linear program with one variable for each route $(i,j)$.
The feasible set is bounded because, for each $i,j$,
\begin{align*}
0\le x_{ij}\le \sum_{k=1}^q x_{ik}=s_i.
\end{align*}
Hence, when $T$ is nonempty, a finite linear objective attains its minimum on $T$. By *[Fundamental Theorem of Linear Programming](/theorems/6702)*, at least one optimal transportation plan is a vertex. At such a vertex, many nonnegativity constraints $x_{ij}=0$ are active; equivalently, only a sparse set of routes can carry positive shipment while the row and column equations still force the prescribed supplies and demands.
[/example]
## Linear Programming Duality and Certificates
The second problem is to decide when a proposed optimum, or a proposed claim of infeasibility, can be checked without solving the original optimisation problem again. Duality provides such certificates: multipliers combine constraints into global lower bounds, and separating hyperplanes certify that a right-hand side is outside a feasible cone.
For the standard form minimisation problem, equality constraints receive unrestricted multipliers. The dual inequalities state exactly that the multiplier-generated linear functional lies below the primal cost on every nonnegative direction.
[definition: Dual of a Standard Form Linear Program]
The dual of $\min\{c^\top x : Ax=b,\ x\ge0\}$ is $\max\{b^\top y : A^\top y\le c\}$, where $y \in \mathbb R^m$ is unrestricted in sign.
[/definition]
The sign pattern follows from the need for certified lower bounds. Any feasible dual multiplier produces a lower bound for every feasible primal point, and this motivates the first duality inequality.
[quotetheorem:2557]
[citeproof:2557]
Weak duality turns any feasible dual point into a certificate of a lower bound, and the nonnegativity hypothesis is essential to that certificate. If $x$ were allowed to have negative components, the componentwise inequality $A^\top y\le c$ could reverse after multiplication, so the proof would no longer produce a lower bound. For a concrete failure, take $A=[1]$, $b=-1$, $c=1$, $x=-1$, and $y=0$. Then $Ax=b$ and $A^\top y\le c$, but the weak-duality conclusion would read $0\le -1$, which is false. Weak duality is also only one-sided: it can certify that a candidate is no better than a bound, but it does not by itself produce a dual point attaining the primal value. The natural next question is whether the best certificate always reaches the true primal optimum when that optimum is finite and attained, which motivates strong duality.
[quotetheorem:6703]
[citeproof:6703]
Strong duality upgrades lower bounds into exact certificates, but the finiteness hypotheses are doing real work. If the primal is unbounded below, weak duality forces the dual to be infeasible rather than optimal; if the primal feasible set is empty, there is no primal value for a dual certificate to attain. In finite-dimensional linear programming with a closed polyhedral feasible region, a finite optimum is attained whenever the feasible set is nonempty, so the usual unattained-infimum pathology belongs to more general convex optimisation rather than to closed polyhedral LPs in this form. The result also does not claim that every feasible dual point is optimal, only that at least one multiplier reaches the primal value. Once such a multiplier exists, the remaining gap expression is a sum of nonnegative products, so optimality should force primal activity and dual slack to be paired; this motivates complementary slackness.
[quotetheorem:2559]
[citeproof:2559]
Complementary slackness is a statement about a primal-dual feasible pair, not a replacement for feasibility. Without primal feasibility or dual feasibility, the summands need not be nonnegative and a zero-looking product condition can be meaningless. For instance, take the primal problem $\min\{x:x=1,\ x\ge0\}$ and its dual $\max\{y:y\le1\}$. The pair $x=0$, $y=1$ satisfies $x(1-y)=0$, but $x$ is not primal feasible and therefore is not an optimal primal solution. The theorem also does not force a variable and its reduced cost to be simultaneously zero; it says at least one member of each paired product vanishes. In network problems this becomes a structural rule: flow travels only along tight arcs, unmatched constraints may carry slack, and potentials identify the active part of the optimum.
[example: Shortest Path as Linear Programming Duality]
Let $G=(V,E)$ be a directed graph with edge length $c_{uv}$ on each edge $(u,v)$, and suppose a finite shortest $s$-$t$ path exists. Use the node-arc incidence matrix $B$ whose column for $(u,v)$ has entry $1$ in row $u$, entry $-1$ in row $v$, and entry $0$ in every other row. A unit $s$-$t$ flow is a vector $x\in\mathbb R^E$ satisfying $x\ge0$ and $Bx=b$, where
\begin{align*}
b_s=1,\qquad b_t=-1,\qquad b_w=0\text{ for }w\ne s,t.
\end{align*}
Thus the shortest-path relaxation is
\begin{align*}
\min \sum_{(u,v)\in E} c_{uv}x_{uv}\quad\text{subject to }Bx=b,\ x\ge0.
\end{align*}
The dual has one unrestricted multiplier $p_w$ for each node $w$:
\begin{align*}
\max b^\top p\quad\text{subject to }B^\top p\le c.
\end{align*}
For an edge $(u,v)$, the corresponding component of $B^\top p$ is
\begin{align*}
(B^\top p)_{uv}=1\cdot p_u+(-1)\cdot p_v=p_u-p_v.
\end{align*}
Hence the dual inequalities are
\begin{align*}
p_u-p_v\le c_{uv}\qquad ((u,v)\in E).
\end{align*}
The dual objective is
\begin{align*}
b^\top p=\sum_{w\in V}b_wp_w=1\cdot p_s+(-1)\cdot p_t+\sum_{w\ne s,t}0\cdot p_w=p_s-p_t.
\end{align*}
So the dual maximises the feasible potential difference $p_s-p_t$.
For any directed path
\begin{align*}
P:s=v_0\to v_1\to\cdots\to v_k=t,
\end{align*}
the path flow defined by $x_{v_{r-1}v_r}=1$ on the path edges and $x_e=0$ on all other edges satisfies $Bx=b$: the source has one more chosen outgoing edge than incoming edge, the sink has one more chosen incoming edge than outgoing edge, and every internal path vertex has one chosen incoming edge and one chosen outgoing edge. Its objective value is
\begin{align*}
\sum_{e\in E}c_ex_e=\sum_{r=1}^k c_{v_{r-1}v_r}.
\end{align*}
If $p$ is dual feasible, then along this same path,
\begin{align*}
p_s-p_t=(p_{v_0}-p_{v_1})+(p_{v_1}-p_{v_2})+\cdots+(p_{v_{k-1}}-p_{v_k}).
\end{align*}
Using $p_{v_{r-1}}-p_{v_r}\le c_{v_{r-1}v_r}$ for each path edge gives
\begin{align*}
p_s-p_t\le c_{v_0v_1}+c_{v_1v_2}+\cdots+c_{v_{k-1}v_k}.
\end{align*}
Thus every feasible potential difference is at most the length of every $s$-$t$ path.
If $d(v)$ denotes the shortest distance from $s$ to $v$, set
\begin{align*}
p_v=d(t)-d(v).
\end{align*}
For every edge $(u,v)$, the shortest-distance inequality is
\begin{align*}
d(v)\le d(u)+c_{uv}.
\end{align*}
Rearranging gives
\begin{align*}
d(v)-d(u)\le c_{uv}.
\end{align*}
Therefore
\begin{align*}
p_u-p_v=(d(t)-d(u))-(d(t)-d(v))=d(v)-d(u)\le c_{uv}.
\end{align*}
So $p$ is dual feasible. Its objective value is
\begin{align*}
p_s-p_t=(d(t)-d(s))-(d(t)-d(t))=d(t)-d(s).
\end{align*}
Since $d(s)=0$, this becomes
\begin{align*}
p_s-p_t=d(t).
\end{align*}
Therefore the maximum feasible potential difference equals the shortest-path distance.
Finally, if an optimal flow uses an edge $(u,v)$ with $x_{uv}>0$, *Complementary Slackness* gives
\begin{align*}
x_{uv}\bigl(c_{uv}-(p_u-p_v)\bigr)=0.
\end{align*}
Because $x_{uv}>0$, the factor $c_{uv}-(p_u-p_v)$ must be zero, so
\begin{align*}
p_u-p_v=c_{uv}.
\end{align*}
Positive flow can therefore travel only on tight edges, meaning edges whose length exactly equals the potential drop.
[/example]
The shortest path example displays optimality certificates, but linear systems also need certificates of nonexistence. A bare assertion that $Ax=b$, $x\ge0$ has no solution is not checkable from the original variables; for instance, infeasibility may only become visible after taking a signed linear combination of several equations. This motivates Farkas' alternative, the algebraic form of separating a point from a finitely generated cone.
[quotetheorem:6685]
[citeproof:6685]
Farkas' alternative is an exclusive alternative: the sign condition $b^\top y<0$ is what turns the certificate into a contradiction rather than merely another inequality. If the cone $\{Ax:x\ge0\}$ is replaced by a nonclosed cone in an infinite-dimensional setting, the separation conclusion can fail in this exact form, which is why closedness is part of the geometric proof. The theorem does not produce a feasible solution and a certificate simultaneously; it says precisely one side exists. This template reappears throughout convex optimisation: infeasibility is proved by displaying a weighted combination of constraints that contradicts the right-hand side.
[remark: Hoffman Bound]
For a fixed matrix $A$, there is a constant $H_A>0$ such that the distance from a point $x\in\mathbb R^n$ to the polyhedron $P=\{z:Az\le b\}$ is bounded above by
\begin{align*}
\operatorname{dist}(x,P) \le H_A |(Ax-b)^+|
\end{align*}
whenever $P$ is nonempty. The constant depends on $A$ but not on $b$ or $x$.
[/remark]
The Hoffman bound is not used here as a main proof tool, but it is conceptually important. It says that violation of linear inequalities controls distance to feasibility uniformly over right-hand sides with the same normal matrix.
## Totally Unimodular Matrices and Integrality
The final problem in this chapter is to explain why some linear programming relaxations solve integer optimisation problems without adding any nonlinear or combinatorial constraints. The geometric answer is that every vertex of the feasible polyhedron is integral, so vertex optimality automatically returns an integral optimiser whenever the data are integral.
Integral right-hand sides alone do not force integral vertices. For example, the one-dimensional system $2x\le1$ has integral matrix and integral right-hand side, but the vertex $x=1/2$ is fractional. The missing condition is not integrality of entries but unimodularity of every possible basis matrix.
[definition: Totally Unimodular Matrix]
A matrix $A \in \mathbb R^{m \times n}$ is totally unimodular if every square submatrix of $A$ has determinant in $\{-1,0,1\}$.
[/definition]
The definition is designed to control inverses of basis matrices. If a nonsingular basis matrix has determinant $\pm1$, then [Cramer's rule](/theorems/3305) preserves integrality of basic solutions for integral right-hand sides.
[quotetheorem:5824]
[citeproof:5824]
This theorem is a convex-geometric integrality result, not an algorithmic accident. The integrality of the right-hand side is essential: even with $H=[1]$, the constraint $x\le1/2$ has a fractional vertex. The word "full" in the statement is also essential. Nonnegativity rows and bound rows can be the constraints that determine a vertex, so their determinant structure must be part of the matrix to which total unimodularity is applied. Total unimodularity is a sufficient condition rather than a necessary description of every integral polyhedron; some integral polyhedra arise from matrices that are not totally unimodular in the chosen formulation. Once the complete constraint matrix has the right determinant structure, the linear relaxation has integral extreme points for every integral right-hand side, so an integer optimisation problem can sometimes be solved by ordinary linear programming.
[example: Bipartite Matching Relaxation]
Let $M\in\mathbb R^{(U\cup V)\times E}$ be the vertex-edge incidence matrix: $M_{we}=1$ when the edge $e$ is incident to the vertex $w$, and $M_{we}=0$ otherwise. Then the relaxation is
\begin{align*}
P=\{x\in\mathbb R^E:Mx\le \mathbf 1,\ x\ge0\}.
\end{align*}
For an edge $e=\{u,v\}$, nonnegativity gives
\begin{align*}
x_e\le \sum_{f\ni u}x_f.
\end{align*}
The constraint at $u$ gives
\begin{align*}
\sum_{f\ni u}x_f\le1.
\end{align*}
Therefore $x_e\le1$, and together with $x_e\ge0$ this shows
\begin{align*}
0\le x_e\le1\qquad(e\in E).
\end{align*}
Thus $P$ is bounded.
The full inequality matrix for $P$ is the matrix $H$ obtained by stacking the incidence matrix $M$ above the matrix $-I$, with right-hand side $(\mathbf 1,0)$. The incidence matrix $M$ of a bipartite graph is totally unimodular. Appending the rows of $-I$ preserves total unimodularity: in any square submatrix of $H$, a selected row from $-I$ has either no nonzero entry among the selected columns, in which case the determinant is $0$, or exactly one nonzero entry, in which case [cofactor expansion](/theorems/398) along that row gives $\pm$ the determinant of a smaller selected submatrix. Repeating this removes all selected rows coming from $-I$. The remaining determinant is either $0$ or a square subdeterminant of $M$, hence belongs to $\{-1,0,1\}$. So $H$ is totally unimodular.
Since $(\mathbf 1,0)$ is integral, *Total Unimodularity Gives Integral Vertices* implies that every vertex of $P$ is integral. Combining integrality with $0\le x_e\le1$ gives
\begin{align*}
x_e\in\{0,1\}\qquad(e\in E)
\end{align*}
at every vertex. If $x$ is such a vertex and
\begin{align*}
F=\{e\in E:x_e=1\},
\end{align*}
then the constraint
\begin{align*}
\sum_{e\ni w}x_e\le1
\end{align*}
says that at most one edge of $F$ is incident to each vertex $w$. Hence $F$ is a matching. Conversely, the incidence vector of any matching satisfies $x_e\ge0$ and all constraints $\sum_{e\ni w}x_e\le1$, so it is feasible for $P$.
Thus the integral vertices of the relaxation are exactly matching incidence vectors. Since a linear objective over the nonempty bounded polytope $P$ attains an optimum at a vertex by *Fundamental Theorem of Linear Programming*, optimising any linear edge weight over the relaxation gives the same value as the maximum weight matching problem.
[/example]
The matching example raises the structural question of where total unimodularity comes from in practice. This motivates the network incidence theorem, which supplies the determinant structure behind transportation, flow, and many bipartite models.
[quotetheorem:5825]
[citeproof:5825]
The directed incidence convention matters because each arc contributes one $+1$ and one $-1$, making the column-sum argument possible. A general $0$-$1$ incidence matrix need not be totally unimodular; odd cycles in non-bipartite matching are the standard obstruction and lead to fractional relaxation vertices. The theorem therefore identifies the algebraic reason that flows, cuts, and transportation problems behave better than arbitrary integer programs. Combining incidence total unimodularity with linear programming duality explains why several discrete optimisation problems admit exact linear formulations: the integrality lies in the polyhedron, while the dual describes potentials, cuts, prices, or constraints depending on the model.
[example: Integral Transportation Plans]
Let $D$ be the node-arc incidence matrix of the complete bipartite directed graph with supply nodes $u_1,\dots,u_p$, demand nodes $v_1,\dots,v_q$, and one arc $u_i\to v_j$ for each route $(i,j)$. With the convention that an arc column has $+1$ at its tail and $-1$ at its head, the column for $u_i\to v_j$ has
\begin{align*}
D_{u_i,(i,j)}=1
\end{align*}
and
\begin{align*}
D_{v_j,(i,j)}=-1,
\end{align*}
with all other entries equal to $0$. If $\beta_{u_i}=s_i$ and $\beta_{v_j}=-d_j$, then $Dx=\beta$ gives, at each supply node $u_i$,
\begin{align*}
\sum_{j=1}^q x_{ij}=s_i,
\end{align*}
and, at each demand node $v_j$,
\begin{align*}
-\sum_{i=1}^p x_{ij}=-d_j.
\end{align*}
Multiplying the last equality by $-1$ gives
\begin{align*}
\sum_{i=1}^p x_{ij}=d_j,
\end{align*}
so $Dx=\beta$ together with $x\ge0$ is exactly the transportation system.
By *Network Incidence Matrices Are Totally Unimodular*, the matrix $D$ is totally unimodular. The equality and nonnegativity constraints can be written as
\begin{align*}
Dx\le\beta,
\end{align*}
\begin{align*}
-Dx\le-\beta,
\end{align*}
and
\begin{align*}
-x\le0.
\end{align*}
Let $H$ be the full constraint matrix obtained by stacking the rows of $D$, then the rows of $-D$, and then the rows of $-I$. To check total unimodularity of $H$, take any square submatrix. If it uses a selected row from $-I$ with exactly one nonzero selected entry, cofactor expansion along that row reduces the determinant to the determinant of a smaller selected submatrix, up to sign. If that row has no nonzero selected entry, the determinant is $0$. Repeating this removes all selected rows coming from $-I$. For the remaining rows, every row is either a row of $D$ or the negative of a row of $D$. Multiplying the rows chosen from $-D$ by $-1$ changes the determinant only by a sign. If two resulting rows are equal, the determinant is $0$; otherwise the resulting matrix is a square submatrix of $D$, whose determinant belongs to $\{-1,0,1\}$. Hence $H$ is totally unimodular.
If all $s_i$ and $d_j$ are integers, then $\beta$ is integral, and so the full right-hand side formed from $\beta$, $-\beta$, and $0$ is integral. By *Total Unimodularity Gives Integral Vertices*, every vertex of the transportation polytope is integral. When
\begin{align*}
\sum_{i=1}^p s_i=\sum_{j=1}^q d_j,
\end{align*}
the polytope is nonempty. If the common total is $S>0$, define
\begin{align*}
x_{ij}=\frac{s_i d_j}{S}.
\end{align*}
Then
\begin{align*}
\sum_{j=1}^q x_{ij}=\sum_{j=1}^q \frac{s_i d_j}{S}=\frac{s_i}{S}\sum_{j=1}^q d_j=\frac{s_i}{S}S=s_i,
\end{align*}
and
\begin{align*}
\sum_{i=1}^p x_{ij}=\sum_{i=1}^p \frac{s_i d_j}{S}=\frac{d_j}{S}\sum_{i=1}^p s_i=\frac{d_j}{S}S=d_j.
\end{align*}
If $S=0$, then all supplies and demands are $0$ because they are nonnegative, and the zero array is feasible.
The feasible set is bounded because every feasible point satisfies
\begin{align*}
0\le x_{ij}\le \sum_{k=1}^q x_{ik}=s_i.
\end{align*}
Therefore any linear transportation cost attains an optimum on the feasible set, and by *Fundamental Theorem of Linear Programming* some optimal solution is a vertex. Since every vertex is integral, the linear programming relaxation has an integral cost-minimising transportation plan.
[/example]
# 8. Conic Programming Framework
Conic programming is the common language behind linear, second-order cone, and semidefinite optimisation. The point of this chapter is to separate what belongs to convex geometry from what belongs to a particular coordinate representation. Once inequalities are interpreted as membership in an ordered cone, the primal-dual pattern from linear programming becomes a theorem about dual cones, adjoint maps, and regularity.
The chapter assumes Chapter 7's linear programming duality, Chapter 1's separating hyperplane theorem for finite-dimensional convex sets, basic adjoints of linear maps between inner product spaces, and the elementary geometry of positive semidefinite matrices. The main new phenomenon is that conic duality is more fragile than linear programming duality. Feasible conic problems can have positive duality gaps, optimal values need not be attained, and images of closed cones under linear maps need not be closed. These pathologies are not numerical accidents; they are geometric failures of regularity, refining the Slater and relative-interior qualifications from Chapters 4 and 6.
## Ordering Vector Spaces by Cones
How can an inequality such as $Ax \geq b$ be written when the quantity being constrained is a matrix or a vector with a quadratic norm bound? The unifying idea is to choose a cone of nonnegative elements and define order by translation of that cone.
[definition: Cone-Induced Order]
Let $V$ be a finite-dimensional real [vector space](/page/Vector%20Space) and let $K \subset V$ be a cone. The relation $\preceq_K$ on $V$ is defined by
\begin{align*}
u \preceq_K v \quad \Longleftrightarrow \quad v-u \in K.
\end{align*}
[/definition]
This gives a candidate inequality, but a random cone can identify opposite directions or have no interior room for perturbation. If both $v-u$ and $u-v$ lie in the cone, the relation cannot distinguish $u$ from $v$; if the cone is thin or nonclosed, separation and perturbation arguments can break down. The following definition collects the precise hypotheses that prevent these pathologies and make the induced order behave like a usable finite-dimensional inequality.
[definition: Proper Cone]
A cone $K \subset V$ is proper if it is closed, convex, pointed, and has nonempty interior:
\begin{align*}
K \text{ is closed}, \quad K+K \subset K, \quad \lambda K \subset K \text{ for } \lambda \geq 0, \quad K \cap (-K)=\{0\}, \quad \operatorname{int} K \neq \varnothing.
\end{align*}
[/definition]
Pointedness gives antisymmetry of $\preceq_K$, while nonempty interior gives a strict order $u \prec_K v$ by asking that $v-u \in \operatorname{int} K$. The closedness and convexity assumptions are the ones needed for separation and duality.
[example: Standard Ordering Cones]
For $K=\mathbb R^m_+$, the cone-induced order is exactly componentwise comparison. If $u,v\in\mathbb R^m$, then $u\preceq_{\mathbb R^m_+} v$ means $v-u\in\mathbb R^m_+$ by the definition of cone-induced order. By the definition of $\mathbb R^m_+$, this is equivalent to $(v-u)_i\geq 0$ for every $i$, hence to $v_i-u_i\geq 0$ for every $i$, and therefore to
\begin{align*}
u_i\leq v_i \text{ for every } i.
\end{align*}
For the second-order cone
\begin{align*}
Q_{m+1}=\{(t,z)\in \mathbb R\times \mathbb R^m : t\geq |z|\},
\end{align*}
take two points $(t,z),(s,w)\in\mathbb R\times\mathbb R^m$. The relation $(t,z)\preceq_{Q_{m+1}}(s,w)$ means $(s,w)-(t,z)\in Q_{m+1}$. Since
\begin{align*}
(s,w)-(t,z)=(s-t,w-z),
\end{align*}
the definition of $Q_{m+1}$ gives
\begin{align*}
(t,z)\preceq_{Q_{m+1}}(s,w)\Longleftrightarrow s-t\geq |w-z|.
\end{align*}
Thus the scalar difference $s-t$ must dominate the Euclidean distance between the vector parts.
Let $S^n$ be the real vector space of $n\times n$ symmetric matrices, and let
\begin{align*}
S^n_+=\{X\in S^n : X \text{ is positive semidefinite}\}.
\end{align*}
For $U,V\in S^n$, the relation $U\preceq_{S^n_+}V$ means $V-U\in S^n_+$. By the definition of positive semidefiniteness, this is equivalent to
\begin{align*}
x^\top(V-U)x\geq 0 \text{ for every } x\in\mathbb R^n.
\end{align*}
Expanding the quadratic form gives
\begin{align*}
x^\top(V-U)x=x^\top Vx-x^\top Ux,
\end{align*}
so the same condition is
\begin{align*}
x^\top Ux\leq x^\top Vx \text{ for every } x\in\mathbb R^n.
\end{align*}
This is the Loewner order on symmetric matrices. The same formal relation $u\preceq_K v$ therefore becomes coordinatewise comparison, norm domination, or matrix positivity depending on the chosen cone $K$.
[/example]
The examples show that choosing the cone selects the type of inequality. Duality requires the corresponding notion of a nonnegative linear functional, which is encoded by the dual cone.
[definition: Dual Cone]
Let $V$ be a finite-dimensional real [inner product space](/page/Inner%20Product%20Space) and let $K \subset V$ be a cone. The dual cone is
\begin{align*}
K^* = \{s \in V : (s,x) \geq 0 \text{ for all } x \in K\}.
\end{align*}
[/definition]
The dual cone consists of all linear certificates that cannot be negative on feasible slack. To know whether such certificates lose information, we need the result saying that a closed convex cone is recovered from all of its nonnegative linear functionals.
[quotetheorem:4108]
[citeproof:4108]
This result is the geometric reason that conic dual variables give complete separating certificates: membership in a closed convex cone can be tested by all nonnegative linear functionals. Both hypotheses matter. If $K=\{0\}\cup\{(x,y)\in\mathbb R^2:y>0\}$, then $K$ is a nonclosed convex cone and $K^{**}=\{(x,y):y\geq 0\}$ is strictly larger than $K$. If $K$ is the union of the two nonnegative coordinate axes in $\mathbb R^2$, then $K$ is closed but not convex, and the double dual recovers the first quadrant rather than the original union. The theorem does not say that every cone is self-dual or that every boundary face is exposed; it only says that closed convex cones are exactly recovered by their dual certificates, which is the input needed for the conic dual construction below.
[example: Self-Dual Cones]
For $\mathbb R^m_+$, the dual cone consists of all $s\in\mathbb R^m$ such that $s^\top x\geq 0$ for every $x\in\mathbb R^m_+$. If $s,x\in\mathbb R^m_+$, then each product $s_i x_i$ is nonnegative, so
\begin{align*}
s^\top x=\sum_{i=1}^m s_i x_i\geq 0.
\end{align*}
Thus $\mathbb R^m_+\subset(\mathbb R^m_+)^*$. Conversely, if $s\in(\mathbb R^m_+)^*$, then the coordinate vector $e_i$ belongs to $\mathbb R^m_+$, and testing against it gives
\begin{align*}
0\leq s^\top e_i=s_i.
\end{align*}
This holds for every $i$, so $s\in\mathbb R^m_+$. Hence $(\mathbb R^m_+)^*=\mathbb R^m_+$.
For the second-order cone $Q_{m+1}=\{(t,z):t\geq |z|\}$, use the standard inner product $(a,p)\cdot(t,z)=at+p^\top z$. If $(a,p),(t,z)\in Q_{m+1}$, then $a\geq |p|$, $t\geq |z|$, and $a,t\geq0$. Since $p^\top z\geq -|p|\,|z|$, we get
\begin{align*}
at+p^\top z\geq at-|p|\,|z|.
\end{align*}
The inequalities $|p|\leq a$ and $|z|\leq t$ give $|p|\,|z|\leq at$, so
\begin{align*}
at-|p|\,|z|\geq at-at=0.
\end{align*}
Therefore $(a,p)\cdot(t,z)\geq0$ for all $(t,z)\in Q_{m+1}$, and $Q_{m+1}\subset Q_{m+1}^*$. Conversely, suppose $(a,p)\in Q_{m+1}^*$. Testing against $(1,0)\in Q_{m+1}$ gives
\begin{align*}
0\leq (a,p)\cdot(1,0)=a.
\end{align*}
If $p=0$, then $a\geq0=|p|$. If $p\neq0$, then $(|p|,-p)\in Q_{m+1}$, so dual feasibility gives
\begin{align*}
0\leq (a,p)\cdot(|p|,-p)=a|p|-|p|^2=|p|(a-|p|).
\end{align*}
Since $|p|>0$, this implies $a\geq |p|$. Thus $(a,p)\in Q_{m+1}$, and $Q_{m+1}^*=Q_{m+1}$.
For $S^n_+$ with inner product $(S,X)=\operatorname{tr}(SX)$, let $S,X\in S^n_+$. The matrix $X^{1/2}SX^{1/2}$ is positive semidefinite, so its eigenvalues are nonnegative and its trace is nonnegative. By cyclic invariance of trace,
\begin{align*}
\operatorname{tr}(SX)=\operatorname{tr}(SX^{1/2}X^{1/2})=\operatorname{tr}(X^{1/2}SX^{1/2})\geq0.
\end{align*}
Thus $S^n_+\subset(S^n_+)^*$. Conversely, if $S\in(S^n_+)^*$, then for every $x\in\mathbb R^n$ the rank-one matrix $xx^\top$ is positive semidefinite. Hence
\begin{align*}
0\leq \operatorname{tr}(Sxx^\top)=\operatorname{tr}(x^\top Sx)=x^\top Sx.
\end{align*}
Therefore $x^\top Sx\geq0$ for every $x\in\mathbb R^n$, which is exactly $S\in S^n_+$. Hence $(S^n_+)^*=S^n_+$. These computations show that the standard nonnegative cone, the second-order cone, and the positive semidefinite cone are self-dual under their standard inner products.
[/example]
## Conic Standard Form
What is the smallest template that contains linear programs, norm-constrained robust programs, and semidefinite programs? The answer is an affine equation together with membership in a proper cone.
[definition: Conic Primal Standard Form]
Let $E$ and $F$ be finite-dimensional real inner product spaces, let $A:E\to F$ be linear, let $b\in F$, let $c\in E$, and let $K\subset E$ be a closed convex cone. The primal conic program is the problem of minimising $(c,x)$ over all $x\in E$ such that $Ax=b$ and $x\in K$.
[/definition]
The equality constraint carries the affine modelling, while the cone contains the inequality structure. To price the equality constraint without destroying the cone constraint, the dual problem must leave behind a cost vector that is nonnegative on $K$.
[definition: Conic Dual Standard Form]
For the primal conic program above, let $A^*:F\to E$ be the adjoint of $A$. The dual conic program is the problem of maximising $(b,y)$ over all $y\in F$ such that $c-A^*y\in K^*$.
[/definition]
The dual slack $s=c-A^*y$ is the part of the objective vector left over after pricing the equality constraints. The condition $s\in K^*$ is exactly the requirement that this leftover cost is nonnegative on all cone-feasible directions.
[example: Linear Programming as Conic Programming]
Take $E=\mathbb R^n$, $F=\mathbb R^m$, and $K=\mathbb R^n_+$. With the standard inner products, the conic primal standard form is
\begin{align*}
\inf\{(c,x):Ax=b,\ x\in \mathbb R^n_+\}.
\end{align*}
For vectors in Euclidean space, $(c,x)=c^\top x$, and $x\in\mathbb R^n_+$ means $x_i\geq0$ for every $i$. Thus $x\in\mathbb R^n_+$ is exactly the componentwise condition $x\geq0$, so the primal is
\begin{align*}
\inf\{c^\top x:Ax=b,\ x\geq0\}.
\end{align*}
The adjoint of the matrix map $x\mapsto Ax$ is $y\mapsto A^\top y$. Indeed, expanding the Euclidean inner product gives
\begin{align*}
(Ax)^\top y=\sum_{i=1}^m (Ax)_i y_i.
\end{align*}
Since $(Ax)_i=\sum_{j=1}^n A_{ij}x_j$, this becomes
\begin{align*}
(Ax)^\top y=\sum_{i=1}^m\sum_{j=1}^n A_{ij}x_jy_i.
\end{align*}
Reordering the finite sum gives
\begin{align*}
(Ax)^\top y=\sum_{j=1}^n x_j\left(\sum_{i=1}^m A_{ij}y_i\right).
\end{align*}
The $j$th coordinate of $A^\top y$ is $\sum_{i=1}^m A_{ij}y_i$, so
\begin{align*}
(Ax)^\top y=x^\top A^\top y.
\end{align*}
Therefore $A^*y=A^\top y$.
The conic dual standard form is
\begin{align*}
\sup\{(b,y):c-A^*y\in(\mathbb R^n_+)^*\}.
\end{align*}
Substituting $(b,y)=b^\top y$ and $A^*y=A^\top y$ gives
\begin{align*}
\sup\{b^\top y:c-A^\top y\in(\mathbb R^n_+)^*\}.
\end{align*}
The nonnegative orthant is self-dual under the standard inner product, so $(\mathbb R^n_+)^*=\mathbb R^n_+$. Hence the dual feasibility condition is
\begin{align*}
c-A^\top y\in\mathbb R^n_+.
\end{align*}
Equivalently, each coordinate satisfies $(c-A^\top y)_j\geq0$, which is the vector inequality
\begin{align*}
c-A^\top y\geq0.
\end{align*}
Thus the dual is
\begin{align*}
\sup\{b^\top y:c-A^\top y\geq0\}.
\end{align*}
Equality-form linear programming is therefore the conic standard-form pair obtained by choosing the cone of componentwise nonnegative vectors.
[/example]
Linear programming is the polyhedral case. A different modelling gain appears when a semi-infinite family of linear inequalities is reduced to one second-order cone constraint.
[example: Robust Linear Constraint as an SOCP Constraint]
Suppose $\rho\geq 0$ and $a$ ranges over the Euclidean ball $\{\bar a+u: |u|\leq \rho\}$. For a fixed decision vector $x\in\mathbb R^n$, the robust constraint is
\begin{align*}
(\bar a+u)^\top x\leq \beta \quad \text{for every } u\in\mathbb R^n \text{ with } |u|\leq \rho.
\end{align*}
Expanding the left side gives
\begin{align*}
(\bar a+u)^\top x=\bar a^\top x+u^\top x,
\end{align*}
so the worst possible left side is
\begin{align*}
\bar a^\top x+\sup_{|u|\leq \rho} u^\top x.
\end{align*}
For every $u$ with $|u|\leq \rho$, the *Cauchy-Schwarz inequality* gives
\begin{align*}
u^\top x\leq |u|\,|x|\leq \rho |x|.
\end{align*}
If $x\neq 0$, take
\begin{align*}
u=\rho\frac{x}{|x|}.
\end{align*}
Then
\begin{align*}
|u|=\left|\rho\frac{x}{|x|}\right|=\rho
\end{align*}
and
\begin{align*}
u^\top x=\left(\rho\frac{x}{|x|}\right)^\top x
=\rho\frac{x^\top x}{|x|}
=\rho\frac{|x|^2}{|x|}
=\rho |x|.
\end{align*}
If $x=0$, then $u^\top x=0=\rho |x|$ for every feasible $u$. Hence, in all cases,
\begin{align*}
\sup_{|u|\leq \rho} u^\top x=\rho |x|.
\end{align*}
Therefore the robust constraint is equivalent to
\begin{align*}
\bar a^\top x+\rho |x|\leq \beta.
\end{align*}
Rearranging gives
\begin{align*}
\beta-\bar a^\top x\geq \rho |x|.
\end{align*}
Since $|\rho x|=\rho |x|$ for $\rho\geq0$, this is exactly
\begin{align*}
\beta-\bar a^\top x\geq |\rho x|.
\end{align*}
By the definition
\begin{align*}
Q_{n+1}=\{(t,z)\in\mathbb R\times\mathbb R^n:t\geq |z|\},
\end{align*}
the last inequality is equivalent to
\begin{align*}
(\beta-\bar a^\top x,\rho x)\in Q_{n+1}.
\end{align*}
Thus one second-order cone constraint represents the whole family of linear inequalities indexed by the uncertainty ball.
[/example]
Semidefinite programming enlarges the same template from vector inequalities to matrix inequalities. This is the setting where nonpolyhedral geometry first produces serious duality pathologies.
[example: Semidefinite Feasibility]
Let $C,A_1,\dots,A_m\in S^n$ and let $x=(x_1,\dots,x_m)\in\mathbb R^m$. The semidefinite feasibility constraint is
\begin{align*}
C+\sum_{i=1}^m x_iA_i\in S^n_+.
\end{align*}
Define the affine map $L:\mathbb R^m\to S^n$ by
\begin{align*}
Lx=C+\sum_{i=1}^m x_iA_i.
\end{align*}
Then the constraint is exactly
\begin{align*}
Lx\in S^n_+.
\end{align*}
Thus feasibility is the question whether there exists $x\in\mathbb R^m$ such that $Lx$ lies in the positive semidefinite cone, or equivalently whether
\begin{align*}
\{C+\sum_{i=1}^m x_iA_i:x\in\mathbb R^m\}\cap S^n_+\neq\varnothing.
\end{align*}
The set
\begin{align*}
\{C+\sum_{i=1}^m x_iA_i:x\in\mathbb R^m\}
\end{align*}
is an affine subspace of $S^n$: it is the translate of the linear subspace
\begin{align*}
\operatorname{span}\{A_1,\dots,A_m\}
=
\left\{\sum_{i=1}^m x_iA_i:x\in\mathbb R^m\right\}
\end{align*}
by the matrix $C$. Therefore the semidefinite feasibility problem asks whether this affine subspace intersects the cone $S^n_+$. When it does not, an infeasibility certificate is a symmetric matrix $Y$ whose trace pairing separates them: one seeks inequalities of the form
\begin{align*}
\operatorname{tr}(YZ)\geq 0 \quad \text{for every } Z\in S^n_+
\end{align*}
but
\begin{align*}
\operatorname{tr}\!\left(Y\left(C+\sum_{i=1}^m x_iA_i\right)\right)<0
\quad \text{for every } x\in\mathbb R^m.
\end{align*}
Expanding the affine expression gives
\begin{align*}
\operatorname{tr}\!\left(Y\left(C+\sum_{i=1}^m x_iA_i\right)\right)
=
\operatorname{tr}(YC)+\sum_{i=1}^m x_i\operatorname{tr}(YA_i),
\end{align*}
so such a certificate detects incompatibility between the affine equations and positive semidefiniteness. Thus semidefinite feasibility is the geometric problem of intersecting an affine matrix subspace with the positive semidefinite cone.
[/example]
## Weak Duality and Complementarity
Why is every dual feasible point a lower bound on every primal feasible point? The answer is that the dual slack pairs nonnegatively with every primal cone point. If $x\in K$ satisfies $Ax=b$ and $y$ is dual feasible with slack $s=c-A^*y\in K^*$, then
\begin{align*}
c\cdot x-b\cdot y
= (A^*y+s)\cdot x-y\cdot Ax
= s\cdot x
\ge 0.
\end{align*}
Thus $b\cdot y\le c\cdot x$ for every primal-dual feasible pair. This is conic weak duality in the notation of the present chapter, and it gives the certificate interpretation of dual feasibility: a dual feasible $y$ proves that no primal feasible point has value below $b\cdot y$. The feasibility hypotheses are essential, because the identity only becomes a nonnegative gap when $Ax=b$, $x\in K$, and $s\in K^*$ all hold. Weak duality also does not imply that either problem has an optimiser or that the best lower bound reaches the primal value; those are strong-duality questions. When equality holds for a feasible pair, the primal variable and dual slack must be orthogonal in the cone-dual-cone pairing.
[quotetheorem:6704]
[citeproof:6704]
Complementarity is an optimality condition, not a regularity condition. It characterises optimal pairs once they exist, but it does not guarantee existence of such a pair. The hypotheses include both feasibility and equality of attained objective values; replacing an optimiser by an infimising sequence gives no vector or matrix with which to impose the orthogonality condition. For example, in nonattained semidefinite formulations where feasible matrices approach a boundary value only by sending another entry to infinity, the limiting value can exist without any primal matrix at the limit, so complementary slackness has no primal-dual pair to test. Thus complementarity belongs after existence and no-gap results, not before them.
[example: SDP Complementarity]
For semidefinite programming, primal and dual cone variables are matrices $X,S\in S^n_+$. Complementarity says
\begin{align*}
\operatorname{tr}(SX)=0.
\end{align*}
Because $S$ is positive semidefinite, it has a positive semidefinite square root $S^{1/2}$ with $S=S^{1/2}S^{1/2}$. By cyclic invariance of trace,
\begin{align*}
\operatorname{tr}(SX)=\operatorname{tr}(S^{1/2}XS^{1/2}).
\end{align*}
For every $v\in\mathbb R^n$,
\begin{align*}
v^\top(S^{1/2}XS^{1/2})v=(S^{1/2}v)^\top X(S^{1/2}v)\geq 0,
\end{align*}
since $X\in S^n_+$. Hence $S^{1/2}XS^{1/2}\in S^n_+$, and its trace is zero:
\begin{align*}
\operatorname{tr}(S^{1/2}XS^{1/2})=\operatorname{tr}(SX)=0.
\end{align*}
A positive semidefinite matrix has nonnegative eigenvalues, and its trace is the sum of those eigenvalues, so all eigenvalues of $S^{1/2}XS^{1/2}$ are zero. Therefore
\begin{align*}
S^{1/2}XS^{1/2}=0.
\end{align*}
Now fix $v\in\mathbb R^n$. Using $X=X^{1/2}X^{1/2}$ gives
\begin{align*}
0=v^\top(S^{1/2}XS^{1/2})v.
\end{align*}
The right-hand side expands as
\begin{align*}
v^\top(S^{1/2}XS^{1/2})v=(S^{1/2}v)^\top X(S^{1/2}v).
\end{align*}
Substituting $X=X^{1/2}X^{1/2}$ gives
\begin{align*}
(S^{1/2}v)^\top X(S^{1/2}v)=(X^{1/2}S^{1/2}v)^\top(X^{1/2}S^{1/2}v).
\end{align*}
Thus
\begin{align*}
|X^{1/2}S^{1/2}v|^2=0.
\end{align*}
Hence $X^{1/2}S^{1/2}v=0$ for every $v$, so
\begin{align*}
X^{1/2}S^{1/2}=0.
\end{align*}
Taking transposes gives
\begin{align*}
S^{1/2}X^{1/2}=0.
\end{align*}
Therefore
\begin{align*}
SX=S^{1/2}(S^{1/2}X^{1/2})X^{1/2}=0.
\end{align*}
Since $S$ and $X$ are symmetric, this also gives
\begin{align*}
XS=(SX)^\top=0.
\end{align*}
If $u=Xa$ lies in the range of $X$ and $w=Sb$ lies in the range of $S$, then
\begin{align*}
u^\top w=(Xa)^\top(Sb)=a^\top XSb=0.
\end{align*}
Complementarity therefore means that the primal and dual semidefinite variables have orthogonal range directions.
[/example]
## Strong Duality and Slater Regularity
What additional hypothesis turns weak duality into equality of optimal values and dual attainment? In conic programming the standard answer is strict feasibility, also called Slater regularity.
[definition: Slater Point]
A primal Slater point for the conic program is a point $x_0\in E$ such that $Ax_0=b$ and $x_0\in \operatorname{int}K$. A dual Slater point is a point $y_0\in F$ such that $c-A^*y_0\in \operatorname{int}K^*$.
[/definition]
Strict feasibility prevents the affine constraint set from merely grazing a lower-dimensional face of the cone. The central question is whether this interior room is enough to upgrade weak duality from a bound into equality with an attained optimiser on the other side.
[quotetheorem:6705]
[citeproof:6705]
The cited result relies on a separation argument with a nondegeneracy step. The point of Slater's condition is not differentiability or curvature; it is the exclusion of abnormal separating hyperplanes. Strict feasibility is needed because a feasible affine space may lie entirely in a proper face of the cone, in which case the separator can be supported by that face instead of producing an attained dual optimiser. The semidefinite boundary pattern where $X_{11}\downarrow0$ and $X_{12}=1$ forces $X_{22}\to\infty$ is a concrete warning: a finite limiting value can be approached while no feasible matrix attains the limiting boundary point. Finiteness is also essential; if the primal value is $-\infty$, there is no finite number for a dual optimum to equal, and if the relevant dual side is unbounded the statement cannot produce an attained finite optimiser.
[example: Slater Condition in Linear and Semidefinite Programming]
For the linear program
\begin{align*}
Ax=b,\qquad x\geq 0,
\end{align*}
the cone is $K=\mathbb R^n_+$. A primal Slater point is therefore a vector $x_0\in\mathbb R^n$ such that
\begin{align*}
Ax_0=b
\end{align*}
and
\begin{align*}
x_0\in\operatorname{int}\mathbb R^n_+.
\end{align*}
Since
\begin{align*}
\operatorname{int}\mathbb R^n_+=\{x\in\mathbb R^n:x_i>0\text{ for every }i\},
\end{align*}
this means exactly that
\begin{align*}
Ax_0=b,\qquad (x_0)_i>0\text{ for every }i.
\end{align*}
For a semidefinite program with matrix variable $X\in S^n_+$ and affine equations such as
\begin{align*}
\operatorname{tr}(A_iX)=b_i,\qquad i=1,\dots,m,
\end{align*}
a primal Slater point is a matrix $X_0\in S^n$ satisfying
\begin{align*}
\operatorname{tr}(A_iX_0)=b_i,\qquad i=1,\dots,m,
\end{align*}
and
\begin{align*}
X_0\in\operatorname{int}S^n_+.
\end{align*}
The interior of the positive semidefinite cone consists of the positive definite matrices, so this condition is
\begin{align*}
X_0\succ 0,
\end{align*}
equivalently
\begin{align*}
v^\top X_0v>0\quad\text{for every nonzero }v\in\mathbb R^n.
\end{align*}
Thus, in the linear case Slater feasibility means all scalar nonnegativity constraints are strict, while in the semidefinite case it means the feasible matrix is positive definite rather than merely positive semidefinite. In both cases the feasible affine set meets the interior of the cone, not just its boundary.
[/example]
## Pathologies Without Regularity
What fails when the feasible set lies entirely on the boundary of a nonpolyhedral cone? The separation theorems still exist, but the projected cone images used by duality may not be closed, and support values may fail to be attained.
[explanation: Nonclosed Images of Closed Cones]
Let $K\subset E$ be a closed convex cone and let $L:E\to F$ be linear. The image $L(K)$ is convex, but it need not be closed when $K$ is nonpolyhedral. In conic duality, feasible right-hand sides and attainable objective bounds are often projections of intersections with $K$. If such a projection is not closed, a limiting certificate may exist at the boundary without being represented by an actual dual feasible point.
[/explanation]
This geometric failure has concrete optimisation consequences. The first is nonattainment: the value is approached along a sequence, but no feasible point realises it.
[example: Nonattainment in a Semidefinite Program]
Consider the semidefinite optimisation problem
\begin{align*}
\inf\{X_{11}:X\in S^2_+,\ X_{12}=1\}.
\end{align*}
For each $\varepsilon>0$, define $X_\varepsilon\in S^2$ by
\begin{align*}
(X_\varepsilon)_{11}=\varepsilon,\qquad (X_\varepsilon)_{12}=(X_\varepsilon)_{21}=1,\qquad (X_\varepsilon)_{22}=\varepsilon^{-1}.
\end{align*}
For any vector $(u,v)\in\mathbb R^2$, the associated quadratic form is
\begin{align*}
\varepsilon u^2+2uv+\varepsilon^{-1}v^2=\left(\sqrt{\varepsilon}\,u+\frac{v}{\sqrt{\varepsilon}}\right)^2\geq 0.
\end{align*}
Therefore $X_\varepsilon\in S^2_+$, and since $(X_\varepsilon)_{12}=1$, the matrix $X_\varepsilon$ is feasible. Its objective value is
\begin{align*}
(X_\varepsilon)_{11}=\varepsilon,
\end{align*}
so feasible values approach $0$ as $\varepsilon\downarrow0$.
Every positive semidefinite matrix has nonnegative diagonal entries, because $e_1^\top Xe_1=X_{11}\geq0$. Hence every feasible $X$ satisfies
\begin{align*}
X_{11}\geq0,
\end{align*}
and the infimum is exactly $0$. The value is not attained. If a feasible matrix $X$ had $X_{11}=0$ and $X_{12}=1$, then for the vector $(t,1)$ positive semidefiniteness would require
\begin{align*}
0\leq t^2X_{11}+2tX_{12}+X_{22}=2t+X_{22}
\end{align*}
for every real $t$. Choosing any $t<-X_{22}/2$ makes $2t+X_{22}<0$, a contradiction.
Equivalently, for any feasible positive semidefinite matrix, applying positive semidefiniteness to the vectors $(1,0)$, $(0,1)$, and to the determinant condition for a $2$ by $2$ positive semidefinite matrix gives
\begin{align*}
X_{12}^2\leq X_{11}X_{22}.
\end{align*}
Since $X_{12}=1$, this becomes
\begin{align*}
1\leq X_{11}X_{22}.
\end{align*}
Thus along any feasible sequence with $X_{11}\downarrow0$, one must have $X_{22}\geq 1/X_{11}\to\infty$. The limiting value $0$ is approached only by matrices escaping to infinity, not by any finite feasible matrix.
[/example]
A second pathology is a positive duality gap. The primal and dual are both feasible, but their optimal values differ because the separating functional that would close the gap is lost at a nonclosed boundary.
[example: Semidefinite Duality Gap]
Consider the primal semidefinite program
\begin{align*}
p^*=\inf\{X_{11}+X_{22}:X\in S^3_+,\ X_{11}=0,\ 2X_{13}+X_{22}=1\}.
\end{align*}
It is feasible, because the diagonal matrix with diagonal entries $0,1,0$ is positive semidefinite and satisfies $X_{11}=0$ and $2X_{13}+X_{22}=1$.
Let $X\in S^3_+$ be feasible. Since $X_{11}=0$, positive semidefiniteness applied to the vector $(t,0,1)$ gives
\begin{align*}
0\leq (t,0,1)^\top X(t,0,1)=t^2X_{11}+2tX_{13}+X_{33}=2tX_{13}+X_{33}
\end{align*}
for every $t\in\mathbb R$. If $X_{13}>0$, then choosing $t<-X_{33}/(2X_{13})$ makes $2tX_{13}+X_{33}<0$. If $X_{13}<0$, then choosing $t>-X_{33}/(2X_{13})$ makes $2tX_{13}+X_{33}<0$. Hence $X_{13}=0$. The constraint $2X_{13}+X_{22}=1$ then gives
\begin{align*}
X_{22}=1.
\end{align*}
Therefore every primal feasible matrix has objective value
\begin{align*}
X_{11}+X_{22}=0+1=1,
\end{align*}
so $p^*=1$.
To compute the dual, let $E_{ij}$ denote the matrix with a $1$ in entry $(i,j)$ and zeros elsewhere, and set
\begin{align*}
C=E_{11}+E_{22}.
\end{align*}
Set also
\begin{align*}
A_1=E_{11}.
\end{align*}
Finally set
\begin{align*}
A_2=E_{13}+E_{22}+E_{31}.
\end{align*}
Then the trace pairing gives
\begin{align*}
\operatorname{tr}(CX)=X_{11}+X_{22}.
\end{align*}
It also gives
\begin{align*}
\operatorname{tr}(A_1X)=X_{11}.
\end{align*}
For $A_2$, the three nonzero terms give
\begin{align*}
\operatorname{tr}(A_2X)=X_{31}+X_{22}+X_{13}=2X_{13}+X_{22},
\end{align*}
because $X$ is symmetric. Thus the dual is
\begin{align*}
d^*=\sup\{y_2:C-y_1A_1-y_2A_2\in S^3_+\}.
\end{align*}
Write the dual slack as $S=C-y_1A_1-y_2A_2$. Its relevant entries are
\begin{align*}
S_{11}=1-y_1,\qquad S_{22}=1-y_2,\qquad S_{13}=S_{31}=-y_2,\qquad S_{33}=0,
\end{align*}
with all other entries equal to $0$. If $S\in S^3_+$, then applying positive semidefiniteness to $(t,0,1)$ gives
\begin{align*}
0\leq (t,0,1)^\top S(t,0,1)=(1-y_1)t^2-2y_2t
\end{align*}
for every $t\in\mathbb R$. If $y_2\neq0$, choose $t$ with the same sign as $y_2$ and with $|t|$ sufficiently small. Then the linear term $-2y_2t$ is negative and has order $|t|$, while the quadratic term $(1-y_1)t^2$ has order $t^2$, so the whole expression is negative. Hence dual feasibility forces
\begin{align*}
y_2=0.
\end{align*}
Conversely, if $y_2=0$ and $y_1\leq1$, then $S$ is diagonal with diagonal entries $1-y_1,1,0$, so $S\in S^3_+$. Therefore the dual feasible set has objective value only
\begin{align*}
y_2=0,
\end{align*}
and hence
\begin{align*}
d^*=0.
\end{align*}
Both primal and dual are feasible, but
\begin{align*}
p^*=1>0=d^*.
\end{align*}
Weak duality still holds for every feasible primal-dual pair; the missing ingredient is regularity, since the primal constraints force every feasible matrix onto the boundary face where $X_{11}=0$.
[/example]
The third pathology is infeasibility without a strong certificate. To organise these cases, it is useful to separate ordinary infeasibility certificates from the facial reduction process that exposes the minimal cone face containing the feasible set.
## Certificates and Alternative Theorems
How can infeasibility of a conic system be proved by a finite object? For linear inequalities, Farkas' lemma supplies a multiplier certificate. The conic version replaces nonnegative multipliers by elements of the dual cone.
[quotetheorem:6706]
[citeproof:6706]
The theorem is a conic analogue of Farkas' lemma: infeasibility is witnessed by a dual object separating the affine equations from the cone. The closedness condition is automatic for polyhedral cones, which is why linear programming has cleaner alternatives. For semidefinite cones it can fail, and then infeasibility may require a sequence of exposing certificates rather than a single strict separator.
This distinction matters algorithmically because a missing certificate does not always mean that the system is nearly feasible in the original cone. It can mean that the affine constraints live entirely inside a smaller face where the correct dual alternative has to be applied. The next definition names the face exposed by one such certificate, which is the basic step used to reduce the ambient cone without losing any feasible points.
[definition: Exposed Face by a Dual Certificate]
Let $K\subset E$ be a closed convex cone and let $s\in K^*$. The face exposed by $s$ is
\begin{align*}
K\cap s^\perp=\{x\in K:(s,x)=0\}.
\end{align*}
[/definition]
Exposed faces are where complementarity forces feasible solutions to live once a dual certificate is known. If the original cone is too large for the affine constraints, reducing to such a face can restore regularity.
[explanation: Facial Reduction Viewpoint]
Facial reduction starts with a conic system $Ax=b$, $x\in K$ whose feasible set has empty intersection with $\operatorname{int}K$. A certificate $s\in K^*$ satisfying $(s,x)=0$ for every feasible $x$ exposes a smaller face $K_1=K\cap s^\perp$ that still contains all feasible points. The process repeats until the feasible set meets the relative interior of the current face or infeasibility is certified. In this form, Slater regularity is recovered after replacing the ambient cone by the minimal face containing the feasible region.
[/explanation]
Facial reduction explains why conic duality pathologies are often boundary phenomena. The problem may be regular relative to a smaller face even when it is singular relative to the original cone.
[example: Boundary Feasibility in the Semidefinite Cone]
Suppose the affine constraints imply $Xv=0$ for every feasible $X\in S^n_+$, where $v\in\mathbb R^n$ is fixed and nonzero. Then every feasible matrix belongs to
\begin{align*}
F_v=\{X\in S^n_+ : Xv=0\}.
\end{align*}
We show that this face is exposed by the dual certificate $vv^\top$. For any $X\in S^n_+$,
\begin{align*}
\operatorname{tr}(vv^\top X)=\operatorname{tr}(v^\top Xv).
\end{align*}
Since $v^\top Xv$ is a $1$ by $1$ matrix, its trace is its only entry, so
\begin{align*}
\operatorname{tr}(v^\top Xv)=v^\top Xv.
\end{align*}
Positive semidefiniteness of $X$ gives
\begin{align*}
v^\top Xv\geq 0.
\end{align*}
Thus $vv^\top$ is nonnegative on $S^n_+$, so $vv^\top\in (S^n_+)^*$.
The face exposed by $vv^\top$ is
\begin{align*}
S^n_+\cap (vv^\top)^\perp=\{X\in S^n_+:\operatorname{tr}(vv^\top X)=0\}.
\end{align*}
Using the trace identity above, this is
\begin{align*}
\{X\in S^n_+:v^\top Xv=0\}.
\end{align*}
If $X\in S^n_+$ and $v^\top Xv=0$, write $X=X^{1/2}X^{1/2}$ with $X^{1/2}\in S^n_+$. Then
\begin{align*}
0=v^\top Xv=v^\top X^{1/2}X^{1/2}v.
\end{align*}
The right-hand side is
\begin{align*}
v^\top X^{1/2}X^{1/2}v=(X^{1/2}v)^\top(X^{1/2}v)=|X^{1/2}v|^2.
\end{align*}
Hence $X^{1/2}v=0$, and multiplying by $X^{1/2}$ gives
\begin{align*}
Xv=X^{1/2}(X^{1/2}v)=0.
\end{align*}
Conversely, if $Xv=0$, then
\begin{align*}
v^\top Xv=v^\top 0=0.
\end{align*}
Therefore
\begin{align*}
S^n_+\cap (vv^\top)^\perp=\{X\in S^n_+ : Xv=0\}=F_v.
\end{align*}
Thus the feasible set already lives in the exposed face $F_v$, and replacing $S^n_+$ by this smaller face removes cone directions that no feasible matrix can use.
[/example]
## Regularity Conditions and the Conic Modelling Lesson
Which assumptions should be checked before applying a duality theorem? In conic programming, feasibility alone is not enough; the geometry of the feasible affine space relative to the cone matters.
[definition: Primal and Dual Values]
The primal value is
\begin{align*}
p^*=\inf\{(c,x): Ax=b,\ x\in K\}.
\end{align*}
The dual value is
\begin{align*}
d^*=\sup\{(b,y): c-A^*y\in K^*\}.
\end{align*}
[/definition]
Weak duality always gives $d^*\leq p^*$. Strong duality requires an additional condition such as Slater feasibility, polyhedrality in the relevant image, or successful facial reduction to a smaller regular cone.
[remark: Polyhedral Versus Semidefinite Behaviour]
For polyhedral cones, linear images are closed, alternative theorems have single certificates, and many linear programming duality statements hold under mild feasibility and boundedness assumptions. For semidefinite and second-order cones, strict feasibility is a genuine hypothesis rather than a cosmetic strengthening. The difference is caused by the curved boundary geometry of nonpolyhedral cones.
[/remark]
The modelling lesson is that conic form is not merely notation. It determines the cone, the dual cone, the available certificates, and the regularity assumptions needed for exact duality.
[example: Reading a Conic Model]
Given a proposed conic formulation, first rewrite it in primal standard form:
\begin{align*}
p^*=\inf\{(c,x):Ax=b,\ x\in K\}.
\end{align*}
Here $x\in E$, the equality residual $Ax-b$ lies in $F$, the cost vector $c$ lies in $E$, and $K\subset E$ is the cone that defines the inequality part of the model.
The dual cone is
\begin{align*}
K^*=\{s\in E:(s,z)\geq 0\text{ for every }z\in K\}.
\end{align*}
For a multiplier $y\in F$, compute the adjoint $A^*:F\to E$ from the identity
\begin{align*}
(Ax,y)=(x,A^*y)\text{ for every }x\in E.
\end{align*}
The dual slack is then
\begin{align*}
s=c-A^*y.
\end{align*}
Thus dual feasibility is the cone condition
\begin{align*}
c-A^*y\in K^*.
\end{align*}
This calculation already gives the weak-duality certificate. If $x$ is primal feasible and $y$ is dual feasible, then $Ax=b$ gives
\begin{align*}
(b,y)=(Ax,y).
\end{align*}
By the defining identity for the adjoint,
\begin{align*}
(Ax,y)=(x,A^*y).
\end{align*}
Substituting this into the objective difference gives
\begin{align*}
(c,x)-(b,y)=(c,x)-(x,A^*y).
\end{align*}
Since the inner product is symmetric,
\begin{align*}
(x,A^*y)=(A^*y,x).
\end{align*}
Therefore
\begin{align*}
(c,x)-(b,y)=(c-A^*y,x).
\end{align*}
Using $s=c-A^*y$, this becomes
\begin{align*}
(c,x)-(b,y)=(s,x).
\end{align*}
Because $s\in K^*$ and $x\in K$, the definition of the dual cone gives
\begin{align*}
(s,x)\geq 0.
\end{align*}
Hence
\begin{align*}
(b,y)\leq (c,x).
\end{align*}
So a dual feasible point always gives a valid lower bound on every primal feasible objective value. To conclude equality of optimal values or attainment, one must separately check a regularity condition, such as a primal Slater point $x_0$ with $Ax_0=b$ and $x_0\in\operatorname{int}K$, or a dual Slater point $y_0$ with $c-A^*y_0\in\operatorname{int}K^*$. This is the basic reading rule for conic models: the slack condition gives weak-duality certificates automatically, while strong duality requires additional geometry.
[/example]
# 9. Second-Order Cone and Semidefinite Programming
Second-order cone programming and semidefinite programming extend linear programming by replacing coordinatewise nonnegativity with richer convex cones. The aim of this chapter is to recognise common nonlinear convex constraints as conic constraints, and to understand why matrix positivity gives a powerful language for relaxations. The chapter builds on Chapter 8's conic duality framework: separation remains the organising principle, while Slater-type interior conditions explain when the primal and dual optimal values coincide.
## Quadratic Constraints and Second-Order Cones
How can a nonlinear convex constraint such as
\begin{align*}
|Ax+b| \le c^\top x+d
\end{align*}
be treated with the same duality machinery as a linear programme? The answer is to regard it as a linear image constraint into a cone. The basic cone for Euclidean norm inequalities is the Lorentz cone.
[definition: Lorentz Cone]
For $n \ge 2$, the Lorentz cone in $\mathbb R^n$ is
\begin{align*}
\mathcal Q_n := \{(t,x) \in \mathbb R \times \mathbb R^{n-1} : |x| \le t\}.
\end{align*}
[/definition]
The scalar coordinate $t$ must be nonnegative for every point in $\mathcal Q_n$. Geometrically, $\mathcal Q_n$ is the epigraph of the Euclidean norm, so linear constraints into this cone encode norm bounds without leaving the conic framework.
[example: Norm Inequality as a Cone Constraint]
Let $A \in \mathbb R^{m \times n}$, $b \in \mathbb R^m$, $c \in \mathbb R^n$, and $d \in \mathbb R$. For a fixed $x \in \mathbb R^n$, put
\begin{align*}
t=c^\top x+d
\end{align*}
and
\begin{align*}
y=Ax+b \in \mathbb R^m.
\end{align*}
By the definition of the Lorentz cone,
\begin{align*}
(t,y)\in \mathcal Q_{m+1} \iff |y|\le t.
\end{align*}
Substituting $t=c^\top x+d$ and $y=Ax+b$ gives
\begin{align*}
(c^\top x+d,Ax+b)\in \mathcal Q_{m+1} \iff |Ax+b|\le c^\top x+d.
\end{align*}
Since $|Ax+b|\ge 0$, this condition also forces
\begin{align*}
c^\top x+d\ge 0.
\end{align*}
Thus the Euclidean norm bound is exactly an affine map of $x$ followed by membership in the fixed cone $\mathcal Q_{m+1}$.
[/example]
The Lorentz cone handles one squared norm bounded by one affine scalar. This motivates a second cone for constraints where a squared norm is bounded by a product of two nonnegative affine quantities.
[definition: Rotated Lorentz Cone]
For $n \ge 3$, the rotated Lorentz cone is
\begin{align*}
\mathcal Q_n^r := \{(u,v,x) \in \mathbb R \times \mathbb R \times \mathbb R^{n-2} : 2uv \ge |x|^2,\ u \ge 0,\ v \ge 0\}.
\end{align*}
[/definition]
The rotated cone is linearly isomorphic to the Lorentz cone, but its coordinates expose the convexity of quadratic-over-linear expressions. In applications it often appears when replacing $|x|^2 \le yz$ by a cone membership constraint.
[example: Quadratic Over Linear Epigraph]
The epigraph of $x \mapsto |x|^2/t$ on $t>0$ is encoded by the rotated Lorentz cone. Fix $s \in \mathbb R$, $t>0$, and $x \in \mathbb R^m$. Since $t$ is positive, multiplying by $t$ preserves the inequality:
\begin{align*}
\frac{|x|^2}{t} \le s \iff |x|^2 \le st.
\end{align*}
Also, $\frac{|x|^2}{t}\ge 0$, so the inequality $\frac{|x|^2}{t}\le s$ forces $s\ge 0$.
By the definition of the rotated Lorentz cone,
\begin{align*}
(s/2,t,x)\in \mathcal Q_{m+2}^r \iff 2(s/2)t\ge |x|^2 \text{ and } s/2\ge 0 \text{ and } t\ge 0.
\end{align*}
The first condition is exactly $st\ge |x|^2$, the second is exactly $s\ge 0$, and the third is automatic from the assumption $t>0$. Therefore
\begin{align*}
(s/2,t,x)\in \mathcal Q_{m+2}^r \iff st\ge |x|^2 \text{ and } s\ge 0.
\end{align*}
Since $t>0$, the inequality $st\ge |x|^2$ is equivalent to $s\ge |x|^2/t$, and this already implies $s\ge 0$. Hence
\begin{align*}
\frac{|x|^2}{t}\le s \iff (s/2,t,x)\in \mathcal Q_{m+2}^r.
\end{align*}
Thus the nonlinear epigraph constraint with division by a positive variable is exactly a rotated second-order cone membership constraint.
[/example]
These examples show two reusable conic templates. This motivates naming the optimisation class obtained by imposing finitely many affine constraints into Lorentz and rotated Lorentz cones.
[definition: Second-Order Cone Programme]
A second-order cone programme is an optimisation problem with decision variable $x \in \mathbb R^n$ that minimises a linear objective $c^\top x$, where $c \in \mathbb R^n$, subject to finitely many constraints of the form
\begin{align*}
A_i x+b_i \in \mathcal Q_{n_i}, \qquad E_jx+f_j \in \mathcal Q_{m_j}^r, \qquad Gx=h.
\end{align*}
Here $A_i: \mathbb R^n \to \mathbb R^{n_i}$ and $E_j: \mathbb R^n \to \mathbb R^{m_j}$ are linear maps, $b_i \in \mathbb R^{n_i}$ and $f_j \in \mathbb R^{m_j}$, and $G: \mathbb R^n \to \mathbb R^r$ is a [linear map](/page/Linear%20Map) with $h \in \mathbb R^r$.
[/definition]
The formulation matters because all constraints are affine-conic constraints. The separation and Lagrange multiplier theory from earlier chapters can therefore be applied after replacing $\mathbb R_+^m$ by a product of Lorentz and rotated Lorentz cones.
[example: Minimum Enclosing Ball]
Given points $a_1,\dots,a_N \in \mathbb R^d$, choose a centre $x \in \mathbb R^d$ and a radius $r \in \mathbb R$. For each $i$, the definition of the Lorentz cone gives
\begin{align*}
(r,a_i-x)\in \mathcal Q_{d+1}\iff |a_i-x|\le r.
\end{align*}
Therefore the containment constraints for all points,
\begin{align*}
|a_i-x|\le r \text{ for } i=1,\dots,N,
\end{align*}
are exactly equivalent to the cone constraints
\begin{align*}
(r,a_i-x)\in \mathcal Q_{d+1} \text{ for } i=1,\dots,N.
\end{align*}
Since $|a_i-x|\ge 0$, any feasible instance with $N\ge 1$ also forces $r\ge 0$.
Thus the minimum enclosing ball problem is the second-order cone programme that minimises $r$ over $x\in \mathbb R^d$ and $r\in \mathbb R$ subject to
\begin{align*}
(r,a_i-x)\in \mathcal Q_{d+1} \text{ for } i=1,\dots,N.
\end{align*}
The nonlinear requirement that every point lies in the Euclidean ball of radius $r$ around $x$ has been rewritten as affine membership in a fixed Lorentz cone.
[/example]
The SOCP formulation captures norm inequalities, but quadratic expressions also appear through block matrices. This motivates the Schur complement, which is the main algebraic bridge from quadratic inequalities to positive semidefinite constraints.
[quotetheorem:6707]
[citeproof:6707]
This lemma turns many convex quadratic inequalities into linear matrix inequalities, but it is not a general rule for arbitrary block pivots. The positive definiteness of the chosen pivot is the hypothesis that makes the inverse exist and makes completing the square reversible. If the $A$-pivot is singular, the displayed equivalence is not even defined; for example, take $A=0$, $C=1$, and $B=1$ in the scalar block case. Then the quadratic form is $2uv+v^2$, which takes negative values by choosing $u$ with the opposite sign to $v$, so the block matrix is not positive semidefinite even though a formal Schur complement with $A^{-1}$ cannot be formed.
The same warning applies to the other pivot. If $C=0$, $A=1$, and $B=1$, then the quadratic form is $u^2+2uv$, which again takes negative values for suitable $u,v$, and $BC^{-1}B^\top$ has no meaning. The theorem also does not say that every positive semidefinite block matrix must have a positive definite diagonal block; rank-deficient positive semidefinite matrices exist, but their Schur complement criterion requires extra range conditions rather than the clean inverse formula above. The next theorem asks a sharper question: when does a quadratic implication admit a scalar multiplier certificate?
[quotetheorem:6708]
[citeproof:6708]
The S-Lemma explains why the trust-region relaxation is exact but also why the same conclusion is special. The strict feasibility hypothesis is not cosmetic. If $q_1(x)=x^2$ and $q_0(x)=x$, then $q_1(x)\le 0$ forces $x=0$, so the implication $q_1(x)\le 0 \implies q_0(x)\ge 0$ holds. However, for every $\lambda \ge 0$ the polynomial $q_0+\lambda q_1=x+\lambda x^2$ takes negative values near $x=0$ with $x<0$, so no multiplier certificate exists. The failure occurs because the feasible set has no point with $q_1<0$; it lies entirely on the boundary $q_1=0$. The result is also genuinely a one-constraint theorem; with two or more independent quadratic constraints, there are quadratic implications that hold on the feasible set but cannot be represented by nonnegative scalar multipliers alone. This is the obstruction behind the gap between general quadratically constrained quadratic programming and the exact trust-region case.
[example: Trust-Region Subproblem]
Let $H \in \mathbb R^{n \times n}$ be symmetric, $g \in \mathbb R^n$, and $\Delta>0$. The trust-region problem is
\begin{align*}
\inf\{x^\top Hx+2g^\top x: |x|\le \Delta\}.
\end{align*}
Writing the constraint as $|x|^2-\Delta^2\le 0$, a multiplier $\lambda\ge 0$ gives
\begin{align*}
L(x,\lambda)=x^\top Hx+2g^\top x+\lambda(|x|^2-\Delta^2).
\end{align*}
Expanding the terms containing $\lambda$ gives
\begin{align*}
L(x,\lambda)=x^\top(H+\lambda I)x+2g^\top x-\lambda\Delta^2.
\end{align*}
For fixed $\lambda$, put $A_\lambda=H+\lambda I$. If there is a vector $v$ with $v^\top A_\lambda v<0$, then
\begin{align*}
L(tv,\lambda)=t^2v^\top A_\lambda v+2t g^\top v-\lambda\Delta^2.
\end{align*}
The quadratic term has negative coefficient, so $L(tv,\lambda)\to -\infty$ as $|t|\to\infty$. Thus finite dual values require $A_\lambda\succeq 0$. If $A_\lambda\succeq 0$ but $g\notin \operatorname{range}(A_\lambda)$, then, because $A_\lambda$ is symmetric, $\operatorname{range}(A_\lambda)^\perp=\ker(A_\lambda)$, so there is $v\in \ker(A_\lambda)$ with $g^\top v\ne 0$. For every $x$ and scalar $t$,
\begin{align*}
L(x+tv,\lambda)=L(x,\lambda)+2t g^\top v,
\end{align*}
because $A_\lambda v=0$. Choosing the sign of $t$ so that $2t g^\top v\to -\infty$ shows that the Lagrangian is again unbounded below. Hence the finite Lagrange dual values occur exactly when
\begin{align*}
H+\lambda I\succeq 0
\end{align*}
and
\begin{align*}
g\in \operatorname{range}(H+\lambda I).
\end{align*}
Equivalently, a number $\tau$ is certified as a lower bound if there is $\lambda\ge 0$ such that
\begin{align*}
x^\top Hx+2g^\top x-\tau+\lambda(|x|^2-\Delta^2)\ge 0 \text{ for all }x\in \mathbb R^n.
\end{align*}
The polynomial on the left is
\begin{align*}
x^\top(H+\lambda I)x+2g^\top x-\tau-\lambda\Delta^2.
\end{align*}
This is the quadratic form of the block matrix
\begin{align*}
M_{\lambda,\tau}=\text{the symmetric block matrix with first row }(H+\lambda I,\ g)\text{ and second row }(g^\top,\ -\tau-\lambda\Delta^2)
\end{align*}
evaluated at $(x,1)$, since
\begin{align*}
(x,1)^\top M_{\lambda,\tau}(x,1)=x^\top(H+\lambda I)x+x^\top g+g^\top x-\tau-\lambda\Delta^2.
\end{align*}
Because $x^\top g=g^\top x$, this equals
\begin{align*}
x^\top(H+\lambda I)x+2g^\top x-\tau-\lambda\Delta^2.
\end{align*}
Thus the lower-bound certificate is the semidefinite condition
\begin{align*}
M_{\lambda,\tau}\succeq 0
\end{align*}
together with
\begin{align*}
\lambda\ge 0.
\end{align*}
When $H+\lambda I\succ 0$, the *Schur Complement Lemma* rewrites the block constraint as
\begin{align*}
-\tau-\lambda\Delta^2-g^\top(H+\lambda I)^{-1}g\ge 0.
\end{align*}
Equivalently,
\begin{align*}
\tau\le -\lambda\Delta^2-g^\top(H+\lambda I)^{-1}g.
\end{align*}
Finally, $x=0$ is a strict feasible point for the quadratic constraint because $|0|^2-\Delta^2=-\Delta^2<0$. The *S-Lemma* therefore applies to this single quadratic constraint and shows that the best lower bound obtained from these scalar multiplier certificates is exact for the trust-region problem.
[/example]
## Positive Semidefinite Matrices and Spectrahedra
How do we express infinitely many quadratic inequalities with finitely many linear constraints? For a symmetric matrix $X$, the condition $v^\top Xv \ge 0$ for every vector $v$ is one matrix cone constraint. Semidefinite programming is the conic theory obtained from this positive semidefinite cone.
[definition: Positive Semidefinite Cone]
Let $\mathbb S^n$ be the vector space of real symmetric $n \times n$ matrices. The positive semidefinite cone is
\begin{align*}
\mathbb S_+^n := \{X \in \mathbb S^n : v^\top Xv \ge 0 \text{ for all } v \in \mathbb R^n\}.
\end{align*}
[/definition]
The notation $X \succeq 0$ means $X \in \mathbb S_+^n$, and $X \succeq Y$ means $X-Y \succeq 0$. To use this cone in duality, we need an inner product that identifies linear functionals on symmetric matrices.
[definition: Trace Inner Product]
The trace inner product is the map
\begin{align*}
\langle \cdot,\cdot \rangle : \mathbb S^n \times \mathbb S^n \to \mathbb R
\end{align*}
defined, for $X,Y \in \mathbb S^n$, by
\begin{align*}
\langle X,Y \rangle := \operatorname{tr}(XY)=\sum_{i,j=1}^n X_{ij}Y_{ij}.
\end{align*}
[/definition]
This inner product lets matrix inequalities fit into conic duality, but the dual cone still has to be identified. A priori, the linear functionals nonnegative on positive semidefinite matrices could form a larger or different cone than the positive semidefinite cone itself. The key obstruction is to detect negative curvature of a test matrix using only trace pairings with positive semidefinite matrices. The result below shows that rank-one positive semidefinite tests are enough, which is why SDP duals again use positive semidefinite variables.
[quotetheorem:6709]
[citeproof:6709]
Self-duality explains the algebra of SDP duals, but the conclusion depends on both the cone and the pairing. With the trace inner product on all real symmetric matrices, rank-one tests $X=vv^\top$ detect every negative quadratic direction of $Y$. If the ambient cone is changed, the same statement fails: in the cone of diagonal positive semidefinite matrices, the trace-dual cone only tests the diagonal entries, so a symmetric matrix with nonnegative diagonal and a large negative off-diagonal entry can pass all diagonal tests while failing to be positive semidefinite.
The theorem also does not say that every matrix cone used in optimisation is self-dual, nor that self-duality preserves extra structure such as sparsity or entrywise signs. For instance, the cone of entrywise nonnegative symmetric matrices is self-dual for entrywise nonnegativity, but that duality is unrelated to quadratic-form nonnegativity. The positive semidefinite cone is special because its order, its quadratic tests, and its trace-dual cone coincide. Before writing optimisation problems, we also need a geometric name for feasible sets cut out by affine slices of $\mathbb S_+^n$.
[definition: Spectrahedron]
A spectrahedron is a set of the form
\begin{align*}
\{x \in \mathbb R^m : F_0 + x_1F_1+\cdots+x_mF_m \succeq 0\},
\end{align*}
where $F_0,F_1,\dots,F_m \in \mathbb S^n$.
[/definition]
A spectrahedron is the semidefinite analogue of a polyhedron: linear inequalities are replaced by one linear matrix inequality. This motivates the standard optimisation problem obtained by minimising a linear functional over an affine slice of the positive semidefinite cone.
[definition: Semidefinite Programme]
A semidefinite programme in equality form minimises $\langle C,X\rangle$ over $X \in \mathbb S^n$ subject to
\begin{align*}
\langle A_i,X\rangle=b_i \quad \text{for } i=1,\dots,m, \qquad X \succeq 0,
\end{align*}
where $C,A_1,\dots,A_m \in \mathbb S^n$ and $b \in \mathbb R^m$.
[/definition]
The equality form is not restrictive for theory: inequalities, epigraph variables, and affine matrix inequalities can be transformed into this form using slack matrices and affine substitutions. The dual has the same structure because $\mathbb S_+^n$ is self-dual. If $X\succeq0$ is primal feasible and $y$ is dual feasible with slack
\begin{align*}
S=C-\sum_{i=1}^m y_iA_i\succeq0,
\end{align*}
then
\begin{align*}
\langle C,X\rangle-b^\top y
=\left\langle C-\sum_{i=1}^m y_iA_i,X\right\rangle
=\langle S,X\rangle
\ge 0.
\end{align*}
Therefore $b^\top y\le \langle C,X\rangle$ for every primal-dual feasible pair.
Weak duality gives lower bounds for minimisation problems. No regularity condition is needed for this inequality: it is only the nonnegativity of the positive-semidefinite pairing between $S$ and $X$, together with primal and dual feasibility. It therefore says less than strong duality. It does not assert that a feasible lower bound is sharp, that an optimal dual multiplier exists, or that the primal and dual optimal values coincide. A weak-duality certificate can be very conservative: if the primal optimum is finite but the only available dual feasible point has value far below it, the inequality remains true while giving little information. The next question is when every sharp lower bound is attained by a dual matrix certificate, which is the SDP version of strong conic duality.
[quotetheorem:6710]
[citeproof:6710]
Strong duality says that optimal values match under Slater's condition. The strict interior point cannot simply be replaced by boundary feasibility. A small example already shows the issue: minimise $2X_{12}$ subject to $X \in \mathbb S_+^2$, $X_{11}=0$, and $X_{22}=1$. Positive semidefiniteness forces $X_{12}=0$, so the primal value is $0$, but every feasible matrix lies on the boundary of $\mathbb S_+^2$. The dual maximises $y_2$ subject to the positive semidefiniteness of the symmetric matrix with diagonal entries $-y_1,-y_2$ and off-diagonal entries $1$. Equivalently, its scalar constraints are $y_1 \le 0$, $y_2 \le 0$, and $y_1y_2 \ge 1$. Its supremum is $0$, approached by $y_2<0$ with $y_1 \le 1/y_2$, but it is not attained because $y_2=0$ would force a positive semidefinite matrix to have a zero diagonal entry while retaining a nonzero off-diagonal entry in the same row and column. Thus boundary-only feasibility can preserve the value while destroying attainment. The finite-value assumption also matters: the primal problem $\inf\{-X_{11}: X \in \mathbb S_+^2,\ X_{22}=1\}$ has strictly feasible points such as $I_2$, but it is unbounded below as $X_{11}\to \infty$. To certify a particular primal-dual pair once strong duality is available, we need the equality case in weak duality, which is complementary slackness.
[quotetheorem:6711]
[citeproof:6711]
Complementary slackness is the matrix form of the familiar principle that an inactive inequality has zero multiplier, but the theorem has two separate layers. Feasibility plus $\langle S,X\rangle=0$ gives equality of the displayed primal and dual objective values; it becomes an optimality test only in a zero-gap setting, such as under the Slater strong-duality principle with attainment. Without such a setting, the condition is better read as a certificate that a particular feasible pair has closed the weak-duality gap, not as a replacement for existence or regularity hypotheses.
The product condition also depends on positive semidefiniteness. For indefinite symmetric matrices, trace orthogonality alone does not force product orthogonality; for example,
\begin{align*}
S=\begin{pmatrix}1&0\cr 0&-1\end{pmatrix}, \qquad X=I_2
\end{align*}
satisfy $\operatorname{tr}(SX)=0$ but $SX \neq 0$. In relaxation arguments, the positive semidefinite product form is often where rank information enters, because the range of $X$ must lie in the kernel of the dual slack matrix $S$.
## Semidefinite Relaxations and Exactness Certificates
What should we do when an optimisation problem is nonconvex but its variables appear through quadratic products? The SDP strategy is to lift $x$ to the matrix $X=xx^\top$, keep the linear identities satisfied by $X$, and relax the nonconvex rank-one constraint. Exactness is then certified by dual solutions, complementary slackness, or algebraic certificates.
[definition: Semidefinite Relaxation by Lifting]
For a quadratic optimisation problem in $x \in \mathbb R^n$, the basic lifting map is
\begin{align*}
L: \mathbb R^n \to \mathbb S^n, \qquad L(x)=xx^\top.
\end{align*}
A semidefinite lifting uses variables $x \in \mathbb R^n$ and $X \in \mathbb S^n$ satisfying
\begin{align*}
X \succeq xx^\top.
\end{align*}
After homogenisation, the lifting map is
\begin{align*}
\widetilde L: \mathbb R^{n+1} \to \mathbb S^{n+1}, \qquad \widetilde L(z)=zz^\top,
\end{align*}
with $z=(1,x)$, and the relaxation imposes
\begin{align*}
Z \succeq 0, \qquad Z_{00}=1.
\end{align*}
[/definition]
The missing condition in the relaxation is $\operatorname{rank} Z=1$. Dropping rank is what makes the feasible region convex; the art is to recognise when the optimal relaxed matrix still has rank one, or when its value is nevertheless the desired bound.
[example: Max-Cut Relaxation]
Let $G=(V,E)$ be an undirected weighted graph, with each unordered edge $ij\in E$ counted once and weight $w_{ij}\ge 0$. A cut can be encoded by signs $s_i\in\{-1,1\}$: vertices with sign $1$ lie on one side and vertices with sign $-1$ lie on the other. If $s_i=s_j$, then $s_is_j=1$, so
\begin{align*}
\frac{1}{2}w_{ij}(1-s_is_j)=\frac{1}{2}w_{ij}(1-1)=0.
\end{align*}
If $s_i=-s_j$, then $s_is_j=-1$, so
\begin{align*}
\frac{1}{2}w_{ij}(1-s_is_j)=\frac{1}{2}w_{ij}(1-(-1))=w_{ij}.
\end{align*}
Thus
\begin{align*}
\frac{1}{2}\sum_{ij\in E} w_{ij}(1-s_is_j)
\end{align*}
is exactly the total weight of edges crossing the cut.
Introduce the rank-one matrix
\begin{align*}
X=ss^\top.
\end{align*}
Then each diagonal entry satisfies
\begin{align*}
X_{ii}=(ss^\top)_{ii}=s_i^2=1.
\end{align*}
Also $X\succeq 0$, because for every $v\in\mathbb R^{|V|}$,
\begin{align*}
v^\top Xv=v^\top ss^\top v=(s^\top v)^2\ge 0.
\end{align*}
For an edge $ij$, the corresponding matrix entry is
\begin{align*}
X_{ij}=(ss^\top)_{ij}=s_is_j.
\end{align*}
Therefore the cut objective can be written as
\begin{align*}
\frac{1}{2}\sum_{ij\in E}w_{ij}(1-X_{ij})
\end{align*}
whenever $X=ss^\top$ comes from a sign vector.
The semidefinite relaxation drops the nonconvex requirement that $X$ have the form $ss^\top$ with $s_i\in\{-1,1\}$, and keeps the convex consequences
\begin{align*}
X\succeq 0,\qquad X_{ii}=1 \text{ for all } i\in V.
\end{align*}
It is the SDP
\begin{align*}
\max \left\{\frac{1}{2}\sum_{ij\in E}w_{ij}(1-X_{ij}) : X\succeq 0,\ X_{ii}=1 \text{ for all } i\in V\right\}.
\end{align*}
Every cut sign vector $s$ gives a feasible SDP matrix $ss^\top$ with the same objective value as the cut. Hence the SDP feasible region contains all cut matrices, so its optimal value is an upper bound for the maximum cut value. The Goemans-Williamson rounding scheme then interprets a feasible positive semidefinite matrix as vector inner products and rounds those vectors by a random hyperplane to produce a cut with a controlled approximation guarantee.
[/example]
Relaxations are useful even when they are not exact, but exactness is a central theoretical question. A certificate usually lives in the dual: if a dual feasible point matches a candidate primal value, weak duality proves optimality.
[definition: Exact Semidefinite Relaxation]
A semidefinite relaxation of an optimisation problem is exact if its optimal value equals the optimal value of the original problem.
[/definition]
This definition concerns values, not necessarily optimisers. In some problems the relaxed optimiser can have high rank while the value is still sharp; in stronger cases, complementary slackness forces a rank-one optimiser and recovers an original solution.
[example: Lovasz Theta Semidefinite Programme]
For a graph $G=(V,E)$ with $|V|=n$, let $J$ be the $n\times n$ all-ones matrix. The Lovasz theta number in this convention is
\begin{align*}
\vartheta(G)=\max\{\langle J,X\rangle:\operatorname{tr}(X)=1,\ X_{ij}=0 \text{ for every } ij\in E,\ X\succeq 0\}.
\end{align*}
Since $\langle J,X\rangle=\sum_{i,j=1}^n J_{ij}X_{ij}$ and $J_{ij}=1$ for all $i,j$, the objective is
\begin{align*}
\langle J,X\rangle=\sum_{i,j=1}^n X_{ij}.
\end{align*}
We first see why this SDP bounds the independence number from above. Let $S\subseteq V$ be an independent set with $|S|=k$, and let $\mathbf 1_S\in\mathbb R^n$ be its indicator vector. Define
\begin{align*}
X=\frac{1}{k}\mathbf 1_S\mathbf 1_S^\top.
\end{align*}
For every $v\in\mathbb R^n$,
\begin{align*}
v^\top Xv
= v^\top\left(\frac{1}{k}\mathbf 1_S\mathbf 1_S^\top\right)v
= \frac{1}{k}(\mathbf 1_S^\top v)^2
\ge 0,
\end{align*}
so $X\succeq 0$. Its trace is
\begin{align*}
\operatorname{tr}(X)
=\sum_{i=1}^n X_{ii}
=\frac{1}{k}\sum_{i=1}^n (\mathbf 1_S)_i^2
=\frac{1}{k}\sum_{i\in S}1
=1.
\end{align*}
If $ij\in E$, then $i$ and $j$ cannot both lie in $S$, because $S$ is independent. Hence
\begin{align*}
X_{ij}=\frac{1}{k}(\mathbf 1_S)_i(\mathbf 1_S)_j=0.
\end{align*}
Thus $X$ is feasible for the SDP. Its objective value is
\begin{align*}
\langle J,X\rangle
=\sum_{i,j=1}^n X_{ij}
=\frac{1}{k}\sum_{i,j=1}^n(\mathbf 1_S)_i(\mathbf 1_S)_j
=\frac{1}{k}\left(\sum_{i=1}^n(\mathbf 1_S)_i\right)^2
=\frac{1}{k}k^2
=k.
\end{align*}
Every independent set of size $k$ therefore gives a feasible SDP point of value $k$, so
\begin{align*}
\alpha(G)\le \vartheta(G).
\end{align*}
Now let $\bar G$ have a proper coloring with $q$ color classes $C_1,\dots,C_q$. Each $C_\ell$ is a clique in $G$: if $i\ne j$ lie in the same $C_\ell$, then $ij$ is not an edge of $\bar G$, so $ij$ is an edge of $G$. Take any feasible $X$ in the theta SDP, and write $u_\ell=\mathbf 1_{C_\ell}$. Since $X\succeq 0$, choose vectors $p_1,\dots,p_n$ with $X_{ij}=p_i^\top p_j$, for instance by a spectral factorisation of $X$. Put
\begin{align*}
y_\ell=\sum_{i\in C_\ell}p_i.
\end{align*}
Because the color classes partition $V$,
\begin{align*}
\langle J,X\rangle
=\sum_{i,j=1}^n p_i^\top p_j
=\left(\sum_{i=1}^n p_i\right)^\top\left(\sum_{j=1}^n p_j\right)
=\left|\sum_{\ell=1}^q y_\ell\right|^2.
\end{align*}
The elementary identity
\begin{align*}
q\sum_{\ell=1}^q |y_\ell|^2-\left|\sum_{\ell=1}^q y_\ell\right|^2
=\sum_{1\le \ell<m\le q}|y_\ell-y_m|^2
\ge 0
\end{align*}
gives
\begin{align*}
\left|\sum_{\ell=1}^q y_\ell\right|^2\le q\sum_{\ell=1}^q |y_\ell|^2.
\end{align*}
For each color class $C_\ell$, feasibility gives $X_{ij}=0$ whenever $i\ne j$ lie in $C_\ell$, because then $ij\in E$. Hence
\begin{align*}
|y_\ell|^2
=\left(\sum_{i\in C_\ell}p_i\right)^\top\left(\sum_{j\in C_\ell}p_j\right)
=\sum_{i,j\in C_\ell}X_{ij}
=\sum_{i\in C_\ell}X_{ii}.
\end{align*}
Therefore
\begin{align*}
\langle J,X\rangle
\le q\sum_{\ell=1}^q\sum_{i\in C_\ell}X_{ii}
=q\sum_{i=1}^n X_{ii}
=q\,\operatorname{tr}(X)
=q.
\end{align*}
Since this holds for every feasible $X$ and every $q$-coloring of $\bar G$,
\begin{align*}
\vartheta(G)\le \chi(\bar G).
\end{align*}
Thus the SDP places $\vartheta(G)$ between $\alpha(G)$ and $\chi(\bar G)$, and the graph constraints have become the linear zero-pattern equations $X_{ij}=0$ on a positive semidefinite matrix.
[/example]
The preceding relaxations use matrix positivity to certify bounds for quadratic expressions. This motivates a polynomial version: certify global nonnegativity by expressing a polynomial as a sum of polynomial squares.
[definition: Sum of Squares Polynomial]
A polynomial $p \in \mathbb R[x_1,\dots,x_n]$ is a sum of squares if there exist polynomials $q_1,\dots,q_r \in \mathbb R[x_1,\dots,x_n]$ such that
\begin{align*}
p = \sum_{k=1}^r q_k^2.
\end{align*}
[/definition]
Every sum of squares polynomial is globally nonnegative, but not every globally nonnegative polynomial is a sum of squares. The semidefinite value of the concept comes from the following Gram matrix characterisation.
[quotetheorem:6712]
[citeproof:6712]
The equality $p(x)=z_d(x)^\top Qz_d(x)$ is linear in the entries of $Q$ after matching coefficients of monomials. The degree bound is necessary because the monomial vector must be large enough to contain the polynomial factors $q_k$; choosing a smaller vector may falsely rule out an existing sum-of-squares representation. The theorem certifies sums of squares rather than all nonnegative polynomials, and this distinction becomes real in several variables and degree at least four. Thus searching for a sum-of-squares certificate is an SDP feasibility problem.
[example: Certifying a Univariate Quartic]
Consider $p(x)=x^4+2x^2+1$ and $z_2(x)=(1,x,x^2)$. Choose
\begin{align*}
Q=\operatorname{diag}(1,2,1).
\end{align*}
For every $u=(u_0,u_1,u_2)^\top\in\mathbb R^3$,
\begin{align*}
u^\top Qu=u_0^2+2u_1^2+u_2^2\ge 0.
\end{align*}
Hence $Q\succeq 0$. Multiplying the Gram matrix by the monomial vector gives
\begin{align*}
Qz_2(x)=(1,2x,x^2)^\top.
\end{align*}
Therefore
\begin{align*}
z_2(x)^\top Qz_2(x)=(1,x,x^2)\cdot(1,2x,x^2)=1+2x^2+x^4=p(x).
\end{align*}
By the Gram matrix characterisation, this positive semidefinite Gram representation certifies the sum-of-squares decomposition
\begin{align*}
p(x)=1^2+(\sqrt{2}x)^2+(x^2)^2.
\end{align*}
The same polynomial also has the one-square representation
\begin{align*}
p(x)=(x^2+1)^2=x^4+2x^2+1.
\end{align*}
In Gram form, this is represented by
\begin{align*}
Q'=vv^\top\quad\text{where }v=(1,0,1)^\top.
\end{align*}
This matrix is positive semidefinite because, for every $u=(u_0,u_1,u_2)^\top$,
\begin{align*}
u^\top Q'u=((1,0,1)\cdot u)^2=(u_0+u_2)^2\ge 0.
\end{align*}
It gives the same polynomial since
\begin{align*}
z_2(x)^\top Q'z_2(x)=((1,0,1)\cdot z_2(x))^2=(1+x^2)^2.
\end{align*}
Thus one polynomial can have several positive semidefinite Gram matrices: the diagonal matrix records three visible squares, while the rank-one matrix records the single square $x^2+1$ by using off-diagonal entries between the constant and $x^2$ monomials.
[/example]
The chapter closes by tying the conic viewpoint together. SOCPs give a compact language for Euclidean norm and convex quadratic constraints; SDPs enlarge the cone to matrix positivity, which captures infinitely many quadratic inequalities at once. Semidefinite relaxations then use this extra expressive power to build computable bounds for problems that are not convex in their original variables.
# 10. Barriers, Interiors, and Central Paths
Chapters 4 through 9 developed optimality conditions, duality, and conic formulations; this chapter explains how those ideas become algorithmic when the feasible set has a usable interior. Inequality constraints create feasible regions with boundaries, and the boundary is where differentiability often fails or multipliers become singular. Logarithmic barriers replace the constrained problem by a family of smooth interior problems, whose minimizers trace the central path. The goals are to derive the barrier optimality equations, interpret them as perturbed primal-dual KKT conditions, and isolate the extra curvature hypotheses needed for interior-point complexity theory.
## From Inequality Constraints to Interior Penalisation
How can an optimisation problem enforce inequalities while still be treated by smooth first-order calculus? The basic answer is to make the boundary infinitely expensive. To do this without losing the original feasible geometry, we first isolate the part of the feasible region where all inequality margins are genuinely positive.
[definition: Feasible Interior]
Let $C \subset \mathbb R^n$ be a closed convex set with nonempty interior. The feasible interior of $C$ is $\operatorname{int} C$.
For an inequality representation
\begin{align*}
C = \{x \in \mathbb R^n : f_i(x) \le 0,\ i=1,\dots,m\},
\end{align*}
where each $f_i: \mathbb R^n \to \mathbb R$ is convex, the strict feasible region is
\begin{align*}
\{x \in \mathbb R^n : f_i(x) < 0,\ i=1,\dots,m\}.
\end{align*}
[/definition]
The feasible interior is the domain on which the method is allowed to move. The next object must therefore be finite exactly at strictly feasible points and must become large near the boundary, so that smooth minimisation will not step outside the constraint set.
[definition: Logarithmic Barrier for Inequalities]
Let $f_1,\dots,f_m: \mathbb R^n \to \mathbb R$ be convex functions and set
\begin{align*}
C = \{x \in \mathbb R^n : f_i(x) \le 0,\ i=1,\dots,m\}.
\end{align*}
The logarithmic barrier associated with these inequalities is the function
\begin{align*}
\Phi : \{x \in \mathbb R^n : f_i(x)<0\text{ for all }i\} \to \mathbb R, \qquad
\Phi(x) = -\sum_{i=1}^{m} \log(-f_i(x)).
\end{align*}
[/definition]
The logarithm turns the sign margin $-f_i(x)$ into a penalty that diverges as $f_i(x) \uparrow 0$. To use this penalty for optimisation, we combine it with the original objective and introduce a parameter that controls their relative weight.
[definition: Barrier Subproblem]
Let $c \in \mathbb R^n$ and let $\Phi:\operatorname{int} C \to \mathbb R$ be a convex barrier for a closed convex set $C \subset \mathbb R^n$. For $t>0$, the barrier subproblem is
\begin{align*}
\inf_{x \in \operatorname{int} C} \{t\, c \cdot x + \Phi(x)\}.
\end{align*}
[/definition]
The parameter $t$ determines the tradeoff. Small $t$ gives a point governed mostly by the geometry of the feasible region, while large $t$ pushes the minimizer toward the original linear objective.
[example: Log Barrier for Linear Programming]
Consider the linear programme
\begin{align*}
\min_{x \in \mathbb R^n} c\cdot x \quad \text{subject to} \quad Ax < b,
\end{align*}
where $A \in \mathbb R^{m\times n}$, $b \in \mathbb R^m$, and $a_i$ denotes the $i$th row of $A$. The strict inequalities are exactly the positive-slack conditions
\begin{align*}
b_i-a_i\cdot x>0 \text{ for } i=1,\dots,m,
\end{align*}
so the logarithmic barrier on the strict feasible region is
\begin{align*}
\Phi(x)=-\sum_{i=1}^{m}\log(b_i-a_i\cdot x).
\end{align*}
For a direction $h\in\mathbb R^n$, differentiate each summand using the chain rule. Since $D(b_i-a_i\cdot x)[h]=-a_i\cdot h$, we get
\begin{align*}
D\Phi_x(h)=-\sum_{i=1}^{m}\frac{1}{b_i-a_i\cdot x}\,(-a_i\cdot h).
\end{align*}
Equivalently,
\begin{align*}
D\Phi_x(h)=\sum_{i=1}^{m}\frac{a_i\cdot h}{b_i-a_i\cdot x}.
\end{align*}
Moving the dot product outside the sum gives
\begin{align*}
D\Phi_x(h)=\left(\sum_{i=1}^{m}\frac{a_i}{b_i-a_i\cdot x}\right)\cdot h.
\end{align*}
Since this identity holds for every direction $h$, the gradient is
\begin{align*}
\nabla \Phi(x)=\sum_{i=1}^{m}\frac{a_i}{b_i-a_i\cdot x}.
\end{align*}
The barrier subproblem with parameter $t>0$ has objective $x\mapsto t\,c\cdot x+\Phi(x)$. At an interior stationary point, its gradient vanishes, so
\begin{align*}
0=\nabla\bigl(t\,c\cdot x+\Phi(x)\bigr).
\end{align*}
By linearity of the gradient,
\begin{align*}
0=tc+\nabla\Phi(x).
\end{align*}
Substituting the computed gradient gives the barrier stationarity equation
\begin{align*}
tc+\sum_{i=1}^{m}\frac{a_i}{b_i-a_i\cdot x}=0.
\end{align*}
Thus each inequality contributes its row vector $a_i$ scaled by the reciprocal slack $(b_i-a_i\cdot x)^{-1}$, which is the algebraic source of the dual-variable interpretation later in the central path equations.
[/example]
The preceding example shows the central algebraic feature of logarithmic barriers: gradients contain inverse slacks. Before interpreting those slacks as dual variables, we need an analytic guarantee that the barrier problem really selects a single interior point.
[quotetheorem:6713]
[citeproof:6713]
The hypotheses in the existence theorem each remove a different failure mode. If sublevel sets are unbounded, the infimum may escape to infinity; for instance, on $C=(0,\infty)$ the function $-x-\log x$ has no minimizer. If boundary blow-up is absent, a minimizing sequence can converge to a boundary point where the barrier problem is no longer an interior problem. Strict convexity is also essential for selecting a single point on the path: a merely convex objective may have an entire face of minimizers.
Once a minimizer exists in the interior, no normal cone is present in the first-order condition. The constrained problem has been converted into a smooth equation on $\operatorname{int} C$, and this equation is the bridge from barriers to computable central-path formulas.
[quotetheorem:6714]
[citeproof:6714]
The interior hypothesis is not cosmetic: at a boundary minimizer, first-order optimality would involve a normal cone rather than a zero gradient. Differentiability is also essential, since nonsmooth barriers would lead to subgradient inclusions rather than Newton equations. Positive definite Hessian is stronger than needed for existence, but it is what prevents flat directions and makes the local Newton system nonsingular. Thus the theorem explains both the power and the limitation of the barrier transformation: it works only while the iterates stay in a smooth open domain.
## The Central Path and Primal-Dual Geometry
What does the barrier minimizer represent in the dual problem? For linear and conic optimisation, the inverse slack terms in the barrier gradient are not merely penalties; they encode dual variables. The central path is therefore a path of primal-dual pairs satisfying perturbed complementarity.
[definition: Central Path]
Let $C \subset \mathbb R^n$ be a closed convex set with nonempty interior, let $\Phi:\operatorname{int} C \to \mathbb R$ be a differentiable barrier, and let $c \in \mathbb R^n$. When the minimizer exists uniquely for each $t>0$, the central path associated with the linear objective $c\cdot x$ is the map
\begin{align*}
x:(0,\infty)\to \operatorname{int} C, \qquad
x(t) = \operatorname*{argmin}_{x\in \operatorname{int} C}\{t c\cdot x + \Phi(x)\}.
\end{align*}
[/definition]
The parameter may also be written as $\mu=1/t$. Large $t$ corresponds to small complementarity parameter $\mu$, so the path approaches the optimal face when the original problem has finite optimum.
[definition: Primal-Dual Linear Programme with Slacks]
Let $A \in \mathbb R^{p\times n}$, $b\in\mathbb R^p$, and $c\in\mathbb R^n$. The primal problem in equality form is
\begin{align*}
\min c\cdot x \quad \text{subject to} \quad Ax=b,\ x\in\mathbb R^n_+.
\end{align*}
Its dual problem is
\begin{align*}
\max b\cdot y \quad \text{subject to} \quad A^\top y+s=c,\ s\in\mathbb R^n_+.
\end{align*}
[/definition]
The primal variable $x$ and dual slack $s$ are both constrained to lie in the nonnegative orthant. Barrier methods keep these variables strictly positive, but optimality still has to balance primal feasibility, dual feasibility, and the way each coordinate pair approaches complementarity. The obstruction is that the ordinary KKT equation $x_i s_i=0$ lies on the boundary, where the logarithmic barrier is singular. The central-path equations replace exact complementarity by the positive relation $x_i s_i=\mu$, giving the precise system followed before the limit is taken.
[quotetheorem:6715]
[citeproof:6715]
These equations show why the path is geometric rather than only analytic. It lives inside the product of the primal feasible region and the dual feasible region, and the scalar $\mu$ measures the common complementarity gap per coordinate. Strict positivity is indispensable here: if some $x_i$ or $s_i$ is zero, the logarithmic barrier gradient or the equation $x_i s_i=\mu$ is not meaningful for $\mu>0$. Full row rank gives a clean multiplier representation; without it, $y$ may fail to be unique even when the primal point and slack are determined. The equations are therefore a characterization of positive barrier stationary points, not by themselves a proof that such a triple exists for every $\mu$.
[example: Two-Variable Central Path]
Consider
\begin{align*}
\min_{(x_1,x_2)\in\mathbb R^2} x_1+x_2 \quad \text{subject to} \quad x_1>0,\ x_2>0,\ x_1+x_2=1.
\end{align*}
On the affine constraint $x_1+x_2=1$, the linear objective has value $1$ at every feasible point, so the barrier subproblem is
\begin{align*}
\min_{x_1+x_2=1,\ x_1>0,\ x_2>0} t-\log x_1-\log x_2.
\end{align*}
Using $x_2=1-x_1$, the feasible set becomes $0<x_1<1$, and the one-variable objective is
\begin{align*}
g(x_1)=t-\log x_1-\log(1-x_1).
\end{align*}
Its derivative is
\begin{align*}
g'(x_1)=-\frac{1}{x_1}+\frac{1}{1-x_1}.
\end{align*}
The stationary equation $g'(x_1)=0$ is
\begin{align*}
-\frac{1}{x_1}+\frac{1}{1-x_1}=0.
\end{align*}
Equivalently,
\begin{align*}
\frac{1}{1-x_1}=\frac{1}{x_1}.
\end{align*}
Since $0<x_1<1$, both denominators are positive, so cross-multiplication gives
\begin{align*}
x_1=1-x_1.
\end{align*}
Thus
\begin{align*}
2x_1=1.
\end{align*}
Therefore
\begin{align*}
x_1=\frac12.
\end{align*}
Substituting into the affine constraint gives
\begin{align*}
x_2=1-x_1=1-\frac12=\frac12.
\end{align*}
To verify that this stationary point is the minimizer, compute the second derivative:
\begin{align*}
g''(x_1)=\frac{1}{x_1^2}+\frac{1}{(1-x_1)^2}.
\end{align*}
This is positive for every $0<x_1<1$, so $g$ is strictly convex on the feasible interval and the stationary point is the unique minimizer. Hence the central path is
\begin{align*}
x_1(t)=x_2(t)=\frac12
\end{align*}
for every $t>0$. The path is constant because the linear objective is constant on the affine slice, leaving the symmetric logarithmic barrier to select the midpoint.
[/example]
The previous example is deliberately symmetric. If the objective is tilted or if the constraint is written with inequality slacks, the same equations reveal how the path bends toward the optimizer while staying strictly feasible.
[example: Central Path for an Inequality Problem]
Consider
\begin{align*}
\min_{(x_1,x_2)\in\mathbb R^2} x_1 \text{ subject to } x_1>0,\ x_2>0,\ x_1+x_2<1.
\end{align*}
For $t>0$, write the slack of the third inequality as $s=1-x_1-x_2$. On the strict feasible triangle, the logarithmic barrier objective is
\begin{align*}
F_t(x_1,x_2)=t x_1-\log x_1-\log x_2-\log s.
\end{align*}
Since $\partial s/\partial x_1=-1$ and $\partial s/\partial x_2=-1$, differentiating term by term gives
\begin{align*}
\frac{\partial F_t}{\partial x_1}=t-\frac{1}{x_1}+\frac{1}{s}.
\end{align*}
\begin{align*}
\frac{\partial F_t}{\partial x_2}=-\frac{1}{x_2}+\frac{1}{s}.
\end{align*}
Thus an interior stationary point satisfies
\begin{align*}
t-\frac{1}{x_1}+\frac{1}{s}=0.
\end{align*}
\begin{align*}
-\frac{1}{x_2}+\frac{1}{s}=0.
\end{align*}
The second stationarity equation gives $1/x_2=1/s$. Because strict feasibility gives $x_2>0$ and $s>0$, multiplying by $x_2s$ gives
\begin{align*}
s=x_2.
\end{align*}
Substituting $s=1-x_1-x_2$ gives
\begin{align*}
1-x_1-x_2=x_2.
\end{align*}
Hence
\begin{align*}
x_2=\frac{1-x_1}{2}.
\end{align*}
Substituting $s=(1-x_1)/2$ into the first stationarity equation gives
\begin{align*}
t-\frac{1}{x_1}+\frac{2}{1-x_1}=0.
\end{align*}
Equivalently,
\begin{align*}
t=\frac{1}{x_1}-\frac{2}{1-x_1}.
\end{align*}
Combining the fractions gives
\begin{align*}
t=\frac{1-3x_1}{x_1(1-x_1)}.
\end{align*}
Multiplying by the positive denominator $x_1(1-x_1)$ gives
\begin{align*}
t x_1(1-x_1)=1-3x_1.
\end{align*}
Expanding and moving all terms to one side gives
\begin{align*}
t x_1^2-(t+3)x_1+1=0.
\end{align*}
The [quadratic formula](/theorems/1301) gives
\begin{align*}
x_1=\frac{t+3\pm\sqrt{t^2+2t+9}}{2t}.
\end{align*}
The root with the plus sign is not feasible, since its numerator is larger than $2t$ for every $t>0$. Therefore the feasible branch is
\begin{align*}
x_1(t)=\frac{t+3-\sqrt{t^2+2t+9}}{2t}.
\end{align*}
Rationalizing the numerator gives the equivalent positive form
\begin{align*}
x_1(t)=\frac{2}{t+3+\sqrt{t^2+2t+9}}.
\end{align*}
Then
\begin{align*}
x_2(t)=\frac{1-x_1(t)}{2}.
\end{align*}
For a direction $u=(u_1,u_2)$, the second directional derivative is
\begin{align*}
D^2F_t(x_1,x_2)[u,u]=\frac{u_1^2}{x_1^2}+\frac{u_2^2}{x_2^2}+\frac{(u_1+u_2)^2}{s^2}.
\end{align*}
Each denominator is positive on the strict feasible triangle, and for nonzero $u$ at least one of $u_1^2$ or $u_2^2$ is positive. Hence $D^2F_t(x_1,x_2)[u,u]>0$ for every nonzero $u$, so $F_t$ is strictly convex and the stationary point is the unique barrier minimizer.
Finally,
\begin{align*}
\lim_{t\to 0^+}x_1(t)=\frac{2}{3+\sqrt{9}}=\frac13.
\end{align*}
Also,
\begin{align*}
\lim_{t\to\infty}x_1(t)=\lim_{t\to\infty}\frac{2}{t+3+\sqrt{t^2+2t+9}}=0.
\end{align*}
The denominator $t+3+\sqrt{t^2+2t+9}$ increases with $t$, so $x_1(t)$ decreases toward $0$. Thus the central path moves toward the boundary face that minimizes $x_1$, while every finite parameter keeps $x_1>0$, $x_2>0$, and $1-x_1-x_2>0$.
[/example]
This computation also explains the term central: at each finite parameter the point balances all active inequality margins instead of choosing a boundary face. The limiting point belongs to the optimal face, but the path itself remains in the interior.
## Barriers for Cones and Matrix Interiors
Why do conic problems use special barriers rather than arbitrary penalty functions? A cone has its own intrinsic geometry, and the interior of the cone determines the natural notion of positivity. We begin by separating the abstract barrier condition from any particular coordinate description.
[definition: Cone Barrier]
Let $K \subset \mathbb R^n$ be a closed convex cone with nonempty interior. A cone barrier is a convex function $F:K^\circ\to\mathbb R$ such that $F(x_k)\to\infty$ whenever $x_k\in K^\circ$ converges to a point of $\partial K$.
[/definition]
The orthant barrier and the determinant barrier are the two basic models. The first gives linear programming; the second requires a cone whose points are matrices, so we now identify the relevant interior.
[example: Orthant Barrier]
For $K=\mathbb R^n_+$, the interior is
\begin{align*}
K^\circ=\{x\in\mathbb R^n:x_i>0\text{ for all }i\},
\end{align*}
and the standard logarithmic barrier is
\begin{align*}
F(x)=-\sum_{i=1}^{n}\log x_i.
\end{align*}
For a direction $h=(h_1,\dots,h_n)\in\mathbb R^n$, differentiating each coordinate term gives
\begin{align*}
DF_x(h)
&=-\sum_{i=1}^{n}\frac{h_i}{x_i}.
\end{align*}
Equivalently,
\begin{align*}
DF_x(h)=\left(-\frac{1}{x_1},\dots,-\frac{1}{x_n}\right)\cdot h,
\end{align*}
so
\begin{align*}
\nabla F(x)=\left(-\frac{1}{x_1},\dots,-\frac{1}{x_n}\right).
\end{align*}
Differentiating the first derivative once more in the same direction gives
\begin{align*}
D^2F_x(h,h)
&=\sum_{i=1}^{n}\frac{h_i^2}{x_i^2}.
\end{align*}
Since this quadratic form is
\begin{align*}
D^2F_x(h,h)=h^\top \operatorname{diag}(x_1^{-2},\dots,x_n^{-2})h,
\end{align*}
the Hessian is
\begin{align*}
\nabla^2F(x)=\operatorname{diag}(x_1^{-2},\dots,x_n^{-2}).
\end{align*}
Each $x_i>0$, so every diagonal entry $x_i^{-2}$ is positive. If one coordinate $x_i$ approaches $0$ from above, then $x_i^{-2}\to\infty$, and the curvature in every direction with $h_i\ne 0$ grows without bound. This is the coordinate form of boundary repulsion in linear programming.
[/example]
The positive semidefinite cone replaces coordinatewise positivity by spectral positivity. To formulate its barrier, we need the exact description of which symmetric matrices count as interior points.
[definition: Positive Semidefinite Cone]
Let $\mathbb S^n$ be the vector space of real symmetric $n\times n$ matrices. The positive semidefinite cone is
\begin{align*}
\mathbb S^n_+ = \{X\in\mathbb S^n : v^\top Xv\ge 0 \text{ for all } v\in\mathbb R^n\}.
\end{align*}
Its interior is
\begin{align*}
(\mathbb S^n_+)^\circ = \{X\in\mathbb S^n : v^\top Xv>0 \text{ for all } v\in\mathbb R^n\setminus\{0\}\}.
\end{align*}
[/definition]
For matrices in the interior, all eigenvalues are positive and the determinant is positive. The determinant barrier is the matrix analogue of the coordinate logarithmic barrier.
[example: Determinant Barrier for Semidefinite Programming]
For $K=\mathbb S^n_+$, define
\begin{align*}
F(X)=-\log\det X
\end{align*}
on $(\mathbb S^n_+)^\circ$, so $X$ is positive definite. Fix $H\in\mathbb S^n$ and set
\begin{align*}
M=X^{-1/2}HX^{-1/2}.
\end{align*}
For all sufficiently small $r$, the matrix $X+rH$ remains positive definite, and
\begin{align*}
X+rH=X^{1/2}(I+rM)X^{1/2}.
\end{align*}
Taking determinants and using multiplicativity gives
\begin{align*}
\det(X+rH)=\det(X^{1/2})\det(I+rM)\det(X^{1/2})=\det X\det(I+rM).
\end{align*}
Since $M$ is symmetric, choose an orthogonal matrix $Q$ and real eigenvalues $\lambda_1,\dots,\lambda_n$ such that
\begin{align*}
M=Q\operatorname{diag}(\lambda_1,\dots,\lambda_n)Q^\top.
\end{align*}
Then
\begin{align*}
\det(I+rM)=\det\bigl(Q(I+r\operatorname{diag}(\lambda_1,\dots,\lambda_n))Q^\top\bigr)=\prod_{j=1}^{n}(1+r\lambda_j).
\end{align*}
Thus the one-variable restriction $\varphi(r)=F(X+rH)$ satisfies
\begin{align*}
\varphi(r)=-\log\det X-\sum_{j=1}^{n}\log(1+r\lambda_j).
\end{align*}
Differentiating at $r=0$ gives
\begin{align*}
DF_X(H)=\varphi'(0)=-\sum_{j=1}^{n}\lambda_j=-\operatorname{tr}(M).
\end{align*}
Substituting the definition of $M$ gives
\begin{align*}
DF_X(H)=-\operatorname{tr}(X^{-1/2}HX^{-1/2}).
\end{align*}
By cyclic invariance of trace,
\begin{align*}
\operatorname{tr}(X^{-1/2}HX^{-1/2})=\operatorname{tr}(X^{-1}H).
\end{align*}
Therefore
\begin{align*}
DF_X(H)=-\operatorname{tr}(X^{-1}H).
\end{align*}
With respect to the trace inner product $\langle A,H\rangle=\operatorname{tr}(AH)$ on $\mathbb S^n$, this is
\begin{align*}
DF_X(H)=\operatorname{tr}((-X^{-1})H),
\end{align*}
so
\begin{align*}
\nabla F(X)=-X^{-1}.
\end{align*}
Differentiating the same scalar expression a second time gives
\begin{align*}
\nabla^2F(X)[H,H]=\varphi''(0)=\sum_{j=1}^{n}\lambda_j^2=\operatorname{tr}(M^2).
\end{align*}
Since
\begin{align*}
M^2=X^{-1/2}HX^{-1}HX^{-1/2},
\end{align*}
cyclic invariance of trace gives
\begin{align*}
\operatorname{tr}(M^2)=\operatorname{tr}(X^{-1}HX^{-1}H).
\end{align*}
Hence
\begin{align*}
\nabla^2F(X)[H,H]=\operatorname{tr}(X^{-1}HX^{-1}H).
\end{align*}
Thus the determinant barrier has gradient $-X^{-1}$ and Hessian quadratic form $\operatorname{tr}(X^{-1}HX^{-1}H)$. If $X_k$ approaches a singular positive semidefinite matrix from the interior, then its positive eigenvalues have product $\det X_k\to 0$, so $\log\det X_k\to -\infty$ and therefore $F(X_k)=-\log\det X_k\to\infty$.
[/example]
These formulas are the source of the Newton systems used for semidefinite programming. More conceptually, they show that curvature is measured in the local metric induced by $X^{-1}$.
## Self-Concordance and Complexity Control
What extra regularity is needed to make Newton's method globally controllable along a barrier path? Convexity and blow-up give existence, but they do not control how fast curvature changes. Self-concordance is the condition that bounds third derivatives by the local quadratic norm generated by the Hessian.
[definition: Self-Concordant Function]
Let $U\subset\mathbb R^n$ be open and convex. A three-times continuously differentiable convex function $F:U\to\mathbb R$ is self-concordant if for every $x\in U$ and every $h\in\mathbb R^n$,
\begin{align*}
|D^3F_x[h,h,h]|\le 2\bigl(D^2F_x[h,h]\bigr)^{3/2}.
\end{align*}
[/definition]
The definition is affine invariant and is designed for Newton analysis. The Hessian gives a local norm, and the third-derivative bound says that this local norm cannot change too violently over one Newton step.
[definition: Self-Concordant Barrier Parameter]
Let $K\subset\mathbb R^n$ be a closed convex cone with nonempty interior. A self-concordant barrier $F:K^\circ\to\mathbb R$ has parameter $\nu>0$ if, for every $x\in K^\circ$,
\begin{align*}
|DF_x(h)|\le \sqrt{\nu}\bigl(D^2F_x[h,h]\bigr)^{1/2}\qquad \text{for all }h\in\mathbb R^n.
\end{align*}
[/definition]
The parameter $\nu$ is a measure of barrier complexity, but the definition by itself does not say whether useful cones admit barriers with finite parameter. This raises the existence question addressed by Nesterov-Todd theory: can the geometry of a cone supply the curvature control needed for polynomial-time interior-point theory? For the symmetric cones used in semidefinite and second-order cone programming, the structural theory supplies self-concordant barriers with finite parameter.
This result belongs to the general theory of self-concordant barriers and Euclidean Jordan algebras, so the course records it as structural input rather than proving it. Properness, closedness, and nonempty interior are not decorative assumptions: without a pointed closed cone with an open feasible interior, the barrier may fail to control escape directions or may not have a meaningful boundary to repel from. The result also explains why arbitrary blow-up barriers are insufficient for complexity theory. Polynomial-time Newton analysis needs quantitative control of curvature variation, and self-concordance supplies exactly that control.
[remark: Barrier Parameters in Basic Cones]
For the nonnegative orthant $\mathbb R^n_+$, the standard barrier $F(x)=-\sum_i\log x_i$ has parameter $n$. For the positive semidefinite cone $\mathbb S^n_+$, the determinant barrier $F(X)=-\log\det X$ has parameter $n$. These parameters match the rank of the corresponding symmetric cone, not the ambient dimension in the semidefinite case.
[/remark]
The remark explains why the numerical size of a conic problem is not the only measure of its interior-point complexity. The remaining question is variational rather than computational: after building a central path with a controlled barrier, does the path actually converge toward solutions of the original constrained problem?
[quotetheorem:6717]
[citeproof:6717]
The lower bound on $\Phi$ is the key technical condition in this convergence argument. Without it, the barrier term could become large and negative along the path, and the division by $t$ would not justify comparison with an optimal point. The theorem also proves only convergence of accumulation points; it does not assert that the whole path converges unless the optimal solution is unique or additional geometry selects a single limit. Barriers create smooth interior minimizers, KKT equations reinterpret those minimizers as primal-dual points, and self-concordance supplies the curvature control required for the algorithmic theory developed after the purely convex-analytic foundations.
# 11. Applications and Modelling Principles
This chapter closes the theory course by showing how the convex-analytic machinery from Chapters 1 through 10 becomes a modelling language. The point is not to collect unrelated applications, but to recognise a small number of reusable transformations: uncertainty becomes a support function, regularisation becomes a penalty or constraint with a dual certificate, and low-dimensional structure becomes a convex surrogate. These examples also prepare the transition from theory to algorithms, since conic form, KKT conditions, and duality gaps are the objects numerical solvers actually manipulate.
## Robust Linear Constraints as Conic Models
A deterministic model often contains coefficients that have been estimated from data or simplified from a physical system. The modelling question is how to impose a constraint for every plausible value of an uncertain coefficient without producing an intractable semi-infinite programme.
[definition: Robust Linear Constraint]
Let $x \in \mathbb R^n$ be the decision variable, let $b \in \mathbb R$, and let $\mathcal U \subset \mathbb R^n$ be an uncertainty set. The robust linear constraint associated to $\mathcal U$ is
\begin{align*}
a^\top x \le b \quad \text{for all } a \in \mathcal U.
\end{align*}
[/definition]
The definition packages infinitely many inequalities into one modelling requirement. Its tractability depends almost entirely on the geometry of $\mathcal U$, because the worst-case left-hand side is the support function of $\mathcal U$ evaluated at $x$. The next theorem answers the practical question raised by the definition: when the uncertainty is ellipsoidal, the infinitely many inequalities collapse to a single second-order cone constraint.
[quotetheorem:6718]
[citeproof:6718]
This result is the model example for robust optimisation: uncertainty has disappeared, but its support function remains as a conic constraint. The ellipsoidal hypothesis is essential for the Euclidean norm formula; for example, if $\mathcal U=\{\bar a+u: |u_i|\le \rho\}$, then the support function term is $\rho\|x\|_1$, not $\rho|x|$. The theorem also does not say that the chosen uncertainty set is statistically correct, nor that the resulting feasible point is optimal for any objective; it only converts one robust constraint into cone form. This distinction is why the next examples keep track of how the uncertainty set is chosen and how conservatism enters the model.
[example: Robust Portfolio Allocation]
Consider a portfolio vector $x \in \mathbb R^n$ with budget $\mathbf{1}^\top x=1$ and nonnegative holdings $x_i\ge 0$. The uncertain mean return is $\mu=\bar\mu+Pu$ with $|u|\le \rho$, so the robust return requirement is
\begin{align*}
(\bar\mu+Pu)^\top x \ge r \quad \text{for every } u\in\mathbb R^m \text{ with } |u|\le \rho .
\end{align*}
For fixed $x$, this condition is equivalent to requiring the smallest possible left-hand side over the uncertainty ball to be at least $r$. Expanding the inner product gives
\begin{align*}
(\bar\mu+Pu)^\top x=\bar\mu^\top x+(Pu)^\top x.
\end{align*}
Since $(Pu)^\top x=u^\top P^\top x$, the worst-case return is
\begin{align*}
\inf_{|u|\le \rho}(\bar\mu+Pu)^\top x=\bar\mu^\top x+\inf_{|u|\le \rho}u^\top P^\top x.
\end{align*}
The infimum term is the negative of the Euclidean-ball support function:
\begin{align*}
\inf_{|u|\le \rho}u^\top P^\top x=-\sup_{|u|\le \rho}u^\top(-P^\top x).
\end{align*}
By the Euclidean-ball support-function computation used in *Robust Linear Constraint SOCP Reformulation*,
\begin{align*}
\sup_{|u|\le \rho}u^\top(-P^\top x)=\rho|-P^\top x|.
\end{align*}
Because $|-P^\top x|=|P^\top x|$, we obtain
\begin{align*}
\inf_{|u|\le \rho}(\bar\mu+Pu)^\top x=\bar\mu^\top x-\rho|P^\top x|.
\end{align*}
Hence $\mu^\top x\ge r$ for all $|u|\le \rho$ is equivalent to
\begin{align*}
\bar\mu^\top x-\rho|P^\top x|\ge r.
\end{align*}
Introducing $t\in\mathbb R$ gives the second-order cone representation: impose $|P^\top x|\le t$ and impose the affine inequality $\bar\mu^\top x-\rho t\ge r$.
If the portfolio also has a risk constraint $x^\top\Sigma x\le \sigma^2$, where $\Sigma\succeq 0$ and $\sigma\ge 0$, choose $L$ with $\Sigma=L^\top L$. Then
\begin{align*}
x^\top\Sigma x=x^\top L^\top Lx.
\end{align*}
Since $x^\top L^\top=(Lx)^\top$, this becomes
\begin{align*}
x^\top L^\top Lx=(Lx)^\top(Lx)=|Lx|^2.
\end{align*}
Thus $x^\top\Sigma x\le \sigma^2$ is equivalent to $|Lx|^2\le \sigma^2$, and because $|Lx|\ge 0$ and $\sigma\ge 0$, this is equivalent to the cone constraint
\begin{align*}
|Lx|\le \sigma.
\end{align*}
The robust portfolio model is therefore an SOCP: the budget and nonnegativity constraints are affine, return uncertainty contributes the cone $|P^\top x|\le t$, and quadratic risk contributes the cone $|Lx|\le \sigma$.
[/example]
The portfolio example also shows the modelling tradeoff. Increasing $\rho$ makes the feasible set smaller and the decision more conservative; changing the shape of $\mathcal U$ changes which estimation errors are protected against.
## Statistical Estimation and Dual Certificates
Statistical estimation asks for a point that fits observed data while respecting a structural preference such as sparsity, smoothness, or small norm. Convex optimisation enters when the loss and the structural penalty are convex, so that optimality can be certified by a subgradient equation rather than by local search.
[definition: Regularised Convex Estimator]
Let $\ell: \mathbb R^n \to \mathbb R$ be a convex differentiable loss, let $R: \mathbb R^n \to (-\infty,\infty]$ be a proper closed convex regulariser, and let $\lambda > 0$. A regularised convex estimator is any solution of
\begin{align*}
\min_{x \in \mathbb R^n}\ \ell(x) + \lambda R(x).
\end{align*}
[/definition]
The definition covers ridge regression, lasso, logistic regression with penalties, constrained maximum likelihood after taking negative log-likelihood, and many inverse problems. To use such a model, we need a finite certificate that a proposed estimator is not merely locally stationary but globally optimal. The next theorem supplies that certificate by translating optimality into a subgradient membership condition.
[quotetheorem:6719]
[citeproof:6719]
This theorem is representer-style because the data enter the certificate through $\nabla \ell(\hat x)$, while the model structure is encoded in $\partial R(\hat x)$. Convexity is the hypothesis that turns the first-order condition into a global certificate: if $R=0$ and $\ell(x)=x^4-x^2$ on $\mathbb R$, then $\nabla\ell(0)=0$ but $0$ is not a global minimiser. The theorem also does not guarantee uniqueness, statistical consistency, or recovery of the true parameter; those require additional assumptions on the loss, the data, and the regulariser. For norms, the subdifferential is described by the dual norm and an exposed face, so certificates become geometric objects.
[example: Lasso Dual Certificate]
For least squares with an $\ell^1$ penalty, let $A \in \mathbb R^{m \times n}$, $y \in \mathbb R^m$, $\lambda>0$, and consider
\begin{align*}
\min_{x \in \mathbb R^n}\ \frac{1}{2}|Ax-y|^2 + \lambda \|x\|_1.
\end{align*}
Write $\ell(x)=\frac{1}{2}|Ax-y|^2$. Since
\begin{align*}
\ell(x)=\frac{1}{2}(Ax-y)^\top(Ax-y)=\frac{1}{2}\sum_{i=1}^m\left(\sum_{k=1}^n A_{ik}x_k-y_i\right)^2,
\end{align*}
the $j$th partial derivative is
\begin{align*}
\frac{\partial \ell}{\partial x_j}(x)=\sum_{i=1}^m\left(\sum_{k=1}^n A_{ik}x_k-y_i\right)A_{ij}.
\end{align*}
The $j$th component of $A^\top(Ax-y)$ is the same expression, so
\begin{align*}
\nabla \ell(x)=A^\top(Ax-y).
\end{align*}
By the *Representer-Style Optimality Certificate for Regularised Convex Estimation*, a vector $\hat x$ is optimal exactly when there exists $z\in \partial \|\hat x\|_1$ such that
\begin{align*}
A^\top(A\hat x-y)+\lambda z=0.
\end{align*}
It remains to spell out the subdifferential condition coordinate by coordinate. Since
\begin{align*}
\|x\|_1=\sum_{i=1}^n |x_i|,
\end{align*}
the one-dimensional absolute value has subdifferential $\partial |s|=\{1\}$ when $s>0$, $\partial |s|=\{-1\}$ when $s<0$, and $\partial |s|=[-1,1]$ when $s=0$. Therefore $z\in \partial\|\hat x\|_1$ exactly when $z_i=\operatorname{sgn}(\hat x_i)$ for every coordinate with $\hat x_i\ne 0$, and $|z_i|\le 1$ for every coordinate with $\hat x_i=0$.
Thus $\hat x$ is optimal exactly when one can choose $z\in\mathbb R^n$ satisfying
\begin{align*}
A^\top(A\hat x-y)+\lambda z=0,
\end{align*}
with $z_i=\operatorname{sgn}(\hat x_i)$ on the support of $\hat x$ and $|z_i|\le 1$ on the inactive coordinates. The active coordinates force the certificate onto the exposed face of the dual $\ell^\infty$ ball determined by the sign pattern of $\hat x$, while the inactive coordinates are certified by containment in the interval $[-1,1]$.
[/example]
Maximum likelihood fits the same pattern after replacing the likelihood by its negative logarithm. Constraints on covariance matrices, probabilities, or intensities are convex when the parameter space is a cone or an affine slice of a cone.
[example: Covariance Selection by Log-Determinant Optimisation]
Let $S \in \mathbb R^{n \times n}$ be a symmetric sample covariance matrix, let $X \succ 0$ represent a precision matrix, and consider
\begin{align*}
\min_{X \succ 0}\ F(X)=-\log\det X+\operatorname{tr}(SX)+\lambda\sum_{i\ne j}|X_{ij}|.
\end{align*}
We compute the optimality certificate for a candidate $\hat X\succ 0$. For a symmetric perturbation $H$, Jacobi's formula gives
\begin{align*}
\frac{d}{dt}\det(X+tH)\bigg|_{t=0}=\det(X)\operatorname{tr}(X^{-1}H).
\end{align*}
Therefore
\begin{align*}
\frac{d}{dt}\left[-\log\det(X+tH)\right]\bigg|_{t=0}=-\frac{1}{\det X}\det(X)\operatorname{tr}(X^{-1}H)=-\operatorname{tr}(X^{-1}H).
\end{align*}
The trace term satisfies
\begin{align*}
\operatorname{tr}(S(X+tH))=\operatorname{tr}(SX)+t\operatorname{tr}(SH),
\end{align*}
so
\begin{align*}
\frac{d}{dt}\operatorname{tr}(S(X+tH))\bigg|_{t=0}=\operatorname{tr}(SH).
\end{align*}
With the Frobenius pairing $\langle A,H\rangle=\operatorname{tr}(A^\top H)$, and using that $X^{-1}$ and $S$ are symmetric, these derivatives give
\begin{align*}
\nabla\left(-\log\det X+\operatorname{tr}(SX)\right)=-X^{-1}+S.
\end{align*}
Now write
\begin{align*}
R(X)=\sum_{i\ne j}|X_{ij}|.
\end{align*}
For the one-dimensional absolute value, $\partial |s|=\{1\}$ when $s>0$, $\partial |s|=\{-1\}$ when $s<0$, and $\partial |s|=[-1,1]$ when $s=0$. Since $R$ is a sum of entrywise absolute values over the off-diagonal entries, $Z\in \partial R(\hat X)$ exactly when the diagonal entries are unpenalised, so $Z_{ii}=0$ for every $i$; when $i\ne j$ and $\hat X_{ij}\ne 0$, one has $Z_{ij}=\operatorname{sgn}(\hat X_{ij})$; and when $i\ne j$ and $\hat X_{ij}=0$, one has $|Z_{ij}|\le 1$.
By *Representer-Style Optimality Certificate for Regularised Convex Estimation*, $\hat X$ is optimal exactly when there is such a matrix $Z$ satisfying
\begin{align*}
-\hat X^{-1}+S+\lambda Z=0.
\end{align*}
The diagonal part of the certificate is unpenalised, while each off-diagonal entry either records the sign of a nonzero conditional dependence or certifies a zero entry by lying in the interval $[-1,1]$.
[/example]
Here the cone $X \succ 0$ is not cosmetic: it is the domain on which the log-determinant barrier is finite, and it enforces positive definiteness of the estimated precision matrix. The KKT equation is also the statistical score equation modified by a convex penalty.
## Control and Signal Recovery as Convex Modelling Templates
Many applied problems are not born convex; they become convex after choosing the right variables, replacing rank or sparsity by a convex surrogate, or lifting nonlinear expressions into cones. The guiding question is which physical or statistical structure should be preserved exactly and which nonconvex structure should be relaxed.
[definition: Convex Relaxation]
Let $S \subset \mathbb R^n$ be the feasible set of an optimisation problem with objective $f:S\to\mathbb R$, let $Y\subset \mathbb R^m$ be a convex decision space, and let $\Phi:S\to Y$ be a modelling map. A convex relaxation is a convex optimisation problem on a convex feasible set $C\subset Y$ such that $\Phi(S)\subset C$ and whose objective agrees with, bounds, or convexly extends $f$ on the embedded decisions $\Phi(S)$.
[/definition]
A relaxation is useful only when its solution can be interpreted. Sometimes the relaxation is exact under a certificate; sometimes it gives a bound or a stable approximation.
[example: Nuclear-Norm Matrix Completion]
Let $\mathcal P_\Omega:\mathbb R^{p\times q}\to\mathbb R^{p\times q}$ denote the observation operator defined entrywise by $(\mathcal P_\Omega X)_{ij}=X_{ij}$ when $(i,j)\in\Omega$ and $(\mathcal P_\Omega X)_{ij}=0$ when $(i,j)\notin\Omega$. The rank minimisation model requires $\mathcal P_\Omega X=\mathcal P_\Omega M$ and minimises $\operatorname{rank}(X)$. The convex relaxation replaces rank by the nuclear norm:
\begin{align*}
\min_{X\in\mathbb R^{p\times q}}\ \|X\|_* \quad \text{subject to} \quad \mathcal P_\Omega X=\mathcal P_\Omega M.
\end{align*}
If $X$ has singular values $\sigma_1(X),\dots,\sigma_s(X)$, where $s=\min\{p,q\}$, then
\begin{align*}
\operatorname{rank}(X)=|\{k:\sigma_k(X)>0\}|.
\end{align*}
The nuclear norm is
\begin{align*}
\|X\|_*=\sum_{k=1}^s \sigma_k(X).
\end{align*}
On the spectral-norm unit ball $\|X\|_2\le 1$, each singular value satisfies $0\le \sigma_k(X)\le 1$, so the nuclear norm replaces the count of nonzero singular values by the convex sum of their magnitudes. More precisely, by the standard *Convex Envelope of Rank on the Spectral Norm Ball* fact, $\|X\|_*$ is the convex envelope of $\operatorname{rank}(X)$ on $\{X:\|X\|_2\le 1\}$.
Now suppose the unknown matrix has compact [singular value decomposition](/theorems/3071)
\begin{align*}
M=U\Sigma V^\top,
\end{align*}
where $U^\top U=I$, $V^\top V=I$, and the columns of $U$ and $V$ span the left and right singular spaces of $M$. The equality constraint has normal directions supported on the observed entries. Indeed, for every perturbation $H$ and every matrix $Y$,
\begin{align*}
\langle Y,\mathcal P_\Omega H\rangle=\sum_{(i,j)\in\Omega}Y_{ij}H_{ij}.
\end{align*}
Since $(\mathcal P_\Omega Y)_{ij}=Y_{ij}$ on $\Omega$ and $(\mathcal P_\Omega Y)_{ij}=0$ off $\Omega$, the same sum is
\begin{align*}
\sum_{(i,j)\in\Omega}Y_{ij}H_{ij}=\langle \mathcal P_\Omega Y,H\rangle.
\end{align*}
Hence a dual certificate for optimality of $M$ in the relaxed problem is a matrix $Y$ supported on the observed entries, meaning $\mathcal P_\Omega Y=Y$, such that
\begin{align*}
Y\in \partial \|M\|_*.
\end{align*}
Using the standard *Subdifferential of the Nuclear Norm* formula, this subgradient condition means that $Y$ can be written as
\begin{align*}
Y=UV^\top+W.
\end{align*}
The correction term $W$ must be orthogonal to the singular spaces:
\begin{align*}
U^\top W=0.
\end{align*}
It must also satisfy
\begin{align*}
WV=0.
\end{align*}
Finally, its spectral norm is bounded by
\begin{align*}
\|W\|_2\le 1.
\end{align*}
Thus the certificate is supported only on observed entries, agrees with $UV^\top$ on the singular-space directions of $M$, and has spectral norm at most $1$ on the orthogonal directions. In this sense the matrix-completion relaxation is parallel to lasso: the $\ell^1$ norm promotes coordinate sparsity through an $\ell^\infty$ dual certificate, while the nuclear norm promotes low rank through a spectral-norm dual certificate.
[/example]
The matrix completion template is parallel to lasso: sparsity of coordinates is replaced by low rank, the $\ell^1$ norm is replaced by the nuclear norm, and the dual $\ell^\infty$ ball is replaced by the spectral norm ball. This analogy suggests a broader modelling task: identify when a nonlinear or quadratic-looking constraint has hidden cone structure. The next definition names the SOCP representation used for many such constraints.
[definition: Second-Order Cone Representable Constraint]
Let the decision variable be $x\in\mathbb R^n$. A convex constraint on $x$ is second-order cone representable if it can be written using affine equalities and inequalities together with finitely many constraints of the form
\begin{align*}
|u(x)| \le v(x),
\end{align*}
where $u:\mathbb R^n\to\mathbb R^k$ and $v:\mathbb R^n\to\mathbb R$ are affine maps.
[/definition]
Conic representability matters in control and signal processing because many energy, gain, and uncertainty constraints are quadratic before reformulation. The difficulty is that a quadratic inequality is not syntactically a second-order cone constraint, even when its sublevel set is convex. To use SOCP duality and algorithms, the square term must be exposed as a norm bound and the affine upper bound must appear as the cone's scalar side without changing the feasible set. The result below gives this standard rotated-cone conversion for positive semidefinite quadratic forms.
[quotetheorem:6720]
[citeproof:6720]
This reformulation explains why quadratically constrained convex problems often sit next to SOCPs in applications. The positive semidefinite hypothesis is essential: an indefinite quadratic such as $x_1^2-x_2^2\le 1$ has nonconvex sublevel geometry and cannot be represented by second-order cones as a convex constraint. The theorem also does not say that every quadratic constraint is SOCP-representable, nor does it remove the requirement $h^\top x+r\ge 0$; it only treats convex upper bounds after a positive semidefinite factorisation. It also clarifies the role of modelling variables: auxiliary variables do not change the mathematical requirement, but they expose the cone structure needed by duality and algorithms.
[example: Energy-Constrained Control Input]
Let $u \in \mathbb R^m$ be a control input, with affine tracking constraint $Au=b$ and energy limit $u^\top Ru\le \gamma$, where $R\succeq 0$ and $\gamma\ge 0$. Choose $L$ with $R=L^\top L$. Then
\begin{align*}
u^\top Ru=u^\top L^\top Lu.
\end{align*}
Since $u^\top L^\top=(Lu)^\top$, this becomes
\begin{align*}
u^\top L^\top Lu=(Lu)^\top(Lu)=|Lu|^2.
\end{align*}
Thus the energy constraint is
\begin{align*}
|Lu|^2\le \gamma.
\end{align*}
When $\gamma$ is a fixed nonnegative parameter, both $|Lu|$ and $\sqrt{\gamma}$ are nonnegative, so this is equivalent to
\begin{align*}
|Lu|\le \sqrt{\gamma}.
\end{align*}
If $\gamma$ is instead an optimisation variable, the constraint is kept conic by using the rotated second-order cone form from *Rotated Cone Reformulation of a Convex Quadratic Bound*. Set $a=\gamma/2$, $b=1$, and $w=Lu$. The rotated cone inequality $2ab\ge |w|^2$ gives
\begin{align*}
2\left(\frac{\gamma}{2}\right)(1)\ge |Lu|^2.
\end{align*}
The left-hand side is $\gamma$, so this is
\begin{align*}
\gamma\ge |Lu|^2.
\end{align*}
Using $|Lu|^2=u^\top Ru$, the rotated cone constraint is exactly
\begin{align*}
\gamma\ge u^\top Ru.
\end{align*}
Together with $a=\gamma/2\ge 0$, this is equivalent to the original energy bound. The resulting model has affine tracking constraints and a conic energy constraint, so the controller can optimise over the energy budget $\gamma$ while preserving convexity.
[/example]
## Modelling Principles and Certificates
The final modelling question is how to decide whether a proposed convex model is faithful enough to the original problem. The theory developed in the course gives three tests: verify convexity and closedness, identify a qualification condition for strong duality or KKT, and interpret the dual variables as certificates.
[quotetheorem:2547]
[citeproof:2547]
KKT sufficiency is the bridge between modelling and verification. The convexity assumption cannot be dropped: for minimising $-x^2$ over $[-1,1]$, the point $x^*=0$ satisfies stationarity with zero multiplier for the convex constraint $x^2-1\le 0$, but it is not a minimiser. The theorem also gives sufficiency only; it does not assert that multipliers exist for every optimum without a constraint qualification such as Slater's condition. In estimation it produces dual certificates, in robust optimisation it identifies worst-case perturbations, and in conic programming it explains the primal-dual residuals used to assess solutions.
[remark: Practical Modelling Checklist]
A convex model should state the decision variables, domains, objective, constraints, and parameter meanings before any solver syntax is introduced. Each nonlinear expression should be justified by a convexity rule or a conic representation. If a model is a relaxation, the notes should record what is relaxed, what certificate would imply exactness, and what the dual variables mean in the original application.
[/remark]
The applications in this chapter share the same mathematical skeleton. A modelling choice introduces a convex set or convex function; separation and conjugacy produce a dual object; optimality conditions turn that dual object into a certificate. This is why convex optimisation is both a theory of global optima and a practical language for robust, statistical, and engineering models.
## Beyond and Connected Topics
Separation is the geometric thread behind the early theory: the [Hahn-Banach Separation Theorem](/theorems/974) supplies supporting functionals, while the [Supporting Hyperplane Theorem for Convex Functions](/theorems/2551) expresses the same idea through epigraphs. Compact convex geometry enters through [Minkowski's Theorem on Extreme Points](/theorems/4093), which explains why finite-dimensional convex sets can often be studied through exposed or extreme points.
Duality and certification run through the rest of the course. Slater conditions, Fenchel-Rockafellar duality, conic duality, and KKT sufficiency all turn primal optimality into a verifiable dual object. The modelling chapter then reuses the same pattern in lasso, matrix completion, robust optimisation, and conic reformulations: first expose convex structure, then identify the certificate that proves optimality or exactness.
## References
- R. Tyrrell Rockafellar, *Convex Analysis*, Princeton University Press, 1970.
- Stephen Boyd and Lieven Vandenberghe, *Convex Optimization*, Cambridge University Press, 2004.
- Yurii Nesterov and Arkadi Nemirovski, *Interior-Point Polynomial Algorithms in Convex Programming*, SIAM, 1994.
- Dimitri P. Bertsekas, *Convex Optimization Theory*, Athena Scientific, 2009.
Contents
- Introduction
- The Shape of a Convex Optimisation Problem
- Geometry as the Source of Certificates
- Functions Through Epigraphs and Subgradients
- Optimality, Duality, and Cones
- Existence and the Role of Topology
- How The Course Proceeds
- 1. Convex Sets and Separation
- Convexity, Affine Structure, and Relative Interior
- Separating and Supporting Hyperplanes
- Faces, Extreme Points, and Representation
- 2. Convex Functions and Subdifferentials
- Convex Functions Through Epigraphs
- Jensen Convexity and First-Order Information
- Normal Cones and Constrained Optimality
- Conjugates and Biconjugates
- Support and Gauge Functions
- 3. Convex Optimisation Problems and Existence
- Standard Convex Programmes and Their Objects
- Existence from Compactness, Coercivity, Level-Boundedness, and Recession
- Perturbations and Stability of Optimal Values
- 4. Lagrangian Duality
- Constrained Problems and the Lagrangian
- Constraint Qualifications and Strong Duality
- Saddle Points and Multiplier Geometry
- 5. KKT Conditions and Sensitivity
- Optimality Equations for Inequality and Equality Constraints
- Convex Sufficiency and Constraint Qualification
- Shadow Prices and Perturbed Problems
- Envelope Formulas and Danskin's Theorem
- 6. Fenchel Duality and Infimal Convolution
- Fenchel Inequality and Fenchel Dual Problems
- Subdifferential Calculus for Composite Convex Functions
- Moreau Decomposition, Proximal Geometry, and Polar Gauges
- 7. Linear and Polyhedral Programming
- Polyhedra, Standard Form, and Vertex Optimality
- Linear Programming Duality and Certificates
- Totally Unimodular Matrices and Integrality
- 8. Conic Programming Framework
- Ordering Vector Spaces by Cones
- Conic Standard Form
- Weak Duality and Complementarity
- Strong Duality and Slater Regularity
- Pathologies Without Regularity
- Certificates and Alternative Theorems
- Regularity Conditions and the Conic Modelling Lesson
- 9. Second-Order Cone and Semidefinite Programming
- Quadratic Constraints and Second-Order Cones
- Positive Semidefinite Matrices and Spectrahedra
- Semidefinite Relaxations and Exactness Certificates
- 10. Barriers, Interiors, and Central Paths
- From Inequality Constraints to Interior Penalisation
- The Central Path and Primal-Dual Geometry
- Barriers for Cones and Matrix Interiors
- Self-Concordance and Complexity Control
- 11. Applications and Modelling Principles
- Robust Linear Constraints as Conic Models
- Statistical Estimation and Dual Certificates
- Control and Signal Recovery as Convex Modelling Templates
- Modelling Principles and Certificates
- Beyond and Connected Topics
- References
Convex Optimisation I: Theory
Content
Problems
History
Created by admin on 6/12/2026 | Last updated on 6/12/2026
Prerequisites (0/9 completed)
Log in to track your prerequisite progress.
Prerequisites Graph
Interactive dependency map showing prerequisite concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Rate this page
★
★
★
★
★
Poor
Excellent