Subdifferential Chain Rule for Composition with a Linear Map

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The inclusion $A^\top \partial g(Ax) \subset \partial(g \circ A)(x)$ follows directly by composing the subgradient inequality for $g$ with the [linear map](/page/Linear%20Map) $A$. For the reverse inclusion, a subgradient of $g \circ A$ first gives a supporting affine functional for $g$ restricted to the subspace $A(\mathbb{R}^n)$. The relative-interior hypothesis is exactly the finite-dimensional constraint qualification that permits this restricted support to be extended to a genuine global subgradient of $g$, with only a correction by a vector orthogonal to $A(\mathbb{R}^n)$. Applying $A^\top$ kills that orthogonal correction and recovers the original subgradient. [/proofplan] [step:Pull back subgradients of $g$ through the linear map $A$] Fix $x \in \mathbb{R}^n$ with $Ax \in \operatorname{dom} g$. Define the function $h: \mathbb{R}^n \to (-\infty,\infty]$ by \begin{align*} h(u) := g(Au) \quad \text{for } u \in \mathbb{R}^n. \end{align*} We first prove \begin{align*} A^\top \partial g(Ax) \subset \partial h(x). \end{align*} Let $y \in \partial g(Ax)$. By the definition of the convex subdifferential, for every $z \in \mathbb{R}^m$, \begin{align*} g(z) \geq g(Ax) + y \cdot (z - Ax). \end{align*} Substitute $z = Au$ for an arbitrary $u \in \mathbb{R}^n$. Since $A$ is linear, \begin{align*} h(u) = g(Au) \geq g(Ax) + y \cdot (Au - Ax). \end{align*} Using the defining property of the transpose $A^\top$, namely \begin{align*} y \cdot A(u - x) = A^\top y \cdot (u - x), \end{align*} we obtain \begin{align*} h(u) \geq h(x) + A^\top y \cdot (u - x) \end{align*} for every $u \in \mathbb{R}^n$. Hence $A^\top y \in \partial h(x)$, proving the first inclusion. [guided] Fix $x \in \mathbb{R}^n$ with $Ax \in \operatorname{dom} g$, and define the composed convex function $h: \mathbb{R}^n \to (-\infty,\infty]$ by \begin{align*} h(u) := g(Au) \quad \text{for } u \in \mathbb{R}^n. \end{align*} The goal of this step is to show that every subgradient of $g$ at $Ax$ pulls back to a subgradient of $h = g \circ A$ at $x$. Let $y \in \partial g(Ax)$. By definition of the convex subdifferential, the affine function \begin{align*} z \mapsto g(Ax) + y \cdot (z - Ax) \end{align*} supports $g$ from below at $Ax$. That means that for every $z \in \mathbb{R}^m$, \begin{align*} g(z) \geq g(Ax) + y \cdot (z - Ax). \end{align*} To convert this into a statement about $h$, evaluate the inequality only at points of the form $z = Au$, where $u \in \mathbb{R}^n$. This gives \begin{align*} h(u) = g(Au) \geq g(Ax) + y \cdot (Au - Ax). \end{align*} Since $A$ is linear, $Au - Ax = A(u - x)$. The transpose $A^\top: \mathbb{R}^m \to \mathbb{R}^n$ is defined by the identity \begin{align*} y \cdot Av = A^\top y \cdot v \end{align*} for every $y \in \mathbb{R}^m$ and every $v \in \mathbb{R}^n$. Applying this identity with $v = u - x$, we obtain \begin{align*} y \cdot (Au - Ax) = A^\top y \cdot (u - x). \end{align*} Therefore, for every $u \in \mathbb{R}^n$, \begin{align*} h(u) \geq h(x) + A^\top y \cdot (u - x). \end{align*} This is exactly the definition of $A^\top y \in \partial h(x)$. Since $y \in \partial g(Ax)$ was arbitrary, we have proved \begin{align*} A^\top \partial g(Ax) \subset \partial(g \circ A)(x). \end{align*} [/guided] [/step] [step:Show that a subgradient of $g \circ A$ vanishes on $\ker A$] Define the kernel by \begin{align*} \ker A := \{v \in \mathbb{R}^n : Av = 0\}. \end{align*} For any linear subspace $M \subset \mathbb{R}^n$, define the orthogonal complement by \begin{align*} M^\perp := \{q \in \mathbb{R}^n : q \cdot r = 0 \text{ for every } r \in M\}. \end{align*} Define the transpose range by \begin{align*} \operatorname{Range}(A^\top) := \{A^\top y : y \in \mathbb{R}^m\}. \end{align*} Let $s \in \partial h(x)$. We prove that $s$ annihilates $\ker A$. Let $v \in \ker A$. For every $t \in \mathbb{R}$, linearity gives \begin{align*} A(x + tv) = Ax + tAv = Ax. \end{align*} Thus $h(x + tv) = h(x)$. The subgradient inequality for $s \in \partial h(x)$ gives \begin{align*} h(x + tv) \geq h(x) + s \cdot tv. \end{align*} Since $h(x + tv) = h(x)$, this becomes \begin{align*} 0 \geq t(s \cdot v) \end{align*} for every $t \in \mathbb{R}$. Taking $t = 1$ and $t = -1$ yields $s \cdot v = 0$. Hence \begin{align*} s \in (\ker A)^\perp. \end{align*} We justify the finite-dimensional identity $(\ker A)^\perp = \operatorname{Range}(A^\top)$. If $r=A^\top y$ with $y \in \mathbb{R}^m$ and $v \in \ker A$, then \begin{align*} r \cdot v = A^\top y \cdot v = y \cdot Av = 0, \end{align*} so $\operatorname{Range}(A^\top) \subset (\ker A)^\perp$. Conversely, by rank-nullity and equality of matrix rank under transpose, \begin{align*} \dim (\ker A)^\perp = n-\dim \ker A = \operatorname{rank} A = \operatorname{rank} A^\top = \dim \operatorname{Range}(A^\top). \end{align*} The inclusion and equality of dimensions give $(\ker A)^\perp = \operatorname{Range}(A^\top)$. Therefore there exists $y_0 \in \mathbb{R}^m$ such that \begin{align*} A^\top y_0 = s. \end{align*} [/step] [step:Convert the pulled-back support into support on $A(\mathbb{R}^n)$] Let \begin{align*} L := A(\mathbb{R}^n) \subset \mathbb{R}^m \end{align*} be the range subspace of $A$, and let $z_0 := Ax$. We show that $y_0$ supports $g$ at $z_0$ along $L$. For every $u \in \mathbb{R}^n$, the subgradient inequality for $s \in \partial h(x)$ gives \begin{align*} g(Au) = h(u) \geq h(x) + s \cdot (u - x). \end{align*} Using $h(x) = g(Ax)$ and $s = A^\top y_0$, we get \begin{align*} g(Au) \geq g(Ax) + A^\top y_0 \cdot (u - x). \end{align*} By the transpose identity, \begin{align*} A^\top y_0 \cdot (u - x) = y_0 \cdot (Au - Ax). \end{align*} Therefore \begin{align*} g(Au) \geq g(z_0) + y_0 \cdot (Au - z_0) \end{align*} for every $u \in \mathbb{R}^n$. Since every $z \in L$ has the form $z = Au$ for some $u \in \mathbb{R}^n$, this proves \begin{align*} g(z) \geq g(z_0) + y_0 \cdot (z - z_0) \end{align*} for every $z \in L$. [/step] [step:Extend the supporting functional from $A(\mathbb{R}^n)$ to all of $\mathbb{R}^m$] We use the following finite-dimensional extension principle, derived below from the subdifferential sum rule: if $g: \mathbb{R}^m \to (-\infty,\infty]$ is proper convex, $L \subset \mathbb{R}^m$ is a linear subspace, $L \cap \operatorname{ri}(\operatorname{dom} g) \neq \varnothing$, $z_0 \in L \cap \operatorname{dom} g$, and $y_0 \in \mathbb{R}^m$ satisfies \begin{align*} g(z) \geq g(z_0) + y_0 \cdot (z - z_0) \end{align*} for every $z \in L$, then there exists $w \in L^\perp$ such that \begin{align*} y_0 + w \in \partial g(z_0). \end{align*} We derive the extension principle from the finite-dimensional subdifferential sum rule. Define the indicator function $\iota_L: \mathbb{R}^m \to (-\infty,\infty]$ by setting $\iota_L(z)=0$ for $z \in L$ and $\iota_L(z)=\infty$ for $z \notin L$. Define the affine perturbation $\phi: \mathbb{R}^m \to (-\infty,\infty]$ by \begin{align*} \phi(z) := g(z)-y_0 \cdot (z-z_0). \end{align*} The restricted support inequality says exactly that $z_0$ minimizes $\phi+\iota_L$ over $\mathbb{R}^m$, hence $0 \in \partial(\phi+\iota_L)(z_0)$ by the definition of the convex subdifferential at a global minimizer. The relative interior qualification for the finite-dimensional subdifferential sum rule is \begin{align*} \operatorname{ri}(\operatorname{dom} \phi) \cap \operatorname{ri}(\operatorname{dom} \iota_L) = \operatorname{ri}(\operatorname{dom} g) \cap L \neq \varnothing, \end{align*} which holds by hypothesis. Applying the finite-dimensional subdifferential sum rule gives \begin{align*} 0 \in \partial \phi(z_0)+\partial \iota_L(z_0). \end{align*} By the definition of subgradients under subtraction of the linear functional $z \mapsto y_0 \cdot (z-z_0)$, we have $\partial \phi(z_0)=\partial g(z_0)-y_0$. Also, the definition of the subdifferential of $\iota_L$ gives $\partial \iota_L(z_0)=L^\perp$: the inequality $\iota_L(z) \geq q \cdot (z-z_0)$ for all $z$ forces $q$ to vanish on $L$, and every $q \in L^\perp$ satisfies it. Therefore there exist $p \in \partial g(z_0)$ and $q \in L^\perp$ such that $0=p-y_0+q$. Setting $w:=-q \in L^\perp$ gives $y_0+w=p \in \partial g(z_0)$. The hypotheses of this extension principle are satisfied here. The function $g$ is proper convex by hypothesis. The set $L = A(\mathbb{R}^n)$ is a linear subspace of $\mathbb{R}^m$. The qualification \begin{align*} L \cap \operatorname{ri}(\operatorname{dom} g) \neq \varnothing \end{align*} is exactly the assumed condition. Also $z_0 = Ax \in L \cap \operatorname{dom} g$. The previous step proved the required support inequality on $L$. Therefore there exists $w \in L^\perp$ such that \begin{align*} y := y_0 + w \in \partial g(z_0). \end{align*} [guided] At this point we know that $y_0$ supports $g$ correctly, but only after restricting $g$ to the subspace \begin{align*} L := A(\mathbb{R}^n). \end{align*} That is, with $z_0 := Ax$, we have \begin{align*} g(z) \geq g(z_0) + y_0 \cdot (z - z_0) \end{align*} for every $z \in L$. The missing point is that a subgradient of $g$ must give this inequality for every $z \in \mathbb{R}^m$, not just for $z \in L$. The relative interior hypothesis is precisely what allows this extension. We derive the needed extension from the finite-dimensional subdifferential sum rule. Define the indicator function $\iota_L: \mathbb{R}^m \to (-\infty,\infty]$ by $\iota_L(z)=0$ for $z \in L$ and $\iota_L(z)=\infty$ for $z \notin L$. Define $\phi: \mathbb{R}^m \to (-\infty,\infty]$ by \begin{align*} \phi(z) := g(z)-y_0 \cdot (z-z_0). \end{align*} The support inequality on $L$ says that $z_0$ is a global minimizer of $\phi+\iota_L$. Therefore $0 \in \partial(\phi+\iota_L)(z_0)$, because the zero vector gives the subgradient inequality at a global minimizer. The sum rule applies since \begin{align*} \operatorname{ri}(\operatorname{dom} \phi) \cap \operatorname{ri}(\operatorname{dom} \iota_L) = \operatorname{ri}(\operatorname{dom} g) \cap L \neq \varnothing. \end{align*} Thus \begin{align*} 0 \in \partial \phi(z_0)+\partial \iota_L(z_0). \end{align*} Here $\partial \phi(z_0)=\partial g(z_0)-y_0$, because $\phi$ differs from $g$ by the linear functional $z \mapsto y_0 \cdot (z-z_0)$. Also $\partial \iota_L(z_0)=L^\perp$ by the definition of the subdifferential of an indicator of a linear subspace. Hence there are $p \in \partial g(z_0)$ and $q \in L^\perp$ with $0=p-y_0+q$. Setting $w:=-q$ gives $w \in L^\perp$ and $y_0+w=p \in \partial g(z_0)$. We now verify its hypotheses in the present setting. The function $g$ is proper convex by assumption. The set \begin{align*} L = A(\mathbb{R}^n) \end{align*} is a linear subspace because it is the range of a linear map. The qualification condition is exactly the assumed condition \begin{align*} A(\mathbb{R}^n) \cap \operatorname{ri}(\operatorname{dom} g) \neq \varnothing. \end{align*} The point $z_0 = Ax$ lies in $L$ by definition and lies in $\operatorname{dom} g$ by the hypothesis on $x$. Finally, we reproduce the support inequality on $L$ from the subgradient inequality for $s \in \partial h(x)$. For every $u \in \mathbb{R}^n$, \begin{align*} g(Au) = h(u) \geq h(x) + s \cdot (u-x). \end{align*} Since $h(x)=g(Ax)$ and $s=A^\top y_0$, the transpose identity gives \begin{align*} g(Au) \geq g(Ax) + y_0 \cdot (Au-Ax). \end{align*} Now let $z \in L$. By the definition of $L=A(\mathbb{R}^n)$, there exists $u \in \mathbb{R}^n$ such that $z=Au$. Since $z_0=Ax$, the preceding inequality becomes \begin{align*} g(z) \geq g(z_0) + y_0 \cdot (z-z_0). \end{align*} This proves the required support inequality for every $z \in L$. The derivation above applies to this support inequality and gives a vector $w \in L^\perp$ such that \begin{align*} y := y_0 + w \in \partial g(z_0). \end{align*} The role of $w$ is only to adjust the supporting hyperplane in directions transverse to $L$; it does not change the pullback through $A$, because $A$ only sees directions inside $L$. [/guided] [/step] [step:Apply $A^\top$ to the extended subgradient and recover $s$] Since $w \in L^\perp$ and $L = A(\mathbb{R}^n)$, we have \begin{align*} w \cdot Av = 0 \end{align*} for every $v \in \mathbb{R}^n$. By the definition of the transpose, this means \begin{align*} A^\top w \cdot v = 0 \end{align*} for every $v \in \mathbb{R}^n$, hence $A^\top w = 0$. With $y = y_0 + w \in \partial g(z_0)$ and $z_0 = Ax$, we compute \begin{align*} A^\top y = A^\top(y_0 + w) = A^\top y_0 + A^\top w = s + 0 = s. \end{align*} Thus every $s \in \partial h(x)$ belongs to $A^\top \partial g(Ax)$, so \begin{align*} \partial h(x) \subset A^\top \partial g(Ax). \end{align*} Combining this with the first inclusion and recalling that $h = g \circ A$, we obtain \begin{align*} \partial(g \circ A)(x) = A^\top \partial g(Ax). \end{align*} This proves the theorem. [guided] We have constructed $y := y_0+w \in \partial g(z_0)$, where $z_0=Ax$, and we need to show that this subgradient pulls back to the original vector $s \in \partial h(x)$. The only possible obstruction is the correction vector $w$, so we check that $A^\top w=0$. Since $w \in L^\perp$ and $L=A(\mathbb{R}^n)$, the definition of orthogonal complement gives \begin{align*} w \cdot Av = 0 \end{align*} for every $v \in \mathbb{R}^n$. By the defining identity for the transpose $A^\top$, this is equivalent to \begin{align*} A^\top w \cdot v = 0 \end{align*} for every $v \in \mathbb{R}^n$. A vector in $\mathbb{R}^n$ whose dot product with every $v \in \mathbb{R}^n$ is zero must be the zero vector, so $A^\top w=0$. Now use $y=y_0+w$, $A^\top y_0=s$, and $A^\top w=0$: \begin{align*} A^\top y = A^\top(y_0+w)=A^\top y_0 + A^\top w=s. \end{align*} Thus the arbitrary vector $s \in \partial h(x)$ can be written as $s=A^\top y$ with $y \in \partial g(Ax)$. Hence \begin{align*} \partial h(x) \subset A^\top \partial g(Ax). \end{align*} The first step proved the reverse inclusion. Since $h=g\circ A$, the two inclusions give \begin{align*} \partial(g\circ A)(x)=A^\top\partial g(Ax). \end{align*} This completes the proof. [/guided] [/step]

Prerequisites (0/1 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Linear Map

What brings you to Androma?

Start with a route through the knowledge graph.