[proofplan]
The inclusion $A^\top \partial g(Ax) \subset \partial(g \circ A)(x)$ follows directly by composing the subgradient inequality for $g$ with the [linear map](/page/Linear%20Map) $A$. For the reverse inclusion, a subgradient of $g \circ A$ first gives a supporting affine functional for $g$ restricted to the subspace $A(\mathbb{R}^n)$. The relative-interior hypothesis is exactly the finite-dimensional constraint qualification that permits this restricted support to be extended to a genuine global subgradient of $g$, with only a correction by a vector orthogonal to $A(\mathbb{R}^n)$. Applying $A^\top$ kills that orthogonal correction and recovers the original subgradient.
[/proofplan]
[step:Pull back subgradients of $g$ through the linear map $A$]
Fix $x \in \mathbb{R}^n$ with $Ax \in \operatorname{dom} g$. Define the function $h: \mathbb{R}^n \to (-\infty,\infty]$ by
\begin{align*}
h(u) := g(Au) \quad \text{for } u \in \mathbb{R}^n.
\end{align*}
We first prove
\begin{align*}
A^\top \partial g(Ax) \subset \partial h(x).
\end{align*}
Let $y \in \partial g(Ax)$. By the definition of the convex subdifferential, for every $z \in \mathbb{R}^m$,
\begin{align*}
g(z) \geq g(Ax) + y \cdot (z - Ax).
\end{align*}
Substitute $z = Au$ for an arbitrary $u \in \mathbb{R}^n$. Since $A$ is linear,
\begin{align*}
h(u) = g(Au) \geq g(Ax) + y \cdot (Au - Ax).
\end{align*}
Using the defining property of the transpose $A^\top$, namely
\begin{align*}
y \cdot A(u - x) = A^\top y \cdot (u - x),
\end{align*}
we obtain
\begin{align*}
h(u) \geq h(x) + A^\top y \cdot (u - x)
\end{align*}
for every $u \in \mathbb{R}^n$. Hence $A^\top y \in \partial h(x)$, proving the first inclusion.
[guided]
Fix $x \in \mathbb{R}^n$ with $Ax \in \operatorname{dom} g$, and define the composed convex function $h: \mathbb{R}^n \to (-\infty,\infty]$ by
\begin{align*}
h(u) := g(Au) \quad \text{for } u \in \mathbb{R}^n.
\end{align*}
The goal of this step is to show that every subgradient of $g$ at $Ax$ pulls back to a subgradient of $h = g \circ A$ at $x$.
Let $y \in \partial g(Ax)$. By definition of the convex subdifferential, the affine function
\begin{align*}
z \mapsto g(Ax) + y \cdot (z - Ax)
\end{align*}
supports $g$ from below at $Ax$. That means that for every $z \in \mathbb{R}^m$,
\begin{align*}
g(z) \geq g(Ax) + y \cdot (z - Ax).
\end{align*}
To convert this into a statement about $h$, evaluate the inequality only at points of the form $z = Au$, where $u \in \mathbb{R}^n$. This gives
\begin{align*}
h(u) = g(Au) \geq g(Ax) + y \cdot (Au - Ax).
\end{align*}
Since $A$ is linear, $Au - Ax = A(u - x)$. The transpose $A^\top: \mathbb{R}^m \to \mathbb{R}^n$ is defined by the identity
\begin{align*}
y \cdot Av = A^\top y \cdot v
\end{align*}
for every $y \in \mathbb{R}^m$ and every $v \in \mathbb{R}^n$. Applying this identity with $v = u - x$, we obtain
\begin{align*}
y \cdot (Au - Ax) = A^\top y \cdot (u - x).
\end{align*}
Therefore, for every $u \in \mathbb{R}^n$,
\begin{align*}
h(u) \geq h(x) + A^\top y \cdot (u - x).
\end{align*}
This is exactly the definition of $A^\top y \in \partial h(x)$. Since $y \in \partial g(Ax)$ was arbitrary, we have proved
\begin{align*}
A^\top \partial g(Ax) \subset \partial(g \circ A)(x).
\end{align*}
[/guided]
[/step]
[step:Show that a subgradient of $g \circ A$ vanishes on $\ker A$]
Define the kernel by
\begin{align*}
\ker A := \{v \in \mathbb{R}^n : Av = 0\}.
\end{align*}
For any linear subspace $M \subset \mathbb{R}^n$, define the orthogonal complement by
\begin{align*}
M^\perp := \{q \in \mathbb{R}^n : q \cdot r = 0 \text{ for every } r \in M\}.
\end{align*}
Define the transpose range by
\begin{align*}
\operatorname{Range}(A^\top) := \{A^\top y : y \in \mathbb{R}^m\}.
\end{align*} Let $s \in \partial h(x)$. We prove that $s$ annihilates $\ker A$.
Let $v \in \ker A$. For every $t \in \mathbb{R}$, linearity gives
\begin{align*}
A(x + tv) = Ax + tAv = Ax.
\end{align*}
Thus $h(x + tv) = h(x)$. The subgradient inequality for $s \in \partial h(x)$ gives
\begin{align*}
h(x + tv) \geq h(x) + s \cdot tv.
\end{align*}
Since $h(x + tv) = h(x)$, this becomes
\begin{align*}
0 \geq t(s \cdot v)
\end{align*}
for every $t \in \mathbb{R}$. Taking $t = 1$ and $t = -1$ yields $s \cdot v = 0$. Hence
\begin{align*}
s \in (\ker A)^\perp.
\end{align*}
We justify the finite-dimensional identity $(\ker A)^\perp = \operatorname{Range}(A^\top)$. If $r=A^\top y$ with $y \in \mathbb{R}^m$ and $v \in \ker A$, then
\begin{align*}
r \cdot v = A^\top y \cdot v = y \cdot Av = 0,
\end{align*}
so $\operatorname{Range}(A^\top) \subset (\ker A)^\perp$. Conversely, by rank-nullity and equality of matrix rank under transpose,
\begin{align*}
\dim (\ker A)^\perp = n-\dim \ker A = \operatorname{rank} A = \operatorname{rank} A^\top = \dim \operatorname{Range}(A^\top).
\end{align*}
The inclusion and equality of dimensions give $(\ker A)^\perp = \operatorname{Range}(A^\top)$. Therefore there exists $y_0 \in \mathbb{R}^m$ such that
\begin{align*}
A^\top y_0 = s.
\end{align*}
[/step]
[step:Convert the pulled-back support into support on $A(\mathbb{R}^n)$]
Let
\begin{align*}
L := A(\mathbb{R}^n) \subset \mathbb{R}^m
\end{align*}
be the range subspace of $A$, and let $z_0 := Ax$. We show that $y_0$ supports $g$ at $z_0$ along $L$.
For every $u \in \mathbb{R}^n$, the subgradient inequality for $s \in \partial h(x)$ gives
\begin{align*}
g(Au) = h(u) \geq h(x) + s \cdot (u - x).
\end{align*}
Using $h(x) = g(Ax)$ and $s = A^\top y_0$, we get
\begin{align*}
g(Au) \geq g(Ax) + A^\top y_0 \cdot (u - x).
\end{align*}
By the transpose identity,
\begin{align*}
A^\top y_0 \cdot (u - x) = y_0 \cdot (Au - Ax).
\end{align*}
Therefore
\begin{align*}
g(Au) \geq g(z_0) + y_0 \cdot (Au - z_0)
\end{align*}
for every $u \in \mathbb{R}^n$.
Since every $z \in L$ has the form $z = Au$ for some $u \in \mathbb{R}^n$, this proves
\begin{align*}
g(z) \geq g(z_0) + y_0 \cdot (z - z_0)
\end{align*}
for every $z \in L$.
[/step]
[step:Extend the supporting functional from $A(\mathbb{R}^n)$ to all of $\mathbb{R}^m$]
We use the following finite-dimensional extension principle, derived below from the subdifferential sum rule: if $g: \mathbb{R}^m \to (-\infty,\infty]$ is proper convex, $L \subset \mathbb{R}^m$ is a linear subspace, $L \cap \operatorname{ri}(\operatorname{dom} g) \neq \varnothing$, $z_0 \in L \cap \operatorname{dom} g$, and $y_0 \in \mathbb{R}^m$ satisfies
\begin{align*}
g(z) \geq g(z_0) + y_0 \cdot (z - z_0)
\end{align*}
for every $z \in L$, then there exists $w \in L^\perp$ such that
\begin{align*}
y_0 + w \in \partial g(z_0).
\end{align*}
We derive the extension principle from the finite-dimensional subdifferential sum rule. Define the indicator function $\iota_L: \mathbb{R}^m \to (-\infty,\infty]$ by setting $\iota_L(z)=0$ for $z \in L$ and $\iota_L(z)=\infty$ for $z \notin L$. Define the affine perturbation $\phi: \mathbb{R}^m \to (-\infty,\infty]$ by
\begin{align*}
\phi(z) := g(z)-y_0 \cdot (z-z_0).
\end{align*}
The restricted support inequality says exactly that $z_0$ minimizes $\phi+\iota_L$ over $\mathbb{R}^m$, hence $0 \in \partial(\phi+\iota_L)(z_0)$ by the definition of the convex subdifferential at a global minimizer. The relative interior qualification for the finite-dimensional subdifferential sum rule is
\begin{align*}
\operatorname{ri}(\operatorname{dom} \phi) \cap \operatorname{ri}(\operatorname{dom} \iota_L) = \operatorname{ri}(\operatorname{dom} g) \cap L \neq \varnothing,
\end{align*}
which holds by hypothesis. Applying the finite-dimensional subdifferential sum rule gives
\begin{align*}
0 \in \partial \phi(z_0)+\partial \iota_L(z_0).
\end{align*}
By the definition of subgradients under subtraction of the linear functional $z \mapsto y_0 \cdot (z-z_0)$, we have $\partial \phi(z_0)=\partial g(z_0)-y_0$. Also, the definition of the subdifferential of $\iota_L$ gives $\partial \iota_L(z_0)=L^\perp$: the inequality $\iota_L(z) \geq q \cdot (z-z_0)$ for all $z$ forces $q$ to vanish on $L$, and every $q \in L^\perp$ satisfies it. Therefore there exist $p \in \partial g(z_0)$ and $q \in L^\perp$ such that $0=p-y_0+q$. Setting $w:=-q \in L^\perp$ gives $y_0+w=p \in \partial g(z_0)$.
The hypotheses of this extension principle are satisfied here. The function $g$ is proper convex by hypothesis. The set $L = A(\mathbb{R}^n)$ is a linear subspace of $\mathbb{R}^m$. The qualification
\begin{align*}
L \cap \operatorname{ri}(\operatorname{dom} g) \neq \varnothing
\end{align*}
is exactly the assumed condition. Also $z_0 = Ax \in L \cap \operatorname{dom} g$. The previous step proved the required support inequality on $L$.
Therefore there exists $w \in L^\perp$ such that
\begin{align*}
y := y_0 + w \in \partial g(z_0).
\end{align*}
[guided]
At this point we know that $y_0$ supports $g$ correctly, but only after restricting $g$ to the subspace
\begin{align*}
L := A(\mathbb{R}^n).
\end{align*}
That is, with $z_0 := Ax$, we have
\begin{align*}
g(z) \geq g(z_0) + y_0 \cdot (z - z_0)
\end{align*}
for every $z \in L$. The missing point is that a subgradient of $g$ must give this inequality for every $z \in \mathbb{R}^m$, not just for $z \in L$.
The relative interior hypothesis is precisely what allows this extension. We derive the needed extension from the finite-dimensional subdifferential sum rule. Define the indicator function $\iota_L: \mathbb{R}^m \to (-\infty,\infty]$ by $\iota_L(z)=0$ for $z \in L$ and $\iota_L(z)=\infty$ for $z \notin L$. Define $\phi: \mathbb{R}^m \to (-\infty,\infty]$ by
\begin{align*}
\phi(z) := g(z)-y_0 \cdot (z-z_0).
\end{align*}
The support inequality on $L$ says that $z_0$ is a global minimizer of $\phi+\iota_L$. Therefore $0 \in \partial(\phi+\iota_L)(z_0)$, because the zero vector gives the subgradient inequality at a global minimizer. The sum rule applies since
\begin{align*}
\operatorname{ri}(\operatorname{dom} \phi) \cap \operatorname{ri}(\operatorname{dom} \iota_L) = \operatorname{ri}(\operatorname{dom} g) \cap L \neq \varnothing.
\end{align*}
Thus
\begin{align*}
0 \in \partial \phi(z_0)+\partial \iota_L(z_0).
\end{align*}
Here $\partial \phi(z_0)=\partial g(z_0)-y_0$, because $\phi$ differs from $g$ by the linear functional $z \mapsto y_0 \cdot (z-z_0)$. Also $\partial \iota_L(z_0)=L^\perp$ by the definition of the subdifferential of an indicator of a linear subspace. Hence there are $p \in \partial g(z_0)$ and $q \in L^\perp$ with $0=p-y_0+q$. Setting $w:=-q$ gives $w \in L^\perp$ and $y_0+w=p \in \partial g(z_0)$.
We now verify its hypotheses in the present setting. The function $g$ is proper convex by assumption. The set
\begin{align*}
L = A(\mathbb{R}^n)
\end{align*}
is a linear subspace because it is the range of a linear map. The qualification condition is exactly the assumed condition
\begin{align*}
A(\mathbb{R}^n) \cap \operatorname{ri}(\operatorname{dom} g) \neq \varnothing.
\end{align*}
The point $z_0 = Ax$ lies in $L$ by definition and lies in $\operatorname{dom} g$ by the hypothesis on $x$. Finally, we reproduce the support inequality on $L$ from the subgradient inequality for $s \in \partial h(x)$. For every $u \in \mathbb{R}^n$,
\begin{align*}
g(Au) = h(u) \geq h(x) + s \cdot (u-x).
\end{align*}
Since $h(x)=g(Ax)$ and $s=A^\top y_0$, the transpose identity gives
\begin{align*}
g(Au) \geq g(Ax) + y_0 \cdot (Au-Ax).
\end{align*}
Now let $z \in L$. By the definition of $L=A(\mathbb{R}^n)$, there exists $u \in \mathbb{R}^n$ such that $z=Au$. Since $z_0=Ax$, the preceding inequality becomes
\begin{align*}
g(z) \geq g(z_0) + y_0 \cdot (z-z_0).
\end{align*}
This proves the required support inequality for every $z \in L$.
The derivation above applies to this support inequality and gives a vector $w \in L^\perp$ such that
\begin{align*}
y := y_0 + w \in \partial g(z_0).
\end{align*}
The role of $w$ is only to adjust the supporting hyperplane in directions transverse to $L$; it does not change the pullback through $A$, because $A$ only sees directions inside $L$.
[/guided]
[/step]
[step:Apply $A^\top$ to the extended subgradient and recover $s$]
Since $w \in L^\perp$ and $L = A(\mathbb{R}^n)$, we have
\begin{align*}
w \cdot Av = 0
\end{align*}
for every $v \in \mathbb{R}^n$. By the definition of the transpose, this means
\begin{align*}
A^\top w \cdot v = 0
\end{align*}
for every $v \in \mathbb{R}^n$, hence $A^\top w = 0$.
With $y = y_0 + w \in \partial g(z_0)$ and $z_0 = Ax$, we compute
\begin{align*}
A^\top y = A^\top(y_0 + w) = A^\top y_0 + A^\top w = s + 0 = s.
\end{align*}
Thus every $s \in \partial h(x)$ belongs to $A^\top \partial g(Ax)$, so
\begin{align*}
\partial h(x) \subset A^\top \partial g(Ax).
\end{align*}
Combining this with the first inclusion and recalling that $h = g \circ A$, we obtain
\begin{align*}
\partial(g \circ A)(x) = A^\top \partial g(Ax).
\end{align*}
This proves the theorem.
[guided]
We have constructed $y := y_0+w \in \partial g(z_0)$, where $z_0=Ax$, and we need to show that this subgradient pulls back to the original vector $s \in \partial h(x)$. The only possible obstruction is the correction vector $w$, so we check that $A^\top w=0$.
Since $w \in L^\perp$ and $L=A(\mathbb{R}^n)$, the definition of orthogonal complement gives
\begin{align*}
w \cdot Av = 0
\end{align*}
for every $v \in \mathbb{R}^n$. By the defining identity for the transpose $A^\top$, this is equivalent to
\begin{align*}
A^\top w \cdot v = 0
\end{align*}
for every $v \in \mathbb{R}^n$. A vector in $\mathbb{R}^n$ whose dot product with every $v \in \mathbb{R}^n$ is zero must be the zero vector, so $A^\top w=0$.
Now use $y=y_0+w$, $A^\top y_0=s$, and $A^\top w=0$:
\begin{align*}
A^\top y = A^\top(y_0+w)=A^\top y_0 + A^\top w=s.
\end{align*}
Thus the arbitrary vector $s \in \partial h(x)$ can be written as $s=A^\top y$ with $y \in \partial g(Ax)$. Hence
\begin{align*}
\partial h(x) \subset A^\top \partial g(Ax).
\end{align*}
The first step proved the reverse inclusion. Since $h=g\circ A$, the two inclusions give
\begin{align*}
\partial(g\circ A)(x)=A^\top\partial g(Ax).
\end{align*}
This completes the proof.
[/guided]
[/step]