[proofplan]
Differentiate the exponential-dispersion log-likelihood term by term with respect to each coordinate $\beta_j$. The derivative first produces the canonical-scale residual $Y_i-b'(\theta_i)=Y_i-\mu_i$, then the chain rule converts from $\theta_i$ to $\mu_i$ and from $\mu_i$ to $\eta_i$. The variance identity $\operatorname{Var}_\beta(Y_i)=a(\phi)b''(\theta_i(\beta))$ rewrites the canonical-parameter derivative as the standard GLM weight. In the canonical-link case, $\theta_i=\eta_i$, so $d\mu_i/d\eta_i=b''(\theta_i)$ and the variance factor cancels except for the common dispersion factor.
[/proofplan]
custom_env
admin
[step:Differentiate each likelihood contribution through the canonical parameter]Fix $\beta \in \mathcal B$ and $j \in \{1,\dots,p\}$. For each $i \in \{1,\dots,n\}$, define
\begin{align*}
\ell_i:\mathcal B &\to \mathbb{R} \\
\gamma &\mapsto \frac{Y_i\theta_i(\gamma)-b(\theta_i(\gamma))}{a(\phi)}+c(Y_i,\phi).
\end{align*}
Then $\ell=\sum_{i=1}^n \ell_i$. Since $\theta_i$ is differentiable and $b$ is differentiable on the relevant canonical-parameter values, the ordinary chain rule gives
\begin{align*}
\frac{\partial \ell_i}{\partial \beta_j}(\beta)
&=
\frac{1}{a(\phi)}
\left[
Y_i\frac{\partial \theta_i}{\partial \beta_j}(\beta)
-
b'(\theta_i(\beta))\frac{\partial \theta_i}{\partial \beta_j}(\beta)
\right] \\
&=
\frac{Y_i-\mu_i(\beta)}{a(\phi)}
\frac{\partial \theta_i}{\partial \beta_j}(\beta),
\end{align*}
because $\mu_i(\beta)=b'(\theta_i(\beta))$. Summing over $i$ yields
\begin{align*}
U_j(\beta)
=
\frac{\partial \ell}{\partial \beta_j}(\beta)
=
\sum_{i=1}^n
\frac{Y_i-\mu_i(\beta)}{a(\phi)}
\frac{\partial \theta_i}{\partial \beta_j}(\beta).
\end{align*}[/step]
custom_env
admin
[guided]Fix a parameter value $\beta \in \mathcal B$ and a coordinate index $j \in \{1,\dots,p\}$. We isolate the $i$th contribution to the log-likelihood by defining the differentiable map
\begin{align*}
\ell_i:\mathcal B &\to \mathbb{R} \\
\gamma &\mapsto \frac{Y_i\theta_i(\gamma)-b(\theta_i(\gamma))}{a(\phi)}+c(Y_i,\phi).
\end{align*}
The total log-likelihood is the finite sum $\ell=\sum_{i=1}^n \ell_i$, so differentiating $\ell$ reduces to differentiating each $\ell_i$.
The only part of $\ell_i$ depending on $\beta$ is the canonical parameter $\theta_i(\beta)$. The term $c(Y_i,\phi)$ is constant as a function of $\beta$, and $a(\phi)$ is a fixed positive scalar because $\phi$ is fixed. Applying the chain rule to the composite function $b\circ \theta_i$ gives
\begin{align*}
\frac{\partial \ell_i}{\partial \beta_j}(\beta)
&=
\frac{1}{a(\phi)}
\left[
Y_i\frac{\partial \theta_i}{\partial \beta_j}(\beta)
-
b'(\theta_i(\beta))\frac{\partial \theta_i}{\partial \beta_j}(\beta)
\right] \\
&=
\frac{Y_i-b'(\theta_i(\beta))}{a(\phi)}
\frac{\partial \theta_i}{\partial \beta_j}(\beta).
\end{align*}
By the defining mean identity for an exponential-dispersion family, $\mu_i(\beta)=b'(\theta_i(\beta))$. Substituting this identity gives
\begin{align*}
\frac{\partial \ell_i}{\partial \beta_j}(\beta)
=
\frac{Y_i-\mu_i(\beta)}{a(\phi)}
\frac{\partial \theta_i}{\partial \beta_j}(\beta).
\end{align*}
Finally, since $\ell$ is a finite sum of differentiable functions, differentiation commutes with the finite sum:
\begin{align*}
U_j(\beta)
=
\frac{\partial \ell}{\partial \beta_j}(\beta)
=
\sum_{i=1}^n
\frac{Y_i-\mu_i(\beta)}{a(\phi)}
\frac{\partial \theta_i}{\partial \beta_j}(\beta).
\end{align*}[/guided]
custom_env
admin
[step:Convert the canonical-parameter derivative to the mean-scale derivative]For each $i$, the identity $\mu_i(\beta)=b'(\theta_i(\beta))$ gives
\begin{align*}
\frac{\partial \mu_i}{\partial \theta_i}(\beta)
=
b''(\theta_i(\beta)).
\end{align*}
Since $\operatorname{Var}_\beta(Y_i)=a(\phi)b''(\theta_i(\beta))>0$, we have $b''(\theta_i(\beta))>0$ and therefore
\begin{align*}
\frac{\partial \theta_i}{\partial \mu_i}(\beta)
=
\frac{1}{b''(\theta_i(\beta))}
=
\frac{a(\phi)}{\operatorname{Var}_\beta(Y_i)}.
\end{align*}
The linear predictor map
\begin{align*}
\eta_i:\mathcal B &\to \mathbb{R} \\
\gamma &\mapsto x_i^\top \gamma
\end{align*}
satisfies
\begin{align*}
\frac{\partial \eta_i}{\partial \beta_j}(\beta)=x_{ij}.
\end{align*}
Applying the chain rule to the dependence $\theta_i=\theta_i(\mu_i(\eta_i(\beta)))$ gives
\begin{align*}
\frac{\partial \theta_i}{\partial \beta_j}(\beta)
=
\frac{\partial \theta_i}{\partial \mu_i}(\beta)
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
\frac{\partial \eta_i}{\partial \beta_j}(\beta)
=
\frac{a(\phi)}{\operatorname{Var}_\beta(Y_i)}
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
x_{ij}.
\end{align*}
Substituting this expression into the formula for $U_j(\beta)$ gives
\begin{align*}
U_j(\beta)
&=
\sum_{i=1}^n
\frac{Y_i-\mu_i(\beta)}{a(\phi)}
\frac{a(\phi)}{\operatorname{Var}_\beta(Y_i)}
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
x_{ij} \\
&=
\sum_{i=1}^n
\frac{Y_i-\mu_i(\beta)}{\operatorname{Var}_\beta(Y_i)}
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
x_{ij}.
\end{align*}[/step]
custom_env
admin
[guided]The derivative found in the previous step is expressed in terms of $\partial \theta_i/\partial \beta_j$. The standard GLM score formula is expressed instead in terms of the mean derivative $\partial \mu_i/\partial \eta_i$ and the variance. We now convert between these scales.
For each $i$, the mean is defined by
\begin{align*}
\mu_i(\beta)=b'(\theta_i(\beta)).
\end{align*}
Differentiating this identity with respect to the canonical parameter gives
\begin{align*}
\frac{\partial \mu_i}{\partial \theta_i}(\beta)
=
b''(\theta_i(\beta)).
\end{align*}
The variance identity in the theorem statement is
\begin{align*}
\operatorname{Var}_\beta(Y_i)=a(\phi)b''(\theta_i(\beta)).
\end{align*}
Because $a(\phi)>0$ and $\operatorname{Var}_\beta(Y_i)>0$, this implies $b''(\theta_i(\beta))>0$. Hence the mean is locally differentiable with nonzero derivative as a function of the canonical parameter, and the reciprocal derivative is
\begin{align*}
\frac{\partial \theta_i}{\partial \mu_i}(\beta)
=
\frac{1}{b''(\theta_i(\beta))}
=
\frac{a(\phi)}{\operatorname{Var}_\beta(Y_i)}.
\end{align*}
Next, define the linear predictor map
\begin{align*}
\eta_i:\mathcal B &\to \mathbb{R} \\
\gamma &\mapsto x_i^\top \gamma.
\end{align*}
Writing $x_i=(x_{i1},\dots,x_{ip})$, we have
\begin{align*}
\eta_i(\gamma)=\sum_{k=1}^p x_{ik}\gamma_k,
\end{align*}
so differentiating with respect to $\gamma_j$ gives
\begin{align*}
\frac{\partial \eta_i}{\partial \beta_j}(\beta)=x_{ij}.
\end{align*}
Now apply the chain rule along the parameter path
\begin{align*}
\beta \mapsto \eta_i(\beta) \mapsto \mu_i(\beta) \mapsto \theta_i(\beta).
\end{align*}
This gives
\begin{align*}
\frac{\partial \theta_i}{\partial \beta_j}(\beta)
=
\frac{\partial \theta_i}{\partial \mu_i}(\beta)
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
\frac{\partial \eta_i}{\partial \beta_j}(\beta)
=
\frac{a(\phi)}{\operatorname{Var}_\beta(Y_i)}
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
x_{ij}.
\end{align*}
Substituting this expression into the score component from the previous step cancels the factor $a(\phi)$:
\begin{align*}
U_j(\beta)
&=
\sum_{i=1}^n
\frac{Y_i-\mu_i(\beta)}{a(\phi)}
\frac{a(\phi)}{\operatorname{Var}_\beta(Y_i)}
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
x_{ij} \\
&=
\sum_{i=1}^n
\frac{Y_i-\mu_i(\beta)}{\operatorname{Var}_\beta(Y_i)}
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
x_{ij}.
\end{align*}[/guided]
custom_env
admin
[step:Assemble the component formulas into the vector score equation]
The preceding identity holds for every coordinate $j \in \{1,\dots,p\}$. Since $x_i=(x_{i1},\dots,x_{ip}) \in \mathbb{R}^p$, the coordinate identities are equivalent to the vector identity
\begin{align*}
U(\beta)
=
\sum_{i=1}^n
\frac{Y_i-\mu_i(\beta)}{\operatorname{Var}_\beta(Y_i)}
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
x_i
\quad \text{in } \mathbb{R}^p.
\end{align*}
Therefore $U(\beta)=0$ if and only if
\begin{align*}
\sum_{i=1}^n
\frac{Y_i-\mu_i(\beta)}{\operatorname{Var}_\beta(Y_i)}
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
x_i
=
0
\quad \text{in } \mathbb{R}^p.
\end{align*}
[/step]
custom_env
admin
[step:Cancel the variance factor for a canonical link]
Assume the link is canonical, so $\theta_i(\beta)=\eta_i(\beta)$ for every $i \in \{1,\dots,n\}$. Then
\begin{align*}
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
=
\frac{\partial \mu_i}{\partial \theta_i}(\beta)
=
b''(\theta_i(\beta)).
\end{align*}
Using $\operatorname{Var}_\beta(Y_i)=a(\phi)b''(\theta_i(\beta))$, we obtain
\begin{align*}
\frac{1}{\operatorname{Var}_\beta(Y_i)}
\frac{\partial \mu_i}{\partial \eta_i}(\beta)
=
\frac{b''(\theta_i(\beta))}{a(\phi)b''(\theta_i(\beta))}
=
\frac{1}{a(\phi)}.
\end{align*}
Hence
\begin{align*}
U(\beta)
=
\frac{1}{a(\phi)}
\sum_{i=1}^n x_i\bigl(Y_i-\mu_i(\beta)\bigr).
\end{align*}
Because $a(\phi)>0$, multiplying the equation $U(\beta)=0$ by the nonzero scalar $a(\phi)$ gives the equivalent canonical-link score equations
\begin{align*}
\sum_{i=1}^n x_i\bigl(Y_i-\mu_i(\beta)\bigr)=0.
\end{align*}
This is the asserted reduction up to the common dispersion factor.
[/step]