Logistic Regression Log-Likelihood, Score, and Observed Hessian

Logistic Regression Log-Likelihood, Score, and Observed Hessian (Theorem # 4466)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We first write the Bernoulli likelihood using the logistic parametrisation \begin{align*} p_i(\beta)=\frac{e^{x_i^\top\beta}}{1+e^{x_i^\top\beta}}. \end{align*} Taking logarithms converts the product likelihood into a sum and gives the stated expression for $\ell$. We then differentiate $\ell$ as a function of $\beta \in \mathbb{R}^d$: the first derivative gives the score, and differentiating the score once more gives the observed Hessian. [/proofplan] [step:Substitute the logistic probabilities into the Bernoulli likelihood] For each $i \in \{1,\dots,n\}$, define the linear predictor map \begin{align*} \eta_i:\mathbb{R}^d &\to \mathbb{R} \\ \beta &\mapsto x_i^\top \beta \end{align*} and the fitted probability map \begin{align*} p_i:\mathbb{R}^d &\to (0,1) \\ \beta &\mapsto \frac{e^{\eta_i(\beta)}}{1+e^{\eta_i(\beta)}}. \end{align*} Then \begin{align*} 1-p_i(\beta) &=1-\frac{e^{\eta_i(\beta)}}{1+e^{\eta_i(\beta)}}\\ &=\frac{1}{1+e^{\eta_i(\beta)}}. \end{align*} Since $Y_i \in \{0,1\}$ and $Y_1,\dots,Y_n$ are conditionally independent Bernoulli random variables with success probabilities $p_i(\beta)$, the likelihood function $L:\mathbb{R}^d \to (0,\infty)$ is \begin{align*} L(\beta) &=\prod_{i=1}^n p_i(\beta)^{Y_i}\bigl(1-p_i(\beta)\bigr)^{1-Y_i}. \end{align*} Substituting the two displayed formulas for $p_i(\beta)$ and $1-p_i(\beta)$ gives \begin{align*} L(\beta) &=\prod_{i=1}^n \left(\frac{e^{\eta_i(\beta)}}{1+e^{\eta_i(\beta)}}\right)^{Y_i} \left(\frac{1}{1+e^{\eta_i(\beta)}}\right)^{1-Y_i}\\ &=\prod_{i=1}^n \frac{e^{Y_i\eta_i(\beta)}}{1+e^{\eta_i(\beta)}}. \end{align*} [/step] [step:Take logarithms to obtain the log-likelihood] Define the log-likelihood function $\ell:\mathbb{R}^d \to \mathbb{R}$ by $\ell(\beta)=\log L(\beta)$. Since each factor in $L(\beta)$ is positive, the logarithm is well-defined. Using the product rule for logarithms and the identity $\log(e^a)=a$ for $a \in \mathbb{R}$, we obtain \begin{align*} \ell(\beta) &=\sum_{i=1}^n \log\left(\frac{e^{Y_i\eta_i(\beta)}}{1+e^{\eta_i(\beta)}}\right)\\ &=\sum_{i=1}^n \left\{Y_i\eta_i(\beta)-\log(1+e^{\eta_i(\beta)})\right\}\\ &=\sum_{i=1}^n \left\{Y_i x_i^\top\beta-\log(1+e^{x_i^\top\beta})\right\}. \end{align*} This is the stated expression for the log-likelihood. [/step] [step:Differentiate the log-likelihood to compute the score] Fix $\beta \in \mathbb{R}^d$ and $h \in \mathbb{R}^d$. The total derivative of $\eta_i$ at $\beta$ is the [linear map](/page/Linear%20Map) \begin{align*} D(\eta_i)_\beta:\mathbb{R}^d &\to \mathbb{R} \\ h &\mapsto x_i^\top h. \end{align*} By the chain rule, \begin{align*} D\left(\log(1+e^{\eta_i})\right)_\beta(h) &=\frac{e^{\eta_i(\beta)}}{1+e^{\eta_i(\beta)}}\,D(\eta_i)_\beta(h)\\ &=p_i(\beta)x_i^\top h. \end{align*} Therefore \begin{align*} D\ell_\beta(h) &=\sum_{i=1}^n\left\{Y_i x_i^\top h-p_i(\beta)x_i^\top h\right\}\\ &=\sum_{i=1}^n\bigl(Y_i-p_i(\beta)\bigr)x_i^\top h\\ &=\left(\sum_{i=1}^n x_i\bigl(Y_i-p_i(\beta)\bigr)\right)^\top h. \end{align*} By the defining relation between the gradient and the Euclidean inner product on $\mathbb{R}^d$, the score vector $U(\beta)=\nabla \ell(\beta)$ is \begin{align*} U(\beta)=\sum_{i=1}^n x_i\bigl(Y_i-p_i(\beta)\bigr). \end{align*} [/step] [step:Differentiate the score to compute the observed Hessian] Fix $\beta \in \mathbb{R}^d$ and $h \in \mathbb{R}^d$. First compute the derivative of $p_i$. Since \begin{align*} p_i(\beta)=\frac{e^{\eta_i(\beta)}}{1+e^{\eta_i(\beta)}}, \end{align*} the one-variable derivative of the map \begin{align*} q:\mathbb{R} &\to (0,1) \\ t &\mapsto \frac{e^t}{1+e^t} \end{align*} is \begin{align*} q'(t) &=\frac{e^t(1+e^t)-e^t e^t}{(1+e^t)^2}\\ &=\frac{e^t}{(1+e^t)^2}. \end{align*} Thus, by the chain rule, \begin{align*} D(p_i)_\beta(h) &=\frac{e^{\eta_i(\beta)}}{(1+e^{\eta_i(\beta)})^2}x_i^\top h\\ &=p_i(\beta)\bigl(1-p_i(\beta)\bigr)x_i^\top h. \end{align*} Using the score formula from the previous step and treating $Y_i$ and $x_i$ as fixed observed quantities, we obtain \begin{align*} D U_\beta(h) &=-\sum_{i=1}^n x_i\,D(p_i)_\beta(h)\\ &=-\sum_{i=1}^n p_i(\beta)\bigl(1-p_i(\beta)\bigr)x_i(x_i^\top h)\\ &=\left(-\sum_{i=1}^n p_i(\beta)\bigl(1-p_i(\beta)\bigr)x_i x_i^\top\right)h. \end{align*} Hence the Jacobian matrix of $U$ at $\beta$, equivalently the observed Hessian matrix of $\ell$ at $\beta$, is \begin{align*} H(\beta)=J U_\beta =-\sum_{i=1}^n p_i(\beta)\bigl(1-p_i(\beta)\bigr)x_i x_i^\top. \end{align*} This proves the stated formulas for the log-likelihood, score vector, and observed Hessian. [/step]

Prerequisites (0/1 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Jacobian

Explore Further

Jacobian Theorem #34 Existence of Nonmeasurable Subsets of the Real Line Probability & Statistics Conditional Variance Formula for the Ordinary Least Squares Estimator Probability & Statistics Doob's Maximal Inequality Martingale Theory Conditioning and Independence Conditional Expectation Computational Formula for Variance Probability Theory Consistency of Ordinary Least Squares Under Random Design Probability & Statistics Affine Regression as Projection of the Conditional Mean Probability & Statistics $L^p$ Contraction Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.