Iteratively Reweighted Least Squares Normal Equation Update

Iteratively Reweighted Least Squares Normal Equation Update (Theorem # 4467)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The Fisher scoring quadratic approximation is minimized by solving the weighted least-squares problem with working response $z^{(k)}$ and weight matrix $W^{(k)}$. We expand the weighted residual criterion as a quadratic function of $\beta$, compute its directional derivative, and identify the normal equations. The assumed invertibility of $X^\top W^{(k)}X$ gives a unique critical point, and positivity of the quadratic form shows that this critical point is the unique global minimizer. [/proofplan] [step:Expand the weighted residual criterion as a quadratic polynomial in $\beta$] For this fixed iteration $k$, write $z := z^{(k)} \in \mathbb{R}^n$ and $W := W^{(k)} \in \mathbb{R}^{n \times n}$. The matrix $W$ is symmetric because it is diagonal. Define \begin{align*} Q:\mathbb{R}^p &\to \mathbb{R} \\ \beta &\mapsto (z-X\beta)^\top W(z-X\beta). \end{align*} Expanding the product and using $W^\top=W$ gives \begin{align*} Q(\beta) &= z^\top Wz - z^\top WX\beta - \beta^\top X^\top Wz + \beta^\top X^\top WX\beta \\ &= z^\top Wz - 2\beta^\top X^\top Wz + \beta^\top X^\top WX\beta. \end{align*} Here $z^\top WX\beta$ and $\beta^\top X^\top Wz$ are equal because both are real $1 \times 1$ matrices and transpose to one another. [guided] We first remove the residual notation so that the objective is visibly a quadratic polynomial in the unknown vector $\beta$. For this fixed iteration $k$, set $z := z^{(k)}$ and $W := W^{(k)}$. The weighted residual objective is the map \begin{align*} Q:\mathbb{R}^p &\to \mathbb{R} \\ \beta &\mapsto (z-X\beta)^\top W(z-X\beta). \end{align*} Because $W$ is diagonal, it is symmetric: $W^\top=W$. Expanding the product gives \begin{align*} Q(\beta) &= (z^\top-\beta^\top X^\top)W(z-X\beta) \\ &= z^\top Wz - z^\top WX\beta - \beta^\top X^\top Wz + \beta^\top X^\top WX\beta. \end{align*} The two mixed terms are equal as real scalars. Indeed, \begin{align*} (z^\top WX\beta)^\top=\beta^\top X^\top W^\top z=\beta^\top X^\top Wz. \end{align*} Therefore \begin{align*} Q(\beta)=z^\top Wz - 2\beta^\top X^\top Wz + \beta^\top X^\top WX\beta. \end{align*} This expansion isolates the constant term, the linear term, and the quadratic curvature matrix $X^\top WX$. [/guided] [/step] [step:Differentiate the quadratic objective and obtain the weighted normal equations] Let $h \in \mathbb{R}^p$ be an arbitrary direction. For each $t \in \mathbb{R}$, \begin{align*} Q(\beta+th) &= z^\top Wz - 2(\beta+th)^\top X^\top Wz +(\beta+th)^\top X^\top WX(\beta+th). \end{align*} Subtracting $Q(\beta)$, dividing by $t \neq 0$, and letting $t \to 0$ gives the directional derivative \begin{align*} \frac{d}{dt}\Big|_{t=0} Q(\beta+th) = 2h^\top X^\top WX\beta - 2h^\top X^\top Wz. \end{align*} Thus the gradient is \begin{align*} \nabla Q(\beta)=2X^\top WX\beta-2X^\top Wz. \end{align*} A critical point therefore satisfies the weighted normal equations \begin{align*} X^\top WX\beta=X^\top Wz. \end{align*} [guided] To find the minimizer of the quadratic objective, we compute its first variation in an arbitrary direction. Let $h \in \mathbb{R}^p$ be fixed. For $t \in \mathbb{R}$, substitute $\beta+th$ into the expanded expression for $Q$: \begin{align*} Q(\beta+th) &= z^\top Wz - 2(\beta+th)^\top X^\top Wz +(\beta+th)^\top X^\top WX(\beta+th). \end{align*} Now expand only the terms depending on $t$: \begin{align*} Q(\beta+th) &= Q(\beta) -2t h^\top X^\top Wz +2t h^\top X^\top WX\beta +t^2 h^\top X^\top WXh. \end{align*} The coefficient of $t$ is the directional derivative at $\beta$ in the direction $h$, so \begin{align*} \frac{d}{dt}\Big|_{t=0} Q(\beta+th) =2h^\top X^\top WX\beta-2h^\top X^\top Wz. \end{align*} Since this identity holds for every direction $h \in \mathbb{R}^p$, the gradient is \begin{align*} \nabla Q(\beta)=2X^\top WX\beta-2X^\top Wz. \end{align*} At any interior minimizer of a differentiable function on $\mathbb{R}^p$, the gradient must vanish. Hence every minimizer must satisfy \begin{align*} X^\top WX\beta=X^\top Wz. \end{align*} These are the weighted normal equations. [/guided] [/step] [step:Solve the normal equations using the assumed invertibility] By hypothesis, the matrix $X^\top WX \in \mathbb{R}^{p \times p}$ is invertible. Therefore the weighted normal equations have the unique solution \begin{align*} \beta_*=(X^\top WX)^{-1}X^\top Wz. \end{align*} Returning to the iteration notation $z=z^{(k)}$ and $W=W^{(k)}$, this is \begin{align*} \beta_*=\left(X^\top W^{(k)}X\right)^{-1}X^\top W^{(k)}z^{(k)}. \end{align*} [/step] [step:Verify that the critical point is the unique global minimizer] Let $\beta \in \mathbb{R}^p$ be arbitrary and set $u:=\beta-\beta_* \in \mathbb{R}^p$. Since $\beta_*$ satisfies $X^\top WX\beta_*=X^\top Wz$, the linear terms in the expansion around $\beta_*$ vanish: \begin{align*} Q(\beta) &=Q(\beta_*+u) \\ &=Q(\beta_*)+u^\top X^\top WXu. \end{align*} Because $W=\operatorname{diag}(w_1,\dots,w_n)$ with each $w_i>0$, we have \begin{align*} u^\top X^\top WXu=(Xu)^\top W(Xu)=\sum_{i=1}^{n} w_i ((Xu)_i)^2 \ge 0. \end{align*} Thus $Q(\beta)\ge Q(\beta_*)$ for every $\beta \in \mathbb{R}^p$. If equality holds, then $u^\top X^\top WXu=0$. Since $X^\top WX$ is invertible and symmetric positive semidefinite, its kernel is $\{0\}$, so $u=0$. Hence $\beta=\beta_*$. Therefore $\beta_*$ is the unique global minimizer of $Q$. Substituting back $W=W^{(k)}$ and $z=z^{(k)}$ gives the Fisher scoring, equivalently IRLS, update \begin{align*} \beta^{(k+1)}=\left(X^\top W^{(k)}X\right)^{-1}X^\top W^{(k)}z^{(k)}. \end{align*} This proves the statement. [/step]

Explore Further

Existence and Uniqueness of the Ordinary Least Squares Estimator Probability & Statistics Likelihood-Ratio Deviance Comparison Theorem for Nested GLMs Probability & Statistics $L^2$ Integrability Implies $L^1$ Integrability Probability Theory Cook's Distance Formula for Internally Studentized Residuals Probability & Statistics Cauchy-Schwarz Inequality for Covariance Probability Theory Gambler's Ruin Recurrence Probability Theory Existence and Uniqueness of the Poisson Random Measure Poisson Processes Memoryless Property of the Exponential Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.