Closed Form Formula for the Ridge Regression Estimator

Closed Form Formula for the Ridge Regression Estimator (Theorem # 4461)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We expand the ridge objective as a quadratic polynomial in $\beta$ and compute its first variation in an arbitrary direction $h \in \mathbb{R}^p$. The stationarity condition is the linear system $(X^\top X + \rho I_p)\beta = X^\top y$. The regularization term makes the coefficient matrix positive definite, hence invertible, so this system has exactly one solution. Finally, the same positive definiteness gives strict convexity, which turns the stationary point into the unique global minimizer. [/proofplan] [step:Compute the stationarity equation for the ridge objective] Define the symmetric matrix $A \in \mathbb{R}^{p \times p}$ and the vector $b \in \mathbb{R}^p$ by \begin{align*} A &:= X^\top X + \rho I_p, & b &:= X^\top y. \end{align*} For $\beta \in \mathbb{R}^p$, expanding the Euclidean norms gives \begin{align*} J_\rho(\beta) &= (y - X\beta)^\top (y - X\beta) + \rho \beta^\top \beta \\ &= y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X\beta + \rho \beta^\top \beta \\ &= y^\top y - 2\beta^\top b + \beta^\top A\beta. \end{align*} Let $h \in \mathbb{R}^p$ be an arbitrary direction and define the one-variable map \begin{align*} \varphi_{\beta,h}: \mathbb{R} &\to \mathbb{R} \\ t &\mapsto J_\rho(\beta + th). \end{align*} Since $A^\top = A$, direct expansion gives \begin{align*} \varphi_{\beta,h}(t) &= y^\top y - 2(\beta + th)^\top b + (\beta + th)^\top A(\beta + th) \\ &= J_\rho(\beta) + 2t\,h^\top(A\beta - b) + t^2 h^\top A h. \end{align*} Therefore \begin{align*} \varphi_{\beta,h}'(0) = 2h^\top(A\beta - b). \end{align*} Thus a point $\beta \in \mathbb{R}^p$ is stationary for $J_\rho$ exactly when \begin{align*} h^\top(A\beta - b) = 0 \end{align*} for every $h \in \mathbb{R}^p$. Taking $h = A\beta - b$ shows that this is equivalent to \begin{align*} A\beta = b. \end{align*} [guided] The objective is quadratic, so the natural way to find the minimizer is to compute its first variation. We first package the coefficients in notation that will also be used later. Define \begin{align*} A &:= X^\top X + \rho I_p \in \mathbb{R}^{p \times p}, & b &:= X^\top y \in \mathbb{R}^p. \end{align*} The matrix $A$ is symmetric because $(X^\top X)^\top = X^\top X$ and $I_p^\top = I_p$. Now expand the objective for an arbitrary $\beta \in \mathbb{R}^p$: \begin{align*} J_\rho(\beta) &= |y - X\beta|^2 + \rho |\beta|^2 \\ &= (y - X\beta)^\top (y - X\beta) + \rho \beta^\top \beta \\ &= y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X\beta + \rho \beta^\top \beta \\ &= y^\top y - 2\beta^\top b + \beta^\top A\beta. \end{align*} To test stationarity at $\beta$, choose an arbitrary direction $h \in \mathbb{R}^p$ and restrict $J_\rho$ to the affine line through $\beta$ in direction $h$. That is, define \begin{align*} \varphi_{\beta,h}: \mathbb{R} &\to \mathbb{R} \\ t &\mapsto J_\rho(\beta + th). \end{align*} Substituting $\beta + th$ into the expanded formula gives \begin{align*} \varphi_{\beta,h}(t) &= y^\top y - 2(\beta + th)^\top b + (\beta + th)^\top A(\beta + th) \\ &= y^\top y - 2\beta^\top b - 2t h^\top b + \beta^\top A\beta + t h^\top A\beta + t\beta^\top A h + t^2 h^\top A h. \end{align*} Since $A$ is symmetric, $\beta^\top A h = h^\top A\beta$, so \begin{align*} \varphi_{\beta,h}(t) &= J_\rho(\beta) + 2t\,h^\top(A\beta - b) + t^2 h^\top A h. \end{align*} Differentiating this polynomial in $t$ at $t=0$ gives \begin{align*} \varphi_{\beta,h}'(0) = 2h^\top(A\beta - b). \end{align*} A stationary point must have zero directional derivative in every direction $h \in \mathbb{R}^p$, so it must satisfy \begin{align*} h^\top(A\beta - b) = 0 \end{align*} for every $h \in \mathbb{R}^p$. This condition is equivalent to $A\beta - b = 0$: taking $h = A\beta - b$ gives \begin{align*} |A\beta - b|^2 = 0. \end{align*} Hence the stationarity equation is \begin{align*} A\beta = b, \end{align*} or, in the original notation, \begin{align*} (X^\top X + \rho I_p)\beta = X^\top y. \end{align*} [/guided] [/step] [step:Show the regularized normal matrix is positive definite] For every $v \in \mathbb{R}^p$, \begin{align*} v^\top A v &= v^\top X^\top Xv + \rho v^\top v \\ &= |Xv|^2 + \rho |v|^2. \end{align*} If $v \ne 0$, then $|v|^2 > 0$, and since $\rho > 0$ and $|Xv|^2 \ge 0$, \begin{align*} v^\top A v = |Xv|^2 + \rho |v|^2 > 0. \end{align*} Thus $A$ is positive definite. [guided] The role of the ridge parameter $\rho > 0$ is to make the normal matrix invertible even when $X^\top X$ itself may be singular. Let $v \in \mathbb{R}^p$ be arbitrary. Using the definition of $A$, we compute \begin{align*} v^\top A v &= v^\top (X^\top X + \rho I_p)v \\ &= v^\top X^\top Xv + \rho v^\top I_p v \\ &= (Xv)^\top(Xv) + \rho v^\top v \\ &= |Xv|^2 + \rho |v|^2. \end{align*} Both terms on the right are non-negative. If $v \ne 0$, then $|v|^2 > 0$; because $\rho > 0$, the second term satisfies $\rho |v|^2 > 0$. Therefore \begin{align*} v^\top A v > 0 \end{align*} for every nonzero $v \in \mathbb{R}^p$. This is exactly positive definiteness of $A$. [/guided] [/step] [step:Invert the stationarity equation] Since $A$ is positive definite, $\ker A = \{0\}$. Indeed, if $Av = 0$ for some $v \in \mathbb{R}^p$, then \begin{align*} 0 = v^\top Av = |Xv|^2 + \rho |v|^2, \end{align*} so $\rho |v|^2 = 0$, and hence $v=0$. Thus the columns of the square matrix $A \in \mathbb{R}^{p \times p}$ are linearly independent, so they form a basis of $\mathbb{R}^p$. Therefore $A$ is invertible, and the stationarity equation has the unique solution \begin{align*} \beta_\rho = A^{-1}b = (X^\top X + \rho I_p)^{-1}X^\top y. \end{align*} [/step] [step:Use strict convexity to identify the unique minimizer] Let $\beta_\rho := A^{-1}b \in \mathbb{R}^p$. For any $\beta \in \mathbb{R}^p$, write $h := \beta - \beta_\rho \in \mathbb{R}^p$. Since $A\beta_\rho = b$, the quadratic expansion gives \begin{align*} J_\rho(\beta_\rho + h) &= y^\top y - 2(\beta_\rho + h)^\top b + (\beta_\rho + h)^\top A(\beta_\rho + h) \\ &= J_\rho(\beta_\rho) + 2h^\top(A\beta_\rho - b) + h^\top A h \\ &= J_\rho(\beta_\rho) + h^\top A h. \end{align*} By positive definiteness, $h^\top A h \ge 0$, with equality only when $h=0$. Hence \begin{align*} J_\rho(\beta) \ge J_\rho(\beta_\rho), \end{align*} and equality holds exactly when $\beta = \beta_\rho$. Therefore $J_\rho$ has the unique minimizer \begin{align*} \hat{\beta}^{\mathrm{ridge}}(\rho) = \beta_\rho = (X^\top X + \rho I_p)^{-1}X^\top y. \end{align*} [/step]

Explore Further

Gambler's Ruin Probability Probability Theory Coordinate Characterisation of Product Measurability Probability & Statistics Sherman–Morrison Leave-One-Out Formula Probability & Statistics Bias-Variance Decomposition for Squared Loss Probability & Statistics Gaussian Integral Probability Theory Independence of Fitted Values and Residuals in the Normal Linear Model Probability & Statistics Tail Integral Formula Probability Theory Density of a Transformed Variable Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.