MLE Equals Least Squares Under Normality — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The multivariate normal density of $Y \sim N_n(X\beta, \sigma^2 I_n)$ gives a log-likelihood in $(\beta, \sigma^2)$ whose only $\beta$-dependence is through the quadratic form $S(\beta) = \|Y - X\beta\|^2$, multiplied by a negative factor. Maximising in $\beta$ therefore reduces to **minimising** $S(\beta)$, which by the [normal equations](/theorems/501) is solved by the OLS estimator $\hat{\beta} = (X^\top X)^{-1} X^\top Y$. Substituting and then optimising the profile log-likelihood in $\sigma^2$ via a first-order condition yields $\hat\sigma^2 = S(\hat{\beta})/n = \mathrm{RSS}/n$. [/proofplan] [step:Write the log-likelihood of the normal linear model] Under the assumption $Y \sim N_n(X\beta, \sigma^2 I_n)$ with $X \in \mathbb{R}^{n \times p}$ of full column rank $p < n$, the joint density of $Y$ at a realisation $y \in \mathbb{R}^n$ is \begin{align*} f_{\beta, \sigma^2}(y) = (2\pi \sigma^2)^{-n/2}\, \exp\!\left( -\frac{1}{2\sigma^2}(y - X\beta)^\top(y - X\beta) \right). \end{align*} The log-likelihood as a function of the parameters, with $Y$ fixed, is the map \begin{align*} \ell : \mathbb{R}^p \times (0, \infty) &\to \mathbb{R} \\ (\beta, \sigma^2) &\mapsto -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2} S(\beta), \end{align*} where \begin{align*} S(\beta) := (Y - X\beta)^\top (Y - X\beta) = \|Y - X\beta\|_2^2 \end{align*} is the residual sum of squares at parameter value $\beta$. [/step] [step:Maximising $\ell$ in $\beta$ reduces to minimising $S(\beta)$] Fix $\sigma^2 > 0$. Among the three terms of $\ell(\beta, \sigma^2)$, only the last depends on $\beta$, and it enters through the factor $-1/(2\sigma^2) < 0$ multiplying $S(\beta)$. Therefore, for every fixed $\sigma^2 > 0$, \begin{align*} \arg\max_{\beta \in \mathbb{R}^p} \ell(\beta, \sigma^2) = \arg\max_{\beta \in \mathbb{R}^p} \left( -\frac{1}{2\sigma^2} S(\beta) \right) = \arg\min_{\beta \in \mathbb{R}^p} S(\beta). \end{align*} The right-hand side is **exactly** the least-squares criterion. By the [normal equations](/theorems/501), the unique minimiser (since $X$ has full column rank, $X^\top X$ is positive definite) is \begin{align*} \hat{\beta} = (X^\top X)^{-1} X^\top Y. \end{align*} This is the MLE of $\beta$, and it coincides with the OLS estimator. Crucially, $\hat{\beta}$ does not depend on $\sigma^2$. [guided] The cleanest way to see this decoupling is to observe that $\ell$ separates as \begin{align*} \ell(\beta, \sigma^2) = \underbrace{-\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2}_{\text{depends only on } \sigma^2} \;-\; \frac{1}{2\sigma^2} S(\beta). \end{align*} Only the second piece involves $\beta$. For any fixed positive $\sigma^2$, the factor $-1/(2\sigma^2)$ is a strictly negative constant, so maximising \begin{align*} \beta \mapsto -\frac{1}{2\sigma^2}\, S(\beta) \end{align*} is the same as minimising $\beta \mapsto S(\beta)$. The minimiser therefore does not depend on the value of $\sigma^2$, and we may optimise over $\beta$ first. To be precise about the minimisation: the quadratic $S$ expands as \begin{align*} S(\beta) = Y^\topY - 2\beta^\top X^\top Y + \beta^\top X^\top X \beta. \end{align*} Taking the gradient and setting it to zero, \begin{align*} \nabla_{\beta} S(\beta) = -2 X^\top Y + 2 X^\top X \beta = \mathbf{0} \iff X^\top X \beta = X^\top Y, \end{align*} which are the [normal equations](/theorems/501). Since $X$ has full column rank by hypothesis, $X^\top X$ is invertible, giving the unique critical point $\hat{\beta} = (X^\top X)^{-1} X^\top Y$. The Hessian $\nabla^2 S = 2 X^\top X$ is positive definite, so this critical point is the global minimiser of $S$. Under the normal model, this OLS estimator therefore simultaneously serves as the MLE of $\beta$ for every fixed $\sigma^2$. Remark: this equivalence is specific to the Gaussian noise assumption. For non-Gaussian noise (e.g. Laplace), the MLE corresponds to a different loss (in the Laplace case, minimising $\|Y - X\beta\|_1$, the least absolute deviations criterion), and no longer agrees with OLS. [/guided] [/step] [step:Form the profile log-likelihood and differentiate in $\sigma^2$] Substituting $\beta = \hat{\beta}$ into $\ell$, and writing $\mathrm{RSS} := S(\hat{\beta})$, we obtain the profile log-likelihood in $\sigma^2$: \begin{align*} \ell_{\mathrm{prof}} : (0, \infty) &\to \mathbb{R} \\ \sigma^2 &\mapsto -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{\mathrm{RSS}}{2\sigma^2}. \end{align*} Differentiating with respect to $\sigma^2$ (treating $\sigma^2$ as a single real parameter), \begin{align*} \frac{\partial \ell_{\mathrm{prof}}}{\partial \sigma^2}(\sigma^2) = -\frac{n}{2\sigma^2} + \frac{\mathrm{RSS}}{2(\sigma^2)^2}. \end{align*} Setting this equal to zero and multiplying through by $2(\sigma^2)^2 > 0$, \begin{align*} -n\sigma^2 + \mathrm{RSS} = 0 \iff \sigma^2 = \frac{\mathrm{RSS}}{n}. \end{align*} Denote this critical point by $\hat\sigma^2 := \mathrm{RSS}/n$. [/step] [step:Verify the critical point is a maximum] We check the second-order condition at $\hat\sigma^2$. Differentiating $\partial_{\sigma^2} \ell_{\mathrm{prof}}$ again, \begin{align*} \frac{\partial^2 \ell_{\mathrm{prof}}}{\partial (\sigma^2)^2}(\sigma^2) = \frac{n}{2(\sigma^2)^2} - \frac{\mathrm{RSS}}{(\sigma^2)^3}. \end{align*} At $\sigma^2 = \hat\sigma^2 = \mathrm{RSS}/n$, substituting $\mathrm{RSS} = n\hat\sigma^2$, \begin{align*} \frac{\partial^2 \ell_{\mathrm{prof}}}{\partial (\sigma^2)^2}(\hat\sigma^2) = \frac{n}{2\hat\sigma^4} - \frac{n\hat\sigma^2}{\hat\sigma^6} = \frac{n}{2\hat\sigma^4} - \frac{n}{\hat\sigma^4} = -\frac{n}{2\hat\sigma^4} < 0, \end{align*} provided $\mathrm{RSS} > 0$, which holds almost surely under the continuous normal model. (If $\mathrm{RSS} = 0$ the MLE does not exist as a finite positive number; we exclude this probability-zero event.) The negativity of the second derivative shows that $\hat\sigma^2$ is a local maximum of $\ell_{\mathrm{prof}}$. Since $\ell_{\mathrm{prof}}(\sigma^2) \to -\infty$ as $\sigma^2 \to 0^+$ (the $-\mathrm{RSS}/(2\sigma^2)$ term dominates) and as $\sigma^2 \to \infty$ (the $-(n/2)\log\sigma^2$ term dominates), and $\ell_{\mathrm{prof}}$ is continuous with a unique interior critical point, this local maximum is in fact global. Combining, the MLE under the normal linear model is \begin{align*} (\hat{\beta}_{\mathrm{MLE}}, \hat\sigma^2_{\mathrm{MLE}}) = \left( (X^\top X)^{-1} X^\top Y,\; \frac{\mathrm{RSS}}{n} \right), \end{align*} and $\hat{\beta}_{\mathrm{MLE}}$ coincides with the OLS estimator, as claimed. [guided] Let us make the boundary behaviour explicit, since the profile log-likelihood is defined on the open half-line $(0, \infty)$ and has no boundary where standard compactness arguments could apply. We are trying to show that the interior critical point $\hat\sigma^2 = \mathrm{RSS}/n$ is a **global** maximum, not merely local. Examine the three terms of $\ell_{\mathrm{prof}}(\sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{\mathrm{RSS}}{2\sigma^2}$. * **As $\sigma^2 \to 0^+$:** The term $-\mathrm{RSS}/(2\sigma^2) \to -\infty$, and $-\tfrac{n}{2}\log\sigma^2 \to +\infty$, but the $1/\sigma^2$ blow-up is polynomial while the $-\log\sigma^2$ blow-up is logarithmic, so $-\mathrm{RSS}/(2\sigma^2)$ wins. Hence $\ell_{\mathrm{prof}}(\sigma^2) \to -\infty$. * **As $\sigma^2 \to \infty$:** The term $-\frac{n}{2}\log\sigma^2 \to -\infty$, and $-\mathrm{RSS}/(2\sigma^2) \to 0$, so $\ell_{\mathrm{prof}}(\sigma^2) \to -\infty$. So $\ell_{\mathrm{prof}}$ goes to $-\infty$ at both ends of $(0, \infty)$ and is continuous. By the extreme value theorem applied to any closed interval $[\varepsilon, M] \subset (0, \infty)$ that contains $\hat\sigma^2$ and takes values on $[\varepsilon, M]^c$ below $\ell_{\mathrm{prof}}(\hat\sigma^2)$, the maximum of $\ell_{\mathrm{prof}}$ on $(0, \infty)$ is attained at an interior critical point. Since $\hat\sigma^2$ is the unique such critical point and $\ell_{\mathrm{prof}}''(\hat\sigma^2) < 0$, it is the global maximum. This completes the MLE computation. Observe that $\hat\sigma^2_{\mathrm{MLE}} = \mathrm{RSS}/n$ is **not** the same as the unbiased estimator $\tilde\sigma^2 = \mathrm{RSS}/(n-p)$ used in inference: the MLE divides by $n$, whereas the unbiased estimator adjusts for degrees of freedom consumed in estimating $\beta$. Both converge at the usual $1/\sqrt{n}$ rate as $n \to \infty$, but only the $(n-p)$ version is an unbiased estimator of $\sigma^2$. [/guided] [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

MLE Equals Least Squares Under Normality (Theorem # 1440)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

MLE Equals Least Squares Under Normality (Theorem # 1440)

Discussion

Proof

Explore Further