[proofplan]
The multivariate normal density of $Y \sim N_n(X\beta, \sigma^2 I_n)$ gives a log-likelihood in $(\beta, \sigma^2)$ whose only $\beta$-dependence is through the quadratic form $S(\beta) = \|Y - X\beta\|^2$, multiplied by a negative factor. Maximising in $\beta$ therefore reduces to **minimising** $S(\beta)$, which by the [normal equations](/theorems/501) is solved by the OLS estimator $\hat{\beta} = (X^\top X)^{-1} X^\top Y$. Substituting and then optimising the profile log-likelihood in $\sigma^2$ via a first-order condition yields $\hat\sigma^2 = S(\hat{\beta})/n = \mathrm{RSS}/n$.
[/proofplan]
[step:Write the log-likelihood of the normal linear model]
Under the assumption $Y \sim N_n(X\beta, \sigma^2 I_n)$ with $X \in \mathbb{R}^{n \times p}$ of full column rank $p < n$, the joint density of $Y$ at a realisation $y \in \mathbb{R}^n$ is
\begin{align*}
f_{\beta, \sigma^2}(y) = (2\pi \sigma^2)^{-n/2}\, \exp\!\left( -\frac{1}{2\sigma^2}(y - X\beta)^\top(y - X\beta) \right).
\end{align*}
The log-likelihood as a function of the parameters, with $Y$ fixed, is the map
\begin{align*}
\ell : \mathbb{R}^p \times (0, \infty) &\to \mathbb{R} \\
(\beta, \sigma^2) &\mapsto -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2} S(\beta),
\end{align*}
where
\begin{align*}
S(\beta) := (Y - X\beta)^\top (Y - X\beta) = \|Y - X\beta\|_2^2
\end{align*}
is the residual sum of squares at parameter value $\beta$.
[/step]
[step:Maximising $\ell$ in $\beta$ reduces to minimising $S(\beta)$]
Fix $\sigma^2 > 0$. Among the three terms of $\ell(\beta, \sigma^2)$, only the last depends on $\beta$, and it enters through the factor $-1/(2\sigma^2) < 0$ multiplying $S(\beta)$. Therefore, for every fixed $\sigma^2 > 0$,
\begin{align*}
\arg\max_{\beta \in \mathbb{R}^p} \ell(\beta, \sigma^2) = \arg\max_{\beta \in \mathbb{R}^p} \left( -\frac{1}{2\sigma^2} S(\beta) \right) = \arg\min_{\beta \in \mathbb{R}^p} S(\beta).
\end{align*}
The right-hand side is **exactly** the least-squares criterion. By the [normal equations](/theorems/501), the unique minimiser (since $X$ has full column rank, $X^\top X$ is positive definite) is
\begin{align*}
\hat{\beta} = (X^\top X)^{-1} X^\top Y.
\end{align*}
This is the MLE of $\beta$, and it coincides with the OLS estimator. Crucially, $\hat{\beta}$ does not depend on $\sigma^2$.
[guided]
The cleanest way to see this decoupling is to observe that $\ell$ separates as
\begin{align*}
\ell(\beta, \sigma^2) = \underbrace{-\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2}_{\text{depends only on } \sigma^2} \;-\; \frac{1}{2\sigma^2} S(\beta).
\end{align*}
Only the second piece involves $\beta$. For any fixed positive $\sigma^2$, the factor $-1/(2\sigma^2)$ is a strictly negative constant, so maximising
\begin{align*}
\beta \mapsto -\frac{1}{2\sigma^2}\, S(\beta)
\end{align*}
is the same as minimising $\beta \mapsto S(\beta)$. The minimiser therefore does not depend on the value of $\sigma^2$, and we may optimise over $\beta$ first.
To be precise about the minimisation: the quadratic $S$ expands as
\begin{align*}
S(\beta) = Y^\topY - 2\beta^\top X^\top Y + \beta^\top X^\top X \beta.
\end{align*}
Taking the gradient and setting it to zero,
\begin{align*}
\nabla_{\beta} S(\beta) = -2 X^\top Y + 2 X^\top X \beta = \mathbf{0} \iff X^\top X \beta = X^\top Y,
\end{align*}
which are the [normal equations](/theorems/501). Since $X$ has full column rank by hypothesis, $X^\top X$ is invertible, giving the unique critical point $\hat{\beta} = (X^\top X)^{-1} X^\top Y$. The Hessian $\nabla^2 S = 2 X^\top X$ is positive definite, so this critical point is the global minimiser of $S$. Under the normal model, this OLS estimator therefore simultaneously serves as the MLE of $\beta$ for every fixed $\sigma^2$.
Remark: this equivalence is specific to the Gaussian noise assumption. For non-Gaussian noise (e.g. Laplace), the MLE corresponds to a different loss (in the Laplace case, minimising $\|Y - X\beta\|_1$, the least absolute deviations criterion), and no longer agrees with OLS.
[/guided]
[/step]
[step:Form the profile log-likelihood and differentiate in $\sigma^2$]
Substituting $\beta = \hat{\beta}$ into $\ell$, and writing $\mathrm{RSS} := S(\hat{\beta})$, we obtain the profile log-likelihood in $\sigma^2$:
\begin{align*}
\ell_{\mathrm{prof}} : (0, \infty) &\to \mathbb{R} \\
\sigma^2 &\mapsto -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{\mathrm{RSS}}{2\sigma^2}.
\end{align*}
Differentiating with respect to $\sigma^2$ (treating $\sigma^2$ as a single real parameter),
\begin{align*}
\frac{\partial \ell_{\mathrm{prof}}}{\partial \sigma^2}(\sigma^2) = -\frac{n}{2\sigma^2} + \frac{\mathrm{RSS}}{2(\sigma^2)^2}.
\end{align*}
Setting this equal to zero and multiplying through by $2(\sigma^2)^2 > 0$,
\begin{align*}
-n\sigma^2 + \mathrm{RSS} = 0 \iff \sigma^2 = \frac{\mathrm{RSS}}{n}.
\end{align*}
Denote this critical point by $\hat\sigma^2 := \mathrm{RSS}/n$.
[/step]
[step:Verify the critical point is a maximum]
We check the second-order condition at $\hat\sigma^2$. Differentiating $\partial_{\sigma^2} \ell_{\mathrm{prof}}$ again,
\begin{align*}
\frac{\partial^2 \ell_{\mathrm{prof}}}{\partial (\sigma^2)^2}(\sigma^2) = \frac{n}{2(\sigma^2)^2} - \frac{\mathrm{RSS}}{(\sigma^2)^3}.
\end{align*}
At $\sigma^2 = \hat\sigma^2 = \mathrm{RSS}/n$, substituting $\mathrm{RSS} = n\hat\sigma^2$,
\begin{align*}
\frac{\partial^2 \ell_{\mathrm{prof}}}{\partial (\sigma^2)^2}(\hat\sigma^2) = \frac{n}{2\hat\sigma^4} - \frac{n\hat\sigma^2}{\hat\sigma^6} = \frac{n}{2\hat\sigma^4} - \frac{n}{\hat\sigma^4} = -\frac{n}{2\hat\sigma^4} < 0,
\end{align*}
provided $\mathrm{RSS} > 0$, which holds almost surely under the continuous normal model. (If $\mathrm{RSS} = 0$ the MLE does not exist as a finite positive number; we exclude this probability-zero event.) The negativity of the second derivative shows that $\hat\sigma^2$ is a local maximum of $\ell_{\mathrm{prof}}$. Since $\ell_{\mathrm{prof}}(\sigma^2) \to -\infty$ as $\sigma^2 \to 0^+$ (the $-\mathrm{RSS}/(2\sigma^2)$ term dominates) and as $\sigma^2 \to \infty$ (the $-(n/2)\log\sigma^2$ term dominates), and $\ell_{\mathrm{prof}}$ is continuous with a unique interior critical point, this local maximum is in fact global.
Combining, the MLE under the normal linear model is
\begin{align*}
(\hat{\beta}_{\mathrm{MLE}}, \hat\sigma^2_{\mathrm{MLE}}) = \left( (X^\top X)^{-1} X^\top Y,\; \frac{\mathrm{RSS}}{n} \right),
\end{align*}
and $\hat{\beta}_{\mathrm{MLE}}$ coincides with the OLS estimator, as claimed.
[guided]
Let us make the boundary behaviour explicit, since the profile log-likelihood is defined on the open half-line $(0, \infty)$ and has no boundary where standard compactness arguments could apply. We are trying to show that the interior critical point $\hat\sigma^2 = \mathrm{RSS}/n$ is a **global** maximum, not merely local.
Examine the three terms of $\ell_{\mathrm{prof}}(\sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{\mathrm{RSS}}{2\sigma^2}$.
* **As $\sigma^2 \to 0^+$:** The term $-\mathrm{RSS}/(2\sigma^2) \to -\infty$, and $-\tfrac{n}{2}\log\sigma^2 \to +\infty$, but the $1/\sigma^2$ blow-up is polynomial while the $-\log\sigma^2$ blow-up is logarithmic, so $-\mathrm{RSS}/(2\sigma^2)$ wins. Hence $\ell_{\mathrm{prof}}(\sigma^2) \to -\infty$.
* **As $\sigma^2 \to \infty$:** The term $-\frac{n}{2}\log\sigma^2 \to -\infty$, and $-\mathrm{RSS}/(2\sigma^2) \to 0$, so $\ell_{\mathrm{prof}}(\sigma^2) \to -\infty$.
So $\ell_{\mathrm{prof}}$ goes to $-\infty$ at both ends of $(0, \infty)$ and is continuous. By the extreme value theorem applied to any closed interval $[\varepsilon, M] \subset (0, \infty)$ that contains $\hat\sigma^2$ and takes values on $[\varepsilon, M]^c$ below $\ell_{\mathrm{prof}}(\hat\sigma^2)$, the maximum of $\ell_{\mathrm{prof}}$ on $(0, \infty)$ is attained at an interior critical point. Since $\hat\sigma^2$ is the unique such critical point and $\ell_{\mathrm{prof}}''(\hat\sigma^2) < 0$, it is the global maximum.
This completes the MLE computation. Observe that $\hat\sigma^2_{\mathrm{MLE}} = \mathrm{RSS}/n$ is **not** the same as the unbiased estimator $\tilde\sigma^2 = \mathrm{RSS}/(n-p)$ used in inference: the MLE divides by $n$, whereas the unbiased estimator adjusts for degrees of freedom consumed in estimating $\beta$. Both converge at the usual $1/\sqrt{n}$ rate as $n \to \infty$, but only the $(n-p)$ version is an unbiased estimator of $\sigma^2$.
[/guided]
[/step]