[proofplan]
We prove the normal equations directly from the minimizing property of $\hat{\beta}$. For an arbitrary direction $a \in \mathbb{R}^p$, we restrict the least squares objective to the affine line $\hat{\beta} + ta$ and use the fact that $t=0$ is a minimum. Differentiating this one-variable quadratic gives $(Xa) \cdot \hat{\varepsilon}=0$ for every $a$, which is exactly $X^\top \hat{\varepsilon}=0$ and hence orthogonality to the column space of $X$.
[/proofplan]
custom_env
admin
[step:Restrict the least squares objective to an arbitrary line through $\hat{\beta}$]Define the least squares objective
\begin{align*}
Q: \mathbb{R}^p &\to \mathbb{R} \\
\beta &\mapsto |y - X\beta|^2.
\end{align*}
Fix an arbitrary vector $a \in \mathbb{R}^p$, and define the one-variable function
\begin{align*}
\varphi_a: \mathbb{R} &\to \mathbb{R} \\
t &\mapsto Q(\hat{\beta} + ta).
\end{align*}
Since $\hat{\beta}$ minimizes $Q$ over $\mathbb{R}^p$, the point $t=0$ minimizes $\varphi_a$ over $\mathbb{R}$.
Using $\hat{\varepsilon} = y - X\hat{\beta}$, we compute
\begin{align*}
\varphi_a(t)
&= |y - X(\hat{\beta} + ta)|^2 \\
&= |y - X\hat{\beta} - tXa|^2 \\
&= |\hat{\varepsilon} - tXa|^2 \\
&= |\hat{\varepsilon}|^2 - 2t (Xa)\cdot \hat{\varepsilon} + t^2 |Xa|^2.
\end{align*}
Thus $\varphi_a$ is differentiable and
\begin{align*}
\varphi_a'(0) = -2(Xa)\cdot \hat{\varepsilon}.
\end{align*}
Because $t=0$ is a minimum of the differentiable function $\varphi_a$, we have $\varphi_a'(0)=0$. Therefore
\begin{align*}
(Xa)\cdot \hat{\varepsilon}=0.
\end{align*}
Since $a \in \mathbb{R}^p$ was arbitrary, this identity holds for every $a \in \mathbb{R}^p$.[/step]
custom_env
admin
[guided]The goal is to convert the global minimization statement for $Q$ into an algebraic equation. We do this by testing the minimizer in every possible direction.
Define
\begin{align*}
Q: \mathbb{R}^p &\to \mathbb{R} \\
\beta &\mapsto |y - X\beta|^2.
\end{align*}
The hypothesis says that $\hat{\beta}$ minimizes this function over all of $\mathbb{R}^p$. Now fix an arbitrary direction $a \in \mathbb{R}^p$. Moving from $\hat{\beta}$ in the direction $a$ gives the affine line $\hat{\beta} + ta$, where $t \in \mathbb{R}$. Define
\begin{align*}
\varphi_a: \mathbb{R} &\to \mathbb{R} \\
t &\mapsto Q(\hat{\beta} + ta).
\end{align*}
Since $\hat{\beta}$ is a global minimizer of $Q$, the scalar $t=0$ is a global minimizer of $\varphi_a$.
We now compute this scalar quadratic. By the definition $\hat{\varepsilon} := y - X\hat{\beta}$,
\begin{align*}
\varphi_a(t)
&= |y - X(\hat{\beta} + ta)|^2 \\
&= |y - X\hat{\beta} - tXa|^2 \\
&= |\hat{\varepsilon} - tXa|^2 \\
&= |\hat{\varepsilon}|^2 - 2t (Xa)\cdot \hat{\varepsilon} + t^2 |Xa|^2.
\end{align*}
The function $\varphi_a$ is therefore differentiable, and differentiating the displayed quadratic gives
\begin{align*}
\varphi_a'(0) = -2(Xa)\cdot \hat{\varepsilon}.
\end{align*}
A differentiable real-valued function has derivative $0$ at an interior minimum, so $\varphi_a'(0)=0$. Hence
\begin{align*}
(Xa)\cdot \hat{\varepsilon}=0.
\end{align*}
Because the direction $a \in \mathbb{R}^p$ was arbitrary, this proves that the residual is orthogonal to every vector of the form $Xa$.[/guided]
custom_env
admin
[step:Translate directional orthogonality into the matrix equation $X^\top\hat{\varepsilon}=0$]
For every $a \in \mathbb{R}^p$, the previous step gives
\begin{align*}
0 = (Xa)\cdot \hat{\varepsilon} = a^\top X^\top \hat{\varepsilon}.
\end{align*}
Define $w \in \mathbb{R}^p$ by
\begin{align*}
w := X^\top \hat{\varepsilon}.
\end{align*}
Then $a^\top w = 0$ for every $a \in \mathbb{R}^p$. Taking $a=w$ gives
\begin{align*}
|w|^2 = w^\top w = 0.
\end{align*}
Therefore $w=0$, which is precisely
\begin{align*}
X^\top \hat{\varepsilon}=0.
\end{align*}
[/step]
custom_env
admin
[step:Identify the equivalent orthogonality statement for $\operatorname{Range}(X)$]
Let $v \in \operatorname{Range}(X)$. By definition of the range of $X$, there exists $a \in \mathbb{R}^p$ such that
\begin{align*}
v = Xa.
\end{align*}
Using $X^\top \hat{\varepsilon}=0$, we obtain
\begin{align*}
v \cdot \hat{\varepsilon}
= (Xa)\cdot \hat{\varepsilon}
= a^\top X^\top \hat{\varepsilon}
= a^\top 0
= 0.
\end{align*}
Thus $\hat{\varepsilon}$ is orthogonal to every vector in $\operatorname{Range}(X)$.
Conversely, if $\hat{\varepsilon}$ is orthogonal to every vector in $\operatorname{Range}(X)$, then $(Xa)\cdot \hat{\varepsilon}=0$ for every $a \in \mathbb{R}^p$. Hence $a^\top X^\top\hat{\varepsilon}=0$ for every $a \in \mathbb{R}^p$, and the argument with $w := X^\top\hat{\varepsilon}$ gives $X^\top\hat{\varepsilon}=0$. Therefore the two formulations are equivalent.
[/step]