[proofplan]
View $X$ as a [linear map](/page/Linear%20Map) from coefficient vectors to fitted vectors in $\mathbb{R}^n$. First construct the closest point to $y$ in the finite-dimensional subspace $\operatorname{Range}(X)$ by [orthogonal projection](/theorems/437), and then choose any coefficient vector mapping to that closest point. For uniqueness under full column rank, derive the normal equations from the least-squares minimization problem, prove that $X^\top X$ is invertible, and solve the normal equations explicitly.
[/proofplan]
[step:Construct the orthogonal projection of $y$ onto the fitted-value subspace]
Define the linear map
\begin{align*}
T_X: \mathbb{R}^p &\to \mathbb{R}^n \\
b &\mapsto Xb.
\end{align*}
Let $M := \operatorname{Range}(T_X) \subset \mathbb{R}^n$. Since $M$ is a finite-dimensional subspace of the Euclidean [inner product space](/page/Inner%20Product%20Space) $\mathbb{R}^n$, choose an orthonormal basis $u_1,\dots,u_r \in M$ for $M$, where $r := \dim M$. If $r = 0$, interpret the following sum as $0$. Define
\begin{align*}
P_M y := \sum_{i=1}^r (y \cdot u_i)u_i \in M.
\end{align*}
For every $z \in M$, write $z = \sum_{i=1}^r a_i u_i$ with coefficients $a_1,\dots,a_r \in \mathbb{R}$. Then
\begin{align*}
y - z
&= y - P_M y + \sum_{i=1}^r \bigl((y \cdot u_i)-a_i\bigr)u_i.
\end{align*}
The vector $y-P_M y$ is orthogonal to each $u_i$, because
\begin{align*}
(y-P_M y)\cdot u_i
&= y \cdot u_i - \sum_{j=1}^r (y\cdot u_j)(u_j\cdot u_i) \\
&= y \cdot u_i - (y\cdot u_i) \\
&= 0.
\end{align*}
Therefore the [Pythagorean theorem](/theorems/3266) gives
\begin{align*}
|y-z|^2
&= |y-P_M y|^2 + \sum_{i=1}^r \bigl((y\cdot u_i)-a_i\bigr)^2 \\
&\geq |y-P_M y|^2.
\end{align*}
Thus $P_M y$ is a closest point in $M$ to $y$.
[guided]
The least-squares problem only depends on the fitted vector $Xb$, not directly on the coefficient vector $b$. So we first solve the geometric problem: find the point of the fitted-value subspace closest to $y$.
Define the linear map
\begin{align*}
T_X: \mathbb{R}^p &\to \mathbb{R}^n \\
b &\mapsto Xb.
\end{align*}
Its range
\begin{align*}
M := \operatorname{Range}(T_X)
\end{align*}
is the set of all fitted vectors $Xb$. Since $M$ is a finite-dimensional subspace of $\mathbb{R}^n$, it has an orthonormal basis $u_1,\dots,u_r$, where $r := \dim M$. If $r=0$, then $M=\{0\}$ and all sums below are interpreted as $0$.
Define the vector
\begin{align*}
P_M y := \sum_{i=1}^r (y \cdot u_i)u_i.
\end{align*}
This vector lies in $M$ because it is a linear combination of vectors in the basis of $M$. We now verify that it is the closest point of $M$ to $y$.
Let $z \in M$. Since $u_1,\dots,u_r$ is a basis of $M$, there are unique coefficients $a_1,\dots,a_r \in \mathbb{R}$ such that
\begin{align*}
z = \sum_{i=1}^r a_i u_i.
\end{align*}
Then
\begin{align*}
y - z
&= y - P_M y + P_M y - z \\
&= y - P_M y + \sum_{i=1}^r \bigl((y\cdot u_i)-a_i\bigr)u_i.
\end{align*}
The point of choosing the coefficients $y\cdot u_i$ is that $y-P_M y$ becomes orthogonal to $M$. Indeed, for each $i \in \{1,\dots,r\}$,
\begin{align*}
(y-P_M y)\cdot u_i
&= y \cdot u_i - \sum_{j=1}^r (y\cdot u_j)(u_j\cdot u_i) \\
&= y \cdot u_i - (y\cdot u_i) \\
&= 0,
\end{align*}
where orthonormality gives $u_j\cdot u_i=0$ for $j\neq i$ and $u_i\cdot u_i=1$.
Thus the decomposition of $y-z$ above is an [orthogonal decomposition](/theorems/436). Applying the Pythagorean theorem,
\begin{align*}
|y-z|^2
&= |y-P_M y|^2 + \sum_{i=1}^r \bigl((y\cdot u_i)-a_i\bigr)^2 \\
&\geq |y-P_M y|^2.
\end{align*}
So no vector $z \in M$ is closer to $y$ than $P_M y$. Hence $P_M y$ is a closest fitted vector.
[/guided]
[/step]
[step:Choose a coefficient vector producing the closest fitted value]
Because $P_M y \in M=\operatorname{Range}(T_X)$, there exists $\hat{\beta} \in \mathbb{R}^p$ such that
\begin{align*}
X\hat{\beta} = P_M y.
\end{align*}
For every $b \in \mathbb{R}^p$, the fitted vector $Xb$ lies in $M$. By the closest-point property of $P_M y$,
\begin{align*}
|y-X\hat{\beta}|^2
= |y-P_M y|^2
\leq |y-Xb|^2.
\end{align*}
Therefore $\hat{\beta}$ is an ordinary least squares estimator.
[/step]
[step:Derive the normal equations for every least squares estimator]
Let $\beta \in \mathbb{R}^p$. Define the objective function
\begin{align*}
Q: \mathbb{R}^p &\to \mathbb{R} \\
b &\mapsto |y-Xb|^2.
\end{align*}
For every $h \in \mathbb{R}^p$,
\begin{align*}
Q(\beta+h)
&= |y-X\beta-Xh|^2 \\
&= |y-X\beta|^2 - 2h^\top X^\top (y-X\beta) + |Xh|^2.
\end{align*}
If $\beta$ minimizes $Q$, then for every $h \in \mathbb{R}^p$ and every $t \in \mathbb{R}$,
\begin{align*}
Q(\beta+th)-Q(\beta)
= -2t\,h^\top X^\top(y-X\beta)+t^2|Xh|^2
\geq 0.
\end{align*}
A quadratic polynomial in $t$ that is nonnegative for all $t \in \mathbb{R}$ must have zero linear coefficient, so
\begin{align*}
h^\top X^\top(y-X\beta)=0
\end{align*}
for every $h \in \mathbb{R}^p$. Taking $h=X^\top(y-X\beta)$ gives
\begin{align*}
X^\top(y-X\beta)=0.
\end{align*}
Thus every least squares estimator satisfies the normal equations
\begin{align*}
X^\top X\beta = X^\top y.
\end{align*}
Conversely, if $\beta \in \mathbb{R}^p$ satisfies $X^\top(y-X\beta)=0$, then for every $h \in \mathbb{R}^p$,
\begin{align*}
Q(\beta+h)-Q(\beta)
= |Xh|^2
\geq 0.
\end{align*}
Hence $\beta$ minimizes $Q$. Therefore the least squares estimators are exactly the solutions of
\begin{align*}
X^\top X\beta = X^\top y.
\end{align*}
[guided]
We now translate the geometric minimization condition into an algebraic equation. Define
\begin{align*}
Q: \mathbb{R}^p &\to \mathbb{R} \\
b &\mapsto |y-Xb|^2.
\end{align*}
A vector $\beta \in \mathbb{R}^p$ is an ordinary least squares estimator exactly when it minimizes $Q$.
Fix an arbitrary direction $h \in \mathbb{R}^p$. Expanding the square gives
\begin{align*}
Q(\beta+h)
&= |y-X(\beta+h)|^2 \\
&= |y-X\beta-Xh|^2 \\
&= |y-X\beta|^2 - 2(y-X\beta)\cdot Xh + |Xh|^2 \\
&= |y-X\beta|^2 - 2h^\top X^\top (y-X\beta) + |Xh|^2.
\end{align*}
The last equality uses the identity $(y-X\beta)\cdot Xh = h^\top X^\top(y-X\beta)$.
Assume first that $\beta$ minimizes $Q$. Then moving from $\beta$ in any direction $h$ by any scalar amount $t \in \mathbb{R}$ cannot decrease $Q$, so
\begin{align*}
Q(\beta+th)-Q(\beta)
= -2t\,h^\top X^\top(y-X\beta)+t^2|Xh|^2
\geq 0
\end{align*}
for every $t \in \mathbb{R}$. The left-hand side is a quadratic polynomial in $t$ with constant term $0$. If its linear coefficient were nonzero, choosing $t$ with the opposite sign and sufficiently small absolute value would make the expression negative. Hence the linear coefficient must vanish:
\begin{align*}
h^\top X^\top(y-X\beta)=0
\end{align*}
for every $h \in \mathbb{R}^p$. Taking
\begin{align*}
h := X^\top(y-X\beta) \in \mathbb{R}^p
\end{align*}
gives
\begin{align*}
|X^\top(y-X\beta)|^2 = 0,
\end{align*}
and therefore
\begin{align*}
X^\top(y-X\beta)=0.
\end{align*}
Equivalently,
\begin{align*}
X^\top X\beta = X^\top y.
\end{align*}
These are the normal equations.
Conversely, suppose $\beta \in \mathbb{R}^p$ satisfies
\begin{align*}
X^\top(y-X\beta)=0.
\end{align*}
Then for every $h \in \mathbb{R}^p$, the expansion above simplifies to
\begin{align*}
Q(\beta+h)-Q(\beta)
= |Xh|^2
\geq 0.
\end{align*}
Thus no perturbation of $\beta$ decreases $Q$, so $\beta$ is a minimizer. We have proved that ordinary least squares estimators are precisely the solutions of the normal equations
\begin{align*}
X^\top X\beta = X^\top y.
\end{align*}
[/guided]
[/step]
[step:Invert $X^\top X$ under the full column rank hypothesis]
Assume that $X$ has full column rank. Let $b \in \mathbb{R}^p$ satisfy
\begin{align*}
X^\top Xb = 0.
\end{align*}
Multiplying on the left by $b^\top$ gives
\begin{align*}
0
= b^\top X^\top Xb
= |Xb|^2.
\end{align*}
Hence $Xb=0$. Since $X$ has full column rank, the linear map $T_X:\mathbb{R}^p \to \mathbb{R}^n$ is injective, so $b=0$. Therefore $\ker(X^\top X)=\{0\}$. Because $X^\top X \in \mathbb{R}^{p\times p}$ is a square matrix with trivial kernel, it is invertible.
[/step]
[step:Solve the normal equations and prove uniqueness]
Since $X^\top X$ is invertible, the normal equations have the unique solution
\begin{align*}
\hat{\beta} = (X^\top X)^{-1}X^\top y.
\end{align*}
By the equivalence between minimizers and solutions of the normal equations, this vector is the unique ordinary least squares estimator. This proves existence in general and uniqueness with the stated formula when $X$ has full column rank.
[/step]