[proofplan]
We first verify that the formula for $H$ is well-defined by proving that $X^\top X$ is invertible. We then compute directly that $H$ is symmetric and idempotent, and identify its range with $\operatorname{Range}(X)$. Finally, instead of citing an abstract [projection theorem](/theorems/1985), we prove the closest-point property directly by decomposing $y-v$ into an orthogonal sum.
[/proofplan]
[step:Show that $X^\top X$ is invertible]
Since $X$ has full column rank, the [linear map](/page/Linear%20Map)
\begin{align*}
X:\mathbb{R}^p &\to \mathbb{R}^n \\
a &\mapsto Xa
\end{align*}
is injective. Let $a \in \mathbb{R}^p$ satisfy $X^\top Xa=0$. Taking the Euclidean inner product with $a$ gives
\begin{align*}
0
= a^\top X^\top Xa
= (Xa)^\top (Xa)
= |Xa|^2.
\end{align*}
Hence $Xa=0$, and injectivity of $X$ gives $a=0$. Therefore $\ker(X^\top X)=\{0\}$. Since $X^\top X \in \mathbb{R}^{p \times p}$ is a square matrix with trivial kernel, it is invertible. Thus $H=X(X^\top X)^{-1}X^\top$ is well-defined.
[/step]
[step:Verify that $H$ is symmetric and idempotent]
The matrix $X^\top X$ is symmetric, so its inverse is also symmetric because
\begin{align*}
\bigl((X^\top X)^{-1}\bigr)^\top
=
\bigl((X^\top X)^\top\bigr)^{-1}
=
(X^\top X)^{-1}.
\end{align*}
Using $(ABC)^\top=C^\top B^\top A^\top$, we compute
\begin{align*}
H^\top
&=
\bigl(X(X^\top X)^{-1}X^\top\bigr)^\top \\
&=
X\bigl((X^\top X)^{-1}\bigr)^\top X^\top \\
&=
X(X^\top X)^{-1}X^\top \\
&=H.
\end{align*}
Thus $H$ is symmetric.
Next,
\begin{align*}
H^2
&=
X(X^\top X)^{-1}X^\top X(X^\top X)^{-1}X^\top \\
&=
X(X^\top X)^{-1}(X^\top X)(X^\top X)^{-1}X^\top \\
&=
X(X^\top X)^{-1}X^\top \\
&=H.
\end{align*}
Thus $H$ is idempotent.
[/step]
[step:Identify the range of $H$ with the column space of $X$]
For every $z \in \mathbb{R}^n$,
\begin{align*}
Hz = X\bigl((X^\top X)^{-1}X^\top z\bigr).
\end{align*}
The vector $(X^\top X)^{-1}X^\top z$ belongs to $\mathbb{R}^p$, so $Hz \in \operatorname{Range}(X)$. Hence
\begin{align*}
\operatorname{Range}(H) \subseteq \operatorname{Range}(X).
\end{align*}
Conversely, let $v \in \operatorname{Range}(X)$. Then there exists $b \in \mathbb{R}^p$ such that $v=Xb$. Applying $H$ gives
\begin{align*}
Hv
&=
HXb \\
&=
X(X^\top X)^{-1}X^\top Xb \\
&=
Xb \\
&=
v.
\end{align*}
Therefore every vector in $\operatorname{Range}(X)$ lies in $\operatorname{Range}(H)$, because $v=Hv$ with $v \in \mathbb{R}^n$. Hence
\begin{align*}
\operatorname{Range}(H)=\operatorname{Range}(X).
\end{align*}
[/step]
[step:Prove orthogonality of the residual to the column space]
Fix $y \in \mathbb{R}^n$, and define $\hat y:=Hy$ and $\hat\varepsilon:=y-\hat y=(I_n-H)y$, where $I_n \in \mathbb{R}^{n \times n}$ is the identity matrix. Let $w \in \operatorname{Range}(X)$. From the previous step, $Hw=w$. Since $H^\top=H$, the matrix $I_n-H$ is symmetric. Therefore
\begin{align*}
\langle \hat\varepsilon,w\rangle
&=
\langle (I_n-H)y,w\rangle \\
&=
\langle y,(I_n-H)w\rangle \\
&=
\langle y,w-Hw\rangle \\
&=
0.
\end{align*}
Thus $\hat\varepsilon$ is orthogonal to every $w \in \operatorname{Range}(X)$.
[guided]
Fix $y \in \mathbb{R}^n$. We define the fitted vector and residual by
\begin{align*}
\hat y:=Hy,\qquad \hat\varepsilon:=y-\hat y=(I_n-H)y,
\end{align*}
where $I_n \in \mathbb{R}^{n \times n}$ is the identity matrix. We want to prove that $\hat\varepsilon$ is orthogonal to the whole subspace $\operatorname{Range}(X)$, so take an arbitrary vector $w \in \operatorname{Range}(X)$.
The previous step showed that $H$ fixes every vector in $\operatorname{Range}(X)$, so $Hw=w$. Also, because $H^\top=H$, the matrix $I_n-H$ is symmetric. We can therefore move $I_n-H$ from the first slot of the Euclidean inner product to the second slot:
\begin{align*}
\langle \hat\varepsilon,w\rangle
&=
\langle (I_n-H)y,w\rangle \\
&=
\langle y,(I_n-H)w\rangle \\
&=
\langle y,w-Hw\rangle \\
&=
\langle y,w-w\rangle \\
&=
0.
\end{align*}
Since $w \in \operatorname{Range}(X)$ was arbitrary, $\hat\varepsilon$ is orthogonal to $\operatorname{Range}(X)$.
[/guided]
[/step]
[step:Derive the unique closest-point property]
Since $\hat y=Hy$, we have $\hat y \in \operatorname{Range}(H)=\operatorname{Range}(X)$. Let $v \in \operatorname{Range}(X)$. Then $\hat y-v \in \operatorname{Range}(X)$, while $\hat\varepsilon$ is orthogonal to $\operatorname{Range}(X)$. Hence $\hat\varepsilon$ is orthogonal to $\hat y-v$. Using $y-\hat y=\hat\varepsilon$, we obtain
\begin{align*}
|y-v|^2
&=
|y-\hat y+\hat y-v|^2 \\
&=
|\hat\varepsilon+(\hat y-v)|^2 \\
&=
|\hat\varepsilon|^2+|\hat y-v|^2 \\
&\ge |\hat\varepsilon|^2 \\
&=
|y-\hat y|^2.
\end{align*}
Thus $\hat y$ minimizes $|y-v|$ over $v \in \operatorname{Range}(X)$. Equality holds exactly when $|\hat y-v|^2=0$, which is equivalent to $v=\hat y$. Therefore $\hat y$ is the unique closest vector in $\operatorname{Range}(X)$ to $y$.
[guided]
We already know two facts: $\hat y=Hy$ lies in $\operatorname{Range}(X)$, and the residual $\hat\varepsilon=y-\hat y$ is orthogonal to every vector in $\operatorname{Range}(X)$. To prove the minimizing property, take any competing vector $v \in \operatorname{Range}(X)$.
Because both $\hat y$ and $v$ lie in $\operatorname{Range}(X)$, their difference $\hat y-v$ also lies in $\operatorname{Range}(X)$. Since $\hat\varepsilon$ is orthogonal to $\operatorname{Range}(X)$, we have
\begin{align*}
\langle \hat\varepsilon,\hat y-v\rangle=0.
\end{align*}
Now decompose the error from $y$ to $v$ as
\begin{align*}
y-v=(y-\hat y)+(\hat y-v)=\hat\varepsilon+(\hat y-v).
\end{align*}
The two summands are orthogonal, so the Pythagorean identity gives
\begin{align*}
|y-v|^2
&=
|\hat\varepsilon+(\hat y-v)|^2 \\
&=
|\hat\varepsilon|^2+|\hat y-v|^2 \\
&\ge |\hat\varepsilon|^2 \\
&=
|y-\hat y|^2.
\end{align*}
Therefore no vector $v \in \operatorname{Range}(X)$ is closer to $y$ than $\hat y$.
The equality condition is also determined by this identity. Equality holds if and only if $|\hat y-v|^2=0$, and the Euclidean norm vanishes only at the zero vector. Hence equality holds if and only if $v=\hat y$. Thus $\hat y$ is the unique closest vector in $\operatorname{Range}(X)$ to $y$.
[/guided]
[/step]
[step:Conclude that $H$ is the orthogonal projection onto $\operatorname{Range}(X)$]
The matrix $H$ satisfies $H^2=H$, $H^\top=H$, and $\operatorname{Range}(H)=\operatorname{Range}(X)$. For every $y \in \mathbb{R}^n$, the vector $Hy$ belongs to $\operatorname{Range}(X)$, the residual $y-Hy$ is orthogonal to $\operatorname{Range}(X)$, and $Hy$ is the unique closest vector in $\operatorname{Range}(X)$ to $y$. Hence $H$ is precisely the [orthogonal projection](/theorems/437) of $\mathbb{R}^n$ onto $\operatorname{Range}(X)$, completing the proof.
[/step]