[proofplan]
We decompose each deviation $Y_i - \bar Y$ as the sum of a fitted deviation $\hat Y_i - \bar Y$ and a residual $R_i := Y_i - \hat Y_i$, square, and sum. The decomposition reduces to showing that the cross-term $\sum_i (\hat Y_i - \bar Y) R_i$ vanishes. This splits into two pieces: the residuals sum to zero (from the first normal equation, which provides an intercept column in $X$ or is imposed directly), and $\sum_i \hat Y_i R_i = 0$ because the fitted values lie in the column space of $X$ while the residual vector $\mathbf R$ is orthogonal to that column space — a fact captured by $X^\top \mathbf R = \mathbf{0}$, the [normal equations](/theorems/501).
[/proofplan]
[step:Set up the algebraic decomposition $Y_i - \bar Y = (\hat Y_i - \bar Y) + R_i$]
Consider the linear model $Y = X\beta + \varepsilon$ with design matrix $X \in \mathbb{R}^{n \times p}$ whose first column is the all-ones vector (so the model contains an intercept). Let $\hat{\beta} = (X^\top X)^{-1} X^\top Y$ be the [least squares estimator](/theorems/501), let $\hat{Y} = X\hat{\beta}$ be the vector of fitted values, and let $R = Y - \hat{Y}$ be the residual vector. Define the total, explained, and residual sums of squares
\begin{align*}
\mathrm{TSS} &:= \sum_{i=1}^n (Y_i - \bar Y)^2, \\
\mathrm{ESS} &:= \sum_{i=1}^n (\hat Y_i - \bar Y)^2, \\
\mathrm{RSS} &:= \sum_{i=1}^n R_i^2.
\end{align*}
For each $i$, write the identity
\begin{align*}
Y_i - \bar Y = (\hat Y_i - \bar Y) + (Y_i - \hat Y_i) = (\hat Y_i - \bar Y) + R_i.
\end{align*}
Squaring and summing over $i$,
\begin{align*}
\mathrm{TSS} = \sum_{i=1}^n (Y_i - \bar Y)^2 = \sum_{i=1}^n (\hat Y_i - \bar Y)^2 + \sum_{i=1}^n R_i^2 + 2\sum_{i=1}^n (\hat Y_i - \bar Y) R_i.
\end{align*}
The proof reduces to showing that the last sum — the cross-term — equals zero.
[/step]
[step:Split the cross-term as $\sum_i \hat Y_i R_i - \bar Y \sum_i R_i$]
Distributing the sum over the difference,
\begin{align*}
\sum_{i=1}^n (\hat Y_i - \bar Y) R_i = \sum_{i=1}^n \hat Y_i R_i - \bar Y \sum_{i=1}^n R_i.
\end{align*}
We show each of the two terms vanishes separately. The next step handles $\sum_i R_i = 0$; the step after handles $\sum_i \hat Y_i R_i = 0$.
[/step]
[step:Show that $\sum_i R_i = 0$ using the intercept column in $X$]
Since the first column of $X$ is the all-ones vector $\mathbf{1} = (1, \dots, 1)^\top \in \mathbb{R}^n$, the first coordinate of the vector $X^\top R \in \mathbb{R}^p$ is
\begin{align*}
(X^\top R)_1 = \mathbf{1}^\top R = \sum_{i=1}^n R_i.
\end{align*}
The [normal equations](/theorems/501) assert that $X^\top R = \mathbf{0}$, so in particular the first coordinate vanishes:
\begin{align*}
\sum_{i=1}^n R_i = 0.
\end{align*}
Therefore $\bar Y \sum_{i=1}^n R_i = 0$.
[guided]
This step is where the intercept hypothesis is consumed. If the model did **not** contain an intercept — that is, if $X$ had no all-ones column — then we could not conclude $\sum_i R_i = 0$, and the sum-of-squares decomposition would fail in general. Let us see this in detail.
The residual vector $R = Y - \hat{Y}$ satisfies the [normal equations](/theorems/501)
\begin{align*}
X^\top R = \mathbf{0} \in \mathbb{R}^p.
\end{align*}
Writing $X = [\mathbf{1} \mid X_{(-1)}]$ where $\mathbf{1}$ is the column of ones and $X_{(-1)}$ holds the remaining $p-1$ columns, the first row of $X^\top$ is $\mathbf{1}^\top$. The first coordinate of $X^\top R$ is therefore
\begin{align*}
\mathbf{1}^\top R = \sum_{i=1}^n R_i = 0.
\end{align*}
Consequently $\bar Y \sum_i R_i = 0$ regardless of the value of $\bar Y$.
Why does the intercept matter? Geometrically, $\bar Y$ itself is the orthogonal projection of $Y$ onto the one-dimensional subspace $\operatorname{span}(\mathbf{1})$. The sum-of-squares decomposition compares the full fit $\hat{Y}$ against this baseline mean. If $\mathbf{1}$ is not in the column space of $X$, then $\bar Y$ is not part of the projection $\hat{Y}$, and the decomposition need not hold.
[/guided]
[/step]
[step:Show that $\sum_i \hat Y_i R_i = 0$ via $\hat{Y} \in \operatorname{Range}(X)$]
Since $\hat{Y} = X\hat{\beta}$, we can write
\begin{align*}
\sum_{i=1}^n \hat Y_i R_i = \hat{Y}^\top R = (X\hat{\beta})^\top R = \hat{\beta}^\top (X^\top R).
\end{align*}
By the [normal equations](/theorems/501), $X^\top R = \mathbf{0}$, so
\begin{align*}
\sum_{i=1}^n \hat Y_i R_i = \hat{\beta}^\top \mathbf{0} = 0.
\end{align*}
[guided]
This is a geometric statement dressed in algebra. The fitted vector $\hat{Y} = X\hat{\beta}$ lies in the column space $\operatorname{Range}(X) \subset \mathbb{R}^n$, while the residual $R$ is orthogonal to $\operatorname{Range}(X)$ — this is exactly what the [normal equations](/theorems/501) $X^\top R = \mathbf{0}$ say. An element of $\operatorname{Range}(X)$ paired with an element of $\operatorname{Range}(X)^\perp$ has inner product zero, which is our assertion.
Algebraically:
\begin{align*}
\sum_{i=1}^n \hat Y_i R_i = \hat{Y}^\top R = (X\hat{\beta})^\top R = \hat{\beta}^\top (X^\top R) = \hat{\beta}^\top \mathbf{0} = 0,
\end{align*}
where the manipulation $(X\hat{\beta})^\top R = \hat{\beta}^\top X^\top R$ is just the associativity of matrix multiplication combined with $(AB)^\top = B^\top A^\top$. Note this step does **not** require the intercept column — it uses only that $\hat{Y}$ is in the column space of $X$.
[/guided]
[/step]
[step:Assemble the decomposition $\mathrm{TSS} = \mathrm{ESS} + \mathrm{RSS}$]
Combining the previous two steps, the cross-term satisfies
\begin{align*}
2\sum_{i=1}^n (\hat Y_i - \bar Y) R_i = 2\sum_{i=1}^n \hat Y_i R_i - 2\bar Y \sum_{i=1}^n R_i = 0 - 0 = 0.
\end{align*}
Substituting back into the expansion of $\mathrm{TSS}$,
\begin{align*}
\mathrm{TSS} = \sum_{i=1}^n (\hat Y_i - \bar Y)^2 + \sum_{i=1}^n R_i^2 = \mathrm{ESS} + \mathrm{RSS}.
\end{align*}
This is the announced decomposition of the total sum of squares into an explained part and a residual part.
[/step]