Total Sum of Squares Decomposition — Statement & Proof

Total Sum of Squares Decomposition (Theorem # 1439)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We decompose each deviation $Y_i - \bar Y$ as the sum of a fitted deviation $\hat Y_i - \bar Y$ and a residual $R_i := Y_i - \hat Y_i$, square, and sum. The decomposition reduces to showing that the cross-term $\sum_i (\hat Y_i - \bar Y) R_i$ vanishes. This splits into two pieces: the residuals sum to zero (from the first normal equation, which provides an intercept column in $X$ or is imposed directly), and $\sum_i \hat Y_i R_i = 0$ because the fitted values lie in the column space of $X$ while the residual vector $\mathbf R$ is orthogonal to that column space — a fact captured by $X^\top \mathbf R = \mathbf{0}$, the [normal equations](/theorems/501). [/proofplan] [step:Set up the algebraic decomposition $Y_i - \bar Y = (\hat Y_i - \bar Y) + R_i$] Consider the linear model $Y = X\beta + \varepsilon$ with design matrix $X \in \mathbb{R}^{n \times p}$ whose first column is the all-ones vector (so the model contains an intercept). Let $\hat{\beta} = (X^\top X)^{-1} X^\top Y$ be the [least squares estimator](/theorems/501), let $\hat{Y} = X\hat{\beta}$ be the vector of fitted values, and let $R = Y - \hat{Y}$ be the residual vector. Define the total, explained, and residual sums of squares \begin{align*} \mathrm{TSS} &:= \sum_{i=1}^n (Y_i - \bar Y)^2, \\ \mathrm{ESS} &:= \sum_{i=1}^n (\hat Y_i - \bar Y)^2, \\ \mathrm{RSS} &:= \sum_{i=1}^n R_i^2. \end{align*} For each $i$, write the identity \begin{align*} Y_i - \bar Y = (\hat Y_i - \bar Y) + (Y_i - \hat Y_i) = (\hat Y_i - \bar Y) + R_i. \end{align*} Squaring and summing over $i$, \begin{align*} \mathrm{TSS} = \sum_{i=1}^n (Y_i - \bar Y)^2 = \sum_{i=1}^n (\hat Y_i - \bar Y)^2 + \sum_{i=1}^n R_i^2 + 2\sum_{i=1}^n (\hat Y_i - \bar Y) R_i. \end{align*} The proof reduces to showing that the last sum — the cross-term — equals zero. [/step] [step:Split the cross-term as $\sum_i \hat Y_i R_i - \bar Y \sum_i R_i$] Distributing the sum over the difference, \begin{align*} \sum_{i=1}^n (\hat Y_i - \bar Y) R_i = \sum_{i=1}^n \hat Y_i R_i - \bar Y \sum_{i=1}^n R_i. \end{align*} We show each of the two terms vanishes separately. The next step handles $\sum_i R_i = 0$; the step after handles $\sum_i \hat Y_i R_i = 0$. [/step] [step:Show that $\sum_i R_i = 0$ using the intercept column in $X$] Since the first column of $X$ is the all-ones vector $\mathbf{1} = (1, \dots, 1)^\top \in \mathbb{R}^n$, the first coordinate of the vector $X^\top R \in \mathbb{R}^p$ is \begin{align*} (X^\top R)_1 = \mathbf{1}^\top R = \sum_{i=1}^n R_i. \end{align*} The [normal equations](/theorems/501) assert that $X^\top R = \mathbf{0}$, so in particular the first coordinate vanishes: \begin{align*} \sum_{i=1}^n R_i = 0. \end{align*} Therefore $\bar Y \sum_{i=1}^n R_i = 0$. [guided] This step is where the intercept hypothesis is consumed. If the model did **not** contain an intercept — that is, if $X$ had no all-ones column — then we could not conclude $\sum_i R_i = 0$, and the sum-of-squares decomposition would fail in general. Let us see this in detail. The residual vector $R = Y - \hat{Y}$ satisfies the [normal equations](/theorems/501) \begin{align*} X^\top R = \mathbf{0} \in \mathbb{R}^p. \end{align*} Writing $X = [\mathbf{1} \mid X_{(-1)}]$ where $\mathbf{1}$ is the column of ones and $X_{(-1)}$ holds the remaining $p-1$ columns, the first row of $X^\top$ is $\mathbf{1}^\top$. The first coordinate of $X^\top R$ is therefore \begin{align*} \mathbf{1}^\top R = \sum_{i=1}^n R_i = 0. \end{align*} Consequently $\bar Y \sum_i R_i = 0$ regardless of the value of $\bar Y$. Why does the intercept matter? Geometrically, $\bar Y$ itself is the orthogonal projection of $Y$ onto the one-dimensional subspace $\operatorname{span}(\mathbf{1})$. The sum-of-squares decomposition compares the full fit $\hat{Y}$ against this baseline mean. If $\mathbf{1}$ is not in the column space of $X$, then $\bar Y$ is not part of the projection $\hat{Y}$, and the decomposition need not hold. [/guided] [/step] [step:Show that $\sum_i \hat Y_i R_i = 0$ via $\hat{Y} \in \operatorname{Range}(X)$] Since $\hat{Y} = X\hat{\beta}$, we can write \begin{align*} \sum_{i=1}^n \hat Y_i R_i = \hat{Y}^\top R = (X\hat{\beta})^\top R = \hat{\beta}^\top (X^\top R). \end{align*} By the [normal equations](/theorems/501), $X^\top R = \mathbf{0}$, so \begin{align*} \sum_{i=1}^n \hat Y_i R_i = \hat{\beta}^\top \mathbf{0} = 0. \end{align*} [guided] This is a geometric statement dressed in algebra. The fitted vector $\hat{Y} = X\hat{\beta}$ lies in the column space $\operatorname{Range}(X) \subset \mathbb{R}^n$, while the residual $R$ is orthogonal to $\operatorname{Range}(X)$ — this is exactly what the [normal equations](/theorems/501) $X^\top R = \mathbf{0}$ say. An element of $\operatorname{Range}(X)$ paired with an element of $\operatorname{Range}(X)^\perp$ has inner product zero, which is our assertion. Algebraically: \begin{align*} \sum_{i=1}^n \hat Y_i R_i = \hat{Y}^\top R = (X\hat{\beta})^\top R = \hat{\beta}^\top (X^\top R) = \hat{\beta}^\top \mathbf{0} = 0, \end{align*} where the manipulation $(X\hat{\beta})^\top R = \hat{\beta}^\top X^\top R$ is just the associativity of matrix multiplication combined with $(AB)^\top = B^\top A^\top$. Note this step does **not** require the intercept column — it uses only that $\hat{Y}$ is in the column space of $X$. [/guided] [/step] [step:Assemble the decomposition $\mathrm{TSS} = \mathrm{ESS} + \mathrm{RSS}$] Combining the previous two steps, the cross-term satisfies \begin{align*} 2\sum_{i=1}^n (\hat Y_i - \bar Y) R_i = 2\sum_{i=1}^n \hat Y_i R_i - 2\bar Y \sum_{i=1}^n R_i = 0 - 0 = 0. \end{align*} Substituting back into the expansion of $\mathrm{TSS}$, \begin{align*} \mathrm{TSS} = \sum_{i=1}^n (\hat Y_i - \bar Y)^2 + \sum_{i=1}^n R_i^2 = \mathrm{ESS} + \mathrm{RSS}. \end{align*} This is the announced decomposition of the total sum of squares into an explained part and a residual part. [/step]

Explore Further

Independence of $\bar{X}$ and $S^2$ for Normal Samples Statistics MLE Invariance Statistics Characterisation of Covariance and Correlation Matrices Statistics Importance Sampling Estimator Statistics Uniform Convergence Under Weak Convergence Statistics Rao–Blackwell Theorem Statistics Wilks' Theorem Statistics Frequentist Coverage of Bayesian Credible Sets Statistics

What brings you to Androma?

Start with a route through the knowledge graph.

Total Sum of Squares Decomposition (Theorem # 1439)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Total Sum of Squares Decomposition (Theorem # 1439)

Discussion

Proof

Explore Further