Ordinary Least Squares Projection Theorem — Statement & Proof

Ordinary Least Squares Projection Theorem (Theorem # 4431)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We first verify that the formula for $H$ is well-defined by proving that $X^\top X$ is invertible. We then compute directly that $H$ is symmetric and idempotent, and identify its range with $\operatorname{Range}(X)$. Finally, instead of citing an abstract [projection theorem](/theorems/1985), we prove the closest-point property directly by decomposing $y-v$ into an orthogonal sum. [/proofplan] [step:Show that $X^\top X$ is invertible] Since $X$ has full column rank, the [linear map](/page/Linear%20Map) \begin{align*} X:\mathbb{R}^p &\to \mathbb{R}^n \\ a &\mapsto Xa \end{align*} is injective. Let $a \in \mathbb{R}^p$ satisfy $X^\top Xa=0$. Taking the Euclidean inner product with $a$ gives \begin{align*} 0 = a^\top X^\top Xa = (Xa)^\top (Xa) = |Xa|^2. \end{align*} Hence $Xa=0$, and injectivity of $X$ gives $a=0$. Therefore $\ker(X^\top X)=\{0\}$. Since $X^\top X \in \mathbb{R}^{p \times p}$ is a square matrix with trivial kernel, it is invertible. Thus $H=X(X^\top X)^{-1}X^\top$ is well-defined. [/step] [step:Verify that $H$ is symmetric and idempotent] The matrix $X^\top X$ is symmetric, so its inverse is also symmetric because \begin{align*} \bigl((X^\top X)^{-1}\bigr)^\top = \bigl((X^\top X)^\top\bigr)^{-1} = (X^\top X)^{-1}. \end{align*} Using $(ABC)^\top=C^\top B^\top A^\top$, we compute \begin{align*} H^\top &= \bigl(X(X^\top X)^{-1}X^\top\bigr)^\top \\ &= X\bigl((X^\top X)^{-1}\bigr)^\top X^\top \\ &= X(X^\top X)^{-1}X^\top \\ &=H. \end{align*} Thus $H$ is symmetric. Next, \begin{align*} H^2 &= X(X^\top X)^{-1}X^\top X(X^\top X)^{-1}X^\top \\ &= X(X^\top X)^{-1}(X^\top X)(X^\top X)^{-1}X^\top \\ &= X(X^\top X)^{-1}X^\top \\ &=H. \end{align*} Thus $H$ is idempotent. [/step] [step:Identify the range of $H$ with the column space of $X$] For every $z \in \mathbb{R}^n$, \begin{align*} Hz = X\bigl((X^\top X)^{-1}X^\top z\bigr). \end{align*} The vector $(X^\top X)^{-1}X^\top z$ belongs to $\mathbb{R}^p$, so $Hz \in \operatorname{Range}(X)$. Hence \begin{align*} \operatorname{Range}(H) \subseteq \operatorname{Range}(X). \end{align*} Conversely, let $v \in \operatorname{Range}(X)$. Then there exists $b \in \mathbb{R}^p$ such that $v=Xb$. Applying $H$ gives \begin{align*} Hv &= HXb \\ &= X(X^\top X)^{-1}X^\top Xb \\ &= Xb \\ &= v. \end{align*} Therefore every vector in $\operatorname{Range}(X)$ lies in $\operatorname{Range}(H)$, because $v=Hv$ with $v \in \mathbb{R}^n$. Hence \begin{align*} \operatorname{Range}(H)=\operatorname{Range}(X). \end{align*} [/step] [step:Prove orthogonality of the residual to the column space] Fix $y \in \mathbb{R}^n$, and define $\hat y:=Hy$ and $\hat\varepsilon:=y-\hat y=(I_n-H)y$, where $I_n \in \mathbb{R}^{n \times n}$ is the identity matrix. Let $w \in \operatorname{Range}(X)$. From the previous step, $Hw=w$. Since $H^\top=H$, the matrix $I_n-H$ is symmetric. Therefore \begin{align*} \langle \hat\varepsilon,w\rangle &= \langle (I_n-H)y,w\rangle \\ &= \langle y,(I_n-H)w\rangle \\ &= \langle y,w-Hw\rangle \\ &= 0. \end{align*} Thus $\hat\varepsilon$ is orthogonal to every $w \in \operatorname{Range}(X)$. [guided] Fix $y \in \mathbb{R}^n$. We define the fitted vector and residual by \begin{align*} \hat y:=Hy,\qquad \hat\varepsilon:=y-\hat y=(I_n-H)y, \end{align*} where $I_n \in \mathbb{R}^{n \times n}$ is the identity matrix. We want to prove that $\hat\varepsilon$ is orthogonal to the whole subspace $\operatorname{Range}(X)$, so take an arbitrary vector $w \in \operatorname{Range}(X)$. The previous step showed that $H$ fixes every vector in $\operatorname{Range}(X)$, so $Hw=w$. Also, because $H^\top=H$, the matrix $I_n-H$ is symmetric. We can therefore move $I_n-H$ from the first slot of the Euclidean inner product to the second slot: \begin{align*} \langle \hat\varepsilon,w\rangle &= \langle (I_n-H)y,w\rangle \\ &= \langle y,(I_n-H)w\rangle \\ &= \langle y,w-Hw\rangle \\ &= \langle y,w-w\rangle \\ &= 0. \end{align*} Since $w \in \operatorname{Range}(X)$ was arbitrary, $\hat\varepsilon$ is orthogonal to $\operatorname{Range}(X)$. [/guided] [/step] [step:Derive the unique closest-point property] Since $\hat y=Hy$, we have $\hat y \in \operatorname{Range}(H)=\operatorname{Range}(X)$. Let $v \in \operatorname{Range}(X)$. Then $\hat y-v \in \operatorname{Range}(X)$, while $\hat\varepsilon$ is orthogonal to $\operatorname{Range}(X)$. Hence $\hat\varepsilon$ is orthogonal to $\hat y-v$. Using $y-\hat y=\hat\varepsilon$, we obtain \begin{align*} |y-v|^2 &= |y-\hat y+\hat y-v|^2 \\ &= |\hat\varepsilon+(\hat y-v)|^2 \\ &= |\hat\varepsilon|^2+|\hat y-v|^2 \\ &\ge |\hat\varepsilon|^2 \\ &= |y-\hat y|^2. \end{align*} Thus $\hat y$ minimizes $|y-v|$ over $v \in \operatorname{Range}(X)$. Equality holds exactly when $|\hat y-v|^2=0$, which is equivalent to $v=\hat y$. Therefore $\hat y$ is the unique closest vector in $\operatorname{Range}(X)$ to $y$. [guided] We already know two facts: $\hat y=Hy$ lies in $\operatorname{Range}(X)$, and the residual $\hat\varepsilon=y-\hat y$ is orthogonal to every vector in $\operatorname{Range}(X)$. To prove the minimizing property, take any competing vector $v \in \operatorname{Range}(X)$. Because both $\hat y$ and $v$ lie in $\operatorname{Range}(X)$, their difference $\hat y-v$ also lies in $\operatorname{Range}(X)$. Since $\hat\varepsilon$ is orthogonal to $\operatorname{Range}(X)$, we have \begin{align*} \langle \hat\varepsilon,\hat y-v\rangle=0. \end{align*} Now decompose the error from $y$ to $v$ as \begin{align*} y-v=(y-\hat y)+(\hat y-v)=\hat\varepsilon+(\hat y-v). \end{align*} The two summands are orthogonal, so the Pythagorean identity gives \begin{align*} |y-v|^2 &= |\hat\varepsilon+(\hat y-v)|^2 \\ &= |\hat\varepsilon|^2+|\hat y-v|^2 \\ &\ge |\hat\varepsilon|^2 \\ &= |y-\hat y|^2. \end{align*} Therefore no vector $v \in \operatorname{Range}(X)$ is closer to $y$ than $\hat y$. The equality condition is also determined by this identity. Equality holds if and only if $|\hat y-v|^2=0$, and the Euclidean norm vanishes only at the zero vector. Hence equality holds if and only if $v=\hat y$. Thus $\hat y$ is the unique closest vector in $\operatorname{Range}(X)$ to $y$. [/guided] [/step] [step:Conclude that $H$ is the orthogonal projection onto $\operatorname{Range}(X)$] The matrix $H$ satisfies $H^2=H$, $H^\top=H$, and $\operatorname{Range}(H)=\operatorname{Range}(X)$. For every $y \in \mathbb{R}^n$, the vector $Hy$ belongs to $\operatorname{Range}(X)$, the residual $y-Hy$ is orthogonal to $\operatorname{Range}(X)$, and $Hy$ is the unique closest vector in $\operatorname{Range}(X)$ to $y$. Hence $H$ is precisely the [orthogonal projection](/theorems/437) of $\mathbb{R}^n$ onto $\operatorname{Range}(X)$, completing the proof. [/step]

Explore Further

Hilbert Projection Theorem for Closed Linear Subspaces of $L^2$ Probability & Statistics Strong Markov Property of Brownian Motion Brownian Motion Radon-Nikodym Theorem (Probabilistic) Martingale Theory Multiplication Rule for Conditional Probability Probability Theory Doob-Dynkin Representation Probability & Statistics Density of a Transformed Variable Probability Theory Weak Stirling Probability Theory Moments from the MGF Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.

Ordinary Least Squares Projection Theorem (Theorem # 4431)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Ordinary Least Squares Projection Theorem (Theorem # 4431)

Discussion

Proof

Explore Further