Two vectors point in entirely different directions. How different? In many problems the answer is not about the angle between them — it is about whether they share any component at all. The condition that makes two vectors completely independent in a directional sense is orthogonality: the inner product vanishes. This deceptively simple requirement turns out to organize all of linear algebra. Systems of orthogonal vectors are easy to work with because coordinates decouple. Projections onto orthogonal subspaces can be computed without solving a system of equations. Symmetric matrices have orthogonal eigenvectors, which is the reason diagonalization is so clean in that setting. And when a basis is not merely orthogonal but orthonormal, every coefficient in a linear combination is given by a single inner product rather than a matrix equation.
Orthogonality is not just a geometric curiosity. It is the mechanism behind least-squares regression, Fourier series, the Gram–Schmidt process, the spectral theorem, and the QR decomposition. The purpose of this chapter is to build the theory from the ground up: define the inner product, extract its geometry, establish orthogonal complements and projections, prove the spectral theorem for real symmetric matrices, and show how to construct orthonormal bases algorithmically.
[example: Two Non-orthogonal Vectors That Look Perpendicular]
Consider the vectors $v = (1, 1)$ and $w = (1, -1)$ in $\mathbb{R}^2$ with the standard Euclidean inner product. Their inner product is
\begin{align*}
\langle v, w \rangle &= (1)(1) + (1)(-1) = 0.
\end{align*}
So $v$ and $w$ are orthogonal. Now replace $w$ with $w' = (2, -1)$. Their inner product is
\begin{align*}
\langle v, w' \rangle &= (1)(2) + (1)(-1) = 1 \neq 0.
\end{align*}
Even though $w'$ still "looks nearly perpendicular" to $v$, it is not orthogonal to $v$: it has a nonzero projection onto $v$. This illustrates that orthogonality is a metric condition — it depends entirely on the inner product, not on visual angle in a drawing.
[/example]
## Definition
The foundation of everything in this chapter is the inner product. Before stating the definition, recall what we need it to do: we want a notion of "angle" between vectors, and in particular a notion of "no component in common." We need the operation to be symmetric (so the angle from $v$ to $w$ equals the angle from $w$ to $v$), bilinear (so it interacts correctly with scaling and addition), and positive-definite (so the "length" of a nonzero vector is always positive). These three requirements are exactly the axioms of an inner product.
[definition: Inner Product]
Let $V$ be a real vector space. An **inner product** on $V$ is a map
\begin{align*}
\langle \cdot, \cdot \rangle: V \times V &\to \mathbb{R}
\end{align*}
satisfying the following three axioms for all $u, v, w \in V$ and $\alpha \in \mathbb{R}$:
1. **Symmetry:** $\langle u, v \rangle = \langle v, u \rangle$.
2. **Bilinearity:** $\langle \alpha u + v, w \rangle = \alpha \langle u, w \rangle + \langle v, w \rangle$ (linearity in the first argument; symmetry gives linearity in the second).
3. **Positive-definiteness:** $\langle v, v \rangle \geq 0$, with equality if and only if $v = 0$.
[/definition]
A vector space equipped with an inner product is called an **inner product space**.
The standard example is $\mathbb{R}^n$ with the dot product $\langle v, w \rangle = \sum_{i=1}^n v_i w_i$. But the definition accommodates far more: the space of continuous functions $C([0, 1])$ carries the $L^2$ inner product
\begin{align*}
\langle f, g \rangle_{L^2} &= \int_0^1 f(t)\, g(t)\, d\mathcal{L}^1(t),
\end{align*}
and this is the inner product that underlies Fourier series. The key point is that once you have an inner product, you have a complete geometric framework.
Every inner product induces a norm by $\|v\| = \sqrt{\langle v, v \rangle}$. This norm satisfies the triangle inequality by the Cauchy–Schwarz inequality, which we state next. The Cauchy–Schwarz inequality is not merely a useful estimate — it is the precise statement that the cosine of the angle between two vectors is always at most one in absolute value, which is what makes it legitimate to define angles in an abstract inner product space.
[quotetheorem:432]
The Cauchy–Schwarz inequality justifies the following central definition.
[definition: Orthogonality]
Let $V$ be a real inner product space. Two vectors $v, w \in V$ are **orthogonal**, written $v \perp w$, if $\langle v, w \rangle = 0$.
More generally, a vector $v \in V$ is orthogonal to a subset $S \subset V$ if $\langle v, s \rangle = 0$ for every $s \in S$.
[/definition]
[remark: Zero Vector Is Orthogonal to Everything]
The zero vector $0 \in V$ satisfies $\langle 0, v \rangle = 0$ for all $v \in V$ by bilinearity. So $0$ is orthogonal to every vector in $V$, including itself. This is the only vector orthogonal to itself: if $\langle v, v \rangle = 0$ then $v = 0$ by positive-definiteness.
[/remark]
The geometric content of orthogonality is captured by a generalization of the Pythagorean theorem.
[quotetheorem:3266]
[quotetheorem:3269]
This is why orthogonal bases are so powerful: coordinates are computed by inner products, not by solving linear systems.
[example: What Fails Without Orthogonality]
To see the failure mode concretely, suppose we try to use the projection formula $\langle v, b_i \rangle \, b_i$ to compute coordinates in a basis that is not orthogonal. Let $V = \mathbb{R}^2$, $b_1 = (1, 0)^\top$, $b_2 = (1, 1)^\top$ — a valid but non-orthogonal basis since $\langle b_1, b_2 \rangle = 1 \neq 0$. Take $v = (1, 1)^\top = b_2$.
The naive formula gives "coordinate" $\langle v, b_1 \rangle = 1$ along $b_1$ and $\langle v, b_2 \rangle = 2$ along $b_2$. But $1 \cdot b_1 + 2 \cdot b_2 = (1, 0) + (2, 2) = (3, 2) \neq v$. The formula collapses: taking inner products with basis vectors does not recover coordinates when the basis is not orthonormal.
The correct decomposition is $v = 0 \cdot b_1 + 1 \cdot b_2$, which must be found by solving the system $\alpha b_1 + \beta b_2 = v$. The moment an inner product space lacks an orthogonal basis, every decomposition problem becomes a linear system — and the elegant structure that makes projections and expansions cheap dissolves entirely. This is the problem orthogonality solves.
[/example]
Now comes the key structural object: the orthogonal complement.
## Orthogonal Complements and Projections
Given a subspace $W$ of an inner product space $V$, one natural question is: what part of $V$ is "invisible" to $W$? More precisely, which vectors have zero inner product with everything in $W$? This is the orthogonal complement. It captures the directions that $W$ cannot see, and it turns out that $V$ splits cleanly into two orthogonal pieces.
[definition: Orthogonal Complement]
Let $V$ be a real inner product space and let $W \subset V$ be a subset. The **orthogonal complement** of $W$ is
\begin{align*}
W^\perp &:= \{v \in V : \langle v, w \rangle = 0 \text{ for all } w \in W\}.
\end{align*}
[/definition]
[explanation: Why the Orthogonal Complement Is Always a Subspace]
Even if $W$ is not a subspace, $W^\perp$ is always a subspace. If $v_1, v_2 \in W^\perp$ and $\alpha \in \mathbb{R}$, then for any $w \in W$:
\begin{align*}
\langle \alpha v_1 + v_2, w \rangle &= \alpha \langle v_1, w \rangle + \langle v_2, w \rangle = \alpha \cdot 0 + 0 = 0.
\end{align*}
So $\alpha v_1 + v_2 \in W^\perp$. The intersection $W \cap W^\perp$ consists only of the zero vector, since any $v$ in this intersection satisfies $\langle v, v \rangle = 0$, forcing $v = 0$.
[/explanation]
The main theorem about orthogonal complements in finite-dimensional spaces says that they give a clean decomposition of $V$.
[quotetheorem:241]
This decomposition is the precise sense in which two orthogonal subspaces "tile" the ambient space. It also defines the orthogonal projection, which is the most important map arising from orthogonality.
What does it mean to decompose $v = w + w^\perp$? The vector $w$ is the "shadow" of $v$ onto $W$ — it is the closest point in $W$ to $v$. The vector $w^\perp$ is the error, the part of $v$ that lies orthogonal to $W$. This closest-point property makes projections the correct tool for least-squares problems.
[definition: Orthogonal Projection]
Let $V$ be a finite-dimensional real inner product space and $W \subset V$ a subspace. The **orthogonal projection** onto $W$ is the linear map
\begin{align*}
P_W: V &\to V
\end{align*}
defined by $P_W(v) = w$, where $v = w + w^\perp$ is the unique decomposition with $w \in W$ and $w^\perp \in W^\perp$.
[/definition]
[quotetheorem:86]
[example: Projection onto a Non-Axis Line]
Let $V = \mathbb{R}^2$ with the standard inner product and let $W = \operatorname{span}(w_0)$ where $w_0 = (1, 2)^\top$. Given $v = (3, 1)^\top$, we compute $P_W(v)$ — the closest point to $v$ on the line through the origin in the direction $(1, 2)$.
The projection formula onto $\operatorname{span}(w_0)$ is
\begin{align*}
P_W(v) &= \frac{\langle v, w_0 \rangle}{\|w_0\|^2}\, w_0.
\end{align*}
Compute: $\langle v, w_0 \rangle = (3)(1) + (1)(2) = 5$ and $\|w_0\|^2 = 1^2 + 2^2 = 5$. Therefore
\begin{align*}
P_W(v) &= \frac{5}{5}(1, 2)^\top = (1, 2)^\top.
\end{align*}
The error vector is
\begin{align*}
v - P_W(v) &= (3, 1)^\top - (1, 2)^\top = (2, -1)^\top.
\end{align*}
We check orthogonality: $\langle (2, -1), (1, 2) \rangle = 2 \cdot 1 + (-1) \cdot 2 = 0$. The error is perpendicular to $W$, as required.
The distance from $v$ to the line $W$ is $\|(2, -1)\| = \sqrt{5}$. To confirm this is the minimum, take any other point $tw_0 \in W$ with $t \neq 1$:
\begin{align*}
\|v - tw_0\|^2 &= \|(3 - t, 1 - 2t)\|^2 = (3 - t)^2 + (1 - 2t)^2 \\
&= 9 - 6t + t^2 + 1 - 4t + 4t^2 = 5t^2 - 10t + 10 = 5(t - 1)^2 + 5.
\end{align*}
This is minimized uniquely at $t = 1$, giving $\|v - P_W(v)\|^2 = 5$, confirming $P_W(v) = (1, 2)^\top$ is the nearest point.
[/example]
[illustration:orthogonal-projection-r2]
When $W$ is one-dimensional, spanned by a single unit vector $u$, the projection takes a particularly clean form:
\begin{align*}
P_W(v) &= \langle v, u \rangle\, u.
\end{align*}
For a nonzero vector $w$ (not necessarily of unit length), replace $u$ by $w / \|w\|$:
\begin{align*}
P_W(v) &= \frac{\langle v, w \rangle}{\|w\|^2}\, w.
\end{align*}
This formula will be the building block of the Gram–Schmidt process.
[remark: Idempotence of Projection]
The orthogonal projection $P_W$ satisfies $P_W \circ P_W = P_W$, i.e., it is idempotent. This is geometrically obvious: projecting a vector already in $W$ onto $W$ leaves it unchanged. In matrix terms, if $P$ is the matrix of an orthogonal projection, then $P^2 = P$ and $P = P^\top$.
[/remark]
## Orthonormal Bases and the Gram–Schmidt Process
An orthogonal set of nonzero vectors is already linearly independent, as we noted. But there is a stronger condition: an **orthonormal** set, where each vector also has unit length. Orthonormal bases make every computation involving coordinates explicit and inexpensive. If you do not have an orthonormal basis, the Gram–Schmidt process gives you one.
Why is the lack of orthonormality inconvenient? Suppose $\{b_1, \ldots, b_n\}$ is a basis for $V$ that is not orthogonal. To express a vector $v$ in this basis, you must solve the linear system $Bc = v$ where $B$ is the matrix with columns $b_i$. This takes $O(n^3)$ operations. But if $\{b_1, \ldots, b_n\}$ is orthonormal, the coefficients are just $c_i = \langle v, b_i \rangle$ — no system, no elimination, $O(n^2)$ inner products. The structural simplification is not cosmetic; it is computational.
To see concretely what breaks without orthonormality, consider $V = \mathbb{R}^2$ with the skewed basis $\{b_1, b_2\} = \{(1, 0), (1/2, 1)\}$. This is a perfectly valid basis: every vector in $\mathbb{R}^2$ can be expressed as a linear combination. But the coordinates are not given by inner products. For $v = (2, 3)$, we need $\alpha(1, 0) + \beta(1/2, 1) = (2, 3)$, which gives $\beta = 3$ and $\alpha = 2 - 3/2 = 1/2$. Now contrast with $\langle v, b_1 \rangle = 2$ and $\langle v, b_2 \rangle = 2 \cdot (1/2) + 3 \cdot 1 = 4$: these are not the coordinates $1/2$ and $3$. The inner products do not yield the coordinates because the basis is not orthonormal. This is not a minor inconvenience — it means that every coordinate computation requires solving a linear system, which is exactly what orthonormality lets you bypass.
There is a further structural distinction worth making precise. An orthonormal *set* $\{u_1, \ldots, u_k\}$ — where $k$ may be less than $\dim V$ — is already a powerful object: it spans a subspace equipped with a perfect coordinate system, and projection onto that subspace costs only $k$ inner products. An orthonormal *basis* is the special case $k = \dim V$, where the subspace is all of $V$ and the coordinate formula covers every vector without remainder. The definition below pins down the condition on the set alone; the requirement that it span $V$ is what upgrades a set to a basis.
[definition: Orthonormal Set]
A set of vectors $\{u_1, \ldots, u_k\}$ in an inner product space $V$ is **orthonormal** if:
\begin{align*}
\langle u_i, u_j \rangle &= \delta_{ij} \coloneqq \begin{cases} 1 & i = j, \\ 0 & i \neq j, \end{cases}
\end{align*}
where $\delta_{ij}$ is the Kronecker delta. An orthonormal set that is also a basis for $V$ is called an **orthonormal basis**.
[/definition]
[quotetheorem:3267]
The Gram–Schmidt process constructs an orthonormal basis from any basis. The idea is sequential: take each basis vector in turn, subtract off its projections onto all previously constructed orthonormal vectors, then normalize. What remains is orthogonal to all previous vectors and can safely be normalized.
[definition: Gram–Schmidt Process]
Let $\{a_1, \ldots, a_k\}$ be a linearly independent set in a real inner product space $V$. The **Gram–Schmidt process** produces an orthonormal set $\{u_1, \ldots, u_k\}$ with $\operatorname{span}(u_1, \ldots, u_j) = \operatorname{span}(a_1, \ldots, a_j)$ for each $j = 1, \ldots, k$, as follows:
Set $\tilde{u}_1 = a_1$ and $u_1 = \tilde{u}_1 / \|\tilde{u}_1\|$. For $j = 2, \ldots, k$, define
\begin{align*}
\tilde{u}_j &= a_j - \sum_{i=1}^{j-1} \langle a_j, u_i \rangle\, u_i,
\end{align*}
and set $u_j = \tilde{u}_j / \|\tilde{u}_j\|$.
[/definition]
[explanation: Why Gram–Schmidt Works]
At each step, $\tilde{u}_j$ is constructed by subtracting from $a_j$ its projections onto the already-orthonormal vectors $u_1, \ldots, u_{j-1}$. This is exactly the orthogonal complement of $a_j$ relative to $\operatorname{span}(u_1, \ldots, u_{j-1})$. To verify $\langle \tilde{u}_j, u_\ell \rangle = 0$ for $\ell < j$:
\begin{align*}
\langle \tilde{u}_j, u_\ell \rangle &= \left\langle a_j - \sum_{i=1}^{j-1} \langle a_j, u_i \rangle\, u_i,\; u_\ell \right\rangle \\
&= \langle a_j, u_\ell \rangle - \sum_{i=1}^{j-1} \langle a_j, u_i \rangle \langle u_i, u_\ell \rangle \\
&= \langle a_j, u_\ell \rangle - \langle a_j, u_\ell \rangle \cdot 1 = 0,
\end{align*}
where the sum collapses to a single term because $\langle u_i, u_\ell \rangle = \delta_{i\ell}$. Since $\{a_1, \ldots, a_k\}$ is linearly independent, $a_j \notin \operatorname{span}(a_1, \ldots, a_{j-1}) = \operatorname{span}(u_1, \ldots, u_{j-1})$, which ensures $\tilde{u}_j \neq 0$ and the normalization is valid.
[/explanation]
[illustration:gram-schmidt-r3]
[example: Gram–Schmidt in $\mathbb{R}^3$]
Let $a_1 = (1, 1, 0)$, $a_2 = (1, 0, 1)$, $a_3 = (0, 1, 1)$ in $\mathbb{R}^3$ with the standard inner product. We apply Gram–Schmidt.
**Step 1.** $\tilde{u}_1 = a_1 = (1, 1, 0)$. Then $\|\tilde{u}_1\| = \sqrt{2}$, so
\begin{align*}
u_1 &= \frac{1}{\sqrt{2}}(1, 1, 0).
\end{align*}
**Step 2.** Compute $\langle a_2, u_1 \rangle = \frac{1}{\sqrt{2}}(1 \cdot 1 + 0 \cdot 1 + 1 \cdot 0) = \frac{1}{\sqrt{2}}$. Then
\begin{align*}
\tilde{u}_2 &= a_2 - \langle a_2, u_1 \rangle\, u_1 = (1, 0, 1) - \frac{1}{\sqrt{2}} \cdot \frac{1}{\sqrt{2}}(1, 1, 0) = (1, 0, 1) - \frac{1}{2}(1, 1, 0) = \left(\frac{1}{2}, -\frac{1}{2}, 1\right).
\end{align*}
We verify: $\langle \tilde{u}_2, u_1 \rangle = \frac{1}{\sqrt{2}}\left(\frac{1}{2} - \frac{1}{2} + 0\right) = 0$. Good. Now $\|\tilde{u}_2\|^2 = \frac{1}{4} + \frac{1}{4} + 1 = \frac{3}{2}$, so $\|\tilde{u}_2\| = \sqrt{3/2}$ and
\begin{align*}
u_2 &= \sqrt{\frac{2}{3}}\left(\frac{1}{2}, -\frac{1}{2}, 1\right) = \frac{1}{\sqrt{6}}(1, -1, 2).
\end{align*}
**Step 3.** Compute:
\begin{align*}
\langle a_3, u_1 \rangle &= \frac{1}{\sqrt{2}}(0 + 1 + 0) = \frac{1}{\sqrt{2}}, \\
\langle a_3, u_2 \rangle &= \frac{1}{\sqrt{6}}(0 - 1 + 2) = \frac{1}{\sqrt{6}}.
\end{align*}
Then:
\begin{align*}
\tilde{u}_3 &= a_3 - \langle a_3, u_1 \rangle\, u_1 - \langle a_3, u_2 \rangle\, u_2 \\
&= (0, 1, 1) - \frac{1}{\sqrt{2}} \cdot \frac{1}{\sqrt{2}}(1, 1, 0) - \frac{1}{\sqrt{6}} \cdot \frac{1}{\sqrt{6}}(1, -1, 2) \\
&= (0, 1, 1) - \frac{1}{2}(1, 1, 0) - \frac{1}{6}(1, -1, 2) \\
&= (0, 1, 1) - \left(\frac{1}{2} + \frac{1}{6}, \frac{1}{2} - \frac{1}{6}, 0 + \frac{1}{3}\right) \\
&= \left(0 - \frac{2}{3}, 1 - \frac{1}{3}, 1 - \frac{1}{3}\right) = \left(-\frac{2}{3}, \frac{2}{3}, \frac{2}{3}\right).
\end{align*}
We have $\|\tilde{u}_3\|^2 = \frac{4}{9} + \frac{4}{9} + \frac{4}{9} = \frac{4}{3}$, so $\|\tilde{u}_3\| = \frac{2}{\sqrt{3}}$ and
\begin{align*}
u_3 &= \frac{\sqrt{3}}{2} \cdot \left(-\frac{2}{3}, \frac{2}{3}, \frac{2}{3}\right) = \frac{1}{\sqrt{3}}(-1, 1, 1).
\end{align*}
We check that $u_3$ is orthogonal to $u_1$ and $u_2$. Since $u_1 = \frac{1}{\sqrt{2}}(1, 1, 0)$ and $u_3 = \frac{1}{\sqrt{3}}(-1, 1, 1)$:
\begin{align*}
\langle u_3, u_1 \rangle &= \frac{1}{\sqrt{6}}\bigl((-1)(1) + (1)(1) + (1)(0)\bigr) = \frac{1}{\sqrt{6}}(0) = 0.
\end{align*}
Since $u_2 = \frac{1}{\sqrt{6}}(1, -1, 2)$ and $u_3 = \frac{1}{\sqrt{3}}(-1, 1, 1)$:
\begin{align*}
\langle u_3, u_2 \rangle &= \frac{1}{\sqrt{18}}\bigl((-1)(1) + (1)(-1) + (1)(2)\bigr) = \frac{1}{\sqrt{18}}(0) = 0.
\end{align*}
The resulting orthonormal basis is $\{u_1, u_2, u_3\}$.
[/example]
Gram–Schmidt also underlies the QR decomposition: writing the columns of a matrix $A$ as $a_1, \ldots, a_n$ and applying Gram–Schmidt produces an orthonormal basis $u_1, \ldots, u_n$, which can be assembled into an orthogonal matrix $Q$. The original matrix satisfies $A = QR$ where $R$ is upper triangular (upper triangular because $a_j$ is expressed in terms of $u_1, \ldots, u_j$). The QR decomposition is foundational for numerical linear algebra.
## Orthogonal Matrices
Why isolate matrices with orthonormal columns as a class deserving their own name? Because the condition $Q^\top Q = I_n$ does something remarkable: it turns abstract orthonormality into a single algebraic equation. Any matrix satisfying this equation automatically preserves inner products, preserves lengths, and has its inverse handed to you for free — $Q^{-1} = Q^\top$ costs nothing beyond a transpose. These matrices are precisely the linear maps that represent rigid motions in $\mathbb{R}^n$: rotations and reflections leave lengths and angles intact, and $Q^\top Q = I_n$ is exactly the algebraic encoding of that rigidity. Without a dedicated name, one would have to unpack this entire constellation of properties each time, obscuring the structural unity that a single equation makes visible.
The name "orthogonal matrix" is slightly misleading: an orthogonal matrix has columns that are not merely orthogonal but orthonormal. The terminology is historical and universal, so we accept it.
[definition: Orthogonal Matrix]
A matrix $Q \in \mathbb{R}^{n \times n}$ is **orthogonal** if its columns form an orthonormal set
[/definition]
Equivalently if
\begin{align*}
Q^\top Q &= I_n.
\end{align*}
Since $Q^\top Q = I_n$ implies $Q$ is invertible with $Q^{-1} = Q^\top$, and since for a square matrix a left inverse is also a right inverse, we also have $QQ^\top = I_n$, so the rows of $Q$ also form an orthonormal set.
[quotetheorem:3268]
[example: Rotation Matrices in $\mathbb{R}^2$]
For any $\theta \in \mathbb{R}$, the matrix
\begin{align*}
Q_\theta &= \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}
\end{align*}
is orthogonal. Its columns are $(\cos\theta, \sin\theta)^\top$ and $(-\sin\theta, \cos\theta)^\top$. Their norms are both $1$, and their inner product is $\cos\theta(-\sin\theta) + \sin\theta\cos\theta = 0$. So the columns form an orthonormal set.
We verify $Q_\theta^\top Q_\theta = I_2$:
\begin{align*}
Q_\theta^\top Q_\theta &= \begin{pmatrix} \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{pmatrix}\begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} = \begin{pmatrix} \cos^2\theta + \sin^2\theta & 0 \\ 0 & \sin^2\theta + \cos^2\theta \end{pmatrix} = I_2.
\end{align*}
Moreover $\det Q_\theta = \cos^2\theta + \sin^2\theta = 1$, confirming this is a rotation.
For the vector $v = (1, 0)^\top$, the image is $Q_\theta v = (\cos\theta, \sin\theta)^\top$, which is the vector at angle $\theta$ from the $x$-axis. The length is preserved: $\|Q_\theta v\| = 1 = \|v\|$.
[/example]
The orthogonal group $O(n)$ is a compact Lie group. Its connected component containing the identity is the special orthogonal group $SO(n)$, consisting of rotations ($\det = 1$). The orthogonal matrices with $\det = -1$ form the other connected component and correspond to reflections and improper rotations.
## The Spectral Theorem
The most powerful consequence of orthogonality in linear algebra is the spectral theorem for symmetric matrices. It says that every real symmetric matrix is diagonalizable by an orthogonal matrix — that is, it has a complete orthonormal basis of real eigenvectors. This is not true for general matrices. A $2 \times 2$ rotation by $90°$ has no real eigenvalues at all. A Jordan block is not diagonalizable. Symmetry is the key.
Why should symmetry force orthogonal eigenvectors? The argument is elegant: if $Av = \lambda v$ and $Aw = \mu w$ with $\lambda \neq \mu$ and $A = A^\top$, then
\begin{align*}
\lambda \langle v, w \rangle &= \langle Av, w \rangle = \langle v, A^\top w \rangle = \langle v, Aw \rangle = \mu \langle v, w \rangle.
\end{align*}
Since $\lambda \neq \mu$, we conclude $\langle v, w \rangle = 0$. Eigenvectors for different eigenvalues of a symmetric matrix are automatically orthogonal.
Why are all eigenvalues real? Suppose $\lambda \in \mathbb{C}$ is an eigenvalue of $A$ with eigenvector $z \in \mathbb{C}^n$ (extend the inner product to the Hermitian inner product $\langle z, w \rangle = \bar{z}^\top w$). Then
\begin{align*}
\lambda \langle z, z \rangle &= \langle Az, z \rangle = \langle z, A^\top z \rangle = \langle z, Az \rangle = \bar{\lambda} \langle z, z \rangle.
\end{align*}
Since $\langle z, z \rangle = \|z\|^2 > 0$, we get $\lambda = \bar{\lambda}$, so $\lambda \in \mathbb{R}$.
The existence of a complete orthonormal eigenbasis follows by induction on the dimension $n$. For $n = 1$ the result is immediate. For the inductive step: since $A$ is a real symmetric matrix and all its eigenvalues are real, it has at least one real eigenvector $u_1$ (with $\|u_1\| = 1$). Let $W = u_1^\perp$. Because $A$ is symmetric, $W$ is $A$-invariant: if $w \perp u_1$ then $\langle Aw, u_1 \rangle = \langle w, Au_1 \rangle = \lambda_1 \langle w, u_1 \rangle = 0$, so $Aw \in W$. The restriction $A|_W$ is again a real symmetric matrix on the $(n-1)$-dimensional space $W$, so by induction it has an orthonormal eigenbasis $\{u_2, \ldots, u_n\}$ for $W$. Together $\{u_1, u_2, \ldots, u_n\}$ is the required orthonormal eigenbasis for $\mathbb{R}^n$.
[quotetheorem:925]
The decomposition $A = Q\Lambda Q^\top$ is the **eigendecomposition** or **spectral decomposition** of $A$. It says that in the coordinate system given by the eigenvectors, $A$ acts simply as stretching along each axis by the corresponding eigenvalue.
[explanation: The Spectral Decomposition as a Sum of Rank-One Projections]
The spectral decomposition can be written as a sum:
\begin{align*}
A &= Q\Lambda Q^\top = \sum_{i=1}^n \lambda_i\, u_i u_i^\top.
\end{align*}
Each term $u_i u_i^\top = u_i \otimes u_i$ is the orthogonal projection onto $\operatorname{span}(u_i)$, since for any $v \in \mathbb{R}^n$:
\begin{align*}
(u_i u_i^\top) v &= u_i (u_i^\top v) = \langle v, u_i \rangle\, u_i = P_{\operatorname{span}(u_i)}(v).
\end{align*}
So $A$ is a weighted sum of orthogonal projections, each weighted by the corresponding eigenvalue. This decomposition is the source of the name "spectral theorem": the set $\{\lambda_1, \ldots, \lambda_n\}$ is the **spectrum** of $A$.
[/explanation]
[example: Spectral Decomposition of a $2 \times 2$ Symmetric Matrix]
Let
\begin{align*}
A &= \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}.
\end{align*}
This matrix is symmetric. Its characteristic polynomial is
\begin{align*}
\det(A - \lambda I) &= (2 - \lambda)^2 - 1 = \lambda^2 - 4\lambda + 3 = (\lambda - 1)(\lambda - 3),
\end{align*}
so the eigenvalues are $\lambda_1 = 1$ and $\lambda_2 = 3$, both real.
For $\lambda_1 = 1$: $(A - I)v = 0$ gives the system with matrix $\begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}$, so $v_1 + v_2 = 0$. A solution is $(1, -1)^\top$, with unit vector $u_1 = \frac{1}{\sqrt{2}}(1, -1)^\top$.
For $\lambda_2 = 3$: $(A - 3I)v = 0$ gives the matrix $\begin{pmatrix} -1 & 1 \\ 1 & -1 \end{pmatrix}$, so $v_1 = v_2$. A solution is $(1, 1)^\top$, with unit vector $u_2 = \frac{1}{\sqrt{2}}(1, 1)^\top$.
We verify orthogonality: $\langle u_1, u_2 \rangle = \frac{1}{2}(1 \cdot 1 + (-1) \cdot 1) = 0$. The orthogonal matrix is
\begin{align*}
Q &= \frac{1}{\sqrt{2}}\begin{pmatrix} 1 & 1 \\ -1 & 1 \end{pmatrix}, \quad \Lambda = \begin{pmatrix} 1 & 0 \\ 0 & 3 \end{pmatrix}.
\end{align*}
We compute $Q\Lambda Q^\top$ step by step. First,
\begin{align*}
Q\Lambda &= \frac{1}{\sqrt{2}}\begin{pmatrix} 1 & 1 \\ -1 & 1 \end{pmatrix}\begin{pmatrix} 1 & 0 \\ 0 & 3 \end{pmatrix} = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 & 3 \\ -1 & 3 \end{pmatrix}.
\end{align*}
Then,
\begin{align*}
Q\Lambda Q^\top &= \frac{1}{\sqrt{2}}\begin{pmatrix} 1 & 3 \\ -1 & 3 \end{pmatrix} \cdot \frac{1}{\sqrt{2}}\begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix} = \frac{1}{2}\begin{pmatrix} 1\cdot1 + 3\cdot1 & 1\cdot(-1) + 3\cdot1 \\ (-1)\cdot1 + 3\cdot1 & (-1)\cdot(-1) + 3\cdot1 \end{pmatrix} = \frac{1}{2}\begin{pmatrix} 4 & 2 \\ 2 & 4 \end{pmatrix} = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix} = A.
\end{align*}
The spectral decomposition is:
\begin{align*}
A &= 1 \cdot u_1 u_1^\top + 3 \cdot u_2 u_2^\top = \frac{1}{2}\begin{pmatrix} 1 & -1 \\ -1 & 1 \end{pmatrix} + \frac{3}{2}\begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix} = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}.
\end{align*}
[/example]
The spectral theorem has far-reaching consequences. For a symmetric positive-definite matrix ($\lambda_i > 0$ for all $i$), one can define $A^{1/2} = Q\Lambda^{1/2}Q^\top$ and $A^{-1} = Q\Lambda^{-1}Q^\top$ with ease. The condition number $\kappa(A) = \lambda_n / \lambda_1$ (the ratio of largest to smallest eigenvalue) controls how sensitive the system $Ax = b$ is to perturbations. The spectral theorem is also the foundation of the singular value decomposition, which extends the spirit of eigendecomposition to non-square matrices.
## Least Squares and Applications
Orthogonality is the key to solving overdetermined systems. Given a matrix $A \in \mathbb{R}^{m \times n}$ with $m > n$ and a vector $b \in \mathbb{R}^m$, the system $Ax = b$ typically has no solution: $b$ does not lie in the column space $\operatorname{Range}(A)$. The least-squares problem asks for the best approximate solution, the vector $x^* \in \mathbb{R}^n$ that minimizes $\|Ax - b\|$.
The answer is provided by orthogonal projection. The minimizer $Ax^*$ is the orthogonal projection of $b$ onto $\operatorname{Range}(A)$:
\begin{align*}
Ax^* &= P_{\operatorname{Range}(A)}\, b.
\end{align*}
For this to hold, the residual $b - Ax^*$ must be orthogonal to every vector in $\operatorname{Range}(A)$, i.e., $A^\top(b - Ax^*) = 0$. This gives the **normal equations**:
[quotetheorem:501]
The matrix $(A^\top A)^{-1} A^\top$ appearing in the formula for $x^*$ is called the **Moore–Penrose pseudoinverse** of $A$ (when $A$ has full column rank). It generalizes matrix inversion to the overdetermined setting.
[example: Linear Regression via Least Squares]
Suppose we observe three data points $(t, y)$: $(0, 1)$, $(1, 2)$, $(2, 2)$, and we want to fit a line $y = \alpha + \beta t$. This asks for $\alpha, \beta \in \mathbb{R}$ minimizing
\begin{align*}
\|Ax - b\|^2, \quad A = \begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{pmatrix}, \quad x = \begin{pmatrix} \alpha \\ \beta \end{pmatrix}, \quad b = \begin{pmatrix} 1 \\ 2 \\ 2 \end{pmatrix}.
\end{align*}
The matrix $A^\top A$ is:
\begin{align*}
A^\top A &= \begin{pmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{pmatrix}\begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{pmatrix} = \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix}.
\end{align*}
And $A^\top b$ is:
\begin{align*}
A^\top b &= \begin{pmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{pmatrix}\begin{pmatrix} 1 \\ 2 \\ 2 \end{pmatrix} = \begin{pmatrix} 5 \\ 6 \end{pmatrix}.
\end{align*}
The normal equations are $\begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix}\begin{pmatrix} \alpha \\ \beta \end{pmatrix} = \begin{pmatrix} 5 \\ 6 \end{pmatrix}$.
From the first equation, $3\alpha + 3\beta = 5$, so $\alpha + \beta = 5/3$. From the second, $3\alpha + 5\beta = 6$. Subtracting $3\alpha + 3\beta = 5$ gives $2\beta = 1$, so $\beta = 1/2$, and $\alpha = 5/3 - 1/2 = 7/6$.
The best-fit line is $y = 7/6 + t/2$. The residual vector is
\begin{align*}
b - Ax^* &= \begin{pmatrix} 1 \\ 2 \\ 2 \end{pmatrix} - \begin{pmatrix} 7/6 \\ 7/6 + 1/2 \\ 7/6 + 1 \end{pmatrix} = \begin{pmatrix} 1 - 7/6 \\ 2 - 5/3 \\ 2 - 13/6 \end{pmatrix} = \begin{pmatrix} -1/6 \\ 1/3 \\ -1/6 \end{pmatrix}.
\end{align*}
We verify orthogonality to the column space: $A^\top(b - Ax^*) = \begin{pmatrix} -1/6 + 1/3 - 1/6 \\ 0 + 1/3 - 1/3 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}$. The residual is indeed orthogonal to both columns of $A$.
[/example]
The normal equations can be numerically unstable when $A^\top A$ is ill-conditioned. In practice, the QR decomposition is preferred: write $A = QR$ with $Q \in \mathbb{R}^{m \times n}$ having orthonormal columns and $R \in \mathbb{R}^{n \times n}$ upper triangular. Then the normal equations simplify to $Rx^* = Q^\top b$, which is solved by back-substitution without forming $A^\top A$.
[remark: Orthogonality in Statistics]
In statistics, the least-squares estimator $\hat{\beta} = (A^\top A)^{-1} A^\top b$ is the ordinary least-squares (OLS) regression estimator. The Gauss–Markov theorem states that under certain conditions, this estimator is the best linear unbiased estimator (BLUE) — "best" meaning minimum variance among all linear unbiased estimators. The orthogonality of the residual $b - A\hat{\beta}$ to the column space of $A$ is what makes the geometry of regression transparent.
[/remark]
## References
Strang, G., *Linear Algebra and Its Applications* (2006).
Horn, R. A. and Johnson, C. R., *Matrix Analysis* (2012).
Trefethen, L. N. and Bau, D., *Numerical Linear Algebra* (1997).
Axler, S., *Linear Algebra Done Right* (2015).