Suppose you want to solve the system of linear equations
\begin{align*}
2x_1 + 3x_2 - x_3 &= 7 \\
x_1 - x_2 + 4x_3 &= 2 \\
5x_1 + 2x_2 + x_3 &= 11.
\end{align*}
You could try substituting and eliminating variables one at a time, but after two eliminations you've lost track of which operations you applied and in what order. Now imagine you have a hundred variables, or a thousand, or that the right-hand side changes and you need to resolve the system. The bookkeeping becomes impossible. What you need is a way to represent, manipulate, and reason about a whole collection of linear equations as a single mathematical object. That object is a matrix.
But matrices are not merely a notation trick. They encode linear maps between vector spaces: the coefficient matrix of the system above describes a linear transformation $T: \mathbb{R}^3 \to \mathbb{R}^3$, and solving the system is equivalent to inverting $T$. Once you see matrices as representations of linear maps, the entire subject opens up — row operations become composition with elementary transformations, the determinant measures how $T$ distorts volume, eigenvalues reveal directions the map stretches without rotating, and matrix decompositions expose the geometric core of $T$. This chapter builds that picture from the ground up.
[example: Encoding a Linear System]
Consider the system above. Write the coefficients as a rectangular array and the right-hand side as a column:
\begin{align*}
A &= \begin{pmatrix} 2 & 3 & -1 \\ 1 & -1 & 4 \\ 5 & 2 & 1 \end{pmatrix}, \qquad
b = \begin{pmatrix} 7 \\ 2 \\ 11 \end{pmatrix}.
\end{align*}
Define the matrix-vector product so that entry $i$ of $Ax$ is $\sum_{j=1}^{3} A_{ij} x_j$. Then the system $2x_1 + 3x_2 - x_3 = 7$, etc., becomes the single equation $Ax = b$. The coefficient matrix $A$ captures all the information in the system; the right-hand side $b$ is separate. Later, if we swap the rows of $A$ and perform the same swap on $b$, the solution does not change — we have just reordered the equations. Row operations on $A$ correspond to multiplying on the left by elementary matrices.
[/example]
## Definition
The right level of generality for the definition is a field $\mathbb{F}$, which for us will almost always be $\mathbb{R}$ or $\mathbb{C}$. A matrix is simply a finite rectangular array of elements of $\mathbb{F}$, but the definition must record the dimensions precisely so that operations are well-defined.
[definition: Matrix]
Let $\mathbb{F}$ be a field and let $m, n \in \mathbb{N}$. An $m \times n$ matrix over $\mathbb{F}$ is a function
\begin{align*}
A: \{1, \dots, m\} \times \{1, \dots, n\} &\to \mathbb{F} \\
(i, j) &\mapsto A_{ij},
\end{align*}
where $A_{ij}$ is called the $(i,j)$-entry of $A$. We display $A$ as a rectangular array with $m$ rows and $n$ columns:
\begin{align*}
A &= \begin{pmatrix} A_{11} & A_{12} & \cdots & A_{1n} \\ A_{21} & A_{22} & \cdots & A_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ A_{m1} & A_{m2} & \cdots & A_{mn} \end{pmatrix}.
\end{align*}
The set of all $m \times n$ matrices over $\mathbb{F}$ is denoted $\mathbb{F}^{m \times n}$.
[/definition]
The index convention matters: $A_{ij}$ always means row $i$, column $j$. This is consistent with the notation for the Jacobian matrix $Jf_a \in \mathbb{R}^{n \times m}$ where $(Jf_a)_{ij} = \partial_j f_i(a)$ — the first index labels the output component (row), the second the input component (column). Never use superscripts for matrix entries: write $A_{ij}$, not $A^i_j$ or $A^{ij}$.
[remark: Square Matrices]
When $m = n$, we call $A$ a square matrix of size $n$, or an $n \times n$ matrix. Square matrices are the setting for most of the deeper theory: determinants, eigenvalues, and diagonalisation all require $m = n$. For $m \neq n$, the map $x \mapsto Ax$ cannot be invertible simply for dimensional reasons.
[/remark]
The set $\mathbb{F}^{m \times n}$ carries the structure of a vector space. Scalar multiplication and addition are defined entry-wise: $(cA)_{ij} = c A_{ij}$ and $(A + B)_{ij} = A_{ij} + B_{ij}$ for $A, B \in \mathbb{F}^{m \times n}$ and $c \in \mathbb{F}$. Since entries are in $\mathbb{F}$ and all operations are inherited from $\mathbb{F}$, all the vector space axioms hold automatically. The zero matrix $0 \in \mathbb{F}^{m \times n}$ has all entries zero, and $\dim \mathbb{F}^{m \times n} = mn$.
The deeper structure, which distinguishes matrices from general vector spaces, is the product. When $A \in \mathbb{F}^{m \times p}$ and $B \in \mathbb{F}^{p \times n}$, their product $AB \in \mathbb{F}^{m \times n}$ is defined by
\begin{align*}
(AB)_{ij} &= \sum_{k=1}^{p} A_{ik} B_{kj}.
\end{align*}
This formula encodes composition of linear maps: if $A$ represents a map $\mathbb{F}^p \to \mathbb{F}^m$ and $B$ represents a map $\mathbb{F}^n \to \mathbb{F}^p$, then $AB$ represents their composition $\mathbb{F}^n \to \mathbb{F}^m$. The dimension constraint — the number of columns of $A$ must equal the number of rows of $B$ — is exactly the condition that the maps compose.
The problem that the matrix product solves is this: given two linear maps applied in sequence, how do you write the composed map as a single matrix? If you apply $B$ first (a map $\mathbb{F}^n \to \mathbb{F}^p$) and then $A$ (a map $\mathbb{F}^p \to \mathbb{F}^m$), you want a single $m \times n$ matrix $C$ such that $C x = A(Bx)$ for all $x \in \mathbb{F}^n$. The formula $(AB)_{ij} = \sum_k A_{ik} B_{kj}$ is the unique answer: it computes the $i$-th output component of $A(Bx)$ by dotting the $i$-th row of $A$ against the $j$-th column of $B$, which is exactly how composition of linear maps works at the entry level.
[definition: Matrix Product]
Let $A \in \mathbb{F}^{m \times p}$ and $B \in \mathbb{F}^{p \times n}$. The matrix product $AB \in \mathbb{F}^{m \times n}$ is defined by
\begin{align*}
(AB)_{ij} &= \sum_{k=1}^{p} A_{ik} B_{kj}, \qquad 1 \le i \le m, \quad 1 \le j \le n.
\end{align*}
[/definition]
[remark: Non-Commutativity]
Matrix multiplication is not commutative in general. Even for square matrices, $AB = BA$ is a special condition — it says that the corresponding linear maps commute, which is geometrically restrictive. For example, rotation by $90°$ followed by reflection across the $x$-axis gives a different result than reflection first, then rotation.
[/remark]
The identity matrix $I_n \in \mathbb{F}^{n \times n}$ has $(I_n)_{ij} = 1$ if $i = j$ and $0$ otherwise. It satisfies $AI_n = A$ and $I_m A = A$ for any $A \in \mathbb{F}^{m \times n}$: $I_n$ is the identity on $\mathbb{F}^n$ (inputs), $I_m$ is the identity on $\mathbb{F}^m$ (outputs). The matrix $I_n$ represents the identity linear map $\operatorname{id}: \mathbb{F}^n \to \mathbb{F}^n$.
[definition: Transpose]
Let $A \in \mathbb{F}^{m \times n}$. The transpose of $A$ is the matrix $A^\top \in \mathbb{F}^{n \times m}$ defined by
\begin{align*}
(A^\top)_{ij} &= A_{ji}.
\end{align*}
Rows of $A$ become columns of $A^\top$, and columns become rows.
[/definition]
The transpose satisfies $(AB)^\top = B^\top A^\top$ — note the reversal of order. This reflects the fact that transposing is adjoint-taking for maps between Euclidean spaces: if $A$ represents $T: \mathbb{R}^n \to \mathbb{R}^m$, then $A^\top$ represents the adjoint $T^*: \mathbb{R}^m \to \mathbb{R}^n$ defined by $\langle Tx, y \rangle = \langle x, T^* y \rangle$.
Transposing a matrix and comparing it to the original immediately raises a natural question: what if $A^\top = A$? This means the map $T$ is self-adjoint — the adjoint of $T$ is $T$ itself. Such matrices arise in a startling number of contexts: the coefficient matrix of the normal equations in least squares, the Hessian of any twice-differentiable scalar function, the covariance matrix of any random vector, and the Laplacian matrix of any undirected graph. In each case, the self-adjoint condition encodes a physical or geometric symmetry — no preferred direction, a reciprocal relationship between input and output. At the other extreme, what if $A^\top = -A$? This is the skew-symmetric condition, and it captures pure rotation: the infinitesimal generators of rotations in $\mathbb{R}^n$ are skew-symmetric matrices. Isolating these two extremes is worth a definition.
[definition: Symmetric and Skew-Symmetric Matrices]
A square matrix $A \in \mathbb{F}^{n \times n}$ is called symmetric if $A^\top = A$, and skew-symmetric if $A^\top = -A$.
[/definition]
Every square matrix decomposes uniquely into symmetric and skew-symmetric parts:
\begin{align*}
A &= \frac{A + A^\top}{2} + \frac{A - A^\top}{2}.
\end{align*}
The first summand is symmetric, the second is skew-symmetric.
## Gaussian Elimination and Row Echelon Form
Systems of linear equations were the original motivation for the matrix, and Gaussian elimination is the original algorithm for solving them. The algorithm works by performing row operations that simplify $A$ without changing the solution set of $Ax = b$. To understand what operations are legitimate and why, we need to see them as multiplications by elementary matrices.
There are three types of elementary row operations: swap two rows, multiply a row by a nonzero scalar, and add a scalar multiple of one row to another. Each corresponds to left-multiplication by an elementary matrix, which is invertible. Since we apply the same operations to $b$ (forming the augmented matrix $[A \mid b]$), the solution set is preserved at every step.
But after applying row operations, what should we aim for? We want a form that is easy to solve by back-substitution: each equation should involve fewer unknowns than the one above it, so we can solve the last equation first, substitute upward, and resolve each variable in turn. The precise condition that makes back-substitution work is that the first nonzero entry of each row lies strictly to the right of the first nonzero entry of the row above. This staircase shape — the row echelon form — is the canonical target of Gaussian elimination. It is not just a convenient stopping point; it is the minimal structure that makes the solution readable and the rank visible (the rank equals the number of nonzero rows).
[definition: Row Echelon Form]
A matrix $A \in \mathbb{F}^{m \times n}$ is in row echelon form if:
1. All rows of zeros, if any, appear at the bottom.
2. The first nonzero entry (called the pivot or leading entry) in each nonzero row lies strictly to the right of the pivot in the row above.
If additionally every pivot equals $1$ and every other entry in the pivot's column is $0$, then $A$ is in reduced row echelon form (RREF).
[/definition]
[example: Gaussian Elimination]
We solve the system $Ax = b$ with
\begin{align*}
[A \mid b] &= \left(\begin{array}{ccc|c} 2 & 3 & -1 & 7 \\ 1 & -1 & 4 & 2 \\ 5 & 2 & 1 & 11 \end{array}\right).
\end{align*}
Step 1: Swap rows 1 and 2 to bring a leading $1$ to position $(1,1)$:
\begin{align*}
\left(\begin{array}{ccc|c} 1 & -1 & 4 & 2 \\ 2 & 3 & -1 & 7 \\ 5 & 2 & 1 & 11 \end{array}\right).
\end{align*}
Step 2: Eliminate the first column below the pivot. Replace row 2 by row $2 - 2 \cdot$ row $1$, and row 3 by row $3 - 5 \cdot$ row $1$:
\begin{align*}
\left(\begin{array}{ccc|c} 1 & -1 & 4 & 2 \\ 0 & 5 & -9 & 3 \\ 0 & 7 & -19 & 1 \end{array}\right).
\end{align*}
Step 3: Eliminate the second column below the pivot in row 2. Replace row 3 by row $3 - \frac{7}{5} \cdot$ row $2$:
\begin{align*}
\left(\begin{array}{ccc|c} 1 & -1 & 4 & 2 \\ 0 & 5 & -9 & 3 \\ 0 & 0 & -\frac{32}{5} & -\frac{16}{5} \end{array}\right).
\end{align*}
This is row echelon form. Back-substitution: from row 3, $-\frac{32}{5} x_3 = -\frac{16}{5}$, so $x_3 = \frac{1}{2}$. From row 2, $5x_2 - 9 \cdot \frac{1}{2} = 3$, so $5x_2 = 3 + \frac{9}{2} = \frac{15}{2}$, giving $x_2 = \frac{3}{2}$. From row 1, $x_1 - \frac{3}{2} + 2 = 2$, so $x_1 = \frac{3}{2}$.
We verify: $2 \cdot \frac{3}{2} + 3 \cdot \frac{3}{2} - \frac{1}{2} = 3 + \frac{9}{2} - \frac{1}{2} = 3 + 4 = 7$. Correct.
[/example]
The number of pivots produced by Gaussian elimination is an invariant of the matrix — it does not depend on the choices made during elimination. This invariant is the rank.
[definition: Rank]
The rank of a matrix $A \in \mathbb{F}^{m \times n}$, denoted $\operatorname{rank}(A)$, is the number of pivots in any row echelon form of $A$
[/definition]
Equivalently the dimension of the column space of $A$:
\begin{align*}
\operatorname{rank}(A) &= \dim \operatorname{span}\{A_1, \dots, A_n\},
\end{align*}
where $A_j \in \mathbb{F}^m$ denotes the $j$-th column of $A$.
[quotetheorem:916]
The rank counts the "true degrees of freedom" in the output: it is the dimension of the column space, i.e., the image of $T$. The nullity $\dim \ker(T)$ counts the dimensions collapsed to zero. Their sum must equal the number of input dimensions $n$, because every input either contributes to the output or gets killed.
[remark: Consistency of Linear Systems]
The system $Ax = b$ has a solution if and only if $b$ lies in the column space of $A$, equivalently $\operatorname{rank}(A) = \operatorname{rank}([A \mid b])$. If a solution exists, the solution set is an affine subspace of dimension $\dim \ker(T) = n - \operatorname{rank}(A)$. In particular, the solution is unique if and only if $\operatorname{rank}(A) = n$.
[/remark]
## Invertibility and the Determinant
Not every square matrix corresponds to an invertible linear map. The question of whether $Ax = b$ has a unique solution for every $b$ is exactly the question of whether $A$ is invertible. There is a single number that answers this question: the determinant.
Before defining the determinant, recall what we want from it. We want a number $\det A$ such that $\det A \neq 0$ exactly when $A$ is invertible. We also want $\det(AB) = \det(A)\det(B)$, so that determinants interact well with composition. And geometrically, $|\det A|$ should measure the factor by which $A$ scales volumes.
[definition: Determinant]
The determinant is the unique function $\det: \mathbb{F}^{n \times n} \to \mathbb{F}$ satisfying:
1. (Multilinearity) $\det A$ is linear in each row of $A$ separately.
2. (Alternating) If two rows of $A$ are equal, then $\det A = 0$.
3. (Normalisation) $\det I_n = 1$.
For $n = 2$:
\begin{align*}
\det \begin{pmatrix} a & b \\ c & d \end{pmatrix} &= ad - bc.
\end{align*}
For general $n$, expanding along the first row:
\begin{align*}
\det A &= \sum_{j=1}^{n} (-1)^{1+j} A_{1j} \det A_{1j}^{\min},
\end{align*}
where $A_{1j}^{\min} \in \mathbb{F}^{(n-1) \times (n-1)}$ is the submatrix obtained by deleting row $1$ and column $j$.
[/definition]
[explanation: Why Multilinearity and Alternating Imply the Formula]
The multilinearity and alternating conditions together force the determinant to be the unique function measuring signed volume of the parallelepiped spanned by the rows. Alternating means that swapping two rows negates the determinant — which is consistent with orientation-reversal. Multilinearity means scaling a row scales the determinant — consistent with scaling a side of the parallelepiped. The normalisation pins down the scale.
From these axioms one derives that $\det A = 0$ whenever two rows are equal (alternating), which extends to: $\det A = 0$ whenever the rows are linearly dependent. This is the key connection to invertibility: a matrix is singular (not invertible) precisely when its rows are linearly dependent, hence precisely when $\det A = 0$.
[/explanation]
[example: Determinant of a $3 \times 3$ Matrix]
Compute $\det A$ for
\begin{align*}
A &= \begin{pmatrix} 1 & -1 & 4 \\ 0 & 5 & -9 \\ 0 & 0 & -\frac{32}{5} \end{pmatrix}.
\end{align*}
This is upper triangular (result of elimination), so the determinant equals the product of diagonal entries:
\begin{align*}
\det A &= 1 \cdot 5 \cdot \left(-\frac{32}{5}\right) = -32.
\end{align*}
Since $\det A = -32 \neq 0$, the matrix is invertible, confirming the system $Ax = b$ has a unique solution. The sign flip came from the row swap in Step 1 of elimination (which negates the determinant), so the determinant of the original matrix $A$ in the problem statement is $+32$.
[/example]
[quotetheorem:917]
The multiplicativity $\det(AB) = \det(A)\det(B)$ is perhaps the most important property. It says that determinants are multiplicative on the group of invertible matrices under composition. Geometrically: if $A$ scales volumes by $|\det A|$ and $B$ scales volumes by $|\det B|$, then $AB$ scales by $|\det A||\det B|$.
[definition: Inverse Matrix]
Let $A \in \mathbb{F}^{n \times n}$ with $\det A \neq 0$. The inverse of $A$ is the unique matrix $A^{-1} \in \mathbb{F}^{n \times n}$ satisfying
\begin{align*}
A A^{-1} &= A^{-1} A = I_n.
\end{align*}
[/definition]
The inverse exists if and only if $\det A \neq 0$. For $n = 2$:
\begin{align*}
\begin{pmatrix} a & b \\ c & d \end{pmatrix}^{-1} &= \frac{1}{ad - bc}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}.
\end{align*}
For larger $n$, computing $A^{-1}$ directly via the cofactor formula is expensive (cost $O(n!)$). In practice one uses Gaussian elimination on the augmented matrix $[A \mid I_n]$ to obtain $[I_n \mid A^{-1}]$ in $O(n^3)$ operations.
## Eigenvalues and Diagonalisation
Gaussian elimination solves $Ax = b$, but it tells us little about the intrinsic geometry of the linear map $x \mapsto Ax$. To understand the map itself — what it does to space, which directions it stretches, which it compresses — we study eigenvalues and eigenvectors.
An eigenvector is a nonzero vector that the matrix merely stretches or compresses (possibly with a sign flip), without any rotation. If $v$ is an eigenvector with eigenvalue $\lambda$, then $Av = \lambda v$: the image of $v$ is a scalar multiple of $v$ itself. Not every matrix has eigenvectors (over $\mathbb{R}$), but over $\mathbb{C}$, every square matrix has at least one.
[definition: Eigenvalue and Eigenvector]
Let $A \in \mathbb{F}^{n \times n}$. A scalar $\lambda \in \mathbb{F}$ is an eigenvalue of $A$ if there exists a nonzero vector $v \in \mathbb{F}^n$ such that
\begin{align*}
Av &= \lambda v.
\end{align*}
Such a vector $v$ is called an eigenvector of $A$ corresponding to $\lambda$. The set of all eigenvectors with eigenvalue $\lambda$, together with $0$, is the eigenspace
\begin{align*}
E_\lambda &= \ker(A - \lambda I_n) = \{v \in \mathbb{F}^n : Av = \lambda v\}.
\end{align*}
[/definition]
The equation $Av = \lambda v$ can be rewritten $(A - \lambda I_n)v = 0$. This has a nonzero solution $v$ if and only if $A - \lambda I_n$ is not invertible, i.e., $\det(A - \lambda I_n) = 0$. This gives the algebraic condition for finding eigenvalues.
To find all eigenvalues at once, we want to understand the set of all $\lambda$ making $A - \lambda I_n$ singular. The determinant $\det(A - \lambda I_n)$ is a function of $\lambda$, and it equals zero precisely on this set. Crucially, this function turns out to be a polynomial in $\lambda$ of degree exactly $n$ — which means we can leverage the full theory of polynomials. Packaging the eigenvalue condition into a polynomial object lets us count eigenvalues with multiplicity (via roots), apply the Fundamental Theorem of Algebra over $\mathbb{C}$, and define algebraic multiplicity — the number of times a root is repeated. Without the polynomial perspective, we would have to hunt for eigenvalues case by case with no global handle on how many there are or how they behave.
[definition: Characteristic Polynomial]
Let $A \in \mathbb{F}^{n \times n}$. The characteristic polynomial of $A$ is
\begin{align*}
p_A(\lambda) &= \det(A - \lambda I_n).
\end{align*}
This is a degree $n$ polynomial in $\lambda$. Its roots are exactly the eigenvalues of $A$.
[/definition]
Over $\mathbb{C}$, the Fundamental Theorem of Algebra guarantees that $p_A$ has exactly $n$ roots (counted with multiplicity), so every complex square matrix has at least one eigenvalue. Over $\mathbb{R}$, the matrix
\begin{align*}
R &= \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}
\end{align*}
represents rotation by $90°$ and has characteristic polynomial $\lambda^2 + 1$, with no real roots — rotation by $90°$ has no real eigenvectors, which is geometrically obvious: no nonzero vector points in the same direction after rotation.
[example: Computing Eigenvalues]
Find the eigenvalues of
\begin{align*}
A &= \begin{pmatrix} 4 & 1 \\ 2 & 3 \end{pmatrix}.
\end{align*}
The characteristic polynomial is:
\begin{align*}
p_A(\lambda) &= \det(A - \lambda I_2) = \det \begin{pmatrix} 4 - \lambda & 1 \\ 2 & 3 - \lambda \end{pmatrix} \\
&= (4 - \lambda)(3 - \lambda) - 1 \cdot 2 \\
&= \lambda^2 - 7\lambda + 12 - 2 \\
&= \lambda^2 - 7\lambda + 10 \\
&= (\lambda - 2)(\lambda - 5).
\end{align*}
The eigenvalues are $\lambda_1 = 2$ and $\lambda_2 = 5$.
For $\lambda_1 = 2$: solve $(A - 2I)v = 0$. We have $A - 2I = \begin{pmatrix} 2 & 1 \\ 2 & 1 \end{pmatrix}$, so $2v_1 + v_2 = 0$, giving eigenvector $v_1 = \begin{pmatrix} 1 \\ -2 \end{pmatrix}$.
For $\lambda_2 = 5$: solve $(A - 5I)v = 0$. We have $A - 5I = \begin{pmatrix} -1 & 1 \\ 2 & -2 \end{pmatrix}$, so $-v_1 + v_2 = 0$, giving eigenvector $v_2 = \begin{pmatrix} 1 \\ 1 \end{pmatrix}$.
Check: $A v_1 = \begin{pmatrix} 4+(-2) \\ 2+(-6) \end{pmatrix} = \begin{pmatrix} 2 \\ -4 \end{pmatrix} = 2 \begin{pmatrix} 1 \\ -2 \end{pmatrix} = 2v_1$. Correct.
[/example]
Knowing the eigenvalues and eigenvectors suggests a strategy: choose a basis of eigenvectors, then in that basis the matrix is diagonal. This is diagonalisation.
[definition: Diagonalisable Matrix]
A matrix $A \in \mathbb{F}^{n \times n}$ is diagonalisable if there exists an invertible matrix $P \in \mathbb{F}^{n \times n}$ and a diagonal matrix $D \in \mathbb{F}^{n \times n}$ such that
\begin{align*}
A &= P D P^{-1}.
\end{align*}
The columns of $P$ are eigenvectors of $A$, and the diagonal entries of $D$ are the corresponding eigenvalues.
[/definition]
[illustration:diagonalisation-change-of-basis]
[quotetheorem:921]
The theorem fails without the linear independence condition, even when all eigenvalues are real. Consider
\begin{align*}
J &= \begin{pmatrix} \lambda & 1 \\ 0 & \lambda \end{pmatrix}
\end{align*}
for any $\lambda \in \mathbb{F}$. The characteristic polynomial is $(\lambda - \mu)^2$ where $\mu$ is the variable, so $\lambda$ is an eigenvalue of algebraic multiplicity $2$. But $J - \lambda I = \begin{pmatrix} 0 & 1 \\ 0 & 0 \end{pmatrix}$ has rank $1$, so $\dim E_\lambda = 1$. The eigenspace is one-dimensional, not two-dimensional, and $J$ cannot be diagonalised. Matrices of this form are Jordan blocks, and they require a more refined normal form — the Jordan normal form — to analyse fully.
[quotetheorem:403]
This claim implies that a matrix with $n$ distinct eigenvalues is automatically diagonalisable. Most matrices encountered in applications satisfy this condition.
## Symmetric Matrices and the Spectral Theorem
The general diagonalisation story is complicated by Jordan blocks and complex eigenvalues. For real symmetric matrices, however, everything simplifies beautifully: eigenvalues are always real, eigenspaces for distinct eigenvalues are orthogonal, and the matrix is always orthogonally diagonalisable. This is the spectral theorem.
Why restrict to symmetric matrices? Because they arise naturally everywhere: the Hessian of a smooth function, covariance matrices in statistics, the stiffness matrix in structural mechanics, and the adjacency matrix of an undirected graph are all symmetric. The spectral theorem gives these matrices a completely transparent structure.
[definition: Orthogonal Matrix]
A matrix $Q \in \mathbb{R}^{n \times n}$ is orthogonal if
\begin{align*}
Q^\top Q &= Q Q^\top = I_n.
\end{align*}
[/definition]
Equivalently, the columns of $Q$ form an orthonormal basis of $\mathbb{R}^n$, and $Q^{-1} = Q^\top$. Orthogonal matrices preserve inner products: $\langle Qx, Qy \rangle = \langle x, y \rangle$ for all $x, y \in \mathbb{R}^n$.
Orthogonal matrices are the matrix representations of isometries of $\mathbb{R}^n$: rotations (if $\det Q = 1$) and reflections (if $\det Q = -1$). They preserve lengths and angles, hence the geometry of $\mathbb{R}^n$.
[quotetheorem:925]
The decomposition $A = Q D Q^\top$ has a striking geometric interpretation. Write $Q = [q_1 \mid \cdots \mid q_n]$ with columns $q_1, \dots, q_n \in \mathbb{R}^n$. Then
\begin{align*}
A &= \sum_{i=1}^{n} \lambda_i \, q_i q_i^\top = \sum_{i=1}^{n} \lambda_i \, (q_i \otimes q_i).
\end{align*}
Each term $q_i \otimes q_i$ is the orthogonal projection onto $\operatorname{span}(q_i)$ (since $q_i$ is a unit vector). So $A$ is a weighted sum of orthogonal projections: it stretches space by $\lambda_i$ in the direction $q_i$, with all these directions mutually orthogonal. The eigenvalues $\lambda_i$ are often called the spectrum of $A$.
[example: Spectral Decomposition]
Let
\begin{align*}
A &= \begin{pmatrix} 3 & 1 \\ 1 & 3 \end{pmatrix}.
\end{align*}
The characteristic polynomial is $(3 - \lambda)^2 - 1 = \lambda^2 - 6\lambda + 8 = (\lambda - 2)(\lambda - 4)$, so eigenvalues are $\lambda_1 = 2$ and $\lambda_2 = 4$.
For $\lambda_1 = 2$: $A - 2I = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}$, so $v_1 + v_2 = 0$, giving (normalized) $q_1 = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ -1 \end{pmatrix}$.
For $\lambda_2 = 4$: $A - 4I = \begin{pmatrix} -1 & 1 \\ 1 & -1 \end{pmatrix}$, so $-v_1 + v_2 = 0$, giving (normalized) $q_2 = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix}$.
Check orthogonality: $\langle q_1, q_2 \rangle = \frac{1}{2}(1 \cdot 1 + (-1) \cdot 1) = 0$. Correct.
The spectral decomposition is:
\begin{align*}
A &= 2 \cdot q_1 q_1^\top + 4 \cdot q_2 q_2^\top \\
&= 2 \cdot \frac{1}{2}\begin{pmatrix} 1 & -1 \\ -1 & 1 \end{pmatrix} + 4 \cdot \frac{1}{2}\begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix} \\
&= \begin{pmatrix} 1 & -1 \\ -1 & 1 \end{pmatrix} + \begin{pmatrix} 2 & 2 \\ 2 & 2 \end{pmatrix} = \begin{pmatrix} 3 & 1 \\ 1 & 3 \end{pmatrix}.
\end{align*}
Correct.
[/example]
## Positive Definite Matrices and the Cholesky Factorisation
Among symmetric matrices, the positive definite ones occupy a particularly important position. They arise as covariance matrices, as Hessians of strictly convex functions at minima, and as the Gram matrices of linearly independent vectors. A positive definite matrix is in some sense the "nicest" possible symmetric matrix: all its eigenvalues are strictly positive, and it defines an inner product.
[definition: Positive Definite Matrix]
A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is positive definite if for every nonzero vector $x \in \mathbb{R}^n$,
\begin{align*}
x^\top A x &> 0.
\end{align*}
It is positive semidefinite if $x^\top A x \ge 0$ for all $x \in \mathbb{R}^n$.
[/definition]
The quadratic form $x \mapsto x^\top A x$ is the key object. For $A = \operatorname{diag}(\lambda_1, \dots, \lambda_n)$ diagonal, $x^\top A x = \sum_{i=1}^n \lambda_i x_i^2$, which is positive for all $x \neq 0$ if and only if all $\lambda_i > 0$. The general case reduces to this via the spectral theorem: $x^\top A x = (Q^\top x)^\top D (Q^\top x)$, so $A$ is positive definite iff all eigenvalues are positive.
[quotetheorem:3289]
The Cholesky factorisation $A = LL^\top$ is the symmetric analogue of Gaussian elimination. It is numerically stable (no pivoting needed), twice as fast as general LU factorisation, and reveals that $A$ defines an inner product $\langle x, y \rangle_A = x^\top A y = \langle Lx, Ly \rangle$. Solving $Ax = b$ reduces to two triangular solves $Lz = b$ then $L^\top x = z$, each $O(n^2)$.
[example: Cholesky Factorisation]
Compute the Cholesky factorisation of
\begin{align*}
A &= \begin{pmatrix} 4 & 2 \\ 2 & 2 \end{pmatrix}.
\end{align*}
First verify positive definiteness: $\det A_1 = 4 > 0$ and $\det A_2 = 4 \cdot 2 - 2 \cdot 2 = 4 > 0$. So $A$ is positive definite.
Write $L = \begin{pmatrix} \ell_{11} & 0 \\ \ell_{21} & \ell_{22} \end{pmatrix}$ and solve $LL^\top = A$:
\begin{align*}
LL^\top &= \begin{pmatrix} \ell_{11}^2 & \ell_{11}\ell_{21} \\ \ell_{11}\ell_{21} & \ell_{21}^2 + \ell_{22}^2 \end{pmatrix} = \begin{pmatrix} 4 & 2 \\ 2 & 2 \end{pmatrix}.
\end{align*}
From position $(1,1)$: $\ell_{11}^2 = 4$, so $\ell_{11} = 2$ (positive diagonal). From position $(2,1)$: $2 \ell_{21} = 2$, so $\ell_{21} = 1$. From position $(2,2)$: $1 + \ell_{22}^2 = 2$, so $\ell_{22} = 1$.
Thus $L = \begin{pmatrix} 2 & 0 \\ 1 & 1 \end{pmatrix}$ and $A = LL^\top = \begin{pmatrix} 2 & 0 \\ 1 & 1 \end{pmatrix}\begin{pmatrix} 2 & 1 \\ 0 & 1 \end{pmatrix} = \begin{pmatrix} 4 & 2 \\ 2 & 2 \end{pmatrix}$. Correct.
[/example]
## Singular Value Decomposition
Diagonalisation and the spectral theorem apply only to square (and symmetric) matrices. For a rectangular matrix $A \in \mathbb{R}^{m \times n}$ representing a linear map between spaces of different dimensions, neither framework applies directly. The singular value decomposition (SVD) is the correct generalisation: it provides an orthonormal basis for the input space and an orthonormal basis for the output space, related by stretching factors called singular values.
The SVD is not just a generalisation for its own sake. It reveals the four fundamental subspaces of $A$ (column space, null space, row space, left null space), provides the best low-rank approximation to $A$ (Eckart-Young theorem), and underlies principal component analysis, pseudoinverse computation, and least-squares problems.
[definition: Singular Value Decomposition]
Let $A \in \mathbb{R}^{m \times n}$ with $r = \operatorname{rank}(A)$. The singular value decomposition of $A$ is a factorisation
\begin{align*}
A &= U \Sigma V^\top,
\end{align*}
where $U \in \mathbb{R}^{m \times m}$ and $V \in \mathbb{R}^{n \times n}$ are orthogonal matrices, and $\Sigma \in \mathbb{R}^{m \times n}$ is a matrix with nonnegative entries $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r > 0 = \sigma_{r+1} = \cdots$ on the main diagonal and zeros elsewhere. The values $\sigma_1, \dots, \sigma_r$ are the singular values of $A$, the columns $u_1, \dots, u_m$ of $U$ are the left singular vectors, and the columns $v_1, \dots, v_n$ of $V$ are the right singular vectors.
[/definition]
[illustration:svd-geometric-decomposition]
The singular values relate to eigenvalues of $A^\top A$: the matrix $A^\top A \in \mathbb{R}^{n \times n}$ is symmetric positive semidefinite, and its eigenvalues are $\sigma_1^2 \ge \sigma_2^2 \ge \cdots \ge \sigma_r^2 > 0 = \cdots = 0$. The right singular vectors $v_i$ are the corresponding eigenvectors, and $u_i = \frac{1}{\sigma_i} A v_i$ for $i = 1, \dots, r$.
[quotetheorem:3290]
The Eckart-Young theorem explains why the SVD is so central in data science: given a data matrix $A$ (e.g., $m$ observations of $n$ features), the rank-$k$ truncation $A_k$ captures the $k$ "most significant" directions of variation (measured by singular values) and discards noise or redundancy. This is the mathematical foundation of principal component analysis.
[example: Low-Rank Approximation via SVD]
Let
\begin{align*}
A &= \begin{pmatrix} 3 & 2 & 2 \\ 2 & 3 & -2 \end{pmatrix}.
\end{align*}
We compute the SVD. First form $A^\top A \in \mathbb{R}^{3 \times 3}$:
\begin{align*}
A^\top A &= \begin{pmatrix} 3 & 2 \\ 2 & 3 \\ 2 & -2 \end{pmatrix} \begin{pmatrix} 3 & 2 & 2 \\ 2 & 3 & -2 \end{pmatrix} = \begin{pmatrix} 13 & 12 & 2 \\ 12 & 13 & -2 \\ 2 & -2 & 8 \end{pmatrix}.
\end{align*}
The eigenvalues of $A^\top A$ are $\sigma_1^2 = 25$, $\sigma_2^2 = 9$, and $\sigma_3^2 = 0$ (since $\operatorname{rank}(A) = 2$), giving singular values $\sigma_1 = 5$, $\sigma_2 = 3$, $\sigma_3 = 0$.
The right singular vectors (eigenvectors of $A^\top A$) are:
\begin{align*}
v_1 &= \frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\\0\end{pmatrix}, \quad v_2 = \frac{1}{\sqrt{18}}\begin{pmatrix}1\\-1\\4\end{pmatrix}, \quad v_3 = \frac{1}{\sqrt{9}}\begin{pmatrix}2\\-2\\-1\end{pmatrix}.
\end{align*}
The left singular vectors are $u_i = \frac{1}{\sigma_i} A v_i$ for $i = 1, 2$:
\begin{align*}
u_1 &= \frac{1}{5} \begin{pmatrix} 3 & 2 & 2 \\ 2 & 3 & -2 \end{pmatrix} \frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\\0\end{pmatrix} = \frac{1}{5\sqrt{2}}\begin{pmatrix}5\\5\end{pmatrix} = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\end{pmatrix}, \\
u_2 &= \frac{1}{3} \begin{pmatrix} 3 & 2 & 2 \\ 2 & 3 & -2 \end{pmatrix} \frac{1}{\sqrt{18}}\begin{pmatrix}1\\-1\\4\end{pmatrix} = \frac{1}{3\sqrt{18}}\begin{pmatrix}9\\-9\end{pmatrix} = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\-1\end{pmatrix}.
\end{align*}
Now form the rank-$1$ truncation $A_1 = \sigma_1 u_1 v_1^\top$:
\begin{align*}
A_1 &= 5 \cdot \frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\end{pmatrix} \cdot \frac{1}{\sqrt{2}}\begin{pmatrix}1&1&0\end{pmatrix} = \frac{5}{2}\begin{pmatrix}1&1&0\\1&1&0\end{pmatrix}.
\end{align*}
The Eckart-Young theorem says $A_1$ is the closest rank-$1$ matrix to $A$:
\begin{align*}
\|A - A_1\|_2 &= \sigma_2 = 3, \qquad \|A - A_1\|_F = \sqrt{\sigma_2^2 + \sigma_3^2} = \sqrt{9 + 0} = 3.
\end{align*}
By contrast, $\|A\|_F = \sqrt{\sigma_1^2 + \sigma_2^2} = \sqrt{25 + 9} = \sqrt{34} \approx 5.83$. The rank-$1$ approximation captures $\sigma_1^2 / (\sigma_1^2 + \sigma_2^2) = 25/34 \approx 74\%$ of the Frobenius energy. Adding the rank-$2$ term $\sigma_2 u_2 v_2^\top$ recovers $A$ exactly, since $\operatorname{rank}(A) = 2$.
[/example]
[example: When Invertibility Fails — a Singular Matrix]
Consider the matrix
\begin{align*}
A &= \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{pmatrix}.
\end{align*}
Gaussian elimination reveals the failure immediately. Subtract $4$ times row $1$ from row $2$, and $7$ times row $1$ from row $3$:
\begin{align*}
\begin{pmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -12 \end{pmatrix}.
\end{align*}
Now subtract $2$ times row $2$ from row $3$:
\begin{align*}
\begin{pmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 0 \end{pmatrix}.
\end{align*}
The bottom row is all zero: $\operatorname{rank}(A) = 2 < 3$. Therefore $\det A = 0$, the map $x \mapsto Ax$ is not injective (the null space has dimension $n - \operatorname{rank}(A) = 1$), and the system $Ax = b$ either has no solution or infinitely many solutions — never a unique one. The specific failure mode here is that the rows of $A$ lie in an arithmetic progression: row $k$ equals $(3k-2, 3k-1, 3k)$, so every row is a linear combination of $(1,1,1)$ and $(1,0,-1)$. A linear map with such a coefficient matrix collapses the $3$-dimensional input space onto a $2$-dimensional subspace, discarding all information in the null direction. No amount of algebra can recover what was lost.
[/example]
## References
Gilbert Strang, *Introduction to Linear Algebra* (2016).
Roger A. Horn and Charles R. Johnson, *Matrix Analysis* (2012).
Lloyd N. Trefethen and David Bau III, *Numerical Linear Algebra* (1997).
Carl D. Meyer, *Matrix Analysis and Applied Linear Algebra* (2000).