Linear Discriminant Analysis Generalized Eigenvalue Theorem

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We convert the generalized Rayleigh quotient for $(B,W)$ into an ordinary Rayleigh quotient for the symmetric positive semidefinite matrix $A = W^{-1/2}BW^{-1/2}$. The ordinary symmetric eigenvalue decomposition of $A$ gives an orthonormal basis of eigenvectors, and transforming back by $W^{-1/2}$ gives a $W$-orthonormal basis of generalized eigenvectors for $(B,W)$. The successive maximization property follows from the same change of variables, because $W$-orthogonality of the original directions becomes Euclidean orthogonality after the transformation. Finally, when $B$ is a weighted scatter matrix of $g$ centered group means, its range is contained in the span of $g$ centered vectors whose weighted sum is zero, so its rank is at most $g-1$. [/proofplan] [step:Transform the generalized quotient into an ordinary Rayleigh quotient] Since $W$ is symmetric positive definite, let $S \in \mathbb{R}^{p \times p}$ denote the unique symmetric positive definite square root of $W$, so that $S^2 = W$, and let $S^{-1}: \mathbb{R}^p \to \mathbb{R}^p$ denote its inverse [linear map](/page/Linear%20Map). Define \begin{align*} A := S^{-1}BS^{-1} \in \mathbb{R}^{p \times p}. \end{align*} The matrix $A$ is symmetric because $B$ and $S^{-1}$ are symmetric: \begin{align*} A^\top = (S^{-1}BS^{-1})^\top = S^{-1}BS^{-1} = A. \end{align*} It is positive semidefinite because, for every $y \in \mathbb{R}^p$, \begin{align*} y^\top Ay = y^\top S^{-1}BS^{-1}y = (S^{-1}y)^\top B(S^{-1}y) \geq 0. \end{align*} For a non-zero vector $v \in \mathbb{R}^p$, define $y := Sv \in \mathbb{R}^p$. Since $S$ is invertible, $v \neq 0$ iff $y \neq 0$, and $v = S^{-1}y$. Hence \begin{align*} R(v) &= \frac{v^\top Bv}{v^\top Wv} \\ &= \frac{(S^{-1}y)^\top B(S^{-1}y)}{(S^{-1}y)^\top S^2(S^{-1}y)} \\ &= \frac{y^\top S^{-1}BS^{-1}y}{y^\top y} \\ &= \frac{y^\top Ay}{y^\top y}. \end{align*} Thus maximizing $R(v)$ over non-zero $v \in \mathbb{R}^p$ is equivalent, under the bijection $y = Sv$, to maximizing the ordinary Rayleigh quotient of $A$ over non-zero $y \in \mathbb{R}^p$. [guided] The role of the positive definite matrix $W$ is to define the denominator of the quotient and the relevant notion of orthogonality. To remove it, we use its symmetric positive definite square root. Let $S \in \mathbb{R}^{p \times p}$ be the unique symmetric positive definite matrix satisfying $S^2 = W$, and let $S^{-1}: \mathbb{R}^p \to \mathbb{R}^p$ be its inverse. Define the transformed matrix \begin{align*} A := S^{-1}BS^{-1}. \end{align*} This matrix is symmetric because both $B$ and $S^{-1}$ are symmetric: \begin{align*} A^\top = (S^{-1}BS^{-1})^\top = S^{-1}BS^{-1} = A. \end{align*} It is positive semidefinite because for every $y \in \mathbb{R}^p$, \begin{align*} y^\top Ay = y^\top S^{-1}BS^{-1}y = (S^{-1}y)^\top B(S^{-1}y) \geq 0, \end{align*} using the positive semidefiniteness of $B$. Now take any non-zero $v \in \mathbb{R}^p$ and set $y := Sv$. This is a bijective change of variables because $S$ is invertible. Since $v = S^{-1}y$, the numerator becomes \begin{align*} v^\top Bv = (S^{-1}y)^\top B(S^{-1}y) = y^\top S^{-1}BS^{-1}y = y^\top Ay, \end{align*} while the denominator becomes \begin{align*} v^\top Wv = (S^{-1}y)^\top S^2(S^{-1}y) = y^\top y. \end{align*} Therefore \begin{align*} R(v) = \frac{y^\top Ay}{y^\top y}. \end{align*} So the generalized optimization problem for $(B,W)$ is exactly the ordinary Rayleigh quotient problem for the real symmetric positive semidefinite matrix $A$. [/guided] [/step] [step:Diagonalize the transformed symmetric matrix] Because $A$ is real symmetric and positive semidefinite, the finite-dimensional spectral theorem for real symmetric matrices gives an orthonormal basis $u_1,\dots,u_p \in \mathbb{R}^p$ and real eigenvalues $\lambda_1,\dots,\lambda_p \in \mathbb{R}$ such that \begin{align*} Au_i &= \lambda_i u_i, \\ u_i^\top u_j &= \delta_{ij} \end{align*} for all $1 \leq i,j \leq p$. Since $A$ is positive semidefinite, \begin{align*} \lambda_i = u_i^\top Au_i \geq 0. \end{align*} Relabel the eigenvectors so that \begin{align*} \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0. \end{align*} Define vectors $v_i \in \mathbb{R}^p$ by \begin{align*} v_i := S^{-1}u_i \end{align*} for $1 \leq i \leq p$. Then \begin{align*} v_i^\top Wv_j &= (S^{-1}u_i)^\top S^2(S^{-1}u_j) \\ &= u_i^\top u_j \\ &= \delta_{ij}. \end{align*} Also, \begin{align*} Bv_i &= BS^{-1}u_i \\ &= S(S^{-1}BS^{-1})u_i \\ &= SAu_i \\ &= \lambda_i Su_i \\ &= \lambda_i S^2S^{-1}u_i \\ &= \lambda_i Wv_i. \end{align*} Thus $v_1,\dots,v_p$ are $W$-orthonormal generalized eigenvectors of $(B,W)$. [guided] We now use the ordinary spectral structure of the symmetric matrix $A$. Since $A$ is real symmetric, it admits an orthonormal eigenbasis $u_1,\dots,u_p \in \mathbb{R}^p$ with eigenvalues $\lambda_1,\dots,\lambda_p \in \mathbb{R}$. Since $A$ is positive semidefinite, each eigenvalue is non-negative: indeed, \begin{align*} \lambda_i = \lambda_i u_i^\top u_i = u_i^\top Au_i \geq 0. \end{align*} We order these eigenvalues as \begin{align*} \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0. \end{align*} The eigenvectors of $A$ must now be transformed back to the original coordinates. Define \begin{align*} v_i := S^{-1}u_i \end{align*} for each $1 \leq i \leq p$. First we check the orthogonality condition. The relevant inner product in the original variables is not the Euclidean one, but the $W$-inner product. For $1 \leq i,j \leq p$, \begin{align*} v_i^\top Wv_j &= (S^{-1}u_i)^\top S^2(S^{-1}u_j) \\ &= u_i^\top u_j \\ &= \delta_{ij}. \end{align*} Thus Euclidean orthonormality of the $u_i$ becomes $W$-orthonormality of the $v_i$. Next we verify the generalized eigenvalue equation. Since $A = S^{-1}BS^{-1}$, multiplying by $S$ on the left gives $SA = BS^{-1}$. Therefore \begin{align*} Bv_i &= BS^{-1}u_i \\ &= SAu_i \\ &= \lambda_i Su_i. \end{align*} Because $v_i = S^{-1}u_i$, we have $Wv_i = S^2S^{-1}u_i = Su_i$. Hence \begin{align*} Bv_i = \lambda_i Wv_i. \end{align*} So the transformed eigenvectors are precisely generalized eigenvectors of the original matrix pair $(B,W)$. [/guided] [/step] [step:Identify the successive maximizers under $W$-orthogonality] Let $1 \leq k \leq p$. Consider the constraint subspace \begin{align*} V_k := \{v \in \mathbb{R}^p : v^\top Wv_i = 0 \text{ for } 1 \leq i < k\}. \end{align*} Under the bijection $y = Sv$, this subspace corresponds to \begin{align*} Y_k := \{y \in \mathbb{R}^p : y^\top u_i = 0 \text{ for } 1 \leq i < k\}. \end{align*} Indeed, \begin{align*} v^\top Wv_i = v^\top S^2S^{-1}u_i = (Sv)^\top u_i = y^\top u_i. \end{align*} Every non-zero $y \in Y_k$ has a unique expansion \begin{align*} y = \sum_{i=k}^p c_i u_i \end{align*} with real coefficients $c_k,\dots,c_p$, not all zero. Therefore \begin{align*} \frac{y^\top Ay}{y^\top y} &= \frac{\sum_{i=k}^p \lambda_i c_i^2}{\sum_{i=k}^p c_i^2} \\ &\leq \lambda_k, \end{align*} because $\lambda_i \leq \lambda_k$ for every $i \geq k$. Equality is attained when $y = u_k$. Since $v_k = S^{-1}u_k$, the vector $v_k$ maximizes $R(v)$ over non-zero $v \in V_k$. [guided] The first eigenvector gives the first maximizer. For later discriminant directions, we impose $W$-orthogonality to the directions already chosen. Fix $k$ with $1 \leq k \leq p$, and define \begin{align*} V_k := \{v \in \mathbb{R}^p : v^\top Wv_i = 0 \text{ for } 1 \leq i < k\}. \end{align*} This is the set of admissible vectors for the $k$-th discriminant direction. The change of variables $y = Sv$ turns this into ordinary Euclidean orthogonality. Indeed, since $v_i = S^{-1}u_i$, \begin{align*} v^\top Wv_i = v^\top S^2S^{-1}u_i = (Sv)^\top u_i = y^\top u_i. \end{align*} Thus $v \in V_k$ if and only if $y = Sv$ belongs to \begin{align*} Y_k := \{y \in \mathbb{R}^p : y^\top u_i = 0 \text{ for } 1 \leq i < k\}. \end{align*} Because $u_1,\dots,u_p$ is an orthonormal basis, every $y \in Y_k$ has the form \begin{align*} y = \sum_{i=k}^p c_i u_i \end{align*} for real coefficients $c_k,\dots,c_p$. If $y \neq 0$, then not all these coefficients vanish. Using $Au_i = \lambda_i u_i$ and $u_i^\top u_j = \delta_{ij}$, we compute \begin{align*} y^\top Ay &= \left(\sum_{i=k}^p c_i u_i\right)^\top A\left(\sum_{j=k}^p c_j u_j\right) \\ &= \sum_{i=k}^p \lambda_i c_i^2, \end{align*} and \begin{align*} y^\top y = \sum_{i=k}^p c_i^2. \end{align*} Hence \begin{align*} \frac{y^\top Ay}{y^\top y} = \frac{\sum_{i=k}^p \lambda_i c_i^2}{\sum_{i=k}^p c_i^2} \leq \lambda_k, \end{align*} because each $\lambda_i$ with $i \geq k$ satisfies $\lambda_i \leq \lambda_k$. Equality occurs at $y = u_k$. Translating back through $v = S^{-1}y$, equality occurs at $v = v_k$. Therefore $v_k$ is the $k$-th discriminant direction obtained by maximizing the between-to-within quotient subject to $W$-orthogonality to the preceding directions. [/guided] [/step] [step:Bound the number of non-zero directions by the rank of $B$] The non-zero discriminant directions are exactly those $v_i$ whose eigenvalue satisfies $\lambda_i > 0$. Since $A$ is symmetric, the number of positive eigenvalues of $A$ equals $\operatorname{rank}(A)$. Because $S^{-1}$ is invertible, \begin{align*} \operatorname{rank}(A) = \operatorname{rank}(S^{-1}BS^{-1}) = \operatorname{rank}(B). \end{align*} Thus the number of non-zero discriminant directions is at most $\operatorname{rank}(B)$, and hence at most $p$. [guided] A discriminant direction is non-zero in the sense of contributing positive between-group variance relative to within-group variance exactly when its generalized eigenvalue is positive. For the directions constructed above, \begin{align*} R(v_i) = \frac{v_i^\top Bv_i}{v_i^\top Wv_i} = \lambda_i. \end{align*} Thus the number of non-zero discriminant directions is the number of positive eigenvalues among $\lambda_1,\dots,\lambda_p$. Since $A$ is symmetric positive semidefinite, its eigenvalues are non-negative, and the number of positive eigenvalues equals $\operatorname{rank}(A)$. The transformation from $B$ to $A$ does not change rank, because multiplication by an invertible matrix on the left or right preserves rank: \begin{align*} \operatorname{rank}(A) = \operatorname{rank}(S^{-1}BS^{-1}) = \operatorname{rank}(B). \end{align*} Therefore the number of non-zero discriminant directions is at most $\operatorname{rank}(B)$. Since $B$ is a $p \times p$ matrix, $\operatorname{rank}(B) \leq p$, so the ambient dimension also bounds the number of such directions. [/guided] [/step] [step:Use the centered group means to obtain the $g-1$ rank bound] Assume now that $B$ arises from $g$ group means. Define centered mean vectors $m_k \in \mathbb{R}^p$ by \begin{align*} m_k := \mu_k - \bar{\mu} \end{align*} for $1 \leq k \leq g$. The definition of $\bar{\mu}$ gives \begin{align*} \sum_{k=1}^g a_k m_k = \sum_{k=1}^g a_k(\mu_k-\bar{\mu}) = \sum_{k=1}^g a_k\mu_k - \left(\sum_{k=1}^g a_k\right)\bar{\mu} = 0. \end{align*} Since each $a_k > 0$, this is a non-trivial linear dependence among $m_1,\dots,m_g$. Hence \begin{align*} \dim \operatorname{span}\{m_1,\dots,m_g\} \leq g-1. \end{align*} For every $x \in \mathbb{R}^p$, \begin{align*} Bx = \sum_{k=1}^g a_k m_k m_k^\top x = \sum_{k=1}^g a_k(m_k^\top x)m_k. \end{align*} Thus $\operatorname{Range}(B) \subseteq \operatorname{span}\{m_1,\dots,m_g\}$, and therefore \begin{align*} \operatorname{rank}(B) = \dim \operatorname{Range}(B) \leq \dim \operatorname{span}\{m_1,\dots,m_g\} \leq g-1. \end{align*} Combining this with the ambient bound $\operatorname{rank}(B) \leq p$, the number of non-zero discriminant directions is at most \begin{align*} \min(g-1,p). \end{align*} This proves the theorem. [guided] We now use the special structure of the between-group scatter matrix. Define the centered group mean vectors \begin{align*} m_k := \mu_k - \bar{\mu} \end{align*} for $1 \leq k \leq g$. The weighted mean $\bar{\mu}$ was defined by \begin{align*} \bar{\mu} = \frac{\sum_{k=1}^g a_k\mu_k}{\sum_{k=1}^g a_k}. \end{align*} Therefore the centered vectors satisfy the weighted cancellation identity \begin{align*} \sum_{k=1}^g a_k m_k &= \sum_{k=1}^g a_k(\mu_k-\bar{\mu}) \\ &= \sum_{k=1}^g a_k\mu_k - \left(\sum_{k=1}^g a_k\right)\bar{\mu} \\ &= 0. \end{align*} Because every weight $a_k$ is positive, not all coefficients in this linear relation are zero. Hence the $g$ vectors $m_1,\dots,m_g$ are linearly dependent, and their span has dimension at most $g-1$: \begin{align*} \dim \operatorname{span}\{m_1,\dots,m_g\} \leq g-1. \end{align*} Next we show that $B$ cannot have range larger than this span. For any $x \in \mathbb{R}^p$, \begin{align*} Bx &= \left(\sum_{k=1}^g a_k m_km_k^\top\right)x \\ &= \sum_{k=1}^g a_k m_k(m_k^\top x). \end{align*} This is a linear combination of $m_1,\dots,m_g$, so \begin{align*} \operatorname{Range}(B) \subseteq \operatorname{span}\{m_1,\dots,m_g\}. \end{align*} Taking dimensions gives \begin{align*} \operatorname{rank}(B) = \dim \operatorname{Range}(B) \leq \dim \operatorname{span}\{m_1,\dots,m_g\} \leq g-1. \end{align*} From the previous step, the number of positive generalized eigenvalues, and therefore the number of non-zero discriminant directions, is at most $\operatorname{rank}(B)$. We also have $\operatorname{rank}(B) \leq p$ because $B$ is a $p \times p$ matrix. Combining the two bounds gives \begin{align*} \#\{i : \lambda_i > 0\} \leq \min(g-1,p). \end{align*} This completes the proof. [/guided] [/step]

Prerequisites (0/1 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Orthogonality

What brings you to Androma?

Start with a route through the knowledge graph.