[proofplan]
We convert the generalized Rayleigh quotient for $(B,W)$ into an ordinary Rayleigh quotient for the symmetric positive semidefinite matrix $A = W^{-1/2}BW^{-1/2}$. The ordinary symmetric eigenvalue decomposition of $A$ gives an orthonormal basis of eigenvectors, and transforming back by $W^{-1/2}$ gives a $W$-orthonormal basis of generalized eigenvectors for $(B,W)$. The successive maximization property follows from the same change of variables, because $W$-orthogonality of the original directions becomes Euclidean orthogonality after the transformation. Finally, when $B$ is a weighted scatter matrix of $g$ centered group means, its range is contained in the span of $g$ centered vectors whose weighted sum is zero, so its rank is at most $g-1$.
[/proofplan]
[step:Transform the generalized quotient into an ordinary Rayleigh quotient]
Since $W$ is symmetric positive definite, let $S \in \mathbb{R}^{p \times p}$ denote the unique symmetric positive definite square root of $W$, so that $S^2 = W$, and let $S^{-1}: \mathbb{R}^p \to \mathbb{R}^p$ denote its inverse [linear map](/page/Linear%20Map). Define
\begin{align*}
A := S^{-1}BS^{-1} \in \mathbb{R}^{p \times p}.
\end{align*}
The matrix $A$ is symmetric because $B$ and $S^{-1}$ are symmetric:
\begin{align*}
A^\top = (S^{-1}BS^{-1})^\top = S^{-1}BS^{-1} = A.
\end{align*}
It is positive semidefinite because, for every $y \in \mathbb{R}^p$,
\begin{align*}
y^\top Ay = y^\top S^{-1}BS^{-1}y = (S^{-1}y)^\top B(S^{-1}y) \geq 0.
\end{align*}
For a non-zero vector $v \in \mathbb{R}^p$, define $y := Sv \in \mathbb{R}^p$. Since $S$ is invertible, $v \neq 0$ iff $y \neq 0$, and $v = S^{-1}y$. Hence
\begin{align*}
R(v)
&= \frac{v^\top Bv}{v^\top Wv} \\
&= \frac{(S^{-1}y)^\top B(S^{-1}y)}{(S^{-1}y)^\top S^2(S^{-1}y)} \\
&= \frac{y^\top S^{-1}BS^{-1}y}{y^\top y} \\
&= \frac{y^\top Ay}{y^\top y}.
\end{align*}
Thus maximizing $R(v)$ over non-zero $v \in \mathbb{R}^p$ is equivalent, under the bijection $y = Sv$, to maximizing the ordinary Rayleigh quotient of $A$ over non-zero $y \in \mathbb{R}^p$.
[guided]
The role of the positive definite matrix $W$ is to define the denominator of the quotient and the relevant notion of orthogonality. To remove it, we use its symmetric positive definite square root. Let $S \in \mathbb{R}^{p \times p}$ be the unique symmetric positive definite matrix satisfying $S^2 = W$, and let $S^{-1}: \mathbb{R}^p \to \mathbb{R}^p$ be its inverse.
Define the transformed matrix
\begin{align*}
A := S^{-1}BS^{-1}.
\end{align*}
This matrix is symmetric because both $B$ and $S^{-1}$ are symmetric:
\begin{align*}
A^\top = (S^{-1}BS^{-1})^\top = S^{-1}BS^{-1} = A.
\end{align*}
It is positive semidefinite because for every $y \in \mathbb{R}^p$,
\begin{align*}
y^\top Ay
= y^\top S^{-1}BS^{-1}y
= (S^{-1}y)^\top B(S^{-1}y)
\geq 0,
\end{align*}
using the positive semidefiniteness of $B$.
Now take any non-zero $v \in \mathbb{R}^p$ and set $y := Sv$. This is a bijective change of variables because $S$ is invertible. Since $v = S^{-1}y$, the numerator becomes
\begin{align*}
v^\top Bv = (S^{-1}y)^\top B(S^{-1}y) = y^\top S^{-1}BS^{-1}y = y^\top Ay,
\end{align*}
while the denominator becomes
\begin{align*}
v^\top Wv
= (S^{-1}y)^\top S^2(S^{-1}y)
= y^\top y.
\end{align*}
Therefore
\begin{align*}
R(v) = \frac{y^\top Ay}{y^\top y}.
\end{align*}
So the generalized optimization problem for $(B,W)$ is exactly the ordinary Rayleigh quotient problem for the real symmetric positive semidefinite matrix $A$.
[/guided]
[/step]
[step:Diagonalize the transformed symmetric matrix]
Because $A$ is real symmetric and positive semidefinite, the finite-dimensional spectral theorem for real symmetric matrices gives an orthonormal basis $u_1,\dots,u_p \in \mathbb{R}^p$ and real eigenvalues $\lambda_1,\dots,\lambda_p \in \mathbb{R}$ such that
\begin{align*}
Au_i &= \lambda_i u_i, \\
u_i^\top u_j &= \delta_{ij}
\end{align*}
for all $1 \leq i,j \leq p$. Since $A$ is positive semidefinite,
\begin{align*}
\lambda_i = u_i^\top Au_i \geq 0.
\end{align*}
Relabel the eigenvectors so that
\begin{align*}
\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0.
\end{align*}
Define vectors $v_i \in \mathbb{R}^p$ by
\begin{align*}
v_i := S^{-1}u_i
\end{align*}
for $1 \leq i \leq p$. Then
\begin{align*}
v_i^\top Wv_j
&= (S^{-1}u_i)^\top S^2(S^{-1}u_j) \\
&= u_i^\top u_j \\
&= \delta_{ij}.
\end{align*}
Also,
\begin{align*}
Bv_i
&= BS^{-1}u_i \\
&= S(S^{-1}BS^{-1})u_i \\
&= SAu_i \\
&= \lambda_i Su_i \\
&= \lambda_i S^2S^{-1}u_i \\
&= \lambda_i Wv_i.
\end{align*}
Thus $v_1,\dots,v_p$ are $W$-orthonormal generalized eigenvectors of $(B,W)$.
[guided]
We now use the ordinary spectral structure of the symmetric matrix $A$. Since $A$ is real symmetric, it admits an orthonormal eigenbasis $u_1,\dots,u_p \in \mathbb{R}^p$ with eigenvalues $\lambda_1,\dots,\lambda_p \in \mathbb{R}$. Since $A$ is positive semidefinite, each eigenvalue is non-negative: indeed,
\begin{align*}
\lambda_i = \lambda_i u_i^\top u_i = u_i^\top Au_i \geq 0.
\end{align*}
We order these eigenvalues as
\begin{align*}
\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0.
\end{align*}
The eigenvectors of $A$ must now be transformed back to the original coordinates. Define
\begin{align*}
v_i := S^{-1}u_i
\end{align*}
for each $1 \leq i \leq p$. First we check the orthogonality condition. The relevant inner product in the original variables is not the Euclidean one, but the $W$-inner product. For $1 \leq i,j \leq p$,
\begin{align*}
v_i^\top Wv_j
&= (S^{-1}u_i)^\top S^2(S^{-1}u_j) \\
&= u_i^\top u_j \\
&= \delta_{ij}.
\end{align*}
Thus Euclidean orthonormality of the $u_i$ becomes $W$-orthonormality of the $v_i$.
Next we verify the generalized eigenvalue equation. Since $A = S^{-1}BS^{-1}$, multiplying by $S$ on the left gives $SA = BS^{-1}$. Therefore
\begin{align*}
Bv_i
&= BS^{-1}u_i \\
&= SAu_i \\
&= \lambda_i Su_i.
\end{align*}
Because $v_i = S^{-1}u_i$, we have $Wv_i = S^2S^{-1}u_i = Su_i$. Hence
\begin{align*}
Bv_i = \lambda_i Wv_i.
\end{align*}
So the transformed eigenvectors are precisely generalized eigenvectors of the original matrix pair $(B,W)$.
[/guided]
[/step]
[step:Identify the successive maximizers under $W$-orthogonality]
Let $1 \leq k \leq p$. Consider the constraint subspace
\begin{align*}
V_k := \{v \in \mathbb{R}^p : v^\top Wv_i = 0 \text{ for } 1 \leq i < k\}.
\end{align*}
Under the bijection $y = Sv$, this subspace corresponds to
\begin{align*}
Y_k := \{y \in \mathbb{R}^p : y^\top u_i = 0 \text{ for } 1 \leq i < k\}.
\end{align*}
Indeed,
\begin{align*}
v^\top Wv_i
= v^\top S^2S^{-1}u_i
= (Sv)^\top u_i
= y^\top u_i.
\end{align*}
Every non-zero $y \in Y_k$ has a unique expansion
\begin{align*}
y = \sum_{i=k}^p c_i u_i
\end{align*}
with real coefficients $c_k,\dots,c_p$, not all zero. Therefore
\begin{align*}
\frac{y^\top Ay}{y^\top y}
&= \frac{\sum_{i=k}^p \lambda_i c_i^2}{\sum_{i=k}^p c_i^2} \\
&\leq \lambda_k,
\end{align*}
because $\lambda_i \leq \lambda_k$ for every $i \geq k$. Equality is attained when $y = u_k$. Since $v_k = S^{-1}u_k$, the vector $v_k$ maximizes $R(v)$ over non-zero $v \in V_k$.
[guided]
The first eigenvector gives the first maximizer. For later discriminant directions, we impose $W$-orthogonality to the directions already chosen. Fix $k$ with $1 \leq k \leq p$, and define
\begin{align*}
V_k := \{v \in \mathbb{R}^p : v^\top Wv_i = 0 \text{ for } 1 \leq i < k\}.
\end{align*}
This is the set of admissible vectors for the $k$-th discriminant direction.
The change of variables $y = Sv$ turns this into ordinary Euclidean orthogonality. Indeed, since $v_i = S^{-1}u_i$,
\begin{align*}
v^\top Wv_i
= v^\top S^2S^{-1}u_i
= (Sv)^\top u_i
= y^\top u_i.
\end{align*}
Thus $v \in V_k$ if and only if $y = Sv$ belongs to
\begin{align*}
Y_k := \{y \in \mathbb{R}^p : y^\top u_i = 0 \text{ for } 1 \leq i < k\}.
\end{align*}
Because $u_1,\dots,u_p$ is an orthonormal basis, every $y \in Y_k$ has the form
\begin{align*}
y = \sum_{i=k}^p c_i u_i
\end{align*}
for real coefficients $c_k,\dots,c_p$. If $y \neq 0$, then not all these coefficients vanish. Using $Au_i = \lambda_i u_i$ and $u_i^\top u_j = \delta_{ij}$, we compute
\begin{align*}
y^\top Ay
&= \left(\sum_{i=k}^p c_i u_i\right)^\top A\left(\sum_{j=k}^p c_j u_j\right) \\
&= \sum_{i=k}^p \lambda_i c_i^2,
\end{align*}
and
\begin{align*}
y^\top y = \sum_{i=k}^p c_i^2.
\end{align*}
Hence
\begin{align*}
\frac{y^\top Ay}{y^\top y}
= \frac{\sum_{i=k}^p \lambda_i c_i^2}{\sum_{i=k}^p c_i^2}
\leq \lambda_k,
\end{align*}
because each $\lambda_i$ with $i \geq k$ satisfies $\lambda_i \leq \lambda_k$. Equality occurs at $y = u_k$. Translating back through $v = S^{-1}y$, equality occurs at $v = v_k$. Therefore $v_k$ is the $k$-th discriminant direction obtained by maximizing the between-to-within quotient subject to $W$-orthogonality to the preceding directions.
[/guided]
[/step]
[step:Bound the number of non-zero directions by the rank of $B$]
The non-zero discriminant directions are exactly those $v_i$ whose eigenvalue satisfies $\lambda_i > 0$. Since $A$ is symmetric, the number of positive eigenvalues of $A$ equals $\operatorname{rank}(A)$. Because $S^{-1}$ is invertible,
\begin{align*}
\operatorname{rank}(A)
= \operatorname{rank}(S^{-1}BS^{-1})
= \operatorname{rank}(B).
\end{align*}
Thus the number of non-zero discriminant directions is at most $\operatorname{rank}(B)$, and hence at most $p$.
[guided]
A discriminant direction is non-zero in the sense of contributing positive between-group variance relative to within-group variance exactly when its generalized eigenvalue is positive. For the directions constructed above,
\begin{align*}
R(v_i) = \frac{v_i^\top Bv_i}{v_i^\top Wv_i} = \lambda_i.
\end{align*}
Thus the number of non-zero discriminant directions is the number of positive eigenvalues among $\lambda_1,\dots,\lambda_p$.
Since $A$ is symmetric positive semidefinite, its eigenvalues are non-negative, and the number of positive eigenvalues equals $\operatorname{rank}(A)$. The transformation from $B$ to $A$ does not change rank, because multiplication by an invertible matrix on the left or right preserves rank:
\begin{align*}
\operatorname{rank}(A)
= \operatorname{rank}(S^{-1}BS^{-1})
= \operatorname{rank}(B).
\end{align*}
Therefore the number of non-zero discriminant directions is at most $\operatorname{rank}(B)$. Since $B$ is a $p \times p$ matrix, $\operatorname{rank}(B) \leq p$, so the ambient dimension also bounds the number of such directions.
[/guided]
[/step]
[step:Use the centered group means to obtain the $g-1$ rank bound]
Assume now that $B$ arises from $g$ group means. Define centered mean vectors $m_k \in \mathbb{R}^p$ by
\begin{align*}
m_k := \mu_k - \bar{\mu}
\end{align*}
for $1 \leq k \leq g$. The definition of $\bar{\mu}$ gives
\begin{align*}
\sum_{k=1}^g a_k m_k
= \sum_{k=1}^g a_k(\mu_k-\bar{\mu})
= \sum_{k=1}^g a_k\mu_k - \left(\sum_{k=1}^g a_k\right)\bar{\mu}
= 0.
\end{align*}
Since each $a_k > 0$, this is a non-trivial linear dependence among $m_1,\dots,m_g$. Hence
\begin{align*}
\dim \operatorname{span}\{m_1,\dots,m_g\} \leq g-1.
\end{align*}
For every $x \in \mathbb{R}^p$,
\begin{align*}
Bx
= \sum_{k=1}^g a_k m_k m_k^\top x
= \sum_{k=1}^g a_k(m_k^\top x)m_k.
\end{align*}
Thus $\operatorname{Range}(B) \subseteq \operatorname{span}\{m_1,\dots,m_g\}$, and therefore
\begin{align*}
\operatorname{rank}(B)
= \dim \operatorname{Range}(B)
\leq \dim \operatorname{span}\{m_1,\dots,m_g\}
\leq g-1.
\end{align*}
Combining this with the ambient bound $\operatorname{rank}(B) \leq p$, the number of non-zero discriminant directions is at most
\begin{align*}
\min(g-1,p).
\end{align*}
This proves the theorem.
[guided]
We now use the special structure of the between-group scatter matrix. Define the centered group mean vectors
\begin{align*}
m_k := \mu_k - \bar{\mu}
\end{align*}
for $1 \leq k \leq g$. The weighted mean $\bar{\mu}$ was defined by
\begin{align*}
\bar{\mu} = \frac{\sum_{k=1}^g a_k\mu_k}{\sum_{k=1}^g a_k}.
\end{align*}
Therefore the centered vectors satisfy the weighted cancellation identity
\begin{align*}
\sum_{k=1}^g a_k m_k
&= \sum_{k=1}^g a_k(\mu_k-\bar{\mu}) \\
&= \sum_{k=1}^g a_k\mu_k - \left(\sum_{k=1}^g a_k\right)\bar{\mu} \\
&= 0.
\end{align*}
Because every weight $a_k$ is positive, not all coefficients in this linear relation are zero. Hence the $g$ vectors $m_1,\dots,m_g$ are linearly dependent, and their span has dimension at most $g-1$:
\begin{align*}
\dim \operatorname{span}\{m_1,\dots,m_g\} \leq g-1.
\end{align*}
Next we show that $B$ cannot have range larger than this span. For any $x \in \mathbb{R}^p$,
\begin{align*}
Bx
&= \left(\sum_{k=1}^g a_k m_km_k^\top\right)x \\
&= \sum_{k=1}^g a_k m_k(m_k^\top x).
\end{align*}
This is a linear combination of $m_1,\dots,m_g$, so
\begin{align*}
\operatorname{Range}(B) \subseteq \operatorname{span}\{m_1,\dots,m_g\}.
\end{align*}
Taking dimensions gives
\begin{align*}
\operatorname{rank}(B)
= \dim \operatorname{Range}(B)
\leq \dim \operatorname{span}\{m_1,\dots,m_g\}
\leq g-1.
\end{align*}
From the previous step, the number of positive generalized eigenvalues, and therefore the number of non-zero discriminant directions, is at most $\operatorname{rank}(B)$. We also have $\operatorname{rank}(B) \leq p$ because $B$ is a $p \times p$ matrix. Combining the two bounds gives
\begin{align*}
\#\{i : \lambda_i > 0\} \leq \min(g-1,p).
\end{align*}
This completes the proof.
[/guided]
[/step]