[proofplan]
We first whiten and recenter the observations so that the problem becomes the corresponding statement for independent Gaussian vectors with identity covariance and common mean $\delta=\Sigma^{-1/2}(\mu-\mu_0)$. An orthogonal change of coordinates in the sample index separates the transformed sample mean from the transformed residuals, giving an independent noncentral Gaussian vector and a central Wishart matrix. The remaining distributional reduction is a block decomposition of a Wishart matrix: for $Z\sim \mathcal N_p(\theta,I_p)$ independent of $W\sim W_p(I_p,n-1)$, the quadratic form $Z^\top W^{-1}Z$ has the same law as $\chi^2_p(|\theta|^2)/\chi^2_{n-p}$ with independent numerator and denominator. This is exactly the defining representation of the noncentral $F$ distribution after multiplying by $(n-p)/p$.
[/proofplan]
[step:Whiten the observations and rewrite Hotelling's statistic]
Let $\Sigma^{1/2}$ denote the unique symmetric positive definite square root of $\Sigma$, and let $\Sigma^{-1/2}$ denote its inverse. Define
\begin{align*}
\delta &= \Sigma^{-1/2}(\mu-\mu_0) \in \mathbb{R}^p,
\end{align*}
and define transformed random vectors
\begin{align*}
Y_i : \Omega &\to \mathbb{R}^p \\
\omega &\mapsto \Sigma^{-1/2}(X_i(\omega)-\mu_0)
\end{align*}
for $i=1,\dots,n$. Since affine transformations of multivariate Gaussian random vectors are Gaussian, the random vectors $Y_1,\dots,Y_n$ are independent with common distribution $\mathcal N_p(\delta,I_p)$.
Define
\begin{align*}
\bar Y &= \frac{1}{n}\sum_{i=1}^n Y_i, \\
S_Y &= \frac{1}{n-1}\sum_{i=1}^n (Y_i-\bar Y)(Y_i-\bar Y)^\top.
\end{align*}
Then
\begin{align*}
\bar Y = \Sigma^{-1/2}(\bar X-\mu_0),
\end{align*}
and
\begin{align*}
S_Y = \Sigma^{-1/2}S\Sigma^{-1/2}.
\end{align*}
Since $S_Y^{-1}=\Sigma^{1/2}S^{-1}\Sigma^{1/2}$ whenever $S$ is invertible, Hotelling's statistic may be written as
\begin{align*}
T^2
&= n(\bar X-\mu_0)^\top S^{-1}(\bar X-\mu_0) \\
&= n\bar Y^\top S_Y^{-1}\bar Y.
\end{align*}
Thus it suffices to determine the distribution of $n\bar Y^\top S_Y^{-1}\bar Y$ in the standardized model.
[guided]
The first move is to remove the covariance matrix from the problem. Because $\Sigma$ is symmetric positive definite, it has a unique symmetric positive definite square root $\Sigma^{1/2}$, and $\Sigma^{-1/2}$ is well-defined. We introduce the vector
\begin{align*}
\delta = \Sigma^{-1/2}(\mu-\mu_0),
\end{align*}
which is the mean shift measured in covariance-standardized coordinates.
For each observation, define the whitened random vector
\begin{align*}
Y_i : \Omega &\to \mathbb{R}^p \\
\omega &\mapsto \Sigma^{-1/2}(X_i(\omega)-\mu_0).
\end{align*}
The transformation $x\mapsto \Sigma^{-1/2}(x-\mu_0)$ is affine. Therefore each $Y_i$ is Gaussian with mean
\begin{align*}
\mathbb{E}[Y_i] = \Sigma^{-1/2}(\mu-\mu_0)=\delta
\end{align*}
and covariance
\begin{align*}
\Sigma^{-1/2}\Sigma(\Sigma^{-1/2})^\top = I_p,
\end{align*}
because $\Sigma^{-1/2}$ is symmetric. Independence is preserved because each $Y_i$ is a measurable function of $X_i$ alone. Hence $Y_1,\dots,Y_n$ are independent with common distribution $\mathcal N_p(\delta,I_p)$.
Now define
\begin{align*}
\bar Y &= \frac{1}{n}\sum_{i=1}^n Y_i, \\
S_Y &= \frac{1}{n-1}\sum_{i=1}^n (Y_i-\bar Y)(Y_i-\bar Y)^\top.
\end{align*}
From the definition of $Y_i$,
\begin{align*}
\bar Y=\Sigma^{-1/2}(\bar X-\mu_0).
\end{align*}
Also,
\begin{align*}
Y_i-\bar Y=\Sigma^{-1/2}(X_i-\bar X),
\end{align*}
so
\begin{align*}
S_Y
&= \frac{1}{n-1}\sum_{i=1}^n \Sigma^{-1/2}(X_i-\bar X)(X_i-\bar X)^\top \Sigma^{-1/2} \\
&= \Sigma^{-1/2}S\Sigma^{-1/2}.
\end{align*}
Therefore, on the almost sure event where $S$ is invertible,
\begin{align*}
S_Y^{-1}=\Sigma^{1/2}S^{-1}\Sigma^{1/2}.
\end{align*}
Substituting these identities into Hotelling's statistic gives
\begin{align*}
T^2
&= n(\bar X-\mu_0)^\top S^{-1}(\bar X-\mu_0) \\
&= n\bar Y^\top S_Y^{-1}\bar Y.
\end{align*}
Thus the original statistic is unchanged by whitening, provided we express it using the whitened sample covariance matrix. The noncentrality parameter also becomes
\begin{align*}
\lambda
&= n(\mu-\mu_0)^\top \Sigma^{-1}(\mu-\mu_0) \\
&= n|\delta|^2.
\end{align*}
[/guided]
[/step]
[step:Separate the sample mean from the residual directions]
Choose an orthogonal matrix $Q\in \mathbb{R}^{n\times n}$ whose first row is
\begin{align*}
\left(\frac{1}{\sqrt n},\dots,\frac{1}{\sqrt n}\right).
\end{align*}
Define random vectors $Z_1,\dots,Z_n : \Omega \to \mathbb{R}^p$ by
\begin{align*}
Z_a = \sum_{i=1}^n Q_{ai}Y_i
\end{align*}
for $a=1,\dots,n$. Since $(Y_1,\dots,Y_n)$ is jointly Gaussian with independent components across the sample index and common covariance $I_p$, orthogonality of $Q$ implies that $Z_1,\dots,Z_n$ are independent Gaussian random vectors. Their means are
\begin{align*}
\mathbb{E}[Z_1] &= \sqrt n\,\delta, \\
\mathbb{E}[Z_a] &= 0 \qquad \text{for } a=2,\dots,n,
\end{align*}
and each has covariance matrix $I_p$. Hence
\begin{align*}
Z_1 \sim \mathcal N_p(\sqrt n\,\delta,I_p),
\end{align*}
while $Z_2,\dots,Z_n$ are independent $\mathcal N_p(0,I_p)$ random vectors, independent of $Z_1$.
Moreover,
\begin{align*}
Z_1=\sqrt n\,\bar Y,
\end{align*}
and orthogonality of $Q$ gives the residual sum-of-squares identity
\begin{align*}
\sum_{i=1}^n (Y_i-\bar Y)(Y_i-\bar Y)^\top
=
\sum_{a=2}^n Z_aZ_a^\top.
\end{align*}
Define
\begin{align*}
W = \sum_{a=2}^n Z_aZ_a^\top.
\end{align*}
Then $W$ is independent of $Z_1$, has the central Wishart distribution $W_p(I_p,n-1)$, and
\begin{align*}
S_Y = \frac{1}{n-1}W.
\end{align*}
Consequently,
\begin{align*}
T^2
= n\bar Y^\top S_Y^{-1}\bar Y
= (n-1)Z_1^\top W^{-1}Z_1.
\end{align*}
[guided]
The purpose of this step is to isolate the part of the data containing the sample mean from the part containing the sample covariance. We do this by rotating the $n$ observations in the sample-index direction.
Choose an orthogonal matrix $Q\in \mathbb{R}^{n\times n}$ whose first row is
\begin{align*}
\left(\frac{1}{\sqrt n},\dots,\frac{1}{\sqrt n}\right).
\end{align*}
For each $a=1,\dots,n$, define
\begin{align*}
Z_a=\sum_{i=1}^n Q_{ai}Y_i.
\end{align*}
This is a linear transformation of the jointly Gaussian vector $(Y_1,\dots,Y_n)$, so $(Z_1,\dots,Z_n)$ is jointly Gaussian. Orthogonality of $Q$ preserves the covariance in the sample-index direction. More explicitly, for $a,b\in\{1,\dots,n\}$,
\begin{align*}
\operatorname{Cov}(Z_a,Z_b)
&= \sum_{i=1}^n\sum_{j=1}^n Q_{ai}Q_{bj}\operatorname{Cov}(Y_i,Y_j) \\
&= \sum_{i=1}^n Q_{ai}Q_{bi} I_p \\
&= \delta_{ab}I_p,
\end{align*}
where $\delta_{ab}$ denotes the Kronecker delta. Since the vector is jointly Gaussian, zero covariance between distinct $Z_a$ and $Z_b$ implies independence.
The means are computed from $\mathbb{E}[Y_i]=\delta$:
\begin{align*}
\mathbb{E}[Z_a]
&= \sum_{i=1}^n Q_{ai}\delta \\
&= \left(\sum_{i=1}^n Q_{ai}\right)\delta.
\end{align*}
For $a=1$, the row sum is $\sqrt n$, so $\mathbb{E}[Z_1]=\sqrt n\,\delta$. For $a\ge 2$, the row of $Q$ is orthogonal to the first row, hence its entries sum to $0$, so $\mathbb{E}[Z_a]=0$. Therefore
\begin{align*}
Z_1\sim \mathcal N_p(\sqrt n\,\delta,I_p),
\end{align*}
and $Z_2,\dots,Z_n$ are independent central $\mathcal N_p(0,I_p)$ random vectors, independent of $Z_1$.
The first transformed vector is the scaled sample mean:
\begin{align*}
Z_1=\sum_{i=1}^n \frac{1}{\sqrt n}Y_i=\sqrt n\,\bar Y.
\end{align*}
The remaining transformed vectors span the orthogonal complement of the constant sample direction, so they contain exactly the centered residual information. Orthogonality of $Q$ gives
\begin{align*}
\sum_{i=1}^n Y_iY_i^\top
=
\sum_{a=1}^n Z_aZ_a^\top.
\end{align*}
Since $Z_1Z_1^\top=n\bar Y\bar Y^\top$, subtracting this rank-one mean part gives
\begin{align*}
\sum_{i=1}^n (Y_i-\bar Y)(Y_i-\bar Y)^\top
=
\sum_{a=2}^n Z_aZ_a^\top.
\end{align*}
Define
\begin{align*}
W=\sum_{a=2}^n Z_aZ_a^\top.
\end{align*}
Because $Z_2,\dots,Z_n$ are independent $\mathcal N_p(0,I_p)$ random vectors, $W$ has the central Wishart distribution $W_p(I_p,n-1)$. It is independent of $Z_1$ because it is a measurable function of $Z_2,\dots,Z_n$, which are independent of $Z_1$.
Finally,
\begin{align*}
S_Y=\frac{1}{n-1}W,
\end{align*}
and therefore
\begin{align*}
T^2
&= n\bar Y^\top S_Y^{-1}\bar Y \\
&= Z_1^\top \left(\frac{1}{n-1}W\right)^{-1}Z_1 \\
&= (n-1)Z_1^\top W^{-1}Z_1.
\end{align*}
This reduces the theorem to a distributional statement about one noncentral Gaussian vector independent of one central Wishart matrix.
[/guided]
[/step]
[step:Reduce the inverse Wishart quadratic form to independent chi-square variables]
Let $m=n-1$ and let $\nu=m-p+1=n-p$. We prove the following distributional fact in the present setting: if $Z:\Omega\to\mathbb{R}^p$ satisfies $Z\sim\mathcal N_p(\theta,I_p)$, if $W\sim W_p(I_p,m)$ is independent of $Z$, and if $m\ge p$, then
\begin{align*}
Z^\top W^{-1}Z \sim \frac{U}{V},
\end{align*}
where $U\sim\chi^2_p(|\theta|^2)$, $V\sim\chi^2_{\nu}$, and $U,V$ are independent.
Condition on the value $Z=z\in\mathbb{R}^p$. If $z=0$, then $z^\top W^{-1}z=0$. If $z\neq 0$, choose an orthogonal matrix $R_z\in\mathbb{R}^{p\times p}$ such that $R_zz=|z|e_1$, where $e_1=(1,0,\dots,0)^\top$. Since $W\sim W_p(I_p,m)$ is orthogonally invariant, $R_zWR_z^\top$ has the same distribution as $W$. Therefore the conditional distribution of $z^\top W^{-1}z$ equals the distribution of
\begin{align*}
|z|^2 e_1^\top W^{-1}e_1.
\end{align*}
Represent $W$ as
\begin{align*}
W=\sum_{r=1}^m G_rG_r^\top,
\end{align*}
where $G_1,\dots,G_m:\Omega\to\mathbb{R}^p$ are independent $\mathcal N_p(0,I_p)$ random vectors. Let $A\in\mathbb{R}^{m\times p}$ be the random matrix whose $r$-th row is $G_r^\top$, so $W=A^\top A$. Write $A=(a,B)$, where $a:\Omega\to\mathbb{R}^m$ is the first column and $B:\Omega\to\mathbb{R}^{m\times(p-1)}$ is the matrix of the remaining columns. Then $a$ is a standard Gaussian vector in $\mathbb{R}^m$, $B$ is independent of $a$, and almost surely $B$ has rank $p-1$.
The Schur complement formula for the inverse of $W=A^\top A$ gives
\begin{align*}
e_1^\top W^{-1}e_1
=
\frac{1}{a^\top (I_m-P_B)a},
\end{align*}
where $I_m$ is the identity matrix on $\mathbb{R}^m$ and $P_B:\mathbb{R}^m\to\mathbb{R}^m$ is the [orthogonal projection](/theorems/437) onto the column space of $B$. Conditional on $B$, the matrix $I_m-P_B$ is the orthogonal projection onto a subspace of dimension
\begin{align*}
m-(p-1)=m-p+1=\nu.
\end{align*}
Since $a\sim\mathcal N_m(0,I_m)$ and $a$ is independent of $B$, the quadratic form
\begin{align*}
a^\top(I_m-P_B)a
\end{align*}
has the $\chi^2_{\nu}$ distribution, conditionally on $B$ and hence unconditionally. Denote this random variable by $V$. Its conditional distribution does not depend on $z$, and $W$ is independent of $Z$, so $V$ may be chosen independent of $Z$.
Thus, conditionally on $Z=z$,
\begin{align*}
Z^\top W^{-1}Z \sim \frac{|z|^2}{V},
\end{align*}
with $V\sim\chi^2_{\nu}$ independent of $Z$. Since $Z\sim\mathcal N_p(\theta,I_p)$, the squared norm
\begin{align*}
U=|Z|^2
\end{align*}
has the noncentral chi-square distribution $\chi^2_p(|\theta|^2)$. Hence
\begin{align*}
Z^\top W^{-1}Z \sim \frac{U}{V},
\end{align*}
with $U$ and $V$ independent.
[guided]
We now need the exact distribution of the quadratic form $Z^\top W^{-1}Z$, where $Z$ is Gaussian and $W$ is an independent central Wishart matrix. The key point is that, after conditioning on $Z=z$, the Wishart matrix is rotationally symmetric. This lets us rotate $z$ onto the first coordinate axis and compute only the upper-left entry of $W^{-1}$.
Let $m=n-1$ and $\nu=m-p+1=n-p$. Consider the general situation where
\begin{align*}
Z\sim\mathcal N_p(\theta,I_p),
\end{align*}
where $\theta\in\mathbb{R}^p$, and where $W\sim W_p(I_p,m)$ is independent of $Z$. Since $m=n-1\ge p$, the Wishart matrix $W$ is invertible almost surely.
Fix $z\in\mathbb{R}^p$ and condition on the event $Z=z$ in the regular conditional distribution sense. If $z=0$, then $z^\top W^{-1}z=0$, so there is nothing to rotate. If $z\neq 0$, choose an orthogonal matrix $R_z\in\mathbb{R}^{p\times p}$ satisfying
\begin{align*}
R_zz=|z|e_1,
\end{align*}
where $e_1=(1,0,\dots,0)^\top$. The central Wishart distribution with scale matrix $I_p$ is invariant under orthogonal conjugation: $R_zWR_z^\top$ has the same distribution as $W$. Therefore
\begin{align*}
z^\top W^{-1}z
&= (R_zz)^\top (R_zWR_z^\top)^{-1}(R_zz) \\
&\sim |z|^2 e_1^\top W^{-1}e_1
\end{align*}
conditionally on $Z=z$.
It remains to compute $e_1^\top W^{-1}e_1$. Represent the Wishart matrix as
\begin{align*}
W=\sum_{r=1}^m G_rG_r^\top,
\end{align*}
where $G_1,\dots,G_m$ are independent $\mathcal N_p(0,I_p)$ random vectors. Equivalently, if $A\in\mathbb{R}^{m\times p}$ is the matrix whose $r$-th row is $G_r^\top$, then
\begin{align*}
W=A^\top A.
\end{align*}
Write the columns of $A$ as
\begin{align*}
A=(a,B),
\end{align*}
where $a:\Omega\to\mathbb{R}^m$ is the first column and $B:\Omega\to\mathbb{R}^{m\times(p-1)}$ contains the remaining columns. The entries of $A$ are independent standard normal random variables, so $a$ is a standard Gaussian vector in $\mathbb{R}^m$, $B$ is independent of $a$, and $B$ has rank $p-1$ almost surely because $m\ge p$.
With respect to this block decomposition,
\begin{align*}
W=A^\top A
=
\begin{pmatrix}
a^\top a & a^\top B \\
B^\top a & B^\top B
\end{pmatrix}.
\end{align*}
The Schur complement formula says that the upper-left entry of $W^{-1}$ is the reciprocal of the Schur complement of $B^\top B$:
\begin{align*}
e_1^\top W^{-1}e_1
=
\frac{1}{a^\top a-a^\top B(B^\top B)^{-1}B^\top a}.
\end{align*}
Let $P_B:\mathbb{R}^m\to\mathbb{R}^m$ be the orthogonal projection onto the column space of $B$. Since
\begin{align*}
P_B=B(B^\top B)^{-1}B^\top
\end{align*}
almost surely, the denominator becomes
\begin{align*}
a^\top(I_m-P_B)a.
\end{align*}
Conditional on $B$, the operator $I_m-P_B$ is the orthogonal projection onto the orthogonal complement of the column space of $B$. Since $B$ has rank $p-1$ almost surely, this orthogonal complement has dimension
\begin{align*}
m-(p-1)=m-p+1=\nu.
\end{align*}
Because $a\sim\mathcal N_m(0,I_m)$ and is independent of $B$, the squared Euclidean length of the projection of $a$ onto this $\nu$-dimensional subspace has the $\chi^2_\nu$ distribution. Thus
\begin{align*}
a^\top(I_m-P_B)a\sim\chi^2_\nu.
\end{align*}
Call this denominator $V$.
The conditional distribution of $V$ does not depend on $z$, and the entire Wishart construction is independent of $Z$. Therefore $V$ is independent of $Z$. We have shown that, conditionally on $Z=z$,
\begin{align*}
Z^\top W^{-1}Z \sim \frac{|z|^2}{V},
\end{align*}
where $V\sim\chi^2_\nu$ is independent of $Z$. Finally, since $Z\sim\mathcal N_p(\theta,I_p)$, its squared norm
\begin{align*}
U=|Z|^2
\end{align*}
has the noncentral chi-square distribution $\chi^2_p(|\theta|^2)$. Hence
\begin{align*}
Z^\top W^{-1}Z \sim \frac{U}{V},
\end{align*}
where $U\sim\chi^2_p(|\theta|^2)$, $V\sim\chi^2_\nu$, and $U,V$ are independent.
[/guided]
[/step]
[step:Identify the noncentral F distribution]
Apply the previous step with
\begin{align*}
Z=Z_1,\qquad \theta=\sqrt n\,\delta,\qquad m=n-1,\qquad \nu=n-p.
\end{align*}
Then
\begin{align*}
|\theta|^2
=
n|\delta|^2
=
n(\mu-\mu_0)^\top\Sigma^{-1}(\mu-\mu_0)
=
\lambda.
\end{align*}
Therefore there exist independent random variables
\begin{align*}
U &\sim \chi^2_p(\lambda), \\
V &\sim \chi^2_{n-p}
\end{align*}
such that
\begin{align*}
Z_1^\top W^{-1}Z_1 \sim \frac{U}{V}.
\end{align*}
Using $T^2=(n-1)Z_1^\top W^{-1}Z_1$, we obtain
\begin{align*}
\frac{n-p}{p(n-1)}T^2
&= \frac{n-p}{p}Z_1^\top W^{-1}Z_1 \\
&\sim \frac{U/p}{V/(n-p)}.
\end{align*}
By definition, the distribution of $(U/p)/(V/(n-p))$, with $U\sim\chi^2_p(\lambda)$ independent of $V\sim\chi^2_{n-p}$, is the noncentral $F$ distribution $F_{p,n-p}(\lambda)$. Hence
\begin{align*}
\frac{n-p}{p(n-1)}T^2 \sim F_{p,n-p}(\lambda).
\end{align*}
This proves the theorem.
[guided]
We now substitute the objects obtained from the data into the distributional reduction. In our case,
\begin{align*}
Z=Z_1,\qquad \theta=\sqrt n\,\delta,\qquad m=n-1.
\end{align*}
The denominator degrees of freedom from the previous step are
\begin{align*}
\nu=m-p+1=(n-1)-p+1=n-p.
\end{align*}
The noncentrality parameter of $|Z_1|^2$ is the squared length of its mean:
\begin{align*}
|\theta|^2
&= |\sqrt n\,\delta|^2 \\
&= n|\delta|^2 \\
&= n(\mu-\mu_0)^\top\Sigma^{-1}(\mu-\mu_0) \\
&= \lambda.
\end{align*}
Therefore the previous step gives independent random variables
\begin{align*}
U &\sim \chi^2_p(\lambda), \\
V &\sim \chi^2_{n-p}
\end{align*}
such that
\begin{align*}
Z_1^\top W^{-1}Z_1 \sim \frac{U}{V}.
\end{align*}
From the sample-mean and residual decomposition, we already proved
\begin{align*}
T^2=(n-1)Z_1^\top W^{-1}Z_1.
\end{align*}
Multiplying by the normalizing factor in the theorem gives
\begin{align*}
\frac{n-p}{p(n-1)}T^2
&= \frac{n-p}{p}Z_1^\top W^{-1}Z_1 \\
&\sim \frac{n-p}{p}\frac{U}{V} \\
&= \frac{U/p}{V/(n-p)}.
\end{align*}
The defining representation of the noncentral $F$ distribution with numerator degrees of freedom $p$, denominator degrees of freedom $n-p$, and noncentrality parameter $\lambda$ is precisely
\begin{align*}
\frac{U/p}{V/(n-p)},
\end{align*}
where $U\sim\chi^2_p(\lambda)$ and $V\sim\chi^2_{n-p}$ are independent. Hence
\begin{align*}
\frac{n-p}{p(n-1)}T^2\sim F_{p,n-p}(\lambda).
\end{align*}
This is the desired distributional identity.
[/guided]
[/step]