[proofplan]
We maximize the log-likelihood, since the logarithm is strictly increasing on $(0,\infty)$. First, for fixed positive definite covariance matrix $\Sigma$, we complete the square in $\mu$ around the sample mean and show that the unique maximizer in $\mu$ is $\bar{x}$. Substituting $\mu=\bar{x}$ reduces the problem to maximizing a function of $\Sigma$ alone. Finally, we whiten by the positive definite square root of the scatter matrix and reduce the covariance optimization to a scalar inequality over the eigenvalues of a positive definite matrix.
[/proofplan]
[step:Write the log-likelihood as a quadratic optimization problem]
Let $\mathbb{S}_{++}^p$ denote the set of symmetric positive definite $p \times p$ real matrices. Since $L(\mu,\Sigma)>0$ for all $(\mu,\Sigma)\in \mathbb{R}^p \times \mathbb{S}_{++}^p$ and the logarithm is strictly increasing, the maximizers of $L$ are exactly the maximizers of the log-likelihood function
\begin{align*}
\ell:\mathbb{R}^p \times \mathbb{S}_{++}^p &\to \mathbb{R},\\
(\mu,\Sigma) &\mapsto \log L(\mu,\Sigma).
\end{align*}
Expanding the product defining $L$ gives
\begin{align*}
\ell(\mu,\Sigma)
&= -\frac{np}{2}\log(2\pi)-\frac{n}{2}\log\det\Sigma
-\frac{1}{2}\sum_{i=1}^n (x_i-\mu)^\top \Sigma^{-1}(x_i-\mu).
\end{align*}
Thus maximizing $\ell$ is equivalent to minimizing the quadratic term in $\mu$ for fixed $\Sigma$, and then maximizing the remaining expression in $\Sigma$.
[guided]
We first replace the likelihood by its logarithm. This loses no information about maximizers because $L(\mu,\Sigma)$ is strictly positive and $\log:(0,\infty)\to\mathbb{R}$ is strictly increasing. Define
\begin{align*}
\ell:\mathbb{R}^p \times \mathbb{S}_{++}^p &\to \mathbb{R},\\
(\mu,\Sigma) &\mapsto \log L(\mu,\Sigma).
\end{align*}
Using the definition of $L$, the logarithm turns the product over observations into a sum and gives
\begin{align*}
\ell(\mu,\Sigma)
&= \sum_{i=1}^n \left[
-\frac{p}{2}\log(2\pi)-\frac{1}{2}\log\det\Sigma
-\frac{1}{2}(x_i-\mu)^\top \Sigma^{-1}(x_i-\mu)
\right] \\
&= -\frac{np}{2}\log(2\pi)-\frac{n}{2}\log\det\Sigma
-\frac{1}{2}\sum_{i=1}^n (x_i-\mu)^\top \Sigma^{-1}(x_i-\mu).
\end{align*}
The first term is constant, the second term depends only on $\Sigma$, and the last term couples $\mu$ and $\Sigma$. The next step isolates the dependence on $\mu$ by expanding around the sample mean.
[/guided]
[/step]
[step:Maximize over the mean by completing the square around $\bar{x}$]
Fix $\Sigma \in \mathbb{S}_{++}^p$, and define $A:=\Sigma^{-1} \in \mathbb{S}_{++}^p$. For each $i\in\{1,\dots,n\}$,
\begin{align*}
x_i-\mu = (x_i-\bar{x})+(\bar{x}-\mu).
\end{align*}
Therefore
\begin{align*}
\sum_{i=1}^n (x_i-\mu)^\top A(x_i-\mu)
&= \sum_{i=1}^n (x_i-\bar{x})^\top A(x_i-\bar{x}) \\
&\quad +2\sum_{i=1}^n (x_i-\bar{x})^\top A(\bar{x}-\mu)
+n(\bar{x}-\mu)^\top A(\bar{x}-\mu).
\end{align*}
Since
\begin{align*}
\sum_{i=1}^n (x_i-\bar{x})
= \sum_{i=1}^n x_i-n\bar{x}
=0,
\end{align*}
the cross term vanishes:
\begin{align*}
\sum_{i=1}^n (x_i-\bar{x})^\top A(\bar{x}-\mu)
=
\left(\sum_{i=1}^n (x_i-\bar{x})\right)^\top A(\bar{x}-\mu)
=0.
\end{align*}
Hence
\begin{align*}
\sum_{i=1}^n (x_i-\mu)^\top \Sigma^{-1}(x_i-\mu)
&= \sum_{i=1}^n (x_i-\bar{x})^\top \Sigma^{-1}(x_i-\bar{x})
+n(\bar{x}-\mu)^\top \Sigma^{-1}(\bar{x}-\mu).
\end{align*}
Because $\Sigma^{-1}$ is positive definite, the final term is nonnegative and equals $0$ exactly when $\mu=\bar{x}$. Thus, for each fixed $\Sigma\in\mathbb{S}_{++}^p$, the unique maximizer of $\ell(\mu,\Sigma)$ over $\mu\in\mathbb{R}^p$ is $\mu=\bar{x}$.
[guided]
Fix a covariance matrix $\Sigma \in \mathbb{S}_{++}^p$. Its inverse
\begin{align*}
A:=\Sigma^{-1}
\end{align*}
is again symmetric positive definite. We want to find the value of $\mu$ that makes the quadratic penalty
\begin{align*}
\sum_{i=1}^n (x_i-\mu)^\top A(x_i-\mu)
\end{align*}
as small as possible.
The natural reference point is the sample mean $\bar{x}$, because the centred residuals sum to zero. For each observation,
\begin{align*}
x_i-\mu = (x_i-\bar{x})+(\bar{x}-\mu).
\end{align*}
Substituting this into the quadratic form and using bilinearity gives
\begin{align*}
\sum_{i=1}^n (x_i-\mu)^\top A(x_i-\mu)
&= \sum_{i=1}^n (x_i-\bar{x})^\top A(x_i-\bar{x}) \\
&\quad +2\sum_{i=1}^n (x_i-\bar{x})^\top A(\bar{x}-\mu)
+n(\bar{x}-\mu)^\top A(\bar{x}-\mu).
\end{align*}
The middle term vanishes because
\begin{align*}
\sum_{i=1}^n (x_i-\bar{x})
= \sum_{i=1}^n x_i-n\bar{x}
=0.
\end{align*}
Therefore
\begin{align*}
\sum_{i=1}^n (x_i-\bar{x})^\top A(\bar{x}-\mu)
=
\left(\sum_{i=1}^n (x_i-\bar{x})\right)^\top A(\bar{x}-\mu)
=0.
\end{align*}
We have proved the decomposition
\begin{align*}
\sum_{i=1}^n (x_i-\mu)^\top \Sigma^{-1}(x_i-\mu)
&= \sum_{i=1}^n (x_i-\bar{x})^\top \Sigma^{-1}(x_i-\bar{x})
+n(\bar{x}-\mu)^\top \Sigma^{-1}(\bar{x}-\mu).
\end{align*}
The first term does not depend on $\mu$. The second term is nonnegative because $\Sigma^{-1}$ is positive definite, and it is zero exactly when $\bar{x}-\mu=0$. Hence the quadratic penalty is uniquely minimized at $\mu=\bar{x}$, so the log-likelihood is uniquely maximized in the mean variable at $\mu=\bar{x}$.
[/guided]
[/step]
[step:Reduce the covariance problem to a trace and determinant expression]
Set $\mu=\bar{x}$. Since
\begin{align*}
W=\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^\top,
\end{align*}
the cyclic trace identity gives
\begin{align*}
\sum_{i=1}^n (x_i-\bar{x})^\top \Sigma^{-1}(x_i-\bar{x})
&= \sum_{i=1}^n \operatorname{tr}\left((x_i-\bar{x})^\top \Sigma^{-1}(x_i-\bar{x})\right) \\
&= \sum_{i=1}^n \operatorname{tr}\left(\Sigma^{-1}(x_i-\bar{x})(x_i-\bar{x})^\top\right) \\
&= \operatorname{tr}(\Sigma^{-1}W).
\end{align*}
Thus, up to the additive constant $-\frac{np}{2}\log(2\pi)$, the covariance log-likelihood is
\begin{align*}
\ell(\bar{x},\Sigma)
&= -\frac{np}{2}\log(2\pi)
-\frac{n}{2}\log\det\Sigma
-\frac{1}{2}\operatorname{tr}(\Sigma^{-1}W).
\end{align*}
It remains to maximize
\begin{align*}
\Phi:\mathbb{S}_{++}^p &\to \mathbb{R},\\
\Sigma &\mapsto -\frac{n}{2}\log\det\Sigma
-\frac{1}{2}\operatorname{tr}(\Sigma^{-1}W).
\end{align*}
[/step]
[step:Whiten the covariance variable using the positive definite scatter matrix]
Because $W$ is positive definite, there is a symmetric positive definite matrix $W^{1/2}\in\mathbb{R}^{p\times p}$ with $W^{1/2}W^{1/2}=W$. Define
\begin{align*}
Y:\mathbb{S}_{++}^p &\to \mathbb{S}_{++}^p,\\
\Sigma &\mapsto W^{-1/2}\Sigma W^{-1/2}.
\end{align*}
This map is bijective, with inverse
\begin{align*}
Y^{-1}:\mathbb{S}_{++}^p &\to \mathbb{S}_{++}^p,\\
B &\mapsto W^{1/2}BW^{1/2}.
\end{align*}
Let $B:=W^{-1/2}\Sigma W^{-1/2}$. Then $\Sigma=W^{1/2}BW^{1/2}$, and [determinant multiplicativity](/theorems/395) gives
\begin{align*}
\det\Sigma
&= \det(W^{1/2})\det B\det(W^{1/2})
= \det W\,\det B.
\end{align*}
Also,
\begin{align*}
\Sigma^{-1}
&= W^{-1/2}B^{-1}W^{-1/2},
\end{align*}
so the cyclic trace identity gives
\begin{align*}
\operatorname{tr}(\Sigma^{-1}W)
&= \operatorname{tr}(W^{-1/2}B^{-1}W^{-1/2}W) \\
&= \operatorname{tr}(W^{-1/2}B^{-1}W^{1/2}) \\
&= \operatorname{tr}(B^{-1}).
\end{align*}
Therefore maximizing $\Phi(\Sigma)$ over $\Sigma\in\mathbb{S}_{++}^p$ is equivalent to maximizing
\begin{align*}
\Psi:\mathbb{S}_{++}^p &\to \mathbb{R},\\
B &\mapsto -\frac{n}{2}\log\det B-\frac{1}{2}\operatorname{tr}(B^{-1}),
\end{align*}
because the term $-\frac{n}{2}\log\det W$ is constant in $B$.
[guided]
The covariance expression contains $W$ in the trace term. Since $W$ is positive definite by hypothesis, we can remove it by a change of variables. Let $W^{1/2}$ be the symmetric positive definite square root of $W$, so that $W^{1/2}W^{1/2}=W$, and define
\begin{align*}
Y:\mathbb{S}_{++}^p &\to \mathbb{S}_{++}^p,\\
\Sigma &\mapsto W^{-1/2}\Sigma W^{-1/2}.
\end{align*}
This is a bijection because its inverse is
\begin{align*}
Y^{-1}:\mathbb{S}_{++}^p &\to \mathbb{S}_{++}^p,\\
B &\mapsto W^{1/2}BW^{1/2}.
\end{align*}
Thus no covariance matrices are lost and none are counted twice.
Put
\begin{align*}
B:=W^{-1/2}\Sigma W^{-1/2}.
\end{align*}
Then $\Sigma=W^{1/2}BW^{1/2}$. For the determinant term, multiplicativity of determinant gives
\begin{align*}
\det\Sigma
&= \det(W^{1/2})\det B\det(W^{1/2})
= \det W\,\det B.
\end{align*}
For the trace term, first invert $\Sigma=W^{1/2}BW^{1/2}$:
\begin{align*}
\Sigma^{-1}
&= W^{-1/2}B^{-1}W^{-1/2}.
\end{align*}
Then use cyclic invariance of trace:
\begin{align*}
\operatorname{tr}(\Sigma^{-1}W)
&= \operatorname{tr}(W^{-1/2}B^{-1}W^{-1/2}W) \\
&= \operatorname{tr}(W^{-1/2}B^{-1}W^{1/2}) \\
&= \operatorname{tr}(B^{-1}W^{1/2}W^{-1/2}) \\
&= \operatorname{tr}(B^{-1}).
\end{align*}
Therefore the part of the log-likelihood depending on covariance becomes
\begin{align*}
-\frac{n}{2}\log\det W
-\frac{n}{2}\log\det B
-\frac{1}{2}\operatorname{tr}(B^{-1}).
\end{align*}
The first term is constant in $B$, so the covariance maximization is equivalent to maximizing
\begin{align*}
\Psi:\mathbb{S}_{++}^p &\to \mathbb{R},\\
B &\mapsto -\frac{n}{2}\log\det B-\frac{1}{2}\operatorname{tr}(B^{-1}).
\end{align*}
This whitening step is useful because the remaining expression depends only on the eigenvalues of $B$.
[/guided]
[/step]
[step:Maximize the whitened covariance objective eigenvalue by eigenvalue]
Let $B\in\mathbb{S}_{++}^p$. Choose an orthogonal matrix $Q\in\mathbb{R}^{p\times p}$ and positive [real numbers](/page/Real%20Numbers) $\lambda_1,\dots,\lambda_p$ such that
\begin{align*}
B=Q\operatorname{diag}(\lambda_1,\dots,\lambda_p)Q^\top.
\end{align*}
Then
\begin{align*}
\det B &= \prod_{j=1}^p \lambda_j, &
\operatorname{tr}(B^{-1}) &= \sum_{j=1}^p \frac{1}{\lambda_j}.
\end{align*}
Hence
\begin{align*}
\Psi(B)
&= \sum_{j=1}^p \left(-\frac{n}{2}\log\lambda_j-\frac{1}{2\lambda_j}\right).
\end{align*}
Define
\begin{align*}
\psi:(0,\infty)&\to\mathbb{R},\\
t&\mapsto -\frac{n}{2}\log t-\frac{1}{2t}.
\end{align*}
Then
\begin{align*}
\psi'(t)
&= -\frac{n}{2t}+\frac{1}{2t^2}
= \frac{1-nt}{2t^2},
\end{align*}
so the only critical point is $t=1/n$. Moreover,
\begin{align*}
\psi'(t)>0 \quad \text{for } 0<t<\frac{1}{n},
\qquad
\psi'(t)<0 \quad \text{for } t>\frac{1}{n}.
\end{align*}
Thus $\psi$ has a unique global maximum at $t=1/n$. Therefore $\Psi(B)$ has a unique global maximum when
\begin{align*}
\lambda_1=\cdots=\lambda_p=\frac{1}{n},
\end{align*}
that is, when
\begin{align*}
B=\frac{1}{n}I_p.
\end{align*}
[guided]
Because $B$ is symmetric positive definite, it has an orthogonal eigenvalue decomposition. Thus there exist an orthogonal matrix $Q\in\mathbb{R}^{p\times p}$ and positive eigenvalues $\lambda_1,\dots,\lambda_p$ such that
\begin{align*}
B=Q\operatorname{diag}(\lambda_1,\dots,\lambda_p)Q^\top.
\end{align*}
The positivity of the eigenvalues is exactly where positive definiteness of $B$ is used.
The determinant and trace of the inverse are then
\begin{align*}
\det B &= \prod_{j=1}^p \lambda_j, &
\operatorname{tr}(B^{-1}) &= \sum_{j=1}^p \frac{1}{\lambda_j}.
\end{align*}
Substituting these formulas into $\Psi$ gives
\begin{align*}
\Psi(B)
&= -\frac{n}{2}\sum_{j=1}^p \log\lambda_j
-\frac{1}{2}\sum_{j=1}^p \frac{1}{\lambda_j} \\
&= \sum_{j=1}^p \left(-\frac{n}{2}\log\lambda_j-\frac{1}{2\lambda_j}\right).
\end{align*}
So the matrix maximization has separated into $p$ identical one-variable maximization problems.
Define
\begin{align*}
\psi:(0,\infty)&\to\mathbb{R},\\
t&\mapsto -\frac{n}{2}\log t-\frac{1}{2t}.
\end{align*}
Differentiating gives
\begin{align*}
\psi'(t)
&= -\frac{n}{2t}+\frac{1}{2t^2}
= \frac{1-nt}{2t^2}.
\end{align*}
Since $2t^2>0$ for $t>0$, the sign of $\psi'(t)$ is the sign of $1-nt$. Hence
\begin{align*}
\psi'(t)>0 \quad \text{for } 0<t<\frac{1}{n},
\qquad
\psi'(t)<0 \quad \text{for } t>\frac{1}{n}.
\end{align*}
Thus $\psi$ strictly increases up to $1/n$ and strictly decreases after $1/n$. It has a unique global maximum at $t=1/n$.
Since
\begin{align*}
\Psi(B)=\sum_{j=1}^p \psi(\lambda_j),
\end{align*}
the sum is maximized uniquely when every eigenvalue satisfies $\lambda_j=1/n$. Therefore
\begin{align*}
B=Q\operatorname{diag}\left(\frac{1}{n},\dots,\frac{1}{n}\right)Q^\top
=\frac{1}{n}I_p.
\end{align*}
[/guided]
[/step]
[step:Transform back to obtain the covariance estimator]
The unique maximizer of $\Psi$ is $B=\frac{1}{n}I_p$. Since $\Sigma=W^{1/2}BW^{1/2}$, the corresponding covariance matrix is
\begin{align*}
\hat{\Sigma}
&= W^{1/2}\left(\frac{1}{n}I_p\right)W^{1/2}
= \frac{1}{n}W.
\end{align*}
The assumption that $W$ is positive definite implies $\hat{\Sigma}=\frac{1}{n}W\in\mathbb{S}_{++}^p$, so this matrix is admissible in the nonsingular multivariate normal model. Combining this with the unique maximization over $\mu$ gives the unique maximum likelihood estimators
\begin{align*}
\hat{\mu} &= \bar{x}, &
\hat{\Sigma} &= \frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^\top.
\end{align*}
This proves the theorem.
[/step]