Maximum Likelihood Estimators for the Multivariate Normal Distribution

Maximum Likelihood Estimators for the Multivariate Normal Distribution (Theorem # 4011)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We maximize the log-likelihood, since the logarithm is strictly increasing on $(0,\infty)$. First, for fixed positive definite covariance matrix $\Sigma$, we complete the square in $\mu$ around the sample mean and show that the unique maximizer in $\mu$ is $\bar{x}$. Substituting $\mu=\bar{x}$ reduces the problem to maximizing a function of $\Sigma$ alone. Finally, we whiten by the positive definite square root of the scatter matrix and reduce the covariance optimization to a scalar inequality over the eigenvalues of a positive definite matrix. [/proofplan] [step:Write the log-likelihood as a quadratic optimization problem] Let $\mathbb{S}_{++}^p$ denote the set of symmetric positive definite $p \times p$ real matrices. Since $L(\mu,\Sigma)>0$ for all $(\mu,\Sigma)\in \mathbb{R}^p \times \mathbb{S}_{++}^p$ and the logarithm is strictly increasing, the maximizers of $L$ are exactly the maximizers of the log-likelihood function \begin{align*} \ell:\mathbb{R}^p \times \mathbb{S}_{++}^p &\to \mathbb{R},\\ (\mu,\Sigma) &\mapsto \log L(\mu,\Sigma). \end{align*} Expanding the product defining $L$ gives \begin{align*} \ell(\mu,\Sigma) &= -\frac{np}{2}\log(2\pi)-\frac{n}{2}\log\det\Sigma -\frac{1}{2}\sum_{i=1}^n (x_i-\mu)^\top \Sigma^{-1}(x_i-\mu). \end{align*} Thus maximizing $\ell$ is equivalent to minimizing the quadratic term in $\mu$ for fixed $\Sigma$, and then maximizing the remaining expression in $\Sigma$. [guided] We first replace the likelihood by its logarithm. This loses no information about maximizers because $L(\mu,\Sigma)$ is strictly positive and $\log:(0,\infty)\to\mathbb{R}$ is strictly increasing. Define \begin{align*} \ell:\mathbb{R}^p \times \mathbb{S}_{++}^p &\to \mathbb{R},\\ (\mu,\Sigma) &\mapsto \log L(\mu,\Sigma). \end{align*} Using the definition of $L$, the logarithm turns the product over observations into a sum and gives \begin{align*} \ell(\mu,\Sigma) &= \sum_{i=1}^n \left[ -\frac{p}{2}\log(2\pi)-\frac{1}{2}\log\det\Sigma -\frac{1}{2}(x_i-\mu)^\top \Sigma^{-1}(x_i-\mu) \right] \\ &= -\frac{np}{2}\log(2\pi)-\frac{n}{2}\log\det\Sigma -\frac{1}{2}\sum_{i=1}^n (x_i-\mu)^\top \Sigma^{-1}(x_i-\mu). \end{align*} The first term is constant, the second term depends only on $\Sigma$, and the last term couples $\mu$ and $\Sigma$. The next step isolates the dependence on $\mu$ by expanding around the sample mean. [/guided] [/step] [step:Maximize over the mean by completing the square around $\bar{x}$] Fix $\Sigma \in \mathbb{S}_{++}^p$, and define $A:=\Sigma^{-1} \in \mathbb{S}_{++}^p$. For each $i\in\{1,\dots,n\}$, \begin{align*} x_i-\mu = (x_i-\bar{x})+(\bar{x}-\mu). \end{align*} Therefore \begin{align*} \sum_{i=1}^n (x_i-\mu)^\top A(x_i-\mu) &= \sum_{i=1}^n (x_i-\bar{x})^\top A(x_i-\bar{x}) \\ &\quad +2\sum_{i=1}^n (x_i-\bar{x})^\top A(\bar{x}-\mu) +n(\bar{x}-\mu)^\top A(\bar{x}-\mu). \end{align*} Since \begin{align*} \sum_{i=1}^n (x_i-\bar{x}) = \sum_{i=1}^n x_i-n\bar{x} =0, \end{align*} the cross term vanishes: \begin{align*} \sum_{i=1}^n (x_i-\bar{x})^\top A(\bar{x}-\mu) = \left(\sum_{i=1}^n (x_i-\bar{x})\right)^\top A(\bar{x}-\mu) =0. \end{align*} Hence \begin{align*} \sum_{i=1}^n (x_i-\mu)^\top \Sigma^{-1}(x_i-\mu) &= \sum_{i=1}^n (x_i-\bar{x})^\top \Sigma^{-1}(x_i-\bar{x}) +n(\bar{x}-\mu)^\top \Sigma^{-1}(\bar{x}-\mu). \end{align*} Because $\Sigma^{-1}$ is positive definite, the final term is nonnegative and equals $0$ exactly when $\mu=\bar{x}$. Thus, for each fixed $\Sigma\in\mathbb{S}_{++}^p$, the unique maximizer of $\ell(\mu,\Sigma)$ over $\mu\in\mathbb{R}^p$ is $\mu=\bar{x}$. [guided] Fix a covariance matrix $\Sigma \in \mathbb{S}_{++}^p$. Its inverse \begin{align*} A:=\Sigma^{-1} \end{align*} is again symmetric positive definite. We want to find the value of $\mu$ that makes the quadratic penalty \begin{align*} \sum_{i=1}^n (x_i-\mu)^\top A(x_i-\mu) \end{align*} as small as possible. The natural reference point is the sample mean $\bar{x}$, because the centred residuals sum to zero. For each observation, \begin{align*} x_i-\mu = (x_i-\bar{x})+(\bar{x}-\mu). \end{align*} Substituting this into the quadratic form and using bilinearity gives \begin{align*} \sum_{i=1}^n (x_i-\mu)^\top A(x_i-\mu) &= \sum_{i=1}^n (x_i-\bar{x})^\top A(x_i-\bar{x}) \\ &\quad +2\sum_{i=1}^n (x_i-\bar{x})^\top A(\bar{x}-\mu) +n(\bar{x}-\mu)^\top A(\bar{x}-\mu). \end{align*} The middle term vanishes because \begin{align*} \sum_{i=1}^n (x_i-\bar{x}) = \sum_{i=1}^n x_i-n\bar{x} =0. \end{align*} Therefore \begin{align*} \sum_{i=1}^n (x_i-\bar{x})^\top A(\bar{x}-\mu) = \left(\sum_{i=1}^n (x_i-\bar{x})\right)^\top A(\bar{x}-\mu) =0. \end{align*} We have proved the decomposition \begin{align*} \sum_{i=1}^n (x_i-\mu)^\top \Sigma^{-1}(x_i-\mu) &= \sum_{i=1}^n (x_i-\bar{x})^\top \Sigma^{-1}(x_i-\bar{x}) +n(\bar{x}-\mu)^\top \Sigma^{-1}(\bar{x}-\mu). \end{align*} The first term does not depend on $\mu$. The second term is nonnegative because $\Sigma^{-1}$ is positive definite, and it is zero exactly when $\bar{x}-\mu=0$. Hence the quadratic penalty is uniquely minimized at $\mu=\bar{x}$, so the log-likelihood is uniquely maximized in the mean variable at $\mu=\bar{x}$. [/guided] [/step] [step:Reduce the covariance problem to a trace and determinant expression] Set $\mu=\bar{x}$. Since \begin{align*} W=\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^\top, \end{align*} the cyclic trace identity gives \begin{align*} \sum_{i=1}^n (x_i-\bar{x})^\top \Sigma^{-1}(x_i-\bar{x}) &= \sum_{i=1}^n \operatorname{tr}\left((x_i-\bar{x})^\top \Sigma^{-1}(x_i-\bar{x})\right) \\ &= \sum_{i=1}^n \operatorname{tr}\left(\Sigma^{-1}(x_i-\bar{x})(x_i-\bar{x})^\top\right) \\ &= \operatorname{tr}(\Sigma^{-1}W). \end{align*} Thus, up to the additive constant $-\frac{np}{2}\log(2\pi)$, the covariance log-likelihood is \begin{align*} \ell(\bar{x},\Sigma) &= -\frac{np}{2}\log(2\pi) -\frac{n}{2}\log\det\Sigma -\frac{1}{2}\operatorname{tr}(\Sigma^{-1}W). \end{align*} It remains to maximize \begin{align*} \Phi:\mathbb{S}_{++}^p &\to \mathbb{R},\\ \Sigma &\mapsto -\frac{n}{2}\log\det\Sigma -\frac{1}{2}\operatorname{tr}(\Sigma^{-1}W). \end{align*} [/step] [step:Whiten the covariance variable using the positive definite scatter matrix] Because $W$ is positive definite, there is a symmetric positive definite matrix $W^{1/2}\in\mathbb{R}^{p\times p}$ with $W^{1/2}W^{1/2}=W$. Define \begin{align*} Y:\mathbb{S}_{++}^p &\to \mathbb{S}_{++}^p,\\ \Sigma &\mapsto W^{-1/2}\Sigma W^{-1/2}. \end{align*} This map is bijective, with inverse \begin{align*} Y^{-1}:\mathbb{S}_{++}^p &\to \mathbb{S}_{++}^p,\\ B &\mapsto W^{1/2}BW^{1/2}. \end{align*} Let $B:=W^{-1/2}\Sigma W^{-1/2}$. Then $\Sigma=W^{1/2}BW^{1/2}$, and [determinant multiplicativity](/theorems/395) gives \begin{align*} \det\Sigma &= \det(W^{1/2})\det B\det(W^{1/2}) = \det W\,\det B. \end{align*} Also, \begin{align*} \Sigma^{-1} &= W^{-1/2}B^{-1}W^{-1/2}, \end{align*} so the cyclic trace identity gives \begin{align*} \operatorname{tr}(\Sigma^{-1}W) &= \operatorname{tr}(W^{-1/2}B^{-1}W^{-1/2}W) \\ &= \operatorname{tr}(W^{-1/2}B^{-1}W^{1/2}) \\ &= \operatorname{tr}(B^{-1}). \end{align*} Therefore maximizing $\Phi(\Sigma)$ over $\Sigma\in\mathbb{S}_{++}^p$ is equivalent to maximizing \begin{align*} \Psi:\mathbb{S}_{++}^p &\to \mathbb{R},\\ B &\mapsto -\frac{n}{2}\log\det B-\frac{1}{2}\operatorname{tr}(B^{-1}), \end{align*} because the term $-\frac{n}{2}\log\det W$ is constant in $B$. [guided] The covariance expression contains $W$ in the trace term. Since $W$ is positive definite by hypothesis, we can remove it by a change of variables. Let $W^{1/2}$ be the symmetric positive definite square root of $W$, so that $W^{1/2}W^{1/2}=W$, and define \begin{align*} Y:\mathbb{S}_{++}^p &\to \mathbb{S}_{++}^p,\\ \Sigma &\mapsto W^{-1/2}\Sigma W^{-1/2}. \end{align*} This is a bijection because its inverse is \begin{align*} Y^{-1}:\mathbb{S}_{++}^p &\to \mathbb{S}_{++}^p,\\ B &\mapsto W^{1/2}BW^{1/2}. \end{align*} Thus no covariance matrices are lost and none are counted twice. Put \begin{align*} B:=W^{-1/2}\Sigma W^{-1/2}. \end{align*} Then $\Sigma=W^{1/2}BW^{1/2}$. For the determinant term, multiplicativity of determinant gives \begin{align*} \det\Sigma &= \det(W^{1/2})\det B\det(W^{1/2}) = \det W\,\det B. \end{align*} For the trace term, first invert $\Sigma=W^{1/2}BW^{1/2}$: \begin{align*} \Sigma^{-1} &= W^{-1/2}B^{-1}W^{-1/2}. \end{align*} Then use cyclic invariance of trace: \begin{align*} \operatorname{tr}(\Sigma^{-1}W) &= \operatorname{tr}(W^{-1/2}B^{-1}W^{-1/2}W) \\ &= \operatorname{tr}(W^{-1/2}B^{-1}W^{1/2}) \\ &= \operatorname{tr}(B^{-1}W^{1/2}W^{-1/2}) \\ &= \operatorname{tr}(B^{-1}). \end{align*} Therefore the part of the log-likelihood depending on covariance becomes \begin{align*} -\frac{n}{2}\log\det W -\frac{n}{2}\log\det B -\frac{1}{2}\operatorname{tr}(B^{-1}). \end{align*} The first term is constant in $B$, so the covariance maximization is equivalent to maximizing \begin{align*} \Psi:\mathbb{S}_{++}^p &\to \mathbb{R},\\ B &\mapsto -\frac{n}{2}\log\det B-\frac{1}{2}\operatorname{tr}(B^{-1}). \end{align*} This whitening step is useful because the remaining expression depends only on the eigenvalues of $B$. [/guided] [/step] [step:Maximize the whitened covariance objective eigenvalue by eigenvalue] Let $B\in\mathbb{S}_{++}^p$. Choose an orthogonal matrix $Q\in\mathbb{R}^{p\times p}$ and positive [real numbers](/page/Real%20Numbers) $\lambda_1,\dots,\lambda_p$ such that \begin{align*} B=Q\operatorname{diag}(\lambda_1,\dots,\lambda_p)Q^\top. \end{align*} Then \begin{align*} \det B &= \prod_{j=1}^p \lambda_j, & \operatorname{tr}(B^{-1}) &= \sum_{j=1}^p \frac{1}{\lambda_j}. \end{align*} Hence \begin{align*} \Psi(B) &= \sum_{j=1}^p \left(-\frac{n}{2}\log\lambda_j-\frac{1}{2\lambda_j}\right). \end{align*} Define \begin{align*} \psi:(0,\infty)&\to\mathbb{R},\\ t&\mapsto -\frac{n}{2}\log t-\frac{1}{2t}. \end{align*} Then \begin{align*} \psi'(t) &= -\frac{n}{2t}+\frac{1}{2t^2} = \frac{1-nt}{2t^2}, \end{align*} so the only critical point is $t=1/n$. Moreover, \begin{align*} \psi'(t)>0 \quad \text{for } 0<t<\frac{1}{n}, \qquad \psi'(t)<0 \quad \text{for } t>\frac{1}{n}. \end{align*} Thus $\psi$ has a unique global maximum at $t=1/n$. Therefore $\Psi(B)$ has a unique global maximum when \begin{align*} \lambda_1=\cdots=\lambda_p=\frac{1}{n}, \end{align*} that is, when \begin{align*} B=\frac{1}{n}I_p. \end{align*} [guided] Because $B$ is symmetric positive definite, it has an orthogonal eigenvalue decomposition. Thus there exist an orthogonal matrix $Q\in\mathbb{R}^{p\times p}$ and positive eigenvalues $\lambda_1,\dots,\lambda_p$ such that \begin{align*} B=Q\operatorname{diag}(\lambda_1,\dots,\lambda_p)Q^\top. \end{align*} The positivity of the eigenvalues is exactly where positive definiteness of $B$ is used. The determinant and trace of the inverse are then \begin{align*} \det B &= \prod_{j=1}^p \lambda_j, & \operatorname{tr}(B^{-1}) &= \sum_{j=1}^p \frac{1}{\lambda_j}. \end{align*} Substituting these formulas into $\Psi$ gives \begin{align*} \Psi(B) &= -\frac{n}{2}\sum_{j=1}^p \log\lambda_j -\frac{1}{2}\sum_{j=1}^p \frac{1}{\lambda_j} \\ &= \sum_{j=1}^p \left(-\frac{n}{2}\log\lambda_j-\frac{1}{2\lambda_j}\right). \end{align*} So the matrix maximization has separated into $p$ identical one-variable maximization problems. Define \begin{align*} \psi:(0,\infty)&\to\mathbb{R},\\ t&\mapsto -\frac{n}{2}\log t-\frac{1}{2t}. \end{align*} Differentiating gives \begin{align*} \psi'(t) &= -\frac{n}{2t}+\frac{1}{2t^2} = \frac{1-nt}{2t^2}. \end{align*} Since $2t^2>0$ for $t>0$, the sign of $\psi'(t)$ is the sign of $1-nt$. Hence \begin{align*} \psi'(t)>0 \quad \text{for } 0<t<\frac{1}{n}, \qquad \psi'(t)<0 \quad \text{for } t>\frac{1}{n}. \end{align*} Thus $\psi$ strictly increases up to $1/n$ and strictly decreases after $1/n$. It has a unique global maximum at $t=1/n$. Since \begin{align*} \Psi(B)=\sum_{j=1}^p \psi(\lambda_j), \end{align*} the sum is maximized uniquely when every eigenvalue satisfies $\lambda_j=1/n$. Therefore \begin{align*} B=Q\operatorname{diag}\left(\frac{1}{n},\dots,\frac{1}{n}\right)Q^\top =\frac{1}{n}I_p. \end{align*} [/guided] [/step] [step:Transform back to obtain the covariance estimator] The unique maximizer of $\Psi$ is $B=\frac{1}{n}I_p$. Since $\Sigma=W^{1/2}BW^{1/2}$, the corresponding covariance matrix is \begin{align*} \hat{\Sigma} &= W^{1/2}\left(\frac{1}{n}I_p\right)W^{1/2} = \frac{1}{n}W. \end{align*} The assumption that $W$ is positive definite implies $\hat{\Sigma}=\frac{1}{n}W\in\mathbb{S}_{++}^p$, so this matrix is admissible in the nonsingular multivariate normal model. Combining this with the unique maximization over $\mu$ gives the unique maximum likelihood estimators \begin{align*} \hat{\mu} &= \bar{x}, & \hat{\Sigma} &= \frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^\top. \end{align*} This proves the theorem. [/step]

Prerequisites (0/1 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Determinant

Explore Further

Determinant Definition Law of the Unconscious Statistician probability Herglotz Representation Theorem probability Kalman Filter Recursion Theorem probability Covariance-Structure Distinction Between Factor Analysis and Principal Component Analysis probability Orthogonal Equivariance of Covariance Principal Components probability Bonferroni Simultaneous Confidence Intervals for Linear Contrasts of a Multivariate Normal Mean probability Conditional Distribution of a Multivariate Normal Vector probability Hypothesis and Error SSP Matrices for Multivariate General Linear Model Contrasts probability

What brings you to Androma?

Start with a route through the knowledge graph.