Maximum Likelihood Estimator of the Coefficient Matrix in the Multivariate Linear Model

Maximum Likelihood Estimator of the Coefficient Matrix in the Multivariate Linear Model (Theorem # 4062)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] For fixed positive definite $\Sigma$, the part of the log-likelihood depending on $B$ is the negative of a weighted residual sum of squares. We show that the ordinary least-squares matrix $\hat{B} = (X^\top X)^{-1}X^\top Y$ satisfies the normal equations. Then we expand the weighted residual criterion around $\hat{B}$ and prove that every deviation $B-\hat{B}$ increases the criterion by a non-negative term, which vanishes only when $B=\hat{B}$. [/proofplan] [step:Write the likelihood as a weighted residual criterion] For $B \in \mathbb{R}^{p \times m}$, define the residual matrix \begin{align*} R(B) := Y - XB \in \mathbb{R}^{n \times m}. \end{align*} Since the rows of $E$ are independent Gaussian vectors in $\mathbb{R}^m$ with common covariance matrix $\Sigma$, the log-likelihood as a function of $B$ and fixed $\Sigma$ differs from a constant independent of $B$ by \begin{align*} \ell(B;\Sigma) = -\frac{1}{2}\operatorname{tr}\!\left(\Sigma^{-1}R(B)^\top R(B)\right). \end{align*} The term $-\frac{n}{2}\log\det\Sigma$ is independent of $B$, so maximizing $\ell(B;\Sigma)$ over $B \in \mathbb{R}^{p \times m}$ is equivalent to minimizing the function \begin{align*} Q_\Sigma:\mathbb{R}^{p \times m} &\to \mathbb{R},\\ B &\mapsto \operatorname{tr}\!\left(\Sigma^{-1}(Y-XB)^\top(Y-XB)\right). \end{align*} [/step] [step:Verify that the least-squares matrix satisfies the normal equations] Because $X$ has full column rank, $X^\top X \in \mathbb{R}^{p \times p}$ is invertible. Define \begin{align*} \hat{B} := (X^\top X)^{-1}X^\top Y \in \mathbb{R}^{p \times m}. \end{align*} Then \begin{align*} X^\top(Y-X\hat{B}) &= X^\top Y - X^\top X (X^\top X)^{-1}X^\top Y\\ &= X^\top Y - X^\top Y\\ &= 0. \end{align*} Thus $\hat{B}$ satisfies the normal equations \begin{align*} X^\top R(\hat{B}) = 0. \end{align*} [guided] The candidate estimator is the same matrix that appears in ordinary least squares. The full column rank hypothesis is used exactly here: since the columns of $X$ are linearly independent, the Gram matrix $X^\top X$ is invertible, so the formula \begin{align*} \hat{B} := (X^\top X)^{-1}X^\top Y \end{align*} defines a matrix in $\mathbb{R}^{p \times m}$. Now compute the residual normal equations. The residual at $\hat{B}$ is \begin{align*} R(\hat{B}) = Y-X\hat{B}. \end{align*} Multiplying on the left by $X^\top$ gives \begin{align*} X^\top R(\hat{B}) &= X^\top(Y-X\hat{B})\\ &= X^\top Y - X^\top X (X^\top X)^{-1}X^\top Y\\ &= X^\top Y - X^\top Y\\ &= 0. \end{align*} This orthogonality identity is the algebraic reason the covariance matrix $\Sigma$ will not affect the minimizing value of $B$. [/guided] [/step] [step:Expand the criterion around the normal-equation solution] Let $B \in \mathbb{R}^{p \times m}$ be arbitrary and define \begin{align*} D := B-\hat{B} \in \mathbb{R}^{p \times m}. \end{align*} Then \begin{align*} R(B) = Y-XB = Y-X\hat{B}-XD = R(\hat{B})-XD. \end{align*} Expanding the residual product gives \begin{align*} R(B)^\top R(B) &= \bigl(R(\hat{B})-XD\bigr)^\top\bigl(R(\hat{B})-XD\bigr)\\ &= R(\hat{B})^\top R(\hat{B}) - R(\hat{B})^\top XD - D^\top X^\top R(\hat{B}) + D^\top X^\top XD. \end{align*} Since $X^\top R(\hat{B})=0$, also $R(\hat{B})^\top X=0$, so the two cross terms vanish. Therefore \begin{align*} Q_\Sigma(B) = Q_\Sigma(\hat{B}) + \operatorname{tr}\!\left(\Sigma^{-1}D^\top X^\top XD\right). \end{align*} [/step] [step:Show that the remainder term is non-negative and vanishes only at $\hat{B}$] Let $A := X^\top X \in \mathbb{R}^{p \times p}$. Since $X$ has full column rank, $A$ is positive definite. Since $\Sigma$ is positive definite, $\Sigma^{-1}$ is positive definite. Let $A^{1/2} \in \mathbb{R}^{p \times p}$ and $\Sigma^{-1/2} \in \mathbb{R}^{m \times m}$ denote the positive definite square roots of $A$ and $\Sigma^{-1}$, respectively. Using cyclic invariance of trace for finite matrices, \begin{align*} \operatorname{tr}\!\left(\Sigma^{-1}D^\top A D\right) &= \operatorname{tr}\!\left(\Sigma^{-1/2}D^\top A D\Sigma^{-1/2}\right)\\ &= \operatorname{tr}\!\left((A^{1/2}D\Sigma^{-1/2})^\top(A^{1/2}D\Sigma^{-1/2})\right)\\ &= \sum_{i=1}^{p}\sum_{j=1}^{m} \left(A^{1/2}D\Sigma^{-1/2}\right)_{ij}^{2} \geq 0. \end{align*} Equality holds if and only if $A^{1/2}D\Sigma^{-1/2}=0$. Since $A^{1/2}$ and $\Sigma^{-1/2}$ are invertible, this is equivalent to $D=0$, hence to $B=\hat{B}$. [guided] The expansion from the previous step reduced the problem to the sign of one matrix trace: \begin{align*} \operatorname{tr}\!\left(\Sigma^{-1}D^\top X^\top XD\right). \end{align*} Set \begin{align*} A := X^\top X. \end{align*} The full column rank of $X$ implies that $A$ is positive definite, because for every nonzero $v \in \mathbb{R}^p$, \begin{align*} v^\top A v = v^\top X^\top Xv = |Xv|^2 > 0. \end{align*} Also, because $\Sigma$ is positive definite, its inverse $\Sigma^{-1}$ is positive definite. We now rewrite the trace as a sum of squares. Let $A^{1/2}$ be the positive definite square root of $A$, and let $\Sigma^{-1/2}$ be the positive definite square root of $\Sigma^{-1}$. Then cyclic invariance of trace gives \begin{align*} \operatorname{tr}\!\left(\Sigma^{-1}D^\top A D\right) &= \operatorname{tr}\!\left(\Sigma^{-1/2}D^\top A D\Sigma^{-1/2}\right)\\ &= \operatorname{tr}\!\left((A^{1/2}D\Sigma^{-1/2})^\top(A^{1/2}D\Sigma^{-1/2})\right). \end{align*} The final trace is the sum of the squares of all entries of the matrix $A^{1/2}D\Sigma^{-1/2}$: \begin{align*} \operatorname{tr}\!\left((A^{1/2}D\Sigma^{-1/2})^\top(A^{1/2}D\Sigma^{-1/2})\right) = \sum_{i=1}^{p}\sum_{j=1}^{m} \left(A^{1/2}D\Sigma^{-1/2}\right)_{ij}^{2} \geq 0. \end{align*} This sum is zero exactly when every entry of $A^{1/2}D\Sigma^{-1/2}$ is zero. Since both $A^{1/2}$ and $\Sigma^{-1/2}$ are invertible, this happens exactly when $D=0$, that is, exactly when $B=\hat{B}$. [/guided] [/step] [step:Conclude uniqueness of the maximum likelihood estimator] For every $B \in \mathbb{R}^{p \times m}$, \begin{align*} Q_\Sigma(B) = Q_\Sigma(\hat{B}) + \operatorname{tr}\!\left(\Sigma^{-1}(B-\hat{B})^\top X^\top X(B-\hat{B})\right) \geq Q_\Sigma(\hat{B}), \end{align*} with equality if and only if $B=\hat{B}$. Thus $\hat{B}$ is the unique minimizer of $Q_\Sigma$ and therefore the unique maximizer of the log-likelihood over $B$ for fixed $\Sigma$. The formula \begin{align*} \hat{B} = (X^\top X)^{-1}X^\top Y \end{align*} contains only $X$ and $Y$. Hence the maximum likelihood estimator of the coefficient matrix does not depend on the positive definite covariance matrix $\Sigma$. [/step]

Prerequisites (0/1 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Orthogonality

Explore Further

Orthogonality Definition Beveridge–Nelson Decomposition probability Wishart Mean and Entrywise Covariance Formula probability Sampling Distribution of the Least Squares Coefficient Matrix in the Multivariate Linear Model probability One-Sample Gaussian Mean Test with Known Covariance probability Birkhoff Ergodic Theorem for Sample Means probability One-Sample Hotelling Confidence Ellipsoid probability Prediction Error Decomposition for the Linear Gaussian State Space Likelihood probability Wishart Distribution of the Sample Covariance Matrix probability

What brings you to Androma?

Start with a route through the knowledge graph.