Normal Equal-Covariance Bayes Rule — Statement & Proof

Normal Equal-Covariance Bayes Rule (Theorem # 4048)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The Bayes rule under zero-one loss compares the posterior scores $\pi_k f_k(x)$, where $f_k$ is the conditional Gaussian density of $X$ given $Y=k$; strict preference for class $1$ is the strict inequality $\pi_1 f_1(x)>\pi_2 f_2(x)$, while equality is a tie to be resolved by a separate tie-breaking convention. Because both classes have the same covariance matrix, the normalising constants and determinant factors in $f_1$ and $f_2$ cancel after taking logarithms. Expanding the two quadratic forms cancels the common $x^\top\Sigma^{-1}x$ term and leaves exactly the stated linear inequality in $x$. When the priors are equal, the prior-odds term vanishes and the equality set is the locus of equal Mahalanobis squared distance from the two means; this is a hyperplane if the means are distinct and all of $\mathbb R^p$ if the means coincide. [/proofplan] [step:Write the Bayes comparison using the two Gaussian densities] Since $\Sigma$ is symmetric positive definite, the inverse matrix $A:=\Sigma^{-1}\in\mathbb R^{p\times p}$ exists and is symmetric positive definite. For each $k\in\{1,2\}$, define the conditional density \begin{align*} f_k:\mathbb R^p&\to(0,\infty)\\ x&\mapsto (2\pi)^{-p/2}(\det\Sigma)^{-1/2} \exp\left(-\frac{1}{2}(x-\mu_k)^\top A(x-\mu_k)\right). \end{align*} Under zero-one loss, class $1$ has strictly smaller conditional risk than class $2$ at $x$ exactly when its posterior score is strictly larger, namely when \begin{align*} \pi_1 f_1(x)>\pi_2 f_2(x). \end{align*} On the equality set $\pi_1 f_1(x)=\pi_2 f_2(x)$, the two actions have the same conditional risk, so a single-valued [Bayes classifier](/theorems/1941) may break the tie arbitrarily. Since $\pi_1,\pi_2>0$ and $f_1(x),f_2(x)>0$, taking logarithms preserves the strict inequality. Thus the comparison is equivalent to \begin{align*} \log\pi_1+\log f_1(x)>\log\pi_2+\log f_2(x). \end{align*} [guided] The Bayes classifier under zero-one loss chooses the class with the larger posterior probability. For two classes with prior probabilities $\pi_k$ and conditional densities $f_k$, this posterior comparison is equivalent to comparing the unnormalised posterior scores $\pi_k f_k(x)$, because the common marginal density of $X$ at $x$ is positive and cancels from both sides. Since $\Sigma$ is symmetric positive definite, its inverse \begin{align*} A:=\Sigma^{-1} \end{align*} exists and is also symmetric positive definite. For each class $k\in\{1,2\}$, the conditional Gaussian density is the map \begin{align*} f_k:\mathbb R^p&\to(0,\infty)\\ x&\mapsto (2\pi)^{-p/2}(\det\Sigma)^{-1/2} \exp\left(-\frac{1}{2}(x-\mu_k)^\top A(x-\mu_k)\right). \end{align*} Therefore class $1$ is strictly preferred to class $2$ exactly when \begin{align*} \pi_1 f_1(x)>\pi_2 f_2(x). \end{align*} If equality holds instead, the two classes have equal posterior score and hence equal conditional risk under zero-one loss; a Bayes classifier is then not uniquely determined by the risk comparison and may use any tie-breaking convention. All quantities in the strict inequality are positive: $\pi_1,\pi_2>0$ by hypothesis, and Gaussian densities are strictly positive on $\mathbb R^p$. Hence the logarithm is strictly increasing on the relevant domain, so the same comparison is equivalent to \begin{align*} \log\pi_1+\log f_1(x)>\log\pi_2+\log f_2(x). \end{align*} [/guided] [/step] [step:Cancel the common Gaussian normalising terms] For each $k\in\{1,2\}$, the logarithm of $f_k(x)$ is \begin{align*} \log f_k(x) = -\frac{p}{2}\log(2\pi)-\frac{1}{2}\log\det\Sigma -\frac{1}{2}(x-\mu_k)^\top A(x-\mu_k). \end{align*} Substituting this expression into the logarithmic Bayes comparison, the terms \begin{align*} -\frac{p}{2}\log(2\pi)-\frac{1}{2}\log\det\Sigma \end{align*} appear on both sides and cancel. Hence class $1$ is chosen exactly when \begin{align*} \log\pi_1-\frac{1}{2}(x-\mu_1)^\top A(x-\mu_1) > \log\pi_2-\frac{1}{2}(x-\mu_2)^\top A(x-\mu_2). \end{align*} [/step] [step:Expand the quadratic forms and isolate the linear term in $x$] Because $A$ is symmetric, for each $k\in\{1,2\}$ we have \begin{align*} (x-\mu_k)^\top A(x-\mu_k) = x^\top Ax-2\mu_k^\top Ax+\mu_k^\top A\mu_k. \end{align*} Substituting these expansions gives \begin{align*} \log\pi_1-\frac{1}{2}x^\top Ax+\mu_1^\top Ax-\frac{1}{2}\mu_1^\top A\mu_1 > \log\pi_2-\frac{1}{2}x^\top Ax+\mu_2^\top Ax-\frac{1}{2}\mu_2^\top A\mu_2. \end{align*} The common term $-\frac{1}{2}x^\top Ax$ cancels. Moving all remaining terms involving $x$ to the left and all constant terms to the right yields \begin{align*} (\mu_1-\mu_2)^\top Ax > \frac{1}{2}\left(\mu_1^\top A\mu_1-\mu_2^\top A\mu_2\right)-\log\frac{\pi_1}{\pi_2}. \end{align*} Since $A=\Sigma^{-1}$, this is exactly \begin{align*} (\mu_1-\mu_2)^\top\Sigma^{-1}x > \frac{1}{2}\left(\mu_1^\top\Sigma^{-1}\mu_1-\mu_2^\top\Sigma^{-1}\mu_2\right)-\log\frac{\pi_1}{\pi_2}. \end{align*} [guided] The only algebraic point that needs care is the expansion of the quadratic form. Since $A$ is symmetric, the two mixed terms agree: \begin{align*} x^\top A\mu_k=\mu_k^\top A^\top x=\mu_k^\top Ax. \end{align*} Thus, for each $k\in\{1,2\}$, \begin{align*} (x-\mu_k)^\top A(x-\mu_k) = x^\top Ax-2\mu_k^\top Ax+\mu_k^\top A\mu_k. \end{align*} Substituting this into the logarithmic Bayes comparison gives \begin{align*} \log\pi_1-\frac{1}{2}x^\top Ax+\mu_1^\top Ax-\frac{1}{2}\mu_1^\top A\mu_1 > \log\pi_2-\frac{1}{2}x^\top Ax+\mu_2^\top Ax-\frac{1}{2}\mu_2^\top A\mu_2. \end{align*} The term $-\frac{1}{2}x^\top Ax$ is present on both sides because both classes have the same covariance matrix. This is the exact algebraic reason the Bayes rule becomes linear in $x$ rather than quadratic. Canceling the common term and rearranging gives \begin{align*} \mu_1^\top Ax-\mu_2^\top Ax > \frac{1}{2}\mu_1^\top A\mu_1-\frac{1}{2}\mu_2^\top A\mu_2-\log\pi_1+\log\pi_2. \end{align*} Combining the left-hand side and rewriting the logarithms as a prior-odds term, \begin{align*} (\mu_1-\mu_2)^\top Ax > \frac{1}{2}\left(\mu_1^\top A\mu_1-\mu_2^\top A\mu_2\right)-\log\frac{\pi_1}{\pi_2}. \end{align*} Finally $A=\Sigma^{-1}$ by definition, so this is precisely the stated decision inequality. [/guided] [/step] [step:Identify the equal-prior boundary as the Mahalanobis bisector] Assume $\pi_1=\pi_2$. Then $\log(\pi_1/\pi_2)=0$, and the decision boundary is the set of $x\in\mathbb R^p$ satisfying \begin{align*} (\mu_1-\mu_2)^\top Ax = \frac{1}{2}\left(\mu_1^\top A\mu_1-\mu_2^\top A\mu_2\right). \end{align*} For $v\in\mathbb R^p$, define $|v|_A^2:=v^\top Av$. Then \begin{align*} |x-\mu_1|_A^2=|x-\mu_2|_A^2 \end{align*} is equivalent, after expanding both sides, to \begin{align*} x^\top Ax-2\mu_1^\top Ax+\mu_1^\top A\mu_1 = x^\top Ax-2\mu_2^\top Ax+\mu_2^\top A\mu_2, \end{align*} which is equivalent to \begin{align*} (\mu_1-\mu_2)^\top Ax = \frac{1}{2}\left(\mu_1^\top A\mu_1-\mu_2^\top A\mu_2\right). \end{align*} Thus the equal-prior equality set is exactly the locus of points with equal squared Mahalanobis distance from $\mu_1$ and $\mu_2$. Since this locus is defined by one nonconstant affine linear equation in $x$ when $\mu_1\ne\mu_2$, it is a hyperplane in that case. If $\mu_1=\mu_2$, the equality above reduces to $0=0$, the two class-conditional densities coincide, and the equality set is all of $\mathbb R^p$ rather than a proper hyperplane. This completes the proof. [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Bayes Classifier

Definitions & Concepts

Determinant

Explore Further

Determinant Definition Bayes Classifier Theorem #1941 Kolmogorov Isomorphism Theorem for Purely Nondeterministic Stationary Processes probability Wishart Principal Block Marginal Theorem probability Innovations Algorithm probability Linear Pairwise Decision Boundaries for Sample LDA probability Anderson's Asymptotic Normality Theorem for Sample Covariance Eigenvalues probability Bonferroni Simultaneous Confidence Intervals for Linear Contrasts of a Multivariate Normal Mean probability Birkhoff Erodic Theorem for Stationary Processes probability Eckart–Young–Mirsky Theorem probability

What brings you to Androma?

Start with a route through the knowledge graph.

Normal Equal-Covariance Bayes Rule (Theorem # 4048)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Normal Equal-Covariance Bayes Rule (Theorem # 4048)

Discussion

Proof

Prerequisites (0/2 completed)

Prerequisites Graph

Explore Further