Two-Class Equal-Prior Gaussian Bayes Error with Common Covariance

Two-Class Equal-Prior Gaussian Bayes Error with Common Covariance (Theorem # 4052)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We derive the Bayes rule by comparing the two equal-prior Gaussian class-conditional densities. Because the covariance matrices are equal, the quadratic terms cancel and the decision rule reduces to thresholding the one-dimensional Fisher score $(\mu_1-\mu_2)^\top\Sigma^{-1}X$ at the midpoint. Under either class this centered score is a univariate normal random variable with mean $\Delta^2/2$ or $-\Delta^2/2$ and variance $\Delta^2$, so each conditional misclassification probability equals $\Phi(-\Delta/2)$. The equal prior weights then leave the same value as the total Bayes error. [/proofplan] [step:Handle the case where the two Gaussian laws coincide] Assume first that $\Delta=0$. Since $\Sigma$ is positive definite, the quadratic form $v \mapsto v^\top \Sigma^{-1}v$ is positive definite, so $\Delta=0$ implies $\mu_1=\mu_2$. Hence the two conditional laws of $X$ are identical. Let $\delta:\mathbb{R}^p\to\{1,2\}$ be any deterministic classifier. Denote the common conditional law of $X$ by $\nu$. Its error probability is \begin{align*} \mathbb{P}(\delta(X)\neq Y) &= \frac{1}{2}\mathbb{P}(\delta(X)=2\mid Y=1) + \frac{1}{2}\mathbb{P}(\delta(X)=1\mid Y=2)\\ &= \frac{1}{2}\nu(\{x\in\mathbb{R}^p:\delta(x)=2\}) + \frac{1}{2}\nu(\{x\in\mathbb{R}^p:\delta(x)=1\})\\ &= \frac{1}{2}. \end{align*} Thus the Bayes error is $1/2$. Since $\Phi(0)=1/2$, this equals $\Phi(-\Delta/2)$ when $\Delta=0$. [/step] [step:Compute the equal-prior Bayes decision rule when the means differ] Assume now that $\Delta>0$. Define the mean difference vector $d\in\mathbb{R}^p$, the Fisher direction $a\in\mathbb{R}^p$, and the midpoint $m\in\mathbb{R}^p$ by \begin{align*} d := \mu_1-\mu_2, \qquad a := \Sigma^{-1}d, \qquad m := \frac{\mu_1+\mu_2}{2}. \end{align*} For $k\in\{1,2\}$, let $f_k:\mathbb{R}^p\to(0,\infty)$ be the density of $\mathcal{N}_p(\mu_k,\Sigma)$ with respect to $\mathcal{L}^p$: \begin{align*} f_k(x) := \frac{1}{(2\pi)^{p/2}(\det\Sigma)^{1/2}} \exp\left( -\frac{1}{2}(x-\mu_k)^\top\Sigma^{-1}(x-\mu_k) \right). \end{align*} With equal priors, the [Bayes classifier](/theorems/1941) chooses class $1$ exactly where $f_1(x)\ge f_2(x)$ and class $2$ otherwise. Taking logarithms, this condition is equivalent to \begin{align*} (x-\mu_1)^\top\Sigma^{-1}(x-\mu_1) \le (x-\mu_2)^\top\Sigma^{-1}(x-\mu_2). \end{align*} Expanding both quadratic forms and cancelling the common term $x^\top\Sigma^{-1}x$ gives \begin{align*} -2\mu_1^\top\Sigma^{-1}x+\mu_1^\top\Sigma^{-1}\mu_1 \le -2\mu_2^\top\Sigma^{-1}x+\mu_2^\top\Sigma^{-1}\mu_2. \end{align*} Rearranging yields \begin{align*} d^\top\Sigma^{-1}x \ge \frac{1}{2}\left(\mu_1^\top\Sigma^{-1}\mu_1-\mu_2^\top\Sigma^{-1}\mu_2\right). \end{align*} Since $\Sigma^{-1}$ is symmetric, the right-hand side is $d^\top\Sigma^{-1}m$. Therefore the Bayes classifier $\delta_*:\mathbb{R}^p\to\{1,2\}$ is \begin{align*} \delta_*(x) = \begin{cases} 1, & a^\top(x-m)\ge 0,\\ 2, & a^\top(x-m)<0. \end{cases} \end{align*} [guided] The equal-prior Bayes rule compares posterior probabilities. Because the priors are equal, comparing posterior probabilities is the same as comparing the class-conditional densities $f_1$ and $f_2$. For each $k\in\{1,2\}$, the conditional density of $X$ given $Y=k$ is the map $f_k:\mathbb{R}^p\to(0,\infty)$ defined by \begin{align*} f_k(x) := \frac{1}{(2\pi)^{p/2}(\det\Sigma)^{1/2}} \exp\left( -\frac{1}{2}(x-\mu_k)^\top\Sigma^{-1}(x-\mu_k) \right). \end{align*} The normalizing constants are identical because the covariance matrix is the same in both classes. Therefore $f_1(x)\ge f_2(x)$ holds exactly when \begin{align*} (x-\mu_1)^\top\Sigma^{-1}(x-\mu_1) \le (x-\mu_2)^\top\Sigma^{-1}(x-\mu_2). \end{align*} The key cancellation is the common quadratic term in $x$. Expanding gives \begin{align*} x^\top\Sigma^{-1}x -2\mu_1^\top\Sigma^{-1}x +\mu_1^\top\Sigma^{-1}\mu_1 \le x^\top\Sigma^{-1}x -2\mu_2^\top\Sigma^{-1}x +\mu_2^\top\Sigma^{-1}\mu_2. \end{align*} After cancelling $x^\top\Sigma^{-1}x$ and rearranging, we obtain \begin{align*} (\mu_1-\mu_2)^\top\Sigma^{-1}x \ge \frac{1}{2}\left(\mu_1^\top\Sigma^{-1}\mu_1-\mu_2^\top\Sigma^{-1}\mu_2\right). \end{align*} Now define \begin{align*} d := \mu_1-\mu_2, \qquad a := \Sigma^{-1}d, \qquad m := \frac{\mu_1+\mu_2}{2}. \end{align*} Since $\Sigma^{-1}$ is symmetric, \begin{align*} d^\top\Sigma^{-1}m &= \frac{1}{2}(\mu_1-\mu_2)^\top\Sigma^{-1}(\mu_1+\mu_2)\\ &= \frac{1}{2}\left(\mu_1^\top\Sigma^{-1}\mu_1-\mu_2^\top\Sigma^{-1}\mu_2\right). \end{align*} Thus the decision rule is the hyperplane rule \begin{align*} \delta_*(x) = \begin{cases} 1, & a^\top(x-m)\ge 0,\\ 2, & a^\top(x-m)<0. \end{cases} \end{align*} This is the Fisher linear discriminant rule: project $x$ onto the direction $a=\Sigma^{-1}(\mu_1-\mu_2)$ and compare with the projected midpoint. [/guided] [/step] [step:Compute the class one misclassification probability from the projected score] Define the score map $S:\mathbb{R}^p\to\mathbb{R}$ by \begin{align*} S(x) := a^\top(x-m). \end{align*} Conditionally on $Y=1$, the random variable $S(X):\Omega\to\mathbb{R}$ is normally distributed with mean \begin{align*} a^\top(\mu_1-m) &= a^\top\left(\frac{\mu_1-\mu_2}{2}\right) = \frac{1}{2}d^\top\Sigma^{-1}d = \frac{\Delta^2}{2} \end{align*} and variance \begin{align*} a^\top\Sigma a &= d^\top\Sigma^{-1}\Sigma\Sigma^{-1}d = d^\top\Sigma^{-1}d = \Delta^2. \end{align*} Therefore, under $Y=1$, \begin{align*} \frac{S(X)-\Delta^2/2}{\Delta}\sim \mathcal{N}(0,1). \end{align*} The classifier assigns class $2$ exactly when $S(X)<0$, so \begin{align*} \mathbb{P}(\delta_*(X)=2\mid Y=1) &= \mathbb{P}(S(X)<0\mid Y=1)\\ &= \mathbb{P}\left(\frac{S(X)-\Delta^2/2}{\Delta}<-\frac{\Delta}{2}\,\middle|\,Y=1\right)\\ &= \Phi\left(-\frac{\Delta}{2}\right). \end{align*} [/step] [step:Compute the class two misclassification probability by the same score] Conditionally on $Y=2$, the same score $S(X)$ is normally distributed with mean \begin{align*} a^\top(\mu_2-m) &= a^\top\left(\frac{\mu_2-\mu_1}{2}\right) = -\frac{1}{2}d^\top\Sigma^{-1}d = -\frac{\Delta^2}{2} \end{align*} and variance \begin{align*} a^\top\Sigma a=\Delta^2. \end{align*} The classifier assigns class $1$ exactly when $S(X)\ge 0$. Hence \begin{align*} \mathbb{P}(\delta_*(X)=1\mid Y=2) &= \mathbb{P}(S(X)\ge 0\mid Y=2)\\ &= \mathbb{P}\left(\frac{S(X)+\Delta^2/2}{\Delta}\ge \frac{\Delta}{2}\,\middle|\,Y=2\right)\\ &= 1-\Phi\left(\frac{\Delta}{2}\right)\\ &= \Phi\left(-\frac{\Delta}{2}\right), \end{align*} where the final equality uses the symmetry identity $\Phi(-t)=1-\Phi(t)$ for the standard normal distribution. [/step] [step:Average the two conditional errors using the equal priors] The Bayes error rate is the error probability of $\delta_*$. Since $\mathbb{P}(Y=1)=\mathbb{P}(Y=2)=1/2$, the [law of total probability](/theorems/1113) gives \begin{align*} \mathbb{P}(\delta_*(X)\neq Y) &= \frac{1}{2}\mathbb{P}(\delta_*(X)=2\mid Y=1) + \frac{1}{2}\mathbb{P}(\delta_*(X)=1\mid Y=2)\\ &= \frac{1}{2}\Phi\left(-\frac{\Delta}{2}\right) + \frac{1}{2}\Phi\left(-\frac{\Delta}{2}\right)\\ &= \Phi\left(-\frac{\Delta}{2}\right). \end{align*} Together with the already handled case $\Delta=0$, this proves that the Bayes error rate is $\Phi(-\Delta/2)$ for all $\mu_1,\mu_2\in\mathbb{R}^p$. [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Bayes Classifier

Definitions & Concepts

Distribution

Explore Further

Distribution Definition Bayes Classifier Theorem #1941 Wishart Distribution of the Sample Covariance Matrix probability Positive Definiteness Criterion for Autocovariance Functions probability Wilks' Lambda Product Formula probability Gaussian Innovations Likelihood Factorization probability Prediction Error Decomposition for the Linear Gaussian State Space Likelihood probability Linear Filter Spectral Transformation Theorem probability Tail Dependence Coefficients of the Bivariate Student $t$ Copula probability One-Sample Hotelling Confidence Ellipsoid probability

What brings you to Androma?

Start with a route through the knowledge graph.