Quadratic Discriminant Analysis Bayes Rule — Statement & Proof

Quadratic Discriminant Analysis Bayes Rule (Theorem # 4049)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The posterior probability of class $k$ at an observation $x$ is proportional to the prior probability $\pi_k$ multiplied by the class-conditional Gaussian density at $x$. Taking logarithms preserves the maximizers because the logarithm is strictly increasing. The Gaussian log-density consists of a class-independent constant, a determinant term, and a Mahalanobis quadratic term; removing the class-independent constant leaves exactly $q_k(x)$. [/proofplan] [step:Write the posterior probabilities in terms of the class-conditional densities] Let $\mathcal{L}^p$ denote Lebesgue measure on $\mathbb{R}^p$. For each $k \in \{1,\dots,g\}$, define the class-conditional density \begin{align*} f_k : \mathbb{R}^p &\to (0,\infty) \\ x &\mapsto (2\pi)^{-p/2}(\det \Sigma_k)^{-1/2} \exp\left(-\frac{1}{2}(x-\mu_k)^\top \Sigma_k^{-1}(x-\mu_k)\right). \end{align*} This formula is valid because $\Sigma_k$ is symmetric positive definite, so $\det \Sigma_k > 0$ and $\Sigma_k^{-1}$ exists. Define the marginal density of $X$ by \begin{align*} m : \mathbb{R}^p &\to (0,\infty) \\ x &\mapsto \sum_{j=1}^g \pi_j f_j(x). \end{align*} Since $\pi_j > 0$ and $f_j(x) > 0$ for every $j$ and every $x \in \mathbb{R}^p$, we have $m(x) > 0$ for every $x \in \mathbb{R}^p$. Since each $f_j$ is a probability density with respect to $\mathcal{L}^p$ and $\sum_{j=1}^g \pi_j = 1$, the mixture density satisfies \begin{align*} \int_{\mathbb{R}^p} m(x)\,d\mathcal{L}^p(x) &= \sum_{j=1}^g \pi_j \int_{\mathbb{R}^p} f_j(x)\,d\mathcal{L}^p(x) = \sum_{j=1}^g \pi_j = 1. \end{align*} Thus $m$ is the density of the marginal law of $X$. For each $k$, choose the regular conditional probability version \begin{align*} \eta_k : \mathbb{R}^p &\to [0,1] \\ x &\mapsto \mathbb{P}(Y=k \mid X=x) = \frac{\pi_k f_k(x)}{m(x)}. \end{align*} This is a posterior version defined for every $x \in \mathbb{R}^p$; changing it on an $X$-null set would give the same regular conditional distribution. [guided] Let $\mathcal{L}^p$ denote Lebesgue measure on $\mathbb{R}^p$. This is the measure with respect to which the Gaussian densities below are integrated. For each class $k$, the conditional distribution of $X$ given $Y=k$ is multivariate normal with mean $\mu_k$ and covariance matrix $\Sigma_k$. Because $\Sigma_k$ is symmetric positive definite, the determinant is positive and the inverse exists, so the Gaussian density is the well-defined map \begin{align*} f_k : \mathbb{R}^p &\to (0,\infty) \\ x &\mapsto (2\pi)^{-p/2}(\det \Sigma_k)^{-1/2} \exp\left(-\frac{1}{2}(x-\mu_k)^\top \Sigma_k^{-1}(x-\mu_k)\right). \end{align*} The unconditional density of $X$ is obtained by mixing the class-conditional densities with weights given by the priors. Thus we define \begin{align*} m : \mathbb{R}^p &\to (0,\infty) \\ x &\mapsto \sum_{j=1}^g \pi_j f_j(x). \end{align*} Each summand $\pi_j f_j(x)$ is positive because $\pi_j > 0$ and $f_j(x) > 0$. Therefore $m(x) > 0$ for every $x \in \mathbb{R}^p$, so division by $m(x)$ is legitimate. The prior normalization is used here: since each $f_j$ integrates to $1$ with respect to $\mathcal{L}^p$ and $\sum_{j=1}^g \pi_j = 1$, \begin{align*} \int_{\mathbb{R}^p} m(x)\,d\mathcal{L}^p(x) &= \sum_{j=1}^g \pi_j \int_{\mathbb{R}^p} f_j(x)\,d\mathcal{L}^p(x) = 1. \end{align*} Thus $m$ is the marginal density of $X$. There is a technical point in the notation $\mathbb{P}(Y=k \mid X=x)$: when $X$ has a density, the event $\{X=x\}$ has probability zero. Therefore this expression means a chosen regular conditional probability version. [Bayes' formula](/theorems/1114) for densities gives the posterior version \begin{align*} \eta_k : \mathbb{R}^p &\to [0,1] \\ x &\mapsto \mathbb{P}(Y=k \mid X=x) = \frac{\pi_k f_k(x)}{m(x)}. \end{align*} The codomain is $[0,1]$ because in the edge case $g=1$ the unique posterior probability is $1$. The denominator $m(x)$ is the same for every class, so the only class-dependent quantity in the posterior comparison is $\pi_k f_k(x)$. [/guided] [/step] [step:Reduce posterior maximization to log-density maximization] Fix $x \in \mathbb{R}^p$. Since $m(x) > 0$ and does not depend on $k$, \begin{align*} \operatorname*{arg\,max}_{1 \le k \le g} \eta_k(x) = \operatorname*{arg\,max}_{1 \le k \le g} \pi_k f_k(x). \end{align*} Since $\pi_k f_k(x) > 0$ for every $k$ and the logarithm is strictly increasing on $(0,\infty)$, \begin{align*} \operatorname*{arg\,max}_{1 \le k \le g} \pi_k f_k(x) = \operatorname*{arg\,max}_{1 \le k \le g} \log(\pi_k f_k(x)). \end{align*} Expanding the logarithm gives \begin{align*} \log(\pi_k f_k(x)) &= \log \pi_k - \frac{p}{2}\log(2\pi) - \frac{1}{2}\log \det \Sigma_k - \frac{1}{2}(x-\mu_k)^\top \Sigma_k^{-1}(x-\mu_k). \end{align*} [guided] Fix an observation $x \in \mathbb{R}^p$. The posterior probability of class $k$ is \begin{align*} \eta_k(x) = \frac{\pi_k f_k(x)}{m(x)}. \end{align*} The denominator $m(x)$ is positive and is the same for all classes $k$. Therefore multiplying each posterior probability by the same positive number $m(x)$ does not change which classes maximize it: \begin{align*} \operatorname*{arg\,max}_{1 \le k \le g} \eta_k(x) = \operatorname*{arg\,max}_{1 \le k \le g} \pi_k f_k(x). \end{align*} Next, every number $\pi_k f_k(x)$ is positive. The logarithm is strictly increasing on $(0,\infty)$, so applying $\log$ to each positive score also preserves the set of maximizers: \begin{align*} \operatorname*{arg\,max}_{1 \le k \le g} \pi_k f_k(x) = \operatorname*{arg\,max}_{1 \le k \le g} \log(\pi_k f_k(x)). \end{align*} Now substitute the explicit Gaussian density: \begin{align*} \log(\pi_k f_k(x)) &= \log \pi_k + \log\left((2\pi)^{-p/2}\right) + \log\left((\det \Sigma_k)^{-1/2}\right) + \log\left( \exp\left(-\frac{1}{2}(x-\mu_k)^\top \Sigma_k^{-1}(x-\mu_k)\right) \right) \\ &= \log \pi_k - \frac{p}{2}\log(2\pi) - \frac{1}{2}\log \det \Sigma_k - \frac{1}{2}(x-\mu_k)^\top \Sigma_k^{-1}(x-\mu_k). \end{align*} This separates the log posterior numerator into the prior contribution, the determinant contribution, the quadratic Mahalanobis contribution, and the class-independent normalizing constant. [/guided] [/step] [step:Remove the class-independent constant and identify a pointwise Bayes rule] The term $-\frac{p}{2}\log(2\pi)$ is independent of $k$. Therefore adding or subtracting it from every class score does not change the set of maximizers. Hence \begin{align*} \operatorname*{arg\,max}_{1 \le k \le g} \log(\pi_k f_k(x)) &= \operatorname*{arg\,max}_{1 \le k \le g} \left[ \log \pi_k - \frac{1}{2}\log \det \Sigma_k - \frac{1}{2}(x-\mu_k)^\top \Sigma_k^{-1}(x-\mu_k) \right] \\ &= \operatorname*{arg\,max}_{1 \le k \le g} q_k(x). \end{align*} For $0$-$1$ loss, the conditional risk of assigning class $a \in \{1,\dots,g\}$ at the fixed observation $x$ is \begin{align*} R_x(a) := \mathbb{P}(Y \neq a \mid X=x) = 1 - \eta_a(x). \end{align*} Thus a pointwise conditional-risk minimizer is exactly a maximizer of $\eta_a(x)$, and the preceding identities show that one may choose such a minimizer in \begin{align*} \operatorname*{arg\,max}_{1 \le k \le g} q_k(x). \end{align*} Define the smallest-index tie-breaking classifier \begin{align*} \delta : \mathbb{R}^p &\to \{1,\dots,g\} \\ x &\mapsto \min \operatorname*{arg\,max}_{1 \le k \le g} q_k(x). \end{align*} This map is measurable: for each $a \in \{1,\dots,g\}$, the event $\{x \in \mathbb{R}^p : \delta(x)=a\}$ is a finite intersection of sets of the form $\{x : q_a(x) \ge q_j(x)\}$ and $\{x : q_i(x) < q_a(x)\}$, and these sets are Borel because each $q_k$ is continuous. Thus $\delta$ is a measurable [Bayes classifier](/theorems/1941) version. A global Bayes classifier is determined only up to changes on $X$-null sets, so the conclusion is the existence of this measurable pointwise maximizing version, not that every representative must agree at every individual $x$. [guided] The logarithmic score contains one term that does not depend on the class: \begin{align*} -\frac{p}{2}\log(2\pi). \end{align*} If the same real number is added to every candidate score, the set of maximizers is unchanged. Removing this class-independent term gives \begin{align*} \operatorname*{arg\,max}_{1 \le k \le g} \log(\pi_k f_k(x)) &= \operatorname*{arg\,max}_{1 \le k \le g} \left[ \log \pi_k - \frac{1}{2}\log \det \Sigma_k - \frac{1}{2}(x-\mu_k)^\top \Sigma_k^{-1}(x-\mu_k) \right] \\ &= \operatorname*{arg\,max}_{1 \le k \le g} q_k(x). \end{align*} Now connect the score maximization to Bayes risk. For $0$-$1$ loss, if we decide class $a \in \{1,\dots,g\}$ after observing $x$, the conditional probability of making an error is \begin{align*} R_x(a) := \mathbb{P}(Y \neq a \mid X=x) = 1 - \mathbb{P}(Y=a \mid X=x) = 1 - \eta_a(x). \end{align*} Therefore minimizing $R_x(a)$ over $a$ is the same as maximizing $\eta_a(x)$ over $a$. The earlier reductions showed that the maximizers of $\eta_a(x)$ are precisely the maximizers of $q_a(x)$, so a pointwise Bayes decision at $x$ may be chosen from \begin{align*} \operatorname*{arg\,max}_{1 \le k \le g} q_k(x). \end{align*} To turn the pointwise maximizing statement into a classifier, we must choose ties measurably. Define \begin{align*} \delta : \mathbb{R}^p &\to \{1,\dots,g\} \\ x &\mapsto \min \operatorname*{arg\,max}_{1 \le k \le g} q_k(x). \end{align*} This is a measurable rule. Indeed, for a fixed class $a$, the set where $\delta(x)=a$ is described by requiring $q_a(x) \ge q_j(x)$ for every $j$ and requiring $q_i(x) < q_a(x)$ for every $i<a$. These are finitely many Borel conditions because all the score functions $q_k$ are continuous functions of $x$. Therefore $\delta$ is a measurable classifier that chooses a maximizing class at every $x \in \mathbb{R}^p$. This pointwise statement should still be read with the regular conditional probability convention from the first step. Since posterior probabilities are only determined up to $X$-null sets, a global Bayes classifier is also only determined up to such null-set modifications. Thus the theorem proves that there exists a measurable Bayes classifier version that uses the displayed maximizing rule for every $x \in \mathbb{R}^p$. [/guided] [/step]

Prerequisites (0/3 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Bayes Classifier

Definitions & Concepts

Explore Further

Determinant Definition Distribution Definition Bayes Classifier Theorem #1941 Linear Pairwise Decision Boundaries for Sample LDA probability Fisher Discriminant Maximisation Theorem probability Ledoit-Wolf Linear Shrinkage Optimality Theorem probability Tracy-Widom Limit for the Largest Eigenvalue of a Real Wishart Matrix probability Marchenko-Pastur Theorem probability Unbiasedness of the Sample Mean and Sample Covariance Matrix probability Rotational Diagonalization of a Maximum Likelihood Factor Loading Representative probability Asymptotic Normality of the Maximum Likelihood Estimators in the Multivariate Normal Model probability

What brings you to Androma?

Start with a route through the knowledge graph.

Quadratic Discriminant Analysis Bayes Rule (Theorem # 4049)

Discussion

Proof

Prerequisites (0/3 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Quadratic Discriminant Analysis Bayes Rule (Theorem # 4049)

Discussion

Proof

Prerequisites (0/3 completed)

Prerequisites Graph

Explore Further