Bayes Classifier — Statement & Proof

Bayes Classifier (Theorem # 1941)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We express the misclassification risk $R(h) = \mathbb{P}(Y \neq h(X))$ as an iterated expectation using the tower property, then decompose the inner conditional expectation into contributions from each label. Because $h(X)$ is determined by $X$, minimising the outer expectation reduces to a pointwise minimisation of the conditional misclassification probability at each $x$. Comparing the two cases $h(x) = 1$ and $h(x) = -1$ shows that the optimal decision boundary occurs at $\eta(x) = 1/2$. [/proofplan] [step:Express the risk as an iterated expectation using the tower property] Let $(\Omega, \mathcal{F}, \mathbb{P})$ be the underlying probability space and let $(X, Y)$ be a random pair with $X$ taking values in $\mathcal{X}$ and $Y \in \{-1, 1\}$. Define the regression function $\eta : \mathcal{X} \to [0,1]$ by $\eta(x) := \mathbb{P}(Y = 1 \mid X = x)$. For any measurable classifier $h : \mathcal{X} \to \{-1, 1\}$, the misclassification risk is \begin{align*} R(h) = \mathbb{E}[\mathbb{1}_{\{Y \neq h(X)\}}]. \end{align*} By the tower property of conditional expectation — conditioning on the $\sigma$-algebra $\sigma(X)$ generated by $X$ — we obtain \begin{align*} R(h) = \mathbb{E}\bigl[\mathbb{E}[\mathbb{1}_{\{Y \neq h(X)\}} \mid X]\bigr]. \end{align*} [guided] The risk $R(h) = \mathbb{P}(Y \neq h(X))$ is the expectation of the indicator $\mathbb{1}_{\{Y \neq h(X)\}}$. The tower property of conditional expectation states that for any integrable random variable $Z$ and any sub-$\sigma$-algebra $\mathcal{G} \subset \mathcal{F}$, \begin{align*} \mathbb{E}[Z] = \mathbb{E}[\mathbb{E}[Z \mid \mathcal{G}]]. \end{align*} We apply this with $Z = \mathbb{1}_{\{Y \neq h(X)\}}$ and $\mathcal{G} = \sigma(X)$, the $\sigma$-algebra generated by $X$. Since $\mathbb{1}_{\{Y \neq h(X)\}} \in \{0, 1\}$, it is bounded and hence integrable, so the tower property applies. This gives \begin{align*} R(h) = \mathbb{E}\bigl[\mathbb{E}[\mathbb{1}_{\{Y \neq h(X)\}} \mid X]\bigr]. \end{align*} Why condition on $X$? Because $h(X)$ is a function of $X$ alone, conditioning on $X$ makes $h(X)$ a known constant inside the inner expectation. The only remaining randomness comes from $Y$, which is governed by the conditional distribution $\mathbb{P}(Y = \cdot \mid X)$. This separation is what allows us to optimise $h$ pointwise. [/guided] [/step] [step:Decompose the inner conditional expectation into contributions from each label] Conditioning on $X = x$ makes $h(X) = h(x)$ a constant. The event $\{Y \neq h(X)\}$ then depends only on $Y$, which takes values in $\{-1, 1\}$. We decompose by cases: \begin{align*} \mathbb{E}[\mathbb{1}_{\{Y \neq h(X)\}} \mid X = x] &= \mathbb{P}(Y \neq h(x) \mid X = x) \\ &= \mathbb{1}_{\{h(x) = -1\}} \cdot \mathbb{P}(Y = 1 \mid X = x) + \mathbb{1}_{\{h(x) = 1\}} \cdot \mathbb{P}(Y = -1 \mid X = x) \\ &= \mathbb{1}_{\{h(x) = -1\}} \cdot \eta(x) + \mathbb{1}_{\{h(x) = 1\}} \cdot (1 - \eta(x)). \end{align*} The second equality holds because $h(x) \in \{-1, 1\}$, so exactly one of the indicators $\mathbb{1}_{\{h(x) = -1\}}$ and $\mathbb{1}_{\{h(x) = 1\}}$ equals $1$, and the misclassification event $\{Y \neq h(x)\}$ selects the opposite class from the one predicted. The third equality uses $\mathbb{P}(Y = -1 \mid X = x) = 1 - \eta(x)$. [guided] Since $h(x)$ is either $-1$ or $1$, exactly one of the two indicators is active at each $x$. If $h(x) = -1$, then $Y \neq h(x)$ iff $Y = 1$, which has conditional probability $\eta(x)$. If $h(x) = 1$, then $Y \neq h(x)$ iff $Y = -1$, which has conditional probability $1 - \eta(x)$. This gives the decomposition \begin{align*} \mathbb{P}(Y \neq h(x) \mid X = x) = \mathbb{1}_{\{h(x) = -1\}} \cdot \eta(x) + \mathbb{1}_{\{h(x) = 1\}} \cdot (1 - \eta(x)). \end{align*} The key observation is that this quantity depends on $x$ only through $\eta(x)$ and the choice of $h(x)$. Since the outer expectation $\mathbb{E}[\cdot]$ integrates this over the marginal distribution of $X$, minimising $R(h)$ reduces to choosing $h(x)$ to minimise this expression at each $x$ separately. This pointwise reduction is possible precisely because $h(X)$ is $\sigma(X)$-measurable, so the choice at one value of $X$ does not affect the conditional expectation at another. [/guided] [/step] [step:Minimise pointwise by comparing the two label choices at each $x$] At each $x \in \mathcal{X}$, the conditional misclassification probability takes one of two values: \begin{align*} \mathbb{P}(Y \neq h(x) \mid X = x) = \begin{cases} \eta(x) & \text{if } h(x) = -1, \\ 1 - \eta(x) & \text{if } h(x) = 1. \end{cases} \end{align*} To minimise $R(h) = \mathbb{E}[\mathbb{P}(Y \neq h(X) \mid X)]$, it suffices to minimise the integrand pointwise (since the integrand is non-negative and the expectation is monotone in the integrand). At each $x$: - If $\eta(x) > 1/2$, then $1 - \eta(x) < \eta(x)$, so setting $h(x) = 1$ yields the smaller conditional error $1 - \eta(x)$. - If $\eta(x) < 1/2$, then $\eta(x) < 1 - \eta(x)$, so setting $h(x) = -1$ yields the smaller conditional error $\eta(x)$. - If $\eta(x) = 1/2$, then both choices give conditional error $1/2$, so either label is optimal. Therefore the Bayes classifier \begin{align*} h_0(x) = \begin{cases} 1 & \text{if } \eta(x) > 1/2, \\ -1 & \text{otherwise} \end{cases} \end{align*} minimises $\mathbb{P}(Y \neq h(X) \mid X = x)$ at every $x \in \mathcal{X}$. Pointwise minimisation of a non-negative integrand implies minimisation of its expectation, so $h_0$ minimises $R(h)$ over all measurable classifiers $h : \mathcal{X} \to \{-1, 1\}$. The tie-breaking rule at $\eta(x) = 1/2$ is arbitrary because both labels achieve the same conditional misclassification probability $1/2$. [guided] We are choosing $h(x) \in \{-1, 1\}$ to minimise a quantity that depends only on $\eta(x)$ and $h(x)$. This is a discrete optimisation over two values, so we simply compare: - Choosing $h(x) = 1$ incurs conditional error $1 - \eta(x)$. - Choosing $h(x) = -1$ incurs conditional error $\eta(x)$. Setting $h(x) = 1$ is better when $1 - \eta(x) < \eta(x)$, i.e., when $\eta(x) > 1/2$. Setting $h(x) = -1$ is better when $\eta(x) < 1/2$. When $\eta(x) = 1/2$, both choices give the same error of $1/2$. Why does pointwise minimisation imply global minimisation of $R(h)$? Because $R(h) = \mathbb{E}[\mathbb{P}(Y \neq h(X) \mid X)]$ is the expectation (with respect to the marginal distribution of $X$) of a non-negative function of $X$. If $g_1(x) \leq g_2(x)$ for all $x$, then $\mathbb{E}[g_1(X)] \leq \mathbb{E}[g_2(X)]$. Since the Bayes classifier $h_0$ achieves the smallest possible value of $\mathbb{P}(Y \neq h(X) \mid X = x)$ at every $x$, no other classifier can have smaller risk. The conclusion is that \begin{align*} h_0(x) = \begin{cases} 1 & \text{if } \eta(x) > 1/2, \\ -1 & \text{otherwise} \end{cases} \end{align*} minimises $R(h)$ over all measurable classifiers, and the choice at the boundary $\eta(x) = 1/2$ is immaterial. [/guided] [/step]

Explore Further

Ensemble Variance Formula Machine Learning Rademacher Complexity of ℓ₂-Constrained Class Machine Learning Contraction Lemma Machine Learning Calibration via Differentiability at Zero Machine Learning Sub-Gaussian Stability Under Linear Combinations Machine Learning Non-Emptiness of the Subdifferential Machine Learning Zhang–Bartlett Machine Learning Projection Theorem Machine Learning

What brings you to Androma?

Start with a route through the knowledge graph.

Bayes Classifier (Theorem # 1941)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Bayes Classifier (Theorem # 1941)

Discussion

Proof

Explore Further