Rao–Blackwell Theorem — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We produce a new estimator $\hat\theta := \mathbb{E}_\theta[\tilde\theta \mid T]$ by conditioning on a sufficient statistic $T$, and show it improves on $\tilde\theta$ in mean squared error. Three facts are used. First, sufficiency guarantees that $\hat\theta$ is a statistic — it does not depend on $\theta$ — even though the conditional expectation is taken under $\mathbb{P}_\theta$. Second, the tower property gives $\mathbb{E}_\theta[\hat\theta] = \mathbb{E}_\theta[\tilde\theta]$, so the new estimator has the same bias. Third, the conditional Jensen inequality applied to the convex function $u \mapsto u^2$ gives $\mathbb{E}_\theta[\hat\theta^2] \le \mathbb{E}_\theta[\tilde\theta^2]$, which, combined with the equality of means, gives $\operatorname{Var}_\theta(\hat\theta) \le \operatorname{Var}_\theta(\tilde\theta)$. The bias-variance decomposition delivers the MSE comparison. [/proofplan] [step:Define the Rao-Blackwellised estimator and verify it is a statistic] Let $X = (X_1, \ldots, X_n)$ be a sample from a model $\{\mathbb{P}_\theta : \theta \in \Theta\}$, let $T: \mathcal{X}^n \to \mathcal{T}$ be a [sufficient statistic](/page/Sufficient%20Statistic) for $\theta$, and let $\tilde\theta: \mathcal{X}^n \to \mathbb{R}$ be an estimator of $\theta$ with $\mathbb{E}_\theta[\tilde\theta^2] < \infty$ for every $\theta \in \Theta$. Define \begin{align*} \hat\theta: \mathcal{X}^n &\to \mathbb{R} \\ x &\mapsto \mathbb{E}_\theta\bigl[\tilde\theta(X) \,\big|\, T(X) = T(x)\bigr]. \end{align*} Because $T$ is sufficient for $\theta$, the [conditional distribution](/page/Conditional%20Distribution) of $X$ given $T(X)$ does not depend on $\theta$. Hence the conditional expectation $\mathbb{E}_\theta[\tilde\theta(X) \mid T(X)]$ is the same function of $T(X)$ for every $\theta$, and $\hat\theta$ is a measurable function of the data alone. In particular, $\hat\theta$ is a legitimate estimator. [guided] **Notation.** The theorem statement uses $\tilde\theta$ for the original estimator and $\hat\theta$ for the improved (Rao-Blackwellised) estimator. We follow that convention throughout this proof: $\tilde\theta$ is the input, $\hat\theta := \mathbb{E}_\theta[\tilde\theta \mid T]$ is the output. The construction has a simple idea behind it: if $T$ is sufficient, then averaging any estimator over the conditional distribution given $T$ cannot hurt, because that conditional distribution carries no information about $\theta$. The averaging smooths out noise that was informative only about $X$ itself, not about $\theta$. Formally, we set $\hat\theta := \mathbb{E}_\theta[\tilde\theta \mid T]$. But there is a subtle issue: conditional expectation is defined with respect to a probability measure, and here we wrote $\mathbb{E}_\theta$, suggesting the answer might depend on $\theta$. If it did, $\hat\theta$ would not be a statistic at all — it would depend on the unknown parameter, which is forbidden. This is exactly where sufficiency of $T$ is essential. [Sufficiency](/page/Sufficient%20Statistic) of $T$ means: the conditional distribution of $X$ given $T(X)$ is the same function regardless of $\theta$. Therefore, when we compute the conditional expectation — writing the integral against the conditional density $f(y \mid T = t)$ with respect to a dominating $\sigma$-finite measure $\mu$ on $\mathcal{X}^n$ (e.g. Lebesgue measure, counting measure, or any fixed dominating measure for the model) — \begin{align*} \mathbb{E}_\theta[\tilde\theta(X) \mid T(X) = t] = \int \tilde\theta(y)\, f(y \mid T = t)\, d\mu(y), \end{align*} the conditional density $f(y \mid T = t)$ does not depend on $\theta$ (by sufficiency), and so neither does the integral. The conditional expectation is a function of $t$ alone; call it $\psi(t)$. Then $\hat\theta(x) = \psi(T(x))$, which is a statistic — a measurable function of the data, with no hidden parameter dependence. The integrability assumption $\mathbb{E}_\theta[\tilde\theta^2] < \infty$ ensures that $\tilde\theta \in L^2(\mathbb{P}_\theta)$, so the conditional expectation exists and is well-defined as an $L^2$-projection onto the $\sigma$-algebra generated by $T$. [/guided] [/step] [step:Verify that $\hat\theta$ has the same bias as $\tilde\theta$] By the [tower property of conditional expectation](/theorems/1150) applied to $(\tilde\theta, T)$ under $\mathbb{P}_\theta$, \begin{align*} \mathbb{E}_\theta[\hat\theta] = \mathbb{E}_\theta\bigl[\mathbb{E}_\theta[\tilde\theta \mid T]\bigr] = \mathbb{E}_\theta[\tilde\theta]. \end{align*} In particular, $\operatorname{bias}_\theta(\hat\theta) = \mathbb{E}_\theta[\hat\theta] - \theta = \mathbb{E}_\theta[\tilde\theta] - \theta = \operatorname{bias}_\theta(\tilde\theta)$. If $\tilde\theta$ is unbiased, so is $\hat\theta$. [guided] The [tower property of conditional expectation](/theorems/1150) (also called the law of iterated expectations) says $\mathbb{E}[\mathbb{E}[Y \mid \mathcal{G}]] = \mathbb{E}[Y]$ for any integrable $Y$ and any sub-$\sigma$-algebra $\mathcal{G}$. Applying it with $Y = \tilde\theta$ and $\mathcal{G} = \sigma(T)$ under the measure $\mathbb{P}_\theta$, \begin{align*} \mathbb{E}_\theta[\hat\theta] = \mathbb{E}_\theta\bigl[\mathbb{E}_\theta[\tilde\theta \mid T]\bigr] = \mathbb{E}_\theta[\tilde\theta]. \end{align*} The hypothesis of the tower property is integrability of $\tilde\theta$, which we have since $\mathbb{E}_\theta[\tilde\theta^2] < \infty$ and $L^2 \subseteq L^1$ on a probability space. The consequence: $\hat\theta$ and $\tilde\theta$ have identical means under every $\mathbb{P}_\theta$, so they have identical bias. Rao-Blackwellisation does not change the bias — this is a feature. An unbiased estimator stays unbiased. A biased estimator retains its bias exactly. All the improvement happens in the variance term. [/guided] [/step] [step:Apply conditional Jensen to compare second moments] The function $u \mapsto u^2$ is convex on $\mathbb{R}$. The conditional Jensen inequality applied to $(\tilde\theta, T)$ under $\mathbb{P}_\theta$ gives \begin{align*} \bigl(\mathbb{E}_\theta[\tilde\theta \mid T]\bigr)^2 \le \mathbb{E}_\theta[\tilde\theta^2 \mid T]. \end{align*} Taking unconditional expectation under $\mathbb{P}_\theta$ and using the [tower property of conditional expectation](/theorems/1150), \begin{align*} \mathbb{E}_\theta\bigl[\hat\theta^2\bigr] = \mathbb{E}_\theta\bigl[(\mathbb{E}_\theta[\tilde\theta \mid T])^2\bigr] \le \mathbb{E}_\theta\bigl[\mathbb{E}_\theta[\tilde\theta^2 \mid T]\bigr] = \mathbb{E}_\theta[\tilde\theta^2]. \end{align*} [guided] The core inequality is a conditional version of Jensen. Recall that for any convex $\phi: \mathbb{R} \to \mathbb{R}$ and any integrable random variable $Y$ with $\phi(Y)$ integrable, the unconditional Jensen inequality states $\phi(\mathbb{E}[Y]) \le \mathbb{E}[\phi(Y)]$. The conditional Jensen inequality upgrades this to \begin{align*} \phi\bigl(\mathbb{E}[Y \mid \mathcal{G}]\bigr) \le \mathbb{E}[\phi(Y) \mid \mathcal{G}] \quad \text{a.s.,} \end{align*} for any sub-$\sigma$-algebra $\mathcal{G}$. We apply this with $\phi(u) = u^2$ (which is convex and continuous, hence measurable), $Y = \tilde\theta$, $\mathcal{G} = \sigma(T)$, under $\mathbb{P}_\theta$. The integrability hypotheses are satisfied because $\mathbb{E}_\theta[\tilde\theta^2] < \infty$ by assumption (this gives integrability of both $\tilde\theta$ and $\tilde\theta^2$). The inequality is \begin{align*} \hat\theta^2 = \bigl(\mathbb{E}_\theta[\tilde\theta \mid T]\bigr)^2 \le \mathbb{E}_\theta[\tilde\theta^2 \mid T] \quad \mathbb{P}_\theta\text{-a.s.} \end{align*} Take expectations under $\mathbb{P}_\theta$ on both sides. The left side becomes $\mathbb{E}_\theta[\hat\theta^2]$. The right side, by the [tower property of conditional expectation](/theorems/1150), becomes \begin{align*} \mathbb{E}_\theta\bigl[\mathbb{E}_\theta[\tilde\theta^2 \mid T]\bigr] = \mathbb{E}_\theta[\tilde\theta^2]. \end{align*} Therefore $\mathbb{E}_\theta[\hat\theta^2] \le \mathbb{E}_\theta[\tilde\theta^2]$. The second moment of $\hat\theta$ is no larger than that of $\tilde\theta$. [/guided] [/step] [step:Combine the second-moment and first-moment inequalities to obtain the variance and MSE comparison] Using $\operatorname{Var}_\theta(Y) = \mathbb{E}_\theta[Y^2] - (\mathbb{E}_\theta[Y])^2$ together with the equality of first moments and the inequality of second moments from the previous steps, \begin{align*} \operatorname{Var}_\theta(\hat\theta) = \mathbb{E}_\theta[\hat\theta^2] - (\mathbb{E}_\theta[\hat\theta])^2 \le \mathbb{E}_\theta[\tilde\theta^2] - (\mathbb{E}_\theta[\tilde\theta])^2 = \operatorname{Var}_\theta(\tilde\theta). \end{align*} By the [Bias-Variance Decomposition](/theorems/1424), \begin{align*} \operatorname{MSE}_\theta(\hat\theta) = \operatorname{Var}_\theta(\hat\theta) + \operatorname{bias}_\theta(\hat\theta)^2, \qquad \operatorname{MSE}_\theta(\tilde\theta) = \operatorname{Var}_\theta(\tilde\theta) + \operatorname{bias}_\theta(\tilde\theta)^2. \end{align*} Since $\operatorname{bias}_\theta(\hat\theta) = \operatorname{bias}_\theta(\tilde\theta)$ and $\operatorname{Var}_\theta(\hat\theta) \le \operatorname{Var}_\theta(\tilde\theta)$, \begin{align*} \operatorname{MSE}_\theta(\hat\theta) \le \operatorname{MSE}_\theta(\tilde\theta) \quad \text{for every } \theta \in \Theta. \end{align*} This is the Rao–Blackwell improvement. [guided] We have $\mathbb{E}_\theta[\hat\theta] = \mathbb{E}_\theta[\tilde\theta]$, so $(\mathbb{E}_\theta[\hat\theta])^2 = (\mathbb{E}_\theta[\tilde\theta])^2$. Combined with $\mathbb{E}_\theta[\hat\theta^2] \le \mathbb{E}_\theta[\tilde\theta^2]$, \begin{align*} \operatorname{Var}_\theta(\hat\theta) &= \mathbb{E}_\theta[\hat\theta^2] - (\mathbb{E}_\theta[\hat\theta])^2 \\ &\le \mathbb{E}_\theta[\tilde\theta^2] - (\mathbb{E}_\theta[\tilde\theta])^2 \\ &= \operatorname{Var}_\theta(\tilde\theta). \end{align*} The two terms being subtracted are equal (same first moment), so subtracting them does not alter the direction of the inequality between the second moments. The conclusion is exactly that the variance of $\hat\theta$ is at most the variance of $\tilde\theta$. Now we convert this to an MSE statement. By the [Bias-Variance Decomposition](/theorems/1424), $\operatorname{MSE}_\theta = \operatorname{Var}_\theta + \operatorname{bias}_\theta^2$. Since the bias terms are identical and the variance term has decreased (or stayed the same), \begin{align*} \operatorname{MSE}_\theta(\hat\theta) = \operatorname{Var}_\theta(\hat\theta) + \operatorname{bias}_\theta(\hat\theta)^2 \le \operatorname{Var}_\theta(\tilde\theta) + \operatorname{bias}_\theta(\tilde\theta)^2 = \operatorname{MSE}_\theta(\tilde\theta). \end{align*} [/guided] [/step] [step:Determine the case of equality] Equality $\operatorname{MSE}_\theta(\hat\theta) = \operatorname{MSE}_\theta(\tilde\theta)$ holds iff the variance inequality is an equality, which via the Jensen step holds iff \begin{align*} \bigl(\mathbb{E}_\theta[\tilde\theta \mid T]\bigr)^2 = \mathbb{E}_\theta[\tilde\theta^2 \mid T] \quad \mathbb{P}_\theta\text{-a.s.} \end{align*} This is the equality case of the conditional Jensen inequality applied to the strictly convex function $u \mapsto u^2$. Strict convexity gives that equality holds iff $\tilde\theta$ is $\sigma(T)$-measurable $\mathbb{P}_\theta$-a.s., that is, $\tilde\theta$ is itself a function of $T$ up to a $\mathbb{P}_\theta$-null set. Thus Rao–Blackwellisation strictly improves MSE unless $\tilde\theta$ is already a function of the sufficient statistic $T$. [guided] We trace when the various inequalities are tight. The bias comparison is always an equality (step on means), so MSE reduction comes entirely from variance reduction, which in turn comes from the conditional Jensen inequality applied to $u \mapsto u^2$. The equality case of conditional Jensen for a strictly convex $\phi$ and integrable $Y$ is: $\phi(\mathbb{E}[Y \mid \mathcal{G}]) = \mathbb{E}[\phi(Y) \mid \mathcal{G}]$ a.s. iff $Y$ is $\mathcal{G}$-measurable a.s. With $\phi(u) = u^2$ (which is strictly convex), $Y = \tilde\theta$, $\mathcal{G} = \sigma(T)$, the equality case reads: $\tilde\theta$ is $\sigma(T)$-measurable a.s. Translating: $\tilde\theta$ is a function of $T$ up to a $\mathbb{P}_\theta$-null set. In that case $\mathbb{E}_\theta[\tilde\theta \mid T] = \tilde\theta$ a.s., so $\hat\theta = \tilde\theta$, and the two estimators coincide — no improvement, but also no loss. Otherwise (if $\tilde\theta$ is not already $\sigma(T)$-measurable), the inequality is strict: $\operatorname{Var}_\theta(\hat\theta) < \operatorname{Var}_\theta(\tilde\theta)$ and hence $\operatorname{MSE}_\theta(\hat\theta) < \operatorname{MSE}_\theta(\tilde\theta)$. This gives the Rao–Blackwell Theorem in full: conditioning on a sufficient statistic never increases MSE, and strictly decreases it unless the original estimator was already a function of the sufficient statistic. [/guided] [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Rao–Blackwell Theorem (Theorem # 1427)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Rao–Blackwell Theorem (Theorem # 1427)

Discussion

Proof

Explore Further