Multivariate Cramér-Rao Bound — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] We reduce to the single-observation case using the tensorization of Fisher information, then apply the Cauchy-Schwarz inequality in $L^2(\mathbb{P}_\theta)$ pairing the centred estimator against each component of the score vector. This produces a covariance identity between the estimator and the score, and the Cauchy-Schwarz bound delivers the information inequality. The gradient of $\Phi$ enters through differentiation under the integral sign of the unbiasedness condition. [/proofplan] [step:Reduce to a single-observation Fisher information via tensorization] Since $X_1, \ldots, X_n$ are i.i.d. with common density $f(\cdot, \theta)$, the Fisher information for the full sample is $I_n(\theta) = nI(\theta)$, where $I(\theta)$ is the single-observation Fisher information matrix. It therefore suffices to prove the bound for a single observation and replace $I(\theta)$ by $I_n(\theta) = nI(\theta)$ at the end. For notational clarity, we work with a single observation $X \sim f(\cdot, \theta)$ and prove \begin{align*} \operatorname{Var}_\theta(\tilde{\Phi}) \geq \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta). \end{align*} The result for $n$ observations then follows by replacing $I(\theta)$ with $nI(\theta)$. [guided] Why reduce to $n = 1$? The joint density of $n$ i.i.d. observations factorizes as $\prod_{i=1}^n f(x_i, \theta)$, so the log-likelihood is a sum and the Fisher information matrix of the joint model is $I_n(\theta) = nI(\theta)$ by the [Fisher Information Tensorizes for I.I.D. Samples](/theorems/1841). Rather than carrying the product density through the entire argument, we prove the single-observation bound and substitute $nI(\theta)$ for $I(\theta)$ at the conclusion. We work with a single observation $X \sim f(\cdot, \theta)$ and aim to show \begin{align*} \operatorname{Var}_\theta(\tilde{\Phi}) \geq \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta). \end{align*} [/guided] [/step] [step:Differentiate the unbiasedness condition to obtain a covariance identity] Define the score vector $S(X, \theta) = \nabla_\theta \log f(X, \theta) \in \mathbb{R}^p$. By the regularity assumption (interchange of differentiation and integration), for each $j \in \{1, \ldots, p\}$, \begin{align*} \frac{\partial}{\partial \theta_j} \mathbb{E}_\theta[\tilde{\Phi}(X)] = \frac{\partial}{\partial \theta_j} \int_{\mathcal{X}} \tilde{\Phi}(x) f(x, \theta) \, d\mu(x) = \int_{\mathcal{X}} \tilde{\Phi}(x) \frac{\partial}{\partial \theta_j} f(x, \theta) \, d\mu(x). \end{align*} Writing $\frac{\partial}{\partial \theta_j} f(x, \theta) = f(x, \theta) \cdot \frac{\partial}{\partial \theta_j} \log f(x, \theta) = f(x, \theta) \cdot S_j(x, \theta)$, the right-hand side becomes $\mathbb{E}_\theta[\tilde{\Phi}(X) S_j(X, \theta)]$. Since $\tilde{\Phi}$ is unbiased, $\mathbb{E}_\theta[\tilde{\Phi}(X)] = \Phi(\theta)$, so the left-hand side is $\frac{\partial \Phi}{\partial \theta_j}(\theta)$. Since $\mathbb{E}_\theta[S_j(X, \theta)] = 0$ (the [Score Has Zero Expectation](/theorems/1839)), we obtain the covariance identity \begin{align*} \operatorname{Cov}_\theta(\tilde{\Phi}(X), S_j(X, \theta)) = \frac{\partial \Phi}{\partial \theta_j}(\theta), \quad j = 1, \ldots, p. \end{align*} In vector form, $\operatorname{Cov}_\theta(\tilde{\Phi}, S) = \nabla_\theta \Phi(\theta) \in \mathbb{R}^p$. [guided] The key mechanism is the same as in the scalar [Cramér-Rao Lower Bound](/theorems/1843): differentiate the unbiasedness identity $\mathbb{E}_\theta[\tilde{\Phi}] = \Phi(\theta)$ with respect to each parameter component. The regularity assumption permits interchange of $\partial/\partial \theta_j$ and $\int$; this is the same condition used to prove that the score has mean zero. For each $j \in \{1, \ldots, p\}$: \begin{align*} \frac{\partial \Phi}{\partial \theta_j}(\theta) = \frac{\partial}{\partial \theta_j} \int_{\mathcal{X}} \tilde{\Phi}(x) f(x, \theta) \, d\mu(x) = \int_{\mathcal{X}} \tilde{\Phi}(x) \frac{\partial f}{\partial \theta_j}(x, \theta) \, d\mu(x). \end{align*} Now rewrite $\frac{\partial f}{\partial \theta_j} = f \cdot \frac{\partial \log f}{\partial \theta_j} = f \cdot S_j$. Substituting: \begin{align*} \frac{\partial \Phi}{\partial \theta_j}(\theta) = \int_{\mathcal{X}} \tilde{\Phi}(x) S_j(x, \theta) f(x, \theta) \, d\mu(x) = \mathbb{E}_\theta[\tilde{\Phi}(X) S_j(X, \theta)]. \end{align*} Since $\mathbb{E}_\theta[S_j] = 0$, we have $\mathbb{E}_\theta[\tilde{\Phi} \cdot S_j] = \operatorname{Cov}_\theta(\tilde{\Phi}, S_j)$. Stacking all $p$ components into a vector gives $\operatorname{Cov}_\theta(\tilde{\Phi}, S) = \nabla_\theta \Phi(\theta)$. [/guided] [/step] [step:Apply the Cauchy-Schwarz inequality to obtain the information bound] For any deterministic vector $a \in \mathbb{R}^p$, form the scalar random variable $a^\top S(X, \theta)$. The Cauchy-Schwarz inequality in $L^2(\mathbb{P}_\theta)$ applied to $\tilde{\Phi}(X) - \Phi(\theta)$ and $a^\top S(X, \theta)$ gives \begin{align*} \left(\operatorname{Cov}_\theta(\tilde{\Phi}, a^\top S)\right)^2 \leq \operatorname{Var}_\theta(\tilde{\Phi}) \cdot \operatorname{Var}_\theta(a^\top S). \end{align*} The left-hand side equals $(a^\top \nabla_\theta \Phi(\theta))^2$ by the covariance identity from the previous step. The right-hand side involves $\operatorname{Var}_\theta(a^\top S) = a^\top I(\theta) a$, since $\operatorname{Cov}_\theta(S) = I(\theta)$ by definition of the Fisher information matrix. Thus for every $a \in \mathbb{R}^p$, \begin{align*} \operatorname{Var}_\theta(\tilde{\Phi}) \geq \frac{(a^\top \nabla_\theta \Phi(\theta))^2}{a^\top I(\theta) a}. \end{align*} [guided] Why introduce an arbitrary vector $a$? In the scalar case ($p = 1$), the Cauchy-Schwarz inequality directly pairs $\tilde{\Phi}$ against the single score. In the multivariate case, the score is a vector in $\mathbb{R}^p$ while $\tilde{\Phi}$ is scalar, so we cannot directly apply Cauchy-Schwarz between objects of different dimension. The remedy is to project the score onto a direction $a$ and then optimize over $a$. For any $a \in \mathbb{R}^p$, the random variable $a^\top S$ is scalar. The Cauchy-Schwarz inequality for covariance states \begin{align*} \left(\operatorname{Cov}_\theta(\tilde{\Phi}, a^\top S)\right)^2 \leq \operatorname{Var}_\theta(\tilde{\Phi}) \cdot \operatorname{Var}_\theta(a^\top S). \end{align*} By linearity, $\operatorname{Cov}_\theta(\tilde{\Phi}, a^\top S) = a^\top \operatorname{Cov}_\theta(\tilde{\Phi}, S) = a^\top \nabla_\theta \Phi(\theta)$. The variance of the projection is $\operatorname{Var}_\theta(a^\top S) = a^\top \operatorname{Cov}_\theta(S, S) a = a^\top I(\theta) a$, where $\operatorname{Cov}_\theta(S) = I(\theta)$ by definition. This gives \begin{align*} \operatorname{Var}_\theta(\tilde{\Phi}) \geq \frac{(a^\top \nabla_\theta \Phi(\theta))^2}{a^\top I(\theta) a} \end{align*} for every nonzero $a \in \mathbb{R}^p$. [/guided] [/step] [step:Optimize over $a$ to extract the quadratic form $\nabla_\theta \Phi^\top I^{-1} \nabla_\theta \Phi$] We maximize the right-hand side over $a \in \mathbb{R}^p \setminus \{0\}$. Write $b = \nabla_\theta \Phi(\theta)$. The ratio $(a^\top b)^2 / (a^\top I(\theta) a)$ is maximized by setting $a = I(\theta)^{-1} b$. This is valid because $I(\theta)$ is positive definite by the regularity assumptions, so $I(\theta)^{-1}$ exists. Substituting $a = I(\theta)^{-1} b$: \begin{align*} \frac{(b^\top I(\theta)^{-1} b)^2}{b^\top I(\theta)^{-1} I(\theta) I(\theta)^{-1} b} = \frac{(b^\top I(\theta)^{-1} b)^2}{b^\top I(\theta)^{-1} b} = b^\top I(\theta)^{-1} b = \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta). \end{align*} Therefore \begin{align*} \operatorname{Var}_\theta(\tilde{\Phi}) \geq \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta). \end{align*} For the full $n$-observation model, $I(\theta)$ is replaced by $I_n(\theta) = nI(\theta)$, giving \begin{align*} \operatorname{Var}_\theta(\tilde{\Phi}) \geq \frac{1}{n} \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta). \end{align*} [guided] The optimization step is a standard fact from linear algebra: for a positive definite matrix $M$ and a nonzero vector $b$, the maximum of $(a^\top b)^2 / (a^\top M a)$ over $a \neq 0$ equals $b^\top M^{-1} b$, attained at $a = M^{-1} b$. To verify this, note that by the substitution $c = M^{1/2} a$, the ratio becomes $(c^\top M^{-1/2} b)^2 / (c^\top c)$, which by Cauchy-Schwarz in $\mathbb{R}^p$ is at most $|M^{-1/2} b|^2 = b^\top M^{-1} b$. Applying this with $M = I(\theta)$ and $b = \nabla_\theta \Phi(\theta)$, the optimal choice $a = I(\theta)^{-1} \nabla_\theta \Phi(\theta)$ yields \begin{align*} \operatorname{Var}_\theta(\tilde{\Phi}) \geq \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta). \end{align*} Finally, for $n$ i.i.d. observations the Fisher information of the joint model is $nI(\theta)$, so the bound becomes \begin{align*} \operatorname{Var}_\theta(\tilde{\Phi}) \geq \frac{1}{n} \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta), \end{align*} which completes the proof. [/guided] [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Multivariate Cramér-Rao Bound (Theorem # 1844)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Multivariate Cramér-Rao Bound (Theorem # 1844)

Discussion

Proof

Explore Further