[proofplan]
We reduce to the single-observation case using the tensorization of Fisher information, then apply the Cauchy-Schwarz inequality in $L^2(\mathbb{P}_\theta)$ pairing the centred estimator against each component of the score vector. This produces a covariance identity between the estimator and the score, and the Cauchy-Schwarz bound delivers the information inequality. The gradient of $\Phi$ enters through differentiation under the integral sign of the unbiasedness condition.
[/proofplan]
[step:Reduce to a single-observation Fisher information via tensorization]
Since $X_1, \ldots, X_n$ are i.i.d. with common density $f(\cdot, \theta)$, the Fisher information for the full sample is $I_n(\theta) = nI(\theta)$, where $I(\theta)$ is the single-observation Fisher information matrix. It therefore suffices to prove the bound for a single observation and replace $I(\theta)$ by $I_n(\theta) = nI(\theta)$ at the end.
For notational clarity, we work with a single observation $X \sim f(\cdot, \theta)$ and prove
\begin{align*}
\operatorname{Var}_\theta(\tilde{\Phi}) \geq \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta).
\end{align*}
The result for $n$ observations then follows by replacing $I(\theta)$ with $nI(\theta)$.
[guided]
Why reduce to $n = 1$? The joint density of $n$ i.i.d. observations factorizes as $\prod_{i=1}^n f(x_i, \theta)$, so the log-likelihood is a sum and the Fisher information matrix of the joint model is $I_n(\theta) = nI(\theta)$ by the [Fisher Information Tensorizes for I.I.D. Samples](/theorems/1841). Rather than carrying the product density through the entire argument, we prove the single-observation bound and substitute $nI(\theta)$ for $I(\theta)$ at the conclusion.
We work with a single observation $X \sim f(\cdot, \theta)$ and aim to show
\begin{align*}
\operatorname{Var}_\theta(\tilde{\Phi}) \geq \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta).
\end{align*}
[/guided]
[/step]
[step:Differentiate the unbiasedness condition to obtain a covariance identity]
Define the score vector $S(X, \theta) = \nabla_\theta \log f(X, \theta) \in \mathbb{R}^p$. By the regularity assumption (interchange of differentiation and integration), for each $j \in \{1, \ldots, p\}$,
\begin{align*}
\frac{\partial}{\partial \theta_j} \mathbb{E}_\theta[\tilde{\Phi}(X)] = \frac{\partial}{\partial \theta_j} \int_{\mathcal{X}} \tilde{\Phi}(x) f(x, \theta) \, d\mu(x) = \int_{\mathcal{X}} \tilde{\Phi}(x) \frac{\partial}{\partial \theta_j} f(x, \theta) \, d\mu(x).
\end{align*}
Writing $\frac{\partial}{\partial \theta_j} f(x, \theta) = f(x, \theta) \cdot \frac{\partial}{\partial \theta_j} \log f(x, \theta) = f(x, \theta) \cdot S_j(x, \theta)$, the right-hand side becomes $\mathbb{E}_\theta[\tilde{\Phi}(X) S_j(X, \theta)]$. Since $\tilde{\Phi}$ is unbiased, $\mathbb{E}_\theta[\tilde{\Phi}(X)] = \Phi(\theta)$, so the left-hand side is $\frac{\partial \Phi}{\partial \theta_j}(\theta)$. Since $\mathbb{E}_\theta[S_j(X, \theta)] = 0$ (the [Score Has Zero Expectation](/theorems/1839)), we obtain the covariance identity
\begin{align*}
\operatorname{Cov}_\theta(\tilde{\Phi}(X), S_j(X, \theta)) = \frac{\partial \Phi}{\partial \theta_j}(\theta), \quad j = 1, \ldots, p.
\end{align*}
In vector form, $\operatorname{Cov}_\theta(\tilde{\Phi}, S) = \nabla_\theta \Phi(\theta) \in \mathbb{R}^p$.
[guided]
The key mechanism is the same as in the scalar [Cramér-Rao Lower Bound](/theorems/1843): differentiate the unbiasedness identity $\mathbb{E}_\theta[\tilde{\Phi}] = \Phi(\theta)$ with respect to each parameter component. The regularity assumption permits interchange of $\partial/\partial \theta_j$ and $\int$; this is the same condition used to prove that the score has mean zero.
For each $j \in \{1, \ldots, p\}$:
\begin{align*}
\frac{\partial \Phi}{\partial \theta_j}(\theta) = \frac{\partial}{\partial \theta_j} \int_{\mathcal{X}} \tilde{\Phi}(x) f(x, \theta) \, d\mu(x) = \int_{\mathcal{X}} \tilde{\Phi}(x) \frac{\partial f}{\partial \theta_j}(x, \theta) \, d\mu(x).
\end{align*}
Now rewrite $\frac{\partial f}{\partial \theta_j} = f \cdot \frac{\partial \log f}{\partial \theta_j} = f \cdot S_j$. Substituting:
\begin{align*}
\frac{\partial \Phi}{\partial \theta_j}(\theta) = \int_{\mathcal{X}} \tilde{\Phi}(x) S_j(x, \theta) f(x, \theta) \, d\mu(x) = \mathbb{E}_\theta[\tilde{\Phi}(X) S_j(X, \theta)].
\end{align*}
Since $\mathbb{E}_\theta[S_j] = 0$, we have $\mathbb{E}_\theta[\tilde{\Phi} \cdot S_j] = \operatorname{Cov}_\theta(\tilde{\Phi}, S_j)$. Stacking all $p$ components into a vector gives $\operatorname{Cov}_\theta(\tilde{\Phi}, S) = \nabla_\theta \Phi(\theta)$.
[/guided]
[/step]
[step:Apply the Cauchy-Schwarz inequality to obtain the information bound]
For any deterministic vector $a \in \mathbb{R}^p$, form the scalar random variable $a^\top S(X, \theta)$. The Cauchy-Schwarz inequality in $L^2(\mathbb{P}_\theta)$ applied to $\tilde{\Phi}(X) - \Phi(\theta)$ and $a^\top S(X, \theta)$ gives
\begin{align*}
\left(\operatorname{Cov}_\theta(\tilde{\Phi}, a^\top S)\right)^2 \leq \operatorname{Var}_\theta(\tilde{\Phi}) \cdot \operatorname{Var}_\theta(a^\top S).
\end{align*}
The left-hand side equals $(a^\top \nabla_\theta \Phi(\theta))^2$ by the covariance identity from the previous step. The right-hand side involves $\operatorname{Var}_\theta(a^\top S) = a^\top I(\theta) a$, since $\operatorname{Cov}_\theta(S) = I(\theta)$ by definition of the Fisher information matrix. Thus for every $a \in \mathbb{R}^p$,
\begin{align*}
\operatorname{Var}_\theta(\tilde{\Phi}) \geq \frac{(a^\top \nabla_\theta \Phi(\theta))^2}{a^\top I(\theta) a}.
\end{align*}
[guided]
Why introduce an arbitrary vector $a$? In the scalar case ($p = 1$), the Cauchy-Schwarz inequality directly pairs $\tilde{\Phi}$ against the single score. In the multivariate case, the score is a vector in $\mathbb{R}^p$ while $\tilde{\Phi}$ is scalar, so we cannot directly apply Cauchy-Schwarz between objects of different dimension. The remedy is to project the score onto a direction $a$ and then optimize over $a$.
For any $a \in \mathbb{R}^p$, the random variable $a^\top S$ is scalar. The Cauchy-Schwarz inequality for covariance states
\begin{align*}
\left(\operatorname{Cov}_\theta(\tilde{\Phi}, a^\top S)\right)^2 \leq \operatorname{Var}_\theta(\tilde{\Phi}) \cdot \operatorname{Var}_\theta(a^\top S).
\end{align*}
By linearity, $\operatorname{Cov}_\theta(\tilde{\Phi}, a^\top S) = a^\top \operatorname{Cov}_\theta(\tilde{\Phi}, S) = a^\top \nabla_\theta \Phi(\theta)$. The variance of the projection is $\operatorname{Var}_\theta(a^\top S) = a^\top \operatorname{Cov}_\theta(S, S) a = a^\top I(\theta) a$, where $\operatorname{Cov}_\theta(S) = I(\theta)$ by definition.
This gives
\begin{align*}
\operatorname{Var}_\theta(\tilde{\Phi}) \geq \frac{(a^\top \nabla_\theta \Phi(\theta))^2}{a^\top I(\theta) a}
\end{align*}
for every nonzero $a \in \mathbb{R}^p$.
[/guided]
[/step]
[step:Optimize over $a$ to extract the quadratic form $\nabla_\theta \Phi^\top I^{-1} \nabla_\theta \Phi$]
We maximize the right-hand side over $a \in \mathbb{R}^p \setminus \{0\}$. Write $b = \nabla_\theta \Phi(\theta)$. The ratio $(a^\top b)^2 / (a^\top I(\theta) a)$ is maximized by setting $a = I(\theta)^{-1} b$. This is valid because $I(\theta)$ is positive definite by the regularity assumptions, so $I(\theta)^{-1}$ exists. Substituting $a = I(\theta)^{-1} b$:
\begin{align*}
\frac{(b^\top I(\theta)^{-1} b)^2}{b^\top I(\theta)^{-1} I(\theta) I(\theta)^{-1} b} = \frac{(b^\top I(\theta)^{-1} b)^2}{b^\top I(\theta)^{-1} b} = b^\top I(\theta)^{-1} b = \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta).
\end{align*}
Therefore
\begin{align*}
\operatorname{Var}_\theta(\tilde{\Phi}) \geq \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta).
\end{align*}
For the full $n$-observation model, $I(\theta)$ is replaced by $I_n(\theta) = nI(\theta)$, giving
\begin{align*}
\operatorname{Var}_\theta(\tilde{\Phi}) \geq \frac{1}{n} \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta).
\end{align*}
[guided]
The optimization step is a standard fact from linear algebra: for a positive definite matrix $M$ and a nonzero vector $b$, the maximum of $(a^\top b)^2 / (a^\top M a)$ over $a \neq 0$ equals $b^\top M^{-1} b$, attained at $a = M^{-1} b$. To verify this, note that by the substitution $c = M^{1/2} a$, the ratio becomes $(c^\top M^{-1/2} b)^2 / (c^\top c)$, which by Cauchy-Schwarz in $\mathbb{R}^p$ is at most $|M^{-1/2} b|^2 = b^\top M^{-1} b$.
Applying this with $M = I(\theta)$ and $b = \nabla_\theta \Phi(\theta)$, the optimal choice $a = I(\theta)^{-1} \nabla_\theta \Phi(\theta)$ yields
\begin{align*}
\operatorname{Var}_\theta(\tilde{\Phi}) \geq \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta).
\end{align*}
Finally, for $n$ i.i.d. observations the Fisher information of the joint model is $nI(\theta)$, so the bound becomes
\begin{align*}
\operatorname{Var}_\theta(\tilde{\Phi}) \geq \frac{1}{n} \nabla_\theta \Phi(\theta)^\top I(\theta)^{-1} \nabla_\theta \Phi(\theta),
\end{align*}
which completes the proof.
[/guided]
[/step]