MLE Invariance — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The proof is a direct application of the change-of-variables property for maximisation over a set. Given a bijective reparametrisation $h: \Theta \to \Phi$, the likelihood of the new parameter $\phi$ is defined as the likelihood of the corresponding original parameter $h^{-1}(\phi)$. Maximising over $\phi \in \Phi$ is therefore equivalent to maximising the original likelihood over $\theta \in \Theta$ via the substitution $\theta = h^{-1}(\phi)$. Because $h$ is a bijection, the maximum of $L$ over $\Theta$ is attained at $\hat\theta$ exactly when the maximum of $L^*$ over $\Phi$ is attained at $h(\hat\theta)$. We make this rigorous by reasoning about the suprema and the argmax sets. [/proofplan] [step:Fix the statistical model, the reparametrisation, and the induced likelihood] Let $\{f(\cdot; \theta) : \theta \in \Theta\}$ be a statistical model and let $x \in \mathcal{X}^n$ be an observed sample. The [likelihood function](/page/Likelihood) is \begin{align*} L : \Theta &\to [0, \infty) \\ \theta &\mapsto f(x; \theta). \end{align*} Let $h: \Theta \to \Phi$ be a bijection (the reparametrisation) with inverse $h^{-1}: \Phi \to \Theta$. The induced likelihood in the new parametrisation is \begin{align*} L^* : \Phi &\to [0, \infty) \\ \phi &\mapsto L(h^{-1}(\phi)) = f(x; h^{-1}(\phi)). \end{align*} Assume that $L$ attains its maximum on $\Theta$ at some $\hat\theta \in \Theta$ (the maximum likelihood estimator). We must show that $L^*$ attains its maximum on $\Phi$ at $\hat\phi := h(\hat\theta)$, i.e., that the [MLE](/page/Maximum%20Likelihood%20Estimator) of $\phi$ is $h$ applied to the MLE of $\theta$. [guided] The statement of the theorem is a claim about how maximisation interacts with a bijective change of variables. Before proving anything, we have to be precise about what "the likelihood of $\phi$" means. The original model parametrises densities by $\theta$: for each $\theta \in \Theta$, we have a density $f(\cdot; \theta)$. Given data $x$, the [likelihood function](/page/Likelihood) $L(\theta) = f(x; \theta)$ is a function of $\theta \in \Theta$. If we reparametrise by a bijection $h: \Theta \to \Phi$, then each $\phi \in \Phi$ corresponds to a unique $\theta = h^{-1}(\phi) \in \Theta$, which in turn corresponds to a density $f(\cdot; h^{-1}(\phi))$. This is the density of the model at parameter $\phi$ in the new parametrisation — and it is the same density as before, just indexed differently. So the likelihood in the new parametrisation is defined as \begin{align*} L^*(\phi) := L(h^{-1}(\phi)) = f(x; h^{-1}(\phi)). \end{align*} This is not a computation — it is the definition of the likelihood under reparametrisation. The only substantive content of the definition is that the probabilistic model (the densities) does not change; only the labels on the parameter space change. The theorem claims: if $\hat\theta$ maximises $L$ over $\Theta$, then $\hat\phi = h(\hat\theta)$ maximises $L^*$ over $\Phi$. Note the hypothesis: $h$ is bijective. This is essential. Without bijectivity, a single $\phi$ could correspond to several $\theta$, and $L^*(\phi)$ would not be well-defined (or would have to be defined via a $\sup$ or $\inf$ over the preimage, changing the theorem). We return to this in the final step. [/guided] [/step] [step:Show that the suprema of $L$ and $L^*$ are equal] Because $h: \Theta \to \Phi$ is a bijection, it is in particular surjective, so $\{h^{-1}(\phi) : \phi \in \Phi\} = \Theta$. Therefore \begin{align*} \sup_{\phi \in \Phi} L^*(\phi) = \sup_{\phi \in \Phi} L(h^{-1}(\phi)) = \sup_{\theta \in \Theta} L(\theta), \end{align*} where in the last equality we used that $\phi \mapsto h^{-1}(\phi)$ traces out all of $\Theta$ as $\phi$ ranges over $\Phi$. [guided] The key observation is a general fact about suprema under a bijection: if $h: A \to B$ is a bijection and $g: A \to \mathbb{R}$ is any function, then \begin{align*} \sup_{a \in A} g(a) = \sup_{b \in B} g(h^{-1}(b)). \end{align*} Indeed, as $b$ ranges over $B$, $h^{-1}(b)$ ranges over all of $A$ exactly once (because $h^{-1}$ is also a bijection). So the set of values $\{g(h^{-1}(b)) : b \in B\}$ is identical to the set $\{g(a) : a \in A\}$, and the suprema of these two sets coincide. Applying this with $A = \Theta$, $B = \Phi$, $g = L$, we get \begin{align*} \sup_{\phi \in \Phi} L^*(\phi) = \sup_{\phi \in \Phi} L(h^{-1}(\phi)) = \sup_{\theta \in \Theta} L(\theta). \end{align*} Note that bijectivity is used for both directions of the equality: surjectivity ensures $h^{-1}$ exhausts $\Theta$, and injectivity ensures no double-counting (though double-counting does not affect suprema; surjectivity is what is really needed here). So the two suprema are equal. The remaining question is where they are attained. [/guided] [/step] [step:Identify the argmax of $L^*$ as $h(\hat\theta)$] Since $L$ attains its supremum over $\Theta$ at $\hat\theta$, we have $L(\hat\theta) = \sup_{\theta \in \Theta} L(\theta)$. Set $\hat\phi := h(\hat\theta) \in \Phi$. Then \begin{align*} L^*(\hat\phi) = L(h^{-1}(\hat\phi)) = L(h^{-1}(h(\hat\theta))) = L(\hat\theta) = \sup_{\theta \in \Theta} L(\theta) = \sup_{\phi \in \Phi} L^*(\phi), \end{align*} where the third equality uses $h^{-1} \circ h = \operatorname{id}_\Theta$, and the last equality is the supremum identity from the previous step. Therefore $\hat\phi$ attains the supremum of $L^*$ over $\Phi$, i.e., $\hat\phi$ is an MLE of $\phi$. This proves $\hat\phi = h(\hat\theta)$. [guided] We have two facts in hand: \begin{align*} &(i) \quad L(\hat\theta) = \sup_{\theta \in \Theta} L(\theta) \quad \text{(hypothesis)}, \\ &(ii) \quad \sup_{\phi \in \Phi} L^*(\phi) = \sup_{\theta \in \Theta} L(\theta) \quad \text{(previous step)}. \end{align*} We want to show: $L^*(\hat\phi) = \sup_{\phi \in \Phi} L^*(\phi)$, where $\hat\phi = h(\hat\theta)$. Compute $L^*(\hat\phi)$ directly from the definition: \begin{align*} L^*(\hat\phi) = L(h^{-1}(\hat\phi)). \end{align*} Since $\hat\phi = h(\hat\theta)$, and $h^{-1}$ is the inverse of $h$ (bijectivity is used here — specifically, injectivity, which ensures $h^{-1}$ is a well-defined function and $h^{-1} \circ h = \operatorname{id}_\Theta$), \begin{align*} h^{-1}(\hat\phi) = h^{-1}(h(\hat\theta)) = \hat\theta. \end{align*} Substituting, \begin{align*} L^*(\hat\phi) = L(\hat\theta). \end{align*} By hypothesis (i), $L(\hat\theta)$ equals $\sup_{\theta \in \Theta} L(\theta)$, and by (ii), this equals $\sup_{\phi \in \Phi} L^*(\phi)$. Chaining, \begin{align*} L^*(\hat\phi) = L(\hat\theta) = \sup_{\theta \in \Theta} L(\theta) = \sup_{\phi \in \Phi} L^*(\phi). \end{align*} This says $\hat\phi$ achieves the supremum of $L^*$ on $\Phi$, i.e., $\hat\phi$ is an MLE of $\phi$. And $\hat\phi = h(\hat\theta)$ by construction — this is exactly the invariance statement: the MLE of a bijective function of $\theta$ is that function applied to the MLE of $\theta$. [/guided] [/step] [step:Record that the MLE is unique iff $L$ has a unique maximiser] The argument above shows that every maximiser of $L^*$ is of the form $h(\theta^*)$ for some maximiser $\theta^*$ of $L$, and conversely. Indeed, if $\phi^* \in \operatorname{argmax}_\phi L^*$, then by the same chain of equalities $L(h^{-1}(\phi^*)) = L^*(\phi^*) = \sup_\theta L(\theta)$, so $h^{-1}(\phi^*) \in \operatorname{argmax}_\theta L$. Therefore \begin{align*} \operatorname{argmax}_{\phi \in \Phi} L^*(\phi) = h\bigl(\operatorname{argmax}_{\theta \in \Theta} L(\theta)\bigr). \end{align*} In particular, the MLE of $\phi$ is unique iff the MLE of $\theta$ is unique, and in the unique case $\hat\phi = h(\hat\theta)$ is the identity that is usually quoted as "MLE invariance". This completes the proof. [guided] The cleanest way to state the result is that argmax sets are related by the bijection $h$: \begin{align*} \operatorname{argmax}_{\phi \in \Phi} L^*(\phi) = h\bigl(\operatorname{argmax}_{\theta \in \Theta} L(\theta)\bigr). \end{align*} We verify this is a set equality. Let $\phi^* \in \operatorname{argmax}_\phi L^*$. Then $L^*(\phi^*) = \sup_\phi L^*(\phi) = \sup_\theta L(\theta)$. Since $L^*(\phi^*) = L(h^{-1}(\phi^*))$, it follows that $L(h^{-1}(\phi^*)) = \sup_\theta L(\theta)$, so $h^{-1}(\phi^*) \in \operatorname{argmax}_\theta L$. Applying $h$, $\phi^* = h(h^{-1}(\phi^*)) \in h(\operatorname{argmax}_\theta L)$. This gives one inclusion. Conversely, if $\theta^* \in \operatorname{argmax}_\theta L$, the computation of the previous step shows $h(\theta^*) \in \operatorname{argmax}_\phi L^*$, giving the other inclusion. In the case of a unique MLE (either $L$ or $L^*$ has a unique maximiser, which by the above is equivalent), the argmax sets are singletons and we recover the classical formulation $\hat\phi = h(\hat\theta)$. **Why bijectivity matters.** If $h$ is not injective, two different $\theta$ can map to the same $\phi$, and $L^*(\phi)$ would not be well-defined by our formula — we would have to redefine $L^*(\phi) = \sup\{L(\theta) : h(\theta) = \phi\}$ (the "profile likelihood"), and the invariance statement becomes subtler: $\hat\phi = h(\hat\theta)$ but the backward direction need not hold. If $h$ is not surjective, the image $h(\Theta) \subsetneq \Phi$ and $L^*$ is only defined on the image — outside it, there is no $\theta$ corresponding to the given $\phi$. Both failure modes are ruled out by assuming $h$ is a bijection, as we have done. [/guided] [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

MLE Invariance (Theorem # 1428)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

MLE Invariance (Theorem # 1428)

Discussion

Proof

Explore Further