[proofplan]
The proof is a direct application of the change-of-variables property for maximisation over a set. Given a bijective reparametrisation $h: \Theta \to \Phi$, the likelihood of the new parameter $\phi$ is defined as the likelihood of the corresponding original parameter $h^{-1}(\phi)$. Maximising over $\phi \in \Phi$ is therefore equivalent to maximising the original likelihood over $\theta \in \Theta$ via the substitution $\theta = h^{-1}(\phi)$. Because $h$ is a bijection, the maximum of $L$ over $\Theta$ is attained at $\hat\theta$ exactly when the maximum of $L^*$ over $\Phi$ is attained at $h(\hat\theta)$. We make this rigorous by reasoning about the suprema and the argmax sets.
[/proofplan]
[step:Fix the statistical model, the reparametrisation, and the induced likelihood]
Let $\{f(\cdot; \theta) : \theta \in \Theta\}$ be a statistical model and let $x \in \mathcal{X}^n$ be an observed sample. The [likelihood function](/page/Likelihood) is
\begin{align*}
L : \Theta &\to [0, \infty) \\
\theta &\mapsto f(x; \theta).
\end{align*}
Let $h: \Theta \to \Phi$ be a bijection (the reparametrisation) with inverse $h^{-1}: \Phi \to \Theta$. The induced likelihood in the new parametrisation is
\begin{align*}
L^* : \Phi &\to [0, \infty) \\
\phi &\mapsto L(h^{-1}(\phi)) = f(x; h^{-1}(\phi)).
\end{align*}
Assume that $L$ attains its maximum on $\Theta$ at some $\hat\theta \in \Theta$ (the maximum likelihood estimator). We must show that $L^*$ attains its maximum on $\Phi$ at $\hat\phi := h(\hat\theta)$, i.e., that the [MLE](/page/Maximum%20Likelihood%20Estimator) of $\phi$ is $h$ applied to the MLE of $\theta$.
[guided]
The statement of the theorem is a claim about how maximisation interacts with a bijective change of variables. Before proving anything, we have to be precise about what "the likelihood of $\phi$" means.
The original model parametrises densities by $\theta$: for each $\theta \in \Theta$, we have a density $f(\cdot; \theta)$. Given data $x$, the [likelihood function](/page/Likelihood) $L(\theta) = f(x; \theta)$ is a function of $\theta \in \Theta$.
If we reparametrise by a bijection $h: \Theta \to \Phi$, then each $\phi \in \Phi$ corresponds to a unique $\theta = h^{-1}(\phi) \in \Theta$, which in turn corresponds to a density $f(\cdot; h^{-1}(\phi))$. This is the density of the model at parameter $\phi$ in the new parametrisation — and it is the same density as before, just indexed differently. So the likelihood in the new parametrisation is defined as
\begin{align*}
L^*(\phi) := L(h^{-1}(\phi)) = f(x; h^{-1}(\phi)).
\end{align*}
This is not a computation — it is the definition of the likelihood under reparametrisation. The only substantive content of the definition is that the probabilistic model (the densities) does not change; only the labels on the parameter space change.
The theorem claims: if $\hat\theta$ maximises $L$ over $\Theta$, then $\hat\phi = h(\hat\theta)$ maximises $L^*$ over $\Phi$.
Note the hypothesis: $h$ is bijective. This is essential. Without bijectivity, a single $\phi$ could correspond to several $\theta$, and $L^*(\phi)$ would not be well-defined (or would have to be defined via a $\sup$ or $\inf$ over the preimage, changing the theorem). We return to this in the final step.
[/guided]
[/step]
[step:Show that the suprema of $L$ and $L^*$ are equal]
Because $h: \Theta \to \Phi$ is a bijection, it is in particular surjective, so $\{h^{-1}(\phi) : \phi \in \Phi\} = \Theta$. Therefore
\begin{align*}
\sup_{\phi \in \Phi} L^*(\phi) = \sup_{\phi \in \Phi} L(h^{-1}(\phi)) = \sup_{\theta \in \Theta} L(\theta),
\end{align*}
where in the last equality we used that $\phi \mapsto h^{-1}(\phi)$ traces out all of $\Theta$ as $\phi$ ranges over $\Phi$.
[guided]
The key observation is a general fact about suprema under a bijection: if $h: A \to B$ is a bijection and $g: A \to \mathbb{R}$ is any function, then
\begin{align*}
\sup_{a \in A} g(a) = \sup_{b \in B} g(h^{-1}(b)).
\end{align*}
Indeed, as $b$ ranges over $B$, $h^{-1}(b)$ ranges over all of $A$ exactly once (because $h^{-1}$ is also a bijection). So the set of values $\{g(h^{-1}(b)) : b \in B\}$ is identical to the set $\{g(a) : a \in A\}$, and the suprema of these two sets coincide.
Applying this with $A = \Theta$, $B = \Phi$, $g = L$, we get
\begin{align*}
\sup_{\phi \in \Phi} L^*(\phi) = \sup_{\phi \in \Phi} L(h^{-1}(\phi)) = \sup_{\theta \in \Theta} L(\theta).
\end{align*}
Note that bijectivity is used for both directions of the equality: surjectivity ensures $h^{-1}$ exhausts $\Theta$, and injectivity ensures no double-counting (though double-counting does not affect suprema; surjectivity is what is really needed here).
So the two suprema are equal. The remaining question is where they are attained.
[/guided]
[/step]
[step:Identify the argmax of $L^*$ as $h(\hat\theta)$]
Since $L$ attains its supremum over $\Theta$ at $\hat\theta$, we have $L(\hat\theta) = \sup_{\theta \in \Theta} L(\theta)$. Set $\hat\phi := h(\hat\theta) \in \Phi$. Then
\begin{align*}
L^*(\hat\phi) = L(h^{-1}(\hat\phi)) = L(h^{-1}(h(\hat\theta))) = L(\hat\theta) = \sup_{\theta \in \Theta} L(\theta) = \sup_{\phi \in \Phi} L^*(\phi),
\end{align*}
where the third equality uses $h^{-1} \circ h = \operatorname{id}_\Theta$, and the last equality is the supremum identity from the previous step. Therefore $\hat\phi$ attains the supremum of $L^*$ over $\Phi$, i.e., $\hat\phi$ is an MLE of $\phi$. This proves $\hat\phi = h(\hat\theta)$.
[guided]
We have two facts in hand:
\begin{align*}
&(i) \quad L(\hat\theta) = \sup_{\theta \in \Theta} L(\theta) \quad \text{(hypothesis)}, \\
&(ii) \quad \sup_{\phi \in \Phi} L^*(\phi) = \sup_{\theta \in \Theta} L(\theta) \quad \text{(previous step)}.
\end{align*}
We want to show: $L^*(\hat\phi) = \sup_{\phi \in \Phi} L^*(\phi)$, where $\hat\phi = h(\hat\theta)$.
Compute $L^*(\hat\phi)$ directly from the definition:
\begin{align*}
L^*(\hat\phi) = L(h^{-1}(\hat\phi)).
\end{align*}
Since $\hat\phi = h(\hat\theta)$, and $h^{-1}$ is the inverse of $h$ (bijectivity is used here — specifically, injectivity, which ensures $h^{-1}$ is a well-defined function and $h^{-1} \circ h = \operatorname{id}_\Theta$),
\begin{align*}
h^{-1}(\hat\phi) = h^{-1}(h(\hat\theta)) = \hat\theta.
\end{align*}
Substituting,
\begin{align*}
L^*(\hat\phi) = L(\hat\theta).
\end{align*}
By hypothesis (i), $L(\hat\theta)$ equals $\sup_{\theta \in \Theta} L(\theta)$, and by (ii), this equals $\sup_{\phi \in \Phi} L^*(\phi)$. Chaining,
\begin{align*}
L^*(\hat\phi) = L(\hat\theta) = \sup_{\theta \in \Theta} L(\theta) = \sup_{\phi \in \Phi} L^*(\phi).
\end{align*}
This says $\hat\phi$ achieves the supremum of $L^*$ on $\Phi$, i.e., $\hat\phi$ is an MLE of $\phi$. And $\hat\phi = h(\hat\theta)$ by construction — this is exactly the invariance statement: the MLE of a bijective function of $\theta$ is that function applied to the MLE of $\theta$.
[/guided]
[/step]
[step:Record that the MLE is unique iff $L$ has a unique maximiser]
The argument above shows that every maximiser of $L^*$ is of the form $h(\theta^*)$ for some maximiser $\theta^*$ of $L$, and conversely. Indeed, if $\phi^* \in \operatorname{argmax}_\phi L^*$, then by the same chain of equalities $L(h^{-1}(\phi^*)) = L^*(\phi^*) = \sup_\theta L(\theta)$, so $h^{-1}(\phi^*) \in \operatorname{argmax}_\theta L$. Therefore
\begin{align*}
\operatorname{argmax}_{\phi \in \Phi} L^*(\phi) = h\bigl(\operatorname{argmax}_{\theta \in \Theta} L(\theta)\bigr).
\end{align*}
In particular, the MLE of $\phi$ is unique iff the MLE of $\theta$ is unique, and in the unique case $\hat\phi = h(\hat\theta)$ is the identity that is usually quoted as "MLE invariance".
This completes the proof.
[guided]
The cleanest way to state the result is that argmax sets are related by the bijection $h$:
\begin{align*}
\operatorname{argmax}_{\phi \in \Phi} L^*(\phi) = h\bigl(\operatorname{argmax}_{\theta \in \Theta} L(\theta)\bigr).
\end{align*}
We verify this is a set equality.
Let $\phi^* \in \operatorname{argmax}_\phi L^*$. Then $L^*(\phi^*) = \sup_\phi L^*(\phi) = \sup_\theta L(\theta)$. Since $L^*(\phi^*) = L(h^{-1}(\phi^*))$, it follows that $L(h^{-1}(\phi^*)) = \sup_\theta L(\theta)$, so $h^{-1}(\phi^*) \in \operatorname{argmax}_\theta L$. Applying $h$, $\phi^* = h(h^{-1}(\phi^*)) \in h(\operatorname{argmax}_\theta L)$. This gives one inclusion.
Conversely, if $\theta^* \in \operatorname{argmax}_\theta L$, the computation of the previous step shows $h(\theta^*) \in \operatorname{argmax}_\phi L^*$, giving the other inclusion.
In the case of a unique MLE (either $L$ or $L^*$ has a unique maximiser, which by the above is equivalent), the argmax sets are singletons and we recover the classical formulation $\hat\phi = h(\hat\theta)$.
**Why bijectivity matters.** If $h$ is not injective, two different $\theta$ can map to the same $\phi$, and $L^*(\phi)$ would not be well-defined by our formula — we would have to redefine $L^*(\phi) = \sup\{L(\theta) : h(\theta) = \phi\}$ (the "profile likelihood"), and the invariance statement becomes subtler: $\hat\phi = h(\hat\theta)$ but the backward direction need not hold. If $h$ is not surjective, the image $h(\Theta) \subsetneq \Phi$ and $L^*$ is only defined on the image — outside it, there is no $\theta$ corresponding to the given $\phi$. Both failure modes are ruled out by assuming $h$ is a bijection, as we have done.
[/guided]
[/step]