[proofplan]
The proof is a direct algebraic expansion of the squared error. We add and subtract the mean $\mathbb{E}_\theta[\hat\theta]$ inside the square, expand the resulting binomial, and identify the three terms. The cross term vanishes because the centred estimator $\hat\theta - \mathbb{E}_\theta[\hat\theta]$ has mean zero, and the factor it is paired with, $\mathbb{E}_\theta[\hat\theta] - \theta$, is a deterministic scalar that can be pulled out of the expectation. The two surviving terms are exactly the variance of $\hat\theta$ and the square of its bias.
[/proofplan]
[step:Decompose the error into a stochastic and a deterministic piece]
Let $\hat\theta: \mathcal{X}^n \to \Theta$ be a measurable estimator based on a sample $X_1, \ldots, X_n$ from a model parameterised by $\theta \in \Theta \subseteq \mathbb{R}$, and assume $\mathbb{E}_\theta[\hat\theta^2] < \infty$ so that all relevant moments exist. Define
\begin{align*}
m(\theta) &:= \mathbb{E}_\theta[\hat\theta], \\
Z(\theta) &:= \hat\theta - m(\theta), \\
b(\theta) &:= m(\theta) - \theta.
\end{align*}
Then $Z(\theta)$ is a random variable with $\mathbb{E}_\theta[Z(\theta)] = 0$ by construction, and $b(\theta)$ is the deterministic bias of $\hat\theta$ at $\theta$. The estimation error decomposes as
\begin{align*}
\hat\theta - \theta = Z(\theta) + b(\theta).
\end{align*}
[guided]
We begin with the mean squared error
\begin{align*}
\operatorname{MSE}_\theta(\hat\theta) = \mathbb{E}_\theta[(\hat\theta - \theta)^2].
\end{align*}
The expression $(\hat\theta - \theta)^2$ mixes two kinds of deviation: the random fluctuation of $\hat\theta$ around its own mean, and the systematic offset between that mean and the true parameter. We want to separate them. The standard device is to insert the mean of $\hat\theta$:
\begin{align*}
\hat\theta - \theta = \underbrace{\bigl(\hat\theta - \mathbb{E}_\theta[\hat\theta]\bigr)}_{\text{random, mean zero}} + \underbrace{\bigl(\mathbb{E}_\theta[\hat\theta] - \theta\bigr)}_{\text{deterministic bias}}.
\end{align*}
To keep notation clean, name the pieces. Let $\hat\theta: \mathcal{X}^n \to \Theta$ be the estimator and assume $\mathbb{E}_\theta[\hat\theta^2] < \infty$ so that the variance, the mean, and the MSE are all finite — this is the integrability hypothesis that makes every expectation below well-defined. Define
\begin{align*}
m(\theta) &:= \mathbb{E}_\theta[\hat\theta], \\
Z(\theta) &:= \hat\theta - m(\theta), \\
b(\theta) &:= m(\theta) - \theta.
\end{align*}
Here $m(\theta)$ is the expected value of $\hat\theta$ when the data are generated by $\mathbb{P}_\theta$, $Z(\theta)$ is the centred version of $\hat\theta$, and $b(\theta)$ is the bias. By linearity of expectation, $\mathbb{E}_\theta[Z(\theta)] = \mathbb{E}_\theta[\hat\theta] - m(\theta) = 0$. And since $m(\theta)$ and $\theta$ are both non-random (under the fixed true parameter $\theta$), $b(\theta)$ is deterministic. We have
\begin{align*}
\hat\theta - \theta = \bigl(\hat\theta - m(\theta)\bigr) + \bigl(m(\theta) - \theta\bigr) = Z(\theta) + b(\theta).
\end{align*}
[/guided]
[/step]
[step:Expand the square and take expectations term by term]
Squaring the decomposition from the previous step and taking the expectation under $\mathbb{P}_\theta$:
\begin{align*}
\mathbb{E}_\theta\bigl[(\hat\theta - \theta)^2\bigr]
&= \mathbb{E}_\theta\bigl[(Z(\theta) + b(\theta))^2\bigr] \\
&= \mathbb{E}_\theta\bigl[Z(\theta)^2\bigr] + 2\,b(\theta)\,\mathbb{E}_\theta\bigl[Z(\theta)\bigr] + b(\theta)^2.
\end{align*}
The second equality uses linearity of expectation and the fact that $b(\theta)$ is a constant (with respect to the randomness under $\mathbb{P}_\theta$), so it factors out of each expectation.
[guided]
We now compute $\mathbb{E}_\theta[(\hat\theta - \theta)^2]$ using the decomposition $\hat\theta - \theta = Z(\theta) + b(\theta)$ from the previous step. Squaring the binomial,
\begin{align*}
(\hat\theta - \theta)^2 = Z(\theta)^2 + 2\,Z(\theta)\,b(\theta) + b(\theta)^2.
\end{align*}
Next apply $\mathbb{E}_\theta[\cdot]$. Linearity of expectation lets us move it past the sum, and the key observation is that $b(\theta)$ is a deterministic scalar: it does not depend on the data, so it behaves like a constant inside the expectation. Therefore $\mathbb{E}_\theta[Z(\theta)\,b(\theta)] = b(\theta)\,\mathbb{E}_\theta[Z(\theta)]$, and $\mathbb{E}_\theta[b(\theta)^2] = b(\theta)^2$. Collecting,
\begin{align*}
\mathbb{E}_\theta\bigl[(\hat\theta - \theta)^2\bigr]
= \mathbb{E}_\theta[Z(\theta)^2] + 2\,b(\theta)\,\mathbb{E}_\theta[Z(\theta)] + b(\theta)^2.
\end{align*}
This has three terms; we examine each next.
[/guided]
[/step]
[step:Identify the cross term as zero]
By the construction of $Z(\theta) = \hat\theta - m(\theta)$ in the first step, $\mathbb{E}_\theta[Z(\theta)] = 0$. Therefore the middle term vanishes:
\begin{align*}
2\,b(\theta)\,\mathbb{E}_\theta[Z(\theta)] = 0.
\end{align*}
[guided]
The cross term is $2\,b(\theta)\,\mathbb{E}_\theta[Z(\theta)]$. Recall that $Z(\theta) = \hat\theta - m(\theta)$ and $m(\theta) = \mathbb{E}_\theta[\hat\theta]$, so
\begin{align*}
\mathbb{E}_\theta[Z(\theta)] = \mathbb{E}_\theta[\hat\theta] - m(\theta) = m(\theta) - m(\theta) = 0.
\end{align*}
This is the entire reason for centring: subtracting the mean makes the first moment vanish, and the cross term therefore drops out. Without this cancellation, MSE would not split cleanly into variance and squared bias — it would also carry a covariance-like interaction term. Here the cross term is exactly zero.
[/guided]
[/step]
[step:Identify the remaining terms as variance and squared bias]
The first surviving term is
\begin{align*}
\mathbb{E}_\theta\bigl[Z(\theta)^2\bigr] = \mathbb{E}_\theta\bigl[(\hat\theta - \mathbb{E}_\theta[\hat\theta])^2\bigr] = \operatorname{Var}_\theta(\hat\theta),
\end{align*}
by the definition of variance. The second surviving term is
\begin{align*}
b(\theta)^2 = \bigl(\mathbb{E}_\theta[\hat\theta] - \theta\bigr)^2 = \operatorname{bias}_\theta(\hat\theta)^2,
\end{align*}
by the definition of bias. Substituting into the expansion from the second step and using the vanishing of the cross term established in the third step,
\begin{align*}
\mathbb{E}_\theta\bigl[(\hat\theta - \theta)^2\bigr] = \operatorname{Var}_\theta(\hat\theta) + \operatorname{bias}_\theta(\hat\theta)^2.
\end{align*}
This is the bias-variance decomposition, and completes the proof.
[/step]