[proofplan]
We prove the variational principle by establishing the two inequalities separately. The lower bound constructs many orbit segments from a fixed invariant measure using finite partitions and the ergodic averages of $\phi$. The upper bound assigns to each weighted separated set a probability measure and extracts a weak limit, then uses entropy estimates to bound the exponential orbit complexity by measure-theoretic entropy plus the potential integral. Compactness of $X$ supplies finite covers and weak compactness of probability measures, while continuity of $T$ and $\phi$ supplies uniform control on Bowen balls and convergence of orbit averages.
[/proofplan]
[step:Fix the dynamical notation and the pressure model]
Let $d:X\times X\to [0,\infty)$ denote the metric on $X$, and let $\mathcal B(X)$ denote the Borel $\sigma$-algebra generated by the metric topology on $X$. Define the iterates $T_k:X\to X$ recursively by $T_0=\operatorname{id}_X$ and $T_{k+1}=T\circ T_k$ for $k\in\mathbb N\cup\{0\}$. For $n\in\mathbb N$ and $\varepsilon>0$, define the Bowen metric $d_n:X\times X\to [0,\infty)$ by
\begin{align*}
d_n(x,y):=\max_{0\le k\le n-1} d(T_k(x),T_k(y)).
\end{align*}
Define the Birkhoff sum $S_n\phi:X\to\mathbb R$ by
\begin{align*}
S_n\phi(x):=\sum_{k=0}^{n-1}\phi(T_k(x)).
\end{align*}
A set $E\subset X$ is called $(n,\varepsilon)$-separated if $d_n(x,y)>\varepsilon$ for all distinct $x,y\in E$. Since $X$ is compact and $d_n$ is continuous, every $(n,\varepsilon)$-separated set is finite. Define
\begin{align*}
Z_n(\varepsilon):=\sup_E\sum_{x\in E} \exp(S_n\phi(x)),
\end{align*}
where the supremum is over all $(n,\varepsilon)$-separated subsets $E\subset X$. We use the separated-set definition of pressure:
\begin{align*}
P(T,\phi):=\lim_{\varepsilon\downarrow 0}\limsup_{n\to\infty}\frac{1}{n}\log Z_n(\varepsilon).
\end{align*}
[/step]
[step:Prove the lower bound first for ergodic measures and then average over the ergodic decomposition]
Fix an ergodic measure $\nu\in\mathcal M_T(X)$. Let $\eta>0$. Since $X$ is compact and $\phi:X\to\mathbb R$ is continuous, $\phi\in L^1(X,\mathcal B(X),\nu)$. By Birkhoff's ergodic theorem applied to $\phi$, ergodicity gives
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}S_n\phi(x)=\int_X\phi(y)\,d\nu(y)
\end{align*}
for $\nu$-almost every $x\in X$. For each $m\in\mathbb N$, define the Borel tail set
\begin{align*}
B_m:=\left\{x\in X: \frac{1}{n}S_n\phi(x)\ge \int_X\phi(y)\,d\nu(y)-\eta \text{ for every } n\ge m\right\}.
\end{align*}
The almost-everywhere convergence implies $\nu(\bigcup_{m=1}^{\infty}B_m)=1$. Since the sets $B_m$ increase with $m$, continuity from below of the probability measure $\nu$ gives an integer $N_1\in\mathbb N$ such that $B:=B_{N_1}$ satisfies $\nu(B)>1/2$. Thus every $x\in B$ and every $n\ge N_1$ satisfy
\begin{align*}
\frac{1}{n}S_n\phi(x)\ge \int_X\phi(y)\,d\nu(y)-\eta.
\end{align*}
Here ergodicity is the point that turns the Birkhoff limit into the scalar integral rather than a [conditional expectation](/page/Conditional%20Expectation).
Use Katok's separated-set entropy formula in the compact-metric ergodic form. The hypotheses are satisfied because $X$ is compact metric, $T:X\to X$ is continuous, $\nu\in\mathcal M_T(X)$ is ergodic, and $B\in\mathcal B(X)$ has $\nu(B)>1/2$. If $h_\nu(T)<\infty$, then for every sufficiently small $\varepsilon>0$ there are, for all sufficiently large $n$, $(n,\varepsilon)$-separated sets $E_n\subset B$ satisfying
\begin{align*}
|E_n|\ge \exp(n(h_\nu(T)-\eta)).
\end{align*}
If $h_\nu(T)=\infty$, the same formula gives the corresponding finite-level statement with $h_\nu(T)-\eta$ replaced by any prescribed number $R>0$; letting $R\to\infty$ at the end gives the infinite lower bound. This result is applied directly to separated sets inside the positive-measure set $B$, so no measurable partition is used to infer metric separation. For these $n$,
\begin{align*}
Z_n(\varepsilon)\ge \sum_{x\in E_n}\exp(S_n\phi(x))\ge \exp(n(h_\nu(T)+\int_X\phi(y)\,d\nu(y)-2\eta)).
\end{align*}
Taking $n^{-1}\log$, then $\limsup_{n\to\infty}$, then $\varepsilon\downarrow0$, and finally $\eta\downarrow0$ gives
\begin{align*}
P(T,\phi)\ge h_\nu(T)+\int_X\phi(y)\,d\nu(y).
\end{align*}
Now let $\mu\in\mathcal M_T(X)$ be arbitrary. Let $\mathcal E_T(X)$ denote the Borel subset of $\mathcal M_T(X)$ consisting of ergodic invariant Borel probability measures, where $\mathcal M_T(X)$ carries the [weak topology](/page/Weak%20Topology). By the [ergodic decomposition theorem](/theorems/3453), there is a Borel probability measure $\tau$ on $\mathcal E_T(X)$ such that for every $f\in C(X)$,
\begin{align*}
\int_X f(x)\,d\mu(x)=\int_{\mathcal E_T(X)}\left(\int_X f(x)\,d\nu(x)\right)d\tau(\nu).
\end{align*}
The entropy-affinity theorem for the ergodic decomposition and the displayed affine identity for integrals give
\begin{align*}
h_\mu(T)+\int_X\phi(y)\,d\mu(y)=\int \left(h_\nu(T)+\int_X\phi(y)\,d\nu(y)\right)\,d\tau(\nu).
\end{align*}
Since each integrand is at most $P(T,\phi)$ by the ergodic case, the same inequality holds for the integral. Thus
\begin{align*}
P(T,\phi)\ge h_\mu(T)+\int_X\phi(y)\,d\mu(y).
\end{align*}
Taking the supremum over $\mu\in\mathcal M_T(X)$ proves the lower bound.
[guided]
Fix an ergodic invariant Borel probability measure $\nu\in\mathcal M_T(X)$ and let $\eta>0$. The ergodicity assumption is essential in this first part: Birkhoff's ergodic theorem then says that the Birkhoff averages of the [continuous function](/page/Continuous%20Function) $\phi:X\to\mathbb R$ converge $\nu$-almost everywhere to the scalar integral $\int_X\phi\,d\nu$, not merely to a conditional expectation. Since $X$ is compact and $\phi$ is continuous, $\phi\in L^1(X,\mathcal B(X),\nu)$, so the theorem applies. Hence
\begin{align*}
\lim_{n\to\infty}\frac{1}{n}S_n\phi(x)=\int_X\phi(y)\,d\nu(y)
\end{align*}
for $\nu$-almost every $x\in X$.
Birkhoff's theorem gives pointwise convergence, so we still need to extract a single time threshold that works for every point in a positive-measure set. For each $m\in\mathbb N$, define the Borel tail set
\begin{align*}
B_m:=\left\{x\in X: \frac{1}{n}S_n\phi(x)\ge \int_X\phi(y)\,d\nu(y)-\eta \text{ for every } n\ge m\right\}.
\end{align*}
The almost-everywhere convergence says that almost every $x$ eventually belongs to one of these tail sets, so $\nu(\bigcup_{m=1}^{\infty}B_m)=1$. The sets are increasing in $m$, and continuity from below for the probability measure $\nu$ gives an integer $N_1\in\mathbb N$ such that $B:=B_{N_1}$ satisfies $\nu(B)>1/2$. By the definition of $B$, every $x\in B$ and every $n\ge N_1$ satisfy
\begin{align*}
\frac{1}{n}S_n\phi(x)\ge \int_X\phi(y)\,d\nu(y)-\eta.
\end{align*}
We now need many points in $B$ that are actually separated in the Bowen metric. A measurable partition with small atoms would not be enough, because different atoms can have points arbitrarily close to one another. Instead we use Katok's separated-set entropy formula in its compact-metric ergodic form. Its hypotheses are exactly the following: $X$ is compact metric, $T:X\to X$ is continuous, $\nu$ is an ergodic $T$-invariant Borel probability measure, and $B\in\mathcal B(X)$ has positive $\nu$-measure. These have already been verified: compactness and continuity are hypotheses of the theorem, $\nu$ was chosen ergodic in $\mathcal M_T(X)$, and the tail-set argument gave $\nu(B)>1/2$. If $h_\nu(T)<\infty$, then for every sufficiently small $\varepsilon>0$ and all sufficiently large $n$, the formula provides an $(n,\varepsilon)$-separated set $E_n\subset B$ such that
\begin{align*}
|E_n|\ge \exp(n(h_\nu(T)-\eta)).
\end{align*}
If $h_\nu(T)=\infty$, the same conclusion holds with $h_\nu(T)-\eta$ replaced by any finite number $R>0$, and letting $R\to\infty$ after the pressure estimate gives the desired infinite lower bound.
For every $x\in E_n$, the Birkhoff lower bound gives
\begin{align*}
\exp(S_n\phi(x))\ge \exp(n(\int_X\phi(y)\,d\nu(y)-\eta)).
\end{align*}
Therefore
\begin{align*}
Z_n(\varepsilon)\ge \sum_{x\in E_n}\exp(S_n\phi(x))\ge \exp(n(h_\nu(T)+\int_X\phi(y)\,d\nu(y)-2\eta)).
\end{align*}
Taking logarithms, dividing by $n$, passing to the limsup, letting $\varepsilon\downarrow0$, and then letting $\eta\downarrow0$ gives
\begin{align*}
P(T,\phi)\ge h_\nu(T)+\int_X\phi(y)\,d\nu(y).
\end{align*}
To pass from ergodic measures to an arbitrary invariant measure $\mu\in\mathcal M_T(X)$, use the ergodic decomposition. Let $\mathcal E_T(X)$ denote the Borel subset of $\mathcal M_T(X)$ consisting of ergodic invariant Borel probability measures, with $\mathcal M_T(X)$ equipped with the weak topology. The ergodic decomposition theorem gives a Borel probability measure $\tau$ on $\mathcal E_T(X)$ such that for every $f\in C(X)$,
\begin{align*}
\int_X f(x)\,d\mu(x)=\int_{\mathcal E_T(X)}\left(\int_X f(x)\,d\nu(x)\right)d\tau(\nu).
\end{align*}
The entropy-affinity theorem for this decomposition and the displayed barycentric identity imply that the entropy and potential terms are affine:
\begin{align*}
h_\mu(T)+\int_X\phi(y)\,d\mu(y)=\int \left(h_\nu(T)+\int_X\phi(y)\,d\nu(y)\right)\,d\tau(\nu).
\end{align*}
The ergodic case bounds every integrand by $P(T,\phi)$, so the integral is also bounded by $P(T,\phi)$. Hence
\begin{align*}
P(T,\phi)\ge h_\mu(T)+\int_X\phi(y)\,d\mu(y).
\end{align*}
Taking the supremum over invariant measures proves the lower bound.
[/guided]
[/step]
[step:Extract invariant measures from weighted separated sets]
Fix $\varepsilon>0$. For each $n\in\mathbb N$, choose an $(n,\varepsilon)$-separated set $E_n\subset X$ satisfying
\begin{align*}
\sum_{x\in E_n}\exp(S_n\phi(x))\ge \frac{1}{2}Z_n(\varepsilon).
\end{align*}
Define the probability measure $\nu_n$ on $X$ by
\begin{align*}
\nu_n:=\frac{1}{\sum_{x\in E_n}\exp(S_n\phi(x))}\sum_{x\in E_n}\exp(S_n\phi(x))\delta_x,
\end{align*}
where $\delta_x$ denotes the Dirac probability measure at $x$. For a Borel probability measure $\rho$ on $X$, define the pushforward $T_*\rho$ by
\begin{align*}
(T_*\rho)(A):=\rho(T^{-1}(A))
\end{align*}
for every Borel set $A\subset X$. For $k\in\{0,\dots,n-1\}$, write $(T_k)_*\rho$ for the pushforward by the iterate $T_k:X\to X$. Define the averaged measure $\mu_n$ by
\begin{align*}
\mu_n:=\frac{1}{n}\sum_{k=0}^{n-1}(T_k)_*\nu_n.
\end{align*}
The compactness of $X$ implies weak compactness of the space of Borel probability measures on $X$, so some subsequence $\mu_{n_j}$ converges weakly to a Borel probability measure $\mu$. For every $f\in C(X)$,
\begin{align*}
\int_X f\,d(T_*\mu_n)(x)-\int_X f\,d\mu_n(x)=\frac{1}{n}\left(\int_X f(T^n x)\,d\nu_n(x)-\int_X f(x)\,d\nu_n(x)\right).
\end{align*}
The absolute value of the right-hand side is at most $2\|f\|_\infty/n$, so it tends to $0$. Along the subsequence $n_j$, [weak convergence](/page/Weak%20Convergence) of $\mu_{n_j}$ to $\mu$ gives
\begin{align*}
\lim_{j\to\infty}\int_X f(x)\,d\mu_{n_j}(x)=\int_X f(x)\,d\mu(x),
\end{align*}
because $f\in C(X)$. Since $T:X\to X$ is continuous, $f\circ T:X\to\mathbb R$ is also continuous, so weak convergence also gives
\begin{align*}
\lim_{j\to\infty}\int_X f(Tx)\,d\mu_{n_j}(x)=\int_X f(Tx)\,d\mu(x)=\int_X f(x)\,d(T_*\mu)(x).
\end{align*}
Combining these limits with the asymptotic identity above yields
\begin{align*}
\int_X f(x)\,d(T_*\mu)(x)=\int_X f(x)\,d\mu(x)
\end{align*}
for every $f\in C(X)$. Continuous functions determine Borel probability measures on the compact [metric space](/page/Metric%20Space) $X$, hence $T_*\mu=\mu$ and $\mu\in\mathcal M_T(X)$. Also, by continuity of $\phi$,
\begin{align*}
\lim_{j\to\infty}\int_X\phi(x)\,d\mu_{n_j}(x)=\int_X\phi(x)\,d\mu(x).
\end{align*}
[/step]
[step:Bound the weighted separated sums by the entropy distribution principle]
For each $m\in\mathbb N$, let $\Delta_m:=\{p=(p_1,\dots,p_m)\in[0,1]^m: \sum_{i=1}^m p_i=1\}$ denote the finite probability simplex. Define the Shannon entropy map $H_m:\Delta_m\to[0,\infty)$ by
\begin{align*}
H_m(p):=-\sum_{i=1}^m p_i\log p_i,
\end{align*}
with the convention $0\log0=0$. For a finite Borel partition $\alpha=\{A_1,\dots,A_m\}$ and a Borel probability measure $\rho$ on $X$, define the partition entropy map value $H_\rho(\alpha)\in[0,\infty)$ by
\begin{align*}
H_\rho(\alpha):=H_m((\rho(A_1),\dots,\rho(A_m))).
\end{align*}
For $n\in\mathbb N$, define the joined partition $\alpha[n]$ by
\begin{align*}
\alpha[n]:=\bigvee_{k=0}^{n-1}T_k^{-1}\alpha.
\end{align*}
Here $T_k^{-1}\alpha:=\{T_k^{-1}(A):A\in\alpha\}$ is the inverse-image partition under the iterate $T_k:X\to X$.
For a $T$-invariant Borel probability measure $\rho$ on $X$ and a finite Borel partition $\alpha$ of $X$, define the partition entropy rate $h_\rho(T,\alpha)\in[0,\infty]$ by
\begin{align*}
h_\rho(T,\alpha):=\lim_{q\to\infty}\frac{1}{q}H_\rho(\alpha[q]).
\end{align*}
The limit exists by subadditivity of $q\mapsto H_\rho(\alpha[q])$. Define the measure-theoretic entropy $h_\rho(T)\in[0,\infty]$ by
\begin{align*}
h_\rho(T):=\sup_\alpha h_\rho(T,\alpha),
\end{align*}
where the supremum is over finite Borel partitions $\alpha$ of $X$.
We use the Misiurewicz entropy-distribution estimate for pressure as the following auxiliary result, independent of the variational principle itself. It is the standard finite-partition entropy estimate obtained from partitions with $\mu$-null boundaries, the log-sum inequality, and upper semicontinuity of partition entropy along weak convergence for such partitions. Assume $X$ is a compact metric space, $T:X\to X$ and $\phi:X\to\mathbb R$ are continuous, $\varepsilon>0$ is fixed, $n_j\to\infty$, each $E_{n_j}$ is $(n_j,\varepsilon)$-separated, $\nu_{n_j}$ is the weighted probability measure
\begin{align*}
\nu_{n_j}:=\frac{1}{\sum_{x\in E_{n_j}}\exp(S_{n_j}\phi(x))}\sum_{x\in E_{n_j}}\exp(S_{n_j}\phi(x))\delta_x,
\end{align*}
and the averaged measures
\begin{align*}
\mu_{n_j}:=\frac{1}{n_j}\sum_{k=0}^{n_j-1}(T_k)_*\nu_{n_j}
\end{align*}
converge weakly to a $T$-invariant Borel probability measure $\mu$. Then
\begin{align*}
\limsup_{j\to\infty}\frac{1}{n_j}\log\sum_{x\in E_{n_j}}\exp(S_{n_j}\phi(x))\le h_\mu(T)+\int_X\phi(x)\,d\mu(x).
\end{align*}
All hypotheses of this estimate have been verified in the preceding step: compactness of $X$, continuity of $T$ and $\phi$, separatedness of $E_{n_j}$, the displayed weighted definition of $\nu_{n_j}$, weak convergence of $\mu_{n_j}$, and invariance of the limit $\mu$. The finite-partition proof uses the entropy quantities $H_\rho(\alpha)$ and $h_\rho(T,\alpha)$ defined above, chooses partitions whose boundaries have $\mu$-measure zero, applies the log-sum inequality to the atomic weights on $E_{n_j}$, and then passes to the weak limit using boundary-null continuity of atom masses. Thus the result controls the moving-length joined entropies directly and does not replace $H_{\nu_{n_j}}$ by $H_{\mu_{n_j}}$ without proof. Since the chosen sets satisfy
\begin{align*}
\sum_{x\in E_{n_j}}\exp(S_{n_j}\phi(x))\ge \frac{1}{2}Z_{n_j}(\varepsilon),
\end{align*}
the constant $\log 2$ disappears after division by $n_j$, and therefore
\begin{align*}
\limsup_{j\to\infty}\frac{1}{n_j}\log Z_{n_j}(\varepsilon)\le h_\mu(T)+\int_X\phi(x)\,d\mu(x).
\end{align*}
[/step]
[step:Pass to the pressure limit and conclude equality]
Choose a sequence $\varepsilon_m\downarrow0$. For each $m\in\mathbb N$, choose integers $n_{m,j}\to\infty$ such that
\begin{align*}
\lim_{j\to\infty}\frac{1}{n_{m,j}}\log Z_{n_{m,j}}(\varepsilon_m)=\limsup_{n\to\infty}\frac{1}{n}\log Z_n(\varepsilon_m).
\end{align*}
Apply the preceding construction and entropy-distribution bound to this subsequence. It produces a measure $\mu_m\in\mathcal M_T(X)$ such that
\begin{align*}
\limsup_{n\to\infty}\frac{1}{n}\log Z_n(\varepsilon_m)\le h_{\mu_m}(T)+\int_X\phi(x)\,d\mu_m(x).
\end{align*}
Thus, for every $m$,
\begin{align*}
\limsup_{n\to\infty}\frac{1}{n}\log Z_n(\varepsilon_m)\le \sup_{\mu\in\mathcal M_T(X)}\left(h_\mu(T)+\int_X\phi(x)\,d\mu(x)\right).
\end{align*}
Taking $m\to\infty$ and using the separated-set definition of $P(T,\phi)$ gives
\begin{align*}
P(T,\phi)\le \sup_{\mu\in\mathcal M_T(X)}\left(h_\mu(T)+\int_X\phi(x)\,d\mu(x)\right).
\end{align*}
Together with the lower bound already proved, this gives
\begin{align*}
P(T,\phi)=\sup_{\mu\in\mathcal M_T(X)}\left(h_\mu(T)+\int_X\phi(x)\,d\mu(x)\right).
\end{align*}
[/step]