[proofplan]
We compute the composite by running a polynomial-time machine for $f$ on the original input and then running a polynomial-time machine for $g$ on the resulting output. The only point needing care is that the running time of the second machine is controlled by the length of $f(x)$, so we first bound $|f(x)|$ by the running time of the machine computing $f$. Since a polynomial composed with a polynomial is again polynomial, the total running time is polynomial in $|x|$.
[/proofplan]
[step:Choose polynomial-time machines for the two functions]
Since $f$ is polynomial-time computable, there exist a deterministic Turing machine $M_f$ and a polynomial $p: \mathbb{N} \to \mathbb{N}$ such that, for every input word $x \in \Sigma^*$, the machine $M_f$ halts with output $f(x)$ in at most $p(|x|)$ steps.
Since $g$ is polynomial-time computable, there exist a deterministic Turing machine $M_g$ and a polynomial $q: \mathbb{N} \to \mathbb{N}$ such that, for every input word $y \in \Gamma^*$, the machine $M_g$ halts with output $g(y)$ in at most $q(|y|)$ steps.
[/step]
[step:Build the machine that computes the composite]
Define a deterministic Turing machine $M$ as follows. On input $x \in \Sigma^*$, first run $M_f$ on $x$ and store the output word $y := f(x) \in \Gamma^*$. Then run $M_g$ on input $y$ and output the resulting word $g(y) \in \Delta^*$.
For every $x \in \Sigma^*$, this machine halts and outputs
\begin{align*}
g(y) = g(f(x)) = (g \circ f)(x).
\end{align*}
Thus $M$ computes $g \circ f$.
[/step]
[step:Bound the intermediate output length by the first running time]
For every $x \in \Sigma^*$, the word $f(x)$ has length at most the number of steps used by $M_f$ on input $x$, up to an additive machine-dependent constant. Absorbing this constant into the polynomial bound, replace $p$ if necessary by a polynomial $p_1: \mathbb{N} \to \mathbb{N}$ such that $M_f$ runs in at most $p_1(|x|)$ steps and
\begin{align*}
|f(x)| \leq p_1(|x|)
\end{align*}
for every $x \in \Sigma^*$.
[guided]
The second stage runs $M_g$ on the word produced by the first stage, so its time bound depends on $|f(x)|$, not directly on $|x|$. We therefore need a length bound for the intermediate word.
A Turing machine can write only a bounded amount of output per computation step; in the usual single-symbol-per-step model, it can increase the output length by at most one per step. If the chosen model has a fixed finite output convention with a different constant overhead, that overhead is independent of $x$ and can be absorbed into a polynomial. Therefore there is a polynomial $p_1: \mathbb{N} \to \mathbb{N}$ such that, for every $x \in \Sigma^*$, the first computation both halts within $p_1(|x|)$ steps and produces an output satisfying
\begin{align*}
|f(x)| \leq p_1(|x|).
\end{align*}
This is the key estimate: it converts the running-time guarantee for $M_g$, which is stated in terms of the length of its own input, into a polynomial bound in the length of the original input $x$.
[/guided]
[/step]
[step:Estimate the total running time by a polynomial]
Let $T: \mathbb{N} \to \mathbb{N}$ be the function defined by
\begin{align*}
T(n) = p_1(n) + q(p_1(n)).
\end{align*}
Since $p_1$ and $q$ are polynomials, the composite $q \circ p_1$ is a polynomial, and hence $T$ is a polynomial.
For input $x \in \Sigma^*$, the first stage costs at most $p_1(|x|)$ steps. The second stage costs at most
\begin{align*}
q(|f(x)|) \leq q(p_1(|x|)),
\end{align*}
after replacing $q$ by a nondecreasing polynomial upper bound if necessary. Therefore the total running time of $M$ on input $x$ is at most
\begin{align*}
p_1(|x|) + q(p_1(|x|)) = T(|x|).
\end{align*}
Thus $M$ computes $g \circ f$ in polynomial time.
[/step]
[step:Conclude polynomial-time computability of the composite]
We have constructed a deterministic Turing machine $M$ that computes $g \circ f: \Sigma^* \to \Delta^*$ and halts on every input $x \in \Sigma^*$ within $T(|x|)$ steps, where $T$ is a polynomial. Hence $g \circ f$ is polynomial-time computable.
[/step]