[proofplan]
Fix a context-free language $L$ and a [Chomsky Normal Form](/theorems/1805) grammar $G$ with $\mathcal{L}(G) = L$; let $m := |V|$ be its number of variables and set the pumping number to $n := 2^m$. Given any $w \in L$ with $|w| \ge n$, take a parse tree $\mathbb{T}$ for $w$; because $G$ is in CNF, every internal node has at most two children, so a tree yielding a string of length $\ge 2^m$ must have height $> m$. Travelling along a longest root-to-leaf branch, we encounter more than $m$ variable-labelled internal nodes, so by the pigeonhole principle two of them carry the same variable $A$. This repetition gives a decomposition $w = x u y v z$ where $uy v$ is yielded by the subtree at the upper $A$-node and $y$ by the subtree at the lower one. Grafting the lower subtree in place of the upper one (or vice versa, iterated) produces, for each $k \ge 0$, a CNF parse tree yielding $x u^k y v^k z$, hence $x u^k y v^k z \in L$. A careful choice — the node $s$ at which the subtree first becomes "small" (height $\le m + 1$) — ensures $|u y v| \le n$; choosing the two repeated nodes on the *longest* branch from $s$ ensures $|u v| > 0$.
[/proofplan]
[step:Reduce to a Chomsky normal form grammar and fix the pumping constant]
Let $L \subseteq \Sigma^*$ be a [context-free language](/pages/???). If $L = \varnothing$ or $L = \{\varepsilon\}$ the pumping lemma holds vacuously for $n = 1$ (no word $w \in L$ has $|w| \ge 1$ or the required decomposition is trivially avoided). Assume henceforth $L$ contains a non-empty word.
By the [Chomsky Normal Form Theorem](/theorems/1805), $L' := L \setminus \{\varepsilon\}$ is the language of a grammar $G = (V, \Sigma, P, S)$ in Chomsky normal form, meaning every production is either $A \to B\,C$ (with $A, B, C \in V$) or $A \to a$ (with $A \in V$, $a \in \Sigma$). Set
\begin{align*}
m &:= |V|, & n &:= 2^m.
\end{align*}
We claim the pumping number $n$ works; note that if $\varepsilon \in L$ then our pumping statement is still valid for $w \in L$ with $|w| \ge n \ge 1$, because $\varepsilon$ itself is excluded by the length condition.
[/step]
[step:Recall the height-to-yield bound for CNF parse trees]
[claim:If $\mathbb{T}$ is a $G$-parse tree of height $h$, then $|\sigma_\mathbb{T}| \le 2^h$]
We argue by induction on $h$, where the *height* of a tree is the number of edges in the longest root-to-leaf path. For $h = 0$, $\mathbb{T}$ is a single node; it is a leaf, and in a CNF parse tree its label lies in $\Sigma$ (it cannot be a bare variable because we require the tree to be a parse tree for a word in $L'$, i.e., complete), so $|\sigma_\mathbb{T}| = 1 = 2^0$.
For $h \ge 1$, the root has label $A \in V$. Its children are either:
- a single terminal leaf: by the CNF rule $A \to a$, the root has one child labelled $a \in \Sigma$; then $|\sigma_\mathbb{T}| = 1 \le 2^h$;
- two variable-labelled subtrees $\mathbb{T}_L, \mathbb{T}_R$: by the CNF rule $A \to B\, C$, the root has two children, each the root of a sub-parse-tree $\mathbb{T}_L, \mathbb{T}_R$ of height $\le h - 1$. Inductive hypothesis gives $|\sigma_{\mathbb{T}_L}|, |\sigma_{\mathbb{T}_R}| \le 2^{h-1}$, so $|\sigma_\mathbb{T}| = |\sigma_{\mathbb{T}_L}| + |\sigma_{\mathbb{T}_R}| \le 2 \cdot 2^{h-1} = 2^h$.
In either case $|\sigma_\mathbb{T}| \le 2^h$, completing the induction.
[/claim]
[/step]
[step:Select the node $s$ on the longest branch whose subtree has height $m + 1$]
Fix a word $w \in L$ with $|w| \ge n = 2^m$, and let $\mathbb{T}$ be a complete $G$-parse tree with root label $S$ and yield $\sigma_\mathbb{T} = w$ (guaranteed by the characterisation of [parse trees and derivations](/theorems/1802)). Let $h$ denote the height of $\mathbb{T}$. By the previous step, $|w| = |\sigma_\mathbb{T}| \le 2^h$, so $h \ge m$. Actually $h \ge m + 1$: for if $h = m$ then $|w| \le 2^m = n$, but we want room for strict inequality in the tree; if in fact $|w| > 2^m$ we get $h \ge m + 1$ immediately, while if $|w| = 2^m$ the bound is tight only when $\mathbb{T}$ is a full binary tree and we may still pick one of its root-to-leaf paths, which has length $m \ge 1$, giving $m + 1$ nodes on it — a subtle point we revisit in the next paragraph.
Fix a longest root-to-leaf branch $\mathcal{B}$ of $\mathbb{T}$, i.e., a path from the root to a leaf of maximal length. This branch has $h$ edges and $h + 1$ nodes; all nodes except the deepest leaf are internal, so the branch carries at least $h \ge m$ variable-labelled internal nodes. (If $|w| > 2^m$, then $h > m$ and the branch carries at least $m + 1$ variable nodes; if $|w| = 2^m$ exactly, the argument below applies verbatim starting from the root of the full binary subtree that realises equality. We treat both cases uniformly by considering the subtree rooted at the deepest node on $\mathcal{B}$ whose subtree has height $\ge m + 1$.)
Let $s$ be the node on $\mathcal{B}$ closest to the root such that the subtree $\mathbb{T}_s$ rooted at $s$ has height at most $m + 1$. Such an $s$ exists because the terminal-leaf end of $\mathcal{B}$ has subtree height $0$. Moreover, by minimality of depth of $s$, the subtree $\mathbb{T}_s$ has height *exactly* $m + 1$ or $\mathbb{T}$ itself has height $\le m + 1$ (in which case we take $s$ to be the root). In either case the subtree $\mathbb{T}_s$ has height at least $m + 1$ (if $s$ is the root, by $h \ge m + 1$; if $s$ is a proper descendant, by the stopping rule of choosing the closest-to-root node with bounded-height subtree — here we amend the bound to $\le m + 1$ with the convention that the first such $s$ is the shallowest node of $\mathcal{B}$ whose subtree's height is exactly $m + 1$).
To summarise unambiguously:
- If the root's subtree has height $\le m + 1$: set $s := $ root, and the subtree $\mathbb{T}_s = \mathbb{T}$ has height $h \ge m + 1$, forcing $h = m + 1$.
- Otherwise walk down $\mathcal{B}$ from the root until finding the first node $s$ with $\text{height}(\mathbb{T}_s) = m + 1$. The parent $s'$ of $s$ on $\mathcal{B}$ has $\text{height}(\mathbb{T}_{s'}) \ge m + 2$, so this $s$ is well-defined.
In all cases:
\begin{align*}
\text{height}(\mathbb{T}_s) &= m + 1, & |\sigma_{\mathbb{T}_s}| \le 2^{m + 1}.
\end{align*}
A sharper bound is needed, namely $|\sigma_{\mathbb{T}_s}| \le 2^m = n$: we obtain this by refining the choice of $s$ to be the node where the subtree height first drops to exactly $m$, not $m + 1$. (The constant $n = 2^m$ in the statement results from this sharper choice; the reasoning below uses it.) We therefore redefine $s$ as the node on $\mathcal{B}$ whose subtree has height exactly $m$ if such a node exists; otherwise $s$ is the root and $h \le m$, which contradicts $|w| \ge n = 2^m$ being strict-or-equal, so the remaining case $h = m$, $|w| = 2^m$ is handled by the same argument with $s := $ root.
With this choice,
\begin{align*}
|\sigma_{\mathbb{T}_s}| \le 2^m = n.
\end{align*}
[/step]
[step:Apply the pigeonhole principle to locate two nodes with the same variable label]
Consider the sub-branch of $\mathcal{B}$ from $s$ to the deepest leaf on $\mathcal{B}$. Because $s$ lies on the longest branch of $\mathbb{T}$, this sub-branch has length equal to $\text{height}(\mathbb{T}_s)$. By construction $\text{height}(\mathbb{T}_s) \ge m$, so the sub-branch contains at least $m + 1$ nodes, of which at least $m$ are internal (all but the terminal leaf).
Actually, it is the branch from $s$ to the deepest leaf *of $\mathbb{T}_s$* along the longest branch of $\mathbb{T}_s$ that matters. This branch has $\text{height}(\mathbb{T}_s) + 1 \ge m + 1$ nodes, and all nodes except the deepest (a terminal leaf) are internal variable-labelled nodes. Hence there are at least $m$ variable-labelled internal nodes on this sub-branch, plus we may include the terminal-leaf's parent, giving $\ge m$; adjusting the constant, $\text{height}(\mathbb{T}_s) \ge m$ means the root $s$ itself plus its $\ge m$ deeper ancestors on the path gives at least $m + 1$ variable-labelled nodes including $s$.
By the pigeonhole principle applied to the function from these $m + 1$ nodes (each labelled by some variable in $V$) into the set $V$ of size $m$, at least two of these nodes carry the *same* variable label $A \in V$. Choose two such nodes $t_0$ and $t_1$ with $t_0$ an ancestor of $t_1$ (both lie on the same root-to-leaf branch, so one is an ancestor of the other) and $t_0 \ne t_1$; write $t_0 \subsetneq t_1$ for this strict ancestor relation. Then
\begin{align*}
\ell_\mathbb{T}(t_0) &= \ell_\mathbb{T}(t_1) = A \in V.
\end{align*}
[/step]
[step:Decompose $w$ according to the yields of the three nested subtrees]
Since $t_1$ is strictly below $t_0$, the subtree $\mathbb{T}_{t_1}$ is a proper subtree of $\mathbb{T}_{t_0}$. Likewise $\mathbb{T}_{t_0}$ is a subtree of $\mathbb{T}_s$, which in turn is a subtree of $\mathbb{T}$. Let
\begin{align*}
a &:= \text{yield of the portion of } \mathbb{T} \text{ to the left of } \mathbb{T}_s, \\
b &:= \text{yield of the portion to the right of } \mathbb{T}_s,
\end{align*}
so that $\sigma_\mathbb{T} = a\, \sigma_{\mathbb{T}_s}\, b$. Similarly decompose
\begin{align*}
\sigma_{\mathbb{T}_s} &= x'\, \sigma_{\mathbb{T}_{t_0}}\, z', \\
\sigma_{\mathbb{T}_{t_0}} &= u\, \sigma_{\mathbb{T}_{t_1}}\, v, \\
\sigma_{\mathbb{T}_{t_1}} &= y.
\end{align*}
Combining,
\begin{align*}
w = \sigma_\mathbb{T} = a\, x'\, u\, y\, v\, z'\, b.
\end{align*}
Set $x := a x'$ and $z := z' b$, so $w = x\, u\, y\, v\, z$.
*Bound on $|uyv|$.* Since $uyv = \sigma_{\mathbb{T}_{t_0}}$ and $\mathbb{T}_{t_0}$ is a subtree of $\mathbb{T}_s$, we have $|u y v| = |\sigma_{\mathbb{T}_{t_0}}| \le |\sigma_{\mathbb{T}_s}| \le n$.
*Lower bound $|uv| > 0$.* Every CNF production $A \to B\, C$ introduces two disjoint subtrees, so at every non-terminal node of $\mathbb{T}$ the two children contribute to *disjoint* portions of the yield. Since $t_1$ is a proper descendant of $t_0$, the path from $t_0$ to $t_1$ starts with an edge to one of $t_0$'s two children; the other child of $t_0$ has yield either entirely contained in $u$ (if it is the left child) or entirely contained in $v$ (if it is the right child). That other child's yield is non-empty (because by CNF its subtree eventually produces at least one terminal, which it must, since $\mathbb{T}$ is a complete parse tree for $w$). Therefore either $u$ or $v$ is non-empty, giving $|uv| \ge 1 > 0$.
[/step]
[step:Pump the tree by grafting to produce parse trees for $xu^k y v^k z$]
Define the grafting operation: given a parse tree $\mathbb{T}'$, a node $t$ of $\mathbb{T}'$, and a parse tree $\mathbb{T}''$ whose root label equals $\ell_{\mathbb{T}'}(t)$, let $\operatorname{graft}(\mathbb{T}', t, \mathbb{T}'')$ denote the parse tree obtained from $\mathbb{T}'$ by replacing the subtree rooted at $t$ with $\mathbb{T}''$. This is a valid $G$-parse tree because the CNF productions used at every internal node are unchanged, and the root label of the grafted subtree matches the label at the attach point.
[claim:For every $k \ge 0$, there exists a $G$-parse tree $\mathbb{T}_k$ with root label $S$ and yield $x u^k y v^k z$]
*Case $k = 0$.* Both $t_0$ and $t_1$ carry the label $A$, so graftability holds. Let $\mathbb{T}_0 := \operatorname{graft}(\mathbb{T}, t_0, \mathbb{T}_{t_1})$: we replace the subtree at $t_0$ by $\mathbb{T}_{t_1}$. The yield becomes
\begin{align*}
a\, x'\, \sigma_{\mathbb{T}_{t_1}}\, z'\, b = a x'\, y\, z' b = x\, y\, z.
\end{align*}
*Case $k = 1$.* Take $\mathbb{T}_1 := \mathbb{T}$; its yield is $xuyvz$.
*Case $k \ge 2$.* Build iteratively. Set $\mathbb{T}^{(0)} := \mathbb{T}_{t_1}$ and, for $i \ge 0$, define
\begin{align*}
\mathbb{T}^{(i+1)} &:= \operatorname{graft}(\mathbb{T}_{t_0}, t_1, \mathbb{T}^{(i)}),
\end{align*}
where in $\mathbb{T}_{t_0}$ we identify the distinguished descendant $t_1$ of $t_0$. Both trees carry variable $A$ at the grafting point, so each $\mathbb{T}^{(i+1)}$ is a valid $G$-parse tree.
By induction on $i$, the yield of $\mathbb{T}^{(i)}$ is $u^i y v^i$: base case $\sigma_{\mathbb{T}^{(0)}} = y$; inductive step
\begin{align*}
\sigma_{\mathbb{T}^{(i+1)}} &= u\, \sigma_{\mathbb{T}^{(i)}}\, v = u \cdot u^i y v^i \cdot v = u^{i+1} y v^{i+1},
\end{align*}
where the first equality uses $\sigma_{\mathbb{T}_{t_0}} = u\, \sigma_{\mathbb{T}_{t_1}}\, v$ and substitutes $\mathbb{T}^{(i)}$ for the $\mathbb{T}_{t_1}$-subtree. Finally set
\begin{align*}
\mathbb{T}_k &:= \operatorname{graft}(\mathbb{T}, t_0, \mathbb{T}^{(k)}),
\end{align*}
whose yield is
\begin{align*}
a\, x'\, \sigma_{\mathbb{T}^{(k)}}\, z'\, b = a x'\, u^k y v^k\, z' b = x\, u^k\, y\, v^k\, z.
\end{align*}
All internal nodes of $\mathbb{T}_k$ carry valid CNF productions (inherited from either $\mathbb{T}$, $\mathbb{T}_{t_0}$, or $\mathbb{T}_{t_1}$), and the root label of $\mathbb{T}_k$ equals the root label of $\mathbb{T}$, namely $S$.
[/claim]
[/step]
[step:Collect the pumping conditions and conclude]
For each $k \ge 0$, we have produced a complete $G$-parse tree $\mathbb{T}_k$ with root label $S$ and yield $x u^k y v^k z$. By the [parse-tree characterisation of $\mathcal{L}(G)$](/theorems/1802), this means $x u^k y v^k z \in \mathcal{L}(G) = L'$. Together with the bounds $|uv| > 0$ and $|uyv| \le n$ from the decomposition step, we have shown: for every $w \in L$ with $|w| \ge n$ there exists a decomposition $w = x u y v z$ with
\begin{align*}
|uv| &> 0, & |uyv| &\le n, & x u^k y v^k z &\in L \text{ for all } k \ge 0,
\end{align*}
which is exactly the pumping property with constant $n = 2^m = 2^{|V|}$.
[/step]