This course explores the fundamental question: what can be computed, and what can be described? We begin with formal languages — infinite sets of strings captured by finite descriptions — and the automata and grammars that recognise or generate them. Through rewrite systems, we see how simple substitution rules can encode complex behaviour. Regular languages introduce the first automata model (finite state machines) and show that some languages fall outside their reach. Context-free languages extend our expressiveness through grammars, capturing phenomena like nested parentheses that regularity cannot handle. Yet even these have limits. Register machines provide a model of computation closer to real computers, manipulating words in unbounded memory. Finally, computability theory crowns the journey: we prove that some problems — like the halting problem — are fundamentally undecidable, no matter how powerful our machines.
The course is organised around the Chomsky hierarchy, a pyramid of language classes of increasing power. Each level asks: what can be computed or recognised at this tier? Yet power comes with a cost. We discover a critical gap between describing a language (giving a grammar or recognising device) and deciding it (answering "is this word in the language?" algorithmically). Regular languages sit squarely in the decidable zone. Context-free languages mostly do too, though some decision problems become hard. But as we climb toward register machines and general computation, we hit the ceiling: some properties — like whether a machine halts — are provably undecidable. This tension between finite description and infinite behaviour, between what we can express and what we can decide, is the heartbeat of the course.
Each chapter builds naturally on the last. Rewrite systems give us a grammar-like tool to reason about transformations. Regular languages and finite automata establish our baseline — expressive enough for lexical scanning, but too weak for parsing. Context-free languages let us handle recursion and nesting, the grammar of programming languages themselves. Register machines bridge to general computation, making the step to computability tangible. Computability theory then reveals the fundamental limits: through Rice's theorem, we see that almost all non-trivial questions about what programs compute are undecidable. By the end, the reader sees why some problems are hard in practice (complexity), while others are impossible in principle (uncomputability).
# 1. Introduction
This chapter establishes the conceptual and notational foundations of the course. The central question sounds simple but has no naive answer: what does it mean for something to be *computable*? The answer turns out to require a precise mathematical framework — one built from strings over finite alphabets — and the combinatorial properties of these string sets illuminate both the scope and the limits of computation. This chapter motivates the need for formal definitions, introduces the basic objects of the theory, and records the set-theoretic infrastructure that the rest of the course depends on.
## Why Formal Definitions of Computation?
There is a fundamental asymmetry between positive and negative results in mathematics. A positive result — showing that some algorithm exists — does not require a formal definition of "algorithm", because we can agree on a concrete procedure when we see one. A negative result — showing that *no* algorithm can solve a given problem — requires ruling out every conceivable algorithm, and that demands a rigorous definition.
[motivation]
### The Gap Between Existence and Computability
Consider two statements about polynomials $p \in \mathbb{Z}[X]$ of degree $n$:
1. Every polynomial of degree $n$ has a root (in $\mathbb{C}$).
2. There is an algorithm that, given a polynomial of degree $n$, finds a root.
The first statement is the Fundamental Theorem of Algebra, and its proof is purely existential. The second statement is a claim about algorithmic access, and it demands specifying what an algorithm actually is. In many areas of mathematics, existence proofs are known while constructive algorithms are not — and sometimes provably cannot — exist.
### Hilbert's Programme and the Entscheidungsproblem
In 1900, David Hilbert presented his celebrated list of mathematical problems at the International Congress in Paris. The tenth problem asked for an algorithm to decide whether a Diophantine equation — a polynomial equation $p(x_1, \ldots, x_n) = 0$ with coefficients in $\mathbb{Z}$ — has integer solutions. Hilbert expected the answer to be yes. The problem was eventually resolved negatively by Matiyasevich in 1970, building on work of Davis, Putnam, and Robinson: no such algorithm exists.
The Entscheidungsproblem (decision problem) was formulated in 1928 by Ackermann in the book *Grundzüge der theoretischen Logik*: given a first-order logical formula $\varphi$, determine whether $\varphi$ is a tautology — that is, whether it is true under every interpretation of its variables. Again, Hilbert expected a positive solution. Church and Turing independently showed in 1936 that no such algorithm exists. Their proofs required, for the first time, a rigorous mathematical definition of "algorithm".
[/motivation]
The lesson of these examples is clear: to prove a problem is algorithmically unsolvable, one must first pin down exactly what counts as an algorithm. The rest of this course is devoted to making that definition precise, and to understanding the landscape of what is and is not computable.
## Alphabets and String Spaces
Computation operates on data. Before defining what a computation *does*, we must define what it operates *on*. Modern computation encodes all objects — numbers, polynomials, logical formulas, programs — as finite strings over a finite set of symbols, just as digital computers encode everything in bits.
[definition: Alphabet and String Space]
An **alphabet** (or set of symbols) is a set $\Omega$, which we typically assume to be finite. The **set of $\Omega$-strings** $\Omega^\star$ is the set of all finite sequences of elements of $\Omega$, including the empty sequence $\varepsilon$.
More precisely, using the von Neumann identification of $n = \{0, 1, \ldots, n-1\}$, we define $\Omega^n := (n \to \Omega)$ as the set of functions from $n$ to $\Omega$ — equivalently, sequences of length $n$. Then:
\begin{align*}
\Omega^\star = \bigcup_{n \in \mathbb{N}} \Omega^n.
\end{align*}
The **length** of a string $\alpha \in \Omega^\star$ is written $|\alpha| = \operatorname{domain}(\alpha)$.
[/definition]
The choice to work with strings rather than numbers is not a loss of generality. Any finite mathematical object can be encoded as a string: the encoding is part of the model. Binary strings (where $\Omega = \{0,1\}$) suffice for everything, but working with a general alphabet $\Omega$ keeps the formalism cleaner in many settings.
[example: Concrete Alphabets]
Take $\Omega = \{0, 1\}$. Then $\Omega^3$ consists of the eight bit-strings $000, 001, 010, 011, 100, 101, 110, 111$. The empty string $\varepsilon \in \Omega^0$ is the unique element of $\Omega^0 = (0 \to \Omega)$.
Take $\Omega = \{a, b, \ldots, z\}$, the lower-case English letters. Then $\Omega^\star$ contains the empty string, all single letters, all pairs $aa, ab, \ldots, zy, zz$, and so on. The string "hello" is an element of $\Omega^5$.
Now consider the **boundary**: what happens when $\Omega = \varnothing$? The definition gives $\Omega^0 = (0 \to \varnothing)$, which contains exactly one element — the empty function $\varepsilon$. But $\Omega^1 = (1 \to \varnothing) = \varnothing$, since there is no function from a one-element set to the empty set. So $\Omega^\star = \{\varepsilon\}$ when $\Omega = \varnothing$: the empty alphabet admits only the empty string. This is not merely an edge case — it is why the Countability Theorem for $X^\star$ requires $X \neq \varnothing$ (the empty alphabet produces a finite, not infinite, string space). Any algorithm that processes strings over a nonempty alphabet can be contrasted with this degenerate case to test whether a result really depends on having at least one symbol available.
[/example]
## Notational Conventions for Strings
The course uses a carefully set up notation for manipulating strings. These conventions are worth recording explicitly, as they recur throughout.
[definition: String Operations]
Fix an alphabet $\Omega$. The following operations and conventions apply:
- **Natural numbers**: $\mathbb{N} := \{1, 2, 3, \ldots\}$. For indexing string positions and lengths, we use the von Neumann ordinal construction extended to include $0$: write $\mathbb{N}_0 := \{0, 1, 2, \ldots\} = \mathbb{N} \cup \{0\}$, and identify each $n \in \mathbb{N}_0$ with the set $\{0, 1, \ldots, n-1\}$ of all smaller non-negative integers. The string space is then $\Omega^\star = \bigcup_{n \in \mathbb{N}_0} \Omega^n$.
- **String length**: For $\alpha \in \Omega^n$, we write $|\alpha| = n$.
- **Empty string**: $\Omega^0 = (0 \to \Omega)$ has exactly one element, $\varepsilon$, the empty sequence.
- **Truncation**: For $\alpha \in \Omega^n$ and $k \leq n$, the truncation $\alpha|_k \in \Omega^k$ is the unique sequence of length $k$ such that $\alpha|_k \subseteq \alpha$ (viewing sequences as functions, $\alpha|_k$ is the restriction of $\alpha$ to $\{0, \ldots, k-1\}$).
- **Concatenation**: For $\alpha \in \Omega^m$ and $\beta \in \Omega^n$, their concatenation $\alpha\beta \in \Omega^{m+n}$ is defined piecewise:
\begin{align*}
(\alpha\beta)(i) = \begin{cases} \alpha(i) & i < m \\ \beta(i - m) & m \leq i < m + n. \end{cases}
\end{align*}
- **Powers of strings**: By recursion, $\alpha^0 := \varepsilon$ and $\alpha^{n+1} := \alpha \cdot \alpha^n$.
- **Singletons**: An element $x \in \Omega$ is identified with the string $(x) \in \Omega^1$.
- **Language concatenation**: For sets $Y, Z \subseteq \Omega^\star$, write $YZ := \{\alpha\beta \mid \alpha \in Y,\, \beta \in Z\}$. For a singleton $Y = \{\alpha\}$, write $\alpha Z := \{\alpha\beta \mid \beta \in Z\}$.
- **Functorial lift**: A function $f: \Omega \to \Lambda$ lifts to a function $\hat{f}: \Omega^\star \to \Lambda^\star$ by applying $f$ componentwise. The hat is often omitted when context is clear.
[/definition]
The definitions above are more elaborate than they might initially appear — the identification of natural numbers with sets, the uniform treatment of $\Omega^0$, and the explicit piecewise formula for concatenation all serve a purpose. When we come to define Turing machines and register machines, these conventions allow us to encode programs as strings and to reason about string operations algebraically rather than informally. The notation pays for itself.
[remark: Why Von Neumann Ordinals?]
The identification $n = \{0, 1, \ldots, n-1\}$ lets us write $\Omega^n = (n \to \Omega)$ uniformly, treating the length as a set (the domain of the function). This unifies the definitions of $\Omega^0$, $\Omega^n$, and $\Omega^\star$ without introducing separate notation for each case, and makes truncation ($\alpha|_k = \alpha\upharpoonright k$) a standard restriction of functions.
[/remark]
## Countability of String Spaces
A key structural observation is that string spaces over countable alphabets are themselves countable. This has a direct implication for the theory of computation: any object that can be encoded as a string can in principle be indexed by a natural number, and there are only countably many such objects. The uncountability of certain sets — like the power set of an infinite set — then serves as a fundamental source of undecidability results.
Recall the standard definitions: a set $X$ is **countable** if there is a surjection $\mathbb{N} \to X$, and **infinite** if there is an injection $\mathbb{N} \to X$.
[quotetheorem:1774]
[citeproof:1774]
The proof of countability is worth appreciating: it relies on Gödel-style numbering — encoding finite sequences of natural numbers as single natural numbers via prime factorisations. This technique reappears throughout the course when encoding programs as natural numbers (or strings).
[quotetheorem:621]
[citeproof:621]
Cantor's Theorem reveals what the hypothesis of infiniteness does and does not buy us. The theorem **requires** $X$ to be infinite: if $X$ is finite with $|X| = n$, then $|\mathcal{P}(X)| = 2^n$, which is finite and certainly countable. The argument also does **not** say that $\mathcal{P}(X)$ is the only uncountable set, nor that all uncountable sets have the same cardinality — the theorem is a strict lower bound ($|\mathcal{P}(X)| > |X|$), not an upper bound. Looking ahead, the same diagonal argument is the engine behind the undecidability of the Halting Problem: we will construct a diagonal language that cannot be in the range of any computable enumeration of decidable languages.
[remark: Significance For Computability]
Since $\Omega^\star$ is countable but $\mathcal{P}(\Omega^\star)$ is uncountable, there are uncountably many subsets of $\Omega^\star$ — called *languages* — but only countably many algorithms (because algorithms can themselves be encoded as strings). It follows immediately that most languages cannot be decided by any algorithm. This cardinality argument is the deepest single reason why undecidability is pervasive.
[/remark]
The cardinality argument in the remark is worth dwelling on. It is not just a curiosity — it is the single most powerful non-constructive tool in computability theory. It tells us that undecidability is not a pathological exception but the norm: a randomly chosen language is almost certainly undecidable. The theorems that follow sharpen the arithmetic of this observation by identifying which specific sets remain countable.
[quotetheorem:1775]
[citeproof:1775]
Taken together, these three results establish the countability landscape: $X^\star$ and $\operatorname{Fin}(X)$ are countable when $X$ is, but $\mathcal{P}(X)$ escapes to the uncountable. The hypothesis that $X$ is **finite** in the last theorem is essential: if we allowed infinite subsets, the result would fail — $\mathcal{P}(\mathbb{N})$ is uncountable, and even the collection of infinite subsets of $\mathbb{N}$ is uncountable. Restricting to finite subsets is exactly what allows us to mirror the bijection between $X^\star$ and $\operatorname{Fin}(X)$ given by the forgetful map. This finiteness theme will recur: finite automata, context-free grammars, and register machines are all finitary objects, and their finitude is precisely what makes them encodable as strings — and hence indexable. This is exactly the setting in which formal language theory and computability live.
Rewrite systems provide the algebraic framework for generating strings through rule application, formalising the intuition behind grammars. The finite, deterministic nature of these rewrite rules makes them the natural vehicle for studying how strings can be systematically constructed and transformed.
# 2. Rewrite Systems
This chapter introduces rewrite systems, the algebraic machinery underlying formal grammars. Where Chapter 1 established the combinatorial objects — strings, alphabets, and counting arguments — this chapter asks how strings can be transformed into other strings by local substitution rules. The central application is grammars: formal systems that generate languages by iteratively rewriting a start symbol. By the end of the chapter we have the Chomsky hierarchy, the four nested classes of grammars that will organise the rest of the course, together with the first results on which decision problems are solvable within each class.
## Rewrite Rules and Rewrite Systems
Suppose we want to describe how one string can be transformed into another by replacing a piece of it in place — the way a compiler substitutes a macro, or a grammar expands a phrase. The obstacle is finding the right level of generality: the replacement must be local (it acts on a contiguous substring, leaving the rest untouched) and the full system must be finitely described. Rewrite systems are precisely the formalism that packages these two requirements.
Fix a finite symbol set $\Omega$ and let $\Omega^\star$ denote the set of all finite strings over $\Omega$, including the empty string $\varepsilon$.
[definition: Rewrite Rule]
Let $\Omega$ be a finite set of symbols. A **rewrite rule** (or **production rule**) over $\Omega$ is a pair $(\alpha, \beta) \in \Omega^\star \times \Omega^\star$, written $\alpha \to \beta$.
The informal reading of $\alpha \to \beta$ is: find an occurrence of the substring $\alpha$ inside a larger string and replace it with $\beta$.
[/definition]
[definition: Rewrite System]
A **rewrite system** is a pair $R = (\Omega, P)$ where $\Omega$ is a finite set of symbols and $P$ is a finite set of rewrite rules over $\Omega$.
[/definition]
[quotetheorem:1776]
[citeproof:1776]
Countability has an immediate and important consequence: rewrite systems are a meagre resource. There are only countably many of them, yet — as we will see shortly — there are uncountably many languages. Any attempt to describe every language by a grammar is therefore doomed from the start. Most languages cannot be captured by any rewrite system, no matter how complex. This gap between "describable" and "all" will resurface when we reach the undecidability results later in the course.
The key operation is applying a rewrite rule to a string. The rule $\gamma \to \delta$ fires on a string $\sigma$ by finding an occurrence of $\gamma$ as a contiguous substring, and replacing it with $\delta$, leaving the surrounding context unchanged.
[definition: One-Step Rewriting]
Let $R = (\Omega, P)$ be a rewrite system and $\sigma, \tau \in \Omega^\star$. We write $\sigma \xrightarrow{R}_1 \tau$, and say **$R$ rewrites $\sigma$ to $\tau$ in one step**, if there exist strings $\alpha, \beta, \gamma, \delta \in \Omega^\star$ such that
\begin{align*}
\sigma &= \alpha\gamma\beta, \quad \tau = \alpha\delta\beta, \quad \gamma \to \delta \in P.
\end{align*}
The relation $\xrightarrow{R}$ denotes the reflexive-transitive closure of $\xrightarrow{R}_1$.
[/definition]
A sequence $\sigma_0 \xrightarrow{R}_1 \sigma_1 \xrightarrow{R}_1 \cdots \xrightarrow{R}_1 \sigma_n$ is called an **$R$-derivation of length $n$** of $\sigma_n$ from $\sigma_0$. The reflexive-transitive closure $\xrightarrow{R}$ encodes the idea of rewriting in zero or more steps.
[definition: Derivation Set]
For a rewrite system $R = (\Omega, P)$ and a string $\sigma \in \Omega^\star$, the **derivation set** of $\sigma$ is
\begin{align*}
\mathcal{D}(R, \sigma) &= \{\tau \in \Omega^\star \mid \sigma \xrightarrow{R} \tau\}.
\end{align*}
This is the set of all strings reachable from $\sigma$ via the rules of $R$.
[/definition]
## Rewrite Systems and Natural Language
Before proceeding to the formal theory, it is worth pausing to understand why rewrite systems are a natural model for language. The symbols $\Omega$ can represent letters (with $\Omega^\star$ representing words), or words (with $\Omega^\star$ representing sentences), or sentences (with $\Omega^\star$ representing texts). The rewrite rules then encode how valid structures at one level are assembled from components.
Natural languages are in practice finite — there are finitely many words, and sentences cannot be arbitrarily long in everyday use. Nevertheless, Chomsky observed that modelling a language as a finite set forces an awkward choice: one must impose an arbitrary upper bound on sentence length. The phenomenon that makes this unattractive is **linguistic recursion**: if $X$ is a grammatical sentence in English, then "E observes that $X$" is also a grammatical sentence. No finite bound can accommodate all instances of this construction.
Chomsky therefore proposed modelling languages as potentially infinite sets, generated by finitely many rewrite rules. An important distinction must be kept in mind throughout: a sentence being **grammatical** (derivable from the rules) is not the same as being **meaningful**. The sentence "colourless green ideas sleep furiously" is grammatically well-formed but semantically empty, while "furiously sleep ideas green colourless" is neither. Rewrite systems capture grammar, not semantics.
[example: English Phrase-Structure Grammar]
The following rewrite rules form a small generative grammar fragment for English.
\begin{align*}
S &\to \mathrm{NP}\ \mathrm{VP} \\
\mathrm{NP} &\to \mathrm{Adj}\ \mathrm{NP} \\
\mathrm{NP} &\to \mathrm{Noun} \\
\mathrm{VP} &\to \mathrm{Verb} \\
\mathrm{VP} &\to \mathrm{Verb}\ \mathrm{Adv}
\end{align*}
Starting from $S$ and using a dictionary to replace parts of speech with specific words, one can derive "colourless green ideas sleep furiously":
\begin{align*}
S \xrightarrow{}_1 \mathrm{NP}\ \mathrm{VP} \xrightarrow{}_1 \mathrm{Adj}\ \mathrm{NP}\ \mathrm{VP} \xrightarrow{}_1 \cdots \xrightarrow{}_1 \text{colourless green ideas sleep furiously}.
\end{align*}
[/example]
## Grammars
A rewrite system on a single uniform symbol set cannot distinguish the "output" strings — the actual words of a language — from the intermediate working strings produced during a derivation. Without this distinction, we cannot say when a derivation is finished. The fix is to split the symbols into two disjoint classes: terminal symbols, which represent the letters of the target language and can never be rewritten further, and nonterminal symbols, which are auxiliary variables that drive the derivation but must eventually be eliminated. A derivation is complete precisely when no nonterminal symbols remain. This terminal/nonterminal split is the defining feature that turns a bare rewrite system into a grammar.
[definition: Alphabet and Words]
Let $\Sigma$ be a nonempty **alphabet** of **terminal symbols** (or **letters**), and let $V$ be a nonempty set of **variables** (or **nonterminal symbols**), with $\Sigma \cap V = \varnothing$. Set $\Omega = \Sigma \cup V$. Convention: lowercase letters $a, b, c, \ldots$ denote elements of $\Sigma$, uppercase letters $A, B, C, \ldots$ denote elements of $V$, and $u, v, w, \ldots$ denote elements of $\mathbb{W} = \Sigma^\star$.
Elements of $\mathbb{W} = \Sigma^\star$ are called **words**. A subset of $\mathbb{W}$ is called a **language**. Write $\mathbb{W}^+ = \Sigma^\star \setminus \{\varepsilon\}$ for the set of nonempty words.
[/definition]
[remark: Uncountably Many Languages]
There are uncountably many languages over any nonempty alphabet $\Sigma$. This is because the collection of all languages is $\mathcal{P}(\mathbb{W})$, and $\mathbb{W} = \Sigma^\star$ is countably infinite, so $|\mathcal{P}(\mathbb{W})| = 2^{\aleph_0}$.
[/remark]
[definition: Grammar]
A **grammar** is a tuple $G = (\Sigma, V, P, S)$ where $\Sigma$ and $V$ are nonempty disjoint sets of terminal and nonterminal symbols respectively, $R = (\Omega, P)$ (with $\Omega = \Sigma \cup V$) is a rewrite system, and $S \in V$ is the **start symbol**.
The rewriting notation for $R$ applies unchanged to $G$: write $\mathcal{D}(G, \sigma) = \mathcal{D}(R, \sigma)$ and $\sigma \xrightarrow{G} \tau \iff \sigma \xrightarrow{R} \tau$.
The **language generated by $G$** is
\begin{align*}
\mathcal{L}(G) &= \mathcal{D}(G, S) \cap \mathbb{W}.
\end{align*}
That is, $\mathcal{L}(G)$ is the set of all words (pure terminal strings) derivable from the start symbol.
[/definition]
The intersection with $\mathbb{W}$ is essential: the derivation set $\mathcal{D}(G, S)$ will typically contain intermediate strings with variables still present, and these are not words. Only strings over $\Sigma$ alone count as outputs of the grammar.
[example: Empty Language]
If $P$ contains no rule of the form $S \to \alpha$, then $\mathcal{D}(G, S) = \{S\}$, and since $S \notin \mathbb{W}$, we get $\mathcal{L}(G) = \varnothing$. Similarly, if $P$ contains no rule of the form $\alpha \to w$ with $w \in \mathbb{W}$, then no word is ever reachable, and again $\mathcal{L}(G) = \varnothing$.
[/example]
[example: Odd-Length Words]
Let $\Sigma = \{a\}$, $V = \{S\}$, and $P_0 = \{S \to aaS,\ S \to a\}$, giving the grammar $G_0 = (\Sigma, V, P_0, S)$. We claim $\mathcal{L}(G_0) = \{a^{2n+1} \mid n \in \mathbb{N}\}$, the set of odd-length strings over $\{a\}$.
Every production rule preserves the parity of the total symbol count: $S \to aaS$ replaces one variable with two terminals and one variable (net change $+2$ to a count that starts at 1), and $S \to a$ replaces a variable with one terminal. An induction on derivation length shows every derivable terminal word has odd length.
Conversely, given $a^{2n+1}$, applying $S \to aaS$ exactly $n$ times and then $S \to a$ once yields $a^{2(n)+1}$ as required.
Notably, many different rule sets generate the same language. For instance, $P_1 = \{S \to aSa, S \to a\}$, $P_2 = \{S \to Saa, S \to a\}$, and $P_3 = \{S \to aaS, S \to aaS aa, S \to a\}$ all generate $\{a^{2n+1} \mid n \in \mathbb{N}\}$ by the same parity argument. This motivates the notion of grammar equivalence.
[/example]
## Grammar Equivalence and Isomorphism
Two grammars are equivalent if they generate the same language, regardless of their internal structure. Because the variable names in a grammar are purely symbolic, renaming them should not change the language.
[definition: Grammar Equivalence and Isomorphism]
Two grammars $G = (\Sigma, V, P, S)$ and $G' = (\Sigma, V', P', S')$ on the same terminal alphabet $\Sigma$ are **equivalent** if $\mathcal{L}(G) = \mathcal{L}(G')$.
A function $f: \Omega \to \Omega' = \Sigma \cup V'$ (where $\Omega = \Sigma \cup V$) is an **isomorphism** from $G$ to $G'$ if:
1. $f|_\Sigma = \operatorname{id}$ (terminals are fixed);
2. $f(S) = S'$ (start symbols correspond);
3. $f|_V: V \to V'$ is a bijection;
4. $\alpha \to \beta \in P \iff \hat{f}(\alpha) \to \hat{f}(\beta) \in P'$, where $\hat{f}: \Omega^\star \to (\Omega')^\star$ is the extension of $f$ to strings.
[/definition]
[quotetheorem:1777]
[citeproof:1777]
The isomorphism theorem says that the language generated by a grammar depends only on its structure, not on the names of its variables. Two consequences are worth noting. First, the hypotheses cannot be weakened: if $f$ does not fix terminals (condition 1), the generated languages over different terminal alphabets are simply incomparable. If $f$ does not send start symbol to start symbol (condition 2), the derivation trees are rooted differently and the generated languages can diverge. Second — and this is the key subtlety — the converse fails: equivalent grammars need not be isomorphic. Two grammars can generate the same language with completely different numbers of variables, different rule structures, or different derivation strategies. Isomorphism is a sufficient condition for equivalence, not a necessary one; the example with $P_0, P_1, P_2, P_3$ in the Odd-Length Words example already shows this.
The isomorphism criterion has a useful corollary: the particular names chosen for variables are irrelevant.
[quotetheorem:1778]
[citeproof:1778]
Isomorphism allows us to assume without loss of generality that variable sets of two given grammars are disjoint, a hypothesis we will exploit when combining grammars. The next result shows that, even without restricting to isomorphism classes, only countably many languages arise from grammars.
[quotetheorem:1779]
[citeproof:1779]
[remark: Grammar Languages vs All Languages]
The set of languages generated by grammars is countable, while the set of all languages over a nonempty $\Sigma$ is $\mathcal{P}(\mathbb{W})$, which is uncountable. Therefore, most languages are not generated by any grammar. This is an important structural gap and foreshadows the undecidability results later in the course.
[/remark]
## The Chomsky Hierarchy
Production rules can be classified by their structural form. Restricting to rules of a particular form gives rise to a class of grammars, and in turn a class of languages. Chomsky's key insight was that four natural structural restrictions nest cleanly and correspond to increasing computational power.
[definition: Types of Production Rules]
Let $\alpha \to \beta$ be a production rule in a grammar $G = (\Sigma, V, P, S)$ with $\Omega = \Sigma \cup V$. The rule is:
1. **noncontracting** if $|\alpha| \leq |\beta|$;
2. **context-sensitive** if there exist $A \in V$, $\gamma, \delta \in \Omega^\star$, and $\eta \in \Omega^+$ such that $\alpha = \gamma A \delta$ and $\beta = \gamma \eta \delta$ (the variable $A$ is replaced by $\eta$ in the context $\gamma \cdot \delta$);
3. **context-free** if $\alpha = A \in V$ and $|\beta| > 0$ (a single variable rewrites to a nonempty string);
4. **regular** if $\alpha = A \in V$ and $\beta \in \Sigma \cup \Sigma V$ (a variable rewrites to a single terminal, or a terminal followed by a variable).
[/definition]
The hierarchy between these types is strict in one direction:
[quotetheorem:1780]
[citeproof:1780]
A grammar is said to be of type $\mathcal{Q}$ (for $\mathcal{Q}$ any of the above four properties) if all its production rules satisfy $\mathcal{Q}$, and a language is of type $\mathcal{Q}$ if it is generated by some grammar of type $\mathcal{Q}$. Chomsky's original labelling uses numerals:
- **Type 0**: generated by some grammar (no restriction);
- **Type 1**: generated by a context-sensitive grammar;
- **Type 2**: generated by a context-free grammar;
- **Type 3**: generated by a regular grammar.
These four classes are the **Chomsky hierarchy**, and the course will develop machinery for each level. A central theme is that containment is strict: the inclusion Type 3 $\subsetneq$ Type 2 $\subsetneq$ Type 1 $\subsetneq$ Type 0 is genuine. Finding language-theoretic witnesses — a language in one class that cannot be described in a smaller class — requires tools we will develop in later chapters.
A natural question is whether the distinction between noncontracting and context-sensitive grammars matters at the level of languages, or whether it is merely an artefact of rule syntax. Chomsky's theorem resolves this: the two notions coincide for languages.
[quotetheorem:1781]
The forward direction — that every context-sensitive language is noncontracting — is immediate from the rule-level inclusion proved above. The nontrivial direction is the converse: starting from a noncontracting grammar, one must systematically rewrite each rule to fit the context-sensitive pattern $\gamma A \delta \to \gamma \eta \delta$. This requires introducing auxiliary variables to simulate the multi-symbol left-hand sides that a noncontracting grammar can have. The construction proceeds entirely at the grammar level, replacing each noncontracting rule by a sequence of context-sensitive rules acting through the auxiliary variables. What Chomsky's theorem buys us is a clean characterisation of Type 1: it says that the class of context-sensitive languages is exactly the class of languages that a monotone rewriting process — one that never shrinks intermediate strings — can generate.
## Decision Problems
Having classified grammars into four types, the next question is what we can compute about them. Three decision problems arise naturally, and their solvability pattern is surprising: for unrestricted grammars, all three are unsolvable — no algorithm can answer them in general. Yet as soon as we impose the noncontracting condition, the word problem becomes solvable. The emptiness and equivalence problems remain unsolvable even for context-free grammars. This sharp boundary between the solvable and the unsolvable, running through the Chomsky hierarchy, is one of the central themes of the course.
With the hierarchy established, we turn to three fundamental algorithmic questions about grammars.
[definition: Decision Problems for Grammars]
Let $G$ be a grammar and $w \in \mathbb{W}$ a word.
- The **word problem**: given $G$ and $w$, decide whether $w \in \mathcal{L}(G)$.
- The **emptiness problem**: given $G$, decide whether $\mathcal{L}(G) = \varnothing$.
- The **equivalence problem**: given grammars $G$ and $G'$, decide whether $\mathcal{L}(G) = \mathcal{L}(G')$.
A problem is **solvable** if there exists an algorithm that always terminates and gives the correct answer. Otherwise it is **unsolvable**.
[/definition]
Posed for arbitrary (Type 0) grammars, all three problems are unsolvable. This will be proved much later in the course using the theory of register machines and the halting problem. For now, we note a positive result for the restricted class of noncontracting grammars.
[quotetheorem:1782]
[citeproof:1782]
[quotetheorem:1783]
[citeproof:1783]
This is a decidability result, not an efficiency one — the naive algorithm is exponential. Efficient algorithms for the word problem are a topic in their own right: for context-free grammars, the CYK algorithm runs in cubic time; for regular languages, membership can be tested in linear time via automata.
## Closure Properties
The Chomsky hierarchy is only useful if the language classes it defines are robust — they should not collapse or escape when we perform natural operations on languages. Suppose we have a context-free grammar for noun phrases and a context-free grammar for verb phrases. Can we always combine them into a context-free grammar for sentences? If context-free languages were not closed under concatenation, the answer would be no, and the hierarchy would be an awkward tool for linguistic modelling. The closure theorems below show that the answer is yes for union and concatenation across all four types — with one notable exception that already hints at the hierarchy's depth: regular languages need special handling because the new start rule $T \to SS'$ introduces a step beyond the regular pattern.
The operations of interest are concatenation $LM$, union $L \cup M$, intersection $L \cap M$, set difference $L \setminus M$, and relative complement $\mathbb{W}^+ \setminus L$.
[remark: Why $\mathbb{W}^+$ and Not $\mathbb{W}$]
Noncontracting grammars cannot derive the empty word $\varepsilon$, since any derivation from $S$ (a string of length 1) can only produce strings of length $\geq 1$. For this reason, when discussing noncontracting and context-sensitive languages, we work with $\mathbb{W}^+ = \Sigma^\star \setminus \{\varepsilon\}$ and use $\mathbb{W}^+ \setminus L$ as the notion of complement rather than $L^c$ (which would include $\varepsilon$).
[/remark]
To combine two grammars, we need their variable sets to be disjoint. The renaming theorem guarantees we can always arrange this.
[definition: Concatenation and Union Grammars]
Let $G = (\Sigma, V, P, S)$ and $G' = (\Sigma, V', P', S')$ be grammars on the same alphabet $\Sigma$. The **concatenation grammar** is $H = (\Sigma, V \cup V' \cup \{T\}, P^\star, T)$ where $T$ is a new variable and
\begin{align*}
P^\star &= P \cup P' \cup \{T \to SS'\}.
\end{align*}
The **union grammar** is $H' = (\Sigma, V \cup V' \cup \{T\}, P^{\star\star}, T)$ where
\begin{align*}
P^{\star\star} &= P \cup P' \cup \{T \to S,\ T \to S'\}.
\end{align*}
[/definition]
If $V$ and $V'$ share variables, the interaction between the two rule sets $P$ and $P'$ could allow derivations that combine rules from both grammars in unintended ways. Assuming $V \cap V' = \varnothing$ eliminates this interference.
[remark: Disjointness and Interaction]
Without the disjointness condition $V \cap V' = \varnothing$, we only get the inclusions $\mathcal{L}(G)\mathcal{L}(G') \subset \mathcal{L}(H)$ and $\mathcal{L}(G) \cup \mathcal{L}(G') \subset \mathcal{L}(H')$, not equality. The issue is that terminal symbols from $\Sigma$ cannot be relabelled, so overlapping variables in $V, V'$ can enable derivations that mix rules from $G$ and $G'$ in ways not present in either alone.
[/remark]
Before proving the main closure theorem, we need a technical lemma that simplifies the analysis.
[definition: Variable-Based Grammar]
A production rule $\alpha \to \beta$ is **variable-based** if every symbol appearing in $\alpha$ is a variable (i.e., $\alpha \in V^\star$). A grammar is **variable-based** if all its rules are variable-based.
[/definition]
Regular and context-free rules are already variable-based (since $\alpha = A \in V$). Context-sensitive rules are not necessarily variable-based, because the context $\gamma, \delta$ in $\alpha = \gamma A \delta$ may include terminals.
[quotetheorem:1784]
[citeproof:1784]
[remark: Stability Under Variable-Based Conversion]
The variable-based conversion preserves the context-free, context-sensitive, and noncontracting classes, but not regularity in general. The rules added to $P'$ are obtained by replacing terminals with fresh variables in the right-hand sides; this substitution does not change the structural type of a context-free or higher-type rule (a context-free rule $A \to \beta$ becomes $A \to \hat{X}(\beta)$, still context-free). The additional rules $X_a \to a$ map a single variable to a single terminal, which is a regular rule and hence lies in every class. For a concrete illustration of why regularity can fail: a regular rule $A \to ab$ becomes $A \to X_a X_b$ in $P'$, whose right-hand side has two variables and no trailing terminal, so it is no longer regular.
[/remark]
[quotetheorem:1785]
[citeproof:1785]
The closure theorem says nothing about intersection or complementation — and for good reason. Context-free languages are not closed under intersection: the languages $\{a^n b^n c^m \mid n, m \geq 1\}$ and $\{a^m b^n c^n \mid n, m \geq 1\}$ are both context-free, but their intersection $\{a^n b^n c^n \mid n \geq 1\}$ is not. Context-sensitive languages are closed under intersection and complementation, but those proofs use more delicate constructions than the ones developed here. Regular languages are closed under all Boolean operations; this will follow from the automaton characterisation in Chapter 3.
## Handling the Empty Word
Adding the empty word $\varepsilon$ to a language looks like a minor bookkeeping step, but it can silently break a grammar. Suppose we want to convert a grammar $G$ for a nonempty language $L$ into one generating $L \cup \{\varepsilon\}$. The naive approach is to add the rule $S \to \varepsilon$. The problem is that if $S$ appears on the right-hand side of any other rule — say $A \to \alpha S \beta$ — then after adding $S \to \varepsilon$ we can derive $A \to \alpha \beta$ without warning, potentially collapsing derivations and introducing words that were never in $L$. The $\varepsilon$-adequacy condition is designed to block exactly this failure. Noncontracting grammars cannot derive $\varepsilon$, because any derivation of $\varepsilon$ would require a contracting step (reducing a non-empty string to $\varepsilon$). It is sometimes convenient to add $\varepsilon$ to a language. The naive approach — adding the rule $S \to \varepsilon$ — can cause problems if $S$ appears in the right-hand side of other rules, since those rules could then silently introduce or suppress the start symbol in unexpected ways.
[definition: $\varepsilon$-Adequate Grammar]
A rule $\alpha \to \beta$ is **$S$-safe** if $S$ does not occur in $\beta$. A grammar $G = (\Sigma, V, P, S)$ is **$\varepsilon$-adequate** if all its rules are $S$-safe.
In an $\varepsilon$-adequate grammar, adding the rule $S \to \varepsilon$ converts $\mathcal{L}(G)$ into $\mathcal{L}(G) \cup \{\varepsilon\}$ without introducing side effects.
[/definition]
Any grammar can be converted to an equivalent $\varepsilon$-adequate grammar: introduce a new start symbol $T$ and add the single rule $T \to S$. Then $S$ is no longer the start symbol and cannot be introduced by other rules in a problematic way. The new grammar $G' = (\Sigma, V \cup \{T\}, P \cup \{T \to S\}, T)$ has $\mathcal{L}(G') = \mathcal{L}(G)$ and is $\varepsilon$-adequate, so one can subsequently add $T \to \varepsilon$ to include the empty word cleanly.
Regular languages, characterised by finite automata and regular expressions, form the base of the Chomsky hierarchy precisely because their finiteness allows complete mechanisation. Moving beyond regular languages requires mechanisms that can maintain memory across the input — a capability that rewrite systems and context-free grammars provide.
# 3. Regular Languages
This chapter develops the theory of **regular languages** — the simplest and most tractable family in the Chomsky hierarchy. Having established grammars and rewrite systems in the previous chapters, we now characterise regular languages from three equivalent perspectives: grammars with a rigid linear structure, finite automata (deterministic and nondeterministic), and regular expressions. Along the way we prove the pumping lemma (a powerful tool for showing non-regularity), establish closure properties, and show that the emptiness and equivalence problems are decidable for regular grammars.
## Regular Derivations
A regular grammar, by definition, permits only rules of the form $A \to aB$ (a nonterminal rule) or $A \to a$ (a terminal rule). The structural discipline imposed by these two rule types forces every derivation to follow a predictable pattern, which we now make precise.
[definition: Terminal and Nonterminal Rules]
Let $G$ be a regular grammar with variables $V$ and terminals $\Sigma$. A rule of the form $A \to a$ with $A \in V$ and $a \in \Sigma$ is called a **terminal rule**. A rule of the form $A \to aB$ with $A, B \in V$ is called a **nonterminal rule**.
[/definition]
The key structural fact is that every sentential form in a regular derivation has a particularly simple shape: a (possibly empty) string of terminals followed by at most one variable.
[quotetheorem:1786]
[citeproof:1786]
The shape theorem is the reason regular languages are easy to work with computationally. Because every sentential form has at most one variable, and that variable always sits at the rightmost position, a derivation carries no hidden "branching" information — the only choice at each step is which rule to apply to the single current variable. This is precisely the structure a finite automaton can track: the current variable corresponds to the current state. The theorem also explains why regular grammars are sometimes called *right-linear* grammars — the linearity is not just a syntactic restriction but has the semantic consequence proved here.
[quotetheorem:1787]
[citeproof:1787]
[remark: Non-Uniqueness of Derivations]
The derivation of a word $w$ in a regular grammar is not in general unique: multiple sequences of rule applications can produce the same terminal string. This is why we cannot directly build a deterministic automaton from a regular grammar — the nondeterminism in rule choice must be handled separately.
[/remark]
These structural lemmas already have consequences for closure. The following result will be used repeatedly when building up the closure theory.
[quotetheorem:1788]
[citeproof:1788]
This concatenation closure is the key structural fact behind the Kleene operations: once we know how to chain two grammars, iterating the construction yields the Kleene plus. Notice also what the theorem does *not* say: it provides no bound on the size of the combined grammar, and the construction does not preserve the minimality of either $G$ or $G'$. The grammar $H$ may generate duplicated derivation paths if $S'$ appears multiple times — the correctness proof above sidesteps this by tracking where $S'$ is first introduced.
## Deterministic Automata
The structural analysis of regular derivations motivates a more computational model. Rather than generating words by grammar rules, we can recognise them by reading symbols one at a time and updating a finite state.
[definition: Deterministic Automaton]
Let $\Sigma$ be an alphabet. A **deterministic automaton** is a tuple $D = (\Sigma, Q, \delta, q_0, F)$ where:
- $Q$ is a finite set of **states**,
- $q_0 \in Q$ is the **start state**,
- $F \subseteq Q \setminus \{q_0\}$ is the set of **accept states**,
- $\delta : Q \times \Sigma \to Q$ is the **transition function**.
[/definition]
[remark: Convention on the Start State]
The restriction $F \subseteq Q \setminus \{q_0\}$ — forbidding the start state from being an accept state — is a convention of this course. It reflects the standing assumption that languages are subsets of $\mathbb{W}^+ = \Sigma^+$, so the empty word $\varepsilon$ is never accepted. Standard definitions (e.g., in Sipser or Hopcroft–Ullman) allow $q_0 \in F$, and the theory goes through unchanged in that more general setting.
[/remark]
We visualise a deterministic automaton as a labelled directed graph: nodes represent states, accept states are drawn with a double circle, and each state has exactly $|\Sigma|$ outgoing edges labelled by the letters of $\Sigma$.
To formalise what it means for an automaton to "read" a word, we extend $\delta$ from single letters to words by recursion.
[definition: Extended Transition Function]
The **extended transition function** $\hat{\delta} : Q \times \mathbb{W} \to Q$ is defined by $\hat{\delta}(q, \varepsilon) = q$ and $\hat{\delta}(q, aw) = \hat{\delta}(\delta(q, a), w)$. The **language accepted by $D$** is
\begin{align*}
\mathcal{L}(D) = \{ w \mid \hat{\delta}(q_0, w) \in F \}.
\end{align*}
The sequence of states $q_0, q_1, \ldots, q_{|w|}$ encountered while reading $w$ is called the **state sequence** of the computation.
[/definition]
[example: Automaton for Words Containing a Zero]
Over the alphabet $\Sigma = \{0, 1\}$, consider the automaton with states $\{q_0, q_1, q_2\}$, start state $q_0$, accept state $F = \{q_1\}$, and transitions $\delta(q_0, 0) = \delta(q_1, 0) = \delta(q_2, 0) = q_1$, $\delta(q_0, 1) = q_2$, $\delta(q_1, 1) = q_1$, $\delta(q_2, 1) = q_2$. Then $\mathcal{L}(D) = \{w \mid w \text{ contains at least one } 0\}$.
To see this: $\hat{\delta}(q_0, w) = q_0$ only when $w = \varepsilon$. Any word containing a $0$ transitions to $q_1$, and all transitions from $q_1$ return to $q_1$, so such words end in $q_1 \in F$. Words of the form $1^n$ end in $q_2 \notin F$.
[/example]
The right notion of "equivalence" between automata is a structure-preserving map.
[definition: Homomorphism of Automata]
Let $D = (\Sigma, Q, \delta, q_0, F)$ and $D' = (\Sigma, Q', \delta', q_0', F')$. A map $f : Q \to Q'$ is a **homomorphism** from $D$ to $D'$ if:
1. $\delta'(f(q), a) = f(\delta(q, a))$ for all $q \in Q$, $a \in \Sigma$ (transitions are preserved),
2. $f(q_0) = q_0'$ (start states correspond),
3. $q \in F \iff f(q) \in F'$ (accept states correspond).
A bijective homomorphism is called an **isomorphism**.
[/definition]
An induction shows that $\hat{\delta}'(f(q), w) = f(\hat{\delta}(q, w))$ for all words $w$, so homomorphisms preserve the entire computation.
[quotetheorem:1789]
[citeproof:1789]
The central result connecting automata to grammars is that the two models define exactly the same class of languages.
[quotetheorem:1790]
[citeproof:1790]
This direction — automaton to grammar — relies on a perfect dictionary between transitions and grammar rules. The hypothesis that $D$ is deterministic is used in the converse: because every state has exactly one outgoing edge per letter, the only derivations in $G$ are those that faithfully trace state sequences of $D$. A nondeterministic automaton would produce a grammar in which a variable $p$ has multiple rules of the form $p \to aq$ for different $q$, making the derivation non-unique — which is exactly what the next section handles. The theorem also shows that the class of automaton languages and the class of regular languages coincide *upward*: every automaton language is regular. The other inclusion (every regular language is accepted by some automaton) follows from the grammar-to-NFA construction and determinisation.
## Nondeterministic Automata
The preceding theorem gives one direction. To go the other way — from a regular grammar to an automaton — we face an obstacle: a grammar may have multiple rules of the form $A \to aB$ and $A \to aB'$ applicable at the same state, so a straightforward translation produces a machine with non-unique transitions. This motivates nondeterminism.
[definition: Nondeterministic Automaton]
A **nondeterministic automaton** is a tuple $N = (\Sigma, Q, \delta, q_0, F)$ where $Q$, $q_0$, $F$ are as before, but the transition function now maps $\delta : Q \times \Sigma \to \mathcal{P}(Q)$. We interpret $\delta(q, a)$ as the set of states the machine may transition into from $q$ upon reading $a$.
[/definition]
The extended transition function now maps to sets of states:
\begin{align*}
\hat{\delta}(q, \varepsilon) = \{q\}; \qquad \hat{\delta}(q, wa) = \bigcup_{p \in \hat{\delta}(q, w)} \delta(p, a).
\end{align*}
A word is accepted if the resulting set contains at least one accept state:
\begin{align*}
\mathcal{L}(N) = \{ w \mid \hat{\delta}(q_0, w) \cap F \neq \varnothing \}.
\end{align*}
[remark: Deterministic Automata as a Special Case]
A deterministic automaton is a nondeterministic automaton in which $|\delta(q, a)| = 1$ for all $q, a$, i.e., the transition function happens to be single-valued.
[/remark]
Despite the apparent extra power of nondeterminism, the two models are equivalent. The proof uses a **subset construction** that trades nondeterminism for a potentially exponential blowup in state count.
[quotetheorem:1791]
[citeproof:1791]
[remark: State Complexity of Subset Construction]
The construction transforms an $n$-state nondeterministic automaton into a deterministic automaton with up to $2^n$ states. In the worst case this exponential blowup cannot be avoided, but in practice many of the $2^n$ subsets are inaccessible from the start state $\{q_0\}$.
[/remark]
Finally, every regular grammar can be translated into a nondeterministic automaton, completing the circle.
[quotetheorem:1792]
[citeproof:1792]
Combining these three results — grammars produce nondeterministic automata, nondeterministic automata can be determinised, and deterministic automata produce regular grammars — we obtain the equivalence of all three models for the class of regular languages.
## The Pumping Lemma for Regular Languages
The pumping lemma provides the main technique for proving that a language is *not* regular. The idea is that any sufficiently long word accepted by a finite automaton must contain a "loop" in the state sequence, and this loop can be repeated any number of times.
[definition: Pumping Lemma Condition]
A language $L$ **satisfies the pumping lemma with pumping number $n$** if every word $w \in L$ with $|w| \geq n$ can be written as $w = xyz$ such that:
- $|y| > 0$ (the pump is nonempty),
- $|xy| \leq n$ (the pump lies near the start),
- $xy^k z \in L$ for all $k \in \mathbb{N}$ (the pump can be iterated any number of times, including zero).
We call $y$ a **pump** for the word $xyz$.
[/definition]
[quotetheorem:1793]
[citeproof:1793]
[remark: Pumping Implies Infinitely Many Words]
If any word in $L$ can be pumped (i.e., the pump can be repeated arbitrarily), then $L$ contains infinitely many words. In particular, any finite language satisfies the pumping lemma vacuously: choosing $n$ larger than the longest word in $L$ means no word has length $\geq n$, so the condition applies to no word at all.
[/remark]
The pumping lemma is most useful as a contrapositive: if a language fails the pumping lemma, it is not regular.
[example: The Language $\{0^k 1^k : k > 0\}$ Is Not Regular]
Let $L = \{0^k 1^k \mid k > 0\}$. Suppose for contradiction that $L$ is regular with pumping number $N$. The word $0^N 1^N \in L$ has length $2N \geq N$, so it can be pumped: write $0^N 1^N = xyz$ with $|y| > 0$ and $|xy| \leq N$. Since $|xy| \leq N$, both $x$ and $y$ consist entirely of zeros. So $y = 0^\ell$ for some $\ell \geq 1$. Pumping down ($k = 0$) gives $xz = 0^{N-\ell} 1^N$, which has strictly fewer zeros than ones and thus $xz \notin L$ — a contradiction.
This language is context-free (generated by $S \to 0S1 \mid 01$), so this demonstrates that the inclusion of regular languages in context-free languages is proper.
[/example]
[example: The Pumping Lemma Does Not Characterise Regularity]
The converse of the pumping lemma is false. Over $\Sigma = \{0, 1\}$, define the *tail* of a word containing at least one zero to be the number of ones following the last zero. For any set $X \subseteq \mathbb{N}$, let $L_X$ be the set of words over $\Sigma$ that either contain no zeros, or have a tail in $X$.
The map $X \mapsto L_X$ is injective ($X \neq Y \implies L_X \neq L_Y$), so there are uncountably many languages of this form. Since there are only countably many regular languages, some $L_X$ must be non-regular. However, all $L_X$ satisfy the pumping lemma with pumping number 2. Indeed, for $w \in L_X$ with $|w| \geq 2$: if $w = 0z$, take $x = \varepsilon$, $y = 0$; pumping does not change the tail, so all pumped words remain in $L_X$. If $w = 1z$, take $x = \varepsilon$, $y = 1$; pumping does not affect any zero, so again all pumped words remain in $L_X$.
This example shows that the pumping lemma is a necessary but not sufficient condition for regularity.
[/example]
[example: Lower Bound on Automaton Size]
Let $n > 0$ and $L = \{0^n w \mid w \in \mathbb{W}\}$, the set of words beginning with at least $n$ zeros. This language is regular: it is generated by the grammar with rules $S \to 0X_0$, $X_0 \to 0X_1$, $\ldots$, $X_{n-2} \to 0X_{n-1}$, $X_{n-1} \to 0 \mid 1 \mid 0X_{n-1} \mid 1X_{n-1}$. However, any deterministic automaton accepting $L$ must have more than $n$ states. If it had at most $n$ states, the pumping lemma would apply with pumping number $\leq n$, allowing the word $0^n$ to be pumped down, yielding a word $0^{n-\ell} \notin L$ — a contradiction.
[/example]
## Closure Properties of Regular Languages
Closure properties are not just an aesthetic feature of the theory — they are the engine of decision procedures. For instance, to test whether two regular languages $L$ and $L'$ are equivalent, one can ask whether $(L \setminus L') \cup (L' \setminus L) = \varnothing$. This reduces equivalence to emptiness, provided we know the class is closed under complement and intersection. Establishing these closures is therefore the prerequisite for the decidability results at the end of the chapter.
We have already established closure under concatenation and union (from Chapter 2 and the grammar constructions above). We now add complement, intersection, and set difference.
For complement, let $D = (\Sigma, Q, \delta, q_0, F)$ be a deterministic automaton. Define the **complement automaton** $D' = (\Sigma, Q, \delta, q_0, (Q \setminus F) \setminus \{q_0\})$. Since $D$ is deterministic, every word $w \in \mathbb{W}^+$ leads to a unique state $\hat{\delta}(q_0, w)$; this state is in $F$ if and only if it is not in $Q \setminus F$. Because the course convention excludes $q_0$ from accept states, we also remove $q_0$ from the complement accept set to preserve that convention. The result is $\mathcal{L}(D') = \mathbb{W}^+ \setminus \mathcal{L}(D)$ — the complement within the space of nonempty words.
For union and intersection simultaneously, the **product automaton construction** is elegant. Given $D = (\Sigma, Q, \delta, q_0, F)$ and $D' = (\Sigma, Q', \delta', q_0', F')$, form
\begin{align*}
D'' = (\Sigma, Q \times Q', \delta \times \delta', (q_0, q_0'), F'')
\end{align*}
where $(\delta \times \delta')((q, q'), a) = (\delta(q, a), \delta'(q', a))$. The automaton $D''$ tracks both computations in parallel. Choosing $F'' = \{(q, q') \mid q \in F \text{ and } q' \in F'\}$ gives $\mathcal{L}(D'') = \mathcal{L}(D) \cap \mathcal{L}(D')$; choosing $F'' = \{(q, q') \mid q \in F \text{ or } q' \in F'\}$ gives $\mathcal{L}(D'') = \mathcal{L}(D) \cup \mathcal{L}(D')$.
Together these constructions show that regular languages are closed under union, intersection, concatenation, and complement — and therefore also under set difference (since $A \setminus B = A \cap B^c$).
## The Emptiness Problem
One of the fundamental decision problems for a class of languages is the **emptiness problem**: given a grammar, is the language it generates empty? For regular languages this is solvable, thanks to the following lemma relating the pumping number to the minimum word length.
[quotetheorem:1794]
[citeproof:1794]
The theorem makes a sharp, quantitative claim: $n = |Q|$ is not just a bound on pumping lengths but also a bound on how large any minimal witness of non-emptiness can be. Notice what the theorem does *not* say: it does not say there is a word of length exactly $n - 1$, nor does it bound the length of any particular word in $L$. A language could have its shortest word at length 1, or at length $n - 1$, and both are consistent with the bound. The practical upshot is the decidability argument that follows: since short words are finitely many and the automaton can check each one, the search terminates.
[quotetheorem:1795]
[citeproof:1795]
## Regular Expressions
Grammars generate languages by rewriting, and automata recognise them by computation. Both perspectives are operational: to specify a language, one must describe a process. But is there a purely *compositional*, algebraic syntax for regular languages — one where complex languages are built from simple ones by explicit operations, with no machine or derivation in sight? Regular expressions answer this question affirmatively: they provide a closed-form description of exactly the regular languages, analogous to how arithmetic expressions describe integers without appealing to the Peano axioms.
[definition: Kleene Star and Plus]
The **Kleene star** of a language $L$ is
\begin{align*}
L^\star = \{ w \mid \exists \text{ a finite sequence } w_1, w_2, \ldots, w_k \in L \text{ with } w = w_1 w_2 \cdots w_k \}.
\end{align*}
In particular, $\varepsilon \in L^\star$ (the empty concatenation). The **Kleene plus** is $L^+ = L^\star \setminus \{\varepsilon\}$.
[/definition]
[definition: Regular Expression]
A **regular expression** over an alphabet $\Sigma$ is defined inductively:
1. $\varnothing$ is a regular expression;
2. $\varepsilon$ is a regular expression;
3. For each $a \in \Sigma$, $a$ is a regular expression;
4. If $R$, $S$ are regular expressions, so is $(R + S)$ (union);
5. If $R$, $S$ are regular expressions, so is $(RS)$ (concatenation);
6. If $R$ is a regular expression, so is $R^\star$ (Kleene star);
7. If $R$ is a regular expression, so is $R^+$ (Kleene plus).
Nothing else is a regular expression.
[/definition]
By recursion on the structure of a regular expression $E$, we assign a language $\mathcal{L}(E)$:
\begin{align*}
\mathcal{L}(\varnothing) &= \varnothing, & \mathcal{L}(\varepsilon) &= \{\varepsilon\}, & \mathcal{L}(a) &= \{a\}, \\
\mathcal{L}(R + S) &= \mathcal{L}(R) \cup \mathcal{L}(S), & \mathcal{L}(RS) &= \mathcal{L}(R)\mathcal{L}(S), \\
\mathcal{L}(R^\star) &= \mathcal{L}(R)^\star, & \mathcal{L}(R^+) &= \mathcal{L}(R)^+. &&
\end{align*}
Convention: concatenation $RS$ binds more tightly than union $R + S$, so $RS + T$ means $(RS) + T$.
A language $L$ is **essentially regular** if either $L$ is regular, or $L = L' \cup \{\varepsilon\}$ for some regular language $L'$. The subtle distinction arises because some regular grammars do not generate $\varepsilon$, so the Kleene star (which always includes $\varepsilon$) requires this slight relaxation.
[quotetheorem:1796]
[citeproof:1796]
[remark: Kleene's Algorithm]
The converse — that every regular language is the language of some regular expression — is also true. This is sometimes called Kleene's theorem or Kleene's algorithm. Its proof, which constructs a regular expression from an automaton by eliminating states one by one, is not required for this course.
[/remark]
## Minimisation of Deterministic Automata
Given a regular language $L$, there may be many deterministic automata that accept it. The minimisation theory shows there is a unique smallest one, up to isomorphism. Two concepts are central: accessibility (can we reach a state at all?) and distinguishability (can two states be told apart?).
[definition: Accessible and Inaccessible States]
A state $q \in Q$ is **accessible** if there exists a word $w \in \mathbb{W}$ such that $\hat{\delta}(q_0, w) = q$. States that are not accessible are called **inaccessible**.
[/definition]
[definition: Distinguishable and Indistinguishable States]
States $q$ and $q'$ are **distinguished by a word $w$** if $\hat{\delta}(q, w) \in F$ and $\hat{\delta}(q', w) \notin F$, or vice versa. States distinguished by some word are called **distinguishable**; states for which no distinguishing word exists are called **indistinguishable**, written $q \sim q'$.
[/definition]
The indistinguishability relation $\sim$ is an equivalence relation. An automaton is irreducible when it has neither redundant states (inaccessibility) nor redundant distinctions (all states are distinguishable from each other).
[definition: Irreducible Automaton]
A deterministic automaton $D$ is called **irreducible** if:
- all states are accessible, and
- every pair of distinct states is distinguishable.
[/definition]
[remark: Homomorphisms Between Irreducible Automata]
If $f$ is a homomorphism from $D$ to $D'$ and $D$ has all states distinguishable (so no two are identified by $f$), then $f$ is injective. If all states in $D'$ are accessible (all are reachable from $q_0'$), then $f$ is surjective. In particular, any homomorphism between irreducible automata is simultaneously injective and surjective, hence an isomorphism.
[/remark]
Every automaton can be reduced to an irreducible one by two steps: first discard inaccessible states, then collapse indistinguishable states.
The **quotient automaton** is built as follows. Since $\sim$ is an equivalence relation, write $[q]$ for the $\sim$-equivalence class of $q$. The transition $[\delta]([q], a) = [\delta(q, a)]$ is well-defined: if $q \sim q'$ then $\delta(q, a) \sim \delta(q', a)$, because if $\delta(q, a) \not\sim \delta(q', a)$ there would be a word $w$ distinguishing $\delta(q, a)$ from $\delta(q', a)$, and then $aw$ would distinguish $q$ from $q'$. Define
\begin{align*}
D / {\sim} = (\Sigma, Q/{\sim}, [\delta], [q_0], [F])
\end{align*}
where $[F] = \{[q] \mid q \in F\}$. The quotient map $q \mapsto [q]$ is a homomorphism from $D$ to $D/{\sim}$, so $\mathcal{L}(D) = \mathcal{L}(D/{\sim})$.
[quotetheorem:1797]
[citeproof:1797]
The construction proceeds in exactly two stages — pruning inaccessible states, then collapsing indistinguishable ones — and crucially the order matters: pruning first ensures that the indistinguishability quotient is taken only over reachable states, which is where the language-theoretic information lives. Doing the stages in the other order can collapse states that would have been pruned anyway, but does not produce a wrong result; however, the forward order is cleaner to reason about. The hypothesis that $D$ is deterministic is essential: for nondeterministic automata, the corresponding minimisation is more delicate (and the minimal automaton is not in general a quotient of the original). Finally, note that the theorem guarantees *existence* of an irreducible equivalent but says nothing about how to find it efficiently; the table-filling algorithm of the next section provides the computational procedure.
[remark: Size of the Irreducible Automaton]
The number of states in $I$ is at most the number of states in $D$, since we only remove or merge states in constructing $I$.
[/remark]
The payoff of irreducibility is uniqueness: there is essentially only one irreducible automaton for each regular language.
[quotetheorem:1798]
[citeproof:1798]
[remark: The Minimal Automaton]
The irreducible automaton is the unique (up to isomorphism) minimal automaton for a given regular language: it has strictly fewer states than any other deterministic automaton accepting the same language that is not itself irreducible.
[/remark]
## The Equivalence Problem
We have now established solvability of the word problem for noncontracting grammars and the emptiness problem for regular grammars. The final decision problem to address is the **equivalence problem**: given two regular grammars $G$ and $G'$, do they generate the same language?
The strategy is to reduce the equivalence problem to isomorphism checking of minimal automata.
[quotetheorem:1799]
[citeproof:1799]
[quotetheorem:1800]
[citeproof:1800]
[example: Table-Filling Algorithm in Action]
Consider the four-state automaton with $Q = \{q_0, q_1, q_2, q_3\}$, alphabet $\Sigma = \{a, b\}$, transitions
\begin{align*}
\delta(q_0, a) = q_1,\quad \delta(q_0, b) = q_3,\quad \delta(q_1, a) = q_1,\quad \delta(q_1, b) = q_2,\\
\delta(q_2, a) = q_2,\quad \delta(q_2, b) = q_2,\quad \delta(q_3, a) = q_3,\quad \delta(q_3, b) = q_3,
\end{align*}
and accept states $F = \{q_1, q_2\}$. Here $q_2$ and $q_3$ are sink states: once entered, all transitions loop back.
**Step 0 (initialisation).** Mark every pair $(q, q')$ with $q \in F$, $q' \notin F$ (or vice versa) with the distinguishing word $\varepsilon$. The accept/non-accept split is $\{q_1, q_2\}$ vs $\{q_0, q_3\}$, giving:
| | $q_0$ | $q_1$ | $q_2$ | $q_3$ |
|---|---|---|---|---|
| $q_0$ | — | $\varepsilon$ | $\varepsilon$ | |
| $q_1$ | | — | | $\varepsilon$ |
| $q_2$ | | | — | $\varepsilon$ |
| $q_3$ | | | | — |
Two entries remain unfilled: $(q_0, q_3)$ and $(q_1, q_2)$.
**Step 1 (first iteration).** For each unfilled pair and each letter, look up the entry for the letter-successors.
- *Pair $(q_0, q_3)$ via $a$:* $\delta(q_0, a) = q_1$ and $\delta(q_3, a) = q_3$. The entry $(q_1, q_3)$ is $\varepsilon$, so $a \cdot \varepsilon = a$ distinguishes $q_0$ from $q_3$. Mark $(q_0, q_3) \leftarrow a$.
- *Pair $(q_0, q_3)$ via $b$:* $\delta(q_0, b) = q_3$ and $\delta(q_3, b) = q_3$ — same state, no information.
- *Pair $(q_1, q_2)$ via $a$:* $\delta(q_1, a) = q_1$ and $\delta(q_2, a) = q_2$. Entry $(q_1, q_2)$ is still unfilled — no help yet.
- *Pair $(q_1, q_2)$ via $b$:* $\delta(q_1, b) = q_2$ and $\delta(q_2, b) = q_2$ — same state, no information.
After Step 1:
| | $q_0$ | $q_1$ | $q_2$ | $q_3$ |
|---|---|---|---|---|
| $q_0$ | — | $\varepsilon$ | $\varepsilon$ | $a$ |
| $q_1$ | | — | | $\varepsilon$ |
| $q_2$ | | | — | $\varepsilon$ |
| $q_3$ | | | | — |
**Step 2 (second iteration).** Only $(q_1, q_2)$ remains unfilled. Both letters send the pair to $(q_1, q_2)$ or $(q_2, q_2)$ — no previously filled entry is found. The algorithm terminates with $(q_1, q_2)$ unfilled, so $q_1 \sim q_2$.
**Conclusion.** The indistinguishable pairs are exactly $\{q_1, q_2\}$. The quotient automaton has three states: $[q_0]$, $[q_1] = \{q_1, q_2\}$ (the merged accept state), and $[q_3]$ (the dead state).
[/example]
[quotetheorem:1801]
[citeproof:1801]
This completes the decision-problem picture for regular languages: the word problem is solvable (the automaton reads the input), the emptiness problem is solvable (by the short-word lemma), and the equivalence problem is solvable (by minimal automaton isomorphism). The class of regular languages also enjoys a rich closure theory — union, intersection, complement, concatenation, difference — making it the most well-behaved family in the Chomsky hierarchy.
Context-free languages extend regular languages by permitting recursive rewriting rules that can express nested and balanced structures. The trade-off is direct: while context-free languages gain expressive power, they lose the decidability and closure properties that made regular languages so tractable.
# 4. Context-Free Languages
Context-free languages sit strictly between regular languages and the full class of type-0 languages: they can express balanced nesting — such as matching brackets or equal counts of symbols — that no finite automaton can track, yet they remain decidable and admit efficient parsing algorithms. The central challenge in this chapter is to understand how context-free derivations are organised spatially rather than linearly. Where a regular derivation progresses as a sequence $wA$ with one active variable, a context-free derivation fans out into a tree. This chapter develops the theory of parse trees, establishes a canonical normal form for context-free grammars, proves the pumping lemma that characterises what context-free languages cannot express, and examines closure and decidability properties of the class.
## Parse Trees and the Structure of Derivations
Every regular derivation can be read off as a linear sequence of rewrites. Context-free grammars, by contrast, allow any non-terminal to be expanded regardless of its position in the sentential form, so the same word may have many derivation sequences corresponding to different expansion orders. The right object to study is not a derivation sequence but the underlying tree that records which rule was applied where.
To make this precise, the course first sets up abstract trees as combinatorial objects before attaching grammar labels to them.
[definition: Finitely-Branching Tree]
A set $T \subset \mathbb{N}^\star$ is a **finitely-branching tree** if it is closed under initial segments and, for every $t \in T$, there is a branching number $n \in \mathbb{N}$ such that $tk \in T$ if and only if $k < n$. A node $t \in T$ with no successors is called a **leaf**; the empty sequence $\varepsilon$ is the **root**. The **level** of $t$ is $|t|$, and the **height** of a finite tree is the maximum level. For $t \in T$, the sequence $\varepsilon, t|_1, t|_2, \ldots, t = t|_{|t|}$ is the **branch** leading to $t$.
[/definition]
[example: A Concrete Finitely-Branching Tree]
The set whose nodes at levels 0, 1, 2, 3 are (reading left to right): $\varepsilon$; $0, 1, 2$; $00, 01, 10, 20, 21, 22$; $000, 001, 100, 101, 210, 220, 221$; and then $0010$ at level 4, is a finitely-branching tree. Node $0$ has branching number 2 (successors $00, 01$), node $1$ has branching number 1 (successor $10$), and node $2$ has branching number 3 (successors $20, 21, 22$). This example shows that different nodes can have different branching numbers, and that a tree need not be balanced. The leaves are $001, 101, 210, 221, 0010$ (among others) — they are the nodes with branching number 0.
[/example]
[definition: Subtree]
Let $T$ be a tree and $t \in T$. The **subtree rooted at $t$** is $T_t = \{s \mid ts \in T\}$.
[/definition]
[definition: Left-to-Right Order]
The **left-to-right order** on $T$ is defined by $t < s$ if $t \neq s$ and, letting $k_0$ be the smallest index where $t(k_0) \neq s(k_0)$, we have $t(k_0) < s(k_0)$. This makes the leaves of any finite tree totally ordered, even though the relation is only a partial order on $T$ overall (it does not compare nodes on the same branch, such as $0$ and $00$).
[/definition]
With tree structure in hand, parse trees attach grammar labels to nodes so that the tree records a derivation.
[definition: Parse Tree]
Let $G = (\Sigma, V, P, S)$ be a context-free grammar, and let $\Omega = \Sigma \cup V$. A pair $\mathbb{T} = (T, \ell)$ is a **$G$-parse tree** if $T$ is a finitely-branching tree and $\ell : T \to \Omega$ is a labelling function such that:
1. $\ell(\varepsilon) \in V$ (the root carries a variable);
2. if $\ell(t) \in \Sigma$, then $t$ is a leaf (terminals are leaves);
3. if $t$ has $n + 1$ successors and $\ell(t) = A \in V$, then $A \to \ell(t0)\ell(t1)\cdots\ell(tn) \in P$.
If $t_0, \ldots, t_m$ are the leaves of $\mathbb{T}$ in left-to-right order, the **string parsed by $\mathbb{T}$** is $\sigma_\mathbb{T} = \ell(t_0)\cdots\ell(t_m)$.
[/definition]
[remark: Subtree Decomposition]
If $t \in T$, then writing $\mathbb{T}_t = (T_t, \ell_t)$ for the subtree at $t$ (with $\ell_t(s) = \ell(ts)$), we have the factorisation $\sigma_\mathbb{T} = \alpha\,\sigma_{\mathbb{T}_t}\,\beta$ for some strings $\alpha, \beta$ that come from the rest of the tree. Every subword of $\sigma_\mathbb{T}$ derivable from some variable $A$ corresponds to the subtree rooted at the unique node labelled $A$ that generates it.
[/remark]
The key proposition confirms that parse trees and derivation sequences are two ways of describing the same information.
[quotetheorem:1802]
[citeproof:1802]
The parse tree representation makes it possible to manipulate derivations by surgery on trees. The central operation is grafting.
[definition: Grafting]
Let $\mathbb{T} = (T, \ell)$ be a $G$-parse tree, $t \in T$ a node with $\ell(t) = A$, and $\mathbb{T}' = (T', \ell')$ another $G$-parse tree starting from $A$. Define
\begin{align*}
\operatorname{graft}(\mathbb{T}, t, \mathbb{T}') &= (S, \ell^\star)
\end{align*}
where $S = \{s \in T \mid t \not\subseteq s\} \cup \{tu \mid u \in T'\}$ and $\ell^\star(s) = \ell(s)$ when $t \not\subseteq s$, and $\ell^\star(tu) = \ell'(u)$ for $u \in T'$. The subtree rooted at $t$ is replaced wholesale by $\mathbb{T}'$.
[/definition]
[remark: Grafting Formula]
If $\sigma_\mathbb{T} = \alpha\,\sigma_{\mathbb{T}_t}\,\beta$, then $\sigma_{\operatorname{graft}(\mathbb{T},t,\mathbb{T}')} = \alpha\,\sigma_{\mathbb{T}'}\,\beta$. Grafting thus replaces the substring derived from $t$ by the string derived by $\mathbb{T}'$ — without touching any other part of the derivation. This locality is what makes the pumping lemma work: we can swap out the piece of a long derivation that repeats a variable.
[/remark]
## Chomsky Normal Form
Every context-free grammar can be converted to a particularly clean shape — Chomsky normal form — in which all rules either produce exactly two variables or a single terminal. Having a canonical form is not merely aesthetic: the normal form gives precise structural control over parse trees (their heights and sizes), which underpins both the pumping lemma proof and efficient parsing algorithms.
[definition: Chomsky Normal Form]
A context-free grammar $G$ is in **Chomsky normal form (CNF)** if every production rule has one of two shapes:
- **binary**: $A \to BC$ where $B, C \in V$;
- **unary**: $A \to a$ where $a \in \Sigma$.
[/definition]
A grammar in CNF has especially regular parse trees: every internal node has exactly two children (binary rule) or one child that is a leaf (unary rule). This makes the tree a perfect vehicle for counting arguments.
[quotetheorem:1803]
[citeproof:1803]
The formula $2n - 1$ is tighter than it may first appear: it depends crucially on CNF discipline. Without CNF, rules of arbitrary length can produce many terminals in a single step, and unary rules $A \to B$ can chain indefinitely — both of which break the direct correspondence between tree shape and derivation length. The CNF constraint buys us a simple, exact accounting of how much work each derivation step contributes, which is what makes the downstream analysis (parsing algorithms, the pumping lemma) possible.
[explanation: What CNF Buys]
The length formula $2n - 1$ means that every CNF parse tree of a word of length $n$ has exactly $n$ leaves (each produced by a unary rule) and $n - 1$ internal binary nodes. The tree has at most $2n - 1$ nodes total, and its height satisfies $h \geq \lceil \log_2 n \rceil$ (a binary tree with $n$ leaves has height at least $\log_2 n$). In the other direction, a tree of height $h$ can have at most $2^h$ leaves, giving $|w| \leq 2^h$. This inequality is the backbone of the pumping lemma: a long word forces a tall tree, and a tall tree must repeat a variable along some branch.
[/explanation]
The theorem that every context-free grammar can be converted to CNF is constructive. There are exactly three kinds of obstruction, each removable by a different local transformation.
**Removing letters from long rules.** If $A \to \alpha$ has $|\alpha| \geq 2$ and $\alpha$ contains a terminal $a$, introduce a fresh variable $X_a$ with rule $X_a \to a$, and replace every occurrence of $a$ in long rules with $X_a$. All such rules now contain only variables.
**Handling unit rules.** A unit rule is a production $A \to B$ with $B \in V$. First, close the grammar under unit rules: for every $A \to B \in P$ and $B \to \alpha \in P$, add $A \to \alpha$ to $P$. The resulting **unit-closed** grammar is equivalent and has at most $|V| \cdot |P|$ additional rules. Now remove all unit rules from the unit-closed grammar.
[quotetheorem:1804]
[citeproof:1804]
**Binarising long rules.** If $A \to A_1 A_2 \cdots A_n$ with $n > 2$, introduce fresh variables $X_1, \ldots, X_{n-2}$ and replace the rule by the chain:
\begin{align*}
A \to A_1 X_1,\quad X_1 \to A_2 X_2,\quad \ldots,\quad X_{n-2} \to A_{n-1} A_n.
\end{align*}
[quotetheorem:1805]
[citeproof:1805]
The practical payoff of this theorem is that any result proved about CNF grammars — for instance, the derivation-length formula and the pumping lemma below — can be lifted to results about context-free languages as a whole, by first converting to CNF. The cost is a modest blow-up in grammar size, but the structural simplification is usually worth it. The next explanation block highlights one of the most important applications: practical parsing.
[explanation: CNF and Parsing Algorithms]
The CYK (Cocke–Younger–Kasami) parsing algorithm — which decides in $O(n^3)$ time whether $w \in \mathcal{L}(G)$ for a grammar in CNF — relies critically on the two-child structure of CNF parse trees. Each cell in the CYK table asks: can the substring $w[i..j]$ be derived from variable $A$? The binary structure means this reduces to checking all splits $w[i..k]$ and $w[k+1..j]$. Without CNF, the split structure would be irregular and the dynamic programming would not apply.
[/explanation]
## The Pumping Lemma for Context-Free Languages
The question this section answers is: which languages are provably not context-free? For regular languages the tool was the regular pumping lemma, which exploited path-repetition in DFA runs. For context-free languages, the analogous tool uses repeated variables in long parse trees — and the grafting operation to manipulate them.
[definition: Context-Free Pumping Lemma Condition]
A language $L$ satisfies the **context-free pumping lemma** with pumping number $n \in \mathbb{N}$ if: for every $w \in L$ with $|w| \geq n$, there exist strings $x, u, y, v, z$ with $w = xuyvz$ such that:
1. $|uyv| \leq n$;
2. $|uv| > 0$ (the pump is non-empty);
3. $xu^k yv^k z \in L$ for all $k \geq 0$.
The strings $u, v$ are called the **pump**.
[/definition]
[remark: Comparison with the Regular Pumping Lemma]
In the regular pumping lemma the pump is a single block $u$ with a constraint on its position ($|xy| \leq n$). In the context-free version the pump splits into two separate pieces $u$ and $v$ that must be inflated simultaneously. There is no position constraint on $x$ or $z$, only a constraint $|uyv| \leq n$ on the combined middle segment. This reflects the two-sided nature of context-free derivations: a variable $A$ appearing twice along a branch generates an outer ring $u(\cdot)v$ of context, and both sides must be pumped together to stay in the language. The context-free pumping lemma is strictly weaker than the regular pumping lemma in the sense that the regular one implies the context-free one (any regular language with pumping number $n$ satisfies the CFL version with the same $n$, by taking $y = $ the pumped block and $u = v = \varepsilon$). Because there are uncountably many languages satisfying the context-free pumping lemma, the lemma cannot characterise the (countable) class of context-free languages.
[/remark]
[quotetheorem:1806]
[citeproof:1806]
The proof is not just an existence argument: it produces the decomposition constructively from the parse tree, by choosing the deepest variable-repetition along the longest branch. This constructive character is what makes the pumping number $n = 2^{|V|}$ explicit rather than an unspecified bound. Before applying the lemma, however, it is worth being careful about what it does and does not claim — the converse, in particular, is false, as the next explanation clarifies.
[explanation: What the Pumping Lemma Says and Does Not Say]
The pumping lemma gives a necessary condition for a language to be context-free: if $L$ is context-free then there exists a pumping number $n$ such that every long word can be pumped. Its contrapositive is the useful direction — to show $L$ is not context-free, take an arbitrary $n$, exhibit a word $w \in L$ with $|w| \geq n$, and show that no split $w = xuyvz$ satisfying $|uyv| \leq n$ and $|uv| > 0$ permits all pumped strings $xu^kyv^kz$ to remain in $L$. The lemma does not provide a sufficient condition: there are non-context-free languages that satisfy the pumping lemma for every $n$. More powerfully, the pumping number $n = 2^{|V|}$ is explicit — it is not some abstract existence — and the proof shows exactly which structural feature of the grammar forces pumpability: the finiteness of the variable set and the pigeonhole principle on long branches.
[/explanation]
[example: The Language $\{a^n b^n c^n\}$ Is Not Context-Free]
Consider $L = \{a^n b^n c^n \mid n > 0\}$. Suppose for contradiction that $L$ is context-free with pumping number $N$. Take $w = a^N b^N c^N \in L$, so $|w| = 3N \geq N$. Write $w = xuyvz$ with $|uyv| \leq N$ and $|uv| > 0$.
Since $|uyv| \leq N$, the segment $uyv$ spans at most $N$ consecutive characters of $w = a^N b^N c^N$. The word $w$ has three blocks of length $N$ each, and any segment of length $\leq N$ can intersect at most two of these blocks (it cannot straddle all three since it would need length at least $N + 1$ just to touch the boundary between $a$'s and $c$'s). Therefore $u$ and $v$ together contain characters from at most two of the three letter types.
Pumping up to $k = 2$: the total count of each letter in $xu^2yv^2z$ increases by $|u|$ and $|v|$ respectively for whichever letters appear in $u$ and $v$. Since at least one letter type (say $c$) is absent from both $u$ and $v$, the count of $c$'s stays at $N$, while the counts of $a$'s and/or $b$'s increase. The pumped word therefore has unequal counts and does not lie in $L$. This contradiction shows $L$ is not context-free.
This example demonstrates a structural boundary: the language $\{a^n b^n\}$ is context-free (generated by $S \to aSb \mid ab$), but the three-letter analogue is not. The extra constraint that all three counts agree simultaneously cannot be maintained under two-piece pumping.
[/example]
## Closure Properties and the Pushdown Automaton Connection
The pumping lemma reveals what context-free languages cannot express, but closure properties reveal how context-free languages interact with the Boolean operations that regular languages handle gracefully.
The closure failures come from a single example. The languages $L_0 = \{a^n b^n c^k \mid n, k > 0\}$ and $L_1 = \{a^k b^n c^n \mid n, k > 0\}$ are both context-free: $L_0$ is the concatenation of $\{a^n b^n \mid n > 0\}$ (context-free) with $\{c^k \mid k > 0\}$ (regular), and $L_1$ is symmetric. But their intersection is exactly $L = \{a^n b^n c^n \mid n > 0\}$, which the previous example showed is not context-free.
[quotetheorem:1807]
[citeproof:1807]
The asymmetry between positive closures (union, concatenation, Kleene star — all provable by direct grammar constructions) and the negative closures (intersection, complement — provably impossible) has a deep model-theoretic explanation. For regular languages, the product automaton gives intersection for free; the question is why no analogous construction works for context-free grammars.
[explanation: Why No Product Construction Exists]
For DFAs, the product construction $M_1 \times M_2$ computes $\mathcal{L}(M_1) \cap \mathcal{L}(M_2)$ by running both automata in parallel. The closure failure for context-free languages means no analogous product construction can exist for whatever automaton model corresponds to CFLs. Running two such automata in parallel would yield a machine for the intersection, but intersections of CFLs need not be context-free — the product would violate our result. This provides a model-theoretic constraint: any automaton model for context-free languages cannot support a product.
The appropriate machine model is the **pushdown automaton (PDA)**: a nondeterministic finite automaton augmented with a stack. The stack allows unbounded memory, but only in a last-in first-out discipline — unlike the random-access memory of a Turing machine. The transition function reads the current state, the current input symbol (or $\varepsilon$), and the top of the stack, then moves to a new state and optionally pushes or pops a symbol. The nondeterminism is essential: deterministic PDAs recognise a strictly smaller class than nondeterministic PDAs.
[/explanation]
[quotetheorem:1808]
The proof of this equivalence, while not covered in full detail in the course, follows a standard pattern: given a grammar, simulate leftmost derivations using the stack; conversely, given a PDA, construct a grammar whose variables encode pairs of states. The equivalence confirms that the stack, with its restricted memory access, is exactly the right amount of computational power to handle the balanced-nesting structure characteristic of context-free languages.
## Decision Problems for Context-Free Languages
Having established the structure of the class, we turn to what can be algorithmically determined about a given context-free language.
The **word problem** — given a grammar $G$ and a word $w$, decide whether $w \in \mathcal{L}(G)$ — is solvable for context-free languages. Converting $G$ to Chomsky normal form, every derivation of $w$ has length exactly $2|w| - 1$, so there are finitely many derivations to check. More efficiently, the CYK algorithm solves the word problem in $O(|G| \cdot |w|^3)$ time by dynamic programming on substrings.
The **emptiness problem** — given $G$, decide whether $\mathcal{L}(G) = \varnothing$ — is also solvable, and the pumping lemma provides the key insight.
[explanation: Emptiness via the Pumping Lemma]
Let $n$ be the pumping number of $G$ (which can be computed from the CNF of $G$). If $\mathcal{L}(G)$ is non-empty, it must contain a word of length less than $n$. The reason: any word $w \in \mathcal{L}(G)$ with $|w| \geq n$ can be pumped down ($k = 0$) to a shorter word $xyz$ still in $\mathcal{L}(G)$. Repeating, we obtain a word of length $< n$ in $\mathcal{L}(G)$. Therefore, $\mathcal{L}(G) \neq \varnothing$ if and only if some word of length less than $n$ lies in $\mathcal{L}(G)$. There are finitely many such words, and each can be checked with the word problem algorithm. This argument is identical in structure to the emptiness test for regular languages via the regular pumping lemma, and the same observation applies: the choice of pumping lemma (regular vs context-free) does not matter for the argument, only the existence of a bound below which short witnesses must exist.
The argument also demonstrates that the pumping lemma is not only a tool for non-membership proofs: here it provides an upper bound on the length of shortest words, turning an infinite search into a finite one.
[/explanation]
The **equivalence problem** — given two context-free grammars $G_1$ and $G_2$, decide whether $\mathcal{L}(G_1) = \mathcal{L}(G_2)$ — is undecidable for context-free languages, in contrast to the decidable equivalence problem for regular languages. This reflects the fact that context-free languages, unlike regular languages, do not have unique minimal representations, and the class is not closed under the Boolean operations needed to reduce equivalence to emptiness.
Register machines abandon the constraint of processing input in a single pass and instead allow arbitrary access to finitely many memory cells, moving closer to the practical computers we use. This shift from finite or stack-bounded memory to general-purpose mutable storage marks the crossing from language classes to full computation.
# 5. Register Machines
Register machines are the course's first concrete model of general computation. Where finite automata and pushdown automata process input in a single pass with bounded or stack-structured memory, a register machine stores words in finitely many last-in first-out registers and can push, inspect, or pop letters in any order — making it expressive enough to capture the informal notion of an algorithm. This chapter defines register machines precisely, establishes a countability result for them, and then systematically builds up a library of computable operations and answerable questions. The central observation is that by combining simple building blocks through the concatenation and case-distinction lemmas, a rich class of register-manipulating operations becomes reachable.
## The Architecture of Register Machines
How should we model a computation that can store and retrieve data in an unbounded but structured way? Finite automata fail because their state space is fixed and finite; a pushdown automaton adds a single stack but still exposes only the top element for inspection. A register machine generalises this by giving the machine access to finitely many registers, each holding a word $w \in \mathbb{W}$ (the set of all finite words over the alphabet $\Sigma$), manipulated via last-in first-out operations. The key design question is which operations to allow: too many operations would make the model unphysical, too few would be underpowered.
We denote the set of all finite words over $\Sigma$ by $\mathbb{W} = \Sigma^\star$. Throughout this chapter, $Q$ is a finite set of states with two distinguished elements: the start state $q_S$ and the halt state $q_H$, with $q_S \neq q_H$.
[definition: Register Machine Instruction]
Let $\Sigma$ be an alphabet and $Q$ a nonempty finite set of states. A **$(\Sigma, Q)$-instruction** is a tuple of one of the following four forms:
\begin{align*}
+(k, a, q) &= (0, k, a, q) \in \mathbb{N} \times \mathbb{N} \times \Sigma \times Q \\
?(k, a, q, q') &= (1, k, a, q, q') \in \mathbb{N} \times \mathbb{N} \times \Sigma \times Q \times Q \\
?(k, \varepsilon, q, q') &= (2, k, q, q') \in \mathbb{N} \times \mathbb{N} \times Q \times Q \\
-(k, q, q') &= (3, k, q, q') \in \mathbb{N} \times \mathbb{N} \times Q \times Q
\end{align*}
The intuitive meanings are:
- $+(k, a, q)$: push the letter $a$ onto register $k$, then advance to state $q$;
- $?(k, a, q, q')$: check if the final letter of register $k$ is $a$; if so, advance to state $q$, otherwise advance to state $q'$;
- $?(k, \varepsilon, q, q')$: check if register $k$ is empty; if so, advance to state $q$, otherwise advance to state $q'$;
- $-(k, q, q')$: pop the topmost letter from register $k$ if it is nonempty and advance to $q'$; if register $k$ is already empty, advance to $q$.
We write $\operatorname{Instr}(\Sigma, Q)$ for the set of all $(\Sigma, Q)$-instructions. With $k$ bounded, this set is finite.
[/definition]
The four instruction types correspond to the only operations one naturally wants on a stack: push (write), peek (branch on top), check emptiness (branch on empty), and pop (read and remove). Any richer set of operations can be built from these by composition, as the API section of this chapter demonstrates.
[definition: Register Machine]
A **$\Sigma$-register machine** is a triple $M = (\Sigma, Q, P)$ where $\Sigma$ is an alphabet, $Q$ is a finite set of states containing distinguished start state $q_S$ and halt state $q_H$ with $q_S \neq q_H$, and $P : Q \to \operatorname{Instr}(\Sigma, Q)$ is the **program**. If $Q = \{q_0, \ldots, q_n\}$, the program is described by finitely many **program lines** $q_i \mapsto P(q_i)$. The **upper register index** of $M$ is the largest $k$ appearing in any instruction in the image of $P$.
[/definition]
[definition: Configuration and Computation Sequence]
Let $M$ be a register machine with upper register index $n$. A **configuration** (or **snapshot**) of $M$ is a tuple $(q, w_0, \ldots, w_n) \in Q \times \mathbb{W}^{n+1}$. For configurations $C, C'$, we say **$M$ transforms $C$ into $C'$** if, writing $C = (q, w_0, \ldots, w_n)$, one of the following holds:
- $P(q) = +(k, a, q')$ and $C' = (q', w_0, \ldots, w_{k-1}, w_k a, w_{k+1}, \ldots, w_n)$;
- $P(q) = ?(k, a, q', q'')$ and either $w_k = wa$ for some $w$ (the top letter is $a$) and $C' = (q', w_0, \ldots, w_n)$, or $w_k \neq wa$ for all $w$ and $C' = (q'', w_0, \ldots, w_n)$;
- $P(q) = ?(k, \varepsilon, q', q'')$ and either $w_k = \varepsilon$ and $C' = (q', w_0, \ldots, w_n)$, or $w_k \neq \varepsilon$ and $C' = (q'', w_0, \ldots, w_n)$;
- $P(q) = -(k, q', q'')$ and either $w_k = \varepsilon$ and $C' = (q', w_0, \ldots, w_n)$, or $w_k = wa$ (with top letter $a$) and $C' = (q'', w_0, \ldots, w_{k-1}, w, w_{k+1}, \ldots, w_n)$.
Given an initial register content $w = (w_0, \ldots, w_n) \in \mathbb{W}^{n+1}$, the **computation sequence** of $M$ with input $w$ is defined by $C(0, M, w) = (q_S, w)$ and $C(k+1, M, w) = C'$ where $M$ transforms $C(k, M, w)$ into $C'$.
If $w$ has fewer than $n+1$ components, pad with copies of $\varepsilon$.
[/definition]
[remark: Computations Are Always Infinite Sequences]
As defined, every configuration has a unique successor, so computation sequences are always infinite. The machine does not intrinsically stop. Halting must be detected externally: we declare the computation to have halted when it first reaches the halt state $q_H$, and the content of the registers at that moment is the output.
[/remark]
[definition: Halting and Output]
We say that **the computation of $M$ with input $w$ halts at time $k$** (or **in $k$ steps**) if $k$ is the least natural number such that $C(k, M, w) = (q_H, v)$ for some $v \in \mathbb{W}^{n+1}$. In this case, $v$ is the **output** (or **register content at time of halting**). If no such $k$ exists, the computation **does not halt**.
[/definition]
[example: Halting vs Non-Halting Programs]
Consider the program over any alphabet $\Sigma$ with $a \in \Sigma$: $q_S \mapsto +(0, a, q_S)$ and $q_H \mapsto +(0, a, q_S)$. This machine forever pushes $a$ onto register $0$, cycling between $q_S$ and back, never reaching $q_H$ in the sense required. The computation sequence visits only states in $\{q_S\}$ from the program's perspective (the halt state rule is never the exit condition of the machine), so this machine never halts on any input. In contrast, the single-line program $q_S \mapsto ?(0, a, q_H, q_H)$ halts after exactly one step on every input, regardless of the content of register $0$, because both branches of the test lead to $q_H$.
[/example]
## Counting Register Machines
Having defined register machines, a natural question is: how many are there? The answer is that the collection is countable, once we identify machines that differ only in the labelling of their states.
[definition: Strong Equivalence]
Register machines $M$ and $M'$ are **strongly equivalent** if for all $k$ and all $w$, the configurations $C(k, M, w)$ and $C(k, M', w)$ have the same register content, and $M$ halts after $k$ steps with input $w$ if and only if $M'$ halts after $k$ steps with input $w$.
[/definition]
Strong equivalence is more demanding than operational equivalence: it requires that both machines perform the same register transformations at every step, not merely that they agree on what the output is when they halt. In particular, if $|Q| = |Q'|$, any $(\Sigma, Q, P)$ machine has a strongly equivalent $(\Sigma, Q', P')$ machine obtained by relabelling states in $P$. This relabelling freedom means that only the cardinality of the state set — not the actual names of states — matters up to strong equivalence.
[quotetheorem:1809]
[citeproof:1809]
The padding lemma tells us that within any strong-equivalence class there are machines of arbitrarily many states. Despite this abundance, there are only countably many equivalence classes.
[quotetheorem:1810]
[citeproof:1810]
This countability result has an important consequence: because there are only countably many register machines, there are only countably many computable partial functions. Since the set of all partial functions $\mathbb{W} \rightharpoonup \mathbb{W}$ is uncountable, most functions are not computable — a theme the course will develop in subsequent chapters.
## Operations and Questions
How should we describe what a register machine computes? The natural notion is that of an **operation**: a partial function on register contents that the machine may or may not succeed in producing.
[definition: Operation and Performance]
An **operation** is a partial function $f : \mathbb{W}^{n+1} \rightharpoonup \mathbb{W}^{n+1}$. We write $f(w) \downarrow$ if $w$ lies in the domain of $f$ (the operation **converges** or is **defined**) and $f(w) \uparrow$ otherwise (the operation **diverges** or is **undefined**). A register machine $M$ **performs** $f$ if for all $w$: $f(w) \downarrow$ if and only if $M$ halts on input $w$, and in that case the register content at time of halting equals $f(w)$.
[/definition]
The two simplest operations are the extremes of behaviour.
[example: The Never-Halt Operation]
The operation that never converges is the empty function $f$ with $\operatorname{dom} f = \varnothing$. Any machine that never references $q_H$ in the right-hand side of a program line performs this operation. For instance, the program $q_S \mapsto +(0, a, q_S)$ pushes the letter $a$ indefinitely and never halts.
[/example]
[example: The Identity Operation]
The operation that halts immediately without modifying any register is the identity function $f(w) = w$ with $\operatorname{dom} f = \mathbb{W}^{n+1}$. The single-line program $q_S \mapsto ?(0, a, q_H, q_H)$ performs this: for any register content, it checks whether register $0$ ends in $a$ (both outcomes lead to $q_H$), halts after one step, and leaves the registers unchanged. Many other programs also perform the identity operation, demonstrating that the relationship between machines and operations is many-to-one.
[/example]
Alongside operations, machines can also **answer questions**: decide which part of a partition the register content belongs to.
[definition: Question and Answer]
A **question with $k+1$ answers** is a partition of $\mathbb{W}^{n+1}$ into $k+1$ sets $A_0, \ldots, A_k$. A register machine **answers** the question if it has $k+1$ designated **answer states** $q_0, \ldots, q_k$ such that, upon input $w \in A_i$, after finitely many steps the machine's configuration is $(q_i, w)$.
[/definition]
[example: Elementary Questions]
The question "is register $i$ empty?" has two answers, yes and no, and is answered by the single-line program $q_S \mapsto ?(i, \varepsilon, q_Y, q_N)$: the machine transitions to $q_Y$ if register $i$ is empty and to $q_N$ otherwise, without touching the register contents. Similarly, the question "does register $i$ end in letter $a$?" is answered by $q_S \mapsto ?(i, a, q_Y, q_N)$.
[/example]
The reason the distinction between operations and questions is useful is that they can be assembled together. The next two lemmas are the main composition tools.
## Composing Register Machines
Individual register machine programs are simple. Real computations arise by chaining machines together: running one operation then another, or dispatching to different machines depending on the answer to a question.
[quotetheorem:1811]
The composition $F' \circ F$ diverges whenever $F$ diverges, and also whenever $F$ converges but $F'$ diverges on the output. Thus $\hat{M}$ halts precisely when both machines would have halted in sequence.
[proof]
Assume without loss of generality that the state sets $Q$ and $Q'$ are disjoint (they can be relabelled if not, using strong equivalence). Define the combined state set $\hat{Q} = Q \cup Q' \setminus \{q_H\}$, removing the halt state of $M$. Let $P^\star$ be the program $P$ with the rule $q_H \mapsto P(q_H)$ removed and every occurrence of $q_H$ in the remaining rules replaced by $q_S'$ (the start state of $M'$). Set $\hat{P} = P^\star \cup P'$. The machine $\hat{M} = (\Sigma, \hat{Q}, \hat{P})$ first simulates $M$ until what would have been a halt, then seamlessly continues as $M'$ from that register content, halting only when $M'$ halts. This performs $F' \circ F$.
[/proof]
[explanation: The Role of the Concatenation Lemma]
The concatenation lemma is the core modularity result of register machine theory. It allows us to treat any register machine as a subroutine: design a machine for a simple operation, verify it is correct, and then invoke it as a building block within larger computations. Each entry in the API catalogue in the next section is an application of this principle: complex operations like "copy the content of register $i$ into register $j$" are built by chaining elementary push, pop, and test machines together. The lemma is constructive — it gives an explicit program for $\hat{M}$ — and the construction scales without additional cost, since the state space of $\hat{M}$ is just $|Q| + |Q'| - 1$.
[/explanation]
[quotetheorem:1812]
[citeproof:1812]
The two lemmas together give register machines the power of structured programs: sequential composition (concatenation) and conditional branching (case distinction). These are exactly the two control flow primitives that suffice for describing any algorithm.
## The Register Machine API
Having established the composition lemmas, we can catalogue the operations and questions that register machines can perform. A register is **unused** if no program line references it; it is **empty** if its content is $\varepsilon$. Registers used only for intermediate computation (not contributing to output) are called **scratch registers**.
The following list builds progressively. Each entry is either a single program line or an explicit combination of earlier entries via the concatenation or case distinction lemmas.
**Deletion of a single letter.** The operation "delete the final letter of register $i$ if it is nonempty" is performed by $q_S \mapsto -(i, q_H, q_H)$: both branches of the pop instruction lead to the halt state, so the machine halts after one step regardless, having popped a letter if one existed.
**Pushing a fixed letter.** The operation "push letter $a$ onto register $i$" is performed by $q_S \mapsto +(i, a, q_H)$. As a side effect, this guarantees register $i$ is nonempty after the step.
**Clearing a register.** The operation "delete the entire content of register $i$" is performed by the loop $q_S \mapsto -(i, q_H, q_S)$: if register $i$ is empty, halt; otherwise pop and return to $q_S$ to repeat. This is a loop of at most $|w_i|$ steps, where $|w_i|$ is the current length of register $i$.
**Writing a fixed word.** The operation "push the fixed word $w = a_0 a_1 \cdots a_\ell$ onto register $i$" is performed by concatenating $\ell + 1$ push operations, one per letter $a_j$, applied in order. By the concatenation lemma, this is a valid single machine.
**Setting a register to a fixed word.** The operation "set register $i$ to the fixed word $w$" combines clearing register $i$ and then writing $w$: two applications of the concatenation lemma.
**Reading the final letter.** The question "what is the final letter of register $i$?" has $|\Sigma| + 1$ answers: one for each letter $a_j \in \Sigma$ and one for the empty case. For each $a_j$, ask the question "does register $i$ end in $a_j$?"; if yes, move to answer state $q_j$; if no, proceed to ask about $a_{j+1}$. If all letter queries fail, register $i$ is empty, so move to answer state $q_\varepsilon$. This uses $|\Sigma|$ sequential case distinctions.
**Copying the final letter.** The operation "copy the final letter of register $i$ into register $j$ (if it exists)" is performed by first reading the final letter (the question above) and then, in each answer branch, pushing the corresponding letter onto register $j$ (or doing nothing if register $i$ was empty). This is an application of the case distinction lemma with $|\Sigma|$ operation branches.
**Moving the final letter.** The operation "move the final letter of register $i$ into register $j$ (if it exists)" extends the copy operation: after copying, apply the deletion operation to pop the letter from register $i$.
**Moving the content in reverse order.** The operation "move the content of register $i$ into register $j$ in reverse order" repeatedly applies "move the final letter of $i$ into $j$" until register $i$ is empty. Each step moves one letter; the letters are deposited in $j$ in the reverse of their order in $i$, since the first letter moved is the last letter of $i$.
**Moving the content in the correct order.** The operation "move the content of register $i$ into register $j$ in the correct order" uses a scratch register $k$: first move the content of $i$ into $k$ in reverse order, then move the content of $k$ into $j$ in reverse order. Two reversals produce the original order.
**Reversing a register.** The operation "reverse the content of register $i$" moves $i$ into a scratch register $j$ in reverse order, then moves $j$ back into $i$ in the correct order (using a second scratch register if needed). The net effect reverses the word in register $i$.
**Duplicating into two registers.** The operation "move the content of register $i$ into registers $j$ and $k$ simultaneously, in reverse order" is performed by iteratively copying the final letter of $i$ into both $j$ and $k$, then removing it from $i$, until $i$ is empty.
**Copying a full register.** The operation "copy the content of register $i$ into register $j$ in reverse order" is accomplished by moving the content of $i$ into both $j$ and a scratch register $k$ simultaneously (the duplication above), then moving the content of $k$ back into $i$ in reverse order to restore the original. Combining this with a reversal of $j$ gives the operation "copy the content of register $i$ into register $j$ in the correct order".
These building blocks are enough to implement arbitrary register-content manipulation. The catalogue is important not just for its own sake: in the next chapter, the course uses these operations to simulate more powerful machines and to encode programs as data, which leads to the undecidability results that are the ultimate goal of the course.
[example: Checking Whether a Register Holds a Specific Word]
Consider the question "is the content of register $i$ equal to the word $w = a_0 a_1 \cdots a_k$?". We design a subroutine $S_\ell$ that answers "is $a_\ell$ the final letter of register $i$?". If no, move to a rejection state $q_N$. If yes, move the final letter to a scratch register $r$ and run subroutine $S_{\ell-1}$; if $\ell = 0$ there are no more letters to check, so move to an acceptance state $q_Y$.
At state $q_N$: move the accumulated content of $r$ back into $i$ (to restore the register) and answer no. At state $q_Y$: move the content of $r$ back into $i$ and answer yes.
This subroutine proceeds from $S_k$ down to $S_0$, checking letters from right to left and using the scratch register to temporarily store matched letters. It uses $k+1$ question steps, one move-letter operation per step, and a restore operation at the end. The structural insight is that register machines can inspect the content of a register letter by letter without losing it, provided they use scratch space to buffer what they read and restore it afterwards.
[/example]
The systematic nature of the API — each operation defined in terms of simpler ones — demonstrates a key property of register machines: they are closed under composition in a very strong sense. Any finite procedure that manipulates words in a last-in first-out manner can be implemented by chaining together the primitives above. This places register machines firmly in the territory of universal computation, a claim the course will make precise when it shows that register machines compute exactly the class of partial recursive functions.
Register machines compute exactly the partial recursive functions, establishing that the model-theoretic notion of computability aligns with the algebraic machinery developed throughout the course. Computability theory takes this equivalence as its foundation, asking not what classes of strings can be generated, but what problems can be solved by mechanical process at all.
# 6. Computability Theory
The preceding chapters developed a hierarchy of languages — regular, context-free, context-sensitive, and type 0 — and showed that each class is strictly larger than the one before. This chapter asks the complementary question: what can be computed at all? We introduce the formal notion of a computable function and a computable set, relate computability to the Chomsky hierarchy, and then prove the central undecidability results that mark the limits of algorithmic reasoning. The main surprise is that the halting problem — asking whether a given program terminates on a given input — cannot be decided by any algorithm, and from this single obstruction a rich theory of unsolvability flows.
## Computable Functions and Sets
How should we define what it means to compute something? Register machines give a concrete operational answer: a function is computable if some register machine realises it. To handle scratch space cleanly, only the content of register 0 counts as output; all other registers are auxiliary.
[definition: Computable Function]
Let $M$ be a register machine and $k \in \mathbb{N}$. Define $f_{M,k} : \mathbb{W}^k \rightharpoonup \mathbb{W}$ by setting $f_{M,k}(w) \uparrow$ when $M$ does not halt on input $w$, and $f_{M,k}(w) = v_0$ when $M$ halts on input $w$ with register content $v$ (so $v_0$ is the content of register 0). A partial function $f : \mathbb{W}^k \rightharpoonup \mathbb{W}$ is **computable** if there exists a register machine $M$ such that $f = f_{M,k}$.
[/definition]
If $M$ and $M'$ are strongly equivalent, then $f_{M,k} = f_{M',k}$ for every $k$. The converse fails: two machines can compute the same function by entirely different paths. For the special case $k = 1$ we write $W_M = \operatorname{dom} f_{M,1}$, the domain of the one-input function computed by $M$.
Since there are only countably many register machines up to strong equivalence, there are only countably many computable functions. The power set of $\mathbb{W}$ is uncountable, so most functions $\mathbb{W} \to \mathbb{W}$ are not computable. This is not a deficiency of register machines specifically — it is an unavoidable consequence of any effective enumeration of programs.
[example: Basic Computable Functions]
The identity function on $\mathbb{W}$ is computable: the register machine that does nothing computes it. Every constant function $c : \mathbb{W}^k \to \mathbb{W}$ defined by $c(w) = v$ for a fixed $v \in \mathbb{W}$ is computable: empty register 0, then write $v$ to register 0, then halt. The projection $\pi_i : \mathbb{W}^k \to \mathbb{W}$ defined by $\pi_i(w) = w_i$ is computable: empty register 0, copy the content of register $i$ into register 0, then halt.
[/example]
To talk about computability of sets, we need a way to recognise membership. The characteristic function is the standard device.
[definition: Characteristic Function and Computable Set]
Let $X \subseteq \mathbb{W}^k$ and fix a letter $a \in \Sigma$. The **characteristic function** $\mathbb{1}_X : \mathbb{W}^k \to \mathbb{W}$ is defined by $\mathbb{1}_X(w) = a$ if $w \in X$ and $\mathbb{1}_X(w) = \varepsilon$ otherwise. A set $X \subseteq \mathbb{W}^k$ is **computable** if $\mathbb{1}_X$ is computable.
[/definition]
A language is a set of words, so we may now ask whether a language is computable: it is computable when a register machine can decide membership. There is a weaker notion in which the machine only needs to halt on members, with no requirement about non-members.
[definition: Pseudocharacteristic Function and Computably Enumerable Set]
Let $X \subseteq \mathbb{W}^k$. The **pseudocharacteristic function** $\psi_X : \mathbb{W}^k \rightharpoonup \mathbb{W}$ is the partial function with $\operatorname{dom} \psi_X = X$ defined by $\psi_X(w) = a$ if $w \in X$. A set $X \subseteq \mathbb{W}^k$ is **computably enumerable** if $\psi_X$ is computable.
[/definition]
The two notions are related but distinct. Every computable set is computably enumerable, but the converse fails: the halting set will later furnish the canonical counterexample. A computably enumerable set is one for which membership is eventually confirmed; non-membership may never be confirmed. These two classes correspond exactly to two levels of the Chomsky hierarchy, a connection made precise in Section 6.8.
## Computability of Languages
The relationship between computability and the Chomsky hierarchy begins with a simple observation about complements and domains, then connects back to automata.
[quotetheorem:1813]
[citeproof:1813]
Part (i) says that the computable sets are closed under complement: a machine can decide $X$ if and only if it can decide its complement. This may seem automatic, but it fails for computably enumerable sets — the complement of a computably enumerable set need not be computably enumerable, and indeed the halting set provides a counterexample.
[quotetheorem:1814]
[citeproof:1814]
This result confirms that the class of computable languages lies above the regular languages in expressiveness. The fact that computably enumerable sets correspond exactly to type 0 languages — and computable sets lie strictly between type 1 and type 0 — will be established in Section 6.8.
## The Shortlex Ordering
Register machines work over words $\mathbb{W} = \Sigma^\star$, while classical computability theory works over $\mathbb{N}$. To transfer results and intuitions between these settings, we need an order-isomorphism $(\mathbb{N}, <) \cong (\mathbb{W}, <)$.
[definition: Shortlex Ordering]
Fix a total order $<$ on $\Sigma$. The **shortlex ordering** on $\mathbb{W}$ is defined by $w < v$ when either (i) $|w| < |v|$, or (ii) $|w| = |v|$ and $w \neq v$ and the leftmost position where $w$ and $v$ differ has $w$'s letter smaller than $v$'s letter in the fixed ordering on $\Sigma$.
[/definition]
The shortlex ordering is a total order on $\mathbb{W}$: it is irreflexive, transitive, and trichotomous. The empty word $\varepsilon$ is the unique least element. Crucially, every proper initial segment $\{v \mid v < w\}$ is finite, which is exactly what is needed for an isomorphism with $(\mathbb{N}, <)$.
[example: Binary Shortlex Order]
Let $\Sigma = \{0, 1\}$ with $0 < 1$. The shortlex ordering begins:
\begin{align*}
\varepsilon,\ 0,\ 1,\ 00,\ 01,\ 10,\ 11,\ 000,\ 001,\ 010,\ 011,\ 100,\ 101,\ 110,\ 111,\ 0000, \ldots
\end{align*}
Each word is identified with its index in this sequence, counting from zero. There are $2^k$ words of length $k$, so the word $0^k$ has index $2^k - 1$. The shortlex indices allow us to define addition and multiplication on words by acting on indices: for instance, $10 + 01 = 010$ because the indices are $5$, $4$, and $9$ respectively. This gives $\mathbb{W}$ the structure of a commutative semiring.
[/example]
[quotetheorem:1815]
[citeproof:1815]
The isomorphism $(\mathbb{N}, <) \cong (\mathbb{W}, <)$ means we can freely translate between the natural-number and word-based formulations of computability. The computability of the shortlex ordering and successor function ensures that this translation is itself effective.
## Church's Recursive Functions
Register machines are operational: they describe how computation proceeds step by step. An alternative, purely definitional approach characterises the computable functions as the closure of a small set of basic functions under three inductive operations.
[definition: Basic Functions]
The **basic functions** are:
- the **projection functions** $\pi_{k,i} : \mathbb{W}^k \to \mathbb{W}$ defined by $\pi_{k,i}(w) = w_i$,
- the **constant functions** $c_{k,\varepsilon} : \mathbb{W}^k \to \mathbb{W}$ defined by $c_{k,\varepsilon}(w) = \varepsilon$,
- the **successor function** $s : \mathbb{W} \to \mathbb{W}$ with $\#(s(w)) = \#(w) + 1$.
[/definition]
From the basic functions, three closure operations build up the recursive functions.
**Composition.** Given $f : \mathbb{W}^m \rightharpoonup \mathbb{W}$ and $g_1, \ldots, g_m : \mathbb{W}^k \rightharpoonup \mathbb{W}$, their composition is $h(w) = f(g_1(w), \ldots, g_m(w))$.
**Recursion.** Given $f : \mathbb{W}^k \rightharpoonup \mathbb{W}$ and $g : \mathbb{W}^{k+2} \rightharpoonup \mathbb{W}$, the function $h : \mathbb{W}^{k+1} \rightharpoonup \mathbb{W}$ defined by $h(w, \varepsilon) = f(w)$ and $h(w, s(v)) = g(w, v, h(w, v))$ is defined by **recursion**.
**Minimisation.** Given $f : \mathbb{W}^{k+1} \rightharpoonup \mathbb{W}$, the function $h : \mathbb{W}^k \rightharpoonup \mathbb{W}$ defined by
\begin{align*}
h(w) = \begin{cases} v & \text{if for all } u \leq v,\ f(w, u) \downarrow\ \text{and } v \text{ is } <\text{-minimal with } f(w, v) = \varepsilon \\ \uparrow & \text{if no such } v \text{ exists} \end{cases}
\end{align*}
is defined by **minimisation**.
[definition: Recursive and Primitive Recursive Functions]
A partial function is **recursive** (equivalently, **partial recursive**) if it lies in the smallest class containing the basic functions and closed under composition, recursion, and minimisation. A partial function is **primitive recursive** if it lies in the smallest such class closed only under composition and recursion (with no minimisation).
[/definition]
If a class contains the basic functions and is closed under composition, it contains every constant function $c_{k,v}(w) = v$: since $v = s^n(\varepsilon)$ for $n = \#(v)$, we have $c_{k,v} = s^n \circ c_{k,\varepsilon}$.
[example: Addition and Multiplication are Primitive Recursive]
The identity $\pi_{1,0} : \mathbb{W}^1 \to \mathbb{W}$ is primitive recursive (it is a basic function). Addition is the function $h$ defined by $h(w, \varepsilon) = \pi_{1,0}(w) = w$ and $h(w, s(v)) = s(\pi_{3,2}(w, v, h(w, v))) = s(h(w,v))$, so $\#h(m, n) = \#m + \#n$ in index arithmetic. This is defined by recursion from $\pi_{1,0}$ and $s \circ \pi_{3,2}$, both primitive recursive. Multiplication and exponentiation are similarly defined by recursion and are primitive recursive.
[/example]
The recursive functions admit a tree-based encoding. A **recursion tree** $(T, \ell)$ is a finitely branching tree whose nodes are labelled by the operations and basic functions, with branching number and arity constraints matching the table:
| Label | Arity | Branching |
|---|---|---|
| $B^\pi_{k,i}$ | $k$ | 0 |
| $B^c_{k,\varepsilon}$ | $k$ | 0 |
| $B^s$ | 1 | 0 |
| $C_{n,k}$ (composition) | $k$ | $n+1$ |
| $R_k$ (recursion) | $k+1$ | 2 |
| $M_k$ (minimisation) | $k$ | 1 |
Arity and branching constraints: for a composition node $C_{n,k}$, the first child has arity $n$ and all others have arity $k$; for a recursion node $R_k$, the first child has arity $k$ and the second has arity $k+2$; for a minimisation node $M_k$, the single child has arity $k$.
[quotetheorem:1816]
[citeproof:1816]
[quotetheorem:1817]
[citeproof:1817]
The proof does more than prove the implication: it shows the computable functions are themselves closed under recursion and minimisation, so these operations may be used directly in register machine constructions.
## Merging and Splitting Words
Computability theory often requires pairing two inputs into one and splitting one output into two. The Cantor zigzag function provides an effective bijection $\mathbb{N} \times \mathbb{N} \to \mathbb{N}$, which we transfer to words via the shortlex isomorphism.
The **Cantor zigzag function** $z : \mathbb{N} \times \mathbb{N} \to \mathbb{N}$ is defined by
\begin{align*}
z(i, j) = \frac{(i+j)(i+j+1)}{2} + j.
\end{align*}
This is a bijection, and all arithmetic involved is computable by register machines.
[definition: Merge and Split]
For words $v, w \in \mathbb{W}$, define their **merge** $v * w$ as the unique word with $\#(v * w) = z(\#v, \#w)$. For a word $w \in \mathbb{W}$, define its **split** by $w^{(0)}$ and $w^{(1)}$, the unique words with $\#w = z(\#w^{(0)}, \#w^{(1)})$.
[/definition]
Splitting is technically not a computable function from $\mathbb{W}$ to $\mathbb{W}$ (since a computable function has codomain $\mathbb{W}$, not $\mathbb{W} \times \mathbb{W}$), but it is an operation that a register machine can perform: compute $z^{-1}(\#w)$ to obtain the pair $(\#u, \#v)$, then convert back to words. Merging and splitting allow us to encode pairs of words as single words, which is essential for the universality construction and the zigzag argument in subsequent sections.
## Universality
Can a single register machine simulate every other register machine? The answer is yes: a universal register machine exists. The key step is encoding machines, inputs, and configurations as words.
Let $\Sigma$ be the working alphabet. Extend it to $\Sigma'$ by adding symbols $\mathbf{0}, \mathbf{1}, +, -, ?, (, ), {,}, \mapsto, \square$. Encodings are defined as follows:
- Natural numbers are encoded in binary using $\mathbf{0}$ and $\mathbf{1}$: $\operatorname{code}(n)$ is the binary representation.
- States $q_k$ are encoded as $\operatorname{code}(q_k) = \operatorname{code}(k)$.
- Instructions in $\operatorname{Instr}(\Sigma, Q)$ are encoded using the punctuation symbols, e.g. $\operatorname{code}(+(k, a, \ell)) = +(\operatorname{code}(k), a, \operatorname{code}(\ell))$.
- Program lines $q \mapsto I$ are encoded as $\operatorname{code}(q) \mapsto \operatorname{code}(I)$.
- A register machine with program $P$ and states $q_0, \ldots, q_n$ is encoded as the concatenation $\operatorname{code}(q_0 \mapsto P(q_0)), \ldots, \operatorname{code}(q_n \mapsto P(q_n))$.
- A sequence of words $w = (w_0, \ldots, w_k)$ is encoded as $\square w_0 \square \cdots \square w_k \square$.
- A configuration $(q, w)$ is encoded as $\operatorname{code}(q)\, \operatorname{code}(w)$.
[quotetheorem:1818]
[citeproof:1818]
[quotetheorem:1819]
[citeproof:1819]
[quotetheorem:1820]
[citeproof:1820]
A register machine $U$ that computes $g$ is called a **universal register machine**. Despite having only finitely many states and registers, $U$ can simulate any register machine with any number of states and registers, by working with encoded descriptions.
The software principle streamlines notation: for $v \in \mathbb{W}$, write $f_{v,k}(w) = f_{U,2}(v, \operatorname{code}(w)) = f_{M,k}(w)$ when $\operatorname{code}(M) = v$. Similarly, write $W_v = \operatorname{dom} f_{v,1}$, so that $\{W_v \mid v \in \mathbb{W}\}$ is exactly the collection of all computably enumerable sets.
[quotetheorem:1821]
[citeproof:1821]
The $s$-$m$-$n$ theorem formalises the idea of **currying**: a function of $k+1$ inputs, once its last argument is fixed to $v$, becomes a function of $k$ inputs whose program code is a computable function of $v$. This is the key tool for constructing reduction functions in the sections that follow.
## The Halting Problem
Does every program eventually stop? For a specific program given a specific input, this question might seem answerable by running the program and waiting. The difficulty is that if the program never stops, we wait forever without ever knowing the answer.
Define the halting sets:
\begin{align*}
\mathbb{K}_0 &= \{(w, v) \mid f_{w,1}(v) \downarrow\}, & \mathbb{K} &= \{w \mid f_{w,1}(w) \downarrow\}.
\end{align*}
Here $\mathbb{K}_0$ is the **general halting set** (machine $w$ halts on input $v$) and $\mathbb{K}$ is the **diagonal halting set** (machine $w$ halts when given its own code as input).
[quotetheorem:1822]
[citeproof:1822]
[quotetheorem:1823]
[citeproof:1823]
The undecidability of the halting problem is one of the most important theorems in mathematics. It demonstrates that there are natural, simply-stated questions about programs to which no algorithm can give a correct yes/no answer. The diagonalisation argument is the same in spirit as Cantor's proof that the real numbers are uncountable, and as Gödel's incompleteness argument: in each case, a self-referential construction produces a contradiction.
## Sets with Quantifiers
The computably enumerable sets are characterised by an existential condition: $w \in X$ if and only if there exists a witness $y$ such that a computable test involving $w$ and $y$ is satisfied. This leads to a stratification of sets by the logical complexity of their membership condition.
[definition: The Arithmetical Hierarchy at Level 1]
A set $X \subseteq \mathbb{W}^k$ is called **$\Sigma_1$** if there exists a computable set $Y \subseteq \mathbb{W}^{k+1}$ such that $w \in X \iff \exists y,\, (w, y) \in Y$. We call $X = p(Y) = \{w \mid \exists y,\, (w, y) \in Y\}$ the **projection** of $Y$.
A set $X$ is called **$\Pi_1$** if $X^c$ is $\Sigma_1$ (equivalently, $w \in X \iff \forall y,\, (w, y) \in Z$ for some computable $Z$). A set is **$\Delta_1$** if it is both $\Sigma_1$ and $\Pi_1$.
[/definition]
The notation is suggestive: $\Sigma$ evokes the existential quantifier (as sums range over indices, existentials range over witnesses), $\Pi$ evokes the universal quantifier, and $\Delta$ (from the German *Durchschnitt*, intersection) marks the class that is the intersection of $\Sigma_1$ and $\Pi_1$.
[quotetheorem:1824]
[citeproof:1824]
This equivalence is fundamental: the computably enumerable sets are exactly those definable by a computable predicate with a single existential quantifier. The following example shows how to handle multiple existential quantifiers using the merge operation.
[example: Merging Existential Quantifiers]
Let $f : \mathbb{W}^2 \rightharpoonup \mathbb{W}$ be partial computable and let $X = \{w \mid \exists v,\, f(w, v) \downarrow\}$. Let $M$ be such that $f = f_{M,2}$. Define $Z = \{(w, v_0, v_1) \mid t_{M,2}(w, v_0, v_1) = a\}$, which is computable. Define $Y = \{(w, u) \mid (w, u^{(0)}, u^{(1)}) \in Z\}$, also computable by the computability of the split operation. Then $\exists v,\, f(w, v) \downarrow \iff \exists v_0, \exists v_1,\, (w, v_0, v_1) \in Z \iff \exists u,\, (w, u) \in Y$, so $X = p(Y)$ is $\Sigma_1$. The two existential quantifiers over $v_0$ and $v_1$ are merged into a single existential over $u = v_0 * v_1$ using the Cantor pairing.
[/example]
This technique is called the **zigzag argument**: by merging words, finitely many existential quantifiers collapse into one, and a sequence of parallel computations can be interleaved into a single computation. The argument extends to show that $\Sigma_1$ is closed under existential quantification over computable predicates.
[quotetheorem:1825]
[citeproof:1825]
[quotetheorem:1826]
[citeproof:1826]
[quotetheorem:1827]
[citeproof:1827]
The converse — every computably enumerable set is a type 0 language — also holds. A proof sketch appears in Section 6.10 using the Church–Turing thesis and Turing machines.
## Closure Properties
The computably enumerable sets and the computable sets each form a class with its own closure properties, which determine what set-theoretic operations preserve decidability or semi-decidability.
[quotetheorem:1828]
[citeproof:1828]
The closure under intersection mirrors the product construction for DFAs: two computable functions can be evaluated in parallel since they always terminate, and their outputs combined.
[quotetheorem:1829]
[citeproof:1829]
[quotetheorem:1830]
[citeproof:1830]
A stronger version holds: $X$ is computably enumerable if and only if $X$ is the image of a **total** computable function. The total version justifies the name "computably enumerable": there is a total program that, when run indefinitely, enumerates all elements of $X$ (and nothing else).
## The Church–Turing Thesis
Register machines, recursive functions, Turing machines, and while programs all give rise to notions of computability. Remarkably, these notions coincide.
[quotetheorem:1831]
The proof that (i) $\iff$ (ii) has already been carried out. The equivalences with Turing machines and while programs follow from similar mutual simulation arguments: each model can simulate the others by encoding states and transitions appropriately.
The **Church–Turing thesis** asserts that these four classes capture the intuitive notion of "computability by any effective procedure." Every formalism for computation that has ever been proposed and studied — lambda calculus, random access machines, cellular automata — computes exactly the same class of functions. The thesis is not a mathematical theorem (it cannot be, since "effective procedure" is not formally defined) but an empirically robust statement about the nature of computation. It licenses us to use whichever model of computation is most convenient for a given argument.
### Computably Enumerable Sets Are Type 0 Languages
Using the Church–Turing thesis, we sketch why every computably enumerable set is a type 0 language.
**Proof sketch.** Let $M$ be a Turing machine computing $\psi_X$. Without loss of generality, assume that upon halting the read-write head is moved to the front of the tape, so that $q_S \square w \square \xrightarrow{M} q_H \square a \square$ for each $w \in X$. The transitions of $M$ define a rewrite system on configurations. Define a type 0 grammar that starts with the production $S \to q_H \square a \square$ and then applies all Turing machine transitions **backwards**: each step of the Turing machine computation corresponds to a production rule applied in reverse. When the starting state $q_S$ is reached, additional productions delete everything except the input word $w$.
This grammar derives exactly those words $w$ for which the Turing machine halts on $w$ with output $a$, namely the elements of $X$. A detailed verification that all productions can be encoded as context-sensitive and type 0 rules is carried out in Salomaa, *Formal Languages* (1973).
## Solvability of Decision Problems
We can now give mathematically precise formulations of classical decision problems. First, encode every grammar $G$ as a word $\operatorname{code}(G) \in \mathbb{W}$, with the convention that every word $w$ is the code of some grammar $G_w$, so the map $w \mapsto G_w$ is a surjection from $\mathbb{W}$ to the class of grammars.
[definition: Decision Problems for Grammars]
The three fundamental decision problems are the sets:
- **Word problem**: $\{(w, v) \mid w \in \mathcal{L}(G_v)\}$.
- **Emptiness problem**: $\{w \mid \mathcal{L}(G_w) = \varnothing\}$.
- **Equivalence problem**: $\{(w, v) \mid \mathcal{L}(G_w) = \mathcal{L}(G_v)\}$.
A decision problem is **solvable** if the corresponding set is computable.
[/definition]
[quotetheorem:1832]
[citeproof:1832]
This is another diagonalisation argument, mirroring the halting problem proof. The connection is more than superficial: the word problem for type 0 grammars reduces to the halting problem (both are $\Sigma_1$-complete), and they can be reduced to each other.
## Reduction Functions
Reduction functions provide the fundamental tool for comparing the complexity of decision problems. The intuition is that if problem $A$ is reducible to problem $B$, then any algorithm for $B$ can be converted into an algorithm for $A$.
[definition: Many-One Reduction]
Let $A, B \subseteq \mathbb{W}$. A total computable function $f : \mathbb{W} \to \mathbb{W}$ is a **many-one reduction** from $A$ to $B$ if $w \in A \iff f(w) \in B$ for all $w \in \mathbb{W}$. We write $A \leq_m B$ if such a reduction exists.
[/definition]
The subscript $m$ stands for "many-one": $f$ need not be injective, so multiple elements of $A$ may map to the same element of $B$. The name reflects that the reduction is a function from all of $\mathbb{W}$, potentially many-to-one.
The relation $\leq_m$ satisfies:
- **Reflexivity**: $A \leq_m A$ (use the identity function).
- **Transitivity**: if $A \leq_m B$ and $B \leq_m C$ then $A \leq_m C$ (compose the reduction functions).
- **Respects complements**: $A \leq_m B$ implies $\mathbb{W} \setminus A \leq_m \mathbb{W} \setminus B$ (use the same reduction function).
- The relation is not antisymmetric in general, so $\leq_m$ is a preorder, not a partial order. However, the quotient by the equivalence $A \sim B \iff A \leq_m B \text{ and } B \leq_m A$ is a partial order on equivalence classes.
The key monotonicity properties are: if $A \leq_m B$ and $B$ is computable, then $A$ is computable ($\mathbb{1}_A = \mathbb{1}_B \circ f$); and if $A \leq_m B$ and $B$ is computably enumerable, then $A$ is computably enumerable ($\psi_A = \psi_B \circ f$). Contrapositively, if $A \leq_m B$ and $A$ is not computable, then $B$ is not computable.
[quotetheorem:1833]
[citeproof:1833]
This means that every non-trivial computable set sits at the bottom of the $\leq_m$ preorder within its own complexity class. The non-reducibilities $\mathbb{W} \setminus \mathbb{K} \not\leq_m \mathbb{K}$ and $\mathbb{K} \not\leq_m \mathbb{W} \setminus \mathbb{K}$ show that $\mathbb{K}$ and $\mathbb{W} \setminus \mathbb{K}$ lie in strictly different equivalence classes, establishing that there are multiple distinct degrees of unsolvability above the computable sets.
The **Turing join** provides further degrees: for $\{0, 1\} \subseteq \Sigma$ and sets $A, B$, define $A \oplus B = 0A \cup 1B$. Then $A \leq_m A \oplus B$ and $B \leq_m A \oplus B$, and $A \oplus B$ is the least upper bound of $A$ and $B$ in the preorder. The set $\mathbb{K} \oplus (\mathbb{W} \setminus \mathbb{K})$ is a new degree that is neither $\Sigma_1$ nor $\Pi_1$.
[definition: Hardness and Completeness]
Let $\mathcal{C}$ be a class of sets. A set $A$ is **$\mathcal{C}$-hard** if $B \leq_m A$ for every $B \in \mathcal{C}$. A set $A$ is **$\mathcal{C}$-complete** if it is $\mathcal{C}$-hard and $A \in \mathcal{C}$.
[/definition]
A $\mathcal{C}$-complete set is in a precise sense the most complicated set in $\mathcal{C}$: every other set in $\mathcal{C}$ reduces to it. The following results identify the complete sets at the first level of the hierarchy.
[quotetheorem:1834]
[citeproof:1834]
Similarly, every non-trivial $\Delta_1$ set is $\Delta_1$-complete, as every computable set reduces to it by the proposition above.
## Rice's Theorem
The most sweeping undecidability result in computability theory says that no non-trivial property of the computably enumerable sets is decidable. To state this precisely, we work with index sets: sets of program codes whose membership depends only on what the program computes, not on how it does so.
Two register machines $M$ and $M'$ are **weakly equivalent** if $W_M = W_{M'} = \operatorname{dom} f_{M,1}$. Two words $v$ and $u$ are weakly equivalent, written $v \sim u$, if $W_v = W_u$.
[definition: Index Set]
A set $I \subseteq \mathbb{W}$ is an **index set** if it is closed under weak equivalence: $v \in I$ and $v \sim u$ implies $u \in I$.
[/definition]
Index sets are precisely the unions of $\sim$-equivalence classes. The trivial index sets are $\varnothing$ and $\mathbb{W}$. Every non-trivial property of the domain of the function computed by a program gives rise to a non-trivial index set.
[example: Standard Index Sets]
- $\operatorname{Emp} = \{v \mid W_v = \varnothing\}$: the set of codes of programs that never halt on any input.
- $\operatorname{Fin} = \{v \mid W_v \text{ is finite}\}$: codes of programs that halt on only finitely many inputs.
- $\operatorname{Inf} = \{v \mid W_v \text{ is infinite}\}$: codes of programs that halt on infinitely many inputs.
- $\operatorname{Tot} = \{v \mid W_v = \mathbb{W}\}$: codes of programs that halt on every input.
All four are index sets. The emptiness problem for type 0 grammars (in the coded formulation) is precisely $\operatorname{Emp}$.
[/example]
Before the proof, we set up the key reduction. Fix $w \in \mathbb{W}$ and define
\begin{align*}
g(u, v) = \begin{cases} f_{w,1}(v) & \text{if } u \in \mathbb{K} \\ \uparrow & \text{if } u \notin \mathbb{K}. \end{cases}
\end{align*}
This function is computable even though the case distinction $u \in \mathbb{K}$ is not computable: when $u \in \mathbb{K}$ we compute $f_{w,1}(v)$, and when $u \notin \mathbb{K}$ we diverge — a machine that first checks whether $u \in \mathbb{K}$ by running $f_{u,1}(u)$, and only if it halts proceeds to compute $f_{w,1}(v)$, realises this. By the $s$-$m$-$n$ theorem, there is a total computable $h : \mathbb{W} \to \mathbb{W}$ such that $f_{h(u),1}(v) = g(u, v)$. The domain satisfies: $W_{h(u)} = W_w$ if $u \in \mathbb{K}$, and $W_{h(u)} = \varnothing$ if $u \notin \mathbb{K}$.
[quotetheorem:1835]
[citeproof:1835]
Rice's theorem deserves careful reflection. It does not say that no property of programs is decidable — properties like "does this program have exactly 5 states?" are perfectly decidable, but they concern the syntactic description of the program, not the function it computes. The theorem says that no non-trivial **semantic** property — a property that depends only on the input-output behaviour — is decidable.
The proof actually gives stronger information about the complexity of the index sets involved.
[remark: Finer Complexity of Index Sets]
Rice's theorem's proof actually shows more than non-computability: depending on whether the witness $e$ lies in $I$ or not, a specific reduction is obtained. If $e \in I$ then the reduction produces $\mathbb{W} \setminus \mathbb{K} \leq_m I$, which shows that $I$ is at least as hard as the complement of the halting set and therefore is not $\Sigma_1$. If $e \notin I$ then the reduction produces $\mathbb{K} \leq_m I$, which shows that $I$ is at least as hard as the halting set and therefore is not $\Pi_1$. For the specific cases: $e \in \operatorname{Emp}$ and $e \in \operatorname{Fin}$, so $\mathbb{W} \setminus \mathbb{K} \leq_m \operatorname{Emp}$ and $\mathbb{W} \setminus \mathbb{K} \leq_m \operatorname{Fin}$. Since $e \notin \operatorname{Inf}$ and $e \notin \operatorname{Tot}$, we get $\mathbb{K} \leq_m \operatorname{Inf}$ and $\mathbb{K} \leq_m \operatorname{Tot}$. The set $\operatorname{Emp}$ is many-one equivalent to $\mathbb{W} \setminus \mathbb{K}$, hence $\Pi_1$-complete.
[/remark]
[quotetheorem:1836]
[citeproof:1836]
The set $\operatorname{Fin}$ provides a particularly illuminating example of a set that lies outside both $\Sigma_1$ and $\Pi_1$ — it sits at a strictly higher level of the arithmetical hierarchy.
[quotetheorem:1837]
[citeproof:1837]
The set $\operatorname{Fin}$, along with $\operatorname{Inf}$, $\operatorname{Tot}$, sits at the level $\Sigma_2 \setminus (\Sigma_1 \cup \Pi_1)$ or $\Pi_2 \setminus (\Sigma_1 \cup \Pi_1)$ of the arithmetical hierarchy. This demonstrates that the hierarchy is strict: there are ever more complex decision problems that require more and more alternating quantifiers to express their membership conditions.