Regular Expressions Produce Essentially Regular Languages

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The set of regular expressions is defined inductively by seven formation rules, so we proceed by structural induction. The base cases ($\varnothing$, $\varepsilon$, $a \in \Sigma$) each denote a singleton-or-empty language for which we exhibit a regular grammar directly. For the inductive cases — union, concatenation, Kleene star, Kleene plus — we assume the operand languages are essentially regular and show the combined language is too. Union and concatenation use the standard grammar-combining constructions. Kleene star reduces to Kleene plus via $\mathcal{L}(E^\star) = \mathcal{L}(E^+) \cup \{\varepsilon\}$, so the substantive work is closure under Kleene plus. Given an $\varepsilon$-adequate regular grammar $G = (\Sigma, V, P, S)$ for $\mathcal{L}(G)$, we adjoin to $P$ the rules $\{A \to aS \mid A \to a \in P\}$ to form $G^+$; intuitively, every rule that terminates a derivation gains a twin that loops back to the start symbol. A two-sided induction — on block count for $\supseteq$ and on occurrences of $S$ for $\subseteq$ — establishes $\mathcal{L}(G^+) = \mathcal{L}(G)^+$. [/proofplan] [step:Structure the argument by induction on the formation of the regular expression] Regular expressions over $\Sigma$ are defined by the inductive formation rules [Regular Expression](/pages/Regular%20Expression): \begin{align*} E &::= \varnothing \mid \varepsilon \mid a \;(a \in \Sigma) \mid (E_1 + E_2) \mid (E_1 E_2) \mid E^\star \mid E^+. \end{align*} We prove: for every regular expression $E$, the language $\mathcal{L}(E)$ is essentially regular (where *essentially regular* means: either regular, or the union of a regular language with $\{\varepsilon\}$; equivalently, $L$ is essentially regular iff $L \setminus \{\varepsilon\}$ is regular). By structural induction on $E$, it suffices to verify seven clauses — three base cases and four inductive cases corresponding to the operations $+$, concatenation, $\star$, $+$. [/step] [step:Handle the base cases $\varnothing$, $\varepsilon$, and $a \in \Sigma$] *Case $E = \varnothing$.* Then $\mathcal{L}(\varnothing) = \varnothing$. The grammar $G_\varnothing = (\Sigma, \{S\}, \varnothing, S)$ with empty production set generates no word (no rule is applicable to $S$). Hence $\mathcal{L}(G_\varnothing) = \varnothing$, and $\varnothing$ is regular (hence essentially regular). *Case $E = \varepsilon$.* Then $\mathcal{L}(\varepsilon) = \{\varepsilon\}$. By the definition of essentially regular, $\{\varepsilon\} = \varnothing \cup \{\varepsilon\}$ is the union of the regular language $\varnothing$ with $\{\varepsilon\}$, hence essentially regular. *Case $E = a$ for $a \in \Sigma$.* Then $\mathcal{L}(a) = \{a\}$. The grammar $G_a = (\Sigma, \{S\}, \{S \to a\}, S)$ generates exactly the single word $a$: the only derivation is $S \Rightarrow a$. Hence $\mathcal{L}(G_a) = \{a\}$, and $\{a\}$ is regular. [/step] [step:Handle the inductive cases for union and concatenation via grammar combination] Assume, as structural inductive hypotheses, that $\mathcal{L}(E_1)$ and $\mathcal{L}(E_2)$ are essentially regular. *Case $E = E_1 + E_2$.* By definition $\mathcal{L}(E_1 + E_2) = \mathcal{L}(E_1) \cup \mathcal{L}(E_2)$. By [Essentially Regular Languages Are Closed Under Union](/theorems/1786), the union of two essentially regular languages is essentially regular. Both inductive hypotheses supply essentially regular languages, so the hypotheses of that theorem are met. *Case $E = E_1 E_2$.* By definition $\mathcal{L}(E_1 E_2) = \mathcal{L}(E_1) \cdot \mathcal{L}(E_2) := \{u v : u \in \mathcal{L}(E_1), v \in \mathcal{L}(E_2)\}$. By [Essentially Regular Languages Are Closed Under Concatenation](/theorems/1787), the concatenation of two essentially regular languages is essentially regular. Both inductive hypotheses supply essentially regular languages, so the hypotheses of that theorem are met. [/step] [step:Reduce Kleene star to Kleene plus] Assume as structural inductive hypothesis that $\mathcal{L}(E)$ is essentially regular. The Kleene star of a language $L$ is $L^\star := \bigcup_{k \geq 0} L^k$ where $L^0 := \{\varepsilon\}$ and $L^{k+1} := L \cdot L^k$; the Kleene plus is $L^+ := \bigcup_{k \geq 1} L^k$. By separating off the $k = 0$ term, \begin{align*} L^\star &= \{\varepsilon\} \cup L^+. \end{align*} Applying this with $L := \mathcal{L}(E)$, \begin{align*} \mathcal{L}(E^\star) &= \mathcal{L}(E)^\star = \{\varepsilon\} \cup \mathcal{L}(E)^+ = \{\varepsilon\} \cup \mathcal{L}(E^+). \end{align*} Hence if $\mathcal{L}(E^+)$ is essentially regular, then so is $\mathcal{L}(E^\star)$: essentially regular languages are by definition closed under adjoining $\{\varepsilon\}$, since $\{\varepsilon\} \cup (L' \cup \{\varepsilon\}) = L' \cup \{\varepsilon\}$ and $\{\varepsilon\} \cup L'$ is essentially regular whenever $L'$ is regular or essentially regular. It therefore suffices to prove closure under Kleene plus; this is the next step. [/step] [step:Construct the grammar $G^+$ and reduce Kleene-plus closure to verifying $\mathcal{L}(G^+) = \mathcal{L}(G)^+$] We prove: for every essentially regular language $L$, the language $L^+$ is essentially regular. By definition, $L$ essentially regular means $L \setminus \{\varepsilon\}$ is regular, say $L \setminus \{\varepsilon\} = \mathcal{L}(G)$ for some regular grammar $G = (\Sigma, V, P, S)$. Note \begin{align*} L^+ &= (L \setminus \{\varepsilon\})^+ \cup ((L \cap \{\varepsilon\}) \text{ adjustment}) \end{align*} and more cleanly: $w \in L^+$ iff $w = w_1 \cdots w_k$ with $k \geq 1$ and each $w_i \in L$; deleting any $w_i$ equal to $\varepsilon$ from the concatenation leaves the same $w$, so \begin{align*} L^+ &= (L \setminus \{\varepsilon\})^+ \cup (\{\varepsilon\} \cap L). \end{align*} The second summand is $\{\varepsilon\}$ or $\varnothing$ depending on whether $\varepsilon \in L$, which costs at most one $\{\varepsilon\}$ — a legal adjustment within essentially regular. Hence closure under Kleene plus reduces to showing $(L \setminus \{\varepsilon\})^+ = \mathcal{L}(G)^+$ is essentially regular (in fact regular). *Epsilon-adequacy.* By the [$\varepsilon$-adequacy normal form for regular grammars](/theorems/1788), we may assume without loss of generality that $G$ is $\varepsilon$-adequate: the start symbol $S$ does not appear on the right-hand side of any rule of $P$. (If the original grammar is not $\varepsilon$-adequate, we first convert it by introducing a fresh start symbol $S'$ with rule $S' \to \alpha$ for each $S \to \alpha$ in $P$; the converted grammar generates the same language and satisfies the adequacy condition.) *The construction.* Define \begin{align*} P^+ &:= P \cup \{A \to aS \mid A \to a \in P\}, & G^+ &:= (\Sigma, V, P^+, S). \end{align*} Here the added rules take each *terminal* rule $A \to a$ (producing a single letter and no nonterminal) and add a twin $A \to aS$ (producing the same letter followed by the start symbol). Both $G^+$ and $G$ use the same nonterminal set, alphabet, and start symbol; only the production set differs. Since $P \subseteq P^+$, and each rule of $P^+ \setminus P$ has the form $A \to aS$ with $A \in V$, $a \in \Sigma$, $S \in V$ — matching the allowed regular-grammar form $A \to aB$ — the grammar $G^+$ is also a regular grammar. To finish the Kleene-plus case, it suffices to prove \begin{align*} \mathcal{L}(G^+) &= \mathcal{L}(G)^+. \end{align*} We establish both inclusions in the next two steps; $\mathcal{L}(G)^+$ will then be regular (generated by $G^+$), so $(L \setminus \{\varepsilon\})^+ = \mathcal{L}(G)^+$ is regular, whence $L^+$ is essentially regular by the $\{\varepsilon\}$ adjustment above. [/step] [step:Prove $\mathcal{L}(G)^+ \subseteq \mathcal{L}(G^+)$ by induction on the number of concatenated blocks] [claim:For every $m \geq 1$ and every $w_0, w_1, \dots, w_m \in \mathcal{L}(G)$, the word $w_0 w_1 \cdots w_m$ lies in $\mathcal{L}(G^+)$] We induct on $m$. *Base*, $m = 0$: we must show $w_0 \in \mathcal{L}(G^+)$. Since $w_0 \in \mathcal{L}(G)$, there is a $G$-derivation $S \Rightarrow_G^\star w_0$. Every rule used in this derivation belongs to $P \subseteq P^+$, so the same derivation is a $G^+$-derivation: $S \Rightarrow_{G^+}^\star w_0$. Hence $w_0 \in \mathcal{L}(G^+)$. *Inductive step*: assume $w_0 w_1 \cdots w_{m-1} \in \mathcal{L}(G^+)$ for every choice of $w_0, \dots, w_{m-1} \in \mathcal{L}(G)$, and consider $w_0 w_1 \cdots w_m$ with each $w_i \in \mathcal{L}(G)$. By the inductive hypothesis applied to $w_0, \dots, w_{m-1}$, there is a $G^+$-derivation \begin{align*} S \Rightarrow_{G^+}^\star w_0 w_1 \cdots w_{m-1}. \end{align*} This derivation is a sequence of sentential forms, each either a terminal string or a string of the form $u A$ with $u \in \Sigma^\star$ and $A \in V$ (the shape is preserved by every regular rule). At the final step, the sentential form must be $u_{m-1} A$ for some $A \in V$ and must be rewritten by a *terminal* rule $A \to a$ (since the result $w_0 \cdots w_{m-1}$ is purely terminal). Thus $w_0 \cdots w_{m-1} = u_{m-1} a$ with the final step being $u_{m-1} A \Rightarrow_{G^+} u_{m-1} a$ via $A \to a \in P^+$. Since $A \to a \in P \subseteq P^+$, the twin rule $A \to aS \in P^+ \setminus P$ is also available in $G^+$. Replace the final step with the twin: $u_{m-1} A \Rightarrow_{G^+} u_{m-1} a S$, giving a $G^+$-derivation \begin{align*} S \Rightarrow_{G^+}^\star u_{m-1} a S = w_0 w_1 \cdots w_{m-1} S. \end{align*} Now, $w_m \in \mathcal{L}(G)$, so there is a $G$-derivation $S \Rightarrow_G^\star w_m$. Since $P \subseteq P^+$, this is also a $G^+$-derivation. Applying it to the nonterminal $S$ at the right end of $w_0 \cdots w_{m-1} S$ (note: $S$ appears in $w_0 \cdots w_{m-1} S$ exactly once by construction), \begin{align*} w_0 w_1 \cdots w_{m-1} S \Rightarrow_{G^+}^\star w_0 w_1 \cdots w_{m-1} w_m. \end{align*} Concatenating the two derivations yields $S \Rightarrow_{G^+}^\star w_0 w_1 \cdots w_m$. Hence $w_0 \cdots w_m \in \mathcal{L}(G^+)$, closing the induction. [/claim] [proof] See the induction stated in the claim. [/proof] Every element of $\mathcal{L}(G)^+$ has the form $w_0 w_1 \cdots w_m$ with $m \geq 0$ and each $w_i \in \mathcal{L}(G)$, so the claim gives $\mathcal{L}(G)^+ \subseteq \mathcal{L}(G^+)$. [/step] [step:Prove $\mathcal{L}(G^+) \subseteq \mathcal{L}(G)^+$ by induction on the number of occurrences of $S$ in the derivation] Let $w \in \mathcal{L}(G^+)$; fix a $G^+$-derivation \begin{align*} S &= \alpha_0 \Rightarrow_{G^+} \alpha_1 \Rightarrow_{G^+} \cdots \Rightarrow_{G^+} \alpha_N = w, \end{align*} where each $\alpha_i$ is a sentential form. Let $k$ denote the number of times a rule from $P^+ \setminus P = \{A \to aS \mid A \to a \in P\}$ is applied in this derivation — equivalently, the number of times $S$ is re-introduced on the right-hand side (the initial $S = \alpha_0$ is counted as the *starting* $S$, not a re-introduced one). [claim:For every $k \geq 0$, if there is a $G^+$-derivation of $w$ using exactly $k$ applications of rules from $P^+ \setminus P$, then $w \in \mathcal{L}(G)^{k+1}$] We induct on $k$. *Base*, $k = 0$: the derivation uses only rules in $P$. Then it is a valid $G$-derivation of $w$, so $w \in \mathcal{L}(G) = \mathcal{L}(G)^1$. *Inductive step*: assume the claim holds for derivations with fewer than $k$ applications of rules from $P^+ \setminus P$, and consider a $G^+$-derivation of $w$ with exactly $k \geq 1$ such applications. Locate the *last* application of a rule from $P^+ \setminus P$ in the derivation. Say this is step $j$, which rewrites a sentential form $\alpha_{j-1} = u B \Rightarrow_{G^+} u a S = \alpha_j$ using the rule $B \to aS \in P^+ \setminus P$. (By $\varepsilon$-adequacy of $G$, $S$ never appears on the right-hand side of any rule in $P$, so after step $j$ the only rules applied are from $P$; hence the $S$ introduced at step $j$ persists as the *unique* $S$ in all subsequent sentential forms until it is itself rewritten. Furthermore no new $S$ can be introduced after step $j$, since that would be another application of a rule from $P^+ \setminus P$ beyond the "last" one.) After step $j$, the derivation has sentential form $\alpha_j = uaS$, and the remaining steps use only rules in $P$. These remaining steps act on the single $S$ in $\alpha_j$ and eventually rewrite it to a terminal string $w'$, leaving the prefix $ua$ untouched (regular-grammar rules only rewrite the unique nonterminal at the right end of the sentential form). Hence $w = u a w'$, and the subderivation from $S$ to $w'$ uses only rules in $P$, so it is a $G$-derivation: $S \Rightarrow_G^\star w'$, giving $w' \in \mathcal{L}(G)$. Now consider the prefix derivation $S \Rightarrow_{G^+}^\star \alpha_{j-1} = uB$, followed by the single $G^+$-step $uB \Rightarrow_{G^+} u a S$. Replace that last $G^+$-step with the *twin* $G$-step $uB \Rightarrow_G ua$ using the rule $B \to a \in P$ (which exists in $P$ precisely because its twin $B \to aS$ was in $P^+ \setminus P$). The result is a $G^+$-derivation \begin{align*} S \Rightarrow_{G^+}^\star uB \Rightarrow_G ua \end{align*} of the terminal word $ua$, and this derivation uses $k - 1$ applications of rules from $P^+ \setminus P$ (one fewer than the original, since we replaced the $k$-th such step with a $P$-step and the prefix contained only $k - 1$). By the inductive hypothesis applied to the word $ua$, $ua \in \mathcal{L}(G)^k$. Combining with $w' \in \mathcal{L}(G) = \mathcal{L}(G)^1$ and $w = (ua) \cdot w'$, \begin{align*} w &= (ua) w' \in \mathcal{L}(G)^k \cdot \mathcal{L}(G)^1 = \mathcal{L}(G)^{k+1}, \end{align*} closing the induction. [/claim] [proof] See the induction stated in the claim. [/proof] Since every $w \in \mathcal{L}(G^+)$ has *some* derivation with *some* count $k \geq 0$ of $P^+ \setminus P$-applications, the claim gives $w \in \mathcal{L}(G)^{k+1} \subseteq \mathcal{L}(G)^+$. Hence $\mathcal{L}(G^+) \subseteq \mathcal{L}(G)^+$. [guided] The grammar construction for Kleene plus is the cleanest of the regular-expression closure arguments, and it is worth unpacking why it succeeds. *The design decision.* A regular grammar derives a word left-to-right, rewriting a single right-most nonterminal at each step. Every derivation ends by applying a *terminal* rule $A \to a$ to kill off the last nonterminal and produce a pure-terminal sentential form. Thus terminal rules are exactly the "finish" rules; everything else keeps the derivation going. To express $L^+$ — meaning "one or more concatenated copies of words in $L$" — we need a grammar that can *optionally* restart the derivation after finishing a word. The twin rule $A \to aS$ realises this: it acts like $A \to a$ (producing the same letter and closing off the current nonterminal's work) but also reintroduces the start symbol $S$, forcing the derivation to continue generating another word of $L$. The grammar chooses nondeterministically at each terminal step whether to finish ($A \to a$) or to restart ($A \to aS$). *Why the $\varepsilon$-adequacy normal form?* The argument for $\mathcal{L}(G^+) \subseteq \mathcal{L}(G)^+$ relies on a clean factoring: "the last occurrence of $S$ in the derivation was introduced by a twin rule, and everything after it is a $G$-subderivation". For this to work, $S$ must not be re-introduced accidentally by a $P$-rule. $\varepsilon$-adequacy guarantees this: if $S$ does not appear on the right-hand side of any $P$-rule, then the only way $S$ can appear in a sentential form is via the initial $\alpha_0 = S$ or via a twin rule. Without $\varepsilon$-adequacy, a $P$-rule like $A \to aS$ (already in $P$, not added) would confuse the counting; $S$ might re-appear spontaneously, breaking the match between "occurrences of $S$" and "blocks of $L$". *The two inductions carry the two directions.* For $\supseteq$: we induct on the number of blocks $m$, gluing one extra word $w_m$ onto an already-handled concatenation $w_0 \cdots w_{m-1}$. The gluing trick is to "restart" the derivation at the last step of the first $m$ blocks — take the terminal rule that finishes $w_{m-1}$ and replace it with its twin to reintroduce $S$, then append a fresh $G$-derivation of $w_m$. The proof exhibits an explicit $G^+$-derivation of $w_0 \cdots w_m$. For $\subseteq$: we induct on the number of twin-rule applications $k$, peeling off the last one. The peeled-off application marks a seam: before the twin rule lives a $G^+$-derivation with one fewer block, and after the twin rule lives a pure-$G$-derivation of the final block. The induction hypothesis handles the prefix; the final block is in $\mathcal{L}(G)$ directly. Combining the two gives a concatenation in $\mathcal{L}(G)^{k+1}$. *Failure mode without $\varepsilon$-adequacy.* Consider a grammar with $S \to aSb$ and $S \to \varepsilon$-like rules (not regular, but for illustration): if $S$ can appear on the RHS of rules in $P$, a derivation might have many $S$-occurrences without using the twin rule $A \to aS$. The correspondence "each $S$ marks a block boundary" breaks. $\varepsilon$-adequacy closes this gap by ensuring every $S$ in a derivation is either the initial one or a twin-rule introduction — hence every $S$ marks a block. *Closing $\{\varepsilon\}$ arithmetic.* The move from "essentially regular" to "regular" via deleting $\{\varepsilon\}$ and adjoining it back requires some care: $\{\varepsilon\}$ is not regular in the strict sense used here (regular grammars cannot produce $\varepsilon$ unless the start symbol appears on no right-hand side *and* has a rule to a single letter — a special case), but essentially regular absorbs the $\{\varepsilon\}$ ambiguity. The definition of essentially regular is precisely what makes this argument go through cleanly without case-splitting at every step. [/guided] [/step] [step:Combine the cases to conclude structural induction] By the base cases, $\mathcal{L}(\varnothing)$, $\mathcal{L}(\varepsilon)$, and $\mathcal{L}(a)$ are essentially regular for every $a \in \Sigma$. By the union and concatenation cases, the set of regular expressions whose language is essentially regular is closed under $+$ and concatenation. By the Kleene-plus case (via the $G^+$ construction) and the reduction of star to plus, the same set is closed under $E^+$ and $E^\star$. These are exactly the seven formation rules of regular expressions, so by structural induction every regular expression $E$ satisfies: $\mathcal{L}(E)$ is essentially regular. This completes the proof. [/step]

What brings you to Androma?

Start with a route through the knowledge graph.