Written as an exercise in Rust and a refreshment of my university classes on finite automatons.
The supported syntax is restricted to the following operators:
* `*` - the Kleene star. The previous item can be present zero or more times.
* Concatenation - is implied between items following each other. `ab*` matches an `a` followed by any number of `b`s
* `+` or `|` - a union of several alternatives. Both symbols are equivalent.
* `_` - a lambda, an empty string
* `()` - parentheses, grouping several tokens together. For example `a|b*` matches a single `a` or any number of `b`s. `(a|b)*` matches any string of `a`s and `b`s
* Symbols not listed above are used as they are. There is no escaping, so matching for the operator literals is not possible.
Licensed under GNU GPLv3 or newer.
## Usage
The python code supports Python 3.9 and higher. The rust code is written for Rust edition 2021.
@@ -22,24 +23,37 @@ cargo build --release
# Match a string against a pattern (returns true or false)
python regular-expresso.py match PATTERN STRING
# or
./target/release/regular-expresso match PATTERN STRING
# Compare two regexps and return None of they are equivalent, or the shortest string that can distinguish them
# That is, it gets accepted by one, but not the other
1. The pattern gets parsed into a tree of its parts.
2. We check which symbols can stand at the beginning, which can follow each other and which can stand at the end of the string. These data are used to construct a non-deterministic finite automaton (NFA).
3. *Determinization:* We explore the possible paths through the NFA, recording the state sets visited. Each set is then converted to a single state of a deterministic finite automaton (DFA).
4. *Reduction (minimization):* We iteratively check which states of the DFA are equivalent - they can't be distinguished from each other by any string. Each equivalence class can be collapsed to a single state, compressing and simplifying the DFA.
5. *Normalization:* Finally we normalize the DFA by visiting its states in a breadth-first search (BFS) order, with edges ordered by the alphabet. Any unreachable states get removed here. (1)
6. Two regular expressions are equivalent if and only if their normalized DFAs are equal.
### Looking for a distinguishing string
1. We construct DFAs for both patterns as described just above.
2. We merge them into a product automaton, with states being a carthesian product of the original DFAs' states, and the transition function constructed accordingly. The state `(qa, qb)` is accepting if exactly one of `qa`, `qb` was accepting in the original automatons. (2)
3. We search through the product automaton's states in a BFS order, recording our path, until we find an accepting state. The path from the start to the accepting state defines a distinguishing word for the two input automata.