Written as an exercise in Rust and a refreshment of my university classes on finite automatons.
The supported syntax is restricted to the following operators:
* `*` - the Kleene star. The previous item can be present zero or more times.
* Concatenation - is implied between items following each other. `ab*` matches an `a` followed by any number of `b`s
* `+` or `|` - a union of several alternatives. Both symbols are equivalent.
* `_` - a lambda, an empty string
* `()` - parentheses, grouping several tokens together. For example `a|b*` matches a single `a` or any number of `b`s. `(a|b)*` matches any string of `a`s and `b`s
* Symbols not listed above are used as they are. There is no escaping, so matching for the operator literals is not possible.
@@ -28,12 +29,25 @@ python regular-expresso.py match PATTERN
# That is, it gets accepted by one, but not the other
1. The pattern gets parsed into a tree of its parts.
2. We check which symbols can stand at the beginning, which can follow each other and which can stand at the end of the string. These data are used to construct a non-deterministic finite automaton (NFA).
3. *Determinization:* We explore the possible paths through the NFA, recording the state sets visited. Each set is then converted to a single state of a deterministic finite automaton (DFA).
4. *Reduction (minimization):* We iteratively check which states of the DFA are equivalent - they can't be distinguished from each other by any string. Each equivalence class can be collapsed to a single state, compressing and simplifying the DFA.
5. *Normalization:* Finally we normalize the DFA by visiting its states in a breadth-first search (BFS) order, with edges ordered by the alphabet. Any unreachable states get removed here. (1)