diff --git a/README.md b/README.md new file mode 100644 --- /dev/null +++ b/README.md @@ -0,0 +1,47 @@ +# Regular expresso + +Written as an exercise in Rust and a refreshment of my university classes on finite automatons. + +The supported syntax is restricted to the following operators: +* `*` - the Kleene star. The previous item can be present zero or more times. +* Concatenation - is implied between items following each other. `ab*` matches an `a` followed by any number of `b`s +* `+` or `|` - a union of several alternatives. Both symbols are equivalent. +* `_` - a lambda, an empty string +* `()` - parentheses, grouping several tokens together. For example `a|b*` matches a single `a` or any number of `b`s. `(a|b)*` matches any string of `a`s and `b`s +* Symbols not listed above are used as they are. There is no escaping, so matching for the operator literals is not possible. + +## Usage +The python code supports Python 3.9 and higher. The rust code is written for Rust edition 2021. + +``` +# Match a string against a pattern (returns true or false) +python regexp.py match PATTERN STRING +# or +cargo run match PATTERN STRING + +# Compare two regexps and return None of they are equivalent, or the shortest string that can distinguish them +# That is, it gets accepted by one, but not the other +python regexp.py compare PATTERN1 PATTERN2 +# or +cargo run compare PATTERN1 PATTERN2 +``` + +## Theory +1. The pattern gets parsed into a tree of its parts. +2. We check which symbols can stand at the beginning, which can follow each other and which can stand at the end of the string. These data are used to construct a non-deterministic finite automaton (NFA). +3. *Determinization:* We explore the possible paths through the NFA, recording the state sets visited. Each set is then converted to a single state of a deterministic finite automaton (DFA). +4. *Reduction (minimization):* We iteratively check which states of the DFA are equivalent - they can't be distinguished from each other by any string. Each equivalence class can be collapsed to a single state, compressing and simplifying the DFA. +5. *Normalization:* Finally we normalize the DFA by visiting its states in a breadth-first search (BFS) order, with edges ordered by the alphabet. Any unreachable states get removed here. (1) +6. Two regular expressions are equivalent if and only if their normalized DFAs are equal. + +### Looking for a distinguishing string +1. We construct DFAs for both patterns as described just above. +2. We merge them into a product automaton, with states being a carthesian product of the original DFAs' states, and the transition function constructed accordingly. The state `(qa, qb)` is accepting if exactly one of `qa`, `qb` was accepting in the original automatons. (2) +3. We search through the product automaton's states in a BFS order, recording our path, until we find an accepting state. The path from the start to the accepting state defines a distinguishing word for the two input automata. + +### Notes +(1) This is formally described as a part of reduction, but the implementation leaves it as a side effect of normalization. + +Also note that the algorithm doesn't actually produce any unreachable states, so this is only relevant in (2). + +(2) Construction of the product automaton can create plenty of states that can't be actually reached and their removal gets useful.