Summary
Laman | f62bf5223dbf |
9 months ago
|
|||
Laman | 1e7b35645ea4 |
9 months ago
|
|||
Laman | ac6c57072747 |
9 months ago
|
|||
Laman | 08f1519c859e |
9 months ago
|
|||
Laman | bac2e7f972b5 |
9 months ago
|
|||
Laman | 6fc184bbea04 |
9 months ago
|
|||
Laman | d2dfbcc8bb96 |
9 months ago
|
|||
Laman | f03565ba6137 |
9 months ago
|
|||
Laman | 4eb3ab63f9dd |
10 months ago
|
|||
Laman | 1790bcb433e3 |
10 months ago
|
Regular expresso
Written as an exercise in Rust and a refreshment of my university classes on finite automatons.
The supported syntax is restricted to the following operators:
*
- the Kleene star. The previous item can be present zero or more times.- Concatenation - is implied between items following each other.
ab*
matches ana
followed by any number ofb
s +
or|
- a union of several alternatives. Both symbols are equivalent._
- a lambda, an empty string()
- parentheses, grouping several tokens together. For examplea|b*
matches a singlea
or any number ofb
s.(a|b)*
matches any string ofa
s andb
s- Symbols not listed above are used as they are. There is no escaping, so matching for the operator literals is not possible.
Licensed under GNU GPLv3 or newer.
Usage
The python code supports Python 3.9 and higher. The rust code is written for Rust edition 2021.
# If you want to use Rust, compile it first
cargo build --release
# Match a string against a pattern (returns true or false)
python regular-expresso.py match PATTERN STRING
# or
./target/release/regular-expresso match PATTERN STRING
# Compare two regexps and return None of they are equivalent, or the shortest string that can distinguish them
# That is, it gets accepted by one, but not the other
python regexp.py compare PATTERN1 PATTERN2
# or
./target/release/regular-expresso compare PATTERN1 PATTERN2
Example
> ./regular-expresso match "(ab)*" abab
true
> ./regular-expresso match "(ab)*" aba
false
> ./regular-expresso compare "ab(ab)*" "a(ba)*b"
None
> ./regular-expresso compare "a*" "(_|a|aa)"
aaa
Theory
- The pattern gets parsed into a tree of its parts.
- We check which symbols can stand at the beginning, which can follow each other and which can stand at the end of the string. These data are used to construct a non-deterministic finite automaton (NFA).
- Determinization: We explore the possible paths through the NFA, recording the state sets visited. Each set is then converted to a single state of a deterministic finite automaton (DFA).
- Reduction (minimization): We iteratively check which states of the DFA are equivalent - they can't be distinguished from each other by any string. Each equivalence class can be collapsed to a single state, compressing and simplifying the DFA.
- Normalization: Finally we normalize the DFA by visiting its states in a breadth-first search (BFS) order, with edges ordered by the alphabet. Any unreachable states get removed here. (1)
- Two regular expressions are equivalent if and only if their normalized DFAs are equal.
Looking for a distinguishing string
- We construct DFAs for both patterns as described just above.
- We merge them into a product automaton, with states being a carthesian product of the original DFAs' states, and the transition function constructed accordingly. The state
(qa, qb)
is accepting if exactly one ofqa
,qb
was accepting in the original automatons. (2) - We search through the product automaton's states in a BFS order, recording our path, until we find an accepting state. The path from the start to the accepting state defines a distinguishing word for the two input automata.
Notes
(1) This is formally described as a part of reduction, but the implementation leaves it as a side effect of normalization.
Also note that the algorithm doesn't actually produce any unreachable states, so this is only relevant in (2).
(2) Construction of the product automaton can create plenty of states that can't be actually reached and their removal gets useful.