Summary

Clone URL:

hg Use ID

Description:

Regular-Expresso

Trending files:

Download:

Download as zip With subrepos

Latest Changes

Laman	f62bf5223dbf	10 months ago	added examples to the README tip default
Laman	1e7b35645ea4	11 months ago	renamed the package to regular-expresso default
Laman	ac6c57072747	11 months ago	refactoring: reduction and normalization included in the DFA constructor default
Laman	08f1519c859e	11 months ago	added Rust documentation default
Laman	bac2e7f972b5	11 months ago	added a license default
Laman	6fc184bbea04	11 months ago	added documentation default
Laman	d2dfbcc8bb96	11 months ago	refactoring: determinization indexing states with vecs instead of string hashes default
Laman	f03565ba6137	11 months ago	find_equivalent_states implemented with Hopcroft's n log n algorithm default
Laman	4eb3ab63f9dd	11 months ago	refactored expand_alphabet interation default
Laman	1790bcb433e3	11 months ago	fixes: never create an empty automaton, removed the erroneous early fail default

README.md

Regular expresso

Written as an exercise in Rust and a refreshment of my university classes on finite automatons.

The supported syntax is restricted to the following operators:

* - the Kleene star. The previous item can be present zero or more times.
Concatenation - is implied between items following each other. ab* matches an a followed by any number of bs
+ or | - a union of several alternatives. Both symbols are equivalent.
_ - a lambda, an empty string
() - parentheses, grouping several tokens together. For example a|b* matches a single a or any number of bs. (a|b)* matches any string of as and bs
Symbols not listed above are used as they are. There is no escaping, so matching for the operator literals is not possible.

Licensed under GNU GPLv3 or newer.

Usage

The python code supports Python 3.9 and higher. The rust code is written for Rust edition 2021.

# If you want to use Rust, compile it first
cargo build --release

# Match a string against a pattern (returns true or false)
python regular-expresso.py match PATTERN STRING
# or
./target/release/regular-expresso match PATTERN STRING

# Compare two regexps and return None of they are equivalent, or the shortest string that can distinguish them
# That is, it gets accepted by one, but not the other
python regexp.py compare PATTERN1 PATTERN2
# or
./target/release/regular-expresso compare PATTERN1 PATTERN2

Example

> ./regular-expresso match "(ab)*" abab
true
> ./regular-expresso match "(ab)*" aba
false

> ./regular-expresso compare "ab(ab)*" "a(ba)*b"
None
> ./regular-expresso compare "a*" "(_|a|aa)"
aaa

Theory

The pattern gets parsed into a tree of its parts.
We check which symbols can stand at the beginning, which can follow each other and which can stand at the end of the string. These data are used to construct a non-deterministic finite automaton (NFA).
Determinization: We explore the possible paths through the NFA, recording the state sets visited. Each set is then converted to a single state of a deterministic finite automaton (DFA).
Reduction (minimization): We iteratively check which states of the DFA are equivalent - they can't be distinguished from each other by any string. Each equivalence class can be collapsed to a single state, compressing and simplifying the DFA.
Normalization: Finally we normalize the DFA by visiting its states in a breadth-first search (BFS) order, with edges ordered by the alphabet. Any unreachable states get removed here. (1)
Two regular expressions are equivalent if and only if their normalized DFAs are equal.

Looking for a distinguishing string

We construct DFAs for both patterns as described just above.
We merge them into a product automaton, with states being a carthesian product of the original DFAs' states, and the transition function constructed accordingly. The state (qa, qb) is accepting if exactly one of qa, qb was accepting in the original automatons. (2)
We search through the product automaton's states in a BFS order, recording our path, until we find an accepting state. The path from the start to the accepting state defines a distinguishing word for the two input automata.

Notes

(1) This is formally described as a part of reduction, but the implementation leaves it as a side effect of normalization.

Also note that the algorithm doesn't actually produce any unreachable states, so this is only relevant in (2).

(2) Construction of the product automaton can create plenty of states that can't be actually reached and their removal gets useful.