Regular-Expresso Changeset

Changeset - f62bf5223dbf

Parent rev.

Child rev.

[Not reviewed]

tip default

0 1 0

Laman - 13 months ago 2024-07-15 13:50:50

added examples to the README

1 file changed with 14 insertions and 0 deletions:

README.md

0 comments (0 inline, 0 general)

README.md

➞

Show inline comments

 # Regular expresso
 Written as an exercise in Rust and a refreshment of my university classes on finite automatons.
 The supported syntax is restricted to the following operators:
 * `*` - the Kleene star. The previous item can be present zero or more times.
 * Concatenation - is implied between items following each other. `ab*` matches an `a` followed by any number of `b`s
 * `+` or `|` - a union of several alternatives. Both symbols are equivalent.
 * `_` - a lambda, an empty string
 * `()` - parentheses, grouping several tokens together. For example `a|b*` matches a single `a` or any number of `b`s. `(a|b)*` matches any string of `a`s and `b`s
 * Symbols not listed above are used as they are. There is no escaping, so matching for the operator literals is not possible.
 Licensed under GNU GPLv3 or newer.
 ## Usage
 The python code supports Python 3.9 and higher. The rust code is written for Rust edition 2021.
@@ @@ -22,24 +23,37 @@ cargo build --release @@
 # Match a string against a pattern (returns true or false)
 python regular-expresso.py match PATTERN STRING
 # or
 ./target/release/regular-expresso match PATTERN STRING
 # Compare two regexps and return None of they are equivalent, or the shortest string that can distinguish them
 # That is, it gets accepted by one, but not the other
 python regexp.py compare PATTERN1 PATTERN2
 # or
 ./target/release/regular-expresso compare PATTERN1 PATTERN2
 ```
 ### Example
 ```
 > ./regular-expresso match "(ab)*" abab
 true
 > ./regular-expresso match "(ab)*" aba
 false
 > ./regular-expresso compare "ab(ab)*" "a(ba)*b"
 None
 > ./regular-expresso compare "a*" "(_|a|aa)"
 aaa
 ```
 ## Theory
 . The pattern gets parsed into a tree of its parts.
 . We check which symbols can stand at the beginning, which can follow each other and which can stand at the end of the string. These data are used to construct a non-deterministic finite automaton (NFA).
 . *Determinization:* We explore the possible paths through the NFA, recording the state sets visited. Each set is then converted to a single state of a deterministic finite automaton (DFA).
 . *Reduction (minimization):* We iteratively check which states of the DFA are equivalent - they can't be distinguished from each other by any string. Each equivalence class can be collapsed to a single state, compressing and simplifying the DFA.
 . *Normalization:* Finally we normalize the DFA by visiting its states in a breadth-first search (BFS) order, with edges ordered by the alphabet. Any unreachable states get removed here. (1)
 . Two regular expressions are equivalent if and only if their normalized DFAs are equal.
 ### Looking for a distinguishing string
 . We construct DFAs for both patterns as described just above.
 . We merge them into a product automaton, with states being a carthesian product of the original DFAs' states, and the transition function constructed accordingly. The state `(qa, qb)` is accepting if exactly one of `qa`, `qb` was accepting in the original automatons. (2)
 . We search through the product automaton's states in a BFS order, recording our path, until we find an accepting state. The path from the start to the accepting state defines a distinguishing word for the two input automata.

0 comments (0 inline, 0 general)