alphabet.rs |
!
This module provides APIs for dealing with the alphabets of finite state
machines.
There are two principal types in this module, [`ByteClasses`] and [`Unit`].
The former defines the alphabet of a finite state machine while the latter
represents an element of that alphabet.
To a first approximation, the alphabet of all automata in this crate is just
a `u8`. Namely, every distinct byte value. All 256 of them. In practice, this
can be quite wasteful when building a transition table for a DFA, since it
requires storing a state identifier for each element in the alphabet. Instead,
we collapse the alphabet of an automaton down into equivalence classes, where
every byte in the same equivalence class never discriminates between a match or
a non-match from any other byte in the same class. For example, in the regex
`[a-z]+`, then you could consider it having an alphabet consisting of two
equivalence classes: `a-z` and everything else. In terms of the transitions on
an automaton, it doesn't actually require representing every distinct byte.
Just the equivalence classes.
The downside of equivalence classes is that, of course, searching a haystack
deals with individual byte values. Those byte values need to be mapped to
their corresponding equivalence class. This is what `ByteClasses` does. In
practice, doing this for every state transition has negligible impact on modern
CPUs. Moreover, it helps make more efficient use of the CPU cache by (possibly
considerably) shrinking the size of the transition table.
One last hiccup concerns `Unit`. Namely, because of look-around and how the
DFAs in this crate work, we need to add a sentinel value to our alphabet
of equivalence classes that represents the "end" of a search. We call that
sentinel [`Unit::eoi`] or "end of input." Thus, a `Unit` is either an
equivalence class corresponding to a set of bytes, or it is a special "end of
input" sentinel.
In general, you should not expect to need either of these types unless you're
doing lower level shenanigans with DFAs, or even building your own DFAs.
(Although, you don't have to use these types to build your own DFAs of course.)
For example, if you're walking a DFA's state graph, it's probably useful to
make use of [`ByteClasses`] to visit each element in the DFA's alphabet instead
of just visiting every distinct `u8` value. The latter isn't necessarily wrong,
but it could be potentially very wasteful.
|
40709 |
captures.rs |
!
Provides types for dealing with capturing groups.
Capturing groups refer to sub-patterns of regexes that some regex engines can
report matching offsets for. For example, matching `[a-z]([0-9]+)` against
`a789` would give `a789` as the overall match (for the implicit capturing group
at index `0`) and `789` as the match for the capturing group `([0-9]+)` (an
explicit capturing group at index `1`).
Not all regex engines can report match offsets for capturing groups. Indeed,
to a first approximation, regex engines that can report capturing group offsets
tend to be quite a bit slower than regex engines that can't. This is because
tracking capturing groups at search time usually requires more "power" that
in turn adds overhead.
Other regex implementations might call capturing groups "submatches."
# Overview
The main types in this module are:
[`Captures`] records the capturing group offsets found during a search. It
provides convenience routines for looking up capturing group offsets by either
index or name.
[`GroupInfo`] records the mapping between capturing groups and "slots,"
where the latter are how capturing groups are recorded during a regex search.
This also keeps a mapping from capturing group name to index, and capture
group index to name. A `GroupInfo` is used by `Captures` internally to
provide a convenient API. It is unlikely that you'll use a `GroupInfo`
directly, but for example, if you've compiled an Thompson NFA, then you can use
[`thompson::NFA::group_info`](crate::nfa::thompson::NFA::group_info) to get its
underlying `GroupInfo`.
|
101364 |
determinize |
|
|
empty.rs |
|
13516 |
escape.rs |
!
Provides convenience routines for escaping raw bytes.
Since this crate tends to deal with `&[u8]` everywhere and the default
`Debug` implementation just shows decimal integers, it makes debugging those
representations quite difficult. This module provides types that show `&[u8]`
as if it were a string, with invalid UTF-8 escaped into its byte-by-byte hex
representation.
|
2996 |
int.rs |
!
This module provides several integer oriented traits for converting between
both fixed size integers and integers whose size varies based on the target
(like `usize`).
The driving design principle of this module is to attempt to centralize as many
`as` casts as possible here. And in particular, we separate casts into two
buckets:
Casts that we use for their truncating behavior. In this case, we use more
descriptive names, like `low_u32` and `high_u32`.
Casts that we use for converting back-and-forth between `usize`. These
conversions are generally necessary because we often store indices in different
formats to save on memory, which requires converting to and from `usize`. In
this case, we very specifically do not want to overflow, and so the methods
defined here will panic if the `as` cast would be lossy in debug mode. (A
normal `as` cast will never panic!)
For `as` casts between raw pointers, we use `cast`, so `as` isn't needed there.
For regex engines, floating point is just never used, so we don't have to worry
about `as` casts for those.
Otherwise, this module pretty much covers all of our `as` needs except for one
thing: const contexts. There are a select few places in this crate where we
still need to use `as` because const functions on traits aren't stable yet.
If we wind up significantly expanding our const footprint in this crate, it
might be worth defining free functions to handle those cases. But at the time
of writing, that just seemed like too much ceremony. Instead, I comment each
such use of `as` in a const context with a "fixme" notice.
NOTE: for simplicity, we don't take target pointer width into account here for
`usize` conversions. Since we currently only panic in debug mode, skipping the
check when it can be proven it isn't needed at compile time doesn't really
matter. Now, if we wind up wanting to do as many checks as possible in release
mode, then we would want to skip those when we know the conversions are always
non-lossy.
NOTE: this module isn't an exhaustive API. For example, we still use things
like `u64::from` where possible, or even `usize::try_from()` for when we do
explicitly want to panic or when we want to return an error for overflow.
|
6518 |
interpolate.rs |
!
Provides routines for interpolating capture group references.
That is, if a replacement string contains references like `$foo` or `${foo1}`,
then they are replaced with the corresponding capture values for the groups
named `foo` and `foo1`, respectively. Similarly, syntax like `$1` and `${1}`
is supported as well, with `1` corresponding to a capture group index and not
a name.
This module provides the free functions [`string`] and [`bytes`], which
interpolate Rust Unicode strings and byte strings, respectively.
# Format
These routines support two different kinds of capture references: unbraced and
braced.
For the unbraced format, the format supported is `$ref` where `name` can be
any character in the class `[0-9A-Za-z_]`. `ref` is always the longest
possible parse. So for example, `$1a` corresponds to the capture group named
`1a` and not the capture group at index `1`. If `ref` matches `^[0-9]+$`, then
it is treated as a capture group index itself and not a name.
For the braced format, the format supported is `${ref}` where `ref` can be any
sequence of bytes except for `}`. If no closing brace occurs, then it is not
considered a capture reference. As with the unbraced format, if `ref` matches
`^[0-9]+$`, then it is treated as a capture group index and not a name.
The braced format is useful for exerting precise control over the name of the
capture reference. For example, `${1}a` corresponds to the capture group
reference `1` followed by the letter `a`, where as `$1a` (as mentioned above)
corresponds to the capture group reference `1a`. The braced format is also
useful for expressing capture group names that use characters not supported by
the unbraced format. For example, `${foo[bar].baz}` refers to the capture group
named `foo[bar].baz`.
If a capture group reference is found and it does not refer to a valid capture
group, then it will be replaced with the empty string.
To write a literal `$`, use `$$`.
To be clear, and as exhibited via the type signatures in the routines in this
module, it is impossible for a replacement string to be invalid. A replacement
string may not have the intended semantics, but the interpolation procedure
itself can never fail.
|
17309 |
iter.rs |
!
Generic helpers for iteration of matches from a regex engine in a haystack.
The principle type in this module is a [`Searcher`]. A `Searcher` provides
its own lower level iterator-like API in addition to methods for constructing
types that implement `Iterator`. The documentation for `Searcher` explains a
bit more about why these different APIs exist.
Currently, this module supports iteration over any regex engine that works
with the [`HalfMatch`], [`Match`] or [`Captures`] types.
|
37773 |
lazy.rs |
!
A lazily initialized value for safe sharing between threads.
The principal type in this module is `Lazy`, which makes it easy to construct
values that are shared safely across multiple threads simultaneously.
|
19128 |
look.rs |
!
Types and routines for working with look-around assertions.
This module principally defines two types:
[`Look`] enumerates all of the assertions supported by this crate.
[`LookSet`] provides a way to efficiently store a set of [`Look`] values.
[`LookMatcher`] provides routines for checking whether a `Look` or a
`LookSet` matches at a particular position in a haystack.
|
65075 |
memchr.rs |
!
This module defines simple wrapper routines for the memchr functions from the
`memchr` crate. Basically, when the `memchr` crate is available, we use it,
otherwise we use a naive implementation which is still pretty fast.
|
2868 |
mod.rs |
!
A collection of modules that provide APIs that are useful across many regex
engines.
While one should explore the sub-modules directly to get a sense of what's
there, here are some highlights that tie the sub-modules to higher level
use cases:
`alphabet` contains APIs that are useful if you're doing low level things
with the DFAs in this crate. For example, implementing determinization or
walking its state graph directly.
`captures` contains APIs for dealing with capture group matches and their
mapping to "slots" used inside an NFA graph. This is also where you can find
iterators over capture group names.
`escape` contains types for pretty-printing raw byte slices as strings.
`iter` contains API helpers for writing regex iterators.
`lazy` contains a no-std and no-alloc variant of `lazy_static!` and
`once_cell`.
`look` contains APIs for matching and configuring look-around assertions.
`pool` provides a way to reuse mutable memory allocated in a thread safe
manner.
`prefilter` provides APIs for building prefilters and using them in searches.
`primitives` are what you might use if you're doing lower level work on
automata, such as walking an NFA state graph.
`syntax` provides some higher level convenience functions for interacting
with the `regex-syntax` crate.
`wire` is useful if you're working with DFA serialization.
|
1970 |
pool.rs |
|
43396 |
prefilter |
|
|
primitives.rs |
!
Lower level primitive types that are useful in a variety of circumstances.
# Overview
This list represents the principle types in this module and briefly describes
when you might want to use them.
[`PatternID`] - A type that represents the identifier of a regex pattern.
This is probably the most widely used type in this module (which is why it's
also re-exported in the crate root).
[`StateID`] - A type the represents the identifier of a finite automaton
state. This is used for both NFAs and DFAs, with the notable exception of
the hybrid NFA/DFA. (The hybrid NFA/DFA uses a special purpose "lazy" state
identifier.)
[`SmallIndex`] - The internal representation of both a `PatternID` and a
`StateID`. Its purpose is to serve as a type that can index memory without
being as big as a `usize` on 64-bit targets. The main idea behind this type
is that there are many things in regex engines that will, in practice, never
overflow a 32-bit integer. (For example, like the number of patterns in a regex
or the number of states in an NFA.) Thus, a `SmallIndex` can be used to index
memory without peppering `as` casts everywhere. Moreover, it forces callers
to handle errors in the case where, somehow, the value would otherwise overflow
either a 32-bit integer or a `usize` (e.g., on 16-bit targets).
[`NonMaxUsize`] - Represents a `usize` that cannot be `usize::MAX`. As a
result, `Option<NonMaxUsize>` has the same size in memory as a `usize`. This
useful, for example, when representing the offsets of submatches since it
reduces memory usage by a factor of 2. It is a legal optimization since Rust
guarantees that slices never have a length that exceeds `isize::MAX`.
|
27691 |
search.rs |
!
Types and routines that support the search APIs of most regex engines.
This sub-module isn't exposed directly, but rather, its contents are exported
at the crate root due to the universality of most of the types and routines in
this module.
|
73095 |
sparse_set.rs |
!
This module defines a sparse set data structure. Its most interesting
properties are:
They preserve insertion order.
Set membership testing is done in constant time.
Set insertion is done in constant time.
Clearing the set is done in constant time.
The cost for doing this is that the capacity of the set needs to be known up
front, and the elements in the set are limited to state identifiers.
These sets are principally used when traversing an NFA state graph. This
happens at search time, for example, in the PikeVM. It also happens during DFA
determinization.
|
8019 |
start.rs |
!
Provides some helpers for dealing with start state configurations in DFAs.
[`Start`] represents the possible starting configurations, while
[`StartByteMap`] represents a way to retrieve the `Start` configuration for a
given position in a haystack.
|
11337 |
syntax.rs |
!
Utilities for dealing with the syntax of a regular expression.
This module currently only exposes a [`Config`] type that
itself represents a wrapper around the configuration for a
[`regex-syntax::ParserBuilder`](regex_syntax::ParserBuilder). The purpose of
this wrapper is to make configuring syntax options very similar to how other
configuration is done throughout this crate. Namely, instead of duplicating
syntax options across every builder (of which there are many), we instead
create small config objects like this one that can be passed around and
composed.
|
17706 |
unicode_data |
|
|
utf8.rs |
!
Utilities for dealing with UTF-8.
This module provides some UTF-8 related helper routines, including an
incremental decoder.
|
7290 |
wire.rs |
!
Types and routines that support the wire format of finite automata.
Currently, this module just exports a few error types and some small helpers
for deserializing [dense DFAs](crate::dfa::dense::DFA) using correct alignment.
|
36213 |