Name Description Size
alphabet.rs ! This module provides APIs for dealing with the alphabets of finite state machines. There are two principal types in this module, [`ByteClasses`] and [`Unit`]. The former defines the alphabet of a finite state machine while the latter represents an element of that alphabet. To a first approximation, the alphabet of all automata in this crate is just a `u8`. Namely, every distinct byte value. All 256 of them. In practice, this can be quite wasteful when building a transition table for a DFA, since it requires storing a state identifier for each element in the alphabet. Instead, we collapse the alphabet of an automaton down into equivalence classes, where every byte in the same equivalence class never discriminates between a match or a non-match from any other byte in the same class. For example, in the regex `[a-z]+`, then you could consider it having an alphabet consisting of two equivalence classes: `a-z` and everything else. In terms of the transitions on an automaton, it doesn't actually require representing every distinct byte. Just the equivalence classes. The downside of equivalence classes is that, of course, searching a haystack deals with individual byte values. Those byte values need to be mapped to their corresponding equivalence class. This is what `ByteClasses` does. In practice, doing this for every state transition has negligible impact on modern CPUs. Moreover, it helps make more efficient use of the CPU cache by (possibly considerably) shrinking the size of the transition table. One last hiccup concerns `Unit`. Namely, because of look-around and how the DFAs in this crate work, we need to add a sentinel value to our alphabet of equivalence classes that represents the "end" of a search. We call that sentinel [`Unit::eoi`] or "end of input." Thus, a `Unit` is either an equivalence class corresponding to a set of bytes, or it is a special "end of input" sentinel. In general, you should not expect to need either of these types unless you're doing lower level shenanigans with DFAs, or even building your own DFAs. (Although, you don't have to use these types to build your own DFAs of course.) For example, if you're walking a DFA's state graph, it's probably useful to make use of [`ByteClasses`] to visit each element in the DFA's alphabet instead of just visiting every distinct `u8` value. The latter isn't necessarily wrong, but it could be potentially very wasteful. 40709
captures.rs ! Provides types for dealing with capturing groups. Capturing groups refer to sub-patterns of regexes that some regex engines can report matching offsets for. For example, matching `[a-z]([0-9]+)` against `a789` would give `a789` as the overall match (for the implicit capturing group at index `0`) and `789` as the match for the capturing group `([0-9]+)` (an explicit capturing group at index `1`). Not all regex engines can report match offsets for capturing groups. Indeed, to a first approximation, regex engines that can report capturing group offsets tend to be quite a bit slower than regex engines that can't. This is because tracking capturing groups at search time usually requires more "power" that in turn adds overhead. Other regex implementations might call capturing groups "submatches." # Overview The main types in this module are: [`Captures`] records the capturing group offsets found during a search. It provides convenience routines for looking up capturing group offsets by either index or name. [`GroupInfo`] records the mapping between capturing groups and "slots," where the latter are how capturing groups are recorded during a regex search. This also keeps a mapping from capturing group name to index, and capture group index to name. A `GroupInfo` is used by `Captures` internally to provide a convenient API. It is unlikely that you'll use a `GroupInfo` directly, but for example, if you've compiled an Thompson NFA, then you can use [`thompson::NFA::group_info`](crate::nfa::thompson::NFA::group_info) to get its underlying `GroupInfo`. 101364
determinize
empty.rs 13516
escape.rs ! Provides convenience routines for escaping raw bytes. Since this crate tends to deal with `&[u8]` everywhere and the default `Debug` implementation just shows decimal integers, it makes debugging those representations quite difficult. This module provides types that show `&[u8]` as if it were a string, with invalid UTF-8 escaped into its byte-by-byte hex representation. 2996
int.rs ! This module provides several integer oriented traits for converting between both fixed size integers and integers whose size varies based on the target (like `usize`). The driving design principle of this module is to attempt to centralize as many `as` casts as possible here. And in particular, we separate casts into two buckets: Casts that we use for their truncating behavior. In this case, we use more descriptive names, like `low_u32` and `high_u32`. Casts that we use for converting back-and-forth between `usize`. These conversions are generally necessary because we often store indices in different formats to save on memory, which requires converting to and from `usize`. In this case, we very specifically do not want to overflow, and so the methods defined here will panic if the `as` cast would be lossy in debug mode. (A normal `as` cast will never panic!) For `as` casts between raw pointers, we use `cast`, so `as` isn't needed there. For regex engines, floating point is just never used, so we don't have to worry about `as` casts for those. Otherwise, this module pretty much covers all of our `as` needs except for one thing: const contexts. There are a select few places in this crate where we still need to use `as` because const functions on traits aren't stable yet. If we wind up significantly expanding our const footprint in this crate, it might be worth defining free functions to handle those cases. But at the time of writing, that just seemed like too much ceremony. Instead, I comment each such use of `as` in a const context with a "fixme" notice. NOTE: for simplicity, we don't take target pointer width into account here for `usize` conversions. Since we currently only panic in debug mode, skipping the check when it can be proven it isn't needed at compile time doesn't really matter. Now, if we wind up wanting to do as many checks as possible in release mode, then we would want to skip those when we know the conversions are always non-lossy. NOTE: this module isn't an exhaustive API. For example, we still use things like `u64::from` where possible, or even `usize::try_from()` for when we do explicitly want to panic or when we want to return an error for overflow. 6518
interpolate.rs ! Provides routines for interpolating capture group references. That is, if a replacement string contains references like `$foo` or `${foo1}`, then they are replaced with the corresponding capture values for the groups named `foo` and `foo1`, respectively. Similarly, syntax like `$1` and `${1}` is supported as well, with `1` corresponding to a capture group index and not a name. This module provides the free functions [`string`] and [`bytes`], which interpolate Rust Unicode strings and byte strings, respectively. # Format These routines support two different kinds of capture references: unbraced and braced. For the unbraced format, the format supported is `$ref` where `name` can be any character in the class `[0-9A-Za-z_]`. `ref` is always the longest possible parse. So for example, `$1a` corresponds to the capture group named `1a` and not the capture group at index `1`. If `ref` matches `^[0-9]+$`, then it is treated as a capture group index itself and not a name. For the braced format, the format supported is `${ref}` where `ref` can be any sequence of bytes except for `}`. If no closing brace occurs, then it is not considered a capture reference. As with the unbraced format, if `ref` matches `^[0-9]+$`, then it is treated as a capture group index and not a name. The braced format is useful for exerting precise control over the name of the capture reference. For example, `${1}a` corresponds to the capture group reference `1` followed by the letter `a`, where as `$1a` (as mentioned above) corresponds to the capture group reference `1a`. The braced format is also useful for expressing capture group names that use characters not supported by the unbraced format. For example, `${foo[bar].baz}` refers to the capture group named `foo[bar].baz`. If a capture group reference is found and it does not refer to a valid capture group, then it will be replaced with the empty string. To write a literal `$`, use `$$`. To be clear, and as exhibited via the type signatures in the routines in this module, it is impossible for a replacement string to be invalid. A replacement string may not have the intended semantics, but the interpolation procedure itself can never fail. 17309
iter.rs ! Generic helpers for iteration of matches from a regex engine in a haystack. The principle type in this module is a [`Searcher`]. A `Searcher` provides its own lower level iterator-like API in addition to methods for constructing types that implement `Iterator`. The documentation for `Searcher` explains a bit more about why these different APIs exist. Currently, this module supports iteration over any regex engine that works with the [`HalfMatch`], [`Match`] or [`Captures`] types. 37773
lazy.rs ! A lazily initialized value for safe sharing between threads. The principal type in this module is `Lazy`, which makes it easy to construct values that are shared safely across multiple threads simultaneously. 19128
look.rs ! Types and routines for working with look-around assertions. This module principally defines two types: [`Look`] enumerates all of the assertions supported by this crate. [`LookSet`] provides a way to efficiently store a set of [`Look`] values. [`LookMatcher`] provides routines for checking whether a `Look` or a `LookSet` matches at a particular position in a haystack. 65075
memchr.rs ! This module defines simple wrapper routines for the memchr functions from the `memchr` crate. Basically, when the `memchr` crate is available, we use it, otherwise we use a naive implementation which is still pretty fast. 2868
mod.rs ! A collection of modules that provide APIs that are useful across many regex engines. While one should explore the sub-modules directly to get a sense of what's there, here are some highlights that tie the sub-modules to higher level use cases: `alphabet` contains APIs that are useful if you're doing low level things with the DFAs in this crate. For example, implementing determinization or walking its state graph directly. `captures` contains APIs for dealing with capture group matches and their mapping to "slots" used inside an NFA graph. This is also where you can find iterators over capture group names. `escape` contains types for pretty-printing raw byte slices as strings. `iter` contains API helpers for writing regex iterators. `lazy` contains a no-std and no-alloc variant of `lazy_static!` and `once_cell`. `look` contains APIs for matching and configuring look-around assertions. `pool` provides a way to reuse mutable memory allocated in a thread safe manner. `prefilter` provides APIs for building prefilters and using them in searches. `primitives` are what you might use if you're doing lower level work on automata, such as walking an NFA state graph. `syntax` provides some higher level convenience functions for interacting with the `regex-syntax` crate. `wire` is useful if you're working with DFA serialization. 1970
pool.rs 43396
prefilter
primitives.rs ! Lower level primitive types that are useful in a variety of circumstances. # Overview This list represents the principle types in this module and briefly describes when you might want to use them. [`PatternID`] - A type that represents the identifier of a regex pattern. This is probably the most widely used type in this module (which is why it's also re-exported in the crate root). [`StateID`] - A type the represents the identifier of a finite automaton state. This is used for both NFAs and DFAs, with the notable exception of the hybrid NFA/DFA. (The hybrid NFA/DFA uses a special purpose "lazy" state identifier.) [`SmallIndex`] - The internal representation of both a `PatternID` and a `StateID`. Its purpose is to serve as a type that can index memory without being as big as a `usize` on 64-bit targets. The main idea behind this type is that there are many things in regex engines that will, in practice, never overflow a 32-bit integer. (For example, like the number of patterns in a regex or the number of states in an NFA.) Thus, a `SmallIndex` can be used to index memory without peppering `as` casts everywhere. Moreover, it forces callers to handle errors in the case where, somehow, the value would otherwise overflow either a 32-bit integer or a `usize` (e.g., on 16-bit targets). [`NonMaxUsize`] - Represents a `usize` that cannot be `usize::MAX`. As a result, `Option<NonMaxUsize>` has the same size in memory as a `usize`. This useful, for example, when representing the offsets of submatches since it reduces memory usage by a factor of 2. It is a legal optimization since Rust guarantees that slices never have a length that exceeds `isize::MAX`. 27691
search.rs ! Types and routines that support the search APIs of most regex engines. This sub-module isn't exposed directly, but rather, its contents are exported at the crate root due to the universality of most of the types and routines in this module. 73095
sparse_set.rs ! This module defines a sparse set data structure. Its most interesting properties are: They preserve insertion order. Set membership testing is done in constant time. Set insertion is done in constant time. Clearing the set is done in constant time. The cost for doing this is that the capacity of the set needs to be known up front, and the elements in the set are limited to state identifiers. These sets are principally used when traversing an NFA state graph. This happens at search time, for example, in the PikeVM. It also happens during DFA determinization. 8019
start.rs ! Provides some helpers for dealing with start state configurations in DFAs. [`Start`] represents the possible starting configurations, while [`StartByteMap`] represents a way to retrieve the `Start` configuration for a given position in a haystack. 11337
syntax.rs ! Utilities for dealing with the syntax of a regular expression. This module currently only exposes a [`Config`] type that itself represents a wrapper around the configuration for a [`regex-syntax::ParserBuilder`](regex_syntax::ParserBuilder). The purpose of this wrapper is to make configuring syntax options very similar to how other configuration is done throughout this crate. Namely, instead of duplicating syntax options across every builder (of which there are many), we instead create small config objects like this one that can be passed around and composed. 17706
unicode_data
utf8.rs ! Utilities for dealing with UTF-8. This module provides some UTF-8 related helper routines, including an incremental decoder. 7290
wire.rs ! Types and routines that support the wire format of finite automata. Currently, this module just exports a few error types and some small helpers for deserializing [dense DFAs](crate::dfa::dense::DFA) using correct alignment. 36213