onepass.rs - mozsearch

mozilla-central/third_party/rust/regex-automata/src/dfa/onepass.rs (file symbol)

Enable keyboard shortcuts

Source code

File a bug in Firefox Build System :: General

Revision control

Copy as Markdown

Other Tools

/*!

A DFA that can return spans for matching capturing groups.

This module is the home of a [one-pass DFA](DFA).

This module also contains a [`Builder`] and a [`Config`] for building and

configuring a one-pass DFA.

*/

// A note on naming and credit:

//

// As far as I know, Russ Cox came up with the practical vision and

// implementation of a "one-pass regex engine." He mentions and describes it

// briefly in the third article of his regexp article series:

// https://swtch.com/~rsc/regexp/regexp3.html

//

// Cox's implementation is in RE2, and the implementation below is most

// heavily inspired by RE2's. The key thing they have in common is that

// their transitions are defined over an alphabet of bytes. In contrast,

// Go's regex engine also has a one-pass engine, but its transitions are

// more firmly rooted on Unicode codepoints. The ideas are the same, but the

// implementations are different.

//

// RE2 tends to call this a "one-pass NFA." Here, we call it a "one-pass DFA."

// They're both true in their own ways:

//

// * The "one-pass" criterion is generally a property of the NFA itself. In

// particular, it is said that an NFA is one-pass if, after each byte of input

// during a search, there is at most one "VM thread" remaining to take for the

// next byte of input. That is, there is never any ambiguity as to the path to

// take through the NFA during a search.

//

// * On the other hand, once a one-pass NFA has its representation converted

// to something where a constant number of instructions is used for each byte

// of input, the implementation looks a lot more like a DFA. It's technically

// more powerful than a DFA since it has side effects (storing offsets inside

// of slots activated by a transition), but it is far closer to a DFA than an

// NFA simulation.

//

// Thus, in this crate, we call it a one-pass DFA.

use alloc::{vec, vec::Vec};

use crate::{

    dfa::{remapper::Remapper, DEAD},

    nfa::thompson::{self, NFA},

    util::{

        alphabet::ByteClasses,

        captures::Captures,

        escape::DebugByte,

        int::{Usize, U32, U64, U8},

        look::{Look, LookSet, UnicodeWordBoundaryError},

        primitives::{NonMaxUsize, PatternID, StateID},

        search::{Anchored, Input, Match, MatchError, MatchKind, Span},

        sparse_set::SparseSet,

},

};

/// The configuration used for building a [one-pass DFA](DFA).

///

/// A one-pass DFA configuration is a simple data object that is typically used

/// with [`Builder::configure`]. It can be cheaply cloned.

///

/// A default configuration can be created either with `Config::new`, or

/// perhaps more conveniently, with [`DFA::config`].

#[derive(Clone, Debug, Default)]

pub struct Config {

    match_kind: Option<MatchKind>,

    starts_for_each_pattern: Option<bool>,

    byte_classes: Option<bool>,

    size_limit: Option<Option<usize>>,

impl Config {

    /// Return a new default one-pass DFA configuration.

    pub fn new() -> Config {

        Config::default()

    /// Set the desired match semantics.

///

    /// The default is [`MatchKind::LeftmostFirst`], which corresponds to the

    /// match semantics of Perl-like regex engines. That is, when multiple

    /// patterns would match at the same leftmost position, the pattern that

    /// appears first in the concrete syntax is chosen.

///

    /// Currently, the only other kind of match semantics supported is

    /// [`MatchKind::All`]. This corresponds to "classical DFA" construction

    /// where all possible matches are visited.

///

    /// When it comes to the one-pass DFA, it is rarer for preference order and

    /// "longest match" to actually disagree. Since if they did disagree, then

    /// the regex typically isn't one-pass. For example, searching `Samwise`

    /// for `Sam|Samwise` will report `Sam` for leftmost-first matching and

    /// `Samwise` for "longest match" or "all" matching. However, this regex is

    /// not one-pass if taken literally. The equivalent regex, `Sam(?:|wise)`

    /// is one-pass and `Sam|Samwise` may be optimized to it.

///

    /// The other main difference is that "all" match semantics don't support

    /// non-greedy matches. "All" match semantics always try to match as much

    /// as possible.

    pub fn match_kind(mut self, kind: MatchKind) -> Config {

        self.match_kind = Some(kind);

        self

    /// Whether to compile a separate start state for each pattern in the

    /// one-pass DFA.

///

    /// When enabled, a separate **anchored** start state is added for each

    /// pattern in the DFA. When this start state is used, then the DFA will

    /// only search for matches for the pattern specified, even if there are

    /// other patterns in the DFA.

///

    /// The main downside of this option is that it can potentially increase

    /// the size of the DFA and/or increase the time it takes to build the DFA.

///

    /// You might want to enable this option when you want to both search for

    /// anchored matches of any pattern or to search for anchored matches of

    /// one particular pattern while using the same DFA. (Otherwise, you would

    /// need to compile a new DFA for each pattern.)

///

    /// By default this is disabled.

///

    /// # Example

///

    /// This example shows how to build a multi-regex and then search for

    /// matches for a any of the patterns or matches for a specific pattern.

///

    /// ```

    /// use regex_automata::{

    ///     dfa::onepass::DFA, Anchored, Input, Match, PatternID,

    /// };

///

    /// let re = DFA::builder()

    ///     .configure(DFA::config().starts_for_each_pattern(true))

    ///     .build_many(&["[a-z]+", "[0-9]+"])?;

    /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

    /// let haystack = "123abc";

    /// let input = Input::new(haystack).anchored(Anchored::Yes);

///

    /// // A normal multi-pattern search will show pattern 1 matches.

    /// re.try_search(&mut cache, &input, &mut caps)?;

    /// assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());

///

    /// // If we only want to report pattern 0 matches, then we'll get no

    /// // match here.

    /// let input = input.anchored(Anchored::Pattern(PatternID::must(0)));

    /// re.try_search(&mut cache, &input, &mut caps)?;

    /// assert_eq!(None, caps.get_match());

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    pub fn starts_for_each_pattern(mut self, yes: bool) -> Config {

        self.starts_for_each_pattern = Some(yes);

        self

    /// Whether to attempt to shrink the size of the DFA's alphabet or not.

///

    /// This option is enabled by default and should never be disabled unless

    /// one is debugging a one-pass DFA.

///

    /// When enabled, the DFA will use a map from all possible bytes to their

    /// corresponding equivalence class. Each equivalence class represents a

    /// set of bytes that does not discriminate between a match and a non-match

    /// in the DFA. For example, the pattern `[ab]+` has at least two

    /// equivalence classes: a set containing `a` and `b` and a set containing

    /// every byte except for `a` and `b`. `a` and `b` are in the same

    /// equivalence class because they never discriminate between a match and a

    /// non-match.

///

    /// The advantage of this map is that the size of the transition table

    /// can be reduced drastically from (approximately) `#states * 256 *

    /// sizeof(StateID)` to `#states * k * sizeof(StateID)` where `k` is the

    /// number of equivalence classes (rounded up to the nearest power of 2).

    /// As a result, total space usage can decrease substantially. Moreover,

    /// since a smaller alphabet is used, DFA compilation becomes faster as

    /// well.

///

    /// **WARNING:** This is only useful for debugging DFAs. Disabling this

    /// does not yield any speed advantages. Namely, even when this is

    /// disabled, a byte class map is still used while searching. The only

    /// difference is that every byte will be forced into its own distinct

    /// equivalence class. This is useful for debugging the actual generated

    /// transitions because it lets one see the transitions defined on actual

    /// bytes instead of the equivalence classes.

    pub fn byte_classes(mut self, yes: bool) -> Config {

        self.byte_classes = Some(yes);

        self

    /// Set a size limit on the total heap used by a one-pass DFA.

///

    /// This size limit is expressed in bytes and is applied during

    /// construction of a one-pass DFA. If the DFA's heap usage exceeds

    /// this configured limit, then construction is stopped and an error is

    /// returned.

///

    /// The default is no limit.

///

    /// # Example

///

    /// This example shows a one-pass DFA that fails to build because of

    /// a configured size limit. This particular example also serves as a

    /// cautionary tale demonstrating just how big DFAs with large Unicode

    /// character classes can get.

///

    /// ```

    /// # if cfg!(miri) { return Ok(()); } // miri takes too long

    /// use regex_automata::{dfa::onepass::DFA, Match};

///

    /// // 6MB isn't enough!

    /// DFA::builder()

    ///     .configure(DFA::config().size_limit(Some(6_000_000)))

    ///     .build(r"\w{20}")

    ///     .unwrap_err();

///

    /// // ... but 7MB probably is!

    /// // (Note that DFA sizes aren't necessarily stable between releases.)

    /// let re = DFA::builder()

    ///     .configure(DFA::config().size_limit(Some(7_000_000)))

    ///     .build(r"\w{20}")?;

    /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

    /// let haystack = "A".repeat(20);

    /// re.captures(&mut cache, &haystack, &mut caps);

    /// assert_eq!(Some(Match::must(0, 0..20)), caps.get_match());

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

///

    /// While one needs a little more than 3MB to represent `\w{20}`, it

    /// turns out that you only need a little more than 4KB to represent

    /// `(?-u:\w{20})`. So only use Unicode if you need it!

    pub fn size_limit(mut self, limit: Option<usize>) -> Config {

        self.size_limit = Some(limit);

        self

    /// Returns the match semantics set in this configuration.

    pub fn get_match_kind(&self) -> MatchKind {

        self.match_kind.unwrap_or(MatchKind::LeftmostFirst)

    /// Returns whether this configuration has enabled anchored starting states

    /// for every pattern in the DFA.

    pub fn get_starts_for_each_pattern(&self) -> bool {

        self.starts_for_each_pattern.unwrap_or(false)

    /// Returns whether this configuration has enabled byte classes or not.

    /// This is typically a debugging oriented option, as disabling it confers

    /// no speed benefit.

    pub fn get_byte_classes(&self) -> bool {

        self.byte_classes.unwrap_or(true)

    /// Returns the DFA size limit of this configuration if one was set.

    /// The size limit is total number of bytes on the heap that a DFA is

    /// permitted to use. If the DFA exceeds this limit during construction,

    /// then construction is stopped and an error is returned.

    pub fn get_size_limit(&self) -> Option<usize> {

        self.size_limit.unwrap_or(None)

    /// Overwrite the default configuration such that the options in `o` are

    /// always used. If an option in `o` is not set, then the corresponding

    /// option in `self` is used. If it's not set in `self` either, then it

    /// remains not set.

    pub(crate) fn overwrite(&self, o: Config) -> Config {

        Config {

            match_kind: o.match_kind.or(self.match_kind),

            starts_for_each_pattern: o

                .starts_for_each_pattern

                .or(self.starts_for_each_pattern),

            byte_classes: o.byte_classes.or(self.byte_classes),

            size_limit: o.size_limit.or(self.size_limit),

/// A builder for a [one-pass DFA](DFA).

///

/// This builder permits configuring options for the syntax of a pattern, the

/// NFA construction and the DFA construction. This builder is different from a

/// general purpose regex builder in that it permits fine grain configuration

/// of the construction process. The trade off for this is complexity, and

/// the possibility of setting a configuration that might not make sense. For

/// example, there are two different UTF-8 modes:

///

/// * [`syntax::Config::utf8`](crate::util::syntax::Config::utf8) controls

/// whether the pattern itself can contain sub-expressions that match invalid

/// UTF-8.

/// * [`thompson::Config::utf8`] controls whether empty matches that split a

/// Unicode codepoint are reported or not.

///

/// Generally speaking, callers will want to either enable all of these or

/// disable all of these.

///

/// # Example

///

/// This example shows how to disable UTF-8 mode in the syntax and the NFA.

/// This is generally what you want for matching on arbitrary bytes.

///

/// ```

/// # if cfg!(miri) { return Ok(()); } // miri takes too long

/// use regex_automata::{

///     dfa::onepass::DFA,

///     nfa::thompson,

///     util::syntax,

///     Match,

/// };

///

/// let re = DFA::builder()

///     .syntax(syntax::Config::new().utf8(false))

///     .thompson(thompson::Config::new().utf8(false))

///     .build(r"foo(?-u:[^b])ar.*")?;

/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

///

/// let haystack = b"foo\xFFarzz\xE2\x98\xFF\n";

/// re.captures(&mut cache, haystack, &mut caps);

/// // Notice that `(?-u:[^b])` matches invalid UTF-8,

/// // but the subsequent `.*` does not! Disabling UTF-8

/// // on the syntax permits this.

/// //

/// // N.B. This example does not show the impact of

/// // disabling UTF-8 mode on a one-pass DFA Config,

/// //  since that only impacts regexes that can

/// // produce matches of length 0.

/// assert_eq!(Some(Match::must(0, 0..8)), caps.get_match());

///

/// # Ok::<(), Box<dyn std::error::Error>>(())

/// ```

#[derive(Clone, Debug)]

pub struct Builder {

    config: Config,

    #[cfg(feature = "syntax")]

    thompson: thompson::Compiler,

impl Builder {

    /// Create a new one-pass DFA builder with the default configuration.

    pub fn new() -> Builder {

        Builder {

            config: Config::default(),

            #[cfg(feature = "syntax")]

            thompson: thompson::Compiler::new(),

    /// Build a one-pass DFA from the given pattern.

///

    /// If there was a problem parsing or compiling the pattern, then an error

    /// is returned.

    #[cfg(feature = "syntax")]

    pub fn build(&self, pattern: &str) -> Result<DFA, BuildError> {

        self.build_many(&[pattern])

    /// Build a one-pass DFA from the given patterns.

///

    /// When matches are returned, the pattern ID corresponds to the index of

    /// the pattern in the slice given.

    #[cfg(feature = "syntax")]

    pub fn build_many<P: AsRef<str>>(

        &self,

        patterns: &[P],

    ) -> Result<DFA, BuildError> {

        let nfa =

            self.thompson.build_many(patterns).map_err(BuildError::nfa)?;

        self.build_from_nfa(nfa)

    /// Build a DFA from the given NFA.

///

    /// # Example

///

    /// This example shows how to build a DFA if you already have an NFA in

    /// hand.

///

    /// ```

    /// use regex_automata::{dfa::onepass::DFA, nfa::thompson::NFA, Match};

///

    /// // This shows how to set non-default options for building an NFA.

    /// let nfa = NFA::compiler()

    ///     .configure(NFA::config().shrink(true))

    ///     .build(r"[a-z0-9]+")?;

    /// let re = DFA::builder().build_from_nfa(nfa)?;

    /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

    /// re.captures(&mut cache, "foo123bar", &mut caps);

    /// assert_eq!(Some(Match::must(0, 0..9)), caps.get_match());

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    pub fn build_from_nfa(&self, nfa: NFA) -> Result<DFA, BuildError> {

        // Why take ownership if we're just going to pass a reference to the

        // NFA to our internal builder? Well, the first thing to note is that

        // an NFA uses reference counting internally, so either choice is going

        // to be cheap. So there isn't much cost either way.

//

        // The real reason is that a one-pass DFA, semantically, shares

        // ownership of an NFA. This is unlike other DFAs that don't share

        // ownership of an NFA at all, primarily because they want to be

        // self-contained in order to support cheap (de)serialization.

//

        // But then why pass a '&nfa' below if we want to share ownership?

        // Well, it turns out that using a '&NFA' in our internal builder

        // separates its lifetime from the DFA we're building, and this turns

        // out to make code a bit more composable. e.g., We can iterate over

        // things inside the NFA while borrowing the builder as mutable because

        // we know the NFA cannot be mutated. So TL;DR --- this weirdness is

        // "because borrow checker."

        InternalBuilder::new(self.config.clone(), &nfa).build()

    /// Apply the given one-pass DFA configuration options to this builder.

    pub fn configure(&mut self, config: Config) -> &mut Builder {

        self.config = self.config.overwrite(config);

        self

    /// Set the syntax configuration for this builder using

    /// [`syntax::Config`](crate::util::syntax::Config).

///

    /// This permits setting things like case insensitivity, Unicode and multi

    /// line mode.

///

    /// These settings only apply when constructing a one-pass DFA directly

    /// from a pattern.

    #[cfg(feature = "syntax")]

    pub fn syntax(

        &mut self,

        config: crate::util::syntax::Config,

    ) -> &mut Builder {

        self.thompson.syntax(config);

        self

    /// Set the Thompson NFA configuration for this builder using

    /// [`nfa::thompson::Config`](crate::nfa::thompson::Config).

///

    /// This permits setting things like whether additional time should be

    /// spent shrinking the size of the NFA.

///

    /// These settings only apply when constructing a DFA directly from a

    /// pattern.

    #[cfg(feature = "syntax")]

    pub fn thompson(&mut self, config: thompson::Config) -> &mut Builder {

        self.thompson.configure(config);

        self

/// An internal builder for encapsulating the state necessary to build a

/// one-pass DFA. Typical use is just `InternalBuilder::new(..).build()`.

///

/// There is no separate pass for determining whether the NFA is one-pass or

/// not. We just try to build the DFA. If during construction we discover that

/// it is not one-pass, we bail out. This is likely to lead to some undesirable

/// expense in some cases, so it might make sense to try an identify common

/// patterns in the NFA that make it definitively not one-pass. That way, we

/// can avoid ever trying to build a one-pass DFA in the first place. For

/// example, '\w*\s' is not one-pass, and since '\w' is Unicode-aware by

/// default, it's probably not a trivial cost to try and build a one-pass DFA

/// for it and then fail.

///

/// Note that some (immutable) fields are duplicated here. For example, the

/// 'nfa' and 'classes' fields are both in the 'DFA'. They are the same thing,

/// but we duplicate them because it makes composition easier below. Otherwise,

/// since the borrow checker can't see through method calls, the mutable borrow

/// we use to mutate the DFA winds up preventing borrowing from any other part

/// of the DFA, even though we aren't mutating those parts. We only do this

/// because the duplication is cheap.

#[derive(Debug)]

struct InternalBuilder<'a> {

    /// The DFA we're building.

    dfa: DFA,

    /// An unordered collection of NFA state IDs that we haven't yet tried to

    /// build into a DFA state yet.

///

    /// This collection does not ultimately wind up including every NFA state

    /// ID. Instead, each ID represents a "start" state for a sub-graph of the

    /// NFA. The set of NFA states we then use to build a DFA state consists

    /// of that "start" state and all states reachable from it via epsilon

    /// transitions.

    uncompiled_nfa_ids: Vec<StateID>,

    /// A map from NFA state ID to DFA state ID. This is useful for easily

    /// determining whether an NFA state has been used as a "starting" point

    /// to build a DFA state yet. If it hasn't, then it is mapped to DEAD,

    /// and since DEAD is specially added and never corresponds to any NFA

    /// state, it follows that a mapping to DEAD implies the NFA state has

    /// no corresponding DFA state yet.

    nfa_to_dfa_id: Vec<StateID>,

    /// A stack used to traverse the NFA states that make up a single DFA

    /// state. Traversal occurs until the stack is empty, and we only push to

    /// the stack when the state ID isn't in 'seen'. Actually, even more than

    /// that, if we try to push something on to this stack that is already in

    /// 'seen', then we bail out on construction completely, since it implies

    /// that the NFA is not one-pass.

    stack: Vec<(StateID, Epsilons)>,

    /// The set of NFA states that we've visited via 'stack'.

    seen: SparseSet,

    /// Whether a match NFA state has been observed while constructing a

    /// one-pass DFA state. Once a match state is seen, assuming we are using

    /// leftmost-first match semantics, then we don't add any more transitions

    /// to the DFA state we're building.

    matched: bool,

    /// The config passed to the builder.

///

    /// This is duplicated in dfa.config.

    config: Config,

    /// The NFA we're building a one-pass DFA from.

///

    /// This is duplicated in dfa.nfa.

    nfa: &'a NFA,

    /// The equivalence classes that make up the alphabet for this DFA>

///

    /// This is duplicated in dfa.classes.

    classes: ByteClasses,

impl<'a> InternalBuilder<'a> {

    /// Create a new builder with an initial empty DFA.

    fn new(config: Config, nfa: &'a NFA) -> InternalBuilder {

        let classes = if !config.get_byte_classes() {

            // A one-pass DFA will always use the equivalence class map, but

            // enabling this option is useful for debugging. Namely, this will

            // cause all transitions to be defined over their actual bytes

            // instead of an opaque equivalence class identifier. The former is

            // much easier to grok as a human.

            ByteClasses::singletons()

        } else {

            nfa.byte_classes().clone()

};

        // Normally a DFA alphabet includes the EOI symbol, but we don't need

        // that in the one-pass DFA since we handle look-around explicitly

        // without encoding it into the DFA. Thus, we don't need to delay

        // matches by 1 byte. However, we reuse the space that *would* be used

        // by the EOI transition by putting match information there (like which

        // pattern matches and which look-around assertions need to hold). So

        // this means our real alphabet length is 1 fewer than what the byte

        // classes report, since we don't use EOI.

        let alphabet_len = classes.alphabet_len().checked_sub(1).unwrap();

        let stride2 = classes.stride2();

        let dfa = DFA {

            config: config.clone(),

            nfa: nfa.clone(),

            table: vec![],

            starts: vec![],

            // Since one-pass DFAs have a smaller state ID max than

            // StateID::MAX, it follows that StateID::MAX is a valid initial

            // value for min_match_id since no state ID can ever be greater

            // than it. In the case of a one-pass DFA with no match states, the

            // min_match_id will keep this sentinel value.

            min_match_id: StateID::MAX,

            classes: classes.clone(),

            alphabet_len,

            stride2,

            pateps_offset: alphabet_len,

            // OK because PatternID::MAX*2 is guaranteed not to overflow.

            explicit_slot_start: nfa.pattern_len().checked_mul(2).unwrap(),

};

        InternalBuilder {

            dfa,

            uncompiled_nfa_ids: vec![],

            nfa_to_dfa_id: vec![DEAD; nfa.states().len()],

            stack: vec![],

            seen: SparseSet::new(nfa.states().len()),

            matched: false,

            config,

            nfa,

            classes,

    /// Build the DFA from the NFA given to this builder. If the NFA is not

    /// one-pass, then return an error. An error may also be returned if a

    /// particular limit is exceeded. (Some limits, like the total heap memory

    /// used, are configurable. Others, like the total patterns or slots, are

    /// hard-coded based on representational limitations.)

    fn build(mut self) -> Result<DFA, BuildError> {

        self.nfa.look_set_any().available().map_err(BuildError::word)?;

        for look in self.nfa.look_set_any().iter() {

            // This is a future incompatibility check where if we add any

            // more look-around assertions, then the one-pass DFA either

            // needs to reject them (what we do here) or it needs to have its

            // Transition representation modified to be capable of storing the

            // new assertions.

            if look.as_repr() > Look::WordUnicodeNegate.as_repr() {

                return Err(BuildError::unsupported_look(look));

        if self.nfa.pattern_len().as_u64() > PatternEpsilons::PATTERN_ID_LIMIT

            return Err(BuildError::too_many_patterns(

                PatternEpsilons::PATTERN_ID_LIMIT,

));

        if self.nfa.group_info().explicit_slot_len() > Slots::LIMIT {

            return Err(BuildError::not_one_pass(

                "too many explicit capturing groups (max is 16)",

));

        assert_eq!(DEAD, self.add_empty_state()?);

        // This is where the explicit slots start. We care about this because

        // we only need to track explicit slots. The implicit slots---two for

        // each pattern---are tracked as part of the search routine itself.

        let explicit_slot_start = self.nfa.pattern_len() * 2;

        self.add_start_state(None, self.nfa.start_anchored())?;

        if self.config.get_starts_for_each_pattern() {

            for pid in self.nfa.patterns() {

                self.add_start_state(

                    Some(pid),

                    self.nfa.start_pattern(pid).unwrap(),

)?;

        // NOTE: One wonders what the effects of treating 'uncompiled_nfa_ids'

        // as a stack are. It is really an unordered *set* of NFA state IDs.

        // If it, for example, in practice led to discovering whether a regex

        // was or wasn't one-pass later than if we processed NFA state IDs in

        // ascending order, then that would make this routine more costly in

        // the somewhat common case of a regex that isn't one-pass.

        while let Some(nfa_id) = self.uncompiled_nfa_ids.pop() {

            let dfa_id = self.nfa_to_dfa_id[nfa_id];

            // Once we see a match, we keep going, but don't add any new

            // transitions. Normally we'd just stop, but we have to keep

            // going in order to verify that our regex is actually one-pass.

            self.matched = false;

            // The NFA states we've already explored for this DFA state.

            self.seen.clear();

            // The NFA states to explore via epsilon transitions. If we ever

            // try to push an NFA state that we've already seen, then the NFA

            // is not one-pass because it implies there are multiple epsilon

            // transition paths that lead to the same NFA state. In other

            // words, there is ambiguity.

            self.stack_push(nfa_id, Epsilons::empty())?;

            while let Some((id, epsilons)) = self.stack.pop() {

                match *self.nfa.state(id) {

                    thompson::State::ByteRange { ref trans } => {

                        self.compile_transition(dfa_id, trans, epsilons)?;

                    thompson::State::Sparse(ref sparse) => {

                        for trans in sparse.transitions.iter() {

                            self.compile_transition(dfa_id, trans, epsilons)?;

                    thompson::State::Dense(ref dense) => {

                        for trans in dense.iter() {

                            self.compile_transition(dfa_id, &trans, epsilons)?;

                    thompson::State::Look { look, next } => {

                        let looks = epsilons.looks().insert(look);

                        self.stack_push(next, epsilons.set_looks(looks))?;

                    thompson::State::Union { ref alternates } => {

                        for &sid in alternates.iter().rev() {

                            self.stack_push(sid, epsilons)?;

                    thompson::State::BinaryUnion { alt1, alt2 } => {

                        self.stack_push(alt2, epsilons)?;

                        self.stack_push(alt1, epsilons)?;

                    thompson::State::Capture { next, slot, .. } => {

                        let slot = slot.as_usize();

                        let epsilons = if slot < explicit_slot_start {

                            // If this is an implicit slot, we don't care

                            // about it, since we handle implicit slots in

                            // the search routine. We can get away with that

                            // because there are 2 implicit slots for every

                            // pattern.

                            epsilons

                        } else {

                            // Offset our explicit slots so that they start

                            // at index 0.

                            let offset = slot - explicit_slot_start;

                            epsilons.set_slots(epsilons.slots().insert(offset))

};

                        self.stack_push(next, epsilons)?;

                    thompson::State::Fail => {

                        continue;

                    thompson::State::Match { pattern_id } => {

                        // If we found two different paths to a match state

                        // for the same DFA state, then we have ambiguity.

                        // Thus, it's not one-pass.

                        if self.matched {

                            return Err(BuildError::not_one_pass(

                                "multiple epsilon transitions to match state",

));

                        self.matched = true;

                        // Shove the matching pattern ID and the 'epsilons'

                        // into the current DFA state's pattern epsilons. The

                        // 'epsilons' includes the slots we need to capture

                        // before reporting the match and also the conditional

                        // epsilon transitions we need to check before we can

                        // report a match.

                        self.dfa.set_pattern_epsilons(

                            dfa_id,

                            PatternEpsilons::empty()

                                .set_pattern_id(pattern_id)

                                .set_epsilons(epsilons),

);

                        // N.B. It is tempting to just bail out here when

                        // compiling a leftmost-first DFA, since we will never

                        // compile any more transitions in that case. But we

                        // actually need to keep going in order to verify that

                        // we actually have a one-pass regex. e.g., We might

                        // see more Match states (e.g., for other patterns)

                        // that imply that we don't have a one-pass regex.

                        // So instead, we mark that we've found a match and

                        // continue on. When we go to compile a new DFA state,

                        // we just skip that part. But otherwise check that the

                        // one-pass property is upheld.

        self.shuffle_states();

        Ok(self.dfa)

    /// Shuffle all match states to the end of the transition table and set

    /// 'min_match_id' to the ID of the first such match state.

///

    /// The point of this is to make it extremely cheap to determine whether

    /// a state is a match state or not. We need to check on this on every

    /// transition during a search, so it being cheap is important. This

    /// permits us to check it by simply comparing two state identifiers, as

    /// opposed to looking for the pattern ID in the state's `PatternEpsilons`.

    /// (Which requires a memory load and some light arithmetic.)

    fn shuffle_states(&mut self) {

        let mut remapper = Remapper::new(&self.dfa);

        let mut next_dest = self.dfa.last_state_id();

        for i in (0..self.dfa.state_len()).rev() {

            let id = StateID::must(i);

            let is_match =

                self.dfa.pattern_epsilons(id).pattern_id().is_some();

            if !is_match {

                continue;

            remapper.swap(&mut self.dfa, next_dest, id);

            self.dfa.min_match_id = next_dest;

            next_dest = self.dfa.prev_state_id(next_dest).expect(

                "match states should be a proper subset of all states",

);

        remapper.remap(&mut self.dfa);

    /// Compile the given NFA transition into the DFA state given.

///

    /// 'Epsilons' corresponds to any conditional epsilon transitions that need

    /// to be satisfied to follow this transition, and any slots that need to

    /// be saved if the transition is followed.

///

    /// If this transition indicates that the NFA is not one-pass, then

    /// this returns an error. (This occurs, for example, if the DFA state

    /// already has a transition defined for the same input symbols as the

    /// given transition, *and* the result of the old and new transitions is

    /// different.)

    fn compile_transition(

        &mut self,

        dfa_id: StateID,

        trans: &thompson::Transition,

        epsilons: Epsilons,

    ) -> Result<(), BuildError> {

        let next_dfa_id = self.add_dfa_state_for_nfa_state(trans.next)?;

        for byte in self

            .classes

            .representatives(trans.start..=trans.end)

            .filter_map(|r| r.as_u8())

            let oldtrans = self.dfa.transition(dfa_id, byte);

            let newtrans =

                Transition::new(self.matched, next_dfa_id, epsilons);

            // If the old transition points to the DEAD state, then we know

            // 'byte' has not been mapped to any transition for this DFA state

            // yet. So set it unconditionally. Otherwise, we require that the

            // old and new transitions are equivalent. Otherwise, there is

            // ambiguity and thus the regex is not one-pass.

            if oldtrans.state_id() == DEAD {

                self.dfa.set_transition(dfa_id, byte, newtrans);

            } else if oldtrans != newtrans {

                return Err(BuildError::not_one_pass(

                    "conflicting transition",

));

        Ok(())

    /// Add a start state to the DFA corresponding to the given NFA starting

    /// state ID.

///

    /// If adding a state would blow any limits (configured or hard-coded),

    /// then an error is returned.

///

    /// If the starting state is an anchored state for a particular pattern,

    /// then callers must provide the pattern ID for that starting state.

    /// Callers must also ensure that the first starting state added is the

    /// start state for all patterns, and then each anchored starting state for

    /// each pattern (if necessary) added in order. Otherwise, this panics.

    fn add_start_state(

        &mut self,

        pid: Option<PatternID>,

        nfa_id: StateID,

    ) -> Result<StateID, BuildError> {

        match pid {

            // With no pid, this should be the start state for all patterns

            // and thus be the first one.

            None => assert!(self.dfa.starts.is_empty()),

            // With a pid, we want it to be at self.dfa.starts[pid+1].

            Some(pid) => assert!(self.dfa.starts.len() == pid.one_more()),

        let dfa_id = self.add_dfa_state_for_nfa_state(nfa_id)?;

        self.dfa.starts.push(dfa_id);

        Ok(dfa_id)

    /// Add a new DFA state corresponding to the given NFA state. If adding a

    /// state would blow any limits (configured or hard-coded), then an error

    /// is returned. If a DFA state already exists for the given NFA state,

    /// then that DFA state's ID is returned and no new states are added.

///

    /// It is not expected that this routine is called for every NFA state.

    /// Instead, an NFA state ID will usually correspond to the "start" state

    /// for a sub-graph of the NFA, where all states in the sub-graph are

    /// reachable via epsilon transitions (conditional or unconditional). That

    /// sub-graph of NFA states is ultimately what produces a single DFA state.

    fn add_dfa_state_for_nfa_state(

        &mut self,

        nfa_id: StateID,

    ) -> Result<StateID, BuildError> {

        // If we've already built a DFA state for the given NFA state, then

        // just return that. We definitely do not want to have more than one

        // DFA state in existence for the same NFA state, since all but one of

        // them will likely become unreachable. And at least some of them are

        // likely to wind up being incomplete.

        let existing_dfa_id = self.nfa_to_dfa_id[nfa_id];

        if existing_dfa_id != DEAD {

            return Ok(existing_dfa_id);

        // If we don't have any DFA state yet, add it and then add the given

        // NFA state to the list of states to explore.

        let dfa_id = self.add_empty_state()?;

        self.nfa_to_dfa_id[nfa_id] = dfa_id;

        self.uncompiled_nfa_ids.push(nfa_id);

        Ok(dfa_id)

    /// Unconditionally add a new empty DFA state. If adding it would exceed

    /// any limits (configured or hard-coded), then an error is returned. The

    /// ID of the new state is returned on success.

///

    /// The added state is *not* a match state.

    fn add_empty_state(&mut self) -> Result<StateID, BuildError> {

        let state_limit = Transition::STATE_ID_LIMIT;

        // Note that unlike dense and lazy DFAs, we specifically do NOT

        // premultiply our state IDs here. The reason is that we want to pack

        // our state IDs into 64-bit transitions with other info, so the fewer

        // the bits we use for state IDs the better. If we premultiply, then

        // our state ID space shrinks. We justify this by the assumption that

        // a one-pass DFA is just already doing a fair bit more work than a

        // normal DFA anyway, so an extra multiplication to compute a state

        // transition doesn't seem like a huge deal.

        let next_id = self.dfa.table.len() >> self.dfa.stride2();

        let id = StateID::new(next_id)

            .map_err(|_| BuildError::too_many_states(state_limit))?;

        if id.as_u64() > Transition::STATE_ID_LIMIT {

            return Err(BuildError::too_many_states(state_limit));

        self.dfa

            .table

            .extend(core::iter::repeat(Transition(0)).take(self.dfa.stride()));

        // The default empty value for 'PatternEpsilons' is sadly not all

        // zeroes. Instead, a special sentinel is used to indicate that there

        // is no pattern. So we need to explicitly set the pattern epsilons to

        // the correct "empty" PatternEpsilons.

        self.dfa.set_pattern_epsilons(id, PatternEpsilons::empty());

        if let Some(size_limit) = self.config.get_size_limit() {

            if self.dfa.memory_usage() > size_limit {

                return Err(BuildError::exceeded_size_limit(size_limit));

        Ok(id)

    /// Push the given NFA state ID and its corresponding epsilons (slots and

    /// conditional epsilon transitions) on to a stack for use in a depth first

    /// traversal of a sub-graph of the NFA.

///

    /// If the given NFA state ID has already been pushed on to the stack, then

    /// it indicates the regex is not one-pass and this correspondingly returns

    /// an error.

    fn stack_push(

        &mut self,

        nfa_id: StateID,

        epsilons: Epsilons,

    ) -> Result<(), BuildError> {

        // If we already have seen a match and we are compiling a leftmost

        // first DFA, then we shouldn't add any more states to look at. This is

        // effectively how preference order and non-greediness is implemented.

        // if !self.config.get_match_kind().continue_past_first_match()

        // && self.matched

        // {

        // return Ok(());

        // }

        if !self.seen.insert(nfa_id) {

            return Err(BuildError::not_one_pass(

                "multiple epsilon transitions to same state",

));

        self.stack.push((nfa_id, epsilons));

        Ok(())

/// A one-pass DFA for executing a subset of anchored regex searches while

/// resolving capturing groups.

///

/// A one-pass DFA can be built from an NFA that is one-pass. An NFA is

/// one-pass when there is never any ambiguity about how to continue a search.

/// For example, `a*a` is not one-pass becuase during a search, it's not

/// possible to know whether to continue matching the `a*` or to move on to

/// the single `a`. However, `a*b` is one-pass, because for every byte in the

/// input, it's always clear when to move on from `a*` to `b`.

///

/// # Only anchored searches are supported

///

/// In this crate, especially for DFAs, unanchored searches are implemented by

/// treating the pattern as if it had a `(?s-u:.)*?` prefix. While the prefix

/// is one-pass on its own, adding anything after it, e.g., `(?s-u:.)*?a` will

/// make the overall pattern not one-pass. Why? Because the `(?s-u:.)` matches

/// any byte, and there is therefore ambiguity as to when the prefix should

/// stop matching and something else should start matching.

///

/// Therefore, one-pass DFAs do not support unanchored searches. In addition

/// to many regexes simply not being one-pass, it implies that one-pass DFAs

/// have limited utility. With that said, when a one-pass DFA can be used, it

/// can potentially provide a dramatic speed up over alternatives like the

/// [`BoundedBacktracker`](crate::nfa::thompson::backtrack::BoundedBacktracker)

/// and the [`PikeVM`](crate::nfa::thompson::pikevm::PikeVM). In particular,

/// a one-pass DFA is the only DFA capable of reporting the spans of matching

/// capturing groups.

///

/// To clarify, when we say that unanchored searches are not supported, what

/// that actually means is:

///

/// * The high level routines, [`DFA::is_match`] and [`DFA::captures`], always

/// do anchored searches.

/// * Since iterators are most useful in the context of unanchored searches,

/// there is no `DFA::captures_iter` method.

/// * For lower level routines like [`DFA::try_search`], an error will be

/// returned if the given [`Input`] is configured to do an unanchored search or

/// search for an invalid pattern ID. (Note that an [`Input`] is configured to

/// do an unanchored search by default, so just giving a `Input::new` is

/// guaranteed to return an error.)

///

/// # Other limitations

///

/// In addition to the [configurable heap limit](Config::size_limit) and

/// the requirement that a regex pattern be one-pass, there are some other

/// limitations:

///

/// * There is an internal limit on the total number of explicit capturing

/// groups that appear across all patterns. It is somewhat small and there is

/// no way to configure it. If your pattern(s) exceed this limit, then building

/// a one-pass DFA will fail.

/// * If the number of patterns exceeds an internal unconfigurable limit, then

/// building a one-pass DFA will fail. This limit is quite large and you're

/// unlikely to hit it.

/// * If the total number of states exceeds an internal unconfigurable limit,

/// then building a one-pass DFA will fail. This limit is quite large and

/// you're unlikely to hit it.

///

/// # Other examples of regexes that aren't one-pass

///

/// One particularly unfortunate example is that enabling Unicode can cause

/// regexes that were one-pass to no longer be one-pass. Consider the regex

/// `(?-u)\w*\s` for example. It is one-pass because there is exactly no

/// overlap between the ASCII definitions of `\w` and `\s`. But `\w*\s`

/// (i.e., with Unicode enabled) is *not* one-pass because `\w` and `\s` get

/// translated to UTF-8 automatons. And while the *codepoints* in `\w` and `\s`

/// do not overlap, the underlying UTF-8 encodings do. Indeed, because of the

/// overlap between UTF-8 automata, the use of Unicode character classes will

/// tend to vastly increase the likelihood of a regex not being one-pass.

///

/// # How does one know if a regex is one-pass or not?

///

/// At the time of writing, the only way to know is to try and build a one-pass

/// DFA. The one-pass property is checked while constructing the DFA.

///

/// This does mean that you might potentially waste some CPU cycles and memory

/// by optimistically trying to build a one-pass DFA. But this is currently the

/// only way. In the future, building a one-pass DFA might be able to use some

/// heuristics to detect common violations of the one-pass property and bail

/// more quickly.

///

/// # Resource usage

///

/// Unlike a general DFA, a one-pass DFA has stricter bounds on its resource

/// usage. Namely, construction of a one-pass DFA has a time and space

/// complexity of `O(n)`, where `n ~ nfa.states().len()`. (A general DFA's time

/// and space complexity is `O(2^n)`.) This smaller time bound is achieved

/// because there is at most one DFA state created for each NFA state. If

/// additional DFA states would be required, then the pattern is not one-pass

/// and construction will fail.

///

/// Note though that currently, this DFA uses a fully dense representation.

/// This means that while its space complexity is no worse than an NFA, it may

/// in practice use more memory because of higher constant factors. The reason

/// for this trade off is two-fold. Firstly, a dense representation makes the

/// search faster. Secondly, the bigger an NFA, the more unlikely it is to be

/// one-pass. Therefore, most one-pass DFAs are usually pretty small.

///

/// # Example

///

/// This example shows that the one-pass DFA implements Unicode word boundaries

/// correctly while simultaneously reporting spans for capturing groups that

/// participate in a match. (This is the only DFA that implements full support

/// for Unicode word boundaries.)

///

/// ```

/// # if cfg!(miri) { return Ok(()); } // miri takes too long

/// use regex_automata::{dfa::onepass::DFA, Match, Span};

///

/// let re = DFA::new(r"\b(?P<first>\w+)[[:space:]]+(?P<last>\w+)\b")?;

/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

///

/// re.captures(&mut cache, "Шерлок Холмс", &mut caps);

/// assert_eq!(Some(Match::must(0, 0..23)), caps.get_match());

/// assert_eq!(Some(Span::from(0..12)), caps.get_group_by_name("first"));

/// assert_eq!(Some(Span::from(13..23)), caps.get_group_by_name("last"));

/// # Ok::<(), Box<dyn std::error::Error>>(())

/// ```

///

/// # Example: iteration

///

/// Unlike other regex engines in this crate, this one does not provide

/// iterator search functions. This is because a one-pass DFA only supports

/// anchored searches, and so iterator functions are generally not applicable.

///

/// However, if you know that all of your matches are

/// directly adjacent, then an iterator can be used. The

/// [`util::iter::Searcher`](crate::util::iter::Searcher) type can be used for

/// this purpose:

///

/// ```

/// # if cfg!(miri) { return Ok(()); } // miri takes too long

/// use regex_automata::{

///     dfa::onepass::DFA,

///     util::iter::Searcher,

///     Anchored, Input, Span,

/// };

///

/// let re = DFA::new(r"\w(\d)\w")?;

/// let (mut cache, caps) = (re.create_cache(), re.create_captures());

/// let input = Input::new("a1zb2yc3x").anchored(Anchored::Yes);

///

/// let mut it = Searcher::new(input).into_captures_iter(caps, |input, caps| {

///     Ok(re.try_search(&mut cache, input, caps)?)

/// }).infallible();

/// let caps0 = it.next().unwrap();

/// assert_eq!(Some(Span::from(1..2)), caps0.get_group(1));

///

/// # Ok::<(), Box<dyn std::error::Error>>(())

/// ```

#[derive(Clone)]

pub struct DFA {

    /// The configuration provided by the caller.

    config: Config,

    /// The NFA used to build this DFA.

///

    /// NOTE: We probably don't need to store the NFA here, but we use enough

    /// bits from it that it's convenient to do so. And there really isn't much

    /// cost to doing so either, since an NFA is reference counted internally.

    nfa: NFA,

    /// The transition table. Given a state ID 's' and a byte of haystack 'b',

    /// the next state is `table[sid + classes[byte]]`.

///

    /// The stride of this table (i.e., the number of columns) is always

    /// a power of 2, even if the alphabet length is smaller. This makes

    /// converting between state IDs and state indices very cheap.

///

    /// Note that the stride always includes room for one extra "transition"

    /// that isn't actually a transition. It is a 'PatternEpsilons' that is

    /// used for match states only. Because of this, the maximum number of

    /// active columns in the transition table is 257, which means the maximum

    /// stride is 512 (the next power of 2 greater than or equal to 257).

    table: Vec<Transition>,

    /// The DFA state IDs of the starting states.

///

    /// `starts[0]` is always present and corresponds to the starting state

    /// when searching for matches of any pattern in the DFA.

///

    /// `starts[i]` where i>0 corresponds to the starting state for the pattern

    /// ID 'i-1'. These starting states are optional.

    starts: Vec<StateID>,

    /// Every state ID >= this value corresponds to a match state.

///

    /// This is what a search uses to detect whether a state is a match state

    /// or not. It requires only a simple comparison instead of bit-unpacking

    /// the PatternEpsilons from every state.

    min_match_id: StateID,

    /// The alphabet of this DFA, split into equivalence classes. Bytes in the

    /// same equivalence class can never discriminate between a match and a

    /// non-match.

    classes: ByteClasses,

    /// The number of elements in each state in the transition table. This may

    /// be less than the stride, since the stride is always a power of 2 and

    /// the alphabet length can be anything up to and including 256.

    alphabet_len: usize,

    /// The number of columns in the transition table, expressed as a power of

    /// 2.

    stride2: usize,

    /// The offset at which the PatternEpsilons for a match state is stored in

    /// the transition table.

///

    /// PERF: One wonders whether it would be better to put this in a separate

    /// allocation, since only match states have a non-empty PatternEpsilons

    /// and the number of match states tends be dwarfed by the number of

    /// non-match states. So this would save '8*len(non_match_states)' for each

    /// DFA. The question is whether moving this to a different allocation will

    /// lead to a perf hit during searches. You might think dealing with match

    /// states is rare, but some regexes spend a lot of time in match states

    /// gobbling up input. But... match state handling is already somewhat

    /// expensive, so maybe this wouldn't do much? Either way, it's worth

    /// experimenting.

    pateps_offset: usize,

    /// The first explicit slot index. This refers to the first slot appearing

    /// immediately after the last implicit slot. It is always 'patterns.len()

    /// * 2'.

///

    /// We record this because we only store the explicit slots in our DFA

    /// transition table that need to be saved. Implicit slots are handled

    /// automatically as part of the search.

    explicit_slot_start: usize,

impl DFA {

    /// Parse the given regular expression using the default configuration and

    /// return the corresponding one-pass DFA.

///

    /// If you want a non-default configuration, then use the [`Builder`] to

    /// set your own configuration.

///

    /// # Example

///

    /// ```

    /// use regex_automata::{dfa::onepass::DFA, Match};

///

    /// let re = DFA::new("foo[0-9]+bar")?;

    /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

///

    /// re.captures(&mut cache, "foo12345barzzz", &mut caps);

    /// assert_eq!(Some(Match::must(0, 0..11)), caps.get_match());

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    #[cfg(feature = "syntax")]

    #[inline]

    pub fn new(pattern: &str) -> Result<DFA, BuildError> {

        DFA::builder().build(pattern)

    /// Like `new`, but parses multiple patterns into a single "multi regex."

    /// This similarly uses the default regex configuration.

///

    /// # Example

///

    /// ```

    /// use regex_automata::{dfa::onepass::DFA, Match};

///

    /// let re = DFA::new_many(&["[a-z]+", "[0-9]+"])?;

    /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

///

    /// re.captures(&mut cache, "abc123", &mut caps);

    /// assert_eq!(Some(Match::must(0, 0..3)), caps.get_match());

///

    /// re.captures(&mut cache, "123abc", &mut caps);

    /// assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    #[cfg(feature = "syntax")]

    #[inline]

    pub fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<DFA, BuildError> {

        DFA::builder().build_many(patterns)

    /// Like `new`, but builds a one-pass DFA directly from an NFA. This is

    /// useful if you already have an NFA, or even if you hand-assembled the

    /// NFA.

///

    /// # Example

///

    /// This shows how to hand assemble a regular expression via its HIR,

    /// compile an NFA from it and build a one-pass DFA from the NFA.

///

    /// ```

    /// use regex_automata::{

    ///     dfa::onepass::DFA,

    ///     nfa::thompson::NFA,

    ///     Match,

    /// };

    /// use regex_syntax::hir::{Hir, Class, ClassBytes, ClassBytesRange};

///

    /// let hir = Hir::class(Class::Bytes(ClassBytes::new(vec![

    ///     ClassBytesRange::new(b'0', b'9'),

    ///     ClassBytesRange::new(b'A', b'Z'),

    ///     ClassBytesRange::new(b'_', b'_'),

    ///     ClassBytesRange::new(b'a', b'z'),

    /// ])));

///

    /// let config = NFA::config().nfa_size_limit(Some(1_000));

    /// let nfa = NFA::compiler().configure(config).build_from_hir(&hir)?;

///

    /// let re = DFA::new_from_nfa(nfa)?;

    /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

    /// let expected = Some(Match::must(0, 0..1));

    /// re.captures(&mut cache, "A", &mut caps);

    /// assert_eq!(expected, caps.get_match());

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    pub fn new_from_nfa(nfa: NFA) -> Result<DFA, BuildError> {

        DFA::builder().build_from_nfa(nfa)

    /// Create a new one-pass DFA that matches every input.

///

    /// # Example

///

    /// ```

    /// use regex_automata::{dfa::onepass::DFA, Match};

///

    /// let dfa = DFA::always_match()?;

    /// let mut cache = dfa.create_cache();

    /// let mut caps = dfa.create_captures();

///

    /// let expected = Match::must(0, 0..0);

    /// dfa.captures(&mut cache, "", &mut caps);

    /// assert_eq!(Some(expected), caps.get_match());

    /// dfa.captures(&mut cache, "foo", &mut caps);

    /// assert_eq!(Some(expected), caps.get_match());

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    pub fn always_match() -> Result<DFA, BuildError> {

        let nfa = thompson::NFA::always_match();

        Builder::new().build_from_nfa(nfa)

    /// Create a new one-pass DFA that never matches any input.

///

    /// # Example

///

    /// ```

    /// use regex_automata::dfa::onepass::DFA;

///

    /// let dfa = DFA::never_match()?;

    /// let mut cache = dfa.create_cache();

    /// let mut caps = dfa.create_captures();

///

    /// dfa.captures(&mut cache, "", &mut caps);

    /// assert_eq!(None, caps.get_match());

    /// dfa.captures(&mut cache, "foo", &mut caps);

    /// assert_eq!(None, caps.get_match());

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    pub fn never_match() -> Result<DFA, BuildError> {

        let nfa = thompson::NFA::never_match();

        Builder::new().build_from_nfa(nfa)

    /// Return a default configuration for a DFA.

///

    /// This is a convenience routine to avoid needing to import the `Config`

    /// type when customizing the construction of a DFA.

///

    /// # Example

///

    /// This example shows how to change the match semantics of this DFA from

    /// its default "leftmost first" to "all." When using "all," non-greediness

    /// doesn't apply and neither does preference order matching. Instead, the

    /// longest match possible is always returned. (Although, by construction,

    /// it's impossible for a one-pass DFA to have a different answer for

    /// "preference order" vs "longest match.")

///

    /// ```

    /// use regex_automata::{dfa::onepass::DFA, Match, MatchKind};

///

    /// let re = DFA::builder()

    ///     .configure(DFA::config().match_kind(MatchKind::All))

    ///     .build(r"(abc)+?")?;

    /// let mut cache = re.create_cache();

    /// let mut caps = re.create_captures();

///

    /// re.captures(&mut cache, "abcabc", &mut caps);

    /// // Normally, the non-greedy repetition would give us a 0..3 match.

    /// assert_eq!(Some(Match::must(0, 0..6)), caps.get_match());

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    #[inline]

    pub fn config() -> Config {

        Config::new()

    /// Return a builder for configuring the construction of a DFA.

///

    /// This is a convenience routine to avoid needing to import the

    /// [`Builder`] type in common cases.

///

    /// # Example

///

    /// This example shows how to use the builder to disable UTF-8 mode.

///

    /// ```

    /// # if cfg!(miri) { return Ok(()); } // miri takes too long

    /// use regex_automata::{

    ///     dfa::onepass::DFA,

    ///     nfa::thompson,

    ///     util::syntax,

    ///     Match,

    /// };

///

    /// let re = DFA::builder()

    ///     .syntax(syntax::Config::new().utf8(false))

    ///     .thompson(thompson::Config::new().utf8(false))

    ///     .build(r"foo(?-u:[^b])ar.*")?;

    /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

///

    /// let haystack = b"foo\xFFarzz\xE2\x98\xFF\n";

    /// let expected = Some(Match::must(0, 0..8));

    /// re.captures(&mut cache, haystack, &mut caps);

    /// assert_eq!(expected, caps.get_match());

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    #[inline]

    pub fn builder() -> Builder {

        Builder::new()

    /// Create a new empty set of capturing groups that is guaranteed to be

    /// valid for the search APIs on this DFA.

///

    /// A `Captures` value created for a specific DFA cannot be used with any

    /// other DFA.

///

    /// This is a convenience function for [`Captures::all`]. See the

    /// [`Captures`] documentation for an explanation of its alternative

    /// constructors that permit the DFA to do less work during a search, and

    /// thus might make it faster.

    #[inline]

    pub fn create_captures(&self) -> Captures {

        Captures::all(self.nfa.group_info().clone())

    /// Create a new cache for this DFA.

///

    /// The cache returned should only be used for searches for this

    /// DFA. If you want to reuse the cache for another DFA, then you

    /// must call [`Cache::reset`] with that DFA (or, equivalently,

    /// [`DFA::reset_cache`]).

    #[inline]

    pub fn create_cache(&self) -> Cache {

        Cache::new(self)

    /// Reset the given cache such that it can be used for searching with the

    /// this DFA (and only this DFA).

///

    /// A cache reset permits reusing memory already allocated in this cache

    /// with a different DFA.

///

    /// # Example

///

    /// This shows how to re-purpose a cache for use with a different DFA.

///

    /// ```

    /// # if cfg!(miri) { return Ok(()); } // miri takes too long

    /// use regex_automata::{dfa::onepass::DFA, Match};

///

    /// let re1 = DFA::new(r"\w")?;

    /// let re2 = DFA::new(r"\W")?;

    /// let mut caps1 = re1.create_captures();

    /// let mut caps2 = re2.create_captures();

///

    /// let mut cache = re1.create_cache();

    /// assert_eq!(

    ///     Some(Match::must(0, 0..2)),

    ///     { re1.captures(&mut cache, "Δ", &mut caps1); caps1.get_match() },

    /// );

///

    /// // Using 'cache' with re2 is not allowed. It may result in panics or

    /// // incorrect results. In order to re-purpose the cache, we must reset

    /// // it with the one-pass DFA we'd like to use it with.

    /// //

    /// // Similarly, after this reset, using the cache with 're1' is also not

    /// // allowed.

    /// re2.reset_cache(&mut cache);

    /// assert_eq!(

    ///     Some(Match::must(0, 0..3)),

    ///     { re2.captures(&mut cache, "☃", &mut caps2); caps2.get_match() },

    /// );

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    #[inline]

    pub fn reset_cache(&self, cache: &mut Cache) {

        cache.reset(self);

    /// Return the config for this one-pass DFA.

    #[inline]

    pub fn get_config(&self) -> &Config {

        &self.config

    /// Returns a reference to the underlying NFA.

    #[inline]

    pub fn get_nfa(&self) -> &NFA {

        &self.nfa

    /// Returns the total number of patterns compiled into this DFA.

///

    /// In the case of a DFA that contains no patterns, this returns `0`.

    #[inline]

    pub fn pattern_len(&self) -> usize {

        self.get_nfa().pattern_len()

    /// Returns the total number of states in this one-pass DFA.

///

    /// Note that unlike dense or sparse DFAs, a one-pass DFA does not expose

    /// a low level DFA API. Therefore, this routine has little use other than

    /// being informational.

    #[inline]

    pub fn state_len(&self) -> usize {

        self.table.len() >> self.stride2()

    /// Returns the total number of elements in the alphabet for this DFA.

///

    /// That is, this returns the total number of transitions that each

    /// state in this DFA must have. The maximum alphabet size is 256, which

    /// corresponds to each possible byte value.

///

    /// The alphabet size may be less than 256 though, and unless

    /// [`Config::byte_classes`] is disabled, it is typically must less than

    /// 256. Namely, bytes are grouped into equivalence classes such that no

    /// two bytes in the same class can distinguish a match from a non-match.

    /// For example, in the regex `^[a-z]+$`, the ASCII bytes `a-z` could

    /// all be in the same equivalence class. This leads to a massive space

    /// savings.

///

    /// Note though that the alphabet length does _not_ necessarily equal the

    /// total stride space taken up by a single DFA state in the transition

    /// table. Namely, for performance reasons, the stride is always the

    /// smallest power of two that is greater than or equal to the alphabet

    /// length. For this reason, [`DFA::stride`] or [`DFA::stride2`] are

    /// often more useful. The alphabet length is typically useful only for

    /// informational purposes.

///

    /// Note also that unlike dense or sparse DFAs, a one-pass DFA does

    /// not have a special end-of-input (EOI) transition. This is because

    /// a one-pass DFA handles look-around assertions explicitly (like the

    /// [`PikeVM`](crate::nfa::thompson::pikevm::PikeVM)) and does not build

    /// them into the transitions of the DFA.

    #[inline]

    pub fn alphabet_len(&self) -> usize {

        self.alphabet_len

    /// Returns the total stride for every state in this DFA, expressed as the

    /// exponent of a power of 2. The stride is the amount of space each state

    /// takes up in the transition table, expressed as a number of transitions.

    /// (Unused transitions map to dead states.)

///

    /// The stride of a DFA is always equivalent to the smallest power of

    /// 2 that is greater than or equal to the DFA's alphabet length. This

    /// definition uses extra space, but possibly permits faster translation

    /// between state identifiers and their corresponding offsets in this DFA's

    /// transition table.

///

    /// For example, if the DFA's stride is 16 transitions, then its `stride2`

    /// is `4` since `2^4 = 16`.

///

    /// The minimum `stride2` value is `1` (corresponding to a stride of `2`)

    /// while the maximum `stride2` value is `9` (corresponding to a stride

    /// of `512`). The maximum in theory should be `8`, but because of some

    /// implementation quirks that may be relaxed in the future, it is one more

    /// than `8`. (Do note that a maximal stride is incredibly rare, as it

    /// would imply that there is almost no redundant in the regex pattern.)

///

    /// Note that unlike dense or sparse DFAs, a one-pass DFA does not expose

    /// a low level DFA API. Therefore, this routine has little use other than

    /// being informational.

    #[inline]

    pub fn stride2(&self) -> usize {

        self.stride2

    /// Returns the total stride for every state in this DFA. This corresponds

    /// to the total number of transitions used by each state in this DFA's

    /// transition table.

///

    /// Please see [`DFA::stride2`] for more information. In particular, this

    /// returns the stride as the number of transitions, where as `stride2`

    /// returns it as the exponent of a power of 2.

///

    /// Note that unlike dense or sparse DFAs, a one-pass DFA does not expose

    /// a low level DFA API. Therefore, this routine has little use other than

    /// being informational.

    #[inline]

    pub fn stride(&self) -> usize {

        1 << self.stride2()

    /// Returns the memory usage, in bytes, of this DFA.

///

    /// The memory usage is computed based on the number of bytes used to

    /// represent this DFA.

///

    /// This does **not** include the stack size used up by this DFA. To

    /// compute that, use `std::mem::size_of::<onepass::DFA>()`.

    #[inline]

    pub fn memory_usage(&self) -> usize {

        use core::mem::size_of;

        self.table.len() * size_of::<Transition>()

            + self.starts.len() * size_of::<StateID>()

impl DFA {

    /// Executes an anchored leftmost forward search, and returns true if and

    /// only if this one-pass DFA matches the given haystack.

///

    /// This routine may short circuit if it knows that scanning future

    /// input will never lead to a different result. In particular, if the

    /// underlying DFA enters a match state, then this routine will return

    /// `true` immediately without inspecting any future input. (Consider how

    /// this might make a difference given the regex `a+` on the haystack

    /// `aaaaaaaaaaaaaaa`. This routine can stop after it sees the first `a`,

    /// but routines like `find` need to continue searching because `+` is

    /// greedy by default.)

///

    /// The given `Input` is forcefully set to use [`Anchored::Yes`] if the

    /// given configuration was [`Anchored::No`] (which is the default).

///

    /// # Panics

///

    /// This routine panics if the search could not complete. This can occur

    /// in the following circumstances:

///

    /// * When the provided `Input` configuration is not supported. For

    /// example, by providing an unsupported anchor mode. Concretely,

    /// this occurs when using [`Anchored::Pattern`] without enabling

    /// [`Config::starts_for_each_pattern`].

///

    /// When a search panics, callers cannot know whether a match exists or

    /// not.

///

    /// Use [`DFA::try_search`] if you want to handle these panics as error

    /// values instead.

///

    /// # Example

///

    /// This shows basic usage:

///

    /// ```

    /// use regex_automata::dfa::onepass::DFA;

///

    /// let re = DFA::new("foo[0-9]+bar")?;

    /// let mut cache = re.create_cache();

///

    /// assert!(re.is_match(&mut cache, "foo12345bar"));

    /// assert!(!re.is_match(&mut cache, "foobar"));

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

///

    /// # Example: consistency with search APIs

///

    /// `is_match` is guaranteed to return `true` whenever `captures` returns

    /// a match. This includes searches that are executed entirely within a

    /// codepoint:

///

    /// ```

    /// use regex_automata::{dfa::onepass::DFA, Input};

///

    /// let re = DFA::new("a*")?;

    /// let mut cache = re.create_cache();

///

    /// assert!(!re.is_match(&mut cache, Input::new("☃").span(1..2)));

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

///

    /// Notice that when UTF-8 mode is disabled, then the above reports a

    /// match because the restriction against zero-width matches that split a

    /// codepoint has been lifted:

///

    /// ```

    /// use regex_automata::{dfa::onepass::DFA, nfa::thompson::NFA, Input};

///

    /// let re = DFA::builder()

    ///     .thompson(NFA::config().utf8(false))

    ///     .build("a*")?;

    /// let mut cache = re.create_cache();

///

    /// assert!(re.is_match(&mut cache, Input::new("☃").span(1..2)));

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    #[inline]

    pub fn is_match<'h, I: Into<Input<'h>>>(

        &self,

        cache: &mut Cache,

        input: I,

    ) -> bool {

        let mut input = input.into().earliest(true);

        if matches!(input.get_anchored(), Anchored::No) {

            input.set_anchored(Anchored::Yes);

        self.try_search_slots(cache, &input, &mut []).unwrap().is_some()

    /// Executes an anchored leftmost forward search, and returns a `Match` if

    /// and only if this one-pass DFA matches the given haystack.

///

    /// This routine only includes the overall match span. To get access to the

    /// individual spans of each capturing group, use [`DFA::captures`].

///

    /// The given `Input` is forcefully set to use [`Anchored::Yes`] if the

    /// given configuration was [`Anchored::No`] (which is the default).

///

    /// # Panics

///

    /// This routine panics if the search could not complete. This can occur

    /// in the following circumstances:

///

    /// * When the provided `Input` configuration is not supported. For

    /// example, by providing an unsupported anchor mode. Concretely,

    /// this occurs when using [`Anchored::Pattern`] without enabling

    /// [`Config::starts_for_each_pattern`].

///

    /// When a search panics, callers cannot know whether a match exists or

    /// not.

///

    /// Use [`DFA::try_search`] if you want to handle these panics as error

    /// values instead.

///

    /// # Example

///

    /// Leftmost first match semantics corresponds to the match with the

    /// smallest starting offset, but where the end offset is determined by

    /// preferring earlier branches in the original regular expression. For

    /// example, `Sam|Samwise` will match `Sam` in `Samwise`, but `Samwise|Sam`

    /// will match `Samwise` in `Samwise`.

///

    /// Generally speaking, the "leftmost first" match is how most backtracking

    /// regular expressions tend to work. This is in contrast to POSIX-style

    /// regular expressions that yield "leftmost longest" matches. Namely,

    /// both `Sam|Samwise` and `Samwise|Sam` match `Samwise` when using

    /// leftmost longest semantics. (This crate does not currently support

    /// leftmost longest semantics.)

///

    /// ```

    /// use regex_automata::{dfa::onepass::DFA, Match};

///

    /// let re = DFA::new("foo[0-9]+")?;

    /// let mut cache = re.create_cache();

    /// let expected = Match::must(0, 0..8);

    /// assert_eq!(Some(expected), re.find(&mut cache, "foo12345"));

///

    /// // Even though a match is found after reading the first byte (`a`),

    /// // the leftmost first match semantics demand that we find the earliest

    /// // match that prefers earlier parts of the pattern over later parts.

    /// let re = DFA::new("abc|a")?;

    /// let mut cache = re.create_cache();

    /// let expected = Match::must(0, 0..3);

    /// assert_eq!(Some(expected), re.find(&mut cache, "abc"));

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    #[inline]

    pub fn find<'h, I: Into<Input<'h>>>(

        &self,

        cache: &mut Cache,

        input: I,

    ) -> Option<Match> {

        let mut input = input.into();

        if matches!(input.get_anchored(), Anchored::No) {

            input.set_anchored(Anchored::Yes);

        if self.get_nfa().pattern_len() == 1 {

            let mut slots = [None, None];

            let pid =

                self.try_search_slots(cache, &input, &mut slots).unwrap()?;

            let start = slots[0].unwrap().get();

            let end = slots[1].unwrap().get();

            return Some(Match::new(pid, Span { start, end }));

        let ginfo = self.get_nfa().group_info();

        let slots_len = ginfo.implicit_slot_len();

        let mut slots = vec![None; slots_len];

        let pid = self.try_search_slots(cache, &input, &mut slots).unwrap()?;

        let start = slots[pid.as_usize() * 2].unwrap().get();

        let end = slots[pid.as_usize() * 2 + 1].unwrap().get();

        Some(Match::new(pid, Span { start, end }))

    /// Executes an anchored leftmost forward search and writes the spans

    /// of capturing groups that participated in a match into the provided

    /// [`Captures`] value. If no match was found, then [`Captures::is_match`]

    /// is guaranteed to return `false`.

///

    /// The given `Input` is forcefully set to use [`Anchored::Yes`] if the

    /// given configuration was [`Anchored::No`] (which is the default).

///

    /// # Panics

///

    /// This routine panics if the search could not complete. This can occur

    /// in the following circumstances:

///

    /// * When the provided `Input` configuration is not supported. For

    /// example, by providing an unsupported anchor mode. Concretely,

    /// this occurs when using [`Anchored::Pattern`] without enabling

    /// [`Config::starts_for_each_pattern`].

///

    /// When a search panics, callers cannot know whether a match exists or

    /// not.

///

    /// Use [`DFA::try_search`] if you want to handle these panics as error

    /// values instead.

///

    /// # Example

///

    /// This shows a simple example of a one-pass regex that extracts

    /// capturing group spans.

///

    /// ```

    /// use regex_automata::{dfa::onepass::DFA, Match, Span};

///

    /// let re = DFA::new(

    ///     // Notice that we use ASCII here. The corresponding Unicode regex

    ///     // is sadly not one-pass.

    ///     "(?P<first>[[:alpha:]]+)[[:space:]]+(?P<last>[[:alpha:]]+)",

    /// )?;

    /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

///

    /// re.captures(&mut cache, "Bruce Springsteen", &mut caps);

    /// assert_eq!(Some(Match::must(0, 0..17)), caps.get_match());

    /// assert_eq!(Some(Span::from(0..5)), caps.get_group(1));

    /// assert_eq!(Some(Span::from(6..17)), caps.get_group_by_name("last"));

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    #[inline]

    pub fn captures<'h, I: Into<Input<'h>>>(

        &self,

        cache: &mut Cache,

        input: I,

        caps: &mut Captures,

) {

        let mut input = input.into();

        if matches!(input.get_anchored(), Anchored::No) {

            input.set_anchored(Anchored::Yes);

        self.try_search(cache, &input, caps).unwrap();

    /// Executes an anchored leftmost forward search and writes the spans

    /// of capturing groups that participated in a match into the provided

    /// [`Captures`] value. If no match was found, then [`Captures::is_match`]

    /// is guaranteed to return `false`.

///

    /// The differences with [`DFA::captures`] are:

///

    /// 1. This returns an error instead of panicking if the search fails.

    /// 2. Accepts an `&Input` instead of a `Into<Input>`. This permits reusing

    /// the same input for multiple searches, which _may_ be important for

    /// latency.

    /// 3. This does not automatically change the [`Anchored`] mode from `No`

    /// to `Yes`. Instead, if [`Input::anchored`] is `Anchored::No`, then an

    /// error is returned.

///

    /// # Errors

///

    /// This routine errors if the search could not complete. This can occur

    /// in the following circumstances:

///

    /// * When the provided `Input` configuration is not supported. For

    /// example, by providing an unsupported anchor mode. Concretely,

    /// this occurs when using [`Anchored::Pattern`] without enabling

    /// [`Config::starts_for_each_pattern`].

///

    /// When a search returns an error, callers cannot know whether a match

    /// exists or not.

///

    /// # Example: specific pattern search

///

    /// This example shows how to build a multi-regex that permits searching

    /// for specific patterns. Note that this is somewhat less useful than

    /// in other regex engines, since a one-pass DFA by definition has no

    /// ambiguity about which pattern can match at a position. That is, if it

    /// were possible for two different patterns to match at the same starting

    /// position, then the multi-regex would not be one-pass and construction

    /// would have failed.

///

    /// Nevertheless, this can still be useful if you only care about matches

    /// for a specific pattern, and want the DFA to report "no match" even if

    /// some other pattern would have matched.

///

    /// Note that in order to make use of this functionality,

    /// [`Config::starts_for_each_pattern`] must be enabled. It is disabled

    /// by default since it may result in higher memory usage.

///

    /// ```

    /// use regex_automata::{

    ///     dfa::onepass::DFA, Anchored, Input, Match, PatternID,

    /// };

///

    /// let re = DFA::builder()

    ///     .configure(DFA::config().starts_for_each_pattern(true))

    ///     .build_many(&["[a-z]+", "[0-9]+"])?;

    /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

    /// let haystack = "123abc";

    /// let input = Input::new(haystack).anchored(Anchored::Yes);

///

    /// // A normal multi-pattern search will show pattern 1 matches.

    /// re.try_search(&mut cache, &input, &mut caps)?;

    /// assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());

///

    /// // If we only want to report pattern 0 matches, then we'll get no

    /// // match here.

    /// let input = input.anchored(Anchored::Pattern(PatternID::must(0)));

    /// re.try_search(&mut cache, &input, &mut caps)?;

    /// assert_eq!(None, caps.get_match());

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

///

    /// # Example: specifying the bounds of a search

///

    /// This example shows how providing the bounds of a search can produce

    /// different results than simply sub-slicing the haystack.

///

    /// ```

    /// # if cfg!(miri) { return Ok(()); } // miri takes too long

    /// use regex_automata::{dfa::onepass::DFA, Anchored, Input, Match};

///

    /// // one-pass DFAs fully support Unicode word boundaries!

    /// // A sad joke is that a Unicode aware regex like \w+\s is not one-pass.

    /// // :-(

    /// let re = DFA::new(r"\b[0-9]{3}\b")?;

    /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

    /// let haystack = "foo123bar";

///

    /// // Since we sub-slice the haystack, the search doesn't know about

    /// // the larger context and assumes that `123` is surrounded by word

    /// // boundaries. And of course, the match position is reported relative

    /// // to the sub-slice as well, which means we get `0..3` instead of

    /// // `3..6`.

    /// let expected = Some(Match::must(0, 0..3));

    /// let input = Input::new(&haystack[3..6]).anchored(Anchored::Yes);

    /// re.try_search(&mut cache, &input, &mut caps)?;

    /// assert_eq!(expected, caps.get_match());

///

    /// // But if we provide the bounds of the search within the context of the

    /// // entire haystack, then the search can take the surrounding context

    /// // into account. (And if we did find a match, it would be reported

    /// // as a valid offset into `haystack` instead of its sub-slice.)

    /// let expected = None;

    /// let input = Input::new(haystack).range(3..6).anchored(Anchored::Yes);

    /// re.try_search(&mut cache, &input, &mut caps)?;

    /// assert_eq!(expected, caps.get_match());

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    #[inline]

    pub fn try_search(

        &self,

        cache: &mut Cache,

        input: &Input<'_>,

        caps: &mut Captures,

    ) -> Result<(), MatchError> {

        let pid = self.try_search_slots(cache, input, caps.slots_mut())?;

        caps.set_pattern(pid);

        Ok(())

    /// Executes an anchored leftmost forward search and writes the spans

    /// of capturing groups that participated in a match into the provided

    /// `slots`, and returns the matching pattern ID. The contents of the

    /// slots for patterns other than the matching pattern are unspecified. If

    /// no match was found, then `None` is returned and the contents of all

    /// `slots` is unspecified.

///

    /// This is like [`DFA::try_search`], but it accepts a raw slots slice

    /// instead of a `Captures` value. This is useful in contexts where you

    /// don't want or need to allocate a `Captures`.

///

    /// It is legal to pass _any_ number of slots to this routine. If the regex

    /// engine would otherwise write a slot offset that doesn't fit in the

    /// provided slice, then it is simply skipped. In general though, there are

    /// usually three slice lengths you might want to use:

///

    /// * An empty slice, if you only care about which pattern matched.

    /// * A slice with

    /// [`pattern_len() * 2`](crate::dfa::onepass::DFA::pattern_len)

    /// slots, if you only care about the overall match spans for each matching

    /// pattern.

    /// * A slice with

    /// [`slot_len()`](crate::util::captures::GroupInfo::slot_len) slots, which

    /// permits recording match offsets for every capturing group in every

    /// pattern.

///

    /// # Errors

///

    /// This routine errors if the search could not complete. This can occur

    /// in the following circumstances:

///

    /// * When the provided `Input` configuration is not supported. For

    /// example, by providing an unsupported anchor mode. Concretely,

    /// this occurs when using [`Anchored::Pattern`] without enabling

    /// [`Config::starts_for_each_pattern`].

///

    /// When a search returns an error, callers cannot know whether a match

    /// exists or not.

///

    /// # Example

///

    /// This example shows how to find the overall match offsets in a

    /// multi-pattern search without allocating a `Captures` value. Indeed, we

    /// can put our slots right on the stack.

///

    /// ```

    /// use regex_automata::{dfa::onepass::DFA, Anchored, Input, PatternID};

///

    /// let re = DFA::new_many(&[

    ///     r"[a-zA-Z]+",

    ///     r"[0-9]+",

    /// ])?;

    /// let mut cache = re.create_cache();

    /// let input = Input::new("123").anchored(Anchored::Yes);

///

    /// // We only care about the overall match offsets here, so we just

    /// // allocate two slots for each pattern. Each slot records the start

    /// // and end of the match.

    /// let mut slots = [None; 4];

    /// let pid = re.try_search_slots(&mut cache, &input, &mut slots)?;

    /// assert_eq!(Some(PatternID::must(1)), pid);

///

    /// // The overall match offsets are always at 'pid * 2' and 'pid * 2 + 1'.

    /// // See 'GroupInfo' for more details on the mapping between groups and

    /// // slot indices.

    /// let slot_start = pid.unwrap().as_usize() * 2;

    /// let slot_end = slot_start + 1;

    /// assert_eq!(Some(0), slots[slot_start].map(|s| s.get()));

    /// assert_eq!(Some(3), slots[slot_end].map(|s| s.get()));

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    #[inline]

    pub fn try_search_slots(

        &self,

        cache: &mut Cache,

        input: &Input<'_>,

        slots: &mut [Option<NonMaxUsize>],

    ) -> Result<Option<PatternID>, MatchError> {

        let utf8empty = self.get_nfa().has_empty() && self.get_nfa().is_utf8();

        if !utf8empty {

            return self.try_search_slots_imp(cache, input, slots);

        // See PikeVM::try_search_slots for why we do this.

        let min = self.get_nfa().group_info().implicit_slot_len();

        if slots.len() >= min {

            return self.try_search_slots_imp(cache, input, slots);

        if self.get_nfa().pattern_len() == 1 {

            let mut enough = [None, None];

            let got = self.try_search_slots_imp(cache, input, &mut enough)?;

            // This is OK because we know `enough_slots` is strictly bigger

            // than `slots`, otherwise this special case isn't reached.

            slots.copy_from_slice(&enough[..slots.len()]);

            return Ok(got);

        let mut enough = vec![None; min];

        let got = self.try_search_slots_imp(cache, input, &mut enough)?;

        // This is OK because we know `enough_slots` is strictly bigger than

        // `slots`, otherwise this special case isn't reached.

        slots.copy_from_slice(&enough[..slots.len()]);

        Ok(got)

    #[inline(never)]

    fn try_search_slots_imp(

        &self,

        cache: &mut Cache,

        input: &Input<'_>,

        slots: &mut [Option<NonMaxUsize>],

    ) -> Result<Option<PatternID>, MatchError> {

        let utf8empty = self.get_nfa().has_empty() && self.get_nfa().is_utf8();

        match self.search_imp(cache, input, slots)? {

            None => return Ok(None),

            Some(pid) if !utf8empty => return Ok(Some(pid)),

            Some(pid) => {

                // These slot indices are always correct because we know our

                // 'pid' is valid and thus we know that the slot indices for it

                // are valid.

                let slot_start = pid.as_usize().wrapping_mul(2);

                let slot_end = slot_start.wrapping_add(1);

                // OK because we know we have a match and we know our caller

                // provided slots are big enough (which we make true above if

                // the caller didn't). Namely, we're only here when 'utf8empty'

                // is true, and when that's true, we require slots for every

                // pattern.

                let start = slots[slot_start].unwrap().get();

                let end = slots[slot_end].unwrap().get();

                // If our match splits a codepoint, then we cannot report is

                // as a match. And since one-pass DFAs only support anchored

                // searches, we don't try to skip ahead to find the next match.

                // We can just quit with nothing.

                if start == end && !input.is_char_boundary(start) {

                    return Ok(None);

                Ok(Some(pid))

impl DFA {

    fn search_imp(

        &self,

        cache: &mut Cache,

        input: &Input<'_>,

        slots: &mut [Option<NonMaxUsize>],

    ) -> Result<Option<PatternID>, MatchError> {

        // PERF: Some ideas. I ran out of steam after my initial impl to try

        // many of these.

//

        // 1) Try doing more state shuffling. Right now, all we do is push

        // match states to the end of the transition table so that we can do

        // 'if sid >= self.min_match_id' to know whether we're in a match

        // state or not. But what about doing something like dense DFAs and

        // pushing dead, match and states with captures/looks all toward the

        // beginning of the transition table. Then we could do 'if sid <=

        // self.max_special_id', in which case, we need to do some special

        // handling of some sort. Otherwise, we get the happy path, just

        // like in a DFA search. The main argument against this is that the

        // one-pass DFA is likely to be used most often with capturing groups

        // and if capturing groups are common, then this might wind up being a

        // pessimization.

//

        // 2) Consider moving 'PatternEpsilons' out of the transition table.

        // It is only needed for match states and usually a small minority of

        // states are match states. Therefore, we're using an extra 'u64' for

        // most states.

//

        // 3) I played around with the match state handling and it seems like

        // there is probably a lot left on the table for improvement. The

        // key tension is that the 'find_match' routine is a giant mess, but

        // splitting it out into a non-inlineable function is a non-starter

        // because the match state might consume input, so 'find_match' COULD

        // be called quite a lot, and a function call at that point would trash

        // perf. In theory, we could detect whether a match state consumes

        // input and then specialize our search routine based on that. In that

        // case, maybe an extra function call is OK, but even then, it might be

        // too much of a latency hit. Another idea is to just try and figure

        // out how to reduce the code size of 'find_match'. RE2 has a trick

        // here where the match handling isn't done if we know the next byte of

        // input yields a match too. Maybe we adopt that?

//

        // This just might be a tricky DFA to optimize.

        if input.is_done() {

            return Ok(None);

        // We unfortunately have a bit of book-keeping to do to set things

        // up. We do have to setup our cache and clear all of our slots. In

        // particular, clearing the slots is necessary for the case where we

        // report a match, but one of the capturing groups didn't participate

        // in the match but had a span set from a previous search. That would

        // be bad. In theory, we could avoid all this slot clearing if we knew

        // that every slot was always activated for every match. Then we would

        // know they would always be overwritten when a match is found.

        let explicit_slots_len = core::cmp::min(

            Slots::LIMIT,

            slots.len().saturating_sub(self.explicit_slot_start),

);

        cache.setup_search(explicit_slots_len);

        for slot in cache.explicit_slots() {

            *slot = None;

        for slot in slots.iter_mut() {

            *slot = None;

        // We set the starting slots for every pattern up front. This does

        // increase our latency somewhat, but it avoids having to do it every

        // time we see a match state (which could be many times in a single

        // search if the match state consumes input).

        for pid in self.nfa.patterns() {

            let i = pid.as_usize() * 2;

            if i >= slots.len() {

                break;

            slots[i] = NonMaxUsize::new(input.start());

        let mut pid = None;

        let mut next_sid = match input.get_anchored() {

            Anchored::Yes => self.start(),

            Anchored::Pattern(pid) => self.start_pattern(pid)?,

            Anchored::No => {

                // If the regex is itself always anchored, then we're fine,

                // even if the search is configured to be unanchored.

                if !self.nfa.is_always_start_anchored() {

                    return Err(MatchError::unsupported_anchored(

                        Anchored::No,

));

                self.start()

};

        let leftmost_first =

            matches!(self.config.get_match_kind(), MatchKind::LeftmostFirst);

        for at in input.start()..input.end() {

            let sid = next_sid;

            let trans = self.transition(sid, input.haystack()[at]);

            next_sid = trans.state_id();

            let epsilons = trans.epsilons();

            if sid >= self.min_match_id {

                if self.find_match(cache, input, at, sid, slots, &mut pid) {

                    if input.get_earliest()

                        || (leftmost_first && trans.match_wins())

                        return Ok(pid);

            if sid == DEAD

                || (!epsilons.looks().is_empty()

                    && !self.nfa.look_matcher().matches_set_inline(

                        epsilons.looks(),

                        input.haystack(),

at,

))

                return Ok(pid);

            epsilons.slots().apply(at, cache.explicit_slots());

        if next_sid >= self.min_match_id {

            self.find_match(

                cache,

                input,

                input.end(),

                next_sid,

                slots,

                &mut pid,

);

        Ok(pid)

    /// Assumes 'sid' is a match state and looks for whether a match can

    /// be reported. If so, appropriate offsets are written to 'slots' and

    /// 'matched_pid' is set to the matching pattern ID.

///

    /// Even when 'sid' is a match state, it's possible that a match won't

    /// be reported. For example, when the conditional epsilon transitions

    /// leading to the match state aren't satisfied at the given position in

    /// the haystack.

    #[cfg_attr(feature = "perf-inline", inline(always))]

    fn find_match(

        &self,

        cache: &mut Cache,

        input: &Input<'_>,

        at: usize,

        sid: StateID,

        slots: &mut [Option<NonMaxUsize>],

        matched_pid: &mut Option<PatternID>,

    ) -> bool {

        debug_assert!(sid >= self.min_match_id);

        let pateps = self.pattern_epsilons(sid);

        let epsilons = pateps.epsilons();

        if !epsilons.looks().is_empty()

            && !self.nfa.look_matcher().matches_set_inline(

                epsilons.looks(),

                input.haystack(),

at,

            return false;

        let pid = pateps.pattern_id_unchecked();

        // This calculation is always correct because we know our 'pid' is

        // valid and thus we know that the slot indices for it are valid.

        let slot_end = pid.as_usize().wrapping_mul(2).wrapping_add(1);

        // Set the implicit 'end' slot for the matching pattern. (The 'start'

        // slot was set at the beginning of the search.)

        if slot_end < slots.len() {

            slots[slot_end] = NonMaxUsize::new(at);

        // If the caller provided enough room, copy the previously recorded

        // explicit slots from our scratch space to the caller provided slots.

        // We *also* need to set any explicit slots that are active as part of

        // the path to the match state.

        if self.explicit_slot_start < slots.len() {

            // NOTE: The 'cache.explicit_slots()' slice is setup at the

            // beginning of every search such that it is guaranteed to return a

            // slice of length equivalent to 'slots[explicit_slot_start..]'.

            slots[self.explicit_slot_start..]

                .copy_from_slice(cache.explicit_slots());

            epsilons.slots().apply(at, &mut slots[self.explicit_slot_start..]);

        *matched_pid = Some(pid);

        true

impl DFA {

    /// Returns the anchored start state for matching any pattern in this DFA.

    fn start(&self) -> StateID {

        self.starts[0]

    /// Returns the anchored start state for matching the given pattern. If

    /// 'starts_for_each_pattern'

    /// was not enabled, then this returns an error. If the given pattern is

    /// not in this DFA, then `Ok(None)` is returned.

    fn start_pattern(&self, pid: PatternID) -> Result<StateID, MatchError> {

        if !self.config.get_starts_for_each_pattern() {

            return Err(MatchError::unsupported_anchored(Anchored::Pattern(

                pid,

            )));

        // 'starts' always has non-zero length. The first entry is always the

        // anchored starting state for all patterns, and the following entries

        // are optional and correspond to the anchored starting states for

        // patterns at pid+1. Thus, starts.len()-1 corresponds to the total

        // number of patterns that one can explicitly search for. (And it may

        // be zero.)

        Ok(self.starts.get(pid.one_more()).copied().unwrap_or(DEAD))

    /// Returns the transition from the given state ID and byte of input. The

    /// transition includes the next state ID, the slots that should be saved

    /// and any conditional epsilon transitions that must be satisfied in order

    /// to take this transition.

    fn transition(&self, sid: StateID, byte: u8) -> Transition {

        let offset = sid.as_usize() << self.stride2();

        let class = self.classes.get(byte).as_usize();

        self.table[offset + class]

    /// Set the transition from the given state ID and byte of input to the

    /// transition given.

    fn set_transition(&mut self, sid: StateID, byte: u8, to: Transition) {

        let offset = sid.as_usize() << self.stride2();

        let class = self.classes.get(byte).as_usize();

        self.table[offset + class] = to;

    /// Return an iterator of "sparse" transitions for the given state ID.

    /// "sparse" in this context means that consecutive transitions that are

    /// equivalent are returned as one group, and transitions to the DEAD state

    /// are ignored.

///

    /// This winds up being useful for debug printing, since it's much terser

    /// to display runs of equivalent transitions than the transition for every

    /// possible byte value. Indeed, in practice, it's very common for runs

    /// of equivalent transitions to appear.

    fn sparse_transitions(&self, sid: StateID) -> SparseTransitionIter<'_> {

        let start = sid.as_usize() << self.stride2();

        let end = start + self.alphabet_len();

        SparseTransitionIter {

            it: self.table[start..end].iter().enumerate(),

            cur: None,

    /// Return the pattern epsilons for the given state ID.

///

    /// If the given state ID does not correspond to a match state ID, then the

    /// pattern epsilons returned is empty.

    fn pattern_epsilons(&self, sid: StateID) -> PatternEpsilons {

        let offset = sid.as_usize() << self.stride2();

        PatternEpsilons(self.table[offset + self.pateps_offset].0)

    /// Set the pattern epsilons for the given state ID.

    fn set_pattern_epsilons(&mut self, sid: StateID, pateps: PatternEpsilons) {

        let offset = sid.as_usize() << self.stride2();

        self.table[offset + self.pateps_offset] = Transition(pateps.0);

    /// Returns the state ID prior to the one given. This returns None if the

    /// given ID is the first DFA state.

    fn prev_state_id(&self, id: StateID) -> Option<StateID> {

        if id == DEAD {

            None

        } else {

            // CORRECTNESS: Since 'id' is not the first state, subtracting 1

            // is always valid.

            Some(StateID::new_unchecked(id.as_usize().checked_sub(1).unwrap()))

    /// Returns the state ID of the last state in this DFA's transition table.

    /// "last" in this context means the last state to appear in memory, i.e.,

    /// the one with the greatest ID.

    fn last_state_id(&self) -> StateID {

        // CORRECTNESS: A DFA table is always non-empty since it always at

        // least contains a DEAD state. Since every state has the same stride,

        // we can just compute what the "next" state ID would have been and

        // then subtract 1 from it.

        StateID::new_unchecked(

            (self.table.len() >> self.stride2()).checked_sub(1).unwrap(),

    /// Move the transitions from 'id1' to 'id2' and vice versa.

///

    /// WARNING: This does not update the rest of the transition table to have

    /// transitions to 'id1' changed to 'id2' and vice versa. This merely moves

    /// the states in memory.

    pub(super) fn swap_states(&mut self, id1: StateID, id2: StateID) {

        let o1 = id1.as_usize() << self.stride2();

        let o2 = id2.as_usize() << self.stride2();

        for b in 0..self.stride() {

            self.table.swap(o1 + b, o2 + b);

    /// Map all state IDs in this DFA (transition table + start states)

    /// according to the closure given.

    pub(super) fn remap(&mut self, map: impl Fn(StateID) -> StateID) {

        for i in 0..self.state_len() {

            let offset = i << self.stride2();

            for b in 0..self.alphabet_len() {

                let next = self.table[offset + b].state_id();

                self.table[offset + b].set_state_id(map(next));

        for i in 0..self.starts.len() {

            self.starts[i] = map(self.starts[i]);

impl core::fmt::Debug for DFA {

    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {

        fn debug_state_transitions(

            f: &mut core::fmt::Formatter,

            dfa: &DFA,

            sid: StateID,

        ) -> core::fmt::Result {

            for (i, (start, end, trans)) in

                dfa.sparse_transitions(sid).enumerate()

                let next = trans.state_id();

                if i > 0 {

                    write!(f, ", ")?;

                if start == end {

                    write!(

f,

                        "{:?} => {:?}",

                        DebugByte(start),

                        next.as_usize(),

)?;

                } else {

                    write!(

f,

                        "{:?}-{:?} => {:?}",

                        DebugByte(start),

                        DebugByte(end),

                        next.as_usize(),

)?;

                if trans.match_wins() {

                    write!(f, " (MW)")?;

                if !trans.epsilons().is_empty() {

                    write!(f, " ({:?})", trans.epsilons())?;

            Ok(())

        writeln!(f, "onepass::DFA(")?;

        for index in 0..self.state_len() {

            let sid = StateID::must(index);

            let pateps = self.pattern_epsilons(sid);

            if sid == DEAD {

                write!(f, "D ")?;

            } else if pateps.pattern_id().is_some() {

                write!(f, "* ")?;

            } else {

                write!(f, "  ")?;

            write!(f, "{:06?}", sid.as_usize())?;

            if !pateps.is_empty() {

                write!(f, " ({:?})", pateps)?;

            write!(f, ": ")?;

            debug_state_transitions(f, self, sid)?;

            write!(f, "\n")?;

        writeln!(f, "")?;

        for (i, &sid) in self.starts.iter().enumerate() {

            if i == 0 {

                writeln!(f, "START(ALL): {:?}", sid.as_usize())?;

            } else {

                writeln!(

f,

                    "START(pattern: {:?}): {:?}",

                    i - 1,

                    sid.as_usize(),

)?;

        writeln!(f, "state length: {:?}", self.state_len())?;

        writeln!(f, "pattern length: {:?}", self.pattern_len())?;

        writeln!(f, ")")?;

        Ok(())

/// An iterator over groups of consecutive equivalent transitions in a single

/// state.

#[derive(Debug)]

struct SparseTransitionIter<'a> {

    it: core::iter::Enumerate<core::slice::Iter<'a, Transition>>,

    cur: Option<(u8, u8, Transition)>,

impl<'a> Iterator for SparseTransitionIter<'a> {

    type Item = (u8, u8, Transition);

    fn next(&mut self) -> Option<(u8, u8, Transition)> {

        while let Some((b, &trans)) = self.it.next() {

            // Fine because we'll never have more than u8::MAX transitions in

            // one state.

            let b = b.as_u8();

            let (prev_start, prev_end, prev_trans) = match self.cur {

                Some(t) => t,

                None => {

                    self.cur = Some((b, b, trans));

                    continue;

};

            if prev_trans == trans {

                self.cur = Some((prev_start, b, prev_trans));

            } else {

                self.cur = Some((b, b, trans));

                if prev_trans.state_id() != DEAD {

                    return Some((prev_start, prev_end, prev_trans));

        if let Some((start, end, trans)) = self.cur.take() {

            if trans.state_id() != DEAD {

                return Some((start, end, trans));

        None

/// A cache represents mutable state that a one-pass [`DFA`] requires during a

/// search.

///

/// For a given one-pass DFA, its corresponding cache may be created either via

/// [`DFA::create_cache`], or via [`Cache::new`]. They are equivalent in every

/// way, except the former does not require explicitly importing `Cache`.

///

/// A particular `Cache` is coupled with the one-pass DFA from which it was

/// created. It may only be used with that one-pass DFA. A cache and its

/// allocations may be re-purposed via [`Cache::reset`], in which case, it can

/// only be used with the new one-pass DFA (and not the old one).

#[derive(Clone, Debug)]

pub struct Cache {

    /// Scratch space used to store slots during a search. Basically, we use

    /// the caller provided slots to store slots known when a match occurs.

    /// But after a match occurs, we might continue a search but ultimately

    /// fail to extend the match. When continuing the search, we need some

    /// place to store candidate capture offsets without overwriting the slot

    /// offsets recorded for the most recently seen match.

    explicit_slots: Vec<Option<NonMaxUsize>>,

    /// The number of slots in the caller-provided 'Captures' value for the

    /// current search. This is always at most 'explicit_slots.len()', but

    /// might be less than it, if the caller provided fewer slots to fill.

    explicit_slot_len: usize,

impl Cache {

    /// Create a new [`onepass::DFA`](DFA) cache.

///

    /// A potentially more convenient routine to create a cache is

    /// [`DFA::create_cache`], as it does not require also importing the

    /// `Cache` type.

///

    /// If you want to reuse the returned `Cache` with some other one-pass DFA,

    /// then you must call [`Cache::reset`] with the desired one-pass DFA.

    pub fn new(re: &DFA) -> Cache {

        let mut cache = Cache { explicit_slots: vec![], explicit_slot_len: 0 };

        cache.reset(re);

        cache

    /// Reset this cache such that it can be used for searching with a

    /// different [`onepass::DFA`](DFA).

///

    /// A cache reset permits reusing memory already allocated in this cache

    /// with a different one-pass DFA.

///

    /// # Example

///

    /// This shows how to re-purpose a cache for use with a different one-pass

    /// DFA.

///

    /// ```

    /// # if cfg!(miri) { return Ok(()); } // miri takes too long

    /// use regex_automata::{dfa::onepass::DFA, Match};

///

    /// let re1 = DFA::new(r"\w")?;

    /// let re2 = DFA::new(r"\W")?;

    /// let mut caps1 = re1.create_captures();

    /// let mut caps2 = re2.create_captures();

///

    /// let mut cache = re1.create_cache();

    /// assert_eq!(

    ///     Some(Match::must(0, 0..2)),

    ///     { re1.captures(&mut cache, "Δ", &mut caps1); caps1.get_match() },

    /// );

///

    /// // Using 'cache' with re2 is not allowed. It may result in panics or

    /// // incorrect results. In order to re-purpose the cache, we must reset

    /// // it with the one-pass DFA we'd like to use it with.

    /// //

    /// // Similarly, after this reset, using the cache with 're1' is also not

    /// // allowed.

    /// re2.reset_cache(&mut cache);

    /// assert_eq!(

    ///     Some(Match::must(0, 0..3)),

    ///     { re2.captures(&mut cache, "☃", &mut caps2); caps2.get_match() },

    /// );

///

    /// # Ok::<(), Box<dyn std::error::Error>>(())

    /// ```

    pub fn reset(&mut self, re: &DFA) {

        let explicit_slot_len = re.get_nfa().group_info().explicit_slot_len();

        self.explicit_slots.resize(explicit_slot_len, None);

        self.explicit_slot_len = explicit_slot_len;

    /// Returns the heap memory usage, in bytes, of this cache.

///

    /// This does **not** include the stack size used up by this cache. To

    /// compute that, use `std::mem::size_of::<Cache>()`.

    pub fn memory_usage(&self) -> usize {

        self.explicit_slots.len() * core::mem::size_of::<Option<NonMaxUsize>>()

    fn explicit_slots(&mut self) -> &mut [Option<NonMaxUsize>] {

        &mut self.explicit_slots[..self.explicit_slot_len]

    fn setup_search(&mut self, explicit_slot_len: usize) {

        self.explicit_slot_len = explicit_slot_len;

/// Represents a single transition in a one-pass DFA.

///

/// The high 24 bits corresponds to the state ID. The low 48 bits corresponds

/// to the transition epsilons, which contains the slots that should be saved

/// when this transition is followed and the conditional epsilon transitions

/// that must be satisfied in order to follow this transition.

#[derive(Clone, Copy, Eq, PartialEq)]

struct Transition(u64);

impl Transition {

    const STATE_ID_BITS: u64 = 21;

    const STATE_ID_SHIFT: u64 = 64 - Transition::STATE_ID_BITS;

    const STATE_ID_LIMIT: u64 = 1 << Transition::STATE_ID_BITS;

    const MATCH_WINS_SHIFT: u64 = 64 - (Transition::STATE_ID_BITS + 1);

    const INFO_MASK: u64 = 0x000003FF_FFFFFFFF;

    /// Return a new transition to the given state ID with the given epsilons.

    fn new(match_wins: bool, sid: StateID, epsilons: Epsilons) -> Transition {

        let match_wins =

            if match_wins { 1 << Transition::MATCH_WINS_SHIFT } else { 0 };

        let sid = sid.as_u64() << Transition::STATE_ID_SHIFT;

        Transition(sid | match_wins | epsilons.0)

    /// Returns true if and only if this transition points to the DEAD state.

    fn is_dead(self) -> bool {

        self.state_id() == DEAD

    /// Return whether this transition has a "match wins" property.

///

    /// When a transition has this property, it means that if a match has been

    /// found and the search uses leftmost-first semantics, then that match

    /// should be returned immediately instead of continuing on.

///

    /// The "match wins" name comes from RE2, which uses a pretty much

    /// identical mechanism for implementing leftmost-first semantics.

    fn match_wins(&self) -> bool {

        (self.0 >> Transition::MATCH_WINS_SHIFT & 1) == 1

    /// Return the "next" state ID that this transition points to.

    fn state_id(&self) -> StateID {

        // OK because a Transition has a valid StateID in its upper bits by

        // construction. The cast to usize is also correct, even on 16-bit

        // targets because, again, we know the upper bits is a valid StateID,

        // which can never overflow usize on any supported target.

        StateID::new_unchecked(

            (self.0 >> Transition::STATE_ID_SHIFT).as_usize(),

    /// Set the "next" state ID in this transition.

    fn set_state_id(&mut self, sid: StateID) {

        *self = Transition::new(self.match_wins(), sid, self.epsilons());

    /// Return the epsilons embedded in this transition.

    fn epsilons(&self) -> Epsilons {

        Epsilons(self.0 & Transition::INFO_MASK)

impl core::fmt::Debug for Transition {

    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {

        if self.is_dead() {

            return write!(f, "0");

        write!(f, "{}", self.state_id().as_usize())?;

        if self.match_wins() {

            write!(f, "-MW")?;

        if !self.epsilons().is_empty() {

            write!(f, "-{:?}", self.epsilons())?;

        Ok(())

/// A representation of a match state's pattern ID along with the epsilons for

/// when a match occurs.

///

/// A match state in a one-pass DFA, unlike in a more general DFA, has exactly

/// one pattern ID. If it had more, then the original NFA would not have been

/// one-pass.

///

/// The "epsilons" part of this corresponds to what was found in the epsilon

/// transitions between the transition taken in the last byte of input and the

/// ultimate match state. This might include saving slots and/or conditional

/// epsilon transitions that must be satisfied before one can report the match.

///

/// Technically, every state has room for a 'PatternEpsilons', but it is only

/// ever non-empty for match states.

#[derive(Clone, Copy)]

struct PatternEpsilons(u64);

impl PatternEpsilons {

    const PATTERN_ID_BITS: u64 = 22;

    const PATTERN_ID_SHIFT: u64 = 64 - PatternEpsilons::PATTERN_ID_BITS;

    // A sentinel value indicating that this is not a match state. We don't

    // use 0 since 0 is a valid pattern ID.

    const PATTERN_ID_NONE: u64 = 0x00000000_003FFFFF;

    const PATTERN_ID_LIMIT: u64 = PatternEpsilons::PATTERN_ID_NONE;

    const PATTERN_ID_MASK: u64 = 0xFFFFFC00_00000000;

    const EPSILONS_MASK: u64 = 0x000003FF_FFFFFFFF;

    /// Return a new empty pattern epsilons that has no pattern ID and has no

    /// epsilons. This is suitable for non-match states.

    fn empty() -> PatternEpsilons {

        PatternEpsilons(

            PatternEpsilons::PATTERN_ID_NONE

                << PatternEpsilons::PATTERN_ID_SHIFT,

    /// Whether this pattern epsilons is empty or not. It's empty when it has

    /// no pattern ID and an empty epsilons.

    fn is_empty(self) -> bool {

        self.pattern_id().is_none() && self.epsilons().is_empty()

    /// Return the pattern ID in this pattern epsilons if one exists.

    fn pattern_id(self) -> Option<PatternID> {

        let pid = self.0 >> PatternEpsilons::PATTERN_ID_SHIFT;

        if pid == PatternEpsilons::PATTERN_ID_LIMIT {

            None

        } else {

            Some(PatternID::new_unchecked(pid.as_usize()))

    /// Returns the pattern ID without checking whether it's valid. If this is

    /// called and there is no pattern ID in this `PatternEpsilons`, then this

    /// will likely produce an incorrect result or possibly even a panic or

    /// an overflow. But safety will not be violated.

///

    /// This is useful when you know a particular state is a match state. If

    /// it's a match state, then it must have a pattern ID.

    fn pattern_id_unchecked(self) -> PatternID {

        let pid = self.0 >> PatternEpsilons::PATTERN_ID_SHIFT;

        PatternID::new_unchecked(pid.as_usize())

    /// Return a new pattern epsilons with the given pattern ID, but the same

    /// epsilons.

    fn set_pattern_id(self, pid: PatternID) -> PatternEpsilons {

        PatternEpsilons(

            (pid.as_u64() << PatternEpsilons::PATTERN_ID_SHIFT)

                | (self.0 & PatternEpsilons::EPSILONS_MASK),

    /// Return the epsilons part of this pattern epsilons.

    fn epsilons(self) -> Epsilons {

        Epsilons(self.0 & PatternEpsilons::EPSILONS_MASK)

    /// Return a new pattern epsilons with the given epsilons, but the same

    /// pattern ID.

    fn set_epsilons(self, epsilons: Epsilons) -> PatternEpsilons {

        PatternEpsilons(

            (self.0 & PatternEpsilons::PATTERN_ID_MASK)

                | u64::from(epsilons.0),

impl core::fmt::Debug for PatternEpsilons {

    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {

        if self.is_empty() {

            return write!(f, "N/A");

        if let Some(pid) = self.pattern_id() {

            write!(f, "{}", pid.as_usize())?;

        if !self.epsilons().is_empty() {

            if self.pattern_id().is_some() {

                write!(f, "/")?;

            write!(f, "{:?}", self.epsilons())?;

        Ok(())

/// Epsilons represents all of the NFA epsilons transitions that went into a

/// single transition in a single DFA state. In this case, it only represents

/// the epsilon transitions that have some kind of non-consuming side effect:

/// either the transition requires storing the current position of the search

/// into a slot, or the transition is conditional and requires the current

/// position in the input to satisfy an assertion before the transition may be

/// taken.

///

/// This folds the cumulative effect of a group of NFA states (all connected

/// by epsilon transitions) down into a single set of bits. While these bits

/// can represent all possible conditional epsilon transitions, it only permits

/// storing up to a somewhat small number of slots.

///

/// Epsilons is represented as a 42-bit integer. For example, it is packed into

/// the lower 42 bits of a `Transition`. (Where the high 22 bits contains a

/// `StateID` and a special "match wins" property.)

#[derive(Clone, Copy)]

struct Epsilons(u64);

impl Epsilons {

    const SLOT_MASK: u64 = 0x000003FF_FFFFFC00;

    const SLOT_SHIFT: u64 = 10;

    const LOOK_MASK: u64 = 0x00000000_000003FF;

    /// Create a new empty epsilons. It has no slots and no assertions that

    /// need to be satisfied.

    fn empty() -> Epsilons {

        Epsilons(0)

    /// Returns true if this epsilons contains no slots and no assertions.

    fn is_empty(self) -> bool {

        self.0 == 0

    /// Returns the slot epsilon transitions.

    fn slots(self) -> Slots {

        Slots((self.0 >> Epsilons::SLOT_SHIFT).low_u32())

    /// Set the slot epsilon transitions.

    fn set_slots(self, slots: Slots) -> Epsilons {

        Epsilons(

            (u64::from(slots.0) << Epsilons::SLOT_SHIFT)

                | (self.0 & Epsilons::LOOK_MASK),

    /// Return the set of look-around assertions in these epsilon transitions.

    fn looks(self) -> LookSet {

        LookSet { bits: (self.0 & Epsilons::LOOK_MASK).low_u16() }

    /// Set the look-around assertions on these epsilon transitions.

    fn set_looks(self, look_set: LookSet) -> Epsilons {

        Epsilons((self.0 & Epsilons::SLOT_MASK) | u64::from(look_set.bits))

impl core::fmt::Debug for Epsilons {

    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {

        let mut wrote = false;

        if !self.slots().is_empty() {

            write!(f, "{:?}", self.slots())?;

            wrote = true;

        if !self.looks().is_empty() {

            if wrote {

                write!(f, "/")?;

            write!(f, "{:?}", self.looks())?;

            wrote = true;

        if !wrote {

            write!(f, "N/A")?;

        Ok(())

/// The set of epsilon transitions indicating that the current position in a

/// search should be saved to a slot.

///

/// This *only* represents explicit slots. So for example, the pattern

/// `[a-z]+([0-9]+)([a-z]+)` has:

///

/// * 3 capturing groups, thus 6 slots.

/// * 1 implicit capturing group, thus 2 implicit slots.

/// * 2 explicit capturing groups, thus 4 explicit slots.

///

/// While implicit slots are represented by epsilon transitions in an NFA, we

/// do not explicitly represent them here. Instead, implicit slots are assumed

/// to be present and handled automatically in the search code. Therefore,

/// that means we only need to represent explicit slots in our epsilon

/// transitions.

///

/// Its representation is a bit set. The bit 'i' is set if and only if there

/// exists an explicit slot at index 'c', where 'c = (#patterns * 2) + i'. That

/// is, the bit 'i' corresponds to the first explicit slot and the first

/// explicit slot appears immediately following the last implicit slot. (If

/// this is confusing, see `GroupInfo` for more details on how slots works.)

///

/// A single `Slots` represents all the active slots in a sub-graph of an NFA,

/// where all the states are connected by epsilon transitions. In effect, when

/// traversing the one-pass DFA during a search, all slots set in a particular

/// transition must be captured by recording the current search position.

///

/// The API of `Slots` requires the caller to handle the explicit slot offset.

/// That is, a `Slots` doesn't know where the explicit slots start for a

/// particular NFA. Thus, if the callers see's the bit 'i' is set, then they

/// need to do the arithmetic above to find 'c', which is the real actual slot

/// index in the corresponding NFA.

#[derive(Clone, Copy)]

struct Slots(u32);

impl Slots {

    const LIMIT: usize = 32;

    /// Insert the slot at the given bit index.

    fn insert(self, slot: usize) -> Slots {

        debug_assert!(slot < Slots::LIMIT);

        Slots(self.0 | (1 << slot.as_u32()))

    /// Remove the slot at the given bit index.

    fn remove(self, slot: usize) -> Slots {

        debug_assert!(slot < Slots::LIMIT);

        Slots(self.0 & !(1 << slot.as_u32()))

    /// Returns true if and only if this set contains no slots.

    fn is_empty(self) -> bool {

        self.0 == 0

    /// Returns an iterator over all of the set bits in this set.

    fn iter(self) -> SlotsIter {

        SlotsIter { slots: self }

    /// For the position `at` in the current haystack, copy it to

    /// `caller_explicit_slots` for all slots that are in this set.

///

    /// Callers may pass a slice of any length. Slots in this set bigger than

    /// the length of the given explicit slots are simply skipped.

///

    /// The slice *must* correspond only to the explicit slots and the first

    /// element of the slice must always correspond to the first explicit slot

    /// in the corresponding NFA.

    fn apply(

        self,

        at: usize,

        caller_explicit_slots: &mut [Option<NonMaxUsize>],

) {

        if self.is_empty() {

            return;

        let at = NonMaxUsize::new(at);

        for slot in self.iter() {

            if slot >= caller_explicit_slots.len() {

                break;

            caller_explicit_slots[slot] = at;

impl core::fmt::Debug for Slots {

    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {

        write!(f, "S")?;

        for slot in self.iter() {

            write!(f, "-{:?}", slot)?;

        Ok(())

/// An iterator over all of the bits set in a slot set.

///

/// This returns the bit index that is set, so callers may need to offset it

/// to get the actual NFA slot index.

#[derive(Debug)]

struct SlotsIter {

    slots: Slots,

impl Iterator for SlotsIter {

    type Item = usize;

    fn next(&mut self) -> Option<usize> {

        // Number of zeroes here is always <= u8::MAX, and so fits in a usize.

        let slot = self.slots.0.trailing_zeros().as_usize();

        if slot >= Slots::LIMIT {

            return None;

        self.slots = self.slots.remove(slot);

        Some(slot)

/// An error that occurred during the construction of a one-pass DFA.

///

/// This error does not provide many introspection capabilities. There are

/// generally only two things you can do with it:

///

/// * Obtain a human readable message via its `std::fmt::Display` impl.

/// * Access an underlying [`thompson::BuildError`] type from its `source`

/// method via the `std::error::Error` trait. This error only occurs when using

/// convenience routines for building a one-pass DFA directly from a pattern

/// string.

///

/// When the `std` feature is enabled, this implements the `std::error::Error`

/// trait.

#[derive(Clone, Debug)]

pub struct BuildError {

    kind: BuildErrorKind,

/// The kind of error that occurred during the construction of a one-pass DFA.

#[derive(Clone, Debug)]

enum BuildErrorKind {

    NFA(crate::nfa::thompson::BuildError),

    Word(UnicodeWordBoundaryError),

    TooManyStates { limit: u64 },

    TooManyPatterns { limit: u64 },

    UnsupportedLook { look: Look },

    ExceededSizeLimit { limit: usize },

    NotOnePass { msg: &'static str },

impl BuildError {

    fn nfa(err: crate::nfa::thompson::BuildError) -> BuildError {

        BuildError { kind: BuildErrorKind::NFA(err) }

    fn word(err: UnicodeWordBoundaryError) -> BuildError {

        BuildError { kind: BuildErrorKind::Word(err) }

    fn too_many_states(limit: u64) -> BuildError {

        BuildError { kind: BuildErrorKind::TooManyStates { limit } }

    fn too_many_patterns(limit: u64) -> BuildError {

        BuildError { kind: BuildErrorKind::TooManyPatterns { limit } }

    fn unsupported_look(look: Look) -> BuildError {

        BuildError { kind: BuildErrorKind::UnsupportedLook { look } }

    fn exceeded_size_limit(limit: usize) -> BuildError {

        BuildError { kind: BuildErrorKind::ExceededSizeLimit { limit } }

    fn not_one_pass(msg: &'static str) -> BuildError {

        BuildError { kind: BuildErrorKind::NotOnePass { msg } }

#[cfg(feature = "std")]

impl std::error::Error for BuildError {

    fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {

        use self::BuildErrorKind::*;

        match self.kind {

            NFA(ref err) => Some(err),

            Word(ref err) => Some(err),

            _ => None,

impl core::fmt::Display for BuildError {

    fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {

        use self::BuildErrorKind::*;

        match self.kind {

            NFA(_) => write!(f, "error building NFA"),

            Word(_) => write!(f, "NFA contains Unicode word boundary"),

            TooManyStates { limit } => write!(

f,

                "one-pass DFA exceeded a limit of {:?} for number of states",

                limit,

),

            TooManyPatterns { limit } => write!(

f,

                "one-pass DFA exceeded a limit of {:?} for number of patterns",

                limit,

),

            UnsupportedLook { look } => write!(

f,

                "one-pass DFA does not support the {:?} assertion",

                look,

),

            ExceededSizeLimit { limit } => write!(

f,

                "one-pass DFA exceeded size limit of {:?} during building",

                limit,

),

            NotOnePass { msg } => write!(

f,

                "one-pass DFA could not be built because \

                 pattern is not one-pass: {}",

                msg,

),

#[cfg(all(test, feature = "syntax"))]

mod tests {

    use alloc::string::ToString;

    use super::*;

    #[test]

    fn fail_conflicting_transition() {

        let predicate = |err: &str| err.contains("conflicting transition");

        let err = DFA::new(r"a*[ab]").unwrap_err().to_string();

        assert!(predicate(&err), "{}", err);

    #[test]

    fn fail_multiple_epsilon() {

        let predicate = |err: &str| {

            err.contains("multiple epsilon transitions to same state")

};

        let err = DFA::new(r"(^|$)a").unwrap_err().to_string();

        assert!(predicate(&err), "{}", err);

    #[test]

    fn fail_multiple_match() {

        let predicate = |err: &str| {

            err.contains("multiple epsilon transitions to match state")

};

        let err = DFA::new_many(&[r"^", r"$"]).unwrap_err().to_string();

        assert!(predicate(&err), "{}", err);

    // This test is meant to build a one-pass regex with the maximum number of

    // possible slots.

//

    // NOTE: Remember that the slot limit only applies to explicit capturing

    // groups. Any number of implicit capturing groups is supported (up to the

    // maximum number of supported patterns), since implicit groups are handled

    // by the search loop itself.

    #[test]

    fn max_slots() {

        // One too many...

        let pat = r"(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)(m)(n)(o)(p)(q)";

        assert!(DFA::new(pat).is_err());

        // Just right.

        let pat = r"(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)(m)(n)(o)(p)";

        assert!(DFA::new(pat).is_ok());

    // This test ensures that the one-pass DFA works with all look-around

    // assertions that we expect it to work with.

//

    // The utility of this test is that each one-pass transition has a small

    // amount of space to store look-around assertions. Currently, there is

    // logic in the one-pass constructor to ensure there aren't more than ten

    // possible assertions. And indeed, there are only ten possible assertions

    // (at time of writing), so this is okay. But conceivably, more assertions

    // could be added. So we check that things at least work with what we

    // expect them to work with.

    #[test]

    fn assertions() {

        // haystack anchors

        assert!(DFA::new(r"^").is_ok());

        assert!(DFA::new(r"$").is_ok());

        // line anchors

        assert!(DFA::new(r"(?m)^").is_ok());

        assert!(DFA::new(r"(?m)$").is_ok());

        assert!(DFA::new(r"(?Rm)^").is_ok());

        assert!(DFA::new(r"(?Rm)$").is_ok());

        // word boundaries

        if cfg!(feature = "unicode-word-boundary") {

            assert!(DFA::new(r"\b").is_ok());

            assert!(DFA::new(r"\B").is_ok());

        assert!(DFA::new(r"(?-u)\b").is_ok());

        assert!(DFA::new(r"(?-u)\B").is_ok());

    #[cfg(not(miri))] // takes too long on miri

    #[test]

    fn is_one_pass() {

        use crate::util::syntax;

        assert!(DFA::new(r"a*b").is_ok());

        if cfg!(feature = "unicode-perl") {

            assert!(DFA::new(r"\w").is_ok());

        assert!(DFA::new(r"(?-u)\w*\s").is_ok());

        assert!(DFA::new(r"(?s:.)*?").is_ok());

        assert!(DFA::builder()

            .syntax(syntax::Config::new().utf8(false))

            .build(r"(?s-u:.)*?")

            .is_ok());

    #[test]

    fn is_not_one_pass() {

        assert!(DFA::new(r"a*a").is_err());

        assert!(DFA::new(r"(?s-u:.)*?").is_err());

        assert!(DFA::new(r"(?s:.)*?a").is_err());

    #[cfg(not(miri))]

    #[test]

    fn is_not_one_pass_bigger() {

        assert!(DFA::new(r"\w*\s").is_err());