look.rs - mozsearch

mozilla-central/third_party/rust/regex-automata/src/util/look.rs (file symbol)

Enable keyboard shortcuts

Source code

File a bug in Firefox Build System :: General

Revision control

Copy as Markdown

Other Tools

/*!

Types and routines for working with look-around assertions.

This module principally defines two types:

* [`Look`] enumerates all of the assertions supported by this crate.

* [`LookSet`] provides a way to efficiently store a set of [`Look`] values.

* [`LookMatcher`] provides routines for checking whether a `Look` or a

`LookSet` matches at a particular position in a haystack.

*/

// LAMENTATION: Sadly, a lot of the API of `Look` and `LookSet` were basically

// copied verbatim from the regex-syntax crate. I would have no problems using

// the regex-syntax types and defining the matching routines (only found

// in this crate) as free functions, except the `Look` and `LookSet` types

// are used in lots of places. Including in places we expect to work when

// regex-syntax is *not* enabled, such as in the definition of the NFA itself.

//

// Thankfully the code we copy is pretty simple and there isn't much of it.

// Otherwise, the rest of this module deals with *matching* the assertions,

// which is not something that regex-syntax handles.

use crate::util::{escape::DebugByte, utf8};

/// A look-around assertion.

///

/// An assertion matches at a position between characters in a haystack.

/// Namely, it does not actually "consume" any input as most parts of a regular

/// expression do. Assertions are a way of stating that some property must be

/// true at a particular point during matching.

///

/// For example, `(?m)^[a-z]+$` is a pattern that:

///

/// * Scans the haystack for a position at which `(?m:^)` is satisfied. That

/// occurs at either the beginning of the haystack, or immediately following

/// a `\n` character.

/// * Looks for one or more occurrences of `[a-z]`.

/// * Once `[a-z]+` has matched as much as it can, an overall match is only

/// reported when `[a-z]+` stops just before a `\n`.

///

/// So in this case, `abc` and `\nabc\n` match, but `\nabc1\n` does not.

///

/// Assertions are also called "look-around," "look-behind" and "look-ahead."

/// Specifically, some assertions are look-behind (like `^`), other assertions

/// are look-ahead (like `$`) and yet other assertions are both look-ahead and

/// look-behind (like `\b`).

///

/// # Assertions in an NFA

///

/// An assertion in a [`thompson::NFA`](crate::nfa::thompson::NFA) can be

/// thought of as a conditional epsilon transition. That is, a matching engine

/// like the [`PikeVM`](crate::nfa::thompson::pikevm::PikeVM) only permits

/// moving through conditional epsilon transitions when their condition

/// is satisfied at whatever position the `PikeVM` is currently at in the

/// haystack.

///

/// How assertions are handled in a `DFA` is trickier, since a DFA does not

/// have epsilon transitions at all. In this case, they are compiled into the

/// automaton itself, at the expense of more states than what would be required

/// without an assertion.

#[derive(Clone, Copy, Debug, Eq, PartialEq)]

pub enum Look {

    /// Match the beginning of text. Specifically, this matches at the starting

    /// position of the input.

    Start = 1 << 0,

    /// Match the end of text. Specifically, this matches at the ending

    /// position of the input.

    End = 1 << 1,

    /// Match the beginning of a line or the beginning of text. Specifically,

    /// this matches at the starting position of the input, or at the position

    /// immediately following a `\n` character.

    StartLF = 1 << 2,

    /// Match the end of a line or the end of text. Specifically, this matches

    /// at the end position of the input, or at the position immediately

    /// preceding a `\n` character.

    EndLF = 1 << 3,

    /// Match the beginning of a line or the beginning of text. Specifically,

    /// this matches at the starting position of the input, or at the position

    /// immediately following either a `\r` or `\n` character, but never after

    /// a `\r` when a `\n` follows.

    StartCRLF = 1 << 4,

    /// Match the end of a line or the end of text. Specifically, this matches

    /// at the end position of the input, or at the position immediately

    /// preceding a `\r` or `\n` character, but never before a `\n` when a `\r`

    /// precedes it.

    EndCRLF = 1 << 5,

    /// Match an ASCII-only word boundary. That is, this matches a position

    /// where the left adjacent character and right adjacent character

    /// correspond to a word and non-word or a non-word and word character.

    WordAscii = 1 << 6,

    /// Match an ASCII-only negation of a word boundary.

    WordAsciiNegate = 1 << 7,

    /// Match a Unicode-aware word boundary. That is, this matches a position

    /// where the left adjacent character and right adjacent character

    /// correspond to a word and non-word or a non-word and word character.

    WordUnicode = 1 << 8,

    /// Match a Unicode-aware negation of a word boundary.

    WordUnicodeNegate = 1 << 9,

impl Look {

    /// Flip the look-around assertion to its equivalent for reverse searches.

    /// For example, `StartLF` gets translated to `EndLF`.

///

    /// Some assertions, such as `WordUnicode`, remain the same since they

    /// match the same positions regardless of the direction of the search.

    #[inline]

    pub const fn reversed(self) -> Look {

        match self {

            Look::Start => Look::End,

            Look::End => Look::Start,

            Look::StartLF => Look::EndLF,

            Look::EndLF => Look::StartLF,

            Look::StartCRLF => Look::EndCRLF,

            Look::EndCRLF => Look::StartCRLF,

            Look::WordAscii => Look::WordAscii,

            Look::WordAsciiNegate => Look::WordAsciiNegate,

            Look::WordUnicode => Look::WordUnicode,

            Look::WordUnicodeNegate => Look::WordUnicodeNegate,

    /// Return the underlying representation of this look-around enumeration

    /// as an integer. Giving the return value to the [`Look::from_repr`]

    /// constructor is guaranteed to return the same look-around variant that

    /// one started with within a semver compatible release of this crate.

    #[inline]

    pub const fn as_repr(self) -> u16 {

        // AFAIK, 'as' is the only way to zero-cost convert an int enum to an

        // actual int.

        self as u16

    /// Given the underlying representation of a `Look` value, return the

    /// corresponding `Look` value if the representation is valid. Otherwise

    /// `None` is returned.

    #[inline]

    pub const fn from_repr(repr: u16) -> Option<Look> {

        match repr {

            0b00_0000_0001 => Some(Look::Start),

            0b00_0000_0010 => Some(Look::End),

            0b00_0000_0100 => Some(Look::StartLF),

            0b00_0000_1000 => Some(Look::EndLF),

            0b00_0001_0000 => Some(Look::StartCRLF),

            0b00_0010_0000 => Some(Look::EndCRLF),

            0b00_0100_0000 => Some(Look::WordAscii),

            0b00_1000_0000 => Some(Look::WordAsciiNegate),

            0b01_0000_0000 => Some(Look::WordUnicode),

            0b10_0000_0000 => Some(Look::WordUnicodeNegate),

            _ => None,

    /// Returns a convenient single codepoint representation of this

    /// look-around assertion. Each assertion is guaranteed to be represented

    /// by a distinct character.

///

    /// This is useful for succinctly representing a look-around assertion in

    /// human friendly but succinct output intended for a programmer working on

    /// regex internals.

    #[inline]

    pub const fn as_char(self) -> char {

        match self {

            Look::Start => 'A',

            Look::End => 'z',

            Look::StartLF => '^',

            Look::EndLF => '$',

            Look::StartCRLF => 'r',

            Look::EndCRLF => 'R',

            Look::WordAscii => 'b',

            Look::WordAsciiNegate => 'B',

            Look::WordUnicode => '𝛃',

            Look::WordUnicodeNegate => '𝚩',

/// LookSet is a memory-efficient set of look-around assertions.

///

/// This is useful for efficiently tracking look-around assertions. For

/// example, a [`thompson::NFA`](crate::nfa::thompson::NFA) provides properties

/// that return `LookSet`s.

#[derive(Clone, Copy, Default, Eq, PartialEq)]

pub struct LookSet {

    /// The underlying representation this set is exposed to make it possible

    /// to store it somewhere efficiently. The representation is that

    /// of a bitset, where each assertion occupies bit `i` where `i =

    /// Look::as_repr()`.

///

    /// Note that users of this internal representation must permit the full

    /// range of `u16` values to be represented. For example, even if the

    /// current implementation only makes use of the 10 least significant bits,

    /// it may use more bits in a future semver compatible release.

    pub bits: u16,

impl LookSet {

    /// Create an empty set of look-around assertions.

    #[inline]

    pub fn empty() -> LookSet {

        LookSet { bits: 0 }

    /// Create a full set of look-around assertions.

///

    /// This set contains all possible look-around assertions.

    #[inline]

    pub fn full() -> LookSet {

        LookSet { bits: !0 }

    /// Create a look-around set containing the look-around assertion given.

///

    /// This is a convenience routine for creating an empty set and inserting

    /// one look-around assertions.

    #[inline]

    pub fn singleton(look: Look) -> LookSet {

        LookSet::empty().insert(look)

    /// Returns the total number of look-around assertions in this set.

    #[inline]

    pub fn len(self) -> usize {

        // OK because max value always fits in a u8, which in turn always

        // fits in a usize, regardless of target.

        usize::try_from(self.bits.count_ones()).unwrap()

    /// Returns true if and only if this set is empty.

    #[inline]

    pub fn is_empty(self) -> bool {

        self.len() == 0

    /// Returns true if and only if the given look-around assertion is in this

    /// set.

    #[inline]

    pub fn contains(self, look: Look) -> bool {

        self.bits & look.as_repr() != 0

    /// Returns true if and only if this set contains any anchor assertions.

    /// This includes both "start/end of haystack" and "start/end of line."

    #[inline]

    pub fn contains_anchor(&self) -> bool {

        self.contains_anchor_haystack() || self.contains_anchor_line()

    /// Returns true if and only if this set contains any "start/end of

    /// haystack" anchors. This doesn't include "start/end of line" anchors.

    #[inline]

    pub fn contains_anchor_haystack(&self) -> bool {

        self.contains(Look::Start) || self.contains(Look::End)

    /// Returns true if and only if this set contains any "start/end of line"

    /// anchors. This doesn't include "start/end of haystack" anchors. This

    /// includes both `\n` line anchors and CRLF (`\r\n`) aware line anchors.

    #[inline]

    pub fn contains_anchor_line(&self) -> bool {

        self.contains(Look::StartLF)

            || self.contains(Look::EndLF)

            || self.contains(Look::StartCRLF)

            || self.contains(Look::EndCRLF)

    /// Returns true if and only if this set contains any "start/end of line"

    /// anchors that only treat `\n` as line terminators. This does not include

    /// haystack anchors or CRLF aware line anchors.

    #[inline]

    pub fn contains_anchor_lf(&self) -> bool {

        self.contains(Look::StartLF) || self.contains(Look::EndLF)

    /// Returns true if and only if this set contains any "start/end of line"

    /// anchors that are CRLF-aware. This doesn't include "start/end of

    /// haystack" or "start/end of line-feed" anchors.

    #[inline]

    pub fn contains_anchor_crlf(&self) -> bool {

        self.contains(Look::StartCRLF) || self.contains(Look::EndCRLF)

    /// Returns true if and only if this set contains any word boundary or

    /// negated word boundary assertions. This include both Unicode and ASCII

    /// word boundaries.

    #[inline]

    pub fn contains_word(self) -> bool {

        self.contains_word_unicode() || self.contains_word_ascii()

    /// Returns true if and only if this set contains any Unicode word boundary

    /// or negated Unicode word boundary assertions.

    #[inline]

    pub fn contains_word_unicode(self) -> bool {

        self.contains(Look::WordUnicode)

            || self.contains(Look::WordUnicodeNegate)

    /// Returns true if and only if this set contains any ASCII word boundary

    /// or negated ASCII word boundary assertions.

    #[inline]

    pub fn contains_word_ascii(self) -> bool {

        self.contains(Look::WordAscii) || self.contains(Look::WordAsciiNegate)

    /// Returns an iterator over all of the look-around assertions in this set.

    #[inline]

    pub fn iter(self) -> LookSetIter {

        LookSetIter { set: self }

    /// Return a new set that is equivalent to the original, but with the given

    /// assertion added to it. If the assertion is already in the set, then the

    /// returned set is equivalent to the original.

    #[inline]

    pub fn insert(self, look: Look) -> LookSet {

        LookSet { bits: self.bits | look.as_repr() }

    /// Updates this set in place with the result of inserting the given

    /// assertion into this set.

    #[inline]

    pub fn set_insert(&mut self, look: Look) {

        *self = self.insert(look);

    /// Return a new set that is equivalent to the original, but with the given

    /// assertion removed from it. If the assertion is not in the set, then the

    /// returned set is equivalent to the original.

    #[inline]

    pub fn remove(self, look: Look) -> LookSet {

        LookSet { bits: self.bits & !look.as_repr() }

    /// Updates this set in place with the result of removing the given

    /// assertion from this set.

    #[inline]

    pub fn set_remove(&mut self, look: Look) {

        *self = self.remove(look);

    /// Returns a new set that is the result of subtracting the given set from

    /// this set.

    #[inline]

    pub fn subtract(self, other: LookSet) -> LookSet {

        LookSet { bits: self.bits & !other.bits }

    /// Updates this set in place with the result of subtracting the given set

    /// from this set.

    #[inline]

    pub fn set_subtract(&mut self, other: LookSet) {

        *self = self.subtract(other);

    /// Returns a new set that is the union of this and the one given.

    #[inline]

    pub fn union(self, other: LookSet) -> LookSet {

        LookSet { bits: self.bits | other.bits }

    /// Updates this set in place with the result of unioning it with the one

    /// given.

    #[inline]

    pub fn set_union(&mut self, other: LookSet) {

        *self = self.union(other);

    /// Returns a new set that is the intersection of this and the one given.

    #[inline]

    pub fn intersect(self, other: LookSet) -> LookSet {

        LookSet { bits: self.bits & other.bits }

    /// Updates this set in place with the result of intersecting it with the

    /// one given.

    #[inline]

    pub fn set_intersect(&mut self, other: LookSet) {

        *self = self.intersect(other);

    /// Return a `LookSet` from the slice given as a native endian 16-bit

    /// integer.

///

    /// # Panics

///

    /// This panics if `slice.len() < 2`.

    #[inline]

    pub fn read_repr(slice: &[u8]) -> LookSet {

        let bits = u16::from_ne_bytes(slice[..2].try_into().unwrap());

        LookSet { bits }

    /// Write a `LookSet` as a native endian 16-bit integer to the beginning

    /// of the slice given.

///

    /// # Panics

///

    /// This panics if `slice.len() < 2`.

    #[inline]

    pub fn write_repr(self, slice: &mut [u8]) {

        let raw = self.bits.to_ne_bytes();

        slice[0] = raw[0];

        slice[1] = raw[1];

    /// Checks that all assertions in this set can be matched.

///

    /// Some assertions, such as Unicode word boundaries, require optional (but

    /// enabled by default) tables that may not be available. If there are

    /// assertions in this set that require tables that are not available, then

    /// this will return an error.

///

    /// Specifically, this returns an error when the the

    /// `unicode-word-boundary` feature is _not_ enabled _and_ this set

    /// contains a Unicode word boundary assertion.

///

    /// It can be useful to use this on the result of

    /// [`NFA::look_set_any`](crate::nfa::thompson::NFA::look_set_any)

    /// when building a matcher engine to ensure methods like

    /// [`LookMatcher::matches_set`] do not panic at search time.

    pub fn available(self) -> Result<(), UnicodeWordBoundaryError> {

        if self.contains_word_unicode() {

            UnicodeWordBoundaryError::check()?;

        Ok(())

impl core::fmt::Debug for LookSet {

    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {

        if self.is_empty() {

            return write!(f, "∅");

        for look in self.iter() {

            write!(f, "{}", look.as_char())?;

        Ok(())

/// An iterator over all look-around assertions in a [`LookSet`].

///

/// This iterator is created by [`LookSet::iter`].

#[derive(Clone, Debug)]

pub struct LookSetIter {

    set: LookSet,

impl Iterator for LookSetIter {

    type Item = Look;

    #[inline]

    fn next(&mut self) -> Option<Look> {

        if self.set.is_empty() {

            return None;

        // We'll never have more than u8::MAX distinct look-around assertions,

        // so 'repr' will always fit into a u16.

        let repr = u16::try_from(self.set.bits.trailing_zeros()).unwrap();

        let look = Look::from_repr(1 << repr)?;

        self.set = self.set.remove(look);

        Some(look)

/// A matcher for look-around assertions.

///

/// This matcher permits configuring aspects of how look-around assertions are

/// matched.

///

/// # Example

///

/// A `LookMatcher` can change the line terminator used for matching multi-line

/// anchors such as `(?m:^)` and `(?m:$)`.

///

/// ```

/// use regex_automata::{

///     nfa::thompson::{self, pikevm::PikeVM},

///     util::look::LookMatcher,

///     Match, Input,

/// };

///

/// let mut lookm = LookMatcher::new();

/// lookm.set_line_terminator(b'\x00');

///

/// let re = PikeVM::builder()

///     .thompson(thompson::Config::new().look_matcher(lookm))

///     .build(r"(?m)^[a-z]+$")?;

/// let mut cache = re.create_cache();

///

/// // Multi-line assertions now use NUL as a terminator.

/// assert_eq!(

///     Some(Match::must(0, 1..4)),

///     re.find(&mut cache, b"\x00abc\x00"),

/// );

/// // ... and \n is no longer recognized as a terminator.

/// assert_eq!(

///     None,

///     re.find(&mut cache, b"\nabc\n"),

/// );

///

/// # Ok::<(), Box<dyn std::error::Error>>(())

/// ```

#[derive(Clone, Debug)]

pub struct LookMatcher {

    lineterm: DebugByte,

impl LookMatcher {

    /// Creates a new default matcher for look-around assertions.

    pub fn new() -> LookMatcher {

        LookMatcher { lineterm: DebugByte(b'\n') }

    /// Sets the line terminator for use with `(?m:^)` and `(?m:$)`.

///

    /// Namely, instead of `^` matching after `\n` and `$` matching immediately

    /// before a `\n`, this will cause it to match after and before the byte

    /// given.

///

    /// It can occasionally be useful to use this to configure the line

    /// terminator to the NUL byte when searching binary data.

///

    /// Note that this does not apply to CRLF-aware line anchors such as

    /// `(?Rm:^)` and `(?Rm:$)`. CRLF-aware line anchors are hard-coded to

    /// use `\r` and `\n`.

    pub fn set_line_terminator(&mut self, byte: u8) -> &mut LookMatcher {

        self.lineterm.0 = byte;

        self

    /// Returns the line terminator that was configured for this matcher.

///

    /// If no line terminator was configured, then this returns `\n`.

///

    /// Note that the line terminator should only be used for matching `(?m:^)`

    /// and `(?m:$)` assertions. It specifically should _not_ be used for

    /// matching the CRLF aware assertions `(?Rm:^)` and `(?Rm:$)`.

    pub fn get_line_terminator(&self) -> u8 {

        self.lineterm.0

    /// Returns true when the position `at` in `haystack` satisfies the given

    /// look-around assertion.

///

    /// # Panics

///

    /// This panics when testing any Unicode word boundary assertion in this

    /// set and when the Unicode word data is not available. Specifically, this

    /// only occurs when the `unicode-word-boundary` feature is not enabled.

///

    /// Since it's generally expected that this routine is called inside of

    /// a matching engine, callers should check the error condition when

    /// building the matching engine. If there is a Unicode word boundary

    /// in the matcher and the data isn't available, then the matcher should

    /// fail to build.

///

    /// Callers can check the error condition with [`LookSet::available`].

///

    /// This also may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

    #[inline]

    pub fn matches(&self, look: Look, haystack: &[u8], at: usize) -> bool {

        self.matches_inline(look, haystack, at)

    /// Like `matches`, but forcefully inlined.

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(crate) fn matches_inline(

        &self,

        look: Look,

        haystack: &[u8],

        at: usize,

    ) -> bool {

        match look {

            Look::Start => self.is_start(haystack, at),

            Look::End => self.is_end(haystack, at),

            Look::StartLF => self.is_start_lf(haystack, at),

            Look::EndLF => self.is_end_lf(haystack, at),

            Look::StartCRLF => self.is_start_crlf(haystack, at),

            Look::EndCRLF => self.is_end_crlf(haystack, at),

            Look::WordAscii => self.is_word_ascii(haystack, at),

            Look::WordAsciiNegate => self.is_word_ascii_negate(haystack, at),

            Look::WordUnicode => self.is_word_unicode(haystack, at).unwrap(),

            Look::WordUnicodeNegate => {

                self.is_word_unicode_negate(haystack, at).unwrap()

    /// Returns true when _all_ of the assertions in the given set match at the

    /// given position in the haystack.

///

    /// # Panics

///

    /// This panics when testing any Unicode word boundary assertion in this

    /// set and when the Unicode word data is not available. Specifically, this

    /// only occurs when the `unicode-word-boundary` feature is not enabled.

///

    /// Since it's generally expected that this routine is called inside of

    /// a matching engine, callers should check the error condition when

    /// building the matching engine. If there is a Unicode word boundary

    /// in the matcher and the data isn't available, then the matcher should

    /// fail to build.

///

    /// Callers can check the error condition with [`LookSet::available`].

///

    /// This also may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

    #[inline]

    pub fn matches_set(

        &self,

        set: LookSet,

        haystack: &[u8],

        at: usize,

    ) -> bool {

        self.matches_set_inline(set, haystack, at)

    /// Like `LookSet::matches`, but forcefully inlined for perf.

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(crate) fn matches_set_inline(

        &self,

        set: LookSet,

        haystack: &[u8],

        at: usize,

    ) -> bool {

        // This used to luse LookSet::iter with Look::matches on each element,

        // but that proved to be quite diastrous for perf. The manual "if

        // the set has this assertion, check it" turns out to be quite a bit

        // faster.

        if set.contains(Look::Start) {

            if !self.is_start(haystack, at) {

                return false;

        if set.contains(Look::End) {

            if !self.is_end(haystack, at) {

                return false;

        if set.contains(Look::StartLF) {

            if !self.is_start_lf(haystack, at) {

                return false;

        if set.contains(Look::EndLF) {

            if !self.is_end_lf(haystack, at) {

                return false;

        if set.contains(Look::StartCRLF) {

            if !self.is_start_crlf(haystack, at) {

                return false;

        if set.contains(Look::EndCRLF) {

            if !self.is_end_crlf(haystack, at) {

                return false;

        if set.contains(Look::WordAscii) {

            if !self.is_word_ascii(haystack, at) {

                return false;

        if set.contains(Look::WordAsciiNegate) {

            if !self.is_word_ascii_negate(haystack, at) {

                return false;

        if set.contains(Look::WordUnicode) {

            if !self.is_word_unicode(haystack, at).unwrap() {

                return false;

        if set.contains(Look::WordUnicodeNegate) {

            if !self.is_word_unicode_negate(haystack, at).unwrap() {

                return false;

        true

    /// Split up the given byte classes into equivalence classes in a way that

    /// is consistent with this look-around assertion.

    #[cfg(feature = "alloc")]

    pub(crate) fn add_to_byteset(

        &self,

        look: Look,

        set: &mut crate::util::alphabet::ByteClassSet,

) {

        match look {

            Look::Start | Look::End => {}

            Look::StartLF | Look::EndLF => {

                set.set_range(self.lineterm.0, self.lineterm.0);

            Look::StartCRLF | Look::EndCRLF => {

                set.set_range(b'\r', b'\r');

                set.set_range(b'\n', b'\n');

            Look::WordAscii

            | Look::WordAsciiNegate

            | Look::WordUnicode

            | Look::WordUnicodeNegate => {

                // We need to mark all ranges of bytes whose pairs result in

                // evaluating \b differently. This isn't technically correct

                // for Unicode word boundaries, but DFAs can't handle those

                // anyway, and thus, the byte classes don't need to either

                // since they are themselves only used in DFAs.

//

                // FIXME: It seems like the calls to 'set_range' here are

                // completely invariant, which means we could just hard-code

                // them here without needing to write a loop. And we only need

                // to do this dance at most once per regex.

//

                // FIXME: Is this correct for \B?

                let iswb = utf8::is_word_byte;

                // This unwrap is OK because we guard every use of 'asu8' with

                // a check that the input is <= 255.

                let asu8 = |b: u16| u8::try_from(b).unwrap();

                let mut b1: u16 = 0;

                let mut b2: u16;

                while b1 <= 255 {

                    b2 = b1 + 1;

                    while b2 <= 255 && iswb(asu8(b1)) == iswb(asu8(b2)) {

                        b2 += 1;

                    // The guards above guarantee that b2 can never get any

                    // bigger.

                    assert!(b2 <= 256);

                    // Subtracting 1 from b2 is always OK because it is always

                    // at least 1 greater than b1, and the assert above

                    // guarantees that the asu8 conversion will succeed.

                    set.set_range(asu8(b1), asu8(b2.checked_sub(1).unwrap()));

                    b1 = b2;

    /// Returns true when [`Look::Start`] is satisfied `at` the given position

    /// in `haystack`.

///

    /// # Panics

///

    /// This may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

    #[inline]

    pub fn is_start(&self, _haystack: &[u8], at: usize) -> bool {

        at == 0

    /// Returns true when [`Look::End`] is satisfied `at` the given position in

    /// `haystack`.

///

    /// # Panics

///

    /// This may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

    #[inline]

    pub fn is_end(&self, haystack: &[u8], at: usize) -> bool {

        at == haystack.len()

    /// Returns true when [`Look::StartLF`] is satisfied `at` the given

    /// position in `haystack`.

///

    /// # Panics

///

    /// This may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

    #[inline]

    pub fn is_start_lf(&self, haystack: &[u8], at: usize) -> bool {

        self.is_start(haystack, at) || haystack[at - 1] == self.lineterm.0

    /// Returns true when [`Look::EndLF`] is satisfied `at` the given position

    /// in `haystack`.

///

    /// # Panics

///

    /// This may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

    #[inline]

    pub fn is_end_lf(&self, haystack: &[u8], at: usize) -> bool {

        self.is_end(haystack, at) || haystack[at] == self.lineterm.0

    /// Returns true when [`Look::StartCRLF`] is satisfied `at` the given

    /// position in `haystack`.

///

    /// # Panics

///

    /// This may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

    #[inline]

    pub fn is_start_crlf(&self, haystack: &[u8], at: usize) -> bool {

        self.is_start(haystack, at)

            || haystack[at - 1] == b'\n'

            || (haystack[at - 1] == b'\r'

                && (at >= haystack.len() || haystack[at] != b'\n'))

    /// Returns true when [`Look::EndCRLF`] is satisfied `at` the given

    /// position in `haystack`.

///

    /// # Panics

///

    /// This may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

    #[inline]

    pub fn is_end_crlf(&self, haystack: &[u8], at: usize) -> bool {

        self.is_end(haystack, at)

            || haystack[at] == b'\r'

            || (haystack[at] == b'\n'

                && (at == 0 || haystack[at - 1] != b'\r'))

    /// Returns true when [`Look::WordAscii`] is satisfied `at` the given

    /// position in `haystack`.

///

    /// # Panics

///

    /// This may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

    #[inline]

    pub fn is_word_ascii(&self, haystack: &[u8], at: usize) -> bool {

        let word_before = at > 0 && utf8::is_word_byte(haystack[at - 1]);

        let word_after =

            at < haystack.len() && utf8::is_word_byte(haystack[at]);

        word_before != word_after

    /// Returns true when [`Look::WordAsciiNegate`] is satisfied `at` the given

    /// position in `haystack`.

///

    /// # Panics

///

    /// This may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

    #[inline]

    pub fn is_word_ascii_negate(&self, haystack: &[u8], at: usize) -> bool {

        !self.is_word_ascii(haystack, at)

    /// Returns true when [`Look::WordUnicode`] is satisfied `at` the given

    /// position in `haystack`.

///

    /// # Panics

///

    /// This may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

///

    /// # Errors

///

    /// This returns an error when Unicode word boundary tables

    /// are not available. Specifically, this only occurs when the

    /// `unicode-word-boundary` feature is not enabled.

    #[inline]

    pub fn is_word_unicode(

        &self,

        haystack: &[u8],

        at: usize,

    ) -> Result<bool, UnicodeWordBoundaryError> {

        let word_before = is_word_char::rev(haystack, at)?;

        let word_after = is_word_char::fwd(haystack, at)?;

        Ok(word_before != word_after)

    /// Returns true when [`Look::WordUnicodeNegate`] is satisfied `at` the

    /// given position in `haystack`.

///

    /// # Panics

///

    /// This may panic when `at > haystack.len()`. Note that `at ==

    /// haystack.len()` is legal and guaranteed not to panic.

///

    /// # Errors

///

    /// This returns an error when Unicode word boundary tables

    /// are not available. Specifically, this only occurs when the

    /// `unicode-word-boundary` feature is not enabled.

    #[inline]

    pub fn is_word_unicode_negate(

        &self,

        haystack: &[u8],

        at: usize,

    ) -> Result<bool, UnicodeWordBoundaryError> {

        // This is pretty subtle. Why do we need to do UTF-8 decoding here?

        // Well... at time of writing, the is_word_char_{fwd,rev} routines will

        // only return true if there is a valid UTF-8 encoding of a "word"

        // codepoint, and false in every other case (including invalid UTF-8).

        // This means that in regions of invalid UTF-8 (which might be a

        // subset of valid UTF-8!), it would result in \B matching. While this

        // would be questionable in the context of truly invalid UTF-8, it is

        // *certainly* wrong to report match boundaries that split the encoding

        // of a codepoint. So to work around this, we ensure that we can decode

        // a codepoint on either side of `at`. If either direction fails, then

        // we don't permit \B to match at all.

//

        // Now, this isn't exactly optimal from a perf perspective. We could

        // try and detect this in is_word_char::{fwd,rev}, but it's not clear

        // if it's worth it. \B is, after all, rarely used. Even worse,

        // is_word_char::{fwd,rev} could do its own UTF-8 decoding, and so this

        // will wind up doing UTF-8 decoding twice. Owch. We could fix this

        // with more code complexity, but it just doesn't feel worth it for \B.

//

        // And in particular, we do *not* have to do this with \b, because \b

        // *requires* that at least one side of `at` be a "word" codepoint,

        // which in turn implies one side of `at` must be valid UTF-8. This in

        // turn implies that \b can never split a valid UTF-8 encoding of a

        // codepoint. In the case where one side of `at` is truly invalid UTF-8

        // and the other side IS a word codepoint, then we want \b to match

        // since it represents a valid UTF-8 boundary. It also makes sense. For

        // example, you'd want \b\w+\b to match 'abc' in '\xFFabc\xFF'.

//

        // Note also that this is not just '!is_word_unicode(..)' like it is

        // for the ASCII case. For example, neither \b nor \B is satisfied

        // within invalid UTF-8 sequences.

        let word_before = at > 0

            && match utf8::decode_last(&haystack[..at]) {

                None | Some(Err(_)) => return Ok(false),

                Some(Ok(_)) => is_word_char::rev(haystack, at)?,

};

        let word_after = at < haystack.len()

            && match utf8::decode(&haystack[at..]) {

                None | Some(Err(_)) => return Ok(false),

                Some(Ok(_)) => is_word_char::fwd(haystack, at)?,

};

        Ok(word_before == word_after)

impl Default for LookMatcher {

    fn default() -> LookMatcher {

        LookMatcher::new()

/// An error that occurs when the Unicode-aware `\w` class is unavailable.

///

/// This error can occur when the data tables necessary for the Unicode aware

/// Perl character class `\w` are unavailable. The `\w` class is used to

/// determine whether a codepoint is considered a word character or not when

/// determining whether a Unicode aware `\b` (or `\B`) matches at a particular

/// position.

///

/// This error can only occur when the `unicode-word-boundary` feature is

/// disabled.

#[derive(Clone, Debug)]

pub struct UnicodeWordBoundaryError(());

impl UnicodeWordBoundaryError {

    #[cfg(not(feature = "unicode-word-boundary"))]

    pub(crate) fn new() -> UnicodeWordBoundaryError {

        UnicodeWordBoundaryError(())

    /// Returns an error if and only if Unicode word boundary data is

    /// unavailable.

    pub fn check() -> Result<(), UnicodeWordBoundaryError> {

        is_word_char::check()

#[cfg(feature = "std")]

impl std::error::Error for UnicodeWordBoundaryError {}

impl core::fmt::Display for UnicodeWordBoundaryError {

    fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {

        write!(

f,

            "Unicode-aware \\b and \\B are unavailable because the \

             requisite data tables are missing, please enable the \

             unicode-word-boundary feature"

// Below are FOUR different ways for checking whether whether a "word"

// codepoint exists at a particular position in the haystack. The four

// different approaches are, in order of preference:

//

// 1. Parse '\w', convert to an NFA, convert to a fully compiled DFA on the

// first call, and then use that DFA for all subsequent calls.

// 2. Do UTF-8 decoding and use regex_syntax::is_word_character if available.

// 3. Do UTF-8 decoding and use our own 'perl_word' table.

// 4. Return an error.

//

// The reason for all of these approaches is a combination of perf and

// permitting one to build regex-automata without the Unicode data necessary

// for handling Unicode-aware word boundaries. (In which case, '(?-u:\b)' would

// still work.)

//

// The DFA approach is the fastest, but it requires the regex parser, the

// NFA compiler, the DFA builder and the DFA search runtime. That's a lot to

// bring in, but if it's available, it's (probably) the best we can do.

//

// Approaches (2) and (3) are effectively equivalent, but (2) reuses the

// data in regex-syntax and avoids duplicating it in regex-automata.

//

// Finally, (4) unconditionally returns an error since the requisite data isn't

// available anywhere.

//

// There are actually more approaches possible that we didn't implement. For

// example, if the DFA builder is available but the syntax parser is not, we

// could technically hand construct our own NFA from the 'perl_word' data

// table. But to avoid some pretty hairy code duplication, we would in turn

// need to pull the UTF-8 compiler out of the NFA compiler. Yikes.

//

// A possibly more sensible alternative is to use a lazy DFA when the full

// DFA builder isn't available...

//

// Yet another choice would be to build the full DFA and then embed it into the

// source. Then we'd only need to bring in the DFA search runtime, which is

// considerably smaller than the DFA builder code. The problem here is that the

// Debian people have spooked me[1] into avoiding cyclic dependencies. Namely,

// we'd need to build regex-cli, which depends on regex-automata in order to

// build some part of regex-automata. But to be honest, something like this has

// to be allowed somehow? I just don't know what the right process is.

//

// There are perhaps other choices as well. Why did I stop at these 4? Because

// I wanted to preserve my sanity. I suspect I'll wind up adding the lazy DFA

// approach eventually, as the benefits of the DFA approach are somewhat

// compelling. The 'boundary-words-holmes' benchmark tests this:

//

//   $ regex-cli bench measure -f boundary-words-holmes -e pikevm > dfa.csv

//

// Then I changed the code below so that the util/unicode_data/perl_word table

// was used and re-ran the benchmark:

//

//   $ regex-cli bench measure -f boundary-words-holmes -e pikevm > table.csv

//

// And compared them:

//

//   $ regex-cli bench diff dfa.csv table.csv

//   benchmark                             engine                 dfa        table

//   ---------                             ------                 ---        -----

//   internal/count/boundary-words-holmes  regex/automata/pikevm  18.6 MB/s  12.9 MB/s

//

// Which is a nice improvement.

//

// UPDATE: It turns out that it takes approximately 22ms to build the reverse

// DFA for \w. (And about 3ms for the forward DFA.) It's probably not much in

// the grand scheme things, but that is a significant latency cost. So I'm not

// sure that's a good idea. I then tried using a lazy DFA instead, and that

// eliminated the overhead, but since the lazy DFA requires mutable working

// memory, that requires introducing a 'Cache' for every simultaneous call.

//

// I ended up deciding for now to just keep the "UTF-8 decode and check the

// table." The DFA and lazy DFA approaches are still below, but commented out.

//

// [1]: https://github.com/BurntSushi/ucd-generate/issues/11

/*

/// A module that looks for word codepoints using lazy DFAs.

#[cfg(all(

    feature = "unicode-word-boundary",

    feature = "syntax",

    feature = "unicode-perl",

    feature = "hybrid"

))]

mod is_word_char {

    use alloc::vec::Vec;

    use crate::{

        hybrid::dfa::{Cache, DFA},

        nfa::thompson::NFA,

        util::{lazy::Lazy, pool::Pool, primitives::StateID},

        Anchored, Input,

};

    pub(super) fn check() -> Result<(), super::UnicodeWordBoundaryError> {

        Ok(())

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(super) fn fwd(

        haystack: &[u8],

        mut at: usize,

    ) -> Result<bool, super::UnicodeWordBoundaryError> {

        static WORD: Lazy<DFA> = Lazy::new(|| DFA::new(r"\w").unwrap());

        static CACHE: Lazy<Pool<Cache>> =

            Lazy::new(|| Pool::new(|| WORD.create_cache()));

        let dfa = Lazy::get(&WORD);

        let mut cache = Lazy::get(&CACHE).get();

        let mut sid = dfa

            .start_state_forward(

                &mut cache,

                &Input::new("").anchored(Anchored::Yes),

            .unwrap();

        while at < haystack.len() {

            let byte = haystack[at];

            sid = dfa.next_state(&mut cache, sid, byte).unwrap();

            at += 1;

            if sid.is_tagged() {

                if sid.is_match() {

                    return Ok(true);

                } else if sid.is_dead() {

                    return Ok(false);

        Ok(dfa.next_eoi_state(&mut cache, sid).unwrap().is_match())

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(super) fn rev(

        haystack: &[u8],

        mut at: usize,

    ) -> Result<bool, super::UnicodeWordBoundaryError> {

        static WORD: Lazy<DFA> = Lazy::new(|| {

            DFA::builder()

                .thompson(NFA::config().reverse(true))

                .build(r"\w")

                .unwrap()

});

        static CACHE: Lazy<Pool<Cache>> =

            Lazy::new(|| Pool::new(|| WORD.create_cache()));

        let dfa = Lazy::get(&WORD);

        let mut cache = Lazy::get(&CACHE).get();

        let mut sid = dfa

            .start_state_reverse(

                &mut cache,

                &Input::new("").anchored(Anchored::Yes),

            .unwrap();

        while at > 0 {

            at -= 1;

            let byte = haystack[at];

            sid = dfa.next_state(&mut cache, sid, byte).unwrap();

            if sid.is_tagged() {

                if sid.is_match() {

                    return Ok(true);

                } else if sid.is_dead() {

                    return Ok(false);

        Ok(dfa.next_eoi_state(&mut cache, sid).unwrap().is_match())

*/

/*

/// A module that looks for word codepoints using fully compiled DFAs.

#[cfg(all(

    feature = "unicode-word-boundary",

    feature = "syntax",

    feature = "unicode-perl",

    feature = "dfa-build"

))]

mod is_word_char {

    use alloc::vec::Vec;

    use crate::{

        dfa::{dense::DFA, Automaton, StartKind},

        nfa::thompson::NFA,

        util::{lazy::Lazy, primitives::StateID},

        Anchored, Input,

};

    pub(super) fn check() -> Result<(), super::UnicodeWordBoundaryError> {

        Ok(())

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(super) fn fwd(

        haystack: &[u8],

        mut at: usize,

    ) -> Result<bool, super::UnicodeWordBoundaryError> {

        static WORD: Lazy<(DFA<Vec<u32>>, StateID)> = Lazy::new(|| {

            let dfa = DFA::builder()

                .configure(DFA::config().start_kind(StartKind::Anchored))

                .build(r"\w")

                .unwrap();

            // OK because our regex has no look-around.

            let start_id = dfa.universal_start_state(Anchored::Yes).unwrap();

            (dfa, start_id)

});

        let &(ref dfa, mut sid) = Lazy::get(&WORD);

        while at < haystack.len() {

            let byte = haystack[at];

            sid = dfa.next_state(sid, byte);

            at += 1;

            if dfa.is_special_state(sid) {

                if dfa.is_match_state(sid) {

                    return Ok(true);

                } else if dfa.is_dead_state(sid) {

                    return Ok(false);

        Ok(dfa.is_match_state(dfa.next_eoi_state(sid)))

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(super) fn rev(

        haystack: &[u8],

        mut at: usize,

    ) -> Result<bool, super::UnicodeWordBoundaryError> {

        static WORD: Lazy<(DFA<Vec<u32>>, StateID)> = Lazy::new(|| {

            let dfa = DFA::builder()

                .configure(DFA::config().start_kind(StartKind::Anchored))

                // From ad hoc measurements, it looks like setting

                // shrink==false is slightly faster than shrink==true. I kind

                // of feel like this indicates that shrinking is probably a

                // failure, although it can help in some cases. Sigh.

                .thompson(NFA::config().reverse(true).shrink(false))

                .build(r"\w")

                .unwrap();

            // OK because our regex has no look-around.

            let start_id = dfa.universal_start_state(Anchored::Yes).unwrap();

            (dfa, start_id)

});

        let &(ref dfa, mut sid) = Lazy::get(&WORD);

        while at > 0 {

            at -= 1;

            let byte = haystack[at];

            sid = dfa.next_state(sid, byte);

            if dfa.is_special_state(sid) {

                if dfa.is_match_state(sid) {

                    return Ok(true);

                } else if dfa.is_dead_state(sid) {

                    return Ok(false);

        Ok(dfa.is_match_state(dfa.next_eoi_state(sid)))

*/

/// A module that looks for word codepoints using regex-syntax's data tables.

#[cfg(all(

    feature = "unicode-word-boundary",

    feature = "syntax",

    feature = "unicode-perl",

))]

mod is_word_char {

    use regex_syntax::try_is_word_character;

    use crate::util::utf8;

    pub(super) fn check() -> Result<(), super::UnicodeWordBoundaryError> {

        Ok(())

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(super) fn fwd(

        haystack: &[u8],

        at: usize,

    ) -> Result<bool, super::UnicodeWordBoundaryError> {

        Ok(match utf8::decode(&haystack[at..]) {

            None | Some(Err(_)) => false,

            Some(Ok(ch)) => try_is_word_character(ch).expect(

                "since unicode-word-boundary, syntax and unicode-perl \

                 are all enabled, it is expected that \

                 try_is_word_character succeeds",

),

})

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(super) fn rev(

        haystack: &[u8],

        at: usize,

    ) -> Result<bool, super::UnicodeWordBoundaryError> {

        Ok(match utf8::decode_last(&haystack[..at]) {

            None | Some(Err(_)) => false,

            Some(Ok(ch)) => try_is_word_character(ch).expect(

                "since unicode-word-boundary, syntax and unicode-perl \

                 are all enabled, it is expected that \

                 try_is_word_character succeeds",

),

})

/// A module that looks for word codepoints using regex-automata's data tables

/// (which are only compiled when regex-syntax's tables aren't available).

///

/// Note that the cfg should match the one in src/util/unicode_data/mod.rs for

/// perl_word.

#[cfg(all(

    feature = "unicode-word-boundary",

    not(all(feature = "syntax", feature = "unicode-perl")),

))]

mod is_word_char {

    use crate::util::utf8;

    pub(super) fn check() -> Result<(), super::UnicodeWordBoundaryError> {

        Ok(())

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(super) fn fwd(

        haystack: &[u8],

        at: usize,

    ) -> Result<bool, super::UnicodeWordBoundaryError> {

        Ok(match utf8::decode(&haystack[at..]) {

            None | Some(Err(_)) => false,

            Some(Ok(ch)) => is_word_character(ch),

})

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(super) fn rev(

        haystack: &[u8],

        at: usize,

    ) -> Result<bool, super::UnicodeWordBoundaryError> {

        Ok(match utf8::decode_last(&haystack[..at]) {

            None | Some(Err(_)) => false,

            Some(Ok(ch)) => is_word_character(ch),

})

    #[cfg_attr(feature = "perf-inline", inline(always))]

    fn is_word_character(c: char) -> bool {

        use crate::util::{unicode_data::perl_word::PERL_WORD, utf8};

        // MSRV(1.59): Use 'u8::try_from(c)' instead.

        if u8::try_from(u32::from(c)).map_or(false, utf8::is_word_byte) {

            return true;

        PERL_WORD

            .binary_search_by(|&(start, end)| {

                use core::cmp::Ordering;

                if start <= c && c <= end {

                    Ordering::Equal

                } else if start > c {

                    Ordering::Greater

                } else {

                    Ordering::Less

})

            .is_ok()

/// A module that always returns an error if Unicode word boundaries are

/// disabled. When this feature is disabled, then regex-automata will not

/// include its own data tables even if regex-syntax is disabled.

#[cfg(not(feature = "unicode-word-boundary"))]

mod is_word_char {

    pub(super) fn check() -> Result<(), super::UnicodeWordBoundaryError> {

        Err(super::UnicodeWordBoundaryError::new())

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(super) fn fwd(

        _bytes: &[u8],

        _at: usize,

    ) -> Result<bool, super::UnicodeWordBoundaryError> {

        Err(super::UnicodeWordBoundaryError::new())

    #[cfg_attr(feature = "perf-inline", inline(always))]

    pub(super) fn rev(

        _bytes: &[u8],

        _at: usize,

    ) -> Result<bool, super::UnicodeWordBoundaryError> {

        Err(super::UnicodeWordBoundaryError::new())

#[cfg(test)]

mod tests {

    use super::*;

    macro_rules! testlook {

        ($look:expr, $haystack:expr, $at:expr) => {

            LookMatcher::default().matches($look, $haystack.as_bytes(), $at)

};

    #[test]

    fn look_matches_start_line() {

        let look = Look::StartLF;

        assert!(testlook!(look, "", 0));

        assert!(testlook!(look, "\n", 0));

        assert!(testlook!(look, "\n", 1));

        assert!(testlook!(look, "a", 0));

        assert!(testlook!(look, "\na", 1));

        assert!(!testlook!(look, "a", 1));

        assert!(!testlook!(look, "a\na", 1));

    #[test]

    fn look_matches_end_line() {

        let look = Look::EndLF;

        assert!(testlook!(look, "", 0));

        assert!(testlook!(look, "\n", 1));

        assert!(testlook!(look, "\na", 0));

        assert!(testlook!(look, "\na", 2));

        assert!(testlook!(look, "a\na", 1));

        assert!(!testlook!(look, "a", 0));

        assert!(!testlook!(look, "\na", 1));

        assert!(!testlook!(look, "a\na", 0));

        assert!(!testlook!(look, "a\na", 2));

    #[test]

    fn look_matches_start_text() {

        let look = Look::Start;

        assert!(testlook!(look, "", 0));

        assert!(testlook!(look, "\n", 0));

        assert!(testlook!(look, "a", 0));

        assert!(!testlook!(look, "\n", 1));

        assert!(!testlook!(look, "\na", 1));

        assert!(!testlook!(look, "a", 1));

        assert!(!testlook!(look, "a\na", 1));

    #[test]

    fn look_matches_end_text() {

        let look = Look::End;

        assert!(testlook!(look, "", 0));

        assert!(testlook!(look, "\n", 1));

        assert!(testlook!(look, "\na", 2));

        assert!(!testlook!(look, "\na", 0));

        assert!(!testlook!(look, "a\na", 1));

        assert!(!testlook!(look, "a", 0));

        assert!(!testlook!(look, "\na", 1));

        assert!(!testlook!(look, "a\na", 0));

        assert!(!testlook!(look, "a\na", 2));

    #[test]

    #[cfg(all(not(miri), feature = "unicode-word-boundary"))]

    fn look_matches_word_unicode() {

        let look = Look::WordUnicode;

        // \xF0\x9D\x9B\x83 = 𝛃 (in \w)

        // \xF0\x90\x86\x80 = 𐆀 (not in \w)

        // Simple ASCII word boundaries.

        assert!(testlook!(look, "a", 0));

        assert!(testlook!(look, "a", 1));

        assert!(testlook!(look, "a ", 1));

        assert!(testlook!(look, " a ", 1));

        assert!(testlook!(look, " a ", 2));

        // Unicode word boundaries with a non-ASCII codepoint.

        assert!(testlook!(look, "𝛃", 0));

        assert!(testlook!(look, "𝛃", 4));

        assert!(testlook!(look, "𝛃 ", 4));

        assert!(testlook!(look, " 𝛃 ", 1));

        assert!(testlook!(look, " 𝛃 ", 5));

        // Unicode word boundaries between non-ASCII codepoints.

        assert!(testlook!(look, "𝛃𐆀", 0));

        assert!(testlook!(look, "𝛃𐆀", 4));

        // Non word boundaries for ASCII.

        assert!(!testlook!(look, "", 0));

        assert!(!testlook!(look, "ab", 1));

        assert!(!testlook!(look, "a ", 2));

        assert!(!testlook!(look, " a ", 0));

        assert!(!testlook!(look, " a ", 3));

        // Non word boundaries with a non-ASCII codepoint.

        assert!(!testlook!(look, "𝛃b", 4));

        assert!(!testlook!(look, "𝛃 ", 5));

        assert!(!testlook!(look, " 𝛃 ", 0));

        assert!(!testlook!(look, " 𝛃 ", 6));

        assert!(!testlook!(look, "𝛃", 1));

        assert!(!testlook!(look, "𝛃", 2));

        assert!(!testlook!(look, "𝛃", 3));

        // Non word boundaries with non-ASCII codepoints.

        assert!(!testlook!(look, "𝛃𐆀", 1));

        assert!(!testlook!(look, "𝛃𐆀", 2));

        assert!(!testlook!(look, "𝛃𐆀", 3));

        assert!(!testlook!(look, "𝛃𐆀", 5));

        assert!(!testlook!(look, "𝛃𐆀", 6));

        assert!(!testlook!(look, "𝛃𐆀", 7));

        assert!(!testlook!(look, "𝛃𐆀", 8));

    #[test]

    fn look_matches_word_ascii() {

        let look = Look::WordAscii;

        // \xF0\x9D\x9B\x83 = 𝛃 (in \w)

        // \xF0\x90\x86\x80 = 𐆀 (not in \w)

        // Simple ASCII word boundaries.

        assert!(testlook!(look, "a", 0));

        assert!(testlook!(look, "a", 1));

        assert!(testlook!(look, "a ", 1));

        assert!(testlook!(look, " a ", 1));

        assert!(testlook!(look, " a ", 2));

        // Unicode word boundaries with a non-ASCII codepoint. Since this is

        // an ASCII word boundary, none of these match.

        assert!(!testlook!(look, "𝛃", 0));

        assert!(!testlook!(look, "𝛃", 4));

        assert!(!testlook!(look, "𝛃 ", 4));

        assert!(!testlook!(look, " 𝛃 ", 1));

        assert!(!testlook!(look, " 𝛃 ", 5));

        // Unicode word boundaries between non-ASCII codepoints. Again, since

        // this is an ASCII word boundary, none of these match.

        assert!(!testlook!(look, "𝛃𐆀", 0));

        assert!(!testlook!(look, "𝛃𐆀", 4));

        // Non word boundaries for ASCII.

        assert!(!testlook!(look, "", 0));

        assert!(!testlook!(look, "ab", 1));

        assert!(!testlook!(look, "a ", 2));

        assert!(!testlook!(look, " a ", 0));

        assert!(!testlook!(look, " a ", 3));

        // Non word boundaries with a non-ASCII codepoint.

        assert!(testlook!(look, "𝛃b", 4));

        assert!(!testlook!(look, "𝛃 ", 5));

        assert!(!testlook!(look, " 𝛃 ", 0));

        assert!(!testlook!(look, " 𝛃 ", 6));

        assert!(!testlook!(look, "𝛃", 1));

        assert!(!testlook!(look, "𝛃", 2));

        assert!(!testlook!(look, "𝛃", 3));

        // Non word boundaries with non-ASCII codepoints.

        assert!(!testlook!(look, "𝛃𐆀", 1));

        assert!(!testlook!(look, "𝛃𐆀", 2));

        assert!(!testlook!(look, "𝛃𐆀", 3));

        assert!(!testlook!(look, "𝛃𐆀", 5));

        assert!(!testlook!(look, "𝛃𐆀", 6));

        assert!(!testlook!(look, "𝛃𐆀", 7));

        assert!(!testlook!(look, "𝛃𐆀", 8));

    #[test]

    #[cfg(all(not(miri), feature = "unicode-word-boundary"))]

    fn look_matches_word_unicode_negate() {

        let look = Look::WordUnicodeNegate;

        // \xF0\x9D\x9B\x83 = 𝛃 (in \w)

        // \xF0\x90\x86\x80 = 𐆀 (not in \w)

        // Simple ASCII word boundaries.

        assert!(!testlook!(look, "a", 0));

        assert!(!testlook!(look, "a", 1));

        assert!(!testlook!(look, "a ", 1));

        assert!(!testlook!(look, " a ", 1));

        assert!(!testlook!(look, " a ", 2));

        // Unicode word boundaries with a non-ASCII codepoint.

        assert!(!testlook!(look, "𝛃", 0));

        assert!(!testlook!(look, "𝛃", 4));

        assert!(!testlook!(look, "𝛃 ", 4));

        assert!(!testlook!(look, " 𝛃 ", 1));

        assert!(!testlook!(look, " 𝛃 ", 5));

        // Unicode word boundaries between non-ASCII codepoints.

        assert!(!testlook!(look, "𝛃𐆀", 0));

        assert!(!testlook!(look, "𝛃𐆀", 4));

        // Non word boundaries for ASCII.

        assert!(testlook!(look, "", 0));

        assert!(testlook!(look, "ab", 1));

        assert!(testlook!(look, "a ", 2));

        assert!(testlook!(look, " a ", 0));

        assert!(testlook!(look, " a ", 3));

        // Non word boundaries with a non-ASCII codepoint.

        assert!(testlook!(look, "𝛃b", 4));

        assert!(testlook!(look, "𝛃 ", 5));

        assert!(testlook!(look, " 𝛃 ", 0));

        assert!(testlook!(look, " 𝛃 ", 6));

        // These don't match because they could otherwise return an offset that

        // splits the UTF-8 encoding of a codepoint.

        assert!(!testlook!(look, "𝛃", 1));

        assert!(!testlook!(look, "𝛃", 2));

        assert!(!testlook!(look, "𝛃", 3));

        // Non word boundaries with non-ASCII codepoints. These also don't

        // match because they could otherwise return an offset that splits the

        // UTF-8 encoding of a codepoint.

        assert!(!testlook!(look, "𝛃𐆀", 1));

        assert!(!testlook!(look, "𝛃𐆀", 2));

        assert!(!testlook!(look, "𝛃𐆀", 3));

        assert!(!testlook!(look, "𝛃𐆀", 5));

        assert!(!testlook!(look, "𝛃𐆀", 6));

        assert!(!testlook!(look, "𝛃𐆀", 7));

        // But this one does, since 𐆀 isn't a word codepoint, and 8 is the end

        // of the haystack. So the "end" of the haystack isn't a word and 𐆀

        // isn't a word, thus, \B matches.

        assert!(testlook!(look, "𝛃𐆀", 8));

    #[test]

    fn look_matches_word_ascii_negate() {

        let look = Look::WordAsciiNegate;

        // \xF0\x9D\x9B\x83 = 𝛃 (in \w)

        // \xF0\x90\x86\x80 = 𐆀 (not in \w)

        // Simple ASCII word boundaries.

        assert!(!testlook!(look, "a", 0));

        assert!(!testlook!(look, "a", 1));

        assert!(!testlook!(look, "a ", 1));

        assert!(!testlook!(look, " a ", 1));

        assert!(!testlook!(look, " a ", 2));

        // Unicode word boundaries with a non-ASCII codepoint. Since this is

        // an ASCII word boundary, none of these match.

        assert!(testlook!(look, "𝛃", 0));

        assert!(testlook!(look, "𝛃", 4));

        assert!(testlook!(look, "𝛃 ", 4));

        assert!(testlook!(look, " 𝛃 ", 1));

        assert!(testlook!(look, " 𝛃 ", 5));

        // Unicode word boundaries between non-ASCII codepoints. Again, since

        // this is an ASCII word boundary, none of these match.

        assert!(testlook!(look, "𝛃𐆀", 0));

        assert!(testlook!(look, "𝛃𐆀", 4));

        // Non word boundaries for ASCII.

        assert!(testlook!(look, "", 0));

        assert!(testlook!(look, "ab", 1));

        assert!(testlook!(look, "a ", 2));

        assert!(testlook!(look, " a ", 0));

        assert!(testlook!(look, " a ", 3));

        // Non word boundaries with a non-ASCII codepoint.

        assert!(!testlook!(look, "𝛃b", 4));

        assert!(testlook!(look, "𝛃 ", 5));

        assert!(testlook!(look, " 𝛃 ", 0));

        assert!(testlook!(look, " 𝛃 ", 6));

        assert!(testlook!(look, "𝛃", 1));

        assert!(testlook!(look, "𝛃", 2));

        assert!(testlook!(look, "𝛃", 3));

        // Non word boundaries with non-ASCII codepoints.

        assert!(testlook!(look, "𝛃𐆀", 1));

        assert!(testlook!(look, "𝛃𐆀", 2));

        assert!(testlook!(look, "𝛃𐆀", 3));

        assert!(testlook!(look, "𝛃𐆀", 5));

        assert!(testlook!(look, "𝛃𐆀", 6));

        assert!(testlook!(look, "𝛃𐆀", 7));

        assert!(testlook!(look, "𝛃𐆀", 8));

    #[test]

    fn look_set() {

        let mut f = LookSet::default();

        assert!(!f.contains(Look::Start));

        assert!(!f.contains(Look::End));

        assert!(!f.contains(Look::StartLF));

        assert!(!f.contains(Look::EndLF));

        assert!(!f.contains(Look::WordUnicode));

        assert!(!f.contains(Look::WordUnicodeNegate));

        assert!(!f.contains(Look::WordAscii));

        assert!(!f.contains(Look::WordAsciiNegate));

        f = f.insert(Look::Start);

        assert!(f.contains(Look::Start));

        f = f.remove(Look::Start);

        assert!(!f.contains(Look::Start));

        f = f.insert(Look::End);

        assert!(f.contains(Look::End));

        f = f.remove(Look::End);

        assert!(!f.contains(Look::End));

        f = f.insert(Look::StartLF);

        assert!(f.contains(Look::StartLF));

        f = f.remove(Look::StartLF);

        assert!(!f.contains(Look::StartLF));

        f = f.insert(Look::EndLF);

        assert!(f.contains(Look::EndLF));

        f = f.remove(Look::EndLF);

        assert!(!f.contains(Look::EndLF));

        f = f.insert(Look::StartCRLF);

        assert!(f.contains(Look::StartCRLF));

        f = f.remove(Look::StartCRLF);

        assert!(!f.contains(Look::StartCRLF));

        f = f.insert(Look::EndCRLF);

        assert!(f.contains(Look::EndCRLF));

        f = f.remove(Look::EndCRLF);

        assert!(!f.contains(Look::EndCRLF));

        f = f.insert(Look::WordUnicode);

        assert!(f.contains(Look::WordUnicode));

        f = f.remove(Look::WordUnicode);

        assert!(!f.contains(Look::WordUnicode));

        f = f.insert(Look::WordUnicodeNegate);

        assert!(f.contains(Look::WordUnicodeNegate));

        f = f.remove(Look::WordUnicodeNegate);

        assert!(!f.contains(Look::WordUnicodeNegate));

        f = f.insert(Look::WordAscii);

        assert!(f.contains(Look::WordAscii));

        f = f.remove(Look::WordAscii);

        assert!(!f.contains(Look::WordAscii));

        f = f.insert(Look::WordAsciiNegate);

        assert!(f.contains(Look::WordAsciiNegate));

        f = f.remove(Look::WordAsciiNegate);

        assert!(!f.contains(Look::WordAsciiNegate));

    #[test]

    fn look_set_iter() {

        let set = LookSet::empty();

        assert_eq!(0, set.iter().count());

        let set = LookSet::full();

        assert_eq!(10, set.iter().count());

        let set =

            LookSet::empty().insert(Look::StartLF).insert(Look::WordUnicode);

        assert_eq!(2, set.iter().count());

        let set = LookSet::empty().insert(Look::StartLF);

        assert_eq!(1, set.iter().count());

        let set = LookSet::empty().insert(Look::WordAsciiNegate);

        assert_eq!(1, set.iter().count());

    #[test]

    #[cfg(feature = "alloc")]

    fn look_set_debug() {

        let res = alloc::format!("{:?}", LookSet::empty());

        assert_eq!("∅", res);

        let res = alloc::format!("{:?}", LookSet::full());

        assert_eq!("Az^$rRbB𝛃𝚩", res);