third_party/rust/regex-syntax/src

mozilla-central/third_party/rust/regex-syntax/src

Name	Description	Size
ast
debug.rs		3617
either.rs		194
error.rs		10520
hir
lib.rs		16874
parser.rs		10864
rank.rs		4438
unicode.rs		36744
unicode_tables
utf8.rs	! Converts ranges of Unicode scalar values to equivalent ranges of UTF-8 bytes. This is sub-module is useful for constructing byte based automatons that need to embed UTF-8 decoding. The most common use of this module is in conjunction with the [`hir::ClassUnicodeRange`](crate::hir::ClassUnicodeRange) type. See the documentation on the `Utf8Sequences` iterator for more details and an example. # Wait, what is this? This is simplest to explain with an example. Let's say you wanted to test whether a particular byte sequence was a Cyrillic character. One possible scalar value range is `[0400-04FF]`. The set of allowed bytes for this range can be expressed as a sequence of byte ranges: ```text [D0-D3][80-BF] ``` This is simple enough: simply encode the boundaries, `0400` encodes to `D0 80` and `04FF` encodes to `D3 BF`, and create ranges from each corresponding pair of bytes: `D0` to `D3` and `80` to `BF`. However, what if you wanted to add the Cyrillic Supplementary characters to your range? Your range might then become `[0400-052F]`. The same procedure as above doesn't quite work because `052F` encodes to `D4 AF`. The byte ranges you'd get from the previous transformation would be `[D0-D4][80-AF]`. However, this isn't quite correct because this range doesn't capture many characters, for example, `04FF` (because its last byte, `BF` isn't in the range `80-AF`). Instead, you need multiple sequences of byte ranges: ```text [D0-D3][80-BF] # matches codepoints 0400-04FF [D4][80-AF] # matches codepoints 0500-052F ``` This gets even more complicated if you want bigger ranges, particularly if they naively contain surrogate codepoints. For example, the sequence of byte ranges for the basic multilingual plane (`[0000-FFFF]`) look like this: ```text [0-7F] [C2-DF][80-BF] [E0][A0-BF][80-BF] [E1-EC][80-BF][80-BF] [ED][80-9F][80-BF] [EE-EF][80-BF][80-BF] ``` Note that the byte ranges above will not match any erroneous encoding of UTF-8, including encodings of surrogate codepoints. And, of course, for all of Unicode (`[000000-10FFFF]`): ```text [0-7F] [C2-DF][80-BF] [E0][A0-BF][80-BF] [E1-EC][80-BF][80-BF] [ED][80-9F][80-BF] [EE-EF][80-BF][80-BF] [F0][90-BF][80-BF][80-BF] [F1-F3][80-BF][80-BF][80-BF] [F4][80-8F][80-BF][80-BF] ``` This module automates the process of creating these byte ranges from ranges of Unicode scalar values. # Lineage I got the idea and general implementation strategy from Russ Cox in his [article on regexps](https://web.archive.org/web/20160404141123/https://swtch.com/~rsc/regexp/regexp3.html) and RE2. Russ Cox got it from Ken Thompson's `grep` (no source, folk lore?). I also got the idea from [Lucene](https://github.com/apache/lucene-solr/blob/ae93f4e7ac6a3908046391de35d4f50a0d3c59ca/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java), which uses it for executing automata on their term index.	18988