Syntax Overview
This chapter gives you a focused overview of the regular expression syntax supported by this library. Whenever you want to dive deeper, each item links to the relevant detail sections so you can explore the topics at your own pace.
You will notice that this syntax is intentionally strict: only explicitly supported escape sequences are accepted. Unknown sequences produce parser errors, which helps you spot mistakes early and keeps your patterns predictable and safe.
Quoting
The following escape sequences let you insert literal characters that would otherwise have a special meaning in pattern syntax. Use them whenever you want the regex engine to treat these characters exactly as written.
\\ \. \" \' \# \ \+ \* \? \{ \} \$ \^ \( \| \) \[ \] |
These escape sequences produce the corresponding literal character [1]. |
Legacy and Compatibility Expressions
The following constructs are accepted for compatibility with older syntaxes. For new code, prefer the recommended approach to keep your patterns clean and explicit.
Expression |
Details |
Please Use |
|---|---|---|
\Q…\E |
Treats all characters inside the |
Escaping via API |
Special Characters
The escape sequences below let you insert common control characters and Unicode code points directly into your patterns. Using them keeps your regular expressions both readable and intentional, especially when working with non-printable or invisible characters.
\n |
Newline character U+000A. |
\r |
Carriage return character U+000D. |
\t |
Horizontal tab character U+0009. |
\uhhhh |
Unicode code point given as a four-digit hexadecimal value |
\u{hh…} |
Unicode code point given as a hexadecimal value of one to eight digits |
Legacy and Compatibility Expressions
These expressions are supported for compatibility with older regex syntaxes. For clarity and consistency, you should prefer the recommended forms whenever possible.
Expression |
Details |
Please Use |
|---|---|---|
\a |
Alarm (BEL) character U+0007. |
\u{7} |
\cX |
ASCII control character from U+0001 to U+001F.
|
\u{hh} |
\e |
Escape character U+001B. |
\u{1b} |
\f |
Form feed character U+000C. |
\u{c} |
\N |
Matches any character except the newline. Useful in legacy patterns but directly equivalent to a simple negated class. |
[^\n] |
\o{ddd…} |
Character with an octal code |
\u{hh…} |
\xhh |
Character with a two-digit hexadecimal code |
\u{hh…} |
\x{hh…} |
Character with a hexadecimal code |
\u{hh…} |
\Uhhhhhhhh |
Unicode code point given as an eight-digit hexadecimal number |
\u{hh…} |
Character Types
Character type escapes let you quickly match common categories of characters without writing long character classes. They are especially helpful when working with Unicode text, where “letters” and “digits” extend far beyond the ASCII ranges.
. |
Matches any character except the newline (U+000A) [5]. This is your general-purpose wildcard. |
\d |
Matches any Unicode digit character [6]. Equivalent to \p{Nd}. |
\D |
Matches anything except Unicode digit characters [6]. Equivalent to \P{Nd}. |
\s |
Matches any Unicode space separator character [7]. Equivalent to \p{Zs}. |
\S |
Matches anything except Unicode space characters [7]. Equivalent to \P{Zs}. |
\w |
Matches any “word” character: all Unicode letters, all Unicode digits, and the underscore [8]. Equivalent to [\p{Nd}\p{L}_]. |
\W |
Matches anything except “word” characters [8]. Equivalent to [^\p{Nd}\p{L}_]. |
\p{category} |
Matches all characters belonging to a specific Unicode category. This is ideal when you want precision and full Unicode awareness. |
\P{category} |
Matches all characters not belonging to the given Unicode category. |
Legacy and Compatibility Expressions
These escape sequences stem from older regex engines such as PCRE. They are still recognized, but they represent fixed, narrow character sets and should not be used in modern Unicode-aware patterns.
Expression |
Details |
Please Use |
|---|---|---|
\h \H |
Matches a small, predefined set of “horizontal whitespace” characters from PCRE. These sets are not Unicode-complete. |
[…] |
\v \V |
Matches a small, predefined set of “vertical whitespace” characters from PCRE. Also not Unicode-complete and best avoided in new patterns. |
[…] |
Character Classes
Character classes give you fine-grained control over which characters your pattern should match. They are one of the most powerful building blocks of regular expressions, especially when you want to define your own sets rather than rely on predefined categories.
[…] |
A positive character class. It matches any character listed inside the brackets. |
[^…] |
A negative character class. It matches any character not listed inside the brackets. |
Legacy and Compatibility Expressions
These constructs are recognized for compatibility with older regex syntaxes. While they may look familiar if you have used POSIX or PCRE in the past, the modern Unicode-aware forms are more expressive and easier to read.
Expression |
Details |
Please Use |
|---|---|---|
[\Q…\E] |
A quoted literal inside a character class. This often obscures intent and is unnecessary in Unicode-aware patterns. |
Avoid this. |
[[:name:]] |
POSITIVE POSIX named set (e.g. |
\p{name} |
[[:^name:]] |
NEGATIVE POSIX named set. Matches everything not in the POSIX set. |
\P{name} |
[[:…:][:…:]] |
Union of two or more POSIX named sets. While functional, it is harder to maintain and less explicit than Unicode category classes. |
[\p{…}\p{…}] |
Quantifiers
Quantifiers let you control how many times a preceding element may occur. Each quantifier comes in three variants:
Greedy – tries to match as much as possible.
Lazy – tries to match as little as possible.
Possessive – matches as much as possible and never backtracks.
Use these intentionally to make your patterns both fast and predictable.
? |
Match 0 or 1 occurrence (greedy). |
?+ |
Match 0 or 1 occurrence (possessive). |
?? |
Match 0 or 1 occurrence (lazy). |
* |
Match 0 or more occurrences (greedy). |
*+ |
Match 0 or more occurrences (possessive). |
*? |
Match 0 or more occurrences (lazy). |
+ |
Match 1 or more occurrences (greedy). |
++ |
Match 1 or more occurrences (possessive). |
+? |
Match 1 or more occurrences (lazy). |
{n} |
Match exactly n occurrences. |
{n,m} |
Match at least n and at most m occurrences (greedy). |
{n,m}+ |
Match at least n and at most m occurrences (possessive). |
{n,m}? |
Match at least n and at most m occurrences (lazy). |
{n,} |
Match at least n occurrences (greedy). |
{n,}+ |
Match at least n occurrences (possessive). |
{n,}? |
Match at least n occurrences (lazy). |
Anchors
Anchors match positions in the text rather than actual characters. Use them to precisely define where a match may occur.
\A |
Match only at the start of the entire text. |
\b |
Match at a word boundary (between a word and a non-word character). |
\B |
Match anywhere except at a word boundary. |
\Z |
Match at the end of the text [9]. Note that this does not match before a trailing newline. |
^ |
Match at the start of the text, or after a newline when multi-line mode is active. |
$ |
Match at the end of the text, or before a newline when multi-line mode is active. |
Legacy and Compatibility Expressions
The following anchor is kept for compatibility, but its behavior duplicates \Z.
Expression |
Details |
Please Use |
|---|---|---|
\z |
Behaves the same as \Z and also does not match before the final newline. |
\Z |
Alternatives
Alternatives allow you to express “match one of several options” directly inside your pattern. They are particularly useful when several text variants should be treated equally.
A|B|C |
Match any of the alternatives |
(?:A|B|C) |
Match any of the given alternatives without creating a capturing group. Prefer this form when you do not need to extract the matched part. |
Groups
Groups give structure to your regular expressions. They help you control scope, apply quantifiers to sequences, extract matched parts, or influence how the regex engine performs backtracking.
(…) |
A capturing group. Use this when you want to extract or refer to the matched content. |
(?<name>…) |
A named capturing group. This makes your patterns more readable and avoids relying on numeric group indices. |
(?:…) |
A non-capturing group. Ideal when grouping is needed only for precedence or quantifiers. |
(?>…) |
An atomic group. The content is matched without backtracking, which can significantly improve performance in specific situations and prevents ambiguous matches. |
Legacy and Compatibility Expressions
The following group syntaxes exist for compatibility with older regex engines. Prefer the modern, (?<name>…) and verbose-mode comments whenever possible.
Expression |
Details |
Please Use |
|---|---|---|
(?’name’…) |
A legacy form of named capturing group. |
(?<name>…) |
(?P<name>…) |
Another legacy syntax for named capturing groups. |
(?<name>…) |
(?#…) |
An inline comment. It cannot be nested, but literal \) is allowed inside. Prefer verbose mode for more readable, maintainable patterns. |
Verbose Mode |
Modes and Flags
Flags let you change how the regular expression engine behaves. You can enable or disable them globally (only allowed at the very start of the pattern [10]) or locally for a specific group. Local flags apply only inside that group and restore the previous settings afterward.
(?i) |
Enable case-insensitive matching (global only). |
(?m) |
Enable multi-line mode ( |
(?s) |
Enable dot-all mode ( |
(?x) |
Enable verbose mode (whitespace and comments ignored; global only). |
(?a) |
Enable ASCII mode (character escapes follow ASCII semantics; global only). |
(?u) |
Enable Unicode mode and disable ASCII mode (global only). |
(?flags) |
Enable multiple flags at once (global only).
Example: |
(?-flags) |
Disable the listed flags ( |
(?flags:…) |
Apply or remove flags only inside this group.
Example: |
Unsupported Expressions
The expressions listed below are intentionally not supported by this library. Most of them introduce ambiguity, reduce safety, or add complexity that conflicts with the design goals of a predictable and Unicode-aware regex engine. When encountering one of these constructs, the parser will raise an error.
Expression |
Reason |
|---|---|
\0dd |
Zero-prefixed octal escapes conflict with backreferences and replacement patterns. Using \0 produces a parser error. |
\ddd |
Ambiguous octal syntax that conflicts with backreferences and replacement patterns. Using \ddd produces a parser error. |
\C |
Unsafe: would break proper UTF-8 parsing and could lead to invalid byte sequences. |
\R |
Redundant. Because the engine can normalize CRLF to LF automatically, \R would collapse to \n. To avoid surprising behavior, it is disabled. |
\X |
Too similar to \x and easily confused with it. Unicode category unions are already supported via character classes like [\p{L}\p{N}]. |
\n \gn \g{n} \g{-n} \k<name> \k’name’ \g{name} \k{name} (?P=name) |
Backreferences are not supported in this version of the library. |
[[:^…:][:^…:]] |
Combining multiple negative POSIX sets, or mixing negative and positive sets, adds unnecessary complexity to this compatibility feature. |
\G |
Not meaningful for this implementation; omitted for clarity. |
\K |
Extremely rare in practice and not worth the added complexity. |
(?|…) |
Introduces unnecessary complexity; can be replaced with straightforward program logic. |
(?C) (?Cn) |
These debugging constructs are not needed; the library exposes clearer ways to inspect its behavior. |
(?J) |
Allows duplicate group names. This breaks clarity and maintainability. |
(?U) |
Adds unnecessary mode complexity and easily confused with (?u). |
(*…) |
Parser tuning directives are provided via the API, not as inline syntax. |
(?=…) (?!…) (?<=…) (?<!…) |
Lookahead and lookbehind assertions are not supported in this version. |
(?R) (?n) (?+n) (?-n) (?&name) (?P>name) \g<name> \g’name’ \gn \g’n’ |
These recursion and subroutine constructs cannot be implemented safely. |
(?(cond)yes) (?(cond)yes|no) |
Conditional patterns are not supported in this version of the library. |
See chapter “Unsupported Expressions” for all details →
Footnotes