Syntax Overview

This chapter gives you a focused overview of the regular expression syntax supported by this library. Whenever you want to dive deeper, each item links to the relevant detail sections so you can explore the topics at your own pace.

You will notice that this syntax is intentionally strict: only explicitly supported escape sequences are accepted. Unknown sequences produce parser errors, which helps you spot mistakes early and keeps your patterns predictable and safe.

Quoting

The following escape sequences let you insert literal characters that would otherwise have a special meaning in pattern syntax. Use them whenever you want the regex engine to treat these characters exactly as written.

\\

\.

\"

\'

\#

\

\+

\*

\?

\{

\}

\$

\^

\(

\|

\)

\[

\]

These escape sequences produce the corresponding literal character [1].

Legacy and Compatibility Expressions

The following constructs are accepted for compatibility with older syntaxes. For new code, prefer the recommended approach to keep your patterns clean and explicit.

Expression

Details

Please Use

\Q\E

Treats all characters inside the Q…E block as literal. Even backslashes lose their special meaning until E is encountered.

Escaping via API

See chapter “Quoting” for all details →

Special Characters

The escape sequences below let you insert common control characters and Unicode code points directly into your patterns. Using them keeps your regular expressions both readable and intentional, especially when working with non-printable or invisible characters.

\n

Newline character U+000A.

\r

Carriage return character U+000D.

\t

Horizontal tab character U+0009.

\uhhhh

Unicode code point given as a four-digit hexadecimal value hhhh [2].

\u{hh…}

Unicode code point given as a hexadecimal value of one to eight digits hh… [2].

Legacy and Compatibility Expressions

These expressions are supported for compatibility with older regex syntaxes. For clarity and consistency, you should prefer the recommended forms whenever possible.

Expression

Details

Please Use

\a

Alarm (BEL) character U+0007.

\u{7}

\cX

ASCII control character from U+0001 to U+001F. X must be a single ASCII letter and is interpreted case-insensitively.

\u{hh}

\e

Escape character U+001B.

\u{1b}

\f

Form feed character U+000C.

\u{c}

\N

Matches any character except the newline. Useful in legacy patterns but directly equivalent to a simple negated class.

[^\n]

\o{ddd…}

Character with an octal code ddd… [3]. Octal escape sequences are still parsed but can be confusing in modern Unicode patterns.

\u{hh…}

\xhh

Character with a two-digit hexadecimal code hh [4]. This form cannot express code points above FF.

\u{hh…}

\x{hh…}

Character with a hexadecimal code hh… [4]. Supports more digits than the compact x<hh> form, but still superseded by u{}.

\u{hh…}

\Uhhhhhhhh

Unicode code point given as an eight-digit hexadecimal number hhhhhhhh [2]. Works, but is unnecessarily rigid.

\u{hh…}

See chapter “Special Characters” for all details →

Character Types

Character type escapes let you quickly match common categories of characters without writing long character classes. They are especially helpful when working with Unicode text, where “letters” and “digits” extend far beyond the ASCII ranges.

.

Matches any character except the newline (U+000A) [5]. This is your general-purpose wildcard.

\d

Matches any Unicode digit character [6]. Equivalent to \p{Nd}.

\D

Matches anything except Unicode digit characters [6]. Equivalent to \P{Nd}.

\s

Matches any Unicode space separator character [7]. Equivalent to \p{Zs}.

\S

Matches anything except Unicode space characters [7]. Equivalent to \P{Zs}.

\w

Matches any “word” character: all Unicode letters, all Unicode digits, and the underscore [8]. Equivalent to [\p{Nd}\p{L}_].

\W

Matches anything except “word” characters [8]. Equivalent to [^\p{Nd}\p{L}_].

\p{category}

Matches all characters belonging to a specific Unicode category. This is ideal when you want precision and full Unicode awareness.

\P{category}

Matches all characters not belonging to the given Unicode category.

Legacy and Compatibility Expressions

These escape sequences stem from older regex engines such as PCRE. They are still recognized, but they represent fixed, narrow character sets and should not be used in modern Unicode-aware patterns.

Expression

Details

Please Use

\h \H

Matches a small, predefined set of “horizontal whitespace” characters from PCRE. These sets are not Unicode-complete.

[…]

\v \V

Matches a small, predefined set of “vertical whitespace” characters from PCRE. Also not Unicode-complete and best avoided in new patterns.

[…]

See chapter “Character Types” for all details →

Character Classes

Character classes give you fine-grained control over which characters your pattern should match. They are one of the most powerful building blocks of regular expressions, especially when you want to define your own sets rather than rely on predefined categories.

[…]

A positive character class. It matches any character listed inside the brackets.

[^…]

A negative character class. It matches any character not listed inside the brackets.

Legacy and Compatibility Expressions

These constructs are recognized for compatibility with older regex syntaxes. While they may look familiar if you have used POSIX or PCRE in the past, the modern Unicode-aware forms are more expressive and easier to read.

Expression

Details

Please Use

[\Q…\E]

A quoted literal inside a character class. This often obscures intent and is unnecessary in Unicode-aware patterns.

Avoid this.

[[:name:]]

POSITIVE POSIX named set (e.g. alpha, digit). Works, but comes from a legacy alphabetic system that predates modern Unicode categories.

\p{name}

[[:^name:]]

NEGATIVE POSIX named set. Matches everything not in the POSIX set.

\P{name}

[[:…:][:…:]]

Union of two or more POSIX named sets. While functional, it is harder to maintain and less explicit than Unicode category classes.

[\p{…}\p{…}]

See chapter “Character Classes” for all details →

Quantifiers

Quantifiers let you control how many times a preceding element may occur. Each quantifier comes in three variants:

  • Greedy – tries to match as much as possible.

  • Lazy – tries to match as little as possible.

  • Possessive – matches as much as possible and never backtracks.

Use these intentionally to make your patterns both fast and predictable.

?

Match 0 or 1 occurrence (greedy).

?+

Match 0 or 1 occurrence (possessive).

??

Match 0 or 1 occurrence (lazy).

*

Match 0 or more occurrences (greedy).

*+

Match 0 or more occurrences (possessive).

*?

Match 0 or more occurrences (lazy).

+

Match 1 or more occurrences (greedy).

++

Match 1 or more occurrences (possessive).

+?

Match 1 or more occurrences (lazy).

{n}

Match exactly n occurrences.

{n,m}

Match at least n and at most m occurrences (greedy).

{n,m}+

Match at least n and at most m occurrences (possessive).

{n,m}?

Match at least n and at most m occurrences (lazy).

{n,}

Match at least n occurrences (greedy).

{n,}+

Match at least n occurrences (possessive).

{n,}?

Match at least n occurrences (lazy).

See chapter “Quantifiers” for all details →

Anchors

Anchors match positions in the text rather than actual characters. Use them to precisely define where a match may occur.

\A

Match only at the start of the entire text.

\b

Match at a word boundary (between a word and a non-word character).

\B

Match anywhere except at a word boundary.

\Z

Match at the end of the text [9]. Note that this does not match before a trailing newline.

^

Match at the start of the text, or after a newline when multi-line mode is active.

$

Match at the end of the text, or before a newline when multi-line mode is active.

Legacy and Compatibility Expressions

The following anchor is kept for compatibility, but its behavior duplicates \Z.

Expression

Details

Please Use

\z

Behaves the same as \Z and also does not match before the final newline.

\Z

See chapter “Anchors” for all details →

Alternatives

Alternatives allow you to express “match one of several options” directly inside your pattern. They are particularly useful when several text variants should be treated equally.

A|B|C

Match any of the alternatives A, B or C. This form creates a capturing group unless you wrap it in a non-capturing group.

(?:A|B|C)

Match any of the given alternatives without creating a capturing group. Prefer this form when you do not need to extract the matched part.

See chapter “Alternatives” for all details →

Groups

Groups give structure to your regular expressions. They help you control scope, apply quantifiers to sequences, extract matched parts, or influence how the regex engine performs backtracking.

(…)

A capturing group. Use this when you want to extract or refer to the matched content.

(?<name>…)

A named capturing group. This makes your patterns more readable and avoids relying on numeric group indices.

(?:…)

A non-capturing group. Ideal when grouping is needed only for precedence or quantifiers.

(?>…)

An atomic group. The content is matched without backtracking, which can significantly improve performance in specific situations and prevents ambiguous matches.

Legacy and Compatibility Expressions

The following group syntaxes exist for compatibility with older regex engines. Prefer the modern, (?<name>…) and verbose-mode comments whenever possible.

Expression

Details

Please Use

(?’name’…)

A legacy form of named capturing group.

(?<name>…)

(?P<name>…)

Another legacy syntax for named capturing groups.

(?<name>…)

(?#…)

An inline comment. It cannot be nested, but literal \) is allowed inside. Prefer verbose mode for more readable, maintainable patterns.

Verbose Mode

See chapter “Groups” for all details →

Modes and Flags

Flags let you change how the regular expression engine behaves. You can enable or disable them globally (only allowed at the very start of the pattern [10]) or locally for a specific group. Local flags apply only inside that group and restore the previous settings afterward.

(?i)

Enable case-insensitive matching (global only).

(?m)

Enable multi-line mode (^ and $ match at line boundaries; global only).

(?s)

Enable dot-all mode (. also matches newlines; global only).

(?x)

Enable verbose mode (whitespace and comments ignored; global only).

(?a)

Enable ASCII mode (character escapes follow ASCII semantics; global only).

(?u)

Enable Unicode mode and disable ASCII mode (global only).

(?flags)

Enable multiple flags at once (global only). Example: (?ims) activates i, m and s.

(?-flags)

Disable the listed flags (i, m, s or x) globally. Example: (?-i) turns off case-insensitive matching.

(?flags:…)

Apply or remove flags only inside this group. Example: (?i:abc) matches abc case-insensitively, but the rest of the pattern is unaffected. Flags can also be mixed: (?im-s:...) enables i and m and disables s for the enclosed group.

See chapter “Modes and Flags” for all details →

Unsupported Expressions

The expressions listed below are intentionally not supported by this library. Most of them introduce ambiguity, reduce safety, or add complexity that conflicts with the design goals of a predictable and Unicode-aware regex engine. When encountering one of these constructs, the parser will raise an error.

Expression

Reason

\0dd

Zero-prefixed octal escapes conflict with backreferences and replacement patterns. Using \0 produces a parser error.

\ddd

Ambiguous octal syntax that conflicts with backreferences and replacement patterns. Using \ddd produces a parser error.

\C

Unsafe: would break proper UTF-8 parsing and could lead to invalid byte sequences.

\R

Redundant. Because the engine can normalize CRLF to LF automatically, \R would collapse to \n. To avoid surprising behavior, it is disabled.

\X

Too similar to \x and easily confused with it. Unicode category unions are already supported via character classes like [\p{L}\p{N}].

\n \gn \g{n} \g{-n} \k<name> \k’name \g{name} \k{name} (?P=name)

Backreferences are not supported in this version of the library.

[[:^…:][:^…:]]

Combining multiple negative POSIX sets, or mixing negative and positive sets, adds unnecessary complexity to this compatibility feature.

\G

Not meaningful for this implementation; omitted for clarity.

\K

Extremely rare in practice and not worth the added complexity.

(?|…)

Introduces unnecessary complexity; can be replaced with straightforward program logic.

(?C) (?Cn)

These debugging constructs are not needed; the library exposes clearer ways to inspect its behavior.

(?J)

Allows duplicate group names. This breaks clarity and maintainability.

(?U)

Adds unnecessary mode complexity and easily confused with (?u).

(*…)

Parser tuning directives are provided via the API, not as inline syntax.

(?=…) (?!…) (?<=…) (?<!…)

Lookahead and lookbehind assertions are not supported in this version.

(?R) (?n) (?+n) (?-n) (?&name) (?P>name) \g<name> \g’name \gn \g’n’

These recursion and subroutine constructs cannot be implemented safely.

(?(cond)yes) (?(cond)yes|no)

Conditional patterns are not supported in this version of the library.

See chapter “Unsupported Expressions” for all details →

Footnotes