Character Types

Character type escapes match common character categories without requiring you to write explicit character classes. Like character classes, they always match exactly one character.

Use character type escapes for readability and intent, especially when working with Unicode text or when the exact set of characters is secondary to their semantic meaning.

.

Matches any single character.

By default, this escape matches any character except the line feed character (\n, LF, U+000A).

When the Flag::DotAll flag is enabled, the dot also matches line feed characters.

You can enable DotAll mode either programmatically or inline using (?s).

\d

Matches a digit character.

The exact set depends on the active ASCII or Unicode mode:

In Unicode mode (default), it matches Unicode decimal digits (equivalent to \p{Nd}).
In ASCII mode, it matches only ASCII digits (equivalent to [0-9]).

Enable ASCII mode programmatically via Flag::Ascii or inline using (?a).

\D

Matches any character that is not a digit as defined by \d.

This is the negated form of \d.

\s

Matches a whitespace character.

The exact set depends on the active ASCII or Unicode mode.

When the Flag::DotAll flag is enabled, the whitespace set is extended to include the line feed character. This keeps the behavior of \s consistent with . when matching across lines.

You can enable DotAll mode programmatically or inline using (?s). ASCII mode can be enabled via (?a) or Flag::Ascii.

\S

Matches any character that is not whitespace as defined by \s.

This is the negated form of \s.

\w

Matches a word character.

The exact definition depends on the active ASCII or Unicode mode:

In Unicode mode (default), it matches Unicode letters and digits, plus underscore.
In ASCII mode, it matches only ASCII letters, digits, and underscore.

This definition is consistent with word-boundary handling via \b.

Enable ASCII mode programmatically using Flag::Ascii or inline with (?a).

\W

Matches any character that is not a word character as defined by \w.

This is the negated form of \w.

\p{category}

Matches a character belonging to the given Unicode general category.

The category name is normalized before matching:

Matching is case-insensitive.
Underscores are ignored.
Only category name characters (a-z, A-Z, _) are accepted.
Unknown or invalid category names raise a parse error.

Examples:

\p{Lu} matches any uppercase letter.
\p{Uppercase_Letter} is accepted as well.

Unicode category matching is not affected by the following flags:

For example, even in case-insensitive mode, \p{UppercaseLetter} still matches only uppercase letters.

\P{category}

Matches any character that is not in the given Unicode general category.

This is the negated form of \p{category}.

Supported Unicode Categories

Short Name	Long Name
C	Other
L	Letter
M	Mark
N	Number
P	Punctuation
S	Symbol
Z	Separator
Cc	Control
Cf	Format
Cn	Unassigned
Co	PrivateUse
Cs	Surrogate
Ll	LowercaseLetter
Lm	ModifierLetter
Lo	OtherLetter
Lt	TitlecaseLetter
Lu	UppercaseLetter
Mc	SpacingMark
Me	EnclosingMark
Mn	NonspacingMark
Nd	DecimalNumber
Nl	LetterNumber
No	OtherNumber
Pc	ConnectorPunctuation
Pd	DashPunctuation
Pe	ClosePunctuation
Pf	FinalPunctuation
Pi	InitialPunctuation
Po	OtherPunctuation
Ps	OpenPunctuation
Sc	CurrencySymbol
Sk	ModifierSymbol
Sm	MathSymbol
So	OtherSymbol
Zl	LineSeparator
Zp	ParagraphSeparator
Zs	SpaceSeparator

Legacy and Compatibility Expressions

The following escapes exist for compatibility with other regular expression engines. They are supported as legacy syntax and should not be used in new patterns unless compatibility is required.

Prefer using \s / \S and explicit Unicode categories via \p{category} instead.

\h / \H

Matches horizontal whitespace (or its negation).

This escape sequence exists for compatibility and may be disabled via the feature set.

\v / \V

Matches vertical whitespace (or its negation).

This escape sequence exists for compatibility and may be disabled via the feature set.

Differences from Common Regex Engines

If you are migrating from commonly used regular expression engines such as PCRE, PCRE2, ECMAScript, RE2, or std::regex, you may notice a few deliberate differences in how character type escapes behave.

These differences are intentional and aim to make matching semantics explicit, predictable, and independent of context, even when working with Unicode text.

Dot does not silently match newlines

In many engines, the behavior of . varies subtly depending on configuration, or differs between default modes.

In this engine:

. never matches a line feed by default.
Line feed is matched only when Flag::DotAll is explicitly enabled.

This strict separation avoids accidental cross-line matches and makes it immediately clear when a pattern is intended to span multiple lines.

Whitespace handling is aligned with DotAll

When Flag::DotAll is enabled, the definition of whitespace matched by \s is extended to include the line feed character.

Some engines treat dot and whitespace independently, which can lead to inconsistent behavior when switching between . and \s. This engine keeps both in sync to avoid surprising differences when refactoring patterns.

Unicode categories are not affected by flags

Unicode category matching via \p{category} and \P{category} is completely independent of matching flags such as:

case-insensitive matching
ASCII mode
DotAll mode

For example, even when case-insensitive matching is enabled, \p{UppercaseLetter} still matches only uppercase letters.

In contrast, some engines implicitly fold or reinterpret categories depending on flags, which can make patterns harder to reason about. This engine keeps Unicode semantics stable and explicit.

Word characters are explicitly defined

The definition of word characters used by \w, \W, and word-boundary anchors is tightly defined:

In Unicode mode, word characters are Unicode letters and digits plus underscore.
In ASCII mode, only ASCII letters, digits, and underscore are included.

Some engines expand word characters to include combining marks or additional categories. This engine intentionally keeps the definition conservative to ensure predictable word boundary behavior across scripts.

Legacy escapes are opt-in

Escapes such as \h, \H, \v, and \V exist solely for compatibility with other engines.

They are clearly marked as legacy and may be disabled via the feature set. For new patterns, prefer \s, \S, and explicit Unicode categories for clarity and long-term stability.

Summary

In short, this engine favors stable Unicode semantics, explicit intent, and consistent behavior across flags.