Character Types

Character type escapes match common character categories without requiring you to write explicit character classes. Like character classes, they always match exactly one character.

Use character type escapes for readability and intent, especially when working with Unicode text or when the exact set of characters is secondary to their semantic meaning.

.

Matches any single character.

By default, this escape matches any character except the line feed character (\n, LF, U+000A).

When the Flag::DotAll flag is enabled, the dot also matches line feed characters.

You can enable DotAll mode either programmatically or inline using (?s).

\d

Matches a digit character.

The exact set depends on the active ASCII or Unicode mode:

  • In Unicode mode (default), it matches Unicode decimal digits (equivalent to \p{Nd}).

  • In ASCII mode, it matches only ASCII digits (equivalent to [0-9]).

Enable ASCII mode programmatically via Flag::Ascii or inline using (?a).

\D

Matches any character that is not a digit as defined by \d.

This is the negated form of \d.

\s

Matches a whitespace character.

The exact set depends on the active ASCII or Unicode mode.

When the Flag::DotAll flag is enabled, the whitespace set is extended to include the line feed character. This keeps the behavior of \s consistent with . when matching across lines.

You can enable DotAll mode programmatically or inline using (?s). ASCII mode can be enabled via (?a) or Flag::Ascii.

\S

Matches any character that is not whitespace as defined by \s.

This is the negated form of \s.

\w

Matches a word character.

The exact definition depends on the active ASCII or Unicode mode:

  • In Unicode mode (default), it matches Unicode letters and digits, plus underscore.

  • In ASCII mode, it matches only ASCII letters, digits, and underscore.

This definition is consistent with word-boundary handling via \b.

Enable ASCII mode programmatically using Flag::Ascii or inline with (?a).

\W

Matches any character that is not a word character as defined by \w.

This is the negated form of \w.

\p{category}

Matches a character belonging to the given Unicode general category.

The category name is normalized before matching:

  • Matching is case-insensitive.

  • Underscores are ignored.

  • Only category name characters (a-z, A-Z, _) are accepted.

  • Unknown or invalid category names raise a parse error.

Examples:

  • \p{Lu} matches any uppercase letter.

  • \p{Uppercase_Letter} is accepted as well.

Unicode category matching is not affected by the following flags:

For example, even in case-insensitive mode, \p{UppercaseLetter} still matches only uppercase letters.

\P{category}

Matches any character that is not in the given Unicode general category.

This is the negated form of \p{category}.

Supported Unicode Categories

Short Name

Long Name

C

Other

L

Letter

M

Mark

N

Number

P

Punctuation

S

Symbol

Z

Separator

Cc

Control

Cf

Format

Cn

Unassigned

Co

PrivateUse

Cs

Surrogate

Ll

LowercaseLetter

Lm

ModifierLetter

Lo

OtherLetter

Lt

TitlecaseLetter

Lu

UppercaseLetter

Mc

SpacingMark

Me

EnclosingMark

Mn

NonspacingMark

Nd

DecimalNumber

Nl

LetterNumber

No

OtherNumber

Pc

ConnectorPunctuation

Pd

DashPunctuation

Pe

ClosePunctuation

Pf

FinalPunctuation

Pi

InitialPunctuation

Po

OtherPunctuation

Ps

OpenPunctuation

Sc

CurrencySymbol

Sk

ModifierSymbol

Sm

MathSymbol

So

OtherSymbol

Zl

LineSeparator

Zp

ParagraphSeparator

Zs

SpaceSeparator

Legacy and Compatibility Expressions

The following escapes exist for compatibility with other regular expression engines. They are supported as legacy syntax and should not be used in new patterns unless compatibility is required.

Prefer using \s / \S and explicit Unicode categories via \p{category} instead.

\h / \H

Matches horizontal whitespace (or its negation).

This escape sequence exists for compatibility and may be disabled via the feature set.

\v / \V

Matches vertical whitespace (or its negation).

This escape sequence exists for compatibility and may be disabled via the feature set.

Differences from Common Regex Engines

If you are migrating from commonly used regular expression engines such as PCRE, PCRE2, ECMAScript, RE2, or std::regex, you may notice a few deliberate differences in how character type escapes behave.

These differences are intentional and aim to make matching semantics explicit, predictable, and independent of context, even when working with Unicode text.

Dot does not silently match newlines

In many engines, the behavior of . varies subtly depending on configuration, or differs between default modes.

In this engine:

  • . never matches a line feed by default.

  • Line feed is matched only when Flag::DotAll is explicitly enabled.

This strict separation avoids accidental cross-line matches and makes it immediately clear when a pattern is intended to span multiple lines.

Whitespace handling is aligned with DotAll

When Flag::DotAll is enabled, the definition of whitespace matched by \s is extended to include the line feed character.

Some engines treat dot and whitespace independently, which can lead to inconsistent behavior when switching between . and \s. This engine keeps both in sync to avoid surprising differences when refactoring patterns.

Unicode categories are not affected by flags

Unicode category matching via \p{category} and \P{category} is completely independent of matching flags such as:

  • case-insensitive matching

  • ASCII mode

  • DotAll mode

For example, even when case-insensitive matching is enabled, \p{UppercaseLetter} still matches only uppercase letters.

In contrast, some engines implicitly fold or reinterpret categories depending on flags, which can make patterns harder to reason about. This engine keeps Unicode semantics stable and explicit.

Word characters are explicitly defined

The definition of word characters used by \w, \W, and word-boundary anchors is tightly defined:

  • In Unicode mode, word characters are Unicode letters and digits plus underscore.

  • In ASCII mode, only ASCII letters, digits, and underscore are included.

Some engines expand word characters to include combining marks or additional categories. This engine intentionally keeps the definition conservative to ensure predictable word boundary behavior across scripts.

Legacy escapes are opt-in

Escapes such as \h, \H, \v, and \V exist solely for compatibility with other engines.

They are clearly marked as legacy and may be disabled via the feature set. For new patterns, prefer \s, \S, and explicit Unicode categories for clarity and long-term stability.

Summary

In short, this engine favors stable Unicode semantics, explicit intent, and consistent behavior across flags.