.. index:: single: RegEx single: Flag single: Flags ******************************** The Regular Expression Interface ******************************** Creating New Instances ====================== Your main entry point into the Erbsland Regular Expression library is the :cpp:class:`RegEx ` class. The very first step is always to compile a regular expression pattern into a :cpp:class:`RegEx ` instance. .. code-block:: cpp auto reEmail = RegEx::compile( R"(([a-zA-Z0-9\._%\+\-]+)@([a-zA-Z0-9\.\-]+\.[a-zA-Z]{2,}))" ); If the :doc:`pattern <../../syntax/overview>` contains an error, an :cpp:class:`Error ` exception is thrown with detailed information about the problem. When you work with static or hard-coded expressions, it is usually best *not* to catch this exception and let the application fail fast. This makes it easy to locate faulty patterns during development using a debugger and avoids silently shipping broken expressions. The instance created by the :cpp:any:`compile ` function stores the compiled program of your expression, including all data required for efficient matching. A :cpp:class:`RegEx ` instance is **thread-safe**, which means you can safely call all of its methods from multiple threads in parallel. Standard Flags -------------- The :cpp:any:`compile ` function accepts an optional parameter for :cpp:any:`flags `. These flags mirror common inline modifiers from the regular expression syntax: * :cpp:any:`IgnoreCase ` Ignores character case during matching. This is equivalent to placing :expression:`(?i)` at the beginning of a pattern. * :cpp:any:`Multiline ` Allows :expression:`^` and :expression:`$` to match at the beginning and end of individual lines. This corresponds to :expression:`(?m)`. * :cpp:any:`DotAll ` Makes :expression:`.` and :esc_code:`s` match line-break characters as well. This equals :expression:`(?s)`. * :cpp:any:`Ascii ` Restricts certain character classes to the ASCII range only. This matches the behaviour of :expression:`(?a)`. * :cpp:any:`Verbose ` Enables whitespace, line breaks and comments inside the pattern for better readability. This corresponds to :expression:`(?x)`. All of these flags can be specified either directly in the pattern or via the compile-time flags parameter. Enable Line-Break Folding ------------------------- The flag :cpp:any:`CRLF ` can only be set at compile time and enables line-break folding. When this feature is active, the library treats any sequence of :unicode:`d` followed by :unicode:`a` (CRLF) as a single logical new-line character. As a result, a pattern that matches :expression:`\n` will automatically match both ``LF`` *and* ``CRLF``. Although a ``CRLF`` sequence behaves like a single line break during matching, its original representation is preserved internally. If you capture the sequence, the result will either contain a single line-feed character or two characters (carriage return and line feed), depending on the original input. Settings -------- You can further customize the behaviour of the regular expression parser and engine using a :cpp:any:`Settings ` instance. This allows you to: * Enable or disable specific syntax features * Tighten internal limits * Configure timeouts for matching operations These settings are especially useful when your application accepts patterns from untrusted or external sources, such as user-provided configuration files. Using a Regular Expression Instance =================================== A :cpp:class:`RegEx ` instance provides several groups of matching and transformation functions: * **match** Matches the pattern at the beginning of a given text. * **fullMatch** Matches the pattern against the entire text and only succeeds if the whole input matches. * **findFirst** Searches for the first location in the text where the pattern matches. * **findAll** Finds all matching locations in the text. These functions return a coroutine-based generator, allowing you to stop iteration early. * **collectAll** Similar to **findAll**, but eagerly collects all matches into a ``std::vector``. * **replaceAll** Replaces all matches in a text using either a replacement pattern or a custom replacement function. Return Values ------------- All matching methods return one or more instances of a match type wrapped in a shared pointer: * Non-view methods return :cpp:any:`Match ` * ``...View`` methods return :cpp:any:`MatchView ` For **match**, **fullMatch** and **findFirst**, a return value of ``nullptr`` indicates that no match was found. The **findAll**, **collectAll** and **replaceAll** methods never use ``nullptr`` to represent matches. If no matches are found, they instead return an empty result (for example, an empty generator or an empty ``std::vector``). This design allows you to distinguish cleanly between *no match* and *an exceptional condition* without additional state checks. The ``...View`` Variants ------------------------ Many functions also provide a ``...View`` variant. For example, :cpp:any:`match ` has a corresponding :cpp:any:`matchView ` method. These variants operate exclusively on views of the input text and return views into the original string for all match results. This approach avoids allocations and string copies, making it the most efficient option. However, you are responsible for ensuring that the original text remains alive for as long as the match object is in use. .. rubric:: Pros: * No unnecessary allocations or string copies * Ideal for nested matching scenarios (e.g. matching inside a captured group) .. rubric:: Cons: * You must ensure that the original text outlives all match results .. code-block:: cpp auto text = std::string("..."); auto reHeader = el::re::RegEx::compile(R"((?i)]*>(.*?))"); auto match = reHeader->findFirstView(text); if (match != nullptr) { auto title = std::string(match->contentView(1)); // ... } .. code-block:: cpp // !!! BAD EXAMPLE !!! auto match = reHeader->findFirstView(std::string("...")); // The matched text no longer exists. auto title = std::string(match->contentView(1)); // undefined behavior! If you work with temporary strings or if match results need to outlive the original text, use the non-view variants. These methods create internal copies of the matched text and are safe in such scenarios. Replacement Patterns -------------------- Replacement patterns may contain placeholders of the form ``{}`` or ``{}``, which reference numeric or named capture groups. To insert literal ``{`` or ``}`` characters, escape them by doubling: use ``{{`` or ``}}``. Error Handling ============== Errors can occur not only while compiling a pattern, but also during the matching process itself: * If the matched text contains invalid UTF-8 sequences, an :cpp:any:`Encoding ` error is thrown. * If a built-in or manually configured resource or time limit is exceeded, a :cpp:any:`Timeout ` or :cpp:any:`Limit ` error is thrown. For this reason, we recommend enclosing your matching code in a ``try { ... } catch (...) {}`` block at the earliest convenient location where you can properly handle these errors. Parser errors during compilation usually indicate programming mistakes and should typically *not* be caught. Letting the application fail fast makes such issues easier to detect during development. .. code-block:: cpp auto text = std::string("..."); // Do not catch parser errors here – let the application fail fast. auto reHeader = el::re::RegEx::compile(R"((?i)]*>(.*?))"); std::string title; try { auto match = reHeader->findFirstView(text); if (match != nullptr) { title = std::string(match->contentView(1)); } } catch (const el::re::Error &error) { title = std::format("", error); } // Process the matching result. UTF-16 and UTF-32 Strings ========================= Most matching functions in the :cpp:class:`RegEx ` API provide overloads for UTF-16 and UTF-32 encoded strings. These overloads behave exactly like their UTF-8 counterparts in terms of matching semantics, flags, and error handling. The only difference is the type of match object they return, which corresponds to the underlying string type. Depending on the input, the methods return: * UTF-8 strings → :cpp:class:`Match ` or :cpp:class:`MatchView ` * UTF-16 strings → :cpp:class:`Match16 ` or :cpp:class:`Match16View ` * UTF-32 strings → :cpp:class:`Match32 ` or :cpp:class:`Match32View ` This ensures that positional information and content access are always expressed in units appropriate for the input string, while keeping the overall API consistent across all supported encodings. .. button-ref:: match :ref-type: doc :color: success :align: center :expand: :class: sd-fs-5 sd-font-weight-bold sd-p-2 sd-my-4 The Match Interface → Interfaces and Types ==================== .. doxygenclass:: erbsland::re::RegEx :members: .. doxygenenum:: erbsland::re::Flag .. doxygenclass:: erbsland::re::Flags :members: