The Input Interface

The Input interface allows you to provide custom input sources to the regular expression engine.

Because the Erbsland Regular Expression library is based on a Thompson NFA, input is consumed sequentially and processed in a highly efficient streaming fashion. This makes it possible to match patterns not only against in-memory strings, but also against data sources such as files, network streams, or custom iterators.

The input interface is a low-level extension point intended for advanced use cases. If you only need to match against strings, the built-in string overloads are usually the better and simpler choice.

How to Implement Your Source

To implement a custom input source, derive from one of the following classes:

The chosen base class determines the character type and the type of match objects returned by the matching API.

Your implementation must override the abstract methods defined by InputBase. These methods form the contract between your input source and the matching engine.

Implementation Requirements

The most important method is read. It is invoked in the hot loop of the matching engine and must therefore be implemented as efficiently as possible.

When implementing an input source, the following rules must be respected:

  • read must return the next character together with its position.

  • When the end of the input is reached, a zero character must be returned.

  • Repeated calls to read after the end of the input must continue to succeed and keep returning a zero character.

  • The returned position must advance monotonically and must uniquely identify the character within the input stream.

If line-break folding (CRLF handling) is enabled for the regular expression, your input source must additionally implement:

  • peek to look ahead without consuming input

  • skip to advance the input by one character

These methods allow the engine to treat CRLF sequences as a single logical line break while preserving correct positional information.

Incorrect or incomplete implementations may lead to incorrect matches or undefined behavior.

Example Implementation

The following example shows a complete implementation of a custom input source that reads characters from a std::vector. While simplified, it demonstrates all required methods and lifetime rules.

 1#pragma once
 2
 3
 4#include <erbsland/re/Input.hpp>
 5#include <erbsland/re/Match.hpp>
 6#include <erbsland/re/RegEx.hpp>
 7
 8
 9class VectorMatch : public erbsland::re::Match32 {
10public:
11    using Match32::Match32;
12
13public:
14    VectorMatch(
15        const el::re::ConstRegExPtr &regEx,
16        const el::re::CaptureGroupList &captureGroupList,
17        std::u32string &&capturedContent,
18        const std::size_t offset)
19    :
20        Match32(regEx, captureGroupList),
21        _capturedContent{std::move(capturedContent)},
22        _offset{offset} {
23
24    }
25    ~VectorMatch() override = default;
26
27protected:
28    [[nodiscard]] auto getContentForGroup(
29        const el::re::CaptureGroup &group) const noexcept -> std::u32string_view override {
30
31        return std::u32string_view{_capturedContent}.substr(
32            group.begin() - _offset, group.size());
33    }
34
35private:
36    std::u32string _capturedContent;
37    std::size_t _offset;
38};
39
40
41class VectorInput : public erbsland::re::Input32 {
42public:
43    explicit VectorInput(const std::vector<char32_t> &textVector) : _vectorRef(textVector) {};
44    ~VectorInput() override = default;
45
46public:
47    [[nodiscard]] auto read() -> el::re::CharAndPosition override {
48        if (_position >= _vectorRef.size()) {
49            return {0, _position};
50        }
51        const auto position = _position;
52        const auto character = _vectorRef.at(position);
53        _position += 1;
54        return {character, position};
55    }
56    [[nodiscard]] auto peek() -> el::re::CharAndPosition override {
57        const auto character = _vectorRef.at(_position);
58        return {character, _position};
59    }
60    void skip(const std::size_t characterCount) override {
61        _position += characterCount;
62        if (_position > _vectorRef.size()) {
63            _position = _vectorRef.size();
64        }
65    }
66    [[nodiscard]] auto createMatch(
67        el::re::ConstRegExPtr regEx,
68        el::re::CaptureGroupList captureGroupList) -> el::re::Match32Ptr override {
69
70        std::u32string capturedContent;
71        const std::size_t offset = captureGroupList.front().begin();
72        for (std::size_t i = 0; i < captureGroupList.size(); ++i) {
73            capturedContent.push_back(_vectorRef.at(i + offset));
74        }
75        return std::make_shared<VectorMatch>(
76            std::move(regEx),
77            std::move(captureGroupList),
78            std::move(capturedContent), offset);
79    }
80
81private:
82    const std::vector<char32_t> &_vectorRef;
83    std::size_t _position = 0;
84};

Errors →

Interfaces and Types

class InputBase : public std::enable_shared_from_this<InputBase>

The abstract base class for inputs for regular expression matching.

Subclassed by erbsland::re::Input, erbsland::re::Input16, erbsland::re::Input16ForView, erbsland::re::Input32, erbsland::re::Input32ForView, erbsland::re::InputForView

Public Functions

virtual CharAndPosition read() = 0

Read the next character from the input and advance the position.

Returns:

1. The next character from the input, or an end-of-input character at the end of the stream.

  1. The start position of the read character (the position of the first byte of the read character).

virtual CharAndPosition peek() = 0

Peek at the next character from the input, do not advance the position.

Returns:

1. The next character from the input, or an end-of-input character at the end of the stream.

  1. The start position of the read character (the position of the first byte of the read character).

virtual void skip(std::size_t characterCount) = 0

Skip a number of characters.

inline std::pair<impl::Char, InputPosition> readChar()

Read the next character from the input and advance the position. Helper to return a Char instance instead of char32_t.

inline std::pair<impl::Char, InputPosition> peekChar()

Peek at the next character from the input, do not advance the position. Helper to return a Char instance instead of char32_t.

class Input : public erbsland::re::InputBase

An abstract input for regular expression matching.

Public Functions

virtual MatchPtr createMatch(ConstRegExPtr regEx, CaptureGroupList captureGroupList) = 0

Create a match object for this input.

Parameters:
  • regEx – The regular expression object that created the match.

  • captureGroupList – The list of capture groups.

Returns:

An owning match result that holds a copy of the matched text.

class Input16 : public erbsland::re::InputBase

An abstract UTF-16 input for regular expression matching.

Public Functions

virtual Match16Ptr createMatch(ConstRegExPtr regEx, CaptureGroupList captureGroupList) = 0

Create a match object for this input.

Parameters:
  • regEx – The regular expression object that created the match.

  • captureGroupList – The list of capture groups.

Returns:

An owning match result that holds a copy of the matched text.

class Input32 : public erbsland::re::InputBase

An abstract UTF-32 input for regular expression matching.

Public Functions

virtual Match32Ptr createMatch(ConstRegExPtr regEx, CaptureGroupList captureGroupList) = 0

Create a match object for this input.

Parameters:
  • regEx – The regular expression object that created the match.

  • captureGroupList – The list of capture groups.

Returns:

An owning match result that holds a copy of the matched text.

class InputForView : public erbsland::re::InputBase

An abstract input for regular expression matching for view-based methods.

Public Functions

virtual MatchViewPtr createMatch(ConstRegExPtr regEx, CaptureGroupList captureGroupList) = 0

Create a match object for this input.

Parameters:
  • regEx – The regular expression object that created the match.

  • captureGroupList – The list of capture groups.

Returns:

An owning match result that holds a copy of the matched text.

class Input16ForView : public erbsland::re::InputBase

An abstract UTF-16 input for regular expression matching for view-based methods.

Public Functions

virtual Match16ViewPtr createMatch(ConstRegExPtr regEx, CaptureGroupList captureGroupList) = 0

Create a match object for this input.

Parameters:
  • regEx – The regular expression object that created the match.

  • captureGroupList – The list of capture groups.

Returns:

A view-based match result.

class Input32ForView : public erbsland::re::InputBase

An abstract UTF-32 input for regular expression matching for view-based methods.

Public Functions

virtual Match32ViewPtr createMatch(ConstRegExPtr regEx, CaptureGroupList captureGroupList) = 0

Create a match object for this input.

Parameters:
  • regEx – The regular expression object that created the match.

  • captureGroupList – The list of capture groups.

Returns:

A view-based match result.

struct CharAndPosition

A read character and its start position.

using erbsland::re::InputPosition = std::size_t

A position in the input stream. This is a valid position in the input. The concrete meaning depends on the used Input implementation.

  • UTF8 => the position of the start byte (char, char8_t) for a character.

  • UTF16 => the position of the start word (char16_t) for a character.

  • UTF32 => the position of the character (char32_t).