Write the Main Functionality

Now you’ll implement the core logic of our small command-line tool:

Read an HTML file into memory.
Use the Erbsland Regular Expression Library to find HTML tags.
Extract and count attribute names inside these tags.
Print the results as a simple summary table.

Create `main.cpp`

Create and open the file:
```
$ nano htmlstats/src/main.cpp
```

Add the following code:

#include <erbsland/all_re.hpp>

#include <algorithm>
#include <format>
#include <fstream>
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>

using namespace el::re;
using CountMap = std::unordered_map<std::string, std::size_t>;
using Rows = std::vector<std::pair<std::string, std::size_t>>;

auto readFileToString(const std::string &path) -> std::string {
    std::ifstream in{path, std::ios::in};
    if (!in) {
        throw std::runtime_error(std::format("Failed to open file: {}", path));
    }
    return {std::istreambuf_iterator<char>{in}, std::istreambuf_iterator<char>{}};
}

auto sortedRowsFromCounts(const CountMap &tagCounts) -> Rows {
    Rows rows;
    rows.reserve(tagCounts.size());
    for (const auto &[tag, count] : tagCounts) {
        rows.emplace_back(tag, count);
    }
    std::ranges::sort(rows, [](const auto &a, const auto &b) {
        return a.second != b.second ? (a.second > b.second) : (a.first < b.first);
    });
    return rows;
}

void printStats(const Rows &rows, const std::string_view &title) {
    std::cout << std::format("{}\n", title);
    for (const auto &[tag, count] : rows) {
        std::cout << std::format("{:<10} {:>}\n", count, tag);
    }
}

int main(int argc, char **argv) {
    try {
        if (argc < 2) {
            throw std::runtime_error(std::format("Usage: {} <html-file>\n", argv[0]));
        }
        const auto reTag = RegEx::compile(R"((?is)<([a-z]+)([^>]*)>)");
        const auto reAttribute = RegEx::compile(R"((?is)\s+([a-z]+)=("[^"]*"|\S+)?)");
        auto html = readFileToString(argv[1]);
        auto tagCounts = CountMap{};
        auto attributeCounts = CountMap{};
        std::cout << "Scanning file\n";
        for (const auto &match : reTag->findAllView(html)) {
            tagCounts[std::string(match->contentView(1))] += 1;
            for (const auto &attributeMatch : reAttribute->findAllView(match->contentView(2))) {
                attributeCounts[std::string(attributeMatch->contentView(1))] += 1;
            }
        }
        const auto tagRows = sortedRowsFromCounts(tagCounts);
        const auto attributeRows = sortedRowsFromCounts(attributeCounts);
        printStats(tagRows, "Tag Statistics:");
        printStats(attributeRows, "Attribute Statistics:");
        return 0;
    } catch (const std::runtime_error &error) {
        std::cout << error.what() << '\n';
        return 1;
    }
}

Code Breakdown

This example intentionally uses regular expressions for a pragmatic task: gathering quick statistics from HTML. It is not a full HTML parser (HTML is not a regular language), but for “count tags and attributes” it works well and keeps the example compact and easy to understand.

Include and namespace setup

At the top of the file you include the library’s public API:

#include <erbsland/all_re.hpp>

Then you make the regular expression types directly available:

using namespace el::re;

This allows you to use types such as RegEx directly, without prefixing them with the full namespace.

Two compiled expressions, reused many times

The key performance and design idea is: compile once, then reuse.

Inside main you compile two regular expression patterns:

A tag expression:

const auto reTag = RegEx::compile(R"((?is)<([a-z]+)([^>]*)>)");

An attribute expression:

const auto reAttribute = RegEx::compile(R"((?is)\s+([a-z]+)=("[^"]*"|\S+)?)");

Both expressions are stored in const auto variables so they can be applied repeatedly while scanning the input text.

The inline flags (?is) are important here:

i enables case-insensitive matching, so DIV and div are treated the same.
s enables dot-matches-newline mode, which is defensive when attributes span multiple lines.

Capturing exactly what you need

The tag pattern <([a-z]+)([^>]*)> uses capture groups to isolate the relevant parts of each match:

Group 1: ([a-z]+) — the tag name (for example div, a, span).
Group 2: ([^>]*) — the attribute region (everything between the tag name and >).

When iterating over matches, you access these capture groups via content(n):

match->content(1) → tag name
match->content(2) → attribute text

This “compile → match → access capture groups” workflow is central to how the Erbsland Regular Expression Library is used throughout this example.

Scanning with `findAllView` (and why it matters)

The main scan loop uses:

reTag->findAllView(html)

This call returns a coroutine-based generator that iterates over all matches in the input string. The *View variant is used here, which means all match results reference the original string instead of creating copies. This avoids unnecessary allocations and provides excellent performance when scanning large inputs.

Inside the loop, a nested scan is performed over the attribute region of each tag:

reAttribute->findAllView(match->content(2))

Here, the attribute expression operates only on the view returned from the tag match. This keeps responsibilities clean: one expression finds tags, the other extracts attributes.

Counting tags and attribute names

The counting logic is intentionally minimal and easy to read:

tagCounts[std::string(match->content(1))] += 1;

At this point, the tag name view is converted into a std::string key for use in a std::unordered_map. While this introduces a copy, it keeps the example simple. In a real application, you would likely use a more efficient approach that better leverages the lifetime of the underlying views.

The same approach is used for attribute names:

attributeCounts[std::string(attributeMatch->content(1))] += 1;

Note

While the *View variants (such as findAllView) are the most efficient way to work with compiled regular expressions, you must ensure that the original input string remains alive for as long as you use the match results.

In contrast, the non-view methods (for example findAll) create copies of the matched text. This allows you to discard the original input string while continuing to work with the results, at the cost of additional allocations.

From raw counts to readable output

After scanning, sortedRowsFromCounts converts the unordered maps into sortable row lists. The results are then printed using printStats, which formats aligned output via std::format.

This separation keeps the matching logic independent from presentation and makes the code easier to extend later.

Error handling and user experience

Error handling in the Erbsland Regular Expression Library is intentionally simple. The library uses a single Error type, derived from std::runtime_error. As a result, this example wraps main in a single try/catch block.

In real-world applications with static regular expression patterns, you may decide not to catch errors during pattern compilation. Letting the program terminate on an invalid pattern often simplifies debugging and makes errors immediately visible.

For runtime operations such as match, findFirst, or findAll, it is important to catch Error, as errors may occur due to:

Invalid UTF-8 sequences in the input text.
Timeouts (when configured or triggered by malformed patterns).

Note

This application is intentionally small and serves as an example for using the Erbsland Regular Expression Library in a real-world-style workflow.

The regular expression patterns used here are heuristics, not a complete HTML grammar. They will not correctly handle all cases, such as:

Boolean attributes (for example checked or disabled)
Self-closing tags
Comments, scripts, or malformed markup

For production-grade HTML processing, a dedicated HTML parser should be used. In this tutorial, the focus is on demonstrating how to compile expressions, iterate matches, and work with capture groups efficiently.

The Current Project State

At this point, your project directory structure should look like this, with the newly added components marked:

htmlstats
    ├── erbsland-cpp-re
    ├── htmlstats
    │   ├── src
    │   │   └── main.cpp             # [new] The main entry point
    │   └── CMakeLists.txt
    └── CMakeLists.txt

Compile and Run the Application →