Write the Main Functionality

Now you’ll implement the core logic of our small command-line tool:

  • Read an HTML file into memory.

  • Use the Erbsland Regular Expression Library to find HTML tags.

  • Extract and count attribute names inside these tags.

  • Print the results as a simple summary table.

Create main.cpp

  1. Create and open the file:

    $ nano htmlstats/src/main.cpp
    
  2. Add the following code:

     1#include <erbsland/all_re.hpp>
     2
     3#include <algorithm>
     4#include <format>
     5#include <fstream>
     6#include <iostream>
     7#include <string>
     8#include <unordered_map>
     9#include <vector>
    10
    11using namespace el::re;
    12using CountMap = std::unordered_map<std::string, std::size_t>;
    13using Rows = std::vector<std::pair<std::string, std::size_t>>;
    14
    15auto readFileToString(const std::string &path) -> std::string {
    16    std::ifstream in{path, std::ios::in};
    17    if (!in) {
    18        throw std::runtime_error(std::format("Failed to open file: {}", path));
    19    }
    20    return {std::istreambuf_iterator<char>{in}, std::istreambuf_iterator<char>{}};
    21}
    22
    23auto sortedRowsFromCounts(const CountMap &tagCounts) -> Rows {
    24    Rows rows;
    25    rows.reserve(tagCounts.size());
    26    for (const auto &[tag, count] : tagCounts) {
    27        rows.emplace_back(tag, count);
    28    }
    29    std::ranges::sort(rows, [](const auto &a, const auto &b) {
    30        return a.second != b.second ? (a.second > b.second) : (a.first < b.first);
    31    });
    32    return rows;
    33}
    34
    35void printStats(const Rows &rows, const std::string_view &title) {
    36    std::cout << std::format("{}\n", title);
    37    for (const auto &[tag, count] : rows) {
    38        std::cout << std::format("{:<10} {:>}\n", count, tag);
    39    }
    40}
    41
    42int main(int argc, char **argv) {
    43    try {
    44        if (argc < 2) {
    45            throw std::runtime_error(std::format("Usage: {} <html-file>\n", argv[0]));
    46        }
    47        const auto reTag = RegEx::compile(R"((?is)<([a-z]+)([^>]*)>)");
    48        const auto reAttribute = RegEx::compile(R"((?is)\s+([a-z]+)=("[^"]*"|\S+)?)");
    49        auto html = readFileToString(argv[1]);
    50        auto tagCounts = CountMap{};
    51        auto attributeCounts = CountMap{};
    52        std::cout << "Scanning file\n";
    53        for (const auto &match : reTag->findAllView(html)) {
    54            tagCounts[std::string(match->contentView(1))] += 1;
    55            for (const auto &attributeMatch : reAttribute->findAllView(match->contentView(2))) {
    56                attributeCounts[std::string(attributeMatch->contentView(1))] += 1;
    57            }
    58        }
    59        const auto tagRows = sortedRowsFromCounts(tagCounts);
    60        const auto attributeRows = sortedRowsFromCounts(attributeCounts);
    61        printStats(tagRows, "Tag Statistics:");
    62        printStats(attributeRows, "Attribute Statistics:");
    63        return 0;
    64    } catch (const std::runtime_error &error) {
    65        std::cout << error.what() << '\n';
    66        return 1;
    67    }
    68}
    

Code Breakdown

This example intentionally uses regular expressions for a pragmatic task: gathering quick statistics from HTML. It is not a full HTML parser (HTML is not a regular language), but for “count tags and attributes” it works well and keeps the example compact and easy to understand.

Include and namespace setup

At the top of the file you include the library’s public API:

#include <erbsland/all_re.hpp>

Then you make the regular expression types directly available:

using namespace el::re;

This allows you to use types such as RegEx directly, without prefixing them with the full namespace.

Two compiled expressions, reused many times

The key performance and design idea is: compile once, then reuse.

Inside main you compile two regular expression patterns:

  1. A tag expression:

    const auto reTag = RegEx::compile(R"((?is)<([a-z]+)([^>]*)>)");
    
  2. An attribute expression:

    const auto reAttribute = RegEx::compile(R"((?is)\s+([a-z]+)=("[^"]*"|\S+)?)");
    

Both expressions are stored in const auto variables so they can be applied repeatedly while scanning the input text.

The inline flags (?is) are important here:

  • i enables case-insensitive matching, so DIV and div are treated the same.

  • s enables dot-matches-newline mode, which is defensive when attributes span multiple lines.

Capturing exactly what you need

The tag pattern <([a-z]+)([^>]*)> uses capture groups to isolate the relevant parts of each match:

  • Group 1: ([a-z]+) — the tag name (for example div, a, span).

  • Group 2: ([^>]*) — the attribute region (everything between the tag name and >).

When iterating over matches, you access these capture groups via content(n):

  • match->content(1) → tag name

  • match->content(2) → attribute text

This “compile → match → access capture groups” workflow is central to how the Erbsland Regular Expression Library is used throughout this example.

Scanning with findAllView (and why it matters)

The main scan loop uses:

reTag->findAllView(html)

This call returns a coroutine-based generator that iterates over all matches in the input string. The *View variant is used here, which means all match results reference the original string instead of creating copies. This avoids unnecessary allocations and provides excellent performance when scanning large inputs.

Inside the loop, a nested scan is performed over the attribute region of each tag:

reAttribute->findAllView(match->content(2))

Here, the attribute expression operates only on the view returned from the tag match. This keeps responsibilities clean: one expression finds tags, the other extracts attributes.

Counting tags and attribute names

The counting logic is intentionally minimal and easy to read:

tagCounts[std::string(match->content(1))] += 1;

At this point, the tag name view is converted into a std::string key for use in a std::unordered_map. While this introduces a copy, it keeps the example simple. In a real application, you would likely use a more efficient approach that better leverages the lifetime of the underlying views.

The same approach is used for attribute names:

attributeCounts[std::string(attributeMatch->content(1))] += 1;

Note

While the *View variants (such as findAllView) are the most efficient way to work with compiled regular expressions, you must ensure that the original input string remains alive for as long as you use the match results.

In contrast, the non-view methods (for example findAll) create copies of the matched text. This allows you to discard the original input string while continuing to work with the results, at the cost of additional allocations.

From raw counts to readable output

After scanning, sortedRowsFromCounts converts the unordered maps into sortable row lists. The results are then printed using printStats, which formats aligned output via std::format.

This separation keeps the matching logic independent from presentation and makes the code easier to extend later.

Error handling and user experience

Error handling in the Erbsland Regular Expression Library is intentionally simple. The library uses a single Error type, derived from std::runtime_error. As a result, this example wraps main in a single try/catch block.

In real-world applications with static regular expression patterns, you may decide not to catch errors during pattern compilation. Letting the program terminate on an invalid pattern often simplifies debugging and makes errors immediately visible.

For runtime operations such as match, findFirst, or findAll, it is important to catch Error, as errors may occur due to:

  • Invalid UTF-8 sequences in the input text.

  • Timeouts (when configured or triggered by malformed patterns).

Note

This application is intentionally small and serves as an example for using the Erbsland Regular Expression Library in a real-world-style workflow.

The regular expression patterns used here are heuristics, not a complete HTML grammar. They will not correctly handle all cases, such as:

  • Boolean attributes (for example checked or disabled)

  • Self-closing tags

  • Comments, scripts, or malformed markup

For production-grade HTML processing, a dedicated HTML parser should be used. In this tutorial, the focus is on demonstrating how to compile expressions, iterate matches, and work with capture groups efficiently.

The Current Project State

At this point, your project directory structure should look like this, with the newly added components marked:

htmlstats
    ├── erbsland-cpp-re
    ├── htmlstats
    │   ├── src
    │   │   └── main.cpp             # [new] The main entry point
    │   └── CMakeLists.txt
    └── CMakeLists.txt

Compile and Run the Application →