Write the Main Functionality
Now you’ll implement the core logic of our small command-line tool:
Read an HTML file into memory.
Use the Erbsland Regular Expression Library to find HTML tags.
Extract and count attribute names inside these tags.
Print the results as a simple summary table.
Create main.cpp
Create and open the file:
$ nano htmlstats/src/main.cpp
Add the following code:
1#include <erbsland/all_re.hpp> 2 3#include <algorithm> 4#include <format> 5#include <fstream> 6#include <iostream> 7#include <string> 8#include <unordered_map> 9#include <vector> 10 11using namespace el::re; 12using CountMap = std::unordered_map<std::string, std::size_t>; 13using Rows = std::vector<std::pair<std::string, std::size_t>>; 14 15auto readFileToString(const std::string &path) -> std::string { 16 std::ifstream in{path, std::ios::in}; 17 if (!in) { 18 throw std::runtime_error(std::format("Failed to open file: {}", path)); 19 } 20 return {std::istreambuf_iterator<char>{in}, std::istreambuf_iterator<char>{}}; 21} 22 23auto sortedRowsFromCounts(const CountMap &tagCounts) -> Rows { 24 Rows rows; 25 rows.reserve(tagCounts.size()); 26 for (const auto &[tag, count] : tagCounts) { 27 rows.emplace_back(tag, count); 28 } 29 std::ranges::sort(rows, [](const auto &a, const auto &b) { 30 return a.second != b.second ? (a.second > b.second) : (a.first < b.first); 31 }); 32 return rows; 33} 34 35void printStats(const Rows &rows, const std::string_view &title) { 36 std::cout << std::format("{}\n", title); 37 for (const auto &[tag, count] : rows) { 38 std::cout << std::format("{:<10} {:>}\n", count, tag); 39 } 40} 41 42int main(int argc, char **argv) { 43 try { 44 if (argc < 2) { 45 throw std::runtime_error(std::format("Usage: {} <html-file>\n", argv[0])); 46 } 47 const auto reTag = RegEx::compile(R"((?is)<([a-z]+)([^>]*)>)"); 48 const auto reAttribute = RegEx::compile(R"((?is)\s+([a-z]+)=("[^"]*"|\S+)?)"); 49 auto html = readFileToString(argv[1]); 50 auto tagCounts = CountMap{}; 51 auto attributeCounts = CountMap{}; 52 std::cout << "Scanning file\n"; 53 for (const auto &match : reTag->findAllView(html)) { 54 tagCounts[std::string(match->contentView(1))] += 1; 55 for (const auto &attributeMatch : reAttribute->findAllView(match->contentView(2))) { 56 attributeCounts[std::string(attributeMatch->contentView(1))] += 1; 57 } 58 } 59 const auto tagRows = sortedRowsFromCounts(tagCounts); 60 const auto attributeRows = sortedRowsFromCounts(attributeCounts); 61 printStats(tagRows, "Tag Statistics:"); 62 printStats(attributeRows, "Attribute Statistics:"); 63 return 0; 64 } catch (const std::runtime_error &error) { 65 std::cout << error.what() << '\n'; 66 return 1; 67 } 68}
Code Breakdown
This example intentionally uses regular expressions for a pragmatic task: gathering quick statistics from HTML. It is not a full HTML parser (HTML is not a regular language), but for “count tags and attributes” it works well and keeps the example compact and easy to understand.
Include and namespace setup
At the top of the file you include the library’s public API:
#include <erbsland/all_re.hpp>
Then you make the regular expression types directly available:
using namespace el::re;
This allows you to use types such as RegEx directly, without prefixing them with the full namespace.
Two compiled expressions, reused many times
The key performance and design idea is: compile once, then reuse.
Inside main you compile two regular expression patterns:
A tag expression:
const auto reTag = RegEx::compile(R"((?is)<([a-z]+)([^>]*)>)");
An attribute expression:
const auto reAttribute = RegEx::compile(R"((?is)\s+([a-z]+)=("[^"]*"|\S+)?)");
Both expressions are stored in const auto variables so they can be applied repeatedly while scanning the input text.
The inline flags (?is) are important here:
ienables case-insensitive matching, soDIVanddivare treated the same.senables dot-matches-newline mode, which is defensive when attributes span multiple lines.
Capturing exactly what you need
The tag pattern <([a-z]+)([^>]*)> uses capture groups to isolate the relevant parts of each match:
Group
1:([a-z]+)— the tag name (for examplediv,a,span).Group
2:([^>]*)— the attribute region (everything between the tag name and>).
When iterating over matches, you access these capture groups via content(n):
match->content(1)→ tag namematch->content(2)→ attribute text
This “compile → match → access capture groups” workflow is central to how the Erbsland Regular Expression Library is used throughout this example.
Scanning with findAllView (and why it matters)
The main scan loop uses:
reTag->findAllView(html)
This call returns a coroutine-based generator that iterates over all matches in the input string.
The *View variant is used here, which means all match results reference the original string
instead of creating copies. This avoids unnecessary allocations and provides excellent performance
when scanning large inputs.
Inside the loop, a nested scan is performed over the attribute region of each tag:
reAttribute->findAllView(match->content(2))
Here, the attribute expression operates only on the view returned from the tag match. This keeps responsibilities clean: one expression finds tags, the other extracts attributes.
From raw counts to readable output
After scanning, sortedRowsFromCounts converts the unordered maps into sortable row lists.
The results are then printed using printStats, which formats aligned output via
std::format.
This separation keeps the matching logic independent from presentation and makes the code easier to extend later.
Error handling and user experience
Error handling in the Erbsland Regular Expression Library is intentionally simple.
The library uses a single Error type, derived from
std::runtime_error. As a result, this example wraps main in a single
try/catch block.
In real-world applications with static regular expression patterns, you may decide not to catch errors during pattern compilation. Letting the program terminate on an invalid pattern often simplifies debugging and makes errors immediately visible.
For runtime operations such as match, findFirst, or findAll, it is important
to catch Error, as errors may occur due to:
Invalid UTF-8 sequences in the input text.
Timeouts (when configured or triggered by malformed patterns).
Note
This application is intentionally small and serves as an example for using the Erbsland Regular Expression Library in a real-world-style workflow.
The regular expression patterns used here are heuristics, not a complete HTML grammar. They will not correctly handle all cases, such as:
Boolean attributes (for example
checkedordisabled)Self-closing tags
Comments, scripts, or malformed markup
For production-grade HTML processing, a dedicated HTML parser should be used. In this tutorial, the focus is on demonstrating how to compile expressions, iterate matches, and work with capture groups efficiently.
The Current Project State
At this point, your project directory structure should look like this, with the newly added components marked:
htmlstats
├── erbsland-cpp-re
├── htmlstats
│ ├── src
│ │ └── main.cpp # [new] The main entry point
│ └── CMakeLists.txt
└── CMakeLists.txt