**************************** Write the Main Functionality **************************** Now you’ll implement the core logic of our small command-line tool: * Read an HTML file into memory. * Use the *Erbsland Regular Expression Library* to find HTML tags. * Extract and count attribute names inside these tags. * Print the results as a simple summary table. Create ``main.cpp`` =================== #. Create and open the file: .. code-block:: console $ nano htmlstats/src/main.cpp #. Add the following code: .. literalinclude:: files/main.cpp :language: cpp :linenos: Code Breakdown ============== This example intentionally uses regular expressions for a pragmatic task: gathering quick statistics from HTML. It is *not* a full HTML parser (HTML is not a regular language), but for “count tags and attributes” it works well and keeps the example compact and easy to understand. Include and namespace setup --------------------------- At the top of the file you include the library’s public API: .. code-block:: cpp #include Then you make the regular expression types directly available: .. code-block:: cpp using namespace el::re; This allows you to use types such as ``RegEx`` directly, without prefixing them with the full namespace. Two compiled expressions, reused many times ------------------------------------------- The key performance and design idea is: **compile once, then reuse**. Inside ``main`` you compile two regular expression patterns: #. A tag expression: .. code-block:: cpp const auto reTag = RegEx::compile(R"((?is)<([a-z]+)([^>]*)>)"); #. An attribute expression: .. code-block:: cpp const auto reAttribute = RegEx::compile(R"((?is)\s+([a-z]+)=("[^"]*"|\S+)?)"); Both expressions are stored in ``const auto`` variables so they can be applied repeatedly while scanning the input text. The inline flags ``(?is)`` are important here: * ``i`` enables case-insensitive matching, so ``DIV`` and ``div`` are treated the same. * ``s`` enables *dot-matches-newline* mode, which is defensive when attributes span multiple lines. Capturing exactly what you need ------------------------------- The tag pattern ``<([a-z]+)([^>]*)>`` uses capture groups to isolate the relevant parts of each match: * Group ``1``: ``([a-z]+)`` — the *tag name* (for example ``div``, ``a``, ``span``). * Group ``2``: ``([^>]*)`` — the *attribute region* (everything between the tag name and ``>``). When iterating over matches, you access these capture groups via ``content(n)``: * ``match->content(1)`` → tag name * ``match->content(2)`` → attribute text This “compile → match → access capture groups” workflow is central to how the *Erbsland Regular Expression Library* is used throughout this example. Scanning with ``findAllView`` (and why it matters) -------------------------------------------------- The main scan loop uses: .. code-block:: cpp reTag->findAllView(html) This call returns a coroutine-based generator that iterates over all matches in the input string. The ``*View`` variant is used here, which means all match results reference the original string instead of creating copies. This avoids unnecessary allocations and provides excellent performance when scanning large inputs. Inside the loop, a nested scan is performed over the attribute region of each tag: .. code-block:: cpp reAttribute->findAllView(match->content(2)) Here, the attribute expression operates only on the *view* returned from the tag match. This keeps responsibilities clean: one expression finds tags, the other extracts attributes. Counting tags and attribute names --------------------------------- The counting logic is intentionally minimal and easy to read: .. code-block:: cpp tagCounts[std::string(match->content(1))] += 1; At this point, the tag name view is converted into a ``std::string`` key for use in a ``std::unordered_map``. While this introduces a copy, it keeps the example simple. In a real application, you would likely use a more efficient approach that better leverages the lifetime of the underlying views. The same approach is used for attribute names: .. code-block:: cpp attributeCounts[std::string(attributeMatch->content(1))] += 1; .. note:: While the ``*View`` variants (such as ``findAllView``) are the most efficient way to work with compiled regular expressions, you must ensure that the original input string remains alive for as long as you use the match results. In contrast, the non-view methods (for example ``findAll``) create copies of the matched text. This allows you to discard the original input string while continuing to work with the results, at the cost of additional allocations. From raw counts to readable output ---------------------------------- After scanning, ``sortedRowsFromCounts`` converts the unordered maps into sortable row lists. The results are then printed using ``printStats``, which formats aligned output via ``std::format``. This separation keeps the matching logic independent from presentation and makes the code easier to extend later. Error handling and user experience ---------------------------------- Error handling in the *Erbsland Regular Expression Library* is intentionally simple. The library uses a single :cpp:class:`Error ` type, derived from ``std::runtime_error``. As a result, this example wraps ``main`` in a single ``try``/``catch`` block. In real-world applications with static regular expression patterns, you may decide *not* to catch errors during pattern compilation. Letting the program terminate on an invalid pattern often simplifies debugging and makes errors immediately visible. For runtime operations such as ``match``, ``findFirst``, or ``findAll``, it is important to catch :cpp:class:`Error `, as errors may occur due to: * Invalid UTF-8 sequences in the input text. * Timeouts (when configured or triggered by malformed patterns). .. note:: This application is intentionally small and serves as an example for using the *Erbsland Regular Expression Library* in a real-world-style workflow. The regular expression patterns used here are **heuristics**, not a complete HTML grammar. They will **not** correctly handle all cases, such as: * Boolean attributes (for example ``checked`` or ``disabled``) * Self-closing tags * Comments, scripts, or malformed markup For production-grade HTML processing, a dedicated HTML parser should be used. In this tutorial, the focus is on demonstrating how to compile expressions, iterate matches, and work with capture groups efficiently. The Current Project State ========================= At this point, your project directory structure should look like this, with the newly added components marked: .. code-block:: none :emphasize-lines: 6 htmlstats ├── erbsland-cpp-re ├── htmlstats │ ├── src │ │ └── main.cpp # [new] The main entry point │ └── CMakeLists.txt └── CMakeLists.txt .. button-ref:: 04-compile-and-run :ref-type: doc :color: success :align: center :expand: :class: sd-fs-5 sd-font-weight-bold sd-p-2 sd-my-4 Compile and Run the Application →