Extract Regular Expression Matches from Scraped HTML
Source:R/UrlScraper_utils_html.R
dot-extract_regex.RdApplies a regular expression to previously scraped HTML documents, optionally
restricted to a specific capture group.
Each document is first cleaned using parse_HTML() to remove non‑text
content, ensuring reliable pattern extraction.
Arguments
- docs
Character vector or list of HTML source documents.
- urls
Character vector of URLs corresponding to
docs.- pattern
A regular expression to search for.
- group
Optional capture group name or index to extract. If
NULL, the full match is returned.- ignore_cases
Logical; if
TRUE, performs case‑insensitive matching.
Value
A data.table where each row corresponds to a match and includes:
url– The originating document URLpattern(or the given group name) – Extracted values
Missing matches are returned as NA_character_.
Details
The function:
Cleans and normalizes each HTML document
Converts text to lowercase when
ignore_cases = TRUEExtracts all regex matches using
stringr::str_match_all()Supports named or numbered capture groups
Returns a unified
data.tableindexed by URL
Named groups allow meaningful column labeling in the result.