Skip to contents

Applies a regular expression to previously scraped HTML documents, optionally restricted to a specific capture group. Each document is first cleaned using parse_HTML() to remove non‑text content, ensuring reliable pattern extraction.

Usage

.extract_regex(docs, urls, pattern, group = NULL, ignore_cases = TRUE)

Arguments

docs

Character vector or list of HTML source documents.

urls

Character vector of URLs corresponding to docs.

pattern

A regular expression to search for.

group

Optional capture group name or index to extract. If NULL, the full match is returned.

ignore_cases

Logical; if TRUE, performs case‑insensitive matching.

Value

A data.table where each row corresponds to a match and includes:

  • url – The originating document URL

  • pattern (or the given group name) – Extracted values

Missing matches are returned as NA_character_.

Details

The function:

  • Cleans and normalizes each HTML document

  • Converts text to lowercase when ignore_cases = TRUE

  • Extracts all regex matches using stringr::str_match_all()

  • Supports named or numbered capture groups

  • Returns a unified data.table indexed by URL

Named groups allow meaningful column labeling in the result.

Examples

if (FALSE) { # \dontrun{
## Extract email-like patterns:
.extract_regex(docs, urls, pattern = "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+")
} # }