Extracts all valid hyperlinks from an HTML document and returns them as a
cleaned and normalized data.table.
The function parses <a>, <area>, <base>, and <link> elements,
resolves relative URLs, removes invalid or unwanted links, and enriches the
output with metadata such as the source URL, extraction level, and timestamp.
Value
A data.table containing the following columns:
href– Cleaned and validated absolute URLslabel– Link text extracted from the anchor elementsource_url– The originating page from which links were extractedlevel– Extraction depth (always 0 for this function)scraped_at– Timestamp of extraction
Duplicate URLs are automatically removed.
Details
This extractor is designed for web‑scraping pipelines where only meaningful, navigable hyperlinks are desired. The function:
Converts inputs to an XML document when necessary
Extracts link text and normalizes whitespace
Resolves relative URLs against the provided
baseurlForces all URLs to use
https://Removes invalid links using
check_links()Ensures uniqueness of extracted links