The UrlScraper R6 class provides a high‑level framework for scraping
a list of URLs using multiple parallel Selenium (or non‑Selenium) workers.
It manages scraping state, progress, snapshots, logs, and respects
robots.txt rules. Results and logs are stored in an internal DuckDB
database.
Overview
The UrlScraper class is designed for robust, resumable web scraping
workflows. Its key features include:
Parallel scraping of URLs via multiple Selenium workers
Persistent storage of results, logs, and extracted links in DuckDB
Automatic snapshotting and recovery of partially processed chunks
Respecting
robots.txtrules via pre‑checks on domainsConvenience helpers for querying results, logs, and extracted links
Regex‑based extraction of text from previously scraped HTML
Configuration
A configuration object (typically created via paramsScraper) is expected to contain at least the following entries:
db_file– path to the DuckDB database filesnapshot_dir– directory for temporary snapshot filesprogress_dir– directory for progress/log filesstop_file– path to a file used to signal workers to stopurls/urls_todo– vectors of URLs and URLs still to scrapeselenium– list with Selenium‑related settings, such as:use_selenium– logical, whether to use Seleniumworkers– number of parallel Selenium workershost,port,browser,verbose– Selenium connection settingsecaps– list with Chrome options (args,prefs,excludeSwitches)snapshot_every– number of URLs after which a snapshot is taken
robots– list withrobots.txthandling options, such as:check– logical, whether to checkrobots.txtsnapshot_every– snapshot frequency for robots checksworkers– number of workers forrobots.txtchecksrobots_user_agent– user agent string used for robots queries
exclude_social_links– logical, whether to exclude social media links
The exact structure depends on paramsScraper and related helpers.
Methods
initialize(config)– create a newUrlScraperinstancescrape()– scrape all remaining URLs in parallelupdate_urls(urls, force = FALSE)– add new URLs to the queueresults(filter = NULL)– extract scraping resultslogs(filter = NULL)– extract log entrieslinks(filter = NULL)– extract discovered linksquery(q)– run custom SQL queries on the internal DuckDB databaseregex_extract(pattern, group = NULL, filter_links = NULL, ignore_cases = TRUE)– extract text via regex from scraped HTMLstop()– create a stop‑file so workers can exit gracefullyclose()– clean up snapshots and close database connections
Methods
Method new()
Create a new UrlScraper object.
This constructor initializes the internal storage (DuckDB database, snapshot and progress directories), restores previous snapshots/logs if present, and configures progress handlers.
Usage
UrlScraper$new(config)Arguments
configA list (or configuration object) of settings, typically created by
paramsScraper(). It should include:db_file– path to the DuckDB database file.snapshot_dir– directory for snapshot files.progress_dir– directory for progress/log files.stop_file– path to the stop signal file.urls_todo– character vector of URLs still to be scraped.selenium– list of Selenium settings (host, port, workers, etc.).robots– list ofrobots.txthandling options.any additional options required by helper functions.
Method scrape()
Scrape all remaining URLs using parallel workers.
Details
This method orchestrates the parallel scraping process:
Re‑initializes storage and processes any existing snapshots or logs.
Computes the set of URLs still to scrape.
Optionally performs
robots.txtchecks on new domains.Sets up a parallel plan via the
futureframework.Starts multiple Selenium (or non‑Selenium) sessions.
Distributes URLs across workers and tracks global progress.
Cleans up snapshots/logs and updates internal URL state after scraping.
If a stop‑file is detected (see stop()), scraping is aborted
before starting. Workers themselves will also honor the stop‑file to
terminate gracefully after finishing the current URL.
Method update_urls()
Update the list of URLs to be scraped.
Arguments
urlsA character vector of new URLs to add.
forceA logical flag. If
TRUE, all given URLs are kept except for duplicates withinurlsitself (no check against already scraped URLs). IfFALSE(default), URLs already in the database and duplicates inurlsare removed.
Details
This method updates the internal URL queue based on the given input
vector urls. Depending on force:
If
force = FALSE(default), URLs that have already been scraped (i.e. present in the results database) are removed, as well as duplicates within theurlsvector itself.If
force = TRUE, only duplicates within the givenurlsvector are removed; URLs that are already present in the database are kept.
Summary information about how many URLs were added, already known, or
duplicates is printed via cli.
Method results()
Extract scraping results from the internal database.
Method logs()
Extract log entries from the internal database.
Method links()
Extract scraped links from the internal database.
Method query()
Execute a custom SQL query against the internal DuckDB database.
This is a low‑level helper for advanced use cases. It assumes that
the user is familiar with the schema of the internal database
(tables such as results, logs, links, and any others created
by helper functions).
Method regex_extract()
Extract text from scraped HTML using a regular expression.
Arguments
patternA character string containing a regular expression. Named capture groups are supported.
groupEither:
A character string naming a capture group (e.g.
"name"if the pattern contains(?<name>...)), orAn integer specifying the index of the capture group to return. If
NULL(default), the behavior is delegated to.extract_regex()and may return all groups depending on its implementation.
filter_linksA character vector containing keywords or partial words used to filter the set of URLs from which
patternwill be extracted. For example,filter_links = "imprint"restricts the extraction to URLs whosehreforlabelcontains "imprint".ignore_casesLogical. If
TRUE(default), case is ignored when matchingpattern. IfFALSE, the pattern is matched in a case‑sensitive way.
Details
This helper performs a post‑processing step on the stored HTML
sources in the results table:
It first selects links from the
linkstable whosehreforlabelmatch the providedfilter_linksterms.It then identifies those documents (rows in
results) whoseurlis among the selected links and that havestatus == TRUE.Finally, it applies a regular expression to the HTML source of those documents and returns the extracted matches.
This is particularly useful for extracting structured information such as email addresses, phone numbers, or IDs from a subset of pages (e.g. contact or imprint pages).
Returns
A data.table (or similar object) returned by
.extract_regex(), typically containing the matched text and the
corresponding URLs.
Method stop()
Create a stop‑file to signal running workers to terminate gracefully.
Method close()
Clean up resources, including snapshots and database connections.
Details
This method performs the following clean‑up steps:
Processes any remaining snapshots and logs.
Deletes the snapshot directory (if it exists).
Opens a DuckDB connection to the configured
db_fileand disconnects it withshutdown = TRUE.
It is good practice to call close() once you are done with a
UrlScraper instance.
Examples
if (FALSE) { # \dontrun{
# Create a default configuration object
cfg <- paramsScraper()
# Example Selenium settings
cfg$set("selenium$host", "localhost")
cfg$set("selenium$workers", 2)
cfg$show_config()
# Initialize the scraper
scraper <- UrlScraper$new(config = cfg)
# Start scraping remaining URLs
scraper$scrape()
# Retrieve results as a data.table
results_dt <- scraper$results()
# Retrieve logs and links
logs_dt <- scraper$logs()
links_dt <- scraper$links()
# Add new URLs to be scraped (only those not already in the DB)
scraper$update_urls(urls = c("https://example.com/"))
# Force adding URLs (ignores duplicates against already scraped ones)
scraper$update_urls(urls = c("https://example.com/"), force = TRUE)
# Regex extraction from scraped HTML
emails_dt <- scraper$regex_extract(
pattern = "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}",
filter_links = c("contact", "imprint")
)
# Stop ongoing workers after they finish the current URL
scraper$stop()
# Clean up resources
scraper$close()
} # }