UrlScraper R6 Class for Parallel Web Scraping with Selenium

The UrlScraper R6 class provides a high‑level framework for scraping a list of URLs using multiple parallel Selenium (or non‑Selenium) workers. It manages scraping state, progress, snapshots, logs, and respects robots.txt rules. Results and logs are stored in an internal DuckDB database.

Format

An R6 class generator of class UrlScraper.

Overview

The UrlScraper class is designed for robust, resumable web scraping workflows. Its key features include:

Parallel scraping of URLs via multiple Selenium workers
Persistent storage of results, logs, and extracted links in DuckDB
Automatic snapshotting and recovery of partially processed chunks
Respecting robots.txt rules via pre‑checks on domains
Convenience helpers for querying results, logs, and extracted links
Regex‑based extraction of text from previously scraped HTML

Configuration

A configuration object (typically created via paramsScraper) is expected to contain at least the following entries:

db_file – path to the DuckDB database file
snapshot_dir – directory for temporary snapshot files
progress_dir – directory for progress/log files
stop_file – path to a file used to signal workers to stop
urls / urls_todo – vectors of URLs and URLs still to scrape
selenium – list with Selenium‑related settings, such as:
- use_selenium – logical, whether to use Selenium
- workers – number of parallel Selenium workers
- host, port, browser, verbose – Selenium connection settings
- ecaps – list with Chrome options (args, prefs, excludeSwitches)
- snapshot_every – number of URLs after which a snapshot is taken
robots – list with robots.txt handling options, such as:
- check – logical, whether to check robots.txt
- snapshot_every – snapshot frequency for robots checks
- workers – number of workers for robots.txt checks
- robots_user_agent – user agent string used for robots queries
exclude_social_links – logical, whether to exclude social media links

The exact structure depends on paramsScraper and related helpers.

Methods

initialize(config) – create a new UrlScraper instance
scrape() – scrape all remaining URLs in parallel
update_urls(urls, force = FALSE) – add new URLs to the queue
results(filter = NULL) – extract scraping results
logs(filter = NULL) – extract log entries
links(filter = NULL) – extract discovered links
query(q) – run custom SQL queries on the internal DuckDB database
regex_extract(pattern, group = NULL, filter_links = NULL, ignore_cases = TRUE) – extract text via regex from scraped HTML
stop() – create a stop‑file so workers can exit gracefully
close() – clean up snapshots and close database connections

Methods

Method `new()`

Create a new UrlScraper object.

This constructor initializes the internal storage (DuckDB database, snapshot and progress directories), restores previous snapshots/logs if present, and configures progress handlers.

Usage

UrlScraper$new(config)

Arguments

config

A list (or configuration object) of settings, typically created by paramsScraper(). It should include:

db_file – path to the DuckDB database file.
snapshot_dir – directory for snapshot files.
progress_dir – directory for progress/log files.
stop_file – path to the stop signal file.
urls_todo – character vector of URLs still to be scraped.
selenium – list of Selenium settings (host, port, workers, etc.).
robots – list of robots.txt handling options.
any additional options required by helper functions.

Returns

An initialized UrlScraper object (invisibly).

Method `scrape()`

Scrape all remaining URLs using parallel workers.

Usage

UrlScraper$scrape()

Details

This method orchestrates the parallel scraping process:

Re‑initializes storage and processes any existing snapshots or logs.
Computes the set of URLs still to scrape.
Optionally performs robots.txt checks on new domains.
Sets up a parallel plan via the future framework.
Starts multiple Selenium (or non‑Selenium) sessions.
Distributes URLs across workers and tracks global progress.
Cleans up snapshots/logs and updates internal URL state after scraping.

If a stop‑file is detected (see stop()), scraping is aborted before starting. Workers themselves will also honor the stop‑file to terminate gracefully after finishing the current URL.

Returns

The UrlScraper object (invisibly), with internal state updated to reflect newly scraped URLs.

Method `update_urls()`

Update the list of URLs to be scraped.

Usage

UrlScraper$update_urls(urls, force = FALSE)

Arguments

urls: A character vector of new URLs to add.
force: A logical flag. If TRUE, all given URLs are kept except for duplicates within urls itself (no check against already scraped URLs). If FALSE (default), URLs already in the database and duplicates in urls are removed.

Details

This method updates the internal URL queue based on the given input vector urls. Depending on force:

If force = FALSE (default), URLs that have already been scraped (i.e. present in the results database) are removed, as well as duplicates within the urls vector itself.
If force = TRUE, only duplicates within the given urls vector are removed; URLs that are already present in the database are kept.

Summary information about how many URLs were added, already known, or duplicates is printed via cli.

Returns

The UrlScraper object (invisibly).

Method `results()`

Extract scraping results from the internal database.

Usage

UrlScraper$results(filter = NULL)

Arguments

filter: Optional character string with a SQL‑like WHERE condition (without the WHERE keyword), e.g. "url LIKE 'https://example.com/%'". If NULL (default), all rows from the results table are returned.

Returns

A data.table containing the scraping results.

Method `logs()`

Extract log entries from the internal database.

Usage

UrlScraper$logs(filter = NULL)

Arguments

filter: Optional character string with a SQL‑like WHERE condition (without the WHERE keyword). If NULL (default), all rows from the logs table are returned.

Returns

A data.table containing the log entries.

Method `links()`

Extract scraped links from the internal database.

Usage

UrlScraper$links(filter = NULL)

Arguments

filter: Optional character string with a SQL‑like WHERE condition (without the WHERE keyword). If NULL (default), all rows from the links table are returned.

Returns

A data.table containing the extracted links.

Method `query()`

Execute a custom SQL query against the internal DuckDB database.

This is a low‑level helper for advanced use cases. It assumes that the user is familiar with the schema of the internal database (tables such as results, logs, links, and any others created by helper functions).

Usage

UrlScraper$query(q)

Arguments

q: A character string containing a valid DuckDB SQL query.

Returns

The result of the query, typically a data.table.

Method `regex_extract()`

Extract text from scraped HTML using a regular expression.

Usage

UrlScraper$regex_extract(
  pattern,
  group = NULL,
  filter_links = NULL,
  ignore_cases = TRUE
)

Arguments

pattern

A character string containing a regular expression. Named capture groups are supported.

group

Either:

A character string naming a capture group (e.g. "name" if the pattern contains (?<name>...)), or
An integer specifying the index of the capture group to return. If NULL (default), the behavior is delegated to .extract_regex() and may return all groups depending on its implementation.

filter_links

A character vector containing keywords or partial words used to filter the set of URLs from which pattern will be extracted. For example, filter_links = "imprint" restricts the extraction to URLs whose href or label contains "imprint".

ignore_cases

Logical. If TRUE (default), case is ignored when matching pattern. If FALSE, the pattern is matched in a case‑sensitive way.

Details

This helper performs a post‑processing step on the stored HTML sources in the results table:

It first selects links from the links table whose href or label match the provided filter_links terms.
It then identifies those documents (rows in results) whose url is among the selected links and that have status == TRUE.
Finally, it applies a regular expression to the HTML source of those documents and returns the extracted matches.

This is particularly useful for extracting structured information such as email addresses, phone numbers, or IDs from a subset of pages (e.g. contact or imprint pages).

Returns

A data.table (or similar object) returned by .extract_regex(), typically containing the matched text and the corresponding URLs.

Method `stop()`

Create a stop‑file to signal running workers to terminate gracefully.

Usage

UrlScraper$stop()

Details

Workers periodically check for the existence of the configured stop_file. When it is present, they will finish processing the current URL and then exit. This allows for a controlled shutdown of a long‑running scraping job without abruptly terminating the R session or Selenium instances.

Returns

Invisible NULL.

Method `close()`

Clean up resources, including snapshots and database connections.

Usage

UrlScraper$close()

Details

This method performs the following clean‑up steps:

Processes any remaining snapshots and logs.
Deletes the snapshot directory (if it exists).
Opens a DuckDB connection to the configured db_file and disconnects it with shutdown = TRUE.

It is good practice to call close() once you are done with a UrlScraper instance.

Returns

Invisible NULL.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

UrlScraper$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples

if (FALSE) { # \dontrun{
# Create a default configuration object
cfg <- paramsScraper()

# Example Selenium settings
cfg$set("selenium$host", "localhost")
cfg$set("selenium$workers", 2)
cfg$show_config()

# Initialize the scraper
scraper <- UrlScraper$new(config = cfg)

# Start scraping remaining URLs
scraper$scrape()

# Retrieve results as a data.table
results_dt <- scraper$results()

# Retrieve logs and links
logs_dt  <- scraper$logs()
links_dt <- scraper$links()

# Add new URLs to be scraped (only those not already in the DB)
scraper$update_urls(urls = c("https://example.com/"))

# Force adding URLs (ignores duplicates against already scraped ones)
scraper$update_urls(urls = c("https://example.com/"), force = TRUE)

# Regex extraction from scraped HTML
emails_dt <- scraper$regex_extract(
  pattern      = "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}",
  filter_links = c("contact", "imprint")
)

# Stop ongoing workers after they finish the current URL
scraper$stop()

# Clean up resources
scraper$close()
} # }

UrlScraper R6 Class for Parallel Web Scraping with Selenium

Format

Overview

Configuration

Methods

Methods

Public methods

Method new()

Usage

Arguments

Returns

Method scrape()

Usage

Details

Returns

Method update_urls()

Usage

Arguments

Details

Returns

Method results()

Usage

Arguments

Returns

Method logs()

Usage

Arguments

Returns

Method links()

Usage

Arguments

Returns

Method query()

Usage

Arguments

Returns

Method regex_extract()

Usage

Arguments

Details

Returns

Method stop()

Usage

Details

Returns

Method close()

Usage

Details

Returns

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `scrape()`

Method `update_urls()`

Method `results()`

Method `logs()`

Method `links()`

Method `query()`

Method `regex_extract()`

Method `stop()`

Method `close()`

Method `clone()`