The taRantula package provides a production-grade
framework for high-volume statistical web scraping. It focuses on
persistence, parallelism, and
compliance.
Prerequisites: Selenium Grid
Before scraping, you need a running Selenium Grid. Using Docker is
the recommended and easisest way to manage browser overhead outside of
R.
Docker-Compose Setup
Save the following as docker-compose.yaml and run
docker-compose up -d. This configuration is optimized for
the memory demands of modern browsers.
version: "3"
services:
hub:
container_name: selenium-hub
image: selenium/hub:4.8
ports:
- 4442-4444:4442-4444
networks:
- net-selenium-grid
environment:
- NODE_MAX_INSTANCES=3
- NODE_MAX_SESSION=3
node-chrome:
image: selenium/node-chrome:138.0
container_name: selenium-grid-chrome
depends_on:
- hub
networks:
- net-selenium-grid
environment:
- SE_EVENT_BUS_HOST=hub
- SE_EVENT_BUS_PUBLISH_PORT=4442
- SE_EVENT_BUS_SUBSCRIBE_PORT=4443
- SE_NODE_OVERRIDE_MAX_SESSIONS=true
- SE_NODE_MAX_SESSIONS=3
- SE_ENABLE_BROWSER_LEFTOVERS_CLEANUP=true
- SE_BROWSER_LEFTOVERS_INTERVAL_SECS=600
- SE_BROWSER_LEFTOVERS_TEMPFILES_DAYS=2
shm_size: 2g
networks:
net-selenium-grid:
external: trueNote: Create the network first if needed:
This can be achieved using command
docker network create net-selenium-grid.
Basic Workflow
The package contains the following processes:
- define google custom search parameters via an
R6-based configuration and run a google custom search. - define scraping parameters via an
R6-based configuration and initializing the scraper.
Google custom search
Configuration
Set the following environmental variables:
-
SCRAPING_APIKEY_GOOGLEfor the google custom search api key -
SCRAPING_ENGINE_GOOGLEfor the google custom search engine ID
Use paramsGoogleSearch() to manage settings. This
ensures all nested parameters are validated before the google custom
search starts.
Custom search
Run google custom search api with a queries build from enterprise name and addresses.
Sys.setenv(
SCRAPING_APIKEY_GOOGLE = "My_ApiKey",
SCRAPING_ENGINE_GOOGLE = "My_Engine"
)
cfg <- paramsGoogleSearch(
path = "~/path/to/my/project",
scrape_attributes = c("title", "link", "displayLink", "snippet")
)
dat <- data.table(
ID = c(1, 2, 3),
EnterpriseName = c("Name1", "Name2", "Name3"),
EnterpriseAddress = c("Address1", "Address2", "Address3")
)
# build queries
dat[, Query1 := buildQuery(.SD), .SDcols = c(
"EnterpriseName",
"EnterpriseAddress"
)]
dat[, Query2 := buildQuery(.SD), .SDcols = c("EnterpriseAddress")]
# update keys in config
cfg$set(key = "query_col", c("Query1", "Query2"))
cfg$set(key = "file", c("File1.csv", "File2.csv")) # <- files to save results in
# run google custom search
runGoogleSearch(
cfg = cfg,
data = data
)Scraping
Configuration
Use paramsScraper() to manage settings. This ensures all
nested parameters are validated before the scraper starts. The
UrlScraper class manages the state, connection to the
DuckDB database, and the parallel worker pool.
library(taRantula)
# Initialize scraping parameters
cfg <- paramsScraper()
cfg$set("selenium$host", "localhost")
cfg$set("selenium$workers", 3)
cfg$set("urls", c("https://nytimes.com", "https://whitehouse.gov"))
# Initialize the scraper
s <- UrlScraper$new(cfg)
# Start scraping
s$scrape()Scraping Engines: Selenium vs. httr
The scrape() implementation is engine-agnostic. It
detects whether the session is a full browser or a set of HTTP headers
and adjusts its strategy accordingly.
Selenium Mode (JavaScript Support)
By default, the scraper expects a Selenium Grid to work with. This is necessary for modern websites that require JavaScript to render content.
-
Redirect Detection: The scraper automatically
detects if a URL redirects (e.g., from
httptohttps) and stores theurl_redirectpointer in the database. - Content Persistence: The full rendered HTML source is captured after the DOM has stabilized.
httr Fallback (High Speed)
If you disable Selenium, the scraper falls back to
httr::GET(). This is significantly faster and uses fewer
resources but does not allow to catch dynamically rendered content.
- When to use: For static sites or APIs where JavaScript rendering is not required.
- Custom Headers: In this mode, the scraper passes your configured user-agent and metadata via HTTP headers.
# To disable Selenium and use the httr fallback:
cfg$set("selenium$use_selenium", FALSE)
# The same UrlScraper$new(cfg) will now use httr internallyFault Tolerance and Progress Protection
One of the core strengths of taRantula is its
resilience. Scraping thousands of URLs is prone to interruptions
(network timeouts, hardware crashes, or OS updates).
Snapshotting Mechanism
The configuration includes a snapshot_every parameter
(default is 10).
How it works: Every N URLs, each worker writes its current progress to a temporary “snapshot” file in the project directory.
Crash Recovery: If the R session or the Selenium
server crashes, simply re-initialize the UrlScraper with
the same configuration. On startup, the class detects these snapshots
and the existing DuckDB database.
Persistence: Only the URLs processed since the last snapshot are lost, ensuring that 99% of your progress is usually preserved.
Data Extraction and Iteration
Data is persisted in DuckDB, meaning you can access it
via helper methods or raw SQL.
# Access results as a data.table
results_dt <- s$results()
# Access all discovered links
links_dt <- s$links()
# Inspect logs for errors or redirects
logs_dt <- s$logs()Advanced Extraction: Regex
You can also perform targeted data mining across your results. For example, to find an Austrian UID (VAT number) only on pages that are likely to be “Imprints”:
# Pattern for Austrian VAT numbers
uid_pattern <- "ATU[0-9]{8}"
# Extract only from pages where the URL or the link label matches 'imprint'
vat_results <- s$regex_extract(
pattern = uid_pattern,
filter_links = "imprint",
ignore_cases = TRUE
)Controlling the Scraper
Adding New URLs
You can feed newly discovered links back into the scraper.
taRantula automatically handles duplicate detection.
# Add new URLs and scrape again
s$update_urls(urls = c("https://google.com", "https://www.statistik.at"))
s$scrape()