cfg_googlesearch is an R6 class that inherits from
params_manager and provides configuration management for
performing Google Custom Search API queries.
It handles:
Definition of default parameters for Google search jobs
YAML-based configuration overrides
Programmatic overrides via
...Validation of all relevant configuration fields
This utility function simplifies the creation of a cfg_googlesearch object. It supports optional configuration file loading and programmatic overrides.
Usage
paramsGoogleSearch(config_file = NULL, path = tempdir(), ...)Format
An R6::R6Class generator object.
Arguments
- config_file
Optional path to a YAML configuration file.
- path
Path to a directory used for storing downloaded data. Defaults to
tempdir().- ...
Additional named configuration overrides. These take precedence over defaults and YAML configuration.
Key Features
Built-in defaults suitable for most scraping workflows
Support for API credentials (key + engine ID)
Control over query frequency and batching
Control over which metadata fields to keep from API responses
Optional saving of results to disk
Super class
taRantula::params_manager -> cfg_googlesearch
Methods
Inherited methods
Method new()
Initialize a new cfg_googlesearch configuration object.
The load precedence is:
Defaults
YAML configuration file (if provided)
Programmatic overrides via
...
Usage
cfg_googlesearch$new(config_file = NULL, path = tempdir(), ...)Arguments
config_file(Optional) Path to a YAML configuration file. Supported settings include:
path: Directory where output data are stored.id_col: Column name serving as a unique identifier for each entity (default:"kz_z").query_col: Column name containing the search queries. Must be created viabuildQuery()(default:NULL).print_every_n: Positive integer. Interval for displaying progress messages (default:100).save_every_n: Positive integer. Interval for saving intermediate results (default:500).scrape_attributes: Character vector. Specifies which data to extract from results. One or more of:"title","link","displayLink","snippet"(default:c("link", "displayLink")).verbose: Logical. Should progress updates be printed to the console? (default:TRUE).max_queries: Maximum queries allowed per 24-hour period. If reached, the process will pause until the 24-hour window resets (default:10000).max_query_rate: Numeric. Maximum number of queries allowed per 100 seconds (default:100).file: Filename (relative topath) for saving results. IfNULL, results are not written to disk. Usesdata.table::fwrite()internally.overwrite: Logical. IfTRUE, existing files are overwritten. IfFALSE, existing data are loaded viadata.table::fread()and new results are appended. Ensure column names match when appending (default:FALSE).credentials: A named list containing Google API credentials. Use"key"or"SCRAPING_APIKEY_GOOGLE"for the API Key, and"engine"or"SCRAPING_ENGINE_GOOGLE"for the Search Engine ID. If omitted, environment variables are used. See alsogetGoogleCreds().
pathPath to the directory where project data are stored. Overrides the
pathsetting inconfig_file....Named arguments used to override specific configuration settings. These take precedence over both
config_fileand default values.
Method defaults()
Return the default configuration settings for Google Custom Search.
Returns
A named list containing default values for:
path– directory to store outputid_col– identifier columnquery_col– column containing query stringsprint_every_n– progress message intervalsave_every_n– save intervalscrape_attributes– which CSE fields to keepverbose– print progress messagesmax_queries– maximum queries per 24hmax_query_rate– queries per 100 secondsfile– output file (orNULL)overwrite– overwrite output file or appendcredentials– list withkeyandengine
Examples
# Create with defaults
# in this case, Environment-Variables `SCRAPING_APIKEY_GOOGLE` and `SCRAPING_ENGINE_GOOGLE`
# need to be set beforehand
cfg <- paramsGoogleSearch()
#> ℹ No configuration file provided for 'cfg_googlesearch'. Using default configuration.
# Create with overrides
cfg <- paramsGoogleSearch(
path = getwd(),
credentials = list(
key = "my_google_apikey",
engine = "my-search-engine-id"
),
verbose = FALSE
)
#> Set/Updated environment variables for SCRAPING_APIKEY_GOOGLE and SCRAPING_ENGINE_GOOGLE.
#> ℹ No configuration file provided for 'cfg_googlesearch'. Using default configuration.
# Return the current configuration
cfg$show_config()
#> $path
#> [1] "/tmp/RtmppVykHU/file1b7447f41c24/reference"
#>
#> $id_col
#> [1] "ID"
#>
#> $query_col
#> NULL
#>
#> $print_every_n
#> [1] 100
#>
#> $save_every_n
#> [1] 500
#>
#> $scrape_attributes
#> [1] "link" "displayLink"
#>
#> $verbose
#> [1] FALSE
#>
#> $max_queries
#> [1] 10000
#>
#> $max_query_rate
#> [1] 100
#>
#> $file
#> NULL
#>
#> $overwrite
#> [1] FALSE
#>
#> $credentials
#> $credentials$engine
#> [1] "my-search-engine-id"
#>
#> $credentials$key
#> [1] "my_google_apikey"
#>
#>
# Write current configuration to file
f <- file.path(tempdir(), "config.yaml")
cfg$export(f)
#> ℹ Current configuration for 'cfg_googlesearch' written to '/tmp/RtmppVykHU/config.yaml'.
# Load from exported config-file and override
cfg <- paramsGoogleSearch(config_file = f, verbose = TRUE)
#> ℹ Configuration loaded from '/tmp/RtmppVykHU/config.yaml' for cfg_googlesearch
try(file.remove(f))
#> [1] TRUE
# Return the current configuration
cfg$show_config()
#> $path
#> [1] "/tmp/RtmppVykHU"
#>
#> $id_col
#> [1] "ID"
#>
#> $print_every_n
#> [1] 100
#>
#> $save_every_n
#> [1] 500
#>
#> $scrape_attributes
#> [1] "link" "displayLink"
#>
#> $verbose
#> [1] TRUE
#>
#> $max_queries
#> [1] 10000
#>
#> $max_query_rate
#> [1] 100
#>
#> $overwrite
#> [1] FALSE
#>
#> $credentials
#> $credentials$engine
#> [1] "my-search-engine-id"
#>
#> $credentials$key
#> [1] "my_google_apikey"
#>
#>
# Or a specific setting
cfg$get("max_query_rate")
#> [1] 100
# Update the configuration
cfg$set("max_query_rate", 200)
cfg$get("max_query_rate")
#> [1] 200