## ✔ Key could be verified via a test request
## ℹ The provided key will be available for this R session
## ℹ Add `STATCUBE_KEY_EXT = XXXX` to "~/.Renviron" to set the key
## persistently. Replace `XXXX` with your key
This article explains how and where STATcubeR caches resources from data.statistik.gv.at in the local file system. Understanding this behavior will allow you to enable persistent caches and directly use the cached resources.
Overview
By default, STATcubeR caches all accessed resources from data.statistik.gv.at in the temporary directory of the current R session.
#> [1] "/tmp/RtmpHB89Xm/STATcubeR/open_data/"
Let’s examine for example what happens when the data from the structure of earnings suvey (SES) is requested.
earnings <- od_table("OGD_veste309_Veste309_1")
First STATcubeR
will grab a json with metadata about
this dataset from https://data.statistik.gv.at/ogd/json?dataset=OGD_veste309_Veste309_1
and check which resources belong to it. For any resource, the attributes
name
and last_modified
are extracted from the
json. They are also included in the od_table
object under
$resources
.
earnings$resources
# A data frame: 7 × 6
name last_modified cached size
<chr> <dttm> <dttm> <dbl>
1 OGD_veste309_Veste309_1.json 2022-03-24 11:29:48 2024-04-18 10:04:54 4062
2 OGD_veste309_Veste309_1.csv 2022-03-24 11:29:48 2024-04-18 10:04:54 4931
3 OGD_veste309_Veste309_1_HEADER.… 2022-03-24 11:29:48 2024-04-18 10:04:54 516
4 OGD_veste309_Veste309_1_C-A11-0… 2022-03-24 11:29:48 2024-04-18 10:04:54 159
5 OGD_veste309_Veste309_1_C-STAAT… 2022-03-24 11:29:48 2024-04-18 10:04:54 697
6 OGD_veste309_Veste309_1_C-VEBDL… 2022-03-24 11:29:48 2024-04-18 10:04:54 518
7 OGD_veste309_Veste309_1_C-BESCH… 2022-03-24 11:29:48 2024-04-18 10:04:54 641
# ℹ 2 more variables: download <dbl>, parsed <dbl>
last_modified
tells us when the resource was changed on
the fileserver. If a resource does not exist in the cache or if the last
modified entry in the json is newer than the cached file, it will be
dowloaded from the server. Otherwise, the cached version is reused.
Access and Updates
Cached files can be acessed with od_cache_file()
. If the
specified file exists in the cache, a path to the file will be returned.
Otherwise, the file is downloaded to the cache and then the path is
returned. The files use the same naming conventions as the open data
fileserver.
od_cache_file("OGD_veste309_Veste309_1")
#> [1] "/tmp/RtmpHB89Xm/STATcubeR/open_data/OGD_veste309_Veste309_1.csv"
od_cache_file("OGD_veste309_Veste309_1", "C-A11-0")
#> [1] "/tmp/RtmpHB89Xm/STATcubeR/open_data/OGD_veste309_Veste309_1_C-A11-0.csv"
To read files from the cache as data.frame
s, use
od_resource()
with same parameters as in
od_cache_file()
. This will apply a special parser to the
dataset which drops unneeded columns and normalizes column names.
od_resource("OGD_veste309_Veste309_1", "C-A11-0")
# A data frame: 3 × 5
code label label_de label_en parent
* <chr> <chr> <chr> <chr> <fct>
1 A11-1 NA insgesamt Sum total NA
2 A11-2 NA männlich Male NA
3 A11-3 NA weiblich Female NA
The parser behaves differently for header files, data files and
fields. Json files can be acessed with od_json()
.
#> [1] "Staatsangehörigkeit" "Bundesland"
#> [3] "Beschäftigungsverhältnis"
Clearing and Changing
od_cache_clear(id)
can be used to clear the cache from
all files belonging to the passed dataset id. We saw that
earnings$resources
contains 7 rows, therefore 7 files will
be deleted during cleanup.
od_cache_clear("OGD_veste309_Veste309_1")
#> deleted 7 files from '/tmp/RtmpHB89Xm/STATcubeR/open_data/'
If you want to use a persistent directory like
~/.cache/STATcubeR/open_data/
for caching, the directory
can be changed with od_cache_dir(new)
.
od_cache_dir("~/.cache/STATcubeR/open_data/")
The resources field
Let’s go back to the $resources
field of
earnings
.
earnings$resources
# A data frame: 7 × 6
name last_modified cached size
<chr> <dttm> <dttm> <dbl>
1 OGD_veste309_Veste309_1.json 2022-03-24 11:29:48 2024-04-18 10:04:54 4062
2 OGD_veste309_Veste309_1.csv 2022-03-24 11:29:48 2024-04-18 10:04:54 4931
3 OGD_veste309_Veste309_1_HEADER.… 2022-03-24 11:29:48 2024-04-18 10:04:54 516
4 OGD_veste309_Veste309_1_C-A11-0… 2022-03-24 11:29:48 2024-04-18 10:04:54 159
5 OGD_veste309_Veste309_1_C-STAAT… 2022-03-24 11:29:48 2024-04-18 10:04:54 697
6 OGD_veste309_Veste309_1_C-VEBDL… 2022-03-24 11:29:48 2024-04-18 10:04:54 518
7 OGD_veste309_Veste309_1_C-BESCH… 2022-03-24 11:29:48 2024-04-18 10:04:54 641
# ℹ 2 more variables: download <dbl>, parsed <dbl>
We already looked at name
and
last_modified
. The remaining columns can
be interpreted as follows
-
cached
tells us the last time the cache file for the resource was modified. -
size
is the file size in bytes -
download
contains the amount of milliseconds used to retrieve the resource when it was last updated. -
parsed
reports the amount of milliseconds it tookod_resource()
to convert the file contents into adata.frame()
format. For the json file, the parsing time is always reported asNA
.
What’s in the cache?
od_cache_summary()
will give an overview about all files
that are available in the cache directory. The returned table contains
one row for every dataset.
- The column
updated
contains the last modified date for the datasets json file. -
json
,data
andheader
give the file sizes in bytes for the corresponding files. -
fields
is the total size of all fields andn_fields
is the number of classification files available.
We can get a clear picture on how much disk space is used for each dataset.
#> NULL
Note that od_cache_summary()
only gathers information
from the local file system based on filenames, file.mtime()
and file.size()
.
Download history
To get a history of all files that have been downloaded from the
server, use od_downloads()
. For each file, a timestamp for
the download is recorded as well as the download time in
milliseconds.
# A data frame: 7 × 3
time file downloaded
* <dttm> <chr> <dbl>
1 2024-04-18 10:04:54 OGD_veste309_Veste309_1_C-BESCHV-0.csv 104.
2 2024-04-18 10:04:54 OGD_veste309_Veste309_1_C-VEBDL-0.csv 103.
3 2024-04-18 10:04:54 OGD_veste309_Veste309_1_C-STAATS-0.csv 103.
4 2024-04-18 10:04:54 OGD_veste309_Veste309_1_C-A11-0.csv 103.
5 2024-04-18 10:04:54 OGD_veste309_Veste309_1_HEADER.csv 102.
6 2024-04-18 10:04:54 OGD_veste309_Veste309_1.csv 103.
7 2024-04-18 10:04:54 OGD_veste309_Veste309_1.json 335.