## ✔ Key could be verified via a test request
## ℹ The provided key will be available for this R session
## ℹ Add `STATCUBE_KEY_EXT = XXXX` to "~/.Renviron" to set the key
## persistently. Replace `XXXX` with your key
This article explains how and where STATcubeR caches resources from data.statistik.gv.at in the local file system. Understanding this behavior will allow you to enable persistent caches and directly use the cached resources.
Overview
By default, STATcubeR caches all accessed resources from data.statistik.gv.at in the temporary directory of the current R session.
#> [1] "/tmp/RtmpATgdnl/STATcubeR/open_data/"
Let’s examine for example what happens when the data from the structure of earnings survey (SES) is requested.
earnings <- od_table("OGD_veste309_Veste309_1")
First STATcubeR
will grab a json with metadata about
this dataset from https://data.statistik.gv.at/ogd/json?dataset=OGD_veste309_Veste309_1
and check which resources belong to it. For any resource, the attributes
name
and last_modified
are extracted from the
json. They are also included in the od_table
object under
$resources
.
earnings$resources
# A data frame: 7 × 6
name last_modified cached size download parsed
<chr> <dttm> <dttm> <dbl> <dbl> <dbl>
1 meta.json 2022-03-24 11:29:48 2024-11-29 09:55:33 4062 530. NA
2 data.csv 2022-03-24 11:29:48 2024-11-29 09:55:33 4931 167. 0.773
3 HEADER.csv 2022-03-24 11:29:48 2024-11-29 09:55:34 516 168. 0.480
4 C-A11-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34 159 168. 0.448
5 C-STAATS-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34 697 168. 0.466
6 C-VEBDL-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34 518 167. 0.476
7 C-BESCHV-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34 641 167. 0.470
last_modified
tells us when the resource was changed on
the fileserver. If a resource does not exist in the cache or if the last
modified entry in the json is newer than the cached file, it will be
downloaded from the server. Otherwise, the cached version is reused.
Access and Updates
Cached files can be accessed with od_cache_file()
. If
the specified file exists in the cache, a path to the file will be
returned. Otherwise, the file is downloaded to the cache and then the
path is returned. The files use the same naming conventions as the open
data fileserver.
od_cache_file("OGD_veste309_Veste309_1")
#> [1] "/tmp/RtmpATgdnl/STATcubeR/open_data/OGD_veste309_Veste309_1.csv"
od_cache_file("OGD_veste309_Veste309_1", "C-A11-0")
#> [1] "/tmp/RtmpATgdnl/STATcubeR/open_data/OGD_veste309_Veste309_1_C-A11-0.csv"
To read files from the cache as data.frame
s, use
od_resource()
with same parameters as in
od_cache_file()
. This will apply a special parser to the
dataset which drops unneeded columns and normalizes column names.
od_resource("OGD_veste309_Veste309_1", "C-A11-0")
# A data frame: 3 × 7
code label label_de label_en parent de_desc en_desc
* <chr> <chr> <chr> <chr> <fct> <lgl> <lgl>
1 A11-1 NA insgesamt Sum total NA NA NA
2 A11-2 NA männlich Male NA NA NA
3 A11-3 NA weiblich Female NA NA NA
The parser behaves differently for header files, data files and
fields. Json files can be accessed with od_json()
.
#> [1] "Staatsangehörigkeit" "Bundesland"
#> [3] "Beschäftigungsverhältnis"
Clearing and Changing
od_cache_clear(id)
can be used to clear the cache from
all files belonging to the passed dataset id. We saw that
earnings$resources
contains 7 rows, therefore 7 files will
be deleted during cleanup.
od_cache_clear("OGD_veste309_Veste309_1")
#> deleted 7 files from '/tmp/RtmpATgdnl/STATcubeR/open_data/'
If you want to use a persistent directory like
~/.cache/STATcubeR/open_data/
for caching, the directory
can be changed with od_cache_dir(new)
.
od_cache_dir("~/.cache/STATcubeR/open_data/")
The resources field
Let’s go back to the $resources
field of
earnings
.
earnings$resources
# A data frame: 7 × 6
name last_modified cached size download parsed
<chr> <dttm> <dttm> <dbl> <dbl> <dbl>
1 meta.json 2022-03-24 11:29:48 2024-11-29 09:55:33 4062 530. NA
2 data.csv 2022-03-24 11:29:48 2024-11-29 09:55:33 4931 167. 0.773
3 HEADER.csv 2022-03-24 11:29:48 2024-11-29 09:55:34 516 168. 0.480
4 C-A11-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34 159 168. 0.448
5 C-STAATS-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34 697 168. 0.466
6 C-VEBDL-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34 518 167. 0.476
7 C-BESCHV-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34 641 167. 0.470
We already looked at name
and
last_modified
. The remaining columns can
be interpreted as follows
-
cached
tells us the last time the cache file for the resource was modified. -
size
is the file size in bytes -
download
contains the amount of milliseconds used to retrieve the resource when it was last updated. -
parsed
reports the amount of milliseconds it tookod_resource()
to convert the file contents into adata.frame()
format. For the json file, the parsing time is always reported asNA
.
What’s in the cache?
od_cache_summary()
will give an overview about all files
that are available in the cache directory. The returned table contains
one row for every dataset.
- The column
updated
contains the last modified date for the datasets json file. -
json
,data
andheader
give the file sizes in bytes for the corresponding files. -
fields
is the total size of all fields andn_fields
is the number of classification files available.
We can get a clear picture on how much disk space is used for each dataset.
#> NULL
Note that od_cache_summary()
only gathers information
from the local file system based on filenames, file.mtime()
and file.size()
.
Download history
To get a history of all files that have been downloaded from the
server, use od_downloads()
. For each file, a timestamp for
the download is recorded as well as the download time in
milliseconds.
# A data frame: 7 × 3
time file downloaded
<dttm> <chr> <dbl>
1 2024-11-29 09:55:33 OGD_veste309_Veste309_1.json 530.
2 2024-11-29 09:55:33 OGD_veste309_Veste309_1.csv 167.
3 2024-11-29 09:55:34 OGD_veste309_Veste309_1_HEADER.csv 168.
4 2024-11-29 09:55:34 OGD_veste309_Veste309_1_C-A11-0.csv 168.
5 2024-11-29 09:55:34 OGD_veste309_Veste309_1_C-STAATS-0.csv 168.
6 2024-11-29 09:55:34 OGD_veste309_Veste309_1_C-VEBDL-0.csv 167.
7 2024-11-29 09:55:34 OGD_veste309_Veste309_1_C-BESCHV-0.csv 167.