This article explains how and where STATcubeR caches resources from data.statistik.gv.at in the local file system. Understanding this behavior will allow you to enable persistent caches and directly use the cached resources.
Overview
By default, STATcubeR caches all accessed resources from data.statistik.gv.at in the temporary directory of the current R session.
#> [1] "/tmp/RtmpjlfDXE/STATcubeR/open_data/"
Let’s examine for example what happens when the data from the structure of earnings suvey (SES) is requested.
earnings <- od_table("OGD_veste309_Veste309_1")
First STATcubeR
will grab a json with metadata about
this dataset from https://data.statistik.gv.at/ogd/json?dataset=OGD_veste309_Veste309_1
and check which resources belong to it. For any resource, the attributes
name
and last_modified
are extracted from the
json. They are also included in the od_table
object under
$resources
.
earnings$resources
# A data frame: 7 × 6
name last_modified cached size download parsed
<chr> <dttm> <dttm> <dbl> <dbl> <dbl>
1 meta.json 2022-03-24 11:29:48 2022-12-20 11:35:48 4062 80.8 NA
2 data.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 4931 8.17 1.61
3 HEADER.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 516 6.44 20.9
4 C-A11-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 159 8.16 0.544
5 C-STAATS-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 697 4.79 0.541
6 C-VEBDL-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 518 8.02 0.536
7 C-BESCHV-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 641 5.05 0.520
last_modified
tells us when the resource was changed on
the fileserver. If a resource does not exist in the cache or if the last
modified entry in the json is newer than the cached file, it will be
dowloaded from the server. Otherwise, the cached version is reused.
Access and Updates
Cached files can be acessed with od_cache_file()
. If the
specified file exists in the cache, a path to the file will be returned.
Otherwise, the file is downloaded to the cache and then the path is
returned. The files use the same naming conventions as the open data
fileserver.
od_cache_file("OGD_veste309_Veste309_1")
#> [1] "/tmp/RtmpjlfDXE/STATcubeR/open_data/OGD_veste309_Veste309_1.csv"
od_cache_file("OGD_veste309_Veste309_1", "C-A11-0")
#> [1] "/tmp/RtmpjlfDXE/STATcubeR/open_data/OGD_veste309_Veste309_1_C-A11-0.csv"
To read files from the cache as data.frame
s, use
od_resource()
with same parameters as in
od_cache_file()
. This will apply a special parser to the
dataset which drops unneeded columns and normalizes column names.
od_resource("OGD_veste309_Veste309_1", "C-A11-0")
# A data frame: 3 × 7
code label label_de label_en parent de_desc en_desc
* <chr> <chr> <chr> <chr> <fct> <lgl> <lgl>
1 A11-1 NA insgesamt Sum total NA NA NA
2 A11-2 NA männlich Male NA NA NA
3 A11-3 NA weiblich Female NA NA NA
The parser behaves differently for header files, data files and
fields. Json files can be acessed with od_json()
.
#> [1] "Staatsangehörigkeit" "Bundesland"
#> [3] "Beschäftigungsverhältnis"
Clearing and Changing
od_cache_clear(id)
can be used to clear the cache from
all files belonging to the passed dataset id. We saw that
earnings$resources
contains 7 rows, therefore 7 files will
be deleted during cleanup.
od_cache_clear("OGD_veste309_Veste309_1")
#> deleted 7 files from '/tmp/RtmpjlfDXE/STATcubeR/open_data/'
If you want to use a persistent directory like
~/.cache/STATcubeR/open_data/
for caching, the directory
can be changed with od_cache_dir(new)
.
od_cache_dir("~/.cache/STATcubeR/open_data/")
The resources field
Let’s go back to the $resources
field of
earnings
.
earnings$resources
# A data frame: 7 × 6
name last_modified cached size download parsed
<chr> <dttm> <dttm> <dbl> <dbl> <dbl>
1 meta.json 2022-03-24 11:29:48 2022-12-20 11:35:48 4062 80.8 NA
2 data.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 4931 8.17 1.61
3 HEADER.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 516 6.44 20.9
4 C-A11-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 159 8.16 0.544
5 C-STAATS-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 697 4.79 0.541
6 C-VEBDL-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 518 8.02 0.536
7 C-BESCHV-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48 641 5.05 0.520
We already looked at name
and
last_modified
. The remaining columns can
be interpreted as follows
-
cached
tells us the last time the cache file for the resource was modified. -
size
is the file size in bytes -
download
contains the amount of milliseconds used to retrieve the resource when it was last updated. -
parsed
reports the amount of milliseconds it tookod_resource()
to convert the file contents into adata.frame()
format. For the json file, the parsing time is always reported asNA
.
What’s in the cache?
od_cache_summary()
will give an overview about all files
that are available in the cache directory. The returned table contains
one row for every dataset.
- The column
updated
contains the last modified date for the datasets json file. -
json
,data
andheader
give the file sizes in bytes for the corresponding files. -
fields
is the total size of all fields andn_fields
is the number of classification files available.
We can get a clear picture on how much disk space is used for each dataset.
# A data frame: 315 × 7
id updated json data header fields
<chr> <dttm> <dbl> <dbl> <dbl> <dbl>
1 OGD__steuer_est_ab_2008_altge… 2022-09-26 12:08:54 9540 1.67e6 10936 1698
2 OGD__steuer_est_ab_2008_bl_ei… 2022-09-26 12:08:54 9622 3.48e6 10949 2194
3 OGD__steuer_est_ab_2008_blges… 2022-09-26 12:08:54 9586 1.69e6 10952 1585
4 OGD__steuer_kst_KST_1 2022-09-26 12:08:54 5009 2.06e5 1396 1823
5 OGD__steuer_kst_KST_2 2022-09-26 12:08:54 5017 3.26e5 1419 2961
6 OGD__steuer_lst_ab_2008_2_LST… 2022-09-26 12:08:54 7037 4.47e5 5307 2725
7 OGD__steuer_lst_ab_2008_4_LST… 2022-09-26 12:08:54 6943 6.19e2 5305 1476
8 OGD__steuer_lst_ab_2015_3_LSt… 2022-09-26 12:08:54 6528 8.37e4 4259 1695
9 OGD__steuer_lst_ab_2015_6_LSt… 2022-09-26 12:08:54 6208 3.05e5 4297 2653
10 OGD__steuer_lst_ab_2017_3_LSt… 2022-09-26 12:08:54 6513 4.20e4 4259 1676
# … with 305 more rows, and 1 more variable: n_fields <int>
Note that od_cache_summary()
only gathers information
from the local file system based on filenames, file.mtime()
and file.size()
.
Download history
To get a history of all files that have been downloaded from the
server, use od_downloads()
. For each file, a timestamp for
the download is recorded as well as the download time in
milliseconds.
# A data frame: 7 × 3
time file downloaded
* <dttm> <chr> <dbl>
1 2022-12-20 11:35:48 OGD_veste309_Veste309_1_C-BESCHV-0.csv 5.05
2 2022-12-20 11:35:48 OGD_veste309_Veste309_1_C-VEBDL-0.csv 8.02
3 2022-12-20 11:35:48 OGD_veste309_Veste309_1_C-STAATS-0.csv 4.79
4 2022-12-20 11:35:48 OGD_veste309_Veste309_1_C-A11-0.csv 8.16
5 2022-12-20 11:35:48 OGD_veste309_Veste309_1_HEADER.csv 6.44
6 2022-12-20 11:35:48 OGD_veste309_Veste309_1.csv 8.17
7 2022-12-20 11:35:48 OGD_veste309_Veste309_1.json 80.8