File Management • STATcubeR

This article explains how and where STATcubeR caches resources from data.statistik.gv.at in the local file system. Understanding this behavior will allow you to enable persistent caches and directly use the cached resources.

Overview

By default, STATcubeR caches all accessed resources from data.statistik.gv.at in the temporary directory of the current R session.

od_cache_dir()

#> [1] "/tmp/RtmpjlfDXE/STATcubeR/open_data/"

Let’s examine for example what happens when the data from the structure of earnings suvey (SES) is requested.

earnings <- od_table("OGD_veste309_Veste309_1")

First STATcubeR will grab a json with metadata about this dataset from https://data.statistik.gv.at/ogd/json?dataset=OGD_veste309_Veste309_1 and check which resources belong to it. For any resource, the attributes name and last_modified are extracted from the json. They are also included in the od_table object under $resources.

earnings$resources

# A data frame: 7 × 6
  name           last_modified       cached               size download parsed
  <chr>          <dttm>              <dttm>              <dbl>    <dbl>  <dbl>
1 meta.json      2022-03-24 11:29:48 2022-12-20 11:35:48  4062    80.8  NA    
2 data.csv       2022-03-24 11:29:48 2022-12-20 11:35:48  4931     8.17  1.61 
3 HEADER.csv     2022-03-24 11:29:48 2022-12-20 11:35:48   516     6.44 20.9  
4 C-A11-0.csv    2022-03-24 11:29:48 2022-12-20 11:35:48   159     8.16  0.544
5 C-STAATS-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48   697     4.79  0.541
6 C-VEBDL-0.csv  2022-03-24 11:29:48 2022-12-20 11:35:48   518     8.02  0.536
7 C-BESCHV-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48   641     5.05  0.520

last_modified tells us when the resource was changed on the fileserver. If a resource does not exist in the cache or if the last modified entry in the json is newer than the cached file, it will be dowloaded from the server. Otherwise, the cached version is reused.

Access and Updates

Cached files can be acessed with od_cache_file(). If the specified file exists in the cache, a path to the file will be returned. Otherwise, the file is downloaded to the cache and then the path is returned. The files use the same naming conventions as the open data fileserver.

od_cache_file("OGD_veste309_Veste309_1")

#> [1] "/tmp/RtmpjlfDXE/STATcubeR/open_data/OGD_veste309_Veste309_1.csv"

od_cache_file("OGD_veste309_Veste309_1", "C-A11-0")

#> [1] "/tmp/RtmpjlfDXE/STATcubeR/open_data/OGD_veste309_Veste309_1_C-A11-0.csv"

To read files from the cache as data.frames, use od_resource() with same parameters as in od_cache_file(). This will apply a special parser to the dataset which drops unneeded columns and normalizes column names.

od_resource("OGD_veste309_Veste309_1", "C-A11-0")

# A data frame: 3 × 7
  code  label label_de  label_en  parent de_desc en_desc
* <chr> <chr> <chr>     <chr>     <fct>  <lgl>   <lgl>  
1 A11-1 NA    insgesamt Sum total NA     NA      NA     
2 A11-2 NA    männlich  Male      NA     NA      NA     
3 A11-3 NA    weiblich  Female    NA     NA      NA

The parser behaves differently for header files, data files and fields. Json files can be acessed with od_json().

json <- od_json("OGD_veste309_Veste309_1")
unlist(json$tags)

#> [1] "Staatsangehörigkeit"      "Bundesland"              
#> [3] "Beschäftigungsverhältnis"

Clearing and Changing

od_cache_clear(id) can be used to clear the cache from all files belonging to the passed dataset id. We saw that earnings$resources contains 7 rows, therefore 7 files will be deleted during cleanup.

od_cache_clear("OGD_veste309_Veste309_1")

#> deleted 7 files from '/tmp/RtmpjlfDXE/STATcubeR/open_data/'

If you want to use a persistent directory like ~/.cache/STATcubeR/open_data/ for caching, the directory can be changed with od_cache_dir(new).

od_cache_dir("~/.cache/STATcubeR/open_data/")

The resources field

Let’s go back to the $resources field of earnings.

earnings$resources

# A data frame: 7 × 6
  name           last_modified       cached               size download parsed
  <chr>          <dttm>              <dttm>              <dbl>    <dbl>  <dbl>
1 meta.json      2022-03-24 11:29:48 2022-12-20 11:35:48  4062    80.8  NA    
2 data.csv       2022-03-24 11:29:48 2022-12-20 11:35:48  4931     8.17  1.61 
3 HEADER.csv     2022-03-24 11:29:48 2022-12-20 11:35:48   516     6.44 20.9  
4 C-A11-0.csv    2022-03-24 11:29:48 2022-12-20 11:35:48   159     8.16  0.544
5 C-STAATS-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48   697     4.79  0.541
6 C-VEBDL-0.csv  2022-03-24 11:29:48 2022-12-20 11:35:48   518     8.02  0.536
7 C-BESCHV-0.csv 2022-03-24 11:29:48 2022-12-20 11:35:48   641     5.05  0.520

We already looked at name and last_modified. The remaining columns can be interpreted as follows

cached tells us the last time the cache file for the resource was modified.
size is the file size in bytes
download contains the amount of milliseconds used to retrieve the resource when it was last updated.
parsed reports the amount of milliseconds it took od_resource() to convert the file contents into a data.frame() format. For the json file, the parsing time is always reported as NA.

What’s in the cache?

od_cache_summary() will give an overview about all files that are available in the cache directory. The returned table contains one row for every dataset.

The column updated contains the last modified date for the datasets json file.
json, data and header give the file sizes in bytes for the corresponding files.
fields is the total size of all fields and n_fields is the number of classification files available.

We can get a clear picture on how much disk space is used for each dataset.

od_cache_summary()

# A data frame: 315 × 7
   id                             updated              json   data header fields
   <chr>                          <dttm>              <dbl>  <dbl>  <dbl>  <dbl>
 1 OGD__steuer_est_ab_2008_altge… 2022-09-26 12:08:54  9540 1.67e6  10936   1698
 2 OGD__steuer_est_ab_2008_bl_ei… 2022-09-26 12:08:54  9622 3.48e6  10949   2194
 3 OGD__steuer_est_ab_2008_blges… 2022-09-26 12:08:54  9586 1.69e6  10952   1585
 4 OGD__steuer_kst_KST_1          2022-09-26 12:08:54  5009 2.06e5   1396   1823
 5 OGD__steuer_kst_KST_2          2022-09-26 12:08:54  5017 3.26e5   1419   2961
 6 OGD__steuer_lst_ab_2008_2_LST… 2022-09-26 12:08:54  7037 4.47e5   5307   2725
 7 OGD__steuer_lst_ab_2008_4_LST… 2022-09-26 12:08:54  6943 6.19e2   5305   1476
 8 OGD__steuer_lst_ab_2015_3_LSt… 2022-09-26 12:08:54  6528 8.37e4   4259   1695
 9 OGD__steuer_lst_ab_2015_6_LSt… 2022-09-26 12:08:54  6208 3.05e5   4297   2653
10 OGD__steuer_lst_ab_2017_3_LSt… 2022-09-26 12:08:54  6513 4.20e4   4259   1676
# … with 305 more rows, and 1 more variable: n_fields <int>

Note that od_cache_summary() only gathers information from the local file system based on filenames, file.mtime() and file.size().

Download history

To get a history of all files that have been downloaded from the server, use od_downloads(). For each file, a timestamp for the download is recorded as well as the download time in milliseconds.

od_downloads()

# A data frame: 7 × 3
  time                file                                   downloaded
* <dttm>              <chr>                                       <dbl>
1 2022-12-20 11:35:48 OGD_veste309_Veste309_1_C-BESCHV-0.csv       5.05
2 2022-12-20 11:35:48 OGD_veste309_Veste309_1_C-VEBDL-0.csv        8.02
3 2022-12-20 11:35:48 OGD_veste309_Veste309_1_C-STAATS-0.csv       4.79
4 2022-12-20 11:35:48 OGD_veste309_Veste309_1_C-A11-0.csv          8.16
5 2022-12-20 11:35:48 OGD_veste309_Veste309_1_HEADER.csv           6.44
6 2022-12-20 11:35:48 OGD_veste309_Veste309_1.csv                  8.17
7 2022-12-20 11:35:48 OGD_veste309_Veste309_1.json                80.8