File Management • STATcubeR

## ✔ Key could be verified via a test request

## ℹ The provided key will be available for this R session

## ℹ Add `STATCUBE_KEY_EXT = XXXX` to "~/.Renviron" to set the key
##   persistently. Replace `XXXX` with your key

This article explains how and where STATcubeR caches resources from data.statistik.gv.at in the local file system. Understanding this behavior will allow you to enable persistent caches and directly use the cached resources.

Overview

By default, STATcubeR caches all accessed resources from data.statistik.gv.at in the temporary directory of the current R session.

od_cache_dir()

#> [1] "/tmp/RtmpATgdnl/STATcubeR/open_data/"

Let’s examine for example what happens when the data from the structure of earnings survey (SES) is requested.

earnings <- od_table("OGD_veste309_Veste309_1")

First STATcubeR will grab a json with metadata about this dataset from https://data.statistik.gv.at/ogd/json?dataset=OGD_veste309_Veste309_1 and check which resources belong to it. For any resource, the attributes name and last_modified are extracted from the json. They are also included in the od_table object under $resources.

earnings$resources

# A data frame: 7 × 6
  name           last_modified       cached               size download parsed
  <chr>          <dttm>              <dttm>              <dbl>    <dbl>  <dbl>
1 meta.json      2022-03-24 11:29:48 2024-11-29 09:55:33  4062     530. NA    
2 data.csv       2022-03-24 11:29:48 2024-11-29 09:55:33  4931     167.  0.773
3 HEADER.csv     2022-03-24 11:29:48 2024-11-29 09:55:34   516     168.  0.480
4 C-A11-0.csv    2022-03-24 11:29:48 2024-11-29 09:55:34   159     168.  0.448
5 C-STAATS-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34   697     168.  0.466
6 C-VEBDL-0.csv  2022-03-24 11:29:48 2024-11-29 09:55:34   518     167.  0.476
7 C-BESCHV-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34   641     167.  0.470

last_modified tells us when the resource was changed on the fileserver. If a resource does not exist in the cache or if the last modified entry in the json is newer than the cached file, it will be downloaded from the server. Otherwise, the cached version is reused.

Access and Updates

Cached files can be accessed with od_cache_file(). If the specified file exists in the cache, a path to the file will be returned. Otherwise, the file is downloaded to the cache and then the path is returned. The files use the same naming conventions as the open data fileserver.

od_cache_file("OGD_veste309_Veste309_1")

#> [1] "/tmp/RtmpATgdnl/STATcubeR/open_data/OGD_veste309_Veste309_1.csv"

od_cache_file("OGD_veste309_Veste309_1", "C-A11-0")

#> [1] "/tmp/RtmpATgdnl/STATcubeR/open_data/OGD_veste309_Veste309_1_C-A11-0.csv"

To read files from the cache as data.frames, use od_resource() with same parameters as in od_cache_file(). This will apply a special parser to the dataset which drops unneeded columns and normalizes column names.

od_resource("OGD_veste309_Veste309_1", "C-A11-0")

# A data frame: 3 × 7
  code  label label_de  label_en  parent de_desc en_desc
* <chr> <chr> <chr>     <chr>     <fct>  <lgl>   <lgl>  
1 A11-1 NA    insgesamt Sum total NA     NA      NA     
2 A11-2 NA    männlich  Male      NA     NA      NA     
3 A11-3 NA    weiblich  Female    NA     NA      NA

The parser behaves differently for header files, data files and fields. Json files can be accessed with od_json().

json <- od_json("OGD_veste309_Veste309_1")
unlist(json$tags)

#> [1] "Staatsangehörigkeit"      "Bundesland"              
#> [3] "Beschäftigungsverhältnis"

Clearing and Changing

od_cache_clear(id) can be used to clear the cache from all files belonging to the passed dataset id. We saw that earnings$resources contains 7 rows, therefore 7 files will be deleted during cleanup.

od_cache_clear("OGD_veste309_Veste309_1")

#> deleted 7 files from '/tmp/RtmpATgdnl/STATcubeR/open_data/'

If you want to use a persistent directory like ~/.cache/STATcubeR/open_data/ for caching, the directory can be changed with od_cache_dir(new).

od_cache_dir("~/.cache/STATcubeR/open_data/")

The resources field

Let’s go back to the $resources field of earnings.

earnings$resources

# A data frame: 7 × 6
  name           last_modified       cached               size download parsed
  <chr>          <dttm>              <dttm>              <dbl>    <dbl>  <dbl>
1 meta.json      2022-03-24 11:29:48 2024-11-29 09:55:33  4062     530. NA    
2 data.csv       2022-03-24 11:29:48 2024-11-29 09:55:33  4931     167.  0.773
3 HEADER.csv     2022-03-24 11:29:48 2024-11-29 09:55:34   516     168.  0.480
4 C-A11-0.csv    2022-03-24 11:29:48 2024-11-29 09:55:34   159     168.  0.448
5 C-STAATS-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34   697     168.  0.466
6 C-VEBDL-0.csv  2022-03-24 11:29:48 2024-11-29 09:55:34   518     167.  0.476
7 C-BESCHV-0.csv 2022-03-24 11:29:48 2024-11-29 09:55:34   641     167.  0.470

We already looked at name and last_modified. The remaining columns can be interpreted as follows

cached tells us the last time the cache file for the resource was modified.
size is the file size in bytes
download contains the amount of milliseconds used to retrieve the resource when it was last updated.
parsed reports the amount of milliseconds it took od_resource() to convert the file contents into a data.frame() format. For the json file, the parsing time is always reported as NA.

What’s in the cache?

od_cache_summary() will give an overview about all files that are available in the cache directory. The returned table contains one row for every dataset.

The column updated contains the last modified date for the datasets json file.
json, data and header give the file sizes in bytes for the corresponding files.
fields is the total size of all fields and n_fields is the number of classification files available.

We can get a clear picture on how much disk space is used for each dataset.

od_cache_summary()

#> NULL

Note that od_cache_summary() only gathers information from the local file system based on filenames, file.mtime() and file.size().

Download history

To get a history of all files that have been downloaded from the server, use od_downloads(). For each file, a timestamp for the download is recorded as well as the download time in milliseconds.

od_downloads()

# A data frame: 7 × 3
  time                file                                   downloaded
  <dttm>              <chr>                                       <dbl>
1 2024-11-29 09:55:33 OGD_veste309_Veste309_1.json                 530.
2 2024-11-29 09:55:33 OGD_veste309_Veste309_1.csv                  167.
3 2024-11-29 09:55:34 OGD_veste309_Veste309_1_HEADER.csv           168.
4 2024-11-29 09:55:34 OGD_veste309_Veste309_1_C-A11-0.csv          168.
5 2024-11-29 09:55:34 OGD_veste309_Veste309_1_C-STAATS-0.csv       168.
6 2024-11-29 09:55:34 OGD_veste309_Veste309_1_C-VEBDL-0.csv        167.
7 2024-11-29 09:55:34 OGD_veste309_Veste309_1_C-BESCHV-0.csv       167.