od_table()
makes it easy to import datasets from
data.statistik.gv.at
into your R sessions. This function downloads csv sources from the
fileserver. This means that no API key is required to use STATcubeR with
datasets from the OGD portal.
In this example, we will use a
data set about cancer statistics.
The dataset id "OGD_krebs_ext_KREBS_1"
can be extracted
from the url and will be used in the data import.
Import and overview
To import a dataset, provide the dataset id as an argument.
table <- od_table("OGD_krebs_ext_KREBS_1")
This returns an object of class od_table
, which
bundles all the data from the OGD portal that correspons to this
dataset. Printing the object will show a summary of the contents.
table
#> Cancer statistics by reporting year, province of residence and
#> localisation of cancer
#>
#> Dataset: OGD_krebs_ext_KREBS_1 (data.statistik.gv.at)
#> Measures: Number of records F-KRE
#> Fields: Tumore ICD/10 3-Steller <95>, Reporting year <37>, Province of
#> residence <9>, Sex <2>
#>
#> Request: [2022-12-20 11:36:01]
#> STATcubeR: 0.5.0.1
The dataset contains the number of cancer patients by several classification fields
- tumor type differentiates <95> types of cancers
- The reporting period spans <37> years (1983 to 2019).
- The regional variable contains the <9> NUTS-2 regions of austria.
- The demographic variable “Sex” is reported with <2> levels
Convert to a data frame
The method $tabulate()
can be used to turn the object
into a data.frame
in long format, which contains labled
data.
table$tabulate()
# A STATcubeR tibble: 46,479 x 5
`Tumore ICD/10 3-Steller` Reportin…¹ Province of r…² Sex Numbe…³
* <fct> <date> <fct> <fct> <int>
1 <C00> Bösartige Neubildung der Lip… 1983-01-01 "Burgenland " male 2
2 <C00> Bösartige Neubildung der Lip… 1983-01-01 "Carinthia" male 8
3 <C00> Bösartige Neubildung der Lip… 1983-01-01 "Carinthia" female 2
4 <C00> Bösartige Neubildung der Lip… 1983-01-01 "Lower Austria" male 6
5 <C00> Bösartige Neubildung der Lip… 1983-01-01 "Lower Austria" female 2
6 <C00> Bösartige Neubildung der Lip… 1983-01-01 "Upper Austria" male 12
7 <C00> Bösartige Neubildung der Lip… 1983-01-01 "Upper Austria" female 2
8 <C00> Bösartige Neubildung der Lip… 1983-01-01 "Salzburg" male 4
9 <C00> Bösartige Neubildung der Lip… 1983-01-01 "Styria" male 5
10 <C00> Bösartige Neubildung der Lip… 1983-01-01 "Styria" female 2
# … with 46,469 more rows, and abbreviated variable names ¹`Reporting year`,
# ²`Province of residence`, ³`Number of records F-KRE`
The dataset contains 46479 rows. If every combination of tumor type, year, region and sex would contain a seperate row the number of rows would be the following.
\[ 95\times37\times9\times2 = 63270 \]
This means that the table is fairly dense. But this might not be the case for other OGD datasets.
Metadata
This section will show the different metadata components contained in the table object and how they relate to the resources on the OGD server.
table$resources$name
#> [1] "OGD_krebs_ext_KREBS_1.json"
#> [2] "OGD_krebs_ext_KREBS_1.csv"
#> [3] "OGD_krebs_ext_KREBS_1_HEADER.csv"
#> [4] "OGD_krebs_ext_KREBS_1_C-TUM_ICD10_3ST-0.csv"
#> [5] "OGD_krebs_ext_KREBS_1_C-BERJ-0.csv"
#> [6] "OGD_krebs_ext_KREBS_1_C-BUNDESLAND-0.csv"
#> [7] "OGD_krebs_ext_KREBS_1_C-KRE_GESCHLECHT-0.csv"
#> attr(,"class")
#> [1] "ogd_file" "character"
Header
The labels for the columns of the data.frame representation are
generated from
OGD_krebs_ext_KREBS_1_HEADER.csv
and can be extracted from the table object via $header
.
table$header
# STATcubeR metadata: 5 x 6
code label label_de label_en
<chr> <chr> <chr> <chr>
1 F-KRE NA Anzahl der Datensätze F-KRE Number of records F-KRE
2 C-TUM_ICD10_3ST-0 NA Tumore ICD/10 3-Steller NA
3 C-BERJ-0 NA Berichtsjahr Reporting year
4 C-BUNDESLAND-0 NA Bundesland Province of residence
5 C-KRE_GESCHLECHT-0 NA Geschlecht Sex
# … with 2 more columns: 'de_desc', 'en_desc'
Additional metadata for the columns can be obtained via
$meta
. See the
STATcubeR data article
for more details.
Field infos
The method table$field()
can be used to get information
about specific classification fields. Thise contain data from
{dataset_id}_{field_code}.csv
. Unlike the metadata in
sc_table
, the od_table
class always contains
german and english labels. Both can be used to label the dataset.
The following call gives access to the german and english labels for
the 95 differen tumor types in the "cancer type"
classification. Click "Year"
above to see information about
the years.
table$field("C-TUM_ICD10_3ST-0")
# STATcubeR metadata: 95 x 10
code label
<chr> <chr>
1 TUM_ICD10_3ST-C00 <C00> Bösartige Neubildung der Lippe
2 TUM_ICD10_3ST-C01 <C01> Bösartige Neubildung des Zungengrundes
3 TUM_ICD10_3ST-C02 <C02> Bösartige Neubildung sonstiger und nicht näher bezeic…
4 TUM_ICD10_3ST-C03 <C03> Bösartige Neubildung des Zahnfleisches
5 TUM_ICD10_3ST-C04 <C04> Bösartige Neubildung des Mundbodens
# … with 90 more rows, and 1 more variable: parsed <chr>
# … with 7 more columns: 'label_de', 'label_en', 'parent', 'de_desc', 'en_desc', 'visible', 'order'
The reporting period spans 37 years (1983 to 2019). The
classification elements are parsed into a <date>
format for the <data.frame>
representation.
table$field("C-BERJ-0")
# STATcubeR metadata: 37 x 10
code label parsed
<chr> <chr> <date>
1 BERJ-1983 1983 1983-01-01
2 BERJ-1984 1984 1984-01-01
3 BERJ-1985 1985 1985-01-01
4 BERJ-1986 1986 1986-01-01
5 BERJ-1987 1987 1987-01-01
# … with 32 more rows
# … with 7 more columns: 'label_de', 'label_en', 'parent', 'de_desc', 'en_desc', 'visible', 'order'
The regional classification contains 9 elements which correspond to the NUTS2 regions (“Bundesländer”) of austria.
table$field("C-BUNDESLAND-0")
# STATcubeR metadata: 9 x 10
code label parsed
<chr> <chr> <chr>
1 BUNDESLAND-1 "Burgenland " "Burgenland "
2 BUNDESLAND-2 "Carinthia" "Carinthia"
3 BUNDESLAND-3 "Lower Austria" "Lower Austria"
4 BUNDESLAND-4 "Upper Austria" "Upper Austria"
5 BUNDESLAND-5 "Salzburg" "Salzburg"
# … with 4 more rows
# … with 7 more columns: 'label_de', 'label_en', 'parent', 'de_desc', 'en_desc', 'visible', 'order'
Sex is coded as a ditochome variable with the classification elements
"male"
and "female"
.
table$field("C-KRE_GESCHLECHT-0")
# STATcubeR metadata: 2 x 10
code label parsed
<chr> <chr> <chr>
1 GESCHLECHT-1 male male
2 GESCHLECHT-2 female female
# … with 7 more columns: 'label_de', 'label_en', 'parent', 'de_desc', 'en_desc', 'visible', 'order'
json Metadata
The json metadatafile
OGD_krebs_ext_KREBS_1.json
is available via the $json
binding.
table$json
#> Krebsstatistik
#>
#> Krebsstatistik nach Krebslokalisation (ICD10), Geschlecht und
#> Wohnbundesland
#>
#> Measures: Anzahl der Datensätze F-KRE
#> Fields: Tumore ICD/10 3-Steller, Berichtsjahr, Bundesland, Geschlecht
#> Updated: 2022-09-06 14:51:39
#> Tags: Krebsstatistik, Krebslokalisation-ICD10, Geschlecht, Wohnbundesland
#> Categories: Gesundheit
od_json("OGD_veste309_Veste309_1")
#> Verdienststrukturerhebung 2018 Bruttostundenverdienste in EUR nach
#> Staatsangehörigkeit, Bundesland und Beschäftigungsverhältnis
#>
#> Verdienststruktur nach Geschlecht, Staatsangehörigkeit, Bundesland
#> und Beschäftigungsverhältnis
#>
#> Measures: Arithmetisches Mittel, 1. Quartil, 2. Quartil (Median), 3.
#> Quartil, Zahl d unselbst Beschäftigten
#> Fields: Geschlecht, Staatsangehörigkeit, Bundesland (NUTS 2), Form des
#> Beschäftigungsverhältnisses
#> Updated: 2022-03-24 11:29:48
#> Tags: Staatsangehörigkeit, Bundesland, Beschäftigungsverhältnis
#> Categories: Arbeit, Bevölkerung
od_json("OGD_konjunkturmonitor_KonMon_1")
#> Konjunkturmonitor
#>
#> Measures: Produktionsindex Industrie (at; 2015=100), Technische
#> Gesamtproduktion Industrie in Tsd. € (KJE), Umsatzindex Industrie
#> (2015=100), Umsatz Industrie inTsd.€ (KJE), Auftragseingangsindex
#> Industrie (2015=100), Beschäftigtenindex Industrie (2015=100),
#> Beschäftigte Industrie gesamt (KJE), Produktivitätsindex Industrie je
#> unselbständig Beschäftigtem (2015=100), Produktivitätsindex Industrie je
#> geleisteter Arbeitsstunde (2015=100), Erzeugerpreisindex für den
#> Produzierenden Bereich (2021=100; NACE B-E), … (76 more)
#> Fields: Berichtszeitraum, Wertangabe
#> Updated: 2022-12-20 09:25:23
#> Tags: Konjunkturdaten
#> Categories: Wirtschaft und Tourismus, Arbeit, Bevölkerung, Finanzen und
#> Rechnungswesen
This print method only shows part of the metadata. More information can be extracted by using the keys of the json object.
table$json$extras$publisher
#> [1] "Statistik Austria, Guglgasse 13, 1110 Wien, Austria"
table$json$extras$update_frequency
#> [1] "jährlich"
table$json$resources[[1]]$url
#> [1] "https://data.statistik.gv.at/data/OGD_krebs_ext_KREBS_1.csv"
Table Contents
To get the raw microdata from
OGD_krebs_ext_KREBS_1.csv
,
use table$data
. The output is similar to what is returned
from read.csv2("OGD_krebs_ext_KREBS_1.csv")
.
table$data
# A STATcubeR tibble: 46,479 x 5
`C-TUM_ICD10_3ST-0` `C-BERJ-0` `C-BUNDESLAND-0` `C-KRE_GESCHLECHT-0` `F-KRE`
* <fct> <fct> <fct> <fct> <int>
1 TUM_ICD10_3ST-C00 BERJ-1983 BUNDESLAND-1 GESCHLECHT-1 2
2 TUM_ICD10_3ST-C00 BERJ-1983 BUNDESLAND-2 GESCHLECHT-1 8
3 TUM_ICD10_3ST-C00 BERJ-1983 BUNDESLAND-2 GESCHLECHT-2 2
4 TUM_ICD10_3ST-C00 BERJ-1983 BUNDESLAND-3 GESCHLECHT-1 6
5 TUM_ICD10_3ST-C00 BERJ-1983 BUNDESLAND-3 GESCHLECHT-2 2
# … with 46,474 more rows
od_table()
makes sure that the levels of all factor
columns are in the same order as in the metadata.
levels(table$data$`C-BUNDESLAND-0`) == table$field("C-BUNDESLAND-0")$code
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
As mentioned above, a labelled version of the data can be obtained
via table$tabulate()
. The labelling is done by taking the
raw dataset and then joining the labes from $header
and
$field()
.
table$tabulate()
# A STATcubeR tibble: 46,479 x 5
`Tumore ICD/10 3-Steller` Reportin…¹ Province of r…² Sex Numbe…³
* <fct> <date> <fct> <fct> <int>
1 <C00> Bösartige Neubildung der Lippe 1983-01-01 "Burgenland " male 2
2 <C00> Bösartige Neubildung der Lippe 1983-01-01 "Carinthia" male 8
3 <C00> Bösartige Neubildung der Lippe 1983-01-01 "Carinthia" female 2
4 <C00> Bösartige Neubildung der Lippe 1983-01-01 "Lower Austria" male 6
5 <C00> Bösartige Neubildung der Lippe 1983-01-01 "Lower Austria" female 2
# … with 46,474 more rows, and abbreviated variable names ¹`Reporting year`,
# ²`Province of residence`, ³`Number of records F-KRE`
Time variables are converted into a <date>
format
if they satisfy certain STATcube standards. You can read more about
$tabulate()
in the
tabulation article.
A Trip to Germany
It is possible to switch the language used for labelling the dataset
using the $language
field. This field can be used to get
and set the language. Allowed options are "en"
for english
and "de"
for german.
table$language
#> [1] "en"
table$language <- "de"
table$language
#> [1] "de"
This option affects the print()
method as well as the
output of $tabulate()
. If no english labels are available,
the german labels are used as a fallback mechanism.
table
#> Krebsstatistik
#>
#> Dataset: OGD_krebs_ext_KREBS_1 (data.statistik.gv.at)
#> Measures: Anzahl der Datensätze F-KRE
#> Fields: Tumore ICD/10 3-Steller <95>, Berichtsjahr <37>, Bundesland <9>,
#> Geschlecht <2>
#>
#> Request: [2022-12-20 11:36:01]
#> STATcubeR: 0.5.0.1
table$tabulate()
# A STATcubeR tibble: 46,479 x 5
`Tumore ICD/10 3-Steller` Berichts…¹ Bundesland Geschl…² Anzah…³
* <fct> <date> <fct> <fct> <int>
1 <C00> Bösartige Neubildung der L… 1983-01-01 Burgenland männlich 2
2 <C00> Bösartige Neubildung der L… 1983-01-01 Kärnten männlich 8
3 <C00> Bösartige Neubildung der L… 1983-01-01 Kärnten weiblich 2
4 <C00> Bösartige Neubildung der L… 1983-01-01 Niederösterreich männlich 6
5 <C00> Bösartige Neubildung der L… 1983-01-01 Niederösterreich weiblich 2
# … with 46,474 more rows, and abbreviated variable names ¹Berichtsjahr,
# ²Geschlecht, ³`Anzahl der Datensätze F-KRE`
Further reading
- See the
available datasets article
to list all datasets that are compatible with
od_table()
. - Open data datasets often contain a large amount of rows. Check out the tabulation article to see how they can be summarized into a more compact form.
- STATcubeR caches all files requested from the server under the hood. The caching article explains where and how those caches are stored.